0% found this document useful (0 votes)

95 views39 pages

Pulp: A Parallel Ultra Low Power Platform For Next Generation Iot Applications

PULP is a parallel ultra-low power platform for next generation IoT applications. It aims for 1012ops/J efficiency through near-threshold computing across multiple cores with tight power management including body biasing. It has been proven in silicon and subsequent versions include more cores, larger memory, and peripheral support.

Uploaded by

Sami M. Salamin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views39 pages

Pulp: A Parallel Ultra Low Power Platform For Next Generation Iot Applications

Uploaded by

Sami M. Salamin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

PULP: A Parallel Ultra Low Power platform

for next generation IoT Applications

Davide Rossi1
Francesco Conti1, Andrea Marongiu1,2, Antonio Pullini2, Igor Loi1, Michael Gautschi2,
Giuseppe Tagliavini1, Alessandro Capotondi1, Philippe Flatresse3, Luca Benini1,2
1DEI-UNIBO, 2IIS-ETHZ, 3STMicroelectroncis

How efficient do we need to be?

1012ops/J

1pJ/op

1GOPS/mW

[*RuchIBM11]

System View
Sense

Analyze and Classify

Transmit

MEMS IMU
Controller

MEMS Microphone

Short range, BW

L2 Memory
e.g. CotrexM

ULP Imager

Low rate (periodic) data

IOs

EMG/ECG/EIT

25 MOPS
1 12000
MOPS
1 10 mW

SW update, commands
Long range, low BW

100 W 2 mW

Battery + Harvesting powered

a few mW power envelope

Idle:
~1W
Active: ~ 50mW
3

Near-Sensor Processing
Image

COMPRESSION
COMPUTATIONAL OUTPUT
INPUT
FACTOR
BANDWIDTH
DEMAND
BANDWIDTH

Tracking:

[*Lagroce2014]

80 Kbps

1.34 GOPS

0.16 Kbps

500x

256 Kbps

100 MOPS

0.02 Kbps

12800x

2.4 Kbps

7.7 MOPS

0.02 Kbps

120x

16 Kbps

150 MOPS

0.08 Kbps

200x

Voice/Sound
Speech:
[*VoiceControl]

Inertial
Kalman:

[*Nilsson2014]

Biometrics
SVM:

[*Benatti2014]

Extremely compact output (single index, alarm, signature)

Computational power of ULP Controllers is not enough
Parallel worloads
4

PULP:
pJ/op Parallel ULP computing
pJ/op is traditionally the target of ASIC + Controllers
Scalable: to many-core + heterogeneity
Best-in-class LP silicon technology
Programmable: OpenMP, OpenCL, OpenVX
Open: Software & HW

Processor &
Hardware IPs

Compiler
Infrastructure

Virtualization
Layer

Programming
Model

From ULP computing to parallel + heterogeneous ULP computing

1mW-10mW active power
5

Near-Threshold Multiprocessing

Minimum Energy Operation

0.9

Energy/Cycle (nJ)

0.8

32nm CMOS, 25 C

0.7
0.6

4.7X

0.5
0.4

Total Energy

0.3

Leakage Energy

0.2

Dynamic Energy

0.1
0.0
0.2
0.55

0.3
0.55

0.4
0.55

0.5
0.55

0.6
0.6

0.7
0.7

0.8
0.8

0.9
0.9

1
1

1.1
1.1

1.2
1.2

Logic Vcc / Memory Vcc (V)

[VivekDeDATE2013]

Near-Threshold Computing (NTC):

Dont waste energy pushing devices in strong inversion

Recover performance with parallel execution

Aggressively manage idle power (switching, leakage)

The best Processor

[AziziISCA10]

Single issue in-order is most energy efficient

Put more than one + shared memory to fill cluster area
8

Building PULP
SIMD + MIMD + sequential
Private per-core instruction cache
I$0

4-stage, in-order
OpenRISC core

PE0

I$N1
.....

PEN1

2 ..16 Cores

DEMUX
Periph
+ExtM

LOWLATENCYINTERCONNECT
MB0

L1TCDM

MBM1

DMA Double buffering

Tightly Coupled DMA

1 Cycle Shared Multi-Banked L1 Data Memory

+ Low Latency Interconnect
GPU like shared memory low overhead data sharing
Near Threshold but parallel Maximum Energy efficiency when Active
+ strong power management for (partial) idleness
9

OR1ON: Extended OpenRISC Core

4-stage OpenRISC
IPC ~ 1
DSP extensions:
Hardware loops
Eliminates branching overhead

LD/ST + post-increment
Enhanced vector indexing

Small vector support (SIMD)

2x 16-bit operations
4x 8-bit operations

Unaligned memory accesses

To better exploit SIMD

UP TO 5x performance improvement
and 3x reduction of energy!!!
10

Silicon Implementation

Technology For ULP

Low VDDMIN
Low cost
High ION/IOFF
@ LowVDD
Knobs for
variability
management

Knobs for power

management

UTBB FD-SOI provides good features for ULP design:

Good behavior at low voltage
Body bias for power and variability management
12

Body biasing with

UTBB FD-SOI technology
RVT transistor
(conventional-well)
Vdd/2 + 300 mV

LVT transistor
(flip-well)

BODY BIAS
WINDOWS

RVT: Regular Voltage Threshold

LVT: Low Voltage Threshold

Vdd/2 + 300 mV

FBB: Forward Body Bias

RBB: Reverse Body Bias
Poly biasing allow to trade
performance/leakage
At design time

RVT transistors: low leakage + flexible power management (FBB + RBB)

Near Threshold + Body Biasing

Combined
FBB vs. FREQUENCY

RBB vs. LEAKAGE

+ 2.5x
@0.5V

- 10x
@0.5V

RVT transistors
State retentive (no state retentive registers and memories)
Ultra-fast transitions (tens of ns depending on n-well area to bias)
Low area overhead for isolation (3m spacing for deep n-well isolation)
Thin grids for voltage distribution (small transient current for wells polarization)
Simple circuits for on-chip VBB generation (e.g. charge pump)

But even with aggressive RBB leakage is not zero!

Body Biasing for

Variability Management
Process variation

120C

Thermal inversion

100x
@0.5V

25 MHz 7 MHz (3)

-40C

RVT transistors
FBB/RBB
FREQUENCY +
LEAKAGE
COMPENSATION
WITH 0.2 BB

FREQUENCY +
LEAKAGE
COMPENSATION
WITH -1.8 to +0.5 BB

ULP memory implementation:

latch-based SCM
Standard 6T SRAMs:
High VDDMIN
Bottleneck for energy efficiency

256x32 6T SRAMS vs. SCM

2x-4x

Near-Threshold SRAMs (8T)

Lower VDDMIN
Area/timing overhead (25%-50%)
High active energy
Low technology portability

Standard Cell Memories:

Wide supply voltage range

Lower read/write energy (2x - 4x)
Easy technology portability
Controlled P&R mitigates area
overhead

Architectural Technology Awareness

Exploiting body biasing

The cluster is partitioned in separate clock gating and body bias regions
Body bias multiplexers (BBMUXes) control the well voltages of each region
Each region can be active (FBB) or idle (deep RBB low leakage!)
State-Retentive + Low Leakage + Fast transitions
18

Power Management:
Hardware Synchronization
Core shut-down sequence:
1) Disable fetching
2) Wait outstanding
transactions
2) Clock gating
3) Reverse Body Biasing
Private, per core port
single cycle latency
no contention

PE0

PE1

PE2

PE3

HW
SYNCH

BARRIER COST

15x

GOALS:
Reduce parallelization overhead
Accelerate common OpenMP and OpenCL patterns (e.g. Task creation)
Automatically manage shut down of idle cores
19

Power Management:
External Events
Programming sequence:
1) Set events mask
2) Program transfer
3) Trigger transfer
4) Shut down cores

48 maskable events
General purpose
DMA
Timers
Peripherals (SPI, I2C.
GPIO...)

PE0

PE1

PE2

PE3

HW
SYNCH

EVENT
UNIT

SPI

PERIPHERALINTERCONNECT

L2
MEM

GOALS:
Automatically manage shut down of cores during data transfers
20

Heterogeneous Memory Architecture

+ Management
Shared I$ to recover SCMs area overhead
Private L0 buffers to reduce pressure on shared I$
I$B0

SHAREDI$

I$Bk

PE0

.....

PEN1

MMU

interleaved

SCM0

SCMM1

Reconfigurable Pipeline Stages

for SRAMs degradation@low VDD

L1TCDM
private

SRAM0

MMU (logical/physical add map):

1) Interleaved/private addresses
2) Shutdown of SRAM banks

SRAMM1

SCM on I$ and part of TCDM

to widen the operating range
21

The PULP Family

PULPv1
CHIP FEATURES

Tester chip

Technology

28nm FDSOI (RVT)

Chip Area

3mm2

# Cores

4xOpenRISC

4x1kbyte (private)

TCDM

16 kbyte

BB regions

VDD range

0.45-1.2V

VBB range

-1.8V - +0.9V

Perf. Range

1 MOPS-1.9GOPS

Power Range

100 W

Peak Efficiency

60 GOPS/[email protected]*

-127 mW*

*Does not include IOs

Measured Results
Maximum Energy Efficiency
@ 0.5V + 0.5V FBB 60GOPS/W

1.8GOPS

~10mW@100MHz,0.75V

Low leakage
(< 2%)
Wide operating range

Peak GOPS/W competitive with best-in-class near-threshold

(16bit) ULP microcontrollers, plus more than x100 peak GOPS!
24

PULPv2

= PULPv1 + 2 DVFS regions (SoC + CLUSTER) + Event Unit + Peripherals

PULPv3

= PULPv2 + Extended cores + HW Synch + Shared Cache + HWCE + Shared IOs

PULPs Summary

#ofcores
L2memory
TCDM
Reconf.pipe.stages
I$
Bodybiasregions
DVFS
I/Oconnectivity
Extendedprocessor
Eventunit
Debugunit

Status
Technology
Voltagerange
BBrange
Maxfreq.
Maxperf.
Peaken.eff.

PULPv1
4
16kB
16kBSRAM
no
4kBSRAMprivate
yes
no
JTAG
no
no
no
PULPv1
siliconproven
FDSOI28nm
conventionalwell
0.45V 1.2V
1.8V 0.9V
475MHz
1.9GOPS
60GOPS/W

PULPv2
4
64kB
32kBSRAM
8kBSCM
yes
4kBSCMprivate
yes
yes
full
no
yes
no

PULPv3
4
128kB
32kBSRAM
16kBSCM
yes
4kBSCMshared
yes
yes
fullmultiplexed
Yes
yes+HWsynchro
yes

PULPv2
posttapeout
FDSOI28nmflip
well
0.3V1.2V
0.0V1.8V
1GHz
4GOPS
135GOPS/W

PULPv3
pretapeout
FDSOI28nm
conventionalwell
0.5V 0.7V
1.8V 0.9V
200MHz
1.8 GOPS
385 GOPS/W

*equivalent 32-bit RISC operations

Breaking the GOPS/mW wall

Recovering more silicon efficiency

GOPS/W
1

> 100

Mixed

1GOPS/mW
General-purpose Throughput
Computing
Computing
CPU

GPGPU

ULP parallel
Computing
Accelerator Gap

HW IP

Closing The Accelerator Efficiency Gap with Agile Customization

Fractal Heterogeneity
Fixed function accelerators have limited reuse how to limit proliferation?

Learn to Accelerate
Brain-inspired (deep convolutional networks) systems
are high performers in many tasks over many domains
Human:
85% (untrained),
94.9% (trained)
CNN:
93.4% accuracy

Image recognition

[RussakovskyIMAGENET2014]

Speech recognition
[HannunARXIV2014]

Flexible acceleration: learned CNN weights are the program

PULP CNN Performance

Average performance and energy efficiency on a 32x16 CNN frame
PERFORMANCE

ENERGY EFFICIENCY

8 GOPS

61x

6500 GOPS/W

47x

PULPv3 ARCHITECTURE, CORNER: tt28, 25C, VDD= 0.5V, FBB = 0.5V

Thanks for your attention!!!

www-micrel.deis.unibo.it/pulp-project

References
[RuchIBM11] Ruch, P., Toward five-dimensional scaling: How density improves efficiency in
future computers, IBM Journal of Research and Development , vol.55, no.5, pp.1-13, 2011.
[AziziISCA10] O. Azizi, et. al., Energy-Performance Tradeoffs in Processor Architecture and
Circuit Design: A Marginal Cost Analysis Proceedings of the 37th annual international
symposium on Computer architecture, ISCA 2010, pp. 26-36, June 1923, 2010.
[Nilsson2014] John-Olof Nilsson et.al., Foot-mounted inertial navigation made easy, 2014
International Conference on Indoor Positioning and Indoor Navigation, 27-30 October 2014.
[Benatti2014] S .Benatti et. al., "EMG-based hand gesture recognition with flexible analog
front end," IEEE Biomedical Circuits and Systems Conference (BioCAS), pp.57,60, Oct. 2014.
[Lagorce2014] Lagorce et. al., Asynchronous Event-Based Multikernel Algorithm for HighSpeed Visual Features Tracking, IEEE Trans Neural Netw Learn Syst. 2014 Sep 16.
[VoiceControl] TrulyHandsfreeVoice Control, available: http://www.sensory.com/wpcontent/uploads/80-0342-A.pdf
[VivekDeDATE13] De, Vivek, "Near-Threshold Voltage design in nanoscale CMOS," Design,
Automation & Test in Europe Conference & Exhibition DATE, 2013.
[DoganICSDPTMO2011] Dogan, A. Y., et al., Power/performance exploration of single-core
and multi-core processor approaches for biomedical signal processing, Integrated Circuit and
System Design, Power and Timing Modeling, Optimization, and Simulation, pp. 102-11, 2011.
[RussakovskyIMAGENET2014] O. Russakovsky, ImageNet Large Scale Visual
Recognition Challenge, International Journal of Computer Vision, 2014.
[HannunARXIV2014] A. Hannun Deep Speech: Scaling up end-to-end speech recognition,
arXiv, 2014.
34

How Big is the IoT?

How much energy to process (1 op. per Byte) one BB?

Microcontrollers Landscape
*not exhaustive

Parallel NTC

Core Power [mW]

High Workloads

Low Workloads

SUB-Vth

NEAR-Vth

[DoganICSDPTMO2011]

Workload [MOPS]
Target Workload 1-Core Energy Efficiency 4-Cores Energy Efficiency
[MOPS]
(ideal) [MOPS/mW]
(ideal) [MOPS/mW]

Ratio

100

1.3x

200

1.5x

400

2.4x

*Measured on our first prototype

Parallel NTC + Race to Halt

Power SINGLE-CORE @ MAX FREQUENCY (e.g. 200MHz)
Low Workload
(duty cycled)

core power
system power

active period
MULTI-CORE @ MAX FREQUENCY (e.g. 200 MHz)
Power
core power

Ideally same energy of single-core solution

Deep sleep

saved energy

system power

Low Workload
(duty cycled)

active period
Going faster allows to integrate system power over a smaller period
The main constraint here is the power envelope
38

Back to SRAMs
VOLTAGE LIMIT OF SRAMS

2x
@0.5V

SRAM performance rapidly degrades at low voltage

SRAM VDDMIN is higher than logic (and SCM)
39

WRC 297
No ratings yet
WRC 297
9 pages
Staff Management Plan (OSIAdmin 3456)
No ratings yet
Staff Management Plan (OSIAdmin 3456)
21 pages
(Arrow) Infineon BMS Product Introduce
No ratings yet
(Arrow) Infineon BMS Product Introduce
42 pages
Buckling of Columns and Stability Analysis
No ratings yet
Buckling of Columns and Stability Analysis
10 pages
CAQA5e ch1
No ratings yet
CAQA5e ch1
80 pages
File 1501
No ratings yet
File 1501
31 pages
IET Circuits Devices Syst - 2023 - Wang
No ratings yet
IET Circuits Devices Syst - 2023 - Wang
14 pages
AN4621 Application Note: STM32L4 and STM32L4+ Ultra-Low-Power Features Overview
No ratings yet
AN4621 Application Note: STM32L4 and STM32L4+ Ultra-Low-Power Features Overview
31 pages
Wed1315 PULP Riscv3 - Noanim
No ratings yet
Wed1315 PULP Riscv3 - Noanim
33 pages
LPVD U1,2
No ratings yet
LPVD U1,2
34 pages
Power Aware Embedded Systems
No ratings yet
Power Aware Embedded Systems
26 pages
Lecture1 ch1 Fundamentals of Quantitative Design and Analysis
No ratings yet
Lecture1 ch1 Fundamentals of Quantitative Design and Analysis
28 pages
CAQA6e ch1
No ratings yet
CAQA6e ch1
31 pages
Canary Dram
No ratings yet
Canary Dram
42 pages
Fundamentals of Quantitative Design and Analysis: A Quantitative Approach, Fifth Edition
No ratings yet
Fundamentals of Quantitative Design and Analysis: A Quantitative Approach, Fifth Edition
24 pages
Power-Constrained MPU Roadmap: A For I T R S
No ratings yet
Power-Constrained MPU Roadmap: A For I T R S
18 pages
Low Power Vlsi Design: Assignment-1 G Abhishek Kumar Reddy, M Manoj Varma
No ratings yet
Low Power Vlsi Design: Assignment-1 G Abhishek Kumar Reddy, M Manoj Varma
17 pages
EAGLE TUGS - Parts Service Manual
No ratings yet
EAGLE TUGS - Parts Service Manual
72 pages
LPVD U1
No ratings yet
LPVD U1
21 pages
Monte Carlo Analysis: With Emphasis On Memory L13, Part 1
No ratings yet
Monte Carlo Analysis: With Emphasis On Memory L13, Part 1
48 pages
Ultra-Low Power Process-Tolerant 10T PT10T SRAM Wi
No ratings yet
Ultra-Low Power Process-Tolerant 10T PT10T SRAM Wi
22 pages
Architectural-Level Low-Power Design: Naehyuck Chang Dept. of EECS/CSE Seoul National University Naehyuck@snu - Ac.kr
No ratings yet
Architectural-Level Low-Power Design: Naehyuck Chang Dept. of EECS/CSE Seoul National University Naehyuck@snu - Ac.kr
53 pages
Benini ISC2023 Paving The Road For Riscv
No ratings yet
Benini ISC2023 Paving The Road For Riscv
40 pages
MEK Cure Test: General Information
No ratings yet
MEK Cure Test: General Information
1 page
A Dual-Core RISC-V Vector Processor With On-Chip Fine-Grain Power Management in 28-nm FD-SOI
No ratings yet
A Dual-Core RISC-V Vector Processor With On-Chip Fine-Grain Power Management in 28-nm FD-SOI
5 pages
Microcontrollers stm32l4 Series Product Overview
No ratings yet
Microcontrollers stm32l4 Series Product Overview
19 pages
Lecture02 - High-Level Digital Design Automation
No ratings yet
Lecture02 - High-Level Digital Design Automation
34 pages
Design and Technology Trends: R. Saleh Dept. of ECE University of British Columbia Res@ece - Ubc.ca
No ratings yet
Design and Technology Trends: R. Saleh Dept. of ECE University of British Columbia Res@ece - Ubc.ca
32 pages
Near-Threshold RISC-V Core With DSP Extensions For Scalable IoT Endpoint Devices
No ratings yet
Near-Threshold RISC-V Core With DSP Extensions For Scalable IoT Endpoint Devices
14 pages
MSP430m01 INTRO
No ratings yet
MSP430m01 INTRO
47 pages
Ultra-Low Power Design Approaches For IoT
No ratings yet
Ultra-Low Power Design Approaches For IoT
57 pages
Lecture Slides-Week2
No ratings yet
Lecture Slides-Week2
58 pages
VLSI Unit4 Sem Imp
No ratings yet
VLSI Unit4 Sem Imp
9 pages
STM 32 WL 55 JC
No ratings yet
STM 32 WL 55 JC
147 pages
N2X-N2XT Product Presentation May 2011
No ratings yet
N2X-N2XT Product Presentation May 2011
35 pages
Chapter 01
No ratings yet
Chapter 01
40 pages
A Reconfigurable 8T Ultra-Dynamic Voltage Scalable U-DVS SRAM in 65 NM CMOS
No ratings yet
A Reconfigurable 8T Ultra-Dynamic Voltage Scalable U-DVS SRAM in 65 NM CMOS
11 pages
Low Power Wireless Sensor Networks
No ratings yet
Low Power Wireless Sensor Networks
23 pages
Ultra-Low Power VLSI Circuit
No ratings yet
Ultra-Low Power VLSI Circuit
27 pages
30VLSI System Level
No ratings yet
30VLSI System Level
49 pages
Chapter 17: Low-Power Design: Keshab K. Parhi and Viktor Owall
No ratings yet
Chapter 17: Low-Power Design: Keshab K. Parhi and Viktor Owall
34 pages
EC Centrifugal Module: K3G630-AB21-02
No ratings yet
EC Centrifugal Module: K3G630-AB21-02
6 pages
Low-Power CMOS SRAM: By: Tony Lugo Nhan Tran Adviser: Dr. David Parent
No ratings yet
Low-Power CMOS SRAM: By: Tony Lugo Nhan Tran Adviser: Dr. David Parent
34 pages
Ri5cy User Manual
No ratings yet
Ri5cy User Manual
60 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
01) Fundamentals of Quantitative Design and Analysis
No ratings yet
01) Fundamentals of Quantitative Design and Analysis
71 pages
MSP430-1 MSP430 Overview
No ratings yet
MSP430-1 MSP430 Overview
99 pages
Submarine Pipeline Route Selection Upheaval Buckling External Pressure Collapse
No ratings yet
Submarine Pipeline Route Selection Upheaval Buckling External Pressure Collapse
19 pages
Power Aware Design Methodologies
No ratings yet
Power Aware Design Methodologies
542 pages
Kogge 2013 Yearly Update Exascale Projections
No ratings yet
Kogge 2013 Yearly Update Exascale Projections
130 pages
Unit #3
No ratings yet
Unit #3
23 pages
Wrong Guy Lesson Plan
No ratings yet
Wrong Guy Lesson Plan
9 pages
Construction of A Low-Voltage Standard Cell Library For Ultra-Low Power Application
No ratings yet
Construction of A Low-Voltage Standard Cell Library For Ultra-Low Power Application
64 pages
OMM Jonsered 2490 GB
No ratings yet
OMM Jonsered 2490 GB
68 pages
A Highly Stable Reliable SRAM Cell For Low Power Applications
No ratings yet
A Highly Stable Reliable SRAM Cell For Low Power Applications
11 pages
STM32WLE5JC Datasheet
No ratings yet
STM32WLE5JC Datasheet
135 pages
Energy Optimization of 6T SRAM Cell Using Low-Volt
No ratings yet
Energy Optimization of 6T SRAM Cell Using Low-Volt
14 pages
Ljung Nokia
No ratings yet
Ljung Nokia
25 pages
THESIS LucasHuijbregts Final
No ratings yet
THESIS LucasHuijbregts Final
86 pages
Arm STM32WB55 PDF
No ratings yet
Arm STM32WB55 PDF
178 pages
STM 32 WB 55 CC
No ratings yet
STM 32 WB 55 CC
196 pages
INSP-UT-08, Rev 00-PARUA Approval
No ratings yet
INSP-UT-08, Rev 00-PARUA Approval
15 pages
Sony Wm-Ex511
No ratings yet
Sony Wm-Ex511
22 pages
Structural Calculation IWF 250
No ratings yet
Structural Calculation IWF 250
1 page
Steam Power Plant PDF
100% (1)
Steam Power Plant PDF
47 pages
Unit 1
No ratings yet
Unit 1
194 pages
2022 - List of French Defense Engines - NP - DGA-IP-ASA - Liste - CT - Moteurs - EN - Translated
No ratings yet
2022 - List of French Defense Engines - NP - DGA-IP-ASA - Liste - CT - Moteurs - EN - Translated
4 pages
FSD Cessna 337 POH
67% (3)
FSD Cessna 337 POH
9 pages
Concentration of Solutions
No ratings yet
Concentration of Solutions
11 pages
stm32wb55vg PDF
No ratings yet
stm32wb55vg PDF
183 pages
LEA4173 Assignment Sept 2022
No ratings yet
LEA4173 Assignment Sept 2022
3 pages
New_SRAM_design_using_body_bias_technique_for_ultr
No ratings yet
New_SRAM_design_using_body_bias_technique_for_ultr
5 pages
CH 5 - Design of Drilling Jigs
No ratings yet
CH 5 - Design of Drilling Jigs
47 pages
962 Stager Control Manual 1076301
No ratings yet
962 Stager Control Manual 1076301
20 pages
CPM - Critical Path Method
No ratings yet
CPM - Critical Path Method
3 pages
Installation, Commissioning and Maintenance Dyneo+: Magnet Assisted Reluctance Motors
No ratings yet
Installation, Commissioning and Maintenance Dyneo+: Magnet Assisted Reluctance Motors
28 pages
Case Study
No ratings yet
Case Study
4 pages
Iqbal MTWEEKLY 300323
100% (1)
Iqbal MTWEEKLY 300323
8 pages
Lab 05 Electric Circuits Conceptual Model For Current
No ratings yet
Lab 05 Electric Circuits Conceptual Model For Current
6 pages
Amplifiers: Noise
No ratings yet
Amplifiers: Noise
9 pages
STP128 Eb.1415051 1 PDF
No ratings yet
STP128 Eb.1415051 1 PDF
256 pages
Part List Feb
No ratings yet
Part List Feb
328 pages
Computer Architecture Slides
No ratings yet
Computer Architecture Slides
274 pages
STM 32 L 552 RC
No ratings yet
STM 32 L 552 RC
340 pages
Curriculum Business Analyst
No ratings yet
Curriculum Business Analyst
5 pages
AutoMM Energy-Efficient Multi-Data-Type Matrix Multiply Design On Heterogeneous Programmable System-On-chip
No ratings yet
AutoMM Energy-Efficient Multi-Data-Type Matrix Multiply Design On Heterogeneous Programmable System-On-chip
7 pages
2013 QII Sealing and Insulation Visual
No ratings yet
2013 QII Sealing and Insulation Visual
1 page

Pulp: A Parallel Ultra Low Power Platform For Next Generation Iot Applications

Uploaded by

Pulp: A Parallel Ultra Low Power Platform For Next Generation Iot Applications

Uploaded by

PULP: A Parallel Ultra Low Power platform

for next generation IoT Applications

How efficient do we need to be?

Analyze and Classify

Low rate (periodic) data

Battery + Harvesting powered

Extremely compact output (single index, alarm, signature)

From ULP computing to parallel + heterogeneous ULP computing

Minimum Energy Operation

Logic Vcc / Memory Vcc (V)

Near-Threshold Computing (NTC):

Dont waste energy pushing devices in strong inversion

Recover performance with parallel execution

Aggressively manage idle power (switching, leakage)

The best Processor

Single issue in-order is most energy efficient

DMA Double buffering

Tightly Coupled DMA

1 Cycle Shared Multi-Banked L1 Data Memory

OR1ON: Extended OpenRISC Core

Small vector support (SIMD)

Unaligned memory accesses

Technology For ULP

Knobs for power

UTBB FD-SOI provides good features for ULP design:

Body biasing with

RVT: Regular Voltage Threshold

FBB: Forward Body Bias

RVT transistors: low leakage + flexible power management (FBB + RBB)

Near Threshold + Body Biasing

RBB vs. LEAKAGE

But even with aggressive RBB leakage is not zero!

Body Biasing for

25 MHz 7 MHz (3)

ULP memory implementation:

256x32 6T SRAMS vs. SCM

Near-Threshold SRAMs (8T)

Standard Cell Memories:

Wide supply voltage range

Architectural Technology Awareness

Exploiting body biasing

Heterogeneous Memory Architecture

Reconfigurable Pipeline Stages

MMU (logical/physical add map):

SCM on I$ and part of TCDM

The PULP Family

28nm FDSOI (RVT)

*Does not include IOs

Peak GOPS/W competitive with best-in-class near-threshold

= PULPv1 + 2 DVFS regions (SoC + CLUSTER) + Event Unit + Peripherals

= PULPv2 + Extended cores + HW Synch + Shared Cache + HWCE + Shared IOs

*equivalent 32-bit RISC operations

Breaking the GOPS/mW wall

Recovering more silicon efficiency

Closing The Accelerator Efficiency Gap with Agile Customization

Flexible acceleration: learned CNN weights are the program

PULP CNN Performance

PULPv3 ARCHITECTURE, CORNER: tt28, 25C, VDD= 0.5V, FBB = 0.5V

Thanks for your attention!!!

How Big is the IoT?

How much energy to process (1 op. per Byte) one BB?

Core Power [mW]

*Measured on our first prototype

Parallel NTC + Race to Halt

Ideally same energy of single-core solution

SRAM performance rapidly degrades at low voltage

You might also like