Lecture 12
Lecture 12
Processor
Introduction
• Contents:
– ASIP Introduction
• Special function unit
• Integration in SoC
– Commercially available tools
• Design flow
– Exercise session about Tensilica Xtensa Xplorer
• Introduction to Xtensa Xplorer
• Exercise session
• Goals:
– Knowing the pros and cons of ASIPs
– Learning the general concepts of ASIPs
– Being able to design and analyze ASIPs using Xtensa Xplorer
Michael Gautschi 2
Integrated Systems Laboratory
• Problem:
– GPP is too slow for some applications
– Power consumption is not feasible for an embedded system
High
Energy Efficiency/Speed
ASICs
ASIPs
DSP
GPP
Low
Low Flexibility High
Michael Gautschi 3
Integrated Systems Laboratory
ASIP Introduction
Michael Gautschi 4
Integrated Systems Laboratory
Michael Gautschi 6
Integrated Systems Laboratory
Processor Customization
• ASIPs are customized processors with hardwired functional units in the data path.
• Custom instructions are executed on a special function unit (SFU)
– Compiler maps application source code to the extended instruction set
– New instructions can be used intrinsically in C code
Michael Gautschi 7
Integrated Systems Laboratory
Example:
• Modulo operation:
– c=a%b
• Square root:
– c = a + sqrt(b)
Michael Gautschi 8
Integrated Systems Laboratory
Example:
• Accumulator
– a += b[i] * c[i]
– a stored in the state
register (can be
larger than 32 bit)
1. initialize state to 0
2. Compute MAC
3. Read final state
Michael Gautschi 9
Integrated Systems Laboratory
Example:
• Multiple accumulators
– a += b[i] * c[i]
– d += f[j] * g[j]
=> Store d and a in SPR
with higher precision
Michael Gautschi 10
Integrated Systems Laboratory
[1] Caro et. al, “High-Performance Special Function Unit for Programmable 3-D Graphics Processors”, TCAS 2009
[2] Oberman et. al, “A High-Performance Area-Efficient Multifunction Interpolator”, ARITH 2005
[3] Gautschi et. al, “A 65nm CMOS 6.4-to-29.2 pJ/FLOP@ 0.8 V shared logarithmic floating point unit for acceleration of nonlinear
function kernels in a tightly coupled processor cluster”, ISSCC 2016
Michael Gautschi 11
Integrated Systems Laboratory
• Usage in software:
– Program in standard C with the use of intrinsics
Data path of
simple
processor
Reconfigurable ASIPs
• Replace SFU by
configurable array of
simple functions
• FPGA-like approach
– Efficient if not necessary
to reconfigure very often
Source: Berekovic, TU-Braunschweig 2010
Michael Gautschi 13
Integrated Systems Laboratory
• LISA (Synopsys)
Choose template
• Xtensa Application
(Tensilica/Cadence)
Create
Processor
Modify/customize
• OptimoDE (ARM)
ISS Compile
• Codasip Profile
application
Hardware Software
Michael Gautschi 14
Integrated Systems Laboratory
Where is it used?
• AMD Radeon R9-290
– High-end graphics card with
TrueAudio
– TrueAudio is a co-processor
based on the HIFI core
architecture of Cadence
– AMD supports audio processing
via TrueAudio in all its new chips
(architecture generation GCN 1.1)
•
–
–
44
Integrated Systems Laboratory
Michael Gautschi 45
Integrated Systems Laboratory
• Processor configuration:
– 5 stage integer pipeline
– 32 32bit general purpose registers
– 16x16 bit multiplier
– 2 KB I$, D$
[1] Semester Thesis by A. Traber, S. Stucki, 2014, A Unified Multiplier Based Hardware Architecture for Elliptic Curve Cryptography
Michael Gautschi 46
Integrated Systems Laboratory
Michael Gautschi 48
Integrated Systems Laboratory
Results: Performance
• Comparison to HW-architecture:[2]
– Coprocessor with 16 bit datapath requires ~1’850’000 cycles (12 kGE)
– Factor 3.3 slower
[2] M. Gautschi, M. Mühlberghuber et.al. , SIRIOUS: A tightly coupled ECC Coprocessor for the OpenRISC
Michael Gautschi 49
Integrated Systems Laboratory
• Conclusion:
– Flexible architecture fully integrated in an application processor
– 7.5x speedup at very low costs and design time (< 1 week)
– Only 2.2 kGE hardware overhead, (datapath of co-processor 12 kGE)
Michael Gautschi 50
Integrated Systems Laboratory
Summary
Michael Gautschi 51
Integrated Systems Laboratory
Q&A
Michael Gautschi 52