Thesis Lenart
Thesis Lenart
Thomas Lenart
Lund 2008
The Department of Electrical and Information Technology
Lund University
Box 118, S-221 00 LUND
SWEDEN
iii
Science may set limits to knowledge,
but should not set limits to imagination
Abstract iii
Preface xi
Acknowledgment xiii
List of Acronyms xv
1 Introduction 1
1.1 Challenges in Digital Hardware Design . . . . . . . . . . . . . . . 1
1.1.1 Towards Parallel Architectures . . . . . . . . . . . . . . . 3
1.1.2 Application-Specific vs. Reconfigurable Architectures . . . 3
1.2 Contributions and Thesis Outline . . . . . . . . . . . . . . . . . . 4
vii
2.4 Memory and Storage . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Data Caching . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Data Streaming . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Hardware Design Flow . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Architectural Exploration . . . . . . . . . . . . . . . . . . 22
2.5.2 Virtual Platforms . . . . . . . . . . . . . . . . . . . . . . 24
2.5.3 Instruction Set Simulators . . . . . . . . . . . . . . . . . . 25
Bibliography 143
Appendix 155
A The Scenic Shell 159
1 Launching Scenic . . . . . . . . . . . . . . . . . . . . . . . . . 159
1.1 Command Syntax . . . . . . . . . . . . . . . . . . . . . . 159
1.2 Number Representations . . . . . . . . . . . . . . . . . . 161
1.3 Environment Variables . . . . . . . . . . . . . . . . . . . 161
1.4 System Variables . . . . . . . . . . . . . . . . . . . . . . 161
2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
2.1 Starting a Simulation . . . . . . . . . . . . . . . . . . . . 162
2.2 Running a Simulation . . . . . . . . . . . . . . . . . . . . 163
3 Simulation Modules . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.1 Module Library . . . . . . . . . . . . . . . . . . . . . . . 163
3.2 Module Hierarchy . . . . . . . . . . . . . . . . . . . . . . 164
3.3 Module Commands . . . . . . . . . . . . . . . . . . . . . 164
3.4 Module Parameters . . . . . . . . . . . . . . . . . . . . . 165
3.5 Module Variables . . . . . . . . . . . . . . . . . . . . . . 165
3.6 Module Debug . . . . . . . . . . . . . . . . . . . . . . . . 166
4 System Specification . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.1 Module Instantiation . . . . . . . . . . . . . . . . . . . . 167
4.2 Module Binding . . . . . . . . . . . . . . . . . . . . . . . 168
4.3 Module Configuration . . . . . . . . . . . . . . . . . . . . 168
This thesis summarizes my academic work in the digital ASIC group at the
department of Electrical and Information Technology. The main contributions
to the thesis are derived from the following publications:
Thomas Lenart, Henrik Svensson, and Viktor Öwall, “Modeling and Ex-
ploration of a Reconfigurable Architecture for Digital Holographic Imag-
ing,” in Proceedings of IEEE International Symposium on Circuits and
Systems, Seattle, USA, May 2008.
Henrik Svensson, Thomas Lenart, and Viktor Öwall, “Modelling and Ex-
ploration of a Reconfigurable Array using SystemC TLM,” in Proceedings
of Reconfigurable Architectures Workshop, Miami, Florida, USA, April
2008.
Thomas Lenart and Viktor Öwall, “Architectures for Dynamic Data Scal-
ing in 2/4/8K Pipeline FFT Cores,” IEEE Transactions on Very Large
Scale Integration Systems, vol. 14, no. 11, November 2006, pp. 1286–
1290.
xi
Thomas Lenart, Viktor Öwall, Mats Gustafsson, Mikael Sebesta, and
Peter Egelberg, “Accelerating Signal Processing Algorithms in Digital
Holography using an FPGA Platform,” in Proceedings of IEEE Inter-
national Conference on Field-Programmable Technology, Tokyo, Japan,
December 2003, pp. 387–390.
Thomas Lenart and Viktor Öwall, “A 2048 Complex Point FFT Processor
using a Novel Data Scaling Approach,” in Proceedings of IEEE Interna-
tional Symposium on Circuits and Systems, vol. 4, Bangkok, Thailand,
May 2003, pp. 45–48.
The following papers concerning education are also published, but are not
considered part of this thesis:
Hugo Hedberg, Thomas Lenart, Henrik Svensson, Peter Nilsson and Vik-
tor Öwall, “Teaching Digital HW-Design by Implementing a Complete
MP3 Decoder,” in Proceedings of Microelectronic Systems Education,
Anaheim, California, USA, June 2003, pp. 31–32.
Acknowledgment
I have many people to thank for a memorable time during my PhD studies. It
has been the most challenging and fruitful experience in my life, and given me
the opportunity to work with something that I really enjoy!
I would first of all like to extend my sincere gratitude to my supervisor as-
sociate professor Viktor Öwall. He has not only been guiding, reading, and
commenting my work over the years, but also given me the opportunity and
encouragement to work abroad during my PhD studies and gain experience
from industrial research. I also would like to sincerely thank associate pro-
fessor Mats Gustafsson, who has been supervising parts of this work, for his
profound theoretical knowledge and continuous support during this time.
I would like to thank my colleagues and friends at the department of Electrical
and Information Technology (former Electroscience) for an eventful and enjoy-
able time. It has been a pleasure to work with you all. I will always remember
our social events and interesting discussions on any topic. My gratitude goes to
Henrik Svensson, Hugo Hedberg, Fredrik Kristensen, Matthias Kamuf, Joachim
Neves Rodrigues, and Hongtu Jiang for friendship and support over the years,
and to my new colleagues Johan Löfgren and Deepak Dasalukunte for accom-
panying me during my last year of PhD studies. I especially would like to
thank Henrik Svensson and Joachim Neves Rodrigues for reading and com-
menting parts of the thesis, which was highly appreciated. The administrative
and practical support at the department has been more than excellent, and
naturally I would like to thank Pia Bruhn, Elsbieta Szybicka, Erik Jonsson,
Stefan Molund, Bengt Bengtsson, and Lars Hedenstjerna for all their help.
I am so grateful that my family has always been there for me, and encouraged
my interest in electronics from an early age. At that time, my curiosity and
research work constituted of dividing all kind of electronic devices into smaller
pieces, i.e. components, which were unfortunately never returned back into
their original state again. This experience has been extremely valuable for
me, though the focus has slightly shifted from system separation to system
integration.
My dear fiancée Yiran, I am so grateful for your love and patient support.
During my PhD studies I visited many countries, and I am so glad and fortunate
that I got the chance to meet you in the most unexpected place. Thank you
for being who you are, for always giving me new perspectives and angles of life,
and for inspiring me to new challenges.
xiii
The initial part of this work was initiated as a multi-disciplinary project on
digital holography, which later became Phase Holographic Imaging AB. I have
had the opportunity to take part in this project for a long time, and it has
been a valuable experience in many ways. From our weekly meetings in the
department library, to the first prototype of the holographic microscope, what
an exciting time! I would like to thank all the people who have been involved
in this work, especially Mikael Sebesta for our valuable discussions since the
very beginning, and for your enduring enthusiasm to take this project forward.
In 2004 and 2006, I was invited for internships at Xilinx in San José California,
for 3 and 9 months, respectively. I gained a lot of experience and knowledge
during that time, and I would like to thank Dr. Ivo Bolsens for giving me this
opportunity. I would like to extend my gratitude to the colleagues and friends
in the Xilinx Research Labs, especially my supervisors Dr. Adam Donlin and
Dr. Jörn Janneck for their encouragement, support, and valuable discussions.
I have a lot of good memories from the visit, and I would like to thank my local
and international friends for all the unforgettable adventures we experienced
together in the US.
This work has been funded by the Competence Center for Circuit Design
(CCCD), and sponsored with hardware equipment from the Xilinx University
Program (XUP).
T homas Lenart
List of Acronyms
xv
DVB Digital Video Broadcasting
EDA Electronic Design Automation
EMIF External Memory Interface
ESL Electronic System Level
FFT Fast Fourier Transform
FIFO First In, First Out
FIR Finite Impulse Response
FPGA Field Programmable Gate Array
FSM Finite State Machine
GPP General-Purpose Processor
GPU Graphics Processing Unit
GUI Graphical User Interface
HDL Hardware Description Language
IFFT Inverse Fast Fourier Transform
IP Intellectual Property
ISS Instruction Set Simulator
LSB Least Significant Bit
LUT Lookup Table
MAC Multiply-Accumulate
MC Memory Cell
MPMC Multi-Port Memory Controller
MSB Most Significant Bit
NoC Network-on-Chip
OFDM Orthogonal Frequency Division Multiplexing
OSCI Open SystemC Initiative
PE Processing Element
PIM Processor-In-Memory
RAM Random-Access Memory
PC Processing Cell
RC Resource Cell
RGB Red-Green-Blue
ROM Read-Only Memory
RPA Reconfigurable Processing Array
RTL Register Transfer Level
SAR Synthetic Aperture Radar
SCENIC SystemC Environment with Interactive Control
SCV SystemC Verification (library)
SDF Single-path Delay Feedback
SDRAM Synchronous DRAM
SNR Signal-to-Noise Ratio
SOC System-On-Chip
SQNR Signal-to-Quantization-Noise Ratio
SRAM Static Random-Access Memory
SRF Stream Register File
TLM Transaction Level Modeling
VGA Video Graphics Array
VHDL Very high-speed integrated circuit HDL
VLIW Very Long Instruction Word
XML eXtensible Markup Language
List of Definitions and Mathematical Operators
xix
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
2
Pdyn = αCL VDD fclk ,
where α is the switching activity and CL is the load capacitance [2]. However,
the supply voltage also affects the propagation time tp in logic gates as
VDD
tp ∝ ,
(VDD − VT )β
where VT is the threshold voltage and β is a technology specific parameter.
Hence, lowering the supply voltage means that the system becomes slower and
counteracts the initial goal. This is the situation for most sequential processing
units, e.g. microprocessors, where the processing power heavily depends on the
system clock frequency.
The current trend in the computer industry is to address this situation us-
ing multiple parallel processing units, which shows the beginning of a paradigm
shift towards parallel hardware architectures [3]. However, conventional com-
puter programs are described as sequentially executed instructions and can not
easily be adapted for a multi-processor environment. This situation prevents
a potential speed-up due to the problem of finding enough parallelism in the
software. The speedup S from using N parallel processing units is calculated
using Amdahl’s law as
1
S= p ,
(1 − p) + N
1.1. CHALLENGES IN DIGITAL HARDWARE DESIGN 3
where p is the fraction of the sequential program that can be parallelized [4].
Assuming that 50% of the sequential code can be executed on parallel proces-
sors, the speedup will still be limited to a modest factor of 2, no matter how
many parallel processors that are used. Parallel architectures have a promising
future, but will require new design approaches and programming methodologies
to enable high system utilization.
Thomas Lenart, Viktor Öwall, Mats Gustafsson, Mikael Sebesta, and Peter
Egelberg, “Accelerating Signal Processing Algorithms in Digital Holography
using an FPGA Platform,” in Proceedings of IEEE International Conference
on Field-Programmable Technology, Tokyo, Japan, December 2003, pp. 387–
390.
Mats Gustafsson, Mikael Sebesta, Bengt Bengtsson, Sven-Göran Pettersson,
Peter Egelberg, and Thomas Lenart, “High Resolution Digital Transmission
Microscopy - a Fourier Holography Approach,” in Optics and Lasers in Engi-
neering, vol. 41, issue 3, March 2004, pp. 553–563.
model generators and architectural generators are also proposed to create cus-
tom processor, memory, and system architectures that facilitate design explo-
ration. The Scenic exploration tool is used in Part IV to evaluate dynamically
reconfigurable architectures.
Publications:
Henrik Svensson, Thomas Lenart, and Viktor Öwall, “Modelling and Explo-
ration of a Reconfigurable Array using SystemC TLM,” in Proceedings of Re-
configurable Architectures Workshop, Miami, Florida, USA, April 2008.
Thomas Lenart, Henrik Svensson, and Viktor Öwall, “A Hybrid Interconnect
Network-on-Chip and a Transaction Level Modeling approach for Reconfig-
urable Computing,” in Proceedings of IEEE International Symposium on Elec-
tronic Design, Test and Applications, Hong Kong, China, January 2008, pp.
398–404.
9
10 CHAPTER 2. DIGITAL SYSTEM DESIGN
General
Flexibility
purpose
GPP Special
ASIP Purpose
Reconfigurable
DSP Architecture Application
GPU Specific
CGRA
ASIC
This work
Efficiency, Performance
well data is re-used through caching, and the memory access pattern to non-
cacheable data [14]. Hence, system performance is a balance between efficient
processing and efficient memory management.
Constructing digital systems not only require a set of hardware building
blocks, but also a methodology and design flow to generate a functional cir-
cuit from a high-level specification. In addition, the current trends towards
high-level modeling and system-level exploration require more advanced design
steps to enable software and hardware co-development from a common virtual
platform [15]. For complex system design, exploration tools require the use of
abstraction levels to trade modeling accuracy for simulation speed [16], which
is further discussed in Part III.
(a) (b)
Figure 2.2: (a) A DSP architecture with two separate register files and
multiple functional units on each data path. (b) A stream processor with
a high-performance register file connected to an array of ALU clusters.
Each cluster contains multiple functional units and local register files.
The stream controller handles data transfers to external memory.
Router
(a) (b)
Figure 2.3: (a) Overview of the FPGA architecture with embedded mem-
ory (in gray) and GPP macro-blocks. The FPGA fabric containing logic
blocks and switchboxes are magnified. (b) An example of a coarse-
grained reconfigurable architecture, with an array of processing elements
(ALU) and a routing network.
The FPGA design flow starts with a specification, either from a hardware
description language (HDL) or from a more high-level description. HDL code
is synthesized to gate level and the functionality is mapped onto the logic
blocks in the FPGA. The final step is to route the signals over the interconnect
network and to generate a device specific configuration file (bit-file). This file
contains information on how to configure logic blocks, interconnect network,
and IO-pads inside the FPGA. The configuration file is static and can not
alter the functionality during run-time, but some FPGAs support run-time
reconfiguration (RTR) and the possibility to download partial configuration
files while the device is operating [36]. However, the reconfiguration time is in
the range of milliseconds since the FPGA is configured at bit-level.
Bank N
Row Decoder
Row Bank 2
Bank 0 Bank 1
Address Address Memory array
Latch
Sense Amplifiers
Data
I/O Data
Row buffer
Latch
Column
Address Column Decoder
Latch
Figure 2.4: Modern SDRAM is divided into banks, rows and columns.
Accessing consequtive elements result in a high memory bandwidth and
must be considered to achieve high performance.
Specification - The initial design decisions are outlined during the spec-
ification phase. This is usually in the form of block diagrams and hierar-
chical schematics, design constraints, and design requirements.
Specification
Behavioral
Exploration
(ESL)
Virtual prototype
Software dev.
HDL
IP models Structural
Simulation (RTL)
Logic Synthesis
=?
Placement
Cell Library Physical
(IP)
Simulation Routing
GDSII / bit-file
Figure 2.5: Design flow for system hardware development, starting from
behavioral modeling, followed by hardware implementation, and finally
physical placement and routing.
The design flow is an iterative process, where changes to one step may
require the designer to go back to a previous step in the flow. The virtual
hardware prototype is one such example since it performs design exploration
based on parameters given by the initial specification. Therefore, architectural
problems discovered during virtual prototyping may require the specification to
be revised. Furthermore, the virtual prototype is not only reference design for
the structural RTL generation, but also a simulation and verification platform
for software development. Changing the RTL implementation requires that the
virtual platform is updated correspondingly, to avoid inconsistency between the
embedded software and the final hardware.
Algorithm Matlab/C
simulation accuracy
simulation speed
Transaction-level Behavioral
modeling
...
Cycle-callable (ESL)
...
modeling
Timing-accurate (RTL)
Figure 2.6: Abstraction levels is the ESL design space, ranging from
abstract transaction-level modeling to pin and timing accurate modeling.
Cycle accurate - A cycle accurate (CA) model uses the same model-
ing accuracy at clock cycle level as a conventional RTL implementation.
Cycle accurate models are usually also pin-accurate.
void process() {
...
x = acc1.read()
CPU Acc #1 Acc #2 Master
acc2.write(x)
acc2.start()
..
}
(a) (b)
during the RTL design phase to simulate for example the complex timing
constraints in memory models.
However, controllability and visibility come with the price of longer exe-
cution times. Hence, hardware-based prototyping using FPGA platforms is
still an important part of ASIC verification, but will be introduced later in the
design flow.
Given this background, the virtual platform and ISS concepts are further dis-
cussed in Part III, which proposes a design environment and a set of scalable
simulation models to evaluate reconfigurable architectures.
Chapter 3
In 1947, the Hungarian scientist Dennis Gabor developed a method and theory
to photographically create a three-dimensional recording of a scene, commonly
known as holography [51]. For his work on holographic methods, Dennis Ga-
bor was awarded the Nobel Prize in physics 1971. A holographic setup is
shown in Figure 3.1, and is based on a coherent light source. The interference
pattern between light from a reference wave and reflected light from an ob-
ject illuminated with the same light source is captured on a photographic film
(holographic plate). Interference between two wave fronts cancels or amplifies
the light in each point on the holographic film. This is called constructive and
destructive interference, respectively.
A recorded hologram has certain properties that distinguish it from a con-
ventional photograph. In a normal camera, the intensity (amplitude) of the
light is captured and the developed photography is directly visible. The pho-
tographic film in a holographic setup captures the interference, or phase differ-
ence, between two waves [52]. Hence, both amplitude and phase information
are stored in the hologram. By illuminating the developed photographic film
with the same reference light as used during the recording phase, the original
image is reconstructed and appears three-dimensional.
The use of photographic film in conventional holography makes it a time-
consuming, expensive, and inflexible process. In digital holography, the pho-
tographic film is replaced by a high-resolution digital image sensor to capture
the interference pattern. The interference pattern is recorded in a similar way
as in conventional holography, but the reconstruction process is completely
different. Reconstruction requires the use of a signal processing algorithm to
27
28 CHAPTER 3. DIGITAL HOLOGRAPHIC IMAGING
plane
waves
laser photographic
light plate
scene laser
light
Figure 3.1: The reflected light from the scene and the reference light
creates an interference pattern on a photographic film.
transform the digital recordings into visible images [1, 53]. The advantage over
conventional holography is that it only takes a fraction of a second to capture
and digitally store the image, instead of developing a photographic film. The
downside is that current sensor resolution is in the range 300-500 pixels/mm,
whereas the photographic film contains 3000 lines/mm.
A holographic setup with a digital sensor is illustrated in Figure 3.2. The
light source is a 633 nm Helium-Neon laser, which is divided into separate ref-
erence and object lights using a beam splitter. The object light passes through
the object and creates an interference pattern with the reference light on the
digital image sensor. The interference pattern (hologram) is used to digitally
reconstruct the original object image.
1. Mirror.
2. Polarization beamsplitter cube.
6. 3/6 Half-wave plates.
4/5. Polarizers.
12. 7. Mirror.
8. Iris diaphragm.
11. 9. Object.
8. 10. Beam splitter cube.
11. GRIN lens.
9. 7. 12. Digital image sensor.
10.
sures the light amplitude (energy), but the phase shift still carries important
information. In the beginning of the 1930s, the Dutch physicist Frits Zernike
invented a technique to enhance the image contrast in microscopy, known as
phase contrast, for which he received the Nobel Prize in physics 1953. The
invention is a technique to convert the phase shift in a transparent specimen
into amplitude changes, in other words making invisible images visible [56].
However, the generated image does not provide the true phase of the light,
only improved contrast as illustrated in Figure 3.3(b).
An alternative to phase contrast microscopy is to instead increase the cells
contrast. A common approach is to stain the cells using various dyes, such
as Trypan blue or Eosin. However, staining is not only a cumbersome and
time-consuming procedure, but could also be harmful to the cells. Therefore,
experiments to study growth, viability, and other cell characteristics over time
require multiple sets of cell samples. This prevents the study of individual cells
over time, and is instead based on the assumption that each cell sample has
similar growth-rate and properties. In addition, the staining itself can generate
artifacts that affect the result. In Figure 3.4(a), staining with Trypan blue has
been used to identify dead cells.
In contrast, digital holography provides a non-invasive method and does
not require any special cell preparation techniques [57]. The following sections
discuss the main advantages of digital holography in the field of microscopy,
namely the true phase-shift, the software autofocus, and the possibility to gen-
erate three-dimensional images.
30 CHAPTER 3. DIGITAL HOLOGRAPHIC IMAGING
cell
Figure 3.4: To the left a dead cell and to the right a living cell. (a) Using
Trypan blue staining in a phase contrast microscope to identify dead
cells. Dead cells absorb the dye from the surrounding fluid. (b) Using
phase information in non-invasive digital holography to identify dead
cells. (c) Phase-shift when parallel light pass through a cell containing
regions with different refractive index.
3.1. MICROSCOPY USING DIGITAL HOLOGRAPHY 31
(a) (b)
(c) (d)
Figure 3.6: (a) Reference image ψr . (b) Object image ψo . (c) Holo-
gram or interference image ψh . (d) Close-up of interference fringes in a
holographic image.
ing either the object or the reference beam, the reference and object light are
captured respectively. An example of captured holographic images is shown in
Figure 3.6 together with a close-up on the recorded interference fringes in the
hologram.
The images captured on the digital image sensor need to be processed using
a signal processing algorithm, which reconstructs the original and visible image.
The reference and object light is first subtracted from the hologram, and the
remaining component is referred to as the modified interference pattern ψ
where k = 2π/λ is the angular wave number for a wavelength λ, and the
reference field is assumed to be spherical for simplicity [60]. By specifying
a point of origin, r ′ represents the vector to any point in the object plane
and r represents the vector to any point in the sensor plan, as illustrated in
Figure 3.7. The three-dimensional vectors can be divided into a z component
and an orthogonal two-dimensional vector ρ representing the x and y positions
as r = ρ + zẑ. The distance z ′ specifies the location of the image plane to be
reconstructed, whereas z is the distance to the sensor. The integral in (3.2)
can be expressed as a convolution
Ψ(ρ′ ) = ψ1 ∗ G, (3.3)
where √
|z−zr |2 +|ρ−ρr |2
ψ1 (ρ) = ψ(ρ)e−ik
and √
|z−z ′ |2 +|ρ|2
G(ρ) = eik .
The discrete version of the integral with an equidistant grid, equal to the sen-
sor pixel size (∆x, ∆y), generates a discrete convolution of (3.3) that can be
evaluated with the FFT [61] as
The size of the two-dimensional FFT needs to be at least the sum of the sensor
size and the object size in each dimension. Higher resolution is achieved by
shifting the coordinates a fraction of a pixel size and combining the partial
3.2. PROCESSING OF HOLOGRAPHIC IMAGES 35
r
z
O
r′
|r − r ′ | ≈
object |r − r ′ | ≈ r − r · r ′ /r
plane
z≫λ (z − z ′ ) + (x−x′ )2 +(y−y ′ )2
2(z−z ′ )
to study cells over time. In biomedical imaging, this information can also
be used to study cell trajectories.
Rayleigh- Fresnel/Fraunhofer
Properties
Sommerfeld approximation
Algorithmic complexity∗ 2N 2 + 1 1
Software run-time∗∗ (s) ≈ 82 ≈ 1.1
Required throughput (samples/s) 15.3 G 210 M
∗ Number of 2D-FFTs required. The original algorithm requires refinement with N = 6.
∗∗ based on FFTW 2D-FFT [67] on a Pentium-4 2.0 GHz.
For the current application, a 1.3 Mpixel CMOS sensor with a video frame rate
of 15 fps has been selected. This is a trade-off to be able to produce high quality
images in close to video speed. However, it is likely that future image sensors
would enable combinations of higher resolution with higher frame rate.
800 1
more parallel .75
600
more sequential
400 .5
200 .25
.5 1 1.5 2 2.5 3 .5 1 2 3 4 5
Tcc (samples/cc) Tcc (samples/cc)
(a) (b)
Figure 3.9: Feasible design space points (grey) that comply with the
system requirements. Area is estimated based on similar designs.
in Table 3.1, a real-time video frame rate of 25 fps is assumed. The table shows
the run-time in software and the required throughput to meet the real-time
constraints.
where FFT size NFFT and frame rate frate are given by the specification. The
choice of hardware architecture and the selected system clock frequency control
the throughput, which results in the following relations
2
fclk × Tcc = 2NFFT × frate
A ∝ Tcc ,
where fclk is the clock frequency, and Tcc is the scaled throughput in sam-
ples per clock cycle. It is assumed that the required area (complexity) A is
propositional to the throughput. An attempt to double the throughput for the
40 CHAPTER 3. DIGITAL HOLOGRAPHIC IMAGING
FFT algorithm, for given operational conditions such as frequency and supply
voltage, would result in twice the amount of computational resources.
The equations enable design space exploration based on the variable param-
eters. Samples per clock cycle relate to architectural selection, while frequency
corresponds to operational conditions. Figure 3.9 shows points in the design
space that comply with the specification. Both graphs use the horizontal axis
to represent the architectural scaling, ranging from time-multiplexed (folded)
to more parallel architectures. Figure 3.9(a) shows the clock frequency required
for a certain architecture. Since the clock frequency has an upper feasible limit,
it puts a lower bound on Tcc . Figure 3.9(b) shows the area requirement, which
is assumed to grow linearly with the architectural complexity, and puts an up-
per bound on Tcc . Hence, Tcc is constrained by system clock frequency and the
hardware area requirement. The graphs can be used to select a suitable and
feasible architecture, where Tcc = 1 seems to be reasonable candidate. Based
on this estimated analysis, the architectural decisions are further discussed in
Part I.
Part I
Abstract
A hardware acceleration platform for image reconstruction in digital holo-
graphic imaging is proposed. The hardware accelerator executes a computa-
tionally demanding reconstruction algorithm which transforms an interference
pattern captured on a digital image sensor into visible images. Focus in this
work is to maximize computational efficiency, and to minimize the external
memory transfer overhead, as well as required internal buffering. We present
an efficient processing datapath with a fast transpose unit and an interleaved
memory storage scheme. The proposed architecture results in a speedup with
a factor 3 compared with the traditional column/row approach for calculating
the two-dimensional FFT. Memory sharing between the computational units
reduces the on-chip memory requirements with over 50%. The custom hard-
ware accelerator, extended with a microprocessor and a memory controller,
has been implemented on a custom designed FPGA platform and integrated
in a holographic microscope to reconstruct images. The proposed architec-
ture targeting a 0.13 µm CMOS standard cell library achieves real-time image
reconstruction with over 30 frames per second.
41
1. INTRODUCTION 43
1 Introduction
In digital holography, the photographic film used in conventional holography
is replaced by a digital image sensor to obtain a digital hologram, as shown
in Figure 1(a). A reconstruction algorithm processes the images captured by
the digital sensor to create a visible image, which in this work is implemented
using a hardware acceleration platform, as illustrated in Figure 1(b).
A hardware accelerator for image reconstruction in digital holographic imag-
ing is presented. The hardware accelerator, referred to as Xstream, executes
a computationally demanding reconstruction algorithm based on a 2048 × 2048
point two-dimensional FFT, which requires a substantial amount of signal pro-
cessing. It will be shown that a general-purpose computer is not capable of
meeting the real-time constraints, hence a custom solution is presented.
The following section presents how to reduce the complexity of the recon-
struction algorithm, as presented in Chapter 3.2.1, and how it can be adapted
for hardware implementation. System requirements and hardware estimates
are discussed in Section 2, and Section 3 presents the proposed hardware archi-
tecture and the internal functional units. In Section 4, the hardware accelerator
is integrated into an embedded system, which is followed by simulation results
and proposed optimizations in Section 5. Results and comparisons are pre-
sented in Section 6 and finally conclusions are drawn in Section 7.
Laser
splitter
(a) ψr mirror
object
lens
ψo
ψh
ψr Hardware
(b) ψo Acceleration
ψh Platform
The first exponential term in (2) can be removed since it only affects the object
location in the reconstructed image. Instead, the coordinates of the object loca-
tion can be modified after reconstruction. The image reconstruction algorithm
with reduced complexity requires only a single FFT as
′
Ψ(ρ′ ) ≈ F(ψ(ρ)e−ikzz /r ), (3)
2 System Requirements
As discussed in Chapter 3.3, the performance of the two-dimensional FFT will
limit the number of frames that can be reconstructed per second. The selected
digital image sensor has a resolution of 1312 × 1032 pixels, with a pixel pitch
of 6 × 6 µm, and a precision of 8 bits per pixel. The hardware accelerator is
designed to satisfy the following specification:
frate = 25 fps
NFFT = 2048 points
Tcc = 1 sample/cc
ψ(ρ)
bit-reverse
(a) combine 1-D FFT spectrum shift
α
Ψ(ρ′ ) bit-reverse
(b) Mem max |Ψ| spectrum shift 1-D FFT Mem
Post-
(c) |Ψ| ∠Ψ Processing
3 Proposed Architecture
Figure 3 shows the dataflow for the reconstruction algorithm and additional
processing to obtain the result image. Images recorded with the digital sensor
are transferred and stored in external memory. The processing is divided into
three sequential steps, each streaming data from external memory. Storing
images or intermediate data in on-chip memory is not feasible due to the size of
3. PROPOSED ARCHITECTURE 47
CMUL
In M data M M M M Out
valid Combine CORDIC 1D-FFT Buffer
DMA I/F DMA I/F
ack
AGU I/F AGU I/F
Protocol
AGU AGU
Config
ψh ψr
ψh ψo ψr α
ψo α
(a) (b)
Figure 5: (a) The three captured images and the phase factor. (b) Ex-
ample of how the captured images are stored interleaved in external
memory. The buffer in the combine unit contains 4 groups holding 8
pixels each, requiring a total buffer size of 32 pixels. Grey indicates the
amount of data copied to the input buffer in a single burst transfer.
where B is the size of the block of pixels to read from each image. Hence,
the internal buffer in the combine unit must be able to store 4B pixels. Using
this scheme, both storing captured images and reading images from memory
can be performed with single burst transfers. Figure 6 shows how the reorder
operation depends on the burst size Nburst , a parameter that also defines the
block size B and the required internal buffering. For a short burst length, it is
more efficient to store the images separately since the write operation can be
50 PART I. A HARDWARE ACCELERATION PLATFORM FOR DIGITAL...
16
separated images (read and write)
14 interleaved images (read and write)
writing images to memory using interleaving
12
Clock cycles per pixel reading interleaved images from memory
10
0
4 8 16 32 64 128
Read burst length Nburst (=buffer size)
burst oriented. When the burst length increases above 8, interleaving becomes
a suitable alternative but requires more buffer space. A trade-off between speed
and memory is when the curve that represents interleaving flattens out, which
suggests a burst length between 32 and 64. Parameter selection is further
discussed in Section 5.
Xlin 8 4 2 1 xbitrev
xlin Xbitrev
MR-2 MR-2 MR-2 MR-2
0 1 2 3 ... 0 8 4 12 ...
-j -j
MR-22
(a)
Delay feedback
(b)
Transpose unit - Considering the large size of the data matrix, on-chip storage
during processing is not realistic and transfers to external memory is required.
The access pattern for a transpose operation is normally reading rows and
writing columns, or vice versa. To avoid column-wise memory access, a trans-
pose operation can be broken down into a set of smaller transpose operations
3. PROPOSED ARCHITECTURE 53
1 2 3 4 8 12
Buffer
4 5 6 7 DMA I/F DMA I/F 1 5 9 13
AGU I/F Axy ATxy AGU I/F
8 9 10 11 2 6 10 14
12 13 14 15 3 7 11 15
AGU AGU
Source XSTREAM Destination
Figure 8: The transpose logic uses the DMA controllers, the AGU units
and the internal buffer to transpose large matrices. In the figure, a
32 × 32 matrix is transposed by individually transposing and relocating
4 × 4 macro blocks. Each macro block contains an 8 × 8 matrix.
ADDRx,y = M (x + yNFFT ) x, y = 0, 1, . . . , M − 1.
Input and output AGUs generate base addresses in reverse order by exchanging
index counters x, y to relocate the macro blocks. Figure 9 shows a simulation of
the row-column FFT and the row-transpose-row FFT. For the latter approach,
the transpose operation is required, also shown separately in the graph. When
the burst length is short, the transpose overhead dominates the computation
time. When the burst length increases, the row-transpose-row FFT improves
the overall performance. The graph representing row-transpose-row flattens
out when the burst length is between 16 and 32 words, which is a good trade-
54 PART I. A HARDWARE ACCELERATION PLATFORM FOR DIGITAL...
22
row/column two-dimensional FFT
20
row/transpose/row two-dimensional FFT
18 only transpose operation
Clock cycles per element
16
14
12
10
8
6
4
2
0
1 2 4 8 16 32 64 128
Burst length Nburst
4 System Design
This section presents the integration of the Xstream accelerator into an em-
bedded system and a prototype of a holographic microscope. The system cap-
tures, reconstructs and presents holographic images. The architecture is based
on the Xstream accelerator, extended with an embedded SPARC compat-
ible microprocessor [77] and a memory controller for connecting to external
memory. Only a single memory interface is supported in the prototype, but
the Xstream accelerator supports streaming from multiple memories. Two
additional interface blocks provide functionality for capturing images from an
external sensor device and to present reconstructed images on a monitor. The
complete system is shown in Figure 10.
4. SYSTEM DESIGN 55
SPARC Mem
DMA
(LEON) ctrl
XSTREAM
DMA
AHB
bridge Sensor
DMA
I/F
APB
DMA VGA
2
UART I C IRQ I/O
vsize
control reorder
addr format
size 2D
AGU M-1
space
M
Config
(a) (b)
Figure 11: (a) The DMA interface connects the internal dataflow protocol
to the external bus using a buffer and an AMBA bus interface. (b)
The DMA interface can access a two-dimensional matrix inside a two-
dimensional memory array, where skip is the distance to the next row.
words on the write side. An example is when calculating the vector magnitude
of the resulting image and then rescaling the value into an 8-bit pixel. Pixels
are processed individually, but combined into 32-bit words (groups of 4 pixels)
in the buffer before transferred to the memory to reduce the transfer size. The
buffer also has the ability to reorder the output data in various ways, which is
further explained in Section 5.1.
ical window, as shown in Figure 12. From an application point of view, the
behavior of the system is the same.
Table 1: Required buffering inside each processing block for the original and op-
timized pipeline. The addressing mode for each buffer depends on the function
that the block is evaluating.
address bits, which requires the addressing modes to be flexible with a dynamic
address wordlength. Since bit-reverse and FFT spectrum shift is often used in
conjunction, this addressing mode can be optimized. The spectrum shift inverts
the MSB, as shown in Figure 13(d), and the location of the MSB depends on
the transform size. However, in bit-reversed addressing the MSB is actually the
LSB, and the LSB location is always constant. By reordering the operations,
the cost for moving the spectrum is a single inverter.
• The reorder operation in the image combine block is moved to the DMA
input buffer. The only modification required is that the DMA input
buffer must support both linear and interleaved addressing mode.
a4 a3 a2 a1 a0 a5 a4 a3 a2 a1 a0 a4 a3 a2 a1 a0 a4 a3 a2 a1 a0
a1 a2 a4 a3 a2 a2 a1 a0 a5 a4 a3 a0 a1 a2 a3 a4 a4 a3 a2 a1 a0
• The output DMA read data in linear mode directly from the main buffer
unit.
Figure 14 shows the memory requirements before and after optimization and
how it depends on Nburst . The delay feedback memory in the FFT can not be
shared and is not included in the memory simulation. For short burst lengths,
the memory requirements are reduced from ≈ 3 K words to ≈ 2 K words. The
memory requirement then increases with the burst length for the unoptimized
design, but stays constant for the optimized design up to Nburst = 32. After
this point, the memory requirements for both architectures rapidly increase.
14K
Unoptimized pipeline
12K Optimized pipeline
10K
Memory words 8K
6K
4K
2K
0
1 2 4 8 16 32 64 128
Burst length Nburst
and that Nburst is an integer power of 2. When Nburst 2 exceeds NFFT , internal
memory requirements rapidly increases, which leads to a high area cost accord-
ing to Figure 14 and a relatively low performance improvement according to
Figure 6 and Figure 9. Another condition is that the image read operation
should generate one set of pixels per clock cycle, or at least close to this value,
to supply the FFT with input data at full speed to balance the throughput.
Selecting Nburst = 32 satisfy the conditions and results in:
• The combine unit requires 2.8 cc/element to store and read the image
data from external memory in interleaved mode. This is a speed-up factor
of approximately 2 compared to storing images separately in memory, as
shown in Figure 6.
• The combine unit is capable of constantly supplying the FFT unit with
data. Hence, the total system speed-up factor is the same as for the
two-dimensional FFT.
6. RESULTS AND COMPARISONS 61
Table 2: Equivalent gates (NAND2) and memory cells (RAM and ROM) after
synthesis to a 0.13 µm standard cell library. Slice count and block RAMs are
presented for the FPGA design.
• The optimized pipeline requires less then 50% of the memory compared
to the original pipeline, reduced from over 4 Kbit down to 2 Kbit as shown
in Figure 14.
Table 3: Comparison between related work and the proposed architecture. Per-
formance is presented as fps/MHz, normalized to 1.0 for the proposed FPGA
design.
required block RAMs are presented, where the Xstream accelerator occupies
approximately 65% of the FPGA resources. The largest part of the design is
the 2048-point pipeline FFT, which is approximately 73% of the Xstream ac-
celerator. The FPGA design runs at a clock frequency of 24 MHz, limited by
the embedded processor, while the design synthesized for a 0.13µm cell library
is capable of running up to 398 MHz. A floorplan of the Xstream accelerators
is shown in Figure 16, with a core area of 1500 × 1400 µm2 containing 352 K
equivalent gates.
In related work on two-dimensional FFT implementation, the problem with
memory organization is not widely mentioned. Instead, a high bandwidth from
memory with uniform access-time is assumed (SRAM). However, for computing
a large size multi-dimensional FFT the memory properties and data organiza-
tion must be taken into account, as discusses in this work. Table 3 shows a
comparison between the proposed architecture, a modern desktop computer,
and related work. In [79], an ASIC design of a 512-point two-dimensional FFT
connected to a single memory interface has been presents. [80] presents an
FPGA implementation with variable transform size storing data in four sep-
arate memory banks. To compare the processing efficiency between different
architectures, a performance metric is defined as (fps / MHz) and normalized
to 1.0 for the proposed FPGA implementation. The frame rate is estimated for
a transform size of 2048 × 2048 points. The table shows the proposed architec-
ture to be highly efficient, resulting in real-time image reconstruction with over
30 fps for the proposed ASIC design. The reason for increased efficiency when
targeting ASIC is that the DMA transfers with fixed bandwidth requirements,
such as the sensor and VGA interfaces from Figure 10, will have less impact
on the total available bandwidth as the system clock frequency increases.
6. RESULTS AND COMPARISONS 63
(a)
I/F
SDRAM
FPGA
(b)
Sensor
VGA
BUFFER
RDMA
COMB
WDMA
CORDIC FFT
CMUL
Sensor / VGA
DELAY FEEDBACK
Figure 16: Final layout of the Xstream accelerator using a 0.13 µm cell
library. The core size is 1500 × 1400 µm2 .
64 PART I. A HARDWARE ACCELERATION PLATFORM FOR DIGITAL...
7 Conclusion
A hardware acceleration platform for image processing in digital holography
has been presented. The hardware accelerator contains an efficient datapath
for calculating FFT and other required operations. A fast transpose unit is
proposed to significantly improve the computation time for a two-dimensional
FFT, which improves the computation time with a factor of 3 compared with
the traditional row/column approach. To cope with the increased bandwidth
and to balance the throughput of the computational units, a fast reorder unit is
proposed to store captured images and read data in an interleaved fashion. This
results in a speedup of 2 compared with accessing separately stored images in
memory. It is also shown how to reduce the memory requirement in a pipelined
design with over 50% by sharing buffers between modules. The design has been
synthesized and integrated in an FPGA-based system for digital holography.
The same architecture targeting a 0.13 µm CMOS standard cell library achieves
real-time image reconstruction with over 30 frames per second.
Part II
Abstract
Dynamic data scaling in pipeline FFTs is suitable when implementing large
size FFTs in applications such as DVB and digital holographic imaging. In
a pipeline FFT, data is continuously streaming and must hence be scaled
without stalling the dataflow. We propose a hybrid floating-point scheme
with tailored exponent datapath, and a co-optimized architecture between hy-
brid floating-point and block floating-point to reduce memory requirements for
two-dimensional signal processing. The presented co-optimization generates
a higher SQNR and requires less memory than for instance convergent block
floating-point. A 2048 point pipeline FFT has been fabricated in a standard
CMOS process from AMI Semiconductor [9], and an FPGA prototype integrat-
ing a two-dimensional FFT core in a larger design shows that the architecture
is suitable for image reconstruction in digital holographic imaging.
Based on: T. Lenart and V. Öwall, “Architectures for dynamic data scaling in
2/4/8K pipeline FFT cores,” IEEE Transaction on Very Large Scale Integration.
ISSN 1063-8210, vol. 14, no. 11, Nov 2006.
and: T. Lenart and V. Öwall, “A 2048 Complex Point FFT Processor using a Novel
Data Scaling Approach,” in Proceedings of IEEE International Symposium on Circuits
and Systems, vol. 4, Bangkok, Thailand, May 2003, pp. 45–48.
65
1. INTRODUCTION 67
1 Introduction
The discrete Fourier transform is a commonly used operation in digital signal
processing, where typical applications are linear filtering, correlation, and spec-
trum analysis [68]. The Fourier transform is also found in modern communi-
cation systems using digital modulation techniques, including wireless network
standards such as 802.11a [81] and 802.11g [82], as well as in audio and video
broadcasting using DAB and DVB.
The DFT is defined as
N
X −1
X(k) = x(n)WNkn 0 ≤ k < N, (1)
n=0
where
WN = e−i2π/N . (2)
Evaluating (1) requires N MAC operations for each transformed value in X, or
N 2 operations for the complete DFT. Changing transform size significantly af-
fects computation time, e.g. calculating a 1024-point Fourier transform requires
three orders of magnitude more work than a 32-point DFT.
A more efficient way to compute the DFT is to use the fast Fourier transform
(FFT) [83]. The FFT is a decomposition of an N -point DFT into successively
smaller DFT transforms. The concept of breaking down the original problem
into smaller sub-problems is known as a divide-and-conquer approach. The
original sequence can for example be divided into N = r1 · r2 · ... · rq where each
r is a prime. For practical reasons, the r values are often chosen equal, creating
a more regular structure. As a result, the DFT size is restricted to N = rq ,
where r is called radix or decomposition factor. Most decompositions are based
on a radix value of 2, 4 or even 8 [84]. Consider the following decomposition of
(1), known as radix-2
N
X −1
X(k) = x(n)WNkn
n=0
N/2−1 N/2−1
X k(2n)
X k(2n+1)
= x(2n)WN + x(2n + 1)WN
n=0 n=0
N/2−1 N/2−1
X X
kn
= xeven (n)WN/2 +WNk kn
xodd (n)WN/2 . (3)
n=0 n=0
| {z } | {z }
DF TN/2 (xeven ) DF TN/2 (xodd )
68 PART II. A HIGH-PERFORMANCE FFT CORE FOR DIGITAL...
The original N -point DFT has been divided into two N/2 DFTs, a procedure
that can be repeated over again on the smaller transforms. The complexity is
thus reduced from O(N 2 ) to O(N log2 N ). The decomposition in (3) is called
decimation-in-time (DIT), since the input x(n) is decimated with a factor of 2
when divided into an even and odd sequence. Combining the result from each
transform requires a scaling and add operation. Another common approach
is known as decimation-in-frequency (DIF), splitting the input sequence into
x1 = {x(0), x(1), ..., x(N/2 − 1)} and x2 = {x(N/2), x(N/2 + 1), ..., x(N − 1)}.
The summation now yields
N/2−1 N −1
X X
X(k) = x(n)WNkn + x(n)WNkn (4)
n=0 n=N/2
N/2−1 N/2−1
X kN/2
X
= x1 (n)WNkn + WN x2 (n)WNkn ,
n=0
| {z } n=0
(−1)k
kN/2
where WN can be extracted from the summation since it only depends on
the value of k, and is expressed as (−1)k . This expression divides, or decimates,
X(k) into two groups depending on whether (−1)k is positive or negative. That
is, one equation calculate the even values and one calculate the odd values as
in
N/2−1
X
kn
X(2k) = x1 (n) + x2 (n) WN/2 (5)
n=0
= DF TN/2 (x1 (n) + x2 (n))
and
N/2−1h i
X
X(2k + 1) = kn
x1 (n) − x2 (n) WNn WN/2 (6)
n=0
= DF TN/2 ((x1 (n) − x2 (n))WNn ).
(5) calculates the sum of two sequences, while (6) calculates the difference and
then scales the result. This kind of operation, adding and subtracting the same
two values, is commonly referred to as butterfly due to its butterfly-like shape in
the flow graph, shown in Figure 1(a). Sometimes, scaling is also considered to
be a part of the butterfly operation. The flow graph in Figure 1(b) represents
1. INTRODUCTION 69
x(0) X(0)
x(1) X(4)
DFTN/2
x(2) X(2)
x1 x1+ x2
x(3) X(6)
W08
WN x(4) X(1)
W18
x2 (x1- x2)WN x(5) X(5)
-1 W28 DFTN/2
x(6) X(3)
W38
x(7) X(7)
(a) (b)
the computations from (5) and (6), where each decomposition step requires
N/2 butterfly operations.
In Figure 1(b), the output sequence from the FFT appears scrambled. The
binary output index is bit-reversed, i.e. the most significant bits (MSB) have
changed place with the least significant bits (LSB), e.g. 11001 becomes 10011.
To unscramble the sequence, bit-reversed indexing is required.
x1 x1
-j
x2 - x2
j j
- x3
x3
-
x4 - x4
-j -j
(a) (b)
x(0) X(0)
2 1
x(1) -1
X(2)
0
W4 W4
x(2) X(1) R-2 R-2
-1 1
W4
x X
x(3) -1 -1
X(3)
(a) (b)
Figure 3: (a) Flow graph of a 4-point radix-2 FFT (b) Mapping of the
4-point FFT using two radix-2 butterfly units with delay feedback mem-
ories, where the number represents the FIFO depth.
2 Proposed Architectures
Currently the demands increase towards larger and multidimensional trans-
forms for use in synthetic aperture radar (SAR) and scientific computing, in-
cluding biomedical imaging, seismic analysis, and radio astronomy. Larger
transforms require more processing on each data sample, which increases the
total quantization noise. This can be avoided by gradually increasing the
72 PART II. A HIGH-PERFORMANCE FFT CORE FOR DIGITAL...
2x x 2x x
wordlength inside the pipeline, but will also increase memory requirements
as well as the critical path in arithmetic components. For large size FFTs,
dynamic scaling is therefore a suitable trade-off between arithmetic complexity
and memory requirements. The following architectures have been evaluated
and compared with related work:
(A) A hybrid floating-point pipeline with fixed-point input and tailored expo-
nent datapath for one-dimensional (1D) FFT computation.
x x
2M+E
(a) =
MR-2 R-2
M M+T M
(b) =
T
N N/2 N/4
buffer buffer
R-2 L R-22 L
m N N/4 m
e e
e
2.3 Co-Optimization
In this section a co-optimized architecture that combines hybrid floating-point
and BFP is proposed. By extending the hybrid floating-point architecture
with small intermediate buffers, the size of the delay feedback memory can be
reduced. Figure 7(a-c) show dynamic data scaling for hybrid floating-point,
CBFP, and the proposed co-optimization architecture. Figure 7(c) is a com-
bined architecture with an intermediate buffer to apply block scaling on D
elements, which reduces the storage space for exponents in the delay feed-
2. PROPOSED ARCHITECTURES 75
N
N N N/D
2M+E 2M 2M E
buffer buffer
MR-2 L R-2 L MR-2
2N D
E
(a) (b) (c)
D(αi ) = 2αi 0 ≤ αi ≤ i + 1.
The total bits required for supporting dynamic scaling is the sum of exponent
bits in the delay feedback unit and the total size of the intermediate buffer.
This can be expressed as
j γN k
i
Memi = Ei + L(D(αi ) − 1), (8)
D(αi ) | {z }
| {z } buffer
delay feedback
where
1 Radix-2
γ=
3/2 Radix-22
and
2M + Ei i = imax
L=
2(M + T ) 0 ≤ i < imax
76 PART II. A HIGH-PERFORMANCE FFT CORE FOR DIGITAL...
12K CBFP
@
10K @
@
Hybrid FP
8K
Co-optimization
6K
4K
?
2K
0
1 2 4 8 16 32 64 128 256 512 1024
Intermediate buffer length D
For radix-22 butterflies, (8) is only defined for odd values of i. This is com-
pensated by a scale factor γ = 3/2 to include both delay feedback units in the
radix-22 butterfly, as shown in Figure 4. The buffer input wordlength L differs
between initial and internal butterflies. For every butterfly stage, αi is chosen
to minimize (8). For example, an 8192 point FFT using a hybrid floating-point
format of 2 × 10 + 4 bits requires 16 Kb of memory in the initial butterfly for
storing exponents, as shown in Figure 8. The number of memory elements for
supporting dynamic scaling can be reduced to only 1256 bits by selecting a
block size of 32, hence removing over 90% of the storage space for exponents.
The hardware overhead is a counter to keep track of when to update the block
exponent in the delay feedback, similar to the exponent control logic required
in CBFP implementations. Thus the proposed co-optimization architecture
supports hybrid floating-point on the input port at very low hardware cost.
3. ARCHITECTURAL EXTENSIONS 77
8 4 2 1
-j -j
x 0 MR-2 0 MR-2 0 0 MR-2 0 MR-2 X
1 1 1 1 1
-j
Y y
-j
Since the input and output format is the same, this architecture then becomes
suitable for 2D FFT computation.
3 Architectural Extensions
The architectures described in this paper have been extended with support for
bidirectional processing, which is important for the intended application and
also in many general applications. A pipeline FFT can support a bidirectional
dataflow if all internal butterfly stages have the same wordlength. The advan-
tage with a bidirectional pipeline is that input data can be supplied either in
linear or bit-reversed sample order by changing the dataflow direction. One
application for the bidirectional pipeline is to exchange the FFT/IFFT struc-
ture using reordering buffers in an OFDM transceiver to minimize the required
buffering for inserting and removing the cyclic suffix, proposed in [76]. OFDM
implementations based on CBFP have also been proposed in [74], but these
solutions only operate in one direction since input and output format differ.
Another application for a bidirectional pipeline is to evaluate 1D and 2D convo-
lutions. Since the forward transform generates data in bit-reversed order, the
architecture is more efficient if the inverse transform supports a bit-reversed
input sequence as shown in Figure 9. Both input and output from the convo-
lution are in linear sample order, hence no reorder buffers are required. The
hardware requirement for a bidirectional pipeline is limited to multiplexers on
the inputs of each butterfly and on each complex multiplier. Each unit requires
26 two-input muxes for internal 2 × 11 + 4 format, which is negligible compared
to the size of an FFT stage.
4 Simulations
A simulation tool has been designed to evaluate different FFT architectures in
terms of precision, dynamic range, memory requirements, and estimated chip
78 PART II. A HIGH-PERFORMANCE FFT CORE FOR DIGITAL...
size based on architectural descriptions. The user can specify the number of
bits for representing mantissa M , exponents E, twiddle factors T , FFT size
(NFFT ), rounding type and simulation stimuli. To make a fair comparison
with related work, all architectures have been described and simulated in the
developed tool.
First, we compare the proposed architectures with CBFP in terms of mem-
ory requirements and signal quality. In addition to the lower memory require-
ments, we will show how the co-optimized architecture produces a higher SQNR
than CBFP. Secondly, we will compare the fabricated design with related work
in terms of chip size and data throughput.
Table 1 shows a comparison of memory distribution between delay feed-
back units and intermediate buffers. 1D architectures have fixed-point input,
whereas 2D architectures support hybrid floating-point input. The table shows
that the intermediate buffers used in CBFP consume a large amount of memory,
which puts the co-optimized architecture in favor for 1D processing. For 2D
processing, the co-optimized architecture also has lower memory requirements
than hybrid floating-point due to the buffer optimization. Figures 10 and 11
present simulation results for the 1D architectures in Table 1. Figure 10 is a
simulation to compare SQNR when changing energy level in the input signal.
In this case, the variations only affect CBFPlow since scaling is applied later
in the pipeline. Figure 11 shows the result when applying signals with a large
crest factor, i.e. the ratio between peak and mean value of the input. In this
case, both CBFP implementations are strongly affected due to the large block
size in the beginning of the pipeline. Signal statistics have minor impact on the
hybrid floating-point architecture since every value is scaled individually. The
SQNR for the co-optimized solution is located between hybrid floating-point
and CBFP since it uses a relatively small block size.
50
45
40
SQNR (dB)
35
30
Hybrid floating-point
CBFP
25
CBFPlow
Co-optimization
20
1 1/2 1/4 1/8 1/16 1/32
Signal level
Figure 10: Decreasing the energy in a random value input signal affects
only the architecture when scaling is not applied in the initial stage.
Signal level=1 means utilizing the full dynamic range.
50
45
SQNR (dB) 40
35
30
Hybrid floating-point
CBFP
25
CBFPlow
Co-optimization
20
5 10 15 20
Crest Factor
Figure 11: Decreasing the energy in a random value input signal with
peak values utilizing the full dynamic range. This affects all block scal-
ing architectures, and the SQNR depends on the block size. The co-
optimized architecture performs better than convergent block floating-
point, since it has a smaller block size through the pipeline.
does not support scaling and is not directly comparable in terms of precision
since SQNR depends on the input signal. The wordlength increases gradually
in the pipeline to minimize the quantization noise, but this increases the mem-
ory requirements or more important the wordlength in arithmetic components
and therefore also the chip area.
The proposed architectures have low hardware requirements and produce
high SQNR using dynamic data scaling. They can easily be adapted to 2D sig-
nal processing, in contrast to architectures without data scaling or using CBFP.
The pipeline implementation results in a high throughput by continuous data
streaming, which is shown as peak performance of 1D transforms in Table 2.
5. VLSI IMPLEMENTATION 81
5 VLSI Implementation
A 2048 complex point pipeline FFT core using hybrid floating-point and based
on the radix-22 decimation-in-frequency algorithm [71] has been designed, fab-
ricated, and verified. This section presents internal building blocks and mea-
surements on the fabricated ASIC prototype.
The butterfly units calculate the sum and the difference between the input
sequence and the output sequence from the delay feedback. Output from the
butterfly connects to the complex multiplier, and data is finally normalized
and sent to the next FFT stage. The implementation of the delay feedbacks
is a main consideration. For shorter delay sequences, serially connected flip-
flops are used as delay elements. As the number of delay elements increases,
this approach is no longer area and power efficient. One solution is to use
SRAM and to continuously supply the computational units with data, one
read and one write operation have to be performed in every clock cycle. A dual
port memory approach allow simultaneous read and write operations, but is
larger and consumes more energy per memory access than single port memories.
Instead two single port memories, alternating between read and write each clock
cycle could be used. This approach can be further simplified by using one single
port memory with double wordlength to hold two consecutive values in a single
location, alternating between reading two values in one cycle and writing two
values in the next cycle. The latter approach has been used for delay feedback
exceeding the length of eight values. An area comparison can be found in [9].
A 2048 point FFT chip based on architecture (A) has been fabricated in
a 0.35 µm 5ML CMOS process from AMI Semiconductor, and is shown in
Figure 12. The core size is 2632 × 2881 µm2 connected to 58 I/O pads and
26 power pads. The implementation requires 11 delay feedback buffers, one
for each butterfly unit. Seven on-chip RAMs are used as delay buffers (ap-
proximately 49 K bits), while the four smallest buffers are implemented using
flip-flops. Twiddle factors are stored in three ROMs containing approximately
47 K bits. The memories can be seen along the sides of the chip. The number
of equivalent gates (2-input NAND) is 45900 for combinatorial area and 78300
for non-combinatorial area (including memories). The power consumption of
the core was measured to 526 mW when running at 50 MHz and using a supply
voltage of 2.7 V. The pipeline architecture produces one output value each clock
cycle, or 37K transforms per second running at maximum clock frequency. The
2D FFT architecture (B) has been implemented on FPGA in [92].
82 PART II. A HIGH-PERFORMANCE FFT CORE FOR DIGITAL...
Figure 12: Chip photo of the 2048 complex point FFT core fabricated in
a 0.35 µm 5ML CMOS process. The core size is 2632 × 2881 µm2 .
6 Conclusion
Dynamic data scaling architectures for pipeline FFTs have been proposed for
both 1D and 2D applications. Based on hybrid floating-point, a high-precision
pipeline with low memory and arithmetic requirements has been constructed. A
co-optimization between hybrid floating-point and block floating-point has been
proposed, reducing the memory requirement further by adding small interme-
diate buffers. A 2048 complex point pipeline FFT core has been implemented
and fabricated in a 0.35 µm 5ML CMOS process, based on the presented scal-
ing architecture and a throughput of 1 complex point/cc. The bidirectional
pipeline FFT core, intended for image reconstruction in digital holography, has
also been integrated on a custom designed FPGA platform to create a complete
hardware accelerator for digital holographic imaging.
6. CONCLUSION
Table 2: Comparison between proposed architectures and related work. The values based on simulated data are
highlighted by grey fields.
Proposed (A) Proposed (B) Proposed (C) Lin [90] Bidet [73] Wang [91]
Architecture Pipeline (1D) Pipeline (2D) Pipeline (2D) Parallel Pipeline Pipeline
Dynamic scaling Hybrid FP Hybrid FP Co-optimized BFP D = 64 CBFP No
Technology (µm) 0.35 Virtex-E Virtex-E 0.18 0.5 0.35
Max Freq. (MHz) 76 50 50 56 22 16
Input wordlength 2 × 10 2 × 10 + 4 2 × 10 + 4 2 × 10 2 × 10 2×8
Internal wordlength 2 × 11 + (0 . . . 4) 2 × 11 + 4 2 × 11 + 4 2 × 11 + 4 2 × 12 + 4 2 × (19 . . . 34)
Transform size 2K 1/2/4/8K 2K 2K 8K 8K 2/8K
SQNR (dB) 45.3 44.0 45.3 44.3 41.2 42.4
Memory (bits) 49K 196K 53.9K 50.4K 185K 350K 213K
Norm. Area (mm2 )1 7.58 ≈ 16 18.3 49 33.75
1D Transform/s 37109 9277 24414 24414 3905 2686 7812/1953
1
Area normalized to 0.35 µm technology.
83
Part III
Abstract
System-level simulation and exploration tools are required to rapidly evaluate
system performance early in the design phase. The use of virtual platforms
enables hardware modeling as well as early software development. An explo-
ration framework (Scenic) is proposed, which is based on OSCI SystemC and
consists of a design exploration environment and a set of customizable simula-
tion models. The exploration environment extends the SystemC library with
features to construct and configure simulation models using an XML descrip-
tion, and to control and extract performance data using run-time reflection.
A set of generic simulation models have been developed, and are annotated
with performance monitors for interactive run-time access. The Scenic frame-
work is developed to enable design exploration and performance analysis of
reconfigurable architectures and embedded systems.
85
1. INTRODUCTION 87
1 Introduction
Due to the continuous increase in design complexity, system-level exploration
tools and methodologies are required to rapidly evaluate system behavior and
performance. An important aspect for efficient design exploration is the design
methodology, which involves the construction and configuration of the system
to be simulated, and the controllability and observability of simulation models.
SystemC has shown to be a powerful system-level modeling language, mainly
for exploring complex system architectures such as System-on-Chip (SOC),
Network-on-Chip (NoC) [93], multi-processor systems [94], and run-time re-
configurable platforms [95]. The Open SystemC Initiative (OSCI) maintains
an open-source simulator for SystemC, which is a C++ library containing rou-
tines and macros to simulate concurrent processes using an HDL like seman-
tic [96]. Systems are constructed from SystemC modules, which are connected
to form a design hierarchy. SystemC modules encapsulate processes, which
describe behavior, and communicate through ports and channels with other
SystemC modules. The advantages with SystemC, besides the well-known
C++ syntax, include modeling at different abstraction levels, simplified hard-
ware/software co-simulation, and a high simulation performance compared to
traditional HDL. The abstraction levels range from cycle accurate (CA) to
transaction level modeling (TLM), where abstract models trade modeling ac-
curacy for a higher simulation speed [42].
SystemC supports observability by tracing signals and transactions using
the SystemC verification library (SCV), but it only provides limited features
using trace-files. Logging to trace-files is time-consuming and requires post-
processing of extracted simulation data. Another drawback is that the OSCI
simulation kernel does not support real-time control to allow users to start and
stop the simulation interactively. Issues related to system construction, system
configuration, and controllability are not addressed.
To cover the most important aspects on efficient design exploration, a Sys-
temC Environment with Interactive Control (Scenic) is proposed. Scenic is
based on OSCI SystemC 2.2 and extends the SystemC library with functional-
ity to construct and configure simulations from eXtensible Markup Language
(XML), possibilities to interact with simulation models during run-time, and
the ability to control the SystemC simulation kernel using micro-step simula-
tion. Scenic extends OSCI SystemC without modifying the core library, hence
proposing a non-intrusive exploration approach. A command shell is provided
to handle user interaction and a connection to a graphical user interface. In
addition, a library of customizable simulation models is developed, which con-
tains commonly used building blocks for modeling embedded systems. The
models are used in Part IV to simulate and evaluate reconfigurable architec-
88 PART III. A DESIGN ENVIRONMENT AND MODELS FOR RECONFIGURABLE...
User modules
GUI
Architectural generators Module
Library
Model generators
2 Related Work
Performance analysis is an important part of design exploration, and is based
on extracted performance data from a simulation. Extraction of performance
data requires either the use of trace files, for post-processing of simulation data,
or run-time access to performance data inside simulation models. The former
approach is not suitable for interactive performance exploration due to the lack
of observability during simulation, and also has a negative impact on simulation
performance. The latter approach is a methodology referred to as data intro-
spection. Data introspection is the ability to access run-time information using
a reflection mechanism. The mechanism enables either structural reflection, to
expose the design hierarchy, or run-time reflection to extract performance data
and statistics for performance analysis.
General frameworks to automatically reflect run-time information in soft-
ware are presented in [97] and [47]. While this approach works well for stan-
3. SCENIC EXPLORATION ENVIRONMENT 89
dard data types, performance data is highly model-specific and thus can not
be automatically annotated and reflected. Dust is another approach to reflect
structural and run-time information [98]. Dust is a Java visualization front-
end, which captures run-time information about data transactions using SCV.
However, performance data is not captured, and has to be evaluated and ana-
lyzed from recorded transactions. Design structure and recorded transactions
are stored in XML format and visualized in a structural view and a message
sequence chart, respectively. Frameworks have also been presented for mod-
eling at application level, where models are annotated with performance data
and trace files are used for performance analysis [15]. However, due to the use
of more abstract simulation models, these environments are more suitable for
software evaluation than hardware exploration.
The use of structural reflection have been proposed in [99] and [100] to
generate a visual representation of the simulated system. This can be useful
to ensure structural correctness and provide greater understanding about the
design hierarchy, but does not provide additional design information to enable
performance analysis. In contrast, it is argued that structural reflection should
instead be used to translate the design hierarchy into a user-specified format
to enable automatic code generation.
Related work only covers a part of the functionality required for efficient
design exploration. This work proposes an exploration environment, Scenic,
that supports simulation construction using XML format, simulation interac-
tion using run-time reflection, and code generation using structural reflection.
XML
TCP/IP
SCENIC Shell
interface
Access variables
SCENIC Core Filtered
Simulation events Debug messages
Module
library
parameters
variables
events
Simulation SCI_MODULE
(a) (b)
(SystemC)
Built-in commands are executed from the Scenic shell command line in-
terface by typing the command name followed by a parameter list. Built-in
commands include for example setting environment variables, evaluate basic
arithmetic or binary operations, and random number generation. If the name
of a SystemC module is given instead of a Scenic shell command, then the
command is forwarded and evaluated by that module’s custom access method.
The interface enables fast scripting to setup different scenarios in order to eval-
uate system architectures. The Scenic shell commands are presented in detail
in Appendix A.
Simulation Control
OSCI SystemC currently lacks the possibility to enable the user to interactively
control the SystemC simulation kernel, and simulation execution is instead di-
rectly described in the source code. Hence, changes to simulation execution
require the source code to be recompiled, which is a non-desirable approach
to control the simulation kernel. In contrast, Scenic addresses the issue of
simulation control by introducing micro-step simulation. Micro-step simula-
tion allows the user to interactively pause the simulation, to modify or view
internal state, and then resume execution. A micro-step is a user-defined dura-
tion of simulation time, during which the SystemC kernel simulates in blocking
mode. The micro-steps are repeated until the total simulation time is com-
pleted. From a user-perspective, the simulation appears to be non-blocking
and fully interactive. The performance penalty for using micro-steps is evalu-
ated to be negligible down to clock cycle resolution.
From the Scenic shell, simulations are controlled using either blocking or
non-blocking run commands. Non-blocking commands are useful for interac-
tive simulations, whereas blocking commands are useful for scripting. A stop
command halts a non-blocking simulation at the beginning of the next micro-
step.
ate
anti
Inst
XML Design Bin d
Specification Con
figu
re
Module library
Access Variables
Member variables inside an sci module are exported to the Scenic shell as
configurable and observable access variables. This reflects the variables data
type, data size and value during simulation time. Access variables enable dy-
namic configuration (controllability) and extraction of performance data and
statistics during simulation (observability). Hence, the user can configure mod-
els for specific simulation scenarios, and then observe the corresponding simu-
lation effects.
Performance data and statistics that are represented by native and trivial
data types can be directly reflected using an access variable. An example
is a memory model that contains a counter to represents the total number
of memory accesses. However, more complex operations are required when
performance values are data or time dependent. This requires functional code
to evaluate and calculate a performance value, and is supported in Scenic by
using variable callbacks. When an access variable is requested, the associated
callback function is evaluated to assign a new value to the member variable.
This new value is then reflected in the Scenic shell. The callback functionality
can also be used for the reverse operation of assigning values that affect the
model functionality or configuration. An example is a variable that represents
a memory size, which requires the memory array to be resized (reallocated)
when the variable change. The following code is a trivial example on how to
export member variables and how to implement a custom callback function:
Modifying the value of the size variable triggers the callback function to
be executed and the memory to be reallocated. The variable average is also
96 PART III. A DESIGN ENVIRONMENT AND MODELS FOR RECONFIGURABLE...
evaluated on demand, while the data vector is constant and reflects the current
simulation value. Registered variables are accessed from the Scenic shell as:
40
20 16.8
0.6
0
Logging Conventional Proposed
Disabled Method (Mutex) Micro-steps
Simulation Events
During simulations, it is valuable to automatically receive information from
simulation models regarding simulation status. This information could be in-
formative messages on when to read out generated performance data, or warn-
ing messages to enable the user to trace simulation problems at an exact time
instance. Instead of printing debug messages in the Scenic shell, it is desirable
to be able to automatically halt the simulation once user-defined events occur.
An extension to the access variables provides the ability to combine data
logging with conditional simulation events. Hence, dynamic simulation events
are proposed, which can be configured to execute Scenic commands when
triggered. In this way, the simulation models can notify the simulator of specific
events or conditions on which to observe, reconfigure, or halt the simulation.
A simulation event is a sci variable that is assigned a user-specified
Scenic shell command. Simulation events are created dynamically during run-
time from the Scenic shell to notify the simulator when a boolean condition
associated with an access variable is satisfied. The condition is evaluated on
data inside the history buffer, hence repeated assignments between clock cycles
are effectively avoided.
3. SCENIC EXPLORATION ENVIRONMENT 99
The internal call graph for creating a simulation event is shown in Figure 8.
An access variable is first configured with a periodical logging intervall on
which to evaluate a boolean condition. A boolean condition is created from the
Scenic shell, which registers itself with the access variable and creates a new
simulation event. This simulation event is configured by the user to evaluate
a Scenic shell command when triggered. Every time the access variable is
logged, the condition is evaluated against the assigned boolean expression. If
the condition is true, the associated event is triggered. The event executes
the user command, which in this case is set to halt the simulation. Hence,
the simulation events operate in a similar way as software assertions, but are
dynamically created during simulation time.
SCI_PROCESSOR
SCI_MEMORY
Acc.
I D
Adapter Adapter
Bus
(a) (b)
Processor Registers
The register bank is user-defined, thus each generated processor has a different
register configuration. However, a register holding the program counter is
automatically created to control the processor execution flow.
There are two types of processor registers: internal general-purpose regis-
ters and external i/o port registers. General-purpose registers hold data values
for local processing, while i/o port registers are bi-directional interfaces for ex-
ternal communication. i/o port registers are useful for connecting external
co-processors or hardware accelerators to a generated processor. Each uni-
directional link uses flow control to prevent the processor from accessing an
empty input link or a full output link. Similar port registers are found in many
processor architectures, for example the IBM PowerPC [103].
Processor Instructions
The instruction set is also user-defined, to emulate arbitrary processor archi-
tectures. An instruction is constructed from the sci instruction class, and
is based on an instruction template that describes operand fields and bitfields.
The operand fields specify the number of destination registers (D), source reg-
isters (S), and immediate values (I). The bitfields specifies the number of bits
to represent opcode (OP), operand fields, and instruction flags (F). Figure 10
illustrates how a generic instruction format is used to represent different in-
struction templates, based on {OP, D, S, I, F}. An instruction template has
the following format:
class TypeB : public sci_instruction { /* instruction template */
public:
TypeB(string nm) : sci_instruction(nm, /* constructor */
"D=1,S=1,I=1", /* operand fields */
"opcode=6,dst=5,src=5,imm=16,flags=0") /* bitfields */
{ }
};
In this example the opcode use 6 bits, source and destination registers use 5
bits each, the immediate value use 16 bits, and the instruction flags are not
used. Templates are reused to group similar instructions, i.e. those represented
with the same memory layout. For example, arithmetic instructions between
registers are based on one template, while arithmetic instructions involving
immediate values are based on another template.
A user-defined instruction inherit properties from an instruction template,
and extends it with a name and an implementation. The name is used to ref-
erence the instruction from the assembly source code, and the implementation
specifies the instructions functionality. An execute method describes the func-
tionality by reading processor registers, performing a specific operation, and
4. SCENIC MODEL AND ARCHITECTURAL GENERATORS 103
(c) OP D0 DN S0 SN I0 IN F
writing processor registers accordingly. An instruction can also control the pro-
gram execution flow by modifying the program counter, and provides a method
to specify the instruction delay for more accurate modeling. Instructions have
the following format:
The execute method initiates a read from zero or more registers (i/o or general-
purpose) and a write to zero or more registers (i/o or general-purpose). The
instruction is only allowed to execute if all input values are available, and the
instruction can only finish execution if all output values can be successfully
written to result registers. Hence, an instruction performs blocking access to
i/o port registers and to memory interfaces.
Memory Interfaces
The processor provides an interface to access external memory or to connect the
processor to a system bus. External memory access is supported using virtual
function calls, for which the user supply a suitable adaptor implementation.
For the processor to execute properly, functionality to load data from memory,
store data in memory, and fetch instructions from memory are required. The
implementation of virtual memory access methods are illustrated in Figure 9(a),
104 PART III. A DESIGN ENVIRONMENT AND MODELS FOR RECONFIGURABLE...
and is performed by the adapter unit. The adapter translates the function calls
to bus accesses, and returns requested data to the processor.
Memory map
0x0
untimed read
Address space
0x07 addr = 0x12
...
0x10
refere
nce
0x17
...
0x10 - 0x17 0x0 - 0x07
Figure 12: The memory models are accessed using either the simulation
ports or thought the untimed interface. The untimed interface enables
access to the memory contents from the Scenic shell.
• Tile Generator - The tile generator use a tile template file to create
a static array of resource cells, presented in Part IV, which are generic
containers for any type of simulation models. When constructing larger
arrays, the tile is used as the basic building block, as illustrated in Fig-
ure 13(a), and extended to arrays of arbitrary size. The resource cells are
configured with a functional unit specified by the tile template, and can
be either a processing cell or a memory cell.
• Topology Generator - The topology generator creates the local in-
terconnects between resource cells, and supports mesh, torus, ring, and
user-defined topologies. Figure 13(b) illustrates resource cells connected
in a mesh topology. The local interconnects provide a high bandwidth
between neighboring resource cells.
5. CONCLUSION 107
Tile template
5 Conclusion
SystemC enables rapid system development and high simulation performance.
The proposed exploration environment, Scenic, is a SystemC environment
with interactive control that addresses the issues on controllability and ob-
servability of SystemC models. It extends the SystemC functionality to en-
able system construction and configuration from XML, and provides access
to simulation and performance data using run-time reflection. Scenic pro-
vides advanced scripting capabilities, which allow rapid design exploration and
performance analysis in complex designs. In addition, a library of model gen-
erators and architectural generators is proposed. Model generators are used to
construct designs based on customized processing and memory elements, and
architectural generators provide capabilities to construct complex systems, such
as reconfigurable architectures and Network-on-Chip designs.
Part IV
Abstract
Reconfigurable hardware architectures are emerging as a suitable and feasi-
ble approach to achieve high performance combined with flexibility and pro-
grammability. While conventional fine-grained architectures are capable of
bit-level reconfiguration, recent work focuses on medium-grained and coarse-
grained architectures that result in higher performance using word-level data
processing. In this work, a coarse-grained dynamically reconfigurable architec-
ture is proposed. The system is constructed from an array of processing and
memory cells, which communicate using local interconnects and a hierarchical
routing network. Architectures are evaluated using the Scenic exploration en-
vironment and simulation models, and implemented VHDL modules have been
synthesized for a 0.13 µm cell library. A reconfigurable architecture of size 4×4
has a core area of 2.48 mm2 and runs up to 325 MHz. It is shown that mapping
of a 256-point FFT generates 18 times higher throughput than for commercial
embedded DSPs.
109
1. INTRODUCTION 111
1 Introduction
Platforms based on reconfigurable architectures combine high performance pro-
cessing with flexibility and programmability [104, 105]. A reconfigurable ar-
chitecture enables re-use in multiple design projects to allow rapid hardware
development. This is an important aspect for developing consumer electronics,
which are continuously required to include and support more functionality.
A dynamically reconfigurable architecture (DRA) can be reconfigured dur-
ing run-time to adapt to the current operational and processing conditions.
Using reconfigurable hardware platforms, radio transceivers dynamically adapt
to radio protocols used by surrounding networks [106], whereas digital cameras
adapt to the currently selected image or video compression format. Reconfig-
urable architectures provide numerous additional advantages over traditional
application-specific hardware accelerators, such as resource sharing to provide
more functionality than there is physical hardware. Hence, currently inacti-
vated functional units do not occupy any physical resources, which are instead
dynamically configured during run-time. Another advantage is that a reconfig-
urable architecture may enable mapping of future functionality without addi-
tional hardware or manufacturing costs, which could also extend the lifetime
of the platform.
In this part, a DRA is proposed and modeled using the Scenic exploration
framework and simulation models presented in Part III. By evaluating the
platform at system level, multiple design aspects are considered during per-
formance analysis. The proposed design flow for constructing, modeling, and
implementing a DRA is presented in Figure 1. System construction is based
on the Scenic architectural generators, which use the model library to cus-
tomize simulation components. Related work is discussed in Section 2, and the
proposed architecture is presented in Section 3. In Section 4, the Scenic ex-
ploration environment is used for system-level integration, and the platform is
modeled and evaluated for application mapping in Section 5. However, auto-
mated application mapping is not part of the presented work, but is a natural
extension. In Section 6, the exploration models are translated to VHDL, syn-
thesized, and compared against existing architectures.
2 Related Work
A range of reconfigurable architectures have been proposed for a variety of
application domains [33]. Presented architectures differ in granularity, pro-
cessing and memory organization, communication strategy, and programming
methodology. For example, the GARP project presents a generic MIPS proces-
sor with reconfigurable co-processors [107]. PipeRench is a programmable dat-
apath of virtualized hardware units that is programmed through self-managed
112 PART IV. A RUN-TIME RECONFIGURABLE COMPUTING PLATFORM
Specification
Tile Generator
Architectural analysis
Topology Generator
Construction
Network Generator Model
Generators
System Integration
configurations [108]. Other proposed architectures are the MIT RAW pro-
cessor array [109], the REMARC array of 16-bit nano processors [110], the
Cimaera reconfigurable functional unit [111], the RICA instruction cell [112],
weakly programmable processor arrays (WPPA) [113, 114], and architectures
optimized for multimedia applications [115]. A medium-grained architecture is
presented in [116], using a multi-level interconnection network similar to this
work.
Examples of commercial dynamically reconfigurable architectures include
the field programmable object array (FPOA) from MathStar [117], which is an
array of 16-bit objects that contain local program and data memory. The adap-
tive computing machine (ACM) from QuickSilver Technology [118] is a 32-bit
array of nodes that each contain an algorithmic core supporting arithmetic,
bit-manipulation, general-purpose computing, or external memory access. The
XPP platform from PACT Technologies is constructed from 24-bit processing
array elements (PAE) and communication is based on self-synchronizing data
flows [119, 120]. The Montium tile processor is a programmable architecture
where each tile contains five processing units, each with a reconfigurable in-
struction set, connected to ten parallel memories [106].
A desirable programming approach for the architectures above is software-
centric, where applications are described using a high-level design language
3. PROPOSED ARCHITECTURE 113
3 Proposed Architecture
Based on recently proposed architectures, it is evident that coarse-grained de-
signs are becoming increasingly complex, with heterogeneous processing units
comprising a range of application-tuned and general-purpose processors. As a
consequence, efficient and high-performance memories for internal and external
data streaming are required, to supply the computational units with data.
However, increasing complexity is not a feasible approach for an embedded
communication network. In fact, it has lately been argued that interconnect
networks should only be sufficiently complex to be able to fully utilize the
computational power of the processing units [122]. For example, the mesh-
based network-on-chip (NoC) structure has been widely researched in both
industry and academia, but suffers from inefficient global communication due to
multi-path routing and long network latencies. Another drawback is the large
amount of network routers required to construct a mesh. As a consequence,
star-based and tree-based networks are being considered [123], as well as a
range of hybrid network topologies [124].
This work proposes reconfigurable architectures based on the following
statements:
• Coarse-grained architectures result in better performance/area trade-off
than fine-grained and medium-grained architectures for the current ap-
plication domain. Flavors of coarse-grained processing elements are pro-
posed to efficiently map different applications.
• Streaming applications require more than a traditional load/store archi-
tecture. Hence, a combination of a RISC architecture and a streaming
architecture is proposed. Furthermore, the instruction set of each pro-
cessor is customized to extend the application domain.
• Multi-level communication is required to combine high bandwidth with
flexibility to a reasonable hardware cost [116]. A separate local and global
114 PART IV. A RUN-TIME RECONFIGURABLE COMPUTING PLATFORM
tile template
Local ports Global port
L0 L1 ... Lx G
PC L0-Lx 16
16
nx G 16
16
en
R0-Rx
...
ILC
Imm
instr.
...
Program memory
(a) (b)
• Dual ALU - A conventional ALU takes two input operands and produces
a single result value. In contrast, the DSP and MAC processors include
two separate ALUs to produce two values in a single instruction. This
is useful when computing a radix-2 butterfly or when moving two data
values in parallel. The MAC processor uses a parallel move instruction
116 PART IV. A RUN-TIME RECONFIGURABLE COMPUTING PLATFORM
to split and join 16-bit internal registers and 32-bit i/o registers.
• Separable ALU - Each 32-bit ALU data path can be separated into two
independent 16-bit fields, where arithmetic operations are applied to both
fields in parallel. This is useful when operating on complex valued data,
represented as a 2 × 16-bit value. Hence, complex values can be added
and subtracted in a single instruction.
• Inner loop counter - A special set of registers are used to reduce control
overhead in compute-intensive inner loops. The inner loop counter (ILC)
register is loaded using a special instruction that stores the next program
counter address. Each instruction contains a flag that indicates end-of-
loop, which updates the ILC register and reloads the previously stored
program counter.
... Switch
#1 Area
output ports
that is reserved for each stream transfer, which is indicated with a base address
and a high address. In fifo mode, the reserved memory area operates as a
circular buffer, and the controller unit handles address pointers to the current
read and write locations. Input data is received from the source port and placed
at the current write location inside the memory array. At the same time, the
destination port receives the data stored in the memory array at the current
read location.
In a similar fashion, a stream transfer in ram mode is configured with an
address port, a data port, and an allocated memory area. The controller unit
is triggered when the address port receives either a read or a write request,
that specifies a memory address and a transfer length. If the address port
receives a read request, data will start streaming from the memory array to
the configured data port. Consequently, if the address port receives a write
request, data is fetched from the configured data port and stored in the memory
array.
Control Control
0-1 : P0
P0 2-3 : P1 P0 P0
4-5 : P2 P0 P0 P0
6-7 : P3
P1 routing P1
table P1
DE-MUX
P1 P1 P1
MUX
P2 P2
P2 P2 P2 P2
P3 FSM P3
P3 P3 P3 P3
Decision Routing Queues
(a) (b)
pType = { data, READ, WRITE }
src ID dst ID nType pType 32-bit payload
nType = { data, config, control }
2 × ⌈log2 (#IDs)⌉ 2 2 32
(c)
Figure 4: (a) Network router constructed from a decision unit, a routing
structure, and packet queues for each output port. (b) Architectural op-
tions for routing structure implementation. (c) Local flit format (white)
and extension for global routing (grey).
Network Packets
The routers forward network packets over the global interconnects. A network
packet is a carrier of data and control information from a source to a destination
cell, or between resource cells and an external memory. A data stream is
a set of network packets, and for streaming data each individual packet is
send as a network flow control digit (flit). A flit is an atomic element that is
transferred as a single word on the network, as illustrated in Figure 4(c). A
flit consists of a 32-bit payload and a 2-bit payload type identifier to indicate
if the flit contains data, a read request, or a write request. For global routing,
unique identification numbers are required and discusses in the next section.
An additional 2-bit network type identifier indicates if the packet carries data,
configuration, or control information. Data packets have the same size as a
flit, and contain a payload to be either processed or stored in a resource cell.
Configuration packets contain a functional description on how resource cells
should be configured, and is further discussed in Section 4.2. Configurations
are transferred with a header specifying the local target address and the payload
size, to be able to handle partial configurations. Control packets are used to
notify the host processor of the current processing status, and are reserved to
exchange flow control information between resource cells.
120 PART IV. A RUN-TIME RECONFIGURABLE COMPUTING PLATFORM
2 3 6 7
0 − 15 R0,1 R1,1
8 9 12 13
Network Routing
Each resource cell allocates one or more network identifiers (ID), which are
integer numbers to uniquely identify a resource cell. Each resource cell allocates
one (or more) network ID as shown in Figure 5(a). The identification field is
represented with ⌈log2 (W × H + IDext )⌉ bits, where IDext is the number of
network IDs allocated for external communication.
A static routing table is stored inside the router and used to direct traffic
over the network. At design-time, network IDs and routing tables are recur-
sively assigned by traversing the global network from the top router. Recursive
assignment results in that each entry in the routing table for a router Ri,l ,
where i is the router index number and l is the routers hierarchical level as
defined in Figure 5(c), is a continuous range of network IDs as illustrated in
Figure 5(b). Hence, network ID ranges are represented with a base address
and a high address. The network connectivity C is defined by a function
(
link
C(Ri,l , Rm,n ) = 1, Ri,l → Rm,n
0, otherwise,
where the value 1 indicates that there is a link from router Ri,l to router Rm,n
(Ri,l 6= Rm,n ) and 0 otherwise. Based on the network connectivity, the routing
table Λ for a router Ri,l is defined as
Λ(Ri,l ) = {λ(Ri,l ), κ(Ri,l )}, (1)
where λ(Ri,l ) is the set of network IDs to reachable routers from Ri,l to lower
hierarchical levels, and κ(Ri,l ) is the set of network IDs from Ri,l to reachable
routers on the same hierarchical level as
λ(Ri,l ) = {λ(Rj,l−1 ) : C(Ri,l , Rj,l−1 ) = 1, l > 0}
3. PROPOSED ARCHITECTURE 121
(a) (b)
Figure 6: (a) Enhancing the router capacity when the hierarchical level
increase. (b) Enhancing network capacity by connecting routers at the
same hierarchical level.
and
κ(Ri,l ) = {λ(Rj,l ) : C(Ri,l , Rj,l ) = 1}.
At the lowest hierarchical level, where l = 0, the reachable nodes in λ(Ri,0 ) is
a set of network IDs that are reserved by the connected resource cells. At this
level, λ is always represented as a continuous range of network IDs.
A link from a router Ri,l to a router Rj,l+1 is referred to as an uplink. Any
packet received by router R is forwarded to the uplink router if the packets net-
work ID is not found in the router table Λ(R). A router may only have a single
uplink port, else the communication path could become non-deterministic.
In the Scenic environment, routing tables can be extracted for any router,
and contain the information below for top router R0,1 and sub-router R0,0 from
Figure 5(b):
i/o
interface
Mem
ID=20
i/o i/o
interface interface Mem Mem
Mem Mem
ID=14 ID=19
(a) (b)
Network Capacity
When the size of a DRA increases, the global communication network is likely
to handle more traffic, which requires network enhancements. A solution to im-
prove the communication bandwidth is to increase the network capacity in the
communication links, as shown in Figure 6(a). Since routers on a higher hierar-
chical level could become potential bottlenecks to the system, these routers and
router links are candidates for network link capacity scaling. Thus, this means
that a single link between two routers is replaced by parallel links to improve
the network capacity. A drawback is increased complexity, since a more ad-
vanced router decision unit is required to avoid packet reordering. Otherwise,
if packets from the same stream are divided onto different parallel links, this
might result in that individual packets arrive out-of-order at the destination.
Another way to improve the communication bandwidth is to insert addi-
tional network paths to avoid routing congestion in higher level routers, referred
to as network balancing. Figure 6(b) shows an example where all Ri,1 routers
are connected to lower the network traffic through the top router. Additional
links may be inserted between routers as long as the routing table in each
network router is deterministic. When a network link is created between two
routers, the destination router’s reachable IDs (λ) is inserted in the routing
table (Λ) for the source router. Links that do not satisfy the conditions above
are not guaranteed to automatically generate deterministic routing tables, but
could still represent a valid configuration.
4. SYSTEM-LEVEL INTEGRATION 123
External memory
I/F 1 data
GPP MPMC SMC
I/F 0
External Communication
The network routers are used for connecting external devices to the DRA,
as illustrated in Figure 7(a). When binding an external device to a router, a
unique network ID is assigned and the routing tables are automatically updated
to support the new configuration. Examples of external devices are memories
for data streaming, as illustrated in Figure 7(b), or an interface to receive con-
figuration data from an embedded GPP. Hence, data streaming to and from
external memories require the use of a global network. Details about exter-
nal communication and how to enable efficient memory streaming is further
discussed in Section 4.1.
4 System-Level Integration
Additional system components are required to dynamically configure resource
cells, and to transfer data between resource cells and external memories. There-
fore, the proposed DRA is integrated into an embedded system containing a
general-purpose processor, a multi-port memory controller (MPMC), and a
proposed stream memory controller (SMC) to efficiently supply the DRA with
data. The GPP and the MPMC are based on the Scenic processor and mem-
ory generators, while the SMC is custom designed. The system architecture
is shown in Figure 8, where the top network router is connected to the SMC
to receive streaming data from an external memory. The top network router
is also connected to a bridge, which allows the GPP to transmit configuration
data over the system bus.
The MPMC contains multiple ports to access a shared external memory,
124 PART IV. A RUN-TIME RECONFIGURABLE COMPUTING PLATFORM
I/O buffers
Control
FSM
Config
were one port is connected to the system bus and one port to the SMC. For
design exploration, the MPMC can be configured to emulate different memory
types by changing operating frequency, wordlength, data-rate, and internal
parameters for controlling the memory timing. This is useful when optimizing
and tuning the memory system, which is presented in 5.2.
...
2 Program
@4
3 add{l} %L1,%L0,%R1 @9 16 0xE0F21125
4 end 0
0 TSIZE = 2 1 TADDR= 3 W
0 TSIZE = 1 TADDR = 0 W 1 RAM Src Dst ID Base=4
Descriptor 3
1 {start} PC = 9 Control 2 High=19 rPtr=x wPtr=x
(a) (b)
with the stream transfer. The transfer direction is either read or write, and
number of words to transfer is indicated by size. A shape describes how data
is accessed inside the memory area, and consists of a three parameters: stride,
span, and skip [127]. Stride is the memory distance to the next stream element,
and span is the number of elements to transfer. Skip indicates the distance to
the next start address, which restarts the stride and span counters. The use
of programmable memory shapes is a flexible and efficient method to avoid
address generation overhead in functional units.
Additional control bits enable resource cells to share buffers in an external
memory. The reference field (ref) contains a pointer to an inactivated stream
descriptor that shares the same memory area. When the current transfer com-
pletes, the associated stream is activated and allowed to access the memory.
Hence, the memory area is used by two stream transfers, but the access to the
memory is mutually exclusive. Alternatively, the descriptors are associated with
different memory regions, and the data pointers are automatically interchanged
when both stream transfers complete, which enables double-buffering. An ex-
ample is shown in Figure 9, where stream descriptors 0 and 2 are configured
to perform a 8 × 8 matrix transpose operation. Data is written column-wise,
and then read row-wise once the write transaction completes.
Latency L
localization 0.6
localization 0.8
0.6
10
0.4 localization 0
localization 0.2
0.2 5 localization 0.4
localization 0.6
localization 0.8
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
Injection rate r Injection rate r
Network with increased link capacity from Figure 6(a).
20
localization 0
localization 0.2
Accepted traffic T
Latency L
localization 0.6
localization 0.8
0.6
10
0.4 localization 0
localization 0.2
0.2 5 localization 0.4
localization 0.6
localization 0.8
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
Injection rate r Injection rate r
Network with balanced traffic load from Figure 6(b).
20
localization 0
localization 0.2
Accepted traffic T
localization 0.6
localization 0.8
0.6
10
0.4 localization 0
localization 0.2
0.2 5 localization 0.4
localization 0.6
localization 0.8
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
Injection rate r Injection rate r
to neighboring cells and to the global network. Packets are annotated with
statistical information to monitor when each packet was produced, injected,
received, and consumed. It also contains information about the number of
network hops from the source node to the destination node, while the traffic
generators monitor the number of received packets and the transport latency.
Since communication is both local and global, a localization factor α is defined
as the ratio between the amount of local and global communication, where
α = 1 corresponds to pure local communication and α = 0 corresponds to fully
global communication.
The traffic generators inject packets into the network according to the
Bernoulli process, which is a commonly used injection process to character-
ize networks [128]. The injection rate, r, is the number of packets per clock
cycle and per resource cell injected into the local or global network. The ac-
cepted traffic is the network throughput T , which is measured in packets per
clock cycle and per resource cell. Ideally, accepted traffic should increase lin-
early with the injection rate. However, due to traffic congestion in the global
network the amount of accepted traffic willPsaturate at a certain level. The
average transport latency is defined as L = Li /N , where Li is the transport
latency for packet i and N is the total number of consumed packets. Transport
latency is measured as the number of clock cycles between a successful injection
into the network and consumption at the final destination.
The network performance depends on both the local and global network. As
discussed in Section 3.4, the global network may be enhanced by either improv-
ing the link capacities or by inserting additional network links between routers.
Figure 11 shows the simulation results from three network scenarios based on
an 8 × 8 resource array. The first simulation is the original network configura-
tion shown in Figure 2(a). The second and third scenarios are the enhanced
routing networks shown in Figure 6(a-b). As shown, network enhancements
generate a higher communication bandwidth with reduced latency. Perfor-
mance is measured as the combined local and global communication, i.e. the
overall performance, while latency is measured for global communication. The
routers used in this simulation can manage up to two data transactions per
clock cycle, but only for transactions that do not access the same physical
ports.
The simulations in Figure 11 indicate how much traffic is injectable into
the global network before saturation. Assuming an application with 80% local
communication, i.e. α = 0.8, the networks can manage an injection rate of
around 0.3, 0.8 and 0.8, respectively. Thus, this illustrates the need for capacity
enhancements in the global network. Since both enhancement techniques have
similar performance, the network in Figure 6(b) is a better candidate since
it is less complex and easier to scale with existing router models. However,
5. EXPLORATION AND ANALYSIS 129
0.4
Original
Increased link capacity
Balanced network
0.3
Accepted traffic T
Both enhancements
0.2
0.1
0
1 2 4 8 16 32 64 ∞
Router queue depth Q
Figure 12: The impact on network performance for different router out-
put queue depth Q when α = 0. The curve start to flatten out around
Q = 8.
output stream, i.e. from the DRA to memory, can be managed in different
ways, either as a locally controlled or as a globally controlled memory transfer.
Local control means that each resource cell transfers data autonomously, while
global control implies stream management by the SMC.
To evaluate the memory stream interface, two architectural mappings and
five memory scenarios is evaluated. The first mapping is when an applica-
tion executes on a single processing cell, and the second mapping is when the
data processing is shared and divided onto two processing cells, as shown in
Figure 13(a-b). This will illustrate that throughput is not only a matter of
parallelism, but a balance between shared system resources such as the global
network and the memory interface.
A processing cell is configured with an application to post-process images
in digital holography, as illustrated in Part I Figure 3(c). The post-processing
step generates a superposed image between magnitude and phase data after
image reconstruction, to enhance the visual perception. A configuration is
downloaded to combine the RGB-colors of two input streams, and write the
result to an output stream. Four clock cycles are required to process one pixel,
resulting in a throughput of 1/4 samples per clock cycle. The DRA and the
external memory are assumed to operate at the same clock frequency, hence
3/4 of the theoretical memory bandwidth is required for three streams. The
following Scenic script constructs a DRA platform, downloads two bitmap
images through the MPMC, and inserts stream descriptors in the SMC to
transfer data to and from the DRA:
xml load -file "system_4x4.xml" % load XML design
sim % create simulation platform
xml config -name "post-process" % load XML configuration
MPMC rd -addr 0x300000 -size $IMSIZE -file "result.bmp" % save result image
set TRANSFERS [ra.cell* get -var transfers] % extract transfers
set SIM_CC [SMC get -var last_transfer] % extract clock cycles
set UTILIZATION [eval [eval sum $TRANSFERS] / $SIM_CC] % calculate utilization
The system is evaluated for different transfer lengths, which is the size in words
of a transfer between the external memory and the processing cell. The transfer
length is important for a realistic case study, since external memory is burst
oriented with high initial latency to access the first word in a transfer. Conse-
quently, the simulation in Figure 14 plot (a) shows improved throughput when
the transfer length increases.
5. EXPLORATION AND ANALYSIS 131
(a) (b)
Figure 13: (a) Application mapping to a single processing cell, using two
read and one write stream to external memory. (b) Dividing the compu-
tations onto two processing cells, which require two sets of i/o streams
and double bandwidth to external memory.
200
160
Relative throughput (%)
120
100
80
(a) Single PC with 32-bit Memory
(b) Dual PC with 32-bit Memory
40 (c) Dual PC with 32-bit Memory (reorder)
(d) Dual PC with 32-bit DDR Memory (reorder)
(e) Dual PC with 64-bit DDR Memory (reorder)
0
2 4 8 12 16 20 24 28 32
Transfer length (words)
300
(a) Single PC with 32-bit Memory
# row access (K)
100
0
2 4 8 12 16 20 24 28 32
Transfer length (words)
memory are shown in Figure 14 plot (d-e), and presents dramatically increased
data throughput as expected. Hence, it illustrates that system performance is
a combination of many design parameters.
5. EXPLORATION AND ANALYSIS 133
<SCENIC> .restart
<CONFIG name="FIR-filter"> addi %LACC,%R0,0
<MODULE name="pc2x2"> addi %HACC,%R0,0
<PARAM name="SRC" value="fir.asm"> addi %R1,$PIF,0
<PARAM name="ADDR" value="0x1"> addi $POF,$PID,0
<PARAM name="PID" value="%G"> ilc $FIR_ORDER
<PARAM name="PIC" value="%L2"> .repeat
<PARAM name="PIF" value="%L3"> dmov $POF,%R1,$PIF,$PIF
<PARAM name="POF" value="%L3"> mul{al} $PIC,%R1
<PARAM name="POD" value="%L1"> jmov $POD,%HACC,%LACC
<PARAM name="FIR_ORDER" value="36"> bri .restart
</MODULE>
...
(a) (b)
Figure 16: Application mapping (FIR filter) that allocates one processing
cell and two memory cells (grey). (a) The XML configuration specifies
a source file and parameters for code generation, and from which ports
data are streaming. (b) A generic assembly program that use the XML
parameters as port references.
• FIR Filter - The time-multiplexed and pipeline FIR filters are mapped
to the DRA using MAC processors. The time-multiplexed design requires
one MAC unit and two memory cells, one for coefficients and one operat-
ing as a circular buffer for data values. The inner loop counter is used to
efficiently iterate over data values and coefficients, which are multiplied
pair-wise and accumulated. When an iteration completes, the least re-
cent value in the circular buffer is discarded and replaced with the value
134 PART IV. A RUN-TIME RECONFIGURABLE COMPUTING PLATFORM
on the input port. At the same time, the result from accumulation is
written to the output port. The time-multiplexed FIR implementation
is illustrated in Figure 17(a), and the corresponding mapping to resource
cells is shown in Figure 17(b). In contrast, the pipeline design requires
one MAC unit for each filter coefficient, which are serially connected to
form a pipeline. Each unit multiplies the input value with the coefficient,
adds the partial sum from the preceding stage, and forwards the data
value and result to the next stage.
• Radix-2 FFT - The time-multiplexed and pipeline FFT cores are mapped
to the DRA using DSP and CORDIC processors. DSPs are used for the
butterfly operations, while CORDIC units emulate complex multiplica-
tion using vector rotation. Delay feedback units are connected to each
DSP butterfly, which are implemented using memory cells operating in
FIFO mode. An example of an FFT stage is illustrated in Figure 17(c),
and the corresponding mapping to resource cells is shown in Figure 17(d).
For the time-multiplexed design, data is streamed through the DSP and
CORDIC units n times for a 2n -point FFT, where the butterfly size
changes for every iteration. The pipeline radix-2 design is constructed
from n DSP and n − 1 CORDIC units, which significantly increases the
throughput but also requires more hardware resources.
• Matrix Transpose - A matrix transpose operation is mapped to illus-
trate that the DSP processors may alternatively be used as address gen-
eration units (AGU). A fast transpose operation requires two DSP units
to generate read and write addresses, and the data is double-buffered in-
side a memory cell. While one buffer is being filled linearly, the other is
drained using an addressing mode to transpose the data. When both op-
erations finish, the DSP units switch buffers to transpose the next block.
Table 2: Application mapping results in terms of reconfiguration time, throughput, and resource requirements.
6 Hardware Prototyping
The Scenic models have been translated to VHDL, and verified using the same
configuration as during design exploration and analysis. Currently available
VHDL models are the DSP and MAC processors presented in Section 3.2, the
memory cell presented in Section 3.3, and the router presented in Section 3.4.
VHDL implementations have been individually synthesized to explore different
parameter settings, and integrated to construct arrays of size 4 × 4 and 8 × 8.
The results from logic synthesis in a 0.13 µm technology are presented in Ta-
ble 3, where each design has been synthesized with the following configuration
parameters:
The table also presents the maximum frequency and the memory storage
space inside each hardware unit. The router overhead illustrates how large
part of the system resources that is spent on global routing. Synthesis results
show how configuration parameters affect the area, frequency, and required
storage space. The DSP and MAC units are in the same area range, but
the memory cell is slightly larger. When constructing an array of cells, it
is important to choose cells that have similar area requirements. Hence, a
memory cell with Mmc = 256 is a trade-off between area and memory space,
since it is comparable is size with the processing cells. For routers, it can be
seen that the output queue depth is associated with large hardware cost. To
avoid overhead from routing resources, it is important to minimize the queue
depth. Hence, a router with Q = 4 has been chosen as a trade-off between area
and network performance. As an example, the floorplan and layout of a 4 × 4
array, with 8 processing cells and 8 memory cells, are shown in Figure 18 and
19, respectively. The floorplan size is 1660 × 1660 µm2 (90% core utilization).
Table 3: Synthesis results for the processor, memory, router, and array. The
results are based on a 0.13 µm cell library. Mpc = 64 for all processing cells.
are estimated based on how many resource cells that are allocation for each
application. Table 4 also includes a recently proposed medium-grained archi-
tecture for mapping a 256-point FFT [116]. Finally, commercial CPU and DSP
processors are presented to compare with general-purpose and special-purpose
architectures.
Compared with an application-specific solution, a time-multiplexed version
of the FFT can be mapped to the proposed DRA at an even lower cost, but
with the penalty of reduced performance. In contrast, the proposed pipeline
FFT generates the same throughput as the application specific solution, but
with four times higher cost. Based on this application, this is the price for
flexible for a reconfigurable architecture.
The area requirement for the proposed DRA of size 4 × 4 is 2.48 mm2 , as
shown in Table 3. As a comparison, the PowerPC 405F6 embedded processor
has a core area of 4.53 mm2 (with caches) in the same implementation technol-
ogy [103], while the Pentium 4 processor requires 305 mm2 . Area numbers for
the TMS320VC55 DSP are not available.
138 PART IV. A RUN-TIME RECONFIGURABLE COMPUTING PLATFORM
8 Conclusion
Modeling and implementation of a dynamically reconfigurable architecture has
been presented. The reconfigurable architecture is based on an array of pro-
cessing and memory cells, communicating using local interconnects and a hier-
archical network. The Scenic exploration environment and models have been
used to evaluate the architecture and to emulate application mapping. Various
network, memory, and application scenarios have been evaluated using Scenic,
to facilitate system tuning during the design phase. A 4 × 4 array of processing
cells, memory cells, and routers has been implemented in VHDL and synthe-
sized for a 0.13 µm cell library. The design has a core size of 2.48 mm2 and is
capable of operating up to 325 MHz. It is shown that mapping of a 256-point
FFT generate 18 times higher throughput than a traditional DSP solution.
8. CONCLUSION 139
DSP MC MAC MC
Router Router
(5-port) (5-port)
MC MAC MC DSP
Router
(4-port)
DSP MC MAC MC
Router Router
(5-port) (5-port)
MC MAC MC DSP
141
Bibliography
143
144 BIBLIOGRAPHY
[9] T. Lenart and V. Öwall, “A 2048 complex point FFT processor using a
novel data scaling approach,” in Proceedings of IEEE International Sym-
posium on Circuits and Systems, Bangkok, Thailand, May 25-28 2003,
pp. 45–48.
[10] Y. Chen, Y.-C. Tsao, Y.-W. Lin, C.-H. Lin, and C.-Y. Lee, “An Indexed-
Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications,”
IEEE Transactions on Circuits and Systems—Part II: Analog and Digital
Signal Processing, vol. 55, no. 2, pp. 146–150, Feb. 2008.
[11] Computer Systems Laboratory, University of Campinas , “The ArchC
Architecture Description Language,” http://archc.sourceforge.net.
[12] M. J. Flynn, “Area - Time - Power and Design Effort: The Basic Trade-
offs in Application Specific Systems,” in Proceedings of IEEE 16th Inter-
national Conference on Application-specific Systems, Architectures and
Processors, Samos, Greece, July 23-25 2005, pp. 3–6.
[13] R. Tessier and W. Burleson, “Reconfigurable Computing for Digital Sig-
nal Processing: A Survey,” The Journal of VLSI Signal Processing,
vol. 28, no. 1-2, pp. 7–27, 2001.
[14] S.A. McKee et al., “Smarter Memory: Improving Bandwidth for
Streamed References,” Computer, vol. 31, no. 7, pp. 54–63, July 1998.
[15] S. Mahadevan, M. Storgaard, J. Madsen, and K. Virk, “ARTS: A System-
level Framework for Modeling MPSoC Components and Analysis of their
Causality,” in Proceedings of 13th IEEE International Symposium on
Modeling, Analysis, and Simulation of Computer and Telecommunication
Systems, Sept. 27-29 2005.
[16] G. Beltrame, D. Sciuto, and C. Silvano, “Multi-Accuracy Power
and Performance Transaction-Level Modeling,” IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, vol. 26,
no. 10, pp. 1830–1842, 2007.
[17] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, “A Scalable Synthesis
Methodology for Application-Specific Processors,” IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, vol. 14, no. 11, pp. 1175–
1188, 2006.
[18] D. Talla, L. K. John, V. Lapinskii, and B. L. Evans, “Evaluating Signal
Processing and Multimedia Applications on SIMD, VLIW and Super-
scalar Architectures,” in Proceedings of IEEE International Conference
on Computer Design, Austin, Texas, USA, Sept. 11720 2000, pp. 163–172.
BIBLIOGRAPHY 145
[20] K. Keutzer, S. Malik, and R. Newton, “From ASIC to ASIP: The Next
Design Discontinuity,” in Proceedings of IEEE International Conference
on Computer Design, Freiburg, Germany, Sept. 16-18 2002, pp. 84–90.
[23] Tensilica, “Configurable and Standard Processor Cores for SOC Design,”
http://www.tensilica.com.
[24] K. K. Parhi, VLSI Digital Signal Processing Systems. 605 Third Avenue,
New York 10158: John Wiley and Sons, 1999.
[41] D. C. Black and J. Donovan, SystemC: From the Ground Up. Springer-
Verlag, Berlin, Heidelberg: Springer, 2005.
[42] A. Donlin, “Transaction Level Modeling: Flows and Use Models,” in Pro-
ceedings of IEEE International Conference on Hardware/Software Code-
sign and System Synthesis, Stockholm, Sweden, Sept. 8-10 2004, pp. 75–
80.
[51] D. Gabor, “A New Microscopic Principle,” Nature, vol. 161, pp. 777–778,
1948.
[52] W. E. Kock, Lasers and Holography. 180 Varick Street, New York 10014:
Dover Publications Inc., 1981.
[56] F. Zernike, “Phase Contrast, A new Method for the Microscopic Ob-
servation of Transparent Objects,” Physica, vol. 9, no. 7, pp. 686–698,
1942.
[66] D. Litwiller, “CCD vs. CMOS: Facts and Fiction,” in Photonics Spectra,
2001.
[68] E. O. Brigham, The Fast Fourier Transform and its Applications. Upper
Saddle River, New Jersey: Prentice-Hall, 1988.
[74] C.D. Toso et al., “0.5-µm CMOS Circuits for Demodulation and Decoding
of an OFDM-Based Digital TV Signal Conforming to the European DVB-
T Standard,” IEEE Journal of Solid-State Circuits, vol. 33, no. 11, pp.
1781–1792, Nov. 1998.
[83] J. Cooley and J. Tukey, “An Algorithm for Machine Calculation of Com-
plex Fourier Series,” IEEE Journal of Solid-State Circuits, vol. 19, pp.
297–301, Apr. 1965.
BIBLIOGRAPHY 151
[84] K. Zhong, H. He, and G. Zhu, “An Ultra High-speed FFT Processor,” in
International Symposium on Signals, Circuits and Systems, Lasi, Roma-
nia, July 10-11 2003, pp. 37–40.
[85] J. G. Proakis and D. G. Manolakis, Digital Signal Processing - Principles,
algorithms, and applications. Upper Saddle River, New Jersey: Prentice-
Hall, 1996.
[86] S. He, “Concurrent VLSI Architectures for DFT Computing and Algo-
rithms for Multi-output Logic Decomposition,” Ph.D. dissertation, Ph.D.
dissertation, Lund University, Department of Applied Electronics, 1995.
[87] M. Gustafsson et al., “High resolution digital transmission microscopy -
a Fourier holography approach,” Optics and Lasers in Engineering, vol.
41(3), pp. 553–563, Mar. 2004.
[88] ETSI, “Digital Video Broadcasting (DVB); Framing Structure, Channel
Coding and Modulation for Digital Terrestrial Television,” ETSI EN 300
744 v1.4.1, 2001.
[89] K. Kalliojrvi and J. Astola, “Roundoff Errors is Block-Floating-Point
Systems,” IEEE Transactions on Signal Processing, vol. 44, no. 4, pp.
783–790, Apr. 1996.
[90] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, “A Dynamic Scaling FFT Processor
for DVB-T Applications,” IEEE Journal of Solid-State Circuits, vol. 39,
no. 11, pp. 2005–2013, Nov. 2004.
[91] C.-C. Wang, J.-M. Huang, and H.-C. Cheng, “A 2K/8K Mode Small-area
FFT Processor for OFDM Demodulation of DVB-T Receivers,” IEEE
Transactions on Consumer Electronics, vol. 51, no. 1, pp. 28–32, Feb.
2005.
[92] T. Lenart et al., “Accelerating signal processing algorithms in digital
holography using an FPGA platform,” in Proceedings of IEEE Inter-
national Conference on Field Programmable Technology, Tokyo, Japan,
Dec. 15-17 2003, pp. 387–390.
[93] X. Ningyi et al., “A SystemC-based NoC Simulation Framework support-
ing Heterogeneous Communicators,” in Proceedings IEEE International
Conference on ASIC, Shanghai, China, Sept. 24-27 2005, pp. 1032–1035.
[94] L. Yu, S. Abdi, and D. D. Gajski, “Transaction Level Platform Modeling
in SystemC for Multi-Processor Designs,” UC Irvine, Tech. Rep., Jan.
2007. [Online]. Available: http://www.gigascale.org/pubs/988.html
152 BIBLIOGRAPHY
[95] A.V. Brito et al., “Modelling and Simulation of Dynamic and Partially
Reconfigurable Systems using SystemC,” in Proceedings of IEEE Com-
puter Society Annual Symposium on VLSI, Porto Alegra, Brazil, Mar. 9-
11 2007, pp. 35–40.
[96] Open SystemC Initiative (OSCI), “OSCI SystemC 2.2 Open-source Li-
brary,” http://www.systemc.org.
[101] E. Gansner et al., “Dot user guide - Drawing graphs with dot,”
http://www.graphviz.org/Documentation/dotguide.pdf.
[102] The Spirit Consortium, “Enabling Innovative IP Re-use and Design Au-
tomation,” http://www.spiritconsortium.org.
[114] ——, “Hardware Cost Analysis for Weakly Programmable Processor Ar-
rays,” in Proceedings of International Symposium on System-on-Chip,
Tampere, Finland, Nov. 13-16 2006, pp. 1–4.
157
Appendix A
Scenic is an extension to the OSCI SystemC library and enables rapid sys-
tem prototyping and interactive simulations. This manual presents the built-in
Scenic shell commands, how to start and run a simulation, and how to in-
teract with simulation models and access information during the simulation
phase. An overview of the Scenic functionality is presented in Figure 1. A
module library contains simulation modules that can be instantiated, bound,
and configured from an XML description. The Scenic shell provides access
to internal variables during simulation and enables extraction of performance
data from simulation models.
1 Launching Scenic
Scenic is started using the command scenic.exe from the windows command
prompt, or using the command ./scenic from a cygwin or linux shell.
scenic.exe <file> [-ip] [-port] [-exec] [-exit]
159
160 APPENDIX A. THE SCENIC SHELL
XML
TCP/IP
SCENIC Shell
interface
Access variables
SCENIC Core Filtered
Simulation events Debug messages
Module
library
parameters
variables
events
Simulation SCI_MODULE
(a) (b)
Figure 1: (a) A simulation is created from an XML description format
using a library of simulation modules. (b) The modules can communi-
cate with the Scenic shell to access member variables, notify simulation
events, and filter debug messages.
[0 s]> set I 42
[0 s]> set S "Hello World"
Both environment and system variables can be listed using either the set com-
mand without specifying any parameters or by typing list env.
[1 ms]> set
system variable(s) :
_DELTACOUNT : "276635"
_ELABDONE : "TRUE"
_MICROSTEP : "25"
_RESOLUTION : "1 ns"
_SIMTIME : "1000000"
_TIME : "5225"
environment variable(s) :
I : "42"
S : "Hello World"
MSG : "Hello World 42"
2 Simulation
Scenic modeling is divided into two phases, a system specification phase and
simulation phase. System specification means describing the hierarchical mod-
ules, the static module configuration, and the module interconnects. The sys-
tem specification phase will end when the command sim is executed, which
launches the SystemC simulator and constructs the simulation models (elab-
oration). After elaboration, simulation modules can not be instantiated, but
dynamic configuration can be sent to the simulation modules.
[0 s]>
3. SIMULATION MODULES 163
After loading the XML system descriptions, the command sim launches the
SystemC simulator, instantiates and configures the library modules, and binds
the module ports. If the optional argument nostart is set to true, the simu-
lator stays in specification phase until the simulation is actually run.
[1 ms]> step 25 us
[1 ms]> runb 1 ms
[2 ms]> run 10 ms
[2 ms]> stop
Simulation halted at 3125 us
[3125 us]>
3 Simulation Modules
Simulation modules are described in SystemC and uses Scenic macros to regis-
ter the module in a module library, read configuration data during instantiation,
export internal member variables, register simulation events, and generate de-
bug messages and trace data.
The simulation modules in the module library can be listed using the com-
mand list lib, which will show the module name and the number of currently
instantiated components of each type.
library module(s) :
GenericProcessor_v1_00 [1 instance(s)]
mpmc_v1_00 [1 instance(s)]
spb2mpmc_v1_00 [1 instance(s)]
spb_v1_00 [1 instance(s)]
instantiated module(s) :
Adapter [spb2mpmc_v1_00]
BUS [spb_v1_00]
CPU [GenericProcessor_v1_00]
CPU.program [sci_memory]
MEM [mpmc_v1_00]
_memory [access object]
_scheduler [sci_module]
Scenic contains base classes for frequently used objects, such as sci memory
and sci processor, which provide additional module specific functionality.
For example, the memory class implements functions to read and write memory
contents as shown below. In the same way, user modules can implement custom
functions to respond to requests from the Scenic shell.
Each registered variable can be accessed and modified using the get and set
commands. The example below illustrated how the program counter (PC) in
a process or set to instruction 4. After simulating two clock cycles, the PC
reaches the value 6. For vector variables, the value is specified as a string of
values.
[10 ns]> CPU set -var "%PC" -value 4
[10 ns]> runb 20 ns
[30 ns]> CPU get -var "%PC"
[ 6 ]
Each variable can be periodically logged to study trends over time. The com-
mand log configures the history buffer depth and the logging time interval.
To disable logging, the time interval is set to 0. The read command is used
to access data in the history buffer, where time and vector range are optional
parameters. Time/value pairs can also be acquired by setting the timed flag
to true.
[0 us]> CPU log -var "%PC" -depth 10 -every "10 ns"
[0 us]> runb 1 us
[1 us]> CPU read -var "%PC"
[ 4 ; 5 ; 6 ; 7 ; 8 ; 4 ; 5 ; 6 ; 7 ; 8 ]
can be changed by either the global command debug, which configures all
simulation modules, or individually by changing the variable with the set
command. The following command will set the debug level for all modules to
warning, and then set the debug level for the processor to message. During
simulation, the processor reports the executed instructions using a message
macro, which prints the information on the screen.
[0 s]> debug -level WARNING
[0 s]> CPU set -var _debug_level -value MESSAGE
[0 s]> runb 50 ns
[0 ns] <CPU> In reset
[10 ns] <CPU> processing instruction @0
[20 ns] <CPU> processing instruction @1
[30 ns] <CPU> processing instruction @2
[40 ns] <CPU> processing instruction @3
[50 ns]>
4 System Specification
System description is based on eXtensible Markup Language (XML), which
is divided into module instantiation, binding, and configuration. Instantiation
and binding is handled during the specification phase using the command xml
load, while configurations can be loaded during the simulation phase using
xml config. Multiple XML files can be loaded and will be merged into a
single system during elaboration.
xml load -file system.xml % load system decription
xml load -file config.xml % load configurations
sim % build the simulation
runb 1 ns % start simulation phase
xml config -name "config_processor" % configure module
All parameters specified in the bind tag are sent to the modules bind function,
for example parameter if, but src and dst represents the modules to be bound.
<CONFIG name="config_processor">
<MODULE name="CPU">
<PARAMETER name="C_PROGRAM" type="STRING" value="demo.asm"/>
<PARAMETER name="C_ADDR" type="UINT" value="0"/>
</MODULE>
</CONFIG>
4. SYSTEM SPECIFICATION 169
Command Description
codegen [-format] [-top] [- run the code generator specified by format on a hierarchi-
proj] [-dir] cal simulation module top.
debug [-level] set the global debug level.
echo [text] print text message.
eval [A] [op] [B] evaluate mathematical or logical operation.
exec [-file] [-repeat] [-fork] execute a script file with scenic commands repeat times.
fork executes the script in a new context.
exit exit program.
foreach [-var] [-iterator] [- execute a command cmd for every value in a set var.
command]
function(parameters) function declaration with parameter list.
list [mod,gen,lib,env,arch] list module, generator, library, env ironment, architecture.
random [-min] [-max] [- generate a sequence with size random numbers in the
size] [-round] [-seed] range min to max.
return [value] return value from function call.
run [time] [unit] run simulation for time time units. valid time units are
[fs,ps,ns,us,ms,s].
runb [time] [unit] blocking version of run.
set [var] [value] assign or list environment variables.
sim [-arch] [-nostart] launch SystemC simulator and create system from
architectural description or from loaded XML.
step [time] [unit] run a single micro-step / set micro-step value.
stop halts a running simulation at next micro-step.
system [command] execute system shell command.
unset [var] remove environment variable var.
xml [clear,load,config,view] load XML system description or configure simulation mod-
ules from XML.
Global modules Description
memory [map,rd,wr] Scenic module that manages the global memory map
scheduler [active] Scenic module that manages access variable logging
Appendix B
The DSP and MAC processors, presented in Part IV, are based on the same
architectural description, but with minor differences in the instruction set. The
DSP processor uses a 32-bit ALU and supports real and complex valued radix-2
butterfly. The MAC unit is based on a 16-bit ALU with multiplication support,
and implements instructions to join and split data transactions between the
external ports and the internal registers.
Table 2 presents the instruction set for the VHDL implementation, while
the Scenic models support additional and configurable functionality. Figure 1
presents a detailed description of the processor architecture. It is based on a
three-stage pipeline for instruction decoding, execution, and write-back. The
program is stored internally in the processing cell (PGM), and the main con-
troller handles configuration management and the control/status register.
173
Table 2: Instruction set for the 32-bit DSP processor and the 16-bit MAC processor.
174
Type A Instructions 31-26 25-21 20-16 15-11 10-6 5-0 Type A Semantics
Type B Instructions 31-26 25-21 20-16 15-0 Type B Semantics
ADD D0 ,S0 ,S1 000001 D0 S0 S1 {srwcl} D0 := S0 + S1
SUB D0 ,S0 ,S1 000010 D0 S0 S1 {srwcl} D0 := S0 - S1
DMOV D0 ,D1 ,S0 ,S1 000111 D0 D1 S0 S1 {srwl} D0 := S0 - S1
ADDI D0 ,S0 ,Imm 100001 D0 S0 Imm D0 := S0 + sxt(Imm)
SUBI D0 ,S0 ,Imm 100010 D0 S0 Imm D0 := S0 - sxt(Imm)
BEQI S0 ,Imm 100011 S0 Imm PC := PC + sxt(Imm) if S0 = 0
BNEI S0 ,Imm 100100 S0 Imm PC := PC + sxt(Imm) if S0 6= 0
BLTI S0 ,Imm 100101 S0 Imm PC := PC + sxt(Imm) if S0 < 0
BLEI S0 ,Imm 100110 S0 Imm PC := PC + sxt(Imm) if S0 ≤ 0
BGTI S0 ,Imm 100111 S0 Imm PC := PC + sxt(Imm) if S0 > 0
BGEI S0 ,Imm 101000 S0 Imm PC := PC + sxt(Imm) if S0 ≥ 0
Special instruction
NOP 000000 No operation
BRI Imm 101001 Imm PC := PC + sxt(Imm)
END Imm 101010 Imm End execution with code=Imm
ILC Imm 101011 Imm ILP := PC + 1
ILC := Imm
GID Imm 101100 Imm Global port destination ID=Imm
32-bit DSP specific
BTF D0 ,D1 ,S0 ,S1 000011 D0 D1 S0 S1 {srwcl} D0 := S0 + S1
D1 := S0 - S1
16-bit MAC specific
SMOV D0 ,D1 ,S0 000101 D0 D1 S0 {srwl} D0 := HI(S0 )
D1 := LO(S0 )
JMOV D0 ,S0 ,S1 000110 D0 S0 S1 {srwl} HI(D0 ) := S0
LO(D0 ) := S1
MUL S0 ,S1 000100 S0 S1 {srwal} HACC := HI(S0 * S1 )
LACC := LO(S0 * S1 )
APPENDIX B. DSP/MAC PROCESSOR ARCHITECTURE
TX_Stall Flow
Control
Pipe_Stall
Valid
ILC_Loop_End
WB
OPCODE [Inst.(31 - 26)] Control WB
Flag_l 4
& Flag [Inst.(5 - 0)] ILC_Load
1
2
0
pType
M1
Jump
_reg Jump CLK
6
IFID_IN_PC IFID_OUT_PC = EXE 2
PC 0
0?
M14 ILC ACC_WEN
WD_MSR
1
-
Flag_a
CLK
Jump_JAL
L0TX
L7TX
1
G0TX
0 0
1 PC
M15 M16
S1_IMM
ILP_PC_OUT 36 36
ALU_mode
+ 1 1 ILP 54
0
ALU_complex
M17 Branch_PC L0TX P V A
Reserved
1
0
1
+ L1TX P V A
M13
PC_Stall
Branch_MUX
Reserved
Valid
Next_PC
OPCODE
[Inst.(31 - 26)] 5 [CONST] L7TX P V A
CO-ALU BS_EN
PC_EN OPCODE’high G0TX P V A T Src Dst
IDEXE_IN_PC
Flow
Current_PC
L0RX
L7RX
PC [Inst.(31)] Control MSR
CLK CLK B C N Z 1
32
32
CLK 36 36 16/
32
M19 1 [CONST] Src
32 0 IFID_OUT MSR 32 ID_OUTWD0 32 LO(8)
S0_Sel
16
G0RX
PGM 0 M12
L0RX P V A _S0 32
1 IFID_IN OPA 0
ADDR Inst.(20 - 16) L1RX P V A
_INST 1 ID_OUT_REG0 Reserved [CONST] tType
M2 S0 54 WD0 32
PC_RST
0 3 M3
Inst.(15 - 11)
0 WD1 16/32 ID_OUTWD1
32 ALU
L7RX P V A IFID_OUT mac_out
IFID_OUT_INST
S1 [Inst.(10 – 6)] G0RX P V A T Src Dst _S1 16/32
32
0
R0 ID_OUT_REG1
PGM_Program M9 OPB [CONST] ALU_16
CLK R1 16/ S0_data
0
32 16/32 LO(16)
1
PGM_ADDR 1
LO(16) LACC
1
HI(16)
1 (16-bit)
ForwardA 32
0
LO(16) / 32
M6
LO(16) / 32 16/32 Forwarding
1 2
Rx ForwardB
1 0 EN
CLK M8 32
M11
Control regiter reg_en_LACC HACC
2 1
RST reg_en_HACC (32-bit)
IMM [Inst.(15 – 0)] 32 HI(32)
16/
Main control 32 0 CLK WEN
Inst_done 16/32 $1
M4 32
(FSM) 1
16 16
1 IF_OUT_IMM
M18
ID_OUT_D0
0
G0TX [CONST] ALU_16
G0RX ID_OUT_D1
Reserved
Inst_end_code
IF_OUT_D1 [Inst.(20 - 16)] 5
G0RX_main_control
G0TX_main_control
Notes: