0% found this document useful (0 votes)
4 views13 pages

Notes Day9

The document discusses the hardware acceleration of digital signal processing (DSP) tasks in embedded systems, emphasizing the advantages of custom hardware over software for real-time processing. It covers the principles of constructing hardware accelerators, the trade-offs between speed and hardware resources, and the integration of these systems into microprocessor-based architectures using standardized interfaces like AXI. Additionally, it touches on the concept of System-on-Chip (SoC) designs, particularly in the context of modern smartphones and tablets.

Uploaded by

ly3924266
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views13 pages

Notes Day9

The document discusses the hardware acceleration of digital signal processing (DSP) tasks in embedded systems, emphasizing the advantages of custom hardware over software for real-time processing. It covers the principles of constructing hardware accelerators, the trade-offs between speed and hardware resources, and the integration of these systems into microprocessor-based architectures using standardized interfaces like AXI. Additionally, it touches on the concept of System-on-Chip (SoC) designs, particularly in the context of modern smartphones and tablets.

Uploaded by

ly3924266
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

MSc Electronic and Computer Engineering

Digital Design

45 Hardware Acceleration of DSP


If we have a task to be carried out within an embedded system, then programming it
in software and running it on a microprocessor or a microcontroller is often the best
approach. Microprocessor hardware gives an excellent balance between computing
power and monetary cost, and software gives very flexible solutions that can be easily
bug-fixed and upgraded. However, if we have a task that is computationally
burdensome and needs to be carried out in real time, then software may be far too
slow. In recent years, this type of problem has become increasingly important, and is
especially dominant in the smartphone and tablet market which require elaborate
digital signal processing (DSP) for their cameras and for their face and voice
recognition systems. In this lecture we will look at the basic principles of how we
would solve a DSP problem by constructing a hardware accelerator, and we will see
how this approach compares with software.

45.1 An example DSP problem


In general, DSP problems take a stream of data items x and modify them in some way
to produce a revised version z. The modification is accomplished by taking batches of
x and multiply-accumulating with a series of coefficients y:

𝑧= 𝑥𝑦

The x could, for example, be pixel data in an image, audio samples of music or voice,
or neuron activations in a pattern recognition neural network. The modification could,
for example, be picking out edges in an image or boosting particular frequencies in
audio. As an example, let’s consider an edge detection image processing problem:

This works by splitting the source image into many small rectangular regions called
windows, and then passing them through the computation

𝑧= 𝑥𝑦

Different operations (detect vertical edges, detect horizontal edges, vary contrast
sensitivity, suppress noise, etc.) correspond to different choices of values for the
coefficients y.

1
If we solve this in software, the program entails a sequence of operations:
 Step 1 Set z=0
 Step 2 Multiply x0 and y0
 Step 3 Add sub-total to z
 Step 4 Multiply x1 and y1
 Step 5 Add sub-total to z
 Step 6 Multiply x2 and y2
 Step 7 Add sub-total to z
 Step 8 Multiply x3 and y3
 Step 9 Add sub-total to z
Each of these steps corresponds to reading in the x and y values from memory onto
processor, doing the arithmetic calculations on the processor’s arithmetic logic unit
(ALU), then writing the results back to memory. All of these steps are carried out one
after the other, in serial, and this can take a long time. To keep this example simple
and small enough to fit on the page, we have assumed that a window is only 4 pixels.
In practice a window would normally be 9 or 25 pixels in size, and the sequence of
operations would become very long.

If we build a piece of hardware customized to solve the problem, we get this:

We apply all of the x, y data for one window in one go and they are all processed in
parallel by a parallel array of multipliers. This gives us a substantial speed up, but
requires a lot of hardware to build. The more data we can process in parallel, the
bigger the speed up but the larger the amount of hardware resource needed. So, for
example, if our window size increased to 25 pixels the speed advantage and the
hardware resource requirements would become much larger.

Due to modern production processes being able to miniaturise of transistors within a


moderate cost profile, the large array of multiplier hardware may not be too
challenging. But the input/output bandwidth required to ensure that large arrays of x
and y data turn up at the inputs simultaneously can be more problematic.

45.2 Trading off speed against hardware resource


Problems involving very large amounts of data can be challenging to implement in
fully parallel hardware. When building custom hardware, it is common to include
some degree of serialism by building hardware that has high parallelism, but requires
several passes to get all of one data window through. For our example problem, it
would look like this:

2
We now require two passes to complete the processing of one window. Initially we
would apply a reset to zero the output register. Then in the first pass we would use the
Select input to set the multiplexers to solve x0y0+x2y2. Once this result has been
received in the output register, we start a second pass where we set the multiplexers to
solve x1y1+x3y3 which is added to the result of the first pass to give our overall result.

We can take this process as far as we like. So, for example, this would be a fully serial
solution that takes four passes to produce its result:

Note that we have had to introduce control signals (Reset, Select) and give them
appropriate sequence of activations with a specific timing in order to get the design as
a whole to work correctly. These signals would be generated by a finite state machine
which constitutes the control path of the design.

45.3 Control path and data path


In general, a hardware design looks like this

System inputs

State
sensing
Control Data path
path
Control
signals

System outputs

3
The data flows along a datapath where it is transformed or modified in some way.
Datapaths consist of circuits like adders, multipliers, etc. (which transform the data)
and multiplexers, etc. (which can switch data flow between different units). The
operation of the datapath is governed by control signals, which switch parts of the
datapath on or off, or alter the routing of data through the datapath. The control
signals are generated in the appropriate sequence by the control path, which is a finite
state machine (or group of interacting FSMs). The control path may modify the
control signals in response to information about the state of the datapath.

If the number of operations that the system needs to carry out greatly exceeds what
can fit onto the chip at one time, then complicated sequencing must be carried out,
and the control design can be quite intricate. By contrast, if the number of operations
that the systems performs is small enough for all to fit on the chip in one go, then the
sequencing is easy, and the control path will be simple, or maybe even non-existent.

45.4 The speed of a datapath: throughput and latency


Our datapath designs will have a register at their input and a register at their output.

A set of inputs starts in the input register, then moves through the multipliers and
adders to produce a result that is read into the output register. In order for this
arrangement to work, the clock driving the registers need to be slow enough that a
batch of data that arrives in the input register on one clock edge has enough time to
flow through the multipliers and adders before the next clock edge causes the result to
be read by the output register. So the clock period needs to be at least as big as the
worst case delay between the register stages.

As an example, suppose we have the following delay times for our components:
 Multiplier delay: 2 ns
 Adder delay 1 ns
All paths from the input register to the output register pass through one multiplier and
two adders. So the worst case delay between the two registers is 4 ns. That means that
our clock period must be 4 ns or slower, so our clock frequency is 250 MHz or
slower.

There are two important measures of speed:


 Latency is the length of time that elapses between the inputs becoming valid, and
the corresponding output becoming valid. The way the circuit works, we apply the

4
inputs on one clock cycle, and the output appears the clock cycle after, so the
latency is just one clock cycle (4 ns in the above example). The faster we can run
the clock, the lower the latency.
 Throughput is the rate at which we put new inputs into our circuit (or equivalently
the rate at which we get new outputs out of the circuit). So, for example, in the
above circuit we apply a ste of inputs and then 4 ns later we apply a new set of
inputs. So the throughput of this circuit is 1 / 4 ns = 250 million data items per
second.

45.5 Timing diagrams


Drawing registers as boxes on our diagrams becomes cumbersome for systems with
complicated timing, so a standard convention is to draw a register as a dashed line:

The way that we read this timing diagram is that a data item (for example the window
data for our image) hops from one dashed line to the next on each clock cycle.

45.6 Pipelining
We can increase the throughput by using a technique called pipelining. We insert an
additional register stage after the multipliers:

This now means that the worst case delay between register stages is 2 ns, and we can
double the clock frequency to 500 MHz. Each individual data window will now take 2
clock cycles (i.e. 4 ns) to traverse the data path, just as before. However, we can insert
a new window into the data path every 2 ns so our throughput doubles to 500 million
data items per second.

5
You should now know...

 Why custom hardware can be faster than software solutions


 How we can trade off parallel and serial processing to get the balance of speed,
resource cost and I/O cost that we want
 How to read a timing diagram
 How pipelining can be used to increase throughput

6
46 Integrating a Hardware Accelerator into a Computer System
It would be rare for a pure hardware solution to be satisfactory for a complex tasks.
There are almost always some parts of the task that require the flexibility of software.
So we normally find ourselves needing to interface out custom hardware to a
microprocessor-based computer system. This is not trivial, because software and
hardware see the world in different ways. In a computer system, all variables and I/O
devices correspond to a particular location on a memory map. Hardware usually
operates on a streaming basis: it just expects a new data item to turn up at its input on
each new clock cycle. In this lecture we will look at how we interface custom
hardware to a computer system.

46.1 Computer organisation


The conceptual picture of a computer architecture is as follows:

Other devices Memory Processor

Address bus

Data bus

Control bus

To multiply two numbers together (z = x × y) the processor must first fetch the
instruction from memory by issuing the instruction address, and reading the returned
instruction from the data bus. Then it must decode the instruction to figure out what
operands it needs. The variables x and y will correspond to particular locations on the
memory map. The processor will issue the address of x and read back its value on the
data bus. Then it will do the same for y. Once it has collected the operands, it
performs the multiplication operation. Then it issues the address of z and writes a
value of z into memory via the data bus. The sequence of operations is:
1. Instruction fetch
2. Instruction decode and operand fetch
3. Execute
4. Write back

In order to maximise the throughput of instructions, it is common to use pipelining.


We start a fetch execute cycle for one instruction, which will take several clock cycles
to complete. Then we launch subsequent instructions on each successive clock cycle:

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5


Fetch 1 Decode 1 Execute 1 Write back 1
Fetch 2 Decode 2 Execute 2 Write back 2
Fetch 3 Decode 3 Execute 3
Fetch 4 Decode 4

7
This pipelined structure can only keep moving successfully if each instruction moves
one stage to the right on each clock cycle. If we have an instruction that is many
words long, then it will take many clock cycles to fetch and the pipeline won’t work
correctly. If we have an instruction that takes may cycles to find its operands, or to
execute its operation on those operands, then the pipeline won’t work correctly.

When a design team sets out to design a microprocessor, the most fundamental
decision they take is what instruction set the processor should use. Historically, the
designers of instruction sets for processors were allowed to do whatever they liked.
This resulted in typical processors being able to process a very wide range of
instructions, each of different length, taking different numbers of clock cycles to
complete, and having very variable formats. This style of doing things is known as
CISC (Complex Instruction Set Computer). Up until the mid-1980’s almost all
computers were designed in this way. Many of the instruction types (in particular the
complex memory addressing modes) used by CISC processors were impossible to
pipeline efficiently. Most modern processors are based on an approach called RISC
(Reduced Instruction Set Computer). This way of designing microprocessors is based
on ensuring that all instructions were exactly one word long and take exactly one
clock cycle to execute. This is achieved by imposing severe restrictions on what
instructions are allowed to do, and in particular by restricting the memory addressing
modes that they are allowed to use.

Microprocessor designs vary from extremely high speed and capable through to very
basic. Differentiating factors include:
 Capability of instruction execution units (e.g. do they have hardware support for
floating point?)
 Number of execution pipelines that operate simultaneously. A processor that has
multiple parallel execution pipelines is called superscalar
 Number of pipeline stages. Microprocessors with very high clock frequencies
normally have a large number of pipeline stages.
 Features to boost clock speed when computational load is high and reduce clock
speed when there is danger of overheating
 Data bus/address bus width

46.2 Hardware-software interfacing


We interface custom hardware to the busses of the computer system by incorporating
an interface that listens to the address bus and associates addresses with the input and
output registers of our custom hardware

8
The interface has associated control signals to indicate both to the hardware and the
software when data is valid.

In order to allow suppliers and consumers of designs to exchange and trade


conveniently, these interfaces are standardised. The most important standard is AXI,
the Advanced eXtensible Interface for on-chip communication. AXI interfaces are
allocated addresses on the memory map of the computer, and allow the computer to
treat the input and output registers of the custom hardware as if they were normal
variables in the running program.

There are three different types of AXI interface that can be used:
 AXI (often called “AXI full”, to distinguish from other types) supports burst mode
transfers. This means that we only need to send one address and one set of control
signals at the start of a transfer. The control signals indicate how many data words
are to be transferred (up to a maximum of 256 for a single transaction). Once the
address and control signals have been sent, the data words are transferred on
consecutive clock cycles with no further exchange of address or control
information.
 AXI Lite has a simpler control interface than AXI full, but can only transfer one
data word at a time, which makes it inefficient for large transfers. So, for example,
if we had 20 data words in consecutive memory locations to transfer, AXI Lite
would need to send 20 different addresses, whereas AXI full would just need to
transfer the single base address for the 20 words.
 AXI stream does not use addresses: it just pushes data in at one end and harvests it
at the other

You should now know...

 The basic ideas of the AXI interface.

9
47 System-on-Chip
In this unit we will look briefly at commercial system-on-chip systems. These create a
single integrated circuit from intellectual property sourced from a variety of providers.
This is the standard approach used in smartphone and tablet systems.

47.1 Commercial System-on-Chip (SoC) systems


Historically, hardware systems have typically used a different chip for each different
major function, with the chips being assembled onto a printed circuit board for
interconnection. Nowadays it is becoming common for many different functions to be
integrated onto a single chip to form a system-on-chip design. This trend is most fully
developed in the smartphone market where the main SoC will combine a CPU, a
graphical processing unit, a display controller, a 4G/5G modem, a video codec,
several levels of cache, a main memory controller and many other functions into a
single chip. It would be rare for the producer of the chip to wish to design all of these
different functional units themselves, so many (maybe all) of these units are bought in
as cores from a variety of companies. In order to ensure that cores from different
companies can easily be mixed in one design, designers make their cores conform to a
standard interface structure, timing and set of control signals. The most important
vendor of intellectual property for system-on-ship is ARM. As well as producing the
well-known ARM RISC microprocessor cores and a range of powerful graphical
processing unit cores for smartphones, they also invented the standard bus (AMBA –
Advanced Microcontroller Bus Architecture) and interface system (AXI – Advanced
Extensible Interface) which are used by most modern SoCs.

Producers of smartphones and tablets can purchase a design for a standard processor
from ARM (usually as a Verilog IP core). Alternatively, companies can purchase an
“Architectural license” which gives them the right to design their own processors
using the ARM instruction set and standard interfaces. Architectural licenses are used
by most of the high-end producers, e.g. Apple and Qualcomm.

The ARM processor originally came to prominence because its extreme simplicity of
construction gave it a very low power dissipation, which is a top priority factor in the
smartphone market where poor battery lifetime would limit the appeal of a new
product. As time has gone on, the full range of advanced features (e.g. deep
superscalar pipelines) has been added to many versions to produce processors that can
rival the Intel and AMD processors that are commonly used for desktop and laptop
PCs. However, low power dissipation remains crucial. In order to manage this trade-
off, smartphone SoCs usually come in heterogeneous clusters. A cluster will consist
of some processor cores (usually 2 or 4) that are high speed/high battery drain, using a
high clock frequency and many advanced architectural features, and other cores
(usually 4) that are low speed/low battery drain and use a medium clock frequency.

To illustrate these issues, an example system on chip is shown on the next page. This
is an Apple Bionic A12X, used in the iPad Pro. There are many diverse functions
combined onto the single chip. The largest features on the chip are the CPU and GPU
clusters. “Vortex” is Apple’s name for its performance microprocessor cores. These
are designed by Apple using an architectural license from ARM. They are 7-way
superscalar and run at a clock frequency up to 2.5GHz to give very strong
performance but poor battery drain. “Tempest” is Apple’s name for its efficiency
cores, which are 3-way superscalar and run at a maximum clock frequency of 1.6

10
GHz. These efficiency cores have a battery drain which is 10 times lower than for the
performance cores.

https://en.wikichip.org/wiki/apple/ax/a12x

The other large units on the die are the graphics processing unit (GPU) which is
responsible for high performance graphics, the image signal processor (ISP) which is
responsible for optimising the pictures and video taken by the camera, and the NPU
(neural processing unit) responsible for accelerating pattern recognition tasks such as
recognising and labelling the various objects in an image. Historically, SoC has been
used for smartphones and tablets, application areas which are less performance-
oriented than laptop or desktop computers. However, recently Apple has started to
deploy a variant of the Bionic SoC in its laptop range.

An alternative to ARM which may become significant in future is RISC-V. Like


ARM, this is a RISC microprocessor that can be targeted to a wide variety of different
price/speed/battery lifetime trade-offs. Unlike ARM, RISC-V can be designed by
anyone without the need to purchase a license to use the instruction set.

47.2 Soft System on Chip


The SoCs used in smartphones are produced as ASICs because of their huge
production volumes. For lower production volumes FPGAs can be used as a
programmable platform that can combine purchased cores with a companies’ own
custom designs to produce complex systems without high production costs. Such
systems are called soft system on chip, as the hardware function is not fixed and is
programmed into the design by a configuration bit stream.

In principle we can build any hardware that we want in the general-purpose logic
fabric on an FPGA. In practice, this can be inefficient and slow, and so it is rare for an
FPGA to consist solely of general-purpose reconfigurable logic fabric. FPGA
manufacturers will normally include some commonly used subsystems manufactured
directly into the silicon of the FPGA in order to be as efficient and as fast as possible.

11
These typically include high speed multiply/add units, efficient memory units and
high speed transceivers. Some FPGAs will also include powerful microprocessors
embedded within the FPGA silicon.

Intellectual property cores that are acquired as configuration bit stream to construct a
subsystem in the general purpose resource of an FPGA are called soft core.
Subsystems which are manufactured directly into the FPGA surface are called hard
core.

You should now know...


The meaning and significance of the following
 IP core
 System-on-chip
 Hard core and soft core

12
Index
45 Hardware Acceleration of DSP 1
45.1 An example DSP problem. 1
45.2 Trading off speed against hardware resource 2
45.3 Control path and data path 3
45.4 The speed of a datapath: throughput and latency 4
45.5 Timing diagrams 5
45.5 Pipelining 6

46 Integrating a Hardware Accelerator into a Computer System 7


46.1 Computer organisation 7
46.2 Hardware-software interfacing 8

47 System on Chip 10
50.1 Commercial system on chip systems 10
50.2 Soft System on chip 11

13

You might also like