Hackster - io-hardware-As-Code Part IV Embedded RAM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Hardware-as-Code Part IV: Embedded RAM

hackster.io/sthibault/hardware-as-code-part-iv-embedded-ram-fc0545

Scott Thibault

Things used in this project

Hardware components

× 1
UPduino

Software apps and online services

PlatformIO IDE

Story

1/6
As I've mentioned a number of times already during this series, eliminating the central-
memory bottleneck is one of the performance benefits of building custom hardware vs a
CPU approach. That does not mean that FPGA designs never include memory; they are
often used with together with an external memory. However, the topic of this installment
is a another kind of memory that is key to performance: embedded memory. Let's start
with some example code and then talk about more about embedded memory.

If you are new to this series, you may want to go back to Hardware-as-Code Part I.

Software update
One last thing before we get into it; you will need to update the Upduino HLS toolchain to
the latest version before building the examples below. To do the update, simply enter the
following pio command (if you’re using Visual Studio Code, open a command prompt
by clicking the PlatformIO icon (ant) on the left and then selecting Platform Core CLI
under Miscellaneous):

> pio platform update upduino_hls

This should install the latest release (0.2.1 at the time of writing).

The simple perceptron


Last time, I said that the example in part III involves the same type of computation as
neural networks. The following example implements a single layer neural network, called
the perceptron, and as you see the code is very similar.

As always, this code is also available from the git repository


(https://github.com/sathibault/hac-examples).

This example performs classification just like the poly-classify example of part III.
However, this time we are classifying flowers from the popular iris dataset. There are four
data values associated with each flower: sepal length, sepal width, petal length and petal
width. In this case, we are classifying the flower type (setosa or not).

When you run the example, you should get the following output:

Ex 0: -822
Ex 1: -837
Ex 2: 94
Ex 3: 777

This shows the output of the perceptron for four example flowers. Output values less than
0 are predicted to be type setosa, and over 0 not setosa. The actual flower types of the four
examples are 1) setosa, 2) setosa, 3) versicolour, and 4) virginica. So, this lines up exactly
with perceptron predictions above.

Arrays and embedded memory

2/6
The main difference between the example from last time and the dotproduct function
here is the use of arrays. The Upduino HLS tool supports arrays like we've used here, but
you are probably wondering about the in_array<int16_t,4> type on line 5 and the
as_array function on line 27. Sometimes, perfectly normal C/C++ written for CPUs
simply does not provide enough information to generate equivalent hardware. That is the
case for arrays. Arrays are often passed around C/C++ programs as a simple pointer to
some space in central memory where the array is located.

In the case of hardware synthesis, arrays actually get mapped to individual, dedicated
memories embedded in the hardware. Yes, you read that correctly! The FPGA contains
many small memory blocks called embedded memory or block RAM. Each array variable
will have dedicated memory blocks that hold that variable's data.

In order to generate a memory for an array variable, we need to know what size it is, and
that's point of the in_array<int16_t,4> type. This type indicates that the parameter is
an input array, the element type is int16_t and the maximum amount of data in the
array is 4. That enables the hardware generator to allocate the correct number of memory
blocks to receive the data for the features parameter.

On line 27, the as_array function is used to construct the correct type to match the
in_array parameter, but also to indicate how much data is actually in the array (as
opposed to the maximum which we know from the parameter type). Since this call
executes on the CPU and the function executes on the FPGA, the data must be sent from
the CPU to the FPGA. Consequently, we need to know exactly how much data is to be
sent.

This is one of the examples where we have to follow certain design patterns in order to
target hardware generation. Unfortunately, the exact mechanism may vary from one
toolchain to the next. The good news is that there is not a lot of cases like this, and the
principal is always the same, even if the syntax varies. For arrays, the principle is we need
to provide some size information about arrays, and we do that using certain types or
functions provided by the framework we are using.

A multi-class perceptron
The iris dataset actually contains flowers of three different types: setosa, versicolour, and
virginica. The percetron model can be extended to multiple classes by simply using one
perceptron per class. The one with the highest output is the predicted class. The following
is an implementation of a multi-class perceptron to predict the iris flower type.

The following example implements the multi-class perceptron for identifying the three
different iris flower types.

The output should look like:

3/6
>.\program
Ex 0 = 0 (3694, -3018, -5298)
Ex 1 = 0 (2474, -1038, -6052)
Ex 2 = 1 (-944, 1292, 204)
Ex 3 = 2 (-2392, 138, 5364)

As you can see, the maximum value of each row matches the expected class for that
example.

This example introduces another array parameter type out_array (line 5) and a
corresponding resize method (line 12). This type serves the same purpose as
in_array , but represents an output array parameter. So again, it indicates to the
compiler how much space to reserve for this array, i.e. the maximum number of elements
it will have.

In addition to the maximum array size, since the out data needs to be returned to the
CPU from the FPGA, the system must know how much data is actually in the array. The
resize method is used to indicate how many elements are actually used and need to be
sent back to the caller. Neither of these examples use an input and output array, but that
is also available using the inout_array template type in the same manner.

Another thing you might be wondering is why isn't coef a two-dimensional array?
That's only because the Upduino HLS toolchain only supports one-dimensional arrays.
Most HLS toolchains do support multi-dimensional arrays, but I don't find this limitation
particularly annoying. The index calculation is easy; it's just row-index * row-length +
column-index. Having to write it out helps you be aware of the arithmetic resources
required, and most of the time it can be done more efficiently if you think about it (hint:
using an index variable and just 2 additions, no multiplications).

Performance
There are two main points I want to make about these examples. The first is how
embedded memory breaks the memory bottleneck. Since each array has it's own memory,
they are completely independent and can be accessed in parallel simultaneously!
Furthermore, the embedded memory of many FPGAs have two ports which means you
can actually access two values per cycle per array. That's a lot of parallelism. So what's the
catch? Well again, we are limited by space.

There are only a fixed number of memory blocks per chip. Memory blocks are described
by the number of bits they can store. The FPGA on the UPduino board has 30 block RAMs
each with 4096 bits of storage. To calculate the number of blocks an array will need
simply multiply the number of elements by the number of bits per element and round up
to the nearest multiple of the block RAM size.

On the UPduino, the total number of bits is 30 x 4096 = 120Kb = 15KB. That's really
limited, but this FPGA is one of the smallest available. Large FPGAs can have many Mb of
block RAM. Side note: most hardware documentation will give memory numbers in bits,
written with a lowercase b (e.g. Kb), as opposed to bytes, usually written with a capital B
(e.g. KB).

4/6
The second point is that loops help us make trade offs between space and speed. In the
last installment, everything was in parallel and took up multiple adders and multipliers.
In this installment we've used loops, which will take more execution time, but only use
one multiplier and one adder. It's not all not all or nothing; an intermediate option is to
partially unroll loops to balance that speed vs space trade off.

Here's a challenge for you. The multi-class perceptron example above requires 3 x 4
iterations which are executed sequentially (it is possible to execute iterations in parallel
but that is an advanced topic that will not be covered until later). How could you rewrite
this example to take advantage of the embedded memory's parallelism and compute the 3
outputs in parallel?

Next steps
Up to this point, we've only talked about space at a pretty high level. Next time, I'll cover
in more depth what's inside the FPGA, how space is measured, how much is available, and
how to find out exactly how much a particular function is using.

Continue to Part V: Inside the FPGA

Connect
Please follow me to stay up-to-date as I release new installments. There is also a Discord
server (public chat platform) for any comments, questions, or discussion you might have
at https://discord.gg/3sA7FHayGH

Code

Hardware-as-Code Examples

sathibault / hac-examples
00
Hardware-as-Code Examples — Read More

Latest commit to the master branch on 3-5-2022

Download as zip

Credits

Scott Thibault

5/6
5 projects • 11 followers
Doctorate in programming languages + experience in FPGA, design automation,
embedded systems, and machine learning.
Contact

6/6

You might also like