Image Processing On System On Chip FPGA Devices Using LabVIEW
Image Processing On System On Chip FPGA Devices Using LabVIEW
LICENSE THESIS
2018
FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT
FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT
Graduate: ________________________________
Supervisor: ________________________________
FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT
FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT
1
3.3.3. An image processing system for driver assistance ................................. 25
3.4. Existing Image Processing Implementations in Hardware and their
Comparison .............................................................................................................. 25
3.4.1. FPGA-based implementations of image processing algorithms
and systems .......................................................................................................... 25
3.4.2. Performance comparison of FPGA, GPU and CPU in
image processing.................................................................................................. 26
3.5. SoC Image Processing ................................................................................... 27
3.5.1. Image Processing Towards a System on Chip........................................ 27
3.5.2. A Survey of Systems-on-Chip Solutions for Smart Cameras................. 28
3.5.3. FPGA implementation of a license plate recognition SoC using
automatically generated streaming accelerators .................................................. 28
3.6. Other usages of FPGA and SoC devices ....................................................... 28
3.6.1. SoC-FPGA implementation of the sparse fast Fourier transform
algorithm .............................................................................................................. 29
3.6.2. A fully-digital real-time SoC FPGA based phase noise analyzer with
cross-correlation ................................................................................................... 29
3.7. Other bibliographical research ....................................................................... 29
Chapter 4. Analysis and Theoretical Foundation ............................... 31
4.1. Overall Architecture ...................................................................................... 31
4.1.1. High-level architecture ........................................................................... 31
4.1.2. System on Chip overview ....................................................................... 32
4.1.3. Offloading work to the FPGA ................................................................ 33
4.2. Image Acquisition .......................................................................................... 34
4.2.1. Acquisition device .................................................................................. 35
4.2.2. Image capturing ...................................................................................... 36
4.3. Image and Data Transfer ............................................................................... 37
4.3.1. Digital image representation ................................................................... 37
4.3.2. Data decomposition and streaming......................................................... 38
4.4. Processing ...................................................................................................... 39
4.5. Display ........................................................................................................... 40
4.6. Possible hardware configuration.................................................................... 40
4.6.1. SoC vendors ............................................................................................ 40
4.6.2. SoCs in academical embedded devices .................................................. 41
Chapter 5. Detailed Design and Implementation ................................ 43
5.1. Ecosystem and Development Environment ................................................... 43
5.1.1. Development environment – LabVIEW ................................................. 43
5.1.2. NI myRIO hardware and software specifications................................... 45
2
5.2. System Architecture....................................................................................... 46
5.2.1. The system as a LabVIEW project ......................................................... 47
5.2.2. „Main” VIs and top-level view ............................................................... 49
5.3. Image Acquisition .......................................................................................... 52
5.3.1. Camera session ....................................................................................... 52
5.3.2. Image initialization ................................................................................. 53
5.3.3. Image capturing ...................................................................................... 53
5.4. Image Transfer using DMA FIFO Channels ................................................. 54
5.4.1. Ways of transferring data between the FPGA and the host device ........ 54
5.4.2. DMA FIFO implementation ................................................................... 55
5.5. Image Processing on the FPGA ..................................................................... 57
5.5.1. General structure..................................................................................... 57
5.5.2. Storing the image in a local memory ...................................................... 58
5.5.3. Applying a convolution kernel ............................................................... 59
5.5.4. Synchronization ...................................................................................... 60
5.5.5. Improving the FPGA code and preliminary results ................................ 61
5.6. FPGA Resource summary ............................................................................. 63
Chapter 6. Testing and Validation ....................................................... 65
6.1. Technological Motivation .............................................................................. 65
6.2. System Performance ...................................................................................... 65
6.2.1. Different versions of the LabVIEW SoC implementation ..................... 65
6.2.2. Comparison with other implementations ................................................ 67
6.3. System Scalability ......................................................................................... 68
Chapter 7. User’s manual ...................................................................... 69
7.1. Requirements ................................................................................................. 69
7.1.1. Hardware ................................................................................................ 69
7.1.2. Software .................................................................................................. 69
7.2. User’s Manual ................................................................................................ 69
7.2.1. Setting up the development environment ............................................... 69
7.2.2. Building the LabVIEW project............................................................... 69
7.2.3. Deploying and running the project ......................................................... 70
7.2.4. Validating results .................................................................................... 70
Chapter 8. Conclusions .......................................................................... 71
8.1. Result Analysis and Achievements ............................................................... 71
8.2. Future Work ................................................................................................... 71
8.2.1. Using the AXI standard for inter-SoC communication .......................... 72
8.2.2. Interfacing the acquisition device directly with the FPGA .................... 72
3
Bibliography ........................................................................................... 73
Appendix 1 – Acknowledgements ......................................................... 77
Appendix 2 – Table of Figures .............................................................. 78
Appendix 3 – Source Code .................................................................... 80
4
Chapter 1
5
Chapter 1
6
Chapter 1
• Mean value and standard deviation of (the intensity levels of) an image
• Geometrical features of binary objects – area, center of mass, perimeter,
aspect ratio, etc.
• Histogram calculation
• Labelling or clustering an image
• Edge/border detection
In many cases these algorithms are used as part of a bigger, more complex
algorithm.
7
Chapter 1
algorithms – in our case this means using special hardware devices. These possible
hardware devices are presented in the next part.
8
Chapter 2
Note – Throughout the next chapters, we will extensively use the concept of
“real-time” image processing. There is no exact definition for what real-time means,
but it is generally accepted that a real-time image processing system should be able to
process around 30 images per second [8].
9
Chapter 2
From Table 2.1 we can deduce that a balanced choice would be the usage of a
system on chip device, having both an FPGA and a microcontroller. The reasoning is
that we can separate the system into two different components: a smaller, time-critical
portion of the application can be developed on the FPGA, while the rest of the
application can be deployed to the microcontroller, which is much easier to program.
This way the flexibility is not too high, but we have considerable performance
improvements over choosing a CPU.
From now on, we will mostly concentrate on system on chip devices, however
we will still mention other types of hardware, especially in the Bibliographic Research
chapter. Also, in the Conclusions chapter, we will present a comparison of the same
algorithm implemented on different hardware.
1
Based on the authors own experience
10
Chapter 2
Functional requirements
Functional requirements define WHAT our system must do. These can be
broken down into a set of steps that must be performed by the system to be able to
successfully process images. These requirements are enumerated in a logical order in
11
Chapter 2
the following part. If any of these requirements are missing, we consider that our system
cannot be considered an image processing system. The initial requirements are also
specified in Figure 2.1 as a series of tasks that must be performed by the system.
2.2.1.3. Process the image using already known algorithms and generate
transformed image or relevant data
This is one of the most important steps of the system. All previous and future
steps are a “preparation” and “finalization” stages, respectively. In this stage we already
have the image stored in the FPGA in some form.
To process the image, we first need an already known (and frequently used)
image processing algorithm, so that we can easily compare our results to existing
results, considering the speed of the image processing. After selecting one or more
representative algorithms, these must be implemented on the FPGA.
It depends on the algorithms whether the input image can be overwritten by the
resulting image or it must be saved to a different location on the FPGA. A third option
2
Abbreviation of microcontroller or μ-controller
12
Chapter 2
is to directly transfer the result, as the image is processed – if this is the case, this step
and the next step can be merged together into one step.
Non-functional requirements
Non-functional requirements define HOW our system must behave while
performing the functional requirements. These are enumerated below:
2.2.2.1. Deployment
Once the system is ready to be deployed from a development computer (in form
of an executable or bitfile3), it should be easy to connect to the system on chip target
device and start the application. This means that we should also be able to remotely
start the execution of the SoC application, by minimal user interaction.
2.2.2.3. Speed/performance
We must not forget that our goal in experimenting with image processing
algorithms on FPGA-based SoC devices is to increase the performance of embedded
image processing systems. Therefore, one of the most important requirement is related
to speed and performance.
We are mostly interested in the time it takes to perform the steps defined in the
Functional requirements part – i.e. to acquire, process and present the resulting image.
The execution time of this process will also define the frequency of the image
3
A stream of bits that are used to configure the FPGA
13
Chapter 2
processing application, or in our terms, the frames that can be processed in a second
(FPS).
We will try to design, implement and optimize the system to reach high FPS
values, comparable to today’s processing frequencies, that are above 30 FPS [8].
Other objectives
Throughout the next chapters we will design and implement a system, keeping
in mind the functional and non-functional requirements. We will need to be able to
measure the performance of our system. We must also implement the same algorithms
on several different types of hardware to be able to compare our results.
We expect to achieve greater performance using our proposed solution then
already existing solutions. However, if these expectations are not met (while the
requirements are still fulfilled), we do not consider our project a failure. The conclusion
of our project in that case will simply be that it is not feasible to use system on chip
devices having FPGAs for image processing. We will however try to avoid this result
as much as possible.
14
Chapter 3
15
Chapter 3
16
Chapter 3
In the present, computers are much faster, and the field of computer vision has
grown exponentially. Today we can easily find IP algorithms almost anywhere: in
transportation, defense, social media, geography, space research, and the list could
continue. In the following part we will exemplify some of these fields.
4
Widely used in hospitals for medical diagnostics and is considered safer than using X-rays
17
Chapter 3
18
Chapter 3
In the previous example 𝐻 is both additive and homogeneous. This can have
significant importance in the following chapters, when we discuss performance.
Some other mathematical operations that are presented are listed below:
• Arithmetic operations – e.g. addition, subtraction, multiplication or
division
• Set and logical operations – e.g. the difference of two images
• Logical operations – e.g. inverting (negating) a black and white image
• Spatial operations – applied on a single or several neighboring pixels, or
on the entire image at once, e.g. kernel5 or transformation operations
5
To be detailed/explained in the following chapters
19
Chapter 3
3.2.1.1. Linearity
In section 3.1.6, we have identified several mathematical operations that can be
used to process images. We have seen that most algorithms are linear, having a
complexity of 𝑂(𝑛), with a small constant factor. These algorithms are usually a good
choice to be parallelized or to be moved to more performant hardware, because linear
algorithms usually scale well, resulting in a good speedup.
If the algorithms are more complex ones, also having higher complexity, it will
be hard to scale them. Such an algorithm, that has for example a polynomial or
exponential complexity (e.g. 𝑂(𝑛# ) or 𝑂(𝑒 $ ), where 𝑛 is directly proportional to the
resolution of the image), might not even fit on the FPGA, because of the limitations
imposed by the hardware.
From these considerations, we will implement linear algorithms, that may have
a complexity of 𝑂(𝑛), with a low multiplier.
6
Form this book at least
20
Chapter 3
If more steps are performed on the same image, we might even have to keep
two copies of the image. This is very often the case with motion detection algorithms
or algorithms that require two consecutive images captured from an acquisition device.
Using these algorithms will have an increased memory usage.
Based on the last three sections, we will select one or several representative
algorithms that we will describe and finally implement in the Detailed Design and
Implementation chapter.
The algorithm and the pseudocode are specified in the next parts, as well as
properties and examples of the algorithm.
21
Chapter 3
The above formula is applied on each pixel of 𝐼% , except the border of the image and
therefore applying this algorithm implies “scanning” of the image, as also presented in
Figure 3.4.
22
Chapter 3
Because each pixel from the input image influences only a small number of
output pixels (9 in the case of our convolution kernels), we can implement an “in-
memory” image processing algorithm. That means that we do not have to make a copy
of the original image to generate the result image. Instead, we can use the same memory
location, thus we overwrite the input image with the output image.
3.2.2.3. Examples
Figure 3.6 shows the result of applying a Gaussian kernel and a Sobel kernel on
a color input image.
Figure 3.6 Example of applying the Sobel filters (2nd image) and the Gaussian
blur (3rd image) on a color image (1st image), from [11]
The kernels are defined as follows:
1 2 1
Gaussian: T2 4 2V,
1 2 1
−1 0 1 1 2 1
Vertical Sobel filter: T−2 0 2V , Horizontal Sobel filter: T 0 0 0V
−1 0 1 −1 −2 −1
23
Chapter 3
extremely difficult and requires a large skillset from the developer. To program an
FPGA, usually hardware description languages (HDL) are used, such as VHDL7 or
Verilog. Defining the personality of the FPGAs using these languages can be hard and
very complex. Therefore HLS8 tools are used that provide a layer of abstraction
between the low-level HDL and the abstract model of the system. Despite the effort
invested in these tools, they are still not performant enough and do not generate good
enough HDL code.
In response to these limitations, authors of the mentioned paper used CAPH to
describe the system that shall be implemented on the FPGA. The CAPH code is
compiled into highly optimized VHDL code, which is then synthesized and finally
programmed on the FPGA.
Authors exemplified the usage of this language on the “implementation of a
real-time image processing application on an FPGA embedded in a smart camera
architecture” [12]. As a conclusion, we have seen that using this approach is well suited
for architectures such as smart cameras.
7
Very High Speed Integrated Circuit Hardware Description Language
8
High-level Synthesis
24
Chapter 3
synthesis tools and creating intellectual property cores (IP9) that implement specific
algorithms, such as the AdaBoost10 classifier algorithm.
9
Not to be confused with the abbreviation of image processing!
10
Adaptive Boosting, as defined by Wikipedia, is a machine learning meta-algorithm mostly
used in image processing classification
25
Chapter 3
show how these algorithms are well suited for FPGA. The paper also states that good
results can still be achieved after increasing the image size, if the memory constraints
of the device are met.
Paper [18] specifies that the most suited integrated chips for image processing
are ASIC, DSP chips (Digital Signal Processor chip) and FPGA. In this paper an FPGA-
based application is presented that was designed for image preprocessing. Authors
proposed and implemented a fast median filtering algorithm on an FPGA, that resulted
in reduced cost and higher performance than a similar implementation on a
conventional hardware. Results show that this approach can be also used for real-time
image processing.
An example of an FPGA-based embedded vision system is presented in [19].
Authors stress out the major parallelism that is implemented in the system and give
examples of algorithms that benefit from this hardware. The chosen hardware solution
is both fast and cost-effective. Authors could reach a processing frequency of over 100
FPS, compared to the 50 FPS of the same algorithm implemented on a serial processor.
It is also proposed in the Future Work section, to use “System-on-a-Programmable-
Chip (SOPC)” technology – we simply call this system on chip in our book.
Authors in [20] combine the already presented FPGA-based approaches with
digital signal processing (DSP) hardware, to achieve a highly parallel and
reconfigurable system intended for fast computer vision applications. A host-
independent architecture was designed which allows “dealing with high-level real-time
image processing routines”.
26
Chapter 3
Figure 3.7 Performance of the k-means clustering algorithm, from [21] (Fig. 8.
of the original paper)
11
Complementary metal–oxide–semiconductor, a technology for constructing integrated
circuits
12
Charge-coupled device
27
Chapter 3
systems, it’s quality is not as good as what the CCD offers. Therefore, images are
noisier, have less contrast and are blurrier.
To face these issues and to also provide high performance, the paper proposes
to process (or preprocess) the images close to the acquisition device. This is done by
using a “retina” – a small computer vision system that mainly focuses on sensing and
initial processing on the same small device.
Authors implement several filtering algorithms (e.g. smoothing or Sobel filter)
on the retina, that is basically a system on chip device having an integrated camera
module. The success of this project shows how well-suited image processing algorithms
are for on-chip processing.
28
Chapter 3
29
Chapter 4
High-level architecture
Most image processing systems are like data acquisition (DAQ) and control
systems, at least from a high-level view. These can be characterized as having three
distinct procedures: acquisition, processing and control. We can adapt this structure to
our needs, as described in Figure 4.1 – most of the systems presented in Chapter 3 are
also implementing this structure in some way.
Image Image
Display
acquisition processing
13
Unified Modeling Language (http://www.uml.org/)
31
Chapter 4
The first step is acquiring the image and the last one is displaying it. These might
not seem important, especially displaying the image, however because we would like
to visualize the result of our system from real-time data, we need these steps too. The
analysis of these steps, as well as the image processing part, will be detailed in the
following sections.
Because our focus will be on the “Image Processing” part of the diagram, which
in fact will be implemented on the system on chip device, it is necessary to break down
the design to smaller logical components, that we will be able to map to specific parts
of our hardware. Figure 4.2 describes the logical components of the system, by splitting
the second part of the diagram.
Image
acquisition
UC FPGA
Display
32
Chapter 4
14
FPGAs are often referred to as programmable logic devices
33
Chapter 4
define the way the processor interacts with the FPGA. We can free the processor from
the responsibility of processing the image, by executing the processing part on the
FPGA. Unfortunately, this induces overhead, because the images must be transferred
between the two devices even if physically they are on the same chip. This behavior is
visualized in Figure 4.4.
Transfer
Transfer
Apply IP Write
Read image
algorithm image
The following parts will explain the operating principles of the different
components described here, starting from acquisition and data transfer, as well as the
image processing part and finally the visualization of the image.
34
Chapter 4
Acquisition device
In a more realistic embedded device – e.g. one used in an ECU15 of an
autonomous vehicle – the acquisition device would be directly connected to the FPGA.
This would not require the images to be transferred from the processor to the FPGA,
instead the FPGA would directly access the image. This would eliminate much of the
communication overhead and the latency of the application would be lower. Smart
cameras are an example for this behavior, where the acquisition device is much closer
to the processor – or they are on the same physical unit.
Unfortunately, we cannot use such industrial and high-performance cameras.
The first reason is that such a camera is not available for this project and the second
reason is that even if we had a camera intended for embedded image processing, we
would have to implement an interface to that camera, which is beyond the scope of our
project. Instead we will find different alternatives, that may or may not generate extra
overhead, but they are accessible to us. These are to be discussed in the next paragraphs.
We have identified two relatively simple methods of reproducing a high-
performance embedded camera:
• USB webcam
• Persistent storage device
The possible usage of these devices is detailed below.
15
Electronic Control Unit – mostly used in vehicles
35
Chapter 4
operating system, this storage can be accessed over the operating system’s file system
– this storage is also where the program and user data files are stored.
Instead of using a pluggable camera, we can gather images from a different
source and save them to the device’s permanent storage. When the image processing
application is executed, we can load the images into memory by simply reading the
contents of a file. This is much simpler that using a camera because all the functionality
of opening the file and reading its contents to the main memory are handled by the
operating system (if it exists).
The performance of this solution would be much better that the previous one,
because reading from a file is considered to be much faster than reading from a
peripheral device, even if both are considered input-output devices.
The only downside of using this approach is that we cannot test our system with
live data. Instead, pre-defined images will be used (we can reuse the same image several
times).
Image capturing
In section 4.2.1 we defined several ways that we can use to acquire images.
There is still a necessity to also explicitly define what “capturing” an image represents.
We can take the example of a webcam that acquires several images every
second. These images will not be automatically processed. In fact, they will not even
necessarily be available to our image processing system. Most webcams stream images
at a given rate and always keep the most recent image in a buffer. We must keep in
mind that only the most recent image is kept in the buffer and the rest of the (previous)
images are discarded.
If we want to process an image we must first capture it, i.e. to copy it from the
buffer to a location in the main memory that is accessible from the image processing
application. Thus, the action performed to access one particular image to be processed
from the stream of incoming images is defined as capturing.
36
Chapter 4
we want the input image to “freeze”. All future iterations of the algorithm will use the
last image that was captured before switching to offline mode.
To be able to start the application in offline mode, we must make sure that an
image is already available in the memory – this can be done by executing a “fake”
capturing operation before the application starts.
37
Chapter 4
the acquisition device or replacing it is usually not possible, so we have to use the
images that are given to us.
However, in the next parts of the system, we might want to use several types of
image processing algorithms. These might require other types of images. We do not
want to restrict our system to only be able to use the image format provided by the input
devices, so we might have to first apply a simple transformation to the captured images.
We can suppose that all images are represented in memory as a matrix, having
the size (𝑀, 𝑁), which defines the resolution – see Chapter 1.2 for more details. For
color images, each pixel is represented by three values, for grayscale images, however
only one value is required. Each such (pixel) value can also be represented with
different precision – i.e. each value can be represented by one or several bytes in
memory.
We can already see that there are several parameters used to define the “type”
of the image. These can all be different for the capturing and the image processing part.
To make a conclusion of these parameters, these are enumerated below:
• Resolution, e.g. 256 ∗ 256
• Samples/pixel – number of values required to represent one pixel, e.g.
1 sample for grayscale and 3 samples for color images
• Sample depth (bitness) – size of one sample, e.g. 8 bits/sample or 32
bits/sample, for very high-quality images
In the ideal case, these parameters are the same for the acquisition device and
for the image processing algorithm. For our research project, we do not have the
resources to choose between several acquisition devices, therefore the parameters of
the captured device are fixed. To solve this mismatch, we have to transform the images,
for example to change the resolution (scale operation) or change the representation from
color to grayscale.
These operations can introduce significant delays and they can decrease the
performance of the overall system. Unfortunately, we do not have a proper workaround
for this issue. The only thing that we can do is to individually measure the execution
time of these operations and subtract them from the overall execution time. This way
we can estimate the performance of a system in which the type of the captured image
and the processed image match.
16
First In First Out
38
Chapter 4
To read and write data from and to a channel or bus, buffers have to be allocated
on both sides of the links. This way the processor (or FPGA) knows from where to send
the outgoing data or where to write the incoming data.
In some cases, we can directly use the main memory instead of a dedicated
buffer. This is called direct memory access, or DMA. When DMA is available, it can
become easier to access large amounts of data. Also, in some cases, the memory used
for DMA transfer can be a shared memory (shared between the FPGA and processor),
which increases performance even more.
Links between the two components are usually serial, so we have to send the
image as a stream of data. If more than one physical link is available to send the image,
we might consider using several links in the same direction, by transferring the image
in parallel. This could reduce the transfer time, but the image will have to be
decomposed – and at the receiving end it will have to be reconstructed. This also implies
extra synchronization but could still improve the overall algorithm.
Fortunately, decomposing an image is fairly simple, because it is easy to split a
matrix in several equal parts. For example, we can splint an image in two parts, by
.
sending the first # rows at the first link and the remaining rows at the second link (𝑀
represents the number of rows). We can also decompose by columns, sub-matrices or
other more sophisticated ways.
Sending data is usually done with the aid of a communication protocol. Most
protocols however add some supplementary control data, that generates communication
overhead. To reduce overhead, it is recommended to send large chunks of data at once
(instead of sending an image pixel-by-pixel). We must also pay attention not to send
too large amounts of data, because of the limitations of the used communication
channels, or because the memory buffers might be overflown.
4.4. Processing
So far, we have prepared mostly every aspect that we need to finally be able to
process the images. In this stage the image is in the specified format and is already
accessible by the FPGA.
The algorithms that we will use is the Convolution Filter that was already
presented in section 3.2.2. The kernel that is used for convoluting should be
parametrized, i.e. we should be able to easily change it if we want to test several kernels.
Initially we want to use the Gaussian blur, but several other kernels can also be
used. In fact, the used kernel type is not relevant at all – it should just be easy to verify
the correctness of the algorithm. For example, after applying the Gaussian kernel, the
output image should be smother and blurrier. Also, in some cases we will have to apply
the division operator – this will be detailed in the implementation part.
The presented examples all use 3 ∗ 3 kernels. We will also start the development
using this size. However, once we have a stable product (that possibly meets the
objectives), we will start gradually increasing the kernel size. This is required, because
we want to measure the scalability of the system. Applying a larger kernel is also
important, because we have to test the system in computationally more intensive
scenarios. Another reason why it is important to try to use large kernels, is because it
might not be worth using an FPGA for a really small kernel: the communication
overhead would be too high compared to the speedup of the image processing part itself.
39
Chapter 4
4.5. Display
The embedded image processing device that we are reproducing would
normally not be able to present images in a visual way. Therefore, displaying the
resulting image of the algorithm (or displaying a computed value that was generated by
the image processing algorithm) has only debugging and validation purposes. We
would like to be able to manually verify the correctness of the system: for example, if
we apply a smoothing kernel in a convolution filter algorithm, we expect the resulting
image to be “smoother”.
Adding a visual representation of the output can greatly aid the development
process and it is also much easier to demo the application.
Displaying the image must not be done on the system on chip device. In fact,
this is usually not even possible, because we would require special displaying devices.
Instead, we can send the result of the algorithm over a network to a different device
(e.g. a general-purpose PC) and display the image there.
We should keep in mind that displaying the image should not have a major
impact on our system. If, however, displaying the image generates a large overhead, we
should be able to turn off this feature when measuring the performance of the system.
SoC vendors
Xilinx offers three types of SoCs, depending on the customer’s needs and the
complexity of the application. These categories are: Cost-optimized, Mid-range and
High-end. The cost-optimized category, represented by the Zynq-7000 SoC device,
mainly targets the education and research industry and also users who need a fast
prototyping board that is within budget and speeds up the development process (and
time to market). These boards are shipped with single- or dual-core ARM processors.
Xilinx also offers a well-known development tool for programming the FPGA of the
SoC, called Vivado. For programming the software-defined part, usually Eclipse is used
as a free and open-source IDE19 [31].
Since the fusion with Altera, Intel has also released several SoC devices (such
as the Intel Stratix® 10 SoC, exemplified in Figure 4.5). These are less known in our
academic environment but can be an interesting alternative to the Zynq chips. Intel also
17
From Wikipedia: Microsemi Corporation was a California-based provider of semiconductor
and system solutions for aerospace & defense, communications, data center and industrial markets
18
From Wikipedia: ON Semiconductor is a global semiconductor supplier company and was
ranked in the 500 of the largest US corporations by total revenue for their respective fiscal years
19
Integrated Development Environment
40
Chapter 4
offers a variety of development tools compatible with their devices, but we are not
going to detail them here [32].
20
Small, integrated, single-board computers
21
High-Definition Multimedia Interface
22
Video Graphics Array
23
Light emitting diodes
41
Chapter 5
43
Chapter 5
Figure 5.2 Snippet of a VI’s block diagram that computes 𝑓(𝑥) = 𝑔𝑎𝑖𝑛 ∗ 𝑥 + 𝑜𝑓𝑓𝑠𝑒𝑡,
on each element of an array (x)
One of the great advantages of using LabVIEW is that the same VI can be used
to specify software functionality and to specify the behavior of the FPGA (with some
constraints and limitations). The above VI can be ran on different computers, having
different operating systems, if these support the LabVIEW runtime engine. On the other
hand, if this VI is used for FPGA development, its contents will be first translated to
corresponding VHDL code and will be automatically synthesized for the specified
FPGA hardware (using the same tools that we would use in Vivado) – therefore,
LabVIEW can also be used as a High-Level Synthesis tool for FPGA development.
The following part presents the myRIO in detail and introduces the LabVIEW
concepts that will be used throughout the development of the system – in general,
24
Laboratory Virtual Instrument Engineering Workbench
44
Chapter 5
LabVIEW will be presented in some detail, however the reader is expected to have a
basic knowledge in understanding graphical data-flow code.
5.1.2.1. HW
As specified in the NI myRIO-1900 user guide and specification [34], the
myRIO-1900 is a “portable reconfigurable I/O (RIO) device that students can use to
design control, robotics, and mechatronics systems”. The hardware block diagram is
specified in Figure 5.3. We can see the clear separation between the processor and the
FPGA, even if these are on the same chip. The embedded device also includes several
peripherals, such as buttons (including a reset button), USB host and device ports,
LEDs, DD325 and nonvolatile memories, as well as a variety of I/O ports.
The USB host port supports most web cameras that are UVC compliant (USB
Video Device Class protocol), as well as machine vision cameras that conform to the
USB3 Vision standard and are USB 2.0 backward compatible [34]. We will use the
USB host port to connect a webcam for acquiring images.
The SoC used in myRIO comes equipped with a dual-core 32-bit ARM
processor, having 667 MHz maximal frequency. The device has 256 MB of RAM and
512 MB of nonvolatile memory. Both USBs have the 2.0 Hi-Speed specification. The
device has to be powered from an external DC voltage source, having between 6 and
16 V.
25
Double data rate type three
45
Chapter 5
5.1.2.2. SW
The most important aspect of the myRIO software environment is that it has a
Linux-based real-time operating system developed by National Instruments, called “NI
Linux Real-Time” or RTOS. Therefore, we have access to a large Linux ecosystem and
also real-time functionality – this means that we can write deterministic code with strict
timing constraints. The RTOS is fully compatible with the NI ecosystem, so we do not
have to worry about compatibility issues when developing the application.
However, some additional software components have to be installed as an
addition to the default configuration, but these components are all provided by NI. We
will include a full list of required hardware and software stack in the User’s manual
chapter.
46
Chapter 5
47
Chapter 5
and the host (the development environment) is done over a network using the IP
protocol.
As you can see from Figure 5.4, we have specified two myRIO targets, out of
which the second one is a simulated one. In the following part we will detail why this
is very important in the development process.
48
Chapter 5
computers. The compile worker uses the Xilinx compilation tools, which are installed
locally (the compilation tool includes Vivado).
The steps performed to generate the bitfile from the LabVIEW FPGA VI are
taken from [35] and are also shown below (note that “compile worker” refers to the
Vivado application that was installed with the Xilinx compilation tools):
1. Generation of intermediate files – LabVIEW converts the FPGA VI
into intermediate files (HDL code) to send to the compile server;
2. Queuing – The compile server queues jobs and sends the intermediate
files to the compile worker for compiling;
3. HDL compilation, analysis, and synthesis – The compile worker
transforms intermediate files (HDL code) into digital logic elements.
4. Mapping – The compile worker divides the application logic between
the physical building blocks on the FPGA;
5. Placing and routing – The compile worker assigns the logic to physical
building blocks on the FPGA and routes the connections between the
logic blocks to meet the space or timing constraints of the compilation;
6. Generating programming file – The compile worker creates binary
data that LabVIEW saves inside a bitfile;
7. Creating bitfile – LabVIEW saves the bitfile in a subdirectory of the
project directory and can download and/or run the application on the
FPGA VI.
As one can probably imagine, performing the steps mentioned above can be a
very long process, requiring high memory usages. In early stages of our development,
several compilations failed due to insufficient memory, extra-long compile time
(several days) or because timing and resource constraints on the FPGA were not met.
In later stages of development, most of our VIs were optimized allowing compilation
times below 20 minutes.
Once the steps needed to compile an FPGA VI are successfully completed, the
bitfile can be deployed on the target device. The VIs that are going to be executed by
the myRIO’s processor must also be deployed. Therefore, we need to connect the host
computer (the development PC that contains the LabVIEW project and the compiled
application) to the myRIO via an USB cable. When both devices are configured
properly, a LAN26 is created over the USB and IP addresses are assigned to the host
and target devices. We can then open a connection to the target by specifying the IP
address of the target device in LabVIEW project. Once the connection is made, VIs,
bitfiles and other deployment items can be transferred from the host to the target.
49
Chapter 5
C/C++ application or the “public static void Main()” method in the C# language. This
VI must be placed under the specific target in the LabVIEW project – as a reminder,
we will only work with VIs in the context of a project.
The first method is to simply run the VI, as we would run it under “My
Computer”. The deployment (and compilation, if needed) will start shortly, and once
all deployment items are transferred, the main VI is executed remotely on the host. The
contents of the front panel will however still be updated on the host by an automatic
mechanism that polls the target device to acquire the latest values on the front panel.
This induces some communication overhead for the target device, however this is
unnoticeable for front diagrams that contain small amounts of data. In this mode, it is
possible to remotely debug the block diagram or its sub-VIs too (obviously at the cost
of performance degradation).
The second choice for starting up the system is to create a new “Real-Time
Application” build specification in the project and set the main VI as the build
specifications startup VI. As its name suggests, a build specification can be built,
resulting in a folder that contains all the compiled items, dependencies and deployment
items that are needed for the application. Therefore, in this mode, everything is
“precompiled” which saves some time. On the other hand, starting the VI is somewhat
less intuitive, because we have to set the build specification as the default startup item
for our target device. Once the device is restarted, it will automatically start executing
the main VI.
We will mostly use the first approach because it implicitly lets us visualize the
front panel of the VI, which helps us in debugging and also lets us manually (visually)
verify the correctness of our image processing algorithms, by displaying the contents
of the processed image. We also created a build specification, but this would be mostly
intended for a “releasable” product and does not suit the requirements of the research
and development project. If our solution would be offered to the market, probably
creating a real-time executable or shared object would be the most appropriate way.
50
Chapter 5
Figure 5.6 Template VI for myRIO development using custom FPGA personality
We can notice that the previous example is very similar to a Data Acquisition
(DAQ) and Control application, where we first initialize the system and then
continuously read, process and write data (in a while loop). In the following part, we
will present the three main parts from Figure 5.6.
A. Initialization
In the initialization part, we introduced a Conditional Disable Structure, that is
similar to preprocessor directives in C/C++. The structure has two different behaviors
(implemented in two different subdiagrams): when executed on the host development
PC (having a Windows OS), we open an FPGA reference to a VI that is placed under
the Simulated FPGA target – this allows us to automatically execute the application in
a simulated FPGA environment on the host computer; when the Conditional Disable
Structure is executed on the target device, having a Linux operating system, we load a
reference in the default way, to the actual FPGA, so we have a real, production
environment. The condition of the structure that determines which subdiagram to be
executed is a string that is interpreted before compiling the VI and is written below:
• “OS==Linux” – when evaluated to true, we open a reference to the real
FPGA
• “OS==Win” – when evaluated to true, we will simulate the behavior of
the FPGA
The initialization section will also be responsible for opening a connection to
the image acquisition device, as well as to create any other references or executing setup
instructions – these will be presented in detail when we start detailing the different
components of the system, in the next subchapters.
B. Processing
The data processing part is responsible for most of the work done by the
application. In this part we continuously execute a loop that acquires images, transfers
them to the FPGA and then transfers the resulting image (or image attribute) back to
the processor. These are the main responsibilities, which are also visible from the
previous figure and Figure 5.7. These tasks will be presented in more detail in the
following subchapters.
51
Chapter 5
Beside the main responsibilities, there are several other tasks that must be
performed in the main loop. We have to update the indicators that present the acquired
input image and the resulting output. Because the data that populates these indicators
is coming from the target device and we want to display them on the host, a large
amount of data has to be transferred between the target and host over the LAN.
Although the provided USB is capable at transferring hundreds of Mb of information
per second, the latency and computational overhead on both devices is significant.
Therefore, we placed a boolean control on the front panel, which lets the user to
deactivate displaying the images on the front panel.
To measure the performance of the application, we compute the elapsed time
between two iterations of the processing loop. This is done by reading a millisecond
counter. We subtract from the value of read in the current iteration the previous value
and then multiply by 1000, to convert from milliseconds to seconds. This represents the
elapsed time between the iterations, also known as the execution time of one iteration
(including all the communication and additional overhead). To measure the frame rate
or FPS of the application, we compute the inverse of the elapsed time:
1000
𝐹𝑃𝑆 = [𝐻𝑧]
𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑡𝑖𝑚𝑒 − 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑡𝑖𝑚𝑒
C. Finalization
The finalization phase is the simplest and has the responsibility to close any
references that were previously created. Skipping this phase could put the myRIO in a
faulty state – even if the currently executing VI is closed after finishing execution, the
LabVIEW process in not closed; therefore, unwanted items can remain in the process’
memory. In this phase we also display any errors that occurred during execution.
Camera session
We use the NI Vision Acquisition Software to create a session to the camera
and enable capturing images from it. These steps are presented in Figure 5.8.
52
Chapter 5
this environment. In the real scenario, the webcam will be used that is connected to the
myRIO board.
We are using a Logitech QuickCam Pro 5000, that provides color images with
a resolution of 640 ∗ 480 pixels and at a frequency of approximately 30 FPS.
Image initialization
We will have to declare and initialize the input and output images in LabVIEW.
This is presented below, in Figure 5.9.
Figure 5.9 Declare and initialize the input and output images
We will place the captured data from the webcam into the input image, while
the output image will contain the result of the processed image. This way we can display
both images, so that it is possible to visually compare the initial image with its
transformation.
Both subVIs in the figure allocate a grayscale 8-bit image, having the resolution
of 256 ∗ 256 pixels. We have chosen the resolution to be a power of two, so that
memory addressing might be easier, however other values are also acceptable. To
enable the execution to work without a camera, the input image is populated with data
from the myRIO’s permanent storage. This can be very helpful, when measuring
performance, because we can disable the data acquisition part, which generates a large
overhead (that would not be present in the case of an embedded camera in the myRIO).
Image capturing
We have placed a boolean control on the front panel of the main VI, called
“Live". When this boolean is set to true, we acquire the most current frame that the
camera session provides – this is called the Snap operation.
We can remember from the previous parts, that the camera provides a new
image 30 times a second, that is roughly every 33 milliseconds. If less than 33 ms is
elapsed between iterations, then most probably we will snap the same image
consecutively. On the other hand, if the frequency of the main loop is lower than the
frequency of the acquisition device (30 Hz or 30 FPS), then some images provided by
the acquisition device might be lost/skipped. This is not an issue, but it is good to keep
these concepts in mind.
53
Chapter 5
Ways of transferring data between the FPGA and the host device
National Instrument defines three ways of transferring data between the FPGA
and the Host device – in our case the myRIO FPGA and the myRIO LabVIEW
application process. These are, according to [36] the following:
• Programmatic Front Panel Communication
• Direct Memory Access (DMA)
• User-Defined I/O Variables
Using the front panel can work for small sets of data and has a low call overhead,
however at the cost of higher CPU usage. This method is mostly used to pass
configuration data, report status from the FPGA or transfer single-point data. It is
however not recommended to pass large amounts of data because of the low throughput.
Another downside of this approach is that the user has to implement a synchronization
mechanism – e.g. to pause execution if data is not yet available or resume it when data
transfer can be initiated.
For transferring large sets of data, it is recommended to use the DMA. This is
approach has a much higher throughput and also lower call overhead. Another
advantage is the built-in synchronization mechanism. The method of DMA
communication is based on a FIFO-method. Two buffers have to be allocated on each
endpoint of the transfer. Sending data from device A to B means that we read the
content of A’s buffer and place it in the DMA FIFO channel. The NI DMA Engine will
place the data in B’s buffer, which can be read by B. An example of such a
communication is provided in Figure 5.11, where we transfer data from the FPGA to
the host.
54
Chapter 5
Figure 5.11 Illustration of DMA FIFO transfer from the FPGA to the host, from [37]
Using the User-Defined I/O Variables is similar to the first option but has a
lower host CPU usage and provides automatic synchronization. The performance and
throughput of this method is also much worse than using the FIFO-based method.
It becomes obvious at this point, that the best way to transfer the image arrays
between the two components of the SoC is to use Direct Memory Access.
55
Chapter 5
The common configurations are the element size, that is set to 1 byte,
representing an unsigned character (8-bit grayscale value) and the number of elements
to be read or written, which is set to 1. The differences will be detailed separately in the
next part.
Secondly, we have to allocate a much larger buffer on the host side. This is required
because the FPGA writes to the FIFO much faster than how the host can read from the
FIFO. As a result, we have allocated buffers large enough to hold an entire image
(32,767 elements each buffer). The large buffer doesn’t affect the performance of the
host, because the myRIO’s microcontroller has much bigger memory capacity than the
FPGA. This approach is also faster, because the FPGA can send a processed image
even if the host is not yet ready to receive it.
General structure
As most signal processing applications, we can structure the FPGA image
processing into three parts: acquisition, processing and acting/producing an output. In
section 5.4, we have already defined how to read and write the image. All we have to
do is specify how we store the image locally and how is the IP algorithm implemented.
To increase the performance of the FPGA, we split the three identified parts in
three independent loops. This allows LabVIEW to generate VHDL code that is more
performant. Figure 5.13 shows the independent loops that can be executed in parallel.
The figure is just a template – in the actual implementation the “#TODO” comments
are replaced with actual LabVIEW code.
57
Chapter 5
58
Chapter 5
but a large amount of important FPGA resources would be lost. Therefore, we will use
the Block RAM implementation.
The BRAM will be configured with a dual-port read access, so that two values
can be read in the same clock cycle. To increase the performance even more, we set the
cycles of read latency to the maximal value, 3. This means that internally the BRAM
access is pipelined, and several clock cycles have to be executed to read a value from
the memory. This introduces a minor delay but also allows higher clock rates. Because
it is possible to read and write the memory in the same clock cycle, we also specify that
arbitration should always happen when more writers try to access the memory, or
several readers try to read from it. This can introduce a minor degradation in the
execution speed, however it is required for the correctness of the algorithms.
If an algorithm does not require the whole image to be in the memory at the
same time, we don’t even need to save the image to a local memory. To exemplify this,
we can calculate the mean of the pixel intensity values while the image is still being
received from the incoming FIFO channel. We consider that most real-world scenarios
of image processing algorithms do not have this advantage, so we will not consider this
improvement – even if it could be applied to the convolution filter in some way.
59
Chapter 5
we scale the values by 2(8 , which in fact results in four right shift operations (if the
most significant bit is on the left side).
We will use the Gaussian kernel and an edge detection kernel, as seen below:
1 2 1 −1 −1 −1
Gaussian: T2 4 2V, Edge detect: T−1 8 −1V
1 2 1 −1 −1 −1
We also make sure that the values are within the allowed range of [0, 255] – if
not, the values are saturated. This can result in some loss of data. The best method to
deal with this is to use the histogram equalization algorithm. However, this is beyond
the scope of our hardware-level implementation.
Finally, once the simple kernels are implemented, we can extend the
implementation to larger kernels. This will not be detailed here, because the method is
very similar – we just have to increase the kernel size and include the new kernel
elements in the computation. Also, the borders of the image, where the convolution is
not applied will be thicker.
Synchronization
The three main loops that we defined (get data and save to memory, process
image and write result) have dependencies because they share the same memory. If the
processing loop starts to process the image before the required data is available, we can
generate incorrect result. Similarly, if we send the image before it is processed, the
resulting image might be corrupted27.
Figure 5.14 The three loops of the FPGA VI with and without synchronization
27
Like our government
60
Chapter 5
5.5.4.2. Implementation
In the initial state, we manually generate an “Image sent” occurrence – this
marks the starting point of the FPGA execution. The Read image loop will start
execution and it will read element coming from the FIFOs until an image is fully read.
Once the image has arrived, we generate an “Image received” occurrence, which
triggers the execution of the second loop. In the meantime, the first loop is blocked
because it waits for another “Image sent” occurrence.
Once the second loop finishes processing the image it will be blocked again and
it will generate the “Image processed” occurrence, which will unblock the third loop
that sends the image over the DMA FIFO channels back to the target. We can see that
this way we “serialized” the execution of the three tasks.
The first improvement that we can make, is to enable partial parallelism between
the tasks, as we have already mentioned. This is very similar to the concept of
pipelining, which we will use frequently in the improvement part of the
implementation.
5.5.5.2. Pipelining
The performance of the FPGA is highly dependent on the frequency of its clock.
The base clock of the myRIO FPGA is set to 40 MHz, but this values ca be extended
to 80, 120, 160 or 200 MHz. The initial implementation did not allow frequencies
greater than 40 MHz, therefore, we had to optimize the design.
The frequency of the FPGA is closely related to the propagation delay of the
implemented circuits and is determined by the longest path that was generated by the
FPGA compilation tools. As an example, if the propagation delay of the longest path is
0.01 microseconds (1e-8 seconds), then the maximal frequency will be 100 MHz (108
Hz).
61
Chapter 5
To reduce the maximal propagation delay, we have firstly identified the longest
paths using the log generated by the Xilinx Compilation tool. To reduce the delay, we
must break down a long “execution path” into smaller ones – in FPGA development we
can achieve this by pipelining.
Therefore, we use pipelining almost everywhere in the design (even on the
processor in some cases – because the CPU has two cores, only one pipeline stage is
worth implemented).
62
Chapter 5
63
Chapter 5
64
Chapter 6
28
Dynamically Loadable Library
29
Shared Object – similar to a DLL, but used in a Linux environment
30
Manually/visually validated
65
Chapter 6
host, adding duplicate FIFO channel and sending large chunks of data at
once to reduce communication overhead
3. Version 3 – Removed error handling on the processor (after testing in
detail the implementation) and optimized the execution mode (e.g. by
disabling debugging); on the FPGA side, we improved by using the
smallest possible numeric representations for variables
4. Version 4 – Pipeline the operations between the FIFO and BRAM, add
multi-cycle BRAM read operations and implement multi-stage division
and multiplication operations (having several pipeline stages) – these
improvements increased the longest path in the design and allowed clock
speeds up to 160 MHz
5. Last version – is not yet complete, but preliminary results show a great
increase in performance; in this stage, we execute in parallel the three
loops of the FPGA VI, similarly to having a large pipelined solution
Version 2
Version 3
Version 4
Last version
0 10 20 30 40 50
FPS
FPS None FPS Display FPS Live FPS Live & Display
66
Chapter 6
0 10 20 30 40 50 60 70 80 90
FPS None FPS Display FPS Live FPS Live & Display
The question is whether it is worth using our solution, when the problem size
increases. Therefore, we increased the kernel size to 5 ∗ 5 and 7 ∗ 7 (we have also
implemented the 15 ∗ 15 version, but it did not fit on the FPGA). Figure 6.3 shows the
performance comparison of the C and LabVIEW implementation when using different
kernel sizes.
67
Chapter 6
50
40 33
30 41 38 27
20 28
10
6
0
1 3 5 7 9 11 13 15
Kernel size
Figure 6.3 Comparing the performance of the LabVIEW SoC implementation with the
C implementation executed on CPU. *performance for the 15 ∗ 15 kernel was estimated
We have seen from the previous figure, that if the problem is large enough (e.g.
the size of the image or the size of the kernel is increased), the SoC-based
implementation outperforms the highly optimized CPU-based one. Note these results
on the SoC were achieved by keeping the frequency of the FPGA constant (at 160
MHz). Because of the limited FPGA resources, it was very hard to reach this frequency
– recompiling the same LabVIEW specification that lead to these results might not even
succeed. The estimated result of using a 15 ∗ 15 kernel would only work on a larger
FPGA, that can meet both the timing and resource constraints.
We have to keep in mind that the FPGA-based approach scales really well, only
as long as the FPGA resource constraints are met.
68
Chapter 7
Hardware
Our system has the following hardware requirements:
o NI myRIO-1900
o Host computer (development PC)
o Generic USB webcam
o USB to connect the myRIO to the PC
o Power source for the myRIO
Software
The following software must be installed on the Windows development PC:
• LabVIEW 2017 Development Edition, including the following
modules:
o LabVIEW Real-time module
o LabVIEW FPGA module
o myRIO add-on
o LabVIEW Vision Development module
• LabVIEW 2017 FPGA Module Xilinx Compilation Tool for Vivado
2015.4
On the myRIO, we must also install necessary software (beside the software
packages that are automatically shipped):
• NI Vision RT
• LabVIEW Real-Time
• NI IMAQdx (image acquisition drivers for the webcam)
31
Domain Name Server
69
Chapter 7
Validating results
Once the application is running the user can visualize the result of the system
on the front panel. You can also notice the FPS indicator, that indicates the performance
of the system.
On the right image you should see a “blurrier” image than on the left one. This
is because we apply the Gaussian convolution kernel to the image on the left.
70
Chapter 8
Chapter 8. Conclusions
In the previous several months, we have identified some requirements of real-
time image processing systems and we decided to implement a System on Chip-based
software and hardware solution. As opposed to many already existing implementations,
we proposed a new approach for development using the LabVIEW graphical and data-
flow programming language to specify the behavior for both the processor and FPGA.
71
Chapter 8
72
Bibliography
Bibliography
73
Bibliography
74
Bibliography
https://www.altera.com/en_US/pdfs/literature/ab/ab1_soc_fpga.pdf.
[Accessed 6 June 2018].
[31] Xilinx Inc., "SoCs, MPSoCs & RFSoCs," 2018. [Online].
Available: https://www.xilinx.com/products/silicon-devices/soc.html.
[Accessed 1 July 2018].
[32] Intel Corporation, "SoCs Overview," 2018. [Online]. Available:
https://www.altera.com/products/soc/overview.html. [Accessed 1 July
2018].
[33] E. Vansteenkiste, New FPGA Design Tools and Architectures,
2016.
[34] National Instruments, "myRIO-1900 User Guide and
Specifications," 16 May 2016. [Online]. Available:
http://www.ni.com/pdf/manuals/376047c.pdf. [Accessed 17 March
2018].
[35] National Instrument, "Understanding the LabVIEW FPGA
Compile System (FPGA Module)," March 2017. [Online]. Available:
http://zone.ni.com/reference/en-XX/help/371599N-
01/lvfpgaconcepts/compiling_fpga_vis/. [Accessed July 2018].
[36] National Instruments, "Transferring Data between the FPGA and
Host (FPGA Module)," 2017. [Online]. Available:
http://zone.ni.com/reference/en-XX/help/371599N-
01/lvfpgaconcepts/fpga_data_transfer_overview/. [Accessed 16 February
2018].
[37] National Instruments, "How DMA Transfers Work (FPGA
Module)," 2017. [Online]. Available: http://zone.ni.com/reference/en-
XX/help/371599N-01/lvfpgaconcepts/fpga_dma_how_it_works/.
[Accessed 18 June 2018].
[38] Wikipedia, "Field-programmable gate array," [Online]. Available:
http://en.wikipedia.org/wiki/Field-programmable_gate_array. [Accessed
15 3 2018].
[39] Wikipedia, "Internet of Things," [Online]. Available:
http://en.wikipedia.org/wiki/Internet_of_things. [Accessed 15 3 2018].
[40] National Instruments, "National Instruments: Test, Measurement,
and Embedded Systems - National Instruments," [Online]. Available:
www.ni.com/en-us.html. [Accessed 15 March 2018].
[41] National Instruments, "National Instruments: Test, Measurement,
and Embedded Systems - National Instruments," [Online]. Available:
http://www.ni.com/en-us/shop/labview.html. [Accessed 15 March 2018].
[42] Viewpoint Systems Inc, "LabVIEW FPGA: Features, Benefits &
Drawbacks | Viewpoint Systems," Viewpoint Systems, [Online].
Available: https://www.viewpointusa.com/IE/ar/labview-fpga-the-good-
the-bad-and-the-ugly/. [Accessed 18 March 2018].
[43] Viewpoint Systems, Inc, "LabVIEW FPGA: Features, Benefits &
Drawbacks | Viewpoint Systems," Viewpoint Systems, [Online].
Available: https://www.viewpointusa.com/IE/ar/labview-fpga-the-good-
the-bad-and-the-ugly/. [Accessed 18 March 2018].
75
Bibliography
Several sources that were cited in this section were distributed under the GNU
Free License. Therefore, reusing or distributing this document must also comply to the
GNU Free Documentation License and the GNU General Public License, which are
available at https://fsf.org.
76
Appendices
Appendix 1 – Acknowledgements
The hardware (myRIO) and software (LabVIEW) components required for this
project were provided by National Instruments ® Romania, with headquarters in Cluj-
Napoca, Romania (having its corporate headquarters Austin, TX, USA).
I want to thank the Romanian team for the opportunity and the support they
gave to make the implementation of this project possible.
77
Appendices
78
Appendices
79
Appendices
The whole source code can be viewed online at the following GitHub web page:
https://github.com/gergo13/SystemOnChip-ImageProcessing-myRIO
80