Design and Development of An FPGA-based Distributed Computing Pro
Design and Development of An FPGA-based Distributed Computing Pro
2011
Recommended Citation
Su, Juliana, "Design and Development of an FPGA-based Distributed Computing Processing Platform" (2011). Master’s Theses. 38.
https://digitalcommons.bucknell.edu/masters_theses/38
This Masters Thesis is brought to you for free and open access by the Student Theses at Bucknell Digital Commons. It has been accepted for inclusion in
Master’s Theses by an authorized administrator of Bucknell Digital Commons. For more information, please contact [email protected].
I, Juliana Su, do grant permission for my thesis to be copied.
Acknowledgements
To my family, Mom, Dad, and Tina, thank you for all of your support and encouragement
over the years. You have always believed in me, no matter what, and for that, I am so
grateful.
To my advisor, Professor Thompson, thank you for this wonderful opportunity. I still re-
member our very first meeting, something required as part of the first ELEC 101 homework
assignment. You asked me about what I wanted to do after graduation. As a junior, with
graduation being some two years away, I told you I was not sure (maybe grad school?). Well,
here we are, almost four years later, with me just over a week away from graduating with
a master’s and a few months away from starting a PhD program; I think your question has
been answered. I cannot thank you enough for your guidance, advice, support, and patience
over the past few years.
To the members of my thesis committee, Professors Hass and Nepal, thank you for your
thoughtful comments, suggestions, and advice on this work. I appreciate your help, not just
with my research, but with my other academic endeavors, as well. Thanks for teaching me
the ins and outs of digital design and writing recommendations for me.
i
Contents
1 Introduction 1
1.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Distributed Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Problem Statement 6
2.1 Challenges of Designing with FPGAs . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Development Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Hardware/Software Partitioning . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 FPGA Fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Design Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Related Work 10
3.1 Reconfigurable Computing Systems . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Splash/Splash 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 PRISM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.3 SLAAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3.1 Tower of Power . . . . . . . . . . . . . . . . . . . . . . . . . 12
ii
3.1.3.2 Adaptive Computing Systems (ACS) Application Program-
ming Interface (API) . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4 Baylor University Cluster . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.5 The Reconfigurable Computing Cluster (RCC) Project . . . . . . . . 15
3.2 Types of Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Digital Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
iii
4.4 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.1 Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.1.1 Message Components . . . . . . . . . . . . . . . . . . . . . . 33
4.4.1.2 Types of Messages . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.1.3 Exchanging Messages . . . . . . . . . . . . . . . . . . . . . . 36
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
iv
5.7 Performance Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.7.1 Hardware Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.7.2 File I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7.3 Network Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6 Conclusion 64
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Bibliography 69
A Software Framework 70
A.1 wrapper.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.2 wrapper.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
v
D.2 3DES MATLAB Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
D.3 DESDecrypt MATLAB Code . . . . . . . . . . . . . . . . . . . . . . . . . . 127
vi
List of Tables
vii
List of Figures
viii
Abstract
This thesis presents two frameworks- a software framework and a hardware core manager
framework- which, together, can be used to develop a processing platform using a distributed
system of field-programmable gate array (FPGA) boards. The software framework provides
users with the ability to easily develop applications that exploit the processing power of
FPGAs while the hardware core manager framework gives users the ability to configure and
interact with multiple FPGA boards and/or hardware cores. This thesis describes the design
and development of these frameworks and analyzes the performance of a system that was
constructed using the frameworks. The performance analysis included measuring the effect
of incorporating additional hardware components into the system and comparing the system
to a software-only implementation. This work draws conclusions based on the provided
results of the performance analysis and offers suggestions for future work.
Chapter 1
Introduction
Recently, there has been growing interest in high-performance computing using FPGAs due
to numerous application areas experiencing an increased demand in processing capability.
Research conducted over the past twenty years has demonstrated that hardware acceler-
ation using FPGAs yields considerable performance improvements for certain application
areas, such as bioinformatics, digital signal processing, cryptography, and network packet
processing.
One method of exploiting the processing power of FPGAs is to cluster them together and dis-
tribute computations among them. The concept of dividing up a problem into smaller tasks
and distributing these tasks among separate processing elements is called distributed com-
puting. By using FPGAs in a distributed computing environment, performance is increased
through parallelization.
1
1.1 FPGAs
The basic architecture of an FPGA, shown in Figure 1.1, consists of configurable logic blocks
(CLBs), routing channels, and input/output blocks. Each CLB, which is embedded in a
general routing structure, consists of look-up tables and flip-flops that can be configured
to perform either combinational or sequential logic. CLBs are surrounded by input/output
blocks (IOBs) for interfacing with external devices. The general routing structure allows for
arbitrary wiring, so designers can connect the logic elements however necessary. Designs can
be implemented on an FPGA using a hardware description language (HDL), such as Verilog
or VHDL, or a schematic.
2
Figure 1.1: Block diagram of an FPGA architecture [1].
In general, the amount of exploitable parallelism is the key factor in determining the suit-
ability of an application for FPGAs. FPGAs can only outperform modern processors by
exploiting huge amounts of parallelism.
3
1.3 Problem Overview
With FPGAs, the benefits of high performance and reconfigurability come at the cost of
added complexity. When designing for FPGAs, one challenge that designers face is the
need to work with both hardware and software components, each of which possesses its own
design methods. Using FPGAs in a distributed computing environment adds another level
of complexity since it introduces the element of networking. For the user, determining how
to configure and interact with hardware, software, and networking components in order to
develop an application for an FPGA can be a tedious and time-consuming task.
1.4 Contributions
4
1.5 Thesis Organization
Chapter 2 defines the problem statement for this work. Chapter 3 explores related work.
Chapter 4 describes the design and implementation of the system. Chapter 5 provides the
results and analysis of our performance analysis. Chapter 6 summarizes this work and
discusses potential future work.
5
Chapter 2
Problem Statement
Unlike general-purpose computing systems, which separate the design of hardware and soft-
ware, embedded systems involve the simultaneous design of hardware and software. The
challenge of creating a system which clusters FPGAs in a distributed computing environ-
ment requires designers to be knowledgeable in hardware, software, and networking concepts.
Exploiting huge amounts of parallelism using FPGAs is no trivial matter. This can be
attributed to three aspects of FPGA design: 1) lengthy development time; 2) complex
hardware/software partitioning; and 3) limited size.
One of the major difficulties of FPGA design lies in the manner that designers must approach
a problem. Designing for FPGAs involves simultaneously using multiple resources that are
6
spread across a chip to achieve a massive amount of parallelism. Software programming, in
contrast, is generally aimed at exploiting a microprocessor’s ability to sequentially execute
instructions. Humans naturally think in a sequential manner, so translating a design into
parallel logic takes significantly more time than sequential programming. In addition to
the increased time it takes to implement designs in an HDL, there is a steep learning curve
associated with learning to use FPGA development tools. These tools are often vendor-
specific, depending on the FPGA being used. In summary, the increased time it takes to
design and implement parallel logic translates into longer development times.
Finding the right balance between the flexibility of software and speed of hardware while
satisfying design requirements, such as performance, area, designer effort, etc., is a challenge
associated with designing for FPGAs. In a process called hardware/software partitioning,
shown in Figure 2.1, it is the responsibility of the designer to decide how best to divide
an application between a microprocessor component (“software”) and one or more custom
coprocessors (“hardware”). The task of partitioning is particularly difficult due to the fact
that there are often many ways to partition a design.
7
Figure 2.1: Diagram of hardware/software partitioning [2].
As the related work described in Chapter 3 implies, the task of designing a processing
platform that incorporates FPGAs is not a novel idea. What makes our particular problem
unique from previous research, however, is the combination of design requirements imposed
upon it. First, we require the system to be flexible. The system needs to not be application-
specific since we want to have the ability to incorporate any type of hardware core into the
system. Second, we require the system to be scalable. This means that the user should have
the ability to incorporate multiple boards and multiple cores into the system. In our case,
specifically, the ability to support multiple cores is crucial since we only have three FPGA
8
boards available for use. Our system will be very small-scale; however, we will design for
potential larger-scale configurations.
2.3 Summary
This chapter defined the problem statement of this work. Summarizing our design goals, we
aim to develop a system that decreases development time while being flexible and scalable.
The next chapter provides an overview of related work.
9
Chapter 3
Related Work
The following review of related work covers reconfigurable computing systems, as well as
types of FPGA applications. The purpose of reviewing reconfigurable computing systems is
to provide an overview of what has already been built and how these existing systems do or
do not address portions of the problem statement. A review of applications was conducted
to understand what types of application have been applied to FPGAs. These applications
provide an understanding of how FPGAs were utilized to run these applications. This
information was taken into account when our system was designed.
The majority of related reconfigurable computing systems take the form of computing clus-
ters in which multiple FPGA boards were either networked together or used as accelerators
on PCs to achieve increased computational performance.
10
3.1.1 Splash/Splash 2
Splash and Splash 2 [2] are special-purpose parallel processors that use FPGAs as processing
elements. Splash is a reconfigurable linear logic array of XC3000-series Xilinx FPGAs that
uses a Peripheral Component Interconnect bus to interface with a host system. Splash
2 consists of two rows of eight Xilinx XC4010 FPGAs, each with a small local memory
attached. Both systems are scalable since multiple FPGA boards can be incorporated into
a system and multiple systems can be connected together to form even larger systems.
3.1.2 PRISM
Unlike the large-scale Splash systems, PRISM [3] is a small-scale proof-of-concept system,
which augments individual general-purpose core processors with FPGAs. In the system, an
FPGA, which serves as a coprocessor, is configured to execute the smaller, more frequently
executed sections of a program while the general-purpose core processor is responsible for
executing the less frequently accessed sections.
PRISM addresses the issue of hardware/software partitioning by assigning the most fre-
quently executed portions of code to the FPGA and the less frequently executed portions of
the code to the general-purpose processor. However, the issue with this generalized method
of partitioning a program is that it may not improve performance for all types of applica-
tions. The most frequently executed portions of a program may actually be better suited
for a general-purpose processor. In order for the system to execute a program more effi-
ciently, the FPGA must execute its computations faster than a general-purpose processor.
11
Otherwise, if a general-purpose processor can execute the same computations faster than the
FPGA, there is no benefit to using the FPGA over the general-purpose processor.
3.1.3 SLAAC
The objective of the Systems Level Applications of Adaptive Computing (SLAAC) project
[4] is to define an open and scalable heterogeneous distributed adaptive computing systems
architecture standard. Part of SLAAC’s mission includes creating a COTS reference platform
implementation. One such platform, the SLAAC Research Reference Platform (RRP), is
defined as a high-speed network cluster of desktop PCs where each PC is enhanced with a
reconfigurable accelerator, such as an FPGA board.
An example of an existing RRP is Virginia Tech’s Tower of Power (ToP) [4, 5], a system
comprised of sixteen Pentium II PCs that are individually equipped with a WildForce board
and connected together using an an Ethernet and a Myrinet network. A diagram of the ToP
is shown in Figure 3.1. In total, the platform contains 80 Xilinx XC4062XL FPGAs and
memory banks. As part of the SLAAC project, researchers at Virginia Tech designed and
developed a common Application Programming Interface (API) which supports application
development for the SLAAC system.
12
Figure 3.1: Diagram of the Tower of Power [6].
In [5] and [6], the ACS API is described as a development environment for applications on
heterogeneous distributed FPGA boards. Its purpose is to provide developers with a sim-
ple API for controlling a complex distributed system of interconnected adaptive computing
boards. Some applications of the API include HokieGene, a system that implements an
enhanced version of a genetic-search algorithm [7] and FIR filters [8].
The ACS API programming model describes a system as a collection of hosts, nodes, and
channels. Hosts are user application-level processes that allocate and control nodes, which
are hardware resources (typically FPGA accelerator boards). Hosts and nodes are connected
together by channels, which allow data to flow throughout system. This data flow is initiated
by system creation, streaming data, and memory access routines. System creation allocates
nodes and sets up channels. Streaming data functions allow the host application to insert
13
streaming data into and receive streaming data from the system. Memory access functions
include data transfer operations that read, write, and copy memory, as well as generate
interrupt signals at nodes.
The ACS API addresses the FPGA development time issue since it: 1) is independent of a
specific FPGA architecture; 2) allows a developer to control multiple boards with a single
program; and 3) configures the system’s communication network. All these characteristics of
the ACS API lessen the time it takes for developers to implement an FPGA-based system.
While the ACS API’s programming model and functions have characteristics desirable in
our system, the configuration of the cluster it was used for is not ideal for our purposes. ToP
is a large-scale configuration that simply incorporates too many PCs and too many FPGAs.
14
Figure 3.2: Block diagram of the Baylor University cluster [9].
This cluster addresses scalability, one of our design goals, in that FPGA boards can be added
or removed from the system. Results included configurations ranging from one board to 15
boards, indicating that this design works for both very small-case to larger-scale configura-
tions. However, while this cluster addresses the finite FPGA fabric issue by utilizing more
than one FPGA board, it does not take advantage of all the available FPGA space on each
board since each board only contained one 3DES hardware core.
The RCC project at the University of North Carolina- Charlotte is a multi-institution, multi-
disciplinary project currently investigating the use of Xilinx ML-410 development boards to
build cost-effective petascale computers [10, 11]. As of 2008, the project has constructed
a prototype cluster consisting of 64 Xilinx Virtex-4 (V4P60) ML-410 Development boards.
Figure 3.3 shows a block diagram of the RCC cluster.
15
Figure 3.3: Block diagram of the RCC cluster [10].
While petascale computing is outside the scope of this work, we do find the configuration
of the RCC cluster to be ideal. It is similar to the design of the Baylor University cluster
in that it consists of a host PC and multiple FPGA board nodes connected together via a
network switch. Despite the fact that these two clusters incorporate a large number of FPGA
boards, they do not use a large number of PCs, as in the ToP system, which is ideal in terms
of resources. Like the Baylor University cluster, the RCC cluster addresses scalability since
FPGA nodes can be added or removed from the system.
In terms of applications often designed for FPGAs, digital signal processing, bioinformatics,
and cryptography are typical target applications due to inherent characteristics that are
conducive to exploiting parallelism. This section provides a few examples of viable target
applications that were considered for our performance analysis. Since our system needs to be
flexible and not application-specific, this review revealed what variable application features,
such as different types and sizes of input data, need to be accommodated.
16
3.2.1 Digital Signal Processing
One of the earliest reconfigurable computing applications was signal processing. Digital sig-
nal processing (DSP) is concerned with the processing of signals that have been converted
into a digital format. DSP technology is used in a variety of areas, such as speech pro-
cessing, imaging processing, audio processing, information systems, control systems, and
instrumentation.
DSP applications possess characteristics that increase the amount of parallelism that can
be exploited by reconfigurable computing. First, the operations in many DSP functions
follow rather regular schedules. This predictability reduces the amount of control logic
needed in the design and allows hardware to be customized to extract a significant amount
of parallelism. Second, many DSP applications use small word data widths. Requiring less
hardware and less routing, smaller word widths allow more hardware units to fit on a chip
and result in higher clock rates. Finally, fixed coefficients or constants are often used in DSP
computations. Hardware customized for a given coefficient or constant uses less area and
processes operations more quickly.
Discrete Fourier Transforms (DFTs) are typically computed using the Fast Fourier Transform
(FFT), an algorithm that efficiently computes a DFT by recursively dividing a DFT into
smaller DFTs. In [12], a speed-up of 23 times over a Sparc-10 workstation was achieved for
an FFT algorithm implementation on Splash-2, an FPGA-based array processor. Also, in
[13], an FPGA implementation of a radix-4 FFT achieved speedup factors of 9.4, 10, and 12.5
over a TMS320C5x DSP processor for FFTs of length 64, 256, and 1024 points, respectively.
17
3.2.2 Bioinformatics
One of the main focuses of bioinformatics, an interdisciplinary field which combines biology,
computer science, and information technology, is analyzing and interpreting nucleotide and
amino acid sequences, protein domains, and protein structures. The process of analyzing
and interpreting biological data sets is known as computational biology.
Computational biology works with incredibly large sets of data. For example, in 2008,
the National Center for Biotechnology Information’s GenBank had more than 98 million
sequences on record [14]. In addition, many of the algorithms applied to these large data
sets are computationally intensive. The complexity of comparison algorithms, for instance,
is quadratic with respect to sequence length. The fact that the number of sequences on
record continues to grow exponentially every year along with the fact that computational
biology algorithms are computationally intensive makes the area an ideal candidate for high-
performance computing.
3.2.3 Cryptography
A key element in achieving computer system security is the use of cryptography, the science
of encrypting and decrypting data. Since cryptographic algorithms are computationally-
18
intensive, easily pipelined, and frequently implemented with hardware components, such as
shift registers and permutation networks, they are an ideal application for reconfigurable
computing. Hardware implementations are well-suited for cryptographic algorithms since
they provide high performance and significant resistance to attacks. In particular, modular
arithmetic, which is an important element of cryptography, is more efficiently implemented
in hardware than with fixed-width microprocessor arithmetic logic units.
The RSA algorithm is a public-key encryption scheme that was developed by Ron Rivest,
Adi Shamir, and Len Adleman. The challenge of RSA lies in the factoring of large numbers,
a problem which becomes exponentially more difficult to solve as the size of the numbers
increases. A fast and efficient factoring algorithm for large numbers has yet to be discovered.
The best factoring algorithm runs in sub-exponential time, which is greater than polynomial
time, but less than exponential time.
In [17], the 0.8 millisecond decryption time for an FPGA-implemented RSA was about 11
times faster than the 9.1 millisecond decryption time for a 512-bit software implementation
on a 150MHz Alpha. Also, the fastest 1024-bit software implementation of 43.3 milliseconds
running on a PPro200 based PC was shown to be about 14 times slower than their best
result of 3.1 milliseconds.
3.3 Summary
19
Chapter 4
The design goal of the software and hardware core manager frameworks is to provide users
with an API to develop applications for, as well as build and control, a complex distributed
system consisting of a host PC and multiple FPGA boards.
• A host PC
• A software framework
• A software application
• FPGA boards
• Hardware core managers
• Hardware cores
• A network switch
20
Together these components form a distributed system of multiple FPGA boards that paral-
lelizes computations. A graphical representation of the system is illustrated in Figure 4.1.
In terms of operation, the software framework, which resides on a host PC, provides the user
with a library of functions for interacting with the FPGA boards. Functionalities provided
by the software framework can be grouped into the following categories: initialization, core
information set up, reading data, formatting data, sending data, and receiving data. Using
these functions, the user has the tools to develop a software application that uses the FPGA
boards to process data.
When input data read by the software application is sent out to the FPGA boards, it is,
more specifically, being sent to the hardware core managers running on the FPGA boards.
The hardware core manager takes the form of a software application that runs on top the soft
core processor implemented on the FPGA. The framework for the hardware core managers
includes a set of initialization, core information set up, read data, format data, send data,
21
and receive data functionalities that are similar to, but independent from, the functionalities
defined for the software framework. The hardware core manager serves as the middleman
between the user’s application and the hardware cores, logic blocks that perform a specific
computation. This is due to the fact that the hardware core manager has the ability to
access the hardware cores directly, writing data sent by the user application to the hardware
core and then reading the result computed by the hardware core. This result is returned to
the software application, where it is sorted and either read to a file or used by the user in
some other manner.
The host PC and FPGA boards are connected together via a network switch to allow for
data communication between the software application and hardware core managers over
a network. Ethernet and lwIP, a implementation of the TCP/IP protocol suite, facilitate
network communication.
The software framework takes the form of an API software library that provides a set of
data structures and functions that users can use to build an application that sends data to
and receives data from the FPGA boards.
This section describes the components that make up the software framework and provides a
description of their their functionalities and responsibilities. For more details on the software
framework, see Appendix A, which provides the software framework source code.
Figure 4.2 illustrates the flow of data through the software application created using the
software framework.
22
Figure 4.2: Data flow through the software application.
4.2.1 Conventions
23
• One thread per read socket
4.2.2 Initialization
Before an application may begin processing data, communication to the FPGA boards must
be established and several data structures must be created and initialized.
4.2.2.1 Communication
The framework provides a setUpSocket() function, which is responsible for creating both a
read and write socket for each IP address associated with an FPGA board, constructing the
FPGA board address structures, and connecting to the FPGA boards. Two separate sockets
must be used for sending and receiving due to the fact that the Sockets API of the provided
implementation of lwIP is not thread-safe, meaning it is not possible for two different threads
to operate on the same lwIP socket without data being corrupted.
The software framework provides users with two types of data queues: input data queues and
data results queues. The input data queue is a POSIX message queue structure while the
data results queue is a linked listed structure. For every different input data file, there is one
input data queue. It is assumed that the input data file contains data only intended for one
core type. Therefore, an equivalent statement is that there is one input data queue for each
different core type in the system. Similarly, there is one data results queue for each different
core type in the system. The input data queue is initialized with the setUpQueue() function.
This function creates the message queues by generating message keys and then initializes the
input data queue structure. A separate function called setUpRList() constructs the data
24
results queue.
Finally, there is the setUpCoreInfo() function, which requests, collects, and organizes
information concerning the number and types of cores available for use in the system.
setUpCoreInfo() asks an FPGA board to report the number and types of cores it houses.
To be aware of the locations of all available hardware cores, the software application must
make one call to setUpCoreInfo() for every FPGA board present in the system. After
collecting core information from every FPGA board in the system, the software application
may begin sending data requests to the FPGA boards.
Processing data involves reading input data, formatting data into messages, sending mes-
sages, and receiving results.
In terms of input data, the system currently only supports input data read from a file. The
readFile() function is responsible for reading input data from a file and placing the data
into the appropriate input data queue. In the system, there is one input data queue assigned
to each different core type present in the system. For example, if the system contained 3DES
and FIR filter cores, then there would be two input data queues created, one for 3DES and
one for FIR.
A mapToQueue() function ensures that data is placed into the right queue. Data is pulled
from the input data queue by a core-specific thread. Each core-specific thread is asso-
ciated with a particular hardware core in the system. Since the software framework is not
application-specific, it is the responsibility of the user to develop these core-specific functions.
25
These core-specific functions are associated with a particular core type in the user-defined
coreMapEntry data structure. An example of a core-specific function for 3DES is provided in
Appendix A.1 and can be used as a guide to creating a core-specific function for another type
of core. The core-specific function is responsible for forming and sending the data request
message. The process of setting the message type, core type, and job ID portions of the data
request message does not vary from core type to core type. The process of sending the data
request message out to a particular FPGA board also does not vary from core type to core
type. What does vary, however, is the method of constructing the input data portion of the
message. For example, 3DES requires five pieces of input data: key 1, key 2, key 3, function
select, and data to be processed. These inputs must be gathered from whatever source the
user chooses. In our case, all the keys were read in from a user-defined file, function select
was assigned from a command-line argument, and the data to be processed was grabbed
from an input data file. These pieces of data then had to be concatenated into a collection
of bytes and attached to the message type, core type, and job ID to form a data request
message. Once the data request message is properly formed, it is sent through to a particular
FPGA board through a write socket.
As previously mentioned, there is one read socket for every FPGA board in the system.
A thread executing the recvResults() function operates on each of these read sockets.
recvResults() receives a data response message through the socket and checks the core
type of the message. Based on the core type, recvResults() places the message into the
appropriate data results queue using the addResults() function. The addResults() func-
tion places results into a data result queue in order using the job ID. Once all results have
been received, the user may either use the writeResults() function to write the data to
26
an output file or use the removeResults() function to remove and access results from the
queue.
The hardware core manager is a software application that controls the data flow into and
out of the hardware cores. It runs on the MicroBlaze soft core processor and is responsible
for collecting information regarding what types of cores are available for use and sharing this
information with the host PC when the information is requested. Another responsibility of
the hardware core manager is to route data to and from the hardware cores. Source code
for the hardware core manager can be found in Appendix C.
27
Figure 4.3 shows the flow of data through the hardware core manager.
4.3.1 Conventions
28
4.3.2 Operating System
The hardware core manager runs on top of XilKernel, the operating system for the Xilinx
FPGAs. Xilkernel provides such features as threading and message queues.
4.3.3 Initialization
When started, the hardware core manager must first set up its network information. This
requires reading the state of the DIP switches present on the FPGA board. The number
corresponding to the state of the switches represents the last decimal digit of the IP address
to be assigned to the FPGA board. This procedure allows the IP addresses of different
FPGA boards to be set by simply changing the DIP switches. It is important, however, that
these IP addresses correspond to the IP addresses used by the software application on the
host PC.
Once the network information has been configured, the hardware core manager must establish
a connection to the host PC. This is done by creating two TCP/IP stream sockets, one socket
for reading and one socket for writing. Each of these sockets is assigned a different known port
number. The hardware core manager then connects to the host PC. Once communication
to the host PC has been established, the hardware core manager may begin to receive and
process requests, a process that is depicted by the flowchart in Figure 4.4.
29
Figure 4.4: Flowchart depicting the receive request process.
A core request is always the first message that the hardware core manager receives from
the host PC. When a core request is received, the hardware core manager sets up its core
30
information, then sends this information to the host PC.
For every different hardware core type present on the FPGA, an input data queue must be
created. Therefore, the convention is that there is one input data queue per core type.
Each FPGA board has a Compact Flash card inserted into its SysACE controller. The Flash
card contains a text file which describes each core to be used in the system configuration.
An example of the contents of this file is as follows:
3DES,0xcea00000,36,8
3DES,0xcea20000,36,8
!
Each line of the file contains the following information: core type, core base address, core
input data size, and core output data size. In this example, the first line indicates a 3DES
core type with a base address of 0xcea00000, a total input data size of 36 bytes, and an
output data size of 8 bytes. A stop character, in this case ’ !’, is placed on the last line of the
file, so that it is known when the end of the file has been reached.
The process of setting up the core list involves reading the text file containing the core
information and parsing it into the necessary data structures. For each core that is set up, a
thread is created. This thread is assigned specifically to this core and is therefore responsible
to reading/writing data to/from this core. The core information is sent to the host PC so
31
that the software application is aware of the types and locations of the cores present in the
system.
When the hardware core manager receives a data request, it places the message into the
input data queue corresponding the the core type of the message. The core thread created
in the core list setup procedure checks this queue for new data. When a new piece of data is
available, it removes the data and provides the input data to the appropriate core processing
function. This function writes the data to the hardware core and reads the data from the
hardware core. The result computed by the hardware core is formed into a data response
and sent back to the host PC.
4.4 Communication
This section describes the methods with which the software application communicates with
the hardware core manager.
4.4.1 Messages
Data is passed between the host PC and FPGA boards as streams of bytes since TCP/IP is
a byte-stream oriented communication protocol. Based on this characteristic of TCP/IP, we
define a message to be a variable-length packet of bytes that the host PC and FPGA boards
use to exchange information.
32
4.4.1.1 Message Components
Each message, depending on its type, can consist of one or more of the following pieces of
information, as illustrated in Figure 4.5: 1) message type; 2) number of cores; 3) core type;
4) job identifier (job ID); 5) input data; and/or 6) output data.
-Message Type-
1 byte
-# of Cores-
4 bytes
-Core Type-
m bytes
-Job ID-
4 bytes
-Input Data-
n bytes
-Output Data-
x bytes
The message type is a one-byte character that enables the hardware core manager to identify
and distinguish incoming messages. Our system defines two types of messages types, core
33
and data. The core request message type is represented by the character, “c”, while the data
request message type is represented by the character, “d”. Since message types are defined
in a header file (wrapper.h in Appendix A.2), it is possible for the user to define new message
types or use a different set of characters to identify messages by simply editing the header
file.
Number of cores represents the number of cores present on a particular FPGA board. The
system currently defines this component to be four bytes in length, specifically using a C int
data type. The use of a C int data type to store the value for the number of cores means
that the maximum number that can be used to represent the number of cores is 231 − 1.
Considering the fact that our system could only handle a maximum of seven hardware cores,
four bytes is more than enough bytes to store the value for the number of cores in the system.
The core type is a byte array consisting of a user-defined variable-length character string
associated with a particular type of core. For our implementation, we defined the core type
to be eight bytes long, allowing cores to be identified by a string consisting of a maximum of
eight characters. As with the message type, the core type is a flexible message component
in that its length can be changed by editing the wrapper.h header file.
The job ID is a number representing the order in which the output data corresponding to the
message must arranged in the data results buffer. The length of the job ID is user-defined,
but must be large enough to storage the maximum number of data requests that will be
sent. For our system, four bytes in the form of a C uint32_t data type were used to hold
the largest possible job ID, meaning the system can support a maximum of 231 − 1 data
requests.
Input data is a byte array containing data supplied to the hardware core. The length of the
input data is dependent on the specifications of the hardware core the data is intended for.
34
For example, input data for the 3DES core consists of 36 bytes that include key 1 (8 bytes),
key 2 (8 bytes), key 3 (8 bytes), function select (4 bytes), and data to be processed (8 bytes).
In the case where there is more than one input, as in 3DES, all input data are concatenated
together when incorporated into a message. The input data are later parsed by a function
specific to the core that the data is intended for. For a system consisting of a heterogeneous
set of hardware cores, the length of the input data section of the message must be equal to
the length of the largest possible input data.
Output data is a byte array holding the result computed by the hardware core. Like the
input data, its length is dependent on the specifications of the hardware core the data is
intended for. For example, the output data of 3DES is 8 bytes long. Again, for a system
consisting of a heterogeneous set of hardware cores,the length of the output data section
must be equal to the length of the largest possible output data.
The various message components described in the previous section are used to form the four
possible types of messages that can be exchanged between the host PC and the FPGA boards:
1) core information request (core request); 2) core information response (core response); 3)
data request; and 4) data response. Requests are messages sent from the host PC to an FPGA
board while responses are messages sent from an FPGA board to the host PC. Figure 4.6
shows the format of the different message types. The composition and usage of each of the
messages are described in the next section.
35
Core Request -Message Type-
Core Response -# of Cores- -Core Type- -Core Type- -Core Type- ...
Data Request -Message Type- -Core Type- -Job ID- -Input Data-
Figure 4.7 illustrates the exchanging of messages between the host PC (software application)
and FPGA board (hardware core manager) over time.
36
Figure 4.7: Exchanging of messages between the host PC and FPGA board over time.
The first message in a series of exchanges is a core information request, a message that only
consists of a core request message type. For our system, this is simply the character, “c”.
The core information request is only sent once as part of the system initialization of the
software application running on the host PC. To be able to distribute input data properly,
the application must be aware of what types of cores are present in the system and where
(which FPGA board) each core is located. Therefore, before any data can be sent out to an
FPGA board, the application must first send a core information request out to each of the
FPGA boards.
37
The hardware core manager checks the message type portion of every message it receives and
processes the message based on the message type it sees. If it identifies a core information
request, it sends the host PC a core information response. In this case, it forms a message
that includes the number of cores it has, followed by the core types of these cores.
Once the host PC has received the core information response sent by the FPGA board, it
may begin sending data requests. A data request message consists of: 1) a message type
representing a data request; 2) the core type that the data processing request is for; 3) the
job ID of the request; and 4) input data. After the FPGA board has determined that the
message it has received is a data request, it checks the core type portion of the message to
determine which core type input data queue it should place the input data into. This queue
is being constantly checked for new messages by the corresponding core threads. When a
message is available, a core thread removes the message from the queue and provides the
input data from the message to the hardware core function. This hardware core function is
user-defined since each hardware core has potentially different structures/requirements, such
as number of inputs, size of input data, size of output data, etc. The hardware core function
must parse the input data properly and write it to the hardware core. The hardware core
function then reads the result from the hardware core. The result is then concatenated with
the core type and job ID to form the data response, which is sent back to the host PC.
On the host PC side, a single recvResults() thread per FPGA board retrieves data re-
sponses and places them into a data results queue. After all results have been received and
sorted by job ID, the user do whatever they see fit with the data. If the user wishes to write
the data to a file, a function to do so is provided. In this case, a writeResults() thread
removes data from the results queue and writes them to an output file. Whether the user
chooses to write the data to a file or not, the main application thread should not exit until
all core message queues are removed and all read and write sockets are closed.
38
4.5 Summary
This chapter described the design and operation of the software and hardware core man-
ager frameworks. The next chapter discusses the experimental setup and the results of the
performance analysis.
39
Chapter 5
5.1 Overview
In addition to designing and constructing the software and hardware core manager frame-
works, our goal was to evaluate the performance of our system design (a hardware/software
configuration) with a software-only solution. Throughput, as defined in Equation 5.1, was
the metric used to evaluate system performance.
Bits
T hroughput = (5.1)
Second
Bits represents input data file size (total amount of data to be processed) and second repre-
sents application runtime.
40
5.2 Experimental Setup
In Section 4.1, the system was described as consisting of seven components: a host PC,
software framework, software application, FPGA boards, hardware core managers, hardware
cores, and a network switch. Since the software framework and hardware core managers
reflect our unique contributions to this research, they were the only components discussed
in depth in Chapter 4. This section, however, will describe the components that were not
specifically created for this research, but instead configured for use in our system.
This section describes the host PC, FPGA board, network, and input data file components
of the system. The hardware core used in the system is discussed in Section 5.3.2.
5.2.1.1 Host PC
The host PC is a Dell OptiPlex GX280 desktop PC running Ubuntu 10.04 LTS (Lucid Lynx).
Virtex-5 LXT [18] platform FPGAs were used for this project. The FPGAs contain one or
more Microblaze processors, Ethernet MACs, block RAM, and large arrays of reconfigurable
logic, among many other features. The Microblaze is a high-performance 32-bit RISC Har-
vard architecture soft processor core designed with embedded systems application in mind
[19]. The configurable logic blocks on the FPGA fabric can be used to implement a variety
of logic, including 18 × 18 multipliers, adders, and accumulators.
41
5.2.1.3 Hardware Base System
The hardware base system is the starting foundation for any system configuration imple-
mented on the FPGA. This configuration may be customized to include different hardware
cores. Figure 5.1 is a screenshot of the hardware base system’s System Assembly view in
Xilinx Platform Studio. This window displays the various components included in the de-
sign. This particular screenshot is for a system that has been customized to include 3DES
hardware cores.
• MicroBlaze processor
• MicroBlaze Debug Module (MDM)
• Processor Local Bus (PLB)
• Local Memory Bus (LMB)
• Universal Asynchronous Receiver/Transmitter (UART)
42
• Local Link Tri-Mode Ethernet Media Access Controller (LL TEMAC)
• Block RAM (BRAM)
• Double Data Rate Synchronous Dynamic Random Access Memory (DRR2 SDRAM)
• DIP Switches
• System ACE Interface Controller (SysACE)
• Interrupt controller
• Timer
• LMB BRAM controller
MicroBlaze The MicroBlaze is a soft 32-bit processor core. On the Virtex 5 FPGA, more
than one MicroBlaze may be implemented. However, for our purposes, only one MicroBlaze
was included in the design. For our test setup, the MicroBlaze was configured to run at
125 MHz.
MDM The MDM enables JTAG-based debugging for the MicroBlaze processor.
PLB The PLB is the system bus that connects the processor to high-speed peripherals.
Of the various available CoreConnect peripherals, it offers the highest bandwidth (128-bit
data bus or 32-bit address).
LMB The LMB is a 32-bit bus that is designed to have one master and one slave, the
MicroBlaze and a memory controller, respectively.
43
UART The UART core is a low-speed communication core that allows data to be trans-
mitted serially via a RS-232 cable. Since a PC can be easily interfaced to a UART, it is very
useful for debugging purposes.
LL TEMAC The LL TEMAC is a soft Ethernet core that enables data to be transmitted
and received over a network.
5.2.1.4 Network
In order for the host PC to be able to communicate with the FPGA boards, Ethernet
(100 Mbps) and lwIP, a lightweight version of TCP/IP, were used.
44
5.2.1.5 Input Data Files
The input data files used for testing contained random data and were created using the Unix
command, dd. For example, to generate 1 MB of random data, the following command was
used:
if=/dev/urandom specifies that data should be read from the file /dev/urandom, a special
file that, when read, returns random bytes. of=1MB.log indicates that the data should be
written to a file named 1MB.log. bs=1024 is the block size of the data to be written (in
bytes) and count=1024 is the number of blocks to write.
Given that the system was comprised of various different components, several tools were
used to build and debug the system. For portions of the system built on top of the FPGAs,
including the hardware core manager framework and the hardware cores, the Xilinx Design
Suite was used. This set of tools includes the Xilinx Embedded Development Kit, Integrated
Software Environment, Platform Studio, Software Development Kit, and ChipScope. For
the software framework, the freely-available Eclipse Integrated Development Environment
[22] was used for C program development. Finally, for the networking portion of the system,
Wireshark [23], a free open-source network protocol analyzer, was used for both network
troubleshooting and observing network traffic.
45
5.3 Test Application
5.3.1 3DES
3DES is an encryption algorithm that applies the DES algorithm to a block of data three
times. 3DES uses three 64-bit keys, denoted as k1 for key 1, k2 for key 2, and k3 for key 3.
The algorithm for 3DES encryption is given in Equation 5.2. To encrypt a 64-bit block of
data, a DES encryption with k1 is first applied to the data block. Next, a DES decryption
is applied with k2 to the result of the previous encryption. A final 3DES encryption with k3
is then applied to produce the final 3DES encrypted data block.
For decryption, the algorithm shown in Equation 5.3 is used. To decrypt a 64-bit block of
data, a DES decryption with k3 is first applied to the encrypted data block. Next, a DES
encryption with k2 is applied to the result of the previous decryption. Finally, this result is
then DES decrypted with k1 to obtain the decrypted data block.
3DES was chosen as an application since it possesses characteristics that we can use to eval-
uate the flexibility design requirement. First, 3DES requires more than one input. Second,
3DES operates on data blocks that are larger than 32 bits. Third, the input data provided
to 3DES could contain any sort of combination of bytes.
Choosing 3DES as an application ended up affecting how message passing was implemented.
46
Since the data that 3DES could be applied to could be any sequence of bytes, this eliminated
the possibility of using a delimiter-based method for reading and parsing data.
Since the focus of this work was not to design and build hardware cores, but instead to build
and analyze a system using hardware cores, a pre-built hardware core was used. The 3DES
core chosen for inclusion in our system was from OpenCores, an open-source repository of
hardware components. As of April 2011, the site hosts 800 different IP-blocks [24]. The
project page for the particular 3DES core used in our system can be found at [25].
To read and write data to/from the hardware core with the hardware core manager, a custom
data structure for 3DES was created. This data structure allowed for easier access to the
software accessible slave registers that are used to read/write data to/from the hardware
core since the address offsets of the different slave registers is taken care of by the definition
of the custom data structure. Code segment 1 shows the 3DES data structure.
47
typedef struct {
long key1_in_A;
long key1_in_B;
long key2_in_A;
long key2_in_B;
long key3_in_A;
long key3_in_B;
long function_select;
long data_in_A;
long data_in_B;
long data_out_A;
long data_out_B;
} tripleDES;
After initializing a pointer to the hardware core, data can be written to and read from
the hardware core using pointers. An abbreviated example of using pointers to access the
hardware core is shown in Code Segment 2. The full code can be found in Appendix C.
48
volatile tripleDES *hw_core = (tripleDES*)(baseAddr);
long key1_in_A;
long key1_in_B;
long data_out_A;
long data_out_B;
5.3.2.2 Verification
While only the runtimes for 3DES encryption were recorded, decryption was still used to
verify that 3DES had been applied correctly to a file. The following lines show how we used
hexdump to verify that the data had been processed correctly:
Since outputs were in binary format, after a file was encrypted, then decrypted, its contents
49
were written in hexadecimal format to another file using the hexdump command. The hex-
dump of the output file was then compared to the hexdump file of the original input file
using the diff command. If diff did not produce any output, then the file was successfully
encrypted and decrypted.
In order to observe what effect varying the number and configuration of hardware processing
elements (hardware cores and FPGA boards) has on the system, the following test scenarios
were used:
Tests were run a total of ten times, after which the processing times were averaged and the
throughput calculated. Recorded processing times reflect times for an application conducting
3DES encryption, starting from system initialization and ending once all data has been
returned to the software application. These times were generated using the Unix time
command. Variables that were changed between tests include: input data file size (total
amount of data to be processed), number of cores, and number of boards.
In addition to these tests scenarios, a software implementation of 3DES was also tested in
order to compare the performances of a hardware/software solution versus a software-only
solution.
50
5.4.1 Single Core, Single Board Configuration
The single core, single board configuration serves as the base case for the performance anal-
ysis. The calculated throughputs of configurations which incorporated additional hardware
are compared to this particular configuration in order to determine whether a gain in perfor-
mance was achieved or not. Table 5.1 indicates that the baseline throughput is 128 bytes/sec-
ond for a 1 kB input data file size and 135 bytes/second for a 10 kB input data file size.
Average
Number Processing Throughput
of Time (Bytes/Second)
Cores (seconds)
1 kB 10 kB 1 kB 10 kB
1 8.0 76.0 128 135
Following the single core, single board configuration, we performed a series of tests to observe
what effect incorporating additional hardware into the system would have on performance.
Additional hardware can be incorporated into the system in the form of additional hardware
cores or additional FPGA boards. Our first approach involved integrating more hardware
cores.
For the multiple cores, single board configuration, a total of six separate test runs were
conducted. The first test was a two core configuration. Each subsequent test added another
core to the board until a maximum of seven cores were present in the system. Only seven
cores could be configured into the system due to the fact that there is a limit to how many
peripherals can be attached to the PLB bus. As defined in the data sheet for the PLB,
51
the maximum supported allowable value for the number of PLB slaves is 16 [26]. After
subtracting the number of required peripherals that are attached to the PLB bus, such as
those listed in Section 5.2.1.3, only room for seven additional peripherals is available.
Before conducting the tests for this configuration, we hypothesized that each additional
hardware core would increase performance by some given amount since adding more cores
would increase the number of processing elements in the system; however, the results proved
this hypothesis to be incorrect. Table 5.2, which combines the results for the single core,
single board and multiple cores, single board configurations, reveals that adding additional
cores to the system did not result in any gain in performance. In fact, throughput remained
at relatively the same rate for all configurations, with the percentage change from integrating
an additional core being as high as ±2.2 percent and as low as 0 percent. This low percentage
change is reflected in Figure 5.2, which graphically depicts the data from Table 5.2.
52
Number of Cores vs. Throughput
150
140
130
120
110
Throughput (bytes/second)
100
90
80
70 1 kB file size
60 10 kB file size
50
40
30
20
10
0
1 2 3 4 5 6 7
Number of Cores
After collecting data for all the multiple cores, single board configurations, we next tested
what effect incorporating additional hardware into the system would have on performance
using our second approach of incorporating additional boards. For this data set, tests were
conducted up to three boards since we only had three boards at our disposal.
In contrast to all previously-tested configurations, the single core per board, multiple board
configuration showed increased performance when additional hardware was incorporated
into the system. Table 5.3 shows that adding additional boards to the system increased
throughput by roughly between 1.5 to 2 times the single board configuration.
53
It should be noted that seeing improved performance with the single core per board, multiple
boards configuration is dependent on thread scheduling. Initial test runs for a three board
configuration revealed that a single core thread was starving the two other core threads,
causing data to only be sent to a single board. Since the two other boards in the configuration
were not receiving any data, they were not utilized, and the system emulated a single core,
single board configuration. To fix the starvation issue and ensure that all boards were
utilized, sched_yield() was incorporated into the code. sched_yield() forced threads to
give up control of the CPU once they were finished sending data to a particular board.
Table 5.3: Single core per board, multiple boards throughput results.
54
Number of Boards vs. Throughput
(One Core Per Board)
425
400
375
350
325
300
roughput (bytes/second)
275
250
225
200 1 kB file size
Throughput
Figure 5.3: Graph of number of boards versus throughput (one core per board).
Based on the results of the multiple cores per board, single board configuration, the con-
clusion was that multiple cores per board, multiple boards configurations would not yield
increased performance. In fact, the multiple cores per board, single board configuration re-
sults suggested that multiple cores per board, multiple boards configurations would perform
similarly to a single core per board, multiple boards configuration. While it was decided
that it was unnecessary to iterate through all the possible multiple cores per board, multiple
boards configurations since they would not generate any significant data, two tests were
conducted to verify the theory that multiple cores per board, multiple boards configura-
tions would produce similar results to a single core per board, multiple boards configuration.
55
These two tests included a test for the median number of cores per board (four cores per
board for a total of 12 cores in the system), as well as the maximum number of cores per
board (seven cores per board for a total of 21 cores in the system). In choosing the median
and maximum number of cores per board, we can assume that throughputs for the configu-
rations that were not tested would not vary significantly from the calculated throughputs of
the median and maximum number of cores per board configurations.
As shown in Table 5.4, having four or seven cores per board in a three board configuration for
a total of 12 and 21 cores in the system, respectively, did not noticeably impact throughput.
At most, a -2.9 percentage change in throughput going from a single core per board to seven
cores per board was observed. This data supports our theory that multiple cores per board,
multiple boards configurations would not yield increased performance.
Table 5.4: Multiple cores per board, multiple boards configuration throughput results com-
parison: one core per board vs. four cores per board vs. seven cores per board.
To evaluate the performance of the various FPGA board and hardware core configurations,
a software-only implementation of 3DES was written using MATLAB for comparison.
The MATLAB code for 3DES was created with the help of code from [27]. Only MATLAB
56
source code for DES encryption was provided, so code for DES decryption was created by
modifying the DES encryption code. DES encryption and decryption were then used to
form the 3DES algorithm. Source code for the MATLAB implementation is provided in
Appendix D.
Recorded processing times reflect times for 3DES encryption; however 3DES decryption was
used to verify the output files. The output was verified using the “Compare Selected Files”
feature in MATLAB. The profiler feature was used to record runtime. The same input data
files and input keys used for the FPGA-based configurations were used for the MATLAB
implementation. Similarly, tests were run a total of ten times.
Results showed that the MATLAB implementation of 3DES achieved a better throughput
than the best platform configuration. Even the best throughput of the single core per board,
multiple boards configuration of just under 400 bytes/second, shown in Table 5.3, could not
compare to the performance of the MATLAB implementation, whose throughput leveled off
at around 600 bytes/second, as shown in Table 5.5.
Looking over the data collected for all the various system configurations, we can draw some
preliminary conclusions. First, of the two approaches- additional cores and additional boards-
57
used to evaluate the effect of adding additional hardware into the system, only the addition of
more boards with a single core on each board demonstrated an improvement in performance.
However, even the best single core, multiple board configuration, could not achieve the
same or greater performance than the MATLAB software implementation. Since hardware
typically achieves higher performance than software, particularly in the case of 3DES [9], the
fact that the software implementation of 3DES performed better than our hardware/software
solution using FPGAs suggests that a performance bottleneck exists within the system. In
the next section, we explore possible sources of the bottleneck.
Based on how data flows through the system, we can identify five potential locations for the
bottleneck:
• Hardware core
• File I/O
• Network transmission
• Software framework
• Hardware framework
We provide qualitative data on the hardware core, file I/O, and network transmission since
we have the means to isolate those components and observe their contribution to the aver-
age processing time. For the software and hardware frameworks, we offer theories on the
possibility of an existing bottleneck.
58
5.7.1 Hardware Core
To observe the impact of the 3DES computation on overall performance, we remove the
core from the system and replace it with a dummy core. The structure of the dummy core
resembles that of a 3DES core in that it takes in the same number and size of input data,
as well as output data. The difference between the cores is that the dummy core does not
perform the 3DES computation, instead it simply assigns input data to the output. By using
a dummy core, the time it takes for the 3DES computation to be executed in hardware is
removed from the overall processing time.
Table 5.6 provides the 1 kB input data file and 10 kB input data file average processing
times and throughputs for the dummy core and 3DES core. Comparing these values, we
notice that the difference between the throughputs is small- only a 0.7 percentage change of
throughput was observed between the dummy core and 3DES core in the case of a 10 kB
input data file size while no change was observed with the 1 kB input data file size. Based
on these values, we can eliminate the hardware core as the source of the bottleneck.
59
5.7.2 File I/O
After determining that the 3DES hardware core was not the source of the bottleneck, we
looked to see if file I/O was limiting performance. Removing the amount of time it takes to
conduct file I/O involved using gettimeofday() to measure elapsed time starting from the
time that all input data has been read into an input data buffer and ending when all data
has been received back from the FPGA boards.
As with the case with the hardware core, no significant differences in performance were
observed; a -7.4 percentage change was calculated between the throughput without file I/O
and the throughput with file I/O for the 1 kB input data file size while only a 0.7 percentage
change was calculated with the 10 kB input data file size. This data suggests that the
bottleneck is not a result of file I/O.
Average Percentage
File Processing Throughput Change
Size Time (Bytes/Second) of
(Seconds) Throughput
(%)
No File I/O With File I/O No File I/O With File I/O
1 kB 7.6 8.2 135 125 -7.4
10 kB 74.5 74.4 137 138 0.7
Table 5.7: Processing times excluding time for file I/O (1 board, 7 cores per board configu-
ration)
Finally, to estimate the time that data spends traveling through the network, we refer back
to Section 4.4, which describes the messages that are exchanged between the host PC and
FPGA boards.
60
While there are four types of messages that are exchanged, the bulk of the messages are data
requests and data responses. Therefore, we generalize the amount of data that flows through
the network by only considering data requests and data responses. While core requests and
core responses do contribute to network traffic, since there is only one core request and one
core response required per FPGA board, the contribution, in comparison to that of the total
number of data requests and responses is rather small.
A data request for 3DES requires a core type (8 bytes), job ID (4 bytes), and input data (36
bytes) components, amounting to a total of 52 bytes. A data response for 3DES requires
a core type (8 bytes), job ID (4 bytes), and output data (8 bytes) components for total of
24 bytes. A message header (4 bytes), which stores the total length of a message, must
precede a data request/response, so that the the number of bytes to be read is known. Since
the system uses Ethernet and TCP/IP, an additional 54 bytes (20 bytes for a typical IP
header, 20 bytes for TCP header, and 14 bytes for Ethernet header) must be added to the
data request and response. In total, a data request requires 106 bytes and a data response
requires 78 bytes. Summing these up, every 8 bytes of input data to be processed by a 3DES
hardware core requires 184 bytes to be transmitted over the network.
We can calculate the total amount of bytes transmitted over the network using Equation 5.4,
where n represents the total number of bytes transmitted over the network, f equals the input
data file size in bytes, d is the input data size in bytes, q is the bytes required for a data
request, and r is the bytes required for a data response.
f
n= × (q + r) (5.4)
d
Using a 10 kB input data file as an example, we can calculate the total number of bytes
transmitted over the network as follows:
61
10240 bytes
× (106 bytes + 78 bytes) = 235520 bytes
8 bytes
Since our Ethernet configuration has the capability of sending 100 Mbps, we can estimate
that the total time that data spends on the network is:
Comparing 18.8 ms to the processing times provided in Tables 5.2 and 5.3, which show the
average processing times for a 10 kB input data file ranging from 26.5 to 74.2 seconds, reveals
that time that data spends traveling through the network only amounts to a small portion of
the overall processing time. Applying this same method of measuring network transmission
time to the 1 kB input data file case, we come to the conclusion that the bottleneck is not
related to the time that data spends traveling through the network.
5.7.4 Analysis
Of the five potential causes of a performance bottleneck, quantitative data was collected
for the hardware core, file I/O, and network transmission. This data indicated that the
bottleneck is not caused by the hardware core, file I/O, or network transmission, suggesting
that the problem exists in either the software or hardware core manager frameworks.
When evaluating the effect of incorporating additional hardware into the system on per-
formance, we observed improved performance in the single core per board, multiple boards
configuration, but not in the multiple core, single board configuration. Based on this obser-
vation, we theorize that a bottleneck exists in the hardware core manager framework as a
result of the system convention of there only being one input data queue per core type.
62
Queues must be protected from being accessed by multiple threads at the same time in order
to ensure that data is read and written properly. As a result of this protection scheme, only
one thread is allowed to access the queue at a time; therefore, there is a lack of parallelism
since other threads must wait their turn to pull data from the queue. We speculate that the
performance gain in the single core per board, multiple boards configuration was due to the
fact that, by adding additional boards to the system, additional hardware core managers were
made available for processing. The presence of additional hardware core managers meant
that there were more input data queues present in the system. Since each of these input data
queues was on a separate board and independent of each other, it would, in theory, be possible
for each of these queues to be operated on at the same time, thus achieving parallelism. In
the next chapter, we suggest potential methods of eliminating the performance bottleneck
that can be explored as future work.
5.8 Summary
This chapter discussed the results gathered from the performance analysis of the system. The
next chapter summarizes this research and concludes with a discussion of potential future
work.
63
Chapter 6
Conclusion
6.1 Summary
This work involved designing and developing a software framework that provides users with
an API to develop a software application that interfaces with multiple FPGA boards and
a hardware core manager framework that gives users the ability to configure and interact
with multiple FPGA boards and/or hardware cores. We demonstrated that the system is
flexible, in that it could accommodate various application requirements, such as multiple
inputs and inputs larger than 32 bits. We also showed that the system is scalable, in that
it could accommodate multiple hardware cores and FPGA boards. Using an application
developed using the frameworks, we performed an analysis of various system configurations
to observe the effects of incorporating additional hardware components (FPGA boards and
hardware cores) on performance. While the results of our single core per board, multiple
board configuration test scenario showed an increase in performance, the results of the other
test configurations and the software implementation revealed that a performance bottleneck
exists in the system. With a series of tests meant to probe the system for the bottleneck, we
64
were able to eliminate the hardware core, file I/O, and network transmission as sources of
the bottleneck. For the two remaining possible bottleneck locations, the software framework
and the hardware core manager framework, we offered theories on what could potentially
be causing the bottleneck. Suggestions for fixing the potential bottleneck in these areas, are
described in the following section, which discusses future work.
One area of potential future research is exploring methods to improve the performance of
multiple core configurations. The results presented in Chapter 5 revealed that there was no
performance gain when multiple cores were added to the system. In Section 5.7.4, we offered
a theory that the one input data queue per core type system convention, in combination
with the need to protect input data queues from multiple access, caused there to be a lack
of parallelism in the hardware core manager framework. Noting that the single core per
board, multiple boards configuration achieved improved performance, exploring methods of
incorporating multiple hardware core managers on a single board appears to be a viable
path to increasing parallelism in the hardware core manager framework. This solution could
potentially be achieved by configuring an FPGA to support more than one MicroBlaze
processor, therefore allowing more than one hardware core manager to operate on the FPGA.
Another option to increase parallelism would be to redesign the hardware core manager as
a hardware component.
In addition to exploring methods of increasing parallelism, more work could be done in terms
of application development using the frameworks. This work only covered one application,
3DES. Future work could involve developing applications of a different category, such as
digital signal processing or bioinformatics.
65
Bibliography
[1] D. M. Harris and S. L. Harris, Digital Design and Computer Architecture. Elsevier,
2007.
[2] S. Hauck and A. DeHon, Reconfigurable Computing: The Theory and Practice of FPGA-
Based Computation. Morgan Kaufmann, 2007.
[3] L. Agarwal, M. Wazlowski, and S. Ghosh, “An asynchronous approach to efficient exe-
cution of programs on adaptive architectures utilizing FPGAs,” in IEEE Workshop on
FPGAs for Custom Computing Machines, 1994, pp. 101–110.
66
[7] K. Puttegowda, W. Worek, N. Pappas, A. Dandapani, P. Athanas, and A. Dickerman,
“A run-time reconfigurable system for gene-sequence searching,” in VLSID ’03: Pro-
ceedings of the 16th International Conference on VLSI Design. Washington, DC, USA:
IEEE Computer Society, 2003, p. 561.
67
[14] National Center for Biotechnology Information GenBank Statistics, “GenBank
Growth.” [Online]. Available: http://slaac.east.isi.edu/
[19] ——, “Microblaze processor reference guide,” 2008. [Online]. Available: http:
//www.xilinx.com/support/documentation/sw manuals/mb ref guide.pdf
[25] “3DES (Triple DES) / DES (VHDL) :: Overview.” [Online]. Available: http:
//opencores.org/project,3des vhdl
68
[26] “Processor Local Bus (PLB) v4.6 (v1.04a).” [Online]. Available: http://www.xilinx.
com/support/documentation/ip documentation/ds531.pdf
69
Appendix A
Software Framework
A.1 wrapper.c
#include <stdio.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include ”wrapper.h”
// printf mutex
pthread mutex t stdOutMutex = PTHREAD MUTEX INITIALIZER;
// socket mutex
pthread mutex t sendMutex = PTHREAD MUTEX INITIALIZER;
// thread ID mutex
pthread mutex t threadIDMutex = PTHREAD MUTEX INITIALIZER;
// thread IDs
pthread t threadID[NUM BOARDS];
int threadIDCount = 0;
int fileSize ;
int coreCount = 0;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: setupSocket
Parameter Definition:
− fpgaIP: IP address of FPGA board
70
− fpgaPort: known port
int sock;
struct sockaddr in fpgaAddr;
memset(&fpgaAddr, 0, sizeof(fpgaAddr));
fpgaAddr.sin family = AF INET;
fpgaAddr.sin addr.s addr = inet addr(fpgaIP);
fpgaAddr.sin port = htons(fpgaPort);
return sock;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: setUpQueues
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void setUpQueues() {
int queueIndex;
strcpy(coreQueue[queueIndex].coreType, coreMap[queueIndex].coreType);
// Message key needs to be unique− if you add NUM CORE TYPES to the
// index to create the key value for the result queues, you will
// guarantee that you won’t get the same key value
msgidOut = msgget((key t) queueIndex, (0666 | IPC CREAT));
71
}
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: setUpCoreInfo
Parameter Definition:
− writeSock: write socket
− readSock: read socket
− cores: list of cores in system
− numCores: number of cores in system
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int setUpCoreInfo(int writeSock, int readSock, ipCore ∗cores, int numCores) {
int bytesRcvd;
int coresRcvd = 0;
coresRcvd = ntohl(coresRcvd);
coreInfoPtr += sizeof(int);
72
coreCount++;
return coresRcvd;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: readFile
Parameter Definition:
− arg: pointer to FileThrdInfo structure
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void ∗readFile(void ∗arg) {
FileThrdInfo ∗fileData ;
fileData = (FileThrdInfo ∗) arg;
FILE ∗ inFilePtr;
msgStruct msg;
// Open file
if (( inFilePtr = fopen(fileData−>fileName, ”rb”)) == NULL) {
fprintf ( stderr , ”Error opening file %s \n\r”, fileData−>fileName);
removeAllQueues();
exit (1);
}
int j = 0;
int bytesRead;
int leftoverBytes ;
int numBytesToRead;
leftoverBytes = fileSize % DATA SIZE;
numBytesToRead = fileSize − leftoverBytes;
// Set job ID
// For the message queue, the msg type must be a positive number,
// so increment numReceived before using it as the job id
pthread mutex lock(&((fileData−>queue)−>numReceivedMutex));
(( fileData−>queue)−>numReceived)++;
memcpy(&(msg.msgCount), &((fileData−>queue)−>numReceived), JOB ID SIZE);
pthread mutex unlock(&((fileData−>queue)−>numReceivedMutex));
73
// Place data into message queue
if ((msgsnd((fileData−>queue)−>msgid, &msg,
(sizeof(msg) − sizeof(long)), 0)) == −1) {
fprintf ( stderr , ”read file msgsnd error! \n\r”);
removeAllQueues();
exit (1);
}
j++;
if ( leftoverBytes > 0) {
msgPtr += bytesRead;
memset(msgPtr, 0, (DATA SIZE − leftoverBytes));
// Terminate thread
pthread exit(NULL);
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: tripleDES
Parameter Definition:
− arg: pointer to ThrdInfo structure
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
// Core−specific function (one per core type)
void ∗tripleDES(void ∗arg) {
// Grab input data from the queue (make sure to have a mutex)
// Parse the input string ( this will be different for each string )
// Format the request −> make sure to include the core type and job ID
// There should be a separate job ID counter for each queue that
// will need to be protected by a mutex.
74
my data = (ThrdInfo ∗) arg;
msgStruct msgBuffer;
FILE ∗ keysFile;
int outBytes = 0;
// Get keys
while (fscanf(keysFile , ”%2x”, keysPtr) == 1) {
keysPtr++;
}
int leftoverBytes ;
int numBytesToRead;
leftoverBytes = fileSize % DATA SIZE;
numBytesToRead = fileSize − leftoverBytes;
int numJobsProcessed = 0;
msgPtr = msg;
// Message type
memcpy(msgPtr, msgType, MSG TYPE SIZE);
msgPtr += MSG TYPE SIZE;
// Core type
memcpy(msgPtr, (my data−>core)−>coreType, CORE TYPE SIZE);
msgPtr += CORE TYPE SIZE;
75
// Job ID
memcpy(msgPtr, &(msgBuffer.msgCount), JOB ID SIZE);
msgPtr += JOB ID SIZE;
// Keys
memcpy(msgPtr, keys, TOTAL KEY SIZE);
msgPtr += TOTAL KEY SIZE;
// Function select
memcpy(msgPtr, &functionSelect, FUNCT SELECT SIZE);
msgPtr += FUNCT SELECT SIZE;
// Input data
memcpy(msgPtr, msgBuffer.msgBuffer, DATA SIZE);
// Terminate thread
pthread exit(NULL);
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: matchToMap
Parameter Definition:
− inCoreType: core type
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
76
int matchToMap(char ∗inCoreType) {
// Need to compare the core type with the core types in the core map entries.
// If there is a match, then this is a valid core type.
//
// If there is a mismatch, then this core type has not been defined with a
// core map entry. The user will need to make an entry, in this case.
int i ;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: mapToQueue
Parameter Definition:
− inCoreType: core type
− queueList: input data queue
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int mapToQueue(char ∗inCoreType, queueInfo queueList[]) {
// If the core type matches the core type of the queue, then place that
// result in that queue.
//
// If there is a mismatch, then something went wrong.
//
int i ;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: recvResults
Parameter Definition:
− arg: pointer to socket to receive results from
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void ∗recvResults(void ∗arg) {
int ∗sockPtr = (int ∗) arg;
int sock = ∗sockPtr;
77
int bytesRcvd = 0;
int rListIndex;
int check;
msgStruct msg;
Result r ;
if (numResults == numToRecv) {
break;
}
} else {
fprintf ( stderr ,
”Error: Received message core type does not match a queue core type! \n\r”);
exit (1);
}
int i = 0;
for ( i = 0; i < NUM BOARDS; i++) {
pthread cancel(threadID[i ]);
}
// Terminate thread
pthread exit(NULL);
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
78
Function: writeResults
Parameter Definition:
− arg: pointer to WFileThrdInfo data structure
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void ∗writeResults(void ∗arg) {
WFileThrdInfo ∗fThrdInfo;
fThrdInfo = (WFileThrdInfo ∗) arg;
msgStruct msg;
FILE ∗ outFilePtr;
int leftoverBytes ;
int numBytesToRead;
leftoverBytes = fileSize % DATA SIZE;
numBytesToRead = fileSize − leftoverBytes;
int numResultsWritten = 0;
Result r ;
removeResult(&r, fThrdInfo−>rList);
fwrite(&r.data, DATA SIZE, 1, outFilePtr);
numResultsWritten++;
}
// Close file
fclose (outFilePtr);
// Terminate thread
pthread exit(NULL);
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: removeAllQueues
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void removeAllQueues() {
int msgqIndex = 0;
79
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: recvMsg
Parameter Definition:
− sock: socket to read message from
− inMsg: message to read
− maxMsgSize: length of message to receive
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int recvMsg(int sock, uint8 t ∗inMsg, int maxMsgSize) {
int msgHeaderSize = sizeof(int);
int msgTotalSize = 0;
int bytesRcvd = 0;
return bytesRcvd;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: sendMsg
Parameter Definition:
− sock: socket to sent message through
− outMsg: message to write
80
− msgSize: length of message to write
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int sendMsg(int sock, uint8 t ∗outMsg, int msgSize) {
int bytesSent = 0;
int msgHeaderSize = sizeof(int);
int msgTotalSize = htonl(msgSize);
uint8 t msgBuffer[msgBufferSize];
msgBufferPtr += msgHeaderSize;
return bytesSent;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: readn
Parameter Definition:
− sock: socket to read data from
− inMsg: message to read from socket
− numBytesToRead: number of bytes to read
Adapted from:
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
numBytesLeft = numBytesToRead;
81
numBytesRead = 0; // Call recv again
} else {
return −1;
}
} else if (numBytesRead == 0) {
printf (”No more bytes... \n\r”);
break; // No more bytes
}
numBytesLeft −= numBytesRead;
inMsgPtr += numBytesRead;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: writen
Parameter Definition:
− sock: socket to write data to
− outMsg: message to write to socket
− numBytesToWrite: number of bytes to write
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int writen(int sock, void ∗outMsg, int numBytesToWrite) {
int numBytesLeft;
int numBytesWritten;
void ∗outMsgPtr = outMsg;
numBytesLeft = numBytesToWrite;
numBytesLeft −= numBytesWritten;
outMsgPtr += numBytesWritten;
}
return numBytesToWrite;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Adapted from:
82
C Primer Plus by Stephen Prata
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: mapToResultList
Usage: place output data into the appropriate data results queue
Parameter Definition:
− inCoreType: core type
− rl: data results queue
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int i ;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
copyToNode and copyToItem function definitions
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
static void copyToNode(Result result, Node ∗pn);
static void copyToItem(Node ∗pn, Result ∗result);
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: copyToNode
Parameter Definition:
− result : result to copy
− pn: node to copy result to
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
static void copyToNode(Result result, Node ∗pn) {
pn−>result = result;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: copyToItem
Parameter Definition:
− pn: node to copy result to
− result : result to copy
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
83
static void copyToItem(Node ∗pn, Result ∗result) {
∗ result = pn−>result;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: initResultList
Parameter Definition:
− rl: data results queue
− inCoreType: core type
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void initResultList(ResultList ∗rl , char ∗inCoreType) {
memcpy(rl−>coreType, inCoreType, CORE TYPE SIZE);
pthread mutex init(&(rl−>mutex), NULL);
rl −>front = rl−>rear = NULL;
rl −>numItems = 0;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: resultListIsEmpty
Parameter Definition:
− rl: data results queue
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
bool resultListIsEmpty(const ResultList ∗rl) {
return rl−>numItems == 0;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: resultListItemCount
Parameter Definition:
− rl: data result queue
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int resultListItemCount(const ResultList ∗rl) {
return rl−>numItems;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: addResult
Parameter Definition:
− r: result to add
− rl: data results queue to add result to
84
Return value: true ( if successful )
false ( if unsuccessful)
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
bool addResult(Result ∗r, ResultList ∗rl ) {
Node ∗pnew;
Result temp;
if (pnew == NULL) {
fprintf ( stderr , ”Unable to allocate memory!\n”);
exit (1);
}
memcpy(&temp, r, sizeof(Result));
copyToNode(temp, pnew);
pnew−>next = NULL;
if (resultListIsEmpty(rl )) {
rl −>front = pnew; //Item goes to front
} else if ((pnew−>result).seqNum < (rl−>front−>result).seqNum) {
pnew−>next = rl−>front;
rl −>front = pnew; // Item goes to front
} else {
current = rl−>front−>next;
last = rl−>front;
int keepGoing = 1;
// Keep going through list until sequence number is < next number in list
while (keepGoing && current != NULL) {
if ((pnew−>result).seqNum > ((current−>result).seqNum)) {
last = current;
current = current−>next;
} else {
keepGoing = 0;
}
}
return true;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: removeResult
Usage: removes results from the front of the data results queue
Parameter Definition:
− r: result to remove
− rl: data results queue to remove result from
85
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
bool removeResult(Result ∗result, ResultList ∗rl ) {
Node ∗pt;
if (resultListIsEmpty(rl ))
return false;
copyToItem(rl−>front, result);
pt = rl−>front;
rl −>front = rl−>front−>next;
free (pt );
rl −>numItems−−;
if ( rl −>numItems == 0)
rl −>rear = NULL;
return true;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: emptyResults
Parameter Definition:
− rl: data results queue to empty
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void emptyResults(ResultList ∗rl) {
Result dummy;
while (!resultListIsEmpty(rl))
removeResult(&dummy, rl);
}
A.2 wrapper.h
#include <stdio.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>
#include <stdbool.h>
#include <sys/time.h>
#include <sys/msg.h>
#include <sched.h>
86
#define KEY SIZE 8 // Input data
#define NUM KEYS 3 // Input data
#define TOTAL KEY SIZE (KEY SIZE ∗ 3) // Input data
#define FUNCT SELECT SIZE 4 // Input data
#define DATA SIZE 8 // Input data
#define RESULT SIZE (CORE TYPE SIZE + JOB ID SIZE + DATA SIZE) // Output data
#define MAX MSG SIZE (MSG TYPE SIZE + CORE TYPE SIZE + JOB ID SIZE + \
(KEY SIZE ∗ NUM KEYS) + FUNCT SELECT SIZE + DATA SIZE)
//
// ipCore
//
// Represents a hardware core
//
typedef struct {
char coreType[CORE TYPE SIZE]; // core type; i.e. multiplier
int location ; // location aka socket descriptor of board it ’s located on
} ipCore;
//
// board
//
// Represents an FPGA board
//
typedef struct {
char ipAddr[IP ADDR SIZE]; //IP address of board
int sock; // write socket
} board;
//
// coreMapEntry
//
// Represents a core map entry
//
// User must use this to associate core types with their core−specific functions
//
typedef struct {
char coreType[CORE TYPE SIZE]; // Core type, i.e. ”tripleDES, ”FIR”...
char fileName[MAX FILE NAME SIZE]; // Input data file associated with core type
void (∗funct)(char ∗dataIn, char∗ dataOut); // Function pointer
} coreMapEntry;
//
// queueInfo
//
// Represents an input data queue
//
typedef struct {
char coreType[CORE TYPE SIZE]; // core type
int msgid; // message id of queue
87
int numReceived; // number of messages received
pthread mutex t numReceivedMutex; // mutex to protect numReceived
int numProcessed; // number of messages processed
pthread mutex t numProcessedMutex; // mutex to protect numProcessed
} queueInfo;
//
// Result
//
// Represents a result from a hardware core
//
typedef struct {
uint8 t data[DATA SIZE]; // result from hardware core
int seqNum; // job ID
} Result;
//
// Node
//
// Represents a node in the data results queue
//
typedef struct node {
Result result ; // data result
struct node ∗next; // pointer to next result
} Node;
//
// ResultList
//
// Represents a data results queue
//
typedef struct {
char coreType[CORE TYPE SIZE]; //core type
pthread mutex t mutex; // mutex
Node ∗front; //pointer to the front of the queue
Node ∗rear; //pointer to the rear of the queue
int numItems; //number of items in the queue
} ResultList ;
//
// ThrdInfo
//
// Information needed by a core−specific thread
//
typedef struct {
ipCore ∗core; // core associated with thread
queueInfo ∗queue; // queue associated with thread
} ThrdInfo;
//
// FileThrdInfo
//
// Information needed by a read file thread
//
typedef struct {
char fileName[MAX FILE NAME SIZE]; // file to read from
queueInfo ∗queue; // queue to put data from file into
} FileThrdInfo;
//
// WFileThrdInfo
//
// Information needed by a write results file thread
//
88
typedef struct {
char fileName[MAX FILE NAME SIZE]; // file to write to
ResultList ∗rList ; // data results queue to pull data from
} WFileThrdInfo;
//
// msgStructure
//
// Represents a message
//
typedef struct {
long msgCount; // job ID
uint8 t msgBuffer[MAX MSG SIZE]; // message buffer
} msgStruct;
//
// EXTERNAL VARIABLES
//
extern ResultList rList[NUM CORE TYPES];
extern coreMapEntry coreMap[NUM CORE TYPES];
extern queueInfo coreQueue[NUM CORE TYPES];
extern coreMapEntry coreMap[NUM CORE TYPES];
extern msgStruct msgStructure;
extern int fileSize ;
extern int functionSelect;
extern pthread t threadID[NUM BOARDS];
extern int threadCount;
extern struct timeval start, end;
extern int coreCount;
//
// EXTERNAL FUNCTIONS
//
// Communication−related
extern int setupSocket(char ∗fpgaIP, unsigned short fpgaPort);
extern int setUpCoreInfo(int writeSock, int readSock, ipCore ∗cores,
int numCores);
extern int recvMsg(int sock, uint8 t ∗inMsg, int maxMsgSize);
extern int sendMsg(int sock, uint8 t ∗outMsg, int sizeOfMsg);
extern int readn(int sock, void ∗inMsg, int numBytesToRead);
extern int writen(int sock, void ∗outMsg, int numBytesToWrite);
// Queue−related
extern bool resultListIsEmpty(const ResultList ∗rl);
extern int resultListItemCount(const ResultList ∗rl);
extern bool addResult(Result ∗r, ResultList ∗rl);
extern bool removeResult(Result ∗result, ResultList ∗rl );
extern void emptyResults(ResultList ∗rl);
extern int mapToResultList(char ∗inCoreType, ResultList rl[]);
extern void initResultList(ResultList ∗rl , char ∗inCoreType);
extern void setUpQueues();
extern void removeAllQueues();
// Data formatting−related
extern void ∗writeResults(void ∗arg);
extern void ∗readFile(void ∗arg);
extern void ∗recvResults(void ∗arg);
// Mapping functions
extern int matchToMap(char ∗inCoreType);
extern int mapToQueue(char ∗inCoreType, queueInfo queueList[]);
// Core−specific (3DES)
89
extern void ∗tripleDES(void ∗arg);
// Other
extern ipCore ∗getCore(char ∗coreType, ipCore cores[]);
90
Appendix B
B.1 FPGA.c
#include <stdio.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>
#include <errno.h>
#include ”wrapper.h”
// File names
FILE ∗inFilePtr; // input data file
FILE ∗outFilePtr; // output data file
FILE ∗keysFile;
// Core map
coreMapEntry coreMap[NUM CORE TYPES] = { { ”3DES”, ”3DES.txt”,
(void ∗) tripleDES } };
// THREADS
//
91
// One thread per input data file to read input data into the queues
// One thread per result data queue to read result data to file
pthread t readFileThrds[NUM CORE TYPES];
pthread t writeResultThrds[NUM CORE TYPES];
// Thread information
FileThrdInfo readFileThrdInfo[sizeof(coreMap)];
WFileThrdInfo writeFileThrdInfo[sizeof(coreMap)];
// Message structure
msgStruct msgStructure = { 1 };
// MESSAGE COMPONENTS
// Actual message
uint8 t mesg[MAX MSG SIZE];
uint8 t ∗msgPtr = mesg;
// MAIN
int main(int argc, char ∗argv[]) {
// COMMAND−LINE ARGUMENTS
92
// input data file name
char ∗inFileName = argv[5];
// Core setup
int coresSetUp = 0;
int totalCoresSetUp = 0;
// Boards
board ∗boards;
boards = (board ∗) malloc(numBoards ∗ sizeof(board));
// Cores
ipCore ∗cores;
cores = (ipCore ∗) malloc(numCores ∗ sizeof(ipCore));
// MUTEXES
// THREADS
// Thread ID
//
// Used to kill threads in multiple board configurations , so that the
// application can exit
pthread t ∗ threadID;
threadID = (pthread t ∗) malloc(numBoards ∗ sizeof(pthread t));
// Thread information
ThrdInfo ∗inThrdInfo;
inThrdInfo = (ThrdInfo ∗) malloc(numCores ∗ sizeof(ThrdInfo));
// Check that the given core type is valid (of the right length)
if ( strlen (inCoreType) > CORE TYPE SIZE) {
fprintf (
stderr ,
93
”ERROR: %s has too many characters. Max number of characters is %d!!! \n\r”,
inCoreType, CORE TYPE SIZE);
exit (1);
}
// Check that the given core type is defined in the core map
int i ;
for ( i = 0; i < sizeof(coreMap); i++) {
if (strcmp(coreMap[i].coreType, inCoreType) == 0) {
strcpy(coreMap[i].fileName, inFileName);
strcpy(coreQueue[i].coreType, coreMap[i].coreType);
initResultList ( rList , coreMap[i].coreType);
break;
}
// ERROR: Core type does not match a core type in the core map!
fprintf ( stderr ,
”ERROR: %s is NOT defined in the program core map!!! \n\r”,
inCoreType);
exit (1);
}
i = 0;
totalCoresSetUp += coresSetUp;
94
i++;
}
// Close file
fclose (inFilePtr );
int index;
index = mapToQueue(coreMap[i].coreType, coreQueue);
int j ;
95
// Does this core have an input data queue?
if ((index2 = mapToQueue(cores[j].coreType, coreQueue)) < 0) {
// ERROR!
printf (”index2: %d \n\r”, index2);
printf (”cores type: %s \n\r”, cores[j ]. coreType);
printf (
”ERROR: This core type does not exist in the queue list!!! \n\r”);
}
int index;
index = mapToResultList(coreMap[i].coreType, rList);
// Clean up!
96
for ( i = 0; i < numBoards; i++) {
close (boards[i ]. sock);
return 0;
#!/bin/sh
echo −n ”Enter core type: ”
read −e CORETYPE
echo −n ”Enter total number of boards in the system: ”
read −e NUMBOARDS
echo −n ”Enter total number of cores in the system: ”
read −e NUMCORES
echo −n ”Enter input filename: ”
read −e INFILENAME
# UNCOMMENT TO VERIFY 3DES USING DECRYPTION
#echo −n ”Enter input file hexdump file: ”
#read −e INFILEHEXDUMP
OUTFILENAME=outFile.txt
# UNCOMMENT TO VERIFY 3DES USING DECRYPTION
#OUTFILENAME2=outFile2.txt
#OUTFILENAME2HEXDUMP=outFile2HexDump.txt
for i in 1 2 3 4 5 6 7 8 9 10
do
time ./FPGA $CORETYPE $NUMBOARDS $NUMCORES 1 $INFILENAME $OUTFILENAME
# UNCOMMENT TO VERIFY 3DES USING DECRYPTION
# ./FPGA $CORETYPE $NUMBOARDS $NUMCORES 0 $OUTFILENAME $OUTFILENAME2
# hexdump $OUTFILENAME2 > $OUTFILENAME2HEXDUMP
# diff $OUTFILENAME2HEXDUMP $INFILEHEXDUMP
done
CC = gcc
HDRS = wrapper.h
OBJS = wrapper.o
CFLAGS = −Wall −g −lpthread
EXECS = FPGA
all : $(EXECS)
97
clean :
/bin/rm −f $(OBJS) $(EXECS) core∗ ∗˜ semantic.cache
98
Appendix C
C.1 main.c
/∗
∗ Copyright (c) 2008 Xilinx, Inc. All rights reserved .
∗
∗ Xilinx, Inc.
∗ XILINX IS PROVIDING THIS DESIGN, CODE, OR INFORMATION ”AS IS” AS A
∗ COURTESY TO YOU. BY PROVIDING THIS DESIGN, CODE, OR INFORMATION AS
∗ ONE POSSIBLE IMPLEMENTATION OF THIS FEATURE, APPLICATION OR
∗ STANDARD, XILINX IS MAKING NO REPRESENTATION THAT THIS IMPLEMENTATION
∗ IS FREE FROM ANY CLAIMS OF INFRINGEMENT, AND YOU ARE RESPONSIBLE
∗ FOR OBTAINING ANY RIGHTS YOU MAY REQUIRE FOR YOUR IMPLEMENTATION.
∗ XILINX EXPRESSLY DISCLAIMS ANY WARRANTY WHATSOEVER WITH RESPECT TO
∗ THE ADEQUACY OF THE IMPLEMENTATION, INCLUDING BUT NOT LIMITED TO
∗ ANY WARRANTIES OR REPRESENTATIONS THAT THIS IMPLEMENTATION IS FREE
∗ FROM CLAIMS OF INFRINGEMENT, IMPLIED WARRANTIES OF MERCHANTABILITY
∗ AND FITNESS FOR A PARTICULAR PURPOSE.
∗
∗/
//
// NOTE TO USER:
//
// Portions of this code were adapted from the lwip demo software application from
// the Xilinx EDK Standard IP Design with Pcores Addition reference design,
// which can be found at: http://www.xilinx.com/univ/xupv5−lx110t−bsb.htm.
//
#include ”xmk.h” // Must be first header file listed in order to use Xilkernel
#include <stdio.h>
#include ”xenv standalone.h”
#include ”xparameters.h”
#include ”netif/xadapter.h”
#include ”memory map.h”
#include ”wrapper.h”
99
#include ”xgpio.h”
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: print ip
Parameter Definition:
− msg: message to print
− ip: IP address to print
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void print ip(char ∗msg, struct ip addr ∗ip) {
print(msg);
xil printf (”%d.%d.%d.%d\n\r”, ip4 addr1(ip), ip4 addr2(ip), ip4 addr3(ip),
ip4 addr4(ip ));
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Parameter Definition:
− ip: IP address
− mask: netmask address
− gw: gateway address
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void print ip settings (struct ip addr ∗ip, struct ip addr ∗mask,
struct ip addr ∗gw) {
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: main
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int main() {
#ifdef MICROBLAZE
microblaze init icache range (0, XPAR MICROBLAZE 0 CACHE BYTE SIZE);
microblaze init dcache range (0, XPAR MICROBLAZE 0 DCACHE BYTE SIZE);
100
microblaze enable exceptions ();
#endif
// Enable caches
XCACHE ENABLE ICACHE();
XCACHE ENABLE DCACHE();
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: startNetwork
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int startNetwork() {
struct netif ∗ netif ;
struct ip addr ipaddr, netmask, gw;
101
return −1;
}
return 0;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int main thread() {
// Any thread using lwIP should be created using sys thread new
xil printf (”Starting network... \n\r”);
sys thread new(startNetwork, NULL, DEFAULT THREAD PRIO);
C.2 HCM.c
/∗
∗ Copyright (c) 2008 Xilinx, Inc. All rights reserved .
∗
∗ Xilinx, Inc.
∗ XILINX IS PROVIDING THIS DESIGN, CODE, OR INFORMATION ”AS IS” AS A
∗ COURTESY TO YOU. BY PROVIDING THIS DESIGN, CODE, OR INFORMATION AS
∗ ONE POSSIBLE IMPLEMENTATION OF THIS FEATURE, APPLICATION OR
∗ STANDARD, XILINX IS MAKING NO REPRESENTATION THAT THIS IMPLEMENTATION
∗ IS FREE FROM ANY CLAIMS OF INFRINGEMENT, AND YOU ARE RESPONSIBLE
∗ FOR OBTAINING ANY RIGHTS YOU MAY REQUIRE FOR YOUR IMPLEMENTATION.
∗ XILINX EXPRESSLY DISCLAIMS ANY WARRANTY WHATSOEVER WITH RESPECT TO
∗ THE ADEQUACY OF THE IMPLEMENTATION, INCLUDING BUT NOT LIMITED TO
∗ ANY WARRANTIES OR REPRESENTATIONS THAT THIS IMPLEMENTATION IS FREE
∗ FROM CLAIMS OF INFRINGEMENT, IMPLIED WARRANTIES OF MERCHANTABILITY
∗ AND FITNESS FOR A PARTICULAR PURPOSE.
∗
∗/
//
// NOTE TO USER:
//
102
// Portions of this code were adapted from the lwip demo software application from
// the Xilinx EDK Standard IP Design with Pcores Addition reference design,
// which can be found at: http://www.xilinx.com/univ/xupv5−lx110t−bsb.htm.
//
#include ”xmk.h” // Must be first header file listed in order to use Xilkernel
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include ”lwip/inet.h”
#include ”lwip/sockets.h”
#include ”lwipopts.h”
#include ”xparameters.h”
#include ”xbasic types.h”
#include ”xstatus.h”
#include ”wrapper.h”
#include ”sys/timer.h”
int doneProcessing = 0;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: checkPeripheral
Parameter Definition:
− core: hardware core
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void checkPeripheral(ipCore core) {
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
103
void print echo app header() {
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: processCoreRequest
Parameter Definition:
− sock: socket to send core information to
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void processCoreRequest(int sock) {
coreInfoBufferPtr += sizeof(int);
// Lock the socket , so that outgoing data does not get corrupted
pthread mutex lock(&sockMutex);
104
pthread mutex unlock(&sockMutex);
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: processDataRequest
Parameter Definition:
− inMsg: data request message
− inMsgSize: length of data request message
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void processDataRequest(uint8 t ∗inMsg, int inMsgSize) {
int i ;
// Temporary buffer
uint8 t tmpBuffer[inMsgSize];
uint8 t ∗tmpBufferPtr = tmpBuffer;
// Message structure
msgStruct msg = { 1 };
memset(tmpBuffer, 0, sizeof(tmpBuffer));
memset(coreType, 0, sizeof(coreType));
// No match!
xil printf (”ERROR: data queue for this core type does not exist! \n\r”);
}
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: processRequest
Parameter Definition:
105
− read sd: socket to read requests from
− write sd: socket to write responses to
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void processRequest(int read sd, int write sd) {
// Request message
uint8 t inMsg[MAX MSG SIZE];
uint8 t ∗inMsgPtr = inMsg;
int bytesRcvd;
char msgType;
int i = 0;
inMsgPtr = inMsg;
// Is it a core request?
if (msgType == CORE MSG TYPE) {
// Is it a data request?
} else if (msgType == DATA MSG TYPE) {
// Skip over the message type portion of the data request message
inMsgPtr += MSG TYPE SIZE;
} else {
xil printf (”Invalid request!”);
}
106
// Close read and write sockets
close (read sd);
close (write sd );
// Re−initialize threadIDCount
threadIDCount = 0;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: recvRequests
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void recvRequests() {
size = sizeof(remote);
size2 = sizeof(remote2);
while (1) {
read sd = lwip accept(sock, (struct sockaddr ∗) &remote, &size);
write sd = lwip accept(sock2, (struct sockaddr ∗) &remote2, &size2);
processRequest(read sd, write sd );
}
}
107
C.3 wrapper.c
#include ”xmk.h” // Must be first header file listed
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include ”lwip/inet.h”
#include ”lwip/sockets.h”
#include ”lwipopts.h”
#include ”wrapper.h”
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
VARIABLE DECLARATIONS
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
const char NEWLINE = ’\n’;
const char STOP = ’!’;
// MUTEXES
pthread mutex t uartMutex = PTHREAD MUTEX INITIALIZER;
pthread mutex t totalSentMutex = PTHREAD MUTEX INITIALIZER;
pthread mutex t threadIDMutex = PTHREAD MUTEX INITIALIZER;
pthread mutex t sockMutex;
int doneProcressing;
int threadIDCount = 0;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: matchToMap
Parameter Definition:
− inCoreType: input core type
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int i ;
108
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: mapToQueue
Parameter Definition:
− inCoreType: input core type
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int i ;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: setUpQueues
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
void setUpQueues() {
int queueIndex;
memset(queueList[queueIndex].coreType, ’\0’,
sizeof(queueList[queueIndex].coreType));
// Use the core types defined in the core map to label the queues
strcpy(queueList[queueIndex].coreType, entry[queueIndex].coreType);
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: processData
109
host PC.
Parameter Definition:
− arg: Pointer to a threadInfo data structure
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
msgStruct msg = { 1 };
int bytesSent;
uint8 t tmpData[tmpDataSize];
uint8 t ∗tmpDataPtr = tmpData;
uint8 t inData[inDataSize];
uint8 t resultData[resultDataSize ];
uint8 t outData[outDataSize];
uint8 t ∗outDataPtr = outData;
uint8 t coreType[CORE TYPE SIZE];
uint32 t jobID;
// Function pointer to the function used to process the data for this
// thread’s core type
void (∗functPtr)(uint8 t ∗dataIn, uint8 t ∗dataOut, Xuint32 baseAddr) =
tInfo .functPtr;
while (1) {
// Reset bytesSent
bytesSent = 0;
// Reset pointers
tmpDataPtr = tmpData;
outDataPtr = outData;
110
memcpy(tmpData, msg.msgBuffer, tmpDataSize);
// Copy the core type and job ID into the data response message
memcpy(outDataPtr, msg.msgBuffer, (CORE TYPE SIZE + JOB ID SIZE));
pthread exit(NULL);
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: setUpCoreList
Parameter Definitions:
− fileName: name of the file located on the Flash card
that contains the board’s core information
− sd: socket that data will be sent to
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int stop = 0;
int coreIndex = 0;
111
int i = 0;
ipCore inCore;
// Open the file on the CF card that contains the core information
if ((fd = sysace fopen(fileName, ”r”)) == 0) {
xil printf (”Cannot open input file: %s \r\n”, fileName);
exit (1);
}
memset(&inCore, 0, sizeof(inCore));
// Index of the core map entry that matches this core type
index2 = matchToMap(inCore.type);
// Pointer to the function that processes data for this core type
tInfo [ i ]. functPtr = entry[index2].functPtr;
112
tInfo [ i ]. outputSize = cores[coreIndex].outputSize;
coreIndex++;
i++;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: mapCoreToFunct
Parameter Definitions:
− inCoreType: input core type
− inFunct: pointer to a user−defined function that
processes data for a given core type
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: tripleDESFunction
Parameter Definitions:
− dataIn: input data
− dataOut: result of the tripleDES computation
113
− baseAddr: baseAddr of the tripleDES core
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
// Input data
uint8 t inputData[TRIPLEDES DATA SIZE];
uint8 t ∗inputDataPtr = inputData;
// Output data
uint8 t resultData[DATA SIZE];
uint8 t ∗resultDataPtr = resultData;
114
// Copy the last 32 bits of input data
memcpy(&data in B, inputDataPtr, DATA SIZE / 2);
// Function select
hw core−>function select = funct select;
// Key 1
hw core−>key1 in A = key1 in A;
hw core−>key1 in B = key1 in B;
// Key 2
hw core−>key2 in A = key2 in A;
hw core−>key2 in B = key2 in B;
// Key 3
hw core−>key3 in A = key3 in A;
hw core−>key3 in B = key3 in B;
// Input data
hw core−>data in A = data in A;
hw core−>data in B = data in B;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: recvMsg
Parameter Definitions:
− sock: socket to read messages from
− inMsg: incoming message buffer
− maxMsgSize: maximum number of bytes to read
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int recvMsg(int sock, uint8 t ∗inMsg, int maxMsgSize) {
int msgHeaderSize = sizeof(int); // Use 4 bytes to store the message size
int msgTotalSize = 0; // Message size (in bytes)
int bytesRcvd = 0; // Number of bytes received
115
!= msgHeaderSize) {
pthread mutex lock(&uartMutex);
xil printf (”Error: expected to receive %d bytes, received %d! \n\r”,
msgHeaderSize, bytesRcvd);
pthread mutex unlock(&uartMutex);
return −1;
}
return bytesRcvd;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: sendMsg
Parameter Definitions:
− sock: socket to send messages through
− outMsg: outgoing message buffer
− msgSize: number of bytes to send
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int sendMsg(int sock, uint8 t ∗outMsg, int msgSize) {
int msgHeaderSize = sizeof(int); // Use 4 bytes to store the message size
int bytesSent = 0;
116
// Copy the outgoing message
memcpy(msgBufferPtr, outMsg, msgSize);
return bytesSent;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: readn
Parameter Definitions:
− sock: socket to read bytes from
− inMsg: incoming message buffer
− numBytesToRead: number of bytes to read
−−
Adapted from:
Unix Network Programming − The Sockets Networking API
Volume 1, Third Edition
by W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff
(Page 89)
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
numBytesLeft = numBytesToRead;
117
}
numBytesLeft −= numBytesRead;
inMsgPtr += numBytesRead;
}
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
Function: writen
Parameter Definitions:
− sock: socket to write bytes to
− outMsg: outgoing message buffer
− numBytesToWrite: number of bytes to write
−−
Adapted from:
Unix Network Programming − The Sockets Networking API
Volume 1, Third Edition
by W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff
(Page 89)
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
int writen(int sock, void ∗outMsg, int numBytesToWrite) {
int numBytesLeft;
int numBytesWritten;
void ∗outMsgPtr = outMsg;
numBytesLeft = numBytesToWrite;
numBytesLeft −= numBytesWritten;
outMsgPtr += numBytesWritten;
}
return numBytesToWrite;
}
118
C.4 wrapper.h
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#include <pthread.h>
#include <errno.h>
#include <stdint.h>
#include <semaphore.h>
#include ”lwip/inet.h”
#include ”lwipopts.h”
#include ”xbasic types.h”
#include ”sys/msg.h”
#include ”sys/ipc.h”
#include ”sys/timer.h”
#include ”sys/process.h”
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
MACROS
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
#define NUM BOARDS 1
#define NUM CORES 3 // Must manually change this value when configuration is changed
#define CORE TYPE CHARS 4 // Number of characters in the core type string
#define IP ADDR CHARS 15 // Number of characters in the IP address string (###.###.###.###)
#define NUM CORE TYPES 1 // Number of different core types in system
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
STRUCTURES
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
//
// Core information
//
// Represents a hardware core
//
typedef struct {
char type[CORE TYPE SIZE]; // core type; i.e. multiplier
Xuint32 baseAddr; // base address
int inputSize; // input size (in bytes)
int outputSize; // output size (in bytes)
} ipCore;
119
//
// Message structure
//
// Represents a message that gets placed into an input data queue
//
typedef struct {
long msgCount; // Message number
uint8 t msgBuffer[MAX MSG SIZE]; // Message buffer
} msgStruct;
//
// Core map entry
//
// A data structure used by the user to assigned a core−specific function
// to a particular core type
//
typedef struct {
char coreType[CORE TYPE SIZE]; // Core type, i.e. ”tripleDES, ”FIR”...
void (∗functPtr)(uint8 t ∗dataIn, uint8 t ∗dataOut, Xuint32 baseAddr); // Pointer to function
} coreMapEntry;
// associated with the core type
//
// Queue information
//
// Represent an input data queue
//
typedef struct {
char coreType[CORE TYPE SIZE]; // Core type
int msgid; // Message id
} queueInfo;
//
// Thread information
//
// Represents a core−specific thread
//
typedef struct {
int msgid; // Input data queue message id
Xuint32 baseAddr; // Hardware core base address
int sock; // Socket to send data to
int inputSize; // Core input data size (in bytes)
int outputSize; // Core output data size (in bytes)
void (∗functPtr)(uint8 t ∗dataIn, uint8 t ∗dataOut, Xuint32 baseAddr); // Pointer to core−specific function
} threadInfo;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
EXTERNAL VARIABLES
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
// Character definitions
extern const char STOP;
extern const char NEWLINE;
// Core information
extern ipCore cores[NUM CORES];
// Queue information
extern queueInfo queueList[NUM CORE TYPES];
120
// Thread information
extern threadInfo tInfo[NUM CORES];
// Mutexes
extern pthread mutex t uartMutex;
extern pthread mutex t sockMutex;
extern pthread mutex t threadIDMutex;
// Thread ID
extern pid t threadID[NUM CORES];
extern int threadIDCount;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
EXTERNAL FUNCTIONS
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
// Initialization functions
extern void makeQueues();
extern void makeThread();
extern void setUpCoreList(const char ∗fileName, int sd);
extern void setUpQueues();
extern int setupSocket(char ∗serverIP, unsigned short serverPort);
// Mapping functions
extern int mapToMap(char ∗inCoreType);
extern int matchToQueue(char ∗inCoreType);
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
CORE−SPECIFIC INFORMATION
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
//
// tripleDES
//
// Stores all the necessary input and output data for a 3DES hardware core
//
typedef struct {
long key1 in A; // First 32 bits of key 1
long key1 in B; // Last 32 bits of key 1
long key2 in A; // First 32 bits of key 2
long key2 in B; // Last 32 bits of key 2
long key3 in A; // First 32 bits of key 3
long key3 in B; // Last 32 bits of key 3
long function select ; // 0xFFFFFFFF for encryption; 0 for decryption
long data in A; // First 32 bits of input data
long data in B; // Last 32 bits of input data
long data out A; // First 32 bits of output data
121
long data out B; // Last 32 bits of output data
} tripleDES;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗ Header file that translates existing constants in xparameters.h
∗ into constants that are used by the software applications .
∗ Note: xparameters.h must be included before this file .
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
122
#define IIC 2 BASE ADDRESS XPAR XPS IIC 2 BASEADDR
//#define PPC440
#ifdef PPC440
#define DDR BASEADDR XPAR DDR2 SDRAM MEM BASEADDR
#define CPU CORE FREQUENCY XPAR CPU PPC440 CORE CLOCK FREQ HZ
#else
#define DDR BASEADDR XPAR DDR2 SDRAM MPMC BASEADDR
#define CPU CORE FREQUENCY XPAR MICROBLAZE CORE CLOCK FREQ HZ
#endif
//#define PPC440CACHE
#ifdef PPC440CACHE
#define PPC440 ICACHE 0xC0000000
#define PPC440 DCACHE 0xC0000000
#endif
#endif
123
Appendix D
MATLAB Implementation
results = zeros(1,numTestRuns);
for i=1:numTestRuns
profile on;
TripleDES(inFileName, k1, k2, k3, 1, outFileName);
p = profile ( ’ info ’ );
disp([p.FunctionTable(p.FunctionHistory(2,1)).TotalTime]);
profile off
results ( i ) = [p.FunctionTable(p.FunctionHistory(2,1)).TotalTime];
fprintf( fid , ’Test Run #%d: %10.3f seconds\n’, i, results(i));
end
average = sum(results)/numTestRuns
fclose( fid );
124
D.2 3DES MATLAB Code
The source code for helper functions, InitPerm(), InvInitPerm(), DESRoundKeys, SBox(),
DESRoundKeyFunction(), and DES() can be found at [27].
% Array of 32 zeros, for the case when there isn’t an even number of 32−bit
% data chunks
extraData = 0;
% Open file
fid = fopen(inFileName);
% Close file
fclose( fid );
fileInfo = dir(inFileName);
fileInfoBits = fileInfo .bytes∗8;
extraBits = mod(fileInfoBits ,32);
% Can we evenly divide all the 32−bit chunks of data into pairs to form 64−bit chunks of data?
if (mod(amountOfData, 2) == 0) % Divisible by 2
counter = amountOfData;
else % Not divisible by 2
data = [data ; extraData]; % Add 32 zeros
counter = amountOfData + 1; % Increment amount of data since we added data
end
125
% Vector to hold output data
outData = [];
for i=1:2:counter
% 3DES
if (inFunct == 1)
% Encryption
outEncrypt = DES(DESDecrypt (DES (inMsg, key1), key2), key3);
outData = [outData outEncrypt];
elseif (inFunct == 0)
% Decryption
outDecrypt = DESDecrypt(DES(DESDecrypt(inMsg, key3), key2), key1);
outData = [outData outDecrypt];
else
% Invalid value for function
fprintf(’ERROR! Invalid value for function! Valid values: 1 for encryption, 0 for decryption. \n\r’);
end
end
y = outData;
126
D.3 DESDecrypt MATLAB Code
function C = DESDecrypt(P,Key)
% C = DES(P,Key)
% Inputs: P = 64 bit (plaintext) vector P, Key = a 64 bit vector
% that serves as an admissible DES key.
% Output: C = the 64 bit (ciphertext) vector that corresponds to the
% output of the DES encryption algorithm.
% If the key is not admissible (i .e ., if the parity check bits do not
% satisfy the required properties ) the program will produce an error message.
%We generate the 16 round keys; the ith row of the following matrix is the
%ith round key (of 48 bits )
RoundKeys = DESRoundKeys(Key);
C = InvInitPerm([R L]);
127