0% found this document useful (0 votes)
6 views

FPGA-based Custom Microprocessor Architectures

The document discusses the development of FPGA-based application-specific processors (ASIPs) for autonomous dynamic event scheduling in space missions, utilizing Iterative Repair techniques. It outlines a methodology for automating the derivation of these processors from ANSI C code, targeting Xilinx Virtex 4 FPGAs, and emphasizes the importance of optimizing the scheduling process through a pipelined architecture. The paper details the architecture's components, including memory design and various processing stages, to enhance the efficiency of task scheduling in real-time environments.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

FPGA-based Custom Microprocessor Architectures

The document discusses the development of FPGA-based application-specific processors (ASIPs) for autonomous dynamic event scheduling in space missions, utilizing Iterative Repair techniques. It outlines a methodology for automating the derivation of these processors from ANSI C code, targeting Xilinx Virtex 4 FPGAs, and emphasizes the importance of optimizing the scheduling process through a pipelined architecture. The paper details the architecture's components, including memory design and various processing stages, to enhance the efficiency of task scheduling in real-time environments.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

SSC07-XII-7

Deriving FPGA Based Custom Soft-Core Microprocessors for Mission Planning


Algorithms
Aravind Dasu
Electrical and Computer Engineering
Utah State University
4120 Old Main Hill, Logan, UT 84322; (435) 797-2830
[email protected]

Jonathan Phillips
Electrical and Computer Engineering
Utah State University
4120 Old Main Hill, Logan, UT 84322; (435) 757-8341
[email protected]

ABSTRACT
Autonomous dynamic event scheduling using Iterative Repair techniques is an essential component of successful
space missions, as it enables spacecraft to adaptively schedule tasks in a dynamic, real-time environment. Event
rescheduling is a compute-intensive process. Typical applications involve scheduling hundreds of events that share
tens or hundreds of resources. We are developing a set of tools for automating the derivation of application-specific
processors (ASIPs) from ANSI C source code that perform this scheduling in an efficient manner. The tools will
produce VHDL code targeted for a Xilinx Virtex 4 FPGA (Field Programmable Gate Array). Features of FPGAs,
including large processing bandwidth and embedded ASICs and block RAMs, are exploited to optimize the design.

Iterative Repair problems are generally solved using Simulated Annealing, which works by gradually improving an
initial solution over thousands of iterations. We propose an FPGA-based architectural framework derived from
ANSI C function-level blocks for accelerating these computations by optimizing the process of (1) generating a new
solution, (2) evaluating the solution, and (3) determining whether the new solution should be accepted. Each step is
implemented in VHDL through data- and control-flow analysis of the source C code. We discuss an architecture
template for automated processor design.

INTRODUCTION a larger audience who do not have skills in VLSI


Field Programmable Gate Arrays (FPGAs) are design. This event scheduling is currently done using
CASPER and ASPEN.3 Through our methodology
becoming increasingly popular as a platform of choice
custom ASIPs on FPGAs can quickly be designed
for spacecraft computer systems. FPGA-based designs
which exploit the features of the scheduling algorithms
are highly cost effective compared to Application-
Specific Integrated Circuits (ASICs), and provide more and maximize the efficiency of the system.
computing power and efficiency than standard
RELATED WORK
microprocessors. Current and planned NASA missions
that utilize FPGA technology include MARTE (Mars Our methodology leverages concepts from several
Astrobiology Research and Technology Experiment) different research areas, including hardware
and the Discovery and New Frontier programs.1, 2 implementations of heuristic search techniques, the
However, the complexity of designing even reasonably design of application-specific instruction processors
efficient micro-architectures on commodity FPGA (ASIPs), and methods for performing design space
devices is daunting for engineers outside the realm of exploration for FPGA-based processors. Recent
VLSI design. advances in each of these fields are discussed in this
section.
A methodology for automatic derivation of FPGA-
based application-specific processors for use in the Iterative repair utilizes a combinatorial search heuristic,
mission planning and event scheduling computations such as a genetic algorithm (GA), simulated annealing
performed by satellites and deep-space probes will (SA), or a stochastic beam search (SBS), to arrive at a
mitigate this steep barrier and facilitate their adoption to solution. In theory, implementing these combinatorial

Dasu 1 21st Annual AIAA/USU


Conference on Small Satellites
search algorithms in hardware could significantly speed
up the search process. Large amounts of parallelism
and pipelining can be extracted from GA and SBS,
since deriving a new generation is largely only a
function of the previous generation.

FPGA-based GAs and SBS have been implemented for


the purposes of blind signal separation, filter design, Figure 1: Pseudocode for the Simulated Annealing
function interpolation, and speech recognition.4, 5, 6,7 As algorithm. The main loop consists of five steps.
long as the solution length is kept reasonably small, this
technique in which entire solutions are passed between IMPLEMENTATION
pipelined modules works well. Iterative repair In order to develop an automated tool to derive a micro-
problems, however, are complex enough that a solution architecture from a C program describing applications
can be hundreds of bytes in length. within the class of iterative repair based scheduling
algorithms similar to that shown in fig. 1, we are taking
Design space exploration in the context of FPGA- the approach of first defining and prototyping an
based architectures is a powerful tool. Exploring a application oriented architecture framework. This
design space is, in essence, searching the combinatorial framework will then be used to guide the tool to
space of all possible hardware architectures that can analyze the C program and determine the specifics of
support a given function. The goal is to identify the different control, memory, and computation modules
architecture that yields the best tradeoff between that make up the application-specific processor. The
conflicting goals, such as minimizing required FPGA general hardware framework consists of an architecture
resources while maximizing system throughput. The that is conducive to the execution of the simulated
design space is generally very large, thus demanding a annealing algorithm as employed by Iterative Repair.
search heuristic such as simulated annealing or a Based upon the framework shown in fig. 1, a tool flow
genetic algorithm to arrive at a solution within a is derived for the design of iterative repair processors.
reasonable amount of time. An FPGA design space can This tool flow is shown in fig 2. Source C code for an
be searched at many levels, from the low-level Iterative Repair problem is first passed through GCC to
specification of individual look-up tables to high-level obtain an intermediate .cfg format. This is then passed
complex modules. through an Intermediate Format Generator to produce
An overview of the different types of processors that
are typically considered in a design space search is
provided by Mehta.8 Reduced Instruction Set (RISC),
Complex Instruction Set (CISC), VLIW (Very Long
Instruction Word), dataflow, and tagged-token
architectures are all commonly utilized. A design space
explorer is generally restricted to one flavor of
processor in order to put an upper bound on the time
needed to search the design space. Trying to search
across all possible architectures is considered to be an
intractable problem.

Miramond provides a good description of performing


design space exploration for a reconfigurable
processor.9 Important elements to be considered in the
design space include allocation of computational,
control, and memory resources, along with the
scheduling of operations onto these resources.
Exploration can occur in both parallelization (spatial
optimization) and pipelining (temporal optimization).
Simulated annealing is employed as the search Figure 2: High-level diagram showing tool flow
heuristic. Over thousands of iterations of the simulated from C source code to application-specific
annealing algorithm, the throughput of the algorithm architecture. Red text indicates portions discussed
gradually improves. in this paper

Dasu 2 21st Annual AIAA/USU


Conference on Small Satellites
custom Control-Data flow graphs. The custom CDFGs
are then partitioned by function to Design Space
Explorers for the different pipeline stages. The Design
Space Explorers take the Intermediate Format code a
stage-specific architecture template, and a constraint
file, and produce an architecture for each pipeline stage.
In this paper, we specifically discuss the custom
intermediate code and the templates that have been
derived for each stage.

In a simulated annealing/iterative repair algorithm,


solutions are represented as a string of start times for Figure 4: The Copy Processor: Data is copied word
events numbered 0 to n-1 for a problem consisting of n by word from the source memory bank to the
events that need to be scheduled. Lists of available destination memory bank.
resources and resources needed by each event are also
solution, and provides some space for temporary data
provided. A generic framework macro-architecture for
storage. At a given point in time, one memory bank is
such algorithms is shown in fig 3.
associated with each of the five processing stages in the
The architecture is composed of a five-stage pipeline pipeline. The sixth memory block holds the best
coupled with six memory banks. A global controller solution found so far. The main controller determines
coordinates execution and data exchange between the how memory blocks are associated with different
units. As this is a pipelined architecture, it can only processing stages. Details on the manner in which
operate as fast as the slowest stage. Design Space memory banks are managed are discussed in the section
Exploration techniques must be employed in the more on the main controller.
complex stages to minimize the latency. Each of these
Copy Processor
stages is discussed in detail in this section.
As shown in fig. 1, the main loop of the simulated
Memory Design
annealing algorithm begins by making a copy of the
The architecture consists of 6 memory banks, derived best solution. This copy is then altered to generate a
from Xilinx FPGA block RAMs. A 1024-word (32-bit new solution that could potentially replace the best
word) memory bank, for example, consumes 4 BRAMs. solution. In the architecture shown in fig. 3, the Copy
Each memory bank holds a solution, the score of the Processor performs this function.

Assuming the length of the solution is known; the


contents of the solution in the “best-solution” memory
bank are copied, word by word, into the memory bank
currently associated with the Copy Processor. There is
no need to accelerate the copy process, as this pipeline
stage is guaranteed to complete in n+1 clock cycles for
a solution length of n. Other stages are much more
compute-intensive. As can be seen from fig. 4, the copy
processor is merely a controller to facilitate data
transfers. The “step” signal comes from the main
controller, indicating that a new pipeline step has
begun. The copy controller consists of a counter that
generates addresses and produces a “done” signal when
all data has been copied and also controls the write-
enable line on the destination memory bank. The
source and destination addresses are identical, because
the data locations in each memory bank are identical.

Alter Processor
Figure 3: Top-level architecture depiction for a The second stage in the Iterative Repair pipeline is the
pipelined Iterative Repair processor. Black lines Alter Processor. One event is selected at random from
represent data buses and red lines signify control the solution string. The start time of this event is
signals.

Dasu 3 21st Annual AIAA/USU


Conference on Small Satellites
to indicate that the stage has completed.

Evaluate Processor
The Evaluate Processor is by far the most complex of
all the pipeline stages in the Iterative Repair
architecture. This processor’s job is to compute a
numerical score for a potential solution. The score of a
solution to the Iterative Repair problem consists of 3
components. A penalty is incurred for total clock
cycles consumed by the schedule. A second penalty is
assessed for double-booking a resource on a given
clock cycle. Thirdly, a penalty is assigned for
dependency violations, which occur when event “b”
Figure 5: The Alter Processor. A random number depends upon the results of event “a”, but event “b” is
generator is used to modify the incoming solution scheduled before event “a”.
changed to a random value smaller than the maximum
latency. This stage shown in fig. 5 could be accelerated Fig. 6 shows an intermediate output of the tool as it
by introducing an additional random number generator works upon the Evaluate Processor. Fig. 6 is a control-
and an additional divider, allowing for maximum data flow graph depicting basic blocks, data
concurrency. But it is not necessary as a 15-cycle dependencies, control dependencies, and data
integer divider allows this stage to terminate in 21 clock operations for the evaluate function described above.
cycles, regardless of the size of the solution string. As Each of the evaluation components described above are
solutions generally consist of hundreds of events, even implemented as an individual pipelined processor.
the simple Copy Processor will have a greater latency Because the three components of the score can be
than the Alter Processor. The alter controller is based computed independently, all three processors can run in
on a counter that starts when the “step” signal is parallel, thus saving substantial clock cycles. The first
received from the Main Controller, control logic to sub-processor, termed the Dependency Graph Violation
enable register writing on the “address” and “data” Processor, or DGVP, is shown in fig. 7. The processor
registers on the proper clock cycles, and a “done” signal is a 4-stage pipeline. In the first and second stages, an
adjacency matrix is used to index the solution memory
and determine when parent/child pairs of events are
scheduled. The third and fourth stages determine the
magnitude of the penalty, if any, to be incurred because
the child event is scheduled before the parent event
terminates. This penalty has a magnitude in order to
encourage offending parent/child pairs to gradually
move toward each other, thus decreasing the penalty
over several iterations and causing the schedule to
become more optimized.

The second sub-processor, shown in fig. 8, is the Total


Schedule Length Processor. Its job is to simply
compute the total length of the schedule from beginning

Figure 6: Control-Data Flow Graph of the Evaluate Figure 7: Dependency Graph Violation Processor
function. Information contained in this graph can architecture. This four-stage pipelined processor
be used to create an optimal application-specific computes all dependency graph violations for a
processor given schedule

Dasu 4 21st Annual AIAA/USU


Conference on Small Satellites
All three sub-processors have “done” signals. When all
three have completed their tasks, the three penalty
values are summed to give the total score for the given
schedule of events. This score is stored in the
associated main memory bank. The Timing Matrix
must be cleared on each iteration. To avoid using clock
cycles on this clearing operation, the Timing Matrix is
implemented as a double memory (sometimes called a
ping-pong buffer). On a given iteration, one block is
used for computations while the other is being cleared.
Figure 8: The Total Schedule Length Processor Unlike the other stages in the pipelined architecture, the
architecture size and speed of the Evaluate Processor are not fixed
values, but rather are dependent upon the size and
to end. This 2-stage processor looks through all events complexity of the list of events to be scheduled. The
one-by-one, updating the earliest and latest times seen results section contains an example consisting of a well-
so far. Upon conclusion, the difference between the connected graph of 100 events.
earliest and latest times is the schedule length.
Accept Processor
The third sub-processor internal to the Evaluate
Processor is the Resource Over-Utilization Processor. The Accept Processor’s job is to determine whether to
This processor, depicted in fig. 9, is responsible for accept the current solution as the new best solution. If
checking for resource over-utilization on every resource the current solution is better than the best solution, the
for every time step. This processor is actually two current solution is accepted unconditionally. According
different pipelined processors. The first populates a to the Simulated Annealing algorithm, a solution that is
timing matrix, which is a two-dimensional matrix that worse than the best solution can also be accepted with a
keeps track of resource utilization of every resource for
every time step. This matrix is populated by going
through the events one-by-one and determining when
each is scheduled and what resource each uses. This
timing matrix is then passed on to the second processor,
in which the utilization of each resource at each time
step is compared to the total number of available
resources of that type. When over-usage occurs, the
amount of over-usage is added to the existing penalty.

Figure 10: The Accept Processor. The new


Figure 9: The Resource Over-utilization Processor solution is always accepted if it is better. If
architecture. A timing matrix is first populated and worse, it is accepted with a computed
then compared against the available resources probability

Dasu 5 21st Annual AIAA/USU


Conference on Small Satellites
computed probability, defined in (1). Main Controller
E/T The main controller keeps track of the memory block
p=e , E = Scur - Sbest (1)
that is associated with each processing stage. Upon the
Scur and Sbest are the current and best scores, completion of a pipeline period, the main controller
respectively, and T represents temperature. This must determine how to reassign the memory blocks to
probability is a function of both the temperature and the the different stages, keeping track of which one holds
difference between the score of the current solution and the best solution and which one can be recycled and
the score of the new solution. When the temperature is assigned to the Copy Processor. The main controller
high, suboptimal solutions are more-likely to be also performs global synchronization. As shown in fig.
accepted. This feature allows the algorithm to escape 3, the main controller receives a “done” signal from
from local minima as it searches the solution space and each of the pipeline stages. When all stages have
zero in on the true optimal solution. completed, the main controller sends out a “step” signal
to each processor, indicating that they can proceed.
An architecture that supports this computation is shown The main controller also monitors the temperature and
in fig 10. The best score and the current score are read halts the system when the algorithm is complete.
from their respective memory banks. The temperature
is provided by the Main Controller. The random RESULTS
number generator (RNG) is a simple 15-bit tapped shift Now that the architecture has been discussed at length,
register. The exponential block is a BRAM-based a specific example of how the architecture is employed
lookup table. The I-to-F block is an integer-to-float is given. The event graph depicted in fig. 12 is
converter. complex enough to provide an interesting problem to
solve. The problem consists of 100 events. Each event
Adjust Temperature Processor uses one type of resource. There are four types of
The Adjust Temperature Processor is a simple but resources. The resource type associated with each
critical stage in the pipelined processor. The event is designated by the color of the event node in fig.
temperature is used to compute the probability of 12. Each event also takes one time step to complete.
acceptance in the Accept Processor and by the Main Additional input parameters are four of each type of
Controller to determine when the algorithm should resource, maximum schedule length of 32 time steps,
complete. The architecture for the Adjust Temperature initial temperature of 10,000, cooling rate of 0.9999,
Processor is shown in fig. 11. The current temperature and a termination threshold of 0.0001. This means that
is stored in a register. When the “step” signal is the schedule cannot exceed 32 time steps, the simulated
received, the temperature is multiplied by the constant annealing temperature starts at 10,000 and is decreased
“cooling rate”, which is typically a value such as geometrically by 0.9999 on each iteration, and the
0.9999. This cooling rate allows the temperature to program terminates when the temperature falls below
decrease slowly and geometrically, allowing for the 0.0001. This means that the improvement loop will run
discovery of better solutions.

Figure 11: Adjust Temperature Processor. The


temperature is reduced geometrically each time this Figure 12: Event graph consisting of 100 events,
processing stage runs. each of which uses one of four resource types, as
designated by color. Edges denote dependencies.

Dasu 6 21st Annual AIAA/USU


Conference on Small Satellites
184,198 times. The FPGA resources needed to solve speed, the entire Iterative Repair algorithm, consisting
this scheduling problem are shown in Table 1. Each of of 184,198 iterations can execute in just over 43 million
the six memory banks uses 4 BRAM blocks, thus the 24 clock cycles, or a wall-clock time of 318 ms. As shown
blocks used by the Memory Module. The problem in Table 2, this is a speedup of more than 500 times
contains 99 dependency edges. The Dependency Graph when compared to a PowerPC, without a floating-point
Violation Processor (DGVP) in the Evaluate Processor coprocessor, running comparable code at 100 MHz.
needs to look at all 99 edges, plus 3 cycles for the
pipeline delay, giving a total of 102 cycles. The Total Table 2: Performance Comparison
Schedule Length Processor (TSLP) needs to look at all Processing Clock Cycles Wall Speedup
100 events, plus 1 cycle for pipeline delays, yielding Platform Frequency Clock
101 cycles. The Resource Over-utilization Processor Time
(ROP) needs to look at every event to populate the Xilinx Virtex-4 100 MHz 1.87x1010 187.1 s N/A
Timing Matrix, which means 100 cycles plus 2 for embedded
pipeline draining, totaling 102 cycles. It also needs to PowerPC core
look at every element in the Timing Matrix with AMD Athlon 2.61 GHz 3.7x1010 14.265 s 13.11
dimensions of 32 time steps maximum latency and 4 64
resource types, plus 3 cycles of pipeline draining, Xilinx Virtex-4 136 MHz 184,198 318 ms 588.4
resulting in 131 cycles. This means the Resource Over- Iterative Repair
utilization Processor has a total latency of 233 cycles. circuit
As this is the most costly of the three sub-processors in
The reasons for the massive speed-up of the custom
the Evaluate Processor, the total latency of the Evaluate
implementation when compared to traditional linear
Processor is 233 cycles plus 2 for the final summations,
processors are three-fold. First, the custom circuit
resulting in a 235-cycle latency.
employs a five-stage macro pipeline. This allows for
Table 1: Architecture Results five different solutions to be at different stages of
processing simultaneously, rather than only managing
Slice DSP48 BRAMs Latency Max. one solution at a time in the case of traditional
Count Units Freq. processors. Second, the most complex of the
(MHz)
processing stages, the evaluate function, has been
Main 193 0 0 3 472
parallelized in the custom implementation to drastically
Controller
decrease the latency of the pipeline. Once again, in a
Copy 21 0 0 101 307
conventional processor, no such parallelization can
Processor
occur. Third, in a conventional processor, up to 50
Alter 390 0 0 21 238
percent or more of the computation cycles can be
Processor
consumed by load and store instructions, especially in
Eval. 393 0 5 235 310
Processor
Intel-type architectures which have few internal
registers. Because of the application-specific nature of
DGVP 104 0 1 102 347
the custom approach, no unneeded load/store cycles are
TSLP 74 0 0 101 352 consumed.
ROP 215 0 4 233 310
FUTURE WORK
Accept 1,424 0 1 54 197
Processor The architecture described in this paper is only an
Adjust 186 4 0 12 445 initial stage in our vision of hardware acceleration of
Temperature the Iterative Repair algorithm. Once the processing
Processor platform outlined here has been completed, a future
Memory 1,612 0 24 N/A 136 step would be to further exploit the parallel nature of
Module the iterative repair algorithm. Simulated Annealing is a
Complete 4,712 4 35 235 136 sequential algorithm that can be pipelined, but not
Processor parallelized. However, similar heuristic search
This design will easily fit on a Xilinx Virtex-4 SX35 techniques exist that are much more conducive to
device, which consists of 15,360 slices, 192 DSP48 parallelization. Stochastic Beam Search is one of these.
units, and 192 BRAM blocks. It should be noted that It is almost identical to Simulated Annealing, but a set
the design assumes 32-bit single-precision floating- of “best” solutions are maintained, rather than a single
point arithmetic and 16-bit integer arithmetic. A stage solution. The pseudocode for the Stochastic Beam
latency of the pipelined processor is 235 clock cycles, Search is shown in fig. 13. A modified version of the
with a maximum clock frequency of 136 MHz. At this Stochastic Beam Search could better utilize FPGA

Dasu 7 21st Annual AIAA/USU


Conference on Small Satellites
resources if significant space is left by the traditional symposium on Computer architecture, Tokyo,
Simulated Annealing algorithm. Japan, 1986, pp. 216-223.

Finally, the tool will be designed to accept additional 8. Mehta, G., R. R. Hoare, J. Stander, and A. K.
Jones, "Design space exploration for low-power
optimization constraints such as using Triple Modular
reconfigurable fabrics," in Parallel and
Redundancy (TMR) or other techniques for
implementing fault tolerance along with power Distributed Processing Symposium, 2006. IPDPS
2006. 20th International, 2006, p. 4 pp.
optimization strategies.
9. Miramond, B. and J. M. Delosme, "Design space
exploration for dynamically reconfigurable
architectures," in Design, Automation and Test in
Europe, 2005. Proceedings, 2005, pp. 366-371
Vol. 1.

Figure 13: The Stochastic Beam Search algorithm is


similar to Simulated Annealing.
1. Winterholler, A., M. Roman, D. Miller, J.
Krause, and T. Hunt, "Automated core sample
handling for future Mars drill missions," in 8th
International Symposium on Artificial
Intelligence, Robotics and Automation in Space
Germany, 2005.
2. "New Space Communications Capabilities
Available for NASA's Discovery and New
Frontier Programs," in NASA Technology
Discovery/New Frontier Roadmap, 2006.
3. Knight, S., G. Rabideau, S. Chien, B. Engelhardt,
and R. Sherwood, "Casper: space exploration
through continuous planning," Intelligent
Systems, IEEE, vol. 16, pp. 70-75, 2001.
4. Emam, H., M. A. Ashour, H. Fekry, and A. M.
Wahdan, "Introducing an FPGA based genetic
algorithms in the applications of blind signals
separation," in System-on-Chip for Real-Time
Applications, 2003. Proceedings. The 3rd IEEE
International Workshop on, 2003, pp. 123-127.
5. Hamid, M. S., and S. Marshall, "FPGA
realisation of the genetic algorithm for the design
of grey-scale soft morphological filters," in
Visual Information Engineering, 2003. VIE 2003.
International Conference on, 2003, pp. 141-144.
6. Mostafa, H. E., A. I. Khadragi, and Y. Y. Hanafi,
"Hardware implementation of genetic algorithm
on FPGA," in Radio Science Conference, 2004.
NRSC 2004. Proceedings of the Twenty-First
National, 2004, pp. C9-1-9.
7. Anantharaman, T. and R. Bisiani, "A hardware
accelerator for speech recognition algorithms," in
Proceedings of the 13th annual international

Dasu 8 21st Annual AIAA/USU


Conference on Small Satellites

You might also like