A Dynamic Instruction Set Computer
A Dynamic Instruction Set Computer
99
$04.000 1995 IEEE
0-8186-7086-W95
Authorized licensed use limited to: ULAKBIM UASL - IZMIR YUKSEK TEKNOLOJI ENSTITUSU. Downloaded on October 19,2024 at 09:06:26 UTC from IEEE Xplore. Restrictions apply.
Although few partially reconfigurable systems processor begins execution. The sequencing of instruc-
have actually been implemented, several have been tions on a small FPGA may execute and configure as
proposed such as hardware multi-tasking[lO], a follows:
multi-phase serial communication algorithm[ll], a
data acquisition system[4], and a self-reconfiguring Operation Instruction
processor[8]. In addition, caching logic to in- Configure INSTA Configure INSTA on FPGA
crease hardware efficiency in standard digital sys- Execute INSTA Execute first INSTA
tems has been proposed using partially reconfigurable Execute INSTA Execute second INSTA
FPGAs[15]. Configure INSTB Configure INSTB on FPGA
DISC uses partial configuration to implement Execute INSTB Execute first INSTB
custom-instruction caching. Instruction modules are Configure INSTC Configure INSTC on FPGA
implemented as partial configurations and individu- Execute INSTC Execute first INSTC
ally configured on DISC as demanded by the applica- Execute CMP Execute CMP (always available)
tion program. Before initiating execution of a custom- Execute JNE Execute JNE (always available)
instruction, DISC queries the FPGA for the pres- (continue looping to INSTC until JNE fails)
ence of the custom-instruction configuration. If the Remove INSTA FPGA full, remove oldest modul
custom-instruction is on the FPGA, execution is initi- Configure INSTD Configure INSTD
ated. Otherwise, program execution pauses while the Execute INSTD Execute INSTD
custom-instruction is configured on the FPGA. Execute INSTB Execute second INSTB
As a typical program executes, custom-instructions Remove INSTC FPGA full, remove oldest modul
are configured onto the FPGA until all available hard- Configure INSTE Configure INSTE
ware is consumed. When all hardware is used by the Execute INSTE Execute INSTE
custom-instructions, new custom-instruction modules
may not be configured on the FPGA until enough ex-
isting hardware is removed. By replacing the oldest
custom-instruction modules on the FPGA with newer In the previous example, it is assumed that the first
modules, the FPGA serves as a cache of the most- five instructions (INSTA, INSTB, INSTC, CMP, and
recently used custom-instruction modules. JNE) consume all available space on a single FPGA.
Partially configuring the FPGA allows two additional
2.1 Example instructions (INSTD and INSTE) to execute on an oth-
The following assembly language source code exem- erwise full FPGA.
plifies the use of partial configuration on DISC:
2.2 Advantages
begin:
Partial configuration provides a number of advan-
tages for DISC over conventional configuration meth-
;instruction INSTA operates on
ods. First, idle instruction modules can be removed to
;memory location meml make room for other usable modules. The ability to
INSTA meml replace instruction modules in the system at run-time
INSTA mem2 allows the implementation of an instruction set much
;instruction INSTB operates on larger than is possible on a single one-time configured
;mem3 and mem2 FPGA.
INSTB mem3,mem2 Second, configuration time is substantially reduced.
; "loopback" label defined Although the DISC FPGA could be completely con-
loopback: figured every time a new instruction is needed, config-
INSTC mem3
uration overhead can be dramatically reduced by con-
;instruction CMP compares
figuring only the requested instruction. Reducing the
size of hardware to configure significantly reduces the
;memi with mem3 configuration bit-stream. Configuration bit-stream re-
CMP meml,mem3
;instruction JNE jumps
ductions for DISC instruction modules fall between &
;to loopback if not equal
and of a complete FPGA configuration. With a sig-
JNE loopback
nificantly smaller bit-stream, the corresponding con-
figuration time is reduced. In an environment of run-
continue : time configuration, reducing the configuration time
INSTD mem3 will limit the reconfiguration overhead.
INSTB mem2 Third, system state can be saved on the FPGA dur-
INSTE mem3 ing configuration. Conventional configuration tech-
end: niques prevent the preservation of system state during
configuration by destroying the contents of all flip-
Once each inst ruc- flops. Implementing DISC with conventional configu-
tion in the previous program (INSTA, INSTB , INSTC , ration methods would require the saving and restor-
CMP, JNE, INSTD, and INSTE) has been designed as ing of system state (program counter, register values,
an independent partial configuration, the source code etc.) every time a configuration occurs. To prevent
representing the program is loaded into DISC and the the time-consuming process of saving and restoring
100
Authorized licensed use limited to: ULAKBIM UASL - IZMIR YUKSEK TEKNOLOJI ENSTITUSU. Downloaded on October 19,2024 at 09:06:26 UTC from IEEE Xplore. Restrictions apply.
state, DISC implements a global controller that re- no affect on the physical layout or placement of any
mains on the FPGA at all times. other module in the library.
In summary, partial configuration allows DISC
to implement an essentially infinite instruction set 4 Linear Hardware Space
in hardware with limited configuration and state- DISC implements relocatable hardware in the form
preserving overhead. of a linear hardware model. As the name suggests, the
model is based on a linear , one-dimensional hardware
3 Relocatable Hardware space. The two-dimensional grid of configurable logic
The ability to partially con- cells are organized as an array of rows: location is
figure custom-instruction modules allows DISC to im- specified by vertical location and module size is spec-
plement an important strategy - relocatable hardware. ified by module height (in rows).
Relocatable hardware, implemented only in partially The global context for the linear hardware model
configurable FPGAs, provides the ability to relocate or consists of a uniform communication network and a
make placement decisions of partial configurations at global controller. The communication network is con-
run-tame. Although not essential for a general purpose structed by running each global signal vertically across
processor, it is used on DISC to substantially improve the die and spreading the global signals across the
run-time hardware utilization. width of the die parallel to each other (see Figure 1).
Sub-modules in traditional digital systems require
a single fixed location in hardware because of strict
global and local physical constraints. Because sub-
modules in traditional systems are not paged in and
out of hardware, a fixed location does not pose any
problems and global optimizations can be made on the
static circuitry to improve hardware utilization. In a
run-time partial reconfigurable system, however, fixed
locations for partial configurations can pose serious
performance problems.
If DISC modules are designed for a single physi-
cal location, instructions in the library will inevitably
overlap each other on the hardware. Two overlap-
ping instructions can never operate properly on the
FPGA at the same time. If two overlapping instruc-
tions are used frequently together in an application
program, the configuration overhead needed to replace
the instructions quickly becomes the system bottle-
neck. DISC removes these problems by designing each
custom-instruction module for multiple locations on
the FPGA. U
The flexibility of multiple locations for DISC
custom-instructions significantly improves run-time
utilization. Instruction modules are initially config- = UODisabled
ured on the FPGA as close as possible to avoid wasted
hardware between modules. Once the hardware space
is full, additional instruction modules are placed in Figure 1: Linear Hardware Space.
locations where older unneeded instruction modules
currently lie. Relocatable hardware allows run-time The communication network provides access to
constraints and conditions to dictate instruction mod- global resources for all instruction modules and per-
ule placement for optimal hardware utilization. forms intermodule communication. The global con-
Relocatable hardware is implemented by design- troller specifies the communication protocol, controls
ing custom-instruction modules around a firmly de- global resources (such as 1/0 and global state) and
fined global context. A global context provides physi- monitors circuit execution. The global controller and
cal placement positions and a communication network the communication network remain in the same loca-
necessary for these modules t o operate correctly. The tion throughout application execution to preserve the
global context partitions the available hardware into global context.
an array of potential placement locations for the relo- To gain access of all global signals, sub-modules
catable instruction modules. The communication net- within a linear hardware space are designed horizon-
work is provided at each placement location t o insure tally, across the width of the FPGA. The modules
adequate communication between the global controller lie perpendicular to the global communication signals
and the instruction modules at any location. for full access of all global signals regardless of their
In order to design instruction modules that fit vertical placement (see Figure 2). Although all sub-
within the global context, all instruction modules modules must span the entire width of the FPGA, each
must be physically independent from each other. The module may consume an arbitrary amount of hard-
physical layout of any instruction module must have ware by varying its height.
101
Authorized licensed use limited to: ULAKBIM UASL - IZMIR YUKSEK TEKNOLOJI ENSTITUSU. Downloaded on October 19,2024 at 09:06:26 UTC from IEEE Xplore. Restrictions apply.
and global state. The global controller consumes ten
complete rows (approximately 1/6 of the chip) leav-
ing 46 rows available for custom-instruction modules.
The physical layout of the global controller, estimated
I I I I I at 1007 gates, along with the communication network
is seen in Figure 4.
Width of FPGA
Figure 4: DISC Global Controller Layout.
Add
Subtract
Multiply
AND
U 0
D
n
U Custom Module I
n i DataRep~erFeedback Data Register
i t
Custom Module 2
n
: Data Reaser Valw
n a+b-c"d
17 Edge Detection
id
FFT
0uuu000000 Figure 5: DISC Global Controller Architecture.
102
Authorized licensed use limited to: ULAKBIM UASL - IZMIR YUKSEK TEKNOLOJI ENSTITUSU. Downloaded on October 19,2024 at 09:06:26 UTC from IEEE Xplore. Restrictions apply.
Data Register Feedback: provides new values for
Data Register (8 bits),
Memory Address: allows address generation con-
I IF IOFIEXI
trol by custom-instructions (16 bits), Standard Instruction Sequence
Memory Data: allows bi-directional access of
memory data by custom-instructions (8 bits),
Status Signals: provides control capability for
custom-instructions (4 bits),
Instruction Register: provides opcode of current Custom Instruction Sequence
instruction (8 bits).
The global controller is also responsible for sequenc- Figure 6: DISC Instruction Sequences.
ing through the instruction cycles for the custom-
instruction modules. The following instruction cycles
are implemented by the global controller: 0 load data register: load data register from mem-
ory,
0 Instruction Fetch (IF), 0 conditional jump: jump with carry not set.
0 Operand Fetch (OF),
0 Halt Processor (HP), Each of these instructions follow the standard in-
0 Custom Cycle ( C C ) , struction sequence of three cycles. These instructions,
0 Instruction Execution (EX) coupled with the custom-instruction library designed
for a particular application, provide the complete in-
The IF cycle stores the current program memory struction set of the processor. An application can im-
into the instruction register and increments the pro- plement an instruction set of any size by paging in-
gram counter. The OF cycle stores the current pro- struction modules in a demand-driven manner from
gram byte into the address register and also incre- the instruction library.
ments the program counter. The HP cycle causes all 5.2 Custom-instruction Modules
processor resources to remain idle and is used dur- Custom-instruction modules vary in size and com-
ing configuration. The C C cycle is used by complex plexity, but each is designed to fit within the global
custom-instruction modules for adding additional cy- context described above. Specifically, each module
cles and has no affect on global resources. The EX contains a decode and a data-path unit. Complex
cycle loads the value of the data register with the con- modules contain additional control structures.
tents of the data register feedback path. The decode unit assigns a specific op-code to the
Each instruction in the library operates in one of custom instruction and is responsible for acknowledg-
two possible instruction cycle sequences: standard ing its presence to the global controller. The decode
and custom. The standard instruction sequence fol- unit compares the contents of the I R for a match
lows a simple three-cycle execution: IF, OF, and EX. against its own opcode during the OF cycle. On a
Any instruction that completes its computation or positive match the module signals the global controller
function in a single clock cycle, such as basic arith- that the hardware is present and instruction sequenc-
metic and logic operations, will operate with this se- ing continues.
quence. The data-path is responsible for providing the
The custom-instruction sequence offers additional proper connections to the global communication net-
cycles for complex custom-instructions. The custom work and adhering to the established communication
sequence begins with the following two cycles: IF protocol. Instruction modules not executing refrain
followed by OF. The sequence then varies by insert- from sending any signals on the communication chan-
ing as many CC cycles as necessary to complete a nel to prevent the corruption of other operating in-
complex application-specific operation. The custom- structions. The data-path unit provides a new value
instruction sequence completes with the EX instruc- for the data register during the EX stage. Most in-
tion cycle. The custom-instruction module has com- structions perform their function by modifying the
plete control over the number of C C cycles needed for DR.
a particular function. Some instructions add as few as Several custom-instruction modules of varying size
one cycle, while others require thousands of cycles for have been implemented on DISC. These vary from a
a single operation. Figure 6 displays the two instruc- simple single row shifter to a complex edge-detection
tion sequences. module of 34 rows. Table 1 shows the current instruc-
The global control unit contains a number of de- tions available for DISC. The circuit layout for the
fault instructions necessary for controlling global re- Adder/Subtracter module is seen in Figure 7.
sources. These instructions are used for sequencing,
status control, and memory transfer and include the 6 System Operation
following: The DISC processor was implemented on a PC-
ISA custom board made exclusively for the study.
0 set carry: sets carry bit in status register, The board includes static bus interface circuitry, two
0 clear carry: clears carry bit in status register, CLAy3l FPGAs, and memory. A configuration con-
0 store data register: store data register in memory, troller is implemented on the first FPGA to monitor
103
Authorized licensed use limited to: ULAKBIM UASL - IZMIR YUKSEK TEKNOLOJI ENSTITUSU. Downloaded on October 19,2024 at 09:06:26 UTC from IEEE Xplore. Restrictions apply.
Upon receiving a request for an instruction mod-
ule, the host evaluates the current state of the DISC
FPGA hardware and chooses a physical location for
Comparator 3 I 155 the requested module. The physical location is chosen
based on available FPGA resources and the existence
of idle instruction modules. If possible, the instruc-
tion module is loaded in an FPGA location not cur-
rently occupied by any other instruction module. If no
empty hardware locations are available, a simple least-
recently-used (LRU) algorithm is used to remove idle
hardware. The host modifies the bit-stream of the
requested hardware module to reflect the placement
changes. The hardware module is then configured on
the DISC platform by sending the new configuration
Table 1: Sample Custom Instruction Modules. to the system. Figure 9 provides a simplified flow chart
of DISC instruction execution.
-+,
OL
Instruction
Figure 7: DISC Adder/Subtracter Custom Module
Layout.
Present?
Processor Conuoller
I
ISA Bus
104
Authorized licensed use limited to: ULAKBIM UASL - IZMIR YUKSEK TEKNOLOJI ENSTITUSU. Downloaded on October 19,2024 at 09:06:26 UTC from IEEE Xplore. Restrictions apply.
time. The following application example will demon- simple instructions used in the general purpose ap-
strate this tradeoff. proach.
The MEAN instruction module calculates the aver-
7 Application Example age of a 3x3 neighborhood through the use of a sliding
A simple image mean filter was developed as both window as seen in Figure 11. Each numbered element
a sequence of general purpose instructions and as an of the sliding window represents a pixel register in the
application specific hardware module to demonstrate custom module. Instead of loading the entire window
the performance improvements gained by tailoring the from memory at each pixel, register values are shifted
hardware t o the application. Both demonstrations to represent a sliding window (see Figure 12). Only
calculate the mean value of each pixel in an image, registers 3, 6, and 9 are loaded at each new pixel.
g(z, y), by obtaining an average over a 3x3 neighbor-
hood as follows:
! I l I I !
- 1 1
105
Authorized licensed use limited to: ULAKBIM UASL - IZMIR YUKSEK TEKNOLOJI ENSTITUSU. Downloaded on October 19,2024 at 09:06:26 UTC from IEEE Xplore. Restrictions apply.
Although the techniques of partial configuration,
relocatable hardware, and the linear hardware model
were implemented as a general purpose processor,
they offer similar advantages to other digital archi-
tectures. They may enhance the usefulness of FPGA
co-processors by providing demand-driven computa-
tion. In addition, these techniques may allow FPGA
based computing machines to operate in more dy-
namic environments such as multi-tasking operating
systems. Any digital architecture that could benefit
from demand-driven hardware may find these tech-
niques useful.
Figure 13: Test Image Filtered Through MEAN Cus- References
tom Instruction. Algotronix, Edinburgh, UK. CALI024 Prelimi-
nary Data Sheet, 1988.
7.3 Configuration Overhead P. M. Athanas and H. F. Silverman. Processor
Because the cost of reconfiguring the application- reconfiguration through instructiom-set metamor-
specific instruction module is so high, configuration phosis. Computer, 26(3):11-18, March 1993.
overhead must be considered when comparing the two
approaches. The 31 row MEAN instruction requires Atmel, San Jose, CA. Configurable Logic: Design
an additional 140 kcycles for configuration, raising the 63 Application Book, 1993-1994.
total cycle count to 197 kcycles. The MEAN configu-
ration overhead represents 71% of the total operating R. Camerota and J. Rosenberg. Data acquisition
time. If device configuration speeds are maximized, systems using Cache Logic FPGAs. In Conjig-
this configuration overhead is reduced to 16% of the urable Logic: Design 63 Application Book, pages
total operating time. 7.15-7-18. Atmel, San Jose, CA, 1993-1994.
The extra four modules needed for the general pur-
pose approach require only 36 kcycles for configura- J . Davidson. FPGA implementation of a recon-
tion. This represents less than 1%of the total operat- figurable microprocessor. In Proceedings of the
ing time. When considering the high-cost of configura- IEEE 1993 Custom Integrated Circuits Confer-
tion in total operating time, the MEAN filter custom ence, pages 3.2.1-3.2.4, 1993.
instruction provides a 23 times speedup t o the general
purpose approach (see Table 2). J. G. Eldredge and B. L. Hutchings. Density en-
hancement of a neural network using FPGAs and
run-time reconfiguration. In D. A. Buell and K. L.
General Application Pocek, editors, Proceedings of IEEE Workshop on
Purvose Svecific FPGAs for Custom Computing Machines, pages
Rows 8 I 31 180-188, Napa, CA, April 1994.
106
Authorized licensed use limited to: ULAKBIM UASL - IZMIR YUKSEK TEKNOLOJI ENSTITUSU. Downloaded on October 19,2024 at 09:06:26 UTC from IEEE Xplore. Restrictions apply.
[ll] P. Lysaght and J. Dunlop. Dynamic reconfigura-
tion of FPGAs. In W. Moore and W. Luk, edi-
tors, More FPGAs: Proceedings of the 1993 In-
ternational workshop on field-programmable logic
and applications, pages 82-94, Oxford, England,
September 1993.
[12] S. Monaghan and C. P. Cowen. Reconfigurable
multi-bit processor for DSP applications in statis-
tical physics. In D. A. Buell and K. L. Pocek, ed-
itors, Proceedings of IEEE Workshop on FPGAs
for Custom Computing Machines, pages 103-110,
Napa, CA, April 1993.
[13] National Semiconductor. Configurable Logic Ar-
ray (CLAY) Data Sheet, December 1993.
[14] T. G . Rauscher and A. K. Agrawala. Dy-
namic problem-oriented redefinition of com-
puter architecture via microprogramming. IEEE
Transactions on Computers, C-27( 11):1006-1014,
November 1978.
[15] J. Rosenberg. Implementing Cache Logictm with
FPGAs. In Configurable Logic: Design 63 Appli-
cation Book, pages 7.11-7.14. Atmel, San Jose,
CA, 1993-1994.
[16] M. J. Wirthlin, B. L. Hutchings, and K. L. Gilson.
The Nan0 Processor: A low resource reconfig-
urable processor. In D. A. Buell and K. L. Pocek,
editors, Proceedings of IEEE Workshop on FP-
GASfor Custom Computing Machines, pages 23-
30, Napa, CA, April 1994.
[17] A. Wolfe and J . P. Shen. Flexible processors:
a promising application-specific processor design
approach. In Proceedings of the 21st Annual
Workshop on Microprogramming and Microarchi-
tecture - MICRO '21, pages 30-39, San Diego,
CA, November 1988.
107
Authorized licensed use limited to: ULAKBIM UASL - IZMIR YUKSEK TEKNOLOJI ENSTITUSU. Downloaded on October 19,2024 at 09:06:26 UTC from IEEE Xplore. Restrictions apply.