0% found this document useful (0 votes)
27 views5 pages

In-House Developed 32-Bit Digital Signal Processor For Strategic Applications

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views5 pages

In-House Developed 32-Bit Digital Signal Processor For Strategic Applications

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

In-house developed 32-bit Digital Signal Processor

for Strategic Applications


Essy Samuel Sheba Elizabeth D
Priya P
Vikram Sarabhai Space Centre Vikram Sarabhai Space Centre
Vikram Sarabhai Space Centre
Indian Space Research Organization Indian Space Research Organization
Indian Space Research Organization
Thiruananthapuram, India Thiruananthapuram, India
Thiruananthapuram, India
[email protected] [email protected]
[email protected]
Vinod P Gopalakrishnan T
Sreekumar S
Vikram Sarabhai Space Centre Vikram Sarabhai Space Centre
Vikram Sarabhai Space Centre
Indian Space Research Organization Indian Space Research Organization
Indian Space Research Organization
Thiruananthapuram, India Thiruananthapuram, India
Thiruananthapuram, India
2023 International Conference on Power, Instrumentation, Control and Computing (PICC) | 979-8-3503-3446-3/23/$31.00 ©2023 IEEE | DOI: 10.1109/PICC57976.2023.10142742

[email protected] [email protected]
[email protected]

Abstract— Avionics systems rely on digital signal multiple hardware blocks, Very Large Instruction Word
processing algorithms for various applications like Navigation, (VLIW), Superscalar architecture or multi core designs to
Guidance and Control, Image and Data Processing etc. Digital meet the timing requirements of signal processing
Signal Processors (DSPs) are typically used to implement these applications. Hardware acceleration is also used where the
algorithms, which require intense iterative computations to be computations are implemented in hardware for fast
performed on large data sets. DSPs generally provide special execution. Such hardware accelerators are usually
addressing modes, dedicated hardware blocks and enhanced implemented as Co-processors to the DSP [4], performing
parallelism to meet the stringent timing requirements of these intense computations independently and exchanging data
computations. Commercial DSPs are available in various
with the DSP. Application Specific Instruction Set
ranges of capability and speed, however, their use in strategic
applications is governed by their availability and reliability
Processors (ASIPs) are also designed with custom
criteria. Usage of commercial DSPs is made difficult due to instructions for such computation intensive tasks [1].
long lead times, obsolescence, import restrictions and cost. In Application specific hardware accelerators have been
this work, we present the architecture and design of a scalable proposed for commonly used signal processing applications
32-bit Floating point DSP, with an in-house developed like Fast Fourier Transform (FFT), Matrix computations
architecture, instruction set, assembly language and software [2,5] and Video Processing [3].
toolset. The DSP provides high performance in terms of fast
Electronic components used in avionics systems for
execution, low overhead and interrupt latency, making it
strategic applications require sustained availability for long
suitable for time critical DSP applications. The DSP also
implements special instructions, called algorithmic instructions,
duration. Commercial DSPs, though available in various
which provide hardware acceleration for selected computation performance grades, have low availability, long lead times,
intensive operations. The performance of the DSP in executing fast obsolescence and import restrictions, which limit their
commonly used signal processing algorithms compares well use in such applications. Application Specific Integrated
with commercial DSPs currently used in avionics systems. The Circuits (ASICs) can be used to overcome these
design is currently implemented as a core in Field disadvantages, however, developing ASICs is a costly and
Programmable Gate Array (FPGA), making use of the internal time consuming process and is hence not suitable to absorb
memory blocks of the FPGA to hold program and data. the changing requirements of an application. Open source
Different versions of the DSP have been flown in various processor architectures, like SPARC [7], ARM [8] etc., are
Satellite and Launch Vehicle avionics systems. also widely available, with the architecture and instruction
set fully defined and the implementation aspects left to the
Keywords— Hardware acceleration, DSP design, DSP user. The advantage of this scheme is that the software
architecture, instruction set development tools are available and the time involved in the
I. INTRODUCTION architecture and instruction set design process can be
reduced. However, being an open source architecture, the
Avionics systems rely on Digital Signal Processing security of such a design may be a matter of concern in
(DSP) algorithms for applications like Navigation, Guidance strategic applications.
and Control, as well as Image Processing and Data
Processing. DSP algorithms typically require intense iterative In this work, the design of an in-house developed 32-bit
computations that operate on large sets of data like Fast Floating Point DSP is described, which provides
Fourier Transform (FFT), Matrix Multiplication, Matrix performance comparable with commercial DSPs, while
Inverse etc. Real time applications place stringent timing providing the advantages of design security, scalability and
requirements on these computations. A direct hardware availability. The Harvard architecture based DSP, with
implementation of the computations enables fast execution multiple buses for Program and Data, implements a three
of these algorithms, but at the cost of flexibility. A software stage fetch & decode – read & execute – Write back pipeline.
implementation provides flexibility in the design at a reduced The Complex Instruction Set Computer (CISC) with a rich
execution speed. Hence, for such applications, a hardware- instruction set comprises instructions for general purpose
software co-design architecture is required to avail the arithmetic as well as special instructions, called algorithmic
advantages of both. instructions, which provide hardware acceleration for
specific computation intensive applications. These
Modern Digital Signal Processors (DSPs) offer instructions efficiently make use of the available memory
architectural enhancements like parallel processing using

979-8-3503-3446-3/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on June 17,2024 at 13:05:41 UTC from IEEE Xplore. Restrictions apply.
buses of the DSP for wide data access from memory, and accessed through four 64-bit buses, supporting two read and
shares the available addressing options, registers and two write operations. When the external logic requires a
execution units, like Multiply and Accumulate (MAC) units, memory access, it provides a bus request to the processor and
of the DSP. The DSP also supports advanced branching the processor grants the bus to the logic after its current
features like zero overhead loops and fast context switching, access.
thereby providing low branching overhead and interrupt
latency. The performance of the system, in terms of code size B. Registers and Addressing
and execution speed, is compared with that of Analog The DSP consists of four general purpose registers (for
Devices ADSP21060 processor, for commonly used DSP arithmetic operations) and four index registers (for data
applications, and a vast improvement is observed. addressing), and supports direct, immediate, register and
The paper is structured as follows. Section II provides a register indirect addressing modes. The modifier, base and
brief description of the architectural features of the DSP core length registers are used to specify the modification and
along with the instruction set. The advanced program circular buffer parameters. Special addressing schemes like
execution features available in the DSP are explained in automatic post modification, modulo addressing and bit
Section III. Design and implementation of algorithmic reversed addressing are provided to enable access of large
instructions, with an example application, are outlined in data arrays for signal processing applications.
Section IV. The performance of the DSP, in terms of
execution speed and code size, for some commonly used C. Control Logic
applications, is analysed in section V and the conclusion is The control logic controls the program flow of the DSP. A
presented in Section VI. loop counter is provided to load the iteration count for
executing software loops as well as for algorithmic
II. DSP ARCHITECTURE AND INSTRUCTION SET
instructions. A hardware loop stack is used to store the loop
The overall architecture of the DSP is shown in Fig.1. The parameters to implement zero overhead loop. A hardware
constituent elements are described below. register stack is implemented for automatic context saving
during subroutine calls and interrupts. The software for the
DSP is loaded to the Electrically Erasable Programmable
Read Only Memory (EEPROM), from where it will be
loaded to internal Random Access Memory (RAM) at power
ON. The software is executed from the RAM.
D. Instruction Set
The instruction set consists of over 100 operations
supporting all general purpose arithmetic and logical
computation requirements. Arithmetic, logical, shift and
rotate, multiply, data transfer, field operations and branch
instructions are executed in a single cycle. Algorithmic
instructions are executed as many times as specified in the
iteration count. Iterative floating point computations (like
trigonometric function computations) are carried out in
multiple cycles, however, these instructions are capable of
executing in parallel with the subsequent single cycle
instructions as long as those instructions do not require the
result of the multi-cycle computation.
Instructions can be either single or two word. The
instruction set is near orthogonal, with most instructions
supporting all addressing modes. Up to three memory
operands can be directly specified in the instruction, either as
direct addresses or register indirect addresses, enabling
Fig. 1. Architecture of the DSP computations directly from memory and eliminating the need
for data transfer to and from registers as in a load-store
A. Memory Organization architecture. All instructions can be conditionally executed,
The proposed DSP implements a modified Harvard i.e., the instruction will be executed only if the condition in
Architecture with separate Program and Data Memory. The the condition field evaluates to true. Signed and unsigned
Program Memory is organized as two banks, each bank integers, single and double precision floating point (IEEE754
holding locations addressed with either even or odd standard) data types are supported by the DSP.
addresses. The Data Memory is also organized as two banks,
with one bank dedicated for use by the processor and the E. Integrated Software Development Environment
other bank shared by the processor and any external logic An Integrated Software Development Environment (ISDE)
which requires to share data with the processor. The DSP is designed, in Visual C#, to enable design, development,
also communicates with the external logic through input and testing and debugging of software for the in-house
output ports. The DSP consists of four 64-bit Program developed DSP. The ISDE comprises a code editor,
memory buses which can be used for reading instructions Assembler, Disassembler, Simulator and Log Window. The
and constants stored in Program Memory. Data memory is Assembler converts the assembly language code to machine

Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on June 17,2024 at 13:05:41 UTC from IEEE Xplore. Restrictions apply.
code, while the disassembler performs the reverse IV. ALGORITHMIC INSTRUCTION
translation. The instruction set simulator mimics the Algorithmic instructions implement custom, instruction
operation of the DSP in software, with the contents of the specific hardware to accelerate the computation of specific
internal registers and selected memory locations displayed, functions. This scheme is particularly suited for applications
so as to allow code development, testing, debugging and where the same computation is to be performed on large data
profiling before the hardware is realized. An object code sets residing in memory. The set of operations to be
verification tool is available, that verifies the disassembled performed on a single data point is implemented in hardware.
machine code with the original assembly code. A compiler The hardware implementation is fully combinational, making
is also being developed to facilitate code development in use of the processor registers to hold any registered values.
high level language (C language). The processor implements four Multiply and Accumulate
(MAC) units which can be used for implementing the
III. ADVANCED BRANCHING FEATURES computations required in the algorithmic instruction. The
The DSP supports zero overhead loop and branching with operations are performed on a set of data points by iteratively
automatic context switching and context saving features. The executing the same algorithmic instruction. The number of
DSP implements a loop stack which facilitates zero overhead iterations required is specified in the loop counter.
loops. When a loop instruction is encountered, the loop count The algorithmic instruction is held in the instruction
and condition are checked and, in case the loop count is zero pipeline until the loop count decrements to zero, after which
or the condition is false, the loop is automatically bypassed. the next instruction will be fetched. However, if an interrupt
There is a provision to execute subroutines as loops, by is received, the algorithmic instruction will be halted and the
specifying the loop count along with the subroutine call instructions from the interrupt service routine will be
instruction. The subroutine will be executed the specified executed. After the execution of the ISR, the algorithmic
number of times before returning to the calling point. A instruction is fetched again and the execution resumes from
register stack is also implemented which allows automatic the iteration count at which it was interrupted. When an
push-pop of registers specified in the instruction. During a algorithmic instruction is present in the pipeline, all four
subroutine call, the registers specified in the instruction are buses of Program Memory and four buses of data memory
automatically pushed to the stack and at subroutine return, are available for the execution of the instruction. The
memory locations can be accessed through register indirect
the registers are automatically popped from the stack. The
addressing mode using any of the index registers of the
branching overhead is further reduced by dispensing with the
processor. Automatic Post modification, circular buffer
explicit return instruction. The last instruction of the addressing and bit reversed addressing schemes are also
subroutine is automatically identified and a return action is available for these instructions. Any combination of the
taken along with popping of specified registers from register processor registers can be updated by the algorithmic
stack. The last instruction of the subroutine can be any type instruction.
of instruction – single cycle, multi-cycle, algorithmic or even
another subroutine call or loop. The resulting improvement A. Design of Algorithmic Instruction
in code size and execution time can be seen from the
Break down operation
Any computation intensive
example in Table 1, where a conventional code is compared
into multiple application can be broken down
with the implementation in the in-house developed DSP. into a sequence of operations, with
computations
Table 1: Comparison of branching any iterative set of operations
amenable for implementation as an
Conventional code Code in DSP algorithmic instruction. The steps
Identify computations
load count=3; if <cond> loop (label2,label2,3, in one iteration
involved in the design of an
check if condition is true; reg1,reg2,reg3); algorithmic instruction, shown in
if not, jump to label 3; ` Fig. 2, are described below:
label 1: label3:
call label2; other instructions…. - The iterative parts in a
decrement count; Work out memory computation are selected to be
requirements
check if count is zero; label2: implemented as an algorithmic
if not zero, jump to label1; mem3=mem1+mem2; instruction. Multiple instructions
can be identified, which when
label3:
Work out register and
operated sequentially, perform a
other instructions….
MAC requirements particular computation.
label2: - The data organization is
push reg1;
worked out considering the
push reg2;
push reg3;
memory access requirements and
Implement available buses of Program and
load mem1 to reg1; application specific Data Memory. The number of read
load mem2 to reg2; hardware and write requirements are
reg3=reg1+reg2; assessed.
store reg3 to mem3;
- The instruction code is
Add the instruction to
pop reg3; the Algorithmic assigned to the instruction such that
pop reg2; Instruction Core details like the type of computation
pop reg1;
Fig. 2. Steps involved in
required, as well as the number of
return;
design of Algorithmic
Instruction

Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on June 17,2024 at 13:05:41 UTC from IEEE Xplore. Restrictions apply.
reads and writes to Program and Data Memory are easily on large data sets. The advanced branching features,
identified on decoding. described in section III, reduces the overhead during
- For each instruction, the set of operations to be subroutine calls and interrupts.
performed in a single iteration is identified and implemented The DSP core, along with algorithmic instructions for
in hardware using only combinational elements. various signal processing applications like FFT, Filtering,
Matrix Operations and Median Filtering, is designed and
- The number of instruction cycles for each iteration
implemented in SmartFusion2 FPGA (M2S090). Library
is computed.
routines are written to perform the computations mentioned
- The register update requirements are worked out above using the algorithmic instructions identified for them.
depending on the address modifications and requirements for Different inputs are emulated and the outputs are verified.
registered values in the operations
The execution and code size parameters of these routines
are compared with a software implementation using the
B. Example Application: Median filtering library routines of ADSP21060 Processor [6], which is
Median filtering is usually used in image processing to commonly used in avionics applications. The performance
remove the effects of salt and pepper noise. For each pixel metrics, tabulated in Table 2, indicate that the in-house
value, a 3x3 sub image is constructed with the pixel at the developed DSP provides a significant gain in code size as
middle and the value of the pixel is replaced by the median well as execution time compared to a full software
of the nine elements in the sub image, as shown in Fig.3. One implementation using ADSP21060. The overall logic
median computation requires reading 9 values, and sorting
utilisation for the DSP core in SmartFusion2 FPGA
them in order to compute the middle value.
(M2S090) is 45%.
The algorithmic instruction for median filtering computes
two median values in one iteration. The instruction scans Table 2: Comparison of DSP performance
through the image, accessing multiple wide data buses of DSPcore vs ADSP21060
memory and using the processor registers such that two Function
Lines of Code No. of Cycles
adjacent median values are computed on-the-fly. This 512 point FFT 16 158 4697 9764
instruction is to be iterated n2/2 times for median filtering of
an n x n image. 15x15 Matrix 3 11 1775 3911
Multiplication
10 tap FIR Filtering 2 7 12 18
A00 A01 A02 A03 A04
Median filtering of 16 40 8208 917504
128x128 RGB image
A10 A11 A12 A13 A14
VI. CONCLUSION
A20 A21 A22 A23 A24 Signal Processing algorithms used in various
applications require the use of a Digital Signal Processor
(DSP). Commercial DSPs are prone to obsolescence and
A30 A31 A32 A33 A34 have limited availability in higher grades as required by
strategic applications. The architecture, design and features
Fig. 3. Computation of Median using sub image of an in-house developed Digital Signal Processor are
presented in this paper. The DSP is suitable for
V. PERFORMANCE implementing signal processing algorithms in strategic
The in-house developed DSP provides various applications and provides a performance comparable with
architectural advantages compared to conventional DSPs. commercially available high end DSPs. The DSP implements
The instruction set allows up to three memory operands to be a custom architecture, instruction set and assembly language,
specified in an instruction, thereby enabling computations with supporting software tools for design, development,
with operands directly read from memory, as compared to a testing and debugging of software. The DSP provides
load-store architecture where the operands are to be loaded to advanced branching features and architectural support for
registers for any computation. This reduces the code size as hardware acceleration of computationally intensive
well as execution time. operations through use of algorithmic instructions. A
comparison of the performance of the DSP with a
Compared to a micro-coded architecture, where multiple conventionally used DSP (ADSP21060) shows that the in-
cycles are required for executing a computation, the in-house house developed DSP is superior in terms of code size and
developed DSP provides single cycle execution for most execution time. The DSP core is designed in Verilog HDL
instructions. For instructions where multiple cycles are and currently implemented in SmartFusion2 FPGA. The
required for execution, parallelism is built in such that all design is scalable and customisable; new algorithmic
subsequent instructions which do not require the result of the instructions can be designed for the DSP as per the
multi-cycle operation can proceed with their execution, with requirements of any application.
the multi-cycle computation being performed in the
background. REFERENCES
Multiple wide memory buses are available, which are [1] X. Guan, Hai Lin and Yunsi Fei, "Design of an application-specific
efficiently made use of in algorithmic instructions to provide instruction set processor for high-throughput and scalable FFT," 2009
Design, Automation & Test in Europe Conference & Exhibition,
hardware acceleration for computation intensive operations 2009, pp. 1302-1307, doi: 10.1109/DATE.2009.5090866.

Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on June 17,2024 at 13:05:41 UTC from IEEE Xplore. Restrictions apply.
[2] Z. Liu, K. Dickson and J. V. McCanny, "CORDIC based application [5] T. Gaurav, A. Bhatt and R. Parekh, "Design and Implementation of
specific instruction set processor for QRD/SVD," The Thirty-Seventh low power RISC V ISA based coprocessor design for Matrix
Asilomar Conference on Signals, Systems & Computers, 2003, 2003, multiplication," 2021 Second International Conference on Electronics
pp. 1456-1460 Vol.2, doi: 10.1109/ACSSC.2003.1292227. and Sustainable Communication Systems (ICESC), 2021, pp. 189-
[3] S. D. Kim, C. J. Hyun and M. H. Sunwoo, "VSIP: Implementation of 195, doi: 10.1109/ICESC51422.2021.9532933.
Video Specific Instruction-set Processor," APCCAS 2006 - 2006 [6] “ADSP-21000 Family Application Handbook Volume I”, Analog
IEEE Asia Pacific Conference on Circuits and Systems, 2006, pp. Devices Inc., 1994
1075-1078, doi: 10.1109/APCCAS.2006.342307. [7] David L. Weaver, Tom Germond, The SPARC architecture manual
[4] M. Ali, M. von Ameln and D. Goehringer, "Vector Processing Unit: version 9, PTR Prentice Hall
A RISC-V based SIMD Co-processor for Embedded Processing," [8] ARM architecture reference manual, ARM Limited.
2021 24th Euromicro Conference on Digital System Design (DSD),
2021, pp. 30-34, doi: 10.1109/DSD53832.2021.00014.

Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on June 17,2024 at 13:05:41 UTC from IEEE Xplore. Restrictions apply.

You might also like