Design and Implementation of A SHARC Digital Signal Processor Core in Verilog HDL
Design and Implementation of A SHARC Digital Signal Processor Core in Verilog HDL
Design and Implementation of A SHARC Digital Signal Processor Core in Verilog HDL
-
field and process it.
x- IO.ISI.nUlm.,lC
.\emur-
--7
.res",,
Barrel shifter circulates data bits in a synchronous manner. Figure 5 . The M A C System Architecture
The Barrel Shifter i s used for Scaling operations such as:
As evident form the Figure this architecture is base on an
Prescaling an input data memory operand or the Array Multiplier, a Adder, a M u , and Output hold
accumulator value before an ALU operation. register.
Made I :
MAC / MUL= 0
Mode 2:
In this mode i t performs Multiplication and accumulation For partial product generation multiplicand x is multiplied
ofthe input data i.e. it is in M A C mode. by x"2"i for all i.Adding all bits o f the multiplicand with
ith bit of the multiplier generates pp, (partial product for
3.3.1 Array Multiplier ith bit) and then shill pp; left. Due to its concurrency, only
one level delay is introduced Independent of the
Due to today's VLSl technology, the array (or parallel) multiplication word length. Hence this area needs no fast
multiplier has become increasingly economical and speed algorithms for generation of partial products.
papular. Some multiplier basics, notations and Ml laplicam
4 Vlll VlZl 1111 flO1
conventions are described here.
Generation o f partial products i s the first step of
multiplication operation. These partial products need to be
added together for generating the final product of
multiplication. Figure 6 shows the process of partial
product generation in the multiplication of two unsigned
numbers a and b. Adding all partial products generates the
final product. So the process o f multiplication consists of
generating partial products and adding them together.
. . . . OD,
.. .. .
%bo +bo %b. .visual method o f reduction in Figure 9.
o.b, +b, 4 e,b, %b,
a,& o.b, u,b2 u,b, a,b, 0.b: LeAn
4 0, 4 0, +, 4
0.6. a.b. 4 4 %b. nab8
4 4 0, 4 4 a d ,
Figure 6. Multiplication oftwo 4-bit no
__ __ -
. m . m I
LeA(W
* w . w
An N x M array multiplier architecture consists of three
well-defined major sections performing different (FJlWSbapdlrg
operations: csrydhdlaiw lW*)
- layers.
And addition of the two layers into a final
product
In the above Figure three dots each symbolizes a partial
product. Using FA (Full Adder) reduces these to two bits,
where one has the weight o f 2'(sum) and the other
Z'(carry). This type o f reduction i s known as 3 to 2
Figure 7. Visually demonstraies'the three major sections reduction or carry saves addition. The two dots are
for performing array multiplication. reduced to 2 using a HA (Half Adder). It can be seen that
this stage (Level n+l) does not yield any reduction. The
rightmost diagram has I dot, which i s carried down
without any action.
The speed o f a signal processing or communication system The 5-bit opcode i s further decoded by the control unit
ASIC depends heavily on these lunctional units. Adders This module separate the 5-bits o f opcode and assign them
are in the critical path o f many other arithmetic operations to the control register of the appropriate functional unit.
like multiplication. scaling, add-compare select, and These Control register is connected to the Function select
division. pin ofthe three functional units.
--
Two basic adder architectures are studied for This opcode contains the following information:
implementation:
Defines the function to be performed.
Rippie Carry Adder.
Carry Lookahead Adder. . Select a specific functional unit for that operation
Specify the source of data to that functional unit.
makes sure that the signals arrive at their destined place to 3.5.1 Data Memory
ensure the efficient operation o f architecture. The
Controller Unit includes three sub modules. Data Memory is used to store only Data coming from
either external source or Funcional units output registers.lt
Program Counter (PCj
can store 64 words each o f 8 bits.The address bus width is
Instruction Decoder (ID)
of 6-bit thus it easily address 64 memory ocations.The
Control Unit (CUj
memory is designed with separate read and write address
bus thus enabling read and write from different locations
3.4.1 Program Counter of the memory at the same time in the intelval o f one
clock cycle.At first the write instructions must be executed
The Program Counter i s a 6-bit counter its output i s
to write the data in the Data Memory
connected to Program Memory. It generates a 6-bit count
value, which is used to address the Program Memory,
3.5.2 Coellicient Memory
which stores the instructions. I t can select 64 memory
locations. The Program Counter usually holds the address
Coefficient Memory is used to store only Filter
o f the next instruction from the instruction, which i s
coefficients and twiddle factors in case o f FFTSt can store
currently executing.
16 words each o f 8 bits.The address bus width i s o f 4-bit
I t addresses a 64x1 I bit Program Memory
thus it easilyaddress 16 memory ocations.The memory is
designed with only a single address bus and a read control
3.4.1 Instruction Decoder signal.Before simulation the memory is initialized with
Memory initialirarion file which contains the required
The instruction decoder decodes the 1 I-bit instruction. It Coefficients needed for a particular operation.
also generates control signals for the memory. The
instructions contain the following information. 3.5.3 Program Memory
Opcode.
Program Memory i s used to store only intructions which
Address.
are first given as input by the user during simulation .It
Read control. can store 64 words each o f I 1 bits.The address bus width
Write control. i s of 6-bit thus it easily address 64 memory ocations.At
CMaddress. first the write instructions must be executed to write the
C M read control dlta in the Data Memory.
The proposed archilecture is tested for FFT Table A. Instructions for the Computation ofone butterfly.
implementation, a digital signal processing algorithm used
for calculating DFT. A 4-point radix-2 algorithm is
implemented for this architecture as shown in Figure IO.
r - -..-..-
addrcs o f x (2)
efsfc 2 from DM
5. PIPELINING
Figure IO.Flow graph for a 4-poig radix-2 algorithm.
The proposed DSP design is a fully pipelined architecture
that has five stages. The pipelining in the design can be
The WO4and Wlrarethe twiddle factors and are stored in shown as follows:
the CM prior to the processing. For the convenience of
understanding, we have supposed the values of these as
the I and 2 respectively and are stored in the CM.The
algorithm uses four butterilies and each requires I 1
instructions and so 4 point DFT needs 44 instructions to be
computed. Four more have been added to output the four
outputs from the DM thus making a total 48 instructions.
Now these 48 instructions will need 53 clock cycles to
execute complying with the pipeline specifications.
The four.data inputs samples are coming from the external
environment (in our case given input by the simulation
waveform) and are stored in the memory. An efficient
program (set of EEDSPOOI-the proposed DSP
architecture-instruction sets) is written as shown in Table
A, so that data samples are fetched in the bit reversed CAR
order. Since each wite instruction takes 2 cycles to
execute 8 operations take 1 I cycles, as three write are
included. 19-"1
Cwff Addr. Rcgirleri
Feleh
6. INTEGRATION O F COMPONENTS
Critical Path Timin 10.91 nr
3 2 3 i MHZ
To achieve the design o f the Digital Signal Process&, the
Number OfCLBr
modules were implemented according to the plan set. The 7765
Digital Signal Processor was divided into several Addirialai GareCounl For lOBr
Components (top-down approach). The Functionality of
each module that it must satisfy to Communicate with 9. CONCLUSION
other modules were defined at that stage with great caution.
This helped us in the bottom-up approach of integration o f DSPs is an answer to the intense need of high-speed and
the modules to get a final Digital Signal Processor intensive processing technologies, which is both cheap and
circuitry. The individual modules were combined to form easy to use. DSPs is finding favor not only for computer
the Digital Signal Processor core. systems, but also in consumer electronics products such as
cellular phones.
The CPU and the Data path are designed to comprehend
the details of DSPs architecture. The author develops the
logic, architecture and interface.