Introduction To THE TMS320C6x Vliw DSP: Prof. Brian L. Evans
Introduction To THE TMS320C6x Vliw DSP: Prof. Brian L. Evans
THE TMS320C6x
VLIW DSP
M em o r y - r e g i s t e r a r c h itectu r e
P r of. B r i a n L . E v a n s
in collaboration w ith
N ir a n ja n D a m e r a -V e n k a t a a n d
M a g e s h Va llia p p a n L o a d - s t o r e a r c h itectu r e
E m b e d d e d S ign a l P r oces s in g L a b or a t or y
T h e U n i v e r s it y of T e x a s a t A u s t in
A u s t in , T X 7 8 7 1 2 - 1 0 8 4
h t t p ://s i g n a l.e c e .u t e x a s .e d u /
Outline
n I n s t r u ct ion s e t a r ch it e ct u r e
n Vect o r d o t p r o d u c t e x a m p le
n P i p e l i n in g
n Vect o r d o t p r o d u c t e x a m p le r e v isit e d
n C o m p a r ison s w it h ot h e r p r oce s s or s
n C o n clu s ion
2
Instruction Set Architecture
Simplified
Architecture
Program RAM
Data RAM
or Cache
Addr
Internal Buses
DMA
Data
.D1 .D2 Serial Port
Regs (A0-A15)
Regs (B0-B15)
External .M1 .M2 Host Port
Memory
-Sync .L1 .L2
Boot Load
-Async
.S1 .S2
Timers
Control Regs
Pwr Down
3
Instruction Set Architecture
n T w o p a r a lle l d a t a p a t h s w it h 3 2 -bit R I S C u n it s
4 D a t a u n it - 3 2 - b i t a d d r e s s c a l c u l a t i o n s ( m odulo, lin e a r )
4 M u lt i p l i e r u n it - 1 6 b i t x 1 6 b i t w i t h 3 2 -bit r e s u lt
4 L ogica l u n it - 4 0 - b i t ( s a t u r a t i o n ) a r i t h m e t i c & c o m p a r e s
4 S h ift e r u n it - 3 2 - b i t i n t e g e r A L U a n d 4 0 -bit s h ift e r
4 C o n d it ion a l l y e x e c u t e d b a s e d on r e g i s t e r s A 1 -2 & B 0 -2
4 W or k w it h t w o 1 6 -b it h a lfw or d s p a ck e d in t o 3 2 b i t s
4
Functional Units
n . M m u lt iplica t ion u n it
4 1 6 b it x 1 6 b it s ign e d /u n s ign e d p a ck e d /u n p a ck e d
n . L a r it h m e t ic logic u n it
4 C o m p a r ison s a n d logic op e r a t ion s (a n d , or , a n d xor )
4 S a t u r a t ion a r it h m e t ic a n d a b s olu t e v a lu e
n .S shifter unit
4 B it m a n i p u la t ion (s e t , g e t , s h ift , r o t a t e ) a n d b r a n ch in g
4 A d d it ion a n d p a ck e d a d d it ion
n . D d a t a u n it
4 L o a d /s t or e t o m e m or y
4 A d d it ion a n d p oin t e r a r it h m e t ic
5
Restrictions on Register Accesses
n E a ch fu n ct ion u n it h a s r e a d / w r i t e p o r t s
4 D a t a p a t h 1 (2) u n it s r e a d / w r i t e A ( B ) r e g i s t e r s
4 D a t a p a t h 2 (1) ca n r e a d on e A ( B ) r e g i s t e r p e r c y c l e
n 4 0 b it w or d s s t or e d in a d ja cen t e v e n /od d r e g i s t e r s
4 U s e d in e x t e n d e d p r e cis ion a ccu m u la t ion
4 O n e 4 0 -bit r e s u lt ca n b e w r it t e n p e r cycle
4 A 4 0 -bit r e a d ca n n ot occu r in s a m e cycle a s 4 0 -bit w r it e
6
Disadvantages
n N o h a r d w a r e loop in g or b it -r e v e r s e d a d d r e s s in g
4 M u s t e m u la t e in s oft w a r e
n 4 0 -bit a ccu m u la t ion in cu r s p e r for m a n c e p e n a lt y
n N o s t a t u s r e g i s t e r : m u s t e m u la t e s t a t u s b i t s
ot h e r t h a n s a t u r a t ion b it (.L u n it )
7
TMS320C62x Fixed-Point Processors
Processor MHz MIP S D a t a Program Price Applications
(k b i t s ) (k b i t s )
Unit price is for 100 - 999 units. N/a means not in production until 4Q99.
In volumes of 10,000, the 200 MHz C6201 is $96 per unit.
8
Example: Vector Dot Product
n A v e ct o r d o t p r o d u c t i s c o m m on in filt e r in g
N
Y = ∑ a (n) x(n )
n =1
n S t or e a (n ) a n d x (n ) in t o a n a r r a y of N e l e m e n t s
n C 6 x p e a k p e r for m a n c e : 8 R I S C i n s t r u c t i o n s /cycle
4 P e a k R I S C in s t r u ct ion s p e r s a m p l e : 3 0 0 , 0 0 0 f o r s p e e c h ;
5 4 , 4 2 1 for a u d io; a n d 2 9 0 for lu m in a n ce N T S C v i d e o
4 G e n e r a lly r e q u ir e s h a n d cod in g for p e a k p e r for m a n ce
n F ir s t d ot p r o d u ct e x a m p le w ill n ot b e o p t i m ized
9
Example: Vector Dot Product
n P r ologu e
4 I n it ia lize p o i n t e r s : A5 for a (n ), A 6 for x (n ), a n d A 7 for Y
4 M o v e t h e n u m b e r of t im e s t o loop (N ) in t o A2
4 S e t a ccu m u la t or (A4 ) t o z e r o Reg M e a n i n g
n I n n e r loop
A0 a (n )
4 P u t a (n ) in t o A0 a n d x (n ) in t o A 1 A1 x (n )
4 M u lt i p l y a (n ) a n d x (n )
A2 N -n
4 Accu m u la t e m u lt iplica t ion r e s u lt in t o A 4 A3 a (n ) x (n )
4 D e cr e m e n t loop cou n t e r (A2 ) A4 Y
4 C o n t in u e in n e r loop if cou n t e r i s n ot z e r o A5 &a
A6 &x
n E p ilogu e
A7 &Y
4 S t or e t h e r e s u lt in t o Y
10
Example: Vector Dot Product
A0 a (n )
Coefficients a(n) A1 x (n )
A2 N - n
A3 a (n ) x (n )
A4 Y
Data x(n) A5 &a
A6 &x
A7 &Y
Using A data path only
n MoVeKonstant
4 MVK .S 40,A2 ; A2 = 40
4 Lower 16 bits of A2 are loaded
n Conditional branch
4 [condition] B .S loop
4 [A2] means to execute the instruction if A2 != 0
4 Only A1, A2, B0, B1, and B2 can be used
n Loading registers
4 LDH .D *A5, A0 ;Loads half-word into A0 from memory
n Registers may be used as pointers (*A1++)
12
Pipelining
n C P U operations
4 F e t ch in s t r u ct ion fr om m e m or y ( D S P p r o g r a m m e m o r y )
4 D e c o d e i n s t r u ct ion
4 E xecu t e in s t r u ct ion in clu d in g r e a d in g d a t a v a lu e s
n O v e r la p o p e r a t i o n s t o i n c r e a s e p e r f o r m a n ce
4 P ipelin e C P U operations to increase clock speed over a
s e q u e n t ia l im p l e m e n t a t ion
4 S e p a r a t e p a r a llel fu n ct ion a l u n it s
4 P e r i p h e r a l in t e r fa ces for I / O d o n o t b u r d e n C P U
13
Pipelining
Sequential (Motorola 56000)
Fetch Decode Read Execute
•hardware instruction
scheduling
Fetch Decode Execute
14
TMS320C6x Pipeline
n D i s p a t ch e s in s t r u ct ion s in p a ck e t s
15
Program Fetch (F)
FR
C6x
Memory FG
FS
FW
16
Decode Stage (D)
n D e cod e s t a g e con s i s t s of t w o p h a s e s
4 d i s p a t ch in s t r u ct ion t o fu n ct ion a l u n it ( D P )
4 in s t r u ct ion d ecod e d a t fu n ct ion a l u n it ( D C )
FR DP DC
C6x
Memory FG
FS
FW
17
Execute Stage (E)
LDx Load 3 4
B Branch 1 5
18
Execute stage (E)
Execu t e Description
Phase
19
Vector Dot Product with Pipeline Effects
Multiplication has a
delay of 1 cycle
20
Fetch packet
F DP DC E1 E2 E3 E4 E5 E6
MVK
LDH
LDH
MPY
ADD
SUB
B
STH
(F1-4)
F DP DC E1 E2 E3 E4 E5 E6
MVK
LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH
F DP DC E1 E2 E3 E4 E5 E6
MVK
LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH
F DP DC E1 E2 E3 E4 E5 E6
MVK
LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH
F DP DC E1 E2 E3 E4 E5 E6
MVK Done
LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH
26
Optimized Vector Dot Product
; clear A4 and initialize pointers A5, A6, and A7
MVK .S1 40,A2 ; A2 = 40 (loop counter)
loop LDW .D1 *A5++,A0 ; load a(n) and a(n+1)
LDW .D2 *B6++,B1 ; load x(n) and x(n+1)
MPY .M1X A0,B1,A3 ; A3 = a(n) * x(n)
MPYH .M2X A0,B1,B3 ; B3 = a(n+1) * x(n+1)
ADD .L1 A3,A4,A4 ; Yeven = Yeven + A3
ADD .L2 B3,B4,B4 ; Yodd = Yodd + A3
SUB .S1 A2,1,A2 ; decrement loop counter
[A2] B .S2 loop ; if A2 != 0, then branch
ADD .L1 A4,B4,A4 ; Y = Yodd + Yeven
STH .D1 A4,*A7 ; *A7 = Y
Retime summation
-- compute odd/even indexed terms at same time
-- utilize all eight functional units in the loop
-- put the sequential instructions in parallel
27
TMS320C6x vs. Pentium MMX
P r ocessor Peak BDTI ISR Power Unit Area Volum e
MIP S m a r k s latency Price
28
TMS320C62x vs. StarCore S140
Fea t u r e C62x S140
F u n ct ion a l Units 8 16
m u lt ipliers 2 4
adders 6 4
other -- 8
Instruct ions/cycle 8 6 + br a n ch
RISC in s t r u ct ions * 8 11
condit ion a ls 8 2
Instruct ion widt h (bits) 256 128
Tot a l in s t r u ct ions 48 180
Number of registers 32 51
Register size (bits) 32 40
Accu m u lation precision (bits) ** 32 or 40 40
P ipeline depth (cycle) 7-11 5
* Does not count equivalent RISC operations for modulo addressing
** On the C62x, there is a performance penalty for 40-bit accumulation
29
Conclusion
30
Conclusion
n W e b r e s ou r c e s
4 com p .d s p n e w s g r ou p : F A Q w w w .b d t i.com /fa q /d s p _fa q .h t m l
4 e m b e d d e d p r oce s s or s a n d s y s t e m s : w w w .eg3.com
4 on -lin e cou r s e s a n d D S P b oa r d s : w w w .t e ch on lin e .com
n R e fer e n c e s
4 R . B h a r g a v a , R . R a d h a k r i s h n a n , B . L . E v a n s , a n d L . K . J oh n ,
“E v a lu a t in g M M X T e c h n ology U s i n g D S P a n d M u lt i m e d i a
A p p l i c a t ion s ,” Proc. IE E E S y m . M icroarch itectu r e , p p . 3 7 - 4 6 , 1 9 9 8 .
h t t p ://w w w .e c e .u t e x a s .e d u /~ r a v i b /m m x d s p /
4 B . L . E v a n s , “E E 3 7 9 K - 1 7 R e a l - T i m e D S P L a b or a t or y , ” U T A u s t i n .
h t t p ://w w w .e c e .u t e x a s .e d u /~ b e v a n s /cou r s e s /r e a lt i m e /
4 B . L . E v a n s , “E E 3 8 2 C E m b e d d e d S o ft w a r e S y s t e m s , ” U T A u s t i n .
h t t p ://w w w .e c e .u t e x a s .e d u /~ b e v a n s /cou r s e s / e e 3 8 2 c /
31