INTRODUCTION TO A ccu m u l a t o r a r c h i t e c t u re
THE TMS320C6x
VLIW DSP
M em o r y - r e g i s t e r a r c h itectu r e
P r of. B r i a n L . E v a n s
in collaboration w ith
N ir a n ja n D a m e r a -V e n k a t a a n d
M a g e s h Va llia p p a n L o a d - s t o r e a r c h itectu r e
E m b e d d e d S ign a l P r oces s in g L a b or a t or y
T h e U n i v e r s it y of T e x a s a t A u s t in
A u s t in , T X 7 8 7 1 2 - 1 0 8 4
h t t p ://s i g n a l.e c e .u t e x a s .e d u /
Outline
n I n s t r u ct ion s e t a r ch it e ct u r e
n Vect o r d o t p r o d u c t e x a m p le
n P i p e l i n in g
n Vect o r d o t p r o d u c t e x a m p le r e v isit e d
n C o m p a r ison s w it h ot h e r p r oce s s or s
n C o n clu s ion
2
Instruction Set Architecture
Simplified
Architecture
Program RAM
Data RAM
or Cache
Addr
Internal Buses
DMA
Data
.D1 .D2 Serial Port
Regs (A0-A15)
Regs (B0-B15)
External .M1 .M2 Host Port
Memory
-Sync .L1 .L2
Boot Load
-Async
.S1 .S2
Timers
Control Regs
Pwr Down
C62x fixed point
CPU
C67x floating point
3
Instruction Set Architecture
n Address 8/16/32 bit data + 64 bit data on C67x
n L o a d -s t or e R I S C a r ch it e ct u r e w it h 2 d a t a p a t h s
4 1 6 3 2 -bit r e g i s t e r s p e r d a t a p a t h (A0-15 a n d B 0 -15)
4 4 8 in s t r u ct ion s ( C 6 2 x ) a n d 7 9 in s t r u ct ion s ( C 6 7 x )
n T w o p a r a lle l d a t a p a t h s w it h 3 2 -bit R I S C u n it s
4 D a t a u n it - 3 2 - b i t a d d r e s s c a l c u l a t i o n s ( m odulo, lin e a r )
4 M u lt i p l i e r u n it - 1 6 b i t x 1 6 b i t w i t h 3 2 -bit r e s u lt
4 L ogica l u n it - 4 0 - b i t ( s a t u r a t i o n ) a r i t h m e t i c & c o m p a r e s
4 S h ift e r u n it - 3 2 - b i t i n t e g e r A L U a n d 4 0 -bit s h ift e r
4 C o n d it ion a l l y e x e c u t e d b a s e d on r e g i s t e r s A 1 -2 & B 0 -2
4 W or k w it h t w o 1 6 -b it h a lfw or d s p a ck e d in t o 3 2 b i t s
4
Functional Units
n . M m u lt iplica t ion u n it
4 1 6 b it x 1 6 b it s ign e d /u n s ign e d p a ck e d /u n p a ck e d
n . L a r it h m e t ic logic u n it
4 C o m p a r ison s a n d logic op e r a t ion s (a n d , or , a n d xor )
4 S a t u r a t ion a r it h m e t ic a n d a b s olu t e v a lu e
n .S shifter unit
4 B it m a n i p u la t ion (s e t , g e t , s h ift , r o t a t e ) a n d b r a n ch in g
4 A d d it ion a n d p a ck e d a d d it ion
n . D d a t a u n it
4 L o a d /s t or e t o m e m or y
4 A d d it ion a n d p oin t e r a r it h m e t ic
5
Restrictions on Register Accesses
n E a ch fu n ct ion u n it h a s r e a d / w r i t e p o r t s
4 D a t a p a t h 1 (2) u n it s r e a d / w r i t e A ( B ) r e g i s t e r s
4 D a t a p a t h 2 (1) ca n r e a d on e A ( B ) r e g i s t e r p e r c y c l e
n 4 0 b it w or d s s t or e d in a d ja cen t e v e n /od d r e g i s t e r s
4 U s e d in e x t e n d e d p r e cis ion a ccu m u la t ion
4 O n e 4 0 -bit r e s u lt ca n b e w r it t e n p e r cycle
4 A 4 0 -bit r e a d ca n n ot occu r in s a m e cycle a s 4 0 -bit w r it e
n T w o sim u lt a n eou s m e m or y a cce s s e s ca n n ot u s e
r e g i s t e r s of s a m e r e g i s t e r file a s a d d r e s s p oin t e r s
n N o m or e t h a n fou r r e a d s p e r r e g i s t e r p e r cycle
6
Disadvantages
n N o a cce l e r a t ion for v a r ia b le len g t h d e cod in g
4 5 0 % o f com p u t a t ion for M P E G -2 d e cod in g o n C 6 x i n C
n D e e p p ipelin e
4 I f a b r a n ch i s i n t h e p i p e l i n e , in t e r r u p t s a r e d i s a b led :
a v oid b r a n ch e s b y u s i n g con d ition a l execu tion
4 N o h a r d w a r e p r ot e ct ion a g a in s t p i p e l i n e h a z a r d s :
p r ogra m m er a n d s o f t w a re tools m u s t g u a r d a g a i n s t i t
n N o h a r d w a r e loop in g or b it -r e v e r s e d a d d r e s s in g
4 M u s t e m u la t e in s oft w a r e
n 4 0 -bit a ccu m u la t ion in cu r s p e r for m a n c e p e n a lt y
n N o s t a t u s r e g i s t e r : m u s t e m u la t e s t a t u s b i t s
ot h e r t h a n s a t u r a t ion b it (.L u n it )
7
TMS320C62x Fixed-Point Processors
Processor MHz MIP S D a t a Program Price Applications
(k b i t s ) (k b i t s )
C6211 150 1200 32 32 $25
167 1336 (512 kbit L2 cache)
C6201 167 1336 512 512 $152 EVM board
200 1600 $159
C6202 200 1600 1000 2000 $167
250 2000 $184
C6203 250 2000 4000 3000 n /a 3G basestations
300 2400 n /a modem banks
Unit price is for 100 - 999 units. N/a means not in production until 4Q99.
In volumes of 10,000, the 200 MHz C6201 is $96 per unit.
For more information: http://www.ti.com/sc/c62xdsps/
8
Example: Vector Dot Product
n A v e ct o r d o t p r o d u c t i s c o m m on in filt e r in g
N
Y = ∑ a (n) x(n )
n =1
n S t or e a (n ) a n d x (n ) in t o a n a r r a y of N e l e m e n t s
n C 6 x p e a k p e r for m a n c e : 8 R I S C i n s t r u c t i o n s /cycle
4 P e a k R I S C in s t r u ct ion s p e r s a m p l e : 3 0 0 , 0 0 0 f o r s p e e c h ;
5 4 , 4 2 1 for a u d io; a n d 2 9 0 for lu m in a n ce N T S C v i d e o
4 G e n e r a lly r e q u ir e s h a n d cod in g for p e a k p e r for m a n ce
n F ir s t d ot p r o d u ct e x a m p le w ill n ot b e o p t i m ized
9
Example: Vector Dot Product
n P r ologu e
4 I n it ia lize p o i n t e r s : A5 for a (n ), A 6 for x (n ), a n d A 7 for Y
4 M o v e t h e n u m b e r of t im e s t o loop (N ) in t o A2
4 S e t a ccu m u la t or (A4 ) t o z e r o Reg M e a n i n g
n I n n e r loop
A0 a (n )
4 P u t a (n ) in t o A0 a n d x (n ) in t o A 1 A1 x (n )
4 M u lt i p l y a (n ) a n d x (n )
A2 N -n
4 Accu m u la t e m u lt iplica t ion r e s u lt in t o A 4 A3 a (n ) x (n )
4 D e cr e m e n t loop cou n t e r (A2 ) A4 Y
4 C o n t in u e in n e r loop if cou n t e r i s n ot z e r o A5 &a
A6 &x
n E p ilogu e
A7 &Y
4 S t or e t h e r e s u lt in t o Y
10
Example: Vector Dot Product
A0 a (n )
Coefficients a(n) A1 x (n )
A2 N - n
A3 a (n ) x (n )
A4 Y
Data x(n) A5 &a
A6 &x
A7 &Y
Using A data path only
; clear A4 and initialize pointers A5, A6, and A7
MVK .S1 40,A2 ; A2 = 40 (loop counter)
loop LDH .D1 *A5++,A0 ; A0 = a(n)
LDH .D1 *A6++,A1 ; A1 = x(n)
MPY .M1 A0,A1,A3 ; A3 = a(n) * x(n)
ADD .L1 A3,A4,A4 ; Y = Y + A3
SUB .L1 A2,1,A2 ; decrement loop counter
[A2] B .S1 loop ; if A2 != 0, then branch
STH .D1 A4,*A7 ; *A7 = Y
11
Example: Vector Dot Product
n MoVeKonstant
4 MVK .S 40,A2 ; A2 = 40
4 Lower 16 bits of A2 are loaded
n Conditional branch
4 [condition] B .S loop
4 [A2] means to execute the instruction if A2 != 0
4 Only A1, A2, B0, B1, and B2 can be used
n Loading registers
4 LDH .D *A5, A0 ;Loads half-word into A0 from memory
n Registers may be used as pointers (*A1++)
12
Pipelining
n C P U operations
4 F e t ch in s t r u ct ion fr om m e m or y ( D S P p r o g r a m m e m o r y )
4 D e c o d e i n s t r u ct ion
4 E xecu t e in s t r u ct ion in clu d in g r e a d in g d a t a v a lu e s
n O v e r la p o p e r a t i o n s t o i n c r e a s e p e r f o r m a n ce
4 P ipelin e C P U operations to increase clock speed over a
s e q u e n t ia l im p l e m e n t a t ion
4 S e p a r a t e p a r a llel fu n ct ion a l u n it s
4 P e r i p h e r a l in t e r fa ces for I / O d o n o t b u r d e n C P U
13
Pipelining
Sequential (Motorola 56000)
Fetch Decode Read Execute
Pipelined (Most conventional DSP processors)
Fetch Decode Read Execute
Superscalar (Pentium, MIPS)
Managing Pipelines
•compiler or programmer
(TMS320C6x)
Fetch Decode Read Execute
•pipeline interlocking
Superpipelined (TMS320C6x) in processor (TMS320C30)
•hardware instruction
scheduling
Fetch Decode Execute
14
TMS320C6x Pipeline
n O n e in s t r u ct ion cycle e v e r y clock cycle
n D e e p p ipelin e
4 7 - 1 1 s t a g e s i n C 6 2 x : fe t ch 4 , d e c o d e 2 , e x e c u t e 1 -5
4 7 - 1 6 s t a g e s i n C 6 7 x : fe t ch 4 , d e c o d e 2 , e x e c u t e 1 - 1 0
4 I f a b r a n ch i s i n t h e p i p e l i n e , in t e r r u p t s a r e d i s a b led
4 Avoid b r a n ch e s b y u s in g con d it ion a l e x e c u t ion
n N o h a r d w a r e p r ot e ct ion a g a in s t p i p e l i n e h a z a r d s
4 C o m p iler a n d a s s e m b l e r m u s t p r e v e n t p i p e l i n e h a z a r d s
n D i s p a t ch e s in s t r u ct ion s in p a ck e t s
15
Program Fetch (F)
n P r ogr a m fet ch in g con s i s t s of 4 p h a s e s
4 g e n e r a t e f e t c h a d d r e s s (F G )
4send address to m emory (FS)
4 w a it for d a t a r e a d y (F W )
4 r e a d o p c o d e (F R )
n F e t ch p a ck e t con s i s t s of 8 3 2 -bit in s t r u ct ion s
FR
C6x
Memory FG
FS
FW
16
Decode Stage (D)
n D e cod e s t a g e con s i s t s of t w o p h a s e s
4 d i s p a t ch in s t r u ct ion t o fu n ct ion a l u n it ( D P )
4 in s t r u ct ion d ecod e d a t fu n ct ion a l u n it ( D C )
FR DP DC
C6x
Memory FG
FS
FW
17
Execute Stage (E)
Type Description # Instr Dela y
ISC Single cycle 38 0
IMPY Mult iply 2 1
LDx Load 3 4
B Branch 1 5
18
Execute stage (E)
Execu t e Description
Phase
E1 ISC in s t r u ct ions completed
E2 IMPY in s t r u ct ions completed
E3
E4
E5 Load value into register
E6 Br a n ch t o destination complete
19
Vector Dot Product with Pipeline Effects
; clear A4 and initialize pointers A5, A6, and A7
MVK .S1 40,A2 ; A2 = 40 (loop counter)
loop LDH .D1 *A5++,A0 ; A0 = a(n)
LDH .D1 *A6++,A1 ; A1 = x(n)
MPY .M1 A0,A1,A3 ; A3 = a(n) * x(n)
ADD .L1 A3,A4,A4 ; Y = Y + A3
SUB .L1 A2,1,A2 ; decrement loop counter
[A2] B .S1 loop ; if A2 != 0, then branch
STH .D1 A4,*A7 ; *A7 = Y
Multiplication has a
delay of 1 cycle
Load has a pipeline
delay of four cycles
20
Fetch packet
F DP DC E1 E2 E3 E4 E5 E6
MVK
LDH
LDH
MPY
ADD
SUB
B
STH
(F1-4)
Time (t) = 4 clock cycles
21
Dispatch
F DP DC E1 E2 E3 E4 E5 E6
MVK
LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH
Time (t) = 5 clock cycles
22
Decode
F DP DC E1 E2 E3 E4 E5 E6
MVK
LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH
Time (t) = 6 clock cycles
23
Execute (E1)
F DP DC E1 E2 E3 E4 E5 E6
MVK
LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH
Time (t) = 7 clock cycles
24
Execute (MVK done LDH in E1)
F DP DC E1 E2 E3 E4 E5 E6
MVK Done
LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH
Time (t) = 8 clock cycles
25
Vector Dot Product with Pipeline Effects
; clear A4 and initialize pointers A5, A6, and A7
MVK .S1 40,A2 ; A2 = 40 (loop counter)
loop LDH .D1 *A5++,A0 ; A0 = a(n)
LDH .D1 *A6++,A1 ; A1 = x(n)
NOP 4
MPY .M1 A0,A1,A3 ; A3 = a(n) * x(n)
NOP
ADD .L1 A3,A4,A4 ; Y = Y + A3
SUB .L1 A2,1,A2 ; decrement loop counter
[A2] B .S1 loop ; if A2 != 0, then branch
NOP 5
STH .D1 A4,*A7 ; *A7 = Y
Assembler will automatically insert NOP instructions
Assembler can also make sequential code parallel
26
Optimized Vector Dot Product
; clear A4 and initialize pointers A5, A6, and A7
MVK .S1 40,A2 ; A2 = 40 (loop counter)
loop LDW .D1 *A5++,A0 ; load a(n) and a(n+1)
LDW .D2 *B6++,B1 ; load x(n) and x(n+1)
MPY .M1X A0,B1,A3 ; A3 = a(n) * x(n)
MPYH .M2X A0,B1,B3 ; B3 = a(n+1) * x(n+1)
ADD .L1 A3,A4,A4 ; Yeven = Yeven + A3
ADD .L2 B3,B4,B4 ; Yodd = Yodd + A3
SUB .S1 A2,1,A2 ; decrement loop counter
[A2] B .S2 loop ; if A2 != 0, then branch
ADD .L1 A4,B4,A4 ; Y = Yodd + Yeven
STH .D1 A4,*A7 ; *A7 = Y
Retime summation
-- compute odd/even indexed terms at same time
-- utilize all eight functional units in the loop
-- put the sequential instructions in parallel
27
TMS320C6x vs. Pentium MMX
P r ocessor Peak BDTI ISR Power Unit Area Volum e
MIP S m a r k s latency Price
P e n t iu m 466 49 1 . 1 4 µs 4.25 W $ 2 1 3 5 .5” x 2.5” 8 .789 in 3
M M X 233
P e n t iu m 532 56 1 . 0 0 µs 4.85 W $ 3 4 8 5 .5” x 2.5” 8 .789 in 3
M M X 266
C62x 1200 74 0 . 1 2 µs 1.45 W $ 2 5 1 .3” x 1.3” 0 .118 in 3
150 MH z
C62x 1600 99 0 . 0 9 µs 1.94 W $ 9 6 1 .3” x 1.3” 0 .118 in 3
200 MH z
BDTImarks: Berkeley Design Technology Inc. DSP benchmark
results (larger means better) http://www.bdti.com/bdtimark/results.htm
http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/processors.html
28
TMS320C62x vs. StarCore S140
Fea t u r e C62x S140
F u n ct ion a l Units 8 16
m u lt ipliers 2 4
adders 6 4
other -- 8
Instruct ions/cycle 8 6 + br a n ch
RISC in s t r u ct ions * 8 11
condit ion a ls 8 2
Instruct ion widt h (bits) 256 128
Tot a l in s t r u ct ions 48 180
Number of registers 32 51
Register size (bits) 32 40
Accu m u lation precision (bits) ** 32 or 40 40
P ipeline depth (cycle) 7-11 5
* Does not count equivalent RISC operations for modulo addressing
** On the C62x, there is a performance penalty for 40-bit accumulation
29
Conclusion
n C o n v e n t ion a l digit a l sign a l p r o c e s s o r s
4 H igh p e r for m a n ce v s . p o w e r c o n s u m p t i o n / c o s t / v o l u m e
4 E x cel a t on e -d im e n s i o n a l p r o c e s s i n g
4 H a v e i n s t r u ct ion s t a ilor e d t o s p e cific a p p lica t ion s
n TMS320C6x VLIW DSP
4 H igh p e r for m a n ce v s . cos t /volu m e
4 E x cel a t m u lt i d i m e n s ion a l s i g n a l p r o c e s s i n g
4A maximum of 8 RISC instructions per cycle
30
Conclusion
n W e b r e s ou r c e s
4 com p .d s p n e w s g r ou p : F A Q w w w .b d t i.com /fa q /d s p _fa q .h t m l
4 e m b e d d e d p r oce s s or s a n d s y s t e m s : w w w .eg3.com
4 on -lin e cou r s e s a n d D S P b oa r d s : w w w .t e ch on lin e .com
n R e fer e n c e s
4 R . B h a r g a v a , R . R a d h a k r i s h n a n , B . L . E v a n s , a n d L . K . J oh n ,
“E v a lu a t in g M M X T e c h n ology U s i n g D S P a n d M u lt i m e d i a
A p p l i c a t ion s ,” Proc. IE E E S y m . M icroarch itectu r e , p p . 3 7 - 4 6 , 1 9 9 8 .
h t t p ://w w w .e c e .u t e x a s .e d u /~ r a v i b /m m x d s p /
4 B . L . E v a n s , “E E 3 7 9 K - 1 7 R e a l - T i m e D S P L a b or a t or y , ” U T A u s t i n .
h t t p ://w w w .e c e .u t e x a s .e d u /~ b e v a n s /cou r s e s /r e a lt i m e /
4 B . L . E v a n s , “E E 3 8 2 C E m b e d d e d S o ft w a r e S y s t e m s , ” U T A u s t i n .
h t t p ://w w w .e c e .u t e x a s .e d u /~ b e v a n s /cou r s e s / e e 3 8 2 c /
31