0% found this document useful (0 votes)
22 views31 pages

Introduction To THE TMS320C6x Vliw DSP: Prof. Brian L. Evans

The document describes a vector dot product example to demonstrate the capabilities of the TMS320C6x processor. It involves loading coefficient and data vectors into registers, multiplying corresponding elements, accumulating the results, and storing the output. Looping and parallelism allow it to achieve high performance.

Uploaded by

Hou Bou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views31 pages

Introduction To THE TMS320C6x Vliw DSP: Prof. Brian L. Evans

The document describes a vector dot product example to demonstrate the capabilities of the TMS320C6x processor. It involves loading coefficient and data vectors into registers, multiplying corresponding elements, accumulating the results, and storing the output. Looping and parallelism allow it to achieve high performance.

Uploaded by

Hou Bou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

INTRODUCTION TO A ccu m u l a t o r a r c h i t e c t u re

THE TMS320C6x
VLIW DSP
M em o r y - r e g i s t e r a r c h itectu r e

P r of. B r i a n L . E v a n s
in collaboration w ith
N ir a n ja n D a m e r a -V e n k a t a a n d
M a g e s h Va llia p p a n L o a d - s t o r e a r c h itectu r e

E m b e d d e d S ign a l P r oces s in g L a b or a t or y
T h e U n i v e r s it y of T e x a s a t A u s t in
A u s t in , T X 7 8 7 1 2 - 1 0 8 4

h t t p ://s i g n a l.e c e .u t e x a s .e d u /
Outline

n I n s t r u ct ion s e t a r ch it e ct u r e

n Vect o r d o t p r o d u c t e x a m p le

n P i p e l i n in g

n Vect o r d o t p r o d u c t e x a m p le r e v isit e d

n C o m p a r ison s w it h ot h e r p r oce s s or s

n C o n clu s ion

2
Instruction Set Architecture

Simplified
Architecture
Program RAM
Data RAM
or Cache
Addr

Internal Buses
DMA
Data
.D1 .D2 Serial Port
Regs (A0-A15)

Regs (B0-B15)
External .M1 .M2 Host Port
Memory
-Sync .L1 .L2
Boot Load
-Async
.S1 .S2
Timers
Control Regs
Pwr Down

C62x fixed point


CPU

C67x floating point

3
Instruction Set Architecture

n Address 8/16/32 bit data + 64 bit data on C67x


n L o a d -s t or e R I S C a r ch it e ct u r e w it h 2 d a t a p a t h s
4 1 6 3 2 -bit r e g i s t e r s p e r d a t a p a t h (A0-15 a n d B 0 -15)
4 4 8 in s t r u ct ion s ( C 6 2 x ) a n d 7 9 in s t r u ct ion s ( C 6 7 x )

n T w o p a r a lle l d a t a p a t h s w it h 3 2 -bit R I S C u n it s
4 D a t a u n it - 3 2 - b i t a d d r e s s c a l c u l a t i o n s ( m odulo, lin e a r )
4 M u lt i p l i e r u n it - 1 6 b i t x 1 6 b i t w i t h 3 2 -bit r e s u lt
4 L ogica l u n it - 4 0 - b i t ( s a t u r a t i o n ) a r i t h m e t i c & c o m p a r e s
4 S h ift e r u n it - 3 2 - b i t i n t e g e r A L U a n d 4 0 -bit s h ift e r
4 C o n d it ion a l l y e x e c u t e d b a s e d on r e g i s t e r s A 1 -2 & B 0 -2
4 W or k w it h t w o 1 6 -b it h a lfw or d s p a ck e d in t o 3 2 b i t s
4
Functional Units

n . M m u lt iplica t ion u n it
4 1 6 b it x 1 6 b it s ign e d /u n s ign e d p a ck e d /u n p a ck e d
n . L a r it h m e t ic logic u n it
4 C o m p a r ison s a n d logic op e r a t ion s (a n d , or , a n d xor )
4 S a t u r a t ion a r it h m e t ic a n d a b s olu t e v a lu e
n .S shifter unit
4 B it m a n i p u la t ion (s e t , g e t , s h ift , r o t a t e ) a n d b r a n ch in g
4 A d d it ion a n d p a ck e d a d d it ion
n . D d a t a u n it
4 L o a d /s t or e t o m e m or y
4 A d d it ion a n d p oin t e r a r it h m e t ic

5
Restrictions on Register Accesses

n E a ch fu n ct ion u n it h a s r e a d / w r i t e p o r t s
4 D a t a p a t h 1 (2) u n it s r e a d / w r i t e A ( B ) r e g i s t e r s
4 D a t a p a t h 2 (1) ca n r e a d on e A ( B ) r e g i s t e r p e r c y c l e

n 4 0 b it w or d s s t or e d in a d ja cen t e v e n /od d r e g i s t e r s
4 U s e d in e x t e n d e d p r e cis ion a ccu m u la t ion
4 O n e 4 0 -bit r e s u lt ca n b e w r it t e n p e r cycle
4 A 4 0 -bit r e a d ca n n ot occu r in s a m e cycle a s 4 0 -bit w r it e

n T w o sim u lt a n eou s m e m or y a cce s s e s ca n n ot u s e


r e g i s t e r s of s a m e r e g i s t e r file a s a d d r e s s p oin t e r s
n N o m or e t h a n fou r r e a d s p e r r e g i s t e r p e r cycle

6
Disadvantages

n N o a cce l e r a t ion for v a r ia b le len g t h d e cod in g


4 5 0 % o f com p u t a t ion for M P E G -2 d e cod in g o n C 6 x i n C
n D e e p p ipelin e
4 I f a b r a n ch i s i n t h e p i p e l i n e , in t e r r u p t s a r e d i s a b led :
a v oid b r a n ch e s b y u s i n g con d ition a l execu tion
4 N o h a r d w a r e p r ot e ct ion a g a in s t p i p e l i n e h a z a r d s :
p r ogra m m er a n d s o f t w a re tools m u s t g u a r d a g a i n s t i t

n N o h a r d w a r e loop in g or b it -r e v e r s e d a d d r e s s in g
4 M u s t e m u la t e in s oft w a r e
n 4 0 -bit a ccu m u la t ion in cu r s p e r for m a n c e p e n a lt y
n N o s t a t u s r e g i s t e r : m u s t e m u la t e s t a t u s b i t s
ot h e r t h a n s a t u r a t ion b it (.L u n it )
7
TMS320C62x Fixed-Point Processors
Processor MHz MIP S D a t a Program Price Applications
(k b i t s ) (k b i t s )

C6211 150 1200 32 32 $25


167 1336 (512 kbit L2 cache)
C6201 167 1336 512 512 $152 EVM board
200 1600 $159

C6202 200 1600 1000 2000 $167


250 2000 $184

C6203 250 2000 4000 3000 n /a 3G basestations


300 2400 n /a modem banks

Unit price is for 100 - 999 units. N/a means not in production until 4Q99.
In volumes of 10,000, the 200 MHz C6201 is $96 per unit.

For more information: http://www.ti.com/sc/c62xdsps/

8
Example: Vector Dot Product

n A v e ct o r d o t p r o d u c t i s c o m m on in filt e r in g
N
Y = ∑ a (n) x(n )
n =1

n S t or e a (n ) a n d x (n ) in t o a n a r r a y of N e l e m e n t s
n C 6 x p e a k p e r for m a n c e : 8 R I S C i n s t r u c t i o n s /cycle
4 P e a k R I S C in s t r u ct ion s p e r s a m p l e : 3 0 0 , 0 0 0 f o r s p e e c h ;
5 4 , 4 2 1 for a u d io; a n d 2 9 0 for lu m in a n ce N T S C v i d e o
4 G e n e r a lly r e q u ir e s h a n d cod in g for p e a k p e r for m a n ce

n F ir s t d ot p r o d u ct e x a m p le w ill n ot b e o p t i m ized

9
Example: Vector Dot Product

n P r ologu e
4 I n it ia lize p o i n t e r s : A5 for a (n ), A 6 for x (n ), a n d A 7 for Y
4 M o v e t h e n u m b e r of t im e s t o loop (N ) in t o A2
4 S e t a ccu m u la t or (A4 ) t o z e r o Reg M e a n i n g
n I n n e r loop
A0 a (n )
4 P u t a (n ) in t o A0 a n d x (n ) in t o A 1 A1 x (n )
4 M u lt i p l y a (n ) a n d x (n )
A2 N -n
4 Accu m u la t e m u lt iplica t ion r e s u lt in t o A 4 A3 a (n ) x (n )
4 D e cr e m e n t loop cou n t e r (A2 ) A4 Y
4 C o n t in u e in n e r loop if cou n t e r i s n ot z e r o A5 &a

A6 &x
n E p ilogu e
A7 &Y
4 S t or e t h e r e s u lt in t o Y
10
Example: Vector Dot Product
A0 a (n )
Coefficients a(n) A1 x (n )

A2 N - n
A3 a (n ) x (n )

A4 Y
Data x(n) A5 &a

A6 &x
A7 &Y
Using A data path only

; clear A4 and initialize pointers A5, A6, and A7


MVK .S1 40,A2 ; A2 = 40 (loop counter)
loop LDH .D1 *A5++,A0 ; A0 = a(n)
LDH .D1 *A6++,A1 ; A1 = x(n)
MPY .M1 A0,A1,A3 ; A3 = a(n) * x(n)
ADD .L1 A3,A4,A4 ; Y = Y + A3
SUB .L1 A2,1,A2 ; decrement loop counter
[A2] B .S1 loop ; if A2 != 0, then branch
STH .D1 A4,*A7 ; *A7 = Y
11
Example: Vector Dot Product

n MoVeKonstant
4 MVK .S 40,A2 ; A2 = 40
4 Lower 16 bits of A2 are loaded
n Conditional branch
4 [condition] B .S loop
4 [A2] means to execute the instruction if A2 != 0
4 Only A1, A2, B0, B1, and B2 can be used
n Loading registers
4 LDH .D *A5, A0 ;Loads half-word into A0 from memory
n Registers may be used as pointers (*A1++)

12
Pipelining

n C P U operations
4 F e t ch in s t r u ct ion fr om m e m or y ( D S P p r o g r a m m e m o r y )
4 D e c o d e i n s t r u ct ion
4 E xecu t e in s t r u ct ion in clu d in g r e a d in g d a t a v a lu e s

n O v e r la p o p e r a t i o n s t o i n c r e a s e p e r f o r m a n ce
4 P ipelin e C P U operations to increase clock speed over a
s e q u e n t ia l im p l e m e n t a t ion
4 S e p a r a t e p a r a llel fu n ct ion a l u n it s
4 P e r i p h e r a l in t e r fa ces for I / O d o n o t b u r d e n C P U

13
Pipelining
Sequential (Motorola 56000)
Fetch Decode Read Execute

Pipelined (Most conventional DSP processors)

Fetch Decode Read Execute

Superscalar (Pentium, MIPS)


Managing Pipelines
•compiler or programmer
(TMS320C6x)
Fetch Decode Read Execute
•pipeline interlocking
Superpipelined (TMS320C6x) in processor (TMS320C30)

•hardware instruction
scheduling
Fetch Decode Execute
14
TMS320C6x Pipeline

n O n e in s t r u ct ion cycle e v e r y clock cycle


n D e e p p ipelin e
4 7 - 1 1 s t a g e s i n C 6 2 x : fe t ch 4 , d e c o d e 2 , e x e c u t e 1 -5
4 7 - 1 6 s t a g e s i n C 6 7 x : fe t ch 4 , d e c o d e 2 , e x e c u t e 1 - 1 0
4 I f a b r a n ch i s i n t h e p i p e l i n e , in t e r r u p t s a r e d i s a b led
4 Avoid b r a n ch e s b y u s in g con d it ion a l e x e c u t ion
n N o h a r d w a r e p r ot e ct ion a g a in s t p i p e l i n e h a z a r d s
4 C o m p iler a n d a s s e m b l e r m u s t p r e v e n t p i p e l i n e h a z a r d s

n D i s p a t ch e s in s t r u ct ion s in p a ck e t s

15
Program Fetch (F)

n P r ogr a m fet ch in g con s i s t s of 4 p h a s e s


4 g e n e r a t e f e t c h a d d r e s s (F G )
4send address to m emory (FS)
4 w a it for d a t a r e a d y (F W )
4 r e a d o p c o d e (F R )
n F e t ch p a ck e t con s i s t s of 8 3 2 -bit in s t r u ct ion s

FR

C6x

Memory FG
FS
FW

16
Decode Stage (D)

n D e cod e s t a g e con s i s t s of t w o p h a s e s
4 d i s p a t ch in s t r u ct ion t o fu n ct ion a l u n it ( D P )
4 in s t r u ct ion d ecod e d a t fu n ct ion a l u n it ( D C )

FR DP DC

C6x

Memory FG
FS
FW

17
Execute Stage (E)

Type Description # Instr Dela y

ISC Single cycle 38 0

IMPY Mult iply 2 1

LDx Load 3 4

B Branch 1 5

18
Execute stage (E)

Execu t e Description
Phase

E1 ISC in s t r u ct ions completed


E2 IMPY in s t r u ct ions completed
E3
E4
E5 Load value into register
E6 Br a n ch t o destination complete

19
Vector Dot Product with Pipeline Effects

; clear A4 and initialize pointers A5, A6, and A7


MVK .S1 40,A2 ; A2 = 40 (loop counter)
loop LDH .D1 *A5++,A0 ; A0 = a(n)
LDH .D1 *A6++,A1 ; A1 = x(n)
MPY .M1 A0,A1,A3 ; A3 = a(n) * x(n)
ADD .L1 A3,A4,A4 ; Y = Y + A3
SUB .L1 A2,1,A2 ; decrement loop counter
[A2] B .S1 loop ; if A2 != 0, then branch
STH .D1 A4,*A7 ; *A7 = Y

Multiplication has a
delay of 1 cycle

Load has a pipeline


delay of four cycles

20
Fetch packet

F DP DC E1 E2 E3 E4 E5 E6

MVK
LDH
LDH
MPY
ADD
SUB
B
STH

(F1-4)

Time (t) = 4 clock cycles


21
Dispatch

F DP DC E1 E2 E3 E4 E5 E6

MVK
LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH

Time (t) = 5 clock cycles


22
Decode

F DP DC E1 E2 E3 E4 E5 E6

MVK
LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH

Time (t) = 6 clock cycles


23
Execute (E1)

F DP DC E1 E2 E3 E4 E5 E6

MVK
LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH

Time (t) = 7 clock cycles


24
Execute (MVK done LDH in E1)

F DP DC E1 E2 E3 E4 E5 E6

MVK Done

LDH
LDH
F(2-5) MPY
ADD
SUB
B
STH

Time (t) = 8 clock cycles


25
Vector Dot Product with Pipeline Effects

; clear A4 and initialize pointers A5, A6, and A7


MVK .S1 40,A2 ; A2 = 40 (loop counter)
loop LDH .D1 *A5++,A0 ; A0 = a(n)
LDH .D1 *A6++,A1 ; A1 = x(n)
NOP 4
MPY .M1 A0,A1,A3 ; A3 = a(n) * x(n)
NOP
ADD .L1 A3,A4,A4 ; Y = Y + A3
SUB .L1 A2,1,A2 ; decrement loop counter
[A2] B .S1 loop ; if A2 != 0, then branch
NOP 5
STH .D1 A4,*A7 ; *A7 = Y

Assembler will automatically insert NOP instructions

Assembler can also make sequential code parallel

26
Optimized Vector Dot Product
; clear A4 and initialize pointers A5, A6, and A7
MVK .S1 40,A2 ; A2 = 40 (loop counter)
loop LDW .D1 *A5++,A0 ; load a(n) and a(n+1)
LDW .D2 *B6++,B1 ; load x(n) and x(n+1)
MPY .M1X A0,B1,A3 ; A3 = a(n) * x(n)
MPYH .M2X A0,B1,B3 ; B3 = a(n+1) * x(n+1)
ADD .L1 A3,A4,A4 ; Yeven = Yeven + A3
ADD .L2 B3,B4,B4 ; Yodd = Yodd + A3
SUB .S1 A2,1,A2 ; decrement loop counter
[A2] B .S2 loop ; if A2 != 0, then branch
ADD .L1 A4,B4,A4 ; Y = Yodd + Yeven
STH .D1 A4,*A7 ; *A7 = Y

Retime summation
-- compute odd/even indexed terms at same time
-- utilize all eight functional units in the loop
-- put the sequential instructions in parallel

27
TMS320C6x vs. Pentium MMX
P r ocessor Peak BDTI ISR Power Unit Area Volum e
MIP S m a r k s latency Price

P e n t iu m 466 49 1 . 1 4 µs 4.25 W $ 2 1 3 5 .5” x 2.5” 8 .789 in 3


M M X 233

P e n t iu m 532 56 1 . 0 0 µs 4.85 W $ 3 4 8 5 .5” x 2.5” 8 .789 in 3


M M X 266

C62x 1200 74 0 . 1 2 µs 1.45 W $ 2 5 1 .3” x 1.3” 0 .118 in 3


150 MH z

C62x 1600 99 0 . 0 9 µs 1.94 W $ 9 6 1 .3” x 1.3” 0 .118 in 3


200 MH z

BDTImarks: Berkeley Design Technology Inc. DSP benchmark


results (larger means better) http://www.bdti.com/bdtimark/results.htm
http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/processors.html

28
TMS320C62x vs. StarCore S140
Fea t u r e C62x S140
F u n ct ion a l Units 8 16
m u lt ipliers 2 4
adders 6 4
other -- 8
Instruct ions/cycle 8 6 + br a n ch
RISC in s t r u ct ions * 8 11
condit ion a ls 8 2
Instruct ion widt h (bits) 256 128
Tot a l in s t r u ct ions 48 180
Number of registers 32 51
Register size (bits) 32 40
Accu m u lation precision (bits) ** 32 or 40 40
P ipeline depth (cycle) 7-11 5
* Does not count equivalent RISC operations for modulo addressing
** On the C62x, there is a performance penalty for 40-bit accumulation

29
Conclusion

n C o n v e n t ion a l digit a l sign a l p r o c e s s o r s


4 H igh p e r for m a n ce v s . p o w e r c o n s u m p t i o n / c o s t / v o l u m e
4 E x cel a t on e -d im e n s i o n a l p r o c e s s i n g
4 H a v e i n s t r u ct ion s t a ilor e d t o s p e cific a p p lica t ion s

n TMS320C6x VLIW DSP


4 H igh p e r for m a n ce v s . cos t /volu m e
4 E x cel a t m u lt i d i m e n s ion a l s i g n a l p r o c e s s i n g
4A maximum of 8 RISC instructions per cycle

30
Conclusion

n W e b r e s ou r c e s
4 com p .d s p n e w s g r ou p : F A Q w w w .b d t i.com /fa q /d s p _fa q .h t m l
4 e m b e d d e d p r oce s s or s a n d s y s t e m s : w w w .eg3.com
4 on -lin e cou r s e s a n d D S P b oa r d s : w w w .t e ch on lin e .com
n R e fer e n c e s
4 R . B h a r g a v a , R . R a d h a k r i s h n a n , B . L . E v a n s , a n d L . K . J oh n ,
“E v a lu a t in g M M X T e c h n ology U s i n g D S P a n d M u lt i m e d i a
A p p l i c a t ion s ,” Proc. IE E E S y m . M icroarch itectu r e , p p . 3 7 - 4 6 , 1 9 9 8 .
h t t p ://w w w .e c e .u t e x a s .e d u /~ r a v i b /m m x d s p /
4 B . L . E v a n s , “E E 3 7 9 K - 1 7 R e a l - T i m e D S P L a b or a t or y , ” U T A u s t i n .
h t t p ://w w w .e c e .u t e x a s .e d u /~ b e v a n s /cou r s e s /r e a lt i m e /
4 B . L . E v a n s , “E E 3 8 2 C E m b e d d e d S o ft w a r e S y s t e m s , ” U T A u s t i n .
h t t p ://w w w .e c e .u t e x a s .e d u /~ b e v a n s /cou r s e s / e e 3 8 2 c /

31

You might also like