0% found this document useful (0 votes)
8 views35 pages

High Level Synthesis II: ECE 3401 Digital Systems Design

The document discusses High-Level Synthesis (HLS) in digital systems design, highlighting the challenges of traditional hardware design such as increased complexity and shorter design cycles. It outlines the HLS workflow, including scheduling and binding, and details various techniques like loop unrolling, pipelining, and array partitioning to optimize performance. The document emphasizes the importance of directives in guiding HLS optimizations to achieve results comparable to hand-coded RTL implementations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views35 pages

High Level Synthesis II: ECE 3401 Digital Systems Design

The document discusses High-Level Synthesis (HLS) in digital systems design, highlighting the challenges of traditional hardware design such as increased complexity and shorter design cycles. It outlines the HLS workflow, including scheduling and binding, and details various techniques like loop unrolling, pipelining, and array partitioning to optimize performance. The document emphasizes the importance of directives in guiding HLS optimizations to achieve results comparable to hand-coded RTL implementations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Lecture 20

High Level Synthesis II

ECE 3401
Digital Systems Design

1
Traditional Hardware Design

• Hand coded RTL


– Understand the problem space
• E.g., Sort
– Decide on a solution
• E.g., Radix Sort
– Implementation
• Code RTL Single Design Point
• Validate/Verify Ø Power
• Manually tune to meet spec Ø Performance
Ø Area

2
The Problem is Getting Worse
• More heterogeneity requires more hardware
– Less re-use, too expensive to hand code

• Shorter design cycles


– Rushed designs mean specs keep changing
– Focus on correctness, let tool handle performance

• Can’t spend months tuning every pipeline in the


sea of design implementations

3
High-Level Synthesis workflow

Design Source Technology


Library
(C, C++, SystemC)

Scheduling Binding

User RTL
(Verilog, VHDL, SystemC)
Directives

4
High-Level Synthesis workflow

Design Source Technology


Library
(C, C++, SystemC)

Scheduling Binding

User RTL
(Verilog, VHDL, SystemC)
Directives

Which cycle each operation happens

5
High-Level Synthesis workflow

Maps operations onto instantiated hardware

Design Source Technology


Library
(C, C++, SystemC)

Scheduling Binding

User RTL
(Verilog, VHDL, SystemC)
Directives

Which cycle each operation happens

6
Typical C/C++ Synthesizable Subset
• Data types:
– Primitive types: (u)char, (u)short, (u)int, (u)long, float,
double
– Arbitrary precision integer or fixed-point types
– Composite types: array, struct, class
– Template types: template<>
– Statically determinable pointers

• No support for dynamic memory allocations


• No support for recursive function calls

7
Typical C/C++ Constructs for HW Mapping

8
Function Hierarchy
• Each module is usually translated into an HDL
module
– Functions may be inlined to dissolve their hierarchy

9
Function Arguments
• Function arguments become ports on the HDL
blocks

• Inputs/Outputs enable synchronous data


exchange

10
Design Space Explorations with HLS
• Directives guide HLS optimizations
– Resource allocation and implementation
– Memory partitioning
– Loop unrolling
– Loop pipelining

• ~30 unique directives


– User can provide as much detail as desired
– Can achieve performance on order of handwritten RTL

11
Expressions
• HLS generates datapath circuits mostly from
expressions
– Timing constraints influence the degree of
registering

12
Resources

• Allocation directive constrains resources


– Operations: e.g., number of adders instantiated
in HDL
• Can save a lot of area (and power)

CYCLE 1 add1 CYCLE 1 add1 add2

CYCLE 2 add1 CYCLE 2

13
Array partitioning
• HLS implements an array in C as memory block in
HDL
– Read & Write array à RAM
– Constant array à ROM
• Typically, each memory module supports a
limited number of read/write ports (up to 2)

void TOP (){


int A[N];
for (i = 0; i < N; i++)
A[i] = A[i] + i;

14
Array partitioning
• An array can be partitioned and implemented with multiple
RAMs
– Extreme case: completely partitioned into individual elements
that map to discrete registers

void TOP (int x, ..){


int A[N];
for (i = 0; i < N; i++)
A[i+x] = A[i] + i;
?
• An array can be partitioned and mapped to multiple RAMs
• Multiples arrays can be merged and mapped to one RAM

15
Array partitioning directives
• Split arrays to improve memory bandwidth

16
Block array partitioning
Array1[N]

Array1a[N/2]

Array1b[N/2]

• Block partitioning creates smaller arrays from


consecutive blocks of the original array
• Splits the array into N equal blocks, where N is
the integer defined by the factor= argument

17
Cyclic array partitioning
Array1[N]

Array1a[N/2]

Array1b[N/2]

• Cyclic partitioning creates smaller arrays by interleaving


elements from the original array
• Array is partitioned cyclically by putting one element into
each new array before coming back to the first array to
repeat the cycle until the array is fully partitioned
– For example, if factor=2, Element 0 in first new array, Element 1
in second new array, Element 2 in first new array again, and so
on
18
Complete array partitioning
Array1[N]

Array1a[1] Array1b[1]

Array1c[1] Array1d[1]

• Decomposes the array into individual elements


• For a one-dimensional array, this corresponds to
resolving a memory into individual registers
19
Loops
• By default, loops are rolled
• Each loop iteration corresponds to a “sequence” of
states (more generally, an FSM)
• This state sequence will be repeated multiple times
based on the loop trip count (or loop bound)

TOP
void TOP (){
.. S1
for (i = 0; i < N; i++) +
sum += A[i]; LD sum S2
A[i]

20
Loop Unrolling
void TOP (){
Unroll ..
void TOP (){
factor of 4 for (i = 0; i < N/4; i++)
..
sum += A[4*i];
for (i = 0; i < N; i++)
sum += A[4*1+1];
sum += A[i];
sum += A[4*1+2];
sum += A[4*1+3];

• Exploits parallelism between loop iterations to


achieve shorter latency or higher throughput
• Creates multiple copies of the loop body and
adjust the loop iteration counter accordingly

21
Loop unrolling
(+) Decreased loop control overhead
(+) Increased parallelism for scheduling
(–) Increased operation count, which may negatively
impact area, timing, and power

TOP
Unroll
factor of 4 +
+
+ +
sum A[3] A[2] + sum
A[i] A[1]
A[0]
Takes 4 Cycles when N=4 Takes 1 Cycle!

22
Pipelining
X
4 consecutive operations Z
( )2 Square Root
Z=F(X,Y)=SqRoot(X2 +Y2 ) Y

If each step takes 1T then one calculation takes 3T, four take 12T
X
Stage 1 Stage 2 Stage 3 Z

X2 +Y2 SqRoot
Y

Assuming ideally that each stage takes 1T


• What will be the latency (time to produce the first result)?
• What will be the throughput (pipeline rate in the steady state)?

23
Pipelining -- Timing

T T T T T T

Total of 6T; Speedup = ?


n-1
For n operations: 3T + (n-1)T = latency +
throughput
3T x n 3n
Speedup = = =
3T + (n-1)T n+2
24
Loop pipelining
Loop_tag : for( II = 1 ; II < 3 ; II++ ) {
op_Read; RD
op_Compute; CMPUTE
op_Write; WR
}

25
Loop pipelining
Without Pipelining Loop_tag : for( II = 1 ; II < 3 ; II++ ) {
op_Read; RD
op_Compute; CMPUTE
op_Write; WR
}

RD CMP WR RD CMP WR

Throughput = 3 cycles
Latency = 3 cycles

Loop Latency = 6 cycles

26
Loop pipelining
Without Pipelining With Pipelining
Loop_tag : for( II = 1 ; II < 3 ; II++ ) {
op_Read; RD
op_Compute; CMP
op_Write; WR
}

RD CMP WR RD CMP WR RD CMP WR


RD CMP WR

Throughput = 3 cycles Throughput = 1 cycle


Latency = 3 cycles Latency = 3 cycles

Loop Latency = 6 cycles Loop Latency = 4 cycles

• Loop pipelining allows one iteration to begin


processing before the previous iteration is complete

27
Loop pipelining
• Iteration Interval (II)
– Cycles loop must wait before starting next iteration

• II = 1 cannot be implemented
– Port cannot be read at the same time
– Similar effect with other resource limitations

28
Loop pipelining
• Given sufficient arrays and resources, ll = 1 can be
implemented

29
Matrix Vector Multiplication Example
• A: input matrix
• x: input vector
• y: output vector
• y[i] = ∑ A[i][j]*x[j]

// N = 8
void MV(int A[N][N], int x[N], int y[N]) {
int i, j;
int acc;
for (i = 0; i < N; i++) {
acc = 0;
for (j = 0; j < N; j++) {
acc += A[i][j] * x[j];
}
y[i] = acc;
}
}
30
Baseline Datapath
A[i][0 to 7]

X + acc
y[i]
x[0 to 7]

// N = 8 • Latency of load=2 cycles, Mult=1 cycle, Add+Acc=1


for (i = 0; i < 16; i++) { cycle, Store=1 cycle
acc = 0;
for (j = 0; j < 16; j++) {
acc += A[i][j] * x[j];
• Inner loop latency = (2+1+1) * 8 = 32 cycles
– 2 LD, 1 *, 1 +
}
y[i] = acc; • Outer loop latency = (32+2) * 8 = 272 cycles
} – 1 ST, 1 Acc

31
Loop Unrolling
A[i][0 to 7]

X + acc
y[i]
x[0 to 7]

// N = 8 • Unroll inner loop (j) expects to


for (i = 0; i < 16; i++) {
acc = 0; load 8 elements of A and x every
for (j = 0; j < 16; j++) {
acc += A[i][j] * x[j];
cycle
} • A single RAM only has 1 read
port!!
y[i] = acc;
}

32
Partition Arrays using complete

...

A[i][0 to 7] A[i][0] A[i][1] A[i][7]


x[0 to 15] x[0] x[1] x[7]

• Array partitioning breaks one array into smaller


portions and implements it with multiple RAM
modules
– Options from block, cyclic or complete
33
Loop Unrolling
A[i][0] X
x[0] +
• Latency of load=2 cycles, Mult=1
A[i][1] X
cycle, Add+Acc=1 cycle, Store=1
x[1] + cycle
A[i][2] X
x[2] + • Unrolled inner loop latency =
A[i][3] X (2+1+1+1+1+1) * 1 = 7 cycles
x[3] +
– 2 LD, 1 *, 1 +, 1 ST
A[i][4] X y[i]
x[4] • Outer loop latency = 7 * 8 = 56 cycles
+
A[i][5] X
x[5] +
A[i][6] X • 4.86x faster but
x[6] + requires 8x more
A[i][7] X RAMs and multipliers
x[7]
and 7x more adders!!
34
Loop Pipelining
Cycle 1 2 3 4 5 6 7 8 9 10 11
# • Iteration interval, ll = 2
i=0
LD x + Tree ST – RAM port cannot be read
A[0][:] y[[0]
at the same time. So must
iter
LD wait 2 cycles before
x[[:] starting next iteration
LD x + Tree ST • Pipelined latency = 7 +
i=1 A[1][:] y[[1]

iter (7*2) = 21 cycles


LD
x[[:]

i=2
LD
A[2][:]
x + Tree ST
y[[2] • Pipelining 2.7x
iter
LD
x[[:]
faster than loop
unrolling only

35

You might also like