High Level Synthesis II: ECE 3401 Digital Systems Design
High Level Synthesis II: ECE 3401 Digital Systems Design
ECE 3401
Digital Systems Design
1
Traditional Hardware Design
2
The Problem is Getting Worse
• More heterogeneity requires more hardware
– Less re-use, too expensive to hand code
3
High-Level Synthesis workflow
Scheduling Binding
User RTL
(Verilog, VHDL, SystemC)
Directives
4
High-Level Synthesis workflow
Scheduling Binding
User RTL
(Verilog, VHDL, SystemC)
Directives
5
High-Level Synthesis workflow
Scheduling Binding
User RTL
(Verilog, VHDL, SystemC)
Directives
6
Typical C/C++ Synthesizable Subset
• Data types:
– Primitive types: (u)char, (u)short, (u)int, (u)long, float,
double
– Arbitrary precision integer or fixed-point types
– Composite types: array, struct, class
– Template types: template<>
– Statically determinable pointers
7
Typical C/C++ Constructs for HW Mapping
8
Function Hierarchy
• Each module is usually translated into an HDL
module
– Functions may be inlined to dissolve their hierarchy
9
Function Arguments
• Function arguments become ports on the HDL
blocks
10
Design Space Explorations with HLS
• Directives guide HLS optimizations
– Resource allocation and implementation
– Memory partitioning
– Loop unrolling
– Loop pipelining
11
Expressions
• HLS generates datapath circuits mostly from
expressions
– Timing constraints influence the degree of
registering
12
Resources
13
Array partitioning
• HLS implements an array in C as memory block in
HDL
– Read & Write array à RAM
– Constant array à ROM
• Typically, each memory module supports a
limited number of read/write ports (up to 2)
14
Array partitioning
• An array can be partitioned and implemented with multiple
RAMs
– Extreme case: completely partitioned into individual elements
that map to discrete registers
15
Array partitioning directives
• Split arrays to improve memory bandwidth
16
Block array partitioning
Array1[N]
Array1a[N/2]
Array1b[N/2]
17
Cyclic array partitioning
Array1[N]
Array1a[N/2]
Array1b[N/2]
Array1a[1] Array1b[1]
…
Array1c[1] Array1d[1]
TOP
void TOP (){
.. S1
for (i = 0; i < N; i++) +
sum += A[i]; LD sum S2
A[i]
20
Loop Unrolling
void TOP (){
Unroll ..
void TOP (){
factor of 4 for (i = 0; i < N/4; i++)
..
sum += A[4*i];
for (i = 0; i < N; i++)
sum += A[4*1+1];
sum += A[i];
sum += A[4*1+2];
sum += A[4*1+3];
21
Loop unrolling
(+) Decreased loop control overhead
(+) Increased parallelism for scheduling
(–) Increased operation count, which may negatively
impact area, timing, and power
TOP
Unroll
factor of 4 +
+
+ +
sum A[3] A[2] + sum
A[i] A[1]
A[0]
Takes 4 Cycles when N=4 Takes 1 Cycle!
22
Pipelining
X
4 consecutive operations Z
( )2 Square Root
Z=F(X,Y)=SqRoot(X2 +Y2 ) Y
If each step takes 1T then one calculation takes 3T, four take 12T
X
Stage 1 Stage 2 Stage 3 Z
X2 +Y2 SqRoot
Y
23
Pipelining -- Timing
T T T T T T
25
Loop pipelining
Without Pipelining Loop_tag : for( II = 1 ; II < 3 ; II++ ) {
op_Read; RD
op_Compute; CMPUTE
op_Write; WR
}
RD CMP WR RD CMP WR
Throughput = 3 cycles
Latency = 3 cycles
26
Loop pipelining
Without Pipelining With Pipelining
Loop_tag : for( II = 1 ; II < 3 ; II++ ) {
op_Read; RD
op_Compute; CMP
op_Write; WR
}
27
Loop pipelining
• Iteration Interval (II)
– Cycles loop must wait before starting next iteration
• II = 1 cannot be implemented
– Port cannot be read at the same time
– Similar effect with other resource limitations
28
Loop pipelining
• Given sufficient arrays and resources, ll = 1 can be
implemented
29
Matrix Vector Multiplication Example
• A: input matrix
• x: input vector
• y: output vector
• y[i] = ∑ A[i][j]*x[j]
// N = 8
void MV(int A[N][N], int x[N], int y[N]) {
int i, j;
int acc;
for (i = 0; i < N; i++) {
acc = 0;
for (j = 0; j < N; j++) {
acc += A[i][j] * x[j];
}
y[i] = acc;
}
}
30
Baseline Datapath
A[i][0 to 7]
X + acc
y[i]
x[0 to 7]
31
Loop Unrolling
A[i][0 to 7]
X + acc
y[i]
x[0 to 7]
32
Partition Arrays using complete
...
i=2
LD
A[2][:]
x + Tree ST
y[[2] • Pipelining 2.7x
iter
LD
x[[:]
faster than loop
unrolling only
35