HLS Tips and Tricks
HLS Tips and Tricks
Presented By
Frédéric Rivoallon
Marketing Product Manager
October 2018
AXI I/F
Interface synthesis
Optimized libraries
Vivado
Synthesis, P&R
Software programmable
FPGA SoCs become available
˃ Datatype Optimization
Customized data type (adjusted to requirement)
˃ Compute-bound or memory-bound?
Algorithm
Examples
a
v The default interface for C
*
in x y out
+
v
arrays (BRAM) can be
FIFO
v v FIFO
v changed to “FIFO” via a
in b out single line pragma (a.k.a
+
v
BRAM c BRAM directive)…
v
i
i
Control logic – FSM v v v
void F (...) {
void F (...) {
... clk
... default READ COMPUTE WRITE
add: for (i=0;i<=3;i++) {
add: for (i=0;i<=3;i++) { throughput = 3
READ COMPUTE WRITE
# PRAGMA HLS PIPELINE READ COMPUTE WRITE
op_READ; loop latency = 12 READ COMPUTE WRITE
op_READ;
op_COMPUTE;
op_COMPUTE;
op_WRITE; clk
op_WRITE;
}
}
...
PIPELINE READ COMPUTE
READ
WRITE
COMPUTE WRITE
... READ COMPUTE WRITE
throughput = 1 READ COMPUTE WRITE
loop latency = 6
High performance
a[3] + + + execution when array
In “Directives” pane, select
loop label “add”, right-click
a[2]
a[1]
a[0]
+ v b
elements available in
parallel…
and select unroll… Otherwise no benefit
Example: Fully unrolled loop. from unrolling this loop…
(parallel execution but more area)
void fir(data_t x, coef_t c[N], acc_t *y) { QUIZ: Which other pragmas might be useful?
˃ Rewind for PIPELINE for next loop execution to start as soon as possible
Removes inter-loop gaps
loop: for(i=1;i<N;i++) {
loop: for(i=1;i<N;i++) { RD0 CMP WR0
#pragma HLS PIPELINE rewind RD0 CMP WR0
op_Read; RD RD1
RD1 CMP
CMP WR1
WR1
op_Read;
op_Compute; RDCMP RD2
RD2 CMP
CMP WR2
WR2
op_Compute;
op_Write; CMPWR RDN
RDN CMP
CMP WRN
WRN
op_Write; WR
} Next loop Next loop
RD0 CMP RD0
WR0 CMP WR0
RD1 CMP RD1
WR1 CMP WR1
invocation invocation
reads
} immediately…
starts after RD2 CMP RD2
WR2 CMP WR2
previous one RDN CMP RDN
WRN CMP WRN
has finished…
˃ See user guide for more information (including the “flush” option)
Example: Code implies three reads from a RAM, prevents full throughput
0 1 … N/2-1
block
Example:
factor of 2
RTL arrays
N/2 … N-2 N-1
C array
Example:
1 3 … N-1
0 1 2 … N-3 N-2 N-1 cyclic factor of 2 RTL arrays
0 2 … N-2
N-2 N-1 …
1
complete Individual elements
0
N-3 2
˃ The FIFO channel with DATAFLOW avoids storing frames between tasks
Default Pipelined Dataflow
FIFOI/F FIFO FIFO
Stream
RAM Stream Task
Exclusive full function Instruction
execution per pixel Parallelism Parallelism
RAM
FIFOI/F Stream
FIFO Stream
FIFO
channel
vecIn[10] vecOut[10]
func 1
Note: Apply “inline off” pragma to small functions so that they show as a level of hierarchy… func 2
© Copyright 2018 Xilinx
Dataflow Example
˃ DATAFLOW allows concurrent execution of two (or more) functions
void top(int vecIn[10], int vecOut[10]) {
#pragma HLS DATAFLOW
int tmp[10]; ˃ Vector I/O are modeled as coming
from/to a RAM
func1(vecIn,tmp);
func2(tmp,vecOut); ˃ Code on the left has an II of 5
}
i.e. vector size of 10 and 2
void func1(int f1In[10], int f1Out[10]) { elements cycle
#pragma HLS INLINE off
#pragma HLS PIPELINE Input vector is “BRAM” by default, so only 2
for(int i=0; i<10; i++) { reads in one cycle, hence II is 5
f1Out[i] = f1In[i] * 10;
} top II
}
4
Click Open Wave Viewer icon
1 Run C/RTL Cosimulation: Vivado
Vivado Simulator (or Auto)
5 Pre-grouped signals:
• Block-level IO
Select Dump Trace • C inputs
2 “all” or “port” • C outputs
6 Select function:
• Add its signals to waveforms
• ap_done
• ap_idle
• ap_ready
• ap_start
Note: Apply “inline off” pragma to small functions so that they remain a level of hierarchy in HLS…
3 Click OK
(*): 2018.2: Visible when Dataflow is applied, all traces dumped, using Vivado simulator and checking waveform debug
Automotive Consumer
Infotainment 3D television
Driver assistance eReaders
Throughput Optimizations…
˃ Apply task and instruction level parallelism