0% found this document useful (0 votes)
20 views23 pages

HLS Tips and Tricks

The document provides tips and tricks for using Vivado HLS, focusing on coding techniques, design steps, and performance optimization strategies. It emphasizes the importance of methodologies such as pipelining, loop unrolling, and dataflow for enhancing system performance in FPGA designs. The presentation also discusses the growing acceptance of Vivado HLS in the industry, particularly for deep learning applications on FPGAs.

Uploaded by

yehia.mahmoud02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views23 pages

HLS Tips and Tricks

The document provides tips and tricks for using Vivado HLS, focusing on coding techniques, design steps, and performance optimization strategies. It emphasizes the importance of methodologies such as pipelining, loop unrolling, and dataflow for enhancing system performance in FPGA designs. The presentation also discusses the growing acceptance of Vivado HLS in the industry, particularly for deep learning applications on FPGAs.

Uploaded by

yehia.mahmoud02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Vivado HLS – Tips and Tricks

Presented By

Frédéric Rivoallon
Marketing Product Manager
October 2018

© Copyright 2018 Xilinx


Vivado HLS
Algorithms

Coding techniques C / C++


• Micro-architecture OpenCL
˃ Abstracted C based descriptions • RAM adaptation Open
• Data type optimization Source
Design steps
˃ Higher productivity • C sim
• C synthesis
HLS
C++
• Co-sim Vivado HLS Library
Concise code Automated RTL verification

AXI I/F
Interface synthesis
Optimized libraries

Fast C simulation RTL IP


IP Integrator
Automated simulation of generated RTL RTL IP
Platform
Awareness
Interface synthesis (AXI-4) Integration
RTL

Vivado
Synthesis, P&R

© Copyright 2018 Xilinx


Vivado HLS Acceptance Grows…

5,000+ papers since 2014!


1370

High demand for deep


learning accelerators on
FPGAs

Software programmable
FPGA SoCs become available

2000 2013 2015 2017


Year
Based on graph from Cornell University.
© Copyright 2018 Xilinx
Factors for Overall System Performance

˃ Platform Fixed Performance…


Off-chip memory, data links (e.g. PCIe) Malleable Performance…
Data Processing (RTL, HLS)
Connectivity IPs
Connectivity IPs
Typically Xilinx IPs
˃ Compute Customization
Micro-architecture, parallelism, operators
D
FPGA D
˃ Memory Adaptation R

On-chip memory, shift registers, piping

˃ Datatype Optimization
Customized data type (adjusted to requirement)

© Copyright 2018 Xilinx


Identify the Performance Challenge

˃ Compute-bound or memory-bound?

˃ What kind of parallelism is required?

Algorithm
Examples

Cornell University - Rosetta benchmarks: http://www.csl.cornell.edu/~zhiruz/pdfs/rosetta-fpga2018.pdf


© Copyright 2018 Xilinx
Proceed Methodologically
Adjusting C code and pragmas
to find the “right” micro-architecture
˃ 5 Steps to design closure – UG1197 (Chapter 4) is a major design step…
The UltraFast High-Level Productivity Design Methodology Guide (Design Hub)

• Define interfaces and data packing


• Define loop trip counts
• Pipeline and Dataflow
• (compute instructions and tasks parallelism)

• Partition memories and ports


• Remove false dependencies

• Optionally recover resources through sharing

• Fine tune operator sharing and constraints

© Copyright 2018 Xilinx


Interface Synthesis
˃ Simple code quickly becomes a “real” circuit
HLS provide block level IO and interface pragma to customize circuit
f(int
void Fin[20],
(int in[20],
int out[20])
int out[20])
{ {
int a,b,c,x,y;
for(int i = 0; i < 20; i++) {
x = in[i]; y = a*x + b + c; out[i] = y;}
y;
}

a
v The default interface for C
*
in x y out
+
v
arrays (BRAM) can be
FIFO
v v FIFO
v changed to “FIFO” via a
in b out single line pragma (a.k.a
+
v
BRAM c BRAM directive)…
v

i
i
Control logic – FSM v v v

HLS Adapts Logic to the Design Interface


© Copyright 2018 Xilinx
Apply Instruction Level Parallelism with PIPELINE

˃ PIPELINE applies to loops or functions Initiation Interval (II):


Number of clock cycles before the
Instructs HLS to process variables continuously function can accept new inputs

void F (...) {
void F (...) {
... clk
... default READ COMPUTE WRITE
add: for (i=0;i<=3;i++) {
add: for (i=0;i<=3;i++) { throughput = 3
READ COMPUTE WRITE
# PRAGMA HLS PIPELINE READ COMPUTE WRITE
op_READ; loop latency = 12 READ COMPUTE WRITE
op_READ;
op_COMPUTE;
op_COMPUTE;
op_WRITE; clk
op_WRITE;
}
}
...
PIPELINE READ COMPUTE
READ
WRITE
COMPUTE WRITE
... READ COMPUTE WRITE
throughput = 1 READ COMPUTE WRITE

loop latency = 6

Loop pipelining example


˃ Allows for loops or functions to process inputs continuously
Improves throughput (II gets lower)

© Copyright 2018 Xilinx


Loop Unrolling
˃ Unroll forces the parallel execution of the instructions in the loop

void F (...) { Default: 4 cycles clk


... 0 1 2 3
add: for (i=0;i<=3;i++) {
b = a[i] + b; clk Note: A tight timing
... Unroll: 1 cycle 0 constraint could lead
1 to a latency different
2
3 than 1 clock cycle.

High performance
a[3] + + + execution when array
In “Directives” pane, select
loop label “add”, right-click
a[2]
a[1]
a[0]
+ v b
elements available in
parallel…
and select unroll… Otherwise no benefit
Example: Fully unrolled loop. from unrolling this loop…
(parallel execution but more area)

© Copyright 2018 Xilinx


PIPELINE and Automatic Loop Unrolling
Initiation Interval (II):
Number of clock cycles before the
˃ PIPELINE automatically unrolls loops… function can accept new inputs

void fir(data_t x, coef_t c[N], acc_t *y) { QUIZ: Which other pragmas might be useful?

#pragma HLS PIPELINE a) “interface ap_stable” for the coefficients


static data_t shift_x[N]; b) “array partition” for shift_x
acc_t acc; c) “expression_balance” to control adder tree
data_t data;
d) All of the above
acc=0;
for (int i=N-1;i>=0;i--) {
if (i==0) { Answer d)
shift_x[0] = x;
• ap_stable helps reduce logic for “c” if the coefficients are
data = x;
} else { expected to be constant
shift_x[i] = shift_x[i-1]; • Array partitioning the shifter then ensures all “x” can be
data = shift_x[i];
} accessed in parallel
acc+=data*c[i];; • Expression balance to preserve the inherent multiplier-add
}
*y=acc;
cascade chain implied in the C code (longer latency but
} more efficient once mapped onto DSP blocks)

© Copyright 2018 Xilinx


Removing Inter-Loop Bubbles

˃ Rewind for PIPELINE for next loop execution to start as soon as possible
Removes inter-loop gaps

loop: for(i=1;i<N;i++) {
loop: for(i=1;i<N;i++) { RD0 CMP WR0
#pragma HLS PIPELINE rewind RD0 CMP WR0
op_Read; RD RD1
RD1 CMP
CMP WR1
WR1
op_Read;
op_Compute; RDCMP RD2
RD2 CMP
CMP WR2
WR2
op_Compute;
op_Write; CMPWR RDN
RDN CMP
CMP WRN
WRN
op_Write; WR
} Next loop Next loop
RD0 CMP RD0
WR0 CMP WR0
RD1 CMP RD1
WR1 CMP WR1
invocation invocation
reads
} immediately…
starts after RD2 CMP RD2
WR2 CMP WR2
previous one RDN CMP RDN
WRN CMP WRN
has finished…

˃ See user guide for more information (including the “flush” option)

© Copyright 2018 Xilinx


C Arrays

˃ C Arrays describe memories…


Vivado HLS default memory model assumes 2-port BRAMs
˃ Default number of memory ports defined by…
How elements of the array are accessed
The target throughput (a.k.a initiation interval also referred to as II)

See UG902 to get full throughput on this example


void foo (...) { • (Chap 3 – Array Accesses and Performance)
...
SUM_LOOP:for(i=2;i<N;++i) {
sum += mem[i] + mem[i-1] + mem[i-2];
RD RD RD + + WR RD RD WR
... RD +
} +
}

Example: Code implies three reads from a RAM, prevents full throughput

˃ Arrays can be reshaped and/or partitioned to remove bottlenecks


Changes to array layout do not require changes to the original code

© Copyright 2018 Xilinx


Partition, Reshape Your C Arrays
˃ Partitioning splits an array into independent arrays
Arrays can be partitioned on any of their dimensions for better throughput

0 1 … N/2-1
block
Example:
factor of 2
RTL arrays
N/2 … N-2 N-1
C array
Example:
1 3 … N-1
0 1 2 … N-3 N-2 N-1 cyclic factor of 2 RTL arrays
0 2 … N-2

N-2 N-1 …
1
complete Individual elements
0
N-3 2

˃ Reshaping combines array elements into wider containers


Different arrays into a single physical memory

New RTL memories are automatically generated without changes to C code

© Copyright 2018 Xilinx


Dataflow Pragma – Task Level Parallelism

˃ By default a C function producing data for another is fully executed first


// This memory can be a FIFO during optimization
rgb_pixel inter_pix[MAX_HEIGHT][MAX_WIDTH]; Sepia Filter Sobel Filter

// Primary processing functions


Sepia Filter Finish all writes … then Sobel starts
sepia_filter(in_pix,inter_pix);
sobel_filter(inter_pix,out_pix2);
Sobel Filter to inter_pix[N]… accessing
inter_pix[N]

˃ Dataflow allows Sobel to start as soon as data is ready


Sepia Filter
Sepia Filter
Functions operate concurrently and continuously
[0] [1] [2][0] [3][1] …[2] [3] …
The interval (hence throughput) is improved
Channel buffer has to be filled before consumed for ping-pong Sobel Filter
Sobel Filter

˃ Dataflow creates memory channels Channel (ping-pong)


Channel (FIFO)
Created between loops or functions to store data samples RAM
SepiaFilter
Sepia Filter FIFO Sobel
SobelFilter
Filter
“Ping-pong” channel holds all the data RAM

“FIFO” for sequential access, no need to store all the data


© Copyright 2018 Xilinx
Video Applications and DATAFLOW

˃ The FIFO channel with DATAFLOW avoids storing frames between tasks
Default Pipelined Dataflow
FIFOI/F FIFO FIFO
Stream
RAM Stream Task
Exclusive full function Instruction
execution per pixel Parallelism Parallelism

Sepia Filter Sepia Filter Sepia Filter

RAM RAM FIFO


Exclusive full function Instruction
execution per pixel Parallelism

Sobel Filter Sobel Filter Sobel Filter

RAM
FIFOI/F Stream
FIFO Stream
FIFO

Default Pipelined Dataflow


BRAM 2792 2790 24
FF 891 1136 883
LUT 2315 2114 1606
Interval (II) 128,744,588 4,150,224 2,076,613

© Copyright 2018 Xilinx


Dataflow Hardware Implementation
˃ HLS inserts a “channel” between the functions ˃ Channel implementation
RAM
Ping-pong buffer
vecIn[10] vecOut[10] ‒ RAM buffers RAM
func 1 channel func 2 FIFO
‒ Sequential access FIFO

˃ Vivado implementation (RTL view)

channel

vecIn[10] vecOut[10]

func 1

Note: Apply “inline off” pragma to small functions so that they show as a level of hierarchy… func 2
© Copyright 2018 Xilinx
Dataflow Example
˃ DATAFLOW allows concurrent execution of two (or more) functions
void top(int vecIn[10], int vecOut[10]) {
#pragma HLS DATAFLOW
int tmp[10]; ˃ Vector I/O are modeled as coming
from/to a RAM
func1(vecIn,tmp);
func2(tmp,vecOut); ˃ Code on the left has an II of 5
}
i.e. vector size of 10 and 2
void func1(int f1In[10], int f1Out[10]) { elements cycle
#pragma HLS INLINE off
#pragma HLS PIPELINE Input vector is “BRAM” by default, so only 2
for(int i=0; i<10; i++) { reads in one cycle, hence II is 5
f1Out[i] = f1In[i] * 10;
} top II
}

void func2(int f2In[10], int f2Out[10]) {


#pragma HLS INLINE off Function II
#pragma HLS PIPELINE
for(int i=0; i<10; i++) {
f2Out[i] = f2In[i] + 2;
} Review Optimization in
} DATAFLOW viewer

© Copyright 2018 Xilinx


Analyzing Dataflow Results
˃ View simulation waveforms after RTL cosimulation
Toolbar button Open Wave Viewer
Top-level signals in waveform view, pre-grouped into useful bundles

Vivado HLS Vivado HLS

4
Click Open Wave Viewer icon
1 Run C/RTL Cosimulation: Vivado
Vivado Simulator (or Auto)
5 Pre-grouped signals:
• Block-level IO
Select Dump Trace • C inputs
2 “all” or “port” • C outputs

6 Select function:
• Add its signals to waveforms
• ap_done
• ap_idle
• ap_ready
• ap_start

Note: Apply “inline off” pragma to small functions so that they remain a level of hierarchy in HLS…
3 Click OK

© Copyright 2018 Xilinx


Analyze Simulation Waveforms

˃ New Dataflow waveform viewer(*)


Shows task-level parallelism

Confirm optimizations took place

Co-Simulation Waveforms in v2018.2


˃ HLS Schedule Viewer
Shows operator timing and clock margin

Shows data dependencies

X-probing from operations to source code


HLS Schedule Viewer in v2018.2

(*): 2018.2: Visible when Dataflow is applied, all traces dumped, using Vivado simulator and checking waveform debug

© Copyright 2018 Xilinx


Target Markets for HLS

Aerospace and Defense Communications


Radar, Sonar LTE MIMO receiver
Signals Intelligence Advanced wireless antenna
positioning

Industrial, Scientific, Medical Audio, Video, Broadcast


Ultrasound systems 3D cameras
Motor controllers Video transport

Automotive Consumer
Infotainment 3D television
Driver assistance eReaders

Test & Measurement Computing & Storage


Communications instruments High performance computing
Semiconductor ATE Database acceleration

© Copyright 2018 Xilinx


Vivado HLS Resources

˃ Vivado HLS is included in all Vivado HLx Editions (free in WebPACK)

˃ Videos on xilinx.com and YouTube

˃ DocNav: Tutorials, UG, app notes, videos, etc…

˃ Application notes on xilinx.com (also linked from hub)

˃ Code examples within the tool itself and on github

˃ Instructor led training

© Copyright 2018 Xilinx


Summary

Performance Boosters for HLS…


˃ Compute customization, memory adaptation, datatype optimization

Throughput Optimizations…
˃ Apply task and instruction level parallelism

Vivado HLS is not just C synthesis…


˃ It’s C simulation, automated RTL simulation, interface synthesis, waveform analysis

© Copyright 2018 Xilinx


© Copyright 2018 Xilinx

You might also like