VivadoHLS Improving Performance
VivadoHLS Improving Performance
This material exempt per Department of Commerce license exception TSU © Copyright 2013 Xilinx
Objectives
Outline
Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary
“Add” or “Remove”
configuration settings
Select
“General”
Outline
Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary
Design Latency
– The latency of the design is the number of cycle it takes to output the result
• In this example the latency is
10 cycles
Design Throughput
– The throughput of the design is the
number of cycles between new inputs
• By default (no concurrency) this is the
same as latency
• Next start/read is when this transaction ends
Functions
– Vivado HLS will seek to minimize latency by allowing functions to operate in parallel
• As shown on the previous slide
Loops
– Vivado HLS will not schedule loops to operate in parallel by default
• Dataflow optimization must be used or the loops must be unrolled Loop:for(i=1;i<3;i++) {
op_Read; RD
Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary
Review: Loops
a[N]
– Loops can be unrolled if their indices are statically determinable at elaboration time
• Not when the number of iterations is variable
Options explained on next Unrolled loops are likely to result in more hardware
slide resources and higher area
Loop Flattening
2 x4
L2: for (i=3;i>=0;i--) {
L3: for (j=3;j>=0;j--) { L2: for (k=15,k>=0;k--) {
x16
3 x4
[loop body l3 ]
} [loop body l3 ]
2
} }
Loop Merging
3 x4 [loop body l3 ]
[loop body l3 ]
1
} x16
} Already flattened
if (cond4)
L4: for (i=3;i>=0;i--) { [loop body l4 ]
4 [loop body l4 ] }
x4 }
18 transitions
36 transitions
If loop bounds are all variables, they must have the same value
If loops bounds are constants, the maximum constant value is used as the bound of the
merged loop
– As in the previous example where the maximum loop bounds become 16 (implied by L3 flattened into
L2 before the merge)
Loops with both variable bound and constant bound cannot be merged
The code between loops to be merged cannot have side effects
– Multiple execution of this code should generate same results
• A=B is OK, A=B+1 is not
Reads from a FIFO or FIFO interface must always be in sequence
– A FIFO read in one loop will not be a problem
– FIFO reads in multiple loops may become out of sequence
• This prevents loops being merged
Loop Reports
Constraints
– Vivado HLS accepts constraints for latency
Loop Optimizations
– Latency can be improved by minimizing the number of loop boundaries
• Rolled loops (default) enforce sharing at the expense of latency
• The entry and exits to loops costs clock cycles
Outline
Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary
Dataflow Optimization
Dataflow Optimization
– Can be used at the top-level function
– Allows blocks of code to operate concurrently
• The blocks can be functions or loops
• Dataflow allows loops to operate concurrently
– It places channels between the blocks to maintain the data rate
• For arrays the channels will include memory elements to buffer the samples
• For scalars the channel is a register with hand-shakes
Dataflow optimization therefore has an area overhead
– Additional memory blocks are added to the design
– The timing diagram on the previous page should have a memory access delay between the blocks
• Not shown to keep explanation of the principle clear
Improving Performance 13- 30 © Copyright 2013 Xilinx
Dataflow Optimization Commands
Dataflow Optimization
– Dataflow optimization is “coarse grain” pipelining at the function and loop level
– Increases concurrency between functions and loops
– Only works on functions or loops at the top-level of the hierarchy
• Cannot be used in sub-functions
void foo(...) {
op_Read; RD
op_Compute; CMP
op_Write; WR
}
Latency = 3 cycles
Latency = 3 cycles
There are 3 clock cycles before operation RD can The latency is the same
occur again
The throughput is better
– Throughput = 3 cycles
– Less cycles, higher throughput
There are 3 cycles before the 1st output is written
– Latency = 3 cycles
Loop Pipelining
Loop:for(i=1;i<3;i++) {
op_Read; RD
op_Compute; CMP
op_Write; WR
}
Latency = 3 cycles
Latency = 3 cycles
Loop Latency = 6 cycles
Loop Latency = 4 cycles
There are 3 clock cycles before operation RD can The latency is the same
occur again – The throughput is better
– Throughput = 3 cycles
– Less cycles, higher throughput
There are 3 cycles before the 1st output is written
The latency for all iterations, the loop latency, has been
– Latency = 3 cycles
improved
– For the loop, 6 cycles
Improving Performance 13- 36 © Copyright 2013 Xilinx
Pipelining and Function/Loop Hierarchy
Vivado HLS will attempt to unroll all loops nested below a PIPELINE directive
– May not succeed for various reason and/or may lead to unacceptable area
• Loops with variable bounds cannot be unrolled
• Unrolling Multi-level loop nests may create a lot of hardware
– Pipelining the inner-most loop will result in best performance for area
• Or next one (or two) out if inner-most is modest and fixed
e.g. Convolution algorithm
• Outer loops will keep the inner pipeline fed
void foo(in1[ ][ ], in2[ ][ ], …) { void foo(in1[ ][ ], in2[ ][ ], …) { void foo(in1[ ][ ], in2[ ][ ], …) {
… … #pragma AP PIPELINE
L1:for(i=1;i<N;i++) { L1:for(i=1;i<N;i++) { …
L2:for(j=0;j<M;j++) { #pragma AP PIPELINE L1:for(i=1;i<N;i++) {
#pragma AP PIPELINE L2:for(j=0;j<M;j++) { L2:for(j=0;j<M;j++) {
out[i][j] = in1[i][j] + in2[i][j]; out[i][j] = in1[i][j] + in2[i][j]; out[i][j] = in1[i][j] + in2[i][j];
} } }
} } }
} } }
Pipelining Commands
Clk Clk
With Flush
– When no new input reads are performed
– Values already in the pipeline are flushed out
Outline
Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary
Array Partitioning
Array Dimensions
Dimension 1
Dimension 2
Dimension 3
Dimension 0
(All dimensions)
Examples my_array_0[10][6]
my_array_1[10][6] my_array_0[6][4]
my_array[10][6][4] partition dimension 3
my_array_2[10][6] my_array_1[6][4]
my_array_3[10][6] my_array_2[6][4]
my_array_3[6][4]
my_array_4[6][4]
my_array[10][6][4] partition dimension 1
my_array_5[6][4]
my_array_6[6][4]
my_array_7[6][4]
my_array_8[6][4]
my_array_9[6][4]
my_array[10][6][4] partition dimension 0 10x6x4 = 240 individual registers
• On the Interface
− This means separate ports
• Internally
− Separate buses & wires
− Separate control logic, which may be more
complex, slower and increase latency
• Grouped structure
− First element in the struct becomes the LSB
− Last struct element becomes the MSB
− Arrays are partitioning completely
• On the Interface
− This means a single port
• Internally
− Single bus
− May result in simplified control logic, faster
and lower latency designs
Outline
Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary
Summary
Optimizing Performance
– Latency optimization
• Specify latency directives
• Unroll loops
• Merge and Flatten loops to reduce loop transition overheads
– Throughput optimization
• Perform Dataflow optimization at the top-level
• Pipeline individual functions and/or loops
• Pipeline the entire function: beware of lots of operations, lots to schedule and it’s not always possible
– Array Optimizations
• Focus on bottlenecks often caused by memory and port accesses
• Removing bottlenecks improves latency and throughput
Use Array Partitioning, Reshaping, and Data packing directives to achieve throughput
This material exempt per Department of Commerce license exception TSU © Copyright 2013 Xilinx
Objectives
The design consists of YUV filter typically used in video processing. The design
consists of three functions – rgb2yuv, yuv_scale, and yuv2rgb
– Each of these functions iterates over the
entire source image, requiring a single
source pixel to produce a pixel in the result
image
– The scale function simply applies individual
scale factors, supplied through top-level
arguments
Procedure
Create a Vivado HLS project by executing script from Vivado HLS command prompt
Open the created project in Vivado HLS GUI and analyze
Apply TRIPCOUNT directive using PRAGMA
Apply PIPELINE directive, generate solution, and analyze output
Apply DATAFLOW directive to improve performance
Export and Implement the design
In this lab you learned that even though this design could not be pipelined at the top-
level, a strategy of pipelining the individual loops and then using dataflow optimization
to make the functions operate in parallel was able to achieve the same high throughput,
processing one pixel per clock. When DATAFLOW directive is applied, the default
memory buffers (of ping-pong type) are automatically inserted between the functions.
Using the fact that the design used only sequential (streaming) data accesses allowed
the costly memory buffers associated with dataflow optimization to be replaced with
simple 2 element FIFOs using the Dataflow command configuration