0% found this document useful (0 votes)

20 views31 pages

VivadoHLS Improving Performance

detailed guide for vivadoHLS improving performance

Uploaded by

2022ugec018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views31 pages

VivadoHLS Improving Performance

detailed guide for vivadoHLS improving performance

Uploaded by

2022ugec018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Improving Performance

Vivado HLS 2013.3 Version

Objectives

After completing this module, you will be able to:

– Add directives to your design

– List number of ways to improve performance
– State directives which are useful to improve latency
– Describe how loops may be handled to improve latency
– Recognize the dataflow technique that improves throughput of the design
– Describe the pipelining technique that improves throughput of the design
– Identify some of the bottlenecks that impact design performance

Improving Performance 13- 2 © Copyright 2013 Xilinx

Improving Performance

Vivado HLS has a number of way to improve performance

– Automatic (and default) optimizations
– Latency directives
– Pipelining to allow concurrent operations
Vivado HLS support techniques to remove performance bottlenecks
– Manipulating loops
– Partitioning and reshaping arrays
Optimizations are performed using directives
– Let’s look first at how to apply and use directives in Vivado HLS

Improving Performance 13- 3 © Copyright 2013 Xilinx

Outline

Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary

Improving Performance 13- 4 © Copyright 2013 Xilinx

Applying Directives

If the source code is open in the GUI Information pane

– The Directive tab in the Auxiliary pane shows all the locations and objects upon which directives can
be applied (in the opened C file, not the whole design)
• Functions, Loops, Regions, Arrays, Top-level arguments
– Select the object in the Directive Tab
• “dct” function is selected
– Right-click to open the editor dialog box
– Select a desired directive from the drop-
down menu
• “DATAFLOW” is selected
– Specify the Destination
• Source File
• Directive File

Improving Performance 13- 5 © Copyright 2013 Xilinx

Optimization Directives: Tcl or Pragma

Directives can be placed in the directives file

– The Tcl command is written into directives.tcl
– There is a directives.tcl file in each solution
• Each solution can have different directives
Once applied the directive will be
shown in the Directives tab
(right-click to modify or delete)

Directives can be place into the C source

– Pragmas are added (and will remain) in the C
source file
– Pragmas (#pragma) will be used by every
solution which uses the code

Improving Performance 13- 6 © Copyright 2013 Xilinx

Solution Configurations

Configurations can be set on a solution

– Set the default behavior for that solution
• Open configurations settings from the menu (Solutions > Solution Settings…)

“Add” or “Remove”
configuration settings
Select
“General”

– Choose the configuration from the drop-down menu

• Array Partitioning, Dataflow Memory types, Default IO ports, RTL Settings, Operator binding, Schedule efforts

Improving Performance 13- 7 © Copyright 2013 Xilinx

Example: Configuring the RTL Output

Specify the FSM encoding style

– By default the FSM is auto
Add a header string to all RTL output files
– Example: Copyright Acme Inc.
Add a user specified prefix to all RTL output filenames
– The RTL has the same name as the C functions
– Allow multiple RTL variants of the same top-level function to be
used together without renaming files
Reset all registers
– By default only the FSM registers and variables initialized in the code are reset
– RAMs are initialized in the RTL and bitstream
The remainder of the configuration commands
Synchronous or Asynchronous reset will be covered throughout the course
– The default is synchronous reset
Active high or low reset
– The default is active high
Improving Performance 13- 8 © Copyright 2013 Xilinx
Copying Directives into New Solutions

Select the New Solution Button

Optionally modify any of the settings
– Part, Clock Period, Uncertainty
– Solution Name
Copy existing directives
– By default selected
– Uncheck if do not want to copy
– No need to copy pragmas, they are in the code
Copy any existing custom commands in to
the new script.tcl
– By default selected
– Uncheck if do not want to copy

Improving Performance 13- 9 © Copyright 2013 Xilinx

Outline

Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary

Improving Performance 13- 10 © Copyright 2013 Xilinx

Latency and Throughput – The Performance Factors

Design Latency
– The latency of the design is the number of cycle it takes to output the result
• In this example the latency is
10 cycles

Design Throughput
– The throughput of the design is the
number of cycles between new inputs
• By default (no concurrency) this is the
same as latency
• Next start/read is when this transaction ends

Improving Performance 13- 11 © Copyright 2013 Xilinx

Latency and Throughput

In the absence of any concurrency

– Latency is the same as throughput

Pipelining for higher throughput

– Vivado HLS can pipeline functions and
loops to improve throughput
– Latency and throughput are related
– We will discuss optimizing for latency first,
then throughput

Improving Performance 13- 12 © Copyright 2013 Xilinx

Vivado HLS: Minimize latency

Vivado HLS will by default minimize latency

– Throughput is prioritized above latency
(no throughput directive is specified here)
– In this example
• The functions are connected as shown
• Assume function B takes longer than any
other functions

Vivado HLS will automatically take advantage of the parallelism

– It will schedule functions to start
as soon as they can
• Note it will not do this for loops
within a function: by default they
are executed in sequence

Improving Performance 13- 13 © Copyright 2013 Xilinx

Default Behavior: Minimizing Latency

Functions
– Vivado HLS will seek to minimize latency by allowing functions to operate in parallel
• As shown on the previous slide
Loops
– Vivado HLS will not schedule loops to operate in parallel by default
• Dataflow optimization must be used or the loops must be unrolled Loop:for(i=1;i<3;i++) {
op_Read; RD

• Both techniques are discussed in detail later op_Compute; CMP

WR
op_Write;
}
Operations
– Vivado HLS will seek to minimize latency by allowing the operations to occur in parallel
– It does this within functions and within loops Example of Minimizing latency with Parallel
Example with Sequential Operations
void foo(...) {
op_Read; RD
Operations
op_Compute; CMP
op_Write; WR RD WR
} RD CMP WR CMP

Improving Performance 13- 14 © Copyright 2013 Xilinx

Latency Constraints

Latency constraints can be specified

– Can define a minimum and/or maximum latency for the location
• This is applied to all objects in the specified scope
– No range specification: schedule for minimum Impact of ranges
• Which is the default

Improving Performance 13- 15 © Copyright 2013 Xilinx

Region Specific Latency Constraint

Latency directives can be applied on functions, loops and regions

Use regions to specify specific locations for latency constraints
– A region is any set of named braces {…a region…}
• The region My_Region is shown in this example
– This allows the constraint to be applied to a specific range of code
• Here, only the else branch has a latency constraint

int write_data (int buf, int output) {

if (x < y) {
return (x + y);
} else {
My_Region: { Select the region in the
Directives tab & right-click to
return (y – x) * (y + x);
apply latency directive
}
}

Improving Performance 13- 16 © Copyright 2013 Xilinx

Outline

Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary

Improving Performance 13- 17 © Copyright 2013 Xilinx

Review: Loops

By default, loops are rolled

– Each C loop iteration Implemented in the same state
– Each C loop iteration Implemented with same resources
N
void foo_top (…) {
... foo_top
Add: for (i=3;i>=0;i--) {
b = a[i] + b;
...
} Synthesis b
+

a[N]

Loops require labels if they are to be referenced by Tcl directives

(GUI will auto-add labels)

– Loops can be unrolled if their indices are statically determinable at elaboration time
• Not when the number of iterations is variable

Improving Performance 13- 18 © Copyright 2013 Xilinx

Rolled Loops Enforce Latency

A rolled loop can only be optimized so much

– Given this example, where the delay of the adder is small compared to the clock frequency

void foo_top (…) {

...
Add: for (i=3;i>=0;i--) { Clock
b = a[i] + b; Adder Delay 3 2 1 0
...
}

– This rolled loop will never take less than 4 cycles

• No matter what kind of optimization is tried
• This minimum latency is a function of the loop iteration count

Improving Performance 13- 19 © Copyright 2013 Xilinx

Unrolled Loops can Reduce Latency

Select loop “Add” in

the directives pane
and right-click
Unrolled loops allow
greater option &
exploration

Options explained on next Unrolled loops are likely to result in more hardware
slide resources and higher area

Improving Performance 13- 20 © Copyright 2013 Xilinx

Partial Unrolling

Fully unrolling loops can create a lot of hardware

Loops can be partially unrolled
– Provides the type of exploration shown in the previous slide Add: for(int i = 0; i < N; i++) {
a[i] = b[i] + c[i];
}
Partial Unrolling
– A standard loop of N iterations can be unrolled to by a factor Add: for(int i = 0; i < N; i += 2) {
a[i] = b[i] + c[i];
– For example unroll by a factor 2, to have N/2 iterations if (i+1 >= N) break;
a[i+1] = b[i+1] + c[i+1]; Effective code after
• Similar to writing new code as shown on the right } compiler
transformation
• The break accounts for the condition when N/2 is not an integer
– If “i” is known to be an integer multiple of N
• The user can remove the exit check (and associated logic) for(int i = 0; i < N; i += 2) {
a[i] = b[i] + c[i];
• Vivado HLS is not always be able to determine this is true a[i+1] = b[i+1] + c[i+1];
}
(e.g. if N is an input argument) An extra adder for
N/2 cycles trade-off
• User takes responsibility: verify!

Improving Performance 13- 21 © Copyright 2013 Xilinx

Loop Flattening

Vivado HLS can automatically flatten nested loops

– A faster approach than manually changing the code
Flattening should be specified on the inner most loop
– It will be flattened into the loop above
– The “off” option can prevent loops in the hierarchy from being flattened
void foo_top (…) { void foo_top (…) {
... ...
1 L1: for (i=3;i>=0;i--) { L1: for (i=3;i>=0;i--) { 1
x4 [loop body l1 ] [loop body l1 ]
} } x4

2 x4
L2: for (i=3;i>=0;i--) {
L3: for (j=3;j>=0;j--) { L2: for (k=15,k>=0;k--) {
x16
3 x4
[loop body l3 ]
} [loop body l3 ]
2
} }

L4: for (i=3;i>=0;i--) { L4: for (i=3;i>=0;i--) {

4 [loop body l4 ] [loop body l1 ] 4
x4 } }
x4
Loops will be flattened by default: use “off” to disable 28 transitions
36 transitions

Improving Performance 13- 22 © Copyright 2013 Xilinx

Perfect and Semi-Perfect Loops

Only perfect and semi-perfect loops can be flattened

Loop_outer: for (i=3;i>=0;i--) {
– The loop should be labeled or directives cannot be applied Loop_inner: for (j=3;j>=0;j--) {
[loop body]
}
– Perfect Loops }
– Only the inner most loop has body (contents)
– There is no logic specified between the loop statements
– The loop bounds are constant Loop_outer: for (i=3;i>N;i--) {
Loop_inner: for (j=3;j>=0;j--) {
– Semi-perfect Loops [loop body]
}
– Only the inner most loop has body (contents) }

– There is no logic specified between the loop statements

– The outer most loop bound can be variable Loop_outer: for (i=3;i>N;i--) {
[loop body]
– Other types Loop_inner: for (j=3;j>=M;j--) {
[loop body]
– Should be converted to perfect or semi-perfect loops }
}

Improving Performance 13- 23 © Copyright 2013 Xilinx

Loop Merging

Vivado HLS can automatically merge loops

– A faster approach than manually changing the code
– Allows for more efficient architecture explorations
– FIFO reads, which must occur in strict order, can prevent loop merging
• Can be done with the “force” option : user takes responsibility for correctness
void foo_top (…) {
1 ...
L1: for (i=3;i>=0;i--) {
x4 [loop body l1 ] void foo_top (…) {
} ...
2 x4 L123: for (l=16,l>=0;l--) {
if (cond1)
L2: for (i=3;i>=0;i--) {
L3: for (j=3;j>=0;j--) { [loop body l1 ]

3 x4 [loop body l3 ]
[loop body l3 ]
1
} x16
} Already flattened
if (cond4)
L4: for (i=3;i>=0;i--) { [loop body l4 ]
4 [loop body l4 ] }
x4 }

18 transitions
36 transitions

Improving Performance 13- 24 © Copyright 2013 Xilinx

Loop Merge Rules

If loop bounds are all variables, they must have the same value
If loops bounds are constants, the maximum constant value is used as the bound of the
merged loop
– As in the previous example where the maximum loop bounds become 16 (implied by L3 flattened into
L2 before the merge)
Loops with both variable bound and constant bound cannot be merged
The code between loops to be merged cannot have side effects
– Multiple execution of this code should generate same results
• A=B is OK, A=B+1 is not
Reads from a FIFO or FIFO interface must always be in sequence
– A FIFO read in one loop will not be a problem
– FIFO reads in multiple loops may become out of sequence
• This prevents loops being merged

Improving Performance 13- 25 © Copyright 2013 Xilinx

Loop Reports

Vivado HLS reports the latency of loops

– Shown in the report file and GUI
Given a variable loop index, the latency cannot be reported
– Vivado HLS does not know the limits of the loop index
– This results in latency reports showing unknown values
The loop tripcount (iteration count) can be specified
– Apply to the loop in the directives pane
– Allows the reports to show an estimated latency Impacts reporting – not synthesis

Improving Performance 13- 26 © Copyright 2013 Xilinx

Techniques for Minimizing Latency

Constraints
– Vivado HLS accepts constraints for latency

Loop Optimizations
– Latency can be improved by minimizing the number of loop boundaries
• Rolled loops (default) enforce sharing at the expense of latency
• The entry and exits to loops costs clock cycles

Improving Performance 13- 27 © Copyright 2013 Xilinx

Outline

Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary

Improving Performance 13- 28 © Copyright 2013 Xilinx

Improving Throughput

Given a design with multiple functions

– The code and dataflow are as shown

Vivado HLS will schedule the design

It can also automatically optimize the dataflow for throughput

Improving Performance 13- 29 © Copyright 2013 Xilinx

Dataflow Optimization

Dataflow Optimization
– Can be used at the top-level function
– Allows blocks of code to operate concurrently
• The blocks can be functions or loops
• Dataflow allows loops to operate concurrently
– It places channels between the blocks to maintain the data rate

• For arrays the channels will include memory elements to buffer the samples
• For scalars the channel is a register with hand-shakes
Dataflow optimization therefore has an area overhead
– Additional memory blocks are added to the design
– The timing diagram on the previous page should have a memory access delay between the blocks
• Not shown to keep explanation of the principle clear
Improving Performance 13- 30 © Copyright 2013 Xilinx
Dataflow Optimization Commands

Dataflow is set using a directive

– Vivado HLS will seek to create the highest performance design
• Throughput of 1

Improving Performance 13- 31 © Copyright 2013 Xilinx

Dataflow Optimization through Configuration Command

Configuring Dataflow Memories

– Between functions Vivado HLS uses ping-pong memory buffers by default
• The memory size is defined by the maximum number of producer or consumer elements
– Between loops Vivado HLS will determine if a FIFO can be used in place of a ping-pong buffer
– The memories can be specified to be FIFOs using the Dataflow Configuration
• Menu: Solution > Solution Settings > config_dataflow
• With FIFOs the user can override the default size of the FIFO
• Note: Setting the FIFO too small may result in an RTL verification failure
Individual Memory Control
– When the default is ping-pong
• Select an array and mark it as Streaming (directive STREAM) to implement the array as a FIFO
– When the default is FIFO
• Select an array and mark it as Streaming (directive STREAM) with option “off” to implement the array as a ping-
pong To use FIFO’s the access must be sequential. If HLS determines that the access is not
sequential then it will halt and issue a message. If HLS can not determine the sequential
nature then it will issue warning and continue.
Improving Performance 13- 32 © Copyright 2013 Xilinx
Dataflow : Ideal for streaming arrays & multi-rate functions

Arrays are passed as single entities by default

– This example uses loops but the same principle applies to functions

Dataflow pipelining allows loop_2 to start when data is ready

– The throughput is improved
– Loops will operate in parallel
• If dependencies allow
Multi-Rate Functions
– Dataflow buffers data when one function or loop consumes or produces data at different rate from others
IO flow support
– To take maximum advantage of dataflow in streaming designs, the IO interfaces at both ends of the datapath
should be streaming/handshake types (ap_hs or ap_fifo)
Improving Performance 13- 33 © Copyright 2013 Xilinx

Pipelining: Dataflow, Functions & Loops

Dataflow Optimization
– Dataflow optimization is “coarse grain” pipelining at the function and loop level
– Increases concurrency between functions and loops
– Only works on functions or loops at the top-level of the hierarchy
• Cannot be used in sub-functions

Function & Loop Pipelining

– “Fine grain” pipelining at the level of the operators (*, +, >>, etc.)
– Allows the operations inside the function or loop to operate in parallel
– Unrolls all sub-loops inside the function or loop being pipelined
• Loops with variable bounds cannot be unrolled: This can prevent pipelining
• Unrolling loops increases the number of operations and can increase memory and run time

Improving Performance 13- 34 © Copyright 2013 Xilinx

Function Pipelining

Without Pipelining With Pipelining

void foo(...) {
op_Read; RD
op_Compute; CMP
op_Write; WR
}

Throughput = 3 cycles Throughput = 1 cycle

RD CMP WR RD CMP WR RD CMP WR

RD CMP WR

Latency = 3 cycles
Latency = 3 cycles
There are 3 clock cycles before operation RD can The latency is the same
occur again
The throughput is better
– Throughput = 3 cycles
– Less cycles, higher throughput
There are 3 cycles before the 1st output is written
– Latency = 3 cycles

Improving Performance 13- 35 © Copyright 2013 Xilinx

Loop Pipelining

Without Pipelining With Pipelining

Loop:for(i=1;i<3;i++) {
op_Read; RD
op_Compute; CMP
op_Write; WR
}

Throughput = 3 cycles Throughput = 1 cycle

RD CMP WR RD CMP WR RD CMP WR

RD CMP WR

Latency = 3 cycles
Latency = 3 cycles
Loop Latency = 6 cycles
Loop Latency = 4 cycles
There are 3 clock cycles before operation RD can The latency is the same
occur again – The throughput is better
– Throughput = 3 cycles
– Less cycles, higher throughput
There are 3 cycles before the 1st output is written
The latency for all iterations, the loop latency, has been
– Latency = 3 cycles
improved
– For the loop, 6 cycles
Improving Performance 13- 36 © Copyright 2013 Xilinx
Pipelining and Function/Loop Hierarchy

Vivado HLS will attempt to unroll all loops nested below a PIPELINE directive
– May not succeed for various reason and/or may lead to unacceptable area
• Loops with variable bounds cannot be unrolled
• Unrolling Multi-level loop nests may create a lot of hardware
– Pipelining the inner-most loop will result in best performance for area
• Or next one (or two) out if inner-most is modest and fixed
e.g. Convolution algorithm
• Outer loops will keep the inner pipeline fed
void foo(in1[ ][ ], in2[ ][ ], …) { void foo(in1[ ][ ], in2[ ][ ], …) { void foo(in1[ ][ ], in2[ ][ ], …) {
… … #pragma AP PIPELINE
L1:for(i=1;i<N;i++) { L1:for(i=1;i<N;i++) { …
L2:for(j=0;j<M;j++) { #pragma AP PIPELINE L1:for(i=1;i<N;i++) {
#pragma AP PIPELINE L2:for(j=0;j<M;j++) { L2:for(j=0;j<M;j++) {
out[i][j] = in1[i][j] + in2[i][j]; out[i][j] = in1[i][j] + in2[i][j]; out[i][j] = in1[i][j] + in2[i][j];
} } }
} } }
} } }

1adder, 3 accesses Unrolls L2 Unrolls L1 and L2

M adders, 3M accesses N*M adders, 3(N*M) accesses

Improving Performance 13- 37 © Copyright 2013 Xilinx

Pipelining Commands

The pipeline directive pipelines functions or loops

– This example pipelines the function with an Initiation
Interval (II) of 2
• The II is the same as the throughput but this term is used
exclusively with pipelines
RD CMP WR
RD CMP WR

Initiation Interval (or II)

Omit the target II and Vivado HLS will Automatically

pipeline for the fastest possible design
– Specifying a more accurate maximum may allow more
sharing (smaller area)

Improving Performance 13- 38 © Copyright 2013 Xilinx

Pipeline Flush

Pipelines can optionally be flushed

– Flush: when the input enable goes low (no more data) all existing results are flushed out
• The input enable may be from an input interface or from another block in the design
– The default is to stall all existing values in the pipeline

Without Flush (default) With Flush (optional)

Clk Clk

Data Valid Data Valid

RD CMP WR out1 RD CMP WR out1

RD CMP WR out2 RD CMP WR out2
RD CMP RD CMP

With Flush
– When no new input reads are performed
– Values already in the pipeline are flushed out

Improving Performance 13- 39 © Copyright 2013 Xilinx

Pipelining the Top-Level Loop

Loop Pipelining top-level loop may give a “bubble”

– A “bubble” here is an interruption to the data stream
– Given the following

– The function will process a stream of data

– The next time the function is called, it still needs to execute the initial (init) operations
• These operations are any which occur before the loop starts
• These operations may include interface start/stop/done signals
– This can result in an unexpected interruption of the data stream

Improving Performance 13- 40 © Copyright 2013 Xilinx

Continuous Pipelining the Top-Level loop

Use the “rewind” option for continuous pipelining

– Immediate re-execution of the top-level loop
– The operation rewinds to the start of the loop
• Ignores any initialization statements before the start of the loop

The rewind portion only effects top-level loops

– Ensures the operations before the loop are never re-executed when the function is re-executed

Issues which prevent Pipelining

Pipelining functions unrolls all loops

– Loops with variable bounds cannot be unrolled
– This will prevent pipelining
• Re-code to remove the variables bounds: max bounds with an exit
Feedback prevent/limits pipelines
– Feedback within the code will prevent or limit pipelining
• The pipeline may be limited to higher initiation interval (more cycles, lower throughput)

Resource Contention may prevent pipelining

– Can occur within input and output ports/arguments
– This is a classis way in which arrays limit performance

Resource Contention: Unfeasible Initiation Intervals

Sometimes the II specification cannot be met

– In this example there are 2 read operations on the same port

– An II=1 cannot be implemented

• The same port cannot be read at the same time
• Similar effect with other resource limitations
• For example if functions or multipliers etc. are limited
Vivado HLS will automatically increase the II
– Vivado HLS will always try to create a design, even if constraints must be violated

Outline

Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary

Arrays : Performance bottlenecks

Arrays are intuitive and useful software constructs

– They allow the C algorithm to be easily captured and understood
Array accesses can often be performance bottlenecks
– Arrays are targeted to a default RAM
• May not be the most ideal memory for performance

• Cannot pipeline with a throughput of 1

Vivado HLS allows arrays to be partitioned and reshaped
– Allows more optimal configuration of the array
– Provides better implementation of the memory resource
Improving Performance 13- 45 © Copyright 2013 Xilinx

Review: Arrays in HLS

An array in C code is implemented by a memory in the RTL

– By default, arrays are implemented as RAMs, optionally a FIFO

The array can be targeted to any memory

resource in the library
– The ports and sequential operation are
defined by the library model List of
• All RAMs are listed in the Vivado HLS Library Guide available
Cores

Array and RAM selection

If no RAM resource is selected

– Vivado HLS will determine the RAM to use
• It will use a Dual-port if it improves throughput
• Else it will use a single-port
BRAM and LUTRAM selection
– If none is made (e.g. resource RAM_1P used) RTL synthesis will determine if RAM is implemented as
BRAM or LUTRAM
– If the user specifies the RAM target (e.g. RAM_1P_BRAM or RAM_1P_LUTRAM is selected ) Vivado
HLS will obey the target
• If LUTRAM is selected Vivado HLS reports registers not BRAM

Array Partitioning

Partitioning breaks an array into smaller elements

– If the factor is not an integer multiple the final array has fewer elements
– Arrays can be split along any dimension
• If none is specified dimension zero is assumed
• Dimension zero means all dimensions
– All partitions inherit the same resource target
• That is, whatever RAM is specified as the resource target
• Except of course “complete”

Configuring Array Partitioning

Vivado HLS can automatically partition arrays to improve throughput

– This is controlled via the array configuration command
– Enable mode throughput_driven
Auto-partition arrays with constant indexing
– When the array index is not a variable
– Arrays below the threshold are auto-partitioned
– Set the threshold using option elem_count_limit
Partition all arrays in the design
– Select option scalarize_all
Include all arrays in partitioning
– The include_ports option will include any arrays on the IO interface when partitioning is performed
• Partitioning these arrays will result in multiple ports and change the inteface
• This may however improve throughput
– Any arrays defined as a global can be included in the partitioning by selecting option include_extern_globals
• By default, global arrays are not partitioned

Array Dimensions

The array options can be performed on dimensions of the array

my_array[10][6][4]

Dimension 1
Dimension 2
Dimension 3

Dimension 0
(All dimensions)

Examples my_array_0[10][6]
my_array_1[10][6] my_array_0[6][4]
my_array[10][6][4] partition dimension 3
my_array_2[10][6] my_array_1[6][4]
my_array_3[10][6] my_array_2[6][4]
my_array_3[6][4]
my_array_4[6][4]
my_array[10][6][4] partition dimension 1
my_array_5[6][4]
my_array_6[6][4]
my_array_7[6][4]
my_array_8[6][4]
my_array_9[6][4]
my_array[10][6][4] partition dimension 0 10x6x4 = 240 individual registers

Array Reshaping

Reshaping recombines partitioned arrays back into a single array

– Same options as array partition
– However, reshape automatically recombines
the parts back into a single element
– The “new” array has the same name
• Same name used for resource targeting

Structs and Arrays: The Default Handling

Structs are a commonly used coding construct

– By default, structs are separated into their separate elements

• Treated as separate elements

• On the Interface
− This means separate ports
• Internally
− Separate buses & wires
− Separate control logic, which may be more
complex, slower and increase latency

Data Packing

Data packing groups structs internally and at the IO Interface

– Creates a single wide bus of all struct elements

• Grouped structure
− First element in the struct becomes the LSB
− Last struct element becomes the MSB
− Arrays are partitioning completely
• On the Interface
− This means a single port
• Internally
− Single bus
− May result in simplified control logic, faster
and lower latency designs

Outline

Adding Directives
Improving Latency
– Manipulating Loops
Improving Throughput
Performance Bottleneck
Summary

Summary

Directives may be added through GUI

– Tcl command is added into script.tcl file
– Pragmas are added into the source file
Latency is minimized by default
– Constraints can be set
Loops may have impact on the latency
Throughput may be improved by pipelining at
– The task, function, and loop level
Arrays may create performance bottleneck if not handled properly

Summary

Optimizing Performance
– Latency optimization
• Specify latency directives
• Unroll loops
• Merge and Flatten loops to reduce loop transition overheads
– Throughput optimization
• Perform Dataflow optimization at the top-level
• Pipeline individual functions and/or loops
• Pipeline the entire function: beware of lots of operations, lots to schedule and it’s not always possible
– Array Optimizations
• Focus on bottlenecks often caused by memory and port accesses
• Removing bottlenecks improves latency and throughput
Use Array Partitioning, Reshaping, and Data packing directives to achieve throughput

Lab2 Intro
Improving Performance

Vivado HLS 2013.3 Version

ZedBoard

Objectives

After completing this lab, you will be able to:

– Add directives to your design

– Understand the effect of INLINE-ing functions
– Observe the effect of PIPELINE-ing functions
– Improve the performance using various directives

The Design

The design consists of YUV filter typically used in video processing. The design
consists of three functions – rgb2yuv, yuv_scale, and yuv2rgb
– Each of these functions iterates over the
entire source image, requiring a single
source pixel to produce a pixel in the result
image
– The scale function simply applies individual
scale factors, supplied through top-level
arguments

Procedure

Create a Vivado HLS project by executing script from Vivado HLS command prompt
Open the created project in Vivado HLS GUI and analyze
Apply TRIPCOUNT directive using PRAGMA
Apply PIPELINE directive, generate solution, and analyze output
Apply DATAFLOW directive to improve performance
Export and Implement the design

Summary

In this lab you learned that even though this design could not be pipelined at the top-
level, a strategy of pipelining the individual loops and then using dataflow optimization
to make the functions operate in parallel was able to achieve the same high throughput,
processing one pixel per clock. When DATAFLOW directive is applied, the default
memory buffers (of ping-pong type) are automatically inserted between the functions.
Using the fact that the design used only sequential (streaming) data accesses allowed
the costly memory buffers associated with dataflow optimization to be replaced with
simple 2 element FIFOs using the Dataflow command configuration

Evolution of Windows Operating System
No ratings yet
Evolution of Windows Operating System
16 pages
Artificial Intelligence For HR
No ratings yet
Artificial Intelligence For HR
241 pages
25 Improving Performance and Resource Utilization
No ratings yet
25 Improving Performance and Resource Utilization
38 pages
HLS Tips and Tricks
No ratings yet
HLS Tips and Tricks
23 pages
24 Vivado HLS Intro
No ratings yet
24 Vivado HLS Intro
34 pages
Ug902 4
No ratings yet
Ug902 4
3 pages
Introduction To High-Level Synthesis With Vivado HLS
No ratings yet
Introduction To High-Level Synthesis With Vivado HLS
39 pages
Introduction To High-Level Synthesis With Vivado HLS
No ratings yet
Introduction To High-Level Synthesis With Vivado HLS
39 pages
Vivado - HLS - To - Zynq - Design - Summary - Jgarrigos
No ratings yet
Vivado - HLS - To - Zynq - Design - Summary - Jgarrigos
24 pages
Vivado HLS Update
No ratings yet
Vivado HLS Update
35 pages
Heterogen Vivado Hls
No ratings yet
Heterogen Vivado Hls
341 pages
02 Vivado Tutorial I
No ratings yet
02 Vivado Tutorial I
67 pages
Xilinx Answer 72471 PCIe EoU Debug 2019 1 Ver1
No ratings yet
Xilinx Answer 72471 PCIe EoU Debug 2019 1 Ver1
51 pages
Ug902 2
No ratings yet
Ug902 2
3 pages
Introduction To Xilinx Vivado
No ratings yet
Introduction To Xilinx Vivado
42 pages
Ug902 5
No ratings yet
Ug902 5
3 pages
VivadoHLS Overview PDF
No ratings yet
VivadoHLS Overview PDF
43 pages
HLS Tutorial
No ratings yet
HLS Tutorial
42 pages
Vivado Design Suite Tutorial: Using Constraints
No ratings yet
Vivado Design Suite Tutorial: Using Constraints
26 pages
Ug871 Vivado High Level Synthesis Tutorial
No ratings yet
Ug871 Vivado High Level Synthesis Tutorial
264 pages
Ug902 Vivado High Level Synthesis
100% (1)
Ug902 Vivado High Level Synthesis
673 pages
Xilinx HLS
No ratings yet
Xilinx HLS
16 pages
Vivado HLS Update
No ratings yet
Vivado HLS Update
35 pages
Xapp1209 Designing Protocol Processing Systems Hls
No ratings yet
Xapp1209 Designing Protocol Processing Systems Hls
24 pages
Basic HLS Tutorial-2022.2
No ratings yet
Basic HLS Tutorial-2022.2
95 pages
Simulation and Debugging Techniques in Vivado IP Integrator
100% (1)
Simulation and Debugging Techniques in Vivado IP Integrator
22 pages
12 Vivado Design Flow
No ratings yet
12 Vivado Design Flow
39 pages
Creating A Processor System Lab
No ratings yet
Creating A Processor System Lab
28 pages
Ug904 Vivado Implementation
No ratings yet
Ug904 Vivado Implementation
211 pages
Constraints Guide
No ratings yet
Constraints Guide
1,026 pages
002 Express
No ratings yet
002 Express
26 pages
Course5 System Design Flow On Zynq Zybo Lab
No ratings yet
Course5 System Design Flow On Zynq Zybo Lab
121 pages
Ug902 Vivado High Level Synthesis
No ratings yet
Ug902 Vivado High Level Synthesis
589 pages
Advance HDL Design Training On Xilinx FPGA
No ratings yet
Advance HDL Design Training On Xilinx FPGA
333 pages
High-Level Synthesis: Hao Zheng Comp Sci & Eng University of South Florida
No ratings yet
High-Level Synthesis: Hao Zheng Comp Sci & Eng University of South Florida
26 pages
Basic HLS Tutorial
No ratings yet
Basic HLS Tutorial
84 pages
Ug903 Vivado Using Constraints
No ratings yet
Ug903 Vivado Using Constraints
197 pages
Vivado Design Flow
No ratings yet
Vivado Design Flow
38 pages
Ug903 Vivado Using Constraints en Us 2022.2
No ratings yet
Ug903 Vivado Using Constraints en Us 2022.2
197 pages
Ug1165 Zynq Embedded Design Tutorial 1
No ratings yet
Ug1165 Zynq Embedded Design Tutorial 1
136 pages
Ug 612
No ratings yet
Ug 612
148 pages
Ug1399 Vitis Hls
No ratings yet
Ug1399 Vitis Hls
692 pages
1 Objectives: EA-268 Digital Signal Processors Prof. Dr. Osamu Saotome Teaching Assistant: Canisio Barth
No ratings yet
1 Objectives: EA-268 Digital Signal Processors Prof. Dr. Osamu Saotome Teaching Assistant: Canisio Barth
15 pages
Section 1HLS Overview Powerpoint
No ratings yet
Section 1HLS Overview Powerpoint
28 pages
Run Fast With Vivado HLS
No ratings yet
Run Fast With Vivado HLS
4 pages
Ug937 Vivado Design Suite Simulation Tutorial
No ratings yet
Ug937 Vivado Design Suite Simulation Tutorial
88 pages
C D1 05 Introduction To Vitis HLS
No ratings yet
C D1 05 Introduction To Vitis HLS
22 pages
Ug953 Vivado 7series Libraries PDF
No ratings yet
Ug953 Vivado 7series Libraries PDF
612 pages
SUG918-1.9E - Gowin Software Quick Start Guide
No ratings yet
SUG918-1.9E - Gowin Software Quick Start Guide
43 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Lab1 Intro Vivado HLS Design Flow: This Material Exempt Per Department of Commerce License Exception TSU
No ratings yet
Lab1 Intro Vivado HLS Design Flow: This Material Exempt Per Department of Commerce License Exception TSU
5 pages
Vivado Design Suite Tutorial - Implementation (UG986)
No ratings yet
Vivado Design Suite Tutorial - Implementation (UG986)
118 pages
Ug1399 Vitis Hls en Us 2024.1
No ratings yet
Ug1399 Vitis Hls en Us 2024.1
854 pages
v6 Pcie Ug517
No ratings yet
v6 Pcie Ug517
382 pages
Lab 06
No ratings yet
Lab 06
33 pages
Constraints Guide: UG625 (V 12.1) April 19, 2010
No ratings yet
Constraints Guide: UG625 (V 12.1) April 19, 2010
248 pages
Outline DSCoD Jgarrigos
No ratings yet
Outline DSCoD Jgarrigos
3 pages
Lecture05 - High-Level Digital Design Automation
No ratings yet
Lecture05 - High-Level Digital Design Automation
36 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
First Hop Redundancy Protocol: Network Redundancy Protocol
From Everand
First Hop Redundancy Protocol: Network Redundancy Protocol
Mulayam Singh
No ratings yet
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
WAN TECHNOLOGY FRAME-RELAY: An Expert's Handbook of Navigating Frame Relay Networks
From Everand
WAN TECHNOLOGY FRAME-RELAY: An Expert's Handbook of Navigating Frame Relay Networks
Mamta Devi
No ratings yet
Academic Calender 2023-2024
No ratings yet
Academic Calender 2023-2024
2 pages
Placement Brochure 2024-25-2 - 11zon
No ratings yet
Placement Brochure 2024-25-2 - 11zon
44 pages
Lecture 10
No ratings yet
Lecture 10
21 pages
Lecture 9
No ratings yet
Lecture 9
16 pages
DFT Scan and ATPG Training Student Workbook
100% (2)
DFT Scan and ATPG Training Student Workbook
576 pages
Lecture 11
No ratings yet
Lecture 11
18 pages
CamScanner 06-14-2020 09.09.44
No ratings yet
CamScanner 06-14-2020 09.09.44
100 pages
System Verilog Full Material From Testbench
No ratings yet
System Verilog Full Material From Testbench
877 pages
Final Plasma 2025
No ratings yet
Final Plasma 2025
3 pages
Introduction To The OBD II System
No ratings yet
Introduction To The OBD II System
10 pages
Samsung Galaxy s9 Quick Reference Manual
No ratings yet
Samsung Galaxy s9 Quick Reference Manual
23 pages
2024 LG Commercial TV E-Catalog (Low) - 20240307
100% (1)
2024 LG Commercial TV E-Catalog (Low) - 20240307
28 pages
Vertex VX-4500 Series SpecSheet FINAL 012011
No ratings yet
Vertex VX-4500 Series SpecSheet FINAL 012011
2 pages
Web Technology
No ratings yet
Web Technology
3 pages
AZ-104 - UD - 103 Removed - PDF
100% (1)
AZ-104 - UD - 103 Removed - PDF
67 pages
Example - Reverse Proxy For Exchange Services
No ratings yet
Example - Reverse Proxy For Exchange Services
5 pages
Numerical Methods in Engineering
0% (7)
Numerical Methods in Engineering
3 pages
Welcome To The World Of: "Career Path Finder"
No ratings yet
Welcome To The World Of: "Career Path Finder"
22 pages
Resume Ai
No ratings yet
Resume Ai
1 page
Network Topology
No ratings yet
Network Topology
5 pages
TRANSCRIPT - Maruti Suzuki Drives Business Growth On A Full Oracle Stack
No ratings yet
TRANSCRIPT - Maruti Suzuki Drives Business Growth On A Full Oracle Stack
2 pages
Is 2102 Part 2 1993 ISO 2768 2 1989 General Tolerances Part 2 Geometrical Tolerances For Features Without Individual Tolerance Indications
100% (1)
Is 2102 Part 2 1993 ISO 2768 2 1989 General Tolerances Part 2 Geometrical Tolerances For Features Without Individual Tolerance Indications
24 pages
(Basic Training) IMS Bearer Network ISSUE 5.0
No ratings yet
(Basic Training) IMS Bearer Network ISSUE 5.0
63 pages
F6
No ratings yet
F6
2 pages
What Is A Compilation
100% (1)
What Is A Compilation
8 pages
SD WAN 7.4 Architecture For MSSPs
No ratings yet
SD WAN 7.4 Architecture For MSSPs
67 pages
Nitro Shock Absorbers
No ratings yet
Nitro Shock Absorbers
25 pages
Best Practices: Getting Started With Informix Connection Manager
No ratings yet
Best Practices: Getting Started With Informix Connection Manager
52 pages
Courier Tracking System Project Showcase
No ratings yet
Courier Tracking System Project Showcase
2 pages
Anil 15
No ratings yet
Anil 15
2 pages
Nissan Altima 2007 2013 Engine Mechanical
100% (61)
Nissan Altima 2007 2013 Engine Mechanical
20 pages
Economics Assignment
No ratings yet
Economics Assignment
2 pages
Experiment No. 02 Am Demodulation: Pre Lab Task Lab Objective
No ratings yet
Experiment No. 02 Am Demodulation: Pre Lab Task Lab Objective
8 pages
Medical Management System
No ratings yet
Medical Management System
5 pages
Lean Thinking
No ratings yet
Lean Thinking
15 pages
MM Brochure - IQ7000 - MERCK - Ina
No ratings yet
MM Brochure - IQ7000 - MERCK - Ina
12 pages