0% found this document useful (0 votes)

8 views35 pages

High Level Synthesis II: ECE 3401 Digital Systems Design

The document discusses High-Level Synthesis (HLS) in digital systems design, highlighting the challenges of traditional hardware design such as increased complexity and shorter design cycles. It outlines the HLS workflow, including scheduling and binding, and details various techniques like loop unrolling, pipelining, and array partitioning to optimize performance. The document emphasizes the importance of directives in guiding HLS optimizations to achieve results comparable to hand-coded RTL implementations.

Uploaded by

JohnnyBoiGotNoLimits

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views35 pages

High Level Synthesis II: ECE 3401 Digital Systems Design

Uploaded by

JohnnyBoiGotNoLimits

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Lecture 20

High Level Synthesis II

ECE 3401
Digital Systems Design

1
Traditional Hardware Design

• Hand coded RTL

– Understand the problem space
• E.g., Sort
– Decide on a solution
• E.g., Radix Sort
– Implementation
• Code RTL Single Design Point
• Validate/Verify Ø Power
• Manually tune to meet spec Ø Performance
Ø Area

2
The Problem is Getting Worse
• More heterogeneity requires more hardware
– Less re-use, too expensive to hand code

• Shorter design cycles

– Rushed designs mean specs keep changing
– Focus on correctness, let tool handle performance

• Can’t spend months tuning every pipeline in the

sea of design implementations

3
High-Level Synthesis workflow

Design Source Technology

Library
(C, C++, SystemC)

Scheduling Binding

User RTL
(Verilog, VHDL, SystemC)
Directives

4
High-Level Synthesis workflow

Design Source Technology

Library
(C, C++, SystemC)

Scheduling Binding

User RTL
(Verilog, VHDL, SystemC)
Directives

Which cycle each operation happens

5
High-Level Synthesis workflow

Maps operations onto instantiated hardware

Design Source Technology

Library
(C, C++, SystemC)

Scheduling Binding

User RTL
(Verilog, VHDL, SystemC)
Directives

Which cycle each operation happens

6
Typical C/C++ Synthesizable Subset
• Data types:
– Primitive types: (u)char, (u)short, (u)int, (u)long, float,
double
– Arbitrary precision integer or fixed-point types
– Composite types: array, struct, class
– Template types: template<>
– Statically determinable pointers

• No support for dynamic memory allocations

• No support for recursive function calls

7
Typical C/C++ Constructs for HW Mapping

8
Function Hierarchy
• Each module is usually translated into an HDL
module
– Functions may be inlined to dissolve their hierarchy

9
Function Arguments
• Function arguments become ports on the HDL
blocks

• Inputs/Outputs enable synchronous data

exchange

10
Design Space Explorations with HLS
• Directives guide HLS optimizations
– Resource allocation and implementation
– Memory partitioning
– Loop unrolling
– Loop pipelining

• ~30 unique directives

– User can provide as much detail as desired
– Can achieve performance on order of handwritten RTL

11
Expressions
• HLS generates datapath circuits mostly from
expressions
– Timing constraints influence the degree of
registering

12
Resources

• Allocation directive constrains resources

– Operations: e.g., number of adders instantiated
in HDL
• Can save a lot of area (and power)

CYCLE 1 add1 CYCLE 1 add1 add2

CYCLE 2 add1 CYCLE 2

13
Array partitioning
• HLS implements an array in C as memory block in
HDL
– Read & Write array à RAM
– Constant array à ROM
• Typically, each memory module supports a
limited number of read/write ports (up to 2)

void TOP (){

int A[N];
for (i = 0; i < N; i++)
A[i] = A[i] + i;

14
Array partitioning
• An array can be partitioned and implemented with multiple
RAMs
– Extreme case: completely partitioned into individual elements
that map to discrete registers

void TOP (int x, ..){

int A[N];
for (i = 0; i < N; i++)
A[i+x] = A[i] + i;
?
• An array can be partitioned and mapped to multiple RAMs
• Multiples arrays can be merged and mapped to one RAM

15
Array partitioning directives
• Split arrays to improve memory bandwidth

16
Block array partitioning
Array1[N]

Array1a[N/2]

Array1b[N/2]

• Block partitioning creates smaller arrays from

consecutive blocks of the original array
• Splits the array into N equal blocks, where N is
the integer defined by the factor= argument

17
Cyclic array partitioning
Array1[N]

Array1a[N/2]

Array1b[N/2]

• Cyclic partitioning creates smaller arrays by interleaving

elements from the original array
• Array is partitioned cyclically by putting one element into
each new array before coming back to the first array to
repeat the cycle until the array is fully partitioned
– For example, if factor=2, Element 0 in first new array, Element 1
in second new array, Element 2 in first new array again, and so
on
18
Complete array partitioning
Array1[N]

Array1a[1] Array1b[1]
…
Array1c[1] Array1d[1]

• Decomposes the array into individual elements

• For a one-dimensional array, this corresponds to
resolving a memory into individual registers
19
Loops
• By default, loops are rolled
• Each loop iteration corresponds to a “sequence” of
states (more generally, an FSM)
• This state sequence will be repeated multiple times
based on the loop trip count (or loop bound)

TOP
void TOP (){
.. S1
for (i = 0; i < N; i++) +
sum += A[i]; LD sum S2
A[i]

20
Loop Unrolling
void TOP (){
Unroll ..
void TOP (){
factor of 4 for (i = 0; i < N/4; i++)
..
sum += A[4*i];
for (i = 0; i < N; i++)
sum += A[4*1+1];
sum += A[i];
sum += A[4*1+2];
sum += A[4*1+3];

• Exploits parallelism between loop iterations to

achieve shorter latency or higher throughput
• Creates multiple copies of the loop body and
adjust the loop iteration counter accordingly

21
Loop unrolling
(+) Decreased loop control overhead
(+) Increased parallelism for scheduling
(–) Increased operation count, which may negatively
impact area, timing, and power

TOP
Unroll
factor of 4 +
+
+ +
sum A[3] A[2] + sum
A[i] A[1]
A[0]
Takes 4 Cycles when N=4 Takes 1 Cycle!

22
Pipelining
X
4 consecutive operations Z
( )2 Square Root
Z=F(X,Y)=SqRoot(X2 +Y2 ) Y

If each step takes 1T then one calculation takes 3T, four take 12T
X
Stage 1 Stage 2 Stage 3 Z

X2 +Y2 SqRoot
Y

Assuming ideally that each stage takes 1T

• What will be the latency (time to produce the first result)?
• What will be the throughput (pipeline rate in the steady state)?

23
Pipelining -- Timing

T T T T T T

Total of 6T; Speedup = ?

n-1
For n operations: 3T + (n-1)T = latency +
throughput
3T x n 3n
Speedup = = =
3T + (n-1)T n+2
24
Loop pipelining
Loop_tag : for( II = 1 ; II < 3 ; II++ ) {
op_Read; RD
op_Compute; CMPUTE
op_Write; WR
}

25
Loop pipelining
Without Pipelining Loop_tag : for( II = 1 ; II < 3 ; II++ ) {
op_Read; RD
op_Compute; CMPUTE
op_Write; WR
}

RD CMP WR RD CMP WR

Throughput = 3 cycles
Latency = 3 cycles

Loop Latency = 6 cycles

26
Loop pipelining
Without Pipelining With Pipelining
Loop_tag : for( II = 1 ; II < 3 ; II++ ) {
op_Read; RD
op_Compute; CMP
op_Write; WR
}

RD CMP WR RD CMP WR RD CMP WR

RD CMP WR

Throughput = 3 cycles Throughput = 1 cycle

Latency = 3 cycles Latency = 3 cycles

Loop Latency = 6 cycles Loop Latency = 4 cycles

• Loop pipelining allows one iteration to begin

processing before the previous iteration is complete

27
Loop pipelining
• Iteration Interval (II)
– Cycles loop must wait before starting next iteration

• II = 1 cannot be implemented
– Port cannot be read at the same time
– Similar effect with other resource limitations

28
Loop pipelining
• Given sufficient arrays and resources, ll = 1 can be
implemented

29
Matrix Vector Multiplication Example
• A: input matrix
• x: input vector
• y: output vector
• y[i] = ∑ A[i][j]*x[j]

// N = 8
void MV(int A[N][N], int x[N], int y[N]) {
int i, j;
int acc;
for (i = 0; i < N; i++) {
acc = 0;
for (j = 0; j < N; j++) {
acc += A[i][j] * x[j];
}
y[i] = acc;
}
}
30
Baseline Datapath
A[i][0 to 7]

X + acc
y[i]
x[0 to 7]

// N = 8 • Latency of load=2 cycles, Mult=1 cycle, Add+Acc=1

for (i = 0; i < 16; i++) { cycle, Store=1 cycle
acc = 0;
for (j = 0; j < 16; j++) {
acc += A[i][j] * x[j];
• Inner loop latency = (2+1+1) * 8 = 32 cycles
– 2 LD, 1 *, 1 +
}
y[i] = acc; • Outer loop latency = (32+2) * 8 = 272 cycles
} – 1 ST, 1 Acc

31
Loop Unrolling
A[i][0 to 7]

X + acc
y[i]
x[0 to 7]

// N = 8 • Unroll inner loop (j) expects to

for (i = 0; i < 16; i++) {
acc = 0; load 8 elements of A and x every
for (j = 0; j < 16; j++) {
acc += A[i][j] * x[j];
cycle
} • A single RAM only has 1 read
port!!
y[i] = acc;
}

32
Partition Arrays using complete

...

A[i][0 to 7] A[i][0] A[i][1] A[i][7]

x[0 to 15] x[0] x[1] x[7]

• Array partitioning breaks one array into smaller

portions and implements it with multiple RAM
modules
– Options from block, cyclic or complete
33
Loop Unrolling
A[i][0] X
x[0] +
• Latency of load=2 cycles, Mult=1
A[i][1] X
cycle, Add+Acc=1 cycle, Store=1
x[1] + cycle
A[i][2] X
x[2] + • Unrolled inner loop latency =
A[i][3] X (2+1+1+1+1+1) * 1 = 7 cycles
x[3] +
– 2 LD, 1 *, 1 +, 1 ST
A[i][4] X y[i]
x[4] • Outer loop latency = 7 * 8 = 56 cycles
+
A[i][5] X
x[5] +
A[i][6] X • 4.86x faster but
x[6] + requires 8x more
A[i][7] X RAMs and multipliers
x[7]
and 7x more adders!!
34
Loop Pipelining
Cycle 1 2 3 4 5 6 7 8 9 10 11
# • Iteration interval, ll = 2
i=0
LD x + Tree ST – RAM port cannot be read
A[0][:] y[[0]
at the same time. So must
iter
LD wait 2 cycles before
x[[:] starting next iteration
LD x + Tree ST • Pipelined latency = 7 +
i=1 A[1][:] y[[1]

iter (7*2) = 21 cycles

LD
x[[:]

i=2
LD
A[2][:]
x + Tree ST
y[[2] • Pipelining 2.7x
iter
LD
x[[:]
faster than loop
unrolling only

70-411 R2 Test Bank Lesson 01
No ratings yet
70-411 R2 Test Bank Lesson 01
10 pages
DX200 Maintainence
No ratings yet
DX200 Maintainence
1,166 pages
Vulnerability Management Policy and Procedures
50% (4)
Vulnerability Management Policy and Procedures
18 pages
360 Scripting Guide2017
No ratings yet
360 Scripting Guide2017
290 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Superscalar Architecture
No ratings yet
Superscalar Architecture
156 pages
Advanced Microwave Engineering
0% (1)
Advanced Microwave Engineering
3 pages
UG - Imonitor iDX 4.3 - T0001399 - 31 - Oct - 2023 - Rev - 2
No ratings yet
UG - Imonitor iDX 4.3 - T0001399 - 31 - Oct - 2023 - Rev - 2
308 pages
Bourgault X30 Set-Up For Version 3 - 18 - 517
No ratings yet
Bourgault X30 Set-Up For Version 3 - 18 - 517
81 pages
High Speed electronics-UoH - 4-Vivado-Presentation
No ratings yet
High Speed electronics-UoH - 4-Vivado-Presentation
66 pages
Monitor - Mac - Apple 22 Inch LCD - TFT - M8149 - Parts and Service
No ratings yet
Monitor - Mac - Apple 22 Inch LCD - TFT - M8149 - Parts and Service
36 pages
GTO24001 Service Manual
No ratings yet
GTO24001 Service Manual
29 pages
Advanced FPGA Design1
No ratings yet
Advanced FPGA Design1
52 pages
Day01 HPC WRKSHP Compiler Opt
No ratings yet
Day01 HPC WRKSHP Compiler Opt
61 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
Exploiting Instruction-Level Parallelism With Software Approaches
No ratings yet
Exploiting Instruction-Level Parallelism With Software Approaches
108 pages
Elegant Oscillator
No ratings yet
Elegant Oscillator
16 pages
Introduction To THE TMS320C6000 Vliw DSP: Prof. Brian L. Evans
No ratings yet
Introduction To THE TMS320C6000 Vliw DSP: Prof. Brian L. Evans
33 pages
Arecaunut Classification Report Final Yolo Based
No ratings yet
Arecaunut Classification Report Final Yolo Based
35 pages
Untitled
No ratings yet
Untitled
35 pages
Class Routine Cse
No ratings yet
Class Routine Cse
20 pages
Texas TLO74 Tone Preamp
No ratings yet
Texas TLO74 Tone Preamp
9 pages
Clase de Progrea 555
No ratings yet
Clase de Progrea 555
35 pages
05 High Level Synthesis
No ratings yet
05 High Level Synthesis
37 pages
CS225 Lect24 Basics of Pipelining
No ratings yet
CS225 Lect24 Basics of Pipelining
36 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
Klimax 210
No ratings yet
Klimax 210
6 pages
Java Swing Components Are Not Thread-Safe in Java
No ratings yet
Java Swing Components Are Not Thread-Safe in Java
4 pages
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
Hls PDF
No ratings yet
Hls PDF
68 pages
Automatic Timetable Generat
No ratings yet
Automatic Timetable Generat
18 pages
Hw5 Solution
No ratings yet
Hw5 Solution
11 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
No ratings yet
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
78 pages
Digital Filtering in Hardware: Adnan Aziz
No ratings yet
Digital Filtering in Hardware: Adnan Aziz
102 pages
Edu Cloud
No ratings yet
Edu Cloud
10 pages
White Paper - Embedded System Security
No ratings yet
White Paper - Embedded System Security
33 pages
Ch#4 Part 1, 2,34
No ratings yet
Ch#4 Part 1, 2,34
70 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Question # 01: Source Code: Talha Maqsood
No ratings yet
Question # 01: Source Code: Talha Maqsood
2 pages
Module-5: Syntax Directed Translation, Intermediate Code Generation, Code Generation 5.1,5.2,5.3, 6.1,6.2,8.1,8.2
No ratings yet
Module-5: Syntax Directed Translation, Intermediate Code Generation, Code Generation 5.1,5.2,5.3, 6.1,6.2,8.1,8.2
37 pages
Wayne Luk Imperial College March 2002: Wl@doc - Ic.ac - Uk
No ratings yet
Wayne Luk Imperial College March 2002: Wl@doc - Ic.ac - Uk
15 pages
Paper Review 2 - BigTable
No ratings yet
Paper Review 2 - BigTable
2 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
No ratings yet
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
60 pages
Q1. Explain JDK, JRE and JVM?
No ratings yet
Q1. Explain JDK, JRE and JVM?
1 page
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
COA DR MVN 5 UNIT - Latest PDF
No ratings yet
COA DR MVN 5 UNIT - Latest PDF
24 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
ARM Pipelining
No ratings yet
ARM Pipelining
31 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
Distributed Systems
No ratings yet
Distributed Systems
2 pages
FernFlex Controller Data Sheet
No ratings yet
FernFlex Controller Data Sheet
2 pages
Onur 447 Spring15 Lecture17 Memoryhierarchyandcaches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture17 Memoryhierarchyandcaches Afterlecture
51 pages
LEC12-Optimization and New Trends
No ratings yet
LEC12-Optimization and New Trends
23 pages
PDF File Rsndom - Buscar Con Google
No ratings yet
PDF File Rsndom - Buscar Con Google
4 pages
FPGA Design and Implementation
No ratings yet
FPGA Design and Implementation
20 pages
Chapter 4 (Processors and Memory Hierarchy)
100% (1)
Chapter 4 (Processors and Memory Hierarchy)
17 pages
43-Instruction Scheduling and Software Pipelining-19!11!2024
No ratings yet
43-Instruction Scheduling and Software Pipelining-19!11!2024
25 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
COMP1411 Final Exam Question Book
No ratings yet
COMP1411 Final Exam Question Book
10 pages
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
HPC Unit 5 B
No ratings yet
HPC Unit 5 B
31 pages
Fuse Monitoring - Auto Change Over System
No ratings yet
Fuse Monitoring - Auto Change Over System
2 pages
2 DataflowAnalysis
No ratings yet
2 DataflowAnalysis
49 pages
Chapter4 2
No ratings yet
Chapter4 2
34 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
MIC IMP Question Bank of All Chapters
No ratings yet
MIC IMP Question Bank of All Chapters
6 pages
Lecture 7-2 Design For Area Constraints
No ratings yet
Lecture 7-2 Design For Area Constraints
25 pages
CS222 - COAL - SOLUTION - Final - Spring2023
No ratings yet
CS222 - COAL - SOLUTION - Final - Spring2023
12 pages
L04 PipeliningII
No ratings yet
L04 PipeliningII
33 pages
Ecomatic Manual
No ratings yet
Ecomatic Manual
12 pages
2025 Amd3a Test A2
No ratings yet
2025 Amd3a Test A2
5 pages
Sheet 2 Answers
No ratings yet
Sheet 2 Answers
6 pages
High-Level Synthesis (HLS) : ECE 3401 Digital Systems Design
No ratings yet
High-Level Synthesis (HLS) : ECE 3401 Digital Systems Design
32 pages
Meghnad Saha Answers
No ratings yet
Meghnad Saha Answers
25 pages
CA07 2022S3 New
No ratings yet
CA07 2022S3 New
29 pages
Lec02 2 Compiler Optimizations
No ratings yet
Lec02 2 Compiler Optimizations
32 pages
PDII Demo
No ratings yet
PDII Demo
5 pages
Course Script HWSW Codesign
No ratings yet
Course Script HWSW Codesign
144 pages
Lecture05 - High-Level Digital Design Automation
No ratings yet
Lecture05 - High-Level Digital Design Automation
36 pages
Interview Questions
No ratings yet
Interview Questions
6 pages
2 Mark Answers
No ratings yet
2 Mark Answers
9 pages
(并行课件w3) 第2讲 1&2
No ratings yet
(并行课件w3) 第2讲 1&2
143 pages
Framework Class Library
No ratings yet
Framework Class Library
2 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet

High Level Synthesis II: ECE 3401 Digital Systems Design

Uploaded by

High Level Synthesis II: ECE 3401 Digital Systems Design

Uploaded by

Lecture 20

High Level Synthesis II

• Hand coded RTL

• Shorter design cycles

• Can’t spend months tuning every pipeline in the

Design Source Technology

Design Source Technology

Which cycle each operation happens

Maps operations onto instantiated hardware

Design Source Technology

Which cycle each operation happens

• No support for dynamic memory allocations

• Inputs/Outputs enable synchronous data

• ~30 unique directives

• Allocation directive constrains resources

CYCLE 1 add1 CYCLE 1 add1 add2

CYCLE 2 add1 CYCLE 2

void TOP (){

void TOP (int x, ..){

• Block partitioning creates smaller arrays from

• Cyclic partitioning creates smaller arrays by interleaving

• Decomposes the array into individual elements

• Exploits parallelism between loop iterations to

Assuming ideally that each stage takes 1T

Total of 6T; Speedup = ?

Loop Latency = 6 cycles

RD CMP WR RD CMP WR RD CMP WR

Throughput = 3 cycles Throughput = 1 cycle

Loop Latency = 6 cycles Loop Latency = 4 cycles

• Loop pipelining allows one iteration to begin

// N = 8 • Latency of load=2 cycles, Mult=1 cycle, Add+Acc=1

// N = 8 • Unroll inner loop (j) expects to

A[i][0 to 7] A[i][0] A[i][1] A[i][7]

• Array partitioning breaks one array into smaller

iter (7*2) = 21 cycles

You might also like