Lecture 30
Lecture 30
CS G553
‹#›
Lecture –30
High-Level Synthesis for Reconfigurable Devices
(Behavioral Synthesis): Spatial and Temporal Partitioning
CS G553 2
High-Level Synthesis
CS G553 3
General vs. RCS High-Level Synthesis
Example: add
mul
x = ((a b) (c d)) + ((c d)-(e-f))
sub *
y = ((c d) - (e - f)) - ((e - f) + (g - h))
CS G553 4
General vs. RCS High-Level Synthesis
CS G553 5
High-Level Synthesis
Fundamental differences in RCS:
2.
o In general HLS:
• Application is specified using a structure that encapsulates a datapath and a
control part.
• Synthesis process allocates the resources to operators at different time
according to a computed schedule.
• Control part is synthesized.
o In RCS:
• Hardware modules implemented as datapath normally compete for execution on
the chip.
• A processor is used to control selection process of the hardware modules by
means of reconfiguration.
• The same processor is also in charge of activating the resources in the
corresponding hardware accelerators.
CS G553 6
Partitioning
CS G553 7
Partitioning - Motivation
CS G553 8
Spatial Partitioning
CS G553 9
Spatial partitioning - Problem
CS G553 10
Spatial partitioning - Problem
CS G553 11
Spatial partitioning –Timing –Block
replication
CS G553 17
Temporal Partitioning
CS G553 18
Temporal Partitioning
Temporal Partitioning
Challenging
Why?
CS G553 19
CS G553 20
Configuration
Configuration:
o Given a reconfigurable processing unit H and
o a set of tasks T = {t1, ...., tn} available as cores C = {c1, ...., cn},
o we define the configuration ζi of the RPU at time si to be the set of
cores ζi = {ci1, ..., cik} C running on H at time si.
A core (module) ci for each ti in library:
o Hard / soft / firm module.
CS G553 21
Schedule
Schedule:
o is a function ς : V → Z+, where ς(vi) denotes the starting time
of the node vi that implements a task ti.
Feasible Schedule:
o ς is feasible if: eij = (vi, vj) E,
ς(tj) ≥ ς(ti) + T(ti) + tij
• eij defines a data dependency between tasks ti and tj,
• tij is the latency of the edge eij,
• T(ti) is the time it takes the node vi to complete execution.
CS G553 22
Ordering Relation
CS G553 23
Partition
Partition:
o A partition P of the graph G = (V,E) is its division into some
disjoint subsets P1, ..., Pm such that
Uk=1,…,mPk = V
Feasible Partition:
o A partition is feasible in accordance to a reconfigurable
device H with area a(H) and pin count p(H) if:
o Pk P: a(Pk) = (∑viPkai) ≤ a(H)
o ∑eijEwij ≤ p(H)
• for eij = crossing edges
Crossing edge:
o an edge that connects one component in a partition with
another component out of the partition.
CS G553 24
Run Time
CS G553 25
Ordering Relation
Ordered partitions:
o A partitioning P is ordered an ordering relation ≤ exists on P.
CS G553 26
Temporal Partitioning
Temporal partitioning:
o Given a DFG G = (V,E) and a reconfigurable device H, a temporal
partitioning of G on H is an ordered partitioning P of G with respect
to H.
CS G553 27
Temporal Partitioning
Cycle
CS G553 28
Temporal partitioning
Goal:
o Computation and scheduling of a Configuration graph
A configuration graph is a graph in which:
o Nodes are partitions or bitstreams
o Edges reflect the precedence constraints in a given DFG
P1 P2 P3
P4
P5
Configuration Graph
CS G553 29
Temporal partitioning
P1 P2 P3
P4
• Formal Definition: P5
CS G553 30
Temporal partitioning
Whenever a new partition is P1 P2 P3
downloaded, the partition that was
running is destroyed.
o Communication through inter-configuration P4
P5
registers (or communication memory)
Inter-configuration
• May sit in main memory registers
• May sit at the boundary of the device to
hold the input and output values
o Configuration sequence is controlled by the IO Register Bus IO Register
host processor
IO Register
IO Register
IO Register
Block
IO Register
IO Register
IO Register
Processor
FPGA
Communication Memory Synthesis? Device’s register mapping into
the processor address spaces
CS G553 31
Temporal partitioning
CS G553 32
Temporal partitioning
Objectives for optimization:
1. # interconnections: very important, since it minimizes:
➢ The amount of exchanged data
➢ The amount of memory for temporally storing the data
2. # produced blocks (partitions)
➢ Reduces the number of reconfigurations (total time?)
3. Overall computation delay depends on
➢ the partition run time
➢ the processor used for reconfiguration
➢ speed of data exchange
4. Similarity between consecutive partitions (for partial)
5. Overall amount of wasted resources on the chip.
➢ When components with shorter run-times are
placed in the same partition with other components
with longer run-time, those with the shorter
components remain idle for a longer period of time.
CS G553 33
Wasted Resources
CS G553 34
Communication Overhead
8
1 2
3 7 9
4 6
5
1
0
Connectivity = 0.24
CS G553 35
Communication Overhead
o Low quality means that the algorithm performs poor Connectivity = 0.24
3 8
2 6 2
1
9
1 3 7 9
7
4
4 6
5 8 10
5 10
CS G553 36
Communication Overhead
CS G553 37
The End
Questions ?
CS G553 38