0% found this document useful (0 votes)
24 views33 pages

Lecture 30

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views33 pages

Lecture 30

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Reconfigurable Computing

CS G553

Dr. A. Amalin Prince


BITS - Pilani K K Birla Goa Campus
Department of Electrical and Electronics Engineering

‹#›
Lecture –30
High-Level Synthesis for Reconfigurable Devices
(Behavioral Synthesis): Spatial and Temporal Partitioning

CS G553 2
High-Level Synthesis

 Fundamental differences in RCS:


1.
o General: The binding will just map operator to resources and the
scheduling will decide which operator owns the resource at a given
time.
o RCS: The architectural resources are created on the reconfigurable
device according to the resource types needed to map the operators
at a given time
o Uniform resources:
• → It is possible to implement any task on a given part of a device
(provided that the available resource are enough).

CS G553 3
General vs. RCS High-Level Synthesis

 Example: add
mul
x = ((a  b)  (c  d)) + ((c  d)-(e-f))
sub *
y = ((c  d) - (e - f)) - ((e - f) + (g - h))

• Assumptions on “resource fixed” device:


➢ Multiplication needs 100 basic resource unit
➢ The adder and the subtractor need 50 units
each.
➢ Allocation selects “one” instance of each
resource type.
− → Two subtractors cannot be used in the first *
level.
➢ The adder cannot be used in the first step
− due to data dependency
➢ Minimum execution time: 4 steps

CS G553 4
General vs. RCS High-Level Synthesis

Assumptions on a reconfigurable device


o Multiplication needs 100 LUTs.
o Adder/subtractor need 50 LUTs each. *

o Total available amount of resources:


200 LUTs.
o The two subtractors can be assigned
in the first step.
o Minimum execution time: 3 steps

CS G553 5
High-Level Synthesis
 Fundamental differences in RCS:
2.
o In general HLS:
• Application is specified using a structure that encapsulates a datapath and a
control part.
• Synthesis process allocates the resources to operators at different time
according to a computed schedule.
• Control part is synthesized.

o In RCS:
• Hardware modules implemented as datapath normally compete for execution on
the chip.
• A processor is used to control selection process of the hardware modules by
means of reconfiguration.
• The same processor is also in charge of activating the resources in the
corresponding hardware accelerators.

Most of your project under which category?

CS G553 6
Partitioning

CS G553 7
Partitioning - Motivation

 A design implementation is often too big to allow an


implementation on a single FPGA.
 Possible solutions are:
o Spatial partitioning: The design is partitioned into many FPGAs.
Each partition block is implemented in one single FPGA. All the
FPGAs are used simultaneously.
o Temporal partitioning: The design is partitioned into blocks, each of
which will be executed on one FPGA at a given time.

CS G553 8
Spatial Partitioning

CS G553 9
Spatial partitioning - Problem

 Partitioning Constraints: Each


FPGA is characterized by:
o The size, i.e., the number of LUTs,
FFs available
o The terminals, i.e., the number of
I/O pins available on the device
o A partition is valid iff: for a block B
produced by the partition, we have:
• S(B) <= S(device) where S(X) = size
of X
• T(B) <= T(device) where T(X) = #
terminals of X

CS G553 10
Spatial partitioning - Problem

 Objectives: The following objectives


are possible:
o Minimize the number of cut nets
o Minimize the number of produced
blocks
o Minimize the delay
 Difficult problem due to all the
constraints which are not always
compatible.
 Solution approaches:
o Use of heuristics for automatic
partitioning
o Manual intervention

CS G553 11
Spatial partitioning –Timing –Block
replication

CS G553 17
Temporal Partitioning

CS G553 18
Temporal Partitioning

o Resources on the device are not


allocated to only one operator but to a set
of operators that must be placed at the
same time and removed.
• An application must be partitioned in sets
of operators.
o The partitions will then be successively
implemented at different time on the
device.

Temporal Partitioning
Challenging
Why?

CS G553 19
CS G553 20
Configuration

 Configuration:
o Given a reconfigurable processing unit H and
o a set of tasks T = {t1, ...., tn} available as cores C = {c1, ...., cn},
o we define the configuration ζi of the RPU at time si to be the set of
cores ζi = {ci1, ..., cik}  C running on H at time si.
 A core (module) ci for each ti in library:
o Hard / soft / firm module.

CS G553 21
Schedule

 Schedule:
o is a function ς : V → Z+, where ς(vi) denotes the starting time
of the node vi that implements a task ti.
 Feasible Schedule:
o ς is feasible if: eij = (vi, vj)  E,
ς(tj) ≥ ς(ti) + T(ti) + tij
• eij defines a data dependency between tasks ti and tj,
• tij is the latency of the edge eij,
• T(ti) is the time it takes the node vi to complete execution.

CS G553 22
Ordering Relation

 Ordering relation ≤ among the nodes of G


o vi ≤ vj   schedule ς, ς(vi) ≤ ς(vj).

• ≤ is a partial ordering, as it is not defined for all pairs of nodes in G.

CS G553 23
Partition

 Partition:
o A partition P of the graph G = (V,E) is its division into some
disjoint subsets P1, ..., Pm such that
Uk=1,…,mPk = V

 Feasible Partition:
o A partition is feasible in accordance to a reconfigurable
device H with area a(H) and pin count p(H) if:
o Pk  P: a(Pk) = (∑viPkai) ≤ a(H)
o ∑eijEwij ≤ p(H)
• for eij = crossing edges
 Crossing edge:
o an edge that connects one component in a partition with
another component out of the partition.

CS G553 24
Run Time

 Run time of a partition r(Pi):


o The maximum time from the input of the data to the output of the
result.

CS G553 25
Ordering Relation

 Ordering relation for partitions:


o Pi ≤ Pj  vi  Pi, vj  Pj
• either vi ≤ vj
• or vi and vj are not in relation.

 Ordered partitions:
o A partitioning P is ordered  an ordering relation ≤ exists on P.

o If P is ordered, then for a pair of partitions, one can always be


implemented after the other with respect to any scheduling relation.

CS G553 26
Temporal Partitioning

 Temporal partitioning:
o Given a DFG G = (V,E) and a reconfigurable device H, a temporal
partitioning of G on H is an ordered partitioning P of G with respect
to H.

CS G553 27
Temporal Partitioning

o Cycles are not allowed in DFG.


• Otherwise, the resulting partition may not
be schedulable on the device.

Cycle

CS G553 28
Temporal partitioning

 Goal:
o Computation and scheduling of a Configuration graph
 A configuration graph is a graph in which:
o Nodes are partitions or bitstreams
o Edges reflect the precedence constraints in a given DFG

P1 P2 P3

P4
P5

Configuration Graph

CS G553 29
Temporal partitioning

P1 P2 P3

P4
• Formal Definition: P5

➢ Given a DFG G = (V,E) Configuration Graph


➢ and a temporal partitioning P = {P1, ..., Pn} of G, we define a
Configuration graph of G relative to the P, with notation Γ(G/P) =
(P,EP) in which the nodes are partitions in P. An edge eP = (Pi, Pj )
EP  e = (vi, vj)  E with vi  Pi and vj  Pj .
• Configuration:
➢ For a given partition P, each node Pi  P has an associated
configuration ζi that is the implementation of Pi for the given device
H.

CS G553 30
Temporal partitioning
 Whenever a new partition is P1 P2 P3
downloaded, the partition that was
running is destroyed.
o Communication through inter-configuration P4
P5
registers (or communication memory)
Inter-configuration
• May sit in main memory registers
• May sit at the boundary of the device to
hold the input and output values
o Configuration sequence is controlled by the IO Register Bus IO Register
host processor

IO Register

IO Register
IO Register
Block
IO Register
IO Register

IO Register
Processor
FPGA
Communication Memory Synthesis? Device’s register mapping into
the processor address spaces

CS G553 31
Temporal partitioning

 Steps (for Pi and Pj, (Pi ≤ Pj):


1. Configuration for Pi is first downloaded into
the device.
2. Executes. P1 P2 P3

3. Pi copies all the data it needs to send to


other partitions into the communication
memory. P4
P5
4. The device is reconfigured to implement the
partition Pj Inter-configuration
registers
5. Accesses the communication memory and
collect the data.

CS G553 32
Temporal partitioning
 Objectives for optimization:
1. # interconnections: very important, since it minimizes:
➢ The amount of exchanged data
➢ The amount of memory for temporally storing the data
2. # produced blocks (partitions)
➢ Reduces the number of reconfigurations (total time?)
3. Overall computation delay depends on
➢ the partition run time
➢ the processor used for reconfiguration
➢ speed of data exchange
4. Similarity between consecutive partitions (for partial)
5. Overall amount of wasted resources on the chip.
➢ When components with shorter run-times are
placed in the same partition with other components
with longer run-time, those with the shorter
components remain idle for a longer period of time.

CS G553 33
Wasted Resources

 Wasted resource wr(vi) of a node vi:


o Unused area occupied by the node vi during the computation
of a partition
wr(vi) = (t(Pi)−T(ti))×ai Run time
t(Pi): run-time of partition Pi.
T(ti)): run-time of the component vi
ai: area of vi
 Wasted resource wr(Pi) of a partition
(Pi = {vi1 , .., vin}:
wr(Pi) = j =1,…,n wr(vi)
 Wasted resource of a partitioning P: Area
wr(P) = j =1,…,k wr(Pj)

CS G553 34
Communication Overhead

Communication Cost: modelled as graph connectivity:


Connectivity of a graph G=(V,E):
con(G) = 2*|E|/(|V|2 - |V|)
o |V|2 - |V|: the number of all edges that can be built with V.

8
1 2

3 7 9
4 6
5
1
0

Connectivity = 0.24

CS G553 35
Communication Overhead

Quality of Partitioning P = {P1,…,Pn}:


o Average connectivity over P: 8
1 2
Q(P) = 1/n i=1,…,ncon(Pi) 3 7 9
4 6
5 1
o High quality means the algorithm performs well. 0

o Low quality means that the algorithm performs poor Connectivity = 0.24

3 8
2 6 2
1
9
1 3 7 9
7
4
4 6
5 8 10
5 10

Quality = 0.25 Quality = 0.45

CS G553 36
Communication Overhead

• Minimizing communication overhead by


➢minimizing the weighted sum of crossing edges among
the partitions.
− → minimize the size of the communication memory and
− → minimize the communication time.
• Heuristic:
➢Highly connected components are placed in the same
partition (High quality partitioning)

CS G553 37
The End

 Questions ?

 Thank you for your attention

CS G553 38

You might also like