0% found this document useful (0 votes)
201 views17 pages

RapidWright Tutorials (3) .Odt

The document discusses pre-implemented design flows in RapidWright including high performance and rapid prototyping flows. It explains that high performance flows reuse pre-implemented modules from a cache to optimize designs while meeting timing constraints, and rapid prototyping flows automatically stitch, place, and route blocks for faster implementation. It also provides an example of using the RWRouter timing-driven router to route an unrouted design and analyzing the output routing statistics.

Uploaded by

Srijeet Guha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
201 views17 pages

RapidWright Tutorials (3) .Odt

The document discusses pre-implemented design flows in RapidWright including high performance and rapid prototyping flows. It explains that high performance flows reuse pre-implemented modules from a cache to optimize designs while meeting timing constraints, and rapid prototyping flows automatically stitch, place, and route blocks for faster implementation. It also provides an example of using the RWRouter timing-driven router to route an unrouted design and analyzing the output routing statistics.

Uploaded by

Srijeet Guha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 17

RapidWright Tutorials

Pre-implemented Design Flows


Reference: https://www.rapidwright.io/docs/PreImplemented_Module_Flow.html
Note: For most of the classes related to pre-implemented design flows, visit ipi package.
For sourcing of Vivado: source /home/guha/Thesis/Vivado/Vivado/2022.2/settings64.sh
Note for running the java commands which requires binaries write: java -cp bin:jars/* <filename
without src>

Implementation of designs can happen usually in one of two flows:

• Using pre-existing libraries and optimised solutions to develop faster and optimised design flows:
(requires user intervention)

• Rapid prototyping where the design is implemented as it is saving on implementation time. (is done
automatically by user flows)

The entire pre-implemented design flows (high performance and rapid protyping can be understood
from this daigram:
For Using RapidWright with Vivado, we need to first source the RapidWright libraries in the vivado tcl
command line. This is done by: source ${::env(RAPIDWRIGHT_PATH)}/tcl/rapidwright.tcl in
the vivado tcl interpreter. Here after the design is ready, the design is first pre-implemented out-of-context.
This generates the IPI Design mentioned in the above daigram. The flow goes as follows:

• First the entire design is implemented on the Physical netlist. This is basically the ISI design of
the netlist [This is not the real implementation as it does not have information about the
physical netlist]

• Sometimes the entire design is forcibly constrained within the a pblock (may span multiple tiles or
even FSRs). This is to prevent Vivado from spreading out the design and constraining it within a
perticular region so that the remaining FPGA can be used for other designs.

• Each block is also stored separately in the Vivado IP Cache so that the blocks can be reused later if
required. RapidWright also stores the blocks in a similar cache called the IP Cache which builds
upon the one with Vivado.
• After all the blocks have been implemented out of context, now the blocksticher is invoked in
RapidWright which connects the blocks together to form the entire design. This is done by the
BlockSticker Class in the ipi package of RW.

• To use the blocks already present in the above-mentioned cache, we can use the BlockCreator.java
class in the ipi package.

• After the stiching is over we go on to block placement and routing on the physical netlist.

The implementation (either high-performance or rapid protyping) can be started using


rapid_compile_ipi command

High Performance Flow


RapidWrite stores the modules which allows the designers to develop high performance designs by
restructuring the pre-implemented designs and optimising them. A high performance design is created by:

• Restructuring the implemented design by using pre-implemented modules stored in the cache. This
allows us to use high uality designs thus reducing latency. We also check if the design has a latency
which is within the required limits. (IPI Design Parser)

• Now the design with new modules has to be implemented (by mapping, routing and placing). This is
done by user who generates a custom implementation process. This process is written in a
implementation guide file. This provides the user with custom placement methods. (Block Placer)

• use automated flow to stitch, route and place the custom blocks and modules. (Block Sticher). (Use
above daigram to understand)

Implementation Guide File: This can be written by the user, exported to RW and can be used to change
mapping, stitching and routing strategies. The file has a .igf extension. The format of the igf file can be seen
as:
PART <part_name>
BLOCK <ip_cache_id> <# of implementations> <# of instances in the design> <# of
clocks used in this block>
IMPL <implementation index> [# of sub implementation entries] <Pblock range>
[SUB_IMPL <sub implementation index> '<Tcl command returning a subset of
cells in the module>' <pblock range>]
...
...
INST <instance name> <implementation index to apply> <lower left corner site to
place implementation on fabric>
...
CLOCK <clock name> <clock period constraint (ns)> <BUFGCE site (to use for skew
estimation)>
...
END_BLOCK
...
END_BLOCKS

And the functions to export and import an igf file for implementation are:

com.xilinx.rapidwright.design.blocks.ImplGuide.readImplGuide(String fileName)
and com.xilinx.rapidwright.design.blocks.ImplGuide.writeImplGuide(String
fileName)

The implementation guide file must contain a few parts which include:
Block: Each entry in the vivado IP cache is a block. There are multiple presaved blocks in the cache block
which can be used for optimizing our designs.
Impl and Sub-impl: The cache blocks has their implementations and sub-implementations also stored. This
allows them to be used directly when the blocks are in use. These are stored in impl and sub-impl. The sub-
implementations are basically implementations of a sub-part of the design.
Inst: Each block has uniquely named instance which can be used in our design. This instance need to be
stiched, routed and mapped together with the other instances of the design.
Clock: This allows us to give clock input to the design.

Rapid Prototyping Flow


If the impelmentation guide file is not present, then the implementation automatically leads to rapid
prototyping. The steps for rapid prototyping as as follows:

• Automatic Pblock Generator: First from the blocks, pblocks are generated. The design blocks
package has the classes responsible for the same.

• Block Placer: Now the implemented block is placed on the FPGA fabric on teh implemented design.
This is achieved by the class BlockPlacer2 of placer.blockplacer package.

• Router: Then we go on to the router to route the designs, using the route class.

Tutorial 1: RWRoute Timing-driven Routing


Reference: https://www.rapidwright.io/docs/RWRoute_timing_driven_routing.html

This tutorial gives a dcp file which has been mapped on the FPGA (which means, it has the physical and
logical netlist, constraints, but no routing information). We need to use the RWRouter to route the design
and prodduce the resulting dcp file with routing information in timing driven mode and confirm the same
using Vivado.

• To download the unrouted design: wget


http://www.rapidwright.io/docs/_downloads/gnl_2_4_7_3.0_gnl_3500_03_7_80_8
0.dcp

• To route the design using the RWRouter (in default timing driven mode): ./gradlew run --
args="RWRoute gnl_2_4_7_3.0_gnl_3500_03_7_80_80.dcp
gnl_2_4_7_3.0_gnl_3500_03_7_80_80_routed.dcp".

• Note we had previously (in the RapidWright Documentation, routing section, we had used another
command, we cannot use it here, since the code has not been compiled yet. We will have to use javac
to first build an executable and then run it).

• Example output can be seen in the reference.

• Main sections related to the routing code is placed in


src.com.xilinx.rapidwright.rwroute.RWRoute.java class

Analysis of the Output


The output has majorly 3 parts.
The first part RWRoute, gives the statistics of the RWRoute code. This includes time required to load the
dcp file, read and parse the EDIF logical netlist and its routing, and the mapped physical netlist..

For this we need to understand how routing is done.


Reference: https://sites.lafayette.edu/cadapps/main-page/pathfinder-fpga-routing-algorithm/
Reference: https://www.youtube.com/watch?v=_0T7JOIftc0

Here, first a graph is built which consists of nodes. Each node in the graph either represents pins / ports
(which needs to be connected, so a source or a drain) or routing resources (which includes routing cells,
switch boxes, crossbars etc.). The router needs to find a way to connect the required sources and drains using
the routing resources.
Sometimes the sources and drains can be connected directly without the use of any routing resources in the
middle, this is called direct connections. Sometimes, we need to make a connnection between a sink and a
drian through one or more routing resources, this is called indirect connections.

The most common routing algorithm is the pathfinder routing algorithm, which assigns weights to each
node and then sees if a perticular routing node is over used or not. If it does then in the next routing iteration,
it tries to remove some conjestion from that node to other nodes.

The second part is statistics on the design routed, Route Design and Statistics: This includes statistics about
the netlist that has been routed. This includes:
• It shows number of pins/ports that need to be connected (Here it is 1763 nodes)
• RRG: Routing Resource Graph: It is a graph that consists of nodes (pins, ports) that need to be
connected according to the design. This also includes switches, routing cells and other elements in
the FPGA fabric which would help in the routing of the pins and nodes.
• Generated RRG Nodes: This includes the nodes in the RRG that needs to be still routed.
• Routed Connections: Number of connections made in that iteration.
• Nodes with Overlaps: Routing nodes where the connections have some degree of conjestion.
• It also has statistics like total wire length, time it took to route the design etc.

The third part includes timing statistics of the routed design, in the section: Timing Report. This section
includes:
• Critical path delays, slack and timing gurantees.
• Exact delays between every routing wire in the net and the resource routed. This includes the
delays occuring in the logic, net and in intra-site connections. It also mentions the netlist resources in
that perticular net.
• And the time it took to write the final output dcp file including the EDIF file, physical netlist,
constraints, routing information etc.

Analysis of the Code


Code can be found here: src.com.xilinx.rapidwright.rwroute.RWRoute.java. We can start from
the main function.
• The most important function for consideration is route() function.
• CreateRuntimeTracker() helps to measure time taken for each process.
• RouteGlobalClkNets(): routes the global clock nets in the design to provide clock to all clock
regions. This includes assigning a root in the clock region, building routing and distribution
interconnects in the clock region.
• RouteStaticNets(): Routes the static wires like the VCC, GND. Note these connections wont change
based on the routing algorithm and there is no scope of reducing conjestion in these routing.
• RouteIndirectConnections(): This builds the indirect connections using the routing algorithm
mentioned above.
• RouteDirectConnections(): They are responsible for main routing.
• PostRouteProcess(): This finishes the routing process.
• SetPIPNets(): This routes the PIPs of the design.

Analysis of Vivado Output


On using the command in the Vivado tcl window: open_checkpoint
/home/guha/Thesis/DesignCheckpoints/gnl_2_4_7_3.0_gnl_3500_03_7_80_80_routed.dcp
We can see the placed and routed design on the FPGA. We can also see the routing paths, the FSRs, tiles,
sites and BELs. The IPI design (Logical netlist) as well as the physical netlist can be seen in Vivado.

Also notice that the design is constrained within a pblock. This is basically to prevent the design from
distributing throughout the FPGA fabric and be remain constarined within a region.

To get information about the routing of the design, we can use the cpmmand: report_route_status on the tcl
window of Vivado. This dumps the vivado routing report:
Design Route Status
: # nets :
------------------------------------------- : ----------- :
# of logical nets.......................... : 4937 :
# of nets not needing routing.......... : 1082 :
# of internally routed nets........ : 932 :
# of implicitly routed ports....... : 150 :
# of routable nets..................... : 3855 :
# of fully routed nets............. : 3855 :
# of nets with routing errors.......... : 0 :
------------------------------------------- : ----------- :
This confirms the routing via RWRoute.

To get the timing report of the routed design (includes the criical path delay and shows the critical path as
well), we can use the command: report_timing in the tcl console. This shows the critical path delays and
also the critical path and the delays in each route and LUTs.

Tutorial 2: RWRoute Wirelength-driven Routing


This tutorial allows us to route premapped designs using wirelength-driven method. Note that here, the
routing algorithm tries to reduce the wirelength used for routing. This may in turn increase the critical path
delay.

After downloading the premapped design, we can start routing using the RWRouter, using the command:
gradlew run --args="RWRoute gnl_2_4_7_3.0_gnl_3500_03_7_80_80.dcp
gnl_2_4_7_3.0_gnl_3500_03_7_80_80_routed.dcp –nonTimingDriven"
Note that most of the command is same as before, just that we add an addition –nontimimgDriven argument
at th end of it.

Analysis of the Output


The output of the routing is very similar to as one before. Some important differences:
• The wirelength needed in this case is lesser than the one in timing mode.
• The critical path delay is not mentioned as the routing has not been done keeping the same in mind.

Analysis of the Code


The flow of execution is mostly the same. However, sinec the object has the knowledge of the type of
routing to be done, the routing is wirelenhth-driven.

Analysis of Vivado Output


This is also same as the abve example and we can run the above commands to get the outputs.

Tutorial 3: RWRoute Partial Routing


This tutorial shows us how we can take a dcp file which had been routed before, after that we had made
some changes in the placement of the BELs, we need to reroute the design. One way is to rip off the entire
routing and re-route it but it would lead to loss of resources, instread we can use partial routing as well.
The partially-routed dcp file can be downloaded using the command: wget
http://www.rapidwright.io/docs/_downloads/picoblaze_partial.dcp

We can start the partial routing using the following command: gradlew run --args="RWRoute
picoblaze_partial.dcp picoblaze_partial_routed.dcp --partialRouting –
nonTimingDriven"

Note that the routing command mentioned above has 2 key features. One is that it will do partial routing
and second is that it will do a non-timing routing.

Error Encountered
The above execution led to an error. The error mentioned is:
Exception in thread "main" java.lang.RuntimeException: ERROR: No mapped LCB to
SitePinInst IN RAMB36_X10Y25.CLKAU
at
com.xilinx.rapidwright.rwroute.GlobalSignalRouting.getLCBPinMappings(GlobalSigna
lRouting.java:257)
at
com.xilinx.rapidwright.rwroute.GlobalSignalRouting.symmetricClkRouting(GlobalSig
nalRouting.java:200)
at
com.xilinx.rapidwright.rwroute.RWRoute.routeGlobalClkNets(RWRoute.java:354)
at com.xilinx.rapidwright.rwroute.RWRoute.route(RWRoute.java:584)
at com.xilinx.rapidwright.rwroute.RWRoute.routeDesign(RWRoute.java:1756)
at com.xilinx.rapidwright.rwroute.RWRoute.routeDesign(RWRoute.java:1741)
at
com.xilinx.rapidwright.rwroute.RWRoute.routeDesignWithUserDefinedArguments(RWRou
te.java:1732)
at com.xilinx.rapidwright.rwroute.RWRoute.main(RWRoute.java:1780)
at com.xilinx.rapidwright.MainEntrypoint.main(MainEntrypoint.java:209)

Error exists in routing the clock in the design. Specifically, there is no mapping between a carry chain (LCB)
to siteInst named: RAMB36_X10Y25.CLKAU. This is a block RAM instance. We need to check if the same
error occurs in Vivado partial routing as well or not.

Analysis of Vivado Output


However, we can see the dcp design on Vivado. If we open the design on Vivado, we can see the schematic
as well as the physical design.

The design that has been loaded is unrouted. Vivado in green shows the routes that have been routed. It
also shows in red the routes which has not been routed and it expects to be routed from the physical netlist
it has in the EDIF file (which is also a part of the dcp file). To highlight the unrouted routes, we can use the
command: highlight_objects -color red [get_nets * -filter {ROUTE_STATUS == UNROUTED}]
This highlights the unrouted routes in red.

To get information about the routing of the design, we can use the cpmmand: report_route_status on the tcl
window of Vivado. This dumps the vivado routing report

Tutorial 4: RapidWright Report Timing


Reference: https://docs.xilinx.com/r/en-US/ug906-vivado-design-analysis/Report-Timing
Reference: https://www.rapidwright.io/docs/ReportTimingExample.html

Report timing allows RW or Vivado to report the timing delay of the entire design (critical path delays) or
delays in a perticular path. The page(Xilinx) referenced above clearly mentions how to select start points,
end points, and the path and Vivado calculates delay of the path. The selection of start, end points and the
path can be done at any hierchial levels, be it leaf cells, sites, tiles, or even clock regions. Similarly RW
software can report delays (in this tutorial only critical delays have been mentioned).
The attached paper also mentions a novel method how the timing of a perticular path/design can be
calculated in way which is less memory and compuattionaly intensive.

This allows us to check if the design meets timing constraints or not.

To download the routed dcp file, we can use the command: wget
http://www.rapidwright.io/docs/_downloads/microblaze4.dcp

Command to generate critical path timing reports on RW: gradlew run --


args="ReportTimingExample microblaze4.dcp"
Note that the command to report timing is via ReportTiming

Analysis of Code
The code for report timing can be found in src/com/xilinx/rapidWright/examples/ReportTimingExample.java.
Note that the code mentioned above is not a part of RapidWright. It is basically a script to read a design that has been
placed and routed. The script does the following:
• Finds the routing resource graph of the design.
• Finds the critical path from the routing resource graph
• Finds the delay on the path.

Most of the functions called are for measuring the path delays. The most important functions are:
• Design.readCheckpoint(): Which reads the dcp file mentioned in teh args and then returns the design object
for the same.
• TimingManager(design): Class constructor responsible for extracting RRG, nets and the device of the design.
• GetTimingGraph(): This returns the RRG of the designcreated in the TimingManager constructor.
• GetMaxDelayPath(): This finds the most critical path of the graph. Thsi returns the most critical path of the
graph [return datatype: GraphPath].
• GetWeight(): This returns the most critical timing delay.

Analysis of Output
The output mostly contains the timing report of the design. The report includes the critical path essentially. This also
includes the time it took to read the dcp design and other delays. It also gives the critical path which is basically the part
of the RRG which gives the critical path.

Analysis of Vivado Output


We can again use the report_timing command like the above to get the timing reports, critical path delay
and the critical net itself.

Tutorial 5: Create Placed and Routed DCP to Cross SLR


Reference: https://docs.xilinx.com/r/2021.2-English/ug949-vivado-design-methodology/SSI-Pinout-
Considerations (For SLRs)
Reference: https://docs.xilinx.com/r/2021.2-English/ug949-vivado-design-methodology/XPIO-PL-
Interface-Techniques-for-Timing (For Laguna Sites)
Reference: https://www.rapidwright.io/docs/SLR_Crosser_DCP_Creator_Tutorial.html
In Stacked Silicon Interconnect (SSI) technologies, SLRs (Super Logic Regions) are device slices stacked
one above the other in the same device. Each SLR has similar hierarchy of resources like clock regions, tiles,
sites and BELs. The device also has a master SLR which has all the configuration logic, programming logic
etc. Super Long Lines (SLLs) connect different SLRs and allow communication between them. All transfer
of signals happen through these SLLs.

Since these SLLs are slow in carrying information, usually the logic is divided such that there is minimum
communication required between the SLRs. This involves BRAMs and DSPs to not be kept and shared
among SLRs. Also special attention is given to prevent data flow crossing SLR boundries multiple times.
To increase communication between the SLR regions, the hardware supports TX-RX registers in the SLR
boundaries (called Laguna sites). These are called Laguna registers. The TX Laguna registers of a
perticular SLR Region, drive the RX region of another Laguna register. So any changes made in the TX
register is immediately reflected in the RX register in the different SLR. This facilitates communication
between SLRs. You need to specifically mention the FPGA fabric to place the registers in the Laguna
regions. This increases the speed and consistency of communication between the regions.

So architecture of laguna registers and SLLs are such that say SLR0 needs to send information to SLR2.
Then the SLR0 will have laguna TX registers and SLR2 will have laguna RX registers which will be driven
by the RX registers. In these register pairs, each flip flop pairs are connected via dedicated SLL lines.
Though this architecture make things faster, there are still delays and inefficiencies. The example application
in RW in: com.xilinx.rapidwright.examples.SLRCrosserGenerator has solved this problem.

The above application basically places and routes a design which helps communicate between SLR regions
using laguna flip-flop pairs and SLLs.

Commands
To know more about the options that the SLRCrosserGenerator has we can run the command: java -cp
bin:jars/* com.xilinx.rapidwright.examples.SLRCrosserGenerator -h
This commands tells us about the information and various commands that we have.

We can run the application using : java -cp bin:jars/*


com.xilinx.rapidwright.examples.SLRCrosserGenerator
This generates dcp file of the design which helps communicate between SLR0 (FSR: X0Y5) and
SLR1(FSR: X0Y4). This design places laguna TX and RX registers in the laguna sites of these FSRs and
the flipflop pairs are connected via Super Long Lines (SLLs). The generated dcp file is placed in the
RapidWright directory itself after the name: slr_crosser.dcp. This dcp file can be exported to vivado, opened
and analysed. Along with the dcp file, it also generates the timing reports that it took to generate the above
dcp file.

Vivado Analysis
To analyse the design on Vivado, we must export it first to teh other server and then run: Vivado
slr_crosser.dcp. This generates the physical netlist of the design. The following are the observations:
• We can see the laguna sites been used to place flipflops.
• The communication between the SLRs are bidirectional.
• All the laguna sites are placed in the LAG tiles.
• All the laguna sites essentially have a pair of flipflops.
• We have placed 2 lines of laguna tiles since one line is acts as TX and the other acts as RX (since it
is a bidirectional transmission line).

Tutorial 7: RapidWright PipelineGenerator Example


Reference: https://community.arm.com/support-forums/f/architectures-and-processors-forum/6630/can-anyone-
tell-me-the-difference-between-pipelined-bus-and-depipelined-bus-and-i-have-uploaded-two-screen-shot-of-both-
so-what-does-that-arrow-from-mclk-to-a-31-0-indicates
Reference: https://www.rapidwright.io/docs/PipelineGeneratorExample.html
This example generates a dcp file containing an implemented, placed and routed design of a pipelined bus. A pipelined
bus is such that the address lines donot have the address where the data is stored or retrieved from. The address and the
data on the respective lines come at different times. This give sthe storage device time to decode the address and
store/retrieve the data.

As seen in the later tutorials, each each tile has few CLBs, few flipflops, and a carry chain. This design uses the
flipflops in the tiles to build a pipelined bus. The application can be seen in :
RapidWright/com/xilinx/rapidwright/examples/PipelineGenerator.java.

The commands for the dcp file generation are:


• java -cp bin:jars/* com.xilinx.rapidwright.examples.PipelineGenerator: This prints timing reports of the
time it took at the various stages of the dcp file generation.
• We can also add some more options to the above command. These commands include:
==============================================================================
== Pipeline Generator ==
==============================================================================
This RapidWright program creates an example pipelined bus as a placed and routed DCP.
See the RapidWright documentation for more information.

Option Description
------ -----------
-?, -h Print Help
-c [String: Clk net name] (default: clk)
-d [String: Design Name] (default: pipeline)
-l [Integer: distance] (default: 10)
-m [Integer: depth] (default: 3)
-n [Integer: width] (default: 10)
-o [String: Output DCP File Name] (default: pipeline.dcp)
-p [String: Ultrascale/UltraScale+ (default: xcvu3p-ffvc1517-2-e)
Part Name]
-s [String: Lower left slice to be (default: SLICE_X42Y70)
used for pipeline]
-v [Boolean: Print verbose output] (default: true)
-x [Double: Clk period constraint (ns)] (default: 1.291)

• The above commad generates pipeline.dcp as the dcp file which can be run by vivado to seethe design.

Tutorial 8: RapidWright PipelineGeneratorWithRouting


Example
This tutorial helps us to use the report timing model of RapidWright to give timimg details of the routing of the
pipelined bus design, generated in the above tutorial. The application fro the same has been placed in :
src/com/xilinx/rapidwright/examples/PipelineGeneratorWithRouting.java. This application not only generates a
pipelined bus with routing is the form of a dcp file which can be run on Vivado, but also generates the timimg reports of
the same design using the report timing model mentioned in tutorial 4.

To run the above application we can run: java -cp bin:jars/*


com.xilinx.rapidwright.examples.PipelineGeneratorWithRouting.
This causes an an exception with:
==============================================================================
== PipelineGeneratorWithRouting ==
==============================================================================
Init: 1.379s
DistanceX:4
DistanceY:16

Exception in thread "main" java.lang.NullPointerException


at com.xilinx.rapidwright.timing.TimingGroup.<init>(TimingGroup.java:96)
at com.xilinx.rapidwright.examples.PipelineGeneratorWithRouting.findRoute(PipelineGeneratorWithRouting.java:422)
at com.xilinx.rapidwright.examples.PipelineGeneratorWithRouting.createPipeline(PipelineGeneratorWithRouting.java:395)
at com.xilinx.rapidwright.examples.PipelineGeneratorWithRouting.main(PipelineGeneratorWithRouting.java:885)

Tutorial 9: Pre-implemented Modules - Part I


Sometimes it is better to manually place a perticular circuit design on the FPGA to optimize the circuit in
terms of area. It also involves using prexisting optimised circuitry from other designs into our designs. If left
on Vivado, the optimisations may not be effective.

This tutorial teaches us how to place a design in a manual customised fashion and then use it our own crcuit
later.

Here, first we will take the design of a lightweight 8-bit processor (picoblaze) and then constrain it in a
pblock. We will then make the design, reusable. For this we need to open the picoblaze_synth.dcp design
given in Vivado. This design only has the logical netlist and out of contect (OOC) implementation.
Therefore, we can see the IPI Design but not its mapping/routing in the FPGA fabric.

Design Utilization Analysis


To be done in Vivado
The amount of resources utilised by the design can be seen from Reports->Report Utilization. This gives
exact number of LUTs, Carry Chains, clocks, BRAMs, Flipflops (also called registers here) and percentage
of utilisation of these resources. It also shouws how each module of the entire design consumes FPGA
resources.

As seen in the report utilisation, the resources used are:


• LUTs: 115
• Flipflops: 117
• BRAMs: 1
• Carry Blocks: 8
Given the metrics of each tile,
• Each tile has 1 site
• Each site has 8 CLBs, 1 carry chain, 16 flipflops.
• There is 1 common BRAMs for 5 column tiles.

Each we know that a minimum of 15 tiles are required to fit the entire design. Also one BRAM is 5 tiles
long.

To create a pblock to constrain the entire design into the pblock we can do (in Vivado tcl window):
• create_pblock pblock_1: This creates a pblock of name plock1. Has not placed it yet.

• resize_pblock pblock_1 -add {SLICE_X27Y60:SLICE_X29Y64


RAMB18_X2Y24:RAMB18_X2Y25 RAMB36_X2Y12:RAMB36_X2Y12}: This places the pblock of the
FPGA fabric. Analysis of the Pblock placed on the FPGA gives us this configuration:

Some points regarding the above Pblock is:


◦ The pink outline refers to the pblock
◦ The tiles in the pblock are CLEM_X18Y60 (bottom-left) to CLEL_R_X19Y64 (top-right). This includes
all the tiles included in the square version.
◦ Each tile has one site. The site in tile CLEM_X18Y60 is SLICE_X27Y60 and the site in tile
CLEL_R_X19Y64 is SLICE_X29Y64. This corresponds to the sites that has been included in the above
command.
◦ The pblock also shows 3 columns sites/tiles each consisting of 5 similar tiles.
◦ The pblock also includes tile BRAM_X19Y60. This tile is a BRAM tile and has 3 BRAM sites in it. The
tile is the length of 5 conventional tiles. The tile includes sites: RAMB36_X2Y12, RAMB18_X2Y24
and RAMB18_X2Y25.
◦ The above contents of the Pblock clearly show the reason of the above command.

• add_cells_to_pblock pblock_1 -top: This actually helps choose the pblock for placing the
designs. This places the leafs of the design in the pblock. After the place_design, command, we can then place
the design on the fabric. See references: https://docs.xilinx.com/r/2021.2-English/ug835-vivado-tcl-
commands/add_cells_to_pblock

• set_property CONTAIN_ROUTING 1 [get_pblocks pblock_1]: This forces Vivado to


contain the routing for any design being implemented in the Pblock to be confined within the pblock. This
allows the pblock to be reused whenever possible. This also allows the pblock to be moved at any place
required.

Now we need to add clock source to the pblock. We have choosen a 400 Mhz (time period = 2.5ns) clock to meet our
timing constraints. To add clock we can do this in the Vivado tcl prompt:

• create_clock -period 2.5 -name clk -waveform {0.000 1.25} [get_ports


clk]:The format for the command is: create_clock -period <arg> [-name <arg>] [-waveform
<args>] [-add] [-quiet] [-verbose] [<objects>]. See reference for command understanding:
https://docs.xilinx.com/r/2021.2-English/ug835-vivado-tcl-commands/create_clock. The get ports array in
the end basically gets all the ports in the design which requires clock and feeds them the clk clock.

• set_property HD.CLK_SRC BUFGCTRL_X0Y2 [get_ports clk]: This sets the clk to have the
source from BUFGCTRL_X0Y2 clock.

Apparently if we try to place the design, we would not be able to do it since the design gets too conjested. To
resize the pblock we must use this:
resize_pblock pblock_1 -add {SLICE_X26Y60:SLICE_X29Y64 RAMB18_X2Y24:RAMB18_X2Y25
RAMB36_X2Y12:RAMB36_X2Y12} -locs keep_all
The above command can be read about here:
https://docs.xilinx.com/r/en-US/ug835-vivado-tcl-commands/resize_pblock . This allows us to resize the pblock. Also,
we can resize using the gui by streaching the pblock. After some adjustments, we will be able to place the designs.

The command to place the design is : place_design. This gives the following output:
Architectural Pattern Analysis
Now we need to find all the places in the device fabric where this pblock can be placed so that this adds to
the transporatbility of the design. Also note that xilinx architectures are column based. So all resources (tile,
site, BELs etc) in the same column has the same type and all interconnects are replicated.

So for that we need all the tile combinations which have 3 normal tiles and 1 BRAM tile. This analysis of the
entire device fabric to get required column patterns can be done in RapidWright. The following commads
help us in doing so:
• java -cp bin:jars/* com.xilinx.rapidwright.util.RapidWright: This starts the RapidWright terminal.
• device = Device.getDevice("xcvu3p-ffvc1517-2-i"): This stores the entire floorplan of the mentioned device
in the device object mentioned.
• colMap = TileColumnPattern.genColumnPatternMap(device): This stores the tile pattern of the entire
device in colMap map. So in this map, tile type names are the keys of the map, and the values of the map is the
column numbers of the corresponding tile type. Note that the xilinx fabric have similar tile types in the same
column. So say the map stores: [BRAM: 75, 97, 137, 193, 268, 331, 340, 396, 471, 534, 571, 594], means that
columns, 75, 97, etc have tiles of BRAM type. Note that these numbers are actual column numbers and not the
x-indices of the tiles.
• Now, we can filter the above colMap, to get the column numbers of a perticlar tile type or a pattern of tile
types. Say we want all the column numbers of BRAM tile types, we can use the command: filtered =
list(filter(lambda e: TileTypeEnum.BRAM in e.getKey() and e.getKey().size() == 1, colMap.entrySet())):
This gives an array of column numbers which is filtered from the map of the basis of BRAM tile type.
• print filtered: This prints the above list of tile columns.

Similarly, we can use the above map to filter out all the column numbers where the left corner of the pblock can start,
thus finding all column positions, where our pblock can be successfully placed. For our pblock we need columns of 4
CLB tiles and 1 BRAM tile in consecutive locations. We can filter out this pattern from the colMap, above using:

• filtered = list(filter(lambda e: TileTypeEnum.BRAM in e.getKey() and not TileTypeEnum.DSP in


e.getKey() and e.getKey().size() == 5, colMap.entrySet())): This filters the colMap to return an array of
column numbers which has a BRAM and not a DSP and is of 5 tiles long. So one tile is a BRAM and other 4
tiles are non-DSP tiles.

• filtered.sort(key=lambda x: x.getValue().size(), reverse=True): This sorts the array, on the basis of number
of columns each such combination has.

• from pprint import pprint


pprint(filtered): This prints the columns.

So we get a DDA of column numbers of each type of 5 tile combinations with one containing a BRAM.

The tutorial mentions that we will be using these column number for placing the pblocks:

[CLEM, CLEL_R, BRAM, CLEL_R, CLEM]=[94, 134, 265, 328, 337, 468, 531, 568]
[CLEL_R, CLEL_R, BRAM, CLEL_R, CLEM]=[70, 188, 391]
[CLEL_R, CLEM_R, CLEL_R, BRAM, CLEL_R]=[588]

Pblock Selections
We now know the columns where we can have the left corner of the pblock. We now need to select the row, where we
can keep the top corner. After having selected this position, we can place the pblock in that position. However not all
rows, mentioned in the above columns is available for placing the pblock. There are some irregularities in the FPGA
fabric. These irregularities include:

• Laguna Tiles: The tiles in FSRs which are close to the lower and upper boundary of the SLR have laguna
tiles. These tiles have flipflop pairs (Tx and Rx registers) which is used for bidirectional routing between SLRs
using Super Long Lines (SLLs).

• All FSRs in the left and right boundaries have routing lines to communicate with the other side of the SLR.
This means that lines go from the leftmost tiles to the rightmost tiles. This forming a closed loop. These tiles
near the FSR boundaries have special tiles so we cannot place our designs near the FSR boundaries.

• Even if we use the contain_routing=1 construct while making pblocks. There are still some parts of the
routing whcih extends outside the pblock defined area especially clock sources which have a single root in the
entire FSR. These routing also need to be taken care of while placing the designs. These are called used_stubs.
If these cross clock-regions, it can delay the entire design.

We can use the vivado to select correct position sto place our pblocks. This will allow non-complex selections of
pblocks.

Say, we use this combination of tiles, with the column numbers as: [CLEM, CLEL_R, BRAM, CLEL_R,
CLEM]=[94, 134, 265, 328, 337, 468, 531, 568], the first column with this tile pattern is: 94.

We can get the exact tile name using device.getTile(1,94). This basically returns the first tile in the column of 94. This
returns CLEM_X9Y299.

If we know the name of a tile, we can select that tile on the FPGA daigram using the tcl command: select_objects
[get_tiles CLEM_X9Y299]. This selects the perticular tile on the FPGA fabric.

Once we have selected the tile, we can now drag and select a pblock of a perticular size. We can now build a pblock of
20 slices and 1 BRAM.

Also note that the FSRs are 60 tiles long (in Y direction). So the next tile in a different FSR is CLEM_X9Y239.
By using the above command, we can select a tile in the next FSR. Then dray a pblock there. Again subracting
60, we get to a new FSR.

We ca now select three pblocks. We can (using the gui) make a pblock and find the ranges using the command:
get_property GRID_RANGES [get_selected_objects] (Select the pblock before running the command).
Tutorial 10: Pre-implemented Modules - Part II
Reference: https://www.rapidwright.io/docs/PreImplemented_Modules_Part_II.html

Implementation Optimization
Now, that we created 3 pblocks and we have 3 areas where we can place the picoblaze design, it would be great to
understand which of the pblocks would have the best performance (ie. Perform within the given timing timing
constraints mentioned in the design). For this we can use the PerformanceExplorer tool in RapidWright.

To knwo more about the command we can run java -cp bin:jars/* com.xilinx.rapidwright.util.PerformanceExplorer
-h: This gives more and the options of the command. This command basically places and routes the same dcp design in
all the pblocks in multiple ways and check the number of paths which follow the timing constraints and if the design
achieved timing closure. These statistics is then output. This commands allows several options like the kind of dcp to be
tested, the regions where it needs to be placed (here pblocks), place and route directives, clock uncertainities.

The command used here to get the performance report is : java


com.xilinx.rapidwright.util.PerformanceExplorer -c clk -i picoblaze_synth.dcp -t 2.85 -b picoblaze_pblocks.txt
• clk: gives the clocking resources of the design.
• picoblaze_synth.dcp: gives the name of the dcp file to be used for performance analysis.
• 2.85: time constarint time period. Means all nets miust have a max time delay of 2.85 ns.
• picoblaze_pblocks.txt: This gives the areas where the design must be placed and tested.

The above command takes a long time to run, bbut it gives the Worst Negative Slack vs Pblock implementation for
multiple clock uncertainities.
Worst Negative Slack: It is the number of nets that have a slack of the given timing constraint of 2.85 ns. For ex.

Time delay of 7 nets take about 0.05ns less than the


timing constraint of 2.85 ns. About 4 nets take 0.05ns
less than 2.85 ns and so on.

If the WNS is +ve, then the nets finish execution


before time just giving timing closure. If WNS is
negative, then it takes more time that the timing
constraint, thus we need to improve the design.

The performance shows that teh best design in terms


of performance is pblock1

Building the Overlay


Now, we have the method to create pblocks. So in separate dcp files, we create pblocks (3) and map the designs on
them. Now we need a script to stitch them together into one single dcp file and also route them to form one design. This
has been done in a script provided in RW, src/com/xilinx/rapidwright/examples/PicoBlazeArray.java. This class
takes the designs implemented in dcp files and stitches them together.

Ideally we want to create a design such there are columns of picoblaze processors whose inputs and outputs are
connected in a perticular fashion as in teh reference. The above program actually not only stiches the dcp file of three
placed pblocks but also replicates them along the column to make multiple such copies thus making an array of such
processors.
The format of the command to run the function is : java -cp bin:jars/*
com.xilinx.rapidwright.examples.PicoBlazeArray <pblock dcp directory> <part> <output_dcp> [--
no_hand_placer]

The command is: java -cp bin:jars/* com.xilinx.rapidwright.examples.PicoBlazeArray ./picoblaze xcvu3p-


ffvc1517-2-i picoblaze_array.dcp: thsi command is run from the RW directory. xcvu3p-ffvc1517-2-i is the name of
the device and picoblaze_array.dcp is the name of the final dcp file after all stitching.

This opens the hand_placer, which allows us to mode the positions of picoblaze as required. This also prints the time it
took for each task. But finally we have a dcp file (in the name of picoblaze_array,dcp (in RW directory)) as mentioned
before. When the same design is opened, we can see this design:

The picoblaze instances has been copied 396 times.


On closer analysis we can see that:
• The point of convergence of all the wires is a clock source mentioned before.
• We can see 11 columns of picoblaze processor design which has been connected together.
• Each column has many 36 picoblaze processors.
• Each processor has a usual of 20 tiles and 1 BRAM.
As seen the routing is in red, we need to complete the routing. This can be done using the following tcl commands:
• update_clock_routing: The initial routing had been with a global clock. Now the clock has been shifted to
clock roots and clock has been supplied from there. (Reference: https://docs.xilinx.com/r/en-US/ug835-vivado-
tcl-commands/update_clock_routing)
• route_design: This routes the entire design with a global clock
• report_timing_summary -delay_type min_max -report_unconstrained -check_timing_verbose -
max_paths 10 -input_pins -routable_nets -name timing_1: This gives timing reports

Tutorial 11: Create and Use an SLR Bridge


This tutorial explains how we can use Time Delay Mutiplexing (TDM) with Laguna tiles and SLL for inter-SLR
communication with less number of SLL lines. SLL lines are usually at high demand and a number of applications may
require SLL lines for inter-SLR routing. RapidWright proposes a design where TDM is used with SLLs to reduce the
number of SLLs by 4X. The circuit for the same is mentioned here:
Here, 4 input signals are fed into a series of
flipflops and then a mux. The mux selects one of
the 4 signals and then transmits it to the flipflop
pairs in the laguna sites which is connected using
SLLs.

The signals and the input flipflops work at 1X


clock whereas the transmission of the SLLs and
the selection of signals using the mux is done
using 4X clock. Thus the time of transmission of
signals over the SLLs is divided into 4 parts.

The RW documentation mentuions that clock can


run at frequencies greater than 750MHz.

Commands
The application wheich generates the above SLR Bridge is present in :
src.com.xilinx.rapidwright.examples.SLRCrosserGenerator.java

To run teh application and generate the dcp file for the above bridge, we can run the command with the specifications.
The command and its options are: java -cp bin:jars/* com.xilinx.rapidwright.examples.SLRCrosserGenerator -h.
This command gives all the options that can be added to customise the SLRbridge that will be developed.

The command to generate the SLRBridge as mentioned in the tutorial is: java -cp bin:jars/*
com.xilinx.rapidwright.examples.SLRCrosserGenerator -l LAGUNA_X20Y120 -b BUFGCE_X1Y80 -w 32 -o
slr_crosser_vu7p_32.dcp -p xcvu7p-flva2104-2-i : The options used here area:
• -l LAGUNA_X20Y120: It mentions the starting laguna site from where the placement of the SLR Bridge
needs to start.
• -b BUFGCE_X1Y80: Mentions the clock site for generation of clock.
• -w 32: Tells the width of SLL lines to be ccupied. So the value of n in the above design is 32/4 = 8.
• -o slr_crosser_vu7p_32.dcp: It tells the name od dcp file desired.
• -p xcvu7p-flva2104-2-i: This mentions the part number of the device.
The above command generates a dcp file of the abovementioned name which is then open using Vivado.
We can also clock to the above design to make a fully functional bridge.

However, the above design is only the SLR crossing, we also need a TDM circuit to go along with it.
For that we have been provided with the tutorial. This has the implementation of the design but it has not been placed
and routed.

In the above design, the entire circuit is present (TDM) but not the SLR crossing circuit. The SLR crossing circuit is
mentioned as a black box in the circuit with the name : crossing.The SLR crossing circuit is present in the design
generated by the SLRCrosserGenerator.java above. Now we need to merge the designs. This can be done by the
following command in the tcl window after opening the synth32 design.

read_checkpoint -cell crossing slr_crosser_vu7p_32.dcp


This adds the dcp file provided to the cell crossing. Now we can see the final netlist with the TDM circuit and the SLR
crossing circuit.

Now we need to craete pblocks, place and route the design. This can be done using the commands:
• create_pblock pblock_top: this creates a pblock of name pblock_top
• add_cells_to_pblock pblock_top [get_cells [list T_top]] -clear_locs: This adds cells to the pblock from the
cells mentioned in list T_top.
• resize_pblock [get_pblocks pblock_top] -add {CLOCKREGION_X5Y5:CLOCKREGION_X5Y5}:
Resized the pblock by adding clock regions mentioned.

• create_pblock pblock_bot
• add_cells_to_pblock pblock_bot [get_cells [list T_bot]] -clear_locs
• resize_pblock [get_pblocks pblock_bot] -add {CLOCKREGION_X5Y4:CLOCKREGION_X5Y4}

After this we can place and route the design.


And we can run timig reports to see the timing of the design.

You might also like