Sjsnug03 Lahti Final
Sjsnug03 Lahti Final
Gregg D. Lahti
Corrent Corporation
[email protected]
Tim Schneider
Synopsys Corporation
[email protected]
Gopal Varshney
Corrent Corporation
[email protected]
ABSTRACT
It's three days before your tapeout deadline and an RTL change from above gets dropped into
your design. The RTL and gate-level simulation regressions need to be re-run across the
compute ranch. You're now under the gun to get it completed to make the tapeout or you may be
the next gateslinger in the manager's layoff sights at high noon. Are you sure you're getting the
most performance out of your VCS simulation tool?
The huge RTL and gate-level simulation burden for verifying multi-million gate designs can
make or break a project deadline. This paper will discuss methods, tips and tricks that proved
successful at Corrent of getting the most performance out of your VCS simulations so you too
can have the fastest VCS simulation in the West.
1.0 Introduction
VCS is a fast Verilog simulator provided you know the right switches and code your Verilog to
make use of the built-in performance. VCS has been continually enhanced over the years, and
the myriad of command-line switches can make the usage a bit unmanageable at times. VCS is
generally quite fast out of the box, but the compiler solves for the “general” case. Using
knowledge from the designer or verification engineer, VCS can be coaxed into more efficiently
simulating Verilog code. The right command line switches thrown for the right reasons can make
all the difference in the performance of your simulation runs.
VCS should be viewed as having two different operating models: debugging and regression.
Debugging options within VCS add more visibility into the design at the cost of slowing the
simulation down, and VCS will show its true speedy colors without the debugging overhead. So
which one do you use?
The last thing an engineer wants is to wait around for a simulation to finish up. It’s a great
excuse to get coffee, but productivity slides as the simulations and gate count get exponentially
larger. When you want VCS to run fast, here’s what not to do.
1
Test Benches: The Dark Side of IP Reuse, Gregg D. Lahti, San Jose SNUG 2000 paper.
• –I
• –PP
• +cli
• +acc+2
• missing –Mupdate
• –P [library]
Many times compile scripts get written quickly to get the simulation up and functional but
weren’t proofed for basic things that may not ever get used or needed except in rare
circumstances. A cursory check of the compile and run-time command line flags can optimize
your simulation without affecting the simulation usage.
Make sure that your pli.tab files (dispatch table for pli calls) is optimized. The key item to
keep in mind is to ensure that you are not enabling more PLI visibility and capabilities than are
really needed. The pli.tab file optimization will be covered later in this paper, plus there is a
whole section in the VCS reference manual that discusses this in detail.
VCS unnecessarily schedules the q assignment one delay element past the clock transition
causing the simulation to run slower. Removing the #1 or #0 delay assignments in your code can
increase the simulation performance from 0 to 200% with an average of 30-50% increase2.
In Figure 2.0 an initial block should have been used instead of an always block. VCS may or
may not optimize this, and your simulations that worked for days may suddenly not work and
develop infinite loop conditions if this type of coding is used. Instead, change the always to
initial.
2
Verilog Nonblocking Assignments With Delays, Myths and Mysteries, Cliff Cummings, Boston SNUG 2002 paper.
Either of these coding styles is acceptable, as VCS collapses the logic for performance.
Now that we’ve discussed some of the pitfalls that slow VCS, let’s look at some optimization
techniques that will help the simulation speed and performance. These optimizations are geared
for regression runs, where debugging information isn’t required. These are the easy
optimizations to make since it usually just requires a cursory review of the compile and run
scripts used in the simulation environment.
Another option is the +2state flag to force the simulation to run in two-states instead of seven.
This flag has implications on coding style and for the most part is less than useful, but can be
worth a try if you don’t require all the Verilog states. It is also useful to test functionality in your
design without worring about getting out of reset due to X propagation.
4.3 In-Line C
If possible, re-code your PLI routines into in-line C (Direct C) format. This does limit your use
of the code to VCS only, but hey, this is a VCS paper! ☺ The added ease of use and perf may be
worth it.
So how do you know if all of these command-line switches and coding techniques really work?
The answer is found in the output of the +prof command line option. VCS will happily tell you
where it is spending time in your Verilog code and provide enough statistical data to choke a
horse. Here’s how to get through the gory details.
Previous versions of VCS required a separate step to profile code execution using gprof.
Profiling is a very easy procedure as of VCS 5.2 now that a profiling option has been included
3
VCS 5.0 release notes
===========================================================================
TOP LEVEL VIEW
===========================================================================
TYPE %Totaltime
---------------------------------------------------------------------------
PLI 0.23
VCD 0.99
KERNEL 7.76
DESIGN 91.02
---------------------------------------------------------------------------
===========================================================================
MODULE VIEW
===========================================================================
Module(index) %Totaltime No of Instances Definition
---------------------------------------------------------------------------
delaychain (1) 67.75 56 ../top/rtl/delaychain.v:15.
dll_delay_line (2) 7.16 2 ../rtl/ddrctlr/rtl/dll_delay_line.v:21.
ckrst (3) 2.47 1 ../top/rtl/ckrst.v:13.
INVDL (4) 1.25 8431 /projects/clibs/umc/0.15vst/
tapeoutkit/stdcell/UMCL15U210T2_2.2/Verilog_simulation_models/INVDL.v:32.
hurricane_tb (5) 1.23 1 ../tb/hurricane_tb.v:31.
pdisp (6) 1.16 8 ../rtl/hurricane/rtl/pdisp.v:33.
dll_mux (7) 0.96 1720 ../rtl/ddrctlr/rtl/dll_mux.v:21.
dll_buf (8) 0.56 1732 ../rtl/ddrctlr/rtl/dll_buf.v:21.
spsram_1536x32 (9) 0.54 8 /projects/clibs/rams_nobist_M4one/
0.15vst_2.0/Verilog_fix/spsram_1536x32.v:8.
---------------------------------------------------------------------------
In figure 4.0, Top Level View section shows the total percentage of time the simulation pieces
executed. In our case, 91% of the time, the design was being executed in VCS with 7.76% of the
time VCS kernel operations were taking place. This quick summary is useful to see if the PLI is
occupying a large percentage of simulation time compared to the actual design. The simulation
time is pretty large. Upon closer inspection, the –I flag and +acc+2 flag were set in the compile
script. These timesinks gobble huge amounts of valuable VCS CPU time, so these first get
removed from the compile script.
The Module View shows how much percentage of time each module of the design was
occupying of the total design percentage time. This section of the report shows that VCS spent
67% of time in the one particular area, the top/rtl/delaychain.v file. That’s a very
input sigin;
output [120:0] sigout;
wire [120:0] sigout;
Endmodule
===========================================================================
TOP LEVEL VIEW
===========================================================================
TYPE %Totaltime
---------------------------------------------------------------------------
PLI 1.65
VCD 0.02
KERNEL 11.86
DESIGN 86.46
---------------------------------------------------------------------------
===========================================================================
MODULE VIEW
===========================================================================
Module(index) %Totaltime No of Instances Definition
---------------------------------------------------------------------------
delaychain (1) 17.24 56 ../top/rtl/delaychain.v:15.
hurricane_tb (2) 8.43 1 ../tb/hurricane_tb_nodump.v:31.
dll_mux (3) 5.06 1720 ../rtl/ddrctlr/rtl/dll_mux.v:21.
INVDL (4) 4.12 8431 /projects/clibs/umc/0.15vst/
tapeoutkit/stdcell/UMCL15U210T2_2.2/Verilog_simulation_models/INVDL.v:32.
pdisp (5) 3.83 8 ../rtl/hurricane/rtl/pdisp.v:33.
dll_buf (6) 3.10 1732 ../rtl/ddrctlr/rtl/dll_buf.v:21.
rctl (7) 2.35 8 ../rtl/hurricane/rtl/rctl.v:32.
dll_delay_element (8) 2.24 1720 ../rtl/ddrctlr/rtl/
dll_delay_element.v:20.
xaux_regs (9) 1.60 8 ../rtl/hurricane/rtl/xaux_regs.v:249.
tdc_cdb (10) 1.00 1 ../rtl/tdc/rtl/tdc_cdb.v:16.
dpsram_b_64x64 (11) 0.98 2 /projects/clibs/rams_nobist_M4one/
0.15vst_2.0/Verilog_fix/dpsram_b_64x64.v:28.
cr_int (12) 0.98 8 ../rtl/hurricane/rtl/cr_int.v:287.
dpsram_b_160x64 (13) 0.97 1 /projects/clibs/rams_nobist_M4one/
0.15vst_2.0/Verilog_fix/dpsram_b_160x64.v:28.
saob (14) 0.93 8 ../rtl/saob/rtl/saob.v:162.
mt46v16m16 (15) 0.88 10 ../vmodels/mt46v16m16.v:48.
Notice that the simulation time drastically decreased from 976 seconds down to 149.6 seconds!
The rtl/ddrctlr/rtl/dll_delay_line.v item has moved down the CPU hog list since the
+nospecify was added, so we have appeared to clean up that area a bit. However, the
delaychain item is still in the top list of CPU hogs. Since we know that this is a gate-level
workaround coded in RTL, we can optimize the code further.
A better approach to the delay chain would be to completely remove the delay chain with `ifdef
options for gate-level simulation only. Again, this is a Verilog coding style issue that has a
drastic affect on speed. Here is the modified delaychain.v code as shown below:
input sigin;
output [120:0] sigout;
wire [120:0] sigout;
`ifdef SYNTH_DELAYCHAIN
`else
assign sigout = { sigin, {60{!sigin, sigin}} };
`endif
Endmodule
Figure 8.0 Improved Delay Chain Verilog Code with IFDEF Construct
Now the delay chain is completely removed from the RTL simulation and VCS can optimize the
run. Here’s the profile output with recoding:
===========================================================================
TOP LEVEL VIEW
===========================================================================
TYPE %Totaltime
---------------------------------------------------------------------------
PLI 1.15
VCD 0.02
KERNEL 13.34
DESIGN 85.50
---------------------------------------------------------------------------
===========================================================================
MODULE VIEW
===========================================================================
Module(index) %Totaltime No of Instances Definition
---------------------------------------------------------------------------
hurricane_tb (1) 6.79 1 ../tb/hurricane_tb_nodump.v:31.
dll_mux (2) 5.90 1720 ../rtl/ddrctlr/rtl/dll_mux.v:21.
pdisp (3) 4.92 8 ../rtl/hurricane/rtl/pdisp.v:33.
rctl (4) 3.76 8 ../rtl/hurricane/rtl/rctl.v:32.
dll_buf (5) 3.66 1732 ../rtl/ddrctlr/rtl/dll_buf.v:21.
xaux_regs (6) 2.93 8 ../rtl/hurricane/rtl/xaux_regs.v:249.
dll_delay_element (7) 2.63 1720 ../rtl/ddrctlr/rtl/
dll_delay_element.v:20.
delaychain (8) 2.08 56 ../top/rtl/delaychain.v:15.
tdc_cdb (9) 1.67 1 ../rtl/tdc/rtl/tdc_cdb.v:16.
spsram_1536x32 (10) 1.32 8 /projects/clibs/rams_nobist_M4one/
0.15vst_2.0/verilog_fix/spsram_1536x32.v:8.
The delaychain.v file has been removed from the list of top hitters. Note that the overall
simulation time has decreased to 124.4 seconds, saving us more valuable simulation time.
The * means add the pli acc hook for every signal in the entire design, consuming valuable
resources and simulation time. Be prudent and determine if you really need all that visibility into
you simulation!
Many engineers opt to use a different debugging & waveform viewing tool than the built-in
Virsim application. One of the leading tools, Debussy, ships a pli.tab file to use with the
application so that VCS can map in the PLI routines for simulation debugging and waveform
dumping. Unfortunately, Debussy ships a pli.tab file that is unoptimized.
The problem with the standard Debussy PLI file is that some of the items defined use all of the
PLI operations when they can be optimized down to just a task-based call. The items that can be
optimized are marked above with a ♦ symbol. Here is a modified version of the same file that
improved the performance of our debugging simulations by about 15% in runtime, just by
changing the %* denotations with %TASK:
At the time of writing this paper, Debussy now supports a Direct Kernel Interface to VCS which
is more efficient than using the PLI. If possible switch to the new Direct Kernel Interface model.
In our test case, using the +vcs+learn+pli routine boiled down the accesses used into the
learn_pli.tab file with a small number of PLI routine options specified:
We incorporated this new pli_learn.tab file by using the +applylearn flag in conjunction with
another compile. About a 10% speed improvement was realized in simulation runs. Keep in
mind that if the PLI interface changes or a new PLI function is added, the +vcs+learn+pli
option needs to be rerun or your simulation may quit working.
Cleaning up your simulation with the profiling analysis & compile/run script optimizations will
do wonders for simulation speed improvement. In test case #1, we had a baseline, un-optimized
compile and run script that looked as follows:
#!/usr/local/bin/perl -w
$plat = `uname -s`;
chomp $plat;
$ENV{PLI_LIB} = "/projects/Verilog_pli/lib/`uname -s`";
$vcs_cmd = "vcs -notice +acc+2 -I -Mupdate ";
$vcs_cmd .= "-P $ENV{DEBUSSY_PLI}/debussy.tab ";
$vcs_cmd .= "$ENV{DEBUSSY_PLI}/pli.a ";
$vcs_cmd .= "-P ./pli.tab ./pli.c +incdir+../tb ";
$vcs_cmd .= "+define+ARC_MYSRAM +define+BEHAVE ";
$vcs_cmd .= "+define+PLL_resolution_1ps ";
$vcs_cmd .= "-P /projects/Verilog_pli/lib/${plat}/get_plusarg.tab ";
$vcs_cmd .= "-P /projects/Verilog_pli/lib/${plat}/value.tab ";
$vcs_cmd .= "/projects/Verilog_pli/lib/${plat}/libvalue.so ";
$vcs_cmd .= "+define+ARC_MYSRAM +define+PCI66 ";
$vcs_cmd .= "-P /projects/Verilog_pli/lib/${plat}/fileio.tab ";
$vcs_cmd .= " $ENV{PLI_LIB}/libget_plusarg.so ";
$vcs_cmd .= "$ENV{PLI_LIB}/libfileio.so ../lib/timescale.v ";
$vcs_cmd .= "+define+PLX -Mdir=../bin/csrc ";
$vcs_cmd .= "-CC \"-I $ENV{VCS_HOME}/include/ -DMUNIX -DUNIX\" ";
$vcs_cmd .= "/projects/hurricane/bin/onlineChecker.a ";
$vcs_cmd .= "-P /projects/Verilog_pli/lib/SunOS/rascalint.tab ";
$vcs_cmd .= "-larc-neutral-pli-Verilog -lrascalint ";
$vcs_cmd .= "-f filelist.v $copts -o simv ";
print "VCS Command : \n\n\n$vcs_cmd\n";
system("$vcs_cmd | tee compile.log");
A: Baseline compile and run script with the delay chains enabled
B: Removal of instantiated gate-level delay chains
C: Removing +acc+2, -I, and –PP
D: Removing compile-in of Debussy PLI and other debugging PLIs which
weren’t used for regression simulations
E: With +nospecify compile switch
In columns B-E, each item builds upon the previous setting so column D has items from B, C and
D. All times are measured in CPU seconds and were executed on single Linux P4 1.7GHz
machine with 1.5 GB of system memory.
Test Name A B C D E
ahb_cfg_test1 30.770 7.340 7.360 6.650 5.940
ahb_cfg_test2 13.130 4.370 4.380 4.170 3.740
ahb_cfg_test_incr 186.830 33.860 33.780 29.530 25.740
ahb_cfg_wr_rd 67.610 13.980 13.940 12.400 10.980
ahb_ddr_bw 115.140 22.310 22.250 18.970 16.680
ahb_ddr_test1 61.570 13.000 12.940 11.260 9.890
ahb_ddr_test_incr 103.870 21.420 21.270 18.510 16.320
ahb_ddr_wr_rd 100.100 20.520 20.340 17.400 15.400
ahb_memctl_test1 217.370 34.560 34.330 29.610 25.220
ahb_memctl_test_sdram 131.330 23.970 23.900 21.170 18.420
gmi_mission_test1 416.290 76.550 76.510 67.790 59.320
gmi_mission_test2 416.880 76.840 76.940 67.890 59.560
gmi_mission_test3 416.340 76.850 76.630 67.670 59.470
gmi_pause_test1 76.250 15.730 15.700 13.920 12.560
gmi_ser_test 97.630 18.230 18.180 16.160 14.140
Speed Increase over (A) - 533% 534% 608% 693%
Note that each item incrementally improves the original time. Getting rid of the delay chain
dramatically optimized the performance. Just by cleaning the delay chains that were hogging
valuable CPU time resulted in over a 5X speed improvement! Let’s look at the same results with
the optimized Verilog code (B) as the baseline to show the improvements just by command-line
switches alone in Figure 16.0.
Pruning the compile and run scripts gained an average of 30% simulation speed performance!
Gate simulations are useful in verifying the final product acts like it is supposed to function in a
full-timing manner. They are very useful in verifying that initialization and reset of the design is
correct, no hidden multi-cycle paths got lost at static timing analysis, and that timing issues are
clean at process/voltage/temperature corners. Gate simulations are very large and slow, as they
incorporate back-annotated layout timing data along with library timing models into the chip
model. The last 20 million gate project at Corrent had gate sims taking 8 hours to compile and
over 24 hours of run time for one simulation occupying a full 5 GB of system memory. Many
engineers limit how many gate simulations are done purely due to resources and timing. This is
also a good area for a speed optimization.
• +memopt
• +timopt
+timopt+100ns.
This option specifies that the shortest clock period is 100ns. In the hurricane testcase, +timopt
was able to automatically optimize 32% of the sequential elements in the design. A
configuration file is also written out which enables the user to manually specify other potential
sequential elements for further optimizations.
If you find that you are using up too much memory even when utilizing the +memopt option, you
could be reaching the per process memory limits imposed by the system on the compilation
process. Another switch to try is the +memopt+2 compile-time option. The +memopt+2 option
spawns a second process for some of the memory optimizations. The compilation log file will
contain entries if +memopt+2 optimizations occurred. Be sure to check the log file after
compilation.
On our Linux compute farms, there are a small group of Linux machines that contain 4GB of
physical memory. Out-of-the-box generic Linux imposes a 3GB process size limitation. Edit the
/usr/src/linux-2.4/include/asm-i386/page.h file and change 0xC0000000 to 0xEC000000
to bump this up to 3.7 GB of process size. This was highly useful for those compiles that did not
crest the top of memory limit but were large enough to go past the 3GB process size.
input d,cp,cd;
output q,qn;
reg q;
assign qn=~q;
always@(posedge cp or negedge cd)
q = #5 cd ? d : 1'b0;
endmodule
input d,cp,cd;
output q,qn;
reg q;
assign qn=~q;
always@(posedge cp or negedge cd)
if(cd == 0)
q <= 1'b0;
else
q <= d;
endmodule
Another item to scan in libraries are library cells or Verilog modules that are nearly the same but
not quite identical. VCS can use an optimization (in concert with +rad) called ‘vectorization’.
Basically, when Verilog modules look alike, VCS can take advantage of that fact. This was
more of an issue in older versions (5.X) of VCS, however when coupled with +rad, these module
similarities can have an impact on the design and where time is spent.
input d,clk,reset;
output q;
wire q;
wire q_unused;
EN en_1(d,q7,xout);
FD2 fd2_1(xout,clk,reset,q1,q1b),
fd2_2(q1,clk,reset,q2,q2b),
fd2_3(q2,clk,reset,q3,q3b),
fd2_4(q3,clk,reset,q4,q4b),
fd2_5(q4,clk,reset,q5,q5b),
fd2_6(q5,clk,reset,q6,q6b),
fd2_7(q6,clk,reset,q7,q7b),
fd2_8(q7,clk,reset,q8,q8b),
fd2_9(q8,clk,reset,q9,q9b),
fd2_10(q9,clk,reset,q_unused,q10b);
assign q = ~q10b;
endmodule
Figure 19.0 ‘lfsr_leaf1’ code
module lfsr_leaf2(d,clk,reset,q);
input d,clk,reset;
output q;
EN en_1(d,q7,xout);
FD2 fd2_1(xout,clk,reset,q1,q1b),
fd2_2(q1,clk,reset,q2,q2b),
fd2_3(q2,clk,reset,q3,q3b),
fd2_4(q3,clk,reset,q4,q4b),
fd2_5(q4,clk,reset,q5,q5b),
fd2_6(q5,clk,reset,q6,q6b),
fd2_7(q6,clk,reset,q7,q7b),
fd2_8(q7,clk,reset,q8,q8b),
fd2_9(q8,clk,reset,q9,q9b),
fd2_10(q9,clk,reset,q,q10b);
endmodule
By coding the two leaf modules identically, less time is spent doing port evaluations in the VCS
kernel and more time is spent evaluating the design (approximately 2% of a 25 second run).
While not much code is in this example, consider nearly 50 minutes of a 24-hour run and
multiply that time savings by many many modules across a large SOC design!
The Synopsys RTL rule checking tool “LEDA” includes rulesets for not only synthesis, but for
VCS performance optimizations as well. This can be used to screen libraries and Verilog code
for performance as well as potential race conditions.
8.0 Summary
In summary, the heavy performance items to prune from compile and run scripts are:
• -I
• +cli
Clean your Verilog code and libraries of any #0 or #1 delays and proof your code for style
problems. Profiling your sim with the +prof option gives the user a heads-up into time hogs.
Using the +timopt can increase gate-level simulation performance and using the +memopt may
give you that extra bit of headroom to compile gate simulations. Clean your PLI .tab files by
correcting the * options and when possible use the +vcs+learn+pli option to optimize the PLI
calls.
Straight out of the box, VCS is a powerful simulator but needs to be customized to get the most
performance based upon the user constraints. Not all optimization techniques are obvious, but
with this paper and little bit of trial and error, your simulations can be running quicker and more
efficient. Our large designs at Corrent yielded up to a 6x performance improvement by
following these tips!
The authors would like to thank Mark Warren, Synopsys Corporation, for his review and
comments of the paper and that fruit basket which fueled the ambition to write this paper. This
paper, along with other SNUG papers the author has written, can be found on
http://gateslinger.com/chiphead.htm.
Test Benches: The Dark Side of IP Reuse, Gregg D. Lahti, San Jose SNUG 2000 paper.
http://gateslinger.com/chiphead.htm or
http://www.synopsys.com/news/pubs/snug/snug00/lahti_final.pdf
Verilog Nonblocking Assignments With Delays, Myths and Mysteries, Cliff Cummings,
Boston SNUG 2002 paper. http://www.sunburst-design.com/papers/.
ESNUG posts: 380 item 11, 383 item 9, 387 item 16. http://deepchip.com/esnug.html
Solvnet: http://solvnet.synopsys.com