DVCon Europe 2015 TA5 1 Paper
DVCon Europe 2015 TA5 1 Paper
DVCon Europe 2015 TA5 1 Paper
Abstract— Long simulation run times are a bottleneck in the verification process. This article presents a variety
of methods for combating this performance issue: utilizing different tools, such as SystemVerilog properties;
understanding of the design as a system, making changes in different levels of implementation such as
line/block/macro level. This generally speeds up the full regression, or, in some situations, at least the debug run. The
article also presents several tips on how to do analysis for performance issues.
Keywords— performance; efficiency; simulation; simulator; acceleraion; time; runtime; turn around; system;
analysis; test; verification; coding style
I. INTRODUCTION
Long simulation run times are a bottleneck in the verification process. A lengthy delay between the start of a
simulation run and the availability of simulation results has several implications:
Long turn-around times cause the code development (design and verification) and the debug process to be
slow and clumsy. Due to the long time, some scenarios become not feasible to verify on a simulator and
must be verified on faster platforms — such as an FPGA or emulator, which have their own weaknesses.
Engineers must make frequent context-switches, which can reduce efficiency and lead to mistakes.
Coding style has a significant effect on simulation run times. Therefore it is imperative that the code writer
examine their code, not only by asking the question “does the code produce the desired output?” but also “is the
code efficient, and if not, what can be done to improve it?”
Previous papers have examined the effect on simulator performance when implementing the exact same
functionality while using different coding styles[1], or emphasize the different coding style approach that is
needed when the traditional verification environments are migrating from Verilog to SystemVerilog and are
implementing the widely spread UVM methodology[2].
While this paper gives some optimizations methods based on the abilities of SystemVerilog, it is also adding
another layer of performance optimization that comes from the understanding of the design as a system, and
optimizations that are made specifically for debug runs purposes.
This article also presents several tips on how to analyze for performance issues.
There are two types of code modifications that can accelerate simulations: changes in line/block level (which
we will call micro level) and changes in the module or component level (which we will refer to as macro level).
1
always @(posedge clk)
begin
if ((VALID == 1) && (READY ==
1))
begin
count++;
end
end
The above code is the most intuitive way to implement the counting code. It is intuitive since this is how it
would have been written in the design. Notice, however, that during the time the clock is toggling and no
transactions are present on the bus, the if condition is unnecessarily checked, over and over again.
Now consider the following adjusted code:
initial begin
forever
begin
wait ((VALID == 1) && (READY == 1))
count++;
@(posedge clk);
end
end
You can see that this code functionally counts the same thing, but much more efficiently, with respect to the
number of calculations needed. This implementation realizes that in a system, a bus is in idle state a significant
percent of the time, allowing us to achieve performance optimization.
The exact speedup of this change cannot be calculated using a simple formula since it depends on several
factors, such as the effort of a particular simulator on a given machine to execute the count++ relative to the other
commands, and the ratio between the idle cycles and the cycles with actual transaction. Nevertheless, in order to
get an idea on the speedup potential, when using a ratio of 1:3 transaction to idle, the result was that the “each
cycle” code took 47% more time to be executed than the alternative code1.
2) Asynchronous example:
The following is taken from actual code found inside an actual IP. It is a BFM code of an internal PHY. For
this example, the code has been edited to use only eight phases; the original code included 128 phases.
1
All of the measurements in this article have been taken from code running on Questasim 10.4
2
wire IN0 = IN;
wire #(25) IN1 = IN0;
wire #(25) IN2 = IN1;
wire #(25) IN3 = IN2;
wire #(25) IN4 = IN3;
wire #(25) IN5 = IN4;
wire #(25) IN6 = IN5;
wire #(25) IN7 = IN6;
always @(*)
begin
case (DELAY_SEL)
4'd0 : OUT = IN0 ;
4'd1 : OUT = IN1 ;
4'd2 : OUT = IN2 ;
4'd3 : OUT = IN3 ;
4'd4 : OUT = IN4 ;
4'd5 : OUT = IN5 ;
4'd6 : OUT = IN6 ;
4'd7 : OUT = IN7 ;
endcase
end
Examining this code carefully shows that for each change of IN, the always block is invoked eight times. This
is due to the cascading changes of the INx signals: IN0 changes at “t” invoke the always block that initially
processes the case logic; then IN1 changes at “t+25” and invokes the always block again, and so on, until IN7
invokes it at “t+175”. Remember that the code originally supported 128 phases, so for each change there where
128 invocations. The case itself was composed of 128 options, and this module was implemented on every bit of
the PHY’s 128-bit bus!
This resulted in a complexity magnitude of ~M*N2 (where M is the number of bits in the bus, and N is the
number of phases).
Now, consider this adjusted code:
always @(*)
begin
case (DELAY_SEL)
4'd0 : OUT = IN0 ;
4'd1 : OUT = IN1 ;
4'd2 : OUT = IN2 ;
4'd3 : OUT = IN3 ;
4'd4 : OUT = IN4 ;
4'd5 : OUT = IN5 ;
4'd6 : OUT = IN6 ;
4'd7 : OUT = IN7 ;
endcase
end
3
Based on the system assumption that the delay configuration is not configured simultaneously with the
modules’ functional data flow, we have reduced the code complexity to M*N. Actually, if we do not care in our
simulation about the “analog delay” on the bus, we can simply write OUT=IN and reduce the complexity to M
only.
To emphasize the importance of being efficiency aware, this simple code change alone, having reduced the
calculation complexity to M*N, accelerated some full-chip tests (SoC of ~40M gates) by a factor of two!
This example code seems fine, but it can actually be optimized as well.
This is a large array and looping over one million entries will take a long time. Fortunately, this time can be
saved during the initial reset of the chip (before the memory is filled) by masking the first reset negedge — as the
array is already filled with zeros. Beyond that, however, a different approach can be applied. Using an associative
array instead of a fixed array enables the array to be nullified with one command, instead of by using a loop:
Even with a relatively small memory with 256 entries, the efficiency of the implementation with an
associative array code is 10 times better.
C. Time tracking:
Many times we want to execute code after some time has passed from a previous event. There are two options
to keep track on the time that have passed. The first option is by using explicit delays by the ‘#’ operator. This
approach has one major drawback – what if the clock’s frequency is not known in advance, or can be different in
a future project? Therefore, instead, the clock cycles counter is usually used:
4
int cycleDelay=100;
event startDelay; //triggered by some logic
initial
begin
@( startDelay);
repeat (cycleDelay) @(posedge clk);
$display("%t %0d count", $realtime, cycleDelay); //just some action
end
However, this is very wasteful manner to track time, since a code is being executed each clock cycle. Instead,
consider the following code using a more sophisticated way of using the ‘#’ operator:
int cycleDelay=100;
realtime samplePosedge1, samplePosedge2, period;
event startDelay; //triggered by some logic
initial
begin
@(posedge clk);
samplePosedge1 = $realtime;
@(posedge clk);
samplePosedge2 = $realtime;
period = samplePosedge2 - samplePosedge1;
-> startDelay;
end
initial
begin
@(startDelay);
#( period * ( cycleDelay – 2 ) ); //since 2 clocks were "wasted" on the sampling
$display("%t %0d * delay", $realtime, cycleDelay); //just some action
end
Using this delay method consumes only a tiny fraction of simulator resources compared to the cycle count
method.
5
generate if ( ! CLKDIV_DIGRF_SIMPLIFIED_MODEL )
begin :package_model //simulation will take longer time
clkdiv_digrf u_pll_clkdiv
(
.CKOUT_624M (CKOUT_624M),
.CKOUT_499M (CKOUT_499M),
.CKOUT_416M (CKOUT_416M),
.CKOUT_312M (CKOUT_312M),
.CLKIN (CLKIN)
);
end
else
begin : simplified_model
clkdiv_digrf_simplified u_pll_clkdiv
(
.CKOUT_624M (CKOUT_624M),
.CKOUT_499M (CKOUT_499M),
.CKOUT_416M (CKOUT_416M),
.CKOUT_312M (CKOUT_312M),
.CLKIN (CLKIN)
);
end
endgenerate
Please note that each begin-end pair is named differently, so the module path will be different depending on
the parameter value. This is relevant when probing into that module, so the probing path should be also depended
on the parameter value.
If the code is not in the design, engineers can use the generate if method as well, or simply add a parameter
that interacts with the testbench component directly to disable the component. Alternatively, if the code is of a
class type, the parameter may be used to prevent the component creation.
For example, in the universal verification methodology (UVM), the configuration object of an agent should
hold variables indicating whether to create subscribers for the monitor, and even indicating whether to run or
disable the monitoring of the monitor itself.
By using these types of methods we have managed, with minimal effort, to speed up a SoC (10M gates)
environment by a factor of 10.
B. System Modes
In some cases, leaving some modules in a reset state, or with no clock, is a valid design mode. Even when it is
not a valid system mode, if the test(s) are not affected by it, forces can be used to override the normal behavior.
Again, this can be controlled by a parameter.
In a design with a complex clock scheme, engineers may try to find the best clock ratios that are relevant for
that type of test. If the test depends on cores, it may help to increase the core clock frequency. If the test depends
on DMA activity, the core frequency can be reduced when the core is idle. It is good practice to choose ratios that
are used by default for most of the tests and make adjustments to only specific ones.
6
simulator calculation resources and the power consumption of the chip, it is sometimes possible to find design
bugs in the early stages of the project.
A. Simulation Phases
When using analysis tools, it is best to perform different analysis for the different stages of the simulation
(i.e., different time-frames). These stages include the out-of-reset phase, the configuration phase, and the run
phase (which can be further sub-divided). Using small time frames produces more accurate analysis per
simulation stage, since different parts of the design are active at these different stages; thereby consuming
different simulator resources. Conversely, examining the simulation run globally makes it harder to analyze the
design for anomalies.
B. Acceleration Measurements
After finding the critical components in the code that affect simulation time and finding the right solution for
them, it is recommended to measure the benefit from those optimizations. Here are some tips regarding those
measurements:
When measuring, know exactly what is being measured. For example, when measuring simulation run
time, do not include the time required to load the simulator software and do not include the time taken to
load the design code into the simulator. These times may be important, and may be optimized as well, but
they are irrelevant for this type of calculation.
V. CONCLUSION
Slow simulations are not necessarily decreed by fate. Engineers as well as managers should pay attention to
the importance of coding efficiently for simulation as well as the different ways to analyze simulations and tackle
simulation bottlenecks.
REFERENCES
[1] Clifford E. Cummings “Verilog coding styles for improved simulation efficiency”, ICU 1997
[2] Frank Kampf, Justin Sprague and Adam Sherer “Yikes! why is my SystemVerilog testbench so slooooow?”, DVCon 2012