DT3 2022 en
DT3 2022 en
Summary
The design of a digital system begins with the definition of its requirements, and ends with the
implementation, which traditionally consists of printed circuit boards that hold integrated circuits.
Today, the entire system is often implemented as a single integrated circuit, in which case we speak of
system-on-a-chip (SoC) design. The SoC can be implemented either as a specially designed ASIC circuit,
or by using an FPGA circuit that contains all the programmable resources required in the SoC. This
chapter gives an overview of the SoC design process.
1
521406S, Digital Techniques 3
Due to the complexity of digital systems, electronic design automation (EDA) methods play a key role in
their design. The transformation of design information from one level of abstraction to another takes
place by means of computer programs, either manually (the designer performs the transformation
interactively) or automatically (the computer program performs the transformation independently
within the constraints given by the designer). In order for a designer to be able to use EDA software
correctly and efficiently, he or she must know how the programs work and what type of information the
programs process and produce. In other words, the designer needs to know the properties of different
types of models. Otherwise, the use of design tools may be ineffective or may produce erroneous
results.
Models of digital systems can be classified according to their level of abstraction, or description
accuracy, in many ways. In the following, a classification is used which is particularly well suited to
describe a typical hardware design process based on the use of ASICs. The description of the levels
correspond to models of digital circuit designs at various stages of design, and are the following:
● electronic system Level (ESL),
● register-transfer level (RTL),
● gate level (structural logic level),
● circuit level, and
● physical level.
As the design progresses, lower-level models are created based on the upper-level models to implement
the functions described in the higher-level models. This is continued until a physical level is reached at
which point the description of the system is sufficiently detailed for manufacturing. Models generated at
different stages of design are also used for design verification tasks such as simulation or timing analysis.
Verification ensures that a new model corresponds with the function and properties of the model on
which its design was based.
2
521406S, Digital Techniques 3
the specification, it is usually also presented in a format that can be executed on a computer.
System-level modeling tools include, for example, high-level programming and hardware description
languages (HDL), transfer functions (e.g., Matlab models), and graphical state and data flow diagrams. A
commonly used abstraction for a system-level model is a set of communicating, concurrent processes,
each describing sequential operation. One standardized description language for digital systems using
this approach is SystemC. The following model presents a description of the correlator circuit (practically
an FIR filter) in SystemC, consisting of a module declaration (a C++ class declaration implemented with
SystemC macros) and a "run" method describing the function of the module. The model does not differ
significantly from a standard C++ model, except for that its operation is synchronized to the clock signal
with wait function calls.
#include <systemc.h>
SC_MODULE(SAMPLE_DESIGN) {
public:
sc_in_clk CLK;
sc_in<bool> RST_N;
sc_in <sc_fixed<N_TEMPLATE_BITS, N_TEMPLATE_BITS, SC_RND, SC_SAT> >
TEMPLATE_IN;
sc_in<bool> DATA_MODE_IN;
sc_in <sc_fixed<N_DATA_BITS, N_DATA_BITS, SC_RND, SC_SAT> > DATA_IN;
sc_in<bool> DATA_EN_IN;
sc_out<sc_fixed<N_DATA_BITS, N_DATA_BITS, SC_RND, SC_SAT> > DATA_OUT;
sc_out<bool> DATA_EN_OUT;
void run();
SC_CTOR(SAMPLE_DESIGN) {
SC_CTHREAD(run, CLK.pos());
async_reset_signal_is(RST_N, false);
}
};
void SAMPLE_DESIGN::run()
{
bool data_mode;
// RESET SECTION
3
521406S, Digital Techniques 3
wait();
data_mode = false;
while(1)
{
BEGIN_PROCESSING: data_mode = DATA_MODE_IN.read();
if (DATA_EN_IN.read() == true)
{
if (data_mode == false)
{
TEMPLATE_SHIFT_LOOP: for (int i = CORRELATION_LENGTH-1; i > 0; i = i-1)
{
TEMPLATE_SRG[i] = TEMPLATE_SRG[i-1];
}
TEMPLATE_SRG[0] = TEMPLATE_IN.read();
}
ACCUMULATOR = 0;
DATA_EN_OUT.write(true);
DATA_OUT.write(ACCUMULATOR >> N_DATA_BITS);
OUTPUT_EN_HIGH: wait();
}
DATA_EN_OUT.write(false);
END_PROCESSING: wait();
}
}
A key problem in system-level design is partitioning, which is the division of the system's functions into
parts implemented using different component technologies and implementation styles. This includes the
partitioning of the system into, for example, software and hardware, as well as the partitioning of
hardware into parts to be implemented with different technologies (analog/digital technology,
FPGA/ASIC, etc.) so that the result is optimal in terms of cost, function and performance. The
comparison of different partitioning options can be done by comparing the system-level models created
for them.
Example of SW/HW partitioning. A SoC shall provide support for an audio application that
includes a digital filter algorithm that contains 128 additions and multiplications. The sampling
frequency of the filter is 48 kHz. The circuit has a clock frequency of 26 MHz and it uses a
4
521406S, Digital Techniques 3
ARM7TDMI microcontroller as its central processing unit. Is it advisable to implement the filter
programmatically on the ARM or should a separate hardware-based filter block be designed for
the circuit?
The circuit must perform 48,000 * 128 = 6,144,000 additions and multiplications per second.
According to the ARM7 processor documentation, an addition is performed in one clock cycle,
but a multiplication can take four clock cycles in the worst case. In total, the execution of the
filter algorithm would therefore require up to 6,144,000 * (1 + 5) = 36.9 million clock cycles per
second. That’s a lot more than the 26 million available, so a software-based implementation of
the filter isn’t possible. A separate hardware accelerator block can compute the algorithm in
principle in one clock cycle if it is implemented using 128 combinational logic multipliers and
adders.
In principle, the function of a system can be described as accurate algorithms only after
software-hardware partitioning, as the implementation of the SW/HW interface and other interfaces
has a considerable effect on the model. Today, however, communication between different parts of
systems is usually modeled using transaction-level modeling (TLM) so that the functionality of often
irrelevant data transfer protocols is represented using simplified models, such as function calls (e.g.
read() and write()), instead of register transfers synchronized to clock signals and enabled by control
signals. After that, the operation of the entire system model can be studied by simulating it even before
choosing its high-level architecture.
Register-Transfer Level
In a system-level model, the function of each part of the system is presented as a single behavioral
description. It shows in the form of an algorithm which computations or other operations the part in
question performs without specifying which components, or resources, are used for the computations.
In those parts of the system that have been chosen to be implemented in hardware, the implementation
of the algorithm is a digital circuit that consists of combinational logic computing units (eg multipliers or
ALUs), data storage units (registers, register banks, memory blocks), communication channels and a
finite-state machine that controls the operation of the circuit. The computing units and data storage
units are called the circuit's data path, and the control unit the control path. The data and control paths
form the architecture of the digital circuit. This kind of description of the logic architecture of a digital
circuit is called a register-transfer level (RTL) model. The RTL model is most commonly designed without
automated design tools by creating a block diagram of the circuit architecture, but the use of high-level
synthesis (HLS) programs that automatically form an RTL model from a C or C++ description is growing
(albeit slowly).
5
521406S, Digital Techniques 3
The RTL model describes the operation of a digital circuit as data transfers between data storage units at
specific times. The model defines a set of memory locations (registers), the communication between
memory locations by control steps (clock cycles) and the next control step to be performed. Data
transfer transactions are either direct storage operations or they may involve arithmetic or logical
operations. These operations take place in combinational logic blocks. Typical combinational operations
are addition, subtraction and multiplication, comparisons, code conversions and identifications (Boolean
functions), and multiplexing. The control unit determines the schedule for these operations. In practice,
performing a function means allowing a particular register to be loaded at a certain time. The control
unit also controls the operation of the combinational logic parts. For example, it may determine whether
a combination logic block performs a computation or only a data transfer.
The RTL model is usually described using a hardware description language (VHDL, SystemVerilog). Data
storage units (registers) are described by variables (or in VHDL usually by signals) and computational
units (combinational logic) by expressions consisting of arithmetic and logical operations and conditional
constructs of programming languages. RTL languages provide powerful tools for describing data
processing in combinational logic, but in other respects, writing RTL code is a disciplined task due to the
synthesizability requirement. Virtually all RTL models consist of combinational and sequential logic
processes similar to the ones shown below, to which the designer only needs to supplement the section
describing the operation of combinational logic.
6
521406S, Digital Techniques 3
Unlike a system-level model, an RTL model is both "bit-accurate" and "clock-cycle-accurate," meaning
that all signals on its external interfaces and their timing are defined with the accuracy with which they
operate in the finished circuit. Within a model, the description may not be bit-accurate, as models can
use scalar types (such as integers) or enumeration types, the number of bits of which are not fixed until
the gate-level model is created. Starting with an RTL model, a synthesis program can be used to first
create the functional blocks and the Boolean logic equations for the operations described as expressions
in the blocks, and then to synthesize an optimized gate-level model for the circuit. The following code
sample shows the SystemVerilog RTL model, which consists of one sequential block (always_ff) and one
combinational logic block (always_comb).
module srg9
(input logic clk,
input logic rst_n,
input logic shift_in,
input logic init_in,
input logic data_in,
output logic [8:0] data_out,
output logic check_out);
7
521406S, Digital Techniques 3
if (init_in == '1)
srg9_r <= 9'b110000001;
else if (shift_in == '1)
begin
srg9_r[8:1] <= srg9_r[7:0];
srg9_r[0] <= data_in;
end
end
end
always_comb
begin
if (srg9_r[8] == '1 && srg9_r[7:0] == 8'b10000001)
check_out = '1;
else
check_out = '0;
end
endmodule
Gate Level
A gate-level model of a logic circuit describes the circuit as interconnected physical logic gates and
flip-flops that can be, for example, components from an ASIC manufacturer's component library. The
gate-level description is a pure structural description. The traditional representation for a gate-level
model is a schematic diagram, but in practice a structural HDL description, called a netlist, is used, as
digital circuits usually contain too many components for schematic representation to be viable. The
gate-level model can be used as input to a logic simulator, timing analysis program, or placement and
routing program.
The following code sample shows a Verilog-language gate-level netlist file synthesized from the RTL
model of the srg9 component. It contains only component instantiation statements that refer to the
circuit manufacturer's library components, such as DFF_X1 and NAND3_X1.
8
521406S, Digital Techniques 3
Describing both logical states and timing is more accurate in gate-level simulation than at the
register-transfer level. RTL simulation most commonly requires only logical states 0 and 1, and a U state
(undefined, in SystemVerilog 'X) to describe a situation where a signal is uninitialized. Gate-level models
must also take into account gate output impedance information, as there may be structures in the
circuit where gate outputs are connected together (three-state buses). For this reason, the concept of
strength is added to the model of logic states of signals in simulators. The most often used strength
values are
● forcing, which means that a node is connected directly to the supply voltage or the ground, and
is able to charge or discharge the node infinitely fast (such as the normal logic gate output),
● non-forcing or weak (on VHDL, 'H' or 'L'), which means that the node is connected to the supply
voltage or ground through a resistor, and is capable of charging or discharging the node at a
finite rate if the node is not controlled by a stronger output, or
● high impedance, which means that the node is not connected to the supply voltage or ground
(high impedance state)
The logical state of the node can be solved according to the following rules:
● if all control signals are in the same logical state, the state is maintained and it strength is the
strength value of the the strongest signal,
● if the signals have different values, the strongest signal value is dominant,
● if the signals have a different logical state and the same strength, the value is unknown.
In addition, it should be noted that connecting two "strong" signals to the same node is always an error
(short circuit).
Gate-level models are also more accurate in timing than RTL models, where functions are represented
as delayless. In the RTL model, all state changes always occur at the (rising) edge of the clock signal, but
9
521406S, Digital Techniques 3
in the gate level model only after the propagation delay of each component. For this reason, gate-level
models reveal hazards (glitches) caused by logic gate delays, and unknown states describing metastable
states caused by violations of flip-flop timing requirements.
The value of the delay in gate-level models consists of the gate's internal propagation delay and the
delay due to switching. The value of the internal delay can be determined relatively accurately on the
basis of the transistor level structure of the gate. The delay due to switching is caused by the time it
takes to charge or discharge the parasitic load capacitance at the gate's output. It consists of the
capacitance caused by the wiring and the gates in the fanout of the gate. Before routing of the design,
this delay can be estimated in the simulation from the number of gates in the fanout. After the design
has been routed, the parasitic capacitance values can be calculated from the wire lengths and the input
capacitances of the load gates.
In a conventional design flow, a gate-level model is generated from the RTL model almost without
exception automatically using a synthesis program. The gate-level model can thus in principle be
assumed to be error-free. In practice, this is not always the case. The RTL code used for the synthesis
may have contained errors that do not cause problems that can be observed in ideal RTL simulation. The
most typical error is latch components produced in the synthesis from the incorrect HDL description of a
combination logic function, which may cause timing errors (violations of latch settling or holding times).
These can only be detected in the gate-level simulation. Another issue that can cause differences
between RTL and gate-level models is the ambiguities included in the RTL code, such as don't care states
of logic functions. A third potential source of errors in gate-level model simulation are combinational
logic hazards that may cause erroneous effects in signals intended to be edge-sensitive. Because of
these potential errors, a gate-level model must always be carefully verified, even if the RTL model has
already been found to be error-free.
Circuit Level
A circuit-level model describes the structure of a circuit as a network of transistors, resistors, capacitors
and inductors. In logic design the circuit level is often called the switch level, as the transistors are
viewed mainly as electronic switches. Whereas a logic-level model deals with the logical states of the
signals as a function of time, the circuit-level model considers currents and voltages associated with the
signals.
In digital circuit design, circuit-level design is rarely done, but circuit-level models are often used in
design verification tasks, such as simulation, timing analysis, and power consumption estimation. Since
these models are incorporated into library components, it is not necessary for a designer using a design
program or simulator to know them very accurately. If the designer wants to create an own logic
component, it must be designed from transistors, and its operation must be analyzed with a circuit-level
simulator. This gives an idea of, for example, the propagation delays of the component, i.e. the rise and
fall times of the outputs as a function of the transition times of its input signals and the load on the
outputs.
10
521406S, Digital Techniques 3
The figure below illustrates the difference between the gate-level and circuit-level models in timing
analysis. In the circuit-level model, conductors can be modeled, for example, as distributed transmission
lines, in which case their delays (rise times) can be calculated more accurately than in the gate-level
model.
Physical Level
A physical model contains the information needed to fabricate a circuit. The physical model of an
integrated circuit is its layout design. It consists of a large number of polygons that define the patterns in
the exposure masks used in the photolithographic manufacturing process. The model is thus geometric
and no longer contains a logical or even electronic description of the function of the circuit. In the case
of an FPGA circuit, the physical model is a circuit configuration file that contains information to be
loaded into the circuit's configuration memory.
The layout of an integrated circuit is designed with a layout editor program. To support the design, many
kinds of utility programs are used to verify the flawlessness of the circuit design. Design rule check (DRC)
programs can be used to verify that the drawn circuit patterns comply with the rules set for the
minimum line widths and distances in the manufacturing process. With a netlist extraction program, a
transistor-level netlist and a simulation model can be formed from the geometric circuit patterns. A
layout-versus-schematic (LVS) comparison program can be used to ensure that the transistor-level
circuit diagram and the designed circuit patterns are identical in their connections.
The design of digital circuits today is based on a so-called standard cell principle, in which the layout
patterns are formed automatically based on a gate-level model by placing the circuit patterns of the
logic components in rows in a "circuit floorplan" and wiring them together. For this reason, digital design
usually does not require the full-custom design programs mentioned above. An automatic placement
and routing program is used instead.
11
521406S, Digital Techniques 3
The figure below shows a small part of the layout patterning of a circuit implemented with standard cell
technology for two wiring layers. The green boxes shown below represent the diffusion layers of
standard cells (in-silicon transistors whose layout patterns are not shown in the figure). The red and blue
lines represent the wires to be created for the second and third metal layers (the 1st metal layer is used
inside the standard cells). The small red and light green squares represent vias punched through the
insulation layers.
Physical level models are used in the design of digital circuits mainly for design verification. The lengths
and widths of the wires can be calculated only from the finished layout patterns, and the delays due to
the parasitic capacitances and resistances of the wiring can be estimated from this information. These
delay values can be back-annotated to the values of the timing parameters of the components in a
gate-level model for use in post-layout simulation or static timing analysis. In general, circuit
manufacturers require that the timing of a circuit be verified using these exact delay values before the
circuit is manufactured.
12
521406S, Digital Techniques 3
RTL Data transfer or Bit-accurate, timed Bit vectors or types Synchronized with
processing between by clocks whose bit count can clock edges.
registers in clock be determined at
cycles compile time.
Circuit level Transistor level Bit vectors, voltage Voltage, current Continuous
circuit diagram and current signals
Physical level Layout patterns (IC) I/O cell and pin Geometric No
or configuration locations location dimensions (IC),
file (FPGA) (IC) or I/O cell manufacturer
configuration dependent file
information (FPGA) format (FPGA)
13
521406S, Digital Techniques 3
In practice, tasks related to the design process are either design tasks that create new, lower-level
models based on higher-level models, or verification tasks that ensure that the created models function
according to the requirements specification or higher-level models. The amount of work required for
verification is usually much larger than that required for design. Many analysis tasks are also performed
at all levels to evaluate the results of the previous design phase, or to generate information that can be
used to make decisions with respect to the next design phase.
The design of large digital circuits requires, and generates, a huge amount of data , the management of
which is important for the success of the entire design process. For this reason, the design flow is usually
implemented using "scripts" that control the operation of EDA programs. Almost all EDA programs
support the use of TCL (tool command language) for issuing program commands. The use of TCL scripts
has the advantage that during repetitive tasks, the designer does not have to remember which files to
enter into the program, what commands and in what order to give to the program, and which files the
program must finally save.
The following figure shows the design flow used in the Digital Techniques 3 course, and the programs
and scripts required for it, as well as the information they use and produce. Design information is shown
as gray symbols. Design tasks that modify information from a higher level of abstraction to a lower level
14
521406S, Digital Techniques 3
are marked with blue symbols. The dark blue arrows represent the tasks that the designer must do her-
or himself. Light blue arrows describe the tasks automatically performed by design programs. Red
symbols and arrows indicate design verification flow. The used EDA program and script are marked next
to the automatically performed design and verification tasks.
15
521406S, Digital Techniques 3
16
521406S, Digital Techniques 3
Components of a System-on-a-Chip
The illustration below shows a typical (but simple) system-on-a-chip architecture. It includes the
following components:
● A processor core (CPU) that runs the system's operating system and application programs
● A system bus that connects the processor via an external memory controller (MEM IF) to
external memory components and to other functional parts of the system
● Special hardware blocks (HW1, HW2, ...) that implement application functions that cannot be
implemented as software in the CPU because of performance or power consumption
requirements.
17
521406S, Digital Techniques 3
SoC bus solutions are usually "multi-level". Functional parts of the circuit that need to be able to transfer
a lot of data to the system's main memory (external SDRAM) are placed on a high-performance bus (HP
SYSTEM BUS). Such components are usually equipped with a master interface, which means that they
can independently transfer data directly to or from memory (DMA, direct memory access). Thus, the
central processing unit does not have to read the data from the hardware block first and then write it to
memory or vice versa, as the block can do it itself. This enhances the performance of the circuit as it
frees the CPU from executing data transfer routines. The CPU usually includes an internal cache memory
for program code and data, so it does not need to use the bus continuously, allowing the bus to be made
available to other masters from time to time. A multi-master bus needs a bus arbiter to operate, which
decides which master is allowed to use the bus at any given time. The master blocks connected to the
bus also include a slave interface that the CPU uses to control them.
SoCs also contain a large number of blocks with lesser data transfer needs. Such blocks are not be
placed on the "main bus" of the circuit, which uses a clock frequency of hundreds of megahertz and
often a large word width of the data buses. The "slow" blocks are placed on the peripheral bus (LP
PERIPHERAL BUS), which can use a small word width (e.g. 16 or 32 bit) and a low clock frequency. This
reduces the silicon area of the circuit and reduces its power consumption. Typically, components of the
"basic infrastructure" of a computer, such as a boot ROM (memory that contains code executed by the
processor after a reset), timers, an interrupt controller, and general purpose interfaces (GPIOs), are
18
521406S, Digital Techniques 3
placed on the peripheral bus. External serial interfaces (UART, I2C, etc.) are also typically placed on the
peripheral bus. The peripheral blocks contain only a slave interface. The peripheral bus is connected to
the main bus with a bridge component (BRIDGE), which converts the main bus protocol to the peripheral
bus protocol. The bridge includes a slave interface on the main bus side and a master interface on the
secondary bus side.
The buses of modern system circuits are very complex in structure and protocol, so their design is a
demanding task. It is further complicated by the fact that different levels of buses often use different
clock frequencies, resulting in the need for synchronization between different parts of the circuit.
However, the design of the blocks to be placed on the buses is streamlined by the standard bus solution,
as the blocks can be designed independently of the actual SoC design. In addition, they can be verified
against the standard bus model, so that they will work with great certainty when integrated in the entire
SoC.
The design of the IP block to be placed on the bus includes the design of the bus interface and the
associated register bank. In addition to this, the actual application logic of the block must, of course, be
19
521406S, Digital Techniques 3
designed. At its simplest, it can consist of a few registers and counters (e.g., an I2C interface), at its most
complex, it can be a hard accelerator with tens of millions of gates (e.g., a video codec, a 3D graphics
accelerator, or a 4G LTE baseband processor).
In addition to hardware design, IP block design involves a considerable amount of software design. To
make a block usable, a device driver must be designed for the operating system used in the final
product. The device driver implements the functions that make it possible to read and write the IP
block's register bank through the device interface of the operating system. For example, in the Linux
operating system, peripherals appear as files in the file system, and can be handled like files with
operating system system calls (such as Linux's open, close, read, write). The device driver must convert
these system calls into reads and writes to the actual memory addresses located in the address range of
the register bank of the IP block. If a hardware block is designed as an IP product that is to be sold to
many SoC manufacturers, device drivers and higher-level program libraries will have to be designed for
many operating systems.
In the case of a complex hardware block, the block must also be accompanied by a higher-level
application program interface (API) that facilitates the use of the block in the design of application
programs. For example, in the case of a graphics accelerator, the API may include library functions that
implement the functions of a graphics programming standard (e.g., OpenGL) by using the hardware
block to accelerate their performance.
Thus, even an individual IP block's "deliverables" to a SoC contain a lot of design information, regardless
of whether the customer is the company's own SOC design project or an external party. A summary of
this information is shown in the figure below.
The hardware design itself (HARDWARE IP in the figure) contains the RTL code of the block and the
synthesis and other EDA tool scripts using which the RTL design can be implemented in an optimal way.
The RTL code requires a testbench to allow the customer or user to verify that the design works. The
20
521406S, Digital Techniques 3
A SoC contains dozens or even hundreds of IP blocks that are connected to its buses. For each block, the
data shown in the figure above is needed to compile the RTL model, synthesis script, test bench etc. for
the whole circuit. Creating, managing, and verifying all of this data is challenging, but the platform-based
design method with standardized interfaces facilitates the use of data originating from different sources.
A standard (IP-XACT) has also been developed for the management of this kind of design information.
SoCs require software, such as device drivers and application programs or program libraries, to operate
IP blocks. Their design and verification requirements must be considered in circuit design, as software
development and testing cannot wait for the completion of the circuit design. For this reason, SoC
design must be carried out so that software can be tested with a realistic hardware model at the earliest
possible stage. This requirement also affects the design of IP blocks.
The following figure illustrates the SoC design flow. It is divided into the design of the SoC platform itself
and the design of individual IP blocks.
simulated in an HDL simulator. However, this is slow, which is why hardware emulation is
commonly used for verification. The RTL model is implemented in an emulator, which can be
FPGA or special processor based, after which the model can be executed much faster than in a
simulator. It is therefore possible to test the software with a real hardware.
3. Physical design, in which the RTL code of the entire circuit is synthesized and the layout for the
circuit is designed. Physical design includes many demanding design and verification tasks, such
as testability design, power management design, and timing analysis.
The design of IP blocks can be done separately from the design of the SoC itself if their interfaces are
based on standardized interface solutions and protocols. Once all the "deliverables" have been designed
for an IP block, as described above, it can be used in all SoCs that use the same bus standards. The main
phases in of the IP block design are:
1. In the first step, a system-level model operating on the TLM principle is created for the IP block.
This is needed to verify the software on a virtual platform, but it also has use as a "golden
model" of the IP block itself, i.e., a specification that can be simulated.
2. In the next step, an RTL implementation of the IP block is designed. This can only be done when
a bus standard has been selected for the SoC platform. The IP block is verified in its own
testbench, in which the bus master and other SoC components are modeled as needed. The IP
22
521406S, Digital Techniques 3
block simulation environment is most commonly implemented using the Universal Verification
Methodology (UVM), which is based on the SystemVerilog language. Many other verification
methods are also used for IP blocks' RTL verification, such as assertion based verification and
formal verification.
3. Implementation of a physical prototype of an IP block. In principle, it is possible to physically
design an IP block separately from the SoC design by placing the IP block's layout model to the
SoC's layout model. However, it is more common to incorporate the RTL code of an IP block into
the RTL code of a SoC, and perform physical design for the entire circuit at once. However, in
connection with the design of an IP block, it is necessary to carry out all the steps of physical
design, such as logic synthesis and layout design, to check the feasibility of the block and to
obtain information about its physical properties.
Summary
Digital system design is a demanding task that involves solving at the same time many different
problems related to functionality, performance, power consumption, design time, manufacturing cost,
or testability. Systems are often implemented as single integrated circuits, which is why all these
problems must be solved within a single design project, as no changes can be made to the design once it
is completed.
SoC design covers the levels of abstraction from the system level to the physical level. For this reason,
the various stages of design and verification are always performed in companies by designers or design
teams specializing in different stages. Some design steps, such as the physical design of the circuit, may
also be outsourced to outside service companies.
The design of complex SoCs is possible thanks to the extensive use of standards. This means two things.
Interfaces between circuit blocks are based on standardized bus solutions, allowing IP blocks to be
designed independently so that they can be easily connected to any SoC platform using the same bus
standard. In addition to this, the representation of design information is also standardized, which allows
information to be easily transferred between levels of abstraction or design teams during a design
project.
The majority of SoC design consists of the design of IP blocks and the verification of their operation. This
mainly means RTL design and verification. The course Digital Technology 3 provides an in-depth
introduction to the methods and tools used for that. In addition to this, the course covers IP blocks'
physical design, minimization and analysis of power consumption, and testability design. Although these
design steps are ultimately performed only in the context of the physical design of the entire SoC, the IP
blocks' properties must be estimated and their feasibility ensured already in their design phase.
23
RTL Design and Verification Flow
Phases of the RTL Design and Verification Flow
Specification Phase
General Requirements Specification of a Design
Specification of the Functional Requirements of a Design
Interface Specification
Functional Properties Specification
Design Phase
Naming Conventions for Design Information
RTL Architecture Design
Registers and Signals of an RTL Design
Combinational and Sequential Blocks
Module Hierarchy
RTL Assertions
Verification Plan Design
Verification Methods
Functional Verification
Formal Verification
Contents of the Verification Plan
Testbench
Test program
Assertions for Functional Properties
Coverage Model
Coding Phase
Common Definitions of the Design
RTL Coding
Structural RTL Coding
Functional RTL Coding
Common Constructs Used in RTL VHDL and SystemVerilog Code
Writing Functional RTL Code Based on the RTL Architecture Specification
Verification Plan Coding
Testbench Coding
Test Program Coding
Assertions Coding
Coverage Model Coding
Verification Phase
RTL Simulation
Formal Verification
Coverage Analysis
Summary
In practice, the design contains other information, such as a description of the communication protocols
and data representations used by the inputs and outputs, but this type of information is already
implicitly given the information listed above, so the list can be said to be an adequate description of an
RTL design. The contents of the design cannot therefore be regarded as difficult to understand from a
theoretical point of view. The difficulty of designing a digital circuit is not so much due to the nature of
the properties of the circuits but rather to their large number, with the result that a lot of simple human
errors are inevitably made in their design. For this reason, the aim is to organize the design process in
such a way that the chances of making mistakes are kept to a minimum and that the chances of
detecting them using automated tools are maximized. This chapter introduces the phases of such an RTL
design and verification flow and the methods used in it. In the later stages of the course Digital
Techniques 3, the design phases and methods presented in this chapter will be studied in more detail.
Phases of the RTL Design and Verification Flow
Each company designing digital circuits has its own design and verification flow that defines the order
and content of the different phases of work. This flow is constantly evolving due to, for example,
changes in the nature of design, the accumulation of experience and the development of EDA tools. In
this section, the flow shown in the following figure is used to illustrate the content of the design and
verification flow, covering the most important phases of a modern design and verification flow. The flow
is divided into four phases: specification, design, coding, and verification. In reality, the flow would
certainly be more diverse, and the order of some of its phases could be different. The contents of the
graph are explained in the following sections.
Specification Phase
The specification phase includes tasks that are performed before the actual design begins. Initially, a
general specification of the requirements of the design is made, which may be, for example, a
techno-economic study on the feasibility of the circuit or IP block and the economic conditions for its
implementation. If, based on the general requirements specification, a decision is made to implement
the design, a functional requirements specification is created for it, which accurately describes the
required function of the circuit or IP block.
The memory controller converts the write and read transaction performed by the bus controller on
the APB to the corresponding transactions on the memory bus. Part 3 of the figure shows, by way of
example, one basic write access, and one read access involving two wait states (in the figure,
indicated by the notation W on the signal pready_out). To reduce power consumption, the memory
controller should keep the addr_out of the SRAM address bus and the prdata_out states of the APB
read bus unchanged when no bus transactions are targeted at the memory controller. Examples of
this function are shown with purple arrows in the figure. Only one 16-bit memory location is accessed
at a time (the LB and UB inputs of the SRAM are active each time the memory is used).
clk input 1 Logic Rising-edge sensitive clock signal input that clock all flip-flops. This clock corresponds to
the PCLK clock defined in the AMBA APB Specification.
rst_n input 1 Logic Active-low, Asynchronous reset signal input that resets all flip-flops.
psel_in input 1 Logic AMBA APB PSEL signal. See the AMBA APB documentation.
Below are a few examples of functional requirements specifications of the sramctrl design presented
above. The table uses the name feature for these to distinguish them from the SystemVerilog property
statement, which is used to formally describe these features later.
f-wctrl The number of wait states that the sramctrl inserts in the next APB transaction is defined as a binary number
present at input wctrl_in. The wctrl_in input is assumed to remain stable throughout APB transactions.
f-addr The APB address in paddr_in [17: 0] is placed on the SRAM address bus addr_out in the beginning of the APB
SETUP phase when psel_in == '1. This address shall be kept at addr_out until a new APB read or write access to
the sramctrl occurs as indicated by setting psel_in == '1 and a new address is placed in addr_out.
Design Phase
The design phase is divided into concurrently done RTL architecture design and verification plan design.
They are made on the basis of a functional requirements specification.
The RTL architecture could in principle be described by listing its registers and defining the next-state
expressions of their data inputs, but this method is not very practical. RTL designers prefer to use
various combinational and sequential logic blocks as building blocks for design. The representation of
the design should also be based on the same principle. The structure of the RTL code created on the
basis of the design will then also match the architecte designer's intention.
An RTL architecture can be described in textual format, and a textual presentation is usually necessary,
but it is often supplemented with block diagrams, truth tables, state charts, ASM graphs, and other
representations that clarify the textual documentation.
An RTL architecture usually contains separate combinational logic blocks that do not merely act as the
next-state encoding logic for one register. The outputs of such blocks can be described in RTL designs by
signals for which the number of bits and the data type must also also be defined. Connections made
with plain wires can also be represented in RTL architecture with signals. The signals are implemented in
RTL code in the same way as the registers. The coding practices for their names vary more than in the
case of registers.
Below is an example of defining signals that represent registers (kind = sequential) and signals that
represent the outputs of combinational logic.
addr_r sequential 18 logic 0 Address register that holds SRAM address read from paddr_in[17:0].
addr_ns combinational 18 logic Next state value of addr_r that also drives addr_out.
wdata_r sequential 16 logic 0 Write-data register that holds data to be placed on data_inout.
wdata_ns combinational 16 logic Next state value of wdata_r that also drives data_inout.
The definition of a combinational logic block must include the following information:
1. name of the block that shall be used as an identifier of the block in RTL code
2. names of the block's inputs, that is, the registers and signals whose values are used to compute
the value for the block's outputs
3. names of the block's outputs, that is, the signals for which a value is computed in the block
4. the values assigned to the block's outputs, which can be given as text, as RTL- or pseudocode, as
a truth table, or by other means.
The definition of a sequential logic block must include the following information:
1. name of the block that shall be used as an identifier of the block in RTL code
2. name and behavior of the clock and reset signals of the block
3. names of the block's inputs, that is, registers and signals whose values are used to compute the
next-state value of the block's registers
4. the next-state values assigned to the block's registers, which can be given as text, as RTL- or
pseudocode, as a truth table, or by other means.
Feature Function Block Name Inputs Outputs Functional Description Clock Reset
Label
r-regs sequential registers next_state state_r Sequential process that resets all clk rst_n
addr_ns addr_r registers asynchronously if rst_n == '0,
wdata_ns wdata_r or assigns to them their respective
rdata_ns rdata_r next-state values on the rising edge of
wctr_ns wctr_r the clock.
we_ns we_r
In both of the above cases, the 4th item in the list is the most critical, as the functional part of the
block's RTL code, as well as the assertions used to verify the block, are written based on it. For this
reason, the description of the function must be as unambiguous as possible. The textual description can
be supplemented with pseudocode, RTL block diagrams, truth tables, state charts or tables, or ASM
diagrams. The function meant by the designer is usually easiest to describe verbally, but using a different
representation alongside it improves the reliability of the document because, for example, the likelihood
of making the same mistake in both versions is lower. Below is an example of using an RTL block diagram
to clarify the function of the wctr_logic combination logic block mentioned above.
Defining the operation of combinational logic of state machines is one of the most laborious tasks
involved in describing the RTL architecture. The most recommended way to do this is to use an ASM
diagram, as it can be "translated" directly and unambiguously into RTL code. Alternatively, a state table
can be used, which also defines the activity unambiguously. However, it is more difficult to write
comprehensible code based on a state table.
The following is an example of defining the operation of the combinational logic part of a state machine
with an ASM diagram and a state map.
Feature Function Block Name Inputs Outputs Functional Description Clock Reset
Label
r-fsm-logic combinational fsm_logic psel_in pready_out Control state machine logic of the
penable_in ce_out sramctrl. The functions are described
pwrite_in we_ns in a separate ASM chart and state
wctr_full oe_out table.
state_r ub_out
lb_out
wctr_inc
addr_load
rdata_load
wdata_sel
data_en
next_state
Below are images of the ASM diagram and state table mentioned in the specification shown above.
Although they present the same information, the creation of the state table is clearly more error-prone,
as the structure of the table itself does not in any way reflect its function. On the other hand, it is easy to
see the general operating principle of the state machine from the ASM diagram.
ASM Diagram
Status Table
Module Hierarchy
If an RTL design is large, it is usually divided into smaller parts. The complete design is put together by
connecting these parts to each other. Both the top-level model of the design and its subdesigns are
represented as separate modules. Subdesigns can be further divided into smaller parts, so that there can
be several levels in the module hierarchy. Each module corresponds to a module statement in
SystemVerilog and the entity statement in VHDL.
The module partitioning of the design has no effect on its function. However, it is important for the
management of design data and work. A subdesign, implemented as a separate module,can be easily
reused later in other designs. A large entity divided into submodules, in turn, can easily be assigned to
many designers. The amount of design data contained in one module also has an effect on the run-times
of EDA programs. In order to achieve these advantages, the following principle should be followed in
module partitioning:
● the modules of the design shall only be hierarchical or functional one,
● hierarchical modules shall contain other modules, but no description of combinational or
sequential logic,
● functional modules shall not contain other modules, but descriptions only combinational or
sequential logic blocks.
RTL Assertions
The description of the function of the combinational and sequential logic blocks of the RTL architecture
serves as a guide for the RTL code designer. To make it possible for the coder to make sure that the code
works as required, the architectural properties of the RTL design are described as assertions that are
implemented in verification code as SystemVerilog property and assert, cover, or assume statements.
Property statements can be used to describe sequences and cause-and-effect relationships of logical
states of signals (e.g., "if A is true, and then B is true, C must be true for the next two clock cycles").
assert statements can be used to check whether a property described by a property statement holds in
the design, and a cover statement can be used to track how often it is realized. The assume statement
typically describes the properties that are used to validate the data to be fed into the design. In this text,
the term assertion generally refers to, for example, a combination of a property and an assert
statement.
The assertions describe all the functions defined for each combination and sequential logic block. These
statements thus formally describe exactly what kind of functionality the RTL designer has intended. The
realization of the assertions can then be verified in the simulation, making coding errors easy to detect.
Assertions that describe the properties of an RTL architectural design are called whitebox assertions
because they can refer to internal registers and signals in the design. Blackbox assertion, discussed later,
can only refer to the design's input and outputs.
The table below shows two properties meant to be described by property statements. They are based
on the RTL architecture property that was given the label r-wtcr-logic property in the example presented
above. The first one describes the increment function of the counter, and the other the (synchronous)
reset function. The Assertion and Cover columns define the names of the assert and cover directives
that activate these property statements. The following "encoding" has been used to designate these: in
the prefix ar_ "a" denotes an assert directive, and "r" an assertion describing a property of the RTL
architecture. Similarly, in the prefix cr_ "c" denotes a cover directive. This naming convention seeks to
ensure that all assertions of a certain type can be referenced, for example, with a "wildcards" in design
programs (e.g., a hypothetical command d isable ar_*).
wctr_increment r-wctr_logic If wtcr_inc == '1 then wctr_ns ar_wctr_increment cr_wctr_increment clk rst_n
must be equal to wctr_r + 1.
wctr_reset r-wctr_logic If wtcr_inc == '0 then wctr_ns ar_wctr_reset cr_wctr_reset clk rst_n
must be equal to 0.
Verification Plan Design
The other concurrent task in the design phase that follows the creation of the functional requirements
specification of the design is the design of a verification plan. Its purpose is to define the means that can
be used to verify if the design complies with the requirements specifications or if an error has been
made in RTL architecture design or coding that causes a required property to not hold. This verification
is done by executing the RTL code in a simulator, or often also by formally verifying that the RTL code
implements the properties described in the requirements specifications, which are coded as assertions.
The verification plan describes how the verification task shall be done with these tools. As a design task,
creating a verification plan is often a more difficult or at least a larger problem than designing an RTL
architecture.
Verification Methods
As the verification plan is based on the verification methods used to implement it, it is appropriate to
provide a brief overview of them.
Functional Verification
In functional verification, the RTL code of the design is executed in a simulator by feeding data from a
test program to its inputs while capturing the values of its outputs and possibly also its internal signals,
and then comparing them with the expected, known values. This comparison can in principle be made in
two ways. The simplest way is to compute the correct results off-line using a separate reference model,
so that a simulation results file can be compared with the file produced by the reference model. Another
way is to execute the reference model in the simulator together with the RTL code of the design being
verified, so that the results can be compared already during the simulation. In the latter case, it can be
easier to locate errors, in addition to which new tests added to the test program will also be performed
automatically with the reference model. If the design is not very complex, instead of comparing it to a
reference model, the data produced by the design can be verified in the same test program that feeds
test data to the design.
In addition to comparison with the reference data, the function of an RTL design can also be verified by
using assertions. In assertion-based verification, the properties described in the functional requirements
specification of the design and the architectural properties presented in the RTL architecture
specification are coded as SystemVerilog or VHDL assertions. The assertions are executed in a simulation
concurrently with the RTL model so that they check the fulfillment of the required properties on every
clock cycle. Thus, the verification of the correctness of the design is not based on a comparison of the
data produced by the design with known-to-be-good data, but on the monitoring of the fulfillment of
the properties.
The verification result obtained by simulation is only reliable if the RTL model has been simulated
sufficiently comprehensively. For this reason, coverage analysis is an important part of design
verification. Coverage refers to how well the test data fed to a design in simulation activates different
parts of the design. It can be measured "mechanically" during the simulation as code coverage. This
means that the simulator keeps a record of which lines of code were executed in the simulation or which
signals changed state, for instance. Such an assessment of coverage is easy because it happens
automatically during the simulation. However, code coverage is not a good enough measure of
coverage, as it does not tell you much about the functions of the design activated during the simulation.
For example, if a design includes two state machines that should change state at the same time when a
particular event occurs, code coverage alone cannot detect this, even though it can tell that both state
changes occurred at some point during the simulation (because those lines of code were executed).
Tracking of events like the one described is known as functional coverage measurement. Functional
coverage can be assessed using assertions by tracking which assertions were fulfilled during the
simulation. Since each statement corresponds to one of the properties described in the requirements
specification of the design, in this way it is found out which properties became covered during the
simulation. The SystemVerilog language also includes a covergroup statement that can be used to build
coverage models that are more complex than assertions.
Formal Verification
Formal verification is also based on the use of assertions. The idea of formal verification is that the
verification program aims to find a so-called counterexample that shows that a property of the design
does not hold in some situations. For example, if the design is a state machine that should always visit a
state A at least once in every 10 clock cycles, the verification program will be able to prove this property
to be false if it finds legitimate test stimuli for the design that do not make the state machine to enter
state A within the required time. Thus, the design is not simulated, but the verification program itself
tries to create test stimuli that make the design work incorrectly. If no such counterexample is found,
the design is deemed to have been proven to be flawless. The main advantage of formal verification
compared to simulation is that its results do not depend on the coverage of test stimuli. Its downside is
the high memory requirement, which is why it is not suitable for verifying very large designs in one
piece.
Coverage can also be measured in formal verification. In this context, it can mean, for example, that the
tool reports how much of the design code was activated during the proving of its properties.
Formal methods also include some other static verification methods used to verify more specific
properties. These include programs that can find RTL coding errors related to e.g. synthesizability,
structural connectivity checking programs, and synchronization structure checking programs.
In the figure, block DUT contains the plan to be verified. The other blocks, as well as the testbench itself,
are parts to be defined in the verification plan. The test program, the results analyzer, and the reference
model are already mentioned components necessary for generating test data and verifying the results.
The "CHECKER" block represents the verification modules used to monitor the design, which the
testbench installs inside the design during the simulation. These modules typically contain the assertions
and code needed to measure coverage. The blocks shown in green are design-specific. Of the blocks
shown in gray, the "BFM IP" (bus functional model) represents a functional bus model, which may be,
for example, a SystemVerilog interface model of the processor bus, with which the test program
communicates with the "pin-level" interface of the design. The "VERIFICATION IP" block represents the
functional models of external components needed for the simulation. These can be, for example,
memory circuit simulation models. The gray blocks are often verification IP (VIP) blocks that have been
reused from previous projects or outsourced.
Module-level test benches are usually coded on a case-by-case basis by creating a testbench module
(module statement) and instantiating a test program (program statement), bus models (interface
object), possibly a reference model as its own module, and installing checker modules inside the design
(bind statement). Entire IP blocks are usually verified in a UMV standard-based testbench, which is
assembled using object-oriented programming principles from ready-made testbench components
defined in the UVM class Library.
The image below shows the structure of the testbench of the sramctrl design. In this case, there is no
reference model used. A SystemVerilog interface model of the APB bus is used as the bus functional
model. The memory circuit is modeled with a simulation model of a commercial SRAM circuit of which
an instance is created in the testbench module.
Test program
The testbench only determines the framework in which the tests can be executed. A more important
and difficult step in terms of verification is the design of the test program itself. It defines the data that
is fed to the design to verify its correct functionality with sufficient coverage.
The test program performs one or more tests. Tests can be classified as directed and randomized tests.
In a directed test, the test data is selected so that it will activate a particular property in the design. The
result of the test can be checked by reading the data produced by the design from its outputs, or by
checking whether the assertions describing the tested property were fulfilled. A randomized test usually
does not focus on a particular property at a time. Instead, a large amount of test data is created for the
design that it could, in principle, receive under normal use. The test data is randomized with constraints,
which means that it is random, but values or sequences that would not actually be possible have been
removed. The SystemVerilog language has constructs that can be used to constrain random stimuli in
many different ways. The results of random tests can be checked with assertions.
Below is a description of the tests performed by the sramctrl design's test program. It describes how the
tests work, the nature of the test data, and how the test results are checked. In this case, the check is
based on checks placed in the test program, which compare the data written to and read from the
memory. In addition, all assertions are checked during the simulation.
Below is an example of three assertions based on a functional requirements specifications. The second
property, wctrl_stable, is defined to be of type assume because it constrains when the data to be fed to
wctrl_in can be changed. The Features column refers to the features described in the functional
requirements definition.
wctrl_stable f-wctrl If psel_in == '1 on rising edge of clk, then mf_wctrl_stabl clk rst_n
wctrl_in must have been stable on that e
clock cycle.
addr_set f-addr If (psel_in == '1 && penable_in == '0) in the af_addr_set cf_addr_set clk rst_n
current clock cycle, then addr_out must be
equal to paddr_in[17:0]
Coverage Model
Coverage model design aims to define the properties, whose fulfillment is to be measured during the
simulation, and in particular how the measurement is carried out. Coverage can be measured by
monitoring the fulfilment of properties in simulation with SystemVerilog cover statements. More
complex coverage models can be built with covergroup statements. Examples of use cases of
covergroups are tracking how many of the addresses in the memory space were written and read, or
how many of all possible instruction codes were used to simulate the processor circuit. Measuring these
kinds of things is especially important when using randomized test stimuli. In addition to the choice of
these measurement methods, an essential part of the design of the coverage model is to define a
coverage computation method and a coverage goal (%) for each gauge.
cg_apb_writes On the last clock paddr_in[17:0] cg_apb_writes_inst Collects coverage on paddr_in[17:0] and
edge of the ACCESS wctrl_in wctrl_in, and their cross coverage on addresses
phase of every APB and wait state counts used on APB write
write transaction transactions. One coverpoint 'addresses' for
when psel_in, paddr_in[17:0] with a bin for every possible
pwrite_in, address, one coverpoint 'wait_states' for
penable_in and wctrl_in with a bin for values 0 to 7, and one
pready_out are all cross coverage point for the coverpoints
'1. 'addresses' and 'wait_states'.
Coverage is measured in a simulator program, so this part of the verification plan is often prepared in a
format used by the simulator. Below is an example of the XML representation used in the QuestaSim
simulator, shown in tabular form. In the Link column are listed those cover and covergroup statements,
and the coverage metrics maintained by the simulator itself that are to be included in the coverage
model. Assertions are selected using wildcard selection (e.g., af_*), so that all don’t have to be named
specifically. The Weight column sets a weight value used to compute total coverage for each individual
gauge. The coverage goal is defined in the Goal column.
In the package file of the design example, a parameter and the data type of the state memory of the
control state machine are defined as shown below. This information is referenced in several code files.
package sramctrl_pkg;
localparam int MEMORY_SIZE = 262144;
typedef enum logic [1:0] { SETUP = 2'b01, ACCESS = 2'b10 } state_t;
endpackage
Like the C language, SystemVerilog also supports the inclusion of a file into another file during analysis
with the compiler's include directive (`include). Include files typically contain compiler preprocessor
macros that can be used to simplify the writing of frequently used "code snippets" for instance when
creating assertions. As an example, the definition below creates a macro xcheck:
`define xcheck(name) X_``name``: assert property ( @(posedge clk) disable iff (rst_n !== '1) !$isunknown( name ))
`xcheck(paddr_in);
Without the use of the macro, the assert statement below should have been written in the code file.
X_paddr_in: assert property ( @(posedge clk) disable iff (rst_n !== '1)
!$isunknown( paddr_in ));
If dozens or hundreds of such checks had to be written, there would be a considerable benefit to using a
macro.
It is generally not advisable to define constants in SystemVerilog code with define-macros, as the
parameter or localparam definitions placed in packages are better in many respects.
RTL Coding
RTL code implements the designed RTL architecture, usually in either VHDL or SystemVerilog. Since the
description of the architecture should be a complete description of the RTL structure and function of the
circuit, the coding can be considered only as a translation from one representation to another. The RTL
code must be synthesizable, so it must be written in accordance with all the rules associated with that.
The coding style for code written for the synthesis have been standardized by the IEEE1.
The parts of code related to the structural description can be written directly on the basis of the
interface and signal lists presented in the functional requirements specification. Below is the beginning
of the code of the module statement of the sramctrl design.
1
"IEC/IEEE International Standard - VHDL Register Transfer Level (RTL) Synthesis," in IEC 62050 Ed. 1 (IEEE Std
1076.6-2004) , vol., no., pp.1-128, 31 July 2005, doi: 10.1109/IEEESTD.2005.8894298.
"IEC/IEEE International Standard - Verilog(R) Register Transfer Level Synthesis," in IEC 62142-2005 First edition
2005-06 IEEE Std 1364.1 , vol., no., pp.1-116, 18 Dec. 2002, doi: 10.1109/IEEESTD.2002.8894283.
Functional RTL Coding
To describe the function of the RTL architecture, concurrent process constructs of hardware description
languages are used, process statements in VHDL and always procedures in Systemverilog. In
SystemVerilog models written for synthesis, however, combinational logic should be modeled with
always_comb, and sequential logic with always_ff procedures.
The following figure shows the structure of a combinational logic process in VHDL and SystemVerilog
languages. At the beginning of the VHDL process a sensitivity list must be used to define the start
condition of the process. In the example of the figure, the sensitivity list is represented by the reserved
word "all" according to the syntax allowed by the VHDL-2008 standard. It makes the process sensitive to
all signals that are read in the process. The code that describes the function of combinational logic must
be written in the region indicated by the blue background. The use of the label ("my_mux") is optional,
but it is recommended because it links the code to the RTL design.
The following figure shows the structure of the sequential logic process in VHDL and SystemVerilog. In
both cases, a sensitivity list is required for the process, showing the process' clock and asynchronous
reset signal, and in the case of SystemVerilog, also their polarity. The function of the process is described
by an if-else statement, the if part of which describes the reset function of the registers, and the else
part the function of the next-state encoding logic. The RTL coder should fill in the sections shown in
yellow and blue in the process template.
Common Constructs Used in RTL VHDL and SystemVerilog Code
The following table provides a comparison of the most common VHDL and SystemVerilog constructs
needed in RTL coding.
VHDL SystemVerilog
library IEEE; Not required. All relevant types are built-in in SystemVerilog.
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
Concurrent signal assignment (outside process) Continuous signal assignment (outside always*)
DATA_OUT <= '1' assign DATA_OUT = '1.
S1(0) <= '1'; -- set element assign S1[0] = '1; -- set element
S1 <= (others => '1'); -- set all assign S1 = '1; -- set all
S1(3 downto 0) <= "1010"; -- set slice assign S1[3:0] = 4'b1010;
signal S1, S2, S3 : std_logic_vector(7 downto 0) logic [7:0] S1, S2, S3;
IF IF
r-addr-mux comb addr_mux addr_r addr_ns Input mux of the address register addr_r
paddr_in that selects the next state for addr_r. If
addr_load addr_load == '1, the next state is
paddr_in[17:0], else it is paddr_r.
always_comb
begin : addr_mux
if(addr_load)
addr_ns = paddr_in;
else
addr_ns = addr_r;
end : addr_mux
In the case of a testbench used to verify a single and fairly simple module, its structure is usually simple
as well. The typical components are:
● the testbench module (module statement)
○ processes to create a clock and a reset signal(s)
○ instantiation of the design to be verified (DUT)
○ optional instantiation of the reference design (REF) and an analysis process for
comparing the output of the DUT and the REF models
○ instantiation of a test program
○ optional creation of interface objects and connecting them with the the design
○ optional instantiation of checker modules inside design modules with bind statements
● test program (program statement)
○ optional creation of a clocking block
○ initial procedure in which the inputs of the DUT are driven and its outputs are read
● checker module(s) (module statement)
○ property statements
○ assert, cover, and assume statements
○ covergroup statements
● bind file to instantiate checker modules inside DUT modules (ibnd statement)
● optional config file for configuring a hierarchical template (config statement)
Entire IP blocks, which usually have one or more standard bus interfaces, as well as testbenches of
entire SoCs are generally coded in according to the UVM standard. In such cases, the test program
(program statement) and possibly the reference model and the results analyzer, if any, are omitted from
the configuration described above. An initial procedure is added to the testbench module, and a call to
the UVM run_test subprogram is added. The rest of the UVM test bench is coded following the
object-oriented paradigm by deriving the required classes from the basic classes of testbench
components defined in the UVM class library. At the beginning of the simulation, the run_test call
creates objects from the classes defined by the code and runs the test using them. The UVM method is
discussed in more detail in a separate chapter.
Below is the testbench module for the example design. It first creates a bus interface model that is
connected to the design with assign-statements. The design, test program, and memory circuit
simulation model are then instantiated. The bind statement that instantiates a checker module inside
the design is included from a different file using the include directive, as the same bind-file is also used
for formal verification. Finally, clock and reset signals are generated.
import sramctrl_pkg::*;
import apb_pkg::*;
module sramctrl_tb;
logic clk;
logic rst_n;
logic [31:0] paddr_in;
logic [15:0] prdata_out;
logic [15:0] pwdata_in;
logic psel_in;
logic penable_in;
logic pwrite_in;
logic pready_out;
logic [2:0] wctrl_in;
logic [17:0] addr_out;
wire [15:0] data_inout;
logic ce_out;
logic oe_out;
logic we_out;
logic ub_out;
logic lb_out;
always
begin : clk_gen
if (!clk) clk = '1; else clk = '0;
#10ns;
end : clk_gen
initial
begin : reset_gen
rst_n = '0;
@(negedge clk);
rst_n = '1;
end : reset_gen
endmodule
The sramctrl_svabind.svh file contains only the line of code shown below. The file extension "svh" is
commonly used in the names of SystemVerilog include files.
Below is part of the example design's test program. The program communicates with the design using
the reset, write, and read subroutines included in the APB bus interface model. They produce the
waveforms of the APB bus signals based on their parameters in synchronization with the clock signal.
This makes the test program very simple, as the code related to bus handling can be omitted from it. By
changing the interface definition, the same test program could also be used in the event that the design
bus was changed to another.
initial
begin
logic [31:0] paddr;
logic [17:0] addr;
logic [15:0] wdata, rdata;
logic fail;
wctrl_in = '0;
apb.reset;
$info("T1");
wdata = $random;
rdata = ~wdata;
fail = '0;
paddr = i;
apb.write(paddr, wdata, fail);
apb.read(paddr, rdata, fail);
test_T1: assert(wdata == rdata)
else $error("T1: Wrong data");
end
Assertions Coding
Assertions are usually coded into separate checker modules, which are instantiated in simulation and
formal verification inside the design modules using a bind statement. Assertions consist of property
statements and assert, cover, or assume statements. Below is a functional property of the example
design, and the property statement written based on it. The property statement is activated by the
assert and cover directives.
Property Features Description Assert Assume Cover Clock Reset
wdata_set f-wdata If (psel_in == '1 && penable_in == '0 && af_wdata_set cf_wdata_set clk rst_n
pwrite_in == '1) on the rising edge of clk,
then data_inout must be the same as
pwdata_in was at that time until the end
of the APB write transaction.
property wdata_set;
logic [15:0] wdata;
@(posedge clk) disable iff (rst_n == '0)
((psel_in && !penable_in && pwrite_in), wdata = pwdata_in) |->
((data_inout == wdata) throughout (!pready_out [*0:8] ##1 pready_out));
endproperty
covergroup cg_apb_writes @(posedge clk iff (psel_in && pwrite_in && penable_in &&
pready_out));
addresses: coverpoint paddr_in
{ bins addr[] = {[0:$]}; }
endgroup
RTL Simulation
Before RTL simulation, the design and testbench code is first compiled and optimized. The purpose of
optimization is to speed up the simulation. Optimization reduces the ability to monitor the internal
activity of a design, as it may remove the hierarchy from the design or prevent the values of internal
signals or variables from being monitored. In general, optimization is used when the code is already
assumed to be relatively error-free.
At the beginning of the simulation, the analyzed and optimized model is read into the simulator's
memory, after which the test program is started. The simulation results are stored as specified in the
testbench code or in the simulator settings. At the same time, the simulator automatically stores the
coverage data it collects for both code coverage and the coverage model encoded on the test bench.
Based on the figure, the coverage measured with covergroup constructs was less than 100%. The reason
for this can be found by examining the bin values of the covergroups. The image below shows that both
coverpoints of the covergroup cg_apb_reads got the score 100%, but their cross coverage only 12%.
Thus, the test program generated all possible values for the addresses and the number of wait states,
but not all possible combinations of these. Coverage would be improved if the randomized test T2 were
run longer than the 1000 times specified in the code.
Formal Verification
Formal verification is a static verification method in which a verification program reads the RTL code of a
design and the assertions written for it, and tries to prove the assertions one at a time as false by
developing test stimuli that brings the design into a state that contradicts a property. A formal
verification program can update the same coverage database as the simulator. This can be an
advantage, for example, when random test stimuli fail to cover some feature of the design, but the
formal verification program succeeds in proving that the property is always fulfilled.
Below is a screenshot of the Questa PropertyCheck program's results screen, which lists assert
statements (P) that have been proven to always hold, and cover statements that have been proven to
be covered (C).
Coverage Analysis
To monitor coverage, some sort of "scoreboard" is needed, the implementation of which depends on
the simulator used. For example, it could be an XML table listing the coverage metrics to track (e.g.,
cover and covergroup instances) and setting coverage goals for them. This information allows the
simulator to read the data being tracked from its coverage database. The figure below shows a summary
of the results of the coverage analysis provided by the QuestaSim simulator.
Summary
This chapter has provided an overview of the RTL design and verification flow. Its purpose has been to
help understand the status and purpose of different design and verification tasks in the design flow. The
figure below shows the time distribution of the work phases in the design and verification process
discussed in the chapter. Each step produces a set of documents or source code files. Only when they
are all ready can the functionality of the design be completely verified.
In the figure above, the acquisition of RTL and verification IP has been added to the design phase as an
additional task. This refers to externally sourced or reused RTL designs or testbench components.
Assessing their usability and quality may require a lot of work.
521406S Digital Techniques 3
Concurrent Assertions
Creating Concurrent Assertions
Sequences
Sequence Declarations
Named Sequences
Operations on Sequences
Describing Properties with Property Statements
Defining Clocking
Formal Arguments of Property Statements
Operations on Properties
Use of System Functions in Property Statements
Use of Variables in Property Statements
Writing Code for Assertions
Use of Macros
Use of Auxiliary Code
Summary of the Creation of Concurrent Assertions
Summary
Property above means a logical expression computed from the values of the ports or internal variables
of a design, or a sequence of logical expressions lasting several clock cycles.
The statements listed above can be used as either immediate or concurrent assertions. Immediate
assertions are placed inside the procedural code and are executed in the normal way when that
statement is activated in the simulation. Concurrent assertions are separate "processes" that run
concurrently with the model being verified. They therefore constantly monitor the design to be verified.
Immediate Assertions
Immediate assertions can be either assert, assume, or cover statements, and they are placed in a
procedural block of code where they act like if statements. They thus compute a value of a logical
conditional expression whose true or false value determines the outcome of the assertion.
The next assert statement checks that the value of the variable address is in the allowed range and
prints an appropriate message based on the result of the check.
As the outcome of checks such as this example is generally positive, it is usually not necessary to report
it. For this reason, the first branch of the action part of the assert statement is usually left blank in the
following way.
If the action part is completely missing, the verification program calls the $error system function.
521406S Digital Techniques 3
The immediate assume statement works in simulation like an assert statement. In formal verification, a
verification program (which does not use test stimuli) can use the information provided by the assume
statement to delimit the value range of the design's inputs or variables.
The immediate cover statement states that the fulfillment of its conditional expression is the
information to be tracked in assessing the coverage of the simulation. During execution, the simulator
collects information about how often the condition of the cover statement was computed and how
many times it was fulfilled.
Reporting the results of these checks can be done using, for example, the $display system function, but
it is preferable to use the following system functions that include the notification of a severity level:
● $fatal (worst run-time error)
● $error (run-time error)
● $warning (run-time warning)
● $info (notification that has no severity level)
When using these reporting functions, the simulator can report the event, for example, as a marker in a
waveform display of the user interface, making the event easy to notice.
Concurrent Assertions
Creating Concurrent Assertions
If an assert, assume, cover or restrict statement is placed outside procedural blocks in a module, a
concurrent assertion is created that can be used to monitor activity in the design continuously, as
opposed to immediate assertions, whose conditional expressions are evaluated only when the
statement is executed in the procedure. Concurrent statements are usually synchronized with the clock
signal, and the values of their conditional statements are computed on each clock cycle. The values of
the variables used by concurrent assertions are read at the beginning of the simulation cycle, before the
execution of the code describing the function of the design to be verified itself. The values of the
assertions, in turn, are computed at the end of the simulation cycle after the design has already ceased
to operate for that simulation period.
Concurrent claims check the fulfilment of sme property of a design. This property is described by a
property statement, which at its simplest can be included directly in a statement, for example:
According to this assertion, none of the bits in the variable data_bus must be in an unknown state on
the rising edge of the clock. The check is not performed if the reset input rst_n is in state '0.
The conditions of concurrent assertions are usually described by separate named property statements,
which are checked with assert, assume, cover, or restrict statements. The same property definition can
521406S Digital Techniques 3
be referenced simultaneously from, for example, assert and cover statements. The following example
uses a property specification to build a check to verify that the variables load_enable1 and load_enable2
are never in state '1 at the same time.
property load_enable_check;
@(posedge clk) disable iff (rst_n == '0)
!(( load_enable1 == '1) && (load_enable2 == '1));
endproperty
In this example, the load_enable_check property is defined to always be checked on the rising edge of
the clock signal clk, but only if the value of the variable rst_n is not '0. This prevents possible "false
alarms" during circuit reset. The check is activated with an assert statement identified as
load_enable_error. The use of identifiers makes it easier to track the outcome of assertions in the
simulation.
The figure below shows the result of tracking the assertion load_enable_error in the Mentor Graphics
QuestaSim simulator's waveform display. By observing the waveforms of the signals alone, the
erroneous operation would easily go unnoticed, but the false value of the assert statement is easily
detectable as a red triangle. Descriptive naming of the assert statement ("load_enable_error") facilitates
the interpretation of results when there are many statements to follow. In the figure, the green triangles
represent the accepted assert check, and the red ones the rejected one. The blue squares indicate the
start time of the check. The cursor is over the accepted statement, with "PASS" in the column showing
the values.
Sequences
In the example above, the property statement checked the fulfillment of a simple logical condition. One
must, however, also be able to use property statements to describe more complex properties that occur
over time, as it is often necessary, in addition to checking the allowed value combinations of variables
(Boolean functions), to track their values over several successive clock cycles. A typical such situation is
a handshake in inter-component communication, in which the transmitting component activates a
521406S Digital Techniques 3
request signal when placing the information on a communications bus, and the receiver acknowledges
receipt of the information with another signal within an agreed time. This condition can not be
described only as a Boolean function of the request and acknowledgment signals, because they occur at
different times.
Sequence Declarations
Properties spanning several clock cycles can be defined by first describing sequences of Boolean
functions of variables, and then defining what the relationship between these sequences is. The
sequences can be defined by concatenating the Boolean expressions using the clock cycle delay operator
##, for example, as follows:
a ##1 b ##1 a
This sequence has the value true, if the variable a is true on the first clock cycle, after which the variable
b is true on the next cycle, and variable a again after one clock cycle. The sequence is evaluated during
simulation so that when the simulator detects that the first condition of the sequence (a == '1) is
fulfilled, it starts an "execution thread" that follows whether the following conditions are fulfilled after
the clock period delays defined for them. In the example above, the expressions consist of a single
variable, but they could also be any logical expressions.
Sequence expressions can be concatenated not only with delays of a constant number of clock cycles,
but also with delays of varying or even indefinite length. The sequencing operators of the sequences are
presented in the table below.
## constant Delay of a constant number of clock ##1 ( one clock period delay)
cycles ##2 ( two clock cycles of delay)
##[const1:const2] Delay, which may be within the ##[ 1: 3] (1, 2 or 3 clock cycle delay)
limits specified by const1 and ##[2:10] (2 to 10 clock cycle delay)
const2
##[constant: $] Delay that is greater or equivalent ##[1: $ ] (Delay, which can be from one
than constant (unbounded) clock cycle to the end of the simulation, or
unlimited in formal verification)
A sequence definition can be included in a property statement just like a logical expression. The
following property definition is true if its argument gets the one-bit values 101010 on consecutive clock
cycles. When used in a cover statement, it counts how many times the bit string 101010 occurred in the
variable pulse_input during the simulation. In this example, a formal argument x is defined for the
521406S Digital Techniques 3
property statement. The value of the actual argument pulse_input is given using name-based mapping
with the notation .x(pulse_input) in the cover statement.
Named Sequences
Long sequence expressions are unclear and difficult to use repeatedly. For this reason, sequences can be
named, in which case the sequences can be referred to by name in property statements or other
sequences. In the example below, a sequence of the three pulses is first defined by a sequence
statement, after which the name of the sequence is used in the property statement, which has become
much clearer. In this case, the difference is not big, but if the same sequence of three pulses were
tracked in several property statements, the use of the name would make it easier to write and especially
maintain them.
sequence three_pulses;
x ## 1 !x ##1 x ##1 !x ##1 x ##1 !x;
endsequence
Formal arguments can also be defined for sequence statements, allowing them to be used flexibly in
different contexts.
Operations on Sequences
Sequences can be subjected to operations, such as a logical AND or OR, whose operators are and and or
in this context. The and operation is satisfied if both of its argument sequences are true. The sequences
must start at the same time, but do not have to be the same length. If the length of the sequences must
be the same, the intersect operation must be used. The or operation is satisfied if one of the sequences
is true.
Example. This example illustrates the interpretation of the and operation. Assume a sequence has
been define as:
sequence enabled_a;
en_a and (a ##1 !a ##1 a);
521406S Digital Techniques 3
endsequence
This sequence matches the data in which the variable en_a == '1 and at the same time a sequence
begins in which the variable a gets the values 101. This can take place, for example, as shown in the
figure below. Note that the and operation is not a temporal logical AND function of the bits of the
sequences, but an AND function of the sequences en_a and (a ##1 !a ##1 a) themselves, which holds if
both sequences that started at the same time (yellow triangles) are detected (green dark-edged
triangles). The sequence en_a is always detected immediately because it has only one part. Matching
of the sequence (a ##1 !a ##1 a) takes three cycles.
Other operations defined for sequences are first_match, within, and throughout.
● first_match matches with the first (shortest) possible sequence of several alternatives (if the
sequence has parts of varying lengths).
● the within operator recognizes a shorter sequence (seq1) that is inside another (seq2). It is in
practice an abbreviation for: (1[*0:$] ##1 seq1 ##1 1[*0:$]) intersect seq2.
● The throughout operator has a format "expression throughout sequence", and it evaluates to
true if "expression" is true for the duration of the entire "sequence". The throughout operation
is an abbreviation for: (expression) [*0:$] intersect sequence.
Long sequences can be simplified by using repeat operations. The most commonly used repetition is a
consecutive repetition operation . It is described by [* constant_or_range], which is placed after
the sequence element to be repeated. Repetition can be applied to a longer sequence by using
parentheses. The sequence can be defined as repeating a certain constant number of times (or not at all
if the constant is zero), or a variable number of times. The following are examples of repetition
operations and their interpretations.
a [*2] a ##1 a
With repetitions and logical operations on sequences, it is possible to define very complex sequences
briefly. However, the syntax of sequence expressions is cryptic, so erroneous definitions can be easily
created. For this reason, it is a good idea to test the sequences carefully in a simulator before using
them for verification.
Example. The following is a graphical illustration of sequence matching. Let the following sequence of
signals start and resp be defined :
A verbal description of the sequence could be as follows: When the signal start is in state '1, the signal
resp is in state' 1 for one or two clock cycles after one or two clock cycles. This definition is equal to
the or operation between these four sequences:
start ## 1 resp
or start ## 1 resp ## 1 resp
or start ## 2 resp
or start ## 2 resp ## 1 resp
The following figure illustrates the sequence matching events. Whenever the beginning of a sequence
is detected (start = '1, yellow triangles), a "thread" is started that monitors the variables in the
sequence. The execution of the thread is continued as long as the values of the variables are as
defined in the sequence (green triangles). If a value that differs from that specified in the sequence
(red triangles) is detected, thread execution is stopped. If the values observed so far match some of
the allowed forms of the sequence under observation, the sequence is declared to be identified on
this basis (cases 1, 2 and 5), i.e. to be logically true. Thread execution is also stopped if the longest
possible sequence is identified (dark-edged green triangles). The empty green triangles in the figure
indicate the values that have been accepted into the sequence because it does not specify the values
required for the signals for that clock period.
521406S Digital Techniques 3
The last resp pulse is longer than the 1-2 clock cycles specified in the sequence, but the pulse is
accepted into the sequence after two clock cycles, as the sequence does not specify that the signal
should return to zero. A long pulse also causes the acceptance of the sequence starting from the 5th
start pulse. The start pulse does not have to be one clock cycle long either. That, too, should have
been defined more precisely if only a pulse with the length of one cycle had been allowed. The
following sequence definition addresses these shortcomings:
start ##1 ((!resp [*0:1] ##1 resp [*1:2] ##1 !resp) and (!start [*1:4]))
In this version, the resp signal should initially be in state '0 for 0-1 clock cycles after the start pulse,
then in state '1 for 1-2 cycles and then again in state '0. At the same time, according to the and
operation, the start signal must be in state '0 for 1-4 clock cycles.
Goto repetition is another possible form of repetition operation. It is described by the expression [->
constant_or_range], and it defines a finite number of repetitions of a Boolean expression so that
there are delays of one or more of the clock cycles in between the repetitions, during which the Boolean
expression is not satisfied. On the last repetition, the expression must match. For example:
This sequence matches if a is true on the first clock cycle, and c on the last, and in addition b at least 2
and at most 4 times between them, however, so that the last match of b occurs on the second to last
cycle of the whole sequence (before c matches). Thus, expression b must match 2-4 times, but not
necessarily in consecutive clock cycles.
The nonconsecutive repetition is a variation of the goto repetition in which the repeated expression
does not need to match on the last clock cycle for the sequence to match, but there can still come in
clock cycles in which the expression does not match. Nonconsecutive repetition is defined by the
expression [= constant_or_interval]
Example. This example illustrates the operation of different types of repetitions with the following
sequences:
It is worth paying attention to how the values b gets in between the a and c pulses affect the
sequences in cases 1, 2, and 3. In case 1 (b == 1) is repeated three times, so all three sequences
match. In case 2, b, is repeated three times, but not on consecutive clock cycles. Consecutive
repetition does not match in this case. Also in the case of 3, there is a gap between the 1 states of b,
in addition to which b still has time to get the value 0 before the c-pulse. In this case, only the lowest
sequence described by non-consecutive repetition matches.
The definition of the clock signal can be omitted if a common default clock is defined for the property
statements, e.g. in the module in which the statements are placed. This is done as follows:
This method is useful if there are many property statements to write, and especially if the property
definition is placed in a separate module as described below, making it easy to limit the default clocking
to only the property statements in that module.
one. In the three_pulses example, the values of the arguments were given according to their names, but
they can also be bound to formal arguments in the order in which they are defined in the property
statement.
In the following example, the load_enable_check -property has been changed to a general check
mutex_check by rewriting it using formal arguments. In the load_enable_error assertion, the values of
the arguments are given in the order they have been defined
By defining formal arguments for property statements, they become universal, allowing them to be used
in many designs. In this way, it is also possible to create verification libraries that contain ready-made
solutions to common verification problems, such as verifying the operation of various standard buses. If
no type is specified for the argument, it is determined by the actual argument. For reasons of generality,
the type definition is often omitted in cases where the property can be checked regardless of the type.
Operations on Properties
Properties can also be subject to operations with which it is possible to create a new property from one
or more properties. In the following examples, the term property_expr denotes a property defined, for
instance, with a property statement, which can, therefore, be a Boolean expression, or a sequence of
Boolean expressions spanning a number of clock cycles.
The most apparent operations are the logical operations NOT, AND, OR, a negation, disjunction and
conjunction, that evaluate to true true if the corresponding logic function of the properties evaluates to
true:
not property_expr
property_expr or property_expr
property_expr and property_expr
expr ) property_expr
if (
if ( expr ) property_expr1 else property_expr2
Of these, the first form evaluates to true only if the expression expr evaluates to false or if the property
property_expr evaluates to true. The property property_expr should therefore hold only if expr is true.
521406S Digital Techniques 3
The latter format is true if expr is true and property_expr1 is true, or if expr is false and property_expr2
is true.
Implication in this context means a property that must be true if a predefined sequence is first detected.
Implication can be used to check that if certain signals have initially assumed certain values over a
period of time, this sequence will be followed by some expected events in the future.
The sequence sequence_expr is known as the antecedent and the property_expr as the consequent. The
implication expression is interpreted as follows:
● if the sequence described by the antecedent is not detected, the value of the whole implication
is true.
● if the sequence described by the antecedent is detected, and the property evaluated from the
end of the antecedent sequence evaluates to true, the value of the whole implication is true.
Example. Suppose you want to verify the synchronous reset function of a 3-bit synchronous counter
counter_r. It is correct if the counter's state is zero on the next rising edge of the clock after the reset
signal was srst == '1 on the previous rising edge of the clock. With the SystemVerilog property
statement, this can be described using the implication as follows:
property sync_reset_check;
@(posedge clk) disable iff (rst_n == '0)
(srst == '1) |=> (counter_r == 3'b000);
endproperty
The start-resp example discussed in connection with the sequences, in which the start pulse had to be
followed by the resp pulse, can be described more sensibly with an implication than with a sequence
alone, as this is a typical cause-and-effect chain of events. In this case, the antecedent part of the
implication is the 1-state of the start variable, and the consequent part then consists of the expected
waveforms of the start and resp variables. Below is a property statement written with this principle. It
can now also be used in an assert statement to ensure that the sequence is always repeated correctly.
521406S Digital Techniques 3
The sequence alone could only be used in the context of a cover check to count how often the sequence
is detected in the simulation. When used with an assert statement, a check written using a sequence
alone would produce an error message when the sequence is not detected (that is, often).
property start_resp_check;
@(posedge clk) disable iff (rst_n == '0)
start |=> (!resp [*0:1] ##1 resp [*1:2] ##1 !resp) and (!start [*1:4]);
endproperty;
In addition to logical, if-else, and implication operations, the SystemVerilog standard (clause 16.12)
defines a number of other operations that are used less frequently.
Function Explanation
$rose(expr) Returns true if the least significant bit of the expression value changed to 1
from the previous clock cycle to the current one. Otherwise, returns the
value false.
$fell(expr) Returns true if the least significant bit of the expression value changed to 0
from the previous clock cycle to the current one. Otherwise, returns the
value false.
$stable(expr) Returns true if the value of the expression did not change from the
previous clock cycle to the current one. Otherwise, returns the value false.
$changed(expr) Returns true if the value of the expression changed from the previous
clock cycle to the current one. Otherwise, returns the value false.
$past(expr) Returns the value of an expression in a previous clock cycle. The default
cycle is the previous clock cycle, but the clock cycle can be selected with an
optional second argument, e.g. $ past (x, 2).
521406S Digital Techniques 3
Example. According to one bus specification, the controller directs a read or write operation to a
responder device by first setting the selection signal PSEL to state 1 and then the signal enabling the
function PENABLE in the next clock cycle also to state 1. This operation can be checked using the $
rose function as follows:
property PSEL_PENABLE_check;
@(posedge PCLK) disable iff (rst_n == '0)
($rose(PSEL) && !PENABLE) |=> (PSEL && PENABLE);
endproperty
Function Explanation
$onehot0(expression) Returns true if at most one bit of the expression value is '1.
$countones(expression) Returns the number of bits in state '1 in the expression's value. The
value of the expression must be a bit vector.
Of these, the $isunkown function is an almost indispensable verification aid, as unknown states resulting
from an incomplete test bench or the absence reset in register, for example, can cause unpleasant
surprises at some point in the design project. In the following example, the unknown states of all signals
in the bus are checked by creating one bit vector from these in the assert statement, and passing it as an
argument to the property. Note that the property argument of a property statement does not have a
predefined type, so it is determined by the type of the actual argument.
property unknown_check(x);
@(posedge clk) disable iff (rst_n == '0)
!($isunknown(x));
endproperty
property xymatch;
logic [2:0] tmp;
@(posedge clk) disable iff (rst_n == '0)
($rose(x), tmp = datax) |=> (!x && !y) [* 1:3] ##1 (y && (datay == tmp));
endproperty
A variable is assigned a value in a Boolean expression, separated by a comma. The variable gets the
value on the clock period on which that expression is evaluated. The value of a variable is referenced in
the normal way. Thus, the variables can be used to capture values on different clock cycles of the
sequence for use in subsequent comparisons.
Here, the name of the variable is the only thing that distinguishes this sentence from other similar
statements. The "efficiency" of the code is thus poor. To avoid unnecessary coding, designers often use
521406S Digital Techniques 3
compiler macros in such situations. A macro is a construct that is placed in code by a compiler before
analyzing the code. In SystemVerilog syntax, a macro, called xcheck, that solves the problem described
above, can be defined as:
The xcheck macro has the parameter called name here. Whenever the compiler detects a macro name
and parameter from the code, it places the rest of the macro definition in the code in place of the macro
reference, and replaces the strings `name` with the value given to the macro parameter name. Using a
macro, the checking code is abbreviated as follows:
`include "mymacros.svh"
`xcheck(myssignal);
`xcheck(mysignal2);
// ja niin edelleen
Here it is assumed that the macro code is defined in the file "mymacros.h".
The example contains two property statements that describe, as implications, write events to a memory
location on the AMBA APB bus whose address is defined by the constant CFG_REG. The antecedent part
of the implication activates the check if a write event (and-function of the first four terms) to the
address CFG_REG is detected on the bus, and if the data bus PWDATA has a certain code (constant CFG1
or CFG2) as the data to be written. Dozens of such almost identical properties may be needed to check a
bus interface.
property cfg1(x);
@(posedge clk) disable iff (rst_n == '0)
(PSEL && PENABLE && READY && PWRITE && (PADDR == CFG_REG) && (PDATA == CFG1))
/ check that things related to CFG1 happen
|=> /
endproperty
property cfg2(x);
@(posedge clk) disable iff (rst_n == '0)
(PSEL && PENABLE && READY && PWRITE && (PADDR == CFG_REG) && (PDATA == CFG2))
/ check that things related to CFG2 happen
|=> /
521406S Digital Techniques 3
endproperty
The simulator has to evaluate the antecedent parts of the implications of the example on each clock
cycle. Their common factor PSEL && PENABLE && READY && PWRITE && (PADDR == CFG_REG) is thus
computed twice in this example, which slows down the simulation. The slow-down could be
considerable if there were a lot of similar property statements, and if they were also complex. In such
situations, auxiliary code is often used to compute the value of a logical expression outside the property
statement and store it in a variable, which is then used in the property statements in place of the
original expression. The external computation can be implemented, for example, with the always
process as shown below.
logic cfg_write;
Thanks to the auxiliary code, the property statement becomes much simpler, in addition to which it
works faster in simulation and formal verification.
property cfg1(x);
@(posedge clk) disable iff (rst_n == '0)
(cfg_write && (PDATA == CFG1))
/ check that things related to CFG1 happen
|=> /
endproperty
There are situations where it is not possible to describe an activity with property statements. Many
overlapping sequences are such. In this case, auxiliary code must be used.
The following example attempts to describe a protocol in which a req pulse should be followed within
1-3 clock cycles by an ack pulse. A new req pulse can come before the previous pulse has been
acknowledged with an ack pulse, but each req pulse must be acknowledged with its own ack pulse. The
functionality seems to be easily described by implication:
property reqack1;
@(posedge clk) disable iff (rst_n == '0)
req |=> ##[0:3] ack;
endproperty
521406S Digital Techniques 3
However, the simulation works as in the figure below. Each req pulse starts its own evaluation thread,
and since the second req pulse is seen before the first ack pulse, the threads run in parallel for a while.
Because both now detect the same ack pulse, they both get the value true. However, the operation was
not in accordance with the requirements, as the latter req pulse should have been acknowledged with
its own ack pulse. It is impossible to describe this operation without the help code1.
The problem is solved by writing an auxiliary process that counts ack and req pulses:
After that, in the property statement, the sequence number of the req pulse that triggered the thread
can be captured from the variable rc in the antecedent part of the implication and compared with the
sequence number of each ack pulse in the consequent part of the implication:
property reqack2;
int unsigned c;
1
Or so they say. However, you can always try.
521406S Digital Techniques 3
This property works better. At the time of the ack pulse, the value 1 of the c variable of the second
thread does not match the current value 0 of the ac variable, so the property statement does not hold in
it. The latter thread gets a value false in this example because the ack pulse is not seen in time.
Example. Assume that a digital filter with data input data_in, an input filter_start that starts the filter,
and an output filter_done that tells when the filter algorithm has finished computing has to be
designed. The value of the data input must not change during the computation, as the filter must be
able to use the value of the input during the computation so that it does not have to be stored in a
register inside the filter.
The property statement below describes the protocol used by the filter. On the consequent side of
the implication, a throughout operation has been used, the left-side expression of which must be true
for the entire duration of its right-side sequence. In this case, the condition $stable (data_in) must be
true as long as filter_done == '0.
property data_stable_check;
@(posedge clk) disable iff (rst_n == '0)
(filter_done && $rose(filter_start)) |=>
$stable(data_in) throughout (( !filter_done [*1:$]) ##1 filter_done);
endproperty
The property is checked with an assume statement, which in simulation works like an assert
statement. In formal verification, however, the assume statement is interpreted to describe what kind
of data can be expected to be fed into the design. On the simulator waveform display, the evaluation
of the assume statement looks like the figure below. After the third filter_start pulse, the data input
changes state at the wrong time, which is detected by the assume statement.
Assertions describing the requirements of the RTL architecture are, of course, very case-specific. Typical
checks include:
● checking of allowed states for variables (especially detection of unknown states)
521406S Digital Techniques 3
● checking the allowed values of variables (e.g. values of registers and counters, selection signals
of multiplexers, addresses of internal memories and register banks, values of variables used as
indexes in table variable references)
● flip-flop reset checks
● checking the realization of predefined state sequences (e.g. a state machine must always return
to the initial state within a certain predefined number of clock cycles, or the realization of a
certain state sequence)
● checking the function of status signals (e.g. arithmetic block overflow bit, counter output
decoders, full/empty signals of FIFO registers or memories, etc.)
● checking the function of bus arbitration signals (e.g. one and only one three-state output enable
signal active)
Assertions describing the requirements of the RTL architecture are characterized by the fact that they
describe the same functions that are (later) implemented as RTL code. Therefore, the same work is done
in a way twice. However, the assertions always describe one required property, for example, whether
the value of the up-down counter increases when up-counting is activated. In contrast, in RTL code, a
single process usually describes many separate functions, for example, all the counting modes of an
up-down counter, which is often just what causes human coding errors. For this reason, one should not
"pack" all of the functions realized by the RTL block implementing a property into one assertion. In such
cases, the assertion would only duplicate the code of the RTL block, with the risk that the same error
would slip into both the assertion and the RTL code.
Example. Assume that we have to design a 4-bit synchronous binary counter ctr4_r, which counts up
when its control input ctr4_inc is in state 1. The counter is used in a design where it must be
periodically reset synchronously with the signal ctr4_clr during counting.
The counter's count-up function is described by the following property, which states that when
counting is enabled and synchronous reset is not enabled, the next state of the counter (|=>
implication) must be equal to its value on the previous clock period (value of the $past function)
represented with 4 bits (4 '() type conversion):
property ctr4_inc_check;
@(posedge clk) disable iff (rst_n == '0)
(!ctr4_clr && ctr4_inc) |=> (ctr4_r == 4'($past(ctr4_r) + 1));
endproperty
property ctr4_clr_check;
@(posedge clk) disable iff (rst_n == '0)
ctr4_clr |=> (ctr4_r == 4'b0000);
endproperty
These properties are set as requirements with following concurrent assert statements:
521406S Digital Techniques 3
Assume that the designer, after getting to code, write the following model for the counter:
By examining the code, we find that the operation it describes does not quite meet the requirements,
as the controls ctr4_clr and ctr4_inc are not mutually exclusive. Thus, the counter counts up even if
the signal ctr4_clr is in state '1, as a result of which it enters the wrong state when ctr4_clr ==' 1 and
the current state of the counter is not 4'b1111, the next state of which would be 4'b000, which would
be correct for the reset function. In the simulation, it would be difficult to detect the error caused by
the absence of the single "else" word without the alarm generated by the assertion if the synchronous
reset occurred infrequently.
Example. A synchronous 4-bit binary counter ctrdiv_r with two counting modes selectable by the
input ctrdiv_mode has to be designed. When ctrdiv_mode == '0, the counter counts up to 9 and
returns to zero. When ctrdiv_mode == '1, the counter counts up to 13 and returns to zero. The
counter measures time intervals of a certain length.
The property statements describing the properties of the counter become a bit more complex than in
the previous example, because the counter now does not reach the full value of 4'b1111. Therefore,
the comparison of its value with the previous value must be done using the mode-dependent modulus
operation %. In the mode where the counting period is 9 clock periods, the property statement
describing the incrementing operation of the counter may be, for example, as follows. The $stable
function before the implication limits this property to describe a situation where the mode has not
just changed.
property ctrdiv9_inc_check;
@(posedge clk) disable iff (rst_n == '0)
($stable(ctrdiv_mode) && ctrdiv_mode == '0) |=>
(ctrdiv_r == 4'( ($past(ctrdiv_r) + 1)%10 ) );
endproperty
If the counter mode has just changed from 13-cycle mode to 9-cycle mode, the counter value may be
greater than 9. In this situation, the next state of the counter should be zero:
521406S Digital Techniques 3
property ctrdiv9_jump_check;
@(posedge clk) disable iff (rst_n == '0)
(!$stable(ctrdiv_mode) && ctrdiv_mode == '0 && ctrdiv_r >= 9) |=>
(ctrdiv_r == 4'b0000);
endproperty
The 9-cycle mode of the counter can be described with a single property statement, because despite
the change of mode, the counter continues to the next state because its previous state was <= 9.
property ctrdiv13_inc_check;
@(posedge clk) disable iff (rst_n == '0)
(ctrdiv_mode == '1) |=> (ctrdiv_r == 4'( ($past(ctrdiv_r) + 1)%14 ) );
endproperty
Again, the SystemVerilog coder of the RTL model in this example is no luckier than in the previous
one, because a small error again slips into the model:
This counter counts up towards a final value of 9 or 13, after which it resets itself. The final value is
detected by a inequality operator (! =), which has a simple in gate-level implementation. Indeed, the
counter generally works correctly except when its mode changes from 13-cycle mode to 9-cycle
mode. If the value of the counter before the change is greater than 9, it always counts up to 15, and
only then rolls over back to zero and continues from there to 9. The counting cycle therefore becomes
too long. The correct way would have been to perform a comparison with a "smaller than" operator <.
Detecting an error in the simulation would again require considerable attention, as it would occur
infrequently if the mode were not changed very often. However, in a long simulation, an error would
occur before long, and would cause the ctrdiv9_jump_error statement to be triggered like in the
simulation image below. After the counter goes into the wrong sequence, the ctrdiv9_inc_error
statement also gives error messages for some time.
Nämä kaksi esimerkkiä ovat pyrkineet osoittamaan, että muutamalla yksinkertaisella väittämällä
pystytään helposti löytämään koodausvirheet, joiden havaitseminen simulointitulosten tarkastelulla olisi
aikaa vievää ja epävarmaa.
`include "ctrdiv_assertions.svh"
The above methods are easy to use, but not always viable. If the code of the assertions is to be excluded
from the simulation, the include statement must be deleted or commented out. Modifying the RTL code
file may then necessitate recompiling the entire design, and run all regression simulations for the design.
If the module to be verified is in VHDL, it is not even possible to add a SystemVerilog-language include
statement in it.
521406S Digital Techniques 3
If changing the source code file of the design is out of the question, you can use the bind statement of
SystemVerilog, which can be used "externally" to install a module inside another. In this case, a separate
module is created for the assertions and they are written inside this module. Of this verification module,
a component instance is then created inside the module to be verified by writing a bind statement in the
testbench, for example. If you want to exclude assertions from simulation, you can remove the bind
statement from the testbench without changing the files that contain the source code for the design.
The bind statement can also be placed entirely outside the modules.
where "module" is the name of the module to be verified, "checker- module" is the name of the module
containing the assertions, "checker-instance" is the name of the verification module instance to be
created, and "port names" is a list of verification module ports to which the ports or internal variables of
the module to be verified are connected. Normal connection list syntax of instantiations is used here.
Default connections can also be defined using the .* notation if the names of the ports and variables of
the module to be verified are the same as the names of the ports of the verification module). This form
of the bind statement binds the verification module to all instances in the module to be verified in the
design. There is a form of the bind statement that can be used to bind only to certain instances of the
module to be verified.
The following code example includes the module definition of the ctrdiv counter shown above and the
definition of the ctrdiv_svamod module containing its verification code. The ports in the ctrdiv_svamod
module are all inputs, as in verification modules in general, and they have the same names as the inputs
in the ctrdiv module and the variable ctrdiv_r, which is also referenced in the property statements. The
property and assert statements themselves can be included in the authentication module as is, as they
"see" signals with exactly the same names as before.
begin
if (ctrdiv_r != 9)
ctrdiv_r <= ctrdiv_r + 1;
else
ctrdiv_r <= '0;
end
else
begin
if (ctrdiv_r != 13)
ctrdiv_r <= ctrdiv_r + 1;
else
ctrdiv_r <= '0;
end
end
end
The bind statement that places the ctrdiv_svamod module inside the module ctrdiv can be written into
the testbench:
module ctrdiv_tb;
logic clk = 0;
logic rst_n;
logic ctrdiv_mode;
logic [3:0] ctrdiv_out;
/*
Rest of testbench code removed.
*/
endmodule
When the simulator builds a simulation model starting from the testbench module ctrdiv_tb, the
instance ctrdiv_check_1 of the module ctrdiv_svamod is created inside the counter module so that all
the signals required for verification are also connected to the verification module. On the simulator
schematic display, the situation appears as shown below. Thus, it has been possible to create a module
monitoring the signals of the design to be verified inside it without changing the design itself.
521406S Digital Techniques 3
Once the code describing the assertions has been installed in the RTL model, no special simulation steps
are required to use the assertions. For the simulator, assertions describe processes to be executed the
same way as, for example, clock-signal-sensitive always procedures. If the simulator detects the start
condition of the sequence described in the property statement, it creates a "thread" that follows the
sequence to its end and reports the result of the assert statement, or updates the coverage database in
the case of a cover statement.
In designing and debugging assertions, a simulator is an essential tool, as even slightly more complex
sequences usually produce surprising results on the first try. On the other hand, modeling events that
initially seem simple often proves more difficult than expected, when all possible cases are considered,
and not just the most obvious ones. For these reasons, testing assertions usually produces a lot of false
alarms at first. However, the simulators provide good support for interpreting the decisions made by the
assertion evaluation threads. Below is the representation of the QuestaSim simulator for evaluating the
data_stable_check property of the filter example above. In this case, it is seen that the assert statement
failed because of the false value returned by the $stable (data_in) function call.
521406S Digital Techniques 3
The principle of formal verification based on the use of assertions is quite simple: the verification
program interprets each assert statement as a statement that tells a fact about the circuit, and then
tries to find a “counterexample” that shows that the fact does not hold in some situation. This is done so
that the program sets the signal that is the subject of the assertion (e.g., the output of the register) in a
state that does not fulfill the assertion. It then, by "stepping back" the logical model of the circuit, tries
to find a sequence of input signals of the circuit that causes the circuit to traverse into a state where the
assert statement is violated2. If such an initial state and a "path" are found, the assert statement has
been proven to be false, and the circuit can be concluded to contain an error (see the figure below for
the result of static verification of the ctrdiv circuit). The initial state should be one that can be reached
from the reset state of the circuit.
The image below shows the result of the formal verification of the ctrdiv counter discussed above on the
Questa PropertyCheck program's display. The waveform display shows the waveforms of the
counterexample found by the program for the statement ctrdiv9_inc_error. When looking for a
counterexample, the verification program has to examine all possible states of the circuit from its
2
The program can also start from reset state or another user-defined state, and try to find a possible route to
illegal state.
521406S Digital Techniques 3
current state backwards. In the image, the Radius column tells you how large a "radius", that is, how
many clock cycles back the program had searched when it found the counterexample. The farther the
program has to go back, the more it needs working memory. In practice, for this reason, the program
cannot always perform an exhaustive search that either proves the statement to be true (no
counterexample does not exist) or false (counterexample found). In such a situation, the proof of the
assertions is said to be inconclusive.
The work of a formal verification program can be made easier by limiting on the values of the design's
inputs and variables. Usually, these cannot have all possible values. If values that do not actually occur
are excluded, there will be fewer options for the program to explore. At the same time, the program is
prevented from providing impossible counterexamples. SystemVerilog's assume-statements can be used
for this purpose. In simulation, they are interpreted in the same way as assert statements, but in formal
verification they can be used to tell the program the allowed values of variables and input ports. For
example, the definition below would tell the formal verification program that it should use only 32
different addresses (instead of 232 addresses) according to the apb_slave_range property, because the
designer knows that the bus controller can only feed addresses that are within the range delimited by
the property statement for this device. If the address was something else, there would be a design error
in the bus controller.
property apb_slave_range;
(paddr >= 32'h80000000 && paddr <= 32'h8000001F);
endproperty
Summary
Assertions can improve the quality of design verification by describing the functional and RTL
requirements of a design in a way that allows for automatic verification of the fulfillment of the
requirements. Thus, the detection of errors does not depend on the attention of, for example, the
person responsible for checking the simulation results. Although simulation results can be checked
521406S Digital Techniques 3
automatically by comparing them with reference data, simulating a design without assertions does not
provide as accurate information about where and when the error occurred.
Verification coverage can be improved by verifying assertions formally. However, a limitation of formal
verification is that it requires a large amount of memory, which is why it is better suited for "block-level"
verification.
This chapter has covered only the most commonly used features and constructs contained in
SystemVerilog that are suitable for assertion-based verification. A more comprehensive view of this can
be obtained, for example, in the SystemVerilog language standard.
521406S, Digital Techniques 3
Functional Coverage
Methods of Measuring Functional Coverage
Point Coverage
Cross Coverage
Transition Coverage
Measuring Functional Coverage with Covergroups
Covergroup Creation and Operation
Bin Creation
Measuring Cross Coverage
Measuring Transition Coverage
Customization of the Sampling Method
Other Properties of Covegroups
Summary
Coverage is a measure for how comprehensively a design has been verified. There is no universally
applicable definition or measurement method for coverage. Consequently, these must be defined in the
design's verification plan. This is done by first determining which features of the digital circuit design are
to be verified, and then how the measurement of the verification coverage of each feature is done
during the verification and what goals are set for the measurement results.
The two most important and most commonly measured types of coverage are code coverage and
functional coverage. Code coverage is a measure of coverage automatically computed by the simulator
program based on the execution of RTL code. Its computation is therefore in no way based on a design
under verification. Functional coverage, on the other hand, is a completely design-specific measure of
how comprehensively design features are exercised in simulation. Functional coverage measurement is
accomplished by adding code to the testbench for measuring coverage in a manner appropriate for each
design property.
521406S, Digital Techniques 3
Code Coverage
Coverage coverage is a structural measure of simulation coverage, as it shows which parts of the
design's source code were executed during simulation. It does not, therefore, provide any information
related to the function of the particular design, but it is nevertheless a useful tool for verification. If the
design contains code that is not executed in the simulation, either the design has a bug or the test
stimuli used in the simulation are incomplete. Code coverage can be measured in many different ways,
and some of the ways are simulator-specific. The following are some common ways used to measure
code coverage.
Statement coverage
Statement coverage, sometimes also referred to as line coverage, is the simplest form of code coverage.
Statement coverage shows how many times source code statements (basic code structures that end
with a semicolon) were executed during the simulation.
The example below illustrates the computation of statement, branch, and condition coverage. The code
shown in the figure is executed twice using the test stimuli shown in red text on the right. Statement
100 is executed twice and statement 102 once, so both are covered. The if condition on line 101 gets
both true and false values, so it too is covered. However, the term cond_b does not make the value of
the condition expression to become true in the case as when cond_a == 0, so the condition coverage of
the expression is zero. From this it can be concluded that the signal cond_b never enables the
assignment of the variable tmp, based on which one can suspect that there is an error in the design or
that the signal is just unnecessary.
521406S, Digital Techniques 3
Toggle Coverage
Toggle coverage is used to monitor state changes of the bits in the design's ports and variables from 0 to
1 and from 1 to 0. In this way, for example, registers to which data is never loaded, address bus bits that
are never used, and many other flaws can be detected. If the value of a bit in a variable or port does not
change at all in a long simulation, it may be an indication of an error.
State Coverage
State coverage measures whether a state machine visits in all states during simulation and whether all
transitions between states are executed. It is related to the ability of simulator programs to extract state
machines presented in a certain way from the RTL code.
521406S, Digital Techniques 3
Functional Coverage
Measurement of functional coverage is always based on a case-by-case solution and cannot be
performed automatically with the capabilities built into a simulator program. In practice, this means that
code must be added to the simulation model for measuring coverage. What is measured from a design
naturally depends on the design. In general, it is desired to measure things, such as, whether each
property of the design is activated in the simulation, whether certain properties are activated at the
same time, or in a certain order. The following example illustrates some ways to measure coverage in
such situations.
Point Coverage
Many kinds of coverage goals can be set for the verification of the example device. In all cases, it is
usually desired to monitor whether the operation of each individual property was verified. In this case,
the goal would be to ensure that the device was set to produce both cylinders and cubes, and that all
color settings were used. The coverage measurement is done by computing during verification for each
"coverpoint", FORM and COLOR, how many times they got each possible value. These values are then
compared to a predefined goal value, and if the goal was reached, that setting value is considered
covered. The ratio of the covered values to all values is the coverage of that coverpoint.
521406S, Digital Techniques 3
Below are shown the accumulations of the values (in green boxes) for the coverpoints FORM and COLOR
in the simulation presented the figure above. The cylinder was selected three times and the cube once.
The goal was one for both. The coverage of coverpoint FORM was thus 100%. The result is the same for
COLOR, as each color was tried at least once.
Cross Coverage
For verification, you also want to know if the device can print objects of a specific shape and color. For
reliable results, the device must be controlled with all possible combinations of FORM and COLOR
settings during verification. The simultaneous occurrence of these properties is measured by cross
coverage. The figure below illustrates the measurement of cross coverage in the case of the printer
example. Here we compute the number of times the FORM and COLOR coverpoint values occurred
simultaneously in the test data. Because the red and blue cubes were never selected, the cross coverage
is only 67%.
Transition Coverage
The third often measured type of functional coverage is transition coverage. It measures how the
coverpoint values changed during verification. In the case of the 3D printer, this could mean, for
example, what consecutive values the COLOR setting received during verification. The figure below
illustrates this. It can be seen that the test sequence presented above activated only three color
changes, so in this respect the coverage was very low, 33%.
521406S, Digital Techniques 3
● Covergroup options
The following sample shows a simple covergroup definition consisting of a clocking event and a
coverpoint. In this case, the coverpoint is an enum-type variable state_r that represents the state
register of a state machine. The variable declaration is presented above the covergroup definition.
The covergroup construct works as follows: Based on the definition, the simulator creates a "bin" for
every possible value of the coverpoint, the variable state_r, and saves in these bins how many times the
values were detected during clocking events, in this case on the rising edges clk. If the value of a bin is
larger than zero (by default) after simulation, the bin is classified as covered. The coverage of the entire
covergroup instance is the number of bins covered divided by the number of all bins. The method of
coverage computation can be changed with option definitions, which are not discussed here.
The following figure shows the results reported by the QuestaSim simulator for the cf_fsm_state_inst
covergroup instance. It can be seen that the state machine remained in BUSY state most of the time and
did not visit the ERROR state at all, so the coverage was 75%.
Bin Creation
It is often necessary to define explicitly the bins that are used to collect coverage. This is done by adding
bins definitions to the coverpoint expression to list the values or ranges of values covered by each bin.
The following code sample provides an example of this. It has a coverpoint of the variable addr, which
represents the 3-bit address input of a register bank. The register bank's write enable input is we. The
purpose of the covergroup is to measure how many of the registers have been written during the
simulation. Since the write occurs when we == '1 on the rising edge of the clock signal clk, this has been
521406S, Digital Techniques 3
added to the clocking event as an iff condition. The bins definitions create two bins, left and right. The
value of bin left is incremented if a write hits an address in the range [0,3], and the value of bin right is
incremented if a write hits one of the addresses in the range [4,7].
The simulation result shows that the coverage of both bins is 100%, even though only 2 and 3 write
events are recorded in them. This is because each bin measures the entire address range, with a single
hit in the entire range increasing the respective bin's coverage to 100%.
In the following example, an entry [] has been added after the bin names in the bins definitions, which
means that the simulator must create a bin for all of the values listed in the bins definition. Thus, four
left and right bins are created.
In this case, the coverage of the entire covergroup instance will be 37.5%, as only three addresses were
written to during the simulation.
521406S, Digital Techniques 3
logic [1:0] x, y;
The figure below shows the operation of the cg_x_y_inst covergroup in the simulation. Both x and y
received all possible values in the simulation, so their coverage is 100%. Cross coverage instead was only
81.2%, as the cases (2,1), (1,3) and (3,3) did not occur in the simulation. Cross coverage, therefore, gave
clearly a worse view of the total coverage than the coverage measured for each variable separately.
521406S, Digital Techniques 3
The following is a covergroup type that measures in different ways the coverage of the settings of the
3D printer settings presented earlier in this chapter. The coverpoint color_tests contains three examples
of how to define transition coverage. Here "bins all" creates a bin for every possible color change. "bins
L6" creates a bin for the color sequence used in the example figure presented earlier. "bins reds" creates
a bin for the case where the RED setting is used four times in a row.
The figure below shows the representation of this covergroup instance in the QuestaSim simulator.
As stated earlier, covergroup is in practice a user-defined class. This class has a built-in member function
"sample", which takes a sample of the coverpoints when the clock condition is met. The sample function
can also be called directly from a property statement or procedural code, for example. In this way, it is
possible to define more complex sampling conditions.
In the covergroup class cg_sf shown below, the sample function has been redefined by replacing the
clocking event with the "with function" definition. In this example, the sample function takes a sample
of a four-bit value, which is assigned to the parameter "data" when the sample function is called. This
parameter is defined as a coverpoint.
coverpoint data;
endgroup
property reqack;
@(posedge clk) disable iff (rst_n == '0)
$rose(req) |=> !ack [* 5] ##1 ($rose(ack), cg_sf_inst.sample(tx));
endproperty
The easiest way to generate random data is to use SystemVerilog's $urandom system function, which
can be used to generate 32-bit random numbers, for example:
initial
begin
addr_in = $urandom;
data_in = $urandom
Completely random test data is rarely useful, as the data should also make sense for the design being
verified. If, in the example above, addr_out were used to control a 32-bit memory bus that is only
allowed to be accessed as 32-bit words, the example code would produce illegal addresses. For valid
addresses, the two least significant bits should always be zeros, so the test program code should be
corrected as follows:
521406S, Digital Techniques 3
Such correction of test data makes the code unclear and difficult to maintain, especially if there are
many constraints on the test data. Therefore, the SystemVerilog language defines a better way to create
constrained random test stimuli. This method is based on the use of classes to describe the type of test
data. Classes familiar from object-oriented programming are implemented in SystemVerilog so that their
member variables can be defined as random with the rand keyword, and new random values can be
computed for them using the class's randomize member function. In addition, the values obtained by
the variables can be constrained with the constraint statement.
The following sample declares the class ahb_bus_tx, which describes the test data of the previous
example, which consists of a 32-bit bus address addr and 32-bit bus data data. In this example, the
address is additionally constrained to always be less than 32'h8c000000.
class ahb_bus_tx;
endclass
Class member variables are declared with a rand keyword, which gives them random values when the
randomize function of an object instantiated from this class is called. The class also includes the
constraint ahb_addr, which specifies that the random value of the variable addr must be less than
32'h8c000000 and divisible by 4. The constraint is thus included in the definition of the data type, so if it
is necessary to change the constraint, it can be done in the class definition without changing the test
program itself.
The class ahb_bus_tx can now be used as in the following example. The test program first creates a new
object of the class with the new function and sets it to be the value of the variable tx. The test program
then calls the randomize function of the object tx, which gives the random variables tx.addr and tx.data
of tx random values. These values are assigned to variables data_in and addr_in. The example also
checks the value returned by the randomize function. If randomization fails, for instance, due to
conflicting constraints, randomize returns 0. Even though there is no risk of error in this case, in
accordance with good programming practice, the function return value is checked.
initial
begin
521406S, Digital Techniques 3
if (tx.randomize () == 1)
begin
data_in = tx.data;
addr_in = tx.addr;
end
else
$ error ("Randomization of ahb_bus_tx failed.");
The example above gives an idea of how constrained random data can be created. SystemVerilog has a
vast array of ways to write constraint statements, in addition to which data generation can be controlled
in conjunction with each randomize function call. You can learn more of these features from the
SystemVerilog standard.
Today, random test stimuli are usually created in the verification components of testbenches
implemented according to the UVM standard.
Summary
Coverage analysis is an important part of the functional verification of system-on-a-chip designs. When
using random test stimuli, it is indispensable. It can also guide the development of directed tests, as
coverage analysis can point out those parts of the design that have not yet been comprehensively
verified.
The most important means for analyzing coverage are the code coverage measurements performed
automatically by the simulator, and the functional coverage measurements performed in the testbench
according to the requirements of the design. The most important programming constructs for measuring
functional coverage are the SystemVerilog cover directive and the covergroup construct.
521406S, Digital Techniques 3
UVM classes
UVM Components
UVM Test (uvm_test)
UVM environment (uvm_env)
UVM Agent (uvm_agent)
UVM scoreboard (uvm_scoreboard)
UVM Sequencer (uvm_sequencer)
UVM Driver (uvm_driver)
UVM Monitor (uvm_monitor)
UVM subscriber (uvm_subscriber)
Classes for Creating UVM Sequences
UVM Sequence Item (uvm_sequence_item)
UVM Sequence (uvm_sequence)
Other UVM Classes
Configuration Objects
UVM Configuration Database
UVM Factory
Summary
The aim of digital circuit design verification is to ensure that all the functions of the circuit have been
implemented in accordance with the requirements. In practice, this means that the function of all parts
of the circuit must be verified in all possible use cases. Functional verification is performed by feeding
into the circuit data that activates specific functions and monitoring the results produced by the circuit.
The creation of the input data and the analysis of the results take place in a testbench developed
"around" the circuit model. Because a large number of different tests are required to verify each
functional part of the circuit, the amount of code in the testbench is usually larger than in the design
itself.
To ensure that testbench design does not slow down system-on-a-chip circuit development, the EDA
industry has developed solutions that streamline design work and enable testbench reuse. Based on the
proprietary solutions of different manufacturers, a SystemVerilog-based verification methadone called
Universal Verification Methodology (UVM) has been developed. The method is supported by all of the
major manufacturers' of EDA simulators. UVM defines a testbench architecture that consists of
standardized components, which is especially suitable for verifying standard-bus-based parts of SoCs at
different stages of design work, from block-level to integration-phase verification. UVM defines a class
library that utilizes the object-oriented programming features of the SystemVerilog language. The library
contains ready-made class definitions for UVM verification components. By using and modifying these
classes, the designer can create verification components that meet the requirements of a particular
application, and assemble a testbench from them. Thus, the design of the testbench can always follow
the same standard formula, which speeds up the verification process and allows the use of verification
components used in previous designs or purchased from outside.
521406S, Digital Techniques 3
The UVM test bench consists of a standard module hierarchy (1) and a class hierarchy that implements
the UVM components (2). The top-level module in the module hierarchy is a standard SystemVerilog
language testbench module that instantiates the top-level module of the design to be verified, and thus
the entire design.
In a conventional verification method, the testbench module includes at least one initial procedure or
program block that contains a test program that generates test stimuli for the design to be verified. After
elaboration of the model, the simulator starts the test program, after which the simulation continues
until the test program has produced all the stimuli defined therein. Typically, all test stimuli are
521406S, Digital Techniques 3
generated and the results produced by the design are detected,at times determined by the clock signal
generated in the testbench.
In UVM, test stimuli are not generated in the initial or always procedures of the testbench module or the
program program called from it. Instead, the initial procedure calls the UVM library subprogram
run_test, which creates the components of the designer-defined UVM class hierarchy. The components
are in practice instances of UVM classes. The components generate test stimuli and receive and analyze
the results produced by the design to be verified. The operation of the UVM components and the
communication between them is not timed, so it is not synchronized to the clock signal, for example. The
components communicate at transaction level by calling each other's delayless member functions, or
methods. The synchronization of the operation of the untimed UVM components with the timed module
hierarchy takes place in the interface between them. For this reason, the creation of a UVM testbench is
more reminiscent of conventional object-oriented programming, than the creation of a SystemVerilog
description of an RTL design, except for the design of interface components. Object-oriented
programming plays a key role in the design of UVM testbenches, as it is based entirely on the reuse of
the basic classes of the UVM class library.
The UVM test bench in the figure shows the most commonly used UVM component types:
2. UVM test (base class: uvm_test), which creates the UVM test environment, and executes the
test.
3. UVM environment (uvm_env), which creates the top-level components of the test environment,
such as agents and scoreboards.
4. A UVM agent (uvm_agent) that contains the components needed to verify a single interface in a
design to be verified. There can be several agents.
5. UVM driver (uvm_driver), which converts the transaction-based data processed by the UVM
components into bit- and clock-cycle-accurate signals for the interfaces of the design to be
verified.
6. UVM sequencer (uvm_sequencer), which inputs test data to the driver.
7. UVM sequences (uvm_sequence) that generate test data.
8. UVM monitor (uvm_monitor) which converts the signals of the interfaces of the design to be
verified into a transaction-based format used by the UVM components.
9. UVM subscriber (uvm_subscriber), which in this example analyzes the data generated by the
design to be verified.
The component hierarchy of a UVM testbench is created "top down", which in the figure above means
that the outermost components are created first. Each component then creates the components it
contains. When this build phase is complete, a bottom-up connect phase follows, where the components
are connected as required by the test bench architecture. This is followed by a run phase, in which the
UVM components begin to generate test stimuli and communicate with each other. All UVM
components have build_phase, connect_phase, and run_phase metho, which the UVM environment
calls during simulation in a specific order. The designer defines how the components work in the
different steps by writing the code for the methods mentioned above.
521406S, Digital Techniques 3
Transactions
A transaction is an object that contains the information needed for communication between UVM
components. For example, communication between a computer's CPU and memory could be modeled
by the transaction tx shown in the figure below, which defined the address, data, and transaction type
(read/write) required by the memory access. Thus, the exact signal-level (CS, OE, etc.) operation of the
memory bus is not shown. TLM transactions are objects whose class definition is derived from the
uvm_sequence_item base class by adding member variables in which the information contained in the
transaction is stored. In the example of the figure, there are three variables: addr, data, and rw. The
figure also shows the class declaration of such a transaction. In order to understand TLM
communication, it is not necessary at this stage to pay attention to UVM-specific parts of the code,
which are shown in gray text.
The producer, consumer, put_port, and put_export are all objects created from UVM base classes, and in
addition, put_port is a member variable of the producer object, and put_export is a member variable of
1
Only the version known as TLM-1 is discussed here..
521406S, Digital Techniques 3
the consumer object. The connection between the components is created as shown at the top of the
figure by calling the put_port object connect's function and giving it the export object as an argument. In
UVM, port-class objects are the initiating parties for data transfers.
Consider this example using the code shown below. The SystemVerilog class declarations shown below
build the block diagram shown in the figure above using the UVM classes. At this stage, again, it is
necessary to pay attention only to the code relevant to the understanding of the TLM principle, which is
highlighted in the figure.
521406S, Digital Techniques 3
1. The code introduces three classes: producer, consumer, and tlm_agent. The components
producer and consumer are created inside the component tlm_agent.
2. Therefore, two variables are introduced into the class tlm_agent, m_producer of type producer
and m_consumer of type consumer.
3. For the components producer and consumer, the member variables for their TLM ports have yet
to be introduced. The port name of the component producer is put_port and it is of type
uvm_blocking_put_port, whose parameter value is set to the type of transactions handled by
the port, which is the tx described above. The port of the component consumer is called
put_export and it is of type uvm_blocking_put_imp.
4. The actual objects are created and their memory addresses, or "handles", are assigned to the
corresponding variables in the build_phase functions. There is no need to pay attention to the
object creation mechanism at this stage.
5. The connection between the ports of the components producer and consumer is created in the
connect_phase function of the top-level component tlm_agent. The UVM hierarchy under the
tlm_agent component is then complete. It is undoubtedly more difficult to perceive from code
than from the figure, but its principle is consistent and the same for all UVM models.
521406S, Digital Techniques 3
For the sake of clarity, the data transfer code has been removed from the code shown above. It is
presented below. The data transfer is performed so that the port class defines a method that initiates
the data transfer, and the component containing the export connector defines the method that
implements it. In practice, both the port connector and the component that contains the export
connector have a method with the same name, which in the example of the figure and code is put. In the
code, the put_port of the component producer inherits the put method from its UVM base class, so it is
not necessary to present it in the code. On the other hand, the component consumer has to have a
user-defined put method, as it must process the received information in an application-specific way.
The implementation of the data transfer is shown below. On the left is the run_phase method of the
producer class, which is executed in the run_phase phase of the UVM simulation. In this example,
run_phase contains a small test program consisting of for-loop that will be executed three times. Each
time the loop is executed, a new transaction object of class tx is created in the variable t (step 1) and its
member variables are given values. The program then prints out the values of the variables on the
screen. Finally, the transaction is written to port put_port by calling its put method (step 2).
The data transfer takes place in such a way that when the put method of a port object is called with tx as
argument, the UVM environment calls the put method of the component (consumer) that contains the
export object at the other end of the connection with the same argument value. The put method of the
component consumer is shown in the figure on the right as item 3. The consumer component thus
receives the transaction tx as the argument value of its own put method. This mechanism is built into
UVM classes, so the designer only needs to connect the port and export connectors of the components
together. In addition, the component containing the export connector must specify an implementation
for the put/get method. In principle, the producer object could directly call the put method of the
consumer object, but then the connection between these components would be hard-coded in the
producer class, which would make the component design-specific.
The put method of the consumer component of the example prints out the information it receives on
the screen the same way as the producer. The following information would be printed in the simulator's
console window as the result of the simulation of this model:
As you can see, the selection is large, but as with programming in general, most tasks can be solved by
using only a small subset of all available features. The following is an overview of the most important
port classes.
In the example above, the relationship of put and export objects is in accordance with the direction of
data transfer. They can also be connected in the opposite direction, in which case the "downstream" port
requests a transaction from an upstream export. The figure below illustrates this situation. In this case,
the name of the communication functions is "get".
521406S, Digital Techniques 3
The put and get mechanisms described above can be used to execute testbench data processing
transactions. They are characterized by the fact that the put and get function calls are blocking. This
means that the data transfer takes place (and the function call returns) only when the counterparty has
executed it (e.g. only when the data is available or has been stored). In this way, the data transmission
can be synchronized, for example, with the operation of the design that is being simulated. Non-blocking
connections that execute data transfers immediately can also be created by using the nonblocking
classes from the table presented above.
In addition to the one-to-one connection type, monitoring and analysis of the data generated during the
simulation requires a port type that can be left unconnected if it is not needed in the testbench, or that
can be connected to multiple export objects. For such needs, UVM defines the analysis port class. An
arbitrary number (or none) of export objects can be connected to an analysis port. The analysis port has
a "write" method that works so that when a transaction is written with a write call to the analysis port,
the simulator goes through a list of export objects connected to the port, and passes the transaction to
them by calling the write method of each of them separately. These function calls are non-blocking, i.e.
they are always executed immediately.
521406S, Digital Techniques 3
Connecting two TLM ports at the same hierarchical level is a simple operation, described superficially
above. In the following, it will be reviewed in more detail in the case of the example in the figure below.
Here, the put port is connected to the corresponding imp-export.
Below is a part of the class declaration of vc1, which creates a port (member variable) of type
uvm_blocking_put_port, named B. The variable declaration gives the transaction type handled by port as
a parameter, which here is my_tx. The port object itself is created in the build_phase function of class
vc1 using SystemVerilog's new function (class constructor), which creates an object from the class and
initializes it. The new function must be given as arguments the name of the variable as a string and a
pointer to the UVM component in which the port is created, using the SystemVerilog keyword "this".
Initializing an object with a new function in this way is a requirement of UVM.
521406S, Digital Techniques 3
uvm_blocking_put_port #(my_tx) B;
The other end of the connection is created in component vc2 by defining a variable of type
uvm_blocking_get_imp named C. The variable declaration gets as parameters the type of transaction the
port handles and, in addition, the name of the host component class.
The connection is established by calling the connect function on port B. The connection is made in the
connect_phase method of the UVM component that contains the components to be connected. In the
code snippet below, the variables m_vc1 and m_vc2 are assumed to be member variables of this host
component that contain the handles of the components to be connected.
m_vc1.B.connect(m_vc2.C);
Thus, creating a connection does not require a lot of code, but the following things must be taken in to
account;
● When declaring port, export or imp variables, you must give the name of the transaction class
as a parameter, and sometimes also other parameters. The required parameters can be found in
the documentation of the UVM classes. Entering invalid parameters usually results in an error
during compilation or simulation.
● You must remember to initialize port, export, or imp variables with the new function in the
build_phase function. Without initialization, the variables do not point to the memory area
allocated to the object in the simulator's memory, resulting in an error in the simulation.
521406S, Digital Techniques 3
If the components to be connected are not inside the same component, a connection that traverses
hierarchical boundaries must be created. The figure below shows such a situation. Here, we want to
connect the put port A of component producer to an imp export called component D in component
consumer. producer is inside component vc1, and consumer inside component vc2. A direct connection
cannot be made.
To create a connection, a new port B must be created on the periphery of the component vc1, to which
port A of the producer component is connected. Once port B has been created, the connection from A to
B can be made in the connect_phase function of component vc1 as follows, as vc1 "sees" both ports:
producer.A.connect(B);
To create a connection between vc1 and vc2, an export-type connector named here C must be created
on the periphery of component vc2. For the class of this connector, we select a class from the "export"
column of the table above. These classes differ from the imp classes in that they do not implement data
transfer with a put, get, or other function, but only pass the data forward. The connection is made in the
connect_phase function of the component containing components vc1 and vc2. Assuming that the
handle of component vc1 is stored in the variable m_vc1 and the handle of component vc2 in the
variable m_vc2, the connection is created as follows:
m_vc1.B.connect(m_vc2.C);
The last jump from the periphery of component vc2 to the D-connector of component consumer is made
in the connect_phase function of vc2 as follows:
C.connect(consumer.D);
Connector D should be of the imp type here. The data transfer thus follows the route shown below and
is implemented by writing a put method for the consumer component.
All connections between the components of a hierarchical UVM test bench can be implemented the
same way as described above. The following table summarizes the TLM connections for UVM.
521406S, Digital Techniques 3
UVM classes
The figure below shows part of the UVM class hierarchy. All of the classes are derived from the
uvm_object base class, which implements a large number of basic functions, a few of which are shown in
the figure. All classes derived from the uvm_object class inherit its properties. The uvm_report_object
class adds reporting properties to the base class. The uvm_component class derived from the reporting
class, in turn, is the basic class of all UVM components. The uvm_sequence_item displayed on the other
branch of the hierarchy diagram is the class for describing TLM transactions, and the uvm_sequence
derived from it is the class for generating sequences. The figure shows only a part of UVM’s overall class
hierarchy, but these classes are most commonly needed on testbenches. TLM connector classes are not
shown in this figure.
UVM Components
UVM components are the building blocks of the testbench. Each UVM component class has a specific
purpose in the testbench, and its class declaration includes methods and variables that support this
purpose. Some component classes, such as the UVM sequencer, contain a wealth of ready-made
521406S, Digital Techniques 3
functions so that they can be used as such in testbenches. Some classes, on the other hand, are primarily
intended to bring together in a single component certain types of user-defined functions. For example,
the UVM scoreboard is one such class. In many cases, it would be possible to implement the functions of
a testbench without using the specialized UVM classes, but their use is still recommended because they
document nicely the intended purpose of different parts of the testbench. In this way, the
comprehensibility and maintainability of the code is improved. At the same time, the testbench will
benefit from possible future extensions to the UVM standard.
The following sections introduce the main UVM components. The name of the corresponding UVM base
class is shown in parentheses.
In addition to the UVM monitor presented below, the driver is the only UVM component that contains
information about the bus interface signals of the circuit under test and their timing. If the interface is
changed, the driver must also be changed, but otherwise the UVM testbench can remain as it is.
Typically, the driver uses a SystemVerilog virtual interface description to drive the bus, eliminating the
need to code signal-level functions. A virtual interface description is a "pointer variable" whose value can
be set to an actual interface object. By changing the value of the variable to another actual interface, the
same driver can be set to control the signals of a different physical interface.
The uvm_push_driver class differs from the uvm_driver class in that it does not actively request
transactions from the sequencer, but passively receives them from a push-type sequencer.
addition, sequences may be hierarchical so that a single sequence can run subsequences. UVM
sequences must be associated with a specific UVM sequencer.
The first argument of the set and get functions defines the context of the UVM hierarchy, i.e. the
top-level component under which in the component hierarchy the information stored is available. The
second argument is the name of a component instance, which can contain "wildcards", under the cntxt
component to which the set and get calls are to be further restricted. The third argument is a search key
521406S, Digital Techniques 3
(string) and the last argument is the variable or value that contains the information stored behind this
key. The type of data to be stored and read must be specified, represented by the <type> entry above.
The type is given as the value of the function call parameter with the #() notation.
Below is a simple example of storing information in a database. Here, the context of the setting is not
defined (it is "null"), and its target is delimited by an *, so the information can be accessed with the get
function everywhere the UVM hierarchy. The search key is "BASE_ADDRESS" and the value to be stored
is 32'h8c000000. The data type is an integer, i.e. int.
The data is read from the database into the variable with the get function:
int base_address;
In general, it is wise to limit the context of the database accesses and/or define the target instance to
avoid conflicts caused by using the same search key in different parts of the testbench. In the following
example, it is assumed that the UVM environment contains two agents, m_agent1 and m_agent2. The
agents have been created from the same class, so they are completely identical. We now want to
configure them so that each gets a different value for the constant "BASE_ADDRESS". This information
can be stored in the configuration database in the build_phase function of the UVM environment as
follows:
The context "this" here refers to the UVM environment, and "m_agent1" and "m_agent2" are the names
of the agent instances.
In the agent class, the value of the BASE_ADDRESS constant can be fetched from the database as follows:
During the agents' code execution, "this" refers to the agent instance "m_agent1" of the environment
component for the first agent and to the "m_agent2" instance for the second. Although the executable
code is the same, it returns a different "BASE_ADDRESS" constant value in different agent contexts. In
this way, it was not necessary to define two search keys with different names. The second argument to
the get function is an empty string because the search takes place inside the agent, and therefore we
want the search to target the agent itself and not the components below it.
521406S, Digital Techniques 3
UVM Factory
UVM is designed to be as flexible as possible. The wide use of parameterization is one example of this.
One of the implementation principles of UVM is that the simulation environment can be dynamically
modified also at run-time, for example so that a component of one class can be replaced with a
component of another class. This functionality is implemented in the UVM factory class (uvm_factory).
Like the configuration database, you do not have to create objects from this class in user code, but the
code can use the functions it provides.
The operating idea of the UVM factory is that instead of calling the normal new function of
SystemVerilog, new UVM components are created by "ordering" them from the factory using the create
function it provides. It is possible for the user to change the factory settings during the simulation so that
one requested component type is automatically replaced by another. This factory override feature is
probably not needed very often, but UVM test benches are usually coded so that it can be used when
needed.
Considering factory requirements in the code requires the adoption of two coding practices. First, the
uvm_component_utils macro must be added to the UVM component class declarations, as shown below,
to tell the factory the name of the new component class. Using a macro hides the code needed to
register a component with the factory from the designer, so he or she does not need to know how the
factory is implemented. You have to remember to add this macro to every class declaration.
Another requirement of the factory is that the components must be created in a rather complex way as
shown in the following example.
The interface definition is not just a named group of signals, as it can also contain executable task
subroutines that can be used to control the interface. The model below contains four subroutines. Of
these, "reset" resets the interface signals. The "write" subroutine creates an APB protocol compliant
write transaction with the interface signals, using the address and data given as arguments. The
subroutine "read" performs a read operation, and returns its results in its "data" argument. Both
subroutines return in the argument "fail" information about whether the operation failed. The
subroutine "monitor" reads the state of the interface signals, and returns in its tx_ok argument
information on whether a bus write or read occurred on that particular edge of the clock signal. In
addition, it returns the state of the data and address buses. Using these subroutines, the UVM testbench
designer can create events for the bus and monitor its state in principle even without knowing the
protocol or all the signals. This feature is useful for coding a testbench that operates at the transaction
level.
task reset;
pwrite <= '0;
paddr <= 0;
pwdata <= 0;
penable <= '0;
psel = '0;
521406S, Digital Techniques 3
@(posedge rst_n);
endtask;
////////////////////////////////////////////////////////////////////
// 1. APB SETUP Phase
////////////////////////////////////////////////////////////////////
@(posedge clk);
////////////////////////////////////////////////////////////////////
// 2. APB ACCESS Phase
////////////////////////////////////////////////////////////////////
@(posedge clk);
endtask
///////////////////////////////////////////////////////////////////
// 1. APB SETUP Phase
///////////////////////////////////////////////////////////////////
@(posedge clk);
///////////////////////////////////////////////////////////////////
// 2. APB ACCESS Phase
///////////////////////////////////////////////////////////////////
@(posedge clk);
endtask
@(posedge clk);
addr = paddr;
if (penable == '1 && pready == '1)
begin
if (pwrite == '1)
begin
data = pwdata;
write_mode = '1;
end
else
begin
data = prdata;
write_mode = '0;
end
tx_ok = '1;
end
else
tx_ok = '0;
endtask
endinterface
apbslave DUT
(.CLK (clk),
.RST_N (rst_n),
.PSEL (dut_if.psel),
.PENABLE(dut_if.penable),
.PWRITE (dut_if.pwrite),
.PADDR (dut_if.paddr ),
.PWDATA (dut_if.pwdata ),
.PRDATA (dut_if.prdata ),
.PREADY (dut_if.pready ),
.PSLVERR(dut_if.pslverr)
);
Once the interface description has this way been connected to the design to be verified, the interface is
provided as configuration information to the UVM testbench, after which the test can be started. In the
example shown below, the interface description is stored in the UVM configuration database, from which
the UVM components can get it in the build phase with the search key "apb_if_config".
initial
begin
run_test("apb_test");
end
In this case, the new member variables are the 32-bit addr and data that describe the bus address and
data, a 1-bit write_mode that indicates whether the transaction is a read (0) or a write event (1). The
variable fail describes whether the data transfer was successful or not. The variables addr, data, and
write_mode are defined as random variables, which means that random values can be generated for
them by calling the randomize method that is available in all classes in SystemVerilog. The constraint
statements delimit the random values to a certain range.
super.do_copy(rhs);
addr = rhs_.addr;
data = rhs_.data;
write_mode = rhs_.write_mode;
fail = rhs_.fail;
endfunction
endclass
The apb_transaction class redefines two methods. Of these, the new method used to create the object
only calls the corresponding method in the base class. The do_copy function specifies a method for
copying the contents of an apb_transaction object. Similarly, a function could be defined for
comparisons (do_compare), for example, but this is not needed in this example.
Before introducing the sequence class, a class used to configure the sequence is defined. Using an object
created from this apb_sequence_config class, the top-level UVM test component can pass the sequence
length to the sequence. In the case of a small test bench, a `define macro could be used to define the
same information more easily, but the use of a configuration object is shown here for the sake of
example.
The class apb_sequence is created from the uvm_sequence class by setting its parameter value to be the
transaction type apb_transaction. The operation of the sequence is described in the task-type method
521406S, Digital Techniques 3
"body" (a task can consume simulation time, but a function can not). At the beginning, information
about the length of the sequence is read from the configuration path database. The repeat loop is then
executed the number of times indicated by this information. A random number between 1 and 3 is
created in the loop, and a corresponding number of write events is created based on it. After this, 1-3
read events are created in the same way.
Rows relevant to the operation of a UVM sequence are highlighted. The first of these creates an object
write_tx from the apb_transaction class, the values of whose member values are then randomized. The
address is then set to be divisible by four (because the data is 32-bit one word requires four bytes), and
the transaction is defined to be of the write type. Finally, write_tx is stored in the dynamic table
write_queu. The next two function calls, start_item and stop_item, required in all sequences, allow the
sequence to be carried by the sequencer, and wait for it to finish processing. Read transactions are
created in the same way, but their address is set based on a write transaction selected at random from
the write_queu table. This way, the read always targets an address to which data has already been
written.
task body;
apb_sequence_config seq_config;
int test_cycles = 100;
int queue_index;
apb_transaction write_queue[$];
if (uvm_config_db #(apb_sequence_config)::
get(null, get_full_name(), "apb_sequence_config", seq_config))
begin
test_cycles = seq_config.apb_test_cycles;
`uvm_info("",$sformatf("apb_sequence configured run for %d test cycles", test_cycles),
UVM_NONE);
end
repeat(test_cycles)
begin
apb_transaction write_tx;
apb_transaction read_tx;
int unsigned random_value;
random_value = $urandom_range(1,3);
for (int i=0; i < random_value; ++i)
begin
write_tx = apb_transaction::type_id::create("write_tx");
assert( write_tx.randomize() );
write_tx.addr &= 32'hFFFFFFFC;
write_tx.write_mode = '1;
write_tx.fail = 0;
write_queue.push_back(write_tx);
start_item(write_tx);
521406S, Digital Techniques 3
finish_item(write_tx);
end
random_value = $urandom_range(1,3);
for (int i=0; i < random_value; ++i)
begin
read_tx = apb_transaction::type_id::create("read_tx");
assert( read_tx.randomize() );
queue_index = $urandom_range( 0, (write_queue.size()-1) );
read_tx.addr = write_queue[queue_index].addr;
read_tx.write_mode = '0;
read_tx.fail = '0;
start_item(read_tx);
finish_item(read_tx);
end
end
endtask
endclass
Most of the sequence's code is application dependent, but it is always built using the same basic
structure: first an object of a transaction class is created, then the values of its member variables are set
or randomized, and finally the start_item and stop_item functions are called. The operation of the
sequence can be controlled by configuration information. Conventional configuration means, such as
parameter definitions, can also be used, but they require all the code to be recompiled.
In the code of the sequence, it is worth noting that it constantly creates transaction objects, but they are
not destroyed in the sequence. This is done because the sequencer transmits to the driver the
transaction's handle, i.e. in principle an address of the memory area allocated to it. The sequence cannot
be sure how long this address will be used elsewhere in the testbench, which is why it cannot free it.
However, UVM has a built-in garbage collection function, which takes care that memory areas that are
no longer referenced are automatically freed.
The figure below illustrates the interaction between the driver and the interface object's task
subroutines for three transactions (1. read, 2. write, 3. read). The upper part of the figure shows the
transactions received by the driver from the sequencer. The bottom part of the figure shows the
waveforms of the APB bus signals generated from these transactions. The red waveforms represent the
signals generated by the driver using interface subroutines, and the blue the signals generated by the
design to be verified. The clock signal shown in gray is generated in the normal way in the top-level
module of the testbench, e.g. by an always process, and is transmitted to the driver via the apb_if
interface.
The following figure shows an apb_driver class derived from the uvm_driver base class. The most
important of the member variables is m_abp_if, which stores a reference to an apb_if-type virtual
interface object that the driver uses to control the ports in the design to be verified. The other member
variables are used in the example to count various events.
521406S, Digital Techniques 3
The driver's class declaration is listed below. The driver receives the interface definition of the bus as
configuration information, which it stores in its member variable m_apb_if in the connect_phase
function. In the code, m_apb_if is defined as a virtual interface, which means that it is in effect a pointer
to the actual interface. If the bus interface were changed, no changes would be required to be made in
the driver. It would be sufficient to just configure it to use the new bus interface model. In addition to
the interface, some variables have been defined in the driver class for the collection of statistics.
The operation of the driver is described in the run_phase method. At the beginning, the reset subroutine
of the apb_if interface is called. The execution of the driver's code stops there until the end of the reset
state. An eternal loop is then started, in which the driver processes the transactions it has requested
from the sequencer with the get_next_item method of its seq_item_port port. The driver then calls
either the write or read subroutine of the interface, depending on the type of transaction. When the
subroutine call returns, the driver updates the values of the variables it uses to count different types of
events. Finally, the driver notifies the sequencer that the processing of the transaction has ended by
calling the seq_item_done function. The apb_driver class also defines the report_phase method, which
reports the statistical data collected during the simulation after the simulation has finished.
521406S, Digital Techniques 3
int m_writes_to_dut = 0;
int m_reads_from_dut = 0;
int m_writes_to_other = 0;
int m_reads_from_other = 0;
int m_failed_writes_to_dut = 0;
int m_failed_reads_from_dut = 0;
int m_failed_writes_to_other = 0;
int m_failed_reads_from_other = 0;
m_apb_if.reset();
forever
begin
apb_transaction tx;
int wait_counter;
seq_item_port.get_next_item(tx);
if (tx.write_mode == '1)
begin
m_apb_if.write(tx.addr, tx.data, tx.fail);
seq_item_port.item_done();
end
endtask
endclass
The figure below illustrates the synchronization of the operation of the body method of the sequence
and the run_phase method of the driver in the sequencer. The function calls made by the sequence and
the driver into the sequencer always stop to wait for the other party to terminate the data transfer with
its own corresponding call (e.g. start_item only returns when the get_next_item call occurs).
521406S, Digital Techniques 3
Contrary to the model presented above, the interface signals are sometimes controlled directly in the
driver and not in the task subroutines of the interface as in this example. However, this has the
consequence that the driver becomes bus protocol specific.
The operation of the monitor can be thought of as the opposite to the driver. The monitor uses the same
interface bus model as the driver, but the monitor only reads the states of bus signals and tries to detect
read and write events. For this, it uses the interface's "monitor" subroutine, which returns the value
tx_ok == '1 if the APB bus master has set the psel and penable signals and the slave the pready signal to
state '1. The subroutine also returns the state of the bus address and data signals and the type of the
event. Upon detecting a read or write event, the monitor writes the transaction to its analysis port, from
which analyzer components can read it. This non-blocking write transaction is executed immediately
unlike when the sequencer passes a transaction to the driver.
The initialization steps of the apb_monitor component correspond to those discussed in connection with
the driver, with the exception that the monitor must create an analysis port. The seq_item_port of the
driver, on the other hand, was already included in the definition of the uvm_driver base class, as there
can only be one. The monitor's run_phase method initially creates a single apb_transaction object, tx,
which it uses throughout its operation to collect the data for the event it detects. Each time an event is
detected, it creates a copy of the tx object using the clone method of the class. To this end, a do_copy
function was defined in the apb_transaction class at the beginning of the example, because the
uvm_seq_item base class cannot copy user-defined variables. The copied transaction is written to the
analysis port. A copy was created in case a component receiving the transaction changes the values of
the transaction's member variables.
super.new(name, parent);
endfunction
endfunction
forever
begin
m_apb_if.monitor(tx_ok, tx.addr, tx.data, tx.write_mode);
if (tx_ok == '1)
begin
$cast(tx_clone, tx.clone());
analysis_port.write(tx_clone);
end
end
endtask
endclass
A diagram of the apb_analyzer class is shown below. The class definition is derived from the
uvm_subscriber class, from which apb_analyzer inherits the export-type connector analysis_export.
521406S, Digital Techniques 3
Two covergroup variables are created in the analyzer to track how many of the APB addresses are
accessed with randomly selected addresses during the simulation. Covergroup objects are created in the
new method of the class. The items to be tracked are defined in a coverpoint definition. In this case, the
bits [6:2] of the member variable addr of the transaction received by the analyzer are selected as the
target. The APB slave uses these bits to address its register bank. A separate covergroup definition is
created for write and read transactions by adding a condition to their coverpoint definitions that
depends on the type of transaction (write/read).
At the end of the monitor run_phase method described in the previous section, the monitor sends the
transaction to the analyzers by calling the write method of its analysis port. Components of the
uvm_subscriber class have a predefined analysis_export connector to which the monitor's analysis port
is connected, but a corresponding write method must yet be defined for data transfer. In the example
below, the write method only calls the sample method for covergroup objects, which updates the
521406S, Digital Techniques 3
coverage estimate. The results of coverage analysis are reported in the report_phase method. In a
real-life case, the write method could implement a register bank model, which could be used for
comparison to ensure that the register bank inside the design under test is working properly.
apb_transaction tx;
covergroup dut_write_coverage;
write_cov: coverpoint tx.addr[6:2] iff (tx.write_mode == '1);
endgroup
covergroup dut_read_coverage;
read_cov: coverpoint tx.addr[6:2] iff (tx.write_mode == '0);
endgroup
endclass
The following diagram and code sample show the apb_agent class, which is derived from the uvm_agent
class.
apb_driver m_driver;
apb_monitor m_monitor;
apb_analyzer m_analyzer;
521406S, Digital Techniques 3
apb_sequencer m_sequencer;
endclass;
apb_agent m_agent;
endclass
apb_env m_env;
super.build_phase(phase);
seq_config.apb_test_cycles = APB_TEST_CYCLES;
uvm_config_db #(apb_sequence_config)::set(this, "*",
"apb_sequence_config", seq_config);
endclass
The model above creates a UVM environment and executes a single sequence. In real design projects, a
large number of different tests are required. A different sequence is required for each test, and the UVM
environment may also be different in different test cases. In practice, it is advisable to first create one
UVM test class for the UVM testbench, which defines the common characteristics of all tests. From this
class, a separate class can then be derived for each different test. In these classes, the build_phase
function is defined so that the UVM environment component of that test is given configuration
information that allows the environment to create exactly the components needed for that test.
Similarly, the run_phase functions can be defined so that a different sequence is started in each test. By
doing this, different tests can be executed as separate simulation runs. For example, with the QuestaSim
simulator, UVM tests can be run directly from the command line with the commands shown below. Here,
my_tb is the name of the testbench module and my_test1 and others are the names of the UVM test
classes.
Summary
The first impression of the UVM testbench can be quite confusing: there are many different classes, and
in addition, many UVM-specific constructs and macros are needed in the class definitions that make the
code difficult to understand. However, many of the different constructs appear in the same form in all
declarations, so the designer does not need to understand their meaning profoundly in practice in order
to be able to create testbenches. In addition, a novice designer usually only has to deal with the design
of a single agent, which is a fairly straightforward task. In practice, the work consists mainly of the
implementation of the verification plan with UVM sequences and analyzer components.
In principle, the design and implementation of the UVM testbench could proceed as follows:
1. Select the interfaces of the design that you want to connect to the UVM test bench and develop
or obtain Systemverilog interface descriptions for them.
2. Create Systemverilog interface instances in the testbench module, connect them to the design,
and store them in the uvm_config_db database in the testbench module's initial procedure
3. Design a block diagram of the UVM test bench defining the required UVM components and the
TLM interfaces between them (connector types and transaction data types)
4. Write class declarations for transactions derived from the base class uvm_sequence_item
5. If you want to pass configuration information to UVM components with objects derived from the
uvm_object class, write the declarations of these classes
6. Write class declarations for non-hierarchical UVM components by deriving them from UVM base
classes (uvm_driver, uvm_monitor, uvm_sequencer)
7. Write class declarations for hierarchical UVM components (uvm_agent, uvm_env, uvm_test) and
add member variables for the UVM components they contain
8. Add member variables for TLM connectors to all UVM component classes
9. Write a build_phase function for all UVM component classes, starting with the top-level
uvm_test component and ending with the lowest-level non-hierarchical components. The
build_phase function must contain at least the following parts:
a. Call to the build_phase function of the base class (super)
b. If the component needs configuration information, reading the configuration object
from the uvm_config_db database into a variable with the correct type
c. If configuration information is required to create lower-level components: creation,
initialization, and saving of configuration objects to the uvm_config_db database
d. Creating the component's own TLM connectors with the new function and setting the
value of the corresponding member variables
e. Creating lower-level UVM components using the create method of their factory classes
and storing the created objects in the corresponding member variables
10. Write a connect_phase function for all UVM component classes that must contain at least the
following parts:
a. Call to the build_phase function of the base class (super)
b. Creation of all required port-to-port, port-to-export, and export-to-export connections
inside the component.
521406S, Digital Techniques 3
11. For each UVM component that has a TLM communication implementing imp-port, write the
method required by that port type (put, get, write, etc.) that performs the copy of the
transaction.
12. Add a virtual interface member variable to each driver and monitor class, and read the interface
definition from the uvm_config_db database in the build_phase function and set it as the value
of the member variable.
13. Write the code for the run_phase methods of drivers and monitors.
14. Create a UVM sequence by deriving it from the class uvm_sequence , and write the test program
into its body method.
15. If the testbench has analyzer components, write the code for their run_phase methods.
16. If you want to print out data from the testbench, write report_phase methods for the classes
where it is needed.
17. Write the run_phase method of the UVM test with at least the following parts:
a. Declaration of a local variable of type UVM sequence, and create a sequence object in it
using the create method of the sequence class.
b. Execution of the function call phase.raise_objection(this)
c. Starting a sequence with its start method by passing a sequence object as an argument
(often using a hierarchical path to the variable in which the handle of the sequence is
stored)
d. Execution of the function call phase.drop_objection(this)
18. Add a run_test subroutine call to the initial procedure of the testbench module. This can be
done in two ways:
a. run_test ("UVM test component name")
b. run_test (), in which case the name of the UVM test component must be given to the
simulator in its user interface or on the command line in a simulator-specific way
521406S, Digital Techniques 3
SystemC is a method used for high-level modeling of digital systems. In practice, it is a C++ class library
that defines the means needed to describe, connect, and simulate the operation of digital system
components. A SystemC model can be translated into an executable computer program with a standard
C++ compiler by linking it with a freely available SystemC library. Often, however, SystemC models are
executed like hardware description language models in a simulator, allowing them to be simulated
alongside with SystemVerilog or VHDL models.
In order to use SystemC, the designer must be familiar with the basics of object-oriented programming
with the C++ language and the classes, data types, and modeling principles defined in the SystemC
standard. For designers familiar with register-transfer level hardware description languages, SystemC
includes macros that make it easy to create component models, or modules, and describe their
interfaces. Using these macros, the designer only needs to learn how to write functional code in C++.
The SystemC standard defines many constructs for describing system-level operations using concurrent,
communicating processes. These constructs can be used to describe communication between processes
using abstract channel models, and process synchronization through events. In the following, only the
521406S, Digital Techniques 3
constructs required to describe synchronous, clocked systems, that are needed by the digital hardware
designers to create descriptions suitable for high-level synthesis, are discussed.
521406S, Digital Techniques 3
The figure below shows the internal functional structure of an executable SystemC model. The model
consists of the SystemC library, which contains the class definitions of the SystemC objects and the
simulator kernel with which the user-designed model is executed. The SystemC library is linked to user
models like a normal C++ library (e.g. with the GNU g++ compiler using the -lsystemc option).
An executable simulation model (see figure above) consists of modules that are objects instantiated from
SystemC class library's sc_module base class. Models can be hierarchical so that the modules can contain
other modules that are represented as member variables of the higher-level modules. The operation of a
module is described by processes, which are member functions of the modules that have been
registered as processes with the SystemC simulator kernel. Modules and processes communicate using
channels (C in the figure below), which are also objects based on SystemC classes. The modules are
connected to the channels via ports (P). The model to be simulated and its verification environment (test
bench) are created in the same way using modules as in SystemVerilog and VHDL. However, the top level
of the executable SystemC model is the sc_main function, which is called first from the simulator or from
some other operating system level process. This function, in turn, creates the top-level module and then
521406S, Digital Techniques 3
starts the simulation by calling the sc_start function. The designer's tasks is to create the parts of the
simulation model shown on the orange background in the figure, that is, the sc_main function and the
SystemC modules.
The simulation of the SystemC model is based on the event-driven simulation method commonly used in
digital simulators. This means that simulation models create events by writing values to ports or
channels (signals), as a result of which models of objects "sensitive" to these events, acting as readers of
the communication channels, are started either in the next simulation time step (after a delta delay) or
at a later simulation time (if a delay was specified for the write).
In a SystemC model, the functions to be simulated are described in the member functions of the
modules, which are registered as executable processes in the constructor function of the module class.
The SystemC simulator kernel starts processes by calling those functions. Modeling the concurrent
operation of processes is based on cooperative multitasking. This means that a process that is running
passes control back to the simulator kernel by calling its wait method. This stops the execution of the
process either for a time specified in the wait call or until a specified event occurs. The most typical
example of the latter is that the process waits for the next rising edge of the clock signal.
All objects instantiated from the SystemC base classes (module sc_module, ports sc_in and sc_out, or
channels such as sc_signal or sc_fifo) automatically "register" under the simulator kernel, and will then
function according to the simulation mechanism described above. This distinguishes SystemC models
from standard C++ programs. Normal C++ objects work in SystemC models according to the C++
execution rules, and therefore for example communication between processes cannot be implemented
by using conventional variables if the communication is to take place according to the exact execution
order specified by the simulator kernel.
A digital component model is created by defining a new C++ class for it. This is called a module in
SystemC. The member variables of the class define the input and output ports, internal signals and
registers of the component, and its member functions, or methods, describe the operation of a
component. The module description is usually divided into two files, of which the header file (ending in
.h) contains the class declaration and the implementation file (usually ending in .cpp or .cc) the code that
implements the methods of the class.
The new class is declared in this file using the macro SC_MODULE, which in practice only expands "class
fir: sc_module", i.e. it creates a new class, called fir, based on the sc_module class defined in the
systemc.h file.
The module fir first declares member variables sc_in_clk, sc_in and sc_out that represent a clock input, a
data input and a data output port, respectively. The data types of these ports are defined as class
template parameters inside the <> characters. For example, sc_in <sc_int <N_DATA_BITS>> declares an
input port whose data type is a SystemC signed integer sc_int whose bit-count is N_DATA_BITS (a
constant defined in the fir_defs.h file).
Next, the model declares variables data_r and data_out_r that represent the filter's registers, and a
prototype for a member function run of type void, which means that the function does not return a
value.
The constructor macro SC_CTOR kernel initializes a fir object, and registers it with the SystemC simulator.
The macro SC_CTHREAD informs the kernel that the function run is of type SC_CTHREAD (a clocked
thread), which is sensitive to the rising edges of the clock signal clk. The constructor also defines a
synchronous reset signal for the model.
Module implementation
There are three different types of functions for describing processes in SystemC: SC_METHOD,
SC_THREAD and SC_CTHREAD. SC_METHOD processes are functions that are executed from beginning to
end when their start condition is realized. The SC_THREAD and SC_CTHREAD processes, on the other
521406S, Digital Techniques 3
hand, can synchronize their execution to an event for wait statements. Of these, SC_CTHREAD is
sensitive to only one event, usually the rising edge of the clock signal. The operation of the fir module is
described by an SC_CTHREAD type run function, whose implementation is shown below.
The SC_CTHREAD process is divided into two parts by the first wait statement. The part preceding this
statement is called the reset region, which is executed when the reset condition defined for the module
in the SC_CTOR constructor is realized and this function is called by the simulator kernel. The part of the
process following the first wait statement consists of one eternal loop, which must contain at least one
wait statement. This loop therefore describes the behavior of the module on each clock cycle. A function
of type SC_CTHREAD returns only when a reset occurs, in which case its execution starts from the
beginning. Code describing functions that are executed only once can be placed between the wait
statement terminating the reset part and the eternal loop.
The eternal loop of the run function shown above contains two wait statements. The first one is inside a
521406S, Digital Techniques 3
while loop, in which the model waits for the input data_en_in to be set to state 1. When this condition is
met, the filter algorithm is executed, and the result is written to data_out_r and the output data_out,
and the output data_en_out is set to state 1. The second wait statement of the loop is executed on the
next clock edge, after which data_en_out is set to 0, and the next iteration of the eternal loop begins.
Notice that input and output ports are read and written using their read and write methods.
In the filter model, the loops are named data_r_SHIFT_LOOP and ACCU_LOOP. This is not necessary, but
it is useful if the code is to be used for synthesis, as it makes it easy to refer to loops in the synthesis
program settings.
The simulation environment in the figure consists of the module to be verified (fir), a testbench1 module
(fir_tb) for creating and analyzing test data, and a top-level module (fir_top) that creates the instances of
the module to be verified and its testbench and connects them to each other (with signals that have the
_sig suffix in the figure). The function of all these modules can be described using the same principles as
in the case of the fir module.
1
In this example, testbench means, in accordance to the SystemC User Guide, a module that generates test stimuli
and checks the simulation results. The designation differs from, for example, SystemVerilog terminology, where a
testbench usually means the top-level module that instantiates a separate test program (program block).
521406S, Digital Techniques 3
The top-level of the SystemC simulation model is the sc_main function, which corresponds to the main
function of C-language programs. This function creates an object of the top-level module class (here
fir_top), which then builds the simulation model of the lower-level modules in its SC_CTOR constructor.
After creating the simulation model, sc_main starts the simulation by calling the sc_start function.
Simulation continues until one of the models (usually the testbench) finishes it by calling the sc_stop
function.
#ifndef FIR_TB_H
#define FIR_TB_H
#include <systemc.h>
#include "fir_defs.h"
SC_MODULE(fir_tb) {
public:
sc_in_clk clk;
sc_in<bool> rst_n;
sc_int<N_DATA_BITS> data_r[N_TAPS];
sc_int<N_DATA_BITS> data_out_r;
void run();
SC_CTOR(fir_tb)
{
SC_CTHREAD(run, clk.pos());
reset_signal_is(rst_n, false);
}
};
#endif
The code for the run process of the testbench is shown below. Its reset part sets the initial values for the
filter coefficients. The main loop is executed 1000 times, and within it, values are assigned to the data_in
and data_en_in inputs of the filter, and its outputs are read. After having executed the loop, the
testbench stops the simulation by calling the sc_stop function.
#include "fir_tb.h"
521406S, Digital Techniques 3
void fir_tb::run()
{
int sample_counter = 0;
sc_int<N_DATA_BITS> input_data = 1;
// RESET SECTION
data_in.write(0);
data_en_in.write(0);
for (int i = 0; i < N_TAPS; ++i)
coeff_in[i] = ( i == N_TAPS/2 ? (1<<(N_COEFF_BITS-2)) : (i%2 == 0 ? 1 : -1));
wait();
data_en_in.write(1);
wait();
data_en_in.write(0);
while(data_en_out.read() != 1)
wait();
cout << sample_counter << " " << input_data << " " << data_out.read() << endl;
++sample_counter;
}
sc_stop();
}
Top-level Model
The task of the top-level model of the simulation environment is to connect the design to be verified and
the testbench prepared for it, and also to generate clock and reset signals. The top-level module of the
fir design is shown below.
521406S, Digital Techniques 3
At the beginning of the top-level module, a member variable clk of the type sc_clock is declared, which
defines the clock signal of the simulation. The sc_signal type variables define the signals that connect the
reset signal (rst_n) to the modules, and the ports of modules fir and fir_tb to each other. Next, the fir
and fir_tb variables of the classes fir_instance and fir_tb_instance are declared to create the instances of
these modules. Another way to create these instances would be to declare the variables as pointer-type,
and to create the instances dynamically with the C++ new operator in the fir_top class' SC_CTOR
521406S, Digital Techniques 3
constructor. Before the definition of the constructor, a prototype for the function called reset_thread is
declared. This function is used to generate the waveform of the reset signal rst_n in the beginning of the
simulation.
The class constructor contains more functionality than in the earlier cases. Before the functional code,
the member variables clk, fir_instance and fir_tb_instance are initialized (orange area in the figure). The
initialization of the clk variable with clk("clk", 20, SC_NS, 0.5) defines a clock signal with a period of 20 ns
and a duty-cycle of 50%. Initialization of sub-module instances fir_instance("fir_instance") and
fir_tb_instance("fir_tb_instance") gives names to these instances, as required by the SystemC standard.
Text-format names could also be registered for the other variables (rst_n, data_in, etc.), which might be
helpful in debugging the code, but this is not necessary.
In the executable part of the constructor, the reset_thread function is defined as a process of type
SC_THREAD, whose execution can be synchronized to the rising edges of the clock signal clk. The
dont_initialize function prevents the reset_thread function from running during the simulation
initialization period. The other statements in the executable part connect the internal signals of the
fir_top module to the ports of the fir and fir_tb module instances. Notice that each element of
array-type port coeff_in must be bound to signals separately.
The file fir_top.cpp of the top-level module is simple, as its process only controls the model's reset
signal. The reset_thread function sets the active-low reset signal to state 0 for five clock cycles, after
which its execution ends.
#include "fir_top.h"
void fir_top::reset_thread()
{
rst_n.write(false);
wait(5);
rst_n.write(true);
}
Main Program
The name of the main program of SystemC models is sc_main. In its simplest form, sc_main creates an
instance of the top-level module and starts the simulation. It is a good idea to create a top-level instance
dynamically with a new operator so that it is not allocated memory from the stack segment of the
operating system process, which usually has a fixed size. A hierarchical model may require a lot of
memory, so placing it in the stack may be unwise. The new operator allocates memory space for the
object from the process's data segment, which can be resized during execution.
include "fir_top.h"
sc_start();
delete fir_top_instance;
return 0;
}
Model Execution
Once all the source code files have been created, they can be compiled with a C++ compiler, and linked
with the SystemC library to form an executable program that can then be run. In the following example,
the environment variables SC_INC and SC_LIB indicate the location of the SystemC header and library
files in the file system. The numbers shown in green represent the information printed by the program's
testbench.
> g++ -o fir_exe -I $SC_INC-L $SC_LIB fir.cpp fir_tb.cpp fir_top.cpp fir_sc_main.cpp -lsystemc
> fir_exe
000
100
200
300
400
500
660
7 0 -1
803
9 0 -1
521406S, Digital Techniques 3
sc_bv Arbitrary length sc_bit and sc_logic vector types. sc_bv<128> busA;
sc_lv
sc_fixed Template class for signed and unsigned fixed-point // Rounding and saturation:
sc_fixed numbers. The data type is defined in the form of
sc_fixed<8.4, SC_RND, SC_SAT> A;
sc_fix sc_fixed <wl, IWL, q_mode, o_mode, n_bits>
sc_ufix A // Truncation:
or
sc_fixed<3.2, SC_TRN> y;
sc_ufixed <wl, IWL, q_mode, o_mode, n_bits>
The following table shows the C ++ language operators defined for these data types.
sc_int ja sc_uint
Bit Operations ~ & | ^ >> <<
Arithmetic + - * / %
Assignments = += -= *= /= %= &= |= ^=
Equality == !=
Comparison < <= > >=
Incrementation ++
Decrementation --
Bit Select [x]
Range Select range()
Concatenation (,)
sc_bit ja sc_logic
Bit Operations &(and) |(or) ^(xor) ~(not)
Assignments = &= |= ^=
Equality == !=
sc_bv ja sc_lv
Bit Operations ~ & | ^ << >>
Assignments = &= |= ^=
Equality == !=
Bit Select [x]
Range Select range()
Concatenation (,)
Reduction and_reduce() or_reduce() xor_reduce()
sc_fixed ja sc_ufixed
Bit Operation ~ & ^ |
Arithmetic * / + - << >> ++ --
Equality == !=
Comparison < <= > >=
Assignments = *= /= += -= <<= >>= &= ^= |=
Bit Select [x]
Range Select range()
In order to make type conversions, conversion methods have been defined for data types, which allow
different types of data to be handled quite flexibly. For example, an integer can be converted to a bit
vector directly:
521406S, Digital Techniques 3
uint8 = 127;
bv8 = uint8; // = 01111111
Conversion functions can be called directly if necessary. In the following example, the to_uint and to_int
functions of the bit vector type sv_bv are called, depending on whether the vector is to be interpreted as
an unsigned or a signed binary number. A list of conversion functions for different data types is provided
in the SystemC standard.
bv8 = "10000000";
uint8 = bv8.to_uint (); // = 128
int8 = bv8.to_int (); // = -128
When defining variables of type sc_fixed or sc_ufixed, the number of bits and the precision of the data
represented by the variable are defined at the same time. This is done by giving the type definition as
C++ template parameters the number of bits and the number of integer bits, i.e. the number of bits to
the left of the radix point. The figure below illustrates the representation of the approximate value of phi
as two 16-bit sc_fixed numbers, of which the integer part of the number x is represented with 8, and of
the number y with 4 bits. Notice that SystemC fixed-point numbers can be assigned values in
floating-point literal format without first converting them to binary format.
sc_fixed<16,8> x;
x = 3.141357421875; // x saa arvon 00000011.001001000011
// 3.140625
sc_fixed<16,4> y;
y = 3.141357421875; // y saa arvon 0011.001001000011
521406S, Digital Techniques 3
// 3.141357421875
Arithmetic operations on fixed-point numbers are always performed with the precision defined by the
programmer, and with automatic "point alignment", so describing arithmetic functions is as effortless
as in conventional programming. The following figure shows the representation of result of the
multiplication x * y in the variable z, the type of which is defined so that the result always fits in z.
sc_fixed<32,12> z;
z = x * y; // z saa arvon 000000001001.11011101101001101100
// 9.865825653076171875
The following figure shows what happens if the value of z of type sc_fixed <32.12> is given directly as the
value of y of type sc_fixed <16.4>. The result is negative, which is clearly wrong. This is because the
longer number is, by default, merely truncated at both ends. This kind of functionality is generally not
desirable, and for this purpose, for variables of type sc_fixed and sc_ufixed, it is possible to specify
exactly how the number must be shortened.
If the number to be stored as the value of a variable needs to be shortened from the right side of the
point, we talk about quantization (decrease in precision), and if it has to be shortened from the left side
of the point, we talk about overflow. Quantization and overflow handling are defined by two additional
template parameters.
In the following example, the behavior of the variable w in overflow situations is determined by the
constant SC_SAT (saturation) defined in the SystemC header file. It means that in the case of overflows,
the value w is set to the maximum or minimum value of the data type, depending on the sign of the
value to be set. In the example shown below, w is saturated to the largest possible value
7.999755859375. The template parameter SC_RND, in turn, defines the quantization mode SC_RND,
which means rounding up. In the example, no rounding occurs because the result was saturated.
If the quantization and overflow parameters are not included in a type definition, the quantization mode
defaults to SC_TRN (truncate) and the overflow mode SC_WRAP (wrap), which corresponds to the
"rolling over" behavior of binary counters. The SystemC standard defines many different quantization
and overflow modes.
521406S, Digital Techniques 3
Type conversions with most SystemC types are easy also in the case of fixed-point numbers, as the
example below shows:
sc_ufixed<16,8> pi;
sc_uint<8> uint8;
pi = 3.14159265359;
uint8 = pi; // = 3
For some reason, however, the sc_fixed and sc_ufixed types cannot be converted directly to bit vectors.
Instead, one must resort to the use of a temporary sc_uint-type variable (Method 1 below) or an
awkward type conversion (Method 2).
sc_ufixed<16,8> pi;
sc_uint<16> uint16;
sc_bv<16> bv16;
// Method 1
uint16 = pi.range();
bv16 = uint16;
// Method 2
bv16 = static_cast< sc_dt::sc_uint<16> >(pi.range());
521406S, Digital Techniques 3
When C or C ++ code is used as input for architectural synthesis, assumptions about the code execution
platform cannot be made because the processor executing the code does not exist but, on the contrary,
will be generated from the code. This difference imposes many limitations on the use of C or C++
constructs, as not all functions that can be described in computer programs can be implemented in
synthesis.
The most obvious limitations caused by the synthesizability requirement concern constructs related to
memory accesses and management. For example, the use of pointer variables (int *x) is usually only
possible with limitations, e.g. when pointing to an array whose size is defined in the source code.
Because the "memory" that stores the values of the variables in a synthesis model will be implemented
as separate flip-flops, registers, or SRAM blocks instead of a linear memory space, many pointer
arithmetic operations useful in computer programming are not possible. The use of dynamic memory
allocation commands (new operation or malloc library function) is not possible at all, as a circuit
synthesized and manufactured based on the source code cannot allocate new registers or memory
resources. It is also not possible to use a recursion, which is one of the basic techniques of computer
programming, in which a function subroutine calls itself, since the number of function calls must be
known at compile time. Function subroutines describe combinational or sequential logic and cannot be
created dynamically in a manufactured circuit based.
For the reasons described above, computer programs written in C or C++ can rarely be synthesized into
hardware without considerable modifications. In addition to the limitations imposed by different target
technologies, the different requirements of synthesis programs from different manufacturers also make
it difficult to apply high-level synthesis, as standards for modeling methods suitable for synthesis do not
yet exist or have not been implemented in synthesis tools.
signals. The designer can choose to describe the progress of the computation either completely or
partially by clock cycles by adding wait statements to the algorithm, but the scheduling of operations in
clock cycles can also be left entirely to the synthesis program.
The figure below shows an RTL architecture that consists of two registers a_r and b_r and a multiplexer
connected to their outputs. The registers are loaded from inputs a_in and b_in when the input load_in is
true. The input select_in can be used to connect the state of the register a_r or b_r to the output q_out.
The SystemC model below describes the operation of the circuit. The registers are described in the
SC_METHOD process seq, and the multiplexer in the SC_METHOD process comb. In the module
521406S, Digital Techniques 3
constructor, the process seq is defined by the statement "sensitive << clk.pos() << rst_n.neg()" to start
when the clock input clk rises or when the reset input rst_n falls. Similarly, the combinational logic
process is defined by the sentence "sensitive << a_r << b_r << select_in" to start when one of its inputs
a_r, b_r, or select_in changes state. In the implementation of the process seq, an if statement is used to
check whether the reset input rst_n is in the active state, and if it is not, the else branch performs the
action activated by the rising edge of the clock. The principle is the same as in the VHDL and
SystemVerilog processes describing sequential logic. The combinational logic process comb consists only
of an if-else statement that describes the multiplexer. The registers are represented by sc_signal-type
variables a_r and b_r. The dont_initialize function call prevents the processes from running by default
from the constructor, which would happen even if the startup conditions specified for them were not
met. Without automatic initialization, the circuit will not initialize until the reset input is activated.
};
In addition to the SC_METHOD process, the SC_CTHREAD process can also be used to describe
sequential logic by placing only one wait statement inside its execution loop.
have similar solutions for simulating RTL code on a SystemC testbench. Below is such a module definition
for the fir module presented above.
#include <systemc.h>
#include "fir_defs.h"
When the RTL model is simulated, the instantiation of the SystemC module must be replaced by the
instantiation of the foreign module. In the example case, the initialization of the variable fir_instance in
the fir_top.h file should be changed, for example, like this:
#ifdef SYSTEMC_SIM
fir_instance("fir_instance"),
#endif
#ifdef RTL_SIM
fir_instance("fir_instance", "work.fir"),
#endif
In this example, the macro RTL_SIM is defined for RTL simulation, as a result of which the constructor of
the sc_foreign_module class is given the name of the component's RTL model in the simulator's library.
In addition, instead of the fir module, the file containing the foreign module must be compiled.
Verification of the SystemC model in a SystemVerilog or VHDL testbench is also usually possible in
simulators. This may be necessary if only one or a few modules of the design otherwise modeled in
these languages are implemented in SystemC. If the port types in the SystemC module are selected so
that they have equivalent types in SystemVerilog or VHDL, the SystemC model can be used as a
component in the same way as SystemVerilog or VHDL components. Data types that can be
521406S, Digital Techniques 3
unambiguously converted to logical signals (sc_bit, bool, sc_bv, sc_int, sc_fixed, etc.) meet these
requirements.
521406S, Digital Techniques 3
Loop Transformations
Array Transformations
Data Flow Optimizations
Interface Synthesis
Interface Types
Interface Synthesis for SystemC Designs
Interface Synthesis for C ++ Designs
Summary
The starting point for designing a digital circuit is often a data processing task described as an algorithm.
The first step in the design is to define the register layer (RTL) architecture of the circuit. It defines the
circuits's registers and the combinational logic parts that produce the registers' next state values. The
RTL architecture is then described in VHDL or SystemVerilog. This description can be simulated to verify
its function, and it can be implemented as a gate-level model by using logic synthesis. This chapter
introduces topics related to the design of the RTL architecture of a circuit, and in particular the
automatic synthesis of the RTL architecture.
521406S, Digital Techniques 3
One way to classify RTL architectures is based on the form in which they process data. In parallel data
architectures, operations are applied to values, and values are stored in registers in bit-parallel format.
For example, the addition of integers and the storage of the result in a register can then take place in
one clock cycle. In a serial architecture, values are processed bit by bit, in which case, for example, the
duration of the addition of two numbers in clock cycles depends on the number of bits in the numbers.
The figure below illustrates this difference. Figure a) shows a serial adder containing only a full adder
and a D-flip-flop that stores the carry bit. The addends are brought to the inputs a_in and b_in bit by bit,
the least significant bit first. Figure b) shows a 4-bit parallel adder where four full adders are required. As
the number of bits in the addends increases, the difference in the component counts of the solutions
increases in the same proportion, as does the difference in performance.
Due to the poor performance of serial architectures and the difficulty of designing them, they are rarely
used. From here on only parallel architectures are discussed in this chapter.
Another way to classify architectures is based on how centralized or decentralized control is used in
them. Centralized control means that the architecture of the circuit includes combinational logic blocks
521406S, Digital Techniques 3
that perform basic arithmetic and logical operations, registers, and multiplexers with which the desired
data can be fed to the above-mentioned components. All these structural blocks of the data processing
part of the circuit, i.e. the datapath, are controlled centrally by a finite-state machine. Such a finite-state
machine-with-datapath, or FSMD, architecture is flexible because the computational algorithm it
performs can be changed by changing only the control state machine. Thanks to multiplexing,
combinational logic resources can be efficiently shared between the algorithm's operations, at the
expense of performance. The figure below shows a very general-purpose FSMD architecture whose
datapath consists of a register bank and an arithmetic-logic unit. If the fixed state machine in this
architecture is replaced by a solution in which codewords fetched from a memory are used to control
the combinational logic of the state machine, a programmable instruction-set processor is formed.
In a distributed-control-based architecture, the components of the circuit's datapath are not controlled
completely centrally. Such an architecture consists of relatively independent processing elements,
between which communication connections are established, instead of multiplexed buses, by fixed
connections that are determined by the data flow of the algorithm. A distributed architecture can have a
very high-performance, but at the same time it is completely case-specific. The figure below shows an
architecture that performs the multiplication of matrices on a systolic basis1
1
Data moves in all elements in pulses like blood circulation.
521406S, Digital Techniques 3
Problems related to the design of distributed architectures are always case-specific and not all
algorithms are suitable or easily adaptable to be implemented as such. For example, the implementation
of interfaces requires special solutions that can supply data to the processing elements fast enough. The
remainder of this chapter will focus on the FSMD architecture.
The properties of the scheduled algorithm are described by the concepts of latency and throughput.
Latency refers to the time taken by the algorithm for computation, that is, in practice, how many clock
cycles it takes to compute a result. Throughput tells how many results per unit time a scheduled
algorithm can produce. Instead of throughput, it is often easier to use the term throughput period,
which is the inverse of throughput. The throughput period tells the time interval between successive
results produced by the scheduled algorithm, so it has the same unit as the latency, for example,
seconds or the number of clock cycles.
521406S, Digital Techniques 3
R4 = R1 + R2 + R3
where R1, R2, R3 and R4 are registers. Conventional two-input combinatorial adders with a maximum
propagation delay of 7 ns are available for the implementation of additions. Without latency and
resource constraints, the architecture will be as shown below. Here, operation R1 + R2 is selected to be
performed first. The maximum delay of the critical path of the circuit can for simplicity be assumed to be
at least 7 ns + 7 ns = 14 ns, taking into account the delays of the registers, and the clock frequency of the
circuit is determined accordingly. Because the algorithm is executed in a single clock cycle, the latency of
the architecture is determined only by the delay of the critical path in combinational logic. The latency
of such an architecture designed without resource constraints is as low as possible. Because each
operation has its own combinational logic block, the number of components required by the
architecture is large.
Next, suppose that only one adder is available. The additions must then be scheduled to be performed
in successive clock cycles. The result of the first addition must therefore be stored in a register so that it
can be used as the second operand of the second addition. The architecture is shown in the figure
below. Two multiplexers must be added to the architecture (delay = 2 ns) to select the numbers to be
added, and a control unit is needed to control the multiplexers (a 2-state state machine not shown in
the figure). In this case, the length of the clock period must be greater than 9 ns, for example 10 ns, in
which case the latency of the architecture becomes 20 ns. Assuming that the total number of
components in the two multiplexers and the control state machine is less than that of the second adder,
increasing the latency resulted in a cost saving thanks to the resource sharing it allowed.
521406S, Digital Techniques 3
The first architectural option can be made to work on the pipeline principle by adding pipeline registers
as shown in the figure below. Thanks to them, the result of the first addition can be stored in the
register R5, whereby new information can be read into the input registers R1, R2 and R3, even though
the computation of the algorithm R1 + R2 + R3 is still in progress. The previous state of the register R3 is
stored in the register R6, so that the right-hand adder can add numbers that are in the same phase in
the pipeline. Pipeline registers increase the latency of the architecture by one clock cycle. At the same
time, the delay of the combination logic is halved, so the circuit can be clocked at about twice the clock
frequency compared to the first version of the architecture.
In the pipelined circuit, successive executions of the algorithm partially overlap in time, as shown in the
figure below. For this reason, the performance of the architecture is double that of the dual adder
architecture operating without pipeline registers.
521406S, Digital Techniques 3
The findings based on this simple example can be summarized as the following principles:
● By using more resources, the latency of the architecture can be reduced because more
operations can be scheduled for execution as soon as they are ready to be executed due to the
data dependencies of the algorithm.
● Increasing latency by adding clock cycles allows resources to be shared, that is, one resource to
be used to perform multiple operations if they are scheduled to be executed on different clock
cycles. However, resource sharing requires multiplexers and possibly additional registers, which
reduces its benefits.
● The performance of the architecture can be improved by applying the pipeline principle to those
parts of the algorithm where it is possible within the data dependencies of the algorithm.
In addition to trade-off between resource use and performance, the design of an RTL architecture often
needs to consider the impact of the architecture on power consumption. At its simplest, this can mean
that the choice of components that implement arithmetic and logical operations must take into account
not only their delay and area, but also their power consumption. Components that consume less power
are often slower, so this affects the scheduling of the algorithm.
Minimizing power consumption may require a comprehensive modification of the circuit's architecture.
Today, it is customary to select the operating voltage for different parts of a circuit according to the
performance requirements of the algorithms the parts implement. Lowering the operating voltage
reduces power consumption, but slows down the operation of components. Through this, the choice of
operating voltage also affects the scheduling of the algorithms and the design of the RTL architecture.
The figure below illustrates the optimization of power consumption through supply voltage selection.
The block diagram on the left shows a circuit already discussed earlier. Here, it is assumed that its
operating voltage (VDD) is 1V and its normalized power consumption is 1. The figure on the right shows
an identical circuit, which, however, uses a supply voltage of 0.8 V. This is assumed to increase the
delays of its components by about 50%, so that the circuit can no longer operate at the original clock
frequency. To compensate for the slowdown, the architecture is "doubled" so that the registers of both
halves are loaded only on every other clock cycle. The latency and throughput period of both halves
have doubled, but the whole circuit works as before, because the multiplexer placed at the output can
be used to select the result from both halves on every other clock cycle. The active capacitance of the
circuit, which changes state on each clock cycle, has thus not, in principle, changed much. Because the
521406S, Digital Techniques 3
effect of supply voltage on dynamic power consumption is quadratic2, optimizing the supply voltage
together with the architecture significantly reduces power consumption.
The number of available architecture options can be estimated by comparing the data rate to the
selected clock frequency. In many applications, a digital circuit processes the data at a regular rate: for
example, in an audio application, the rate is determined by the sampling rate (e.g., 48 kHz) or in video
applications, the frame rate (e.g., 25 frames per second) or the bit rate of the video data stream (e.g., 8
2
Dynamic power consumption can be estimated by a formula P = pSWITCH × fCLK × CLOAD × VDD2. In the example, the
capacitance CLOAD doubles, but the probability of state changes pSWITCH per clock cycle is halved, so the decrease of
the operating voltage from 1.0 V to 0.8 V reduces the normalized power consumption to 0.64.
.
521406S, Digital Techniques 3
Mbps). In such cases, the algorithm must be scheduled so that the performance achieved meets the
requirements of the application.
Let’s examine the architectural alternatives for the hardware implementation of the filter algorithm
shown in the figure below in a few different applications. Assume that the clock frequency is 25 MHz
and the filter length is N = 512.
Initially, assume that the application data rate is 48,000 samples per second. In this case, 520 clock
cycles are available per sample, so the algorithm can be computed using a single multiplier-accumulator
(MAC) block consisting of a multiplier, adder and register. In addition, registers are needed to store N
data samples. The delay of the combination logic shown in the figure must be less than 1/25 MHz = 40
ns. A large number of other architectural alternatives using 2, 3,… 512 multipliers would be available to
perform the computations. This is due to a large difference in data rate and clock frequency. In this case,
however, the use of more than one multiplier is not useful because even with one it is possible to meet
the requirements of the application.
Let us now consider the case where the data rate is 8 million samples per second. In this case, only three
clock cycles are available for executing the filter algorithm. A minimum of 79 multipliers would be
needed to complete all computations. There would thus clearly be fewer architectural options.
In the event that the data rate is the same as the clock frequency, there is only one architectural
alternative. 256 multipliers and 255 adders are needed, as there is no time to use shared combinational
logic blocks to perform two or more calculations. The circuit architecture can in principle be
implemented directly on the basis of the data flow graph of the algorithm by replacing the delay
elements with registers and the arithmetic operations with the corresponding combination logic
components. A control unit is not needed as data is loaded into all registers on each clock cycle and
521406S, Digital Techniques 3
there are no multiplexers that require control. The figure below illustrates the principle of such an
architecture. The small yellow rectangles represent pipeline registers added to the architecture so that
delay of the long adder chain does not become too large. The pipeline registers added to the inputs of
multipliers ensure that the operation of the algorithm is not changed as a result of pipelining the adder
chain. If implemented according to the principle shown in the figure, a large number of pipeline registers
would be needed in the architecture.
Based on this example, it can be concluded that there are most opportunities and needs for optimizing
the RTL architecture when the data rate of the algorithm is clearly lower than the clock frequency of the
circuit. In such a case, there are many different scheduling options for operations, and the problem of
optimizing the architecture becomes complex and difficult. If the difference between the data rate and
the clock frequency of the system is very large, as in the first case of the example, it may be easier to
implement the algorithm as a program of an instruction set processor instead of custom hardware. If, on
the other hand, the data rate of the application is the same or almost the same as the clock frequency,
the algorithm itself defines the required RTL architecture. In such a case, in optimizing performance,
attention must be paid to optimizing the algorithm itself, and on the other hand to optimizing the logic
of the circuit.
Performance-Constrained Design
In performance-constrained applications, an absolute requirement, such as a maximum allowable
latency or throughput-period length, is set for the performance of the algorithm. Such applications
include, for example, digital signal processing tasks based on a fixed sampling rate, where data
processing must be performed between two samples. The goal is to design a circuit that meets the
performance requirement and at the same time is as resource-efficient as possible.
Resource-Constrained Design
In resource-constrained applications, the amount of hard resources used to implement the algorithm is
limited, so that the scheduling of algorithm operations in clock cycles must be implemented so that the
algorithm can be executed with predetermined resources. The resource constraint is usually caused by
cost constraints. For example, many consumer products will only attract demand if their price is low
enough. Minimizing manufacturing costs in circuit design for such products can therefore be a primary
optimization goal. The aim of the design is to develop an architecture that is as high-performing as
possible, and whose resource needs remain within the set limits.
Power-Constrained Design
Power consumption can also be an important factor influencing architectural-level design. As discussed
earlier in this chapter, taking power consumption into account can have a major impact on the basic
style of the RTL architecture used. In power-constrained applications, a general power management
solution is usually selected first, for example the use of different supply voltages in different parts of the
circuit, and then the performance- or resource-constrained architecture optimization is done within the
power management solution.
Timing Optimization
Circuit timing is optimized in conjunction with combinational logic optimization when synthesizing a
gate-level model from an RTL model. However, the design of the RTL architecture defines these
combinational parts whose timing is optimized in logic synthesis. If the combinational logic is made too
deep in architecture design by placing too many combinational logic components on a critical path
between registers, the logic synthesis program may not be able to optimize the logic to be fast enough.
In such a case, the architecture design must be done again. For this reason, it is necessary to constantly
estimate already during the design of the RTL architecture how much timing margin (SLACK in the figure
below) is left in the clock period. This is not an easy task, because in addition to the delays of the
computational components, e.g. delay of multiplexers used for resource sharing must be taken into
account.
521406S, Digital Techniques 3
521406S, Digital Techniques 3
The example is the FIR filter algorithm, which is computationally simple but can be implemented with
many different RTL architectures with different performance and complexity characteristics. Therefore,
it is well suited to illustrate the key issues in architecture design. The operation of the FIR filter is defined
by the following formula:
N T AP S−1
data_out (t) = ∑ data_in (ti) · coef f _in (i)
i=0
where data_in(t) is the value of the t:th input sample, data_out(t) is the value of the t:th input sample,
and the values coeff_in(i) are the coefficients of the filter. In practice, data_out(t) is obtained by
computing the weighted sum of NTAPS most recent input values using the filter coefficients as weights.
Thus, the algorithm requires (at least) the storage of NTAPS values, NTAPS multiplications, and NTAPS-1
additions.
Next, an RTL architecture implementing an FIR filter algorithm is presented. The architecture is designed
with the following requirements specifications:
Input and output data are represented as 24-bit signed integers encoded as 2's complement
numbers using a symmetric number range (code 100 ... 000 is not used). The filter coefficients
are 24-bit fixed-point constants with the radix point to the right of the most significant bit
(decimal range [-0.99999988079071044921875, .0.99999988079071044921875]). The
computations must be performed with full precision using only one combinational logic multiplier
and one adder. The result is truncated to 24-bit, ignoring fractional bits. Data input and output
are based on the valid-ready handshake principle, and the design must have an output register
that stores the result until the next result is completed. The clock frequency of the circuit is 100
MHz, i.e. its clock period is 10 ns.
The general principle of operation of the filter circuit is shown below. The operation has three main
states. Initially, the filter waits for input valid_in to go to state '1 while keeping the ready_out output at
state' 1. When the state valid_in == '1 is detected, the circuit executes the filter algorithm. It then enters
a state where it sets the valid_out output to state '1, and waits for the acknowledgment signal ready_in
to rise to state' 1. After that, a new iteration of the main loop is started.
521406S, Digital Techniques 3
The algorithm includes 5 multiplications, 4 additions and a saturation operation, for a total of 10
operations. The number of operations gives an overview of the complexity of the algorithm. If the
algorithm were implemented as an instruction set processor program, the performance of the
implementation could even be estimated based on the number of operations. When designing an RTL
architecture, the complexity of the algorithm gives an idea of the difficulty of the design task.
Each operation must be allocated a resource in the architecture implementing the algorithm. The
following table summarizes the operations and the combinational logic resources required to perform
them. The parameter values following the resource type describe the number of input and output bits of
the component. The Delay column contains the smallest possible achievable delay for the available
resource, which has been estimated by synthesizing the operation in question with a logic synthesis
program.
521406S, Digital Techniques 3
* mul0, mul1, mul2, mul3, mul4 mul #(24, 24, 47) > 2.0 ns
The algorithm is computed using the last five values of the input data_in, so the architecture requires
five 24-bit storage locations. The design specification required that the result be stored in a register,
which requires one 24-bit deposit location. In addition to these, data generated on each clock cycle that
is used in one of the following cycles must be stored. The following table summarizes the data storage
requirements for the algorithm. In the column describing the format of the data, s<I.F> denotes a signed
(s) binary number with I bits to the left of the radix point (integer part) and F bits to the right side of the
point (fractional part).
operations. However, this does not make sense because a single 49-bit adder can be used to compute all
the additions.
The table below shows the resource allocation and usage decided on a clockwise basis based on a
scheduled data flow graph. Because there is only one instance of each resource type in this example,
binding operations to resources can only be done in one way. For example, if you had two multipliers,
you should decide which operation to perform with which multiplier on each clock cycle.
NAME TYPE C0 C1 C2 C3 C4
Registers or SRAMs can also be allocated for data storage based on the scheduled data flow chart. In this
example, only registers are used. The use of SRAMs should be considered already in scheduling, as they
can only be the target of 1-2 read or write operations per clock cycle.
The table below shows the registers reserved for the FIR algorithm. Registers DREG0 - DREG4 and OREG
are allocated on the basis that the circuit must continuously store the last five values of the data_in
input and also the result of the previous execution (output of the "sat" block). Examining the scheduled
data flow graph, we find that only the values mul0, add0, add1, and add2 should be stored in a register,
as they are carried from one clock cycle to another. As each value is only needed for one clock cycle, and
never simultaneously, one common register, the 50-bit ACC, can be allocated for these. The yellow cells
in the table indicate the clock cycles on which that value is loaded into the register. The green cell
indicates the clock cycle on which the register would in principle be available for storing new data.
NAME TYPE C0 C1 C2 C3 C4
In this example, the selection of a register for each value to be stored was easy to do because most of
the data that required storing remained the same throughout the execution. In principle, however, all
the 194 bits reserved for storage were freely available here on each clock cycle, so the information could
have been bound to the registers in other ways, too.
Allocation of Multiplexers
After the allocation of computing resources and registers, the RTL architecture has the components
needed for processing and storing data. These components also act as sources of information that must
next be connected to the inputs of the components. If an input of a component receives data from only
one source, the data can be transferred using a direct connection. If data from different sources has to
be fed to the input of the component on different clock cycles, a multiplexer must be placed in front of
the input. The multiplexer's selection input is controlled by the architecture's control unit.
In the example design, a multiplexer must be added to both inputs of the multiplier component MUL
and to the data input of the ACC register. The binding of operations to resources and values to registers
discussed above affects the number of multiplexers required, so multiplexing costs must be considered
already when making binding decisions.
NAME TYPE C0 C1 C2 C3 C4
Data samples are stored in a shift register formed of registers DREG0 to DREG4. The computation is
performed by selecting one data sample at a time from the input data_in or the registers DREG1 through
DREG4 and one coefficient from the inputs coeff_in [0] through coeff_in [4] using the multiplexers
MUXA and MUXB, multiplying them in the multiplier MUL and storing or adding the multiplication result
with the adder ADD to register ACC. This computation is activated by driving the control signals at the
bottom of the figure from the control unit. After the computation has completed, the result rounded
and saturated to 24 bits is stored in the output register OREG.
When sharing resources between different operations and registers between different values, it is often
necessary to match the bit-count of the data to the bit-count of the shared component. In the figure, for
example, the multiplier's 47-bit output signal mul_v must be converted to a 50-bit signal mul_se by
extending the sign bit. The finishing of an architecture can involve a lot of such small, meticulous
adaptations.
sequence in which it changes the values of the selection inputs of the multiplexers and thus allows the
results of the multiplications to be added to the register ACC. After these execution states of the filter
algorithm, the output register is loaded and the state machine enters state 101, where it waits for the
ready_in input to rise. The state machine then returns to the initial state 000, where it sets the
ready_out output to state '1. The values of the control signals in different states are shown in the table
below. The signal ms[6:0] consists of the selection addresses of the multiplexers and the signal re[6:0] of
the load-enable signals of the registers.
The table below shows the area of the components of the architecture designed above, expressed as
NAND2 equivalent gate-counts. It can be seen from the table that the multiplier accounts for 50% of the
total area, so sharing it between different operations was very profitable. In contrast, the size difference
between the adder and the multiplexer is not large, so the usefulness of sharing an adder usually needs
to be considered more carefully.
521406S, Digital Techniques 3
COMPONENT GATE-COUNT %
ACC 325 7.7
ADD 219 5.2
CONTROL 50 1.2
DREG0 156 3.7
DREG1 157 3.7
DREG2 156 3.7
DREG3 156 3.7
DREG4 156 3.7
MUL 2193 51.9
MUXA 159 3.8
MUXB 166 3.9
MUXC 87 2.1
OREG 177 4.2
SAT 64 1.5
TOTAL 4221 100
The figure below shows the distribution of the critical path delay of the circuit between the different
components in the case where the circuit is optimized for a clock period of 10 ns. The multiplier is
playing a major part here as well, but it is also worth noting to the relatively large delay of the
multiplexer. This must be taken into account alongside the area when deciding on resource sharing.
Binding of Stored Decide which information is stored in Scheduled data flow diagram
Values to Registers which register on each clock cycle. Register delays
Multiplexer delays and areas
Control Unit Creating a control state machine that Scheduled Data Flow Graph
Creations produces the control signals needed by Datapath RTL Architecture Description
the datapath on each clock cycle
521406S, Digital Techniques 3
The operation of the circuit is described for HLS programs as a processing loop that is executed
continuously. In C and C++ models, the processing loop is not presented in the code. Instead, the
top-level function of the model is assumed to be called continuously from such a loop, like the fir
function in the figure below. In SystemC models, the processing loop is specified by the designer in the
model's code, for example, as an "eternal" while loop.
In C and C++ models, external interfaces are specified as the parameters of the top-level function of the
model. Since these programming languages do not include the concept of time, the operation of the
interfaces cannot be described in more detail. In SystemC models, the protocols used by the interfaces
can be described with clock cycle accuracy.
Synthesis Flow
The synthesis flow used by an HLS program naturally varies from program to program, but usually the
main steps shown in the figure below can be identified. The source code is initially compiled and
optimized to form an internal representation for the synthesis program. This step is not significantly
different from a normal C++ translation. In the next step, the external interfaces of the design are
synthesized, as the operating principle of the interfaces has a significant effect especially on the
scheduling of the algorithm that takes place in the synthesis of the microarchitecture. Prior to
microarchitecture synthesis, the designer has the opportunity to modify the synthesis program's internal
representation of the design, i.e. the control and data flow graph, to improve the synthesis results. In
microarchitecture synthesis, the operations of the algorithm's control and data flow graph are scheduled
in clock cycles, computational and storage resources are allocated for them, and the scheduled
operations and the data generated in the computation are bound to the allocated resources. This is the
most important optimization step of the HLS, and its implementation depends on the overall
optimization goal and the program used. In the last step, the HLS program writes out the RTL code of the
design for use with simulation and logic synthesis programs.
521406S, Digital Techniques 3
The internal representation for algorithms in HLS programs is the control and data flow graph (CDFG).
Because this graph is generated from program code by a C++ compiler, it can contain control structures
such as conditional blocks and loops in addition to arithmetic and logical operations. The following figure
shows some program code and the CDFG formed from it. In it, the conditional statements of the
algorithm are presented as multiplexers. Because of the while loop, the graph also goes back in time.
521406S, Digital Techniques 3
The inclusion of control functions in the data flow graphs reveals the "implied" combinational logic
included in the algorithm. It can be seen from the figure that the if statement requires adding a
multiplexer in front of the variables that are assigned to inside it. If the algorithm has a lot of nested if
statements and many variable assignments in them, a large number of cascaded multiplexers are
generated, which increases the delay caused by them. This may be difficult to notice in the program
code. Typically, HLS programs include the ability to view the CDFG in graphical form. This feature is
useful in algorithm design and optimization.
Optimizations performed already by the C++ compiler can have a big impact on the synthesis results.
One such optimization is common subexpression elimination, which modifies the data flow graph of the
algorithm so that the same expression does not need to be computed twice. This makes it easier to
write code as it allows code to be written in a way that better describes the intent of the designer than
code in which common expressions have been eliminated "manually." Below is an example of the
elimination of a common expression and its effect on the data flow graph.
521406S, Digital Techniques 3
Expression balancing, shown in the figure below, is an important optimization in hardware design. The
C++ compiler may optimize loops whose iteration count is represented in the code as a constant by
creating copies of the code in the loop body so that the value of the loop variable is substituted in the
replicated code. This way the repetitive updating of the loop variable's value is avoided. In the example
in the figure, the chain of additions created by unrolling the loop is problematic because of its large
delay. However, the compiler can optimize it by changing the evaluation order of the operations so that
the data flow graph does not become so deep. If the code contains an expression with many arithmetic
operations, their execution order can also be controlled by using parentheses in the code.
Constant propagation is also one of the basic optimizations in compilers. In it, the compiler substitutes
constants in expressions, and simplifies expressions if possible. For example, the following code
simplifies to the form Z = 1 when X is first substituted in the expression of Y, and X and Y are then
substituted in the expression of Z.
X = 21;
Y = 8 - X / 3;
Z = Y * (63 / X - 2);
Constant propagation thus simplifies expressions that are overly complex for some reason. In HLS tools,
constant propagation also has a more hardware-oriented form. The following is one example of this. In
521406S, Digital Techniques 3
Compiler optimizations such as those described above give the coder certain freedom on how carefully
the kind of hardware the code produces must be considered when writing the code. However, compiler
optimizations are HLS tool specific, so the user should familiarize himself with the features of the
program being used.
Scheduling of Operations
In operation scheduling, the synthesis program decides on which clock cycle the operations of the
algorithm are performed. The length of the clock period is determined before this. The figure below
shows one possible schedule for the CDFG of the algorithm presented above. Because the graph
521406S, Digital Techniques 3
contains a loop, orange symbols representing the states of the control state machine have been drawn
in it for clarity. The operations inside the blocks following each state symbol are executed in that state.
In scheduling, the most common optimization target is the number of resources, with the number of
clock cycles being fixed, or the number of clock cycles, with the number of resources being fixed. In
addition, the optimization must ensure that the total delay of successive combinational logic operations
executed in a clock cycle (in Figure TC*) is not too large. To this end, the program must rapidly generate
estimates of the delays in the components needed to carry out the operations. This also applies to the
multiplexers required to implement the control functions. Estimating their delays is not easy as they can
be very irregular in structure.
The designer controls the scheduling by placing constraints on latency, number of resources, and
pipelining for the synthesis program. The effect of these constraints is discussed below, using the data
flow graphs of the expression Z = A * B + C * D + E as an example.
521406S, Digital Techniques 3
The latency constraint determines how many clock cycles the synthesis program can use for the
schedule of the operations in the CDFG. Typical latency constraints are:
● latency ≤ LMAX, where the latency must not exceed LMAX
● LMIN ≤ latency ≤ LMAX, where the latency must be between a certain minimum and maximum
value
● latency = L, where the latency must always be exactly L
In a latency constrained schedule, the secondary optimization goal is usually the area so that the
synthesis program must definitely implement the latency constraint while at the same time try to
minimize the amount of resources it uses. In the example below, the latency constraint is 2 clock cycles.
This can be achieved with the resources shown on the right in the figure.
A resource constraint can be used to set an upper limit on the resources used in the RTL architecture.
This can be done in many ways. In practice, it is usually done by setting an upper limit on how many
instances of the component type that perform a particular operation can be used in the architecture.
Such a constraint can be easily converted to a constraint on the scheduling of operations. In general, it is
customary to limit the number of operations that require a lot of logic gates, such as multiplications and
divisions. In the following example, the number of multipliers is limited to one. For this reason, the
algorithm must be scheduled for three clock cycles and multiplexers must be added to it.
The two examples presented above illustrate a basic scheduling situation where changing the latency
can affect the number of resources (or vice versa). The number of options increases if the synthesis
521406S, Digital Techniques 3
program is allowed to apply the pipeline principle in scheduling. In that case, the next iteration of the
algorithm is started before the previous one has finished.
The constraint controlling pipelining in high-level synthesis is called the initiation interval (II). It
determines how many clock cycles the scheduler waits after the start of the execution of the algorithm
before the next execution is started. In the figure below, II = 1, and the algorithm is scheduled for three
clock cycles. Scheduling and resource allocation must then take into account all three simultaneous
executions I1, I2 and I3. For this reason, more computing resources and registers are needed than in the
three-clock-cycle-latency architecture described above.
Depending on the characteristics of the synthesis program, different combinations of latency, resource,
and pipeline constraints can be formed, which makes it easy to explore the architecture alternatives
required by different kinds of applications. For example, in an application where it is important to
produce results at regular intervals (e.g., 30 frames per second), but the time it takes to produce one
result is not very significant (e.g., video decoding delay of 0.5 seconds), the latency constraint can be set
to be long and the pipeline initiation interval to the required throughput period. Due to the long latency,
the number of components in the architecture does not necessarily increase significantly, even if the
pipeline principle is used.
In addition to optimizing the architecture, a HLS program can find out whether it is in general possible to
implement the algorithm within the performance or cost constraints of a particular application. To this
end, the design can be scheduled using special scheduling algorithms, such as "as-soon-as-possible"
scheduling (ASAP). In the ASAP scheduling, operations are executed as soon as possible regardless of the
use of resources, that is, immediately when the data they need is available. In this way, it is possible to
quickly find out what is the minimum latency with which the algorithm can be implemented. The most
521406S, Digital Techniques 3
resource-efficient architecture, in turn, is obtained by scheduling the algorithm without any latency
constraint. In this case, clock cycles are added to the schedule without restrictions so that resources can
be shared as often as possible. From this result it can be deduced whether it is possible to implement
the algorithm as a circuit with a sufficiently low price. If the actual requirements of the design fall
between the results achieved in these two extreme ways, it can be concluded that the design is feasible.
An optimal implementation can then be sought by using more precise scheduling constraints.
In the figure below, one multiplier and one adder have been allocated for the multiplications and
additions of periods 1 and 2. A fast version of these has been chosen because the multiplication and
addition are in series within the clock period, and their combined delay is therefore large. The lone
addition that is done with a smaller number of bits in the third clock cycle is implemented with a
separate, slower adder.
Resource allocation is not a trivial operation: even if two operations are scheduled in different clock
cycles to allow resource sharing, the increase in delay due to multiplexing may require the allocation of
faster but larger resources, making it more advantageous to allocate two separate, smaller and slower
resources. Since this analysis must be done between all operations of the same type, the problem grows
rapidly as the number of operations grows large.
521406S, Digital Techniques 3
The allocation of registers is based on the analysis of the "lifetime" of the data. Values created in one
clock cycle and used in subsequent clock cycles must be stored in registers at the end of the cycle on
which they were created. This applies both to values read from inputs and to values generated as a
result of operations performed on that clock cycle. There is a binding problem also with register
allocation. In this case, it must be decided which value is to be stored in which register if several
registers can be selected. The goal is again to minimize the number of multiplexers.
In the case of the figure below, the value mul1 (24 bits) and the value add1 (25 bits) for a total of 49 bits
are transferred from clock cycle 1 to clock cycle 2. At the end of cycle 2, the value sub1 (26 bits) must be
stored. The values mul1 and add1 do not need to be stored at the end of cycle 2, so the 49 bits reserved
at the end of period 1 can be freely used to store the value of sub1. In the figure, two registers are
allocated as storage resources, of which REG26 contains 26 bits and REG24 24 bits. The values add1 and
sub1 are bound to register REG26, and the value mul1 to register REG24. In practice, the synthesis
program would reserve only 49 register bits and perform the sharing of the registers between the values
at the bit level. For the sake of clarity, two separate registry components are allocated in the example.
Let’s look at how a good and poor binding of operations and values affects the architecture by using the
example shown in the figure below. In a scheduled data flow graph input ports a_in and b_in are read on
two clock cycles. The values in the ports may be different on different clock cycles. These values are
multiplied by the values of the variables r1 through r4. The bottom of the figure shows the allocation of
resources and registers. On the first clock cycle, the binding of operations to resources and values to
registers can be done freely. On the second clock cycle, binding can be made in two ways. The way
presented in green text is better. In it, the multiplication mul3 is bound to the multiplier M1 and the
multiplication mul4 to the multiplier M2. Therefore, input a_in can be hardwired to multiplier M1 and
input b_in to multiplier M2. Correspondingly, the value mul3 is bound to the register REG1 and the value
mul4 to the register REG2. Each register thus receives information from the same multiplier on each
clock cycle. Multiplexers are only needed in the second input of the multipliers. This binding produces
the architecture shown in the figure on a green background. If the bindings on the second clock cycle
were made in the other way (red text), multiplexers should be added to each input of the multipliers, as
well as to the inputs of the registers, according to the architectural diagram shown on the red
background.
521406S, Digital Techniques 3
Loop Transformations
Loops sare handled differently in high-level synthesis than in register-transfer level synthesis. In RTL
synthesis, loops are completely removed during compilation by duplicating the code contained in the
loop before it is compiled into combinational logic functions. This process is called loop unrolling. Once
unrolled, all the computations included in the loop can be performed in parallel and in principle in one
clock cycle. The example below illustrates the principle of unrolling a loop. In practice, however, the
unrolled loop is not presented as program code as below, but the unrolled form is created in the CDFG
used internally by the program.
In high-level synthesis, loops are treated differently. Most HLS tools let the designer decide how the
loops are translated into hardware. In general, the default handling for loops is that they are not
unrolled and only one copy of the hardware that implements their body is created. One iteration of a
521406S, Digital Techniques 3
loop always requires at least one clock cycle, or more if the computations cannot be scheduled in one
cycle. The CDFGs of the loop shown above and its unrolled version are shown below (original loop on
the left, unrolled on the right).
By comparing the graphs, some conclusions can be drawn about the effects of loop unrolling:
● The number of different scheduling options becomes manifold when a loop is unrolled. In this
example, the algorithm can be implemented in 1 to 9 clock cycles after unrolling the loop,
depending on how many operations are placed in one clock cycle. In the case of a not-unrolled
loop, the algorithm can be executed in either 5 or 10 clock cycles, depending on whether the
two operations contained in the loop are performed in one or two clock cycles.
● The complexity of the scheduling problem increases, as the data flow graph to be scheduled
after unrolling the loop contains 9 operations instead of 2 in the not-unrolled version. This
increases the running time of the synthesis program because the program now has to go
through a much larger set of different options. For this reason, the use of loop unrolling must be
carefully considered, especially if the number of iterations of the loop or the number of
operations it contains are large.
Loops can also be partially unrolled, for example, as shown below. Here, the number of executions of
the loop is halved and the operation of the loop is changed so that on each iteration the operations of
two iterations of the original loop are executed. By applying partial unrolling of a loop it is possible to
curb the increase in complexity of the scheduling problem and at the same time benefit from the
521406S, Digital Techniques 3
Another transformation that can be applied to loops is loop pipelining. In it, the loop is first completely
unrolled and the resulting data flow graph is then scheduled using the pipeline principle by starting a
new execution of operations before the execution of the entire data flow graph has completed. The
following figure shows the data flow graph of the loop
In this example, the data flow graph has been scheduled with latency 6 and initiation interval 2 as
indicated in the figure. The bottom of the figure shows the execution times of the different operations
for the first three iterations. In the clock cycles marked with an asterisk, the architectural resource
requirement is greatest, three multipliers and two adders. Two terms describing the operation of
pipeline architectures, stage and phase, are also marked in the figure. The number of stages tells how
521406S, Digital Techniques 3
many executions are running at one time. The number of phases tells how many clock cycles one stage
lasts. In the example of the figure, the pipeline principle has been applied to the for loop. Most
commonly, it is applied to the top-level processing loop of the algorithm, whereby the entire design
operates on the pipeline principle.
It is not always possible to make a loop work on the pipeline principle. The following are the two most
common cases in which the pipeline principle cannot be used.
Data dependencies between different iterations of a loop can prevent pipelining with a certain initiation
interval. In the figure below, a value within the for loop is computed for the variable X, which depends
on the previous value of X. If an attempt is made to pipeline the loop with initiation interval 1, in the 1st
stage of the pipeline it is necessary to refer to the value of X, which is computed only in the 2nd stage of
the pipeline, i.e. in practice only on the next clock cycle. An attempt to pipeline such a loop generates an
error message in the synthesis program. However, if the initiation interval is increased to 2, pipelining is
possible.
Writes to design outputs or memory locations can cause a conflict that prevents the pipeline from
functioning correctly. Below is an example of this. The figure shows a data flow graph of a pipeline
implemented with latency 3 and initiation interval 1. For clarity, the graph is drawn to show the three
executions of the loop side-by-side. On the 1st clock cycle of execution, the design sets the output
ready_out to 0, and raises it again to 1 in clock cycle 3. This would work well if the iterations of the loop
were performed in succession. In a pipelined implementation, they overlap in time. Therefore, when the
3rd stage of the pipeline writes the value 1 to the ready_out port, its 1st stage simultaneously writes the
value 0 to it. This port conflict prevents the use of the pipeline principle in this case.
521406S, Digital Techniques 3
Loop pipelining is a very effective transformation that can often significantly improve the performance
of an architecture. However, its effective use requires that its requirements be taken into account
already in the design of the algorithm and the I/O protocol it uses.
Array Transformations
Array data types are used to describe data that has to be referenced with indices. This is a very common
need in high-level synthesis, where the synthesis models are often signal processing algorithms that
process multidimensional data. Such algorithms can be effectively described using array variables, the
contents of which are used within loops, so that loop variables are used to compute the values of the
array indices.
The most straightforward way to implement arrays in HLS is to treat array elements as separate
variables and implement each variable as a register. This is sometimes called array flattening or
scalarization in HLS tools. Depending on the tool, the default action may be to implement arrays with
memory resources, or alternatively, the tool may make a choice between memory and register
implementation automatically based on the size of the array. However, the user can always make a
decision on a case-by-case basis using the tool directives.
The figure below illustrates the flattening of an array. The array is broken down into variables and each
variable is mapped to a separate register. Each register can be written to and read from on every clock
cycle, so there are no restrictions on scheduling the array accesses. During the allocation and binding of
registers, some registers may be optimized out if it is possible to share a physical register among the
521406S, Digital Techniques 3
elements of the array on different clock cycles. Thus, the final number of registers in an optimized
design may be less than the number of elements in the source code array.
Tables are commonly used within loops, and loop processing has a large effect on hardware complexity
when using array flattening. If the loop containing array accesses is unrolled completely, all array reads
and writes are based on constant indices after unrolling, as shown in the example below. The flattened
register-based solution shown in the previous figure can then be used to implement the array by adding
only the wires necessary to implement the data transfers described in the code.
If the loop is not unrolled, the situation is different, because the read and write indices are then not
constant. Instead, in the general case, they are computed in some way from the index variable of the
loop. For this reason, address decoding and multiplexing logic must be added to the inputs and outputs
of the registers to select the correct register as the target of the write or read operation. The array is
implemented as a register bank with an RTL structure of the kind shown below. Note that in the case of
large arrays, the area and delay of the multiplexing logic can be large. The synthesis run time can also
increase significantly if the synthesis tool needs to generate area and timing estimates for the
multiplexing logic during synthesis.
521406S, Digital Techniques 3
Static RAM is an alternative technology for implementing arrays. FPGAs contain a wealth of SRAM
resources, making it easy to map arrays to memories in HLS programs aimed for FPGA design. If the
target of synthesis is an integrated circuit, an SRAM macrocell must be modeled for the HLS program
before it can be used in synthesis.
Mapping array variables to SRAM memory blocks has a significant effect on scheduling. If a standard
single-port SRAM block is available, only one read or write operation can be performed per clock cycle.
For this reason, only one array element can be accessed at a time, as shown in the figure below. The
figure shows the execution of the above loop in the case where the array d is implemented as a
memory. Copying data from one memory location to another would take 10 clock cycles. If dual-port
SRAMs were available, there would be slightly more scheduling options. However, the operation is
always slower than in the case of an array that has been flattened into registers, where the execution of
the loop could be implemented in one clock cycle by connecting the registers as a shift register.
521406S, Digital Techniques 3
Arrays are often accessed inside loops as shown in the code sample below. If the array is mapped into
memory, the effects of this must be taken into account when choosing how to handle the loops. If the
loop is not unrolled, the register in which the value of the loop variable is stored can be used directly as
the memory address. If the loop is unrolled, the circuit's control state machine must generate a memory
address in each state with its output decoder. The address is selected with a multiplexer, and the value
of its selection input must be decoded from the state of the control mode machine. In the case of a large
array, the increase in area caused by unrolling the loop can be significant. However, it may not achieve
performance improvement because the memory can still be read or written only once per clock cycle.
The simplest way to map arrays to memory is to allocate a separate SRAM block for each array, as
shown below. This can be a good solution in FPGA design, where both large block RAMs and distributed
memories that can be assembled from small SRAM lookup tables of programmable logic blocks are
available. In ASIC design, the situation is different, as each SRAM block must first be created and
modeled for the HLS program, and in the layout design phase, they must be placed and routed in part
manually. A large number of small memories significantly increases that kind of design work.
521406S, Digital Techniques 3
The advantage of using separate memory blocks is that each memory can be accessed simultaneously,
independently of each other. If there is no need for this, two or more arrays can be mapped to the same
memory. In the figure below, arrays a[4] and b[4] are mapped to one SRAM block in two different ways:
on the left the so-called horizontal mapping method is used where the elements of the array are placed
in successive memory addresses. On the right the vertical mapping method is used where the elements
of the arrays with the same index are placed in the same memory location in consecutive bit positions.
Instead of two blocks of memory, only one is now needed, but at the expense of performance. When
using the horizontal method, it is possible to read or write only an element of one array (a or b) at a
time. In the vertical case, it is possible to read or write elements with the same index in both arrays at a
time. In horizontal mapping, unused bits may remain in memory if the bit-counts of the elements of the
arrays mapped to it are not the same (shaded bits in the figure). The choice of the appropriate mapping
method depends on how the algorithm addresses the arrays and the data types used.
If an array stored in memory is a scheduling bottleneck despite the fact that a separate memory is
allocated for the array, the efficiency of accessing the array can be improved by partitioning it before it
is mapped to memory. This means that instead of mapping the array to memory as it is like in figure a)
below, it is first partitioned into sections, after which a separate memory is allocated for each section.
Figures b) and c) show two alternative ways of doing this. Figure b) shows a direct partitioning. In this
example, the array is divided into four sections, each with its own memory. The indices of the array are
mapped to the addresses of the memories in ascending order, that is, in this case, the indices 0 and 1 in
the addresses 0 and 1 of the memory 1, the indices 2 and 3 in the addresses 0 and 1 of the memory 2,
and so on. In Figure c) the array is partitioned cyclically. In this method, index 0 of the array is mapped to
address 0 of memory 1, index 1 to address 0 of memory 2, index 2 to address 3 of memory 3, etc. Cyclic
partitioning is suitable for use, for example, when accessing an array from inside a partially unrolled loop
(see example above). Array accesses will then use successive addresses (in the example i and i + 1),
521406S, Digital Techniques 3
which in the cyclic partitioning are located in different memories and are thus accessible at the same
time.
A cluster of operations commonly used in signal processing algorithms is the multiplier accumulator
(MAC) operation:
c + a * b
It is often beneficial to realize this cluster of operations with dedicated components so that its adder or
multiplier are not shared with other operations. The component library of the HLS program can also
contain optimized MAC components that can be used to implement such an operation group. The
implementation of the operation group can be controlled by presenting it in the source code as a
function subroutine, and by preventing the sharing of resources between the operations in the
subroutine and main program. This makes the operations presented as a subroutine practically a new
component, so that the group of operations contained in it can be treated as one operation in the
algorithm scheduling.
The example below illustrates the use of a function subroutine to group operations. In the "INLINED"
column, the MAC operation acc + c[i]*tmp is presented as part of the two-line main program. This
produces the data flow graph shown below the code, which is shown in the figure as scheduled in two
clock cycles. Multiplications mul1 and mul2 can be implemented with a shared multiplier. In the
521406S, Digital Techniques 3
NON-INLINED column, the MAC operation is described in the muladd function and has been replaced by
a function call in the main program. The function call can be treated as a single operation in the data
flow graph of the main program.
INLINED NON-INLINED
If the HLS model code is originally written using function subroutines, they can be automatically inlined
into the main program in HLS programs. In many programs, inlining is done automatically unless the
designer prevents it. For this reason, especially those groups of operations that you may want to
implement with separate "hard macros" should be represented as function subroutines in the code. This
will then allow you to study whether the inlining of the function produces better results than its
implementation as a separate, shareable component.
Most HLS programs have the capability to automatically identify operation clusters and generate custom
components for them. This feature can be useful if the source code is not written in such a way that the
521406S, Digital Techniques 3
operation groups are described in it as function subroutines. The figure below illustrates the principle of
automatic clustering.
The principle of conventional synthesis is shown on the left: the data flow graph is scheduled using delay
and area estimates of pre-characterized components, after which resources are allocated and bound.
The operations are executed using the corresponding library component. In automatic clustering, the
synthesis program looks for similar groups of operations in the data flow graph, and creates a new
custom component for each frequently recurring group. A delay and area estimate is then created for
this component, for example, by synthesizing it with a logic synthesis program. All operation clusters
compatible with this new component can then be replaced by a single operation, which can significantly
simplify the data flow graph. The number of multiplexers required may also be reduced because
multiplexers are only needed at the inputs of the new component. Without clustering, they could be
needed in the inputs of each basic component.
Interface Synthesis
Interface Types
Digital circuits contain two types of interfaces: block-level interfaces, and port-level interfaces. A block
here refers to, for example, an entity described as a finite-state machine with datapath architecture, of
which there can be several in one design. Ports refer to the inputs and outputs of such a block.
Block-level interfaces refer to inputs that start ("start" in the figure shown below) or disable (continue)
the operation of a block, or outputs that indicate that the block is ready to process data (ready) or has
521406S, Digital Techniques 3
completed a data processing task (done). Block-level interfaces are usually connected inside the block
to its control state machine, and outside the block to other blocks or external interfaces to the circuit.
The port-level interfaces contain the data inputs and outputs of the circuit, as well as the associated
handshake signals. The handshake signals indicate when the data is ready on the port (valid) and when it
has been read from the port (ready). Port-level interfaces can also be more complex bus, memory, or
FIFO buffer interfaces.
In the following example, a code block delimited by red braces describes the operation of an interface
protocol. The block is named PROTOCOL_REGION to make it easy to apply directives on it with HLS tool
commands. If a block is defined as a protocol region, its operations are distributed to clock cycles based
on wait statements as shown in the figure.
521406S, Digital Techniques 3
If protocol code such as the one above is included directly in the same function subroutine as the data
processing algorithm, the function can become complex, and in addition, it is not easy to change the
interface protocol used by it. For this reason, interfaces are often described in SystemC models as
separate classes that contain both the SystemC ports and the protocol code that handles them. Such
classes are called metaports. Objects are created of metaports in SystemC modules like normal ports,
after which metaports can be accessed in the processing function code using the methods they provide.
This way of describing interfaces is called modular I/O in HLS programs.
The following example illustrates the differences between a conventional "pin-level" interface and a
modular interface. The left part of the figure describes the protocol code of the design DUT, which
implements the valid-ready protocol. The protocol code waits until port vld_in is in state 1, at which
point it reads the value of input data_in and sets the port ready_out to state 0 for the duration of the
data processing. The ports are SystemC sc_in and sc_out ports, and they are handled by using their read
and write methods. The right part of the figure shows an implementation based on a metaport. In it,
SystemC ports are created inside a separate class vld_rdy_port, and the protocol code is implemented in
the member function "get" of this class. The DUT module definition of the design introduces a variable
"input" of the type vld_rdy_port (1). The processing function calls the "get" function of the "input"
metaport when it wants to read data from port data_in according to the protocol it uses (2). The
protocol can now be changed simply by changing the type of the "input" variable.
521406S, Digital Techniques 3
The code example on the next page contains a simple but complete definition of a modular interface. It
consists of three classes parameterized according to the data type:
● mio_vld_rdy_in, which describes an input port,
● mio_vld_rdy_out, which describes an output port, and
● mio_vld_rdy, which describes the signal between modules, i.e. a channel.
The member variables "vld", "rdy", and "data" of the port classes are normal SystemC ports, with the _in
and _out suffixes omitted in this example. The first member function "bind" is used to bind the ports
contained in this metaport class to signals of the corresponding class "C" given as an argument. The
operator "()" overloads (redefines) the operator "()" to call the bind function. Binding a metaport to a
signal can now be done with the () operator using the same syntax as for conventional SystemC ports:
module.port(signal). The "reset" function sets initial values for the SystemC ports contained in the
metaport. The "get" and "put" functions define the read and write operations. The operator T () defines
a type cast operation and operator = an assignment operation.
The channel class mio_vld_rdy contains three sc_signal type variables that can be used to establish a
connection between mio_vld_rdy_in and mio_vld_rdy_out type metaports. The typedef definitions
contained in the class give names "in" and "out" to the data types of the metaport classes. This makes it
possible to refer to metaport classes "through" the channel class as follows, eliminating the need to use
three different data types in the code:
mio_vld_rdy < int >::in my_input; // Declaration of metaport variable
// my_input of type int
The channel connecting the metaports could be declared in a top-level module like this:
mio_vld_rdy <int> my_channel;
CHANNEL CLASS
The implementation of a modular interface as described above has one limitation that should be taken
into account in simulation: If a module has two metaports, both of which call the wait function, they are
executed sequentially in the simulation. This means that, in the following example, the metaports would
be read on different clock cycles:
input1_data = data1_in.get();
input2_data = data2_in.get();
However, a synthesis program can usually implement metaports in parallel if the designer so wishes. The
synthesis programs' own modular interface libraries may also contain genuinely parallel
implementations for metaports.
The hardware hierarchy of a C++ model is deduced from its function call hierarchy by first selecting a
function representing the top-level module ("top" in the figure), and then treating the functions called
from it as lower-level hardware block models ("block") as shown below. The connections between the
blocks are described with variables (tmp in the figure).
This principle of hierarchical modeling is easy to understand, but you should also be aware of its
limitations. The most obvious drawback is that the code is executed in the simulation like normal
program code, so that the "blocks" fun1 and fun2 are executed sequentially and at the same rate, and
not in parallel like modules in SystemC or RTL models. It is therefore difficult to describe a design where
521406S, Digital Techniques 3
the blocks use different sample rates. These kinds of shortcomings can be alleviated with HLS program
specific data types, such as FIFO-buffered channel classes.
The direction of port-level interfaces in C++ models can be deduced from the type of function
parameters and from how they are referenced in the function's code:
● scalar parameters are inputs because their values can only be read;
● pointer and reference type parameters are either inputs if they are merely read in code or
outputs if they are written to;
● the value returned by the function is an output.
int func(int a, int *b, int *c, int &d) a Input (scalar)
{
int tmp; b Input (pointer that is only read)
tmp = a + b;
*c = tmp; c Output (pointer that is written)
d = 2 * tmp;
return 3 * tmp; d Output (reference that is written)
}
func Output (return value)
Based on the type of C++ function parameters it is therefore possible to infer the direction of the
implemented hardware block ports, but not the protocol they use if it is anything other than just a direct
connection. Because it is impossible to describe the protocols in the code, they are added in the
synthesis phase by selecting a library component that implements an interface protocol for each port. It
is thus easy to implement and change the interfaces, but the disadvantage is that the implementation of
the design then depends on the HLS program used.
When an interface component is added to a C++ model, its properties should be considered in the
scheduling of the algorithm's operations. Thus, it is not a mere instantiation of a component, as
functionality must be also added to the control unit of the FSMD architecture to synchronize the
operation of the interface component and the FSMD architecture as required by the protocol of the
interface component. The following example illustrates this.
Assume that a function fun(data_t &data_in,...) is synthesized with the Catapult HLS program. The
first parameter data_in acts as an input to the model. Assume further that we decide to implement the
input with a two-way valid-ready handshake protocol. The component that implements this protocol is
in the Catapult I/O library called ccs_in_wait and it can be instantiated with the following command:
When the design is synthesized, the HLS program creates the interface component shown in green in
the figure below, as well as a control state machine for it (WAIT CONTROLLER). The control state
machine is connected in the control unit of the FSMD architecture to a STALLER block, which is
responsible for stopping the FSMD processor when an I/O port is waiting for data read or write events to
occur.
The test bench for C++ models is the main program, which calls the top-level function that represents
the design. Once an RTL model with the added interface components has been synthesized from C++
code, the same test bench can no longer be used as such to simulate the RTL design, as the interfaces in
the design may be completely different from the original C++ model. For this reason, in addition to the
RTL model, the HLS program must also create a "wrapper" that contains code that converts the data
generated and received by the test bench into the format required by the port access protocols of the
RTL model. RTL verification is therefore also based on an HLS program-specific solution in C++-based
synthesis. When using the SystemC language, only the simulator-specific instantiation of the "foreign"
module is required.
Summary
The goal of RTL architecture design is to create a design that consists of combinational logic blocks,
multiplexers, registers, and memories, that meets the required latency, throughput, and resource
requirements. The architecture can often be designed based on solutions that are well known for a
particular application. For example, high data rate digital filters are often designed this way. The
designer can then achieve a near-optimal result by applying basic architectural design techniques, such
as resource sharing or pipelining, to previous solutions.
If the design starts completely "from the scratch", the basic style of the architecture is first chosen, and
the best possible solution is sought within it. This chapter dealt with the design of a finite-state machine
with datapath architecture. The key steps in it are algorithm control and data flow analysis, operation
scheduling, computing resource and register allocation and binding. By applying these methods, even
"manual" RTL design can yield good solutions in principle, but a comprehensive comparison of the
different alternatives in a normal product development schedule is usually not possible. Using a
521406S, Digital Techniques 3
high-level synthesis program, key optimization tasks in architectural design can be performed
automatically, allowing the designer to quickly find an architecture that meets the specific requirements.
Effective use of a high-level synthesis program requires that the user understands the operation of the
key optimization methods used by the program and how they are affected by design features such as
algorithm complexity, I/O protocols or resource-specific constraints. In order to obtain a good synthesis
result, it is almost invariably necessary to modify the control and data flow graph of the algorithm with
transformations, so knowing their usage patterns and impact is important in the use of HLS programs.
521406S, Digital Techniques 3
Summary
Manufacturing testing of electronics products aims to ensure that each device delivered to the customer
works as required. Testing is not intended to reveal design flaws, as the device in production can be
assumed to be properly designed. Instead of that, the aim is to detect faults injected into the device or
circuit during manufacturing. In the manufacturing process of electronic products, testing is performed
after each significant step. In the case of integrated circuits, testing takes place even before they are
installed on the circuit board of the end-product, and in fact before they are packaged.
1
521406S, Digital Techniques 3
Due to the complexity of digital circuits, issues related to their manufacturing testing must be considered
already in the design of the circuit. This means that when designing a circuit, efforts must be made to
use solutions that facilitate (or at least do not complicate) manufacturing testing. In addition, the
testability of circuits and devices is improved by adding logic structures to the circuits to facilitate testing.
This is to reduce the time and thus the cost of testing an individual circuit or device. Design methods that
aim to improve testability are known as design-for-testability (DFT) methods.
In the integrated circuit manufacturing process, some circuits get faults that make the circuits unusable.
Such faults include, for example, short circuits from signals to the ground or to the supply voltage,
interconnections between signals, or excessive delays within the circuit. These faults have to be detected
through the external connections of the circuit, which is difficult as there can be billions of possible fault
points, but only tens, hundreds or thousands of external observation-points.
2
521406S, Digital Techniques 3
The test coverage requirements in integrated circuit manufacturing are high. Assuming that a final
product (electronic device) contains 100 components and the product's defect rate, i.e. the proportion of
defective devices among the devices delivered to end users, must not exceed 1%, an individual
component's defect rate can only be 1%/100 = 0.01%, that is, 100 parts per million (100 ppm). If we
further assume that an integrated circuit device aimed for the product is manufactured in a process
whose yield is 90%, and that its production volume is 10 million units, there will be one million defective
units in the manufactured lot. In the end-product, the number of defective chips can only be
"component's defect rate" x "production volume", that is 0.01% * 10,000,000 = 1,000 pieces,. Thus,
999,000 out of one million defective components must be detected in manufacturing testing. The test
coverage should therefore be 999,000/1,000,000 = 99.9%. Thus, virtually all defective ICs must be
detected by their test.
In addition to the quality of the tests, the production testing of integrated circuits involves the
requirement that the tests be carried out quickly, as the test equipment is very expensive. For this
reason, the number of test patterns required in the test should be as small as possible. A test pattern
refers to the bit pattern applied to the circuit inputs and test points one at a time1. Design for testability
aims to improve the testability of a circuit so that all faults can be detected with as little test data as
possible, and that test data can be fed into the circuit and test results read out as quickly as possible.
1
The term "test vector" is also often used.
3
521406S, Digital Techniques 3
Because of imperfections in the functional testing, the generation of test patterns used in ASIC testing is
always based on some kind of fault model. This means that when the tests are created, the circuit is
assumed to have certain types of faults, the effect of which on the function of the circuit is sought to be
revealed. Examples of such fault models are:
● stuck-at fault model, in which the faulty nodes of the circuit are assumed to be permanently tied
to either a 0 or 1 state,
● bridging fault model, in which two nodes of the circuit are assumed to be connected together, or
● transistor stuck-open or stuck-short fault models, in which the channel of a MOS transistor is
assumed to be permanently open or closed.
By far the most widely used fault model in generating test patterns is the stuck-at fault model.
The following figure shows a simple circuit modeled with a stuck-at fault at the component output (a)
and the component input (b). In case a) the output of the upper AND gate is stuck in state 0, from which
it follows that the signal controlled by the output causes a wrong state for the circuit's output X and for
the second input of the lower AND gate. In case b) the second input of the lower AND gate is stuck in
state 0, so the 1-state of the output of the previous gate is not passed to this input (this is highlighted by
showing the wire as broken). The output of the lower AND gate is thus in the wrong state compared to a
good circuit.
4
521406S, Digital Techniques 3
The stuck-at fault model may seem simplistic at first, as integrated circuits can have many other types of
faults in manufacturing than just stuck-at faults. However, this fault model has proven to be very useful
in practice. The reason for this is that if a set of test patterns can be developed that detects all stuck-at
faults, a large proportion of other faults will also be detected with the same test patterns.
5
521406S, Digital Techniques 3
The principle of stuck-at fault testing is that the assumed fault location is driven to the opposite state to
the one caused by the fault in order to detect whether there is a fault at that point. Thus, if a test is
searched for whether the input of the NAND gate is stuck to zero (SA-0), the attempt is made to drive
that input to 1 state. The other inputs of the gate must then be driven to such a state that the point
under test has an effect on the output of the gate. The following figure shows three stuck-at faults
located at the input or output of the NAND gate. The correct test pattern for detecting the faults is
marked with a box, as is the value of the output of the circuit in the event of a fault in the circuit.
6
521406S, Digital Techniques 3
The following figure shows a slightly more complex circuit in which an SA-0 fault is located at point X. To
detect this fault, point X must be driven to state '1', which is done by driving the AND gate inputs to state
1. Circuit input A must therefore be driven to state '0' and C to state '1'. However, the test result cannot
be read directly from point X as it is internal to the circuit. To observe the result, the circuit must be
brought to a state where the state of its output D depends only on the state of point X. Point Y must
therefore be set to '0'. This is possible because the upper input of the OR gate connected to Y is in state
'0', and the state of the lower input (B) has not been fixed yet. By setting it to '0' as well, the state of Y
becomes '0', and the state of D now only depends on the state of the fault point X. This procedure is
known as path sensitizing. The test pattern ABC for signal X SA-0 fault is thus "001".
A circuit with N internal signals can contain 2N different stuck-at faults. At first glance, therefore, it
seems that as many test patterns are needed. However, the number of patterns can be reduced by
elimination equivalent, redundant and dominated faults.
Two faults are equivalent if their effect on the circuit is exactly the same and if they can be detected by
the same test. An example of this is a NAND gate whose input's SA-0 fault is equivalent to the output's
SA-1 fault. When generating tests, only one of the equivalent faults has to be considered.
7
521406S, Digital Techniques 3
Redundant faults are faults that have no effect on the logic function of the circuit. The figure below
shows an example of such a fault. The SA-0 fault on signal C is activated when ABC = "111". In this case,
however, the output D is in the '1' state due to the "11" state of the signals AB, so that the fault has no
effect on the logic function of the circuit. It is therefore not necessary to develop test patterns for such
faults. Redundant faults are also called untestable faults.
The number of test patterns can be further reduced based on fault domination. Fault F1 is said to
dominate fault F2 if all tests that detect F1 also detect F2, but only some of the tests that detect F2
detect F1. In this case, a test is only required to detect F1. The reduction in the number of test patterns
performed in this way is called compression of the test patterns. Using fault model-based test generation
and test pattern compression, the number of test patterns can be made significantly smaller compared
to functional testing.
Generating test patterns for a complex combinational logic circuit is difficult because fault point control
and result path sensitization are highly interdependent. As described above, tests can in principle be
generated without the help of computer programs, but even with a small number of components, the
task becomes too difficult. For this reason, test patterns are generated automatically using computer
programs. These automatic test pattern generation (ATPG) programs seek to find a test for all possible
faults of interest in the circuit. The operating principle of the programs ranges from directing the circuit
inputs with pseudo-random values to complex search algorithms.
8
521406S, Digital Techniques 3
The poor controllability and observability of sequential logic make the generation of test patterns fro
them difficult. For this reason, ASICs are usually embedded with test structures that make it possible to
partition the circuits to blocks that only contain combinational logic or flip-flops in test mode. These
methods are discussed later in this chapter.
9
521406S, Digital Techniques 3
Stuck-at and bridging fault models describe static faults in the circuit, the occurrence of which does not
depend on the clock frequency of the circuit. Delay faults, in turn, are faults that occur only when the
circuit is clocked at its normal clock frequency, which is generally much higher than the clock frequency
used in the test. The malfunction is caused by an exceptionally large delay of a single gate or an entire
delay path.
Delay fault testing is based on scan path structures described later in this chapter. Each test consists of
two patterns . The first sets the circuit to a known initial state, in which the gates along the path to be
examined are "sensitized" to a state that allows a state change at the beginning of the path to propagate
through the entire path. Another test pattern causes this state change to happen. Once the pattern has
been set, the circuit's response is stored in the scan path with a precisely timed clock pulse. The result
read from the scan path indicates whether there are delay faults on that signal path.
When assessing fault coverage, the faults modeled in the circuit are divided into untestable and testable
faults. Fault coverage refers to the percentage of modeled faults detected by test patterns of all faults.
Generally, however, the concept of test coverage is preferred, which refers to the percentage of faults
detected by the test of all testable faults. This is of greater importance in practice because untestable
faults have no effect on circuit function. The following table summarizes and gives examples of
untestable and testable faults.
Untestable faults are faults that, due to the Testable faults are faults that cannot be proven to
structure of the logic, do not cause a difference be untestable. They can be divided into e.g. the
between a faulty and a good circuit. Such faults following groups:
include: ● detectable faults (faults for which a test
● faults in unused nodes, for example in the pattern can be generated with a response
unconnected complementary outputs of different from the good circuit response),
flip-flops. ● ATPG-untestable faults for which the
● faults in nodes permanently connected to ATPG program cannot generate a test and
the same state as the fault (for example, a which the program cannot prove
10
521406S, Digital Techniques 3
SA-0 fault at a gate input permanently untestable. Such faults are due to the
connected to ground). constraints imposed on the ATPG
● faults at points where fault propagation program; for example, the designer may
has been prevented (e.g. SA-0 fault at the have permanently fixed a particular signal
input of an AND gate whose second input in the 0 or 1 state.
is permanently in the 0-state) ● undetected defects, i.e., faults that
● redundant faults, i.e. faults for which no cannot be shown to be untestable or
test can be developed but that do not ATPG-untestable. Undetected faults are
affect circuit function. divided into non-controllable (for which a
test pattern cannot be formed that forces
the fault point to the desired state) and
non-observable (for which a test pattern
cannot be generated that allows the fault
point's state to propagate to the
observation point).
Undetected faults occur at locations that, due to the structure of the logic, cannot be forced into the
state required by the test, or from which the state of the fault location cannot be propagated to the
circuit output. In the first case, the circuit's controllability of the is said to be poor, while in the latter
case, the poor observability of the circuit prevents a test pattern from being found. Poor controllability
and observability are caused by too "deep" combinational logic, and can be therefore influenced by the
design of the circuit architecture. Design-for-test techniques can also improve controllability and
observability.
If the test coverage of the test patterns is low even after a long ATPG run, the ATPG program's reports
can be used to find out which parts of the circuit have undetected faults. This can be tricky, as the ATPG
program uses a gate-level model of the circuit, which may no longer have the original module hierarchy
of the design left.
11
521406S, Digital Techniques 3
Controllability and observability of a logic circuit can be improved with scan path structures that allow a
circuit to be divided into purely combinational parts, and easily testable sequential parts, shift-registers
in practice. The following figure illustrates the use of a scan path to improve the controllability and
observability of the circuit from the previous example. This is done by disconnecting the feedback loop
in the circuit with a scan path during the test.
A scan path is created by modifying the registers (some or all) in the circuit so that they can be used as
shift registers during the test. In addition, the shift registers are connected in series so that a scan path
that runs through the entire circuit is created. Through this path, serial test data can be fed into the
circuit's flip-flops during the test, so that the inputs of all combinational logic blocks can be controlled
directly. The test is performed by first setting the circuit into the test mode, in which the test data is
serially loaded into its flip-flops (register A in the figure) via the scan path. The circuit is then placed in its
normal operating mode and clocked once. The flip-flops of the circuit (and thus the registers A and B in
12
521406S, Digital Techniques 3
the figure) then store the state of the outputs of the combinational logic blocks. The circuit is then
placed back in the test mode, and the states of the flip-flops are shifted out serially via the scan path. In
this way, the inputs of the adder can be controlled and its outputs can be observed directly.
To add a scan path into a circuit, the structure of the flip-flops must be changed so that they can operate
in the required two modes. The multiplexed D-flip-flop with two data inputs, shown below, is most
commonly used for this. One of the inputs (DATA) is used in the normal operating mode of the circuit,
and the other (SCAN_IN) when daisy-chaining the flip-flops to form a shift register. The SCAN_EN input
can be used to select the operating mode. ASIC libraries contain ready-made scan flip-flops that contain
a D-flip-flop and a multiplexer. Logic synthesis programs can build a scan path from these flip-flops
automatically, so designing scan paths is easy. However, the prerequisite is that the circuit is completely
synchronous. Asynchronous structures, such as internal clocks and resets, prevent the creation of a scan
path
The following figure shows how the scan path (green) is built inside the circuit. The blue clouds represent
the combinational logic that is functional in the normal operating mode of the circuit. They are bypassed
when test data is shifted through the scan path.
The test procedure based on the use of the scan path is as follows for one test pattern:
1. The circuit is put into test mode (by setting the SCAN_EN input to state 1).
2. The circuit is clocked so many times that the entire scan path is serially filled with a new test
pattern from the circuit's test data input.
3. The circuit is brought from test mode to normal mode (SCAN_EN = 0).
4. The circuit is clocked once so that all the flip-flops in the circuit store the state of their normal
data input.
5. The circuit is put back into test mode (SCAN_EN = 1).
13
521406S, Digital Techniques 3
6. The circuit is clocked so many times that the data stored in the flip-flops can be read out of the
circuit serial through its test data output.
7. The test result read out of the circuit is compared with the expected result given by a good
circuit.
This procedure is repeated for all test patterns. If there are N flip-flops in the scan path, it takes N + 1
clock cycles to perform one test2. Testing can be speeded up by creating multiple scan paths in the
circuit, allowing test data to be transferred to and from the circuit faster. However, test equipment set
limits on the number of scan paths.
The use of a scan path requires that one input be added to the circuit to enable the test mode
(SCAN_EN). In addition, one input and output is required for each scan path (TEST_DATA_IN and
TEST_DATA_OUT). However, the normal inputs of the circuit can be used as scan path inputs, and normal
outputs with an added multiplexer controlled by the SCAN_EN signal can be used as scan path outputs.
Generally, all flip-flops in the circuit are included in the scan paths. This has the advantage that all
combination logic parts of the circuit can then be easily tested from the scan paths. This makes it easier
to reach a high test coverage with a small number of patterns. For example, the 64-bit counter shown at
the beginning of this chapter can be tested with a total of 14 test patterns when a scan path is used. The
disadvantage of the method is the increase of the area of the circuit due to the added test structures,
and the increase of the delay of combinational logic due to the multiplexed flip-flops.
In the partial scan technique, only part of the circuit's flip-flops are included in the scan path. This is to
reduce the effect of test structures on circuit area and delay (the scan path can be omitted from
timing-critical points). This is usually done at the expense of test coverage and test time. The flip-flops
are selected for the scan path based on testability analysis of different parts of the circuit, which is
performed using a test synthesis program.
Built-In Self-Test
The test methods presented above are based on the use of scan path structures and on automatic
generation of test patterns for combinational based on the stuck-at fault model. They can be used to test
logic gates and flip-flops. Testing macrocells, such as memory blocks, is less straightforward because their
test patterns cannot be generated based on the stuck-at fault model. Therefore, more test patterns are
needed than in testing a combination logic block of a similar size. However, it would take too long to shift
in a large set of test patterns through the scan path from outside the chip, which is why testing memory
blocks is usually done by using built-in self-test (BIST).
The basic architecture of a self-testing block includes three parts: a test pattern generator, a response
analyzer, and a BIST controller. With the circuit in normal mode, the test pattern generator and signature
analyzer are bypassed. In test mode, the control section starts a test pattern generator that feeds test
data to the block under test. The signature analyzer stores the block's response and transfers the result
2
When shifting the test results out, a new test pattern can be shifted in at the same time, so the test does not take
2*N + 1 clock cycles.
14
521406S, Digital Techniques 3
to a results register in the control section when the test has finished. The result register is part of the
circuit's scan path.
The simplest way to generate test patterns is with a counter that steps through all possible value
combinations of the inputs of the block under test. A binary counter, however, requires a relatively large
area, which is why test patterns are usually formed using a linear feedback shift register, or LFSR, which
has a simpler next-state encoding logic part. An LFSR produces pseudo-random sequences. So-called
maximum-length LFSRs go through all possible states, so they can be used instead of a binary counter.
The result of the test could in principle be checked by comparing it with the response of a properly
functioning circuit stored in ROM, but this would be impractical due to the large amount of memory
required for it. For this reason, the response of the block under test is compressed in a circuit structure
called a compactor. The compactor computes a checksum (signature) from the output data produced by
the block to be tested. This signature can then be compared with the signature of a good circuit. Since it
is highly unlikely that both a good and a faulty circuit would produce the same signature, faulty and good
circuits can this way be distinguished from each other. The checksums can also be calculated with an
LFSR-type circuit whose inputs are the output signals of the block under test.
The RAM and ROM macrocells used in ASIC design are created with module generator programs that can
also create "BIST collars" for the blocks, i.e. the logic required for testing. They can also be created with a
separate synthesis program. Test starting and other control functions must be designed as part of other
testability design, for example by connecting the BIST controller to the scan path.
Self-testing can also be used to test combinational logic instead of the normal stuck-at fault model based
testing. In such a case, the combination logic blocks to be tested are equipped with BIST structures as
described above. The advantage of this method is that the tests can be performed at the normal clock
frequency of the circuit, as the test equipment does not have to feed test data to the circuit from the
outside. It is sufficient for the test equipment to start the test and read the result. Another advantage of
this method is that long scan paths are not needed, which can save area.
15
521406S, Digital Techniques 3
Traditionally, circuit boards have been tested by placing probes at strategic points on the board. Due to
the introduction of surface mount components and multilayer circuit boards, this technique is no longer
applicable, as it is often impossible to place the probe at the desired location on the circuit board. For
this reason, board level testing must be taken into account already in the design of integrated circuits.
The scan path based techniques described above can also be applied to board level testability design. For
circuit board testing, a flip-flop is added to the I/O cells of each integrated circuit on the board so that
these flip-flops form a scan path around the circuit's core logic (so-called boundary scan structure, red
arrow in the figure). In the normal operating mode, these flip-flops are bypassed.
The boundary scan path located at the interfaces of the chips can be used to locate faults in the signals
between components as shown in the figure above. For example, if we want to test the connection A
between chips 1 and 2 of the figure, a test pattern is initially placed in the boundary scan path of chip 1.
The flip-flops of the scan path of chip 2 are then set to a state where they read their values from the
external connections of the chip. When the flip-flops on chip 2 are now clocked, the state of the chip 1
flip-flop that drives signal A should be stored in the chip 2 flip-flop driven by signal A. If the test is
repeated twice, with values 0 and 1, the scan path of circuit 2 is likely to store an incorrect value in at
least one of the cases if wire A is broken. The fault can be detected by reading the contents of the
boundary scan path out of chip 2.
The boundary scan paths of the chips on a circuit board can be connected in series, in which case only
two connectors at the edge of the circuit board are required for the scan path (in addition to the test
16
521406S, Digital Techniques 3
mode selection signal). Testing of the internal operation of integrated circuits, for example in
maintenance situations, can also be performed via the circuit board's scan path.
Surrounding integrated circuits with scan paths only makes sense if all the chips on the circuit board
contain a similar structure. This applies to both ASICs and standard components. For this reason, the
boundary scan path architecture of the integrated circuits has been standardized by the IEEE (IEEE std.
1149.1). The standard is commonly known as JTAG. The protocol and structure defined in the JTAG
standard is more complex than with a conventional scan path. The figure below shows the basic
architecture of the JTAG test logic.
The JTAG standard defines four test signals for the logic circuit:
● test data input (TDI),
● test data output (TDO)
● test clock (TCK), and
● test mode select (TMS).
The test signals, the control state machine embedded in the circuit and the instruction register form a
test access port (TAP) through which the test logic is controlled. In addition, the JTAG architecture
includes a number of data registers:
● a mandatory boundary scan register (BSR), a
● a mandatory 1-bit bypass register that can be used to bypass the circuit's scan path so that test
data only passes quickly through the circuit,
● an optional device identification register, which stores a 32-bit code that identifies the type,
version, and manufacturer of the circuit,
● user-defined optional data registers, such as self-test structures or scan paths that can be
controlled through the test access port.
17
521406S, Digital Techniques 3
The registers consist of a shift register and a shadow latch that is loaded in parallel. The data is always
transferred first to the shift register and then loaded from there to the shadow register. The operation of
the JTAG logic is controlled by instructions, which are loaded into the instruction register via the TDI
input. Each instruction selects one data register between the TDI and TDO pins and controls the
operation of the data registers. The state of status signals indicating the state of the circuit can also be
loaded in parallel into the instruction register.
The standard defines four mandatory instructions, implying that the instruction register must have at
least two bits. The mandatory instructions are:
● BYPASS, which connects the bypass register between the test input and output,
● SAMPLE/PRELOAD, which selects the boundary scan register. With this command, the state of
the circuit's I/O signals is read into the boundary scan register in parallel, after which the
contents of the scan path are shifted out through the TDO output while new data is read from
the TDI input into the scan path,
● EXTEST, which selects the scan path register. This command can be used to read data from the
circuit board (ASIC input pins) to the boundary scan register,
● IDCODE, which selects an identifier register, allowing the circuit board components to be
identified and located.
In addition to the above, the JTAG standard defines the following optional commands: INTEST (BSR
register selected for internal test), RUNBIST (selects a user-defined BIST register that starts a self-test and
stores the result), and USERCODE (selects a user-defined identifier register).
The tests are controlled by a 16-state state machine in the test access port, which is controlled by the
TMS signal and changes state on the rising edge of the TCK clock. A number of control signals are
decoded from the state machine for the registers. The state chart of the finite state machine is shown
below.
18
521406S, Digital Techniques 3
When the circuit is operating normally, the state machine is in the Test-Logic-Reset state. The state chart
shows that the test access port controller can be brought from any state to a Test-Logic-Reset state by
holding the TMS signal in state 1 for at least five clock cycles. In this way, the test logic is brought to a
known state, for example after switching on the operating voltages. To exit the Test-Logic-Reset mode,
the TMS input is set to 0.
The other parts of the state machine are the Run-Test/Idle test state and the state sequences used to
write and read data registers and the instruction register. In both of these sequences, the data can be
parallel-loaded to the selected register (capture), data can be shifted through the register (shift), and the
shadow latch of the register can be updated by transferring the data from the shift register in parallel to
the shadow latch (update). Pause mode can be used if more test data has to be loaded during testing
into the equipment controlling the test.
As an example of performing a test, consider a situation in which data is read from a circuit board to the
boundary scan path. The code of the corresponding instruction (EXTEST) must first be written into the
19
521406S, Digital Techniques 3
instruction register. To do this, the state machine must be brought to the Shift-IR state, in which it should
remain for so many clock cycles that the entire instruction word gets into the instruction register. After
that the state machine is stepped to state Exit1-IR and further to state Update-IR, where the instruction
word is transferred to the shadow latch of the instruction register. Several control signals are decoded
from its contents for the boundary scan and data registers. Next, the Capture-DR state is entered, where
data from the circuit board is read in parallel to the scan path register. From this state it is possible to go
to the state Shift-DR, where the data in the boundary scan path is serially transferred out via the TDO
output.
From this example it can be concluded that the control of the JTAG test access port requires a lot of serial
control information. In practice, the control is always handled by using a computer, for example by
connecting the JTAG test access port on the circuit board to the computer's USB bus, where it is available
as a serial port. The test access port can then be controlled from a test program written for the
computer.
The operating principle of the P1500 wrapper is the same as that of the JTAG boundary scan path. Its
mandatory interfaces are a serial input (WSI) and output (WSO), as well as wrapper serial control ( WSC)
input. The WSC input contains 16 signals that are the same as those produced by the JTAG test access
port control mode machine. For this reason, P1500 wrappers work seamlessly with the JTAG test port on
SoCs. In addition to instruction (WIR) and bypass registers (WBY), the P1500 wrapper includes a wrapper
boundary scan path register (WBR) and an optional number of data registers (WDR). Registers within the
IP block, known as core data registers (WCDR), can also be connected to the wrapper.
3
The term "core" is used in the standard instead of "block".
20
521406S, Digital Techniques 3
The operating principle of the boundary scan wrapper of an IP block is basically the same as that of the
JTAG boundary scan path, but its purpose is different. In the case of an IP block, its main purpose is to
"isolate" the IP block so that it can be tested independently in manufacturing testing. To describe the
operation of a core wrapper, a description language CTL (core test language, IEEE Std 1450.6) has been
defined. A CTL model of a wrapped core can be used with EDA tools to automatically connect the test
wrapper of the IP block to the test structures of the SoC. For example, wrapped IP blocks can be
daisy-chained and controlled via the circuit's JTAG test access port. The blocks can then be tested
independently by bringing the test data to their boundary scan registers and from there on to the inside
of the block to scan paths and other test structures.
Another problem with testing SoCs is that due to their large size, a lot of test data is needed. Because the
number of inputs and outputs of the circuit is very small compared to the amount of test data, it takes a
long time to transfer all the data in and out of the circuit as such, which increases the testing cost. This
problem can be solved by compressing the test data and feeding it in compressed form to the circuit
under test. A decompressor block must be placed inside the circuit to decompress the test data and feed
it into the circuit's scan paths. To the other end of the scan paths, a compressor block must be added to
compress the test results data. This method increases the area of the circuit somewhat, but if the cost of
the area-increase is smaller than the savings obtained from shortening the test time, the method is
viable. The decompressor and compressor blocks can be created with a test synthesis program.
21
521406S, Digital Techniques 3
22
521406S, Digital Techniques 3
Test structures are created for the circuit in the context of RTL and logic synthesis. In the synthesis, the
multiplexed D-flip-flop is selected as the flip-flop type, whereby the synthesis program implements the
registers using this flip-flop type. It is then easy to add one or more scan paths to the circuit. Also, the
logic for the JTAG test bus and BIST structures can be added at this stage using the synthesis program.
When the logic synthesis program adds test structures to the circuit, it creates a description file ("DFT
model" in the figure) that contains information about the test structures of the circuit. An automatic test
pattern generator program uses this information to generate test patterns for combinational logic.
In SoC design, IP block-specific test structures and their P1500 wrappers are first created, which are then
coupled to the top-level JTAG test logic. When creating IP block wrappers and a JTAG test access port,
one must define their "instruction set" that is needed to perform the tests designed for the circuit. This
step may require a considerable amount of design work. In practice, all test structures are defined by
writing a TCL script using the commands of the synthesis program, which adds test structures to the
circuit during RTL synthesis.
When designing the circuit layout, some testability-related measures can be taken. The placement and
routing program can be allowed to reorder the scan path flip-flops to make them easier to place and
route. Because the scan path is, in effect, a shift register, it is sensitive to hold time violations caused by
clock skew. This can also be taken into account in layout design.
23
521406S, Digital Techniques 3
Once the final gate-level model of the circuit has been created, it can be used to create the test patterns
for the circuit with an ATPG program. It is possible for the designer to select the fault model to be used,
and to create test patterns based on it. Generating test patterns is computationally intensive, so it is
possible that the ATPG program will not achieve sufficient test coverage in the time available. In such a
case, the structure of the circuit must be changed so that it is more controllable and observable. Poor
architectural design can therefore cause unpleasant surprises even at the very end of the design project.
The circuit may work properly, but if it cannot be tested, it cannot be put into production.
In addition to architectural problems, design errors in the RTL model can also complicate testability
design and test pattern generation. If the circuit has internally generated clocks, asynchronous resets,
feedback combinational logic, or latch components in the combinational logic blocks, this usually results
in difficulties in, for example, creating scan paths or generating test patterns. These shortcomings of the
RTL model are often the result of HDL coding errors. They do not necessarily affect the logical
functioning of the circuit, and may therefore go unnoticed in the earlier stages of the design project.
Summary
Testability design that facilitates integrated circuit manufacturing testing is an integral part of circuit
design, but in the case of digital circuits it is largely automated. However, it is important for a logic
designer who typically works with IP block design to know the factors that affect testability in both
architectural design and HDL coding of the RTL model. The most important of these factors are the
avoidance of too deep combination logic, and the use of a synchronous design principle based on a
single clock signal.
SoC testability design is in principle a demanding task, but since the test solution is usually based on the
JTAG and P1500 standards, it can be broken down into more manageable parts that can be combined to
form a working chip-level solution. The design is based on the use of fully automated synthesis
programs.
24
Power Management
Sources of Power Consumption in Digital Circuits
Power Consumption Caused by Signal State Changes
Internal Power Consumption of Components
Power Consumption Due to Leakage Currents
Summary
Power consumption has become an important consideration in the design of digital circuits. There are
two reasons for this. First, a large proportion of devices that use digital circuits are mobile and therefore
operate on battery power. To ensure a long operating time, the power consumption of the devices must
be optimized to be as low as possible. Another reason is in semiconductor technology. The share of
transistor leakage currents of the total current consumption of circuits in processes using very fine line
widths has grown large, while the number of transistors that can fit in circuits has continued to grow in
accordance with "Moore's Law". This would quickly lead to an unsustainable increase in power
consumption, unless at the same time design technologies and methods are introduced that can reduce
power consumption. This chapter discusses techniques for reducing power consumption at different
stages of the design process.
1
Sources of Power Consumption in Digital Circuits
Power consumption of digital circuits consists of both dynamic, operational power consumption and
static power consumption, which does not depend on the operation of the circuit, but which occurs all
the time when the operating voltage of the circuits is switched on. Dynamic power consumption is the
result of signal state changes within the circuit and its interfaces. Static power consumption is caused by
leakage currents.
2
𝑃𝑠𝑤𝑖𝑡𝑐ℎ = 𝑝𝑆𝑊𝐼𝑇𝐶𝐻 · 𝑓𝐶𝐿𝐾 · 𝐶𝐿𝑂𝐴𝐷 · 𝑈𝐷𝐷
The power consumption due to change switching is caused by the charging and discharging of the load
capacitance 𝐶𝐿𝑂𝐴𝐷 voltage 𝑈𝐷𝐷 at clock frequency 𝑓𝐶𝐿𝐾, causing the current
𝐼𝑆𝑊𝐼𝑇𝐶𝐻 = 𝑓𝐶𝐿𝐾 · 𝐶𝐿𝑂𝐴𝐷 · 𝑈𝐷𝐷. The power consumption is obtained from the formula
𝑃𝑆𝑊𝐼𝑇𝐶𝐻 = 𝑈𝐷𝐷 · 𝐼𝑆𝑊𝐼𝑇𝐶𝐻. This expression has yet to be multiplied by the coefficient for the average
state-change probability of 𝑝𝑆𝑊𝐼𝑇𝐶𝐻, which describes the state changes in the nodes of the circuit, since
not every node in the circuit changes its state in each clock cycle.
It should be noted that 𝑝𝑆𝑊𝐼𝑇𝐶𝐻 is describing the characteristics of a circuit statistical factor, which is
often difficult to estimate, as the number of state changes in one clock cycle depends not only on the
circuit topology but also on the data it processes. However, the number of state changes can be
calculated for each signal separately during simulation, in which case there is no need to estimate the
probability.
2
Internal Power Consumption of Components
The internal power consumption of CMOS components is due to the component's internal short-circuit
current between the supply voltage and ground. In principle, the PMOS and NMOS transistors that
function as switches are never in a conductive state at the same time, so the internal, operational power
consumption should be zero. In practice, however, the transistors are in a conductive state for a short
period when the state changes, which causes an electric current between the supply voltage and ground.
The internal power consumption of the CMOS component can be calculated by the following formula.
Here, the term 𝑡𝑆𝑊𝐼𝑇𝐶𝐻describes the period during which the transistors are open at the same time, and
the term 𝐼𝑃𝐸𝐴𝐾 described the current observed during this time. Compared to the power consumption
caused by state changes, the internal power consumption of the components is small if the circuit is
designed so that state changes take place quickly.
The two main leakage current mechanisms of CMOS circuits are the weak inversion state leakage current
and gate-oxide leakage. Of these, the former is dominant in modern technologies, but the share of the
latter is growing.
3
Weak inversion leakage current (also known as subthreshold leakage) means the current flowing from
the MOS transistor to the source when the channel of the transistor is closed (vertical red arrows in the
figure). The magnitude of this current is inversely and exponentially proportional to the threshold
voltage (VT) of the transistor, so that as the threshold voltage decreases, the leakage current increases
rapidly. As the line widths decrease, the operating voltages (VDD) of the CMOS processes have decreased.
The delay of the logic (which is proportional to the current of the conducting state of the transistor) is
directly proportional to the difference between the operating voltage and the threshold voltage VDD - VT,
so the threshold voltage has also had to be lowered to maintain performance. The result is an increase in
the subthreshold leakage current.
Due to the increase in leakage currents, the design of circuits today requires a choice between
performance and static power consumption. If high (GHz grade) clock frequencies are to be used, a low
VT technology must be used, resulting in high static power consumption of the circuit. Although the
leakage current of a single transistor is on the nanoampere scale, in circuits containing billions of
transistors, the total leakage current can become large. For this reason, circuits designed for mobile
devices have to use high VT manufacturing processes where the manufactured circuits "leak" less but
operate more slowly.
4
blocks do not change state unnecessarily when the output values of the blocks are not used for anything.
The following figure illustrates this situation.
In the figure, the blue combinational logic block is connected to an internal, busy data bus. However, the
8-bit counter controlling this part of the circuit only allows the combinational logic block's outputs to be
stored in register REG8 when the counter state is > 127. Half of the information computed by logic is thus
left unused, and half of the energy it uses is wasted. A more energy efficient solution would have been to
place AND gates controlled by the MSB bit of the counter in front of the combinational block, so that the
counter could prevent the data bus state affecting the combinational block when the counter's state is
<128. Alternatively, register REG8 could have been moved in front of the combinational logic block, and
the operation of the circuit changed as required.
RTL synthesis programs capable of optimizing power consumption may be able to correct some problem
areas such as those described above. However, such are easy to avoid with careful design.
The figure below shows a structure that is repeated countless times in synchronous circuits. The data
path doing the data processing contains registers that can be loaded by the control state machine on
certain clock cycles. The loading is enabled in synchronous circuits by a multiplexer that connects to the
5
data inputs of the register's flip-flops either the new value computed by the data path, or the current
state of the flip-flops when loading is not enabled. In logic synthesis, this structure is usually
implemented with enabled D-flip-flops. If the loading is infrequent, the register is clocked unnecessarily
most of the time.
Unnecessary clocking of registers can be prevented by gating their clock signal with an AND gate on
those clock cycles where the registers are not loaded. This can be done already at the design stage of the
circuit's RTL model, but it is usually done using an automated synthesis program. The synthesis program
is told the minimum number of flip-flops for the registers to be gated, on the basis of which it identifies
suitable targets for gating (it is not useful to gate the clock of small registers). Automatic gating logic
generation has the advantage that the RTL model itself does not need to include a description of gating
logic, which may not be needed in all situations (e.g., if the design is implemented with an FPGA).
However, if the design includes a large entity that is not used for a certain amount of time, it is a good
idea to implement its clock gating in the RTL model.
The following figure shows the gating principle commonly used by synthesis programs. Using the AND
gate alone is not a good way, as the hazards of the clock-enabling logic may generate extra pulses in the
clock signal. In the schematic shown below, the level-sensitive latch only passes the enable signal during
the 0 state of the clock pulse, when the glitches caused by hazard have had time to settle.
6
Gating of the clock signal increases the latency and skew of the clock signal of the circuit, which must be
taken into account in timing design and verification. Because gating is done before the clock tree is
synthesized during layout design, estimates must be used to define the timing constraints of the gating
logic, which requires special care. In addition, logic coupled to the clock signal can complicate circuit
testability design, as flip-flops with a clock signal gated cannot be connected directly to the scan path.
This is because the state of the gating logic depends on the state of the circuit flip-flops, which is random
during testing. For this reason, the gating logic must be turned off during testing, for which purpose a
separate logic of its own must be added to the circuit when creating the test path (synthesis).
Because of the many risks associated with clock gating, the circuit’s RTL model must be designed to be
fully synchronous, after which gating can be implemented using automated programs that generate
gating logic that takes into account the requirements described above. So don't code the clock gating
in your HDL model!
The following figure illustrates the use of voltage scaling to minimize power consumption.. It is desired to
reduce the power consumption of the circuit (gray box) at the top of the figure by lowering its operating
7
voltage from 1.3 volts to 1.1 volts. In the 90 nm CMOS technology, the resulting increase in delays
reduces the maximum clock frequency of the circuit by about 30%. At the bottom of the figure, the
circuit is implemented in such a way that its logic is doubled, so that both halves can be clocked at half
the clock frequency compared to the original. This performance is excellent even at reduced voltage. The
amount of load capacitance more than doubles when the effect of control logic and multiplexing is taken
into account. However, the effect of the clock frequency and operating voltage decrease is greater, so the
total power consumption decreases from 1.7 to 1.2.
Lowering the operating voltage also reduces the power consumption of the circuit due to static, leakage
currents. For this reason, it is applied in almost all major digital circuits today. Limiting the growth in
power consumption has also changed the design principle of microprocessors by increasing the amount
of logic used instead of increasing the clock frequency. It is also customary to select the lowest possible
operating voltage for the different parts of the system-on-a-chip circuits, according to the performance
needs of the parts. In the case of processor cores, dynamic operating voltage scaling can also be used, in
which case the voltage level and the clock frequency of the processor are changed according to the
computational needs of the active application (a.k.a. dynamic voltage and frequency scaling, DVFS). The
use of several different operating voltages on a chip introduces additional design requirements, because
when transferring information from one power doma to another, the logical levels of the signals must be
matched.
8
Operating voltage scaling has already been discussed above as a means of curbing dynamic power
consumption. An important additional way to minimize static power consumption is to turn off the
power supply completely (power gating) from those parts of the circuit that are not in use. Examples
include hardwarited video codecs for mobile application processor circuits or various radio modems that
have large gate counts. In relation to the total operating time of the device (7/24), they are often unused
most of the time, so it does not make sense to keep them powered up. If the flip-flops and memories (or
part of them) of the powered-down part of the circuit are to maintain their state during a power
shut-down, they must be implemented using special state-retention structures.
The following figure illustrates the use of special structures required to implement power management
in SoCs. The circuit contains three power domains (POWER DOMAIN 1 - 3) and uses two different
operating voltages (VDD1 and VDD2).
The blocks "CPU" and "Camera'' use the voltage VDD1 and each contain a separate switch (S3 and S5)
with which the voltage of each block can be switched off separately. However, the block CPU contains a
smaller block (POWER DOMAIN 1.1), which gets a separate retention voltage from the "main switch" S1
of the circuit. This voltage RET V1 keeps block 1.1 running even if its master block is turned off. This small
block could be, for example, an interrupt handling block that wakes up the processor core in the power
saving mode when the user of the device presses a button. The block also includes a corresponding
function for the situation where there is voltage in the host block, but part of the functions of block 1.1.
have to be shut down.
The "Camera" block can also be switched off with switch S5. The external connections of this block (to
the CPU block) must be equipped with isolation structures that prevent its outputs from drifting into an
indeterminate state when the block is powered off. The "DSP" block can be turned off just like the
Camera block, but it uses a different voltage than the other blocks. For this reason, in addition to the
isolation structures, its interface signals require level shifters.
9
If the performance requirements of the application are such that high threshold voltage technology
cannot be used, it is possible to implement the circuit in a manufacturing process that allows the use of
transistors with different threshold voltages. The circuit is designed in the normal way, but logic synthesis
uses a "multi-VT" component library that contains versions of logic gates consisting not only of transistors
of different sizes, but also of transistors with different threshold voltages. Logic timing optimization on
the critical path of the circuit uses fast but high leaking low-VT gates, and in other parts of the circuit
slower but low leaking high-VT gates. In this way, the static power consumption of the circuit can be
minimized without sacrificing performance. The disadvantage of this technology is the higher than usual
manufacturing cost, as more processing steps are required to manufacture multi-VT circuits.
10
Impact of power consumption management on design flow
The “traditional” design flow of digital devices consists of steps in which the logical model is gradually
refined as it moves toward the final structural description. The content of the various design stages is
fairly unambiguously defined, e.g. with timing and resource usage constraints. The success of each
design phase can be verified by comparing the result with the logical model on which it was based.
Operating voltage scaling techniques complicate the established design flow, as the various design steps
can no longer be implemented as logical transformations alone. If a different power management
strategy is chosen for different parts of the digital circuit (operating voltage, dynamic scaling of voltage
and clock frequency, voltage off, power off mode, clock gating, etc.), the requirements of the different
design stages can no longer be described as logical or timing requirements. The automated logic and
layout synthesis flow must include information about the power intent choices made by the designer so
that design programs can take these into account. This information is presented using the described in
the IEEE 1801-2009 standard UPF (Unified Power Format) presentation, which is supported by logic and
layout synthesis programs.
11
Power Consumption Estimation
Assessing the power consumption of a digital circuit is an important part of their design, as a reliable
understanding of power consumption must be obtained as early as possible. This also applies to the
individual functional parts of a large system circuit, as they are usually assigned a "power budget" within
which the designer must stay. The designer must therefore have an understanding of both the
performance (timing margin), area (number of gates) and power consumption of the block she or he is
designing.
The estimation of the average dynamic power consumption of a digital circuit or its functional part is
quite straightforward at the stage when the gate-level structure of the circuit has been formed by
synthesis and its layout has been created. In this case, all information is available except for the
parameter describing the probability of state changes. This can also be estimated based on the circuit
architecture and simulations, after which an estimate of the average power consumption of the circuit
can be calculated. However, even if the estimate is not accurate, this will allow a reliable estimate of the
theoretical maximum power consumption of the circuit to be obtained quickly.
Knowing the average power consumption is usually not enough, as it is often necessary to also know
how the power consumption varies during the operation of the circuit, and how large and when it is at
its maximum. This is important, for example, for power distribution network design. Information on
power consumption variations is obtained by simulating a gate-level model of the circuit with realistic
test stimuli, and by storing the times of state changes of the nets within the circuit during the simulation
in a file. After the simulation, this information can be combined with the capacitance information
reported by the layout design program, whereby the instantaneous power consumption of the circuit as
a function of time can be calculated. The result is a graph of the power consumption of the circuit
("power waveform"), an example of which is given below.
12
The figure shows the instantaneous power consumption of the circuit over a few clock cycles. In the
beginning, there are hardly any state changes in the circuit's combinational logic (reset may be on, or the
circuit inputs have constant values), so that the circuit consumes power only at the rising and falling
edges of the clock signal. During the last three clock cycles, new information is loaded into the flip-flops
of the registers, which is why there are now also state changes in the combinational logic driven by the
registers.
Summary
The most important decisions affecting power management are made in the context of defining and
designing the architecture of the circuit. In general, the aim is to use the lowest possible clock frequency
and operating voltage, with the result that special attention must be paid to the performance of the
architecture. Once the architecture has been selected, it must be implemented in such a way that the
number of state changes within the circuit is kept to a minimum. The design of the architecture must
also take into account the methods used in the synthesis, such as automatic clock gating, in order to
make efficient use of them. During design, the "evolution" of the power consumption of the circuit must
be constantly estimated, as it is as crucial to the end result as the correct operation and area.
13
Physical Prototype Design of a
Digital IP Block
Integrated Circuit Manufacturing
Summary
Physical design is here defined as the design phase where the information needed to manufacture the
circuit is created. In the case of integrated circuits, this means a description of the geometry of the
transistors and wires of the circuit. When using prefabricated configurable FPGA circuits, the generation
of the information needed to configure the circuit can be considered as a physical design.
1
Integrated Circuit Manufacturing
Integrated circuits are manufactured by creating semiconductor components on silicon wafers,
transistors in the case of logic components, by diffusing or implanting (implanting) impurities in the
silicon crystal. From the components thus formed in silicon, circuit structures are created by connecting
them together with polysilicon and metal conductors grown on the surface of silicon.
The manufacturing process is based on the optical lithography, and it consists of the repetition of the
following process steps:
● A silicon wafer is encoated with photosensitive material
● The wafer is exposed with ultraviolet light through the mask that contains patterns that pass the
light through,
● The photosensitive coating is removed with acid form those regions that were exposed in
previous step
● The exposed regions are treated in the desired manner, by diffusing or implanting impurities in
silicon through them, or by growing conductors or insulating layers thereon.
In this process, diffusion and implantation areas can be made on the silicon wafer, or insulating layers,
conductors, and connections ("via") through the various layers can be grown on it.
The manufacturing process as a whole involves dozens or hundreds of steps and takes place in the
semiconductor manufacturer’s factory. The processed wafer typically contains dozens or hundreds of
individual circuits that are cut into separate silicon chips. These, in turn, are placed in the packages, and
attached at their connection areas to the pins of the packages. In order to detect manufacturing defects,
the circuits are tested several times, first on the wafer and finally after packaging.
2
Processed silicon wafers cost a few thousand euros, so the price of one chip depends on how many chips
are obtained from the wafer. Assuming a circuit size of 50 mm2, more than 1000 circuits can be obtained
from a 300 mm diameter disc. The price of one chip is therefore a few euros. Thus, the manufacturing
cost of integrated circuits is not high. However, the non-recurring engineering costs (NRE) of the
manufacturing process, especially the mask fabrication costs, are very high. As a result, the production of
integrated circuits is only profitable in large quantities, where NRE costs of up to millions of euros can be
shared among a large number of components.
In integrated circuit production, the circuit designer must create the information needed to make the
exposure masks. This information includes a description of the patterning of the exposure mask required
at each process step. Physical design means the design of this mask pattern layout It is also the
responsibility of the designer to create the information needed to test the circuit, and to select the
package and pin order of the circuit.
3
Digital Integrated Circuit Layout Design
The purpose of circuit layout design is thus to design the exposure masks needed to make the circuit. At
its most basic, this is done by drawing the pattern "by hand" using a computer program designed for this
purpose. This is the prevailing practice in the design of analog circuits, where the designer must be able
to control the properties (dimensions) of the individual semiconductor components. In this full custom
design style, the designer draws a pattern corresponding to each process step (material layer) by
following the design rules defined by the circuit manufacturer, which apply e.g. minimum line widths and
distances between components. For the design of digital circuits, this method is far too laborious.
The following figure shows a standard NAND2 cell (in CMOS technology). The mask layers corresponding
to the different process steps are shown in the figure in different colors. In addition to the diffusion
layers, the standard cell contains a circuit pattern of the polysilicon layer (gates of the MOS transistors)
and the lowest metal layer (the connections between the transistors). The upper metal layers remain
free to be used to form connections between standard cells, i.e., design signals. The width of standard
cells varies, but their height is always the same. In this way, the supply voltage lines of the cells in the
row and the N-wells can be connected directly to each other. If the row is not completely filled in the cell
placement, a filler cell of suitable width is placed in the opening, which contains only the N-well area and
metallization for the supply voltage rails.
4
Standard Cell Layout Design
The following sequence shows the steps for creating a standard cell layout. This process is referred to as
placement and routing (P&R). It is created from a netlist of the components used in the design formed
with logic synthesis
First, the shape and size of the P&R area in which the components are to be placed are determined. In
addition to the total area of the components, the size of the area is also affected by the cell density used
in the design of the circuit layout. In this context, density refers to how much of the area of the P&R
block is covered by the standard cells of the design. Density can range from 0% to 100% and is a
parameter selected by the designer. A density of 100% produces the smallest surface area, but a design
placed on such a small area is not usually feasible. The number of wires on each routing layer is limited,
as the wires are defined to have a minimum width and a minimum distance due to the manufacturing
process. For this reason, the cell density is usually well below 100%. Filler cells are placed in the vacant
areas.
Once the size and shape of the floorplan of the P&R area has been decided, a power supply network is
designed for it by creating supply voltage rings around it. If the area is large, operating voltage stripes
(a.k.s. straps, shown vertically below) can also be created on top of it. After this, the supply voltage rails
are created for the standard cell rows (horizontal in the figure). After these steps, the standard cells can
be placed in the rows. Looking at Figure 5, it can be seen that the standard cells are mirrored on every
other row. Thanks to mirroring, the cells of the adjacent rows can use common operating voltage rails, so
that it is not necessary to draw both the operating voltage and the ground rail for each row. In the last
step, connections between standard cells are created using the higher metal layers.
5
Automatic placement and routing of standard cell designs requires a lot of computer resources. The P&R
program must continuously optimize the timing of the circuit at all stages, as the placement and routing
of the components must not impair the timing optimized in logic synthesis.
The layout of a digital circuit could in principle be implemented as a single P&R block, but this is
impractical if the design is large. For this reason, the layout of modern SoCs is usually assembled from
separate blocks, the internal layout of which is designed as described above. The circuits floorplan is
formed in this case as shown below like. In it, the logical entities are implemented as separate P&R
6
blocks, which are placed in the floorplan so that the circuit becomes as small as possible. Each block is
handled separately by the placement and routing program, so that the optimization problem it sees
remains reasonable. Finally, the blocks are routed together. In addition, a power supply network must be
implemented for all blocks.
A Soc can also contain blocks whose layout has been designed on a full custom basis. Such blocks
include analog blocks as well as SRAM and ROM memory blocks. The layout of memory blocks is created
using a separate module generator program. This program is designed to produce the layout for a
memory block according to the design rules of a particular semiconductor manufacturer. The most
common of the analog blocks is a phase lock-based clock generator, which generates a high (> 100MHz)
clock frequency used inside the circuit from the lower frequency produced by the clock oscillator placed
on the circuit board. Generating a frequency of hundreds of megahertz on a circuit board would be
impractical due to power consumption and electromagnetic interference.
In addition to the placement and routing of the circuit's core area, also the external connections must be
designed. For each input, output and bidirectional connection, an I/O cell must be selected according to
the voltage levels of the logic standard used. The electrical properties, such as control capability, of the
I/O cell must also meet the requirements of the corresponding signal. The I/O cells include a bonding
pad from which the circuit is connected by a bonding wire to the package pin, as well as electronics for
signal buffering and protection of the interface against electrostatic discharges (ESD). The entity formed
by the I/O cells is called a pad frame or ring. Power supply must also be arranged by including a sufficient
number of power and ground inputs in the ring.
7
Clock Tree Synthesis
In placement and routing, buffering for the circuit's clock signal is usually added. This process is called
clock tree synthesis. It could in principle be done already in the context of a logic synthesis, but that
would not yield the best possible result. In clock tree synthesis, a balanced tree-like structure is built
from buffer gates or inverters to buffer the clock signal. The aim is to form a buffer tree with the smallest
possible clock skew between the branches. Since the magnitudes of the wiring delays are not known
until the layout design stage, it is more useful to synthesize the clock tree in connection with it. In
practice, this occurs after the placement of standard cells, but before routing.
The timing of the finished circuit can be studied with timing simulation and static timing analysis by using
the RC delays calculated from the circuit design of the layout design program with the models of the
analysis programs. The wiring delays reported by the synthesis program are already available for
post-logic synthesis analyzes, but they are based only on the statistical wiring model used by the
program. In such a model, wiring delays are estimated based on, for example, the size of the P&R block.
In reality, however, there may be individual clearly longer-than-average wires that may cause violations
of timing requirements. By using in the analysis wiring delays calculated from the layout, such problems
can be detected. Wiring capacitances calculated from the layout can also be used to estimate the
8
circuit's power consumption.
Timing analyzes after circuit pattern design are usually performed using two or more "corner points"
(multi-corner analysis). "Corner point" refers to the combination of factors influencing the logic delays
used in the analysis. These factors are usually a parameter describing the variation in the quality of the
manufacturing process (P), the operating voltage of the circuit (V) and the operating temperature of the
circuit (T). The parameter P describes how large the differences between the delays of different circuit
individuals can be and is based on information measured by the circuit manufacturer. The variation in
the operating voltage of the circuit (deviation during operation from the design), in turn, has an inverse
effect on the logic delays. Temperature variation has an effect on intra-circuit delays. The corner point
used in the timing analysis is a certain PVT combination. Typically, a critical path analysis (setup analysis)
in which the maximum delay from flip-flop to flip-flop is estimated is done with maximum delays, with
the corner point used being as "slow" as possible (a bad circuit individual is used with reduced operating
voltage in hot conditions). In turn, a “fast” corner point is used to detect the risk of hold time violations
caused by a clock skew (hold analysis), as the flip hold time requirement is most likely to be violated by
short delay paths where the flip-flop output is directly connected to another flip-flop input. In the latest
technology, the delay does not vary monotonically as a function of PVT parameters, so timing analysis
may need to be performed using numerous corner points.
In addition to variations in wiring lengths, the timing characteristics of a circuit can be altered by cross
talk, which can only be estimated when the physical location of the signal wires and the distances
between them are known. For this reason, signal integrity analysis can be done reliably only after layout
design, as part of a static timing analysis.
Post-layout analyzes can also take into account changes in circuit timing characteristics due to other
non-idealities, such as operating voltage fluctuations. The resistance of the supply rails causes a voltage
voltage drop (IR-drop) that is greater the farther from the voltage source. A decrease in voltage slows
down the operation of the components and thus increases the delays within the circuit.
9
Physical Design in the ASIC Design
The figure below shows the position of the placement and routing program in the ASIC design flow.
Layout design requires as input a netlist generated by a logic synthesis program, which is usually
presented as a Verilog file. In addition, the information provided by the synthesis program about the
timing constraints defined for the circuit is needed, since the layout must be formed with the same
constraints as the gate-level description. For layout design, the designer must also specify at this stage at
least the type and location of the I/O cells so that they can be placed in the correct locations. If a clock
tree is created in connection with a layout, its properties must be defined. These features include e.g.
maximum allowable clock skew and latency (delay from clock input to flip-flops). In connection with
layout design, the structure of the scan paths created for the circuit can also be optimized, so it may also
be necessary to state their structure.
10
To create a layout, a standard cell library is required, which contains the layout patterns of the logic
components used in the gate-level netlist given as input information. The cell library is obtained from the
same supplier as the library used in the logic synthesis, and in practice it is the same product.
Commercial logic and standard cell libraries are available from several suppliers for the manufacturing
processes of major semiconductor manufacturers. These libraries are always process-specific.
In the figure above, an arrow is drawn from the layout design output file back to the logic synthesis
program. This means that the results of layout design (or more generally, the floorplan of the circuit) are
often utilized in logic synthesis, especially when designing large circuits with millions of gates. In such
circuits, the share of wiring delays of the total logic delay is so large that realistic delay values calculated
from the layout must be used in the logic timing optimization. For this reason, synthesis and layout
design programs can often be used together so that the synthesis program can do "preliminary" layout
or floor plan design during optimization, and use the information thus obtained to analyze timing. The
same information can then be used as control information (constraints) for the layout design program to
form the final layout.
If there are a lot of connections in the plan or some parts of it, the cell density may have to be reduced
to make routing possible. By creating an experimental layout during the design process, an estimate of
the design's routability can be formed. If congestion is observed in some parts of the layout, the design's
RTL architecture can be changed to reduce the concentration of wiring in the specific region. Typical
structures that cause wiring congestion are multiplexers with a very large number (hundreds) of inputs.
By correcting such problems, a higher cell density can be used in layout design.
Summary
The design of a digital integrated circuit layout is a challenging task that involves numerous difficult
optimization tasks. In this chapter, the goal was to give only an overview of the main stages of design to
the extent that a digital designer needs to make a prototype layout, for example. From this perspective,
the primary goal of layout design is to produce information about circuit timing, power consumption,
and feasibility that can be used to improve the design in the RTL design phase. This will ensure that the
final layout design of the circuit will be successful. This task is usually performed by specialized design
teams or service providers.
12