Core Design and SOC Integration
Core Design and SOC Integration
CORE DESIGN
Authorized licensed use limited to: University of Illinois. Downloaded on September 9, 2009 at 23:30 from IEEE Xplore. Restrictions apply.
.
OCTOBER–DECEMBER 1 9 9 7 27
Authorized licensed use limited to: University of Illinois. Downloaded on September 9, 2009 at 23:30 from IEEE Xplore. Restrictions apply.
.
CORE DESIGN
Figure 1. Partitioned RAMDAC core integrated with customer Figure 2. PowerPC -core-based design. Unlabeled blocks are a
logic. Unlabeled blocks are SRAMs and register arrays. PLL, RAMs, and register arrays.
blocks that designers could use either separately or as an in- Interfacing with on-chip buses. Designers can use stan-
tegrated solution. We implemented the video DAC, PLL, and dard on-chip buses to eliminate the suboptimal glue logic re-
high-speed SRAM as individual hard cores in the range of quired to integrate one or more cores with customer logic or
10,000 to 30,000 cells each. We implemented the remaining with each other. Successfully used in the world of printed
digital logic as a soft-core netlist (RAMDAC) that included circuit board design for several years, this method is ex-
the other hard cores as components and was the delivery tendable to SOC designs. Standard buses ease integration
vehicle for the integrated solution. of peripherals and independent design of user modules by
Figure 1 shows an example of the successful use of the providing a standard interface and communication proto-
RAMDAC integrated solution. Other chip designs, requiring col. Core logic and customer logic designed to a consistent
only the PLL, video DAC, or SRAM, could use these cores as protocol can quickly interconnect without requiring addi-
stand-alone functions without sacrificing valuable silicon to tional glue logic gates.
unused palette DAC functions. As in board design, the latency, bandwidth, and interface
Another example of multiple cores derived from a stan- compatibility trade-offs among blocks of differing traffic char-
dard product chip is the PowerPC core product line (based acteristics dictate the need for a hierarchy of buses. For the
on the PowerPC 40X chip series). We divided the PowerPC PowerPC cores, IBM devised a dual on-chip bus architecture
microcontroller chip into a hard core and several soft cores. (Figure 3): The processor local bus (PLB) serves high-speed
The timing-critical CPU became a hard core; peripheral devices, and the on-chip peripheral bus (OPB) serves lower-
functions such as the DMA controller, external bus inter- speed peripheral devices. A separate core provides arbiter
face unit (EBIU), timers, and serial port unit (SPU) became logic for each bus, and a bridge core transfers data between
soft cores. the two buses to further enhance usability.
The first chip to use these PowerPC cores contained the In addition to the PowerPC peripheral cores, many other
401 CPU hard core and the SPU soft core (Figure 2). It did not cores interface to the PLB/OPB bus structures. These include
use the off-chip memory interface core (EBIU). The appli- a UART (universal asynchronous receive transmit), a time di-
cation called for the Rambus high-speed memory interface, vision multiplexer, a universal serial bus, an Ethernet, an
provided as a separate, mixed analog-digital hard core. HDLC (high-level data line controller), a MAL (memory ac-
Because we had partitioned the 40X PowerPC function into cess layer), an IIC (interintegrated circuit) serial bus inter-
multiple individual cores, the customer could select and use face, and an IEEE 1284 parallel port unit.
only the functions required for the design. Figure 4 represents the architecture of a design using the
Authorized licensed use limited to: University of Illinois. Downloaded on September 9, 2009 at 23:30 from IEEE Xplore. Restrictions apply.
.
OCTOBER–DECEMBER 1 9 9 7 29
Authorized licensed use limited to: University of Illinois. Downloaded on September 9, 2009 at 23:30 from IEEE Xplore. Restrictions apply.
.
CORE DESIGN
Master
form additional technology-specific manufacturability
DMA checks using an IBM-developed set of checking routines.
read Functional verification of the design is left entirely to the
server
designer. IBM does not resimulate ASIC designs before or
Configuration
Master registers after layout. We provide an updated postlayout design netlist
FIFO
read
to the ASIC customer for functional verification, with an SDF
server Slave (standard delay file) on request, to support delay simula-
write tion. While many customers still rely on gate-level simula-
Master port
tion for functional verification after layout, we recommend
DMA
write the use of formal verification tools for this task.
server Slave We have functionally verified pre- to postlayout versions
PCI read port
engine
of designs exceeding 800,000 gates with a 24-hour turn-
Master FIFO around using the DesignVerifyer tool from Chrysalis. With
write server Slave delayed the advent of chips containing complex embedded cores,
read port
verification through gate-level simulation has become even
less practical, and formal verification methods have become
Figure 5. PC I chiplets. more essential.
Our timing-driven layout system handles both hierarchi-
cal and flat designs. Timing assertions (description of ex-
the FIFO servers was approximately 35% smaller than a sin- pected arrival times, clock cycles, false paths, and so forth)
gle hard core containing all nine chiplets. The PCI function for static sign-off generate timing targets for the place-and-
using the DMA servers was over 40% smaller. route system. To close the final postlayout chip timing, we
use a series of in-place optimization tools for drive strength
Firm cores. Using firm cores increases design flexibility optimization, buffer insertion, clock tree placement, and
and reduces layout problems. A core provider who requires scan chain reordering.
IP protection for functions without critical timing require- Figure 6 also shows how we have updated the basic ASIC
ments that drive a fixed hard-core layout should implement methodology with core-specific deliverables. Although the
such functions as firm cores. Firm cores provide abstracted basic flow and sign-off points remain the same, the design
core views to the designer; the vendor replaces the firm kit has changed substantially. In addition to the base library
cores with the gate-level netlist during chip layout. By al- of ANDs, NANDs, latches, and SRAMs, the ASIC vendor now
lowing core size, aspect ratio, and pin locations to be altered supplies large pieces of the customer design in the form of
during layout, this method facilitates an optimal chip lay- soft-core netlists and black box models for firm- and hard-
out. It alleviates tiling problems and reduces the unused sil- core functions. This requires the creation, delivery, and sup-
icon space that sometimes occurs when several large cores port of several new core-specific models. New tools and
reside on the same chip. design techniques are also needed to address the verifica-
tion requirements of complex cores (such as embedded
SOC design methodology controllers) that now reside on ASIC silicon.
Figure 6 illustrates an ASIC design methodology proven
on very large (500,000 to 2 million gates), high-performance Simulation
(over 100 MHz) designs. IBM ASIC customers used this The customer receives soft-core functions as gate-level
methodology to create the designs described in this article. netlists. These netlists are mapped to the same ASIC library
The most significant factor in achieving success on these used by the customer logic. Simulation of the core netlist,
large, complex designs is the sign-off criteria. We base de- therefore, requires no unique support beyond the design kit
sign sign-off on static timing analysis and DFT compliance provided for customer logic simulation at the gate level.
with fully automatic test pattern generation rather than ex- Simulation models for hard and firm cores fall into two
haustive delay simulation and functional manufacturing test major categories: full-function models (FFMs) and bus func-
vectors. We enforce DFT rules through test structure verifi- tion models (BFMs). Each hard- or firm-core macro requires
cation (TSV) software provided by the ASIC vendor. an FFM. An FFM, derived from the core design source, ac-
We run the TSV against the design at both the release-to- curately models the core hardware’s behavior. The design
layout and release-to-manufacturing checkpoints. Similarly, source may be Verilog or VHDL and may be register-transfer
we use a “golden” static timing analysis tool (Einstimer) and level, gate level, or a mixture of the two. Because these core
library for timing sign-off at these same checkpoints. We per- macros require IP protection, the design source must be en-
Authorized licensed use limited to: University of Illinois. Downloaded on September 9, 2009 at 23:30 from IEEE Xplore. Restrictions apply.
.
Full-function models
Hardware-software High-level Core instruction Synthesis, timing, test,
cosimulation simulation set simulator floor planning
ASIC library
component models
Release
Timing to layout SmartModel (Swift) interface. This interface is supported by
assertions most event-based simulators on the market and requires
uniquely compiled versions on a platform basis only (Sun,
Timing-driven
layout HP, IBM RS/6000 ), not a simulator-specific basis.
We supply bus function models for processor cores and
on-chip bus structures. A BFM is not derived from the design
Static source; it is based on the processor’s bus specification docu-
timing analysis
SDF, RC, ment and is written in VHDL or Verilog. A translator can cre-
capacitance
values ate the alternate form. In either case, the customer receives
Postlayout unencrypted HDL source for simulation. In contrast to the
technology checks
FFM, which accurately represents the entire core function, a
BFM drives simulation with the core’s bus response without
modeling the internal implementation. Because it represents
Release to Automatic test only a subset of core function, a BFM generally simulates faster
manufacturing pattern generation
than an FFM and is useful in the early stages of SOC design.
OCTOBER–DECEMBER 1 9 9 7 31
Authorized licensed use limited to: University of Illinois. Downloaded on September 9, 2009 at 23:30 from IEEE Xplore. Restrictions apply.
.
CORE DESIGN
terization must explicitly remove the don’t_touch annota- capacitance files to generate firm-core timing rules because
tion and subsequently accepts ownership of the core’s func- the firm-core layout varies with each implementation.
tion and timing. Instead, we use the timing assertions to generate fixed pin-
Hard and firm cores are modeled as black box library el- to-pin delays that exactly match the delays published in the
ements in synthesis and pass through the synthesis process core’s functional specification. We capture these delays,
unchanged. The PowerPC CPU core is an exception to this with variable temperature and voltage factors, and the ap-
rule. A synthesizable logic macro called the test mode ma- propriate timing checks in DCL statements. We compile DCL
trix (TMM), required to support functional testing of the CPU, statements for both hard and firm cores into a non-human-
accompanies the PowerPC black box model. The customer readable executable form, which is provided to the cus-
synthesizes this logic to elements in the target ASIC library tomer for static timing analysis. From the abstracted DCL
and can optimize it for both area and performance. timing model, an IBM program called gensyn uses Einstimer
to create timing information in the Synopsys synthesis mod-
Timing el and core timing wrappers for simulation back-annotation.
The customer performs timing analysis on soft cores using
the timing models for the ASIC library elements. Timing as- Testability
sertions that specify false or don’t-care paths in the design Soft and firm cores are designed to meet the same DFT
come with the soft-core netlist. The customer incorporates requirements as the customer design. All soft and firm cores
these assertions into the chip-level assertions used for timing must pass through the TSV sign-off tool without generating
sign-off. errors or warnings. All core scan chains and test clocks must
The basic criterion for a soft core is that it meet perfor- be connected correctly, and untestable faults are not al-
mance requirements using the standard timing-driven lay- lowed. An edge clock at the core boundary drives the core.
out system. Therefore, IBM provides core-specific wire-load A clock splitter element in the design splits the edge clock
models or area constraints on an exception basis only. The into the required master/slave clocks. The customer con-
customers who designed the chips shown in Figures 2 and nects the edge clock, scan clocks, and scan chain inputs
4 integrated the soft cores with their custom logic and per- and outputs in the customer logic to the corresponding pins
formed chip-level timing analysis. IBM placed and routed at the core boundary. Once the core is fully integrated into
these chips, without soft-core region constraints, using the the customer design, the customer uses ASIC library com-
timing-driven layout approach. In contrast, the RAMDAC soft ponent test models to check it again for DFT compliance at
core in Figure 1 required region constraints and early floor the chip level.
planning, in addition to timing-driven layout, to meet the We take either an integrated or an isolated approach to
220-MHz performance target. testing hard-core macros, depending on how the core was
For hard and firm cores, core developers must create designed. If the hard-core design complies with the DFT re-
black box timing models or timing abstracts to protect the quirements, we can test the core with the same test meth-
intellectual content. The timing model must contain values ods used in the standard ASIC flow (Figure 7). IBM includes
for all pin-to-pin paths as well as all appropriate timing the full gate-level core model and automatically generates
checks such as setup, hold, and minimum pulse width. test patterns for the customer and core logic concurrently.
Although the timing model is an abstracted representation To maintain IP protection for hard and firm cores, we send
(that is, it does not contain detailed design data), it must be an encrypted model to the customer. A special cloaking fea-
accurate enough for static timing sign-off. Because of core ture in the sign-off TSV tool prevents core models from be-
designs’ complexity and time-to-market requirements, an ing viewed via the graphical interface. We used the
automated method of creating these models is necessary to integrated core test method on the customer designs con-
eliminate human error. taining the PCI core chiplets described earlier.
We create the hard-core timing models for static timing In isolated testing, we test the customer logic separately
sign-off using the IBM Einstimer tool’s design abstraction from the core using a complex set of multiplexing and gat-
process. Originally developed to support efficient timing of ing logic, called an isolation matrix, or test mode matrix
hierarchical designs, we extended this capability for cores. (Figure 8). Control signals put the core into a series of
We read a detailed design netlist, RC (resistance-capacitance) modes. The nontest or functional mode allows communi-
and capacitance files from layout, and the design timing as- cation between the core and customer logic. In this mode,
sertions into Einstimer. The tool generates a pin-to-pin timing the isolation matrix logic is transparent to the core’s func-
abstraction for the design and writes out the information in tional use. In core test modes, the core is accessible via the
Delay Calculation Language (DCL). chip I/Os, and the customer logic is fenced off and stable.
We do not use the detailed netlist and the layout RC and We can apply functional patterns to the core using the chip
Authorized licensed use limited to: University of Illinois. Downloaded on September 9, 2009 at 23:30 from IEEE Xplore. Restrictions apply.
.
User-
defined
logic
Core Test User-
isolation defined
matrix logic
Core 1
I/O
Core 2
OCTOBER–DECEMBER 1 9 9 7 33
Authorized licensed use limited to: University of Illinois. Downloaded on September 9, 2009 at 23:30 from IEEE Xplore. Restrictions apply.
.
CORE DESIGN
the third level of metal ranges from approximately 15% in much easier than laying out a chip without cores. For ex-
the PCI chiplets to zero in the Rambus. In the 0.35-µm ASIC ample, large static RAMs have been available as cores for
product, this leaves a minimum of one and a maximum of many years, and these enable the layout engineer to add
two wiring planes for routing over these cores. The limited large amounts of memory to a chip without difficulty.
porosity is caused primarily by the densely packed under- Likewise, well-designed cores solve timing, clock skew, and
lying core logic. Other core areas may complete the block congestion problems for high-performance circuits without
wiring to prevent routing noise over the analog circuitry (in additional layout effort on core contents.
the Rambus) and instruction and data caches (in the 401
CPU). This causes highly congested wiring in other chip ar- Hardware-software cosimulation
eas. It may be necessary to reserve area around the cores to Simulation and verification can quickly become the bot-
help alleviate wiring congestion in those regions. tleneck in the design of large, complex core-based systems.
Certain hard cores, particularly those with analog func- A range of simulation models of varying accuracy, including
tions, have additional characteristics affecting placement. BFMs and FFMs, address specific needs at various stages of
The Rambus, for example, is a high-speed analog/digital the ASIC hardware design process. But none fulfills the
memory controller with unique test and high-performance needs of the software designer developing code for the em-
I/O cell requirements. The I/O cells must occupy predeter- bedded processor. Using an instruction set simulator (ISS)
mined locations on the die to allow access by wafer-level with an instruction set architecture (ISA) model has been a
test equipment. The Rambus core requires wiring several traditional method of software designers for code debug and
sensitive high-current analog signals to the chip’s I/O pads. execution time analysis. In stand-alone mode, an ISS runs
Performance requirements on the signal nets between the several orders of magnitude faster than an FFM, executing
core and the I/O pads and the need to prevent coupling to an average of 100,000 instructions per second (IPS) versus
other noisy wires limit placement of the core: It must be the FFM’s 5 to 20 IPS. An ISS also gives the software developer
placed in a predefined area directly adjacent to the test- visibility into the internal registers of the processor as it ex-
specific I/O cells. ecutes instructions and provides breakpoint and single-step
The chip shown in Figure 2 required both the 401 and functions for controlling the execution stream.
Rambus cores to fit on a small die. Because the Rambus had Because the processor core is embedded on the same
to occupy a location in the middle of one side of the chip, piece of silicon as the customer-designed ASIC logic, test-
the 401 was forced into the corner on the opposite side. This ing the interaction of processor and ASIC gates is critical. In
placement created a long narrow area between the two most cases, the processor must correctly execute a stream
cores that caused significant chip-level routing challenges of initialization code before it can begin to interact with the
on the four-level metal design. The design was highly pop- surrounding ASIC logic. We can debug this hardware-
ulated (using over 80% of the available silicon) and con- dependent software by translating processor instructions
tained logic that needed to communicate across the core into a memory image and running an ASIC simulation with
to other on-chip logic. Because routing across the core was the FFM fetching instructions from the memory model. This
available in only the vertical direction, we had to create ad- method’s productivity is constrained by the FFM’s perfor-
ditional reserve areas around the cores to allow for hori- mance and the limited visibility this model provides into the
zontal wires. Detailed floor planning and wiring congestion processor’s internals. As a result, many vendors are creating
analysis highlighted these routing problems early in the de- cosimulation products that link a processor ISS with an ASIC
sign process. Subsequent versions of the 401 CPU contain HDL simulator by using the processor BFM. These products
wiring channels through the core on the third wiring level to execute processor instructions faster and concurrently sim-
help alleviate the chip-level routing problem. ulate the activity of the remaining logic.
Large cores sometimes require additional peripheral cir- IBM used a prototype cosimulation system developed for
cuitry such as test logic, which must be arranged around the the PowerPC 401 during the design of several core-based
core. When test signals are multiplexed with functional sig- chips. The system consists of a 401 ISA model running in the
nals around the chip’s edge, the floor plan must account for PowerPC Virtual Simulator (PVS), linked with the VHDL sim-
the additional wire demand. ulator from Model Tech, Inc. (MTI), executing on an RS/6000
Logic and layout optimization of hard-core contents trans- workstation. The ISA model does not include the concept
lates into area and performance improvements within the of functional pins and uses the BFM to model output pin be-
core. Inefficiencies at the chip level will partially offset these havior. The BFM accepts bus commands such as read and
gains unless the cores and logic on the chip are arranged to write and translates them into bus transactions that model
avoid timing and wiring congestion problems. the interface’s signal sequencing and timing.
With a good floor plan, laying out a chip with cores is The ISA model in PVS executes instructions at a relative-
Authorized licensed use limited to: University of Illinois. Downloaded on September 9, 2009 at 23:30 from IEEE Xplore. Restrictions apply.
.
OCTOBER–DECEMBER 1 9 9 7 35
Authorized licensed use limited to: University of Illinois. Downloaded on September 9, 2009 at 23:30 from IEEE Xplore. Restrictions apply.