0% found this document useful (0 votes)
132 views10 pages

Intel Nehalem Processor Core Made FPGA Synthesizable

Uploaded by

nou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views10 pages

Intel Nehalem Processor Core Made FPGA Synthesizable

Uploaded by

nou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Intel

R
Nehalem Processor Core Made FPGA Synthesizable

Graham Schelle1 , Jamison Collins1 , Ethan Schuchman1 , Perry Wang1 , Xiang Zou1
Gautham Chinya1 , Ralf Plate2 , Thorsten Mattner2 , Franz Olbrich2 , Per Hammarlund3
Ronak Singhal3 , Jim Brayton4 , Sebastian Steibl2 , Hong Wang1
Microarchitecture Research Lab, Intel Labs, Intel Corporation1
Intel Germany Research Center, Intel Labs, Intel Corporation2
Central Architecture and Planning, Intel Architecture Group, Intel Corporation3
Microprocessor and Graphics Development, Intel Architecture Group, Intel Corporation4
Contact: [email protected]

ABSTRACT 1. INTRODUCTION
We present a FPGA-synthesizable version of the Intel Ne- Intel Nehalem [4, 8, 14] is the latest microarchitecture de-
halem processor core, synthesized, partitioned and mapped sign and the foundation of the Intel CoreTM i7 and CoreTM
to a multi-FPGA emulation system consisting of Xilinx Virtex- i5 processor series. Like its predecessor (IntelR
CoreTM mi-
4 and Virtex-5 FPGAs. To our knowledge, this is the first croarchitecture), Intel Nehalem microarchitecture continues
time a modern state-of-the-art x86 design with the out-of- to focus on improvements in how the processor uses avail-
order micro-architecture is made FPGA synthesizable and able clock cycles and power, rather than just pushing up
capable of high-speed cycle-accurate emulation. Unlike the ever higher clock speeds and energy needs. Its goal is to
Intel Atom core which was made FPGA synthesizable on a do more in the same power envelope or even reduced en-
single Xilinx Virtex-5 in a previous endeavor, the Nehalem velopes. In turn, Intel Nehalem microarchitecture includes
core is a more complex design with aggressive clock-gating, the ability to process up to four instructions per clock cy-
double phase latch RAMs, and RTL constructs that have cle on a sustained basis compared to just three instructions
no true equivalent in FPGA architectures. Despite these per clock cycle or less processed by other processors. In
challenges, we are successful in making the RTL synthesiz- addition, Intel Nehalem incorporates a few essential perfor-
able with only 5% RTL code modifications, partitioning the mance and power management innovations geared towards
design across five FPGAs, and emulating the core at 520 optimizations of the individual cores and the overall multi-
KHz. The synthesizable Nehalem core is able to boot Linux core microarchitecture to increase single-thread and multi-
and execute standard x86 workloads with all architectural thread performance.
features enabled. In addition to backward compatibility to the rich Intel
Architecture legacy, the Intel Nehalem sports several salient
new features: (1) Intel Turbo Boost Technology which en-
Categories and Subject Descriptors ables judicious dynamical management cores, threads, cache,
C.1.0 [Processor Architectures]: General interfaces and power, (2) Intel Hyper-Threading Technol-
ogy which in combination with Intel Turbo Boost Technol-
ogy can deliver better performance by dynamically adapting
General Terms to the workloads which can automatically take advantage
Design, Measurement, Performance of available headroom to increase processor frequency and
maximize clock cycles on active cores and (3) Intel SSE4
instruction set extensions that center on enhancing XML,
Keywords string and text processing performance.
Intel Nehalem, FPGA, emulator, synthesizable core In this paper, we share our experience and present the
methodology to make the Intel Nehalem processor core FPGA
synthesizable. The emulated Nehalem processor core is par-
titioned across multiple FPGAs and can boot the standard
off-the-shelf x86 OSes including Linux and run x86 work-
loads at 520Khz. Compared to the Intel Atom core that we
previously made FPGA synthesizable, the Nehalem core is
much more challenging due to the microarchitectural com-
Permission to make digital or hard copies of all or part of this work for plexity and sheer size of the design. The key contributions
personal or classroom use is granted without fee provided that copies are of this paper are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to • We present our methodology to synthesize and judi-
republish, to post on servers or to redistribute to lists, requires prior specific ciously partition the fully featured Nehalem RTL de-
permission and/or a fee.
FPGA’10, February 21–23, 2010, Monterey, California, USA. sign to an emulator with multiple Virtex-4 [18] and
Copyright 2010 ACM 978-1-60558-911-4/10/02 ...$10.00. Virtex-5 [19] FPGAs.

3
• We demonstrate a systematic and scalable cycle-by-
cycle verification methodology to ensure the functional
and timing correctness of the synthesizable design.

The remainder of the paper is organized as follows. Sec-


tion 2 reviews related work and provides background infor-
mation on the Intel Nehalem processor core and a multi-
FPGA emulator platform. Section 3 elaborates our expe-
rience in making the Nehalem core RTL FPGA synthesiz-
able and introduces our verification methodology. Section 4
describes how we partition the Nehalem core design across
multiple FPGAs and provide memory interface between the
core and the DDR memory. Section 5 evaluates the func-
tionality and performance of the synthesized Nehalem core
in comparison with the Intel Atom core on the same emula-
tor platform. Section 6 concludes the paper.
Figure 1: Intel Nehalem Microarchitecture [3].
2. BACKGROUND
2.1 Related Work ping and partitioning, FPGA synthesizable designs usually
Intel Nehalem processor is representative of a general trend can be emulated at speed of 10’s Mhz thus making it pos-
in designing the modern high performance energy efficient sible to work as software development vehicle before silicon
CPU. Beyond the traditional incremental microarchitectural becomes available [10].
enhancements at processor pipeline level, the state-of-the- While FPGA emulation has long been employed in SoC
art CPU designs tend to incorporate a variety of new tech- design and prototyping (SPARC [6], MIPS [7], ARM [9]),
nologies through a high degree of integration. Resembling a it is only recently that FPGA synthesis is taken into account
system-on-a-chip (SoC), modern CPU usually embodies very for modern PC-compatible CPU designs. In [11] and [17],
sophisticated on-chip resource managements and complex the Pentium processor and the Atom processor were made
interactions amongst many building blocks. These building synthesizable to a single FPGA. However, compared to Atom,
blocks work in concert to deliver the architecturally visible the Nehalem core requires roughly 4x more FPGA capacity.
features in energy efficient ways. Due to this size increase, multiple-FPGA partitioning must
The increase in design complexity will inevitably impact be employed for the Nehalem core requiring time multiplex-
the pace of silicon development. In particular, pre-silicon ing of wires [2] between FPGAs and various partitioning
RTL validation has long been a vital yet time-consuming tools [1] and techniques [5, 12].
phase of microprocessor development. Due to the heavy re-
liance on the software based RTL simulation tools, despite 2.2 The Intel Nehalem Processor
the rich test environments, throughput of validation is usu- The Intel Nehalem processor is a 64-bit multithreaded pro-
ally limited by simulation speed, which is in the range of the cessor with an aggressive out-of-order pipeline. Nehalem in-
single-digit hertz. Such speed of system level tests is pro- cludes a 32KB L1 data cache, a 32KB L1 instruction cache
hibitive for long running simulations as processor designs and a shared 256KB L2 cache. Figure 1 shows the layout
are ever increasing in size and complexity. One alternative and clusters making up the Nehalem core. Four clusters
has typically been emulation. Emulation, however, requires make up the Nehalem core
expensive hardware and software tools. The cost of the em- FE Frontend to fetch bytes, decode instructions
ulation hardware and tools usually scale up when the size of EXE Floating point and integer execution
designs reaches the level of modern CPU. Consequently the OOO Out of order resource allocation
speedups achieved in large emulation platforms actually de- MEU Memory execution unit (load/store handling)
crease as designs grow larger, potentially negating benefits
of emulation The FE cluster fetches instructions, decodes the x86 in-
Thanks to Moore’s Law, the capacity and speed of mod- structions into an internal micro-op (uop) stream, and queues
ern FPGAs have continued to improve. As more productive those uops for execution downstream. Branch prediction is
EDA tools become available for FPGA, the FPGAs have performed in FE as well. The OOO cluster takes uops and
become well suited for implementing complex processor de- schedules their execution to the various reservation stations.
signs with their high density logic and embedded compo- The EXE cluster holds all the ALUs and floating point ex-
nents. Should CPU design be made FPGA synthesizable ecution units and there is highly optimized and specialized
throughout product development as part of the pre-silicon circuitry to complete these computations. The MEU han-
validation process, it would bring significant benefit to ex- dles all loads and stores from the shared L2 cache, known as
ercise the design with much more simulation cycles. For the MLC (midlevel cache). The MEU is the sole interface
example, booting an off-the-shelf operating system can re- to the Uncore. It contains many other miscellaneous com-
quire execution of 100 million to 1 billion instructions. It ponents of the Nehalem processor including the interrupt
is impossible to run such long instruction sequence within controller and the TAP (test and access port).
a reasonable amount of time with today’s RTL simulators, The Nehalem Uncore, also shown in Figure 1, connects the
however it only takes under an hour with FPGA synthesiz- cores to each other, holds the Last level cache, and contains
able designs. With additional optimizations in FPGA map- an on-die memory controller. In this paper, our focus is on

4
U3 U3
U2 U4 U5 U2 U4 U5
U1 U1
Figure 3: One MCEMU System with Nine Boards.

cess to all board level features. For example, each board con-
tains 1GB DDR DIMM accessible by FPGA U1 and a 1GB
Figure 2: One MCEMU Board with Five FPGAs.
DDR DIMM accessible by U5. Access to these DIMMs by
logic within another FPGA would need to be routed through
Table 1: MCEMU board FPGAs. U1 or U5 to reach the appropriate DIMM. Similarly, only
Name FPGA LUTs BRAMs U1 contains the RocketIO transceivers that interface over
U1 Virtex-4 FX140 126,336 552 the cabling. Therefore, signals destined for another board
U2 Virtex-5 LX330 207,360 576 must pass through the U1 FPGAs on both the sending and
receiving board.
U3 Virtex-4 LX200 178,176 336
In addition, the number of physical pins interconnecting
U4 Virtex-5 LX330 207,360 576
pairs of FPGAs is neither uniform nor symmetric. The
U5 Virtex-4 LX200 178,176 336
MCEMU synthesis flow includes a sophisticated intercon-
nect generation tool that when given a set of interconnected
netlist modules generates and configures the TDM multi-
the four core clusters in the Nehalem. The Uncore cluster is plex and demultiplex logic to properly connect the mod-
outside the scope of this paper. ules over the appropriate physical interconnects (intraboard
traces and interboard cables). In the MCEMU flow, parti-
2.3 The Many-Core Emulation System tioning a large netlist into multiple modules (each suitable
The Many-Core Emulation System (MCEMU), is the em- size for 1 FPGA) can be done either manually or with vary-
ulation platform we targeted for this work. MCEMU is ing level of automation through partitioning tools.
an FPGA emulation platform developed at Intel [13]. An Like most FPGA synthesizable designs, the choice of the
MCEMU system consists of a series of identical rackable emulator platform can affect the particular strategy to par-
custom boards, each holding five FPGAs. Table 1 lists the tition the design and interface the design to the memory
name, type, and key resources for each of the five FPGAs, system. In the Atom synthesis project, the platform was a
while Figures 2 and 3 show a single board and a full rack- single FPGA emulator that fits in a Pentium CPU socket. It
able system respectively. To expand capacity beyond five was necessary to build a bridge between the Atom processor
FPGAs, multiple boards are interfaced together using the core and the Pentium front-side bus so as to allow the emu-
Xilinx RocketIO high-speed serial transceivers connected by lated Atom core to communicate with the memory and I/O
external cabling. resources on the motherboard. Similarly, with the MCEMU
Within a single MCEMU board, board traces wire input platform, which has on-board DDR memory, we also need
pins on one FPGA to output pins of another, leading to a to build a bridge between the Nehalem core and a DDR con-
fixed number of physical wires between each FPGA pair. troller so that the emulated CPU core can boot from the OS
While the number of physical wires connecting two FPGAs image and execute code, all resident in the DDR memory.
is fixed and small, any arbitrarily large number of logical The original OS image and workload can be updated by the
signals can be sent across the physical wires by time division host CPU board on the MCEMU.
multiplexing (TDM) using muxes at the sending FPGA and When a design is ready to be run on the MCEMU it is
demuxes at the receiving FPGA. A greater ratio of logical loaded on the FPGA boards by a control/debug host board
signals to physical signals requires more time steps for TDM, that sits along side the FPGA boards. The host board is
and thus lowers emulated frequency. a full x86 computer with hard-disk and network interface
Because of varying resources among the FPGAs and fixed and runs Linux. A Linux application running on the host
physical traces on the boards not all FPGAs have direct ac- board can program and reset the MCEMU FPGAs, write

5
to control registers inside the FPGA, read and write the Synthesizable
MCEMU DRAM DIMMs, and control the emulation clock Nehalem
all over the shared cPCI bus. As we show in Section 4.4, codebase
this built-in hardware/software interface can be a powerful SystemVerilog
tool for system bring-up and verification.

DC-FPGA Wrapper Code


3. SYNTHESIS
Nehalem is designed in SystemVerilog, with wrapper Ver- Netlist Verilog
ilog code. Finding a tool that can parse the design is there-
fore of primary importance. There are actually only a few
FPGA frontend synthesis tools that can parse SystemVer- SynplifyPro
ilog. While many tools do support a subset of SystemVer-
ilog, from our experience, there are usually certain features Netlist
of the language that either cause the tools to report syntax
errors, die silently, or synthesize netlists incorrectly.
Even though Synopsys ceased its development and sup- Auspy
port of DC-FPGA [15] in 2005, it is the only tool that can
correctly parse all SystemVerilog features used in Nehalem’s Partitioned Netlists
RTL. Since DC-FPGA supports these features, we are able
to minimize the code changes necessary to build netlists. Xilinx Xilinx Xilinx Xilinx Xilinx
MAP MAP MAP MAP MAP
With fewer modifications we must make to the Nehalem PAR PAR PAR PAR PAR
codebase, we are less prone to introduced bugs.
Ultimately, however, some source code modifications were
still necessary due to some deficiencies in this tool. DC-
FPGA works well enough for creating netlists from the Sys- Final Bitstreams
temVerilog. However, as a discontinued tool, DC-FPGA
sometimes is prone to create erroneous circuits from the
given RTL. This was observed in the synthesized Atom core, Figure 4: Toolflow: From Nehalem RTL to FPGA
and was observed in the Nehalem synthesis with new Sys- Bitstreams.
temVerilog constructs. For example, a certain XOR macro
within the Nehalem execution cluster is not mapped cor-
rectly to the Xilinx FPGA architecture. We are able to clock tree and have the enable signal travel over standard
discover these bugs as described later in Section 3.4, and routing resources.
replace the code snippet with a Xilinx macro block. Addi- Every bit-wide clock is turned into a struct consisting of
tionally, some modules that did not synthesize correctly were the global clock and its enable signal. The Nehalem RTL
synthesized using Synopsys Synplify Pro [16] if the code is codebase consists of various macros that transform the clock
syntactically similar to Verilog. EDA tools are slowly being as it is passed through the hierarchy of modules. Most trans-
patched to handle SystemVerilog constructs, and we pre- forms simply add another enable signal (e.g. powerdown or
dict that SystemVerilog will be fully supported in the near test port access enables) or invert the clock for double phase
future. latch RAMs. The inversion clock macro is the only macro
The entire synthesis flow is shown in Figure 4. Once that modifies the global clock, while all other clocks affect
the netlist is produced by DC-FPGA, the compete Nehalem the enable signal. To handle the clock inversion macro, we
netlist is partitioned using Auspy’s ACE compiler v5.0, and pass into all modules a global inverted clock that is driven
the final bitstreams are generated using Xilinx ISE 10.1.03. from a DCM, not by actually inverting a clock signal in logic.
This section will focus on preparing the Nehalem codebase There are 400% more unique clock macros within the Ne-
to pass through DC-FPGA synthesis. halem codebase as compared to the Atom codebase. Our
methodology of separating the clock and the enable is com-
3.1 Clock Gating pletely portable between the two synthesizable processors.
Modern processor designs all use gated clocks to save on
power. With latch-based designs tolerant to clock-skew, a 3.2 Latch Conversions
gated clock is quite effective in driving a low-power clock Due to the inability of DC-FPGA to correctly map trans-
tree through the design. In fact, this gated clock can hi- parent latches to the Xilinx primitives, we use two approaches
erarchically pass down into the processor subsystems, each to do latch conversion ensuring correct generation of netlists.
subsystem having its own additional enable signal, provid- The first approach is to directly instantiate Xilinx FPGA
ing designers the ability to clock gate the circuit at many latch primitives (LDEs) forcing all latches as blackboxes
levels. during synthesis. However, the high count of latches in
For FPGA synthesis, we need to separate the enable from the Nehalem codebase makes applying this technique to all
the original clock. FPGA architectures rely on a global clock latches impossible. The backend place and route tools sim-
tree that is low-skew to drive the flip-flops within FPGA ply cannot meet the timing constraints in handling so many
slices. Most EDA vendors provide clock-gating removal syn- latches in the resulting netlist. The second approach is to
thesis that can do this separation automatically. This sepa- convert latches to edge-triggered flip-flops – possible when
ration of the clock and enable signals allows the free-running the data and enable arrive one phase earlier than the clock
global clock to travel along the FPGA’s dedicated low-skew edge. Most often this conversion is possible, however, in

6
some instances, the data or the enable signal arrive late.
Since the latches in Nehalem are instantiated in a macro
NHM RAM
Nehalem RAM Replacements per Cluster
Replacements per Cluster
form, we can detect this race condition while running the 140
simulation and determine if data OR the enable is chang- 120
ing with the clock edge. If this behavior is seen, an edge 100

Total RAMs
triggered flip-flop conversion is incorrect and data will not 80
propagated correctly through the latch. Therefore, we adopt RAM
60 Flip-Flop
a combination of both approaches. That is, for those latches
40
with the input data or the enable signals ready before the
leading clock edge, we convert those latches to flip-flops. For 20

the remaining latches, we instantiate them directly to latch 0


primitives. FE EXE MEU OOO

Interestingly, the Nehalem core also has some latch struc-


tures that do not have an equivalent FPGA macro or clock-
and-enable structure. For example, a LATCH P macro is Figure 5: RAM Replacements for Nehalem Clusters.
an inverted latch. By DeMorgan’s theorem, the equivalent
circuit is:
and multiplexed structures. However, we want to take
LATCH == CLK & ENABLE the effort to emulate more complex RAMs whenever
LATCH_P == ~LATCH possible, with the goal of better FPGA resource uti-
LATCH_P == ~(CLK & ENABLE) lization.
LATCH_P == ~CLK || ~ENABLE
As an example of RAMs that exhibits the behaviors de-
The latch can essentially be open by the inverted clock scribed above, the out-of-order cluster holds many complex
or the inverted enable signal. The latch can no longer be RAM structures. This cluster has many highly ported mem-
consistently opened by the clock since the inverted enable ories structures to implement the reorder buffer that vary in
signal may change in either phase of the clock. In order size from single to hundreds of bits per entry. The reorder
to faithfully comply to the latching behavior of the original buffer allows instructions to complete in any order but only
RTL, our solution is to use a clock2x180 to produce a affect machine state in program order. In an out-of-order
positive edged 2x clock that can capture data on each phase machine, many instructions are in flight speculatively, are
of the clock. This solution is feasible for most latches in the waiting for loads or stores to complete, or being used in var-
system, but leads to tight timing constraints, therefore it is ious ALU operations. Instructions completing within the re-
used sparingly. order buffer have to update other reorder buffers as quickly
as possible to keep instructions retiring quickly. The RAM
3.3 RAM Replacements structures to hold the reorder buffer, therefore consist of
Latch and flip-flop RAM structures are used heavily within highly ported memories. These RAMs also have flash reset
the Nehalem RTL. The latch RAMs are extremely power effi- behaviors, since a reorder buffer entry can be invalidated on
cient, are tolerant to clock skew, allow time borrowing, and a mispredicted branch.
are amenable to clock-gated circuits. The flip-flop RAMs We are able to emulate the complex read, write, and in-
can be mapped to optimized cells in the ASIC backend flow, validate behaviors using the following techniques:
whereas the behavioral model is written in standard edge
• Entry Invalidations We can keep a separate valid bit
triggered SystemVerilog code. From looking at the RAM
per RAM entry held in a flip-flop array. We can set this
instantiations, it is clear that the memories that we end up
bit on a write and reset this bit on an invalidate. On
replacing range in size and complexity across a range of pa-
a read operation, we can check this bit to determine
rameters.
whether to output the RAM contents or a reset value.
• Size. Memory structures within the Nehalem core • Multiple Write Ports We can use 2 write ports per
range from several kilobytes down to bits. Small mem- BRAM structure. For higher write ported memories,
ories map better to distributed memory structures (gran- like found in the reorder buffer, we keep multiple copies
ularity of 1bx16 or 1bx64 on Xilinx-V4 and Xilinx- of the RAM structures, each having 2 logical write
V5 FPGAs per reconfigurable logic block), while larger ports attached to the physical write ports. The most
memories map best to Xilinx block RAMs (granulari- uptodate write location is kept in another flipflop array
ties of 18Kb per RAM). that can be used to multiplex a read operation from
the logical RAM structures.
• Read and write ports. RAM structures found within
Nehalem range from simple 1-read and 1-write port FI- • Multiple Read Ports We can also have 2 read ports
FOs, to highly complex banked register files. FPGA per BRAM structure. We emulate higher read ports
distributed memories natively can handle 1 shared read by again duplicating the BRAM structures by having
and write port (distributed memory) and up to 2 in- each physical write port map to multiple write ports,
dependent shared read and write ports (block RAMs). where each logical BRAM structure is mapped to a
multiplicity of read ports.
• Reset and set behavior. The RAM structures in
Nehalem have various flash reset, flash copy, and multiple- FPGA architectures are not suited to map these latch
entry read behavior. Xilinx FPGAs have the connec- structures with complex behaviors. However, with our clock-
tivity available to connect these heavily interconnected gating methodology, we are able to separate out the read

7
and write clocks from the enable signals and convert these phase, and check that produced outputs match those
RAMs all to flip-flop implementations. Once we are able logged to the trace. Comment out all other RTL (e.g.
to confirm all RAMs can be converted to a flip-flop imple- other clusters) to speed compilation and simulation
mentation, we can translate the largest ones to either dis- time
tributed or Block RAM implementations. Whenever these
memory replacements are done using explicit Xilinx mem- 4. Simulate the reduced Nehalem design to test the cor-
ory instantiations, these new instantiations are black boxed rectness of the FPGA-synthesizable RTL changes
throughout the DC-FPGA synthesis step. Then later in the
Additionally, we track every latch replacement’s input and
FPGA design flow, the generated memory netlists targeting
output signals. Within a simulation, a latch macro will re-
FPGA architectures can be dropped in for the final FPGA
port if its input data is not being latched correctly by our
bitstream assembly.
ported code. This is easy enough to do, by having the orig-
We observed 300 instances of these latch RAMs within
inal code in place and comparing outputs. By having this
the Nehalem code base and were able to convert them all
fine grained verification in place, we can quickly see a bug
to flip-flop RAMs, distributed memory RAMs, or BRAMs.
and replace that latch macro with a Xilinx native latch. It is
Figure 5 shows the breakdown of how these RAMs were con-
bad for timing to use too many latches, but the Xilinx tools
verted for each cluster. This number of RAMs is 8x over the
can handle a few of them. Also we are running the FPGA at
number of RAMs seen in the Atom codebase. The synthe-
a relatively low clock rate and the tools can handle placing
sizable Atom core also had low count read and write ported
and timing some latches.
RAMs structures, where in Nehalem, extremely high count
We have made extensive use of this strategy in this project.
write/read RAMs were observed in several instances. Again
Doing so significantly reduces the time to verify a particular
within Nehalem, the out of order cluster proves to hold a
RTL change (e.g. one minute to recompile the EXE clus-
high count of RAM instantiations. The frontend cluster
ter compared to 10 minutes for the full model and three
holds complex branch prediction RAMs, is multithreaded,
minutes to simulate a simple test on EXE compared to one
and can decode multiple instructions in parallel. For this
hour for the full-chip) but also gives a more rigorous val-
reason, FE holds a high count of smaller RAMs, with com-
idation as any deviation from the baseline behavior, even
plex behavior.
changes which might not cause a particular test to fail, will
3.4 Verification Methodology be detected. We have written scripts to automatically in-
sert the necessary trace generation and trace consumption
With all these changes to the Nehalem codebase, special
code (steps 1 and 3 above), and no manual RTL changes
care has to be taken that our code does not break the func-
are necessary to employ this methodology. This methodol-
tionality of the various clusters. The Nehalem model comes ogy was not used for the FPGA synthesizable Atom core.
with a rich regression test environment with a large number With a small codebase, the Atom core can run full simula-
of both full-chip and unit-level regression tests. These tests tions within minutes compared to Nehalem taking one hour
check not only for successful completion of the test case, but
for short system level tests. Therefore, this strategy is ex-
additionally instrument the code within the RTL, monitor
tremely beneficial for large circuits and scales extremely well
produced output files, and poke and peek signals throughout
as the design grows.
the design. Additionally, this methodology can also be applied on
Unexpectedly, due to the nature of the RTL changes nec- FPGA to synthesized code, in order to validate that the
essary to for FPGA synthesis, such as converting the RAMs synthesis flow and tools have produced the correct output.
and converting 1-bit clock signals to clock structures, these
Inputs to the targeted module can be driven either by spe-
regression tests frequently fail to execute unmodified due
cialized software control or by an embedded ROM. We can
to non-existent or renamed signals and clocks that are no typically synthesize a bitfile for testing an individual Ne-
longer accessible in bit operations. Full-chip regressions are halem RTL file in approximately 15 minutes, significantly
less invasive and more likely to continue working with min- faster than the time necessary to synthesize a full cluster or
imal modifications, but take a significant amount of time the full design.
to execute (on average 6 hours). Further, the full-chip re-
Individual modules can be synthesized and tested on-FPGA
gressions also interact with the Uncore code, which shares
with a similar methodology in order to validate that the syn-
some macros and modules with our converted clusters, lead- thesis flow and tools have produced the correct output. The
ing to naming mismatches. Given that most FPGA-related MCEMU hardware and software platform provide a power-
RTL changes are highly localized and only require changes to ful logic analyzer and injector capability which allows signals
individual files, we used the following methodology for vali-
on individual FPGAs to be read or written under software
dating such changes, which yields both a rapid turnaround
control. Each clock phase, the inputs to the synthesized
time on simulation execution but can be employed without
logic block are read from the associated trace file and pro-
requiring any changes to existing regression tests. vided to the corresponding logic block via this signal injec-
tion mechanism, and the outputs which had been generated
1. Modify the original Nehalem RTL to log all input and
on the prior clock phase, are read out are checked against
output values on every phase for a given target of in-
the signal trace to identify any deviation from the expected
terest (a full cluster or smaller RTL component)
behavior.
2. Execute an existing full-chip regression test to generate We can typically synthesize a bitfile for testing an indi-
the signal trace of interest vidual Nehalem RTL file in approximately 15 minutes, sig-
nificantly faster than the time necessary to synthesize a full
3. Modify the FPGA-synthesizable RTL to feed the logged cluster or the full design. Additionally, by ensuring each
inputs into the target of interest on each simulated module was tested using this methodology in additional to

8
FPGA Utilization Relative to Virtex-5 LX330
160% I-CACHE

140% NHM
MLC UNCORE
120% CORE

APIC LLC $
D-CACHE
100% L2 $
% LUTs
80%
% Registers
60%
% BRAM
(a)
40%
20%
0% I-CACHE
MEMORY
FE EXE MEU OOO NHM TRANSLATOR
2 GB DDR

CORE

D-CACHE
MLC UNCORE

LLC $
Figure 6: FPGA Resource Assessment of the Ne- APIC
L2 $

halem Clusters.
(b)

simulation testing, a handful of bugs due to tool miscon-


figuration and incorrect synthesis output were caught at an Figure 7: Nehalem Core to Memory Interface. (a)
early stage, allowing fixes and workarounds to be quickly Original Design (b) Memory Translator.
applied.
emulated codebase, the Nehalem core is forced to allow
4. NEHALEM EMULATION IN MCEMU only one outstanding memory request and translated
Once our code changes have been verified through sim- to communicate with a standard DDR controller. Ad-
ulation and have a body of RTL ready for MCEMU, the ditionally, the bridge must respond correctly to data
netlists are taken and partitioned across the five FPGAs on accesses that correspond to the cache coherency pro-
one MCEMU board. tocol. There are multiple memory requests types due
to locks, read-for-ownership, and self-snoops that are
4.1 Initial Sizing of the Nehalem Clusters handled as well.
With changes made to the clocking structure, RAMs con- • CRAB (Control Register Access Bus) Read/Writes.
verted, and the code verified, the various code bodies are The CRAB bus is a distributed register file that allows
combined. Preliminary synthesis results for the ported Ne- a control registers to be read/written to by communi-
halem code are gathered before partitioning the RTL across cating over a ring. This ring has stops within the MLC
the five FPGAs. As shown in Figure 6, the clusters’ FPGA and Uncore that are emulated.
utilization is presented. The synthesis tool targets all the
clusters for the Virtex-5 FPGAs on the MCEMU platform, • APIC (Advanced Processor Interrupt Controller) Ac-
but of course some of the RTL will have to be targeted to cess. The timer interrupt is the only functionality that
Virtex-4 architectures. As can be seen, the Out-of-Order must be emulated correctly. This timer is used for
cluster clearly cannot fit on a single FPGA. Due to the high operating system functionality and must be emulated
connectivity of the reorder buffers and the number of buffers correctly. As the Uncore can run at a different clock-
themselves, the out of order cluster requires further parti- speed than the cores ondie, the Uncore clock synchro-
tioning. nization signals are emulated as well. The functional-
ity of the APIC timer can be verified in short tests.
4.2 Memory Interface
The standard Nehalem core connects to the Uncore through
the MLC (midlevel cache). The Uncore at the interface to
4.3 Multi-FPGA Partitioning on MCEMU
the MLC shown in Figure 7. This cut is necessary, as the Once the code changes are verified through simulation and
FPGAs cannot fit the 256KB midlevel cache on the emula- a body of RTL is ready for MCEMU, the full design is par-
tion platform or its the associated logic. Once that logic is titioned into modules to target individual FPGAs on the
cut out however, the necessary interfaces are created to our MCEMU.
custom memory controller, which can communicate to the Before describing how Nehalem is mapped to the MCEMU
onboard DRAM. Any other signals that are driven by the emulator, its important to understand a fundamental differ-
Uncore must be emulated correctly such as clock synchro- ence between single- and multi-FPGA emulation. Although
nization and reset signals. This cut is similar to the synthe- in single FPGA emulator it is the resulting critical path
sizable Atom core, where there the cut occurred at the L2 timing from the place and route tools that sets emulation
cache. The cut is chosen to maximize original functionality, speed, in a partitioned multi-FPGA design it is the critical
but cut out the larger lower level caches and complex Un- path of a logical signals between FPGAs that sets the em-
core interfaces that are not inherently part of the processing ulation speed. For a logical signal between 2 FPGAs, U1
pipeline. and U3, this critical path depends on the degree of TDM
This translation step is not trivial and the emulated in- sharing of the physical wires. If the TDM ratio is 10 logical
terfaces are briefly described below. signals per physical wire, the maximal emulation frequency
is 1/10th the speed of the physical interconnect. In addi-
• Memory Load / Stores. For memory operations in our tion, if a logical signal must hop from U1 to U2 and then to

9
to using ACE to spit out netlists into multiple partitions, it
Table 2: Logical Connectivity and TDM Ratio be- can be used to route some internal signals to specific FPGAs.
tween FPGAs. In particular, the memory controller signals are routed from
U1 U2 U3 U4 U5 MEU (U3) to the DRAM interface on U1, and internal ar-
U1 - 18 18 - - chitectural state (instruction pointer and architectural reg-
U2 24 - 18 21 - isters) and memory access signals is routed to U5 where aux-
U3 18 18 - 18 - iliary emulation logic records cycle-by-cycle traces of these
U4 - 21 21 - 24 signals into the other onboard DRAM module. After par-
U5 - - - 18 - titioning, the MCEMU interconnect configuration tool (see
Section 2.3) runs to multiplex the interconnecting logical
signals over the available physical wires. The end result of
the partitioning and interconnect generation is an synthe-
U3 in one emulation cycle, assuming both hops have TDM
sizable fabric with the connectivity matrix and TDM ratio
ratio of 10, the resulting frequency is cut by another fac-
shown in Table 2. Empty entries show where no physical
tor of 2 because the logical signal must traverse both hops
direct connection exists, though logical connection may still
within one emulated cycle. Clearly, the frequency limit due
occur by using 2 or more adjoining physical direct connec-
to logical signal transmission quickly dominates the PAR
tions. As shown in Table 2, the generated interconnect has
frequency limit of the FPGA. As such, attaining a high em-
a TDM critical path of 24 in the path from U2→U1 and
ulation frequency becomes an exercise of mapping the logic
the path U4→U5. These large TDM ratios are a direct re-
across FPGAs in a way that minimizes the number of logi-
sult of the high number of logical signals passing between
cal signals between FPGAs, minimizes the number of hops
those FPGAs. The U4→U5 connection is the signals from
between the source and destination of logical signals and
the OOO cluster to the EXE unit, which as described above,
distributes logical signals to best balance TDM ratios. In
are very tightly coupled clusters. Interestingly, the connec-
other words a good partitioning of emulated logic should be
tions between U2→U1 is actually not dominated by the FE
found so that the partitioned topology most closely matches
unit talking to the DRAM interface or to the OOO cluster,
the emulator topology.
but is instead heavily utilized by signals passing through U2
As shown in Figure 6, the cluster utilizations suggest that
from the OOO sub-clusters.
neglecting OOO’s high LUT utilization, a cluster level parti-
From this TDM data, potential emulation frequency can
tioning would map very naturally to a single MCEMU board
be calculated. The physical interconnect is able to run at
and minimize the number of logical signals traversing the
10ns period. It therefore requires 240ns (24 TDM cycles)
on-board interconnect (i.e. logic within a cluster is more
for all logical signals to complete 1 hop, and the emulation
tightly connected than logic in two different clusters). Be-
period could be 240ns. Because there are paths that need
cause FE and OOO are the largest clusters its clear to map
to cross 2 FPGA hops with in a single emulation cycle, we
them to our larger Virtex-5 FPGAs and map EXE and MEU
need to exercise these paths 24 TDM cycles at least twice
to the smaller Virtex-4 FPGAs. As mentioned in Section 2.3,
within every emulated cycle. This sets the maximum emu-
the number of physical wires between pairs of FPGA is not
lation period then to 480ns. If we could guarantee that all
uniform. The pair U2,U4 and the pair U3,U5 have many
signals crossing the interconnect fabric could only change
more connecting wires than the other pairs of FPGAs. To
at the emulation clock edge then 480ns would be the final
best balance TDM ratios, more highly connected clusters
emulation period. This however is not the case, due to the
are placed in the highly connected FPGA pairs. Analysis
design having registers clocked on clk2x and various level-
of the cluster level connections shows highest coupling be-
sensitive latches. With this added clock and existing latches,
tween the pair EXE, OOO and the pair MEU, FE. This gives
there is a possibility that logical signals crossing the inter-
us the potential initial mappings of (FE→U2, MEU→U3,
connect need to change before any edge of clk2x. To allow
OOO→U4, EXE→U5) or (OOO→U2, EXE→U3, FE→U4,
for this possibility (and maintain logical equivalence to the
MEU→U5). In the end a high Block RAM utilization by
unpartitioned design) we need to allow for all logical signals
auxiliary emulation logic on U5 (U5 is the central com-
to complete the 2 hops within each phase of clk2x. This
munication node that dispatches control/debug messages
means that the actual emulation period (clk1x) needs to be
to other FPGAs) restricts us to mapping the lower Block
4 x 480ns. This results in an emulation clock frequency of
RAM utilizing EXE to U5 and selects the former mapping
520 KHz.
above (FE→U2, MEU→U3, OOO→U4, EXE→U5). As
Interestingly, this partitioning step can be a one-time cost,
mentioned above the OOO cluster is still too large for U4
barring any single FPGA running out of logic resources.
(V5330). Here the resulting split occurs within the OOO
Once the partition step is done, blackboxed Nehalem clus-
at its sub-partition hierarchy. The OOO subclusters on U4
ters can be synthesized to EDIF files and quickly linked into
consist of the Reservation Station (RS), Register Alias Ta-
the bitstream generation step to create a new revision of
ble (RAT), and Allocation (ALLOC), while the OOO’s other
the FPGA synthesizable design. This ability to drop in new
cluster ReOrder Buffer (ROB) resides on U1 [3].
synthesized clusters allows us to turn around a new design
The Auspy ACE partitioning software is used to restruc-
within the time it takes to synthesize and place and route
ture the top level netlist using the given Nehalem netlist par-
a single FPGA (i.e. as opposed to synthesizing all the clus-
titions. Because this methodology keeps the natural cluster-
ters, running a partition step across the entire circuit, and
level partitioning, ACE’s ability to automatically find good
place and routing the individual FPGAs).
partitions is not used. The tool is still critically important
We take the five resulting netlists (each netlist includes
though, as it allows us to pull lower hierarchy structures (e.g.
the emulated cluster wrapped with the generated auxiliary
ROB) up to top level entities. Without such a tool, this re-
interconnect logic) and push them through the typical Xil-
structuring would be error-prone and tedious. In addition

10
Table 3: FPGA Utlization after Xilinx Map.
LUTs (%) BLOCK RAM (%)
U1-ROB 81 62
U2-FE 87 83
U3-MEU 75 75
U4-RAT/Alloc/RS 89 0
U5-EXE 89 55

inx backend flow of ngdbuild, map, and par. Table 3 shows


the resulting post-map resource utilization. All FPGAs are
heavily loaded, but still meet timing because only the high-
speed interconnect wrapper must run at 100MHz, and the
emulated Nehalem clusters are relaxed to meet only 520KHz.
With the set of bitfiles complete, we use the MCEMU con-
trol/debug application (Section 2.3) to load the bitfiles and Figure 8: Screenshot of Synthesizable Nehalem Core
write memory images into the DRAM DIMM accessed by Booting up Linux.
U1. Similarly, we use the MCEMU control/debug applica-
tions to pull out trace data from the DRAM DIMM accessed
by U5. EIP (instruction pointer) and register value tracing mech-
anism was also added. The original Nehalem RTL already
4.4 Nehalem Verification on MCEMU has instrumentation code which is capable of generating a
As described in Section 3.4, individual modules and even full log of retired instruction addresses and architected reg-
entire clusters can be separately synthesized and tested on ister values, which, used in conjunction with a separate x86
FPGA by feeding a trace of inputs to a synthesized chunk of functional model, validates the correctness of the RTL in
logic. With slight modifications, this technique can be ex- simulation. This capability can be added to the synthesized
tended to similarly verify that the fabric which multiplexes core by simply removing the ifdef which guard this code,
the signals between the different clusters operates correctly. and routing these signals to the dedicated MCEMU trace
An initial partition of the Nehalem core is produced, such collection hardware described in Section 2.3. This tracing
that each FPGA holds a portion of the core logic, but with- mechanism proved invaluable in fixing those final bugs in the
out any connectivity between FPGAs. That is, each ”island” memory interface which were not identified through simula-
of logic is fully standalone, with inputs to each FPGA con- tion.
trolled entirely through software which reads inputs and val-
idates outputs from a trace. This verifies that, in isolation,
the complete design produces behavior that matches 100% 5. RESULTS
with simulation. Our FPGA-synthesizable version of the Intel Nehalem core
Following this, in a piecewise fashion, connectivity be- correctly preserves all instruction set and microarchitectural
tween the different clusters is progressively established, al- features implemented in the original Nehalem core. These
lowing signals to flow between the different FPGAs rather include the complete microcode ROM, full capacity L1 in-
than being read from a trace file. Ultimately only those in- struction and data caches, SSE4, Intel64, Intel Virtualiza-
puts to the Nehalem core top level (e.g. clock, reset) are tion Technology, and advanced power modes such as C6.
read from the trace. Note that even in this configuration As expected, the synthesizable Nehalem core is capable of
the outputs produced at each FPGA can still be read out executing the rich variety of legacy x86 software.
and compared against the expected output recorded in the Figure 8 shows a screenshot of Nehalem core booting a
trace. version of Linux on the MCEMU. A simple program is shown
In fact, this technique can be (and was) applied even be- to execute the CPUID instruction and display the result,
fore the complete Nehalem core is FPGA-synthesizable, al- revealing that the emulated CPU is indeed a Nehalem core.
lowing the remaining portions of the design to be tested. As an example to illustrate microarchitectural difference
Logic which cannot be currently synthesized to an FPGA between two families of Intel Architecture designs, Figure 9
due to, for example, capacity restrictions, can be freely re- shows a performance comparison the out-of-order Nehalem
moved from the design. Its functionality is provided, in- core and the in-order Atom core, both synthesized to the
stead, by the MCEMU software, which injects the outputs same MCEMU platform and to the same frequency. The five
of that missing logic by reading from a separately captured benchmarks represent, respectively from left to right, differ-
trace. This technique was used both in the early stages of ent optimizations of a handle-optimized compute-intensive
this work, when accommodating the extremely large OOO Mandelbrot fractal computation using (1) x87 single-precision
cluster (which exceeds the capacity of a single Virtex-5 FPGA (2) SSE3 single precision (3) x87 double precision (4) SSE3
as shown in Figure 4), as well as in the later stages when the double precision, and (5) an integer workload designed to
memory interface described in Section 4.2 was being written. stress a processor’s out-of-order capabilities. In all cases
The correctness of the Nehalem core can be verified through these results show significant performance advantages for
these trace-based methods for core reset and simple work- the Nehalem core, with speedups ranging from 1.8x to 3.9x
loads such as computing a Fibonacci number, but this mech- over the Atom processor.
anism is too slow for more involved workloads. Therefore, an

11
[3] B. Bentley. Simulation-driven Verification. Design
4.5
Automation Summer School, 2009.
4 [4] J. Casazza. First the Tick, Now the Tock: Intel
3.5 Microarchitecture (Nehalem). Intel Corporation, 2009.
[5] W.-J. Fang and A. C.-H. Wu. Multiway FPGA
3
Partitioning by Fully Exploiting Design Hierarchy.
Speedup

2.5 ACM Trans. Des. Autom. Electron. Syst., 5(1):34–50,


2
2000.
[6] J. Gaisler. A Portable and Fault-Tolerant
1.5
Microprocessor Based on the SPARC V8 Architecture.
1 In Proceedings of the International Conference on
Dependable Systems and Networks, 2002.
0.5
[7] M. Gschwind, V. Salapura, and D. Maurer. FPGA
0 Prototyping of a RISC Processor Core for Embedded
x87 Single SSE3 Single x87 Double SSE3 Double OoO Intensive
Applications. IEEE Transactions on VLSI Systems,
9(2), April 2001.
[8] Intel Core i7-800 and i5-700 Desktop Processor Series.
Figure 9: Nehalem vs. Atom: Performance Com- download.intel.com/design/processor/datashts/322164.pdf,
parison of Microbenchmarks. 2009.
[9] D. Jagger. ARM Architecture and Systems. IEEE
Micro, 17, July/August 1997.
6. CONCLUSIONS [10] H. Krupnova. Mapping Multi-Million Gate SoCs on
In this paper we have presented our experience in mak- FPGAs: Industrial Methodology and Experience. In
ing the Intel Nehalem processor core FPGA synthesizable Design, Automation and Test in Europe Conference
and partitioning the design on a multi-FPGA emulator plat- and Exhibition, 2004. Proceedings, volume 2, pages
form. We also present the methodology for taking various 1236–1241 Vol.2, Feb. 2004.
complex constructs, mapping them efficiently to FPGA re- [11] S. L. Lu, P. Yiannacouras, R. Kassa, M. Konow, and
sources. The debugging methodology seamlessly employed T. Suh. An FPGA-based Pentium in A Complete
from RTL simulation through bitfile emulation proves to be Desktop System. In International Symposium on Field
vital in ensuring high productivity. To our knowledge, this Programmable Gate Arrays, 2007.
is the first time that a full-featured state-of-the-art out-of- [12] W.-K. Mak and D. F. Wong. On Optimal Board-Level
order x86 processor design has been successfully made em- Routing for FPGA-based Logic Emulation. In DAC
ulation ready on the commodity FPGAs using the existing ’95: Proceedings of the 32nd annual ACM/IEEE
EDA tools. With our previous work to make the Intel Atom Design Automation Conference, pages 552–556, New
core FPGA synthesizable, this latest milestone with FPGA- York, NY, USA, 1995. ACM.
synthesizable Nehalem provides yet another cost-effective
[13] T. Mattner and F. Olbrich. FPGA Based Tera-Scale
approach to improve the efficiency and the productivity in
IA Prototyping System. In The 3rd Workshop on
design exploration and validation for future x86 architec-
Architectural Research Prototyping, 2008.
tural extensions and microarchitectural optimizations.
[14] Intel Microarchitecture, Codenamed Nehalem.
www.intel.com/technology/architecture-silicon/next-gen/,
7. ACKNOWLEDGMENTS 2009.
We would like to thank Joe Schutz, Steve Pawlowski, [15] Synopsys. DC-FPGA.
Justin Rattner, Glenn Hinton, Rani Borkar, Shekhar Borkar, www.synopsys.com/products/dcFPGA.
Jim Held, Jag Keshava, Belliappa Kuttanna, Chris Weaver, [16] Synopsys FPGA Synthesis Reference Manual.
Elinora Yoeli, Pat Stolt and Ketan Paranjape for the pro- Synopsys, December 2005.
ductive collaboration, guidance and support throughout the [17] P. H. Wang, J. D. Collins, C. T. Weaver, B. Kuttanna,
project. S. Salamian, G. N. Chinya, E. Schuchman,
In addition, we thank the anonymous reviewers whose O. Schilling, T. Doil, S. Steibl, and H. Wang. Intel
valuable feedback has helped the authors greatly improve Atom Processor Core Made FPGA-Synthesizable. In
the quality of this paper. FPGA ’09: Proceeding of the ACM/SIGDA
international symposium on Field programmable gate
8. REFERENCES arrays, pages 209–218, New York, NY, USA, 2009.
[1] Auspy. ACE Compiler. http://www.auspy.com/. ACM.
[2] J. Babb, R. Tessier, M. Dahl, S. Hanono, D. Hoki, and [18] Virtex-4 User Guide, v2.3. Xilinx, August 2007.
A. Agarwal. Logic Emulation with Virtual Wires. [19] Virtex-5 FPGA User Guide, v3.3. Xilinx, February
IEEE Transactions on Computer Aided Design, 2008.
16:609–626, 1997.

12

You might also like