Intel Nehalem Processor Core Made FPGA Synthesizable
Intel Nehalem Processor Core Made FPGA Synthesizable
R
Nehalem Processor Core Made FPGA Synthesizable
Graham Schelle1 , Jamison Collins1 , Ethan Schuchman1 , Perry Wang1 , Xiang Zou1
Gautham Chinya1 , Ralf Plate2 , Thorsten Mattner2 , Franz Olbrich2 , Per Hammarlund3
Ronak Singhal3 , Jim Brayton4 , Sebastian Steibl2 , Hong Wang1
Microarchitecture Research Lab, Intel Labs, Intel Corporation1
Intel Germany Research Center, Intel Labs, Intel Corporation2
Central Architecture and Planning, Intel Architecture Group, Intel Corporation3
Microprocessor and Graphics Development, Intel Architecture Group, Intel Corporation4
Contact: [email protected]
ABSTRACT 1. INTRODUCTION
We present a FPGA-synthesizable version of the Intel Ne- Intel Nehalem [4, 8, 14] is the latest microarchitecture de-
halem processor core, synthesized, partitioned and mapped sign and the foundation of the Intel CoreTM i7 and CoreTM
to a multi-FPGA emulation system consisting of Xilinx Virtex- i5 processor series. Like its predecessor (IntelR
CoreTM mi-
4 and Virtex-5 FPGAs. To our knowledge, this is the first croarchitecture), Intel Nehalem microarchitecture continues
time a modern state-of-the-art x86 design with the out-of- to focus on improvements in how the processor uses avail-
order micro-architecture is made FPGA synthesizable and able clock cycles and power, rather than just pushing up
capable of high-speed cycle-accurate emulation. Unlike the ever higher clock speeds and energy needs. Its goal is to
Intel Atom core which was made FPGA synthesizable on a do more in the same power envelope or even reduced en-
single Xilinx Virtex-5 in a previous endeavor, the Nehalem velopes. In turn, Intel Nehalem microarchitecture includes
core is a more complex design with aggressive clock-gating, the ability to process up to four instructions per clock cy-
double phase latch RAMs, and RTL constructs that have cle on a sustained basis compared to just three instructions
no true equivalent in FPGA architectures. Despite these per clock cycle or less processed by other processors. In
challenges, we are successful in making the RTL synthesiz- addition, Intel Nehalem incorporates a few essential perfor-
able with only 5% RTL code modifications, partitioning the mance and power management innovations geared towards
design across five FPGAs, and emulating the core at 520 optimizations of the individual cores and the overall multi-
KHz. The synthesizable Nehalem core is able to boot Linux core microarchitecture to increase single-thread and multi-
and execute standard x86 workloads with all architectural thread performance.
features enabled. In addition to backward compatibility to the rich Intel
Architecture legacy, the Intel Nehalem sports several salient
new features: (1) Intel Turbo Boost Technology which en-
Categories and Subject Descriptors ables judicious dynamical management cores, threads, cache,
C.1.0 [Processor Architectures]: General interfaces and power, (2) Intel Hyper-Threading Technol-
ogy which in combination with Intel Turbo Boost Technol-
ogy can deliver better performance by dynamically adapting
General Terms to the workloads which can automatically take advantage
Design, Measurement, Performance of available headroom to increase processor frequency and
maximize clock cycles on active cores and (3) Intel SSE4
instruction set extensions that center on enhancing XML,
Keywords string and text processing performance.
Intel Nehalem, FPGA, emulator, synthesizable core In this paper, we share our experience and present the
methodology to make the Intel Nehalem processor core FPGA
synthesizable. The emulated Nehalem processor core is par-
titioned across multiple FPGAs and can boot the standard
off-the-shelf x86 OSes including Linux and run x86 work-
loads at 520Khz. Compared to the Intel Atom core that we
previously made FPGA synthesizable, the Nehalem core is
much more challenging due to the microarchitectural com-
Permission to make digital or hard copies of all or part of this work for plexity and sheer size of the design. The key contributions
personal or classroom use is granted without fee provided that copies are of this paper are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to • We present our methodology to synthesize and judi-
republish, to post on servers or to redistribute to lists, requires prior specific ciously partition the fully featured Nehalem RTL de-
permission and/or a fee.
FPGA’10, February 21–23, 2010, Monterey, California, USA. sign to an emulator with multiple Virtex-4 [18] and
Copyright 2010 ACM 978-1-60558-911-4/10/02 ...$10.00. Virtex-5 [19] FPGAs.
3
• We demonstrate a systematic and scalable cycle-by-
cycle verification methodology to ensure the functional
and timing correctness of the synthesizable design.
4
U3 U3
U2 U4 U5 U2 U4 U5
U1 U1
Figure 3: One MCEMU System with Nine Boards.
cess to all board level features. For example, each board con-
tains 1GB DDR DIMM accessible by FPGA U1 and a 1GB
Figure 2: One MCEMU Board with Five FPGAs.
DDR DIMM accessible by U5. Access to these DIMMs by
logic within another FPGA would need to be routed through
Table 1: MCEMU board FPGAs. U1 or U5 to reach the appropriate DIMM. Similarly, only
Name FPGA LUTs BRAMs U1 contains the RocketIO transceivers that interface over
U1 Virtex-4 FX140 126,336 552 the cabling. Therefore, signals destined for another board
U2 Virtex-5 LX330 207,360 576 must pass through the U1 FPGAs on both the sending and
receiving board.
U3 Virtex-4 LX200 178,176 336
In addition, the number of physical pins interconnecting
U4 Virtex-5 LX330 207,360 576
pairs of FPGAs is neither uniform nor symmetric. The
U5 Virtex-4 LX200 178,176 336
MCEMU synthesis flow includes a sophisticated intercon-
nect generation tool that when given a set of interconnected
netlist modules generates and configures the TDM multi-
the four core clusters in the Nehalem. The Uncore cluster is plex and demultiplex logic to properly connect the mod-
outside the scope of this paper. ules over the appropriate physical interconnects (intraboard
traces and interboard cables). In the MCEMU flow, parti-
2.3 The Many-Core Emulation System tioning a large netlist into multiple modules (each suitable
The Many-Core Emulation System (MCEMU), is the em- size for 1 FPGA) can be done either manually or with vary-
ulation platform we targeted for this work. MCEMU is ing level of automation through partitioning tools.
an FPGA emulation platform developed at Intel [13]. An Like most FPGA synthesizable designs, the choice of the
MCEMU system consists of a series of identical rackable emulator platform can affect the particular strategy to par-
custom boards, each holding five FPGAs. Table 1 lists the tition the design and interface the design to the memory
name, type, and key resources for each of the five FPGAs, system. In the Atom synthesis project, the platform was a
while Figures 2 and 3 show a single board and a full rack- single FPGA emulator that fits in a Pentium CPU socket. It
able system respectively. To expand capacity beyond five was necessary to build a bridge between the Atom processor
FPGAs, multiple boards are interfaced together using the core and the Pentium front-side bus so as to allow the emu-
Xilinx RocketIO high-speed serial transceivers connected by lated Atom core to communicate with the memory and I/O
external cabling. resources on the motherboard. Similarly, with the MCEMU
Within a single MCEMU board, board traces wire input platform, which has on-board DDR memory, we also need
pins on one FPGA to output pins of another, leading to a to build a bridge between the Nehalem core and a DDR con-
fixed number of physical wires between each FPGA pair. troller so that the emulated CPU core can boot from the OS
While the number of physical wires connecting two FPGAs image and execute code, all resident in the DDR memory.
is fixed and small, any arbitrarily large number of logical The original OS image and workload can be updated by the
signals can be sent across the physical wires by time division host CPU board on the MCEMU.
multiplexing (TDM) using muxes at the sending FPGA and When a design is ready to be run on the MCEMU it is
demuxes at the receiving FPGA. A greater ratio of logical loaded on the FPGA boards by a control/debug host board
signals to physical signals requires more time steps for TDM, that sits along side the FPGA boards. The host board is
and thus lowers emulated frequency. a full x86 computer with hard-disk and network interface
Because of varying resources among the FPGAs and fixed and runs Linux. A Linux application running on the host
physical traces on the boards not all FPGAs have direct ac- board can program and reset the MCEMU FPGAs, write
5
to control registers inside the FPGA, read and write the Synthesizable
MCEMU DRAM DIMMs, and control the emulation clock Nehalem
all over the shared cPCI bus. As we show in Section 4.4, codebase
this built-in hardware/software interface can be a powerful SystemVerilog
tool for system bring-up and verification.
6
some instances, the data or the enable signal arrive late.
Since the latches in Nehalem are instantiated in a macro
NHM RAM
Nehalem RAM Replacements per Cluster
Replacements per Cluster
form, we can detect this race condition while running the 140
simulation and determine if data OR the enable is chang- 120
ing with the clock edge. If this behavior is seen, an edge 100
Total RAMs
triggered flip-flop conversion is incorrect and data will not 80
propagated correctly through the latch. Therefore, we adopt RAM
60 Flip-Flop
a combination of both approaches. That is, for those latches
40
with the input data or the enable signals ready before the
leading clock edge, we convert those latches to flip-flops. For 20
7
and write clocks from the enable signals and convert these phase, and check that produced outputs match those
RAMs all to flip-flop implementations. Once we are able logged to the trace. Comment out all other RTL (e.g.
to confirm all RAMs can be converted to a flip-flop imple- other clusters) to speed compilation and simulation
mentation, we can translate the largest ones to either dis- time
tributed or Block RAM implementations. Whenever these
memory replacements are done using explicit Xilinx mem- 4. Simulate the reduced Nehalem design to test the cor-
ory instantiations, these new instantiations are black boxed rectness of the FPGA-synthesizable RTL changes
throughout the DC-FPGA synthesis step. Then later in the
Additionally, we track every latch replacement’s input and
FPGA design flow, the generated memory netlists targeting
output signals. Within a simulation, a latch macro will re-
FPGA architectures can be dropped in for the final FPGA
port if its input data is not being latched correctly by our
bitstream assembly.
ported code. This is easy enough to do, by having the orig-
We observed 300 instances of these latch RAMs within
inal code in place and comparing outputs. By having this
the Nehalem code base and were able to convert them all
fine grained verification in place, we can quickly see a bug
to flip-flop RAMs, distributed memory RAMs, or BRAMs.
and replace that latch macro with a Xilinx native latch. It is
Figure 5 shows the breakdown of how these RAMs were con-
bad for timing to use too many latches, but the Xilinx tools
verted for each cluster. This number of RAMs is 8x over the
can handle a few of them. Also we are running the FPGA at
number of RAMs seen in the Atom codebase. The synthe-
a relatively low clock rate and the tools can handle placing
sizable Atom core also had low count read and write ported
and timing some latches.
RAMs structures, where in Nehalem, extremely high count
We have made extensive use of this strategy in this project.
write/read RAMs were observed in several instances. Again
Doing so significantly reduces the time to verify a particular
within Nehalem, the out of order cluster proves to hold a
RTL change (e.g. one minute to recompile the EXE clus-
high count of RAM instantiations. The frontend cluster
ter compared to 10 minutes for the full model and three
holds complex branch prediction RAMs, is multithreaded,
minutes to simulate a simple test on EXE compared to one
and can decode multiple instructions in parallel. For this
hour for the full-chip) but also gives a more rigorous val-
reason, FE holds a high count of smaller RAMs, with com-
idation as any deviation from the baseline behavior, even
plex behavior.
changes which might not cause a particular test to fail, will
3.4 Verification Methodology be detected. We have written scripts to automatically in-
sert the necessary trace generation and trace consumption
With all these changes to the Nehalem codebase, special
code (steps 1 and 3 above), and no manual RTL changes
care has to be taken that our code does not break the func-
are necessary to employ this methodology. This methodol-
tionality of the various clusters. The Nehalem model comes ogy was not used for the FPGA synthesizable Atom core.
with a rich regression test environment with a large number With a small codebase, the Atom core can run full simula-
of both full-chip and unit-level regression tests. These tests tions within minutes compared to Nehalem taking one hour
check not only for successful completion of the test case, but
for short system level tests. Therefore, this strategy is ex-
additionally instrument the code within the RTL, monitor
tremely beneficial for large circuits and scales extremely well
produced output files, and poke and peek signals throughout
as the design grows.
the design. Additionally, this methodology can also be applied on
Unexpectedly, due to the nature of the RTL changes nec- FPGA to synthesized code, in order to validate that the
essary to for FPGA synthesis, such as converting the RAMs synthesis flow and tools have produced the correct output.
and converting 1-bit clock signals to clock structures, these
Inputs to the targeted module can be driven either by spe-
regression tests frequently fail to execute unmodified due
cialized software control or by an embedded ROM. We can
to non-existent or renamed signals and clocks that are no typically synthesize a bitfile for testing an individual Ne-
longer accessible in bit operations. Full-chip regressions are halem RTL file in approximately 15 minutes, significantly
less invasive and more likely to continue working with min- faster than the time necessary to synthesize a full cluster or
imal modifications, but take a significant amount of time the full design.
to execute (on average 6 hours). Further, the full-chip re-
Individual modules can be synthesized and tested on-FPGA
gressions also interact with the Uncore code, which shares
with a similar methodology in order to validate that the syn-
some macros and modules with our converted clusters, lead- thesis flow and tools have produced the correct output. The
ing to naming mismatches. Given that most FPGA-related MCEMU hardware and software platform provide a power-
RTL changes are highly localized and only require changes to ful logic analyzer and injector capability which allows signals
individual files, we used the following methodology for vali-
on individual FPGAs to be read or written under software
dating such changes, which yields both a rapid turnaround
control. Each clock phase, the inputs to the synthesized
time on simulation execution but can be employed without
logic block are read from the associated trace file and pro-
requiring any changes to existing regression tests. vided to the corresponding logic block via this signal injec-
tion mechanism, and the outputs which had been generated
1. Modify the original Nehalem RTL to log all input and
on the prior clock phase, are read out are checked against
output values on every phase for a given target of in-
the signal trace to identify any deviation from the expected
terest (a full cluster or smaller RTL component)
behavior.
2. Execute an existing full-chip regression test to generate We can typically synthesize a bitfile for testing an indi-
the signal trace of interest vidual Nehalem RTL file in approximately 15 minutes, sig-
nificantly faster than the time necessary to synthesize a full
3. Modify the FPGA-synthesizable RTL to feed the logged cluster or the full design. Additionally, by ensuring each
inputs into the target of interest on each simulated module was tested using this methodology in additional to
8
FPGA Utilization Relative to Virtex-5 LX330
160% I-CACHE
140% NHM
MLC UNCORE
120% CORE
APIC LLC $
D-CACHE
100% L2 $
% LUTs
80%
% Registers
60%
% BRAM
(a)
40%
20%
0% I-CACHE
MEMORY
FE EXE MEU OOO NHM TRANSLATOR
2 GB DDR
CORE
D-CACHE
MLC UNCORE
LLC $
Figure 6: FPGA Resource Assessment of the Ne- APIC
L2 $
halem Clusters.
(b)
9
to using ACE to spit out netlists into multiple partitions, it
Table 2: Logical Connectivity and TDM Ratio be- can be used to route some internal signals to specific FPGAs.
tween FPGAs. In particular, the memory controller signals are routed from
U1 U2 U3 U4 U5 MEU (U3) to the DRAM interface on U1, and internal ar-
U1 - 18 18 - - chitectural state (instruction pointer and architectural reg-
U2 24 - 18 21 - isters) and memory access signals is routed to U5 where aux-
U3 18 18 - 18 - iliary emulation logic records cycle-by-cycle traces of these
U4 - 21 21 - 24 signals into the other onboard DRAM module. After par-
U5 - - - 18 - titioning, the MCEMU interconnect configuration tool (see
Section 2.3) runs to multiplex the interconnecting logical
signals over the available physical wires. The end result of
the partitioning and interconnect generation is an synthe-
U3 in one emulation cycle, assuming both hops have TDM
sizable fabric with the connectivity matrix and TDM ratio
ratio of 10, the resulting frequency is cut by another fac-
shown in Table 2. Empty entries show where no physical
tor of 2 because the logical signal must traverse both hops
direct connection exists, though logical connection may still
within one emulated cycle. Clearly, the frequency limit due
occur by using 2 or more adjoining physical direct connec-
to logical signal transmission quickly dominates the PAR
tions. As shown in Table 2, the generated interconnect has
frequency limit of the FPGA. As such, attaining a high em-
a TDM critical path of 24 in the path from U2→U1 and
ulation frequency becomes an exercise of mapping the logic
the path U4→U5. These large TDM ratios are a direct re-
across FPGAs in a way that minimizes the number of logi-
sult of the high number of logical signals passing between
cal signals between FPGAs, minimizes the number of hops
those FPGAs. The U4→U5 connection is the signals from
between the source and destination of logical signals and
the OOO cluster to the EXE unit, which as described above,
distributes logical signals to best balance TDM ratios. In
are very tightly coupled clusters. Interestingly, the connec-
other words a good partitioning of emulated logic should be
tions between U2→U1 is actually not dominated by the FE
found so that the partitioned topology most closely matches
unit talking to the DRAM interface or to the OOO cluster,
the emulator topology.
but is instead heavily utilized by signals passing through U2
As shown in Figure 6, the cluster utilizations suggest that
from the OOO sub-clusters.
neglecting OOO’s high LUT utilization, a cluster level parti-
From this TDM data, potential emulation frequency can
tioning would map very naturally to a single MCEMU board
be calculated. The physical interconnect is able to run at
and minimize the number of logical signals traversing the
10ns period. It therefore requires 240ns (24 TDM cycles)
on-board interconnect (i.e. logic within a cluster is more
for all logical signals to complete 1 hop, and the emulation
tightly connected than logic in two different clusters). Be-
period could be 240ns. Because there are paths that need
cause FE and OOO are the largest clusters its clear to map
to cross 2 FPGA hops with in a single emulation cycle, we
them to our larger Virtex-5 FPGAs and map EXE and MEU
need to exercise these paths 24 TDM cycles at least twice
to the smaller Virtex-4 FPGAs. As mentioned in Section 2.3,
within every emulated cycle. This sets the maximum emu-
the number of physical wires between pairs of FPGA is not
lation period then to 480ns. If we could guarantee that all
uniform. The pair U2,U4 and the pair U3,U5 have many
signals crossing the interconnect fabric could only change
more connecting wires than the other pairs of FPGAs. To
at the emulation clock edge then 480ns would be the final
best balance TDM ratios, more highly connected clusters
emulation period. This however is not the case, due to the
are placed in the highly connected FPGA pairs. Analysis
design having registers clocked on clk2x and various level-
of the cluster level connections shows highest coupling be-
sensitive latches. With this added clock and existing latches,
tween the pair EXE, OOO and the pair MEU, FE. This gives
there is a possibility that logical signals crossing the inter-
us the potential initial mappings of (FE→U2, MEU→U3,
connect need to change before any edge of clk2x. To allow
OOO→U4, EXE→U5) or (OOO→U2, EXE→U3, FE→U4,
for this possibility (and maintain logical equivalence to the
MEU→U5). In the end a high Block RAM utilization by
unpartitioned design) we need to allow for all logical signals
auxiliary emulation logic on U5 (U5 is the central com-
to complete the 2 hops within each phase of clk2x. This
munication node that dispatches control/debug messages
means that the actual emulation period (clk1x) needs to be
to other FPGAs) restricts us to mapping the lower Block
4 x 480ns. This results in an emulation clock frequency of
RAM utilizing EXE to U5 and selects the former mapping
520 KHz.
above (FE→U2, MEU→U3, OOO→U4, EXE→U5). As
Interestingly, this partitioning step can be a one-time cost,
mentioned above the OOO cluster is still too large for U4
barring any single FPGA running out of logic resources.
(V5330). Here the resulting split occurs within the OOO
Once the partition step is done, blackboxed Nehalem clus-
at its sub-partition hierarchy. The OOO subclusters on U4
ters can be synthesized to EDIF files and quickly linked into
consist of the Reservation Station (RS), Register Alias Ta-
the bitstream generation step to create a new revision of
ble (RAT), and Allocation (ALLOC), while the OOO’s other
the FPGA synthesizable design. This ability to drop in new
cluster ReOrder Buffer (ROB) resides on U1 [3].
synthesized clusters allows us to turn around a new design
The Auspy ACE partitioning software is used to restruc-
within the time it takes to synthesize and place and route
ture the top level netlist using the given Nehalem netlist par-
a single FPGA (i.e. as opposed to synthesizing all the clus-
titions. Because this methodology keeps the natural cluster-
ters, running a partition step across the entire circuit, and
level partitioning, ACE’s ability to automatically find good
place and routing the individual FPGAs).
partitions is not used. The tool is still critically important
We take the five resulting netlists (each netlist includes
though, as it allows us to pull lower hierarchy structures (e.g.
the emulated cluster wrapped with the generated auxiliary
ROB) up to top level entities. Without such a tool, this re-
interconnect logic) and push them through the typical Xil-
structuring would be error-prone and tedious. In addition
10
Table 3: FPGA Utlization after Xilinx Map.
LUTs (%) BLOCK RAM (%)
U1-ROB 81 62
U2-FE 87 83
U3-MEU 75 75
U4-RAT/Alloc/RS 89 0
U5-EXE 89 55
11
[3] B. Bentley. Simulation-driven Verification. Design
4.5
Automation Summer School, 2009.
4 [4] J. Casazza. First the Tick, Now the Tock: Intel
3.5 Microarchitecture (Nehalem). Intel Corporation, 2009.
[5] W.-J. Fang and A. C.-H. Wu. Multiway FPGA
3
Partitioning by Fully Exploiting Design Hierarchy.
Speedup
12