POWER9 Processor Architecture
POWER9 Processor Architecture
OF SILICON TARGET THE SCALE-OUT AND SCALE-UP MARKETS. WITH A NEW CORE
COMPUTING NEEDS OF THE COGNITIVE ERA AND PROVIDES A PLATFORM FOR ACCELERATED
COMPUTING.
PCIe signaling
SMP signaling
L2 L2 L2 L2
Workload-Optimized Design
The Power9 processor has been engineered to Core Core Core Core Core Core Core Core
target four emerging domains:
PCIe On-chip accel.
emerging analytics, AI, and cognitive
computing; L3 region L3 region L3 region L3 region
technical and high-performance L2 L2 L2 L2
computing (HPC);
cloud and hyperscale datacenters; and Core Core Core Core Core Core Core Core
enterprise computing.
These four domains have many distinct SMP/accelerator signaling Memory signaling
requirements and highly diverse workloads.
The Power9 design architecture and family
Figure 1. The Power9 chip diagram shows the number of cores, L2 and L3
of chips were designed to address the require-
cache regions, and interconnects.
ments of all these categories of workloads.
The basic Power9 building block is a modu-
lar design that can support multiple targeted
implementations and create a family of pro-
cessors with state-of-the-art accelerated com-
puting capabilities.
Table 1 summarizes the key requirements
of emerging domains and select features of
the Power9 processor suited for each domain.
DDR
/ Memory
DMI
devices
NVLink
front-end pipeline presents speculative in- NVLink 2.0
Nvidia
order instructions to the mapping, sequenc- GPUs
25G
Link OpenCAPI
ing, and dispatch units. It ensures orderly
ASIC/
completion of the real execution path, dis- FPGA
OpenCAPI
MARCH/APRIL 2017 41
..............................................................................................................................................................................................
HOT CHIPS
Cloud and hyperscale Power, packaging, and cost optimizations for a range of platforms
datacenters Superior virtualization features of security, power management, quality of
service, and interrupt
State-of-the-art I/O technology for network and storage performance
42 IEEE MICRO
The Power9 processor introduces new fea-
Glossary tures to proactively avoid hazards in the load
store unit (LSU) and improve the LSU’s exe-
Here, we list some terms relating to the IBM
cution efficiency. Local pipe control of load
Power8 and Power9 processor.
and store operations enables better avoidance
of hazards and reduces the hazard disruption
Power8
with local recycling of instructions. New lock
management control improves the perform-
AGEN Address generation ance of both contested and uncontested
B Branch locks.
CA Cache access
DCD Decode Execution Slice Microarchitecture
DISP Dispatch to issuing resource The Power9 core microarchitecture offers
ED Early decode improved efficiency and is better aligned
F Floating point with workload requirements. To support
FMT Data formatting future computing needs and enable the core
GF Dispatch group determination microarchitecture to scale, the revamped core
GXFR Group transfer microarchitecture uses a modular execution
IC Instruction cache access slice architecture.
IF Instruction fetch Essential to the modular execution slice
IFB Instruction fetch buffer architecture are the symmetric data-type exe-
IXFR Instruction transfer cution engines. A symmetric data-type execu-
LS Load store tion engine, called a slice, acts as a 64-bit
MAP Register mapping computational building block and is coupled
MRG Microcode selection and merge with a 64-bit load-store building block. Each
RES Branch resolution computational slice supports a range of data
X Fixed point types, including fixed-point, floating-point,
V Vector/float pipe 128-bit, scalar, and multiple data (SIMD)
execution. This architecture enables the
seamless interchange of different data types
Power9 between operations. It also enables higher
performance for a diverse range of workloads
by providing shared computational resources
ALU Arithmetic logic unit and a shared datapath for all data types. This
BRD Address broadcast enables the core to have increased pipeline
D Decode utilization of the execution resources and
PD Predispatch enables efficient management of instruction
VS Vector scalar and dataflow through the machine.
XFER Transfer Two execution slices are combined to
form a super-slice, which enables 128-bit
computation for both fixed-point and float-
ing-point computations. This design enables
flow of instructions from the front end to the the applications with scalar instructions to
back end. achieve greater performance, because each
Advanced branch prediction improves the slice can handle various instruction types per
front-end efficiency and single-thread per- scalar operation independently; one super-
formance, resulting in a significant reduction slice can handle either two scalar or one vec-
in wasted instruction execution cycles due to tor operation, providing robust performance
branch mispredictions. The Power9 pro- for both scalar and vector codes. Two such
cessor improves both direction prediction super-slices combine to form a four-way
and target address prediction techniques to simultaneous multithreading (SMT4) core.
handle hard-to-predict branches. Four such super-slices form an SMT8 core.
.............................................................
MARCH/APRIL 2017 43
..............................................................................................................................................................................................
HOT CHIPS
ISU VSU
FXU
ISU ISU
Exec Exec Exec Exec Exec Exec 64-bit 64-bit 64-bit
slice slice slice slice slice slice VSU VSU VSU
IFU IFU
LSU DW DW DW
IFU LSU LSU LSU
LSU LSU
Figure 4. Modular execution slice diagram showing the slice and super-slice view of SMT4 and SMT8 Power9 core
configurations.
x8
Predecode L1 instruction $ IBUF Decode/crack
SMT4 core
Branch Instruction/Iop
Dispatch: allocate/rename
prediction completion table
x6
FP FP FP FP
MUL MUL MUL MUL
XC XC CRYPT XC XC
PM PM
QFX QP/DFU QFX
DIV DIV
Figure 5. Power9 SMT4 core. The detailed core block diagram shows all the key components of the Power9 core.
44 IEEE MICRO
computational capabilities. The SMT8 core
provides double the execution resources of
Branch Instruction cache
the SMT4 core. predict
8i
The Power9 SMT4 core includes a 32-
Instruction buffer
Kbyte eight-way instruction cache. The
6i
instruction fetch unit can fetch up to eight
instructions per cycle into the instruction Decode/
dispatch
buffer with a highly optimized branch pre-
dictor to support speculative fetching of
instructions. The enhanced instruction pre-
QP-DFU
Crypto
BRU 64-bit 64-bit 64-bit 64-bit
fetcher fetches instruction lines speculatively
VSU VSU VSU VSU
to reduce the instruction-cache miss occur-
rences. The decode unit can decode up to six
instructions per cycle. The dispatch unit can
dispatch up to six instruction operations per
cycle to the back-end execution slices. The DW DW DW DW
LSU LSU LSU LSU
instructions are tracked using an instruction
completion table that can track up to 256
operations out of order per SMT4 core. The
Power9 core’s issue capability is a maximum
of nine instructions operations to the back- Branch Instruction cache
end execution slices. predict
16i
The four execution slices in an SMT4 Instruction buffer
core can each issue any 64-bit computa- 12i
tional/load/store operation and an address Decode/
dispatch
generation (AGEN) operation. The compu-
tational operations can be either fixed point
or floating point (single or double precision).
QP-DFU
Crypto
BRU BRU 64b 64b 64b 64b 64b 64b 64b 64b
When issuing 128-bit operations, two execu- VSU VSU VSU VSU VSU VSU VSU VSU
tion slices are used. The Power9 core provides
a separate branch slice for handling branch
instructions. The execution pipes are com-
DW DW DW DW DW DW DW DW
monly called vector and scalar unit (VSU) LSU LSU LSU LSU LSU LSU LSU LSU
pipes because there is no differentiation in
the execution engine for handling different
data-type operations. Therefore, a single
Power9 SMT4 core can handle up to four Figure 6. SMT4 and SMT8 core architecture shows the differences in fetch
scalar 64-bit operations, including loads and width, issue width, and number of slices available in each configuration.
stores, or two vector 128-bit operations,
along with four load and store AGEN opera-
tions and one branch instruction every cycle. two 128-bit permute operations,
This makes the Power9 execution unit capa- two 128-bit quadword fixed-point
ble of handling any type of mixed-operation operations,
requirements with a high utilization of execu- one 128-bit quadword floating-point
tion engines. operation,
The VSU pipe of the Power9 SMT4 core one decimal floating-point operation,
can handle and
crypto operations.
four arithmetic logic unit simple
fixed-point operations, The Power9 contains four double-preci-
four floating-point or fixed-point sion floating-point units, one per slice. Each
multiply operations and complex 64- of these units is optimized for fully pipelined
bit fixed-point operations, double-precision multiply-add functionality.
.............................................................
MARCH/APRIL 2017 45
..............................................................................................................................................................................................
HOT CHIPS
10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 17
16
15
14
7 TBps 13
12
256 GBps x 12
OpenCAPI
11
NVLink 2
10
CAPI
PCIe
DDR
SMP
9 8
7 6
5 4
IBM and IBM and 3 2
PCIe Nvidia 1
Memory
device partner GPU partner Power9
devices devices
Figure 7. The SMT8 cache shows the 120-Mbyte L3 shared nonuniform cache architecture (NUCA) cache, high-throughput
interconnects, and on-chip fabric.
In addition, each unit can perform the float- granularity to the hypervisor for increased
ing-point divide and square root instructions. granularity of core computational resources.
The Power9 VSU implements the Vector
Scalar Extension architecture, specifying two- Data Capacity and Throughput
way double-precision or four-way single-pre-
In this section, we discuss the Power9 cache
cision operations per cycle using one 128-bit
and throughput.
super-slice.
LSU slices can handle up to four double-
word loads or stores. Each SMT4 core has a Cache
private 32-Kbyte eight-way data cache that is The Power9 processor has a 512-Kbyte, pri-
accessed by the four-LSU execution slice. vate, eight-way set-associative L2 cache per
The L2 and L3 cache regions are shared by SMT8 core. This cache also functions as a
two SMT4 cores. privately shared cache between two SMT4
cores. The Power9 processor has a total of
120 Mbytes of L3 NUCA cache.
SMT4 and SMT8 Core Architecture As Figure 7 shows, the eDRAM-based L3
As Figure 6 shows, the Power9 processor cache comprises 12 regions of 10 Mbytes each
offers different core variants to address differ- per SMT8 core. The L3 cache is 20-way asso-
ent market requirements. The SMT4 core ciative, optimized for up to eight threads shar-
has two 128-bit super-slices, and the SMT8 ing the same cache region. Each of these L3
core has four 128-bit super-slices. regions acts as fast local cache for each SMT8
The instruction fetch, decode, and issue core and is accessed by other cores in the pro-
capabilities of the SMT8 core are twice that cessor using the internal fabric interconnect.
of the SMT4 core. This enables the core to The 10-Mbyte cache region acts as a shared
scale enough to support eight hardware cache between two SMT4 cores. This cache
threads and maintain good throughput for topology lets each L3 cache congruence class on
all of them. The SMT8 core can also power- the chip support up to 240 ways concurrently.
gate half of the core execution resources The cache capacity supports massively
when only one thread is active per core. parallel computation and also enables a
The SMT8 core provides a large shared highly efficient heterogeneous interaction.
resource pool to individual partitions, provid- The Power9 L3 cache implements new
ing for efficient large partition management replacement policies utilizing reuse patterns
and enabling seamless partition mobility from and data-type awareness to improve its effi-
Power8 processor servers. The SMT4 core ciency for data-intensive workloads. The
provides increased resource management Power9 processor also implements an
............................................................
46 IEEE MICRO
(a) (b)
Figure 8. Two variants of the Power9 core’s memory design architecture: direct attach memory and buffered memory. (a)
Scale-out variant. (b) Scale-up variant.
adaptive prefetch mechanism with coordina- direct-attach double data rate type four
tion between the processor cores, caches, and (DDR4) memory. Each DDR4 unit is self-
memory controllers. This mechanism opti- contained and consists of four independent
mizes prefetch aggressiveness in cases in ports that connect to DIMM slots. This unit
which the consumption or utilization of pre- is replicated twice on the Power9 processor to
fetched data is low or the available memory provide a maximum of eight ports, support-
bandwidth is limited. ing up to 120 GBps of sustained memory
The Power9 core and chip architecture bandwidth with a maximum memory
enable large SMP scaling and heterogeneous capacity of up to 4 Tbytes per socket. These
computing using accelerators and attached low-latency access ports support 64-byte or
devices. A key component of this design is a 128-byte adaptive reads from the memory.
high-throughput on-chip interconnect fabric The second variant, shown in Figure 8b,
composed of separate command and data is targeted toward the scale-up domain. It has
switching interconnects. The on-chip data a buffered memory architecture. A socket
switching infrastructure is built from a 2D supports eight such buffered memory chan-
topology of switch segments logically nels, which can deliver a sustained bandwidth
arranged in an 8 12 (96-element) struc- of 230 GBps, with a maximum memory
ture. Each element can transfer 32 bytes at a capacity of up to 8 Tbytes per socket. This
2.4-GHz clock rate, yielding 76.8 GBps per design variant also provides superior reliabil-
element. The 96 elements provide an aggre- ity, availability, and serviceability (RAS) capa-
gate 7.3 TBps for on-chip data switching. bilities with chip kill and lane sparing
This enables each of the processor cores to support. It is also compatible with Power8
move data in and out of the core at the rate system memory.2
of 256 GBps while simultaneously exchang-
ing data with memory, attached devices, and Processor Family
accelerators. The Power9 processor comes as a processor
family that has four implementations to
Memory Subsystem address different market segments. This is
The Power9 system design supports both achieved by the Power9 processor’s modular
scale-out and scale-up domains. Although and scalable design. The execution slice
the Power9 core microarchitecture remains microarchitecture forms the basic building
the same for the both domains, the memory block for all four targeted implementations.
design architecture is tailored to two variants The scale-out design comes in two var-
to suit both of these domains. iants. Both of these variants have similar
The first variant, shown in Figure 8a, is SMP scaling support and a similar memory
targeted toward the scale-out domain. It has subsystem. The scale-out variant with the
.............................................................
MARCH/APRIL 2017 47
..............................................................................................................................................................................................
HOT CHIPS
Core count/size
Four targeted implementations
SMT4 core SMT8 core
24 SMT4 cores/chip 12 SMT8 cores/chip
SMP scalability/memory subsystem Linux ecosystem optimized PowerVM ecosystem continuity
OpenCAPI
OpenCAPI
CAPI2.0
CAPI2.0
Memory
Memory
Accel
Accel
NVLink
NVLink
• Commodity packaging form factor
SMP
SMP
I/O
I/O
DDR PCIe 4.0 25G link 16G DDR PCIe 4.0 25G link 16G
OpenCAPI
OpenCAPI
CAPI2.0
CAPI2.0
Memory
Memory
Accel
Accel
NVLink
NVLink
SMP
SMP
Buffered memory attach
I/O
I/O
• 8 buffered channels
Centaur PCIe 4.0 25G link 16G Centaur PCIe 4.0 25G link 16G
Figure 9. Power9 family of processors showing different targeted implementations for the two-socket and multisocket
design.
direct-attach DDR4 memory supports one key Power ISA 3.0 enhancements fall into the
or two sockets. Two different implementa- following categories:3
tions of the scale-out domain target two dif-
Broader data-type support. The Power9
ferent ecosystems. The 24-SMT4 core
processor provides native 128-bit
implementation is targeted toward Linux
quadword-precision floating-point
ecosystem needs, and the 12-SMT8 core
arithmetic that is IEEE compliant.
implementation is targeted toward a
Support for emerging algorithms. The
PowerVM-based ecosystem.
Power9 processor implements a sin-
The scale-up design also comes in both
gle-instruction random number gen-
SMT4 and SMT8 variants. These variants
erator that is certified by the National
are optimized to support larger SMP connec-
Institute of Standards and Technol-
tivity through 96 lanes of 25G Link, along
ogy. Atomic memory operations are
with eight buffered memory channels on
supported for near-memory compu-
each socket to support a large memory tation and include logical, arithmetic,
bandwidth. max, min, and compare operations.
Figure 9 shows all four targeted imple- Atomic operations are issued by a
mentations of the Power9 family of processor core thread but are exe-
processors. cuted at the memory controller, ena-
bling optimization of high-scale
Power ISA Support for Emerging Markets writing in data-centric applications.
The Power9 design philosophy addresses the Cloud optimization. To optimize for
needs of various emerging market domains. cloud environments, the Power9 pro-
The Power instruction set architecture (ISA) cessor has an interrupt architecture
version 3.0, which Power9 implements, also that automates interrupt routing to
supports various emerging workload seg- partitions to boost the performance
ments. Power9 has focused ISA support for of virtualization.
cognitive computing, cloud, HPC, and Enhanced accelerator virtualization.
enterprise solutions market domains. The Both on-chip and off-chip accelerators
............................................................
48 IEEE MICRO
can be addressed by user programs
with virtual memory addresses. This 2.5
reduces overhead and latency for com-
municating with the accelerators. On- 2.0
chip accelerators have been expanded
to include two 842 and one Gzip com- 1.5
pression accelerator, as well as two
AES/SHA accelerators. 1.0
Energy and frequency management.
Power9 power management supports 0.5
the concept of workload-optimized fre-
quency (WOF). Using this mechanism,
0
the chip’s performance can be pushed Power8 Commercial Integer Floating Scripting Graph Business
to the socket’s thermal and current lim- baseline point analytics intelligence
MARCH/APRIL 2017 49
..............................................................................................................................................................................................
HOT CHIPS
PCIe gen3 x 16 PCIe gen4 x 16 Power8 with NVLink 1.0 Power9 with 25G link
Accelerator Accelerator Accelerator
Figure 11. Power9 processor integration bandwidth with high-performing GPU or accelerators.
and the host CPU. The Power9 processor processors designed for both scale-out and
supports up to 48 lanes of 25G links for scale-up applications. Power9 introduces a
accelerator attach. new core microarchitecture and an
Figure 11 shows the performance capabil- enhanced cache and chip architecture to
ity of the Power9 processor to integrate high- support high bandwidth, computational
performing GPUs or accelerators. Tradition- scale, and data capacity. Architectural inno-
ally, GPUs are attached to the PCIe slots. In vations provide for enhanced virtualization
comparison with PCIe Gen3-based GPU capabilities targeting key market segments,
attach, the PCIe Gen4-based GPU attach including acceleration and the cloud. A
provides two times the bandwidth. With the state of the art I/O subsystem, engineered
support of NVLink 1.0 in Power8 with to be open, enables the Power9 processor to
NVLink,7 the capacity has grown to five support a wide range of externally attached
times more bandwidth. The 25G Link on devices with high bandwidth, low latency,
Power9 provides 7 to 10 times more band- and tight coupling for the next generation
width in comparison with the PCIe Gen3- of heterogeneous and accelerated comput-
based design, which attaches GPUs and other ing applications. MICRO
accelerators.
The Power9 processor supports seamless ....................................................................
CPU-to-accelerator interactions with its abil- References
ity to provide coherent memory sharing. 1. B. Sinharoy et al., “IBM Power8 Processor
This significantly reduces software and hard- Core Microarchitecture,” IBM J. Research
ware overhead due to data interactions and Development, vol. 59, no. 1, 2015, pp.
between the CPU and accelerators (including 2:1–2:21.
GPUs). Accelerator-attached devices are also 2. W.J. Starke et al., “The Cache and Memory
supported by enhanced virtual address trans- Subsystems of the IBM Power8 Process-
lation capabilities on the Power9 die, further or,” IBM J. Research and Development,
reducing CPU and accelerator interaction vol. 59, no. 1, 2015, pp. 3:1–3:13.
latencies. 3. “IBM Power ISA Version 3.0—Open-
Both NVLink 2.0 and the OpenCAPI Power,” May 2016; http://openpower
interface provide an efficient programming foundation.org/?resource lib¼power-isa
model with much less programming complex- -version-3-0.
ity to accelerate complex analytics and cogni-
4. “PCIe Gen4 Specification,” 2016; http://
tive-based applications. The combination of
pcisig.com.
seamless data sharing, low latency, and high-
bandwidth communication between the CPU 5. “OpenCAPI,” 2017; http://opencapi.org
and an accelerator enables application of het- /technical.
erogeneous computing to a new class of 6. J. Stuecheli et al., “CAPI: A Coherent Accel-
applications. erator Processor Interface,” IBM J.
Research and Development, vol. 59, no. 1,
2015, pp. 7:1–7:7.
T he IBM Power9 processor architec-
ture is designed to suit a wide range of
platform optimizations with a family of
7. S. Gupta, “What Is NVLink? And How Will It
Make the World’s Fastest Computers
............................................................
50 IEEE MICRO
Possible?” blog, 14 Nov. 2014; http://blogs Ron Kalla is the chief engineer for IBM
.nvidia.com/blog/2014/11/14/what-is-nvlink. Power9. His work has included processors
for IBM S/370, M68000, iSeries, and
Satish Kumar Sadasivam is a senior engi- pSeries machines, as well as post-silicon
neer at IBM. His work involves next-genera- hardware bring-up and verification. He is an
tion Power microarchitecture concept IBM master inventor. Contact him at
design and performance evaluation. Sadasi- [email protected].
vam received an MS in computer science
from the Madras Institute of Technology.
He is an IBM master inventor. Contact him William J. Starke is an IBM Distinguished
at [email protected]. Engineer and chief architect for the Power
processor storage hierarchy. He is responsi-
Brian W. Thompto is a senior technical ble for shaping the processor cache hier-
staff member for IBM’s Power Systems pro- archy, symmetric multiprocessor (SMP)
cessor team and a lead architect for Power9 interconnect, cache coherence, memory and
and future Power processors. Thompto I/O controllers, accelerators, and logical sys-
received a BS in electrical engineering and tem structures for Power systems. Starke
computer science from the University of received a BS in computer science from
Wisconsin. He is an IBM master inventor. Michigan Technological University. Contact
Contact him at [email protected]. him at [email protected].
.............................................................
MARCH/APRIL 2017 51