0% found this document useful (0 votes)
496 views12 pages

POWER9 Processor Architecture

The Power9 processor has an enhanced core and chip architecture optimized for emerging workloads. It has superior thread performance and higher throughput to support next-generation computing. The Power9 processor contains up to 24 cores on a single chip and comes in 12-core variants, with each core supporting multiple hardware threads. It uses an efficient pipeline and memory subsystem to deliver strong thread performance without compromising throughput.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
496 views12 pages

POWER9 Processor Architecture

The Power9 processor has an enhanced core and chip architecture optimized for emerging workloads. It has superior thread performance and higher throughput to support next-generation computing. The Power9 processor contains up to 24 cores on a single chip and comes in 12-core variants, with each core supporting multiple hardware threads. It uses an efficient pipeline and memory subsystem to deliver strong thread performance without compromising throughput.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

.................................................................................................................................................................................................................

IBM POWER9 PROCESSOR


ARCHITECTURE
.................................................................................................................................................................................................................
THE IBM POWER9 PROCESSOR HAS AN ENHANCED CORE AND CHIP ARCHITECTURE
OPTIMIZED FOR EMERGING WORKLOADS WITH SUPERIOR THREAD PERFORMANCE AND

HIGHER THROUGHPUT TO SUPPORT NEXT-GENERATION COMPUTING. MULTIPLE VARIANTS

OF SILICON TARGET THE SCALE-OUT AND SCALE-UP MARKETS. WITH A NEW CORE

MICROARCHITECTURE DESIGN, ALONG WITH AN INNOVATIVE I/O FABRIC TO SUPPORT

ACCELERATED COMPUTING REQUIREMENTS, THE POWER9 PROCESSOR MEETS THE DIVERSE

COMPUTING NEEDS OF THE COGNITIVE ERA AND PROVIDES A PLATFORM FOR ACCELERATED

COMPUTING.

...... The Power9 processor is fabri-


cated in 14-nm fin field-effect transistor (Fin-
cloud computing environments. The scale-
up systems support large-scale symmetric
FET) process technology and contains 8 multiprocessor (SMP) configurations with
billion transistors. It uses a 17-metal-layer huge memory capacity and bandwidth for
stack technology and continues to exploit enterprise requirements.
embedded DRAM (eDRAM) for its L3 The cores of the Power9 processor are
cache design. The Power9 processor (see supported and fed by a 120-Mbyte L3 cache
Satish Kumar Sadasivam Figure 1) can have up to 24 cores on a single based on nonuniform cache architecture
chip, with each core supporting up to four (NUCA). The L3 cache is designed as 12
Brian W. Thompto hardware threads. The Power9 processor also regions of a 20-way associative cache with
comes in a 12-core variant, which supports advanced replacement policies. The L3 cache
Ron Kalla up to eight hardware threads built on the system is, in turn, fed by an on-chip fabric
same basic building block of the Power9 core that delivers up to 7 Tbytes per second
William J. Starke microarchitecture. (TBps) of on-chip bandwidth.
The Power9 core delivers strong thread The Power9 chip includes an I/O subsys-
IBM performance without compromising the tem with 48 PCI Express (PCIe) Gen4 lanes,
high-socket throughput. This is achieved by enabling heterogeneous computing. Two
an efficient pipeline with a new core micro- interfaces of high-bandwidth signaling tech-
architecture and a highly efficient memory nology enable a large SMP and accelerator
subsystem, which comes in two variants to computation, as shown in Figure 2. A 16-
cater to the needs of the scale-out and scale- GBps interface provides support for connect-
up domains. Scale-out systems support up to ing neighboring socket(s) of an SMP system.
two nodes and are suitable for distributed A 25-GBps interface supports a wide range
.......................................................

40 Published by the IEEE Computer Society 


0272-1732/17/$33.00 c 2017 IEEE
of external attach capabilities for the design
of heterogeneous computing systems as well
SMP/accelerator signaling Memory signaling
as providing support for SMP connections in
SMP systems.
Key features, targeted at cloud and virtuali- Core Core Core Core Core Core Core Core
zation domains, include a new interrupt archi-
tecture, quality-of-service assists, hardware- L2 L2 L2 L2

off-chip accelerator enablement


enforced trusted execution environments, and L3 region L3 region L3 region L3 region

SMP interconnect and


workload-optimized operating frequency. L3 region L3 region L3 region L3 region

PCIe signaling

SMP signaling
L2 L2 L2 L2
Workload-Optimized Design
The Power9 processor has been engineered to Core Core Core Core Core Core Core Core
target four emerging domains:
PCIe On-chip accel.
 emerging analytics, AI, and cognitive
computing; L3 region L3 region L3 region L3 region
 technical and high-performance L2 L2 L2 L2
computing (HPC);
 cloud and hyperscale datacenters; and Core Core Core Core Core Core Core Core
 enterprise computing.
These four domains have many distinct SMP/accelerator signaling Memory signaling
requirements and highly diverse workloads.
The Power9 design architecture and family
Figure 1. The Power9 chip diagram shows the number of cores, L2 and L3
of chips were designed to address the require-
cache regions, and interconnects.
ments of all these categories of workloads.
The basic Power9 building block is a modu-
lar design that can support multiple targeted
implementations and create a family of pro-
cessors with state-of-the-art accelerated com-
puting capabilities.
Table 1 summarizes the key requirements
of emerging domains and select features of
the Power9 processor suited for each domain.
DDR
/ Memory
DMI

Core Pipeline PCIe


devices PCIe G4 I/O
The microprocessor’s pipeline structure is PCIe
G4
subdivided into a front-end pipeline and sev- ASIC/
Cache and interconnect

CAPI 2.0 CAPI


FPGA
eral different execution unit pipelines. The
Processor cores

devices
NVLink
front-end pipeline presents speculative in- NVLink 2.0
Nvidia
order instructions to the mapping, sequenc- GPUs
25G
Link OpenCAPI
ing, and dispatch units. It ensures orderly
ASIC/
completion of the real execution path, dis- FPGA
OpenCAPI

carding any other potential speculative results devices SMP


16G
associated with mispredicted paths. The exe-
cution unit pipelines allow out-of-order issu- On-chip
ing of both speculative and nonspeculative accel.

operations. The execution unit pipelines are


composed of execution slices and progress
independently from the front-end pipeline Figure 2. Power9 processor integration
and from one another. with external accelerating devices
As Figure 3 shows, the Power9 core supporting PCIe Gen4, CAPI 2.0,
microarchitecture has a reduced pipeline NVLink2.0, and OpenCAPI.
.............................................................

MARCH/APRIL 2017 41
..............................................................................................................................................................................................
HOT CHIPS

Table 1. Key requirements of emerging domains.

Emerging domain Support for key requirements

Emerging analytics, AI,  New core for stronger thread performance


and cognitive computing  Delivers twice the computational resources per socket
 Built for acceleration and OpenPower-solution enablement

Technical and high-performance  High-bandwidth interface for GPU attach


computing  Advanced GPU/CPU interaction and memory sharing
 High-bandwidth direct-attach memory

Cloud and hyperscale  Power, packaging, and cost optimizations for a range of platforms
datacenters  Superior virtualization features of security, power management, quality of
service, and interrupt
 State-of-the-art I/O technology for network and storage performance

Enterprise computing  Large, flat, scale-up systems


 Buffered memory for maximum capacity
 Leading reliability, availability, and serviceability (RAS)
 Improved caching

Power8 pipeline length. The latency from fetch to computa-


IF
tion is reduced by five cycles compared with
Fetch to compute the IBM Power8 design.1 The latency from
IC reduced by five cycles
IXFR fetch to retirement is also reduced, for exam-
ED ple, by eight cycles for floating-point opera-
IFB Power9 pipeline tions. (The description for the various
GF1 IF pipeline stages are provided in “Glossary”
GF2 IC sidebar.) These improvements are enabled by
DCD D1 microarchitectural changes while continuing
MRG D2 to support similar cycle-time design con-
GXFR CRACK/FUSE
straints compared with IBM Power8. Central
DISP PD0 to these improvements is the adoption of an
MAP PD1 execution slice microarchitecture. Power9
B0 V0 X0 LS0 XFER
also removes the instruction grouping techni-
B1 V1 X1 LS1 MAP
que from the front end of the core pipeline,
RES V2 X2 LS2 LS0 VS0 B0
one of the basic building blocks of earlier-
V3 X3 LS3 LS1 VS1 B1
V4 ALU AGEN AGEN ALU RES
generation IBM Power cores, enabling indi-
F1 CA BRD F2 vidual instruction allocation and retirement.
F2 FMT CA F3 Support for more robust operations within
F3 FMT F4 the sequencing and execution pipelines
F4 CA+2 F5 improves overall instruction efficiency and
F5 results in a reduction in dynamic instruction
Reduced hazard
F6 CA cracking at decode.
distribution
Variants of the core support completion
of up to 128 (SMT4 core) or 256 (SMT8
Figure 3. Pipeline diagram comparing Power8 and Power9 processor
core) instructions in every cycle. As a result,
pipeline stages, also highlighting the reduction in the number of stages in
the core can free up the out-of-order resour-
the Power9 core.
ces quickly, which in turn supports a faster
............................................................

42 IEEE MICRO
The Power9 processor introduces new fea-
Glossary tures to proactively avoid hazards in the load
store unit (LSU) and improve the LSU’s exe-
Here, we list some terms relating to the IBM
cution efficiency. Local pipe control of load
Power8 and Power9 processor.
and store operations enables better avoidance
of hazards and reduces the hazard disruption
Power8
with local recycling of instructions. New lock
management control improves the perform-
AGEN Address generation ance of both contested and uncontested
B Branch locks.
CA Cache access
DCD Decode Execution Slice Microarchitecture
DISP Dispatch to issuing resource The Power9 core microarchitecture offers
ED Early decode improved efficiency and is better aligned
F Floating point with workload requirements. To support
FMT Data formatting future computing needs and enable the core
GF Dispatch group determination microarchitecture to scale, the revamped core
GXFR Group transfer microarchitecture uses a modular execution
IC Instruction cache access slice architecture.
IF Instruction fetch Essential to the modular execution slice
IFB Instruction fetch buffer architecture are the symmetric data-type exe-
IXFR Instruction transfer cution engines. A symmetric data-type execu-
LS Load store tion engine, called a slice, acts as a 64-bit
MAP Register mapping computational building block and is coupled
MRG Microcode selection and merge with a 64-bit load-store building block. Each
RES Branch resolution computational slice supports a range of data
X Fixed point types, including fixed-point, floating-point,
V Vector/float pipe 128-bit, scalar, and multiple data (SIMD)
execution. This architecture enables the
seamless interchange of different data types
Power9 between operations. It also enables higher
performance for a diverse range of workloads
by providing shared computational resources
ALU Arithmetic logic unit and a shared datapath for all data types. This
BRD Address broadcast enables the core to have increased pipeline
D Decode utilization of the execution resources and
PD Predispatch enables efficient management of instruction
VS Vector scalar and dataflow through the machine.
XFER Transfer Two execution slices are combined to
form a super-slice, which enables 128-bit
computation for both fixed-point and float-
ing-point computations. This design enables
flow of instructions from the front end to the the applications with scalar instructions to
back end. achieve greater performance, because each
Advanced branch prediction improves the slice can handle various instruction types per
front-end efficiency and single-thread per- scalar operation independently; one super-
formance, resulting in a significant reduction slice can handle either two scalar or one vec-
in wasted instruction execution cycles due to tor operation, providing robust performance
branch mispredictions. The Power9 pro- for both scalar and vector codes. Two such
cessor improves both direction prediction super-slices combine to form a four-way
and target address prediction techniques to simultaneous multithreading (SMT4) core.
handle hard-to-predict branches. Four such super-slices form an SMT8 core.
.............................................................

MARCH/APRIL 2017 43
..............................................................................................................................................................................................
HOT CHIPS

Modular execution slices

DFU 4 x 128-bit 2 x 128-bit 128-bit 64-bit


super-slice super-slice super-slice slice

ISU VSU
FXU
ISU ISU
Exec Exec Exec Exec Exec Exec 64-bit 64-bit 64-bit
slice slice slice slice slice slice VSU VSU VSU

IFU IFU
LSU DW DW DW
IFU LSU LSU LSU

LSU LSU

Power8 SMT8 core Power9 SMT8 core Power9 SMT4 core

Figure 4. Modular execution slice diagram showing the slice and super-slice view of SMT4 and SMT8 Power9 core
configurations.

x8
Predecode L1 instruction $ IBUF Decode/crack
SMT4 core

Branch Instruction/Iop
Dispatch: allocate/rename
prediction completion table
x6

Branch slice Slice 0 Slice 1 Slice 2 Slice 3

ALU ALU ALU ALU


BRU AGEN XS XS AGEN AGEN XS XS AGEN

FP FP FP FP
MUL MUL MUL MUL
XC XC CRYPT XC XC

PM PM
QFX QP/DFU QFX

DIV DIV

ST-D ST-D ST-D ST-D


128-bit
super-slice

L1D$ 0 L1D$ 1 L1D$ 2 L1D$ 3

LRQ 0/1 LRQ 2/3

SRQ 0 SRQ 1 SRQ 2 SRQ 3

Figure 5. Power9 SMT4 core. The detailed core block diagram shows all the key components of the Power9 core.

Figure 4 shows several modular execution Core Computational Capabilities


slices, which are the basic building blocks of Figure 5 shows a more detailed view of a
the Power9 core microarchitecture. Power9 SMT4 core, along with its core
............................................................

44 IEEE MICRO
computational capabilities. The SMT8 core
provides double the execution resources of
Branch Instruction cache
the SMT4 core. predict
8i
The Power9 SMT4 core includes a 32-
Instruction buffer
Kbyte eight-way instruction cache. The
6i
instruction fetch unit can fetch up to eight
instructions per cycle into the instruction Decode/
dispatch
buffer with a highly optimized branch pre-
dictor to support speculative fetching of
instructions. The enhanced instruction pre-

QP-DFU
Crypto
BRU 64-bit 64-bit 64-bit 64-bit
fetcher fetches instruction lines speculatively
VSU VSU VSU VSU
to reduce the instruction-cache miss occur-
rences. The decode unit can decode up to six
instructions per cycle. The dispatch unit can
dispatch up to six instruction operations per
cycle to the back-end execution slices. The DW DW DW DW
LSU LSU LSU LSU
instructions are tracked using an instruction
completion table that can track up to 256
operations out of order per SMT4 core. The
Power9 core’s issue capability is a maximum
of nine instructions operations to the back- Branch Instruction cache
end execution slices. predict
16i
The four execution slices in an SMT4 Instruction buffer
core can each issue any 64-bit computa- 12i
tional/load/store operation and an address Decode/
dispatch
generation (AGEN) operation. The compu-
tational operations can be either fixed point
or floating point (single or double precision).

QP-DFU
Crypto
BRU BRU 64b 64b 64b 64b 64b 64b 64b 64b
When issuing 128-bit operations, two execu- VSU VSU VSU VSU VSU VSU VSU VSU
tion slices are used. The Power9 core provides
a separate branch slice for handling branch
instructions. The execution pipes are com-
DW DW DW DW DW DW DW DW
monly called vector and scalar unit (VSU) LSU LSU LSU LSU LSU LSU LSU LSU
pipes because there is no differentiation in
the execution engine for handling different
data-type operations. Therefore, a single
Power9 SMT4 core can handle up to four Figure 6. SMT4 and SMT8 core architecture shows the differences in fetch
scalar 64-bit operations, including loads and width, issue width, and number of slices available in each configuration.
stores, or two vector 128-bit operations,
along with four load and store AGEN opera-
tions and one branch instruction every cycle.  two 128-bit permute operations,
This makes the Power9 execution unit capa-  two 128-bit quadword fixed-point
ble of handling any type of mixed-operation operations,
requirements with a high utilization of execu-  one 128-bit quadword floating-point
tion engines. operation,
The VSU pipe of the Power9 SMT4 core  one decimal floating-point operation,
can handle and
 crypto operations.
 four arithmetic logic unit simple
fixed-point operations, The Power9 contains four double-preci-
 four floating-point or fixed-point sion floating-point units, one per slice. Each
multiply operations and complex 64- of these units is optimized for fully pipelined
bit fixed-point operations, double-precision multiply-add functionality.
.............................................................

MARCH/APRIL 2017 45
..............................................................................................................................................................................................
HOT CHIPS

Power9 17 layers of metal


eDRAM Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 17
16
15
14
7 TBps 13
12
256 GBps x 12

OpenCAPI
11

NVLink 2
10

CAPI
PCIe
DDR

SMP
9 8
7 6
5 4
IBM and IBM and 3 2
PCIe Nvidia 1
Memory
device partner GPU partner Power9
devices devices

Figure 7. The SMT8 cache shows the 120-Mbyte L3 shared nonuniform cache architecture (NUCA) cache, high-throughput
interconnects, and on-chip fabric.

In addition, each unit can perform the float- granularity to the hypervisor for increased
ing-point divide and square root instructions. granularity of core computational resources.
The Power9 VSU implements the Vector
Scalar Extension architecture, specifying two- Data Capacity and Throughput
way double-precision or four-way single-pre-
In this section, we discuss the Power9 cache
cision operations per cycle using one 128-bit
and throughput.
super-slice.
LSU slices can handle up to four double-
word loads or stores. Each SMT4 core has a Cache
private 32-Kbyte eight-way data cache that is The Power9 processor has a 512-Kbyte, pri-
accessed by the four-LSU execution slice. vate, eight-way set-associative L2 cache per
The L2 and L3 cache regions are shared by SMT8 core. This cache also functions as a
two SMT4 cores. privately shared cache between two SMT4
cores. The Power9 processor has a total of
120 Mbytes of L3 NUCA cache.
SMT4 and SMT8 Core Architecture As Figure 7 shows, the eDRAM-based L3
As Figure 6 shows, the Power9 processor cache comprises 12 regions of 10 Mbytes each
offers different core variants to address differ- per SMT8 core. The L3 cache is 20-way asso-
ent market requirements. The SMT4 core ciative, optimized for up to eight threads shar-
has two 128-bit super-slices, and the SMT8 ing the same cache region. Each of these L3
core has four 128-bit super-slices. regions acts as fast local cache for each SMT8
The instruction fetch, decode, and issue core and is accessed by other cores in the pro-
capabilities of the SMT8 core are twice that cessor using the internal fabric interconnect.
of the SMT4 core. This enables the core to The 10-Mbyte cache region acts as a shared
scale enough to support eight hardware cache between two SMT4 cores. This cache
threads and maintain good throughput for topology lets each L3 cache congruence class on
all of them. The SMT8 core can also power- the chip support up to 240 ways concurrently.
gate half of the core execution resources The cache capacity supports massively
when only one thread is active per core. parallel computation and also enables a
The SMT8 core provides a large shared highly efficient heterogeneous interaction.
resource pool to individual partitions, provid- The Power9 L3 cache implements new
ing for efficient large partition management replacement policies utilizing reuse patterns
and enabling seamless partition mobility from and data-type awareness to improve its effi-
Power8 processor servers. The SMT4 core ciency for data-intensive workloads. The
provides increased resource management Power9 processor also implements an
............................................................

46 IEEE MICRO
(a) (b)

Figure 8. Two variants of the Power9 core’s memory design architecture: direct attach memory and buffered memory. (a)
Scale-out variant. (b) Scale-up variant.

adaptive prefetch mechanism with coordina- direct-attach double data rate type four
tion between the processor cores, caches, and (DDR4) memory. Each DDR4 unit is self-
memory controllers. This mechanism opti- contained and consists of four independent
mizes prefetch aggressiveness in cases in ports that connect to DIMM slots. This unit
which the consumption or utilization of pre- is replicated twice on the Power9 processor to
fetched data is low or the available memory provide a maximum of eight ports, support-
bandwidth is limited. ing up to 120 GBps of sustained memory
The Power9 core and chip architecture bandwidth with a maximum memory
enable large SMP scaling and heterogeneous capacity of up to 4 Tbytes per socket. These
computing using accelerators and attached low-latency access ports support 64-byte or
devices. A key component of this design is a 128-byte adaptive reads from the memory.
high-throughput on-chip interconnect fabric The second variant, shown in Figure 8b,
composed of separate command and data is targeted toward the scale-up domain. It has
switching interconnects. The on-chip data a buffered memory architecture. A socket
switching infrastructure is built from a 2D supports eight such buffered memory chan-
topology of switch segments logically nels, which can deliver a sustained bandwidth
arranged in an 8  12 (96-element) struc- of 230 GBps, with a maximum memory
ture. Each element can transfer 32 bytes at a capacity of up to 8 Tbytes per socket. This
2.4-GHz clock rate, yielding 76.8 GBps per design variant also provides superior reliabil-
element. The 96 elements provide an aggre- ity, availability, and serviceability (RAS) capa-
gate 7.3 TBps for on-chip data switching. bilities with chip kill and lane sparing
This enables each of the processor cores to support. It is also compatible with Power8
move data in and out of the core at the rate system memory.2
of 256 GBps while simultaneously exchang-
ing data with memory, attached devices, and Processor Family
accelerators. The Power9 processor comes as a processor
family that has four implementations to
Memory Subsystem address different market segments. This is
The Power9 system design supports both achieved by the Power9 processor’s modular
scale-out and scale-up domains. Although and scalable design. The execution slice
the Power9 core microarchitecture remains microarchitecture forms the basic building
the same for the both domains, the memory block for all four targeted implementations.
design architecture is tailored to two variants The scale-out design comes in two var-
to suit both of these domains. iants. Both of these variants have similar
The first variant, shown in Figure 8a, is SMP scaling support and a similar memory
targeted toward the scale-out domain. It has subsystem. The scale-out variant with the
.............................................................

MARCH/APRIL 2017 47
..............................................................................................................................................................................................
HOT CHIPS

Core count/size
Four targeted implementations
SMT4 core SMT8 core
24 SMT4 cores/chip 12 SMT8 cores/chip
SMP scalability/memory subsystem Linux ecosystem optimized PowerVM ecosystem continuity

Scale-out–2 socket optimized


C C C C C C C C C C C C C C C C C C C C C C C C
Robust two-socket SMP system C C C C C C C C C C C C
Direct memory attach Cache and interconnect Cache and interconnect
• Up to eight DDR4 ports

OpenCAPI

OpenCAPI
CAPI2.0

CAPI2.0
Memory

Memory
Accel

Accel
NVLink

NVLink
• Commodity packaging form factor

SMP

SMP
I/O

I/O
DDR PCIe 4.0 25G link 16G DDR PCIe 4.0 25G link 16G

Scale-up–2 multisocket optimized


C C C C C C C C C C C C C C C C C C C C C C C C
Scalable system topology/capacity C C C C C C C C C C C C
• Large multisocket Cache and interconnect Cache and interconnect
• Additional lanes of 25G link (96 total)

OpenCAPI

OpenCAPI
CAPI2.0

CAPI2.0
Memory

Memory
Accel

Accel
NVLink

NVLink
SMP

SMP
Buffered memory attach
I/O

I/O
• 8 buffered channels
Centaur PCIe 4.0 25G link 16G Centaur PCIe 4.0 25G link 16G

Figure 9. Power9 family of processors showing different targeted implementations for the two-socket and multisocket
design.

direct-attach DDR4 memory supports one key Power ISA 3.0 enhancements fall into the
or two sockets. Two different implementa- following categories:3
tions of the scale-out domain target two dif-
 Broader data-type support. The Power9
ferent ecosystems. The 24-SMT4 core
processor provides native 128-bit
implementation is targeted toward Linux
quadword-precision floating-point
ecosystem needs, and the 12-SMT8 core
arithmetic that is IEEE compliant.
implementation is targeted toward a
 Support for emerging algorithms. The
PowerVM-based ecosystem.
Power9 processor implements a sin-
The scale-up design also comes in both
gle-instruction random number gen-
SMT4 and SMT8 variants. These variants
erator that is certified by the National
are optimized to support larger SMP connec-
Institute of Standards and Technol-
tivity through 96 lanes of 25G Link, along
ogy. Atomic memory operations are
with eight buffered memory channels on
supported for near-memory compu-
each socket to support a large memory tation and include logical, arithmetic,
bandwidth. max, min, and compare operations.
Figure 9 shows all four targeted imple- Atomic operations are issued by a
mentations of the Power9 family of processor core thread but are exe-
processors. cuted at the memory controller, ena-
bling optimization of high-scale
Power ISA Support for Emerging Markets writing in data-centric applications.
The Power9 design philosophy addresses the  Cloud optimization. To optimize for
needs of various emerging market domains. cloud environments, the Power9 pro-
The Power instruction set architecture (ISA) cessor has an interrupt architecture
version 3.0, which Power9 implements, also that automates interrupt routing to
supports various emerging workload seg- partitions to boost the performance
ments. Power9 has focused ISA support for of virtualization.
cognitive computing, cloud, HPC, and  Enhanced accelerator virtualization.
enterprise solutions market domains. The Both on-chip and off-chip accelerators
............................................................

48 IEEE MICRO
can be addressed by user programs
with virtual memory addresses. This 2.5
reduces overhead and latency for com-
municating with the accelerators. On- 2.0
chip accelerators have been expanded
to include two 842 and one Gzip com- 1.5
pression accelerator, as well as two
AES/SHA accelerators. 1.0
 Energy and frequency management.
Power9 power management supports 0.5
the concept of workload-optimized fre-
quency (WOF). Using this mechanism,
0
the chip’s performance can be pushed Power8 Commercial Integer Floating Scripting Graph Business
to the socket’s thermal and current lim- baseline point analytics intelligence

its to maximize performance. The


power-management firmware compo- Figure 10. Power9 performance improvements. The figure shows
nents can interlock to exploit powered- improvements over the Power8 processor for a spectrum of workloads at
off cores to increase the frequency of constant frequency.
the operating cores. When enabled, the
WOF mechanism can also exploit cores
that are running but drawing less power cessor with external accelerators and configu-
due to lower performance states rable devices to make it a truly heterogeneous
(P-states) or lighter workloads. computing platform. This is accomplished
by two key aspects of the design.
These key ISA enhancements enable the
First, the Power9 processor provides
Power9 processor to deliver better perform-
PCIe Gen4 with 48-lane connectivity to
ance for emerging workloads. In the rest of
support PCIe devices.4 This PCIe Gen4
this article, we will discuss the performance
connectivity provides a 192-GBps duplex
of the Power9 processor compared with its
bandwidth. Over these high-bandwidth
previous-generation Power8 processor.
interfaces, Power9 supports IBM Coherent
Accelerator Processor Interface (CAPI)5 2.0
Performance connectivity to attach ASIC and field-pro-
Figure 10 shows the socket-level performance grammable gate array (FPGA) devices on
improvements over the Power8 processor any of the Power9-based socket configura-
across a broad range of workloads. The tions. Power9 CAPI 2.0 connectivity pro-
improvements are shown for the scale-out con- vides four times more bandwidth than its
figuration with similar bandwidth constraints Power8 predecessor does. This enables the
and at constant frequency to compare against Power9 processor to seamlessly integrate
the previous-generation Power systems. with any external accelerating devices,
Emerging workloads such as scripting lan- including ASIC and FPGA devices with a
guages, graph analytics, and business intelli- coherent memory support with the host
gence show close to twice the improvement memory using CAPI 2.0 (see Figure 2).
over the Power8 processor. Integer, floating- The next key design aspect is the 25G
point arithmetic performance improved well Link, which delivers up to 300 GBps of
above one and half times. Commercial work- bidirectional bandwidth to connected accel-
loads, which primarily cover the enterprise erators over 48 lanes. This link provides sup-
domain, gained between one and one-half to port for Nvidia NVLink 2.0 connectivity
two times improvements compared to the and supports OpenCAPI,6 a low-latency and
Power8 processor. high-bandwidth coherently attached device
interface with open standards. Nvidia’s next-
Heterogeneous Computing Architecture generation GPUs are designed to work with
A primary objective of the Power9 processor NVLink 2.0 to provide high performance
is to provide seamless integration of the pro- along with seamless integration of the GPU
.............................................................

MARCH/APRIL 2017 49
..............................................................................................................................................................................................
HOT CHIPS

PCIe gen3 x 16 PCIe gen4 x 16 Power8 with NVLink 1.0 Power9 with 25G link
Accelerator Accelerator Accelerator

CPU CPU Nvidia


GPU
1x GPU 2x GPU Nvidia
5x GPU
7–10x

Figure 11. Power9 processor integration bandwidth with high-performing GPU or accelerators.

and the host CPU. The Power9 processor processors designed for both scale-out and
supports up to 48 lanes of 25G links for scale-up applications. Power9 introduces a
accelerator attach. new core microarchitecture and an
Figure 11 shows the performance capabil- enhanced cache and chip architecture to
ity of the Power9 processor to integrate high- support high bandwidth, computational
performing GPUs or accelerators. Tradition- scale, and data capacity. Architectural inno-
ally, GPUs are attached to the PCIe slots. In vations provide for enhanced virtualization
comparison with PCIe Gen3-based GPU capabilities targeting key market segments,
attach, the PCIe Gen4-based GPU attach including acceleration and the cloud. A
provides two times the bandwidth. With the state of the art I/O subsystem, engineered
support of NVLink 1.0 in Power8 with to be open, enables the Power9 processor to
NVLink,7 the capacity has grown to five support a wide range of externally attached
times more bandwidth. The 25G Link on devices with high bandwidth, low latency,
Power9 provides 7 to 10 times more band- and tight coupling for the next generation
width in comparison with the PCIe Gen3- of heterogeneous and accelerated comput-
based design, which attaches GPUs and other ing applications. MICRO
accelerators.
The Power9 processor supports seamless ....................................................................
CPU-to-accelerator interactions with its abil- References
ity to provide coherent memory sharing. 1. B. Sinharoy et al., “IBM Power8 Processor
This significantly reduces software and hard- Core Microarchitecture,” IBM J. Research
ware overhead due to data interactions and Development, vol. 59, no. 1, 2015, pp.
between the CPU and accelerators (including 2:1–2:21.
GPUs). Accelerator-attached devices are also 2. W.J. Starke et al., “The Cache and Memory
supported by enhanced virtual address trans- Subsystems of the IBM Power8 Process-
lation capabilities on the Power9 die, further or,” IBM J. Research and Development,
reducing CPU and accelerator interaction vol. 59, no. 1, 2015, pp. 3:1–3:13.
latencies. 3. “IBM Power ISA Version 3.0—Open-
Both NVLink 2.0 and the OpenCAPI Power,” May 2016; http://openpower
interface provide an efficient programming foundation.org/?resource lib¼power-isa
model with much less programming complex- -version-3-0.
ity to accelerate complex analytics and cogni-
4. “PCIe Gen4 Specification,” 2016; http://
tive-based applications. The combination of
pcisig.com.
seamless data sharing, low latency, and high-
bandwidth communication between the CPU 5. “OpenCAPI,” 2017; http://opencapi.org
and an accelerator enables application of het- /technical.
erogeneous computing to a new class of 6. J. Stuecheli et al., “CAPI: A Coherent Accel-
applications. erator Processor Interface,” IBM J.
Research and Development, vol. 59, no. 1,
2015, pp. 7:1–7:7.
T he IBM Power9 processor architec-
ture is designed to suit a wide range of
platform optimizations with a family of
7. S. Gupta, “What Is NVLink? And How Will It
Make the World’s Fastest Computers
............................................................

50 IEEE MICRO
Possible?” blog, 14 Nov. 2014; http://blogs Ron Kalla is the chief engineer for IBM
.nvidia.com/blog/2014/11/14/what-is-nvlink. Power9. His work has included processors
for IBM S/370, M68000, iSeries, and
Satish Kumar Sadasivam is a senior engi- pSeries machines, as well as post-silicon
neer at IBM. His work involves next-genera- hardware bring-up and verification. He is an
tion Power microarchitecture concept IBM master inventor. Contact him at
design and performance evaluation. Sadasi- [email protected].
vam received an MS in computer science
from the Madras Institute of Technology.
He is an IBM master inventor. Contact him William J. Starke is an IBM Distinguished
at [email protected]. Engineer and chief architect for the Power
processor storage hierarchy. He is responsi-
Brian W. Thompto is a senior technical ble for shaping the processor cache hier-
staff member for IBM’s Power Systems pro- archy, symmetric multiprocessor (SMP)
cessor team and a lead architect for Power9 interconnect, cache coherence, memory and
and future Power processors. Thompto I/O controllers, accelerators, and logical sys-
received a BS in electrical engineering and tem structures for Power systems. Starke
computer science from the University of received a BS in computer science from
Wisconsin. He is an IBM master inventor. Michigan Technological University. Contact
Contact him at [email protected]. him at [email protected].

.............................................................

MARCH/APRIL 2017 51

You might also like