0% found this document useful (0 votes)
150 views31 pages

Skylake Architecture

Uploaded by

kranti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views31 pages

Skylake Architecture

Uploaded by

kranti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Dr.-Ing.

Michael Klemm
Senior Application Engineer
Developer Relations Division
Intel Architecture, Graphics and Software
Notices and Disclaimers
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel
representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the
OEM or retailer. No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult
other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit
http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and
provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and
uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata
are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced
data are accurate.

Intel, the Intel logo, Intel Optane and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the united states and other countries.

* Other names and brands may be claimed as the property of others. © 2017 Intel Corporation.

2
Agenda

• Intel® Xeon® Scalable Processor Overview


• Skylake-SP CPU Architecture

3
Intel® Xeon® Scalable Processors

2016 2017 2018

Intel® Xeon® Processor E7 Brickland Platform Purley Platform


Targeted at mission critical
applications that value a scale-up Skylake Cascade
Lake
system with leadership memory
18 cores
E7 v3 E7 v4

capacity and advanced RAS Intel® Xeon® PLATINUM


Intel Xeon GOLD
Intel® Xeon® Processor E5
Grantley-EP Platform
Targeted at a wide variety of Intel Xeon SILVER
E5 v3 E5-4600 v4 (4S)
applications that value a balanced
system with leadership Intel Xeon BRONZE
E5 v3 E5-2600 v4
performance/watt/$

Converged platform with innovative Skylake-SP microarchitecture


4
Intel® Xeon® Scalable Processor Feature Overview

3x16 PCIe* 3x16 PCIe Feature Details


Gen3 2 or 3 Intel® UPI Gen3
Socket Socket P
Scalability 2S, 4S, 8S, and >8S (with node controller support)
Skylake-SP DDR4 Skylake-SP CPU TDP 70W – 205W
CPU 2666 CPU
Chipset Intel® C620 Series (code name Lewisburg)
OPA OPA Networking Intel® Omni-Path Fabric (integrated or discrete)
1x 100Gb OPA 1x 100Gb OPA 4x10GbE (integrated w/ chipset)
Fabric Fabric
100G/40G/25G discrete options
DMI CPU VRs Compression and Intel® QuickAssist Technology to support 100Gb/s
OPA VRs Crypto comp/decomp/crypto
Lewisburg PCH Acceleration 100K RSA2K public key
Mem VRs
Intel®QAT ME High USB3 Storage Integrated QuickData Technology, VMD, and NTB
IE Speed IO PCIe3 Intel® Optane™ SSD, Intel® 3D-NAND NVMe &
4x10GbE NIC SATA3 BMC
GPIO SATA SSD
Firmware
eSPI/LPC Security CPU enhancements (MBE, PPK, MPX)
10GbE SPI
Manageability Engine
TPM Firmware Intel® Platform Trust Technology
Intel® Key Protection Technology
BMC: Baseboard Management Controller PCH: Intel® Platform Controller Hub IE: Innovation Engine Manageability Innovation Engine (IE)
Intel® OPA: Intel® Omni-Path Architecture Intel QAT: Intel® QuickAssist Technology ME: Manageability Engine Intel® Node Manager
NIC: Network Interface Controller VMD: Volume Management Device NTB: Non-Transparent Bridge Intel® Datacenter Manager

5
Platform Topologies
2S Configurations 4S Configurations 8S Configuration
LBG LBG

SKL SKL SKL SKL SKL SKL


Intel®
UPI

**
DMI x4
LBG LBG
LBG 3x16 3x16
PCIe* 1x100G PCIe* 1x100G
Intel® OP Fabric Intel® OP Fabric
SKL SKL

SKL SKL
(2S-2UPI & 2S-3UPI shown)

DMI SKL SKL


LBG 3x16
PCIe*

(4S-2UPI & 4S-3UPI shown)


SKL SKL

Intel® Xeon® Scalable Processor supports


configurations ranging from 2S-2UPI to 8S DMI
LBG
3x16
PCIe*
LBG

6
Intel® Xeon® Scalable Processor
Re-architected from the Ground Up
• Skylake core microarchitecture, with data • New mesh interconnect architecture • Intel® Speed Shift Technology
center specific enhancements
• Enhanced memory subsystem • Security & Virtualization enhancements
• Intel® AVX-512 with 32 DP flops per core (MBE, PPK, MPX)
• Modular IO with integrated devices
• Data center optimized cache hierarchy – • Optional Integrated Intel® Omni-Path
• New Intel® Ultra Path Interconnect
1MB L2 per core, non-inclusive L3 Fabric (Intel® OPA)
(Intel® UPI)
Features Intel® Xeon® Processor E5-2600 v4 Intel® Xeon® Scalable Processor 6 Channels DDR4

Cores Per Socket Up to 22 Up to 28 DDR4


Core Core
2 or 3 UPI
Threads Per Socket Up to 44 threads Up to 56 threads DDR4
Core Core UPI
Last-level Cache (LLC) Up to 55 MB Up to 38.5 MB (non-inclusive)
DDR4

QPI/UPI Speed (GT/s) 2x QPI channels @ 9.6 GT/s Up to 3x UPI @ 10.4 GT/s UPI

DDR4 Core Core


PCIe* Lanes/
40 / 10 / PCIe* 3.0 (2.5, 5, 8 GT/s) 48 / 12 / PCIe 3.0 (2.5, 5, 8 GT/s) UPI
Controllers/Speed(GT/s) DDR4 Shared L3
4 channels of up to 3 RDIMMs, 6 channels of up to 2 RDIMMs, Omni-Path
Memory Population Omni-Path HFI
LRDIMMs, or 3DS LRDIMMs LRDIMMs, or 3DS LRDIMMs DDR4

Max Memory Speed Up to 2400 Up to 2666 48 Lanes DMI3


PCIe* 3.0

TDP (W) 55W-145W 70W-205W

7
Haswell/Broadwell Microarchitecture
Skylake Core Microarchitecture Enhancements
32KB L1 I$ Pre decode Inst Q
Decoders
Decoders
Decoders
5 Broadwell Skylake
Decoders
μop uArch uArch
6
Front End Branch Prediction Unit μop Cache Queue
Out-of-order
192 224
Window
Load Store Reorder
Buffer Buffer Buffer Allocate/Rename/Retire In-flight Loads +
In order 72 + 42 72 + 56
Scheduler OOO
Stores
Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7 Scheduler Entries 60 97
Registers –
ALU ALU ALU ALU 168 + 168 180 + 168
Store Data Load/STA Load/STA STA Integer + FP
INT

Shift LEA LEA Shift


JMP 2 MUL JMP 1
Allocation Queue 56 64/thread
FMA FMA FMA Load Data 2
Memory
ALU ALU ALU Load Data 3 Memory Control L1D BW (B/Cyc) –
64 + 32 128 + 64
VEC

Shift
DIV
Shift Shuffle
Load + Store
1MB L2$ Fill Buffers 32KB L1 D$ 4K+2M: 1536
L2 Unified TLB 4K+2M: 1024
Fill Buffers 1G: 16

• Larger and improved branch predictor, higher throughput decoder, larger window to extract ILP
• Improved scheduler and execution engine, improved throughput and latency of divide/sqrt
• More load/store bandwidth, deeper load/store buffers, improved prefetcher
• Data center specific enhancements: Intel® AVX-512 with 2 FMAs per core, larger 1MB MLC

About 10% performance improvement per core on integer applicationsat same frequency
10
Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

• 512-bit wide vectors Microarchitecture Instruction Set SP FLOPs / cycle DP FLOPs / cycle

• 32 operand registers Skylake Intel® AVX-512 & FMA 64 32

• 8 64b mask registers Haswell / Broadwell Intel AVX2 & FMA 32 16


Sandybridge Intel AVX (256b) 16 8
• Embedded broadcast
Nehalem SSE (128b) 8 4
• Embedded rounding
Intel AVX-512 Instruction Types
AVX-512-F AVX-512 Foundation Instructions
AVX-512-VL Vector Length Orthogonality : ability to operate on sub-512 vector sizes
AVX-512-BW 512-bit Byte/Word support
AVX-512-DQ Additional D/Q/SP/DP instructions (converts, transcendental support, etc.)
AVX-512-CD Conflict Detect : used in vectorizing loops with potential address conflicts

Powerful instruction set for data-parallel computation


11
Skylake-SP Core
Skylake-SP core builds on Skylake core with features architected for data center usage
• Intel® AVX-512 implemented with Port 0/1 fused to a single 512b execution unit

• Port 5 is extended to full 512b to add second FMA outside of Skylake core

• L1-D load and store bandwidth doubled to allow up to 2x64B load and 1x64B store

• Additional 768KB of L2 cache added outside of Skylake core Extended


AVX
Port 0 Port 1 Port 5 Port 6

ALU ALU ALU ALU


Shift LEA LEA Shift
Extended
INT

Skylake
JMP 2 MUL JMP 1 L2 Cache

Core
FMA FMA FMA
ALU ALU ALU
VEC

Shift Shift Shuffle


DIV

Skylake-SP Core: Optimized for Data center Workloads


12
Frequency Behavior While Running Intel® AVX Code

• Cores running non-AVX, Intel® AVX2 light/heavy, and Mixed Workloads


Intel® AVX-512 light/heavy code have different turbo
frequency limits
Non-AVX_Turbo
AVX2_Turbo
• Frequency of each core is determined independently

Non-AVX

Non-AVX
Frequency
AVX512_Turbo
based on workload demand …

AVX2

AVX2
AVX512
Code Type All Core Frequency Limit Non-AVX_Base
SSE AVX2_Base
Non-AVX All Core Turbo AVX512_Base
AVX2-Light (without FP & int-mul)

AVX2-Heavy (FP & int-mul) Cores


AVX2 All Core Turbo
AVX512-Light (without FP & int-mul)
AVX512 Cores using AVX-512
AVX512-Heavy (FP & int-mul) AVX512 All Core Turbo AVX2 Cores using AVX2
Non-AVX Cores not using AVX

13
Performance and Efficiency with Intel® AVX-512
6,00
GFLOPs / Watt
4,83

Normalized to SSE4.2
LINPACK Performance
4,00

GFLOPs/Watt
2,92
3500 3,1 3259 3,5
2,8 1,74
GFLOPs, System Power

2,00 1,00
3000 2,5 3

Core Frequency
2500 2,1 2,5
0,00
2000 2 SSE4.2 AVX AVX2 AVX512
2034
1500 1178 1,5
1000 669 1 GFLOPs / GHz
8,00 7,19
500 0,5

Normalized to SSE4.2
760 768 791 767
0 0 6,00

GFLOPs/GHz
SSE4.2 AVX AVX2 AVX512 3,77
4,00
1,95
GFLOPs Power (W) Frequency (GHz) 2,00 1,00

0,00
SSE4.2 AVX AVX2 AVX512

Intel® AVX-512 delivers significant performance and efficiency gains


Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC. Software and workloads used in performance
tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.

14
New Mesh Interconnect Architecture
Broadwell EX 24-core die Skylake-SP 28-core die

QPI
Link
QPI
Link
PCI-E
X16
PCI-E
X16
PCI-E
X8
PCI-E
X4 (ESI)
2x UPI x 20 * x16
PCIe* PCIe x16 On Pkg 1x UPI x 20 PCIe x16
R3QPI R2PCI
CB DMA
UBo x PCU
DMI x 4 PCIe x16
QPI Agent IIO
IOAPIC
CBDMA

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC


U D
P N

SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI

IDI

IDI

IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD
CBO
SAD
CBO
SAD
CBO
SAD
CBO DDR 4 MC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC MC DDR 4

DDR 4 DDR 4
IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI

IDI

IDI

IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD
CBO
SAD
CBO
SAD
CBO
SAD
CBO
DDR 4 SKX Core SKX Core SKX Core SKX Core DDR 4
IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI

IDI

IDI

IDI
Bo Bo
SAD
2.5MB 2.5MB Bo
SAD
Bo Bo P N Bo
SAD
2.5MB 2.5MB Bo
SAD
N P Bo
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC
CBO CBO CBO CBO
IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
IDI

IDI

IDI

IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD SAD SAD SAD
CBO CBO CBO CBO

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC


IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI

IDI

IDI

IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD SAD SAD SAD
CBO CBO CBO CBO

SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI

IDI

IDI

IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD SAD SAD SAD
CBO CBO CBO CBO

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC


U D
P N

UP SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
DN

DDR
Home Agent
DDR
Home Agent CHA – Caching and Home Agent ; SF – Snoop Filter; LLC – Last Level Cache ;
DDR DDR
Mem Ctlr Mem Ctlr SKX Core – Skylake Server Core ; UPI – Intel® UltraPath Interconnect

Mesh Improves Scalability with Higher Bandwidth and Reduced Latencies


Content Under Embargo Until 1:00 PM PST June 15, 2017 16
High and Low Core Count Die Configurations
HCC (up to 18 cores) LCC (up to 10 Cores)
1x16/2x8/4x4
2x UPI x 20 @ 1x16/2x8/4x4 PCIe @ 8GT/s 1x16/2x8/4x4 1x16/2x8/ 4x4
10. 4GT/s PCIe @ 8GT/s x4 DMI PCIe @ 8GT/s PCIe @ 8GT / s
2x UPI x 20 @ 1x16/2x8/ 4x4 1x16/2x8/ 4x4
10.4GT / s PCIe @ 8GT / s x4 DMI PCIe @ 8GT / s

2x UPI x 20 PCIe* x 16 PCIe x 16 PCIe x 16


DMI x 4
CBDMA
2x UPI x 20 PCIe x 16 PCIe x 16 PCIe x 16
DMI x 4
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC
CBDMA

Core Core Core Core


CHA / SF/LLC CHA / SF/LLC CHA / SF/LLC CHA / SF/LLC
3x DDR4 2667

3x DDR4 2667
DDR4 MC CHA/SF/LLC CHA/SF/LLC MC DDR4

DDR4 DDR4
Core Core
Core Core Core Core
DDR4 DDR4

3x DDR4 2667

3x DDR4 2667
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC DDR4 MC CHA / SF/LLC CHA / SF/LLC MC DDR4

DDR4 DDR4
Core Core Core Core
Core Core
DDR4 DDR4
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

CHA / SF/LLC CHA / SF/LLC CHA / SF/LLC CHA / SF/LLC


Core Core Core Core

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC Core Core Core Core

Core Core Core Core CHA – Caching and Home Agent ; SF – Snoop Filter ; LLC – Last Level Cache ;
Core – Skylake -SP Core ; UPI – Intel® UltraPath Interconnect
CHA – Caching and Home Agent ; SF – Snoop Filter ; LLC – Last Level Cache ;
Core – Skylake -SP Core ; UPI – Intel® UltraPath Interconnect

17
Distributed Caching and Home Agent (CHA)
1x16/2x8/4x4
2x UPI x20 @ 1x16/2x8/4x4 PCIe @ 8GT/s 1x16/2x8/4x4 • Intel® UPI caching and home agents are
x4 DMI
10.4GT/s PCIe @ 8GT/s PCIe @ 8GT/s
distributed with each LLC bank
2x UPI x20 PCIe* x16 PCIe x16 PCIe x16
• Prior generation had a small number of QPI
DMI x4
CBDMA
home agents
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC
• Distributed CHA benefits
Core Core Core Core
• Eliminates large tracker structures at memory
controllers, allowing more requests in flight
3x DDR4 2667

3x DDR4 2667
DDR 4 DDR 4
MC CHA/SF/LLC CHA/SF/LLC MC

DDR 4 DDR 4 and processes them concurrently


Core Core
• Reduces traffic on mesh by eliminating home
DDR 4 DDR 4

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC


agent to LLC interaction
Core Core Core Core
• Reduces latency by launching snoops earlier
and obviates need for different snoop modes

Distributed CHA architecture sustains higher bandwidth and lowers latency


18
Re-Architected L2 & L3 Cache Hierarchy
Previous Architectures Skylake-SP Architecture
Shared L3
Shared L3 1.375MB/core
2.5MB/core (non-inclusive)
(inclusive)

L2 L2 L2
(1MB private) (1MB private) (1MB private)
L2 L2 L2
(256KB private) (256KB private) (256KB private)

Core Core Core Core Core Core


• On-chip cache balance shifted from shared-distributed (prior architectures) to private-local (Skylake architecture):
• Shared-distributed  shared-distributed L3 is primary cache
• Private-local  private L2 becomes primary cache with shared L3 used as overflow cache
• Shared L3 changed from inclusive to non-inclusive:
• Inclusive (prior architectures)  L3 has copies of all lines in L2
• Non-inclusive (Skylake architecture)  lines in L2 may not exist in L3

Skylake-SP cache hierarchy architected specifically for Data center use case
19
Inclusive vs Non-Inclusive L3
Non-Inclusive L3 1. Memory reads fill directly to the L2,
(Skylake-SP architecture) no longer to both the L2 and L3
Inclusive L3
(prior architectures) 2. When a L2 line needs to be removed,
L2 both modified and unmodified lines
L2 are written back
256kB
3 1MB
3 3. Data shared across cores are copied
2 into the L3 for servicing future L2
2 misses

Cache hierarchy architected and


2.5 MB 1.375 MB optimized for data center use cases:
L3 L3 • Virtualized use cases get larger private
L2 cache free from interference
• Multithreaded workloads can operate
1 on larger data per thread (due to
increased L2 size) and reduce uncore
1 Memory Memory activity

20
Cache Performance
CPU CACHE LATENCY Skylake-SP L2
Broadwell-EP Skylake-SP cache latency has
increased by 2
cycles for a 4x

19,5
larger L2

18
Lower is better

LATENCY (NS)

Skylake-SP
achieves good L3
cache latency

3,9
3,3
even with larger
1,1

1,1

core count
L1 CACHE L2 CACHE L3 CACHE (AVG)

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC, and platform with Intel® Xeon® E5-2699 v4, Turbo
enabled, without COD, 4x32GB DDR4-2400, RHEL 7.0. Cache latency measurements were done using Intel® Memory Latency Checker (MLC) tool.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information
visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation.

21
Memory Subsystem
1x16/2x8/4x4 2 Memory Controllers, 3 channels each  total of
2x UPI x20 @
10.4GT/s
1x16/2x8/4x4
PCIe @ 8GT/s
PCIe @ 8GT/s
x4 DMI
1x16/2x8/4x4
PCIe @ 8GT/s
6 memory channels
• DDR4 up to 2666, 2 DIMMs per channel
2x UPI x20 PCIe* x16 PCIe x16
DMI x4
PCIe x16
• Support for RDIMM, LRDIMM, and 3DS-LRDIMM
CBDMA
• 1.5TB Max Memory Capacity per Socket (2 DPC with
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC 128GB DIMMs)
Core Core Core Core • >60% increase in Memory BW per Socket compared
to Intel® Xeon® processor E5 v4
3X DDR4-2666

3x DDR4-2666
DDR 4 DDR 4
MC CHA/SF/LLC CHA/SF/LLC MC

DDR 4 DDR 4
Supports XPT prefetch and D2C/D2K to reduce
LLC miss latency
DDR 4 Core Core DDR 4

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC


Introduces a new memory device failure detection
Core Core Core Core
and recovery scheme with Adaptive Double Device
Data Correction (ADDDC)

Significant memory bandwidth and capacity improvements


22
Sub-NUMA Cluster (SNC)
Prior generation supported Cluster-On-Die
(COD) 2x UPI x 20 PCIe* x16 PCIe x16
DMI x 4
On Pkg
PCIe x16
1x UPI x 20 PCIe x16

CBDMA

SNC provides similar localization benefits CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

as COD, without some of its downsides Core Core Core Core Core Core

3xDDR4 2667

3xDDR4 2667
DDR4 MC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC MC DDR4

• Only one UPI caching agent required even


DDR4 DDR4
Core Core Core Core
DDR4 DDR4

in 2-SNC mode CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

Core Core Core Core Core Core

• Latency for memory accesses in remote CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

cluster is smaller, no UPI flow Core Core Core Core Core Core

• LLC capacity is utilized more efficiently


CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

in 2-cluster mode, no duplication of lines


Core Core Core Core Core Core

in LLC SNC Domain 0 SNC Domain 1

23
Sub-NUMA Clusters – 2 SNC Example
SNC partitions the LLC banks and associates them with memory controller to localize LLC miss traffic

• LLC miss latency to local cluster is smaller

• Mesh traffic is localized, reducing uncore power and sustaining higher BW

Without SNC Local SNC Access


LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC
1 LLC LLC
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC
2 3 2Core Core Core Core Core Core
2Core
Core Core Core Core Core Core Core Core Core Core Core
LLC LLC LLC LLC LLC LLC LLC LLC Mem LLC LLC LLC LLC Mem
Mem Mem Mem Mem
Ctrl Core2 Core Core Core Ctrl Ctrl Core Core Core Core Ctrl Ctrl Core Core1 Core Core Ctrl
1
LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC
3Core 3Core Core Core Core
3Core Core Core
Core Core1 Core Core Core Core Core Core Core Core
LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Remote SNC Access

24
Memory Performance
Bandwidth-Latency Profile

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1/SNC2, 6x32GB DDR4-2400/2666 per CPU, 1 DPC, and platform with E5-2699 v4,
Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark
and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to
http://www.intel.com/performance

25
Memory Performance
Core to Memory Latency
NUMA - Local NUMA - Min Remote NUMA - Max Remote UMA - Min UMA - Max
220

200

180
Lower is better
LATENCY (NS)

160

140

120

100

80

60
Intel® Xeon® E5-2699 v4, Intel® Xeon® E5-2699 v4, Intel® Xeon® E5-2699 v4, Intel® Xeon® Platinum Intel® Xeon® Platinum
DDR4-2400, Dir+OSB DDR4-2400, Home Snp DDR4-2400, COD 8180, DDR4-2666 8180, DDR4-2666, SNC2

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, 6x32GB DDR4-2666, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-
2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. For more complete information visit http://www.intel.com/performance.

26
Intel® Ultra Path Interconnect (Intel® UPI)
• Intel® Ultra Path Interconnect (Intel® UPI), replacing Intel® QPI
• Faster link with improved bandwidth for a balanced system design
• Improved messaging efficiency per packet
• 3 UPI option for 2 socket – additional inter-socket bandwidth for non-NUMA optimized use-cases
Data Rate Data Efficiency Idle Power

10.4 4% to
9.6 GT/s
GT/s 21% 75%
50%

QPI UPI (per wire) L0 L0p L0p


QPI UPI

Intel® UPI enables system scalability with higher inter-socket bandwidth


Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, 6x32GB DDR4-2666, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-
2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. For more complete information visit http://www.intel.com/performance.

27
Skylake-SP Architecture Summary
New Architectural Innovations for Data Center

• Up to 60% increase in compute density with Intel® AVX-512


• Improved performance and scalability with Mesh on-chip interconnect
• L2 and L3 cache hierarchy optimized for data center workloads
• Improved memory subsystem with up to 60% higher memory bandwidth
• Faster and more efficient Intel® UPI interconnect for improved scalability
• Improved integrated IO with up to 50% higher aggregate IO bandwidth
• Increased protection against kernel tampering and user data corruption
• Core, cache, memory and IO improvements for increased virtual machine performance
• Enhanced power management and RAS capability for improved utilization of resources

29
Skylake-SP Performance
2 SOCKET SKYL A KE- SP P ER F OR MA N CE OVER I N TEL ®
XEON ® E5- 2699 V4 Skylake-SP CPUs
Xeon Platinum 8176 (28C, 165W) Xeon Platinum 8180 (28C, 205W) provide significant
performance upside

2,25
compared to prior
RELATIVE PERFORMANCE

1,98
generation
165W Skylake-SP CPUs provide

1,64

1,64
1,64
more than 40% gain on
1,53
1,53

performance
1,41

205W Skylake-SP CPUs provide


additional boost to core bound
workloads
SPECINT*_RATE_BASE2006 SPECFP*_RATE_BASE2006 STREAM - TRIAD LINPACK

Source as of June 2017: Intel internal measurements with Xeon Platinum 8180 and 8176, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC, and platform with E5-
2699 v4, Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to
any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including
the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

30

You might also like