Skylake Architecture
Skylake Architecture
Michael Klemm
Senior Application Engineer
Developer Relations Division
Intel Architecture, Graphics and Software
Notices and Disclaimers
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel
representative to obtain the latest forecast, schedule, specifications and roadmaps.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the
OEM or retailer. No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult
other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit
http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and
provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and
uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata
are available on request.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced
data are accurate.
Intel, the Intel logo, Intel Optane and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the united states and other countries.
* Other names and brands may be claimed as the property of others. © 2017 Intel Corporation.
2
Agenda
3
Intel® Xeon® Scalable Processors
5
Platform Topologies
2S Configurations 4S Configurations 8S Configuration
LBG LBG
**
DMI x4
LBG LBG
LBG 3x16 3x16
PCIe* 1x100G PCIe* 1x100G
Intel® OP Fabric Intel® OP Fabric
SKL SKL
SKL SKL
(2S-2UPI & 2S-3UPI shown)
6
Intel® Xeon® Scalable Processor
Re-architected from the Ground Up
• Skylake core microarchitecture, with data • New mesh interconnect architecture • Intel® Speed Shift Technology
center specific enhancements
• Enhanced memory subsystem • Security & Virtualization enhancements
• Intel® AVX-512 with 32 DP flops per core (MBE, PPK, MPX)
• Modular IO with integrated devices
• Data center optimized cache hierarchy – • Optional Integrated Intel® Omni-Path
• New Intel® Ultra Path Interconnect
1MB L2 per core, non-inclusive L3 Fabric (Intel® OPA)
(Intel® UPI)
Features Intel® Xeon® Processor E5-2600 v4 Intel® Xeon® Scalable Processor 6 Channels DDR4
QPI/UPI Speed (GT/s) 2x QPI channels @ 9.6 GT/s Up to 3x UPI @ 10.4 GT/s UPI
7
Haswell/Broadwell Microarchitecture
Skylake Core Microarchitecture Enhancements
32KB L1 I$ Pre decode Inst Q
Decoders
Decoders
Decoders
5 Broadwell Skylake
Decoders
μop uArch uArch
6
Front End Branch Prediction Unit μop Cache Queue
Out-of-order
192 224
Window
Load Store Reorder
Buffer Buffer Buffer Allocate/Rename/Retire In-flight Loads +
In order 72 + 42 72 + 56
Scheduler OOO
Stores
Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7 Scheduler Entries 60 97
Registers –
ALU ALU ALU ALU 168 + 168 180 + 168
Store Data Load/STA Load/STA STA Integer + FP
INT
Shift
DIV
Shift Shuffle
Load + Store
1MB L2$ Fill Buffers 32KB L1 D$ 4K+2M: 1536
L2 Unified TLB 4K+2M: 1024
Fill Buffers 1G: 16
• Larger and improved branch predictor, higher throughput decoder, larger window to extract ILP
• Improved scheduler and execution engine, improved throughput and latency of divide/sqrt
• More load/store bandwidth, deeper load/store buffers, improved prefetcher
• Data center specific enhancements: Intel® AVX-512 with 2 FMAs per core, larger 1MB MLC
About 10% performance improvement per core on integer applicationsat same frequency
10
Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
• 512-bit wide vectors Microarchitecture Instruction Set SP FLOPs / cycle DP FLOPs / cycle
• Port 5 is extended to full 512b to add second FMA outside of Skylake core
• L1-D load and store bandwidth doubled to allow up to 2x64B load and 1x64B store
Skylake
JMP 2 MUL JMP 1 L2 Cache
Core
FMA FMA FMA
ALU ALU ALU
VEC
Non-AVX
Non-AVX
Frequency
AVX512_Turbo
based on workload demand …
AVX2
AVX2
AVX512
Code Type All Core Frequency Limit Non-AVX_Base
SSE AVX2_Base
Non-AVX All Core Turbo AVX512_Base
AVX2-Light (without FP & int-mul)
13
Performance and Efficiency with Intel® AVX-512
6,00
GFLOPs / Watt
4,83
Normalized to SSE4.2
LINPACK Performance
4,00
GFLOPs/Watt
2,92
3500 3,1 3259 3,5
2,8 1,74
GFLOPs, System Power
2,00 1,00
3000 2,5 3
Core Frequency
2500 2,1 2,5
0,00
2000 2 SSE4.2 AVX AVX2 AVX512
2034
1500 1178 1,5
1000 669 1 GFLOPs / GHz
8,00 7,19
500 0,5
Normalized to SSE4.2
760 768 791 767
0 0 6,00
GFLOPs/GHz
SSE4.2 AVX AVX2 AVX512 3,77
4,00
1,95
GFLOPs Power (W) Frequency (GHz) 2,00 1,00
0,00
SSE4.2 AVX AVX2 AVX512
14
New Mesh Interconnect Architecture
Broadwell EX 24-core die Skylake-SP 28-core die
QPI
Link
QPI
Link
PCI-E
X16
PCI-E
X16
PCI-E
X8
PCI-E
X4 (ESI)
2x UPI x 20 * x16
PCIe* PCIe x16 On Pkg 1x UPI x 20 PCIe x16
R3QPI R2PCI
CB DMA
UBo x PCU
DMI x 4 PCIe x16
QPI Agent IIO
IOAPIC
CBDMA
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
IDI/QPII
IDI/QPII
IDI/QPII
IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI
IDI
IDI
IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD
CBO
SAD
CBO
SAD
CBO
SAD
CBO DDR 4 MC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC MC DDR 4
DDR 4 DDR 4
IDI/QPII
IDI/QPII
IDI/QPII
IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI
IDI
IDI
IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD
CBO
SAD
CBO
SAD
CBO
SAD
CBO
DDR 4 SKX Core SKX Core SKX Core SKX Core DDR 4
IDI/QPII
IDI/QPII
IDI/QPII
IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI
IDI
IDI
IDI
Bo Bo
SAD
2.5MB 2.5MB Bo
SAD
Bo Bo P N Bo
SAD
2.5MB 2.5MB Bo
SAD
N P Bo
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC
CBO CBO CBO CBO
IDI/QPII
IDI/QPII
IDI/QPII
IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
IDI
IDI
IDI
IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD SAD SAD SAD
CBO CBO CBO CBO
IDI/QPII
IDI/QPII
IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI
IDI
IDI
IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD SAD SAD SAD
CBO CBO CBO CBO
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
IDI/QPII
IDI/QPII
IDI/QPII
IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI
IDI
IDI
IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD SAD SAD SAD
CBO CBO CBO CBO
UP SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
DN
DDR
Home Agent
DDR
Home Agent CHA – Caching and Home Agent ; SF – Snoop Filter; LLC – Last Level Cache ;
DDR DDR
Mem Ctlr Mem Ctlr SKX Core – Skylake Server Core ; UPI – Intel® UltraPath Interconnect
3x DDR4 2667
DDR4 MC CHA/SF/LLC CHA/SF/LLC MC DDR4
DDR4 DDR4
Core Core
Core Core Core Core
DDR4 DDR4
3x DDR4 2667
3x DDR4 2667
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC DDR4 MC CHA / SF/LLC CHA / SF/LLC MC DDR4
DDR4 DDR4
Core Core Core Core
Core Core
DDR4 DDR4
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC
Core Core Core Core CHA – Caching and Home Agent ; SF – Snoop Filter ; LLC – Last Level Cache ;
Core – Skylake -SP Core ; UPI – Intel® UltraPath Interconnect
CHA – Caching and Home Agent ; SF – Snoop Filter ; LLC – Last Level Cache ;
Core – Skylake -SP Core ; UPI – Intel® UltraPath Interconnect
17
Distributed Caching and Home Agent (CHA)
1x16/2x8/4x4
2x UPI x20 @ 1x16/2x8/4x4 PCIe @ 8GT/s 1x16/2x8/4x4 • Intel® UPI caching and home agents are
x4 DMI
10.4GT/s PCIe @ 8GT/s PCIe @ 8GT/s
distributed with each LLC bank
2x UPI x20 PCIe* x16 PCIe x16 PCIe x16
• Prior generation had a small number of QPI
DMI x4
CBDMA
home agents
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC
• Distributed CHA benefits
Core Core Core Core
• Eliminates large tracker structures at memory
controllers, allowing more requests in flight
3x DDR4 2667
3x DDR4 2667
DDR 4 DDR 4
MC CHA/SF/LLC CHA/SF/LLC MC
L2 L2 L2
(1MB private) (1MB private) (1MB private)
L2 L2 L2
(256KB private) (256KB private) (256KB private)
Skylake-SP cache hierarchy architected specifically for Data center use case
19
Inclusive vs Non-Inclusive L3
Non-Inclusive L3 1. Memory reads fill directly to the L2,
(Skylake-SP architecture) no longer to both the L2 and L3
Inclusive L3
(prior architectures) 2. When a L2 line needs to be removed,
L2 both modified and unmodified lines
L2 are written back
256kB
3 1MB
3 3. Data shared across cores are copied
2 into the L3 for servicing future L2
2 misses
20
Cache Performance
CPU CACHE LATENCY Skylake-SP L2
Broadwell-EP Skylake-SP cache latency has
increased by 2
cycles for a 4x
19,5
larger L2
18
Lower is better
LATENCY (NS)
Skylake-SP
achieves good L3
cache latency
3,9
3,3
even with larger
1,1
1,1
core count
L1 CACHE L2 CACHE L3 CACHE (AVG)
Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC, and platform with Intel® Xeon® E5-2699 v4, Turbo
enabled, without COD, 4x32GB DDR4-2400, RHEL 7.0. Cache latency measurements were done using Intel® Memory Latency Checker (MLC) tool.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information
visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation.
21
Memory Subsystem
1x16/2x8/4x4 2 Memory Controllers, 3 channels each total of
2x UPI x20 @
10.4GT/s
1x16/2x8/4x4
PCIe @ 8GT/s
PCIe @ 8GT/s
x4 DMI
1x16/2x8/4x4
PCIe @ 8GT/s
6 memory channels
• DDR4 up to 2666, 2 DIMMs per channel
2x UPI x20 PCIe* x16 PCIe x16
DMI x4
PCIe x16
• Support for RDIMM, LRDIMM, and 3DS-LRDIMM
CBDMA
• 1.5TB Max Memory Capacity per Socket (2 DPC with
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC 128GB DIMMs)
Core Core Core Core • >60% increase in Memory BW per Socket compared
to Intel® Xeon® processor E5 v4
3X DDR4-2666
3x DDR4-2666
DDR 4 DDR 4
MC CHA/SF/LLC CHA/SF/LLC MC
DDR 4 DDR 4
Supports XPT prefetch and D2C/D2K to reduce
LLC miss latency
DDR 4 Core Core DDR 4
CBDMA
SNC provides similar localization benefits CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC
as COD, without some of its downsides Core Core Core Core Core Core
3xDDR4 2667
3xDDR4 2667
DDR4 MC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC MC DDR4
• Latency for memory accesses in remote CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC
cluster is smaller, no UPI flow Core Core Core Core Core Core
23
Sub-NUMA Clusters – 2 SNC Example
SNC partitions the LLC banks and associates them with memory controller to localize LLC miss traffic
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
24
Memory Performance
Bandwidth-Latency Profile
Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1/SNC2, 6x32GB DDR4-2400/2666 per CPU, 1 DPC, and platform with E5-2699 v4,
Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark
and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to
http://www.intel.com/performance
25
Memory Performance
Core to Memory Latency
NUMA - Local NUMA - Min Remote NUMA - Max Remote UMA - Min UMA - Max
220
200
180
Lower is better
LATENCY (NS)
160
140
120
100
80
60
Intel® Xeon® E5-2699 v4, Intel® Xeon® E5-2699 v4, Intel® Xeon® E5-2699 v4, Intel® Xeon® Platinum Intel® Xeon® Platinum
DDR4-2400, Dir+OSB DDR4-2400, Home Snp DDR4-2400, COD 8180, DDR4-2666 8180, DDR4-2666, SNC2
Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, 6x32GB DDR4-2666, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-
2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. For more complete information visit http://www.intel.com/performance.
26
Intel® Ultra Path Interconnect (Intel® UPI)
• Intel® Ultra Path Interconnect (Intel® UPI), replacing Intel® QPI
• Faster link with improved bandwidth for a balanced system design
• Improved messaging efficiency per packet
• 3 UPI option for 2 socket – additional inter-socket bandwidth for non-NUMA optimized use-cases
Data Rate Data Efficiency Idle Power
10.4 4% to
9.6 GT/s
GT/s 21% 75%
50%
27
Skylake-SP Architecture Summary
New Architectural Innovations for Data Center
29
Skylake-SP Performance
2 SOCKET SKYL A KE- SP P ER F OR MA N CE OVER I N TEL ®
XEON ® E5- 2699 V4 Skylake-SP CPUs
Xeon Platinum 8176 (28C, 165W) Xeon Platinum 8180 (28C, 205W) provide significant
performance upside
2,25
compared to prior
RELATIVE PERFORMANCE
1,98
generation
165W Skylake-SP CPUs provide
1,64
1,64
1,64
more than 40% gain on
1,53
1,53
performance
1,41
Source as of June 2017: Intel internal measurements with Xeon Platinum 8180 and 8176, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC, and platform with E5-
2699 v4, Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to
any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including
the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
30