0% found this document useful (0 votes)

150 views31 pages

Skylake Architecture

Uploaded by

kranti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

150 views31 pages

Skylake Architecture

Uploaded by

kranti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Dr.-Ing.

Michael Klemm
Senior Application Engineer
Developer Relations Division
Intel Architecture, Graphics and Software
Notices and Disclaimers
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel
representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the
OEM or retailer. No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult
other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit
http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and
provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and
uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata
are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced
data are accurate.

Intel, the Intel logo, Intel Optane and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the united states and other countries.

2
Agenda

• Intel® Xeon® Scalable Processor Overview

• Skylake-SP CPU Architecture

3
Intel® Xeon® Scalable Processors

2016 2017 2018

Intel® Xeon® Processor E7 Brickland Platform Purley Platform

Targeted at mission critical
applications that value a scale-up Skylake Cascade
Lake
system with leadership memory
18 cores
E7 v3 E7 v4

capacity and advanced RAS Intel® Xeon® PLATINUM

Intel Xeon GOLD
Intel® Xeon® Processor E5
Grantley-EP Platform
Targeted at a wide variety of Intel Xeon SILVER
E5 v3 E5-4600 v4 (4S)
applications that value a balanced
system with leadership Intel Xeon BRONZE
E5 v3 E5-2600 v4
performance/watt/$

Converged platform with innovative Skylake-SP microarchitecture

4
Intel® Xeon® Scalable Processor Feature Overview

3x16 PCIe* 3x16 PCIe Feature Details

Gen3 2 or 3 Intel® UPI Gen3
Socket Socket P
Scalability 2S, 4S, 8S, and >8S (with node controller support)
Skylake-SP DDR4 Skylake-SP CPU TDP 70W – 205W
CPU 2666 CPU
Chipset Intel® C620 Series (code name Lewisburg)
OPA OPA Networking Intel® Omni-Path Fabric (integrated or discrete)
1x 100Gb OPA 1x 100Gb OPA 4x10GbE (integrated w/ chipset)
Fabric Fabric
100G/40G/25G discrete options
DMI CPU VRs Compression and Intel® QuickAssist Technology to support 100Gb/s
OPA VRs Crypto comp/decomp/crypto
Lewisburg PCH Acceleration 100K RSA2K public key
Mem VRs
Intel®QAT ME High USB3 Storage Integrated QuickData Technology, VMD, and NTB
IE Speed IO PCIe3 Intel® Optane™ SSD, Intel® 3D-NAND NVMe &
4x10GbE NIC SATA3 BMC
GPIO SATA SSD
Firmware
eSPI/LPC Security CPU enhancements (MBE, PPK, MPX)
10GbE SPI
Manageability Engine
TPM Firmware Intel® Platform Trust Technology
Intel® Key Protection Technology
BMC: Baseboard Management Controller PCH: Intel® Platform Controller Hub IE: Innovation Engine Manageability Innovation Engine (IE)
Intel® OPA: Intel® Omni-Path Architecture Intel QAT: Intel® QuickAssist Technology ME: Manageability Engine Intel® Node Manager
NIC: Network Interface Controller VMD: Volume Management Device NTB: Non-Transparent Bridge Intel® Datacenter Manager

5
Platform Topologies
2S Configurations 4S Configurations 8S Configuration
LBG LBG

SKL SKL SKL SKL SKL SKL

Intel®
UPI

**
DMI x4
LBG LBG
LBG 3x16 3x16
PCIe* 1x100G PCIe* 1x100G
Intel® OP Fabric Intel® OP Fabric
SKL SKL

SKL SKL
(2S-2UPI & 2S-3UPI shown)

DMI SKL SKL

LBG 3x16
PCIe*

(4S-2UPI & 4S-3UPI shown)

SKL SKL

Intel® Xeon® Scalable Processor supports

configurations ranging from 2S-2UPI to 8S DMI
LBG
3x16
PCIe*
LBG

6
Intel® Xeon® Scalable Processor
Re-architected from the Ground Up
• Skylake core microarchitecture, with data • New mesh interconnect architecture • Intel® Speed Shift Technology
center specific enhancements
• Enhanced memory subsystem • Security & Virtualization enhancements
• Intel® AVX-512 with 32 DP flops per core (MBE, PPK, MPX)
• Modular IO with integrated devices
• Data center optimized cache hierarchy – • Optional Integrated Intel® Omni-Path
• New Intel® Ultra Path Interconnect
1MB L2 per core, non-inclusive L3 Fabric (Intel® OPA)
(Intel® UPI)
Features Intel® Xeon® Processor E5-2600 v4 Intel® Xeon® Scalable Processor 6 Channels DDR4

Cores Per Socket Up to 22 Up to 28 DDR4

Core Core
2 or 3 UPI
Threads Per Socket Up to 44 threads Up to 56 threads DDR4
Core Core UPI
Last-level Cache (LLC) Up to 55 MB Up to 38.5 MB (non-inclusive)
DDR4

QPI/UPI Speed (GT/s) 2x QPI channels @ 9.6 GT/s Up to 3x UPI @ 10.4 GT/s UPI

DDR4 Core Core

PCIe* Lanes/
40 / 10 / PCIe* 3.0 (2.5, 5, 8 GT/s) 48 / 12 / PCIe 3.0 (2.5, 5, 8 GT/s) UPI
Controllers/Speed(GT/s) DDR4 Shared L3
4 channels of up to 3 RDIMMs, 6 channels of up to 2 RDIMMs, Omni-Path
Memory Population Omni-Path HFI
LRDIMMs, or 3DS LRDIMMs LRDIMMs, or 3DS LRDIMMs DDR4

Max Memory Speed Up to 2400 Up to 2666 48 Lanes DMI3

PCIe* 3.0

TDP (W) 55W-145W 70W-205W

7
Haswell/Broadwell Microarchitecture
Skylake Core Microarchitecture Enhancements
32KB L1 I$ Pre decode Inst Q
Decoders
Decoders
Decoders
5 Broadwell Skylake
Decoders
μop uArch uArch
6
Front End Branch Prediction Unit μop Cache Queue
Out-of-order
192 224
Window
Load Store Reorder
Buffer Buffer Buffer Allocate/Rename/Retire In-flight Loads +
In order 72 + 42 72 + 56
Scheduler OOO
Stores
Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7 Scheduler Entries 60 97
Registers –
ALU ALU ALU ALU 168 + 168 180 + 168
Store Data Load/STA Load/STA STA Integer + FP
INT

Shift LEA LEA Shift

JMP 2 MUL JMP 1
Allocation Queue 56 64/thread
FMA FMA FMA Load Data 2
Memory
ALU ALU ALU Load Data 3 Memory Control L1D BW (B/Cyc) –
64 + 32 128 + 64
VEC

Shift
DIV
Shift Shuffle
Load + Store
1MB L2$ Fill Buffers 32KB L1 D$ 4K+2M: 1536
L2 Unified TLB 4K+2M: 1024
Fill Buffers 1G: 16

• Larger and improved branch predictor, higher throughput decoder, larger window to extract ILP
• Improved scheduler and execution engine, improved throughput and latency of divide/sqrt
• More load/store bandwidth, deeper load/store buffers, improved prefetcher
• Data center specific enhancements: Intel® AVX-512 with 2 FMAs per core, larger 1MB MLC

About 10% performance improvement per core on integer applicationsat same frequency
10
Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

• 512-bit wide vectors Microarchitecture Instruction Set SP FLOPs / cycle DP FLOPs / cycle

• 32 operand registers Skylake Intel® AVX-512 & FMA 64 32

• 8 64b mask registers Haswell / Broadwell Intel AVX2 & FMA 32 16

Sandybridge Intel AVX (256b) 16 8
• Embedded broadcast
Nehalem SSE (128b) 8 4
• Embedded rounding
Intel AVX-512 Instruction Types
AVX-512-F AVX-512 Foundation Instructions
AVX-512-VL Vector Length Orthogonality : ability to operate on sub-512 vector sizes
AVX-512-BW 512-bit Byte/Word support
AVX-512-DQ Additional D/Q/SP/DP instructions (converts, transcendental support, etc.)
AVX-512-CD Conflict Detect : used in vectorizing loops with potential address conflicts

Powerful instruction set for data-parallel computation

11
Skylake-SP Core
Skylake-SP core builds on Skylake core with features architected for data center usage
• Intel® AVX-512 implemented with Port 0/1 fused to a single 512b execution unit

• Port 5 is extended to full 512b to add second FMA outside of Skylake core

• L1-D load and store bandwidth doubled to allow up to 2x64B load and 1x64B store

• Additional 768KB of L2 cache added outside of Skylake core Extended

AVX
Port 0 Port 1 Port 5 Port 6

ALU ALU ALU ALU

Shift LEA LEA Shift
Extended
INT

Skylake
JMP 2 MUL JMP 1 L2 Cache

Core
FMA FMA FMA
ALU ALU ALU
VEC

Shift Shift Shuffle

DIV

Skylake-SP Core: Optimized for Data center Workloads

12
Frequency Behavior While Running Intel® AVX Code

• Cores running non-AVX, Intel® AVX2 light/heavy, and Mixed Workloads

Intel® AVX-512 light/heavy code have different turbo
frequency limits
Non-AVX_Turbo
AVX2_Turbo
• Frequency of each core is determined independently

Non-AVX

Non-AVX
Frequency
AVX512_Turbo
based on workload demand …

AVX2

AVX2
AVX512
Code Type All Core Frequency Limit Non-AVX_Base
SSE AVX2_Base
Non-AVX All Core Turbo AVX512_Base
AVX2-Light (without FP & int-mul)

AVX2-Heavy (FP & int-mul) Cores

AVX2 All Core Turbo
AVX512-Light (without FP & int-mul)
AVX512 Cores using AVX-512
AVX512-Heavy (FP & int-mul) AVX512 All Core Turbo AVX2 Cores using AVX2
Non-AVX Cores not using AVX

13
Performance and Efficiency with Intel® AVX-512
6,00
GFLOPs / Watt
4,83

Normalized to SSE4.2
LINPACK Performance
4,00

GFLOPs/Watt
2,92
3500 3,1 3259 3,5
2,8 1,74
GFLOPs, System Power

2,00 1,00
3000 2,5 3

Core Frequency
2500 2,1 2,5
0,00
2000 2 SSE4.2 AVX AVX2 AVX512
2034
1500 1178 1,5
1000 669 1 GFLOPs / GHz
8,00 7,19
500 0,5

Normalized to SSE4.2
760 768 791 767
0 0 6,00

GFLOPs/GHz
SSE4.2 AVX AVX2 AVX512 3,77
4,00
1,95
GFLOPs Power (W) Frequency (GHz) 2,00 1,00

0,00
SSE4.2 AVX AVX2 AVX512

Intel® AVX-512 delivers significant performance and efficiency gains

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC. Software and workloads used in performance
tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.

14
New Mesh Interconnect Architecture
Broadwell EX 24-core die Skylake-SP 28-core die

QPI
Link
QPI
Link
PCI-E
X16
PCI-E
X16
PCI-E
X8
PCI-E
X4 (ESI)
2x UPI x 20 * x16
PCIe* PCIe x16 On Pkg 1x UPI x 20 PCIe x16
R3QPI R2PCI
CB DMA
UBo x PCU
DMI x 4 PCIe x16
QPI Agent IIO
IOAPIC
CBDMA

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

U D
P N

SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI

IDI

IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD
CBO
SAD
CBO
SAD
CBO
SAD
CBO DDR 4 MC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC MC DDR 4

DDR 4 DDR 4
IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI

IDI

IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD
CBO
SAD
CBO
SAD
CBO
SAD
CBO
DDR 4 SKX Core SKX Core SKX Core SKX Core DDR 4
IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI

IDI

IDI
Bo Bo
SAD
2.5MB 2.5MB Bo
SAD
Bo Bo P N Bo
SAD
2.5MB 2.5MB Bo
SAD
N P Bo
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC
CBO CBO CBO CBO
IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
IDI

IDI

IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD SAD SAD SAD
CBO CBO CBO CBO

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI

IDI

IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD SAD SAD SAD
CBO CBO CBO CBO

SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
IDI/QPII

IDI/QPII
IDI/QPII

IDI/QPII
Core
Core Cache LLC LLC Cache Core
Core Core
Core U D Cache LLC LLC Cache D U Core
Core
IDI

IDI

IDI
Bo Bo 2.5MB 2.5MB Bo Bo Bo P N Bo 2.5MB 2.5MB Bo N P Bo
SAD SAD SAD SAD
CBO CBO CBO CBO

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

U D
P N

UP SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
DN

DDR
Home Agent
DDR
Home Agent CHA – Caching and Home Agent ; SF – Snoop Filter; LLC – Last Level Cache ;
DDR DDR
Mem Ctlr Mem Ctlr SKX Core – Skylake Server Core ; UPI – Intel® UltraPath Interconnect

Mesh Improves Scalability with Higher Bandwidth and Reduced Latencies

Content Under Embargo Until 1:00 PM PST June 15, 2017 16
High and Low Core Count Die Configurations
HCC (up to 18 cores) LCC (up to 10 Cores)
1x16/2x8/4x4
2x UPI x 20 @ 1x16/2x8/4x4 PCIe @ 8GT/s 1x16/2x8/4x4 1x16/2x8/ 4x4
10. 4GT/s PCIe @ 8GT/s x4 DMI PCIe @ 8GT/s PCIe @ 8GT / s
2x UPI x 20 @ 1x16/2x8/ 4x4 1x16/2x8/ 4x4
10.4GT / s PCIe @ 8GT / s x4 DMI PCIe @ 8GT / s

2x UPI x 20 PCIe* x 16 PCIe x 16 PCIe x 16

DMI x 4
CBDMA
2x UPI x 20 PCIe x 16 PCIe x 16 PCIe x 16
DMI x 4
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC
CBDMA

Core Core Core Core

CHA / SF/LLC CHA / SF/LLC CHA / SF/LLC CHA / SF/LLC
3x DDR4 2667

3x DDR4 2667
DDR4 MC CHA/SF/LLC CHA/SF/LLC MC DDR4

DDR4 DDR4
Core Core
Core Core Core Core
DDR4 DDR4

3x DDR4 2667

3x DDR4 2667
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC DDR4 MC CHA / SF/LLC CHA / SF/LLC MC DDR4

DDR4 DDR4
Core Core Core Core
Core Core
DDR4 DDR4
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

CHA / SF/LLC CHA / SF/LLC CHA / SF/LLC CHA / SF/LLC

Core Core Core Core

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC Core Core Core Core

Core Core Core Core CHA – Caching and Home Agent ; SF – Snoop Filter ; LLC – Last Level Cache ;
Core – Skylake -SP Core ; UPI – Intel® UltraPath Interconnect
CHA – Caching and Home Agent ; SF – Snoop Filter ; LLC – Last Level Cache ;
Core – Skylake -SP Core ; UPI – Intel® UltraPath Interconnect

17
Distributed Caching and Home Agent (CHA)
1x16/2x8/4x4
2x UPI x20 @ 1x16/2x8/4x4 PCIe @ 8GT/s 1x16/2x8/4x4 • Intel® UPI caching and home agents are
x4 DMI
10.4GT/s PCIe @ 8GT/s PCIe @ 8GT/s
distributed with each LLC bank
2x UPI x20 PCIe* x16 PCIe x16 PCIe x16
• Prior generation had a small number of QPI
DMI x4
CBDMA
home agents
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC
• Distributed CHA benefits
Core Core Core Core
• Eliminates large tracker structures at memory
controllers, allowing more requests in flight
3x DDR4 2667

3x DDR4 2667
DDR 4 DDR 4
MC CHA/SF/LLC CHA/SF/LLC MC

DDR 4 DDR 4 and processes them concurrently

Core Core
• Reduces traffic on mesh by eliminating home
DDR 4 DDR 4

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

agent to LLC interaction
Core Core Core Core
• Reduces latency by launching snoops earlier
and obviates need for different snoop modes

Distributed CHA architecture sustains higher bandwidth and lowers latency

18
Re-Architected L2 & L3 Cache Hierarchy
Previous Architectures Skylake-SP Architecture
Shared L3
Shared L3 1.375MB/core
2.5MB/core (non-inclusive)
(inclusive)

L2 L2 L2
(1MB private) (1MB private) (1MB private)
L2 L2 L2
(256KB private) (256KB private) (256KB private)

Core Core Core Core Core Core

• On-chip cache balance shifted from shared-distributed (prior architectures) to private-local (Skylake architecture):
• Shared-distributed  shared-distributed L3 is primary cache
• Private-local  private L2 becomes primary cache with shared L3 used as overflow cache
• Shared L3 changed from inclusive to non-inclusive:
• Inclusive (prior architectures)  L3 has copies of all lines in L2
• Non-inclusive (Skylake architecture)  lines in L2 may not exist in L3

Skylake-SP cache hierarchy architected specifically for Data center use case
19
Inclusive vs Non-Inclusive L3
Non-Inclusive L3 1. Memory reads fill directly to the L2,
(Skylake-SP architecture) no longer to both the L2 and L3
Inclusive L3
(prior architectures) 2. When a L2 line needs to be removed,
L2 both modified and unmodified lines
L2 are written back
256kB
3 1MB
3 3. Data shared across cores are copied
2 into the L3 for servicing future L2
2 misses

Cache hierarchy architected and

2.5 MB 1.375 MB optimized for data center use cases:
L3 L3 • Virtualized use cases get larger private
L2 cache free from interference
• Multithreaded workloads can operate
1 on larger data per thread (due to
increased L2 size) and reduce uncore
1 Memory Memory activity

20
Cache Performance
CPU CACHE LATENCY Skylake-SP L2
Broadwell-EP Skylake-SP cache latency has
increased by 2
cycles for a 4x

19,5
larger L2

18
Lower is better

LATENCY (NS)

Skylake-SP
achieves good L3
cache latency

3,9
3,3
even with larger
1,1

1,1

core count
L1 CACHE L2 CACHE L3 CACHE (AVG)

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC, and platform with Intel® Xeon® E5-2699 v4, Turbo
enabled, without COD, 4x32GB DDR4-2400, RHEL 7.0. Cache latency measurements were done using Intel® Memory Latency Checker (MLC) tool.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information
visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation.

21
Memory Subsystem
1x16/2x8/4x4 2 Memory Controllers, 3 channels each  total of
2x UPI x20 @
10.4GT/s
1x16/2x8/4x4
PCIe @ 8GT/s
PCIe @ 8GT/s
x4 DMI
1x16/2x8/4x4
PCIe @ 8GT/s
6 memory channels
• DDR4 up to 2666, 2 DIMMs per channel
2x UPI x20 PCIe* x16 PCIe x16
DMI x4
PCIe x16
• Support for RDIMM, LRDIMM, and 3DS-LRDIMM
CBDMA
• 1.5TB Max Memory Capacity per Socket (2 DPC with
CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC 128GB DIMMs)
Core Core Core Core • >60% increase in Memory BW per Socket compared
to Intel® Xeon® processor E5 v4
3X DDR4-2666

3x DDR4-2666
DDR 4 DDR 4
MC CHA/SF/LLC CHA/SF/LLC MC

DDR 4 DDR 4
Supports XPT prefetch and D2C/D2K to reduce
LLC miss latency
DDR 4 Core Core DDR 4

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

Introduces a new memory device failure detection
Core Core Core Core
and recovery scheme with Adaptive Double Device
Data Correction (ADDDC)

Significant memory bandwidth and capacity improvements

22
Sub-NUMA Cluster (SNC)
Prior generation supported Cluster-On-Die
(COD) 2x UPI x 20 PCIe* x16 PCIe x16
DMI x 4
On Pkg
PCIe x16
1x UPI x 20 PCIe x16

CBDMA

SNC provides similar localization benefits CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

as COD, without some of its downsides Core Core Core Core Core Core

3xDDR4 2667

3xDDR4 2667
DDR4 MC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC MC DDR4

• Only one UPI caching agent required even

DDR4 DDR4
Core Core Core Core
DDR4 DDR4

in 2-SNC mode CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

Core Core Core Core Core Core

• Latency for memory accesses in remote CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

cluster is smaller, no UPI flow Core Core Core Core Core Core

• LLC capacity is utilized more efficiently

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

in 2-cluster mode, no duplication of lines

Core Core Core Core Core Core

in LLC SNC Domain 0 SNC Domain 1

23
Sub-NUMA Clusters – 2 SNC Example
SNC partitions the LLC banks and associates them with memory controller to localize LLC miss traffic

• LLC miss latency to local cluster is smaller

• Mesh traffic is localized, reducing uncore power and sustaining higher BW

Without SNC Local SNC Access

LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC
1 LLC LLC
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC
2 3 2Core Core Core Core Core Core
2Core
Core Core Core Core Core Core Core Core Core Core Core
LLC LLC LLC LLC LLC LLC LLC LLC Mem LLC LLC LLC LLC Mem
Mem Mem Mem Mem
Ctrl Core2 Core Core Core Ctrl Ctrl Core Core Core Core Ctrl Ctrl Core Core1 Core Core Ctrl
1
LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC
3Core 3Core Core Core Core
3Core Core Core
Core Core1 Core Core Core Core Core Core Core Core
LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC LLC

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Remote SNC Access

24
Memory Performance
Bandwidth-Latency Profile

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1/SNC2, 6x32GB DDR4-2400/2666 per CPU, 1 DPC, and platform with E5-2699 v4,
Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark
and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to
http://www.intel.com/performance

25
Memory Performance
Core to Memory Latency
NUMA - Local NUMA - Min Remote NUMA - Max Remote UMA - Min UMA - Max
220

200

180
Lower is better
LATENCY (NS)

160

140

120

100

60
Intel® Xeon® E5-2699 v4, Intel® Xeon® E5-2699 v4, Intel® Xeon® E5-2699 v4, Intel® Xeon® Platinum Intel® Xeon® Platinum
DDR4-2400, Dir+OSB DDR4-2400, Home Snp DDR4-2400, COD 8180, DDR4-2666 8180, DDR4-2666, SNC2

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, 6x32GB DDR4-2666, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-
2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. For more complete information visit http://www.intel.com/performance.

26
Intel® Ultra Path Interconnect (Intel® UPI)
• Intel® Ultra Path Interconnect (Intel® UPI), replacing Intel® QPI
• Faster link with improved bandwidth for a balanced system design
• Improved messaging efficiency per packet
• 3 UPI option for 2 socket – additional inter-socket bandwidth for non-NUMA optimized use-cases
Data Rate Data Efficiency Idle Power

10.4 4% to
9.6 GT/s
GT/s 21% 75%
50%

QPI UPI (per wire) L0 L0p L0p

QPI UPI

Intel® UPI enables system scalability with higher inter-socket bandwidth

27
Skylake-SP Architecture Summary
New Architectural Innovations for Data Center

• Up to 60% increase in compute density with Intel® AVX-512

• Improved performance and scalability with Mesh on-chip interconnect
• L2 and L3 cache hierarchy optimized for data center workloads
• Improved memory subsystem with up to 60% higher memory bandwidth
• Faster and more efficient Intel® UPI interconnect for improved scalability
• Improved integrated IO with up to 50% higher aggregate IO bandwidth
• Increased protection against kernel tampering and user data corruption
• Core, cache, memory and IO improvements for increased virtual machine performance
• Enhanced power management and RAS capability for improved utilization of resources

29
Skylake-SP Performance
2 SOCKET SKYL A KE- SP P ER F OR MA N CE OVER I N TEL ®
XEON ® E5- 2699 V4 Skylake-SP CPUs
Xeon Platinum 8176 (28C, 165W) Xeon Platinum 8180 (28C, 205W) provide significant
performance upside

2,25
compared to prior
RELATIVE PERFORMANCE

1,98
generation
165W Skylake-SP CPUs provide

1,64

1,64
1,64
more than 40% gain on
1,53
1,53

performance
1,41

205W Skylake-SP CPUs provide

additional boost to core bound
workloads
SPECINT*_RATE_BASE2006 SPECFP*_RATE_BASE2006 STREAM - TRIAD LINPACK

Source as of June 2017: Intel internal measurements with Xeon Platinum 8180 and 8176, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC, and platform with E5-
2699 v4, Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to
any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including
the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Recommended Psu Table
No ratings yet
Recommended Psu Table
2 pages
Simatic Ipc227e and Simatic Ipc277e
No ratings yet
Simatic Ipc227e and Simatic Ipc277e
18 pages
Vehant - LPU DataSheet
No ratings yet
Vehant - LPU DataSheet
2 pages
Photon Benefits 2022 (August)
No ratings yet
Photon Benefits 2022 (August)
7 pages
Vostro 14 3480 Laptop Owners Manual2 en Us
No ratings yet
Vostro 14 3480 Laptop Owners Manual2 en Us
30 pages
ThinkPad P15 Gen 2 Spec
No ratings yet
ThinkPad P15 Gen 2 Spec
9 pages
Computer Parts and Accessories
No ratings yet
Computer Parts and Accessories
122 pages
HC31 2.6 Intel SPH 2019 v3
No ratings yet
HC31 2.6 Intel SPH 2019 v3
12 pages
Preliminary School Engagement Pse 2: Assignment
No ratings yet
Preliminary School Engagement Pse 2: Assignment
45 pages
Fit Assignment No.3
No ratings yet
Fit Assignment No.3
11 pages
IdeaCentre AIO 3 24IAP7 Spec
No ratings yet
IdeaCentre AIO 3 24IAP7 Spec
8 pages
Intel Architecture Day 2021 Presentation
No ratings yet
Intel Architecture Day 2021 Presentation
195 pages
2024 Star Product Guide-Compressed - Kedar Mahajan
No ratings yet
2024 Star Product Guide-Compressed - Kedar Mahajan
32 pages
Interfaces Ethernet PDF
No ratings yet
Interfaces Ethernet PDF
1,826 pages
Dos and Don'Ts of Deploying NVMe Over Fabrics BRKDCN-3812 2022
No ratings yet
Dos and Don'Ts of Deploying NVMe Over Fabrics BRKDCN-3812 2022
226 pages
Data Link Layer
No ratings yet
Data Link Layer
146 pages
IdeaPad Gaming 3 15IMH05 Spec
No ratings yet
IdeaPad Gaming 3 15IMH05 Spec
5 pages
Manual Laptop HP
No ratings yet
Manual Laptop HP
106 pages
PC
No ratings yet
PC
37 pages
Arm Instruction Program
No ratings yet
Arm Instruction Program
121 pages
Uc434s.c.02 Lab PDF
No ratings yet
Uc434s.c.02 Lab PDF
204 pages
Dell Latitude E7450 Performance Results - UserBenchmark 2
No ratings yet
Dell Latitude E7450 Performance Results - UserBenchmark 2
5 pages
ARM IHI 0070 D B System Memory Management Unit Architecture Specification
No ratings yet
ARM IHI 0070 D B System Memory Management Unit Architecture Specification
671 pages
Intel Ethernet Fabric Host Software User Guide - v1.3
No ratings yet
Intel Ethernet Fabric Host Software User Guide - v1.3
133 pages
Product Datasheet
No ratings yet
Product Datasheet
2 pages
Uvm Util
No ratings yet
Uvm Util
97 pages
HP Laptops SBS Promotion 11 July 2024
No ratings yet
HP Laptops SBS Promotion 11 July 2024
4 pages
PSI5 Spec v2.3 Base PDF
No ratings yet
PSI5 Spec v2.3 Base PDF
72 pages
SG 247845
No ratings yet
SG 247845
556 pages
Stock Infocomputer
No ratings yet
Stock Infocomputer
12 pages
Nama: Zahrotus Salsabila Kelas: Xi TKJ 2 No Absen: 34
No ratings yet
Nama: Zahrotus Salsabila Kelas: Xi TKJ 2 No Absen: 34
4 pages
NVIDIA P40 Supported Servers
No ratings yet
NVIDIA P40 Supported Servers
13 pages
Storage For RHOS and Cloud Paks Portfolio Seller Presen
No ratings yet
Storage For RHOS and Cloud Paks Portfolio Seller Presen
25 pages
Biostar H81MDV3 Spec
No ratings yet
Biostar H81MDV3 Spec
8 pages
Best Processors - January 2024
No ratings yet
Best Processors - January 2024
5 pages
EB20102076 - Muhammad Hammad Khalid
No ratings yet
EB20102076 - Muhammad Hammad Khalid
20 pages
Acer Nitro
No ratings yet
Acer Nitro
6 pages
Socket AM3 - AMD - WikiChip
No ratings yet
Socket AM3 - AMD - WikiChip
3 pages
IBM Flashsystems 7300
No ratings yet
IBM Flashsystems 7300
68 pages
Puter Power User August 2007 PDF
No ratings yet
Puter Power User August 2007 PDF
112 pages
Nvme Ip Data Sheet En-Xilinx
100% (1)
Nvme Ip Data Sheet En-Xilinx
17 pages
Processes and Threads
No ratings yet
Processes and Threads
14 pages
TS-832PX, TS-832PXU-RP, TS-832PXU Compare Products - QNAP
No ratings yet
TS-832PX, TS-832PXU-RP, TS-832PXU Compare Products - QNAP
12 pages
HPE Servers and Storage - Compute Portfolio at A Glance-4aa0-8758enw
No ratings yet
HPE Servers and Storage - Compute Portfolio at A Glance-4aa0-8758enw
8 pages
Specification - Hitachi Vantara - 5200
No ratings yet
Specification - Hitachi Vantara - 5200
6 pages
Installed Files Vendor
No ratings yet
Installed Files Vendor
29 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
PowerEdge Architecture Technical Overview
No ratings yet
PowerEdge Architecture Technical Overview
24 pages
VM Checklist
No ratings yet
VM Checklist
1 page
IPQ4019
No ratings yet
IPQ4019
80 pages
SPI Vs I2c Vs Uart
No ratings yet
SPI Vs I2c Vs Uart
1 page
2-3 - Common - Storage - Protocols - Copie
No ratings yet
2-3 - Common - Storage - Protocols - Copie
58 pages
Design and Analysis of An FPGA-based Multi-Processor HW-SW Syste
No ratings yet
Design and Analysis of An FPGA-based Multi-Processor HW-SW Syste
101 pages
Imba9454 PDF
No ratings yet
Imba9454 PDF
228 pages
Intel Desktop Board DQ67SW Technical Product Specification
No ratings yet
Intel Desktop Board DQ67SW Technical Product Specification
96 pages
Network Virtualization For FreeBSD Jails Lecture Note
No ratings yet
Network Virtualization For FreeBSD Jails Lecture Note
2 pages
Virtualization and The Cloud
No ratings yet
Virtualization and The Cloud
21 pages
Intel® Graphics Driver For Windows (15.33)
No ratings yet
Intel® Graphics Driver For Windows (15.33)
3 pages
Advances in Microprocessor Cache Architectures Over The Last 25 Years
No ratings yet
Advances in Microprocessor Cache Architectures Over The Last 25 Years
11 pages
Broadcom 9400 Tri-Mode Controller TK Webinar
No ratings yet
Broadcom 9400 Tri-Mode Controller TK Webinar
28 pages
25 VMware Interview Questions
No ratings yet
25 VMware Interview Questions
11 pages
and Pentium Microprocessors
100% (1)
and Pentium Microprocessors
37 pages
Snia Smb3 Final
No ratings yet
Snia Smb3 Final
55 pages
CPU - 652 - Precision - 3630 - Spec - Sheet
No ratings yet
CPU - 652 - Precision - 3630 - Spec - Sheet
6 pages
Intel HBM Controller Spec
No ratings yet
Intel HBM Controller Spec
97 pages
HPC@Intel: Platforms and Technology CCGSC September 10, 2006
No ratings yet
HPC@Intel: Platforms and Technology CCGSC September 10, 2006
29 pages
LINUX LVM Whitepaper SuSE
No ratings yet
LINUX LVM Whitepaper SuSE
31 pages
Verilator Doc PDF
No ratings yet
Verilator Doc PDF
68 pages
Aruba Lab 8
No ratings yet
Aruba Lab 8
6 pages
TN 47 01
No ratings yet
TN 47 01
28 pages
Solution For Sap Hana Platform in Scale Up Configuration Using Advanced Server ds7000 Second Generation Intel Xeon Scalable Processors
No ratings yet
Solution For Sap Hana Platform in Scale Up Configuration Using Advanced Server ds7000 Second Generation Intel Xeon Scalable Processors
39 pages
Intel 82802 Firmware Hub
No ratings yet
Intel 82802 Firmware Hub
53 pages
IBM POWER Processor
No ratings yet
IBM POWER Processor
19 pages
TU0308 Tutorial ARM Cortex-M1 Embedded Processor
No ratings yet
TU0308 Tutorial ARM Cortex-M1 Embedded Processor
57 pages
Intel® Server Board S2600COE
No ratings yet
Intel® Server Board S2600COE
5 pages
Ubifs
No ratings yet
Ubifs
47 pages
Register Transfer and Computer Design Basics
No ratings yet
Register Transfer and Computer Design Basics
27 pages
NetApp AFF A-Series Datasheet
No ratings yet
NetApp AFF A-Series Datasheet
6 pages
Intel® Management Engine (ME) Firmware Update Procedure
No ratings yet
Intel® Management Engine (ME) Firmware Update Procedure
3 pages
Power9 Performance Best Practices 0
No ratings yet
Power9 Performance Best Practices 0
2 pages
ENDURA JD35Q Datasheet - PDF
No ratings yet
ENDURA JD35Q Datasheet - PDF
4 pages
Enterkomputer
No ratings yet
Enterkomputer
27 pages
VNX Series Performanc E: Competitive Comparison
No ratings yet
VNX Series Performanc E: Competitive Comparison
4 pages
We Now Accept Credit Cards: 6/4/2013 Intel DC D2500Hn 1.86Ghz Sodimm, Ddr3
No ratings yet
We Now Accept Credit Cards: 6/4/2013 Intel DC D2500Hn 1.86Ghz Sodimm, Ddr3
6 pages
Broadcom Test Questions
No ratings yet
Broadcom Test Questions
3 pages
489 6GbsSAS 12Gbs PerfTuningGuide
No ratings yet
489 6GbsSAS 12Gbs PerfTuningGuide
97 pages
15Th Generation of Poweredge Servers: Quick Reference Guide
No ratings yet
15Th Generation of Poweredge Servers: Quick Reference Guide
3 pages
PCI SIG Arch Overview
No ratings yet
PCI SIG Arch Overview
37 pages
PCS White Paper
No ratings yet
PCS White Paper
14 pages

Skylake Architecture

Uploaded by

Skylake Architecture

Uploaded by

Dr.-Ing.

• Intel® Xeon® Scalable Processor Overview

2016 2017 2018

Intel® Xeon® Processor E7 Brickland Platform Purley Platform

capacity and advanced RAS Intel® Xeon® PLATINUM

Converged platform with innovative Skylake-SP microarchitecture

3x16 PCIe* 3x16 PCIe Feature Details

SKL SKL SKL SKL SKL SKL

DMI SKL SKL

(4S-2UPI & 4S-3UPI shown)

Intel® Xeon® Scalable Processor supports

Cores Per Socket Up to 22 Up to 28 DDR4

DDR4 Core Core

Max Memory Speed Up to 2400 Up to 2666 48 Lanes DMI3

TDP (W) 55W-145W 70W-205W

Shift LEA LEA Shift

• 32 operand registers Skylake Intel® AVX-512 & FMA 64 32

• 8 64b mask registers Haswell / Broadwell Intel AVX2 & FMA 32 16

Powerful instruction set for data-parallel computation

• Additional 768KB of L2 cache added outside of Skylake core Extended

ALU ALU ALU ALU

Shift Shift Shuffle

Skylake-SP Core: Optimized for Data center Workloads

• Cores running non-AVX, Intel® AVX2 light/heavy, and Mixed Workloads

AVX2-Heavy (FP & int-mul) Cores

Intel® AVX-512 delivers significant performance and efficiency gains

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

Mesh Improves Scalability with Higher Bandwidth and Reduced Latencies

2x UPI x 20 PCIe* x 16 PCIe x 16 PCIe x 16

Core Core Core Core

CHA / SF/LLC CHA / SF/LLC CHA / SF/LLC CHA / SF/LLC

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC Core Core Core Core

DDR 4 DDR 4 and processes them concurrently

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

Distributed CHA architecture sustains higher bandwidth and lowers latency

Core Core Core Core Core Core

Cache hierarchy architected and

CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

Significant memory bandwidth and capacity improvements

• Only one UPI caching agent required even

in 2-SNC mode CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC

Core Core Core Core Core Core

• LLC capacity is utilized more efficiently

in 2-cluster mode, no duplication of lines

in LLC SNC Domain 0 SNC Domain 1

• LLC miss latency to local cluster is smaller

• Mesh traffic is localized, reducing uncore power and sustaining higher BW

Without SNC Local SNC Access

Remote SNC Access

QPI UPI (per wire) L0 L0p L0p

Intel® UPI enables system scalability with higher inter-socket bandwidth

• Up to 60% increase in compute density with Intel® AVX-512

205W Skylake-SP CPUs provide

You might also like