15DD
15DD
Intel Performance,
& Efficiency
Architecture
Rama Malladi
Intel, Bangalore
1
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS
INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel
logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding
the specific instruction sets covered by this notice.
Notice revision #20110804
2
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Intel Technical Computing & HPC
Solutions Portfolio
Intel® Intel®
True Scale Omni-Path
NVM
Fabric Architecture
2015
Intel®
Ethernet
Networking
Intel®
Cluster
Ready
SiPh
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
4
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
5
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda
6
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Parallel & Vector Execution
Problem:
Economical operation frequency of (CMOS) transistors is limited.
No free lunch anymore!
Solution:
More transistors allow more gates/logic on the same die space and
power envelop, improving parallelism:
7
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
8
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
recap
9
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Roofline Analysis?
How to determine if we got the best / peak performance?
• Run GEMM?
• LINPACK?
• STREAM bandwidth tests?
• Latency benchmarks?
• …
• Get theoretical peak possible for the code?
10
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Roofline Analysis
“paper-pen” exercise
11
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda
12
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
History
Intel® 64 and IA-32 Architecture
• 1978: 8086 and 8088 processors
• 1982: Intel® 286 processor
• 1985: Intel386™ processor
• 1989: Intel486™ processor
• 1993: Intel® Pentium®
• 1995-1999: P6 family:
Intel Pentium Pro processor
Intel Pentium II [Xeon] processor
Intel Pentium III [Xeon] processor
Intel Celeron® processor
• 2000-2006: Intel® Pentium® 4 processor
• 2005: Intel® Pentium® processor Extreme Edition
• 2001-2007: Intel® Xeon® processor
• 2003-2006: Intel® Pentium® M processor
• 2006/2007: Intel® Core™ Duo and Intel® Core™ Solo processors
• 2006/2007: Intel® Core™2 processor
• 2008: Intel® Atom™ processor and Intel® Core™ i7 processor family
• 2010: Intel® Core™ processor family
• 2011: Second generation Intel® Core™ processor family
• 2012: Third generation Intel® Core™ processor family
• 2013: Fourth generation Intel® Core™ processor family
• 2013: Intel® Atom™ processor family based on Silvermont microarchitecture
13
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Processor Architecture Basics
Pipeline
Computation of instructions requires several stages:
Front End Back End
Load or
Fetch Decode Execute Commit
Store
Instructions
Registers Memory
1. Fetch:
Read instruction (bytes) from memory
2. Decode:
Translate instruction (bytes) to microarchitecture
3. Execute:
Perform the operation with a functional unit
4. Memory:
Load (read) or store (write) data, if required
5. Commit:
Retire instruction and update micro-architectural state
14
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Processor Architecture Basics
Reducing Impact of Pipeline Stalls
Impact of pipeline stalls can be reduced by:
• Branch prediction
• Superscalarity + multiple issue fetch & decode
• Out of Order execution
• Cache
• Non-temporal stores
• Prefetching
• Line fill buffers
• Load/Store buffers
• Alignment
• Simultaneous Multithreading (SMT)
15
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Processor Architecture Basics
Cache-Hierarchy & Cache-Line
Commit
Execute
2: mov 0x5678, %rcx
Load or
Decode
65
Fetch
Store
4 3: mov %rcx, 0x1234
32 4: …
1 5: …
6: …
Instructions
Cache-hierarchy:
• For data and instructions 65
4
• Usually inclusive caches L1 Caches
Cache-miss
Cache-miss
3
I2 D
1 0x1234
…
• Cache-misses & cache-hits Cache-miss L2 Cache Memory
Instructions 0x1234
0x5678
Cache-line:
Cache-miss
Cache-hit L3 Cache
• Always full 64 byte block Instructions 0x1234 0x5678
Decode &
BTB Dispatch
ROB, L/S BPU
Schedule
Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7
Store Address
Store Address
Store Addres
Exe Exe Exe
Load &
Load &
Exe
[Int] [Int] [Int]
[Int] Store
[FP] [FP] [FP]
LFB D
L2 Cache
17
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Processor Architecture Basics
Core vs. Uncore
• Core: Processor
Processor core’s logic:
Core n
Core 2
Core 1
Execution units Core
Core caches (L1/L2) DDR3
Buffers & registers or
… DDR4 L3 Cache
Graphics
Clock Uncore
MC QPI &
• Uncore: Power
All outside a processor core:
Memory controller/channels (MC) and Intel® QuickPath Interconnect (QPI)
L3 cache shared by all cores
Type of memory
Power management and clocking
Optionally: Integrated graphics
19
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
20
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
21
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
22
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
23
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
24
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
25
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
26
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
27
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
28
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Random Access Latency
29
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
30
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
AVX-512 Designed for HPC
• Promotions of many AVX and AVX2 instructions to AVX-512
− 32-bit and 64-bit floating-point instructions from AVX
− Scalar and 512-bit
− 32-bit and 64-bit integer instructions from AVX2
• Many new instructions to speedup HPC workloads
Quadword New
Bit
integer Math support permutation
manipulation
arithmetic primitives
Including
gather/scatter IEEE division Two source
Vector rotate
with D/Qword and square root shuffles
indices
DP
Compress & Universal ternary
transcendental
Expand logical operation
primitives
New
transcendental New mask
support instructions
instructions
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
32
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
33
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Gather & Scatter
for(j=0, i=0; i<N; i++)
D/Q/SP/DP element types
{
D/Q indices
B[R[i]] = A[Q[i]];
Instruction can partially execute
}
k-reg Mask used as completion mask
A Q R B
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda
35
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Petascale Today
Intel® Cluster Studio XE
36
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Intel® Cluster Studio XE 2018
37
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Leading Supporter of Intel® Parallel Studio XE
Industry Standards Tools to Create
C++, C, Fortran, OpenMP, Highly optimizing Compilers,
High Performance Libraries,
MPI, TBB Designed for Performance and Analysis
More Efficient
Software Development
Programmer Productivity
Preservation of Investment
Intel
Common model method Intel® Parallel Studio XE
Scaling Consistency Tools for Analysis
Multicore to Many-core Threading Analysis & Debug
same programming models, MPI Analysis & Debug
programming languages, tools Vector Analysis & Debug
38
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Compiler Support
Intel Compilers
GNU Compilers
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
39
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Intel Math Kernel Library
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
40
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Small Code Example – Writing Code
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
41
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
MPI Library…
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
42
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Scaling tests of NPB – Pure MPI
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
43
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Scaling tests of NPB – OpenMP
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
44
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Scaling of HYDRO Code
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
45
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Scaling of GROMACS
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Compiler, Runtime Flags: GROMACS
Compiler Flags
Runtime Settings
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
47
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
48
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Normalized Results
0
1
2
3
4
5
6
2s Xeon® E5-2697 v4 1
Linpack 1.56
DGEMM 1.60
SGEMM 1.70
HPCG 2.18
Optimization Notice
STREAM Triad 3.89
UPDATED
Tinity - SNAP 0.96
Trinity - GTC 1.11
Trinity - miniDFT 1.20
Trinity - MILC 1.35
Trinity - AMG 1.38
Trinity - UMT 1.56
Trinity - miniFE 2.94
Trinity - miniGhost 3.34
BenchmarksTrinity Benchmarks
GROMACS 1.22
NAMD apoa1 1.23
RELION 1.32
NAMD stmv 1.36
Coarse-Grain Water Simulation… 1.43
Life
ROME/SML 2.36
Amber 2w49 2.50
Amber Rubisco 2.67
IFS PAPS14 1.16
MPAS-O 1.23
MASNUM Wave 1.41
POP 1.41
HOMME 1.53
DMI HIROMB-BOOS 1.70
WRF CONUS 12KM 1.71
GNAQPMS 1.77
NIM 2.00
NEMO 2.10
performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance
Weather and Climate
iso3DFD 1.71
PETSc 1.80
Soft Sphere 1.81
VLPL-S 2.00
QphiX Wilson DSLASH 2.13
CloverLeaf 3D 2.34
CloverLeaf 2D 2.37
Physics / Geophysics /
QphiX CG 2.45
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
VASP 1.82
presentation
CP2K 2.54
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
GE Tacoma 1.23
HiFUN 1.35
OpenFOAM MotorBike 4M Cells 1.38
OpenLB
ng
1.50
OpenFOAM MotorBike 20M Cells 1.63
OpenFOAM DrivAer car 10M Cells :… 1.67
OpenFOAM MotorBike 11M Cells 1.71
software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
1.31
BlackScholes DP 3.36
Over 40 applications optimized for Intel® Xeon Phi™ processor family are
Intel® Xeon Phi™ processor 7250 relative performance normalized to baseline (1) of a 2 socket Intel® Xeon® processor E5-2697 v4)
51
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.