0% found this document useful (0 votes)
8 views51 pages

15DD

Uploaded by

sayssandeep5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views51 pages

15DD

Uploaded by

sayssandeep5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Programming,

Intel Performance,
& Efficiency
Architecture

Rama Malladi
Intel, Bangalore

1
Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS
INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel
logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding
the specific instruction sets covered by this notice.
Notice revision #20110804

2
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Intel Technical Computing & HPC
Solutions Portfolio

Compute Systems Network I/O & Software


Processing & Boards & Fabric Storage & Services

Intel® Intel®
True Scale Omni-Path
NVM
Fabric Architecture

2015
Intel®
Ethernet
Networking

Intel®
Cluster
Ready

SiPh

Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
4
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
5
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda

• The problem statement!


• Intel® Xeon® Processor Architecture Basics
• Intel® Xeon Phi™ (Knights Landing)
• Software Tools for Developers
• Summary

6
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Parallel & Vector Execution
Problem:
Economical operation frequency of (CMOS) transistors is limited.
 No free lunch anymore!

Solution:
More transistors allow more gates/logic on the same die space and
power envelop, improving parallelism:

 Thread level parallelism (TLP):


Multi- and many-core

 Data level parallelism (DLP):


Wider vectors (SIMD)

 Instruction level parallelism (ILP):


Microarchitecture improvements, e.g.
threading, superscalarity, …

7
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
8
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
recap

9
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Roofline Analysis?
How to determine if we got the best / peak performance?
• Run GEMM?
• LINPACK?
• STREAM bandwidth tests?
• Latency benchmarks?
• …
• Get theoretical peak possible for the code?

10
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Roofline Analysis
“paper-pen” exercise

float *A, *B, *C, d;


for(i=0; i<n; i++)
{
A[i] = B[i] + d * C[i];
}

The above code on Intel Xeon Phi is bound by:


• Compute?
• Bandwidth?

11
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda

• The problem statement!


• Intel® Xeon® Processor Architecture Basics
• Intel® Xeon Phi™ (Knights Landing)
• Software Tools for Developers
• Summary

12
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
History
Intel® 64 and IA-32 Architecture
• 1978: 8086 and 8088 processors
• 1982: Intel® 286 processor
• 1985: Intel386™ processor
• 1989: Intel486™ processor
• 1993: Intel® Pentium®
• 1995-1999: P6 family:
 Intel Pentium Pro processor
 Intel Pentium II [Xeon] processor
 Intel Pentium III [Xeon] processor
 Intel Celeron® processor
• 2000-2006: Intel® Pentium® 4 processor
• 2005: Intel® Pentium® processor Extreme Edition
• 2001-2007: Intel® Xeon® processor
• 2003-2006: Intel® Pentium® M processor
• 2006/2007: Intel® Core™ Duo and Intel® Core™ Solo processors
• 2006/2007: Intel® Core™2 processor
• 2008: Intel® Atom™ processor and Intel® Core™ i7 processor family
• 2010: Intel® Core™ processor family
• 2011: Second generation Intel® Core™ processor family
• 2012: Third generation Intel® Core™ processor family
• 2013: Fourth generation Intel® Core™ processor family
• 2013: Intel® Atom™ processor family based on Silvermont microarchitecture

13
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Processor Architecture Basics
Pipeline
Computation of instructions requires several stages:
Front End Back End

Load or
Fetch Decode Execute Commit
Store
Instructions

Registers Memory
1. Fetch:
Read instruction (bytes) from memory
2. Decode:
Translate instruction (bytes) to microarchitecture
3. Execute:
Perform the operation with a functional unit
4. Memory:
Load (read) or store (write) data, if required
5. Commit:
Retire instruction and update micro-architectural state
14
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Processor Architecture Basics
Reducing Impact of Pipeline Stalls
Impact of pipeline stalls can be reduced by:
• Branch prediction
• Superscalarity + multiple issue fetch & decode
• Out of Order execution
• Cache
• Non-temporal stores
• Prefetching
• Line fill buffers
• Load/Store buffers
• Alignment
• Simultaneous Multithreading (SMT)

 Characteristics of the architecture that might require user action!

15
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Processor Architecture Basics
Cache-Hierarchy & Cache-Line

1: mov 0x1234, %rbx

Commit
Execute
2: mov 0x5678, %rcx

Load or
Decode
65

Fetch

Store
4 3: mov %rcx, 0x1234
32 4: …
1 5: …
6: …
Instructions

Cache-hierarchy:
• For data and instructions 65
4
• Usually inclusive caches L1 Caches
Cache-miss
Cache-miss
3
I2 D
1 0x1234

• Races for resources 0x5678


Instructions

• Can improve access speed


• Cache-misses & cache-hits Cache-miss L2 Cache Memory
Instructions 0x1234
0x5678

Cache-line:
Cache-miss
Cache-hit L3 Cache
• Always full 64 byte block Instructions 0x1234 0x5678

• Minimal granularity of every load/store


• Modifications invalidate entire cache-line (dirty bit)
16
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Processor Architecture Basics
Example: 4th Generation Intel® Core™
From Intel® 64 and IA-32 Architectures Optimization Reference Manual:
Fetch
I (Pre-Decode) I-Queue

Decode &
BTB Dispatch
ROB, L/S BPU

Schedule
Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7

Store Address

Store Address
Store Addres
Exe Exe Exe

Load &
Load &
Exe
[Int] [Int] [Int]
[Int] Store
[FP] [FP] [FP]

LFB D
L2 Cache
17
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Processor Architecture Basics
Core vs. Uncore
• Core: Processor
Processor core’s logic:

Core n
Core 2
Core 1
 Execution units Core
 Core caches (L1/L2) DDR3
 Buffers & registers or
 … DDR4 L3 Cache

Graphics
Clock Uncore
MC QPI &
• Uncore: Power
All outside a processor core:
 Memory controller/channels (MC) and Intel® QuickPath Interconnect (QPI)
 L3 cache shared by all cores
 Type of memory
 Power management and clocking
 Optionally: Integrated graphics

 Only uncore is differentiation within same processor family!


18
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda

• The problem statement!


• Intel® Xeon® Processor Architecture Basics
• Intel® Xeon Phi™ (Knights Landing)
• Software Tools for Developers
• Summary

19
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
20
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
21
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
22
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
23
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
24
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
25
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
26
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
27
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
28
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Random Access Latency

29
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
30
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
AVX-512 Designed for HPC
• Promotions of many AVX and AVX2 instructions to AVX-512
− 32-bit and 64-bit floating-point instructions from AVX
− Scalar and 512-bit
− 32-bit and 64-bit integer instructions from AVX2
• Many new instructions to speedup HPC workloads

Quadword New
Bit
integer Math support permutation
manipulation
arithmetic primitives

Including
gather/scatter IEEE division Two source
Vector rotate
with D/Qword and square root shuffles
indices

DP
Compress & Universal ternary
transcendental
Expand logical operation
primitives

New
transcendental New mask
support instructions
instructions

Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
32
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
33
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Gather & Scatter
for(j=0, i=0; i<N; i++)
D/Q/SP/DP element types
{
D/Q indices
B[R[i]] = A[Q[i]];
Instruction can partially execute
}
k-reg Mask used as completion mask

VMOVDQU64 zmm1, Q[rsi]


VMOVDQU64 zmm2, R[rsi]
VGATHERQQ zmm0 {k2}, [rax+zmm1*8]
VSCATTERQQ [rax+zmm2*8] {k3}, zmm0

A Q R B

Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda

• The problem statement!


• Intel® Xeon® Processor Architecture Basics
• Intel® Xeon Phi™ (Knights Landing)
• Software Tools for Developers
• Summary

35
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Petascale Today
Intel® Cluster Studio XE

36
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Intel® Cluster Studio XE 2018

37
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Leading Supporter of Intel® Parallel Studio XE
Industry Standards Tools to Create
C++, C, Fortran, OpenMP, Highly optimizing Compilers,
High Performance Libraries,
MPI, TBB Designed for Performance and Analysis
More Efficient
Software Development

Programmer Productivity
Preservation of Investment

Intel
Common model method Intel® Parallel Studio XE
Scaling Consistency Tools for Analysis
Multicore to Many-core Threading Analysis & Debug
same programming models, MPI Analysis & Debug
programming languages, tools Vector Analysis & Debug

38
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Compiler Support

Intel Compilers

GNU Compilers

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
39
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Intel Math Kernel Library

Intel MKL library

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
40
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Small Code Example – Writing Code

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
41
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
MPI Library…

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
42
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Scaling tests of NPB – Pure MPI

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
43
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Scaling tests of NPB – OpenMP

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
44
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Scaling of HYDRO Code

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
45
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Scaling of GROMACS

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Compiler, Runtime Flags: GROMACS
Compiler Flags

Runtime Settings

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
47
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
48
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Normalized Results

0
1
2
3
4
5
6
2s Xeon® E5-2697 v4 1
Linpack 1.56
DGEMM 1.60
SGEMM 1.70
HPCG 2.18

Optimization Notice
STREAM Triad 3.89

UPDATED
Tinity - SNAP 0.96
Trinity - GTC 1.11
Trinity - miniDFT 1.20
Trinity - MILC 1.35
Trinity - AMG 1.38
Trinity - UMT 1.56
Trinity - miniFE 2.94
Trinity - miniGhost 3.34
BenchmarksTrinity Benchmarks

GROMACS 1.22
NAMD apoa1 1.23
RELION 1.32
NAMD stmv 1.36
Coarse-Grain Water Simulation… 1.43
Life

Amber Nucleosome 1.75


Amber Polio Virus 1.83
Science

ROME/SML 2.36
Amber 2w49 2.50
Amber Rubisco 2.67
IFS PAPS14 1.16
MPAS-O 1.23
MASNUM Wave 1.41
POP 1.41
HOMME 1.53
DMI HIROMB-BOOS 1.70
WRF CONUS 12KM 1.71
GNAQPMS 1.77
NIM 2.00
NEMO 2.10

performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance
Weather and Climate

SPECFEM_3DGLOBE 6 nodes - 55K… 1.28


SPECFEM_3DGLOBE 25 nodes -… 1.37
SeisSol - M7.2 1992 Landers… 1.45
SeisSol - Mount Merapi LTS MR2 1.49
Open LB 1.50
SeisSol - Mount Merapi GTS 1.59
MILC 1.69
Energy

iso3DFD 1.71
PETSc 1.80
Soft Sphere 1.81
VLPL-S 2.00
QphiX Wilson DSLASH 2.13
CloverLeaf 3D 2.34
CloverLeaf 2D 2.37
Physics / Geophysics /

QphiX CG 2.45
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are

YASK - iso3DFD 2.50


this

YASK - AWP-ODC 2.80


measured

Quantum ESPRESSO 1.17


Berkeley GW 1.38
PWMAT - GaAs160 1.58
PWMAT - GaAs64 1.58
Science
Material

VASP 1.82
presentation

CP2K 2.54
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

GE Tacoma 1.23
HiFUN 1.35
OpenFOAM MotorBike 4M Cells 1.38
OpenLB
ng

1.50
OpenFOAM MotorBike 20M Cells 1.63
OpenFOAM DrivAer car 10M Cells :… 1.67
OpenFOAM MotorBike 11M Cells 1.71
software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the

OpenFOAM DrivAer car 10M Cells :… 1.72


Manufacturi

using specific computer systems, components,

NASA Overflow 1.77


TACC LB3D 3.65
Binomial Options 1.27
BlackScholes SP
available, with up to 6.48X (1.99X average) performance improvement1

1.31
BlackScholes DP 3.36
Over 40 applications optimized for Intel® Xeon Phi™ processor family are

BAW American Options… 4.03


Services
Financial

Intel® Xeon Phi™ processor 7250 relative performance normalized to baseline (1) of a 2 socket Intel® Xeon® processor E5-2697 v4)

Monte Carlo European Options SP 4.25


1 – As demonstrated by respective proof points in

Monte Carlo European Options DP 4.65


BAW American Options… 6.48
49
50
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Thank you!

51
Optimization Notice Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

You might also like