0% found this document useful (0 votes)
99 views

Intel® Processor Architecture: January 2013

Uploaded by

Aadi Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

Intel® Processor Architecture: January 2013

Uploaded by

Aadi Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Intel® Processor

Architecture

January 2013

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 2
Intel® Processor Segments Today
Architecture Target ISA Specific
Platforms Features
Intel® phone, tablet, x86 up to optimized for
ATOM™ netbook, low- SSSE-3, 32 low-power, in-
Architecture power server and 64 bit order

Intel® mainstream x86 up to flexible


Core™ notebook, Intel® AVX, feature set
Architecture desktop, server 32 and 64bit covering all
needs
Intel® high end server IA64, x86 by RAS, large
Itanium® emulation address space
Architecture

Intel® MIC accelerator for x86 and +60 cores,


Architecture HPC Intel® MIC optimized for
Instruction Floating-Point
Set performance

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3
Itanium® 9500 (Poulson)
New Itanium Processor
Poulson Processor • Compatible with Itanium® 9300
processors (Tukwila)
Core Core

Core
32MB
Shared Core • New micro-architecture with 8
Core
Last
Level Core Cores
Cache
Core Core • 54 MB on-die cache
Memory Link • Improved RAS and power
Controllers Controllers
management capabilities
• Doubles execution width from 6
to 12 instructions/cycle
4 Intel
Scalable
4 Full + 2 Half
Width Intel
• 32nm process technology
Memory QuickPath
Interface Interconnect • Launched in November 2012
(SMI)

Compatibility provides protection for today’s Itanium®


investment
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5
Intel® XEON™ Phi
Former Code Name “Knights Corner”

Intel® XEON™ Phi - The first product implementation of the


Intel® Many Integrated Core Architecture (Intel® MIC)

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 7
X86: From Smartphones to …
Motorola RAZR* i
• Launched September 2012
• RAZR i is the first smartphone that can achieve
speeds of 2.0 GHz

Processor:

Intel® ATOM™
Z2460

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 8
X86: … to Supercomputers
LRZ SuperMUC System
• Installed summer 2012
– Most powerful x86-architecture based computer
– #6 on Top500 list
– More than 150000 cores
• Processor:
- Intel® Xeon®
E5-2680
(“Sandy
Bridge”)
- Intel® Xeon®
E7-4870
(“Westmere”)

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 9
Intel Tick-Tock Roadmap
for Mainstream x86 Architecture since 2006

2nd 3nd
Intel® Core™ Micro Architecture Generation
Intel® Core™
Generation
Intel® Core™
MicroArchitecture Codename “Nehalem” Micro Micro
Architecture Architecture

Merom Penryn Nehalem Westmere Sandy Bridge Ivy Bridge

NEW NEW NEW NEW NEW NEW


Micro architecture Process Technology Micro architecture Process Technology Micro architecture Process Technology

65nm 45nm 45nm 32nm 32nm 22nm


TOCK TICK TOCK TICK TOCK TICK

2006 2007 2008 2009 2011 2012


SSSE-3 SSE4.1 SSE4.2 AES AVX 7 new
instructions
TICK + TOCK = SHRINK + INNOVATE
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 12
To be continued ...

4nth
Generation
Intel® Core™ TBD TBD TBD TBD TBD
Micro
Architecture

Haswell Broadwell TBD TBD TBD TBD

NEW NEW NEW NEW NEW NEW


Micro architecture Process Technology Micro architecture Process Technology Micro architecture Process Technology

22nm 14nm 14nm 10nm 10nm 7nm


TICK TOCK TICK TOCK TICK TOCK

2013 >= 2014 ??? ??? ??? ???


AVX-2
TICK + TOCK = SHRINK + INNOVATE
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 13
Registers State for Intel® Pentium® 3
Processor (1998)
IA32-INT MMX Technology / SSE Registers
Registers IA-FP Registers
80
32 64 128

eax st0 mm0 xmm0

edi st7 mm7 xmm7

Fourteen 32-bit registers Eight 80/64-bit registers Eight 128-bit registers


Scalar data & addresses Hold data only Hold data only:
Direct access to regs Direct access to MM0..MM7 4 x single FP numbers
No MMX™ Technology / FP 2 x double FP numbers
interoperability 128-bit packed integers

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 15
SSE Vector Types

4x single precision
Intel® SSE FP

2x double precision
FP

16x 8 bit integer

8x 16 bit integer
Intel® SSE2
4x 32 bit integer

2x 64 bit integer

plain 128 bit

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 16
AVX Vector Types

8x single precision
FP
Intel® AVX
4x double precision
FP

32x 8 bit integer

16x 16 bit integer

Intel® AVX2 8x 32 bit integer


(Future)
4x 64 bit integer

plain 256 bit

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 19
X86 ISA: The Instruction Set

• The instruction set for the x86 architecture has


been extended numerous times since the set
supported by the 8086 processor
– See http://en.wikipedia.org/wiki/X86_instruction_listings
for an excellent overview
• Today, the “base” instructions set (“IA32 ISA”) is
the one supported by the first 32bit processor -
80386
• Multiple, “smaller” extensions added then before
SSE (1998 / Intel® Pentium® 3) like
– MMX( 64 bit SIMD using the x87 FP registers)
– Conditional move
– Atomic exchange

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 20
New Instructions in Haswell (2013)
Group Description Count *

SIMD Integer Adding vector integer operations to 256-bit


Instructions
promoted to 256
AVX2

bits
Gather Load elements from vector of indices 170 /
vectorization enabler 124

Shuffling / Data Blend, element shift and permute instructions


Rearrangement

FMA Fused Multiply-Add operation forms ( FMA-3) 96 / 60

Bit Manipulation and Improving performance of bit stream manipulation and 15 / 15


Cryptography decode, large integer arithmetic and hashes

TSX=RTM+HLE Transactional Memory 4/4

Others MOVBE: Load and Store of Big Endian forms 2/2


INVPCID: Invalidate processor context ID
* Total instructions / different mnemonics

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 23
HSW Improvements for Threading
Sample Code Computing PI by Windows Threads
#include <windows.h>
void main ()
#define NUM_THREADS 2
{
HANDLE thread_handles[NUM_THREADS];
double pi; int i;
CRITICAL_SECTION hUpdateMutex;
DWORD threadID;
static long num_steps = 100000;
int threadArg[NUM_THREADS];
double step;
double global_sum = 0.0;
for(i=0; i<NUM_THREADS; i++)
threadArg[i] = i+1;
void Pi (void *arg)
{
InitializeCriticalSection(&hUpdateMutex);
int i, start;
double x, sum = 0.0;
for (i=0; i<NUM_THREADS; i++){
thread_handles[i] = CreateThread(0, 0,
(LPTHREAD_START_ROUTINE) Pi,
start = *(int *) arg;
&threadArg[i], 0, &threadID);
step = 1.0/(double) num_steps;
}
for (i=start;i<= num_steps;
WaitForMultipleObjects(NUM_THREADS,
i=i+NUM_THREADS){
thread_handles, TRUE,INFINITE);
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
pi = global_sum * step;
}
EnterCriticalSection(&hUpdateMutex);
printf(" pi is %f \n",pi);
global_sum += sum;
}
LeaveCriticalSection(&hUpdateMutex);
}

Locks can be key bottleneck – even in case there is no conflict


Copyright© 2012, Intel Corporation. All rights reserved. Partially Intel Confidential Information.
*Other brands and names are the property of their respective owners.
Intel® Transactional Synchronization
Extensions (Intel® TSX)
Intel® TSX = HLE + RTM
HLE (Hardware Lock Elision) is a hint inserted in front of a LOCK
operation to indicate a region is a candidate for lock elision
• XACQUIRE (0xF2) and XRELEASE (0xF3) prefixes
• Don’t actually acquire lock, but execute region speculatively
• Hardware buffers loads and stores, checkpoints registers
• Hardware attempts to commit atomically without locks
• If cannot do without locks, restart, execute non-speculatively

RTM (Restricted Transactional Memory) is three new


instructions (XBEGIN, XEND, XABORT)
• Similar operation as HLE (except no locks, new ISA)
• If cannot commit atomically, go to handler indicated by XBEGIN
• Provides software additional capabilities over HLE

Copyright© 2012, Intel Corporation. All rights reserved. Partially Intel Confidential Information.
25 *Other brands and names are the property of their respective owners.
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 27
Core™ 2 Architecture (Merom)

Instruction Fetch and 32kB Front End


ITLB
Pre Decode Instruction Cache

Instruction Queue

Memory
Decode

4
Rename/Allocate
2/4/6 MB
Front-
Side
Retirement Unit 4
2nd Level Cache Bus
(ReOrder Buffer)

Reservation Station
6
Execution Units Out-Of-Order
DTLB Execution
32kB Engine
Data Cache

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 29
NHM/SNB: Enhanced Processor Core

Instruction Fetch and 32kB Front End


ITLB
Pre Decode Instruction Cache

Instruction Queue
Execution
Decode Engine
4 2nd Level TLB
256kB
Rename/Allocate L3 and
2nd Level Cache beyond
MLC -
Retirement Unit 4 Mid Level Cache Uncore
(ReOrder Buffer)

Reservation Station
6
Execution Units
Memory
DTLB

32kB
Data Cache

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 30
Peak FP Performance per Core & Cycle

Single Double Comment


Precision Precision
Nehalem 8 4 By SSE; MULT and ADD can start
each cycle:
Sandy Bridge 16 8 AVX doubles all due to twice the
vector length
Haswell 32 16 2 FMA instructions can start
each cycle – doubling
performance compared to SNB

For a 2-socket, 16-core Haswell server system running at 3


GHz, this will sum up to 1.5 terra flops SP FP peak
performance (0.77 for DP)

Software & Services Group,


Potential Developer
future options Products
and features subject Division
to change without notice.

Copyright © 2013, Intel Corporation. All rights reserved.


*Other brands and names are the property of their respective owners. 40
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 45
Common Core, Modular Uncore
• Common “core” C C C C
– Same core for server, desktop, O O O O
Core
mobile R R R R
E E E E
– Incremental improvements to u- 1 2 3 4
arch of current Core architecture DRAM
Last Level Cache Pwr
– Common target for SW &
optimization IMC QPI QPI Clk Uncore
– Common feature set
• Segment differentiation in the QPI
# of L3$ Memory QPI Graphic
“Uncore” Cores Size Controller Links

– # of cores Desktop i5 2 4MB 2xDDR3 N/A Yes


Desktop i3
– # of QPI links Desktop i7 4 8MB 3xDDR3 1 x 4.8 Yes
– Size of L3 cache NHM

– # IMC channels Desktop i7


SNB
6 8MB 3xDDR3 1 x 6.4 Yes

– Frequency DDR3 XEON E5- 2x8 20MB 4xDDR3 2 x 8.0 No


2600
– Integrated graphics (GT) XEON E7- 4x10 30MB 3xDDR3 4 x 6.4 No
–… 8870

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 46
Level 3 Cache
• New 3rd level cache
– Also called LLC – Last Level Cache
• Shared across all cores of
processor (socket) Core Core Core
L1 Caches L1 Caches L1 Caches
• Size
– NHM: 2MB/core ( EX up to 3.0)
– SNB: 2.5 MB/core ( today ) …

• Latency: L2 Cache L2 Cache L2 Cache

– NHM: >=35 L3 Cache


– SNB: 25-31
• Inclusive property
– Cache line residing in L1/L2 must
be present too in 3rd level cache

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 47
QuickPath Interconnect

• Nehalem introduces new


Nehalem Nehalem
QuickPath Interconnect (QPI) EP EP

• High bandwidth, low


latency point to point
interconnect
• 4.8/6.4/8.0 GT/sec
– E.g. 6.4 GT/sec -> 12.8
GB/sec each direction
• Highly scalable for systems CPU
IOH
CPU
with varying # of sockets
memory memory

CPU CPU
memory memory
IOH

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 50
Remote Memory Access
• CPU0 requests cache line X, not present in any CPU0 cache
– CPU0 requests data from CPU1; request sent over QPI to CPU1
– CPU1’s IMC makes request to its DRAM
– CPU1 snoops internal caches
– Data returned to CPU0 over QPI
• Remote memory latency a function of having a low latency
interconnect
– Typical numbers: Local access 60ns, remote access 90ns

QPI
DRAM CPU0 CPU1 DRAM

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 51
Non-NUMA (UMA) Mode
• Addresses interleaved across memory nodes by
cache line
– Some systems too support page size granularity
• Accesses may or may not have to cross QPI link
Socket 0 Memory Socket 1 Memory

DDR3 DDR3
DDR3 DDR3
DDR3 DDR3

Mem Control

System Memory Map

UMA lacks tuning for peak performance but in general


delivers good performance without any additional tuning effort
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 52
NUMA Mode
• Non-Uniform Memory Access (NUMA)
• Addresses not interleaved across memory nodes by cache line.
• Each CPU has direct access to contiguous block of memory.

Socket 0 Memory Socket 1 Memory

DDR3 DDR3
DDR3 DDR3
DDR3 DDR3

Mem Control

System Memory Map

Combined with thread affinity (“pinning”) enables potential for peak


performance but can degrade performance in case not taken care of
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 53
Uncore Architecture: Sandy Bridge
Significant Bandwidth Increases over Prior Generation

Gen3 x16 QPI 8


Gen3 x16 IIO QPI QPI 8
Gen3 x8

PCIe BW: C C Socket to Socke


~300% BW: ~250%
C C
Cache BW
20MB
automatically C Cache C
scales with Cache BW:
core frequency C C ~800%

On-Die Interconnect DDR3 BW: ~200%


BW: ~900% MC
DDR3

DDR3

DDR3

DDR3

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 55
SNB: Scalable Ring On-die Interconnect
• Ring-based interconnect between Cores, Graphics, Last
Level Cache (LLC) and System Agent domain
• Composed of 4 rings
– 32 Byte Data ring, Request ring,
Acknowledge ring and Snoop ring DMI PCI Express*

– Fully pipelined at core frequency System


bandwidth, latency scale with cores
IMC
Display
Agent

• Access on ring always picks the shortest


Core LLC
path – minimize latency
• Distributed arbitration, sophisticated ring Core LLC
protocol to handle coherency, ordering, and
core interface
Core LLC
• Scalable to servers with large number of
processors Core LLC

Graphics

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 56
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 58
Intel® Turbo Boost Improvements

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 59
Dynamic Adaption in Sandy Bridge

Power After idle periods, the


system accumulates
C0 “energy budget” and can
(Turbo) accommodate high
“Next Gen power/performance for a
Turbo few seconds
Boost”
In Steady State conditions
the power stabilizes on
TDP

“TDP” Use
accumulated
energy budget
to enhance user
Sleep or experience
Low power
Time
Buildup thermal budget
during idle periods
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 60
Simultaneous Multi-Threading (SMT)
“Intel Hyper-Threading – HT”
• Run 2 threads at the very same time per core w/o SMT SMT
• Available on Nehalem (and successors) as well as
Intel® ATOM Architecture
• Take advantage of 4-wide execution engine
– Keep it fed with multiple threads
– Hide latency of a single thread

Time (proc. cycles)


• Most power efficient performance feature
– Very low die area cost
– Can provide significant performance benefit
depending on application
– Much more efficient than adding an entire
core
• Nehalem advantages
– Larger caches Note: Each box
– Massive memory BW represents a
processor
execution unit

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 61
SMT Performance Chart NHM

40%
Performance Gain SMT enabled vs disabled
34%
35%
29%
30%
25%
20%
16%
15% 13%
10%
10% 7%
5%
0%
Floating Point 3dsMax* Integer Cinebench* 10POV-Ray* 3.7 3DMark*
beta 25 Vantage* CPU
Intel® Core™ i7
Floating Point is based on SPECfp_rate_base2006* estimate
Integer is based on SPECint_rate_base2006* estimate

SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation.
For more information on SPEC benchmarks, see: http://www.spec.org

Source: Intel. Configuration: pre-production Intel® Core™ i7 processor with 3 channel DDR3 memory. Performance tests and ratings are measured
Software & Services Group, Developer Products Division
using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information
to Copyright
evaluate the©performance
2013, Intelof systems or components
Corporation. they are considering purchasing. For more information on performance tests and on the
All rights reserved.
performance of Intel products, visit http://www.intel.com/performance/
*Other brands and names are the property of their respective owners. 63
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 64
Memory Bandwidth and Performance
Sample Estimations
Platform Memory GFLOPs (DP) FLOPS per DP Data
Bandwidth per core Move to get Peak
NHM: 8.00GB/core 12 12.0
32GB/Socket (3ch x 1333 x 4 x 3GHz
4 cores 8bytes)/4
WSM: 5.33GB/core 9.6 14.4
32GB/Socket (3ch x 1333 x 4 x 2.4GHz
6 cores 8bytes)/6
SNB: 6.40GB/core 9.6 12.0
51GB/Socket (4ch x 1600 x 4 x 2.4GHz
8 cores, SSE 8bytes)/8
SNB: 6.40GB/core 19.2 24.0
51GB/Socket (4ch x 1600 x 8 x 2.4GHz
8 cores, AVX 8bytes)/8
Itanium 2 5.40 GB/core 6.4 9.5
“Montecito” (0.677Ghz x 4 x 1.6 Ghz
Dual core 16bytes)/2

Tuning for memory bandwidth remains key challenge !


Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 65
References
• Intel Software Development and Optimization
manual
• Session from Intel Developer Forum on processor
architecture – www.intel.com/idf
• Michael E. Thomadakis, Texas University, “The
Architecture of the Nehalem Processor …”
• Agner, “The microarchitecture of Intel, AMD and
VIA CPUs …”
• Wikipedia
– x86
– x86 assembly language

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 66
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS
INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the
approximate performance of Intel products as measured by those tests. Any difference in system hardware or
software design or configuration may affect actual performance. Buyers should consult other sources of information
to evaluate the performance of systems or components they are considering purchasing. For more information on
performance tests and on the performance of Intel products, reference www.intel.com/software/products.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2012. Intel Corporation.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are
not unique to Intel microprocessors. These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.

Notice revision #20110804

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® ATOM Processor: Block Diagram
Front - End Cluster
Branch
MS Prediction Unit

Prefetch Buffers
Instruction

Per Thread
XLAT /
Per - thread FL Cache
Instruction 2 - wide ILD
Queues
XLAT /
FL Inst .
TLB

Per thread Per thread


FP Integer
Register File Register File Memory Execution
Cluster

AGU AGU DL 1
prefetcher

ALU ALU Data


Data PMH
Shuffle FP adder TLBs
Cache L2
Cache
SIMD Fill +
multiplier Write combining
buffers
FSB
FP BIU
multiplier
ALU ALU
Fault /
FP move JEU Retire
Shifter APIC
FP ROM
Integer Execution Cluster
FP divider
Bus Cluster
FP store

FP / SIMD execution cluster

Software & Services Group, Developer Products Division


Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
69

Intel® Xeon Phi™ Overview

GDDR GDDR
... Memory
GDDR

Shared Memory Controllers


PCIe x16
Interface

Coherent L2-Cache Coherent L2-Cache

Processor
...
Multi-Threaded Multi-Threaded
Wide SIMD Wide SIMD
Core Core
I$ D I$ D
$ $

Standard IA Shared Memory Programming

Future options subject to change without notice.


Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
70

Intel® Xeon Phi™ Microarchitecture Overview

Core Core Core Core

PCIe
Client L2 L2 L2 L2
Logic

GDDR MC TD TD TD TD GDDR MC

GDDR MC TD TD TD TD GDDR MC

L2 L2 L2 L2

TD: Tag Directory


L2: L2-Cache
MC: Memory Controller Core Core Core Core

Copyright © 2013 Intel Corporation. All rights reserved. Forare


*Other brands and names illustration only.of their respective owners
the property
71

Interleaved Memory Access


Core Core

GDDR MC

GDDR MC
L2 L2

TD TD

Core
GDDR MC

TD

L2
GDDR MC

Core
TD

L2
Core

TD
L2

GDDR MC
Core

TD
L2

GDDR MC
TD TD

L2 L2
GDDR MC

GDDR MC

Core Core

Copyright © 2013 Intel Corporation. All rights reserved. Forare


*Other brands and names illustration only.of their respective owners
the property
72

Intel® Xeon Phi™ Core

Intel® Xeon Phi™ co-processor core:


Instruction Decode • Scalar pipeline derived from the dual-issue Pentium processor
• Short execution pipeline
• Fully coherent cache structure
Scalar Vector • Significant modern enhancements
Unit Unit
- such as multi-threading, 64-bit extensions, and sophisticated
pre-fetching.
• 4 execution threads per core
Scalar Vector • Separate register sets per thread
Register Register
• 32KB instruction cache and 32KB data cache for each core.
Enhanced instructions set with:
• Over 100 new instructions
L1 I-Cache & D-Cache
• Wide vector processing operations , incl. gather/scatter and
masking
• Some specialized scalar instructions
512K L2 Cache • 3-operand, 16-wide vector processing unit (VPU)
Local Subset
• VPU executes integer, SP-float, and DP-float instructions
• Supports IEEE 754 2008 for floating point arithmetic
Interprocessor Network
Interprocessor
Network 1024 bits wide, bi-directional (512 bits in each direction)

Future options subject to change without notice.


Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
73

Intel® Xeon Phi™


T0 IP Processor Core
Code Cache Miss
T1 IP L1 TLB
and 32KB
T2 IP Code Cache
TLB Miss

T3 IP
16B/Cycle (2 IPC)

4 Threads
In-Order
Decode uCode
512KB
TLB Miss
HWP L2 Cache
Handler
Pipe 0 Pipe 1 L2 Control
L2 TLB

VPU RF X87 RF Scalar RF

To On-Die Interconnect
X87 ALU 0 ALU 1
VPU
512b SIMD
TLB Miss

L1 TLB and 32KB Data Cache


DCache Miss

Copyright © 2013 Intel Corporation. All rights reserved. Forare


*Other brands and names illustration only.of their respective owners
the property
74

Vector/SIMD High Computational Density

Instruction Decode Mask Registers

Scalar Vector 16-wide Vector ALU


Unit Unit

Scalar Vector Replicate Reorder


Register Register

L1 I-Cache & D-Cache Vector Registers

Numeric Numeric
512K L2 Cache
Convert Convert
Local Subset

Interprocessor L1 Data Cache


Network

Core Vector/SIMD Unit

Future options subject to change without notice.


Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
75

VPU Block Diagram


Vector/SIMD Part (VPU)

8x 16b Vmask

512b
/ 4 cycles

T3

T1
T2
*
Data Convert /Broadcast
512b
T0
/
M 512b 512b 32x
E / / 512b
+
Vreg
M
L2 L1 Data Swizzle

O
R 512b
/
Y

Scalar Scalar
Register Units

Scalar Part

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Intel® Xeon Phi™ Coprocessor: Prorgaming Model:

Restricted It’s a Supercomputer


Architectures on a chip

Operate as a
compute node

Run a full OS

Run MPI

Run OpenMP*

GPU
Run x86 code
ASIC
FPGA
Run restricted code Run offloaded code

Custom HW Acceleration Intel® Xeon Phi™ Coprocessor

Restrictive architectures limit the ability for applications to use arbitrary nested parallelism, functions calls and threading models

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Well, it is an SMP-on-a-chip running Linux*

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Intel® Xeon Phi™ Environment

NETWORK
NATIVE
Linux
MEM MEM IP
SSH
FTP
Physical View NFS
...
Autonomous
MEM MEM

Logical Views
Xeon MIC

NETWORK

MEM

OFFLOAD
NETWORK

Heterogeneous

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
SINGLE
Flexible Execution Models
SOURCE Optimized Performance for different Workloads
CODE

SERIAL AND MODERATELLY


HIGHLY PARALLEL CODE
PARALLEL CODE
Compilers, Libraries,
Runtime Systems

MAIN() MAIN() MAIN() MAIN() MAIN()

XEON XEON XEON


XEON® XEON® XEON® XEON®
PHI™ PHI™ PHI™

RESULTS RESULTS RESULTS RESULTS RESULTS

Multicore Only Multicore Hosted with Symmetric Many-Core Only


Many-Core Offload

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Flexible Execution Models
Optimized Performance for different Usage Models

MPI
XEON MPI
PHI™ XEON
XEON®
PHI™
DIRECTIVES
XEON® XEON®
XEON® XEON®
PHI

XEON XEON
XEON XEON®
PHI™ PHI™
PHI™

NATIVE ONLY OFFLOAD CO-WORKER SYMMETRIC

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

You might also like