0% found this document useful (0 votes)

99 views

Intel® Processor Architecture: January 2013

Uploaded by

Aadi Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views

Intel® Processor Architecture: January 2013

Uploaded by

Aadi Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Intel® Processor

Architecture

January 2013

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 2
Intel® Processor Segments Today
Architecture Target ISA Specific
Platforms Features
Intel® phone, tablet, x86 up to optimized for
ATOM™ netbook, low- SSSE-3, 32 low-power, in-
Architecture power server and 64 bit order

Intel® mainstream x86 up to flexible

Core™ notebook, Intel® AVX, feature set
Architecture desktop, server 32 and 64bit covering all
needs
Intel® high end server IA64, x86 by RAS, large
Itanium® emulation address space
Architecture

Intel® MIC accelerator for x86 and +60 cores,

Architecture HPC Intel® MIC optimized for
Instruction Floating-Point
Set performance

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3
Itanium® 9500 (Poulson)
New Itanium Processor
Poulson Processor • Compatible with Itanium® 9300
processors (Tukwila)
Core Core

Core
32MB
Shared Core • New micro-architecture with 8
Core
Last
Level Core Cores
Cache
Core Core • 54 MB on-die cache
Memory Link • Improved RAS and power
Controllers Controllers
management capabilities
• Doubles execution width from 6
to 12 instructions/cycle
4 Intel
Scalable
4 Full + 2 Half
Width Intel
• 32nm process technology
Memory QuickPath
Interface Interconnect • Launched in November 2012
(SMI)

Compatibility provides protection for today’s Itanium®

investment
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5
Intel® XEON™ Phi
Former Code Name “Knights Corner”

Intel® XEON™ Phi - The first product implementation of the

Intel® Many Integrated Core Architecture (Intel® MIC)

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 7
X86: From Smartphones to …
Motorola RAZR* i
• Launched September 2012
• RAZR i is the first smartphone that can achieve
speeds of 2.0 GHz

Processor:

Intel® ATOM™
Z2460

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 8
X86: … to Supercomputers
LRZ SuperMUC System
• Installed summer 2012
– Most powerful x86-architecture based computer
– #6 on Top500 list
– More than 150000 cores
• Processor:
- Intel® Xeon®
E5-2680
(“Sandy
Bridge”)
- Intel® Xeon®
E7-4870
(“Westmere”)

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 9
Intel Tick-Tock Roadmap
for Mainstream x86 Architecture since 2006

2nd 3nd
Intel® Core™ Micro Architecture Generation
Intel® Core™
Generation
Intel® Core™
MicroArchitecture Codename “Nehalem” Micro Micro
Architecture Architecture

Merom Penryn Nehalem Westmere Sandy Bridge Ivy Bridge

NEW NEW NEW NEW NEW NEW

Micro architecture Process Technology Micro architecture Process Technology Micro architecture Process Technology

65nm 45nm 45nm 32nm 32nm 22nm

TOCK TICK TOCK TICK TOCK TICK

2006 2007 2008 2009 2011 2012

SSSE-3 SSE4.1 SSE4.2 AES AVX 7 new
instructions
TICK + TOCK = SHRINK + INNOVATE
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 12
To be continued ...

4nth
Generation
Intel® Core™ TBD TBD TBD TBD TBD
Micro
Architecture

Haswell Broadwell TBD TBD TBD TBD

NEW NEW NEW NEW NEW NEW

Micro architecture Process Technology Micro architecture Process Technology Micro architecture Process Technology

22nm 14nm 14nm 10nm 10nm 7nm

TICK TOCK TICK TOCK TICK TOCK

2013 >= 2014 ??? ??? ??? ???

AVX-2
TICK + TOCK = SHRINK + INNOVATE
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 13
Registers State for Intel® Pentium® 3
Processor (1998)
IA32-INT MMX Technology / SSE Registers
Registers IA-FP Registers
80
32 64 128

eax st0 mm0 xmm0

edi st7 mm7 xmm7

Fourteen 32-bit registers Eight 80/64-bit registers Eight 128-bit registers

Scalar data & addresses Hold data only Hold data only:
Direct access to regs Direct access to MM0..MM7 4 x single FP numbers
No MMX™ Technology / FP 2 x double FP numbers
interoperability 128-bit packed integers

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 15
SSE Vector Types

4x single precision
Intel® SSE FP

2x double precision
FP

16x 8 bit integer

8x 16 bit integer
Intel® SSE2
4x 32 bit integer

2x 64 bit integer

plain 128 bit

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 16
AVX Vector Types

8x single precision
FP
Intel® AVX
4x double precision
FP

32x 8 bit integer

16x 16 bit integer

Intel® AVX2 8x 32 bit integer

(Future)
4x 64 bit integer

plain 256 bit

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 19
X86 ISA: The Instruction Set

• The instruction set for the x86 architecture has

been extended numerous times since the set
supported by the 8086 processor
– See http://en.wikipedia.org/wiki/X86_instruction_listings
for an excellent overview
• Today, the “base” instructions set (“IA32 ISA”) is
the one supported by the first 32bit processor -
80386
• Multiple, “smaller” extensions added then before
SSE (1998 / Intel® Pentium® 3) like
– MMX( 64 bit SIMD using the x87 FP registers)
– Conditional move
– Atomic exchange

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 20
New Instructions in Haswell (2013)
Group Description Count *

SIMD Integer Adding vector integer operations to 256-bit

Instructions
promoted to 256
AVX2

bits
Gather Load elements from vector of indices 170 /
vectorization enabler 124

Shuffling / Data Blend, element shift and permute instructions

Rearrangement

FMA Fused Multiply-Add operation forms ( FMA-3) 96 / 60

Bit Manipulation and Improving performance of bit stream manipulation and 15 / 15

Cryptography decode, large integer arithmetic and hashes

TSX=RTM+HLE Transactional Memory 4/4

Others MOVBE: Load and Store of Big Endian forms 2/2

INVPCID: Invalidate processor context ID
* Total instructions / different mnemonics

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 23
HSW Improvements for Threading
Sample Code Computing PI by Windows Threads
#include <windows.h>
void main ()
#define NUM_THREADS 2
{
HANDLE thread_handles[NUM_THREADS];
double pi; int i;
CRITICAL_SECTION hUpdateMutex;
DWORD threadID;
static long num_steps = 100000;
int threadArg[NUM_THREADS];
double step;
double global_sum = 0.0;
for(i=0; i<NUM_THREADS; i++)
threadArg[i] = i+1;
void Pi (void *arg)
{
InitializeCriticalSection(&hUpdateMutex);
int i, start;
double x, sum = 0.0;
for (i=0; i<NUM_THREADS; i++){
thread_handles[i] = CreateThread(0, 0,
(LPTHREAD_START_ROUTINE) Pi,
start = *(int *) arg;
&threadArg[i], 0, &threadID);
step = 1.0/(double) num_steps;
}
for (i=start;i<= num_steps;
WaitForMultipleObjects(NUM_THREADS,
i=i+NUM_THREADS){
thread_handles, TRUE,INFINITE);
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
pi = global_sum * step;
}
EnterCriticalSection(&hUpdateMutex);
printf(" pi is %f \n",pi);
global_sum += sum;
}
LeaveCriticalSection(&hUpdateMutex);
}

Locks can be key bottleneck – even in case there is no conflict

Copyright© 2012, Intel Corporation. All rights reserved. Partially Intel Confidential Information.
*Other brands and names are the property of their respective owners.
Intel® Transactional Synchronization
Extensions (Intel® TSX)
Intel® TSX = HLE + RTM
HLE (Hardware Lock Elision) is a hint inserted in front of a LOCK
operation to indicate a region is a candidate for lock elision
• XACQUIRE (0xF2) and XRELEASE (0xF3) prefixes
• Don’t actually acquire lock, but execute region speculatively
• Hardware buffers loads and stores, checkpoints registers
• Hardware attempts to commit atomically without locks
• If cannot do without locks, restart, execute non-speculatively

RTM (Restricted Transactional Memory) is three new

instructions (XBEGIN, XEND, XABORT)
• Similar operation as HLE (except no locks, new ISA)
• If cannot commit atomically, go to handler indicated by XBEGIN
• Provides software additional capabilities over HLE

Copyright© 2012, Intel Corporation. All rights reserved. Partially Intel Confidential Information.
25 *Other brands and names are the property of their respective owners.
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 27
Core™ 2 Architecture (Merom)

Instruction Fetch and 32kB Front End

ITLB
Pre Decode Instruction Cache

Instruction Queue

Memory
Decode

4
Rename/Allocate
2/4/6 MB
Front-
Side
Retirement Unit 4
2nd Level Cache Bus
(ReOrder Buffer)

Reservation Station
6
Execution Units Out-Of-Order
DTLB Execution
32kB Engine
Data Cache

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 29
NHM/SNB: Enhanced Processor Core

Instruction Fetch and 32kB Front End

ITLB
Pre Decode Instruction Cache

Instruction Queue
Execution
Decode Engine
4 2nd Level TLB
256kB
Rename/Allocate L3 and
2nd Level Cache beyond
MLC -
Retirement Unit 4 Mid Level Cache Uncore
(ReOrder Buffer)

Reservation Station
6
Execution Units
Memory
DTLB

32kB
Data Cache

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 30
Peak FP Performance per Core & Cycle

Single Double Comment

Precision Precision
Nehalem 8 4 By SSE; MULT and ADD can start
each cycle:
Sandy Bridge 16 8 AVX doubles all due to twice the
vector length
Haswell 32 16 2 FMA instructions can start
each cycle – doubling
performance compared to SNB

For a 2-socket, 16-core Haswell server system running at 3

GHz, this will sum up to 1.5 terra flops SP FP peak
performance (0.77 for DP)

Software & Services Group,

Potential Developer
future options Products
and features subject Division
to change without notice.

Copyright © 2013, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners. 40
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 45
Common Core, Modular Uncore
• Common “core” C C C C
– Same core for server, desktop, O O O O
Core
mobile R R R R
E E E E
– Incremental improvements to u- 1 2 3 4
arch of current Core architecture DRAM
Last Level Cache Pwr
– Common target for SW &
optimization IMC QPI QPI Clk Uncore
– Common feature set
• Segment differentiation in the QPI
# of L3$ Memory QPI Graphic
“Uncore” Cores Size Controller Links

– # of cores Desktop i5 2 4MB 2xDDR3 N/A Yes

Desktop i3
– # of QPI links Desktop i7 4 8MB 3xDDR3 1 x 4.8 Yes
– Size of L3 cache NHM

– # IMC channels Desktop i7

SNB
6 8MB 3xDDR3 1 x 6.4 Yes

– Frequency DDR3 XEON E5- 2x8 20MB 4xDDR3 2 x 8.0 No

2600
– Integrated graphics (GT) XEON E7- 4x10 30MB 3xDDR3 4 x 6.4 No
–… 8870

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 46
Level 3 Cache
• New 3rd level cache
– Also called LLC – Last Level Cache
• Shared across all cores of
processor (socket) Core Core Core
L1 Caches L1 Caches L1 Caches
• Size
– NHM: 2MB/core ( EX up to 3.0)
– SNB: 2.5 MB/core ( today ) …

• Latency: L2 Cache L2 Cache L2 Cache

– NHM: >=35 L3 Cache

– SNB: 25-31
• Inclusive property
– Cache line residing in L1/L2 must
be present too in 3rd level cache

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 47
QuickPath Interconnect

• Nehalem introduces new

Nehalem Nehalem
QuickPath Interconnect (QPI) EP EP

• High bandwidth, low

latency point to point
interconnect
• 4.8/6.4/8.0 GT/sec
– E.g. 6.4 GT/sec -> 12.8
GB/sec each direction
• Highly scalable for systems CPU
IOH
CPU
with varying # of sockets
memory memory

CPU CPU
memory memory
IOH

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 50
Remote Memory Access
• CPU0 requests cache line X, not present in any CPU0 cache
– CPU0 requests data from CPU1; request sent over QPI to CPU1
– CPU1’s IMC makes request to its DRAM
– CPU1 snoops internal caches
– Data returned to CPU0 over QPI
• Remote memory latency a function of having a low latency
interconnect
– Typical numbers: Local access 60ns, remote access 90ns

QPI
DRAM CPU0 CPU1 DRAM

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 51
Non-NUMA (UMA) Mode
• Addresses interleaved across memory nodes by
cache line
– Some systems too support page size granularity
• Accesses may or may not have to cross QPI link
Socket 0 Memory Socket 1 Memory

DDR3 DDR3
DDR3 DDR3
DDR3 DDR3

Mem Control

System Memory Map

UMA lacks tuning for peak performance but in general

delivers good performance without any additional tuning effort
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 52
NUMA Mode
• Non-Uniform Memory Access (NUMA)
• Addresses not interleaved across memory nodes by cache line.
• Each CPU has direct access to contiguous block of memory.

Socket 0 Memory Socket 1 Memory

DDR3 DDR3
DDR3 DDR3
DDR3 DDR3

Mem Control

System Memory Map

Combined with thread affinity (“pinning”) enables potential for peak

performance but can degrade performance in case not taken care of
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 53
Uncore Architecture: Sandy Bridge
Significant Bandwidth Increases over Prior Generation

Gen3 x16 QPI 8

Gen3 x16 IIO QPI QPI 8
Gen3 x8

PCIe BW: C C Socket to Socke

~300% BW: ~250%
C C
Cache BW
20MB
automatically C Cache C
scales with Cache BW:
core frequency C C ~800%

On-Die Interconnect DDR3 BW: ~200%

BW: ~900% MC
DDR3

DDR3

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 55
SNB: Scalable Ring On-die Interconnect
• Ring-based interconnect between Cores, Graphics, Last
Level Cache (LLC) and System Agent domain
• Composed of 4 rings
– 32 Byte Data ring, Request ring,
Acknowledge ring and Snoop ring DMI PCI Express*

– Fully pipelined at core frequency System

bandwidth, latency scale with cores
IMC
Display
Agent

• Access on ring always picks the shortest

Core LLC
path – minimize latency
• Distributed arbitration, sophisticated ring Core LLC
protocol to handle coherency, ordering, and
core interface
Core LLC
• Scalable to servers with large number of
processors Core LLC

Graphics

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 56
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 58
Intel® Turbo Boost Improvements

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 59
Dynamic Adaption in Sandy Bridge

Power After idle periods, the

system accumulates
C0 “energy budget” and can
(Turbo) accommodate high
“Next Gen power/performance for a
Turbo few seconds
Boost”
In Steady State conditions
the power stabilizes on
TDP

“TDP” Use
accumulated
energy budget
to enhance user
Sleep or experience
Low power
Time
Buildup thermal budget
during idle periods
Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 60
Simultaneous Multi-Threading (SMT)
“Intel Hyper-Threading – HT”
• Run 2 threads at the very same time per core w/o SMT SMT
• Available on Nehalem (and successors) as well as
Intel® ATOM Architecture
• Take advantage of 4-wide execution engine
– Keep it fed with multiple threads
– Hide latency of a single thread

Time (proc. cycles)

• Most power efficient performance feature
– Very low die area cost
– Can provide significant performance benefit
depending on application
– Much more efficient than adding an entire
core
• Nehalem advantages
– Larger caches Note: Each box
– Massive memory BW represents a
processor
execution unit

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 61
SMT Performance Chart NHM

40%
Performance Gain SMT enabled vs disabled
34%
35%
29%
30%
25%
20%
16%
15% 13%
10%
10% 7%
5%
0%
Floating Point 3dsMax* Integer Cinebench* 10POV-Ray* 3.7 3DMark*
beta 25 Vantage* CPU
Intel® Core™ i7
Floating Point is based on SPECfp_rate_base2006* estimate
Integer is based on SPECint_rate_base2006* estimate

SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation.
For more information on SPEC benchmarks, see: http://www.spec.org

Source: Intel. Configuration: pre-production Intel® Core™ i7 processor with 3 channel DDR3 memory. Performance tests and ratings are measured
Software & Services Group, Developer Products Division
using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information
to Copyright
evaluate the©performance
2013, Intelof systems or components
Corporation. they are considering purchasing. For more information on performance tests and on the
All rights reserved.
performance of Intel products, visit http://www.intel.com/performance/
*Other brands and names are the property of their respective owners. 63
Agenda
•Overview Intel® processor architecture
•Intel x86 ISA (instruction set architecture)
•Micro-architecture of processor core
•Uncore structure
•Additional processor features
– Hyper-threading
– Turbo mode
•Summary

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 64
Memory Bandwidth and Performance
Sample Estimations
Platform Memory GFLOPs (DP) FLOPS per DP Data
Bandwidth per core Move to get Peak
NHM: 8.00GB/core 12 12.0
32GB/Socket (3ch x 1333 x 4 x 3GHz
4 cores 8bytes)/4
WSM: 5.33GB/core 9.6 14.4
32GB/Socket (3ch x 1333 x 4 x 2.4GHz
6 cores 8bytes)/6
SNB: 6.40GB/core 9.6 12.0
51GB/Socket (4ch x 1600 x 4 x 2.4GHz
8 cores, SSE 8bytes)/8
SNB: 6.40GB/core 19.2 24.0
51GB/Socket (4ch x 1600 x 8 x 2.4GHz
8 cores, AVX 8bytes)/8
Itanium 2 5.40 GB/core 6.4 9.5
“Montecito” (0.677Ghz x 4 x 1.6 Ghz
Dual core 16bytes)/2

Tuning for memory bandwidth remains key challenge !

Software & Services Group, Developer Products Division
Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 65
References
• Intel Software Development and Optimization
manual
• Session from Intel Developer Forum on processor
architecture – www.intel.com/idf
• Michael E. Thomadakis, Texas University, “The
Architecture of the Nehalem Processor …”
• Agner, “The microarchitecture of Intel, AMD and
VIA CPUs …”
• Wikipedia
– x86
– x86 assembly language

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 66
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS
INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the
approximate performance of Intel products as measured by those tests. Any difference in system hardware or
software design or configuration may affect actual performance. Buyers should consult other sources of information
to evaluate the performance of systems or components they are considering purchasing. For more information on
performance tests and on the performance of Intel products, reference www.intel.com/software/products.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2012. Intel Corporation.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are
not unique to Intel microprocessors. These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.

Notice revision #20110804

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® ATOM Processor: Block Diagram
Front - End Cluster
Branch
MS Prediction Unit

Prefetch Buffers
Instruction

Per Thread
XLAT /
Per - thread FL Cache
Instruction 2 - wide ILD
Queues
XLAT /
FL Inst .
TLB

Per thread Per thread

FP Integer
Register File Register File Memory Execution
Cluster

AGU AGU DL 1
prefetcher

ALU ALU Data

Data PMH
Shuffle FP adder TLBs
Cache L2
Cache
SIMD Fill +
multiplier Write combining
buffers
FSB
FP BIU
multiplier
ALU ALU
Fault /
FP move JEU Retire
Shifter APIC
FP ROM
Integer Execution Cluster
FP divider
Bus Cluster
FP store

FP / SIMD execution cluster

Software & Services Group, Developer Products Division

Copyright © 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
69

Intel® Xeon Phi™ Overview

GDDR GDDR
... Memory
GDDR

Shared Memory Controllers

PCIe x16
Interface

Coherent L2-Cache Coherent L2-Cache

Processor
...
Multi-Threaded Multi-Threaded
Wide SIMD Wide SIMD
Core Core
I$ D I$ D
$ $

Standard IA Shared Memory Programming

Future options subject to change without notice.

Intel® Xeon Phi™ Microarchitecture Overview

Core Core Core Core

PCIe
Client L2 L2 L2 L2
Logic

GDDR MC TD TD TD TD GDDR MC

L2 L2 L2 L2

TD: Tag Directory

L2: L2-Cache
MC: Memory Controller Core Core Core Core

*Other brands and names illustration only.of their respective owners
the property
71

Interleaved Memory Access

Core Core

GDDR MC

GDDR MC
L2 L2

TD TD

Core
GDDR MC

L2
GDDR MC

Core
TD

L2
Core

TD
L2

GDDR MC
Core

TD
L2

GDDR MC
TD TD

L2 L2
GDDR MC

GDDR MC

Core Core

*Other brands and names illustration only.of their respective owners
the property
72

Intel® Xeon Phi™ Core

Intel® Xeon Phi™ co-processor core:

Instruction Decode • Scalar pipeline derived from the dual-issue Pentium processor
• Short execution pipeline
• Fully coherent cache structure
Scalar Vector • Significant modern enhancements
Unit Unit
- such as multi-threading, 64-bit extensions, and sophisticated
pre-fetching.
• 4 execution threads per core
Scalar Vector • Separate register sets per thread
Register Register
• 32KB instruction cache and 32KB data cache for each core.
Enhanced instructions set with:
• Over 100 new instructions
L1 I-Cache & D-Cache
• Wide vector processing operations , incl. gather/scatter and
masking
• Some specialized scalar instructions
512K L2 Cache • 3-operand, 16-wide vector processing unit (VPU)
Local Subset
• VPU executes integer, SP-float, and DP-float instructions
• Supports IEEE 754 2008 for floating point arithmetic
Interprocessor Network
Interprocessor
Network 1024 bits wide, bi-directional (512 bits in each direction)

Future options subject to change without notice.

Intel® Xeon Phi™

T0 IP Processor Core
Code Cache Miss
T1 IP L1 TLB
and 32KB
T2 IP Code Cache
TLB Miss

T3 IP
16B/Cycle (2 IPC)

4 Threads
In-Order
Decode uCode
512KB
TLB Miss
HWP L2 Cache
Handler
Pipe 0 Pipe 1 L2 Control
L2 TLB

VPU RF X87 RF Scalar RF

To On-Die Interconnect
X87 ALU 0 ALU 1
VPU
512b SIMD
TLB Miss

L1 TLB and 32KB Data Cache

DCache Miss

*Other brands and names illustration only.of their respective owners
the property
74

Vector/SIMD High Computational Density

Instruction Decode Mask Registers

Scalar Vector 16-wide Vector ALU

Unit Unit

Scalar Vector Replicate Reorder

L1 I-Cache & D-Cache Vector Registers

Numeric Numeric
512K L2 Cache
Convert Convert
Local Subset

Interprocessor L1 Data Cache

Network

Core Vector/SIMD Unit

Future options subject to change without notice.

VPU Block Diagram

Vector/SIMD Part (VPU)

8x 16b Vmask

512b
/ 4 cycles

T1
T2
*
Data Convert /Broadcast
512b
T0
/
M 512b 512b 32x
E / / 512b
+
Vreg
M
L2 L1 Data Swizzle

O
R 512b
/
Y

Scalar Scalar
Register Units

Scalar Part

Restricted It’s a Supercomputer

Architectures on a chip

Operate as a
compute node

Run a full OS

Run MPI

Run OpenMP*

GPU
Run x86 code
ASIC
FPGA
Run restricted code Run offloaded code

Custom HW Acceleration Intel® Xeon Phi™ Coprocessor

Restrictive architectures limit the ability for applications to use arbitrary nested parallelism, functions calls and threading models

NETWORK
NATIVE
Linux
MEM MEM IP
SSH
FTP
Physical View NFS
...
Autonomous
MEM MEM

Logical Views
Xeon MIC

NETWORK

MEM

OFFLOAD
NETWORK

Heterogeneous

Copyright © 2013 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
SINGLE
Flexible Execution Models
SOURCE Optimized Performance for different Workloads
CODE

SERIAL AND MODERATELLY

HIGHLY PARALLEL CODE
PARALLEL CODE
Compilers, Libraries,
Runtime Systems

MAIN() MAIN() MAIN() MAIN() MAIN()

XEON XEON XEON

XEON® XEON® XEON® XEON®
PHI™ PHI™ PHI™

RESULTS RESULTS RESULTS RESULTS RESULTS

Multicore Only Multicore Hosted with Symmetric Many-Core Only

Many-Core Offload

MPI
XEON MPI
PHI™ XEON
XEON®
PHI™
DIRECTIVES
XEON® XEON®
XEON® XEON®
PHI

XEON XEON
XEON XEON®
PHI™ PHI™
PHI™

NATIVE ONLY OFFLOAD CO-WORKER SYMMETRIC

Xbox Architecture: Architecture of Consoles: A Practical Analysis, #13
From Everand
Xbox Architecture: Architecture of Consoles: A Practical Analysis, #13
Rodrigo Copetti
No ratings yet
Acid Base
No ratings yet
Acid Base
6 pages
Panasonic TB358K Installation Manual
100% (1)
Panasonic TB358K Installation Manual
2 pages
Architecture and Programming of x86 Processors: Microprocessor Techniques and Embedded Systems
No ratings yet
Architecture and Programming of x86 Processors: Microprocessor Techniques and Embedded Systems
24 pages
Microprocessor
No ratings yet
Microprocessor
32 pages
Whcamacho - It150-8 - Ass 2.1
No ratings yet
Whcamacho - It150-8 - Ass 2.1
2 pages
Microprocessor AARON
No ratings yet
Microprocessor AARON
5 pages
Reviewer in Archi
100% (1)
Reviewer in Archi
8 pages
Evolution of MP
No ratings yet
Evolution of MP
3 pages
Intel IA32 (80x86) Microprocessors: Dr. Doug L. Hoffman Computer Science 330 Spring 2002
No ratings yet
Intel IA32 (80x86) Microprocessors: Dr. Doug L. Hoffman Computer Science 330 Spring 2002
28 pages
07 Basicx86Architecture 1up
No ratings yet
07 Basicx86Architecture 1up
72 pages
Microprocessor Timeline INQ
100% (2)
Microprocessor Timeline INQ
1 page
X86 Architeture
No ratings yet
X86 Architeture
17 pages
Lecture 02 Intel & ADM processor
No ratings yet
Lecture 02 Intel & ADM processor
35 pages
Lab/Tutorial 3: Title Objective
No ratings yet
Lab/Tutorial 3: Title Objective
11 pages
Evolution of Processor
No ratings yet
Evolution of Processor
6 pages
MP 08
No ratings yet
MP 08
38 pages
CA-Al Week 03 Lecture
No ratings yet
CA-Al Week 03 Lecture
31 pages
Lyla B Das 0 & 1 PDF
100% (4)
Lyla B Das 0 & 1 PDF
53 pages
Cpu Ar Chitecture Chapter Four
No ratings yet
Cpu Ar Chitecture Chapter Four
17 pages
AMD Processor and Its Architecture
No ratings yet
AMD Processor and Its Architecture
6 pages
Computer Architecture and Organization: Intel 80386 Processor
No ratings yet
Computer Architecture and Organization: Intel 80386 Processor
15 pages
Cpu Ar Chitecture Chapter Four
No ratings yet
Cpu Ar Chitecture Chapter Four
36 pages
03 IA32Architecture
No ratings yet
03 IA32Architecture
51 pages
Evolution of Intel Microprocessor
100% (1)
Evolution of Intel Microprocessor
18 pages
Computer Maintainance Module
No ratings yet
Computer Maintainance Module
78 pages
Microprocessor
No ratings yet
Microprocessor
12 pages
Intel I3 Processor
100% (1)
Intel I3 Processor
8 pages
Assignment Number - 3 Soc
No ratings yet
Assignment Number - 3 Soc
6 pages
MI 1 2 Intel Architecture v3
No ratings yet
MI 1 2 Intel Architecture v3
32 pages
Chapter 3
No ratings yet
Chapter 3
16 pages
Computer Hardware-03
No ratings yet
Computer Hardware-03
65 pages
Introduction To Intel Architecture - The Basics
No ratings yet
Introduction To Intel Architecture - The Basics
25 pages
Timeline - Intel Processor
No ratings yet
Timeline - Intel Processor
18 pages
About 8086
No ratings yet
About 8086
13 pages
The X86 Microprocessor & Alp: Microprocessors and Microcontrollers
No ratings yet
The X86 Microprocessor & Alp: Microprocessors and Microcontrollers
56 pages
Brief History of The X86 Family:: Evolution From 8080/8085 To 8086
No ratings yet
Brief History of The X86 Family:: Evolution From 8080/8085 To 8086
15 pages
notes_co_unit4
No ratings yet
notes_co_unit4
12 pages
Unit I
No ratings yet
Unit I
10 pages
Difference Betweew Celeron and Pentium
No ratings yet
Difference Betweew Celeron and Pentium
3 pages
CPE 14 Reviewer Module 1 2
No ratings yet
CPE 14 Reviewer Module 1 2
6 pages
Microprocessor (Report)
No ratings yet
Microprocessor (Report)
4 pages
Processor Architecture
No ratings yet
Processor Architecture
52 pages
Arsitektur Mikroprosessor 32 Bit
No ratings yet
Arsitektur Mikroprosessor 32 Bit
40 pages
Introduction On Intel and AMD
No ratings yet
Introduction On Intel and AMD
21 pages
Intel Architecture: 2.1. Brief History of The Ia-32 Architecture
No ratings yet
Intel Architecture: 2.1. Brief History of The Ia-32 Architecture
19 pages
History of Microprocessors
No ratings yet
History of Microprocessors
54 pages
Chip Time Line
No ratings yet
Chip Time Line
5 pages
Lecture 1 - Architecture of Microprocessor
No ratings yet
Lecture 1 - Architecture of Microprocessor
20 pages
Chapt 02
No ratings yet
Chapt 02
71 pages
03 3 Machine Basics
No ratings yet
03 3 Machine Basics
53 pages
1 Chapter-1
No ratings yet
1 Chapter-1
45 pages
Microp
No ratings yet
Microp
5 pages
Advanced Micro Devices (AMD) : Gabay, Stephen Elly F. Coeprof 17
No ratings yet
Advanced Micro Devices (AMD) : Gabay, Stephen Elly F. Coeprof 17
5 pages
History of Microprocessor Generations
No ratings yet
History of Microprocessor Generations
15 pages
Unit-III Telugu Ack
No ratings yet
Unit-III Telugu Ack
46 pages
Mobile Pentium Processor Reference
No ratings yet
Mobile Pentium Processor Reference
18 pages
ECE391 - Ch1 - Basics of Computer Systems
No ratings yet
ECE391 - Ch1 - Basics of Computer Systems
21 pages
Cpu
No ratings yet
Cpu
10 pages
Unit 1 - Microprocessor and Microcontroller
No ratings yet
Unit 1 - Microprocessor and Microcontroller
18 pages
Arm vs x86
From Everand
Arm vs x86
Mei Gates
No ratings yet
PC Hardware Explained
From Everand
PC Hardware Explained
V. Subhash
No ratings yet
TLX RFD User Installation Instructions PDF
No ratings yet
TLX RFD User Installation Instructions PDF
4 pages
ScopeImage 9.0 User Manual
No ratings yet
ScopeImage 9.0 User Manual
40 pages
Mathcad - Prob - 09 - 02
No ratings yet
Mathcad - Prob - 09 - 02
3 pages
Team 7
No ratings yet
Team 7
10 pages
Data Eng With Python Internal
No ratings yet
Data Eng With Python Internal
2 pages
Interview Questions On Salesforce
No ratings yet
Interview Questions On Salesforce
2 pages
Kinghitter Post Driver Series 2 Rear Mount
No ratings yet
Kinghitter Post Driver Series 2 Rear Mount
2 pages
8 21 The Diagram Shows Four Cells.: (Turn Over
No ratings yet
8 21 The Diagram Shows Four Cells.: (Turn Over
44 pages
Problem Set 6
No ratings yet
Problem Set 6
6 pages
Chapter 5
100% (1)
Chapter 5
13 pages
HSAD Assignment
No ratings yet
HSAD Assignment
2 pages
Elementary Grammar Worksheets
100% (3)
Elementary Grammar Worksheets
37 pages
Chemistry MCQ
No ratings yet
Chemistry MCQ
3 pages
Successfully Tested Types of Banknote Handling Machine - Customer-Operated Machines
No ratings yet
Successfully Tested Types of Banknote Handling Machine - Customer-Operated Machines
35 pages
XS-4222 4100XPC Assembly Procedure
100% (4)
XS-4222 4100XPC Assembly Procedure
8 pages
Atomic Theory Science Presentation Colorful 3D Style - 20240609 - 160039 - 0000
No ratings yet
Atomic Theory Science Presentation Colorful 3D Style - 20240609 - 160039 - 0000
25 pages
Installation Information Series PAVC 33/38/65/100: Variable Displacement Piston Pumps
No ratings yet
Installation Information Series PAVC 33/38/65/100: Variable Displacement Piston Pumps
9 pages
Types of Ijarah
No ratings yet
Types of Ijarah
2 pages
CSC3200 Lecture 1
No ratings yet
CSC3200 Lecture 1
18 pages
Casting Workshop
No ratings yet
Casting Workshop
5 pages
FDDSFDSFDSGFFDGFDGDFGFFDG
No ratings yet
FDDSFDSFDSGFFDGFDGDFGFFDG
15 pages
LT2161
No ratings yet
LT2161
15 pages
Ecosystem
No ratings yet
Ecosystem
8 pages
Series de Fourier - Rajendra PDF
100% (1)
Series de Fourier - Rajendra PDF
131 pages
Myp Biology Egg Cell Lab
No ratings yet
Myp Biology Egg Cell Lab
3 pages
MECHANICSimnida (1)
No ratings yet
MECHANICSimnida (1)
31 pages
Download full Multiscale and Multiphysics Flow Simulations of Using the Boltzmann Equation Applications to Porous Media and MEMS Jun Li ebook all chapters
No ratings yet
Download full Multiscale and Multiphysics Flow Simulations of Using the Boltzmann Equation Applications to Porous Media and MEMS Jun Li ebook all chapters
38 pages
Interface Description RS DataExport US (En)
No ratings yet
Interface Description RS DataExport US (En)
24 pages