0% found this document useful (0 votes)

245 views52 pages

Unit 1 Modern Processors

This document provides an overview of modern microprocessor architecture. It discusses key concepts like pipelining, caching, and parallelism techniques. Pipelining breaks down instructions into stages to increase throughput. Caching stores recently used data and instructions closer to the CPU. Parallelism methods like superscalarity, SIMD, and multithreading aim to keep processor resources busy by executing multiple instructions concurrently.

Uploaded by

Sudha Palani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

245 views52 pages

Unit 1 Modern Processors

Uploaded by

Sudha Palani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 52

Unit 1

Modern processors
Objectives of Chapter 1

A high-level overview of the architecture of modern

cache-based microprocessors
Introduction of important concepts, which will be useful for
writing efficient code later
Discussion of inherent performance limitations
“Stored-program
computer”
The “stored-program computer” concept

Instructions (produced by a compiler) and data are stored in

memory
Instructions are read and executed by a control unit
An arithmetic/logic unit “does the work” which is coded in the
instructions
The speed of memory determines how fast instructions and
data can be fed to the control and arithmetic units—limitation
of performance
I/O facilities enable interaction with users
CPU

CPU (central processing unit)—is the “brain” of a computer.

CPU incorporates control and arithmetic units (and many other
components), together with appropriate interfaces to memory and
I/O.
CPU has a “clock”, which at each clock cycle synchronizes the logic
units within the CPU to process instructions.
Cache-based microprocessor
Important hardware
components

Arithmetic units for floating-point (FP) and integer (INT)

operations
Registers hold operands to be accessed by instructions
Load (LD) and store (ST) units handle instructions that
transfer data to and from registers
Instructions are sorted into several queues, waiting to be
executed (probably not in the order they were issued)
Caches hold data and instructions to be (re-)used
Pipelined functional units

Subdividing complex operations into simple components that can

be executed using different functional units, it is possible to
increase instruction throughput—the number of instructions
executed per clock cycle.
This is the most elementary example of instruction-level
parallelism
(ILP).
Optimally pipelined execution leads to a throughput of one
instruction per cycle per pipeline.
Pipelining

Pipelining in microprocessors follows the same principle of

assembly lines in manufacturing: Workers (functional units)
are highly skilled and specialized for a single task.
Each worker executes the same step, over and over again, on
different objects.
If it takes m different steps to finish the product, m products
are continuously worked on, in different stages of completion.
If all tasks take the same amount of time, and all workers are
continuously busy, eventually (after the initial m steps) one
product will be finished per time step.
A simple example of no pipelining

For simplicity, let us suppose every instruction has five stages, each
taking one cycle.
The following picture shows the situation of no pipelining:
The situation with instruction pipelining
More about pipelining

Complex operations such as loading and storing data or performing

floating-point arithmetic cannot be executed in a single cycle. The
“fetch–decode–execute” pipeline is thus applicable, in which each
stage can operate independently of the others.
These still complex tasks are usually broken down even further.
The benefit of elementary subtasks is the potential for a higher
clock rate as the functional units are kept simple.
Example of “vector product”

Arrays of floating-point values: AB C

f o r ( i = 0 ; i<N; i++)
A[i] = B[i] * C[i];

Suppose a floating-point multiplication is decomposed into five

subtasks. (A floating-point value is (sign) × mantissa × 2exponent.)
1 separation of mantissa and exponent on B [ i ] and C [ i ]
2 multiply mantissas of B [ i ] and C [ i ] (recall that a mantisa is
a binary fraction with non-zero leading bit)
3 add exponents of B [ i ] and C [ i ]
4 normalize result
5 insert sign
Depiction of a
pipeline
Simple mathematical model for pipelining

An m-stage pipeline has latency (or depth) of m cycles. The

wind-up and wind-down periods are both m − 1 cycles.
For a pipeline of depth m, executing N independent operations
takes N + m − 1 cycles. The speedup versus “no pipeling” is

Tseq N·m
= N+m−1
Tpipe

The throughput, average number of operations finished per cycle,

can be calculated as
N =
1
Tpipe 1+
m−1
N
Example of pipeline
throughput
Pipeline bubbles

Very complex calculations (like floating-point division or special

math functions) tend to have very long latencies, and are only
pipelined to a small level or not at all. In such cases, stalling the
instruction stream becomes inevitable, leading to so-called “pipeline
bubbles”.
Avoiding such complex functions, if possible, is a useful technique
for code optimization (to be discussed in Chapter 2).
Superscalarit
y
Goal: To produce more than one “result” per cycle.

Multiple instructions are fetched and decoded concurrently

Address and other integer calculations are performed in
multiple integer (add, mult, shift, mask) units
Multiple floating-point pipelines run in parallel
Caches are fast enough to sustain more than one load or store
operation per cycle

Superscalarity is a special form of parallel execution, and a variant

of ILP.
Out-of-order execution and compiler optimization must work
together to fully exploit superscalarity.
SIMD

The SIMD (single-instruction-multiple-data) concept became widely

known with the first vector supercomputers in 1970s.
Modern cache-based processors have instruction set extensions for
both integer and floating-point operations. They allow the
concurrent execution of arithmetic operations on “wide” registers,
each holding multiple numerical values.
Example of SIMD
Memory hierarchy

Data can be stored in a computer system in many different ways.

CPU has a set of registers, which can be accessed without delay.
In addition, there are several levels of cache, holding copies of
recently used data items.
Main memory of a computer is much slower (than the caches).
Depiction of memory
hierarchy
Cache

Caches are low-capacity, high-speed memories that are commonly

integrated on the CPU die.

L1 (level 1) data cache

L1 instruction cache
L2 and L3 unified
caches

The purpose of cache—reducing the impact of main memory’s

small bandwidth and high latency.
Cache hit and
miss

Whenever the CPU issues a read request (“load”) for transferring a

data item to a register, the L1 data cache is checked. If the wanted
data item is found in L1, this is called a cache hit, otherwise a cache
miss occurs.
In case of a cache miss in L1, data must be fetched from upper
cache levels or, in the worst case, from main memory.
Cache eviction

If a data item needs to be loaded into a cache where all cache

entries are occupied. One of the occupant entries has to be evicted
by a hardware-implemented algorithm (typically following the
least-recently used strategy) to give space.
Temporal locality
Curves of performance
gain
Cache
lines

The content of a cache is organized as cache lines. (A cache line

has space for multiple data items.) This is for reducing the latency
penalty for streaming—large amounts of data are loaded into the
CPU, modified, and written back without the potential of reuse “in
time”.
All data transfers between caches and main memory happen on the
cache line level.
If a code has good spatial locality, that is, the probability of
successive accesses to neighboring items is high, the latency
problem can be significantly reduced.
Cache mapping

If a line of data items from main memory, to be loaded into cache,

can be freely placed on any unoccupied cache line, it is called a
fully associative mapping.
Unfortunately, it is hard to build large, fast and fully associative
caches because of large bookkeeping overhead.
On the other end, a directly-mapped cache—a line of data items
can be placed only on a prescribed cache line—runs the risk of low
cache utilization.
Directly-mapped cache
m-way associative
The problem of “first cache miss”

Although exploiting spatial locality and cache lines can improve

cache efficiency, there is still the problem of latency on the first
miss.
Prefetch

Prefetching supplies the cache with data ahead of the actual

requirements from an application code.
Typically, a hardware pre-fetcher can detect regular access patterns
and try to read ahead the needed data.
To completely hide the cache miss latency, the memory subsystem
must be able to sustain a certain number of outstanding prefetch
operations.
How many outstanding prefetch operations needed?
Prefetch helps to overlap computation with data transfer
Multithreaded processors

All modern microprocessors are heavily pipelined. In case there are

frequent “pipeline bubbles” caused by, for example,

dependencies
memory latencies
insufficient loop length,
branch mispredictions

The consequence is that a large part of the execution resources is

idle (wasted resources).
Multithreading

Hyper threading or simultaneous multithreading (SMT) capabilities

are thus built into modern processors.

Multiple architectural states of a CPU core

An architectural state comprises all data, stauts and control
registers
However, resources such as arithmetic units, caches, queues,
memory interfaces are not duplicated

One CPU core “appears to be composed of several cores (also

called logical processors). Multiple instruction streams, or threads,
can be executed in parallel.
SMT

All threads share the same execution resources, so sometimes it is

possible to fill pipeline bubbles that arise due to installs in one
thread. SMT may enhance instruction throughput (instructions
executed per cycle).
Whether the concept of SMT pays off is code-dependent and
hardware-dependent!
Performance metrics

Theoretically, the components of a CPU core can operate at some

maximum speed called peak performance.
Whether this limit can be reached for a specific application code
depends on many factors (one of the key topics of Chapter 3).
Performance metrics:

The performance at which the floating-point units generate

results for multiply and add operations is measured as
floating-point operatins per second (Flops/sec).
The most important data paths are those to and from the
caches and main memory. The performance, called bandwidth,
of these paths is quantified in GBytes/sec.
Multicore processors

A higher clock frequency will allow a CPU to execute the

instructions faster. However, Increasing the clock frequency can
have a serious impact on the power dissipation.
On the other hand, reducing the clock frequency allows placing
more than one CPU core on the same CPU die (or more generally,
the same package), while keeping the same power envelope.
More about multicore CPU

The caches on the different levels can be private or shared. Sharing

a cache enables superfast communication between the cores. An
opposite effect of sharing can be cache bandwidth bottlenecks.
Example of a 4-core CPU

Intel Nehalem (actually a very old CPU)

Challenges with using multicore CPUs

In order to use all the resources belonging to the multiple

cores, parallel programming must be adopted. (This will be
one of the topics for later lectures.)
The memory bandwidth available per core can be a challenge.
So programming techniques for memory traffic reduction will
be even more important!
Vector processors

An very important processor architecture for HPC in the past

However, some of the concepts and techniques related to
vectorization are still used today
Basic ideas

Instructions operate on vector registers that can hold a large

number of arguments
The width of a vector register is called the vector length Lv
MULT and ADD pipelines are multitrack
One or several load, store or combined load/store pipes are
connected directly to main memory

The paradigm of SIMD

Block diagram of vector processor
SIMD for A = B + C
Target calculation

f o r (s=0; s<N, s++)

A[s] = B[s] + C[s] ;

A vectorization-capable compiler will automatically translate into

the following pseudocode:

f o r (s=0; s<N, s+=L)

{ i n t E = min(N-1,s+L-
1);
vload V1(0:L-1) = B(s:E);
vload V2(0:L-1) = C(s:E);
vadd V3(0:L-1) = V1(0:L-1)
+ V2(0:L-1);
vstore A(s:E) = V3(0:L-1);
}
Vectorization

Writing a program so that the compiler can generate effective

SIMD vector instruction is called vectorization.
Sometime this requires reformulation of code or inserting directives
to help the compiler identify SIMD parallelism.
If a code cannot be vectorized, it makes no sense to use a vector
computer!
A prerequisite for vectorization is true data independence across
iterations of a loop. (Forward references are allowed, but not
backward references.)
Branches in vectorized loops (example 1)

f o r ( i = 0 ; i<N; i++) {
i f ( y [ i ] < 0.)
x[ i] = s*y[i];
else
x[i] =
y[i]*y[i];
}
Mask registers

First, a vector of boolean values is generated by the logic pipeline.

Then both branches are executed for all loop indices. Finally the
boolean vector is used to choose the correct results.
Branches in vectorized loops (example 2)

f o r ( i = 0 ; i<N; i++) {
i f ( y [ i ] > 0.)
x[ i] = sqrt(y[i]);
}

There is only the i f branch (no else branch). Also, the sqrt
calculation is expensive. Execution for all loop indices can be a
huge waste of resource. What should be done in such a case?
The gather/scatter method

Cloud Computing Full Notes
100% (1)
Cloud Computing Full Notes
90 pages
Unit 5
No ratings yet
Unit 5
86 pages
L02 - IS - Security Models
100% (1)
L02 - IS - Security Models
17 pages
Pipelining and Vector Processing Chapter 9
100% (6)
Pipelining and Vector Processing Chapter 9
29 pages
Distributed & MultiProcessor
100% (1)
Distributed & MultiProcessor
3 pages
Hardware and Virtual Machines
No ratings yet
Hardware and Virtual Machines
30 pages
CS 8491 Computer Architecture
No ratings yet
CS 8491 Computer Architecture
103 pages
CN Unit-1 PPT
No ratings yet
CN Unit-1 PPT
232 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
Design of 3 Stage Pipelining Processor Using VHDL
No ratings yet
Design of 3 Stage Pipelining Processor Using VHDL
22 pages
PowerPoint Slides To Chapter 07
No ratings yet
PowerPoint Slides To Chapter 07
49 pages
UNIT V-Os Notes
No ratings yet
UNIT V-Os Notes
20 pages
COA - Module-5
No ratings yet
COA - Module-5
35 pages
ACA Notes UNIT-1
No ratings yet
ACA Notes UNIT-1
20 pages
Pipelining and Superscalar Techniques: CSE539: Advanced Computer Architecture
No ratings yet
Pipelining and Superscalar Techniques: CSE539: Advanced Computer Architecture
49 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
Pipeline Processing
No ratings yet
Pipeline Processing
28 pages
CS6801-Multi Core Architectures and Programming
No ratings yet
CS6801-Multi Core Architectures and Programming
9 pages
Tensor Processing Unit
100% (1)
Tensor Processing Unit
15 pages
Multivector&SIMD Computers Ch8
No ratings yet
Multivector&SIMD Computers Ch8
12 pages
Multiprocessor System Architecture
No ratings yet
Multiprocessor System Architecture
11 pages
Prof. Sudhir Bussa: Designing of 32 Bit RISC-V Microprocessor
No ratings yet
Prof. Sudhir Bussa: Designing of 32 Bit RISC-V Microprocessor
11 pages
Embedded System Kerala University Module 1 Notes
100% (1)
Embedded System Kerala University Module 1 Notes
13 pages
Image Filtering
0% (1)
Image Filtering
56 pages
Ppt2 ARM CortexM0
No ratings yet
Ppt2 ARM CortexM0
72 pages
Eiot Notes
No ratings yet
Eiot Notes
129 pages
IoT Module 4 Associated IoT Technologies
No ratings yet
IoT Module 4 Associated IoT Technologies
56 pages
Unit 6 Part1 Ilp
No ratings yet
Unit 6 Part1 Ilp
39 pages
COA Pipelining
No ratings yet
COA Pipelining
35 pages
OPERATING SYSTEM-unit1
No ratings yet
OPERATING SYSTEM-unit1
33 pages
Unit 6
No ratings yet
Unit 6
22 pages
CP4253 Map Unit Iii
No ratings yet
CP4253 Map Unit Iii
26 pages
EEC 214 Lecture 3
No ratings yet
EEC 214 Lecture 3
49 pages
Helping Slides Pipelining Hazards Solutions
No ratings yet
Helping Slides Pipelining Hazards Solutions
55 pages
Session - 25 Subroutine Call Return Mechanisms
No ratings yet
Session - 25 Subroutine Call Return Mechanisms
15 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
Multiprocessor Architecture System
100% (1)
Multiprocessor Architecture System
10 pages
Consistency and Replication: CS403/534 Distributed Systems Erkay Savas Sabanci University
No ratings yet
Consistency and Replication: CS403/534 Distributed Systems Erkay Savas Sabanci University
44 pages
Superscalar Vs Superpipeline Processor
No ratings yet
Superscalar Vs Superpipeline Processor
17 pages
Tve Icf7 q1 w4 Computer Architecturemachine Cycle
No ratings yet
Tve Icf7 q1 w4 Computer Architecturemachine Cycle
14 pages
HPC Unit 3
No ratings yet
HPC Unit 3
31 pages
Collision Free Scheduling
No ratings yet
Collision Free Scheduling
18 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
Micunit 1
No ratings yet
Micunit 1
12 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
50 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
Chapter-4: Microprocessor: 8086 and Modern Microprocessors
No ratings yet
Chapter-4: Microprocessor: 8086 and Modern Microprocessors
50 pages
Unit - I MP&MC
No ratings yet
Unit - I MP&MC
30 pages
OS Concepts Chapter 2 Summary
100% (1)
OS Concepts Chapter 2 Summary
2 pages
pdc1: MODULE 1: PARALLELISM FUNDAMENTALS
No ratings yet
pdc1: MODULE 1: PARALLELISM FUNDAMENTALS
42 pages
Parallelism
No ratings yet
Parallelism
22 pages
Performance Comparison of FPGA, GPU and CPU in Image Processing
No ratings yet
Performance Comparison of FPGA, GPU and CPU in Image Processing
7 pages
Report On 64 Bit Processor
No ratings yet
Report On 64 Bit Processor
7 pages
Unit Iii - Aca
No ratings yet
Unit Iii - Aca
13 pages
Investigating Instruction Pipelining
No ratings yet
Investigating Instruction Pipelining
8 pages
6dd31d8c-ae62-48dc-9c39-6e1e5d14d78f
No ratings yet
6dd31d8c-ae62-48dc-9c39-6e1e5d14d78f
13 pages
Slot14 15 CH08 OperatingSystemSupport 43 Slides
No ratings yet
Slot14 15 CH08 OperatingSystemSupport 43 Slides
34 pages
CAO - Question Bank
No ratings yet
CAO - Question Bank
30 pages
Parallel Computing (Unit5)
No ratings yet
Parallel Computing (Unit5)
25 pages
GPGPU Sim Tutorial
No ratings yet
GPGPU Sim Tutorial
28 pages
Parallel Computer Models: CSE7002: Advanced Computer Architecture
No ratings yet
Parallel Computer Models: CSE7002: Advanced Computer Architecture
37 pages
Coa Ct3 Set A Answer Key
No ratings yet
Coa Ct3 Set A Answer Key
5 pages
Te CN Lab Manual
No ratings yet
Te CN Lab Manual
58 pages
Systolic Arrays & Their Applications
No ratings yet
Systolic Arrays & Their Applications
35 pages
Instruction Pipeline
No ratings yet
Instruction Pipeline
10 pages
IO Programming 16 Mark
No ratings yet
IO Programming 16 Mark
9 pages
Ch-7 Document, Hypertext and MHEG
No ratings yet
Ch-7 Document, Hypertext and MHEG
8 pages
Architecture
No ratings yet
Architecture
21 pages
COE 205 Lab Manual Experiment N o 1 1 in
No ratings yet
COE 205 Lab Manual Experiment N o 1 1 in
10 pages
HPC MCQ List: A. Bus
No ratings yet
HPC MCQ List: A. Bus
6 pages
Advanced Unix Programming
From Everand
Advanced Unix Programming
Prof. N. B Venkateswarlu
No ratings yet
1-IAS Architecture-12-12-2022
No ratings yet
1-IAS Architecture-12-12-2022
34 pages
Quiz2 - 2022 Solution
No ratings yet
Quiz2 - 2022 Solution
4 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Characteristics Multi Processors
No ratings yet
Characteristics Multi Processors
7 pages
An Introduction To Digital Design Using A
No ratings yet
An Introduction To Digital Design Using A
30 pages
Exam of Jaypee
No ratings yet
Exam of Jaypee
4 pages
COMP 452 - Computer Architecture - Course Outline - Muhammad Haroon Shakeel
No ratings yet
COMP 452 - Computer Architecture - Course Outline - Muhammad Haroon Shakeel
5 pages
Minimization of DFA
No ratings yet
Minimization of DFA
5 pages
GUIDELINES FOR PREPARATION OF PROJECT REPORT - III and Above
No ratings yet
GUIDELINES FOR PREPARATION OF PROJECT REPORT - III and Above
15 pages
Lecture 3 Multiprocessor Vs Multicomputer Vs DS
No ratings yet
Lecture 3 Multiprocessor Vs Multicomputer Vs DS
55 pages
IT WorkShop Lab Manual
No ratings yet
IT WorkShop Lab Manual
111 pages
Investigating Synchronisation
No ratings yet
Investigating Synchronisation
7 pages
MP QB
No ratings yet
MP QB
19 pages
Distributed Systems
No ratings yet
Distributed Systems
29 pages
PPS - Unit 1
No ratings yet
PPS - Unit 1
69 pages
Bus Interface Unit
No ratings yet
Bus Interface Unit
4 pages
Instruction Pipeline
No ratings yet
Instruction Pipeline
27 pages
Book Review of Computer Fundamentals by Stallings
No ratings yet
Book Review of Computer Fundamentals by Stallings
4 pages
The Evaluation of Operating System
No ratings yet
The Evaluation of Operating System
6 pages
S.No Topics Lec: Advanced Computer Network ETCS-401
No ratings yet
S.No Topics Lec: Advanced Computer Network ETCS-401
4 pages
CS61C Final Exam : University of California, Berkeley - College of Engineering
No ratings yet
CS61C Final Exam : University of California, Berkeley - College of Engineering
10 pages
Chapter 4 - Processor
No ratings yet
Chapter 4 - Processor
52 pages
Application-Specific Integrated Circuit ASIC A Complete Guide
From Everand
Application-Specific Integrated Circuit ASIC A Complete Guide
Gerardus Blokdyk
No ratings yet

Unit 1 Modern Processors

Uploaded by

Unit 1 Modern Processors

Uploaded by

Unit 1

A high-level overview of the architecture of modern

Instructions (produced by a compiler) and data are stored in

CPU (central processing unit)—is the “brain” of a computer.

Arithmetic units for floating-point (FP) and integer (INT)

Subdividing complex operations into simple components that can

Pipelining in microprocessors follows the same principle of

Complex operations such as loading and storing data or performing

Arrays of floating-point values: AB C

Suppose a floating-point multiplication is decomposed into five

An m-stage pipeline has latency (or depth) of m cycles. The

The throughput, average number of operations finished per cycle,

Very complex calculations (like floating-point division or special

Multiple instructions are fetched and decoded concurrently

Superscalarity is a special form of parallel execution, and a variant

The SIMD (single-instruction-multiple-data) concept became widely

Data can be stored in a computer system in many different ways.

Caches are low-capacity, high-speed memories that are commonly

L1 (level 1) data cache

The purpose of cache—reducing the impact of main memory’s

Whenever the CPU issues a read request (“load”) for transferring a

If a data item needs to be loaded into a cache where all cache

The content of a cache is organized as cache lines. (A cache line

If a line of data items from main memory, to be loaded into cache,

Although exploiting spatial locality and cache lines can improve

Prefetching supplies the cache with data ahead of the actual

All modern microprocessors are heavily pipelined. In case there are

The consequence is that a large part of the execution resources is

Hyper threading or simultaneous multithreading (SMT) capabilities

Multiple architectural states of a CPU core

One CPU core “appears to be composed of several cores (also

All threads share the same execution resources, so sometimes it is

Theoretically, the components of a CPU core can operate at some

The performance at which the floating-point units generate

A higher clock frequency will allow a CPU to execute the

The caches on the different levels can be private or shared. Sharing

Intel Nehalem (actually a very old CPU)

In order to use all the resources belonging to the multiple

An very important processor architecture for HPC in the past

Instructions operate on vector registers that can hold a large

The paradigm of SIMD

f o r (s=0; s<N, s++)

A vectorization-capable compiler will automatically translate into

f o r (s=0; s<N, s+=L)

Writing a program so that the compiler can generate effective

First, a vector of boolean values is generated by the logic pipeline.

You might also like