Pipelining, Introduction To Parallel Processing and Operating System
Pipelining, Introduction To Parallel Processing and Operating System
1
PIPELINING
greater performance can be achieved by taking advantage of
improvements in technology, such as faster circuitry.
In addition, organizational enhancements to the processor can
improve performance.
We have already seen some examples of this, such as the use of
multiple registers rather than a single accumulator, and the use
of a cache memory.
Another approach is Instruction Pipelining
2
PIPELINING
new inputs are accepted at one end before previously
accepted inputs appear as outputs at the other end
an instruction has a number of stages.
3
TWO-STAGE PIPELINING
4
PROBLEMS WITH TWO-STAGE PIPELINE
The execution time will generally be longer than the
fetch time. Thus, the fetch stage may have to wait for
some time before it can empty its buffer.
A conditional branch instruction makes the address of
the next instruction to be fetched unknown. Thus, the
fetch stage must wait until it receives the next instruction
address from the execute stage. The execute stage may
then have to wait while the next instruction is fetched.
5
SIX-STAGE PIPELINE
To gain further speedup, the pipeline must have more stages.
Fetch instruction (FI): Read the next expected instruction into a
buffer.
Decode instruction (DI): Determine the opcode and the operand
specifiers.
Calculate operands (CO): Calculate the effective address of each
source operand. This may involve displacement, register indirect,
indirect, or other forms of address calculation.
Fetch operands (FO): Fetch each operand from memory.
Operands in registers need not be fetched.
Execute instruction (EI): Perform the indicated operation and
store the result, if any, in the specified destination operand
location.
Write operand (WO): Store the result in memory. 6
SIX-STAGE PIPELINE
With this decomposition, the various stages will be of
more nearly equal duration.
For the sake of illustration, let us assume equal duration.
7
SIX-STAGE PIPELINE
8
SIX-STAGE PIPELINE
Factors that limit performance
Ifthe six stages are not of equal duration, there will be some
waiting involved at various pipeline stages
Conditional Brach instruction-invalidate several instruction
fetches
Interrupt
9
BRANCH PENALTY
10
THE LOGIC NEEDED FOR
SIX STAGE
INSTRUCTION PIPELINE
11
DEALING WITH BRANCHES
A variety of approaches have been taken for dealing with
conditional branches:
Multiple streams
Prefetch branch target
Loop buffer
Branch prediction
Delayed branch
12
MULTIPLE STREAMS
A simple pipeline suffers a penalty for a branch
instruction because it must choose one of two
instructions to fetch next and may make the wrong
choice. A brute-force approach is to replicate the initial
portions of the pipeline and allow the pipeline to fetch
both instructions, making use of two streams.
There are two problems with this approach:
With multiple pipelines there are contention delays for access
to the registers and to memory.
Additional branch instructions may enter the pipeline (either
stream) before the original branch decision is resolved. Each
such instruction needs an additional stream. 13
PREFETCH BRANCH TARGET
When a conditional branch is recognized, the target of
the branch is prefetched, in addition to the instruction
following the branch.
This target is then saved until the branch instruction is
executed. If the branch is taken, the target has already
been prefetched.
14
LOOP BUFFER
A loop buffer is a small, very-high-speed memory
maintained by the instruction fetch stage of the pipeline
and containing the n most recently fetched instructions, in
sequence.
If a branch is to be taken, the hardware first checks
whether the branch target is within the buffer. If so, the
next instruction is fetched from the buffer
Benefits:
Reduces memory access
useful for the rather common occurrence of IF–THEN and IF–
THEN–ELSE sequences
well suited to dealing with loops, or iterations; hence the name 15
loop buffer
BRANCH PREDICTION (1)
Predict never taken
Assume that jump will not happen
Always fetch next instruction
68020 & VAX 11/780
VAX will not prefetch after branch if a page fault would
result (O/S v CPU design)
Predict always taken
Assume that jump will happen
Always fetch target instruction
16
BRANCH PREDICTION (2)
Predict by Opcode
Some instructions are more likely to result in a jump than
others
Can get up to 75% success
17
BRANCH PREDICTION (3)
Delayed Branch
Do not take jump until you have to
Rearrange instructions
18
BRANCH PREDICTION FLOWCHART
19
PARALLEL PROCESSING
A traditional way to increase system performance is
to use multiple processors that can execute in parallel to
support a given workload.
The two most common multiple-processor
organizations are symmetric multiprocessors (SMPs) and
clusters.
More recently, nonuniform memory access (NUMA)
systems have been introduced commercially.
20
Symmetric multiprocessors
An SMP consists of multiple similar processors within
the same computer, interconnected by a bus or some sort
of switching arrangement.
chip multiprocessing :
When more than one processor are implemented on a
single chip, the configuration is referred to as chip
multiprocessing.
22
Multithreaded processor:
A related design scheme is to replicate some of the
components of a single processor so that the processor can
execute multiple threads concurrently; this is known as a
multithreaded processor.
Cluster:
A cluster is a group of interconnected, whole computers
working together as a unified computing resource that can
create the illusion of being one machine.
The term whole computer means a system that can run on
its own, apart from the cluster. 23
Non-uniform memory access
A NUMA system is a shared-memory multiprocessor in
which the access time from a given processor to a word
in memory varies with the location of the memory word.
24
MULTIPLE PROCESSOR ORGANIZATION
25
SINGLE INSTRUCTION, SINGLE DATA
STREAM - SISD
Single processor
Single instruction stream
Data stored in single memory
Uni-processor
26
SINGLE INSTRUCTION, MULTIPLE DATA
STREAM - SIMD
Single machine instruction Controls simultaneous
execution of a Number of processing elements on
Lockstep basis
Each processing element has associated data memory
27
MULTIPLE INSTRUCTION, SINGLE
DATA STREAM - MISD
Sequence of data Transmitted to set of processors
Each processor executes different instruction sequence
28
MULTIPLE INSTRUCTION, MULTIPLE DATA STREAM- MIMD
Set of processors Simultaneously execute different
instruction sequences on Different sets of data
SMPs, clusters and NUMA systems
29
TAXONOMY OF PARALLEL PROCESSOR
ARCHITECTURES
30
OPERATING SYSTEM
Operating Systems Definition: The set of software
products that jointly controls the system resources and
processes using these resources on a computer.
31
An operating system (OS) is a software program that
manages the hardware and software resources of a
computer.
32
Layers and Views of a Computer System:
33
Operating System Services:
Program creation
Program execution
Access to I/O devices
Controlled access to files
System access
Error detection and response
Accounting
34
Kernel:
An operating system generally consists of two parts:
• Kernel space (kernel mode) and
• User space (user mode).
Kernel is the smallest and central component of an
operating system.
Its services include managing memory and devices and
also to provide an interface for software applications to
use the resources.
Additional services such as managing protection of
35
programs and multitasking may be included depending
on architecture of operating system.
There are three broad categories of kernel models
available, namely:
I. Monolithic kernel
II. Microkernel
III. Exokernel
36
Why Operating Systems?
Operating systems provide a layer of abstraction
between the programmer and the machine
It screens the complexities of the computer from the
programmer
It allows the computer to be treated as a virtual machine
37
Operating system types:
Real-time:
A real-time operating system is a multitasking operating
system that aims at executing real-time applications.
40
Distributed:
A distributed operating system manages a group of
independent computers and makes them appear to be a
single computer.
43
RELATIVE PROCESSING POWER
Determined:
number of cores,
clock speed (in GHz), size of cache,
Intel technologies like Turbo Boost and Hyper-Threading
44
NUMBER OF CORES
The more cores there are, the more tasks (known as
threads) can be served at the same time.
The lowest number of cores can be found in Core i3
CPUs, i.e., which have only two cores. Currently, all
Core i3s are dual-core processors
45
INTEL TURBO BOOST
The Intel Turbo Boost Technology allows a processor to
dynamically increase its clock speed whenever the need
arises.
The maximum amount that Turbo Boost can raise clock
speed at any given time is dependent on the number of
active cores, the estimated current consumption, the
estimated power consumption, and the processor
temperature.
Because none of the Core i3 CPUs have Turbo Boost, the
i5-4570T can outrun them whenever it needs to.
46
CACHE SIZE
Whenever the CPU finds that it keeps on using the same
data over and over, it stores that data in its cache
with a larger cache, more data can be accessed quickly.
49
The End!
Thank You So Much!
50