0% found this document useful (0 votes)
34 views48 pages

4 - Performance Issues

Uploaded by

hakumbilad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views48 pages

4 - Performance Issues

Uploaded by

hakumbilad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

CSC 2111

Computer Organisation and


Architecture
2 Unit 4: Intended Learning
Outcomes
 By the end of this unit, you should be able to:

 Understand the key performance issues that


relate to computer design.
 Distinguish among multicore, MIC, and GPGPU
organisations.
 Perform basic Measures of Computer
Performance
Designing for
Performance
Unit Introduction
 Chipmakers have been busy learning how to fabricate chips of
greater and greater density.
 But the raw speed of the microprocessor will not achieve its
potential unless it is fed a constant stream of work to perform
in the form of computer instructions.
 As such, processor designers must come up with ever more
elaborate techniques for feeding the processor with
instructions.
 Among the techniques built into contemporary processors are
the following:
 Pipelining
 Branch prediction
 Superscalar execution
 Data flow analysis
 Speculative execution
Pipelining
 Recap
 The execution of an instruction involves multiple stages of
operation: fetching the instruction, decoding the opcode,
fetching operands, performing a calculation, and so on.
 Pipelining
 A processor works simultaneously on multiple instructions
by performing a different phase for each of the multiple
instructions at the same time.
 It overlaps operations by moving data or instructions into
a conceptual pipe with all stages of the pipe processing
simultaneously.
 For example, while one instruction is being executed, the
computer is decoding the next instruction.
 This is the same principle as seen in an assembly line.
Assembly Line Example
 Consider a water bottle packaging plant.
 Let there be 3 stages that a bottle should pass through,
Inserting the bottle(I), Filling water in the bottle(F), and Sealing
the bottle(S).
 Let us consider these stages as stage 1, stage 2 and stage 3
respectively.
 Let each stage take 1 minute to complete its operation.
 Without (left) Vs. With (right) Pipelining
Execution in a Pipelined
Processor
 Consider a processor having 4 stages to each an
instruction and let there be 2 instructions to be executed.
Execution in a Pipelined
Processor
 Consider a processor having 4 stages to each an
instruction and let there be 2 instructions to be
executed.
Execution in a Pipelined
Processor
 Consider a processor having 4 stages to each an
instruction and let there be 2 instructions to be
executed.
Hyper Threading Vs. Pipelining
 In computer science a thread of execution is the smallest
sequence of programmed instructions that can be managed
independently by a scheduler, which is typically a part of
the operating system.
 The implementation of threads and processes differs between
operating systems, but in most cases a thread is a component of
a process.
 The multiple threads of a given process may be
executed concurrently (via multithreading capabilities), sharing
resources such as memory, while different processes do not
share these resources.
 Pipelining works on a single thread, hyperthreading works on
multiple threads.
Hyper Threading Vs.
Multithreading
Branch Prediction
A branch is an instruction in a computer program that can
cause a computer to begin executing a
different instruction sequence and thus deviate from its default
behavior of executing instructions in order.

A branch predictor is a digital circuit that tries to guess


which way a branch will go before this is known definitively. The
purpose of the branch predictor is to improve the flow in
the instruction pipeline.
Branch Prediction
Without branch prediction, the processor would have to
wait until the conditional jump instruction has passed the
execute stage before the next instruction can enter the
fetch stage in the pipeline.
The branch predictor attempts to avoid this waste of time
by trying to guess whether the conditional jump is most
likely to be taken or not taken.
The branch that is guessed to be the most likely is then
fetched and speculatively executed.
If it is later detected that the guess was wrong, then the
speculatively executed or partially executed instructions
are discarded and the pipeline starts over with the correct
branch, incurring a delay.
Superscalar Execution
 This is the ability to issue more than one instruction in
every processor clock cycle.
 In effect, multiple parallel pipelines are used.
 A superscalar processor is a CPU that implements a form
of parallelism called instruction-level parallelism within a
single processor.
 In contrast to a scalar processor, which can execute at most
one single instruction per clock cycle, a superscalar
processor can execute more than one instruction during a
clock cycle by simultaneously dispatching multiple
instructions to different execution units on the processor.
 Each execution unit is not a separate processor (or a core if
the processor is a multi-core processor), but an execution
resource within a single CPU such as an arithmetic logic unit.
Superscalar Execution Vs.
Pipelining
 While a superscalar CPU is typically also pipelined,
superscalar and pipelining execution are
considered different performance enhancement
techniques.
 Superscalar executes multiple instructions in
parallel by using multiple execution units
 Pipelining executes multiple instructions in the
same execution unit in parallel by dividing the
execution unit into different phases.
Superscalar Execution
Pipeline Execution
Data flow analysis
 The processor analyzes which
instructions are dependent on each
other’s results, or data, to create an
optimized schedule of instructions.
 In fact, instructions are scheduled to be
executed when ready, independent of
the original program order.
 This prevents unnecessary delay.
Speculative execution
 Using branch prediction and data flow analysis,
some processors speculatively execute
instructions ahead of their actual appearance
in the program execution, holding the results in
temporary locations.
 This enables the processor to keep its
execution engines as busy as possible by
executing instructions that are likely to be
needed.
Performance Balance
Performance Balance
 Need for performance balance?
 Processor power has raced ahead at
breakneck speed, while other critical
components of the computer have not kept
up.
 Resulting in adjusting of the organisation and
architecture to compensate for the mismatch
among the capabilities of the various
components: especially at the interface
between the processor and main
memory or I/O devices.
Processor – Memory Interface
 Increase the number of bits that are retrieved at one
time from DRAMs “wider” rather than “deeper”
 Using wide bus data paths
 Reduce the frequency of memory access by
incorporating increasingly complex and efficient cache
structures between the processor and main memory.
 Increase the interconnect bandwidth between
processors and memory by using higher-speed buses
and a hierarchy of buses to buffer and structure data
flow.
Processor – I/O Interface
 As computers become faster and more capable, more
sophisticated applications are developed that support
the use of peripherals with intensive I/O demands
Processor – I/O Interface
 Strategies of getting I/O data moved between
processor and peripherals.
 Caching and buffering schemes
 Use of higher-speed interconnection buses and
interconnection structures.
 Use of multiple-processor configurations to
satisfy I/O demands.
 Designers constantly strive to balance the
throughput and processing demands of the
processor components, main memory, I/O devices,
and the interconnection structures.
Multicore, MICS, and GPGPUs
 New approaches to improving performance:
 Multicore
 An approach to improving performance by placing multiple
processors on the same chip, with a large shared cache.
 Many Integrated Core (MIC).
 The number of cores per chip are more than 50 cores per chip.
 GPGPUs
 A chip with multiple general-purpose processors plus graphics
processing units (GPUs) and specialised cores for video
processing and other tasks.
 When a broad range of applications are supported by such a
processor, the term general-purpose computing on GPUs
(GPGPU) is used.
Basic Measures of
Computer Performance
Amdahl’s Law
Amdahl’s Law
 Computer system designers look for ways to
improve system performance by advances in
technology or change in design.
 However, a speedup in one aspect of the
technology or design does not result in a
corresponding improvement in performance.
 Amdahl’s law was first proposed by Gene
Amdahl in 1967 and deals with the potential
speedup of a program using multiple
processors compared to a single processor.
Amdahl’s Law
 Consider a program running on a single
processor such that:
 a fraction (1 - f) of the execution time involves
code that is inherently sequential, and
 a fraction f that involves code that is infinitely
parallelizable with no scheduling overhead.
 Let T be the total execution time of the program
using a single processor.
Amdahl’s Law
 Consider a program running on a single
processor such that:
 Then the speedup using a parallel processor with
N processors that fully exploits the parallel
portion of the program is as follows:
Amdahl’s Law
 Illustration of Amdahl’s Law
Amdahl’s Law
 Amdahl’s Law for Multiprocessors
Clock Speed
Clock Speed
 Operations performed by a processor, such as
fetching an instruction, decoding the instruction,
performing an arithmetic operation, and so on,
are governed by a system clock.
 Typically, all operations begin with the pulse of
the clock.
 Thus, at the most fundamental level, the speed
of a processor is dictated by the pulse frequency
produced by the clock, measured in cycles per
second, or Hertz (Hz).
 MHz (megahertz, or millions of pulses per second)
 GHz (gigahertz, or billions of pulses per second)
System Clock
 Typically, clock signals
are generated by a
quartz crystal, which
generates a constant
signal wave while power
is applied.
 This wave is converted
into a digital voltage
pulse stream that is
provided in a constant
flow to the processor
circuitry.
System Clock
System Clock
System Clock
 The rate of pulses is known as the clock
rate, or clock speed.
 One increment, or pulse, of the clock is
referred to as a clock cycle, or a clock tick.
 The time between pulses is the cycle time.
Instruction Cycle, Machine Cycle and T-
State
 The execution of an instruction involves a number
of discrete steps:
 fetching the instruction from memory,
 decoding the various portions of the instruction,
 loading and storing data, and
 performing arithmetic and logical operations.
 Most instructions on most processors require
multiple clock cycles to complete.
Instruction Cycle, Machine Cycle and T-
State
 Instruction Cycle
 Is the fetching, decoding and execution of a single instruction.
 Typically consists of one to five read or write operations
between processor and memory or input/output devices.
 Machine Cycle
 Is a particular time period required by each memory or I/O
operation
 In other words, to move a byte of data in or out of
the microprocessor, a machine cycle is required.
 T-State
 Each machine cycle consists of 3 to 6 clock periods/cycles,
referred to as T-states.
 Typically, one instruction cycle consists of one to five machine
cycles and one machine cycle consists of three to six T-states
i.e. three to six clock periods.
Instruction Cycle, Machine Cycle and T-
State
INSTRUCTION EXECUTION RATE
 Parameters
 Instruction Count (Ic), for a program is the
number of machine instructions executed for that
program until it runs to completion or for some
defined time interval.
 Average Cycles Per Instruction (CPI) for a program
is the number of clock cycles required for a machine
instruction.
 On any given processor, the number of clock cycles
required varies for different types of instructions, such
as load, store, branch, and so on.
 Hence, the average cycles per instruction (CPI) for
a program is an important parameter.
Class Exercise
 Consider the execution of a program that results in the
execution of 2 million instructions
 The program consists of 4 major types of instructions. The
instruction mix and the CPI for each instruction type are
given below
 What is the average CPI?
CPI Formula

 Let CPIi be the number of cycles required for


instruction type i and Ii be the number of
executed instructions of type Ii for a given
program.
 Then we can calculate an average CPI as
follows:
Processor Execution Time Formula

 A processor is driven by a clock with a


constant frequency f or,
 Equivalently, a constant cycle time T, where T
= 1/ f.
 The processor time T needed to execute a
given program can be expressed as:
Class Exercise
 The processor runs at a clock rate of 400
MHz
 What is the processor execution time?
Instruction Execution Rate
 A common measure of performance for a processor is
the rate at which instructions are executed
 Expressed as millions of instructions per second
(MIPS)- the MIPS rate.
 Expressed in terms of the clock rate and CPI as
follows:
Instruction Execution Rate

You might also like