Computer Architecture and Organization Notes
Computer Architecture and Organization Notes
1. Introduction
In electronic engineering, computer architecture is a set of rules and methods
that describe the functionality, organization and implementation of computer
systems. Some definitions of architecture define it as describing the
capabilities and programming model of a computer but not a particular
implementation. In other descriptions computer architecture involves
instruction set architecture design, micro architecture design, logic design,
and implementation
The word architecture was not invented by computers but it was taken over
from other disciplines. The term does not provide exact definitions,
specifications or principles. From a modern, construction point of view, it can
be divided into four categories:
Structure, layout: parts description and their interconnection,
Interaction, cooperation: describe the dynamic communication of all
working parts of a computer
Realization, implementation: describe internal structure of all
working parts,
1
Functionality, activity: final behavior of the whole computer.
2
To understand, how computer works, we have to deal with all of these
aspects.
In this course the aspects of a computer system are viewed from the points of
view of:
Architecture i.e. those attributes of a system visible to a machine
language programmer
Organization i.e. the operational units and their interconnections that
realize the architecture.
1.1. Components of a Computer System
A computer system, like any system, consists of an interrelated set of
components. The system is best characterized in terms of structure-the way in
which components are interconnected, and function-the operation of the
individual components. Furthermore, a computer's organization is
hierarchical. Each major component can be further described by decomposing
it into its major subcomponents and describing their structure and function.
For clarity and ease of understanding, this hierarchical organization is
described in this course from the top down:
Computer system: Major components are processor, memory, I/O.
Processor: Major components are control unit, registers, ALU, and
instruction execution unit.
Control unit: Major components are control memory, microinstruction
sequencing logic, and registers.
1.1.1. Structural view of a Computer
The structural view of a computer concerns the way in which components are
interconnected. There are two common architectural styles with differing
structural view namely: The Von Neumann architecture and the Harvard
architecture
a) Von Neumann architecture
3
The Von Neumann architecture is representative of most general purpose
computer architectures. Uses stored-program concept and was first
implemented for EDVAC computer in 1945. Key concepts include
Data and instructions are stored in a single read-write memory
Instructions and Data share one memory system. (The Length of
bit, the same form of address)
The contents of the memory are addressable by location, without
regard to the type of data contained there
Execution occurs in a sequential fashion (unless explicitly
modified) from one instruction to the next
Von Neumann architecture (memory stores instruction and data, data bus
carries both instruction and data)
Address bus
Memory
(Stores both data
CPU
and instructions)
Data bus (carries
both data and
instructions)
b) Harvard architecture
The Harvard architecture has the following factors
1) Instructions and data are stored in separate memory.
2) Instructions and data signals are carried using different pathways
3) Generally, the bit of Instructions is wider than Data.
4) For some computers, the Instruction memory is read-only.
Instruction bus
Instruction
Memory Address bus
Address bus
4
Comparison between Harvard and von Neumann’s architectures in
absence of cache memory
i) In cases without caches, the Harvard Architecture is more efficient than
von-Neumann.
ii) Harvard architecture has separate data and instruction busses, allowing
transfers simultaneously on both busses. Von Neumann architecture
has only one bus which is used for both data transfers and instruction
fetches, and therefore data transfers and instruction fetches must be
scheduled – they cannot be performed at the same time.
iii) By reason of the wider bit of instructions, Harvard Architecture
supports more instructions with less hardware requirement. For
example, the ARM9 processor has 24-bit instruction length, so that it
could have 224 =16777216 instructions, which are much more than 16-
bit processors have (65536). So, with uniform bus width which von-
Neumann architecture has, the processor has to take more requirement
of hardware in data length, if it wants to have 24-bit instruction width.
iv) Two buses accessing memory synchronously provides more CPU time.
The von Neumann processor has to perform a command in 2 steps
(first read an instruction, and then read the data that the instruction
requires. But the Harvard architecture can read both instruction and its
data at the same time. Evidently, the parallel method is faster and more
efficiently, because it only takes one step for each command.
CHECK POINT
The principal historical advantage of the Harvard architecture (simultaneous access to more
than one memory system) has been nullified by modern cache systems, allowing the more
flexible Von Neumann machine equal performance in most cases. The Modified Harvard
architecture has thus been relegated to niche applications where the ease-of-programming /
complexity / performance trade-off favors it.
With the help of caches, both architectures gain efficiencies both architectures have
advantages and disadvantages. It is impossible to decide, which is better. Both are still used
in modern computers. 6
Using caches in both architectures
Speed of CPU is faster than the speed of accessing main memory. So Harvard
architecture attempted to improve performance by having two separate buses
one to carry data signal and another to carry instruction signal. However
introduction of cache memory reduced the need to access main memory
frequently during instruction execution. So, modern high performance
computers with caches have the ―incorporate aspect‖ of both von-Neumann
and Harvard architectures. On von-Neumann architecture, cache on CPU is
divided into instruction cache and data cache, and the main memory needn’t
to be separated into 2 sections. So that, the von-Neumann programmers can
work on Harvard Architectures without knowing the hardware. As a result,
with the help of caches, both architectures gain efficiencies. Therefore, both
architectures have advantages and disadvantages. It is impossible to decide,
which is better. Both are still used in modern computers. We can now
compare both in more detail:
7
Von Neumann
pros Cons
Programmers organize the content of the One Bus (for data, instructions
memory and they can use the whole capacity of and devices) is a bottleneck.
the installed memory.
Error in a program can rewrite
One bus is simpler for the Control Unit design. instructions and crash program
Development of the Control Unit is cheaper and execution.
faster.
8
Until today both architectures are used in modern computers. Over the years
von Neumann and Harvard architecture are used massively in mainstream
production. Harvard architecture is used primary for small embedded
computers and signal processing (DSP). Many microcontroller architectures
(like ones you would find in a toaster) are Harvard architectures. Von
Neumann is better for desktop computers, laptops, workstations and high
performance computers. Some computers may use advantages from both
architectures. Typically they use two separated memories. The first one is
used for programs and the second one to store dynamic data. A good example
can be handheld devices – PDA and mobile phones.
1.1.2. Internal Structure of the Computer Itself:
Central Processing Unit (CPU): Controls the operation of the computer
and performs its data processing functions. Often simply referred to as
processor
Main Memory: Stores data and instructions
I/O: Moves data between the computer and its external environment.
System Interconnection: Some mechanism that provides for
communication among CPU, main memory, and I/O.
The functional units of a computer system are summarized below
9
1.2. The Central Processing Unit (CPU) or Processor
A processor (CPU) is the core component in a computer system. It executes
instructions and manipulates data. A processor has several core components
that work together to perform calculations. There are many factors that
influence the performance of a processor.
Data bus width
Processor speed/clock rate
Internal CPU architecture
I/O bus speed
Cache memory, level 1 and level 2
The main structural components of the CPU are discussed next:
1.2.1. Control Unit: Controls the operation of the CPU and hence the
computer.
The control unit sits inside the CPU and coordinates the input and output
devices of a computer system. It coordinates the fetching of program code
from main memory to the CPU and directs the operation of the other
processor components by providing timing and control signals.
1.2.2. Arithmetic and Logic Unit (ALU): performs the data processing
functions of a computer. The Arithmetic Logic Unit or the ALU is a digital
circuit that performs arithmetic and logical operations. Where arithmetic
operations include things such as ADD and SUBTRACT and the logical
operations include things such as AND, OR, NOT. The ALU is a
fundamental building block in the central processing unit (CPU) of a
computer and without it the computer wouldn't be able to calculate anything.
Some examples of assembly code instructions that would use the ALU are as
follows (not all processors will have all these instructions):
ADD ; add one number to another number
SUB ; subtract one number from another number
10
INC ; increment a number by 1
DEC ; decrements a number by 1
MUL ; multiply numbers together
OR ; boolean algebra function
AND ; boolean algebra function
NOT ; boolean algebra function
XOR ; boolean algebra function
JNZ ; jump to another section of code if a number is not
zero (used for loops and ifs)
JZ ; jump to another section of code if a number is zero
(used for loops and ifs)
1.2.3. Registers: Registers are small amount of fast storage which is part of
the processor and provides storage internal to the CPU. For immediate
calculations, using main memory is too slow. Imagine having to send a signal
along the address bus and some data along the data bus when all you want to
do is store the result of adding two numbers together. The distance between
the processor and main memory, even though it might be a few centimetres,
is far enough for the signal to take a significant time to get there. To get past
this issue there are small amounts of memory stored inside the processor
itself, these are called registers. Registers are incredibly fast pieces of
memory that are used to store the results of arithmetic and logic calculations.
Different processors will have different sets of registers. A common register
is the Accumulator (acc) which is a data register, where the user is able to
directly address (talk to) it and use it to store any results they wish.
Processors may also have other registers with particular purposes. Some
registers include:
Program Counter (PC) - an incrementing counter that keeps track of
the memory address of which instruction is to be executed next.
Memory Address Register (MAR) - holds the address in memory of the
next instruction to be executed
11
Memory Buffer Register (MBR) - a two-way register that holds data
fetched from memory (and ready for the CPU to process) or data
waiting to be stored in memory
Current Instruction register (CIR) - a temporary holding ground for
the instruction that has just been fetched from memory
Accumulator - Used to store results of calculations
General purpose register - allow users to use them as they wish
Address registers - used for storing addresses
Conditional registers - hold truth values for loop and selection
The registers are used as temporary holding areas during an instruction
execution cycle. An instruction cycle (sometimes called fetch-decode-
execute cycle) is the basic operation cycle of a computer. It is the process by
which a computer retrieves a program instruction from its memory,
determines what actions the instruction requires, and carries out those actions.
This cycle is repeated continuously by the central processing unit (CPU),
from bootup to when the computer is shut down.
1. Instruction Fetch (IF): The fetch cycle begins with retrieving the address
stored in the program counter. The address stored is some valid address in
the memory holding the instruction to be executed. The Central Processing
Unit completes this step by fetching the instruction stored at the address
from the memory and transferring this instruction to a special register
(MAR) to hold the instruction to be executed. The program counter is
incremented to point to the next address from which the new instruction is
to be fetched.
2. Instruction Decode (ID): The decode cycle is used for interpreting the
instruction that was fetched in the Fetch Cycle. The operands are retrieved
from the addresses if the need be.
12
3. Data Fetch (DF): To load an instruction or piece of data from memory into
a CPU’s resister. Instruction
4. Execute (EX): From the instruction register, the data forming the
instruction is decoded by the control unit. It then passes the decoded
information as a sequence of control signals to the relevant function units
of the CPU to perform the actions required by the instruction such as
reading values from registers, passing them to the Arithmetic logic unit
(ALU) to add them together and writing the result back to a register. A
condition signal is sent back to the control unit by the ALU if it is
involved.
5. Result Return (RR) : The result generated by the operation is stored in the
main memory, or sent to an output device. Based on the condition
feedback from the ALU, the PC is either incremented to address the next
instruction or updated to a different address where the next instruction will
be fetched. The cycle is then repeated. The steps are summarized
diagrammatically as shown below.
13
1.2.4. CPU Interconnection: The Interconnection Structures provides for
communication among the control unit, ALU, and registers. It constitutes of
the collection of paths connecting the various modules of a computer namely:
the CPU module, memory module and, I/O module. It must support the
following types of transfers:
Memory to CPU
CPU to Memory
14
I/O to CPU
CPU to I/O
I/O to or from Memory - using Direct Memory Access (DMA)
The common interconnection structure is the Bus Interconnection. A bus is
a shared transmission medium that must only be used by one device at a time.
When used to connect major computer components (CPU, memory, I/O) is
called a system bus. There are three types of buses that constitute three
functional groups of communication lines as depicted in figure 2. They
include: data bus, address bus and control bus
Data bus
Address bus
Control bus
a) Data lines (data bus) - move data between system modules. The
width of a data bus is a key factor in determining overall system
performance. Usually the width of a data bus is equal to the word
size of a computer or ½ that size.
15
IMPORTANT NOTE
A Word refers to a group of bits that a CPU can process at one time. In computing, word is
a term for the natural unit of data used by a particular processor design. A word is a fixed-
sized piece of data handled as a unit by the instruction set or the hardware of the processor.
The number of bits in a word is called a word size/ word width or word length and it is an
important characteristic of any specific processor design or computer architecture.
Processors with many different word sizes have existed though powers of two (8, 16, 32,
64) have predominated for many years. A processor's word size is often equal to the width
of its external data bus though sometimes the bus is made narrower than the CPU (often
half as many bits) to economize on packaging and circuit board costs.
Size of data bus = CPU word size or ½ of CPU word size
Word size = power of 2
16
"char" size that is a fraction of it. This is a natural choice since
instructions and data usually share the same memory subsystem. In
Harvard architectures the word sizes of instructions and data need not
be related, as instructions and data are stored in different memories; for
example, the processor in the 1ESS electronic telephone switch had 37-
bit instructions and 23-bit data words.
b) Address bus - designate source or destination of data on the data bus. Its
width determines the maximum possible memory capacity of the system
(May be a multiple of width).
Maximum possible memory capacity = 2address bus width
Example:
The size of an address bus 32 bits compute the maximum size of memory that
this bus can reference.
Solution: The bus can hold up to 232 numbers and it hence can refer up to 2 32
bytes of memory = 4GB of memory. This means any memory greater than that
is useless.
Also used to address I/O ports. Typically: high-order bits select a
particular module while lower-order bits select a memory location or
I/O port within the module
c) Control Bus - control access to and use of the data and address lines.
Typical control lines include:
Memory Read and Memory Write
I/O Read and I/O Write
Transfer ACK
Bus Request and Bus Grant
Interrupt Request and Interrupt ACK
Clock
Reset
If one module wishes to send data to another, it must:
Obtain use of the bus
17
Transfer data via the bus
If one module wishes to request data from another, it must:
Obtain use of the bus
Transfer a request to the other module over control and address lines
Wait for second module to send data
Typical physical arrangement of a system bus
A number of parallel electrical conductors
Each system component (usually on one or more boards) taps into
some or all of the Bus lines (usually with a slotted connector)
System can be expanded by adding more boards
A bad component can be replaced by replacing the board where it
resides
18
Can function as a mezzanine or peripheral bus
Current Standard for PCI
up to 64 data lines at 33Mhz
requires few chips to implement
supports other buses attached to PCI bus
public domain, initially developed by Intel to support Pentium-based
systems
supports a variety of microprocessor-based configurations, including
multipleprocessors
uses synchronous timing and centralized arbitration
19
cycles per second and potentially a million calculations. A computer of speed
3.4 GHz means it might be capable of processing 3,400,000,000 instructions
per second. However it isn't as simple as that, as some processors can perform
more than one calculation on each clock cycle.
21
instruction set architectures but different organizations are the AMD Opteron
and the Intel Core i7. Both processors implement the x86 instruction set, but
they have very different pipeline and cache organizations.
Hardware refers to the specifics of a computer, including the detailed logic
design and the packaging technology of the computer. Often a line of
computers contains computers with identical instruction set architectures and
nearly identical organizations, but they differ in the detailed hardware
implementation. For example, the Intel Core i7 and the Intel Xeon 7560 are
nearly identical but offer different clock rates and different memory systems,
making the Xeon 7560 more effective for server computers.
Computer architecture
Computer architecture covers the three aspects of computer design including:
instruction set architecture, organization or microarchitecture, and hardware.
The overall view of computer design showing how ISA interfaces the -+
Applications
Compiler Firmware
CPU I/O
Memory
Hardware part
Digital Circuits
22
As can be seen from this the instruction set architecture (ISA) provides the
interface between the software and the hardware. It is the medium of
communication between the hardware and the software. A software
instruction is translated by the compiler into machine code which is then
interpreted by the ISA using the available set of instructions.
Program code
Compiler
Machine code
ISA
Hardware instructions
Categories of instructions
1. Data movement/transfer instructions (Data handling and memory
operations)
Set a register to a fixed constant value.
Move data from a memory location or register to another memory
location or register without changing its form.
Copy data from a memory location to a register, or vice-versa
Used to store the contents of a register, result of a computation, or to
retrieve stored data to perform a computation on it later.
Read and write data from hardware devices.
examples
23
Examples: STORE, LOAD, EXCHANGE, MOVE, CLEAR, SET, PUSH,
POP.
24
Branch to another location in the program and execute instructions
there.
Conditionally branch to another location if a certain condition holds.
Indirectly branch to another location.
Call another block of code, while saving the location of the next
instruction as a point to return to.
The entire group of these instructions is called instruction set. Instruction set
therefore refers to the range of instructions that a CPU can execute or the
basic set of commands, or instructions, that a microprocessor understands.
The instruction set determines what functions the microprocessor can
perform. One of the principal characteristics that separate RISC from CISC
microprocessors is the size of the instruction set -- RISC microprocessors
have relatively small instruction sets whereas CISC processors have relatively
large instruction sets.
Each instruction has two parts: one is the task to be performed called the
operation code (opcode) and the other is the data to be operated on called the
operand (data).
25
Operands: there can be zero or more operand specifiers, which may
specify registers, memory locations, or literal data
Types of Operand:
Depending on the word size, there will be different numbers of bits available
for the opcode and for the operand. There are two different philosophies at
play:
26
2. Less instructions and more space for the operand (ARM).
Examples:
1. For a word with 4 bits for an opcode and 6 bits for an operand
How many different instructions could I fit into the instruction set?
What is the largest number that I could use as data?
Answer :
Number of instructions:
largest operand:
How many different instructions could I fit into the instruction set?
What is the largest number that I could use as data?
27
Answer :
Number of instructions:
largest operand:
• Access memory
• Instruction complexity.
• Big endian:
– Is more natural.
– The sign of the number can be determined by looking at the byte
at address offset 0.
– Strings and integers are stored in the same order.
• Little endian:
– Makes it easier to place values on non-word boundaries.
– Conversion from a 16-bit integer address to a 32-bit integer
address does not require any arithmetic
• The next consideration for architecture design concerns how the CPU
will store data.
• We have three choices:
1. A stack architecture
29
2. An accumulator architecture
3. A general purpose register architecture.
• In choosing one over the other, the tradeoffs are simplicity (and cost)
of hardware design with execution speed and ease of use.
Stack architecture (0-operand architecture)
• In stack architecture, instructions and operands are implicitly taken
from the stack. A stack cannot be accessed randomly.
• All arithmetic operations take place using the top one or two positions
on the stack
• Stack machines use one - and zero-operand instructions.
• PUSH and POP operations involve only the stack’s top element.
Binary instructions (e.g., ADD, MULT) use the top two items on the
stack.
Z=XYWU+
30
PUSH X
PUSH Y
MULT
PUSH W
PUSH U
MULT
ADD
PUSH Z
PUSH A,
PUSH B,
ADD,
POP C.
Z=XY+WU
looks like this:
LOAD X
MULT Y
STORE TEMP
LOAD W
MULT U
ADD TEMP
STORE Z
31
• In general purpose register (GPR) architecture, registers can be used
instead of memory.
– Faster than accumulator architecture.
– Efficient implementation for compilers.
– Results in longer instructions.
• Most systems today are GPR systems.
• There are three types:
– Memory-memory where two or three operands may be in
memory.
– Register-memory where at least one operand must be in a
register.
– Load-store where no operands may be in memory.
• The number of operands and the number of available registers has a
direct affect on instruction length.
• In a two-address ISA, (e.g.,Intel, Motorola), the infix expression,
Z=XY+WU
might look like this:
LOAD R1,X
MULT R1,Y
LOAD R2,W
MULT R2,U
ADD R1,R2
STORE Z,R1
many CISC and RISC machines fall under this category:
C = A+B needs two instructions. This effectively 'stores' the result
without an explicit store instruction.
CISC — move A to C; then add B to C.
CISC — Often machines are limited to one memory operand per
instruction so C = A+B is coded as:
load a,reg1;
add b,reg1;
32
store reg1,c;
This requires a load/store pair for any memory movement regardless of
whether the add result is an augmentation stored to a different place, as in C =
A+B, or the same memory location: A = A+B.
C = A+B needs three instructions.
RISC — Requiring explicit memory loads, the instructions would be:
load a,reg1;
load b,reg2;
add reg1,reg2;
store reg2,c.
o C = A+B needs four instructions.
33
Unlike 2-operand or 1-operand, this leaves all three values a, b, and c
in registers available for further reuse
EXAMPLE
The task the computer designer faces is a complex one: Determine what
attributes are important for a new computer, then design a computer to
maximize performance and energy efficiency while staying within cost,
power, and availability constraints. This task has many aspects, including
instruction set design, functional organization, logic design, and
implementation. The implementation may encompass integrated circuit
design, packaging, power, and cooling. Optimizing the design requires
familiarity with a very wide range of technologies, from compilers and
operating systems to logic design and packaging.
34
Over time computer architects have strived to cope with the user needs by
designing systems that meet those needs as effectively as possible within
economic and technological constraints. Some key market demands that drive
computer architecture include:
Performance: processing needs users keep increasing and the demand
is to have systems that can process huge work load faster
Storage: storage needs of users keep increasing and the demand is to
have systems that can store more and more information
Portability: users have become more mobile and the demand is to
have smaller systems that can be carried along.
Affordability: the demand is to have computers systems that cost less
to produce and are price friendly to a wider population. the use of
technology improvements to lower cost, as well as increase
performance, has been a major theme in the computer industry.
In response these market demands, computer systems have evolved overtime
to witness:
performance increases almost yearly
memory size goes up a factor of 4 every 3 years or so
price drops every year
decreasing size
35
not only the future cost but also the longevity of architecture. Key trends
include:
Trends in Technology:
Computer technology changes rapidly and if an instruction set architecture is
to be successful; it must be designed to survive the rapid change in computer
technology. A successful new instruction set architecture may last for decades
and therefore an architect must plan for technology changes that can increase
the lifetime of a successful computer. To plan for the evolution of a
computer, the designer must be aware of rapid changes in implementation
technology. Five implementation technologies, which change at a dramatic
pace, are critical to modern implementations:
36
Capacity per Flash chip has increased by about 50% to 60% per year recently,
doubling roughly every two years.
Magnetic disk technology: Prior to 1990, density increased by about 30% per
year, doubling in three years. It rose to 60% per year thereafter, and increased
to 100% per year in 1996. Since 2004, it has dropped back to about 40% per
year, or doubled every three years. Disks are 15 to 25 times cheaper per bit
than Flash.
Designers often design for the next technology, knowing that when a product
begins shipping in volume that the next technology may be the most cost-
effective or may have performance advantages. Traditionally, cost has
decreased at about the rate at which density increases.
37
square of the improvement in latency. Computer designers should plan
accordingly.
38
Challenge of designing to manage Power and Energy in Integrated Circuits
First, power must be brought in and distributed around the chip, and modern
microprocessors use hundreds of pins and multiple interconnect layers just for
power and ground. Second, power is dissipated as heat and must be removed.
How should a system architect or a user think about performance, power, and
energy? From the viewpoint of a system designer, there are three primary
concerns.
First, what is the maximum power a processor ever requires? Meeting this
demand can be important to ensuring correct operation. For example, if a
processor attempts to draw more power than a power supply system can
provide (by drawing more current than the system can supply), the result is
typically a voltage drop, which can cause the device to malfunction. Modern
processors can vary widely in power consumption with high peak currents;
hence, they provide voltage indexing methods that allow the processor to
slow down and regulate voltage within a wider margin. Obviously, doing so
decreases performance.
39
The third factor that designers and users need to consider is energy and
energy efficiency. Recall that power is simply energy per unit time: 1 watt =
1 joule per second. Which metric is the right one for comparing processors:
energy or power? In general, energy is always a better metric because it is
tied to a specific task and the time required for that task. In particular, the
energy to execute a workload is equal to the average power times the
execution time for the workload. Thus, if we want to know which of two
processors is more efficient for a given task, we should compare energy
consumption (not power) for executing the task.
For example, processor A may have a 20% higher average power
consumption than processor B, but if A executes the task in only 70% of the
time needed by B, its energy consumption will be 1.2 × 0.7 = 0.84, which is
clearly better.
Distributing the power, removing the heat, and preventing hot spots have
become increasingly difficult challenges. Power is now the major constraint
to using transistors; in the past, it was raw silicon area. Hence, modern
microprocessors offer many techniques to try to improve energy efficiency
despite flat clock rates and constant supply voltages:
1. Do nothing well. Most microprocessors today turn off the clock of inactive
modules to save energy and dynamic power. For example, if no floating-point
instructions are executing, the clock of the floating-point unit is disabled. If
some cores are idle, their clocks are stopped.
2. Dynamic Voltage-Frequency Scaling (DVFS). Modern microprocessors
typically offer a few clock frequencies and voltages in which to operate that
use lower power and energy.
3. Design for typical case. Given that PMDs and laptops are often idle,
memory and storage offer low power modes to save energy. For example,
DRAMs have a series of increasingly lower power modes to extend battery
40
life in PMDs and laptops, and there have been proposals for disks that have a
mode that spins at lower rates when idle to save power.
4. Overclocking. Intel started offering Turbo mode in 2008, where the chip
decides that it is safe to run at a higher clock rate for a short time possibly on
just a few cores until temperature starts to rise.
41
5. Make the common case fast (Focus on the Common Case): Larger
addresses and constants in instructions and keeping all instructions the
same length.
6. Good design demands good compromises: - PC-relative addressing
for branches and immediate addressing for constant operands.
7. Design for Moore’s law: design with future technological
advancements in mind
Definition
Response time or execution time of a program is defined as the time
between the start and the finish of a task (in time units)
View point Two: : If computer X is faster than Y, then within given time
then X processes more tasks Than Y or X completes more Transactions
Than Y within a given time. This is called throughput
Definition
Throughput is defined as the total amount of work or tasks done in a given
time period (in number of tasks per unit of time)
42
Example:
If a car assembly plant produces 6 cars per an hour then the throughput of the
plant is 6 cars per hour
If a car assembly plant takes 4 hours to produce a car then the response time
of the plant is 4 hours per car
Therefore: Number of tasks
Throughput = tasks per given time =
Unit time
Total time
Response time = total time per task =
Task
The computer user wants response time to decrease, while data centre
manager want throughput increased.
43
used here to mean that the response time or execution time is lower on X than
on Y for the given task. In particular, ―X is n times faster than Y‖ will mean:
Execution time of X
= n
Execution time of Y
1
Execution time of Y Performance of Y Performance of X
=
n= Execution time of X 1 = Performance of Y
Performance of X
Cycles
CPU performance =
Time in seconds
Clock speed or clock rate: is measured in Hertz (Hz), which means 'per
second'.
• 1Hz = one cycle per second = one cycle / second potentially one
calculation in one second
• Clock speed of 1 MHz means 1,000,000 cycles per second and
potentially a million calculations.
1 1
clock_cycle_time = = seconds
X cycles /sec X
45
• Thus, when we refer to different instruction types (from performance
point of view), we are referring to instructions with different number of
clock cycles required (needed) to execute i.e. one instruction may take
several clock cycles to execute therefore if a program has X
instructions and each instruction takes so many clock cycles to execute,
then the Clock cycles for the program is equal to the sum of all the
clock cycles needed to execute all the instructions of the program.
• Thus: Clock cycles for a program = total number of clock cycles
needed to execute all instructions of the given program.
CPI – the average number of clock cycles per instruction (for a given execution
of a given program) is an important parameter given as:
CPI = clock cycles for a program / instruction count
46
Clock cycles for a program
Substituting the expression with CPI
Instruction count
Therefore:
CPU time = Instruction count * CPI / Clock rate
From this equation it is clear that processor clock rate alone is not sufficient
to describe a computer performance. Since good performance means less
CPU time, if clock rate is very high and CPI is very high too performance is
compromised since the high clock rate is cancelled out by the high CPI
Therefore a processor with high clock rate and high CPI has poor
performance since the high clock rate is cancelled out by the high CPI
47
1 instructioncount
Machine Aruntime
200 106
4 instructioncount
Machine B runtime
400106
For any given program runtime of A = ½ * runtime of B therefore machine B
will clearly be slower for any program, in spite of its higher clock rate.
The goal of improving computer performance for the user is to decrease the
execution time (reduce CPU time) therefore from the CPU performance
equation:
CPU time = Instruction count * CPI / Clock rate
How can a designer improve (i.e. decrease) CPU time:
From the equation the following may be done to decrease CPU time
Increase Clock rate thus making denominator of the equation large.
This may be achieved by improving hardware technology &
organization.
Reduce CPI this may be achieved by reworking the computer
organization, ISA and compiler technology.
Reduce Instruction count. This may be achieved by improving the ISA
& compiler technology.
Calculating CPI
The table below indicates frequency of all instruction types executed in a
―typical‖ program and, from existing reference manuals, the following
number of cycles per instruction for each type is provided.
48
Instruction Type Frequency Cycles
Example 1:
Consider an implementation of MIPS ISA with 500 MHz clock and
– each ALU instruction takes 3 clock cycles,
– each branch/jump instruction takes 2 clock cycles,
– each sw instruction takes 4 clock cycles,
– Each lw instruction takes 5 clock cycles.
Also, consider a program that during its execution executes:
– x=200 million ALU instructions
– y=55 million branch/jump instructions
– z=25 million sw instructions
– w=20 million lw instructions
Find CPU time. Assume sequentially executing CPU.
Solution
Approach 1:
Clock cycles for a program = (x*3 + y*2 + z*4 + w*5)
= 910 *106 clock cycles
CPU_time = Clock cycles for a program / Clock rate
= 910 * 106 / 500*106 = 1.82 sec
Approach 2:
CPI = (x*3 + y*2 + z*4 + w*5)/ (x + y + z + w)
= 3.03 clock cycles/ instruction
49
CPI = Clock cycles for a program / Instructions count
CPU time = Instruction count * CPI / Clock rate
= (x+y+z+w) * 3.03 / 500 *106
= 300 *106 * 3.03 /500 * 106
= 1.82 sec
Example 2:
Solution
NB: The calculation may not be accurate since the numbers of cycles per
instruction given don’t account for pipeline effects and other advanced design
techniques.
50
Another element that affects computer performance is delay in memory
access. Instruction execution involves a fetch-decode-execute cycle where It a
computer retrieves a program instruction from its memory, determines what
actions the instruction requires, and carries out those actions. Therefore
memory access delays while program instructions from memory can
significantly reduce performance. Some assumed delays: Memory access = 2
nsec, ALU operation = 2 nsec, Register file access = 1 nsec.
Memory hierarchy is one among the many trade-offs in designing for high
performance i.e. the size and technology of each component. So the various
components can be viewed as forming a hierarchy of memories
(m1,m2,...,mn) in which each member mi is in a sense subordinate to the next
highest member mi+1 of the hierarchy. To limit waiting by higher levels, a
lower level will respond by filling a buffer and then signaling to activate the
transfer.
There are four major storage levels.
1. Internal – Processor registers and cache.
2. Main – the system RAM and controller cards.
3. On-line mass storage – Secondary storage.
4. Off-line bulk storage – Tertiary and Off-line storage.
51
The number of levels in the memory hierarchy and the performance at each
level has increased over time. For example, the memory hierarchy of an Intel
Haswell Mobile processor circa 2013 is:
1. Processor registers – the fastest possible access (usually 1 CPU cycle).
A few thousand bytes in size
2. Cache
3. Level 0 (L0) Micro operations cache – 6 KiB in size
4. Level 1 (L1) Instruction cache – 128 KiB in size
5. Level 1 (L1) Data cache – 128 KiB in size. Best access speed is around
700 GiB/second
6. Level 2 (L2) Instruction and data (shared) – 1 MiB in size. Best access
speed is around 200 GiB/second
7. Level 3 (L3) Shared cache – 6 MiB in size. Best access speed is around
100 GB/second
8. Level 4 (L4) Shared cache – 128 MiB in size. Best access speed is
around 40 GB/second
9. Main memory (Primary storage) – Gigabytes in size. Best access speed
is around 10 GB/second. In the case of a NUMA machine, access times
may not be uniform
10. Disk storage (Secondary storage) – Terabytes in size. As of 2013, best
access speed is from a solid state drive is about 600 MB/second
11. Nearline storage (Tertiary storage) – Up to exabytes in size. As of
2013, best access speed is about 160 MB/second
12. Offline storage
Cache Memory
A Cache (Pronounced as ―cash‖) is a small and very fast temporary storage
memory used to improve average access time to slow memory. It is designed
to speed up the transfer of data and instructions. It is located inside or close to
the CPU chip. It is faster than RAM and the data/instructions that are most
recently or most frequently used by CPU are stored in cache.
52
Exploits spatial and temporal locality
In computer architecture, almost everything is a cache!
o Registers a cache on variables
o First-level cache a cache on second-level cache
o Second-level cache a cache on memory
o Memory a cache on disk (virtual memory)
The data and instructions are retrieved from RAM when CPU uses them for
the first time. A copy of that data or instructions is stored in cache. The next
time the CPU needs that data or instructions, it first looks in cache. If the
required data is found there, it is retrieved from cache memory instead of
main memory. It speeds up the working of CPU.
54
Memory hierarchy goals
Cache is a small high-speed memory. Stores data from some frequently used
addresses (of main memory).
55
Cache hit : Data found in cache. Results in data transfer at maximum
speed.
Cache miss: Data not found in cache. Processor loads data from M and
copies into cache. This results in extra delay, called miss penalty.
Hit ratio = percentage of memory accesses satisfied by the cache.
Miss ratio = 1-hit ratio
Locality of References
This important fundamental observation comes from properties of programs.
The most important program property that we regularly exploit is locality of
references : Programs tend to reuse data and instructions they have used
recently.
Spatial locality: says that items whose addresses are near one another tend to
be referenced close together in time.
This principle also applies when determining how to spend resources, since the
impact on making some occurrence faster is higher if the occurrence is frequent.
56
Helps performance
Smaller is Faster
Smaller pieces of hardware will generally be faster than larger pieces.
This simple principle is particularly applicable to memories built from the same
technology for two reasons:
In most technologies we can obtain smaller memories that are faster than
larger memories. This is primarily because the designer can use more power per
memory cell in a smaller design;
57