0% found this document useful (0 votes)
45 views

Computer Architecture and Organization Notes

This document discusses computer design and organization, covering topics like the components of a computer system, processor organization, and different architectural styles like Von Neumann and Harvard architectures. It provides details on the structures, functions, and tradeoffs of these architectures.

Uploaded by

maikarageorge721
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Computer Architecture and Organization Notes

This document discusses computer design and organization, covering topics like the components of a computer system, processor organization, and different architectural styles like Von Neumann and Harvard architectures. It provides details on the structures, functions, and tradeoffs of these architectures.

Uploaded by

maikarageorge721
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Computer Design and Organization

This course is about:

 What computers consist of


 How computers work
 How they are organized internally
 What are the design tradeoffs
 How design affects programming and applications

1. Introduction
In electronic engineering, computer architecture is a set of rules and methods
that describe the functionality, organization and implementation of computer
systems. Some definitions of architecture define it as describing the
capabilities and programming model of a computer but not a particular
implementation. In other descriptions computer architecture involves
instruction set architecture design, micro architecture design, logic design,
and implementation

The word architecture was not invented by computers but it was taken over
from other disciplines. The term does not provide exact definitions,
specifications or principles. From a modern, construction point of view, it can
be divided into four categories:
 Structure, layout: parts description and their interconnection,
 Interaction, cooperation: describe the dynamic communication of all
working parts of a computer
 Realization, implementation: describe internal structure of all
working parts,

1
 Functionality, activity: final behavior of the whole computer.

2
To understand, how computer works, we have to deal with all of these
aspects.
In this course the aspects of a computer system are viewed from the points of
view of:
 Architecture i.e. those attributes of a system visible to a machine
language programmer
 Organization i.e. the operational units and their interconnections that
realize the architecture.
1.1. Components of a Computer System
A computer system, like any system, consists of an interrelated set of
components. The system is best characterized in terms of structure-the way in
which components are interconnected, and function-the operation of the
individual components. Furthermore, a computer's organization is
hierarchical. Each major component can be further described by decomposing
it into its major subcomponents and describing their structure and function.
For clarity and ease of understanding, this hierarchical organization is
described in this course from the top down:
 Computer system: Major components are processor, memory, I/O.
 Processor: Major components are control unit, registers, ALU, and
instruction execution unit.
 Control unit: Major components are control memory, microinstruction
sequencing logic, and registers.
1.1.1. Structural view of a Computer
The structural view of a computer concerns the way in which components are
interconnected. There are two common architectural styles with differing
structural view namely: The Von Neumann architecture and the Harvard
architecture
a) Von Neumann architecture

3
The Von Neumann architecture is representative of most general purpose
computer architectures. Uses stored-program concept and was first
implemented for EDVAC computer in 1945. Key concepts include
 Data and instructions are stored in a single read-write memory
 Instructions and Data share one memory system. (The Length of
bit, the same form of address)
 The contents of the memory are addressable by location, without
regard to the type of data contained there
 Execution occurs in a sequential fashion (unless explicitly
modified) from one instruction to the next
Von Neumann architecture (memory stores instruction and data, data bus
carries both instruction and data)

Address bus
Memory
(Stores both data
CPU
and instructions)
Data bus (carries
both data and
instructions)

b) Harvard architecture
The Harvard architecture has the following factors
1) Instructions and data are stored in separate memory.
2) Instructions and data signals are carried using different pathways
3) Generally, the bit of Instructions is wider than Data.
4) For some computers, the Instruction memory is read-only.

Instruction bus
Instruction
Memory Address bus

Address bus

Data Memory CPU


Data bus

4
Comparison between Harvard and von Neumann’s architectures in
absence of cache memory
i) In cases without caches, the Harvard Architecture is more efficient than
von-Neumann.
ii) Harvard architecture has separate data and instruction busses, allowing
transfers simultaneously on both busses. Von Neumann architecture
has only one bus which is used for both data transfers and instruction
fetches, and therefore data transfers and instruction fetches must be
scheduled – they cannot be performed at the same time.
iii) By reason of the wider bit of instructions, Harvard Architecture
supports more instructions with less hardware requirement. For
example, the ARM9 processor has 24-bit instruction length, so that it
could have 224 =16777216 instructions, which are much more than 16-
bit processors have (65536). So, with uniform bus width which von-
Neumann architecture has, the processor has to take more requirement
of hardware in data length, if it wants to have 24-bit instruction width.
iv) Two buses accessing memory synchronously provides more CPU time.
The von Neumann processor has to perform a command in 2 steps
(first read an instruction, and then read the data that the instruction
requires. But the Harvard architecture can read both instruction and its
data at the same time. Evidently, the parallel method is faster and more
efficiently, because it only takes one step for each command.

CHECK POINT

Is Harvard architecture better than Von Neumann’s Architecture?

The principal historical advantage of the Harvard architecture (simultaneous access to more
than one memory system) has been nullified by modern cache systems, allowing the more
flexible Von Neumann machine equal performance in most cases. The Modified Harvard
architecture has thus been relegated to niche applications where the ease-of-programming /
complexity / performance trade-off favors it.
With the help of caches, both architectures gain efficiencies both architectures have
advantages and disadvantages. It is impossible to decide, which is better. Both are still used
in modern computers. 6
Using caches in both architectures
Speed of CPU is faster than the speed of accessing main memory. So Harvard
architecture attempted to improve performance by having two separate buses
one to carry data signal and another to carry instruction signal. However
introduction of cache memory reduced the need to access main memory
frequently during instruction execution. So, modern high performance
computers with caches have the ―incorporate aspect‖ of both von-Neumann
and Harvard architectures. On von-Neumann architecture, cache on CPU is
divided into instruction cache and data cache, and the main memory needn’t
to be separated into 2 sections. So that, the von-Neumann programmers can
work on Harvard Architectures without knowing the hardware. As a result,
with the help of caches, both architectures gain efficiencies. Therefore, both
architectures have advantages and disadvantages. It is impossible to decide,
which is better. Both are still used in modern computers. We can now
compare both in more detail:

7
Von Neumann
pros Cons

Programmers organize the content of the One Bus (for data, instructions
memory and they can use the whole capacity of and devices) is a bottleneck.
the installed memory.
Error in a program can rewrite
One bus is simpler for the Control Unit design. instructions and crash program
Development of the Control Unit is cheaper and execution.
faster.

Computer with one bus is cheaper.

Data and instruction are accessed in the same


way.

Harvard Pros and Cons


pros Cons
1. Two memories with two Buses allow 1. Control unit for two Buses is
parallel access to data and instructions. more complicated and more
Execution can be 2 times faster. expensive.

2. Both memories can be produced by 2. Production of a computer with


different technologies (Flash/EEPROM, two Buses is more expensive.
SRAM/DRAM).
3. Development of a complicated
3. Both memories can use different cell Control Unit needs more time.
sizes.
4. Free data memory can't be used
4. Program can't rewrite itself. for instruction and vice-versa.

8
Until today both architectures are used in modern computers. Over the years
von Neumann and Harvard architecture are used massively in mainstream
production. Harvard architecture is used primary for small embedded
computers and signal processing (DSP). Many microcontroller architectures
(like ones you would find in a toaster) are Harvard architectures. Von
Neumann is better for desktop computers, laptops, workstations and high
performance computers. Some computers may use advantages from both
architectures. Typically they use two separated memories. The first one is
used for programs and the second one to store dynamic data. A good example
can be handheld devices – PDA and mobile phones.
1.1.2. Internal Structure of the Computer Itself:
 Central Processing Unit (CPU): Controls the operation of the computer
and performs its data processing functions. Often simply referred to as
processor
 Main Memory: Stores data and instructions
 I/O: Moves data between the computer and its external environment.
 System Interconnection: Some mechanism that provides for
communication among CPU, main memory, and I/O.
The functional units of a computer system are summarized below

9
1.2. The Central Processing Unit (CPU) or Processor
A processor (CPU) is the core component in a computer system. It executes
instructions and manipulates data. A processor has several core components
that work together to perform calculations. There are many factors that
influence the performance of a processor.
 Data bus width
 Processor speed/clock rate
 Internal CPU architecture
 I/O bus speed
 Cache memory, level 1 and level 2
The main structural components of the CPU are discussed next:
1.2.1. Control Unit: Controls the operation of the CPU and hence the
computer.
The control unit sits inside the CPU and coordinates the input and output
devices of a computer system. It coordinates the fetching of program code
from main memory to the CPU and directs the operation of the other
processor components by providing timing and control signals.
1.2.2. Arithmetic and Logic Unit (ALU): performs the data processing
functions of a computer. The Arithmetic Logic Unit or the ALU is a digital
circuit that performs arithmetic and logical operations. Where arithmetic
operations include things such as ADD and SUBTRACT and the logical
operations include things such as AND, OR, NOT. The ALU is a
fundamental building block in the central processing unit (CPU) of a
computer and without it the computer wouldn't be able to calculate anything.
Some examples of assembly code instructions that would use the ALU are as
follows (not all processors will have all these instructions):
 ADD ; add one number to another number
 SUB ; subtract one number from another number
10
 INC ; increment a number by 1
 DEC ; decrements a number by 1
 MUL ; multiply numbers together
 OR ; boolean algebra function
 AND ; boolean algebra function
 NOT ; boolean algebra function
 XOR ; boolean algebra function
 JNZ ; jump to another section of code if a number is not
zero (used for loops and ifs)
 JZ ; jump to another section of code if a number is zero
(used for loops and ifs)

1.2.3. Registers: Registers are small amount of fast storage which is part of
the processor and provides storage internal to the CPU. For immediate
calculations, using main memory is too slow. Imagine having to send a signal
along the address bus and some data along the data bus when all you want to
do is store the result of adding two numbers together. The distance between
the processor and main memory, even though it might be a few centimetres,
is far enough for the signal to take a significant time to get there. To get past
this issue there are small amounts of memory stored inside the processor
itself, these are called registers. Registers are incredibly fast pieces of
memory that are used to store the results of arithmetic and logic calculations.
Different processors will have different sets of registers. A common register
is the Accumulator (acc) which is a data register, where the user is able to
directly address (talk to) it and use it to store any results they wish.
Processors may also have other registers with particular purposes. Some
registers include:
 Program Counter (PC) - an incrementing counter that keeps track of
the memory address of which instruction is to be executed next.
 Memory Address Register (MAR) - holds the address in memory of the
next instruction to be executed

11
 Memory Buffer Register (MBR) - a two-way register that holds data
fetched from memory (and ready for the CPU to process) or data
waiting to be stored in memory
 Current Instruction register (CIR) - a temporary holding ground for
the instruction that has just been fetched from memory
 Accumulator - Used to store results of calculations
 General purpose register - allow users to use them as they wish
 Address registers - used for storing addresses
 Conditional registers - hold truth values for loop and selection
The registers are used as temporary holding areas during an instruction
execution cycle. An instruction cycle (sometimes called fetch-decode-
execute cycle) is the basic operation cycle of a computer. It is the process by
which a computer retrieves a program instruction from its memory,
determines what actions the instruction requires, and carries out those actions.
This cycle is repeated continuously by the central processing unit (CPU),
from bootup to when the computer is shut down.

Steps of Instruction Execution cycle (Fetch-decode-Execute Cycle)

1. Instruction Fetch (IF): The fetch cycle begins with retrieving the address
stored in the program counter. The address stored is some valid address in
the memory holding the instruction to be executed. The Central Processing
Unit completes this step by fetching the instruction stored at the address
from the memory and transferring this instruction to a special register
(MAR) to hold the instruction to be executed. The program counter is
incremented to point to the next address from which the new instruction is
to be fetched.

2. Instruction Decode (ID): The decode cycle is used for interpreting the
instruction that was fetched in the Fetch Cycle. The operands are retrieved
from the addresses if the need be.

12
3. Data Fetch (DF): To load an instruction or piece of data from memory into
a CPU’s resister. Instruction

4. Execute (EX): From the instruction register, the data forming the
instruction is decoded by the control unit. It then passes the decoded
information as a sequence of control signals to the relevant function units
of the CPU to perform the actions required by the instruction such as
reading values from registers, passing them to the Arithmetic logic unit
(ALU) to add them together and writing the result back to a register. A
condition signal is sent back to the control unit by the ALU if it is
involved.

5. Result Return (RR) : The result generated by the operation is stored in the
main memory, or sent to an output device. Based on the condition
feedback from the ALU, the PC is either incremented to address the next
instruction or updated to a different address where the next instruction will
be fetched. The cycle is then repeated. The steps are summarized
diagrammatically as shown below.

13
1.2.4. CPU Interconnection: The Interconnection Structures provides for
communication among the control unit, ALU, and registers. It constitutes of
the collection of paths connecting the various modules of a computer namely:
the CPU module, memory module and, I/O module. It must support the
following types of transfers:
 Memory to CPU
 CPU to Memory
14
 I/O to CPU
 CPU to I/O
 I/O to or from Memory - using Direct Memory Access (DMA)
The common interconnection structure is the Bus Interconnection. A bus is
a shared transmission medium that must only be used by one device at a time.
When used to connect major computer components (CPU, memory, I/O) is
called a system bus. There are three types of buses that constitute three
functional groups of communication lines as depicted in figure 2. They
include: data bus, address bus and control bus

Data bus

Address bus

Control bus

a) Data lines (data bus) - move data between system modules. The
width of a data bus is a key factor in determining overall system
performance. Usually the width of a data bus is equal to the word
size of a computer or ½ that size.

15
IMPORTANT NOTE
A Word refers to a group of bits that a CPU can process at one time. In computing, word is
a term for the natural unit of data used by a particular processor design. A word is a fixed-
sized piece of data handled as a unit by the instruction set or the hardware of the processor.
The number of bits in a word is called a word size/ word width or word length and it is an
important characteristic of any specific processor design or computer architecture.
Processors with many different word sizes have existed though powers of two (8, 16, 32,
64) have predominated for many years. A processor's word size is often equal to the width
of its external data bus though sometimes the bus is made narrower than the CPU (often
half as many bits) to economize on packaging and circuit board costs.
Size of data bus = CPU word size or ½ of CPU word size
Word size = power of 2

Depending on how a computer is organized, word-size units may be used for:


 To represent integral data types –integer types are held in word sized
memory locations while Holders floating point numbers are typically
either a word or a multiple of a word.
 Addresses holders: Holders for memory addresses must be of a size
capable of expressing the needed range of values but not be excessively
large, so often the size used is the word though it can also be a multiple
or fraction of the word size.
 Determine size of Registers: Processor registers are designed with a
size appropriate for the type of data they hold, e.g. integers, floating
point numbers or addresses. Many computer architectures use "general
purpose registers" that can hold any of several types of data, these
registers must be sized to hold the largest of the types, historically this
is the word size of the architecture though increasingly special purpose,
larger, registers have been added to deal with newer types.
 Instructions: Machine instructions are normally the size of the
architecture's word, such as in RISC architectures, or a multiple of the

16
"char" size that is a fraction of it. This is a natural choice since
instructions and data usually share the same memory subsystem. In
Harvard architectures the word sizes of instructions and data need not
be related, as instructions and data are stored in different memories; for
example, the processor in the 1ESS electronic telephone switch had 37-
bit instructions and 23-bit data words.

b) Address bus - designate source or destination of data on the data bus. Its
width determines the maximum possible memory capacity of the system
(May be a multiple of width).
 Maximum possible memory capacity = 2address bus width
Example:
The size of an address bus 32 bits compute the maximum size of memory that
this bus can reference.
Solution: The bus can hold up to 232 numbers and it hence can refer up to 2 32
bytes of memory = 4GB of memory. This means any memory greater than that
is useless.
 Also used to address I/O ports. Typically: high-order bits select a
particular module while lower-order bits select a memory location or
I/O port within the module
c) Control Bus - control access to and use of the data and address lines.
Typical control lines include:
 Memory Read and Memory Write
 I/O Read and I/O Write
 Transfer ACK
 Bus Request and Bus Grant
 Interrupt Request and Interrupt ACK
 Clock
 Reset
If one module wishes to send data to another, it must:
 Obtain use of the bus

17
 Transfer data via the bus
If one module wishes to request data from another, it must:
 Obtain use of the bus
 Transfer a request to the other module over control and address lines
 Wait for second module to send data
Typical physical arrangement of a system bus
 A number of parallel electrical conductors
 Each system component (usually on one or more boards) taps into
some or all of the Bus lines (usually with a slotted connector)
 System can be expanded by adding more boards
 A bad component can be replaced by replacing the board where it
resides

Multiple Bus Hierarchies


A great number of devices on a bus will cause performance to suffer
 Propagation delay - the time it takes for devices to coordinate the use
of the bus
 The bus may become a bottleneck as the aggregate data transfer
demand approaches the capacity of the bus (in available transfer
cycles/second)
Traditional Hierarchical Bus Architecture
 Use of a cache structure insulates CPU from frequent accesses to main
memory
 Main memory can be moved off local bus to a system bus
 Expansion bus interface
o buffers data transfers between system bus and I/O controllers on
expansion bus
o insulates memory-to-processor traffic from I/O traffic
PCI = Peripheral Component Interconnect
 High-bandwidth
 Processor independent

18
 Can function as a mezzanine or peripheral bus
Current Standard for PCI
 up to 64 data lines at 33Mhz
 requires few chips to implement
 supports other buses attached to PCI bus
 public domain, initially developed by Intel to support Pentium-based
systems
 supports a variety of microprocessor-based configurations, including
multipleprocessors
 uses synchronous timing and centralized arbitration

Typical Desktop System with PCI

1.2.5. CPU Clock


A timing device connected to the processor that synchronizes when the fetch,
decode execute cycle runs. Clock speed is measured in terms of number of
cycles that are performed by the CPU per second. Clock speed is measured in
Hertz, which means 'per second'. Clock speed of 1 MHz means 1,000,000

19
cycles per second and potentially a million calculations. A computer of speed
3.4 GHz means it might be capable of processing 3,400,000,000 instructions
per second. However it isn't as simple as that, as some processors can perform
more than one calculation on each clock cycle.

Increasing performance: If a designer wants to increase the performance of a


computer, they can try several things including
 Increasing the clock speed:
The most obvious way to increase the speed of a computer would be to
increase the speed of the computer clock. With a faster clock speed the
processor would be forced to perform more instructions per second. but
the problem with increased clock speed is that an increased current will
have to flow through the circuits. The more current that flows, the hotter
the processor becomes. Thus the faster the clock speed, the hotter the
processor runs. To counter this computer scientists have come up with
smarter chip designs and introduced heat sinks, fans, and even liquid
cooling into computers. If a processor runs too hot it can burn out!

 Adjusting word length


Another way to increase the performance of a computer is to increase the
word size. This means increasing the number of bits a computer can
process at one time. Word size refers to the number of bits of information
that a processor can process at one time. With a larger word, computers
can handle larger or more precise calculations and do more complicated
things. Modern computer mostly have 32 or 64 bit word sizes, with
specialist hardware such as games consoles being able to handle up to 128
bit words. Increasing word size implies designing more complicated
hardware and using lots of little words could be faster than processing
larger ones.
 Increasing bus widths
 Data bus: wider = better performance
20
 Address bus: wider = more locations can be referenced
2. Organization and Architecture
Computer Architecture refers to those attributes of a system that have a direct
impact on the logical execution of a program.
Examples of architectural attributes
o the instruction set
o the number of bits used to represent various data types
o I/O mechanisms
o memory addressing techniques
Computer Organization refers to the operational units and their
interconnections that realize the architectural specifications. Examples are
things that are transparent to the programmer:
o control signals
o interfaces between computer and peripherals
o the memory technology being used
So, for example, the fact that a multiply instruction is available is a computer
architecture issue. How that multiply is implemented is a computer
organization issue.
Several years ago, the term computer architecture often referred only to
instruction set design. Other aspects of computer design were called
implementation, often insinuating that implementation is uninteresting or less
challenging. This view has changed since the architect’s or designer’s job is
much more than instruction set design, and the technical hurdles in the other
aspects of the project are likely more challenging than those encountered in
instruction set design.

The term organization includes the high-level aspects of a computer’s design,


such as the memory system, the memory interconnect, and the design of the
internal processor or CPU (central processing unit—where arithmetic, logic,
branching, and data transfer are implemented). The term microarchitecture is
also used instead of organization. For example, two processors with the same

21
instruction set architectures but different organizations are the AMD Opteron
and the Intel Core i7. Both processors implement the x86 instruction set, but
they have very different pipeline and cache organizations.
Hardware refers to the specifics of a computer, including the detailed logic
design and the packaging technology of the computer. Often a line of
computers contains computers with identical instruction set architectures and
nearly identical organizations, but they differ in the detailed hardware
implementation. For example, the Intel Core i7 and the Intel Xeon 7560 are
nearly identical but offer different clock rates and different memory systems,
making the Xeon 7560 more effective for server computers.

Computer architecture
Computer architecture covers the three aspects of computer design including:
instruction set architecture, organization or microarchitecture, and hardware.

The overall view of computer design showing how ISA interfaces the -+

Applications

Operating System Software part


Increasing Abstraction

Compiler Firmware

Instruction set architecture (ISA): Interface between Hardware and software

CPU I/O
Memory
Hardware part
Digital Circuits

Gates and transistors

22
As can be seen from this the instruction set architecture (ISA) provides the
interface between the software and the hardware. It is the medium of
communication between the hardware and the software. A software
instruction is translated by the compiler into machine code which is then
interpreted by the ISA using the available set of instructions.

Program code

Compiler

Machine code

ISA

Hardware instructions

2.1. Instruction Set Architecture


What is Instruction?

An instruction is a binary pattern designed inside the microprocessor to


perform a specific function. In other words, it is actually a command to the
microprocessor to perform a given task on specified data.

Categories of instructions
1. Data movement/transfer instructions (Data handling and memory
operations)
 Set a register to a fixed constant value.
 Move data from a memory location or register to another memory
location or register without changing its form.
 Copy data from a memory location to a register, or vice-versa
 Used to store the contents of a register, result of a computation, or to
retrieve stored data to perform a computation on it later.
 Read and write data from hardware devices.
examples

23
Examples: STORE, LOAD, EXCHANGE, MOVE, CLEAR, SET, PUSH,
POP.

 Specifies: source and destination (memory, register, stack), amount of


data e.g.
 LOAD—source is memory and destination is register
 STORE—source is register and destination is memory

2. Data processing instructions (Arithmetic and logic (ALU) instructions):


 Add, subtract, multiply, or divide the values of two registers, placing
the result in a register, possibly setting one or more condition codes in
a status register.
 Perform bitwise operations, e.g., taking the conjunction and disjunction
of corresponding bits in a pair of registers, taking the negation of each
bit in a register.
 Compare two values in registers (for example, to see if one is less, or if
they are equal).
Examples of arithmetic instructions are:

 ADD ; add one number to another number


 SUB ; subtract one number to another number
 INC ; increment a number by 1
 DEC ; decrements a number by 1
 MUL ; multiply numbers together
 OR ; boolean algebra function
 AND ; boolean algebra function
 NOT ; boolean algebra function
 XOR ; boolean algebra function

3. Branch instructions (control flow instructions)


 Alter the normal flow of control from executing the next instruction in
sequence

24
 Branch to another location in the program and execute instructions
there.
 Conditionally branch to another location if a certain condition holds.
 Indirectly branch to another location.
 Call another block of code, while saving the location of the next
instruction as a point to return to.

What is an Instruction Set?

The entire group of these instructions is called instruction set. Instruction set
therefore refers to the range of instructions that a CPU can execute or the
basic set of commands, or instructions, that a microprocessor understands.
The instruction set determines what functions the microprocessor can
perform. One of the principal characteristics that separate RISC from CISC
microprocessors is the size of the instruction set -- RISC microprocessors
have relatively small instruction sets whereas CISC processors have relatively
large instruction sets.

Instruction Format: Parts of an instruction

Each instruction has two parts: one is the task to be performed called the
operation code (opcode) and the other is the data to be operated on called the
operand (data).

On traditional architectures, an instruction includes an

 opcode that specifies the operation to perform, such as add contents of


memory to register—

25
 Operands: there can be zero or more operand specifiers, which may
specify registers, memory locations, or literal data

Types of Operand:

 Addresses: immediate, direct, indirect, stack


 Numbers: integer or fixed point (binary, twos complement), floating
point (sign, significand, exponent), (packed) decimal (246 = 0000 0010
0100 0110)
 Characters: ASCII (128 printable and control characters + bit for error
detection)
 Logical Data: bits or flags, e.g., Boolean 0 and 1

Instruction and Word Size

 A word is a fixed-sized piece of data handled as a unit by the instruction


set or the hardware of the processor.
 A word length refers to the number of bits in a word. Microprocessors
(CPUs) are described in terms of their word size e.g. 8086 processor is 16
bit meaning its word size is 16 bits
 Machine instructions are normally the size of the architecture's word.
 Instruction size / Instruction Length = size of the architecture's word.
 Instruction Length is affected by
 Memory size
 Memory organization - addressing
 Bus structure, e.g., width
 CPU complexity
 CPU speed

Depending on the word size, there will be different numbers of bits available
for the opcode and for the operand. There are two different philosophies at
play:

1. Lots of different instructions and a smaller operand (Intel, AMD) and

26
2. Less instructions and more space for the operand (ARM).

CISC - Complex Instruction Set Computer (uses philosophy number 1) -


more instructions allowing for complex tasks to be executed, but range and
precision of the operand is reduced. Some instruction may be of variable
length, for example taking extra words (or bytes) to address full memory
addresses, load full data values or just expand the available instructions.
Therefore a CISC have many bits for opcode to support many instructions but
small number of bits for operand thus reducing range and precision of
operand

RISC - Reduced Instruction Set Computer (uses philosophy number 2) - less


instructions allowing for larger and higher precision operands.

Opcode size and number of instructions

 Size of opcode = Log 2 (number of instructions)


 Or number of instructions supported by a microprocessor
= 2opcode size in bits

Oparand size and number of instructions

 Operand size = Log 2 (number of registers)

Examples:

1. For a word with 4 bits for an opcode and 6 bits for an operand

How many different instructions could I fit into the instruction set?
What is the largest number that I could use as data?

Answer :

 Number of instructions:
 largest operand:

2. For a 16 bit word with 6 bits for an opcode

How many different instructions could I fit into the instruction set?
What is the largest number that I could use as data?

27
Answer :

 Number of instructions:
 largest operand:

What is instruction set architecture(ISA)

Instruction set architecture is the structure of a computer that a machine


language programmer must understand to write a correct (timing
independent) program for that machine. The instruction set architecture is
also the machine description that a hardware designer must understand to
design a correct implementation of the computer.

The instruction set must be able to:

• Access memory

• Perform arithmetic and logic operations

• Control the program flow (branching)

Instruction set architectures is measured according to:

• Main memory space occupied by a program.

• Instruction complexity.

• Instruction length (in bits).

• Total number of instructions in the instruction set.

ISA design Decisions

In designing an instruction set, consideration is given to:


• Instruction length: Whether short, long, or variable.
• Number of operands.
• Number of addressable registers.
• Memory organization.
• Whether byte- or word addressable.
28
• Addressing modes.
• Choose any or all: direct, indirect or indexed.
• Byte ordering, or endianness, is another major architectural
consideration.
• If we have a two-byte integer, the integer may be stored so that the
least significant byte is followed by the most significant byte or vice
versa.
• In little endian machines, the least significant byte is followed by
the most significant byte.
• Big endian machines store the most significant byte first (at the
lower address).
• As an example, suppose we have the hexadecimal number
12345678.
• The big endian and small endian arrangements of the bytes are
shown below.

• Big endian:
– Is more natural.
– The sign of the number can be determined by looking at the byte
at address offset 0.
– Strings and integers are stored in the same order.
• Little endian:
– Makes it easier to place values on non-word boundaries.
– Conversion from a 16-bit integer address to a 32-bit integer
address does not require any arithmetic
• The next consideration for architecture design concerns how the CPU
will store data.
• We have three choices:
1. A stack architecture
29
2. An accumulator architecture
3. A general purpose register architecture.
• In choosing one over the other, the tradeoffs are simplicity (and cost)
of hardware design with execution speed and ease of use.
Stack architecture (0-operand architecture)
• In stack architecture, instructions and operands are implicitly taken
from the stack. A stack cannot be accessed randomly.
• All arithmetic operations take place using the top one or two positions
on the stack
• Stack machines use one - and zero-operand instructions.

• LOAD and STORE instructions require a single memory address


operand.

• Other instructions use operands from the stack implicitly.

• PUSH and POP operations involve only the stack’s top element.

Binary instructions (e.g., ADD, MULT) use the top two items on the
stack.

• Stack arithmetic requires that we use postfix notation: Z = XY+.


Instead of infix : Z = X + Y

• The principal advantage of postfix notation is that parentheses are not


used.

• For example, the infix expression, Z = (X  Y) + (W  U), becomes:


Z=XYWU+ in postfix notation.

• In a stack ISA, the postfix expression,

Z=XYWU+

Is coded as (needs 8 instructios)

30
PUSH X
PUSH Y
MULT
PUSH W
PUSH U
MULT
ADD
PUSH Z

C = A+B = AB+ needs four instructions.

PUSH A,
PUSH B,
ADD,
POP C.

For stack machines, the terms "0-operand" and "zero-address" apply to


arithmetic instructions, but not to all instructions, as 1-operand push and
pop instructions are used to access memory.

Accumulator Architecture (1-operand architecture)

• In accumulator architecture, one operand of a binary operation is


implicitly in the accumulator.
– One operand is in memory, creating lots of bus traffic.

• In a one-address ISA, like MARIE, the infix expression,

Z=XY+WU
looks like this:
LOAD X
MULT Y
STORE TEMP
LOAD W
MULT U
ADD TEMP
STORE Z

General Purpose Architecture (2-operand and 3-operand architectures)

31
• In general purpose register (GPR) architecture, registers can be used
instead of memory.
– Faster than accumulator architecture.
– Efficient implementation for compilers.
– Results in longer instructions.
• Most systems today are GPR systems.
• There are three types:
– Memory-memory where two or three operands may be in
memory.
– Register-memory where at least one operand must be in a
register.
– Load-store where no operands may be in memory.
• The number of operands and the number of available registers has a
direct affect on instruction length.
• In a two-address ISA, (e.g.,Intel, Motorola), the infix expression,
Z=XY+WU
might look like this:
LOAD R1,X
MULT R1,Y
LOAD R2,W
MULT R2,U
ADD R1,R2
STORE Z,R1
many CISC and RISC machines fall under this category:
 C = A+B needs two instructions. This effectively 'stores' the result
without an explicit store instruction.
 CISC — move A to C; then add B to C.
 CISC — Often machines are limited to one memory operand per
instruction so C = A+B is coded as:
 load a,reg1;
 add b,reg1;
32
 store reg1,c;
This requires a load/store pair for any memory movement regardless of
whether the add result is an augmentation stored to a different place, as in C =
A+B, or the same memory location: A = A+B.
 C = A+B needs three instructions.
 RISC — Requiring explicit memory loads, the instructions would be:
 load a,reg1;
 load b,reg2;
 add reg1,reg2;
 store reg2,c.
o C = A+B needs four instructions.

• With a three-address ISA, (e.g.,mainframes), the infix expression,


Z=XY+WU
might look like this:
MULT R1,X,Y
MULT R2,W,U
ADD Z,R1,R2
Example
CISC — for C = A+B we needs one instruction: add a,b,c
 or more typically:
 move a,reg1;
 add reg1,b,c
As most machines are limited to two memory operands.
o C = A+B needs two instructions.
RISC — arithmetic instructions use registers only, so explicit 2-operand
load/store instructions are needed:
 load a,reg1;
 load b,reg2;
 add reg1+reg2->reg3;
 store reg3,c;
 C = A+B needs four instructions.

33
 Unlike 2-operand or 1-operand, this leaves all three values a, b, and c
in registers available for further reuse

EXAMPLE

• A system has 16 registers and 4K of memory.


• We need 4 bits to access one of the registers. We also need 12 bits for a
memory address.
• If the system is to have 16-bit instructions, we have two choices for our
instructions:

2.2. Challenges for the Computer Architect


Computer architecture, like other architecture, is the art of determining the
needs of the user of a structure and then designing to meet those needs as
effectively as possible within economic and technological constraints.

The task the computer designer faces is a complex one: Determine what
attributes are important for a new computer, then design a computer to
maximize performance and energy efficiency while staying within cost,
power, and availability constraints. This task has many aspects, including
instruction set design, functional organization, logic design, and
implementation. The implementation may encompass integrated circuit
design, packaging, power, and cooling. Optimizing the design requires
familiarity with a very wide range of technologies, from compilers and
operating systems to logic design and packaging.

34
Over time computer architects have strived to cope with the user needs by
designing systems that meet those needs as effectively as possible within
economic and technological constraints. Some key market demands that drive
computer architecture include:
 Performance: processing needs users keep increasing and the demand
is to have systems that can process huge work load faster
 Storage: storage needs of users keep increasing and the demand is to
have systems that can store more and more information
 Portability: users have become more mobile and the demand is to
have smaller systems that can be carried along.
 Affordability: the demand is to have computers systems that cost less
to produce and are price friendly to a wider population. the use of
technology improvements to lower cost, as well as increase
performance, has been a major theme in the computer industry.
In response these market demands, computer systems have evolved overtime
to witness:
 performance increases almost yearly
 memory size goes up a factor of 4 every 3 years or so
 price drops every year
 decreasing size

Challenge of designing for future amidst rapidly changing technology


One challenge for computer architects is that the design created today will require
several years of implementation, verification, and testing before appearing on the
market. This means that the architect must project what the technology will be like
several years in advance. Sometimes, this is difficult

To address this challenge, computer architects must be aware of important


trends in both the technology and the use of computers; as such trends affect

35
not only the future cost but also the longevity of architecture. Key trends
include:

Trends in Technology:
Computer technology changes rapidly and if an instruction set architecture is
to be successful; it must be designed to survive the rapid change in computer
technology. A successful new instruction set architecture may last for decades
and therefore an architect must plan for technology changes that can increase
the lifetime of a successful computer. To plan for the evolution of a
computer, the designer must be aware of rapid changes in implementation
technology. Five implementation technologies, which change at a dramatic
pace, are critical to modern implementations:

Integrated circuit logic technology—Transistor density increases by about


35% per year, quadrupling somewhat over four years (Density of integrated
circuits increases by 4 every 3 years). This trend is popularly known as
Moore’s law.

Semiconductor DRAM (dynamic random-access memory)— most DRAM


chips are primarily shipped in DIMM modules. Capacity per DRAM chip has
increased by about 25% to 40% per year recently, doubling roughly every
two to three years. This technology is the foundation of main memory. The
rate of improvement has continued to slow over time. There concern as
whether the growth rate will stop sometime due to the increasing difficulty of
efficiently manufacturing even smaller DRAM cells [Kim 2005]. several
other technologies that may replace DRAM if it hits a capacity wall have
been proposed.

Semiconductor Flash (electrically erasable programmable read-only


memory):

36
Capacity per Flash chip has increased by about 50% to 60% per year recently,
doubling roughly every two years.

Magnetic disk technology: Prior to 1990, density increased by about 30% per
year, doubling in three years. It rose to 60% per year thereafter, and increased
to 100% per year in 1996. Since 2004, it has dropped back to about 40% per
year, or doubled every three years. Disks are 15 to 25 times cheaper per bit
than Flash.

Network technology: Network performance depends both on the performance


of switches and on the performance of the transmission system.

Designers often design for the next technology, knowing that when a product
begins shipping in volume that the next technology may be the most cost-
effective or may have performance advantages. Traditionally, cost has
decreased at about the rate at which density increases.

Performance Trends: Bandwidth over Latency


Bandwidth or throughput is the total amount of work done in a given time,
such as megabytes per second for a disk transfer. In contrast, latency or
response time is the time between the start and the completion of an event,
such as milliseconds for a disk access.

Performance is the primary differentiator for microprocessors and networks,


so they have seen the greatest gains: 10,000–25,000X in bandwidth and 30–
80X in latency. Capacity is generally more important than performance for
memory and disks, so capacity has improved most, yet bandwidth advances
of 300– 1200X are still much greater than gains in latency of 6–8X. Clearly,
bandwidth has outpaced latency across these technologies and will likely
continue. A simple rule of thumb is that bandwidth grows by at least the

37
square of the improvement in latency. Computer designers should plan
accordingly.

Challenge of balancing performance


Since a computer system is made up of various interconnected components
whose performance does not increase at the same pace, the architect is faced
with the challenge of balancing performance e.g. the rate at which the
microprocessor improve performance is not the same as other components
such as memory, bus, I/O system

Unequal increase in performance results to a need to adjust the organization


and architecture to compensate for the mismatch among the capabilities of the
various components

Example: Interface between processor and main memory: while the


processor speed and memory capacity have grown rapidly, Speed with which
data can be transferred between processor and main memory has lagged
behind.

Key is balance. Because of constant and unequal changes in:


 Processor components
 Main memory
 I/O devices
 Interconnection structures
Designers must constantly strive to balance their throughput and processing
demands.

38
Challenge of designing to manage Power and Energy in Integrated Circuits
First, power must be brought in and distributed around the chip, and modern
microprocessors use hundreds of pins and multiple interconnect layers just for
power and ground. Second, power is dissipated as heat and must be removed.

How should a system architect or a user think about performance, power, and
energy? From the viewpoint of a system designer, there are three primary
concerns.

First, what is the maximum power a processor ever requires? Meeting this
demand can be important to ensuring correct operation. For example, if a
processor attempts to draw more power than a power supply system can
provide (by drawing more current than the system can supply), the result is
typically a voltage drop, which can cause the device to malfunction. Modern
processors can vary widely in power consumption with high peak currents;
hence, they provide voltage indexing methods that allow the processor to
slow down and regulate voltage within a wider margin. Obviously, doing so
decreases performance.

Second, what is the sustained power consumption? This metric is widely


called the thermal design power (TDP), since it determines the cooling
requirement. TDP is neither peak power, which is often 1.5 times higher, nor
is it the actual average power that will be consumed during a given
computation, which is likely to be lower still.

39
The third factor that designers and users need to consider is energy and
energy efficiency. Recall that power is simply energy per unit time: 1 watt =
1 joule per second. Which metric is the right one for comparing processors:
energy or power? In general, energy is always a better metric because it is
tied to a specific task and the time required for that task. In particular, the
energy to execute a workload is equal to the average power times the
execution time for the workload. Thus, if we want to know which of two
processors is more efficient for a given task, we should compare energy
consumption (not power) for executing the task.
For example, processor A may have a 20% higher average power
consumption than processor B, but if A executes the task in only 70% of the
time needed by B, its energy consumption will be 1.2 × 0.7 = 0.84, which is
clearly better.

Distributing the power, removing the heat, and preventing hot spots have
become increasingly difficult challenges. Power is now the major constraint
to using transistors; in the past, it was raw silicon area. Hence, modern
microprocessors offer many techniques to try to improve energy efficiency
despite flat clock rates and constant supply voltages:

1. Do nothing well. Most microprocessors today turn off the clock of inactive
modules to save energy and dynamic power. For example, if no floating-point
instructions are executing, the clock of the floating-point unit is disabled. If
some cores are idle, their clocks are stopped.
2. Dynamic Voltage-Frequency Scaling (DVFS). Modern microprocessors
typically offer a few clock frequencies and voltages in which to operate that
use lower power and energy.
3. Design for typical case. Given that PMDs and laptops are often idle,
memory and storage offer low power modes to save energy. For example,
DRAMs have a series of increasingly lower power modes to extend battery

40
life in PMDs and laptops, and there have been proposals for disks that have a
mode that spins at lower rates when idle to save power.
4. Overclocking. Intel started offering Turbo mode in 2008, where the chip
decides that it is safe to run at a higher clock rate for a short time possibly on
just a few cores until temperature starts to rise.

Quantitative Principles of Computer Design

1. Take Advantage of Parallelism: process tasks in parallel


2. Principle of Locality: Important fundamental observations have come
from properties of programs. The most important program property
that we regularly exploit is the principle of locality: Programs tend to
reuse data and instructions they have used recently. A widely held rule
of thumb is that a program spends 90% of its execution time in only
10% of the code. An implication of locality is that we can predict with
reasonable accuracy what instructions and data a program will use in
the near future based on its accesses in the recent past. The principle of
locality also applies to data accesses, though not as strongly as to code
accesses. Two different types of locality have been observed. Temporal
locality states that recently accessed items are likely to be accessed in
the near future. Spatial locality says that items whose addresses are
near one another tend to be referenced close together in time.
3. Smaller is faster : 32 registers only – reduce number of registers – it is
faster to search from a smaller library of registers than a huge one
4. Simplicity favors regularity: fixed size instructions, 3 register
operands in every arithmetic operation, keeping the register fields at
the same place in every instruction format.

41
5. Make the common case fast (Focus on the Common Case): Larger
addresses and constants in instructions and keeping all instructions the
same length.
6. Good design demands good compromises: - PC-relative addressing
for branches and immediate addressing for constant operands.
7. Design for Moore’s law: design with future technological
advancements in mind

Performance of Computer Systems

 What exactly is performance? Or in other words


 What do we mean by saying that computer X is faster than computer
Y?
There are two view points to the above questions:
View point one: If computer X is faster than Y, then given the same program
to execute, the program will run in less time on X than on Y. this is what the
users of computers desire and it is called the execution time or the response
time of a program.

Definition
Response time or execution time of a program is defined as the time
between the start and the finish of a task (in time units)

View point Two: : If computer X is faster than Y, then within given time
then X processes more tasks Than Y or X completes more Transactions
Than Y within a given time. This is called throughput

Definition
Throughput is defined as the total amount of work or tasks done in a given
time period (in number of tasks per unit of time)
42
Example:
If a car assembly plant produces 6 cars per an hour then the throughput of the
plant is 6 cars per hour

If a car assembly plant takes 4 hours to produce a car then the response time
of the plant is 4 hours per car
Therefore: Number of tasks
Throughput = tasks per given time =
Unit time

Total time
Response time = total time per task =
Task

In general, there is no relationship between the two metrics i.e. throughput


and execution or response time. Throughput can be increased without
affecting the response time.
Example: Throughput of the car assembly factory may increase to 18 cars
per an hour without changing time to produce one car by increasing the
production lines.

The computer user is interested in reducing response time—the time between


the start and the completion of an event—also referred to as execution time,
while the manager of a data processing center may be interested in increasing
throughput (the total amount of work done in a given time).

The computer user wants response time to decrease, while data centre
manager want throughput increased.

In comparing design alternatives, we often want to relate the performance of


two different computers, say, X and Y. The phrase ―X is faster than Y‖ is

43
used here to mean that the response time or execution time is lower on X than
on Y for the given task. In particular, ―X is n times faster than Y‖ will mean:

Execution time of X
= n
Execution time of Y

1
Execution time of Y Performance of Y Performance of X
=
n= Execution time of X 1 = Performance of Y
Performance of X

Main factors influencing performance of computer system are:


– Processor and memory,
– Input/output controllers and peripherals,
– Compilers, and
– Operating system.

CPU Time or CPU Execution Time

CPU time is a true measure of processor/memory performance. CPU time (or


CPU Execution time) is the time between the start and the end of execution of
a given program. This time accounts for the time CPU is computing the given
program, including operating system routines executed on the program’s
behalf, and it does not include the time waiting for I/O and running other
programs.
Performance of processor/memory = 1 / CPU_time.

CPU performance is measure by a CPU clock which is a timing device


connected to the processor that synchronizes when the fetch, decode execute
44
cycle runs. The CPU Clock speed is measured in terms of number of cycles
that are performed by the CPU per second.

Cycles
CPU performance =
Time in seconds

Clock speed or clock rate: is measured in Hertz (Hz), which means 'per
second'.
• 1Hz = one cycle per second = one cycle / second potentially one
calculation in one second
• Clock speed of 1 MHz means 1,000,000 cycles per second and
potentially a million calculations.

Analysis of CPU Time

• CPU time depends on the program which is executed, including: types


of instructions executed and their frequency of usage.
• Computers are constructed is such way that events in hardware are
synchronized using a clock.
• A clock rate defines durations of discrete time intervals called clock
cycle times or clock cycle periods: Clock rate is given in Hz (=1/sec).
• Clock_cycle_time = 1/clock_rate (in sec) i.e.

1 1
clock_cycle_time = = seconds
X cycles /sec X

• Example: if a processor speed is 2.4KHz then its clock cycle time is


1/2400 seconds

45
• Thus, when we refer to different instruction types (from performance
point of view), we are referring to instructions with different number of
clock cycles required (needed) to execute i.e. one instruction may take
several clock cycles to execute therefore if a program has X
instructions and each instruction takes so many clock cycles to execute,
then the Clock cycles for the program is equal to the sum of all the
clock cycles needed to execute all the instructions of the program.
• Thus: Clock cycles for a program = total number of clock cycles
needed to execute all instructions of the given program.

Clock cycles for a program


Clock cycles for a program are a total number of clock cycles needed to
execute all instructions of a given program.

• Average clock cycles per instruction is abbreviated as CPI (clock


cycles per instruction)

CPI – the average number of clock cycles per instruction (for a given execution
of a given program) is an important parameter given as:
CPI = clock cycles for a program / instruction count

Example: If a program has 20 instructions and total number of clock cycles


is 60 then average clock cycle per instruction is 60/20 = 3 clock cycles per
instruction
• CPU time = Clock cycles for a program * Clock cycle time = Clock
cycles for a program / Clock rate
Where:
Clock cycles for a program
Clock cycles for a program = * Instruction count
Instruction count

46
Clock cycles for a program
Substituting the expression with CPI
Instruction count

We get Clock cycles for a program = CPI * Instruction count

Therefore:
 CPU time = Instruction count * CPI / Clock rate
From this equation it is clear that processor clock rate alone is not sufficient
to describe a computer performance. Since good performance means less
CPU time, if clock rate is very high and CPI is very high too performance is
compromised since the high clock rate is cancelled out by the high CPI

Why measuring performance using Clock rate (Hz) not good


 Using the clock rate of a processor only is a bad way to measure
performance because: performance = execution time/ CPU time and
 CPU Time or Execution time = Instruction count * CPI / Clock rate

Therefore a processor with high clock rate and high CPI has poor
performance since the high clock rate is cancelled out by the high CPI

Example: Machine A has a clock rate of 200MHz and a CPI of 1while


machine B has a clock rate of 400MHz and a CPI of 4. Show that in spite of
its high clock rate, the performance of machine B is poorer than that of
machine A for a given program
Solution:

47
1 instructioncount
Machine Aruntime 
200 106
4  instructioncount
Machine B runtime 
400106
For any given program runtime of A = ½ * runtime of B therefore machine B
will clearly be slower for any program, in spite of its higher clock rate.

Analysis of CPU Performance Equation

The goal of improving computer performance for the user is to decrease the
execution time (reduce CPU time) therefore from the CPU performance
equation:
 CPU time = Instruction count * CPI / Clock rate
How can a designer improve (i.e. decrease) CPU time:
From the equation the following may be done to decrease CPU time
 Increase Clock rate thus making denominator of the equation large.
This may be achieved by improving hardware technology &
organization.
 Reduce CPI this may be achieved by reworking the computer
organization, ISA and compiler technology.
 Reduce Instruction count. This may be achieved by improving the ISA
& compiler technology.

Many potential performance improvement techniques primarily improve one


component with small or predictable impact on the other two.

Calculating CPI
The table below indicates frequency of all instruction types executed in a
―typical‖ program and, from existing reference manuals, the following
number of cycles per instruction for each type is provided.

48
Instruction Type Frequency Cycles

ALU instruction 50% 4


Load instruction 30% 5
Store instruction 5% 4
Branch instruction 15% 2

A typical program therefore would have a CPI computed as follows:


CPI = 0.5*4 + 0.3*5 + 0.05*4 + 0.15*2 = 4 cycles/instruction

Example 1:
Consider an implementation of MIPS ISA with 500 MHz clock and
– each ALU instruction takes 3 clock cycles,
– each branch/jump instruction takes 2 clock cycles,
– each sw instruction takes 4 clock cycles,
– Each lw instruction takes 5 clock cycles.
Also, consider a program that during its execution executes:
– x=200 million ALU instructions
– y=55 million branch/jump instructions
– z=25 million sw instructions
– w=20 million lw instructions
Find CPU time. Assume sequentially executing CPU.
Solution
Approach 1:
Clock cycles for a program = (x*3 + y*2 + z*4 + w*5)
= 910 *106 clock cycles
CPU_time = Clock cycles for a program / Clock rate
= 910 * 106 / 500*106 = 1.82 sec

Approach 2:
CPI = (x*3 + y*2 + z*4 + w*5)/ (x + y + z + w)
= 3.03 clock cycles/ instruction
49
CPI = Clock cycles for a program / Instructions count
CPU time = Instruction count * CPI / Clock rate
= (x+y+z+w) * 3.03 / 500 *106
= 300 *106 * 3.03 /500 * 106
= 1.82 sec

Example 2:

Consider implementation of MIPS ISA with 1 GHz clock and


– each ALU instruction takes 4 clock cycles,
– each branch/jump instruction takes 3 clock cycles,
– each sw instruction takes 5 clock cycles,
– each lw instruction takes 6 clock cycles.
Also, consider the same program as in Example 1.
Find CPI and CPU time. Assume sequentially executing CPU.

Solution

CPI = (x*4 + y*3 + z*5 + w*6)/ (x + y + z + w)


= 4.03 clock cycles/ instruction
CPU time = Instruction count * CPI / Clock rate
= (x+y+z+w) * 4.03 / 1000 * 106
= 300 *106 *4.03 /1000 *106
= 1.21 sec

NB: The calculation may not be accurate since the numbers of cycles per
instruction given don’t account for pipeline effects and other advanced design
techniques.

50
Another element that affects computer performance is delay in memory
access. Instruction execution involves a fetch-decode-execute cycle where It a
computer retrieves a program instruction from its memory, determines what
actions the instruction requires, and carries out those actions. Therefore
memory access delays while program instructions from memory can
significantly reduce performance. Some assumed delays: Memory access = 2
nsec, ALU operation = 2 nsec, Register file access = 1 nsec.

To address memory delays designers of computer systems have adopted the


concept of memory hierarchy

Memory: Memory Hierarchy and Its Importance

A “memory hierarchy” in Computer storage distinguishes each level


in the “hierarchy” by response time.

Memory hierarchy is one among the many trade-offs in designing for high
performance i.e. the size and technology of each component. So the various
components can be viewed as forming a hierarchy of memories
(m1,m2,...,mn) in which each member mi is in a sense subordinate to the next
highest member mi+1 of the hierarchy. To limit waiting by higher levels, a
lower level will respond by filling a buffer and then signaling to activate the
transfer.
There are four major storage levels.
1. Internal – Processor registers and cache.
2. Main – the system RAM and controller cards.
3. On-line mass storage – Secondary storage.
4. Off-line bulk storage – Tertiary and Off-line storage.

51
The number of levels in the memory hierarchy and the performance at each
level has increased over time. For example, the memory hierarchy of an Intel
Haswell Mobile processor circa 2013 is:
1. Processor registers – the fastest possible access (usually 1 CPU cycle).
A few thousand bytes in size
2. Cache
3. Level 0 (L0) Micro operations cache – 6 KiB in size
4. Level 1 (L1) Instruction cache – 128 KiB in size
5. Level 1 (L1) Data cache – 128 KiB in size. Best access speed is around
700 GiB/second
6. Level 2 (L2) Instruction and data (shared) – 1 MiB in size. Best access
speed is around 200 GiB/second
7. Level 3 (L3) Shared cache – 6 MiB in size. Best access speed is around
100 GB/second
8. Level 4 (L4) Shared cache – 128 MiB in size. Best access speed is
around 40 GB/second
9. Main memory (Primary storage) – Gigabytes in size. Best access speed
is around 10 GB/second. In the case of a NUMA machine, access times
may not be uniform
10. Disk storage (Secondary storage) – Terabytes in size. As of 2013, best
access speed is from a solid state drive is about 600 MB/second
11. Nearline storage (Tertiary storage) – Up to exabytes in size. As of
2013, best access speed is about 160 MB/second
12. Offline storage

Cache Memory
A Cache (Pronounced as ―cash‖) is a small and very fast temporary storage
memory used to improve average access time to slow memory. It is designed
to speed up the transfer of data and instructions. It is located inside or close to
the CPU chip. It is faster than RAM and the data/instructions that are most
recently or most frequently used by CPU are stored in cache.

52
 Exploits spatial and temporal locality
 In computer architecture, almost everything is a cache!
o Registers a cache on variables
o First-level cache a cache on second-level cache
o Second-level cache a cache on memory
o Memory a cache on disk (virtual memory)

The data and instructions are retrieved from RAM when CPU uses them for
the first time. A copy of that data or instructions is stored in cache. The next
time the CPU needs that data or instructions, it first looks in cache. If the
required data is found there, it is retrieved from cache memory instead of
main memory. It speeds up the working of CPU.

Types/Levels of Cache Memory


A computer can have several different levels of cache memory. The level
numbers refers to distance from CPU where Level 1 is the closest. All levels
of cache memory are faster than RAM. The cache closest to CPU is always
faster but generally costs more and stores less data then other level of cache.

The following are the deferent levels of Cache Memory.


Level 1 (L1) Cache
53
It is also called primary or internal cache. It is built directly into the processor
chip. It has small capacity from 8 Km to 128 Kb.
Level 2 (L2) Cache
It is slower than L1 cache. Its storage capacity is more, i-e. From 64 Kb to 16
MB. The current processors contain advanced transfer cache on processor
chip that is a type of L2 cache. The common size of this cache is from 512 kb
to 8 Mb.
Level 3 (L3) Cache
This cache is separate from processor chip on the motherboard. It exists on
the computer that uses L2 advanced transfer cache. It is slower than L1 and
L2 cache. The personal computer often has up to 8 MB of L3 cache.

54
Memory hierarchy goals

To provide CPU with necessary data (and instructions) as quickly as possible

• To achieve this goal, a cache should keep frequently used data


• ―Cache hit‖ when CPU finds a requested data in cache
• Hit rate = # of cache hits/# of cache accesses
• Average memory access latency (AMAL) = cache hit time + (1 – cache
hit rate) × miss penalty
To decrease AMAL, reduce hit time, increase hit rate, and reduce misspenalty

To reduce traffic on memory bus

• Cache becomes a ―filter‖


• Reduces the bandwidth requirements from the main memory
• Typically, max. L1 bandwidth (to CPU) > max. L2 bandwidth (to L1) >
max. memory bandwidth

Cache is a small high-speed memory. Stores data from some frequently used
addresses (of main memory).

55
 Cache hit : Data found in cache. Results in data transfer at maximum
speed.
 Cache miss: Data not found in cache. Processor loads data from M and
copies into cache. This results in extra delay, called miss penalty.
 Hit ratio = percentage of memory accesses satisfied by the cache.
 Miss ratio = 1-hit ratio

Locality of References
This important fundamental observation comes from properties of programs.
The most important program property that we regularly exploit is locality of
references : Programs tend to reuse data and instructions they have used
recently.

90/10 rule comes from empirical observation:


"A program spends 90% of its time in 10% of its code"

An implication of locality is that we can predict with reasonable accuracy what


instructions and data a program will use in the near future based on its accesses
in the recent past.

Two different types of locality have been observed:

Temporal locality: states that recently accessed items are likely to be


accessed in the near future.

Spatial locality: says that items whose addresses are near one another tend to
be referenced close together in time.

Make the Common Case Fast


In making a design trade-off, favor the frequent case over the infrequent case.

This principle also applies when determining how to spend resources, since the
impact on making some occurrence faster is higher if the occurrence is frequent.

Improving the frequent occurrence:

56
Helps performance

Is simpler and can be done faster

Smaller is Faster
Smaller pieces of hardware will generally be faster than larger pieces.

This simple principle is particularly applicable to memories built from the same
technology for two reasons:

In high-speed machines, signal propagation is a major cause of delay;

In most technologies we can obtain smaller memories that are faster than
larger memories. This is primarily because the designer can use more power per
memory cell in a smaller design;

The important exception to the smaller_is_faster rule arises from differences in


power consumption. Designs with higher power consumption will be faster and
also usually larger. Thus, the smaller-is-faster rule applies only when power
considerations are taken into account.

57

You might also like