Module 1
Overview: Introduction to
Computer Architecture and
Organization
Architecture & Organization 1
Architecture is those attributes visible to the programmer
Instruction set, number of bits used for data
representation, I/O mechanisms, addressing
techniques.
e.g. Is there a multiply instruction?
Organization is how architecture features are
implemented
Control signals, interfaces, memory technology.
e.g. Is there a hardware multiply unit or is it done by
repeated addition?
Architecture & Organization 2
All Intel x86 family share the same basic
architecture
The IBM System/370 family share the
same basic architecture
This gives code compatibility
At least backwards
Organization differs between different
versions
Structure & Function
Structure is the way in which
components relate to each other
Function is the operation of individual
components as part of the structure
Function
All basic computer functions are:
Data processing–process data in variety of forms
and requirements
Data storage–short and long term data storage for
retrieval and update
Data movement–move data between computer and
outside world. Devices that serve as source and
destination of data–the process is known as I/O
Control–control of process, move and store data
using instruction.
Functional View
Operations (b) Data
Operations (a) Data movement storage device operation –
device operation read and write
Operation (c)
Processing from/to Operation (d)
storage Processing from storage to I/O
Structure - Top Level - 1
Peripherals Computer
Central Main
Processing Memory
Unit
Computer
Systems
Interconnection
Input
Output
Communication
lines
Structure - Top Level - 2
Central Processing Unit (CPU) – controls
operation of the computer and performs data
processing functions
Main memory – stores data
I/O – moves data between computer and
external environment
System interconnection – provides
communication between CPU, main memory
and I/O
Structure - The CPU - 1
CPU
Computer Arithmetic
Registers and
I/O Login Unit
System CPU
Bus
Internal CPU
Memory Interconnection
Control
Unit
Structure - The CPU - 2
Control unit (CU)–control the operation of
CPU
Arithmetic and logic unit (ALU)-performs data
processing functions
Registers-provides internal storage to the CPU
CPU Interconnection-provides communication
between control unit (CU), ALU and registers
Structure - The Control Unit
Control Unit
CPU
Sequencing
ALU Login
Control
Internal
Unit
Bus
Control Unit
Registers Registers and
Decoders
Control
Memory
Implementation of control unit – micro programmed
implementation
Module 1
Overview: Von Neumann Machine
and Computer Evolution
First Generation – Vacuum Tubes
ENIAC - background
Electronic Numerical Integrator And Computer
Eckert and Mauchly
University of Pennsylvania
Trajectory tables for weapons (range and
trajectory)
Started 1943
Finished 1946
Too late for war effort, but used determine the
feasibility of hydrogen bomb
Used until 1955
ENIAC - diagram
ENIAC - details
Decimal (not binary)
20 accumulators of 10 digits
Programmed manually by switches
18,000 vacuum tubes
30 tons
15,000 square feet
140 kW power consumption
5,000 additions per second
Von Neumann Machine/Turing
Stored Program concept
Main memory storing programs and data- That could be
changed and programming will become easier
ALU operating on binary data
Control unit interpreting instructions from memory and
executing
Input and output (I/O) equipment operated by control unit
Princeton Institute for Advanced Studies
IAS computer
Completed 1952
Structure of von Neumann
machine
Von Neumann Architecture
Data and instruction are stored in a
single read-write memory
The contents of this memory is
addressable by location
Execution occurs in a sequential fashion
from one instruction to the next
IAS – details-1
1000 x 40 bit words
Binary number – number word
2 x 20 bit instructions –
instruction word1
IAS – details-2
Set of registers (storage in CPU)
Memory Buffer Register–contains word to be stored in
memory or sent to I/O unit, receive a word from
memory of from I/O unit
Memory Address Register–specify the address in
memory for MBR
Instruction Register-contains opcode
Instruction Buffer Register-temporary storage for
instruction
Program Counter-contains next instruction to be
fetched from memory
Accumulator and Multiplier Quotient-temporary storage
for operands and result of ALU operations
Structure
of IAS –
detail-3
Control unit–
fetches instructions
from memory and
executes them one
by one
Commercial Computers
1947 - Eckert-Mauchly Computer Corporation-
manufacture computer commercially
UNIVAC I (Universal Automatic Computer)-first
commercial computer
US Bureau of Census 1950 calculations
Became part of Sperry-Rand Corporation
Late 1950s - UNIVAC II
Faster
More memory
IBM
Punched-card processing equipment
1953 - the 701
IBM’s first stored program computer
Scientific calculations
1955 - the 702
Business applications
Lead to 700/7000 series
Second Generation:
Transistors
Replaced vacuum tubes
Smaller
Cheaper
Less heat dissipation
Solid State device
Made from Silicon (Sand)
Invented 1947 at Bell Labs
William Shockley et al.
Transistor Based Computers
Second generation machines
More complex arithmetic and logic
unit(ALU)and control unit(CU)
Use of high level programming languages
NCR & RCA produced small transistor
machines
IBM 7000
DEC - 1957
Produced PDP-1
Transistors Based Computers
The Third Generation: Integrated Circuits
Microelectronics
Literally - “small electronics”
A computer is made up of gates, memory cells and
interconnections
Data storage-provided by memory cells
Data processing-provided by gates
Data movement-the paths between components that are
used to move data
Control-the paths between components that carry control
singnals
These can be manufactured on a semiconductor
e.g. silicon wafer
Integrated Circuits
Early integrated
circuits- know as small
scale integration (SSI)
Moore’s Law
Increased density of components on chip-refer chart
Gordon Moore – co-founder of Intel
Number of transistors on a chip will double every year-refer
chart
Since 1970’s development has slowed a little
Number of transistors doubles every 18 months
Cost of a chip has remained almost unchanged
Higher packing density means shorter electrical paths, giving
higher performance
Smaller size gives increased flexibility
Reduced power and cooling requirements
Fewer interconnections increases reliability
Growth in CPU Transistor
Count
IBM 360 series
1964
Replaced (& not compatible with) 7000 series
First planned “family” of computers
Similar or identical instruction sets
Similar or identical O/S
Increasing speed
Increasing number of I/O ports (i.e. more terminals)
Increased memory size
Increased cost
Multiplexed switch structure-multiplexor
IBM 360 - images
DEC PDP-8
1964
First minicomputer (after
miniskirt!)
Did not need air conditioned
room
Small enough to sit on a lab
bench
$16,000
$100k++ for IBM 360
Embedded applications & OEM-
integrate into a total system
with other manufacturers
DEC - PDP-8 Bus Structure
Universal bus structure (Omnibus)-separate
signals paths to carry control, data and address
signals
Common bus structure- control by CPU
Later Generations:
Semiconductor Memory
1950s to 1960s-memory made of ring of ferromagnetic
material called cores-fast but expensive, bulky and
destructive read
Semiconductor memory-By Fairchild 1970
Size of a single core
i.e. 1 bit of magnetic core storage
Holds 256 bits
Non-destructive read
Much faster than core
Capacity approximately doubles each year
Generations of Computer
Vacuum tube - 1946-1957 Vacuum tube
Transistor - 1958-1964 Transistor
Small scale integration SSI - 1965 on
Up to 100 devices on a chip
Medium scale integration MSI - to 1971
100-3,000 devices on a chip
Large scale integration LSI - 1971-1977
3,000 - 100,000 devices on a chip
Very large scale integration VLSI - 1978 -1991
100,000 - 100,000,000 devices on a chip
Ultra large scale integration ULSI – 1991 -
Over 100,000,000 devices on a chip
Microprocessors Intel
1971 - 4004
First microprocessor
All CPU components on a single chip
4 bit design
Followed in 1972 by 8008
8 bit
Both designed for specific applications
1974 - 8080
Intel’s first general purpose microprocessor
Design to be CPU of a general purpose microcomputer
Pentium Evolution - 1
8080
first general purpose microprocessor
8 bit data path
Used in first personal computer – Altair
8086
much more powerful
16 bit
instruction cache, prefetch few instructions
8088 (8 bit external bus) used in first IBM PC
80286
16 MB memory addressable
80386
First 32 bit design
Support for multitasking- run multiple programs at the same time
Pentium Evolution - 2
80486
sophisticated powerful cache and instruction pipelining
built in maths co-processor
Pentium
Superscalar technique-multiple instructions executed in
parallel
Pentium Pro
Increased superscalar organization
Aggressive register renaming
branch prediction
data flow analysis
speculative execution
Pentium Evolution - 3
Pentium II
MMX technology
graphics, video & audio processing
Pentium III
Additional floating point instructions for 3D graphics
Pentium 4
Note Arabic rather than Roman numerals
Further floating point and multimedia enhancements
Itanium
64 bit
see chapter 15
Itanium 2
Hardware enhancements to increase speed
See Intel web pages for detailed information on processors
Module 1
Computer System: Designing and
Understanding Performance
Designing for Performance
Microprocessor Speed
Achieve full potential of speed if microprocessor is fed
constant data and instructions. Techniques used:
Branch prediction-processor looks ahead in the instruction code
and predicts which branches (group of instructions) are likely to
be processed next
Data flow analysis-processor analyzes instructions which are
dependent on each other’s result or data to create an optimized
schedule of instructions
Speculative execution-use branch prediction and data flow
analysis, processor speculatively executes instructions ahead of
their program execution, holding the result in temporary
locations.
Performance Balance – Processor
and Memory
Performance balance- an adjusting of
organization and architecture to balance
the mismatch capabilities of various
components (etc processor vs. memory)
Processor speed increased
Memory capacity increased
Memory speed lags behind processor
speed
Performance Balance – Processor and
Memory (Performance Gap)
Performance Balance – Processor
and Memory - Solutions
Increase number of bits retrieved at one time
Make DRAM “wider” rather than “deeper”
Change DRAM interface
Include cache in DRAM chip
Reduce frequency of memory access
More complex and efficient cache between processor and
memory
Cache on chip/processor
Increase interconnection bandwidth between processor and memory
High speed buses
Hierarchy of buses
Performance Balance: I/O
Devices
Peripherals with intensive I/O demands-refer chart
Large data throughput demands-refer chart
Processors can handle this I/O process, but the problem
is moving data between processor and devices
Solutions:
Caching
Buffering
Higher-speed interconnection buses
More elaborate bus structures
Multiple-processor configurations
Performance Balance: I/O
Devices
Key is Balance : Designers
2 factors:
The rate at which performance is changing in the
various technology areas (processor, busses,
memory, peripherals) differs greatly from one type
of element to another
New applications and new peripheral devices
constantly change the nature of demand on the
system in term of typical instruction profile and
the data access patterns
Improvements in Chip
Organization and Architecture
Increase hardware speed of processor
Fundamentally due to shrinking logic gate size
More gates, packed more tightly, increasing clock rate
Propagation time for signals reduced
Increase size and speed of caches
Dedicating part of processor chip
Cache access times drop significantly
Change processor organization and architecture
Increase effective speed of execution
Parallelism
Problems with Clock Speed
and Logic Density
Power
Power density increases with density of logic and clock speed
Dissipating heat
RC delay
Speed at which electrons flow limited by resistance and
capacitance of metal wires connecting them
Delay increases as RC product increases
Wire interconnects thinner, increasing resistance
Wires closer together, increasing capacitance
Memory latency
Memory speeds lag processor speeds
Solution:
More emphasis on organizational and architectural approaches
Intel Microprocessor
Performance
Approach 1: Increased Cache
Capacity
Typically two or three levels of cache between
processor and main memory (L1, L2, L3)
Chip density increased
More cache memory on chip
Faster cache access
Pentium chip devoted about 10% of chip area
to cache
Pentium 4 devotes about 50%
Approach 2: More Complex
Execution Logic
Enable parallel execution of instructions
Two approaches introduced:
Pipelining
Superscalar
(covered later on)
Diminishing Returns from
Approach 1 and Approach 2
Internal organization of processors very complex
Can get a great deal of parallelism
Further significant increases likely to be
relatively modest
Benefits from cache are reaching limit
Increasing clock rate runs into power dissipation
problem
Some fundamental physical limits are being
reached
New Approach – Multiple
Cores
Multiple processors on single chip
With large shared cache
Within a processor, increase in performance proportional
to square root of increase in complexity
If software can use multiple processors, doubling number
of processors almost doubles performance
So, use two simpler processors on the chip rather than
one more complex processor
With two processors, larger caches are justified
Power consumption of memory logic less than processing logic
Example: IBM POWER4
Two cores based on PowerPC
POWER4 Chip Organization
Module 1
Computer System: Designing and
Understanding Performance
(Book: Computer Organization and Design,3ed, David L. Patterson and
John L. Hannessy, Morgan Kaufmann Publishers)
Introduction
Hardware performance is often key to the effectiveness of an entire
system of hardware and software
For different types of applications, different performance metrics
may by appropriate, and different aspects of a computer systems
may be the most significant in determining overall performance
Understanding how best to measure performance and limitations of
performance is important when selecting a computer system
To understand the issues of assessing performance.
Why a piece of software performs as it does?
Why one instruction set can be implemented to perform better than another?
How some hardware feature affects performance?
Defining Performance - 1
•How do we say one computer has better performance than another?
•Peformance based on speed
•To take a single passenger from one point to another in the least time – Concorde
•Performance based on passenger throughput
•To transport 450 passengers from one point to another - 747
Response Time and Throughput - 2
As an individual computer user, you are interested in reducing
response time (or execution time)
The time between the start and completion of a task
Data center managers are often interested in increasing
throughput
The total amount of work done in a given time
Example: What is improved with the following changes?
Replacing the processor in a computer with a faster version – will improve
response time and throughput
Adding additional processors to a system that uses multiple processors for
separate tasks (e.g., searching the web) – will improve throughtput
Performance and Execution
Time - 3
Relative Performance - 4
If computer A runs a program in 10 seconds and computer B runs the same
program in 15 seconds, how much faster is A than B?
We know A is n times faster than B
Performance(A) = Execution time(B) = n
Performance(B) Execution time(B)
Performance ratio
15 =1.5
10
We can say
A is 1.5 times faster than B
B is 1.5 times slower than A
To avoid the potential confusion between the terms increasing and decreasing,
we usually say “improve performance” or “improve execution time”
Measuring Performance
Time – measure of computer performance
Definition of time - Wall-clock time, response time, or elapsed time
CPU execution time (or CPU time)
The time the CPU spends computing for this task and does not include
time spent waiting for I/O or running other programs
User CPU time vs. system CPU time
Clock cycles (e.g., 0.25 ns) vs. clock rate (e.g., 4 GHz)
Different applications are sensitive to different aspects of the performance of
a computer system
Many applications, especially those running on servers, depend as much on
I/O performance and total elapsed time measured by a wall clock is of interest
In some application environments, the user may care about throughput,
response time, or a complex combination of the two (e.g., maximum
throughput with a worst-case response time)
CPU Execution Time
A simple formula relates the most
basic metrics (i.e., clock cycles and
clock cycle time) to CPU time
Improving Performance
Our favorite program runs in 10 seconds on computer A, which has a 4 GHz
clock. If a computer B will run this program in 6 seconds given that computer B
requires 1.2 times as many clock cycles as computer A for this program. What is
computer B’s clock rate?
Clock Cycles for program A
CPU Time(A) = CPU Clock Cycles(A) / Clock Rate(A)
10 s = CPU Clock Cycles(A) / 4 GHz
10 s = CPU Clock Cycles(A) / 4 X 10*9 Hz
CPU Clock Cycles(A) = 40 x 10*9 cycles
CPU Time(B)
CPU Time(B) = 1.2 X CPU Clock Cycles(A) / Clock Rate(B)
6s = 1.2 X CPU Clock Cycles(A) / Clock Rate(B)
Clock Rate (B) = 1.2 X 40 X 10*9 cycles / 6 seconds
Clock Rate (B) = 48 X 10*9 cycles / 6 seconds
Clock Rate (B) = 8 X 10*9 cycles / seconds
Clock Rate (B) = 8 GHz
Clock Cycles Per Instruction (CPI)
The execution time must depend on the number of
instructions in a program and the average time per
instruction
CPU clock cycles = Instructions for a program × Average
clock cycles per instruction
Clock cycles per instruction (CPI)
Average number of clock cycles each instruction takes to
execute
CPI can provide one way of comparing two different
implementations of the same instruction set architecture
Using Performance Equation-1
Suppose we have two implementations of the same instruction set architecture and
for the same program. Which computer is faster and by how much?
Computer A: clock cycle time=250 ps and CPI=2.0
Computer B: clock cycle time=500 ps and CPI=1.2
Say I = number of instructions for the program, find number of clock cycles for A
and B
CPU Clock Cycles(A) = I X CPI(A)
CPU Clock Cycles(A) = I X 2.0
CPU Clock Cycles(B) = I X CPI(B)
CPU Clock Cycles(B) = I X 1.2
Compute CPU Time for A and B
CPU Time(A) = CPU Clock Cycles(A) X Clock Cycle Time(A)
CPU Time(A) = I X 2.0 X 250 ps = I X 500 ps
CPU Time(B) = CPU Clock Cycles(B) X Clock Cycle Time(B)
CPU Time(B) = I X 1.2 X 500 ps = I X 600 ps
Using Performance Equation-2
Clearly A is faster. The amount faster is
the ratio of execution time.
Performance(A) = Execution time(B) = I X 600 ps = 1.2 times
Performance(B) Execution time(B) I X 500 ps
We can conclude, A is 1.2 times faster
than B for this program
Basic Peformance Equation
Basic performance equation
Instruction count can be measured by using software tools that
profile the execution or by using a simulator of the architecture
Hardware counters, which are included on many processors,
can be used alternatively to record a variety of measurements
Number of instructions executed
Average CPI
Sources of performance loss
Basis Components of
Performance
Measuring the CPI
Sometimes it is possible to compute the CPU clock cycles by looking at the different types of instructions and using
their individual clock cycle counts
CPIi = count of the number of instructions of class i executed
CPIi = average number of cycles per instruction for that instruction class
n = number of instruction classes
Remember that overall CPI for a program will depend on both the number of cycles for each instruction type and
the frequency of each instruction type in the program execution
Affecting Factors for the CPU
Performance
Comparing Code Segments-1
A compiler designer is trying to decide between 2 code
sequences for a particular computer.
Which code sequence executes the most instructions?
Which will be faster? What is the CPI for each sequence?
Comparing Code Segments-2
Sequence 1 (Instruction Count(1)) 2+1+2=5 instructions
Sequence 2 (Instruction Count(2)) 4+1+1=6 instructions
CPU Clock Cycles(1)= (2X1)+(1X2)+(2X3) = 10 cycles
CPU Clock Cycles(2)= (4X1)+(1X2)+(1X3) = 9 cycles
So code Sequence 2 faster, even though it executes 1 extra instruction
Code Sequence 2 uses fewer clock cycles, must have lower CPI
CPI = CPU Clock Cycles/Instruction Count
CPI(1) = CPU Clock Cycles(1)/Instruction Count(1) = 10/5 = 2
CPI(2) = CPU Clock Cycles(2)/Instruction Count(2) = 9/6 = 1.5
Module 1
Computer System: Computer
Components and Bus
Interconnection Structure
Von Neumann Architecture
Data and instruction are stored in a
single read-write memory
The contents of this memory is
addressable by location
Execution occurs in a sequential fashion
from one instruction to the next
Program Concept
Hardwired program-connecting/combining
various logic components to store data and
perform arithmetic and logic operations
Hardwired systems are inflexible
General purpose hardware can do different
tasks, given correct control signals
Instead of re-wiring, supply a new set of
control signals
What is a program?
A sequence of steps
For each step, an arithmetic or logical
operation is done
For each operation, a different/new set of
control signals is needed
For each operation a unique code is provided
e.g. ADD, MOVE
A hardware segment accepts the code and
issues the control signals
Hardware and Software
Approaches
Components
The Control Unit and the Arithmetic and Logic
Unit constitute the Central Processing Unit
Data and instructions need to get into the
system and results out
Input/output
Temporary storage of code and results is
needed
Main memory
Computer Components:
Top Level View
Interconnection Structures
All the units (processor, memory and
I/O components) must be connected –
interconnection structure
Different type of connection for
different type of unit
Memory
Input/Output
CPU
Computer Modules
Memory Connection
Receives and sends data
Receives addresses (of locations)
Receives control signals
Read
Write
Timing
Input/Output Connection(1)
Similar to memory from computer’s
viewpoint
Output
Receive data from computer
Send data to peripheral
Input
Receive data from peripheral
Send data to computer
Input/Output Connection(2)
Receive control signals from computer
Send control signals to peripherals
e.g. spin disk
Receive addresses from computer
e.g. port number to identify peripheral
Send interrupt signals (control)
CPU Connection
Reads instruction and data
Writes out data (after processing)
Sends control signals to other units
Receives (& acts on) interrupts
Buses Interconnection
There are a number of possible
interconnection systems
Single and multiple BUS structures are
most common
e.g. Control/Address/Data bus (PC)
e.g. Unibus (DEC-PDP)
What is a Bus?
A communication pathway connecting two
or more devices
Usually broadcast/shared medium
Often grouped
A number of channels in one bus
e.g. 32 bit data bus is 32 separate single bit
channels
Power lines may not be shown
Bus Interconnection Scheme
Data Bus
Carries data
Remember that there is no difference
between “data” and “instruction” at this
level
Width is a key determinant of
performance
8, 16, 32, 64 bit
Address bus
Identify the source or destination of data
e.g. CPU needs to read an instruction
(data) from a given location in memory
Bus width determines maximum memory
capacity of system
e.g. 8080 has 16 bit address bus giving 64k
address space
Control Bus
Control and timing information for data
and address line-
Typical control lines:
Memory read/write signal
Interrupt request
Clock signals
Bus grant/request
Big and Yellow?
What do buses look like?
Parallel lines on circuit boards
Ribbon cables
Strip connectors on mother boards
e.g. PCI
Sets of wires
Physical Realization of Bus
Architecture
Single Bus Problems
Lots of devices on one bus leads to:
Propagation delays
Long data paths mean that co-ordination of
bus use can adversely affect performance
If aggregate data transfer approaches bus
capacity
Most systems use multiple buses to
overcome these problems
Traditional (ISA)
(with cache)
High Performance Bus
PCI Bus
Peripheral Component Interconnection
Intel released to public domain
32 or 64 bit
64 data lines
High bandwidth, processor independent
bus
PCI Bus