Von Neumann Architecture,
Programs are stored
on storage devices
Programs are copied
into memory for
execution
CPU reads each
instruction in the
program and
executes accordingly
Von Neumann/Turing
Stored Program Computer
ALU capable of operating on binary data
Both ALU & CU contain registers.
Princeton Institute for Advanced
Studies (IAS)
First implementation of von Neumann
stored program computer – the IAS
computer
Began in 1946
Completed in 1952
Structure of IAS machine
IAS Memory
1000 x 40 bit words of either number or
instruction
Signed magnitude binary number
1 sign bit
39 bits for magnitude
2 x 20 bit instructions
Left and right instructions (left executed first)
8-bit opcode
12 bit address
IAS Registers
Set of registers (storage in CPU)
Memory Buffer Register (MBR)
Memory Address Register (MAR)
Instruction Register (IR)
Instruction Buffer Register (IBR)
Program Counter (PC)
Accumulator (AC)
Multiplier Quotient (MQ)
IAS Registers
Memory buffer register (MBR): Contains a word
to be stored in memory or sent to the I/O unit, or is
used to receive a word from memory or from the I/O
unit.
Memory address register (MAR): Specifies the
address in memory of the word to be written from or
read into the MBR.
Instruction register (IR): Contains the 8-bit
opcode instruction being executed.
IAS Registers
Instruction buffer register (IBR): Employed to hold
temporarily the right-hand instruction from a word in
memory.
Program counter (PC): Contains the address of the
next instruction-pair to be fetched from memory.
Accumulator (AC) and multiplier quotient (MQ):
Employed to hold temporarily operands and results of
ALU operations. For example, the result of multiplying two
40-bit numbers is an 80-bit number; the most significant
40 bits are stored in the AC and the least significant in the
MQ.
Structure of
IAS
Figure 2.3, p. 22
Moore’s Law
Gordon Moore - cofounder of Intel
He observed (based on experience) that number of
transistors on a chip doubled every year
Since 1970’s growth has slowed a little
Number of transistors doubles every 18 months
Cost of a chip has remained almost unchanged
Higher packing density means shorter electrical paths,
giving higher performance
Smaller size gives increased flexibility/portability
Reduced power and cooling requirements
Fewer system interconnections increases reliability
Growth in CPU Transistor Count
Effects of Moore’s Law
The doubling of the number of transistors on a
single chip every 18 months has had some effects on
the application of technology:
Costs have fallen dramatically since chip prices have not
changed substantially since Moore made his prediction
Tighter packaging has allowed for shorter electrical paths
and therefore faster execution
Smaller packaging has allowed for more applications in
more environments
Reduction in power and cooling requirements which also
helps with portability
Solder connections are not as reliable, therefore, with
more functions on a single chip, there are fewer unreliable
solder connections
Effects of Moore’s Law (continued)
As technology allows for higher levels of
performance, processor designers must come
up with ways to use it.
Keeping all parts of the processor busy
Coordinating multiple pipelines
Improved branch prediction
Multiple processors
Optimizing execution
Real-time analysis of code to “re-order” execution
Speculative execution of code
Incorporating multiple functions on single chip
Performance Mismatch
Experienced significant improvement
Processor speed
Memory capacity
Experienced only minor improvement
Memory speed
Bus rates
I/O device performance
Speeding it up
Pipelining
On board cache
On board L1 & L2 cache
Branch prediction
Data flow analysis
Speculative execution
Branch Prediction
The processor looks ahead in the instruction code
fetched from memory and predicts which branches,
or groups of instructions, are likely to be processed
next. If the processor guesses right most of the
time, it can prefetch the correct instructions and
buffer them so that the processor is kept busy. The
more sophisticated examples of this strategy predict
not just the next branch but multiple branches
ahead. Thus, branch prediction increases the
amount of work available for the processor to
execute.
Data Flow Analysis
The processor analyzes which instructions are
dependent on each other’s results, or data, to
create an optimized schedule of instructions. In
fact, instructions are scheduled to be executed
when ready, independent of the original program
order. This prevents unnecessary delay.
Speculative Executoin
Using branch prediction and data flow analysis,
some processors speculatively execute
instructions ahead of their actual appearance in
the program execution, holding the results in
temporary locations. This enables the processor
to keep its execution engines as busy as possible
by executing instructions that are likely to be
needed.
Performance Balance (Mismatch?)
Processor speed increased
Memory capacity increased
But not the speed
Thus, Memory speed lags behind
Processor speed
Logic and Memory Performance Gap
Solutions
Increase number of bits retrieved at one time
Change DRAM interface
Cache
Reduce frequency of memory access
More complex cache and cache on chip
Increase interconnection bandwidth
High speed buses
Hierarchy of buses
I/O Devices
Peripherals with intensive I/O demands
Large data throughput demands
Processors can handle this
Problem moving data
Solutions:
Caching
Buffering
Higher-speed interconnection buses
More elaborate bus structures
Multiple-processor configurations
Key is Balance
Processor components
Main memory
I/O devices
Interconnection structures
Improvements in Chip Organization
and Architecture
Increase hardware speed of processor
Fundamentally due to shrinking logic gate size
More gates, packed more tightly, increasing clock
rate
Propagation time for signals reduced
Increase size and speed of caches
Dedicating part of processor chip
Cache access times drop significantly
Change processor organization and architecture
Increase effective speed of execution
Parallelism
Increased Cache Capacity
Typically two or three levels of cache
between processor and main memory
Chip density increased
More cache memory on chip
Faster cache access
Pentium chip devoted about 10% of
chip area to cache
Pentium 4 devotes about 50%
More Complex Execution Logic
Enable parallel execution of instructions
Pipeline works like assembly line
Different stages of execution of different
instructions at same time along pipeline
Superscalar allows multiple pipelines
within single processor
Instructions that do not depend on one
another can be executed in parallel
Diminishing Returns
Internal organization of processors complex
Can get a great deal of parallelism
Further significant increases likely to be
relatively modest
Benefits from cache are reaching limit
Increasing clock rate runs into power
dissipation problem
Some fundamental physical limits are being
reached
New Approach – Multiple Cores
Multiple processors on single chip
Large shared cache
Within a processor, increase in performance
proportional to square root of increase in complexity
If software can use multiple processors, doubling
number of processors almost doubles performance
So, use two simpler processors on the chip rather than
one more complex processor
With two processors, larger caches are justified
Power consumption of memory logic less than processing logic
Performance Assessment
Performance is one of the key
parameters to consider, along with
cost,
size,
security,
reliability, and,
power consumption.
Performance Assessment
Raw speed is far less important than how a
processor performs when executing a given
application.
Application performance depends not just on the
raw speed of the processor, but on the
instruction set, choice of implementation language,
efficiency of the compiler, and skill of the programming
done to implement the application.
System Clock
Performance Assessment: Clock Speed
Key parameters
Performance, cost, size, security, reliability, power
consumption
System clock speed
In Hz or multiples of (pulse frequency produced by the
clock)
Clock rate, clock cycle, clock tick, cycle time
Signals in a CPU take time to settle down to 1 or 0
Some signals may change at different speeds
Computer operations need to be synchronised
Instruction execution in done in discrete steps:
Fetch, decode, load and store, arithmetic or logical
Usually require multiple clock cycles per instruction
Pipelining gives simultaneous execution of instructions
So, clock speed does not portray the complete picture for
different processors
Performance Assessment: Clock Speed
A 1-GHz Processor receives I billion pulses per
second.
Clock Rate/Clock Speed: The rate of pulses
Cycle Time: the time duration between pulses
Instruction Execution Rate
A processor is driven by a clock with a constant
frequency f , or
1. a constant cycle time
2. Ic = Instruction Count is the number of
machine instructions executed for that program
until it runs to completion or for some defined
time interval (‘executed instructions’?)
3. CPI = average cycles per instruction
Is CPI a constant value for a processor?
Why ‘average’?
Instruction Execution Rate
On any give processor, the number of clock
cycles required varies for different types of
instructions, such as load, store, branch etc.
Let CPIi be the number of cycles required for
instruction type i and Ii be the number of
executed instructions of type i for a given
program
The overall CPI is as:
Instruction Execution Rate
The processor time T needed to execute a given
program can be expressed as: T = I c * CPI *
Refinement in this formula is based on the fact the
memory related processing (memory references) take
more time as compared to processing done by the CPU
Rewriting: T = Ic * [p + (m * k)] *
Where; p = number of processor cycles needed to
decode and execute the instruction,
m = number of memory references needed,
k is the ratio between memory cycle time and processor
cycle time.
Performance Factors & System Attributes
The five performance factors in the preceding
equation (Ic, p, m, k, ) are influenced by four
system attributes:
the design of the instruction set (known as
instruction set architecture),
compiler technology (how effective the compiler is in
producing an efficient machine language program
from a high-level language program),
processor implementation, and
cache and memory hierarchy.
Performance Factors & System Attributes
MIPS
Millions of instructions per second (MIPS)
Millions of floating point instructions per
second (MFLOPS)
Heavily dependent on instruction set,
compiler design, processor
implementation, cache & memory
hierarchy
MIPS rate
MIPS rate in terms of the clock rate and CPI
as follows:
MIPS
Consider the execution of a program which results in the
execution of 2 million instructions on a 400-MHz
processor. The program consists of four major types of
instructions. The instruction mix and the CPI for each
instruction type are given below based on the result of a
program trace experiment:
MIPS
The average CPI when the program is
executed on a uniprocessor with the above
trace results is:
The corresponding MIPS rate is: