ACA Mod2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 45

Processor and memory hierarchy

Design Space of Processors


• Processors can be “mapped” to a space that has clock rate and cycles
per instruction (CPI) as coordinates. Each processor type occupies a
region of this space.

• -Newer technologies are enabling higher clock rates.

• Manufacturers are also trying to lower the number of cycles per


instruction.

• Thus the “future processor space” is moving toward the lower right
of the processor design space.
• Processor families mapped onto a coordinated space of clock rate
v/s CPI
o Clock Rates moved from lower
to higher speeds
o CPI rate is lowered

• Broad Categorization
o CISC
o RISC
CISC and RISC Processors
 Complex Instruction Set Computing (CISC) processors like the Intel
80486, the Motorola 68040, the VAX/8600, and the IBM S/390
typically use micro programmed control units, have lower clock rates,
and higher CPI figures
 Reduced Instruction Set Computing (RISC) processors like the Intel
i860, SPARC, MIPS R3000, and IBM RS/6000, which have hard-wired
control units, higher clock rates, and lower CPI figures
VLIW Machines
• Very Long Instruction Word machines typically have many
more functional units that superscalars (and thus the need for
longer – 256 to 1024 bits – instructions to provide control for
them).

• These machines mostly use micro programmed control units


with relatively slow clock rates because of the need to use
ROM to hold the microcode.
Super pipelined Processors
• These processors typically use a multiphase clock (actually several clocks
that are out of phase with each other, each phase perhaps controlling the
issue of another instruction) running at a relatively high rate.
• The CPI in these machines tends to be relatively high (unless multiple
instruction issue is used).
• Processors in vector supercomputers are mostly superpipelined and use
multiple functional units for concurrent scalar and vector operations
Instruction Pipelines
• Typical instruction includes four phases:
– fetch
– decode
– execute
– write-back
• These four phases are frequently performed in
a pipeline, or “assembly line” manner, as
illustrated on the next slide.
Pipeline Definitions
• Instruction pipeline cycle – the time required for each phase
to complete its operation (assuming equal delay in all phases)
• Instruction issue latency – the time (in cycles) required
between the issuing of two adjacent instructions
• Instruction issue rate – the number of instructions issued per
cycle (the degree of a superscalar)
• Simple operation latency – the delay (after the previous
instruction) associated with the completion of a simple
operation (e.g. integer add) as compared with that of a
complex operation (e.g. divide).
• Resource conflicts – when two or more instructions demand
use of the same functional unit(s) at the same time.
Pipelined Processors
• A base scalar processor:
– issues one instruction per cycle
– has a one-cycle latency for a simple operation
– has a one-cycle latency between instruction issues
– can be fully utilized if instructions can enter the pipeline at a rate on
one per cycle
• For a variety of reasons, instructions might not be able to be
pipelines as agressively as in a base scalar processor. In these
cases, we say the pipeline is underpipelined.
• CPI rating is 1 for an ideal pipeline. Underpipelined systems
will have higher CPI ratings, lower clock rates, or both.
Data path and control unit of scalar processor

the data path architecture and control unit of a


typical, simple scalar processor which does not
employ an instruction pipeline.
Main memory, ID controllers, etc. are
connected to the external bus.
The control unit generates control signals
required for the fetch, decode, ALU operation,
memory access, and write result phases of
instruction execution
Instruction-Set Architectures
Architectural Distinctions
CISC
• Micro programmed control units and ROM in earlier processors
• Conventional CISC architecture uses a unified cache for holding both
instructions and data. Therefore, they must share the same data.-instruction
path
• CISC processors use split cache.
• A microprogrammed control unit is a relatively simple logic circuit that
is capable of sequencing through microinstructions and generating control
signals to execute each microinstruction
RISC
• In a RISC processor separate instruction and data caches are used with
different access paths.
• Split caches and hardwired control unit are used in today RISC machines
Architectural Distinctions
CISC Scalar Processors
• Early systems had only integer fixed point facilities.
• Modern machines have both fixed and floating
point facilities, sometimes as parallel functional
units.
• Many CISC scalar machines are underpipelined.
• Representative systems:
– VAX 8600
– Motorola MC68040
– Intel Pentium
RISC Scalar Processors
• Designed to issue one instruction per cycle
• RISC and CISC scalar processors should have same
performance if clock rate and program lengths are equal.
• RISC moves less frequent operations into software, thus
dedicating hardware resources to the most frequently used
operations.
• Representative systems:
– Sun SPARC
– Intel i860
– Motorola M88100
– AMD 29000
 SPARC family chips produced by Cypress Semiconductors, Inc.
Figure 4.7 shows the architecture of the Cypress CY7C601 SPARC
processor and of the CY7C602 FPU.

 The Sun SPARC instruction set contains 69 basic instructions

 The SPARC runs each procedure with a set of thirty-two 32-bit IU


registers.

 Eight of these registers are global registers shared by all


procedures, and the remaining 24 are window registers associated
with only each procedure.

 The concept of using overlapped register windows is the most


important feature introduced by the Berkeley RISC architecture
 shows eight overlapping windows (formed with 64 local registers and 64
overlapped registers) and eight globals with a total of 136 registers, as
implemented in the Cypress 601.
 Each register window is divided into three eight-register sections, labeled Ins,
Locals, and Outs.
 The local registers are only locally addressable by each procedure. The Ins
and Outs are shared among procedures.
 The calling procedure passes parameters to the called procedure via its Outs
(r8 to r15) registers, which are the Ins registers of the called procedure.
 The window of the currently running procedure is called the active window
pointed to by a current window pointer.
 A window invalid mask is used to indicate which window is invalid. The trap
base register serves as a pointer to a trap handler
Superscalar, Vector Processors
• Scalar processor: executes one instruction per cycle, with only one instruction pipeline.
• Superscalar processor: multiple instruction pipelines, with multiple instructions issued per
cycle, and multiple results generated per cycle.
• Vector processors issue one instructions that operate on multiple data items (arrays). This
is conducive to pipelining with one result produced per cycle

Superscalar Pipelines
Superscalar processors were originally developed as an alternative to vector
processors, with a view to exploit higher degree of instruction level
parallelism.
A superscalar processor of degree m can issue m instructions per cycle.

The base scalar processor, implemented either in RISC or CISC, has m = 1.

In order to fully utilize a superscalar processor of degree m, m instructions


must be executable in parallel. This situation may not be true in all clock
cycles.
• In that case, some of the pipelines may be stalling in a wait state.
• In a superscalar processor, the simple operation latency should
require only one cycle, as in the base scalar processor.
• Due to the desire for a higher degree of instruction-level parallelism
in programs, the superscalar processor depends more on an
optimizing compiler to exploit parallelism
Typical Superscalar Architecture
• A typical superscalar will have
– multiple instruction pipelines
– an instruction cache that can provide multiple instructions per fetch
– multiple buses among the function units
• In theory, all functional units can be simultaneously active .
VLIW Architecture
• VLIW = Very Long Instruction Word
• Instructions usually hundreds of bits long.
• Each instruction word essentially carries multiple “short instructions.”
• Each of the “short instructions” are effectively issued at the same time.
• (This is related to the long words frequently used in microcode.)
• Compilers for VLIW architectures should optimally try to predict branch
outcomes to properly group instructions.
Pipelining in VLIW Processors

• Decoding of instructions is easier in VLIW than in superscalars, because


each “region” of an instruction word is usually limited as to the type of
instruction it can contain.
• Code density in VLIW is less than in superscalars, because if a “region” of
a VLIW word isn’t needed in a particular instruction, it must still exist (to
be filled with a “no op”).
• Superscalars can be compatible with scalar processors; this is difficult with
VLIW parallel and non-parallel architectures
VLIW Opportunities
• “Random” parallelism among scalar operations is exploited in VLIW, instead
of regular parallelism in a vector or SIMD machine.
• The efficiency of the machine is entirely dictated by the success, or
“goodness,” of the compiler in planning the operations to be placed in the
same instruction words.
• Different implementations of the same VLIW architecture may not be
binary-compatible with each other, resulting in different latencies
• VLIW Summary
• VLIW reduces the effort required to detect parallelism using hardware or
software techniques.
• The main advantage of VLIW architecture is its simplicity in hardware
structure and instruction set.
• Unfortunately, VLIW does require careful analysis of code in order to
“compact” the most appropriate ”short” instructions into a VLIW word.
Vector Processors
• Vector Processors

• A vector processor is a coprocessor designed to perform vector computations.

• A vector is a one-dimensional array of data items (each of the same data type).

• Vector processors are often used in multipipelined supercomputers.

• Architectural types include:

– Register-to-Register (with shorter instructions and register files)

– Memory-to-Memory (longer instructions with memory addresses)


Register-to-Register Vector Instructions
• Assume Vi is a vector register of length n,
• si is a scalar register,
• M(1:n) is a memory array of length n, and “ο” is a vector operation.
• Typical instructions include the following
– V1 ο V2  V3 (element by element operation)
– s1 ο V1  V2 (scaling of each element)
– V1 ο V2  s1 (binary reduction - i.e. sum of products)
– M(1:n)  V1 (load a vector register from memory)
– V1  M(1:n) (store a vector register into memory)
– ο V1  V2 (unary vector -- i.e. negation)
– ο V1  s 1 (unary reduction -- i.e. sum of vector)
Memory-to-Memory Vector Instructions

• Typical memory-to-memory vector instructions (using the


same notation as given in the previous slide) include these:

– M1(1:n) ο M2(1:n)  M3(1:n) (binary vector)

– s1 ο M1(1:n)  M2(1:n) (scaling)

– ο M1(1:n)  M2(1:n) (unary vector)

– M1(1:n) ο M2(1:n)  M(k) (binary reduction)


Pipelines in Vector Processors
• Vector processors can usually effectively use large pipelines in
parallel, the number of such parallel pipelines effectively
limited by the number of functional units.
• As usual, the effectiveness of a pipelined system depends on
the availability and use of an effective compiler to generate
code that makes good use of the pipeline facilities
Symbolic Processors
• Symbolic processors are somewhat unique in that their architectures are
tailored toward the execution of programs in languages similar to LISP,
Scheme, and Prolog.
• In effect, the hardware provides a facility for the manipulation of the
relevant data objects with “tailored” instructions.
• These processors (and programs of these types) may invalidate
assumptions made about more traditional scientific and business
computations
Hierarchical Memory Technology
Memory in system is usually characterized as appearing at various levels (0, 1, …) in a
hierarchy, with level 0 being CPU registers and level 1 being the cache closest to the
CPU.
Each level is characterized by five parameters:
• access time ti (round-trip time from CPU to ith level)
• memory size si (number of bytes or words in the level)
• cost per byte ci
• transfer bandwidth bi (rate of transfer between levels)
• unit of transfer xi (grain size for transfers)
Memory devices at a lower level are:

• Faster to access,
• Are smaller in capacity,
• Are more expensive per byte,
• Have a higher bandwidth, and
• Have a smaller unit of transfer

Registers and Caches


Registers
• The registers are parts of the processor;
• Register assignment is made by the compiler.
• Register transfer operations are directly controlled by the processor after
instructions are decoded.
• Register transfer is conducted at processor speed, in one clock cycle.
• Caches
• The cache is controlled by the MMU and is programmer-transparent.
• The cache can also be implemented at one or multiple levels, depending on
the speed and application requirements.
• Multi-level caches are built either on the processor chip or on the processor
board.
• Multi-level cache systems have become essential to deal with memory
access latency.

• Main Memory (Primary Memory)


• It is usually much larger than the cache and often implemented by the most
cost-effective RAM chips, such as DDR SDRAMs, i.e. dual data rate
synchronous dynamic RAMs.
• The main memory is managed by a MMU in cooperation with the operating
system.
• Disk Drives and Backup Storage
• The disk storage is considered the highest level of on-line memory.
• It holds the system programs such as the OS and compilers, and user
programs and their data sets.
• Optical disks and magnetic tape units are off-line memory for use as
archival and backup storage.
• They hold copies of present and past user programs and processed
results and files.
• Disk drives are also available in the form of RAID arrays.

• Peripheral Technology
• Peripheral devices include printers, plotters, terminals, monitors,
graphics displays, optical scanners, image digitizers, output microfilm
devices etc.
• Some I/O devices are tied to special-purpose or multimedia applications.
Inclusion, Coherence, and Locality

• Information stored in a memory hierarchy (M1, M2,…, Mn) satisfies 3 important properties:
– Inclusion
– Coherence
– Locality

• The inclusion property is stated as:


M1  M2  ...  Mn
The implication of the inclusion property is that all items of information in the
“innermost” memory level (cache) also appear in the outer memory levels.

• The inverse, however, is not necessarily true. That is, the presence of a
data item in level Mi+1 does not imply its presence in level Mi. We call a
reference to a missing item a “miss.”
The Coherence Property
The requirement that copies of data items at successive memory levels
be consistent is called the “coherence property.”
Write-through
As soon as a data item in Mi is modified, immediate update of the
corresponding data item(s) in Mi+1, Mi+2, … Mn is required.
This is the most aggressive (and expensive) strategy.
Write-back
The update of the data item in Mi+1 corresponding to a modified
item in Mi is not updated unit it (or the block/page/etc. in Mi that
contains it) is replaced or removed.
This is the most efficient approach, but cannot be used (without
modification) when multiple processors share Mi+1, …, Mn.
Locality of References
Memory references are generated by the CPU for either instruction or
data access.

Temporal locality – if location M is referenced at time t, then it


(location M) will be referenced again at some time t+t.

Spatial locality – if location M is referenced at time t, then another


location Mm will be referenced at time t+t.

Sequential locality – if location M is referenced at time t, then


locations M+1, M+2, … will be referenced at time t+t, t+t’, etc.
In each of these patterns, both m and t are “small.”
Hit Ratios
• When a needed item (instruction or data) is found in the level of the memory
hierarchy being examined, it is called a hit.

• Otherwise (when it is not found), it is called a miss (and the item must be
obtained from a lower level in the hierarchy).

• The hit ratio, h, for Mi is the probability (between 0 and 1) that a needed data
item is found when sought in level memory Mi.

• The miss ratio is obviously just 1-hi.

• We assume h0 = 0 and hn = 1.
Access Frequencies
• The access frequency fi to level Mi is
(1-h1)  (1-h2)  …  hi.

• Note that f = h , and


1 1
 f
i 1
i 1
Effective Access Times
• There are different penalties associated with misses at
different levels in the memory hierarchy.
– A cache miss is typically 2 to 4 times as expensive as a cache hit
(assuming success at the next level).
– A page fault (miss) is 3 to 4 magnitudes as costly as a page hit.
• The effective access time of a memory hierarchy can be
expressed as
n
Teff   f i  ti
i 1

 h1t1  (1  h1 )h2t2    (1  h1 )(1  h2 )  (1  hn 1 )hnt n


The first few terms in this expression dominate, but the effective access
time is still dependent on program behavior and memory design choices
Hierarchy Optimization

• The total cost of a memory hierarchy is estimated as follows:

• This implies that the cost is distributed over n levels. Since cl > c2 > c3 > … cn, we
have to choose s1 < s2 < s3 < … sn.

• The optimal design of a memory hierarchy should result in a T eff close to the t1 of M1
and a total cost close to the cost of Mn.

You might also like