Module 2 ACA Notes
Module 2 ACA Notes
MODULE-2
Processors can be ―mapped‖ to a space that has clock rate and cycles per instruction (CPI) as
coordinates. Each processor type occupies a region of this space.
Newer technologies are enabling higher clock rates.
Manufacturers are also trying to lower the number of cycles per instruction.
Thus the ―future processor space‖ is moving toward the lower right of the processor design space.
• Complex Instruction Set Computing (CISC) processors like the Intel 80486, the Motorola 68040,
the VAX/8600, and the IBM S/390 typically use microprogrammed control units, have lower clock
rates, and higher CPI figures than…
• Reduced Instruction Set Computing (RISC) processors like the Intel i860, SPARC, MIPS R3000,
and IBM RS/6000, which have hard-wired control units, higher clock rates, and lower CPI figures.
Superscalar Processors
• This subclass of the RISC processors allow multiple instructions to be issued simultaneously
during each cycle.
• The effective CPI of a superscalar processor should be less than that of a generic scalar RISC
processor.
• Clock rates of scalar RISC and superscalar RISC machines are similar.
VLIW Machines
• Very Long Instruction Word machines typically have many more functional units than superscalars
(and thus the need for longer – 256 to 1024 bits – instructions to provide control for them).
• These machines mostly use microprogrammed control units with relatively slow clock rates
because of the need to use ROM to hold the microcode.
Superpipelined Processors
• These processors typically use a multiphase clock (actually several clocks that are out of phase
with each other, each phase perhaps controlling the issue of another instruction) running at a
relatively high rate.
• The CPI in these machines tends to be relatively high (unless multiple instruction issue is used).
• Processors in vector supercomputers are mostly superpipelined and use multiple functional units
for concurrent scalar and vector operations.
Instruction Pipelines
• Instruction pipeline cycle – the time required for each phase to complete its operation
(assuming equal delay in all phases)
• Instruction issue latency – the time (in cycles) required between the issuing of two adjacent
instructions
• Instruction issue rate – the number of instructions issued per cycle (the degree of a
superscalar)
• Simple operation latency – the delay (after the previous instruction) associated with the
completion of a simple operation (e.g. integer add) as compared with that of a complex
operation (e.g. divide).
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 3
ACA (15CS72) Notes Module-2
• Resource conflicts – when two or more instructions demand use of the same functional unit(s)
at the same time.
Pipelined Processors
– can be fully utilized if instructions can enter the pipeline at a rate on one per cycle
• CPI rating is 1 for an ideal pipeline. Underpipelined systems will have higher CPI ratings,
lower clock rates, or both.
• Figure 4.3 shows the data path architecture and control unit of a typical, simple scalar processor
which does not employ an instruction pipeline. Main memory, I/O controllers, etc. are connected to
the external bus.
• The control unit generates control signals required for the fetch, decode, ALU operation, memory
access, and write result phases of instruction execution.
• The control unit itself may employ hardwired logic, or—as was more common in older CISC style
processors—microcoded logic.
• Modern RISC processors employ hardwired logic, and even modern CISC processors make use of
many of the techniques originally developed for high-performance RISC processors.
• CISC
– Many different instructions
– Many different operand data types
– Many different operand addressing formats
– Relatively small number of general purpose registers
– Many instructions directly match high-level language constructions
• RISC
– Many fewer instructions than CISC (freeing chip space for more functional units!)
– Fixed instruction format (e.g. 32 bits) and simple operand addressing
– Relatively large number of registers
– Small CPI (close to 1) and high clock rates
Architectural Distinctions
• CISC
– Unified cache for instructions and data (in most cases)
– Microprogrammed control units and ROM in earlier processors (hard-wired controls
units now in some CISC systems)
• RISC
– Separate instruction and data caches
– Hard-wired control units
• CISC Advantages
– Smaller program size (fewer instructions)
– Simpler control unit design
– Simpler compiler design
• RISC Advantages
– Has potential to be faster
– Many more registers
• RISC Problems
– More complicated register decoding system
– Hardwired control is less flexible than microcode
– VAX 8600
– Motorola MC68040
– Intel Pentium
• RISC and CISC scalar processors should have same performance if clock rate and program
lengths are equal.
• RISC moves less frequent operations into software, thus dedicating hardware resources to the
most frequently used operations.
– Sun SPARC
– Intel i860
– Motorola M88100
– AMD 29000
• SPARC family chips produced by Cypress Semiconductors, Inc. Figure 4.7 shows the architecture
of the Cypress CY7C601 SPARC processor and of the CY7C602 FPU.
• The Sun SPARC instruction set contains 69 basic instructions
• The SPARC runs each procedure with a set of thirty-two 32-bit IU registers.
• Eight of these registers are global registers shared by all procedures, and the remaining 24 are
window registers associated with only each procedure.
• The concept of using overlapped register windows is the most important feature introduced by the
Berkeley RISC architecture.
• Fig. 4.8 shows eight overlapping windows (formed with 64 local registers and 64 overlapped
registers) and eight globals with a total of 136 registers, as implemented in the Cypress 601.
• Each register window is divided into three eight-register sections, labeled Ins, Locals, and Outs.
• The local registers are only locally addressable by each procedure. The Ins and Outs are shared
among procedures.
• The calling procedure passes parameters to the called procedure via its Outs (r8 to r15) registers,
which are the Ins registers of the called procedure.
• The window of the currently running procedure is called the active window pointed to by a current
window pointer.
• A window invalid mask is used to indicate which window is invalid. The trap base register serves
as a pointer to a trap handler.
• A special register is used to create a 64-bit product in multiple step instructions. Procedures can
also be called without changing the window.
• The overlapping windows can significantly save the time required for interprocedure
communications, resulting in much faster context switching among cooperative procedures.
A CISC or a RISC scalar processor can be improved with a superscalar or vector architecture.
Scalar processors are those executing one instruction per cycle.
Only one instruction is issued per cycle, and only one completion of instruction is expected from
the pipeline per cycle.
In a superscalar processor, multiple instructions are issued per cycle and multiple results are
generated per cycle.
A vector processor executes vector instructions on arrays of data; each vector instruction involves a
string of repeated operations, which are ideal for pipelining with one result per cycle.
Superscalar processors are designed to exploit more instruction-level parallelism in user programs.
Only independent instructions can be executed in parallel without causing a wait state. The amount
of instruction level parallelism varies widely depending on the type of code being executed.
It has been observed that the average value is around 2 for code without loop unrolling. Therefore,
for these codes there is not much benefit gained from building a machine that can issue more than
three instructions per cycle.
The instruction-issue degree in a superscalar processor has thus been limited to 2 to 5 in practice.
• Code density in VLIW is less than in superscalars, because if a ―region‖ of a VLIW word isn’t
needed in a particular instruction, it must still exist (to be filled with a ―no op‖).
• Superscalars can be compatible with scalar processors; this is difficult with VLIW parallel and
non-parallel architectures.
VLIW Opportunities
• The efficiency of the machine is entirely dictated by the success, or ―goodness,‖ of the
compiler in planning the operations to be placed in the same instruction words.
• Different implementations of the same VLIW architecture may not be binary-compatible with
each other, resulting in different latencies.
VLIW Summary
• VLIW reduces the effort required to detect parallelism using hardware or software techniques.
• The main advantage of VLIW architecture is its simplicity in hardware structure and instruction
set.
• Unfortunately, VLIW does require careful analysis of code in order to ―compact‖ the most
appropriate ‖short‖ instructions into a VLIW word.
• Typical memory-to-memory vector instructions (using the same notation as given in the
previous slide) include these:
• Vector processors can usually effectively use large pipelines in parallel, the number of such
parallel pipelines effectively limited by the number of functional units.
• As usual, the effectiveness of a pipelined system depends on the availability and use of an
effective compiler to generate code that makes good use of the pipeline facilities.
Symbolic Processors
• Symbolic processors are somewhat unique in that their architectures are tailored toward the
execution of programs in languages similar to LISP, Scheme, and Prolog.
• In effect, the hardware provides a facility for the manipulation of the relevant data objects with
―tailored‖ instructions.
• These processors (and programs of these types) may invalidate assumptions made about more
traditional scientific and business computations.
Storage devices such as registers, caches, main memory, disk devices, and backup storage are often
organized as a hierarchy as depicted in Fig. 4.17.
The memory technology and storage organization at each level is characterized by five parameters:
faster to access,
are smaller in capacity,
are more expensive per byte,
have a higher bandwidth, and
have a smaller unit of transfer.
In general, ti-1 < ti, si-1 < si, ci-1 > ci, bi-1 > bi and xi-1 < xi for i = 1, 2, 3, and 4 in the
hierarchy where i = 0 corresponds to the CPU register level.
The cache is at level 1, main memory at level 2, the disks at level 3 and backup storage at level 4.
Caches
The cache is controlled by the MMU and is programmer-transparent.
The cache can also be implemented at one or multiple levels, depending on the speed and
application requirements.
Multi-level caches are built either on the processor chip or on the processor board.
Multi-level cache systems have become essential to deal with memory access latency.
Peripheral Technology
Peripheral devices include printers, plotters, terminals, monitors, graphics displays, optical
scanners, image digitizers, output microfilm devices etc.
Some I/O devices are tied to special-purpose or multimedia applications.
Information stored in a memory hierarchy (M1, M2,…, Mn) satisfies 3 important properties:
1. Inclusion
2. Coherence
3. Locality
We consider cache memory the innermost level M1, which directly communicates with the CPU
registers.
The outermost level Mn contains all the information words stored. In fact, the collection of all
addressable words in Mn forms the virtual address space of a computer.
Program and data locality is characterized below as the foundation for using a memory hierarchy
effectively.
• The inverse, however, is not necessarily true. That is, the presence of a data item in level Mi+1 does
not imply its presence in level Mi. We call a reference to a missing item a ―miss.‖
The requirement that copies of data items at successive memory levels be consistent is called the
―coherence property.‖
Coherence Strategies
• Write-through
• Write-back
– The update of the data item in Mi+1 corresponding to a modified item in Mi is not
updated unit it (or the block/page/etc. in Mi that contains it) is replaced or removed.
– This is the most efficient approach, but cannot be used (without modification) when
multiple processors share Mi+1, …, Mn.
3. Locality of References
• Memory references are generated by the CPU for either instruction or data access.
• These accesses tend to be clustered in certain regions in time, space, and ordering.
There are three dimensions of the locality property:
– Spatial locality – if location M is referenced at time t, then another location Mm will
be referenced at time t+t.
Working Sets
• The set of addresses (bytes, pages, etc.) referenced by a program during the interval from t to t+
t, where t is called the working set parameter, changes slowly.
• This set of addresses, called the working set, should be present in the higher levels of M if a
program is to execute efficiently (that is, without requiring numerous movements of data items
from lower levels of M). This is called the working set principle.
Hit Ratios
• When a needed item (instruction or data) is found in the level of the memory hierarchy being
examined, it is called a hit. Otherwise (when it is not found), it is called a miss (and the item
must be obtained from a lower level in the hierarchy).
• The hit ratio, h, for Mi is the probability (between 0 and 1) that a needed data item is found
when sought in level memory Mi.
• We assume h0 = 0 and hn = 1.
Access Frequencies
• There are different penalties associated with misses at different levels in the memory hierarcy.
Hierarchy Optimization
This implies that the cost is distributed over n levels. Since cl > c2 > c3 > … cn, we have to choose s1
< s2 < s3 < … sn.
The optimal design of a memory hierarchy should result in a Teff close to the t1 of M1 and a total
cost close to the cost of Mn.
The optimization process can be formulated as a linear programming problem, given a ceiling C0 on
the total cost— that is, a problem to minimize
• To facilitate the use of memory hierarchies, the memory addresses normally generated by
modern processors executing application programs are not physical addresses, but are rather
virtual addresses of data items and instructions.
• Physical addresses, of course, are used to reference the available locations in the real physical
memory of a system.
• Virtual addresses must be mapped to physical addresses before they can be used.
• The mapping from virtual to physical addresses can be formally defined as follows:
• The mapping returns a physical address if a memory hit occurs. If there is a memory miss, the
referenced item has not yet been brought into primary memory.
Mapping Efficiency
• The efficiency with which the virtual to physical mapping can be accomplished significantly
affects the performance of the system.
– In this scheme, each processor has a separate virtual address space, but all processors share the
same physical address space.
– Advantages:
– Disadvantages
• The same virtual address in different virtual spaces may point to different pages in
physical memory
• All processors share a single shared virtual address space, with each processor being given a
portion of it.
Advantages:
Disadvantages
• Processors must be capable of generating large virtual addresses (usually > 32 bits)
• Since the page table is shared, mutual exclusion must be used to guarantee atomic updates
• Segmentation must be used to confine each process to its own address space
• The address translation process is slower than with private (per processor) virtual memory
Memory Allocation
Both the virtual address space and the physical address space are divided into fixed-length pieces.
• The purpose of memory allocation is to allocate pages of virtual memory using the page frames of
physical memory.
Both the virtual memory and physical memory are partitioned into fixed-length pages. The purpose of
memory allocation is to allocate pages of virtual memory to the page frames of the physical memory.
The process demands the translation of virtual addresses into physical addresses. Various schemes
for virtual address translation are summarized in Fig. 4.21a.
The translation demands the use of translation maps which can be implemented in various ways.
Translation maps are stored in the cache, in associative memory, or in the main memory.
To access these maps, a mapping function is applied to the virtual address. This function generates
a pointer to the desired translation map.
This mapping can be implemented with a hashing or congruence function.
Hashing is a simple computer technique for converting a long page number into a short one with
fewer bits.
The hashing function should randomize the virtual page number and produce a unique hashed
number to be used as the pointer.
Our purpose is to produce the physical address consisting of the page frame number, the block
number, and the word address.
The first step of the translation is to use the virtual page number as a key to search through the TLB
for a match.
The TLB can be implemented with a special associative memory (content addressable memory) or
use part of the cache memory.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 26
ACA (15CS72) Notes Module-2
In case of a match (a hit) in the TLB, the page frame number is retrieved from the matched page
entry. The cache block and word address are copied directly.
In case the match cannot be found (a miss) in the TLB, a hashed pointer is used to identify one of
the page tables where the desired page frame number can be retrieved.
1. Paging
2. Segmentation
1. Paging memory
• Main memory contains some number of pages which is smaller than the number of pages in the
virtual memory.
• For example, if the page size is 2K and the physical memory is 16M (8K pages) and the virtual
memory is 4G (2 M pages) then there is a factor of 254 to 1 mapping.
• A page map table is used for implementing a mapping, with one entry per virtual page.
2. Segmented memory
• In a segmented memory management system the blocks to be replaced in main memory are
potentially of unequal length and here the segments correspond to logical blocks of code or data.
• Segments, then, are ``atomic'' in the sense that either the whole segment should be in main
memory, or none of the segment should be there.
• The segments may be placed anywhere in main memory, but the instructions or data in one
segment should be contiguous,
3. Paged Segmentation
• Within each segment, the addresses are divided into fixed size pages
– Segment Number
– Page Number
– Offset
Inverted paging
• Besides direct mapping, address translation maps can also be implemented with inverted mapping
(Fig. 4.21c).
• An inverted page table is created for each page frame that has been allocated to users. Any virtual
page number can be paired with a given physical page number.
• Inverted page tables are accessed either by an associative search or by the use of a hashing
function.
• In using an inverted PT, only virtual pages that are currently resident in physical memory are
included. This provides a significant reduction in the size of the page tables.
• The generation of a long virtual address from a short physical address is done with the help of
segment registers, as demonstrated in Fig. 4.21c.
• The leading 4 bits (denoted sreg) of a 32-bit address name a segment register.
• The register provides a segment id that replaces the 4-bit sreg to form a long virtual address.
• This effectively creates a single long virtual address space with segment boundaries at multiples of
256 Mbytes (228 bytes).
• The IBM RT/PC had a 12-bit segment id (4096 segments) and a 40-bit virtual address space.
• Either associative page tables or inverted page tables can be used to implement inverted mapping.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 28
ACA (15CS72) Notes Module-2
• The inverted page table can also be assisted with the use of a TLB. An inverted PT avoids the use
of a large page table or a sequence of page tables.
• Given a virtual address to be translated, the hardware searches the inverted PT for that address and,
if it is found, uses the table index of the matching entry as the address of the desired page frame.
• A hashing table is used to search through the inverted PT.
• The size of an inverted PT is governed by the size of the physical space, while that of traditional
PTs is determined by the size of the virtual space.
• Because of limited physical space, no multiple levels are needed for the inverted page table.
• Memory management policies include the allocation and deallocation of memory pages to active
processes and the replacement of memory pages.
• Demand paging memory systems. refers to the process in which a resident page in main memory is
replaced by a new page transferred from the disk.
• Since the number of available page frames is much smaller than the number of pages, the frames
will eventually be fully occupied.
• In order to accommodate a new page, one of the resident pages must be replaced.
• The goal of a page replacement policy is to minimize the number of possible page faults so that the
effective memory-access time can be reduced.
• The effectiveness of a replacement algorithm depends on the program behavior and memory traffic
patterns encountered.
• A good policy should match the program locality property. The policy is also affected by page size
and by the number of available frames.
Page Traces: A page trace is a sequence of page frame numbers (PFNs) generated during the
execution of a given program.
The following page replacement policies are specified in a demand paging memory system for a page
fault at time t.
(1) Least recently used (LRU)—This policy replaces the page in R(t) which has the longest backward
distance:
(2) Optimal (OPT) algorithm—This policy replaces the page in R(t) with the longest forward
distance:
(3) First-in-first-out (FIFO)—This policy replaces the page in R(t) which has been in memory for the
longest time.
(4) Least frequently used (LFU)—This policy replaces the page in R(t) which has been least
referenced in the past.
(5) Circular FIFO—This policy joins all the page frame entries into a circular FIFO queue using a
pointer to indicate the front of the queue.
• An allocation bit is associated with each page frame. This bit is set upon initial allocation of a page
to the frame.
• When a page fault occurs, the queue is circularly scanned from the pointer position.
• The pointer skips the allocated page frames and replaces the very first unallocated page frame.
• When all frames are allocated, the front of the queue is replaced, as in the FIFO policy.
(6) Random replacement—This is a trivial algorithm which chooses any page for replacement
randomly.
Example:
Consider a paged virtual memory system with a two-level hierarchy: main memory M1 and disk
memory M2.
Assume a page size of four words. The number of page frames in M1 is 3, labeled a, b and c; and the
number of pages in M2 is 10, identified by 0, 1, 2,….9. The ith page in M2consists of word
addresses 4i to 4i + 3 for all i = 0, 1, 2, …, 9.
A certain program generates the following sequence of word addresses which are grouped (underlined)
together if they belong to the same page. The sequence of page numbers so formed is the page trace:
Page tracing experiments are described below for three page replacement policies: LRU, OPT, and
FIFO, respectively. The successive pages loaded in the page frames (PFs) form the trace entries.
Initially, all PFs are empty.