15CS72_ACA_Module2FinalCopy
15CS72_ACA_Module2FinalCopy
4. VLIW: Very long instruction word (VLIW) describes a computer processing architecture
in which a software based language compiler or preprocessor breaks program
instruction down into basic operations that can be performed by the processor in parallel
(that is, at the same time). These operations are put into a very long instruction word
which the processor can then take apart without further analysis, handing each operation
to an appropriate functional unit. Example Itanium (3 operations). VLIW statically issues
multiple instructions at each cycle.
8. Symbolic:It is used in Artificial Intelligence. There are many areas where symbolic
processors are applied. The list includes text retrieval, machine intelligence, expert
systems, and so on. Symbolic processors are sometimes called PROLOG processors,
Lisp processors, or symbolic manipulators.
1. The goal of the processor manufacturers is to lower the CPI using innovative hardware
approaches. The figure below shows the comparison between CISC, RISC and Vector
processor with respect to CPI and Clock Speed.
2. Conventional processors like Intel Pentium, M68040, VAX/8600, IBM 390 etc. fall into
CISC architecture. Clock rate of todays CISC processor ranges up to a few GHz. The
CPI of few CISC instructions varies from 1 to 20. CISC processor are in the upper part of
the design space.
3. RISC processors include SPARC, Power Series, MIPS, Alpha, ARM etc. The average
CPI of RISC instruction is around one or two clock cycles. Hence it is shown in the lower
part of the design space given below.
4. There are two categories of RISC i.e superscalar RISC and scalar RISC. The CPI of
superscalar RISC is even less than scalar RISC.
5. The Vector processors are supercomputers which use multiple functional units for
concurrent scalar and vector operations. The effective CPI of these processors is very
low and is positioned at the lower right corner of the design space as shown in the
figure.
Instruction Pipelines
1. Pipeline is an implementation technique where multiple instructions are having
overlapped execution inside the pipeline. The Pipeline has four stages namely
fetch,decode,execute and writeback.
2. Basic definitions involved with the instruction pipeline.
Instruction pipeline cycle — the clock period of the instruction pipeline.
Instruction issue latency — the time (in cycles) required between the issuing of two
adjacent instructions
Instruction issue rate — the number of instructions issued per cycle, also called the
degree of a superscalar processor.
Simple operation latency ~ Simple operations make up the vast majority of instructions
executed by a machine, such as integer adds, loads, stores, branches, moves, etc. On
the contrary, complex operations are those requiring a longer latency, such as divides,
cache misses, etc. These latencies are measured in number of cycles.
Resource conflicts — This refers to a situation where two or more instructions demand
use of the same functional unit at the same time.
3. A base scalar processor in which one instruction is issued per cycle is shown below.
There is one cycle latency between instruction issues. The pipeline is fully utilized if all
instructions are issued at the rate of one per clock cycle. This is an ideal pipeline where
the effective CPI rating is 1.
4. In case of Under-Pipelined processor with two cycles per instruction issue the pipeline is
underutilized as shown in the figure below. The CPI will be 2.
5. Another under pipelined situation is shown in the figure given below. Here the pipeline
cycle time is doubled by combining the pipeline stages. The fetch and decode phase is
combined into one stage and execute and writeback stage is combined into another
stage. This will result in poor pipeline utilization.
Instruction set size and Large set of instructions with Small set of instructions with
format variable format(16-64 bits per fixed format(32 bit per
instruction) instruction)
General purpose register and 8-24 GPR and unified cache 32-192 GPR and split cache
cache design
CPI Between 2 and 16 Average CPI<1.5
The CISC scalar processor may have both integer unit, floating point unit or even multiple such
units. It also has support for pipelining. The some early representative CISC scalar processors
are VAX 8600,Motorola MC68040,Intel i486 etc
4.2.1 Superscalar processors: Multiple instructions are issued per cycle and multiple results
are generated per cycle. In superscalar processors it is possible to exploit instruction level
parallelism by executing the independent instructions in parallel without causing a wait state.
Pipelining in superscalar processors
The degree of superscalar processor is equal to the number of instructions issued per cycle. A
superscalar processor of degree m can issue m instructions per cycle.
1. The instructions of the program are broken down into some operations which can be
executed simultaneously by multiple functional units and a very long instruction word is
formed.
2. Each VLIW instruction word ranges from say 256 to 1024 bits and there are multiple
functional units which share a common register file.
3. Also the VLIW instruction word is formed by the compiler which can predict branch
outcomes using elaborate heuristics or run time statistics. This is called as code
compaction.
As shown in figure above a single instruction word has 3 operations and it is possible to have a
CPI of 0.33 for this example.
Vector Instructions
Some Register based vector operations are found in register to register vector processor
as listed below. The vector register of length n is denoted as V i , scalar register Si and
memory array of length n as M(1:n) and operation is denoted by small “o”.
The reduction is an operation on one or two vector operands, and the result is a
scalar——such as the dot product between two vectors and the maximum of all
components in a vector
Memory-based vector operations are found in memory-to-memory vector processors
such as those in the early supercomputer CDC Cyber 205. Listed below are a few
examples:
Here M1(1:n) and M2(1:n) are two vectors of length n and M(k) is a scalar quantity
stored in memory location k.
Vector pipeline can be attached to any scalar or superscalar processor. The pipeline for
scalar and vector execution is shown below
In case of scalar pipeline execution single operation is performed for a single data
element. In case of vector pipeline same operation is performed for each data element of
the vector.
Symbolic Processors
It is used in many areas like pattern recognition, expert systems, knowledge engineering,
text retrieval, machine intelligence etc. Also known as prolog processors, lisp processors
or symbolic manipulators.
Characteristics of Symbolic Processing
Figure 4.19 shows the memory reference patterns of three running programs or three software
processes. As a function of time, the virtual address space (identified by page numbers) is
clustered into regions due to the locality of references. The subset of addresses {or pages)
referenced within a given time window (t, t+Δt ) is called the working set.
4.3.3 Memory Capacity Planning
Hit Ratio
Consider memory levels Mi and Mi-1 in a hierarchy, i= 1, 2,. . ., n. The hit ratio hi at
Mi is the
probability than an information item will be found in Mi. The miss ratio at Mi is defined as 1-hi.
The CPU starts finding the data from M1 level and searches till the outermost memory Mn. The
access frequency to Mi is defined as
fi=(1-h1)(1-h2)...(1-hi-1)hi
This is indeed the probability of successfully accessing Mi when there are i-1 misses at the
lower levels and a hit at Mi. Note that
n
∑ fi = 1 and f1=h1
i=1
Due to locality property, the access frequencies decrease very rapidly from low to high levels;
that is f1>f2>f3>.....>fn. This implies that the inner levels of memory are accessed more often than
the outer levels.
Hierarchy Optimization
The total cost of a memory hierarchy is estimated as follows
n
Ctotal= ∑ ci si
i=1
Also ti-1<ti,si-1<si,ci-1>ci,bi-1>bi and xi-1<xi for i=1,2,3,4 in the hierarchy.The optimal design of
memory hierarchy should have Teff close to t1 of M1 and total cost close to cost of Mn.
Address Space
Each word in physical memory has a unique physical address. Virtual addresses are those used
by machine instructions making up an executable program.
The virtual addresses must be translated into physical addresses at run time. A system of
translation tables and mapping functions are used in this process. Hence we have physical
address space and virtual address space.
Address Mapping
Let V be the set of virtual addresses generated by a program running on a processor. Let M be
the set of physical addresses allocated to run this program. A virtual memory system demands
an automatic mechanism to implement the following mapping:
Ft: V→ M U {ϕ}
This mapping is time function which varies from time to time because the physical memory is
dynamically allocated and deallocated. Consider any virtual address v the mapping is formally
defined as follows.
ft(v)= { m, if m εM has been allocated to store the data identified by virtual address v
Φ if data v is missing in M
Thus the ft(v) translates the virtual adress v intp physical address m if there is memory hit and
returns Φ if there is a miss.
Paged Memory
1. Paging is a technique of partitioning both the physical memory and virtual memory as
fixed sized blocks.
2. Physical memory => Frames
3. Virtual memory => Pages
4. A page table consists of frame numbers indexed by page numbers.
5. Using paging lowers performance.
6. It results in internal fragmentation but no external fragmentation.
Segmented Memory
1. Used for logical structuring of a program.
2. Unlike pages, segments are of varied sizes.
3. Segmented memory is arranged as 2-D address space.
4. Virtual address has two parts: Segment number and an offset.
5. The offset addresses within each segment form 1-D contiguous addresses.
6. The segment number, not necessarily contiguous forms the second dimension.
7. Only external fragmentation, no internal fragmentation.
Paged Segments
Inverted Paging
1. Direct paging works well for small address space like 32 bits.
2. A large virtual address space demands either large Page Tables or multilevel paging
which slows down performance.
3. An inverted page table is created containing information about the frames in the physical
memory.
4. Only one inverted page table can be used by all the processes.
5. The size of the inverted page table is governed by the size of the physical memory.