CS6303 Computer Architecture 2
CS6303 Computer Architecture 2
Ramu/AP/IT
3. Define – ISA
The instruction set architecture, or simply architecture of a computer is the
interface between the hardware and the lowest-level software. It includes anything
programmers need to know to make a binary machine language program work correctly,
including instructions, I/O devices, and so on.
4. Define – ABI
Typically, the operating system will encapsulate the details of doing I/O,
allocating memory, and other low-level system functions so that application programmers
do not need to worry about such details. The combination of the basic instruction set and
the operating system interface provided for application programmers is called the
application binary interface (ABI).
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
11. Write the formula for CPU execution time for a program. (Nov/Dec-2016)
12. Write the formula for CPU clock cycles required for a program. (Apr/May-2014)
op rs rt rd shamt funct
6 bits 5bits 5 bits 5 bits 5 bits 6 bits
Where,
op: Basic operation of the instruction, traditionally called the opcode.
rs: The first register source operand.
rt: The second register source operand.
rd: The register destination operand. It gets the result of the
operation. shamt: Shift amount.
funct: Function.
15. Write an example for immediate operand. (Nov/Dec-2014)
The quick add instruction with one constant operand is called add immediate or addi.
To add 4 to register $s3, we just write
3
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
4
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
1. Add 610 to 710 in binary and Subtract 610 from 710 in binary. (Nov/Dec-2016)
Addition,
Subtraction directly,
A+B ≥0 ≥0 <0
A+B <0 <0 ≥0
A-B ≥0 <0 <0
A-B <0 ≥0 ≥0
5
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
6. Define – ULP
Units in the Last Place is defined as the number of bits in error in the least significant
bits of the significant between the actual number and the number that can be represented.
6
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
For example, ARM added more than 100 instructions in the NEON multimedia
instruction extension to support sub-word parallelism, which can be used either with
ARMv7 or ARMv8.
7
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
4. What are the two state elements needed to store and access an instruction?
Two state elements are needed to store and access instructions, and an adder is needed to
compute the next instruction address. The state elements are the instruction memory and
the program counter.
8
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
A portion of the data path is used for fetching instructions and incrementing the
program counter. The fetched instruction is used by other parts of the data path.
9
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
10
10. What are the three instruction classes and their instruction formats?
The three instruction classes (R-type, load and store, and branch) use two
different instruction formats.
10
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
11
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
12
UNIT-IV PARALLELISM
6. Define – VLIW
Very Long Instruction Word (VLIW) is a style of instruction set architecture that
launches many operations that are defined to be independent in a single wide instruction,
typically with many separate opcode fields.
12
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
13
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
14
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
15
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
capacitor. A single transistor is then used to access this stored charge, either to read the
value or to overwrite the charge stored there. Because DRAMs use only a single
transistor per bit of storage, they are much denser and cheaper per bit than SRAM. As
DRAMs store the charge on a capacitor, it cannot be kept indefinitely and must
periodically be refreshed.
8. Consider a cache with 64 blocks and a block size of 16 bytes. To what block number
does byte address 1200 map?
16
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
9. How many total bits are required for a direct-mapped cache with 16 KiB
of data and 4-word blocks, assuming a 32-bit address?
17
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
the entry, writing the upper bits of the address (from the ALU) into the tag
field, and turning the valid bit on.
4. Restart the instruction execution at the first step, which will refetch the
instruction, this time finding it in the cache.
13. What are the various block placement schemes in cache memory?
Direct-mapped cache is a cache structure in which each memory location is
mapped to exactly one location in the cache.
Fully associative cache is a cache structure in which a block can be placed in any
location in the cache.
Set-associative cache is a cache that has a fixed number of locations (at least two)
where each block can be placed.
18
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
19
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
The internal architectural design of computers differs from one system model to another.
However, the basic organization remains the same for all computer systems. The following
five units (also called "The functional units") correspond to the five basic operations
performed by all computer systems.
Input Unit
Data and instructions must enter the computer system before any computation can be
performed on the supplied data. The input unit that links the external environment with the
computer system performs this task. Data and instructions enter input units in forms that
depend upon the particular device used. For example, data is entered from a keyboard in a
manner similar to typing, and this differs from the way in which data is entered through a
20
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
mouse, which is another type of input device. However, regardless of the form in which they
receive their inputs, all input devices must provide a computer with data that are transformed
into the binary codes that the primary memory of the computer is designed to accept. This
transformation is accomplished by units that called input interfaces. Input interfaces are
designed to match the unique physical or electrical characteristics of input devices to the
requirements of the computer system.
1. It accepts (or reads) the list of instructions and data from the outside world.
2. It converts these instructions and data in computer acceptable format.
3. It supplies the converted instructions and data to the computer system for further
processing.
Output Unit
The job of an output unit is just the reverse of that of an input unit. It supplied information
and results of computation to the outside world. Thus it links the computer with the external
environment. As computers work with binary code, the results produced are also in the
binary form. Hence, before supplying the results to the outside world, it must be converted to
human acceptable (readable) form. This task is accomplished by units called output
interfaces.
1. It accepts the results produced by the computer which are in coded form and hence
cannot be easily understood by us.
2. It converts these coded results to human acceptable (readable) form.
3. It supplied the converted results to the outside world.
Storage Unit
The data and instructions that are entered into the computer system through input units have
to be stored inside the computer before the actual processing starts. Similarly, the results
produced by the computer after processing must also be kept somewhere inside the computer
system before being passed on to the output units. Moreover, the intermediate results
produced by the computer must also be preserved for ongoing processing. The Storage Unit
or the primary / main storage of a computer system is designed to do all these things. It
provides space for storing data and instructions, space for intermediate results and also space
for the final results.
1. All the data to be processed and the instruction required for processing (received from
input devices).
2. Intermediate results of processing.
21
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
3. Final results of processing before these results are released to an output device.
The main unit inside the computer is the CPU. This unit is responsible for all events inside
the computer. It controls all internal and external devices, performs "Arithmetic and Logical
operations". The operations a Microprocessor performs are called "instruction set" of this
processor. The instruction set is “hard wired” in the CPU and determines the machine
language for the CPU. The more complicated the instruction set is, the slower the CPU
works. Processors differed from one another by the instruction set. If the same program can
run on two different computer brands they are said to be compatible. Programs written for
IBM compatible computers will not run on Apple computers because these two architectures
are not compatible.
The control Unit and the Arithmetic and Logic unit of a computer system are jointly known
as the Central Processing Unit (CPU). The CPU is the brain of any computer system. In a
human body, all major decisions are taken by the brain and the other parts of the body
function as directed by the brain. Similarly, in a computer system, all major calculations and
comparisons are made inside the CPU and the CPU is also responsible for activating and
controlling the operations of other units of a computer system.
The arithmetic and logic unit (ALU) of a computer system is the place where the actual
execution of the instructions take place during the processing operations. All calculations are
performed and all comparisons (decisions) are made in the ALU. The data and instructions,
stored in the primary storage prior to processing are transferred as and when needed to the
ALU where processing takes place. No processing is done in the primary storage unit.
Intermediate results generated in the ALU are temporarily transferred back to the primary
storage until needed at a later time. Data may thus move from primary storage to ALU and
back again as storage many times before the processing is over. After the completion of
processing, the final results which are stored in the storage unit are released to an output
device.
22
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
The arithmetic and logic unit (ALU) is the part where actual computations take place. It
consists of circuits that perform arithmetic operations (e.g. addition, subtraction,
multiplication, division over data received from memory and capable to compare numbers
(less than, equal to, or greater than).
While performing these operations the ALU takes data from the temporary storage are inside
the CPU named registers. Registers are a group of cells used for memory addressing, data
manipulation and processing. Some of the registers are general purpose and some are
reserved for certain functions. It is a high-speed memory which holds only data from
immediate processing and results of this processing. If these results are not needed for the
next instruction, they are sent back to the main memory and registers are occupied by the
new data used in the next instruction.
All activities in the computer system are composed of thousands of individual steps. These
steps should follow in some order in fixed intervals of time. These intervals are generated by
the Clock Unit. Every operation within the CPU takes place at the clock pulse. No operation,
regardless of how simple, can be performed in less time than transpires between ticks of this
clock. But some operations required more than one clock pulse. The faster the clock runs, the
faster the computer performs. The clock rate is measured in megahertz (Mhz) or Gigahertz
(Ghz). Larger systems are even faster. In older systems the clock unit is external to the
microprocessor and resides on a separate chip. In most modern microprocessors the clock is
usually incorporated within the CPU.
Control Unit
How the input device knows that it is time for it to feed data into the storage unit? How does
the ALU know what should be done with the data once it is received? And how is it that only
the final results are sent to the output devices and not the intermediate results? All this is
possible because of the control unit of the computer system. By selecting, interpreting, and
seeing to the execution of the program instructions, the control unit is able to maintain order
and directs the operation of the entire system. Although, it does not perform any actual
processing on the data, the control unit acts as a central nervous system for the other
23
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
components of the computer. It manages and coordinates the entire computer system. It
obtains instructions from the program stored in main memory, interprets the instructions, and
issues signals that cause other units of the system to execute them.
The control unit directs and controls the activities of the internal and external devices. It
interprets the instructions fetched into the computer, determines what data, if any,
are needed, where it is stored, where to store the results of the operation, and sends the
control signals to the devices involved in the execution of the instructions.
Addressing Modes
The term addressing modes refers to the way in which the operand of an instruction
is specified. Information contained in the instruction code is the value of the operand or the
address of the result/operand. Following are the main addressing modes that are used on
various platforms and architectures.
1) Immediate Mode
2) Index Mode
The address of the operand is obtained by adding to the contents of the general
register (called index register) a constant value. The number of the index register and the
constant value are included in the instruction code. Index Mode is used to access an array
whose elements are in successive memory locations. The content of the instruction code,
represents the starting address of the array and the value of the index register, and the index
value of the current element. By incrementing or decrementing index register different
element of the array can be accessed.
3) Indirect Mode
The effective address of the operand is the contents of a register or main memory
location, location whose address appears in the instruction. Indirection is noted by placing
the name of the register or the memory address given in the instruction in parentheses. The
register or memory location that contains the address of the operand is a pointer. When an
execution takes place in such mode, instruction may be told to go to a specific address. Once
it's there, instead of finding an operand, it finds an address where the operand is located.
24
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
5) Register Mode
The name (the number) of the CPU register is embedded in the instruction. The
register contains the value of the operand. The number of bits used to specify the register
depends on the total number of registers from the processor set.
6) Displacement Mode
Similar to index mode, except instead of a index register a base register will be used.
Base register contains a pointer to a memory location. An integer (constant) is also referred
to as a displacement. The address of the operand is obtained by adding the contents of the
base register plus the constant. The difference between index mode and displacement mode is
in the number of bits used to represent the constant. When the constant is represented a
number of bits to access the memory, then we have index mode. Index mode is more
appropriate for array accessing; displacement mode is more appropriate for structure
(records) accessing.
A special case of indirect register mode. The register, whose number is included in
the instruction code, contains the address of the operand. Auto increment Mode = after
operand addressing, the contents of the register is incremented. Decrement Mode = before
operand addressing, the contents of the register is decrement.
25
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
The Instruction Set Architecture (ISA) is the part of the processor that is visible to the
programmer or compiler writer. The ISA serves as the boundary between software and
hardware. We will briefly describe the instruction sets found in many of the microprocessors
used today. The ISA of a processor can be described using 5 catagories:
Operand location
Can any ALU instruction operand be located in memory? Or must all operands be
kept internaly in the CPU?
Operations
What is the type and size of each operand and how is it specified?
26
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
A = B + C;
in all 3 architectures:
POP C - -
Not all processors can be neatly tagged into one of the above catagories. The i8086 has many
instructions that use implicit operands although it has a general register set. The i8051 is
another example, it has 4 banks of GPRs but most instructions must have the A register as
one of its operands.
What are the advantages and disadvantages of each of these approachs?
Stack
Accumulator
27
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
GPR
Advantages: Makes code generation easy. Data can be stored for long periods in registers.
Disadvantages: All operands must be named leading to longer instructions.
Earlier CPUs were of the first 2 types but in the last 15 years all CPUs made are GPR
processors. The 2 major reasons are that registers are faster than memory, the more data that
can be kept internaly in the CPU the faster the program wil run. The other reason is that
registers are easier for a compiler to use.
As we mentioned before most modern CPUs are of the GPR (General Purpose Register) type.
A few examples of such CPUs are the IBM 360, DEC VAX, Intel 80x86 and Motorola
68xxx. But while these CPUS were clearly better than previous stack and accumulator based
CPUs they were still lacking in several areas:
1. Instructions were of varying length from 1 byte to 6-8 bytes. This causes problems
with the pre-fetching and pipelining of instructions.
2. ALU (Arithmetic Logical Unit) instructions could have operands that were memory
locations. Because the number of cycles it takes to access memory varies so does the
whole instruction. This isn't good for compiler writers, pipelining and multiple issue.
3. Most ALU instruction had only 2 operands where one of the operands is also the
destination. This means this operand is destroyed during the operation or it must be
saved before somewhere.
Thus in the early 80's the idea of RISC was introduced. The SPARC project was started at
Berkeley and the MIPS project at Stanford. RISC stands for Reduced Instruction Set
Computer. The ISA is composed of instructions that all have exactly the same size, usualy 32
bits. Thus they can be pre-fetched and pipelined succesfuly. All ALU instructions have 3
operands which are only registers. The only memory access is through explicit
LOAD/STORE instructions.
Thus A = B + C will be assembled as:
LOAD R1,A
LOAD R2,B
ADD R3,R1,R2
STORE C,R3
7. Hierarchy of memories
Performance:
Response time or execution time: The total time required for the computer to complete a
task, including disk accesses, memory
Bandwidth: The amount of data that can be carried from one point to another in a given
time period (usually a second). This kind of bandwidth is usually expressed in bits (of data)
per second (bps). Occasionally, it's expressed as bytes per second (Bps).
clock cycles per instruction (CPI): Average number of clock cycles per
Performance:
29
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
Multiplication:
30
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
31
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
Addition:
32
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
Subtraction:
3. Discuss in details about division algorithm in details with diagram and control lines
33
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
34
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
4. Explain in detail the principle of of carry look ahead adder. Show how
16 bit CLAs can be constructed from 4 bit adders. (2014/2016/2017)
A carry-lookahead adder (CLA) or fast adder is a type of adder used in digital
logic. A carry-lookahead adder improves speed by reducing the amount of time required to
determine carry bits. It can be contrasted with the simpler, but usually slower, ripple carry
adder for which the carry bit is calculated alongside the sum bit, and each bit must wait until
the previous carry bit have been calculated to begin calculating its own result and carry bits
(see adder for detail on ripple carry adders). The carry-look ahead adder calculates one or
more carry bits before the sum, which reduces the wait time to calculate the result of the
larger value bits of the adder.
35
DSEC/IT/EC6303/CA
CS6303-COMPUTER
COMPUTER ARCHITECTURE M.Ramu/AP/IT
UNIT
UNIT-III PROCESSOR AND CONTROL
1. Explain in details basic MIPS implementation with necessary
multiplexers and control lines.
lines
36
DSEC/IT/EC6303/CA
CS6303-COMPUTER
COMPUTER ARCHITECTURE M.Ramu/AP/IT
37
DSEC/IT/EC6303/CA
CS6303-COMPUTER
COMPUTER ARCHITECTURE M.Ramu/AP/IT
2. Explain how the instruction pipeline works? What are the various situations where
an instruction pipeline can stall? Illustrate with an example.(Nov/2009/2014/2015)
(Nov/2009/2014/2015)
38
DSEC/IT/EC6303/CA
CS6303-COMPUTER
COMPUTER ARCHITECTURE M.Ramu/AP/IT
3. What are hazard? Explain the methods for dealing with data
hazards(2016/2017).
39
DSEC/IT/EC6303/CA
CS6303-COMPUTER
COMPUTER ARCHITECTURE M.Ramu/AP/IT
UNIT--IV PARALLELISM
Classifications
A sequential computer which exploits no parallelism in either the instruction or data streams.
Single control unit (CU) fetches single instruction stream (IS) from memory. The
Th CU then
generates appropriate control signals to direct single processing element (PE) to operate on
single data stream (DS) i.e., one operation at a time.
Examples of SISD architecture are the traditional uniprocessor machines like older personal
computers (PCs; by 2010, many PCs had multiple cores) and mainframe computers.
computers
Multiple instructions operate on one data stream. This is an uncommon architecture which is
generally used for fault tolerance. Heterogeneous systems operate on the same data stream
and must agree on the result. Examples include the Space Shuttle flight control computer[4].
40
DSEC/IT/EC6303/CA
CS6303-COMPUTER
COMPUTER ARCHITECTURE M.Ramu/AP/IT
41
DSEC/IT/EC6303/CA
CS6303-COMPUTER
COMPUTER ARCHITECTURE M.Ramu/AP/IT
Single instruction, multiple threads (SIMT) is an execution model used in parallel computing
where single instruction, multiple data (SIMD) is combined with multithreading.
multithreading This is not
originally part of Flynn's taxonomy but a proposed addition.
These four architectures are shown below visually. Each processing unit (PU) is shown for a
uni-core or multi-core
core computer:
2. Draw a neat sketch of memory hierarchy and explain the need of cache
memory.(2014/2015)
1) Internal register: Internal register in a CPU is used for holding variables and temporary
results. Internal registers have a very small storage; however they can be accessed instantly.
Accessing data from the internal register is the fastest way to access memory.
2) Cache: Cache is used by the CPU for memory which is being accessed over and over
again. Instead of pulling it every time from the main memory, it is put in cache for fast
access. It is also a smaller memory, however, larger than internal register.
42
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
4) Hard disk: A hard disk is a hardware component in a computer. Data is kept permanently
in this memory. Memory from hard disk is not directly accessed by the CPU, hence it is
slower. As compared with RAM, hard disk is cheaper per bit.
5) Magnetic tape: Magnetic tape memory is usually used for backing up large data. When
the system needs to access a tape, it is first mounted to access the data. When the data is
accessed, it is then unmounted. The memory access time is slower in magnetic tape and it
usually takes few minutes to access a tape.
3. Explain various mechanisms of mapping main memory address into cache memory
addresses.(2014/2015)
1. Direct
2. Associative
3. Set Associative.
43
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
DIRECT MAPPING
The simplest technique, known as direct mapping, maps each block of main memory
into only one possible cache line. The mapping is expressed as
i = j modulo m
where
Figure 1 (a) shows the mapping for the first m blocks of main memory. Each block of
main memory maps into one unique line of the cache.
The next m blocks of main memory map into the cache in the same fashion; that is,
block Bm of main memory maps into line L0 of cache, block Bm_1 maps into line
L1, and so on.
The mapping function is easily implemented using the main memory address. Figure
2 illustrates the general mechanism. For purposes of cache access, each main memory
address can be viewed as consisting of three fields.
The least significant w bits identify a unique word or byte within a block of main
memory; in most contemporary machines, the address is at the byte level. The
remaining s bits specify one of the 2s blocks of main memory. The cache logic
interprets these s bits as a tag of s – r bits (most significant portion) and a line field of
r bits. This latter field identifies one of the m = 2r lines of the cache. To summarize,
44
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
The effect of this mapping is that blocks of main memory are assigned to lines of the cache
as follows:
Thus, the use of a portion of the address as a line number provides a unique mapping
of each block of main memory into the cache.
When a block is actually read into its assigned line, it is necessary to tag the data to
distinguish it from other blocks that can fit into that line. The most significant s - r
bits serve this purpose.
The direct mapping technique is simple and inexpensive to implement. Its main
disadvantage is that there is a fixed cache location for any given block.
Thus, if a program happens to reference words repeatedly from two different blocks
that map into the same line, then the blocks will be continually swapped in the cache,
and the hit ratio will be low (a phenomenon known as thrashing).
45
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
ASSOCIATIVE MAPPING
With associative mapping, there is flexibility as to which block to replace when a new
block is read into the cache.
Replacement algorithms, discussed later in this section, are designed to maximize the
hit ratio. The principal disadvantage of associative mapping is the complex circuitry
required to examine the tags of all cache lines in parallel.
46
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
SET-ASSOCIATIVE MAPPING
Set-associative mapping is a compromise that exhibits the strengths of both the direct
and associative approaches while reducing their disadvantages.
In this case, the cache consists of a number sets, each of which consists of a number
of lines. The relationships are
47
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
For set-associative mapping, the cache control logic interprets a memory address as
three fields: Tag, Set, and Word.
The d set bits specify one of v = 2d sets. The s bits of the Tag and Set fields specify
one of the 2s blocks of main memory.
Figure 5 illustrates the cache control logic.With fully associative mapping, the tag in
a memory address is quite large and must be compared to the tag of every line in the
cache. With k-way set-associative mapping, the tag in a memory address is much
smaller and is only compared to the k tags within a single set.
Advantages
If a thread gets a lot of cache misses, the other threads can continue taking advantage
of the unused computing resources, which may lead to faster overall execution as these
resources would have been idle if only a single thread were executed. Also, if a thread cannot
use all the computing resources of the CPU (because instructions depend on each other's
result), running another thread may prevent those resources from becoming idle.
If several threads work on the same set of data, they can actually share their cache, leading to
better cache usage or synchronization on its values.
Disadvantages
Multiple threads can interfere with each other when sharing hardware resources such as
caches or translation lookaside buffers (TLBs). As a result, execution times of a single thread are not
improved but can be degraded, even when only one thread is executing, due to lower frequencies or
additional pipeline stages that are necessary to accommodate thread-switching hardware.
Types of multithreading:
Coarse-grained multithreading
The simplest type of multithreading occurs when one thread runs until it is blocked by an
event that normally would create a long-latency stall. Such a stall might be a cache miss that
has to access off-chip memory, which might take hundreds of CPU cycles for the data to
48
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
return. Instead of waiting for the stall to resolve, a threaded processor would switch
execution to another thread that was ready to run. Only when the data for the previous thread
had arrived, would the previous thread be placed back on the list of ready-to-run threads.
For example:
The goal of multithreading hardware support is to allow quick switching between a blocked
thread and another thread ready to run. To achieve this goal, the hardware cost is to replicate
the program visible registers, as well as some processor control registers (such as the
program counter). Switching from one thread to another thread means the hardware switches
from using one register set to another; to switch efficiently between active threads, each
active thread needs to have its own register set. For example, to quickly switch between two
threads, the register hardware needs to be instantiated twice.
Additional hardware support for multithreading allows thread switching to be done in one
CPU cycle, bringing performance improvements. Also, additional hardware allows each
thread to behave as if it were executing alone and not sharing any hardware resources with
other threads, minimizing the amount of software changes needed within the application and
the operating system to support multithreading.
Many families of microcontrollers and embedded processors have multiple register banks to
allow quick context switching for interrupts. Such schemes can be considered a type of block
multithreading among the user program thread and the interrupt threads.[citation needed]
Interleaved multithreading
49
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
For example:
This type of multithreading was first called barrel processing, in which the staves of a barrel
represent the pipeline stages and their executing threads. Interleaved, preemptive, fine-
grained or time-sliced multithreading are more modern terminology.
In addition to the hardware costs discussed in the block type of multithreading, interleaved
multithreading has an additional cost of each pipeline stage tracking the thread ID of the
instruction it is processing. Also, since there are more threads being executed concurrently in
the pipeline, shared resources such as caches and TLBs need to be larger to avoid thrashing
between the different threads.
Simultaneous multithreading
For example:
1. Cycle i: instructions j and j + 1 from thread A and instruction k from thread B are
simultaneously issued.
2. Cycle i + 1: instruction j + 2 from thread A, instruction k + 1 from thread B, and instruction m
from thread C are all simultaneously issued.
3. Cycle i + 2: instruction j + 3 from thread A and instructions m + 1 and m + 2 from thread C
are all simultaneously issued.
To distinguish the other types of multithreading from SMT, the term "temporal
multithreading" is used to denote when instructions from only one thread can be issued at a
time.
In addition to the hardware costs discussed for interleaved multithreading, SMT has the
additional cost of each pipeline stage tracking the thread ID of each instruction being
processed. Again, shared resources such as caches and TLBs have to be sized for the large
number of active threads being processed.
50
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
UNIT V
1. Explain about Multicore Processors.(2016/2015)
A multi-core CPU is a computer processor which has two or more sections. Each section of
the chip executes instructions as if it was a separate computer. The actual processors are still
on one chip. On this chip every core looks mostly like the other. They are several mostly
independent cores which work together in parallel. A dual-core processor is a multi-core
processor with two independent microprocessors. A quad-core processor is a multi-core
processor with four independent microprocessors. As you might be able to tell from the
prefix, the name of the processor is based on the number of the microprocessors on the chip.
Advantages
Having a multi-core processor in a computer means that it will work faster for certain
programs.
The computer may not get as hot when it is turned on.
The computer needs less power because it can turn off some sections if they aren´t
needed.
More features can be added to the computer.
The signals between different CPUs travel shorter distances, therefore they degrade
less.
Disadvantages
They do not work at twice the speed as a normal processor. They get only 60-80%
more speed.
The speed that the computer works at depends on what the user is doing with it.
They cost more than single core processors.
They are more difficult to manage thermally than lower-density single-core
processors.
Not all operating systems support more than one core.
Operating systems compiled for a multi-core processor will run slightly slower on a
single-core processor.
51
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
2. What is virtual memory? Explain the steps involved in virtual memory address
translation.(2013/2014/2015)
A virtual address is a binary number in virtual memory that enables a process to use a
location in primary storage (main memory) independently of other processes and to use more
space than actually exists in primary storage by temporarily relegating some contents to a hard
disk or internal flash drive
In a system with virtual memory the main memory can be viewed as a cache for the disk,
which serves as the lower-level store. Due to the enormous difference between memory
access times and disk access times, a fully-associative caching scheme is used. That is, the
entire main memory is a single set - any page can be placed anywhere in main memory. This
makes the set field of the address vanish. All that remains is a tag and an offset.
Since the tag field just identifies a page it is usually called the page number field.
52
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
Logical Addresses
With a virtual memory system, the main memory can be viewed as a local store for a
cache level whose lower level is a disk. Since it is fully associative there is no need for a set
field. The address just decomposes into an offset field and a page number field. The number
of bits in the offset field is determined by the page size. The remaining bits are the page
number.
An Example
A computer uses 32-bit byte addressing. The computer uses paged virtual memory with 4KB
pages. Calculate the number of bits in the page number and offset fields of a logical address.
Page Tables
Virtual memory address translation uses a page table. In the diagram to the left, a page table
is represented by the large box. It is a structured array in memory. It is indexed by page
number.
Each page table entry contains information about a single page. Part of this information is a
frame number (green) — where the page is located in physical memory. In addition there are
control bits (blue) for controlling the translation process. Address translation concatenates the
frame number with the offset part of a logical address to form a physical address.
A page table base register (PTBR) holds the base address for the page table of the current
process. It is a processor register that is managed by the operating system
DMA stands for "Direct Memory Access" and is a method of transferring data from the
computer's RAM to another part of the computer without processing it using the CPU. While
most data that is input or output from your computer is processed by the CPU, some data
does not require processing, or can be processed by another device.
In these situations, DMA can save processing time and is a more efficient way to move data
from the computer's memory to other devices. In order for devices to use direct memory
access, they must be assigned to a DMA channel. Each type of port on a computer has a set
of DMA channels that can be assigned to each connected device. For example, a PCI
controller and a hard drive controller each have their own set of DMA channels.
DMA DIAGRAM:
53
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
IOP is a processor with direct memory access capability that communicates with I/O devices.
In this configuration, the computer system can be divided into a memory unit, and a number
of processors comprised of CPU and one or more IOPs. IOP is similar to CPU except that it
is designed to handle the details of I/O processing. Unlike DMA controller which is setup
completely by the CPU, IOP can fetch and execute its own instructions. IOP instructions are
designed specifically to facilitate I/O transfers. Instructions that are read from memory by an
IOP are called commands to differ them from instructions read by CPU . The command
words constitute the program for the IOP. The CPU informs the IOP where to find
commands in memory when it is time to execute the I/O program.
54
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
The memory occupies a central position and can communicate with each processor by means
of DMA. CPU is usually assigned the task of initiating the I/O program, from the non;IOP
operates independent of the CPU and continues to transfer data from external devices and
memory.
CPU-IO Communication
Communication between the CPU and IOP may take different forms depending on the
particular computer used.Mostly, memory unit acts as a memory center where each processor
leaves information for the other.
CPU sends an instruction to test the IOP path. The IOP responds by inserting a status word in
memory for the CPU to check. The bits of the status word indicate the condition of IOP and
I/O device (“IOP overload condition”, “device busy with another transfer” etc). CPU then
checks status word to decide what to do next . If all is in order, CPU sends the instruction to
start the I/O transfer. The memory address received with this instruction tells the IOP where
to find its program. CPU may continue with another program while the IOP is busy with the
I/O program. When IOP terminates the transfer (using DMA), it sends an interrupt request to
CPU. The CPU responds by issuing an instruction to read the status from the IOP and IOP
then answers by placing the status report into specified memory location. By inspecting the
bits in the status word, CPU determines whether the I/O operation was completed
satisfactorily and the process is repeated again.
55
DSEC/IT/EC6303/CA
CS6303-COMPUTER ARCHITECTURE M.Ramu/AP/IT
When a Process is executed by the CPU and when a user Request for another Process
then this will create disturbance for the Running Process. This is also called as the Interrupt.
Interrupts can be generated by User, Some Error Conditions and also by Software’s
and the hardware’s. But CPU will handle all the Interrupts very carefully because when
Interrupts are generated then the CPU must handle all the Interrupts Very carefully means the
CPU will also Provides Response to the Various Interrupts those are generated. So that When
an interrupt has Occurred then the CPU will handle by using the Fetch, decode and Execute
Operations.
Types of Interrupts
Generally there are three types o Interrupts those are Occurred For Example
1) Internal Interrupt
2) Software Interrupt.
3) External Interrupt.
The External Interrupt occurs when any Input and Output Device request for any
Operation and the CPU will Execute that instructions first For Example When a Program is
executed and when we move the Mouse on the Screen then the CPU will handle this External
interrupt first and after that he will resume with his Operation.
The Internal Interrupts are those which are occurred due to Some Problem in the
Execution For Example When a user performing any Operation which contains any Error and
which contains any type of Error.
So that Internal Interrupts are those which are occurred by the Some Operations or by
Some Instructions and the Operations those are not Possible but a user is trying for that
Operation.
And The Software Interrupts are those which are made some call to the System for
Example while we are Processing Some Instructions and when we wants to Execute one
more Application Programs.
56
DSEC/IT/EC6303/CA