Two-Marks Material Computer Architecture
Two-Marks Material Computer Architecture
Prepared by Reviewed by
Ms .R.P.Narmadha, AP/IT
Dr NGP IT CS 8491 Computer Architecture Dept of IT
SYLABUS
UNIT IV PARALLELISM 9
Parallel processing challenges – Flynn‘s classification – SISD, MIMD, SIMD, SPMD, and Vector
Architectures - Hardware multithreading – Multi-core processors and other Shared Memory Multiprocessors
-Introduction to Graphics Processing Units, Clusters, Warehouse Scale Computers and other Message-
Passing Multiprocessors.
TEXT BOOK:
1. David A. Patterson and John L. Hennessy, Computer Organization and Design: The
Hardware/Software Interface, Fifth Edition, Morgan Kaufmann / Elsevier, 2014.
2. Carl Hamacher, Zvonko Vranesic, Safwat Zaky and Naraig Manjikian, Computer Organization and
Embedded Systems, Sixth Edition, Tata McGraw Hill, 2012.
REFERENCES:
1. William Stallings “Computer Organization and Architecture” , Seventh Edition , Pearson
Education, 2006.
2. John P. Hayes, “Computer Architecture and Organization”, Third Edition, Tata Mc Graw Hill,
1998.
3. John L. Hennessey and David A. Patterson, Computer Architecture – A Quantitative Approach‖,
Morgan Kaufmann / Elsevier Publishers, Fifth Edition, 2012.
4. http://nptel.ac.in/.
Dr NGP IT CS 8491 Computer Architecture Dept of IT
UNIT 1
BASIC STRUCTURE OF A COMPUTER SYSTEM
Closely connected to the Processor Connected to main memory through the bus and a controller
Dr NGP IT CS 8491 Computer Architecture Dept of IT
Stored data are easily changed but changes are slow compared
Stored data are quickly and easily changed to main memory
Holds the programs and data that the Used for long term storage of programs and data
processor is actively working with
Interacts with the processor millions of Before data and programs can be used, they must be copied
times per second from secondary memory into the main memory
8. Identify the advantage of network computers. Networked computers have several major
advantages:
Communication: Information is exchanged between computers at high speeds.
Resource sharing: Rather than each computer having its own I/O devices, computers
on the network can share I/O devices.
Nonlocal access: By connecting computers over long distances, users need not be near
the computer they are using.
12. Express the Execution Time (Nov /Dec 2016) (Nov/Dec 2015)
Execution time is defined as the reciprocal of the performance of the computer system It is
related by
CPU time means the time the CPU is computing, not including the time waiting for I/O or
running other programs. It can be further divided into the CPU time spent in the program, called user
CPU time, and the CPU time spent in the operating system performing tasks requested by the
program, called system CPU time.
14. List out the types of programs to evaluate the performance? There are four levels of programs.
They are
- Real Programs
- Kernels
- Toy benchmarks
- Synthetic benchmarks
18. How would you formulate the speedup? Speedup is the ratio
Speedup tells us how much faster a task will run using the machine with the enhancement as
opposed to the original machine.
20. Write the formula for CPU execution time for a program.
21. If computer A runs a program in 10 seconds, and computer B runs the same program in 15
seconds, how much faster is A over B.
( 1/n
)
op rs rt rd shamt funct
28. Show the addressing modes and its various types. (Nov/Dec 2017)
The different ways in which the location of an operand is specified in an instruction is referred to as
addressing modes.
The MIPS addressing modes are the following:
1. Immediate addressing
2. Register addressing
3. Base or displacement addressing
4. PC-relative addressing
5. Pseudo direct addressing
40. State the need for indirect addressing mode. (Apr/May 2017)
With direct addressing, the length of the address field is usually less than the word length,
thus limiting the address range. One solution is to have the address field refer to the
address of a word in memory, which in turn contains a full-length address of the operand.
this is known as indirect addressing.
45. Suppose that we are considering an enhancement to the processor of a server system used for Web
Serving. The new CPU is 10 times faster on computation in the Web serving application that the
original processor. Assuming that the original CPU is busy with computation 40% of the time and is
waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the
enhancement? April/May 2019
UNIT – II
The ALU has got two input registers named as A and B and one output storage register, named as C.
It performs the operation as:
C = A op B
The input data are stored in A and B, and according to the operation specified in the control lines, the
ALU perform the operation and put the result in register C.
3. Add 610 to 710 in binary and Subtract 610 from 710 in binary.
Dr NGP IT CS 8491 Computer Architecture Dept of IT
4. Write the overflow conditions for addition and subtraction. (APR/MAY 2015) (Nov/Dec 2016)
(Nov/Dec 2015)
The overflow conditions for addition and subtraction are
A+B ≥0 ≥0 <0
A-B <0 ≥0 ≥0
SRT division technique is used to faster division. SRT division technique try to guess several quotient bits
per step, using a table lookup based on the upper bits of the dividend and remainder.
The IEEE 754 standard floating point representation is almost always an approximation of the real number.
Fraction: the value, generally between 0 and 1, placed in the fraction field.
Scientific notation: a notation that renders numbers with a single digit to the left of the decimal
point.
Exponent: in the numerical representation system of floating point arithmetic, the value that is placed
in the exponent field.
11. List out the advantages using normalized scientific notation There are three advantages:
It simplifies of data that includes floating point numbers
It simplifies the floating point arithmetic algorithms to know that numbers will always be in this form
It increases the accuracy of the numbers that can be stored in a word, since the unnecessary leading 0s
are replaced by real digits to the right of the binary point.
An unscheduled event that disrupt program execution called as an Exception .it is also called
as Interrupt.
The address of the instruction that overflowed is saved in a register, and the computer jumps
to a predefined address to invoke the appropriate routine for that exception.
Guard is the first of two extra bits kept on the right during intermediate calculations of floating point
numbers. It is used to improve rounding accuracy.
Round is a method to make the intermediate floating-point result fit the floating-point format; the
goal is typically to find the nearest number that can be represented in the format. IEEE 754, therefore, always
keeps two extra bits on the right during intermediate additions, called guard and round, respectively.
Units in the Last Place are defined as the number of bits in error in the least significant bits of the
significant between the actual number and the number that can be represented.
21. Show the sub word parallelism. (APR/MAY 2015, MAY/JUNE 2016)
Subword Parallelism-
Subword Parallelism is a technique that enables the full use of word-oriented datapaths when dealing
with lower-precision data. It is a form of low-cost, small-scale SIMD parallelism.
Graphics and audio applications can take advantage of performing simultaneous operations on short
vectors.By partitioning the carry chains within a 128-bit adder, a processor could use parallelism to
perform simultaneous operations on short vectors of sixteen 8-bitoperands, eight 16-bit operands, four
32-bit operands, or two 64-bit operands.
Example: 128-bit adder:
Sixteen 8-bit adds
Dr NGP IT Eight CS 8491adds
16-bit Computer Architecture Dept of IT
Four 32-bit adds
22. State an opcode. How many bits are needed to specify 32 distinct operations?
The field that denotes the operation and format of the instruction is called opcode. To specify 32
distinct operations, 5 bits are necessary.
23. List out the use of round bit in floating point arithmetic.
Round is a method to make the intermediate floating point result to fit the floating point format.
The purpose of round is to find the nearest number that can be represented in the format.
Parallelism performs simultaneous operations on short vectors with the following values.
1. Sixteen 8 bit operands
2. Eight 16 bit operands
3. Four 32 bit operands
4. Tow 64 bit operands
MIPS has two instructions to produce a proper product for signed and unsigned numbers such as
1. Multiply (mult)
2. Multiply unsigned (multu)
In binary numbers, to represent the sign, the least significant bit is used. If MSB is 0, the number is positive
and if it is 1, then the number is called negative number.
32. List out the rules to perform addition on floating point numbers. (Apr/May 2017)
Step 1: Compare the exponents of the two numbers. Shift the smaller number to the right
until its exponent would match the larger exponent
Step 2: Add the significants
Step 3: Normalize the sum, either shifting right and incrementing the exponent or shifting
left and decrementing the exponent
Step 4: Check for Overflow or Underflow
Step 5: Round the significant to the appropriate number of bits.
33. Write the single and double precision binary representation of −0.75ten (APR/MAY2018)
Double precision
Add the binary values of +4 and -48 to get the correct answer.
0 0 0 0 0 1 0 0 = +04
1 1 0 1 0 0 0 0 = -48
_____________________
1 1 0 1 0 1 0 0 = -44
(or)
UNIT – III
2. State the data path element and program counter. Nov/Dec 2016
Data element is a unit used to operate on or hold data within a processor. In the MIPS
implementation, the data path elements include the instruction and data memories, the register file, the ALU
and adders. Program Counter (PC) is the register containing the address of the current instruction in the
program being executed.
5. List out the two state elements needed to store and access an instruction.
Two state elements needed to store and access instructions are the instruction memory and the
program counter. An adder is needed to compute the next instruction address.
10. List out the three instruction classes and their instruction formats?
The three instruction classes (R-type, load and store, and branch) use two different instruction
formats.
The destination address for a jump instruction is formed by concatenating the upper 4 bits of the
current PC + 4 to the 26-bit address field in the jump instruction and adding 00 as the 2 low-order bits.
13. Point out the five steps in MIPS instruction execution. (April/May 2019)
The five steps in MIPS Instrution
1. Fetch instruction from memory.
2. Read registers while decoding the instruction. The regular format of MIPS instructions allows
reading and decoding to occur simultaneously.
3. Execute the operation or calculate an address.
4. Access an operand in data memory.
5. Write the result into a register.
14. Write the formula for calculating time between instructions in a pipelined processor.
15. Identify the Hazards. Write its types. (Nov/Dec 2015) (Apr/May 2017)(Apr/May 2017)
Hazards are the situations in pipelining when the next instruction cannot be executed in the following
clock cycle. The types of hazards are:
1. Structural Hazards
2. Data Hazards
3. Control Hazards
Forwarding, also called bypassing, is a method of resolving a data hazard by retrieving the missing data
element from internal buffers rather than waiting for it to arrive from programmer visible registers or
memory.
It is a specific form of data hazard in which the data being loaded by a load instruction has not yet
become available when it is needed by another instruction.
Forwarding method used to resolve the data hazards. It is also called as bypassing. It is a method of
resolving data hazard by retrieving the missing data element from internal buffers rather than waiting for it to
arrive from programmer visible or memory.
30. List down the steps to be carried out in executing a load word instruction.
1. An instruction is fetched from the instruction memory and the PC is incremented
2. A register value is read from the register file
3. The ALU computes the sum of the value read from the register file and the sign-extended, lower 16
bits of the instruction
4. The sum from the ALU is used as the address for the data memory
5. The data from the memory unit is written into the register file in the destination register.
31. Name the control Signal required to perform arithmetic operations (Apr/May 2017)
The control signals required to perform arithmetic operations are
1. ResDst
2. RegWrite
3. ALUSrc
4. MemRead
5. MemWrite
UNIT IV
PARALLELISM
1. State the Instruction level parallelism. (Nov/Dec 2016) (Nov/Dec 2015) (Apr/May 2017)
Pipelining exploits the potential parallelism among instructions. This parallelism is called instruction-
level parallelism (ILP). There are two primary methods for increasing the potential amount of instruction-
level parallelism.
1. Increasing the depth of the pipeline to overlap more instructions.
2. Multiple issue.
Multiple issue is a scheme whereby multiple instructions are launched in one clock cycle. It is a
method for increasing the potential amount of instruction-level parallelism. It is done by replicating the
internal components of the computer so that it can launch multiple instructions in every pipeline stage. The
two approaches are
1. Static multiple issue (at compile time)
2. Dynamic multiple issue (at run time)
Speculation is one of the most important methods for finding and exploiting more ILP . It is an
approach whereby the compiler or processor guesses the outcome of an instruction to remove it as
dependence in executing other instructions. For example, we might speculate on the outcome of a branch, so
that instructions after the branch could be executed earlier.
Static multiple issue is an approach to implement a multiple-issue processor where many decisions
are made by the compiler before execution.
Issue slots are the positions from which instructions could be issued in a given clock cycle. By
analogy, these correspond to positions at the starting blocks for a sprint.Issue packet is the set of instructions
that Dr NGP ITtogether in one clock cycle;
issues CS 8491
theComputer
packet Architecture Deptcompiler
may be determined statically by the of IT or
dynamically by the processor.
Very Long Instruction Word (VLIW) is a style of instruction set architecture that launches many
operations that are defined to be independent in a single wide instruction, typically with many separate
opcode fields.
Superscalar is an advanced pipelining technique that enables the processor to execute more than one
instruction per clock cycle by selecting them during execution. Instructions issue in order, and the processor
decides whether zero, one, or more instructions can issue in a given clock cycle.
An important compiler technique to get more performance from loops is loop unrolling, where
multiple copies of the loop body are made. After unrolling, there is moreILP available by overlapping
instructions from different iterations.
Anti-dependence is an ordering forced by the reuse of a name, typically a register.Rather than by a true
dependence that carries a value between two instructions. It is also called as name dependence.Renaming is
the technique used to remove anti-dependence in which the registers are renamed by the compiler or
hardware.
Reservation station is a buffer within a functional unit that holds the operands and the operation. Reorder
buffer is the buffer that holds results in a dynamically scheduled processor until it is safe to store the results
to memory or a register.
Out-of-order execution is a situation in pipelined execution when an instruction is Blocked from executing
does not cause the following instructions to wait. It preserves the data flow order of the program.
In-order execution requires the instruction fetch and decode unit to issue instructions in order, which
allows dependences to be tracked, and requires the commit unit to write results to registers and memory in
program fetch order. This conservative mode is called in-order commit.
Blocked multithreading
This is also known as coarse-grained multithreading. The instructions of a thread are executed
successively until an event occurs that may cause delay, such as a cache miss. This event
induces a switch to another thread. This approach is effective on an in-order processor that would
stall the pipeline for a delay event such as a cache miss.
Instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor.
This combines the wide superscalar instruction issue capability with the use of multiple thread contexts.
Shared memory multiprocessor (SMP) is one that offers the programmer a single physical address space
across all processors - which is nearly always the case for multicore chips. Processors communicate through
shared variables in memory, with all processors capable of accessing any memory location via loads and
stores.
Dr NGP IT CS 8491 Computer Architecture Dept of IT
Uniform memory access (UMA) is a multiprocessor in which latency to any word in main memory is about
the same no matter which processor requests the access.
Non uniform memory access (NUMA) is a type of single address space multiprocessor in which some
memory accesses are much faster than others depending on which processor asks for which word.
Executing some instructions in a different order from the way they occur in the instruction
stream and beginning execution of instructions that may never be needed. This approach may
be reaching a limit due to complexity and power consumption concerns.
An alternative approach, which allows for a high degree of instruction-level parallelism
without increasing circuit complexity or power consumption, is called multithreading.
The instruction stream is divided into several smaller streams, known as threads, such that the
threads can be executed in parallel.
Thread:
A dispatch able unit of work within a process. It includes a processor context (which includes the
program counter and stack pointer) and its own data area for a stack (to enable subroutine
branching).
A thread executes sequentially and is interruptible so that the processor can turn to another thread.
A thread is concerned with scheduling and execution.
22. Recall the task level parallelism and data level parallelism.
Task level parallelism or process level parallelism means utilizing multiple processors by running
independent programs simultaneously. Parallelism achieved by performing the same operation on
independent data is called as data level parallelism
Consider an example in which register A comes before register B in program order. A writes to a location
and B writes to the same loacation. If B writes first, then A writes, the location will end up with the wrong
value. It is called as output dependency.
Consider an example in which register A comes before B in program order. A reads from a location, B
writes to the location, therefore B has a WAR dependency on A. If B executes before A has read its
operand, then the operand will be lost. It is called an anti-dependency.
The multithreading implies that there are multiple threads of control in each processor. Multithreading
offers an effective mechanism for hiding long latency in building large scale microprocessors.
Dr NGP IT
Multithreading CS 8491
is the ability of a program orComputer Architecture
an operating Dept
system to serve more than one user at ofa IT
time and
to manage multiple simultaneous requests without the need to have multiple copies of the programs running
within the computer. To support this, central processing units have hardware support to efficiently execute
multiple threads
29. Differentiate between strong scaling and weak scaling (APR/MAY 2015, NOV/DEC2017)
In strong scaling methods, speed up is achieved on a multiprocessor without increasing the size of
the problem. Strong scaling means measuring speed up while keeping the problem size fixed. In weak
scaling method, speed up is achieved on a multiprocessor while increasing the size of the problem
proportionally to the increase in the number of processors.
Synchronization is the process of coordinating the behavior of two or more processes running on different
processors.
31. List the Fine grained multithreading and Coarse grained multithreading. MAY/JUNE
2016,NOV/DEC2017
Fine grained multithreading
Switches between threads on each instruction, causing the execution of multiples threads to be interleaved,
- Usually done in a round-robin fashion, skipping any stalled threads
- CPU must be able to switch threads every clock
Coarse grained multithreading
Switches threads only on costly stalls, such as L2 cache misses
UNIT V
Temporal locality (locality in time): if an item is referenced, it will tend to be referenced in near future.
Spatial locality (locality in space): if an item is referenced, items whose addresses are close by will tend to
be referenced in near future.
Flash memory is a type of electrically erasable programmable read-only memory.(EEPROM). Unlike disks
and DRAM, EEPROM technologies can wear out flash memory bits. To cope with such limits, most flash
products include a controller to spread the writes by remapping blocks that have been written many times to
less trodden blocks. This technique is called wear Leveling.
Rotational latency, also called rotational delay, is the time required for the desired sector of a disk to rotate
under the read/write head, usually assumed to be half the rotation time.
8. Consider a cache with 64 blocks and a block size of 16 bytes. To what block number does byte
address 1200 map?
The block is given by,
Dr NGP IT CS 8491 Computer Architecture Dept of IT
9. How many total bits are required for a direct-mapped cache with 16 KiB of data and 4-word
blocks, assuming a 3-bit address?
Write-through is a scheme in which writes always update both the cache and the next lower level of
the memory hierarchy, ensuring that data is always consistent between the two.
Write-back is a scheme that handles writes by updating values only to the block in the cache, then
writing the modified block to the lower level of the hierarchy when the block is replaced.
Average memory access time is the average time to access memory considering both hits and misses and the
frequency of different accesses. It is equal to the following:
Direct-mapped cache is a cache structure in which each memory location is mapped to exactly one location
in the cache.
Fully associative cache is a cache structure in which a block can be placed in any location in the cache. Set-
associative cache is a cache that has a fixed number of locations (at least two)where each block can be
placed.
Reliability is a measure of the continuous service accomplishment or,equivalently, of the time to failure from
a reference point. Hence, mean time to failure (MTTF) is a reliability measure. A related term is annual
failure rate (AFR), which is just the percentage of devices that would be expected to fail in a year for a given
MTTF.
Availability is then a measure of service accomplishment with respect to thealternation between the two
states of accomplishment and interruption. Availability is statistically quantified as
Virtual memory is a technique that uses main memory as a “cache” for secondary storage. Two major
motivations for virtual memory: to allow efficient and safe sharing of memory among multiple programs,
and to remove the programming burdens of a small, limited amount of main memory.
Address translation is also called address mapping is the process by which a virtual address is mapped to an
address used to access data in memory.
22. How will be the size of memory works, if 20 address lines are used?
If the Processor has 20 address lines, it is capable of addressing up to 2 20 memory locations. Hence the size of
the memory is 1 MB.
Valid bit indicates that a field in the tables of memory hierarchy with the associated block in the hierarchy
contains valid data.
A field that is set whenever a page is accessed and is used to implement LRU or other replacement schemes
is called reference bit.
DMA works in different modes. It is based on the degrees of overlap between the CPU and DMA
Operations. The various modes of DMA operations are
1. Block Transfer
2. Cycle Stealing
3. Transparent DMA
The various approaches to bus arbitration are 1. Centralized and Distributed. In Centralized arbitration, a
single bus arbiter performs the arbitration and selects the bus master. In distributed arbitration, all the
devices participate in the selection of the next bus master.
Difficult to
3. Implementation Easy to implement implement
6. Control lines used The control lines are The control lines are
READ M, WRITE
READ, WRITE M,
READ IO, WRITE
IO
1. The speed of the CPU is reduced due to low speed IO Devices. The speed with which the CPU can
test and transfer data between IO devices is limited due to low transfer rate of IO devices.
2. Most of the CPU time is wasted. The time that CPU spends testing IO device status and executing IO
data transfers is too long. That can be spent on other tasks.
The Cache Memory is the Memory which is very nearest to the CPU , all the Recent Instructions are Stored
into the Cache Memory. The Cache Memory is attached for storing the input which is given by the user and
which is necessary for the CPU to Perform a Task. But the Capacity of the Cache Memory is too low in
compare to Memory and Hard Disk. The cache memory therefore, has lesser access time than memory and is
faster than the main memory. A cache memory have an access time of 100ns, while the main memory may
have an access time of 700ns.
When a processor refers a data item from a cache, if the referenced item is in the cache, then such a reference
is called Hit. If the referenced data is not in the cache, then it is called Miss, Hit ratio is defined as the ratio of
number of Hits to number of references.
Hit ratio =Total Number of references
In order to carry out two or more simultaneous access to memory, the memory must be partitioned in to
separate modules. The advantage of a modular memory is that it allows the interleaving i.e. consecutive
addresses are assigned to different memory module
Dr NGP IT CS 8491 Computer Architecture Dept of IT
33. Mention the use of DMA. (Dec 2012)(Dec 2013,APR/MAY2018)
DMA (Direct Memory Access) provides I/O transfer of data directly to and from the memory unit and the
peripheral. Direct memory access (DMA) is a method that allows an input/output (I/O) device to send or
receive data directly to or from the main memory, bypassing the CPU to speed up memory operations. The
process is managed by a chip known as a DMA controller (DMAC).