0% found this document useful (0 votes)

11 views60 pages

Advanced Computer Architecture

Uploaded by

rrrroptv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views60 pages

Advanced Computer Architecture

Uploaded by

rrrroptv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

COMPUTER ARCHITECTURE

ADVANCED COMPUTER ARCHITECTURE

Falguni Sinhababu
Government College of Engineering and Leather Technology

1
MULTICYCLE OPERATIONS

Advanced Computer Architecture 2

INTRODUCTION
▪ In our earlier discussions we are considering a
computer instruction consists of integer instruction and
it is divided into 5 stages (IF, ID, EX, MEM and WB). And
each stage will require 1 clock cycle to execute.
▪ But real implementation will consists of both integer and
floating-point units.
▪ Floating point operations are more complex than integer
operations.
▪ Will require more than one cycle in the EX stage
▪ Make the pipeline scheduling and operation more complex.
▪ New type of data hazards will appear that are otherwise not
possible in the integer pipeline.

Advanced Computer Architecture 3

(A) SOLUTION 1
▪ Do not make any change in the pipeline control.
▪ Use a slow clock such that the ALU operations in the
floating point instructions can finish in one clock
cycle (in EX stage).
▪ Drawback:
▪ Other operations are also slowed down, causing severe
degradation in performance.
▪ Not acceptable in practice.

Advanced Computer Architecture 4

(B) SOLUTION 2
▪ We allow the floating point arithmetic pipeline to have a
longer latency (time taken for an instruction to complete its
execution).
▪ EX cycle is considered to be repeated several times.
▪ The number of repetitions can very depending on the operation.

▪ The EX stage will have multiple floating point functional

units.
▪ For example, one for addition/subtraction (pipelined), one for
multiplication (pipelined) and one for division (non-pipelined).
▪ The stall will occur in the pipeline if the instruction to be issued (ID
to EX stage) will either cause a structural hazard for the functional
unit, or a data hazard.
▪ Pipelining the functional units can avoid the structural hazard.
Advanced Computer Architecture 5
(B) SOLUTION 2
MUL IF ID EX MEM WB

MUL IF ID EX MEM WB

Without Pipeline

MUL IF ID EX EX EX EX MEM WB

With Pipeline

Advanced Computer Architecture 6

FLOATING POINT OPERATIONS
▪ Consider that there are
4 functional units
▪ Main integer unit that
handles loads and stores,
integer ALU operations
and branches
▪ Floating point adder/
subtractor
▪ Floating point and integer
multiplier
▪ Floating point and integer
divider.

Advanced Computer Architecture 7

MIPS32 FLOATING POINT EXTENSIONS
▪In the floating extension of MIPS32, there are
32 floating point registers F0 to F31, each of
size 32 bit.
▪For double precision operations, register pairs
can be used to store the data:
▪Register pair <F0, F1> referred to as F0
▪Register pair <F2, F3> referred to as F2
▪Register pair <F30, F31> referred to as F30
Advanced Computer Architecture 8
FLOATING POINT INSTRUCTION EXAMPLES
▪ Load into a floating point register pair:
▪ L.D F6, 200(R2)
▪ F6= Mem(R2+200); F7 = Mem(R2+204);
▪ Store from a floating point register pair:
▪ S.D F4, 40(R5)
▪ Mem(R5+40) = F4; Mem(R5+44) = F5;
▪ Arithmetic operations for floating point register pairs:
▪ ADD.D F0, F4, F6
▪ SUB.D F12, F8, F20
▪ MUL.D F4, F6, F8
▪ DIV.D F8, F8, F10

Advanced Computer Architecture 9

LATENCY AND INITIATION INTERVAL
▪ The multi-cycle arithmetic units are often
pipelined to allow overlapped operations and
hence improve performance.
▪ Definition:
▪ Latency: the number of cycles between an
instruction producing a result and another
instruction using it.
▪ Initiation Interval: the number of cycles that
must elapse between issuing two operations of the
same type.
Advanced Computer Architecture 10
TYPICAL VALUE ASSUMED
Function Unit Latency Initiation
 Assumptions on number of EX
Interval stages:
 FP Add/ Subtract: 4
Integer ALU 0 1  FP Multiply: 7
Data Memory( 1 1  FP Divide: 1 (not pipelined)
Integer/ FP
Load)  It is possible to have up to:
FP Add/ 3 1  4 outstanding FP Add/ Subtract
Subtract  7 outstanding FP Multiply
FP Multiply 6 1  1 FP divide
FP Divide 24 25
Advanced Computer Architecture 11
MULTI-CYCLE PIPELINE STRUCTURE

Advanced Computer Architecture 12

SOME SCENARIOS

Out of order completion of instructions

Stalls arising due to RAW Hazards

Advanced Computer Architecture 13

SOME SCENARIOS

▪ Three instructions are trying to write into the FP register bank

simultaneously.
▪ WAW hazard for the last two conflicting instructions (both writing F6).

▪ No conflict in MEM as only the last instruction accesses memory.

Advanced Computer Architecture 14

EXPLOITING INSTRUCTION
LEVEL PARALLELISM

Advanced Computer Architecture 15

INTRODUCTION
▪ To keep the pipeline full, we try to exploit parallelism among
instructions
▪ Sequence of unrelated instructions that can be overlapped without
causing hazard
▪ Related instructions must be separated by appropriate number of clock
cycles equal to the pipeline latency between the pair of instructions

Advanced Computer Architecture 16

INTRODUCTION
▪ In addition, branches have one clock cycle delay.
▪ The functional units are fully pipelined (except
division), such that an operation can be issued on
every clock cycle.
▪ As an alternative, the functional units can also be
replicated.
▪ We now look at a simplier compiler technique that
can create additional parallelism between
instructions.
▪ Help in reducing pipeline penalty
Advanced Computer Architecture 17
EXAMPLE
MIPS32 Code

Add a scalar s to a vector x

9 clock
cycles per
iteration
(with 4
Our program has 1000 iterations and there are 9
clock cycles per iteration so there are total of
stalls)
9000 iterations.

Advanced Computer Architecture 18

EXAMPLE
▪ We now carried out instruction scheduling.
▪ Moving instructions around and making necessary changes to reduce stalls

Our program has 1000 iterations and now there

are 7 clock cycles per iteration so there are
total of 7000 iterations.

7 clock
cycles per
iteration
(with 2
stalls)

Advanced Computer Architecture 19

LOOP UNROLLING
▪ We now carry out loop unrolling
▪ Replicating the body of the loop
multiple times, so that the loop
overhead per iteration reduces.
Unroll loop
3 times

▪ We use different registers for each

iteration.
▪ Number of stalls per loop = 3x4+1 = 13
▪ Clock cycles per loop =14 + 13 = 27 Cycles per instruction = 27/4 = 6.8

Advanced Computer Architecture 20

SCHEDULING THE UNROLL LOOP

Scheduling
the unroll
loop

No stalls. 14/4 = 3.5 cycles per iteration

Advanced Computer Architecture 21

LOOP UNROLLING: SUMMARY
▪ Lowering the number of loops from 9000 → 7000 →
3500
▪ Loop unrolling can expose more parallelism in
instructions that can be scheduled
▪ Effective way of improving pipeline performance
▪ Can be used to lower the CPI (less than 1) in
architectures where more than one instructions can be
issued per cycle
a. Super Pipeline Architecture
b. Superscalar Architecture
c. Very Long Instruction Word (VLIW) architecture

Advanced Computer Architecture 22

SUPER PIPELINE ARCHITECTURE
▪ Super pipelining is the breaking of stages of a given pipeline into
smaller stages (thus making pipeline deeper) in an attempt to shorten
the clock period and hence enhancing the instruction throughput by
keeping more and more instructions in flight at a time.

Advanced Computer Architecture 23

SUPERSCALAR ARCHITECTURE
▪ Superscalar Machines:
▪ Machines that can issue multiple independent
instructions per clock cycle when they are properly
scheduled by the compiler.
▪ Can result in a CPI of less than 1.
▪ How does it work ?
▪ The hardware can issue a small number (say 2 to 4) of
independent instructions in every clock cycle.
▪ The hardware checks for conflicts between
instructions.
▪ If the instructions are dependent, then only the first
instruction in the sequence will be issued.
Advanced Computer Architecture 24
SUPERSCALAR ARCHITECTURE SCHEMATIC

Advanced Computer Architecture 25

EXAMPLE
▪ Suppose two instructions can be issued every clock cycle.
a. One can be a load, store, branch or integer ALU operation.
b. The other can be a floating point operation.
▪ Used only for illustration.
▪ We do not have shown how FP operations extend the EX cycle.

Advanced Computer Architecture 26

INSTRUCTION DEPENDENCY CHECK
a. Can be checked dynamically by the hardware
(Superscalar Architecture)
b. Compiler can take the complete
responsibility of creating a packets of
instructions that can be simultaneously
issued.
▪ Hardware does not dynamically take any decision
about multiple issues
▪ Also referred to as VLIW architecture

Advanced Computer Architecture 27

SOME ISSUES
▪ If we issue an integer and a FP operation in parallel, the
need for additional hardware is minimized.
▪ Different register set and functional units are used
▪ Only conflict is when the integer instruction is a FP load,
store or move.
▪ This creates contention for the FP register ports and can be
treated as a structural hazard.
▪ In the original MIPS32 pipeline, load instructions have a
latency of 1.
▪ In the superscalar version, the next 3 instructions cannot be
used the result of load without stalling.
▪ Branch delay also becomes 3 cycles.
Advanced Computer Architecture 28
VLIW ARCHITECTURE
▪ In a Very Long Instruction Word (VLIW) machine, an
instruction word is typically hundreds of bits in
length.
▪ Specifies a number of basic operations / instructions, each
using different functional units.
▪ Multiple functional units are used concurrently when a
VLIW “macro-instruction” is being executed.
▪ All the functional units share a common register files.
▪ Similar to Superscalar architecture in concept, but
responsibility of identifying set of instructions that
can concurrently lies with the compiler.
Advanced Computer Architecture 29
VLIW ARCHITECTURE SCHEMATIC

Advanced Computer Architecture 30

EXAMPLE
▪ We try to schedule this
unrolled program code on a
VLIW processor, assuming
that there are 4 functional
units:
▪ Two memory reference unit
(to handle LOAD and
STORE).
▪ One floating point
arithmetic unit (Only the
ADD instruction).
▪ One integer operation and
branch unit (ADDI and
BNE).

Advanced Computer Architecture 31

SCHEDULING ON A VLIW PROCESSOR

LOADs are done using the two Load/Store unit. Then two ADD.D instructions. The S.D there
will be delay of 2 cycles. For 4 instructions we need 8 cycles.

Clock cycles/Iteration = 8/4 = 2.0

Advanced Computer Architecture 32
VECTOR AND ARRAY PROCESSORS

Advanced Computer Architecture 33

VECTOR PROCESSOR
▪ Provide high level instructions that operate on entire
arrays of numbers (called vectors).
▪ A single vector instruction is equivalent to an entire
loop.
▪ No loop overheads are required.
▪ Example:
▪ A, B and C are three vectors containing 64 numbers each.
▪ The three vectors are mapped to vector registers V1, V2, V3
(say).
▪ The following vector instruction computes Ci = Ai + Bi
▪ ADDV V1, V2, V3

Advanced Computer Architecture 34

BASIC VECTOR PROCESSOR ARCHITECTURE
▪A vector processor typically consists of an
ordinary pipelined scalar unit plus a
vector unit
▪All functional units within the vector unit
are deeply pipelined, resulting in a shorter
clock cycle time (CCT).
▪Deep pipelining on vectors do not result in
hazards, since every computation is
independent of others.
Advanced Computer Architecture 35
VECTOR PROCESSOR ARCHITECTURE SCHEMATIC

Advanced Computer Architecture 36

VECTOR PROCESSOR ISA
▪ Vector Registers
▪ There are 8 vector registers V0, V1 … V7.
▪ Each vector register can hold 64 numbers and each are double precision numbers.
▪ Each vector register has 2 read ports and 1 write port, to allow overlapping
operations.
▪ Vector functional units
▪ Each functional unit is fully pipelined and can start a new operation every clock
cycle.
▪ A hardware control unit detects hazards (conflict for functional units and also for
register accesses), and insert stalls as required
▪ Vector Load/Store Unit:
▪ The load/store unit is fully pipelined and allow fast loading and storing of vectors.
▪ The memory system is also deeply interleaved to allow parallel access.
▪ After an initial latency (which indicates the access time of the memory), one word
can be accessed per clock cycle.

Advanced Computer Architecture 37

VECTOR PROCESSOR ISA
▪ Scalar registers:
▪ These are normal scalar and floating point registers.
▪ Can be used to provide data as input to the vector functional units, as
well as compute memory addresses for vector load/store.
▪ Vector control registers
▪ Vector Mask Register (VMASK)
▪ Indicates which elements of vector to operate on
▪ Vector length register (VLEN)
▪ Need to operate on vectors of different lengths
▪ Vector stride register (VSTR)
▪ Elements of a vector might be stored apart from each other in memory
▪ Stride: distance between two elements of a vector

Advanced Computer Architecture 38

EXAMPLE 1
▪ Consider the SAXPY or
DAXPY vector operation: Y MIPS32
= a*X + Y where X and Y Code
are vectors each of size 64
and a is a scalar.
▪ Rx contains starting address
of X
Vector
▪ Ry contains starting address Processor
of Y Code
▪ R1 contains the address of
the scalar ‘a’.
Advanced Computer Architecture 39
SOME PROPERTIES
▪ The vector processor greatly reduces the dynamic
instruction bandwidth (number of instructions actually
getting executed) (from 514 to 6).
▪ Frequency of pipeline interlocks are greatly reduced.
▪ In the original MIPS32 version, every ADD.D must wait for
MUL.D, and S.D must wait for ADD.D.
▪ In the vector processor version, pipeline stalls are required
once per vector operation, rather than once per vector
element.
▪ Pipeline stall frequency is reduced almost 64 times.

Advanced Computer Architecture 40

VECTOR START-UP AND INITIATION RATE
▪ The running time of each vector operation in the vector processor
has two components:
a. Start-up Time: arises due to the pipeline latency of the vector
operation.
▪ After how much time the first result will be available.
▪ Mainly determined by the dept of the pipeline
▪ A latency of 8 clock cycles means that the operation takes 8 cycles, and also
there are 8 stages in the pipeline.
b. Initiation Rate: time per result once the vector instruction is
running.
▪ Usually 1 per clock cycle for individual operations.

▪ Total time to complete a vector operation of length n (n ≤ 64) is:

▪ Start-up Time + (n x Initiation Rate)

Advanced Computer Architecture 41

VECTOR CHAINING
▪ Vector chaining: Data forwarding from one vector functional unit to another

V1 V2 V3 V4 V5
LV v1
MULV v3,v1,v2
ADDV v5, v3, v4

Chain Chain

Load Unit
Mult. Add

Memory

Advanced Computer Architecture 42

OTHER VECTOR PROCESSING CONCEPTS
▪ Vector Length Register
▪ Specifies the length of any vector operation. Suppose we have to
operate on first 30 elements of a 64 bit vector, this register is used.
▪ Loading and storing vectors with stride
▪ Vector elements are stored in memory with uniform spacing between
elements.
▪ Adjacent elements of a vector are not sequential in memory.
▪ Strip mining
▪ How to split loops if the original loop handles vectors that are larger
than that supported by the hardware?
▪ Suppose the vector register is of length 64 and the program is running
for 200 cycles. So we run it for 3 times and the remaining 8 can be run
using strip mining as the fourth operation.
Advanced Computer Architecture 43
SIMD PROCESSING
▪Single instruction operates on multiple data
elements
▪ In time or in space
▪Multiple processing elements
▪Time-space duality
▪ Array processor: Instruction operates on multiple
data elements at the same time
▪ Vector processor: Instruction operates on multiple
data elements in consecutive time steps
Advanced Computer Architecture 44
ARRAY VS. VECTOR PROCESSORS
ARRAY PROCESSOR VECTOR PROCESSOR

Instruction Stream Same op @ same time

Different ops @ time
LD VR  A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR  VR, 1 AD0 AD1 AD2 AD3 LD1 AD0
MUL VR  VR, 2
ST A[3:0]  VR MU0 MU1 MU2 MU3 LD2 AD1 MU0
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0
Different ops @ same space AD3 MU2 ST1
MU3 ST2
Time Same op @ space ST3

Space Space
Advanced Computer Architecture 45
SIMD ARRAY PROCESSING VS. VLIW
VLIW Array processor

Advanced Computer Architecture 46

MULTICORE PROCESSORS

Advanced Computer Architecture 47

MULTI-CORE PROCESSORS
▪ A processing system composed of two or more independent
cores or CPUs.
▪ The cores are typically integrated onto a single integrated
circuit die, or they may be integrated on multiple dies in a
single chip package.
▪ Cores share memory:
▪ In modern multi-core systems, typically the L1 and L2 cache are
private to each core, while the L3 cache is shared among the cores.
▪ In symmetric multi-core systems, all the cores are identical.
▪ Example: multi-core processors used in computer systems
▪ In asymmetric multi-core systems, the cores may have
different functionalities.
Advanced Computer Architecture 48
WHY MULTI-CORES
▪ It is difficult to sustain Moore’s law and at the same
time meet performance demand of various
applications.
▪ Difficult to increase clock frequency, mainly due to power
consumption issues.
▪ Possible solution:
▪ Replicate hardware and run them at a lower clock rate to
reduce power consumption.
▪ 1 core running at 3 GHz has the same performance as 2
cores running at 1.5 GHz with lower power consumption.

Advanced Computer Architecture 49

TAXONOMY OF PARALLEL ARCHITECTURES
(FLYNN’S CLASSIFICATION)
▪ Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966

▪ Single instruction-stream single data-stream (SISD).

▪ Traditional uniprocessor systems.

▪ Multiple instruction-stream single data-stream (MISD).

▪ No commercial implementation exists.
▪ Closest form: systolic array processor, streaming processor

▪ Single instruction-stream multiple data-stream (SIMD).

▪ Array and vector processors.

▪ Multiple instruction-stream multiple data-stream (MIMD).

▪ Multiprocessor systems (various architectures exist).
▪ Multithreaded processor

Advanced Computer Architecture 50

FLYNN’S CLASSIFICATION
D11 D12 D13
I3 I2 I1 SISD D1 D2 D3 I3 I2 I1 D21 D22 D23
SIMD …
…
Dm1 Dm2 Im3

I31 I21 I11 I31 I21 I11 D11 D12 D13

I32 I22 I12 D1 D2 D3 I32 I22 I12
… MISD D21 D22 D23
…
… … MIMD …
I3m I2m I1m I3m I2m I1m
…
Dm1 Dm2 Im3

Advanced Computer Architecture 51

SINGLE-CORE COMPUTER
▪ Falls under SISD
Category.
▪ Typically two
busses:
▪ A high-speed
CPU Memory
bus, that also
connects to I/O
bridge.
▪ A lower speed
I/O bus,
connecting
various
peripherals

Advanced Computer Architecture 52

SINGLE-CORE PROCESSOR

Advanced Computer Architecture 53

MOTHER BOARD ARCHITECTURE
▪ Chipset consisting of north bridge and south bridge

Advanced Computer Architecture 54

MOTHER BOARD VIEW

Advanced Computer Architecture 55

MULTI-CORE ARCHITECTURE

Advanced Computer Architecture 56

TRADITIONAL MULTIPROCESSOR ARCHITECTURES
▪ Tightly coupled multiprocessors:
▪ The processors access common shared memory.
▪ Inter-processor communication takes place through shared
memory.
▪ Multi-core architectures fall under this category.
▪ Loosely coupled multiprocessors (cluster computers):
▪ Memory is distributed among the processors.
▪ Processors typically communicate through a high speed
interconnection network.

Advanced Computer Architecture 57

TIGHTLY COUPLED MULTIPROCESSORS

▪ Difficult to extend to large number of processors.

▪ Memory bandwith requirement increases with the number of processors.
▪ Memory access time for all processors is uniform
▪ Called uniform memory access (UMA)

Advanced Computer Architecture 58

LOOSELY COUPLED MULTIPROCESSORS

▪ Cost-effective way to scale memory bandwidth.

▪ Communicating data between processors is complex and has higher latency.
▪ Memory access time depends on the data.
▪ Called Non Uniform Memory Access(NUMA).

Advanced Computer Architecture 59

THANK YOU

Advanced Computer Architecture 60

Direct Copper Interconnection For Advanced Semiconductor Technology Dongkai Shangguan Instant Download
No ratings yet
Direct Copper Interconnection For Advanced Semiconductor Technology Dongkai Shangguan Instant Download
81 pages
Tabela de Chips Bosch
No ratings yet
Tabela de Chips Bosch
4 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
16.482 / 16.561 Computer Architecture and Design: Instructor: Dr. Michael Geiger Fall 2013
No ratings yet
16.482 / 16.561 Computer Architecture and Design: Instructor: Dr. Michael Geiger Fall 2013
42 pages
Lecture13 Pipeline2
No ratings yet
Lecture13 Pipeline2
51 pages
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
Aca 3
No ratings yet
Aca 3
113 pages
20 Advanced Processor Designs
No ratings yet
20 Advanced Processor Designs
28 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
Unit 1
No ratings yet
Unit 1
5 pages
Advanced Computer Architecture Assignment-1: Submitted To: Mrs. Rajni Sharma
No ratings yet
Advanced Computer Architecture Assignment-1: Submitted To: Mrs. Rajni Sharma
8 pages
L04 PipeliningII
No ratings yet
L04 PipeliningII
33 pages
ELECH473 Th04
No ratings yet
ELECH473 Th04
59 pages
Arch3 Pipelining Afterlecture
No ratings yet
Arch3 Pipelining Afterlecture
180 pages
MULTIcycle OPERATIONS
No ratings yet
MULTIcycle OPERATIONS
24 pages
Cse-Vii-Advanced Computer Architectures (10CS74) - Assignment PDF
No ratings yet
Cse-Vii-Advanced Computer Architectures (10CS74) - Assignment PDF
6 pages
Credits: WWW - Cse.Scu - Edu/ Rdaniels/Html/Courses/Co En1/Cpuarch
No ratings yet
Credits: WWW - Cse.Scu - Edu/ Rdaniels/Html/Courses/Co En1/Cpuarch
35 pages
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
No ratings yet
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
32 pages
Pipeline and Vector Processing
100% (1)
Pipeline and Vector Processing
18 pages
Advanced Processor Superscalarclass
50% (2)
Advanced Processor Superscalarclass
73 pages
Central Processing Unit Architecture: Architecture Overview Machine Organization Speeding Up CPU Operations
No ratings yet
Central Processing Unit Architecture: Architecture Overview Machine Organization Speeding Up CPU Operations
34 pages
ECE/CS 752 Dynamic Scheduling (I) : Nam Sung Kim Electrical and Computer Engineering University of Wisconsin
No ratings yet
ECE/CS 752 Dynamic Scheduling (I) : Nam Sung Kim Electrical and Computer Engineering University of Wisconsin
47 pages
Unit II
No ratings yet
Unit II
84 pages
Unit-I / Part-A
No ratings yet
Unit-I / Part-A
2 pages
Cse-Viii-Advanced Computer Architectures (06CS81) - Question Paper
No ratings yet
Cse-Viii-Advanced Computer Architectures (06CS81) - Question Paper
5 pages
Contact Session 8
No ratings yet
Contact Session 8
63 pages
DSP - Presentation - Sumit 5
No ratings yet
DSP - Presentation - Sumit 5
45 pages
Lecture10 Cda3101
No ratings yet
Lecture10 Cda3101
32 pages
03 Pipeline
0% (1)
03 Pipeline
38 pages
Dynamic Scheduling in Powerpc 604 and Pentium Pro: Ee524 / Cpts561 Computer Architecture
No ratings yet
Dynamic Scheduling in Powerpc 604 and Pentium Pro: Ee524 / Cpts561 Computer Architecture
48 pages
COA DR MVN 5 UNIT - Latest PDF
No ratings yet
COA DR MVN 5 UNIT - Latest PDF
24 pages
Embedded Systems Design: Pipelining and Instruction Scheduling
No ratings yet
Embedded Systems Design: Pipelining and Instruction Scheduling
48 pages
05 Wideissue
No ratings yet
05 Wideissue
77 pages
Complex Pipelining: Arvind
No ratings yet
Complex Pipelining: Arvind
32 pages
ACA Question Bank
No ratings yet
ACA Question Bank
19 pages
COMP 206: Computer Architecture and Implementation
No ratings yet
COMP 206: Computer Architecture and Implementation
24 pages
CS647
No ratings yet
CS647
2 pages
03a ILP Superscalar VLIW
No ratings yet
03a ILP Superscalar VLIW
21 pages
CS6461 - Computer Architecture Fall 2016 Adapted From Professor Stephen Kaisler's Slides
No ratings yet
CS6461 - Computer Architecture Fall 2016 Adapted From Professor Stephen Kaisler's Slides
71 pages
Computer Architecture Revision For Final Exam
No ratings yet
Computer Architecture Revision For Final Exam
60 pages
Superscalar Microprocessors
No ratings yet
Superscalar Microprocessors
9 pages
2 3 4 Merged
No ratings yet
2 3 4 Merged
134 pages
2 3 4 5 Merged Merged
No ratings yet
2 3 4 5 Merged Merged
164 pages
Chapter 4 Processors and Memory Hierarchy: Module-2
No ratings yet
Chapter 4 Processors and Memory Hierarchy: Module-2
31 pages
Pipelining and Parallel Processing
No ratings yet
Pipelining and Parallel Processing
25 pages
Unit 2
No ratings yet
Unit 2
11 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
No ratings yet
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
49 pages
Advanced Computer Architecture Prof Thriveni T K
No ratings yet
Advanced Computer Architecture Prof Thriveni T K
59 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Computer Architecture: Edited by Galatro Giovanni
No ratings yet
Computer Architecture: Edited by Galatro Giovanni
34 pages
Chapter 04 Processors and Memory Hierarchy
75% (8)
Chapter 04 Processors and Memory Hierarchy
50 pages
Chapter 04 Processors and Memory Hierarchy PDF
No ratings yet
Chapter 04 Processors and Memory Hierarchy PDF
50 pages
L05 PipeliningII
No ratings yet
L05 PipeliningII
36 pages
02b ILP Superscalar VLIW
No ratings yet
02b ILP Superscalar VLIW
20 pages
Pipelining and Parallelism
No ratings yet
Pipelining and Parallelism
41 pages
1.pipelining & ILP
No ratings yet
1.pipelining & ILP
38 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
Elixir in Action: From Zero to Production-Ready Applications
From Everand
Elixir in Action: From Zero to Production-Ready Applications
Elliot Ramsey
No ratings yet
Computer Memory Research
No ratings yet
Computer Memory Research
10 pages
Infineon TLE9850QX DataSheet v01 - 00 EN
No ratings yet
Infineon TLE9850QX DataSheet v01 - 00 EN
151 pages
LESSON 9 - Rotate Instructions
No ratings yet
LESSON 9 - Rotate Instructions
6 pages
8086 Interrupts: Interrupt
No ratings yet
8086 Interrupts: Interrupt
5 pages
VLSI Design
No ratings yet
VLSI Design
11 pages
HWMonitor PRO
No ratings yet
HWMonitor PRO
21 pages
DTM - PT-2 (Updated) Question Bank
No ratings yet
DTM - PT-2 (Updated) Question Bank
11 pages
Semiconductor Device: Lecturers: DR - Eng. Arief Udhiarto, S.T.,M.T Source: U.C. Berkeley, Trinity College Dublin, Etc
No ratings yet
Semiconductor Device: Lecturers: DR - Eng. Arief Udhiarto, S.T.,M.T Source: U.C. Berkeley, Trinity College Dublin, Etc
18 pages
Duration: 2 Hours Answer All Questions Total Points: 100 (30 PTS.)
No ratings yet
Duration: 2 Hours Answer All Questions Total Points: 100 (30 PTS.)
2 pages
E-1 MCQ SYBCS
No ratings yet
E-1 MCQ SYBCS
8 pages
Chapter 2 Intel Microprocessor x86 Assembly Slides
No ratings yet
Chapter 2 Intel Microprocessor x86 Assembly Slides
82 pages
Esp32 Errata en
No ratings yet
Esp32 Errata en
25 pages
Improving and Measuring Cache Performance
No ratings yet
Improving and Measuring Cache Performance
8 pages
Timing Diagram 8085
No ratings yet
Timing Diagram 8085
27 pages
RM0440 Stm32g4 Series Advanced Armbased 32bit Mcus Stmicroelectronics
No ratings yet
RM0440 Stm32g4 Series Advanced Armbased 32bit Mcus Stmicroelectronics
2,138 pages
Technical Questions Based On Microprocessor
No ratings yet
Technical Questions Based On Microprocessor
3 pages
CST307 Scheme
No ratings yet
CST307 Scheme
3 pages
X25040 Serial EEPROM
No ratings yet
X25040 Serial EEPROM
14 pages
Assignment 1 2020coa
No ratings yet
Assignment 1 2020coa
5 pages
1.2 Evolution of Microprocessor
No ratings yet
1.2 Evolution of Microprocessor
33 pages
Chapter 5 - Interfacing Memory and IO
No ratings yet
Chapter 5 - Interfacing Memory and IO
58 pages
Mechatronics Test Paper
No ratings yet
Mechatronics Test Paper
1 page
GLA UNIVERSITY Mathura, Uttar Pradesh, India
No ratings yet
GLA UNIVERSITY Mathura, Uttar Pradesh, India
31 pages
15CS205J MCQ PDF
No ratings yet
15CS205J MCQ PDF
48 pages
Stick Diag
No ratings yet
Stick Diag
6 pages
Tut 7
No ratings yet
Tut 7
22 pages
CS1601 Computer Architecture
100% (1)
CS1601 Computer Architecture
389 pages
25AA010A 25LC010A 1 Kbit SPI Bus Serial EEPROM 20001832J
No ratings yet
25AA010A 25LC010A 1 Kbit SPI Bus Serial EEPROM 20001832J
42 pages

Advanced Computer Architecture

Uploaded by

Advanced Computer Architecture

Uploaded by

COMPUTER ARCHITECTURE

ADVANCED COMPUTER ARCHITECTURE

Advanced Computer Architecture 2

Advanced Computer Architecture 3

Advanced Computer Architecture 4

▪ The EX stage will have multiple floating point functional

Advanced Computer Architecture 6

Advanced Computer Architecture 7

Advanced Computer Architecture 9

Advanced Computer Architecture 12

Out of order completion of instructions

Stalls arising due to RAW Hazards

Advanced Computer Architecture 13

▪ Three instructions are trying to write into the FP register bank

▪ No conflict in MEM as only the last instruction accesses memory.

Advanced Computer Architecture 14

Advanced Computer Architecture 15

Advanced Computer Architecture 16

Add a scalar s to a vector x

Advanced Computer Architecture 18

Our program has 1000 iterations and now there

Advanced Computer Architecture 19

▪ We use different registers for each

Advanced Computer Architecture 20

No stalls. 14/4 = 3.5 cycles per iteration

Advanced Computer Architecture 21

Advanced Computer Architecture 22

Advanced Computer Architecture 23

Advanced Computer Architecture 25

Advanced Computer Architecture 26

Advanced Computer Architecture 27

Advanced Computer Architecture 30

Advanced Computer Architecture 31

Clock cycles/Iteration = 8/4 = 2.0

Advanced Computer Architecture 33

Advanced Computer Architecture 34

Advanced Computer Architecture 36

Advanced Computer Architecture 37

Advanced Computer Architecture 38

Advanced Computer Architecture 40

▪ Total time to complete a vector operation of length n (n ≤ 64) is:

Advanced Computer Architecture 41

Advanced Computer Architecture 42

Instruction Stream Same op @ same time

Advanced Computer Architecture 46

Advanced Computer Architecture 47

Advanced Computer Architecture 49

▪ Single instruction-stream single data-stream (SISD).

▪ Multiple instruction-stream single data-stream (MISD).

▪ Single instruction-stream multiple data-stream (SIMD).

▪ Multiple instruction-stream multiple data-stream (MIMD).

Advanced Computer Architecture 50

I31 I21 I11 I31 I21 I11 D11 D12 D13

Advanced Computer Architecture 51

Advanced Computer Architecture 52

Advanced Computer Architecture 53

Advanced Computer Architecture 54

Advanced Computer Architecture 55

Advanced Computer Architecture 56

Advanced Computer Architecture 57

▪ Difficult to extend to large number of processors.

Advanced Computer Architecture 58

▪ Cost-effective way to scale memory bandwidth.

Advanced Computer Architecture 59

Advanced Computer Architecture 60

You might also like