0% found this document useful (0 votes)

234 views16 pages

1 Vector Processing: Solutions

The document discusses vector processing and solutions to related problems. 1. It provides assembly code to implement a for loop that performs vector operations on arrays A, B, C, and D. It calculates the number of cycles needed for different vector processors: without chaining it is 6005 cycles, with chaining and 1 port to memory it is 482 cycles, and with chaining, 2 read ports and 1 write port it is 215 cycles. 2. It discusses a vector computer with specific instruction latencies. It asks to determine the minimum number of memory banks needed to avoid stalls, which is found to be 64 banks. It then partially describes executing a program on this machine.

Uploaded by

Phani Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

234 views16 pages

1 Vector Processing: Solutions

Uploaded by

Phani Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

SOLUTIONS

1 Vector Processing
Consider the following piece of code:
for (i = 0; i < 100; i ++)
A[i] = ((B[i] * C[i]) + D[i])/2;
(a) Translate this code into assembly language using the following instructions in the ISA (note the number
of cycles each instruction takes is shown next to each instruction):

Opcode Operands Number of Cycles Description

LEA Ri, X 1 Ri ← address of X
LD Ri, Rj, Rk 11 Ri ← MEM[Rj + Rk]
ST Ri, Rj, Rk 11 MEM[Rj + Rk] ← Ri
MOVI Ri, Imm 1 Ri ← Imm
MUL Ri, Rj, Rk 6 Ri ← Rj x Rk
ADD Ri, Rj, Rk 4 Ri ← Rj + Rk
ADD Ri, Rj, Imm 4 Ri ← Rj + Imm
RSHFA Ri, Rj, amount 1 Ri ← RSHFA (Rj, amount)
BRcc X 1 Branch to X based on condition codes

Assume one memory location is required to store each element of the array. Also assume that there are
8 registers (R0 to R7).
Condition codes are set after the execution of an arithmetic instruction. You can assume typically
available condition codes such as zero, positive, and negative.

MOVI R1 , 99 // 1 cycle
LEA R0 , A // 1 cycle
LEA R2 , B // 1 cycle
LEA R3 , C // 1 cycle
LEA R4 , D // 1 cycle
LOOP:
LD R5 , R2 , R1 // 11 c y c l e s
LD R6 , R3 , R1 // 11 c y c l e s
MUL R7 , R5 , R6 // 6 cycles
LD R5 , R4 , R1 // 11 c y c l e s
ADD R6 , R7 , R5 // 4 cycles
RSHFA R7 , R6 , 1 // 1 cycle
ST R7 , R0 , R1 // 11 c y c l e s
ADD R1 , R1 , −1 // 4 cycles
BRGEZ R1 LOOP // 1 cycle

1/16
How many cycles does it take to execute the program?
5 + 100 × 60 = 6005

(b) Now write Cray-like vector assembly code to perform this operation in the shortest time possible. Assume
that there are 8 vector registers and the length of each vector register is 64. Use the following instructions
in the vector ISA:

Opcode Operands Number of Cycles Description

LD Vst, #n 1 Vst ← n (Vst = Vector Stride Register)
LD Vln, #n 1 Vln ← n (Vln = Vector Length Register)
VLD Vi, X 11, pipelined
VST Vi, X 11, pipelined
Vmul Vi, Vj, Vk 6, pipelined
Vadd Vi, Vj, Vk 4, pipelined
Vrshfa Vi, Vj, amount 1

LD Vln , 50
LD Vst , 1
VLD V1 , B
VLD V2 , C
VMUL V4 , V1 , V2
VLD V3 , D
VADD V6 , V4 , V3
VRSHFA V7 , V6 , 1
VST V7 , A

VLD V1 , B + 50
VLD V2 , C + 50
VMUL V4 , V1 , V2
VLD V3 , D + 50
VADD V6 , V4 , V3
VRSHFA V7 , V6 , 1
VST V7 , A + 50

2/16
How many cycles does it take to execute the program on the following processors? Assume that memory
is 16-way interleaved.

(i) Vector processor without chaining, 1 port to memory (1 load or store per cycle).
The t h i r d l o a d (VLD) can be p i p e l i n e d with t h e add (VADD) . However a s t h e r e i s
j u s t o n l y one p o r t t o memory and no c h a i n i n g , o t h e r o p e r a t i o n s c an no t be
pipelined .

P r o c e s s i n g t h e f i r s t 50 e l e m e n t s t a k e s 346 c y c l e s a s below

| 1 | 1 | 11 | 49 | 11 | 49 | 6 | 49 |
| 11 | 49 | 4 | 49 | 1 | 49 | 11 | 49 |

P r o c e s s i n g t h e n e x t 50 e l e m e n t s t a k e s 344 c y c l e s a s shown below ( no need t o

i n i t i a l i z e Vln and Vst a s t h e y s t a y a t t h e same v a l u e ) .

| 11 | 49 | 11 | 49 | 6 | 49 |
| 11 | 49 | 4 | 49 | 1 | 49 | 11 | 49 |

T h e r e f o r e , t h e t o t a l number o f c y c l e s t o e x e c u t e t h e program i s 690 c y c l e s

(ii) Vector processor with chaining, 1 port to memory.

I n t h i s c a s e , t h e f i r s t two l o a d s ca nn ot be p i p e l i n e d a s t h e r e i s o n l y one p o r t
t o memory and t h e t h i r d l o a d has t o w a i t u n t i l t h e s e c o n d l o a d has c o m p l e t e d .
However , t h e machine s u p p o r t s c h a i n i n g , s o a l l o t h e r o p e r a t i o n s can be
pipelined .

P r o c e s s i n g t h e f i r s t 50 e l e m e n t s t a k e s 242 c y c l e s a s below

| 1 | 1 | 11 | 49 | 11 | 49 |
| 6 | 49 |
| 11 | 49 |
| 4 | 49 |
| 1 | 49 |
| 11 | 49 |

P r o c e s s i n g t h e n e x t 50 e l e m e n t s t a k e s 240 c y c l e s ( same time l i n e a s above , but

w i t h o u t t h e f i r s t 2 i n s t r u c t i o n s t o i n i t i a l i z e Vln and Vst ) .

T h e r e f o r e , t h e t o t a l number o f c y c l e s t o e x e c u t e t h e program i s 482 c y c l e s

3/16
(iii) Vector processor with chaining, 2 read ports and 1 write port to memory.
Assuming an in −o r d e r p i p e l i n e .

The f i r s t two l o a d s can a l s o be p i p e l i n e d a s t h e r e a r e two p o r t s t o memory . The

t h i r d l o a d has t o w a i t u n t i l t h e f i r s t two l o a d s c o m p l e t e . However , t h e two
l o a d s f o r t h e s e c o n d 50 e l e m e n t s can p r o c e e d i n p a r a l l e l with t h e s t o r e .

| 1 | 1 | 11 | 49 |
| 1 | 11 | 49 |
| 6 | 49 |
| 11 | 49 |
| 4 | 49 |
| 1 | 49 |
| 11 | 49 |
| 11 | 49 |
| 11 | 49 |
| 6 | 49 |
| 11 | 49 |
| 4 | 49 |
| 1 | 49 |
| 11 | 49 |

T h e r e f o r e , t h e t o t a l number o f c y c l e s t o e x e c u t e t h e program i s 215 c y c l e s

4/16
2 More Vector Processing
You are studying a program that runs on a vector computer with the following latencies for various instruc-
tions:
• VLD and VST: 50 cycles for each vector element; fully interleaved and pipelined.
• VADD: 4 cycles for each vector element (fully pipelined).
• VMUL: 16 cycles for each vector element (fully pipelined).
• VDIV: 32 cycles for each vector element (fully pipelined).
• VRSHF (right shift): 1 cycle for each vector element (fully pipelined).
Assume that:
• The machine has an in-order pipeline.
• The machine supports chaining between vector functional units.
• In order to support 1-cycle memory access after the first element in a vector, the machine interleaves
vector elements across memory banks. All vectors are stored in memory with the first element mapped
to bank 0, the second element mapped to bank 1, and so on.
• Each memory bank has an 8 KB row buffer.
• Vector elements are 64 bits in size.
• Each memory bank has two ports (so that two loads/stores can be active simultaneously), and there
are two load/store functional units available.
(a) What is the minimum power-of-two number of banks required in order for memory accesses to never
stall? (Assume a vector stride of 1.)
64 banks (because memory latency is 50 cycles and the next power of two is 64)

(b) The machine (with as many banks as you found in part a) executes the following program (assume that
the vector stride is set to 1):

VLD V1 ← A
VLD V2 ← B
VADD V3 ← V1, V2
VMUL V4 ← V3, V1
VRSHF V5 ← V4, 2

It takes 111 cycles to execute this program. What is the vector length?
VLD |−−−−50−−−−−−|−−−(VLEN−1)−−−−|
VLD |1|−−−−50−−−−−−|
VADD | −4 −|
VMUL | −16 −|
VRSHF |1|−−−−−(VLEN−1)−−−−−|

1+50+4+16+1 + (VLEN−1) = 71 + VLEN = 111 −> VLEN = 40 e l e m e n t s

5/16
If the machine did not support chaining (but could still pipeline independent operations), how many
cycles would be required to execute the same program?
VLD |−−−−−50−−−−−|−−−(VLEN−1)−−−|
VLD |1|−−−−−50−−−−−|−−−(VLEN−1)−−−|
VADD |−4−|−−(VLEN−1)−−−|
VMUL |−16−|−−(VLEN−1)−−−|
VRSHF |1| − −(VLEN−1)−−|

50 + 1 + 4 + 16 + 1 + 4 ∗ (VLEN−1) = 68 + 4∗VLEN = 228 c y c l e s

6/16
(c) The architect of this machine decides that she needs to cut costs in the machine’s memory system. She
reduces the number of banks by a factor of 2 from the number of banks you found in part (a) above.
Because loads and stores might stall due to bank contention, an arbiter is added to each bank so that
pending loads from the oldest instruction are serviced first. How many cycles does the program take to
execute on the machine with this reduced-cost memory system (but with chaining)?

VLD [ 0 ] |−−−−50−−−−| bank 0 ( t a k e s p o r t 0 )

...
[31] |−−31−−|−−−−50−−−−| bank 31
[32] |−−−50−−−| bank 0 ( t a k e s p o r t 0 )
...
[39] |−−7−−| bank 7
VLD [ 0 ] |1|−−−−50−−−−| bank 0 ( t a k e s p o r t 1 )
...
[31] |1|−−31−−|−−−−50−−−−| bank 31
[32] |−−−50−−−−| bank 0 ( t a k e s p o r t 1 )
...
[39] |−−7−−| bank 7
VADD |−−4−−| ( t r a c k i n g l a s t e l e m e n t s )
VMUL |−−16−−|
VRSHF |1|

(B [ 3 9 ] : 1 + 50 + 50 + 7 ) + 4 + 16 + 1 = 129 c y c l e s

Now, the architect reduces cost further by reducing the number of memory banks (to a lower power of
2). The program executes in 279 cycles. How many banks are in the system?

VLD [0] |−−−50−−−|

...
[8] |−−−50−−−|
...
[16] |−−50−−|
...
[24] |−−50−−|
...
[32] |−−50−−|
...
[39] |−−7−−|
VLD [39] |1|
VADD |−−4−−|
VMUL |−−16−−|
VRSHF |1|

5∗50 + 7 + 1 + 4 + 16 + 1 = 279 c y c l e s −> 8 banks

7/16
(d) Another architect is now designing the second generation of this vector computer. He wants to build a
multicore machine in which 4 vector processors share the same memory system. He scales up the number
of banks by 4 in order to match the memory system bandwidth to the new demand. However, when
he simulates this new machine design with a separate vector program running on every core, he finds
that the average execution time is longer than if each individual program ran on the original single-core
system with 1/4 the banks. Why could this be? Provide concrete reason(s).
Row-buffer conflicts (all cores interleave their vectors across all banks).

What change could this architect make to the system in order to alleviate this problem (in less than 20
words), while only changing the shared memory hierarchy?
Partition the memory mappings, or using better memory scheduling.

8/16
3 SIMD Processing
Suppose we want to design a SIMD engine that can support a vector length of 16. We have two options: a
traditional vector processor and a traditional array processor.
Which one is more costly in terms of chip area (circle one)?

The traditional vector processor The traditional array processor Neither

Explain:
An array processor requires 16 functional units for an operation whereas a vector processor requires only
1.

Assuming the latency of an addition operation is five cycles in both processors, how long will a VADD
(vector add) instruction take in each of the processors (assume that the adder can be fully pipelined and is
the same for both processors)?

For a vector length of 1:

The traditional vector processor:
5 cycles

The traditional array processor:

5 cycles

For a vector length of 4:

The traditional vector processor:
8 cycles (5 for the first element to complete, 3 for the remaining 3)

The traditional array processor:

5 cycles

For a vector length of 16:

The traditional vector processor:
20 cycles (5 for the first element to complete, 15 for the remaining 15)

The traditional array processor:

5 cycles

9/16
4 GPUs and SIMD I
We define the SIMD utilization of a program run on a GPU as the fraction of SIMD lanes that are kept busy
with active threads during the run of a program.
The following code segment is run on a GPU. Each thread executes a single iteration of the shown
loop. Assume that the data values of the arrays A and B are already in vector registers so there are no loads
and stores in this program. (Hint: Notice that there are 2 instructions in each thread.) A warp in the GPU
consists of 32 threads, there are 32 SIMD lanes in the GPU. Assume that each instruction takes the same
amount of time to execute.

for (i = 0; i < N; i++) {

if (A[i] % 3 == 0) { // Instruction 1
A[i] = A[i] * B[i]; // Instruction 2
}
}

(a) How many warps does it take to execute this program? Please leave the answer in terms of N .
N
d( 32 )e

(b) Assume integer arrays A have a repetitive pattern which have 24 ones followed by 8 zeros repetitively
and integer arrays B have a different repetitive pattern which have 48 zeros followed by 64 ones. What
is the SIMD utilization of this program?

((24+82)/(322))100% = 40/64100 = 62.5%

YES NO

If YES, what should be true about arrays A for the SIMD utilization to be 100%?
Yes. If all of A’s elements are divisible by 3, or if all are not divisible by 3.

What should be true about arrays B?

B can be any array of integers.

If NO, explain why not.

10/16
(d) Is it possible for this program to yield a SIMD utilization of 56.25% (circle one)?

YES NO

If YES, what should be true about arrays A for the SIMD utilization to be 56.25%?
Yes, if 4 out of every 32 elements of A are divisible by 3.

What should be true about arrays B?

B can be any array of integers.

If NO, explain why not.

(e) Is it possible for this program to yield a SIMD utilization of 50% (circle one)?

YES NO

If YES, what should be true about arrays A for the SIMD utilization to be 50%?

What should be true about arrays B?

If NO, explain why not.

No. The minimum is when only 1 out of every 32 elements of A is divisible by 3. This yields a 51.5625%
usage.

Now, we will look at the technique we learned in class that tries to improve SIMD utilization by merging
divergent branches together. The key idea of the dynamic warp formation is that threads in one warp can
be swapped with threads in another warp as long as the swapped threads have access to the associated
registers (i.e., they are on the same SIMD lane).
Consider the following example of a program that consists of 3 warps X, Y and Z that are executing
the same code segment specified at the top of this question. Assume that the vector below specifies the
direction of the branch of each thread within the warp. 1 means the branch in Instruction 1 is resolved
to taken and 0 means the branch in Instruction 1 is resolved to not taken.

X = {10000000000000000000000000000010}
Y = {10000000000000000000000000000001}
Z = {01000000000000000000000000000000}

11/16
(f) Given the example above. Suppose that you perform dynamic warp formation on these three warps.
What is the resulting outcome of each branch for the newly formed warps X 0 , Y 0 and Z 0 .
There are several answers for this questions but the key is that the taken branch in Z can be combined
with either X or Y. However, the taken branch in the first thread of X and Y cannot be merged because
they are on the same GPU lane.
X = 10000000000000000000000000000010
Y = 11000000000000000000000000000001
Z = 00000000000000000000000000000000

(g) Given the specification for arrays A and B, is it possible for this program to yield a better SIMD
utilization if dynamic warp formation is used? Explain your reasoning.
No. Branch divergence happens on the same lane throughout the program resulting in the case where
there is no dynamically formed warp.

12/16
5 GPUs and SIMD II
We define the SIMD utilization of a program run on a GPU as the fraction of SIMD lanes that are kept
busy with active threads during the run of a program.
The following code segment is run on a GPU. Each thread executes a single iteration of the shown
loop. Assume that the data values of the arrays A, B, and C are already in vector registers so there are no
loads and stores in this program. (Hint: Notice that there are 4 instructions in each thread.) A warp in the
GPU consists of 64 threads, and there are 64 SIMD lanes in the GPU.

for (i = 0; i < 1048576; i++) {

if (B[i] < 4444) {
A[i] = A[i] * C[i];
B[i] = A[i] + B[i];
C[i] = B[i] + 1;
}
}

(a) How many warps does it take to execute this program?

Warps = (Number of threads) / (Number of threads per warp)
Number of threads = 220 (i.e., one thread per loop iteration).
Number of threads per warp = 64 = 26 (given).
Warps = 220 /26 = 214

(b) When we measure the SIMD utilization for this program with one input set, we find that it is 67/256.
What can you say about arrays A, B, and C? Be precise (Hint: Look at the ”if” branch, what can you
say about A, B and C?).
A: Nothing

B: 1 in every 64 of B’s elements is less than 4444

C: Nothing

YES NO

If YES, what should be true about arrays A, B, C for the SIMD utilization to be 100%? Be precise. If
NO, explain why not.

13/16
Yes. Either:
(1) All of B’s elements are greater than or equal to 4444, or
(2) All of B’s element are less than 4444.

(d) Is it possible for this program to yield a SIMD utilization of 25% (circle one)?

YES NO

If YES, what should be true about arrays A, B, and C for the SIMD utilization to be 25%? Be precise.
If NO, explain why not.
No. The smallest SIMD utilization possible is the same as part (b), 67/256, but this is greater than 25%.

14/16
6 GPUs and SIMD III
We define the SIMD utilization of a program run on a GPU as the fraction of SIMD lanes that are kept
busy with active threads during the run of a program. As we saw in lecture and practice exercises, the SIMD
utilization of a program is computed across the complete run of the program.
The following code segment is run on a GPU. Each thread executes a single iteration of the shown
loop. Assume that the data values of the arrays A, B, and C are already in vector registers so there are no
loads and stores in this program. (Hint: Notice that there are 6 instructions in each thread.) A warp in the
GPU consists of 64 threads, and there are 64 SIMD lanes in the GPU. Please assume that all values in array
B have magnitudes less than 10 (i.e., |B[i]| < 10, for all i).

for (i = 0; i < 1024; i++) {

A[i] = B[i] * B[i];
if (A[i] > 0) {
C[i] = A[i] * B[i];
if (C[i] < 0) {
A[i] = A[i] + 1;
}
A[i] = A[i] - 2;
}
}

Please answer the following five questions.

(a) How many warps does it take to execute this program?

Warps = (Number of threads) / (Number of threads per warp)
Number of threads = 210 (i.e., one thread per loop iteration).
Number of threads per warp = 64 = 26 (given).
Warps = 210 /26 = 24

(b) What is the maximum possible SIMD utilization of this program?

100%

15/16
(c) Please describe what needs to be true about array B to reach the maximum possible SIMD utilization
asked in part (b). (Please cover all cases in your answer)
B: For every 64 consecutive elements: every value is 0, every value is positive, or every value is negative.
Must give all three of these.

(d) What is the minimum possible SIMD utilization of this program?

Answer: 132/384
Explanation: The first two lines must be executed by every thread in a warp (64/64 utilization for each
line). The minimum utilization results when a single thread from each warp passes both conditions on
lines 2 and 4, and every other thread fails to meet the condition on line 2. The thread per warp that meets
both conditions, executes lines 3-6 resulting in a SIMD utilization of 1/64 for each line. The minimum
SIMD utilization sums to (64 ∗ 2 + 1 ∗ 4)/(64 ∗ 6) = 132/384

(e) Please describe what needs to be true about array B to reach the minimum possible SIMD utilization
asked in part (d). (Please cover all cases in your answer)
B: Exactly 1 of every 64 consecutive elements must be negative. The rest must be zero. This is the only case
that this holds true.

16/16

CEN468 Lab 3 V2
No ratings yet
CEN468 Lab 3 V2
14 pages
Gtu Mpi Paper Solution
No ratings yet
Gtu Mpi Paper Solution
19 pages
NOC & SOC Dimensioning Tool
No ratings yet
NOC & SOC Dimensioning Tool
21 pages
OSP Data
No ratings yet
OSP Data
6 pages
Computer Organization Homework05: RISC-V Edition
No ratings yet
Computer Organization Homework05: RISC-V Edition
10 pages
Ee6503 Power Electronics
No ratings yet
Ee6503 Power Electronics
20 pages
DI Module With RS485
No ratings yet
DI Module With RS485
10 pages
USB-485 Isolated USB-RS422/RS485 Interface Converter: Installation and Operation
No ratings yet
USB-485 Isolated USB-RS422/RS485 Interface Converter: Installation and Operation
5 pages
Operator Q-A
No ratings yet
Operator Q-A
51 pages
Image Filtering
0% (1)
Image Filtering
56 pages
Implementation of Discrete-Time Systems
100% (1)
Implementation of Discrete-Time Systems
5 pages
Unix Scripts
No ratings yet
Unix Scripts
3 pages
CH Trivikram - Resume
No ratings yet
CH Trivikram - Resume
4 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
MP LAB Cse Manual
No ratings yet
MP LAB Cse Manual
140 pages
Internal Marks Assessment System
0% (1)
Internal Marks Assessment System
22 pages
Connecting To The S7 PLC
No ratings yet
Connecting To The S7 PLC
4 pages
COE 205 Lab Manual Experiment N o 1 1 in
No ratings yet
COE 205 Lab Manual Experiment N o 1 1 in
10 pages
Lxm23d Manual v201 en
No ratings yet
Lxm23d Manual v201 en
403 pages
DSP Processor Fundementals
100% (6)
DSP Processor Fundementals
210 pages
Introduction To Embedded Systems: Bus Structure
No ratings yet
Introduction To Embedded Systems: Bus Structure
17 pages
8086 Inst and Assembler Directives
100% (2)
8086 Inst and Assembler Directives
49 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
AMD64 Architecture Programmers Manual
No ratings yet
AMD64 Architecture Programmers Manual
386 pages
Slot14 15 CH08 OperatingSystemSupport 43 Slides
No ratings yet
Slot14 15 CH08 OperatingSystemSupport 43 Slides
34 pages
Chapters 1 and 3: ARM Processor Architecture
No ratings yet
Chapters 1 and 3: ARM Processor Architecture
44 pages
5 Cache PDF
No ratings yet
5 Cache PDF
46 pages
Instruction Pipeline
No ratings yet
Instruction Pipeline
27 pages
Mes Manual 2022-23
No ratings yet
Mes Manual 2022-23
39 pages
CH06 Memory Organization
No ratings yet
CH06 Memory Organization
85 pages
Lesson Plan: R.V. College of Engineering, Bangalore
No ratings yet
Lesson Plan: R.V. College of Engineering, Bangalore
14 pages
Memory Organization
No ratings yet
Memory Organization
36 pages
Assemly Language 02: To Pay More Attention To Gain Better Result
No ratings yet
Assemly Language 02: To Pay More Attention To Gain Better Result
24 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
Solutions Ch4
No ratings yet
Solutions Ch4
7 pages
GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
Motor Management System: Multilin
No ratings yet
Motor Management System: Multilin
182 pages
Branch Prediction Techniques
No ratings yet
Branch Prediction Techniques
48 pages
Analysis of 4-Parallel Radix-2 2 Feedforward FFT Architecture
No ratings yet
Analysis of 4-Parallel Radix-2 2 Feedforward FFT Architecture
42 pages
Pirs509a Beamnrc
No ratings yet
Pirs509a Beamnrc
289 pages
Architecture
No ratings yet
Architecture
21 pages
PowerPoint Slides To Chapter 07
No ratings yet
PowerPoint Slides To Chapter 07
49 pages
Processors
No ratings yet
Processors
25 pages
DDCA Ch4 VHDL
No ratings yet
DDCA Ch4 VHDL
35 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Arduino - Loop Statement
No ratings yet
Arduino - Loop Statement
23 pages
VHDL Implementation of A Mips-32 Pipeline Processor
No ratings yet
VHDL Implementation of A Mips-32 Pipeline Processor
5 pages
Cs 501 Solved Final Term Papers
100% (1)
Cs 501 Solved Final Term Papers
13 pages
Assignment
No ratings yet
Assignment
41 pages
Coursera Quiz Week3 Fall 2012
100% (1)
Coursera Quiz Week3 Fall 2012
3 pages
DDCA - Ch2 - Class2b
No ratings yet
DDCA - Ch2 - Class2b
49 pages
L3 (Buses and Interupts)
No ratings yet
L3 (Buses and Interupts)
4 pages
Computer Architecture - A Quantitative Approach Chapter 5 Solutions
No ratings yet
Computer Architecture - A Quantitative Approach Chapter 5 Solutions
14 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Sample Midterm Exam Questions
No ratings yet
Sample Midterm Exam Questions
2 pages
CS 6210 Spring 2015 Midterm: Name: - Kishore - GT Number
No ratings yet
CS 6210 Spring 2015 Midterm: Name: - Kishore - GT Number
10 pages
R 2008 It Syllabus
No ratings yet
R 2008 It Syllabus
89 pages
Axi Slave
No ratings yet
Axi Slave
9 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
Department of Computer Science National Tsing Hua University CS4100 Computer Architecture
No ratings yet
Department of Computer Science National Tsing Hua University CS4100 Computer Architecture
3 pages
Draw The Block Diagram of Von Neumann Architecture and Explain About Its Parts in Brief Answer
No ratings yet
Draw The Block Diagram of Von Neumann Architecture and Explain About Its Parts in Brief Answer
7 pages
Hw5 Solution
No ratings yet
Hw5 Solution
11 pages
Microprocessor
No ratings yet
Microprocessor
22 pages
C DAC Winter Project Report-7
No ratings yet
C DAC Winter Project Report-7
29 pages
Laboratory Experiment For Digital Electronics
No ratings yet
Laboratory Experiment For Digital Electronics
1 page
Computer Architecture Sample Midterm
No ratings yet
Computer Architecture Sample Midterm
9 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
Midterm Exam Architecture
No ratings yet
Midterm Exam Architecture
2 pages
7.performance Analysis of Wallace Tree Multiplier With Kogge Stone Adder Using 15-4 Compressor
No ratings yet
7.performance Analysis of Wallace Tree Multiplier With Kogge Stone Adder Using 15-4 Compressor
38 pages
ASIC-System On Chip-VLSI Design - Power Planning
No ratings yet
ASIC-System On Chip-VLSI Design - Power Planning
5 pages
Memory Hierarchy and Cache Quiz Answers
No ratings yet
Memory Hierarchy and Cache Quiz Answers
3 pages
Model 11 Model 11: Mustee Mustee
No ratings yet
Model 11 Model 11: Mustee Mustee
4 pages
LightSYS Brochure en-LR
No ratings yet
LightSYS Brochure en-LR
2 pages
Dell Latitude PDF
No ratings yet
Dell Latitude PDF
8 pages
C++ Pointers
No ratings yet
C++ Pointers
103 pages
Unit-2 Memory Management - Detail
No ratings yet
Unit-2 Memory Management - Detail
81 pages
Computer Architecture and Organization: Lecturer: Pazir Ahmad
No ratings yet
Computer Architecture and Organization: Lecturer: Pazir Ahmad
10 pages
How To Install SimpleScalar
No ratings yet
How To Install SimpleScalar
4 pages
Fast Visibility Analysis in 3D Mass Modeling Environments and Approximated Visibility Analysis Concept Using Point Clouds Data
No ratings yet
Fast Visibility Analysis in 3D Mass Modeling Environments and Approximated Visibility Analysis Concept Using Point Clouds Data
10 pages
Toshiba Cheat Sheet
No ratings yet
Toshiba Cheat Sheet
8 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
Learn The Architecture - Memory System Resource Partitioning and Monitoring (Mpam) Software Guide 108032 0100 01 en
No ratings yet
Learn The Architecture - Memory System Resource Partitioning and Monitoring (Mpam) Software Guide 108032 0100 01 en
23 pages
Virtual Memory, Segmentation and Paging
No ratings yet
Virtual Memory, Segmentation and Paging
22 pages
DE Lpack
No ratings yet
DE Lpack
7 pages
HW2 S24 Sol
No ratings yet
HW2 S24 Sol
15 pages
Alfa Laval Thinktop v20 Instrution Manual 1
No ratings yet
Alfa Laval Thinktop v20 Instrution Manual 1
21 pages
Midterm Solution
No ratings yet
Midterm Solution
18 pages
Quotation- Epiroc OEM - 副本
No ratings yet
Quotation- Epiroc OEM - 副本
8 pages
SIMD
No ratings yet
SIMD
44 pages

1 Vector Processing: Solutions

Uploaded by

1 Vector Processing: Solutions

Uploaded by

SOLUTIONS

Opcode Operands Number of Cycles Description

Opcode Operands Number of Cycles Description

P r o c e s s i n g t h e n e x t 50 e l e m e n t s t a k e s 344 c y c l e s a s shown below ( no need t o

T h e r e f o r e , t h e t o t a l number o f c y c l e s t o e x e c u t e t h e program i s 690 c y c l e s

(ii) Vector processor with chaining, 1 port to memory.

P r o c e s s i n g t h e n e x t 50 e l e m e n t s t a k e s 240 c y c l e s ( same time l i n e a s above , but

T h e r e f o r e , t h e t o t a l number o f c y c l e s t o e x e c u t e t h e program i s 482 c y c l e s

The f i r s t two l o a d s can a l s o be p i p e l i n e d a s t h e r e a r e two p o r t s t o memory . The

T h e r e f o r e , t h e t o t a l number o f c y c l e s t o e x e c u t e t h e program i s 215 c y c l e s

1+50+4+16+1 + (VLEN−1) = 71 + VLEN = 111 −> VLEN = 40 e l e m e n t s

50 + 1 + 4 + 16 + 1 + 4 ∗ (VLEN−1) = 68 + 4∗VLEN = 228 c y c l e s

VLD [ 0 ] |−−−−50−−−−| bank 0 ( t a k e s p o r t 0 )

VLD [0] |−−−50−−−|

5∗50 + 7 + 1 + 4 + 16 + 1 = 279 c y c l e s −> 8 banks

The traditional vector processor The traditional array processor Neither

For a vector length of 1:

The traditional array processor:

For a vector length of 4:

The traditional array processor:

For a vector length of 16:

The traditional array processor:

for (i = 0; i < N; i++) {

((24+8*2)/(32*2))*100% = 40/64*100 = 62.5%

What should be true about arrays B?

If NO, explain why not.

What should be true about arrays B?

If NO, explain why not.

What should be true about arrays B?

If NO, explain why not.

for (i = 0; i < 1048576; i++) {

(a) How many warps does it take to execute this program?

B: 1 in every 64 of B’s elements is less than 4444

for (i = 0; i < 1024; i++) {

Please answer the following five questions.

(a) How many warps does it take to execute this program?

(b) What is the maximum possible SIMD utilization of this program?

(d) What is the minimum possible SIMD utilization of this program?

You might also like

((24+82)/(322))100% = 40/64100 = 62.5%