0% found this document useful (0 votes)

25 views31 pages

Written Asst1

The document contains a series of problems related to parallel computing, focusing on multi-core architecture, instruction-level parallelism (ILP), SIMD efficiency, and cache behavior. It includes calculations for maximum throughput, analysis of cache misses, and the impact of multi-threading on performance. Additionally, it explores the execution of scalar and vector instructions, the utilization of execution units, and the effect of memory bandwidth on computational tasks.

Uploaded by

Private

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views31 pages

Written Asst1

Uploaded by

Private

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Stanford CS149: Parallel Computing

Written Assignment 1

Multi-Core Architecture

Problem 1:

A. Consider a multi-core processor that runs at 2 GHz and has 4 cores. Each core can perform up to one
8-wide SIMD vector instruction per clock and supports hardware multi-threading with 4 hardware
execution context per core. What is the maximum throughput of the processor in units of scalar
floating point operations per second? Please show your calculations.

B. Consider the processor from part A running a program that perfectly parallelizes onto many cores,
makes perfect usage of SIMD vector processing, and has an arithmetic intensity of 4. (4 scalar
floating ops per byte of data transferred from memory.) If we assume the processor has a memory
system providing 64 GB/sec of bandwidth, is the program compute-bound or bandwidth bound on
this processor? You may assume for math simplicity that 1 GB/sec is a one billion bytes per second
(109 ). Please show your work.

Page 1
C. Consider a cache that contains 32 KB of data, has a cache line size of 4 bytes, is fully associative
(meaning any cache line can go anywhere in the cache), and uses an LRU (least recently used—
the line evicted is the line that was last accessed the longest time ago) replacement policy. Please
describe why the following code will take a cache miss on every data access to the array A.

const int SIZE = 1024 * 64;

float A[SIZE];
float sum = 0.0;
for (int reps=0; reps<32; reps++)
for (int i=0, i<SIZE; i++)
sum += A[i];

Page 2
Dependency Graphs, ILP, and Superscalar/Multi-Threaded Execution

Problem 2:
Consider the following sequence of scalar instructions running within a single thread.
1. LD R0 <- mem[R4] // load memory address given by R4 into R0
2. ADD R1 <- R0, R0 // R1 = R0 + R0
3. ADD R2 <- R0, R0 // R2 = R0 + R0
4. ADD R3 <- R0, R0 // R3 = R0 + R0
5. MUL R1 <- R1, R1
6. MUL R2 <- R2, R2
7. MUL R3 <- R3, R3
8. ADD R1 <- R1, R2
9. ADD R1 <- R1, R3
10. ST mem[R5] <- R1 // store R1 into memory at address given by R5

A. Please draw the instruction dependency graph for this instruction sequence.

Page 3
B. What is the maximum amount of instruction level parallelism (ILP) present in the program. Keep
in mind that ILP is entirely a property of the program itself, not the machine it is run on.

C. Consider running this instruction stream on a single core, single-threaded processor that has su-
perscalar execution capability to perform up to two instructions per clock, PROVIDED THAT EX-
ACTLY ONE INSTRUCTION IS A MUL. In other words, the processor has two execution units
(ALUs): one can execute add/load/store instructions and the other can execute only multiplica-
tions. Please assume all instructions (including loads/stores) individually complete in one cycle.
You cannot modify the instructions in the program. What is the minimum number of cycles needed
to execute this program? (Hint: think carefully about what instructions are allowed to run concur-
rently. The processing can run two instructions at the same time if they are independent and exactly
one is a multiply.)

Page 4
D. Now assume that the code above is changed so that it is: (1) running in a loop, where instructions
1-10 make up the body of the loop, and in each iteration of the loop the addresses stored in R4 and
R5 are different. (2) Loop iterations are perfectly parallelized across std::threads. We are running
this code on a single core, multi-threaded processor that can process one instruction per clock
(arithmetic instructions, LDs, STs). However, from the moment a processor begins to execute a load
instruction, there is a latency of 25 cycles before the data can be used by a dependent instruction. To
be clear: if a core issues a LD in clock c, the LD occupies the core in clock c, but then the core cannot
run an instruction that uses the value of the load until clock c + 25.
How many threads must be interleaved for the core to run at 100% efficiency? Please justify your
answer.

Page 5
The patterns are pretty, but the SIMD efficiency may not be

Problem 3:
Consider the following ISPC code that processes a 16 × 16 input image (input) containing white, gray, or
black pixels. It produces a 16 × 16 output image (output).
const int IMAGE_SIZE = 16;
void myfunction(uniform float* input, uniform float* output) {

for (uniform int row=0; row<IMAGE_SIZE; row++) {

for (uniform int col=0; col<IMAGE_SIZE; col+=programCount) {

int idx = row*IMAGE_SIZE + col + programIndex;

float val = input[idx]; // load four bytes
float result;

if (!isWhite(val)) {
if (isGray(val)) {
result = foo1(val); // 10 cycles of arithmetic
} else {
result = foo2(val); // 10 cycles of arithmetic
}
} else {
result = foo3(val); // 10 cycles of arithmetic
}
output[idx] = result; // store four bytes
}
}

The questions are on the next page...

Page 6
A. Consider running the code on the image below. Note the full image is 16×16 pixels... the entire
image is not shown in the figure but you can assume the pattern repeats as indicated, with only
the diagonal being colored.
Assume that the only arithmetic cycles we want you to count are the arithmetic instructions labeled
in the code. (All conditionals and load/stores are “free”.) What is the overall SIMD efficiency (50%,
100%, etc.) achieved when running the code on a 1 GHz single-core CPU with a SIMD-width
(and corresponding ISPC gang size) of 4. In other words, the core can perform one 4-wide SIMD
operation per clock. (Hint: start by computing the number of cycles needed to perform one row
worth of the computation.)

...

...
...

Page 7
B. What is the SIMD efficiency obtained when running the same code, using the same processor, but
on the image below? Again, please keep in mind the image is 16 × 16 in size (e.g, the pattern of
columns below just extends downward another 8 rows).

...

Page 8
C. Now imagine that you are given a the following function:
transpose(uniform float* input, uniform float* output)
This function transposes the 16 × 16 input image and stores the result in a 16 × 16 output. (Recall
that a transpose is output[i][j] = input[j][i].) Write down pseudocode for how you would
use calls to myfunction and transpose to produce a computation that eliminates all execution
divergence when running myfunction on the image from part B. Yes, you can allocate temp buffers
as needed.
Fill in pseudocode in myNewFunction below. Your solution should compute the same output image
output as myfunction(input, output).
void myNewFunction(float* input, float* output)

Page 9
D. Imagine that the code is run the same processor, but now with a memory system with a memory
bandwidth of 8 bytes per clock. Please assume that the transpose operation is bandwidth bound
and the cost of the operation is equal to the time needed to transfer required data to and from
memory. You should also assume that myfunction is compute bound and that the required time is
a function of the number of cycles of arithmetic performed.
When running on the input image from part B Is the solution you proposed in part C faster or
slower than the single call to myfunction used in part B? How much faster is it (in clocks)?
Please keep in mind that the processor:

• Can execute one four-wide SIMD operation per clock

• Is attached to a memory bus that can transfer 8 bytes of data per clock
• The input/output images are 16×16 elements in size.
• Each image element (a float) is 4 bytes

Hint: compute the amount of time needed to transpose the data, and then compute the amount
of time (in clocks) needed to perform myfunction on this modified input.

Page 10
Hardware Basics

PRACTICE PROBLEM 1:

A. Consider a multi-core processor that has two cores. Each core runs at 1 GHz (1 billion operations per
clock). Each core is single-threaded (meaning it only maintains state for a single execution context)
and can compete one single-precision floating point arithmetic operation per clock. What is the peak
arithmetic throughput of the processor in terms of floating point operations per second?

B. Now imagine the cores from part A are upgraded so that they perform 16-wide SIMD instructions.
Assuming these cores still complete one of these SIMD instructions per clock, what is the peak
arithmetic throughput of the processor (in terms of floating point operations per second)?

C. Finally, imagine that each core from part B was a multi-threaded core that maintain execution con-
texts for up to four hardware threads each. What is the peak arithmetic throughput of the processor
in terms of floating operations per second?

D. Imagine that each core from part C was further modified to support superscalar execution where
the core can complete one scalar floating point operation and one 16-wide SIMD instruction per
clock from the same thread (if those instructions are independent). What is the peak arithmetic
throughput of the processor (in terms of floating point operations per second)?

Page 11
Caching Basics

PRACTICE PROBLEM 2:

A. Assume we are running a program on a processor with a data cache. All data loaded from memory
is first loaded into the processor’s data cache, and then transferred from the cache to the processor’s
registers. (This is true of most systems.) When new data is brought into the cache, the cache has a
policy of evicting the least recently used data (the data that has been accessed the longest time ago)
to make room for the newly accessed data. Imagine that the cache is 32 KB, and consider running
the following program:
const int SIZE = 64 * 1024;
float mydata[SIZE]; // hint: how much data is this?

float sum = 0.0;

for (int i=0; i<100; i++) {
for (int j=0; j<SIZE; j++) {
sum += mydata[j];
}
}

A cache miss occurs when a processor accesses data from memory that is not present in the cache.
When the program starts running, each each of data accessed during the i=0 iteration of the outer
loop is a cache miss, since that is the first time the data was accessed in the program. Now consider
the entire program’s execution. Please describe what fraction of the accesses to the mydata array
will be cache misses. (For those that are familiar with the details of cache operation, please assume
that the cache is fully-associative, and has a cache line size of one float. If these terms are unfamiliar
to you, you can safely ignore them... or Google it!)

B. Now consider the case where SIZE = 2048. Does your answer to part A change with this new array
size. Why or why not?

Page 12
C. Now assume that you are running the following code. Would you rather have a processor with a
32 KB data cache, or would you rather add multi-threading to the processor. Why or why not?
float sum = 0.0;
for (int i=0; i<1000000; i++) {
sum += 2.0 * myarray[i] + 1.0;
}

Page 13
Superscalar and Hardware Multi-Threading

PRACTICE PROBLEM 3:
Consider the following sequence of 12 instructions. There is a load operation, followed by 10 math opera-
tions, followed by a store. Note that some of the instructions are scalar instructions, and others are vector
instructions operating on vector registers (Vx registers). The vector operations have “V” at the beginning
of their instruction names.

1. LD R1 <- [R0]
2. VSPLAT V0 <- R1 // copy R1 into all elements of V0
3. MUL R2 <- R1, R1
4. ADD R2 <- R2, 16 // R2 + 16
5. VSPLAT V1 <- R2 // copy R2 into all elements of V1
6. VMUL V2 <- V0, V0
7. VADD V3 <- V0, V0
8. VMUL V3 <- V1, V3
9. VMUL V3 <- V2, V3
10. VRED R1 <- V3 // reduction: sum all elements of V3 into R1
11. MUL R1 <- R1, R2
12. ST [R4] <- R1

A. Please draw the dependency graph for the instruction sequence.

Page 14
B. Imagine the instruction sequence is executed on a single-core, single-threaded processor. The processor
supports superscalar execution in that it can fetch/decode up to two instructions per clock, but it
has one scalar execution unit and one vector execution unit. Therefore, it can run instructions IN
ANY ORDER THAT RESPECTS THE PROGRAM DEPENDENCY GRAPH, but it can only run
two independent instructions per clock if and only if one instruction is a scalar instruction and the
other instruction is a vector instruction. Assuming that all instructions take 1 cycle to complete, how
many cycles does it take to complete this instruction sequence?

Page 15
C. Now assume the sequence of instructions on the previous page is run in a loop. For example:

float in[VERY_BIG]; // let VERY_BIG = 100,000,000

float out[VERY_BIG];

// parallelize iterations of this loop using threads

#pragma omp parallel for
for (int i=0; i<VERY_BIG; i++) {
// assume myfunc() is the 10 non-LD/ST instrs in the seq above
out[i] = myfunc(in[i]);
}

The loop is run on the same single core, single-threaded processor as before, but now the latency
of a LD instruction is 20 clocks. (That is, if the LD instruction begins on clock c, an instruction
depending on the LD can begin on clock c + 20. (There is one cycle to execute the LD instruction
on the processor’s scalar unit, followed by 19 cycles of waiting before the dependent instruction can
begin.) All other instructions still have a latency of 1 cycle.
In this setup, what is the utilization of the core’s vector execution unit? (What fraction of cycles is
the vector unit executing instructions? Hint: What are all the reasons the vector unit might not be
utilized? Note: it’s fine to give your answer as a fraction.)

Page 16
D. Now imagine the core is multi-threaded, and can choose to simultaneously execute two instruc-
tions from the same thread, or from different threads, provided they meet the one scalar and one
vector instruction per clock constraint. In this design, can you achieve full utilization of the vector
unit? If so, how many total threads do you need (please explain why) If not, explain why.

Page 17
A Task Queue on a Multi-Core, Multi-Threaded CPU

PRACTICE PROBLEM 4:
The figure below shows a single-core CPU with an 32 KB L1 cache and execution contexts for up to
two threads of control. The core executes threads assigned to contexts T0-T1 in an interleaved fashion
by switching the active thread only on a memory stall); Memory bandwidth is infinitely high in this
system, but memory latency on a cache miss is 200 clocks.
FAQ about the cache: To keep things simple, assume a cache hit takes only a one cycle. Assume cache
lines are 4 bytes (a single floating point value), and the cache implements a least-recently used (LRU)
replacement policy—meaning that when a cache line needs to be evicted, the line that was last accessed
the furthest in the past is evicted. It may be helpful to think about how this cache behaves when a program
reads 33 KB contiguous bytes of memory over and over. Hint: confirm to yourself that in this situation
every load will be a cache miss.
In this problem assume the CPU performance no prefetching.

to memory

32 KB L1 cache

Core 1
Exec
T0 T1

You are implementing a task queue for a system with this CPU. The task queue is responsible for executing
independent tasks that are created as a part of a bulk launch (much like how an ISPC task launch creates
many independent tasks). You implement your task system using a pool of worker threads, all of which
are spawned at program launch. When tasks are added to the task queue, the worker threads grab the
next task in the queue by atomically incrementing a shared counter next_task_id. Pseudocode for the
execution of a worker thread is shown below.
mutex queue_lock;
int next_task_id; // set to zero at time of bulk task launch
int total_tasks; // set to total number of tasks at time of bulk task launch
float* task_args[MAX_NUM_TASKS]; // initialized elsewhere

while (1) {

int my_task_id;

LOCK(queue_lock);
my_task_id = next_task_id++;
UNLOCK(queue_lock);

if (my_task_id < total_tasks)

TASK_A(my_task_id, task_args[my_task_id]);
else
break;
}

Page 18
A. Consider one possible implementation of TASK_A from the code on the previous page:
function TASK_A(int task_id, float* X) {
for (int i=0; i<1000; i++) {
for (int j=0; j<1024*64; j++) {
load X[j] // assume this is a cold miss when i=0
// ... 50 non-memory instructions using X
}
}
}

The inner loop of TASK_A scans over 64K elements = 256 KB of elements of array X, performing 50
arithmetic instructions after each load. This process is repeated over the same data 1000 times. As-
sume there are no other significant memory instructions in the program and that each task works
on a completely different input array X (there is no sharing of data across tasks). Remember the
cache is 32 KB.

In order to process a bulk launch of TASK_A, you create two worker threads, WT0 and WT1, and as-
sign them to CPU execution contexts T0 and T1. Do you expect the program to execute substantially
faster using the two-thread worker pool than if only one worker thread was used? If so, please cal-
culate how much faster. (Your answer need not be exact, a back-of-the envelop calculation is fine.)
If not, explain why.
(Careful: please consider the program’s execution behavior on average over the entire program’s execution
(“steady state” behavior). Past students have been tricked by only thinking about the behavior of the first loop
iteration of the first task.) It may be helpful to draw when threads are running and stalled waiting for a load
on the diagram below.

T1
Time
(clocks)

Page 19
B. Consider the same setup as the previous problem. How many hardware threads would the CPU
core need in order for the machine to maintain peak throughput (100% utilization) on this workload?

C. Now consider the case where the program is modified to contain 100,000 instructions in the inner-
most loop. Do you expect your two-thread worker pool to execute the program substantially faster
than a one thread pool? If so, please calculate how much faster (your answer need not be exact, a
back-of-the envelop calculation is fine). If not, explain why.

Page 20
D. Now consider the case where the cache size is changed to 1 MB and you are running the original
program from Part A (50 math instructions in the inner loop). When running the program from
part A on this new machine, do you expect your two-thread worker pool to execute the program
substantially faster than a one thread pool? If so, please calculate how much faster (your answer need
not be exact, a back-of-the envelop calculation is fine). If not, explain why.

T1
Time
(clocks)

E. Now consider the case where the L1 cache size is changed to 384 KB. Assuming you cannot change
the implementation of TASK_A from Part A, would you choose to use a worker thread pool of one
or two threads? Why does this improve performance and how much higher throughput does your
solution achieve?

Page 21
Picking the Right CPU for the Job

PRACTICE PROBLEM 5:
You write a bit of ISPC code that modifies a grayscale image of size 32×height pixels based on the
contents of a black and white “mask” image of the same size. The code brightens input image pixels by
a factor of 1000 if the corresponding pixel of the mask image is white (the mask has value 1.0) and by a
factor of 10 otherwise.
The code partitions the image processing work into 128 ISPC tasks, which you can assume balance per-
fectly onto all available CPU processors.
void brighten_image(uniform int height, uniform float image[], uniform float mask_image[])
{
uniform int NUM_TASKS = 128;
uniform int rows_per_task = height / NUM_TASKS;
launch[NUM_TASKS] brighten_chunk(rows_per_task, image, mask_image);
}

void brighten_chunk(uniform int rows_per_task, uniform float image[], uniform float mask_image[])
{
// ‘programCount’ is the ISPC gang size.
// ‘programIndex’ is a per-instance identifier between 0 and programCount-1.
// ‘taskIndex’ is a per-task identifier between 0 and NUM_TASKS-1

// compute starting image row for this task

uniform int start_row = rows_per_task * taskIndex;

// process all pixels in a chunk of rows

for (uniform int j=start_row; j<start_row+rows_per_task; j++) {
for (uniform int i=0; i<32; i+=programCount) {

int idx = j*32 + i + programIndex;

int iters = (mask_image[idx] == 1.f) ? 1000 : 10;

float tmp = 0.f;

for (int k=0; k<iters; k++)
tmp += image[idx]; // these are the ops we want you to count

image[idx] = tmp;
}
}
}

(question continued on next page)

Page 22
You go to the store to buy a new CPU that runs this computation as fast as possible. On the shelf you see
the following three CPUs on sale for the same price:

(A) 1 GHz single core CPU capable of performing one 32-wide SIMD floating point addition per clock

(B) 1 GHz 12-core CPU capable of performing one 2-wide SIMD floating point addition per clock

0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28
0 0

4 4

8 8

12 12

16 16
...

...

...
Mask Image 1: 32 x height Mask Image 2: 32 x height
(vertical white columns every 4th pixel) (random black or white rows)

Figure 1: Image masks used to govern image manipulation by brighten_image

A. If your only use of the CPU will be to run the above code as fast as possible, and assuming the code
will execute using mask image 1 above, rank all three machines in order of performance. Please
explain how you determined your ranking by comparing execution times on the various processors.
When considering execution time, you may assume that (1) the only operations you need to account
for are the floating-point additions in the innermost ’k’ loop. (2) The ISPC gang size will be set to
the SIMD width of the CPU. (3) There are no stalls during execution due to data access.
(Hint: it may be easiest to consider the execution time of each row of the image.)

Page 23
B. Rank all three machines in order of performance for mask image 2? Please justify your answer, but
you are not required to perform detailed calculations like in part A.

Page 24
Be a Parallel Processor Architect

PRACTICE PROBLEM 6:
You are hired to start the parallel processor design team at Lagunita Processors, Inc. Your boss tells you
that you are responsible for designing the company’s first shared address space multi-core processor,
which will be constructed by cramming multiple copies of the company’s best selling uniprocessor core
on a single chip. Your boss expects the project to yield at least a 5× speedup on the performance of the
program given below. You are not allowed to change the program, and assume that:

• Each Lagunita core can complete one floating point operation per clock

• Cores are clocked at 1 GHz, and each have a 1 MB cache using LRU replacement.

• All Lagunita processors (both single and multi-core) are attached to a 100 GB/s memory bus

• Memory latency is perfectly hidden (Lagunita processors have excellent pre-fetchers)

float A[N]; // let N = 100 million elements

float total = 0;

// ASSUME TIMER STARTS HERE //////////////////////////////////

for (int i=0; i<N; i++)

total += A[i];

for (int i=0; i<9; i++) {

// made up syntax for brevity: ’parallel_for’

// Assume iterations of this loop are perfectly partitioned
// using blocked assignment among X pthreads each running on
// one of the processor’s X cores.
parallel_for(int j=0; j<N; j++) {
A[j] = A[j] / total;
}
}

// ASSUME TIMER STOPS HERE //////////////////////////////////

Page 25
A. How do you respond to your boss’ request? Do you believe you can meet the performance goal? If
yes, how many cores should be included in the new multi-core processor? If no, explain why.

B. You tell your boss that if you were allowed to make a few changes to the code, you could deliver
a much better speedup with your parallel processor design. How would you change the code to
improve its performance by improving speedup? (A simple description of the new code is fine). If
your answer was NO in part one, how many processors are required to achieve 5× speedup now? If
your answer was YES, approximately what speedup do you expect from your previously proposed
machine on the new code? (Note: we are NOT looking for answers that optimize the program by rolling
multiple divisions into one.)

Page 26
C. Assume that the following year, Lagunita Processors, Inc. decides to produce a 32-core version of
your parallel CPU design. In addition to adding cores, your boss gives you the opportunity to
further improve the processor through one of the following three options.

• You may double each processor’s cache to 2 MB.

• You may increase memory bandwidth by 50%
• You may add a 4-wide SIMD unit to the core so that each core can perform 4 floating point
operations per clock.

If each of these options has the same cost, given the code you produced in part B (and what you
learned from assignment 1), which option do you recommend to your boss? Why?

Page 27
SPMD Tree Search

PRACTICE PROBLEM 7:
NOTE: This question is tricky. If you can answer this question you really understand how the SPMD
programming model maps to SIMD execution!
The figure below shows a collection of line segments in 1D. It also shows a binary tree data structure
organizing the segments into a hierarchy. Leaves of the tree correspond to the line segments. Each interior
tree node represents a spatial extent that bounds all its child segments. Notice that sibling leaves can (and
do) overlap. Using this data structure, it is possible to answer the question “what is the largest segment
that contains a specified point” without testing the point against all segments in the scene.
For example, the answer for point p = 0.15 is segment 5 (in node N5). The answer for the point p = 0.75
is segment 11 in node N11.

Line segments in 1D
N13
N5 N11
N3 N6 N7 N14
0.0 0.20 0.30 0.35 0.40 0.50 0.60 N12 0.95 1.0
0.10 0.25 0.70 0.90

0.80
Binary Search Tree: N0
0.0 / 1.0

N1 N8
0.0 / 0.40 0.50 / 1.0

N2 N9 N14
0.0 / 0.30 0.50 / 0.90 0.95 / 1.0

N3 N4 N7 N10 N13
0.0 / 0.10 0.10 / 0.30 0.35 / 0.40 0.50 / 0.80 0.70 / 0.90

N5 N6 N11 N12
0.10 / 0.25 0.2 / 0.30 0.50 / 0.80 0.60 / 0.80

struct Node {
float min, max; // if leaf: start/end of segment, else: bounds on all child segments.
bool leaf; // true if nodes is a leaf node
int segment_id; // segment id if this is a leaf
Node* left, *right; // child tree nodes
};

On the following two pages, we provide you two ISPC functions, find_segment_1 and find_segment_2
that both compute the same thing: they use the tree structure above to find the id of the largest line
segment that contains a given query point.

Page 28
struct Node {
float min, max; // if leaf: start/end of segment, else: bounds on all child segments.
bool leaf; // true if nodes is a leaf node
int segment_id; // segment id if this is a leaf
Node* left, *right; // child tree nodes
};

// -- computes segment id of the largest segment containing points[programIndex]

// -- root_node is the root of the search tree
// -- each program instance processes one query point
export void find_segment_1(uniform float* points, uniform int* results, uniform Node* root_node) {

Stack<Node*> stack;
Node* node;
float max_extent = 0.0;

// p is point this program instance is searching for

float p = points[programIndex];
results[programIndex] = NO_SEGMENT;

stack.push(root_node);

while(!stack.size() == 0) {
node = stack.pop();

while (!node->leaf) {
// [I-test]: test to see if point is contained within this interior node
if (p >= node->min && p <= node->max) {
// [I-hit]: p is within interior node... continue to child nodes
push(node->right);
node = node->left;
} else {
// [I-miss]: point not contained within node, pop the stack
if (stack.size() == 0)
return;
else
node = stack.pop();
}
}

// [S-test]: test if point is within segment, and segment is largest seen so far
if (p >= node->min && p <= node->max && (node->max - node->min) > max_extent) {
// [S-hit]: mark this segment as ‘‘best-so-far’’
results[programIndex] = node->segment_id;
max_extent = node->max - node->min;
}
}
}

Page 29
export void find_segment_2(uniform float* points, uniform int* results, uniform Node* root_node) {

Stack<Node*> stack;
Node* node;
float max_extent = 0.0;

// p is point this program instance is sarch for

float p = points[programIndex];

results[programIndex] = NO_SEGMENT;

stack.push(root_node);

while(!stack.size() == 0) {
node = stack.pop();

if (!node->leaf) {
// [I-test]: test to see if point is contained within interior node
if (p >= node->min && p <= node->max) {
// [I-hit]: p is within interior node... continue to child nodes
push(node->right);
push(node->left);
}
} else {
// [S-test]: test if point is within segment, and segment is largest seen so far
if (p >= node->min && p <= node->max && (node->max - node->min) > max_extent) {
// [S-hit]: mark this segment as ‘‘best-so-far’’
results[programIndex] = node->segment_id;
max_extent = node->max - node->min;
}
}
}
}

Begin by studying find_segment_1.

Given the input p = 0.1, the a single program instance will execute the following sequence of steps: (I-
test,N0), (I-hit,N0), (I-test, N1), (I-hit, N1), (I-test, N2), (I-hit, N2) (S-test,N3), (S-hit, N3), (I-test, N4), (I-hit,
N4), (S-test, N5), (S-hit, N5), (S-test, N6), (S-test,N7), (I-test, N8), (I-miss, N8). Where each of the above
“steps” represents reaching a basic block in the code (see comments):

• (I-test, Nx) represents a point-interior node test against node x.

• (I-hit, Nx) represents logic of traversing to the child nodes of node x when p is determined to be
contained in x.

• (I-miss, Nx) represents logic of traversing to sibling/ancestor nodes when the point is not contained
within node x.

• (S-test, Nx) represents a point-segment (left node) test against the segment represented by node x.

• (S-hit, Nx) represents the basic block where a new largest node is found x.

The question is on the next page...

Page 30
A. Confirm you understand the above, then consider the behavior of a gang of 4 program instances
executing the above two ISPC functions find_segment_1 and find_segment_2. For example, you
may wish to consider execution on the following array:
points = {0.15, 0.35, 0.75, 0.95}
Describe the difference between the traversal approach used in find_segment_1 and find_segment_2
in the context of SIMD execution. Your description might want to specifically point out conditions
when find_segment_1 suffers from divergence. (Hint 1: you may want to make a table of four
columns, each row is a step by the entire gang and each column shows each program instance’s
execution. Hint 2: It may help to consider which solution is better in the case of large, heavily
unbalanced trees.)

B. Consider a slight change to the code where as soon as a best-so-far line segment is found (inside
[S-hit]) the code makes a call to a very, very expensive function. Which solution might be preferred
in this case? Why?

Page 31

03 Ch1 BasicArch Parallel
No ratings yet
03 Ch1 BasicArch Parallel
79 pages
Modeling A Non-Uniform Memory Access Architecture For Optimizing
No ratings yet
Modeling A Non-Uniform Memory Access Architecture For Optimizing
79 pages
Sols Book PDF
100% (1)
Sols Book PDF
120 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
04 Progbasics
No ratings yet
04 Progbasics
51 pages
Quiz For Chapter 7 With Solutions
No ratings yet
Quiz For Chapter 7 With Solutions
8 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
KCU401-C Keeler Cryomatic Service Manual
100% (1)
KCU401-C Keeler Cryomatic Service Manual
25 pages
League of Nations
No ratings yet
League of Nations
6 pages
Solution Manual of Cmputer Organization and Architectur
44% (27)
Solution Manual of Cmputer Organization and Architectur
29 pages
HSCD FewSmall CaseStudy
No ratings yet
HSCD FewSmall CaseStudy
19 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
03 Progmodels
No ratings yet
03 Progmodels
47 pages
Notes Day9
No ratings yet
Notes Day9
13 pages
Assignment 2 MPI MSA
No ratings yet
Assignment 2 MPI MSA
10 pages
GPU - Mid - Gradescope
No ratings yet
GPU - Mid - Gradescope
11 pages
CS222 - COAL - SOLUTION - Final - Spring2023
No ratings yet
CS222 - COAL - SOLUTION - Final - Spring2023
12 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
DigitalLogic ComputerOrganization L26 Revision
No ratings yet
DigitalLogic ComputerOrganization L26 Revision
13 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Mid Sem1
No ratings yet
Mid Sem1
2 pages
Exercises 6
No ratings yet
Exercises 6
3 pages
Exame1psd15 Eng 241106 175631
No ratings yet
Exame1psd15 Eng 241106 175631
10 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
EC355TBF - CA - 2022 Scheme - V Sem - MQP
No ratings yet
EC355TBF - CA - 2022 Scheme - V Sem - MQP
4 pages
XVA1 Schematics
No ratings yet
XVA1 Schematics
1 page
2 Mark Answers
No ratings yet
2 Mark Answers
9 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Important!: Read Before Proceeding!
No ratings yet
Important!: Read Before Proceeding!
10 pages
40 Out
No ratings yet
40 Out
80 pages
Module 1 - Parallel Computing
No ratings yet
Module 1 - Parallel Computing
29 pages
Alpha Geometry 2
No ratings yet
Alpha Geometry 2
28 pages
Seminar
No ratings yet
Seminar
85 pages
Exercise Only
No ratings yet
Exercise Only
40 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
HPCA Endsem SPR 2024
No ratings yet
HPCA Endsem SPR 2024
3 pages
Ofa CVPR Tutorial
No ratings yet
Ofa CVPR Tutorial
87 pages
CRGC Mcore PDF
No ratings yet
CRGC Mcore PDF
124 pages
ECE408 2012 Practice Exam1
No ratings yet
ECE408 2012 Practice Exam1
10 pages
Muzen Vivaus Guide
No ratings yet
Muzen Vivaus Guide
46 pages
Abs Bendix
No ratings yet
Abs Bendix
72 pages
2EC319 IR December 2014
No ratings yet
2EC319 IR December 2014
3 pages
Unpacking PMF - Enjoy The Work
No ratings yet
Unpacking PMF - Enjoy The Work
31 pages
VTP Interview Questions and Answers (VLAN Trunking Protocol) - Networker Interview
100% (1)
VTP Interview Questions and Answers (VLAN Trunking Protocol) - Networker Interview
2 pages
Advanced Computer Architecture Question 1: Mcqs
No ratings yet
Advanced Computer Architecture Question 1: Mcqs
4 pages
Recursive Make Considered Harmful
No ratings yet
Recursive Make Considered Harmful
22 pages
Multi View
No ratings yet
Multi View
49 pages
Handout - Chaldean Oracles, Divination and Theurgy
100% (1)
Handout - Chaldean Oracles, Divination and Theurgy
5 pages
Neural ODE
No ratings yet
Neural ODE
21 pages
Grade 6 - MYP DESIGN - Criterion A
No ratings yet
Grade 6 - MYP DESIGN - Criterion A
39 pages
TPO 57 Listening
No ratings yet
TPO 57 Listening
11 pages
XVA1 Installation Guide
No ratings yet
XVA1 Installation Guide
10 pages
COMP1411 Final Exam Question Book
No ratings yet
COMP1411 Final Exam Question Book
10 pages
Us7915515 Pianoteq
No ratings yet
Us7915515 Pianoteq
19 pages
HD07 - Amadeus Reservation and Ticketing Help Desk - Air - Help Desk Module - Jan2018 - 3903939 - en - US
No ratings yet
HD07 - Amadeus Reservation and Ticketing Help Desk - Air - Help Desk Module - Jan2018 - 3903939 - en - US
66 pages
XNB Format
No ratings yet
XNB Format
23 pages
Hw1 Sum2013
No ratings yet
Hw1 Sum2013
8 pages
Surveys (Tunneling)
No ratings yet
Surveys (Tunneling)
66 pages
Ai For IT Coders
No ratings yet
Ai For IT Coders
18 pages
Questions On Chapter 1 and 2 Color New V2
No ratings yet
Questions On Chapter 1 and 2 Color New V2
8 pages
Processors
No ratings yet
Processors
25 pages
Homework 5
No ratings yet
Homework 5
6 pages
Assinmet&Case Study
No ratings yet
Assinmet&Case Study
19 pages
Assignment Questions
No ratings yet
Assignment Questions
3 pages
NRK CPSB 2 Revised 2021 Employment
No ratings yet
NRK CPSB 2 Revised 2021 Employment
4 pages
Automation Simulation:: Your Gateway Into Smart Manufacturing
No ratings yet
Automation Simulation:: Your Gateway Into Smart Manufacturing
31 pages
Ca Mid1 2017
No ratings yet
Ca Mid1 2017
9 pages
A New Cycle-Stepped 6502 CPU Emulator
No ratings yet
A New Cycle-Stepped 6502 CPU Emulator
14 pages
Trial Exam
No ratings yet
Trial Exam
14 pages
Ce-1254 - Surveying Ii
No ratings yet
Ce-1254 - Surveying Ii
9 pages
Non-Recursive Make Considered Harmful
No ratings yet
Non-Recursive Make Considered Harmful
12 pages
Exam2 s09 v2
No ratings yet
Exam2 s09 v2
10 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
GIVER Study Guide
No ratings yet
GIVER Study Guide
5 pages
Solcv 2022
No ratings yet
Solcv 2022
2 pages
Lec 2-Week 1 - (Design of Sewer System)
No ratings yet
Lec 2-Week 1 - (Design of Sewer System)
19 pages
Beyond Goldfish Memory - Long-Term Open-Domain Conversation
No ratings yet
Beyond Goldfish Memory - Long-Term Open-Domain Conversation
15 pages
WK 10 Assignment (Marla)
No ratings yet
WK 10 Assignment (Marla)
3 pages
Brief Summary of Peptic Ulcers
No ratings yet
Brief Summary of Peptic Ulcers
3 pages
Ejobscircular Com SSC Result 2020
No ratings yet
Ejobscircular Com SSC Result 2020
17 pages
Brook For GPUs - Stream Computing On Graphics Hardware - Paper
No ratings yet
Brook For GPUs - Stream Computing On Graphics Hardware - Paper
10 pages
DLL Mapeh-5 Q2
No ratings yet
DLL Mapeh-5 Q2
99 pages
DLL - ALL SUBJECTS 2 - Q4 - W5 - D2 - Ok
No ratings yet
DLL - ALL SUBJECTS 2 - Q4 - W5 - D2 - Ok
7 pages
Thesis Defenseman Mike Has Been Named To His Team
No ratings yet
Thesis Defenseman Mike Has Been Named To His Team
113 pages
Coursera Quiz Week2 Fall 2012
No ratings yet
Coursera Quiz Week2 Fall 2012
3 pages
Cs433 Sp12 Midterm Sol
No ratings yet
Cs433 Sp12 Midterm Sol
9 pages
CS6303 - CA - IQ - Nov - Dec 2017 PDF
No ratings yet
CS6303 - CA - IQ - Nov - Dec 2017 PDF
3 pages
Hath Yoga
No ratings yet
Hath Yoga
5 pages
Fall 2008 PHD Qualifier Exam: Computer Architecture Area
No ratings yet
Fall 2008 PHD Qualifier Exam: Computer Architecture Area
10 pages
Practice Problem Set#2
No ratings yet
Practice Problem Set#2
2 pages
Aca305 2000
No ratings yet
Aca305 2000
8 pages
Computer Architecture: Ph.D. Qualifiers Examination - Sample Questions
No ratings yet
Computer Architecture: Ph.D. Qualifiers Examination - Sample Questions
2 pages
Science Spectrum Circular - Grdaes VIII & IX - 2024-25
No ratings yet
Science Spectrum Circular - Grdaes VIII & IX - 2024-25
2 pages
Illinois Exam2 Practice Solfa08
No ratings yet
Illinois Exam2 Practice Solfa08
4 pages
Video Tutorial CVPR19
No ratings yet
Video Tutorial CVPR19
40 pages
specular 거칠기의 일부 모양을 제공
No ratings yet
specular 거칠기의 일부 모양을 제공
28 pages
SSC CPO 2023 Answer Key in English GS2
No ratings yet
SSC CPO 2023 Answer Key in English GS2
7 pages
Final Exam Study Guide 3010 F17
No ratings yet
Final Exam Study Guide 3010 F17
22 pages
Is Brain A Good Model For AI
No ratings yet
Is Brain A Good Model For AI
2 pages
CH8568DOCSIS 3.1 Wireless Voice Gateway
No ratings yet
CH8568DOCSIS 3.1 Wireless Voice Gateway
3 pages
PR Resume
No ratings yet
PR Resume
6 pages
System Monitoring With Sar and Ksar
No ratings yet
System Monitoring With Sar and Ksar
9 pages
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
From Everand
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
Rodrigo Copetti
No ratings yet
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
From Everand
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

Written Asst1

Uploaded by

Written Asst1

Uploaded by

Stanford CS149: Parallel Computing

const int SIZE = 1024 * 64;

for (uniform int row=0; row<IMAGE_SIZE; row++) {

int idx = row*IMAGE_SIZE + col + programIndex;

The questions are on the next page...

• Can execute one four-wide SIMD operation per clock

float sum = 0.0;

A. Please draw the dependency graph for the instruction sequence.

float in[VERY_BIG]; // let VERY_BIG = 100,000,000

// parallelize iterations of this loop using threads

if (my_task_id < total_tasks)

// compute starting image row for this task

// process all pixels in a chunk of rows

int idx = j*32 + i + programIndex;

float tmp = 0.f;

(question continued on next page)

Figure 1: Image masks used to govern image manipulation by brighten_image

• Memory latency is perfectly hidden (Lagunita processors have excellent pre-fetchers)

float A[N]; // let N = 100 million elements

// ASSUME TIMER STARTS HERE //////////////////////////////////

for (int i=0; i<N; i++)

for (int i=0; i<9; i++) {

// made up syntax for brevity: ’parallel_for’

// ASSUME TIMER STOPS HERE //////////////////////////////////

• You may double each processor’s cache to 2 MB.

// -- computes segment id of the largest segment containing points[programIndex]

// p is point this program instance is searching for

// p is point this program instance is sarch for

Begin by studying find_segment_1.

• (I-test, Nx) represents a point-interior node test against node x.

The question is on the next page...

You might also like