0% found this document useful (0 votes)

4 views19 pages

Written Asst3

The document outlines a series of programming assignments for Stanford's CS149 course on Parallel Computing, focusing on implementing a simplified version of Spark in C++. It includes tasks such as creating RDD transformations like l33tify and FilterLongWords, as well as analyzing a new GPU design, PKPU2.0, for metaverse applications. Additionally, it discusses memory coherence protocols and the implications of Spark's RDD abstraction on performance with large datasets.

Uploaded by

Private

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views19 pages

Written Asst3

Uploaded by

Private

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Stanford CS149: Parallel Computing

Written Assignment 3

Implementing CS149 Spark

Problem 1:
In this problem we want you to implement a very simple version of Spark, called CS149Spark, that supports
only a few operators. You will implement CS149Spark as a simple C++ library consisting of a base class
RDD as well as subclasses for all CS149Spark transforms.
class RDD {
public:
virtual bool hasMoreElements() = 0; // all RDDs must implement this
virtual string next() = 0; // all RDDs must implement this

int count() { // returns number of elements in the RDD

int count = 0;
while (hasMoreElements()) {
string el = next();
count++:
}
return count;
}

vector<string> collect() { // returns STL vector representing RDD

vector<string> data;
while (hasMoreElements()) {
data.append(next());
}
return data;
}
};

class RDDFromTextFile : public RDD {

ifstream inputFile; // regular C++ file IO object
public:
RDDFromTextFile(string filename) {
inputFile.open(filename); // prepares file for reading
}

bool hasMoreElements() {
return !inputFile.eof(); // .eof() returns true if no more data to read
}

string next() {
return inputFile.readLine(); // reads next line from file
}
};

For example, given the two definitions above, a simple program that counts the lines in a text file can be
written as such.
RDDFromTextFile r("myfile.txt’’); // creates an RDD where each element is a string
// corresponding to a line from the text file

printf("The RDD has length %d\n", r.count());

Page 1
A. Now consider adding a l33tify RDD transform to CS149Spark, which returns a new RDD where
all instances of the character ’e’ in string elements of the source RDD are converted to the character
’3’. For example, the following code sequence creates an RDD (r1) whose elements are lines from a
text file. The RDD r2 contains a l33tified version of these strings. This data is collected into a regular
C++ vector at the end of the program using the call to collect().
RDDFromTextFile r1("myfile.txt’’); // creates an RDD where each element is a string
// corresponding to a line from the text file

RDDL33tify r2(r1); // l33tify all elements for r1

vector<string> lines = r2.collect(); // lines from the file, but in l33t form

Implement the functions hasMoreElements() and next() for the l33tify RDD transformation
below. A full credit solution will use minimal memory footprint and never recompute (compute
more than once) any elements of any RDD.
///////////////////////////////////////////////////////////////
class RDDL33tify : public RDD {

RDD parent;

RDDL33tify(RDD parentRDD) {
parent = parentRDD;

bool hasMoreElements() {

string next() {

}
};

Page 2
B. Now consider a transformation FilterLongWords that filters out all elements of the input RDD that
are strings of greater than 32 characters.
Again, we want you to implement hasMoreElements() and next().
You may declare any member variables you wish and assume .length() exists on strings. Care-
ful: hasMoreElements() is trickier now! Again a full credit solution will use minimal memory
footprint and never recompute any elements of any RDD.
A sample program using the FilterLongWords RDD transformation is below:
RDDFromTextFile r1("myfile.txt"); // creates an RDD where each element is a string
// corresponding to a line from the text file

RDDL33tify r2(r1); // converts elements to l33t form

RDDFilterLongWords r3(r2); // removes strings that are greater than 32 characters
print("RDD r3 has length %d\n", r3.count());

/////////////////////////////////////////////////////////////////////
class RDDFilterLongWords : public RDD {

RDD parent;

public:
RDDFilterLongWords(RDD parentRDD) {
parent = parentRDD;

bool hasMoreElements() {

string next() {

}
};

Page 3
C. Finally, implement a groupByFirstWord transformation which is like Spark’s groupByKey, but in-
stead (1) it uses the first word of the input string as a key, and (2) instead of building a list of all
elements with the same key, concatenates all strings with the same key into a long string.
For example, groupByFirstWord on the RDD ["hello world”, "hello cs149”, "good luck”, "paral-
lelism is fun”, "good afternoon”] would produce the RDD ["hello world hello cs149”, "good luck
good afternoon”, "parallelism is fun”].
Your implementation can be rough pseudocode, and may assume the existence of a dictionary data
structure (mapping strings to strings) to actually perform the grouping, an iterator over the dic-
tionaries keys, and useful string functions like: .first() to get the first word of a string, and
.append(string) to append one string to another.
Rough pseudocode is fine, but your solution should make it clear how you are tracking the next
element to return in next(). A full credit solution will use minimal memory footprint and never
recompute any elements of any RDD.
class RDDGroupByFirstWord : public RDD {
RDD parent;
Dictionary<string, string> dict; // assume dict["hello’’] returns the string
// associated with key "hello’’

public:
RDDGroupByFirstWord(RDD parentRDD) {
parent = parentRDD;

bool hasMoreElements() {

string next() {

}
};

Page 4
D. Describe why the RDD transformations L33tify, FilterLongWords, and RDD construction from a
file, as well as the action count() can all execute efficiently on very large files (consider TB-sized
files) on a machine with a small amount of memory (1 GB of RAM).

E. Describe why the transformation GroupByFirstWord differs from the other transformations in terms
of how much memory footprint it requires to implement.

Page 5
Introducing PKPU2.0: The GPU for the Metaverse

Problem 2:
Inspired by their early success documented in prior practice problems, the midterm practice problems,
your CS149 instructors decide to take on NVIDIA in the GPU design business, and launch PKPU2.0... the
GPU designed (their marketing team claims) for metaverse applications! (PKPU stands for Prof. Kayvon
Processing Unit, or Prof Kunle Processing Unit). The PKPU2.0 runs CUDA programs exactly the same
manner as the NVIDIA GPUs discussed in class, but it has the following characteristics:

• The processor has 16 cores (akin to NVIDIA SMs) running at 1 GHz.

• The cores execute CUDA threads in an implicit SIMD fashion running 32 consecutively numbered
CUDA threads together using the same instruction stream (PKPU2.0 implements 32-wide “warps”).

• Each core provides execution contexts for up to 256 CUDA threads (eight PKPKU2.0 warps). Like
the GPUs discussed in class, once a CUDA thread is assigned to an execution context, the processor
runs the thread to completion before assigning a new CUDA thread to the context.

• The cores will fetch/decode one single-precision floating point arithmetic instruction (add, multi-
ply, compare, etc.) per clock (one fp operation completes per clock per ALU). Keep in mind this
instruction is executed on an entire warp in that clock, so exactly one warp can make progress each
clock. As we’ve often done in prior problems, you can assume that all other instructions (integer
ops, load/stores are “free” in that they are executed on other hardware units in the core, not the
main floating point ALUs.)

A. When running at peak utilization. What is the PKPU2.0’s maximum throughput for executing
floating-point math operations?

Page 6
B. Consider a CUDA kernel launch that executes the following CUDA kernel on the processor. In this
program each CUDA thread computes one element of the results array Y using one element from
the input array X. Assume that (1) the program is run on large arrays of size 128 million elements,
(2) the CUDA program is compiled using a CUDA thread-block size of 128 threads, and (3) enough
thread blocks are created in the bulk thread launch so that there is exactly one CUDA thread per
output array element.

global void my_cuda_function(float* X, float* Y) {

// get array index from CUDA block/thread id

int idx = blockIdx.x * blockDim.x + threadIdx.x;

float val = X[idx]; // load instr

float output;
float val2 = 2.0 * val; // 1 arithmetic cycle
if (val2 > 0.0) { // 1 arithmetic cycle
output = f1(val); // 14 arithmetic cycles
} else {
output = f2(val); // 14 arithmetic cyles
}

Y[idx] = output; // memory store

}

The input array contains values with the following pattern: (recall there are 128M elements)

[ 1.0, 2.0, ..., 32.0,

-1.0, -2.0, ..., -32.0,
1.0, 2.0, ..., 32.0,
-1.0, -2.0, ..., -32.0, ...]

Does this workload suffer from instruction stream divergence? Please state YES or NO and explain
why.

Page 7
C. Given the input values shown in the previous problem, what is the arithmetic intensity of the pro-
gram, in terms of PKPKU2.0 cycles of floating point arithmetic (accounting for the potential of
divergence) per bytes transferred from memory? Please write your answer as a fraction. (Hint:
This is best computed at the granularity of a warp!)

D. Assume that on the PKPKU2.0, the memory latency of loads is 50 cycles. (Assume stores have 0 la-
tency and assume (for now) that the memory system has very high bandwidth.) Does the PKPKU2.0
have the ability to hide all memory latency from loads? Why or why not?

E. Now assume that the PKPU2.0 memory system has 128 GB/sec of bandwidth (and still has a load
latency of 50 cycles. Is this program compute bound or bandwidth bound on the PKPU2.0? (show
calculations underlying your answer) If you conclude the PKPU2.0 is bandwidth bound running
this code, tell us what the utilization of the processor will be. Remember the PKPKU2.0 has 16
cores operating at 1 GHz.

Page 8
F. You are hired to improve the PKPKU’s performance on this workload. You have four options.

(a) Increase the maximum number of CUDA thread execution contexts by 2×.
(b) Triple the memory bandwidth.
(c) Add a data cache that can hold 1/2 of the elements in the input and output arrays.
(d) Double the SIMD width (aka warp size) to 64 (while still maintaining the ability to run exactly
one instruction per warp per clock.

Which option do you choose to get the best performance on the given input data, and what speedup
do you expect to observe (compared to the original unmodified PKPKU2.0) from this change? Ex-
plain why. (Note: assume that at the start of the CUDA program’s execution, all the input/output
data is located in main memory, and is not resident in cache.)

Page 9
MSI Coherence Protocol Warmup

eeProtocol
Protocol
l
Problem 3:
Below is a state diagram for the MSI protocol.
“shared”
“shared”
ypy PrRd
PrRd/--/-- PrWr
PrWr/ --
/ --
PrRd /-- PrWr / --
rship before
ership before MM
M
otoinvalidate
invalidate BusRd
BusRd/ BusWB
/ BusWB
PrWr
PrWr/ BusRdX
/ BusRdX
will cause writeback
will causePrWr
writeback
/ BusRdX BusRd / BusWB
k SS BusRdX
BusRdX/ BusWB
/ BusWB
PrWr
PrWr/ /
BusRdX / BusWB
e)
de) PrWr / BusRdXS
BusRdX
BusRdX BusRdX
BusRdX/ --
/ --
PrRd / BusRd BusRdX / --
PrRd/ BusRd
PrRd / BusRd PrRd
PrRd/ --
/ --
PrRd BusRd
BusRd/ --/ --
I I / --
BusRd / --
I

areare
es carrying out
carrying outConsider
this logic
this logic the follow
independently toto
independently sequence
maintain of operations by processors 1 and 2.
coherence
maintain coherence Assume that each processor has a local
independently to cache
maintaincarrying
coherence out the MSI protocol. Each cache can store two cache lines of data (there’s room for both X
Stanford CS149, FallFall
2020
Stanford
and Y). For simplicity, please assume that the variables X and Y take an CS149,
entire2020 cache line.
Stanford CS149, Fall 2020
Please fill out the table below, which indicates the state of the lines X and Y in the local caches of P0 and
P1 after each operation. We’ve given the first four rows for you as examples.

Operation PO X state P1 X state P0 Y state P1 Y state

=================== ============================== ==============================
P0 LOAD X S [MISS] I I I
P0 LOAD X S [HIT] I I I
P1 STORE Y S I I M [MISS]
P1 STORE X I M [MISS] I M

P1 LOAD Y

P1 STORE Y

P1 LOAD Y

P0 STORE X

P0 STORE Y

Page 10
Bringing Locality Back

PRACTICE PROBLEM 1:
Justin Timberlake wants to get back in the news. He hears that Spark is all the rage and decides he’s going
to code up his own implementation to compete against that of the Apache project. Justin’s first test runs
the following Spark program, which creates four RDDs. The program takes Justin’s lengthy (1 TB!) list of
dancing tips and finds all misspelled words.
var lines = spark.textFile("hdfs://mydancetips.txt’’); // 1 TB file
var lower = lines.map( x => x.toLower() ); // convert lines to lower case
var words = lower.flatmap( x => x.split(‘‘ ‘‘) ); // convert RDD of lines to RDD of
// individual words
var misspelled = words.filter( x => !x.isInDictionary() ); // filter to find misspellings

print misspelled.count(); // print number of misspelled words

A. Understanding that the Spark RDD abstraction affords many possible implementations, Justin de-
cides to keep things simple and implements his Spark runtime such that each RDD is implemented
by a fully allocated array. This array is stored either in memory or on disk depending on the size of
the RDD and available RAM. The array is allocated and populated at the time the RDD is created
— as a result of executing the appropriate operator (map, flatmap, filter, etc.) on the input RDD.
Justin runs his program on a cluster with 10 computers, each of which has 100 GB of memory. The
program gets correct results, but Justin is devastated because the program runs incredibly slow. He
calls his friend Taylor Swift, ready to give up on the venture. Encouragingly, Taylor says, “shake
it off Justin”, just run your code on 40 computers. Justin does this and observes a speedup much
greater than 4× his original performance. Why is this the case?

Page 11
B. With things looking good, Justin runs off to write a new single “Bringing Locality Back” to use in
the marketing his product. At that moment, Taylor calls back, and says “Actually, Justin, I think
you can schedule the computations much more efficiently and get very good performance with less
memory and far fewer nodes.” Describe how you would change how Justin schedules his Spark
computations to improve memory efficiency and performance.

C. After hacking until midnight, which in term inspired Taylor’s recent album), Justin and Taylor
run the optimized program on 10 nodes. The program runs for 1 hour, and then right before
misspelled.count() returns, node 6 crashes. Justin is devastated! He says, “Taylor, I have a
single to release, and I don’t have time to deal with rerunning programs from scratch.” Taylor gives
Justin a stink eye and says, “Don’t worry, it will be complete in just a few minutes.” Approximately
how long will it take after the crash for the program to complete? You should assume the .count()
operation is essentially free. But please clearly state any assumptions about how the computation
is scheduled in justifying your answer.

Page 12
Fusion, Fusion, Fusion

PRACTICE PROBLEM 2:
Your boss asks you to buy a computer for running the program below. The program uses a math library
(cs149_math). The library functions should be self-explanatory, but example implementations of the
cs149math_add and cs149math_sum functions are given below.
const int N = 10000000; // very large

void cs149math_sub(float* A, float* B, float* output);

void cs149math_mul(float* A, float* B, float* output);

void cs149math_add(float* A, float* B, float* output) {

// Recall from written asst 1 that this OpenMP directive tells the
// C compiler that iterations of the for loop are independent, and
// that implementations of C compilers that support
// OpenMP will parallelize this loop using multiple threads.
#omp parallel for
for (int i=0; i<N; i++)
output[i] = A[i]+B[i];
}

float cs149math_sum(float* A) { // compute sum of all elements of the input array

atomic<float> x = 0.0;
#omp parallel for
for (int i=0; i<N; i++)
x += A[i];
return x;
}

//////////////////////////////////////////////////////////////////
// The program is below:
//////////////////////////////////////////////////////////////////

// assume arrays are allocated and initialized

float* src1, *src2, *src3, *tmp1, *tmp2, *tmp3, *dst;

cs149math_add(src1, src2, tmp1); // 1

cs149math_mul(tmp1, src3, tmp2); // 2
cs149math_mul(tmp2, src1, tmp3); // 3
float x = cs149math_sum(tmp2) / N; // 4
if (x > 10.0) {
cs149math_mul(tmp3, src1, tmp1); // 5
cs149math_add(src1, tmp1, tmp2); // 6
cs149math_add(src1, tmp2, dst); // 7
} else {
cs149math_add(tmp3, src2, tmp1); // 8
cs149math_mul(src2, tmp1, tmp2); // 9
cs149math_mul(src2, tmp2, dst); // 10
}

The question is on the next page...

Page 13
You have two computers to choose from, of equal price. (Assume that both machines have the same 16MB
cache and 0 memory latency.)

1. Computer A: Four cores 1 GHz, 4-wide SIMD, 192 GB/sec bandwidth

2. Computer B: Four cores 1 GHz, 8-wide SIMD, 128 GB/sec bandwidth

ASSUME THAT YOU ARE ALLOWED TO REWRITE THE CODE, INCLUDING REPLACE LIBRARY
CALLS IF DESIRED, (provided that it computes exactly the same answer—You can parallelize across
cores, vectorize, reorder loops, etc. but you are not permitted to change the math operations to turn adds
into multiplies, eliminate common subexpressions etc.). Please give the arithmetic intensity of your new
program assuming that both loads and stores are 4 bytes of data transfer. (You can also assume 1 GB
is 109 bytes.) As a result, which machine do you choose? Why? (If you decide to change the program
please give a pseudocode description of your changes. What is parallelized, vectorized, what does the
loop structure look like, etc.)

Page 14
Cache Coherence and False Sharing

PRACTICE PROBLEM 3:
Consider the following program.

void worker(int* counter) { // each thread runs this function

barrier();

// <--- invalidate caches: assume all lines in all caches are

// invalidated here after leaving barrier --->

for (int i=0; i<NUM_ITERS; i++)

(*counter)++; // work here (load + incr + store)
}

void test(int num_threads) {

std::thread threads[MAX_THREADS];
int counter[MAX_THREADS]; // Assume this is aligned on a cache line boundary

for (int i=0; i<num_threads; i++)

threads = std::thread(worker, &counter[i]); // spawn thread
for (int i=0; i<num_threads; i++)
threads[i].join(); // wait for thread to terminate
}

Consider calling test() with num_threads=2.

Assuming that code is run on a dual-core processor where:
– Each core has its own private cache
– Caches use the MSI protocol to implement coherence. For reference the MSI diagram is given on the
next page.

PROBLEM CONTINUES ON NEXT PAGE...

Page 15
eeProtocol
Protocol
l
“shared”
“shared”
ypy PrRd
PrRd/--/-- PrWr
PrWr/ --
/ --
PrRd /-- PrWr / --
rship before
ership before MM
M
otoinvalidate
invalidate BusRd
BusRd/ BusWB
/ BusWB
PrWr
PrWr/ BusRdX
/ BusRdX
will
willcause writeback
causePrWr
writeback
/ BusRdX BusRd / BusWB
k SS BusRdX
BusRdX/ BusWB
/ BusWB
PrWr
PrWr/ /
BusRdX / BusWB
e)
de) PrWr / BusRdXS
BusRdX
BusRdX BusRdX
BusRdX/ --
/ --
PrRd / BusRd BusRdX / --
PrRd/ BusRd
PrRd / BusRd PrRd
PrRd/ --
/ --
PrRd BusRd
BusRd/ --/ --
I I / --
BusRd / --
I

areare
es carrying out
carrying In
this
out your
logic
this answer to
independently
logic totomaintain
independently the following
coherence questions:
coherence
maintain
independently to maintain
• Youcoherence
should only analyze references to the counter array Stanford CS149, FallFall
2020
Stanford CS149, 2020
• Assume all arithmetic is free, you only need to think
Stanford CS149, Fallabout
2020 memory operations in this problem.
• Assume that the bus access protocol is such that one core executes the whole C statement involv-
ing both the memory read and then the memory write (see the comment “// work here“) before
allowing bus transactions from the other core.
• Assume that between iterations of the for loop by one core, the other core is able to execute one
iteration of its for loop as well.
• Each processor action (PrRd, PrWr shown in green) takes 1 cycle. (e.g., A cache hit on a PrWr takes
1 cycle)
• Each bus transaction (BusRd, BusRdX, BusWB shown in red) takes 10 cycles. (Specifically: a PrWr
that generates BuxRdX takes a total of 1+10=11 cycles. A PrWr that generates a BusRdX that triggers
a remote cache BusWB (write back) requires 1+10+10=21 cycles.)
• No other operations manipulate the state of caches
• Only consider operations that occur after the line <invalidate caches>

PROBLEM CONTINUES ON NEXT PAGE...

Page 16
A. How many cycles does the memory system take to execute the worker for-loop on one of the threads
with the following cache organization. On the figure, please list how many times the various coher-
ence protocol transitions occur.
• 16 byte cache size
• Fully associative cache (any line can go anywhere, no conflict misses)
• 4 byte cache line size

B. How many cycles does the memory system take to execute the worker for-loop on one of the threads
with the following cache organization. On the figure, please list how many times the various coher-
ence protocol transitions occur.
• 16 byte cache size
• Fully associative cache (any line can go anywhere, no conflict misses)
• 8 byte cache line size

Page 17
Angry Students

PRACTICE PROBLEM 4:
Your friend is developing a game that features a horde of angry students chasing after professors for mak-
ing long exams. Simulating students is expensive, so your friend decides to parallelize the computation
using one thread to compute and update the student’s positions, and another thread to simulate the stu-
dent’s angriness. The state of the game’s N students is stored in the global array students in the code
below).

struct Student {
float position; // assume position is 1D for simplicity
float angriness;
};

Student students[N];

////////////////////////////////

void update_positions() {
for (int i=0; i<N; i++) {
students[i].position = compute_new_position(i);
}
}

void update_angriness() {
for (int i=0; i<N; i++) {
students[i].angriness = compute_new_angriness(i);
}
}

////////////////////////////////

// ... initialize students here

std::thread t0, t1;

t0 = std::thread(update_positions);
t1 = std::thread(update_angriness);
t0.join();
t1.join();

Questions are on the next page...

Page 18
A. Since there is no synchronization between thread 0 and thread 1, your friend expects near a perfect
2× speedup when running on two-core processor that implements invalidation-based cache coher-
ence using 64-byte cache lines. She is shocked when she doesn’t obtain it. Why is this the case?
(For this problem assume that there is sufficient bandwidth to keep two cores busy – “the code is
bandwidth bound” is not an answer we are looking for.) HINT: consider how data is laid out in
memory in a C struct.

B. Modify the program to correct the performance problem. You are allowed to modify the code and
data structures as you wish, but you are not allowed to change what computations are performed
by each thread and your solution should not substantially increase the amount of memory used
by the program. You only need to describe your solution in pseudocode (compilable code is not
required).

Page 19

Manual-Super Ball-English-2022-02-15
No ratings yet
Manual-Super Ball-English-2022-02-15
14 pages
Windows Command Prompt A-N
From Everand
Windows Command Prompt A-N
Prometheus MMS
5/5 (2)
Written Asst2
No ratings yet
Written Asst2
27 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
ECE408 2012 Practice Exam1
No ratings yet
ECE408 2012 Practice Exam1
10 pages
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Parallel Multiverse
No ratings yet
Parallel Multiverse
46 pages
0576 Introduction To C Exercises With Solutions
No ratings yet
0576 Introduction To C Exercises With Solutions
79 pages
OODP All QP Sem 2
No ratings yet
OODP All QP Sem 2
10 pages
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
Solutions 4
No ratings yet
Solutions 4
18 pages
23 24 Endsem
No ratings yet
23 24 Endsem
12 pages
Less Slow C++ - Hacker News
No ratings yet
Less Slow C++ - Hacker News
3 pages
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
Programming Fundamentals
No ratings yet
Programming Fundamentals
29 pages
0576 Introduction To C Exercises With Solutions
No ratings yet
0576 Introduction To C Exercises With Solutions
6 pages
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
Handling Asynchronous Events in C++ Thesis
No ratings yet
Handling Asynchronous Events in C++ Thesis
75 pages
Exercises, C
No ratings yet
Exercises, C
105 pages
Oops Front Page
No ratings yet
Oops Front Page
8 pages
Computer Science & Engineering: A (Mark)
No ratings yet
Computer Science & Engineering: A (Mark)
8 pages
Faculty of Computers and Infromation, Cairo University: CS213: Programming II Year 2022-2023 First Semester
No ratings yet
Faculty of Computers and Infromation, Cairo University: CS213: Programming II Year 2022-2023 First Semester
5 pages
OOPS Karan 1
No ratings yet
OOPS Karan 1
8 pages
External
No ratings yet
External
3 pages
Computer Science 1B 2020 Exam Memo - CSC01B1-2020-EXAM-MEMO
No ratings yet
Computer Science 1B 2020 Exam Memo - CSC01B1-2020-EXAM-MEMO
5 pages
Processors
No ratings yet
Processors
25 pages
Arbitrary Precision Calculator
No ratings yet
Arbitrary Precision Calculator
11 pages
Computer Science 1B 2021 Exam Memo - CSC01B1-2021-EXAM-MEMO
No ratings yet
Computer Science 1B 2021 Exam Memo - CSC01B1-2021-EXAM-MEMO
7 pages
Web GPU
0% (1)
Web GPU
40 pages
Professional Machine Learning Engineer Demo
No ratings yet
Professional Machine Learning Engineer Demo
9 pages
CS 251 Fall 2018 Final Exam
No ratings yet
CS 251 Fall 2018 Final Exam
15 pages
Final Questions For OOPD 23M Rubrics
No ratings yet
Final Questions For OOPD 23M Rubrics
5 pages
C Programming
From Everand
C Programming
Netra
No ratings yet
Test1 1617
No ratings yet
Test1 1617
4 pages
Lec 9
No ratings yet
Lec 9
33 pages
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
ECE 2036 Test II: Output: - IECE 203 826a
No ratings yet
ECE 2036 Test II: Output: - IECE 203 826a
6 pages
Adobe Scan 29 Nov 2024
No ratings yet
Adobe Scan 29 Nov 2024
6 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
C++ Proposed Exercises (Chapter 9: The C++ Programing Language, Fourth Edition)
100% (1)
C++ Proposed Exercises (Chapter 9: The C++ Programing Language, Fourth Edition)
2 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
24CLC10 Final
No ratings yet
24CLC10 Final
8 pages
Register Data Flow Framework
No ratings yet
Register Data Flow Framework
9 pages
Technical Questions
No ratings yet
Technical Questions
5 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Assignemnt 1
No ratings yet
Assignemnt 1
7 pages
OODP1
No ratings yet
OODP1
5 pages
Picovoice Interview Questions
No ratings yet
Picovoice Interview Questions
1 page
Coder Rating
No ratings yet
Coder Rating
25 pages
ParallelR-Accelerating R Applications With CUDA
No ratings yet
ParallelR-Accelerating R Applications With CUDA
59 pages
Coding Right Answers Bolt
No ratings yet
Coding Right Answers Bolt
34 pages
Coding Right Answers Bolt
No ratings yet
Coding Right Answers Bolt
37 pages
May 23
No ratings yet
May 23
2 pages
Screening
No ratings yet
Screening
20 pages
SE DSL Lab Manual Computer Engineering SPPUPVGnASHIK
No ratings yet
SE DSL Lab Manual Computer Engineering SPPUPVGnASHIK
89 pages
CO1110
No ratings yet
CO1110
8 pages
Format For Course Curriculum: Course Level: PG Course Objectives
No ratings yet
Format For Course Curriculum: Course Level: PG Course Objectives
6 pages
CT-2 QP - Set A - Answer
No ratings yet
CT-2 QP - Set A - Answer
10 pages
Neural ODE
No ratings yet
Neural ODE
21 pages
A New Cycle-Stepped 6502 CPU Emulator
No ratings yet
A New Cycle-Stepped 6502 CPU Emulator
14 pages
Ofa CVPR Tutorial
No ratings yet
Ofa CVPR Tutorial
87 pages
Recursive Make Considered Harmful
No ratings yet
Recursive Make Considered Harmful
22 pages
Multi View
No ratings yet
Multi View
49 pages
Us7915515 Pianoteq
No ratings yet
Us7915515 Pianoteq
19 pages
Non-Recursive Make Considered Harmful
No ratings yet
Non-Recursive Make Considered Harmful
12 pages
XNB Format
No ratings yet
XNB Format
23 pages
XVA1 Installation Guide
No ratings yet
XVA1 Installation Guide
10 pages
Muzen Vivaus Guide
No ratings yet
Muzen Vivaus Guide
46 pages
Solcv 2022
No ratings yet
Solcv 2022
2 pages
Video Tutorial CVPR19
No ratings yet
Video Tutorial CVPR19
40 pages
Alpha Geometry 2
No ratings yet
Alpha Geometry 2
28 pages
Unpacking PMF - Enjoy The Work
No ratings yet
Unpacking PMF - Enjoy The Work
31 pages
Thesis Defenseman Mike Has Been Named To His Team
No ratings yet
Thesis Defenseman Mike Has Been Named To His Team
113 pages
XVA1 Schematics
No ratings yet
XVA1 Schematics
1 page
Beyond Goldfish Memory - Long-Term Open-Domain Conversation
No ratings yet
Beyond Goldfish Memory - Long-Term Open-Domain Conversation
15 pages
Brook For GPUs - Stream Computing On Graphics Hardware - Paper
No ratings yet
Brook For GPUs - Stream Computing On Graphics Hardware - Paper
10 pages
specular 거칠기의 일부 모양을 제공
No ratings yet
specular 거칠기의 일부 모양을 제공
28 pages
Introduction To WCDMA: 3 Generation Technology
No ratings yet
Introduction To WCDMA: 3 Generation Technology
18 pages
Edu Cloud
No ratings yet
Edu Cloud
10 pages
InstallGuide Pro50
No ratings yet
InstallGuide Pro50
52 pages
Project Report
No ratings yet
Project Report
40 pages
Java Programming Syllabus
No ratings yet
Java Programming Syllabus
1 page
11 19 1262 23 00be Specification Framework For Tgbe
No ratings yet
11 19 1262 23 00be Specification Framework For Tgbe
119 pages
VS5ICM M10 ResourceMonitoring
No ratings yet
VS5ICM M10 ResourceMonitoring
68 pages
Vce Upgrade Guide 540
No ratings yet
Vce Upgrade Guide 540
17 pages
Unit - 3
No ratings yet
Unit - 3
17 pages
Fiio X3 (2nd Gen) Complete User Manual
No ratings yet
Fiio X3 (2nd Gen) Complete User Manual
19 pages
Aquarius User Manual PDF
No ratings yet
Aquarius User Manual PDF
16 pages
Service Manuals LG TV LCD RT20LA33 RT-20LA33 Service Manual
No ratings yet
Service Manuals LG TV LCD RT20LA33 RT-20LA33 Service Manual
25 pages
10 Hardcore Essbase Tuning Tips You'Ve Never Heard Before
67% (3)
10 Hardcore Essbase Tuning Tips You'Ve Never Heard Before
16 pages
Google Cybersecurity Glossary
No ratings yet
Google Cybersecurity Glossary
37 pages
Synthesis
No ratings yet
Synthesis
18 pages
Mad 2.2
No ratings yet
Mad 2.2
6 pages
DSD Lab Manual
No ratings yet
DSD Lab Manual
16 pages
Sofware Operators Panel
No ratings yet
Sofware Operators Panel
16 pages
IBM Lotus Notes and Domino 7
No ratings yet
IBM Lotus Notes and Domino 7
55 pages
PowerEdge R470 - Participant Guide
No ratings yet
PowerEdge R470 - Participant Guide
66 pages
Red Teaming Toolkit Collection
No ratings yet
Red Teaming Toolkit Collection
50 pages
Coding Area: Roman Iteration
No ratings yet
Coding Area: Roman Iteration
3 pages
PES University, Bangalore: May 2018: End Semester Assessment (Esa) B.Tech. Iv Semester
No ratings yet
PES University, Bangalore: May 2018: End Semester Assessment (Esa) B.Tech. Iv Semester
1 page
2.N Tina: Available Program Versions
No ratings yet
2.N Tina: Available Program Versions
2 pages
Magelis HMI GTO For Vijeo Designer PDF
No ratings yet
Magelis HMI GTO For Vijeo Designer PDF
138 pages
Bapatla Engineering College, Bapatla
No ratings yet
Bapatla Engineering College, Bapatla
6 pages
Next Steps: Important Considerations For Moving Your Data and Formats Into CAS
No ratings yet
Next Steps: Important Considerations For Moving Your Data and Formats Into CAS
14 pages
Data Science Notes - TutorialsDuniya
No ratings yet
Data Science Notes - TutorialsDuniya
59 pages
Srinivasa Ramanujan Institute of Technology
No ratings yet
Srinivasa Ramanujan Institute of Technology
2 pages

Written Asst3

Uploaded by

Written Asst3

Uploaded by

Stanford CS149: Parallel Computing

Implementing CS149 Spark

int count() { // returns number of elements in the RDD

vector<string> collect() { // returns STL vector representing RDD

class RDDFromTextFile : public RDD {

printf("The RDD has length %d\n", r.count());

RDDL33tify r2(r1); // l33tify all elements for r1

RDDL33tify r2(r1); // converts elements to l33t form

• The processor has 16 cores (akin to NVIDIA SMs) running at 1 GHz.

__global__ void my_cuda_function(float* X, float* Y) {

// get array index from CUDA block/thread id

float val = X[idx]; // load instr

Y[idx] = output; // memory store

[ 1.0, 2.0, ..., 32.0,

Operation PO X state P1 X state P0 Y state P1 Y state

print misspelled.count(); // print number of misspelled words

void cs149math_sub(float* A, float* B, float* output);

void cs149math_add(float* A, float* B, float* output) {

float cs149math_sum(float* A) { // compute sum of all elements of the input array

// assume arrays are allocated and initialized

cs149math_add(src1, src2, tmp1); // 1

The question is on the next page...

1. Computer A: Four cores 1 GHz, 4-wide SIMD, 192 GB/sec bandwidth

2. Computer B: Four cores 1 GHz, 8-wide SIMD, 128 GB/sec bandwidth

void worker(int* counter) { // each thread runs this function

// <--- invalidate caches: assume all lines in all caches are

for (int i=0; i<NUM_ITERS; i++)

void test(int num_threads) {

for (int i=0; i<num_threads; i++)

Consider calling test() with num_threads=2.

PROBLEM CONTINUES ON NEXT PAGE...

PROBLEM CONTINUES ON NEXT PAGE...

// ... initialize students here

std::thread t0, t1;

Questions are on the next page...

You might also like

global void my_cuda_function(float* X, float* Y) {