0% found this document useful (0 votes)

21 views51 pages

04 Progbasics

The lecture discusses the basics of parallel programming, focusing on creating and optimizing parallel programs through decomposition, assignment, orchestration, and mapping to hardware. It emphasizes the importance of identifying parallelizable tasks, managing dependencies, and maximizing speedup while considering factors like communication costs and workload balance. A case study on a grid-based solver illustrates the application of these concepts, including the use of red-black coloring to improve performance by reorganizing the update order of grid cells.

Uploaded by

saudiqbal886

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views51 pages

04 Progbasics

Uploaded by

saudiqbal886

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 4:

Parallel Programming Basics

Parallel Computing
Stanford CS149, Fall 2023
Today’s topic: case study on writing an optimizing a
parallel program
▪ Demonstrated in two programming models
- data parallel
- shared address space

Stanford CS149, Fall 2023

Creating a parallel program
▪ Your thought process:
1. Identify work that can be performed in parallel
2. Partition work (and also data associated with the work)
3. Manage data access, communication, and synchronization

▪ A common goal is maximizing speedup *

For a fixed computation:

Time (1 processor)
Speedup( P processors ) =
Time (P processors)

* Other goals include achieving high efficiency (cost, area, power, etc.) or working on bigger problems than can fit on one machine
Stanford CS149, Fall 2023
Creating a parallel program
Problem to solve
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
** I had to pick a term Orchestration
Parallel program
(communicating
threads)
Mapping

These responsibilities may be assumed by the programmer,

Execution on by the system (compiler, runtime, hardware), or by both!
parallel machine

Adopted from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Problem decomposition
▪ Break up problem into tasks that can be carried out in parallel
▪ In general: create at least enough tasks to keep all execution units on a machine busy

Key challenge of decomposition:

identifying dependencies
(or... a lack of dependencies)

Stanford CS149, Fall 2023

Amdahl’s Law: dependencies limit maximum speedup
due to parallelism

▪ You run your favorite sequential program...

▪ Let S = the fraction of sequential execution that is inherently sequential (dependencies

prevent parallel execution)

▪ Then maximum speedup due to parallel execution ≤ 1/S

Stanford CS149, Fall 2023

A simple example
▪ Consider a two-step computation on a N x N image
- Step 1: multiply brightness of all pixels by two
(independent computation on each pixel)
- Step 2: compute average of all pixel values

▪ Sequential implementation of program

- Both steps take ~ N2 time, so total time is ~ 2N2
N
Parallelism

N
N2 N2
1

Execution time
Stanford CS149, Fall 2023
First attempt at parallelism (P processors)
▪ Strategy:
- Step 1: execute in parallel P
- time for phase 1: N2/P Sequential program

Parallelism
- Step 2: execute serially
- time for phase 2: N2
N2 N2
1

▪ Overall performance: Execution time

Speedup
N2/P
P
Parallel program
Parallelism
Speedup ≤ 2
N2
1

Execution time
Stanford CS149, Fall 2023
Parallelizing step 2
▪ Strategy:
- Step 1: execute in parallel
- time for phase 1: N2/P
- Step 2: compute partial sums in parallel, combine results serially
- time for phase 2: N2/P + P

▪ Overall performance:
- Speedup
Overhead of parallel algorithm:
N2/P N2/P P combining the partial sums
P

Parallelism Parallel program

Note: speedup → P when N >> P
1

Execution time
Stanford CS149, Fall 2023
Amdahl’s law
▪ Let S = the fraction of total work that is inherently sequential
▪ Max speedup on P processors given by:
speedup

Max Speedup S=0.01

S=0.05

S=0.1

Num Processors
Stanford CS149, Fall 2023
A small serial region can limit speedup on a large parallel machine
Summit supercomputer: 27,648 GPUs x (5,376 ALUs/GPU) = 148,635,648 ALUs
Machine can perform 148 million single precision operations in parallel
What is max speedup if 0.1% of application is serial?

Stanford CS149, Fall 2023

Decomposition
▪ Who is responsible for decomposing a program into independent tasks?
- In most cases: the programmer

▪ Automatic decomposition of sequential programs continues to be a challenging

research problem (very difficult in the general case)
- Compiler must analyze program, identify dependencies
- What if dependencies are data dependent (not known at compile time)?
- Researchers have had modest success with simple loop nests
- The “magic parallelizing compiler” for complex, general-purpose code has not yet been achieved

Stanford CS149, Fall 2023

Assignment: assigning tasks to workers
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
** I had to pick a term
parallel machine

Stanford CS149, Fall 2023

Assignment
▪ Assigning tasks to workers
- Think of “tasks” as things to do
- What are “workers”? (Might be threads, program instances, vector lanes, etc.)

▪ Goals: achieve good workload balance, reduce communication costs

▪ Can be performed statically (before application is run), or dynamically as program executes

▪ Although programmer is often responsible for decomposition, many languages/runtimes take

responsibility for assignment.

Stanford CS149, Fall 2023

Assignment examples in ISPC
export void ispc_sinx_interleaved( export void ispc_sinx_foreach(
uniform int N, uniform int N,
uniform int terms, uniform int terms,
uniform float* x, uniform float* x,
uniform float* result) uniform float* result)
{ {
// assumes N % programCount = 0 foreach (i = 0 ... N)
for (uniform int i=0; i<N; i+=programCount) {
{ float value = x[i];
int idx = i + programIndex; float numer = x[i] * x[i] * x[i];
float value = x[idx]; uniform int denom = 6; // 3!
float numer = x[idx] * x[idx] * x[idx]; uniform int sign = -1;
uniform int denom = 6; // 3!
uniform int sign = -1; for (uniform int j=1; j<=terms; j++)
{
for (uniform int j=1; j<=terms; j++) value += sign * numer / denom;
{ numer *= x[i] * x[i];
value += sign * numer / denom; denom *= (2*j+2) * (2*j+3);
numer *= x[idx] * x[idx]; sign *= -1;
denom *= (2*j+2) * (2*j+3); }
sign *= -1; result[i] = value;
} }
result[i] = value; }
}
} Decomposition of work by loop iteration
Decomposition of work by loop iteration foreach construct exposes independent work to system
Programmer-managed assignment: System-manages assignment of iterations (work) to ISPC program
Static assignment instances (abstraction leaves room for dynamic assignment, but
Assign iterations to ISPC program instances in interleaved fashion current ISPC implementation is static)
Stanford CS149, Fall 2023
Example 2: static assignment using C++11 threads
void my_thread_start(int N, int terms, float* x, float* results) { Decomposition of work by loop iteration
sinx(N, terms, x, result); // do work
}

Programmer-managed static assignment

void parallel_sinx(int N, int terms, float* x, float* result) { This program assigns loop iterations to threads in a
blocked fashion (first half of array assigned to the
int half = N/2.
spawned thread, second half assigned to main thread)
// launch thread to do work on first half of array
std::thread t1(my_thread_start, half, terms, x, result);

// do work on second half of array in main thread

sinx(N - half, terms, x + half, result + half);

[Link]();
}

Stanford CS149, Fall 2023

Dynamic assignment using ISPC tasks
void foo(uniform float* input,
ISPC runtime (invisible to the programmer)
uniform float* output, assigns tasks to worker threads in a thread pool
uniform int N)
{
// create a bunch of tasks
launch[100] my_ispc_task(input, output, N);
}

Next task ptr

List of tasks:
task 0 task 1 task 2 task 3 task 4 ... task 99

Implementation of task assignment to threads: after completing current task,

worker thread inspects list and assigns itself the next uncompleted task.

Worker Worker Worker Worker

thread 0 thread 1 thread 2 thread 3

Stanford CS149, Fall 2023

Orchestration
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
parallel machine

** I had to pick a term Stanford CS149, Fall 2023

Orchestration
▪ Involves:
- Structuring communication
- Adding synchronization to preserve dependencies if necessary
- Organizing data structures in memory
- Scheduling tasks

▪ Goals: reduce costs of communication/sync, preserve locality of data reference,

reduce overhead, etc.

▪ Machine details impact many of these decisions

- If synchronization is expensive, programmer might use it more sparsely

Stanford CS149, Fall 2023

Mapping to hardware
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
parallel machine

** I had to pick a term Stanford CS149, Fall 2023

Mapping to hardware
▪ Mapping “threads” (“workers”) to hardware execution units
▪ Example 1: mapping by the operating system
- e.g., map a thread to HW execution context on a CPU core

▪ Example 2: mapping by the compiler

- Map ISPC program instances to vector instruction lanes

▪ Example 3: mapping by the hardware

- Map CUDA thread blocks to GPU cores (discussed in a future lecture)

▪ Many interesting mapping decisions:

- Place related threads (cooperating threads) on the same core
(maximize locality, data sharing, minimize costs of comm/sync)
- Place unrelated threads on the same core (one might be bandwidth limited and another might be compute limited) to use
machine more efficiently
Stanford CS149, Fall 2023
A parallel programming example

Stanford CS149, Fall 2023

A 2D-grid based solver
▪ Problem: solve partial differential equation (PDE) on (N+2) x (N+2) grid
▪ Solution uses iterative algorithm:
- Perform Gauss-Seidel sweeps over grid until convergence
N
A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j]
+ A[i,j+1] + A[i+1,j]);

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Grid solver algorithm: find the dependencies
Pseudocode for sequential algorithm is provided below
const int n;
float* A; // assume allocated for grid of N+2 x N+2 elements

void solve(float* A) {

float diff, prev;

bool done = false;

while (!done) { // outermost loop: iterations

diff = 0.f;
for (int i=1; i<n i++) { // iterate over non-border points of grid
for (int j=1; j<n; j++) {
prev = A[i,j];
A[i,j] = 0.2f * (A[i,j] + A[i,j-1] + A[i-1,j] +
A[i,j+1] + A[i+1,j]);
diff += fabs(A[i,j] - prev); // compute amount of change
}
}

if (diff/(n*n) < TOLERANCE) // quit if converged

done = true;
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Step 1: identify dependencies
(problem decomposition phase)
N
Each row element depends on element to left.

Each row depends on previous row.

N
...
...

Note: the dependencies illustrated on this slide are grid

element data dependencies in one iteration of the solver
(in one iteration of the “while not done” loop)

Stanford CS149, Fall 2023

Step 1: identify dependencies
(problem decomposition phase)
N
There is independent work along the diagonals!

Good: parallelism exists!

Possible implementation strategy:

1. Partition grid cells on a diagonal into tasks
N
2. Update values in parallel
3. When complete, move to next diagonal
...
...

Bad: independent work is hard to exploit

Not much parallelism at beginning and end of computation.
Frequent synchronization (after completing each diagonal)

Stanford CS149, Fall 2023

Let’s make life easier on ourselves
▪ Idea: improve performance by changing the algorithm to one that is more amenable
to parallelism
- Change the order that grid cell cells are updated
- New algorithm iterates to same solution (approximately), but converges to solution
differently
- Note: floating-point values computed are different, but solution still converges to within error threshold
- Yes, we needed domain knowledge of the Gauss-Seidel method to realize this
change is permissible
- But this is a common technique in parallel programming

Stanford CS149, Fall 2023

New approach: reorder grid cell update via red-black coloring
Reorder grid traversal: red-black coloring

N
Update all red cells in parallel

When done updating red cells ,

update all black cells in parallel
(respect dependency on red cells)
N
Repeat until convergence

Stanford CS149, Fall 2023

Possible assignments of work to processors
Reorder grid traversal: red-black coloring

Question: Which is better? Does it matter?

Answer: it depends on the system this program is running on
Stanford CS149, Fall 2023
Consider dependencies in the program
1. Perform red cell update in parallel Compute red cells
2. Wait until all processors done with update Wait
3. Communicate updated red cells to other processors
4. Perform black cell update in parallel Compute black cells

5. Wait until all processors done with update Wait

6. Communicate updated black cells to other processors
7. Repeat
P1 P2 P3 P4

Stanford CS149, Fall 2023

Communication resulting from assignment
Reorder grid traversal: red-black coloring

= data that must be sent to P2 each iteration

Blocked assignment requires less data to be communicated between processors
Stanford CS149, Fall 2023
Two ways to think about writing this program

▪ Data parallel thinking

▪ SPMD / shared address space

Stanford CS149, Fall 2023

Data-parallel expression of solver

Stanford CS149, Fall 2023

Data-parallel expression of grid solver
Note: to simplify pseudocode: just showing red-cell update

const int n;
float* A = allocate(n+2, n+2)); // allocate grid Assignment: ???
void solve(float* A) {

bool done = false;

float diff = 0.f;
while (!done) { Decomposition:
for_all (red cells (i,j)) { processing individual grid elements
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] +
constitutes independent work
A[i+1,j] + A[i,j+1]);
reduceAdd(diff, abs(A[i,j] - prev));
}
Orchestration: handled by system
(builtin communication primitive: reduceAdd)
if (diff/(n*n) < TOLERANCE)
done = true;
}
Orchestration: handled by system
} (End of for_all block is implicit wait for all workers
before returning to sequential control)

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Shared address space
(with SPMD threads)
expression of solver

Stanford CS149, Fall 2023

Shared address space expression of solver
SPMD execution model
Compute red cells
▪ Programmer is responsible for synchronization Wait
▪ Common synchronization primitives:
- Locks (provide mutual exclusion): only one Compute black cells
Wait
thread in the critical region at a time
- Barriers: wait for threads to reach this point
P1 P2 P3 P4

Stanford CS149, Fall 2023

Shared address space solver (pseudocode in SPMD execution model)
int
bool
n;
done = false;
// grid size
Assume these are global variables
float diff = 0.0; (accessible to all threads)
LOCK myLock;
BARRIER myBarrier; Assume solve() function is executed by all threads.
// allocate grid (SPMD-style)
float* A = allocate(n+2, n+2);

void solve(float* A) { Value of threadId is different for each SPMD instance:

float myDiff;
int threadId = getThreadId();
use value to compute region of grid to work on
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

What’s this lock doing here ?????

lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;

}
barrier(myBarrier, NUM_PROCESSORS); And these barriers?
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Synchronization in a shared address space

Stanford CS149, Fall 2023

Shared address space model (abstraction)
Threads communicate by reading/writing to locations in a shared address space (shared variables)
Assume x=0 when threads are launched
Thread 1: Thread 2:
// Do work here… void foo(int* x) {

// write to address holding // read from addr storing

// contents of variable x // contents of variable x
x = 1; while (x == 0) {}
print x;
}
Store to x
Thread 1
x

Shared address space

Thread 2 Load from x

(Communication operations shown in red)

(Pseudocode provided in a fake C-like language for brevity.) Stanford CS149, Fall 2023
A common metaphor:
A shared address space is
like a bulletin board

(Everyone can read/write)

Image credit:
[Link]
Stanford CS149, Fall 2023
Coordinating access to shared variables with synchronization
Shared (among all threads) variables:
int x = 0;
Lock my_lock;

Thread 1: Thread 2:

[Link](); my_lock.lock();
x++; x++;
[Link](); my_lock.unlock();

print(x); print(x);

Stanford CS149, Fall 2023

Review: why do we need mutual exclusion?
▪ Each thread executes:
- Load the value of variable x from a location in memory into register r1
(this stores a copy of the value in memory in the register)
- Add the contents of register r2 to register r1
- Store the value of register r1 into the address storing the program variable x
▪ One possible interleaving: (let starting value of x=0, r2=1)
T1 T2
r1 ← x T1 reads value 0
r1 ← x T2 reads value 0
r1 ← r1 + r2 T1 sets value of its r1 to 1
r1 ← r1 + r2 T2 sets value of its r1 to 1
X ← r1 T1 stores 1 to address of x
X ← r1 T2 stores 1 to address of x

▪ Need this set of three instructions must be “atomic”

Stanford CS149, Fall 2023
Example mechanisms for preserving atomicity
▪ Lock/unlock mutex around a critical section
[Link]();
// critical section
[Link]();

▪ Some languages have first-class support for atomicity of code blocks

atomic {
// critical section
}

▪ Intrinsics for hardware-supported atomic read-modify-write operations

atomicAdd(x, 10);

Stanford CS149, Fall 2023

Summary: shared address space model
▪ Threads communicate by:
- Reading/writing to shared variables in a shared address space
- Communication between threads is implicit in memory loads/stores
- Manipulating synchronization primitives
- e.g., ensuring mutual exclusion via use of locks

▪ This is a natural extension of sequential programming

- In fact, all our discussions in class have assumed a shared address space so far!

Stanford CS149, Fall 2023

Shared address space solver (pseudocode in SPMD execution model)
int n; // grid size
bool done = false;
float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;

// allocate grid
float* A = allocate(n+2, n+2);

void solve(float* A) { Value of threadId is different for each SPMD instance:

float myDiff;
int threadId = getThreadId();
use value to compute region of grid to work on
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS); Each thread computes the rows it is responsible for updating
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff; Lock for mutual exclusion
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS); Hint
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;

Do you see a potential performance problem with

barrier(myBarrier, NUM_PROCESSORS);
}
}

this implementation?
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Shared address space solver (pseudocode in SPMD execution model)
int n; // grid size
bool done = false;

Improve performance by accumulating into

float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;

// allocate grid
float* A = allocate(n+2, n+2);
partial sum locally, then complete global
void solve(float* A) { reduction at the end of the iteration.
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

}
myDiff += abs(A[i,j] - prev));
Compute partial sum per worker
lock(myLock);
diff += myDiff; Now only only lock once per thread,
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
not once per (i,j) loop iteration!
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Barrier synchronization primitive
▪ barrier(num_threads)
▪ Barriers are a conservative way to express dependencies Compute red cells

▪ Barriers divide computation into phases Barrier

▪ All computation by all threads before the barrier complete

Compute black cells
before any computation in any thread after the barrier begins
- In other words, all computations after the barrier are Barrier

assumed to depend on all computations before the barrier

P1 P2 P3 P4

Stanford CS149, Fall 2023

Shared address space solver
int n; // grid size
bool done = false;

Why are there three barriers?

float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;

// allocate grid
float* A = allocate(n+2, n+2);

void solve(float* A) {
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Shared address space solver: one barrier
int
bool
n;
done = false;
// grid size
Idea:
LOCK myLock;
BARRIER myBarrier;
float diff[3]; // global diff, but now 3 copies Remove dependencies by using different diff
float *A = allocate(n+2, n+2); variables in successive loop iterations
void solve(float* A) {
float myDiff; // thread local variable
int index = 0; // thread local variable Trade off footprint for removing dependencies!
diff[0] = 0.0f;
barrier(myBarrier, NUM_PROCESSORS); // one-time only: just for init
(a common parallel programming technique)
while (!done) {
myDiff = 0.0f;
//
// perform computation (accumulate locally into myDiff)
//
lock(myLock);
diff[index] += myDiff; // atomically update global diff
unlock(myLock);
diff[(index+1) % 3] = 0.0f;
barrier(myBarrier, NUM_PROCESSORS);
if (diff[index]/(n*n) < TOLERANCE)
break;
index = (index + 1) % 3;
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Grid solver implementation in two programming models
▪ Data-parallel programming model
- Synchronization:
- Single logical thread of control, but iterations of forall loop may be parallelized by the system (implicit
barrier at end of forall loop body)
- Communication
- Implicit in loads and stores (like shared address space)
- Special built-in primitives for more complex communication patterns:
e.g., reduce

▪ Shared address space

- Synchronization:
- Mutual exclusion required for shared variables (e.g., via locks)
- Barriers used to express dependencies (between phases of computation)
- Communication
- Implicit in loads/stores to shared variables
Stanford CS149, Fall 2023
Summary
▪ Amdahl’s Law
- Overall maximum speedup from parallelism is limited by amount of serial execution in a program

▪ Aspects of creating a parallel program

- Decomposition to create independent work, assignment of work to workers, orchestration (to
coordinate processing of work by workers), mapping to hardware
- We’ll talk a lot about making good decisions in each of these phases in the coming lectures

▪ Focus today: identifying dependencies

▪ Focus soon: identifying locality, reducing synchronization

Stanford CS149, Fall 2023

Parallel Programming Models Overview
No ratings yet
Parallel Programming Models Overview
62 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Parallel Programming Insights
No ratings yet
Parallel Programming Insights
42 pages
Parallel Programming Models Lecture
No ratings yet
Parallel Programming Models Lecture
47 pages
02 Basicarch
No ratings yet
02 Basicarch
103 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
04 Progbasics
No ratings yet
04 Progbasics
83 pages
RG2 ParallelizationPrinciples HPCAI Jan2020
No ratings yet
RG2 ParallelizationPrinciples HPCAI Jan2020
40 pages
Apt05 2024S2
No ratings yet
Apt05 2024S2
23 pages
03 Ch1 BasicArch Parallel
No ratings yet
03 Ch1 BasicArch Parallel
79 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
08 Dataparallel
No ratings yet
08 Dataparallel
51 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
High Performance Computing Labs & Concepts
No ratings yet
High Performance Computing Labs & Concepts
5 pages
01 Whyparallelism
No ratings yet
01 Whyparallelism
82 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
01 Introduction
No ratings yet
01 Introduction
41 pages
OpenACC 1
No ratings yet
OpenACC 1
44 pages
03 Progmodels Slides
No ratings yet
03 Progmodels Slides
61 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Stanford CS149 Parallel Computing Assignment
No ratings yet
Stanford CS149 Parallel Computing Assignment
31 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
CS4230 Parallel Programming Introduction To Parallel Algorithms
No ratings yet
CS4230 Parallel Programming Introduction To Parallel Algorithms
25 pages
Unit 1
No ratings yet
Unit 1
11 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
34 pages
Layers of Implementing An Application in Software or Hardware Using Parallel Computers
No ratings yet
Layers of Implementing An Application in Software or Hardware Using Parallel Computers
46 pages
Parallel Computing Unit 3 - Principles of Parallel Computing Design
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
78 pages
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
No ratings yet
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
51 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
29 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
Par - 1 In-Term Exam - Course 2017/18-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2017/18-Q2
7 pages
Parallel Programming Models Guide
No ratings yet
Parallel Programming Models Guide
32 pages
Assignment No. 2 PDC 21L-1786
No ratings yet
Assignment No. 2 PDC 21L-1786
6 pages
Par Proc Book
No ratings yet
Par Proc Book
335 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
03 (Parallel Software)
No ratings yet
03 (Parallel Software)
38 pages
Par - 1 In-Term Exam - Course 2018/19-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2018/19-Q2
9 pages
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
No ratings yet
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
41 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Daa 1
No ratings yet
Daa 1
40 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
03 Multicore2-Ispc
No ratings yet
03 Multicore2-Ispc
56 pages
Module 3
No ratings yet
Module 3
104 pages
ParProcBook PDF
No ratings yet
ParProcBook PDF
410 pages
Lecture02 Types
No ratings yet
Lecture02 Types
21 pages
AA Part1
No ratings yet
AA Part1
43 pages
OpenACC Programming
No ratings yet
OpenACC Programming
43 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center
No ratings yet
Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center
17 pages
Module 1
No ratings yet
Module 1
68 pages
Unit1 2 and 3
No ratings yet
Unit1 2 and 3
76 pages
1.3.2 Labsim Features
No ratings yet
1.3.2 Labsim Features
3 pages
Numerical Investigation of Elliptical and Triangular Perforated Fins Under Forced Convection
100% (1)
Numerical Investigation of Elliptical and Triangular Perforated Fins Under Forced Convection
4 pages
Journey To The End of The Earth
No ratings yet
Journey To The End of The Earth
6 pages
OW Notes
No ratings yet
OW Notes
3 pages
Egypt Food & Drink Report
No ratings yet
Egypt Food & Drink Report
84 pages
9th Biology Test Series 360 MCQs Full Book 2018
No ratings yet
9th Biology Test Series 360 MCQs Full Book 2018
23 pages
CFM56 - 5B - Esm - Rev - 05 11 01 200 001 N PGK08 001 N - TSN.77 - U - 20161030
100% (1)
CFM56 - 5B - Esm - Rev - 05 11 01 200 001 N PGK08 001 N - TSN.77 - U - 20161030
3 pages
Modern Portfolio Theory - Wikipedia
No ratings yet
Modern Portfolio Theory - Wikipedia
12 pages
Expressing Surprise and Disbelief
No ratings yet
Expressing Surprise and Disbelief
9 pages
Vertiv Liebert Sts2 Static Transfer Switch Brochure SL 20600
No ratings yet
Vertiv Liebert Sts2 Static Transfer Switch Brochure SL 20600
12 pages
OIT - Impact of Lockdown Measures On Informal Economy
No ratings yet
OIT - Impact of Lockdown Measures On Informal Economy
1 page
Quarter 3-Grade 6 Summative 1
No ratings yet
Quarter 3-Grade 6 Summative 1
20 pages
Mock Test 13
No ratings yet
Mock Test 13
3 pages
SCS Load Securing GB
No ratings yet
SCS Load Securing GB
40 pages
09 Practice Problem 1 FEVIDAL
No ratings yet
09 Practice Problem 1 FEVIDAL
1 page
ADA Lab Manual
No ratings yet
ADA Lab Manual
47 pages
Anderol 3068 Safety Data Sheet
No ratings yet
Anderol 3068 Safety Data Sheet
11 pages
05 OR BMS MCQs
100% (1)
05 OR BMS MCQs
12 pages
7 Prokaryote and Eukaryote Cells-S
No ratings yet
7 Prokaryote and Eukaryote Cells-S
6 pages
Firefighter Lawsuit
No ratings yet
Firefighter Lawsuit
83 pages
Santa Talks Script Evaluation Rubrics
No ratings yet
Santa Talks Script Evaluation Rubrics
1 page
Final BOQ For Bridges
67% (3)
Final BOQ For Bridges
28 pages
Microsoft PSS Division Service Strategy
No ratings yet
Microsoft PSS Division Service Strategy
3 pages
Satyr Monk
No ratings yet
Satyr Monk
11 pages
Environmental Question Objective
100% (1)
Environmental Question Objective
26 pages
MA1010 Notes 1
No ratings yet
MA1010 Notes 1
38 pages
Scope of Work and Methodology Underground Utility Detection
No ratings yet
Scope of Work and Methodology Underground Utility Detection
15 pages
Maintenance by Robotics
No ratings yet
Maintenance by Robotics
17 pages
Evacuation Efforts for Indians in Ukraine
No ratings yet
Evacuation Efforts for Indians in Ukraine
14 pages
Working Capital Management 2
No ratings yet
Working Capital Management 2
13 pages