0% found this document useful (0 votes)

20 views83 pages

02 Basicarch

Uploaded by

wz1151897402

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views83 pages

02 Basicarch

Uploaded by

wz1151897402

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Lecture 2:

A Modern Multi-Core
Processor
(Forms of parallelism + understanding latency and bandwidth)

Parallel Computer Architecture and Programming

CMU 15-418/15-618, Fall 2023
Today
▪ Today we will talk computer architecture

▪ Four key concepts about how modern computers work

- Two concern parallel execution
- Two concern challenges of accessing memory

▪ Understanding these architecture basics will help you

- Understand and optimize the performance of your parallel programs
- Gain intuition about what workloads might bene t from fast parallel machines

CMU 15-418/618, Fall 2023

fi
Part 1: parallel execution

CMU 15-418/618, Fall 2023

Example program
Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! ...
for each element of an array of N oating-point numbers
void sinx(int N, int terms, float* x, float* result)
{
for (int i=0; i<N; i++)
{
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1;

for (int j=1; j<=terms; j++)

{
value += sign * numer / denom;
numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}
CMU 15-418/618, Fall 2023
fl
Compile program
void sinx(int N, int terms, float* x, float* result)
{
x[i]
for (int i=0; i<N; i++)
{
float value = x[i];
float numer = x[i] * x[i] * x[i]; ld r0, addr[r1]
int denom = 6; // 3! mul r1, r0, r0
int sign = -1; mul r1, r1, r0
...
for (int j=1; j<=terms; j++) ...
{ ...
value += sign * numer / denom; ...
...
numer *= x[i] * x[i];
...
denom *= (2*j+2) * (2*j+3);
st addr[r2], r0
sign *= -1;
}

result[i] = value;
} result[i]

CMU 15-418/618, Fall 2023

Execute program

x[i]

Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
...
Context
...
st addr[r2], r0

result[i]

CMU 15-418/618, Fall 2023

Execute program
My very simple processor: executes one instruction per clock
x[i]

Fetch/
Decode
PC ld r0, addr[r1]
mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
...
Context
...
st addr[r2], r0

result[i]

CMU 15-418/618, Fall 2023

Execute program
My very simple processor: executes one instruction per clock
x[i]

Fetch/
Decode
ld r0, addr[r1]
PC mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
...
Context
...
st addr[r2], r0

result[i]

CMU 15-418/618, Fall 2023

Execute program
My very simple processor: executes one instruction per clock
x[i]

Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU PC mul r1, r1, r0
(Execute) ...
...
...
Execution ...
...
Context
...
st addr[r2], r0

result[i]

CMU 15-418/618, Fall 2023

Superscalar processor
Recall from last class: instruction level parallelism (ILP)
Decode and execute two instructions per clock (if possible)
x[i]

Fetch/ Fetch/
Decode Decode
1 2
ld r0, addr[r1]
mul r1, r0, r0
Exec Exec mul r1, r1, r0
1 2 ...
...
...
Execution ...
...
Context
...
st addr[r2], r0

result[i]

Note: No ILP exists in this region of the program

CMU 15-418/618, Fall 2023
Aside: Pentium 4

Image credit: http://ixbtlabs.com/articles/pentium4/index.html CMU 15-418/618, Fall 2023

Processor: pre multi-core era
Majority of chip transistors used to perform operations

Fetch/
Decode
Data cache
(a big one)
ALU
(Execute)

Execution Out-of-order control logic

Context
Fancy branch predictor

Memory pre-fetcher

More transistors = larger cache, smarter out-of-order logic, smarter branch predictor, etc.
(Also: more transistors → smaller transistors → higher clock frequencies)
CMU 15-418/618, Fall 2023
Processor: multi-core era

Fetch/ Idea #1:

Decode
Use increasing transistor count to add more
ALU cores to the processor
(Execute)

Execution Rather than use transistors to increase

Context sophistication of processor logic that
accelerates a single instruction stream
(e.g., out-of-order and speculative operations)

CMU 15-418/618, Fall 2023

Two cores: compute two elements in parallel

x[i] x[j]
Fetch/ Fetch/
Decode Decode

ld r0, addr[r1] ld r0, addr[r1]

mul
mul
r1, r0, r0
r1, r1, r0
ALU ALU mul
mul
r1, r0, r0
r1, r1, r0
... (Execute) (Execute) ...
... ...
... ...
... ...
... ...
... Execution Execution ...
st addr[r2], r0
Context Context st addr[r2], r0

result[i] result[j]

Simpler cores: each core is slower at running a single instruction stream

than our original “fancy” core (e.g., 0.75 times as fast)

But there are now two cores: 2 × 0.75 = 1.5 (potential for speedup!)
CMU 15-418/618, Fall 2023
But our program expresses no parallelism
void sinx(int N, int terms, float* x, float* result)
This program, compiled with gcc
{
for (int i=0; i<N; i++) will run as one thread on one of
{ the processor cores.
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
If each of the simpler processor
int sign = -1;
cores was 0.75X as fast as the
for (int j=1; j<=terms; j++) original single complicated one,
{ our program now has a “speedup”
value += sign * numer / denom;
numer *= x[i] * x[i];
of 0.75 (i.e. it is slower).
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}

CMU 15-418/618, Fall 2023

Expressing parallelism using pthreads
typedef struct { void sinx(int N, int terms, float* x, float* result)
int N; {
int terms; for (int i=0; i<N; i++)
float* x; {
float* result; float value = x[i];
} my_args; float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
void parallel_sinx(int N, int terms, float* x, float* result) int sign = -1;
{
pthread_t thread_id; for (int j=1; j<=terms; j++)
my_args args; {
value += sign * numer / denom
args.N = N/2; numer *= x[i] * x[i];
args.terms = terms; denom *= (2*j+2) * (2*j+3);
args.x = x; sign *= -1;
args.result = result; }

pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread result[i] = value;

sinx(N - args.N, terms, x + args.N, result + args.N); // do work }
pthread_join(thread_id, NULL); }
}

void my_thread_start(void* thread_arg)

{
my_args* thread_args = (my_args*)thread_arg;
sinx(args->N, args->terms, args->x, args->result); // do work
}

CMU 15-418/618, Fall 2023

Data-parallel expression
(in our ctitious data-parallel language)
void sinx(int N, int terms, float* x, float* result)
Loop iterations declared by the
{
// declare independent loop iterations
programmer to be independent
forall (int i from 0 to N-1)
{ With this information, you could imagine
float value = x[i]; how a compiler might automatically
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
generate parallel threaded code
int sign = -1;

for (int j=1; j<=terms; j++)

{
value += sign * numer / denom;
numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}

CMU 15-418/618, Fall 2023

fi
Four cores: compute four elements in parallel

Fetch/ Fetch/
Decode Decode

ALU ALU
(Execute) (Execute)

Execution Execution
Context Context

Fetch/ Fetch/
Decode Decode

ALU ALU
(Execute) (Execute)

Execution Execution
Context Context

CMU 15-418/618, Fall 2023

Sixteen cores: compute sixteen elements in parallel

Sixteen cores, sixteen simultaneous instruction streams

CMU 15-418/618, Fall 2023
Intel Alder Lake-S (2021)

16 CPU cores (8 performance + 8 e ciency)

CMU 15-418/618, Fall 2023

ffi
Laptops: Apple M1 Pro (2021)

▪ 10 CPU Cores (8 performance + 2 e ciency)

▪ 16 GPU Cores
CMU 15-418/618, Fall 2023
ffi
NVIDIA GeForce GTX 1660 Ti GPU (2019)
24 major processing blocks
(1536 “CUDA cores”)

CMU 15-418/618, Fall 2023

Data-parallel expression
(in our ctitious data-parallel language)
void sinx(int N, int terms, float* x, float* result)
Another interesting property of this code:
{
// declare independent loop iterations
forall (int i from 0 to N-1) Parallelism is across iterations of the loop.
{
float value = x[i]; All the iterations of the loop do the same
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
thing: evaluate the sine of a single input
int sign = -1; number

for (int j=1; j<=terms; j++)

{
value += sign * numer / denom;
numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}

CMU 15-418/618, Fall 2023

fi
Add ALUs to increase compute capability

Fetch/
Decode Idea #2:
Amortize cost/complexity of managing an
ALU 0 ALU 1 ALU 2 ALU 3 instruction stream across many ALUs
ALU 4 ALU 5 ALU 6 ALU 7

SIMD processing
Single instruction, multiple data

Same instruction broadcast to all ALUs

Execution Context Executed in parallel on all ALUs

CMU 15-418/618, Fall 2023

Add ALUs to increase compute capability

Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU 0 ALU 1 ALU 2 ALU 3 mul r1, r1, r0
...
ALU 4 ALU 5 ALU 6 ALU 7 ...
...
...
...
...
st addr[r2], r0

Recall original compiled program:

Execution Context
Instruction stream processes one array element
at a time using scalar instructions on scalar
registers (e.g., 32-bit oats)
CMU 15-418/618, Fall 2023
fl
Scalar program
void sinx(int N, int terms, float* x, float* result) Original compiled program:
{
for (int i=0; i<N; i++) Processes one array element using scalar
{ instructions on scalar registers (e.g., 32-bit oats)
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1;
ld r0, addr[r1]
mul r1, r0, r0
for (int j=1; j<=terms; j++)
mul r1, r1, r0
{
...
value += sign * numer / denom;
...
numer *= x[i] * x[i];
...
denom *= (2*j+2) * (2*j+3);
...
sign *= -1;
...
}
...
st addr[r2], r0
result[i] = value;
}
}

CMU 15-418/618, Fall 2023

fl
Vector program (using AVX intrinsics)
#include <immintrin.h>

Intrinsics available to C programmers

void sinx(int N, int terms, float* x, float* result)
{
float three_fact = 6; // 3!
for (int i=0; i<N; i+=8)
{
__m256 origx = _mm256_load_ps(&x[i]);
__m256 value = origx;
__m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx));
__m256 denom = _mm256_broadcast_ss(&three_fact);
int sign = -1;

for (int j=1; j<=terms; j++)

{
// value += sign * numer / denom
__m256 tmp = _mm256_div_ps(_mm256_mul_ps(_mm256_set1ps(sign), numer), denom);
value = _mm256_add_ps(value, tmp);

numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx));

denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((2*j+2) * (2*j+3)));
sign *= -1;
}
_mm256_store_ps(&result[i], value);
}
}
CMU 15-418/618, Fall 2023
Vector program (using AVX intrinsics)
#include <immintrin.h>
void sinx(int N, int terms, float* x, float* sinx)
{ vloadps xmm0, addr[r1]
float three_fact = 6; // 3!
vmulps xmm1, xmm0, xmm0
for (int i=0; i<N; i+=8)
{
vmulps xmm1, xmm1, xmm0
__m256 origx = _mm256_load_ps(&x[i]); ...
__m256 value = origx; ...
__m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx)); ...
__m256 denom = _mm256_broadcast_ss(&three_fact); ...
int sign = -1; ...
...
for (int j=1; j<=terms; j++)
vstoreps addr[xmm2], xmm0
{
// value += sign * numer / denom
__m256 tmp = _mm256_div_ps(_mm256_mul_ps(_mm256_broadcast_ss(sign),numer),denom); Compiled program:
value = _mm256_add_ps(value, tmp);
Processes eight array elements
numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx));
simultaneously using vector
denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((2*j+2) * (2*j+3)));
instructions on 256-bit vector registers
sign *= -1;
}
_mm256_store_ps(&sinx[i], value);
}
}

CMU 15-418/618, Fall 2023

16 SIMD cores: 128 elements in parallel

CMU 15-418/618, Spring 2016

16 cores, 128 ALUs, 16 simultaneous instruction streams

CMU 15-418/618, Fall 2023
CMU 15-418/618, Spring 2016
Data-parallel expression
(in our ctitious data-parallel language)

void sinx(int N, int terms, float* x, float* result)

Compiler understands loop iterations
{
// declare independent loop iterations
are independent, and that same loop
forall (int i from 0 to N-1) body will be executed on a large
{ number of data elements.
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1; Abstraction facilitates automatic
generation of both multi-core parallel
for (int j=1; j<=terms; j++) code, and vector instructions to make
{
value += sign * numer / denom
use of SIMD processing capabilities
numer *= x[i] * x[i]; within a core.
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}

CMU 15-418/618, Fall 2023

fi
What about conditional execution?
(assume logic below is to be executed for each
1 2 ... ... 8
Time (clocks) element in input array ‘A’, producing output into
ALU 1 ALU 2 . . . . . . ALU 8 the array ‘result’)

float x = A[i];

if (x > 0) {
float tmp = exp(x,5.f);

tmp *= kMyConst1;

x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;

x = 2.f * tmp;
}

<resume unconditional code>

result[i] = x;

CMU 15-418/618, Fall 2023

What about conditional execution?
(assume logic below is to be executed for each
1 2 ... ... 8
Time (clocks) element in input array ‘A’, producing output into
ALU 1 ALU 2 . . . . . . ALU 8 the array ‘result’)

float x = A[i];

T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);

tmp *= kMyConst1;

x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;

x = 2.f * tmp;
}

<resume unconditional code>

result[i] = x;

CMU 15-418/618, Fall 2023

Mask (discard) output of ALU
(assume logic below is to be executed for each
1 2 ... ... 8
Time (clocks) element in input array ‘A’, producing output into
ALU 1 ALU 2 . . . . . . ALU 8 the array ‘result’)

float x = A[i];

T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);

tmp *= kMyConst1;

x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;

x = 2.f * tmp;
}

<resume unconditional code>

Not all ALUs do useful work!
result[i] = x;
Worst case: 1/8 peak performance

CMU 15-418/618, Fall 2023

After branch: continue at full performance
(assume logic below is to be executed for each
1 2 ... ... 8
Time (clocks) element in input array ‘A’, producing output into
ALU 1 ALU 2 . . . . . . ALU 8 the array ‘result’)

float x = A[i];

T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);

tmp *= kMyConst1;

x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;

x = 2.f * tmp;
}

<resume unconditional code>

result[i] = x;

CMU 15-418/618, Fall 2023

Terminology
▪ Instruction stream coherence (“coherent execution”)
- Same instruction sequence applies to all elements operated upon simultaneously
- Coherent execution is necessary for e cient use of SIMD processing resources
- Coherent execution IS NOT necessary for e cient parallelization across cores,
since each core has the capability to fetch/decode a di erent instruction stream

▪ “Divergent” execution
- A lack of instruction stream coherence

▪ Note: don’t confuse instruction stream coherence with “cache

coherence” (a major topic later in the course)

CMU 15-418/618, Fall 2023

ffi
ffi
ff
SIMD execution on modern CPUs
▪ SSE instructions: 128-bit operations: 4x32 bits or 2x64 bits (4-wide oat vectors)

▪ AVX instructions: 256 bit operations: 8x32 bits or 4x64 bits (8-wide oat vectors)

▪ Instructions are generated by the compiler

- Parallelism explicitly requested by programmer using intrinsics
- Parallelism conveyed using parallel language semantics (e.g., forall example)
- Parallelism inferred by dependency analysis of loops (hard problem, even best
compilers are not great on arbitrary C/C++ code)

▪ Terminology: “explicit SIMD”: SIMD parallelization is performed at compile time

- Can inspect program binary and see instructions (vstoreps, vmulps, etc.)

CMU 15-418/618, Fall 2023

fl
fl
SIMD execution on many modern GPUs
▪ “Implicit SIMD”
- Compiler generates a scalar binary (scalar instructions)
- But N instances of the program are *always run* together on the processor
execute(my_function, N) // execute my_function N times
- In other words, the interface to the hardware itself is data-parallel
- Hardware (not compiler) is responsible for simultaneously executing the same
instruction from multiple instances on di erent data on SIMD ALUs

▪ SIMD width of most modern GPUs ranges from 8 to 32

- Divergence can be a big issue
(poorly written code might execute at 1/32 the peak capability of the machine!)

CMU 15-418/618, Fall 2023

ff
Example: Intel Core i9 (Co ee Lake)
8 cores
Fetch/ Fetch/ Fetch/ Fetch/
8 SIMD ALUs per core
Decode Decode Decode Decode

ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3
(AVX2 instructions)
ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7

On campus:
Execution Context Execution Context Execution Context Execution Context GHC machines:
4 cores
8 SIMD ALUs per core
Fetch/ CMU 15-418/618, Spring 2016 Fetch/ CMU 15-418/618, Spring 2016 Fetch/ CMU 15-418/618, Spring 2016 Fetch/ CMU 15-418/618, Spring 2016
Decode Decode Decode Decode

ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3
Machines in GHC 5207:
ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7
(old GHC 3000 machines)
6 cores
4 SIMD ALUs per core

Execution Context Execution Context Execution Context Execution Context

CPUs in “latedays" cluster:
6 cores
CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016
8 SIMD ALUs per code

CMU 15-418/618, Fall 2023

ff
Example: NVIDIA GTX 480
(in the Gates 5 lab)

15 cores
32 SIMD ALUs per core
1.3 TFLOPS

CMU 15-418/618, Fall 2023

Summary: parallel execution
▪ Several forms of parallel execution in modern processors
- Multi-core: use multiple processing cores
- Provides thread-level parallelism: simultaneously execute a completely di erent
instruction stream on each core
- Software decides when to create threads (e.g., via pthreads API)

- SIMD: use multiple ALUs controlled by same instruction stream (within a core)
- E cient design for data-parallel workloads: control amortized over many ALUs
- Vectorization can be done by compiler (explicit SIMD) or at runtime by hardware
- [Lack of] dependencies is known prior to execution (usually declared by programmer,
but can be inferred by loop analysis by advanced compiler)

- Superscalar: exploit ILP within an instruction stream. Process di erent instructions from
the same instruction stream in parallel (within a core)
- Parallelism automatically and dynamically discovered by the hardware during
execution (not programmer visible)
Not addressed further in this class. That’s for a proper computer architecture design course like 18-447.
CMU 15-418/618, Fall 2023
ffi
ff
ff
Quiz Time
▪ L2 Participation Quiz on Canvas

CMU 15-418/618, Fall 2023

Part 2: accessing memory

CMU 15-418/618, Fall 2023

Terminology
▪ Memory latency
- The amount of time for a memory request (e.g., load, store) from a
processor to be serviced by the memory system
- Example: 100 cycles, 100 nsec

▪ Memory bandwidth
- The rate at which the memory system can provide data to a processor
- Example: 20 GB/s

CMU 15-418/618, Fall 2023

DEMO Bandwidth vs Latency
▪ Will need a few volunteers

CMU 15-418/618, Fall 2023

Real World Example
▪ What if we have to move exabytes of data?

▪ Problem:
- Move X bytes of data
- From datacenter in Pittsburgh to New York
▪ 100 PB 370 miles ~ 6.5 Hours
- 1e+11B / 25mb/s = 1.1 hours
▪ 1 EB
- 1e+18B / 25 mb/s = 1,267 years

CMU 15-418/618, Fall 2023

AWS Snowmobile

CMU 15-418/618, Fall 2023

Stalls
▪ A processor “stalls” when it cannot run the next instruction in
an instruction stream because of a dependency on a previous
instruction.

▪ Accessing memory is a major source of stalls

ld r0 mem[r2]
Dependency: cannot execute ‘add’ instruction until data at mem[r2] and
ld r1 mem[r3] mem[r3] have been loaded from memory
add r0, r0, r1

▪ Memory access times ~ 100’s of cycles

- Memory “access time” is a measure of latency

CMU 15-418/618, Fall 2023

Review: why do processors have caches?

L1 cache
(32 KB)

Core 1
L2 cache
(256 KB)

25 GB/sec Memory
. DDR3 DRAM
. L3 cache
. (8 MB) (Gigabytes)
L1 cache
(32 KB)

Core N
L2 cache
(256 KB)

CMU 15-418/618, Fall 2023

Caches reduce length of stalls (reduce latency)
Processors run e ciently when data is resident in caches
Caches reduce memory access latency *

L1 cache
(32 KB)

Core 1
L2 cache
(256 KB)

25 GB/sec Memory
. DDR3 DRAM
. L3 cache
. (8 MB) (Gigabytes)
L1 cache
(32 KB)

Core N
L2 cache
(256 KB)

* Caches also provide high bandwidth data transfer to CPU CMU 15-418/618, Fall 2023
ffi
Prefetching reduces stalls (hides latency)
▪ All modern CPUs have logic for prefetching data into caches
- Dynamically analyze program’s access patterns, predict what it will access soon
▪ Reduces stalls since data is resident in cache when accessed
predict value of r2, initiate load
predict value of r3, initiate load
...
...
...
data arrives in cache
...
Note: Prefetching can also reduce
... data arrives in cache
...
performance if the guess is wrong
ld r0 mem[r2] (hogs bandwidth, pollutes caches)
These loads are cache hits
ld r1 mem[r3]
add r0, r0, r1 (more detail later in course)

CMU 15-418/618, Fall 2023

Multi-threading reduces stalls
▪ Idea: interleave processing of multiple threads on the same
core to hide stalls

▪ Like prefetching, multi-threading is a latency hiding, not a

latency reducing technique

CMU 15-418/618, Fall 2023

Hiding stalls with multi-threading
Thread 1
Elements 0 … 7
Time

1 Core (1 thread)

Fetch/
Decode

ALU 0 ALU 1 ALU 2 ALU 3

ALU 4 ALU 5 ALU 6 ALU 7

Exec Ctx

CMU 15-418/618, Fall 2023

Hiding stalls with multi-threading
Thread 1 Thread 2 Thread 3 Thread 4
Elements 0 … 7 Elements 8 … 15 Elements 16 … 23 Elements 24 … 31
Time
1 2 3 4

1 Core (4 hardware threads)

Fetch/
Decode

ALU 0 ALU 1 ALU 2 ALU 3

ALU 4 ALU 5 ALU 6 ALU 7

1 2

3 4

CMU 15-418/618, Fall 2023

Hiding stalls with multi-threading
Thread 1 Thread 2 Thread 3 Thread 4
Elements 0 … 7 Elements 8 … 15 Elements 16 … 23 Elements 24 … 31
Time
1 2 3 4

1 Core (4 hardware threads)

Stall
Fetch/
Decode

ALU 0 ALU 1 ALU 2 ALU 3

ALU 4 ALU 5 ALU 6 ALU 7

Runnable
1 2

3 4

CMU 15-418/618, Fall 2023

Hiding stalls with multi-threading
Thread 1 Thread 2 Thread 3 Thread 4
Elements 0 … 7 Elements 8 … 15 Elements 16 … 23 Elements 24 … 31
Time
1 2 3 4

1 Core (4 hardware threads)

Stall
Fetch/
Decode
Stall
ALU 0 ALU 1 ALU 2 ALU 3

ALU 4 ALU 5 ALU 6 ALU 7

Runnable Stall
1 2
Stall
Runnable
3 4
Runnable
Done!
Runnable
Done!
CMU 15-418/618, Fall 2023
Throughput computing trade-o
Thread 1 Thread 2 Thread 3 Thread 4
Elements 0 … 7 Elements 8 … 15 Elements 16 … 23 Elements 24 … 31
Time

Key idea of throughput-oriented systems:

Stall Potentially increase time to complete work by any one any
one thread, in order to increase overall system
throughput when running multiple threads.
Runnable
During this time, this thread is runnable, but it is not being executed
by the processor. (The core is running some other thread.)

Done!

CMU 15-418/618, Fall 2023

f
Storing execution contexts
Consider on ship storage of execution contexts a nite resource.

Fetch/
Decode

ALU 0 ALU 1 ALU 2 ALU 3

ALU 4 ALU 5 ALU 6 ALU 7

Context storage
(or L1 cache)

CMU 15-418/618, Fall 2023

fi
Many small contexts (high latency hiding ability)
1 2
1 core
3 4
(16 hardware threads, storage for small working set per thread)
5 6 7 8
Fetch/
Decode
9 10 11 12
ALU 0 ALU 1 ALU 2 ALU 3
13 14 15 16
ALU 4 ALU 5 ALU 6 ALU 7

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

CMU 15-418/618, Spring 2016

CMU 15-418/618, Fall 2023

Four large contexts (low latency hiding ability)
1 2
1 core
3 4
(4 hardware threads, storage for larger working set per thread)
Fetch/
Decode

ALU 0 ALU 1 ALU 2 ALU 3

ALU 4 ALU 5 ALU 6 ALU 7

1 2

3 4

CMU 15-418/618, Spring 2016

CMU 15-418/618, Fall 2023

Hardware-supported multi-threading
▪ Core manages execution contexts for multiple threads
- Runs instructions from runnable threads (processor makes decision about which
thread to run each clock, not the operating system)
- Core still has the same number of ALU resources: multi-threading only helps use
them more e ciently in the face of high-latency operations like memory access

▪ Interleaved multi-threading (a.k.a. temporal multi-threading)

- What I described on the previous slides: each clock, the core chooses a thread,
and runs an instruction from the thread on the ALUs

▪ Simultaneous multi-threading (SMT)

- Each clock, core chooses instructions from multiple threads to run on ALUs
- Extension of superscalar CPU design
- Example: Intel Hyper-threading (2 threads per core)

CMU 15-418/618, Fall 2023

ffi
Multi-threading summary
▪ Bene t: use a core’s ALU resources more e ciently
- Hide memory latency
- Fill multiple functional units of superscalar architecture
(when one thread has insu cient ILP)

▪ Costs
- Requires additional storage for thread contexts
- Increases run time of any single thread
(often not a problem, we usually care about throughput in parallel apps)
- Requires additional independent work in a program (more independent work
than ALUs!)
- Relies heavily on memory bandwidth
- More threads → larger working set → less cache space per thread
- May go to memory more often, but can hide the latency
CMU 15-418/618, Fall 2023
fi
ffi
ffi
Our ctitious multi-core chip
16 cores

8 SIMD ALUs per core

(128 total)

4 threads per core

16 simultaneous
CMU 15-418/618, Spring 2016

instruction streams

64 total concurrent
instruction streams

512 independent pieces of

work are needed to run chip
with maximal latency
hiding ability CMU 15-418/618, Spring 2016

CMU 15-418/618, Fall 2023

fi
GPUs: Extreme throughput-oriented processors

NVIDIA GTX 480 core

Fetch/ = SIMD function unit,

Decode control shared across 16 units
(1 MUL-ADD per clock)

• Instructions operate on 32 pieces of

data at a time (called “warps”).
Execution contexts
(128 KB) • Think: warp = thread issuing 32-wide
vector instructions
“Shared” memory
(16+48 KB)
• Up to 48 warps are simultaneously
interleaved
Source: Fermi Compute Architecture Whitepaper
CUDA Programming Guide 3.1, Appendix G • Over 1500 elements can be processed
concurrently by a core

CMU 15-418/618, Fall 2023

NVIDIA GTX 480: more detail (just for the curious)

NVIDIA GTX 480 core

Fetch/ = SIMD function unit,

Decode control shared across 16 units
(1 MUL-ADD per clock)

• Why is a warp 32 elements and there

are only 16 SIMD ALUs?
Execution contexts
(128 KB) • It’s a bit complicated: ALUs run at twice
the clock rate of rest of chip. So each
“Shared” memory
decoded instruction runs on 32 pieces of
(16+48 KB) data on the 16 ALUs over two ALU clocks.
(but to the programmer, it behaves like
Source: Fermi Compute Architecture Whitepaper a 32-wide SIMD operation)
CUDA Programming Guide 3.1, Appendix G

CMU 15-418/618, Fall 2023

NVIDIA GTX 480: more detail (just for the curious)

NVIDIA GTX 480 core

Fetch/ = SIMD function unit,

Decode control shared across 16 units
(1 MUL-ADD per clock)
Fetch/
Decode
• This process occurs on another set of 16
ALUs as well
Execution contexts
(128 KB) • So there are 32 ALUs per core

• 15 cores × 32 = 480 ALUs per chip

“Shared” memory
(16+48 KB)

Source: Fermi Compute Architecture Whitepaper

CUDA Programming Guide 3.1, Appendix G

CMU 15-418/618, Fall 2023

NVIDIA GTX 480

Recall, there are 15 cores on the GTX 480:

That’s 23,000 pieces of data being
processed concurrently!

CMU 15-418/618, Fall 2023

CPU vs. GPU memory hierarchies
L1 cache
(32 KB)

Core 1
25 GB/sec Memory
L2 cache
DDR3 DRAM
(256 KB)
(Gigabytes)
.. L3 cache
. (8 MB)
L1 cache
(32 KB)
Core N
L2 cache
CPU:
(256 KB) Big caches, few threads, modest memory BW
Rely mainly on caches and prefetching
GFX
texture
cache
(12 KB)

Core 1 Execution Scratchpad

177 GB/sec Memory
contexts L1 cache DDR5 DRAM
(128 KB) (64 KB)
.. L2 cache (~1 GB)
. .. (768 KB)
. GFX
texture
cache

Core N
(12 KB) GPU:
Execution Scratchpad Small caches, many threads, huge memory BW
contexts L1 cache
(128 KB) (64 KB) Rely mainly on multi-threading
CMU 15-418/618, Fall 2023
Thought experiment
Task: element-wise multiplication of two vectors A and B very large array!
Assume vectors contain millions of elements
A
- Load input A[i] ×
- Load input B[i] B
- =
Compute A[i] × B[i] C
- Store result into C[i]

Three memory operations (12 bytes) for every MUL

NVIDIA GTX 480 GPU can do 480 MULs per clock (@ 1.2 GHz)
Need ~6.4 TB/sec of bandwidth to keep functional units busy (only have 177 GB/sec)
~ 3% e ciency… but 7x faster than quad-core CPU!
(2.6 GHz Core i7 Gen 4 quad-core CPU connected to 25 GB/sec memory bus will exhibit
similar e ciency on this computation)

CMU 15-418/618, Fall 2023

ffi
ffi
Bandwidth limited!
If processors request data at too high a rate, the memory system cannot keep up.

No amount of latency hiding helps this.

Overcoming bandwidth limits are a common challenge for

application developers on throughput-optimized systems.

CMU 15-418/618, Fall 2023

Bandwidth is a critical resource
Performant parallel programs will:

▪ Organize computation to fetch data from memory less often

- Reuse data previously loaded by the same thread
(traditional intra-thread temporal locality optimizations)
- Share data across threads (inter-thread cooperation)

▪ Request data less often (instead, do more arithmetic: it’s “free”)

- Useful term: “arithmetic intensity” — ratio of math operations to data
access operations in an instruction stream
- Main point: programs must have high arithmetic intensity to utilize
modern processors e ciently

CMU 15-418/618, Fall 2023

ffi
Summary
▪ Three major ideas that all modern processors employ to varying degrees
- Employ multiple processing cores
- Simpler cores (embrace thread-level parallelism over instruction-level parallelism)
- Amortize instruction stream processing over many ALUs (SIMD)
- Increase compute capability with little extra cost
- Use multi-threading to make more e cient use of processing
resources (hide latencies, ll all available resources)

▪ Due to high arithmetic capability on modern chips, many parallel

applications (on both CPUs and GPUs) are bandwidth bound

▪ GPU architectures use the same throughput computing ideas as CPUs:

but GPUs push these concepts to extreme scales

CMU 15-418/618, Fall 2023

fi
ffi
For the rest of this class, know these terms
▪ Multi-core processor
▪ SIMD execution
▪ Coherent control ow
▪ Hardware multi-threading
- Interleaved multi-threading
- Simultaneous multi-threading
▪ Memory latency
▪ Memory bandwidth
▪ Bandwidth bound application
▪ Arithmetic intensity
CMU 15-418/618, Fall 2023
fl
Another example:
for review and to check your understanding
(if you understand the following sequence you understand this lecture)

CMU 15-418/618, Fall 2023

Running code on a simple processor
My very simple program:
compute sin(x) using Taylor expansion
void sinx(int N, int terms, float* x, float* result)
{
for (int i=0; i<N; i++) My very simple processor:
{ completes one instruction per clock
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3! Fetch/
int sign = -1; Decode

for (int j=1; j<=terms; j++) ALU

(Execute)
{
value += sign * numer / denom;
Execution
numer *= x[i] * x[i];
Context
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}

CMU 15-418/618, Fall 2023

Review: superscalar execution
Unmodi ed program
void sinx(int N, int terms, float* x, float* result) My single core, superscalar processor:
{
executes up to two instructions per clock
for (int i=0; i<N; i++)
{
from a single instruction stream.
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3! Fetch/ Fetch/
Decode Decode
int sign = -1;

Exec Exec
for (int j=1; j<=terms; j++)
1 2
{
value += sign * numer / denom;
Execution
numer *= x[i] * x[i];
Context
denom *= (2*j+2) * (2*j+3);
sign *= -1;
} Independent operations in
instruction stream
result[i] = value; (They are detected by the processor
} at run-time and may be executed in
} parallel on execution units 1 and 2)
CMU 15-418/618, Fall 2023
fi
Review: multi-core execution (two cores)
Modify program to create two threads of
control (two instruction streams)
typedef struct {
int N; My dual-core processor:
int terms;
executes one instruction per clock
float* x;
float* result; from an instruction stream on each core.
} my_args;

void parallel_sinx(int N, int terms, float* x, float* result) Fetch/ Fetch/

{ Decode Decode
pthread_t thread_id;
my_args args; ALU ALU
(Execute) (Execute)
args.N = N/2;
Execution Execution
args.terms = terms;
Context Context
args.x = x;
args.result = result;

pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread

sinx(N - args.N, terms, x + args.N, result + args.N); // do work
pthread_join(thread_id, NULL);
}

void my_thread_start(void* thread_arg)

{
my_args* thread_args = (my_args*)thread_arg;
sinx(args->N, args->terms, args->x, args->result); // do work
}
CMU 15-418/618, Fall 2023
Review: multi-core + superscalar execution
Modify program to create two threads of
control (two instruction streams)
typedef struct {
int N;
My superscalar dual-core processor:
int terms; executes up to two instructions per clock
float* x;
float* result; from an instruction stream on each core.
} my_args;

void parallel_sinx(int N, int terms, float* x, float* result) Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode
{
pthread_t thread_id;
Exec Exec Exec Exec
my_args args;
1 2 1 2

args.N = N/2;
Execution Execution
args.terms = terms;
Context Context
args.x = x;
args.result = result;

pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread

sinx(N - args.N, terms, x + args.N, result + args.N); // do work
pthread_join(thread_id, NULL);
}

void my_thread_start(void* thread_arg)

{
my_args* thread_args = (my_args*)thread_arg;
sinx(args->N, args->terms, args->x, args->result); // do work
}
CMU 15-418/618, Fall 2023
Review: multi-core (four cores)
Modify program to create many threads of control:
recall our ctitious language
void sinx(int N, int terms, float* x, float* result)
My quad-core processor:
{ executes one instruction per clock
// declare independent loop iterations from an instruction stream on each core.
forall (int i from 0 to N-1)
{
Fetch/ Fetch/
float value = x[i];
Decode Decode
float numer = x[i] * x[i] * x[i];
ALU ALU
int denom = 6; // 3! (Execute) (Execute)
int sign = -1;
Execution Execution
Context Context

for (int j=1; j<=terms; j++)

{
value += sign * numer / denom
numer *= x[i] * x[i]; Fetch/ Fetch/
Decode Decode
denom *= (2*j+2) * (2*j+3);
sign *= -1; ALU ALU
(Execute) (Execute)
}
Execution Execution
Context Context
result[i] = value;
}
}

CMU 15-418/618, Fall 2023

fi
Review: four, 8-wide SIMD cores
Observation: program must execute many iterations of the same loop body.
Optimization: share instruction stream across execution of multiple
iterations (single instruction multiple data = SIMD) My SIMD quad-core processor:
void sinx(int N, int terms, float* x, float* result)
executes one 8-wide SIMD instruction per clock
{
// declare independent loop iterations
from an instruction stream on each core.
forall (int i from 0 to N-1)
{ Fetch/ Fetch/
Decode Decode
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
Execution Execution
int sign = -1; Context Context

for (int j=1; j<=terms; j++)

{
value += sign * numer / denom
Fetch/ Fetch/
numer *= x[i] * x[i]; Decode Decode
denom *= (2*j+2) * (2*j+3);
sign *= -1;
} Execution Execution
Context Context

result[i] = value;
}
}
CMU 15-418/618, Fall 2023
Review: four SIMD, multi-threaded cores
Observation: memory operations have very long latency
Solution: hide latency of loading data for one iteration by My multi-threaded, SIMD quad-core processor:
executing arithmetic instructions from other iterations executes one SIMD instruction per clock
void sinx(int N, int terms, float* x, float* result) from one instruction stream on each core. But
{ can switch to processing the other instruction
// declare independent loop iterations stream when faced with a stall.
forall (int i from 0 to N-1)
{ Fetch/ Fetch/
float value = x[i]; Memory load Decode Decode

float numer = x[i] * x[i] * x[i];

int denom = 6; // 3!
int sign = -1; Execution Execution Execution Execution
Context Context Context Context

for (int j=1; j<=terms; j++)

{
value += sign * numer / denom Fetch/ Fetch/
Decode Decode
numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
sign *= -1; Memory store
Execution Execution Execution Execution
} Context Context Context Context

result[i] = value;
}
}
CMU 15-418/618, Fall 2023
Summary: four superscalar, SIMD, multi-threaded cores
My multi-threaded, superscalar, SIMD quad-core processor:
executes up to two instructions per clock from one instruction stream on each core
(in this example: one SIMD instruction + one scalar instruction).
Processor can switch to execute the other instruction stream when faced with stall.

Fetch/ Fetch/ Fetch/ Fetch/

Decode Decode Decode Decode

SIMD Exec 2 SIMD Exec 2

Exec 1 Exec 1

Execution Execution Execution Execution

Context Context Context Context

Fetch/ Fetch/ Fetch/ Fetch/

Decode Decode Decode Decode

SIMD Exec 2 SIMD Exec 2

Exec 1 Exec 1

Execution Execution Execution Execution

Context Context Context Context

CMU 15-418/618, Fall 2023

Connecting it all together
Our simple quad-core processor:
Four cores, two-way multi-threading per core (max eight threads active on chip at once), up to two
instructions per clock per core (one of those instructions is 8-wide SIMD)

Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/ Fetch/

Decode Decode Decode Decode Decode Decode Decode Decode

SIMD Exec 2 SIMD Exec 2 SIMD Exec 2 SIMD Exec 2

Exec 1 Exec 1 Exec 1 Exec 1

Execution Execution Execution Execution Execution Execution Execution Execution

Context Context Context Context Context Context Context Context

L1 Cache L1 Cache L1 Cache L1 Cache

L2 Cache L2 Cache L2 Cache L2 Cache

On-chip
interconnect

Memory
L3 Cache Controller

Memory Bus
(to DRAM)

CMU 15-418/618, Fall 2023

Thought experiment
▪ You write a C application that spawns two pthreads
▪ The application runs on the processor shown below
- Two cores, two-execution contexts per core, up to instructions per clock, one
instruction is an 8-wide SIMD instruction.

▪ Question: “who” is responsible for mapping your pthreads to the

processor’s thread execution contexts?
Answer: the operating system

▪ Question: If you were the OS, how would to assign the two threads to
the four available execution contexts?
Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode

▪ Another question: How would you

Exec 1
SIMD Exec 2

assign threads to execution contexts

Execution Execution Execution Execution
if your C program spawned ve Context Context Context Context

pthreads?
CMU 15-418/618, Fall 2023
fi

Write Your Own Adventure Programs B
100% (1)
Write Your Own Adventure Programs B
52 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
Chapter 8 Function Overloading PDF
100% (1)
Chapter 8 Function Overloading PDF
6 pages
Embedded System
100% (1)
Embedded System
20 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
UUCMS - ಸಮಗ್ರ ವಿಶ್ವವಿದ್ಯಾಲಯ ಮತ್ತು ಕಾಲೇಜು ನಿರ್ವಹಣಾ ವ್ಯವಸ್ಥೆ
No ratings yet
UUCMS - ಸಮಗ್ರ ವಿಶ್ವವಿದ್ಯಾಲಯ ಮತ್ತು ಕಾಲೇಜು ನಿರ್ವಹಣಾ ವ್ಯವಸ್ಥೆ
2 pages
Loading DFI Software Rel 2
No ratings yet
Loading DFI Software Rel 2
6 pages
RISC Vs CISC, Harvard V/s Van Neumann
No ratings yet
RISC Vs CISC, Harvard V/s Van Neumann
35 pages
Ec8552-Cao Unit 5
No ratings yet
Ec8552-Cao Unit 5
72 pages
106105220
No ratings yet
106105220
993 pages
S, R Are All STD - Logic
100% (1)
S, R Are All STD - Logic
7 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
MM PDF
No ratings yet
MM PDF
228 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Parallel Algorithm Merged
No ratings yet
Parallel Algorithm Merged
76 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
02 Basicarch
No ratings yet
02 Basicarch
103 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
Visual Dispatch Guide
No ratings yet
Visual Dispatch Guide
20 pages
Lect11 12 Parallel
No ratings yet
Lect11 12 Parallel
57 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
SYS3041 Cours Microprocesseur Part1
No ratings yet
SYS3041 Cours Microprocesseur Part1
56 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
IT3030E CA Chap3 ISA Exercises
No ratings yet
IT3030E CA Chap3 ISA Exercises
26 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
Excel Lat 1 08082019
No ratings yet
Excel Lat 1 08082019
37 pages
Campus-Difusion-User Guide Students
No ratings yet
Campus-Difusion-User Guide Students
22 pages
03 Progmodels
No ratings yet
03 Progmodels
47 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Written Asst1
No ratings yet
Written Asst1
31 pages
What's The Main Benefit of A Three-Tier Architecture?
No ratings yet
What's The Main Benefit of A Three-Tier Architecture?
2 pages
A Multi-View Feature Fusion Approach For Effective Malware Classification Using Deep Learning
No ratings yet
A Multi-View Feature Fusion Approach For Effective Malware Classification Using Deep Learning
15 pages
Fast Implementation of CV Algorithms: Using Floating Point Hardware For Numeric Intensive Algorithms
No ratings yet
Fast Implementation of CV Algorithms: Using Floating Point Hardware For Numeric Intensive Algorithms
21 pages
Intro HTML Working With Image
No ratings yet
Intro HTML Working With Image
14 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
04 Progbasics
No ratings yet
04 Progbasics
51 pages
Edu32fp Manual Ritter
No ratings yet
Edu32fp Manual Ritter
22 pages
RG2 ParallelizationPrinciples HPCAI Jan2020
No ratings yet
RG2 ParallelizationPrinciples HPCAI Jan2020
40 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
POS Manual
No ratings yet
POS Manual
31 pages
Unit-2 and Unit-3 Divisor - IEEE754 - Hypothetical CPU - Control Unit
No ratings yet
Unit-2 and Unit-3 Divisor - IEEE754 - Hypothetical CPU - Control Unit
18 pages
5 Advanced-1
No ratings yet
5 Advanced-1
60 pages
PDC Lecture 01
No ratings yet
PDC Lecture 01
36 pages
18 Timestampordering
No ratings yet
18 Timestampordering
82 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
COMP1411 Final Exam Question Book
No ratings yet
COMP1411 Final Exam Question Book
10 pages
Chapter 4 Solutions: Case Study: Implementing A Vector Kernel On A Vector Processor and GPU
No ratings yet
Chapter 4 Solutions: Case Study: Implementing A Vector Kernel On A Vector Processor and GPU
12 pages
Lecture05 - High-Level Digital Design Automation
No ratings yet
Lecture05 - High-Level Digital Design Automation
36 pages
Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center
No ratings yet
Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center
17 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
004 Ingrid MEYER ComputerWords in Our Everyday
No ratings yet
004 Ingrid MEYER ComputerWords in Our Everyday
20 pages
Ass Parallel
No ratings yet
Ass Parallel
11 pages
WEB222 - Week 3 Web222
No ratings yet
WEB222 - Week 3 Web222
15 pages
03 Storage1
No ratings yet
03 Storage1
55 pages
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
No ratings yet
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
14 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
Six Data Integration Reference Architectures
No ratings yet
Six Data Integration Reference Architectures
13 pages
2 Mark Answers
No ratings yet
2 Mark Answers
9 pages
Gpus
No ratings yet
Gpus
37 pages
What Is Difference Between Candidate Key and Primary Key
No ratings yet
What Is Difference Between Candidate Key and Primary Key
25 pages
High Level Synthesis II: ECE 3401 Digital Systems Design
No ratings yet
High Level Synthesis II: ECE 3401 Digital Systems Design
35 pages
Chapter 04
No ratings yet
Chapter 04
12 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
03 Ch1 BasicArch Parallel
No ratings yet
03 Ch1 BasicArch Parallel
79 pages
S2 - 5 - Lab Mininet Walkthrough
No ratings yet
S2 - 5 - Lab Mininet Walkthrough
14 pages
CLAT3 - Set C - Answerkey
No ratings yet
CLAT3 - Set C - Answerkey
5 pages
Lab1 Spec
No ratings yet
Lab1 Spec
6 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Experiment 8 - Parallel Processing Using MARIE Simulator
No ratings yet
Experiment 8 - Parallel Processing Using MARIE Simulator
12 pages
Par - 1 In-Term Exam - Course 2017/18-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2017/18-Q2
7 pages
29 Imp
No ratings yet
29 Imp
6 pages
ADE Accenta G3
No ratings yet
ADE Accenta G3
7 pages
SMP Gateway and SEL Relays
No ratings yet
SMP Gateway and SEL Relays
6 pages
Midterm
No ratings yet
Midterm
5 pages
HubSpot Architecture I
No ratings yet
HubSpot Architecture I
5 pages
Aeronautical Development Establishment (ADE) (DRDO Institute)
No ratings yet
Aeronautical Development Establishment (ADE) (DRDO Institute)
9 pages
09 Indexes2
No ratings yet
09 Indexes2
5 pages
Portable) LAN Games Repository 1.0.0 (Final)
No ratings yet
Portable) LAN Games Repository 1.0.0 (Final)
7 pages
Cross Functional Requirements: (MM-Module)
No ratings yet
Cross Functional Requirements: (MM-Module)
3 pages
18CS73
No ratings yet
18CS73
2 pages
Extrahop Product Overview 2015
No ratings yet
Extrahop Product Overview 2015
2 pages
Automated Thermal Cycler Flyer
No ratings yet
Automated Thermal Cycler Flyer
2 pages
Otsm
No ratings yet
Otsm
2 pages
Pstack
No ratings yet
Pstack
1 page
Dts Test
No ratings yet
Dts Test
1 page
Embedded Systems 220 Control Unit Design Notes: PC: PC + 1 PC: PC PC: Operand PC: PC + Operand
No ratings yet
Embedded Systems 220 Control Unit Design Notes: PC: PC + 1 PC: PC PC: Operand PC: PC + Operand
3 pages
My Resume
No ratings yet
My Resume
1 page
Test Hall Ticket 1101 02444 231218 0008: Registration Number
No ratings yet
Test Hall Ticket 1101 02444 231218 0008: Registration Number
1 page