02 Basicarch
02 Basicarch
A Modern Multi-Core
Processor
(Forms of parallelism + understanding latency and bandwidth)
result[i] = value;
}
}
CMU 15-418/618, Fall 2023
fl
Compile program
void sinx(int N, int terms, float* x, float* result)
{
x[i]
for (int i=0; i<N; i++)
{
float value = x[i];
float numer = x[i] * x[i] * x[i]; ld r0, addr[r1]
int denom = 6; // 3! mul r1, r0, r0
int sign = -1; mul r1, r1, r0
...
for (int j=1; j<=terms; j++) ...
{ ...
value += sign * numer / denom; ...
...
numer *= x[i] * x[i];
...
denom *= (2*j+2) * (2*j+3);
st addr[r2], r0
sign *= -1;
}
result[i] = value;
} result[i]
x[i]
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
...
Context
...
st addr[r2], r0
result[i]
Fetch/
Decode
PC ld r0, addr[r1]
mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
...
Context
...
st addr[r2], r0
result[i]
Fetch/
Decode
ld r0, addr[r1]
PC mul r1, r0, r0
ALU mul r1, r1, r0
(Execute) ...
...
...
Execution ...
...
Context
...
st addr[r2], r0
result[i]
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU PC mul r1, r1, r0
(Execute) ...
...
...
Execution ...
...
Context
...
st addr[r2], r0
result[i]
Fetch/ Fetch/
Decode Decode
1 2
ld r0, addr[r1]
mul r1, r0, r0
Exec Exec mul r1, r1, r0
1 2 ...
...
...
Execution ...
...
Context
...
st addr[r2], r0
result[i]
Fetch/
Decode
Data cache
(a big one)
ALU
(Execute)
Memory pre-fetcher
More transistors = larger cache, smarter out-of-order logic, smarter branch predictor, etc.
(Also: more transistors → smaller transistors → higher clock frequencies)
CMU 15-418/618, Fall 2023
Processor: multi-core era
x[i] x[j]
Fetch/ Fetch/
Decode Decode
result[i] result[j]
But there are now two cores: 2 × 0.75 = 1.5 (potential for speedup!)
CMU 15-418/618, Fall 2023
But our program expresses no parallelism
void sinx(int N, int terms, float* x, float* result)
This program, compiled with gcc
{
for (int i=0; i<N; i++) will run as one thread on one of
{ the processor cores.
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
If each of the simpler processor
int sign = -1;
cores was 0.75X as fast as the
for (int j=1; j<=terms; j++) original single complicated one,
{ our program now has a “speedup”
value += sign * numer / denom;
numer *= x[i] * x[i];
of 0.75 (i.e. it is slower).
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}
result[i] = value;
}
}
result[i] = value;
}
}
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
result[i] = value;
}
}
Fetch/
Decode Idea #2:
Amortize cost/complexity of managing an
ALU 0 ALU 1 ALU 2 ALU 3 instruction stream across many ALUs
ALU 4 ALU 5 ALU 6 ALU 7
SIMD processing
Single instruction, multiple data
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
ALU 0 ALU 1 ALU 2 ALU 3 mul r1, r1, r0
...
ALU 4 ALU 5 ALU 6 ALU 7 ...
...
...
...
...
st addr[r2], r0
result[i] = value;
}
}
<unconditional code>
float x = A[i];
if (x > 0) {
float tmp = exp(x,5.f);
tmp *= kMyConst1;
x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;
x = 2.f * tmp;
}
result[i] = x;
<unconditional code>
float x = A[i];
T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);
tmp *= kMyConst1;
x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;
x = 2.f * tmp;
}
result[i] = x;
<unconditional code>
float x = A[i];
T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);
tmp *= kMyConst1;
x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;
x = 2.f * tmp;
}
<unconditional code>
float x = A[i];
T T F T F F F F if (x > 0) {
float tmp = exp(x,5.f);
tmp *= kMyConst1;
x = tmp + kMyConst2;
} else {
float tmp = kMyConst1;
x = 2.f * tmp;
}
result[i] = x;
▪ “Divergent” execution
- A lack of instruction stream coherence
▪ AVX instructions: 256 bit operations: 8x32 bits or 4x64 bits (8-wide oat vectors)
ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3
(AVX2 instructions)
ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7
On campus:
Execution Context Execution Context Execution Context Execution Context GHC machines:
4 cores
8 SIMD ALUs per core
Fetch/ CMU 15-418/618, Spring 2016 Fetch/ CMU 15-418/618, Spring 2016 Fetch/ CMU 15-418/618, Spring 2016 Fetch/ CMU 15-418/618, Spring 2016
Decode Decode Decode Decode
ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3 ALU 0 ALU 1 ALU 2 ALU 3
Machines in GHC 5207:
ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7 ALU 4 ALU 5 ALU 6 ALU 7
(old GHC 3000 machines)
6 cores
4 SIMD ALUs per core
15 cores
32 SIMD ALUs per core
1.3 TFLOPS
- SIMD: use multiple ALUs controlled by same instruction stream (within a core)
- E cient design for data-parallel workloads: control amortized over many ALUs
- Vectorization can be done by compiler (explicit SIMD) or at runtime by hardware
- [Lack of] dependencies is known prior to execution (usually declared by programmer,
but can be inferred by loop analysis by advanced compiler)
- Superscalar: exploit ILP within an instruction stream. Process di erent instructions from
the same instruction stream in parallel (within a core)
- Parallelism automatically and dynamically discovered by the hardware during
execution (not programmer visible)
Not addressed further in this class. That’s for a proper computer architecture design course like 18-447.
CMU 15-418/618, Fall 2023
ffi
ff
ff
Quiz Time
▪ L2 Participation Quiz on Canvas
▪ Memory bandwidth
- The rate at which the memory system can provide data to a processor
- Example: 20 GB/s
▪ Problem:
- Move X bytes of data
- From datacenter in Pittsburgh to New York
▪ 100 PB 370 miles ~ 6.5 Hours
- 1e+11B / 25mb/s = 1.1 hours
▪ 1 EB
- 1e+18B / 25 mb/s = 1,267 years
L1 cache
(32 KB)
Core 1
L2 cache
(256 KB)
25 GB/sec Memory
. DDR3 DRAM
. L3 cache
. (8 MB) (Gigabytes)
L1 cache
(32 KB)
Core N
L2 cache
(256 KB)
L1 cache
(32 KB)
Core 1
L2 cache
(256 KB)
25 GB/sec Memory
. DDR3 DRAM
. L3 cache
. (8 MB) (Gigabytes)
L1 cache
(32 KB)
Core N
L2 cache
(256 KB)
* Caches also provide high bandwidth data transfer to CPU CMU 15-418/618, Fall 2023
ffi
Prefetching reduces stalls (hides latency)
▪ All modern CPUs have logic for prefetching data into caches
- Dynamically analyze program’s access patterns, predict what it will access soon
▪ Reduces stalls since data is resident in cache when accessed
predict value of r2, initiate load
predict value of r3, initiate load
...
...
...
data arrives in cache
...
Note: Prefetching can also reduce
... data arrives in cache
...
performance if the guess is wrong
ld r0 mem[r2] (hogs bandwidth, pollutes caches)
These loads are cache hits
ld r1 mem[r3]
add r0, r0, r1 (more detail later in course)
1 Core (1 thread)
Fetch/
Decode
Exec Ctx
1 2
3 4
Runnable
1 2
3 4
Runnable Stall
1 2
Stall
Runnable
3 4
Runnable
Done!
Runnable
Done!
CMU 15-418/618, Fall 2023
Throughput computing trade-o
Thread 1 Thread 2 Thread 3 Thread 4
Elements 0 … 7 Elements 8 … 15 Elements 16 … 23 Elements 24 … 31
Time
Done!
Fetch/
Decode
Context storage
(or L1 cache)
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 2
3 4
▪ Costs
- Requires additional storage for thread contexts
- Increases run time of any single thread
(often not a problem, we usually care about throughput in parallel apps)
- Requires additional independent work in a program (more independent work
than ALUs!)
- Relies heavily on memory bandwidth
- More threads → larger working set → less cache space per thread
- May go to memory more often, but can hide the latency
CMU 15-418/618, Fall 2023
fi
ffi
ffi
Our ctitious multi-core chip
16 cores
16 simultaneous
CMU 15-418/618, Spring 2016
instruction streams
64 total concurrent
instruction streams
Core 1
25 GB/sec Memory
L2 cache
DDR3 DRAM
(256 KB)
(Gigabytes)
.. L3 cache
. (8 MB)
L1 cache
(32 KB)
Core N
L2 cache
CPU:
(256 KB) Big caches, few threads, modest memory BW
Rely mainly on caches and prefetching
GFX
texture
cache
(12 KB)
Core N
(12 KB) GPU:
Execution Scratchpad Small caches, many threads, huge memory BW
contexts L1 cache
(128 KB) (64 KB) Rely mainly on multi-threading
CMU 15-418/618, Fall 2023
Thought experiment
Task: element-wise multiplication of two vectors A and B very large array!
Assume vectors contain millions of elements
A
- Load input A[i] ×
- Load input B[i] B
- =
Compute A[i] × B[i] C
- Store result into C[i]
result[i] = value;
}
}
Exec Exec
for (int j=1; j<=terms; j++)
1 2
{
value += sign * numer / denom;
Execution
numer *= x[i] * x[i];
Context
denom *= (2*j+2) * (2*j+3);
sign *= -1;
} Independent operations in
instruction stream
result[i] = value; (They are detected by the processor
} at run-time and may be executed in
} parallel on execution units 1 and 2)
CMU 15-418/618, Fall 2023
fi
Review: multi-core execution (two cores)
Modify program to create two threads of
control (two instruction streams)
typedef struct {
int N; My dual-core processor:
int terms;
executes one instruction per clock
float* x;
float* result; from an instruction stream on each core.
} my_args;
void parallel_sinx(int N, int terms, float* x, float* result) Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode
{
pthread_t thread_id;
Exec Exec Exec Exec
my_args args;
1 2 1 2
args.N = N/2;
Execution Execution
args.terms = terms;
Context Context
args.x = x;
args.result = result;
result[i] = value;
}
}
CMU 15-418/618, Fall 2023
Review: four SIMD, multi-threaded cores
Observation: memory operations have very long latency
Solution: hide latency of loading data for one iteration by My multi-threaded, SIMD quad-core processor:
executing arithmetic instructions from other iterations executes one SIMD instruction per clock
void sinx(int N, int terms, float* x, float* result) from one instruction stream on each core. But
{ can switch to processing the other instruction
// declare independent loop iterations stream when faced with a stall.
forall (int i from 0 to N-1)
{ Fetch/ Fetch/
float value = x[i]; Memory load Decode Decode
result[i] = value;
}
}
CMU 15-418/618, Fall 2023
Summary: four superscalar, SIMD, multi-threaded cores
My multi-threaded, superscalar, SIMD quad-core processor:
executes up to two instructions per clock from one instruction stream on each core
(in this example: one SIMD instruction + one scalar instruction).
Processor can switch to execute the other instruction stream when faced with stall.
Exec 1 Exec 1
Exec 1 Exec 1
On-chip
interconnect
Memory
L3 Cache Controller
Memory Bus
(to DRAM)
▪ Question: If you were the OS, how would to assign the two threads to
the four available execution contexts?
Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode
Exec 1
SIMD Exec 2
pthreads?
CMU 15-418/618, Fall 2023
fi