Lecture 16
Lecture 16
William Gropp
www.cs.illinois.edu/~wgropp
Add to the (Model)
Architecture
• What do you do with a billion
transistors?
♦ For a long time, try to make an individual
processor (what we now call a core) faster
♦ Increasingly complicated hardware yielded
less and less benefit (speculation, out of
order execution, prefetch, …)
• An alternative is to simply put multiple
processing elements (cores) on the
same chip
• Thus the “multicore processor” or
“multicore chip” 2
Adding Processing Elements
• Something
Core Core Core Core like this
would be
simple
L1 L1 L1 L1
• But in
Cache Cache Cache Cache
practice,
some
L2 L2 L2 L2 resources
Cache Cache Cache Cache are shared,
giving us…
4
Adding Processing Elements
L1 L1 L1 L1
Cache Cache Cache Cache
L2 Cache L2 Cache
Memory
5
Notes on Multicore
8
Inside a Thread’s Memory
9
Kinds of Threads
• Almost a process
♦ Kernel (Operating System) schedules
♦ Each thread can make independent
system calls
• Co-routines and lightweight
processes
♦ User schedules (sort of…)
• Memory references
♦ Hardware schedules
10
Kernel Threads
12
Hardware Threads
13
Simultaneous Multithreading
(SMT)
• Share the functional units in a single
core
♦ Remember the pipelining example – not all
functional units (integer, floating point,
load/store) are busy each cycle
♦ SMT idea is to have two threads sharing a
single set of functional units
♦ May be able to keep more of the hardware
busy (thus improving throughput)
• Each SMT thread takes more time that
it would if it was the only thread
• Almost entirely managed
14
by hardware
Why Use Threads?
Latency Hiding
♦ Low overhead steering/probing
♦ Background checkpoint save
• Alternate method for
nonblocking operations
♦ CORBA method invocation (no
funky nonblocking calls)
• Hiding memory latency
• Fine-grain parallelism
♦ Compiler parallelism
15
Common Thread
Programming Models
• Library-based (invoke a routine in a
separate thread)
♦ pthreads (POSIX threads)
♦ See “Threads cannot be implemented as a
library,” H. Boehm
http://www.hpl.hp.com/techreports/2004/
HPL-2004-209.pdf
• Separate enhancements to existing
languages
♦ OpenMP, OpenACC, OpenCL, CUDA, …
• Within the language itself
♦ Java, C11, others 16
Thread Issues
• Synchronization
♦ Avoiding conflicting operations
(memory references) between
threads
• Variable Name Space
♦ Interaction between threads and the
language
• Scheduling
♦ Will the OS do what you want?
17
Synchronization of Access
• Read/write model
a = 1; b = 1;
barrier(); barrier();
b = 2; while (a==1) ;
a = 2; printf( “%d\n”, b );
What does thread 2 print?
18
Synchronization of Access
• Read/write model
a = 1; b = 1;
barrier(); barrier();
b = 2; while (a==1) ;
a = 2; printf( “%d\n”, b );
What does thread 2 print?
• Many possibilities:
♦ 2 (what the programmer expected)
♦ 1 (thread 1 reorders stores so a=2
executed before b=2 (valid in language)
♦ Nothing: a never changes in thread 2
♦ Some other value from thread 1 (value of b
before this code starts)
19
How Can We Fix This?
21
Variable Names
• Each thread can access all of a processes
memory (except for the thread’s stack*)
♦ Named variables refer to the address space—thus
visible to all threads
♦ Compiler doesn’t distinguish A in one thread from A
in another
♦ No modularity
♦ Like using Fortran blank COMMON for all variables
• “Thread private” extensions are becoming
common
♦ “Thread local storage” (tls) is becoming common as
an attribute
♦ NEC has a variant where all variables names refer to
different variables unless specified
• All variables are on thread stack by default (even
globals)
• More modular
22
Scheduling Threads
24
Node Execution Models
26
Performance Models
27
Performance Models:
Memory
• Assume the time to move a unit of
memory is tm
♦ Due to latency in hardware; clock rate of
data paths
♦ Rate is 1/tm = rm
• Also assume that there is a maximum
rate rmax
♦ E.g., width of data path * clock rate
• Then the rate at which k threads can
move data is
♦ min(k/tm,rmax) = min(krm,rmax)
28
Limits on Thread
Performance
• Threads share
memory resources
• Performance is
roughly linear with
additional threads
until the maximum Rmax
bandwidth is reached
• At that point each
thread receives a
decreasing fraction Number of threads
of available
bandwidth
29
Questions