2.2 DD2356 Threads
2.2 DD2356 Threads
Stefano Markidis
Add to the (Model) Architecture
2
Adding Processing Elements I
Core
• Here’s our model so far, with the vector and
pipelining part of the “core”
– Most systems today have an L3 cache as
L1 Cache
well
• We can (try to) replicate everything...
L2 Cache
Memory
3
Adding Processing Elements II
4
Adding Processing Elements III
L2 Cache L2 Cache
Memory
5
Notes on Multicore
6
Process VS Thread - I
Multi-Threaded
• A thread is a basic unit of Process Process
processor utilization, consisting
of a program counter, a stack
and registers.
• Processes have a single
thread of control: there is one
program counter, and one
sequence of instructions that
can be carried out at any given
time.
7
• Executing program (process) is
Process VS
• Executing Thread
defined II
by
program (process) is
defined by ♦ Address space Process
♦ Address space
♦ Program Counter
• A process is defined by address
♦ Program Counter
• Threads are multiple program
space and counter
8
8 8
Programming Models For Multicore processors
• Parallelism within a process
• Compiler-managed parallelism
– Transparent to programmer
– Rarely successful
• Threads
– Within a process, all memory shared
– Each “thread” executes “normal” code
– Many subtle issues (more later)
• Parallelism between processes within a node covered later in the third
module
9
Why Use Threads?
10
Common Thread Programming Models
11
Thread Issues
1. Synchronization
• Avoiding conflicting operations (memory references) between threads
2. Variable Name Space
• Interaction between threads and the Language
3. Scheduling
• Will the OS do what you want?
12
Synchronization of Access
Read/write model
13
Synchronization of Access
Many possibilities:
• 2 (what the programmer expected)
• 1 (thread 1 reorders stores so a=2 executed before b=2 (valid in language)
• Nothing: a never changes in thread 2
• Some other value from thread 1 (value of b before this code starts)
14
How Can We Fix This?
• Need to impose an order on the memory updates
– OpenMP has FLUSH
– Memory barriers (more on this later)
• Need to ensure that data updated by another thread is reloaded
– Copies of memory in cache may update eventually
– In this example, a may be (is likely to be) in register, never updated
– volatile in C
15
Synchronization of Access
16
Variable Names
• Each thread can access all of a processes memory (except for other
thread’s stack)
– Named variables refer to the address space - thus visible to all
threads
– Compiler doesn’t distinguish A in one thread from A in another
– No modularity
– Like using Fortran blank COMMON for all variables
17
Scheduling Threads
• If threads used for latency hiding
– Schedule on the same core
– Provides better data locality, cache usage
• If threads used for parallel execution
– Schedule on different cores using different memory pathways
– Appropriate for data parallelism
18
Node Execution Models
• Where do threads run on a node?
– Typical user expectation: User’s applications uses all cores and has
complete access to them
• Reality is complex.
• Common cases include:
– OS pre-empts core 0; Or cores 0,2
– OS pre-empts user threads, distributes across cores
– Hidden core (BG/Q)
19
Performance Models: Memory
• Assume the time to move a unit of memory is tm
– Due to latency in hardware; clock rate of data paths
– Rate is 1/tm = rm
• Also assume that there is a maximum rate rmax
– E.g., width of data path * clock rate
• Then the rate at which k threads can move data is
– min(k/tm,rmax) = min(krm,rmax)
20
Limits on Thread Performance
21
Questions
22