0% found this document useful (0 votes)
23 views22 pages

2.2 DD2356 Threads

The document discusses threads and multicore processors. It describes how multiple processing elements or cores can be placed on the same chip to create multicore processors. It discusses different approaches to adding processing elements and sharing resources between cores. It also covers thread programming models, synchronization issues, variable naming, and scheduling of threads across cores.

Uploaded by

Daniel Araújo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

2.2 DD2356 Threads

The document discusses threads and multicore processors. It describes how multiple processing elements or cores can be placed on the same chip to create multicore processors. It discusses different approaches to adding processing elements and sharing resources between cores. It also covers thread programming models, synchronization issues, variable naming, and scheduling of threads across cores.

Uploaded by

Daniel Araújo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

DD2356 – Threads

Stefano Markidis
Add to the (Model) Architecture

• What do you do with a billion transistors?


– For a long time, try to make an individual
processor (what we now call a core) faster
– Increasingly complicated hardware
yielded less and less benefit
(speculation, out of order execution,
prefetch, ...)
• An alternative is to simply put multiple processing
elements (cores) on the same chip
https://superuser.com/questions/584900/how-distinguish-between-multicore-and-multiprocessor-systems

– Thus, the “multicore processor” or “multicore


chip”

2
Adding Processing Elements I

Core
• Here’s our model so far, with the vector and
pipelining part of the “core”
– Most systems today have an L3 cache as
L1 Cache
well
• We can (try to) replicate everything...
L2 Cache

Memory

3
Adding Processing Elements II

Core Core Core Core


• Something like this
would be simple
L1 Cache L1 Cache L1 Cache L1 Cache
• But in practice,
some resources are
L2 Cache L2 Cache L2 Cache L2 Cache shared, giving us...

Memory Memory Memory Memory

4
Adding Processing Elements III

Core Core Core Core

L1 Cache L1 Cache L1 Cache L1 Cache

L2 Cache L2 Cache

Memory

5
Notes on Multicore

• Some resources are shared


– Typically the larger (slower) caches, path to memory
– May share functional units within the core (variously called
simultaneous multithreading (SMT) or hyperthreading)
– Rarely enough bandwidth for shared resources (cache, memory)
to supply all cores at the same time

6
Process VS Thread - I
Multi-Threaded
• A thread is a basic unit of Process Process
processor utilization, consisting
of a program counter, a stack
and registers.
• Processes have a single
thread of control: there is one
program counter, and one
sequence of instructions that
can be carried out at any given
time.

7
• Executing program (process) is
Process VS
• Executing Thread
defined II
by
program (process) is
defined by ♦ Address space Process

♦ Address space
♦ Program Counter
• A process is defined by address
♦ Program Counter
• Threads are multiple program
space and counter

• Threads are multiple program


• Threads are multiple program
counters
counters (+ stack and registers)
Multi-threaded process
counters

8
8 8
Programming Models For Multicore processors
• Parallelism within a process
• Compiler-managed parallelism
– Transparent to programmer
– Rarely successful
• Threads
– Within a process, all memory shared
– Each “thread” executes “normal” code
– Many subtle issues (more later)
• Parallelism between processes within a node covered later in the third
module

9
Why Use Threads?

• Manage multiple points of interaction


• Low overhead steering/probing
– Background checkpoint save
• Alternate method for nonblocking operations
• Hiding memory latency
• Fine-grain parallelism
• Compiler parallelism

10
Common Thread Programming Models

• Library-based (invoke a routine in a separate thread)


• pthreads (POSIX threads)
– See “Threads cannot be implemented as a library,” H. Boehm
http://www.hpl.hp.com/techreports/2004/ HPL-2004-209.pdf
• Separate enhancements to existing languages
• OpenMP, OpenACC, OpenCL, CUDA, ...
• Within the language itself
• Java, C11, others

11
Thread Issues

1. Synchronization
• Avoiding conflicting operations (memory references) between threads
2. Variable Name Space
• Interaction between threads and the Language
3. Scheduling
• Will the OS do what you want?

12
Synchronization of Access

Read/write model

What does thread 2 print?


Take a few minutes and think about the possibilities

13
Synchronization of Access

Many possibilities:
• 2 (what the programmer expected)
• 1 (thread 1 reorders stores so a=2 executed before b=2 (valid in language)
• Nothing: a never changes in thread 2
• Some other value from thread 1 (value of b before this code starts)

14
How Can We Fix This?
• Need to impose an order on the memory updates
– OpenMP has FLUSH
– Memory barriers (more on this later)
• Need to ensure that data updated by another thread is reloaded
– Copies of memory in cache may update eventually
– In this example, a may be (is likely to be) in register, never updated
– volatile in C

15
Synchronization of Access

• Often need to ensure that updates happen atomically (all or nothing)


– Critical sections, lock/unlock, and similar methods
– Java has synchronized methods
– C11 provides atomic memory operations

16
Variable Names
• Each thread can access all of a processes memory (except for other
thread’s stack)
– Named variables refer to the address space - thus visible to all
threads
– Compiler doesn’t distinguish A in one thread from A in another
– No modularity
– Like using Fortran blank COMMON for all variables

17
Scheduling Threads
• If threads used for latency hiding
– Schedule on the same core
– Provides better data locality, cache usage
• If threads used for parallel execution
– Schedule on different cores using different memory pathways
– Appropriate for data parallelism

18
Node Execution Models
• Where do threads run on a node?
– Typical user expectation: User’s applications uses all cores and has
complete access to them
• Reality is complex.
• Common cases include:
– OS pre-empts core 0; Or cores 0,2
– OS pre-empts user threads, distributes across cores
– Hidden core (BG/Q)

19
Performance Models: Memory
• Assume the time to move a unit of memory is tm
– Due to latency in hardware; clock rate of data paths
– Rate is 1/tm = rm
• Also assume that there is a maximum rate rmax
– E.g., width of data path * clock rate
• Then the rate at which k threads can move data is
– min(k/tm,rmax) = min(krm,rmax)

20
Limits on Thread Performance

• Threads share memory resources


• Performance is roughly linear with
additional threads until the maximum
bandwidth is reached
• At that point, each thread receives a
decreasing fraction of available bandwidth

21
Questions

• How do you expect a multithreaded STREAM to perform as you add threads?


• What happens if there are more threads that cores?
– Can programs run faster in that case?

22

You might also like