0% found this document useful (0 votes)
28 views70 pages

EE6304 Lecture12 TLP

Uploaded by

Ashish Soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views70 pages

EE6304 Lecture12 TLP

Uploaded by

Ashish Soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

EEDG/CE/CS 6304 Computer Architecture

Lecture 12 – Thread-Level
Parallelism

Benjamin Carrion Schaefer


Associate Professor
Department of Electrical and Computer Engineering
Course Overview
• Fundamentals of Design and Analysis of
Computers (2 lectures)
– History, technological breakthroughs, etc.
– Trends and metrics: performance,
power/energy, cost
• CPU (7 Lectures)
– Instruction Set Architecture
– Arithmetic for Computers (new)
– Instruction Level Parallelism (ILP)
– Dynamic instruction scheduling
– Branch prediction
– Thread-level parallelism
– Modern processors
• Memories (4 Lectures)
– Memory hierarchy
– Caches
– Secondary storage
– Virtual memory
• Buses (1 lecture)
• New computer structures: Heterogeneous
computing (1 lecture)
Objectives
Upon completion of this chapter, you will be able to:

• Identify different types of parallelism and how


these are exploited in modern computer systems.
• Understand different types of multiprocessor
systems
– Message passing vs. shared-memory systems
• What is cache coherence and cache coherence
protocols
– Invalidation based
• MESI
– Update based
Ref: Hennessey & Patterson, 6th Edition, Morgan Kaufmann– Chapter 5
Shen & Lipasti, Modern Processor Design, Waveland Press (chapter 11)
3
Types of Parallelism in Applications
1. Instruction-Level Parallelism (ILP)
– Multiple instructions from the same program executed concurrently
• Superscalar: Managed by hardware
• VLIW: Managed by compiler
– Limited by data and control dependencies
2. Task-level Parallelism (TLP)
– Several different tasks executed on the same data
– E.g., on databased computed average age and max age
3. Data-level Parallelism (DLP)
– Distributes data across different nodes, which operate on the same
data in parallel
– E.g., Matrix multiplication by dividing matrixes in smaller blocks
4. Transaction/Thread-level Parallelism
– Multiple processes/threads from different programs executed
concurrently
How to Benefit from Parallelism?

!
Amdahl’s law: S(n)= !
!"# $
"
Terminology
• Process:
– A program in execution
– Unit of work within the system. Program is a passive
entity; process is an active entity.
– Program instance loaded into memory by the OS
– Every process has its own address space organized in
code, data, stack segments
– Each process consists of at least one thread: main
thread of execution
• Thread:
– Is the smallest sequence of programmed instructions
that can be managed independently by a scheduler,
which is typically a part of the operating system
– Is a component of a process (lightweight process)
– Multiple threads can execute concurrently and share
resources (e.g., memory)
Cook Analogy

• You want to prepare food for several


banquets. Each requires many dinners
• Two positions to fill:
– Boss (control). Gets all the ingredients and tells
the chef what to do.
– The Chef (datapath). Does all the cooking
• ILP Analogy
– One ultra-talented boss with many hands
– One ultra-talented chef with many hands
Need for new Acceleration Techniques

• Diminishing return from exploiting:


– Pipelining
– ILP
• When many instructions depend on the next
one à pipeline hazards à bubbles
• Need to exploit all types of parallelism to
further improve performance
Performance Beyond a Single Thread ILP

• Some applications have large amount of parallelism


(e.g., database, scientific applications)
• Data Level Parallelism: Perform identical operations on
data
– 1 kitchen, 1 boss, many chefs
• Transaction-level Parallelism : Process with own
instructions and Data
– Thread may be subpart of a parallel program (thread) or
may be an independent program (process)
– Each thread has all the states necessary to allow it to
execute (instructions, data, PC, register)
– Many kitchens, each with own boss and chef
Single-Threaded CPU
• Superscalar
– Not all slots are used all
the time
– Only one process is
executed at a time
• Colors in RAM = different
process running
• White boxes in execution
core = pipeline bubbles
• Processor can issue up to 4
instructions per cycle, but
never reaches it

https://arstechnica.com/
Chip Multi-processors (CMP):One thread per Core

• Add extra CPU


– Can schedule two
independent
processes on each
core
– Use Moore’s Law
for speed up
• OS schedules
processes to CPUs
IBM Power4: Example of Chip Multi-Processors (CMP)
Single Core Multi-threading
• Also called time-slice multi-
threading or super-threading
• Processor executes
instructions from multiple
threads
• Example: processor issues up
to 4 instructions at a time. All
from the same thread
Simultaneous Multi-threading
• Also called hyper-
threading
– Takes super
threading to the
next level
• Super-threading
without restriction
that all instructions
issued come from
the same thread
Pentium 4 Hyper-threading
• First commercial SMT (2-way
SMT)
– Hyperthreading==SMT
• Logical Processors share nearly
all resources of the physical
processor
– Caches, execution units, branch
predictors
• Die area overhead of
hyperthreading ~5%
• When one logical processor is
stalled, other can make progress
• Processor running only one
active software thread runs at
approximately same speed with
or without hyperthreading
Intel i7 Lineup – 7th Generation
Intel Core Mobile – 11th Generation

Intel® Turbo Boost Technology 2.0 accelerates processor and graphics performance for peak loads,
automatically allowing processor cores to run faster than the rated operating frequency if they're
operating below power, current, and temperature specification limits.
Intel Processors - Hyperthreading
Explicitly Multithreaded Processors
• Evolution to make each processor capable of
executing more than a single thread
– Increases the utilization of expensive resources
– OS enable better resource utilization
• When thread is stalled due to cache miss, branch prediction
or page fault à schedule other thread
• Types of Multithreading
– Chip multi-processors (CMP) : one thread per core
– Fine-grained multi-threading (FGMT)
– Coarse-grained multi-threading (CGMT)
– Simultaneous multi-threading (SMT)
Types of thread-Level Parallelism
Fine Multithreading
• How can we guarantee no dependencies
between instructions in a pipeline?
– Interleave execution of instructions from different
program threads on same pipeline
• E.g., Interleave 4 threads (T1-T4) on 5-stage
pipeline
Simple Multithreaded Pipeline
• Have to carry thread select down pipeline to ensure
correct state bits read/written at each pipe stage?
• Appears to software (including OS) as multiple, but
slower CPUs
Fine-Grained Multithreading
• Advantages
+ No need for dependency checking between instructions
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions
from different threads
+ Improved system throughput and utilization
• Disadvantages
– Extra hardware complexity: multiple hardware contexts, thread
selection logic
– Reduced single thread performance (one instruction fetched
every N cycles)
– Resource contention between threads in caches and memory
– Dependency logic between threads remain
Sun Niagara
• Developed until 2005
• Fine-grained multithreaded pipeline
Coarse-grained Multithreading

• Idea: When a thread is stalled due to some


event, switch to a different hardware context
• Switch-on-event multithreading
• Possible stall events:
– Cache misses
– Synchronization events
– FP operations
– Page Fault
Coarse Grained Multithreading

• Four possible states of each processor


– Running
– Ready
– Stalled
– Swapped
• Threads transition between states whenever
– A cache miss is :
• Initiated
• Completed
– Thread switch logic decides
IBM RS64-IV
• Late 1990’s CPU used in servers
• 4-way superscalar, in-order, 5-stage pipeline
• Two hardware contexts
• On a L2 cache miss
– Flush pipeline
– Switch to the other thread
Fine-grained vs. Coarse-grained MT

• Fine-grained advantages
+ Simpler to implement, can eliminated dependency
checking, branch prediction completely
+ Switching need not have any performance
overhead
+ Coarse-grained requires a pipeline flush
• Disadvantages
– Low single thread performance. Each thread gets
1/Nth of the bandwidth of the pipeline
Simultaneous Multi-threading (SMT)

• Requires additional hardware resources


– Register file per thread
– Program counter per thread
– Fetch logic
• Operating system view
– If a CPU supports n simultaneous threads à OS
views them as n processors
– OS distributes most time-consuming threads
“fairly” across the n processors that it sees
Multiprocessors Systems

• Definition of multiprocessors systems


– Computers consisting of tightly coupled
processors whose coordination and usage are
controlled by a single operating system and that
share memory through a shared address space
Speedup and Performance (review)
• Performance measured in terms of speedup
– How much faster a program runs on a system
with n processors than with one processor
• Parallelism limited by sharing
– Amdahl’s law:
• Access to shared state must be serialized
• Serial portion limits parallel speedup
Limitation of Speedup
• Interprocessor communication
– When a processor computes a value that is needed by
another program running on another processor
• Synchronization
– Often need to synchronize processors to ensure that
they have all completed some phase before
continuing
• Load Balancing
– Difficult to divide program across processors such that
each processor has the same workload
Superlinear Speedup
• Speedup of n-processor systems
> n (program executes in less
than 1/n)
• Increased cache size
– Each processor has a cache
memory à total amount of cache
is greater than uniprocessor
• Better program structure
– Programs perform less work when
executed on multiprocessor
systems. E.g., search algorithm is
different (finishes faster)
Multiprocessor Systems Programming Models

• Two major programming models for


multiprocessor systems:
– Message passing: communicate through explicit
messages (processor executes an explicit SEND
and RECEIVE operation)
– Shared-memory : Memory system handles inter-
processor communications by allowing all
processors to see data written to any processor
Type 1: Messaging Passing Systems

• Also called Distributed Shared Memory (DSM)


• Each processor has its own address space
• Processors cannot read or write data
contained in another processor’s address
space
• Implemented as distributed memory
machines
Type 1: Distributed Shared Memory (DSM)
• Physically distributed memory
• Support larger processor count
Type 2 : Shared-memory Multiprocessors (SMPs)

• Symmetric (shared-memory) multiprocessor–


centralized shared-memory processors
• Typically, small number of processors (8 or less)
Type2: Shared-Memory Systems

• Communication is implicit
• Provide a single address space that all
processors can read and write
• When processor writes a location in address
space, any subsequent reads of that location
by any processor see the results of the write
Shared Memory vs. Messaging Passing

• Shared Memory

• Message passing
Message-Passing vs. Shared-Memory
• Likely that message-passing and shared-memory will co-exist
in the future
• Shared-memory PROS:
+ Computer handles communication à can write a parallel program
without taking care of data communication. However, to achieve
good performance, programmer must consider how data is used by
processors to minimize inter-processor communications (requests,
invalids, updates)
+ Attractive for irregular applications à difficult determining what
communication is required at the time a program is written
• Shared-memory CONS:
– Programmer’s ability to control inter-processor communication is
limited
– On many systems transferring large block of data between
processors is more efficient done as one communication à not
possible in shared-memory systems (hardware controls amount of
data sent)
Message-Passing vs. Shared-Memory

• Message-Passing Systems PROS:


+ Can achieve greater efficiency by allowing
programmer to control inter-process communication
+ Require fewer synchronization operations because
SEND and RECEIVE operations provide these
+ Attractive for scientific computations because of
regular structure of applications à easier to
determine when data must be communicated
• Message-Passing Systems CONS:
– Requires programmer explicitly specify all
communications
Challenges of Parallel Processing : 1 Parallelism

• Limited parallelism available in programs


• Example: If we want to achieve a speedup of
80x with 100 processors, what is the
maximum amount of serial code ?

Only 0.25% of the original code can be sequential ß


Challenges of Parallel Processing : 2 Communication
• High communication costs
• Typical latency
– within cores on same die 35 to 50 clock cycles
– Among separate chips 100 to 500 clock cycles
• Example: An application running on 32 processors with 200ns
time to handle reference to memory. Processors are stalled on
a remote request. Processors clock speed is 3.3GHz. Assuming
base CPI=0.5. How much fasters is the multiprocessor system if
there is no communication overhead vs. if 0.2% of the
instructions involve a remote communication reference?

0.5+1.33=1.88

è The system with all local references is 1.88/0.5=3.76x faster


Challenges of Parallel Processing : 3 Synchronization
of Shared-memory Threads
• What is the result of A when two threads are
expected as shown? (initial value A=0)
Addressing these Challenges
1. Challenge 1 : Inadequate application parallelism
– New algorithms that offer better parallel performance
2. Challenge 2 : Communication overhead
– Use caching
– Improve synchronization
– Latency hiding techniques
– Memory consistency models for shared memories
3. Challenge 3 : Synchronization
– Usage of indivisible primitives
– Usage or mutex (software)
Mutex: Mutual exclusion locks
• Mutex written explicitly by programmer in source code
• Prevent multiple threads from simultaneously executing
critical sections of code that access shared data
– mutexes are used to serialize the execution of threads
• All mutexes must be global.
• A successful call for a mutex lock by way of :
– mutex_lock()
• Will cause another thread that is also trying to lock the
same mutex to block until the owner thread unlocks it by
way of :
– mutex_unlock().
• Threads within the same process or within other
processes can share mutexes.
Writing Multi-threaded Programs
• Using POSIX
thread
libraries
– Standard
based
thread API
for C/C++
Mutexes

• Used to
prevent data
inconsistencies
due to
operations by
multiple
threads up the
same memory.
Memory Consistency Models
• Complexity in designing shared-memory systems come from
presenting the illusion that there is a single memory system
despite having multiple physical memories
• Implementations:
– Centralized memory
– Distributed memory
• Both have cache memories associated with each processor to
reduce memory latency à multiple copies of data will exist in
different caches
• Memory consistency model: Defines when memory operations
executed on one processor become visible on other processors
– Strong consistency: memory system acts exactly as if there were
only one memory in the computer
– Relaxed consistency: Allows different processors to have different
values for some data until program requests all memories be made
consistent.
Strong Consistency (Sequential Consistency)
• Memory system may execute multiple memory operations in parallel
– All parallel operations must generate the same result as if executed on a
system with a single memory system shared by all processors
• PROS:
– Makes system easier to program (data written by any memory
operations becomes immediately visible to all processors)
• CONS:
– Relaxed consistency system can lead to better performance
Strong Consistency Requirements
• Requirement needed:
– The result of a program must be the same as if the
memory operations in the program occurred in
the order appeared in the program
• READS: Multiple read allowed
• WRITES: multiple writes to an address have to
be serialized
Relaxed Consistency

• Allow reads and writes to complete out of


order, BUT use synchronization operations to
enforce ordering
• Different types of models based on what they
relax (read or write)
– Specify ordering by rules, e.g., XàY (X must be
completed before Y)
• Sequential consistency requires
– RàW, RàR, WàR, WàW
Strong vs. Relaxed Consistency
• Strong Consistency:
+ Advantage:
• Simple method
+ Disadvantage
• Performance penalty
• Relaxed Consistency:
– Advantage:
• Significant Performance advantage
– Disadvantage
• Complexity in describing models
Cache Coherence
• CPUs have caches
between main
memory and the
CPU
• Copies of a single
data item can exist
in multiple caches
• Modification of a
shared data item
by one CPU leads
to outdated copies
in the cache of
another CPU

A Primer on Memory consistency and Cache coherence, D.J. Sorin,


M.D. Hill and D.A. Wood
Cache Coherence
• Typical solution
1. Caches keep track of whether a data item is
shared between multiple processes
2. Upon modification of a shared data item à
notification to other caches occur
3. Other caches reload the shared data item into
their cache
Cache Coherence Protocols
• Cache-coherence protocol of shared-memory system
defines how data may be shared and replicated across
processors
• Define the specific set of rules that are executed to keep
each processor’s view of the memory system consistent
• Snooping Protocols
– Processor snoops the bus to see if any data in their caches
has been updated
– Commonly used in centralized memory machines
• Directory-based Protocols
– Keeps track of what is being shared in centralized location
– Commonly used for distributed memory machines
Shared-memory multiprocessors (SMPs)
• Cache Coherence Protocol
– Easiness to implement cache coherence à each
processor can “see” the state of the memory bus
noticing if any requests from processors to memory.
Called CACHE SNOOPING (cache spying)
Distributed-Memory Multiprocessor
• Directory added to each node to implement cache coherence in a
distributed-memory multiprocessor
• Each directory responsible for tracking the caches that share the memory
addresses of the portion of memory in the node
• Coherence mechanism handle maintenance of: (1) directory information
and(2) coherence actions
Cache Coherence Protocols

• Further divided into two categories:


1. Invalidation-based
2. Update-based
• Depending on application either method can
deliver better performance.
Cache Coherence: Invalidation-based
• Multiple processors are allowed to have read-only copied of a cache
line if no processor has a writable copy of the line
• Only one processor can have a writable copy of a given line at any
time
• When processor wants to write a line that more processor have
copies from à line is invalidated à forces processors with copies to
give up their copies
Cache Coherence: Update-based
• Allow multiple-processors to have writable copies of a
line
• When a processor writes a line that more processors
have copies of à update occurs à transmitting new
value of the data to all sharing processors
A Simple Protocol : MSI
• Modified : Processor is the only one with a copy of the line and
has written the line since
• Shared : processor has a copy of the line and one or more other
processors too à processor may read line, but any write attempt
requires that the other copies be invalidated
• Invalid : processor has no copy of the line
Problem with MSI
• A block is in no cache to begin with
• Problem:
– On a read, the block immediately goes to “Shared” state
although it may be the only copy to be cached (i.e., no other
processor will cache it)
• Why is this a problem?
– Suppose the cache that reads the block wants to write to it at
some point It needs to broadcast “invalidate” even though it has
the only cached copy!
• If the cache knew it had the only cached copy in the
system, it could have written to the block without notifying
any other cache saves unnecessary broadcasts of
invalidations
MESI Protocol
• Commonly used invalidation-based cache-coherence
protocol
• Each line in a processor’s cache is assigned one of four states
to track which caches have copies of the line:
– Modified : Processor is the only one with a copy of the line and
has written the line since
– Exclusive : Processor is the only one with a copy of the line and
has NOT written the line since
– Shared : processor has a copy of the line and one or more other
processors too à processor may read line but any write attempt
requires that the other copies be invalidated
– Invalid : processor has no copy of the line
• In modified and exclusive processor can read/write freely
• Distinction between modified and exclusive states so that
system can figure out where the most recent value of line is
stored
MESI Protocol
• State transition in MESI protocol
Modified : Processor is the only one
with a copy of the line and has written
the line since
Exclusive : Processor is the only one
with a copy of the line and has NOT
written the line since
Shared : processor has a copy of the
line and one or more other processors
too à processor may read line but any
write attempt requires that he other
copies be invalidated
Invalid : processor has no copy of the
line
Example
• A four-processor shared-memory system implements
the MESI protocol for cache coherence. For the
following sequence of memory references, show the
state of the line containing the variable a in each
processor’s cache after each reference is resolved. All
processors start out with the line containing a invalid in
their cache.
Operations:
Read a (processor 0)
Read a (processor 1)
Read a (processor 2)
Write a (processor 3)
Read a (processor 0)
Solution
Modified : Processor is the only one
with a copy of the line and has written
the line since
Exclusive : Processor is the only one
with a copy of the line and has NOT
written the line since
Shared : processor has a copy of the
line and one or more other processors
too à processor may read line but any
write attempt requires that the other
copies be invalidated
Invalid : processor has no copy of the
line
Operation Processor 0 Processor 1 Processor 2 Processor 3
P0 reads a E I I I
P1 reads a S S I I
P2 reads a S S S I
P3 writes a I I I M
P0 reads a S I I S
Method Comparison

• Performance application specific


• Invalidation-based protocols generally better
on applications with substantial data locality
à require a communication each time a
shared line is written
• Update-based protocols can achieve better
performance on programs where one
processor repeatedly updates the datum that
is read by many other processors
Multithreading Categories Summary
Summary
• Multi-processors speedup and limitation
– Synchronization
– Communication
– Load balancing
• Different types of parallelisms and CPU
architectures to exploit them
• Message passing vs. shared-memory
systems
• Cache coherence protocols

99

You might also like