EE6304 Lecture12 TLP
EE6304 Lecture12 TLP
Lecture 12 – Thread-Level
Parallelism
!
Amdahl’s law: S(n)= !
!"# $
"
Terminology
• Process:
– A program in execution
– Unit of work within the system. Program is a passive
entity; process is an active entity.
– Program instance loaded into memory by the OS
– Every process has its own address space organized in
code, data, stack segments
– Each process consists of at least one thread: main
thread of execution
• Thread:
– Is the smallest sequence of programmed instructions
that can be managed independently by a scheduler,
which is typically a part of the operating system
– Is a component of a process (lightweight process)
– Multiple threads can execute concurrently and share
resources (e.g., memory)
Cook Analogy
https://arstechnica.com/
Chip Multi-processors (CMP):One thread per Core
Intel® Turbo Boost Technology 2.0 accelerates processor and graphics performance for peak loads,
automatically allowing processor cores to run faster than the rated operating frequency if they're
operating below power, current, and temperature specification limits.
Intel Processors - Hyperthreading
Explicitly Multithreaded Processors
• Evolution to make each processor capable of
executing more than a single thread
– Increases the utilization of expensive resources
– OS enable better resource utilization
• When thread is stalled due to cache miss, branch prediction
or page fault à schedule other thread
• Types of Multithreading
– Chip multi-processors (CMP) : one thread per core
– Fine-grained multi-threading (FGMT)
– Coarse-grained multi-threading (CGMT)
– Simultaneous multi-threading (SMT)
Types of thread-Level Parallelism
Fine Multithreading
• How can we guarantee no dependencies
between instructions in a pipeline?
– Interleave execution of instructions from different
program threads on same pipeline
• E.g., Interleave 4 threads (T1-T4) on 5-stage
pipeline
Simple Multithreaded Pipeline
• Have to carry thread select down pipeline to ensure
correct state bits read/written at each pipe stage?
• Appears to software (including OS) as multiple, but
slower CPUs
Fine-Grained Multithreading
• Advantages
+ No need for dependency checking between instructions
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions
from different threads
+ Improved system throughput and utilization
• Disadvantages
– Extra hardware complexity: multiple hardware contexts, thread
selection logic
– Reduced single thread performance (one instruction fetched
every N cycles)
– Resource contention between threads in caches and memory
– Dependency logic between threads remain
Sun Niagara
• Developed until 2005
• Fine-grained multithreaded pipeline
Coarse-grained Multithreading
• Fine-grained advantages
+ Simpler to implement, can eliminated dependency
checking, branch prediction completely
+ Switching need not have any performance
overhead
+ Coarse-grained requires a pipeline flush
• Disadvantages
– Low single thread performance. Each thread gets
1/Nth of the bandwidth of the pipeline
Simultaneous Multi-threading (SMT)
• Communication is implicit
• Provide a single address space that all
processors can read and write
• When processor writes a location in address
space, any subsequent reads of that location
by any processor see the results of the write
Shared Memory vs. Messaging Passing
• Shared Memory
• Message passing
Message-Passing vs. Shared-Memory
• Likely that message-passing and shared-memory will co-exist
in the future
• Shared-memory PROS:
+ Computer handles communication à can write a parallel program
without taking care of data communication. However, to achieve
good performance, programmer must consider how data is used by
processors to minimize inter-processor communications (requests,
invalids, updates)
+ Attractive for irregular applications à difficult determining what
communication is required at the time a program is written
• Shared-memory CONS:
– Programmer’s ability to control inter-processor communication is
limited
– On many systems transferring large block of data between
processors is more efficient done as one communication à not
possible in shared-memory systems (hardware controls amount of
data sent)
Message-Passing vs. Shared-Memory
0.5+1.33=1.88
• Used to
prevent data
inconsistencies
due to
operations by
multiple
threads up the
same memory.
Memory Consistency Models
• Complexity in designing shared-memory systems come from
presenting the illusion that there is a single memory system
despite having multiple physical memories
• Implementations:
– Centralized memory
– Distributed memory
• Both have cache memories associated with each processor to
reduce memory latency à multiple copies of data will exist in
different caches
• Memory consistency model: Defines when memory operations
executed on one processor become visible on other processors
– Strong consistency: memory system acts exactly as if there were
only one memory in the computer
– Relaxed consistency: Allows different processors to have different
values for some data until program requests all memories be made
consistent.
Strong Consistency (Sequential Consistency)
• Memory system may execute multiple memory operations in parallel
– All parallel operations must generate the same result as if executed on a
system with a single memory system shared by all processors
• PROS:
– Makes system easier to program (data written by any memory
operations becomes immediately visible to all processors)
• CONS:
– Relaxed consistency system can lead to better performance
Strong Consistency Requirements
• Requirement needed:
– The result of a program must be the same as if the
memory operations in the program occurred in
the order appeared in the program
• READS: Multiple read allowed
• WRITES: multiple writes to an address have to
be serialized
Relaxed Consistency
99