0% found this document useful (0 votes)

28 views70 pages

EE6304 Lecture12 TLP

Uploaded by

Ashish Soni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views70 pages

EE6304 Lecture12 TLP

Uploaded by

Ashish Soni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

EEDG/CE/CS 6304 Computer Architecture

Lecture 12 – Thread-Level
Parallelism

Benjamin Carrion Schaefer

Associate Professor
Department of Electrical and Computer Engineering
Course Overview
• Fundamentals of Design and Analysis of
Computers (2 lectures)
– History, technological breakthroughs, etc.
– Trends and metrics: performance,
power/energy, cost
• CPU (7 Lectures)
– Instruction Set Architecture
– Arithmetic for Computers (new)
– Instruction Level Parallelism (ILP)
– Dynamic instruction scheduling
– Branch prediction
– Thread-level parallelism
– Modern processors
• Memories (4 Lectures)
– Memory hierarchy
– Caches
– Secondary storage
– Virtual memory
• Buses (1 lecture)
• New computer structures: Heterogeneous
computing (1 lecture)
Objectives
Upon completion of this chapter, you will be able to:

• Identify different types of parallelism and how

these are exploited in modern computer systems.
• Understand different types of multiprocessor
systems
– Message passing vs. shared-memory systems
• What is cache coherence and cache coherence
protocols
– Invalidation based
• MESI
– Update based
Ref: Hennessey & Patterson, 6th Edition, Morgan Kaufmann– Chapter 5
Shen & Lipasti, Modern Processor Design, Waveland Press (chapter 11)
3
Types of Parallelism in Applications
1. Instruction-Level Parallelism (ILP)
– Multiple instructions from the same program executed concurrently
• Superscalar: Managed by hardware
• VLIW: Managed by compiler
– Limited by data and control dependencies
2. Task-level Parallelism (TLP)
– Several different tasks executed on the same data
– E.g., on databased computed average age and max age
3. Data-level Parallelism (DLP)
– Distributes data across different nodes, which operate on the same
data in parallel
– E.g., Matrix multiplication by dividing matrixes in smaller blocks
4. Transaction/Thread-level Parallelism
– Multiple processes/threads from different programs executed
concurrently
How to Benefit from Parallelism?

!
Amdahl’s law: S(n)= !
!"# $
"
Terminology
• Process:
– A program in execution
– Unit of work within the system. Program is a passive
entity; process is an active entity.
– Program instance loaded into memory by the OS
– Every process has its own address space organized in
code, data, stack segments
– Each process consists of at least one thread: main
thread of execution
• Thread:
– Is the smallest sequence of programmed instructions
that can be managed independently by a scheduler,
which is typically a part of the operating system
– Is a component of a process (lightweight process)
– Multiple threads can execute concurrently and share
resources (e.g., memory)
Cook Analogy

• You want to prepare food for several

banquets. Each requires many dinners
• Two positions to fill:
– Boss (control). Gets all the ingredients and tells
the chef what to do.
– The Chef (datapath). Does all the cooking
• ILP Analogy
– One ultra-talented boss with many hands
– One ultra-talented chef with many hands
Need for new Acceleration Techniques

• Diminishing return from exploiting:

– Pipelining
– ILP
• When many instructions depend on the next
one à pipeline hazards à bubbles
• Need to exploit all types of parallelism to
further improve performance
Performance Beyond a Single Thread ILP

• Some applications have large amount of parallelism

(e.g., database, scientific applications)
• Data Level Parallelism: Perform identical operations on
data
– 1 kitchen, 1 boss, many chefs
• Transaction-level Parallelism : Process with own
instructions and Data
– Thread may be subpart of a parallel program (thread) or
may be an independent program (process)
– Each thread has all the states necessary to allow it to
execute (instructions, data, PC, register)
– Many kitchens, each with own boss and chef
Single-Threaded CPU
• Superscalar
– Not all slots are used all
the time
– Only one process is
executed at a time
• Colors in RAM = different
process running
• White boxes in execution
core = pipeline bubbles
• Processor can issue up to 4
instructions per cycle, but
never reaches it

https://arstechnica.com/
Chip Multi-processors (CMP):One thread per Core

• Add extra CPU

– Can schedule two
independent
processes on each
core
– Use Moore’s Law
for speed up
• OS schedules
processes to CPUs
IBM Power4: Example of Chip Multi-Processors (CMP)
Single Core Multi-threading
• Also called time-slice multi-
threading or super-threading
• Processor executes
instructions from multiple
threads
• Example: processor issues up
to 4 instructions at a time. All
from the same thread
Simultaneous Multi-threading
• Also called hyper-
threading
– Takes super
threading to the
next level
• Super-threading
without restriction
that all instructions
issued come from
the same thread
Pentium 4 Hyper-threading
• First commercial SMT (2-way
SMT)
– Hyperthreading==SMT
• Logical Processors share nearly
all resources of the physical
processor
– Caches, execution units, branch
predictors
• Die area overhead of
hyperthreading ~5%
• When one logical processor is
stalled, other can make progress
• Processor running only one
active software thread runs at
approximately same speed with
or without hyperthreading
Intel i7 Lineup – 7th Generation
Intel Core Mobile – 11th Generation

Intel® Turbo Boost Technology 2.0 accelerates processor and graphics performance for peak loads,
automatically allowing processor cores to run faster than the rated operating frequency if they're
operating below power, current, and temperature specification limits.
Intel Processors - Hyperthreading
Explicitly Multithreaded Processors
• Evolution to make each processor capable of
executing more than a single thread
– Increases the utilization of expensive resources
– OS enable better resource utilization
• When thread is stalled due to cache miss, branch prediction
or page fault à schedule other thread
• Types of Multithreading
– Chip multi-processors (CMP) : one thread per core
– Fine-grained multi-threading (FGMT)
– Coarse-grained multi-threading (CGMT)
– Simultaneous multi-threading (SMT)
Types of thread-Level Parallelism
Fine Multithreading
• How can we guarantee no dependencies
between instructions in a pipeline?
– Interleave execution of instructions from different
program threads on same pipeline
• E.g., Interleave 4 threads (T1-T4) on 5-stage
pipeline
Simple Multithreaded Pipeline
• Have to carry thread select down pipeline to ensure
correct state bits read/written at each pipe stage?
• Appears to software (including OS) as multiple, but
slower CPUs
Fine-Grained Multithreading
• Advantages
+ No need for dependency checking between instructions
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions
from different threads
+ Improved system throughput and utilization
• Disadvantages
– Extra hardware complexity: multiple hardware contexts, thread
selection logic
– Reduced single thread performance (one instruction fetched
every N cycles)
– Resource contention between threads in caches and memory
– Dependency logic between threads remain
Sun Niagara
• Developed until 2005
• Fine-grained multithreaded pipeline
Coarse-grained Multithreading

• Idea: When a thread is stalled due to some

event, switch to a different hardware context
• Switch-on-event multithreading
• Possible stall events:
– Cache misses
– Synchronization events
– FP operations
– Page Fault
Coarse Grained Multithreading

• Four possible states of each processor

– Running
– Ready
– Stalled
– Swapped
• Threads transition between states whenever
– A cache miss is :
• Initiated
• Completed
– Thread switch logic decides
IBM RS64-IV
• Late 1990’s CPU used in servers
• 4-way superscalar, in-order, 5-stage pipeline
• Two hardware contexts
• On a L2 cache miss
– Flush pipeline
– Switch to the other thread
Fine-grained vs. Coarse-grained MT

• Fine-grained advantages
+ Simpler to implement, can eliminated dependency
checking, branch prediction completely
+ Switching need not have any performance
overhead
+ Coarse-grained requires a pipeline flush
• Disadvantages
– Low single thread performance. Each thread gets
1/Nth of the bandwidth of the pipeline
Simultaneous Multi-threading (SMT)

• Requires additional hardware resources

– Register file per thread
– Program counter per thread
– Fetch logic
• Operating system view
– If a CPU supports n simultaneous threads à OS
views them as n processors
– OS distributes most time-consuming threads
“fairly” across the n processors that it sees
Multiprocessors Systems

• Definition of multiprocessors systems

– Computers consisting of tightly coupled
processors whose coordination and usage are
controlled by a single operating system and that
share memory through a shared address space
Speedup and Performance (review)
• Performance measured in terms of speedup
– How much faster a program runs on a system
with n processors than with one processor
• Parallelism limited by sharing
– Amdahl’s law:
• Access to shared state must be serialized
• Serial portion limits parallel speedup
Limitation of Speedup
• Interprocessor communication
– When a processor computes a value that is needed by
another program running on another processor
• Synchronization
– Often need to synchronize processors to ensure that
they have all completed some phase before
continuing
• Load Balancing
– Difficult to divide program across processors such that
each processor has the same workload
Superlinear Speedup
• Speedup of n-processor systems
> n (program executes in less
than 1/n)
• Increased cache size
– Each processor has a cache
memory à total amount of cache
is greater than uniprocessor
• Better program structure
– Programs perform less work when
executed on multiprocessor
systems. E.g., search algorithm is
different (finishes faster)
Multiprocessor Systems Programming Models

• Two major programming models for

multiprocessor systems:
– Message passing: communicate through explicit
messages (processor executes an explicit SEND
and RECEIVE operation)
– Shared-memory : Memory system handles inter-
processor communications by allowing all
processors to see data written to any processor
Type 1: Messaging Passing Systems

• Also called Distributed Shared Memory (DSM)

• Each processor has its own address space
• Processors cannot read or write data
contained in another processor’s address
space
• Implemented as distributed memory
machines
Type 1: Distributed Shared Memory (DSM)
• Physically distributed memory
• Support larger processor count
Type 2 : Shared-memory Multiprocessors (SMPs)

• Symmetric (shared-memory) multiprocessor–

centralized shared-memory processors
• Typically, small number of processors (8 or less)
Type2: Shared-Memory Systems

• Communication is implicit
• Provide a single address space that all
processors can read and write
• When processor writes a location in address
space, any subsequent reads of that location
by any processor see the results of the write
Shared Memory vs. Messaging Passing

• Shared Memory

• Message passing
Message-Passing vs. Shared-Memory
• Likely that message-passing and shared-memory will co-exist
in the future
• Shared-memory PROS:
+ Computer handles communication à can write a parallel program
without taking care of data communication. However, to achieve
good performance, programmer must consider how data is used by
processors to minimize inter-processor communications (requests,
invalids, updates)
+ Attractive for irregular applications à difficult determining what
communication is required at the time a program is written
• Shared-memory CONS:
– Programmer’s ability to control inter-processor communication is
limited
– On many systems transferring large block of data between
processors is more efficient done as one communication à not
possible in shared-memory systems (hardware controls amount of
data sent)
Message-Passing vs. Shared-Memory

• Message-Passing Systems PROS:

+ Can achieve greater efficiency by allowing
programmer to control inter-process communication
+ Require fewer synchronization operations because
SEND and RECEIVE operations provide these
+ Attractive for scientific computations because of
regular structure of applications à easier to
determine when data must be communicated
• Message-Passing Systems CONS:
– Requires programmer explicitly specify all
communications
Challenges of Parallel Processing : 1 Parallelism

• Limited parallelism available in programs

• Example: If we want to achieve a speedup of
80x with 100 processors, what is the
maximum amount of serial code ?

Only 0.25% of the original code can be sequential ß

Challenges of Parallel Processing : 2 Communication
• High communication costs
• Typical latency
– within cores on same die 35 to 50 clock cycles
– Among separate chips 100 to 500 clock cycles
• Example: An application running on 32 processors with 200ns
time to handle reference to memory. Processors are stalled on
a remote request. Processors clock speed is 3.3GHz. Assuming
base CPI=0.5. How much fasters is the multiprocessor system if
there is no communication overhead vs. if 0.2% of the
instructions involve a remote communication reference?

0.5+1.33=1.88

è The system with all local references is 1.88/0.5=3.76x faster

Challenges of Parallel Processing : 3 Synchronization
of Shared-memory Threads
• What is the result of A when two threads are
expected as shown? (initial value A=0)
Addressing these Challenges
1. Challenge 1 : Inadequate application parallelism
– New algorithms that offer better parallel performance
2. Challenge 2 : Communication overhead
– Use caching
– Improve synchronization
– Latency hiding techniques
– Memory consistency models for shared memories
3. Challenge 3 : Synchronization
– Usage of indivisible primitives
– Usage or mutex (software)
Mutex: Mutual exclusion locks
• Mutex written explicitly by programmer in source code
• Prevent multiple threads from simultaneously executing
critical sections of code that access shared data
– mutexes are used to serialize the execution of threads
• All mutexes must be global.
• A successful call for a mutex lock by way of :
– mutex_lock()
• Will cause another thread that is also trying to lock the
same mutex to block until the owner thread unlocks it by
way of :
– mutex_unlock().
• Threads within the same process or within other
processes can share mutexes.
Writing Multi-threaded Programs
• Using POSIX
thread
libraries
– Standard
based
thread API
for C/C++
Mutexes

• Used to
prevent data
inconsistencies
due to
operations by
multiple
threads up the
same memory.
Memory Consistency Models
• Complexity in designing shared-memory systems come from
presenting the illusion that there is a single memory system
despite having multiple physical memories
• Implementations:
– Centralized memory
– Distributed memory
• Both have cache memories associated with each processor to
reduce memory latency à multiple copies of data will exist in
different caches
• Memory consistency model: Defines when memory operations
executed on one processor become visible on other processors
– Strong consistency: memory system acts exactly as if there were
only one memory in the computer
– Relaxed consistency: Allows different processors to have different
values for some data until program requests all memories be made
consistent.
Strong Consistency (Sequential Consistency)
• Memory system may execute multiple memory operations in parallel
– All parallel operations must generate the same result as if executed on a
system with a single memory system shared by all processors
• PROS:
– Makes system easier to program (data written by any memory
operations becomes immediately visible to all processors)
• CONS:
– Relaxed consistency system can lead to better performance
Strong Consistency Requirements
• Requirement needed:
– The result of a program must be the same as if the
memory operations in the program occurred in
the order appeared in the program
• READS: Multiple read allowed
• WRITES: multiple writes to an address have to
be serialized
Relaxed Consistency

• Allow reads and writes to complete out of

order, BUT use synchronization operations to
enforce ordering
• Different types of models based on what they
relax (read or write)
– Specify ordering by rules, e.g., XàY (X must be
completed before Y)
• Sequential consistency requires
– RàW, RàR, WàR, WàW
Strong vs. Relaxed Consistency
• Strong Consistency:
+ Advantage:
• Simple method
+ Disadvantage
• Performance penalty
• Relaxed Consistency:
– Advantage:
• Significant Performance advantage
– Disadvantage
• Complexity in describing models
Cache Coherence
• CPUs have caches
between main
memory and the
CPU
• Copies of a single
data item can exist
in multiple caches
• Modification of a
shared data item
by one CPU leads
to outdated copies
in the cache of
another CPU

A Primer on Memory consistency and Cache coherence, D.J. Sorin,

M.D. Hill and D.A. Wood
Cache Coherence
• Typical solution
1. Caches keep track of whether a data item is
shared between multiple processes
2. Upon modification of a shared data item à
notification to other caches occur
3. Other caches reload the shared data item into
their cache
Cache Coherence Protocols
• Cache-coherence protocol of shared-memory system
defines how data may be shared and replicated across
processors
• Define the specific set of rules that are executed to keep
each processor’s view of the memory system consistent
• Snooping Protocols
– Processor snoops the bus to see if any data in their caches
has been updated
– Commonly used in centralized memory machines
• Directory-based Protocols
– Keeps track of what is being shared in centralized location
– Commonly used for distributed memory machines
Shared-memory multiprocessors (SMPs)
• Cache Coherence Protocol
– Easiness to implement cache coherence à each
processor can “see” the state of the memory bus
noticing if any requests from processors to memory.
Called CACHE SNOOPING (cache spying)
Distributed-Memory Multiprocessor
• Directory added to each node to implement cache coherence in a
distributed-memory multiprocessor
• Each directory responsible for tracking the caches that share the memory
addresses of the portion of memory in the node
• Coherence mechanism handle maintenance of: (1) directory information
and(2) coherence actions
Cache Coherence Protocols

• Further divided into two categories:

1. Invalidation-based
2. Update-based
• Depending on application either method can
deliver better performance.
Cache Coherence: Invalidation-based
• Multiple processors are allowed to have read-only copied of a cache
line if no processor has a writable copy of the line
• Only one processor can have a writable copy of a given line at any
time
• When processor wants to write a line that more processor have
copies from à line is invalidated à forces processors with copies to
give up their copies
Cache Coherence: Update-based
• Allow multiple-processors to have writable copies of a
line
• When a processor writes a line that more processors
have copies of à update occurs à transmitting new
value of the data to all sharing processors
A Simple Protocol : MSI
• Modified : Processor is the only one with a copy of the line and
has written the line since
• Shared : processor has a copy of the line and one or more other
processors too à processor may read line, but any write attempt
requires that the other copies be invalidated
• Invalid : processor has no copy of the line
Problem with MSI
• A block is in no cache to begin with
• Problem:
– On a read, the block immediately goes to “Shared” state
although it may be the only copy to be cached (i.e., no other
processor will cache it)
• Why is this a problem?
– Suppose the cache that reads the block wants to write to it at
some point It needs to broadcast “invalidate” even though it has
the only cached copy!
• If the cache knew it had the only cached copy in the
system, it could have written to the block without notifying
any other cache saves unnecessary broadcasts of
invalidations
MESI Protocol
• Commonly used invalidation-based cache-coherence
protocol
• Each line in a processor’s cache is assigned one of four states
to track which caches have copies of the line:
– Modified : Processor is the only one with a copy of the line and
has written the line since
– Exclusive : Processor is the only one with a copy of the line and
has NOT written the line since
– Shared : processor has a copy of the line and one or more other
processors too à processor may read line but any write attempt
requires that the other copies be invalidated
– Invalid : processor has no copy of the line
• In modified and exclusive processor can read/write freely
• Distinction between modified and exclusive states so that
system can figure out where the most recent value of line is
stored
MESI Protocol
• State transition in MESI protocol
Modified : Processor is the only one
with a copy of the line and has written
the line since
Exclusive : Processor is the only one
with a copy of the line and has NOT
written the line since
Shared : processor has a copy of the
line and one or more other processors
too à processor may read line but any
write attempt requires that he other
copies be invalidated
Invalid : processor has no copy of the
line
Example
• A four-processor shared-memory system implements
the MESI protocol for cache coherence. For the
following sequence of memory references, show the
state of the line containing the variable a in each
processor’s cache after each reference is resolved. All
processors start out with the line containing a invalid in
their cache.
Operations:
Read a (processor 0)
Read a (processor 1)
Read a (processor 2)
Write a (processor 3)
Read a (processor 0)
Solution
Modified : Processor is the only one
with a copy of the line and has written
the line since
Exclusive : Processor is the only one
with a copy of the line and has NOT
written the line since
Shared : processor has a copy of the
line and one or more other processors
too à processor may read line but any
write attempt requires that the other
copies be invalidated
Invalid : processor has no copy of the
line
Operation Processor 0 Processor 1 Processor 2 Processor 3
P0 reads a E I I I
P1 reads a S S I I
P2 reads a S S S I
P3 writes a I I I M
P0 reads a S I I S
Method Comparison

• Performance application specific

• Invalidation-based protocols generally better
on applications with substantial data locality
à require a communication each time a
shared line is written
• Update-based protocols can achieve better
performance on programs where one
processor repeatedly updates the datum that
is read by many other processors
Multithreading Categories Summary
Summary
• Multi-processors speedup and limitation
– Synchronization
– Communication
– Load balancing
• Different types of parallelisms and CPU
architectures to exploit them
• Message passing vs. shared-memory
systems
• Cache coherence protocols

Thread Level Parallelism (2) : EEC 171 Parallel Architectures John Owens UC Davis
No ratings yet
Thread Level Parallelism (2) : EEC 171 Parallel Architectures John Owens UC Davis
45 pages
Multicore Architecture
No ratings yet
Multicore Architecture
159 pages
ACA Lecture 28 Multiprocessors
No ratings yet
ACA Lecture 28 Multiprocessors
20 pages
Onur Digitaldesign 2020 Lecture18c FGMT Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture18c FGMT Beforelecture
22 pages
DST4030A Lecture Notes Week 3
No ratings yet
DST4030A Lecture Notes Week 3
31 pages
06b Multithreading MF
No ratings yet
06b Multithreading MF
37 pages
Unit 5
No ratings yet
Unit 5
86 pages
Simultaneous Multithreading
No ratings yet
Simultaneous Multithreading
50 pages
TLP
No ratings yet
TLP
19 pages
Multicore Architecture
No ratings yet
Multicore Architecture
159 pages
Operating Systems-3-Threads 3
No ratings yet
Operating Systems-3-Threads 3
17 pages
Unit IV QB With Answers
No ratings yet
Unit IV QB With Answers
16 pages
4 Threads and Concurrency
No ratings yet
4 Threads and Concurrency
62 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
Biruk Tewoderos 1790
No ratings yet
Biruk Tewoderos 1790
21 pages
OS Week 6 Threads
No ratings yet
OS Week 6 Threads
28 pages
ch4 New
No ratings yet
ch4 New
39 pages
1 - Concurrent Programming
No ratings yet
1 - Concurrent Programming
28 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
202004261306373620rohit Engg Multi Threaded
No ratings yet
202004261306373620rohit Engg Multi Threaded
4 pages
Lec04 SOFE3950 Threads
No ratings yet
Lec04 SOFE3950 Threads
53 pages
03 TLP
No ratings yet
03 TLP
33 pages
CH 4
No ratings yet
CH 4
21 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
Presentation On Multithreading/Vector
No ratings yet
Presentation On Multithreading/Vector
7 pages
Chapter 4
No ratings yet
Chapter 4
45 pages
Unit - 4 Computing Technologies: To - Bca 4 Sem BY-Vijayalaxmi Chiniwar
No ratings yet
Unit - 4 Computing Technologies: To - Bca 4 Sem BY-Vijayalaxmi Chiniwar
34 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
Unit 6
No ratings yet
Unit 6
15 pages
Unit-5 Part1
No ratings yet
Unit-5 Part1
85 pages
Chapter 04
No ratings yet
Chapter 04
37 pages
Hreads: Program Counter: Registers
No ratings yet
Hreads: Program Counter: Registers
21 pages
Lecture Thread
No ratings yet
Lecture Thread
45 pages
Lecture Slide 5 OS Spring 25
No ratings yet
Lecture Slide 5 OS Spring 25
39 pages
Multithreading, SMT and CMP
No ratings yet
Multithreading, SMT and CMP
7 pages
Lecture 4
No ratings yet
Lecture 4
38 pages
Threads
No ratings yet
Threads
38 pages
Multi Thread2
No ratings yet
Multi Thread2
37 pages
4.OS Threads Dr. Punit
No ratings yet
4.OS Threads Dr. Punit
48 pages
CH 4 (Threads)
No ratings yet
CH 4 (Threads)
8 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
Basic of Thread Level Parallelism
No ratings yet
Basic of Thread Level Parallelism
30 pages
Concurrency in Computing
No ratings yet
Concurrency in Computing
16 pages
BBM en-GB 2015.4
100% (2)
BBM en-GB 2015.4
476 pages
SMT and CMP Architectures
100% (3)
SMT and CMP Architectures
19 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
MULTITHREADING
No ratings yet
MULTITHREADING
30 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Super
No ratings yet
Super
7 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
Principles of External Fixation
100% (1)
Principles of External Fixation
68 pages
Lec 4 Superscalarprocessor Updated PDF
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
40 pages
Handbook-Riser-Design - Clamps PDF
67% (3)
Handbook-Riser-Design - Clamps PDF
46 pages
64-Bit Insider Volume 1 Issue 14
No ratings yet
64-Bit Insider Volume 1 Issue 14
6 pages
Math SPM
No ratings yet
Math SPM
54 pages
Identify Your Helpers of Destiny
90% (10)
Identify Your Helpers of Destiny
6 pages
Hardware Multithreading
100% (1)
Hardware Multithreading
4 pages
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
No ratings yet
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
26 pages
Breaking Spaghetti Nives Bonacic Croatia IYPT 2011
No ratings yet
Breaking Spaghetti Nives Bonacic Croatia IYPT 2011
34 pages
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
No ratings yet
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
5 pages
1 141215051455 Conversion Gate01 PDF
No ratings yet
1 141215051455 Conversion Gate01 PDF
85 pages
Water-Soluble Polymers For Petroleum Recovery PDF
No ratings yet
Water-Soluble Polymers For Petroleum Recovery PDF
355 pages
Hardware Multithreading
No ratings yet
Hardware Multithreading
22 pages
SMT and CMP Architectures
No ratings yet
SMT and CMP Architectures
19 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
Applied Maths 2000-2010
No ratings yet
Applied Maths 2000-2010
55 pages
Algebra Che 304 Cimpetency Reviewer
No ratings yet
Algebra Che 304 Cimpetency Reviewer
1 page
Computer Science I Essentials
From Everand
Computer Science I Essentials
Randall Raus
5/5 (7)
PSIC - Final of Domestic Electrical Appliances
No ratings yet
PSIC - Final of Domestic Electrical Appliances
82 pages
Football Players Need Specific Physical and Skill Based Attributes For Each Position
No ratings yet
Football Players Need Specific Physical and Skill Based Attributes For Each Position
4 pages
Comparison of Critical Rate Correlations: Firdavs A. Aliev, Khurshed A. Rahimov, Balabek Amzayev, Alim F.Kemalov
No ratings yet
Comparison of Critical Rate Correlations: Firdavs A. Aliev, Khurshed A. Rahimov, Balabek Amzayev, Alim F.Kemalov
7 pages
Surprise For The Sniper - Sienna Trap
No ratings yet
Surprise For The Sniper - Sienna Trap
323 pages
B7 CREATIVE ARTS First-Term 2024 DEC EXAMS
No ratings yet
B7 CREATIVE ARTS First-Term 2024 DEC EXAMS
6 pages
Baby Care Catlogue 1-10-23 Compress 1
No ratings yet
Baby Care Catlogue 1-10-23 Compress 1
52 pages
Ob CH-2
No ratings yet
Ob CH-2
20 pages
JK Cements: Swot Analysis
No ratings yet
JK Cements: Swot Analysis
3 pages
National Manual For TB Control 2022update
No ratings yet
National Manual For TB Control 2022update
246 pages
15hp & 30hp Motors
No ratings yet
15hp & 30hp Motors
2 pages
Extraction of Caffeine
No ratings yet
Extraction of Caffeine
15 pages
Introduction To Chemotaxis Chemotaxis Describes How Bacteria and Cellular Organisms
No ratings yet
Introduction To Chemotaxis Chemotaxis Describes How Bacteria and Cellular Organisms
2 pages
The Basic Building Blocks
No ratings yet
The Basic Building Blocks
19 pages
6ES72141AG400XB0 Datasheet en
No ratings yet
6ES72141AG400XB0 Datasheet en
9 pages
AI Based EV Charging Load Estimation
No ratings yet
AI Based EV Charging Load Estimation
30 pages
Chemistry Class - VIII Topic-Metallurgy
No ratings yet
Chemistry Class - VIII Topic-Metallurgy
46 pages
John Dewey - Towards A Flexible Curriculum
No ratings yet
John Dewey - Towards A Flexible Curriculum
8 pages
BZ3 Instruction (v1.0)
No ratings yet
BZ3 Instruction (v1.0)
23 pages
Fundus Changes in High Myopia in Relation To Axial
No ratings yet
Fundus Changes in High Myopia in Relation To Axial
5 pages
74LS113
No ratings yet
74LS113
2 pages

EE6304 Lecture12 TLP

Uploaded by

EE6304 Lecture12 TLP

Uploaded by

EEDG/CE/CS 6304 Computer Architecture

Benjamin Carrion Schaefer

• Identify different types of parallelism and how

• You want to prepare food for several

• Diminishing return from exploiting:

• Some applications have large amount of parallelism

• Add extra CPU

• Idea: When a thread is stalled due to some

• Four possible states of each processor

• Requires additional hardware resources

• Definition of multiprocessors systems

• Two major programming models for

• Also called Distributed Shared Memory (DSM)

• Symmetric (shared-memory) multiprocessor–

• Message-Passing Systems PROS:

• Limited parallelism available in programs

Only 0.25% of the original code can be sequential ß

è The system with all local references is 1.88/0.5=3.76x faster

• Allow reads and writes to complete out of

A Primer on Memory consistency and Cache coherence, D.J. Sorin,

• Further divided into two categories:

• Performance application specific

You might also like