0% found this document useful (0 votes)

10 views60 pages

EE457Unit9c CMT

Uploaded by

Shaurya Chandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views60 pages

EE457Unit9c CMT

Uploaded by

Shaurya Chandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

1

EE 457 Unit 9c

Thread Level Parallelism

Credits
• Some of the material in this presentation is taken from:
– Computer Architecture: A Quantitative Approach
• John Hennessy & David Patterson
• Some of the material in this presentation is derived from
course notes and slides from
– Prof. Michel Dubois (USC)
– Prof. Murali Annavaram (USC)
– Prof. David Patterson (UC Berkeley)
3

BACKGROUND KNOWLEDGE
4

Power
• Power and energy consumption is a MAJOR concern for processors
• Power consumption can be decomposed into:
– Static (PSTAT): Power constantly being dissipated (grows with # of transistors)
– Dynamic (PDYN): Power consumed for switching a bit (1 to 0)
• PDYN = IDYN*VDD ≈ ½CTOTVDD2f
– Recall, I = C dV/dt
– VDD is the logic ‘1’ voltage, f = clock frequency
• Dynamic power favors parallel processing vs. higher clock rates
– VDD value is tied to f, so a reduction/increase in f leads to similar change in Vdd
– Implies power is proportional to f3 (a cubic savings in power if we can reduce f)
– Take a core and replicate it 4x => 4x performance and 4x power
– Take a core and increase clock rate 4x => 4x performance and 64x power
• Static power
– Leakage occurs no matter what the frequency is
5

Temperature
• Temperature is related to power consumption
– Locations on the chip that burn more power will usually run hotter
• Locations where bits toggle (register file, etc.) often will become quite hot especially if
toggling continues for a long period of time
– Too much heat can destroy a chip
– Can use sensors to dynamically sense temperature
• Techniques for controlling temperature
– External measures: Remove and spread the heat
• Heat sinks, fans, even liquid cooled machines
– Architectural measures
• Throttle performance (run at slower frequencies / lower voltages)
• Global clock gating (pause..turn off the clock)
• None…results can be catastrophic
6

Modeling Interconnect Delay

• In modern circuits wire delay (transmitting the signal) begins
to dominate logic delay (time for gate to switch)
• As wires get longer
– Resistance goes up and Capacitance goes up causing longer time
delays (time is proportional to R*C)

A real wire
can be
modeled as…

Ideal wire Lumped Model Distributed Model

(overestimates delay) (better estimate)
7

Dealing With Interconnect

• Interconnect delay rivals switching delay
• Important design considerations
– Long wire traces slow a signal down, thus global signals on a
chip require special attention
– Clock, reset, and other signals must be routed carefully and a
whole tree of buffers inserted to decrease the delay
8

A Case for Thread-Level Parallelism

CHIP MULTITHREADING AND

MULTIPROCESSORS
9

Consideration
• Consider our out-of-order, pipelined processor
from Tomasulo part 2.
10

Question 1
• Do we have high frequencies?
11

Answer 1
• Do we have high frequencies?
– Yes, deep pipelines create shorter clock
cycles and higher frequencies

–Power ↑ ↑ ↑
–Temp. ↑
12

Question 2
• What effect does our OoO processor
have on wires length?
13

Answer 2
• What effect does our OoO processor
have on wires length?
– Wire length ↑ ↑
with CDB, ROB,
Issue logic, etc.
– Time ↑
14

Question 2
• What is the impact of short clock cycles
(high freq.) on cache miss penalties?
15

Answer 3
• What is the impact of short clock cycles
(high freq.) on cache miss penalties?
– Cache miss
penalties ↑ ↑
relative to
processor cycles
16

Memory Wall Problem

• Processor performance is increasing much faster than memory
performance

55%/year
Processor-Memory
Performance Gap

7%/year

There is a limit to ILP! Hennessy and Patterson,

If a cache miss requires several hundred clock cyles even OoO pipelines with Computer Architecture –
10's or 100's of in-flight instructions may stall. A Quantitative Approach (2003)
17

Cache Hierarchy
• A hierarchy of cache can help mitigate P
the cache miss penalty
• L1 Cache L1 Cache
– 64 KB
– 2 cycle access time
– Common Miss Rate ~ 5% L2 Cache
• L2 Cache
– 1 MB
– 10-20 cycle access time L3 Cache
– Common Miss Rate ~ 1%
• Main Memory
– ~300 cycle access time Memory
18

The Growing Memory Problem

• In an In-Order pipeline, a cache miss causes computation to
stall (i.e. a memory induced stall)
– Suppose we could improve our processor to achieve a
2x speedup in compute time
– This would only yield a minimal overall speedup due to memory
latency dominating compute
• Out-of-order may do better, but the problem remains
Time
Compute
C Time

Single-Thread Memory
Execution
C M C M C M M Latency

Actual
program
Single-Thread speedup is
Execution C M C M C M minimal due to
(w/ 2x speedup memory
in compute) latency
Adapted from: OpenSparc T1 Micro-architecture Specification
19

Cache Penalty Example

• Assume 50% of instructions are LW/SW, an L1-D hit
rate of 90%, and miss penalty of 20 clock cycles
(assuming these misses hit in L2). What is the CPI for
our typical 5 stage pipeline?
– 50% * 10% misses = 5 instructions that cause stalls
– Other 95 instructions take 95 cycles to execute
– 5 instructions take 105=5*(1+20) cycles to execute
– Total 200 cycles for 100 instructions =
CPI of 2

Effective CPI = Ideal CPI + Miss Rate*Miss Penalty Cycles

Question 4
• But in OoO processors, can't we just
deepen our ROB, Issue queues, Store
Address Buffer, etc to hide cache misses?
21

Answer 4
• But in OoO processors, can't we just
deepen our ROB, Issue queues, Store
Address Buffer, etc to hide cache misses?
– Associative lookup
structures are expensive
and slow down
dramatically as they
deepen
– DOES NOT SCALE WELL
22

Motivating HW Multithread/Multicore
• Issues that prevent us from exploiting ILP in more
advanced single-core processors with deeper
pipelines and OoO Execution
– Slow memory hierarchy
– Increased power with higher clock rates
– Increased wire delay & size with more advanced structures
(ROBs, Issue queues, etc.) for potentially diminishing
returns
• All of these issues point us to find "easier" sources of
parallelism such as: TLP (Thread-Level Parallelism)
23

OVERVIEW OF TLP
24

What is a Thread?
• Thread (def.): Single execution sequence (instruction
stream) representing a separately schedulable task
– Schedulable task: Can be transparently paused and resumed
by the OS scheduler
• Consider the processor: $1 0xbf01e800

– For what resources would each thread need $3 0x00000005

$31 0xbff70c44
their own copy to execute in parallel? pc 0x0004a804

Thread 2 Thread 1 CPU

0x04001c 0x04a800

lw $5,20($8) sw $9,0($3)
add $2,$8,$4 sub $8,$2,$5
or $8,$5,$3 lw $2,4($1) T1 Stack T2 Stack
sw $9,0($6) sub $7,$2,$6
... ...

0x0 Memory 0xffffffff

Separate or Shared?
• Consider the processor, for what resources
would each thread need their own copy?
1 Per Thread Shared among all
Program Counter
ALUs
Register File
Page Table Base Register
Cache Memory
26

Shared vs. Private

Software Multithreading
• Used since 1960's on uniprocessors to hide I/O CPU
Regs
latency
– Multiple processes with different virtual address PC
spaces and process control blocks
– On an I/O operation, state is saved and another
process is given to the CPU OS
– When I/O operation completes the process is Scheduler
rescheduled T1 = Ready T2 = Blocked T3 = Ready

• On a context switch…
– Trap processor and flush pipeline Saved Saved Saved
State State State
– Save state in process control block (PC, register file, Regs Regs Regs
Interrupt vector, page table base register)
– Restore state of another process PC PC PC

– Start execution and fill pipeline Meta

Data
Meta
Data
Meta
Data

• Context switch is also triggered by timer for fairness

• Very high overhead! (1-10 us)
28

Multicore vs. Multithreaded

• Multicore/Multiprocessor: Single chip containing
multiple processor cores that possess all the logic
resources necessary to execute one or more threads at
a time
– Require software/OS to context switch from one thread to
another and do not share hardware resources between
threads.
• Hardware Multithreading: A processor core that has
hardware support for executing multiple threads and
context switching between them without software
intervention
29

Typical Multicore (CMP) Organization

• Can simply replicate entire processor core to create a
chip multi-processor (CMP)
Private L1 and L2's require
maintaining coherency via
Chip Multi- snooping.
Processor
P P P P Sharing L1 is not a good idea.

L3 is shared (1 copy of data) and

L1 L1 L1 L1 thus does not require a
coherency mechanism.
L2 L2 L2 L2
Interconnect (On-Chip Network)

L3 L3 L3 L3
Bank Bank/ Bank Bank/

Shared bus would be a

bottleneck. Use switched
network (multiple
Main Memory simultaneous connections)
30

Case for Multithreading

• Consider long latency events:
– Cache Miss, Exceptions, Lock (Synchronization), Long instructions such as
MUL/DIV
– Such events cause In-order and even OoO pipelines to be underutilized
• Goal/Idea: Swap to the next thread immediately (on next cycle)
when the current thread hits a long-latency event (i.e. cache miss)
– By executing multiple threads, processors can be kept busy with useful work
Time

Thread 1
C M C M
Compute
C Time
Thread 2
C M C M
Memory
M Latency
Thread 3 C M C M
Thread 4 C M C M
Adapted from: OpenSparc T1 Micro-architecture Specification
31

IMPLEMENTING HARDWARE
MULTITHREADING
32

MT Needs Non-Blocking Caches

• Non-blocking cache: Does not block/pause on a miss but is able
to service hits while fetching one or more miss requests
– Needed to support multithreading
– Example: Pentium Pro has a non-blocking cache capable of handling 4
outstanding misses
Service Service Service Service
Cache Hits Cache Miss Cache Hits Cache Miss

C M C M
Service Service Service Service
Cache Hits Cache Miss Cache Hits Cache Miss

+ C M C M

C M C M
33

Hardware Multithreading
• Run multiple threads on the same core with
hardware support for fast context switch
– Multiple register files
– Multiple state registers (PCs, page table base
registers, interrupt vectors, etc.)
– Avoids saving context manually (via software)
34

Sun T1 "Niagara" Case Study

Ex. of Fine-grained
Multithreading

http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf
35

Sparc T1 Niagara
• 8 cores each executing 4 threads called a thread group
– Zero cycle thread switching penalty (round-robin)
– 6 stage pipeline
• Each core has its own L1 cache
• Each thread has its own
– Register file, instruction and store buffers
• Threads share…
– L1 cache, TLB, and execution units
• 3 MB shared L2 Cache, 4-banks, 12-way set-associative
– Is it a problem that it's not a power of 2? No!

(Thread)
Fetch Select Decode Exec. Mem. WB
36

Sun T1 "Niagara" Pipeline

2b
3 4

http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf
37

T1 Pipeline
• Thread select stage [Stage 2]
– Choose instructions to issue from ready threads
– Issues based on
• Instruction type
• Misses
• Resource conflicts
• Traps and interrupts
• Fetch stage [Stage 1]
– Thread select mux chooses which thread's instruction to
issue and uses that thread's PC to fetch more instructions
– Access I-TLB and I-Cache
– 2 instructions fetched per cycle
38

T1 Pipeline
• Decode stage [Stage 3]
– Accesses register file
• Execute Stage [Stage 4]
– Includes ALU, shifter, MUL and DIV units
– Forwarding Unit
• Memory stage [Stage 5]
– DTLB, Data Cache, and 4 store buffers (1 per thread)
• WB [Stage 6]
– Write to register file
39

Pipeline Scheduling
• No pipeline flush on context switch (except
potentially of instructions from faulting thread)
• Full forwarding/bypassing to consuming, junior
instructions of same thread
• In case of load, wait 2 cycles before an instruction
from the same thread is issued
– Solved forwarding latency issue
• Scheduler guarantees fairness between threads by
prioritizing the least recently scheduled thread
40

A View Without HW Multithreading

Single Threaded
Superscalar
Issue Slots
Time w/ Software MT

Expensive
Context Switch …

Expensive Cache
… Miss Penalty

Only instructions Software

from a single Multithreading
thread
41

Types/Levels of Multithreading
• How should we overlap and share the HW between
instructions from different threads
– Coarse-grained Multithreading: Execute one thread with
all HW resource until a cache-miss or misprediction will
incur a stall or pipeline flush, then switch to another
thread
– Fine-grained Multithreading: Alternate fetching
instructions from a different thread each clock
– Simultaneous Multithreading: Fetch and execute
instructions from different threads at the same time
42

Levels of TLP
Issue Slots
Simultaneous
Time Superscalar Coarse-grained MT Fine-Grained MT Multithreading (SMT)

Miss

Expensive Cache
Miss Penalty

Only instructions Switch threads Alternate threads Mix instructions from

from a single when one hits a every cycle threads during same
thread long-latency event (Sun UltraSparc issue cycle
like a stall due to T2) (Intel HyperThreading,
cache-miss, IBM Power 5)
pipeline flush, etc.
43

Fine Grained Multithreading

• Like Sun Niagara
• Alternates issuing instructions from different threads
each cycle provided a thread has instructions ready
to execute (i.e. not stalled)
• With enough threads, long latency events may be
completely hidden
– Some processors like Cray may have 128 or more threads
• Degrades single thread performance since it only
gets 1 out of every N cycles if all N threads are ready
44

Coarse Grained Multithreading

• Swaps threads on long-latency event
• Hardware does not have to swap threads in a single
cycle (as in fine-grained multithreading) but can take
a few cycles since the current thread has hit a long
latency event
• Requires flushing pipeline of current thread's
instructions and filling pipeline with new thread's
• Better single-thread performance
45

ILP and TLP

• TLP can also help ILP by providing another
source of independent instructions
• In a 3- or 4-way issue processor, better
utilization can be achieved when instructions
from 2 or more threads are executed
simultaneously
46

Simultaneous Multithreading
• Uses multiple-issue, dynamic scheduling mechanisms
to execute instructions from multiple threads at the
same time by filling issue slots with as many available
instructions from either thread
– Overcome poor utilization due to cache misses or lack of
independent instructions
– Requires HW to tag instructions based on their thread
• Requires greater level of hardware resources
(separate register renamer, branch prediction, store
buffers, and multiple register files, etc.)
47

2-Way SMT Updated Block Diagram

I-Cache SB1 SB2
Reg. File 2

Reg. File 1
Instruc. Instruc.
Queue 1 Queue 2 ROB1 ROB2
D-Cache
Reg.
Rename 1 BPB1
Dispatch
Reg.
BPB2
Dispatch can Rename 2
tag instructions
with thread ID

Mult. Queue
L/S Queue
Int. Queue

Div Queue
to separate
instructions in

Issue
Unit
SAB1
the backend
SAB2

Exec. Unit
Integer /
D-Cache Div Mul
Branch
L/S Buffer CDB

Updated OoO processor block diagram for 2-way hardware, SMT

Example
• Intel HyperThreading Technology (HTT) is
essentially SMT
• Recent processors including Core i7 are multi-
core, multi-threaded, multi-issue, OoO
(dynamically scheduled) superscalar
processors
49

Future of Multicore/Multithreaded
• Multiple cores in shared memory configuration
• Per-core L1 or even L2
• Large on-chip shared cache
• Multiple threads on each core to fight memory wall
• Ever increasing on-chip threads
– To continue to meet Moore's Law
– CMP's with 1000's of threads envisioned
– Only sane option from technology perspective (i.e. out of
necessity)
– The big road block is parallel programming
50

Parallel Programming
• Implicit parallelism via…
– Parallelizing compilers
– Programming frameworks (e.g. MapReduce)
• Explicit parallelism
– OpenMP
– Task Libraries
• Intel Thread Building Blocks, Java Task Library
– Native threading (Windows threads, POSIX threads)
– MPI
51

BACKUP
52

Organization for OoO Execution

I-Cache TAG FIFO Block Diagram
Adapted from Prof.
Michel Dubois
(Simplified for EE 457)
Reg. File

Instruc.
Queue

Mult. Queue
L/S Queue
Int. Queue

Div Queue

Issue
Unit
Integer /
D-Cache Div Mul
Branch

CDB
53

2-Way SMT Updated Block Diagram

I-Cache SB1 SB2
Reg. File 2

Reg. File 1
Instruc. Instruc.
Queue 1 Queue 2 ROB1 ROB2
D-Cache
Reg.
Rename 1 BPB1
Dispatch
Reg.
BPB2
Dispatch can Rename 2
tag instructions
with thread ID

Mult. Queue
L/S Queue
Int. Queue

Div Queue
to separate
instructions in

Issue
Unit
SAB1
the backend
SAB2

Exec. Unit
Integer /
D-Cache Div Mul
Branch
L/S Buffer CDB

Updated OoO processor block diagram for 2-way hardware, SMT

Multiple Functional Units

• We now provide multiple functional units
• After decode, issue to a queue, stalling if the unit is busy or
waiting for data dependency to resolve

Queues +
Functional ALU
Units

MUL

IM Reg DM Reg

DIV

DM
(Cache)
55

Functional Unit Latencies

Int. ALU, Addr. Calc.

EX
FP Add Look Ahead: Tomasulo
Algorithm will help absorb
An added complication of A1 A2 A3 A4 latency of different functional
units and cache miss latency by
out-of-order execution & Int. & FP MUL allowing other ready instruction
completion: WAW & WAR proceed out-of-order
hazards M1 M2 M3 M4 M5 M6 M7
Int. & FP DIV

Functional Unit Latency Initiation Interval

(Required stalls cycles (Distance between 2 independent instructions
between dependent [RAW] instrucs.) requiring the same FU)

Integer ALU 0 1
FP Add 3 1
FP Mul. 6 1
FP Div. 24 25
56

OoO Execution w/ ROB

• ROB allows for OoO execution but in-order completion

I-Cache D-Cache

ROB
Reg. File

Instruc. (Reorder
Queue Buffer)

Br. Pred.
Buffer Dispatch Exceptions?
No problem

Mult. Queue
L/S Queue
Int. Queue

Div Queue

Addr.
Buffer
Issue
Unit
Exec. Unit
Integer /
D-Cache Div Mul
Branch
L/S Buffer
CDB
57
58
59

Updated Pipeline
Int. ALU, Addr. Calc.

EX
FP Add

A1 A2 A3 A4
Int. & FP MUL
M1 M2 M3 M4 M5 M6 M7
Int. & FP DIV

Functional Unit Latency Initiation Interval

(Required stalls cycles (Distance between 2 independent instructions
between dependent [RAW] instrucs.) requiring the same FU)

Integer ALU 0 1
FP Add 3 1
FP Mul. 6 1
FP Div. 24 25
60

Updated Pipeline
Int. ALU, Addr. Calc.

PC EX
FP Add

Reg. A1 A2 A3 A4 MEM
I-Cache
File Int. & FP MUL stage
M1 M2 M3 M4 M5 M6 M7
Int. & FP DIV

Functional Unit Latency Initiation Interval

(Required stalls cycles (Distance between 2 independent instructions
between dependent [RAW] instrucs.) requiring the same FU)

Integer ALU 0 1
FP Add 3 1
FP Mul. 6 1
FP Div. 24 25

Blueprints in Neurology PDF
100% (1)
Blueprints in Neurology PDF
246 pages
09 - Thread Level Parallelism
50% (2)
09 - Thread Level Parallelism
34 pages
Unit 1 Modern Processors
No ratings yet
Unit 1 Modern Processors
52 pages
HTAM
100% (1)
HTAM
30 pages
(并行课件w3) 第2讲 1&2
No ratings yet
(并行课件w3) 第2讲 1&2
143 pages
Multicore Processor Report
100% (1)
Multicore Processor Report
19 pages
Unit 5
No ratings yet
Unit 5
86 pages
The Improvement of The Personal Computer
No ratings yet
The Improvement of The Personal Computer
74 pages
Year & Sem.: Iii Yr / Vi Sem Faculty Name: A.Manjunathan Department: Ece Unit No.: Ii Topic: Computing Platform
No ratings yet
Year & Sem.: Iii Yr / Vi Sem Faculty Name: A.Manjunathan Department: Ece Unit No.: Ii Topic: Computing Platform
91 pages
Module 2
No ratings yet
Module 2
127 pages
Pmi RMP Handbook
No ratings yet
Pmi RMP Handbook
39 pages
Memory Coherent
No ratings yet
Memory Coherent
62 pages
CH02 COA10e
No ratings yet
CH02 COA10e
67 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
DYNA102 Stanadyne Pump
100% (3)
DYNA102 Stanadyne Pump
4 pages
L03 Pipelining
No ratings yet
L03 Pipelining
45 pages
HPC - 1
No ratings yet
HPC - 1
40 pages
Chapter-1 ITM
No ratings yet
Chapter-1 ITM
42 pages
l23 Multithread
No ratings yet
l23 Multithread
34 pages
Parallel Computing Platforms-Dr Nausheen
No ratings yet
Parallel Computing Platforms-Dr Nausheen
47 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
33 pages
Unit 6
No ratings yet
Unit 6
30 pages
CA07 2022S3 New
No ratings yet
CA07 2022S3 New
29 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
The Central Processing Unit:: What Goes On Inside The Computer
No ratings yet
The Central Processing Unit:: What Goes On Inside The Computer
42 pages
Computer System: Operating Systems: Internals and Design Principles
No ratings yet
Computer System: Operating Systems: Internals and Design Principles
62 pages
IT Lec 5
No ratings yet
IT Lec 5
18 pages
Multi Thread2
No ratings yet
Multi Thread2
37 pages
Plant Simulation Book
No ratings yet
Plant Simulation Book
18 pages
Lec 4 Superscalarprocessor Updated PDF
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
40 pages
Hyper Threading: E&C Dept, Vvce Mysore
No ratings yet
Hyper Threading: E&C Dept, Vvce Mysore
21 pages
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
No ratings yet
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
32 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Hyper Threading
No ratings yet
Hyper Threading
15 pages
Architechture Solve Part 1-1
No ratings yet
Architechture Solve Part 1-1
8 pages
Future Processors To Use Coarse-Grain Parallelism
No ratings yet
Future Processors To Use Coarse-Grain Parallelism
48 pages
Computer System Overview PDF
No ratings yet
Computer System Overview PDF
59 pages
Mines Paristech / Cri Lal / Cnrs / In2P3
No ratings yet
Mines Paristech / Cri Lal / Cnrs / In2P3
37 pages
Real Time System Lect10 A
No ratings yet
Real Time System Lect10 A
25 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
ICT 2 Aa
No ratings yet
ICT 2 Aa
6 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
Multithreading, SMT and CMP
No ratings yet
Multithreading, SMT and CMP
7 pages
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
No ratings yet
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
5 pages
Computer Architecture: CS/B.TECH (CSE-NEW) /SEM-4/CS-403/2012
No ratings yet
Computer Architecture: CS/B.TECH (CSE-NEW) /SEM-4/CS-403/2012
8 pages
MP Assignment 1
No ratings yet
MP Assignment 1
9 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Homework 2: H2P1: Current Divider
0% (1)
Homework 2: H2P1: Current Divider
1 page
Architecture and Physical Implementation of A Third Generation 65 NM, 16 Core, 32 Thread Chip-Multithreading SPARC Processor
No ratings yet
Architecture and Physical Implementation of A Third Generation 65 NM, 16 Core, 32 Thread Chip-Multithreading SPARC Processor
11 pages
Hyper-Threading Technology: Processor Microarchitecture
No ratings yet
Hyper-Threading Technology: Processor Microarchitecture
18 pages
Military College of Signal1
No ratings yet
Military College of Signal1
8 pages
Cell Broadcast (GBSS19.1 01)
No ratings yet
Cell Broadcast (GBSS19.1 01)
87 pages
Automate Machine Learning - Aparna Elangovan
No ratings yet
Automate Machine Learning - Aparna Elangovan
26 pages
Filipino Bill of Rights
No ratings yet
Filipino Bill of Rights
13 pages
Hyper-Threading Technology Architecture and Microarchitecture - Summary
No ratings yet
Hyper-Threading Technology Architecture and Microarchitecture - Summary
4 pages
Automatically Build ML Models On Amazon SageMaker Autopilot - Tapan Hoskeri
No ratings yet
Automatically Build ML Models On Amazon SageMaker Autopilot - Tapan Hoskeri
26 pages
Characteristics of Letters
No ratings yet
Characteristics of Letters
22 pages
Automatic Transfer Switch - Ats 22 Manual
No ratings yet
Automatic Transfer Switch - Ats 22 Manual
38 pages
Gartner - SWOT SAS Institute
100% (1)
Gartner - SWOT SAS Institute
26 pages
Bray Resilient Valves
No ratings yet
Bray Resilient Valves
25 pages
Uuuu U U U U: Registers (16-Bit)
No ratings yet
Uuuu U U U U: Registers (16-Bit)
3 pages
OU Diary-2020 Informatica PDF
No ratings yet
OU Diary-2020 Informatica PDF
75 pages
Lecture22 PDF
No ratings yet
Lecture22 PDF
29 pages
Adf Scheme - List of The Colleges and Departments Approved by Aicte SL. NO. Name of The College Name of The Departments
No ratings yet
Adf Scheme - List of The Colleges and Departments Approved by Aicte SL. NO. Name of The College Name of The Departments
1 page
Npab FMT
No ratings yet
Npab FMT
7 pages
MMT Bus E-Ticket Nu 25147911932077 Hyderabad-Pune
No ratings yet
MMT Bus E-Ticket Nu 25147911932077 Hyderabad-Pune
2 pages
Introduction To Logic Module 3 Language and Definitions
No ratings yet
Introduction To Logic Module 3 Language and Definitions
16 pages
Competency Mapping: Asst Professor, Amity University Noida Asst Professor, Amity University Noida
No ratings yet
Competency Mapping: Asst Professor, Amity University Noida Asst Professor, Amity University Noida
3 pages
Texas CH 320m Colapret Final Review
No ratings yet
Texas CH 320m Colapret Final Review
87 pages
2425 - Pgdlma - Elscon - Mock - Assessment - Tagged
No ratings yet
2425 - Pgdlma - Elscon - Mock - Assessment - Tagged
4 pages
Candidates Viva Voce MonsoonSemester2014 15
No ratings yet
Candidates Viva Voce MonsoonSemester2014 15
17 pages
Nystesc 2019 Primer
No ratings yet
Nystesc 2019 Primer
64 pages
SHAURDUINO Boards
No ratings yet
SHAURDUINO Boards
56 pages
Ecn 104 Foundations of Managerial Economics Syllabus
No ratings yet
Ecn 104 Foundations of Managerial Economics Syllabus
11 pages
12 Math SP Hy 08 2019-20 PDF
No ratings yet
12 Math SP Hy 08 2019-20 PDF
4 pages
Sensors: Thermalwrist: Smartphone Thermal Camera Correction Using A Wristband Sensor
No ratings yet
Sensors: Thermalwrist: Smartphone Thermal Camera Correction Using A Wristband Sensor
18 pages
Teaching Early Numeracy Skills Hands-On Learning in Times of The Covid-19 Pandemic
No ratings yet
Teaching Early Numeracy Skills Hands-On Learning in Times of The Covid-19 Pandemic
17 pages
Mu Checker - 2215 1
No ratings yet
Mu Checker - 2215 1
20 pages
Set 12-Math-Class V
No ratings yet
Set 12-Math-Class V
6 pages
Physics Investigatory Project: Transistor ASA Switch
No ratings yet
Physics Investigatory Project: Transistor ASA Switch
2 pages
HGP11 Q3 W3 - Las
No ratings yet
HGP11 Q3 W3 - Las
13 pages
Estimation of A Population Mean
No ratings yet
Estimation of A Population Mean
1 page
Oct 31
No ratings yet
Oct 31
4 pages
Prof Ed 7 Rating Scale
No ratings yet
Prof Ed 7 Rating Scale
3 pages
FM11SB 7.8
No ratings yet
FM11SB 7.8
9 pages
Intano 11 Cypress - Assignment N1 CS7
No ratings yet
Intano 11 Cypress - Assignment N1 CS7
1 page
Zig Programming: From Zero to Systems Master
From Everand
Zig Programming: From Zero to Systems Master
Niklas Hoffmann
No ratings yet
World’s First AC-Powered Multi-Parameter Processor: A Journey Beyond Limits
From Everand
World’s First AC-Powered Multi-Parameter Processor: A Journey Beyond Limits
RAJKUMAR OJHA
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet

EE457Unit9c CMT

Uploaded by

EE457Unit9c CMT

Uploaded by

1

Thread Level Parallelism

Modeling Interconnect Delay

Ideal wire Lumped Model Distributed Model

Dealing With Interconnect

A Case for Thread-Level Parallelism

CHIP MULTITHREADING AND

Memory Wall Problem

There is a limit to ILP! Hennessy and Patterson,

The Growing Memory Problem

Cache Penalty Example

Effective CPI = Ideal CPI + Miss Rate*Miss Penalty Cycles

– For what resources would each thread need $3 0x00000005

Thread 2 Thread 1 CPU

0x0 Memory 0xffffffff

Shared vs. Private

– Start execution and fill pipeline Meta

• Context switch is also triggered by timer for fairness

Multicore vs. Multithreaded

Typical Multicore (CMP) Organization

L3 is shared (1 copy of data) and

Shared bus would be a

Case for Multithreading

MT Needs Non-Blocking Caches

Sun T1 "Niagara" Case Study

Sun T1 "Niagara" Pipeline

A View Without HW Multithreading

Only instructions Software

Only instructions Switch threads Alternate threads Mix instructions from

Fine Grained Multithreading

Coarse Grained Multithreading

ILP and TLP

2-Way SMT Updated Block Diagram

Updated OoO processor block diagram for 2-way hardware, SMT

Organization for OoO Execution

2-Way SMT Updated Block Diagram

Updated OoO processor block diagram for 2-way hardware, SMT

Multiple Functional Units

Functional Unit Latencies

Functional Unit Latency Initiation Interval

OoO Execution w/ ROB

Functional Unit Latency Initiation Interval

Functional Unit Latency Initiation Interval

You might also like