0% found this document useful (0 votes)

18 views56 pages

L03 Architecture Memory

The document discusses parallel computing architectures, focusing on key concepts such as concurrency, parallelism, and various forms of processor performance gains. It outlines the importance of understanding processor architecture, memory organization, and different types of parallelism, including bit, instruction, thread, and processor level parallelism. Additionally, it covers Flynn's taxonomy of parallel architectures and the design of multicore processors, emphasizing the significance of efficient memory access and data management in modern computing.

Uploaded by

Vishnu Prasath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views56 pages

L03 Architecture Memory

Uploaded by

Vishnu Prasath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Parallel Computing Architectures

Lecture 03
Parallel Computing
Application application
Problem parallelism

decompose

Tasks

compute
hardware
Physical parallelism
Cores &
Processors

[ CS3210 - AY2425S1 - L3 ]
2
Computer architecture
 Key concepts about how modern computers work
 Concerns on parallel execution
 Challenges of accessing memory
 Understanding these architecture basics help you
 Understand and optimize the performance of your parallel
programs
 Gain intuition about what workloads might benefit from fast parallel
machines

[ CS3210 - AY2425S1 - L3 ]
3
Outline
 Processor Architecture and Technology Trends
 Various forms of parallelism
 Flynn’s Parallel Architecture Taxonomy
 Architecture of Multicore Processors
 Memory Organization
 Distributed-memory Systems
 Shared-memory Systems
 Hybrid (Distributed-Shared Memory) Systems

[ CS3210 - AY2425S1 - L3 ]
4
Concurrency vs. Parallelism

Concurrency Parallelism
 Two or more tasks can start, run,  Two or more tasks can run
and complete in overlapping (execute) simultaneously, at the
time periods exact same time
 They might not be running  Tasks do not only make progress,
(executing on CPU) at the same they actually execute
instant simultaneously
 Two or more execution flows make
progress at the same time by
interleaving their executions or by
executing instructions (on CPU) at
exactly the same time
[ CS3210 - AY2425S1 - L3 ]
5
Source of Processor Performance Gain
 Parallelism of various forms are the main source of
performance gain

 Let us understand parallelism at the:

 Bit Level
 Instruction Level Single
 Thread Level Processor

 Processor Level:
Multiple
 Shared Memory Processors
 Distributed Memory
[ CS3210 - AY2425S1 - L3 ]
8
Bit Level Parallelism
 Parallelism via increasing the processor word size
 Word size may mean:
 Unit of transfer between processor  memory
 Memory address space capacity
 Integer size
 Single precision floating point number size

 Word size trend:

 Varied in the 50s to 70s
 Following x86 processors:
16 bits (8086 – 1978)
 32 bits (80386 – 1985)
 64 bits (Pentium 4 / Opteron – 2003)
[ CS3210 - AY2425S1 - L3 ]
9
Instruction Level Parallelism
 Execute instructions in parallel:
 Pipelining (parallelism across time)
 Superscalar (parallelism across space)

 Pipelining:
 Split instruction execution in multiple stages, e.g.
 Fetch (IF), Decode (ID), Execute (EX), Write-Back (WB)
 Allow multiple instructions to occupy different stages in the same
clock cycle
 Provided there is no data / control dependencies
 Number of pipeline stages == Maximum achievable speedup
[ CS3210 - AY2425S1 - L3 ]
10
Pipelined Execution: Illustration
Non-pipelined
 Disadvantages
Time  Independence
IF ID EX WB IF ID EX WB  Bubbles
 Hazards: data
and control flow
Pipelined
Time  Speculation
IF ID EX WB
 Out-of-order
execution
IF ID EX WB
 Read-after-write
Program Flow IF ID EX WB
(Instructions)
IF ID EX WB

[ CS3210 - AY2425S1 - L3 ]
11
The end of ILP

Processor clock rate

stops increasing

No further benefit from ILP

[ CS3210 - AY2425S1 - L3 ]
12
Instruction Level Parallelism: Superscalar
 Duplicate the pipelines:
 Allow multiple instructions to pass through the same stage
 Scheduling is challenging (decide which instructions can be
executed together):
 Dynamic (Hardware decision)
 Static (Compiler decision)

 Most modern processors are superscalar

 e.g. each intel i7 core has 14 pipeline stages and can execute 6 micro-ops in
the same cycle

[ CS3210 - AY2425S1 - L3 ]
13
Superscalar Execution: Illustration
Superscalar (2-wide)
Time Disadvantages:
IF ID EX WB structural hazard
IF ID EX WB
IF ID EX WB
IF ID EX WB
Program Flow
(Instructions)

 Cycles-per-instruction (CPI)  Instructions-per-cycle (IPC)

[ CS3210 - AY2425S1 - L3 ]
14
Pipelined Processor Superscalar Processor
 Determine what instruction to run  Processor automatically finds
next independent instructions in an
 Execution unit: performs the instruction sequence and can
operation described by an execute them in parallel on
instruction execution units
 Registers: store value of variables  Instructions come from the same
used as inputs and outputs to execution flow (thread)
operations

[ CS3210 - AY2425S1 - L3 ]
15
Instruction Level Parallelism: SIMD
 Superscalar: decode and  Add ALU to increase compute
execute multiple (i.e. two) capability
instructions per clock  Single instruction, multiple
 Issue: not enough ILP, but data: same instruction
math operations are long broadcasted to and executed
by all ALUs

[ CS3210 - AY2425S1 - L3 ]
16
Thread Level Parallelism: Motivation
 Instruction-level parallelism is limited
 For typical programs only 2-3 instructions can be executed in parallel
(either pipelined / superscalar )
 Due to data/control dependencies

 Multithreading was originally a software mechanism

 Allow multiple parts of the same program to execute concurrently
 Key idea:
 The processor can execute the threads in parallel

[ CS3210 - AY2425S1 - L3 ]
17
Superscalar SMT – 2 hardware threads
 Registers: store value of  Can run one scalar instruction
variables used as inputs and per clock from one of the
outputs to operations from hardware threads
one execution flow  Two logical cores

[ CS3210 - AY2425S1 - L3 ]
18
Thread Level Parallelism in the Processor
 Processor can provide hardware support for multiple "thread
contexts“
 Known as simultaneous multithreading (SMT)
 Information specific to each thread, e.g. Program Counter, Registers,
etc
 Software threads can then execute in parallel
 Many implementation approaches

 Example:
 Intel processors with hyper-threading technology, e.g. each i7 core
can execute 2 threads at the same time
[ CS3210 - AY2425S1 - L3 ]
19
Processor Level Parallelism (Multiprocessing)
 Add more cores to the processor
 The application should have multiple execution flows
 Each process/thread has an independent context that can be
mapped to multiple processor cores

[ CS3210 - AY2425S1 - L3 ]
20
Flynn's Parallel Architecture Taxonomy
 One commonly used taxonomy of parallel architecture:
 Based on the parallelism of instructions and data streams in the most
constrained component of the processor
 Proposed by M.Flynn in 1972(!)

 Instruction stream:
 A single execution flow
 i.e. a single Program Counter (PC)
 Data stream:
 Data being manipulated by the instruction stream

[ CS3210 - AY2425S1 - L3 ]
23
Single Instruction Single Data (SISD)
 A single instruction stream is executed
 Each instruction work on single data
 Most of the uniprocessors fall into this category

Processing
Unit

[ CS3210 - AY2425S1 - L3 ]
24
Single Instruction Multiple Data (SIMD)
 A single stream of instructions
 For example: one program counter
 Each instruction works on multiple data
 Popular model for supercomputer during 1980s:
 Exploit data parallelism, commonly known as vector processor
 Modern processor has some forms of SIMD:
 E.g. the SSE, AVX instructions in Intel x86 processors

One Instruction
shared by many
Multiple data being PUs
processed

[ CS3210 - AY2425S1 - L3 ]
25
SIMD nowadays
 Data parallel architectures
 AVX instructions
 GPGPUs
 Same instruction broadcasted to all ALUs
 AVX: Intrinsic functions operate on vectors of four
64-bit values (e.g., vector of 4 doubles)
 Not great for divergent executions

[ CS3210 - AY2425S1 - L3 ]
26
Original program
 Processes one array element using scalar instructions on
scalar registers (e.g., 32-bit floats)

[ CS3210 - AY2425S1 - L3 ]
27
Vector program (using AVX intrinsics)
 Intrinsic functions operate on vectors of
eight 32-bit values (e.g., vector of floats)
 Processes eight array elements
simultaneously using vector instructions on
256-bit vector registers

[ CS3210 - AY2425S1 - L3 ]
28
Multiple Instruction Single Data (MISD)
 Multiple instruction streams
 All instruction work on the same data at any time
 No actual implementation except for the systolic array

[ CS3210 - AY2425S1 - L3 ]
29
Multiple Instruction Multiple Data (MIMD)
 Each PU fetch its own instruction
 Each PU operates on its data
 Currently the most popular model for multiprocessor

[ CS3210 - AY2425S1 - L3 ]
30
Variant – SIMD + MIMD
 Stream processor (nVidia GPUs)
 A set of threads executing the same code (effectively SIMD)
 Multiple set of threads executing in parallel (effectively MIMD at this
level)

[ CS3210 - AY2425S1 - L3 ]
31
MULTICORE ARCHITECTURE

[ CS3210 - AY2425S1 - L3 ]
32
Architecture of Multicore Processors
 Hierarchical design
 Pipelined design
 Network-based design

[ CS3210 - AY2425S1 - L3 ]
33
Hierarchical Design
 Multiple cores share multiple caches

 Cache size increases from the leaves to the root

 Each core can have a separate L1 cache and shares the L2

cache with other cores

 All cores share the common external memory L3

 Usages L2
 Standard desktop
 Server processors L1
 Graphics processing units
[ CS3210 - AY2425S1 - L3 ]
34
Hierarchical Design - Examples
Each core is sophisticated, out-of-order
processor to maximize ILP

Quad-Core AMD Opteron Intel Quad-Core Xeon

[ CS3210 - AY2425S1 - L3 ]
35
Pipelined Design
 Data elements are processed by multiple execution cores in a
pipelined way

 Useful if same computation steps have to be applied to a long

sequence of data elements
 E.g. processors used in routers
and graphics processors

[ CS3210 - AY2425S1 - L3 ]
36
Example: Pipelined Design

Xelerator X11 network processor

[ CS3210 - AY2425S1 - L3 ]
37
Network-Based Design
 Cores and their local caches and memories are connected via
an interconnection network
SUN Niagara 2 (UltraSPARC T2)

[ CS3210 - AY2425S1 - L3 ]
38
Future Trends
 Efﬁcient on-chip interconnection
 Enough bandwidth for data transfers between the cores
 Scalable
 Robust to tolerate failures
 Efﬁcient energy management
 Reduce memory access time
 Key word: Network on Chip (NoC)

[ CS3210 - AY2425S1 - L3 ]
39
MEMORY ORGANIZATION

[ CS3210 - AY2425S1 - L3 ]
40
Parallel Computer Component
 Typical uniprocessor components:
Core
 Processor
 One or more level of caches
Cache  Memory module
 Other (e.g. I/O)

Memory  These components similarly present in a parallel computer setup

Uniprocessor
 Processors in a parallel computer systems is also commonly
known as processing element

[ CS3210 - AY2425S1 - L3 ]
43
Recap: why do modern processors have cache?
 Processors run eﬃciently when data is resident in caches
 Caches reduce memory access latency *
* Caches provide high
bandwidth data transfer to CPU

[ CS3210 - AY2425S1 - L3 ]
44
Recap: memory latency and bandwidth
 Memory latency: the amount of time for a memory request
(e.g., load, store) from a processor to be serviced by the
memory system
 Example: 100 cycles, 100 nsec
 Memory bandwidth: the rate at which the memory system can
provide data to a processor
 Example: 20 GB/s
 Processor “stalls” when it cannot run the next instruction in an
instruction stream because of a dependency on a previous
instruction

[ CS3210 - AY2425S1 - L3 ]
45
Execution on a Processor (one add per clock)

The memory system is

occupied 100% of the time!

Time
[ CS3210 - AY2425S1 - L3 ]
46
In Modern Computing, Bandwidth is the Critical Resource
 Performant parallel programs should:
 Organize computation to fetch data from memory less often
 Reuse data previously loaded by the same thread (temporal locality
optimizations)
 Share data across threads (inter-thread cooperation)
 Favor performing additional arithmetic to storing/reloading values
(the math is “free”)
 Main point: programs must access memory infrequently to utilize
modern processors efficiently

[ CS3210 - AY2425S1 - L3 ]
47
Memory Organization of Parallel Computers

Parallel Computers

Distributed-Memory Shared-memory Hybrid (Distributed-

Multicomputers Multiprocessors Shared Memory)

Uniform Non-Uniform Cache-only

Memory Memory Memory
Access Access Access
(UMA) (NUMA) (COMA)

[ CS3210 - AY2425S1 - L3 ]
48
Distributed-Memory Systems
Processor Processor Processor
Cache Cache Cache
Memory Memory Memory
(Interconnection) Network

Memory Memory Memory

Cache Cache Cache
Processor Processor Processor

 Each node is an independent unit:

 With processor, memory and, sometimes, peripheral elements
 Physically distributed memory module:
  Memory in a node is private
[ CS3210 - AY2425S1 - L3 ]
49
Shared Memory System
PU1 PU2 PU3

Shared Memory Provider

I/O Memory

 Parallel programs / threads access

memory through the shared memory
provider
 Program is unaware of the actual
hardware memory architecture
 Cache coherence and memory consistency

[ CS3210 - AY2425S1 - L3 ]
50
Cache Coherence
 Multiple copies of the same data exist on different caches
 Local update by processing unit  Other PUs should not see
the unchanged data

3
PU1 PU2 PU3
u=? 4 u 7

u=? 5
Cache u 5 Cache Cache u 5
1

2
Memory u 5

[ CS3210 - AY2425S1 - L3 ]
51
Further Classification – Shared Memory
 Two factors can further differentiate shared memory systems:
 Processor to Memory Delay (UMA / NUMA)
 Whether delay to memory is uniform

 Presence of a local cache with cache coherence protocol (CC/NCC):

 Same shared variable may exist in multiple caches
 Hardware ensures correctness via cache coherence protocol

[ CS3210 - AY2425S1 - L3 ]
53
Uniform Memory Access (Time) (UMA)
PU PU PU PU

Cache Cache Cache Cache

Shared Memory Provider

Memory
UMA organization for Multiprocessor

 Latency of accessing the main memory is the same for every

processor:
 Uniform access time, hence the name
 Suitable for small number of processing units – due to
contention
[ CS3210 - AY2425S1 - L3 ]
54
Non-Uniform Memory Access (NUMA)
PU PU PU PU

Cache Cache Cache Cache

Interconnection (Shared Memory Provider)

Memory Memory Memory

 Physically distributed memory of all processing units are

combined to form a global shared-memory address space:
 also called distributed shared-memory
 Accessing local memory is faster than remote memory for a
processor
 Non-uniform access time
[ CS3210 - AY2425S1 - L3 ]
55
AMD Ryzen

[ CS3210 - AY2425S1 - L3 ]
56
Example: Multicore NUMA
core core core
1 2
... n

Multicore Processor P

Processor Processor Processor

Cache Cache Cache

Memory Memory Memory

Interconnection(Shared Memory Provider)

Memory Memory Memory

Cache Cache Cache

Processor Processor Processor

[ CS3210 - AY2425S1 - L3 ]
57
ccNUMA
Processor Processor Processor Processor

Cache Cache Cache Cache

Memory Memory Memory Memory

Interconnection (Shared Memory Provider)

 Cache Coherent Non-Uniform Memory Access

 Each node has cache memory to reduce contention

[ CS3210 - AY2425S1 - L3 ]
58
COMA
Processor Processor Processor Processor

Cache Cache Cache Cache

Interconnection Network

 Cache Only Memory Architecture

 Each memory block works as cache memory
 Data migrates dynamically and continuously according to the cache
coherence scheme

[ CS3210 - AY2425S1 - L3 ]
59
Summary: Shared Memory Systems
 Advantages:
 No need to partition code or data
 No need to physically move data among processors 
communication is efficient

 Disadvantages:
 Special synchronization constructs are required
 Lack of scalability due to contention

[ CS3210 - AY2425S1 - L3 ]
60
[ CS3210 - AY2425S1 - L3 ]
64
Summary
 Goal of parallel architecture is to reduce the average time to
execute an instruction

 Various forms of parallelism

 Different types of multicore processors
 Different types of parallel systems and different memory
systems

[ CS3210 - AY2425S1 - L3 ]
65
Reading
 CS149 - PARALLEL COMPUTING class at Stanford
 Kayvon Fatahalian and Kunle Olukotun

 Platform 2015: Intel Processor and Platform Evolution for the

Next Decade
 Intel white paper, 2005

 https://www.brendangregg.com/blog/2023-03-01/computer-
performance-future-2022.html

[ CS3210 - AY2425S1 - L3 ]
66

CSC580 Quick Notes Lect1and2
100% (1)
CSC580 Quick Notes Lect1and2
18 pages
HSE-6-Soc Introduction To The System Design Approach
No ratings yet
HSE-6-Soc Introduction To The System Design Approach
69 pages
Microprocessor Book
No ratings yet
Microprocessor Book
296 pages
Presentation 3
No ratings yet
Presentation 3
37 pages
Flynn'S Classification: Cs6303 Computer Architecture
No ratings yet
Flynn'S Classification: Cs6303 Computer Architecture
11 pages
German Volume Training (GVT) Program Spreadsheet
No ratings yet
German Volume Training (GVT) Program Spreadsheet
26 pages
Phy 421 Note
No ratings yet
Phy 421 Note
27 pages
The Improvement of The Personal Computer
No ratings yet
The Improvement of The Personal Computer
74 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
High Frequency Isolated Bidirectional Dual Active Bridge DC-DC Converters and Its Application To Distributed Energy Systems: An Overview
No ratings yet
High Frequency Isolated Bidirectional Dual Active Bridge DC-DC Converters and Its Application To Distributed Energy Systems: An Overview
23 pages
Computer System Organizations: Ms - Chit Su Mon
No ratings yet
Computer System Organizations: Ms - Chit Su Mon
74 pages
L01 Introduction
No ratings yet
L01 Introduction
51 pages
Architecture and Micro
No ratings yet
Architecture and Micro
69 pages
Parallel Architecture Fundamental
No ratings yet
Parallel Architecture Fundamental
18 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
33 pages
Coa 3.2 - Risc - Cisc
No ratings yet
Coa 3.2 - Risc - Cisc
20 pages
Elements of Pure and Applied Mathematics
No ratings yet
Elements of Pure and Applied Mathematics
485 pages
01-System Architecture
No ratings yet
01-System Architecture
55 pages
Archtitecure 1
No ratings yet
Archtitecure 1
64 pages
PDC Architectures
No ratings yet
PDC Architectures
24 pages
Jowett 0
No ratings yet
Jowett 0
7 pages
System-On-Chip (Soc) Architecture Soc Example
No ratings yet
System-On-Chip (Soc) Architecture Soc Example
71 pages
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
44 pages
Coa Chapter 5
No ratings yet
Coa Chapter 5
96 pages
NSX Administration Guide NSX For Vsphere 6.2
No ratings yet
NSX Administration Guide NSX For Vsphere 6.2
370 pages
Unit 5
No ratings yet
Unit 5
96 pages
Unit 1
No ratings yet
Unit 1
5 pages
Unit 5
No ratings yet
Unit 5
44 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Advanced Creating of 3D Dental Models in Blender Software: September 2016
No ratings yet
Advanced Creating of 3D Dental Models in Blender Software: September 2016
67 pages
Organization CH 2
No ratings yet
Organization CH 2
102 pages
2018 Howland Et Al. Quantifying The Effects of Erosion On Archaeological Sites With Low-Altitude Aerial Photography, Structure From Motion, and GIS
No ratings yet
2018 Howland Et Al. Quantifying The Effects of Erosion On Archaeological Sites With Low-Altitude Aerial Photography, Structure From Motion, and GIS
9 pages
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
No ratings yet
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
118 pages
A Practical Approach To Vascular Access For Hemodialysis and Predictors of Success
No ratings yet
A Practical Approach To Vascular Access For Hemodialysis and Predictors of Success
7 pages
Intelligent Search Algorithms: Forth Year
No ratings yet
Intelligent Search Algorithms: Forth Year
17 pages
2 - Cpe410l2
No ratings yet
2 - Cpe410l2
10 pages
Unit 5
No ratings yet
Unit 5
66 pages
Computer Architecture Unit 3
No ratings yet
Computer Architecture Unit 3
8 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
Introduction To Parallel Processing Architecture
No ratings yet
Introduction To Parallel Processing Architecture
31 pages
Android Intents 1
No ratings yet
Android Intents 1
30 pages
Statistical Tool Iggat Shaira Salinen Ruffa Grace
No ratings yet
Statistical Tool Iggat Shaira Salinen Ruffa Grace
14 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
Unit V
No ratings yet
Unit V
95 pages
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
No ratings yet
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
32 pages
Parallel Archit 1
No ratings yet
Parallel Archit 1
18 pages
Architecture
No ratings yet
Architecture
67 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
No ratings yet
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
58 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Parallel Processing Parallel Processing
No ratings yet
Parallel Processing Parallel Processing
64 pages
Introduction To Parallel Processing
No ratings yet
Introduction To Parallel Processing
49 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
Review of LSS CSC
No ratings yet
Review of LSS CSC
21 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Market Structure
No ratings yet
Market Structure
14 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
MAE 3181 Materials and Structures Laboratory
No ratings yet
MAE 3181 Materials and Structures Laboratory
22 pages
Edit - The Complete Guide To MACD Indicator
No ratings yet
Edit - The Complete Guide To MACD Indicator
18 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Computer Architecture
100% (1)
Computer Architecture
125 pages
Elec-275 Final Examination April 2012
No ratings yet
Elec-275 Final Examination April 2012
4 pages
JR Inter Maths 1A AP EM 01022025
No ratings yet
JR Inter Maths 1A AP EM 01022025
11 pages
Data Sheet USB5 V 2019 05 EN
No ratings yet
Data Sheet USB5 V 2019 05 EN
1 page
Air Master Catalog
100% (2)
Air Master Catalog
191 pages
Modern Computer Architecture: Lecture1 Fundamentals of Quantitative Design and Analysis (I)
No ratings yet
Modern Computer Architecture: Lecture1 Fundamentals of Quantitative Design and Analysis (I)
41 pages
Rules of Differentiation: Example
No ratings yet
Rules of Differentiation: Example
6 pages
Physical Characterization of Activated Carbon Derived From Mangosteen Peel PDF
No ratings yet
Physical Characterization of Activated Carbon Derived From Mangosteen Peel PDF
5 pages
An Introduction To Combustion - Pyronics PDF
No ratings yet
An Introduction To Combustion - Pyronics PDF
4 pages
Enhanced Performance of Air-Cooled Chillers Using Evaporative Cooling PDF
No ratings yet
Enhanced Performance of Air-Cooled Chillers Using Evaporative Cooling PDF
5 pages
Flynns Taxonomy
0% (1)
Flynns Taxonomy
79 pages
MiR Main Brochure EN Web PDF
No ratings yet
MiR Main Brochure EN Web PDF
20 pages
Design of Unde Ground Water Tank
100% (2)
Design of Unde Ground Water Tank
18 pages
Comparative Analysis of Water and Oil Media On Temperature Stability in PID Control-Based Digital Thermometer Calibrator
No ratings yet
Comparative Analysis of Water and Oil Media On Temperature Stability in PID Control-Based Digital Thermometer Calibrator
6 pages
Problems Chapter 1 Sec B
No ratings yet
Problems Chapter 1 Sec B
7 pages
Andy Clark - Associative Engines
100% (3)
Andy Clark - Associative Engines
248 pages
12 Wcdma Hsdpa RRM and Parameters
No ratings yet
12 Wcdma Hsdpa RRM and Parameters
67 pages
Holsetpartnumbers 2008
No ratings yet
Holsetpartnumbers 2008
1 page
Smart Soot Blower System
No ratings yet
Smart Soot Blower System
8 pages
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Learn Java Programming in 24 Hours
From Everand
Learn Java Programming in 24 Hours
PublishDrive
No ratings yet
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
From Everand
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
Mamta Devi
No ratings yet

L03 Architecture Memory

Uploaded by

L03 Architecture Memory

Uploaded by

Parallel Computing Architectures

 Let us understand parallelism at the:

 Word size trend:

Processor clock rate

No further benefit from ILP

 Most modern processors are superscalar

 Cycles-per-instruction (CPI)  Instructions-per-cycle (IPC)

 Multithreading was originally a software mechanism

 Cache size increases from the leaves to the root

 Each core can have a separate L1 cache and shares the L2

 All cores share the common external memory L3

Quad-Core AMD Opteron Intel Quad-Core Xeon

 Useful if same computation steps have to be applied to a long

Xelerator X11 network processor

Memory  These components similarly present in a parallel computer setup

The memory system is

Distributed-Memory Shared-memory Hybrid (Distributed-

Uniform Non-Uniform Cache-only

Memory Memory Memory

 Each node is an independent unit:

Shared Memory Provider

 Parallel programs / threads access

 Presence of a local cache with cache coherence protocol (CC/NCC):

Cache Cache Cache Cache

Shared Memory Provider

 Latency of accessing the main memory is the same for every

Cache Cache Cache Cache

Interconnection (Shared Memory Provider)

Memory Memory Memory

 Physically distributed memory of all processing units are

Processor Processor Processor

Cache Cache Cache

Memory Memory Memory

Interconnection(Shared Memory Provider)

Memory Memory Memory

Cache Cache Cache

Processor Processor Processor

Cache Cache Cache Cache

Memory Memory Memory Memory

Interconnection (Shared Memory Provider)

 Cache Coherent Non-Uniform Memory Access

Cache Cache Cache Cache

 Cache Only Memory Architecture

 Various forms of parallelism

 Platform 2015: Intel Processor and Platform Evolution for the

You might also like