0% found this document useful (0 votes)

22 views7 pages

W1 Hardwareoverview.4u

The document provides an overview of classical parallel hardware, including single processor and multiple processor designs. It discusses how multiple processors can overcome limitations of single processors by processing tasks in parallel. It then describes key aspects of processor performance such as clock speed, floating point operations per second, and how peak performance is calculated for different systems. Pipelining and instruction scheduling techniques like superscalar execution are introduced to increase processor throughput by processing instructions in parallel within the processor. Memory bandwidth is also discussed as a potential limitation for achieving peak floating point performance.

Uploaded by

Linlin Du

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views7 pages

W1 Hardwareoverview.4u

Uploaded by

Linlin Du

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Overview: Classical Parallel Hardware The Processor

Review of Single Processor Design Performs (among others):

● so we talk the same language ● floating point operations (flops) - add, mult, division (sqrt maybe!)
● integer and logical operations (and, or, etc.)
● many things happen in parallel even on a single processor
● instruction processing (fetch, decoding, etc.)
● identify potential issues that (explicitly) parallel hardware can overcome
● our primary focus will be in flops (as per required by most scientific applications)
● why should we use 2 CPUs instead of doubling the speed on one! ● main performance metric: flops/sec or just FLOPS
Multiple Processor Design The processor clock orchestrates its operation:

● Flynn’s taxonomy of parallel computers (SIMD vs MIMD) ● all ops take a fixed number of clock ticks to complete (latency)
● message-passing versus shared-address space programming ● clock speed is measured in GHz (109 cycles/second) or nsec (10−9 seconds)
● UMA versus NUMA shared-memory computers ■ Apple iPhone 6 ARM A8 1.4GHz (0.71ns), NCI Gadi Intel Xeon Cascade Lake
3.2GHz (0.31ns), IBM zEC12 processor 5.5Ghz (0.18ns)
● dynamic/static connectivity
● clock speed limited by: transistor speed, speed of light, energy consumption, etc.
● evaluating static networks
■ (to our knowledge) IBM zEC12 is fastest commodity processor at 5.5GHz
● case study: the NCI Gadi supercomputer ■ light travels about 1cm in 3.2ns, a chip is a few cm!
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 1 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 2

Processor Performance Illustrating pipelining with an example: Adding Numbers

flops/sec Prefix Occurrence (as of today)

103 kilo (k) very badly written code
Consider adding two double precision (8 byte) numbers
106 mega (m) badly written code
109 giga (g) single-core 0 1 11 12 63
1012 tera (t) supercomputer node ± Exponent Significand
1015 peta (p) all machines in Top500 (Nov 22, measured)
Possible steps:
1018 exa (e) 2022!

How peak flops/sec. is computed? ● determine largest exponent

● normalize significand of the smaller exponent to the larger
● Desktop 2.5GHz Quad-Core, 4(core)*4(flops)*2.5GHz ≡ 40 gflops/sec.
● add significand
● Bunyip cluster Pentium III, 96(nodes)*2(sockets)*1(core)*1(flop)*550MHz ≡ 105
gflops/sec, ● re normalize the significand and exponent of the result
● NCI Raijin 3592(nodes)*2(sockets)*8(core)*8(flops)*2.6GHz ≡ 1.19 pflops/sec. Let us assume each step take 1 clock tick, i.e., a latency of 4 ticks per addition (flop)
● NCI Gadi 3074(nodes)*2(sockets)*24(core)*16(flops)*3.2GHz ≡ 7.55 pflops/sec.

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 3 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 4
Illustrating pipelining with an example: Adding Numbers Instruction Pipelining (Single Instruction Issue)
● break instructions into k stages each that are overlapped in time
● eg. (k = 5): stages FI = Fetch Instrn., DI = Decode Instrn., FO = Fetch Operand, EX
Step in Pipeline
= Execute Instrn., WB = Write Back
Waiting 1 2 3 4 Done
(branch): FI DI FO EX WB
X(6) (guess) FI DI FO EX WB
X(5)→ (guess) FI DI FO EX WB
X(4)→ (guess) FI DI FO EX WB
X(3)→ (sure) FI DI FO EX WB
X(2)→
X(1) ● Ideally, one gets k-way asymptotic parallelism (speedup)
● X(1) takes 4 clock ticks to appear (startup latency); X(2) appears 1 tick after X(1) ● However, hard to maximize utilization in practice:

● asymptotically achieves 1 result per tick ■ Constrained by dependencies among instructions; CPU must ensure result is
the same as if no pipelining!
● the operation (X) is said to be pipelined: steps in the pipeline are running in parallel ■ FO & WB stages may involve memory accesses (and may possibly stall the pipeline)
● requires same op consecutively on different (independent) data items ■ conditional branch instructions are problematic: the wrong guess may require
flushing succeeding instructions from the pipeline and rolling back
■ good for “vector operations” (note limitations on chaining output data to input)
● tendency to increase # of stages (specially acute during 90s-20s)
examples of #stages: UltraSPARC II (9) and III (14), Intel Prescott (31)

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 5 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 6

Superscalar execution (Multiple Instruction Issue) Limitations of Memory System Performance

● Simple idea: Increase execution rate by using w ≥ 2 (i.e., multiple) pipelines Consider the DAXPY computation:

● w (mutually independent) instructions are (tried to be) piped in parallel at each cycle y(i) = y(i) + 1.234 ∗ x(i)
● Ideally it offers kw-way parallelism (recall k is the number of pipeline stages) If at its peak the CPU can perform 8 flops/cycle (4 fused mult-add)
● However, a number of extra challenges arise:
● the memory system must load 8 doubles (x(i) and y(i) – 64 bytes) and store 4
■ Increased complexity: HW has to be able to resolve dependencies at runtime (y(i) – 32 bytes) each clock cycle
before issuing simultaneously several instructions
■ on a 2 GHz system this implies a memory system able to sustain 128 GB/s load
■ Some of the functional units might be shared by the pipelines (aka resource
traffic and 64 GB/s store traffic
dependencies)
■ As a result, instructions to be issued together must have an appropriate ● despite advances in memory technology (e.g., DDR5 SDRAM), memory is not able
‘instruction mix’  to pump data at such high rates
 ≤ 2 different floating point
e.g. UltraSPARC (w = 4): ≤ 1 load / store ; ≤ 1 branch Memory latency and bandwidth are critical performance issues

≤ 2 integer / logical
● Some remedies: pipeline feedback, branch prediction + speculative execution, ● caches: reduce latency and provide improved cache to CPU bandwidth
out-of-order execution, compilers (e.g., VLIW processors) ● multiple memory banks: improve bandwidth (by parallel access)
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 7 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 8
Memory Hierarchy Going (Explicitly) Parallel
Main Memory −→ large, cheap memory; large latency/small bandwidth ● performance of a single processor is irremediably limited by clock rate
↓ ● clock rate in turn limited by power consumption, transistor switching time, etc.
Cache −→ small, expensive memory; lower latency/higher bandwidth
↓↓↓↓↓ ● ILP allows multiple instructions at once, but it is limited by dependencies
CPU Registers ● many problems are inherently distributed/exhibit potential parallelism
● memory is partitioned into blocks (cache lines) and mapped to cache lines using a It’s time to go (explicitly) parallel
mapping algorithm (e.g., completely associative, direct, n-way associative)
Parallel Hardware Overview
● cache lines are typically 16-128 bytes wide; entire cache lines fetched from
memory, not just one element (why?) ● Flynn’s Taxonomy of parallel processors (1966,1972)
● cache hit (few cycles)/cache miss (large number of cycles) ■ (SISD/SIMD/)SIMD/MIMD
● try to structure code to use an entire cache line of data before replacement (e.g., ● message-passing versus shared-address space programming
blocking strategies in dense matrix-matrix multiplication)
● UMA versus NUMA shared-memory computers
Cache memory is effective because algorithms often use data that: ● dynamic/static networks
● was recently accessed from memory (temporal locality) ● evaluating cost and performance of static networks
● was close to other recently accessed data (spatial locality) ● case study: NCI’s Gadi (2020–)
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 9 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 10

SIMD and MIMD in Flynn’s Taxonomy MIMD

SIMD: Single Instruction Multiple Data
Most successful model for parallel architectures
● also known as data parallel or vector processors (very popular in the 70s and 80s)
● more general purpose than SIMD, can be built out of off-the-shelf components
● nowadays come mainly in the form of SSE co-processing instructions
● extra burden to programmer
● other examples: GPUs; SPEs on Sony’s PS3 IBM CellBE (2006)
● perform their best with structured (regular) computations (e.g., image processing) Some challenges for MIMD machines

MIMD: Multiple Instruction Multiple Data ● scheduling: efficient allocation of processors to tasks in a dynamic fashion
● synchronization: prevent processors accessing the same data simultaneously
● examples include: (1) quad-core PC; (2) 2x24-core Xeon CPUs on each Gadi node
● interconnect design: processor to memory and processor to processor
interconnects. Also I/O network - often processors dedicated to I/O devices
Global
Control Unit
● overhead: inevitably there is some overhead associated with coordinating activities
between processors, e.g. resolve contention for resources
CPU and CPU and CPU and CPU and
CPU CPU CPU CPU Control Control Control Control

INTERCONNECT INTERCONNECT
● partitioning: partitioning a computation/algorithm into concurrent tasks might not be
trivial and require algorithm redesign and/or significant programming efforts
SIMD MIMD

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 11 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 12
Logical classification of parallel computers Address Space Organization: Message Passing

● logically organized as multiple processing nodes, each with its own

exclusive/private address space
● interaction among programs running on different nodes accomplished using
Regardless of how they are physically organized under the hood, from a programmer’s messages
perspective, parallel computers can be classified into two broad categories: ● messages are used to transfer data, work, and synchronization
● Message-passing (distributed address space) parallel computers ● typically implemented in practice by so called distributed memory parallel
● Shared address space parallel computers computers (although not necessarily)
● in these computers, (aggregate) memory bandwidth scales linearly with # of
processing nodes
● example: parallelism between “nodes” on the NCI Gadi system

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 13 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 14

Address Space Organization: Shared Address Space Non-Uniform Memory Access (NUMA)
● there is a common data shared address space
● all memory is still visible to the programmer (shared address space), but some
● processes interact by modifying objects stored in this shared address space
memory accesses take longer to access than others
● most typically implemented by so-called shared-memory computers
● designed to increase aggregated memory bandwidth with # of processors
● simplest implementation is a flat or uniform memory access (UMA)
● synchronizing concurrent access to shared data objects and processor-processor ● parallel programs should be written such that fast memory accesses are maximized
communications (to maintain coherence among multiple copies) limits performance (collocate data and computation accordingly)

● typically one observes sublinear memory bandwidth with # of processors ● example: within each Gadi node, each socket (i.e., 24-core CPU) is connected to
● example: QuadCore laptop its own memory module that is faster to access than the other (remote) one

MEMORY MEMORY MEMORY MEMORY INTERCONNECT

MEMORY MEMORY MEMORY MEMORY
INTERCONNECT MEMORY MEMORY MEMORY MEMORY

INTERCONNECT
Cache Cache Cache Cache Cache Cache Cache Cache

Cache Cache Cache Cache PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR

cached UMA cached NUMA

PROCESSOR PROCESSOR PROCESSOR PROCESSOR

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 15 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 16
Dynamic Connectivity: Bus Dynamic Connectivity: Crossbar

● simplest/cheapest network: shared medium common to all processors ● employs a 2D grid of switching nodes (complexity grows as O( p2))
● its a completely-blocking network: a point-to-point comm. among a processor and a ● its a completely non-blocking network: connection among two processors does not
memory module, or among processors, prevents any other comm. block connection between any other two processors
● limited bandwidth scalability (multiple accesses to memory are serialized) ● not scalable in terms of complexity and cost
● effective cache utilization can alleviate demands on the bus bandwidth Processor Processor Processor Processor
and and and and
Memory Memory Memory Memory

Processor
and
MEMORY MEMORY MEMORY MEMORY Memory

Processor
and
BUS Memory

Processor
and
Cache Cache Cache Cache Memory

Processor
and
PROCESSOR PROCESSOR PROCESSOR PROCESSOR Memory

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 17 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 18

Dynamic Connectivity: Multi-staged Networks (e.g. Omega Network) Static Connectivity: Complete, Mesh, Tree
Processors Memory
SWITCHING NETWORK
Completely connected (becomes very complex!)
000 000
001
001

010 010
011

(s = 010 (src), t = 111 (dst), s ⊕ t = 101)

011

100 100
101
101
Linear array/ring, mesh/2d torus
110 110
111
111

OMEGA NETWORK

● consists of log2( p) stages, p/2 switches per stage ( p = 8 in the figure)

● switches can be configured in two modes: pass-through or crossover
● s and t are binary representations of source and destination Static (all nodes are processors) and dynamic trees (intermediate nodes are switches)

■ processed from most to least significant bit (i.e., left to right)

■ route through if current bits of s and t are the same; otherwise, crossover
Switches
● partially blocking network (e.g. consider comms 000-111 and 110-100 at once) Processors

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 19 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 20
Static Connectivity: Hypercube Evaluating Static Interconnection Networks #1

Diameter
0100 0110 1100 1110

0000 0010 1000 1010

● the maximum distance between any two processors in the network
0101 0111 1101 1111
d = 4, p = 16
0011 1001 1011
● directly determines communication time (latency)
0001

Connectivity
● two (and exactly two) processing nodes along each dimension, d = log2( p)
dimensions (thus p = 2d processing nodes) ● the multiplicity of paths between any two processors

● the number of connections per processor grows as log2( p) ● a high connectivity is desirable as it minimizes contention (also enhances
fault-tolerance)
● recursive construction: d -hypercube built by connecting two d − 1-hypercubes
● arc connectivity of the network: the minimum number of arcs that must be removed
● two processing nodes directly connected IF ONLY IF their labels differ by one bit
for the network to break it into two disconnected networks
● the number of links in the shortest path between two processors labeled s and t is
■ 1 for linear arrays and binary trees
the number of bits that are on (i.e., =1) in the binary representation of s ⊕ t (bitwise
■ 2 for rings and 2D meshes
XOR) operation (e.g. 3 for 101 ⊕ 010 and 2 for 011 ⊕ 101)
■ 4 for a 2D torus
● examples: Intel iPSC Hypercube, NCube, SGI Origin, Cray T3D, TOFU ■ d for d -dimensional hypercubes

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 21 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 22

Evaluating Static Interconnection Networks #2 Summary: Static Interconnection Characteristics

Channel width

● the number of bits that can be communicated simultaneously over a link connecting Bisection Arc Cost
two processors Network Diameter width connectivity (no. of links)
Completely-connected 1 p2/4 p−1 p( p − 1)/2
Bisection width and bandwidth Binary Tree 2 log2(( p + 1)/2) 1 1 p−1
● bisection width is the minimum number of communication links that have to be Linear array p−1 1 1 p−1
Ring ⌊p/2⌋ 2 2 p
removed to partition the network into two equal halves √ √ √
2D Mesh 2( p − 1) p 2 2( p − p)
√ √
● bisection bandwidth is the minimum volume of communication allowed between two 2D Torus 2⌊ p/2⌋ 2 p 4 2p
halves of the network with equal numbers of processors Hypercube log2 p p/2 log2 p ( p log2 p)/2

Note: the Binary Tree suffers from a bottleneck: all traffic between the left and right
Cost
sub-trees must pass through the root. The fat tree interconnect alleviates this.
● many criteria can be used; we will use the number of communication links or wires
required by the network

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 23 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 24
NCI’s Gadi: A Petascale Supercomputer Further Reading: Parallel Hardware

● 184K cores (dual socket, 24 core Intel Platinum Xeon 8274 (Cascade Lake), 3.2
GHz) in 4243 compute nodes
● 192 GB memory per node (815 TB total)
● Mellanox Infiniband HDR interconnect (100Gbs, ≈ 60 km cables)
● interconnects: mesh (cores), full (sockets),
Dragonfly+ (nodes)
● The Free Lunch Is Over!
● ≈ 22 PB Lustre parallel filesystem
● Ch 1, 2.1-2.4 of Introduction to Parallel Computing
● power: 1.5 MW max. load
● cooling systems: 100 tonnes of water ● Ch 1, 2 of Principles of Parallel Programming
● 24th fastest in the world in debut (June 2020) – 9.3
PFLOPS
■ (probably) fastest file-system in the s.
hemisphere
■ custom Linux kernel (CentOS 8)
■ highly customised PBS Pro scheduler
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 25 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 26

Introduction To Computer Engineering
No ratings yet
Introduction To Computer Engineering
8 pages
Attch-1A Central Module F8650E AEA
No ratings yet
Attch-1A Central Module F8650E AEA
5 pages
Ford Eectch98
100% (1)
Ford Eectch98
79 pages
Energy-Efficient CMOS Voltage Level Shifters With single-VDD For Multicore Applications
No ratings yet
Energy-Efficient CMOS Voltage Level Shifters With single-VDD For Multicore Applications
7 pages
SPI Interface and Use in A Daisy-Chain Bus Configuration
No ratings yet
SPI Interface and Use in A Daisy-Chain Bus Configuration
9 pages
cs668 Lec1 ParallelArch
No ratings yet
cs668 Lec1 ParallelArch
18 pages
P 1
No ratings yet
P 1
44 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Advanced Computer Architecture: The Architecture of Parallel Computers
No ratings yet
Advanced Computer Architecture: The Architecture of Parallel Computers
44 pages
CSE 260 - Introduction To Parallel Computation: Larry Carter Carter@cs - Ucsd.edu
No ratings yet
CSE 260 - Introduction To Parallel Computation: Larry Carter Carter@cs - Ucsd.edu
22 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
Advanced Computer Architecture: The Architecture of Parallel Computers
No ratings yet
Advanced Computer Architecture: The Architecture of Parallel Computers
44 pages
Design A Full Adder by Using Two Half Adders. (7M, May-19)
100% (1)
Design A Full Adder by Using Two Half Adders. (7M, May-19)
10 pages
Quanta Da0z8vmb8e0 R1a
100% (1)
Quanta Da0z8vmb8e0 R1a
44 pages
Organization of Multiprocessor Systems
No ratings yet
Organization of Multiprocessor Systems
87 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
PIC18F Architecture & Addressing Modes
No ratings yet
PIC18F Architecture & Addressing Modes
17 pages
BROCHURE21919 2019 09 21 Printed PDF
No ratings yet
BROCHURE21919 2019 09 21 Printed PDF
32 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
04 Hardware
No ratings yet
04 Hardware
109 pages
GSM Based Home Automation
83% (6)
GSM Based Home Automation
19 pages
Introduction To Parallel Processing
No ratings yet
Introduction To Parallel Processing
49 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
CS802A Lec-2 PDF
No ratings yet
CS802A Lec-2 PDF
28 pages
Ch12 Parallel Proc3-Aula
No ratings yet
Ch12 Parallel Proc3-Aula
35 pages
ch.9 Pipeline MoDIFIED
No ratings yet
ch.9 Pipeline MoDIFIED
76 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Dell Poweredge R320 Systems Owner'S Manual: Regulatory Model: E18S Series Regulatory Type: E18S001
No ratings yet
Dell Poweredge R320 Systems Owner'S Manual: Regulatory Model: E18S Series Regulatory Type: E18S001
143 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
CS0051 - Module 01 - Subtopic 1
No ratings yet
CS0051 - Module 01 - Subtopic 1
27 pages
Computer Memory.......... : Presented By: Pooja Kushwah
No ratings yet
Computer Memory.......... : Presented By: Pooja Kushwah
13 pages
Parallel Processing Parallel Processing
No ratings yet
Parallel Processing Parallel Processing
64 pages
Erpina Satya Raviteja
No ratings yet
Erpina Satya Raviteja
2 pages
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
No ratings yet
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
11 pages
Dec50143 PW2
No ratings yet
Dec50143 PW2
8 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
Introduction Mod1
No ratings yet
Introduction Mod1
120 pages
Ivrs
No ratings yet
Ivrs
11 pages
Dayananda Sagar College of Engineering: Mini-Project Report (Microprocessors)
No ratings yet
Dayananda Sagar College of Engineering: Mini-Project Report (Microprocessors)
13 pages
Network 2
No ratings yet
Network 2
15 pages
IC Tester TSH 06F
No ratings yet
IC Tester TSH 06F
3 pages
DX Diag
No ratings yet
DX Diag
13 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
New-Msc Tech 3yrs Strcutre
No ratings yet
New-Msc Tech 3yrs Strcutre
5 pages
Fatal1ty Z170 Gaming K4 PDF
No ratings yet
Fatal1ty Z170 Gaming K4 PDF
1 page
Unit 1
No ratings yet
Unit 1
22 pages
8085 Microprocessor Trainer LCD Ver st808504
No ratings yet
8085 Microprocessor Trainer LCD Ver st808504
1 page
ALU Design
No ratings yet
ALU Design
7 pages
Parallel Processing
No ratings yet
Parallel Processing
22 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Manual QuartusII
No ratings yet
Manual QuartusII
73 pages
02 - Computer Evolution and Performance
No ratings yet
02 - Computer Evolution and Performance
32 pages
COA - Module-5
No ratings yet
COA - Module-5
35 pages
Flynn's and Fengs Architecture
No ratings yet
Flynn's and Fengs Architecture
28 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
DEL216D - PIC Chapter 1
No ratings yet
DEL216D - PIC Chapter 1
22 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
DLD Lab N 4
No ratings yet
DLD Lab N 4
10 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Week 4a - Computer Architecture Fundamentals - Part 1
No ratings yet
Week 4a - Computer Architecture Fundamentals - Part 1
45 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Introduction To Parallel Processing Architecture
No ratings yet
Introduction To Parallel Processing Architecture
31 pages
ACA1
No ratings yet
ACA1
26 pages
Presentation 4
No ratings yet
Presentation 4
37 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
Stuck-At Test Coverage Improvement For The Qualcomm Hexagon DSP Using Z01X
No ratings yet
Stuck-At Test Coverage Improvement For The Qualcomm Hexagon DSP Using Z01X
34 pages
Pda 2
No ratings yet
Pda 2
105 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Computer Architecture 1
No ratings yet
Computer Architecture 1
37 pages

W1 Hardwareoverview.4u

Uploaded by

W1 Hardwareoverview.4u

Uploaded by

Overview: Classical Parallel Hardware The Processor

Review of Single Processor Design Performs (among others):

Processor Performance Illustrating pipelining with an example: Adding Numbers

flops/sec Prefix Occurrence (as of today)

How peak flops/sec. is computed? ● determine largest exponent

Superscalar execution (Multiple Instruction Issue) Limitations of Memory System Performance

SIMD and MIMD in Flynn’s Taxonomy MIMD

● logically organized as multiple processing nodes, each with its own

MEMORY MEMORY MEMORY MEMORY INTERCONNECT

cached UMA cached NUMA

(s = 010 (src), t = 111 (dst), s ⊕ t = 101)

● consists of log2( p) stages, p/2 switches per stage ( p = 8 in the figure)

■ processed from most to least significant bit (i.e., left to right)

0000 0010 1000 1010

Evaluating Static Interconnection Networks #2 Summary: Static Interconnection Characteristics

You might also like