0% found this document useful (0 votes)

53 views58 pages

CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer

This document discusses high performance computing using multiprocessors. It begins by introducing multiprocessors as systems with multiple processors that allow for true concurrent programming. It then covers different paradigms of parallel computing including parallel, distributed, and concurrent computing. The document discusses different types of parallelism that can be exploited in problems including data, task, and pipelined parallelism. It also covers concepts like speedup, efficiency, and Amdahl's law. Examples are provided of multiprocessor architectures like shared memory and distributed memory systems.

Uploaded by

闫麟阁

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views58 pages

CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer

Uploaded by

闫麟阁

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 58

CS6461 Computer Architecture

Fall 2016
Morris Lancaster - Lecturer
Adapted from Professor Stephen Kaislers Notes

Lecture 10
High Performance Computing:
Multiprocessors
Introduction

So far, we have studied uniprocessors one processor,

possibly pipelined, and one memory - and vector processors,
still one computer with special functional units.
Performance can be improved through the use of multiple
processors. Multiprocessors allow multiprogramming, e.g., true
concurrent or parallel programming.

Idea: create powerful computers by connecting many

smaller ones
good news: works for timesharing (better than supercomputer)
bad news: its really hard to write good concurrent programs;
many commercial failures
Ref: Introduction to Parallel Processing: Algorithms and
Processing, by Behrooz Parhami

10/7/2017 CS61 Computer Architecture 10-2

Paradigms

Parallel Computing
Simultaneous use of multiple processors - all components of a
single architecture - to solve a task. Typically processors
identical, single user (even if machine is multi-user).

Distributed Computing
Use of a network of processors, each capable of being viewed
as a computer in its own right, to solve a problem. Processors
may be heterogeneous, multi-user, and usually individual tasks
are assigned to individual processors.

Concurrent Computing
Both of the above

10/7/2017 CS61 Computer Architecture 10-3

Types of Parallelism

10/7/2017 CS61 Computer Architecture 10-4

For A Given Problem

10/7/2017 CS61 Computer Architecture 10-5

Speedup

If we can do some computations in parallel, then we

can attain a speedup over sequential execution:
Speedup = Tsequential/Tparallel
Amdahl (Gene) defined the speedup for a parallel
processor as:
S = 1/(f + (1-f)/p)
where: p = # processors; f = fraction of
unparallelizable code
So, if f = 10%, the speedup can be no greater than
10!
With p = 10, S = 1/(0.1 + 0.9/10) ~= 5.3
With p = infinity, S = 1/(0.1 + 0.9/infinity) = 10

10/7/2017 CS61 Computer Architecture 10-6

Efficiency

Efficiency is the ratio of speedup to p: 5.3/10 ~= 53%

Ignores the possibility of new algorithms, with much smaller f
Ignores possibility that more of a program is run from higher
speed memory, such as registers, cache, main memory
often, problem is scaled with the number of processors
f is a function of size of the program which may be decreasing
serial code may take constant time, independent of size
(George Michael 80/20 vs 20/80 rule)

10/7/2017 CS61 Computer Architecture 10-7

Multiprocessor Example

Compare performance of a single Motorola 68020 with ten

68020s coupled as a multiprocessor
Consider the task of adding 100 numbers in memory
The ADD.W <ea>,Dn op takes 4 clock cycles. Thus, a single
68020 would take 400 clock cycles to add 100 numbers one at
a time
Multiprocessor works as follows:
a. All 10 uPs would add first 10 numbers each = 40 cycles
b. 5 uPs would add 10 partial sums = 4 cycles
c. 2 uPs would 4 of 5 partial sums = 4 cycles
d. one uP would add three partial sums of two = 8 cycles
Total: 56 cycles!
Performance Improvement: 7.14 (not 10 due to overhead)

10/7/2017 CS61 Computer Architecture 10-8

Limitations

Software Inertia: Billions of dollars worth of

FORTRAN and C/C++ and COBOL software exists.
Who will rewrite them?
Into what programming language?
Many programmers have experience with multicore
computers, but to the extent they program they
have no direct experience.
Who will retrain them ?
What about languages which are becoming
obsolete: Tcl, other scripting languages, etc.

10/7/2017 CS61 Computer Architecture 10-9

The Path to PetaFLOPS

10/7/2017 CS61 Computer Architecture 10-10

Michael Flynns Hardware Taxonomy

I: Instruction Stream D: Data Stream

SI: Single Instruction Stream (a)

All processors execute the same instruction in the same cycle
Instruction may be conditional
in Multiprocessors, control processor issues the instruction
MI: Multiple Instruction Stream (c)
Different processors may be simultaneously executing different instructions
SD: Single Data Stream (d)
All processors operate on the same data item (e.g., copies of it) at the
same time
MD: Multiple Data Stream (b)
Different processors may be simultaneously operating on different data
items
Multiplying a coefficient vector by a data vector (e.g., in filtering)
y[i] := c[i] x[i], 0 i < n

10/7/2017 CS61 Computer Architecture 10-11

Taxonomy Visual

10/7/2017 CS61 Computer Architecture 10-12

SISD

This model has been the subject of previous lectures.

10/7/2017 CS61 Computer Architecture 10-13

SIMD

Cray-1

This was the topic of lecture 9.

10/7/2017 CS61 Computer Architecture 10-14
SIMD

Execute one operation on multiple data streams

concurrency in time vector processing
concurrency in space array processing
CM-1 (SIMD) CM-5 (MIMD)

Thinking Machines, Inc. Connection Machines

10/7/2017 CS61 Computer Architecture 10-15
Thinking Machines CM-1

10/7/2017 CS61 Computer Architecture 10-16

MISD

10/7/2017 CS61 Computer Architecture 10-17

CMU/GE WARP

1987: 100 MFLOPS for

$300,000; about 30
times cheaper than a
Cray-1 (also 100
MFLOPS) @ $10M

Limited programming
models, however.

Example: WARP Systolic Array

Designed by H.T. Kung at CMU (now at Harvard)
Built by General Electric for me for the DARPA
Strategic Computing Program (ca 1986-1987)
10/7/2017 CS61 Computer Architecture 10-18
MIMD

10/7/2017 CS61 Computer Architecture 10-19

Processor Coupling

Tightly Coupled System

Tasks and/or processors communicate in a highly
synchronized fashion
Communicates through a common shared memory
Shared memory system

Loosely Coupled System

Tasks or processors do not communicate in a
synchronized fashion
Communicates by message passing packets
Overhead for data exchange is high
Distributed memory system

10/7/2017 CS61 Computer Architecture 10-20

Granularity of Parallelism

Coarse-grain
A task is broken into a handful of pieces, each of which is executed by
a powerful processor
Processors may be heterogeneous
Computation/communication ratio is very high
Example: BBN Butterfly
Medium-grain
Tens to few thousands of processors typically running the same code
Computation/communication ratio is often hundreds or more
Intel Paragon XP, Touchstone Series
Fine-grain
Thousands to perhaps millions of small pieces, executed by very small,
simple processors or through pipelines
Processors typically have instructions broadcasted to them
Compute/communicate ratio often near unity
Example: Thinking Machines CM-1, CM-2, CM-200

10/7/2017 CS61 Computer Architecture 10-21

Intel Paragon XP/S 140 Supercomputer

10/7/2017 CS61 Computer Architecture 10-22

Memory Architectures

Shared (Global) Memory

A Global Memory Space accessible by all processors
Processors may also have some local memory
Distributed (Local, Message-Passing) Memory
All memory units are associated with processors
To retrieve information from another processor's memory a
message must be sent there
Uniform Memory : All processors take the same time
to reach all memory locations
Non-uniform (NUMA) Memory: Processors have
varying access patterns to shared memory

10/7/2017 CS61 Computer Architecture 10-23

Shared Memory Multiprocessors

Characteristics
All processors have equally direct access to one large memory address
space
Example systems
Bus and cache-based systems: Sequent Balance, Encore Multimax
Multistage IN-based systems: Ultracomputer, Butterfly, RP3, HEP
Crossbar switch-based systems: C.mmp, Alliant FX/8
Limitations
Memory access latency; Hot spot problem

10/7/2017 CS61 Computer Architecture 10-24

Centralized Shared vs. Distributed Memory

Large caches single memory serves Pro: reduces latency of local memory
small number of processes (up to 16 or accesses
so) Con: communicating data between
More processors introduces more processors becomes more complex
contention

10/7/2017 CS61 Computer Architecture 10-25

Message Passing Multiprocessors

Characteristics
Interconnected computers
Each processor has its own memory, and communicates via message-
passing
Messages are variable-length data containers
Example systems
Tree structure: Teradata, DADO
Mesh-connected: Rediflow, Series 2010, J-Machine
Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III

10/7/2017 CS61 Computer Architecture 10-26

Message Passing Multiprocessors

One-to-one communication: one source, one destination

Collective communication
One-to-many: multicast, broadcast (one-to-all), scatter
Many-to-one: combine (fan-in), global combine, gather
Many-to-many: all-to-all broadcast (gossiping), scatter-gather

10/7/2017 CS61 Computer Architecture 10-27

Message Routing

10/7/2017 CS61 Computer Architecture 10-28

Non-Uniform Memory Access

All memories can be addressed by all processors, but access to a processors own
local memory is faster than access to another processors remote memory.
Looks like a distributed machine, but the interconnection network is usually custom-designed
switches and/or buses.
CC-NUMA: Cache Coherent NUMA
Exemplified by Kendall Square Research KSR1

10/7/2017 CS61 Computer Architecture 10-29

Operating System Support

Individual OS
Each CPU has its own OS
Statically allocate physical memory to each CPU
Each CPU runs its own independents OS
Share peripherals
Each CPU handles its processes system calls
Used in early multiprocessor systems
Simple to implement
Avoids concurrency issues by not sharing

10/7/2017 CS61 Computer Architecture 10-30

Operating System Support

Individual OS Issues:
Each processor has its own scheduling queue.
Each processor has its own memory partition.
Consistency is an issue with independent disk buffer caches and
potentially shared files.

10/7/2017 CS61 Computer Architecture 10-31

Operating System Support

Master-Slave Multiprocessors
OS mostly runs on a single fixed CPU.
User-level applications run on the other CPUs.
All system calls are passed to the Master CPU for
processing
Very little synchronization required
Single to implement
Single centralized scheduler to keep all processors busy
Memory can be allocated as needed to all CPUs.
Issues: Master CPU becomes the bottleneck.

10/7/2017 CS61 Computer Architecture 10-32

Operating System Support

Master-Slave Multiprocessors Issues:

Master CPU becomes the bottleneck.

10/7/2017 CS61 Computer Architecture 10-33

Operating System Support

OS kernel runs on all processors, while load and resources are

balanced between all processors.
One alternative: A single mutex (mutual exclusion object) that make
the entire kernel a large critical section; Only one CPU can be in the
kernel at a time; Only slight better than master-slave
Better alternative: Identify independent parts of the kernel and make
each of them their own critical section, which allows parallelism in the
kernel
Issues: A difficult task; Code is mostly similar to uniprocessor code;
hard part is identifying independent parts that dont interfere with each
other
10/7/2017 CS61 Computer Architecture 10-34
Interconnection Topologies

Shared Bus:
M3 wishes to communicate with S5
[1] M3 sends signals (address) on the bus that causes S5 to
respond
[2] M3 sends data to S5 or S5 sends data to M3 (determined by
the command line)
Master Device: Device that initiates and controls the
communication
Slave Device: Responding device
Multiple-master buses: Bus conflict requires bus arbitration
10/7/2017 CS61 Computer Architecture 10-35
Interconnection Topologies

Shared Bus:
All processors (and memory) are connected to a
common bus or busses
Memory access is fairly uniform, but not very scalable
A collection of signal lines that carry module-to-module
communication
Data highways connecting several digital system
elements
Can handle only one data transmission at a time
Can be easily expanded by connecting additional
processors to the shared bus, along with the
necessary bus arbitration circuitry
10/7/2017 CS61 Computer Architecture 10-36
Interconnection Topologies

Mesh Architecture:
Diameter = 2(m - 1)
In general, an n-
dimensional mesh has
diameter = d ( p1/n - 1)
Diameter can be halved by
having wrap-around
connections (=> Torus)
Ring is a 1-dimensional
mesh with wrap-around
connections

10/7/2017 CS61 Computer Architecture 10-37

Mesh-Type Interconnects

10/7/2017 CS61 Computer Architecture 10-38

Tree-Type Interconnects

10/7/2017 CS61 Computer Architecture 10-39

Example: Mesh Matrix Multiplication

10/7/2017 CS61 Computer Architecture 10-40

Interconnection Topologies
Multiport Memory:
Has multiple sets of address, data, and
control pins to allow simultaneous data
transfers to occur Memory Modules
CPU and DMA controller can transfer data
concurrently MM 1 MM 2 MM 3 MM 4
A system with more than one CPU could
handle simultaneous requests from two
different processors CPU 1
Does not scale well because of explosion of
number of busses CPU 2
Memory Module Control Logic
CPU 3
Each memory module has control logic
Resolve memory module conflicts via fixed
priority among CPUs CPU 4

Requests to read from and write to the

same memory location simultaneously
Advantages
Multiple paths -> high transfer rate
Disadvantages
Multiple copies of Memory control logic
Large number of connections

10/7/2017 CS61 Computer Architecture 10-41

Interconnection Topologies
Crossbar Switch:
Processors p and Memory banks b are
connected to routing switches like in telephone Memory modules
system MM1 MM2 MM3 MM4
Switches might have queues (combining logic),
which improve functionality but increase latency
Switch settings may be determined by message CPU1
headers or preset by controller
Connections can be packet-switched or circuit-
switched (remain connected as long as it is
needed) CPU2
Nonblocking Switch: the connection of a
processing node to a memory bank does not
block the connection of any other processing CPU3
nodes to other memory banks.
Older versions used circuit switching where a
dedicated path was created for the duration of CPU4
the communication
More recently, packet switching has been used
across the interstitial nodes
Requires p*b switches
Examples of machines that employ crossbars
include the Sun Ultra HPC 10000 and the Fujitsu
VPP500

10/7/2017 CS61 Computer Architecture 10-42

Butterfly Network

An example of blocking in omega network: one of the messages

(010 to 111 or 110 to 100) is blocked at link AB.
10/7/2017 CS61 Computer Architecture 10-43
Butterfly Network

10/7/2017 CS61 Computer Architecture 10-44

Butterfly Routing

10/7/2017 CS61 Computer Architecture 10-45

Interconnection Topologies

Omega Network (Butterfly Network)

Consists of log p stages, p is the number of inputs (processing
nodes) and also the number of outputs (memory banks)
Each stage consists of an interconnection pattern that connects
p inputs and p outputs:
Perfect shuffle(left rotation): 2i, 0 i p / 2 1
j
2i 1 p, p / 2 i p 1
Each switch has two connection modes:
Pass-thought connection: the inputs are sent straight through to the
outputs
Cross-over connection: the inputs to the switching node are crossed
over and then sent out.
Has p*(log p)/2 switching nodes: if p = 8, nodes = 12
Much better than crossbar using p*p=64 switches
The cost of such a network grows as (p log p)

10/7/2017 CS61 Computer Architecture 10-46

Interconnection Topologies

Omega Network (Butterfly Network)

Omega network has self-routing property
The path for a cell to take to reach its destination can be
determined directly from its routing tag (i.e., destination port id)
Stage k of the network looks at bit k of the tag
If bit k is 0, then send cell out upper port
If bit k is 1, then send cell out lower port
Works for every possible input port (really!)
Route from any input x to output y by selecting links determined by
successive d-ary digits of ys label.
This process is reversible; we can route from output y back to x by
following the links determined by successive digits of xs label.
This self-routing property allows for simple hardware-based routing
of cells.
yk-1xk-1 . . . x2yk-2
xk-1 . . . x1 yk-1 yk-1 yk-2xk-1 . . . x3yk-3 yk-1 yk-2 . . . xk-1y1
x1 x2
x=xk-1 . . . x0 x0 y1 xk-1
y0 y=yk-1 . . . y0
yk-1 yk-2 yk-3 xk-2

10/7/2017 CS61 Computer Architecture 10-47

Interconnection Topologies

Hypercube:
Processors are directly connected to only certain other
processors and data must traverse multiple hops to get to
additional processors
Usually distributed memory
Hardware may handle only single hops, or multiple hops
Software may mask hardware limitations
Latency is related to graph diameter, among many other
factors
Usually NUMA, nonblocking, scalable, upgradeable
Examples: Ring, Mesh, Torus, Hypercube, Binary Tree
p = 2n, n >= 0
Processors are conceptually on the corners of a n-
dimensional hypercube, and each is directly connected to
the n neighboring nodes

10/7/2017 CS61 Computer Architecture 10-48

Interconnection Topologies

10/7/2017 CS61 Computer Architecture 10-49

Interconnection Topologies

10/7/2017 CS61 Computer Architecture 10-50

Red Storm Cray Research (2003)
Red Storm Overview
Red Storm Network
Red Storm Processor Board
Growth Over Thirty Years

10/7/2017 CS61 Computer Architecture 10-55

What Have We learned Over 30 Years?

Building general-purpose parallel machines is a very

difficult task.

Proof by contradiction:
Many companies have gone bankrupt or left the parallel machine
market

Even harder is developing general parallel programming

schemes
Still an art rather than a science

10/7/2017 CS61 Computer Architecture 10-56

Additional Material
SIMD => Data Level Parallelism

Chapter 06 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
100% (1)
Chapter 06 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
57 pages
CH17-COA10e - Parallel Processing
No ratings yet
CH17-COA10e - Parallel Processing
45 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
Lec 44 Multicore
No ratings yet
Lec 44 Multicore
23 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
Explicitly Parallel Platforms
No ratings yet
Explicitly Parallel Platforms
90 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
91 pages
Introduction To Parallel Processing Architecture
No ratings yet
Introduction To Parallel Processing Architecture
31 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
04 Hardware
No ratings yet
04 Hardware
109 pages
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
No ratings yet
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
13 pages
Cs6303-Computer Architecture Unit-Iv Parallelism Part A: Svcet
No ratings yet
Cs6303-Computer Architecture Unit-Iv Parallelism Part A: Svcet
4 pages
Cs7103 Multicore Architecture
No ratings yet
Cs7103 Multicore Architecture
5 pages
Multicore Question Bank
No ratings yet
Multicore Question Bank
5 pages
Computer Architecture and Parallel Processing
No ratings yet
Computer Architecture and Parallel Processing
29 pages
02 Lecture Flynn IN
No ratings yet
02 Lecture Flynn IN
78 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Chapter 5 - Shared Memory Multiprocessor
No ratings yet
Chapter 5 - Shared Memory Multiprocessor
96 pages
2ad6a430 1637912349895
No ratings yet
2ad6a430 1637912349895
51 pages
Parallel and Distributed Computing Lecture 03
No ratings yet
Parallel and Distributed Computing Lecture 03
44 pages
This Unit: Shared Memory Multiprocessors: - Three Issues
No ratings yet
This Unit: Shared Memory Multiprocessors: - Three Issues
17 pages
10 Multithreading
No ratings yet
10 Multithreading
60 pages
atII Bks Lec 2021 31 32
No ratings yet
atII Bks Lec 2021 31 32
16 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
PP16 Lec4 Arch3
No ratings yet
PP16 Lec4 Arch3
23 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
L32 SMP
No ratings yet
L32 SMP
47 pages
Multiprocessor Architecture and Programming
No ratings yet
Multiprocessor Architecture and Programming
20 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
Parallel Processing: sp2016 Lec#5
No ratings yet
Parallel Processing: sp2016 Lec#5
27 pages
Arkom 13-40275
No ratings yet
Arkom 13-40275
32 pages
CH17 COA9e
No ratings yet
CH17 COA9e
51 pages
Unit VI
No ratings yet
Unit VI
50 pages
BCS702 Module1 Detailed Notes
No ratings yet
BCS702 Module1 Detailed Notes
14 pages
L03 Architecture Memory
No ratings yet
L03 Architecture Memory
56 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
MCAP
No ratings yet
MCAP
32 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
General System Architecture
No ratings yet
General System Architecture
28 pages
Architecture
No ratings yet
Architecture
67 pages
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
44 pages
8051 Arch
No ratings yet
8051 Arch
55 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Chapter Ten Architeture
No ratings yet
Chapter Ten Architeture
14 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Parallel Processors: Session 2
No ratings yet
Parallel Processors: Session 2
32 pages
The Complete Future Trait Guide
From Everand
The Complete Future Trait Guide
Hamze Ghalebi
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
Introduction to Computing DSST Quick Prep Sheet
From Everand
Introduction to Computing DSST Quick Prep Sheet
Justin Orgeron
No ratings yet
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet
CS6461 - Computer Architecture Fall 2016 Instructor Morris Lancaster
No ratings yet
CS6461 - Computer Architecture Fall 2016 Instructor Morris Lancaster
28 pages
CS6461 - Computer Architecture Fall 2016 Morris Lancaster - Memory Systems
No ratings yet
CS6461 - Computer Architecture Fall 2016 Morris Lancaster - Memory Systems
66 pages
CS6461 Computer Architecture Lecture 8
No ratings yet
CS6461 Computer Architecture Lecture 8
61 pages
Cs 6461 Computer Architecture Lecture 11
No ratings yet
Cs 6461 Computer Architecture Lecture 11
51 pages
CS6461 - Computer Architecture Fall 2016 - Vector Operations
No ratings yet
CS6461 - Computer Architecture Fall 2016 - Vector Operations
47 pages
CS6461 - Computer Architecture Fall 2016 Adapted From Professor Stephen Kaisler's Slides
No ratings yet
CS6461 - Computer Architecture Fall 2016 Adapted From Professor Stephen Kaisler's Slides
71 pages
CS6461 Computer Architecture Lecture 5
No ratings yet
CS6461 Computer Architecture Lecture 5
58 pages
CS6461 - Computer Architecture Fall 2016: - Introduction
No ratings yet
CS6461 - Computer Architecture Fall 2016: - Introduction
18 pages
I/O Systems: CS6461 - Computer Architecture Fall 2016 Morris Lancaster
No ratings yet
I/O Systems: CS6461 - Computer Architecture Fall 2016 Morris Lancaster
50 pages
CS6461 - Computer Architecture Fall 2016 Morris Lancaster: Lecture 3 - Instruction Set Architecture
No ratings yet
CS6461 - Computer Architecture Fall 2016 Morris Lancaster: Lecture 3 - Instruction Set Architecture
40 pages
CS6461 - Computer Architecture Fall 2016 Morris Lancaster: Lecture 0 - Administrative
No ratings yet
CS6461 - Computer Architecture Fall 2016 Morris Lancaster: Lecture 0 - Administrative
11 pages
Linux New Programming
No ratings yet
Linux New Programming
8 pages
Homework #4 El6201 - Parallel System: 1 Openmp Matrix Addition
No ratings yet
Homework #4 El6201 - Parallel System: 1 Openmp Matrix Addition
6 pages
Introduction To GPU Computing With CUDA: Siegfried Höfinger
No ratings yet
Introduction To GPU Computing With CUDA: Siegfried Höfinger
74 pages
Multithreading in Java
No ratings yet
Multithreading in Java
59 pages
Operating 4
No ratings yet
Operating 4
57 pages
ISE Question Paper
No ratings yet
ISE Question Paper
2 pages
Explain Semaphores?: Sub: Os Unit: 3
No ratings yet
Explain Semaphores?: Sub: Os Unit: 3
11 pages
Operating Systems & Linux Programming Lab Manual: Ideal Institute of Technology
No ratings yet
Operating Systems & Linux Programming Lab Manual: Ideal Institute of Technology
71 pages
QNX Neutrino RTOS C Library Reference
No ratings yet
QNX Neutrino RTOS C Library Reference
4,028 pages
Operating System Os Notes New Cs 2nd Year
No ratings yet
Operating System Os Notes New Cs 2nd Year
89 pages
Chapter 5: Process Synchronization: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 5: Process Synchronization: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
66 pages
Operating System Design, Fall 2021: Lab Activity Guidance
No ratings yet
Operating System Design, Fall 2021: Lab Activity Guidance
1 page
T4 - Processor Management
No ratings yet
T4 - Processor Management
32 pages
Chapter 5
No ratings yet
Chapter 5
92 pages
Power Off Reset Reason
No ratings yet
Power Off Reset Reason
5 pages
CU-2022 B.Sc. (General) Computer Science Semester-4 Paper-CC4-GE4 QP
No ratings yet
CU-2022 B.Sc. (General) Computer Science Semester-4 Paper-CC4-GE4 QP
2 pages
Java Error in IDEA 2651
No ratings yet
Java Error in IDEA 2651
36 pages
Lec2 +Operating+System+Part+1
No ratings yet
Lec2 +Operating+System+Part+1
256 pages
OS - Module2 - Apsima
No ratings yet
OS - Module2 - Apsima
26 pages
Tes Inventory Valuation
No ratings yet
Tes Inventory Valuation
1,366 pages
Process Management
No ratings yet
Process Management
9 pages
Power Off Reset Reason Backup
No ratings yet
Power Off Reset Reason Backup
4 pages
Parallel Project Section 3
No ratings yet
Parallel Project Section 3
2 pages
Unit 2 Os
No ratings yet
Unit 2 Os
8 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
1 page
Course Outline OS-Template
No ratings yet
Course Outline OS-Template
7 pages
National University of Computer and Emerging Sciences, Lahore Campus
No ratings yet
National University of Computer and Emerging Sciences, Lahore Campus
10 pages
CSE357 Workbook
No ratings yet
CSE357 Workbook
298 pages
Mpi Book
No ratings yet
Mpi Book
673 pages
Multicore Architecture and Programming Lab Manual
No ratings yet
Multicore Architecture and Programming Lab Manual
29 pages

CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer

Uploaded by

CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer

Uploaded by

CS6461 Computer Architecture

So far, we have studied uniprocessors one processor,

Idea: create powerful computers by connecting many

10/7/2017 CS61 Computer Architecture 10-2

10/7/2017 CS61 Computer Architecture 10-3

10/7/2017 CS61 Computer Architecture 10-4

10/7/2017 CS61 Computer Architecture 10-5

If we can do some computations in parallel, then we

10/7/2017 CS61 Computer Architecture 10-6

Efficiency is the ratio of speedup to p: 5.3/10 ~= 53%

10/7/2017 CS61 Computer Architecture 10-7

Compare performance of a single Motorola 68020 with ten

10/7/2017 CS61 Computer Architecture 10-8

Software Inertia: Billions of dollars worth of

10/7/2017 CS61 Computer Architecture 10-9

10/7/2017 CS61 Computer Architecture 10-10

I: Instruction Stream D: Data Stream

SI: Single Instruction Stream (a)

10/7/2017 CS61 Computer Architecture 10-11

10/7/2017 CS61 Computer Architecture 10-12

This model has been the subject of previous lectures.

10/7/2017 CS61 Computer Architecture 10-13

This was the topic of lecture 9.

Execute one operation on multiple data streams

Thinking Machines, Inc. Connection Machines

10/7/2017 CS61 Computer Architecture 10-16

10/7/2017 CS61 Computer Architecture 10-17

1987: 100 MFLOPS for

Example: WARP Systolic Array

10/7/2017 CS61 Computer Architecture 10-19

Tightly Coupled System

Loosely Coupled System

10/7/2017 CS61 Computer Architecture 10-20

10/7/2017 CS61 Computer Architecture 10-21

10/7/2017 CS61 Computer Architecture 10-22

Shared (Global) Memory

10/7/2017 CS61 Computer Architecture 10-23

10/7/2017 CS61 Computer Architecture 10-24

10/7/2017 CS61 Computer Architecture 10-25

10/7/2017 CS61 Computer Architecture 10-26

One-to-one communication: one source, one destination

10/7/2017 CS61 Computer Architecture 10-27

10/7/2017 CS61 Computer Architecture 10-28

10/7/2017 CS61 Computer Architecture 10-29

10/7/2017 CS61 Computer Architecture 10-30

10/7/2017 CS61 Computer Architecture 10-31

10/7/2017 CS61 Computer Architecture 10-32

Master-Slave Multiprocessors Issues:

10/7/2017 CS61 Computer Architecture 10-33

OS kernel runs on all processors, while load and resources are

10/7/2017 CS61 Computer Architecture 10-37

10/7/2017 CS61 Computer Architecture 10-38

10/7/2017 CS61 Computer Architecture 10-39

10/7/2017 CS61 Computer Architecture 10-40

Requests to read from and write to the

10/7/2017 CS61 Computer Architecture 10-41

10/7/2017 CS61 Computer Architecture 10-42

An example of blocking in omega network: one of the messages

10/7/2017 CS61 Computer Architecture 10-44

10/7/2017 CS61 Computer Architecture 10-45

Omega Network (Butterfly Network)

10/7/2017 CS61 Computer Architecture 10-46

Omega Network (Butterfly Network)

10/7/2017 CS61 Computer Architecture 10-47

10/7/2017 CS61 Computer Architecture 10-48

10/7/2017 CS61 Computer Architecture 10-49

10/7/2017 CS61 Computer Architecture 10-50

10/7/2017 CS61 Computer Architecture 10-55

Building general-purpose parallel machines is a very

Even harder is developing general parallel programming

10/7/2017 CS61 Computer Architecture 10-56

You might also like