0% found this document useful (0 votes)

2 views68 pages

Module1

Parallel and Distributed Computing 1

Uploaded by

saif.nalband

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views68 pages

Module1

Parallel and Distributed Computing 1

Uploaded by

saif.nalband

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Parallel and Distributed Computing

UCS645
Lec - 1
Saif Nalband
Contents
Parallelism Fundamentals:
● Scope and issues of parallel and distributed computing,
● Parallelism,
● Goals of parallelism,
● Parallelism and concurrency,
● Multiple simultaneous computations.
One common definition
A parallel computer is a collection of processing elements
that cooperate to solve problems quickly

We care about performance We’re going to use multiple

and we care about processors to get it
efficiency
Speedup

One major motivation of using parallel processing: achieve a speedup

For a given problem:

execution time (using 1 processor)

speedup( using P processors ) =
execution time (using P processors)
● Communication limited the maximum speedup achieved
● Minimizing the cost of communication improved speedup
● Imbalance in work assignment limited speedup
● Improving the distribution of work improved speedup
Course theme 1:
Designing and writing parallel programs ... that scale!

▪ Parallel thinking
1.Decomposing work into pieces that can safely be performed in parallel
2.Assigning work to processors
3.Managing communication/synchronization between the processors so that
it does not limit speedup

▪ Abstractions/mechanisms for performing the above tasks

- Writing code in popular parallel programming languages
Course theme 2:
Parallel computer hardware implementation: how
parallel computers work
▪ Mechanisms used to implement abstractions efficiently
- Performance characteristics of implementations
- Design trade-offs: performance vs. convenience vs. cost

▪ Why do I need to know about hardware?

- Because the characteristics of the
machine really matter (recall speed of
communication issues in earlier demos)
- Because you care about efficiency
and performance (you are writing
parallel programs after all!)
Course theme 3:
Thinking about efficiency
▪ FAST != EFFICIENT

▪ Just because your program runs faster on a parallel computer, it does

not mean it is using the hardware efficiently
- Is 2x speedup on computer with 10 processors a good result?
▪ Programmer’s perspective: make use of provided machine capabilities

▪ HW designer’s perspective: choosing the right capabilities to put in

system (performance/cost, cost = silicon area?, power?, etc.)
Some historical context: why avoid parallel processing?
▪ Single-threaded CPU performance doubling ~ every 18 months
▪ Implication: working to parallelize your code was often not worth the
time
- Software developer does nothing, code gets faster next year. Woot!
Relative CPU Performance

Year
Image credit: Olukutun and Hammond, ACM Queue 2005
Until ~15 years ago: two significant reasons for
processor performance improvement

1. Exploiting instruction-level parallelism

(superscalar execution)

2. Increasing CPU clock frequency

What is a computer program?
Here is a program written in C
int main(int argc, char** argv) {

int x = 1;

for (int i=0; i<10; i++) {

x = x + x;
}

printf(“%d\n”, x);

return 0;
}
What is a program? (from a processor’s perspective)
A program is just a list of processor instructions!
_main:
100000f10: pushq %rbp
100000f11: movq %rsp, %rbp
int main(int argc, char** argv) 100000f14: subq $32, %rsp
100000f18: movl $0, -4(%rbp)
{ 100000f1f: movl %edi, -8(%rbp)
int x = 1; 100000f22: movq %rsi, -16(%rbp)
100000f26: movl $1, -20(%rbp)
100000f2d: movl $0, -24(%rbp)
for (int i=0; i<10; i++) 100000f34: cmpl $10, -24(%rbp)
{ x = x + x; Compile 100000f38:
100000f3e:
jge 23 <_main+0x45>
movl -20(%rbp), %eax
} code 100000f41: addl -20(%rbp), %eax
100000f44: movl %eax, -20(%rbp)
100000f47: movl -24(%rbp), %eax
printf(“%d\n”, x);
100000f4a: addl $1, %eax
100000f4d: movl %eax, -24(%rbp)
return 0; 100000f50: jmp -33 <_main+0x24>
100000f55: leaq 58(%rip), %rdi
} 100000f5c: movl -20(%rbp), %esi
100000f5f: movb $0, %al
100000f61: callq 14
100000f66: xorl %esi, %esi
100000f68: movl %eax, -28(%rbp)
100000f6b: movl %esi, %eax
100000f6d: addq $32, %rsp
100000f71: popq %rbp
100000f72: rets
What does a processor do?
A processor executes instructions
VerySimpleProcessor

Fetch/
Decode
Determine what instruction
to run next
ALU Execution unit: performs the operation
(Execution Unit) described by an instruction, which may
Execution
modify values in the processor’s
Context registers or the computer’s memory
Register 0 (R0)
Register 1 (R1) Registers: maintain program state:
Register 2 (R2) store value of variables used as
Register 3 (R3)
inputs and outputs to operations
One example instruction: add two numbers
VerySimpleProcessor Step 1:
Processor gets next program instruction from memory
(figure out what the processor should do next)
Fetch/ add R0 ← R0, R1
Decode “Please add the contents of register R0 to the contents of
register R1 and put the result of the addition into register
ALU R0”
(Execution Unit)
Step2:
Getoperationinputsfromregisters
Execution ContentsofR0inputtoexecutionunit: 32
Context
ContentsofR1inputtoexecutionunit: 64
R0: 96
R1: 64
R2: 0xff681080
Step3:
R3: 0x80486412
Perform additionoperation:
Executionunitperformsarithmetic,theresultis: 96

Step4:
Storeresult 96 backtoregisterR0
Execute program
My very simple processor: executes one
instruction per clock

Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
Execution Unit mul r1, r1, r0
(ALU) ...
...

...
Execution ...
Context ...
...
st addr[r2], r0
Review of how computers work…
What is a computer program? (from a processor’s perspective)
It is a list of instructions to execute!

What is an instruction?
It describes an operation for a processor to perform.
Executing an instruction typically modifies the computer’s
state.

What do I mean when I talk about a computer’s “state”?

The values of program data, which are stored in a
processor’s registers or in memory.
Lets consider a very simple piece of code
a = x*x + y*y + z*z
Consider the following five instruction program:

Assume register R0 = x, R1 = y, R2 = z

1 mul R0, R0, R0

This program has five
2 mul R1, R1, R1
3 mul R2, R2, R2
instructions, so it will take
4
add R0, R0, R1 five clocks to execute,
5
add R3, R0, R2 correct?
R3 now stores value of program variable ‘a’ Can we do better?
What if up to two instructions can be performed at once?
a = x*x + y*y + z*z
Processor1 Processor2
Assume register
R0 = x, R1 = y, R2 = z

1 mul R0, R0, R0 time

2 mul R1, R1, R1 1 1. mul R0, R0, R0 2. mul R1, R1, R1
3 mul R2, R2, R2
4 add R0, R0, R1
5 add R3, R0, R2 2 3. mul R2, R2, R2 4. add R0, R0, R1

R3 now stores value of 3 5. add R3, R0, R2

program variable ‘a’
4

5
What does it mean for our parallel to scheduling
to that “respects program order”?
What about three instructions at once?
a = x*x + y*y + z*z
Processor1 Processor2 Processor3
Assume register
R0 = x, R1 = y, R2 = z

1 mul R0, R0, R0 time

2 mul R1, R1, R1
3 mul R2, R2, R2
1
4 add R0, R0, R1
5 add R3, R0, R2 2
R3 now stores value of 3
program variable ‘a’
4

5
What about three instructions at once?
a = x*x + y*y + z*z
Processor1 Processor2 Processor3
Assume register
R0 = x, R1 = y, R2 = z

1 mul R0, R0, R0 time

2 mul R1, R1, R1
3 mul R2, R2, R2
1 1. mul R0, R0, R0 2. mul R1, R1, R1 3. mul R2, R2, R2

4 add R0, R0, R1

5 add R3, R0, R2 2 4. add R0, R0, R1

R3 now stores value of 3 5. add R3, R0, R2

program variable ‘a’
4

5
Instruction level parallelism (ILP) example
▪ ILP= 3 a = x*x + y*y + z*z

x x y y z z

ILP= 3 * * *

ILP= 1 +

a
Superscalar processor execution
a = x*x + y*y + z*z
Assume register
R0 = x, R1 = y, R2 = z Idea #1:
Superscalar execution: processor automatically finds*
1
2
mul R0, R0, R0 independent instructions in an instruction sequence
3 mul R1, R1, R1 and executes them in parallel on multiple execution
4 mul R2, R2, R2 units!
5 add R0, R0, R1
add R3, R0, R2

In this example: instructions 1, 2, and 3 can be executed in parallel without impacting

program correctness (on a superscalar processor that determines that the lack of
dependencies exists)
But instruction 4 must be executed after
* Or the compiler finds independent
instructions 1 and 2 And instruction 5 must be instructions at compile time and
explicitly encodes dependencies in the
executed after instruction 4 compiled binary.
Superscalar processor
This processor can decode and execute up to two
instructions per clock
Out-of-order control logic

Fetch/ Fetch/
Decode Decode
1 2

Exec Exec
1 2

Execution
Context
Diminishing returns of superscalar execution
Most available ILP is exploited by a processor capable of issuing four instructions per clock
(Little performance benefit from building a processor that can issue more)

2
Speedup

0
0 4 8 12 16
Instruction issue capability of processor (instructions/clock)
ILP tapped out + end of frequency scaling

Processor clock
rate stops
increasing

No further benefit
from ILP
= Transistor density
= Clock frequency
= Power
= Instruction-level parallelism (ILP)
The “power wall”
Power consumed by a
transistor: Dynamic power ∝ capacitive load × voltage2 × frequency
Static power: transistors burn power even when inactive due to
leakage

High power = high heat

Power is a critical design constraint in modern processors
Apple M1 laptop: 13W TDP
Intel Core i9 10900K (in desktop CPU):95W
NVIDIA RTX 4090 GPU 450W
Mobile phone processor 1/2 - 2W

World’s fastest supercomputer megawatts

Standard microwave oven 900W

Source: Intel, NVIDIA, Wikipedia, Top500.org

Power draw as a function of clock frequency
Dynamic power ∝ capacitive load ×
voltage2 × frequency
Static power: transistors burn power even when inactive due to leakage
Maximum allowed frequency determined by processor’s core voltage
Single-core performance scaling
The rate of single-instruction stream
performance scaling has decreased (almost
to zero)

1. Frequency scaling limited by power

2. ILP scaling tapped out

Architects are now building faster processors by

adding more execution units that run in parallel
(Or units that are specialized for a specific task: like
graphics, or audio/video playback)

Software must be written to be parallel to = Transistor

see performance gains. No more free lunch density
= Clock
for software developers! frequency
= Power
= ILP

Image credit: “The free Lunch is Over” by Herb

Example: multi-core CPU
Intel “Comet Lake” 10th Generation Core i9 10-core CPU (2020)

Core1 Core2 Core3 Core4 Core5

Core6 Core7 Core8 Core9 Core10

One thing you will learn in this course
▪ How to write code that efficiently uses the resources in a modern
multi-core CPU
- Running on a quad-core Intel CPU

- Four CPU cores

- AVX SIMD vector instructions + hyper-
threading
- Baseline: single-threaded C program
compiled with -O3
- Parallelized program that uses all
parallel execution
• resources on this CPU…

•~32-40x faster!
AMD Ryzen Threadripper 3990X
64 cores, 4.3 GHz

Four 8-core
chiplets

StanfordCS149,Fall2023
NVIDIA AD102 GPU
GeForce RTX 4090 (2022)
76 billion transistors

18,432 fp32 multipliers

organized in 144
processing blocks (called
SMs)
GPU-accelerated
supercomputing

Frontier (at Oak Ridge National Lab) (world’s #1

in Fall 2022)
9472 x 64 core AMD CPUs (606,208 CPU cores)
37,888 Radeon GPUs
21 Megawatts StanfordCS149,Fall2023
Mobileparallelprocessing
Powerconstraintsalsoheavilyinfluencethedesignofmobilesystems

5GPUblocks AppleA15Bionic
(iniPhone13,14)
15billiontransistors
6-coreCPU
2“big”CPUcores Multi-coreGPU

4“small”CPUcores

ImageCredit:TechInsightsInc.
Mobile parallel processing

Raspberry Pi 3
Quad-core ARM A53 CPU
High Performance Computing Projects

1.Early Warning and Flood Prediction for River Basins of India- CDAC,
CWC
2.A HPC Software Suite for Seismic Imaging to Aid Oil & Gas
Exploration- CDAC, ONGC, NGRI, IITR
3.NSM Urban Modeling Project- CDAC, CPCB
4.NSM Platform for Genomics and Drug Discovery (NPGDD)- CDAC, IISC
B, NII Delhi, IIT Delhi, NCBS Bangalore, NIBGM, Ministry of Aayush
5.Materials and Computational Chemistry- CDAC, IITK, IACS Kolkata,
IISER Bhopal, SPPU Pune,
6.Design & Development of DCLC Based System- CDAC, CMET, IITB
7.MPPLab Project- CDAC, CDOT, IISc
But in modern computing
software must be more than
just parallel…

•IT MUST ALSO BE EFFICIENT

Parallel + specialized HW
▪ Achieving high efficiency will be a key theme in this class

▪ We will discuss how modern systems not only use many

processing units, but also utilize specialized processing
units to achieve high levels of power efficiency
Specialized processing is ubiquitous in mobile systems

Apple A15 Bionic (in

iPhone 13, 14)
15 billion transistors
6-core GPU
2 “big” CPU cores
4 “small” CPU cores

Apple-designed multi-core GPU

Neural Engine (NPU) for DNN acceleration +
Image/video encode/decode processor +
Motion (sensor) processor

Image Credit: TechInsights Inc.

Google TPU pods
TPU = Tensor Processing Unit: specialized processor for ML
computations
Specialized hardware to accelerate DNN inference/training

Huawei Kirin NPU

Google TPU3 GraphCore IPU

Apple Neural Engine

Intel Deep Learning

Inference Accelerator
SambaNova
Cardinal SN10

Cerebras Wafer Scale Engine Ampere GPU with Tensor Cores

Achieving efficient
processing almost
always comes down to
accessing data
efficiently.
What is memory?

Memory
A program’s memory address space Address Value
0x0 16
▪ A computer’s memory is organized as an array 0x1 255
of bytes 0x2 14
0x3 0

▪ Each byte is identified by its “address” in 0x4 0

0x5 0
memory (its position in this array)
0x6 6
(We’ll assume memory is byte-addressable)
0x7 0
0x8 32
“The byte stored at address 0x8 has the value 32.”
0x9 48
“The byte stored at address 0x10 (16) has the value 128.” 0xA 255
0xB 255
In the illustration on the right, the
0xF 0
program’s memory address space is 32
bytes in size 0x10 128

.
.
.
.
.
.
(so valid addresses range from 0x0 to 0x1F)
0x1F 0
Terminology
▪ Memory access latency
- The amount of time it takes the memory system to provide data to the
processor
- Example: 100 clock cycles, 100 nsec
Datarequest

Memory

Latency~ 2sec
Stalls
▪ A processor “stalls” (can’t make progress) when it cannot run
the next instruction in an instruction stream because
future instructions depend on a previous instruction that is
not yet complete.
▪ Accessing memory is a major source of stalls
ld r0 mem[r2]
Dependency: cannot execute ‘add’ instruction until data from
ld r1 mem[r3]
mem[r2] and mem[r3] have been loaded from memory
add r0, r0, r1

▪ Memory access times ~ 100’s of cycles

- Memory “access time” is a measure of latency
What are caches?
Recall memory is just an array of values
And a processor has instructions for moving data from
memory into registers (load) and storing data from registers
into memory (store)
Caches reduce length of stalls (reduce
memory access latency)
▪ Processors run efficiently when they access data that is
resident in caches
▪ Caches reduce memory access latency when processors
accesses data that they have recently accessed! *

* Caches also provide high bandwidth data transfer

The implementation of the linear memory address space
abstraction on a modern computer is complex
The instruction “load the value stored at address X into register R0”
might involve a complex sequence of operations by multiple data
caches and access to DRAM

Processor
L1 cache
(32 KB)

L3 cache DRAM (64 GB)

L2 cache
(256 KB) (20 MB)

Common organization: hierarchy of caches:

Level 1 (L1), level 2 (L2), level 3 (L3)

Smaller capacity caches near processor →lower latency

Larger capacity caches farther away →larger latency
Data access times
(Kaby Lake CPU)

Latency (number of cycles at 4 GHz)

Data in L1 cache 4
Data in L2 cache 12
Data in L3 cache 38
Data in DRAM (best case) ~248
Parallel Processing- What is it?
• A parallel computer is a computer system that uses multiple
processing elements simultaneously in a cooperative manner to
solve a computational problem
• Parallel processing includes techniques and technologies that
make it possible to compute in parallel . Hardware, networks,
operating systems, parallel libraries, languages, compilers,
algorithms, tools, …
• Parallel computing is an evolution of serial computing
• Parallelism is natural
• Computing problems differ in level / type of parallelism
Goals of Parallelism
The goals of parallelism in computing are centered around improving
performance, efficiency, and scalability. Here are some key objectives:
• 1. Increase Computational Speed
• Goal: Reduce the time required to complete computational tasks by dividing work
among multiple processors.
• Benefit: Achieves faster processing and quicker results for complex computations.
• 2. Improve Resource Utilization
• Goal: Optimize the use of available computational resources, including processors,
memory, and storage.
• Benefit: Enhances overall system efficiency and prevents resource underutilization.
• 3. Enhance Problem-Solving Capabilities
• Goal: Enable the handling of larger, more complex problems that would be infeasible for
a single processor.
• Benefit: Supports advanced research, simulations, and data analysis in various fields.
4. Achieve Scalability
• Goal: Allow the system to scale by adding more processors to handle increased
workloads.
• Benefit: Ensures that the system can grow and adapt to higher demands without
significant redesign.
5. Reduce Execution Time
• Goal: Perform multiple operations simultaneously to minimize overall execution
time.
• Benefit: Increases throughput and enhances the user experience by reducing wait
times.
6. Fault Tolerance and Reliability
• Goal: Improve system reliability by distributing tasks across multiple processors, so
that failure of one processor does not halt the entire system.
• Benefit: Provides robustness and ensures continuous operation even in the
presence of hardware failures.
7. Cost-Effectiveness
• Goal: Leverage multiple, often less expensive, processors to achieve performance
comparable to a single, more expensive, high-performance processor.
• Benefit: Offers a cost-effective solution for achieving high computational power.
Concurrency vs. Parallelism
• The Art of Concurrency defines the difference as follows:

• A system is said to be concurrent if it can support two or more

actions in progress at the same time.
• A system is said to be parallel if it can support two or more
actions executing simultaneously.

• Concurrency is about dealing with lots of things at once.

Parallelism is about doing lots of things at once.
Concurrency vs. Parallelism
• Concurrent is not the same as parallel! Why?
• Parallel execution
• Concurrent tasks actually execute at the same time
• Multiple (processing) resources have to be available
• Parallelism = concurrency + “parallel” hardware
• Both are required
• Find concurrent execution opportunities
• Develop application to execute in parallel
• Run application on parallel hardware
• Is a parallel application a concurrent application?
Reading
● A Grama, A Gupta, G Karypis, V Kumar. Introduction to Parallel
Computing, Addison Wesley (2003). Chapter 1
Thank You

Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
19 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
ELECH473 Th06
No ratings yet
ELECH473 Th06
65 pages
Chapter1 - Basic Structure of Computers
No ratings yet
Chapter1 - Basic Structure of Computers
119 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Parallelism
No ratings yet
Parallelism
22 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
LD and CO Module 3
No ratings yet
LD and CO Module 3
74 pages
Pankaj
No ratings yet
Pankaj
27 pages
Functional Units: Input and Arithmetic Logic
No ratings yet
Functional Units: Input and Arithmetic Logic
20 pages
Chapter1 Basic Structure of Computers
No ratings yet
Chapter1 Basic Structure of Computers
119 pages
Chapter1 - Basic Structure of Computers
100% (1)
Chapter1 - Basic Structure of Computers
119 pages
Module 3 DDCO
No ratings yet
Module 3 DDCO
67 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Chapter1 - Basic Structure of Computers
No ratings yet
Chapter1 - Basic Structure of Computers
123 pages
Fetch Exectue Cycle
No ratings yet
Fetch Exectue Cycle
27 pages
2 TypesofParallelism
No ratings yet
2 TypesofParallelism
69 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Chapter 2 - Computer Organization
No ratings yet
Chapter 2 - Computer Organization
30 pages
Chapter1, Chap 2 PDF
No ratings yet
Chapter1, Chap 2 PDF
86 pages
Fetch Exectue Cycle
No ratings yet
Fetch Exectue Cycle
27 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
Computer System Organizations: Ms - Chit Su Mon
No ratings yet
Computer System Organizations: Ms - Chit Su Mon
74 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Organization CH 2
No ratings yet
Organization CH 2
102 pages
COA - Module-5
No ratings yet
COA - Module-5
35 pages
Chapter1 - Basic Structure of Computers
0% (1)
Chapter1 - Basic Structure of Computers
119 pages
Unit 5
No ratings yet
Unit 5
66 pages
Introduction To Microprocessors (NIT)
No ratings yet
Introduction To Microprocessors (NIT)
55 pages
Chapter 1 - Basic Structure of Computers
No ratings yet
Chapter 1 - Basic Structure of Computers
33 pages
Basic Structure of Computers
No ratings yet
Basic Structure of Computers
39 pages
COA UNIT 5 (AutoRecovered)
No ratings yet
COA UNIT 5 (AutoRecovered)
14 pages
4 MultiIssue 2024
No ratings yet
4 MultiIssue 2024
174 pages
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
No ratings yet
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
32 pages
Year & Sem.: Iii Yr / Vi Sem Faculty Name: A.Manjunathan Department: Ece Unit No.: Ii Topic: Computing Platform
No ratings yet
Year & Sem.: Iii Yr / Vi Sem Faculty Name: A.Manjunathan Department: Ece Unit No.: Ii Topic: Computing Platform
91 pages
UNIT-1 Computer - Instruction - PDF
No ratings yet
UNIT-1 Computer - Instruction - PDF
19 pages
GCSE Computer Science Revision and Workbook: Page 1 of 56
No ratings yet
GCSE Computer Science Revision and Workbook: Page 1 of 56
56 pages
Real Time System Lect10 A
No ratings yet
Real Time System Lect10 A
25 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
50 pages
Computer Organization and Architecture What Does Superscalar Mean?
No ratings yet
Computer Organization and Architecture What Does Superscalar Mean?
14 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
CH18 COA11e
No ratings yet
CH18 COA11e
40 pages
Chapter 1-Basic Structure of Computers
No ratings yet
Chapter 1-Basic Structure of Computers
35 pages
Computer Organization Unit-1
No ratings yet
Computer Organization Unit-1
147 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Computer Achitecture II - Parallel - Computing
No ratings yet
Computer Achitecture II - Parallel - Computing
46 pages
COMP Unit 1
No ratings yet
COMP Unit 1
52 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
10 Week
No ratings yet
10 Week
35 pages
Instruction Level Parallelism and Superscalar Processors
No ratings yet
Instruction Level Parallelism and Superscalar Processors
34 pages
04 CPUOverview
No ratings yet
04 CPUOverview
40 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
C Programming
From Everand
C Programming
Netra
No ratings yet
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Nonlinear Control Feedback Linearization Sliding Mode Control
From Everand
Nonlinear Control Feedback Linearization Sliding Mode Control
Mourad Boufadene
No ratings yet
Secure CRT
No ratings yet
Secure CRT
2 pages
HP SureStore Tape 5000 Manual
0% (1)
HP SureStore Tape 5000 Manual
38 pages
1 Overview: STC8A8K64S4A12 Series Manual
No ratings yet
1 Overview: STC8A8K64S4A12 Series Manual
18 pages
Viessmann Data Logger DL2 Brochure
No ratings yet
Viessmann Data Logger DL2 Brochure
2 pages
Extensible File Allocation Table (ExFAT) AHTCC V1.1
No ratings yet
Extensible File Allocation Table (ExFAT) AHTCC V1.1
117 pages
TC3 Storm e
No ratings yet
TC3 Storm e
8 pages
Biostar g31d-m7 Spec
No ratings yet
Biostar g31d-m7 Spec
2 pages
DSO Monitor Controller For UTD4000M Series Oscilloscopes User Manual V1.9
No ratings yet
DSO Monitor Controller For UTD4000M Series Oscilloscopes User Manual V1.9
16 pages
E-Manual: First Edition October 2015
No ratings yet
E-Manual: First Edition October 2015
108 pages
INTEL 8259A Programmable Interrupt Controller
No ratings yet
INTEL 8259A Programmable Interrupt Controller
15 pages
Sale Cheat Sheet by MySmartPrice - Com - Updated Hourly
No ratings yet
Sale Cheat Sheet by MySmartPrice - Com - Updated Hourly
9 pages
Introduction To Information Technology
No ratings yet
Introduction To Information Technology
4 pages
Mainframes and Networks: Hardware Connectivity On The Mainframe
100% (1)
Mainframes and Networks: Hardware Connectivity On The Mainframe
4 pages
AD9371 and ADRV9009 Setup With ZCU102 or ZC706 April2019
No ratings yet
AD9371 and ADRV9009 Setup With ZCU102 or ZC706 April2019
31 pages
Product Manual: Router CCR
No ratings yet
Product Manual: Router CCR
7 pages
Microcontroller Series
No ratings yet
Microcontroller Series
14 pages
ADC0804 Pinout and Typical Connections
No ratings yet
ADC0804 Pinout and Typical Connections
4 pages
Classification of Cache Misses in A SMP Using The SESC Simulator
No ratings yet
Classification of Cache Misses in A SMP Using The SESC Simulator
4 pages
University of Science and Technology Chittagong
No ratings yet
University of Science and Technology Chittagong
5 pages
Zybo & Zynq Intro: Esdc Lab 1
No ratings yet
Zybo & Zynq Intro: Esdc Lab 1
16 pages
CBSE Class 11 Computer Science Computer Organisation MCQS, Multiple Choice Questions
No ratings yet
CBSE Class 11 Computer Science Computer Organisation MCQS, Multiple Choice Questions
13 pages
PLC Master / Slave Example
No ratings yet
PLC Master / Slave Example
18 pages
02-HUAWEI Server Products Sales Specialist Training V1.0
No ratings yet
02-HUAWEI Server Products Sales Specialist Training V1.0
43 pages
Gemalto PCLink Readers Brochure
No ratings yet
Gemalto PCLink Readers Brochure
6 pages
ECS-700 System Overview
No ratings yet
ECS-700 System Overview
34 pages
0509-E01-001E Ver.1.2 Software For NEXTA Installation Manual
No ratings yet
0509-E01-001E Ver.1.2 Software For NEXTA Installation Manual
24 pages
VM
No ratings yet
VM
2 pages
PRIME B760M-A AX6 Motherboards ASUS Global
No ratings yet
PRIME B760M-A AX6 Motherboards ASUS Global
1 page
i-MX RT1060 EVK Schematic
No ratings yet
i-MX RT1060 EVK Schematic
17 pages
Disk Management
No ratings yet
Disk Management
33 pages

Module1

Uploaded by

Module1

Uploaded by

Parallel and Distributed Computing

We care about performance We’re going to use multiple

One major motivation of using parallel processing: achieve a speedup

For a given problem:

execution time (using 1 processor)

▪ Abstractions/mechanisms for performing the above tasks

▪ Why do I need to know about hardware?

▪ Just because your program runs faster on a parallel computer, it does

▪ HW designer’s perspective: choosing the right capabilities to put in

1. Exploiting instruction-level parallelism

2. Increasing CPU clock frequency

for (int i=0; i<10; i++) {

What do I mean when I talk about a computer’s “state”?

1 mul R0, R0, R0

1 mul R0, R0, R0 time

R3 now stores value of 3 5. add R3, R0, R2

1 mul R0, R0, R0 time

1 mul R0, R0, R0 time

4 add R0, R0, R1

R3 now stores value of 3 5. add R3, R0, R2

In this example: instructions 1, 2, and 3 can be executed in parallel without impacting

High power = high heat

World’s fastest supercomputer megawatts

Standard microwave oven 900W

Source: Intel, NVIDIA, Wikipedia, Top500.org

1. Frequency scaling limited by power

Architects are now building faster processors by

Software must be written to be parallel to = Transistor

Image credit: “The free Lunch is Over” by Herb

Core1 Core2 Core3 Core4 Core5

Core6 Core7 Core8 Core9 Core10

- Four CPU cores

18,432 fp32 multipliers

Frontier (at Oak Ridge National Lab) (world’s #1

•IT MUST ALSO BE EFFICIENT

▪ We will discuss how modern systems not only use many

Apple A15 Bionic (in

Apple-designed multi-core GPU

Image Credit: TechInsights Inc.

Huawei Kirin NPU

Google TPU3 GraphCore IPU

Apple Neural Engine

Intel Deep Learning

Cerebras Wafer Scale Engine Ampere GPU with Tensor Cores

▪ Each byte is identified by its “address” in 0x4 0

▪ Memory access times ~ 100’s of cycles

* Caches also provide high bandwidth data transfer

L3 cache DRAM (64 GB)

Common organization: hierarchy of caches:

Smaller capacity caches near processor →lower latency

Latency (number of cycles at 4 GHz)

• A system is said to be concurrent if it can support two or more

• Concurrency is about dealing with lots of things at once.

You might also like