0% found this document useful (0 votes)

63 views33 pages

CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences

This document provides an overview of the CS326 Parallel and Distributed Computing course at National University of Computer and Emerging Sciences. It discusses key concepts in parallel computing including using multiple CPUs/cores to solve computational problems by breaking them into discrete parts that execute simultaneously on different CPUs. It also covers parallel versus concurrent versus distributed computing, parallel terminologies, and parallel computing platforms including shared memory, distributed memory, and implicit parallelism trends in microprocessor architectures.

Uploaded by

Neha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views33 pages

CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences

Uploaded by

Neha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

CS326 Parallel and

Distributed Computing
SPRING 2021
NATIONAL UNIVERSITY OF COMPUTER AND EMERGING SCIENCES
Parallel computing is the simultaneous use of multiple computing resources to
solve a computational problem.
◦ To be run using multiple CPUs/Cores

A problem is broken into discrete parts that can be solved

concurrently
◦ Each part is further broken down to a series of instructions

Instructions from each part execute simultaneously on different

CPUs
Some Questions
Why we’re building parallel systems?

Why we need ever-increasing performance?

Why we need to write parallel programs?

How do we write parallel programs?

◦ Task Parallelism
◦ Data Parallelism
Parallel VS Concurrent VS Distributed
In concurrent computing, a program is one in which multiple tasks
can be in progress at any instant.

In parallel computing, a program is one in which multiple tasks

cooperate closely to solve a problem.

In distributed computing, a program may need to cooperate with

other programs to solve a problem
Some General Parallel Terminologies
Task/Process
◦ A logically discrete section of computational work. A task is typically a program or
program-like set of instructions that is executed by a processor.

Parallel Task
◦ A task that can be executed by multiple processors safely (yields correct results)

Serial Execution
◦ Execution of a program sequentially, one statement at a time. In the simplest sense,
this is what happens on a one processor machine. However, virtually all parallel
tasks will have sections of a parallel program that must be executed serially.
Parallel Execution
◦ Execution of a program by more than one task, with each task being able to execute the same or
different statement at the same moment in time.
Shared Memory
◦ From a strictly hardware point of view, describes a computer architecture where all processors have
direct (usually bus based) access to common physical memory.
◦ In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same
logical memory locations regardless of where the physical memory actually exists.

Distributed Memory
◦ In hardware, refers to network based memory access for physical memory that is not common. As a
programming model, tasks can only logically "see" local machine memory and must use
communications to access memory on other machines where other tasks are executing.
Communications
◦ Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as
through a shared memory bus or over a network, however the actual event of data exchange is
commonly referred to as communications regardless of the method employed.

Synchronization
◦ The coordination of parallel tasks in real time, very often associated with communications. Often
implemented by establishing a synchronization point within an application where a task may not
proceed further until another task(s) reaches the same or logically equivalent point.

◦ Synchronization usually involves waiting by at least one task, and can therefore cause a parallel
application's wall clock execution time to increase.
Parallel Overhead
◦ The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel
overhead can include factors such as:
◦ Task start-up time
◦ Synchronizations
◦ Data communications
◦ Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.
◦ Task termination time

Massively Parallel
◦ Refers to the hardware that comprises a given parallel system - having many processors. The meaning
of many keeps increasing, but currently BG/L* pushes this number to 6 digits.

*Blue Gene is an IBM project aimed at designing supercomputers that can reach operating
speeds in the petaFLOPS (PFLOPS) range, with low power consumption.
Scalability
◦ Refers to a parallel system's (hardware and/or software) ability to
demonstrate a proportionate increase in parallel speedup with the addition
of more processors.

◦ Factors that contribute to scalabilty include:

◦ Hardware - particularly memory-cpu bandwidths and network communications
◦ Application Algorithm
◦ Parallel overhead related
◦ Characteristics of your specific application and coding
Parallel Computers Memory
Architecture
Shared Memory
Distributed Memory
Hybrid Distributed-Shared Memory
Shared Memory
Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all
memory as global address space.

Multiple processors can operate independently but share the same memory resources.
Changes in a memory location effected by one processor are visible to all other processors.
Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA.
Shared Memory : UMA vs. NUMA
Uniform Memory Access (UMA):
◦ Identical processors with equal access and access times to memory
◦ Sometimes called CC-UMA - Cache Coherent UMA.

Non-Uniform Memory Access (NUMA):

◦ Not all processors have equal access time to all memories
◦ One SMP can directly access memory of another SMP
Distributed Memory
Processors have their own local memory. Memory addresses in one processor do not map to
another processor, so there is no concept of global address space across all processors.
Distributed memory systems require a communication network to connect inter-processor
memory.
When a processor needs access to data in another processor, it is usually the task of the
programmer to explicitly define how and when data is communicated.
The network "fabric" used for data transfer varies widely, though it can can be as simple as
Ethernet.
Parallel Computing Platforms
Implicit Parallelism: Trends in
Microprocessor Architectures
Traditionally a sequential computer consists of:
◦ processor, memory, and datapath
◦ all present bottleneck to the performance.

Increments in clock speed are severely diluted by the limitations of

memory technology.
Pipelining and Superscalar Execution
Pipelining: overlapping various stages in instruction execution
◦ fetch, schedule, decode, operand fetch, execute, store, among others
◦ Pentium 4, which operates at 2.0 GHz, has a 20 stage pipeline.
◦ Speed of a single pipeline is ultimately limited by the largest atomic task in the
pipeline.
◦ In typical instruction traces, every fifth to sixth instruction is a branch instruction.
◦ This requires need effective techniques for predicting branch destinations so that
pipelines can be speculatively filled, a misprediction will cost a lot.

Use multiple pipelines. The ability of a processor to issue multiple

instructions in the same cycle is referred to as superscalar execution
◦ During each clock cycle, multiple instructions are piped into the processor in
parallel.
◦ These instructions are executed on multiple functional units.
A number of issues needs to be resolved with superscalar execution.
◦ Data Dependency
◦ Resource Dependency
◦ Branch/Procedural Dependency

The processor needs the ability to issue instructions out-of-order to

accomplish desired reordering.
◦ The parallelism available in in-order issue of instructions can be
highly limited.

Most current microprocessors are capable of out-of-order issue and completion.

◦ This model, also referred to as dynamic instruction issue, exploits maximum instruction level
parallelism.
Very Long Instruction Word (VLIW)
Processors
The parallelism extracted by superscalar processors is often limited by the
instruction look-ahead.

Instructions that can be executed concurrently are packed into groups

and parceled off to the processor as a single long instruction word to be
executed on multiple functional units at the same time.
◦ The compiler has a larger context from which to select instructions.

The performance of VLIW processors is very sensitive to the compilers'

ability to detect data and resource dependencies and read and write
hazards, and to schedule instructions for maximum parallelism
Limitation of Memory System Performance
The effective performance of a program on a computer relies not
just on the speed of the processor but also on the ability of the
memory system to feed data to the processor.

Latency: request to receiving time for a memory word.

◦ Rate at which data can be pumped from the memory to the processor
determines the bandwidth of the memory system.
Example
a processor operating at 1 GHz (1 ns clock) connected to a DRAM
with a latency of 100 ns (no caches).

The processor is capable of executing four instructions in each cycle

of 1 ns (4 GFLOPS).

Since the memory latency is equal to 100 cycles and

block size is one word, every time a memory request is made, the
processor must wait 100 cycles before it can process the data
Improving Effective Memory Latency Using Caches
Cache: a low-latency high-bandwidth storage between the processor and the
DRAM.
◦ The data needed by the processor is first fetched into the cache. All subsequent accesses to
data items residing in the cache are serviced by the cache.

◦ Thus, in principle, if a piece of data is repeatedly used, the effective latency of this memory
system can be reduced by the cache.

◦ The fraction of data references satisfied by the cache is called the cache hit ratio of the
computation on the system.

◦ Data reuse is critical for cache performance because if each data item is used only once, it
would still have to be fetched once per use from the DRAM
◦ The cache consists of m blocks, called
lines.
◦ In referring to the basic unit of the cache, the term
line is used, rather than the term block.

◦ Line Size: The length of a line.

◦ Line size may be as less as 32 bits (a word).

◦ The number of lines is considerably less

than the number of main memory
blocks
Cache Read operation

27
Impact of Memory Bandwidth
One commonly used technique to improve memory bandwidth is to
increase the size of the memory blocks.

◦ Consider again a memory system with a single cycle cache and 100 cycle latency
DRAM with the processor operating at 1 GHz

◦ If the block size is one word, the processor takes 100 cycles to fetch each word.

◦ If the block size is increased to four words, i.e., the processor can fetch a four-word
cache line every 100 cycles.

◦ increasing the block size from one to four words did not change the latency of the
memory system. However, it increased the bandwidth four-fold.
◦ Another way of quickly estimating performance bounds is to estimate the
cache hit ratio.

Temporal and Spatial Locality of Data:

◦ Temporal locality refers to the reuse of specific data, and/or resources,
within a relatively small time duration.

◦ Spatial locality (also termed data locality) refers to the use of data elements
within relatively close storage locations. Sequential locality.
◦ a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such
as, traversing the elements in a one-dimensional array.
Alternate Approaches for Hiding Memory
Latency
Imagine sitting at your computer browsing the web during peak
network traffic hours.
The lack of response from your browser can be alleviated using one
of three simple approaches:
◦ Anticipate which pages we are going to browse ahead of time and issue
requests for them in advance: Prefetching
◦ we open multiple browsers and access different pages in each browser, thus
while we are waiting for one page to load, we could be reading others:
multi-threading
◦ we access a whole bunch of pages in one go – remunerating the latency
across various accesses: Spatial Locality
Multithreading for Latency Hiding

◦ Because each dot-product is independent of the other, and

therefore represents a concurrent unit of execution, we may rewrite the
above code as:
◦ Multithreaded processors are capable of maintaining the context of a number
of threads of computation with outstanding requests
(memory accesses, I/O, or communication requests) and execute them as the
requests are satisfied.

◦ Machines such relying on multithreaded processors that can switch the

context of execution in every cycle .
◦ they are able to hide latency effectively, provided there is enough concurrency (threads) to keep the
processor from idling.
Prefetching for Latency Hiding
In a typical program, a data item is loaded and used by a processor
in a small time window. If the load results in a cache miss, then the
use stalls.
A simple solution to this problem is to advance the load operation
so that even if there is a cache miss, the data is likely to have
arrived by the time it is used.
However, if the data item has been overwritten between load and
use, a fresh load is issued.

Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
Parallelism in Computer Architecture
No ratings yet
Parallelism in Computer Architecture
27 pages
Embedded System Important Questions
100% (4)
Embedded System Important Questions
2 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Parallel Programming - Unit 1
No ratings yet
Parallel Programming - Unit 1
81 pages
COA - Module-5
No ratings yet
COA - Module-5
35 pages
Unit 1
No ratings yet
Unit 1
25 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
Unit 5
No ratings yet
Unit 5
66 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Pipeliningandvectorprocessing 140612142847 Phpapp01
No ratings yet
Pipeliningandvectorprocessing 140612142847 Phpapp01
53 pages
Unit4 Session3 Parallel Computing Concepts Terminology Design Issues
No ratings yet
Unit4 Session3 Parallel Computing Concepts Terminology Design Issues
30 pages
Module 1 - Parallel Computing
No ratings yet
Module 1 - Parallel Computing
29 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Unit Iv Parallelism
No ratings yet
Unit Iv Parallelism
80 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Unit 4
No ratings yet
Unit 4
42 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
Parallel Computing Platforms-Dr Nausheen
No ratings yet
Parallel Computing Platforms-Dr Nausheen
47 pages
Parallel Computers
No ratings yet
Parallel Computers
39 pages
Materi 3
No ratings yet
Materi 3
26 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Introduction To Parallel Computing-Dr Nousheen
No ratings yet
Introduction To Parallel Computing-Dr Nousheen
43 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
No ratings yet
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
17 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
Seminar
No ratings yet
Seminar
85 pages
Lecture 2 General Parallelism Terms
No ratings yet
Lecture 2 General Parallelism Terms
22 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Lab 9 Triggers
No ratings yet
Lab 9 Triggers
39 pages
Module - 2 Notes-BCS303
100% (1)
Module - 2 Notes-BCS303
38 pages
Module 2
No ratings yet
Module 2
127 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
OS Question Bank - All Modules - II ND Year
67% (3)
OS Question Bank - All Modules - II ND Year
8 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
CPU Scheduling
No ratings yet
CPU Scheduling
29 pages
MC415 - Operating System Final Exam Questions and Solution
100% (6)
MC415 - Operating System Final Exam Questions and Solution
3 pages
OS SRTF Scheduling Algorithm - Javatpoint
No ratings yet
OS SRTF Scheduling Algorithm - Javatpoint
3 pages
CH 6
No ratings yet
CH 6
91 pages
CST206 Operating Systems, June 2022
100% (1)
CST206 Operating Systems, June 2022
3 pages
Functions, Operators, Date - Time Lab Manual: Dummy X Select 777 888 From Dual
No ratings yet
Functions, Operators, Date - Time Lab Manual: Dummy X Select 777 888 From Dual
17 pages
Week 14 Timing Diagram
No ratings yet
Week 14 Timing Diagram
30 pages
Sub Queries and Groups of Data: Lab Manual 04
No ratings yet
Sub Queries and Groups of Data: Lab Manual 04
12 pages
OS Lab Manual PPP
No ratings yet
OS Lab Manual PPP
41 pages
02 Lecture Flynn IN
No ratings yet
02 Lecture Flynn IN
78 pages
Distributed Systems Assignment
No ratings yet
Distributed Systems Assignment
13 pages
CS9211-Computer Architecture Question
No ratings yet
CS9211-Computer Architecture Question
7 pages
Unit-3 Synchronization and Deadlocks: 3.1 Critical Section Problem
No ratings yet
Unit-3 Synchronization and Deadlocks: 3.1 Critical Section Problem
16 pages
Week 11 Sequence Diagram
0% (1)
Week 11 Sequence Diagram
60 pages
Operating System ch-5
No ratings yet
Operating System ch-5
70 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
Database Systems: Lab 08 PL SQL (Continue)
No ratings yet
Database Systems: Lab 08 PL SQL (Continue)
30 pages
DMC Unit 3 Notes
No ratings yet
DMC Unit 3 Notes
6 pages
Concurrency Control
No ratings yet
Concurrency Control
42 pages
Os Galvinsilberschatz Sol
No ratings yet
Os Galvinsilberschatz Sol
134 pages
Lab Manual 06 IPC
No ratings yet
Lab Manual 06 IPC
9 pages
Lab Manual 06 IPC
No ratings yet
Lab Manual 06 IPC
9 pages
Week 12 Robustness Analysis
No ratings yet
Week 12 Robustness Analysis
57 pages
Week 10 Activity Diagram
No ratings yet
Week 10 Activity Diagram
56 pages
Software Design and Analysis: State Transition Diagram
No ratings yet
Software Design and Analysis: State Transition Diagram
51 pages
Week 15 Component Diagram
No ratings yet
Week 15 Component Diagram
50 pages
Week 12 Collaboration Diagram
No ratings yet
Week 12 Collaboration Diagram
40 pages
CS481 - Data Science: Muhammad Sohail Afzal
No ratings yet
CS481 - Data Science: Muhammad Sohail Afzal
31 pages
Hawassa University Chapter 2 Linux Operating System Project
No ratings yet
Hawassa University Chapter 2 Linux Operating System Project
20 pages
Week 16 Deployment Diagram
No ratings yet
Week 16 Deployment Diagram
20 pages
Chapter 4 Multithreading in Java PDF
No ratings yet
Chapter 4 Multithreading in Java PDF
21 pages
Database Systems: Lab 07 PL SQL
No ratings yet
Database Systems: Lab 07 PL SQL
22 pages
Process Synchronization: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Process Synchronization: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
26 pages
Project 4
No ratings yet
Project 4
17 pages
Solution of Operating Systems (RCA301) 2023-24
No ratings yet
Solution of Operating Systems (RCA301) 2023-24
20 pages
Deadlock Prevention Methods
No ratings yet
Deadlock Prevention Methods
21 pages
Lab Manual 03 Shell Scripting PDF
No ratings yet
Lab Manual 03 Shell Scripting PDF
15 pages
Lab Manual 04 Process Management PDF
No ratings yet
Lab Manual 04 Process Management PDF
12 pages
Lab Manual 04 Process Management PDF
No ratings yet
Lab Manual 04 Process Management PDF
12 pages
LAB 08 (Procedure, Functions, Views)
No ratings yet
LAB 08 (Procedure, Functions, Views)
8 pages
Lab Manual 05 Intro - Kernel Configuration PDF
No ratings yet
Lab Manual 05 Intro - Kernel Configuration PDF
8 pages
Lab Manual 02 Shell Basic Command PDF
No ratings yet
Lab Manual 02 Shell Basic Command PDF
14 pages
Lab Session 06: (Relational Modeling)
No ratings yet
Lab Session 06: (Relational Modeling)
9 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
Lab Manual 06 Kernel Modules PDF
No ratings yet
Lab Manual 06 Kernel Modules PDF
11 pages
OS-deadlock Recovery
No ratings yet
OS-deadlock Recovery
8 pages
Lab Manual 1 PDF
No ratings yet
Lab Manual 1 PDF
6 pages
SQL Lab 2 Exercises
0% (1)
SQL Lab 2 Exercises
1 page
Operating Systems Syllabus
No ratings yet
Operating Systems Syllabus
2 pages
Disk Problem
No ratings yet
Disk Problem
6 pages
PDC Assignment #1 - (20014119-035)
No ratings yet
PDC Assignment #1 - (20014119-035)
3 pages
Multi Threading and Multi Core Handout
No ratings yet
Multi Threading and Multi Core Handout
3 pages
Assessment2 CSC3121 PDF
No ratings yet
Assessment2 CSC3121 PDF
7 pages
Sub Queries Exercises
No ratings yet
Sub Queries Exercises
1 page
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences

Uploaded by

CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences

Uploaded by

CS326 Parallel and

A problem is broken into discrete parts that can be solved

Instructions from each part execute simultaneously on different

Why we need ever-increasing performance?

Why we need to write parallel programs?

How do we write parallel programs?

In parallel computing, a program is one in which multiple tasks

In distributed computing, a program may need to cooperate with

◦ Factors that contribute to scalabilty include:

Non-Uniform Memory Access (NUMA):

Increments in clock speed are severely diluted by the limitations of

Use multiple pipelines. The ability of a processor to issue multiple

The processor needs the ability to issue instructions out-of-order to

Most current microprocessors are capable of out-of-order issue and completion.

Instructions that can be executed concurrently are packed into groups

The performance of VLIW processors is very sensitive to the compilers'

Latency: request to receiving time for a memory word.

The processor is capable of executing four instructions in each cycle

Since the memory latency is equal to 100 cycles and

◦ Line Size: The length of a line.

◦ The number of lines is considerably less

Temporal and Spatial Locality of Data:

◦ Because each dot-product is independent of the other, and

◦ Machines such relying on multithreaded processors that can switch the

You might also like