0% found this document useful (0 votes)

46 views

Parallel Computing Platforms: Chieh-Sen (Jason) Huang

This document discusses parallel computing platforms and trends in microprocessor architectures that enable implicit parallelism. It covers limitations of memory system performance due to latency and bandwidth bottlenecks. Caches help improve effective memory latency by exploiting data locality. Increased memory bandwidth through wider memory blocks can also improve computation rates. The layout of data in memory impacts performance, and reordering computations to enhance spatial locality can improve memory bandwidth utilization.

Uploaded by

Balakrish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Parallel Computing Platforms: Chieh-Sen (Jason) Huang

Uploaded by

Balakrish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Parallel Computing Platforms

Chieh-Sen (Jason) Huang

Department of Applied Mathematics

National Sun Yat-sen University

Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin

Kumar for providing slides.
Topic Overview

• Implicit Parallelism: Trends in Microprocessor Architectures

• Limitations of Memory System Performance

Scope of Parallelism

• Conventional architectures coarsely comprise of a processor,

memory system, and the datapath.

• Each of these components present significant performance

bottlenecks.

• Parallelism addresses each of these components in significant

ways.

• Different applications utilize different aspects of parallelism

– e.g., data itensive applications utilize high aggregate
throughput, server applications utilize high aggregate network
bandwidth, and scientific applications typically utilize high
processing and memory system performance.

• It is important to understand each of these performance

bottlenecks.
Implicit Parallelism: Trends in Microprocessor
Architectures

• Microprocessor clock speeds have posted impressive gains

over the past two decades (two to three orders of magnitude).

• Higher levels of device integration have made available a

large number of transistors.

• The question of how best to utilize these resources is an

important one.

• Current processors use these resources in multiple functional

units and execute multiple instructions in the same cycle.

• The precise manner in which these instructions are selected

and executed provides impressive diversity in architectures.

• We shall not discuss any further on these topics. (Details, see

Grama’s slides.)
Limitations of Memory System Performance

• Memory system, and not processor speed, is often the

bottleneck for many applications.

• Memory system performance is largely captured by two

parameters, latency and bandwidth.

• Latency is the time from the issue of a memory request to the

time the data is available at the processor.

• Bandwidth is the rate at which data can be pumped to the

processor by the memory system.
Memory System Performance: Bandwidth and
Latency

• It is very important to understand the difference between

latency and bandwidth.

• Consider the example of a fire-hose. If the water comes out

of the hose two seconds after the hydrant is turned on, the
latency of the system is two seconds.

• Once the water starts flowing, if the hydrant delivers water at

the rate of 5 gallons/second, the bandwidth of the system is 5
gallons/second.

• If you want immediate response from the hydrant, it is important

to reduce latency.

• If you want to fight big fires, you want high bandwidth.

Memory Latency: An Example

Consider a processor operating at 1 GHz (1 ns clock)

connected to a DRAM with a latency of 100 ns (no caches).
Assume that the processor has two multiply-add units and is
capable of executing four instructions in each cycle of 1 ns. The
following observations follow:

• The peak processor rating is 4 GFLOPS.

• Since the memory latency is equal to 100 cycles and block

size is one word, every time a memory request is made, the
processor must wait 100 cycles before it can process the data.
Memory Latency: An Example

On the above architecture, consider the problem of

computing a dot-product of two vectors.

• A dot-product computation performs one multiply-add on

a single pair of vector elements, i.e., each floating point
operation requires one data fetch.

• It follows that the peak speed of this computation is limited to

one floating point operation every 100 ns, or a speed of 10
MFLOPS, a very small fraction of the peak processor rating!
Improving Effective Memory Latency Using Caches

• Caches are small and fast memory elements between the

processor and DRAM.

• This memory acts as a low-latency high-bandwidth storage.

• If a piece of data is repeatedly used, the effective latency of

this memory system can be reduced by the cache.

• The fraction of data references satisfied by the cache is called

the cache hit ratio of the computation on the system.

• Cache hit ratio achieved by a code on a memory system often

determines its performance.
Impact of Caches: Example

Consider the architecture from the previous example. In this

case, we introduce a cache of size 32 KB with a latency of 1 ns or
one cycle. We use this setup to multiply two matrices A and B of
dimensions 32 × 32. We have carefully chosen these numbers so
that the cache is large enough to store matrices A and B, as well
as the result matrix C.
Impact of Caches: Example (continued)

The following observations can be made about the problem:

• Fetching the two matrices into the cache corresponds to

fetching 2K words, which takes approximately 200 µs (2000 ×
100ns).

• Multiplying two n × n matrices takes 2n3 operations. For our

problem, this corresponds to 64K operations, which can be
performed in 16K cycles (or 16 µs) at four instructions per cycle.

• The total time for the computation is therefore approximately

the sum of time for load/store operations and the time for the
computation itself, i.e., 200 + 16µs.

• This corresponds to a peak computation rate of 64K FLOP/216µs

or 303 MFLOPS.
Impact of Caches

• In our example, we had O(n2 ) data accesses and O(n3)

computation. This asymptotic difference makes the above
example particularly desirable for caches.

• Repeated references to the same data item correspond to

temporal locality.

Spatial locality
• Aggressive caching
Temperal locality
1. Spatial locality : Access data in blocks.
2. Temperal locality : Reuse data that is already loaded.
Impact of Caches

• Loop unrolling

Do i = 1, n, 4
sum1 = sum1 + a[i]*b[i]
sum2 = sum2 + a[i+1]*b[i+1]
sum3 = sum3 + a[i+2]*b[i+2]
sum4 = sum4 + a[i+3]*b[i+3]
End do

sum = sum1 + sum2 + sum3 + sum4

• Blocked matrix multiplication

A1 A2 B1 B2 A1B1 + A2B3 A1B2 + A2B4
A3 A4 B3 B4 A3B1 + A4B3 A3B2 + A4B4

• Homework: Compute the FLOPS of the loop unrolling and

blocked matrix multiplication examples.
Impact of Memory Bandwidth

The architecture of GPU.

• Memory bandwidth is determined by the bandwidth of the

memory bus as well as the memory units.

• Memory bandwidth can be improved by increasing the size of

memory blocks.
Impact of Memory Bandwidth: Example

Consider the same setup as before, except in this case, the

block size is 4 words instead of 1 word. We repeat the dot-product
computation in this scenario:

• Assuming that the vectors are laid out linearly in memory, eight
FLOPs (four multiply-adds) can be performed in 200 cycles.

• This is because a single memory access fetches four

consecutive words in the vector.

• Therefore, two accesses can fetch four elements of each of

the vectors. This corresponds to a FLOP every 25 ns, for a peak
speed of 40 MFLOPS.
Impact of Memory Bandwidth

• It is important to note that increasing block size does not

change latency of the system.

• Physically, the scenario illustrated here can be viewed as a

wide data bus (4 words or 128 bits) connected to multiple
memory banks.

• In practice, such wide buses are expensive to construct.

• In a more practical system, consecutive words are sent on the

memory bus on subsequent bus cycles after the first word is
retrieved.
Impact of Memory Bandwidth

• The above examples clearly illustrate how increased bandwidth

results in higher peak computation rates.

• The data layouts were assumed to be such that consecutive

data words in memory were used by successive instructions
(spatial locality of reference).

• If we take a data-layout centric view, computations must be

reordered to enhance spatial locality of reference.
Impact of Memory Bandwidth: Example

Consider the following code fragment:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;
for (i = 0; i < 1000; i++)
for (j = 0; j < 1000; j++)
column_sum[i] += b[j][i];

The code fragment sums columns of the matrix b into a vector

column_sum.

int a[2][3];
for(int i=0;i<2;i++){
for (int j=0;j<3;j++)
cout<<’\t’<<&a[i][j];
cout<<endl;
}
-----------------------------------------------------------
0x7fff9987c700 0x7fff9987c704 0x7fff9987c708
0x7fff9987c70c 0x7fff9987c710 0x7fff9987c714
Impact of Memory Bandwidth: Example

• The vector column_sum is small and easily fits into the cache

• The matrix b is accessed in a column order.

• The strided access results in very poor performance.

b1 b2 b3 b4
=

+ + +

A A A A

(a) Column major data access

= = = =

A b A b A b A b

(b) Row major data access.

Multiplying a matrix with a vector: (a) multiplying

column-by-column, keeping a running sum; (b) computing each
element of the result as a dot product of a row of the matrix with
the vector.
Impact of Memory Bandwidth: Example

We can fix the above code as follows:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;
for (j = 0; j < 1000; j++)
for (i = 0; i < 1000; i++)
column_sum[i] += b[j][i];

In this case, the matrix is traversed in a row-order and

performance can be expected to be significantly better.
Memory System Performance: Summary

The series of examples presented in this section illustrate the

following concepts:

• Exploiting spatial and temporal locality in applications is

critical for amortizing memory latency and increasing effective
memory bandwidth.

• The ratio of the number of operations to number of memory

accesses is a good indicator of anticipated tolerance to
memory bandwidth.

• Memory layouts and organizing computation appropriately

can make a significant impact on the spatial and temporal
locality.

Limitation of Memory Sys Per
No ratings yet
Limitation of Memory Sys Per
38 pages
09 Communication models of Parallel platforms
No ratings yet
09 Communication models of Parallel platforms
25 pages
09 Communication models of Parallel platforms
No ratings yet
09 Communication models of Parallel platforms
25 pages
Module 1 - Parallel Computing
No ratings yet
Module 1 - Parallel Computing
29 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
Parallel Computing Platforms-Dr Nausheen
No ratings yet
Parallel Computing Platforms-Dr Nausheen
47 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
Lec2 PDF
No ratings yet
Lec2 PDF
21 pages
ASA Chapter4
No ratings yet
ASA Chapter4
8 pages
cache_ppt
No ratings yet
cache_ppt
38 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
Ca-Module Ii Notes
No ratings yet
Ca-Module Ii Notes
75 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Chapter 2z Ppt
No ratings yet
Chapter 2z Ppt
54 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Midtermsolutions
No ratings yet
Midtermsolutions
3 pages
Chap2 Slides Week3
No ratings yet
Chap2 Slides Week3
28 pages
week11
No ratings yet
week11
45 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
217 Lec6
No ratings yet
217 Lec6
23 pages
03-Memory
No ratings yet
03-Memory
48 pages
CS 152 Computer Architecture and Engineering Lecture 6 - Memory
No ratings yet
CS 152 Computer Architecture and Engineering Lecture 6 - Memory
29 pages
CUDA_Memory
No ratings yet
CUDA_Memory
56 pages
2 Cache Complexity
No ratings yet
2 Cache Complexity
100 pages
2 Key Concepts: Assignments
No ratings yet
2 Key Concepts: Assignments
18 pages
Library_for_matrix_multiplication-based_data_manipulation_on_a_mesh-of-tori_architecture
No ratings yet
Library_for_matrix_multiplication-based_data_manipulation_on_a_mesh-of-tori_architecture
8 pages
CH04 COA10e
No ratings yet
CH04 COA10e
41 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
Pipelining For Multi-Core Architectures
No ratings yet
Pipelining For Multi-Core Architectures
31 pages
10 Cache Memories
No ratings yet
10 Cache Memories
49 pages
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec04-Cache - (Cuuduongthancong - Com)
No ratings yet
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec04-Cache - (Cuuduongthancong - Com)
16 pages
ACA Unit 2
No ratings yet
ACA Unit 2
45 pages
23 Cache Memory Basics 11-03-2024
No ratings yet
23 Cache Memory Basics 11-03-2024
19 pages
Aca Seminar Report
No ratings yet
Aca Seminar Report
11 pages
03-Chap4-Cache Memory Mapping
No ratings yet
03-Chap4-Cache Memory Mapping
24 pages
Onur Mutlu All Lecs 447
No ratings yet
Onur Mutlu All Lecs 447
503 pages
Cache Memory
No ratings yet
Cache Memory
39 pages
L-4 (Cache Memory)
No ratings yet
L-4 (Cache Memory)
61 pages
Chapter_2z-ppt
No ratings yet
Chapter_2z-ppt
54 pages
4 Memory Models
No ratings yet
4 Memory Models
19 pages
Lecture 13 16 Post
No ratings yet
Lecture 13 16 Post
24 pages
DLCA CH 05 - Memory Organization Part 1
No ratings yet
DLCA CH 05 - Memory Organization Part 1
156 pages
Cache Org
No ratings yet
Cache Org
19 pages
Lecture 8
No ratings yet
Lecture 8
33 pages
13m Arch PDF
No ratings yet
13m Arch PDF
17 pages
Computer Architecture: Lecture 1: Introduction and Basics
No ratings yet
Computer Architecture: Lecture 1: Introduction and Basics
28 pages
Document 90
No ratings yet
Document 90
11 pages
Cache Memory
No ratings yet
Cache Memory
89 pages
Coa Unit Test QP 1
0% (1)
Coa Unit Test QP 1
7 pages
04 - Cache Memory
No ratings yet
04 - Cache Memory
79 pages
08 Caches
No ratings yet
08 Caches
78 pages
Chap 6
No ratings yet
Chap 6
48 pages
Solutions COA7e 1
No ratings yet
Solutions COA7e 1
92 pages
Roofline: An Insightful Visual Performance Model For Floating-Point Programs and Multicore Architectures
No ratings yet
Roofline: An Insightful Visual Performance Model For Floating-Point Programs and Multicore Architectures
10 pages
Week 5 - The Impact of Multi-Core Computing On Computational Optimization
No ratings yet
Week 5 - The Impact of Multi-Core Computing On Computational Optimization
11 pages
Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
No ratings yet
Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
44 pages
block 2 memory input output-3
No ratings yet
block 2 memory input output-3
21 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Computer Memory 1
No ratings yet
Computer Memory 1
22 pages
Microprocessors and Codenames
No ratings yet
Microprocessors and Codenames
7 pages
E MMC v4.41 and v4.5: Architecture For High Speed Architecture For High Speed
No ratings yet
E MMC v4.41 and v4.5: Architecture For High Speed Architecture For High Speed
38 pages
Chapter2_Part 2 (2)
No ratings yet
Chapter2_Part 2 (2)
18 pages
Lesson Plan DCLD
No ratings yet
Lesson Plan DCLD
2 pages
Finfet Technology: As A Promising Alternatives For Conventional Mosfet Technology
No ratings yet
Finfet Technology: As A Promising Alternatives For Conventional Mosfet Technology
5 pages
Microprocessors 8086, 8088
No ratings yet
Microprocessors 8086, 8088
16 pages
DIGITAL LOGIC DESIGN Project List
No ratings yet
DIGITAL LOGIC DESIGN Project List
4 pages
AMD Reveals Radeon RX 7900 XTX and 7900 XT
No ratings yet
AMD Reveals Radeon RX 7900 XTX and 7900 XT
14 pages
Eece488 Set1 2up PDF
No ratings yet
Eece488 Set1 2up PDF
42 pages
8086 Addressing Mode: Microprocessors
No ratings yet
8086 Addressing Mode: Microprocessors
9 pages
CSE 560 - Practice Problem Set 5 Solution
No ratings yet
CSE 560 - Practice Problem Set 5 Solution
9 pages
Dominos logic pdf
No ratings yet
Dominos logic pdf
23 pages
ECT304 VLSI CIRCUIT DESIGN, JANUARY 2024
No ratings yet
ECT304 VLSI CIRCUIT DESIGN, JANUARY 2024
2 pages
Microprocessor - 8085 Architecture - Tutorialspoint
No ratings yet
Microprocessor - 8085 Architecture - Tutorialspoint
3 pages
DSP Audio Chip tms320c6745
No ratings yet
DSP Audio Chip tms320c6745
227 pages
m4 (1)
No ratings yet
m4 (1)
35 pages
Virtual Memory
No ratings yet
Virtual Memory
83 pages
DDR 3
No ratings yet
DDR 3
2 pages
ROM and Its Types
No ratings yet
ROM and Its Types
1 page
MCQ On Operating System-All Unit
No ratings yet
MCQ On Operating System-All Unit
15 pages
Dspa 17ec751 M2
No ratings yet
Dspa 17ec751 M2
27 pages
Computer Organization and Architecture: CLSC & Risc
No ratings yet
Computer Organization and Architecture: CLSC & Risc
15 pages
Switch Level Modeling
No ratings yet
Switch Level Modeling
14 pages
Sample-Question-Paper - Digital Electronics and Microcontroller Applications
No ratings yet
Sample-Question-Paper - Digital Electronics and Microcontroller Applications
5 pages
Pic Micro Controllers
100% (1)
Pic Micro Controllers
18 pages
Arm Processor Architecture
No ratings yet
Arm Processor Architecture
84 pages
Performance_and_Stability_Analysis_of_Built-In_Sel
No ratings yet
Performance_and_Stability_Analysis_of_Built-In_Sel
17 pages
TS6. Memory Organization-OSCA
No ratings yet
TS6. Memory Organization-OSCA
5 pages