0% found this document useful (0 votes)
9 views

01 Introduction

Uploaded by

wz1151897402
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

01 Introduction

Uploaded by

wz1151897402
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Lecture 1:

Why Parallelism?

CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)


One common definition

A parallel computer is a collection of processing elements


that cooperate to solve problems fast

We care about performance * We’re going to use multiple


processors to get it

* Note: different motivation from “concurrent programming” using pthreads in 15-213


(CMU 15-418, Spring 2012)
DEMO 1
(15-418 Spring 2012‘s first parallel program)

(CMU 15-418, Spring 2012)


Speedup
One major motivation of using parallel processing: achieve a speedup

For a fixed problem size:

Time (1 processor)
Speedup( P processors ) =
Time (P processors)

(CMU 15-418, Spring 2012)


Class observations from demos 1

▪ Communication limited the maximum speedup achieved

▪ Minimizing the cost of communication improved speedup


- Moved students (“processors”) closer together (or let them shout)

(CMU 15-418, Spring 2012)


DEMO 2
(scaling up to four processors)

(CMU 15-418, Spring 2012)


Class observations from demo 2

▪ Imbalance in work assignment limited speedup


- Some processors ran out work to do (went idle), while others
were still working

▪ Improving the distribution of work improved speedup

(CMU 15-418, Spring 2012)


DEMO 3
(massively parallel execution)

(CMU 15-418, Spring 2012)


Class observations from demo 3

▪ The problem I just gave you has a significant amount of


communication compared to computation

▪ Communication costs can dominate a parallel


computation, severely limiting speedup

(CMU 15-418, Spring 2012)


Course theme 1:
Designing and writing parallel programs ... that scale!

▪ Parallel thinking
1. Decomposing work into parallel pieces
2. Assigning work to processors
3. Orchestrating communication/synchronization

▪ Abstractions for performing the above tasks


- Writing code in popular parallel programming languages

(CMU 15-418, Spring 2012)


Course theme 2:
Parallel computer hardware implementation: how parallel
computers work

▪ Mechanisms used to implement abstractions efficiently


- Performance characteristics of implementations
- Design trade-offs: performance vs. convenience vs. cost

▪ Why do I need to know about HW?


- Because the characteristics of the machine really matter
(recall speed of communication issues in class demos)
- Because you care about performance (you are writing parallel programs)

(CMU 15-418, Spring 2012)


Course theme 3:
Thinking about efficiency
▪ FAST != EFFICIENT

▪ Just because your program runs faster on a parallel computer, it


doesn’t mean it is using the hardware efficiently
- Is 2x speedup on 10 processors is a good result?

▪ Programmer’s perspective: make use of provided machine capabilities

▪ HW designer’s perspective: choosing the right capabilities to put in


system (performance/cost, cost = silicon area?, power?, etc.)

(CMU 15-418, Spring 2012)


Logistics

(CMU 15-418, Spring 2012)


Logistics
▪ Kayvon’s office hours
- Tues/Thurs 1:30-2:30 PM (right after class)
- GHC 7005

▪ TAs
- Michael Papamichael
- Mike Mu

▪ Textbook
- Culler and Singh, Parallel Computer Architecture: A Hardware/Software Approach
- Yes, it’s old. But many parts are still very good.

(CMU 15-418, Spring 2012)


Logistics: assignments
▪ Four programming assignments
- First assignment individual, the rest are in pairs
- Each in a different parallel programming environment

Assignment 1: ISPC programming Assignment 2: OpenCL


on Intel quad-core CPU programming on NVIDIA GPUs

Assignment 3: OpenMP Assignment 4: MPI


programming on programming on
Supercomputing cluster Supercomputing cluster

(CMU 15-418, Spring 2012)


Logistics: final project
▪ 6-week final project
▪ Done in pairs

▪ Announcing: the first annual 418 parallelism competition!


- Non-CMU judges from (Intel, NVIDIA, etc.)
- Expect non-trivial prizes... (e.g., high end GPUs, tablets)

(CMU 15-418, Spring 2012)


Logistics: grades

40% assignments
30% exams
25% project
5% class participaction

(CMU 15-418, Spring 2012)


Why parallelism?

(CMU 15-418, Spring 2012)


Why parallelism?
▪ The answer 10 years ago
- To get performance that was faster
than what clock frequency scaling
would provide
- Because if you just waited until next
year, your code would run faster on
the next generation CPU

▪ Parallelizing your code not


always worth the time
- Do nothing: performance doubling
~ every 18 months

(CMU 15-418, Spring 2012)


End of frequency scaling

(CMU 15-418, Spring 2012)


Power wall
P = CV2F
P: power
C: capacitance
V: voltage
F: frequency

▪ Higher frequencies typically require higher voltages

(CMU 15-418, Spring 2012)


Power vs. core voltage
Pentium M

Credit: Shimin Chin (CMU 15-418, Spring 2012)


Programmable invisible parallelism
▪ Bit level parallelism
- 16 bit 32 bit 64 bit

▪ Instruction level parallelism (ILP)


- Two instructions that are independent can be executed simultaneously
- “Superscalar” execution

(CMU 15-418, Spring 2012)


ILP example
a"="(x*x"+"y*y"+"z*z)
ILP = 3 x*x y*y z*z

ILP = 1 +

ILP = 1 +

(CMU 15-418, Spring 2012)


ILP scaling
3

2
Speedup

0
0 4 8 12 16

Instruction Issue Capability

(CMU 15-418, Spring 2012)


Single core performance scaling
The rate of single thread
performance scaling has decreased
(essentially to 0)

1. Frequency scaling limited by power


2. ILP scaling tapped out

No more free lunch for


software developers!

(CMU 15-418, Spring 2012)


Why parallelism?
▪ The answer 10 years ago
- To get performance that was faster than what clock frequency scaling
would provide
- Because if you just waited until next year, your code would run faster on
the next generation CPU

▪ The answer today:


- Because it is the only way to achieve significantly higher application
performance for the foreseeable future

(CMU 15-418, Spring 2012)


Intel Sandy Bridge (2011)
▪ Quad core CPU + GPU

(CMU 15-418, Spring 2012)


NVIDIA Fermi GPU (2009)
▪ 16 processing cores

(CMU 15-418, Spring 2012)


Mobile processing
▪ Power limits heavily influencing designs

Apple A5: (in iPhone 4s and iPad 2) NVIDIA Tegra:


Dual Core CPU + GPU + image processor and more Quad core CPU + GPU + image processor...

(CMU 15-418, Spring 2012)


Supercomputing
▪ Today: clusters of CPUs + GPUs
▪ Pittsburgh Supercomputing Center: Backlight
▪ 512 eight core Intel Xeon processors
- 4096 total cores

(CMU 15-418, Spring 2012)


Summary (what we learned)
▪ Single thread performance scaling has ended
- To run faster, you will need to use multiple processing elements
- Which means you need to know how to write parallel code

▪ Writing parallel programs can be challenging


- Problem partitioning, communication, synchronization
- Knowledge of machine characteristics is important

(CMU 15-418, Spring 2012)

You might also like