0% found this document useful (0 votes)
78 views11 pages

Multi-Core Programming - Increasing Performance Through Software Multi-Threading

This document discusses multi-core programming and increasing performance through software multi-threading. It begins by noting that Intel will be moving from Hyper-Threading to dual-core processors over time, requiring software developers to assume threading is pervasive. It then provides examples of how applications like streaming video require managing independent subsystems concurrently. The document defines concurrency as implementing naturally parallel applications, versus parallelism which runs tasks simultaneously on different hardware. It also distinguishes between multiprocessing, multitasking, and parallelism. The rest of the document covers the evolution of parallel computer architectures and Intel's involvement in parallel computing.

Uploaded by

Karthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views11 pages

Multi-Core Programming - Increasing Performance Through Software Multi-Threading

This document discusses multi-core programming and increasing performance through software multi-threading. It begins by noting that Intel will be moving from Hyper-Threading to dual-core processors over time, requiring software developers to assume threading is pervasive. It then provides examples of how applications like streaming video require managing independent subsystems concurrently. The document defines concurrency as implementing naturally parallel applications, versus parallelism which runs tasks simultaneously on different hardware. It also distinguishes between multiprocessing, multitasking, and parallelism. The rest of the document covers the evolution of parallel computer architectures and Intel's involvement in parallel computing.

Uploaded by

Karthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Multi-core Programming -

Introduction
Based on slides from Intel Software College
and
Multi-Core Programming –
increasing performance through software multi-threading
by Shameem Akhter and Jason Roberts,

• “We will go from putting Hyper-Threading Technology


in our products to bringing dual core capability in our
mainstream client microprocessors over time. For the
software developers out there, you need to assume that
threading is pervasive.”

Paul Otellini
Chief Executive Officer
Intel Developer Forum, Fall 2003

2008/1/15 2

1
Concurrency – in everyday use

• User watching streaming video on a laptop in hotel room


• Simplistic user view – just like watching broadcast TV

• Reality
– PC must download streaming video data, decompress/decode it, display
it on the screen, must also handle streaming audio and send to
soundcard
– OS may be doing system tasks
– Server must receive the broadcast, encode/compress it in near real-
time, send it to possibly thousands of users.

2008/1/15 3

Reality of Streaming Video

• Requires managing many


independent subsystems in parallel
– Job may be decomposed into tasks
that handle different parts
– Concurrency is what permits
efficient use of system resources to
maximize performance
– Concurrency – abstraction for
implementing naturally parallel
applications

2008/1/15 4

2
Concurrency in sequential systems !!

• Streaming Video
– While waiting to receive a frame, decode previous frame
• FTP server
– Create a task (thread) for each user that connects
– Much simpler and easier to maintain

2008/1/15 5

Concurrency v Parallelism

• Parallel
– Multiple jobs (threads) are running simultaneously on different hardware
resources or processing elements (PEs)
– Each can execute and make progress at the same time
– Each PE can execute an instruction from a different thread
simultaneously
• Concurrency
– We often say multiple threads or processors are running on the same PE
or CPU at the same time
– But this means that the execution of the threads are interleaved in time
– A single PE is only executing an instruction from a single thread at any
particular time
• To have parallelism concurrency must use multiple hardware
resources

2008/1/15 6

3
Concurrency vs. Parallelism

Concurrency vs. Parallelism

– Concurrency: two or more threads are in progress at the same time:


Thread 1
Thread 2
– Parallelism: two or more threads are executing at the same time

Thread 1
Thread 2
– Multiple cores needed

2008/1/15 7

Multiprocessing v Multitasking

• Multiprocessing is the use of two or more central processing units


(CPUs) within a single computer system
• Multitasking is the apparent simultaneous performance of two or
more tasks by a computer's CPU

2008/1/15 8

4
Bleeding Edge of Computer Architecture

In the 1980’s, it was a Vector SMP.


Custom components throughout

In the 1990’s, it was a


massively parallel
computer.

COTS CPUs, everything else custom

… mid to late 1990’s, clusters.


COTS components everywhere

2008/1/15 9

Flynn’s Taxonomy of Parallel Computers


1972

Classify by two dimensions


• Instruction streams
• Data streams

2008/1/15 10

5
Flynn’s Taxonomy of Parallel Computers
1972
• SISD – single instruction, single data
• Traditional sequential computers
• Instructions executed in serial manner
• MISD – multiple instruction, single data
• More a theoretical model

2008/1/15 11

Flynn’s Taxonomy of Parallel Computers


1972
• SIMD - single instruction, multiple data
• same instruction applied to data on each of
many processors
• particularly useful for signal processing,
image processing, multimedia
• original array/vector machines
• Almost all computers today have SIMD
capabilities, e.g. MMX, SSE, SSE2, SSE3,
AltiVec on PowerPC
• provide capability to process multiple data
streams in a single clock
2008/1/15 12

6
Flynn’s Taxonomy of Parallel Computers
1972
• MIMD multiple instruction, multiple data
• execute different instruction on different data
• Most common parallel platform today
• Multi-core computers

2008/1/15 13

Expanded Taxonomy of Parallel


Architectures
• Arranged by tightness of coupling i.e. latency
• Systolic – special hardware implementation of algorithms, signal
processors, FPGA
• Vector – pipelining of arithmetic operations (ALU) and memory
bank accesses (Cray)
• SIMD (Associative) – Single Instruction Multiple Data, same
instruction applied to data on each of many processors (CM-1, MPP,
Staran, Aspro, Wavetracer)
• Dataflow – fine grained asynchronous flow control depending on
data precedence constraints
• PIM (processor-in-memory) – combine memory and ALU on one
circuit die. Gives high memory bandwidth and low latency

2008/1/15 14

7
Expanded Taxonomy of Parallel
Architectures
• MIMD (Multiple Instruction Multiple Data) – execute different
instruction on different data
• MPP (Massively Parallel Processors)
– Distributed memory (Intel Paragon)
– Shared Memory w/o coherent caches (BBN Butterfly, T3E)
– CC-NUMA [cache coherent non-uniform memory archicture] (HP
Exemplar, SGI Origin 2000)
• Clusters – ensemble of commodity components connected by an
interconnection network within a single administrative domain and
usually in one room
• (Geographically) Distributed Systems – exploit available cycles
(Grid, DSI, Entropia, SETI@home)

2008/1/15 15

From Cray to Beowulf


• Vector computers

• Parallel Computers
– shared memory, bus based (SGI Origin 2000)

– distributed memory, interconnection network based (IBM SP2)

• Network of Workstations (Sun, HP, IBM, DEC) - possibly shared use


– NOW (Berkeley), COW (Wisconsin)

• PC (Beowulf) Cluster – originally dedicated use


– Beowulf (CESDIS, Goddard Flight Center, 1994)

– Possibly SMP nodes

2008/1/15 16

8
Evolution of Parallel Machines

• Originally Parallel Machines had


– custom chips (CPU), custom bus/interconnection network, custom I/O
system

– proprietary compiler or library

• More recently parallel machines have


– custom bus/interconnection network and possibly I/O system

– standard chips

– standard compilers (f90) or library (MPI or OpenMP)

2008/1/15 17

Intel has a long track record in Parallel Computing.

iPSC/860
Paragon

ASCI Option Red


iPSC/2
85 90 95

Intel
iPSC/2 shipped Delta shipped - ASCI Red, World’s
Scientific
fastest computer in First TFLOP
Founded
the world

iPSC/860 shipped, Paragon shipped - ASCI Red upgrade: Regains


iPSC/1 Wins Gordon Bell Breaks Delta title as the “world’s fastest
shipped prize records computer”

2008/1/15 18

9
… and we were pretty good at it
We held the MP-LINPACK record* over most of the 90’s

3000

ASCI Red (9472)


2500

ASCI Red (9152)


ASCI Red 7264)
2000
GFLOPS

1500

(6768)
Paragon
Paragon (3744)
1000 (512)
Delta

500

0
91 92 93 94 96 97 98 99
Thinking Machines Inc** CM-5 (1024 CPUs) SGI** Asci Blue Mountain (5040 CPUs)

IBM** ASCI Blue Pacific Intel MPP supercomputers (512 to 9472 CPUs)

Hitachi CP-PACS (2048 CPUs)

* Data from the Linpack Report, CS-89-85, April 11, 1999

2008/1/15 19

• Chip complexity is not proportional to the number of transistors


• Per-transistor complexity is less in large cache arrays than in execution units
• This doesn’t mean the that the performance is increasing exponential:
e.g. PIII 500/1000: Speedup 2.3 ~ 3x Transistors ~ 2x MHz
„The complexity for minimum component costs has increased at a rate of roughly a factor of two per year ... Certainly over the short
term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain,
although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of
components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer.„„
Electronics Magazine 19 April 1965.

The most popular formulation: The number of transistors on integrated circuits is doubling every 12(18) months.

2008/1/15 20

10
Mistaken Interpretation of Moore’s Law

• Clock frequency will double every 18 to 24 months


– Because clock frequency has been used metric of performance
– For 40 years clock speed did approximately do this
– No longer true
• Many other ways to improve performance
– Instruction level parallelism (ILP) or dynamic out-of-order execution
• Reorder instructions to eliminate pipeline stalls
• Increase number of insts executed in single clock cycle
• Hardware level parallelism invisible to programmer
– Increase number of physical processors (Multiprocessor systems)
– Increase the number of cores in a single chip (Chip level
Multiprocessing)

2008/1/15 21

Parallel computing is omnipresent


(ubiquitous)
•Over the next few years, all computers will be somehow parallel computers.
– Servers
– Laptops
– Cell phones
•What about software?
– Herb Sutter of Microsoft said in Dr. Dobbs’ Journal:
• The free lunch is over: Fundamental Turn towards Concurrency in software
– Performance will no longer rapidly increase from one generation to the next as
hardware improves … unless the software is parallelized

Application performance will become a competitive


feature for Independent Software Vendors.

2008/1/15 22

11

You might also like