0% found this document useful (0 votes)

78 views11 pages

Multi-Core Programming - Increasing Performance Through Software Multi-Threading

This document discusses multi-core programming and increasing performance through software multi-threading. It begins by noting that Intel will be moving from Hyper-Threading to dual-core processors over time, requiring software developers to assume threading is pervasive. It then provides examples of how applications like streaming video require managing independent subsystems concurrently. The document defines concurrency as implementing naturally parallel applications, versus parallelism which runs tasks simultaneously on different hardware. It also distinguishes between multiprocessing, multitasking, and parallelism. The rest of the document covers the evolution of parallel computer architectures and Intel's involvement in parallel computing.

Uploaded by

Karthik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views11 pages

Multi-Core Programming - Increasing Performance Through Software Multi-Threading

Uploaded by

Karthik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Multi-core Programming -

Introduction
Based on slides from Intel Software College
and
Multi-Core Programming –
increasing performance through software multi-threading
by Shameem Akhter and Jason Roberts,

• “We will go from putting Hyper-Threading Technology

in our products to bringing dual core capability in our
mainstream client microprocessors over time. For the
software developers out there, you need to assume that
threading is pervasive.”

Paul Otellini
Chief Executive Officer
Intel Developer Forum, Fall 2003

2008/1/15 2

1
Concurrency – in everyday use

• User watching streaming video on a laptop in hotel room

• Simplistic user view – just like watching broadcast TV

• Reality
– PC must download streaming video data, decompress/decode it, display
it on the screen, must also handle streaming audio and send to
soundcard
– OS may be doing system tasks
– Server must receive the broadcast, encode/compress it in near real-
time, send it to possibly thousands of users.

2008/1/15 3

Reality of Streaming Video

• Requires managing many

independent subsystems in parallel
– Job may be decomposed into tasks
that handle different parts
– Concurrency is what permits
efficient use of system resources to
maximize performance
– Concurrency – abstraction for
implementing naturally parallel
applications

2008/1/15 4

2
Concurrency in sequential systems !!

• Streaming Video
– While waiting to receive a frame, decode previous frame
• FTP server
– Create a task (thread) for each user that connects
– Much simpler and easier to maintain

2008/1/15 5

Concurrency v Parallelism

• Parallel
– Multiple jobs (threads) are running simultaneously on different hardware
resources or processing elements (PEs)
– Each can execute and make progress at the same time
– Each PE can execute an instruction from a different thread
simultaneously
• Concurrency
– We often say multiple threads or processors are running on the same PE
or CPU at the same time
– But this means that the execution of the threads are interleaved in time
– A single PE is only executing an instruction from a single thread at any
particular time
• To have parallelism concurrency must use multiple hardware
resources

2008/1/15 6

3
Concurrency vs. Parallelism

Concurrency vs. Parallelism

– Concurrency: two or more threads are in progress at the same time:

•
Thread 1
Thread 2
– Parallelism: two or more threads are executing at the same time

Thread 1
Thread 2
– Multiple cores needed

2008/1/15 7

Multiprocessing v Multitasking

• Multiprocessing is the use of two or more central processing units

(CPUs) within a single computer system
• Multitasking is the apparent simultaneous performance of two or
more tasks by a computer's CPU

2008/1/15 8

4
Bleeding Edge of Computer Architecture

In the 1980’s, it was a Vector SMP.

Custom components throughout

In the 1990’s, it was a

massively parallel
computer.

COTS CPUs, everything else custom

… mid to late 1990’s, clusters.

COTS components everywhere

2008/1/15 9

Flynn’s Taxonomy of Parallel Computers

1972

Classify by two dimensions

• Instruction streams
• Data streams

2008/1/15 10

5
Flynn’s Taxonomy of Parallel Computers
1972
• SISD – single instruction, single data
• Traditional sequential computers
• Instructions executed in serial manner
• MISD – multiple instruction, single data
• More a theoretical model

2008/1/15 11

Flynn’s Taxonomy of Parallel Computers

1972
• SIMD - single instruction, multiple data
• same instruction applied to data on each of
many processors
• particularly useful for signal processing,
image processing, multimedia
• original array/vector machines
• Almost all computers today have SIMD
capabilities, e.g. MMX, SSE, SSE2, SSE3,
AltiVec on PowerPC
• provide capability to process multiple data
streams in a single clock
2008/1/15 12

6
Flynn’s Taxonomy of Parallel Computers
1972
• MIMD multiple instruction, multiple data
• execute different instruction on different data
• Most common parallel platform today
• Multi-core computers

2008/1/15 13

Expanded Taxonomy of Parallel

Architectures
• Arranged by tightness of coupling i.e. latency
• Systolic – special hardware implementation of algorithms, signal
processors, FPGA
• Vector – pipelining of arithmetic operations (ALU) and memory
bank accesses (Cray)
• SIMD (Associative) – Single Instruction Multiple Data, same
instruction applied to data on each of many processors (CM-1, MPP,
Staran, Aspro, Wavetracer)
• Dataflow – fine grained asynchronous flow control depending on
data precedence constraints
• PIM (processor-in-memory) – combine memory and ALU on one
circuit die. Gives high memory bandwidth and low latency

2008/1/15 14

7
Expanded Taxonomy of Parallel
Architectures
• MIMD (Multiple Instruction Multiple Data) – execute different
instruction on different data
• MPP (Massively Parallel Processors)
– Distributed memory (Intel Paragon)
– Shared Memory w/o coherent caches (BBN Butterfly, T3E)
– CC-NUMA [cache coherent non-uniform memory archicture] (HP
Exemplar, SGI Origin 2000)
• Clusters – ensemble of commodity components connected by an
interconnection network within a single administrative domain and
usually in one room
• (Geographically) Distributed Systems – exploit available cycles
(Grid, DSI, Entropia, SETI@home)

2008/1/15 15

From Cray to Beowulf

• Vector computers

• Parallel Computers
– shared memory, bus based (SGI Origin 2000)

– distributed memory, interconnection network based (IBM SP2)

• Network of Workstations (Sun, HP, IBM, DEC) - possibly shared use

– NOW (Berkeley), COW (Wisconsin)

• PC (Beowulf) Cluster – originally dedicated use

– Beowulf (CESDIS, Goddard Flight Center, 1994)

– Possibly SMP nodes

2008/1/15 16

8
Evolution of Parallel Machines

• Originally Parallel Machines had

– custom chips (CPU), custom bus/interconnection network, custom I/O
system

– proprietary compiler or library

• More recently parallel machines have

– custom bus/interconnection network and possibly I/O system

– standard chips

– standard compilers (f90) or library (MPI or OpenMP)

2008/1/15 17

Intel has a long track record in Parallel Computing.

iPSC/860
Paragon

ASCI Option Red

iPSC/2
85 90 95

Intel
iPSC/2 shipped Delta shipped - ASCI Red, World’s
Scientific
fastest computer in First TFLOP
Founded
the world

iPSC/860 shipped, Paragon shipped - ASCI Red upgrade: Regains

iPSC/1 Wins Gordon Bell Breaks Delta title as the “world’s fastest
shipped prize records computer”

2008/1/15 18

9
… and we were pretty good at it
We held the MP-LINPACK record* over most of the 90’s

3000

ASCI Red (9472)

2500

ASCI Red (9152)

ASCI Red 7264)
2000
GFLOPS

1500

(6768)
Paragon
Paragon (3744)
1000 (512)
Delta

500

0
91 92 93 94 96 97 98 99
Thinking Machines Inc** CM-5 (1024 CPUs) SGI** Asci Blue Mountain (5040 CPUs)

IBM** ASCI Blue Pacific Intel MPP supercomputers (512 to 9472 CPUs)

Hitachi CP-PACS (2048 CPUs)

* Data from the Linpack Report, CS-89-85, April 11, 1999

2008/1/15 19

• Chip complexity is not proportional to the number of transistors

• Per-transistor complexity is less in large cache arrays than in execution units
• This doesn’t mean the that the performance is increasing exponential:
e.g. PIII 500/1000: Speedup 2.3 ~ 3x Transistors ~ 2x MHz
„The complexity for minimum component costs has increased at a rate of roughly a factor of two per year ... Certainly over the short
term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain,
although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of
components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer.„„
Electronics Magazine 19 April 1965.

The most popular formulation: The number of transistors on integrated circuits is doubling every 12(18) months.

2008/1/15 20

10
Mistaken Interpretation of Moore’s Law

• Clock frequency will double every 18 to 24 months

– Because clock frequency has been used metric of performance
– For 40 years clock speed did approximately do this
– No longer true
• Many other ways to improve performance
– Instruction level parallelism (ILP) or dynamic out-of-order execution
• Reorder instructions to eliminate pipeline stalls
• Increase number of insts executed in single clock cycle
• Hardware level parallelism invisible to programmer
– Increase number of physical processors (Multiprocessor systems)
– Increase the number of cores in a single chip (Chip level
Multiprocessing)

2008/1/15 21

Parallel computing is omnipresent

(ubiquitous)
•Over the next few years, all computers will be somehow parallel computers.
– Servers
– Laptops
– Cell phones
•What about software?
– Herb Sutter of Microsoft said in Dr. Dobbs’ Journal:
• The free lunch is over: Fundamental Turn towards Concurrency in software
– Performance will no longer rapidly increase from one generation to the next as
hardware improves … unless the software is parallelized

Application performance will become a competitive

feature for Independent Software Vendors.

2008/1/15 22

Flynn's and Fengs Architecture
No ratings yet
Flynn's and Fengs Architecture
28 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
38 pages
Pda 2
No ratings yet
Pda 2
105 pages
02 Lecture Flynn IN
No ratings yet
02 Lecture Flynn IN
78 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
CS0051 - Module 01 - Subtopic 1
No ratings yet
CS0051 - Module 01 - Subtopic 1
27 pages
Parallel Computing
100% (1)
Parallel Computing
53 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
Unit 5
No ratings yet
Unit 5
66 pages
Unit 1
No ratings yet
Unit 1
22 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
Introduction To Parallel Processing Architecture
No ratings yet
Introduction To Parallel Processing Architecture
31 pages
Comp422 2011 Lecture1 Introduction
No ratings yet
Comp422 2011 Lecture1 Introduction
50 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
P 1
No ratings yet
P 1
44 pages
Topic 1 2024
No ratings yet
Topic 1 2024
41 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
Architecture
No ratings yet
Architecture
67 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Unit 4
No ratings yet
Unit 4
16 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Lec1 Introduction To Parallel Computing
No ratings yet
Lec1 Introduction To Parallel Computing
40 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
28 pages
ch1 PC
No ratings yet
ch1 PC
84 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Module 1: Parallelism Fundamentals Week 1 Learning Outcomes
No ratings yet
Module 1: Parallelism Fundamentals Week 1 Learning Outcomes
8 pages
A Survey On Parallel Architecture and Parallel Programming Languages and Tools
No ratings yet
A Survey On Parallel Architecture and Parallel Programming Languages and Tools
8 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
Flynn's Classification
No ratings yet
Flynn's Classification
46 pages
Principles of Protection by AREVA. 2008
100% (11)
Principles of Protection by AREVA. 2008
792 pages
01 Intro Parallel Computing
No ratings yet
01 Intro Parallel Computing
40 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
Exp4 - Designing Multiplexer (MUX) and De-Multiplexer (DEMUX), Priority Encoder and Decoder Circuits
No ratings yet
Exp4 - Designing Multiplexer (MUX) and De-Multiplexer (DEMUX), Priority Encoder and Decoder Circuits
13 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
NP r25 Plus PCB Diagram
No ratings yet
NP r25 Plus PCB Diagram
53 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
3.3.1 APU Starter Logic: Aircraft Electrical and Electronic Systems 64
No ratings yet
3.3.1 APU Starter Logic: Aircraft Electrical and Electronic Systems 64
15 pages
EE-222, Microprocessor Systems Course Outline
No ratings yet
EE-222, Microprocessor Systems Course Outline
2 pages
Multi Core Processor
No ratings yet
Multi Core Processor
11 pages
Development of An Embedded Web Server System For Controlling and Monitoring of Remote Devices Based On ARM and Win CE
No ratings yet
Development of An Embedded Web Server System For Controlling and Monitoring of Remote Devices Based On ARM and Win CE
6 pages
Hardware Components
No ratings yet
Hardware Components
18 pages
SG7000 Hardware Description Manual
No ratings yet
SG7000 Hardware Description Manual
99 pages
Motherboard Manual Ga-p35-Ds3r (Ds3) (s3) 2
No ratings yet
Motherboard Manual Ga-p35-Ds3r (Ds3) (s3) 2
112 pages
CSO Previous Year Question Paper (2019-15)
No ratings yet
CSO Previous Year Question Paper (2019-15)
10 pages
Integrated Circuits
No ratings yet
Integrated Circuits
98 pages
Computer Architecture Lecture Notes Input - Output
No ratings yet
Computer Architecture Lecture Notes Input - Output
20 pages
Architecture of 8051 Microcontroller Report
No ratings yet
Architecture of 8051 Microcontroller Report
5 pages
Programmers Manual FlexGripPlus SASS
No ratings yet
Programmers Manual FlexGripPlus SASS
67 pages
J-11 Programmers Reference Jan82
No ratings yet
J-11 Programmers Reference Jan82
54 pages
Chapter 3 - Combinational Logic Circuits (Part 1) - Digital Electronics
No ratings yet
Chapter 3 - Combinational Logic Circuits (Part 1) - Digital Electronics
12 pages
FNK9926
No ratings yet
FNK9926
7 pages
P89LPC915/916/917: 1. General Description
No ratings yet
P89LPC915/916/917: 1. General Description
75 pages
Threats and Opportunities
No ratings yet
Threats and Opportunities
22 pages
Deming Chen: Chapter 38, Design Automation For Microelectronics, Springer Handbook of Automation
No ratings yet
Deming Chen: Chapter 38, Design Automation For Microelectronics, Springer Handbook of Automation
15 pages
Switching Circuits & Logic Design
No ratings yet
Switching Circuits & Logic Design
21 pages
Venkat Landminedoc
No ratings yet
Venkat Landminedoc
5 pages
It4t5 PDF
No ratings yet
It4t5 PDF
2 pages
Eee312 Eee282 Lab7 Spring2015
No ratings yet
Eee312 Eee282 Lab7 Spring2015
6 pages
Instructor: Nima Honarmand: Spring 2015:: CSE 502 - Computer Architecture
No ratings yet
Instructor: Nima Honarmand: Spring 2015:: CSE 502 - Computer Architecture
16 pages
74HC4078
No ratings yet
74HC4078
5 pages
The 8087 MathCo
No ratings yet
The 8087 MathCo
8 pages
Virtex6 Product Table
No ratings yet
Virtex6 Product Table
1 page
ECEN 160 Final Project Logisim Instrs and Decoder
No ratings yet
ECEN 160 Final Project Logisim Instrs and Decoder
2 pages
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet
Introduction to Computing DSST Quick Prep Sheet
From Everand
Introduction to Computing DSST Quick Prep Sheet
Justin Orgeron
No ratings yet

Multi-Core Programming - Increasing Performance Through Software Multi-Threading

Uploaded by

Multi-Core Programming - Increasing Performance Through Software Multi-Threading

Uploaded by

Multi-core Programming -

• “We will go from putting Hyper-Threading Technology

• User watching streaming video on a laptop in hotel room

Reality of Streaming Video

• Requires managing many

Concurrency vs. Parallelism

– Concurrency: two or more threads are in progress at the same time:

• Multiprocessing is the use of two or more central processing units

In the 1980’s, it was a Vector SMP.

In the 1990’s, it was a

COTS CPUs, everything else custom

… mid to late 1990’s, clusters.

Flynn’s Taxonomy of Parallel Computers

Classify by two dimensions

Flynn’s Taxonomy of Parallel Computers

Expanded Taxonomy of Parallel

From Cray to Beowulf

– distributed memory, interconnection network based (IBM SP2)

• Network of Workstations (Sun, HP, IBM, DEC) - possibly shared use

• PC (Beowulf) Cluster – originally dedicated use

– Possibly SMP nodes

• Originally Parallel Machines had

– proprietary compiler or library

• More recently parallel machines have

– standard compilers (f90) or library (MPI or OpenMP)

Intel has a long track record in Parallel Computing.

ASCI Option Red

iPSC/860 shipped, Paragon shipped - ASCI Red upgrade: Regains

ASCI Red (9472)

ASCI Red (9152)

Hitachi CP-PACS (2048 CPUs)

* Data from the Linpack Report, CS-89-85, April 11, 1999

• Chip complexity is not proportional to the number of transistors

• Clock frequency will double every 18 to 24 months

Parallel computing is omnipresent

Application performance will become a competitive

You might also like