0% found this document useful (0 votes)

11 views73 pages

Unit IV CA

The document outlines the syllabus for Unit IV of a Computer Architecture course at Velammal Engineering College, focusing on parallelism, multicore processors, and shared memory multiprocessors. It covers Flynn's classification of computer architectures, types of multithreading, and the architecture of GPUs, along with examples of multicore processors. Additionally, it introduces concepts such as Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA) architectures, as well as Amdahl's law related to parallel execution speedup.

Uploaded by

sruthinath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views73 pages

Unit IV CA

Uploaded by

sruthinath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 73

Velammal Engineering College

Department of Computer Science

and Engineering

Welcome…

Mr. A. Arockia Abins &

Ms. R. Amirthavalli,
Asst. Prof,
Slide Sources: Patterson & Hennessy COD book
website (copyright Morgan Kaufmann) adapted CSE,
and supplemented Velammal Engineering College
Subject Code / Name:

19IT202T /
Computer Architecture
Syllabus – Unit IV
UNIT-IV PARALLELISM
Introduction to Multicore processors
and other shared memory
multiprocessors - Flynn's classification:
SISD, MIMD, SIMD, SPMD and
Vector - Hardware multithreading: Fine-
grained, Coarse-grained and
Simultaneous Multithreading (SMT) -
GPU architecture: NVIDIA GPU
Architecture, NVIDIA GPU Memory
Structure
Topics:
• Introduction to Multicore processors
• Other shared memory multiprocessors
• Flynn’s classification:
o SISD,
o MIMD,
o SIMD,
o SPMD and Vector
• Hardware multithreading
• GPU architecture 4
Introduction to Multicore
processors
Multicore processors
• What is a Processor?
o A single chip package that fits in a socket
o Cores can have functional units, cache, etc.
associated with them
• The main goal of the multi-core design is to
provide computing units with an increasing
processing power.
• A multicore processor is a single computing
component with two or more “independent”
processors (called "cores").
• known as a chip multiprocessor or CMP

6
EXAMPLES
 dual-core processor with 2 cores
• e.g. AMD Phenom II X2, Intel Core 2 Duo E8500
 quad-core processor with 4 cores
• e.g. AMD Phenom II X4, Intel Core i5 2500T
 hexa-core processor with 6 cores
• e.g. AMD Phenom II X6, Intel Core i7 Extreme Ed. 980X
 octa-core processor with 8 cores
• e.g. AMD FX-8150, Intel Xeon E7-2820

7
Processor

8
Single core

9
Multicore

10
Number of core
Homogeneous (symmetric) cores:
types
• All of the cores in a homogeneous multicore
processor are of the same type; typically the core
processing units are general-purpose central
processing units that run a single multicore
operating system.
• Example: Intel Core 2

Heterogeneous (asymmetric) cores:

• Heterogeneous multicore processors have a mix
of core types that often run different operating
systems and include graphics processing units.
• Example: IBM's Cell processor, used in the Sony
PlayStation 3 video game console 11
Homogeneous Multicore Processor

12
Heterogeneous Multicore Processor

13
shared memory multiprocessors

14
Shared Memory Multiprocessors
• A system with multiple CPUs “sharing” the
same main memory is called multiprocessor.
• In a multiprocessor system all processes on
the various CPUs share a unique logical
address space, which is mapped on a physical
memory that can be distributed among the
processors.
• Each process can read and write a data item
simply using load and store operations, and
process communication is through shared
memory.

15
Shared Memory Multiprocessors

• Processors communicate through shared

variables in memory, with all processors
capable of accessing any memory location via
loads and stores.

16
Questions:
• Multicore processor
• Hexacore processor
• Homogeneous Multicore processor
• Heterogeneous Multicore processor
• Multiprocessor
• Shared memory Multiprocessor

17
• Single address space multiprocessors come in two
styles.
o Uniform Memory Access (UMA)
o Non-Uniform Memory Access (NUMA)

UMA Architecture:
• In the first style, the latency to a word in
memory does not depend on which processor
asks for it. Such machines are called uniform
memory access (UMA) multiprocessors.
NUMA/DSMA Architecture:
• In the second style, some memory accesses
are much faster than others, depending on
which processor asks for which word, typically
because main memory is divided and attached to
different microprocessors or to different memory
controllers on the same chip.
• Such machines are called nonuniform memory
access (NUMA) multiprocessors. 18
•
Types:
The shared-memory multiprocessors fall into
two classes, depending on the number of
processors involved, which in turn dictates a
memory organization and interconnect
strategy.
• They are:
1. Centralized shared memory (Uniform Memory
Access)
2. Distributed shared memory (NonUniform
Memory Access)

19
1. Centralized shared memory architecture

20
2. Distributed shared memory architecture

21
Flynn’s
classification
Flynn's classification:

• In 1966, Michael Flynn proposed a

classification for computer architectures based
on the number of instruction steams and data
streams (Flynn’s Taxonomy).
o SISD (Single Instruction stream, Single Data
stream)
o SIMD (Single Instruction stream, Multiple Data
streams)
o MISD (Multiple Instruction streams, Single Data
stream)
o MIMD (Multiple Instruction streams, Multiple Data
streams)
23
Flynn's classification:
Simple Diagrammatic Representation

24
SISD
• SISD machines executes a single instruction on
individual data values using a single processor.
• Based on traditional Von Neumann uniprocessor
architecture, instructions are executed
sequentially or serially, one step after the next.
• Until most recently, most computers are of SISD
type.
• Conventional uniprocessor

25
SISD

26
SIMD
• An SIMD machine executes a single instruction on
multiple data values simultaneously using many
processors.
• Since there is only one instruction, each processor
does not have to fetch and decode each
instruction. Instead, a single control unit does the
fetch and decoding for all processors.
• SIMD architectures include array processors.

27
SIMD
• Data level parallelism:
o Parallelism achieved by performing the same operation on
independent data.

28
MISD
• Each processor executes a different sequence of instructions.
• In case of MISD computers, multiple processing units operate on
one single-data stream .
• This category does not actually exist. This category was included in
the taxonomy for the sake of completeness.

29
MISD
•

30
Questions:
• Uniform Memory Access (UMA)
• Non-Uniform Memory Access (NUMA)
• Centralized shared memory
• Distributed shared memory
• Flynn’s classification:

31
MIMD
• MIMD machines are usually referred to as
multiprocessors or multicomputers.
• It may execute multiple instructions
simultaneously, contrary to SIMD machines.
• Each processor must include its own control unit
that will assign to the processors parts of a task or
a separate task.
• It has two subclasses: Shared memory and
distributed memory

32
MIMD

33
Analogy of Flynn’s Classifications
• An analogy of Flynn’s classification is the
check-in desk at an airport
 SISD: a single desk
 SIMD: many desks and a supervisor with
a megaphone giving instructions that
every desk obeys
 MIMD: many desks working at their own
pace, synchronized through a central
database

34
Hardware categorization

SSE : Streaming SIMD Extensions

35
Computer Architecture
Classifications
Processor Organizations

Single Instruction, Single Instruction, Multiple Instruction Multiple Instruction

Single Data Stream Multiple Data Stream Single Data Stream Multiple Data Stream
(SISD) (SIMD) (MISD) (MIMD)

Uniprocessor Vector Array Shared Memory Multicomputer

Processor Processor (tightly coupled) (loosely
coupled)
Vector
• more elegant interpretation of SIMD is called a vector architecture
• the vector architectures pipelined the ALU to get good performance
at lower cost
• to collect data elements from memory, put them in order into a
large set of registers, operate on them sequentially in registers
using pipelined execution units.
• then write the results back to memory

37
Structure of a vector unit containing four lanes

38
vector lane
• One or more vector functional units and a portion of the vector
register fi le.

39
Questions:
• MIMD
• Examples for Flynn’s classification

40
Hardware
multithreading
Hardware multithreading
• A thread is a lightweight process with its own
instructions and data.
• Each thread has all the state (instructions, data,
PC, register state, etc.) necessary to allow it to
execute.
• Multithreading (MT) allows multiple threads to
share the functional units of a single processor.

42
Hardware multithreading
• Increasing utilization of a processor by
switching to another thread when one thread
is stalled.
• Types of Multithreading:
o Fine-grained Multithreading
• Cycle by cycle
o Coarse-grained Multithreading
• Switch on event (e.g., cache miss)
o Simultaneous Multithreading (SMT)
• Instructions from multiple threads executed concurrently in the
same cycle

43
4-issue machine

Thread A Thread B Thread C Thread D

Fine-grained MT
Idea: Switch to another thread every cycle
such that no two instructions from the
thread are in the pipeline concurrently
Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful
instructions from different threads
+ Improved system throughput, latency tolerance,
utilization
Fine-grained MT
Idea: Switch to another thread every cycle
such that no two instructions from the
thread are in the pipeline concurrently
Disadvantages
- Extra hardware complexity: multiple hardware contexts,
thread selection logic
- Reduced single thread performance (one instruction
fetched every N cycles)
- Resource contention between threads in caches and
memory
- Dependency checking logic between threads remains
(load/store)
Time stamp of
single thread Fine-grained MT
execution
7
1 1
5 5
1 1 1
5 5 5 5
1 1 1
5
1
8 8 8 1
2 2
6 6 6 6
2 2 3
7 7 7 7 4
5 5 5
9 9 6
3 3 3
7 7 7
3 3 3 8
11 9
3
10
8 8 1
6
10 10 10 1
1
4 4 2
8 8
4 4

47
Coarse-grained MT switches threads only
on costly stalls, such as L2 misses.
The processor is not slowed down (by
thread switching), since instructions from
other threads will only be issued when a
thread encounters a costly stall.
Since a CPU with coarse-grained MT issues
instructions from a single thread, when a
stall occurs the pipeline must be emptied.
The new thread must fill the pipeline before
instructions will be able to complete.
48
Coarse-grained MT switches threads only
on costly stalls, such as L2 misses.
Advantages:
– thread switching doesn’t have to be
essentially free and much less likely to slow down
the execution of an individual thread
Disadvantage:
– limited, due to pipeline start-up costs, in its
ability to overcome throughput loss
Pipeline must be flushed and refilled on
thread switches

49
Coarse-grained MT

50
Questions
• Define thread.
• What is mean by hardware multithreading?
• Types of multithreading

ILP Limits and Multithreading June 2015 51

Simultaneous Multithreading
Simultaneous multithreading (SMT) is a
variation on MT to exploit TLP
simultaneously with ILP.
SMT is motivated by multiple-issue
processors which have more functional unit
parallelism than a single thread can
effectively use.
Multiple instructions from different threads
can be issued

52
1 1 1 1
1
Time stamp of
2 1 1 1 1 single thread
3 1 2 2 2 execution
4

5 4
1
6 4 4 5 5 2
3
7 4 4 5 5
4
8 5 5 5 6 5
6
9 5 5 7 7
10 7 7 7 7 8
9
1 8 8 8 6 10
1
1
1 6 6 6 8 1
2 1
1 8 9 9 7 2
3
14 7 1 9 9
1
1 1 1 1 8
5 0 0 0
1 8 1 1 1
6 2 2
2

53
Approaches to use the issue slots.

54
55
Amdahl’s law
Speedup
• Speedup measures increase in running time due
to parallelism. The number of PEs is given by n.
• Based on running times, S(n) = ts/tp , where
o ts is the execution time on a single processor, using the fastest
known sequential algorithm
o tp is the execution time using a parallel processor.

• For theoretical analysis, S(n) = ts/tp where

o ts is the worst case running time for of the fastest known sequential
algorithm for the problem
o tp is the worst case running time of the parallel algorithm using n
PEs.

57
Speedup in Simplest
Terms

58
Amdahl’s law:
“It states that the potential speedup gained by the parallel execution
of a program is limited by the portion that can be parallelized.”

59
Amdahl’s law
• execution time before is 1 for some unit of time

60
Question:
• When parallelizing an application, the ideal speedup is speeding up
by the number of processors. What is the speedup with 8
processors if 60% of the application is parallelizable?

61
Question:
• When parallelizing an application, the ideal speedup is speeding up
by the number of processors. What is the speedup with 8
processors if 80% of the application is parallelizable?

62
QUESTION:
• Suppose that we are considering an enhancement that runs 10
times faster than the original machine but is usable only 40% of the
time. What is the overall speedup gained by incorporating the
enhancement.?

63
Question
• Suppose you want to achieve a speed-up
of 90 times faster with 100 processors.
What percentage of the original
computation can be sequential?

64
Question
• Suppose you want to achieve a speed-up
of 90 times faster with 100 processors.
What percentage of the original
computation can be sequential?

65
Question
• Suppose you want to perform two sums: one is a sum of 10
scalar variables, and one is a matrix sum of a pair of two-
dimensional arrays, with dimensions 10 by 10. For now
let’s assume only the matrix sum is parallelizable. What
speed-up do you get with 10 versus 40 processors?
• Next, calculate the speed-ups assuming the matrices grow
to 20 by 20.

66
Graphics
processing unit
(GPU)
Graphics processing unit (GPU)
• It is a processor optimized for 2D/3D graphics, video, visual computing, and display.
• It is highly parallel, highly multithreaded multiprocessor optimized for visual
computing.
• It provide real-time visual interaction with computed objects via graphics images,
and video.
• Heterogeneous Systems: combine a GPU with a CPU

68
GPU Hardware

69
70
An Introduction to the NVIDIA GPU Architecture

71
NVIDIA GPU Memory
Structures

72
Thank you…

Lecture 10 - SIMD Architecture
No ratings yet
Lecture 10 - SIMD Architecture
27 pages
06 Flynn-S Classification
No ratings yet
06 Flynn-S Classification
31 pages
Parallel Processing Lecture2
No ratings yet
Parallel Processing Lecture2
62 pages
Flynn's Taxonomy and SISD SIMD MISD MIMD
86% (14)
Flynn's Taxonomy and SISD SIMD MISD MIMD
7 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
126 pages
Lecture13 - Full IS1500
No ratings yet
Lecture13 - Full IS1500
34 pages
Multicore Architecture
No ratings yet
Multicore Architecture
274 pages
CS621 Week 03
No ratings yet
CS621 Week 03
54 pages
Why Choose A Multiprocessor?
No ratings yet
Why Choose A Multiprocessor?
17 pages
02 Lecture Flynn IN
No ratings yet
02 Lecture Flynn IN
78 pages
A Comprehensive Survey of Various Processor Types & Latest Architectures
No ratings yet
A Comprehensive Survey of Various Processor Types & Latest Architectures
7 pages
MCA Computer Organization and Architecture 14
No ratings yet
MCA Computer Organization and Architecture 14
9 pages
Flynn'S Classification: Cs6303 Computer Architecture
No ratings yet
Flynn'S Classification: Cs6303 Computer Architecture
11 pages
Lecture 3.1.1 (Parallelism in Uniprocessor System, Flynn - S Classification)
No ratings yet
Lecture 3.1.1 (Parallelism in Uniprocessor System, Flynn - S Classification)
8 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
CS516: Parallelization of Programs: Overview of Parallel Architectures
No ratings yet
CS516: Parallelization of Programs: Overview of Parallel Architectures
43 pages
Flynns Taxonomy
0% (1)
Flynns Taxonomy
79 pages
Cs8083 Notes Mcap
No ratings yet
Cs8083 Notes Mcap
187 pages
Computer Architecture Flynn's Taxonomy
No ratings yet
Computer Architecture Flynn's Taxonomy
4 pages
Lecture 3 Flynn's Classical Taxonomy
No ratings yet
Lecture 3 Flynn's Classical Taxonomy
29 pages
3.array Processors
100% (3)
3.array Processors
14 pages
ACA1
No ratings yet
ACA1
29 pages
Cs8083 MCP Unit I Notes
No ratings yet
Cs8083 MCP Unit I Notes
31 pages
Flynns Classification
No ratings yet
Flynns Classification
12 pages
Architecture
No ratings yet
Architecture
67 pages
NOTES
No ratings yet
NOTES
19 pages
Flynn's Classification - SISD, SIMD, MISD & MIMD
No ratings yet
Flynn's Classification - SISD, SIMD, MISD & MIMD
15 pages
Computer Basics Elements
No ratings yet
Computer Basics Elements
413 pages
2nd Page
No ratings yet
2nd Page
7 pages
Unit 1
No ratings yet
Unit 1
48 pages
Parallel & Distributed Computing: By: M. Imran Siddiqui
No ratings yet
Parallel & Distributed Computing: By: M. Imran Siddiqui
25 pages
COA U5 PPT Full
No ratings yet
COA U5 PPT Full
43 pages
Organization of Multiprocessor Systems
No ratings yet
Organization of Multiprocessor Systems
87 pages
Advance Computer Architecture2
No ratings yet
Advance Computer Architecture2
36 pages
Mcap Notes
No ratings yet
Mcap Notes
186 pages
Lec 5
No ratings yet
Lec 5
14 pages
SAP HANA Troubleshooting and Performance Analysis Guide en
No ratings yet
SAP HANA Troubleshooting and Performance Analysis Guide en
140 pages
Flynn's Taxonomy of Computer Architectures: Michael Flynn 1966 CMPS 5433 - Parallel Processing
No ratings yet
Flynn's Taxonomy of Computer Architectures: Michael Flynn 1966 CMPS 5433 - Parallel Processing
13 pages
Lect6-SPC - Flynns
No ratings yet
Lect6-SPC - Flynns
16 pages
Final Unit5 CO Notes
No ratings yet
Final Unit5 CO Notes
7 pages
COE4590 10 Flyns
No ratings yet
COE4590 10 Flyns
15 pages
BCSE412L - Parallel Computing 04
No ratings yet
BCSE412L - Parallel Computing 04
9 pages
Siwes Report On Computer Resource
100% (5)
Siwes Report On Computer Resource
40 pages
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
No ratings yet
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
31 pages
Flynn's Taxonomy
No ratings yet
Flynn's Taxonomy
18 pages
Aca Unit 1.1
No ratings yet
Aca Unit 1.1
20 pages
Winplus Application Rev3
No ratings yet
Winplus Application Rev3
234 pages
CA Classes-221-225
No ratings yet
CA Classes-221-225
5 pages
Flynn's Classification
No ratings yet
Flynn's Classification
46 pages
Flynns Classification
No ratings yet
Flynns Classification
27 pages
Ch12 Parallel Proc3-Aula
No ratings yet
Ch12 Parallel Proc3-Aula
35 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Model
No ratings yet
Model
14 pages
Parallel Processors: Session 2
No ratings yet
Parallel Processors: Session 2
32 pages
SIMD Presentation
No ratings yet
SIMD Presentation
28 pages
5 4 Parallel
No ratings yet
5 4 Parallel
47 pages
Introduction To Parallel Processing
No ratings yet
Introduction To Parallel Processing
49 pages
CP4253 Map Unit I
No ratings yet
CP4253 Map Unit I
31 pages
Flynn's Taxonomy: 1. Sisd
No ratings yet
Flynn's Taxonomy: 1. Sisd
7 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
Parallel Processing in Processor Organization: Prabhudev S Irabashetti
No ratings yet
Parallel Processing in Processor Organization: Prabhudev S Irabashetti
4 pages
IJARCCE6G S Prabhudev Parallel PDF
No ratings yet
IJARCCE6G S Prabhudev Parallel PDF
4 pages
489 6GbsSAS 12Gbs PerfTuningGuide
No ratings yet
489 6GbsSAS 12Gbs PerfTuningGuide
97 pages
Flynn's Taxonomy of Computer Architecture
No ratings yet
Flynn's Taxonomy of Computer Architecture
8 pages
A List of CPU Sockets and Slots
100% (1)
A List of CPU Sockets and Slots
2 pages
CAN Communication Network
No ratings yet
CAN Communication Network
5 pages
Et200sp Cpu1510sp 1 PN Manual en US en-US
No ratings yet
Et200sp Cpu1510sp 1 PN Manual en US en-US
42 pages
Buses /interconnection Structures: Dr. Syafiq Fauzi Bin Kamarulzaman
No ratings yet
Buses /interconnection Structures: Dr. Syafiq Fauzi Bin Kamarulzaman
42 pages
Paas Iaas Universal Credits 3940775
No ratings yet
Paas Iaas Universal Credits 3940775
115 pages
Hw4 Solution
No ratings yet
Hw4 Solution
14 pages
Computer Fundamentals and Logic Circuits Part 2 Updated
No ratings yet
Computer Fundamentals and Logic Circuits Part 2 Updated
58 pages
02 Components of Computer System
No ratings yet
02 Components of Computer System
150 pages
01.evolution of Microprocessors
No ratings yet
01.evolution of Microprocessors
32 pages
Pytorch Paper
No ratings yet
Pytorch Paper
12 pages
04-Instructions and Formats
No ratings yet
04-Instructions and Formats
7 pages
Multi Threaded Processors
No ratings yet
Multi Threaded Processors
10 pages
Unleashed Potential Aa WP
No ratings yet
Unleashed Potential Aa WP
10 pages
Using Winmips64 Simulator: 1. Starting and Configuring Winmips64
No ratings yet
Using Winmips64 Simulator: 1. Starting and Configuring Winmips64
9 pages
CA Classes-116-120
No ratings yet
CA Classes-116-120
5 pages
DTM Super 25 Solution V2V
No ratings yet
DTM Super 25 Solution V2V
21 pages
Mechatronics and Control Systems Laboratory Handbook
No ratings yet
Mechatronics and Control Systems Laboratory Handbook
44 pages
8085 - 8085A Mnemonics Opcode Instruction Set Table Including Description - 8085 Microprocessor Tutorials Resource
No ratings yet
8085 - 8085A Mnemonics Opcode Instruction Set Table Including Description - 8085 Microprocessor Tutorials Resource
4 pages
Introduction To GeoEvent Processor - Module 4
No ratings yet
Introduction To GeoEvent Processor - Module 4
37 pages
Pexip Infinity Specifications and Requirements V34.a
No ratings yet
Pexip Infinity Specifications and Requirements V34.a
8 pages
Microprocessor Classification
No ratings yet
Microprocessor Classification
5 pages
Devices Table
No ratings yet
Devices Table
20 pages
Ulangan Akhir Semester Ganjil Sekolah Menengah Pertama 79 Jakarta TAHUN PELAJARAN 2014/2015 Lembar Soal
No ratings yet
Ulangan Akhir Semester Ganjil Sekolah Menengah Pertama 79 Jakarta TAHUN PELAJARAN 2014/2015 Lembar Soal
4 pages
1940 - 1956: First Generation - Vacuum Tubes
No ratings yet
1940 - 1956: First Generation - Vacuum Tubes
8 pages
Tutorial 8085 Program
No ratings yet
Tutorial 8085 Program
2 pages
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet

Unit IV CA

Uploaded by

Unit IV CA

Uploaded by

Velammal Engineering College

Department of Computer Science

Mr. A. Arockia Abins &

Heterogeneous (asymmetric) cores:

• Processors communicate through shared

• In 1966, Michael Flynn proposed a

SSE : Streaming SIMD Extensions

Single Instruction, Single Instruction, Multiple Instruction Multiple Instruction

Uniprocessor Vector Array Shared Memory Multicomputer

Thread A Thread B Thread C Thread D

ILP Limits and Multithreading June 2015 51

• For theoretical analysis, S(n) = ts/tp where

You might also like