0% found this document useful (0 votes)
11 views73 pages

Unit IV CA

The document outlines the syllabus for Unit IV of a Computer Architecture course at Velammal Engineering College, focusing on parallelism, multicore processors, and shared memory multiprocessors. It covers Flynn's classification of computer architectures, types of multithreading, and the architecture of GPUs, along with examples of multicore processors. Additionally, it introduces concepts such as Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA) architectures, as well as Amdahl's law related to parallel execution speedup.

Uploaded by

sruthinath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views73 pages

Unit IV CA

The document outlines the syllabus for Unit IV of a Computer Architecture course at Velammal Engineering College, focusing on parallelism, multicore processors, and shared memory multiprocessors. It covers Flynn's classification of computer architectures, types of multithreading, and the architecture of GPUs, along with examples of multicore processors. Additionally, it introduces concepts such as Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA) architectures, as well as Amdahl's law related to parallel execution speedup.

Uploaded by

sruthinath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Velammal Engineering College

Department of Computer Science


and Engineering

Welcome…

Mr. A. Arockia Abins &


Ms. R. Amirthavalli,
Asst. Prof,
Slide Sources: Patterson & Hennessy COD book
website (copyright Morgan Kaufmann) adapted CSE,
and supplemented Velammal Engineering College
Subject Code / Name:

19IT202T /
Computer Architecture
Syllabus – Unit IV
UNIT-IV PARALLELISM
Introduction to Multicore processors
and other shared memory
multiprocessors - Flynn's classification:
SISD, MIMD, SIMD, SPMD and
Vector - Hardware multithreading: Fine-
grained, Coarse-grained and
Simultaneous Multithreading (SMT) -
GPU architecture: NVIDIA GPU
Architecture, NVIDIA GPU Memory
Structure
Topics:
• Introduction to Multicore processors
• Other shared memory multiprocessors
• Flynn’s classification:
o SISD,
o MIMD,
o SIMD,
o SPMD and Vector
• Hardware multithreading
• GPU architecture 4
Introduction to Multicore
processors
Multicore processors
• What is a Processor?
o A single chip package that fits in a socket
o Cores can have functional units, cache, etc.
associated with them
• The main goal of the multi-core design is to
provide computing units with an increasing
processing power.
• A multicore processor is a single computing
component with two or more “independent”
processors (called "cores").
• known as a chip multiprocessor or CMP

6
EXAMPLES
 dual-core processor with 2 cores
• e.g. AMD Phenom II X2, Intel Core 2 Duo E8500
 quad-core processor with 4 cores
• e.g. AMD Phenom II X4, Intel Core i5 2500T
 hexa-core processor with 6 cores
• e.g. AMD Phenom II X6, Intel Core i7 Extreme Ed. 980X
 octa-core processor with 8 cores
• e.g. AMD FX-8150, Intel Xeon E7-2820

7
Processor

8
Single core

9
Multicore

10
Number of core
Homogeneous (symmetric) cores:
types
• All of the cores in a homogeneous multicore
processor are of the same type; typically the core
processing units are general-purpose central
processing units that run a single multicore
operating system.
• Example: Intel Core 2

Heterogeneous (asymmetric) cores:


• Heterogeneous multicore processors have a mix
of core types that often run different operating
systems and include graphics processing units.
• Example: IBM's Cell processor, used in the Sony
PlayStation 3 video game console 11
Homogeneous Multicore Processor

12
Heterogeneous Multicore Processor

13
shared memory multiprocessors

14
Shared Memory Multiprocessors
• A system with multiple CPUs “sharing” the
same main memory is called multiprocessor.
• In a multiprocessor system all processes on
the various CPUs share a unique logical
address space, which is mapped on a physical
memory that can be distributed among the
processors.
• Each process can read and write a data item
simply using load and store operations, and
process communication is through shared
memory.

15
Shared Memory Multiprocessors

• Processors communicate through shared


variables in memory, with all processors
capable of accessing any memory location via
loads and stores.

16
Questions:
• Multicore processor
• Hexacore processor
• Homogeneous Multicore processor
• Heterogeneous Multicore processor
• Multiprocessor
• Shared memory Multiprocessor

17
• Single address space multiprocessors come in two
styles.
o Uniform Memory Access (UMA)
o Non-Uniform Memory Access (NUMA)

UMA Architecture:
• In the first style, the latency to a word in
memory does not depend on which processor
asks for it. Such machines are called uniform
memory access (UMA) multiprocessors.
NUMA/DSMA Architecture:
• In the second style, some memory accesses
are much faster than others, depending on
which processor asks for which word, typically
because main memory is divided and attached to
different microprocessors or to different memory
controllers on the same chip.
• Such machines are called nonuniform memory
access (NUMA) multiprocessors. 18

Types:
The shared-memory multiprocessors fall into
two classes, depending on the number of
processors involved, which in turn dictates a
memory organization and interconnect
strategy.
• They are:
1. Centralized shared memory (Uniform Memory
Access)
2. Distributed shared memory (NonUniform
Memory Access)

19
1. Centralized shared memory architecture

20
2. Distributed shared memory architecture

21
Flynn’s
classification
Flynn's classification:

• In 1966, Michael Flynn proposed a


classification for computer architectures based
on the number of instruction steams and data
streams (Flynn’s Taxonomy).
o SISD (Single Instruction stream, Single Data
stream)
o SIMD (Single Instruction stream, Multiple Data
streams)
o MISD (Multiple Instruction streams, Single Data
stream)
o MIMD (Multiple Instruction streams, Multiple Data
streams)
23
Flynn's classification:
Simple Diagrammatic Representation

24
SISD
• SISD machines executes a single instruction on
individual data values using a single processor.
• Based on traditional Von Neumann uniprocessor
architecture, instructions are executed
sequentially or serially, one step after the next.
• Until most recently, most computers are of SISD
type.
• Conventional uniprocessor

25
SISD

26
SIMD
• An SIMD machine executes a single instruction on
multiple data values simultaneously using many
processors.
• Since there is only one instruction, each processor
does not have to fetch and decode each
instruction. Instead, a single control unit does the
fetch and decoding for all processors.
• SIMD architectures include array processors.

27
SIMD
• Data level parallelism:
o Parallelism achieved by performing the same operation on
independent data.

28
MISD
• Each processor executes a different sequence of instructions.
• In case of MISD computers, multiple processing units operate on
one single-data stream .
• This category does not actually exist. This category was included in
the taxonomy for the sake of completeness.

29
MISD

30
Questions:
• Uniform Memory Access (UMA)
• Non-Uniform Memory Access (NUMA)
• Centralized shared memory
• Distributed shared memory
• Flynn’s classification:

31
MIMD
• MIMD machines are usually referred to as
multiprocessors or multicomputers.
• It may execute multiple instructions
simultaneously, contrary to SIMD machines.
• Each processor must include its own control unit
that will assign to the processors parts of a task or
a separate task.
• It has two subclasses: Shared memory and
distributed memory

32
MIMD

33
Analogy of Flynn’s Classifications
• An analogy of Flynn’s classification is the
check-in desk at an airport
 SISD: a single desk
 SIMD: many desks and a supervisor with
a megaphone giving instructions that
every desk obeys
 MIMD: many desks working at their own
pace, synchronized through a central
database

34
Hardware categorization

SSE : Streaming SIMD Extensions


35
Computer Architecture
Classifications
Processor Organizations

Single Instruction, Single Instruction, Multiple Instruction Multiple Instruction


Single Data Stream Multiple Data Stream Single Data Stream Multiple Data Stream
(SISD) (SIMD) (MISD) (MIMD)

Uniprocessor Vector Array Shared Memory Multicomputer


Processor Processor (tightly coupled) (loosely
coupled)
Vector
• more elegant interpretation of SIMD is called a vector architecture
• the vector architectures pipelined the ALU to get good performance
at lower cost
• to collect data elements from memory, put them in order into a
large set of registers, operate on them sequentially in registers
using pipelined execution units.
• then write the results back to memory

37
Structure of a vector unit containing four lanes

38
vector lane
• One or more vector functional units and a portion of the vector
register fi le.

39
Questions:
• MIMD
• Examples for Flynn’s classification

40
Hardware
multithreading
Hardware multithreading
• A thread is a lightweight process with its own
instructions and data.
• Each thread has all the state (instructions, data,
PC, register state, etc.) necessary to allow it to
execute.
• Multithreading (MT) allows multiple threads to
share the functional units of a single processor.

42
Hardware multithreading
• Increasing utilization of a processor by
switching to another thread when one thread
is stalled.
• Types of Multithreading:
o Fine-grained Multithreading
• Cycle by cycle
o Coarse-grained Multithreading
• Switch on event (e.g., cache miss)
o Simultaneous Multithreading (SMT)
• Instructions from multiple threads executed concurrently in the
same cycle

43
4-issue machine

Thread A Thread B Thread C Thread D


Fine-grained MT
Idea: Switch to another thread every cycle
such that no two instructions from the
thread are in the pipeline concurrently
Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful
instructions from different threads
+ Improved system throughput, latency tolerance,
utilization
Fine-grained MT
Idea: Switch to another thread every cycle
such that no two instructions from the
thread are in the pipeline concurrently
Disadvantages
- Extra hardware complexity: multiple hardware contexts,
thread selection logic
- Reduced single thread performance (one instruction
fetched every N cycles)
- Resource contention between threads in caches and
memory
- Dependency checking logic between threads remains
(load/store)
Time stamp of
single thread Fine-grained MT
execution
7
1 1
5 5
1 1 1
5 5 5 5
1 1 1
5
1
8 8 8 1
2 2
6 6 6 6
2 2 3
7 7 7 7 4
5 5 5
9 9 6
3 3 3
7 7 7
3 3 3 8
11 9
3
10
8 8 1
6
10 10 10 1
1
4 4 2
8 8
4 4

47
Coarse-grained MT switches threads only
on costly stalls, such as L2 misses.
The processor is not slowed down (by
thread switching), since instructions from
other threads will only be issued when a
thread encounters a costly stall.
Since a CPU with coarse-grained MT issues
instructions from a single thread, when a
stall occurs the pipeline must be emptied.
The new thread must fill the pipeline before
instructions will be able to complete.
48
Coarse-grained MT switches threads only
on costly stalls, such as L2 misses.
Advantages:
– thread switching doesn’t have to be
essentially free and much less likely to slow down
the execution of an individual thread
Disadvantage:
– limited, due to pipeline start-up costs, in its
ability to overcome throughput loss
Pipeline must be flushed and refilled on
thread switches

49
Coarse-grained MT

50
Questions
• Define thread.
• What is mean by hardware multithreading?
• Types of multithreading

ILP Limits and Multithreading June 2015 51


Simultaneous Multithreading
Simultaneous multithreading (SMT) is a
variation on MT to exploit TLP
simultaneously with ILP.
SMT is motivated by multiple-issue
processors which have more functional unit
parallelism than a single thread can
effectively use.
Multiple instructions from different threads
can be issued

52
1 1 1 1
1
Time stamp of
2 1 1 1 1 single thread
3 1 2 2 2 execution
4

5 4
1
6 4 4 5 5 2
3
7 4 4 5 5
4
8 5 5 5 6 5
6
9 5 5 7 7
10 7 7 7 7 8
9
1 8 8 8 6 10
1
1
1 6 6 6 8 1
2 1
1 8 9 9 7 2
3
14 7 1 9 9
1
1 1 1 1 8
5 0 0 0
1 8 1 1 1
6 2 2
2

53
Approaches to use the issue slots.

54
55
Amdahl’s law
Speedup
• Speedup measures increase in running time due
to parallelism. The number of PEs is given by n.
• Based on running times, S(n) = ts/tp , where
o ts is the execution time on a single processor, using the fastest
known sequential algorithm
o tp is the execution time using a parallel processor.

• For theoretical analysis, S(n) = ts/tp where


o ts is the worst case running time for of the fastest known sequential
algorithm for the problem
o tp is the worst case running time of the parallel algorithm using n
PEs.

57
Speedup in Simplest
Terms

58
Amdahl’s law:
“It states that the potential speedup gained by the parallel execution
of a program is limited by the portion that can be parallelized.”

59
Amdahl’s law
• execution time before is 1 for some unit of time

60
Question:
• When parallelizing an application, the ideal speedup is speeding up
by the number of processors. What is the speedup with 8
processors if 60% of the application is parallelizable?

61
Question:
• When parallelizing an application, the ideal speedup is speeding up
by the number of processors. What is the speedup with 8
processors if 80% of the application is parallelizable?

62
QUESTION:
• Suppose that we are considering an enhancement that runs 10
times faster than the original machine but is usable only 40% of the
time. What is the overall speedup gained by incorporating the
enhancement.?

63
Question
• Suppose you want to achieve a speed-up
of 90 times faster with 100 processors.
What percentage of the original
computation can be sequential?

64
Question
• Suppose you want to achieve a speed-up
of 90 times faster with 100 processors.
What percentage of the original
computation can be sequential?

65
Question
• Suppose you want to perform two sums: one is a sum of 10
scalar variables, and one is a matrix sum of a pair of two-
dimensional arrays, with dimensions 10 by 10. For now
let’s assume only the matrix sum is parallelizable. What
speed-up do you get with 10 versus 40 processors?
• Next, calculate the speed-ups assuming the matrices grow
to 20 by 20.

66
Graphics
processing unit
(GPU)
Graphics processing unit (GPU)
• It is a processor optimized for 2D/3D graphics, video, visual computing, and display.
• It is highly parallel, highly multithreaded multiprocessor optimized for visual
computing.
• It provide real-time visual interaction with computed objects via graphics images,
and video.
• Heterogeneous Systems: combine a GPU with a CPU

68
GPU Hardware

69
70
An Introduction to the NVIDIA GPU Architecture

71
NVIDIA GPU Memory
Structures

72
Thank you…

You might also like