0% found this document useful (0 votes)

6 views69 pages

2 TypesofParallelism

The document discusses types of parallelism in computing, distinguishing between implicit and explicit parallelism. It details various explicit parallelism techniques such as data parallelism, task parallelism, loop-level parallelism, and thread-level parallelism, as well as implicit techniques like instruction-level parallelism. Additionally, it covers the importance of dependencies in program execution and how they affect parallelization and performance optimization.

Uploaded by

sp22-bcs-073

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views69 pages

2 TypesofParallelism

Uploaded by

sp22-bcs-073

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 69

CSC 334 – Parallel and Distributed Computing

Instructor: Ms. Muntha Amjad

Lecture# 02: Types of Parallelism

1
Types of Parallelism
• Implicit Parallelism: The compiler, runtime, or system automatically identifies
and manages parallelism without requiring the programmer to explicitly define it.

• Explicit Parallelism: The programmer directly specifies the parallelism, defining

threads, synchronization, and communication.

2
Explicit Parallelism
• A programming approach where the developer explicitly instructs the system
where and how to parallelize operations using dedicated language features like
operators, function calls, or directives.

• Since multiple threads are executing concurrently, explicit synchronization

mechanisms like barrier or locks are often required to ensure data consistency.

• Examples of explicit parallelism techniques

• Data parallelism
• Task parallelism
• Loop-level parallelism
• Thread-based parallelism
3
Data Parallelism
• Applying the same operation to large datasets in parallel, often using loops to
iterate through data chunks. The same code segment runs concurrently on each
processor, but each processor is assigned its own part of the data to work on.
• For loops define the parallelism.
• The iterations must be independent of each other.
• Data parallelism is called "fine grain parallelism" because the computational work is
spread into many small subtasks.

4
Task Parallelism
• As the opposite of data parallelism, task parallelism breaks down a problem
into independent tasks that can be executed concurrently different processors.
Different operations are performed on different parts of the data.
• Task parallelism is called "coarse grain" parallelism because the computational work
is spread into just a few subtasks.
• More code is run in parallel because the parallelism is implemented at a higher level
than in data parallelism.
• Easier to implement and has less overhead than data parallelism.

Example: an image (like color channels) are

processed in parallel using separate threads,
effectively treating each color channel as a
separate task. 5
Loop-Level Parallelism
• Parallelizing operations within loops where each iteration can be executed
independently

• Loop-level parallelism is a specific technique often used to achieve data

parallelism, but data parallelism encompasses a wider range of strategies.

• Example:
#pragma omp parallel for
for (int i = 0; i < n; i++)
a[i] = b[i] + c[i];

6
Thread-Level Parallelism
• Deals with the execution of multiple threads within a program to execute
different parts of the code simultaneously. The programmer explicitly creates
and manages threads that run concurrently.

• Thread-level parallelism is a mechanism for achieving task parallelism by

managing threads, but not all task parallelism requires explicit thread
management.

• Example: A web server using multiple threads to handle incoming requests

simultaneously, where each thread processes a single request independently.

7
Implicit Parallelism
• The compiler, runtime, or system automatically identifies and manages
parallelism without requiring the programmer to explicitly define it.

• Examples of implicit parallelism techniques

• Instruction-Level Parallelism (ILP)
• Compiler-Driven Parallelism

8
Instruction Level Parallelism
• Instruction Level Parallelism (ILP) is a set of techniques for executing multiple
instructions at the same time within the same CPU core (Note: ILP has
nothing to do with multicore)

• Basic Idea: Execute several instructions in parallel; i.e., overlap the execution
of instructions to run programs faster (“to improve performance”)

• The Problem: A CPU core has a lot of circuits, but at any given moment, many
parts remain idle. This wastes processing power..

• The solution: Different parts of the CPU core work on different instructions
simultaneously instead of waiting for one instruction to finish before starting the
next. If a CPU can handle 10 operations at once, a program could run up to 10
times faster—though, in reality, the speedup is usually lower. 9
Examples of Instruction Level Parallelism
• Pipelining: Start performing an operation on one piece of data while finishing
the same operation on another piece of data – perform different stages of the
same INSTRUCTION on different sets of operands at the same time (like an
assembly line). Example: While adding two numbers, the CPU can start
preparing the next addition before finishing the first.

• Superscalar: The CPU can execute multiple different instructions at the same
time. Example: It can add, multiply, and load data simultaneously.

• Super-pipelining: This is a mix of superscalar and pipelining—making

pipelining even faster. The CPU breaks tasks into smaller steps, allowing
multiple steps to run at the same time.

• Vectorization: The CPU performs the same operation on multiple data points at 10
once instead of processing them one by one. Example: Adding two lists of
numbers in one step instead of adding each number separately.
5 Stages in Executing a MIPS instruction
• IF: Instruction Fetch, Increment Program Counter

• ID: Instruction Decode, Read Registers

• EX: Execution
• Mem-ref: Calculate Address
• Arithmetic/logical: Perform Operation

• MEM:
• Load: Read Data from Memory
• Store: Write Data to Memory

• WB:
11
• Write Data Back to Register
Instruction Execution

12
Instruction Execution

13
Instruction Execution

14
Pipelining Execution

15
Pipelining Execution

16
Superscalar Execution

17
Superscalar Execution

18
Superscalar Execution

19
Functional Units in a CPU

20
Super-pipelining: Superscalar + Pipeline

21
Super-pipelining: Superscalar + Pipeline

22
Vectorization
• A vector register is a register that’s made
up of many individual registers that
simultaneously perform the same
operation on multiple sets of operands,
producing multiple results. In a sense,
vectors are like operation-specific cache.

• A vector instruction is an instruction that for(i=0; i <=MAX; i++)

performs the same operation Z[i] = X[i] + Y[i];
simultaneously on all individual registers
of a vector register.

23
What determines the degree of ILP?
• Dependencies: property of the program
• Hazards: property of the pipeline

• A dependence indicates what program components (statements, loop iterations,

etc.) may be executed ignoring the sequence of events specified by the
programmer without changing the output. Program components that are not
dependent on each other can be executed in parallel.

24
Types of Dependencies
• Data Dependence: When one instruction needs data from a previous
instruction before it can execute. The second instruction must wait for the first to
finish. (True Dependence)
• RAW: Read-After-Write

• Named Dependencies: Write-After-Read (WAR or Anti-Dependance):

Happens when an instruction writes to a variable that a previous instruction is
still reading. The write to 'y' must wait for the read to finish.

25
Types of Dependencies
Write-After-Write (WAW or Output Dependence): Occurs when two instructions write
to the same variable. The second write cannot execute before the first completes.

• Control Dependence:
• When an instruction depends on a decision (branch/jump) from a previous instruction.
• The second instruction must wait to see if the first condition is true.

26
What is Dependency Analysis?
• It examines how different parts of a program depend on each other.
• Ensures that parts of the program execute in the correct order.
• Prevents incorrect results due to improper instruction execution.
• A key role of modern compilers, which optimize code execution while respecting
dependencies.
• A data dependency describes how different pieces of data affect each other.
• A control dependency describes how instruction sequences affect each other.

• Why is this Important?

• Compilers analyze dependencies to reorder instructions for better performance.
• Parallel execution (in multi-core systems) needs to handle dependencies
correctly. 27
• Avoids incorrect calculations due to premature execution of dependent
instructions.
Data Dependence and Hazards
• If two instructions are data dependent, they cannot execute simultaneously, be
completely overlapped or execute in out-of-order

• Instruction S2 is data dependent (aka true dependence) on Instruction S1

S1: X = A + B
S1 S2
S2: C = X + 1

• If data dependence caused a hazard in pipeline, then the hazard is called a

Read After Write (RAW) hazard

28
Loop Carried Dependency
FOR i = 2 to MAX
a[i] = a[i-1] + b[i]
END FOR

• Here each iteration of the loop depends on the previous:

• iteration i=3 depends on iteration i=2,
• iteration i=4 depends on iteration i=3,
• iteration i=5 depends on iteration i=4, and so on.

• There is no way to execute iteration i until after iteration i-1 has completed, so
this loop can’t be parallelized.

29
Why do we care?
• Loops are very common in many programs.
• Also, it’s easier to optimize loops than more arbitrary sequences of
instructions: when a program does the same thing over and over, it’s easier to
predict what’s likely to happen next.

• Loops are the favorite control structures of High Performance Computing. Both
hardware vendors and compiler writers build for optimizing loop performance
using instruction-level parallelism (superscalar, pipelining, and vectorization).

• Loop carried dependencies affect whether a loop can be parallelized, and

how much.

30
ILP and Data Dependencies, Hazards
• HW/SW must preserve program order: code must give the same results as if
instructions were executed sequentially in original order of the source program

• Importance of the data dependencies:

1. Indicates the possibility of a hazard
2. Determines order in which results must be calculated
3. Sets an upper bound on how much parallelism can possibly be exploited

• Goal: Exploit parallelism by preserving program order only where it affects the
outcome of the program

31
Named Dependence and Hazards
• Name Dependence: When two instructions use the same variable name (register or
memory location), but there is no flow of data between them associated with that
variable name.
• Anti-Dependence: Instruction S2 writes operand before Instruction S1 reads it
S1: A = X + B
S1 S2
S2: X = C + D
• Output-Dependence: Instruction S2 writes operand before Instruction S1 writes it
S1: X = A + B
... S1 S2
S2: X = C + D
• If anti-dependence caused a hazard in pipeline, then the hazard is called a Write
After Read (WAR) hazard
32
• If output-dependence caused a hazard in pipeline, then the hazard is called a
Write After Write (WAW) hazard
Control Dependencies
• Every program has a well-defined flow of control that moves from instruction to
instruction. Every instruction is control dependent on some set of branches (if
condition, switch case, function calls, I/O), and, in general, these control
dependencies must be preserved to preserve program order.
IF p1 THEN
S1
IF p2 THEN
S2
• S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.

• Control dependence need not be preserved

• willing to execute instructions that should not have been executed, thereby violating the
control dependences, if can do so without affecting correctness of the program
33
• Speculative Execution
Speculation
• Speculative parallelism helps computers run faster by guessing the outcome of
decisions (branches) before knowing for sure. If the guess is right, the program
runs faster. If it's wrong, the incorrect work is discarded, and the correct path is
executed.
• How it works?
• Speculation → The processor fetches, issues, and executes instructions as if its
guesses about branches (like if-else conditions) were always correct.
• Dynamic Scheduling → The system fetches and issues instructions but doesn't
execute them until it's sure they're needed.
• Essentially a data flow execution model: Operations execute as soon as their
operands are available

34
Speculation
• Types of predictors
• Branch Prediction: Guessing which path (if/else) the program will take.
• Value Prediction: Predicting future values based on past data
• Prefetching (memory access pattern prediction): Predicting which memory locations
will be needed soon and loading them early.

• Problems & Inefficiencies

• If the guess is incorrect, extra work needs to be undone.
• The system must clear out incorrect guesses, wasting time.
• Making guesses and fixing wrong ones uses more energy.

35
Limits to Pipelining
• Pipelining increases CPU efficiency by overlapping instruction execution, but it
has limitations due to hazards. Hazards prevent the next instruction from
executing in its designated clock cycle.

• Structural hazards: Occur when hardware resources are insufficient, meaning

two or more instructions require the same hardware at the same time.
• Data hazards: Occur when an instruction depends on the result of a previous
instruction that hasn’t finished executing.
• Control hazards: Occur when the CPU encounters a branch (conditional or
jump) and doesn’t know the correct next instruction to fetch.

36
Why Does Order Matter?
• Dependencies can affect whether we can execute a particular part of the
program in parallel.

• If we cannot execute that part of the program in parallel, then it’ll be SLOW.

37
Dependencies in Loops
• Dependencies in loops are easy to understand if loops are unrolled. Now the
dependences are between statement “instances”
FOR i = 1 to n
S1 a[i] = b[i] + 1
S2 c[i] = a[i] + 2

Iteration: 1 2 3 4 ...

Instances of S1: S1 S1 S1 S1

...

38
Instances of S2: S2 S2 S2 S2
Dependencies in Loops
• A little more complex example

FOR i = 1 to n
S1 a[i] = b[i] + 1
S2 c[i] = a[i-1] + 2

Iteration: 1 2 3 4 ...

Instances of S1: S1 S1 S1 S1

...

39
Instances of S2: S2 S2 S2 S2
Dependencies in Loops
• Even more complex example

FOR i = 1 to n
S1 a = b[i] + 1
S2 c[i] = a + 2

Iteration: 1 2 3 4 ...

Instances of S1: S1 S1 S1 S1

...

40
Instances of S2: S2 S2 S2 S2
Branch Dependency
y = 7
IF (x != 0) THEN
y = 1.0/x
END IF

• The value of y depends on what the condition (x ! = 0) evaluates to:

• If the condition evaluates to TRUE, then y is set to 1.0/x
• Otherwise, y remains 7

41
Optimizing with Dependencies
• It is valid to parallelize/vectorize a loop if no dependences cross its iteration
boundaries:
FOR i = 1 to n
S1 a[i] = b[i] + 1
S2 c[i] = a[i] + 2

1 2 3
4 ...
S1 S1 S1 S1

...

S2 S2 S2 S2 42
Loop or Branch Dependency?
• Is this a loop carried dependency or a branch dependency?

FOR i = 1 to MAX
IF (x[i] != 0)
y[i] = 1.0/x[i]
END IF
END FOR

43
Loop or Branch Dependency?
• Is this a loop carried dependency or a branch dependency?

FOR i = 1 to MAX
IF (x[i] != 0)
y[i] = 1.0/x[i]
END IF
END FOR

Answer:

• The given code has a branch dependency due to the conditional (IF x[i] != 0).
• However, it does not have a loop-carried dependency since each iteration
is independent.
44
Compiler-Driven Parallelism
• Tricks that compilers play for automatic parallelization
• Scalar optimizations
• Loop-level optimizations

45
Scalar Optimizations
• Copy Propagation
• Constant Folding
• Dead Code Removal
• Strength Reduction
• Common Subexpression Elimination
• Variable Renaming

• Not every compiler does all these, so it sometimes can be worth doing these
by hands

46
Copy Propagation

Before x = y
z = 1 + x

Has data dependency

Compile

After x = y
z = 1 + y
47
No data dependency
Constant Folding
Before After

add = 100 sum = 300

aug = 200
sum = add + aug

• Notice that sum is actually the sum of two constants, so the compiler can
precalculate it, eliminating the addition that otherwise would be performed at
runtime.

48
Dead Code Removal
Before After

var = 5 var = 5
PRINT *, var PRINT *, var
STOP STOP
PRINT *, var * 2

• Since the last statement never executes, the compiler can eliminate it.

49
Strength Reduction
Before After

x = y^2.0 x = y * y
a = c / 2.0 a = c * 0.5

• Raising one value to the power of another, or dividing, is more expensive than
multiplying. If the compiler can tell that the power is a small integer, or that the
denominator is a constant, it’ll use multiplication instead.

50
Common Subexpression Elimination
Before After

d = c * (a / b) adivb = a / b
e = (a / b) * 2.0 d = c * adivb
e = adivb * 2.0

• The subexpression (a / b) occurs in both assignment statements, so

there’s no point in calculating it twice.

• This is typically only worth doing if the common subexpression is expensive to

calculate.
51
Variable Renaming
Before After

x = y * z x0 = y * z
q = r + x * 2.0 q = r + x0 * 2
x = a + b x = a + b

• The original code has output dependency, while the new code doesn’t – but
the final value of x is still correct.

52
Loop Optimizations
• Hoisting Loop Invariant Code
• Unswitching
• Iteration Peeling
• Index Set Splitting
• Loop Interchange
• Unrolling
• Loop Fusion
• Loop Fission
• Inlining

• Not every compiler does all these, so it sometimes can be worth doing these
by hands 53
Hoisting Loop Invariant Code
• The code that does not change inside the loop is known as loop invariant. It
doesn’t need to be calculated over and over.

Before After

FOR i = 1 to n temp = c * d
a[i] = b[i] + c * d FOR i = 1 to n
e = g(n) a[i] = b[i] +
END FOR temp
END FOR
e = g(n)
54
Unswitching
FOR i = 1 to n
FOR j = 2 to n The condition is
IF (t(i) > 0) THEN j-independent
a[i][j] = a[i][j] * t(i) + b(j)
ELSE
a[i][j] = 0.0
END IF
END FOR Before
END FOR
FOR i = 1 to n
IF (t(i) > 0) THEN
FOR j = 2 to n So, it can migrate
a[i][j] = a[i][j] * t(i) + b(j) outside the j loop
END FOR
ELSE
FOR j = 2 to n
a[i][j] = 0.0 55
END FOR
END IF After
END FOR
Iteration Peeling
FOR i = 1 to n
IF ((i == 1) OR (i == n)) THEN
x[i] = y[i]
Before
ELSE
x[i] = y[I + 1] + y[i - 1]
END IF
END FOR

• We can eliminate the IF by peeling the weird iterations.

x(1) = y(1)
FOR i = 2 to n-1 After
x[i] = y[I + 1] + y[i - 1]
END FOR
x(n) = y(n)
56
Index Set Splitting
FOR i = 1 to n
a[i] = b[i] + c[i]
IF ((i > 10) THEN
Before
d[i] = a[i] + b[i – 10]
END IF
END FOR

FOR i = 1 to 10
a[i] = b[i] + c[i]
END FOR
FOR i = 11 to n After
a[i] = b[i] + c[i]
d[i] = a[i] + b[i - 10]
END FOR
57
• Note that this is a generalization of peeling.
Loop Interchange
Before After

FOR i = 1 to nj FOR i = 1 to ni
FOR j = 1 to ni FOR j = 1 to nj
a[i][j] = b[i][j] a[i][j] = b[i]
END FOR [j]
END FOR END FOR
END FOR
• The array elements a[i][j] and a[i][j+1] are near each other in memory,
while a[i+1][j] may be far, so it makes sense to make the i loop be the
outer loop.

• This technique facilitates efficient exploitation of the phenomenon of locality 58

of reference.
Unrolling:
FOR i = 1 to n
a[i] = a[i] + b[i]
END FOR
Before

FOR i = 1, n, 4
a[i] = a[i] + b[i]
a[i+1] = a[i+1] + b[i+1]
a[i+2] = a[i+2] + b[i+2] After
a[i+3] = a[i+3] + b[i+3]
END FOR

• You generally shouldn’t unroll by hand.

59
Why Do Compilers Unroll?
• A loop with a lot of operations gets better performance (up to some point),
especially if there are lots of arithmetic operations but few main memory loads
and stores.

• Unrolling creates multiple operations that typically load from the same, or
adjacent, cache lines.

• So, an unrolled loop has more operations without increasing the memory
accesses by much.

• Also, unrolling decreases the number of comparisons on the loop counter

variable, and the number of branches to the top of the loop.
60
Loop Fusion
FOR i = 1 to n
a[i] = b[i] + 1
END FOR

FOR i = 1 to n
c[i] = a[i] /2 Before
END FOR

FOR i = 1 to n
d[i] = 1 / c[i]
END FOR

FOR i = 1, n
a[i] = b[i] + 1
c[i] = a[i] / 2 After
d[i] = 1 / c[i]
END FOR
61
• As with unrolling, this has fewer branches. It also has fewer total memory
references.
Loop Fission
FOR i = 1, n
a[i] = b[i] + 1
c[i] = a[i] / 2 Before
d[i] = 1 / c[i]
END FOR
FOR i = 1 to n
a[i] = b[i] + 1
END FOR

FOR i = 1 to n
c[i] = a[i] /2
END FOR After

FOR i = 1 to n
d[i] = 1 / c[i]
END FOR 62

• Fission reduces the cache footprint and the number of operations per iteration.
To Fuse or to Fizz?
• The question of when to perform fusion versus when to perform fission, like
many many optimization questions, is highly dependent on the application, the
platform and a lot of other issues that get very, very complicated.

• Compilers don’t always make the right choices.

• That’s why it’s important to examine the actual behavior of the executable.

63
Inlining
Before After

FOR i = 1 to n FOR i = 1 to n
a[i] = func(i) a[i] = i * 3
END FOR END FOR

int func(x)
func = x * 3
RETURN func

• When a function or subroutine is inlined, its contents are transferred directly

into the calling routine, eliminating the overhead of making the call.
64
Task of the Compiler
• It transform program to remove dependencies

• Use dependence to reorganize computation for parallelism and locality.

• However, manual transformation by hand can sometimes be more useful.

65
Transforming Programs: Renaming
S1: A = X + B
S2: X = Y + 1
S3: C = X + B
S4: X = Z + B
S5: D = X + 1

S1: A = X + B
S2: X1 = Y + 1
S3: C = X1 + B
S4: X2 = Z + B 66
S5: D = X2 + 1
Transform Programs: Scalar Expansion
Scalar expansion is an optimization technique that transforms a loop containing a scalar dependency into an equivalent
loop using an array version of the scalar variable. This eliminates dependencies and improves parallelization.

FOR i = 1 to n S1 S1 S1 S1
S1 a = b[i] + 1
S2 c[i] = a + ...
d[i]
END FOR S2 S2 S2 S2

FOR i = 1 to n S1 S1 S1 S1
S1 a1[i] = b[i] + 1
S2 c[i] = a1[i] + d[i]
...
END FOR 67
a = a1[n] S2 S2 S2 S2
Transform Programs: Scalar Expansion
• The rule is simple: it is valid to expand a scalar if it is always assigned before
it is used within the body of the loop and if either
• its final value at the end of the loop is known, or
• it is never used after the loop.

68
Final Message:
• Implicit parallelism is becoming increasingly valuable with the advancement in
hardware and compilers capabilities.

• Try to get maximum of what a compiler can do for you automatically (LEARN
compiler flags/options/directives). But keep ensured whether your BULL (the
compiler) is really pulling your cart FAST (not just dancing), even that to the
RIGHT direction.

• Although a number of compiler/coding techniques have been discussed – but

profiling and analyzing the benefit of a specific technique is a MUST TO DO.

• While optimizing the code, use profiling to find out HOT SPOTS or Bottlenecks
and then focus on those, one by one, also ensuring correctness. The 10-90 rule 69
often works. Algorithm transformation could be considered, also.

Advanced Computer Architecture: Conditions of Parallelism
No ratings yet
Advanced Computer Architecture: Conditions of Parallelism
27 pages
Process Scheduling in OS
No ratings yet
Process Scheduling in OS
37 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Instruction Level Parallelism (ILP)
No ratings yet
Instruction Level Parallelism (ILP)
9 pages
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
No ratings yet
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
201 pages
Vijay Kumar: IBPS (SO) I.T.O Cer: Operating System Study Notes
No ratings yet
Vijay Kumar: IBPS (SO) I.T.O Cer: Operating System Study Notes
20 pages
CS0051 Fa3 Fa4
No ratings yet
CS0051 Fa3 Fa4
19 pages
ILP Overview and Scoreboard
No ratings yet
ILP Overview and Scoreboard
60 pages
MCP Unit 1
No ratings yet
MCP Unit 1
41 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
Parallel Programming 1
No ratings yet
Parallel Programming 1
32 pages
Lecture - 17 - MIPS - Instruction Level Parallelism
No ratings yet
Lecture - 17 - MIPS - Instruction Level Parallelism
27 pages
3a.ILP Dipendenze e Superscalare
No ratings yet
3a.ILP Dipendenze e Superscalare
24 pages
Chapter 5 PPTV 41 STDV 1
No ratings yet
Chapter 5 PPTV 41 STDV 1
47 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Module 5 Instruction Level Parallelism and Pipelining
No ratings yet
Module 5 Instruction Level Parallelism and Pipelining
54 pages
CompanionAsset 9780128119051 Chapter03
No ratings yet
CompanionAsset 9780128119051 Chapter03
67 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
No ratings yet
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
50 pages
Inter Thread Communication
No ratings yet
Inter Thread Communication
23 pages
Instruction Level Parallelism
95% (21)
Instruction Level Parallelism
11 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
5-Instruction Level Support For Parallel Programming-22!12!2022
No ratings yet
5-Instruction Level Support For Parallel Programming-22!12!2022
16 pages
7TH - Unit 2-21ec74h6 - Ca
No ratings yet
7TH - Unit 2-21ec74h6 - Ca
95 pages
ITEC582-Chapter 16m
No ratings yet
ITEC582-Chapter 16m
55 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Lecture-7-15 01 2025
No ratings yet
Lecture-7-15 01 2025
19 pages
EC483 Fall2024 W7
No ratings yet
EC483 Fall2024 W7
40 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
DBMS Module 4 (Transactions) - 5th Semester - Computer Science and Engineering
No ratings yet
DBMS Module 4 (Transactions) - 5th Semester - Computer Science and Engineering
41 pages
Module 1 Chapter2
No ratings yet
Module 1 Chapter2
100 pages
Xx-Iip & Ilp
No ratings yet
Xx-Iip & Ilp
16 pages
Bahria University Lahore Campus: Department of Computer Sciences
No ratings yet
Bahria University Lahore Campus: Department of Computer Sciences
10 pages
Implicit Parallelism
No ratings yet
Implicit Parallelism
18 pages
Parallelism I: Inside The Core
No ratings yet
Parallelism I: Inside The Core
61 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
22 pages
U3.1 Concepts and Challenges
No ratings yet
U3.1 Concepts and Challenges
12 pages
P14-15 Superscalar
No ratings yet
P14-15 Superscalar
28 pages
3-INSTRUCTION LEVEL PARALLELISM-12-Dec-2019Material - I - 12-Dec-2019 - ILP PDF
No ratings yet
3-INSTRUCTION LEVEL PARALLELISM-12-Dec-2019Material - I - 12-Dec-2019 - ILP PDF
15 pages
Pipelining Become Universal Technique in 1985
No ratings yet
Pipelining Become Universal Technique in 1985
16 pages
CS 6290 Instruction Level Parallelism
No ratings yet
CS 6290 Instruction Level Parallelism
45 pages
CH18 COA11e
No ratings yet
CH18 COA11e
37 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
19 pages
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
No ratings yet
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
47 pages
4th Lecture Computer Architecture
No ratings yet
4th Lecture Computer Architecture
15 pages
Module3
No ratings yet
Module3
49 pages
Code Scheduling in Compiler Design
No ratings yet
Code Scheduling in Compiler Design
6 pages
EE457Unit9a OoO
No ratings yet
EE457Unit9a OoO
77 pages
Instruction-Level Parallel Processors: Asim Munir
No ratings yet
Instruction-Level Parallel Processors: Asim Munir
28 pages
Instruction Level Parallelism: Soner Onder
No ratings yet
Instruction Level Parallelism: Soner Onder
25 pages
10 Week
No ratings yet
10 Week
35 pages
Instruction-Level Parallel Processors: Objective
No ratings yet
Instruction-Level Parallel Processors: Objective
31 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
COA UNIT 5 (AutoRecovered)
No ratings yet
COA UNIT 5 (AutoRecovered)
14 pages
CS-6712 - Grid and Cloud Computing Lab Syllabus PDF
No ratings yet
CS-6712 - Grid and Cloud Computing Lab Syllabus PDF
2 pages
Concurrency Control
No ratings yet
Concurrency Control
4 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
50 pages
Lec5 PDF
No ratings yet
Lec5 PDF
39 pages
03 Dynamic Sched
No ratings yet
03 Dynamic Sched
84 pages
MongalJyoti Saha
No ratings yet
MongalJyoti Saha
9 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
Operating Systems Question Bank
No ratings yet
Operating Systems Question Bank
10 pages
CS 6461: Computer Architecture Instruction Level Parallelism
No ratings yet
CS 6461: Computer Architecture Instruction Level Parallelism
41 pages
Unit 2,3 Ct2 Question Bank 4 Marks
No ratings yet
Unit 2,3 Ct2 Question Bank 4 Marks
3 pages
Parallelism
No ratings yet
Parallelism
22 pages
pdc2: MODULE2
No ratings yet
pdc2: MODULE2
113 pages
Instruction-Level Parallelism and Superscalar Processors
No ratings yet
Instruction-Level Parallelism and Superscalar Processors
22 pages
X X X - Old X - New Op (X - Old) Interlockedcompareexchange X X - Old
No ratings yet
X X X - Old X - New Op (X - Old) Interlockedcompareexchange X X - Old
2 pages
Resource and Resource Access Control: Prepared By-Gajendra Shrimal Mtech I Sem Computer Engineering
No ratings yet
Resource and Resource Access Control: Prepared By-Gajendra Shrimal Mtech I Sem Computer Engineering
16 pages
Computer Organization and Architecture What Does Superscalar Mean?
No ratings yet
Computer Organization and Architecture What Does Superscalar Mean?
14 pages
Locks-Granularity, Types, 2PL
No ratings yet
Locks-Granularity, Types, 2PL
19 pages
CSC520-Chapter 7 Oct 2023
No ratings yet
CSC520-Chapter 7 Oct 2023
47 pages
Oopj Unit4
No ratings yet
Oopj Unit4
28 pages
Module 03-Access Control
No ratings yet
Module 03-Access Control
35 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
2 pages
Chapter 5 CSE 309
No ratings yet
Chapter 5 CSE 309
28 pages
Computer Architecture
No ratings yet
Computer Architecture
29 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
Alfred Advanced DTB
No ratings yet
Alfred Advanced DTB
5 pages
Process Synchronisation PDF
No ratings yet
Process Synchronisation PDF
5 pages
2021年《操作系统》试卷
No ratings yet
2021年《操作系统》试卷
4 pages
Difference Between Object Lock and Class Lock
No ratings yet
Difference Between Object Lock and Class Lock
3 pages
Dbms 5
No ratings yet
Dbms 5
15 pages
Previous Year Question Ans
No ratings yet
Previous Year Question Ans
37 pages
Chapter 5 Concurrency Control
No ratings yet
Chapter 5 Concurrency Control
11 pages
4 DesigningParallelPrograms
No ratings yet
4 DesigningParallelPrograms
69 pages
3 ParallelProgrammingModels
No ratings yet
3 ParallelProgrammingModels
20 pages
DEADLOCKS
No ratings yet
DEADLOCKS
10 pages
UNIT3solved OS
No ratings yet
UNIT3solved OS
30 pages
Multiprocessor and Real-Time Scheduling
No ratings yet
Multiprocessor and Real-Time Scheduling
2 pages
C# Threading: Karthikeyan R 18Y023
No ratings yet
C# Threading: Karthikeyan R 18Y023
14 pages

2 TypesofParallelism

Uploaded by

2 TypesofParallelism

Uploaded by

CSC 334 – Parallel and Distributed Computing

Instructor: Ms. Muntha Amjad

• Explicit Parallelism: The programmer directly specifies the parallelism, defining

• Since multiple threads are executing concurrently, explicit synchronization

• Examples of explicit parallelism techniques

Example: an image (like color channels) are

• Loop-level parallelism is a specific technique often used to achieve data

• Thread-level parallelism is a mechanism for achieving task parallelism by

• Example: A web server using multiple threads to handle incoming requests

• Examples of implicit parallelism techniques

• Super-pipelining: This is a mix of superscalar and pipelining—making

• ID: Instruction Decode, Read Registers

• A vector instruction is an instruction that for(i=0; i <=MAX; i++)

• A dependence indicates what program components (statements, loop iterations,

• Named Dependencies: Write-After-Read (WAR or Anti-Dependance):

• Why is this Important?

• Instruction S2 is data dependent (aka true dependence) on Instruction S1

• If data dependence caused a hazard in pipeline, then the hazard is called a

• Here each iteration of the loop depends on the previous:

• Loop carried dependencies affect whether a loop can be parallelized, and

• Importance of the data dependencies:

• Control dependence need not be preserved

• Problems & Inefficiencies

• Structural hazards: Occur when hardware resources are insufficient, meaning

• The value of y depends on what the condition (x ! = 0) evaluates to:

Has data dependency

add = 100 sum = 300

• The subexpression (a / b) occurs in both assignment statements, so

• This is typically only worth doing if the common subexpression is expensive to

• We can eliminate the IF by peeling the weird iterations.

• This technique facilitates efficient exploitation of the phenomenon of locality 58

• You generally shouldn’t unroll by hand.

• Also, unrolling decreases the number of comparisons on the loop counter

• Compilers don’t always make the right choices.

• When a function or subroutine is inlined, its contents are transferred directly

• Use dependence to reorganize computation for parallelism and locality.

• However, manual transformation by hand can sometimes be more useful.

• Although a number of compiler/coding techniques have been discussed – but

You might also like