0% found this document useful (0 votes)
2 views36 pages

PDC Lecture 03

The document discusses key concepts in parallel and distributed systems, focusing on strong and weak scaling, Amdahl's Law, and Gustafson's Law, which describe how performance and scalability are affected by the number of processors. It also highlights the importance of optimizing single-thread performance and loop optimizations, as well as the advanced features of modern CPU architectures that enhance performance. Finally, it covers techniques like branch prediction and out-of-order execution that improve CPU efficiency and reduce idle times.

Uploaded by

arhamkhan4241
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views36 pages

PDC Lecture 03

The document discusses key concepts in parallel and distributed systems, focusing on strong and weak scaling, Amdahl's Law, and Gustafson's Law, which describe how performance and scalability are affected by the number of processors. It also highlights the importance of optimizing single-thread performance and loop optimizations, as well as the advanced features of modern CPU architectures that enhance performance. Finally, it covers techniques like branch prediction and out-of-order execution that improve CPU efficiency and reduce idle times.

Uploaded by

arhamkhan4241
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CS-402 Parallel and Distributed Systems

Fall 2024
Lecture No. 03
Quick Review
 Describe computation graph
 For a computation graph, how do we define Work(CG), span(CG),
and parallelism?
 Scheduling computation graph
o What is a greedy schedule?
o How good is greedy schedule?
Concept of Scaling
 Strong Scaling:
Definition: Strong scaling measures how the solution time varies with the number of processors for a fixed total
problem size.
Objective: The goal is to solve the same problem faster by adding more processors.
 Example: If you have a task that takes 10 hours to complete on one processor, strong scaling would involve
using more processors to reduce this time. If you use 10 processors and the task now takes 1 hour, you have
achieved a strong scaling speedup of 10x.
 Weak Scaling:
Definition: Weak scaling measures how the solution time varies with the number of processors for a fixed
problem size per processor.
Objective: The goal is to solve larger problems in the same amount of time by adding more processors.
 Example: If you have a task that takes 1 hour to complete on one processor, weak scaling would involve
increasing the problem size proportionally as you add more processors. If you use 10 processors and the task
still takes 1 hour, but the problem size is 10 times larger, you have achieved perfect weak scaling.
Concept of Scaling
Practical Implications:
Strong Scaling: Useful when you need to reduce the time to solution for a given problem. It is
often limited by the serial portion of the task, as described by Amdahl’s Law.
Weak Scaling: Useful when you need to handle larger problems as you add more resources. It
is often described by Gustafson’s Law, which suggests that the overall speedup can increase
with the problem size.
Visualization:
Strong Scaling: Imagine you have a pie, and you want to eat it faster by inviting more friends
to help. The pie size remains the same, but you finish it quicker.
Weak Scaling: Imagine you have a pie, and as more friends join, you bake a larger pie so that
everyone gets the same amount of pie in the same amount of time.
Amdahl’s Law
 Amdahl’s Law, formulated by computer scientist Gene Amdahl in 1967, is a principle used
to predict the theoretical maximum speedup of a task when only part of the task can be
parallelized. It is particularly relevant in the context of parallel computing.
 Key Points of Amdahl’s Law:
1. Speedup Calculation: Amdahl’s Law provides a formula to calculate the speedup of a task
based on the proportion of the task that can be parallelized. The formula is:
where:
 ( S ) is the overall speedup.
 ( P ) is the proportion of the task that can be parallelized.
 ( N ) is the number of processors.
Amdahl’s Law
2. Limitation by Serial Portion: The law highlights that the speedup is limited by the portion
of the task that cannot be parallelized. Even if you use an infinite number of processors, the
maximum speedup is constrained by the serial part of the task.
3. Example: If 90% of a task can be parallelized (( P = 0.9 )), and you use 10 processors
(( N = 10 )), the speedup ( S ) would be:

This means the task would be approximately 5.26 times faster with 10 processors.
4. Practical Implications: Amdahl’s Law is used to understand the limitations of parallel
processing and to make decisions about optimizing system performance. It shows that
improving the parallelizable portion of a task has diminishing returns as the number of
processors increases.
Amdahl’s Law (fixed size speedup, strong scaling)
 Given a program, let f be the fraction that must be sequential and 1-f be
the fraction that can be parallelized
 = +

 = = =
( )/

 When → ∞, =
 Original paper: Amdahl, Gene M. (1967). "Validity of the Single Processor
Approach to Achieving Large-Scale Computing Capabilities" . AFIPS
Conference Proceedings (30): 483–485.
Amdahl’s law
Amdahl’s law: As P increases, the percentage of work in the parallel region
reduces, performance is more and more dominated by the sequential
region.

time

P=1 P=2 P=4


Implication of Amdahl’s Law
 For strong scaling, the speedup is
bounded by the percentage of
sequential portion of the program,
not by the number of processors!
 Strong scaling will be hard to achieve
for many programs.
Gustafson’s Law (scaled speedup, weak scaling)
 Large scale parallel/distributed systems are expected to allow for
solving problem faster or larger problems.
o Amdahl’s Law indicates that there is a limit on how faster it can go.
o How about bigger problems? This is what Gustafson’s Law sheds lights on!
 In Amdahl’s law, as the number of processors increases, the amount
of work in each node decreases (more processors sharing the
parallel part).
 In Gustafson’s law, as the number of processors increases, the
amount of work in each node remains the same (doing more work
collectively).
Gustafson’s law
Gustafson’s law: As P increases, the total work on each process remains the
same. So the total work increases with P.

time

P=1 P=2 P=4


Gustafson’s Law (scaled speedup, weak scaling)
 The work on each processor is 1 (f is the fraction for sequential program,
(1-f) is the fraction for parallel program.
 With P processor (with the same = 1), the total amount of useful
work is + 1 − . Thus, = + 1− .
 Thus, speedup(P) = + 1− .
No of PEs Strong scaling speedup Weak scaling speedup
(Amdalh’s law, f = 10%) (Gustafson’s law, f = 10%)
2 1.82 1.9
4 3.07 3.7
8 4.71 7.3
16 6.40 14.5
100 9.90 90.1
Implication of Gustafson’s law
 For weak scaling, speedup(P) = + 1−
o Speedup is now proportional to P.

 Scalability is much better when the problem size can increase.


o Many application can use more computing power to solve larger problems
 Weather prediction, large deep learning models.

 Gustafson, John L. (May 1988). "Reevaluating Amdahl's Law".


Communications of the ACM. 31 (5): 532–3.
Single Thread Performance and Loop optimizations
 Architecture features of a modern CPU core
 Locality and array reference pattern
 Dependence
 Loop optimizations
Single Thread Performance and Loop optimizations
Improving single-thread performance and optimizing loops are crucial for enhancing the
efficiency of programs, especially in scenarios where parallelism isn’t feasible. Here are some
key strategies:
Single-Thread Performance

i. Efficient Algorithms: Choose the most efficient algorithms for the task. Sometimes, a more
complex algorithm can significantly reduce execution time compared to a simpler one.
ii. Data Structures: Use appropriate data structures that offer the best performance for the
operations you need. For example, using a hash table for quick lookups instead of a list.
iii. Memory Access Patterns: Optimize memory access patterns to take advantage of CPU
cache. Accessing memory sequentially is generally faster than random access due to cache
line utilization.
iv. Compiler Optimizations: Enable compiler optimizations (e.g., -O2 or -O3 flags in GCC)
to let the compiler automatically optimize your code.
Hardware features in modern CPU core

 Acknowledgement: Some information is from a presentation in


Intel’s architecture day 2021.
(https://download.intel.com/newsroom/2021/client-
computing/intel-architecture-day-2021-presentation.pdf) The
features relate to how to write efficient programs for today’s CPU
cores.
 More details about CPU core design can be found in a computer
architecture book.
Hardware features in modern CPU core
Modern CPU cores come packed with a variety of advanced hardware features designed to
enhance performance, efficiency, and versatility. Here are some key features:

1. Multiple Cores
Modern CPUs often have multiple cores, allowing them to handle multiple tasks simultaneously. This is crucial
for multitasking and running complex applications efficiently.

2. Hyper-Threading / Simultaneous Multithreading (SMT)


This technology allows each physical core to handle multiple threads, effectively doubling the number of tasks
the CPU can manage at once. Intel calls this Hyper-Threading, while AMD refers to it as SMT.

3. Integrated Graphics
Many modern CPUs come with integrated graphics processing units (GPUs), which can handle basic graphics
tasks without the need for a separate graphics card. This is particularly useful for laptops and budget
desktops.
4. Dynamic Voltage and Frequency Scaling (DVFS)
This feature allows the CPU to adjust its power consumption and performance dynamically based on the
current workload. It helps in balancing performance with energy efficiency.
Hardware features in modern CPU core
5. Advanced Power Management
Modern CPUs incorporate various power-saving states (C-states) and techniques to reduce power consumption
when the CPU is idle or under light load.
6. Cache Memory
CPUs have multiple levels of cache (L1, L2, L3) to store frequently accessed data close to the cores, reducing
the time it takes to fetch data from the main memory.
7. Instruction Set Extensions
Modern CPUs support various instruction set extensions like SSE, AVX, and AVX-512, which provide
specialized instructions for tasks such as multimedia processing, scientific calculations, and cryptography.
8. Security Features
CPUs now include hardware-based security features like Intel’s SGX (Software Guard Extensions) and AMD’s
SEV (Secure Encrypted Virtualization) to protect against various types of cyber threats.
9. Thermal Management
Advanced thermal management features help prevent overheating by throttling the CPU speed or shutting
down cores when temperatures exceed safe limits.
10. High-Speed Interconnects
Modern CPUs use high-speed interconnects like Intel’s QuickPath Interconnect (QPI) or AMD’s Infinity Fabric to
facilitate fast communication between the CPU cores, memory, and other components.
Hardware features in modern CPU core
Practical Example:
Consider a high-end CPU like the Intel Core i9-14900K, which features 24 cores and 48
threads, integrated graphics, and support for advanced instruction sets like AVX-512. It also
includes sophisticated power management and thermal control mechanisms to ensure optimal
performance under various workloads.

These features collectively enable modern CPUs to deliver high performance, energy efficiency,
and robust security, making them suitable for a wide range of applications from gaming to
scientific computing.
Hardware features in modern CPU core
Superscalar architecture is a method used in CPU design to improve performance by allowing
multiple instructions to be executed simultaneously during a single clock cycle. Here are some
key points about it:

Parallel Execution: Unlike traditional scalar processors that execute one instruction per cycle,
superscalar processors can handle multiple instructions in parallel.
Multiple Execution Units: These processors have multiple execution units, such as arithmetic logic units
(ALUs) and floating-point units (FPUs), which allow them to process several instructions at once.
Instruction-Level Parallelism: Superscalar processors exploit instruction-level parallelism by dynamically
checking for data dependencies between instructions at runtime.
Increased Throughput: This architecture increases the throughput, meaning the number of instructions
that can be executed in a unit of time is higher compared to scalar processors.
Compiler Optimization: Compilers play a crucial role in optimizing the instruction sequence to maximize
the use of available execution units.
Hardware features in modern CPU core
Superscalar architecture has several advantages and disadvantages. Here’s a breakdown:

Advantages
1. Increased Performance: By executing multiple instructions per clock cycle, superscalar
processors significantly boost performance and throughput.
2. Better Hardware Utilization: Multiple execution units are used more efficiently, reducing
idle times and improving overall hardware utilization.
3. Instruction-Level Parallelism: Superscalar processors can exploit instruction-level
parallelism, allowing for more complex and faster computations.
Hardware features in modern CPU core
Disadvantages
1. Complexity and Cost: The design and manufacturing of superscalar processors are more
complex and expensive due to the need for multiple execution units and sophisticated
instruction scheduling.
2. Power Consumption: These processors tend to consume more power, which can be a
significant drawback in power-sensitive applications.
3. Scheduling Issues: Managing the parallel execution of instructions can lead to scheduling
problems and potential performance bottlenecks.
4. Security Vulnerabilities: Techniques like speculative execution, used to enhance
performance, can introduce security risks.
Hardware features in modern CPU core

 Superscalar architecture
o Instruction pipelining
 Multiple instructions are in the pipeline at
different stages
 Pipeline hazards: an operand of an
instruction is not available when needed.
 Many causes: the operand has not
been calculated by another
instruction. Load from memory not
complete, etc
 Solution: stall the pipeline (delay the
stage for the instruction).
 The impact of branch instruction.
Hardware features in modern CPU core

 Superscalar architecture
o Multiple issues: allowing more than
one instruction to be issued at the
same time.
 More instruction level parallelism.
o The execution stage may use many
execution units (ALU, Load, Store,
etc), sometimes called ports.
 different operations can be executed
simultaneously.
Hardware features in modern CPU core
 Branch Prediction Based on History
Branch prediction is a technique used in CPUs to guess the direction of a branch (e.g., an if-
then-else structure) before it is known definitively. This helps maintain the flow in the instruction
pipeline and improves performance. Here’s how it works:

1. Historical Data: The branch predictor uses historical data to make educated guesses about
whether a branch will be taken or not. For example, if a branch was taken the last few times,
the predictor might guess it will be taken again.
2. Two-Level Adaptive Prediction: This method uses two levels of history to make predictions.
The first level records the outcomes of recent branches, and the second level uses this history to
predict future branches.
3. Tournament Predictors: These use multiple prediction strategies and select the best one
based on past performance.
Hardware features in modern CPU core
 Out-of-Order Execution
Out-of-order execution is a technique used to improve CPU performance by allowing instructions
to be executed as soon as their operands are available, rather than strictly following the
program order. Here’s how it works:

1. Dynamic Scheduling: Instructions are scheduled dynamically based on the availability of


input data and execution units. This helps avoid idle CPU cycles.
2. Reservation Stations: Instructions are placed in reservation stations until their operands are
ready. Once ready, they are dispatched to the appropriate execution units.
3. Register Renaming: This technique helps eliminate false dependencies by allowing multiple
instructions to use the same registers without conflict.
Both branch prediction and out-of-order execution are crucial for enhancing the performance of
modern CPUs by maximizing the utilization of available resources and minimizing idle times.
Hardware features in modern CPU core
 Branch prediction based on history.
 Out of order execution
o Dual three wide out of order decoders in the Intel presentation: allowing 6
instructions per cycle
o Instructions in the out of order window (256 entries in the Intel talk) can be
executed out of order to exploit more parallelism.
 Many execution ports (17 in the Intel talk)
o 4 integer ALUs, 2 jump ports, 2 Load ports, 2 Store ports, 2 FP/vec store ports, 2
FP/vec stacks, etc.
Memory hierarchy

Reduce the average memory


access cycle:
• Let register access take 1
cycle, L1 cache - 4 cycles, L2
cache – 10 cycles, L3 cache –
40 cycles, Memory – 200
cycles.

• 40% data accesses in


registers, 20% from L1, 20%
from L2, 15% L3, 5% from
memory. What is the average
data access latency?
Implication on the software
To exploit the parallelism in a CPU core, one should
 Use a good mix of instructions (Load, store, different integer and floating point ALU
operations).
 Some operations may subject to the operation latency constraint. For example,
floating point divide takes many cycles (40 sometimes).
 Minimize the number of branches and make them easy to predict.
 Branches create control dependence. The whole pipeline needs to be drained
before the next instruction can be executed (if not predicted correctly).
 CPU is faster than memory: exploit data locality and manage the ratio of memory
operation to ALU operation.
 Minimize the data dependence in the code
The hardware features help to a degree, but the programmer should still be mindful.
Exploit data locality
In parallel computing, data locality is crucial for optimizing performance. There are two
primary types of data locality:
1. Temporal Locality
Temporal locality refers to the reuse of specific data within a relatively short time period. If a
particular piece of data is accessed, it is likely to be accessed again soon. This principle is often
leveraged by caching mechanisms to keep frequently accessed data close to the processor.
2. Spatial Locality
Spatial locality refers to the use of data elements within relatively close storage locations. When a
data item is accessed, it is likely that nearby data items will be accessed soon. This is why data is
often stored in contiguous memory locations, allowing for efficient prefetching and caching.
Exploit data locality
Examples in Practice

1. Matrix Multiplication: In matrix operations, accessing elements in a row-major or column-


major order can significantly impact performance due to spatial locality. Loop blocking
techniques can be used to enhance both temporal and spatial locality.

2. Cache Optimization: Modern CPUs use multi-level caches to exploit both temporal and
spatial locality. Frequently accessed data is kept in the fastest, smallest cache levels, while
less frequently accessed data is stored in larger, slower caches.
Data Locality and Performance

1. Memory Access Patterns: Even if two programs perform the same operations, the order
and pattern in which they access memory can differ. Programs with better data locality will
have fewer cache misses, leading to faster execution.
2. Cache Utilization: Programs that access data in a sequential manner (good spatial locality)
or reuse data frequently (good temporal locality) make better use of the CPU cache. This
reduces the need to fetch data from slower main memory.
3. Loop Optimizations: Techniques like loop unrolling and blocking can transform a program
to improve its data locality. For example, accessing array elements in a cache-friendly order
can significantly boost performance.
Example
Consider two semantically equivalent loops:

// Example 1: Poor locality // Example 2: Better locality


for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) {
for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) {
A[i][j] = B[i][j] + C[i][j]; A[i][j] = B[i][j] + C[i][j];
} }
} }

In the first example, the inner loop accesses elements row-wise, which may lead to poor cache
performance if the arrays are stored in row-major order. The second example accesses
elements column-wise, which can be more cache-friendly if the arrays are stored in column-
major order.
Semantically equivalent programs can have very
different locality, and thus performance
for (i =0; i<n; i++) for (j =0; ij<n; j++)
for (j=0; j<n; j++) for (i=0; i<n; i++)
m[i][j] = 0; m[i][j] = 0;

(a) spatial locality (b) no spatial locality

(a)
m[0][0] m[0][1] m[0][n-1] m[1][0] m[1][1] m[1][n-1] m[2][0] m[2][1]
(b)

Cache operation: with a miss, the whole cache line is brought in the cache. (a) will have much less
cache miss than (b). Run lect5/2d.cpp to see the performance difference.
Summary

 Speedup, Scalability, strong scaling, weak scaling

 Amdahl’s law

 Gustafson’s law

 Modern CPU Core, it’s Hardware Features and Key Issues


Practical Scenario
 Imagine a software development team is working on a large application. The project consists of various tasks,
some of which can be parallelized (e.g., coding, testing), while others must be done sequentially (e.g., project
planning, integration).
 Breakdown:
 Total Project Time: 1000 hours
 Parallelizable Tasks: 800 hours (80% of the total time)
 Sequential Tasks: 200 hours (20% of the total time)
 Applying Amdahl’s Law:
 The team decides to use 4 developers to work on the parallelizable tasks. According to Amdahl’s Law, the
speedup ( S ) can be calculated as follows:
where: ( P = 0.8 ) (80% of the tasks can be parallelized)
( N = 4 ) (number of developers)
Plugging in the values:
Practical Scenario
Interpretation:
 Original Time: 1000 hours

 New Time with 4 Developers: ( \frac{1000}{2.5} = 400 ) hours

 By using 4 developers, the team can reduce the project time from 1000 hours to 400 hours.
However, the speedup is limited by the 200 hours of sequential tasks that cannot be parallelized.

 Practical Insight:

 Even if the team adds more developers, the maximum speedup is constrained by the 200 hours of
sequential work. This demonstrates the diminishing returns of adding more resources to
parallelizable tasks.

You might also like