PDC Lecture 03
PDC Lecture 03
Fall 2024
Lecture No. 03
Quick Review
Describe computation graph
For a computation graph, how do we define Work(CG), span(CG),
and parallelism?
Scheduling computation graph
o What is a greedy schedule?
o How good is greedy schedule?
Concept of Scaling
Strong Scaling:
Definition: Strong scaling measures how the solution time varies with the number of processors for a fixed total
problem size.
Objective: The goal is to solve the same problem faster by adding more processors.
Example: If you have a task that takes 10 hours to complete on one processor, strong scaling would involve
using more processors to reduce this time. If you use 10 processors and the task now takes 1 hour, you have
achieved a strong scaling speedup of 10x.
Weak Scaling:
Definition: Weak scaling measures how the solution time varies with the number of processors for a fixed
problem size per processor.
Objective: The goal is to solve larger problems in the same amount of time by adding more processors.
Example: If you have a task that takes 1 hour to complete on one processor, weak scaling would involve
increasing the problem size proportionally as you add more processors. If you use 10 processors and the task
still takes 1 hour, but the problem size is 10 times larger, you have achieved perfect weak scaling.
Concept of Scaling
Practical Implications:
Strong Scaling: Useful when you need to reduce the time to solution for a given problem. It is
often limited by the serial portion of the task, as described by Amdahl’s Law.
Weak Scaling: Useful when you need to handle larger problems as you add more resources. It
is often described by Gustafson’s Law, which suggests that the overall speedup can increase
with the problem size.
Visualization:
Strong Scaling: Imagine you have a pie, and you want to eat it faster by inviting more friends
to help. The pie size remains the same, but you finish it quicker.
Weak Scaling: Imagine you have a pie, and as more friends join, you bake a larger pie so that
everyone gets the same amount of pie in the same amount of time.
Amdahl’s Law
Amdahl’s Law, formulated by computer scientist Gene Amdahl in 1967, is a principle used
to predict the theoretical maximum speedup of a task when only part of the task can be
parallelized. It is particularly relevant in the context of parallel computing.
Key Points of Amdahl’s Law:
1. Speedup Calculation: Amdahl’s Law provides a formula to calculate the speedup of a task
based on the proportion of the task that can be parallelized. The formula is:
where:
( S ) is the overall speedup.
( P ) is the proportion of the task that can be parallelized.
( N ) is the number of processors.
Amdahl’s Law
2. Limitation by Serial Portion: The law highlights that the speedup is limited by the portion
of the task that cannot be parallelized. Even if you use an infinite number of processors, the
maximum speedup is constrained by the serial part of the task.
3. Example: If 90% of a task can be parallelized (( P = 0.9 )), and you use 10 processors
(( N = 10 )), the speedup ( S ) would be:
This means the task would be approximately 5.26 times faster with 10 processors.
4. Practical Implications: Amdahl’s Law is used to understand the limitations of parallel
processing and to make decisions about optimizing system performance. It shows that
improving the parallelizable portion of a task has diminishing returns as the number of
processors increases.
Amdahl’s Law (fixed size speedup, strong scaling)
Given a program, let f be the fraction that must be sequential and 1-f be
the fraction that can be parallelized
= +
= = =
( )/
When → ∞, =
Original paper: Amdahl, Gene M. (1967). "Validity of the Single Processor
Approach to Achieving Large-Scale Computing Capabilities" . AFIPS
Conference Proceedings (30): 483–485.
Amdahl’s law
Amdahl’s law: As P increases, the percentage of work in the parallel region
reduces, performance is more and more dominated by the sequential
region.
time
time
i. Efficient Algorithms: Choose the most efficient algorithms for the task. Sometimes, a more
complex algorithm can significantly reduce execution time compared to a simpler one.
ii. Data Structures: Use appropriate data structures that offer the best performance for the
operations you need. For example, using a hash table for quick lookups instead of a list.
iii. Memory Access Patterns: Optimize memory access patterns to take advantage of CPU
cache. Accessing memory sequentially is generally faster than random access due to cache
line utilization.
iv. Compiler Optimizations: Enable compiler optimizations (e.g., -O2 or -O3 flags in GCC)
to let the compiler automatically optimize your code.
Hardware features in modern CPU core
1. Multiple Cores
Modern CPUs often have multiple cores, allowing them to handle multiple tasks simultaneously. This is crucial
for multitasking and running complex applications efficiently.
3. Integrated Graphics
Many modern CPUs come with integrated graphics processing units (GPUs), which can handle basic graphics
tasks without the need for a separate graphics card. This is particularly useful for laptops and budget
desktops.
4. Dynamic Voltage and Frequency Scaling (DVFS)
This feature allows the CPU to adjust its power consumption and performance dynamically based on the
current workload. It helps in balancing performance with energy efficiency.
Hardware features in modern CPU core
5. Advanced Power Management
Modern CPUs incorporate various power-saving states (C-states) and techniques to reduce power consumption
when the CPU is idle or under light load.
6. Cache Memory
CPUs have multiple levels of cache (L1, L2, L3) to store frequently accessed data close to the cores, reducing
the time it takes to fetch data from the main memory.
7. Instruction Set Extensions
Modern CPUs support various instruction set extensions like SSE, AVX, and AVX-512, which provide
specialized instructions for tasks such as multimedia processing, scientific calculations, and cryptography.
8. Security Features
CPUs now include hardware-based security features like Intel’s SGX (Software Guard Extensions) and AMD’s
SEV (Secure Encrypted Virtualization) to protect against various types of cyber threats.
9. Thermal Management
Advanced thermal management features help prevent overheating by throttling the CPU speed or shutting
down cores when temperatures exceed safe limits.
10. High-Speed Interconnects
Modern CPUs use high-speed interconnects like Intel’s QuickPath Interconnect (QPI) or AMD’s Infinity Fabric to
facilitate fast communication between the CPU cores, memory, and other components.
Hardware features in modern CPU core
Practical Example:
Consider a high-end CPU like the Intel Core i9-14900K, which features 24 cores and 48
threads, integrated graphics, and support for advanced instruction sets like AVX-512. It also
includes sophisticated power management and thermal control mechanisms to ensure optimal
performance under various workloads.
These features collectively enable modern CPUs to deliver high performance, energy efficiency,
and robust security, making them suitable for a wide range of applications from gaming to
scientific computing.
Hardware features in modern CPU core
Superscalar architecture is a method used in CPU design to improve performance by allowing
multiple instructions to be executed simultaneously during a single clock cycle. Here are some
key points about it:
Parallel Execution: Unlike traditional scalar processors that execute one instruction per cycle,
superscalar processors can handle multiple instructions in parallel.
Multiple Execution Units: These processors have multiple execution units, such as arithmetic logic units
(ALUs) and floating-point units (FPUs), which allow them to process several instructions at once.
Instruction-Level Parallelism: Superscalar processors exploit instruction-level parallelism by dynamically
checking for data dependencies between instructions at runtime.
Increased Throughput: This architecture increases the throughput, meaning the number of instructions
that can be executed in a unit of time is higher compared to scalar processors.
Compiler Optimization: Compilers play a crucial role in optimizing the instruction sequence to maximize
the use of available execution units.
Hardware features in modern CPU core
Superscalar architecture has several advantages and disadvantages. Here’s a breakdown:
Advantages
1. Increased Performance: By executing multiple instructions per clock cycle, superscalar
processors significantly boost performance and throughput.
2. Better Hardware Utilization: Multiple execution units are used more efficiently, reducing
idle times and improving overall hardware utilization.
3. Instruction-Level Parallelism: Superscalar processors can exploit instruction-level
parallelism, allowing for more complex and faster computations.
Hardware features in modern CPU core
Disadvantages
1. Complexity and Cost: The design and manufacturing of superscalar processors are more
complex and expensive due to the need for multiple execution units and sophisticated
instruction scheduling.
2. Power Consumption: These processors tend to consume more power, which can be a
significant drawback in power-sensitive applications.
3. Scheduling Issues: Managing the parallel execution of instructions can lead to scheduling
problems and potential performance bottlenecks.
4. Security Vulnerabilities: Techniques like speculative execution, used to enhance
performance, can introduce security risks.
Hardware features in modern CPU core
Superscalar architecture
o Instruction pipelining
Multiple instructions are in the pipeline at
different stages
Pipeline hazards: an operand of an
instruction is not available when needed.
Many causes: the operand has not
been calculated by another
instruction. Load from memory not
complete, etc
Solution: stall the pipeline (delay the
stage for the instruction).
The impact of branch instruction.
Hardware features in modern CPU core
Superscalar architecture
o Multiple issues: allowing more than
one instruction to be issued at the
same time.
More instruction level parallelism.
o The execution stage may use many
execution units (ALU, Load, Store,
etc), sometimes called ports.
different operations can be executed
simultaneously.
Hardware features in modern CPU core
Branch Prediction Based on History
Branch prediction is a technique used in CPUs to guess the direction of a branch (e.g., an if-
then-else structure) before it is known definitively. This helps maintain the flow in the instruction
pipeline and improves performance. Here’s how it works:
1. Historical Data: The branch predictor uses historical data to make educated guesses about
whether a branch will be taken or not. For example, if a branch was taken the last few times,
the predictor might guess it will be taken again.
2. Two-Level Adaptive Prediction: This method uses two levels of history to make predictions.
The first level records the outcomes of recent branches, and the second level uses this history to
predict future branches.
3. Tournament Predictors: These use multiple prediction strategies and select the best one
based on past performance.
Hardware features in modern CPU core
Out-of-Order Execution
Out-of-order execution is a technique used to improve CPU performance by allowing instructions
to be executed as soon as their operands are available, rather than strictly following the
program order. Here’s how it works:
2. Cache Optimization: Modern CPUs use multi-level caches to exploit both temporal and
spatial locality. Frequently accessed data is kept in the fastest, smallest cache levels, while
less frequently accessed data is stored in larger, slower caches.
Data Locality and Performance
1. Memory Access Patterns: Even if two programs perform the same operations, the order
and pattern in which they access memory can differ. Programs with better data locality will
have fewer cache misses, leading to faster execution.
2. Cache Utilization: Programs that access data in a sequential manner (good spatial locality)
or reuse data frequently (good temporal locality) make better use of the CPU cache. This
reduces the need to fetch data from slower main memory.
3. Loop Optimizations: Techniques like loop unrolling and blocking can transform a program
to improve its data locality. For example, accessing array elements in a cache-friendly order
can significantly boost performance.
Example
Consider two semantically equivalent loops:
In the first example, the inner loop accesses elements row-wise, which may lead to poor cache
performance if the arrays are stored in row-major order. The second example accesses
elements column-wise, which can be more cache-friendly if the arrays are stored in column-
major order.
Semantically equivalent programs can have very
different locality, and thus performance
for (i =0; i<n; i++) for (j =0; ij<n; j++)
for (j=0; j<n; j++) for (i=0; i<n; i++)
m[i][j] = 0; m[i][j] = 0;
(a)
m[0][0] m[0][1] m[0][n-1] m[1][0] m[1][1] m[1][n-1] m[2][0] m[2][1]
(b)
Cache operation: with a miss, the whole cache line is brought in the cache. (a) will have much less
cache miss than (b). Run lect5/2d.cpp to see the performance difference.
Summary
Amdahl’s law
Gustafson’s law
By using 4 developers, the team can reduce the project time from 1000 hours to 400 hours.
However, the speedup is limited by the 200 hours of sequential tasks that cannot be parallelized.
Practical Insight:
Even if the team adds more developers, the maximum speedup is constrained by the 200 hours of
sequential work. This demonstrates the diminishing returns of adding more resources to
parallelizable tasks.