0% found this document useful (0 votes)

2 views36 pages

PDC Lecture 03

The document discusses key concepts in parallel and distributed systems, focusing on strong and weak scaling, Amdahl's Law, and Gustafson's Law, which describe how performance and scalability are affected by the number of processors. It also highlights the importance of optimizing single-thread performance and loop optimizations, as well as the advanced features of modern CPU architectures that enhance performance. Finally, it covers techniques like branch prediction and out-of-order execution that improve CPU efficiency and reduce idle times.

Uploaded by

arhamkhan4241

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views36 pages

PDC Lecture 03

Uploaded by

arhamkhan4241

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

CS-402 Parallel and Distributed Systems

Fall 2024
Lecture No. 03
Quick Review
 Describe computation graph
 For a computation graph, how do we define Work(CG), span(CG),
and parallelism?
 Scheduling computation graph
o What is a greedy schedule?
o How good is greedy schedule?
Concept of Scaling
 Strong Scaling:
Definition: Strong scaling measures how the solution time varies with the number of processors for a fixed total
problem size.
Objective: The goal is to solve the same problem faster by adding more processors.
 Example: If you have a task that takes 10 hours to complete on one processor, strong scaling would involve
using more processors to reduce this time. If you use 10 processors and the task now takes 1 hour, you have
achieved a strong scaling speedup of 10x.
 Weak Scaling:
Definition: Weak scaling measures how the solution time varies with the number of processors for a fixed
problem size per processor.
Objective: The goal is to solve larger problems in the same amount of time by adding more processors.
 Example: If you have a task that takes 1 hour to complete on one processor, weak scaling would involve
increasing the problem size proportionally as you add more processors. If you use 10 processors and the task
still takes 1 hour, but the problem size is 10 times larger, you have achieved perfect weak scaling.
Concept of Scaling
Practical Implications:
Strong Scaling: Useful when you need to reduce the time to solution for a given problem. It is
often limited by the serial portion of the task, as described by Amdahl’s Law.
Weak Scaling: Useful when you need to handle larger problems as you add more resources. It
is often described by Gustafson’s Law, which suggests that the overall speedup can increase
with the problem size.
Visualization:
Strong Scaling: Imagine you have a pie, and you want to eat it faster by inviting more friends
to help. The pie size remains the same, but you finish it quicker.
Weak Scaling: Imagine you have a pie, and as more friends join, you bake a larger pie so that
everyone gets the same amount of pie in the same amount of time.
Amdahl’s Law
 Amdahl’s Law, formulated by computer scientist Gene Amdahl in 1967, is a principle used
to predict the theoretical maximum speedup of a task when only part of the task can be
parallelized. It is particularly relevant in the context of parallel computing.
 Key Points of Amdahl’s Law:
1. Speedup Calculation: Amdahl’s Law provides a formula to calculate the speedup of a task
based on the proportion of the task that can be parallelized. The formula is:
where:
 ( S ) is the overall speedup.
 ( P ) is the proportion of the task that can be parallelized.
 ( N ) is the number of processors.
Amdahl’s Law
2. Limitation by Serial Portion: The law highlights that the speedup is limited by the portion
of the task that cannot be parallelized. Even if you use an infinite number of processors, the
maximum speedup is constrained by the serial part of the task.
3. Example: If 90% of a task can be parallelized (( P = 0.9 )), and you use 10 processors
(( N = 10 )), the speedup ( S ) would be:

This means the task would be approximately 5.26 times faster with 10 processors.
4. Practical Implications: Amdahl’s Law is used to understand the limitations of parallel
processing and to make decisions about optimizing system performance. It shows that
improving the parallelizable portion of a task has diminishing returns as the number of
processors increases.
Amdahl’s Law (fixed size speedup, strong scaling)
 Given a program, let f be the fraction that must be sequential and 1-f be
the fraction that can be parallelized
 = +

 = = =
( )/

 When → ∞, =
 Original paper: Amdahl, Gene M. (1967). "Validity of the Single Processor
Approach to Achieving Large-Scale Computing Capabilities" . AFIPS
Conference Proceedings (30): 483–485.
Amdahl’s law
Amdahl’s law: As P increases, the percentage of work in the parallel region
reduces, performance is more and more dominated by the sequential
region.

time

P=1 P=2 P=4

Implication of Amdahl’s Law
 For strong scaling, the speedup is
bounded by the percentage of
sequential portion of the program,
not by the number of processors!
 Strong scaling will be hard to achieve
for many programs.
Gustafson’s Law (scaled speedup, weak scaling)
 Large scale parallel/distributed systems are expected to allow for
solving problem faster or larger problems.
o Amdahl’s Law indicates that there is a limit on how faster it can go.
o How about bigger problems? This is what Gustafson’s Law sheds lights on!
 In Amdahl’s law, as the number of processors increases, the amount
of work in each node decreases (more processors sharing the
parallel part).
 In Gustafson’s law, as the number of processors increases, the
amount of work in each node remains the same (doing more work
collectively).
Gustafson’s law
Gustafson’s law: As P increases, the total work on each process remains the
same. So the total work increases with P.

time

P=1 P=2 P=4

Gustafson’s Law (scaled speedup, weak scaling)
 The work on each processor is 1 (f is the fraction for sequential program,
(1-f) is the fraction for parallel program.
 With P processor (with the same = 1), the total amount of useful
work is + 1 − . Thus, = + 1− .
 Thus, speedup(P) = + 1− .
No of PEs Strong scaling speedup Weak scaling speedup
(Amdalh’s law, f = 10%) (Gustafson’s law, f = 10%)
2 1.82 1.9
4 3.07 3.7
8 4.71 7.3
16 6.40 14.5
100 9.90 90.1
Implication of Gustafson’s law
 For weak scaling, speedup(P) = + 1−
o Speedup is now proportional to P.

 Scalability is much better when the problem size can increase.

o Many application can use more computing power to solve larger problems
 Weather prediction, large deep learning models.

 Gustafson, John L. (May 1988). "Reevaluating Amdahl's Law".

Communications of the ACM. 31 (5): 532–3.
Single Thread Performance and Loop optimizations
 Architecture features of a modern CPU core
 Locality and array reference pattern
 Dependence
 Loop optimizations
Single Thread Performance and Loop optimizations
Improving single-thread performance and optimizing loops are crucial for enhancing the
efficiency of programs, especially in scenarios where parallelism isn’t feasible. Here are some
key strategies:
Single-Thread Performance

i. Efficient Algorithms: Choose the most efficient algorithms for the task. Sometimes, a more
complex algorithm can significantly reduce execution time compared to a simpler one.
ii. Data Structures: Use appropriate data structures that offer the best performance for the
operations you need. For example, using a hash table for quick lookups instead of a list.
iii. Memory Access Patterns: Optimize memory access patterns to take advantage of CPU
cache. Accessing memory sequentially is generally faster than random access due to cache
line utilization.
iv. Compiler Optimizations: Enable compiler optimizations (e.g., -O2 or -O3 flags in GCC)
to let the compiler automatically optimize your code.
Hardware features in modern CPU core

 Acknowledgement: Some information is from a presentation in

Intel’s architecture day 2021.
(https://download.intel.com/newsroom/2021/client-
computing/intel-architecture-day-2021-presentation.pdf) The
features relate to how to write efficient programs for today’s CPU
cores.
 More details about CPU core design can be found in a computer
architecture book.
Hardware features in modern CPU core
Modern CPU cores come packed with a variety of advanced hardware features designed to
enhance performance, efficiency, and versatility. Here are some key features:

1. Multiple Cores
Modern CPUs often have multiple cores, allowing them to handle multiple tasks simultaneously. This is crucial
for multitasking and running complex applications efficiently.

2. Hyper-Threading / Simultaneous Multithreading (SMT)

This technology allows each physical core to handle multiple threads, effectively doubling the number of tasks
the CPU can manage at once. Intel calls this Hyper-Threading, while AMD refers to it as SMT.

3. Integrated Graphics
Many modern CPUs come with integrated graphics processing units (GPUs), which can handle basic graphics
tasks without the need for a separate graphics card. This is particularly useful for laptops and budget
desktops.
4. Dynamic Voltage and Frequency Scaling (DVFS)
This feature allows the CPU to adjust its power consumption and performance dynamically based on the
current workload. It helps in balancing performance with energy efficiency.
Hardware features in modern CPU core
5. Advanced Power Management
Modern CPUs incorporate various power-saving states (C-states) and techniques to reduce power consumption
when the CPU is idle or under light load.
6. Cache Memory
CPUs have multiple levels of cache (L1, L2, L3) to store frequently accessed data close to the cores, reducing
the time it takes to fetch data from the main memory.
7. Instruction Set Extensions
Modern CPUs support various instruction set extensions like SSE, AVX, and AVX-512, which provide
specialized instructions for tasks such as multimedia processing, scientific calculations, and cryptography.
8. Security Features
CPUs now include hardware-based security features like Intel’s SGX (Software Guard Extensions) and AMD’s
SEV (Secure Encrypted Virtualization) to protect against various types of cyber threats.
9. Thermal Management
Advanced thermal management features help prevent overheating by throttling the CPU speed or shutting
down cores when temperatures exceed safe limits.
10. High-Speed Interconnects
Modern CPUs use high-speed interconnects like Intel’s QuickPath Interconnect (QPI) or AMD’s Infinity Fabric to
facilitate fast communication between the CPU cores, memory, and other components.
Hardware features in modern CPU core
Practical Example:
Consider a high-end CPU like the Intel Core i9-14900K, which features 24 cores and 48
threads, integrated graphics, and support for advanced instruction sets like AVX-512. It also
includes sophisticated power management and thermal control mechanisms to ensure optimal
performance under various workloads.

These features collectively enable modern CPUs to deliver high performance, energy efficiency,
and robust security, making them suitable for a wide range of applications from gaming to
scientific computing.
Hardware features in modern CPU core
Superscalar architecture is a method used in CPU design to improve performance by allowing
multiple instructions to be executed simultaneously during a single clock cycle. Here are some
key points about it:

Parallel Execution: Unlike traditional scalar processors that execute one instruction per cycle,
superscalar processors can handle multiple instructions in parallel.
Multiple Execution Units: These processors have multiple execution units, such as arithmetic logic units
(ALUs) and floating-point units (FPUs), which allow them to process several instructions at once.
Instruction-Level Parallelism: Superscalar processors exploit instruction-level parallelism by dynamically
checking for data dependencies between instructions at runtime.
Increased Throughput: This architecture increases the throughput, meaning the number of instructions
that can be executed in a unit of time is higher compared to scalar processors.
Compiler Optimization: Compilers play a crucial role in optimizing the instruction sequence to maximize
the use of available execution units.
Hardware features in modern CPU core
Superscalar architecture has several advantages and disadvantages. Here’s a breakdown:

Advantages
1. Increased Performance: By executing multiple instructions per clock cycle, superscalar
processors significantly boost performance and throughput.
2. Better Hardware Utilization: Multiple execution units are used more efficiently, reducing
idle times and improving overall hardware utilization.
3. Instruction-Level Parallelism: Superscalar processors can exploit instruction-level
parallelism, allowing for more complex and faster computations.
Hardware features in modern CPU core
Disadvantages
1. Complexity and Cost: The design and manufacturing of superscalar processors are more
complex and expensive due to the need for multiple execution units and sophisticated
instruction scheduling.
2. Power Consumption: These processors tend to consume more power, which can be a
significant drawback in power-sensitive applications.
3. Scheduling Issues: Managing the parallel execution of instructions can lead to scheduling
problems and potential performance bottlenecks.
4. Security Vulnerabilities: Techniques like speculative execution, used to enhance
performance, can introduce security risks.
Hardware features in modern CPU core

 Superscalar architecture
o Instruction pipelining
 Multiple instructions are in the pipeline at
different stages
 Pipeline hazards: an operand of an
instruction is not available when needed.
 Many causes: the operand has not
been calculated by another
instruction. Load from memory not
complete, etc
 Solution: stall the pipeline (delay the
stage for the instruction).
 The impact of branch instruction.
Hardware features in modern CPU core

 Superscalar architecture
o Multiple issues: allowing more than
one instruction to be issued at the
same time.
 More instruction level parallelism.
o The execution stage may use many
execution units (ALU, Load, Store,
etc), sometimes called ports.
 different operations can be executed
simultaneously.
Hardware features in modern CPU core
 Branch Prediction Based on History
Branch prediction is a technique used in CPUs to guess the direction of a branch (e.g., an if-
then-else structure) before it is known definitively. This helps maintain the flow in the instruction
pipeline and improves performance. Here’s how it works:

1. Historical Data: The branch predictor uses historical data to make educated guesses about
whether a branch will be taken or not. For example, if a branch was taken the last few times,
the predictor might guess it will be taken again.
2. Two-Level Adaptive Prediction: This method uses two levels of history to make predictions.
The first level records the outcomes of recent branches, and the second level uses this history to
predict future branches.
3. Tournament Predictors: These use multiple prediction strategies and select the best one
based on past performance.
Hardware features in modern CPU core
 Out-of-Order Execution
Out-of-order execution is a technique used to improve CPU performance by allowing instructions
to be executed as soon as their operands are available, rather than strictly following the
program order. Here’s how it works:

1. Dynamic Scheduling: Instructions are scheduled dynamically based on the availability of

input data and execution units. This helps avoid idle CPU cycles.
2. Reservation Stations: Instructions are placed in reservation stations until their operands are
ready. Once ready, they are dispatched to the appropriate execution units.
3. Register Renaming: This technique helps eliminate false dependencies by allowing multiple
instructions to use the same registers without conflict.
Both branch prediction and out-of-order execution are crucial for enhancing the performance of
modern CPUs by maximizing the utilization of available resources and minimizing idle times.
Hardware features in modern CPU core
 Branch prediction based on history.
 Out of order execution
o Dual three wide out of order decoders in the Intel presentation: allowing 6
instructions per cycle
o Instructions in the out of order window (256 entries in the Intel talk) can be
executed out of order to exploit more parallelism.
 Many execution ports (17 in the Intel talk)
o 4 integer ALUs, 2 jump ports, 2 Load ports, 2 Store ports, 2 FP/vec store ports, 2
FP/vec stacks, etc.
Memory hierarchy

Reduce the average memory

access cycle:
• Let register access take 1
cycle, L1 cache - 4 cycles, L2
cache – 10 cycles, L3 cache –
40 cycles, Memory – 200
cycles.

• 40% data accesses in

registers, 20% from L1, 20%
from L2, 15% L3, 5% from
memory. What is the average
data access latency?
Implication on the software
To exploit the parallelism in a CPU core, one should
 Use a good mix of instructions (Load, store, different integer and floating point ALU
operations).
 Some operations may subject to the operation latency constraint. For example,
floating point divide takes many cycles (40 sometimes).
 Minimize the number of branches and make them easy to predict.
 Branches create control dependence. The whole pipeline needs to be drained
before the next instruction can be executed (if not predicted correctly).
 CPU is faster than memory: exploit data locality and manage the ratio of memory
operation to ALU operation.
 Minimize the data dependence in the code
The hardware features help to a degree, but the programmer should still be mindful.
Exploit data locality
In parallel computing, data locality is crucial for optimizing performance. There are two
primary types of data locality:
1. Temporal Locality
Temporal locality refers to the reuse of specific data within a relatively short time period. If a
particular piece of data is accessed, it is likely to be accessed again soon. This principle is often
leveraged by caching mechanisms to keep frequently accessed data close to the processor.
2. Spatial Locality
Spatial locality refers to the use of data elements within relatively close storage locations. When a
data item is accessed, it is likely that nearby data items will be accessed soon. This is why data is
often stored in contiguous memory locations, allowing for efficient prefetching and caching.
Exploit data locality
Examples in Practice

1. Matrix Multiplication: In matrix operations, accessing elements in a row-major or column-

major order can significantly impact performance due to spatial locality. Loop blocking
techniques can be used to enhance both temporal and spatial locality.

2. Cache Optimization: Modern CPUs use multi-level caches to exploit both temporal and
spatial locality. Frequently accessed data is kept in the fastest, smallest cache levels, while
less frequently accessed data is stored in larger, slower caches.
Data Locality and Performance

1. Memory Access Patterns: Even if two programs perform the same operations, the order
and pattern in which they access memory can differ. Programs with better data locality will
have fewer cache misses, leading to faster execution.
2. Cache Utilization: Programs that access data in a sequential manner (good spatial locality)
or reuse data frequently (good temporal locality) make better use of the CPU cache. This
reduces the need to fetch data from slower main memory.
3. Loop Optimizations: Techniques like loop unrolling and blocking can transform a program
to improve its data locality. For example, accessing array elements in a cache-friendly order
can significantly boost performance.
Example
Consider two semantically equivalent loops:

// Example 1: Poor locality // Example 2: Better locality

for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) {
for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) {
A[i][j] = B[i][j] + C[i][j]; A[i][j] = B[i][j] + C[i][j];
} }
} }

In the first example, the inner loop accesses elements row-wise, which may lead to poor cache
performance if the arrays are stored in row-major order. The second example accesses
elements column-wise, which can be more cache-friendly if the arrays are stored in column-
major order.
Semantically equivalent programs can have very
different locality, and thus performance
for (i =0; i<n; i++) for (j =0; ij<n; j++)
for (j=0; j<n; j++) for (i=0; i<n; i++)
m[i][j] = 0; m[i][j] = 0;

(a) spatial locality (b) no spatial locality

(a)
m[0][0] m[0][1] m[0][n-1] m[1][0] m[1][1] m[1][n-1] m[2][0] m[2][1]
(b)

Cache operation: with a miss, the whole cache line is brought in the cache. (a) will have much less
cache miss than (b). Run lect5/2d.cpp to see the performance difference.
Summary

 Speedup, Scalability, strong scaling, weak scaling

 Amdahl’s law

 Gustafson’s law

 Modern CPU Core, it’s Hardware Features and Key Issues

Practical Scenario
 Imagine a software development team is working on a large application. The project consists of various tasks,
some of which can be parallelized (e.g., coding, testing), while others must be done sequentially (e.g., project
planning, integration).
 Breakdown:
 Total Project Time: 1000 hours
 Parallelizable Tasks: 800 hours (80% of the total time)
 Sequential Tasks: 200 hours (20% of the total time)
 Applying Amdahl’s Law:
 The team decides to use 4 developers to work on the parallelizable tasks. According to Amdahl’s Law, the
speedup ( S ) can be calculated as follows:
where: ( P = 0.8 ) (80% of the tasks can be parallelized)
( N = 4 ) (number of developers)
Plugging in the values:
Practical Scenario
Interpretation:
 Original Time: 1000 hours

 New Time with 4 Developers: ( \frac{1000}{2.5} = 400 ) hours

 By using 4 developers, the team can reduce the project time from 1000 hours to 400 hours.
However, the speedup is limited by the 200 hours of sequential tasks that cannot be parallelized.

 Practical Insight:

 Even if the team adds more developers, the maximum speedup is constrained by the 200 hours of
sequential work. This demonstrates the diminishing returns of adding more resources to
parallelizable tasks.

SACE Stage 2 Chemistry - Cation Exchange Capacity Deconstruct & Design
No ratings yet
SACE Stage 2 Chemistry - Cation Exchange Capacity Deconstruct & Design
7 pages
William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
33 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
CS-3006 10 PerformanceAnalysis
No ratings yet
CS-3006 10 PerformanceAnalysis
52 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
Principles of Scalable Performance
No ratings yet
Principles of Scalable Performance
61 pages
PDC Lecture 3
No ratings yet
PDC Lecture 3
31 pages
Chen Paap08-Multicorescalability
No ratings yet
Chen Paap08-Multicorescalability
12 pages
Lecture 6 (Amdahl's Law)
No ratings yet
Lecture 6 (Amdahl's Law)
13 pages
Laraib Cs - 39 Assig 1
No ratings yet
Laraib Cs - 39 Assig 1
4 pages
PDC ASS1 Reg No 21mdbcs116 Sec A
No ratings yet
PDC ASS1 Reg No 21mdbcs116 Sec A
7 pages
HPC TT1
No ratings yet
HPC TT1
29 pages
B38DF LS2b Performance
No ratings yet
B38DF LS2b Performance
20 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Chapter 1 Solution
No ratings yet
Chapter 1 Solution
35 pages
Parallel2 PDF
No ratings yet
Parallel2 PDF
16 pages
Lecture-11 Amdhals Law Gustafsons Law
No ratings yet
Lecture-11 Amdhals Law Gustafsons Law
16 pages
Document
No ratings yet
Document
10 pages
HPC - 1
No ratings yet
HPC - 1
40 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
Unit-2 Aca
No ratings yet
Unit-2 Aca
24 pages
Unit 1 - Part 3
No ratings yet
Unit 1 - Part 3
17 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
33 pages
Scalability: Speed Up
No ratings yet
Scalability: Speed Up
30 pages
Amdahl's Law: Example 1
No ratings yet
Amdahl's Law: Example 1
12 pages
Speed Up Laws
No ratings yet
Speed Up Laws
21 pages
Lect 02
No ratings yet
Lect 02
51 pages
Ünite
No ratings yet
Ünite
33 pages
Amdahl's Law Example #2: - Protein String Matching Code
No ratings yet
Amdahl's Law Example #2: - Protein String Matching Code
23 pages
SP23 CS 212 Week 2
No ratings yet
SP23 CS 212 Week 2
23 pages
CS-3006 4 PerformanceAnalysis
No ratings yet
CS-3006 4 PerformanceAnalysis
62 pages
CS439 CC 2 Parallel Distributed Systems
No ratings yet
CS439 CC 2 Parallel Distributed Systems
37 pages
CH02 COA10e.performance Issues
No ratings yet
CH02 COA10e.performance Issues
19 pages
Chapter 11
No ratings yet
Chapter 11
33 pages
SPPU High Performance Computing
No ratings yet
SPPU High Performance Computing
12 pages
Coa Presentation
No ratings yet
Coa Presentation
20 pages
Chip Multicore Processors - Tutorial 2: 2.1: Frequency and Voltage Scaling, Amdahl's Law
No ratings yet
Chip Multicore Processors - Tutorial 2: 2.1: Frequency and Voltage Scaling, Amdahl's Law
2 pages
IT3030E CA Chap8 Multiprocessing
No ratings yet
IT3030E CA Chap8 Multiprocessing
26 pages
Zindagi Zama Da
No ratings yet
Zindagi Zama Da
21 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
2 ND
No ratings yet
2 ND
19 pages
2 Week
No ratings yet
2 Week
35 pages
Computer Performance Measurement. Amdahl's Law
No ratings yet
Computer Performance Measurement. Amdahl's Law
24 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
Chapter 2 Notes NBCAS511
No ratings yet
Chapter 2 Notes NBCAS511
10 pages
Aitsam B21F0230CS015 PDC ASS02
No ratings yet
Aitsam B21F0230CS015 PDC ASS02
5 pages
Pc7 Performance
No ratings yet
Pc7 Performance
50 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
Chapter 2
No ratings yet
Chapter 2
34 pages
01 Introduction
No ratings yet
01 Introduction
20 pages
Computer Hardware Engineering: IS1200, Spring 2015
No ratings yet
Computer Hardware Engineering: IS1200, Spring 2015
17 pages
CS-3006 2 PDC Overview Compressed
No ratings yet
CS-3006 2 PDC Overview Compressed
107 pages
Green Computing
No ratings yet
Green Computing
93 pages
L5-L6-Performance Issues
No ratings yet
L5-L6-Performance Issues
47 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Machine Learning: Hands-On for Developers and Technical Professionals
From Everand
Machine Learning: Hands-On for Developers and Technical Professionals
Jason Bell
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
IGNOU Operating System Previous Years Solved Papers
From Everand
IGNOU Operating System Previous Years Solved Papers
Manish Soni
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Class 7 Science CAT 2 Term 2
No ratings yet
Class 7 Science CAT 2 Term 2
3 pages
Worksheet-Simplification of Numerical
No ratings yet
Worksheet-Simplification of Numerical
4 pages
Empowerment Technology EXAM
No ratings yet
Empowerment Technology EXAM
2 pages
Quiz Module 9 Advanced Networking
No ratings yet
Quiz Module 9 Advanced Networking
14 pages
Application of Anti-Corona Products: Slot Portion End Windings
No ratings yet
Application of Anti-Corona Products: Slot Portion End Windings
17 pages
Load Tables
No ratings yet
Load Tables
3 pages
Bdctw401-Update Tile Works
No ratings yet
Bdctw401-Update Tile Works
47 pages
Naca 63415 Wind Tunnel
No ratings yet
Naca 63415 Wind Tunnel
108 pages
Updated Final THRM Module Engr. CM Gualberto
No ratings yet
Updated Final THRM Module Engr. CM Gualberto
116 pages
Chemistry Practical 2
No ratings yet
Chemistry Practical 2
14 pages
Lab 3 Communication System D
No ratings yet
Lab 3 Communication System D
14 pages
Vishay 601-1045
No ratings yet
Vishay 601-1045
2 pages
Statistical Mechanics Part 1
No ratings yet
Statistical Mechanics Part 1
17 pages
Arif Habib Securities LTD Dewan Farooque Motors LTD.: Total Capital
No ratings yet
Arif Habib Securities LTD Dewan Farooque Motors LTD.: Total Capital
4 pages
Technical Aptitude Questions
No ratings yet
Technical Aptitude Questions
176 pages
GC AccessoryCat 09 V2
No ratings yet
GC AccessoryCat 09 V2
13 pages
Ned - 2025 Jce Geography
No ratings yet
Ned - 2025 Jce Geography
12 pages
Summary Performance Rating
No ratings yet
Summary Performance Rating
51 pages
.300 Win. Magnum Ballistics Calcs (QuickTarget Unlimited Lapua Edition)
No ratings yet
.300 Win. Magnum Ballistics Calcs (QuickTarget Unlimited Lapua Edition)
4 pages
BIO DS 3D Instructions v3
No ratings yet
BIO DS 3D Instructions v3
7 pages
059-048 - Carving Incised Letters
No ratings yet
059-048 - Carving Incised Letters
4 pages
JBL Tr125 Manual de Servicio
No ratings yet
JBL Tr125 Manual de Servicio
2 pages
Eclipse TutorialPoint
100% (1)
Eclipse TutorialPoint
71 pages
Motion in A Straight Line DPP
100% (1)
Motion in A Straight Line DPP
44 pages
S11003 QUO01 R 0
0% (1)
S11003 QUO01 R 0
12 pages
Call-Forward B2bua
No ratings yet
Call-Forward B2bua
4 pages
Digital Twins
100% (1)
Digital Twins
17 pages
Hydraulic Engineering Lab Manual
No ratings yet
Hydraulic Engineering Lab Manual
27 pages
51 Ls at 400 Kpa
No ratings yet
51 Ls at 400 Kpa
1 page

PDC Lecture 03

Uploaded by

PDC Lecture 03

Uploaded by

CS-402 Parallel and Distributed Systems

P=1 P=2 P=4

P=1 P=2 P=4

 Scalability is much better when the problem size can increase.

 Gustafson, John L. (May 1988). "Reevaluating Amdahl's Law".

 Acknowledgement: Some information is from a presentation in

2. Hyper-Threading / Simultaneous Multithreading (SMT)

1. Dynamic Scheduling: Instructions are scheduled dynamically based on the availability of

Reduce the average memory

• 40% data accesses in

1. Matrix Multiplication: In matrix operations, accessing elements in a row-major or column-

// Example 1: Poor locality // Example 2: Better locality

(a) spatial locality (b) no spatial locality

 Speedup, Scalability, strong scaling, weak scaling

 Modern CPU Core, it’s Hardware Features and Key Issues

 New Time with 4 Developers: ( \frac{1000}{2.5} = 400 ) hours

You might also like