0% found this document useful (0 votes)
17 views42 pages

PDC Notes by Zatch-1

The document outlines the differences between data parallelism and control parallelism, highlighting their definitions, examples, granularity, and advantages and disadvantages. It also discusses performance metrics in parallel computing, Amdahl's Law, shared and distributed memory architectures, leader election algorithms, and mutual exclusion mechanisms. Overall, it provides a comprehensive overview of key concepts in parallel and distributed computing systems.

Uploaded by

Ansh Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
17 views42 pages

PDC Notes by Zatch-1

The document outlines the differences between data parallelism and control parallelism, highlighting their definitions, examples, granularity, and advantages and disadvantages. It also discusses performance metrics in parallel computing, Amdahl's Law, shared and distributed memory architectures, leader election algorithms, and mutual exclusion mechanisms. Overall, it provides a comprehensive overview of key concepts in parallel and distributed computing systems.

Uploaded by

Ansh Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 42
Difference Between Data Parallelism and Control Parallelism Feature Definition Focus Example Seer Granularity Dependency Implementation Hardware Suitability Example in Programming Advantage Disadvantage Data Parallelism The same operation is applied to different parts of the data simultaneously. Parallelizing date processing by dividing data across multiple pracessors. Matrix multiplication, image processing, large dataset computations. Typically fine-grained (splitting large data sets into smaller chunks). Requires the same operation to be applicable to all data elements. Achieved through SIMD (Single Instruction Multiple Data) or GPU-based parallelism. Works well on GPUs, vector processors, SIMD architectures. Parallel for-loops, GPU-based tensor computations in deep learning. High efficiency in processing large datasets due to uniform operations Limited to problems where same operations apply to all data elements. Control Parallelism Different operations are executed in parallel, possibly on different data. Parallelizing control flow by executing different tasks in parallel Multi-threaded applications, executing multiple functions or tasks concurrently. Typically coarse-grained (separate execution flows or independent tasks). Tasks can be independent or interdependent. Achieved through MIMD (Multiple Instruction Multiple Data) or multi- threading, Works well on multi-core CPUs, distributed systems, MIMD architectures. Multithreading in Java, parallel execution of different functions in a program, Greater flexibility, as different computations can run independently. May require synchronization between tasks, leading to overhead. [eres metrics : Performance metrics are essential in parallel and distributed computing to evaluate and optimize system efficiency, resource utilization, and scalability. Metrics like speedup, efficiency, and utilization assess how well resources are used. These metrics help assess how well a system performs when multiple processors or nodes work together. Key performance metrics include: 1.Speedup: Measures how much faster a parallel system executes a task compared to a sequential system. It indicates efficiency gained through parallelism. 2.Efficiency: The ratio of speedup to the number of processors used, reflecting how effectively the system utilizes available resources. 3.Scalability: Describes how well a parallel system performs as the size of the problem or the number of processors increases. 4.Throughput: Measures the number of tasks or operations completed per unit time, highlighting system productivity. 5.Latency: Indicates the delay in completing a task or communication between nodes, which is crucial in distributed systems. 6.Load Balancing: Evaluates how evenly workloads are distributed across processors to minimize idle time and maximize resource usage. Amdahl’s Law: Speed-up Performance Law = 0 The potential speedup of an algorithm on parallel computing platform is given by Amdahl’s Law; originally formulated by Gene Amdahl’s on 1960's © tis one of the speedup performance law © Itis based on a fixed problem size (or fixed work load) 5 Actually fo keep the efficiency of system fixed we have to increase both the size of the problem and the no. of processors simultaneously co Amdahl’s Law tells that for a given problem size, the speedup doesn’t increase linearly as the number of processor increases. In fact, speed-up tends to become saturated co Amdahl’s law states that a small portion of the program which cannot be parallelized (serial part), will limit the overall speed-up, available from parallelization © Typically, any large mathematical or engineering problem consists of several parallelizable parts & several serial (non-parallelizable) parts a Computation Problem = Serial Part + Parallel Part Q Amdahl s Law | "Amdahl's Law is a law governing the speedup of using paralllel processors on a problem, versus using only one serial processor, under the assumption that the problem size remains the same when parallelized.” Amdahl's Law Amdah''s Law is a fundamental principle in parallel computing that defines the theoretical maximum speedup of a task when parts of it can be parallelized. It highlights the impact of the sequential portion of an algorithm on overall performance. Formula: where: + S(N) = Speedup with N processors +P = Fraction of the task that can be parallelized +1 — P = Fraction of the task that must remain sequential 1. Shared Memory Architecture = 9 In Shared memory architecture, multiple processors operate independently but share the common memory as global address space. a Shared memory systems are Tightly coupled systems as processors share a common memory a Only one processor can access the shared memory location at a time a Changes in a memory location effected by one processor are visible to all other processors. 3 Shared Memory Multiprocessor machines can be divided into two main classes based upon memory access times: UMA and NUMA 1. Uniform Memory Access (UMA) @® Nem Unkera| MemenyiAceen (NUBIA za, = Cache-Only Memory Architecture (COMA): Special Case of NUMA Interconnection Networks for Parallel Computers Interconnection networks carry data between processors and to memory. Interconnects are made of switches and links (wires, fiber). Interconnects are classified as static or dynamic. Static networks consist of point-to-point communication links among processing nodes and are also referred to as direct networks. Dynamic networks are built using switches and communication links. Dynamic networks are also referred to as indirect networks. Uniform Memory Access (UMA) is a shared memory architecture used in parallel compute In UMA model the physical memory is shared uniformly by all the processors All the processors have equal access time to all memory location (which is why it is called Uniform Memory Access ) Most commonly represented by Symmetric Multiprocessor (SMP) machines Each processor may have private cache. Sometimes called CC-UMA (Cache Coherent UMA). 2 Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level. a The UMA Multiprocessor model Processor 1 Processor 2 jooo0 Processor n fe System, Interconnect (Bus, Crossbar, Multistage network) ‘Shared 1/0 eoooo Memory m Shared Memory Non-Uniform Memory Access (NUMA) . Se —— —— eee 2 ANUMA Multiprocessor model shared-memory system in which the access ti location of the memory word e varies with the 2 The shared memery is physically distributed among all the processors, called local memories. The collection of all local memories forms a glebal address space which can be accessed by all the processors. 2 Itis faster to access a local memory with a local processor 2 The access of remote memory attached to other processors is [meet slower due to the added delay through the interconnection network = 2 The system interconnect takes form of common bus, cross bar, itch, multi-stage network Often made by physically linking two or more SMPs. One SMP can directly access memory of another SMP Not all processors have equal access time to all memories (which is why it is called Non- Uniform Memory Access). Memory access across link is slower. Cache-Only Memory Architecture (COMA) _ EEE 5 The COMA model is a special case of NUMA machine in which the distributed main memories are converted to Caches = Allcaches form a global address space and there is no memory hierarchy at each processor node 5 Remote cache access is assisted by the distributed cache directories Ineercenrection Network Chava ondi Advantages: © Global address space provides a user-friendly programming perspective to memory © Data sharing between tasks is beth fast and uniform due to the proximity of memory to CPUs Disadvantages: 2 Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared memory- CPU path, and for cache coherent systems, geometrically increase traffic associated with cache /memory management. 2 Programmer responsibility for synehronization constructs that insure “correct* access of global memory. 2 Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors. Distributed memory systems consists of multiple computers, often called nodes, interconnected by « message- pasting nefwark. Each node is Gn autonomous computer consisting oF a processor, local mamary and sometimes sitached disk or i/S peripheral LisTTibuled Memory Gre Loosely coupled systems as processors nave meir own local Memory Processorshave their own local mamory. Memory addresses in one processor do not map fo another processor, $0 there is no concept of global address space across all processors. Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply. When a processor needs access to data in another processor, itis done by passing message between processors, hich is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility. The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet. Topologies include ring, tree, mesh, torus, hypercube, Netw Bijorecomeces cycle, ete Processor Processor Processor ! | t Cache Cache Cache i | ! Memory Memory Memory | | ! Interconnection Network Distributed Memory Architecture Adeontagat temory ber of processors. cases ploportional o Eos Processor can rapidly ¢ access its own memory without interference and without a ing tomeintain cache coherency: communication between processors. itmay be difficult to map existina data structures. based on alobal memory, to this memory organization. : Nonsuniforny memory cic Wo. Summary of Amdahl's Law: Advantages, Disadvantages, and Significance Advantages: @ Predicts Performance Gains - Provides a mathematical model to estimate speedup with multiple processors. @ identifies Bottlenecks — Highlights the impact of sequential portions in a program, Guides Cost-Effective Scaling — Prevents over-provisioning of resources. Helps in Parallel Algorithm Design - Encourages maximizing parallel execution Widely Applicable - Used in HPC, cloud computing, Al, and distributed systems, Disadvantages: XX Assumes Fixed Problem Size — Does not account for increasing workloads (addressed by Gustatson's Law) XX Ignores Communication Overhead — Real-world systems have synchronization and data transfer delays. XX Sequential Bottleneck Limits Speedup - Even with infinite processors, the sequential portion restricts performance X Limited to Homogeneous Systems - Does not consider CPU-GPU or hybrid computing architectures. XX. Not Suitable for Dynamic Workloads — Does not adapt to real-time scheduling variations. Significance: © Helps Analyze Parallel System Performance ~ Determines the efficiency of adding more processors. Provides a Realistic Expectation of Speedup ~ Shows diminishing retums when increasing processors. 1 Guides Resource Allocation Computing ~ Helps optimize parallel architectures. “© Forms the Basis for Further Parallel Computing Models ~ Inspired laws like Gustafson's Law for scalable workloads. Leader Election Algorithm ALeader Election Algorithm isa distributed algorithm used in networked systems to select a single process (leader) among multiple distributed processes. The leader is responsible for coordination, decision-making, and resource allocation in a distributed system. Need for Leader Election Algorithm: in distributed systems, there is no central authority, so leader election ensures: + Coordination among processes. + Avoidance of conflicts (e.g., multiple processes trying to access the same resource). + Efficient task distribution and fault tolerance. How Does It Work? 1. Processes communicate to decide wha shauld be the leader. 2. Election Criter power, or priority. he leader is chosen based on factors like process ID, camputatianal 3. Election Messages: Processes send election messages to claim leadership. 4. Winner Announcement: The selected leader informs others. ‘Types of Leader Election Algorithms: 1. Bully Algorithm: © The highest-ID process becomes the leader. © Ifthe leader crashes, a new leader is elected. 2. Ring Algorithm: © Processes are arranged in a logical ring. © Atoken circulates, and the process with the highest ID wins. 3. Randomized Leader Election: Uses randomization to pick a leader fairly. 1. Bully Algorithm : The Bully Algorithm is a leader election algorithm used in distributed systems where the process with the highest ID becomes the leader. When a leader fails, a new election is initiated by any process that detects the failure. It sends election messages to all higher-ID processes, and if no higher process responds, it declares itself the new leader. Steps: L Aprocess detects that the leader has failed. 2. It sends ELECTION messages to all higher-ID processes. 3. If no higher process responds, it becomes the leader. 4. If ahigher process responds, that process starts its own election. 5. The highest remaining process broadcasts a COORDINATOR message to inform athers. Bully Algorithm Example with 6 Processes (Failure Scenario) Let's assume six processes in a distributed system, each having a unique ID (1 te 6). The highest ID process is always the leader. Initial Setup Process ID Status Notes PL Alive Lowest ID Process ID Status Notes P2 Alive - P3 Alive : Pq Alive : PS Alive : PG Leader (Fails) Highest ID (Crashes) Step-by-Step Execution £6 (Leader) Fails: + The remaining processes detect that P6 Is unresponsive. EPS Starts an Election: «© P3 realizes P6 is down. * P3 sends Election Messages to P4, PS, and P6. [Pd and PS Respond: © Paand PS (higher ID than P3) respond, meaning they take over the election. + PS initiates another election (since P6 is down, PS is the next highest). ALPS Becomes the New Leader: + PS sends messages to P6 (no response). + Since no higher process responds, PS declares itself the leader. «PS broadcasts "| am the new leader" to all remaining processes, Process ID Status Notes: Pl Alive P2 Alive P3 Alive Initiated Election Pa Alive Participated in Election PS New Leader Highest Available ID PG Failed No Response Important points: + Ifaleader fails, the next highest ID process takes charge. * if multiple processes detect failure, the lowest ID among them starts the election. * Only the highest available process wins the election and becomes the new leader. 2. Ring Algorithm : The Ring Algorithm is a leader election algorithm where processes are arranged in a logical ring ring (each process knows its next neighbor). When a leader fails, an election message is circulated around the ring, collecting process IDs. The process with the highest ID is elected as the new leader and announces itself to all nodes. A leader (coordinator) is elected to manage shared resources or coordinate tasks. Steps: 1. Process Initiation: © Any pracess can initiate the election if it detects a failure of the current leader. 2. Message Passing (Election Phase): © Theinitiator sends an ELECTION message to the next process in the ring. © Each process appends its own ID and forwards it 3. Leader Selection: © When the message returns to the initiator, the process with the highest ID is the new leader. 4. Announcement (Coordinator Phase): © The leader sends a COORDINATOR message around the ring to inform all processes. Example Consider four processes: P1, P2, P3, P4 (P4 is the highest). PA -> P2 > P3 > Pa -> PA (ring structure) + If PA initiates an election, it sends [1] > P2-> [1,2] > 3 [1,2,3] > Pa > [1,2,3,4] > Pa. + Since P4 is the highest, it sends COORDINATOR (P4) message around the ring. Advantages & Disadvantages Pros ‘Cons Simple to implement _ | Message overhead increases with nodes No central failure point | Ifa process crashes, recovery is needed Comparison Table: Bully vs. Ring vs. Raft Feature Bully Algorithm Ring Algorithm Election Basis Highest ID Highest ID Topology Fully connected Logical Ring Message O(n?) (High) O(n) (Moderate) Complexity Failure Handling Needs re-election if highest Needs re-election if process fails ring breaks Speed Fastest Slow Use Cases Centralized systems Token-based systems + Mutual Exclusion (Mutex) is a fundamental concept in concurrent computing that ensures that multiple processes or threads do not access a shared resource simultaneously. + It prevents race conditions and ensures data consistency. ‘Types of Mutual Exclusion Mechanisms 1. Software-based Solutions + Peterson's Algorithm — Works for two processes using a flag mechanism. + Dekker’s Algorithm — Ensures mutual exclusion without hardware support 2. Hardware-based Solutions + Test-and-Set (TAS) Lock ~ Uses an atomic operation to lock critical sections. + Compare-and-Swap (CAS) ~ Uses atomic operations for locking. 3. Operating System-based Soluti + Semaphores — A synchronization tool using counters (Binary & Counting Semaphores). A semaphore is a synchronization primitive used in operating systems and distributed computing to control access to shared resources. It prevents race conditions and ensures. proper execution of concurrent processes. ‘Types of Semaphores 1. Binary Semaphore (Mutex) Can have only two values: 0 (Locked) or 1 (Unlocked). Used for mutual exclusion, allowing only one process to access a critical section at a time. Example: Locking a shared file so only one user can edit it at atime. Can have only two values: 0 (Locked) or 1 (Unlocked). Used for mutual exclusion, allowing only one process to access a critical section at a time, Example: Locking a shared file so only one user can edit it at a time. Example Scenario: Imagine a public restroom with a single stall. If a person enters, they lock the door (semaphore = 0). When they leave, they unlock it (semaphore = 1), allowing another person to enter. 2. Counting Semaphore Can have a value greater than 1, controlling access for multiple processes. Used when multiple instances of a resource are available. Example: Limiting database connections in a web application, Consider a web application where multiple users request access to a database, but only a limited number of connections (say, 5) are available. Mutex Locks — Simplified version of semaphores for mutual exclusion. Monitors ~ Hii level synchronization using object-oriented approaches. ‘Token-Based Approaches — A token is passed among processes to grant access. A.unique token is shared among all the sites. Ifa site possesses the unique token, it is allowed to enter its critical section This approach uses sequence number to order requests for the critical section. Each requests for critical section contains a sequence number. This sequence number is used to distinguish old and current requests This approach insures Mutual exclusion as the token is unique Non-token based approach: A site communicates with other sites in order to determine which sites should execute critical section next. This requires exchange of two or more successive round of messages among sites. This approach use timestamps instead of sequence number to order requests for the critical section. When ever a site make request for eritical section, it gets a timestamp. Timestamp is also used to resolve any conflict between critical section requests. All algorithm which follows non-token based approach maintains a logical clock. Logical clocks get updated according to Lamport’s scheme Example: Ricart-Agrawala Algorithm — Uses timestamp-based message passing. S.No. Token Based Algorithms In the Token-based algorithm, a unique token is shared among all the sites in Distributed Computing Systems. Here, a site is allowed to enter the Critical Section if it possesses the token. The token-based algorithm the sequences to order the request for the Critical Section and to resolve the conflict for the simultaneous requests for the System. uses The token-based algorithm produces less message traffic as compared to Non-Token based Algorithm. They are free from deadlock (i.e. here there are no two or more processes are in the queue in order to wait for messages that will Difference between Token based and Non-Token based Algorithms in Distributed System: Non-Token Based Algorithms In Non-Token based algorithm, there is no token even not any concept of sharing token for access. Here, two or more successive rounds of messages are exchanged between sites to determine which site is to enter the Critical Section next. Non-Token based algorithm uses the timestamp (another concept) to order the request for the Critical Section and to resolve the conflict for the simultaneous requests for the System. Non-Token based Algorithm produces more message traffic as compared to the Token-based Algorithm. They are not free from the deadlock problemas they are based on timestamp only. S.No. Token Based Algorithms actually can’t come) because of the existence of unique token in the distributed system. Here, it is ensured that requests are executed exactly in the order as they are made in. Token-based algorithms are more scalable as they can free your server from storing session state and also they contain all the necessary information which they need for authentication. Here the access control is quite Fine-grained because here inside the token roles, permissions and resources can be easily specifying for the user. Token-based algorithms make authentication quite easy. Examples of Token-Based Algorithms are: (i) Singhal’s Heuristic Algorithm (i) Raymonds Tree Based Algorithm (iii) Suzuki-Kasami Algorithm Non-Token Based Algorithms Here there is no surety of execution order. Non-Token based algorithms are less scalable than the —Token-based algorithms because server is not free from its tasks. Here the access control is not so fine as there is no token which can specify roles, permission, and resources for the user. Non-Token based algorithms can’t make authentication easy. Examples of Non-Token Based Algorithms are: (@ Lamport’s Algorithm Ricart-Agarwala Algorithm i) Mackawa’s Algorithm Load Balancing : Load balancing is the process of distributing tasks or workloads across multiple resources (servers, processors, ctc.) to ensure efficient operation, avoid overload, and improve performance. It is widely used in computer networks, cloud computing, and multi- processor systems. The primary focus of Load Balancing and Task Scheduling in Distributed ‘Computing, should be on how multiple interconnected systems coordinate to cfficiently manage tasks and resources. + In a distributed system, multiple servers or nodes handle requests fom users or applications. + Load balancing ensures that work is evenly distributed among available servers/nodes to avoid overloading any single system. Need for Load Balancing : 1. Prevents Server Overload — Ensures no single server gets overwhelmed. 2. Optimizes Resource Usage — Distributes tasks efficiently to utilize all available resources. 3. Enhances R bility — Improves fault tolerance and system availability. 4. Improves Response Time — Reduces waiting time for uscrs or processes. Types of Load Balancing in Distributed Computing 1. Static Load Balancing: Workload distribution is pre-determined and doesn’t change during execution. o Example: Round Robin Algorithm (requests are assigned to servers in a cyclic order). 2. Dynamic Load Balancing: workload allocation in real-time based on system performance. o Example: Least Connections Algorithm (assigns new tasks to the node with the fewest active connections). 3. Centralized vs. Decentralized Load Balancing © Centralized: A single controller makes all load-balancing decisions (e.g., a master server managing worker nodes). o Decentralized: Nodes make their own load-balancing decisions by communicating with each other (e.g., peer-to-peer systems). Issues in Designing Load-balancing Algorithms: Many issues need to be taken into account while designing Load-balancing Algorithms: + Load Estimation Polici Determination of a load of a node ina distributed system. + Process Transfer Policies: Decides for the execution of process: local or remote. + State Information Exchange: Determination of strategy for exchanging system load information among the nodes in a distributed system. + Location Policy: Determining the sclection of destination nodes for the migration of the process. + Priority Assignment: Determines whether the priority is given to a local or a remote process on a node for execution. + Migration limit policy: Determines the limit value for the migration of processes.

You might also like