0% found this document useful (0 votes)

11 views10 pages

Unit 3

Uploaded by

rajeshwarijalatha23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views10 pages

Unit 3

Uploaded by

rajeshwarijalatha23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

UNIT - 3

1] 2 -D MESH SIMD MODEL

The 2-D mesh SIMD (Single Instruction, Multiple Data) model is a parallel computing architecture
commonly used for data-parallel computations on regular grids, such as image processing, matrix
operations, and stencil computations. In this model, multiple processing elements (PEs) are arranged in a
2-D grid, and each PE executes the same instruction simultaneously but operates on different data
elements.

Diagram of 2-D Mesh SIMD Model

+------+------+------+------+------+

| PE(0,0)| PE(1,0)| PE(2,0)| PE(3,0)|

+-----------------------------------+

| PE(0,1)| PE(1,1)| PE(2,1)| PE(3,1)|

+-----------------------------------+

| PE(0,2)| PE(1,2)| PE(2,2)| PE(3,2)|

+-----------------------------------+

| PE(0,3)| PE(1,3)| PE(2,3)| PE(3,3)|

+-----------------------------------+

In this diagram:

1. Each square represents a processing element (PE).

2. PEs are organized in a 2-D grid, forming a mesh.

3. The arrows indicate possible communication paths between neighboring PEs (e.g., for exchanging
data).

Example Algorithm: Matrix Addition

Let's consider a simple algorithm for parallel matrix addition using the 2-D mesh SIMD model:

Problem: Given two matrices A and B of size N×N, compute the matrix sum C = A + B in parallel using the
2-D mesh SIMD model.

Algorithm:

Each PE (i, j) reads one element from matrices A and B, denoted as A(i, j) and B(i, j), respectively.

Each PE computes the sum of the corresponding elements: C(i, j) = A(i, j) + B(i, j).

PEs exchange data with neighboring PEs to ensure correct computation along the edges of the mesh.

The result matrix C is obtained by aggregating the partial sums computed by each PE.

Pseudocode:

for i = 0 to N-1 do

for j = 0 to N-1 do

C(i, j) = A(i, j) + B(i, j)

Parallelization:

Each PE executes the inner loop independently, computing the sum for one element of the result matrix.

The outer loop can be parallelized across rows or columns of the matrix, with each row or column
assigned to a different group of PEs.

Synchronization may be required to ensure correct computation along the edges of the mesh, such as
exchanging boundary values between neighboring PEs.

Performance:

The algorithm achieves O(N^2) time complexity, similar to the sequential version.

The speedup depends on factors such as the number of PEs, communication overhead, and load
balancing.

This example illustrates how the 2-D mesh SIMD model can be used to parallelize computations on
regular grids, achieving high throughput and efficiency through data parallelism.

Neighborhood Communication:
PEs in the mesh communicate with their immediate neighbors to exchange data required for
computations.

Communication patterns can vary based on the algorithm, but common approaches include
nearest-neighbor communication or communication along rows and columns.

Load Balancing:

Achieving balanced work distribution among PEs is essential for optimal performance.

Irregularities in the data or computation may require load balancing techniques to ensure that all PEs
contribute equally to the workload.

Boundary Handling:

Handling boundary conditions in the mesh can be challenging, as PEs at the edges have fewer neighbors.

Various strategies, such as ghost cells or padding, can be used to address boundary issues and ensure
correct computation.

Scalability:

The 2-D mesh SIMD model scales well for problems with regular structures, as the number of PEs
increases.

However, scalability may be limited by factors such as communication overhead, synchronization

requirements, and memory constraints.

Fault Tolerance:

Fault tolerance mechanisms are essential for robustness in large-scale systems.

Redundancy, checkpointing, and error detection/correction techniques can help mitigate the impact of
hardware failures on computation.

Algorithmic Variations:

The 2-D mesh SIMD model can be adapted to various parallel algorithms beyond simple matrix
operations.

Examples include image processing filters, stencil computations, cellular automata simulations, and finite
difference methods.

Hybrid Models:

Hybrid parallel models combining SIMD with other parallel paradigms, such as MPI for distributed
memory systems or OpenMP for shared memory systems, are common.

Hybrid approaches leverage the strengths of different parallel models to achieve better performance and
scalability.

Memory Hierarchy:

Efficient utilization of memory hierarchy (e.g., caches, shared memory) is crucial for optimizing
performance.

Data locality and access patterns should be considered to minimize memory access latency and maximize
throughput.

Programming Challenges:

Programming for the 2-D mesh SIMD model requires careful consideration of data distribution,
communication, and synchronization.

High-level parallel programming languages and libraries, such as CUDA, OpenCL, and MPI, provide
abstractions and tools to simplify parallel programming on such architectures.

Performance Optimization:

Performance tuning techniques, such as loop unrolling, data prefetching, and vectorization, can be
applied to improve the efficiency of computations on SIMD architectures.

Profiling and analysis tools help identify bottlenecks and optimize critical sections of the algorithm.

These points highlight various aspects of utilizing the 2-D mesh SIMD model in parallel algorithms,
emphasizing its versatility, scalability, and potential challenges in achieving high-performance parallel
computation.

2] PARALLEL ALGORITHMS FOR REDUCTION

Reduction is a common operation in parallel computing where a collection of values is aggregated into a
single value through a binary associative operation, such as addition, multiplication, or finding the
maximum/minimum. Parallel reduction algorithms aim to efficiently compute the reduction result using
multiple processors or threads. Here are some parallel algorithms for reduction:

Tree-based Reduction:

In this approach, the reduction operation is performed hierarchically in a binary tree

structure.

Initially, each processor or thread computes local reductions on subsets of the input data.

Then, the results are combined pairwise up the tree until a single result is obtained at the root.

Common variations include binary tree, balanced tree, and skewed tree reduction.

Scan-Based Reduction:
Scan-based algorithms compute prefix sums or cumulative operations, which can be used to perform
reduction.

The input data is partitioned into segments, and each segment's reduction result is computed
independently.

Then, prefix sums of the segment results are computed to propagate partial results through the tree
structure.

Finally, the global reduction result is obtained by combining the segment results with the prefix sums.

Parallel Prefix Sum:

Parallel prefix sum algorithms compute the cumulative sum of elements in an array efficiently.

These algorithms can be adapted for reduction by performing the reduction operation in conjunction
with the prefix sum computation.

By carefully choosing the binary associative operation, parallel prefix sum algorithms can be used for
various reduction tasks.

Bitwise Reduction:

Bitwise reduction algorithms are used for reduction operations on Boolean or bit-wise data types.

These algorithms exploit bitwise operations (e.g., bitwise AND, OR, XOR) to combine values in parallel.

Bitwise reduction can be efficiently implemented using parallel hardware instructions or specialized
parallel algorithms.

Parallel Sorting-Based Reduction:

Reduction can be performed using parallel sorting algorithms, such as parallel merge sort or parallel
quicksort.

After sorting the input data, the reduction operation can be applied by combining adjacent sorted
elements in parallel.

Distributed Reduction:

In distributed computing environments, reduction can be performed across multiple nodes or processors
using message passing or distributed memory models.

Algorithms such as scatter-reduce or gather-reduce distribute the input data to different nodes, perform
local reductions, and then combine the partial results to obtain the global reduction result.

Hybrid Reduction:

Hybrid reduction algorithms combine multiple parallelization techniques, such as tree-based,

scan-based, and sorting-based approaches, to optimize performance.

These algorithms leverage the strengths of different parallelization strategies to achieve efficient
reduction across various hardware architectures.

Optimization Techniques:

Various optimization techniques, such as load balancing, data partitioning, and cache-aware algorithms,
can be applied to improve the performance of parallel reduction algorithms.

Specialized hardware features, such as vectorization, multi-threading, and GPU acceleration, can also be
utilized to enhance the efficiency of reduction operations.

These parallel algorithms for reduction are essential building blocks in many parallel applications,
including numerical simulations, data processing, and scientific computing. The choice of algorithm
depends on factors such as the size of the input data, the characteristics of the reduction operation, the
hardware architecture, and the desired performance goals.

Let's consider a simple example of parallel reduction using a tree-based algorithm. Suppose we want to
compute the sum of an array of numbers in parallel. We'll use a binary tree structure to perform the
reduction.

Example: Parallel Sum Reduction

Input: Array of numbers A = [3, 7, 1, 5, 2, 4, 6, 8]

Algorithm:

Partition the input array into segments, with each segment assigned to a processor or thread.

Each processor computes the local sum of its assigned segment.

Perform a binary tree reduction to combine the local sums and compute the global sum.

Diagram:

Global Sum

/ \

/ \
/ \

/ \

P0 P1

+ +

/\ /\

/ \ / \

3 7 1 5

In this example:

The input array [3, 7, 1, 5, 2, 4, 6, 8] is partitioned into two segments, each assigned to a processor (P0
and P1).

Each processor computes the local sum of its segment:

Local Sum P0 = 3 + 7 = 10 and Local Sum P1 = 1 + 5 = 6.

Processors combine their local sums pairwise up the tree until a single global sum is obtained at the root.

Execution:

Initially, each processor computes its local sum independently.

The local sums (10 and 6) are combined at the next level of the tree: 10 + 6 = 16.

Finally, the global sum 16 is obtained at the root of the tree.

This diagram illustrates how a binary tree-based reduction algorithm can efficiently compute the sum of
an array in parallel. Each level of the tree represents a stage of the reduction, with processors combining
their partial results until the final result is obtained at the root of the tree.

3] ODD EVEN MERGE SORTING

Odd-even merge sort is a parallel sorting algorithm based on the merge sort algorithm. It utilizes the
parallelism inherent in the odd-even transposition sorting network to sort elements in parallel. Here's
how the odd-even merge sort algorithm works along with a description and a diagram:

Odd-Even Merge Sort Algorithm

Partitioning Phase:

Divide the input array into equal-sized segments, each assigned to a processor or thread.

Each processor sorts its segment independently using a sequential or parallel sorting algorithm.

Odd-Even Merge Phase:

Perform a series of odd-even merge operations to merge adjacent segments and produce sorted
subarrays.

In each iteration:

Odd-even comparisons are performed between elements at corresponding positions in adjacent

segments.

Elements are exchanged if they are out of order to ensure that each pair of adjacent segments is sorted.

Global Merge Phase:

Merge adjacent sorted subarrays produced in the odd-even merge phase to obtain the final sorted array.

This phase can be implemented using a parallel merging algorithm, such as parallel merge sort or parallel
merge tree.

Diagram of Odd-Even Merge Sort

Consider an example of sorting an array of numbers [5, 2, 8, 3, 1, 7, 6, 4] using odd-even merge sort.

Initial Array: [5, 2, 8, 3, 1, 7, 6, 4]

Partitioning Phase:

-----------------------------------------

Processor 1: [5, 2, 8, 3]

Processor 2: [1, 7, 6, 4]

Odd-Even Merge Phase:

-----------------------------------------

Iteration 1: Odd-Even Comparisons

Processor 1: [2, 3, 5, 8]
Processor 2: [1, 4, 6, 7]

Iteration 2: Odd-Even Comparisons

Processor 1: [1, 2, 3, 4]

Processor 2: [5, 6, 7, 8]

Global Merge Phase:

-----------------------------------------

Final Sorted Array: [1, 2, 3, 4, 5, 6, 7, 8]

In this example:

Initially, the array is partitioned into two segments, [5, 2, 8, 3] and [1, 7, 6, 4], assigned to two
processors.

Each processor sorts its segment independently.

During the odd-even merge phase, odd-even comparisons are performed between adjacent segments to
merge and sort them.

After two iterations, all adjacent segments are merged and sorted.

Finally, a global merge phase merges the sorted segments to produce the final sorted array [1, 2, 3, 4, 5,
6, 7, 8].

Parallelism and Efficiency

Odd-even merge sort exhibits parallelism at multiple levels, including segment sorting, odd-even
comparisons, and global merging.

The algorithm has good scalability and can efficiently utilize multiple processors or threads.

However, the performance of odd-even merge sort depends on factors such as load balancing,
communication overhead, and the efficiency of the underlying sorting and merging algorithms.

Work Distribution: In parallel odd-even merge sort, the work is distributed among multiple processors or
threads to exploit parallelism. Each processor typically handles a subset of the data, with communication
between processors to perform merging.

Load Balancing: Ensuring load balance is crucial for efficient parallel odd-even merge sort. Load
imbalance can occur when the workload is not evenly distributed among processors, leading to some
processors finishing their tasks much earlier than others. Load balancing techniques like dynamic
workload distribution or workload stealing can be employed to mitigate this issue.

Communication Overhead: Parallel odd-even merge sort involves communication between processors
during the merging phase. Minimizing communication overhead is important for performance.
Techniques such as efficient message passing or shared memory can be used to reduce communication
costs.

Parallelization Strategies: Various parallelization strategies can be employed in odd-even merge sort,
including task parallelism and data parallelism. Task parallelism involves assigning different processors to
perform different tasks, such as sorting or merging, while data parallelism involves dividing the data into
chunks and assigning each processor to work on a subset of the data.

Scalability: Scalability refers to the ability of the parallel odd-even merge sort algorithm to efficiently
utilize additional resources as the problem size or the number of processors increases. Designing
algorithms that scale well with increasing problem size or processor count is essential for handling large
datasets efficiently.

Cache Efficiency: Cache efficiency is another important consideration in parallel odd-even merge sort.
Minimizing cache misses and optimizing memory access patterns can significantly improve performance.
Techniques such as data locality optimization and cache-conscious algorithms can be employed to
enhance cache efficiency.

Fault Tolerance: In distributed computing environments, fault tolerance becomes crucial. Parallel
odd-even merge sort algorithms should be designed to handle failures gracefully, ensuring that the
computation can continue even if some processors fail. Techniques such as checkpointing and
redundancy can be used to achieve fault tolerance.

Synchronization Overhead: Synchronization between parallel processes or threads can introduce

overhead, impacting performance. Minimizing synchronization overhead by using lock-free or wait-free
algorithms can improve scalability and performance in parallel odd-even merge sort.

Hybrid Approaches: Hybrid approaches combining parallel odd-even merge sort with other
parallelization techniques, such as multi-threading and vectorization, can further enhance performance
on modern multi-core CPUs and accelerators like GPUs.

Algorithmic Optimizations: Various algorithmic optimizations can be applied to parallel odd-even merge
sort to improve its efficiency, such as reducing the number of comparisons during merging, optimizing
the merging phase, and exploiting properties of the data to minimize operations.

Manual de Usuario Suzuki Grand Vitara (2008) (337 Páginas)
No ratings yet
Manual de Usuario Suzuki Grand Vitara (2008) (337 Páginas)
2 pages
Technical Manual: Includes
No ratings yet
Technical Manual: Includes
13 pages
Operating Instructions: Rotary Microtome CUT 4062 / CUT 5062 / CUT 6062
No ratings yet
Operating Instructions: Rotary Microtome CUT 4062 / CUT 5062 / CUT 6062
38 pages
Yanmar SV20 - Partsbook PDF
100% (2)
Yanmar SV20 - Partsbook PDF
168 pages
Parallel Computing in CFD: Milovan Perić
No ratings yet
Parallel Computing in CFD: Milovan Perić
25 pages
Parallel Algorithm - Introduction
No ratings yet
Parallel Algorithm - Introduction
36 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
The Parallel Finite Difference Time Domain (FDTD) Project
No ratings yet
The Parallel Finite Difference Time Domain (FDTD) Project
4 pages
Pda 2
No ratings yet
Pda 2
105 pages
Parallel and Distributed Algorithms-IMPORTANT QUESTION
100% (1)
Parallel and Distributed Algorithms-IMPORTANT QUESTION
15 pages
Parallel Models of Computation
No ratings yet
Parallel Models of Computation
3 pages
Parallel
No ratings yet
Parallel
5 pages
DC Assignment
No ratings yet
DC Assignment
3 pages
Unit-Iv Concurrent and Parallel Programming: Parallel Programming Paradigms - Data Parallel
No ratings yet
Unit-Iv Concurrent and Parallel Programming: Parallel Programming Paradigms - Data Parallel
61 pages
CP4253 Map Unit I
No ratings yet
CP4253 Map Unit I
31 pages
Ca 3
No ratings yet
Ca 3
34 pages
CPP Unit-4
No ratings yet
CPP Unit-4
61 pages
Programação Paralela e Distribuída
No ratings yet
Programação Paralela e Distribuída
39 pages
Parallel Algebraic Multigrid-Survey
No ratings yet
Parallel Algebraic Multigrid-Survey
18 pages
10.introduction To Data-Parallel Architectures
No ratings yet
10.introduction To Data-Parallel Architectures
21 pages
Parallel Computig Assignment
No ratings yet
Parallel Computig Assignment
15 pages
PRAM COMP 633: Parallel Computing Algorithms: The PRAM Model of Computation
No ratings yet
PRAM COMP 633: Parallel Computing Algorithms: The PRAM Model of Computation
49 pages
Hpclab
No ratings yet
Hpclab
58 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Reduction
No ratings yet
Reduction
9 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
Parallel and Distributed Algorithms: Johnnie W. Baker
No ratings yet
Parallel and Distributed Algorithms: Johnnie W. Baker
67 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
No ratings yet
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
22 pages
Sparse 1
No ratings yet
Sparse 1
68 pages
Parallel Algo C Sar
No ratings yet
Parallel Algo C Sar
28 pages
Parallel VS Distributed Computing
No ratings yet
Parallel VS Distributed Computing
9 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
Chapter 1PARALLEL PROGRAM
No ratings yet
Chapter 1PARALLEL PROGRAM
6 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
Parallel and Distributed Computing Systems
100% (1)
Parallel and Distributed Computing Systems
57 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Lasa Abstraction and Specialization For Productive and Performant Linear Algebra On FPGAs
No ratings yet
Lasa Abstraction and Specialization For Productive and Performant Linear Algebra On FPGAs
7 pages
A Review On Use of MPI in Parallel Algorithms: IPASJ International Journal of Computer Science (IIJCS)
No ratings yet
A Review On Use of MPI in Parallel Algorithms: IPASJ International Journal of Computer Science (IIJCS)
8 pages
217 Lec10
No ratings yet
217 Lec10
27 pages
Csa - Unit-4
No ratings yet
Csa - Unit-4
9 pages
Adams 2000
No ratings yet
Adams 2000
18 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
No ratings yet
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
34 pages
Parallel Processing Report
No ratings yet
Parallel Processing Report
9 pages
SIMD Computer Organizations
0% (1)
SIMD Computer Organizations
20 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
Unit 4 COA
No ratings yet
Unit 4 COA
8 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
Gibridnyy Mpi Openmp Algoritm Pereuporyadocheniya Simmetrichnyh Razrezhennyh Matrits I Ego Primenenie K Resheniyu Slau
No ratings yet
Gibridnyy Mpi Openmp Algoritm Pereuporyadocheniya Simmetrichnyh Razrezhennyh Matrits I Ego Primenenie K Resheniyu Slau
14 pages
P 1
No ratings yet
P 1
44 pages
1 of 1 PDF
No ratings yet
1 of 1 PDF
7 pages
Multi-Block CFD Parallel Computation
No ratings yet
Multi-Block CFD Parallel Computation
15 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Accelerating CFD Simulations With Gpus: Patrice Castonguay
No ratings yet
Accelerating CFD Simulations With Gpus: Patrice Castonguay
67 pages
Application of AVX (Advanced Vector Extensions) For Improved PDF
No ratings yet
Application of AVX (Advanced Vector Extensions) For Improved PDF
8 pages
15CS72 ACA Module1 Chapter1Final
No ratings yet
15CS72 ACA Module1 Chapter1Final
25 pages
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
Symbolic Mathematics in Data Science. Algebra, Calculus, and Geometry with Matlab
From Everand
Symbolic Mathematics in Data Science. Algebra, Calculus, and Geometry with Matlab
César Pérez López
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
RHB R6.2 Point Release PDF
No ratings yet
RHB R6.2 Point Release PDF
14 pages
Caterpillar Model
100% (1)
Caterpillar Model
109 pages
Cloud Computing Chapter3 2
0% (1)
Cloud Computing Chapter3 2
36 pages
Mbeya University of Science and Technology: Admission Requirements
No ratings yet
Mbeya University of Science and Technology: Admission Requirements
15 pages
Steps For Price Bid and EPublsih
No ratings yet
Steps For Price Bid and EPublsih
39 pages
Computer Forensic Analyst Intern-JD
No ratings yet
Computer Forensic Analyst Intern-JD
2 pages
Ford Truck f650 f750 Wiring Diagrams 1999
No ratings yet
Ford Truck f650 f750 Wiring Diagrams 1999
16 pages
Rans Simulation of Viscous Flow Around Hull of Multipurpose Amphibious Vehicle
No ratings yet
Rans Simulation of Viscous Flow Around Hull of Multipurpose Amphibious Vehicle
5 pages
RL Quadcopter Movement Control Using Image Processing Techniques
No ratings yet
RL Quadcopter Movement Control Using Image Processing Techniques
4 pages
Flow Calculation Software: Version 4 User's Manual
No ratings yet
Flow Calculation Software: Version 4 User's Manual
64 pages
Chapter Two and Exception Handling
No ratings yet
Chapter Two and Exception Handling
6 pages
Build A Simple Webservice With Delphi 2006 and Microsoft Server 2003 IIS 6.0
No ratings yet
Build A Simple Webservice With Delphi 2006 and Microsoft Server 2003 IIS 6.0
7 pages
Design and Implement of Performance of M
No ratings yet
Design and Implement of Performance of M
4 pages
A Master Gunmakers Guide To Building Bolt-Action Rifles
97% (33)
A Master Gunmakers Guide To Building Bolt-Action Rifles
153 pages
Specifiying Technology Readiness Levels For The Chemical Industry 2019 Buchner
100% (1)
Specifiying Technology Readiness Levels For The Chemical Industry 2019 Buchner
13 pages
Mickael Musindo
No ratings yet
Mickael Musindo
2 pages
Curriculum Vitae: Nguyen Viet Anh
No ratings yet
Curriculum Vitae: Nguyen Viet Anh
7 pages
PanduitProductDetails UTP28SP2MBU
No ratings yet
PanduitProductDetails UTP28SP2MBU
2 pages
Reverberation Time
No ratings yet
Reverberation Time
4 pages
Tut - 03 - 020843
No ratings yet
Tut - 03 - 020843
25 pages
Icmlp 1501
No ratings yet
Icmlp 1501
2 pages
Ducati Superbike: Owner's Manual
No ratings yet
Ducati Superbike: Owner's Manual
122 pages
Computational Fluid Dynamic Analysis of Innovative Design of Solar-Biomass Hybrid Dryer
No ratings yet
Computational Fluid Dynamic Analysis of Innovative Design of Solar-Biomass Hybrid Dryer
12 pages
Appendix C - Machine Language: Code Operand Description
No ratings yet
Appendix C - Machine Language: Code Operand Description
1 page
15kw - SN College - SLD
No ratings yet
15kw - SN College - SLD
1 page
System Requirements Guidelines NX 8 5
No ratings yet
System Requirements Guidelines NX 8 5
3 pages

Unit 3

Uploaded by

Unit 3

Uploaded by

UNIT - 3

1] 2 -D MESH SIMD MODEL

Diagram of 2-D Mesh SIMD Model

| PE(0,0)| PE(1,0)| PE(2,0)| PE(3,0)|

| PE(0,1)| PE(1,1)| PE(2,1)| PE(3,1)|

| PE(0,2)| PE(1,2)| PE(2,2)| PE(3,2)|

| PE(0,3)| PE(1,3)| PE(2,3)| PE(3,3)|

1. Each square represents a processing element (PE).

2. PEs are organized in a 2-D grid, forming a mesh.

Example Algorithm: Matrix Addition

C(i, j) = A(i, j) + B(i, j)

However, scalability may be limited by factors such as communication overhead, synchronization

Fault tolerance mechanisms are essential for robustness in large-scale systems.

2] PARALLEL ALGORITHMS FOR REDUCTION

In this approach, the reduction operation is performed hierarchically in a binary tree

Parallel Prefix Sum:

Parallel Sorting-Based Reduction:

Hybrid reduction algorithms combine multiple parallelization techniques, such as tree-based,

Example: Parallel Sum Reduction

Input: Array of numbers A = [3, 7, 1, 5, 2, 4, 6, 8]

Each processor computes the local sum of its assigned segment.

Each processor computes the local sum of its segment:

Local Sum P0 = 3 + 7 = 10 and Local Sum P1 = 1 + 5 = 6.

Initially, each processor computes its local sum independently.

Finally, the global sum 16 is obtained at the root of the tree.

3] ODD EVEN MERGE SORTING

Odd-Even Merge Sort Algorithm

Odd-Even Merge Phase:

Odd-even comparisons are performed between elements at corresponding positions in adjacent

Global Merge Phase:

Diagram of Odd-Even Merge Sort

Initial Array: [5, 2, 8, 3, 1, 7, 6, 4]

Odd-Even Merge Phase:

Iteration 1: Odd-Even Comparisons

Iteration 2: Odd-Even Comparisons

Global Merge Phase:

Final Sorted Array: [1, 2, 3, 4, 5, 6, 7, 8]

Each processor sorts its segment independently.

Parallelism and Efficiency

Synchronization Overhead: Synchronization between parallel processes or threads can introduce

You might also like