Coa PPT-2
Coa PPT-2
PARALLELISM
(RA2211003030204)
DATE: 20/10/23
Introduction
TYPES TYPES
CLASSIFICATION- ILP
ILP TYPES
DATA-LEVEL PARALLELISM
Basic difference between ILP and
Data-level parallelism is an approach to computer processing that
Pipelining Process? aims to increase data throughput by operating on multiple
elements of data simultaneously.
Pipeline processing has the work of breaking down
instruction execution into stages, where as ILP A data-parallel job on an array of 'n' elements can be divided equally
focuses on executing the multiple instructions at the among all the processors.
In the case of sequential execution, the time taken by the process
same time.
will be n*Ta time units as it sums up all the elements of an array.
data parallel job on 4 processors the time taken would reduce to
(n/4)*Ta + Merging overhead time units
DLP TLP
TASK-LEVEL PARALLELISM
Classification:
An algorithm be broken up into independent tasks and multiple
SIMD
SIMT computing resources be available.
MIMD
Enables multiple portions of a visualization task to be executed in
parallel.
Number of independent tasks that can be identified, as well as the
number of CPUs available, limits the maximum amount of parallelism.
Data parallelism is a more finely grained parallelism in that we Task parallelism is used effectively in the movie industry, where
achieve our performance improvement by applying the same small several frames in an animated production are rendered in parallel.
set of tasks iteratively over multiple streams of data.
TYPES APPLICATIONS
DLP VS TLP APPLICATIONS OF PARALLELISM
a. High-Performance Computing (HPC):
Powers supercomputer clusters for fast simulations and scientific research.
b. Gaming Industry:
Powers complex graphics rendering- AI-driven gameplay.
c. Data Analytics:
Parallelism accelerates data processing for insights and decision-making.
d. Scientific Computing:
Used in simulations for climate modeling, physics, and medical research.
APPLICATIONS PARALLELISM
PARALLELISM APPLICATIONS
PARALLELISM PARALLELISM
PARALLELISM PARALLELISM
PARALLELISM PARALLELISM
PARALLELISM PARALLELISM
Massively Parallel Processing System
EXAMPLES OF MIMD ARCHITECTURE Parallel Computing Solution: MPP is a type of parallel computing architecture.
Supercomputers: Weather simulations, nuclear research. Scalable: It's designed to scale by adding more processors and nodes.
Data Parallelism: Ideal for processing tasks that can be divided into parallel
Cluster Computing: Beowulf clusters for parallel processing. data chunks.
High-Performance: Suited for computationally intensive workloads and big data
Distributed Databases: Data partitioned across multiple servers. analytics.
Distributed Memory: Each processor has its own memory, requiring
Heterogeneous Computing: Multi-core CPUs, GPUs for graphics and AI. communication for data sharing.
Complex and Costly: Implementing and managing MPP systems can be complex
Cloud Computing: Virtualized instances running various tasks. and expensive.
Examples: Teradata, Greenplum, and Hadoop are examples of MPP solutions.
PARALLELISM PARALLELISM
OpenMP VS CUDA
MPI
Known as Message Passing Interface.
OpenMP CUDA
Standardized message-passing system used for communication
between processes in parallel computing.
Essential for parallel applications and distributed computing.
MPI is commonly used in SPMD models.
Processes exchange messages for synchronization and data sharing.
Offers both one-to-one communication and collective communication
operations.
CONCLUSION
CONCLUSION
Parallel Galaxy Simulation with the
Parallelism is a foundational concept that empowers modern computing
to tackle increasingly complex and resource-intensive tasks, making it
essential in the world of technology and scientific research. In a world
Barnes-Hut Algorithm
without parallelism, computing would be slower, less efficient, and limited Alex Patel and William Liu
in its ability to handle complex tasks.
Research Research
Research Research
OUR GOAL CHALLENGES
To demonstrate that the simulation of galaxy evolution is highly parallelizable
on CPU platforms. Algorithm is sub-optimal for larger-scale simulations since the computational
cost grows with O(n^2 ), where n is the total number of bodies we are
Sub-Problem: Reduced the immense task of constructing an accurate galaxy considering, which becomes ridiculously expensive for large n.
simulator to one that approximates the effect gravity has on the evolution of a
galaxy’s bodies. This problem is a classic example of an N-body problem, in which we have a
configuration of bodies and their positions in space, and we aim to update the
this sub-problem is highly parallelizable on CPU platforms, even with more position of each body by considering the positions of each other body.
involved sequential methods of approximation
a naive approach of computing every body’s acceleration by considering all
pairs of bodies is embarrassingly parallelizable, since we can evenly balance
load by partitioning the bodies into equal buckets.
Research Research Methods
APPROACH
BARNES-HUT ALGORITHM
A notable sequential algorithm is the Barnes-Hut
The Barnes-Hut Algorithm operates by
Algorithm, in which we build a spatial tree to form a
hierarchical clustering of bodies so that during the first constructing a spatial tree to
acceleration computation phase, each body can treat far hierarchically distribute bodies between
away clusters as a single larger body to reduce total tree nodes based on closeness in space. A
computation. This results in an average O(n log n) common type of tree used in this scenario
algorithm instead of the all-pairs naive O(n 2 ) algorithm, is the quadtree in 2 dimensions, due to
where n is the number of bodies we are considering. the relative simplicity of its construction.
1. If the node we are looking at is a leaf, then add the force contribution from
the body at the leaf if it exists.
2. If the side length of the node’s region divided by the distance from the
body to the center of mass of the node is less than some defined θ, treat the
node as a single mass and add its force contribution.
(L/D)<θ
Research Methods
EXPECTATIONS
We note that the expected number of nodes touched during force
aggregation for a single body is ≈ log(N)/θ^2 , resulting in an O(n log n) Example of Quadtree
algorithm as long as θ > 0.
Hierarchical Clustering
To perform a simulation, we need to evolve the galaxy over time. To
do this, we will iterate over a number of simulation steps, and at each
step we will compute the acceleration for each body, then integrate
over a short timestep to get the new position for each body.
Our application targets multi-core CPU platforms, and cycletimer.c from the Graph
specifically machines with homogeneous compute resources Rats starter code
such as the GHC machines. module monitor.{h/c} to provide
Processors on the GHC machines have 8 cores supporting macros to keep track of the
simultaneous multithreading (Intel hyperthreading), allowing timings of each sub-routine in
for efficient use of a maximum of 16 threads in our application. the algorithm
All code was written from scratch in the C programming gcc compiler without any flags
language, using the OpenMP parallel framework. OpenMP is (except -Wall to catch warnings)
preferable in our case to a lower-level API since it allows us to
simply identify parallel blocks of code for the compiler and
machine to map to compute resources and execute.
Research Methods SURVEY Research Methods
VISUALIZATION
In total, we implemented 3 parallel implementations of the galaxy
visualization program gviz to be simulator:
able to verify the correctness of 1. Parallel naive all-pairs O(n^2 ) algorithm.
our implementations 2. Parallel Barnes-Hut Algorithm with a fine-grained locking quadtree.
C++ compiled with the g++ 3. Parallel Barnes-Hut Algorithm with a lock-free quadtree.
compiler using the flags -m64 -
std=c++11, and uses the OpenGL We measure the performance of each implementation on 3 benchmarks:
A. 1-to-1: Equal number of clusters and bodies.
graphics framework with the
B. sqrt: The number of clusters is the square root of the number of bodies.
library glfw3 to quickly render
C. single: There is a single cluster of all the bodies. For each benchmark we
bodies as they evolve vary θ between the values 0.1, 0.3, and 0.5, and the number of threads
between all values in the range [1, 16].
Thank You
doi=10.1.1.49.7107&rep=rep1&type=pdf [3] John K. Salmon. PARALLEL HIERARCHICAL N-BODY METHODS,
https://thesis.library.caltech.edu/6291/1/Salmon jk 1991.pdf [4] Lars Nyland, Mark Harris, and Jan Prins. Fast
N-Body Simulation with CUDA, https://developer.download.nvidia.com/compute/cuda/1.1-Beta /x86
website/projects/nbody/doc/nbody gems3 ch31.pdf [5] Guy Blelloch and Girija Narlikar. A Practical
Comparison of N-Body Algorithms,
https://www.cs.cmu.edu/afs/cs.cmu.edu/project/scandal/public/papers/dimacs-nbody.pdf [6] Benedict
Steinbush, Marvin-Lucas Henkel, Mathias Winkel, and Paul Gibbon. A Massively Parallel Barnes-Hut Tree Code By IRKAN, JIYA, SARAL, SOHINI
with Dual Tree Traversal, http://juser.fz-juelich.de/record/808800/files/ParCo2015-paper.pdf [7] David
Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer Architecture: A Hardware / Software
Approach, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.4418&rep=rep1&type=pdf [8]
Martin Burtscher, Keshav Pingali. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body
Algorithm, http://iss.ices.utexas.edu/Publications/Papers/burtscher11.pdf [9] Verlet Integration,
https://en.wikipedia.org/wiki/Verlet integration [10] Oscilation, http://kahrstrom.com/gamephysics/wp-
content/uploads/2011/08/oscilation.jpg [11] Instructions per Cycle,
https://en.wikipedia.org/wiki/Instructions per cycle [12] Galaxy, https://en.wikipedia.org/wiki/Galaxy [13]
Universe Box, http://www.lsw.uni-heidelberg.de/users/mcamenzi/images/Universe Box.gif