0% found this document useful (0 votes)
60 views16 pages

Coa PPT-2

This document discusses parallelism in computer architecture. It begins with an introduction to parallel processing and its association with data locality and communication. It then discusses three types of parallelism: instruction-level parallelism (ILP), data-level parallelism (DLP), and task-level parallelism (TLP). ILP involves executing multiple instructions simultaneously. DLP aims to increase throughput by operating on multiple data elements simultaneously. TLP breaks algorithms into independent tasks that can be run on multiple processors. The document also covers applications of parallelism like high-performance computing.

Uploaded by

kanchiagarwal54
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views16 pages

Coa PPT-2

This document discusses parallelism in computer architecture. It begins with an introduction to parallel processing and its association with data locality and communication. It then discusses three types of parallelism: instruction-level parallelism (ILP), data-level parallelism (DLP), and task-level parallelism (TLP). ILP involves executing multiple instructions simultaneously. DLP aims to increase throughput by operating on multiple data elements simultaneously. TLP breaks algorithms into independent tasks that can be run on multiple processors. The document also covers applications of parallelism like high-performance computing.

Uploaded by

kanchiagarwal54
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

MEMBERS

0001. IRKAN A. SAIFI


(RA2211003030196)

0010. JIYA SHRIVASTAVA

PARALLELISM
(RA2211003030204)

0011. SARAL RASTOGI


(RA2211003030216)
COMPUTER ORGANIZATION AND ARCHITECTURE
0100. SOHINI GANGULY
SUBJECT CODE: 21CSS201T
(RA2211003030218)

DATE: 20/10/23

Introduction

TABLE OF CONTENT INTRODUCTION


Parallel Processing is associated with data locality
01 Introduction 04 Applications and data communication.

02 Needs 05 Conclusion Parallel Computer Architecture is the method of


organizing all the resources to maximize the
performance and the programmability within the
03 Types 06 Research limits given by technology and the cost at any
instance of time.
VLSI technology allows a large number of components to be accommodated on a single chip
and clock rates to increase.
Introduction Introduction

WHY PARALLELISM? NEED FOR PARALLELISM


Parallel computer architecture adds a new dimension EFFICIENCY: Hardware that includes multiple cores, threads
to the development of computer systems by using or processors that allow them to run many processes
more and more number of processors.
SPEED: By separating larger computational problems into
smaller tasks
Performance at a given point of time:
large number of processors>>single processor COST-EFFECTIVE: Require more parts than a serial
processing, BUT produce more results in less time

TYPES TYPES

TYPES OF PARALLELISM INSTRUCTION-LEVEL PARALLELISM


When multiple operations are performed in a single cycle,
which is done by either executing them simultaneously or by
ILP TLP DLP
utilizing gaps between two successive operations that are
The simultaneous Data-level parallelism An algorithm be
execution of multiple is an approach to broken up into created due to the latencies. (floating point-3)
computer processing
instructions from a independent tasks
that aims to increase
program. Pipelining is data throughput by and multiple Decision of when to execute an operation depends largely on
a form of ILP operating on multiple computing resources
elements of data
the compiler rather than the hardware. However, the extent of
be available.
simultaneously. the compiler’s control depends on the type of ILP architecture.
ILP ILP

CLASSIFICATION- ILP

Sequential Dependencies Independence


Here, the program is Here, the program Information regarding
The sequential processor takes 12 cycles to execute 8 not expected to explicitly mentions which operations are
operations whereas the processor with ILP takes only explicitly convey any information regarding independent of each
4 cycles. information regarding dependencies other so that they can
parallelism to between operations be executed instead
While in sequential execution, each cycle has only one hardware of the ‘nops.
operation being executed, in the processor with ILP,
cycle 1 has 4 operations, and cycle 2 has 2 operations.

ILP TYPES
DATA-LEVEL PARALLELISM
Basic difference between ILP and
Data-level parallelism is an approach to computer processing that
Pipelining Process? aims to increase data throughput by operating on multiple
elements of data simultaneously.
Pipeline processing has the work of breaking down
instruction execution into stages, where as ILP A data-parallel job on an array of 'n' elements can be divided equally
focuses on executing the multiple instructions at the among all the processors.
In the case of sequential execution, the time taken by the process
same time.
will be n*Ta time units as it sums up all the elements of an array.
data parallel job on 4 processors the time taken would reduce to
(n/4)*Ta + Merging overhead time units
DLP TLP
TASK-LEVEL PARALLELISM
Classification:
An algorithm be broken up into independent tasks and multiple
SIMD
SIMT computing resources be available.
MIMD
Enables multiple portions of a visualization task to be executed in
parallel.
Number of independent tasks that can be identified, as well as the
number of CPUs available, limits the maximum amount of parallelism.
Data parallelism is a more finely grained parallelism in that we Task parallelism is used effectively in the movie industry, where
achieve our performance improvement by applying the same small several frames in an animated production are rendered in parallel.
set of tasks iteratively over multiple streams of data.

TYPES APPLICATIONS
DLP VS TLP APPLICATIONS OF PARALLELISM
a. High-Performance Computing (HPC):
Powers supercomputer clusters for fast simulations and scientific research.

b. Gaming Industry:
Powers complex graphics rendering- AI-driven gameplay.

c. Data Analytics:
Parallelism accelerates data processing for insights and decision-making.

d. Scientific Computing:
Used in simulations for climate modeling, physics, and medical research.
APPLICATIONS PARALLELISM

APPLICATIONS (INDUSTRIES) CHALLENGES


a. Tracking, processing and storing big data a. Data Dependencies:
Challenges in managing data dependencies between parallel threads.
b. Collaborative digital workspaces
c. AI, virtual reality and advanced graphics b. Scalability Issues:
Difficulty in scaling performance with growing number of processors.
d. Logistical planning and tracking for transportation
d. Load Balancing:
e. Online search engines Ensuring even distribution of tasks among processors. •Synchronization
f. Weather prediction Overhead: The overhead provoked by synchronization mechanisms.

PARALLELISM APPLICATIONS

OVERCOMING CHALLENGES REAL-WORLD EXAMPLES


a. Dynamic Scheduling:
Challenges in maImplementing dynamic scheduling to balance workloads 01 SUPER COMPUTERS 02 GPUs
and avoid bottlenecks. naging data dependencies between parallel threads. Summit and Fugaku for Powering gaming and AI
scientific research. applications.
b. Caching Strategies:
Using advanced caching techniques to manage data dependencies.

c. Parallel Algorithms: 03 CLOUD COMPUTING 04 WEATHER


Developing and optimizing parallel algorithms for specific tasks. Scalable and high-performance Weather, nuclear, and
cloud services. molecular research.
d. Hybrid Architectures:
Combining different parallel architectures- improved performance.
PARALLELISM PARALLELISM

FUTURE TRENDS Parallel Processing Architectures


•Parallel Processing Architecture is the design of computer systems
01 QUANTUM COMPUTING 02 NEUROMORPHIC
to simultaneously execute multiple tasks or instructions with
Exploring potential of quantum COMPUTING increased speed and efficiency.
computing for revolutionary Mimicking brain's architecture
parallelism.
•Flynn's Taxonomy classifies parallel processing architectures into
03 EDGE COMPUTING 04 EXASCALE COMPUTING four categories:
Pushing processing closer to Preparing for the era of •SISD (Single Instruction, Single Data)
data sources for low-latency, exascale computing to solve •SIMD (Single Instruction, Multiple Data)
high-efficiency parallel complex problems. •MISD (Multiple Instruction, Single Data)
operations. •MIMD (Multiple Instruction, Multiple Data)

PARALLELISM PARALLELISM

Shared Memory Architectures Examples of Shared Memory Systems


A shared memory model is one in which processors connects by
UMA:
reading and writing locations in a shared memory that is similarly
Symmetric Multiprocessor (SMP) machine
applicable by all processors. Each processor can have registers,
NUMA:
buffers, caches, and local memory banks as more memory resources.
Cray T3D and the Hector multiprocessor.
Some basic issues in the design of shared-memory systems have to
COMA:
be taken into consideration. These involves access control,
The Data Diffusion Machine (DDM)
synchronization, protection, and security.
PARALLELISM PARALLELISM

Non-Uniform Memory Access (NUMA) Distributed Memory Architectures


Method of configuring a cluster of microprocessors in a multiprocessing Distributed memory MIMD Architecture is known as
system so they can share memory locally. Multicomputer. It can replicate the processor/memory pairs and
Improve the system's performance and allow it to expand as processing link them through an interconnection network. The
needs evolve. processor/memory pair is known as the processing element (PE)
NUMA can be thought of as a microprocessor cluster in a box. The cluster and PEs work more or less separately from each other.
typically consists of four microprocessors interconnected on a local bus
to a shared memory on a single motherboard. The bus may be a In distributed-memory MIMD machines, each processor has its
peripheral component interconnect bus, the shared memory is called an memory location. Each processor has no explicit knowledge about
L3 cache and the motherboard is often referred to as a card. other processor's memory.

PARALLELISM PARALLELISM

Message-Passing Model Message-Passing Model


An example might be a networked cluster of nodes
In this model, data is shared by sending and receiving messages nodes are networked together.
between co-operating processes, using system calls . Message each with multiple cores.
Passing is particularly useful in a distributed environment where the each node using its own local memory. /li>
communicating processes may reside on different, network communicate between nodes and cores via messages.
connected, systems. Message passing architectures are usually easier A message might contain:
to implement but are also usually slower than shared memory 1. The header of message that identifies the sending and receiving
architectures. processes
2. A block of data
3. Process control information
PARALLELISM PARALLELISM
Clusters and Beowulf Clusters
Examples of Distributed Memory Systems
A cluster refers to a set of interconnected computers or servers that
collaborate to provide a unified computing resource. Clustering is an Data can be kept statically in nodes if most computations happen
effective method to ensure high availability, scalability, and fault locally, and only changes on edges have to be reported to other nodes.
tolerance in computer systems. An example of this is simulation where data is modeled using a grid, and
each node simulates a small part of the larger grid. On every iteration,
A Beowulf cluster is formed using normal computers that are identical. nodes inform all neighboring nodes of the new edge data.
These are arranged into a small local area network (LAN). There are
programs that allow these computers to share processing among them.
So Beowulf clusters form a parallel processing unit using common
personal computers.

PARALLELISM PARALLELISM

SIMD ARCHITECTURE MIMD ARCHITECTURE


Known as Single Instruction, Multiple Data.
Known as Multiple Instruction, Multiple Data.
SIMD architecture processes multiple data elements with a single
MIMD architecture allows multiple processors to independently execute
instruction at the same time.
different instructions on different sets of data.
Suitable for data-parallel tasks where the same operation is performed
Each processor in a MIMD system has its own control unit and memory,
on multiple pieces of data simultaneously.
enabling it to execute different programs or tasks.
SIMD processors often have a single control unit (CU) and multiple
MIMD is highly versatile and can be applied to various parallel
processing elements (PEs).
computing tasks, but it may require more sophisticated synchronization
SIMD is efficient for tasks like image processing, audio processing, and
and communication mechanisms compared to SIMD.
simulations that involve a large dataset with similar operations on each
element.
PARALLELISM PARALLELISM

SPMD MODEL EXAMPLE OF SIMD ARCHITECTURE-GPU


Known as Single Program, Multiple Data. GPU stands for Graphics Processing Unit.
Involves a single program or application code that all processors execute. Originally designed for rendering graphics in video games.
Each processor works on its own data or data subset, allowing for data- Now widely used for general-purpose computing (GPGPU).
parallelism. Parallel architecture with many cores for concurrent processing.
Processors may work on different data, but they follow the same control Commonly used in machine learning, scientific simulations, and
flow and execute the same operations. cryptography.
SPMD is used in parallel computing frameworks like MPI and OpenMP. Requires specialized programming, often using APIs like CUDA or OpenCL.

PARALLELISM PARALLELISM
Massively Parallel Processing System
EXAMPLES OF MIMD ARCHITECTURE Parallel Computing Solution: MPP is a type of parallel computing architecture.
Supercomputers: Weather simulations, nuclear research. Scalable: It's designed to scale by adding more processors and nodes.
Data Parallelism: Ideal for processing tasks that can be divided into parallel
Cluster Computing: Beowulf clusters for parallel processing. data chunks.
High-Performance: Suited for computationally intensive workloads and big data
Distributed Databases: Data partitioned across multiple servers. analytics.
Distributed Memory: Each processor has its own memory, requiring
Heterogeneous Computing: Multi-core CPUs, GPUs for graphics and AI. communication for data sharing.
Complex and Costly: Implementing and managing MPP systems can be complex
Cloud Computing: Virtualized instances running various tasks. and expensive.
Examples: Teradata, Greenplum, and Hadoop are examples of MPP solutions.
PARALLELISM PARALLELISM
OpenMP VS CUDA
MPI
Known as Message Passing Interface.
OpenMP CUDA
Standardized message-passing system used for communication
between processes in parallel computing.
Essential for parallel applications and distributed computing.
MPI is commonly used in SPMD models.
Processes exchange messages for synchronization and data sharing.
Offers both one-to-one communication and collective communication
operations.

CONCLUSION

CONCLUSION
Parallel Galaxy Simulation with the
Parallelism is a foundational concept that empowers modern computing
to tackle increasingly complex and resource-intensive tasks, making it
essential in the world of technology and scientific research. In a world
Barnes-Hut Algorithm
without parallelism, computing would be slower, less efficient, and limited Alex Patel and William Liu
in its ability to handle complex tasks.
Research Research

ABSTRACT We Know That


Implementation of multiple optimized parallel
01 02 Force on a body from another
body is inversely proportional to
the square of the distance
implementations of a galaxy evolution simulator for use on
between the bodies.
multi-core CPU platforms using the OpenMP framework. Gravitational force on a single body considering N total bodies.

Given the success of implementations, it is demonstrated 04 Distance between two bodies


becomes arbitrarily small, the
that galaxy simulation is highly-parallelizable on the CPU, 03 Force is directly proportional to
the product of the masses of
acceleration approaches infinity. To
resolve this issue, we introduce a
even when computed using more involved methods such the bodies.
small softening factor ε to set the
as the Barnes-Hut Algorithm. acceleration between bodies to zero
Acceleration on any given body
F = ma-->a=F/m

Research Research
OUR GOAL CHALLENGES
To demonstrate that the simulation of galaxy evolution is highly parallelizable
on CPU platforms. Algorithm is sub-optimal for larger-scale simulations since the computational
cost grows with O(n^2 ), where n is the total number of bodies we are
Sub-Problem: Reduced the immense task of constructing an accurate galaxy considering, which becomes ridiculously expensive for large n.
simulator to one that approximates the effect gravity has on the evolution of a
galaxy’s bodies. This problem is a classic example of an N-body problem, in which we have a
configuration of bodies and their positions in space, and we aim to update the
this sub-problem is highly parallelizable on CPU platforms, even with more position of each body by considering the positions of each other body.
involved sequential methods of approximation
a naive approach of computing every body’s acceleration by considering all
pairs of bodies is embarrassingly parallelizable, since we can evenly balance
load by partitioning the bodies into equal buckets.
Research Research Methods
APPROACH
BARNES-HUT ALGORITHM
A notable sequential algorithm is the Barnes-Hut
The Barnes-Hut Algorithm operates by
Algorithm, in which we build a spatial tree to form a
hierarchical clustering of bodies so that during the first constructing a spatial tree to
acceleration computation phase, each body can treat far hierarchically distribute bodies between
away clusters as a single larger body to reduce total tree nodes based on closeness in space. A
computation. This results in an average O(n log n) common type of tree used in this scenario
algorithm instead of the all-pairs naive O(n 2 ) algorithm, is the quadtree in 2 dimensions, due to
where n is the number of bodies we are considering. the relative simplicity of its construction.

QUADTREE IN 2D Research Methods Research Methods


CONDITIONS
After the quadtree is constructed, we aggregate forces for each body. To do
this, we first consider the root, then recurse into each of the 4 subtrees until
one of the following conditions is met:

1. If the node we are looking at is a leaf, then add the force contribution from
the body at the leaf if it exists.

2. If the side length of the node’s region divided by the distance from the
body to the center of mass of the node is less than some defined θ, treat the
node as a single mass and add its force contribution.
(L/D)<θ
Research Methods
EXPECTATIONS
We note that the expected number of nodes touched during force
aggregation for a single body is ≈ log(N)/θ^2 , resulting in an O(n log n) Example of Quadtree
algorithm as long as θ > 0.
Hierarchical Clustering
To perform a simulation, we need to evolve the galaxy over time. To
do this, we will iterate over a number of simulation steps, and at each
step we will compute the acceleration for each body, then integrate
over a short timestep to get the new position for each body.

Research Methods Research Methods


APPROXIMATIONS ENERGY? APPROXIMATE!
The Barnes-Hut Algorithm is an approximation of the discrete N-
If the energy of the system is
body problem. But the N-body problem itself is an approximation of
constantly increasing, we will notice
the evolution of a galaxy.
on the visualizer that bodies are
getting farther and farther apart
How can we integrate acceleration and velocity to compute from one another as the simulation
updated positions for each body? continues.
We could simply multiply acceleration by the timestep to compute
the change in velocity, and multiply the new velocity by the Verlet Integration- The technique lowers change in energy by using
timestep to compute the change in position. This method is known velocity a half timestep in the future to integrate position, instead of
as Forward Euler. velocity an entire timestep in the future.
Research Methods Research Methods
OVERVIEW
We have explored how we can construct an optimized sequential TWO CHALLENGES
implementation that satisfies our conditions of correctness for a
gravity-based galaxy simulation approximation. Inserting bodies in parallel to the quadtree requires the data
structure to handle concurrent operations.

Different bodies require a different amount of work to


accumulate accelerations from the quadtree. This would result
in an imbalanced work load if we were to arbitrarily assign
bodies during the acceleration update phase.

APPROACH Research Methods TOOLS Research Methods

Our application targets multi-core CPU platforms, and cycletimer.c from the Graph
specifically machines with homogeneous compute resources Rats starter code
such as the GHC machines. module monitor.{h/c} to provide
Processors on the GHC machines have 8 cores supporting macros to keep track of the
simultaneous multithreading (Intel hyperthreading), allowing timings of each sub-routine in
for efficient use of a maximum of 16 threads in our application. the algorithm
All code was written from scratch in the C programming gcc compiler without any flags
language, using the OpenMP parallel framework. OpenMP is (except -Wall to catch warnings)
preferable in our case to a lower-level API since it allows us to
simply identify parallel blocks of code for the compiler and
machine to map to compute resources and execute.
Research Methods SURVEY Research Methods
VISUALIZATION
In total, we implemented 3 parallel implementations of the galaxy
visualization program gviz to be simulator:
able to verify the correctness of 1. Parallel naive all-pairs O(n^2 ) algorithm.
our implementations 2. Parallel Barnes-Hut Algorithm with a fine-grained locking quadtree.
C++ compiled with the g++ 3. Parallel Barnes-Hut Algorithm with a lock-free quadtree.
compiler using the flags -m64 -
std=c++11, and uses the OpenGL We measure the performance of each implementation on 3 benchmarks:
A. 1-to-1: Equal number of clusters and bodies.
graphics framework with the
B. sqrt: The number of clusters is the square root of the number of bodies.
library glfw3 to quickly render
C. single: There is a single cluster of all the bodies. For each benchmark we
bodies as they evolve vary θ between the values 0.1, 0.3, and 0.5, and the number of threads
between all values in the range [1, 16].

CONCLUSION Research Methods Research Methods


CONCLUSION
We have implemented 3 versions of a simple galaxy simulation focused on
only body-to-body gravitational forces. One of the implementations is an As a whole, the project was an exploratory
optimized parallel naive all-pairs O(n 2 ) algorithm to serve as a baseline. dive into many parallel architecture and
The other two implement the Barnes-Hut Algorithm with variants of a programming concepts: problem
concurrent quadtree: fine-grained locking and lock-free. We have shown subdivision, work-load assignment,
that galaxy simulation in terms of purely gravitational forces is highly concurrent data structures, artifactual
parallelizable on multi-core CPU platforms with homogeneous compute communication, cache-coherence
resources. considerations, profiling, benchmarking,
Our lock-free implementation can perform a simulation of over one and scaling analysis.
million bodies in seconds with compiler optimization flags enabled (gcc -
Ofast), and our visualizations are intuitively reasonable and preserve our
definition of correctness.
REFERENCES Research Methods
[1] CMU 15-418/618 - Lecture 9: Parallel Programming Case Studies, http://www.cs.cmu.edu/~418/lectures/09
casestudies.pdf [2] Ananth Y. Grama, Vipin Kumar, and Ahmed Sameh. Scalable Parallel Formulations of the
Barnes-Hut Method for n-Body Simulations, http://citeseerx.ist.psu.edu/viewdoc/download?

Thank You
doi=10.1.1.49.7107&rep=rep1&type=pdf [3] John K. Salmon. PARALLEL HIERARCHICAL N-BODY METHODS,
https://thesis.library.caltech.edu/6291/1/Salmon jk 1991.pdf [4] Lars Nyland, Mark Harris, and Jan Prins. Fast
N-Body Simulation with CUDA, https://developer.download.nvidia.com/compute/cuda/1.1-Beta /x86
website/projects/nbody/doc/nbody gems3 ch31.pdf [5] Guy Blelloch and Girija Narlikar. A Practical
Comparison of N-Body Algorithms,
https://www.cs.cmu.edu/afs/cs.cmu.edu/project/scandal/public/papers/dimacs-nbody.pdf [6] Benedict
Steinbush, Marvin-Lucas Henkel, Mathias Winkel, and Paul Gibbon. A Massively Parallel Barnes-Hut Tree Code By IRKAN, JIYA, SARAL, SOHINI
with Dual Tree Traversal, http://juser.fz-juelich.de/record/808800/files/ParCo2015-paper.pdf [7] David
Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer Architecture: A Hardware / Software
Approach, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.4418&rep=rep1&type=pdf [8]
Martin Burtscher, Keshav Pingali. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body
Algorithm, http://iss.ices.utexas.edu/Publications/Papers/burtscher11.pdf [9] Verlet Integration,
https://en.wikipedia.org/wiki/Verlet integration [10] Oscilation, http://kahrstrom.com/gamephysics/wp-
content/uploads/2011/08/oscilation.jpg [11] Instructions per Cycle,
https://en.wikipedia.org/wiki/Instructions per cycle [12] Galaxy, https://en.wikipedia.org/wiki/Galaxy [13]
Universe Box, http://www.lsw.uni-heidelberg.de/users/mcamenzi/images/Universe Box.gif

You might also like