0% found this document useful (0 votes)
4 views

BigData_ParallelComputing

The document discusses the challenges and methodologies of parallel computing, particularly in the context of Big Data. It highlights the advantages of parallel processing over linear processing, such as reduced processing times and flexibility, while also addressing issues like data scaling and fault tolerance. Key concepts include horizontal scaling, data locality, and the importance of managing complex computations across multiple nodes in a computing cluster.

Uploaded by

rajnath singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

BigData_ParallelComputing

The document discusses the challenges and methodologies of parallel computing, particularly in the context of Big Data. It highlights the advantages of parallel processing over linear processing, such as reduced processing times and flexibility, while also addressing issues like data scaling and fault tolerance. Key concepts include horizontal scaling, data locality, and the importance of managing complex computations across multiple nodes in a computing cluster.

Uploaded by

rajnath singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 9

PARALLEL

COMPUTING
UNIT-1
BIG DATA

Presented by:
Priyanka Rahi
SINGLE NODE CAPACITY
 In any normal analytics cycle, the functionality of the computer is to store
data and move that data from its storage capacity into a compute capacity
(which includes memory), and back to storage once important results are
computed.
 With Big Data, you have more data than will fit on a single computer.
PARALLELISM
Linear Processing Parallel Processing
 Linear processing is the traditional method of  The alternative to Linear processing is parallel
computing a problem where the problem statement processing. Here too, the problem statement is
is broken into a set of instructions that are executed broken down into a set of executable
sequentially till all instructions are completed instructions.
successfully.
 If an error occurs in any one of the instructions, the  The instructions are then distributed to
entire sequence of instructions is executed from the multiple execution nodes of equal processing
beginning after the error has been resolved. power and are executed in parallel.
 It is evident from the processing method that Linear  Since the instructions are run on separate
processing is best suited for minor computing tasks execution nodes, errors can be fixed and
and is inefficient and time consuming when it comes executed locally independent of other
to processing complex problems such as Big Data.
instructions.
PARALLEL PROCESSING
ADVANTAGES
 Parallel
processing offers significant advantages when dealing with
complex problems such as Big Data.
 Some of the other benefits of using Parallel processing are:
 Reduced processing times: Parallel processing can process Big Data in a
fraction of the time compared to linear processing.
 Less memory and processing requirements: Since the problem instructions
are executed on separate execution nodes, memory and processing requirements
are low even while processing large volumes of data.
 Flexibility: The biggest advantage of parallel processing is that execution nodes
can be added and removed as and when required. This significantly reduces
infrastructure cost.
DATA SCALING
 Data Scaling is a technique to manage, store, and process the overflow of data.
 You can get a larger single node computer. But when your data is growing
exponentially, eventually it will outgrow the capacity that is available.

 Increasing the capacity of a single node as a means of increasing capacity is


called scaling up.
HORIZONTAL SCALING
 A better strategy is to scale out or to scale horizontally.

 This simply means adding additional nodes with the same capacity until the problem is
tractable.

 The individual nodes arranged in this way are called a computing cluster.

 Compute clusters can solve problems that are known as “embarrassingly parallel”
calculations. These are the kind of workloads that can easily be divided and run
independent of one another. If any one process fails, it has no impact on the others and
can easily be rerun. An example would be to, say, change the date format in a single
column of a large dataset that has been split into multiple smaller chunks that are stored
in different nodes of the cluster.
ISSUES IN PARALLEL
COMPUTING
 Sometimes, sorting a large data set adds significant complexity to the
process.
 Now, the multiple computations must coordinate with one another because
each process needs to be aware of the state of its peer processes in order
to complete the calculation.
 This requires sending messages across a network to each other or writing
them to a file system that is accessible to all processes on the cluster.
 The level of complexity increases significantly, because you are basically
asking a cluster of computers to behave as a single computer.
DATA LOCALITY
 In the Hadoop ecosystem, the concept of “bringing compute to the data” is
a central idea in the design of the cluster.
 The cluster is designed in a way that computations on certain pieces, or
partitions, of the data will take place right at the location of the data when
possible.
 The resulting output will also be written to the same node.
FAULT TOLERANCE
 Fault tolerance comes into play when computers break an outages happen.

 Fault tolerance refers to the ability of a system to continue operating without interruption when one
or more of its components fail.
 This works for Hadoop primary data storage system (HDFS) and other similar storage systems (like
S3 and object storage).
 Consider the first 3 partitions of a dataset labelled P1, P2, and P3, which reside on the first node.

 In this system, copies of each of these data partitions are also stored on other locations or nodes
within the cluster.
 If the first node ever goes down, you can add a new node to the cluster and recover the lost
partitions by copying data from one of the other nodes where copies of P1, P2, and P3 partitions are
stored.

You might also like