BigData_ParallelComputing
BigData_ParallelComputing
COMPUTING
UNIT-1
BIG DATA
Presented by:
Priyanka Rahi
SINGLE NODE CAPACITY
In any normal analytics cycle, the functionality of the computer is to store
data and move that data from its storage capacity into a compute capacity
(which includes memory), and back to storage once important results are
computed.
With Big Data, you have more data than will fit on a single computer.
PARALLELISM
Linear Processing Parallel Processing
Linear processing is the traditional method of The alternative to Linear processing is parallel
computing a problem where the problem statement processing. Here too, the problem statement is
is broken into a set of instructions that are executed broken down into a set of executable
sequentially till all instructions are completed instructions.
successfully.
If an error occurs in any one of the instructions, the The instructions are then distributed to
entire sequence of instructions is executed from the multiple execution nodes of equal processing
beginning after the error has been resolved. power and are executed in parallel.
It is evident from the processing method that Linear Since the instructions are run on separate
processing is best suited for minor computing tasks execution nodes, errors can be fixed and
and is inefficient and time consuming when it comes executed locally independent of other
to processing complex problems such as Big Data.
instructions.
PARALLEL PROCESSING
ADVANTAGES
Parallel
processing offers significant advantages when dealing with
complex problems such as Big Data.
Some of the other benefits of using Parallel processing are:
Reduced processing times: Parallel processing can process Big Data in a
fraction of the time compared to linear processing.
Less memory and processing requirements: Since the problem instructions
are executed on separate execution nodes, memory and processing requirements
are low even while processing large volumes of data.
Flexibility: The biggest advantage of parallel processing is that execution nodes
can be added and removed as and when required. This significantly reduces
infrastructure cost.
DATA SCALING
Data Scaling is a technique to manage, store, and process the overflow of data.
You can get a larger single node computer. But when your data is growing
exponentially, eventually it will outgrow the capacity that is available.
This simply means adding additional nodes with the same capacity until the problem is
tractable.
The individual nodes arranged in this way are called a computing cluster.
Compute clusters can solve problems that are known as “embarrassingly parallel”
calculations. These are the kind of workloads that can easily be divided and run
independent of one another. If any one process fails, it has no impact on the others and
can easily be rerun. An example would be to, say, change the date format in a single
column of a large dataset that has been split into multiple smaller chunks that are stored
in different nodes of the cluster.
ISSUES IN PARALLEL
COMPUTING
Sometimes, sorting a large data set adds significant complexity to the
process.
Now, the multiple computations must coordinate with one another because
each process needs to be aware of the state of its peer processes in order
to complete the calculation.
This requires sending messages across a network to each other or writing
them to a file system that is accessible to all processes on the cluster.
The level of complexity increases significantly, because you are basically
asking a cluster of computers to behave as a single computer.
DATA LOCALITY
In the Hadoop ecosystem, the concept of “bringing compute to the data” is
a central idea in the design of the cluster.
The cluster is designed in a way that computations on certain pieces, or
partitions, of the data will take place right at the location of the data when
possible.
The resulting output will also be written to the same node.
FAULT TOLERANCE
Fault tolerance comes into play when computers break an outages happen.
Fault tolerance refers to the ability of a system to continue operating without interruption when one
or more of its components fail.
This works for Hadoop primary data storage system (HDFS) and other similar storage systems (like
S3 and object storage).
Consider the first 3 partitions of a dataset labelled P1, P2, and P3, which reside on the first node.
In this system, copies of each of these data partitions are also stored on other locations or nodes
within the cluster.
If the first node ever goes down, you can add a new node to the cluster and recover the lost
partitions by copying data from one of the other nodes where copies of P1, P2, and P3 partitions are
stored.