0% found this document useful (0 votes)
12 views40 pages

4.big Data Platforms

The document discusses various platforms for handling big data, focusing on the importance of choosing the right platform based on application requirements and system capabilities. It compares horizontal and vertical scaling, detailing platforms like Apache Hadoop, Apache Spark, and High-Performance Computing clusters, among others. Additionally, it evaluates these platforms based on scalability, data I/O performance, fault tolerance, and their suitability for real-time processing and iterative tasks.

Uploaded by

Pranshav Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views40 pages

4.big Data Platforms

The document discusses various platforms for handling big data, focusing on the importance of choosing the right platform based on application requirements and system capabilities. It compares horizontal and vertical scaling, detailing platforms like Apache Hadoop, Apache Spark, and High-Performance Computing clusters, among others. Additionally, it evaluates these platforms based on scalability, data I/O performance, fault tolerance, and their suitability for real-time processing and iterative tasks.

Uploaded by

Pranshav Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Platforms to handle Big data

Dr. Jigna Ashish Patel


Assistant Professor, CSE Dept,
Institute of Technology,
Nirma University
Objective of the lecture
• Right platform
• Need of the application/algorithm
• Right decision
Application/algorithm level requirements
• How quickly do we need to get the results?
• How big is the data to be processed?
• Does the model building require several iterations or single iteration?
System/platform level requirements
• Will there be a need for more data processing capability in the
future?
• Is the rate of data transfer critical for this application?
• Is there a need for handling hardware failures within the application?
Scaling
• Scaling is the ability of the system to adapt to increased demands in
terms of data processing
Horizontal Scaling
• It involves distributing the workload across many servers which may
be even commodity machines.
• It is also known as “scale out”, where multiple independent machines
are added together in order to improve the processing capability.
• Typically, multiple instances of the operating system are running on
separate machines.
Vertical Scaling
• Vertical Scaling involves installing more processors, more memory and
faster hardware, typically, within a single server.
• It is also known as “scale up” and it usually involves a single instance
of an operating system.
Comparison of Pros and cons
• Horizontal Scaling
Comparison of Pros and cons
• Vertical Scaling
Horizontal Scaling Platforms
• Peer-to-Peer Network
• Apache Hadoop
• Apache Spark

Vertical Scaling Platforms


• High performance computing clusters
• Multicore CPU
• Graphics Processing Unit(GPU)
• Field Programmable gate arrays(FPGA)
Peer-to-Peer networks
• involve millions of machines connected in a network
• decentralized and distributed network architecture where the
nodes in the networks (known as peers) serve as well as consume
resources.
• oldest distributed computing platforms
• Message Passing Interface (MPI) for communication scheme used
in such a setup to communicate and exchange the data between
peers.
• Each node can store the data instances and the scale out is
practically unlimited (can be millions of nodes).
Apache Hadoop
• open source framework for storing and processing large
datasets using clusters of commodity hardware.
• Hadoop is designed to scale up to hundreds
• highly fault tolerant
• The Hadoop platform contains the following two important
components: (1) HDFS (2) YARN
Apache Spark
• developed by researchers at the University of California at
Berkeley. designed to overcome the disk I/O limitations
• ability to perform in-memory computations.
• allows the data to be cached in memory, thus eliminating the
Hadoop’s disk overhead limitation for iterative tasks.
• supports Java, Scala and Python and for certain tasks
• it is tested to be up to 100× faster than Hadoop MapReduce
HPC clusters
• Known as blades or supercomputers, are machines with thousands
of cores.
• They can have a different variety of disk organization, cache,
communication mechanism etc.
• powerful hardware which is optimized for speed and throughput.
• They are not as scalable as Hadoop or Spark clusters but they are
still capable of processing terabytes of data.
Multicore CPU
• Multicore refers to one machine having dozens of processing
cores They usually have shared memory but only one disk.
• the number of cores per chip and the number of operations that
a core can perform has increased significantly. Newer breeds of
motherboards allow multiple CPUs within a single machine
thereby increasing the parallelism.
• Until the last few years, CPUs were mainly responsible for
accelerating the algorithms for big data analytics.
GPU
• It is designed to accelerate the creation of images in a frame
buffer intended for display output
• GPUs were primarily used for graphical operations such as video
and image editing, accelerating graphics-related processing etc.
due to their massively parallel architecture, recent
developments in GPU hardware and related programming
frameworks have given rise to GPGPU
• In addition to the processing cores, GPU has its own high
throughput DDR5 memory which is many times faster than a
typical DDR3 memory.
FPGA
• highly specialized hardware units for specific applications
• FPGAs can be highly optimized for speed and can be orders of
magnitude faster compared to other platforms for certain
applications.
• Due to customized hardware, the development cost is typically
much higher compared to other platforms.
• On the software side, coding has to be done in HDL with a low-
level knowledge of the hardware which increases the algorithm
development cost.
Comparison of platforms
System/Platform Level characteristics

• Scalability
• Data I/O performance
• Fault Tolerance
Scalability

Platform Scalability
Peer-to-Peer *****
Virtual Clusters(MapReduce/MPI) *****
Virtual Clusters(Spark) *****
HPC clusters (MPI/MapReduce) ***
Multicore(Multithreading) **
GPU(CUDA) **
FPGA(HDL) *
Data I/O Performance

Platform Data I/O performance


Peer-to-Peer *
Virtual Clusters(MapReduce/MPI) **
Virtual Clusters(Spark) ***
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) ****
GPU(CUDA) *****
FPGA(HDL) *****
Fault Tolerance

Platform Fault Tolerance


Peer-to-Peer *
Virtual Clusters(MapReduce/MPI) *****
Virtual Clusters(Spark) *****
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) ****
GPU(CUDA) ****
FPGA(HDL) ****
Comparison of platforms
Application/Algorithm Level characteristics

• Real time processing


• Data size supported
• Iterative task support
Real Time Processing

Platform Real Time Processing


Peer-to-Peer *
Virtual Clusters(MapReduce/MPI) **
Virtual Clusters(Spark) **
HPC clusters (MPI/MapReduce) ***
Multicore(Multithreading) ***
GPU(CUDA) *****
FPGA(HDL) *****
Data Size supported

Platform Data Size supported


Peer-to-Peer *****
Virtual Clusters(MapReduce/MPI) ****
Virtual Clusters(Spark) ****
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) **
GPU(CUDA) **
FPGA(HDL) **
Iterative Task Support

Platform Iterative task support


Peer-to-Peer **
Virtual Clusters(MapReduce/MPI) **
Virtual Clusters(Spark) ***
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) ****
GPU(CUDA) ****
FPGA(HDL) ****
How will you choose one of platform for a
particular criteria ?
Amount of Time
Number of Iterations
Fault Tolerance
Scalability
Choice of platform
• Data size
• Speed/Throughput
• Training /Applying a model
K means clustering
K-means on MapReduce
K-means on MPI
K-means on GPU

You might also like