0% found this document useful (0 votes)
10 views37 pages

4.big Data Platforms

Uploaded by

newt67710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views37 pages

4.big Data Platforms

Uploaded by

newt67710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Platforms to handle Big

data

Dr. Jigna Ashish Patel


Assistant Professor, CSE Dept,
Institute of Technology,
Nirma University
Objective of the lecture
• Right platform
• Need of the application/algorithm
• Right decision
• How quickly do we need to get the results?
• How big is the data to be processed?
• Does the model building require several iterations or single iteration?
System/platform level requirements
• Will there be a need for more data processing capability in the future?
• Is the rate of data transfer critical for this application?
• Is there a need for handling hardware failures within the application?
Horizontal Scaling
• It involves distributing the workload across many servers which may
be even commodity machines.
• It is also known as “scale out”, where multiple independent machines
are added together in order to improve the processing capability.
• Typically, multiple instances of the operating system are running on
separate machines.
Vertical Scaling
• Vertical Scaling involves installing more processors, more memory and
faster hardware, typically, within a single server.
• It is also known as “scale up” and it usually involves a single instance
of an operating system.
Horizontal Scaling Platforms
• Peer-to-Peer Network
• Apache Hadoop
• Apache Spark

Vertical Scaling Platforms


• High performance computing clusters
• Multicore CPU
• Graphics Processing Unit(GPU)
• Field Programmable gate arrays(FPGA)
Peer-to-Peer networks
• involve millions of machines connected in a network
• decentralized and distributed network architecture where the
nodes in the networks (known as peers) serve as well as consume
resources.
• oldest distributed computing platforms
• Message Passing Interface (MPI) for communication scheme used
in such a setup to communicate and exchange the data between
peers.
• Each node can store the data instances and the scale out is
practically unlimited (can be millions of nodes).
Apache Hadoop
• open source framework for storing and processing large datasets
using clusters of commodity hardware.
• Hadoop is designed to scale up to hundreds
• highly fault tolerant
• The Hadoop platform contains the following two important
components: (1) HDFS (2) YARN
Apache Spark
• developed by researchers at the University of California at
Berkeley. designed to overcome the disk I/O limitations
• ability to perform in-memory computations.
• allows the data to be cached in memory, thus eliminating the
• Hadoop’s disk overhead limitation for iterative tasks.
• supports Java, Scala and Python and for certain tasks
• it is tested to be up to 100× faster than Hadoop MapReduce
HPC clusters
• Known as blades or supercomputers, are machines with thousands
of cores.
• They can have a different variety of disk organization, cache,
communication mechanism etc.
• powerful hardware which is optimized for speed and throughput.
• They are not as scalable as Hadoop or Spark clusters but they are still
capable of processing terabytes of data.
Multicore CPU
• Multicore refers to one machine having dozens of processing
cores They usually have shared memory but only one disk.
• the number of cores per chip and the number of
• operations that a core can perform has increased significantly.
Newer breeds of motherboards allow multiple CPUs within a
single machine thereby increasing the parallelism.
• Until the last few years, CPUs were mainly responsible for
accelerating the algorithms for big data analytics.
GPU
• It is designed to accelerate the creation of images in a frame
buffer intended for display output
• GPUs were primarily used for graphical operations such as video
and image editing, accelerating graphics-related processing etc.
due to their massively parallel architecture, recent
developments in GPU hardware and related programming
frameworks have given rise to GPGPU
• In addition to the processing cores, GPU has its own high
throughput DDR5 memory which is many times faster than a
typical DDR3 memory.
FPGA
• highly specialized hardware units for specific applications
• FPGAs can be highly optimized for speed and can be orders of
magnitude faster compared to other platforms for certain
applications.
• Due to customized hardware, the development cost is typically
much higher compared to other platforms.
• On the software side, coding has to be done in HDL with a low-
level knowledge of the hardware which increases the algorithm
development cost.
Comparison of platforms
System/Platform Level characteristics

• Scalability
• Data I/O performance
• Fault Tolerance
Scalability

Platform Scalability
Peer-to-Peer *****
Virtual Clusters(MapReduce/MPI) *****
Virtual Clusters(Spark) *****
HPC clusters (MPI/MapReduce) ***
Multicore(Multithreading) **
GPU(CUDA) **
FPGA(HDL) *
Data I/O Performance

Platform Data I/O performance


Peer-to-Peer *
Virtual Clusters(MapReduce/MPI) **
Virtual Clusters(Spark) ***
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) ****
GPU(CUDA) *****
FPGA(HDL) *****
Fault Tolerance

Platform Fault Tolerance


Peer-to-Peer *
Virtual Clusters(MapReduce/MPI) *****
Virtual Clusters(Spark) *****
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) ****
GPU(CUDA) ****
FPGA(HDL) ****
Comparison of platforms
Application/Algorithm Level characteristics

• Real time processing


• Data size supported
• Iterative task support
Real Time Processing

Platform Real Time Processing


Peer-to-Peer *
Virtual Clusters(MapReduce/MPI) **
Virtual Clusters(Spark) **
HPC clusters (MPI/MapReduce) ***
Multicore(Multithreading) ***
GPU(CUDA) *****
FPGA(HDL) *****
Data Size supported

Platform Data Size supported


Peer-to-Peer *****
Virtual Clusters(MapReduce/MPI) ****
Virtual Clusters(Spark) ****
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) **
GPU(CUDA) **
FPGA(HDL) **
Iterative Task Support

Platform Iterative task support


Peer-to-Peer **
Virtual Clusters(MapReduce/MPI) **
Virtual Clusters(Spark) ***
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) ****
GPU(CUDA) ****
FPGA(HDL) ****
How will you choose one of platform for a
particular criteria ?
Amount of Time
Number of Iterations
Fault Tolerance
Scalability
Choice of platform
• Data size
• Speed/Throughput
• Training /Applying a model
K means clustering
K-means on MapReduce
K-means on MPI
K-means on GPU

You might also like