4.big Data Platforms

The document discusses various platforms for handling big data, focusing on the importance of choosing the right platform based on application requirements and system capabilities. It compares horizontal and vertical scaling, detailing platforms like Apache Hadoop, Apache Spark, and High-Performance Computing clusters, among others. Additionally, it evaluates these platforms based on scalability, data I/O performance, fault tolerance, and their suitability for real-time processing and iterative tasks.

Uploaded by

Pranshav Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views40 pages

4.big Data Platforms

Uploaded by

Pranshav Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Platforms to handle Big data

Dr. Jigna Ashish Patel

Assistant Professor, CSE Dept,
Institute of Technology,
Nirma University
Objective of the lecture
• Right platform
• Need of the application/algorithm
• Right decision
Application/algorithm level requirements
• How quickly do we need to get the results?
• How big is the data to be processed?
• Does the model building require several iterations or single iteration?
System/platform level requirements
• Will there be a need for more data processing capability in the
future?
• Is the rate of data transfer critical for this application?
• Is there a need for handling hardware failures within the application?
Scaling
• Scaling is the ability of the system to adapt to increased demands in
terms of data processing
Horizontal Scaling
• It involves distributing the workload across many servers which may
be even commodity machines.
• It is also known as “scale out”, where multiple independent machines
are added together in order to improve the processing capability.
• Typically, multiple instances of the operating system are running on
separate machines.
Vertical Scaling
• Vertical Scaling involves installing more processors, more memory and
faster hardware, typically, within a single server.
• It is also known as “scale up” and it usually involves a single instance
of an operating system.
Comparison of Pros and cons
• Horizontal Scaling
Comparison of Pros and cons
• Vertical Scaling
Horizontal Scaling Platforms
• Peer-to-Peer Network
• Apache Hadoop
• Apache Spark

Vertical Scaling Platforms

• High performance computing clusters
• Multicore CPU
• Graphics Processing Unit(GPU)
• Field Programmable gate arrays(FPGA)
Peer-to-Peer networks
• involve millions of machines connected in a network
• decentralized and distributed network architecture where the
nodes in the networks (known as peers) serve as well as consume
resources.
• oldest distributed computing platforms
• Message Passing Interface (MPI) for communication scheme used
in such a setup to communicate and exchange the data between
peers.
• Each node can store the data instances and the scale out is
practically unlimited (can be millions of nodes).
Apache Hadoop
• open source framework for storing and processing large
datasets using clusters of commodity hardware.
• Hadoop is designed to scale up to hundreds
• highly fault tolerant
• The Hadoop platform contains the following two important
components: (1) HDFS (2) YARN
Apache Spark
• developed by researchers at the University of California at
Berkeley. designed to overcome the disk I/O limitations
• ability to perform in-memory computations.
• allows the data to be cached in memory, thus eliminating the
Hadoop’s disk overhead limitation for iterative tasks.
• supports Java, Scala and Python and for certain tasks
• it is tested to be up to 100× faster than Hadoop MapReduce
HPC clusters
• Known as blades or supercomputers, are machines with thousands
of cores.
• They can have a different variety of disk organization, cache,
communication mechanism etc.
• powerful hardware which is optimized for speed and throughput.
• They are not as scalable as Hadoop or Spark clusters but they are
still capable of processing terabytes of data.
Multicore CPU
• Multicore refers to one machine having dozens of processing
cores They usually have shared memory but only one disk.
• the number of cores per chip and the number of operations that
a core can perform has increased significantly. Newer breeds of
motherboards allow multiple CPUs within a single machine
thereby increasing the parallelism.
• Until the last few years, CPUs were mainly responsible for
accelerating the algorithms for big data analytics.
GPU
• It is designed to accelerate the creation of images in a frame
buffer intended for display output
• GPUs were primarily used for graphical operations such as video
and image editing, accelerating graphics-related processing etc.
due to their massively parallel architecture, recent
developments in GPU hardware and related programming
frameworks have given rise to GPGPU
• In addition to the processing cores, GPU has its own high
throughput DDR5 memory which is many times faster than a
typical DDR3 memory.
FPGA
• highly specialized hardware units for specific applications
• FPGAs can be highly optimized for speed and can be orders of
magnitude faster compared to other platforms for certain
applications.
• Due to customized hardware, the development cost is typically
much higher compared to other platforms.
• On the software side, coding has to be done in HDL with a low-
level knowledge of the hardware which increases the algorithm
development cost.
Comparison of platforms
System/Platform Level characteristics

• Scalability
• Data I/O performance
• Fault Tolerance
Scalability

Platform Scalability
Peer-to-Peer *****
Virtual Clusters(MapReduce/MPI) *****
Virtual Clusters(Spark) *****
HPC clusters (MPI/MapReduce) ***
Multicore(Multithreading) **
GPU(CUDA) **
FPGA(HDL) *
Data I/O Performance

Platform Data I/O performance

Peer-to-Peer *
Virtual Clusters(MapReduce/MPI) **
Virtual Clusters(Spark) ***
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) ****
GPU(CUDA) *****
FPGA(HDL) *****
Fault Tolerance

Platform Fault Tolerance

Peer-to-Peer *
Virtual Clusters(MapReduce/MPI) *****
Virtual Clusters(Spark) *****
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) ****
GPU(CUDA) ****
FPGA(HDL) ****
Comparison of platforms
Application/Algorithm Level characteristics

• Real time processing

• Data size supported
• Iterative task support
Real Time Processing

Platform Real Time Processing

Peer-to-Peer *
Virtual Clusters(MapReduce/MPI) **
Virtual Clusters(Spark) **
HPC clusters (MPI/MapReduce) ***
Multicore(Multithreading) ***
GPU(CUDA) *****
FPGA(HDL) *****
Data Size supported

Platform Data Size supported

Peer-to-Peer *****
Virtual Clusters(MapReduce/MPI) ****
Virtual Clusters(Spark) ****
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) **
GPU(CUDA) **
FPGA(HDL) **
Iterative Task Support

Platform Iterative task support

Peer-to-Peer **
Virtual Clusters(MapReduce/MPI) **
Virtual Clusters(Spark) ***
HPC clusters (MPI/MapReduce) ****
Multicore(Multithreading) ****
GPU(CUDA) ****
FPGA(HDL) ****
How will you choose one of platform for a
particular criteria ?
Amount of Time
Number of Iterations
Fault Tolerance
Scalability
Choice of platform
• Data size
• Speed/Throughput
• Training /Applying a model
K means clustering
K-means on MapReduce
K-means on MPI
K-means on GPU

Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
The Beginners Guide To Concrete Maturity Ebook
No ratings yet
The Beginners Guide To Concrete Maturity Ebook
32 pages
Yanmar SV20 - Partsbook PDF
100% (2)
Yanmar SV20 - Partsbook PDF
168 pages
4.big Data Platforms
No ratings yet
4.big Data Platforms
37 pages
4.big Data Platforms
No ratings yet
4.big Data Platforms
49 pages
Chap2 ComputingTrends
No ratings yet
Chap2 ComputingTrends
55 pages
Module - 01 CC (BCS601)
No ratings yet
Module - 01 CC (BCS601)
47 pages
Cloud Computing Unit-1
100% (1)
Cloud Computing Unit-1
88 pages
Grid and Cloud Computing
No ratings yet
Grid and Cloud Computing
46 pages
Spark Introduction
No ratings yet
Spark Introduction
90 pages
Visvesvaraya Technological University (VTU) : Created by
No ratings yet
Visvesvaraya Technological University (VTU) : Created by
47 pages
Unit-1 (Cloud Computing) 1. (Accessible) Scalable Computing Over The Internet
100% (1)
Unit-1 (Cloud Computing) 1. (Accessible) Scalable Computing Over The Internet
17 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
03 Intro HadoopAndMapReduce BigData
No ratings yet
03 Intro HadoopAndMapReduce BigData
91 pages
Scalable Computing Over The Internet
No ratings yet
Scalable Computing Over The Internet
41 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
HPC Week1 Samp
No ratings yet
HPC Week1 Samp
23 pages
CAQA5e ch1
No ratings yet
CAQA5e ch1
42 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
34 pages
Module01 Cloudcomputing 250409082345 d719f5bc
No ratings yet
Module01 Cloudcomputing 250409082345 d719f5bc
82 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
Lecture 5
No ratings yet
Lecture 5
32 pages
CDB21DW043 (Autosaved)
No ratings yet
CDB21DW043 (Autosaved)
19 pages
A Comparative Survey of Big Data Computing and HPC
No ratings yet
A Comparative Survey of Big Data Computing and HPC
38 pages
Cloud Computing Notes (As Per Guidelines)
No ratings yet
Cloud Computing Notes (As Per Guidelines)
93 pages
Assignment2 CCL 24
No ratings yet
Assignment2 CCL 24
9 pages
BCS601 Module 01
No ratings yet
BCS601 Module 01
36 pages
CC Unit-1
No ratings yet
CC Unit-1
17 pages
Module1 Part1
No ratings yet
Module1 Part1
26 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
No ratings yet
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
25 pages
Lecture 03
No ratings yet
Lecture 03
17 pages
Cloud Computing - 1
No ratings yet
Cloud Computing - 1
41 pages
CC 1
No ratings yet
CC 1
19 pages
Cloud DevOps Services VFA
No ratings yet
Cloud DevOps Services VFA
88 pages
Module-1 Notes
No ratings yet
Module-1 Notes
46 pages
PDC Assignment 44
No ratings yet
PDC Assignment 44
5 pages
Module 1-Topic 1
No ratings yet
Module 1-Topic 1
36 pages
Big Data Concepts Detailed Introductory Notes
No ratings yet
Big Data Concepts Detailed Introductory Notes
8 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
Unit - 1 Systems Modelling, Clustering and Virtualization: 1. Scalable Computing Over The Internet
No ratings yet
Unit - 1 Systems Modelling, Clustering and Virtualization: 1. Scalable Computing Over The Internet
28 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
36 pages
Big Data
No ratings yet
Big Data
29 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
DC - Co 1 All in 1 PDF
No ratings yet
DC - Co 1 All in 1 PDF
197 pages
Introduction To High-Performance Computing (HPC) : Scientific Research Engineering Data Analytics Machine Learning
No ratings yet
Introduction To High-Performance Computing (HPC) : Scientific Research Engineering Data Analytics Machine Learning
30 pages
Unit-1 Part-1
No ratings yet
Unit-1 Part-1
14 pages
Lecture 1
No ratings yet
Lecture 1
13 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
Module 1 CC Simplified
No ratings yet
Module 1 CC Simplified
8 pages
HPC Lecture 2 Points
No ratings yet
HPC Lecture 2 Points
7 pages
Bda MQP 1
No ratings yet
Bda MQP 1
29 pages
CCL Assignments
No ratings yet
CCL Assignments
11 pages
CC Notes I Unit
No ratings yet
CC Notes I Unit
31 pages
Chapter Three Data Science
No ratings yet
Chapter Three Data Science
23 pages
HPC Tools and Technologies For Web Programming
No ratings yet
HPC Tools and Technologies For Web Programming
33 pages
CC UNIT-1 Material
No ratings yet
CC UNIT-1 Material
26 pages
CCL Assignments 1 and 2
No ratings yet
CCL Assignments 1 and 2
4 pages
Unit I
No ratings yet
Unit I
13 pages
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
From Everand
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
Robert Johnson
No ratings yet
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
NLP Extc Sem8 Final Exam IMPs
No ratings yet
NLP Extc Sem8 Final Exam IMPs
3 pages
Reed Arena Facility Guide
No ratings yet
Reed Arena Facility Guide
28 pages
Alpine Catalogo 2003
No ratings yet
Alpine Catalogo 2003
10 pages
Toyota Sienna 6
No ratings yet
Toyota Sienna 6
2 pages
Second Floor Beam & Slab Layout: B C D E A
No ratings yet
Second Floor Beam & Slab Layout: B C D E A
1 page
Unit 2
No ratings yet
Unit 2
15 pages
SCBM-910400#SCBM-910400 1
No ratings yet
SCBM-910400#SCBM-910400 1
2 pages
Extended Essay BM IB
No ratings yet
Extended Essay BM IB
51 pages
HC Vibration 1
No ratings yet
HC Vibration 1
9 pages
Networking
No ratings yet
Networking
4 pages
Draft - R1-2312083 Summary of UE Features For NR NTN - v002 - DCM - HW&HiSi
No ratings yet
Draft - R1-2312083 Summary of UE Features For NR NTN - v002 - DCM - HW&HiSi
23 pages
Foundation Plan (Delos Santos)
No ratings yet
Foundation Plan (Delos Santos)
1 page
Tom's Introduction To The MBT Binaural Beats and How Best To Use Them
No ratings yet
Tom's Introduction To The MBT Binaural Beats and How Best To Use Them
3 pages
MMW HW05
No ratings yet
MMW HW05
4 pages
36 Lean Manufacturing Tools
No ratings yet
36 Lean Manufacturing Tools
21 pages
Extensometer: Types, How It Works, Applications: What Is An Extensometer?
No ratings yet
Extensometer: Types, How It Works, Applications: What Is An Extensometer?
4 pages
Sop Vigilance
No ratings yet
Sop Vigilance
7 pages
Ford Truck f650 f750 Wiring Diagrams 1999
No ratings yet
Ford Truck f650 f750 Wiring Diagrams 1999
16 pages
3 Categories of Entrants
No ratings yet
3 Categories of Entrants
5 pages
Manual Garvens S2 Uk
100% (1)
Manual Garvens S2 Uk
2 pages
DLL - Mapeh 4 - Q3 - W9
No ratings yet
DLL - Mapeh 4 - Q3 - W9
4 pages
ABB 2025 Dealer Price List
100% (2)
ABB 2025 Dealer Price List
364 pages
Surveillance Systems
No ratings yet
Surveillance Systems
17 pages
Itri 613 Database Systems Assignment 1 29435927
No ratings yet
Itri 613 Database Systems Assignment 1 29435927
9 pages
Chapter 3 Selections Part 2
No ratings yet
Chapter 3 Selections Part 2
33 pages
Amani's Resume 2025
No ratings yet
Amani's Resume 2025
2 pages
Cirvyn Ithinus
No ratings yet
Cirvyn Ithinus
2 pages
Question A - Merged
No ratings yet
Question A - Merged
14 pages

4.big Data Platforms

Uploaded by

4.big Data Platforms

Uploaded by

Platforms to handle Big data

Dr. Jigna Ashish Patel

Vertical Scaling Platforms

Platform Data I/O performance

Platform Fault Tolerance

• Real time processing

Platform Real Time Processing

Platform Data Size supported

Platform Iterative task support

You might also like