Module 2

Uploaded by

madhavan090603

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views20 pages

Module 2

Uploaded by

madhavan090603

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Spark Big Data

• Spark has been proposed by Apache Software

Foundation to speed up the software process of
Hadoop computational computing. Spark includes
its cluster management, while Hadoop is only one
of the forms for implementing Spark.
• Spark applies Hadoop in two forms. The first form
is storage and another one is processing. Thus,
Spark includes its computation for cluster
management and applies Hadoop for only storage
purposes.
Apache Spark
• Apache Spark is a distributed and open-source
processing system. It is used for the workloads of 'Big
data'. Spark utilizes optimized query execution and
in-memory caching for rapid queries across any size of
data. It is simply a general and fast engine for much
large-scale processing of data.
• It is much faster as compared to the previous concepts
to implement with Big Data such as
classical MapReduce. Spark is faster die to it executes
on RAM/memory and enables the processing faster as
compared to the disk drivers.
Apache Spark Evolution

• Spark is one of the most important

sub-projects of Hadoop. It was developed in
APMLab of UC Berkeley in 2009 by Matei
Zaharia. In 2010, it was an open-source under
the BSD license. Spark was donated in 2013 to
the Apache Software Foundation. Apache
Spark is now a top-level project of Apache
from 2014 February.
• Apache Spark Core: Apache Spark Core can be defined as an underlying
normal execution engine for the platform of Spark. It facilitates
referencing data sets and in-memory computing within the external
storage structures.
• Spark SQL: This component is a module of Apache Spark for operating
with many kinds of structured data. Various interfaces provided by Spark
SQL facilitates Spark along with a lot of information regarding both the
computation and data being implemented.
• Spark Streaming: Spark streaming permits Spark for processing streaming
data in real-time. The data could be inhaling from several sources such as
Hadoop Distributed File System (HDFS), Flume, and Kafta. After that data
could be processed with complex algorithms and then pushed out towards
the live dashboards, databases, and file systems.
• Machine Learning Library (MLlib): Apache Spark is armed with a
prosperous library called MLlib. The MLlib includes a wide range of
machine learning algorithms collaborative filtering, clustering, regression,
and classifications. Also, it contains other resources for tuning, evaluating,
and constructing ML pipelines. Each of these functionalities supports Spark
scale-out around the cluster.
• GraphX: Apache Spark comes using a library for manipulating graph
databases and implement computations known as GraphX. This
component unifies Extract, Transform, and Load (ETL) process, constant
graph computation, and exploratory analysis in an individual system.
Architecture of Spark
• The architecture of Spark contains three of the
main elements which are listed below:
• API
• Data Storage
• Resource Management
API
• This element facilitates many developers of the
applications for creating Spark-based applications
with a classic API interface. Spark offers API for
Python, Java, and Scala programming languages.
Data Storage
• Spark applies the Hadoop Distributed File System
for various purposes of data storage. It works with
any data source that is compatible with Hadoop
including Cassandra, HBase, HDFS, etc.
Resource Management
• The Spark could be expanded as the stand-alone
server. Also, it can be expanded on any shared
computing framework such as YARN or Mesos.
RDD in Spark
• RDD stands for Resilient Distributed Dataset.
It is a core concept within the Spark
framework. Assume RDD like any table inside
the database.
• Action
• Transformation
Spark Installation

• There are some different things to use and

install Spark. We can install Spark on our
machine as any stand-alone framework or use
the images of Spark VM (Virtual Machine)
available from many vendors such as MapR,
HortonWorks, and Cloudera. Also, we can use
Spark configured and installed inside the cloud
(such as Databricks Clouds).
Features of Spark

• Fast processing: One of the most essential aspects of

Spark is that it has enabled the world of big data to
select the technology on others because of its speed.
On the other hand, big data is featured by veracity,
variety, velocity, and volume which require to be
implemented at a great speed. Spark includes RDD
(Resilient Distributed Dataset) which can save time in
writing and reading operations, permitting it to
execute almost many times faster as compared to
Hadoop.
• Flexibility: Spark supports more than one language
and permits many developers for writing applications
in Python, R, Scala, or Java.
• In-memory computing: Apache Spark can store the
data inside the server's RAM which permits quick
access. Also, it accelerates analytics speed.
• Real-time processing: Apache Spark can process
streaming data in real-time. Unlike MapReduce that
only processes stored data, Apache Spark can process
data in real-time. Therefore, it can also produce
instant results.
• Better analytics: In variations of MapReduce that
provides the ability for mapping and reducing
functions, Spark provides much more as compared to
it. Apache Spark combines a prosperous set of
machine learning, complex analytics, SQL queries, etc.
Using each of these functionalities, analytics could be
implemented in a better way using Spark.
Hadoop Ecosystem
• Apache Hadoop is an open source framework
intended to make interaction with big data easier,
However, for those who are not acquainted with
this technology, one question arises that what is
big data ? Big data is a term given to the data sets
which can’t be processed in an efficient manner
with the help of traditional methodology such as
RDBMS. Hadoop has made its place in the
industries and companies that need to work on
large data sets which are sensitive and needs
efficient handling.
Components
• There are four major elements of
Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common Utilities
• Following are the components that collectively
form a Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data
Processing
• Spark: In-Memory data processing
Moving data in and out of Hadoop

• Moving data in and out of Hadoop involves

various methods depending on the source and
destination of the data, as well as the specific
Hadoop components involved. Here are some
common techniques and tools used:
• 1. HDFS (Hadoop Distributed File System)
• Uploading Data to HDFS:
• HDFS Command Line Interface (CLI): You can use
hadoop fs -put or hdfs dfs -put commands to
upload data to HDFS. Example:
• 3. Apache Flume
• Flume is used for efficiently collecting, aggregating,
and moving large amounts of log data from various
sources to a centralized data store, such as HDFS.
• 4. Apache Kafka
• Kafka is a distributed messaging system often used for
building real-time data pipelines. Data from various
sources can be sent to Kafka topics, and from there,
Kafka consumers can write the data into Hadoop.
• 5. Apache Nifi
• Nifi is a data integration tool that supports data
ingestion, routing, and transformation. It can be used
to move data between different systems, including
Hadoop.
• 6. Hive and HBase Integration
• Hive: You can load data into Hive tables using the LOAD
DATA command, and data from Hive tables can be exported
using commands like INSERT OVERWRITE.
• HBase: Data can be imported into HBase using tools like
HBase bulkload, and exported using HBase Export.
• 7. Custom Scripts and APIs
• For specific use cases, custom scripts using Hadoop APIs
(Java, Python, etc.) can be written to move data in and out
of Hadoop.
• 8. Cloud Integration
• If you're using Hadoop on the cloud, integration with cloud
storage (like AWS S3, Azure Blob Storage, Google Cloud
Storage) can be done using tools like S3DistCp for Amazon
S3 or similar tools for other cloud providers.

Parallel Processing
No ratings yet
Parallel Processing
38 pages
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
No ratings yet
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
8 pages
BigData Nov2019
No ratings yet
BigData Nov2019
50 pages
Large Scale Data Processing: Saeed Iqbal Khattak
No ratings yet
Large Scale Data Processing: Saeed Iqbal Khattak
81 pages
Unit 6-1
No ratings yet
Unit 6-1
128 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
SPARK
No ratings yet
SPARK
47 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
Shark
No ratings yet
Shark
24 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Bda U4
No ratings yet
Bda U4
49 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Unit 5.1
No ratings yet
Unit 5.1
9 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
BDA Unit 3
No ratings yet
BDA Unit 3
7 pages
Sspark
No ratings yet
Sspark
7 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Bda 5
No ratings yet
Bda 5
21 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Module 3
No ratings yet
Module 3
51 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Mastering Apache Spark - Sample Chapter
No ratings yet
Mastering Apache Spark - Sample Chapter
24 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
FINAL DSA MINI PROJECT - PDF - 20241027 - 225622 - 0000
No ratings yet
FINAL DSA MINI PROJECT - PDF - 20241027 - 225622 - 0000
10 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Milestone Project For Express IT Course
No ratings yet
Milestone Project For Express IT Course
24 pages
Unit 5
100% (1)
Unit 5
109 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
BVMS - System Design Guide - January 2025
No ratings yet
BVMS - System Design Guide - January 2025
50 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
150+ Python Interview Questions
No ratings yet
150+ Python Interview Questions
81 pages
Scout Suite Administration Guide
No ratings yet
Scout Suite Administration Guide
108 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Module 3-1
No ratings yet
Module 3-1
32 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
CTF Event Guide
No ratings yet
CTF Event Guide
3 pages
4 Maintain Training Facilities Marvin Trinidad
No ratings yet
4 Maintain Training Facilities Marvin Trinidad
19 pages
Web Portals
No ratings yet
Web Portals
29 pages
R920 ASTM HOST DiaSys Manual
No ratings yet
R920 ASTM HOST DiaSys Manual
54 pages
Configure Authentication Directory Service
No ratings yet
Configure Authentication Directory Service
9 pages
Asmi Patil
No ratings yet
Asmi Patil
1 page
Report 6
No ratings yet
Report 6
19 pages
Phy Niroopa
No ratings yet
Phy Niroopa
3 pages
Module 3-1
No ratings yet
Module 3-1
2 pages
Collections Manager
No ratings yet
Collections Manager
50 pages
Week3 Module3 Flowcharting Sequential
No ratings yet
Week3 Module3 Flowcharting Sequential
29 pages
Ilok License Manager: - The Manual
No ratings yet
Ilok License Manager: - The Manual
28 pages
Xilinx Programs For Different Implementations of 4X1 Multiplexer
100% (1)
Xilinx Programs For Different Implementations of 4X1 Multiplexer
9 pages
Shivanjali - Bhilare Profile
No ratings yet
Shivanjali - Bhilare Profile
1 page
PC Maintenance and Troubleshooting
No ratings yet
PC Maintenance and Troubleshooting
1 page
Algorithm Quiz Topics For Cse Students (On Recurrence Relation)
No ratings yet
Algorithm Quiz Topics For Cse Students (On Recurrence Relation)
3 pages
AWS Solutions Architect Associate Documentation
No ratings yet
AWS Solutions Architect Associate Documentation
3 pages
Anant Jaiswal Java CV
No ratings yet
Anant Jaiswal Java CV
2 pages
Dell Precision 3930 Rack Spec Sheet
No ratings yet
Dell Precision 3930 Rack Spec Sheet
6 pages
S1 OpenLab Ver16062013
100% (1)
S1 OpenLab Ver16062013
2 pages
Ilovepdf - Merged (1) - Removed
No ratings yet
Ilovepdf - Merged (1) - Removed
20 pages
DCN Question Bank-1
No ratings yet
DCN Question Bank-1
2 pages
Assignment - 2: Dbms Lab Sneha Sana CSE/18088/399
No ratings yet
Assignment - 2: Dbms Lab Sneha Sana CSE/18088/399
5 pages
Bcom CA Ms-Office Practical Examination Question Paper
No ratings yet
Bcom CA Ms-Office Practical Examination Question Paper
7 pages
Ahmad's CV
No ratings yet
Ahmad's CV
1 page
REVSED CourseSyllabus - NCM 110 LEC
No ratings yet
REVSED CourseSyllabus - NCM 110 LEC
6 pages
A80 V S80 Comparison Sheet
No ratings yet
A80 V S80 Comparison Sheet
1 page
ComputerScience-SQP Set5-MS
No ratings yet
ComputerScience-SQP Set5-MS
13 pages

Module 2

Uploaded by

Module 2

Uploaded by

Spark Big Data

• Spark has been proposed by Apache Software

• Spark is one of the most important

• There are some different things to use and

• Fast processing: One of the most essential aspects of

• Moving data in and out of Hadoop involves

You might also like