0% found this document useful (0 votes)

350 views11 pages

Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective

Apache Spark is a general purpose cluster computing system that provides high-level APIs for structured data processing, machine learning, graph processing, and streaming. It consists of several components including Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX, and SparkR. Spark Core is the foundation and provides in-memory computation capabilities. Spark SQL enables powerful analytics across streaming and historical data using SQL queries or a DataFrame API. Spark Streaming allows scalable stream processing of live data streams. MLlib provides scalable machine learning algorithms. GraphX enables graph analytics. SparkR integrates R with Spark's scalability.

Uploaded by

divya kolluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

350 views11 pages

Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective

Uploaded by

divya kolluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Apache Spark Ecosystem –

Complete Spark Components

Guide
1. Objective
In this tutorial on Apache Spark ecosystem, we will learn what is Apache
Spark, what is the ecosystem of Apache Spark. It also covers components of
Spark ecosystem like Spark core component, Spark SQL, Spark
Streaming, Spark MLlib, Spark GraphX and SparkR. We will also learn the
features of Apache Spark ecosystem components in this Spark tutorial.

Apache Spark Ecosystem – Complete Spark Components Guide

2. What is Apache Spark?
Apache Spark is general purpose cluster computing system. It provides
high-level API in Java, Scala, Python, and R. Spark provide an optimized
engine that supports general execution graph. It also has abundant high-
level tools for structured data processing, machine learning, graph
processing and streaming. The Spark can either run alone or on an
existing cluster manager.

3. Introduction to Apache Spark

Ecosystem Components

Apache Spark Ecosystem – Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX, SparkR.

Following are 6 components in Apache Spark Ecosystem which empower to

Apache Spark- Spark Core, Spark SQL, Spark Streaming, Spark MLlib,
Spark GraphX, and SparkR.

Let us now learn about these Apache Spark ecosystem components in detail
below:
3.1. Apache Spark Core
All the functionalities being provided by Apache Spark are built on the top
of Spark Core. It delivers speed by providing in-memory
computation capability. Thus Spark Core is the foundation of parallel and
distributed processing of huge dataset.
The key features of Apache Spark Core are:
• It is in charge of essential I/O functionalities.
• Significant in programming and observing the role of the Spark cluster.
• Task dispatching.
• Fault recovery.
• It overcomes the snag of MapReduce by using in-memory computation.

Spark Core is embedded with a special collection called RDD (resilient

distributed dataset). RDD is among the abstractions of Spark. Spark
RDD handles partitioning data across all the nodes in a cluster. It holds
them in the memory pool of the cluster as a single unit. There are two
operations performed on RDDs: Transformation and Action-
• Transformation: It is a function that produces new RDD from the
existing RDDs.
• Action: In Transformation, RDDs are created from each other. But
when we want to work with the actual dataset, then, at that point we use
Action.
Refer these guides to learn more about Spark RDD Transformations &
Actions API and Different ways to create RDD in Spark.
3.2. Apache Spark SQL
The Spark SQL component is a distributed framework for structured data
processing. Using Spark SQL, Spark gets more information about the
structure of data and the computation. With this information, Spark can
perform extra optimization. It uses same execution engine while computing
an output. It does not depend on API/ language to express the
computation.
Spark SQL works to access structured and semi-structured information. It
also enables powerful, interactive, analytical application across both
streaming and historical data. Spark SQL is Spark module for structured
data processing. Thus, it acts as a distributed SQL query engine.

Features of Spark SQL include:

• Cost based optimizer. Follow Spark SQL Optimization tutorial to learn
more.
• Mid query fault-tolerance: This is done by scaling thousands of nodes
and multi-hour queries using the Spark engine. Follow this guide to
Learn more about Spark fault tolerance.
• Full compatibility with existing Hive data.
• DataFrames and SQL provide a common way to access a variety of data
sources. It includes Hive, Avro, Parquet, ORC, JSON, and JDBC.
• Provision to carry structured data inside Spark programs, using either
SQL or a familiar Data Frame API.

3.3. Apache Spark Streaming

It is an add-on to core Spark API which allows scalable, high-throughput,
fault-tolerant stream processing of live data streams. Spark can access data
from sources like Kafka,Flume, Kinesis or TCP socket. It can operate using
various algorithms. Finally, the data so received is given to file system,
databases and live dashboards. Spark uses Micro-batching for real-time
streaming.
Micro-batching is a technique that allows a process or task to treat a stream
as a sequence of small batches of data. Hence Spark Streaming, groups the
live data into small batches. It then delivers it to the batch system for
processing. It also provides fault tolerance characteristics. Learn Spark
Streaming in detail from this Apache Spark Streaming Tutorial.
How does Spark Streaming Works?
There are 3 phases of Spark Streaming:

a. GATHERING
The Spark Streaming provides two categories of built-in streaming sources:
• Basic sources: These are the sources which are available in
the StreamingContextAPI. Examples: file systems, and socket
connections.
• Advanced sources: These are the sources like Kafka, Flume, Kinesis,
etc. are available through extra utility classes. Hence Spark access data
from different sources like Kafka, Flume, Kinesis, or TCP sockets.

b. PROCESSING
The gathered data is processed using complex algorithms expressed with a
high-level function. For example, map, reduce, join and window. Refer this
guide to learn Spark Streaming transformations operations.

c. DATA STORAGE
The Processed data is pushed out to file systems, databases, and live
dashboards.
Spark Streaming also provides high-level abstraction. It is known as
discretized stream or DStream.
DStream in Spark signifies continuous stream of data. We can form
DStream in two ways either from sources such as Kafka, Flume, and Kinesis
or by high-level operations on other DStreams. Thus, DStream is internally
a sequence of RDDs.
3.4. Apache Spark MLlib (Machine Learning
Library)
MLlib in Spark is a scalable Machine learning library that discusses both
high-quality algorithm and high speed.
The motive behind MLlib creation is to make machine learning scalable and
easy. It contains machine learning libraries that have an implementation of
various machine learning algorithms. For example, clustering, regression,
classification and collaborative filtering. Some lower level machine learning
primitives like generic gradient descent optimization algorithm are also
present in MLlib.

In Spark Version 2.0 the RDD-based API in spark.mllib package entered in

maintenance mode. In this release, the DataFrame-based API is the
primary Machine Learning API for Spark. So, from now MLlib will not add
any new feature to the RDD based API.
The reason MLlib is switching to DataFrame-based API is that it is more
user-friendly than RDD. Some of the benefits of using DataFrames are it
includes Spark Data sources, SQL DataFrame queries Tungsten and Catalyst
optimizations, and uniform APIs across languages. MLlib also uses the
linear algebra package Breeze. Breeze is a collection of libraries for
numerical computing and machine learning.

3.5. Apache Spark GraphX

GraphX in Spark is API for graphs and graph parallel execution. It is
network graph analytics engine and data store. Clustering, classification,
traversal, searching, and pathfinding is also possible in graphs. Furthermore,
GraphX extends Spark RDD by bringing in light a new Graph abstraction: a
directed multigraph with properties attached to each vertex and edge.
GraphX also optimizes the way in which we can represent vertex and edges
when they are primitive data types. To support graph computation it
supports fundamental operators (e.g., subgraph, join Vertices,
and aggregate Messages) as well as an optimized variant of the Pregel API.
3.6. Apache SparkR
SparkR was Apache Spark 1.4 release. The key component of SparkR is
SparkR DataFrame. DataFrames are a fundamental data structure for data
processing in R. The concept of DataFrames extends to other languages
with libraries like Pandas etc.
R also provides software facilities for data manipulation, calculation, and
graphical display. Hence, the main idea behind SparkR was to explore
different techniques to integrate the usability of R with the scalability of
Spark. It is R package that gives light-weight frontend to use Apache Spark
from R.

There are various benefits of SparkR:

• Data Sources API: By tying into Spark SQL’s data sources API SparkR
can read in data from a variety of sources. For example, Hive tables,
JSON files, Parquet files etc.
• Data Frame Optimizations: SparkR DataFrames also inherit all the
optimizations made to the computation engine in terms of code
generation, memory management.
• Scalability to many cores and machines: Operations that executes on
SparkR DataFrames get distributed across all the cores and machines
available in the Spark cluster. As a result, SparkR DataFrames can run
on terabytes of data and clusters with thousands of machines.

4. Conclusion
Apache Spark amplifies the existing Bigdata tool for analysis rather than
reinventing the wheel. It is Apache Spark Ecosystem Components that
make it popular than other Bigdata frameworks. Hence, Apache Spark is a
common platform for different types of data processing. For example, real-
time data analytics, Structured data processing, graph processing, etc.
Therefore Apache Spark is gaining considerable momentum and is a
promising alternative to support ad-hoc queries. It also provide iterative
processing logic by replacing MapReduce. It offers interactive code
execution using Python and Scala REPL but you can also write and compile
your application in Scala and Java.

Features of Apache Spark – Learn

the benefits of using Spark
1. Objective
Apache Spark being an open-source framework for Bigdata has a various
advantage over other big data solutions like Apache Spark is Dynamic in
Nature, it supports in-memory Computation of RDDs. It provides a
provision of reusability, Fault Tolerance, real-time stream processing and
many more. In this tutorial on features of Apache Spark, we will discuss
various advantages of Spark which give us the answer for – Why we should
learn Apache Spark? Why is Spark better than Hadoop MapReduce and
why is Spark called 3G of Big data?

Features of Apache Spark – Learn the benefits of using Spark

2. Introduction to Apache Spark
Apache Spark is lightning fast, in-memory data processing engine. Spark
mainly designs for data science and the abstractions of Spark make it
easier. Apache Spark provides high-level APIs in Java, Scala, Python
and R. It also has an optimized engine for general execution graph. In data
processing, Apache Spark is the largest open source project.

3. Features of Apache Spark

Let’s discuss sparkling features of Apache Spark:

a. Swift Processing
Using Apache Spark, we achieve a high data processing speed of about 100x
faster in memory and 10x faster on the disk. This is made possible by
reducing the number of read-write to disk.

b. Dynamic in Nature
We can easily develop a parallel application, as Spark provides 80 high-
level operators.

c. In-Memory Computation in Spark

With in-memory processing, we can increase the processing speed. Here
the data is being cached so we need not fetch data from the disk every time
thus the time is saved. Spark has DAG execution engine which facilitates in-
memory computation and acyclic data flow resulting in high speed.

d. Reusability
we can reuse the Spark code for batch-processing, join stream against
historical data or run ad-hoc queries on stream state.
e. Fault Tolerance in Spark
Apache Spark provides fault tolerance through Spark abstraction-
RDD. Spark RDDs are designed to handle the failure of any worker node in
the cluster. Thus, it ensures that the loss of data reduces to zero. Learn
different ways to create RDD in Apache Spark.
f. Real-Time Stream Processing
Spark has a provision for real-time stream processing. Earlier the problem
with Hadoop MapReduce was that it can handle and process data which is
already present, but not the real-time data. but with Spark Streaming we
can solve this problem.
g. Lazy Evaluation in Apache Spark
All the transformations we make in Spark RDD are Lazy in nature, that is
it does not give the result right away rather a new RDD is formed from the
existing one. Thus, this increases the efficiency of the system. Follow this
guide to learn more about Spark Lazy Evaluation in great detail.
h. Support Multiple Languages
In Spark, there is Support for multiple languages like Java, R, Scala,
Python. Thus, it provides dynamicity and overcomes the limitation of
Hadoop that it can build applications only in Java.
Get the best Scala Books To become an expert in Scala programming
language.
i. Active, Progressive and Expanding Spark
Community
Developers from over 50 companies were involved in making of Apache
Spark. This project was initiated in the year 2009 and is still expanding and
now there are about 250 developers who contributed to its expansion. It is
the most important project of Apache Community.

j. Support for Sophisticated Analysis

Spark comes with dedicated tools for streaming data,
interactive/declarative queries, machine learning which add-on to map and
reduce.

k. Integrated with Hadoop

Spark can run independently and also on Hadoop YARN Cluster
Manager and thus it can read existing Hadoop data. Thus, Spark is flexible.
l. Spark GraphX
Spark has GraphX, which is a component for graph and graph-parallel
computation. It simplifies the graph analytics tasks by the collection of
graph algorithm and builders.
m. Cost Efficient
Apache Spark is cost effective solution for Big data problem as in Hadoop
large amount of storage and the large data center is required during
replication.

4. Conclusion
In conclusion, Apache Spark is the most advanced and popular product of
Apache Community that provides the provision to work with the streaming
data, has various Machine learning library, can work on structured and
unstructured data, deal with graph etc.

Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Untitled
No ratings yet
Untitled
2 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Big Data Computing Spark Built-In Libraries
No ratings yet
Big Data Computing Spark Built-In Libraries
11 pages
Sas 1
100% (1)
Sas 1
292 pages
Sas 1
100% (1)
Sas 1
292 pages
Mastering Apache Spark
100% (6)
Mastering Apache Spark
1,044 pages
Data Engineering SQL Top 100 Questions With Answers
No ratings yet
Data Engineering SQL Top 100 Questions With Answers
297 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Devops Exp - 7,8,9,10,11,12.
No ratings yet
Devops Exp - 7,8,9,10,11,12.
15 pages
Fpse 64
No ratings yet
Fpse 64
4 pages
fx20
No ratings yet
fx20
3 pages
Job Roadmap For Students (No Degree Required)
No ratings yet
Job Roadmap For Students (No Degree Required)
37 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Mayuri Totare Java Developer 3 YOE
No ratings yet
Mayuri Totare Java Developer 3 YOE
1 page
Part 1 Ict Notes
No ratings yet
Part 1 Ict Notes
3 pages
DataCentre Manual Issue 3
No ratings yet
DataCentre Manual Issue 3
95 pages
Abcdplace Tcad2020 Lin
No ratings yet
Abcdplace Tcad2020 Lin
13 pages
Final Project Report
No ratings yet
Final Project Report
66 pages
Charles Bruyerre Resume
No ratings yet
Charles Bruyerre Resume
1 page
College Notes Gallery
71% (7)
College Notes Gallery
8 pages
HMarkets Brochure
No ratings yet
HMarkets Brochure
15 pages
Zoho Escalation Sheet
No ratings yet
Zoho Escalation Sheet
5 pages
07 - Sheets PDF
No ratings yet
07 - Sheets PDF
18 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
CSU07203 ERDs Qns Review
No ratings yet
CSU07203 ERDs Qns Review
2 pages
Spark 101 - Overview and Efficient Use
No ratings yet
Spark 101 - Overview and Efficient Use
9 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Long Quiz: Quarter 3 03-22-23 #Purplewednesdays
No ratings yet
Long Quiz: Quarter 3 03-22-23 #Purplewednesdays
33 pages
MirrorView and SAN Copy Configuration and Management SRG R29
No ratings yet
MirrorView and SAN Copy Configuration and Management SRG R29
124 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Installation Instruction: Duplicate Kit Installations
No ratings yet
Installation Instruction: Duplicate Kit Installations
2 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Visual Dispatch Guide
No ratings yet
Visual Dispatch Guide
20 pages
Input and Output Devices Input Device
No ratings yet
Input and Output Devices Input Device
5 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Python Programming Concepts
No ratings yet
Python Programming Concepts
5 pages
1 PDF
No ratings yet
1 PDF
10 pages
Report
No ratings yet
Report
70 pages
Sea True
No ratings yet
Sea True
8 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Loading DFI Software Rel 2
No ratings yet
Loading DFI Software Rel 2
6 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
R Session A
No ratings yet
R Session A
107 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Unit 5
100% (1)
Unit 5
109 pages
Data Exploration & Visualization
No ratings yet
Data Exploration & Visualization
23 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Specialist Base Programming
No ratings yet
Specialist Base Programming
4 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Eyesi Indirect Brochure
No ratings yet
Eyesi Indirect Brochure
12 pages
Apache Spark RDD API Examples
No ratings yet
Apache Spark RDD API Examples
38 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Case Study Hadoop
No ratings yet
Case Study Hadoop
3 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Apache Spark Graph Processing - Sample Chapter
No ratings yet
Apache Spark Graph Processing - Sample Chapter
22 pages
Introduction To Statistical Theory Part PDF
50% (2)
Introduction To Statistical Theory Part PDF
2 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Natural Gas Odorant Injection Systems
No ratings yet
Natural Gas Odorant Injection Systems
3 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Big Data Processing With Apache Spark
No ratings yet
Big Data Processing With Apache Spark
17 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
03 IRENA Load Flow Analysis
No ratings yet
03 IRENA Load Flow Analysis
20 pages
First Steps Download
No ratings yet
First Steps Download
16 pages
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet

Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective

Uploaded by

Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective

Uploaded by

Apache Spark Ecosystem –

Complete Spark Components

Apache Spark Ecosystem – Complete Spark Components Guide

3. Introduction to Apache Spark

Following are 6 components in Apache Spark Ecosystem which empower to

Spark Core is embedded with a special collection called RDD (resilient

Features of Spark SQL include:

3.3. Apache Spark Streaming

In Spark Version 2.0 the RDD-based API in spark.mllib package entered in

3.5. Apache Spark GraphX

There are various benefits of SparkR:

Features of Apache Spark – Learn

Features of Apache Spark – Learn the benefits of using Spark

3. Features of Apache Spark

c. In-Memory Computation in Spark

j. Support for Sophisticated Analysis

k. Integrated with Hadoop

You might also like