0% found this document useful (0 votes)
17 views10 pages

Spark Architecture

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

Spark Architecture

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Apache Spark

• Apache spark is general purpose in memory compute engine.


• Hadoop provides storage,computation ,resource management in terms of HDFS,mapreduce &
yarn.
• Spark is a plug & play compute engine
• Spark computes the data in memory whereas mapreduce in disk
• Spark latency is less due to low disk read and write operations

Storage
system – Any resource 10x faster
Spark is an
HDFS,Amaz manager like than
alternative to
on S3,local Yarn,mesos,k mapreduce
mapreduce
file system ubernetes

Spark is an open source distributed computing engine. We use it for


processing and analyzing a large amount of data. Likewise, hadoop
mapreduce, it also works to distribute data across the cluster. It helps
to process data in parallel.
Spark processing

RAM[Memory] end
Start

Disk write
Disk read
• Machine learning
• Data cleaning
Spark
• Streaming
• Hive support

RDD – Resilent
distributed Dataframe Dataset
dataset
Val RDD1=sc.textFile(“abc.txt”) ------------→Transformation 1 Transformation
Val RDD2=RDD1.map() -------→Transformation 2
Val RDD3=RDD2.filter() -→Transformation 3
RDD3.count() →Action
Operations

Actions

Note : Transformations are lazy until an action is being called to


execute. DAG – Generate when we compute a spark statement

text file →map→filter


The Resilient Distributed Datasets are the
group of data items that can be stored in- The Driver Program is a process that runs
memory on worker nodes. Here,
•Resilient: Restore the data on failure. the main() function of the application
•Distributed: Data is distributed among and creates the SparkContext object. The
different nodes. purpose of SparkContext is to coordinate
•Dataset: Group of data.
the spark applications, running as
independent sets of processes on a
cluster.
To run on a cluster,
the SparkContext connects to a different
type of cluster managers and then
perform the following tasks: -
•It acquires executors on nodes in the
cluster.
•Then, it sends your application code to
the executors. Here, the application code
can be defined by JAR or Python files
passed to the SparkContext.
•At last, the SparkContext sends tasks to
the executors to run.
Worker Node
•The worker node is a slave node
SparkContext is the main
•Its role is to run the application code
entry point to spark core. It
in the cluster.
allows us to access further
Executor
functionalities of spark. This
•An executor is a process launched for
helps to establish a
an application on a worker node.
connection to spark execution
•It runs tasks and keeps data in
environment. It provides
memory or disk storage across them.
access to spark cluster even
•It read and write data to the external
with a resource manager.
sources.
Sparkcontext act as master of
•Every application contains its
spark application.
executor.

•Directed- Graph which is directly connected from one node to another. This creates a sequence.
•Acyclic – It defines that there is no cycle or loop available.
•Graph – It is a combination of vertices and edges, with all the connections in a sequence
We can call it a sequence of computations, performed on data. In this graph, edge refers to
transformation on top of data. while vertices refer to an RDD partition.
This helps to eliminate the Hadoop mapreduce multistage execution model. It also provides efficient
performance over Hadoop.
Thank you !!

You might also like