Spark Architecture
Spark Architecture
Storage
system – Any resource 10x faster
Spark is an
HDFS,Amaz manager like than
alternative to
on S3,local Yarn,mesos,k mapreduce
mapreduce
file system ubernetes
RAM[Memory] end
Start
Disk write
Disk read
• Machine learning
• Data cleaning
Spark
• Streaming
• Hive support
RDD – Resilent
distributed Dataframe Dataset
dataset
Val RDD1=sc.textFile(“abc.txt”) ------------→Transformation 1 Transformation
Val RDD2=RDD1.map() -------→Transformation 2
Val RDD3=RDD2.filter() -→Transformation 3
RDD3.count() →Action
Operations
Actions
•Directed- Graph which is directly connected from one node to another. This creates a sequence.
•Acyclic – It defines that there is no cycle or loop available.
•Graph – It is a combination of vertices and edges, with all the connections in a sequence
We can call it a sequence of computations, performed on data. In this graph, edge refers to
transformation on top of data. while vertices refer to an RDD partition.
This helps to eliminate the Hadoop mapreduce multistage execution model. It also provides efficient
performance over Hadoop.
Thank you !!