Spark
Spark
• Need for a powerful engine that can process the data in Real-Time
(streaming) as well as in Batch mode
• Need for a powerful engine that can respond in Sub-second and
perform In-memory analytics
• Need for a powerful engine that can handle diverse workloads:
– Batch
– Streaming
– Interactive
– Graph
– Machine Learning
206 Nodes
Spark
23 min
Src: Databricks
Certified Apache Spark and Scala Training – DataFlair
Goals
Easy to combine batch, streaming, and interactive
computations
Batch
One
Stack to
Rule
them
Interactive all Streaming
Operation1 Operation1
Operation2 Operation1
Disk … Disk
…
Operation 1 Operation 2
… Operation n
Disk Disk
Master Worker
Obj1
Obj2
Obj3
....
Obj n
RDD
RDD
Objects
ttiooi
nn11
Partition2
Partition3
Partition4
Partition5
Partition6
B1 B12
Partition-1 B5 B3
Partition-2
B4 B9
Partition-3
B10 B7 B11 B6
Create RDD Partition-4
Partition-5
... B8
Employee-data.txt
RDD
Hadoop Cluster
Certified Apache Spark and Scala Training – DataFlair
RDD Operations
RDD
Operations
Cache
Primary Storage
RDD
RDD Returns output to
Lineage Driver or exports
data to storage
system after
Actions computation
(saveAsTextFile(), count()…)
Result
Automatic Diverse
Memory
Memory Processing processing
Management
Management platform
Fault Window
Recovers Tolerance Criteria Time based
Automatically window
criteria
/c/DataFlairWS /DataFlairWS