Module 5: Apache Spark
Module 5: Apache Spark
Shortcomings of MapReduce
Learning objectives
• List the main bottlenecks of MapReduce
• Explain how Apache Spark solves them
Shortcomings of MapReduce
Other languages?
Interactivity?
Solution?
• New framework: same features of
MapReduce and more
• Capable of reusing Hadoop
ecosystem, e.g. HDFS, YARN…
• Born at UC Berkeley
Solutions by Spark
Other workflows? i.e. join,
filter, map-reduce-map Flexibility
Interactivity? Other
languages?
Worker Node
to run a process
Spark Python
Executor
Python
Java Virtual
Machine Python
HDFS
Worker Nodes
Exec Python
Python
JVM Python
Exec Python
Python
JVM Python
Exec Python
Python
JVM Python
Worker Nodes
Exec Python
Python
JVM Python
Cluster Manager
YARN/Standalone Exec Python
Python
Provision/Restart Workers JVM Python
Exec Python
Python
JVM Python
Worker Nodes
actual process run
Exec Python
Python
Driver Program JVM Python
Exec Python
har application k lie spark context
ka instance instentiate kr raha ho ga
Python
JVM Python
on Cloudera VM
limited scheduling
Driver Program
Exec Python
JVM
one worker node
Spark Spark
Context Context Standalone
on Amazon EMR EC2 nodes
Exec Python
Python
Master node JVM
Cluster mode Python
Driver Program
Exec Python
Python
Spark Spark JVM Python
Context Context YARN
Exec Python
Python
JVM Python