0% found this document useful (0 votes)
82 views18 pages

Module 5: Apache Spark

Apache Spark solves the main bottlenecks of MapReduce by providing more flexible workflows beyond map and reduce steps, faster computation through in-memory caching of data, and support for multiple programming languages including Python and Scala for interactive use. Spark's architecture includes a driver program that launches parallel operations on executor processes across worker nodes managed by a cluster manager like YARN.

Uploaded by

ArXlan Xahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views18 pages

Module 5: Apache Spark

Apache Spark solves the main bottlenecks of MapReduce by providing more flexible workflows beyond map and reduce steps, faster computation through in-memory caching of data, and support for multiple programming languages including Python and Scala for interactive use. Spark's architecture includes a driver program that launches parallel operations on executor processes across worker nodes managed by a cluster manager like YARN.

Uploaded by

ArXlan Xahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Module 5: Apache Spark

Shortcomings of MapReduce
Learning objectives
• List the main bottlenecks of MapReduce
• Explain how Apache Spark solves them
Shortcomings of MapReduce

Force your pipeline into Map


and Reduce steps

Other workflows? i.e. join,


filter, map-reduce-map
Shortcomings of MapReduce

Read from disk for each


MapReduce job

Iterative algorithms? i.e.


machine learning
Shortcomings of MapReduce

Only native JAVA


programming interface

Other languages?
Interactivity?
Solution?
• New framework: same features of
MapReduce and more
• Capable of reusing Hadoop
ecosystem, e.g. HDFS, YARN…
• Born at UC Berkeley
Solutions by Spark
Other workflows? i.e. join,
filter, map-reduce-map Flexibility

~20 highly efficient


distributed operations, any
combination of them
Solutions by Spark

Iterative algorithms? i.e.


machine learning

in-memory caching of data, Fast Computation

specified by the user


Solutions by Spark

Interactivity? Other
languages?

Native Python, Scala (, R)


interface. Interactive shells.
100TB Sorting competition
Architecture of Spark
one node has multiple executor
one worker node has multiple run envs

Worker Node
to run a process
Spark Python
Executor
Python
Java Virtual
Machine Python

HDFS
Worker Nodes
Exec Python
Python
JVM Python

Exec Python
Python
JVM Python

Exec Python
Python
JVM Python
Worker Nodes
Exec Python
Python
JVM Python

Cluster Manager
YARN/Standalone Exec Python
Python
Provision/Restart Workers JVM Python

Exec Python
Python
JVM Python
Worker Nodes
actual process run

Exec Python
Python
Driver Program JVM Python

take resources like where


Spark Spark
to run processes
Cluster Exec Python
Python
Context Context
Manager JVM Python

Exec Python
har application k lie spark context
ka instance instentiate kr raha ho ga
Python
JVM Python
on Cloudera VM

limited scheduling
Driver Program
Exec Python
JVM
one worker node
Spark Spark
Context Context Standalone
on Amazon EMR EC2 nodes
Exec Python
Python
Master node JVM
Cluster mode Python

Driver Program
Exec Python
Python
Spark Spark JVM Python
Context Context YARN
Exec Python
Python
JVM Python

You might also like