Architecture and Components of Spark
Architecture and Components of Spark
▸ top row are APIs; we can use all in a single spark application
▸ bottom row are cluster managers that spark works with for resource management
▸ Spark Core: distributes workloads, monitors application, schedule tasks, memory mgt, fault recovery, interacts
with HDFS, houses APIs that defines RDDs
▸ Spark API Libraries: built on top of spark, inherits all Spark core features like fault tolerance...
▹ Spark SQL: structured data processing
▹ Spark Streaming: processing of live data stream
▹ Spark MLlib: common machine learning functionality
▹ GraphX: library for manipulating graphs and performing graph parallel computations
▹ SparkR: provides lightweight frontend to use Apache Spark from R
▸ Spark Cluster Manager: for resource allocation - spark can connect to pluggable resource managers
▹ Standalone: simple cluster manager included within Spark and makes it easy to setup a cluster
▹ Apache Mesos: general cluster manager taht can run Hadoop MapReduce and service application
▹ Hadoop YARN: cluster manager of Hadoop 2
▸ Spark Runtime Architecture
▹ master-slave architecture; master=driver, slave=executor
▹ drivers and executors run their own java processes
▹ Spark application is launched using cluster manager
▸ SparkContext
▹ main entry point to everything Spark
▹ defined in the driver program
▹ tells Spark how and where to access cluster
▹ connects to cluster manager
▹ coordinates Spark processes running on different nodes
▹ used to create RDDs and shared variables on a cluster
▸ Driver
▹ this is where main() method of Spark is
▹ converts user program into tasks (smallest unit of work); tasks are bundled into "stages"
▹ it schedules tasks on executors
▸ Executors
▹ runs in the entire lifetime of an app
▹ registers themselves to the driver, which allows driver to schedule tasks
▹ worker processes run individual tasks and returns result to driver
▹ provides in-memory storage for RDDs as well as disk storage
▸ Workflow:
User app → Driver program contacts cluster manager for resources → cluster manager launches executors → driver
splits program into tasks and send them to executors → executors perform tasks and return to driver → executors are
terminated and resources released
▸
▹
RDD Operations
▸ Spark offers >80 high level operations beyond MapReduce,
▹ ex. transformations: map, filter, distinct, flatMap
▹ ex. actions: collect, count, take, reduce
▸ map: apply function to each element in RDD and return new RDD
RDD {4,5,10,10}, using Scala syntax:
rdd.map(x => x+1)
Result: {5,6,11,11}
▸ filter
rdd.filter(x => x < 10)
Result: {4,5}
▸ distinct
rdd.distinct()
Result: {4,5,10}
▸ flatmap: similar to map but returns a sequence rather than a single element; apply function then flatten result
rdd.flatmap(x => list(x-1,x+1))
Result: {3,5,4,6,9,11,9,11}
Quiz
1) Which of the following are Spark API libraries: 7) What is a task?
Correct answer Correct answer
Spark SQL, Spark MLlib A smallest unit of work sent to one executor
Explanation Explanation
Although ETL and Deep learning can be performed A task is a smallest unit of work sent to one executor; tasks are bundled
using Spark, there are no libraries of these names. into “stages”. The driver splits the user program into tasks and stages.
ETL can be performed using Spark SQL and Deep
learning can be performed using Spark MLlib. 8) Driver return results to executors.
Correct answer
2) Spark core is the base engine of Spark. Choose False
the correct functions of Spark core:
Correct answer Explanation
Task scheduler and memory management, Fault Driver schedules the tasks on executors. Results from these tasks are
recovery delivered back to the driver.
Quiz
1) reduceByKey is preferred to 9) RDDs can be created by which of the following approaches:
groupByKey Correct answer
Correct answer By using Spark API like textFile(), By transforming other RDD
True
Explanation
Explanation RDDs can be created by loading from external storage like a file system. The textFile()
groupByKey causes shuffling of on SparkContext converts the contents of a file into a RDD. Transformation operations
large amounts of data and hence on RDD also results in RDD. But actions operations on RDD do not yield a RDD.
not preferred. reduceByKey on the
other hand, reduces by key first 10) If a RDD wordsRDD contains {'pencil', 'paper', 'computer', 'mouse'}, what is the
and then shuffles data to worker result of wordsRDD.map(lambda x : len(x)).reduce(lambda x,y: x+y) Hint: len() is a
nodes where further reducing function that returns length of the string
happens. Correct answer
the value 24
2) Actions are lazily evaluated.
Correct answer Explanation
False The map function returns a RDD containing the length of each string which is :
{6,5,8,5} The reduce function is chained to the map function and adds all the lengths to
Explanation return the value of "24"
Actions are evaluated immediately
and are responsible for getting 11) Which of the following statements are true about RDDs
results from Spark data operations. Correct answer
All of the above
3) Which of the following are
operators in Spark Explanation
Correct answer RDDs is the primary data API allowing data to be processed in Spark. RDDs are
map, reduce distributed collection of elements that can be reconstructed on failure and hence fault
tolerant. RDDs are immutable, as transformations on RDDs result in new RDDs with
Explanation the original RDD staying untouched.
"print" is not a operator, meaning
is neither a transformation or an 12) Transformations on RDDs result in
action. "print" is offered by the Correct answer
native Java, Python or Scala API. a new RDD, update of the DAG
Explanation
When a Spark operator takes a
function as a parameter, operations
are invoked on RDDs by executing
functions on each element of the
RDD.
8) Operations on RDDs are
grouped into transformations,
collections and actions
Correct answer
False
Explanation
Operations on RDDs are either
transformations or actions.