07_Apache Spark - An Introduction
07_Apache Spark - An Introduction
7 Apache Spark
An Introduction
Compiled by
Dr. Muhammad Sajid Qureshi
Contents*
❖ Apache Spark
▪ Introduction, features, ecosystem, and major benefits
▪ Components of Spark
• Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX
▪ Architecture of Spark
• RDDs, Jobs, Tasks, and Context
▪ Anatomy of a Spark Job Run
• Job submission, DAG creation, Task scheduling, Task Execution
▪ Executors and Cluster Managers
▪ Spark applications
• It can store large datasets in distributed fashion, and can apply parallel processing on the data.
• Being an in-memory data processing engine, it can process the real-time data streams.
• Spark can run on YARN and works with Hadoop file formats and storage backends like HDFS.
▪ Data analyst use Spark to process, analyze, transform, and visualize data at very large scale.
• Spark provides a user-friendly interface for programming a cluster with implicit parallel data processing
and fault tolerance.
▪ In 2014 spark set a world record when, it was used by Databricks to sort-out a large dataset efficiently.
❖ In-memory computing
▪ As data resides in RAM, so its read-write operations, and processing is very fast.
❖ Fault tolerance
▪ The RDD mechanism makes Spark a reliable data processing engine, as they recover the data loss in case of
failure of a node.
Hadoop Spark
o The MapReduce framework is slower than Spark, because it o Being an in-memory processing engine, Spark can do
loads the data from storage devices before processing it. parallel data processing much faster than MapReduce.
o Hadoop uses Kerberos for authentication, that is o Sparks uses a shared secret for easier authentication.
complicated. Additionally, it can run on YARN to use the Kerberos.
▪ RDD is a read-only collection of objects that is partitioned across multiple data nodes in a cluster.
• Then, through a series of transformations they are turned into a set of target RDDs, which have an
action to be performed on them.
▪ RDD are resilient because Spark can automatically reconstruct a lost partition by recomputing it from the
RDDs that it was computed from.
• If the return type of and operation is RDD, then it’s a transformation; otherwise, it’s an action.
▪ An action triggers a computation on an RDD and does something with the results—either returning them to
the user, or saving them to external storage.
▪ Stages are split into tasks by the Spark runtime and are run in parallel on partitions of an RDD spread across
the cluster—just like tasks in MapReduce.
▪ A job always runs in the context of an application (represented by a SparkContext instance) that serves to
group RDDs and shared variables.
▪ An application can run more than one job, in series or in parallel, and provides the mechanism for a job to
access an RDD that was cached by a previous job in the same application.
▪ Job submission
▪ DAG creation
▪ Task scheduling
▪ Task execution
❖ Spark use the Executors to run the tasks that make up a job.
▪ First the executor keeps a local cache of all the dependencies that previous tasks have used.
▪ Second it deserializes the task code from the serialized bytes that were sent as a part of the
launch task message.
▪ Third, the Executor executes the task code in the same JVM as the executor.
• Tasks can return a result to the driver. The result is serialized and sent to the executor
backend, and then back to the driver as a status update message.
▪ Spark requires a cluster manager to manage the lifecycle of executors that run the jobs.
• Local cluster
✓ In local mode there is a single executor running in the same JVM as the driver.
• Standalone
✓ It is a simple distributed implementation to run a single Master and multiple worker nodes.
• Apache Mesos
• Hadoop YARN
✓ When YARN is used as a cluster manager for Spark, each Spark application corresponds to an
instance of a YARN application, and each executor runs in its own YARN container.
✓ The Mesos and YARN cluster managers are superior to the standalone manager as they can
manage resources of other applications running on the cluster. They also enforce a scheduling
policy across all of them.
• Client mode
✓ In it, the driver program runs in the client application.
✓ The client mode is required for programs having an interactive component, like Spark-shell or
PySpark.
• Cluster mode
✓ In this mode, the driver program runs on the cluster in the YARN Application Master.
✓ YARN cluster mode is appropriate for production jobs that require logging activity.
• https://www.youtube.com/watch?v=QaoJNXW6SQo&t=3s
• https://www.youtube.com/watch?v=znBa13Earms
• https://www.youtube.com/watch?v=jDkLiqlyQaY