0% found this document useful (0 votes)
6 views

07_Apache Spark - An Introduction

The document provides an introduction to Apache Spark, detailing its features, components, and architecture. It explains how Spark serves as an efficient, open-source in-memory cluster computing framework that supports multiple programming languages and offers fast data processing capabilities. Key components such as Spark Core, Spark SQL, and Spark Streaming are discussed, along with the anatomy of a Spark job run and the role of Resilient Distributed Datasets (RDDs).

Uploaded by

i237822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

07_Apache Spark - An Introduction

The document provides an introduction to Apache Spark, detailing its features, components, and architecture. It explains how Spark serves as an efficient, open-source in-memory cluster computing framework that supports multiple programming languages and offers fast data processing capabilities. Key components such as Spark Core, Spark SQL, and Spark Streaming are discussed, along with the anatomy of a Spark job run and the role of Resilient Distributed Datasets (RDDs).

Uploaded by

i237822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

In the name of ALL AH, the Beneficent, the Merciful

7 Apache Spark
An Introduction

Compiled by
Dr. Muhammad Sajid Qureshi
Contents*

❖ Apache Spark
▪ Introduction, features, ecosystem, and major benefits
▪ Components of Spark
• Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX
▪ Architecture of Spark
• RDDs, Jobs, Tasks, and Context
▪ Anatomy of a Spark Job Run
• Job submission, DAG creation, Task scheduling, Task Execution
▪ Executors and Cluster Managers
▪ Spark applications

* Most of the contents are extracted from:


+ “Hadoop-The Definitive Guide” (Chapter 19) by Tom White, O’ Rielly Media Inc., 4 edition.

Apache Spark - An Introduction 2


What is Spark ?
❖ Spark – an alternative to MapReduce
▪ Spark is an efficient, open-source in-memory cluster computing framework.

• It can store large datasets in distributed fashion, and can apply parallel processing on the data.

• Being an in-memory data processing engine, it can process the real-time data streams.

• Spark can run on YARN and works with Hadoop file formats and storage backends like HDFS.

▪ Data analyst use Spark to process, analyze, transform, and visualize data at very large scale.

• It efficiently performs the iterative and interactive operations on data.

• Spark provides a user-friendly interface for programming a cluster with implicit parallel data processing
and fault tolerance.

Apache Spark - An Introduction 3


What is Spark ?
❖ Spark – an alternative to MapReduce
▪ Spark support multiple languages for its programming

• It provides APIs in languages like Scala, Python, R, and JAVA.

▪ Spark was started as a project in AMPLab at the University of Berkley in 2009.

▪ In 2014 spark set a world record when, it was used by Databricks to sort-out a large dataset efficiently.

Apache Spark - An Introduction 4


Spark’s features

Apache Spark - An Introduction 5


Spark’s features
❖ Fast data processing
▪ Spark use the Resilient Distributed Dataset (RDD) for quick and reliable data processing.

❖ In-memory computing
▪ As data resides in RAM, so its read-write operations, and processing is very fast.

❖ Support for multiple programming languages


▪ Scala, Python, R, and JAVA

❖ Fault tolerance
▪ The RDD mechanism makes Spark a reliable data processing engine, as they recover the data loss in case of
failure of a node.

❖ Rich libraries for data processing


▪ Spark offer rich libraries to process, analyze, transform, and visualize data.

Apache Spark - An Introduction 6


Spark’s features

Apache Spark - An Introduction 7


Hadoop Versus Spark

Hadoop Spark

o The MapReduce framework is slower than Spark, because it o Being an in-memory processing engine, Spark can do
loads the data from storage devices before processing it. parallel data processing much faster than MapReduce.

o The MapReduce is designed for batch processing of large


datasets. o Spark can do batch-processing as well as it can process the
o The data node store intermediate results in their local real-time data (streaming data).
storage, that slows down compilation of the final results.

o Hadoop uses Kerberos for authentication, that is o Sparks uses a shared secret for easier authentication.
complicated. Additionally, it can run on YARN to use the Kerberos.

o Writing a program (usually in JAVA) for Hadoop framework


o Sparks supports Scala, that simplifies programming for it.
requires more effort.

Apache Spark - An Introduction 8


Components of Spark
.
❖ Spark has the following major components
▪ Spark Core
• It manages the RDD in Spark, to enable efficient and reliable data processing.
• Responsible for memory management, job scheduling, fault tolerance.
• It also coordinates with the storage systems like HDFS, HBase, and DBMSs.
▪ Spark SQL
• This component allows fast processing of the structured and semi-structured data.
▪ Spark Streaming
• This offers a light-weight API to process the real-time data (data streams).
▪ Spark MLlib
• A simple, scalable librarY containing various machine learning algorithms for big data analytics.
▪ Spark GraphX
• Specially designed for storage and processing of the graph databases (as used in LinkedIn, DBpedia, Meta
etc.).

Apache Spark - An Introduction 9


Spark Ecosystem

Apache Spark - An Introduction 10


Spark Components – SQL

Apache Spark - An Introduction 11


Spark Components – Streaming

Apache Spark - An Introduction 12


Spark Components – MLlib

Apache Spark - An Introduction 13


Spark Components – GraphX

Apache Spark - An Introduction 14


The RDD in Spark
❖ Role of the Resilient Distributed Dataset (RDD) in Spark
▪ Spark uses the RDDs for quick and reliable distributed in-memory data processing.

▪ RDD is a read-only collection of objects that is partitioned across multiple data nodes in a cluster.

• In a Spark program, initially one or more RDDs are loaded as input

• Then, through a series of transformations they are turned into a set of target RDDs, which have an
action to be performed on them.

▪ RDD are resilient because Spark can automatically reconstruct a lost partition by recomputing it from the
RDDs that it was computed from.

▪ RDD can be created in 3 ways:

• From an in-memory collection of objects (known as parallelizing a collection)

• Using a dataset from external storage (such as HDFS)

• Transforming an existing RDD

Apache Spark - An Introduction 15


Operations on RDDs

Apache Spark - An Introduction 16


Transformation and Actions on RDDs
❖ Spark provides two categories of operations on RDDs: transformations and actions.
▪ A transformation generates a new RDD from an existing one.

• If the return type of and operation is RDD, then it’s a transformation; otherwise, it’s an action.

▪ An action triggers a computation on an RDD and does something with the results—either returning them to
the user, or saving them to external storage.

• Actions have an immediate effect, but transformations do not—they are lazy.

▪ Spark’s library contains a rich set of operators including the following:

• Transformations for mapping,

• Grouping, aggregating, and repartitioning

• Sampling, and joining RDDS

• Treating RDDS as sets.

Apache Spark - An Introduction 17


Operations on RDDs

Apache Spark - An Introduction 18


Spark Applications, Jobs, Stages, and Tasks
❖ The Job in Spark
▪ A Spark job is more is made up of an arbitrary directed acyclic graph (DAG) of stages, each of which is
roughly equivalent to a map or reduce phase in MapReduce.

▪ Stages are split into tasks by the Spark runtime and are run in parallel on partitions of an RDD spread across
the cluster—just like tasks in MapReduce.

▪ A job always runs in the context of an application (represented by a SparkContext instance) that serves to
group RDDs and shared variables.

▪ An application can run more than one job, in series or in parallel, and provides the mechanism for a job to
access an RDD that was cached by a previous job in the same application.

Apache Spark - An Introduction 19


How Spark runs a job ?

❖ Anatomy of a Spark Job Run

▪ Job submission

▪ DAG creation

▪ Task scheduling

▪ Task execution

Apache Spark - An Introduction 20


How Spark runs a job ?

Apache Spark - An Introduction 21


How Spark runs a job ?

Apache Spark - An Introduction 22


How Spark runs a job ?

Apache Spark - An Introduction 23


How Spark runs a job ?

Apache Spark - An Introduction 24


The stages and RDDs in a Spark job

Apache Spark - An Introduction 25


Spark Job Executors

❖ Spark use the Executors to run the tasks that make up a job.

▪ First the executor keeps a local cache of all the dependencies that previous tasks have used.

▪ Second it deserializes the task code from the serialized bytes that were sent as a part of the
launch task message.

▪ Third, the Executor executes the task code in the same JVM as the executor.

• Tasks can return a result to the driver. The result is serialized and sent to the executor
backend, and then back to the driver as a status update message.

Apache Spark - An Introduction 26


Cluster Managers for Spark

❖ Cluster Managers for Spark

▪ Spark requires a cluster manager to manage the lifecycle of executors that run the jobs.

▪ Spark is compatible with a variety of cluster managers with different characteristics:

• Local cluster

✓ In local mode there is a single executor running in the same JVM as the driver.

✓ This mode is useful for testing or running small jobs.

• Standalone

✓ It is a simple distributed implementation to run a single Master and multiple worker nodes.

Apache Spark - An Introduction 27


Cluster Managers for Spark

❖ Cluster Managers for Spark

• Apache Mesos

✓ Mesos is a general-purpose cluster resource manager that allows fine-grained sharing of


resources across different applications.

• Hadoop YARN

✓ When YARN is used as a cluster manager for Spark, each Spark application corresponds to an
instance of a YARN application, and each executor runs in its own YARN container.

✓ The Mesos and YARN cluster managers are superior to the standalone manager as they can
manage resources of other applications running on the cluster. They also enforce a scheduling
policy across all of them.

Apache Spark - An Introduction 28


Spark Cluster Managers

Apache Spark - An Introduction 29


Spark on YARN

❖ Spark deployment on YARN


▪ Running Spark on YARN provides better integration with other Hadoop components.

▪ Spark can be deployed on YARN in two modes:

• Client mode
✓ In it, the driver program runs in the client application.
✓ The client mode is required for programs having an interactive component, like Spark-shell or
PySpark.

• Cluster mode
✓ In this mode, the driver program runs on the cluster in the YARN Application Master.
✓ YARN cluster mode is appropriate for production jobs that require logging activity.

Apache Spark - An Introduction 30


How Spark executors are started in YARN client mode

Apache Spark - An Introduction 31


How Spark executors are started in YARN cluster mode

Apache Spark - An Introduction 32


Applications of Spark

Apache Spark - An Introduction 33


Spark Use Case

Apache Spark - An Introduction 34


Related Resources
❖ Apache Spark Tutorials
▪ Apache Spark

• https://www.youtube.com/watch?v=QaoJNXW6SQo&t=3s

▪ Understanding Apache Spark

• https://www.youtube.com/watch?v=znBa13Earms

▪ How Apache Spark runs a job

• https://www.youtube.com/watch?v=jDkLiqlyQaY

Apache Spark - An Introduction 35


Contents’ Review
❖ Apache Spark
▪ Introduction, features, ecosystem, and major benefits
▪ Components of Spark
• Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX
▪ Architecture of Spark
• RDDs, Jobs, Tasks, and Context
▪ Anatomy of a Spark Job Run
• Job submission, DAG creation, Task scheduling, Task Execution
▪ Executors and Cluster Managers
▪ Spark applications

You are Welcome !


Questions ?
Comments !
Suggestions !!

Apache Spark - An Introduction 36

You might also like