3 UNIT3 Spark
3 UNIT3 Spark
UNIT - III
Spark Definition
Spark is a Unified Computing Engine with a set of libraries
Computing Engine:
Spark provides parallel data computing/processing possible
like MapReduce in Hadoop . The difference lies in that Mapreduce works
only on top of the data stored in hadoop where as Spark can work on top of
any distributed storage relieving the end user from worrying about where to
store/retrieve his data.
History
In 2009 when University of Berkley created a new resource manager in Hadoop called MESOS,
Spark was created as a programming Framework to test the functionality of MESOS.
• Polyglot: In addition to Java, Scala, Python, and R, Spark also supports all four of these languages.
We can write Spark code in any one of these languages. Spark also provides a command-line
interface in Scala and Python.
Layered View of Spark
• Layer 1 : When downloaded we can get Spark core which can run on both RAM and
Hard disk
Spark SQL:
• The Spark SQL is built on the top of Spark Core. It provides support for structured
data.
• It allows to query the data via SQL (Structured Query Language) as well as the
Apache Hive variant of SQL called the HQL (Hive Query Language).
• It supports JDBC and ODBC connections that establish a relation between Java
objects and existing databases, data warehouses and business intelligence tools.
• It also supports various sources of data like Hive tables, Parquet, and JSON.
Components of Spark Unified Stack
Spark Streaming:
• Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.
• It uses Spark Core's fast scheduling capability to perform streaming
analytics.
• It accepts data in mini-batches and performs RDD transformations on that
data.
• Its design ensures that the applications written for streaming data can be
reused to analyze batches of historical data with little modification.
• The log files generated by web servers can be considered as a real-time
example of a data stream
Components of Spark Unified Stack
MLlib:
• The MLlib is a Machine Learning library that contains various machine
learning algorithms.
• These include correlations and hypothesis testing, classification and
regression, clustering, and principal component analysis.
• It is nine times faster than the disk-based implementation used by
Apache Mahout.
Components of Spark Unified Stack
GraphX:
• The GraphX is a library that is used to manipulate graphs and
perform graph parallel computations.
• It facilitates to create a directed graph with arbitrary properties
attached to each vertex and edge.
• To manipulate graph, it supports various fundamental operators like
subgraph, join Vertices, and aggregate Messages.
Spark Application Architecture
Spark Application Architecture
Spark applications run as independent sets of processes on a cluster,
coordinated by the Spark Context object in our main program (called the
driver program).
Specifically, to run on a cluster, the Spark Context can connect to several
types of cluster managers (either Spark’s own standalone cluster
manager, Mesos, YARN or Kubernetes), which allocate resources across
applications. Once connected, Spark acquires executors on nodes in the
cluster, which are processes that run computations and store data for our
application. Next, it sends our application code (defined by JAR or Python
files passed to Spark Context) to the executors. Finally, Spark Context
sends tasks to the executors to run.
Spark Application Architecture
Useful things to note
1. Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads. This has the benefit of isolating applications from each other,
on both the scheduling side (each driver schedules its own tasks) and
executor side (tasks from different applications run in different JVMs).
However, it also means that data cannot be shared across different
Spark applications (instances of Spark Context) without writing it to an
external storage system.
2. Spark is computing to the underlying cluster manager. As long as it
can acquire executor processes, and these communicate with each
other, it is relatively easy to run it even on a cluster manager that also
supports other applications (e.g. Mesos/YARN/Kubernetes).
Spark Application Architecture
3. The driver program must listen for and accept incoming connections
from its executors throughout its lifetime. As such, the driver program
must be network addressable from the worker nodes.
Worker Nodes:
• The slave nodes function as executors, processing tasks, and returning the
results back to the spark context.
• The master node issues tasks to the Spark context and the worker nodes
execute them.
• They make the process simpler by boosting the worker nodes (1 to n) to
handle as many jobs as possible in parallel by dividing the job up into sub-
jobs on multiple machines.
• A Spark worker monitors worker nodes to ensure that the computation is
performed simply.
• Each worker node handles one Spark task.
• In Spark, a partition is a unit of work and is assigned to one executor for
each one.
RealTime Analytics with Spark
• Apache Spark architecture allows a continuous stream of data by
dividing the stream into micro-batches called Discretized stream or
Dstream which is an API
• Dstream is a sequence of RDDs that are created from input data or
from sources such as Kafka, Flume, or by applying operations on
other Dstream
• RDDs thus generated can be converted into data frames and queried
using Spark SQL.
• Dstream can be subjected to any application that can query RDD
through Spark’s JDBC driver and stored in Spark’s working memory
to query it later on-demand of Spark’s API.
RealTime
Analytics with
Spark
Most Real time
Analytics Systems can be
broken down into
A Receiver System,
a Stream Processing System
and
a Storage System.
Batch Analytics with Spark
• Batch processing is used to deal with enormous amounts of data for
implementing high-volume and repeating data jobs, each of which
performs a specific operation without the need for user intervention.
• A single RDD can be divided into multiple logical partitions so that these
partitions can be stored and processed on different machines of a cluster.
• RDDs are immutable (read-only) in nature.
• You cannot change an original RDD, but you can create new RDDs by performing
coarse-grain operations, like transformations, on an existing RDD.
• An RDD in Spark can be cached and used again for future transformations, which
is a huge benefit for users.
• RDDs are said to be lazily evaluated, i.e., they delay the evaluation until it is really
needed. Thus saves time.
Features of an RDD in Spark