0% found this document useful (0 votes)
60 views96 pages

Spark

The document introduces Apache Spark as a solution to the limitations of Hadoop and Map-Reduce, highlighting its capabilities for in-memory data processing, support for multiple programming languages, and a variety of data operations. It details the architecture of Spark, including Resilient Distributed Datasets (RDDs), transformations, and actions, emphasizing Spark's efficiency and fault tolerance. Additionally, it outlines Spark's history, main benefits, and various components such as Spark SQL, Spark Streaming, and MLlib.

Uploaded by

Md Hamid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views96 pages

Spark

The document introduces Apache Spark as a solution to the limitations of Hadoop and Map-Reduce, highlighting its capabilities for in-memory data processing, support for multiple programming languages, and a variety of data operations. It details the architecture of Spark, including Resilient Distributed Datasets (RDDs), transformations, and actions, emphasizing Spark's efficiency and fault tolerance. Additionally, it outlines Spark's history, main benefits, and various components such as Spark SQL, Spark Streaming, and MLlib.

Uploaded by

Md Hamid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Introduction to Apache

Spark™
Hadoop Limitations

Forces your data processing into Map and Reduce


Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
Hadoop implementation of Map-Reduce is designed
for out-of-core data, not in-memory data.
Overhead due to replication
Only Java natively supported: Java is not a high
performance programming language.
Support for others languages needed
Only for Batch processing
Interactivity, streaming data missing
Map-Reduce Limitations
The Map-Reduce paradigm is fundamentally limited in
expressiveness.
Optimized for simple operations on a large amount of
data.
It is perfect…. If your goal is to make a histogram from a
large dataset!
Hard to compose and nest multiple operations.
Not efficient for iterative tasks, i.e. Machine Learning
Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
Read and write to Disk before and after Map and Reduce (stateless
machine)

Not obvious how to perform operations with different


cardinality.
Example: Try implementing All-Pairs efficiently.
One Solution is Apache Spark
A new general framework, which solves many of the
short comings of MapReduce
Idea: Layer a system on top of Hadoop.
It capable of leveraging the Hadoop ecosystem, e.g. HDFS,
YARN, HBase, S3, …

Has many other workflows, i.e. join, filter,


flatMapdistinct, groupByKey, reduceByKey, sortByKey,
collect, count, first…
(around 30 efficient distributed operations)
Achieve fault-tolerance by re-execution instead of
replication.
One Solution is Apache Spark

In-memory caching of data (for iterative, graph,


and machine learning algorithms, etc.)
Native Scala, Java, Python, and R support
Supports interactive shells for exploratory data
analysis
Spark API is extremely simple to use
History of Spark

Developed at AMPLab UC Berkeley in 2009


open sourced in 2010 under a BSD license
In 2013, the project was donated to the Apache
Software Foundation
In February 2014, Spark became a Top-Level Apache
Project
In November 2014, Spark founder M. Zaharia's company
Databricks set a new world record in large scale sorting
using Spark
Sort competition
Hadoop MR Spark
Record (2013) Record (2014) Spark, 3x
faster with
Data Size 102.5 TB 100 TB
1/10 the
Elapsed Time 72 mins 23 mins nodes
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk 3150 GB/s
618 GB/s
throughput (est.)
dedicated data virtualized (EC2) 10Gbps
Network
center, 10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min

Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records)


http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.
html
Spark: Main benefits
Spark: Main benefits
Spark Uses Memory instead
of Disk
Hadoop: Use Disk for Data Sharing

HDFS HDFS HDFS


read HDFS
read Write
Write
Iteration1 Iteration2

Spark: In-Memory Data Sharing

HDFS
read
Iteration1 Iteration2
Apache Spark
Apache Spark supports data analysis, machine learning, graphs,
streaming data, etc. It can read/write from a range of data types and
allows development in multiple languages.

Scala, Java, Python, R, SQL

DataFrames ML Pipelines

Spark
Spark SQL MLlib GraphX
Streaming
Spark Core

Alluxio, Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and
HPC-style (GlusterFS, Lustre)
Data Sources
Spark
Spark Core
contains the basic functionality of Spark, including components for
task scheduling, memory management, fault recovery, interacting
with storage systems and more
Spark Core is also home to the API that defines resilient distributed
datasets (RDDs), which are Spark’s main programming abstraction
Spark SQL
package for working with structured data
allows querying data via SQL as well as HQL (Hive Query
Language)
supports many sources of data, including Hive tables, Parquet and
JSON
Spark SQL allows developers to intermix SQL queries with the
programmatic data manipulations supported by RDDs in Python,
Java, and Scala, all within a single application, thus combining SQL
with complex analytics
Cluster Managers: variety of cluster managers, including
Hadoop YARN, Apache Mesos, and a simple cluster manager
included in Spark itself called the Standalone Scheduler
Spark
Spark Streaming
a Spark component that enables processing of live streams of
data.
Examples: log files generated by production web servers, or queues
of messages containing status updates posted by users of a web
service.
provides an API for manipulating data streams that closely
matches the Spark Core’s RDD API, making it easy for programmers
to move between applications that manipulate data stored in
memory, on disk, or arriving in real time.
Underneath its API, Spark Streaming was designed to provide the
same degree of fault tolerance, throughput, and scalability as
Spark Core
Mllib: a library containing common machine learning (ML)
functionality
GraphX: a library for manipulating graphs (e.g., a social
network’s friend graph) and performing graph-parallel
computations.
Spark Architecture
Spark Architecture
Resilient Distributed Datasets
(RDDs) – key Spark construct
Simply an immutable distributed collection of
objects spread across a cluster stored in RAM or
disk
Each RDD is split into multiple partitions, which
may be computed on different nodes of the
cluster
RDDs can contain any type of Python, Java, or
Scala objects, including user defined classes.
Created through lazy parallel transformations
Automatically rebuilt on failure
Resilient Distributed Dataset
(RDD) – key Spark construct
RDDs (Resilient Distributed Datasets) is Data Containers
RDDs represent data or transformations on data
RDDs can be created from Hadoop InputFormats
(such as HDFS files), “parallelize()” datasets, or by
transforming other RDDs (you can stack RDDs)
Actions can be applied to RDDs; actions force
calculations and return values
Lazy evaluation: Nothing computed until an action
requires it
RDDs are best suited for applications that apply the
same operation to all elements of a dataset
Less suitable for applications that make
asynchronous fine-grained updates to shared state
Fault
Tolerance
• RDDs contain lineage graphs (coarse grained
updates/transformations) to help it rebuild partitions that were lost
• Only the lost partitions of an RDD need to be recomputed upon
failure.
• They can be recomputed in parallel on different nodes without
having to roll back the entire app
• Also lets a system tolerate slow nodes (stragglers) by running a
backup copy of the troubled task.
• Original process on straggling node will be killed when new
process is complete
• Cached/Check pointed partitions are also used to re-compute lost
partitions if available in shared memory
Spark – RDD Persistence
Spark’s RDDs are by default recomputed each time you run an
action on them.
You can persist (cache) an RDD also, if you know it will be
needed again
When you persist an RDD, each node stores any partitions of it
that it computes in memory and reuses them in other actions on
that dataset (or datasets derived from it)
Allows future actions to be much faster (often >10x).
Mark RDD to be persisted using the persist() or cache() methods
on it. The first time it is computed in an action, it will be kept in
memory on the nodes.
Cache is fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations that
originally created it
Can choose storage level (MEMORY_ONLY, DISK_ONLY,
MEMORY_AND_DISK, etc.)
Can manually call unpersist()
If data is too big to be cached, then it will spill to disk with Least
Recently Used (LRU) replacement policy
RDD Persistence (Storage Levels)

Storage Level MEANING


MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If
the RDD does not fit in memory, some partitions will not
be cached and will be recomputed on the fly each time
they're needed. This is the default level.

MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If


the RDD does not fit in memory, store the partitions that
don't fit on disk, and read them from there when they're
needed.

MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array


per partition). This is generally more space-efficient than
deserialized objects, especially when using a fast
serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that


don't fit in memory to disk instead of re-computing them
on the fly each time they're needed.

DISK_ONLY Store the RDD partitions only on disk.

MEMORY_ONLY_2, Same as the levels above, but replicate each partition


on two cluster nodes.
MEMORY_AND_DISK_2, etc.
Spark Operations: Two Types

flatMap
map union
filter join
sample cogroup
Transformations
groupByKey cross
(create a new RDD)
reduceByKey mapValues
sortByKey
intersection

collect first
Reduce take
Actions
Count takeOrdered
(return results to driver
takeSample countByKey
program)
take save
lookupKey foreach
Sample Spark transformations

map(func): Return a new distributed dataset formed by


passing each element of the source through a function
func.
filter(func): Return a new dataset formed by selecting
those elements of the source on which func returns true
union(otherDataset): Return a new dataset that contains
the union of the elements in the source dataset and the
argument.
intersection(otherDataset): Return a new RDD that contains
the intersection of elements in the source dataset and the
argument.
distinct([numTasks])): Return a new dataset that contains
the distinct elements of the source dataset
join(otherDataset, [numTasks]): When called on datasets of
type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs
with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.
Map and flatMap
Map: The map function iterates over every line in RDD and
split into new RDD. Using map() transformation we take in any
function, and that function is applied to every element of
RDD. In the map, we have the flexibility that the input and the
return type of RDD may differ from each other. For example,
we can have input RDD type as String, after applying the
map() function the return RDD can be Boolean.
flatMap: With the help of flatMap() function, to each input
element, we have many elements in an output RDD. The
most simple use of flatMap() is to split each input string into
words.
Map and flatMap are similar in the way that they take a line
from input RDD and apply a function on that line. The key
difference between map() and flatMap() is map() returns
only one element, while flatMap() can return a list of
elements.
mapPartitions

mapPartitions(func)
The MapPartition converts each partition of the source RDD
into many elements of the result (possibly none). In
mapPartition(), the map() function is applied on each
partitions simultaneously. MapPartition is like a map, but the
difference is it runs separately on each partition(block) of
the RDD.
mapPartitionWithIndex()
It is like mapPartition; Besides mapPartition it provides func
with an integer value representing the index of the partition,
and the map() is applied on partition index wise one after
the other.
More transformations
groupByKey()
When we use groupByKey() on a dataset of (K, V) pairs, the data is shuffled
according to the key value K in another RDD. In this transformation, lots of
unnecessary data get to transfer over the network.
reduceByKey
When we use reduceByKey on a dataset (K, V), the pairs on the same
machine with the same key are combined, before the data is shuffled.
sortByKey()
When we apply the sortByKey() function on a dataset of (K, V) pairs, the
data is sorted according to the key K in another RDD.
join()
The Join is database term. It combines the fields from two table using
common values. join() operation in Spark is defined on pair-wise RDD.
Pair-wise RDDs are RDD in which each element is in the form of tuples. Where
the first element is key and the second element is the value.
coalesce()
To avoid full shuffling of data we use coalesce() function. In coalesce() we
use existing partition so that less data is shuffled. Using this we can cut the
number of the partition. Suppose, we have four nodes and we want only
two nodes. Then the data of extra nodes will be kept onto nodes which we
kept
Sample Spark transformations
Sample Spark transformations
Narrow Vs. Wide transformation

Narrow Vs. Wide


A,1 A,[1,2]

A,2

groupByK
Map
ey
Narrow transformation
Wide transformation
Lineage Graph

Spark keeps track of the set of dependencies


between different RDDs, called the lineage graph
It uses this information to compute each RDD on
demand and to recover lost data if part of a persistent
RDD is lost
Directed Acyclic Graphs
(DAG)
B C

A
E

S
D
F
DAGs track dependencies (also known as Lineage )
nodes are RDDs
arrows are Transformations
Actions

What is an action
The final stage of the workflow
Triggers the execution of the DAG
Returns the results to the driver
Or writes the data to HDFS or to a file
Sample Spark Actions
reduce(func): Aggregate the elements of the
dataset using a function func (which takes two
arguments and returns one). The function should
be commutative and associative so that it can
be computed correctly in parallel.
collect(): Return all the elements of the dataset
as an array at the driver program. This is usually
useful after a filter or other operation that returns
a sufficiently small subset of the data.
count(): Return the number of elements in the
dataset.

Remember: Actions cause calculations to be performed;


transformations just set things up (lazy evaluation)
Sample Spark Actions
Sample Spark Actions
reduce, fold, aggregate
reduce()
The reduce() function takes the two elements as input from the
RDD and then produces the output of the same type as that of the
input elements. The simple forms of such function are an addition.
We can add the elements of RDD, count the number of words. It
accepts commutative and associative operations as an argument.
4.7. fold()
The signature of the fold() is like reduce(). Besides, it takes “zero
value” as input, which is used for the initial call on each partition.
But, the condition with zero value is that it should be the identity
element of that operation. The key difference between fold() and
reduce() is that, reduce() throws an exception for empty
collection, but fold() is defined for empty collection.
4.8. aggregate()
It gives us the flexibility to get data type different from the input
type. The aggregate() takes two functions to get the final result.
Through one function we combine the element from our RDD with
the accumulator, and the second, to combine the accumulator.
Hence, in aggregate, we supply the initial zero value of the type
which we want to return.
foreach

foreach()
When we have a situation where we want to apply
operation on each element of RDD, but it should not
return value to the driver. In this case, foreach() function
is useful. For example, inserting a record into the
database.
Spark Workflow

FlatMap Map groupbyKey

Collect
Driver
Spark
Progra
Context
m
When to use RDDs?
Consider these scenarios or common use cases for using
RDDs when:
you want low-level transformation and actions and
control on your dataset;
your data is unstructured, such as media streams or
streams of text;
you want to manipulate your data with functional
programming constructs than domain specific
expressions;
you don’t care about imposing a schema, such as
columnar format, while processing or accessing data
attributes by name or column; and
you can forgo some optimization and performance
benefits available with DataFrames and Datasets for
structured and semi-structured data.
Spark SQL, DataFrames and
Datasets
Spark SQL is a Spark module for structured data processing.
Unlike the basic Spark RDD API, the interfaces provided by
Spark SQL provide Spark with more information about the
structure of both the data and the computation being
performed.
Internally, Spark SQL uses this extra information to perform
extra optimizations.
There are several ways to interact with Spark SQL including
SQL, DataFrame API and the Dataset API.
When computing a result, the same execution engine is used,
independent of which API/language you are using to express
the computation.
This unification means that developers can easily switch back
and forth between different APIs based on which provides the
most natural way to express a given transformation.
Problem with RDD
DataFrame & DataSet
Spark 2.0
DataFrames
Like an RDD, a DataFrame is an immutable distributed
collection of data
Unlike an RDD, data is organized into named columns, like
a table in a relational database, DataFrames have a
schema
Designed to make large data sets processing even easier,
DataFrame allows developers to impose a structure onto a
distributed collection of data, allowing higher-level
abstraction
It provides a domain specific language API to manipulate
your distributed data; and makes Spark accessible to a
wider audience, beyond specialized data engineers.
DataFrames are cached and optimized by Spark
DataFrames are built on top of the RDDs and the core
Spark API
DataFrames
Similar to a relational database, Python Pandas
Dataframe or R’s DataTables
Immutable once constructed
Track lineage
Enable distributed computations
How to construct Dataframes
Read from file(s)
Transforming an existing DFs(Spark or Pandas)
Parallelizing a python collection list
Apply transformations and actions
Datasets

Starting in Spark 2.0, Dataset takes on two distinct APIs


characteristics: a strongly-typed API and an untyped
API
Conceptually, consider DataFrame as an alias for a
collection of generic objects Dataset[Row], where a
Row is a generic untyped JVM object
Dataset, by contrast, is a collection of strongly-typed
JVM objects, dictated by a case class you define in
Scala or a class in Java
Since Python and R have no compile-time type-safety,
we only have untyped APIs, namely DataFrames
DataFrame vs DataSet

In Apache Spark 2.0, these two APIs are unified and we can consider
Dataframe as an alias for a collection of generic objects Dataset[Row], where
a Row is a generic untyped JVM object. Dataset, by contrast, is a collection
of strongly-typed JVM objects.
Spark checks DataFrame type align to those of that are in given schema or
not, in run time and not in compile time. It is because elements in DataFrame
are of Row type and Row type cannot be parameterized by a type by a
compiler in compile time so the compiler cannot check its type. Because of
that DataFrame is untyped and it is not type-safe.
Datasets on the other hand check whether types conform to the specification
at compile time. That’s why Datasets are type safe.
Benefits of DataFrame and
Dataset APIs
Static-typing and runtime type-safety
In DataFrames and Datasets you can catch errors at
compile time which saves developer-time and costs

High-level abstraction and custom view into structured


and semi-structured data
DataFrames as a collection of Datasets[Row] render a
structured custom view into your semi-structured data.
Benefits of DataFrame and
Dataset APIs
Ease-of-use of APIs with structure
Although structure may limit control in what your Spark
program can do with data, it introduces rich semantics and
an easy set of domain specific operations that can be
expressed as high-level constructs
Expressing your computation in a domain specific API is far
simpler and easier than with relation algebra type expressions
(in RDDs)
Performance and Optimization
Because DataFrame and Dataset APIs are built on top of the
Spark SQL engine, it uses Catalyst to generate an optimized
logical and physical query plan
since Spark as a compiler understands your Dataset type JVM
object, it maps your type-specific JVM object to Tungsten’s
internal memory representation using Encoders. As a result,
Tungsten Encoders can efficiently serialize/deserialize JVM
objects as well as generate compact bytecode that can
execute at superior speeds
Benefits of DataFrame and
Dataset APIs

Example: Space

Example: performance
DataFrame example
// Create a new DataFrame that contains “students”
students = users.filter(users.age < 21)

//Alternatively, using Pandas-like syntax


students = users[users.age < 21]

//Count the number of students users by gender


students.groupBy("gender").count()

// Join young students with another DataFrame


called logs
students.join(logs, logs.userId == users.userId,
“left_outer")
Spark Stream
Processing
Data streaming scenario
• Continuous and rapid input of data

• Limited memory to store the data (less than linear


in the input size)

• Limited time to process each element

• Sequential access

• Algorithms have one or very few passes over the


data
5
Data streaming scenario

• Typically: simple functions of the stream are


computed and used as input to other algorithms
• Number of distinct items

• Heavy hitters
• Longest increasing subsequence
• ….

• Closed form solutions are rare - approximation


and randomisation are the norm

5
Sampling
• Sampling: selection of a subset of items from a
large data set

• Goal: sample retains the properties of the whole


data set

• Important for drawing the right conclusions from


the data

5
Sampling framework
• Algorithm A chooses every incoming element with a
certain probability
• If the element is “sampled”, A puts it into memory,
otherwise the element is discarded
• Depending on different situations, algorithm A may
discard some items from memory after having added
them
• For every query of the data set, algorithm A computes
some function only based on the in-memory sample

6
Reservoir sampling
1. Sample the first k elements from the stream
2. Sample the ith element (i>k) with probability k/i (if
sampled, randomly replace a previously sampled
item)

• Limitations:
• Wanted sample fits into main memory
• Distributed sampling is not possible (all elements
need to be processed sequentially)

6
Reservoir sampling
example
500
100

10000
1000

Histogram of entire stream

6
Min-wise
sampling
Task: Given a data stream of unknown length, randomly
pick k elements from the stream so that each element
has the same probability of being chosen.

1. For each element in the stream, tag it with a


random number in the interval [0,1]
2. Keep the k elements with the smallest random
tags

6
Min-wise
sampling
Task: Given a data stream of unknown length, randomly
pick k elements from the stream so that each element
has the same probability of being chosen.

• Can be run in a distributed fashion with a merging


stage (every subset has the same chance of
having the smallest tags)

Disadvantage: more memory/CPU intensive than
reservoir sampling

6
Summarizing vs.
filtering
• So far: all data is useful, summarise it due to the lack of space/
time

• Now: not all data is useful, some is harmful

• Classic example: spam filtering


• Mail servers can analyse the textual content

•• Mail servers have blacklists


• Mail servers have whitelists (very effective!)
• Incoming mails form a stream; quick decisions needed (delete
or forward)
• Applications in Web caching, packet routing, resource location,
etc.

6
Problem
statement
• A set W containing m values (e.g. IP addresses,
email addresses, etc.)

• Working memory of size n bit

• Goal: data structure that allows efficient checking


whether the next element in the stream is in W
• return TRUE with probability 1 if the element
is indeed in W
• return FALSE with high probability if the
element is not in W
6
Bloom filter

6
Bloom filter: element
testing

6
Bloom filter: how many hash
functions are useful?
• Example: m = 10^9 whitelisted IP addresses and
n=8x10^9 bits in memory

6
Requirements for Stream
Processing

▪ Scalable to large clusters

▪ Second-scale latencies
▪ Simple programming model
▪ Integrated with batch & interactive processing

▪ Efficient fault-tolerance
Spark Streaming
Spark Streaming is an extension of the core Spark API that
enables scalable, high-throughput, fault-tolerant stream
processing of live data streams.

Spark Streaming receives live input data streams and


divides the data into batches, which are then processed
by the Spark engine to generate the final stream of results
in batches.
Discretized Stream
Processing
Run a streaming computation as a series of
very small, deterministic batch jobs Spark
live data stream Streamin
▪ Chop up the live stream into batches of X g
seconds
batches of X
▪ Spark treats each batch of data as RDDs and
seconds
processes them using RDD operations

▪ Finally, the processed results of the RDD Spark


operations are returned in batches processed
results
▪ Batch sizes as low as ½ second, latency ~ 1
second


72

Potential for combining batch processing and


streaming processing in the same system
Discretized Streams (DStreams)
Discretized Stream or DStream is the basic abstraction
provided by Spark Streaming
It represents a continuous stream of data, either the input
data stream received from source, or the processed data
stream generated by transforming the input stream.
Internally, a DStream is represented by a continuous series
of RDDs, which is Spark’s abstraction of an immutable,
distributed dataset
Each RDD in a DStream contains data from a certain interval
Any operation applied on a DStream translates to
operations on the underlying RDDs.
Get hashtags from Twitter
Sources

Basic sources
TCP socket ssc.socketTextStream(...)
Filestream StreamingContext.fileStream[KeyClass,
ValueClass, InputFormatClass].
Advanced Sources: This category of sources requires
interfacing with external non-Spark libraries
Kafka
Kinesis
Flume
Transformations on DStreams
Steps in Spark Streaming

Define the input sources by creating input DStreams.


Define the streaming computations by applying
transformation and output operations to DStreams.
Start receiving data and processing it using
streamingContext.start().
Wait for the processing to be stopped (manually or
due to any error) using
streamingContext.awaitTermination().
The processing can be manually stopped using
streamingContext.stop().
Dstream Example (Python)
Window Operations
Allow to apply transformations over a sliding window of
data
every time the window slides over a source DStream, the
source RDDs that fall within the window are combined
and operated upon to produce the RDDs of the
windowed DStream.
window length - The duration of the window (3 in the
figure).
sliding interval - The interval at which the window
operation is performed (2 in the figure).
Transformations on window
Example
MLlib Operations
You can also easily use machine learning algorithms
provided by MLlib.
First of all, there are streaming machine learning algorithms
(e.g. Streaming Linear Regression, Streaming KMeans, etc.)
which can simultaneously learn from the streaming data as
well as apply the model on the streaming data.
Beyond these, for a much larger class of machine learning
algorithms, you can learn a learning model offline (i.e.
using historical data) and then apply the model online on
streaming data.
Caching / Persistence
Similar to RDDs, DStreams also allow developers to persist the
stream’s data in memory. That is, using the persist() method on
a DStream will automatically persist every RDD of that
DStream in memory.
Checkpointing
A streaming application must operate 24/7 and hence must be
resilient to failures unrelated to the application logic (e.g., system
failures, JVM crashes, etc.). There are two types of data that are
checkpointed.
Metadata checkpointing - Saving of the information defining the
streaming computation to fault-tolerant storage like HDFS. This is
used to recover from failure of the node running the driver of the
streaming application. Metadata includes:
Configuration - The configuration that was used to create the
streaming application.
DStream operations - The set of DStream operations that define the
streaming application.
Incomplete batches - Batches whose jobs are queued but have not
completed yet.
Data checkpointing - Saving of the generated RDDs to reliable
storage. This is necessary in some stateful transformations that
combine data across multiple batches
intermediate RDDs of stateful transformations are periodically
checkpointed to reliable storage (e.g. HDFS) to cut off the
dependency chains.
(2) Spark Structured Streaming
A scalable and fault-tolerant stream processing engine built on
the Spark SQL engine
You can express your streaming computation the same way
you would express a batch computation on static data.
The Spark SQL engine will take care of running it incrementally
and continuously and updating the final result as streaming
data continues to arrive.
You can use the Dataset/DataFrame API in Scala, Java,
Python or R to express streaming aggregations, event-time
windows, stream-to-batch joins, etc.
The computation is executed on the same optimized Spark SQL
engine.
the system ensures end-to-end exactly-once fault-tolerance
guarantees through checkpointing and Write-Ahead Logs.
Two modes of Structured
Streaming in Spark
Structured Streaming queries are processed using a
micro-batch processing engine, which processes data
streams as a series of small batch jobs thereby
achieving end-to-end latencies as low as 100
milliseconds and exactly-once fault-tolerance
guarantees.
Since Spark 2.3, a new low-latency processing mode
called Continuous Processing, which can achieve
end-to-end latencies as low as 1 millisecond with
at-least-once guarantees.
Without changing the Dataset/DataFrame operations
in your queries, you will be able to choose the mode
based on your application requirements.
Programming Model

to treat a live data stream as a table that is being


continuously appended and runs it as an incremental
query on the unbounded input table
Results
A query on the input will generate the “Result Table”.
Every trigger interval (say, every 1 second), new rows
get appended to the Input Table, which eventually
updates the Result Table.
Modes of Output

Complete Mode - The entire updated Result Table will be


written to the external storage. It is up to the storage
connector to decide how to handle writing of the entire
table.
Append Mode - Only the new rows appended in the Result
Table since the last trigger will be written to the external
storage. This is applicable only on the queries where
existing rows in the Result Table are not expected to
change.
Update Mode - Only the rows that were updated in the
Result Table since the last trigger will be written to the
external storage (available since Spark 2.1.1). Note that this
is different from the Complete Mode in that this mode only
outputs the rows that have changed since the last trigger.
If the query doesn’t contain aggregations, it will be
equivalent to Append mode.
Wordcount example
API using Datasets and
DataFrames
Since Spark 2.0, DataFrames and Datasets can
represent static, bounded data, as well as streaming,
unbounded data.
Similar to static Datasets/DataFrames, you can use the
common entry point SparkSession
(Scala/Java/Python/R docs) to create streaming
DataFrames/Datasets from streaming sources, and
apply the same operations on them as static
DataFrames/Datasets.
Streaming DataFrames can be created through the
DataStreamReader interface returned by
SparkSession.readStream().
Operations on streaming
DataFrames/Datasets
You can apply all kinds of operations on streaming
DataFrames/Datasets – ranging from untyped, SQL-like
operations (e.g. select, where, groupBy), to typed
RDD-like operations (e.g. map, filter, flatMap).
Window Operations on Event Time: Aggregations over
a sliding event-time window

You might also like