Spark
Spark
Spark™
Hadoop Limitations
HDFS
read
Iteration1 Iteration2
Apache Spark
Apache Spark supports data analysis, machine learning, graphs,
streaming data, etc. It can read/write from a range of data types and
allows development in multiple languages.
DataFrames ML Pipelines
Spark
Spark SQL MLlib GraphX
Streaming
Spark Core
Alluxio, Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and
HPC-style (GlusterFS, Lustre)
Data Sources
Spark
Spark Core
contains the basic functionality of Spark, including components for
task scheduling, memory management, fault recovery, interacting
with storage systems and more
Spark Core is also home to the API that defines resilient distributed
datasets (RDDs), which are Spark’s main programming abstraction
Spark SQL
package for working with structured data
allows querying data via SQL as well as HQL (Hive Query
Language)
supports many sources of data, including Hive tables, Parquet and
JSON
Spark SQL allows developers to intermix SQL queries with the
programmatic data manipulations supported by RDDs in Python,
Java, and Scala, all within a single application, thus combining SQL
with complex analytics
Cluster Managers: variety of cluster managers, including
Hadoop YARN, Apache Mesos, and a simple cluster manager
included in Spark itself called the Standalone Scheduler
Spark
Spark Streaming
a Spark component that enables processing of live streams of
data.
Examples: log files generated by production web servers, or queues
of messages containing status updates posted by users of a web
service.
provides an API for manipulating data streams that closely
matches the Spark Core’s RDD API, making it easy for programmers
to move between applications that manipulate data stored in
memory, on disk, or arriving in real time.
Underneath its API, Spark Streaming was designed to provide the
same degree of fault tolerance, throughput, and scalability as
Spark Core
Mllib: a library containing common machine learning (ML)
functionality
GraphX: a library for manipulating graphs (e.g., a social
network’s friend graph) and performing graph-parallel
computations.
Spark Architecture
Spark Architecture
Resilient Distributed Datasets
(RDDs) – key Spark construct
Simply an immutable distributed collection of
objects spread across a cluster stored in RAM or
disk
Each RDD is split into multiple partitions, which
may be computed on different nodes of the
cluster
RDDs can contain any type of Python, Java, or
Scala objects, including user defined classes.
Created through lazy parallel transformations
Automatically rebuilt on failure
Resilient Distributed Dataset
(RDD) – key Spark construct
RDDs (Resilient Distributed Datasets) is Data Containers
RDDs represent data or transformations on data
RDDs can be created from Hadoop InputFormats
(such as HDFS files), “parallelize()” datasets, or by
transforming other RDDs (you can stack RDDs)
Actions can be applied to RDDs; actions force
calculations and return values
Lazy evaluation: Nothing computed until an action
requires it
RDDs are best suited for applications that apply the
same operation to all elements of a dataset
Less suitable for applications that make
asynchronous fine-grained updates to shared state
Fault
Tolerance
• RDDs contain lineage graphs (coarse grained
updates/transformations) to help it rebuild partitions that were lost
• Only the lost partitions of an RDD need to be recomputed upon
failure.
• They can be recomputed in parallel on different nodes without
having to roll back the entire app
• Also lets a system tolerate slow nodes (stragglers) by running a
backup copy of the troubled task.
• Original process on straggling node will be killed when new
process is complete
• Cached/Check pointed partitions are also used to re-compute lost
partitions if available in shared memory
Spark – RDD Persistence
Spark’s RDDs are by default recomputed each time you run an
action on them.
You can persist (cache) an RDD also, if you know it will be
needed again
When you persist an RDD, each node stores any partitions of it
that it computes in memory and reuses them in other actions on
that dataset (or datasets derived from it)
Allows future actions to be much faster (often >10x).
Mark RDD to be persisted using the persist() or cache() methods
on it. The first time it is computed in an action, it will be kept in
memory on the nodes.
Cache is fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations that
originally created it
Can choose storage level (MEMORY_ONLY, DISK_ONLY,
MEMORY_AND_DISK, etc.)
Can manually call unpersist()
If data is too big to be cached, then it will spill to disk with Least
Recently Used (LRU) replacement policy
RDD Persistence (Storage Levels)
flatMap
map union
filter join
sample cogroup
Transformations
groupByKey cross
(create a new RDD)
reduceByKey mapValues
sortByKey
intersection
collect first
Reduce take
Actions
Count takeOrdered
(return results to driver
takeSample countByKey
program)
take save
lookupKey foreach
Sample Spark transformations
mapPartitions(func)
The MapPartition converts each partition of the source RDD
into many elements of the result (possibly none). In
mapPartition(), the map() function is applied on each
partitions simultaneously. MapPartition is like a map, but the
difference is it runs separately on each partition(block) of
the RDD.
mapPartitionWithIndex()
It is like mapPartition; Besides mapPartition it provides func
with an integer value representing the index of the partition,
and the map() is applied on partition index wise one after
the other.
More transformations
groupByKey()
When we use groupByKey() on a dataset of (K, V) pairs, the data is shuffled
according to the key value K in another RDD. In this transformation, lots of
unnecessary data get to transfer over the network.
reduceByKey
When we use reduceByKey on a dataset (K, V), the pairs on the same
machine with the same key are combined, before the data is shuffled.
sortByKey()
When we apply the sortByKey() function on a dataset of (K, V) pairs, the
data is sorted according to the key K in another RDD.
join()
The Join is database term. It combines the fields from two table using
common values. join() operation in Spark is defined on pair-wise RDD.
Pair-wise RDDs are RDD in which each element is in the form of tuples. Where
the first element is key and the second element is the value.
coalesce()
To avoid full shuffling of data we use coalesce() function. In coalesce() we
use existing partition so that less data is shuffled. Using this we can cut the
number of the partition. Suppose, we have four nodes and we want only
two nodes. Then the data of extra nodes will be kept onto nodes which we
kept
Sample Spark transformations
Sample Spark transformations
Narrow Vs. Wide transformation
A,2
groupByK
Map
ey
Narrow transformation
Wide transformation
Lineage Graph
A
E
S
D
F
DAGs track dependencies (also known as Lineage )
nodes are RDDs
arrows are Transformations
Actions
What is an action
The final stage of the workflow
Triggers the execution of the DAG
Returns the results to the driver
Or writes the data to HDFS or to a file
Sample Spark Actions
reduce(func): Aggregate the elements of the
dataset using a function func (which takes two
arguments and returns one). The function should
be commutative and associative so that it can
be computed correctly in parallel.
collect(): Return all the elements of the dataset
as an array at the driver program. This is usually
useful after a filter or other operation that returns
a sufficiently small subset of the data.
count(): Return the number of elements in the
dataset.
foreach()
When we have a situation where we want to apply
operation on each element of RDD, but it should not
return value to the driver. In this case, foreach() function
is useful. For example, inserting a record into the
database.
Spark Workflow
Collect
Driver
Spark
Progra
Context
m
When to use RDDs?
Consider these scenarios or common use cases for using
RDDs when:
you want low-level transformation and actions and
control on your dataset;
your data is unstructured, such as media streams or
streams of text;
you want to manipulate your data with functional
programming constructs than domain specific
expressions;
you don’t care about imposing a schema, such as
columnar format, while processing or accessing data
attributes by name or column; and
you can forgo some optimization and performance
benefits available with DataFrames and Datasets for
structured and semi-structured data.
Spark SQL, DataFrames and
Datasets
Spark SQL is a Spark module for structured data processing.
Unlike the basic Spark RDD API, the interfaces provided by
Spark SQL provide Spark with more information about the
structure of both the data and the computation being
performed.
Internally, Spark SQL uses this extra information to perform
extra optimizations.
There are several ways to interact with Spark SQL including
SQL, DataFrame API and the Dataset API.
When computing a result, the same execution engine is used,
independent of which API/language you are using to express
the computation.
This unification means that developers can easily switch back
and forth between different APIs based on which provides the
most natural way to express a given transformation.
Problem with RDD
DataFrame & DataSet
Spark 2.0
DataFrames
Like an RDD, a DataFrame is an immutable distributed
collection of data
Unlike an RDD, data is organized into named columns, like
a table in a relational database, DataFrames have a
schema
Designed to make large data sets processing even easier,
DataFrame allows developers to impose a structure onto a
distributed collection of data, allowing higher-level
abstraction
It provides a domain specific language API to manipulate
your distributed data; and makes Spark accessible to a
wider audience, beyond specialized data engineers.
DataFrames are cached and optimized by Spark
DataFrames are built on top of the RDDs and the core
Spark API
DataFrames
Similar to a relational database, Python Pandas
Dataframe or R’s DataTables
Immutable once constructed
Track lineage
Enable distributed computations
How to construct Dataframes
Read from file(s)
Transforming an existing DFs(Spark or Pandas)
Parallelizing a python collection list
Apply transformations and actions
Datasets
In Apache Spark 2.0, these two APIs are unified and we can consider
Dataframe as an alias for a collection of generic objects Dataset[Row], where
a Row is a generic untyped JVM object. Dataset, by contrast, is a collection
of strongly-typed JVM objects.
Spark checks DataFrame type align to those of that are in given schema or
not, in run time and not in compile time. It is because elements in DataFrame
are of Row type and Row type cannot be parameterized by a type by a
compiler in compile time so the compiler cannot check its type. Because of
that DataFrame is untyped and it is not type-safe.
Datasets on the other hand check whether types conform to the specification
at compile time. That’s why Datasets are type safe.
Benefits of DataFrame and
Dataset APIs
Static-typing and runtime type-safety
In DataFrames and Datasets you can catch errors at
compile time which saves developer-time and costs
Example: Space
Example: performance
DataFrame example
// Create a new DataFrame that contains “students”
students = users.filter(users.age < 21)
• Sequential access
• Heavy hitters
• Longest increasing subsequence
• ….
5
Sampling
• Sampling: selection of a subset of items from a
large data set
5
Sampling framework
• Algorithm A chooses every incoming element with a
certain probability
• If the element is “sampled”, A puts it into memory,
otherwise the element is discarded
• Depending on different situations, algorithm A may
discard some items from memory after having added
them
• For every query of the data set, algorithm A computes
some function only based on the in-memory sample
6
Reservoir sampling
1. Sample the first k elements from the stream
2. Sample the ith element (i>k) with probability k/i (if
sampled, randomly replace a previously sampled
item)
• Limitations:
• Wanted sample fits into main memory
• Distributed sampling is not possible (all elements
need to be processed sequentially)
6
Reservoir sampling
example
500
100
10000
1000
6
Min-wise
sampling
Task: Given a data stream of unknown length, randomly
pick k elements from the stream so that each element
has the same probability of being chosen.
6
Min-wise
sampling
Task: Given a data stream of unknown length, randomly
pick k elements from the stream so that each element
has the same probability of being chosen.
6
Summarizing vs.
filtering
• So far: all data is useful, summarise it due to the lack of space/
time
6
Problem
statement
• A set W containing m values (e.g. IP addresses,
email addresses, etc.)
6
Bloom filter: element
testing
6
Bloom filter: how many hash
functions are useful?
• Example: m = 10^9 whitelisted IP addresses and
n=8x10^9 bits in memory
6
Requirements for Stream
Processing
▪ Second-scale latencies
▪ Simple programming model
▪ Integrated with batch & interactive processing
▪ Efficient fault-tolerance
Spark Streaming
Spark Streaming is an extension of the core Spark API that
enables scalable, high-throughput, fault-tolerant stream
processing of live data streams.
▪
72
Basic sources
TCP socket ssc.socketTextStream(...)
Filestream StreamingContext.fileStream[KeyClass,
ValueClass, InputFormatClass].
Advanced Sources: This category of sources requires
interfacing with external non-Spark libraries
Kafka
Kinesis
Flume
Transformations on DStreams
Steps in Spark Streaming