Introduction To Spark
Introduction To Spark
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 1
Recap
MapReduce
• For easily writing applications to process vast amounts of data in-
parallel on large clusters in a reliable, fault-tolerant manner
• Takes care of scheduling tasks, monitoring them and re-executes
the failed tasks
3
Spark (features)
Spark: General purpose computational framework that
substantially improves performance of MapReduce, but retains
the basic model
• Memory based data processing framework → avoids costly I/O
by keeping intermediate results in memory
• Leverages distributed memory for repeated read/writes
• Remembers operations applied the o dataset via DAGs
• Data locality based computation → High Performance
• Best for both iterative (or stream processing) and batch workloads
4
Today’s Topics
• Why Spark?
• Spark Basic Concepts
• Spark basic Programming
5
Apache Spark
** Spark can connect to several types of cluster managers (either
Spark’s own standalone cluster manager, Mesos or YARN)
Hadoop Spark
6
Spark Ecosystem: A Unified Pipeline
Note: Spark is not designed for IoT real-time. The streaming layer is used for
continuous input streams like financial data from stock markets, where events occur
steadily and must be processed as they occur. But there is no sense of direct I/O
from sensors/actuators. For IoT use cases, Spark would not be suitable.
7
Key ideas
In Hadoop, each developer tends to invent his or her own style of work
With Spark, serious effort are made to standardize the parallel code that people
are writing, that often runs for many “cycles” or “iterations” in which a lot of
reuse of information occurs.
8
How this works
You express your application as a graph of RDDs.
The graph is only evaluated as needed, and they only compute the RDDs
actually needed for the output you have requested.
Then Spark can be told to cache the reuseable information either in memory, in
SSD storage or even on disk, based on when it will be needed again, how big it
is, and how costly it would be to recreate.
You write the RDD logic and control all of this via hints
9
Today’s Topics
• Motivation
• Spark Basics
• Spark Programming
10
Spark Basics(1)
Spark: Flexible, in-memory data processing framework written in Scala
Goals:
• Simplicity (Easier to use):
Rich APIs for Scala, Java, and Python
• Generality: APIs for different types of workloads
Batch, Streaming, Machine Learning, Graph
• Low Latency (Performance) : In-memory processing and caching
• Fault-tolerance: Ability to recover losses (Resilient)
11
Spark Basics(2)
There are two ways to manipulate data in Spark
• Spark Shell:
Interactive – for learning or data exploration
Python or Scala
• Spark Applications
For large scale data processing
12
Spark Shell
The Spark Shell provides interactive data exploration (REPL)
13
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions
14
Spark Context (1)
• Every Spark application requires a spark context: the main
entry point to the Spark API
• Spark Shell provides a preconfigured Spark Context called “sc”
15
Spark Context (3)
Spark context works as a client and represents connection to a Spark cluster
16
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions
17
Resilient Distributed Dataset
RDD (Resilient Distributed Dataset) is the fundamental unit of data in Spark: An Immutable
collection of objects (or records, or elements) that can be operated on “in parallel” (spread across
a cluster)
Resilient -- if data in memory is lost, it can be recreated
• Recover from node failures
• An RDD keeps its lineage information it can be recreated from parent
RDDs
Distributed -- processed across the cluster
• Each RDD is composed of one or more partitions (more partitions – more
parallelism)
Dataset -- initial data can come from a file or be created
18
RDDs
Key Idea: Write applications in terms of transformations on
distributed datasets. One RDD per transformation.
• Organize the RDDs into a DAG showing how data flows.
• RDD can be saved and reused or recomputed. Spark can save it to
disk if the dataset does not fit in memory
• Built through parallel transformations (map, filter, group-by, join,
etc). Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
19
RDDs are designed to be “immutable”
• Create once, then reuse without changes. Spark knows lineage
can be recreated at any time Fault-tolerance
• Avoids data inconsistency problems (no simultaneous updates)
Correctness
• Easily live in memory as on disk Caching Safe to share
across processes/tasks Improves performance
• Tradeoff: (Fault-tolerance & Correctness) vs (Disk Memory & CPU)
20
Creating a RDD
Three ways to create a RDD
• From a file or set of files
• From data in memory
• From another RDD
21
Example: A File-based RDD
22
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions
23
RDD Operations
Two types of operations
Transformations: Define a
new RDD based on current
RDD(s)
Actions: return values
24
RDD Transformations
• Set of operations on a RDD that define how they should
be transformed
• As in relational algebra, the application of a
transformation to an RDD yields a new RDD (because
RDD are immutable)
• Transformations are lazily evaluated, which allows for
optimizations to take place before execution
• Examples: map(), filter(), groupByKey(), sortByKey(),
etc.
25
Example: map and filter Transformations
26
RDD Actions
• Apply transformation chains on RDDs, eventually performing
some additional operations (e.g., counting)
• Some actions only store data to an external data source (e.g.
HDFS), others fetch data from the RDD (and its transformation
chain) upon which the action is applied, and convey it to the
driver
• Some common actions
count() – return the number of elements
27
Lazy Execution of RDDs (1)
Data in RDDs is not processed
until an action is performed
28
Lazy Execution of RDDs (2)
Data in RDDs is not processed
until an action is performed
29
Lazy Execution of RDDs (3)
Data in RDDs is not processed
until an action is performed
30
Lazy Execution of RDDs (4)
Data in RDDs is not processed
until an action is performed
31
Lazy Execution of RDDs (5)
Data in RDDs is not processed
until an action is performed
32
Example: Mine error logs
Load error messages from a log into memory, then interactively
search for various patterns:
33
Key Idea: Elastic parallelism
RDDs operations are designed to offer embarrassing parallelism.
Spark will spread the task over the nodes where data resides, offers a highly concurrent
execution that minimizes delays. Term: “partitioned computation” .
If some component crashes or even is just slow, Spark simply kills that task and launches
a substitute.
34
RDD and Partitions (Parallelism example)
35
RDD Graph: Data Set vs Partition Views
Much like in Hadoop MapReduce, each RDD is associated to
(input) partitions
36
RDDs: Data Locality
•Data Locality Principle
Keep high-value RDDs precomputed, in cache or SDD
Run tasks that need the specific RDD with those same inputs
on the node where the cached copy resides.
This can maximize in-memory computational performance.
38
Why is this a good strategy?
Spark tries to run tasks that will need the same intermediary data on the same nodes.
If MapReduce jobs were arbitrary programs, this wouldn’t help because reuse would be
very rare.
But in fact the MapReduce model is very repetitious and iterative, and often applies the
same transformations again and again to the same input files.
Those particular RDDs become great candidates for caching.
MapReduce programmer may not know how many iterations will occur, but
Spark itself is smart enough to evict RDDs if they don’t actually get reused.
39
RDDs -- Summary
RDD are partitioned, locality aware, distributed collections
RDD are immutable
RDD are data structures that:
Either point to a direct data source (e.g. HDFS)
Apply some transformations to its parent RDD(s) to generate new
data elements
Computations on RDDs
Represented by lazily evaluated lineage DAGs composed by
chained RDDs
40
Lifetime of a Job in Spark
41
Anatomy of a Spark Application
Cluster Manager
(YARN/Mesos)
42
Iterative Algorithms: Spark vs MapReduce
43
Today’s Topics
• Motivation
• Spark Basics
• Spark Programming
44
Spark Programming (1)
Creating RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])
46
Spark Programming (3)
Basic Actions
nums = sc.parallelize([1, 2, 3])
49
Example: Word Count
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ “))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)
50
Example: Spark Streaming
Represents streams as a series of RDDs over time (typically sub second intervals, but it is
configurable)
51
Spark: Combining Libraries (Unified Pipeline)
# Load data using Spark SQL
points = spark.sql(“select latitude, longitude from tweets”)
# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 52
Spark: Setting the Level of Parallelism
All the pair RDD operations take an optional second parameter for
number of tasks
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
53
Summary