Intro To Apache Spark: Paco Nathan, Download Slides
Intro To Apache Spark: Paco Nathan, Download Slides
download slides:
http://cdn.liber118.com/spark/dbc_bids.pdf
2
Getting Started
Getting Started:
• https://class02.cloud.databricks.com/
4
Getting Started:
5
Getting Started:
Key Features:
6
Getting Started: Initial coding exercise
Workspace/training-paco/00.log_example
Open in one browser window, then rebuild
a new notebook to run the code shown:
7
Spark Deconstructed
Spark Deconstructed: Log Mining Example
Workspace/training-paco/01.log_example
Open in one browser window, then rebuild
a new notebook by copying its code cells:
9
Spark Deconstructed: Log Mining Example
10
Spark Deconstructed: Log Mining Example
Worker
Worker
Driver
Worker
11
Spark Deconstructed: Log Mining Example
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
!
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
!
discussing the other part
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
12
Spark Deconstructed: Log Mining Example
messages.toDebugString!
!
res5: String = !
MappedRDD[4] at map at <console>:16 (3 partitions)!
MappedRDD[3] at map at <console>:16 (3 partitions)!
FilteredRDD[2] at filter at <console>:14 (3 partitions)!
MappedRDD[1] at textFile at <console>:12 (3 partitions)!
HadoopRDD[0] at textFile at <console>:12 (3 partitions)
13
Spark Deconstructed: Log Mining Example
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
!
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker
!
# action 2!
Worker
Driver
Worker
14
Spark Deconstructed: Log Mining Example
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
!
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker
!
# action 2! block 1
Worker
Driver
block 2
Worker
block 3
15
Spark Deconstructed: Log Mining Example
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
!
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker
!
# action 2! block 1
Worker
Driver
block 2
Worker
block 3
16
Spark Deconstructed: Log Mining Example
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
!
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker read
! HDFS
# action 2! block 1 block
Worker read
HDFS
Driver block
block 2
Worker read
HDFS
block 3 block
17
Spark Deconstructed: Log Mining Example
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
! cache 1
# action 1! process,
messages.filter(lambda x: x.find("mysql") > -1).count()! cache data
Worker
!
# action 2! block 1
cache 2
process,
cache data
Worker
Driver
block 2
cache 3
process,
Worker cache data
block 3
18
Spark Deconstructed: Log Mining Example
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
! cache 1
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker
!
# action 2! block 1
cache 2
Worker
Driver
block 2
cache 3
Worker
block 3
19
Spark Deconstructed: Log Mining Example
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
# persistence!
messages.cache()!
! cache 1
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker
!
# action 2! block 1
messages.filter(lambda x: x.find("php") > -1).count()
cache 2
Worker
Driver
block 2
cache 3
Worker
block 3
20
Spark Deconstructed: Log Mining Example
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
# persistence!
messages.cache()!
! cache 1
# action 1! process
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker from cache
!
# action 2! block 1
messages.filter(lambda x: x.find("php") > -1).count()
cache 2
process
from cache
Worker
Driver
block 2
cache 3
process
Worker from cache
block 3
21
Spark Deconstructed: Log Mining Example
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
# persistence!
messages.cache()!
! cache 1
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker
!
# action 2! block 1
messages.filter(lambda x: x.find("php") > -1).count()
cache 2
Worker
Driver
block 2
cache 3
Worker
block 3
22
Spark Deconstructed: Log Mining Example
# persistence!
messages.cache()!
!
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
!
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
23
Spark Deconstructed: Log Mining Example
RDD
# base RDD!
lines = sqlContext.table("error_log")
24
Spark Deconstructed: Log Mining Example
RDD
RDD
RDD
transformations RDD
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()
25
Spark Deconstructed: Log Mining Example
RDD
RDD
RDD
transformations RDD action value
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()
26
A Brief History
A Brief History:
2004 2010
MapReduce paper Spark paper
2006
Hadoop @ Yahoo!
28
A Brief History: MapReduce
29
A Brief History: MapReduce
Open Discussion:
Enumerate several changes in data center
technologies since 2002…
30
A Brief History: MapReduce
pistoncloud.com/2013/04/storage-
and-the-mobility-gap/
meanwhile, spinny
disks haven’t changed
all that much…
storagenewsletter.com/rubriques/hard-
disk-drives/hdd-technology-trends-ibm/
31
A Brief History: MapReduce
32
A Brief History: MapReduce
Pregel Giraph
MapReduce
Impala GraphLab
Storm S4
33
A Brief History: Spark
spark.apache.org
34
A Brief History: Spark
2004 2010
MapReduce paper Spark paper
2006
Hadoop @ Yahoo!
35
A Brief History: Spark
36
A Brief History: Spark
37
A Brief History: Spark
39
A Brief History: Key distinctions for Spark vs. MapReduce
• generalized patterns
unified engine for many use cases
40
TL;DR: Smashing The Previous Petabyte Sort Record
databricks.com/blog/2014/11/05/spark-officially-
sets-a-new-record-in-large-scale-sorting.html
41
TL;DR: Sustained Exponential Growth
42
TL;DR: Spark Expertise Tops Median Salaries within Big Data
oreilly.com/data/free/2014-data-science-
salary-survey.csp
43
Coding Exercises
Coding Exercises: WordCount
Definition:
count
count how ofteneach
how often each word appears
wordappears void map (String doc_id, String text):!
in
in aacollection of text
collection of textdocuments
documents
for each word w in segment(text):!
emit(w, "1");!
!
A distributed computing framework that can run
WordCount efficiently in parallel at scale
can likely handle much larger and more interesting
compute problems
45
Coding Exercises: WordCount
46
Coding Exercises: WordCount
Workspace/training-paco/02.wc_example
Open in one browser window, then rebuild
a new notebook by copying its code cells:
47
Coding Exercises: Join
Workspace/training-paco/03.join_example
Open in one browser window, then rebuild
a new notebook by copying its code cells:
48
Coding Exercises: Join – Operator Graph
cached
stage 1 partition
A: B: RDD
E:
map() map()
stage 2
join()
C: D:
49
Coding Exercises: Workflow assignment
50
Coding Exercises: Workflow assignment
51
Spark Essentials
Spark Essentials:
using, respectively:
./bin/spark-shell!
./bin/pyspark
53
Spark Essentials: SparkContext
54
Spark Essentials: SparkContext
Scala:
scala> sc!
res: spark.SparkContext = spark.SparkContext@470d1f30
Python:
>>> sc!
<pyspark.context.SparkContext object at 0x7f7570783350>
55
Spark Essentials: Master
master description
run Spark locally with one worker thread
local
(no parallelism)
run Spark locally with K worker threads
local[K]
(ideally set to # cores)
connect to a Spark standalone cluster;
spark://HOST:PORT
PORT depends on config (7077 by default)
connect to a Mesos cluster;
mesos://HOST:PORT
PORT depends on config (5050 by default)
56
Spark Essentials: Master
spark.apache.org/docs/latest/cluster-
overview.html
Worker Node
Executor cache
task task
Worker Node
Executor cache
task task
57
Spark Essentials: Clusters
Executor cache
task task
Worker Node
Executor cache
task task
58
Spark Essentials: RDD
59
Spark Essentials: RDD
60
Spark Essentials: RDD
Scala:
scala> val data = Array(1, 2, 3, 4, 5)!
data: Array[Int] = Array(1, 2, 3, 4, 5)!
!
scala> val distData = sc.parallelize(data)!
distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e
Python:
>>> data = [1, 2, 3, 4, 5]!
>>> data!
[1, 2, 3, 4, 5]!
!
>>> distData = sc.parallelize(data)!
>>> distData!
ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229
61
Spark Essentials: RDD
RDD
RDD
RDD
transformations RDD action value
62
Spark Essentials: RDD
Scala:
scala> val distFile = sc.textFile("README.md")!
distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
Python:
>>> distFile = sc.textFile("README.md")!
14/04/19 23:42:40 INFO storage.MemoryStore: ensureFreeSpace(36827) called
with curMem=0, maxMem=318111744!
14/04/19 23:42:40 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 36.0 KB, free 303.3 MB)!
>>> distFile!
MappedRDD[2] at textFile at NativeMethodAccessorImpl.java:-2
63
Spark Essentials: Transformations
64
Spark Essentials: Transformations
transformation description
return a new distributed dataset formed by passing
map(func) each element of the source through a function func
65
Spark Essentials: Transformations
transformation description
when called on a dataset of (K, V) pairs, returns a
groupByKey([numTasks]) dataset of (K, Seq[V]) pairs
66
Spark Essentials: Transformations
Scala:
val distFile = sc.textFile("README.md")!
distFile.map(l => l.split(" ")).collect()!
distFile is a collection of lines
distFile.flatMap(l => l.split(" ")).collect()
Python:
distFile = sc.textFile("README.md")!
distFile.map(lambda x: x.split(' ')).collect()!
distFile.flatMap(lambda x: x.split(' ')).collect()
67
Spark Essentials: Transformations
Scala:
val distFile = sc.textFile("README.md")!
distFile.map(l => l.split(" ")).collect()!
distFile.flatMap(l => l.split(" ")).collect()
closures
Python:
distFile = sc.textFile("README.md")!
distFile.map(lambda x: x.split(' ')).collect()!
distFile.flatMap(lambda x: x.split(' ')).collect()
68
Spark Essentials: Transformations
Scala:
val distFile = sc.textFile("README.md")!
distFile.map(l => l.split(" ")).collect()!
distFile.flatMap(l => l.split(" ")).collect()
closures
Python:
distFile = sc.textFile("README.md")!
distFile.map(lambda x: x.split(' ')).collect()!
distFile.flatMap(lambda x: x.split(' ')).collect()
69
Spark Essentials: Actions
action description
aggregate the elements of the dataset using a function
func (which takes two arguments and returns one),
reduce(func) and should also be commutative and associative so
that it can be computed correctly in parallel
return all the elements of the dataset as an array at
the driver program – usually useful after a filter or
collect() other operation that returns a sufficiently small subset
of the data
count() return the number of elements in the dataset
return the first element of the dataset – similar to
first() take(1)
return an array with the first n elements of the dataset
take(n) – currently not executed in parallel, instead the driver
program computes all the elements
70
Spark Essentials: Actions
action description
write the elements of the dataset as a text file (or set
of text files) in a given directory in the local filesystem,
saveAsTextFile(path) HDFS or any other Hadoop-supported file system.
Spark will call toString on each element to convert
it to a line of text in the file
write the elements of the dataset as a Hadoop
SequenceFile in a given path in the local filesystem,
HDFS or any other Hadoop-supported file system.
Only available on RDDs of key-value pairs that either
saveAsSequenceFile(path) implement Hadoop's Writable interface or are
implicitly convertible to Writable (Spark includes
conversions for basic types like Int, Double, String,
etc).
only available on RDDs of type (K, V). Returns a
countByKey() `Map` of (K, Int) pairs with the count of each key
run a function func on each element of the dataset –
usually done for side effects such as updating an
foreach(func) accumulator variable or interacting with external
storage systems
71
Spark Essentials: Actions
Scala:
val f = sc.textFile("README.md")!
val words = f.flatMap(l => l.split(" ")).map(word => (word, 1))!
words.reduceByKey(_ + _).collect.foreach(println)
Python:
from operator import add!
f = sc.textFile("README.md")!
words = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))!
words.reduceByKey(add).collect()
72
Spark Essentials: Persistence
73
Spark Essentials: Persistence
transformation description
Store RDD as deserialized Java objects in the JVM.
If the RDD does not fit in memory, some partitions
MEMORY_ONLY will not be cached and will be recomputed on the fly
each time they're needed. This is the default level.
Store RDD as deserialized Java objects in the JVM.
If the RDD does not fit in memory, store the partitions
MEMORY_AND_DISK that don't fit on disk, and read them from there when
they're needed.
Store RDD as serialized Java objects (one byte array
per partition). This is generally more space-efficient
MEMORY_ONLY_SER than deserialized objects, especially when using a fast
serializer, but more CPU-intensive to read.
Similar to MEMORY_ONLY_SER, but spill partitions
MEMORY_AND_DISK_SER that don't fit in memory to disk instead of recomputing
them on the fly each time they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2, Same as the levels above, but replicate each partition
MEMORY_AND_DISK_2, etc on two cluster nodes.
74
Spark Essentials: Persistence
Scala:
val f = sc.textFile("README.md")!
val w = f.flatMap(l => l.split(" ")).map(word => (word, 1)).cache()!
w.reduceByKey(_ + _).collect.foreach(println)
Python:
from operator import add!
f = sc.textFile("README.md")!
w = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).cache()!
w.reduceByKey(add).collect()
75
Spark Essentials: Broadcast Variables
76
Spark Essentials: Broadcast Variables
Scala:
val broadcastVar = sc.broadcast(Array(1, 2, 3))!
broadcastVar.value
Python:
broadcastVar = sc.broadcast(list(range(1, 4)))!
broadcastVar.value
77
Spark Essentials: Accumulators
78
Spark Essentials: Accumulators
Scala:
val accum = sc.accumulator(0)!
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)!
!
accum.value
Python:
accum = sc.accumulator(0)!
rdd = sc.parallelize([1, 2, 3, 4])!
def f(x):!
global accum!
accum += x!
!
rdd.foreach(f)!
!
accum.value
79
Spark Essentials: Accumulators
Scala:
val accum = sc.accumulator(0)!
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)!
!
accum.value
driver-side
Python:
accum = sc.accumulator(0)!
rdd = sc.parallelize([1, 2, 3, 4])!
def f(x):!
global accum!
accum += x!
!
rdd.foreach(f)!
!
accum.value
80
Spark Essentials: API Details
81
Follow-Up
certification:
Anthony Joseph
UC Berkeley
begins 2015-02-23
edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkar
UCLA
begins 2015-04-14
edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
books:
Learning Spark
Fast Data Processing
Holden Karau,
with Spark
Andy Konwinski,
Holden Karau
Matei Zaharia
Packt (2013)
O’Reilly (2015*)
shop.oreilly.com/product/
shop.oreilly.com/product/ 9781782167068.do
0636920028512.do
Spark in Action
Chris Fregly
Manning (2015*)
sparkinaction.com/