0% found this document useful (0 votes)

109 views86 pages

Intro To Apache Spark: Paco Nathan, Download Slides

This document provides an overview of an introductory lecture on Apache Spark. The lecture outline includes getting started with Spark on Databricks Cloud, an overview of Spark's theory of operation in a cluster, a brief history of Spark and how it relates to other frameworks, coding exercises to demonstrate ETL, joins, workflows and the Spark API, and information on follow up resources like certification and community events. The document then discusses getting started on Databricks Cloud accounts, an example log mining notebook to demonstrate Spark concepts, and how transformations and actions are executed in a Spark cluster.

Uploaded by

Saravanan1234567

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views86 pages

Intro To Apache Spark: Paco Nathan, Download Slides

Uploaded by

Saravanan1234567

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

Intro to Apache Spark

Paco Nathan, @pacoid 

http://databricks.com/

download slides: 
http://cdn.liber118.com/spark/dbc_bids.pdf

Licensed under a Creative Commons Attribution-

NonCommercial-NoDerivatives 4.0 International License
Lecture Outline:

• login and get started with Apache Spark  

on Databricks Cloud

• understand theory of operation in a cluster

• a brief historical context of Spark, where it  
fits with other Big Data frameworks

• coding exercises: ETL, WordCount, Join,
Workflow

• tour of the Spark API

• follow-up: certification, events, community
resources, etc.

2
Getting Started
Getting Started:

Everyone will receive a username/password

for one of the Databricks Cloud shards:

• https://class01.cloud.databricks.com/

• https://class02.cloud.databricks.com/

Run notebooks on your account at any time

throughout the duration of the course. The
accounts will be kept open afterwards, long
enough to save/export your work.

4
Getting Started:

Workspace/databricks-guide/01 Quick Start  

Open in a browser window, then follow the  
discussion of the notebook key features:

5
Getting Started:

Workspace/databricks-guide/01 Quick Start  

Key Features:

• Workspace, Folder, Notebook, Export

• Code Cells, run/edit/move

• Markdown

• Tables

6
Getting Started: Initial coding exercise

Workspace/training-paco/00.log_example  
Open in one browser window, then rebuild  
a new notebook to run the code shown:

7
Spark Deconstructed
Spark Deconstructed: Log Mining Example

Workspace/training-paco/01.log_example  
Open in one browser window, then rebuild  
a new notebook by copying its code cells:

9
Spark Deconstructed: Log Mining Example

# load error messages from a log into memory!

# then interactively search for patterns!
!
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
!
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
!
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()

10
Spark Deconstructed: Log Mining Example

We start with Spark running on a cluster… 

submitting code to be evaluated on it:

Worker

Worker
Driver

Worker

11
Spark Deconstructed: Log Mining Example

# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
!
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
!
discussing the other part
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()

12
Spark Deconstructed: Log Mining Example

At this point, we can look at the transformed

RDD operator graph:

messages.toDebugString!
!
res5: String = !
MappedRDD[4] at map at <console>:16 (3 partitions)!
MappedRDD[3] at map at <console>:16 (3 partitions)!
FilteredRDD[2] at filter at <console>:14 (3 partitions)!
MappedRDD[1] at textFile at <console>:12 (3 partitions)!
HadoopRDD[0] at textFile at <console>:12 (3 partitions)

13
Spark Deconstructed: Log Mining Example

discussing the other part

messages.filter(lambda x: x.find("php") > -1).count()

Worker
Driver

Worker

14
Spark Deconstructed: Log Mining Example

discussing the other part

messages.filter(lambda x: x.find("php") > -1).count()

Worker
Driver
block 2

Worker

block 3

15
Spark Deconstructed: Log Mining Example

discussing the other part

messages.filter(lambda x: x.find("php") > -1).count()

Worker
Driver
block 2

Worker

block 3

16
Spark Deconstructed: Log Mining Example

discussing the other part

messages.filter(lambda x: x.find("php") > -1).count()

Worker read
HDFS
Driver block
block 2

Worker read
HDFS
block 3 block

17
Spark Deconstructed: Log Mining Example

# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
! cache 1
# action 1! process,
messages.filter(lambda x: x.find("mysql") > -1).count()! cache data
Worker
!
# action 2! block 1

discussing the other part

messages.filter(lambda x: x.find("php") > -1).count()

cache 2
process,
cache data
Worker
Driver
block 2

cache 3
process,
Worker cache data

block 3

18
Spark Deconstructed: Log Mining Example

# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()!
! cache 1
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker
!
# action 2! block 1

discussing the other part

messages.filter(lambda x: x.find("php") > -1).count()

cache 2

Worker
Driver
block 2

cache 3

Worker

block 3

19
Spark Deconstructed: Log Mining Example

# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!

! discussing the other part

messages = errors.map(lambda x: x[1])!

# persistence!
messages.cache()!
! cache 1
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker
!
# action 2! block 1
messages.filter(lambda x: x.find("php") > -1).count()

cache 2

Worker
Driver
block 2

cache 3

Worker

block 3

20
Spark Deconstructed: Log Mining Example

# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!

! discussing the other part

messages = errors.map(lambda x: x[1])!

# persistence!
messages.cache()!
! cache 1
# action 1! process
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker from cache
!
# action 2! block 1
messages.filter(lambda x: x.find("php") > -1).count()

cache 2
process
from cache
Worker
Driver
block 2

cache 3
process
Worker from cache

block 3

21
Spark Deconstructed: Log Mining Example

# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!

! discussing the other part

messages = errors.map(lambda x: x[1])!

# persistence!
messages.cache()!
! cache 1
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()! Worker
!
# action 2! block 1
messages.filter(lambda x: x.find("php") > -1).count()

cache 2

Worker
Driver
block 2

cache 3

Worker

block 3

22
Spark Deconstructed: Log Mining Example

Looking at the RDD transformations and

actions from another perspective…
# base RDD!
lines = sqlContext.table("error_log")!
!
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")! RDD
RDD
messages = errors.map(lambda x: x[1])! RDD
! transformations RDD action value

# persistence!
messages.cache()!
!
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
!
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()

23
Spark Deconstructed: Log Mining Example

RDD

# base RDD!
lines = sqlContext.table("error_log")

24
Spark Deconstructed: Log Mining Example

RDD
RDD
RDD
transformations RDD

# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()

25
Spark Deconstructed: Log Mining Example

RDD
RDD
RDD
transformations RDD action value

# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()

26
A Brief History
A Brief History:

2004 2010
MapReduce paper Spark paper

2002 2004 2006 2008 2010 2012 2014

2002 2008 2014

MapReduce @ Google Hadoop Summit Apache Spark top-level

2006
Hadoop @ Yahoo!

28
A Brief History: MapReduce

circa 1979 – Stanford, MIT, CMU, etc. 

set/list operations in LISP, Prolog, etc., for parallel processing 
www-formal.stanford.edu/jmc/history/lisp/lisp.htm

circa 2004 – Google 

MapReduce: Simplified Data Processing on Large Clusters 
Jeffrey Dean and Sanjay Ghemawat 
research.google.com/archive/mapreduce.html

circa 2006 – Apache 

Hadoop, originating from the Nutch Project 
Doug Cutting 
research.yahoo.com/files/cutting.pdf

circa 2008 – Yahoo 

web scale search indexing 
Hadoop Summit, HUG, etc. 
developer.yahoo.com/hadoop/

circa 2009 – Amazon AWS 

Elastic MapReduce 
Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc. 
aws.amazon.com/elasticmapreduce/

29
A Brief History: MapReduce

Open Discussion:

Enumerate several changes in data center
technologies since 2002…

30
A Brief History: MapReduce

Rich Freitas, IBM Research

pistoncloud.com/2013/04/storage-
and-the-mobility-gap/

meanwhile, spinny
disks haven’t changed
all that much…
storagenewsletter.com/rubriques/hard-
disk-drives/hdd-technology-trends-ibm/
31
A Brief History: MapReduce

MapReduce use cases showed two major

limitations:

1. difficultly of programming directly in MR

2. performance bottlenecks, or batch not
fitting the use cases

In short, MR doesn’t compose well for large
applications

Therefore, people built specialized systems as
workarounds…

32
A Brief History: MapReduce

Pregel Giraph

Dremel Drill Tez

MapReduce
Impala GraphLab

Storm S4

General Batch Processing Specialized Systems:

iterative, interactive, streaming, graph, etc.

The State of Spark, and Where We're Going Next

Matei Zaharia
Spark Summit (2013)

youtu.be/nU6vO2EJAb4

33
A Brief History: Spark

Developed in 2009 at UC Berkeley AMPLab, then

open sourced in 2010, Spark has since become  
one of the largest OSS communities in big data,
with over 200 contributors in 50+ organizations
“Organizations that are looking at big data challenges – 
including collection, ETL, storage, exploration and analytics – 
should consider Spark for its in-memory performance and 
the breadth of its model. It supports advanced analytics 
solutions on Hadoop clusters, including the iterative model 
required for machine learning and graph analysis.”

Gartner, Advanced Analytics and Data Science (2014)

spark.apache.org
34
A Brief History: Spark

2004 2010
MapReduce paper Spark paper

2002 2004 2006 2008 2010 2012 2014

2002 2008 2014

MapReduce @ Google Hadoop Summit Apache Spark top-level

2006
Hadoop @ Yahoo!

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury,  
Michael J. Franklin, Scott Shenker, Ion Stoica

USENIX HotCloud (2010) 
people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf

!
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for

In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,  
Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica

NSDI (2012)

usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

35
A Brief History: Spark

Unlike the various specialized systems, Spark’s

goal was to generalize MapReduce to support
new apps within same engine

Two reasonably small additions are enough to
express the previous models:

• fast data sharing

• general DAGs

This allows for an approach which is more
efficient for the engine, and much simpler  
for the end users

36
A Brief History: Spark

37
A Brief History: Spark

used as libs, instead of

specialized systems
38
A Brief History: Spark

Some key points about Spark:

• handles batch, interactive, and real-time  

within a single framework

• native integration with Java, Python, Scala

• programming at a higher level of abstraction

• more general: map/reduce is just one set  

of supported constructs

39
A Brief History: Key distinctions for Spark vs. MapReduce

• generalized patterns 
unified engine for many use cases

• lazy evaluation of the lineage graph 

reduces wait states, better pipelining

• generational differences in hardware 

off-heap use of large memory spaces

• functional programming / ease of use 

reduction in cost to maintain large apps

• lower overhead for starting jobs

• less expensive shuffles

40
TL;DR: Smashing The Previous Petabyte Sort Record

databricks.com/blog/2014/11/05/spark-officially-
sets-a-new-record-in-large-scale-sorting.html

41
TL;DR: Sustained Exponential Growth

Spark is one of the most active Apache projects

ohloh.net/orgs/apache

42
TL;DR: Spark Expertise Tops Median Salaries within Big Data

oreilly.com/data/free/2014-data-science-
salary-survey.csp

43
Coding Exercises
Coding Exercises: WordCount

Definition:

count
count how ofteneach
how often each word appears   
wordappears void map (String doc_id, String text):!
in
in aacollection of text
collection of textdocuments
documents
for each word w in segment(text):!
emit(w, "1");!

This simple program provides a good test case   !

for parallel processing, since it:
!
•
void reduce (String word, Iterator group):!
requires a minimal amount of code

int count = 0;!
• demonstrates use of both symbolic and   !
numeric values
for each pc in group:!

• isn’t many steps away from search indexing

!
count += Int(pc);!

• serves as a “Hello World” for Big Data apps

emit(word, String(count));

!
A distributed computing framework that can run
WordCount efficiently in parallel at scale  
can likely handle much larger and more interesting
compute problems

45
Coding Exercises: WordCount

WordCount in 3 lines of Spark

WordCount in 50+ lines of Java MR

46
Coding Exercises: WordCount

Workspace/training-paco/02.wc_example  
Open in one browser window, then rebuild  
a new notebook by copying its code cells:

47
Coding Exercises: Join

Workspace/training-paco/03.join_example  
Open in one browser window, then rebuild  
a new notebook by copying its code cells:

48
Coding Exercises: Join – Operator Graph

cached
stage 1 partition

A: B: RDD

E:
map() map()

stage 2
join()
C: D:

map() map() stage 3

49
Coding Exercises: Workflow assignment

How to “think” in terms of leveraging notebooks,

based on Computational Thinking:

1. create a new notebook

2. copy the assignment description as markdown

3. split it into separate code cells

4. for each step, write your code under the
markdown

5. run each step and verify your results

50
Coding Exercises: Workflow assignment

Let’s assemble the pieces of the previous few code

examples. Using the readme and change_log tables:

1. create RDDs to filter each line for the  
keyword Spark

2. perform a WordCount on each, i.e., so the
results are (K,V) pairs of (keyword, count)

3. join the two RDDs

4. how many instances of “Spark” are there?

51
Spark Essentials
Spark Essentials:

Intro apps, showing examples in both  

Scala and Python…

Let’s start with the basic concepts in:

spark.apache.org/docs/latest/scala-
programming-guide.html

using, respectively:

./bin/spark-shell!

./bin/pyspark

53
Spark Essentials: SparkContext

First thing that a Spark program does is create

a SparkContext object, which tells Spark how
to access a cluster

In the shell for either Scala or Python, this is
the sc variable, which is created automatically

Other programs must use a constructor to
instantiate a new SparkContext

Then in turn SparkContext gets used to create
other variables

54
Spark Essentials: SparkContext

Scala:
scala> sc!
res: spark.SparkContext = spark.SparkContext@470d1f30

Python:
>>> sc!
<pyspark.context.SparkContext object at 0x7f7570783350>

55
Spark Essentials: Master

The master parameter for a SparkContext

determines which cluster to use

master description
run Spark locally with one worker thread  
local
(no parallelism)
run Spark locally with K worker threads  
local[K]
(ideally set to # cores)

connect to a Spark standalone cluster;  
spark://HOST:PORT
PORT depends on config (7077 by default)

connect to a Mesos cluster;  
mesos://HOST:PORT
PORT depends on config (5050 by default)

56
Spark Essentials: Master

spark.apache.org/docs/latest/cluster-
overview.html

Worker Node

Executor cache

task task

Driver Program Cluster Manager

SparkContext

Worker Node

Executor cache

task task

57
Spark Essentials: Clusters

1. master connects to a cluster manager to

allocate resources across applications

2. acquires executors on cluster nodes –
processes run compute tasks, cache data

3. sends app code to the executors

4. sends tasks for the executors to run Worker Node

Executor cache

task task

Driver Program Cluster Manager

SparkContext

Worker Node

Executor cache

task task

58
Spark Essentials: RDD

Resilient Distributed Datasets (RDD) are the

primary abstraction in Spark – a fault-tolerant
collection of elements that can be operated on  
in parallel

There are currently two types:

• parallelized collections – take an existing Scala
collection and run functions on it in parallel

• Hadoop datasets – run functions on each record
of a file in Hadoop distributed file system or any
other storage system supported by Hadoop

59
Spark Essentials: RDD

• two types of operations on RDDs:  

transformations and actions

• transformations are lazy  

(not computed immediately)

• the transformed RDD gets recomputed  

when an action is run on it (default)

• however, an RDD can be persisted into  

storage in memory or disk

60
Spark Essentials: RDD

Scala:
scala> val data = Array(1, 2, 3, 4, 5)!
data: Array[Int] = Array(1, 2, 3, 4, 5)!
!
scala> val distData = sc.parallelize(data)!
distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e

Python:
>>> data = [1, 2, 3, 4, 5]!
>>> data!
[1, 2, 3, 4, 5]!
!
>>> distData = sc.parallelize(data)!
>>> distData!
ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229

61
Spark Essentials: RDD

Spark can create RDDs from any file stored in HDFS

or other storage systems supported by Hadoop, e.g.,
local file system, Amazon S3, Hypertable, HBase, etc.

Spark supports text files, SequenceFiles, and any
other Hadoop InputFormat, and can also take a
directory or a glob (e.g. /data/201404*)

RDD
RDD
RDD
transformations RDD action value

62
Spark Essentials: RDD

Scala:
scala> val distFile = sc.textFile("README.md")!
distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08

Python:
>>> distFile = sc.textFile("README.md")!
14/04/19 23:42:40 INFO storage.MemoryStore: ensureFreeSpace(36827) called
with curMem=0, maxMem=318111744!
14/04/19 23:42:40 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 36.0 KB, free 303.3 MB)!
>>> distFile!
MappedRDD[2] at textFile at NativeMethodAccessorImpl.java:-2

63
Spark Essentials: Transformations

Transformations create a new dataset from  

an existing one

All transformations in Spark are lazy: they  
do not compute their results right away –
instead they remember the transformations
applied to some base dataset

• optimize the required calculations

• recover from lost data partitions

64
Spark Essentials: Transformations

transformation description
return a new distributed dataset formed by passing  
map(func) each element of the source through a function func

return a new dataset formed by selecting those

elements of the source on which func returns true

filter(func)

similar to map, but each input item can be mapped  

to 0 or more output items (so func should return a  
flatMap(func) Seq rather than a single item)

sample a fraction fraction of the data, with or without

sample(withReplacement, replacement, using a given random number generator
fraction, seed) seed

return a new dataset that contains the union of the

union(otherDataset) elements in the source dataset and the argument

return a new dataset that contains the distinct elements

distinct([numTasks])) of the source dataset

65
Spark Essentials: Transformations

transformation description
when called on a dataset of (K, V) pairs, returns a
groupByKey([numTasks]) dataset of (K, Seq[V]) pairs

when called on a dataset of (K, V) pairs, returns  

reduceByKey(func, a dataset of (K, V) pairs where the values for each  
[numTasks]) key are aggregated using the given reduce function

when called on a dataset of (K, V) pairs where K

sortByKey([ascending], implements Ordered, returns a dataset of (K, V)  
pairs sorted by keys in ascending or descending order,
[numTasks]) as specified in the boolean ascending argument

when called on datasets of type (K, V) and (K, W),

join(otherDataset, returns a dataset of (K, (V, W)) pairs with all pairs  
[numTasks]) of elements for each key

when called on datasets of type (K, V) and (K, W),

cogroup(otherDataset, returns a dataset of (K, Seq[V], Seq[W]) tuples –
[numTasks]) also called groupWith

when called on datasets of types T and U, returns a

cartesian(otherDataset) dataset of (T, U) pairs (all pairs of elements)

66
Spark Essentials: Transformations

Scala:
val distFile = sc.textFile("README.md")!
distFile.map(l => l.split(" ")).collect()!
distFile is a collection of lines
distFile.flatMap(l => l.split(" ")).collect()

Python:
distFile = sc.textFile("README.md")!
distFile.map(lambda x: x.split(' ')).collect()!
distFile.flatMap(lambda x: x.split(' ')).collect()

67
Spark Essentials: Transformations

Scala:
val distFile = sc.textFile("README.md")!
distFile.map(l => l.split(" ")).collect()!
distFile.flatMap(l => l.split(" ")).collect()

closures
Python:
distFile = sc.textFile("README.md")!
distFile.map(lambda x: x.split(' ')).collect()!
distFile.flatMap(lambda x: x.split(' ')).collect()

68
Spark Essentials: Transformations

Scala:
val distFile = sc.textFile("README.md")!
distFile.map(l => l.split(" ")).collect()!
distFile.flatMap(l => l.split(" ")).collect()

closures
Python:
distFile = sc.textFile("README.md")!
distFile.map(lambda x: x.split(' ')).collect()!
distFile.flatMap(lambda x: x.split(' ')).collect()

looking at the output, how would you  

compare results for map() vs. flatMap() ?

69
Spark Essentials: Actions

action description
aggregate the elements of the dataset using a function
func (which takes two arguments and returns one),  
reduce(func) and should also be commutative and associative so  
that it can be computed correctly in parallel
return all the elements of the dataset as an array at  
the driver program – usually useful after a filter or
collect() other operation that returns a sufficiently small subset
of the data
count() return the number of elements in the dataset
return the first element of the dataset – similar to
first() take(1)
return an array with the first n elements of the dataset
take(n) – currently not executed in parallel, instead the driver
program computes all the elements

takeSample(withReplacement, return an array with a random sample of num elements

of the dataset, with or without replacement, using the
fraction, seed) given random number generator seed

70
Spark Essentials: Actions

action description
write the elements of the dataset as a text file (or set  
of text files) in a given directory in the local filesystem,
saveAsTextFile(path) HDFS or any other Hadoop-supported file system.
Spark will call toString on each element to convert  
it to a line of text in the file
write the elements of the dataset as a Hadoop
SequenceFile in a given path in the local filesystem,
HDFS or any other Hadoop-supported file system.  
Only available on RDDs of key-value pairs that either
saveAsSequenceFile(path) implement Hadoop's Writable interface or are
implicitly convertible to Writable (Spark includes
conversions for basic types like Int, Double, String,
etc).
only available on RDDs of type (K, V). Returns a  
countByKey() `Map` of (K, Int) pairs with the count of each key
run a function func on each element of the dataset –
usually done for side effects such as updating an
foreach(func) accumulator variable or interacting with external
storage systems

71
Spark Essentials: Actions

Scala:
val f = sc.textFile("README.md")!
val words = f.flatMap(l => l.split(" ")).map(word => (word, 1))!
words.reduceByKey(_ + _).collect.foreach(println)

Python:
from operator import add!
f = sc.textFile("README.md")!
words = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))!
words.reduceByKey(add).collect()

72
Spark Essentials: Persistence

Spark can persist (or cache) a dataset in

memory across operations

Each node stores in memory any slices of it
that it computes and reuses them in other
actions on that dataset – often making future
actions more than 10x faster

The cache is fault-tolerant: if any partition  
of an RDD is lost, it will automatically be
recomputed using the transformations that
originally created it

73
Spark Essentials: Persistence

transformation description
Store RDD as deserialized Java objects in the JVM.  
If the RDD does not fit in memory, some partitions  
MEMORY_ONLY will not be cached and will be recomputed on the fly
each time they're needed. This is the default level.
Store RDD as deserialized Java objects in the JVM.  
If the RDD does not fit in memory, store the partitions
MEMORY_AND_DISK that don't fit on disk, and read them from there when
they're needed.
Store RDD as serialized Java objects (one byte array  
per partition). This is generally more space-efficient  
MEMORY_ONLY_SER than deserialized objects, especially when using a fast
serializer, but more CPU-intensive to read.
Similar to MEMORY_ONLY_SER, but spill partitions
MEMORY_AND_DISK_SER that don't fit in memory to disk instead of recomputing
them on the fly each time they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2, Same as the levels above, but replicate each partition  
MEMORY_AND_DISK_2, etc on two cluster nodes.

74
Spark Essentials: Persistence

Scala:
val f = sc.textFile("README.md")!
val w = f.flatMap(l => l.split(" ")).map(word => (word, 1)).cache()!
w.reduceByKey(_ + _).collect.foreach(println)

Python:
from operator import add!
f = sc.textFile("README.md")!
w = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).cache()!
w.reduceByKey(add).collect()

75
Spark Essentials: Broadcast Variables

Broadcast variables let programmer keep a

read-only variable cached on each machine
rather than shipping a copy of it with tasks

For example, to give every node a copy of  
a large input dataset efficiently

Spark also attempts to distribute broadcast
variables using efficient broadcast algorithms
to reduce communication cost

76
Spark Essentials: Broadcast Variables

Scala:
val broadcastVar = sc.broadcast(Array(1, 2, 3))!
broadcastVar.value

Python:
broadcastVar = sc.broadcast(list(range(1, 4)))!
broadcastVar.value

77
Spark Essentials: Accumulators

Accumulators are variables that can only be

“added” to through an associative operation

Used to implement counters and sums,
efficiently in parallel

Spark natively supports accumulators of
numeric value types and standard mutable
collections, and programmers can extend  
for new types

Only the driver program can read an
accumulator’s value, not the tasks

78
Spark Essentials: Accumulators

Scala:
val accum = sc.accumulator(0)!
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)!
!
accum.value

Python:
accum = sc.accumulator(0)!
rdd = sc.parallelize([1, 2, 3, 4])!
def f(x):!
global accum!
accum += x!
!
rdd.foreach(f)!
!
accum.value

79
Spark Essentials: Accumulators

Scala:
val accum = sc.accumulator(0)!
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)!
!
accum.value

driver-side
Python:
accum = sc.accumulator(0)!
rdd = sc.parallelize([1, 2, 3, 4])!
def f(x):!
global accum!
accum += x!
!
rdd.foreach(f)!
!
accum.value

80
Spark Essentials: API Details

For more details about the Scala/Java API:

spark.apache.org/docs/latest/api/scala/
index.html#org.apache.spark.package

For more details about the Python API:

spark.apache.org/docs/latest/api/python/

81
Follow-Up
certification:

Apache Spark developer certificate program

• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
MOOCs:

Anthony Joseph 
UC Berkeley

begins 2015-02-23

edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181

Ameet Talwalkar 
UCLA

begins 2015-04-14

edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066
community:

spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
books:

Learning Spark 
Fast Data Processing  
Holden Karau,  
with Spark 
Andy Konwinski,
Holden Karau 
Matei Zaharia 
Packt (2013) 
O’Reilly (2015*)  shop.oreilly.com/product/
shop.oreilly.com/product/ 9781782167068.do
0636920028512.do

Spark in Action 
Chris Fregly 
Manning (2015*) 
sparkinaction.com/

AWS Redshift
No ratings yet
AWS Redshift
145 pages
Pyspark
No ratings yet
Pyspark
31 pages
The Data Fabric Handbook
100% (1)
The Data Fabric Handbook
9 pages
Oracle Exadata Best Practices
100% (1)
Oracle Exadata Best Practices
384 pages
GCP Tech Leap Dumps Latest 2023
No ratings yet
GCP Tech Leap Dumps Latest 2023
147 pages
Azure Devops Pipelines Azure Devops
No ratings yet
Azure Devops Pipelines Azure Devops
2,075 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Vsphere Esxi Vcenter Server 601 Troubleshooting Guide
No ratings yet
Vsphere Esxi Vcenter Server 601 Troubleshooting Guide
113 pages
Spark
No ratings yet
Spark
96 pages
UNIT-IV Advanced Architecture Part-2
100% (1)
UNIT-IV Advanced Architecture Part-2
52 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Spark SQL
100% (1)
Spark SQL
34 pages
BM Cimplicity Networking Master
No ratings yet
BM Cimplicity Networking Master
317 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
MapReduce Example
No ratings yet
MapReduce Example
76 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Onefs 8 2 1 Cli Admin Guide PDF
No ratings yet
Onefs 8 2 1 Cli Admin Guide PDF
403 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Sfha Config Upgrade 802 Aix
No ratings yet
Sfha Config Upgrade 802 Aix
324 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
M04 - Fabric Setup Process
No ratings yet
M04 - Fabric Setup Process
28 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Nutanix Study Notes (Part 1) - InfraPCS
No ratings yet
Nutanix Study Notes (Part 1) - InfraPCS
7 pages
NetApp SolidFire Plugin For VMware Vcenter Server
No ratings yet
NetApp SolidFire Plugin For VMware Vcenter Server
132 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Devsum2023 The Lost Art of Software Design
No ratings yet
Devsum2023 The Lost Art of Software Design
149 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
3 Mapreduce Notes
No ratings yet
3 Mapreduce Notes
25 pages
Installing and Configuring Always On Availability Groups (Windows)
No ratings yet
Installing and Configuring Always On Availability Groups (Windows)
42 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
Ajax Notes Full
No ratings yet
Ajax Notes Full
290 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Vsphere Ha Deepdive PDF
No ratings yet
Vsphere Ha Deepdive PDF
144 pages
BIG DATA & Hadoop Tutorial
No ratings yet
BIG DATA & Hadoop Tutorial
23 pages
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Vsan 703 Planning Deployment Guide
No ratings yet
Vsan 703 Planning Deployment Guide
87 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Spark Notes
0% (1)
Spark Notes
23 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Azure Cloud Intro
No ratings yet
Azure Cloud Intro
34 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
FortiSIEM 6.6.0 Release Notes
No ratings yet
FortiSIEM 6.6.0 Release Notes
12 pages
Jinja2 Docs
No ratings yet
Jinja2 Docs
121 pages
국제마 - Deun Deun Kimbab Marketing Strategy
No ratings yet
국제마 - Deun Deun Kimbab Marketing Strategy
40 pages
Azure DataBricks Interview Questions
No ratings yet
Azure DataBricks Interview Questions
17 pages
Des-1444 Exam Qs
No ratings yet
Des-1444 Exam Qs
13 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Spark Summit: June 2014
No ratings yet
Spark Summit: June 2014
32 pages
Spark NLP Training-Public-April 2020
No ratings yet
Spark NLP Training-Public-April 2020
39 pages
VMware Vsphere
No ratings yet
VMware Vsphere
3 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
TF On Spark
No ratings yet
TF On Spark
35 pages
DS210 Exercise 02.06 - OpsCenter
No ratings yet
DS210 Exercise 02.06 - OpsCenter
25 pages
Unit-7 - Parallel Database Systems
No ratings yet
Unit-7 - Parallel Database Systems
35 pages
IBM Spectrum Virtualize HyperSwap Configuration
No ratings yet
IBM Spectrum Virtualize HyperSwap Configuration
31 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Red Hat 3scale API Management 2.5
No ratings yet
Red Hat 3scale API Management 2.5
38 pages
Renderstream Whitepaper
No ratings yet
Renderstream Whitepaper
16 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
Grid Computing Syllabus
No ratings yet
Grid Computing Syllabus
2 pages
PB Connectx 4 LX en Card
No ratings yet
PB Connectx 4 LX en Card
4 pages
Advanced Data Processing (ADP) - Course Content
No ratings yet
Advanced Data Processing (ADP) - Course Content
3 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Z 99 Mirza Junaid Hadoop
No ratings yet
Z 99 Mirza Junaid Hadoop
3 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Document 2909028.1
No ratings yet
Document 2909028.1
2 pages
Simcenter STAR CCM Power Licensing Fact Sheet
No ratings yet
Simcenter STAR CCM Power Licensing Fact Sheet
2 pages
VMware High Availability and Fault Tolerance
No ratings yet
VMware High Availability and Fault Tolerance
4 pages
From Web (HTML, CSS, and JavaScript) to Flutter Widgets: A Web Developer's Guide to Flutter Apps
From Everand
From Web (HTML, CSS, and JavaScript) to Flutter Widgets: A Web Developer's Guide to Flutter Apps
Israel Joshua Chukwubueze
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Learning Continuous Integration with TeamCity
From Everand
Learning Continuous Integration with TeamCity
Manoj Mahalingam S
No ratings yet
Mastering Ninject for Dependency Injection
From Everand
Mastering Ninject for Dependency Injection
Daniel Baharestani
No ratings yet
Instant SQL Server Analysis Services 2012 Cube Security
From Everand
Instant SQL Server Analysis Services 2012 Cube Security
Satya SK Jayanty
No ratings yet

Intro To Apache Spark: Paco Nathan, Download Slides

Uploaded by

Intro To Apache Spark: Paco Nathan, Download Slides

Uploaded by

Intro to Apache Spark

Paco Nathan, @pacoid

Licensed under a Creative Commons Attribution-

• login and get started with Apache Spark

Everyone will receive a username/password

Run notebooks on your account at any time

Workspace/databricks-guide/01 Quick Start

Workspace/databricks-guide/01 Quick Start

• Workspace, Folder, Notebook, Export

# load error messages from a log into memory!

We start with Spark running on a cluster…

At this point, we can look at the transformed

discussing the other part

discussing the other part

discussing the other part

discussing the other part

discussing the other part

discussing the other part

! discussing the other part

! discussing the other part

! discussing the other part

Looking at the RDD transformations and

2002 2004 2006 2008 2010 2012 2014

2002 2008 2014

circa 1979 – Stanford, MIT, CMU, etc.

circa 2004 – Google

circa 2006 – Apache

circa 2008 – Yahoo

circa 2009 – Amazon AWS

Rich Freitas, IBM Research

MapReduce use cases showed two major

Dremel Drill Tez

General Batch Processing Specialized Systems:

The State of Spark, and Where We're Going Next

Developed in 2009 at UC Berkeley AMPLab, then

2002 2004 2006 2008 2010 2012 2014

2002 2008 2014

Spark: Cluster Computing with Working Sets

Unlike the various specialized systems, Spark’s

used as libs, instead of

Some key points about Spark:

• handles batch, interactive, and real-time

• native integration with Java, Python, Scala

• programming at a higher level of abstraction

• more general: map/reduce is just one set

• lazy evaluation of the lineage graph

• generational differences in hardware

• functional programming / ease of use

• lower overhead for starting jobs

Spark is one of the most active Apache projects

This simple program provides a good test case !

• isn’t many steps away from search indexing

• serves as a “Hello World” for Big Data apps

WordCount in 3 lines of Spark

WordCount in 50+ lines of Java MR

map() map() stage 3

How to “think” in terms of leveraging notebooks,

Let’s assemble the pieces of the previous few code

Intro apps, showing examples in both

First thing that a Spark program does is create

The master parameter for a SparkContext

Driver Program Cluster Manager

1. master connects to a cluster manager to

Driver Program Cluster Manager

Resilient Distributed Datasets (RDD) are the

• two types of operations on RDDs:

• transformations are lazy

• the transformed RDD gets recomputed

• however, an RDD can be persisted into

Spark can create RDDs from any file stored in HDFS

Transformations create a new dataset from

• optimize the required calculations

return a new dataset formed by selecting those

similar to map, but each input item can be mapped

sample a fraction fraction of the data, with or without

return a new dataset that contains the union of the

Paco Nathan, @pacoid 

• login and get started with Apache Spark  

Workspace/databricks-guide/01 Quick Start  

Workspace/databricks-guide/01 Quick Start  

We start with Spark running on a cluster… 

circa 1979 – Stanford, MIT, CMU, etc. 

circa 2004 – Google 

circa 2006 – Apache 

circa 2008 – Yahoo 

circa 2009 – Amazon AWS 

• handles batch, interactive, and real-time  

• more general: map/reduce is just one set  

• lazy evaluation of the lineage graph 

• generational differences in hardware 

• functional programming / ease of use 

This simple program provides a good test case   !

Intro apps, showing examples in both  

• two types of operations on RDDs:  

• transformations are lazy  

• the transformed RDD gets recomputed  

• however, an RDD can be persisted into  

Transformations create a new dataset from  

similar to map, but each input item can be mapped  

when called on a dataset of (K, V) pairs, returns  

looking at the output, how would you