Chapter 4 Spark
Chapter 4 Spark
DATA
TRAINING
Intro
to
Spark
Thoại
Nam
• Shared/Distributed memory
• MapReduce drawbacks
• Spark
• Most
programs
have
a
high
degree
of
locality
in
their
accesses
o spa2al
locality:
accessing
things
nearby
previous
accesses
o temporal
locality:
reusing
an
item
that
was
previously
accessed
• Memory
hierarchy
tries
to
exploit
locality
to
improve
average
processor
control
Second Main Secondary Tertiary
level memory storage storage
cache (Disk)
datapath (DRAM) (Disk/Tape)
(SRAM)
on-chip
registers
cache (“Cloud”)
Size KB MB GB TB PB
• Restrict
the
programming
interface
so
that
the
system
can
do
more
automaNcally
“Here’s
an
operaNon,
run
it
on
all
of
the
data”
o I
don’t
care
where
it
runs
(you
schedule
that)
o In
fact,
feel
free
to
run
it
twice
on
different
nodes
• MapReduce
turned
out
to
be
an
incredibly
useful
and
widely-‐deployed
framework
for
processing
large
amounts
of
data.
However,
its
design
forces
programs
to
comply
with
its
computaNon
model,
which
is:
o Map:
create
a
<key,
value>
pairs
o Shuffle:
combine
common
keys
together
and
parNNon
them
to
reduce
workers
o Reduce:
process
each
unique
key
and
all
of
its
associated
values
• Many
applicaNons
had
to
run
MapReduce
over
mulNple
passes
to
process
their
data
• All
intermediate
data
had
to
be
stored
back
in
the
file
system
(GFS
at
Google,
HDFS
elsewhere),
which
tended
to
be
slow
since
stored
data
was
not
just
wriben
to
disks
but
also
replicated
• The
next
MapReduce
phase
could
not
start
un2l
the
previous
MapReduce
job
completed
fully
• MapReduce
was
also
designed
to
read
its
data
from
a
distributed
file
system
(GFS/HDFS).
In
many
cases,
however,
data
resides
within
an
SQL
database
or
is
streaming
in
(e.g.,
acNvity
logs,
remote
monitoring).
Ø Highly
flexible
and
general-‐purpose
way
of
dealing
with
big
data
processing
needs
Ø Does
not
impose
a
rigid
computaNon
model,
and
supports
a
variety
of
input
types
Ø Deal
with
text
files,
graph
data,
database
queries,
and
streaming
sources
and
not
be
confined
to
a
two-‐stage
processing
model
Ø Programmers
can
develop
arbitrarily-‐complex,
mulN-‐step
data
pipelines
arranged
in
an
arbitrary
directed
acyclic
graph
(DAG)
paTern.
Ø Programming
in
Spark
involves
defining
a
sequence
of
transforma2ons
and
ac2ons
Ø Spark
has
support
for
a
map
acNon
and
a
reduce
operaNon,
so
it
can
implement
tradi2onal
MapReduce
operaNons
but
it
also
supports
SQL
queries,
graph
processing,
and
machine
learning
Ø Stores
its
intermediate
results
in
memory,
providing
for
dramaNcally
higher
performance.
Compute Engine
(Memory Management, Task Scheduling, Fault Recovery, Interaction with Cluster Management)
Distributed Storage
• An
applicaNon
that
uses
Spark
idenNfies
data
sources
and
the
operaNons
on
that
data.
The
main
applicaNon,
called
the
driver
program
is
linked
with
the
Spark
API,
which
creates
a
SparkContext
(heart
of
the
Spark
system
and
coordinates
all
processing
acNvity.)
This
SparkContext
in
the
driver
program
connects
to
a
Spark
cluster
manager.
The
cluster
manager
responsible
for
alloca2ng
worker
nodes,
launching
executors
on
them,
and
keeping
track
of
their
status
• Each
worker
node
runs
one
or
more
executors.
An
executor
is
a
process
that
runs
an
instance
of
a
Java
Virtual
Machine
(JVM)
•
When
each
executor
is
launched
by
the
manager,
it
establishes
a
connecNon
back
to
the
driver
program
• The
executor
runs
tasks
on
behalf
of
a
specific
SparkContext
(applicaNon)
and
keeps
related
data
in
memory
or
disk
storage
• A
task
is
a
transformaNon
or
acNon;
the
executor
remains
running
for
the
duraNon
of
the
driver
program.
• Data
in
Spark
is
a
collec2on
of
Resilient
Distributed
Datasets
(RDDs).
This
is
osen
a
huge
collecNon
of
stuff.
Think
of
an
individual
RDD
as
a
table
in
a
database
or
a
structured
file.
• Input
data
is
organized
into
RDDs,
which
will
osen
be
parNNoned
across
many
computers.
RDDs
can
be
created
in
three
ways:
(1)
They
can
be
present
as
any
file
stored
in
HDFS
or
any
other
storage
system
supported
in
Hadoop.
This
includes
Amazon
S3
(a
key-‐value
server,
similar
in
design
to
Dynamo),
HBase
(Hadoop’s
version
of
Bigtable),
and
Cassandra
(a
no-‐SQL
eventually-‐consistent
database).
This
data
is
created
by
other
services,
such
as
event
streams,
text
logs,
or
a
database.
For
instance,
the
results
of
a
specific
query
can
be
treated
as
an
RDD.
A
list
of
files
in
a
specific
directory
can
also
be
an
RDD.
(2)
RDDs
can
be
streaming
sources
using
the
Spark
Streaming
extension.
This
could
be
a
stream
of
events
from
remote
sensors,
for
example.
For
fault
tolerance,
a
sliding
window
is
used,
where
the
contents
of
the
stream
are
buffered
in
memory
for
a
predefined
Nme
interval.
(3)
An
RDD
can
be
the
output
of
a
transforma2on
func2on.
This
allows
one
task
to
create
data
that
can
be
consumed
by
another
task
and
is
the
way
tasks
pass
data
around.
For
example,
one
task
can
filter
out
unwanted
data
and
generate
a
set
of
key-‐
value
pairs,
wriNng
them
to
an
RDD.
This
RDD
will
be
cached
in
memory
(overflowing
to
disk
if
needed)
and
will
be
read
by
a
task
that
reads
the
output
of
the
task
that
created
the
key/value
data.
• Spark
allows
two
types
of
operaNons
on
RDDs:
transforma2ons
and
ac2ons
o Transforma2ons
read
an
RDD
and
return
a
new
RDD.
Example
transformaNons
are
map,
filter,
groupByKey,
and
reduceByKey.
TransformaNons
are
evaluated
lazily,
which
means
they
are
computed
only
when
some
task
wants
their
data
(the
RDD
that
they
generate).
At
that
point,
the
driver
schedules
them
for
execuNon
o Ac2ons
are
operaNons
that
evaluate
and
return
a
new
value.
When
an
acNon
is
requested
on
an
RDD
object,
the
necessary
transformaNons
are
computed
and
the
result
is
returned.
AcNons
tend
to
be
the
things
that
generate
the
final
output
needed
by
a
program.
Example
acNons
are
reduce,
grab
samples,
and
write
to
file
transformation Description
when called on a dataset of (K, V) pairs,
groupByKey([numTasks]) returns a dataset of (K, Seq[V]) pairs
when called on a dataset of (K, V) pairs, returns
reduceByKey(func,
a dataset of (K, V) pairs where the values for
[numTasks]) each key are aggregated using the given reduce
function
when called on a dataset of (K, V) pairs where
sortByKey([ascending], K implements Ordered, returns a dataset of (K,
[numTasks]) V) pairs sorted by keys in ascending or
descending order, as specified in the boolean
ascending argument
when called on datasets of type (K, V) and (K,
join(otherDataset, W), returns a dataset of (K, (V, W)) pairs with
[numTasks]) all pairs of elements for each key
action description
aggregate the elements of the dataset using a
reduce(func) function func (which takes two arguments and
returns one), and should also be commutative and
associative so that it can be computed correctly in
parallel
return all the elements of the dataset as an array at
collect() the driver program – usually useful after a filter or
other operation that returns a sufficiently small
subset of the data
count() return the number of elements in the dataset
return the first element of the dataset – similar to
first()
take(1)
return an array with the first n elements of the dataset
take(n) – currently not executed in parallel, instead the
driver program computes all the elements
Spark
does
not
care
how
data
is
stored.
The
appropriate
RDD
connector
determines
how
to
read
data.
For
example,
RDDs
can
be
the
result
of
a
query
in
a
Cassandra
database
and
new
RDDs
can
be
wriben
to
Cassandra
tables.
AlternaNvely,
RDDs
can
be
read
from
HDFS
files
or
wriben
to
an
HBASE
table.
• For
each
RDD,
the
driver
tracks
the
sequence
of
transformaNons
used
to
create
it
• That
means
every
RDD
knows
which
task
needed
to
create
it.
If
any
RDD
is
lost
(e.g.,
a
task
that
creates
one
died),
the
driver
can
ask
the
task
that
generated
it
to
recreate
it
• The
driver
maintains
the
enNre
dependency
graph,
so
this
recreaNon
may
end
up
being
a
chain
of
transformaNon
tasks
going
back
to
the
original
data.
textFile = sc.textFile(”SomeFile.txt”)
RDD
textFile = sc.textFile(”SomeFile.txt”)
RDD
RDD
RDD
RDD
Transformations
textFile = sc.textFile(”SomeFile.txt”)
RDD
RDD
RDD
RDD Action Value
Transformations
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark
Worker
Driver
Worker
Worker
Driver
Worker
Worker
Driver
Worker
Worker
Driver
Worker
Worker
Driver
Worker
Worker
Worker
Worker
Worker Block 2
Block
3
HPC
Lab
&
CCE
-‐
HCMUT
Big
Data
2021
43
Example:
L og
Mining
Load
error
messages
from
a
log
into
memory,
then
interacNvely
search
for
various
paberns
tasks
messages.filter(lambda s: “mysql” in s).count()
tasks
Worker
Worker Block 2
Block
3
HPC
Lab
&
CCE
-‐
HCMUT
Big
Data
2021
44
Example:
L og
Mining
Load
error
messages
from
a
log
into
memory,
then
interacNvely
search
for
various
paberns
Worker
Block
2
Read
HDFS
Read
Block
3
Block
HDFS
Block
HPC
Lab
&
CCE
-‐
HCMUT
Big
Data
2021
45
Example:
L og
Mining
Load
error
messages
from
a
log
into
memory,
then
interacNvely
search
for
various
paberns
Cache
1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block
1
messages.cache() Driver
Process
&
Cache
Data
messages.filter(lambda s: “mysql” in s).count() Cache
2
Worker
Cache
3
Worker
Process
Block
2
&
Cache
Process
Block
3
Data
&
Cache
Data
HPC
Lab
&
CCE
-‐
HCMUT
Big
Data
2021
46
Example:
L og
Mining
Load
error
messages
from
a
log
into
memory,
then
interacNvely
search
for
various
paberns
Cache
1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) results
Block
1
messages.cache() Driver
results
Block
3
HPC
Lab
&
CCE
-‐
HCMUT
Big
Data
2021
47
Example:
L og
Mining
Load
error
messages
from
a
log
into
memory,
then
interacNvely
search
for
various
paberns
Cache
1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block
1
messages.cache() Driver
Block
3
HPC
Lab
&
CCE
-‐
HCMUT
Big
Data
2021
48
Example:
L og
Mining
Load
error
messages
from
a
log
into
memory,
then
interacNvely
search
for
various
paberns
Cache
1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks
Block
1
messages.cache() Driver
tasks
messages.filter(lambda s: “mysql” in s).count() Cache
2
messages.filter(lambda s: “php” in s).count() tasks
Worker
Cache
3
Worker
Block
2
Block
3
HPC
Lab
&
CCE
-‐
HCMUT
Big
Data
2021
49
Example:
L og
Mining
Load
error
messages
from
a
log
into
memory,
then
interacNvely
search
for
various
paberns
Cache
1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block
1
messages.cache() Driver
Process
from
Cache
messages.filter(lambda s: “mysql” in s).count() Cache
2
messages.filter(lambda s: “php” in s).count() Worker
Cache
3
Worker
Process
Block
2
from
Process
Cache
from
Block
3
Cache
HPC
Lab
&
CCE
-‐
HCMUT
Big
Data
2021
50
Example:
L og
Mining
Load
error
messages
from
a
log
into
memory,
then
interacNvely
search
for
various
paberns
Cache
1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) results
Block
1
messages.cache() Driver
results
Block
3
HPC
Lab
&
CCE
-‐
HCMUT
Big
Data
2021
51
Example:
L og
Mining
Load
error
messages
from
a
log
into
memory,
then
interacNvely
search
for
various
paberns
Cache
1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block
1
messages.cache() Driver
Scala
val lines = sc.textFile(...) Performance
lines.filter(x => x.contains(“ERROR”)).count() Java &Scala are faster due to static
typing
…but Python is often fine
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
map reduce
RDD
120
100
RDD 81
80
57
56
58
58
57
59
57
59
60
40
20
0
1
2
3
4
5
6
7
8
9
10
Iteration