Spark and Scala Week 1
Spark and Scala Week 1
TwitterUtils.createStre
am(...) .filter(_.getText
.contains("earthquake
") ||
_.getText.contains("sh
Event Detection
Case…
Semantic Analysis
Attending a
Earthquake
Conference on
Earth is Shaking Earthquake
Sending
Predicti Density Emails to
Code in MLlib
on of Tweets with Twitter Spark SQL
Retrieve
Model is Positive Location Services
d
Ready Tweets Addresse
s
Spark Architecture & Cluster
RDD (Resilient Distributed
Dataset)
Definiti A Basic Data Abstraction in Spark
on A Collection of Data Items distributed across
the Network
RDDs are interface for running
Transformation and Actions in Spark
Characteri
stics Immutable Lazily
Evaluated
Distribute Fault-Tolerant
d
Methods to Create RDD
By Parallelizing a
Collection of Objects in the
Driver Program
Operations in RDD
Defining
Transformations
Which create a New Dataset from an Existing
one
Defining
Actions
Which return a value to the Driver Program after
running a Computation on the Dataset
Transformations in RDD:
map(func) distinct([numTasks]))
filter(func) groupByKey([numTasks])
flatMap(func) reduceByKey(func, [numTasks])
mapPartitions(func) aggregateByKey(zeroValue)
mapPartitionsWithIndex (seqOp, combOp, [numTasks])
(func) sortByKey([ascending],
sample(withReplaceme [numTasks])
nt, fraction, seed) join(otherDataset, [numTasks])
union(otherDataset) cogroup(otherDataset,
intersection(otherDatas [numTasks])
et) cartesian(otherDataset)
Actions in RDD:
reduce(func) takeOrdered(n,
collect() [ordering])
saveAsTextFile(path)
count()
saveAsSequenceFile(
first()
path)
take(n) saveAsObjectFile(pat
takeSample(withReplac h)
ement, num, [seed]) countByKey()
foreach(func)
Variables
Broadcast Variables
Shared
Variable Accumulators
s
Broadcast
Variables
Allow the programmer to keep a read-only variable cached
on each machine rather than shipping a copy of it with tasks
Can be used to give every node a copy of a large input
dataset
Broadcast variables are created from a variable v by
calling SparkContext.broadcast(v)
The broadcast variable is a wrapper around v, and its
value can be accessed by calling the value method
scala> val broadcastVar =
Code sc.broadcast(Array(1, 2,
3))broadcastVar:
org.apache.spark.broadcast.Broadcast[
Array[Int]] = Broadcast(0) scala>
broadcastVar.valueres0: Array[Int] =
Accumulators
Accumulators are Variables that are only “added” to through an
Associative and Commutative operation
They can be used to implement counters (as in MapReduce) or
sums
Spark supports accumulators of numeric types, and programmers
can add support for new types
Object Function
Basics Oriented al
of Scala Program Program
ming ming