0% found this document useful (0 votes)
12 views16 pages

Spark and Scala Week 1

Uploaded by

sumeetmkhetan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views16 pages

Spark and Scala Week 1

Uploaded by

sumeetmkhetan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

WEEk 1

SPARK & SCALA


What is Apache Spark?

 Lightning Fast Cluster Computing


 Faster Data Processing Platform
 Fastest Open Source Engine for
sorting a Petabyte
 80 High-Level Operators making
Coding quick
“Hello World!” of Big Data: The Word Count
Example
Code Written in Java for 50 Lines of
MapReduce Code
Code Written in Spark &
Scala
sparkContext.textFile(“hdf
s://…”) .flatMap(line =>
line.split(“ “)) .map(word
=> (word,
1)).reduceByKey(_ +
_) .saveAsTextFile(“hdfs://
Other Significant Features of Apache
Spark…
Interactive Shell (REPL) makes you Understand
Outcomes

Provides APIs in Scala, Java, and Python, with support


for other languages (such as R) on the way

Integrates well with the Hadoop Ecosystem and Data


Sources (HDFS, Amazon S3, Hive, HBase, Cassandra,
etc.)

Can run on Clusters managed by Hadoop YARN or


Apache Mesos, and can also run standalone

Complemented by a Set of Powerful, Higher-level


Event Detection Case
Faster than Japan Meteorological Agency

Analyzing a Twitter Stream


Filtering Relevant Tweets – “Earthquake” or
“Shaking”
Using “Spark Streaming” Code…

TwitterUtils.createStre
am(...) .filter(_.getText
.contains("earthquake
") ||
_.getText.contains("sh
Event Detection
Case…
Semantic Analysis

Attending a
Earthquake
Conference on
Earth is Shaking Earthquake
Sending
Predicti Density Emails to
Code in MLlib
on of Tweets with Twitter Spark SQL
Retrieve
Model is Positive Location Services
d
Ready Tweets Addresse
s
Spark Architecture & Cluster
RDD (Resilient Distributed
Dataset)
Definiti A Basic Data Abstraction in Spark
on A Collection of Data Items distributed across
the Network
RDDs are interface for running
Transformation and Actions in Spark

Characteri
stics Immutable Lazily
Evaluated

Distribute Fault-Tolerant
d
Methods to Create RDD

Referring an External Data


Set in an External Storage
System

By Parallelizing a
Collection of Objects in the
Driver Program
Operations in RDD

Defining
Transformations
Which create a New Dataset from an Existing
one
Defining
Actions
Which return a value to the Driver Program after
running a Computation on the Dataset
Transformations in RDD:

 map(func)  distinct([numTasks]))
 filter(func)  groupByKey([numTasks])
 flatMap(func)  reduceByKey(func, [numTasks])
 mapPartitions(func)  aggregateByKey(zeroValue)
 mapPartitionsWithIndex (seqOp, combOp, [numTasks])
(func)  sortByKey([ascending],
 sample(withReplaceme [numTasks])
nt, fraction, seed)  join(otherDataset, [numTasks])
 union(otherDataset)  cogroup(otherDataset,
 intersection(otherDatas [numTasks])
et)  cartesian(otherDataset)
Actions in RDD:

 reduce(func) takeOrdered(n,
 collect() [ordering])
 saveAsTextFile(path)
count()
saveAsSequenceFile(
 first()
path)
 take(n) saveAsObjectFile(pat
 takeSample(withReplac h)
ement, num, [seed]) countByKey()
 foreach(func)
Variables

Broadcast Variables

Shared
Variable Accumulators
s
Broadcast
Variables
Allow the programmer to keep a read-only variable cached
on each machine rather than shipping a copy of it with tasks
Can be used to give every node a copy of a large input
dataset
Broadcast variables are created from a variable v by
calling SparkContext.broadcast(v)
 The broadcast variable is a wrapper around v, and its
value can be accessed by calling the value method
scala> val broadcastVar =
Code sc.broadcast(Array(1, 2,
3))broadcastVar:
org.apache.spark.broadcast.Broadcast[
Array[Int]] = Broadcast(0) scala>
broadcastVar.valueres0: Array[Int] =
Accumulators
Accumulators are Variables that are only “added” to through an
Associative and Commutative operation
They can be used to implement counters (as in MapReduce) or
sums
Spark supports accumulators of numeric types, and programmers
can add support for new types

An accumulator is created from an


scala> val accum =
initial value v by
sc.longAccumulator("My
calling SparkContext.accumulator(v) Accumulator")accum:
Tasks running on a cluster can then org.apache.spark.util.LongAccumulator =
add to it using the add method or LongAccumulator(id: 0, name: Some(My
the += operator Accumulator), value: 0) scala>
Only the driver program can read sc.parallelize(Array(1, 2, 3, 4)).foreach(x
=> accum.add(x))...10/09/29 18:41:08
the accumulator’s value, using
INFO SparkContext: Tasks finished in
Week 2

Object Function
Basics Oriented al
of Scala Program Program
ming ming

Scala Scala Scala


Data Functio Variabl
Types ns es

You might also like