0% found this document useful (0 votes)
31 views26 pages

Spark Introduction

Spark is a cluster computing framework that addresses inefficiencies in MapReduce for iterative and interactive algorithms. Spark introduces Resilient Distributed Datasets (RDDs) that allow data to be cached in memory across jobs for faster processing. RDDs are read-only, partitioned collections that can be rebuilt if lost. This in-memory approach allows Spark to be much faster than MapReduce for complex analytics and iterative algorithms on large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views26 pages

Spark Introduction

Spark is a cluster computing framework that addresses inefficiencies in MapReduce for iterative and interactive algorithms. Spark introduces Resilient Distributed Datasets (RDDs) that allow data to be cached in memory across jobs for faster processing. RDDs are read-only, partitioned collections that can be rebuilt if lost. This in-memory approach allows Spark to be much faster than MapReduce for complex analytics and iterative algorithms on large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Introduction to Spark

• MapReduce enables big data analytics using


large, unreliable clusters

• There are multiple implementation of


MapReduce
- Based on HDFS
- Based on No-SQL databases
- Based on cloud
• Can not support Complex/iterative applications
efficiently
• Root Cause: Inefficient data sharing
Only way to share data across jobs is via stable
storage
• There is room to further improvement with
MapReduce
- iterative algorithms,
- interactive ad-hoc queries
• In other words, MapReduce lacks efficient
primitives for data sharing

• This is where Spark comes into the picture –


instead of load the data from disk for every
query, why not do In-Memory data sharing?
• Here is how Spark addresses this issue:
Resilient Distributed Datasets (RDDs)

• RDD allows applications to keep working sets


in memory for reuse.
RDD Overview
• A new programming model (RDD)
• parallel/distributed computing
• in-memory sharing
• fault-tolerance

• An implementation of RDD: Spark

7
Spark’s Philosophy
• Generalize MapReduce
• Richer Programming Model
Fewer Systems to Master
• Better Memory Management
Less data movement leads to better performance for
complex analytics
• Spark’s solution: Resilient Distributed Datasets
(RDDs)
– Read only, partitioned collections of objects
– A distributed immutable array
– Created through parallel transformations on data
in stable storage
– Can be cached for efficient reuse.

Cited from Matei Zahria, Spark fast, interactive, language-integrated cluster computing, amplab, UC Berkeley
• Log the operations that generates the RDD
• Automatically rebuilt on (partial) failure
Spark API
• Spark provides API to support the following
languages: Scala, Java, Python, R

• Data Operations
– Transformations; lazily create RDDs
wc = dataset.flatMap(tokenize).reduceByKey(add)
– Actions; execute compilation
wc.collect()
Abstraction: Dataflow Operators
• map • reduce • sample
• filter • count • take

• groupBy • fold • first

• sort • reduceByKey • partitionBy

• union • groupByKey • mapWith

• join • cogroup • pipe

• leftOuterJoin • Cross • Save


• ...
• rightouterjoin • Zip
Spark Example: Log Mining
• Load error messages from a log into memory and run interactive queries
base RDD
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR")) transformation
errors.persist() action
errors.filter(_.contains("Foo")).count() !
errors.filter(_.contains("Bar")).count()
Mast
er Work
er

Result: full-text search on 1TB data in 5-7sec


vs. 170sec with on-disk data!
Work
er

14
Simple Yet Powerful
WordCount Implementation: Hadoop vs. Spark

Pregel: iterative graph processing, 200 LoC using


Spark
HaLoop: iterative MapReduce, 200 LoC using Spark

16
WordCount
• val counts = sc.textFile(“hdfs://…”).flatMap(line=>line.split(“”)).map(word=>(word, 1L)).reduceByKey(_+_)
What is Iterative Algorithm?
data = input data
w = <target vector> -(Shared data structure)

At each iteration,
Do something to item in data:
Update (w) -(Update shared data structure)
val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {


val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}

println("Final w: " + w)

Copied from Matei Zahria, et.al., Spark fast, interactive, language-integrated cluster computing, amplab, UC Berkeley
Evaluation
10 iterations on 100GB data using 25-100 machines

20
Spark Ecosystem

Copied from Matei Zaharia Spark’s Role in the Big Data Ecosystem, Databricks
• Spark Core: is the execution engine for the spark
platform. It provides distributed in-memory
computing capabilities
• Spark SQL: is an engine for Hadoop Hive that
enables unmodified Hadoop Hive queries to run up
to 100x faster on existing deployments and data
• Spark Streaming: is an engine that enables
powerful interactive and analytical applications on
streaming data
• MLLib: is a scalable machine learning library
• GraphX is a graph computation engine built on top
of Spark
Hadoop Vs Spark
• Spark is a computing framework that can be
deployed upon Hadoop.

• You can view Hadoop as an operating system


for a distributed computing cluster, while
Spark is an application running on the system
to provide in-memory analytics functions.

You might also like