0% found this document useful (0 votes)

149 views25 pages

Overview

Spark is a fast and general engine for large-scale data processing. It supports iterative algorithms and interactive queries by keeping working sets in memory as resilient distributed datasets (RDDs). RDDs allow efficient reuse through caching and fault tolerance through lineage tracking. Spark's programming model integrates with Scala for easy use in interactive shells and applications like machine learning, graph processing, and data mining.

Uploaded by

sarvesh_mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

149 views25 pages

Overview

Uploaded by

sarvesh_mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Spark

Fast, Interactive, Language-Integrated

Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,
Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
Scott Shenker, Ion Stoica

www.spark-project.org UC BERKELEY
Project Goals
Extend the MapReduce model to better support
two common classes of analytics apps:
»Iterative algorithms (machine learning, graphs)
»Interactive data mining
Enhance programmability:
»Integrate into Scala programming language
»Allow interactive use from Scala interpreter
Motivation
Most current cluster programming models are
based on acyclic data flow from stable storage
to stable storage

Map
Reduce

Input Map Output

Reduce
Map
Motivation
Most current cluster programming models are
based on acyclic data flow from stable storage
to stable storage

Map
Reduce
Benefits of data flow: runtime can decide
where
Input to run tasks and can automatically
Map Output
recover from failures
Reduce
Map
Motivation
Acyclic data flow is inefficient for applications
that repeatedly reuse a working set of data:
»Iterative algorithms (machine learning, graphs)
»Interactive data mining tools (R, Excel, Python)
With current frameworks, apps reload data
from stable storage on each query
Solution: Resilient
Distributed Datasets (RDDs)
Allow apps to keep working sets in memory for
efficient reuse
Retain the attractive properties of MapReduce
» Fault tolerance, data locality, scalability

Support a wide range of applications

Outline
Spark programming model
Implementation
Demo
User applications
Programming Model
Resilient distributed datasets (RDDs)
» Immutable, partitioned collections of objects
» Created through parallel transformations (map, filter,
groupBy, join, …) on data in stable storage
» Can be cached for efficient reuse

Actions on RDDs
» Count, reduce, collect, save, …
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
BaseTransformed
RDD Cache 1
lines = spark.textFile(“hdfs://...”) RDD
Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()
Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Worker Block 2
Result: scaled
full-text
tosearch
1 TB data
of Wikipedia
in 5-7 sec
in <1(vs
sec170
(vssec
20 for
secon-disk
for on-disk
data)
data) Block 3
RDD Fault Tolerance
RDDs maintain lineage information that can be
used to reconstruct lost partitions
Ex: messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD

filter map
(func = _.contains(...)) (func = _.split(...))
Example: Logistic Regression
Goal: find best line separating two sets of points

random initial line

target
Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}

println("Final w: " + w)
Logistic Regression Performance
4500
4000
127 s / iteration
3500
Running Time (s)

3000
2500 Hadoop
2000
Spark
1500
1000
500 first iteration 174 s
0 further iterations 6 s
1 5 10 20 30
Number of Iterations
Spark Applications
In-memory data mining on Hive data (Conviva)
Predictive analytics (Quantifind)
City traffic prediction (Mobile Millennium)
Twitter spam classification (Monarch)
Collaborative filtering via matrix factorization
…
Conviva GeoReport
Hive 20

Spark 0.5
Time (hours)
0 5 10 15 20

Aggregations on many keys w/ same WHERE clause

40× gain comes from:
» Not re-reading unused columns or filtered records
» Avoiding repeated decompression
» In-memory storage of deserialized objects
Frameworks Built on Spark
Pregel on Spark (Bagel)
» Google message passing
model for graph computation
» 200 lines of code

Hive on Spark (Shark)

» 3000 lines of code
» Compatible with Apache Hive
» ML operators in Scala
Implementation
Runs on Apache Mesos to
share resources with Spark Hadoop MPI
…
Hadoop & other apps
Mesos
Can read from any Hadoop
input source (e.g. HDFS) Node Node Node Node

No changes to Scala compiler

Spark Scheduler
Dryad-like DAGs A: B:

Pipelines functions G:
within a stage Stage 1 groupBy

Cache-aware work C: D: F:

reuse & locality map

E: join
Partitioning-aware
to avoid shuffles Stage 2 union Stage 3

= cached data partition

Interactive Spark
Modified Scala interpreter to allow Spark to be
used interactively from the command line
Required two changes:
» Modified wrapper code generation so that each line
typed has references to objects for its dependencies
» Distribute generated classes over the network
Demo
Conclusion
Spark provides a simple, efficient, and powerful
programming model for a wide range of apps
Download our open source release:

www.spark-project.org

[email protected]
Related Work
DryadLINQ, FlumeJava
» Similar “distributed collection” API, but cannot reuse
datasets efficiently across queries
Relational databases
» Lineage/provenance, logical logging, materialized views
GraphLab, Piccolo, BigTable, RAMCloud
» Fine-grained writes similar to distributed shared memory
Iterative MapReduce (e.g. Twister, HaLoop)
» Implicit data sharing for a fixed computation pattern
Caching systems (e.g. Nectar)
» Store data in files, no explicit control over what is cached
Behavior with Not Enough RAM
100
68.8
Iteration time (s)

58.1

40.7
60

29.7
40

11.5
20
0
Cache 25% 50% 75% Fully
disabled cached
% of working set in memory
Fault Recovery Results
140 119 No Failure
Failure in the 6th Iteration
Iteratrion time (s)

120
100

81
80

59
58

58
56
57

57
57
60
40
20
0
1 2 3 4 5 6 7 8 9 10
Iteration
Spark Operations
map flatMap
filter union
Transformations sample join
(define a new RDD) groupByKey cogroup
reduceByKey cross
sortByKey mapValues

collect
Actions reduce
(return a result to count
driver program) save
lookupKey

Wiley - Student Solutions Manual Engineering Statistics, 5e - Douglas C. Montgomery, George C. Runger, Norma F
7% (15)
Wiley - Student Solutions Manual Engineering Statistics, 5e - Douglas C. Montgomery, George C. Runger, Norma F
1 page
Cheat Sheet AWS Data Engineer Associate
No ratings yet
Cheat Sheet AWS Data Engineer Associate
117 pages
Spark Devops
0% (1)
Spark Devops
301 pages
Spark
No ratings yet
Spark
96 pages
Ultimate Big Data Masters Program Curriculum v1
No ratings yet
Ultimate Big Data Masters Program Curriculum v1
14 pages
SPARK
No ratings yet
SPARK
66 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
AdvancedBooks - Python Wiki
0% (1)
AdvancedBooks - Python Wiki
104 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Spark SQL
100% (1)
Spark SQL
25 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Big Data - Bi - and - Analytics PDF
0% (1)
Big Data - Bi - and - Analytics PDF
30 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
SPARK
No ratings yet
SPARK
35 pages
Innovative HUAWEI CLOUD Services
No ratings yet
Innovative HUAWEI CLOUD Services
50 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Bhajanamrita - Gita Press Gorakhpur
No ratings yet
Bhajanamrita - Gita Press Gorakhpur
108 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
TK Inter
No ratings yet
TK Inter
168 pages
Karma Rahasya
No ratings yet
Karma Rahasya
76 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
f4b7901ed5e5f9106a3a82eea2e2f003
No ratings yet
f4b7901ed5e5f9106a3a82eea2e2f003
3,614 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Spark
No ratings yet
Spark
49 pages
Spark 1
No ratings yet
Spark 1
57 pages
DataGrokr Technical Assignment - Data Engineering - Internshala
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
5 pages
BDALab Assn5
No ratings yet
BDALab Assn5
16 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Big Data: Practice Exercises
0% (1)
Big Data: Practice Exercises
4 pages
Lec 9
No ratings yet
Lec 9
33 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Ikshu 8
No ratings yet
Ikshu 8
120 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
SPARK
No ratings yet
SPARK
125 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Intro To RAML - The RESTful API Modeling Language - Baeldung
No ratings yet
Intro To RAML - The RESTful API Modeling Language - Baeldung
10 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
SPARK
No ratings yet
SPARK
47 pages
Week 02
No ratings yet
Week 02
115 pages
Spark Details
No ratings yet
Spark Details
11 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Advanced Data Science On Spark: Reza Zadeh
No ratings yet
Advanced Data Science On Spark: Reza Zadeh
47 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Module 3
No ratings yet
Module 3
51 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Unit 4 (Big Data Analytics)
No ratings yet
Unit 4 (Big Data Analytics)
28 pages
Python Crash Course 0.07 PDF
No ratings yet
Python Crash Course 0.07 PDF
68 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Cloud Era Csu La 11122012
No ratings yet
Cloud Era Csu La 11122012
50 pages
Cloud Era Csu La 11122012
No ratings yet
Cloud Era Csu La 11122012
50 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Big Data Capacity Planning
No ratings yet
Big Data Capacity Planning
7 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
15 pages
Spark
No ratings yet
Spark
33 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Classification Basic Concepts, Decision Trees, and Model Evaluation
No ratings yet
Classification Basic Concepts, Decision Trees, and Model Evaluation
67 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Advanced Certification in Data Science and Artificial Intelligence
No ratings yet
Advanced Certification in Data Science and Artificial Intelligence
15 pages
Log Linear Models and Logistic Regression Springer Texts in Statistics
No ratings yet
Log Linear Models and Logistic Regression Springer Texts in Statistics
33 pages
The Scala Programming Language: Presented by Donna Malayeri
No ratings yet
The Scala Programming Language: Presented by Donna Malayeri
25 pages
Hadoop Spark
No ratings yet
Hadoop Spark
31 pages
Da Unit-2
No ratings yet
Da Unit-2
23 pages
StreamProcessingAndAnalytics Handout
No ratings yet
StreamProcessingAndAnalytics Handout
7 pages
Pyspark Final Assessment
No ratings yet
Pyspark Final Assessment
3 pages
Vamshi Java Full Stack
No ratings yet
Vamshi Java Full Stack
12 pages
Dr. Babasaheb Ambedkar Technological University, Lonere: Final Year
No ratings yet
Dr. Babasaheb Ambedkar Technological University, Lonere: Final Year
254 pages
MasterCard Data Engineering
No ratings yet
MasterCard Data Engineering
17 pages
Azure Data Engineering - Pragathi
No ratings yet
Azure Data Engineering - Pragathi
4 pages
Data Analyst Azure PowerBI Syllabus
No ratings yet
Data Analyst Azure PowerBI Syllabus
35 pages
Data Engineer Manual (User Hands On)
No ratings yet
Data Engineer Manual (User Hands On)
2 pages
Snowflake Connector
No ratings yet
Snowflake Connector
12 pages
Ayoconnect - Open Positions
No ratings yet
Ayoconnect - Open Positions
1 page
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Summary: 12 Years
No ratings yet
Summary: 12 Years
7 pages
Pyspark Tutorial 3
No ratings yet
Pyspark Tutorial 3
5 pages
DataGrokr Technical Assignment - Data Engineering
No ratings yet
DataGrokr Technical Assignment - Data Engineering
4 pages
Rishabh Resume Sept2021
No ratings yet
Rishabh Resume Sept2021
1 page
Machine Learning With Apps in R
No ratings yet
Machine Learning With Apps in R
43 pages

Overview

Uploaded by

Overview

Uploaded by

Spark

Fast, Interactive, Language-Integrated

Input Map Output

Support a wide range of applications

HDFS File Filtered RDD Mapped RDD

random initial line

for (i <- 1 to ITERATIONS) {

Aggregations on many keys w/ same WHERE clause

Hive on Spark (Shark)

No changes to Scala compiler

reuse & locality map

= cached data partition

You might also like