0% found this document useful (0 votes)

31 views26 pages

Spark Introduction

Spark is a cluster computing framework that addresses inefficiencies in MapReduce for iterative and interactive algorithms. Spark introduces Resilient Distributed Datasets (RDDs) that allow data to be cached in memory across jobs for faster processing. RDDs are read-only, partitioned collections that can be rebuilt if lost. This in-memory approach allows Spark to be much faster than MapReduce for complex analytics and iterative algorithms on large datasets.

Uploaded by

Yohanes Eka Wibawa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views26 pages

Spark Introduction

Uploaded by

Yohanes Eka Wibawa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Introduction to Spark

• MapReduce enables big data analytics using

large, unreliable clusters

• There are multiple implementation of

MapReduce
- Based on HDFS
- Based on No-SQL databases
- Based on cloud
• Can not support Complex/iterative applications
efficiently
• Root Cause: Inefficient data sharing
Only way to share data across jobs is via stable
storage
• There is room to further improvement with
MapReduce
- iterative algorithms,
- interactive ad-hoc queries
• In other words, MapReduce lacks efficient
primitives for data sharing

• This is where Spark comes into the picture –

instead of load the data from disk for every
query, why not do In-Memory data sharing?
• Here is how Spark addresses this issue:
Resilient Distributed Datasets (RDDs)

• RDD allows applications to keep working sets

in memory for reuse.
RDD Overview
• A new programming model (RDD)
• parallel/distributed computing
• in-memory sharing
• fault-tolerance

• An implementation of RDD: Spark

7
Spark’s Philosophy
• Generalize MapReduce
• Richer Programming Model
Fewer Systems to Master
• Better Memory Management
Less data movement leads to better performance for
complex analytics
• Spark’s solution: Resilient Distributed Datasets
(RDDs)
– Read only, partitioned collections of objects
– A distributed immutable array
– Created through parallel transformations on data
in stable storage
– Can be cached for efficient reuse.

Cited from Matei Zahria, Spark fast, interactive, language-integrated cluster computing, amplab, UC Berkeley
• Log the operations that generates the RDD
• Automatically rebuilt on (partial) failure
Spark API
• Spark provides API to support the following
languages: Scala, Java, Python, R

• Data Operations
– Transformations; lazily create RDDs
wc = dataset.flatMap(tokenize).reduceByKey(add)
– Actions; execute compilation
wc.collect()
Abstraction: Dataflow Operators
• map • reduce • sample
• filter • count • take

• groupBy • fold • first

• sort • reduceByKey • partitionBy

• union • groupByKey • mapWith

• join • cogroup • pipe

• leftOuterJoin • Cross • Save

• ...
• rightouterjoin • Zip
Spark Example: Log Mining
• Load error messages from a log into memory and run interactive queries
base RDD
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR")) transformation
errors.persist() action
errors.filter(_.contains("Foo")).count() !
errors.filter(_.contains("Bar")).count()
Mast
er Work
er

Result: full-text search on 1TB data in 5-7sec

vs. 170sec with on-disk data!
Work
er

14
Simple Yet Powerful
WordCount Implementation: Hadoop vs. Spark

Pregel: iterative graph processing, 200 LoC using

Spark
HaLoop: iterative MapReduce, 200 LoC using Spark

16
WordCount
• val counts = sc.textFile(“hdfs://…”).flatMap(line=>line.split(“”)).map(word=>(word, 1L)).reduceByKey(_+_)
What is Iterative Algorithm?
data = input data
w = <target vector> -(Shared data structure)

At each iteration,
Do something to item in data:
Update (w) -(Update shared data structure)
val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}

println("Final w: " + w)

Copied from Matei Zahria, et.al., Spark fast, interactive, language-integrated cluster computing, amplab, UC Berkeley
Evaluation
10 iterations on 100GB data using 25-100 machines

20
Spark Ecosystem

Copied from Matei Zaharia Spark’s Role in the Big Data Ecosystem, Databricks
• Spark Core: is the execution engine for the spark
platform. It provides distributed in-memory
computing capabilities
• Spark SQL: is an engine for Hadoop Hive that
enables unmodified Hadoop Hive queries to run up
to 100x faster on existing deployments and data
• Spark Streaming: is an engine that enables
powerful interactive and analytical applications on
streaming data
• MLLib: is a scalable machine learning library
• GraphX is a graph computation engine built on top
of Spark
Hadoop Vs Spark
• Spark is a computing framework that can be
deployed upon Hadoop.

• You can view Hadoop as an operating system

for a distributed computing cluster, while
Spark is an application running on the system
to provide in-memory analytics functions.

Oracle Utilities Software Development Kit V4.4.0.2.x Installation Guide
No ratings yet
Oracle Utilities Software Development Kit V4.4.0.2.x Installation Guide
33 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
NewSkies Guide For Modify Bookings
No ratings yet
NewSkies Guide For Modify Bookings
34 pages
SPARK
No ratings yet
SPARK
66 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Programming in Python 4639304
50% (2)
Programming in Python 4639304
41 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Hosptial Management System
No ratings yet
Hosptial Management System
35 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Mysql Cluster Deployment Best Practices
No ratings yet
Mysql Cluster Deployment Best Practices
39 pages
Spark
No ratings yet
Spark
96 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
AWS Solutions Architect Associate Study Guide 1
No ratings yet
AWS Solutions Architect Associate Study Guide 1
12 pages
SPARK
No ratings yet
SPARK
35 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
SPARK
No ratings yet
SPARK
47 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Database Challenges
No ratings yet
Database Challenges
7 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
MDG Interview Question
0% (1)
MDG Interview Question
1 page
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
SQL New Slide
No ratings yet
SQL New Slide
60 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Unit 2
No ratings yet
Unit 2
85 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
SPARK
No ratings yet
SPARK
125 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
ApacheSparkWorkshop 2020 09 17
No ratings yet
ApacheSparkWorkshop 2020 09 17
58 pages
Practical-1: AIM: Practical On Transaction Control Language. Theory
No ratings yet
Practical-1: AIM: Practical On Transaction Control Language. Theory
19 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
Mysqlpracticaltutorial 231002225649 93643326
No ratings yet
Mysqlpracticaltutorial 231002225649 93643326
103 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Data Science Bro - 2a
No ratings yet
Data Science Bro - 2a
28 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Module 3
No ratings yet
Module 3
51 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Transaction
No ratings yet
Transaction
43 pages
Harnessing The Power of SAP Query: Part I: The Basics
No ratings yet
Harnessing The Power of SAP Query: Part I: The Basics
67 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Stable Management System
No ratings yet
Stable Management System
12 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Lab 5. SQL Injection
No ratings yet
Lab 5. SQL Injection
15 pages
T08 SPARKComponents
No ratings yet
T08 SPARKComponents
25 pages
WEBD 236: Web Information Systems Programming
No ratings yet
WEBD 236: Web Information Systems Programming
60 pages
Unit 3
No ratings yet
Unit 3
18 pages
Spark2x: Big Data Huawei Course
No ratings yet
Spark2x: Big Data Huawei Course
25 pages
Postgresql Query Optimization: Step by Step Techniques
No ratings yet
Postgresql Query Optimization: Step by Step Techniques
50 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Overview
No ratings yet
Overview
25 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
SQL Server Connection
No ratings yet
SQL Server Connection
11 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
ETL - 4.4 - Years - Experience - Padma Sri - Ananthu
100% (1)
ETL - 4.4 - Years - Experience - Padma Sri - Ananthu
3 pages
Create Tags From Collibra - VQL Guide
No ratings yet
Create Tags From Collibra - VQL Guide
8 pages
Unit 4
No ratings yet
Unit 4
8 pages
EerModel - and Some Haracteristics
No ratings yet
EerModel - and Some Haracteristics
9 pages
Grade X Model QP 2024 2025-1
No ratings yet
Grade X Model QP 2024 2025-1
6 pages
Repository Queries - Udemy
No ratings yet
Repository Queries - Udemy
14 pages
Chapter 4-Connecting To Databases
No ratings yet
Chapter 4-Connecting To Databases
10 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Publication Cert+Book+Non-Academic
No ratings yet
Publication Cert+Book+Non-Academic
8 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet

Spark Introduction

Uploaded by

Spark Introduction

Uploaded by

Introduction to Spark

• MapReduce enables big data analytics using

• There are multiple implementation of

• This is where Spark comes into the picture –

• RDD allows applications to keep working sets

• An implementation of RDD: Spark

• groupBy • fold • first

• sort • reduceByKey • partitionBy

• union • groupByKey • mapWith

• join • cogroup • pipe

• leftOuterJoin • Cross • Save

Result: full-text search on 1TB data in 5-7sec

Pregel: iterative graph processing, 200 LoC using

for (i <- 1 to ITERATIONS) {

• You can view Hadoop as an operating system

You might also like