0% found this document useful (0 votes)

82 views18 pages

Module 5: Apache Spark

Apache Spark solves the main bottlenecks of MapReduce by providing more flexible workflows beyond map and reduce steps, faster computation through in-memory caching of data, and support for multiple programming languages including Python and Scala for interactive use. Spark's architecture includes a driver program that launches parallel operations on executor processes across worker nodes managed by a cluster manager like YARN.

Uploaded by

ArXlan Xahir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views18 pages

Module 5: Apache Spark

Uploaded by

ArXlan Xahir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Module 5: Apache Spark

Shortcomings of MapReduce
Learning objectives
• List the main bottlenecks of MapReduce
• Explain how Apache Spark solves them
Shortcomings of MapReduce

Force your pipeline into Map

and Reduce steps

Other workflows? i.e. join,

filter, map-reduce-map
Shortcomings of MapReduce

Read from disk for each

MapReduce job

Iterative algorithms? i.e.

machine learning
Shortcomings of MapReduce

Only native JAVA

programming interface

Other languages?
Interactivity?
Solution?
• New framework: same features of
MapReduce and more
• Capable of reusing Hadoop
ecosystem, e.g. HDFS, YARN…
• Born at UC Berkeley
Solutions by Spark
Other workflows? i.e. join,
filter, map-reduce-map Flexibility

~20 highly efficient

distributed operations, any
combination of them
Solutions by Spark

Iterative algorithms? i.e.

machine learning

in-memory caching of data, Fast Computation

specified by the user

Solutions by Spark

Interactivity? Other
languages?

Native Python, Scala (, R)

interface. Interactive shells.
100TB Sorting competition
Architecture of Spark
one node has multiple executor
one worker node has multiple run envs

Worker Node
to run a process
Spark Python
Executor
Python
Java Virtual
Machine Python

HDFS
Worker Nodes
Exec Python
Python
JVM Python

Exec Python
Python
JVM Python

Exec Python
Python
JVM Python
Worker Nodes
Exec Python
Python
JVM Python

Cluster Manager
YARN/Standalone Exec Python
Python
Provision/Restart Workers JVM Python

Exec Python
Python
JVM Python
Worker Nodes
actual process run

Exec Python
Python
Driver Program JVM Python

take resources like where

Spark Spark
to run processes
Cluster Exec Python
Python
Context Context
Manager JVM Python

Exec Python
har application k lie spark context
ka instance instentiate kr raha ho ga
Python
JVM Python
on Cloudera VM

limited scheduling
Driver Program
Exec Python
JVM
one worker node
Spark Spark
Context Context Standalone
on Amazon EMR EC2 nodes
Exec Python
Python
Master node JVM
Cluster mode Python

Driver Program
Exec Python
Python
Spark Spark JVM Python
Context Context YARN
Exec Python
Python
JVM Python

Charles Horton Cooley PDF
83% (6)
Charles Horton Cooley PDF
5 pages
Untitled
No ratings yet
Untitled
727 pages
Arabic Personal Pronouns
No ratings yet
Arabic Personal Pronouns
10 pages
Biology MCQS: Mcqs Biology 1St Year & 2Nd Year and For Medical Entry Test
75% (4)
Biology MCQS: Mcqs Biology 1St Year & 2Nd Year and For Medical Entry Test
58 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
SPARK
No ratings yet
SPARK
47 pages
A2041175501 - 28953 - 15 - 2025 - Unit 1 Part 1
No ratings yet
A2041175501 - 28953 - 15 - 2025 - Unit 1 Part 1
13 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
ApacheSparkWorkshop 2020 09 17
No ratings yet
ApacheSparkWorkshop 2020 09 17
58 pages
CC PPT
No ratings yet
CC PPT
12 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Spark Architecture
No ratings yet
Spark Architecture
10 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Spark and Scala - Module 5
No ratings yet
Spark and Scala - Module 5
36 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages
Apache Spark
No ratings yet
Apache Spark
40 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
MapReduce - Simpli Ed Data Processing On Large Clusters
No ratings yet
MapReduce - Simpli Ed Data Processing On Large Clusters
22 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Spark Architecture
No ratings yet
Spark Architecture
17 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages
Hadoop Spark
No ratings yet
Hadoop Spark
73 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Module 4
No ratings yet
Module 4
29 pages
M5
No ratings yet
M5
18 pages
Shark
No ratings yet
Shark
24 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Bda U3 p1 (Intro To Spark)
No ratings yet
Bda U3 p1 (Intro To Spark)
66 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Unit V
No ratings yet
Unit V
35 pages
Lec 10
No ratings yet
Lec 10
28 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Apache Spark Lecture Notes
No ratings yet
Apache Spark Lecture Notes
4 pages
06-Apache Spark
No ratings yet
06-Apache Spark
75 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Spark 101
No ratings yet
Spark 101
25 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Big Data Handling Techniques
No ratings yet
Big Data Handling Techniques
21 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark Interview Questions and Answers
100% (3)
Spark Interview Questions and Answers
31 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Unit-4 - Apache Spark
No ratings yet
Unit-4 - Apache Spark
24 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
Mastering Go Network Automation: Automating Networks, Container Orchestration, Kubernetes with Puppet, Vegeta and Apache JMeter
From Everand
Mastering Go Network Automation: Automating Networks, Container Orchestration, Kubernetes with Puppet, Vegeta and Apache JMeter
Ian Taylor
No ratings yet
Mastering Go Network Automation
From Everand
Mastering Go Network Automation
Ian Taylor
No ratings yet
Notes On Set Theory and Probability Theory: Michelle Alexopoulos
No ratings yet
Notes On Set Theory and Probability Theory: Michelle Alexopoulos
37 pages
Resilient Distributed Datasets
No ratings yet
Resilient Distributed Datasets
40 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Spark SQL
No ratings yet
Spark SQL
28 pages
CS Course List Online
No ratings yet
CS Course List Online
29 pages
Transaction Practice Questions
No ratings yet
Transaction Practice Questions
1 page
Generalized Other Definition PDF
No ratings yet
Generalized Other Definition PDF
2 pages
Node Beginner
No ratings yet
Node Beginner
74 pages
Flowcharts
No ratings yet
Flowcharts
5 pages
Chemistry Notes PDF
No ratings yet
Chemistry Notes PDF
101 pages
Image Formation by Spherical Lenses: Project: OSCAR
No ratings yet
Image Formation by Spherical Lenses: Project: OSCAR
18 pages
Braching
No ratings yet
Braching
2 pages
Chemistry Ebook For Entry Test Preparation
No ratings yet
Chemistry Ebook For Entry Test Preparation
58 pages
Assign 6
No ratings yet
Assign 6
4 pages
Mcat 2010
No ratings yet
Mcat 2010
15 pages
Assign 4
No ratings yet
Assign 4
1 page
Monday 27 April, 2015 Till 1:30pm Submission Path: //xeon/spring 2015/sidra Basharat/ITC/submissions/assign-3
No ratings yet
Monday 27 April, 2015 Till 1:30pm Submission Path: //xeon/spring 2015/sidra Basharat/ITC/submissions/assign-3
2 pages
Polynomial Sample Problems
No ratings yet
Polynomial Sample Problems
3 pages
Notes On Kate Chopin - The Story of An Hour
100% (1)
Notes On Kate Chopin - The Story of An Hour
22 pages
Blockchain Shipment Management Tracking System: 1) Background/ Problem Statement
No ratings yet
Blockchain Shipment Management Tracking System: 1) Background/ Problem Statement
8 pages
Cet Exam
No ratings yet
Cet Exam
3 pages
TOEFL Ibt Mugen
No ratings yet
TOEFL Ibt Mugen
2 pages
Balake Kannada - Work Book
No ratings yet
Balake Kannada - Work Book
65 pages
Deadlock Detection Using Java
0% (1)
Deadlock Detection Using Java
3 pages
Oracle DBA Syllabus
No ratings yet
Oracle DBA Syllabus
7 pages
St. Mary's Educational Institute: Online Learning 1
No ratings yet
St. Mary's Educational Institute: Online Learning 1
7 pages
Epistemology
No ratings yet
Epistemology
3 pages
Bcse328l Cryptocurrency-Technologies TH 1.0 70 Bcse328l
No ratings yet
Bcse328l Cryptocurrency-Technologies TH 1.0 70 Bcse328l
2 pages
How Do I Write A Report
100% (2)
How Do I Write A Report
2 pages
Grade Wise Subject Teacher
No ratings yet
Grade Wise Subject Teacher
11 pages
Toothpick Patterns SOLUTIONS
No ratings yet
Toothpick Patterns SOLUTIONS
4 pages
(Ebook PDF) Making Content Comprehensible For English Learners: The SIOP Model 5th Edition PDF Download
100% (1)
(Ebook PDF) Making Content Comprehensible For English Learners: The SIOP Model 5th Edition PDF Download
46 pages
X Maths (Basic) Pre-Board Paper 1
No ratings yet
X Maths (Basic) Pre-Board Paper 1
11 pages
Yoruba Culture of Nigeria: Creating Space For An Endangered Specie
No ratings yet
Yoruba Culture of Nigeria: Creating Space For An Endangered Specie
7 pages
Kid Presidents: Educator's Guide
100% (2)
Kid Presidents: Educator's Guide
3 pages
Point of View - Handout
No ratings yet
Point of View - Handout
3 pages
School Poems
No ratings yet
School Poems
20 pages
Visme Presentation 3
No ratings yet
Visme Presentation 3
6 pages
Analyze Translation Quality Rater by Rizky
No ratings yet
Analyze Translation Quality Rater by Rizky
10 pages
XI Maths Relations and Functions 2 of 2 Worksheet
No ratings yet
XI Maths Relations and Functions 2 of 2 Worksheet
2 pages
Aen 100 PPT Lec 8 and 9 PDF
No ratings yet
Aen 100 PPT Lec 8 and 9 PDF
30 pages
Crossword Puzzle Jadi
No ratings yet
Crossword Puzzle Jadi
2 pages
Eaton Guidespec Busway Low Voltage 26 25 00
No ratings yet
Eaton Guidespec Busway Low Voltage 26 25 00
7 pages
SCHOOLYEAR: 2021 - 2022 Subject: Reading and Writing: Second Semester: Module 1
No ratings yet
SCHOOLYEAR: 2021 - 2022 Subject: Reading and Writing: Second Semester: Module 1
7 pages
1.speak Out Starter Booklet Student
No ratings yet
1.speak Out Starter Booklet Student
126 pages
Computational Complexity Theory
No ratings yet
Computational Complexity Theory
15 pages

Module 5: Apache Spark

Uploaded by

Module 5: Apache Spark

Uploaded by

Module 5: Apache Spark

Force your pipeline into Map

Other workflows? i.e. join,

Read from disk for each

Iterative algorithms? i.e.

Only native JAVA

~20 highly efficient

Iterative algorithms? i.e.

in-memory caching of data, Fast Computation

specified by the user

Native Python, Scala (, R)

take resources like where

You might also like