Architecture and Components of Spark

Spark has a core engine that distributes workloads, monitors applications, and provides fault recovery. It includes APIs for structured data processing, streaming, machine learning, and graphs. Spark programs run as driver and executor processes coordinated by a cluster manager like Standalone, YARN, or Mesos. The driver splits the program into tasks executed on executors, which provide storage for resilient distributed datasets (RDDs) representing the data.

Uploaded by

miyumi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views6 pages

Architecture and Components of Spark

Uploaded by

miyumi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Architecture and Components of Spark

▸ top row are APIs; we can use all in a single spark application
▸ bottom row are cluster managers that spark works with for resource management
▸ Spark Core: distributes workloads, monitors application, schedule tasks, memory mgt, fault recovery, interacts
with HDFS, houses APIs that defines RDDs
▸ Spark API Libraries: built on top of spark, inherits all Spark core features like fault tolerance...
   ▹ Spark SQL: structured data processing
   ▹ Spark Streaming: processing of live data stream
   ▹ Spark MLlib: common machine learning functionality
▹ GraphX: library for manipulating graphs and performing graph parallel computations
   ▹ SparkR: provides lightweight frontend to use Apache Spark from R
▸ Spark Cluster Manager: for resource allocation - spark can connect to pluggable resource managers
▹ Standalone: simple cluster manager included within Spark and makes it easy to setup a cluster
▹ Apache Mesos: general cluster manager taht can run Hadoop MapReduce and service application
   ▹ Hadoop YARN: cluster manager of Hadoop 2
▸ Spark Runtime Architecture
   ▹ master-slave architecture; master=driver, slave=executor
   ▹ drivers and executors run their own java processes
   ▹ Spark application is launched using cluster manager

▸ SparkContext
▹ main entry point to everything Spark
   ▹ defined in the driver program
   ▹ tells Spark how and where to access cluster
   ▹ connects to cluster manager
   ▹ coordinates Spark processes running on different nodes
   ▹ used to create RDDs and shared variables on a cluster
▸ Driver
   ▹ this is where main() method of Spark is
   ▹ converts user program into tasks (smallest unit of work); tasks are bundled into "stages"
   ▹ it schedules tasks on executors
▸ Executors
   ▹ runs in the entire lifetime of an app
   ▹ registers themselves to the driver, which allows driver to schedule tasks
   ▹ worker processes run individual tasks and returns result to driver
   ▹ provides in-memory storage for RDDs as well as disk storage
▸ Workflow:
User app → Driver program contacts cluster manager for resources → cluster manager launches executors → driver
splits program into tasks and send them to executors → executors perform tasks and return to driver → executors are
terminated and resources released

Resilient Distributed Datasets

▸ APIs of Spark
▹ RDD: core data API
▹ DataFrame API: uses schema to describe data
   ▹ DataSet API: combines RDD and DataFrame API
▸ RDD:
   ▹ abstraction for working with data in Spark
   ▹ "distributed collection of elements"
   ▹ Spark distributes RDDs across nodes in a cluster
   ▹ fault tolerant and can be rebuilt
   ▹ read-only collection of elements, immutable once constructed
▸ Creating RDDs (3 methods)
   ▹ parallelize/distribue a collection (list or set)
   ▹ load an external dataset or external storage system
   ▹ transform an existing RDD into new RDD
lines = sc.parallelize(["I","am","learning","Spark"])
lines = sc.textFile("README.md")
newlines = lines.transform(...)
▸ Operations on RDDs
   ▹ Transformations
       ⦿ create new dataset from existing and returns an RDD (because the former is immutable!)
       ⦿ lazy evaluation, not computed immediately
       ⦿ works element wise
       ⦿ transformations are recorded into DAG
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningRDD = inputRDD.filter(lambda x: "warning" in x)
badlinesRDD = errorRDD.union(warningRDD)
▹ Actions
       ⦿ RDDs are evaluated when action is called
       ⦿ entire RDD is computed from scratch
print "Number of problems in logfile:" + badlinesRDD.count()
badlinesRDD.take(10)
▸ Lineage graph of RDD (DAG)
▹ when more transformations are called and new RDDs are derived, lineage graph is updated and keeps track of
relationship between RDDs
▹ if a task failed, DAG is replicated and each RDD is computed on demand, making it fault tolerant

▸ Spark Program Lifecycle
   ▹ create RDD frm external data or existing collection
   ▹ lazily transform into new RDD
   ▹ cache some RDD for repeated use
   ▹ perform actions on RDD

▸
▹
RDD Operations
▸ Spark offers >80 high level operations beyond MapReduce,
▹ ex. transformations: map, filter, distinct, flatMap
▹ ex. actions: collect, count, take, reduce
▸ map: apply function to each element in RDD and return new RDD
RDD {4,5,10,10}, using Scala syntax:
rdd.map(x => x+1)
Result: {5,6,11,11}
▸ filter
rdd.filter(x => x < 10)
Result: {4,5}
▸ distinct
rdd.distinct()
Result: {4,5,10}
▸ flatmap: similar to map but returns a sequence rather than a single element; apply function then flatten result
rdd.flatmap(x => list(x-1,x+1))
Result: {3,5,4,6,9,11,9,11}

▸ collect: return all elements

rdd.collect()
Result: {4,5,10,10}
▸ count
rdd.count()
Result: 4
▸ take(num)
rdd.take(2)
Result: {4,5}
▸ reduce(func): combine elements in RDD together in paraller
rdd.reduce((x,y) => x+y)
Result: 29
▸ Sample Program
data = xrange(1,30)
xrangeRDD = sc.parallelize(data,4)
subRDD = xrangeRDD.map(lambda x: x-1)
filteredRDD = xrangeRDD.filter(lambda x: x<10)
subRDD.collect()
subRDD.count()

▹ partitioning is not equal parts but based on workers capacity

result:subRDD
result: filteredRDD

Key Value Pair RDDs

▸ ex. Bag-of-words model
input = sc.textFile("data.txt")
lines = input.flatMap(lambda line: line.split())
pairs = lines.map(lambda word: (word,1))
counts = pairs.reduceByKey(lambda a,b: a+b)
or by using group by...
input = sc.textFile("data.txt")
lines = input.flatMap(lambda line: line.split())
pairs = lines.map(lambda word: (word,1))
groups = pairs.groupByKey()
counts = groups.map(lambda (word,count): (word, sum(count))
▹ but the second one is inefficient, because pairs are not reduced before the final reduce... so all data pairs are
transferred throughout the network

Quiz
1) Which of the following are Spark API libraries: 7) What is a task?
Correct answer Correct answer
Spark SQL, Spark MLlib A smallest unit of work sent to one executor

Explanation Explanation
Although ETL and Deep learning can be performed A task is a smallest unit of work sent to one executor; tasks are bundled
using Spark, there are no libraries of these names. into “stages”. The driver splits the user program into tasks and stages.
ETL can be performed using Spark SQL and Deep
learning can be performed using Spark MLlib. 8) Driver return results to executors.
Correct answer
2) Spark core is the base engine of Spark. Choose False
the correct functions of Spark core:
Correct answer Explanation
Task scheduler and memory management, Fault Driver schedules the tasks on executors. Results from these tasks are
recovery delivered back to the driver.

Explanation 9) A Spark program consists of a driver program, user program and a

Spark core does not provide a storage system but is cluster manager.
able to interact with other storage systems. Correct answer
False
3) Spark has its own cluster manager.
Correct answer Explanation
True A cluster manager is not part of a Spark program. A cluster manager is
only used to launch a Spark program.
Explanation
Spark includes its own standalone cluster manager 10) When do executors terminate in a Spark program?
and also supports working with YARN and Mesos. Correct answer
4) Which of the following are true about Spark
driver and executor processes?
Correct answer
Driver program is process where main() method of
When driver’s main() method exits , When SparkContext.stop() is called
Spark program run, Spark program consists of
driver program and executors
Explanation
When driver’s main() method exits or SparkContext.stop() is called,
Explanation
executors are terminated and the cluster manager releases the resources
Driver schedules tasks on executors and not vice a
versa.
11) Executor processes start when a task starts and terminate upon task
completion.
5) Executors provide in-memory storage for RDDs.
Correct answer
Correct answer
False
True
Explanation
Explanation
Executor processes are Java processes, launched once at the beginning
Executors provide in-memory storage for RDDs, as
of the application and typically run for the entire lifetime of an
well as disk storage so that running tasks can persist
application. The lifetime of a task does not determine the lifetime of an
intermediate results to memory or disk.
executor process.
6) SparkContext is main entry point to Spark
programs.
Correct answer
True

Quiz
1) reduceByKey is preferred to 9) RDDs can be created by which of the following approaches:
groupByKey Correct answer
Correct answer By using Spark API like textFile(), By transforming other RDD
True
Explanation
Explanation RDDs can be created by loading from external storage like a file system. The textFile()
groupByKey causes shuffling of on SparkContext converts the contents of a file into a RDD. Transformation operations
large amounts of data and hence on RDD also results in RDD. But actions operations on RDD do not yield a RDD.
not preferred. reduceByKey on the
other hand, reduces by key first 10) If a RDD wordsRDD contains {'pencil', 'paper', 'computer', 'mouse'}, what is the
and then shuffles data to worker result of wordsRDD.map(lambda x : len(x)).reduce(lambda x,y: x+y) Hint: len() is a
nodes where further reducing function that returns length of the string
happens. Correct answer
the value 24
2) Actions are lazily evaluated.
Correct answer Explanation
False The map function returns a RDD containing the length of each string which is :
{6,5,8,5} The reduce function is chained to the map function and adds all the lengths to
Explanation return the value of "24"
Actions are evaluated immediately
and are responsible for getting 11) Which of the following statements are true about RDDs
results from Spark data operations. Correct answer
All of the above
3) Which of the following are
operators in Spark Explanation
Correct answer RDDs is the primary data API allowing data to be processed in Spark. RDDs are
map, reduce distributed collection of elements that can be reconstructed on failure and hence fault
tolerant. RDDs are immutable, as transformations on RDDs result in new RDDs with
Explanation the original RDD staying untouched.
"print" is not a operator, meaning
is neither a transformation or an 12) Transformations on RDDs result in
action. "print" is offered by the Correct answer
native Java, Python or Scala API. a new RDD, update of the DAG

4) DAG is lazily evaluated Explanation

Correct answer Transformations of RDDs are lazily evaluated and result in a new RDD, without
True modifying the original RDD. The DAG is updated to reflect the transformation, which
also serves as the lineage graph used to recover at times of failure.
Explanation
A DAG keeps track of the lineage 13) The filter() operator is Spark is an action and returns the filtered results
of a RDD and is only evaluated immediately.
when an action is called. Correct answer
False
5) reduceByKey can be called on
Pair RDDs only Explanation
Correct answer filter() is a transformation and is lazily evaluated
True
14) Which of the following statements are true:
Explanation Correct answer
reduceByKey aggregrates elements All of the above
by key, when each element is a key
value pair Explanation
count, collect and take are all actions returning results to the driver or persistent storage
6) If a RDD wordsRDD contains
{'pencil', 'paper', 'computer', 15) calling collect() on a large RDD can run into memory errors
'mouse'}, what is the result of Correct answer
wordsRDD.map(lambda x : x + 's') True
Correct answer
a new RDD, a RDD containing Explanation
{'pencils', 'papers', 'computers', If a RDD is very large such that it does not fit into the driver's memory, you will get a
'mouses'} memory error. Collect will attempt to copy every single element in the RDD onto the
single driver, and then run out of memory and crash.
Explanation
wordsRDD is transformed into a
new RDD where each element of
wordsRDD is appended with the
letter 's'

7) A function passed to a Spark

operator, is executed on each
element of the RDD
Correct answer
True

Explanation
When a Spark operator takes a
function as a parameter, operations
are invoked on RDDs by executing
functions on each element of the
RDD.
8) Operations on RDDs are
grouped into transformations,
collections and actions
Correct answer
False

Explanation
Operations on RDDs are either
transformations or actions.

Spark Architecture
No ratings yet
Spark Architecture
7 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Spark
No ratings yet
Spark
96 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Spark Interview Questions and Answers
100% (3)
Spark Interview Questions and Answers
31 pages
Spark (Introduction, RDD)
No ratings yet
Spark (Introduction, RDD)
28 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
BDA GTU Study Material Presentations Unit-6 03102021061221PM
No ratings yet
BDA GTU Study Material Presentations Unit-6 03102021061221PM
23 pages
Slips Bigdata
No ratings yet
Slips Bigdata
6 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Bigdata Interview Q&A
No ratings yet
Bigdata Interview Q&A
71 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
SPARK
No ratings yet
SPARK
125 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Unit V
No ratings yet
Unit V
35 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Course Slideware
No ratings yet
Course Slideware
60 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
SPARK Question Answers
No ratings yet
SPARK Question Answers
19 pages
Introduction To Spintronics
100% (1)
Introduction To Spintronics
10 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Spark Intro
No ratings yet
Spark Intro
24 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
1.) Explain Any 5 Important Features of Operating System.: A) They Have A Work Management
No ratings yet
1.) Explain Any 5 Important Features of Operating System.: A) They Have A Work Management
24 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Bda U4
No ratings yet
Bda U4
49 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Build Prop
No ratings yet
Build Prop
16 pages
Group 2 - Bail (Complete)
100% (2)
Group 2 - Bail (Complete)
40 pages
96PhilLJ764 905
No ratings yet
96PhilLJ764 905
149 pages
Module 3
No ratings yet
Module 3
51 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
ECS765P - W5 - Spark Programming
No ratings yet
ECS765P - W5 - Spark Programming
43 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Spark Material
No ratings yet
Spark Material
6 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
People v. Batin
No ratings yet
People v. Batin
3 pages
109 Vaporoso V People
100% (1)
109 Vaporoso V People
4 pages
RFP-Information Security Vulnerability Assessment Testing Solution (ISVAT...
No ratings yet
RFP-Information Security Vulnerability Assessment Testing Solution (ISVAT...
13 pages
Types of Coupling and Cohesion
No ratings yet
Types of Coupling and Cohesion
4 pages
I B.tech II Sem r20 Subject Codes
No ratings yet
I B.tech II Sem r20 Subject Codes
2 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
16 - 118th CONGRESS 2nd Session
No ratings yet
16 - 118th CONGRESS 2nd Session
49 pages
Bip Book 5
No ratings yet
Bip Book 5
11 pages
08 - Thomas R Carper Water Resources Development Act of 2024
No ratings yet
08 - Thomas R Carper Water Resources Development Act of 2024
312 pages
The 2016 Revised Implementing Rules and Regulations of Republic Act No. 9184 (Updated As of 03 July 2023)
No ratings yet
The 2016 Revised Implementing Rules and Regulations of Republic Act No. 9184 (Updated As of 03 July 2023)
254 pages
Electronics
No ratings yet
Electronics
248 pages
Final - The Public Good Obligation - Updating The UGAI For AI's Environmental and Informational Risks - UGAI Paper Submission
No ratings yet
Final - The Public Good Obligation - Updating The UGAI For AI's Environmental and Informational Risks - UGAI Paper Submission
5 pages
How To Install WordPress
No ratings yet
How To Install WordPress
11 pages
Harding v. Commercial Union
100% (1)
Harding v. Commercial Union
5 pages
06 Tm2100eu02tm 0001 GSM Signaling
No ratings yet
06 Tm2100eu02tm 0001 GSM Signaling
14 pages
4B-2084TA - 4B-3064TA - Serise - User - Manual 20190505
No ratings yet
4B-2084TA - 4B-3064TA - Serise - User - Manual 20190505
57 pages
Digital Electronics (Eeng 2042) : Flip-Flops and Related Device
No ratings yet
Digital Electronics (Eeng 2042) : Flip-Flops and Related Device
45 pages
CREDIT Case Reviewer-CAM395ZMV2
No ratings yet
CREDIT Case Reviewer-CAM395ZMV2
37 pages
Tracing in ASP
No ratings yet
Tracing in ASP
13 pages
ST-L01-Testing Fundamentals
No ratings yet
ST-L01-Testing Fundamentals
41 pages
075 - Glover V RDU Airport Authority
No ratings yet
075 - Glover V RDU Airport Authority
2 pages
6-11 Regional Offices CCT 09282023
No ratings yet
6-11 Regional Offices CCT 09282023
108 pages
Fortinet Certforall Nse4 - fgt-72 Study Guide 2023-Sep-16 by Page 114q Vce
No ratings yet
Fortinet Certforall Nse4 - fgt-72 Study Guide 2023-Sep-16 by Page 114q Vce
9 pages
19 - 118th CONGRESS 2nd Session
No ratings yet
19 - 118th CONGRESS 2nd Session
18 pages
(GRP) Zulueta v. Nicolas
100% (1)
(GRP) Zulueta v. Nicolas
2 pages
8086 Microprocessor
No ratings yet
8086 Microprocessor
25 pages
The Computer System Computer Function and Interconnection o Computer Functions o Interconnection Structures o Bus Interconnection
No ratings yet
The Computer System Computer Function and Interconnection o Computer Functions o Interconnection Structures o Bus Interconnection
71 pages
Sap Luw
No ratings yet
Sap Luw
11 pages
A Project Work On Processor: Submitted by
No ratings yet
A Project Work On Processor: Submitted by
16 pages
05 - 119th CONGRESS 1st Session
No ratings yet
05 - 119th CONGRESS 1st Session
3 pages
MyActivitySummary (Consolidated) 20250209173919
No ratings yet
MyActivitySummary (Consolidated) 20250209173919
3 pages
15 Arco Pulp v. Lim
No ratings yet
15 Arco Pulp v. Lim
4 pages
Msam 615
No ratings yet
Msam 615
2 pages
The Internet and The Google Age: Prospects and Perils: February 2018
No ratings yet
The Internet and The Google Age: Prospects and Perils: February 2018
28 pages
Penetration Testing Service Data Sheet
No ratings yet
Penetration Testing Service Data Sheet
3 pages
6-16-2020 Digests
No ratings yet
6-16-2020 Digests
23 pages
3-22-2020 Assignment
No ratings yet
3-22-2020 Assignment
22 pages
Sample Thesis
No ratings yet
Sample Thesis
32 pages
Afp Afp Afp Afp Afp-300 - 300 - 300 - 300 - 300 Afp Afp Afp Afp Afp-400 - 400 - 400 - 400 - 400
No ratings yet
Afp Afp Afp Afp Afp-300 - 300 - 300 - 300 - 300 Afp Afp Afp Afp Afp-400 - 400 - 400 - 400 - 400
10 pages
Rashi Code: Add The Extension
No ratings yet
Rashi Code: Add The Extension
5 pages
Known XML Vulnerabilities Are Still A Threat To Popular Parsers and Open Source Systems
No ratings yet
Known XML Vulnerabilities Are Still A Threat To Popular Parsers and Open Source Systems
9 pages
Tumalad v. Vicencio: Although There Is No Specific
No ratings yet
Tumalad v. Vicencio: Although There Is No Specific
13 pages
ENRILE V. SANDIGANBAYAN, August 18, 2015, G.R. No. 213847: Answers To Guide Questions
No ratings yet
ENRILE V. SANDIGANBAYAN, August 18, 2015, G.R. No. 213847: Answers To Guide Questions
7 pages
Logic and Sets: Clast Mathematics Competencies
No ratings yet
Logic and Sets: Clast Mathematics Competencies
40 pages
Distributed Arithmetic (Da)
No ratings yet
Distributed Arithmetic (Da)
13 pages
Sns College of Technology: Java Programming
No ratings yet
Sns College of Technology: Java Programming
16 pages
HiPath 3000 & 5000 V9 Service Manual - Issue 6
No ratings yet
HiPath 3000 & 5000 V9 Service Manual - Issue 6
1,288 pages
226 Rural Bank of Lipa v. CA
No ratings yet
226 Rural Bank of Lipa v. CA
4 pages
Lumbos v. Baliguat
No ratings yet
Lumbos v. Baliguat
5 pages
SD Creator Manual
No ratings yet
SD Creator Manual
7 pages
Pan Malayan v. CA
No ratings yet
Pan Malayan v. CA
3 pages
Parties: Prudential Guarantee and Assurance, Inc., Petitioner, vs. Equinox Land CORPORATION, Respondent
No ratings yet
Parties: Prudential Guarantee and Assurance, Inc., Petitioner, vs. Equinox Land CORPORATION, Respondent
3 pages
Guide Question 1 - May 8
No ratings yet
Guide Question 1 - May 8
2 pages
Oracle Database 11g Comparison Chart: Key Feature Summary
No ratings yet
Oracle Database 11g Comparison Chart: Key Feature Summary
5 pages
Excuse Letter
No ratings yet
Excuse Letter
2 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet

Architecture and Components of Spark

Uploaded by

Architecture and Components of Spark

Uploaded by

Architecture and Components of Spark

Resilient Distributed Datasets

▸ collect: return all elements

▹ partitioning is not equal parts but based on workers capacity

Key Value Pair RDDs

Explanation 9) A Spark program consists of a driver program, user program and a

4) DAG is lazily evaluated Explanation

7) A function passed to a Spark

You might also like