8 Apache Spark

Apache Spark is an open-source processing engine that supports multiple programming languages and is designed for in-memory computing. It utilizes Resilient Distributed Datasets (RDDs) for fault tolerance and parallel processing, allowing for lazy evaluation and efficient data transformations. Spark provides various components like Spark SQL, MLib, GraphX, and Spark Streaming for different data processing needs.

Uploaded by

modathermostafa1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views25 pages

8 Apache Spark

Uploaded by

modathermostafa1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

MPI

Peer-to-peer Distributed File In-Memory Cluster

networks System (HDFS) Computing
Fault Tolerance
Iterative Lazy Evaluation
Application
Apache Spark
• Processing engine; instead of just “map” and “reduce”, defines a large set of
operations (transformations & actions)
• Open source software
• Supports Java, Scala, R and Python
• Key construct: Resilient Distributed Dataset (RDD)
WHY
SPARK?
IN MEMORY
COMPUTING
WHY SPARK?
Spark Stack and
what you can do
• Spark SQL
• For SQL and unstructured
data processing
• MLib
• Machine Learning Algorithms
• GraphX
• Graph Processing
• Spark Streaming
• stream processing of live data
streams
Resilient Distributed Dataset (RDD)
The fundamental unit of data in Spark: An Immutable collection of objects (or
records, or elements) that can be operated on “in parallel” (spread across a cluster)
Resilient -- if data in memory is lost, it can be recreated
• Recover from node failures
• An RDD keeps its lineage information → it can be recreated from parent
RDDs
Distributed -- processed across the cluster
• Each RDD is composed of one or more partitions → (more partitions –
more parallelism)
Dataset -- initial data can come from a file or be created
How basic
operations works

3 main steps:
• Create RDD
• Transformation
• Actions
Creating RDDs
Two ways creating an RDD
• Initialize a collection of values
val rdd = sc.parallelize(Seq(1,2,3,4))
• Load data file(s) from fileSystem, HDFS, S3, etc.
val rdd = sc.textFile(“file:///anyText.txt”)
RDD Operations
Two types of operations
Transformations: Define
a new RDD based on
current RDD(s)

Actions: return values

RDD Transformations
• Set of operations on an RDD that define how they should be transformed
• As in relational algebra, the application of a transformation to an RDD yields a
new RDD (because RDD are immutable)
• Transformations are lazily evaluated, which allow for optimizations to take place
before execution
• Examples: map(), filter(), groupByKey(), sortByKey(), etc.
RDD Actions
• Apply transformation chains on RDDs, eventually performing some additional operations
(e.g., counting)
• Some actions only store data to an external data source (e.g. HDFS), others fetch data from
the RDD (and its transformation chain) upon which the action is applied, and convey it to the
driver
• Some common actions
➢ count() – return the number of elements
➢ take(n) – return an array of the first n elements
➢ collect()– return an array of all elements
➢ saveAsTextFile(file) – save to text file(s)
Lazy Execution of RDDs (1)
Data in RDDs is not processed until
an action is performed
Lazy Execution of RDDs (2)
Data in RDDs is not processed until
an action is performed
Lazy Execution of RDDs (3)
Data in RDDs is not processed until
an action is performed
Lazy Execution of RDDs (4)
Data in RDDs is not processed until
an action is performed
Lazy Execution of RDDs (5)
Data in RDDs is not processed
until an action is performed
DAG (Directed Acyclic Graph)
• Critical component of the Spark execution engine that provides several
advantages for the efficient processing of large-scale data.
• The DAG allows Spark to break down a large-scale data processing job into
smaller, independent tasks that can be
• Executed in parallel
• Optimize the job execution,
• Achieve fault tolerance,
• Reuse intermediate results,
• provide a visual representation of the logical execution plan.
DAG (Directed Acyclic Graph)
Lifetime of a Job in Spark

20
Two ways working with spark
• Interactively (Spark-shell)
• for learning or data exploration
• Python or Scala

• Standalone application (Spark-submit)

• For large scale data processing
• Python, Scala, or Java
Two ways working with spark
Two ways working with spark

• Standalone application (Spark-submit)

#spark-submit --class company.division.yourClass --master
spark://:7077 --name “Pi” Pi.jar
MPI
Peer-to-peer Distributed File In-Memory Cluster
networks System (HDFS) Computing
Fault Tolerance
Iterative Lazy Evaluation
Application
Ref

• https://spark.apache.org/docs/latest/
• https://sparkbyexamples.com/

Unit 2 (CPU Scheduling)
No ratings yet
Unit 2 (CPU Scheduling)
27 pages
SPARK
No ratings yet
SPARK
66 pages
Tiled Chip Multicore Processors
No ratings yet
Tiled Chip Multicore Processors
3 pages
Pyspark
No ratings yet
Pyspark
31 pages
Spark
No ratings yet
Spark
160 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
SPARK
No ratings yet
SPARK
35 pages
Threads
No ratings yet
Threads
47 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
SPARK
No ratings yet
SPARK
47 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Quiz For Chapter 7 With Solutions
No ratings yet
Quiz For Chapter 7 With Solutions
8 pages
bd1718 10 Spark
No ratings yet
bd1718 10 Spark
55 pages
Supercomputer
No ratings yet
Supercomputer
54 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Spark
No ratings yet
Spark
51 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Spark (Introduction, RDD)
No ratings yet
Spark (Introduction, RDD)
28 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Operating System Assignment - DHANANJAY SINGHAL
No ratings yet
Operating System Assignment - DHANANJAY SINGHAL
38 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Top 100+ Operating System Interview Questions (20 2
No ratings yet
Top 100+ Operating System Interview Questions (20 2
4 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Linux Module 2 - Incomplete
No ratings yet
Linux Module 2 - Incomplete
17 pages
Spark 1
No ratings yet
Spark 1
57 pages
Operating Systems (UNIT-2) : Go, Change The World
No ratings yet
Operating Systems (UNIT-2) : Go, Change The World
54 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Flynn Class
No ratings yet
Flynn Class
12 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Chapter 4
No ratings yet
Chapter 4
23 pages
Ipcs 3
No ratings yet
Ipcs 3
72 pages
4.CPU Scheduling and Algorithm-Notes
No ratings yet
4.CPU Scheduling and Algorithm-Notes
31 pages
Access Mode:: 7.13 Transaction Support in SQL
No ratings yet
Access Mode:: 7.13 Transaction Support in SQL
25 pages
Module 3
No ratings yet
Module 3
51 pages
CSE321 - 3. Threads & Concurrency
No ratings yet
CSE321 - 3. Threads & Concurrency
40 pages
Chapter 1
100% (1)
Chapter 1
36 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Threads and Its Types in Operating System
No ratings yet
Threads and Its Types in Operating System
8 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Lab 1
No ratings yet
Lab 1
18 pages
Database PPT Deadlock
100% (1)
Database PPT Deadlock
15 pages
Course Slideware
No ratings yet
Course Slideware
60 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Priority Inheritance and Priority Ceiling Protocols
No ratings yet
Priority Inheritance and Priority Ceiling Protocols
25 pages
Different Hadoop Modes: 1. Local Mode or Standalone Mode
No ratings yet
Different Hadoop Modes: 1. Local Mode or Standalone Mode
12 pages
Nisn Ganda
No ratings yet
Nisn Ganda
24 pages
Priority MLQ RR Fairshare Cpu Scheduling
No ratings yet
Priority MLQ RR Fairshare Cpu Scheduling
10 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
CSE-311 Operating Systems
No ratings yet
CSE-311 Operating Systems
22 pages
Os Case Study
No ratings yet
Os Case Study
6 pages
Simple List Example: Week 7 Swing Lists and Comboboxes
No ratings yet
Simple List Example: Week 7 Swing Lists and Comboboxes
6 pages
Javacore 20110908 080343 6716
No ratings yet
Javacore 20110908 080343 6716
49 pages
Fork Lecture
No ratings yet
Fork Lecture
8 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
What Is CountDownLatch in Java - Concurrency Example Tutorial PDF
No ratings yet
What Is CountDownLatch in Java - Concurrency Example Tutorial PDF
4 pages

8 Apache Spark

Uploaded by

8 Apache Spark

Uploaded by

MPI

Peer-to-peer Distributed File In-Memory Cluster

Actions: return values

• Standalone application (Spark-submit)

• Standalone application (Spark-submit)

You might also like