0% found this document useful (0 votes)
2 views25 pages

8 Apache Spark

Apache Spark is an open-source processing engine that supports multiple programming languages and is designed for in-memory computing. It utilizes Resilient Distributed Datasets (RDDs) for fault tolerance and parallel processing, allowing for lazy evaluation and efficient data transformations. Spark provides various components like Spark SQL, MLib, GraphX, and Spark Streaming for different data processing needs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views25 pages

8 Apache Spark

Apache Spark is an open-source processing engine that supports multiple programming languages and is designed for in-memory computing. It utilizes Resilient Distributed Datasets (RDDs) for fault tolerance and parallel processing, allowing for lazy evaluation and efficient data transformations. Spark provides various components like Spark SQL, MLib, GraphX, and Spark Streaming for different data processing needs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

MPI

Peer-to-peer Distributed File In-Memory Cluster


networks System (HDFS) Computing
Fault Tolerance
Iterative Lazy Evaluation
Application
Apache Spark
• Processing engine; instead of just “map” and “reduce”, defines a large set of
operations (transformations & actions)
• Open source software
• Supports Java, Scala, R and Python
• Key construct: Resilient Distributed Dataset (RDD)
WHY
SPARK?
IN MEMORY
COMPUTING
WHY SPARK?
Spark Stack and
what you can do
• Spark SQL
• For SQL and unstructured
data processing
• MLib
• Machine Learning Algorithms
• GraphX
• Graph Processing
• Spark Streaming
• stream processing of live data
streams
Resilient Distributed Dataset (RDD)
The fundamental unit of data in Spark: An Immutable collection of objects (or
records, or elements) that can be operated on “in parallel” (spread across a cluster)
Resilient -- if data in memory is lost, it can be recreated
• Recover from node failures
• An RDD keeps its lineage information → it can be recreated from parent
RDDs
Distributed -- processed across the cluster
• Each RDD is composed of one or more partitions → (more partitions –
more parallelism)
Dataset -- initial data can come from a file or be created
How basic
operations works

3 main steps:
• Create RDD
• Transformation
• Actions
Creating RDDs
Two ways creating an RDD
• Initialize a collection of values
val rdd = sc.parallelize(Seq(1,2,3,4))
• Load data file(s) from fileSystem, HDFS, S3, etc.
val rdd = sc.textFile(“file:///anyText.txt”)
RDD Operations
Two types of operations
Transformations: Define
a new RDD based on
current RDD(s)

Actions: return values


RDD Transformations
• Set of operations on an RDD that define how they should be transformed
• As in relational algebra, the application of a transformation to an RDD yields a
new RDD (because RDD are immutable)
• Transformations are lazily evaluated, which allow for optimizations to take place
before execution
• Examples: map(), filter(), groupByKey(), sortByKey(), etc.
RDD Actions
• Apply transformation chains on RDDs, eventually performing some additional operations
(e.g., counting)
• Some actions only store data to an external data source (e.g. HDFS), others fetch data from
the RDD (and its transformation chain) upon which the action is applied, and convey it to the
driver
• Some common actions
➢ count() – return the number of elements
➢ take(n) – return an array of the first n elements
➢ collect()– return an array of all elements
➢ saveAsTextFile(file) – save to text file(s)
Lazy Execution of RDDs (1)
Data in RDDs is not processed until
an action is performed
Lazy Execution of RDDs (2)
Data in RDDs is not processed until
an action is performed
Lazy Execution of RDDs (3)
Data in RDDs is not processed until
an action is performed
Lazy Execution of RDDs (4)
Data in RDDs is not processed until
an action is performed
Lazy Execution of RDDs (5)
Data in RDDs is not processed
until an action is performed
DAG (Directed Acyclic Graph)
• Critical component of the Spark execution engine that provides several
advantages for the efficient processing of large-scale data.
• The DAG allows Spark to break down a large-scale data processing job into
smaller, independent tasks that can be
• Executed in parallel
• Optimize the job execution,
• Achieve fault tolerance,
• Reuse intermediate results,
• provide a visual representation of the logical execution plan.
DAG (Directed Acyclic Graph)
Lifetime of a Job in Spark

20
Two ways working with spark
• Interactively (Spark-shell)
• for learning or data exploration
• Python or Scala

• Standalone application (Spark-submit)


• For large scale data processing
• Python, Scala, or Java
Two ways working with spark
Two ways working with spark

• Standalone application (Spark-submit)


#spark-submit --class company.division.yourClass --master
spark://:7077 --name “Pi” Pi.jar
MPI
Peer-to-peer Distributed File In-Memory Cluster
networks System (HDFS) Computing
Fault Tolerance
Iterative Lazy Evaluation
Application
Ref

• https://spark.apache.org/docs/latest/
• https://sparkbyexamples.com/

You might also like