Data Platform and Analytics Foundational Training: (Speaker Name)
Data Platform and Analytics Foundational Training: (Speaker Name)
[Speaker Name]
Apache Spark: A unified framework
A unified, open source, parallel data processing framework for big data analytics
Alternative
resource
Primary resource
Spark managers:
Mesos or managers: Hadoop
the Spark Hadoop 1.0+ or
resource Hadoop YARN
manager
Faster data, faster results
140 50400 Spark is the 2014 Sort Benchmark
winner.
120 Hadoop 2100
3x faster than 2013 winner
(Hadoop).
100
Spark is fast not just for in-memory,
Running time(s)
60 102.5 100
40 72
20
Spark 0.9
6592
23 206
0
Logistic regression
1 2
In Spark Step 1
Spark cluster architecture
Driver program
SparkContext
HDFS
Cluster Worker node 1
Worker 1
Task
Spark Job
Browser Gateway Zeppelin Jupyter
submit Task
Worker node 2
Worker 2
Head node Task
Spark master Job
Task
Worker node 3
App 0 App 1 App 2
Worker 3
Task
Job
Task
Spark driver
Worker node 4
RDD
Spark Worker 4
Task
context
Job
RDD Task
Use Cases
Apache Spark use cases
High performance Interactive analytics
batch computation