Spark Context, Resilient Distributed Datasets
Spark Context, Resilient Distributed Datasets
Input Output
Read
HDFS
Read Cache
Map Reduce
Big Data Computing Vu Pham Introduction to Spark
Solution: Resilient Distributed Datasets (RDDs)
Fault Recovery?
Lineage!
Log the coarse grained operation applied to a
partitioned dataset
Simply recompute the lost partition if failure occurs!
No cost if no failure
Read
HDFS
Read Cache
Map Reduce
Introduction to Spark
Read
Map Reduce
Control
Partitioning: Spark also gives you control over how you can
partition your RDDs.
https://spark.apache.org/
Action!
log:
HadoopRDD
path = hdfs://...
errors:
FilteredRDD
func = _.contains(…)
shouldCache = true
Task 1 Task 2 ...
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
script
http://cern.ch/kacper/spark.txt
or python interpreter
$ pyspark