ApacheSparkWorkshop 2020 09 17
ApacheSparkWorkshop 2020 09 17
Spark SQL Spark Streaming Machine learning GraphX 3rd party library
Standalone
EC2 Hadoop YARN Apache Mesos Kubernetes
scheduler
Standalone
EC2 Hadoop YARN Apache Mesos Kubernetes
scheduler
▶ Core functionalities
▶ task scheduling
▶ memory management
R Java Python Scala
▶ fault recovery
▶ storage systems interaction
▶ etc.
▶ Basic data structure definitions/abstractions
▶ Resilient Distributed Data sets (RDDs)
▶ main Spark data structure
▶ Directed Acyclic Graph (DAG)
Standalone
EC2 Hadoop YARN Apache Mesos Kubernetes
scheduler
▶ Structured data manipulation
▶ Data Frames definition
▶ Table-like data representation
R Java Python Scala
▶ RDDs extension
▶ Schema definition
▶ SQL queries execution
▶ Native support for schema-based data
▶ Hive, Paquet, JSON, CSV
Standalone
EC2 Hadoop YARN Apache Mesos Kubernetes
scheduler
▶ Data analysis of streaming data
▶ e.g. tweets, log messages
▶ Features of stream processing
R Java Python Scala
▶ High-troughput
▶ Fault-tolerant
▶ End-to-end
▶ Exactly-once
▶ High-level abstraction of a discretized stream
▶ Dstream represented as a sequence of RDDs
▶ Spark 2.3+ , Continuous Processing
▶ end-to-end latencies as low as 1ms
▶ Common ML functionalities
▶ ML Algorithms R Java Python Scala
▶ common learning algorithms such as classification, regression, clustering, and collaborative filtering
▶ Featurization
▶ feature extraction, transformation, dimensionality reduction, and selection
▶ Pipelines
▶ tools for constructing, evaluating, and tuning ML Pipelines
▶ Persistence
▶ saving and load algorithms, models, and Pipelines
▶ Utilities
▶ linear algebra, statistics, data handling, etc.
▶ Two APIs
▶ RDD-based API (spark.mllib package)
▶ Spark 2.0+, DataFrame-based API (spark.ml package)
▶ Methods scale out across the cluster by default
▶ Local mode
▶ „Pseudo-cluster“ ad-hoc setup using script
▶ Cluster mode
▶ Running via cluster manager
▶ Interactive mode
▶ Direct manipulation in a shell (pyspark, spark-shell)
▶ Non-distributed single-JVM
deployment mode
▶ Spark library spawns (in a JVM)
▶ driver
▶ scheduler
▶ master
▶ executor
▶ Parallelism is the number of
threads defined by a parameter N in
a spark master URL
▶ local[N]
▶ Components
▶ Worker
▶ Node in a cluster, managed by an executor
▶ Executor manages computation, storage and caching
▶ Cluster manager
▶ Allocates resources via SparkContext with Driver program
▶ Driver program
▶ A program holding SparkContext and main code to execute in Spark
▶ Sends application code to executors to execute
▶ Listens to incoming connections from executors
1. Data preparation/import
▶ RDDs creation – i.e. parallel dataset
with partitions
2. Transformations/actions definition*
▶ Creation of tasks (units of work) sent to one executor
▶ Job is a set of tasks executed by an action*
3. Creation of a directed acyclic graph (DAG)
▶ Contains a graph of RDD operations
▶ Defition of stages – set of tasks to be executed in parallel (i.e. at a partition level)
4. Execution of a program
def organize(line):
data = line.split('",')
data = data if len(data) == 2 else line.split(',')
return (data[1], data[0][1:51] + ' ...')
movies = rdd2.filter(lambda x: x != 'review,sentiment').map(organize)
movies.count() // 50.000
movies = movies.filter(lambda x: x[0] in ['positive', 'negative'])
movies.count() // 45.936
movieCounts = movies.groupByKey().map(lambda x: (x[0], len(x[1])))
posReviews.cache().collect()
Stage 1 Stage 2
df1.show()
df1.printSchema()
df2.printSchema()
▶ User defined functions are custom functions to run against the "database" directly
▶ Caveats
▶ Optimization problems (especially in pySpark!)
▶ Special values handling by the programmer (e.g. null values)
▶ Approaches to use UDFs
▶ df = df.withColumn
▶ df = sqlContext.sql("SELECT * FROM <UDF>")
▶ rdd.map(UDF())
df2 = df2.drop('review')
df2.select(df2['sentiment'], df2['positiveWords']).show(3)
df2.select(df2['sentiment'], df2['positiveWords']). \
filter(df2['positiveWords'] > 10).show(3)
df2.groupBy('sentiment').count().show()
df2.summary().show()
squeue -u campus02
sacct -j 51438
▶ https://training.databricks.com/visualapi.pdf
▶ https://events.prace-ri.eu/event/896/
▶ https://luminousmen.com/post/spark-core-concepts-explained
▶ https://info.gwdg.de/wiki/doku.php?id=wiki:hpc:slurm_sbatch_script_for_spark_applications
▶ https://researchcomputing.princeton.edu/faq/spark-via-slurm
www.prace-ri.eu