0% found this document useful (0 votes)
26 views58 pages

ApacheSparkWorkshop 2020 09 17

Apache Workshop

Uploaded by

tinnguyen1111ntt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views58 pages

ApacheSparkWorkshop 2020 09 17

Apache Workshop

Uploaded by

tinnguyen1111ntt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Introduction to Apache Spark

Slavko Žitnik, Marko Prelevikj


University of Ljubljana, Faculty for computer and information science

1 Presentation title www.prace-ri.eu


Agenda

▶ About Apache Spark


▶ Spark execution modes
▶ Basic Spark data structures (RDDs)
▶ Hands-on tutorial
▶ RDDs and operations
▶ DataFrames, User defined functions and SparkSQL
▶ Hands-on lab exercises – Jupyter notebooks
▶ Hands-on lab exercise – Spark on an HPC
▶ Apache Spark deployment using Slurm
▶ Challenge exercises (independent work) & debug session

2 Presentation title www.prace-ri.eu


About Apache Spark

▶ Fast, expressive, general-purpose in-memory cluster computing framework compatible with


Apache Hadoop and built around speed, ease of use and streaming analytics
▶ Faster and easier than Hadoop MapReduce*
▶ Large community and 3rd party libraries
▶ Provides high-level APIs (Java, Scala, Python, R)
▶ Supports variety of workloads
▶ interactive queries, streaming, machine learning and graph processing

3 Presentation title www.prace-ri.eu


Apache Spark Use cases

▶ Logs processing (Uber)


▶ Event detection and real-time analysis
▶ Interactive analysis
▶ Latency reduction
▶ Advanced ad-targeting (Yahoo!)
▶ Recommendation systems (Netflix, Pinterest)
▶ Fraud detection
▶ Sentiment analysis (Twitter)
▶ ...

4 Presentation title www.prace-ri.eu


Apache Spark Use cases

▶ Logs processing (Uber)


▶ Event detection and real-time analysis
▶ Interactive analysis
▶ Latency reduction
▶ Advanced ad-targeting (Yahoo!)
▶ Recommendation systems (Netflix, Pinterest)
▶ Fraud detection
▶ Sentiment analysis (Twitter)
▶ ...

5 Presentation title www.prace-ri.eu


Apache Spark Use cases

▶ Logs processing (Uber)


▶ Event detection and real-time analysis
▶ Interactive analysis
▶ Latency reduction
▶ Advanced ad-targeting (Yahoo!)
▶ Recommendation systems (Netflix, Pinterest)
▶ Fraud detection
▶ Sentiment analysis (Twitter)
▶ ...

6 Presentation title www.prace-ri.eu


Apache Spark Use cases

▶ Logs processing (Uber)


▶ Event detection and real-time analysis
▶ Interactive analysis
▶ Latency reduction
▶ Advanced ad-targeting (Yahoo!)
▶ Recommendation systems (Netflix, Pinterest)
▶ Fraud detection
▶ Sentiment analysis (Twitter)
▶ ...

7 Presentation title www.prace-ri.eu


Apache Spark general setup: Twitter sentiment analysis

8 Presentation title www.prace-ri.eu


Hadoop MapReduce vs. Apache Spark

▶ Big data frameworks


▶ Performance
▶ Ease of use
▶ Costs
▶ Data processing
▶ Fault tolerance
▶ Security
▶ Hadoop
▶ Archival data analysis
▶ Spark
▶ Real-time data analysis

9 Presentation title www.prace-ri.eu


Hadoop MapReduce vs. Apache Spark

▶ Big data frameworks


▶ Performance
▶ Ease of use
▶ Costs
▶ Data processing
▶ Fault tolerance
▶ Security
▶ Hadoop
▶ Archival data analysis
▶ Spark
▶ Real-time data analysis

10 Presentation title www.prace-ri.eu


Hadoop MapReduce vs. Apache Spark

▶ Big data frameworks


▶ Performance
▶ Ease of use
▶ Costs
▶ Data processing
▶ Fault tolerance
▶ Security
▶ Hadoop
▶ Archival data analysis
▶ Spark
▶ Real-time data analysis

11 Presentation title www.prace-ri.eu


Hadoop MapReduce vs. Apache Spark

▶ Big data frameworks


▶ Performance
▶ Ease of use
▶ Costs
▶ Data processing
▶ Fault tolerance
▶ Security
▶ Hadoop
▶ Archival data analysis
▶ Spark
▶ Real-time data analysis

12 Presentation title www.prace-ri.eu


Hadoop MapReduce vs. Apache Spark

▶ Big data frameworks


▶ Performance
▶ Ease of use
▶ Costs
▶ Data processing
▶ Fault tolerance
▶ Security
▶ Hadoop
▶ Archival data analysis
▶ Spark
▶ Real-time data analysis

13 Presentation title www.prace-ri.eu


Hadoop MapReduce vs. Apache Spark

▶ Big data frameworks


▶ Performance
▶ Ease of use
▶ Costs
▶ Data processing
▶ Fault tolerance
▶ Security
▶ Hadoop
▶ Archival data analysis
▶ Spark
▶ Real-time data analysis

14 Presentation title www.prace-ri.eu


Apacke Spark ecosystem

Spark SQL Spark Streaming Machine learning GraphX 3rd party library

Apache Spark Core

Standalone
EC2 Hadoop YARN Apache Mesos Kubernetes
scheduler

R Java Python Scala

15 Presentation title www.prace-ri.eu


Spark SQL Spark Streaming Machine learning GraphX 3rd party library

Spark ecosystem: Spark core Apache Spark Core

Standalone
EC2 Hadoop YARN Apache Mesos Kubernetes
scheduler
▶ Core functionalities
▶ task scheduling
▶ memory management
R Java Python Scala
▶ fault recovery
▶ storage systems interaction
▶ etc.
▶ Basic data structure definitions/abstractions
▶ Resilient Distributed Data sets (RDDs)
▶ main Spark data structure
▶ Directed Acyclic Graph (DAG)

16 Presentation title www.prace-ri.eu


Spark SQL Spark Streaming Machine learning GraphX 3rd party library

Spark ecosystem: Spark SQL Apache Spark Core

Standalone
EC2 Hadoop YARN Apache Mesos Kubernetes
scheduler
▶ Structured data manipulation
▶ Data Frames definition
▶ Table-like data representation
R Java Python Scala
▶ RDDs extension
▶ Schema definition
▶ SQL queries execution
▶ Native support for schema-based data
▶ Hive, Paquet, JSON, CSV

17 Presentation title www.prace-ri.eu


Spark SQL Spark Streaming Machine learning GraphX 3rd party library

Spark ecosystem: Streaming Apache Spark Core

Standalone
EC2 Hadoop YARN Apache Mesos Kubernetes
scheduler
▶ Data analysis of streaming data
▶ e.g. tweets, log messages
▶ Features of stream processing
R Java Python Scala
▶ High-troughput
▶ Fault-tolerant
▶ End-to-end
▶ Exactly-once
▶ High-level abstraction of a discretized stream
▶ Dstream represented as a sequence of RDDs
▶ Spark 2.3+ , Continuous Processing
▶ end-to-end latencies as low as 1ms

18 Presentation title www.prace-ri.eu


Spark SQL Spark Streaming Machine learning GraphX 3rd party library

Apache Spark Core

Spark ecosystem: MLlib Standalone


scheduler
EC2 Hadoop YARN Apache Mesos Kubernetes

▶ Common ML functionalities
▶ ML Algorithms R Java Python Scala
▶ common learning algorithms such as classification, regression, clustering, and collaborative filtering
▶ Featurization
▶ feature extraction, transformation, dimensionality reduction, and selection
▶ Pipelines
▶ tools for constructing, evaluating, and tuning ML Pipelines
▶ Persistence
▶ saving and load algorithms, models, and Pipelines
▶ Utilities
▶ linear algebra, statistics, data handling, etc.
▶ Two APIs
▶ RDD-based API (spark.mllib package)
▶ Spark 2.0+, DataFrame-based API (spark.ml package)
▶ Methods scale out across the cluster by default

19 Presentation title www.prace-ri.eu


Spark SQL Spark Streaming Machine learning GraphX 3rd party library

Apache Spark Core


Spark ecosystem: GraphX
Standalone
EC2 Hadoop YARN Apache Mesos Kubernetes
scheduler

▶ Support for graphs and graph-parallel


computation
R Java Python Scala
▶ Extension of RDDs (Graph)
▶ directed multigraph with properties on vertices and edges
▶ Graph computation operators
▶ subgraph, joinVertices, and aggregateMessages, etc.
▶ PregelAPI support

20 Presentation title www.prace-ri.eu


Spark Execution modes

▶ Local mode
▶ „Pseudo-cluster“ ad-hoc setup using script
▶ Cluster mode
▶ Running via cluster manager
▶ Interactive mode
▶ Direct manipulation in a shell (pyspark, spark-shell)

21 Presentation title www.prace-ri.eu


Spark execution modes Local mode

▶ Non-distributed single-JVM
deployment mode
▶ Spark library spawns (in a JVM)
▶ driver
▶ scheduler
▶ master
▶ executor
▶ Parallelism is the number of
threads defined by a parameter N in
a spark master URL
▶ local[N]

22 Presentation title www.prace-ri.eu


Spark execution modes Cluster mode

▶ Deployment on a private cluster


▶ Apache Mesos
▶ Hadoop YARN
▶ Kubernetes
▶ Standalone mode, ...

23 Presentation title www.prace-ri.eu


Spark execution modes Cluster mode

▶ Components
▶ Worker
▶ Node in a cluster, managed by an executor
▶ Executor manages computation, storage and caching
▶ Cluster manager
▶ Allocates resources via SparkContext with Driver program
▶ Driver program
▶ A program holding SparkContext and main code to execute in Spark
▶ Sends application code to executors to execute
▶ Listens to incoming connections from executors

24 Presentation title www.prace-ri.eu


Spark execution modes Cluster mode

▶ Deploy modes (standalone clusters)


▶ Client mode (default)
▶ Driver runs in the same process as client that submits the app
▶ Cluster mode
▶ Driver launched from a worker process
▶ Client process exits immediately after application submission

25 Presentation title www.prace-ri.eu


Spark Execution process

1. Data preparation/import
▶ RDDs creation – i.e. parallel dataset
with partitions
2. Transformations/actions definition*
▶ Creation of tasks (units of work) sent to one executor
▶ Job is a set of tasks executed by an action*
3. Creation of a directed acyclic graph (DAG)
▶ Contains a graph of RDD operations
▶ Defition of stages – set of tasks to be executed in parallel (i.e. at a partition level)
4. Execution of a program

26 Presentation title www.prace-ri.eu


Spark Programming concepts (Resilient distributed datasets -
RDDs)

▶ Basic data representation in Spark


▶ A distributed collection of items – partitions
▶ Enables operations to be performed in parallel
▶ Immutable (read-only)
▶ Fault tolerant
▶ “Recipe“ of data transformations is preserved, so a partition can be re-created at any
time
▶ Caching
▶ Different storage levels possible
▶ Supports a set of Spark transformations and actions

27 Presentation title www.prace-ri.eu


Spark Programming concepts (Resilient distributed datasets -
RDDs)

▶ Computations are expressed using


▶ creation of new RDDs
▶ transforming existing RDDs
▶ operations on RDDs to compute results (actions)
▶ Distributes the data within RDDs across nodes (executors) in the cluster and parallelizes
calculations

28 Presentation title www.prace-ri.eu


RDD Operations

▶ RDDs enable following operations


▶ transformations
▶ lazy operations that return a new RDD from input RDDs
▶ narrow or wide types
▶ examples: map, filter, join, groupByKey...
▶ actions
▶ return a result or write to storage,
execute transformations
▶ examples: count, collect, save

29 Presentation title www.prace-ri.eu


RDD Transformations vs. actions

30 Presentation title www.prace-ri.eu


Hands On
1. Use NoMachine to login to UL FS‘s HPC
2. Open Terminal/Console/“Konzola“
3. Clone Workshop Git repository and enter its folder
git clone https://github.com/szitnik/Apache-Spark-Workshop.git
cd Apache-Spark-Workshop
4. Enter the following commands
module load Spark/2.4.0-Hadoop-2.7-Java-1.8
python3 -m venv spark-workshop-env
. spark-workshop-env/bin/activate

pip install --upgrade pip


pip install pyspark jupyter
python

31 Presentation title www.prace-ri.eu


RDDs in Spark

▶ We will use pySpark library interactively


import pyspark
sc = pyspark.SparkContext(appName='SparkWorkshop', master='local[1]')

32 Presentation title www.prace-ri.eu


RDDs in Spark
▶ Creation of RDDs
▶ From a collection
rdd1 = sc.parallelize([('John', 23), ('Mark', 11), ('Jenna', 44),
('Sandra', 61)])
▶ From a file
rdd2 = sc.textFile('data/IMDB Dataset.csv')

▶ Basic transformations map(), filter(), flatMap()


older = rdd1.filter(lambda x: x[1] > 18)
anonymized = older.map(lambda x: (x[0][0], x[1]))

birthdays = rdd1.map(lambda x: list(range(1, x[1]+1)))


birthdays = rdd1.flatMap(lambda x: list(range(1, x[1]+1)))

33 Presentation title www.prace-ri.eu


RDDs in Spark
▶ Further actions, transformations
rdd2.take(2)

def organize(line):
data = line.split('",')
data = data if len(data) == 2 else line.split(',')
return (data[1], data[0][1:51] + ' ...')
movies = rdd2.filter(lambda x: x != 'review,sentiment').map(organize)

movies.count() // 50.000
movies = movies.filter(lambda x: x[0] in ['positive', 'negative'])
movies.count() // 45.936
movieCounts = movies.groupByKey().map(lambda x: (x[0], len(x[1])))

34 Presentation title www.prace-ri.eu


RDDs in Spark
▶ Caching
movies.take(2)

posReviews = movies.filter(lambda x: x[0] == 'positive').map(lambda x: x[1])


negReviews = movies.filter(lambda x: x[0] == 'negative').map(lambda x: x[1])

posReviews.cache().collect()

35 Presentation title www.prace-ri.eu


RDDs in Spark
▶ Caching
posReviews.filter(lambda x: 'good' in x).count() //605
negReviews.filter(lambda x: 'bad' in x).count() //788

36 Presentation title www.prace-ri.eu


RDDs in Spark

37 Presentation title www.prace-ri.eu


RDDs in Spark
▶ DAG exploration
def splitLine(line):
return line.replace(',', ' ').replace('"', ' ').replace('.', ' ').split()

rdd2 = sc.textFile('data/IMDB Dataset.csv', 4)


wordCounts = rdd2.flatMap(splitLine).map(lambda word: (word, 1)). \
reduceByKey(lambda a,b: a+b, 3)
wordCounts.takeOrdered(10, key = lambda x: -x[1])

textFile() flatMap() map() reduceByKey() takeOrdered()

Stage 1 Stage 2

38 Presentation title www.prace-ri.eu


RDDs in Spark
▶ DAG exploration
(admin console
result)

39 Presentation title www.prace-ri.eu


DataFrames (= RDDs + schema) in Spark
▶ Spark SQL enables read/write
from/to files, JSON, databases,
etc.
▶ DataFrames interoperable with Pandas dataframe
▶ DataFrames creation ...
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import StructType, IntegerType, StringType, StructField
sqlContext = SQLContext(sc)

df1 = sqlContext.createDataFrame(rdd1, ["name", "age"])

ExampleRow = Row("name", "age")


rdd1a = rdd1.map(lambda x: ExampleRow(x[0], x[1]))
df1 = sqlContext.createDataFrame(rdd1a)

40 Presentation title www.prace-ri.eu


DataFrames (= RDDs + schema) in Spark
▶ ... DataFrames creation ...
schema = StructType([StructField("name", StringType(), False), \
StructField("age", IntegerType(), True)])
df1 = sqlContext.createDataFrame(rdd1, schema)

df1.show()

df1.printSchema()

41 Presentation title www.prace-ri.eu


DataFrames (= RDDs + schema) in Spark
▶ ... DataFrames creation
df2 = sqlContext.read.format('csv').option('header', 'true'). \
option('mode', 'DROPMALFORMED').load('data/IMDB Dataset.csv')
df2.show(5)

df2.printSchema()

42 Presentation title www.prace-ri.eu


DataFrames and User Defined Functions (UDF) in Spark

▶ User defined functions are custom functions to run against the "database" directly
▶ Caveats
▶ Optimization problems (especially in pySpark!)
▶ Special values handling by the programmer (e.g. null values)
▶ Approaches to use UDFs
▶ df = df.withColumn
▶ df = sqlContext.sql("SELECT * FROM <UDF>")
▶ rdd.map(UDF())

43 Presentation title www.prace-ri.eu


DataFrames and User Defined Functions (UDF) in Spark
▶ Examples
from pyspark.sql.functions import udf

reviewLen = udf(lambda r: len(r), IntegerType())


reviewSnippet = udf(lambda r: r[0:50] + ' ...', StringType())

df2 = df2.withColumn('reviewLength', reviewLen('review'))


df2 = df2.withColumn('reviewSnippet', reviewSnippet('review'))

44 Presentation title www.prace-ri.eu


DataFrames and User Defined Functions (UDF) in Spark
▶ Examples
def words(review, type = 'positive'):
sentimentWords = ['good', 'great', 'nice', 'awesome']
if type == 'negative':
sentimentWords = ['bad', 'worst', 'ugly', 'scary']
return sum(map(lambda w: review.count(w), sentimentWords))

positiveWords = udf(lambda r: words(r), IntegerType())


negativeWords = udf(lambda r: words(r, 'negative'), IntegerType())

df2 = df2.withColumn('positiveWords', positiveWords('review'))


df2 = df2.withColumn('negativeWords', positiveWords('review'))

df2 = df2.drop('review')

45 Presentation title www.prace-ri.eu


DataFrames and User Defined Functions (UDF) in Spark
▶ Examples
df2.cache().show()

46 Presentation title www.prace-ri.eu


Spark SQL DataFrame operations
▶ Examples
df2.select('sentiment', 'positiveWords').show(3)

df2.select(df2['sentiment'], df2['positiveWords']).show(3)

df2.select(df2['sentiment'], df2['positiveWords']). \
filter(df2['positiveWords'] > 10).show(3)

df2.groupBy('sentiment').count().show()

df2.summary().show()

47 Presentation title www.prace-ri.eu


Spark SQL SQL operations
▶ Examples
df2.createOrReplaceTempView('imdb')

sqlContext.sql('SELECT * FROM imdb WHERE positiveWords > 10 LIMIT 5').show()

sqlContext.sql('SELECT sentiment, count(*) FROM imdb GROUP BY sentiment').show()

48 Presentation title www.prace-ri.eu


Lab exercises - Jupyter
▶ Run jupyter notebook command in the project folder and run notebooks from notebooks
folder

49 Presentation title www.prace-ri.eu


Lab exercise – Spark on an HPC
▶ Move to spark-hpc folder:
▶ 00_clean.sh
▶ Script to clean logs, data generated by running scripts
▶ 01_run-sbatch.sh
▶ Prepare and submit scripts with a Spark job
▶ conf/spark-env.sh
▶ Env variables for worker folder and log folder set
▶ job.py
▶ Pyspark source code (i.e. simple pi calculation script)
▶ logs/
▶ Spark and slurm log folder
▶ NOTES.txt
▶ Short Slurm commands reference
▶ spark-job-TEMPLATE.sh
▶ Slurm script for job submission
▶ workers/
▶ Workers working directories

50 Presentation title www.prace-ri.eu


Lab exercise – Spark on an HPC
▶ job.py

51 Presentation title www.prace-ri.eu


Lab exercise – Spark on an HPC
▶ spark-job-TEMPLATE.sh

52 Presentation title www.prace-ri.eu


Lab exercise – Spark on an HPC
▶ 01_run-sbatch.sh

53 Presentation title www.prace-ri.eu


Lab exercise – Spark on an HPC
▶ Run spark application on an HPC (commands):
./01_run-sbatch.sh

squeue -u campus02

sacct -j 51438

54 Presentation title www.prace-ri.eu


Lab exercise – Spark on an HPC
▶ Check the output log:
cat logs/slurm_stdout_err__51438.log

55 Presentation title www.prace-ri.eu


Challenge exercises
▶ Check Lab 1 and Lab 2 Jupyter notebooks and solve challenges at the end
▶ Train a classifier to predict movie review sentiment
▶ Use the provided IMBD reviews.csv data an split it to train and test set
▶ Extract features (e.g. TF-IDF), train model (e.g. SVM) and test it
▶ See MLlib documentation at https://spark.apache.org/docs/latest/ml-guide.html
▶ Use more nodes with workers on an HPC
▶ Adapt HPC lab exercise to be run on multiple nodes
▶ Hint: https://info.gwdg.de/wiki/doku.php?id=wiki:hpc:slurm_sbatch_script_for_spark_applications
▶ Viewing event logs in the SparkUI after Slurm job is finished
▶ Replicate the HPC exercise from before, retrieve logs and run history server to explore SparkUI
▶ Hints: https://researchcomputing.princeton.edu/faq/spark-via-slurm

56 Presentation title www.prace-ri.eu


References

▶ https://training.databricks.com/visualapi.pdf
▶ https://events.prace-ri.eu/event/896/
▶ https://luminousmen.com/post/spark-core-concepts-explained
▶ https://info.gwdg.de/wiki/doku.php?id=wiki:hpc:slurm_sbatch_script_for_spark_applications
▶ https://researchcomputing.princeton.edu/faq/spark-via-slurm

57 Presentation title www.prace-ri.eu


THANK YOU FOR YOUR ATTENTION

www.prace-ri.eu

58 Presentation title www.prace-ri.eu

You might also like