Slide 8 Spark Shell Tutorial
Slide 8 Spark Shell Tutorial
(Spark)
Instructor: Trong-Hop Do
May 5th 2021
S3Lab
Smart Software System Laboratory
1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems
Big Data 2
Spark shell
● Open Spark shell
● Command: spark-shell
3
What is SparkContext?
● Test some methods with the default SparkContext object sc in Spark shell
● Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD[Int].
7
Spark RDD
Create an RDD through Parallelized Collection
● Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD[Int].
8
Spark RDD
Read single or multiple text files into single RDD?
9
Spark RDD
Read single or multiple text files into single RDD?
10
Spark RDD
Read single or multiple text files into single RDD?
11
Spark RDD
Read single or multiple text files into single RDD?
12
Spark RDD
Load CSV File into RDD
13
Spark RDD
Load CSV File into RDD
● textFile() method read an entire CSV record as a String and returns RDD[String]
14
Spark RDD
Load CSV File into RDD
● we would need every record in a CSV to split by comma delimiter and store it in RDD as
multiple columns, In order to achieve this, we should use map() transformation on RDD where
we will convert RDD[String] to RDD[Array[String] by splitting every record by comma delimiter.
map() method returns a new RDD instead of updating existing.
16
Spark RDD Operations
● https://spark.apache.org/docs/latest/api/python/reference/pyspark.html
● https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/
17
Spark RDD Operations
Two type of RDD operations
○ Transformations
○ Actions
● Spark Transformation is a function that produces new RDD from the existing RDDs. It takes
RDD as input and produces one or more RDD as output. Each time it creates new RDD
when we apply any transformation.
● RDD actions are operations that return the raw values, In other words, any RDD function
that returns other than RDD is considered as an action in spark programming.
18
Spark RDD Operations
RDD Transformation
19
Spark RDD Operations
RDD Transformation
20
Spark RDD Operations
RDD Action
21
Spark RDD Operations
RDD Transformation – Word count example
22
Spark RDD Operations
RDD Transformation – Word count example
23
Spark RDD Operations
RDD Transformation – Word count example
24
Spark RDD Operations
RDD Transformation – Word count example
● Let’s print out the content of these RDDs (each record is printed in a single line)
Key Value
26
Spark RDD Operations
RDD Transformation – Word count example
27
Spark RDD Operations
RDD Transformation – Word count example
28
Spark RDD Operations
RDD Transformation – Word count example
29
Spark RDD Operations
RDD Transformation – Word count example
Notice how each element in a tuple is accessed. If a is an array, the second and first element are accessed
by a(1) and a(0). In this case, a is a tuple, the second and first element are accessed by a._2 and a._1
30
Spark RDD Operations
RDD Transformation – Word count example
● Let’s print out the result (and see that the count number of each word now become the key)
31
Spark RDD Operations
RDD Transformation – Word count example
If there are more than 1 partitions, records in each partitions are sorted, but the
sorted results of multiple partitions might be mixed up.
To avoid this, use sortByKey(numPartitions=1)
32
Spark RDD Operations
RDD Transformation – Word count example
● You can also apply map transformation to switch the word back to the key
33
Spark RDD Operations
RDD Transformation – Word count example
34
Spark RDD Operations
RDD Transformation – Word count example
35
Spark RDD Operations
RDD Transformation – Word count example
36
Spark RDD Operations
RDD Transformation – Word count example
37
Spark RDD Operations
RDD Action – Easiest Wordcount using countByValue
38
Spark RDD Operations
RDD Action
● Let’s create some RDD
39
Spark RDD Operations
RDD Action
● reduce() - Reduces the elements of the dataset using the specified binary
operator.
40
Spark RDD Operations
RDD Action
● collect() -Return the complete dataset as an Array.
41
Spark RDD Operations
RDD Action
● count() – Return the count of elements in the dataset.
● countApprox() – Return approximate count of elements in the dataset, this method returns
incomplete when execution time meets timeout.
● countApproxDistinct() – Return an approximate number of distinct elements in the dataset.
42
Spark RDD Operations
RDD Action
43
Spark Pair RDD Functions
Pair RDD Transformation
Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair
44
Spark Pair RDD Functions
Pair RDD Action
Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair
45
Spark Pair RDD Functions
Wordcount using reduceByKey
46
Spark Pair RDD Functions
Wordcount using reduceByKey
47
Spark Pair RDD Functions
Wordcount using reduceByKey
48
Spark Pair RDD Functions
Another usage of reduceByKey
49
Spark Pair RDD Functions
Wordcount using countByKey action
50
Spark Pair RDD Functions
Join two RDDs
51
Spark Pair RDD Functions
Join two RDDs
If you have duplicates key in any RDD, join will result in cartesian product.
52
Spark Pair RDD Functions
Join two RDDs
53
Spark shell Web UI
Accessing the Web UI of Spark
54
Spark shell Web UI
Explore Spark shell Web UI
Click
Number of partitions 55
Spark shell Web UI
Explore Spark shell Web UI
56
Spark shell Web UI
Partition and parallelism in RDDs
57
Spark shell Web UI
Partition and parallelism in RDDs
58
Spark shell Web UI
Partition and parallelism in RDDs
● Open Spark shell Web UI
60
Q&A
61
Big Data