0% found this document useful (0 votes)
4 views

Slide 8 Spark Shell Tutorial

Uploaded by

putinphuc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Slide 8 Spark Shell Tutorial

Uploaded by

putinphuc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Big Data

(Spark)
Instructor: Trong-Hop Do
May 5th 2021

S3Lab
Smart Software System Laboratory

1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems

Big Data 2
Spark shell
● Open Spark shell
● Command: spark-shell

3
What is SparkContext?

● Spark SparkContext is an entry point to Spark and defined in


org.apache.spark package and used to programmatically create Spark
RDD, accumulators, and broadcast variables on the cluster.
● Its object sc is default variable available in spark-shell and it can be
programmatically created using SparkContext class.
● Most of the operations/methods or functions we use in Spark come from
SparkContext for example accumulators, broadcast variables, parallelize,
and more.
4
What is SparkContext?

● Test some methods with the default SparkContext object sc in Spark shell

● There are many more methods of SparkContext


5
Spark RDD

● Resilient Distributed Datasets (RDD) is a fundamental data structure of


Spark, It is an immutable distributed collection of objects. Each dataset in
RDD is divided into logical partitions, which may be computed on different
nodes of the cluster.
● Spark RDD can be created in several ways using Scala & Pyspark
languages, for example, It can be created by using
sparkContext.parallelize(), from text file, from another RDD, DataFrame,
and Dataset.
6
Spark RDD
Create an RDD through Parallelized Collection

● Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD[Int].

Note: the printed list might be unordered as foreach runs in


parallel across the partitions which creates non-deterministic
ordering. The action collect() gives you an array of the partitions
concatenated in their sorted order. Or you might set the number
of parititions to 1.

7
Spark RDD
Create an RDD through Parallelized Collection

● Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD[Int].

8
Spark RDD
Read single or multiple text files into single RDD?

● Create files and put them to HDFS

9
Spark RDD
Read single or multiple text files into single RDD?

● Read single file and create an RDD[String]

10
Spark RDD
Read single or multiple text files into single RDD?

● Read multiple files and create an RDD[String]

11
Spark RDD
Read single or multiple text files into single RDD?

● Read multiple specific files and create an RDD[String]

12
Spark RDD
Load CSV File into RDD

● Create .csv files and put them to HDFS

13
Spark RDD
Load CSV File into RDD

● textFile() method read an entire CSV record as a String and returns RDD[String]

14
Spark RDD
Load CSV File into RDD

● we would need every record in a CSV to split by comma delimiter and store it in RDD as
multiple columns, In order to achieve this, we should use map() transformation on RDD where
we will convert RDD[String] to RDD[Array[String] by splitting every record by comma delimiter.
map() method returns a new RDD instead of updating existing.

Each record in the csvData RDD is an Array[String]


f(0), f(1) are the first and second element in the array f
15
Spark RDD
Load CSV File into RDD

● Skip header from csv file


● Command: rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }

16
Spark RDD Operations

● https://spark.apache.org/docs/latest/api/python/reference/pyspark.html

● https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/

17
Spark RDD Operations
Two type of RDD operations

● Apache Spark RDD supports two types of Operations-

○ Transformations

○ Actions
● Spark Transformation is a function that produces new RDD from the existing RDDs. It takes
RDD as input and produces one or more RDD as output. Each time it creates new RDD
when we apply any transformation.
● RDD actions are operations that return the raw values, In other words, any RDD function
that returns other than RDD is considered as an action in spark programming.

18
Spark RDD Operations
RDD Transformation

● Two types of RDD Transformation

○ Narrow transformations are the result of map() and filter() functions


and these compute data that live on a single partition meaning there
will not be any data movement between partitions to execute narrow
transformations.

○ Wider transformations are the result of groupByKey() and


reduceByKey() functions and these compute data that live on many
partitions meaning there will be data movements between partitions
to execute wider transformations.

19
Spark RDD Operations
RDD Transformation

20
Spark RDD Operations
RDD Action

21
Spark RDD Operations
RDD Transformation – Word count example

● Create a .txt file and put it to HDFS

22
Spark RDD Operations
RDD Transformation – Word count example

● Read the file and create an RDD[String]

23
Spark RDD Operations
RDD Transformation – Word count example

● Apply flatMap() Transformation


● flatMap() transformation flattens the RDD after applying the function and returns a new RDD. On the below
example, first, it splits each record by space in an RDD and finally flattens it. Resulting RDD consists of a
single word on each record.

24
Spark RDD Operations
RDD Transformation – Word count example

● Apply map() Transformation


● map() transformation is used the apply any complex operations like adding a column,
updating a column e.t.c, the output of map transformations would always have the same
number of records as input.
● In this word count example, we are adding a new column with value 1 for each word, the
result of the RDD is PairRDD which contains key-value pairs, word of type String as Key
and 1 of type Int as value.

Each record in testfile3 RDD is a tuple [String,Int] (not an array) 25


Spark RDD Operations
RDD Transformation – Word count example

● Let’s print out the content of these RDDs (each record is printed in a single line)

Key Value

26
Spark RDD Operations
RDD Transformation – Word count example

● Apply reduceByKey() Transformation


● reduceByKey() merges the values for each key with the function specified. In our example, it
reduces the word string by applying the sum function on value. The result of our RDD
contains unique words and their count.

● Note: reduceByKey() is a transformation, so count is a RDD

27
Spark RDD Operations
RDD Transformation – Word count example

● Let’s print out the result

28
Spark RDD Operations
RDD Transformation – Word count example

● Another way to use reduceByKey

29
Spark RDD Operations
RDD Transformation – Word count example

● Try sortByKey() Transformation


● sortByKey() transformation is used to sort RDD elements on key.
● In this example, we want to sort RDD elements on the number of each word. Therefore, we
need convert RDD[(String,Int]) to RDD[(Int,String]) using map transformation and then apply
sortByKey which ideally does sort on an integer key value.

Notice how each element in a tuple is accessed. If a is an array, the second and first element are accessed
by a(1) and a(0). In this case, a is a tuple, the second and first element are accessed by a._2 and a._1
30
Spark RDD Operations
RDD Transformation – Word count example

● Let’s print out the result (and see that the count number of each word now become the key)

31
Spark RDD Operations
RDD Transformation – Word count example

● Apply sortByKey() transformation

If there are more than 1 partitions, records in each partitions are sorted, but the
sorted results of multiple partitions might be mixed up.
To avoid this, use sortByKey(numPartitions=1)

32
Spark RDD Operations
RDD Transformation – Word count example

● You can also apply map transformation to switch the word back to the key

33
Spark RDD Operations
RDD Transformation – Word count example

● You can to all these steps in one line of code

34
Spark RDD Operations
RDD Transformation – Word count example

● Or you can use sortBy

35
Spark RDD Operations
RDD Transformation – Word count example

● Apply filter() Transformation


● filter() transformation is used to filter the records in an RDD. In this example we are filtering
all words starts with “t”.

36
Spark RDD Operations
RDD Transformation – Word count example

● Merge two RDD

37
Spark RDD Operations
RDD Action – Easiest Wordcount using countByValue

38
Spark RDD Operations
RDD Action
● Let’s create some RDD

39
Spark RDD Operations
RDD Action
● reduce() - Reduces the elements of the dataset using the specified binary
operator.

40
Spark RDD Operations
RDD Action
● collect() -Return the complete dataset as an Array.

41
Spark RDD Operations
RDD Action
● count() – Return the count of elements in the dataset.
● countApprox() – Return approximate count of elements in the dataset, this method returns
incomplete when execution time meets timeout.
● countApproxDistinct() – Return an approximate number of distinct elements in the dataset.

42
Spark RDD Operations
RDD Action

● first() – Return the first element in the dataset.


● top() – Return top n elements from the dataset.
● min() – Return the minimum value from the dataset.
● max() – Return the maximum value from the dataset.
● take() – Return the first num elements of the dataset.
● takeOrdered() – Return the first num (smallest) elements from the dataset and this is
the opposite of the take() action.
● takeSample() – Return the subset of the dataset in an Array.

43
Spark Pair RDD Functions
Pair RDD Transformation
Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair

44
Spark Pair RDD Functions
Pair RDD Action

Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair

45
Spark Pair RDD Functions
Wordcount using reduceByKey

46
Spark Pair RDD Functions
Wordcount using reduceByKey

● Save the output in a text file

47
Spark Pair RDD Functions
Wordcount using reduceByKey

● Check the result

48
Spark Pair RDD Functions
Another usage of reduceByKey

49
Spark Pair RDD Functions
Wordcount using countByKey action

50
Spark Pair RDD Functions
Join two RDDs

● Create two RDDs from text files

51
Spark Pair RDD Functions
Join two RDDs

● Check the result

If you have duplicates key in any RDD, join will result in cartesian product.

52
Spark Pair RDD Functions
Join two RDDs

● Use cogroup or groupByKey to get unique keys in the joining result

53
Spark shell Web UI
Accessing the Web UI of Spark

● Open http://quickstart.cloudera:4040 in web browser

54
Spark shell Web UI
Explore Spark shell Web UI

● Click DAG Visualization

Click

Number of partitions 55
Spark shell Web UI
Explore Spark shell Web UI

● View the Direct Acylic Graph (DAG) of the completed job

56
Spark shell Web UI
Partition and parallelism in RDDs

57
Spark shell Web UI
Partition and parallelism in RDDs

● Execute a parallel task in the shell

58
Spark shell Web UI
Partition and parallelism in RDDs
● Open Spark shell Web UI

Totally 5 partitions are done 59


Spark shell Web UI
Partition and parallelism in RDDs

Parallel execution of 5 completed jobs

60
Q&A

Cảm ơn đã theo dõi


Chúng tôi hy vọng cùng nhau đi đến thành công.

61
Big Data

You might also like