0% found this document useful (0 votes)

23 views61 pages

Slide 8 Spark Shell Tutorial

Uploaded by

putinphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views61 pages

Slide 8 Spark Shell Tutorial

Uploaded by

putinphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Big Data

(Spark)
Instructor: Trong-Hop Do
May 5th 2021

S3Lab
Smart Software System Laboratory

1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems

Big Data 2
Spark shell
● Open Spark shell
● Command: spark-shell

3
What is SparkContext?

● Spark SparkContext is an entry point to Spark and defined in

org.apache.spark package and used to programmatically create Spark
RDD, accumulators, and broadcast variables on the cluster.
● Its object sc is default variable available in spark-shell and it can be
programmatically created using SparkContext class.
● Most of the operations/methods or functions we use in Spark come from
SparkContext for example accumulators, broadcast variables, parallelize,
and more.
4
What is SparkContext?

● Test some methods with the default SparkContext object sc in Spark shell

● There are many more methods of SparkContext

5
Spark RDD

● Resilient Distributed Datasets (RDD) is a fundamental data structure of

Spark, It is an immutable distributed collection of objects. Each dataset in
RDD is divided into logical partitions, which may be computed on different
nodes of the cluster.
● Spark RDD can be created in several ways using Scala & Pyspark
languages, for example, It can be created by using
sparkContext.parallelize(), from text file, from another RDD, DataFrame,
and Dataset.
6
Spark RDD
Create an RDD through Parallelized Collection

● Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD[Int].

Note: the printed list might be unordered as foreach runs in

parallel across the partitions which creates non-deterministic
ordering. The action collect() gives you an array of the partitions
concatenated in their sorted order. Or you might set the number
of parititions to 1.

7
Spark RDD
Create an RDD through Parallelized Collection

● Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD[Int].

8
Spark RDD
Read single or multiple text files into single RDD?

● Create files and put them to HDFS

9
Spark RDD
Read single or multiple text files into single RDD?

● Read single file and create an RDD[String]

10
Spark RDD
Read single or multiple text files into single RDD?

● Read multiple files and create an RDD[String]

11
Spark RDD
Read single or multiple text files into single RDD?

● Read multiple specific files and create an RDD[String]

12
Spark RDD
Load CSV File into RDD

● Create .csv files and put them to HDFS

13
Spark RDD
Load CSV File into RDD

● textFile() method read an entire CSV record as a String and returns RDD[String]

14
Spark RDD
Load CSV File into RDD

● we would need every record in a CSV to split by comma delimiter and store it in RDD as
multiple columns, In order to achieve this, we should use map() transformation on RDD where
we will convert RDD[String] to RDD[Array[String] by splitting every record by comma delimiter.
map() method returns a new RDD instead of updating existing.

Each record in the csvData RDD is an Array[String]

f(0), f(1) are the first and second element in the array f
15
Spark RDD
Load CSV File into RDD

● Skip header from csv file

● Command: rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }

16
Spark RDD Operations

● https://spark.apache.org/docs/latest/api/python/reference/pyspark.html

● https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/

17
Spark RDD Operations
Two type of RDD operations

● Apache Spark RDD supports two types of Operations-

○ Transformations

○ Actions
● Spark Transformation is a function that produces new RDD from the existing RDDs. It takes
RDD as input and produces one or more RDD as output. Each time it creates new RDD
when we apply any transformation.
● RDD actions are operations that return the raw values, In other words, any RDD function
that returns other than RDD is considered as an action in spark programming.

18
Spark RDD Operations
RDD Transformation

● Two types of RDD Transformation

○ Narrow transformations are the result of map() and filter() functions

and these compute data that live on a single partition meaning there
will not be any data movement between partitions to execute narrow
transformations.

○ Wider transformations are the result of groupByKey() and

reduceByKey() functions and these compute data that live on many
partitions meaning there will be data movements between partitions
to execute wider transformations.

19
Spark RDD Operations
RDD Transformation

20
Spark RDD Operations
RDD Action

21
Spark RDD Operations
RDD Transformation – Word count example

● Create a .txt file and put it to HDFS

22
Spark RDD Operations
RDD Transformation – Word count example

● Read the file and create an RDD[String]

23
Spark RDD Operations
RDD Transformation – Word count example

● Apply flatMap() Transformation

● flatMap() transformation flattens the RDD after applying the function and returns a new RDD. On the below
example, first, it splits each record by space in an RDD and finally flattens it. Resulting RDD consists of a
single word on each record.

24
Spark RDD Operations
RDD Transformation – Word count example

● Apply map() Transformation

● map() transformation is used the apply any complex operations like adding a column,
updating a column e.t.c, the output of map transformations would always have the same
number of records as input.
● In this word count example, we are adding a new column with value 1 for each word, the
result of the RDD is PairRDD which contains key-value pairs, word of type String as Key
and 1 of type Int as value.

Each record in testfile3 RDD is a tuple [String,Int] (not an array) 25

Spark RDD Operations
RDD Transformation – Word count example

● Let’s print out the content of these RDDs (each record is printed in a single line)

Key Value

26
Spark RDD Operations
RDD Transformation – Word count example

● Apply reduceByKey() Transformation

● reduceByKey() merges the values for each key with the function specified. In our example, it
reduces the word string by applying the sum function on value. The result of our RDD
contains unique words and their count.

● Note: reduceByKey() is a transformation, so count is a RDD

27
Spark RDD Operations
RDD Transformation – Word count example

● Let’s print out the result

28
Spark RDD Operations
RDD Transformation – Word count example

● Another way to use reduceByKey

29
Spark RDD Operations
RDD Transformation – Word count example

● Try sortByKey() Transformation

● sortByKey() transformation is used to sort RDD elements on key.
● In this example, we want to sort RDD elements on the number of each word. Therefore, we
need convert RDD[(String,Int]) to RDD[(Int,String]) using map transformation and then apply
sortByKey which ideally does sort on an integer key value.

Notice how each element in a tuple is accessed. If a is an array, the second and first element are accessed
by a(1) and a(0). In this case, a is a tuple, the second and first element are accessed by a._2 and a._1
30
Spark RDD Operations
RDD Transformation – Word count example

● Let’s print out the result (and see that the count number of each word now become the key)

31
Spark RDD Operations
RDD Transformation – Word count example

● Apply sortByKey() transformation

If there are more than 1 partitions, records in each partitions are sorted, but the
sorted results of multiple partitions might be mixed up.
To avoid this, use sortByKey(numPartitions=1)

32
Spark RDD Operations
RDD Transformation – Word count example

● You can also apply map transformation to switch the word back to the key

33
Spark RDD Operations
RDD Transformation – Word count example

● You can to all these steps in one line of code

34
Spark RDD Operations
RDD Transformation – Word count example

● Or you can use sortBy

35
Spark RDD Operations
RDD Transformation – Word count example

● Apply filter() Transformation

● filter() transformation is used to filter the records in an RDD. In this example we are filtering
all words starts with “t”.

36
Spark RDD Operations
RDD Transformation – Word count example

● Merge two RDD

37
Spark RDD Operations
RDD Action – Easiest Wordcount using countByValue

38
Spark RDD Operations
RDD Action
● Let’s create some RDD

39
Spark RDD Operations
RDD Action
● reduce() - Reduces the elements of the dataset using the specified binary
operator.

40
Spark RDD Operations
RDD Action
● collect() -Return the complete dataset as an Array.

41
Spark RDD Operations
RDD Action
● count() – Return the count of elements in the dataset.
● countApprox() – Return approximate count of elements in the dataset, this method returns
incomplete when execution time meets timeout.
● countApproxDistinct() – Return an approximate number of distinct elements in the dataset.

42
Spark RDD Operations
RDD Action

● first() – Return the first element in the dataset.

● top() – Return top n elements from the dataset.
● min() – Return the minimum value from the dataset.
● max() – Return the maximum value from the dataset.
● take() – Return the first num elements of the dataset.
● takeOrdered() – Return the first num (smallest) elements from the dataset and this is
the opposite of the take() action.
● takeSample() – Return the subset of the dataset in an Array.

43
Spark Pair RDD Functions
Pair RDD Transformation
Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair

44
Spark Pair RDD Functions
Pair RDD Action

Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair

45
Spark Pair RDD Functions
Wordcount using reduceByKey

46
Spark Pair RDD Functions
Wordcount using reduceByKey

● Save the output in a text file

47
Spark Pair RDD Functions
Wordcount using reduceByKey

● Check the result

48
Spark Pair RDD Functions
Another usage of reduceByKey

49
Spark Pair RDD Functions
Wordcount using countByKey action

50
Spark Pair RDD Functions
Join two RDDs

● Create two RDDs from text files

51
Spark Pair RDD Functions
Join two RDDs

● Check the result

If you have duplicates key in any RDD, join will result in cartesian product.

52
Spark Pair RDD Functions
Join two RDDs

● Use cogroup or groupByKey to get unique keys in the joining result

53
Spark shell Web UI
Accessing the Web UI of Spark

● Open http://quickstart.cloudera:4040 in web browser

54
Spark shell Web UI
Explore Spark shell Web UI

● Click DAG Visualization

Click

Number of partitions 55
Spark shell Web UI
Explore Spark shell Web UI

● View the Direct Acylic Graph (DAG) of the completed job

56
Spark shell Web UI
Partition and parallelism in RDDs

57
Spark shell Web UI
Partition and parallelism in RDDs

● Execute a parallel task in the shell

58
Spark shell Web UI
Partition and parallelism in RDDs
● Open Spark shell Web UI

Totally 5 partitions are done 59

Spark shell Web UI
Partition and parallelism in RDDs

Parallel execution of 5 completed jobs

60
Q&A

Cảm ơn đã theo dõi

Chúng tôi hy vọng cùng nhau đi đến thành công.

61
Big Data

Final Question Bank
No ratings yet
Final Question Bank
22 pages
Oops Solutions Lab PDF
17% (6)
Oops Solutions Lab PDF
167 pages
Computer System Architecture
No ratings yet
Computer System Architecture
6 pages
Java Programming Masterclass Covering Java 11 & Java 17
No ratings yet
Java Programming Masterclass Covering Java 11 & Java 17
235 pages
Pyspark
No ratings yet
Pyspark
31 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
SPARK
No ratings yet
SPARK
35 pages
20981A4607 - Bhamidi Sai Bharadwaj
No ratings yet
20981A4607 - Bhamidi Sai Bharadwaj
2 pages
Datastage - Job Sequence Invocation & Control
No ratings yet
Datastage - Job Sequence Invocation & Control
19 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Typing Tutor
50% (2)
Typing Tutor
17 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Prentice - Hall.the.c.puzzle - book.1982.SCAN DARKCROWN
No ratings yet
Prentice - Hall.the.c.puzzle - book.1982.SCAN DARKCROWN
196 pages
Ss and Os Manual PDF
No ratings yet
Ss and Os Manual PDF
183 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Python Programming Guide
100% (1)
Python Programming Guide
38 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
Clap Switch: An Industry Oriented Main Project Report
No ratings yet
Clap Switch: An Industry Oriented Main Project Report
31 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
Javascript Events, Forms & Ajax
No ratings yet
Javascript Events, Forms & Ajax
6 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Json Cbor: Python Converter
No ratings yet
Json Cbor: Python Converter
32 pages
Spark
No ratings yet
Spark
51 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Spark
No ratings yet
Spark
160 pages
Spark
No ratings yet
Spark
96 pages
Divide & Conquer - Recurrance
No ratings yet
Divide & Conquer - Recurrance
17 pages
6 C# Type Casting
No ratings yet
6 C# Type Casting
7 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
OCA Exam Preparation: Inheritance - Review
No ratings yet
OCA Exam Preparation: Inheritance - Review
38 pages
12th CS Short Questions Notes by Youth Academy
100% (1)
12th CS Short Questions Notes by Youth Academy
36 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
JSS2 Ict Exam
No ratings yet
JSS2 Ict Exam
2 pages
Pyspark
No ratings yet
Pyspark
44 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Adduser and Addgroup Commands
No ratings yet
Adduser and Addgroup Commands
7 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Big Data Analysis With Scala and Spark: Heather Miller
No ratings yet
Big Data Analysis With Scala and Spark: Heather Miller
17 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
STD - Rust
No ratings yet
STD - Rust
9 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Study Outline For Data Structure
No ratings yet
Study Outline For Data Structure
73 pages
External Video-En
No ratings yet
External Video-En
2 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
CSS Essential Training 2
No ratings yet
CSS Essential Training 2
3 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Using Perl 6
No ratings yet
Using Perl 6
133 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Assignment Part1 SP22 2014
No ratings yet
Assignment Part1 SP22 2014
6 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
Apache Spark Tutorials
No ratings yet
Apache Spark Tutorials
9 pages
Module 3
No ratings yet
Module 3
51 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Express Publishing (Obra Colectiva) - Advanced Computer Science For The Ib Diploma Program International Baccalaureate-Express (2017) - Convert
No ratings yet
Express Publishing (Obra Colectiva) - Advanced Computer Science For The Ib Diploma Program International Baccalaureate-Express (2017) - Convert
301 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
UNIT-V Notes Advance Java
No ratings yet
UNIT-V Notes Advance Java
28 pages
T - SQL Notes
No ratings yet
T - SQL Notes
46 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Wipro Sample Paper With Answers.
No ratings yet
Wipro Sample Paper With Answers.
10 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Streams Errors
No ratings yet
Streams Errors
4 pages
Microsoft Excel Masterclass - Mock Exam
No ratings yet
Microsoft Excel Masterclass - Mock Exam
4 pages

Slide 8 Spark Shell Tutorial

Uploaded by

Slide 8 Spark Shell Tutorial

Uploaded by

Big Data

● Spark SparkContext is an entry point to Spark and defined in

● There are many more methods of SparkContext

● Resilient Distributed Datasets (RDD) is a fundamental data structure of

Note: the printed list might be unordered as foreach runs in

● Create files and put them to HDFS

● Read single file and create an RDD[String]

● Read multiple files and create an RDD[String]

● Read multiple specific files and create an RDD[String]

● Create .csv files and put them to HDFS

Each record in the csvData RDD is an Array[String]

● Skip header from csv file

● Apache Spark RDD supports two types of Operations-

● Two types of RDD Transformation

○ Narrow transformations are the result of map() and filter() functions

○ Wider transformations are the result of groupByKey() and

● Create a .txt file and put it to HDFS

● Read the file and create an RDD[String]

● Apply flatMap() Transformation

● Apply map() Transformation

Each record in testfile3 RDD is a tuple [String,Int] (not an array) 25

● Apply reduceByKey() Transformation

● Note: reduceByKey() is a transformation, so count is a RDD

● Let’s print out the result

● Another way to use reduceByKey

● Try sortByKey() Transformation

● Apply sortByKey() transformation

● You can to all these steps in one line of code

● Or you can use sortBy

● Apply filter() Transformation

● Merge two RDD

● first() – Return the first element in the dataset.

● Save the output in a text file

● Check the result

● Create two RDDs from text files

● Check the result

● Use cogroup or groupByKey to get unique keys in the joining result

● Open http://quickstart.cloudera:4040 in web browser

● Click DAG Visualization

● View the Direct Acylic Graph (DAG) of the completed job

● Execute a parallel task in the shell

Totally 5 partitions are done 59

Parallel execution of 5 completed jobs

Cảm ơn đã theo dõi

You might also like