Lecture 4 - Pair RDD and DataFrame
Lecture 4 - Pair RDD and DataFrame
1
4/5/2021
http://spark.apache.org/
http://shop.oreilly.com/product/0636920028512.do
Spark architecture
2
4/5/2021
Supported languages
3
4/5/2021
• Python
• lines = sc.parallelize([“workshop”, “spark”])
• Scala
• val lines = sc.parallelize(List(“workshop”, “spark”))
• Java
• JavaRDD<String> lines = sc.parallelize(Arrays.asList(“workshop”, “spark”))
4
4/5/2021
Python
lines = sc.textFile(“/path/to/file.txt”)
Scala
val lines = sc.textFile(“/path/to/file.txt”)
Java
JavaRDD<String> lines = sc.textFile(“/path/to/file.txt”)
10
RDD persistence
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
MEMORY_ONLY_2
MEMORY_AND_DISK_2
OFF_HEAP
11
5
4/5/2021
12
RDDs
RDDs can hold any serializable type of element
–Primitive types such as integers, characters, and
booleans
–Sequence types such as strings, lists, arrays, tuples,
and dicts (including nested data types)
–Scala/Java Objects (if serializable)
–Mixed types
§ Some RDDs are specialized and have additional
functionality
–Pair RDDs
–RDDs consisting of key-value pairs
–Double RDDs
–RDDs consisting of numeric data
13
6
4/5/2021
myData = ["Alice","Carlos","Frank","Barbara"]
> myRdd = sc.parallelize(myData)
> myRdd.take(2) ['Alice', 'Carlos']
14
15
7
4/5/2021
16
17
8
4/5/2021
18
★ Paired RDD
★ DataFrame
★ DataSet
19
9
4/5/2021
Paired RDD
20
Pair RDDs
21
10
4/5/2021
Pair RDDs
22
23
11
4/5/2021
24
25
12
4/5/2021
26
27
13
4/5/2021
Map-Reduce
§ Map-reduce is a common programming model
–Easily applicable to distributed processing of large
data sets
28
Map-Reduce in Spark
§ Map-reduce in Spark works on pair RDDs
§ Map phase
–Operates on one record at a time
–“Maps” each record to zero or more new records
–Examples: map, flatMap, filter, keyBy
§ Reduce phase
–Works on map output
–Consolidates multiple records
–Examples: reduceByKey, sortByKey, mean
29
14
4/5/2021
30
reduceByKey
The function passed to reduceByKey combines values
from two keys
– Function must be binary
31
15
4/5/2021
OR
,,
32
§ Examples
–countByKey returns a map with the count of
occurrences of each key
–groupByKey groups all the values for each key in an
RDD
–sortByKey sorts in ascending or descending order
–join returns an RDD containing all pairs with matching
keys from two RDD
33
16
4/5/2021
34
35
17
4/5/2021
36
37
18
4/5/2021
38
SQL Context
§ The main Spark SQL entry point is a SQL context object
–Requires a SparkContext object
–The SQL context in Spark SQL is similar to Spark context in
core Spark
§ There are two implementations
–SQLContext
–Basic implementation
–HiveContext
–Reads and writes Hive/HCatalog tables directly
–Supports full HiveQL language
–Requires the Spark application be linked with Hive libraries
–Cloudera recommends using HiveContext
39
19
4/5/2021
§ The
Spark shell creates a HiveContext instance automatically
–Call sqlContext
–You will need to create one when writing a Spark
application
–Having multiple SQL context objects is allowed
40
DataFrames
§ DataFrames are the main abstraction in Spark SQL
41
20
4/5/2021
§ Convenience functions
–json(filename)
–parquet(filename)
–orc(filename)
–table(hive-tablename)
–jdbc(url,table,options)
42
43
21
4/5/2021
44
45
22
4/5/2021
Data Sources
§ Spark SQL 1.6 built-in data source types
–table
–json
–parquet
–jdbc
–orc
§ You can also use third party data source libraries, such as
–Avro (included in CDH)
–HBase
–CSV
–MySQL
–and more being added all the time
46
47
23
4/5/2021
48
DataFrame Actions
49
24
4/5/2021
DataFrame Queries
§ DataFrame query methods return new DataFrames
– Queries can be chained like transformations
§ Some query methods
–distinct returns a new DataFrame with distinct elements of
this DF
–join joins this DataFrame with a second DataFrame
– Variants for inside, outside, left, and right joins
–limit returns a new DataFrame with the first n rows of this DF
–select returns a new DataFrame with data from one or
more columns of the base DataFrame
–where returns a new DataFrame with rows meeting
specified query criteria (alias for filter)
50
51
25
4/5/2021
52
Joining DataFrames
§A basic inner join when join column is in both DataFrames
53
26
4/5/2021
Joining DataFrames
54
55
27
4/5/2021
SQL Queries
§ When using HiveContext, you can query Hive/Impala
tables using HiveQL
– Returns a DataFrame
56
Saving DataFrames
§ Data in DataFrames can be saved to a data source
§ Use DataFrame.write to create a DataFrameWriter
§ DataFrameWriter provides convenience functions to
externally save the data represented by a DataFrame
–jdbc inserts into a new or existing table in a database
–json saves as a JSON file
–parquet saves as a Parquet file
–orc saves as an ORC file
–text saves as a text file (string data in a single column only)
–saveAsTable saves as a Hive/Impala table (HiveContext only)
57
28
4/5/2021
58
59
29
4/5/2021
60
61
30
4/5/2021
Prerequisites
• Docker
• Zeppelin Docker Container
• Terminal Tools (Command Prompt, PowerShell)
64
64
Working with
Spark RDDs, Pair-RDDs
65
31
4/5/2021
RDD Operations
Transformations Actions
map() count()
flatMap() collect()
filter() first(), top(n)
take(n), takeOrdered(n)
union()
countByValue()
intersection()
reduce()
distinct() foreach()
groupByKey() …
reduceByKey() Hanoi University of Science and Technology
sortByKey()
join()
…
66
66
Lambda Expression
PySpark WordCount example:
input_file = sc.textFile("/path/to/text/file")
map = input_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1))
counts = map.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("/path/to/output/")
lambda arguments: expression
67
67
32
4/5/2021
68
68
Try to do somethings:
- Create RDD from textfile
- Count the number of airports
- Filter by country
- Group by country
Hanoi University of Science and Technology
69
69
33
4/5/2021
Try to do somethings:
- Join 2 RDD Hanoi University of Science and Technology
70
70
Working with
DataFrame and Spark SQL
71
34
4/5/2021
Creating a DataFrame(1)
72
72
Creating a DataFrame
73
73
35
4/5/2021
DataFrame APIs
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
74
74
Spark SQL
75
75
36
4/5/2021
76
Thank you
for your
attentions!
77
37
4/5/2021
78
38