0% found this document useful (0 votes)
149 views38 pages

Lecture 4 - Pair RDD and DataFrame

The lecture slide about Pair RDD and DataFrame compares and contrasts the two data structures commonly used in Apache Spark, and provides an overview of their respective advantages and limitations for big data processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views38 pages

Lecture 4 - Pair RDD and DataFrame

The lecture slide about Pair RDD and DataFrame compares and contrasts the two data structures commonly used in Apache Spark, and provides an overview of their respective advantages and limitations for big data processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

4/5/2021

Working with Pair RDDs and


DataFrame

1
4/5/2021

From where to learn Spark ?

http://spark.apache.org/

http://shop.oreilly.com/product/0636920028512.do

Spark architecture

2
4/5/2021

Easy ways to run Spark ?

★ your IDE (ex. Eclipse or IDEA)


★ Standalone Deploy Mode: simplest way to deploy Spark
on a single machine
★ Docker & Zeppelin
★ EMR
★ Hadoop vendors (Cloudera, Hortonworks)

Supported languages

3
4/5/2021

RDD (Resilient Distributed


Dataset)
RDD (Resilient Distributed Dataset)
– Resilient: If data in memory is lost, it can be recreated
– Distributed: Processed across the cluster
– Dataset: Initial data can come from a source such as a
file, or it can be created programmatically
- RDDs are the fundamental unit of data in Spark
- Most Spark programming consists of performing
operations on RDDs

Creating RDD (I)

• Python
• lines = sc.parallelize([“workshop”, “spark”])

• Scala
• val lines = sc.parallelize(List(“workshop”, “spark”))

• Java
• JavaRDD<String> lines = sc.parallelize(Arrays.asList(“workshop”, “spark”))

4
4/5/2021

Creating RDD (II)

Python
lines = sc.textFile(“/path/to/file.txt”)

Scala
val lines = sc.textFile(“/path/to/file.txt”)

Java
JavaRDD<String> lines = sc.textFile(“/path/to/file.txt”)

10

RDD persistence

MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
MEMORY_ONLY_2
MEMORY_AND_DISK_2
OFF_HEAP

11

5
4/5/2021

Working with RDDs

12

RDDs
RDDs can hold any serializable type of element
–Primitive types such as integers, characters, and
booleans
–Sequence types such as strings, lists, arrays, tuples,
and dicts (including nested data types)
–Scala/Java Objects (if serializable)
–Mixed types
§ Some RDDs are specialized and have additional
functionality
–Pair RDDs
–RDDs consisting of key-value pairs
–Double RDDs
–RDDs consisting of numeric data

13

6
4/5/2021

Creating RDDs from Collections

You can create RDDs from collections instead of files


–sc.parallelize(collection)

myData = ["Alice","Carlos","Frank","Barbara"]
> myRdd = sc.parallelize(myData)
> myRdd.take(2) ['Alice', 'Carlos']

14

Creating RDDs from Text Files (1)


For file-based RDDs, use SparkContext.textFile
– Accepts a single file, a directory of files, a wildcard list of
files, or a comma-separated list of files. Examples:
–sc.textFile("myfile.txt")
–sc.textFile("mydata/")
–sc.textFile("mydata/*.log")
–sc.textFile("myfile1.txt,myfile2.txt")
–Each line in each file is a separate record in the RDD
Files are referenced by absolute or relative URI
–Absolute URI:
–file:/home/training/myfile.txt
–hdfs://nnhost/loudacre/myfile.txt

15

7
4/5/2021

Examples: Multi-RDD Transformations (1)

16

Examples: Multi-RDD Transformations (2)

17

8
4/5/2021

Some Other General RDD Operations

Other RDD operations


–first returns the first element of the RDD
–foreach applies a function to each element in an RDD
–top(n) returns the largest n elements using natural ordering
Sampling operations
–sample creates a new RDD with a sampling of elements
–takeSample returns an array of sampled elements

18

Other data structures in Spark

★ Paired RDD

★ DataFrame

★ DataSet

19

9
4/5/2021

Paired RDD

Paired RDD = an RDD of key/value pairs

user1 user2 user3 user4 user5

id1/user1 id2/user2 id3/user3 id4/user4 id5/user5

20

Pair RDDs

21

10
4/5/2021

Pair RDDs

§ Pair RDDs are a special form of RDD


–Each element must be a key-
value pair (a two-element tuple)
–Keys and values can be any type
§ Why?
–Use with map-reduce algorithms
–Many additional functions are
available for common data
processing needs
–Such as sorting, joining, grouping,
and counting

22

Creating Pair RDDs


The first step in most workflows is to get the data into
key/value form

–What should the RDD should be keyed on?


–What is the value?

Commonly used functions to create pair RDDs


–map
–flatMap / flatMapValues
–keyBy

23

11
4/5/2021

Example: A Simple Pair RDD


Example: Create a pair RDD from a tab-separated file

24

Example: Keying Web Logs by User ID

25

12
4/5/2021

Mapping Single Rows to Multiple Pairs

26

Answer : Mapping Single Rows to


Multiple Pairs

27

13
4/5/2021

Map-Reduce
§ Map-reduce is a common programming model
–Easily applicable to distributed processing of large
data sets

§ Hadoop MapReduce is the major implementation


–Somewhat limited
–Each job has one map phase, one reduce phase
–Job output is saved to files

§ Spark implements map-reduce with much greater


flexibility
–Map and reduce functions can be interspersed
–Results can be stored in memory
–Operations can easily be chained

28

Map-Reduce in Spark
§ Map-reduce in Spark works on pair RDDs

§ Map phase
–Operates on one record at a time
–“Maps” each record to zero or more new records
–Examples: map, flatMap, filter, keyBy

§ Reduce phase
–Works on map output
–Consolidates multiple records
–Examples: reduceByKey, sortByKey, mean

29

14
4/5/2021

Example: Word Count

30

reduceByKey
The function passed to reduceByKey combines values
from two keys
– Function must be binary

31

15
4/5/2021

> val counts = sc.textFile (£i1e). flat.Map


(line => line.split (' ')) . map (word => (word
,l)) . reduceByKey ((vl ,v2) => vl+v2)

OR

,,

> val counts = sc.textFile (£i1e). flat.Map


(_.split (' 1
) ) •

map ((_ ,1)) .


reduceByKey(_+_ )

32

Pair RDD Operations


§ In addition to
map and reduceByKey operations, Spark
has several operations specific to pair RDDs

§ Examples
–countByKey returns a map with the count of
occurrences of each key
–groupByKey groups all the values for each key in an
RDD
–sortByKey sorts in ascending or descending order
–join returns an RDD containing all pairs with matching
keys from two RDD

33

16
4/5/2021

Example: Pair RDD Operations

34

Example: Joining by Key

35

17
4/5/2021

Other Pair Operations


§ Some other pair operations
–keys returns an RDD of just the keys, without the values
–values returns an RDD of just the values, without keys
–lookup(key) returns the value(s) for a key
–leftOuterJoin, rightOuterJoin , fullOuterJoin join two RDDs,
including keys defined in the left, right or either RDD
respectively
–mapValues, flatMapValues execute a function on just the
values, keeping the key the same

36

DataFrames and Apache Spark SQL

37

18
4/5/2021

What is Spark SQL?


§ What is Spark SQL?
–Spark module for structured data processing
–Replaces Shark (a prior Spark module, now deprecated)
–Built on top of core Spark

§ What does Spark SQL provide?


–The DataFrame API—a library for working with data as
tables
–Defines DataFrames containing rows and columns
–DataFrames are the focus of this chapter!
–Catalyst Optimizer—an extensible optimization framework
–A SQL engine and command line interface

38

SQL Context
§ The main Spark SQL entry point is a SQL context object
–Requires a SparkContext object
–The SQL context in Spark SQL is similar to Spark context in
core Spark
§ There are two implementations
–SQLContext
–Basic implementation
–HiveContext
–Reads and writes Hive/HCatalog tables directly
–Supports full HiveQL language
–Requires the Spark application be linked with Hive libraries
–Cloudera recommends using HiveContext

39

19
4/5/2021

Creating a SQL Context

§ The
Spark shell creates a HiveContext instance automatically
–Call sqlContext
–You will need to create one when writing a Spark
application
–Having multiple SQL context objects is allowed

§ A SQL context object is created based on the Spark context

40

DataFrames
§ DataFrames are the main abstraction in Spark SQL

–Analogous to RDDs in core Spark


–A distributed collection of structured data organized
into Named columns
–Built on a base RDD containing Row objects

41

20
4/5/2021

Creating a DataFrame from a Data


Source

§ sqlContext.read returns a DataFrameReader object

§ DataFrameReader provides the functionality to load data into


a DataFrame

§ Convenience functions
–json(filename)
–parquet(filename)
–orc(filename)
–table(hive-tablename)
–jdbc(url,table,options)

42

Example: Creating a DataFrame from a


JSON File

43

21
4/5/2021

Example: Creating a DataFrame from a


Hive/Impala Table

44

Loading from a Data Source Manually


§ You can specify settings for the DataFrameReader
–format: Specify a data source type
–option: A key/value setting for the underlying data source
–schema: Specify a schema instead of inferring from the data
source
§ Then call the generic base function load

45

22
4/5/2021

Data Sources
§ Spark SQL 1.6 built-in data source types
–table
–json
–parquet
–jdbc
–orc
§ You can also use third party data source libraries, such as
–Avro (included in CDH)
–HBase
–CSV
–MySQL
–and more being added all the time

46

DataFrame Basic Operations


• §Basic operations deal with DataFrame metadata (rather than
its data)
• § Some examples
• –schema returns a schema object describing the data
• –printSchema displays the schema as a visual tree
• –cache / persist persists the DataFrame to disk or memory
• –columns returns an array containing the names of the
columns
• –dtypes returns an array of (column name,type) pairs
• –explain prints debug information about the DataFrame to
the console

47

23
4/5/2021

DataFrame Basic Operations

48

DataFrame Actions

• § Some DataFrame actions


• –collect returns all rows as an array of Row
objects
• –take(n) returns the first n rows as an array
of Row objects
• –count returns the number of rows
• –show(n)displays the first n rows
(default=20)

49

24
4/5/2021

DataFrame Queries
§ DataFrame query methods return new DataFrames
– Queries can be chained like transformations
§ Some query methods
–distinct returns a new DataFrame with distinct elements of
this DF
–join joins this DataFrame with a second DataFrame
– Variants for inside, outside, left, and right joins
–limit returns a new DataFrame with the first n rows of this DF
–select returns a new DataFrame with data from one or
more columns of the base DataFrame
–where returns a new DataFrame with rows meeting
specified query criteria (alias for filter)

50

DataFrame Query Strings

51

25
4/5/2021

Querying DataFrames using Columns

§ Columns can be referenced in multiple ways

52

Joining DataFrames
§A basic inner join when join column is in both DataFrames

53

26
4/5/2021

Joining DataFrames

54

55

27
4/5/2021

SQL Queries
§ When using HiveContext, you can query Hive/Impala
tables using HiveQL
– Returns a DataFrame

56

Saving DataFrames
§ Data in DataFrames can be saved to a data source
§ Use DataFrame.write to create a DataFrameWriter
§ DataFrameWriter provides convenience functions to
externally save the data represented by a DataFrame
–jdbc inserts into a new or existing table in a database
–json saves as a JSON file
–parquet saves as a Parquet file
–orc saves as an ORC file
–text saves as a text file (string data in a single column only)
–saveAsTable saves as a Hive/Impala table (HiveContext only)

57

28
4/5/2021

Options for Saving DataFrames


• § DataFrameWriter option methods
• –format specifies a data source type
• –mode determines the behavior if file or
table already exists:
• overwrite, append, ignore or error (default
is error)
• –partitionBy stores data in partitioned
directories in the form
• column=value (as with Hive/Impala
partitioning)
• –options specifies properties for the target
data source
• –save is the generic base function to write
the data

58

DataFrames and RDDs


§ DataFrames are built on RDDs
–Base RDDs contain Row objects
–Use rdd to get the underlying RDD

59

29
4/5/2021

DataFrames and RDDs


§ Row RDDs have all the standard Spark actions and
transformations
–Actions: collect, take, count, and so on
–Transformations: map, flatMap, filter, and so on

§ Row RDDs can be transformed into pair RDDs to use


map-
reduce methods

§ DataFrames also provide convenience methods (such as


map, flatMap,and foreach)for converting to RDDs

60

Working with Row Objects


–Use Array-like syntax to return values with type Any
–row(n) returns element in the nth column
–row.fieldIndex("age")returns index of the age column
–Use methods to get correctly typed values
–row.getAs[Long]("age")
–Use type-specific get methods to return typed values
–row.getString(n) returns nth column as a string
–row.getInt(n) returns nth column as an integer
–And so on

61

30
4/5/2021

Prerequisites

• Docker
• Zeppelin Docker Container
• Terminal Tools (Command Prompt, PowerShell)

Hanoi University of Science and Technology

64

64

Working with
Spark RDDs, Pair-RDDs

Hanoi University of Science and Technology 65

65

31
4/5/2021

RDD Operations

Transformations Actions
map() count()
flatMap() collect()
filter() first(), top(n)
take(n), takeOrdered(n)
union()
countByValue()
intersection()
reduce()
distinct() foreach()
groupByKey() …
reduceByKey() Hanoi University of Science and Technology

sortByKey()
join()

66

66

Lambda Expression
PySpark WordCount example:

input_file = sc.textFile("/path/to/text/file")
map = input_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1))
counts = map.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("/path/to/output/")
lambda arguments: expression

Hanoi University of Science and Technology

67

67

32
4/5/2021

PySpark RDD API


https://spark.apache.org/docs/latest/api/python/pyspark.html#py
spark.RDD

Hanoi University of Science and Technology

68

68

Practice with flight data (1)

Data: airports.dat (https://openflights.org/data.html)


[Airport ID, Name, City, Country, IATA, ICAO, Latitude, Longitude, Altitude,
Timezone, DST, Tz database, Type, Source]

Try to do somethings:
- Create RDD from textfile
- Count the number of airports
- Filter by country
- Group by country
Hanoi University of Science and Technology

- Count the number of airports in each country

69

69

33
4/5/2021

Practice with flight data (2)

• Data: airports.dat (https://openflights.org/data.html)


[Airport ID, Name, City, Country, IATA, ICAO, Latitude, Longitude, Altitude,
Timezone, DST, Tz database, Type, Source]
• Data: routes.dat
[Airline, Airline ID, Source airport, Source airport ID, Destination airport,
Destination airport ID, Codeshare, Stops, Equipment]

Try to do somethings:
- Join 2 RDD Hanoi University of Science and Technology

- Count the number of flights arriving in each country

70

70

Working with
DataFrame and Spark SQL

Hanoi University of Science and Technology 71

71

34
4/5/2021

Creating a DataFrame(1)

Hanoi University of Science and Technology

72

72

Creating a DataFrame

From CSV file:

From RDD: Hanoi University of Science and Technology

73

73

35
4/5/2021

DataFrame APIs

• DataFrame: show(), collect(), createOrReplaceTempView(),


distinct(), filter(), select(), count(), groupBy(), join()…
• Column: like()
• Row: row.key, row[key]
• GroupedData: count(), max(), min(), sum(), …

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

Hanoi University of Science and Technology

74

74

Spark SQL

• Create a temporary view


• Query using SQL syntax

Hanoi University of Science and Technology

75

75

36
4/5/2021

76

Thank you
for your
attentions!

77

37
4/5/2021

78

38

You might also like