SIC - Big Data - Chapter 6 - Workbook
SIC - Big Data - Chapter 6 - Workbook
1
Contents
Lab 1: Working with the PySpark Shell...................................................3
Lab 2: Basic Python..............................................................................11
Lab 3: Working with Core API Transformations......................................17
Lab 4: Working with Pair RDDs.............................................................34
Lab 5: Putting it all together................................................................44
Lab 6: Working with the DataFrame API................................................56
Lab 7: Working with Hive from Spark....................................................65
Lab 8: Spark SQL Transformations........................................................72
Lab 9: Working with Spark SQL.............................................................87
Lab 10:..................................................................................... Transforming RDDs to DataFrames
99
Lab 11:................................................................................................ Working with the DStream API
105
Lab 12:.......................................................................... Working with Multi-Batch DStream API
109
Lab 13:........................................................................ Working with Structured Streaming API
118
Lab 14:..................................................................................... Create a Apache Spark Application
122
Lab 1
2
Lab 1: Working with the PySpark Shell
In this lab, we will use the PySpark shell to explore working with PySpark. First get
hands-on experience with the generic shell. In the next lab, we will configure the
shell to work with Jupyter and explore how to work with PySpark on Jupyter.
1.1.1 Open .bashrc file for editing from your home directory. You may use
any editing tool of your choice. Here, we shall show using KWrite.
1.1.2 Open KWrite by clicking on the KWrite text edition icon in the tools
panel from the bottom right screen
3
1.1.5 Check the Show Hidden Files
1.1.6 Select .bashrc file and edit the file to enable ipython mode
1.1.7 Normally, the .bashrc file gets executed whenever you start a new
terminal session and the settings are associated with that terminal
only. In order to apply the changed settings, close the current
terminal and open a new terminal. Alternatively, you may also
execute the following command from the current terminal to execute
the .bashrc file
4
1.2 Start the pyspark shell
1.3 The pyspark command is actually a script which will start up various
settings, including creating the SparkContext and the SparkSession. Your
screen should be similar to below:
1.4 Check and verify the SparkContext and SparkSession are available
In [ ]: sc
In [ ]: spark
1.5 Exit the shell using either the exit() command or Ctrl-d
In [3]: exit()
2.1 Explore and copy alice_in_wonderland.txt file to your HDFS home directory
5
[student@localhost ~]$ cd /home/student/Data
2.1.3 Copy the file to your HDFS home directory. Name the new file
alice.txt
2.2 Verify that the file has been uploaded properly. We will use the hdfs
command line.
2.3 We will use the Hue file browser interface this time
If zookeeper is not running or failed, you will see a screen similar to below
6
Restart Zookeeper if necessary:
Once Hue is running properly, will get a screen similar to below after checking its
status
2.5 Open Hue from Firefox browser. Use the available bookmark or use the
following URL: http://localhost:8888. Use the following authorization
information
Username: student
Password: student
7
2.6 Select from the left tab menu and navigate to the file browser pane.
2.7 Select alice.txt file. Hue will display the content of the file.
If the file is a text file, including JSON, CSV or Plain Text files, Hue is able to display
the contents. If the file is in a binary format such as Parquet file format, use the
parquet-tools command line tool.
3.2.1 Create aliceRDD by reading the alice.txt file from our home HDFS
directory
8
In [ ]: aliceRDD = sc.textFile("alice.txt")
3.3 Transform alice.txt, selecting only lines that contain rabbit in them.
3.3.1 Create a new RDD with lines that contain "rabbit" in it. Do not
distinguish between uppercase and lowercase. Convert all the lines
to lowercase first and then filter for lines that contain "rabbit"
In [ ]: rabbitRDD = aliceRDD \
.map(lambda line: line.lower()) \
.filter(lambda line: "rabbit" in line)
3.3.2 Print out 5 rows of the resulting rabbitRDD and verify that they all
contain "rabbit"
3.3.3 Use the count() action command to count how many lines contain
"rabbit" How may lines contain "rabbit"?
3.4 Save rabbitRDD as a text file to your HDFS home directory and verify that it
was properly saved. Name the file "rabbit.txt".
9
In [ ]: rabbitRDD.saveAsTextFile("rabbit.txt")
3.4.2 Use the HDFS command line to verify that rabbit.txt has been saved.
Notice that rabbit.txt is a directory rather than a text file. This is
because Apache Spark is a distributed parallel system with multiple
Executors processing the data. Each Executor saves its partition to
the designated output path.
3.4.3 Use the -cat subcommand option to view the contents of part-00000
Verify that all of the lines contains the word "rabbit" and all the text is in
lower case.
10
4.1 Sometimes it might be desirable to merge the output results that are
partitioned into multiple files into a single local file. Use the HDFS
command line with the --getmerge option to create a single file
4.2 Verify that a local file with the merged content has been saved to the local
drive
[student@localhost ~]$ ls
[student@localhost ~]$ cat rabbit.txt
11
Lab 2: Basic Python
1. Working with Jupyter
1.2 Start from a new terminal or execute .bashrc file with the source command
1.3 Start the PySpark shell. The shell should launch Firefox automatically and
start Jupyter. In case, the browser does not start automatically, launch
from the link provided on the command line.
1.5 In Jupyter, enter a command in a cell, press "shift Enter" to execute the
command and move to the next cell. Test the interface by checking for the
instantiation of a SparkContext and SparkSession object.
12
2. Applying Python functions to our data
2.1.2 Use readTextFile() with a local file. The path to a local file must be
provide with a full url such as file:/home/student/Data/kv1.txt
2.1.3 Use the take(n) action to view the contents of the file. Your output
should be similar to below
13
2.2 Create a 2 element List for each line of the RDD
2.2.1 Use map() to transform each row. Use the String split(<delimiter>)
to split the string using "\x01" as the delimiter.
2.2.2 Use take(n) to verify that you have a List for each row in the RDD
2.3 Create a Python function that takes a string, and replaces a pattern string
with a replacement string. Apply function to the second element in the
List.
2.3.2 Use map() and apply above function to transform the RDD.
2.3.3 Use take(n) to verify your RDD. Your results should be similar to
below
2.4 Create a Python function that will slice a string and return only part of the
string. Appy the function to the remove the "value is " prefix in the string.
You may choose to create a named function and apply it or use the lambda
notation with an anonymous function.
14
occupies index 0 through 8. We want to slice the string from index 9
to the end.
2.4.2 Use take(n) to verify your RDD. Your results should be similar to
below.
2.5 Each row in the current RDD is a number represented in a String datatype.
Convert each row to a Tuple. The tuple will have two elements each. The
first element will be an integer representation of the string number and the
second element will be the string number.
2.5.2 Use take(n) to verify your RDD. Your results should be similar to
below.
2.6 Create a Python function that takes a number and returns True if the
number is an even number. Use the function to filter the RDD.
2.6.2 Use the if statement to return either True if even number or False,
otherwise.
15
2.6.3 Use .filter(<Boolean function>) to select rows where the first element
of the Tuple is an even number.
2.6.4 Use take(n) to verify your RDD. Your results should be similar to
below.
2.7 In Python, every datatype has a True or False value. For integers, every
integer other than 0 is true. The only integer that is false is 0. Redo above
using a short lambda function instead.
2.7.2 Use take(n) to verify your RDD. Your results should be similar to
below.
2.8 Create a function that navigates through a collection of Tuples. For each
element of the Tuple, if the element is an integer type, add 1000 to it and
print it. If the element is a string type, print two copies of it. If the data
type is neither, print "ERROR"
2.8.1 Use the for loop to navigate to each element in the collection
2.8.2 Use a nested for loop to navigate to each element in the Tuple
2.8.3 Use the type(<variable>) to determine type. To test for integer, use
the is operator with int. To test for string, use the is operator with
string.
2.9 Get the first 5 rows of the evenRDD created above. Pass this collection to
the function created in step 2.8 above.
16
3. Practicing Python Basics
3.2 Create a function that converts all words in a sentence to start with
uppercase. This is different from converting the entire word into uppercase
3.2.2 Use the string split(<delimiter>) method to separate the words in the
string
3.2.4 For each word, capitalize the first letter with the string upper()
method
3.2.5 Use string slicing to add the rest of the word. Hint: word(1:) - If the
second parameter is empty, end of string is assumed and the rest of
the string is returned
17
3.3 Convert all words in the Alice in Wonderland to capitalized words
3.3.1 Use the map transformation with the function you created above
3.4 As it turns out, Python has a nice string method called capitalize() that does
the same thing. Redo step 3.2 above using the capitalize method. Redo
step 3.3 using your new function
3.5 This time, we will do the inverse capitalize. Here, we capitalize all the
letters other than the first letter.
18
Lab 3: Working with Core API Transformations
In this lab, we will practice working with the various Core API transformations that
were introduced in the lectures. Start a new PySpark shell that runs on Jupyter to
begin the labs.
1.1 Make sure that Hue is up and running. Follow the steps from Lab 1:2.3.
1.2.1 Verify that "alice.txt" file exists on your HDFS home directory. It was
created from a previous lab. Either use the hdfs dfs -ls command or
Hue to do this.
1.2.3 Verify that aliceRDD has been created properly using take(5)
1.3.1 Make sure an AWS account with free tier access is available. One
was created in Hands-On C4U1 Lab1. If not, create one now by
following the instructions.
1.3.2 Make sure the necessary AWS credentials is available to work with
AWS CLI. If not, follow the instructions in Hands-On C4U1 Lab 1, Step
3.5 and create a new security credential. AWS only allows two
credentials to be created at a time. If two credentials already exist
and the Secret Key associated with the Access Key ID had been
misplaced, a new one will have to be created. Delete one of the
credentials and create a new one.
1.3.3 Make sure that AWS CLI installed on your computer. If not, follow the
instructions from Hands-On C4U1 Lab 1, Step 3.1.
19
1.3.4 Make sure that AWS CLI is configured with the proper Access Key ID
and Secret Key. Follow the steps from Hands-On C4U1 Lab 1, Step
3.5 if unsure how to do this.
1.3.5 Use the AWS CLI to create a new bucket. Bucket names become part
of the web access URL. Therefore, a globally unique name must be
set for the bucket. For the rest of this lab, "<bucket-name>" will be
used as the bucket name.
1.3.7 From Jupyter, run the following command to read weblog.log from the
s3 bucket. Make sure to replace the Access Key ID, Secret Key and
bucket name with your own information.
1.3.8 Verify that weblog.log has been read from the s3 bucket by
calling .take(5) on the s3RDD
20
1.4.3 Verify that the files have been copied properly. Use Hue to verify and
review the file contents.
1.4.5 Verify that an RDD has been properly created using take(1). Observe
the format of the element in the RDD. The "\t" is a tab. Each
element is a pair tuple where the first element is the path to the file
and the second element is the content of the file.
2.1 Using the map transformation to partially parse the s3RDD from above.
21
2.1.1 Use s3RDD.take(5) to observe the data format. The Apache Web log
data shows the IP address and the UserID in the locations shown
below. In addition, there is a timestamp, followed by activity
information.
2.1.3 Observe the result with take(5). Notice that the result is a List with
several elements. Also notice that the first element is the IP address
and the third element is the UserID.
2.1.4 Use .map() transformation to extract the IP address and the UserID.
Create a new tuple with from the IP address and UserID. In Python,
create tuples with the ( <element>, <element>, <element>, …)
operator. Name the new RDD as IpUserRDD.
2.1.5 Observe the result with take(5). Each element is a tuple of form (IP,
UserID)
22
2.2.1 In a new cell, import the JSON library. The library contains methods
that helps parse JSON files.
import json
2.2.3 Use take(2) to observe the result. How did json.loads() parse the
JSON content?
In the previous step, when the JSON files were read using
wholeTextFiles(), each JSON file became the JSON content of the
(<path> , <JSON content>) tuple that is created. Since each file
contains multiple JSON records, when json.loads() is applied to each
JSON content, many JSON records are parsed as Python dictionaries.
23
A .Python dictionary is contains Key:Value information. The format is
{key:value, key:value, key:value, …}. Multiple JSON records within
each JSON content is parsed as individual dictionaries and all records
are returned in a List.
24
2.2.4 The JSON file was successfully parsed but the result is not in the
format that is necessary. Each JSON record should be individual
elements. Notice above that we requested 2 rows with take(2) and
got 2 Lists, each containing multiple records. Use the flatMap()
transformation instead to create a new RDD where each element of
the List created by the transformation function is placed in its own
row. Modify the code and use flatMap() instead of map().
25
2.3 Use distinct() to remove any duplicate elements in a RDD
2.3.1 Start with aliceRDD that was created in above step. Use flatMap()
transformation with the String.split() method to create an RDD of
words
2.3.4 Use RDD.count() on the new RDD to count number of distinct words
in Alice In Worderland.
3.1.1 Create a List with the following elements and name it fruit1:
["Banana", "Pear", "Kiwi", "Peach", "Grape"]
26
3.1.4 Create a List with the following elements and name it fruit2:
["Strawberry", "Kiwi", "Watermelon", "Banana", "Apple"]
27
3.5 Create cartesianRDD by using dataset1. cartesian(dataset2) transformation
4.1.1 Create the following function: This function receives an iterator as its
input parameter. Think of an iterator as any type of collection that
contains multiple items and has a next() method that can be used by
the for loop to traverse the collection. The function prints a message
and then uses the for loop with the iterator to scan each element.
Python's yield operator is an easy way to create a new iterator.
Alternatively, an empty List could have been created and each item
appended to the empty list.
28
def perPartition(it):
print("I just did a heavy operation once per partition")
for item in it:
yield "updated_" + item
4.1.5 Use collect() to view the results. Notice that the print statement has
been executed twice for each partition. The iterator was used to
update each element in each of the partitions.
29
the transformation is mapPartitionsWithIndex(lambda index, iterator:
<some function(index, iterator>)
4.2.1 Modify perPartition above to now accept an index number for the
partition. This time change the print statement to include which
partition is performing the print by using the partition index number
passed. Name the new function, perPartitionIndex.
def actionPerPartition1(it):
print("I just did a foreachPartition action")
30
4.4 Use foreachPartition() action to print number of rows in each partition.
4.4.1 Sometimes, after transforming the data, one of the partitions can get
overloaded with more data compared to other partitions. This is
called a skewing problem. Create the following function that will print
the number of elements in each partition.
def actionPerPartition2(it):
print("I have", len(list(it)), "elements")
31
records2RDD has the mapPartitionWithIndex(perPartionIndex) in its
dependency lineage.
def printElements(iterator):
for item in iterator: print(item)
print("*********************")
32
5.2.3 Use foreachPartition with printElements on the RDD to print elements
in each partition
33
6. Miscellaneous transformations
6.1 Use sample() transformation to sample an RDD and create a partial RDD.
This transformation is very useful when doing data exploration on a large
dataset. sample() can be used with replacement or not. When set to True,
sampled data can be repeated. When set to False, once a datapoint has
been sampled, it is not replaced with another one, and therefore data is not
repeated
6.1.1 Create an RDD consisting of numbers from 1 to 99. Use the Python
range(99) function to generate the numbers. Use
SparkContext.parallellize to create the RDD. Name the RDD,
to100RDD.
6.1.3 Repeat above, but this time, generate a random number for the seed.
Use the following code to generate a random number.
import random
seed = int(random.random())
6.1.4 This time, take a sample with replacement. Notice that when with
replacement is set to True, sampled data can be repeated.
34
35
Lab 4: Working with Pair RDDs
In this lab, we will create pair rdds and work with various pair rdd transformations
myrdd = sc.parallelize(mydata1)
1.1.3 Use take(5) to make sure the RDD has been created
36
1.2 This time use the map() transformation to accomplish the same thing.
Some developers prefer to use map() because it is much more direct as to
the transformation. Some developers prefer to use keyBy() because the
transformation name itself loudly states that a (Key, Pair) tuple is expected
as its output. Trying doing this on your own before turning the page.
1.3 Create a more complicated nested pair rdd. Take the data source from
above and create a pair rdd of form ("name", ("age", "gender")). "name"
and "gender" are strings while "age" is an integer.
1.3.1 Transform each row of strings to a row of List containing all the
elements. Use the String.split(",") method to parse the string into
individual items.
1.3.2 From each List, create the required complex tuple, thus creating a
pair rdd.
37
1.3.3 take(5) to test your transformations.
1.4.1 Expand mydata by adding a list of favorite colors for each person.
Each person has chosen two favorite colors. The two colors are
delimited with a colon (:)
mydata2 = ["Henry,red:blue",
"Jessica,pink:turquoise",
"Sharon,blue:pink",
"Jonathan,blue:green",
"Shaun,sky blue:red",
"Jasmine,yellow:orange"]
1.4.3 From each string, parse each of the items and create a List
1.4.4 From the List, select the first element (name) and second element
(favorite colors) to create a pair rdd
38
place in its separate (key,value) pair. It will duplicate the original key
for each new row created.
2.1 Calculate the sum of ages for all males and the sum of ages for all females.
2.1.1 Use mydata1 from previous step to create a pair RDD of form
(gender, age)
2.1.2 The gender is key and the age is the value of the pair rdd created in
above step. Use reduceByKey( with v1+v2 function) to add all the
ages of rows with the same key. The gender is currently the key, so
all values of rows with the same gender will be added.
39
2.2 Calculate the maximum age for each gender. Change the function passed
to reduceByKey() to calculate the maximum. Python has a max() function.
2.3 Repeat above to calculate the minimum this time. Python has a min()
function
40
2.5.1 Create a pair rdd of (age, name) from mydata1
2.6 Using mydata2, produce a report that shows for each color, all persons
whose favorite color it is.
2.6.1 Follow steps 1.4 from above to create a pair rdd tuple with (person,
color) information for all records of a person and their favorite colors
in mydata2.
2.6.2 Swap the tuple so that each row shows (color, person)
2.6.3 Use groupByKey to group each color and create a list of persons
whose favorite color it.
2.6.4 Use a nested for loop to print each color and all persons whose
favorite color it is. Add a tab or some spacing on the inner loop so
that a tabulated output it produced.
41
2.7 The groupByKey() is a very expensive operation because it requires all
partitions to exchange their entire data sets with each other. There is no
opportunity to aggregate the data before the exchange occurs.
Redo the transformations to produce the report from the previous step 2.6.
Instead of groupByKey in the last step, use aggregateByKey().
42
2.7.1 Initialize the starting aggregation value to an empty list. The
aggregation functions will append to this empty list, all persons who
like a key value color.
zeroValue = []
2.7.4 Create a RDD from mydata2 with a two (2) partitions. A second
parameter can be added to the parallelize() method to manually set
the number of partitions.
sc.parallelize(mydata2, 2)
colorLikers2.collect()
43
2.8 Joining Pair RDDs
2.8.1 Create a pair rdd from mydata1. The pair rdd should be in the
following form:
(name, [age, gender]). Name the rdd, data1RDD.
2.8.2 Create a pair rdd from mydata2. The pair rdd should be in the
following form:
(name, [color1, color2]). Name the rdd, data2RDD.
data1RDD.join(data2RDD)
44
# indexes [0] [1][0] [1][0][1] [1][1] [1][1][1]
myGender = myTuple[1][0][1]
myColor2 = myTuple[1][1][1]
45
46
Lab 5: Putting it all together
In our MySQL database, there are records of authors who have posted messages.
There is an XML file which shows the latitude and longitude of the location when
the message was posted. There is a json file for each author which shows the type
of phones owned by the author.
Put together all this information and produce a report that shows Author (first
name and last name) posted message (first few words of the title of post) using
phone (a list of phones owned by author) at location (latitude, longitude).
$ mysql -u student -p
Enter password: # type student when prompted for the password
1.1.2 From mysql, use the following commands to examine the schema for
the authors tables
show databases;
use labs;
show tables;
desc authors;
47
1.2 Use Sqoop to import the authors table to /user/student/authors HDFS
directory
1.2.3 Set the authorizations: --username and -password are both "student"
1.2.5 Set the --target-dir where the imported data will be saved to
/user/student/authors
sqoop import \
--connect jdbc:mysql://localhost/labs \
--username student --password student \
--table authors --target-dir /user/student/authors \
--as-textfile
1.2.7 Use HDFS command line or Hue to verify that the data has been
imported
1.3 Examine the posts table from MySQL. Use DESC command to get a
printout of the posts table schema
1.4 Use Sqoop to import the posts table to /user/student/posts HDFS directory
48
1.4.2 Set the --connect string to jdbc:mysql://localhost/<database name>
1.4.3 Set the authorizations: --username and -password are both "student"
1.4.5 Set the --target-dir where the imported data will be saved to
/user/student/posts
sqoop import \
--connect jdbc:mysql://localhost/labs \
--username student --password student \
--table posts --target-dir /user/student/posts \
--hive-drop-import-delims \
--as-textfile
1.4.7 Use HDFS command line or Hue to verify that the data has been
imported
1.5.2 Examine author_phone.json file and review its content and data
schema
1.6.2 Examine any of the XML files and review its content and data schema
2.1 Create authorNameRDD, matching the First Name and Last Name with the
author <id>
49
2.1.1 Using SparkContext.textFile(), read the data from the imported
authors table
2.2 Create postsRDD, showing the post <id> with the author_id of the post and
first few letters of the title of each post
2.2.1 Using SparkContext.textFile(), read the data from the imported posts
table
2.2.2 Transform the RDD to form (<id>, (author_id, <first 10 letters of the
title>))
The pair rdd shows for each post <id>, the author_id of the post and
the first 10 letters of the title of the post. The resulting rdd should
look similar to below:
2.3 Create phoneRDD from authors_phone.json file. Some of the authors have
multiple phones and the phone model for some of the authors is not known.
Unfortunately, rather than showing "Unknown, the data shows an empty
string"
2.3.1 Import the json library in order to parse the JSON records
2.3.3 The JSON library has a json.loads(<JSON string>) method that creates
a collection of JSON records. Use flatMap() transformation with the
json.loads() function to create a new row for each JSON record.
50
Dictionary.get(<key name>, None) method. This method returns
either the Value for the matching Key or None if not found. Using the
get() method, transform the JSON record into a pair tuple of form
(author_id, phone_model)
def setPhoneName(s):
if s == "": return "Unknown"
elif "," in s: return s.replace(",", " or")
else: return s
2.4 Create latlongRDD from the XML files in post_records. The output should
be of form (post_id, (latitude, longitude))
2.4.1 The XML files contain post_id and location fields. The location field is
a comma delimited field showing the latitude and longitude. Use
wholeTextFiles() to read the XML files. wholeTextFiles() returns a
tuple of form (<path to file>,
<XML content>). The XML content contains many XML records.
2.4.2 Define getPosts(XML string) helper functions This function will parse
the XML content and return a collection of XML records.
import xml.etree.ElementTree as ET
51
def getPosts(s):
posts = ET.fromstring(s)
return posts.iter("record")
def getPostID(elem):
return elem.find("post_id").text
def getPostLocation(elem):
return elem.find("location").text
2.4.6 Using the two help functions just created, parse the post_id and
location information from each XML record. Transform the RDD to
form (<post_id>, <location>) pair rdd.
52
3.1 Join authorNameRDD with phoneRDD to match the author's name with their
phone(s)
3.1.2 take(5) to verify the output. The resulting output should be similar to
below.
3.1.3 Use count() to make sure that there are 10000 records
3.2 Join postsRDD with latlongRDD to match each post information with the
location of the post.
53
3.3 Join authorNamePhoneRDD with authorPostLocationRDD.
54
3.4.2 Try again, using the aggregateByKey() transformation. What should
the sequence function look like? How about the combination
function? The final output will look similar to below.
55
56
57
58
59
Lab 6: Working with the DataFrame API
In this lab, we will use the PySpark shell to explore working with PySpark
DataFrame API. Create DataFrames from various data sources and perform basic
transformations.
1.1.3 Use Hue or the HDFS command line command to verify that
people.json has been copied to HDFS
1.1.6 Print out the contents of the dataframe using .show() action.
1.2 Filter the dataframe for persons who are older than 20
1.2.3 Print out the contents of the dataframe using .show() action.
60
1.3 Create a new dataframe that only contains the name columns.
workerDF = spark.read.csv("people.csv")
workerDF.printSchema()
workerDF.show()
61
The output is not quite what was expected. The dataframe consists of a
single column named _c0 of type string. Look at the underlying data and
fix the problem.
There are several problems. The first is that Spark has gobbled all the data
into a single column. Look carefully at the data. Notice that the delimiter
is a semi-colon (;). Spark's default delimiter for CSV files is a comma (,).
An option must be given to specify the non-default delimiter. This can be
set using the "sep" option.
The next problem is the column names. Currently there is a _c0 for the
single column name. Spark defaults to _c0, _c1, etc when it does not know
the column names. However, look carefully. The first row contains all the
column names. Set the "header" option to "true" to let Spark know that the
first row is a header.
1.4.4 Fix the two problems and print the schema and show the contents to
verify the problem has been fixed.
62
1.5 Create dataframe from Parquet file
63
parquet-tools show users.parquet | more
1.5.5 Use Hue or hdfs command line to verify the file has been copied
64
1.6.2 Copy spark-avro_2.12-3.1.2.jar file to $SPARK_HOME/jars
cp spark-avro_2.12-3.1.2.jar $SPARK_HOME/jars/.
1.6.5 Use Hue or hdfs command line to verify the file has been copied
spark.read.format("avro").load(<path_to_file>)
2. Saving DataFrames
avroDF.write.json("/user/student/users_json")
2.1.2 Use HDFS command line or Hue to verify that the data has been
properly saved and in the correct format
65
2.2 Save the dataframe created from people.json as a CSV file
2.2.3 Include a header row with the column names using .option("header",
"true")
2.2.5 Use HDFS command line or Hue to verify the data has been saved
and in the correct format.
2.3 Try saving avroDF as a CSV file. Choose a delimiter of your choice and
save to /user/student/users_csv.
2.3.1 What happened? What is the error message received? Not all data
formats can be saved in every format. In this case, the Avro file has
a column that is an Array (similar to a List in Python). Unfortunately,
a CSV format cannot support a complex column such as an Array.
2.4.2 Set the option to not include a header row. In other words, set the
header option to false.
66
3. DataFrame write modes
newWorkersDF = spark.createDataFrame(new_workers)
newWorkersDF.write \
.option("header", "false") \
.csv("/user/student/workers")
An ERROR is raised. What is the error message? Why did the error
message occur?
67
3.2.2 Use the .mode(<write_mode>) method to change the write mode to
append instead. Use the following command instead, this time:
3.2.3 newWorkersDF.write \
3.2.5 .mode("append") \
3.2.6 .csv("/user/student/workers")
3.2.7 Verify using Hue that the additional worker data has been appended
to the designated directory.
The new workers have been added but the column names have been lost.
This is because we explicitly set the header option to false when the data
was saved. In another lab, we will learn how to explicitly create schema
information and apply it to dataframes.
68
Lab 7: Working with Hive from Spark
In this lab, we will use Spark to access Hive tables, modify them and save them as
both managed and external Hive tables.
1.1 Check to make sure authors table exists in Hive. If not, import from MySQL.
1.1.1 In a previous lab, the authors table in MySQL was imported to Hive in
mydb database. Make sure mydb database exists and authors table
exists: Start beeline with the following command and use show
databases from within it.
$ beeline -u jdbc:hive2://
if the database does not exist, use the following command to create mydb
database
If mydb does not exist, the authors table most likely does not exist, neither. From
another terminal, use Sqoop to Import authors table to Hive. Use the following
sqoop command.
$ sqoop import \
--connect jdbc:mysql://localhost/labs \
--username student --password student \
--fields-terminated-by '\t' --table authors \
--hive-import --hive-database 'mydb' \
69
--hive-table 'authors' --split-by id
1.1.2 Check to make sure authors table has been properly imported, either
from a previous lab or from the previous step above. Run the
following command from beeline to check.
70
1.2.3 Show the first 5 rows of authorsDF with .show(5)
2.1 Starting from authorDF, create a new dataframe with only the "id", "email",
and "birthdate"
2.1.2 Print the schema and show the first 5 rows to confirm the
transformation. By default, show() prints 20 rows and truncates
columns. To control number of rows printed and to not truncate, use
the following parameters: show(<num_rows>, <truncate =
True/False>)
71
2.2 Save the Hive table as a Managed table
2.3.2 Navigate to mydb database and print all tables in mydb database
72
2.3.4 Use the describe formatted Hive command to review the metadata
for author_bday table
As you scroll through the output, find the section that shows the location of
the data stored and the table type. Notice that author_bday is a managed
hive table, and the data is stored in the Hive warehouse directory as
expected.
2.3.5 Use Hue to navigate to the location where the data has been stored.
Verify the location and content.
73
Notice that the data has been saved as a parquet file. Spark's default
format for saving a dataframe is Parquet format
2.3.6 Use hdfs dfs -get subcommand to copy part-00000-xxxxx file to local
disk. Use parquet-tools to view the schema and content of the
parquet file. To make getting the file easier, use the * wildcard to
match the ending after the initial part-00000.
3.1 Create a new dataframe with only the "id", "first_name", and "last_name"
columns
74
3.3 Save the Hive table as an External table in CSV format, using tab separator,
and that includes a header row. Spark saves Hive tables as an External
table if an explicit path is provided. Otherwise, the table is saved as a
managed table in the Hive warehouse directory.
3.3.1 Use
spark.write.format(<format>).option(<options>).saveAsTable(<db>.
<table>)
nameDF.write \
.format("csv") \
.option("path", "/user/student/author_names") \
.option("sep", "\t") \
.option("header", "true") \
.saveAsTable("mydb.author_names")
3.4 Use beeline to verify that the new external table has been created. Make
sure to use the desc formatted command to verify that the table is an
external table and the location of the data is in the expected location.
Notice that Hive is keeping a "place holder" under the mydb.db directory
since the table belongs to this database.
75
However, as you scroll further down, there is another "path" information
where the actual data is stored.
3.5 Use Hue to verify that the data has been written in the correct format and
in the designated CSV format.
76
Lab 8: Spark SQL Transformations
In this lab, we will explore various Spark SQL transformations. We will go beyond
basic transformations and begin working with Column objects to create Column
expression that allows more powerful transformations.
1.1 Print the first name, email address and birthdate of the 3 oldest authors
1.1.2 Transform the dataframe by selecting only the first_name, email, and
birthdate. Chain this transformation to the end and execute both
cells.
77
1.1.3 Use orderBy(<column name>) on the "birthdate" column. Chain the
transformation and execute both cells to view the results
78
Notice that although show(5) was executed, only 3 rows are displayed.
This is because, limit(3) has reduced the dataframe to 3 rows. There isn't
any more rows to print, so even though show(5) was called, Spark returned
the maximum number of rows available.
1.2 Print the first name, email address and birthdate of the 3 oldest authors.
However, this time modify the transformation from above such that the
query selects the 3 oldest authors born after the new millennium. That is,
we want to only choose from authors born on or after January 1, 2001.
79
1.2.2 Now that the code has been tested and verified, add back the limit(3)
transformation to display the 3 oldest authors born after the new
millennium.
80
2.1.3 Copy sales.csv to the HDFS home directory
2.1.4 Create a dataframe from sales.csv and name it salesDF. Does the
source file have a header row? What is the delimiter? Is it the
default "," separator?
2.1.5 Print the schema and show(5) to verify the dataframe has been
created
2.2 Using column objects, select just the Country, ItemType, UnitsSold,
UnitPrice, and UnitCost
2.2.2 Print the schema and use show to actualize the transformation and to
verify
81
2.3 Use column objects cast function to change datatype and perform
calculation
82
Notice the output shows that the variable is of type Column with the actual
operations as the value.
2.3.3 The output of step 2.3.1 shows a column name that is literally the
operation performed to obtain the column result. Use the alias
column object method to give it a more appropriate name. Redo
callcRevDF. This time use calcRevenue column object that was saved
along with alias method to give name the column "Sales_Revenue"
2.4 Notice that there are multiple entries for a country. For example, in above
output, there is at least two row items for Libya. This is because, each sale
is further classified by ItemType. Redo above calculation but this time,
include the ItemType
83
2.5 Modify revItemCountryDF from above. Use a comparative operator to
check if quota has been met
2.5.1 Create a new column that will either be true or false depending on if
a quota has been met. The DataFrame API offers the
withColumn(<column_name>, <operation to calculate value of
column>) method. Set a variable to a quota of 3 million.
quota = 3000000
withColumn("Quota_Met", quotaMet)
84
2.6 This time, change the definition of what it means for the quota to have
been met. The quota will be met if either the calculated amount is greater
than the quota or UnitsSold exceeds a set number
2.6.4 Redo the quotaMet column object. It should now create a new
column object after checking for both amtQuota and cntQuota. If
either of the quota has been met, return True
2.6.6 Using withColumn, add "Quota_met" column. The column value will
be based on the modified quotaMet column object
85
Notice that Spark was smart enough to know that the salesReveue should
be cast to some number. In fact, it has automatically cast it to double
datatype. Separating the conditions and saving them to a variable makes
the withColumn() transformation much more legible. In PySpark the ("|")
character is used to "OR" conditions.
3.1.2 Use count() to count number of rows for each ItemType. The output
should be similar to below:
86
Can you guess why the code selects just the "ItemType" column before
grouping by that column? groupBy is a very expensive operation that
requires all partitions to exchange information in order to generate a global
grouping. Therefore, it is important to reduce the amount of data that
needs to be shuffled around before calling the groupBy transformation.
The only required column is "ItemType" itself so it is the only column
selected.
3.2 What is the average number of items sold for each ItemType?
3.2.4 Print the schema and show() to verify the transformations and the
output
87
3.3 What is the average sales revenue for each ItemType?
3.3.5 Print the schema and show() to verify the transformations and the
output
88
4. Creating User-Defined functions
4.1 Our current quotaMet condition does not really provide very good
information. It does not consider the underlying ItemType in testing if
quota has been met. Modify the condition to reflect the ItemType.
89
"Personal Care" : 5468,
"Fruits" : 5073,
"Snacks" : 4818,
"Clothes" : 4845,}
4.1.2 Create a user-defined function (UDF) that takes two parameters. The
first parameter is the ItemType. The second parameter is a count of
the ItemType. The UDF will reference cntQuotaDict to check whether
the count is greater than the count is the dictionary
4.1.3 In order to use a UDF, the function must first be registered. Use the
udf function. The return type of the function must also be registered
4.2.3 A UDF may be called within a select statement. Continue within the
select transformation from above step to include the result of the
90
UDF. The UDF expects the parameters to be passed as Column
objects. Use the col method to explicitly create column objects and
pass it to the UDF. The UDF requires the ItemType and UnitsSold
columns
4.2.4 Use the alias function to rename the result of the UDF to "Met_Quota"
4.2.5 Print schema and show() to verify the transformation and output
91
Lab 9: Working with Spark SQL
Spark SQL allows Spark developers to use ISO standard SQL to create queries from
DataFrames.
1.2 Try another more complex SQL query where part of the query includes
some string to filter on.
This SQL query returns the first_name, last_name and email address of
authors whose email address ends with "org." When a query includes
strings such as '%org', use triple quotes (""") to surround the SQL query.
This makes it easy to include strings within the query without having to
escape them.
92
1.3 Create a query and show all authors who are born in the new millennium.
1.3.2 Filter the results for birthdates that fall in the new millennium
93
As can be seen, using SQL directly can be much more convenient for those
who are comfortable with the Structured Query Language. However, many
developers find combining Column expressions and SQL statements in their
code, to be most productive.
2.1.1 Make sure that sales.csv is in the HDFS home directory. If it is not,
navigate to /home/student/Data and copy sales.csv to HDFS.
94
error if the temporary view already exists.
DataFrame.createOrReplaceTempView() is a better alternative when
developing applications since code is repeatedly executed, and an
error will be generated after the first temporary view has been
generated when using DataFrame.createTempView().
salesDF.createOrReplaceTempView("sales")
2.3 Using SparkSession.sql(), query sales table and print the first 10 rows of all
columns
2.4.1 Create a SQL query that groups the sales by each Country, ItemType
and SalesChannel. Print the sum of TotalRevenue as Revenue_Sum
and sum of TotalProfit as Profit_Sum for each group. The output
should be ordered by the highest profit first and in descending order.
When using this option, there isn't any opportunity to provide any
options nor transform the data before querying it. This syntax should
only be used when such operations are unnecessary.
3.1 Python provides sparksql-magic which allows using cell magic within
Jupyter
95
!pip install sparksql-magic
%load_ext sparksql_magic
3.1.3 Now use %%sparksql to indicate that we will use sparksql magic on
this cell. Follow this with a simple query.
4. Joining DataFrames
joinDF = authorDF \
.join(authorPhoneDF,
96
authorDF.id == authorPhoneDF.author_id)
4.1.4 Print the schema and use show() to verify the transformation and
output
4.2.1 Create a temporary view for authorsDF. Name the view, author
4.2.3 Use the sql() transformation to join the two tables/views. Select only
the first_name, last_name, and phone_model as output columns
4.3 Use sparksql-magic to run a query that prints 10 youngest authors who do
not have any phones. We will have a giveaway event for the lucky
candidates.
4.3.1 Join author and author_phone. These are the names of the temporary
view created earlier
97
4.3.4 Order by birthdate in descending order. Since the birthdate is
ordered in descending order, the youngest authors will appear first.
5.1.4 Create a new managed table and name it spark_test. This table will
contain two columns: name as string and age as integer. Leave all
the rest to the default setting.
98
5.1.5 Review details of spark_test using DESC FORMATTED command
5.1.7 Print all the columns from spark_test to verify that the rows have
been inserted
99
100
5.2 Use either spark.sql() or a sparksql-magic cell to execute these DML
commands
5.2.2 Test by adding a few more rows to spark_test, but now including the
gender
6.1 Use the SparkSesstion.catalog to access the Catalog API. Return list of
databases, list of tables, list of columns. Change the current database.
101
6.1.1 Use spark.catalog.listDatabases() to return a list of databases. Use a
for loop to print each of the databases
102
103
Lab 10: Transforming RDDs to DataFrames
In this chapter, we will begin with a semi structured data source, transform it to
give it structure and convert it to a DataFrame for query operations.
1.1 Import json and create a function that will parse the phone name and set it
to a friendly text, parsing multiple phones as necessary.
Import json and use the following function to set the phone name
import json
def setPhoneName(s):
if s == "": return "Unknown"
elif "," in s: return s.replace(",", " or")
else: return s
1.1.2 Use flatMap to create separate rows from parsing the json records.
Use json.load() on the content of the file to parse the JSON records.
Recall that wholeTextFiles creates a tuple of form
(<path>,<content>). Pass <content> to json.load
1.2.1 There are several ways to create a schema definition. A quick and
dirty method is to create a string with the following format:
" <col name> <col type>, <col name> <col type>,<col name> <col
type>,…"
104
1.2.2 Create a string variable and name it schema. Enter the following
value for the string:
1.3.2 Print the schema and show(5, truncate=False) to verify that the
dataframe has been created as expected
105
2.1 Import xml.etree.ElementTree and create 3 functions. getPosts() will parse
XML strings and produce a collection of XML records. getPostID() will take
an XML record and return the post_id as text. getPostLocataion() will read
an XML record and return the location as text. The location will consist of
the latitude and longitude string delimited by a comma (","). Use the
following functions:
106
import xml.etree.ElementTree as ET
def getPosts(s):
posts = ET.fromstring(s)
return posts.iter("record")
def getPostID(elem):
return elem.find("post_id").text
def getPostLocation(elem):
return elem.find("location").text
2.3 Use flatMap to create a row for each XML record. Pass the <content>
portion of the (<path>, <content>) tuple resulting from wholeTextFiles to
getPosts(<content>). Recall that getPosts returns a collection of XML
records. flatMap() then breaks the collection up into individual items and
creates a new row for each XML record.
2.5 Use the map() transformation to break the nested List from above to a
simple List consisting of [<post_id>, <latitude>, <longitude>]. Use the
float() function to cast the string latitude and longitude to a float datatype.
107
2.1 Create a dataframe with the schema
2.1.2 Print the schema and show(5, truncate=False) to verify that the
dataframe has been created as expected
2.2 Save latlonDF dataframe as a Hive managed table under mydb database.
Name the table post_latlon
2.3 Use either the Catalog API or SparkSession.sql() to verify that the table has
been created
108
109
110
Lab 11: Working with the DStream API
In this lab, we will capture unstructured streaming data using the DStream API.
1.2 Execute streams.sh shell script to begin simulating a streaming data source
on the Linux socket at port 44444
$ /home/student/Scripts/stream_alice.sh
ssc = StreamingContext(sc, 5)
2.2.3 Check to make sure that ssc has been created properly
ssc
111
3. Develop the transformation logic for each DStream
3.1.2 Create a variable named host and set the value to 'localhost'
host = 'localhost'
port = 44444
3.2 Create a DStream with socketTextStream. Set the host and port to the
variables that we created in the previous step
aliceDS = ssc.socketTextStream(host,port)
3.3.1 Split the string into an individual words. Hint: Use flatMap instead of
map
3.3.2 Create a key:value tuple. Each word will generate a (word, 1) tuple
3.3.3 Use reduceByKey() to aggregate all the values with the same word
112
wcDS.pprint()
3.5 Create a function to print the word count for each DStream
ssc.start()
ssc.awaitTermination()
4.1 As soon as you start the StreamingContext, the simulated streaming data
source will engage and begin streaming to the designated socket
113
114
Lab 12: Working with Multi-Batch DStream API
In this lab, we will work with multiple DStream batches.
1.1.2 Open any of the weblogs and review the source data.
1.1.3 The first element of the weblog is the IP address through which the
user connected to the webserver
1.1.4 There is a (-) followed by another number. This third element is the
user_id of the person browsing the web. In our case, this user_id is
the author_id in our authors table in Hive.
1.1.5 The rest of the web is the timestamp, the GET/PUT operation, and
other miscellaneous items that we are not interested in for now.
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
115
if __name__ == "__main__":
if len(sys.argv) != 3:
print >> sys.stderr, "Usage: multi_dstream.py <hostname>
<port>"
sys.exit(-1)
ssc.start()
ssc.awaitTermination()
116
1.2.4 The housekeeping logic has been done for you. The SparkContext
has been created. The StreamingContext with a DStream duration of
1 second has been created.
1.2.5 A streaming data source has been set to the socketTextStream at the
provided hostname and port number.
ssc.checkpoint("checkpoint_dir")
1.3.3 The streaming weblogs are saved to the variable logs. Count how
many connections are coming in over a window of 5 seconds.
Calculate this count every 2 seconds. Use the
DStream.countByWindow(windowDuration, slideDuration)
transformation for this
cnt = logs.countByWindow(5,2)
cnt.pprint()
$ cat stream_web.sh
1.4.3 The bash script simply calls port_stream.py with the hostname, port
number and the data source directory to stream the weblogs to the
Linux socket.
117
$ ./stream_web.sh
The script will respond by letting you know that it is waiting for a
connection at port 44444
1.5.1 Before starting the program, we have to switch the Spark Shell to
iPython mode.
1.5.3 Select .bashrc file and edit the file to enable ipython mode
1.5.5 Use spark-submit with --master 'local[2]' to make sure that there are
at least two threads
118
2. Create a streaming application that keeps state
2.1 Save the old program and prepare for new code
cnt = logs.countByWindow(5,2)
cnt.pprint()
2.2.1 Use the wordcount logic that we have seen many times to count the
number of occurrences of author_id
119
.map(lambda coll: (coll[2], 1))
2.3.1 Create a function that will update the count. The function is expected
to receive a current state, and a collection of new occurrences. If
needs to check if the state is None (meaning it is the first time we are
seeing this author_id), it should return a count of the new
occurrences. If there is a state, it should return the state plus the
count of new occurrences. The function would be similar to the
following:
totalAuthorCnt = authorCnt \
.updateStateByKey(lambda new_occurences, curr_state:
updateAuthorCount(new_occurences, curr_state))
2.3.4 Use spark-submit with --master 'local[2]' to make sure that there are
at least two threads
120
2.4 Currently the application prints out authors who have accessed the website
and the frequency of their contacts. However, there is no ordering or
sorting in place. Modify the application so that authors with the most
frequent accesses to the website are displayed first
mostFreqAuthor = totalAuthorCnt \
.map(lambda tup:(tup[1], tup[0])) \
121
2.4.3 Use map() to flip the pair rdd back to (<author_id>,<frequency>)
2.4.4 Use spark-submit with --master 'local[2]' to make sure that there
are at least two threads
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
if len(sys.argv) != 3:
122
print >> sys.stderr, "Usage: multi_dstream.py <hostname>
<port>"
sys.exit(-1)
totalAuthorCnt = authorCnt \
123
.updateStateByKey(lambda new_occurences, curr_state:
updateAuthorCount(new_occurences, curr_state))
mostFreqAuthor = totalAuthorCnt \
.map(lambda tup:(tup[1], tup[0])) \
.transform(lambda rdd: rdd.sortByKey(False)) \
.map(lambda tup:(tup[1], tup[0]))
mostFreqAuthor.pprint()
ssc.start()
ssc.awaitTermination()
124
Lab 13: Working with Structured Streaming API
In this lab, we will work with streaming data that is structured. The streaming
source will be created as dataframes where isStreaming is true.
1.1.1 In an earlier lab, the posts table in MySQL was imported to HDFS
under /user/student/posts directory. Verify that the data is there. If
not, use Sqoop to import the data.
1.1.2 Use hdfs dfs -get to copy the data from HDFS to the local disk. Use
the following command:
$ cd /home/student/Data
$ hdfs dfs -get posts
1.1.3 Verify that a new directory named posts had been created in your
local disk
1.1.4 Navigate to posts. There are several data partitions. There is also a
_SUCCESS file. Remove this file.
$ cd ./posts
$ rm _SUCCESS
1.1.5 Review one of the data partitions. The data is in CSV format but
there is no header row. We will need to know the schema
information in order to use this data.
$ mysql -u student -p
#type student when prompted for password
1.2.2 From MariaDB, navigate to labs database and use DESC to get the
schema information for the posts table
125
MariaDB [(none)]> USE labs;
MariaDB [(labs)]> DESC posts;
2.1 Reset PySpark to start from Jupyter by modifying .bashrc in the home
directory
postsDF = spark.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", "44444") \
.load()
126
2.4 Sockets read data with a fixed schema. All data is contained in a single
column named "value." It is necessary to use column expressions to create
the columns with the data types as observed from the description of the
Posts table in MySQL
aPostDF = postsDF \
.withColumn("id",
split(postsDF.value, ",")[0].cast("integer")) \
.withColumn("author_id",
split(postsDF.value, ",")[1].cast("integer")) \
.withColumn("title",
split(postsDF.value, ",")[2]) \
.withColumn("description",
split(postsDF.value, ",")[3]) \
.withColumn("content",
split(postsDF.value, ",")[4]) \
.withColumn("date",
split(postsDF.value, ",")[5])
3.1 Once the base streaming dataframe has been created, use dataframe
transformations to format the data as desired
myOutDF = aPostDF \
.select("author_id",
aPostDF.title[0:10].alias("Post_Title"),
"date")
127
4. Start the Structure Streaming Engine
4.1 Use writeStream with console format to display the output on the console.
Set the output mode to append and do not truncate the output. Set a
trigger for each micro-batch of 2 seconds.
myStream = myOutDF.writeStream \
.format("console") \
.option("truncate","false") \
.outputMode("append") \
.trigger(processingTime="2 seconds") \
.start()
myStream.awaitTermination()
128
Lab 14: Create an Apache Spark Application
So far, the spark shell has been used to develop and test applications. In this lab,
we will create a PySpark application and execute it using spark-submit. We will
also practice setting various configurations.
1.4.1 The SparkConf library has already been imported. Use the imported
library to instantiate a new instance of SparkConf. Save it to variable
sconf. Set the application name to "Bunch O' Rabbits" using the
setAppName() method
1.5.1 The SparkContext library has already been imported. Use the
imported library to instantiate a new instance of SparkContext. Save
it to variable sc
sc = SparkContext(conf = sconf)
sc.setLogLevel("ERROR")
1.6 Create the transformations to read "alice.txt" from the HDFS home
directory and count the number of times the word "Rabbit" appears in the
text. Make sure the logic is case-insensitive, i.e. capture both Rabbit and
rabbit.
129
1.6.1 Create a new RDD by reading the filename passed to the system.
The filename is in sys.argv[1]
src = sys.argv[1]
2.1.2 Every time a bash terminal is opened, .bashrc is read and executed.
Either close the current terminal and open a new one to
execute .bashrc or use source ~/.bahsrc to manually execute it.
source ~/.bashrc
2.2 Use spark-submit to execute CountRabbits.py. Pass alice.txt as the text file
to read.
130
2.3 View the Spark Web UI
2.3.3 This time submit the program in with YARN master and cluster deploy
mode.
2.3.4 Open the Yarn Web UI and look for either a running job or finished
job, depending on your timing
131
2.3.5 Click on the application id as shown above
2.3.6 If the application has already completed, the tracking URL will show
the history server, otherwise, it will show the applications master.
END OF LAB
132
133