Hands - On Exercise: Using The Spark Shell..................................
Hands - On Exercise: Using The Spark Shell..................................
Hands - On Exercise: Using The Spark Shell..................................
cd ~
cp –rf /opt/training/sparkdev .
PySpark
Some Commands to try:
o sc – spark context
<pyspark.context.SparkContext at 0x2724490>
o sc.[TAB]- to see possible options/functions/methods on the object
o exit- Ctrl-D/ Type exit
Spark-shell
Some Commands to try:
o sc – spark context
res0: org.apache.spark.SparkContext =
org.apache.spark.SparkContext@2f0301fa
o sc.[TAB]- to see possible options/functions/methods on the object
o exit- Ctrl-D/ Type exit
/opt/training/sparkdev/data/frostroad.txt
/opt/training/sparkdev/data/weblogs/2013-09-15.log
Solutions:
/opt/training/sparkdev/solutions/
UsingTheSparkShell.pypark
UsingTheSparkShell.scalaspark
LogIPs.pyspark
LogIPs.scalaspark
In this Exercise you will practice using RDDs in the Spark Shell:
You will start by reading a simple text file. Then you will use Spark to explore the Apache web server output
logs of the customer service site of a fictional mobile phone service provider called Loudacre.
1. Start the Spark Shell if you exited it from the previous exercise. You may use either Scala (spark-shell)
or Python (pyspark). These instructions assume you are using Python.
2. Review the simple text file we will be using by viewing (without editing) the file in a text editor. The file
is located at:
/opt/training/sparkdev/data/frostroad.txt
4. Note that Spark has not yet read the file. It will not do so until you perform an operation on the RDD. Try
counting the number of lines in the dataset:
pyspark> mydata.count()
The count operation causes the RDD to be materialized (created and populated). The number of lines
should be displayed, e.g. Out[4]: 23
5. Try executing the collect() operation to display the data in the RDD.
pyspark> mydata.collect()
Note: that this returns and displays the entire dataset. This is convenient for very small RDDs like this one, but be
careful using collect for more typical large datasets.
6. Using command completion, you can see all the available transformations and operations you can
perform on an RDD. Type mydata. and then the [TAB] key.
In this exercise you will be using data in ~/sparkdev/data/weblogs. Initially you will work with the log file from a
single day. Later you will work with the full data set consisting of many days worth of logs.
7. Review one of the .log files in the directory. Note the format of the lines, e.g.
8. Set a variable for the data file so you do not have to retype it each time.
pyspark>logfile="file:/home/<username>/sparkdev/data/weblogs/2013-09-15.log"
10. Create an RDD containing only those lines that are requests for JPG files.
pyspark> jpglogs=logs.filter(lambda x: ".jpg" in x)
12. Sometimes you do not need to store intermediate data in a variable, in which case you can combine the steps
into a single line of code. For instance, if all you need is to count the number of JPG requests, you can execute
this in a single command:
pyspark> sc.textFile(logfile).filter(lambda x: ".jpg" in x).count()
13. Now try using the map function to define a new RDD. Start with a very simple map that returns the length of
each line in the log file.
pyspark> logs.map(lambda s: len(s)).take(5)
This prints out an array of five integers corresponding to the length of each of the first five lines in the log file.
14. That’s not very useful. Instead, try mapping to an array of words for each line:
pyspark> logs.map(lambda s: s.split()).take(5)
This time it prints out five arrays, each containing the words in the corresponding log file line.
15. Now that you know how map works, define a new RDD containing just the IP addresses from each line in the
log file. (The IP address is the first “word” in each line).
pyspark> ips = logs.map(lambda s: s.split()[0])
pyspark> ips.take(5)
16. Although take and collect are useful ways to look at data in an RDD, their output is not very readable.
Fortunately, though, they return arrays, which you can iterate through:
pyspark> for x in ips.take(10): print x
18. In a terminal window, list the contents of the /home/<username>/iplist folder. You should see multiple files.
The one you care about is part-00000, which should contain the list of IP addresses. “Part” (partition) files are
numbered because there may be results from multiple tasks running on the cluster; you will learn more about
this later.
1. Challenge 1: As you did in the previous step, save a list of IP addresses, but this time, use the whole web log data set (weblogs/*)
instead of a single day’s log.
Tip: You can use the up-‐arrow to edit and execute previous commands. You should only need to modify the
lines that read and save the files. Note that the directory you specify when calling saveAsTextFile() must not
already exist.
2. Challenge 2: Use RDD transformations to create a dataset consisting of the IP address and corresponding user ID for each request for
an HTML file. (Disregard requests for other file types). The user ID is the third field in each log file line.
In this Exercise you will continue exploring the Loudacre web server log files, as
well as the Loudacre user account data, using key-‐value Pair RDDs.
This time, work with the entire set of data files in the weblog folder rather than just a single day’s logs.
b. Use reduce to sum the values for each user ID. Your RDD data will be similar to:
reduceByKey()
2. Display the user IDs and hit count for the users with the 10 highest hit counts.
a. Use map to reverse the key and value, like this:
b. Use sortByKey(False) to sort the swapped data by count.
3. Create an RDD where the user id is the key, and the value is the list of all the IP addresses that
user has connected from. (IP address is the first field in each request line.)
Hint: Map to (userid, ipaddress) and then use groupByKey.
4. The data set in the ~/sparkdev/data/accounts.csv consists of information about Loudacre’s user
accounts. The first field in each line is the user ID, which corresponds to the user ID in the web
server logs. The other fields include account details such as creation date, first and last name and
so on.
Join the accounts data with the weblog data to produce a dataset keyed by user ID which contains
the user account information and the number of website hits for that user.
b. Join the Pair RDD with the set of userid/hit counts calculated in the first step. (Note: the
example data below is abbreviated, and you should see the actual userids display)
c. Display the user ID, hit count , and first name (3rd value) and last name (4th value) for the first
10 elements, e.g.:
Challenge 1: Use keyBy to create an RDD of account data with the postal code (9th field in the CSV file) as
the key.
• Hint: refer to the Spark API for more information on the keyBy operation
• Tip: Assign this new RDD to a variable for use in the next challenge
Challenge 2: Create a pair RDD with postal code as the key and a list of names (Last Name,First Name) in
that postal code as the value.
• Hint: First name and last name are the 4th and 5th fields respectively
Challenge 3: Sort the data by postal code, then for the first five postal codes, display the code and list the
names in that postal zone, e.g.
Hands-On Exercise: Using HDFS........................................
Files Used in This Exercise:
~/sparkdev/data/weblogs/*
SparkHDFS.pyspark
SparkHDFS.scalaspark
$ hdfs
3. The hdfs command is subdivided into several subsystems. The subsystem for working with the
files on the cluster is called FsShell. This subsystem can be invoked with the command hdfs dfs. In
the terminal window, enter:
$ hdfs dfs
You see a help message describing all the commands associated with the FsShell subsystem.
4. Enter:
$ hdfs dfs -ls /
This shows you the contents of the root directory in HDFS. There will be multiple entries, one of
which is /user. Individual users have a “home” directory under this directory, named after their
username; your username in this course is <username>@manulife.com, therefore your home
directory is /user/<username>.
5. Try viewing the contents of the /user directory by running:
There are no files yet, so the command silently exits. This is different than if you ran hdfs dfs
-ls /foo, which refers to a directory that doesn’t exist and which would display an error
message.
Note that the directory structure in HDFS has nothing to do with the directory structure of the
local filesystem; they are completely separate namespaces.
If file is deleted (-rm), it is moved to .Trash folder, and kept there for only 30 days.
Uploading Files
Besides browsing the existing filesystem, another important thing you can do with FsShell is to
upload new data into HDFS.
7. Change directories to the local filesystem directory containing the sample data for the
course.
$ cd ~/sparkdev/data
If you perform a regular Linux ls command in this directory, you will see a few files, including
the weblogs directory you used in previous exercises.
This copies the local weblogs directory and its contents into a remote HDFS directory named
/user/<username>/weblogs.
You should see the same results. If you do not pass a directory name to the -ls command, it
assumes you mean your home directory, i.e. /user/<username>.
Relative paths
If you pass any relative (non-absolute) paths to FsShell commands, they are
considered relative to your home directory.
11. Enter:
$ hdfs dfs -cat weblogs/2014-03-08.log | tail -n 50
This prints the last 50 lines of the file to your terminal. This command is useful for viewing the
output of Spark programs. Often, an individual output file is very large, making it inconvenient
to view the entire file in the terminal. For this reason, it is often a good idea to pipe the
output of the fs -cat command into head, tail, more, or less.
12. To download a file to work with on the local filesystem use the dfs -get command. This
command takes two arguments: an HDFS path and a local path. It copies the HDFS
contents into the local filesystem:
13. There are several other operations available with the hdfs dfs command to perform most
common filesystem manipulations: mv, rm, cp, mkdir, and so on. Enter:
$ hdfs dfs
This displays a brief usage report of the commands available within FsShell. Try playing
around with a few of these commands
pyspark> logs=sc.textFile("hdfs://localhost/user/<username>/weblogs/2014-03-08.log")
15. Save the JPG requests in the dataset to HDFS:
pyspark> logs.filter(lambda s: ".jpg" in s).saveAsTextFile("hdfs:// localhost/user/<username>/jpgs")
16. Back in the terminal, view the created directory and files it contains.
$ hdfs dfs -ls jpgs
$ hdfs dfs -cat jpgs/* | more
17. Optional: Explore the NameNode UI: http://localhost:50070 . In particular, try menu selection Utilities à Browse
the file system.
Note that in this course you are running a “cluster” on a single host. This would never happen in a
production environment, but is useful for exploration, testing, and practicing.
1. In a terminal window, start the Spark Master and Spark 1. Worker daemons:
2. Start Firefox on your VM and visit the Spark Master UI by using the provided bookmark or visiting
http://localhost:18080/.
3. You should not see any applications in the Running Applications or Completed Applications areas
because you have not run any applications on the cluster yet.
4. A real-‐world Spark cluster would have several workers configured. In this class we have just one,
running locally, which is named by the date it started, the host it is running on, and the port it is
listening on. For example:
5. Click on the worker ID link to view the Spark. Worker UI and note that there are no executors
currently running on the node.
6. Return to the Spark Master UI and take note of the URL shown at the top. You may wish to select
and copy it into your clipboard.
9. You can confirm that you are connected to the correct master 9. by viewing the sc.master
property:
pyspark> sc.master
10. Execute a simple operation to test execution on the cluster. For example,
pyspark> sc.textFile("weblogs/*").count()
11. Reload the Spark Standalone Master UI in Firefox and note that now the Spark Shell appears in the
list of running applications
12. Click on the application ID (app-xxxxxxx) to see an overview of the application, including the list of
executors running (or waiting to run) tasks from this application. In our small classroom cluster,
there is just one, running on the single node in the cluster, but in a real cluster there could be
multiple executors running on each cluster.