Hands - On Exercise: Using The Spark Shell..................................

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13
At a glance
Powered by AI
The document discusses getting started with Spark and RDDs. It provides instructions on using the Spark shell and exploring log file data using RDDs.

RDDs (Resilient Distributed Datasets) are Spark's basic abstraction for distributed data. They allow performing parallel operations on data distributed across a cluster. The document demonstrates creating RDDs from files and using RDD transformations and actions.

To read data from files into RDDs in Spark, you can use the textFile method on the SparkContext object and pass it a file path. This will return an RDD containing the file contents as lines.

Initial Setup:

Get Kerberos Ticket:

kinit -kt /opt/scripts/keytabs/spark.headless.keytab [email protected]

(kt if no password, no kt if there is password)

Copy Data & Scripts to your Home Folder:

cd ~

cp –rf /opt/training/sparkdev .

Hands-‐On Exercise: Using the Spark Shell...................................


Using the Python Spark Shell

 PySpark
Some Commands to try:
o sc – spark context
<pyspark.context.SparkContext at 0x2724490>
o sc.[TAB]- to see possible options/functions/methods on the object
o exit- Ctrl-D/ Type exit

Using the Scala Spark Shell

 Spark-shell
Some Commands to try:
o sc – spark context
res0: org.apache.spark.SparkContext =
org.apache.spark.SparkContext@2f0301fa
o sc.[TAB]- to see possible options/functions/methods on the object
o exit- Ctrl-D/ Type exit

Spark Shell solutions (Python or Scala)

Spark UI: http://10.252.50.88:4040/jobs/


Hands-‐On Exercise: Getting Started with RDDs ........................................
Data for exercises:

/opt/training/sparkdev/data/frostroad.txt

/opt/training/sparkdev/data/weblogs/2013-09-15.log

Solutions:

/opt/training/sparkdev/solutions/

UsingTheSparkShell.pypark

UsingTheSparkShell.scalaspark

LogIPs.pyspark

LogIPs.scalaspark

In this Exercise you will practice using RDDs in the Spark Shell:

You will start by reading a simple text file. Then you will use Spark to explore the Apache web server output
logs of the customer service site of a fictional mobile phone service provider called Loudacre.

Load and view text file:

1. Start the Spark Shell if you exited it from the previous exercise. You may use either Scala (spark-shell)
or Python (pyspark). These instructions assume you are using Python.

2. Review the simple text file we will be using by viewing (without editing) the file in a text editor. The file
is located at:
/opt/training/sparkdev/data/frostroad.txt

3. Define an RDD to be created by reading in a simple text file:


pyspark> mydata = sc.textFile("file:/home/<username>/sparkdev/data/frostroad.txt")

4. Note that Spark has not yet read the file. It will not do so until you perform an operation on the RDD. Try
counting the number of lines in the dataset:
pyspark> mydata.count()
The count operation causes the RDD to be materialized (created and populated). The number of lines
should be displayed, e.g. Out[4]: 23
5. Try executing the collect() operation to display the data in the RDD.
pyspark> mydata.collect()

Note: that this returns and displays the entire dataset. This is convenient for very small RDDs like this one, but be
careful using collect for more typical large datasets.

6. Using command completion, you can see all the available transformations and operations you can
perform on an RDD. Type mydata. and then the [TAB] key.

Explore the Loudacre web log files:

In this exercise you will be using data in ~/sparkdev/data/weblogs. Initially you will work with the log file from a
single day. Later you will work with the full data set consisting of many days worth of logs.

7. Review one of the .log files in the directory. Note the format of the lines, e.g.

8. Set a variable for the data file so you do not have to retype it each time.
pyspark>logfile="file:/home/<username>/sparkdev/data/weblogs/2013-09-15.log"

9. Create an RDD from the data file.


pyspark> logs = sc.textFile(logfile)

10. Create an RDD containing only those lines that are requests for JPG files.
pyspark> jpglogs=logs.filter(lambda x: ".jpg" in x)

11. View the first 10 lines of the data using take:


pyspark> jpglogs.take(10)

12. Sometimes you do not need to store intermediate data in a variable, in which case you can combine the steps
into a single line of code. For instance, if all you need is to count the number of JPG requests, you can execute
this in a single command:
pyspark> sc.textFile(logfile).filter(lambda x: ".jpg" in x).count()

13. Now try using the map function to define a new RDD. Start with a very simple map that returns the length of
each line in the log file.
pyspark> logs.map(lambda s: len(s)).take(5)

This prints out an array of five integers corresponding to the length of each of the first five lines in the log file.
14. That’s not very useful. Instead, try mapping to an array of words for each line:
pyspark> logs.map(lambda s: s.split()).take(5)
This time it prints out five arrays, each containing the words in the corresponding log file line.

15. Now that you know how map works, define a new RDD containing just the IP addresses from each line in the
log file. (The IP address is the first “word” in each line).
pyspark> ips = logs.map(lambda s: s.split()[0])
pyspark> ips.take(5)

16. Although take and collect are useful ways to look at data in an RDD, their output is not very readable.
Fortunately, though, they return arrays, which you can iterate through:
pyspark> for x in ips.take(10): print x

17. Finally, save the list of IP addresses as a text file:


pyspark> ips.saveAsTextFile("file:/home/<username>/iplist")

18. In a terminal window, list the contents of the /home/<username>/iplist folder. You should see multiple files.
The one you care about is part-00000, which should contain the list of IP addresses. “Part” (partition) files are
numbered because there may be results from multiple tasks running on the cluster; you will learn more about
this later.

If You Have More Time:


If you have more time, attempt the following challenges:

1. Challenge 1: As you did in the previous step, save a list of IP addresses, but this time, use the whole web log data set (weblogs/*)
instead of a single day’s log.

 Tip: You can use the up-‐arrow to edit and execute previous commands. You should only need to modify the
lines that read and save the files. Note that the directory you specify when calling saveAsTextFile() must not
already exist.

2. Challenge 2: Use RDD transformations to create a dataset consisting of the IP address and corresponding user ID for each request for
an HTML file. (Disregard requests for other file types). The user ID is the third field in each log file line.

 Display the data in the form ipaddress/userid, e.g.:


Hands-‐On Exercise: Working with Pair RDDs .............................
Files Used in This Exercise:
Data files (local)
~/sparkdev/data/weblogs/*
~/sparkdev/data/accounts.csv
Solution (in ~/sparkdev/solutions):
UserRequests.pyspark
UserRequests.scalaspark

In this Exercise you will continue exploring the Loudacre web server log files, as
well as the Loudacre user account data, using key-‐value Pair RDDs.
This time, work with the entire set of data files in the weblog folder rather than just a single day’s logs.

1. Using MapReduce, count the number of requests from each user.


a. Use map to create a Pair RDD with the user ID as the key, and the integer 1 as the value.
(The user ID is the third field in each line.) Your data will look something like this:

wordCount.map(lambda x: x.split()[2]).map(lambda y: (y, 1))

b. Use reduce to sum the values for each user ID. Your RDD data will be similar to:

reduceByKey()

2. Display the user IDs and hit count for the users with the 10 highest hit counts.
a. Use map to reverse the key and value, like this:
b. Use sortByKey(False) to sort the swapped data by count.

3. Create an RDD where the user id is the key, and the value is the list of all the IP addresses that
user has connected from. (IP address is the first field in each request line.)
 Hint: Map to (userid, ipaddress) and then use groupByKey.

4. The data set in the ~/sparkdev/data/accounts.csv consists of information about Loudacre’s user
accounts. The first field in each line is the user ID, which corresponds to the user ID in the web
server logs. The other fields include account details such as creation date, first and last name and
so on.

Join the accounts data with the weblog data to produce a dataset keyed by user ID which contains
the user account information and the number of website hits for that user.

a. Map the accounts data to key/value-‐list pairs: (userid, [values…])

b. Join the Pair RDD with the set of userid/hit counts calculated in the first step. (Note: the
example data below is abbreviated, and you should see the actual userids display)
c. Display the user ID, hit count , and first name (3rd value) and last name (4th value) for the first
10 elements, e.g.:

If You Have More Time


If you have more time, attempt the following challenges:

Challenge 1: Use keyBy to create an RDD of account data with the postal code (9th field in the CSV file) as
the key.

• Hint: refer to the Spark API for more information on the keyBy operation

• Tip: Assign this new RDD to a variable for use in the next challenge

Challenge 2: Create a pair RDD with postal code as the key and a list of names (Last Name,First Name) in
that postal code as the value.

• Hint: First name and last name are the 4th and 5th fields respectively

• Optional: Try using the mapValues operation

Challenge 3: Sort the data by postal code, then for the first five postal codes, display the code and list the
names in that postal zone, e.g.
Hands-On Exercise: Using HDFS........................................
Files Used in This Exercise:

Data files (local)

~/sparkdev/data/weblogs/*

Solution (in ~/sparkdev/solutions):

SparkHDFS.pyspark

SparkHDFS.scalaspark

$ hdfs

3. The hdfs command is subdivided into several subsystems. The subsystem for working with the
files on the cluster is called FsShell. This subsystem can be invoked with the command hdfs dfs. In
the terminal window, enter:

$ hdfs dfs

You see a help message describing all the commands associated with the FsShell subsystem.

4. Enter:
$ hdfs dfs -ls /

This shows you the contents of the root directory in HDFS. There will be multiple entries, one of
which is /user. Individual users have a “home” directory under this directory, named after their
username; your username in this course is <username>@manulife.com, therefore your home
directory is /user/<username>.
5. Try viewing the contents of the /user directory by running:

$ hdfs dfs -ls /user

You will see your home directory in the directory listing.

6. List the contents of your home directory by running:


$ hdfs dfs -ls /user/<username>

There are no files yet, so the command silently exits. This is different than if you ran hdfs dfs
-ls /foo, which refers to a directory that doesn’t exist and which would display an error
message.

Note that the directory structure in HDFS has nothing to do with the directory structure of the
local filesystem; they are completely separate namespaces.

If file is deleted (-rm), it is moved to .Trash folder, and kept there for only 30 days.

To delete permanently (skip .Trash folder), use -rm -skipTrash.

Uploading Files
Besides browsing the existing filesystem, another important thing you can do with FsShell is to
upload new data into HDFS.

7. Change directories to the local filesystem directory containing the sample data for the
course.

$ cd ~/sparkdev/data

If you perform a regular Linux ls command in this directory, you will see a few files, including
the weblogs directory you used in previous exercises.

8. Insert this directory into HDFS:


$ hdfs dfs -put weblogs /user/<username>/weblogs

This copies the local weblogs directory and its contents into a remote HDFS directory named
/user/<username>/weblogs.

9. List the contents of your HDFS home directory now:


$ hdfs dfs -ls /user/<username>
You should see an entry for the weblogs directory.
10. Now try the same dfs -ls command but without a path argument:
$ hdfs dfs -ls

You should see the same results. If you do not pass a directory name to the -ls command, it
assumes you mean your home directory, i.e. /user/<username>.

Relative paths
If you pass any relative (non-absolute) paths to FsShell commands, they are
considered relative to your home directory.

Viewing and Manipulating Files


Now view some of the data you just copied into HDFS.

11. Enter:
$ hdfs dfs -cat weblogs/2014-03-08.log | tail -n 50
This prints the last 50 lines of the file to your terminal. This command is useful for viewing the
output of Spark programs. Often, an individual output file is very large, making it inconvenient
to view the entire file in the terminal. For this reason, it is often a good idea to pipe the
output of the fs -cat command into head, tail, more, or less.

12. To download a file to work with on the local filesystem use the dfs -get command. This
command takes two arguments: an HDFS path and a local path. It copies the HDFS
contents into the local filesystem:

$ hdfs dfs -get weblogs/2013-09-22.log ~/logfile.txt


$ less ~/logfile.txt

13. There are several other operations available with the hdfs dfs command to perform most
common filesystem manipulations: mv, rm, cp, mkdir, and so on. Enter:
$ hdfs dfs
This displays a brief usage report of the commands available within FsShell. Try playing
around with a few of these commands

Accessing HDFS files in Spark:


14. In the Spark Shell, create an RDD based on one of the files you uploaded to HDFS. For
example:

pyspark> logs=sc.textFile("hdfs://localhost/user/<username>/weblogs/2014-03-08.log")
15. Save the JPG requests in the dataset to HDFS:
pyspark> logs.filter(lambda s: ".jpg" in s).saveAsTextFile("hdfs:// localhost/user/<username>/jpgs")

16. Back in the terminal, view the created directory and files it contains.
$ hdfs dfs -ls jpgs
$ hdfs dfs -cat jpgs/* | more
17. Optional: Explore the NameNode UI: http://localhost:50070 . In particular, try menu selection Utilities à Browse
the file system.

Hands-‐On Exercise: Running Spark Shell on a Cluster ..........................


In this Exercise you will start the Spark Standalone master and worker daemons, explore the Spark
Master and Spark Worker User Interfaces (UIs), and start the Spark Shell on the cluster.

Note that in this course you are running a “cluster” on a single host. This would never happen in a
production environment, but is useful for exploration, testing, and practicing.

1. In a terminal window, start the Spark Master and Spark 1. Worker daemons:

$ sudo service spark-master start


$ sudo service spark-worker start
Note: You can stop the services by replacing start with stop, or force the service to restart by using
restart. You may need to do this if you suspend and restart the VM.

View the Spark Standalone Cluster UI

2. Start Firefox on your VM and visit the Spark Master UI by using the provided bookmark or visiting
http://localhost:18080/.
3. You should not see any applications in the Running Applications or Completed Applications areas
because you have not run any applications on the cluster yet.
4. A real-‐world Spark cluster would have several workers configured. In this class we have just one,
running locally, which is named by the date it started, the host it is running on, and the port it is
listening on. For example:

5. Click on the worker ID link to view the Spark. Worker UI and note that there are no executors
currently running on the node.
6. Return to the Spark Master UI and take note of the URL shown at the top. You may wish to select
and copy it into your clipboard.

Start Spark Shell on the cluster


7. Return to your terminal window and exit Spark Shell if it is still running.
8. Start Spark Shell again, this time setting the MASTER environment variable to the Master URL you
noted in the Spark Standalone Web UI. For example, to start pyspark:
$ MASTER=spark://localhost:7077 pyspark
Or the Scala shell:
$ spark-shell --master spark://localhost:7077
You will see additional info messages confirming registration with the Spark Master. (You may
need to hit Enter a few times to clear the screen log and see the shell prompt.) For example:

9. You can confirm that you are connected to the correct master 9. by viewing the sc.master
property:
pyspark> sc.master
10. Execute a simple operation to test execution on the cluster. For example,
pyspark> sc.textFile("weblogs/*").count()
11. Reload the Spark Standalone Master UI in Firefox and note that now the Spark Shell appears in the
list of running applications

12. Click on the application ID (app-xxxxxxx) to see an overview of the application, including the list of
executors running (or waiting to run) tasks from this application. In our small classroom cluster,
there is just one, running on the single node in the cluster, but in a real cluster there could be
multiple executors running on each cluster.

You might also like