(Hortonworks University) HDP Developer Apache Spark
(Hortonworks University) HDP Developer Apache Spark
Lab Guide
The contents of this course and all its lessons and related materials, including handouts to
audience members, are Copyright © 2012 - 2015 Hortonworks, Inc.
No part of this publication may be stored in a retrieval system, transmitted or reproduced in any
way, including, but not limited to, photocopy, photograph, magnetic, electronic or other record,
without the prior written permission of Hortonworks, Inc.
This instructional program, including all material provided herein, is supplied without any
guarantees from Hortonworks, Inc. Hortonworks, Inc. assumes no liability for damages or legal
action arising from the use or misuse of contents or details contained herein.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
• HDP Certified Developer: for Hadoop developers using frameworks like Pig, Hive, Sqoop and
Flume.
• HDP Certified Administrator: for Hadoop administrators who deploy and manage Hadoop
clusters.
• HDP Certified Developer: Java: for Hadoop developers who design, develop and architect
Hadoop-based solutions written in the Java programming language.
• HDP Certified Developer: Spark: for Hadoop developers who write and deploy applications for
the Spark framework.
• HDF Certified Professional: for DataFlow Operators responsible for building and deploying
HDF workflows.
How to Register: Visit www.examslocal.com and search for “Hortonworks” to register for an
exam. The cost of each exam is $250 USD, and you can take the exam anytime, anywhere
using your own computer. For more details, including a list of exam objectives and instructions
on how to attempt our practice exams, visit http://hortonworks.com/training/certification/
Earn Digital Badges: Hortonworks Certified Professionals receive a digital badge for each
certification earned. Display your badges proudly on your résumé, LinkedIn profile, email
signature, etc.
On Demand Learning
Hortonworks University courses are designed and developed by Hadoop experts and
provide an immersive and valuable real world experience. In our scenario-based training
courses, we offer unmatched depth and expertise. We prepare you to be an expert with
highly valued, practical skills and prepare you to successfully complete Hortonworks
Technical Certifications.
The online library accelerates time to Hadoop competency. In addition, the content is
constantly being expanded with new material, on an ongoing basis.
Visit: http://hortonworks.com/training/class/hortonworks-university-self-paced-learning-
library/
Lab Steps
Perform the following steps:
1. Start the VM
a. If applicable, start VMWare Player (or Fusion) on your local machine, select the course
VM from the list of virtual machines, then click the Play virtual machine link.
Note:
Type "yes" if asked "are you sure you want to continue connecting".
root@ubuntu:~# ssh sandbox
[root@sandbox ~]# start_ambari
b. From the command line, enter the following command, which displays the usage of the
hdfs dfsadmin utility:
Note:
The “dfs” in dfsadmin stands for distributed filesystem, and the dfsadmin utility contains
administrative commands for communicating with the Hadoop Distributed File System.
c. Notice the dfsadmin utility has a -report option, which outputs the current health of
your cluster. Enter the following command to view this report:
Answer: Look for the value of “Configured Capacity” at the start of the output.
Answer: Look for the value of “Present Capacity” at the start of the output.
Answer: Data in HDFS is chunked into blocks and copied to various nodes in the
cluster. If a particular block does not have enough copies, it is referred to as “under
replicated.”
Answer: 1
<aws_ip>:8080
Log into Ambari using the following credentials
Username: admin
Password: admin
b. You should now be logged into ambari and can see the cluster information. Your
screen should look something like this:
Result
We have verified you’re able to login, and the cluster is setup and running, we are now ready.
Lab Steps
Perform the following steps:
1 . View the hdfs dfs command
a. With your AWS instance, open a Terminal window if you do not have one open already.
b. From the command line, enter the following command to view the usage:
# hdfs dfs
c. Notice the usage contains options for performing file system tasks in HDFS, like
copying files from a local folder into HDFS, retrieving a file from HDFS, copying an
moving files around, and making and removing directoires. In this lab, you will perform
these commands and many others, to help you become comfortable with working with
the hdfs.
b. Run the -ls command, but this time specify the root HDFS folder:
Important: Notice how adding the / in the –ls command caused the contents of the root folder to
display, but leaving off the / showed the contents of /user/root, which is the user root’s home
directory on hadoop. If you do not provide the path for any hdfs dfs commands, the user’s home on
hadoop is assumed.
Notice you only see the test directory. To recursively view the contests of a folder, use -ls
-R
3 . Delete a directory
a. Delete the test2 folder (and recursively its subcontents) using the -rm -R command:
Note: Notice Hadoop create a .Trash folder for the root user and moved the deleted
content there. The .Trash folder empties automatically after a configured amount of time.
# cd /root/spark/data/
# tail data.txt
c. Run the following -put command to copy data.txt into the test folder in HDFS:
b. Verify the file is in both places by using the -ls -R command on test. The output
should look like the following:
b. You can also use the the -tail command to view the end of a file
Result
You should now be comfortable with executing the various HDFS commands, including creating
directories, putting files in the HDFS, copy files out of the HDFS, and deleting files and folders.
Lab Steps
Perform the following
steps:
# ssh sandbox
For Scala:
# spark-shell
For Python:
# pyspark
#ssh sandbox
# cd ~/spark/data
# tail selfishgiant.txt
3 . From the Spark Shell, write the logic for counting all the words
a. Create an RDD from the file we just viewed above
>>> baseRdd=sc.textFile("file:///root/spark/data/selfishgiant.txt")
b. Verify that you have created and RDD from the correct file using take(1)
>>> baseRdd.take(1)
c. Each element is currently a string, transform the string into arrays and examine the
output
d. Map each element into a key value pair, with the key being the word and the value
being 1. Examine the output.
e. Reduce the key value pairs to get the count of each word
>>> reducedRdd.take(20)
>>> reducedRdd.collect()
Result
You should now know how to start the spark shell and perform some basic RDD transformations and
actions.
Lab Steps
Perform the following steps:
1 . Put the required data for the lab from local into the HDFS
a. From within your AWS instance, open a terminal.
# cd /root/spark/data
2 . Explore the data that was just put into the HDFS, using your local machine
a. Use the head/vi/tail command take a look at the data:
flights.csv
ElapsedTime 8 63
AirTime 9 49
ArrDelay 10 1
DepDelay 11 8
Origin 12 JAX
Dest 13 FLL
Distance 14 318
TaxiIn 15 6
TaxiOut 16 8
Cancelled 17 0
CancellationCode 18
Diverted 19 0
carrier.csv
airports.csv
plane-data.csv
i. The charts above will be helpful when trying to access individual fields.
b. This application looks like a word count. As a general rule of thumb, process the
minimal amount of data to get the answer. Transform the RDD created above to only
get the necessary fields, along with anything else needed for a word count.
c. Reduce the RDD to get the number of flights for each airline.
c. Create a new RDD using the smallest amount of required data, and join the
airportsRdd to flightsRdd.
ii. Join the RDDs to get the correct city, retaining only the required data.
5 . CHALLENGE:
Find the longest departure delay for each airline if its over 15 minutes
a. This application is similar to a word count, believe it or not.
c. Instead of adding together values, compare them to find the longest for each key
HINT: max(a,b) returns the greater of the two values, make sure you’re comparing
ints, the data is read in as a string until casted.
6 . CHALLENGE: Find the most common airplane model for flights over 1500 miles
NOTE: Not all data is perfect (plane-data.csv has some missing values), make sure to filter
out airplane model records that don’t contain 9 fields after it is split into an array.
SOLUTIONS
3. a:
>>>flightRdd=sc.textFile("/user/root/flights.csv").map(lambda line: line.split(","))
3. b:
>>> carrierRdd = flightRdd.map(lambda line: (line[5],1))
>>> carrierRdd.take(1)
3. c:
>>> cReducedRdd = carrierRdd.reduceByKey(lambda a,b: a+b)
3. d:
>>> carriersSorted = cReducedRdd.map(lambda (a,b): (b,a)).sortByKey(ascending=False)
>>> carriersSorted.take(3)
4.b:
>>> airportsRdd = sc.textFile("/user/root/airports.csv").map(lambda line:
line.split(","))
4. c. i:
>>> cityRdd = airportsRdd.map(lambda line: (line[0], line[2]))
>>> flightOrigDestRdd = flightRdd.map(lambda line: (line[12], line[13]))
4. c. ii:
>>> origJoinRdd = flightOrigDestRdd.join(cityRdd)
>>> destAndOrigJoinRdd = origJoinRdd.map(lambda (a,b): (b[0],b[1])).join(cityRdd)
>>> citiesCleanRdd = destAndOrigJoinRdd.values()
4. d:
>>> citiesReducedRdd = citiesCleanRdd.map(lambda line: (line,1)).reduceByKey(lambda a,b:
a+b)
4. e:
>>> citiesReducedRdd.map(lambda (a,b): (b,a)).sortByKey(ascending=False).take(5)
5:
>>> flightRdd.filter(lambda line: int(line[11]) > 15) \
.map(lambda line: (line[5], line[11])).reduceByKey(lambda a,b:
max(int(a),int(b))).take(10)
6:
>>> airplanesRdd = sc.textFile("/user/root/plane-data.csv") \
.map(lambda line: line.split(",")) \
.filter(lambda line:len(line) == 9)
>>> flight15Rdd = flightRdd \
.filter(lambda line: int(line[14]) > 1500) \
.map(lambda line: (line[7],1))
>>> tailModelRdd = airplanesRdd \
.map(lambda line: (line[0],line[4]))
>>> flight15Rdd.join(tailModelRdd) \
.map(lambda (a,b): (b[1],b[0])) \
.reduceByKey(lambda a,b: a+b) \
.map(lambda (a,b): (b,a)).sortByKey(ascending=False).take(2)
Lab Steps
Perform the following steps:
1 . Navigate to a fresh Spark web UI
a. Close any REPL’s currently open
>>>exit()
If it seems like the REPL is taking a long time to exit, hit enter.
i. The application will be joining data, so split the data into K/V using map and the
UniqueCarrier field.
>>>flightRdd=sc.textFile("/user/root/flights.csv") \
.map(lambda line: line.split(","))
>>>flightsKVRdd=flightRdd.map(##Key with 5th index, keep the 6th
value)
>>>flightsKVRdd.getNumPartitions()
>>>joinedRdd = flightsKVRdd.join(carrierRdd)
>>>joinedRdd.count()
i. Refresh the web UI.
b. Repeat steps 2a and 3, but repartition the flightsKVRdd to 10 partitions. Explore the
tasks of the stages more in this example:
>>>flightspartKVRdd=flightsKVRdd.repartition(10)
>>>flightspartKVRdd.getNumPartitions()
>>>flightspartKVRdd.join(carrierRdd).count()
c. Find the number of flights using the 10 partition RDD by unique carrier and sort the list.
iii. View the Web UI, repeating the steps from 3a.
NOTE: If you see grey stages like below, it’s because Spark stores the intermediate
files to local disk temporarily, so instead of re-processing all the data, it picks up the
intermediate data and skips the stages from previous. Data is stored to disk
temporarily during operations that require a shuffle.
SOLUTIONS
2. a:
>>>flightRdd=sc.textFile("/user/root/flights.csv") \
.map(lambda line: line.split(","))
>>>flightsKVRdd=flightRdd.map(lambda line: (line[5], line[6]))
>>>flightsKVRdd.getNumPartitions()
2. b:
>>> carrierRdd = sc.textFile("/user/root/carriers.csv")\
.map(lambda line: line.split(",")) \
.map(lambda line: (line[0], line[1]))
3. b:
>>>flightspartKVRdd=flightsKVRdd.repartition(10)
>>>flightspartKVRdd.getNumPartitions()
>>>flightspartKVRdd.join(carrierRdd).count()
4. c:
>>> flightspartKVRdd.map(lambda (a,b): (a,1)) \
.reduceByKey(lambda a,b: a+b).join(carrierRdd) \
.map(lambda (a,b): (b[0],b[1])) \
.sortByKey(ascending=False).collect()
Lab Steps
Perform the following steps:
1 . Testing caching
a. Perform a count on the RDD joinedRdd from the lab (if you deleted, repaste in the
code to create it)
ii. In the following steps, we will be comparing the time, so make sure to save the time
in a notepad or write it down.
i. Note the time it took to complete. Was it more or less than in 2b? Why?
Result
You have successfully used to caching and persistence to realize performance beenfits.
Lab Steps
Perform the following steps:
1 . Start by pasting in the first line of code
a. This will create an RDD:
>>>data = sc.parallelize([1,2,3,4,5])
b. Notice the last RDD, still called data, and run a toDebugString and take a look at the
lineage:
>>> print(data.toDebugString())
4 . Enabling checkpointing
a. Enable checkpointing:
>>> sc.setCheckpointDir("checkpointDir")
b. It isn’t necessary to checkpoint every iteration; figure out a way to checkpoint every 7
iterations.
>>>data = sc.parallelize([1,2,3,4,5]
>>>for x in range(1000):
... ##Create the checkpoint
...##Only do it every 7th iteration of i
...data=data.map(lambda i: i+1)
>>> data.take(1)
g. Use the toDebugString on the above code to see what checkpointing is doing.
h. It works!
SOLUTIONS
4. d:
>>>for x in range(100):
if x%7 == 0:
data.checkpoint()
data=data.map(lambda i: i+1)
>>>data.take(1)
>>>print(data.toDebugString())
Lab Steps
Perform the following steps:
1 . Develop an application for pyspark
a. Start by copying the directory /root/spark/python/projects/myapp/ to your
working directory.
b. This is a simple exercise focusing on building and submitting an application with Spark.
i. The code should look like something that would be copied line by line into the
REPL with a basic Python wrapper around it.
g. Put the selfishgiants.txt file in the HDFS if its not already there.
h. In the application, perform a wordcount on the sleepinggiants.txt file and print out
the final value of the top 10 most said words.
2 . Submitting an application
a. Using spark-submit, submit the application to the cluster.
NOTE: Specify the version of Python you are using by adding it before your submit
command: PYSPARK_PYTHON=/usr/bin/python spark-submit …
ii. Once submitted, open Firefox and navigate to the YARN history server at
sandbox:18080 and find your application.
SOLUTIONS
Sample solution code for this lab is contained within the VM.
Lab Steps
Perform the following steps:
1 . Open up the REPL
2 . Count the number of planes that don’t have all the data filled out
a. Create an RDD from the plane-data.csv file and split it out:
>>> planeRdd=sc.textFile("/user/root/plane-data.csv") \
.map(lambda line: line.split(","))
c. Using foreach, check to see if the size of the resulting array is 9, if not increment the
accumulator.
i. Create a function to do this, pass an array and the accumulator as the input.
>>>print(badData.value)
SOLUTIONS
2. b:
>>> badData=sc.accumulator(0)
2. c. i:
>>>def dataCheck(line,dataCounter):
if len(line) != 9:
dataCounter += 1
2. c. ii:
>>>planeRdd.foreach(lambda line: dataCheck(line, badData))
Lab Steps
Perform the following steps:
1 . Open up the REPL if not still open from a previous lab
>>>execfile("/root/spark/python/stubs/lab9.py")
>>>print(result)
>>>type(result)
>>>carrierbc=sc.broadcast(result)
b. Using the broadcast.value API, create a new RDD with the flight number and carrier
name, this is called a broadcast join.
c. Verify the broadcast join worked by running a take and return a few records.
SOLUTIONS
2. d:
>>>carrierbc=sc.broadcast(result)
3. b:
>>>flightUpdate=flightRdd \
.map(lambda (a,b): (a,carrierbc.value[b]))
Lab Steps
Perform the following steps:
1 . Open up the REPL if not still open from the previous lab
a. Import the Row module from pyspark
>>> flightDF=sqlContext.createDataFrame(flightORdd)
d. Using the printSchema() API, examine the schema that was just created for the
dataframe.
>>>flightDF.write.format("parquet").save("/user/root/flights.parquet")
b. In a new terminal window, verify the file was written to the HDFS.
>>>dfflight=sqlContext.read.##Try to finish
b. Find the percentage of flights delayed/total flights for each airline and sort the list to
get the most delayed airlines, by airline code.
i. Create a UDF to check if the flight is delayed or not, then select the fields. The UDF
will be using an integer and a UDF, so import the libraries:
ii. Select the columns using the UDF to check if a flight was delayed or not:
>>>delayDF = dfflight.select(dfflight.UniqueCarrier, \
##Use UDF here##.alias("IsDelayed"), dfflight.DepDelay)
iii. Using groupby, and the agg operator, create a count of the DepDelay to get total
number of flights, and a sum of the IsDelayed Column
>>>delayGroupDF = delayDF \
.groupBy(delayDF.UniqueCarrier).agg(##Add dict here##)
iv. Create a UDF to get the percentage of delayed flights, import the Float library as
well:
v. Create the final DF by using a select, the UDF, and a sort, then show it:
>>> delayGroupDF.select(delayGroupDF.UniqueCarrier, \
calc_percent(##Use the correct columns for the udf##) \
.alias("Percentage")).sort(##Sort on percent##).show()
c. CHALLENGE: Find the top 5 airlines with longest average flight distance.
b. Find the top 5 airports with the shortest average taxi time out.
SOLUTIONS
4. c:
>>>dfflight=sqlContext.read.format("parquet") \
.load("/user/root/flights.parquet")
5. a:
>>> dfflight.select(dfflight.Origin, dfflight.DepDelay) \
.groupBy('Origin').avg() \
.withColumnRenamed("AVG(DepDelay)", "DelayAvg") \
.sort('DelayAvg', ascending=False).show()
5. b. ii:
>>>delayDF = dfflight.select(dfflight.UniqueCarrier, \
depUDF(dfflight.DepDelay).alias("IsDelayed"), dfflight.DepDelay)
5. b. iii:
>>>delayGroupDF = delayDF.groupBy(delayDF.UniqueCarrier) \
.agg({"IsDelayed": "sum", "DepDelay": "count"})
5. b. v:
>>> delayGroupDF.select(delayGroupDF.UniqueCarrier, \
calc_percent("SUM(IsDelayed)","COUNT(DepDelay)") \
.alias("Percentage")).sort("Percentage", ascending=False).show()
5. c:
>>> dfflight.select("UniqueCarrier", "Distance") \
.groupBy("UniqueCarrier").avg() \
.sort("AVG(Distance)", ascending=False).show(5)
6. a:
>>> dfflight.select("Origin", "TaxiIn") \
.groupBy("Origin").avg() \
.sort("AVG(TaxiIn)", ascending=False).show(5)
6. b:
>>> dfflight.select("Origin", "TaxiOut") \
.groupBy("Origin").avg() \
.sort("AVG(TaxiOut)", ascending=True).show(5)
Lab Steps
Perform the following steps:
1 . Open up the REPL if not still open from the previous lab
a Verify the sqlContext is of the type HiveContext:
>>>type(sqlContext)
>>>sqlContext.sql("USE flight")
4 . Using the hivecontext, create two dataframes. One from the table flights and
the other from planes
5 . Sort the flights dataframe, using distance to find the longest flight, do a take
to look at the distance of the longest flight
6 . Filter all flights on the longest flight distance, and return the tail numbers of
those flights
7 . Join the tailnums to the planes RDD to get the models of the airplanes
SOLUTIONS
4:
>>>sqlContext.sql("Use flight")
>>> flights = sqlContext.table("flights")
>>> planes = sqlContext.table("planes")
5:
>>>flights.sort("distance", ascending=False).take(1)
6:
>>>longflights = flights.filter(flights.distance==4962) \
.select("tailnum").distinct()
7:
>>>longflightplanes = longflights\
.join(planes, 'tailnum' , 'inner')
8:
>>> longflightplanes.select("model").groupBy("model") \
.count().show()
Lab Steps
Perform the following steps:
1 . Close the REPL
>>>ssc = StreamingContext(sc, 5)
>>>inputDS = ssc.socketTextStream("sandbox",9999)
>>>wc.pprint()
>>>sc.setLogLevel("ERROR")
>>>ssc.start()
NOTE: You will see an error when it starts, it’s waiting for an input connection.
a. Start typing words separated by space, press return occasionally to submit them
c. While the application is running, navigate to the web UI in Firefox and explore the web
UI tabs:
sandbox:4040
d. To quit the streaming application, press control-d, control-c for the terminal
running NC.
Result
You have now successfully created and run a stateless application.
Lab Steps
Perform the following steps:
1 . Close the REPL
2 . Start a new REPL specifying the following information:
>>>ssc = StreamingContext(sc, 2)
>>>inputDS = ssc.socketTextStream("sandbox",9999)
>>>ssc.checkpoint("hdfs:///user/root/checkpointDir")
>>>windowDS.pprint()
>>>sc.setLogLevel("ERROR")
>>>ssc.start()
4 . In a new terminal, run the following command to start outputting to the stream:
c. While the application is running, navigate to the web UI in Firefox and explore the web
UI tabs:
sandbox:4040
d. To quit the streaming application, press control-d, control-c for the terminal
running NC.
Result
You have now successfully created an application that utilizes the window function.