0% found this document useful (0 votes)
70 views2 pages

Cloudlab Exercise 11 Lesson 11

gf

Uploaded by

benben08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views2 pages

Cloudlab Exercise 11 Lesson 11

gf

Uploaded by

benben08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

1) Connect to cloud Lab.

2) Check if the data file exists in HDFS from the previous lab. It should have a file called logfile1
that we will be used as log data for this exercise.
hdfs dfs –ls logdir
3) Start Spark shell. You will see many messages before you get the Scala prompt. You see the
Spark context is opened as SC, and a SQL context connecting to Hive metastore is opened with
sqlContext.
spark-shell
4) Import the library for defining storage levels in Spark.
import org.apache.spark.storage.StorageLevel
5) We will process the log file data as in MapReduce job. Read the logfile1 that was created in
earlier exercise as a variable. You will see errors only if there is a syntax error, as Spark uses lazy
processing like Pig. Spark creates an RDD for this data file.
val logFile = sc.textFile("hdfs://ip-172-31-22-91.us-west-
2.compute.internal:8020/user/a8117/logdir/logfile1");
6) Specify the storage level as Memory Only for this RDD.
logFile.persist(StorageLevel.MEMORY_ONLY);
7) Create another RDD that contains the words, INFO, or WARN, or Others, based on the line
content.
val keyval = logFile.map(line => if(line.contains("INFO")) "INFO" else (if(line.contains("WARN"))
"WARN" else "Others"))
8) Specify the storage level for this RDD as Memory and Disk with replication of 2.
keyval.persist(StorageLevel.MEMORY_AND_DISK_2)
9) Use a map to add a count of 1 to each word, and then use reduce by Spark’s Key method to add
the count for each key.
val res=keyval.map(word=>(word,1)).reduceByKey((a,b)=>a+b)
10) This produced the count for each log type like the MapReduce job. Store this RDD as a text file in
HDFS.
res.saveAsTextFile("hdfs://ip-172-31-22-91.us-west-
2.compute.internal:8020/user/a8117/resout")

© Copyright 2015, Simplilearn. All rights reserved. Page |1


This starts the MapReduce job, and is submitted to Yarn. Result will be created as a directory in
HDFS similar to a MapReduce job.
11) We can use the collect function on the RDD to see its contents as an array.
res.collect()
12) End the Spark session.
exit
13) Check the directory in HDFS used to store the RDD.
hdfs dfs –ls resout
Note that it has _SUCCESS file like in MapReduce. It has four part files as Spark ran 4 reducers.
Two of the reducer outputs are empty.
14) Check the contents of part-00002 and part-00003 files that are not empty.
hdfs dfs –cat resout/part-00002
hdfs dfs –cat resout/part-00003
15) Spark also provides a web interface that shows the status of Spark jobs. You can access this
interface at Port 4040. Access the web interface using URL http://h1.cloudxlab.com:4040.
16) Explore different tabs. The Jobs tab shows all the Spark jobs. The Stages tab shows the various
stages of the Job. The Storage tab shows the RDDs we retained, and shows if they are in the
memory or disk. The Environment tab shows the Spark program environment, and the Executor
tab shows the Spark driver.

© Copyright 2015, Simplilearn. All rights reserved. Page |2

You might also like