Cloudlab Exercise 11 Lesson 11
Cloudlab Exercise 11 Lesson 11
2) Check if the data file exists in HDFS from the previous lab. It should have a file called logfile1
that we will be used as log data for this exercise.
hdfs dfs –ls logdir
3) Start Spark shell. You will see many messages before you get the Scala prompt. You see the
Spark context is opened as SC, and a SQL context connecting to Hive metastore is opened with
sqlContext.
spark-shell
4) Import the library for defining storage levels in Spark.
import org.apache.spark.storage.StorageLevel
5) We will process the log file data as in MapReduce job. Read the logfile1 that was created in
earlier exercise as a variable. You will see errors only if there is a syntax error, as Spark uses lazy
processing like Pig. Spark creates an RDD for this data file.
val logFile = sc.textFile("hdfs://ip-172-31-22-91.us-west-
2.compute.internal:8020/user/a8117/logdir/logfile1");
6) Specify the storage level as Memory Only for this RDD.
logFile.persist(StorageLevel.MEMORY_ONLY);
7) Create another RDD that contains the words, INFO, or WARN, or Others, based on the line
content.
val keyval = logFile.map(line => if(line.contains("INFO")) "INFO" else (if(line.contains("WARN"))
"WARN" else "Others"))
8) Specify the storage level for this RDD as Memory and Disk with replication of 2.
keyval.persist(StorageLevel.MEMORY_AND_DISK_2)
9) Use a map to add a count of 1 to each word, and then use reduce by Spark’s Key method to add
the count for each key.
val res=keyval.map(word=>(word,1)).reduceByKey((a,b)=>a+b)
10) This produced the count for each log type like the MapReduce job. Store this RDD as a text file in
HDFS.
res.saveAsTextFile("hdfs://ip-172-31-22-91.us-west-
2.compute.internal:8020/user/a8117/resout")