Cloudlab Exercise 11 Lesson 11

Uploaded by

benben08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views2 pages

Cloudlab Exercise 11 Lesson 11

Uploaded by

benben08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

1) Connect to cloud Lab.

2) Check if the data file exists in HDFS from the previous lab. It should have a file called logfile1
that we will be used as log data for this exercise.
hdfs dfs –ls logdir
3) Start Spark shell. You will see many messages before you get the Scala prompt. You see the
Spark context is opened as SC, and a SQL context connecting to Hive metastore is opened with
sqlContext.
spark-shell
4) Import the library for defining storage levels in Spark.
import org.apache.spark.storage.StorageLevel
5) We will process the log file data as in MapReduce job. Read the logfile1 that was created in
earlier exercise as a variable. You will see errors only if there is a syntax error, as Spark uses lazy
processing like Pig. Spark creates an RDD for this data file.
val logFile = sc.textFile("hdfs://ip-172-31-22-91.us-west-
2.compute.internal:8020/user/a8117/logdir/logfile1");
6) Specify the storage level as Memory Only for this RDD.
logFile.persist(StorageLevel.MEMORY_ONLY);
7) Create another RDD that contains the words, INFO, or WARN, or Others, based on the line
content.
val keyval = logFile.map(line => if(line.contains("INFO")) "INFO" else (if(line.contains("WARN"))
"WARN" else "Others"))
8) Specify the storage level for this RDD as Memory and Disk with replication of 2.
keyval.persist(StorageLevel.MEMORY_AND_DISK_2)
9) Use a map to add a count of 1 to each word, and then use reduce by Spark’s Key method to add
the count for each key.
val res=keyval.map(word=>(word,1)).reduceByKey((a,b)=>a+b)
10) This produced the count for each log type like the MapReduce job. Store this RDD as a text file in
HDFS.
res.saveAsTextFile("hdfs://ip-172-31-22-91.us-west-
2.compute.internal:8020/user/a8117/resout")

This starts the MapReduce job, and is submitted to Yarn. Result will be created as a directory in
HDFS similar to a MapReduce job.
11) We can use the collect function on the RDD to see its contents as an array.
res.collect()
12) End the Spark session.
exit
13) Check the directory in HDFS used to store the RDD.
hdfs dfs –ls resout
Note that it has _SUCCESS file like in MapReduce. It has four part files as Spark ran 4 reducers.
Two of the reducer outputs are empty.
14) Check the contents of part-00002 and part-00003 files that are not empty.
hdfs dfs –cat resout/part-00002
hdfs dfs –cat resout/part-00003
15) Spark also provides a web interface that shows the status of Spark jobs. You can access this
interface at Port 4040. Access the web interface using URL http://h1.cloudxlab.com:4040.
16) Explore different tabs. The Jobs tab shows all the Spark jobs. The Stages tab shows the various
stages of the Job. The Storage tab shows the RDDs we retained, and shows if they are in the
memory or disk. The Environment tab shows the Spark program environment, and the Executor
tab shows the Spark driver.

AWS Associate Data Engineer
100% (2)
AWS Associate Data Engineer
23 pages
Amazon DEA-C01 AWS Certified Data Engineer - Associate Dumps
No ratings yet
Amazon DEA-C01 AWS Certified Data Engineer - Associate Dumps
20 pages
Azure Data Factory
77% (13)
Azure Data Factory
52 pages
SIC - Big Data - Chapter 6 - Workbook
No ratings yet
SIC - Big Data - Chapter 6 - Workbook
133 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Modern Data Engineering With Apache Spark (For - .)
No ratings yet
Modern Data Engineering With Apache Spark (For - .)
604 pages
Azure DataEngineering End To End Videos
No ratings yet
Azure DataEngineering End To End Videos
21 pages
Lec - Spark
No ratings yet
Lec - Spark
65 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Spark
No ratings yet
Spark
160 pages
SPARK
No ratings yet
SPARK
66 pages
Spark by Sumit
No ratings yet
Spark by Sumit
33 pages
Job Description - Principal Engineer Data Engineering
No ratings yet
Job Description - Principal Engineer Data Engineering
2 pages
Big Data Analytics - Applications, Challenges & Future Directions
No ratings yet
Big Data Analytics - Applications, Challenges & Future Directions
6 pages
The Forrester Wave™ - Big Data Fabric Q2 2018 PDF
No ratings yet
The Forrester Wave™ - Big Data Fabric Q2 2018 PDF
18 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
SPARK
No ratings yet
SPARK
35 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
HCIA-Big Data
No ratings yet
HCIA-Big Data
4 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
Unit 4
No ratings yet
Unit 4
42 pages
Module 3 - Analytics Techniques & Tools
No ratings yet
Module 3 - Analytics Techniques & Tools
74 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Test Certif
No ratings yet
Test Certif
41 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
8 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Overview
No ratings yet
Overview
25 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Lec 9
No ratings yet
Lec 9
38 pages
Spark
No ratings yet
Spark
96 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
Week 14
No ratings yet
Week 14
33 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Cognos Query Studio
No ratings yet
Cognos Query Studio
48 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Lec 9
No ratings yet
Lec 9
33 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
BA - Topic1 - Introduction To Business Analytics PDF
No ratings yet
BA - Topic1 - Introduction To Business Analytics PDF
96 pages
10 SparkIntroduction BigData 2x
No ratings yet
10 SparkIntroduction BigData 2x
33 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Data Analyst Interview Questions
No ratings yet
Data Analyst Interview Questions
7 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Big Data Lec4
No ratings yet
Big Data Lec4
38 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
BDALab Assn5
No ratings yet
BDALab Assn5
16 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Draft Program ICCAD18
No ratings yet
Draft Program ICCAD18
22 pages
MTCTDetailedSyllabusAI and DSFinalFinal - Evening
No ratings yet
MTCTDetailedSyllabusAI and DSFinalFinal - Evening
30 pages
Apache Spark Components
No ratings yet
Apache Spark Components
4 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Data Engineers
No ratings yet
Data Engineers
21 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Note
No ratings yet
Note
14 pages
8 PDFsam Apache Spark Tutorial
No ratings yet
8 PDFsam Apache Spark Tutorial
7 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Practice Test 3 New
No ratings yet
Practice Test 3 New
22 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
BigData - BCom Unit 2
No ratings yet
BigData - BCom Unit 2
10 pages
Getting Started With IBM SPSS Statistics For Windows: A Training Manual For Beginners
No ratings yet
Getting Started With IBM SPSS Statistics For Windows: A Training Manual For Beginners
56 pages
EdYoda Data Scientist Program Curriculum
No ratings yet
EdYoda Data Scientist Program Curriculum
20 pages
Devika.v ML
No ratings yet
Devika.v ML
7 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Big Data For 5G Intelligent Network Slicing
No ratings yet
Big Data For 5G Intelligent Network Slicing
7 pages
Configuring and Deploying Mongodb Sharded Cluster in 30 Minutes
No ratings yet
Configuring and Deploying Mongodb Sharded Cluster in 30 Minutes
11 pages
Narsimlu - ADF.Resume
No ratings yet
Narsimlu - ADF.Resume
4 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations
No ratings yet
Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations
5 pages
Kirthiga M - Resume
No ratings yet
Kirthiga M - Resume
3 pages
Day 8
No ratings yet
Day 8
8 pages
Ar - Programing
No ratings yet
Ar - Programing
3 pages
Exercise 2
No ratings yet
Exercise 2
8 pages
HANDS Hadoop Cloud
No ratings yet
HANDS Hadoop Cloud
10 pages
Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu
No ratings yet
Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu
5 pages
1' Ere Feuille D'exercices: Travaux Dirig Es "Entrep Ots de Don Ees Et OLAP " Hiver 2012/13
No ratings yet
1' Ere Feuille D'exercices: Travaux Dirig Es "Entrep Ots de Don Ees Et OLAP " Hiver 2012/13
4 pages
PySpark RDD Cheat Sheet
No ratings yet
PySpark RDD Cheat Sheet
1 page
NoSQL Overview Examples
No ratings yet
NoSQL Overview Examples
15 pages
Which Statistical Test: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
No ratings yet
Which Statistical Test: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
5 pages
Hands-On Lab - Hadoop Components
No ratings yet
Hands-On Lab - Hadoop Components
2 pages
Tushar Chhabra Data Engineer Resume
No ratings yet
Tushar Chhabra Data Engineer Resume
1 page
Cientista de Dados - Curso
No ratings yet
Cientista de Dados - Curso
1 page
Demo MDM
No ratings yet
Demo MDM
1 page
Mie 1628H - Big Data Final Project Report Apple Stock Price Prediction
No ratings yet
Mie 1628H - Big Data Final Project Report Apple Stock Price Prediction
10 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Resume
No ratings yet
Resume
2 pages
QuickStart Guide to Db2 Development with Python
From Everand
QuickStart Guide to Db2 Development with Python
Roger E. Sanders
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet

Cloudlab Exercise 11 Lesson 11

Uploaded by

Cloudlab Exercise 11 Lesson 11

Uploaded by

1) Connect to cloud Lab.

© Copyright 2015, Simplilearn. All rights reserved. Page |1

© Copyright 2015, Simplilearn. All rights reserved. Page |2

You might also like