0% found this document useful (0 votes)

36 views3 pages

Lab Spark

This document describes a dataset containing observations from ship logbooks between the 18th-19th centuries. It provides source code and a Spark API documentation link to analyze the dataset. The code loads the dataset into an RDD and displays the top 5 nationalities. Questions are provided to further analyze the data, such as counting observations, distinct locations, popular routes, and average hottest month.

Uploaded by

Aymane Belhaje

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views3 pages

Lab Spark

Uploaded by

Aymane Belhaje

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

1.

Before starting

For this lab, we provide you with some initial source code that can be downloaded
at https://github.com/AllaHINDIR/lab-spark.git

2. The dataset

The dataset we will study in this lab comes from the project Climatological
Database for the World’s Oceans. Detailed information about this project can be found at
https://en.wikipedia.org/wiki/CLIWOC.

In a few words, this dataset has been created starting from the logbook of ships
that were navigating the oceans during the 18th and 19th century. These logbooks were
maintained by the crew of the ships and contain a lot of information regarding the
weather conditions during that period.

Each observation reported in the dataset includes many entries including the date
of the observation, the location, the air temperature, the wind speed, etc. A detailed
description of all included fields is included in the provided archive (file
CLIWOC_desc.html).

The dataset is to be downloaded at the following address:

https://cloud.univ-grenoble-alpes.fr/index.php/s/5zjJaf5BStr4zw2
The file extracted from the archive is to be stored in the directory data of the provided
source code.

3. Work on the dataset

The provided project contains an initial Spark code in Python (file

code/navigation.py) and in Scala (file code/Navigation/src/main/scala/navigation.scala).
This code already works and does the following :
• It creates a RDD out of the input dataset
• It displays 5 of the nationalities that produced reports in the dataset (column
"Nationality")
• Note that the provided code assumes that the dataset has been stored in the
directory code/data.

Start by running this code and try to understand how it works.

A complete documentation of the Spark API for manipulating RDDs is available online:
• For Scala:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html
• For Python:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#rdd-apis

4. a few comments

• While you are debugging a program, it can be good to run with a single executor
thread (local[1]) to avoid very large and messy log traces in case of error.
• Once your program works, you can try executing it with more executor threads to
observe the impact on performance. (Elapsed time can be measured in the way described
here in Python: https://stackoverflow.com/a/25823885, and here in Scala: https:
//stackoverflow.com/a/37731494)
• Some of the questions below can be solved in different ways. Do not hesitate to
implement multiple solutions and compare their performance.
• Caching RDDs can have a significant impact on the performance of Spark (see
https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations). You
can also try to evaluate this impact during your tests.
• Note that in the dataset, an entry with no value is represented by the string "NA"
(stands forNot Available)
• You can access the Spark Web UI by opening the Url http://localhost:4040/ (or
http://127.0.0.1:4040/). Take the time during the lab to observe the information accessible
through this interface (Graph of tasks, execution time, etc.)
– The Web UI is only accessible when a Spark application is running. To
allow you to simply connect to the interface, the provided example of Spark code
prevents the application from terminating immediately after the computation are
done: you are required to press Enter to terminate the program.

5. Questions

Implement a Spark program that does the following:

1. When running the provided example code, you will observe that there might be
several entries that are equivalent. More specifically, you will see two entries for "British"
with the only difference that one has an extra white space in the name. Propose a new
version of the computation that will consider these two entries as the same one.

2. Count the total number of observations included in the dataset (each line
corresponds to one observation)

3. Count the number of years over which observations have been made (Column
"Year" should be used)

4. Display the oldest and the newest year of observation

5. Display the years with the minimum and the maximum number of observations
(and the corresponding number of observations)

6. Count the distinct departure places (column "VoyageFrom") using two methods
(i.e., using the function distinct() or reduceByKey()) and compare the execution time.

7. Display the 10 most popular departure places

8. Display the 10 roads (defined by a pair "VoyageFrom" and "VoyageTo") the
most often taken.
• Here you can start by implementing a version where a pair
"VoyageFrom"-"VoyageTo" A-B and a pair B-A correspond to different roads.
• Implement then a second version where A-B and B-A are considered as
the same road.

9. Compute the hottest month (defined by column "Month") on average over the
years considering all temperatures (column "ProbTair") reported in the dataset.

Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Scrumexam 1
72% (47)
Scrumexam 1
5 pages
C# for Beginners: Learn in 24 Hours
From Everand
C# for Beginners: Learn in 24 Hours
Alex Nordeen
No ratings yet
Rust Language Cheat Sheet PDF
100% (1)
Rust Language Cheat Sheet PDF
16 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
PySpark Notes
No ratings yet
PySpark Notes
190 pages
Spark
No ratings yet
Spark
12 pages
Spark Material
No ratings yet
Spark Material
6 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
DataGrokr Technical Assignment - Data Engineering
No ratings yet
DataGrokr Technical Assignment - Data Engineering
4 pages
Pyspark File Commands and Theory
No ratings yet
Pyspark File Commands and Theory
29 pages
RDD
No ratings yet
RDD
4 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Pyspark
No ratings yet
Pyspark
44 pages
Journal
No ratings yet
Journal
47 pages
Big Data Technologies Lab
No ratings yet
Big Data Technologies Lab
8 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
DataGrokr Technical Assignment - Data Engineering - Internshala
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
5 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Int 421
No ratings yet
Int 421
2 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Labsession1 SparkRDD
No ratings yet
Labsession1 SparkRDD
2 pages
23CP309T BDA RE-MSE Question Paper
No ratings yet
23CP309T BDA RE-MSE Question Paper
2 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Week 6 Assignment
No ratings yet
Week 6 Assignment
2 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
15 PDFsam Apache Spark Tutorial
No ratings yet
15 PDFsam Apache Spark Tutorial
7 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV
No ratings yet
Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV
5 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
External Video-En
No ratings yet
External Video-En
2 pages
Notes
No ratings yet
Notes
4 pages
Spark
No ratings yet
Spark
11 pages
BDA Exp E1
No ratings yet
BDA Exp E1
5 pages
Pyspark
No ratings yet
Pyspark
31 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Cse413 201-15-3452 Lab-Report 02
No ratings yet
Cse413 201-15-3452 Lab-Report 02
6 pages
UEC735
No ratings yet
UEC735
2 pages
Full Stack Web DevelopmentBeginner To Interview Ready
No ratings yet
Full Stack Web DevelopmentBeginner To Interview Ready
1 page
Resume Harsh It
No ratings yet
Resume Harsh It
1 page
Finals Comprog Reviewer
No ratings yet
Finals Comprog Reviewer
3 pages
Jagriti Koirala-JavaResume
No ratings yet
Jagriti Koirala-JavaResume
7 pages
OO2 Object Pascal
No ratings yet
OO2 Object Pascal
23 pages
Abap Oops
No ratings yet
Abap Oops
22 pages
Aissce Practical 22-23
No ratings yet
Aissce Practical 22-23
1 page
Practical No: 1 AIM: Study All HTML Tags and Prepare A HTML Reference Sheet in Following Format
No ratings yet
Practical No: 1 AIM: Study All HTML Tags and Prepare A HTML Reference Sheet in Following Format
11 pages
Odoo 13 Error
No ratings yet
Odoo 13 Error
5 pages
Aditya Teltia
No ratings yet
Aditya Teltia
2 pages
VB 2
No ratings yet
VB 2
2 pages
Advanced Programming Exit Exam Tutorial With Questions
No ratings yet
Advanced Programming Exit Exam Tutorial With Questions
42 pages
BCA Paper-V Unit-10
No ratings yet
BCA Paper-V Unit-10
15 pages
Assembly Language For Intel-Based Computers, 4 Edition
No ratings yet
Assembly Language For Intel-Based Computers, 4 Edition
23 pages
Bash Shell Scripting
No ratings yet
Bash Shell Scripting
5 pages
LV Grid in The Nutshell Field Catalog - Domname - Domain Name
No ratings yet
LV Grid in The Nutshell Field Catalog - Domname - Domain Name
5 pages
PQ, BI, DAX Blogs List
No ratings yet
PQ, BI, DAX Blogs List
1 page
Toolbox Log Snapshot
No ratings yet
Toolbox Log Snapshot
11 pages
2019 Summer Model Answer Paper (Msbte Study Resources) 1
No ratings yet
2019 Summer Model Answer Paper (Msbte Study Resources) 1
23 pages
Android Applications Development in Java by JAMES, K. L.
100% (1)
Android Applications Development in Java by JAMES, K. L.
205 pages
WinCC VNC Remote Access en
No ratings yet
WinCC VNC Remote Access en
23 pages
What Is Block Structure
No ratings yet
What Is Block Structure
5 pages
Python Modified
No ratings yet
Python Modified
19 pages
Python Programming Task
No ratings yet
Python Programming Task
13 pages
Arrow Function Array Mehod
No ratings yet
Arrow Function Array Mehod
3 pages
Continuous Delivery Foundation - 2021 - Pulkit Sharma
No ratings yet
Continuous Delivery Foundation - 2021 - Pulkit Sharma
17 pages
Data Science For Engineers - Week 1
No ratings yet
Data Science For Engineers - Week 1
4 pages
Co4i VB - Net MP
No ratings yet
Co4i VB - Net MP
13 pages

Lab Spark

Uploaded by

Lab Spark

Uploaded by

1.

The dataset is to be downloaded at the following address:

3. Work on the dataset

The provided project contains an initial Spark code in Python (file

Start by running this code and try to understand how it works.

Implement a Spark program that does the following:

4. Display the oldest and the newest year of observation

7. Display the 10 most popular departure places

You might also like