0% found this document useful (0 votes)
36 views3 pages

Lab Spark

This document describes a dataset containing observations from ship logbooks between the 18th-19th centuries. It provides source code and a Spark API documentation link to analyze the dataset. The code loads the dataset into an RDD and displays the top 5 nationalities. Questions are provided to further analyze the data, such as counting observations, distinct locations, popular routes, and average hottest month.

Uploaded by

Aymane Belhaje
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views3 pages

Lab Spark

This document describes a dataset containing observations from ship logbooks between the 18th-19th centuries. It provides source code and a Spark API documentation link to analyze the dataset. The code loads the dataset into an RDD and displays the top 5 nationalities. Questions are provided to further analyze the data, such as counting observations, distinct locations, popular routes, and average hottest month.

Uploaded by

Aymane Belhaje
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

1.

Before starting

For this lab, we provide you with some initial source code that can be downloaded
at https://github.com/AllaHINDIR/lab-spark.git

2. The dataset

The dataset we will study in this lab comes from the project Climatological
Database for the World’s Oceans. Detailed information about this project can be found at
https://en.wikipedia.org/wiki/CLIWOC.

In a few words, this dataset has been created starting from the logbook of ships
that were navigating the oceans during the 18th and 19th century. These logbooks were
maintained by the crew of the ships and contain a lot of information regarding the
weather conditions during that period.

Each observation reported in the dataset includes many entries including the date
of the observation, the location, the air temperature, the wind speed, etc. A detailed
description of all included fields is included in the provided archive (file
CLIWOC_desc.html).

The dataset is to be downloaded at the following address:


https://cloud.univ-grenoble-alpes.fr/index.php/s/5zjJaf5BStr4zw2
The file extracted from the archive is to be stored in the directory data of the provided
source code.

3. Work on the dataset

The provided project contains an initial Spark code in Python (file


code/navigation.py) and in Scala (file code/Navigation/src/main/scala/navigation.scala).
This code already works and does the following :
• It creates a RDD out of the input dataset
• It displays 5 of the nationalities that produced reports in the dataset (column
"Nationality")
• Note that the provided code assumes that the dataset has been stored in the
directory code/data.

Start by running this code and try to understand how it works.


A complete documentation of the Spark API for manipulating RDDs is available online:
• For Scala:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html
• For Python:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#rdd-apis

4. a few comments

• While you are debugging a program, it can be good to run with a single executor
thread (local[1]) to avoid very large and messy log traces in case of error.
• Once your program works, you can try executing it with more executor threads to
observe the impact on performance. (Elapsed time can be measured in the way described
here in Python: https://stackoverflow.com/a/25823885, and here in Scala: https:
//stackoverflow.com/a/37731494)
• Some of the questions below can be solved in different ways. Do not hesitate to
implement multiple solutions and compare their performance.
• Caching RDDs can have a significant impact on the performance of Spark (see
https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations). You
can also try to evaluate this impact during your tests.
• Note that in the dataset, an entry with no value is represented by the string "NA"
(stands forNot Available)
• You can access the Spark Web UI by opening the Url http://localhost:4040/ (or
http://127.0.0.1:4040/). Take the time during the lab to observe the information accessible
through this interface (Graph of tasks, execution time, etc.)
– The Web UI is only accessible when a Spark application is running. To
allow you to simply connect to the interface, the provided example of Spark code
prevents the application from terminating immediately after the computation are
done: you are required to press Enter to terminate the program.

5. Questions

Implement a Spark program that does the following:


1. When running the provided example code, you will observe that there might be
several entries that are equivalent. More specifically, you will see two entries for "British"
with the only difference that one has an extra white space in the name. Propose a new
version of the computation that will consider these two entries as the same one.

2. Count the total number of observations included in the dataset (each line
corresponds to one observation)

3. Count the number of years over which observations have been made (Column
"Year" should be used)

4. Display the oldest and the newest year of observation


5. Display the years with the minimum and the maximum number of observations
(and the corresponding number of observations)

6. Count the distinct departure places (column "VoyageFrom") using two methods
(i.e., using the function distinct() or reduceByKey()) and compare the execution time.

7. Display the 10 most popular departure places


8. Display the 10 roads (defined by a pair "VoyageFrom" and "VoyageTo") the
most often taken.
• Here you can start by implementing a version where a pair
"VoyageFrom"-"VoyageTo" A-B and a pair B-A correspond to different roads.
• Implement then a second version where A-B and B-A are considered as
the same road.

9. Compute the hottest month (defined by column "Month") on average over the
years considering all temperatures (column "ProbTair") reported in the dataset.

You might also like