Assignment 1 Formatted-V4
Assignment 1 Formatted-V4
Overview
Write MapReduce programs that give you a chance to develop an understanding of principles when solving
complex problems on the Hadoop execution platform.
Learning Outcomes
The key course learning outcomes are:
− CLO 1: model and implement efficient big data solutions for various application areas using appropriately
selected algorithms and data structures.
− CLO 2: analyse methods and algorithms, to compare and evaluate them with respect to time and space
requirements and make appropriate design choices when solving real-world problems.
− CLO 3: motivate and explain trade-offs in big data processing technique design and analysis in written
and oral form.
− CLO 4: explain the Big Data Fundamentals, including the evolution of Big Data, the characteristics of Big
Data and the challenges introduced.
− CLO 6: apply the novel architectures and platforms introduced for Big data, i.e., Hadoop, MapReduce and
Spark.
Assessment Details
You have two datasets: Trips.txt which records trip information, and Taxis.txt which is about taxi information.
Both Trips.txt and Taxis.txt are stored on HDFS. Complete the following MapReduce programming tasks
with Python. Note that using any other language like Java will directly lead to a 0 mark on the assignment.
Also, you are not allowed to use any Python MapReduce library such as mrjob.
Task 1 (5 marks)
For each taxi, count the number of trips and the average distance per trip by developing MapReduce programs
with Python. The program should implement in-mapper combining with state preserved across lines.
The code must work for 3 reducers. You need to submit a shell script named task1-run.sh. Running the shell
script, the task is performed where the shell script and code files are in the same folder (no subfolders).
RMIT Classification: Trusted
The code must work for 3 reducers, for different settings of 𝑘𝑘, and for different settings of 𝑣𝑣. Also, you should
write up a shell script named task2-run.sh. Running the shell script, the task is performed where the shell
script and code files are in the same folder (no subfolders). Note that 𝑘𝑘 and 𝑣𝑣 must be passed to task2-
run.sh as arguments when it is executed.
Note that task 3 should have two MapReduce subtasks where the first is a join operation and the second is a
counting operation. The output of the first task is the input of the second task. The execution of the two
subtasks should be specified in task3-run.sh. It is illegal to copy Trips.txt and/or Taxis.txt to the local machine
and process them.
Submission
Your assignment should follow the requirement below and submit via Canvas > Assignment 1. Assessment
declaration: when you submit work electronically, you agree to the assessment declaration:
Format Requirements
Failure to follow the requirements incurs up to 10 marks penalty.
1. If your student ID is s1234567, then please create a zip file named s1234567_BDP_A1.zip.
• You need to include a “README” file in the zip file. In the README, specify sufficient information
on how to run your codes for each task in AWS EMR.
• The code files and shell scripts for all three tasks are in the same folder (i.e., no subfolders), and then
zip the folder.
• Do not include hadoop-streaming-3.1.4.jar in the zip file.
2. On HDFS, the input files must be in /input/ and the output must be in /output/, as follows:
/input/Trips.txt
/input/Taxis.txt
/output/task1
/output/task2
/output/task3
3. Besides the zip file, organize the codes and the shell scripts of all three tasks in a separate PDF file (copy
& paste into a text editor and then save it as a PDF file). Submit the PDF file (so, there are two
submissions, one is the zip file, and the other is the PDF file). The PDF file is for Turnitin plagiarism
check.
RMIT Classification: Trusted
Functional Requirements
Failure to follow the requirements incurs up to 5 marks penalty
− The code must be well written using a good coding style.
− The codes and scripts must come with concise and clear comments to explain the logical flow of the
program.
RMIT University treats plagiarism as a very serious offense constituting misconduct. Plagiarism covers a
variety of inappropriate behaviors, including:
− Failure to properly document a source
− Copyright material from the internet or databases
− Collusion between students
For further information on our policies and procedures, please refer to
https://www.rmit.edu.au/students/student-essentials/rights-and-responsibilities/academic-integrity
RMIT Classification: Trusted
Marking Guide
• Late submission results in penalty of 10% marks for (up to) every 24 hours being late.
• If unexpected circumstances affect your ability to complete the assignment, you can apply for special consideration.
− Requests for special consideration within 7*24 hours, please email the course coordinator directly with supporting evidence.
− Request for special consideration of more than 7*24 hours must be via the University Special
consideration: https://www.rmit.edu.au/students/student-essentials/assessment-and-exams/assessment/special-consideration.