Assignment2
Assignment2
Fall 2024
Assignment 2
Deadline: Oct 10, 23:59 PM PST
2. Requirements
2.1 Programming Requirements
a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external
libraries like numpy or pandas are not allowed). There will be a 10% bonus for each task if you also submit
a Scala implementation and both your Python and Scala implementations are correct.
b. You are required to only use Spark RDD in order to understand Spark operations. You will not get any
points if you use Spark DataFrame or DataSet.
We will use these library versions to compile and test your code. There will be no point if we cannot run
your code on Vocareum.
3. Datasets
In this assignment, you will use one simulated dataset and one real-world dataset.
In task 1, you will build and test your program with a small simulated CSV file that has been provided to
you.
Then in task2 you need to generate a subset using the Ta Feng dataset (https://bit.ly/2miWqFS) with a
structure similar to the simulated data.
Figure 1 shows the file structure of task1 simulated csv, the first column is user_id and the second
column is business_id.
4. Tasks
In this assignment, you will implement the SON Algorithm to solve all tasks (Task 1 and 2) on top of Spark
Framework. You need to find all the possible combinations of the frequent itemsets in any given input file
within the required time. You can refer to Chapter 6 from the Mining of Massive Datasets book and
concentrate on section 6.4 – Limited-Pass Algorithms. (Hint: you can choose either A-Priori, MultiHash, or
PCY algorithm to process each chunk of the data)
4.1 Task 1: Simulated data (3 pts)
There are two CSV files (small1.csv and small2.csv) in Vocareum under ‘/resource/asnlib/publicdata’. The
small1.csv is just a test file that you can use to debug your code. For task1, we will only test your code
on small2.csv.
In this task, you need to build two kinds of market-basket models.
Output format:
1. Runtime: the total execution time from loading the file till finishing writing the output file You need to
print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”.
2. Output file:
(1) Intermediate result
You should use “Candidates:” as the tag. For each line you should output the candidates of frequent
itemsets you found after the first pass of SON Algorithm followed by an empty line after each
combination. The printed itemsets must be sorted in lexicographical order (Both user_id and
business_id are types of string).
(2) Final result
You should use “Frequent Itemsets:”as the tag. For each line you should output the final frequent
itemsets you found after finishing the SON Algorithm. The format is the same with the intermediate
results. The printed itemsets must be sorted in lexicographical order.
Here is an example of the output file:
Both the intermediate results and final results should be saved in ONE output result file.
Execution example:
Python: spark-submit task1.py <case number> <support> <input_file_path> <output_file_path>
Scala: spark-submit --class task2 hw2.jar <case number> <support> <input_file_path>
<output_file_path>
Input format:
1. Filter threshold: Integer that is used to filter out qualified users
2. Support: Integer that defines the minimum count to qualify as a frequent itemset.
3. Input file path: This is the path to the input file including path, file name and extension.
4. Output file path: This is the path to the output file including path, file name and extension.
Output format:
1. Runtime: the total execution time from loading the file till finishing writing the output file You need to
print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”.
2. Output file
The output file format is the same with task 1. Both the intermediate results and final results should be
saved in ONE output result file.
Execution example:
Python: spark-submit task2.py <filter threshold> <support> <input_file_path> <output_file_path>
Scala: spark-submit --class task2 hw2.jar <filter threshold> <support> <input_file_path>
<output_file_path>
6. Evaluation Metric
Task 1:
Input File Case Support Runtime (sec)
small2.csv 1 4 <=200
small2.csv 2 9 <=100
Task 2:
Input File Filter Threshold Support Runtime (sec)
Customer_product.csv 20 50 <=500
5. Grading Criteria
(% penalty = % penalty of possible points you get)
1. You can use your free 5-day extension separately or together
a. https://forms.gle/S7nsS1QXKe2bysvC6
b. This form will record the number of late days you use for each assignment. We will not
count late days if no request is submitted. Remember to submit the request BEFORE
the deadline.
2. There will be a 10% bonus if you use both Scala and Python.
3. We will combine all the code we can find from the web (e.g., Github) as well as other students’ code
from this and other (previous) sections for plagiarism detection. If plagiarism is detected, there will be
no point for the entire assignment and we will report all detected plagiarism.
4. All submissions will be graded on the Vocareum. Please strictly follow the format provided, otherwise
you can’t get the point even though the answer is correct.
5. If the outputs of your program are unsorted or partially sorted, there will be a 50% penalty.
6. If you use Spark DataFrame, DataSet, sparksql, there will be a 20% penalty.
7. We can regrade your assignments within seven days once the scores are released. No argument after
one week.
8. There will be a 20% penalty for late submission within a week and no point after a week.
9. Only when your results from Python are correct, the bonus of using Scala will be calculated. There is no
partial point for Scala. See the example below:
Example situations
Task Score for Python Score for Scala Total