Assignment 3 - 553
Assignment 3 - 553
Fall 2024
Assignment 3
Deadline: Wednesday, October 30, 11:59 PM PST
2. Assignment Requirements
2.1 Programming Language and Library Requirements
a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external
libraries like numpy or pandas are not allowed). There will be a 10% bonus for each task (or case) if you
also submit a Scala implementation and both your Python and Scala implementations are correct.
b. You are required to only use the Spark RDD to understand Spark operations. You will not receive any
points if you use Spark DataFrame or DataSet.
3. Yelp Data
In this assignment, the datasets you are going to use are from:
https://drive.google.com/drive/folders/1SufecRrgj1yWMOVdERmBBUnqz0EX7ARQ?usp=sharin
g
Make sure you are logged in with your USC gmail to view this folder.
We generated the following two datasets from the original Yelp review dataset with some filters. We
randomly took 60% of the data as the training dataset, 20% of the data as the validation dataset, and
20% of the data as the testing dataset.
● yelp_train.csv: the training data, which only include the columns: user_id, business_id, and
stars.
● yelp_val.csv: the validation data, which are in the same format as training data.
● other datasets: providing additional information (like the average star or location of a business)
Note: We are not sharing the test dataset. The goal of this assignment is to create a generalizable
recommender that works well on the unseen test data.
4. Tasks
Note: This Assignment has been divided into 2 parts on Vocareum. This has been done to
provide more computational resources.
Grading:
We will compare your output file against the ground truth file using precision and recall metrics.
Precision = true positives / (true positives + false positives)
Recall = true positives / (true positives + false negatives)
The ground truth file has been provided in the Google drive, named as “pure_jaccard_similarity.csv”.
You can use this file to compare your results to the ground truth as well.
The ground truth dataset only contains the business pairs (from the yelp_train.csv) whose Jaccard
similarity >=0.5. The business pair itself is sorted in the alphabetical order, so each pair only appears
once in the file (i.e., if pair (a, b) is in the dataset, (b, a) will not be there).
In order to get full credit for this task you should have precision >= 0.99 and recall >= 0.97. If not, then
you will get only partial credit based on the formula:
(Precision / 0.99) * 0.4 + (Recall / 0.97) * 0.4
Your runtime should be less than 100 seconds. If your runtime is more than or equal to 100 seconds,
you will not receive any point for this task.
4.2 Task 2: Recommendation Systems (5 points)
In task 2, you are going to build different types of recommendation systems using the yelp_train.csv to
predict the star ratings for given user ids and business ids. You can make any improvement to your
recommendation system in terms of speed and accuracy. You can use the validation dataset
(yelp_val.csv) to evaluate the accuracy of your recommendation systems, but please don’t include it as
your training data.
Here are some possible options to evaluate your recommendation systems (there are many more ways):
1. You can compare your results to the corresponding ground truth and compute the absolute
differences then take the mean across all data points. This is called Mean Absolute Error (MAE).
Example:
Predicted rating: 4.73 True rating: 3.5 abs(True - Predicted) = abs(3.5 - 4.73) = 1.23
2. You can divide the absolute differences into 5 levels and count the number for each level as
following example:
Rating diff. Count
>=0 and <1 n = 12,345
>=1 and <2 n = 3,506
>=2 and <3 n = 1,189
>=3 and <4 n = 36
>=4 n = 17
As calculated in example 2, there are 12,345 predictions with < 1 difference from the ground
truth. This is a good sign since most predictions are within the smallest possible error window.
Through method 2 you will be able to calculate the error distribution of your predictions and
improve the performance of your recommendation systems.
3. You can compute the Root Mean Squared Error (RMSE) by using following formula:
Where Predi is the prediction for business i and Ratei is the true rating for business i. n is the
total number of the business you are predicting. Squared error is similar to absolute error, but
will penalize large deviations from the predicted values harsher than absolute error.
Input format: (we will use the following commands to execute your code)
Case 1:
spark-submit task2_1.py <train_file_name> <test_file_name> <output_file_name>
Param: train_file_name: the name of the training file (e.g., yelp_train.csv), including the file path
Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path
Param: output_file_name: the name of the prediction result file, including the file path
Case 2:
spark-submit task2_2.py <folder_path> <test_file_name> <output_file_name>
Param: folder_path: the path of dataset folder, which contains exactly the same file as the google drive.
Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path
Param: output_file_name: the name of the prediction result file, including the file path
Case 3:
spark-submit task2_3.py <folder_path> <test_file_name> <output_file_name>
Param: folder_path: the path of dataset folder, which contains exactly the same file as the google drive.
Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path
Param: output_file_name: the name of the prediction result file, including the file path
Output format:
a. The output file is a CSV file, containing all the prediction results for each user and business pair in the
validation/testing data. The header is “user_id, business_id, prediction”. There is no requirement for the
order in this task. There is no requirement for the number of decimals for the similarity values. Please
refer to the format in Figure 2.
Grading:
We will compare your prediction results against the ground truth. We will grade all the cases in Task2
based on your accuracy using RMSE. For your reference, the table below shows the RMSE baselines and
running time for predicting the validation data. The time limit of case3 is set to 30 minutes because we
hope you consider this factor and try to improve on it as much as possible (hint: this will help you a lot in
the competition project at the end of the semester).
Case 1 Case 2 Case 3
RMSE 1.09 1.00 0.99
Running Time 130s 400s 1800s
For grading, we will use the test data to evaluate your recommendation systems. If you can pass the
RMSE baselines in the above table, you should be able to pass the RMSE baselines for the test data.
However, if your recommendation system only passes the RMSE baselines for the validation data, you
will receive 50% of the points for each case.
5. Submission
You need to submit following files on Vocareum with exactly the same name:
a. Four Python scripts:
● task1.py
● task2_1.py
● task2_2.py
● task2_3.py
b. [OPTIONAL] hw3.jar and Four Scala scripts:
● task1.scala
● task2_1.scala
● task2_2.scala
● task2_3.scala
6. Grading Criteria
(% penalty = % penalty of possible points you get)
1. You can use your free 5-day extension separately or together.
a. Google Forms Link for Extension: https://forms.gle/S7nsS1QXKe2bysvC6
2. There will be a 10% bonus if you use both Scala and Python.
3. We will combine all the code we can find from the web (e.g., Github) as well as other students’ code
from this and other (previous) sections for plagiarism detection. If plagiarism is detected, you will
receive no points for the entire assignment and we will report all detected plagiarism.
4. All submissions will be graded on Vocareum. Please strictly follow the format provided, otherwise
you won’t receive points even though the answer is correct.
5. If the outputs of your program are unsorted or partially sorted, there will be a 50% penalty.
6. Do NOT use Spark DataFrame, DataSet, sparksql.
7. We can regrade your assignments within seven days once the scores are released. We will not
accept any regrading requests after a week. There will be a 20% penalty if our grading is correct.
8. There will be a 20% penalty for late submissions within a week and no points after a week.
9. Only if your results from Python are correct will the bonus of using Scala be calculated. There are no
partial points awarded for Scala. See the example below:
Example situations
Score for Scala
Task Score for Python Total
(10% of previous column if correct)
Task
Correct: 3 points Correct: 3 * 10% 3.3
1
Task
Wrong: 0 point Correct: 0 * 10% 0.0
1
Task
Partially correct: 1.5 points Correct: 1.5 * 10% 1.65
1
Task
Partially correct: 1.5 points Wrong: 0 1.5
1