0% found this document useful (0 votes)

10 views7 pages

Assignment2

Uploaded by

anshumohanty2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views7 pages

Assignment2

Uploaded by

anshumohanty2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

DSCI553 Foundations and Applications of Data Mining

Fall 2024
Assignment 2
Deadline: Oct 10, 23:59 PM PST

1. Overview of the Assignment

In this assignment, you will implement the SON Algorithm using the Spark Framework. You will develop a
program to find frequent itemsets in two datasets, one simulated dataset and one real-world generated
dataset. The goal of this assignment is to apply the algorithms you have learned in class on large datasets
more efficiently in a distributed environment.

2. Requirements
2.1 Programming Requirements
a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external
libraries like numpy or pandas are not allowed). There will be a 10% bonus for each task if you also submit
a Scala implementation and both your Python and Scala implementations are correct.

b. You are required to only use Spark RDD in order to understand Spark operations. You will not get any
points if you use Spark DataFrame or DataSet.

c. Python standard library set : https://docs.python.org/3/library/

2.2 Programming Environment

Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2

We will use these library versions to compile and test your code. There will be no point if we cannot run
your code on Vocareum.

On Vocareum, you can call `spark-submit` located at

`/opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit`. (Do not use the one at
/usr/local/bin/spark-submit (2.3.0)). We use `--executor-memory 4G --driver-memory 4G` on Vocareum
for grading.

2.3 Write your own code

Do not share code with other students!!
For this assignment to be an effective learning experience, you must write your own code! We emphasize
this point because you will be able to find Python implementations of some of the required functions on the
web. Please do not look for or at any such code!
TAs will combine all the code we can find from the web (e.g., Github) as well as other students’ code from
this and other (previous) sections for plagiarism detection. We will report all detected plagiarism. We will
report all detected plagiarism and severe penalties will be given for the students whose submissions
are plagiarized.

2.4 What you need to turn in

We will grade all submissions on Vocareum and the submissions on the blackboard will be ignored.
Vocareum produces a submission report after you click the “Submit” button (It takes a while since Vocareum
needs to run your code in order to generate the report). Vocareum will only grade Python scripts during the
submission phase and it will grade both Python and Scala during the grading phase.
a. Two Python scripts, named: (all lowercase)
task1.py, task2.py
b. [OPTIONAL] hw2.jar and two Scala scripts, named:
(all lowercase) hw2.jar, task1.scala, task2.scala
c. You don’t need to include your results or the datasets. We will grade your code with our testing data (data
will be in the same format).
d. Students can submit an unlimited number of times. Only the latest submission will be accepted and
graded.

3. Datasets
In this assignment, you will use one simulated dataset and one real-world dataset.
In task 1, you will build and test your program with a small simulated CSV file that has been provided to
you.
Then in task2 you need to generate a subset using the Ta Feng dataset (https://bit.ly/2miWqFS) with a
structure similar to the simulated data.
Figure 1 shows the file structure of task1 simulated csv, the first column is user_id and the second
column is business_id.

Figure 1: Input Data Format

4. Tasks
In this assignment, you will implement the SON Algorithm to solve all tasks (Task 1 and 2) on top of Spark
Framework. You need to find all the possible combinations of the frequent itemsets in any given input file
within the required time. You can refer to Chapter 6 from the Mining of Massive Datasets book and
concentrate on section 6.4 – Limited-Pass Algorithms. (Hint: you can choose either A-Priori, MultiHash, or
PCY algorithm to process each chunk of the data)
4.1 Task 1: Simulated data (3 pts)
There are two CSV files (small1.csv and small2.csv) in Vocareum under ‘/resource/asnlib/publicdata’. The
small1.csv is just a test file that you can use to debug your code. For task1, we will only test your code
on small2.csv.
In this task, you need to build two kinds of market-basket models.

Case 1 (1.5 pts):

You will calculate the combinations of frequent businesses (as singletons, pairs, triples, etc.) that are
qualified as frequent given a support threshold. You need to create a basket for each user containing the
business ids reviewed by this user. If a business was reviewed more than once by a reviewer, we consider
this product was rated only once. More specifically, the business ids within each basket are unique. The
generated baskets are similar to:
user1: [business11, business12, business13, ...]
user2: [business21, business22, business23, ...]
user3: [business31, business32, business33, ...]
Case 2 (1.5 pts):
You will calculate the combinations of frequent users (as singletons, pairs, triples, etc.) that are qualified as
frequent given a support threshold. You need to create a basket for each business containing the user ids
that commented on this business. Similar to case 1, the user ids within each basket are unique. The
generated baskets are similar to:

business1: [user11, user12, user13, ...]

business2: [user21, user22, user23, ...]
business3: [user31, user32, user33, ...]
Input format:
1. Case number: Integer that specifies the case. 1 for Case 1 and 2 for Case 2.
2. Support: Integer that defines the minimum count to qualify as a frequent itemset.
3. Input file path: This is the path to the input file including path, file name and extension.
4. Output file path: This is the path to the output file including path, file name and extension.

Output format:
1. Runtime: the total execution time from loading the file till finishing writing the output file You need to
print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”.
2. Output file:
(1) Intermediate result
You should use “Candidates:” as the tag. For each line you should output the candidates of frequent
itemsets you found after the first pass of SON Algorithm followed by an empty line after each
combination. The printed itemsets must be sorted in lexicographical order (Both user_id and
business_id are types of string).
(2) Final result
You should use “Frequent Itemsets:”as the tag. For each line you should output the final frequent
itemsets you found after finishing the SON Algorithm. The format is the same with the intermediate
results. The printed itemsets must be sorted in lexicographical order.
Here is an example of the output file:
Both the intermediate results and final results should be saved in ONE output result file.
Execution example:
Python: spark-submit task1.py <case number> <support> <input_file_path> <output_file_path>
Scala: spark-submit --class task2 hw2.jar <case number> <support> <input_file_path>
<output_file_path>

4.2 Task 2: Ta Feng data (4 pts)

In task 2, you will explore the Ta Feng dataset to find the frequent itemsets (only case 1). You will use data
found here from Kaggle (https://bit.ly/2miWqFS) to find product IDs associated with a given customer ID
each day. Aggregate all purchases a customer makes within a day into one basket. In other words, assume
a customer purchases at once all items purchased within a day.
Note: Be careful when reading the csv file as spark can read the product id numbers with leading zeros.
You can manually format Column F (PRODUCT_ID) to numbers (with zero decimal places) in the csv file
before reading it using spark.
SON Algorithm on Ta Feng data:
You will create a data pipeline where the input is the raw Ta Feng data, and the output is the file
described under “output file”. You will pre-process the data, and then from this pre-processed data,
you will create the final output. Your code is allowed to output this pre-processed data during
execution, but you should NOT submit homework that includes this pre-processed data.

(1) Data preprocessing

You need to generate a dataset from the Ta Feng dataset with following steps:
1. Find the date of the purchase (column TRANSACTION_DT), such as December 1, 2000 (12/1/00)
2. At each date, select “CUSTOMER_ID” and “PRODUCT_ID”.
3. We want to consider all items bought by a consumer each day as a separate transaction (i.e., “baskets”).
For example, if consumer 1, 2, and 3 each bought oranges December 2, 2000, and consumer 2 also bought
celery on December 3, 2000, we would consider that to be 4 separate transactions. An easy way to do this
is to rename each CUSTOMER_ID as “DATE-CUSTOMER_ID”. For example, if CUSTOMER_ID is 12321, and
this customer bought apples November 14, 2000, then their new ID is “11/14/00-12321”
4. Make sure each line in the CSV file is “DATE-CUSTOMER_ID1, PRODUCT_ID1”. 5.
The header of CSV file should be “DATE-CUSTOMER_ID, PRODUCT_ID”
You need to save the dataset in CSV format. Figure below shows an example of the output file
(please note DATE-CUSTOMER_ID and PRODUCT_ID are strings and integers, respectively)

Figure: customer_product file

Do NOT submit the output file of this data preprocessing step, but your code is allowed to create this
file.
(2) Apply SON Algorithm
The requirements for task 2 are similar to task 1. However, you will test your implementation with the
large dataset you just generated. For this purpose, you need to report the total execution time. For this
execution time, we take into account the time from reading the file till writing the results to the output
file. You are asked to find the candidate and frequent itemsets (similar to the previous task) using the file
you just generated. The following are the steps you need to do:
1. Reading the customer_product CSV file in to RDD and then build the case 1 market-basket model
2. Find out qualified customers-date who purchased more than k items. (k is the filter threshold);
3. Apply the SON Algorithm code to the filtered market-basket model;

Input format:
1. Filter threshold: Integer that is used to filter out qualified users
2. Support: Integer that defines the minimum count to qualify as a frequent itemset.
3. Input file path: This is the path to the input file including path, file name and extension.
4. Output file path: This is the path to the output file including path, file name and extension.

Output format:
1. Runtime: the total execution time from loading the file till finishing writing the output file You need to
print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”.
2. Output file
The output file format is the same with task 1. Both the intermediate results and final results should be
saved in ONE output result file.
Execution example:
Python: spark-submit task2.py <filter threshold> <support> <input_file_path> <output_file_path>
Scala: spark-submit --class task2 hw2.jar <filter threshold> <support> <input_file_path>
<output_file_path>

6. Evaluation Metric

Task 1:
Input File Case Support Runtime (sec)

small2.csv 1 4 <=200

small2.csv 2 9 <=100

Task 2:
Input File Filter Threshold Support Runtime (sec)

Customer_product.csv 20 50 <=500

5. Grading Criteria
(% penalty = % penalty of possible points you get)
1. You can use your free 5-day extension separately or together
a. https://forms.gle/S7nsS1QXKe2bysvC6
b. This form will record the number of late days you use for each assignment. We will not
count late days if no request is submitted. Remember to submit the request BEFORE
the deadline.
2. There will be a 10% bonus if you use both Scala and Python.
3. We will combine all the code we can find from the web (e.g., Github) as well as other students’ code
from this and other (previous) sections for plagiarism detection. If plagiarism is detected, there will be
no point for the entire assignment and we will report all detected plagiarism.
4. All submissions will be graded on the Vocareum. Please strictly follow the format provided, otherwise
you can’t get the point even though the answer is correct.
5. If the outputs of your program are unsorted or partially sorted, there will be a 50% penalty.
6. If you use Spark DataFrame, DataSet, sparksql, there will be a 20% penalty.
7. We can regrade your assignments within seven days once the scores are released. No argument after
one week.
8. There will be a 20% penalty for late submission within a week and no point after a week.
9. Only when your results from Python are correct, the bonus of using Scala will be calculated. There is no
partial point for Scala. See the example below:
Example situations
Task Score for Python Score for Scala Total

(10% of previous column if correct)

Task1 Correct: 3 points Correct: 3 * 10% 3.3

Task1 Wrong: 0 point Correct: 0 * 10% 0.0

Task1 Partially correct: 1.5 points Correct: 1.5 * 10% 1.65

Task1 Partially correct: 1.5 points Wrong: 0 1.5

6. Common problems causing fail submission on Vocareum/FAQ

(If your program runs seems successfully on your local machine but fail on Vocareum, please check these)
1. Try your program on Vocareum terminal. Remember to set python version as python3.6,

And use the latest Spark

2. Check the input command line format.

3. Check the output format, for example, the header, tag, typo.
4. Check the requirements of sorting the results.
5. Your program scripts should be named as task1.py task2.py.
6. Check whether your local environment fits the assignment description, i.e. version, configuration.
7. If you implement the core part in python instead of spark, or implement it in a high time complexity
way (e.g. search an element in a list instead of a set), your program may be killed on the Vocareum
because it runs too slow.
8. You are required to only use Spark RDD in order to understand Spark operations more deeply. You will
not get any points if you use Spark DataFrame or DataSet. Don’t import sparksql.
9. Do not use Vocareum for debugging purposes, please debug on your local machine. Vocareum can
be very slow if you use it for debugging.
10. Vocareum is reliable in helping you to check the input and output formats, but its function on
checking the code correctness is limited. It can not guarantee the correctness of the code even
with a full score in the submission report.
11. Some students encounter an error like: the output rate …. has exceeded the allowed value
….bytes/s; attempting to kill the process.
To resolve this, please remove all print statements and set the Spark logging level such that it limits
the logs generated - that can be done using sc.setLogLevel . Preferably, set the log level to either
WARN or ERROR when submitting your code.

Nexus Book PDF
100% (2)
Nexus Book PDF
446 pages
Sprint 350 Use Manual
67% (6)
Sprint 350 Use Manual
28 pages
The Programmer's Source Code
50% (2)
The Programmer's Source Code
9 pages
Assignment 03:: Association Rule Mining
No ratings yet
Assignment 03:: Association Rule Mining
3 pages
Assignment 1_553
No ratings yet
Assignment 1_553
8 pages
18CN627 Big Data Framework For Data Science: Centre For Excellence in Computational Engineering and Networking
No ratings yet
18CN627 Big Data Framework For Data Science: Centre For Excellence in Computational Engineering and Networking
1 page
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
SL-III Lab Manual
No ratings yet
SL-III Lab Manual
74 pages
Quiz Complete
No ratings yet
Quiz Complete
4 pages
Assignment 3_553
No ratings yet
Assignment 3_553
9 pages
SampleQuestion- AIOL 2024
No ratings yet
SampleQuestion- AIOL 2024
5 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Group Assignment 01 (2)
No ratings yet
Group Assignment 01 (2)
3 pages
BDA_ASSIGNMENT-1
No ratings yet
BDA_ASSIGNMENT-1
3 pages
12PB24IP01 QP
No ratings yet
12PB24IP01 QP
12 pages
MGNM801 Ca2 Final
No ratings yet
MGNM801 Ca2 Final
13 pages
Computational Thinking Theory Answers
No ratings yet
Computational Thinking Theory Answers
2 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
100% (1)
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
256 pages
Kendriya Vidyalaya Sangathan: Kolkata Region First Preboard E Informatics Practices New (065) - Class Xii
No ratings yet
Kendriya Vidyalaya Sangathan: Kolkata Region First Preboard E Informatics Practices New (065) - Class Xii
15 pages
Assignment_1 (1)
No ratings yet
Assignment_1 (1)
3 pages
Central Board of Secondary Education Amity International School Sector-1, Vasundhara, Ghaziabad
No ratings yet
Central Board of Secondary Education Amity International School Sector-1, Vasundhara, Ghaziabad
13 pages
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
Class XII - Informatics Practices
No ratings yet
Class XII - Informatics Practices
7 pages
QP DAV 3rd Sem Dec 2023
No ratings yet
QP DAV 3rd Sem Dec 2023
12 pages
lab manual
No ratings yet
lab manual
80 pages
DSBDAL Lab Manual
No ratings yet
DSBDAL Lab Manual
26 pages
Unit-2 Data Science Assignment1
No ratings yet
Unit-2 Data Science Assignment1
2 pages
ccs346 Eda Lab Manual
No ratings yet
ccs346 Eda Lab Manual
41 pages
GE - Computer Scien EaQvs42
No ratings yet
GE - Computer Scien EaQvs42
6 pages
HW 1 - Version 2.ipynb - Colab
No ratings yet
HW 1 - Version 2.ipynb - Colab
5 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Set-D_CT2_answerKey
No ratings yet
Set-D_CT2_answerKey
11 pages
HW 1
No ratings yet
HW 1
9 pages
data science
No ratings yet
data science
10 pages
Self Intoduction 1 project
No ratings yet
Self Intoduction 1 project
11 pages
Ca2 - Lpu
No ratings yet
Ca2 - Lpu
2 pages
DATA SCIENCE SAMPLE
No ratings yet
DATA SCIENCE SAMPLE
5 pages
FDS RECORD 5-8
No ratings yet
FDS RECORD 5-8
15 pages
Httppython Mykvs inuploadsfilesXIIInfo Pract S E 150 PDF
No ratings yet
Httppython Mykvs inuploadsfilesXIIInfo Pract S E 150 PDF
15 pages
Data Science Manual
No ratings yet
Data Science Manual
155 pages
PYF_Project_LearnerNotebook_LowCode
No ratings yet
PYF_Project_LearnerNotebook_LowCode
6 pages
NoSQL Injection for Elasticsearch
From Everand
NoSQL Injection for Elasticsearch
Gary Drocella
No ratings yet
2425-HK1-MMDS
No ratings yet
2425-HK1-MMDS
3 pages
XII_IP_Model_1
No ratings yet
XII_IP_Model_1
10 pages
IP Record Python 23-24 Aryan
No ratings yet
IP Record Python 23-24 Aryan
42 pages
index
No ratings yet
index
4 pages
Assingment_1
No ratings yet
Assingment_1
2 pages
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
Computational
No ratings yet
Computational
7 pages
Ip 2019
No ratings yet
Ip 2019
12 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
Quality Control Sheet
No ratings yet
Quality Control Sheet
2 pages
DataGrokr Technical Assignment - Data Engineering - Internshala
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
5 pages
Assignment1-PS10 Friends Circle
No ratings yet
Assignment1-PS10 Friends Circle
5 pages
Project2 - 158755. 4.21
No ratings yet
Project2 - 158755. 4.21
3 pages
Assignment
No ratings yet
Assignment
12 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Sultan Chand IP-CLASS-XII-SAMPLE QUESTION PAPER 1
No ratings yet
Sultan Chand IP-CLASS-XII-SAMPLE QUESTION PAPER 1
11 pages
Ip Class Xii Sample Question Paper 1
100% (3)
Ip Class Xii Sample Question Paper 1
11 pages
PR LIST DSBDA
No ratings yet
PR LIST DSBDA
2 pages
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
Rowling J K HP01 Harry Potter Et La Pierre Philosophale
No ratings yet
Rowling J K HP01 Harry Potter Et La Pierre Philosophale
7 pages
DTU SE Lectures 2024 by Ruchika Pharswan Till Mid Term
No ratings yet
DTU SE Lectures 2024 by Ruchika Pharswan Till Mid Term
124 pages
Application Config
No ratings yet
Application Config
17 pages
Brkarc 2749
No ratings yet
Brkarc 2749
97 pages
Program-1 Write A Program in C To Create Two Sets and Perform The Union Operation On Sets
No ratings yet
Program-1 Write A Program in C To Create Two Sets and Perform The Union Operation On Sets
22 pages
Student Management System Literature Review
50% (2)
Student Management System Literature Review
4 pages
28.9 - Domain Specific Kernels - mp4
No ratings yet
28.9 - Domain Specific Kernels - mp4
2 pages
Verification of Systems and Circuits Using LOTOS Petri Nets and CCS 1st Edition Yoeli 2024 Scribd Download
100% (15)
Verification of Systems and Circuits Using LOTOS Petri Nets and CCS 1st Edition Yoeli 2024 Scribd Download
70 pages
Amazon Alexa and Echo
No ratings yet
Amazon Alexa and Echo
5 pages
Little Booklet of Phone Scams
No ratings yet
Little Booklet of Phone Scams
12 pages
Activities Guide and Evaluation Rubric - Unit 1 - Task 1 - Recognizing The Importance of Information Secu
No ratings yet
Activities Guide and Evaluation Rubric - Unit 1 - Task 1 - Recognizing The Importance of Information Secu
5 pages
Hopcroft Original CS TR 71 190
No ratings yet
Hopcroft Original CS TR 71 190
16 pages
Activity Book Answer Key
No ratings yet
Activity Book Answer Key
7 pages
Lab Report No. 1
No ratings yet
Lab Report No. 1
8 pages
E-COMMERCE
No ratings yet
E-COMMERCE
84 pages
Extreme
No ratings yet
Extreme
9 pages
Advanced Filter Notes
No ratings yet
Advanced Filter Notes
1 page
A4 Internet Porno Key Facts One Pager
No ratings yet
A4 Internet Porno Key Facts One Pager
2 pages
The Inverse Laplace Transform Partial Fractions and The First Shifting Theorem
No ratings yet
The Inverse Laplace Transform Partial Fractions and The First Shifting Theorem
5 pages
User Instructions: For Smartlf Scan! Large Format Scanner Rev. G June 2016 (F/W 1.01)
No ratings yet
User Instructions: For Smartlf Scan! Large Format Scanner Rev. G June 2016 (F/W 1.01)
25 pages
PM Exam Coach ITTO Simplifier
0% (1)
PM Exam Coach ITTO Simplifier
2 pages
Judah Maccabee Part 1 Abomination Of Desolation Chronicles Of The Watchers Brian Godawa pdf download
No ratings yet
Judah Maccabee Part 1 Abomination Of Desolation Chronicles Of The Watchers Brian Godawa pdf download
23 pages
Omicron CMC356 Relay Test With Advanced Protection Software Datasheet
No ratings yet
Omicron CMC356 Relay Test With Advanced Protection Software Datasheet
8 pages
Abbreviations GSM
No ratings yet
Abbreviations GSM
2 pages
OSCE 10 5 Best Practice Guide
No ratings yet
OSCE 10 5 Best Practice Guide
68 pages
Sentinel Programme FAQ 2025
No ratings yet
Sentinel Programme FAQ 2025
2 pages
Immediate download Unit Testing Vue.js Apps ebooks 2024
100% (2)
Immediate download Unit Testing Vue.js Apps ebooks 2024
34 pages

Assignment2

Uploaded by

Assignment2

Uploaded by

DSCI553 Foundations and Applications of Data Mining

1. Overview of the Assignment

c. Python standard library set : https://docs.python.org/3/library/

2.2 Programming Environment

On Vocareum, you can call `spark-submit` located at

2.3 Write your own code

2.4 What you need to turn in

Figure 1: Input Data Format

Case 1 (1.5 pts):

business1: [user11, user12, user13, ...]

4.2 Task 2: Ta Feng data (4 pts)

(1) Data preprocessing

Figure: customer_product file

(10% of previous column if correct)

Task1 Correct: 3 points Correct: 3 * 10% 3.3

Task1 Wrong: 0 point Correct: 0 * 10% 0.0

Task1 Partially correct: 1.5 points Correct: 1.5 * 10% 1.65

Task1 Partially correct: 1.5 points Wrong: 0 1.5

6. Common problems causing fail submission on Vocareum/FAQ

And use the latest Spark

2. Check the input command line format.

You might also like