0% found this document useful (0 votes)
2 views7 pages

Programming Assignment 3 v03

The ECS 116 Programming Assignment 3 focuses on hands-on experience with MongoDB, requiring teams of 3-4 to complete various tasks related to data management and aggregation. The assignment includes creating collections with embedded reviews and calendar availability, utilizing Python and MongoDB's aggregation pipeline, and submitting JSON and CSV files with specific data formats. Teams must adhere to deadlines, avoid plagiarism, and document their use of AI tools like ChatGPT.

Uploaded by

toreyune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views7 pages

Programming Assignment 3 v03

The ECS 116 Programming Assignment 3 focuses on hands-on experience with MongoDB, requiring teams of 3-4 to complete various tasks related to data management and aggregation. The assignment includes creating collections with embedded reviews and calendar availability, utilizing Python and MongoDB's aggregation pipeline, and submitting JSON and CSV files with specific data formats. Teams must adhere to deadlines, avoid plagiarism, and document their use of AI tools like ChatGPT.

Uploaded by

toreyune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ECS 116 Databases for Non-Majors / Data Management for Data Science

Programming Assignment 3

Prelude
1. The goal of this programming assignment is to provide some hands-on experience with MongoDB, one of the
early and still widely used NoSQL databases.

2. The assignment is of 10 points.

3. The assignment is due Sunday, June 1, 2025, at 11:59 pm.

4. This assignment will be in teams of 3 or 4 people.

5. Each team member should have a full installation of the data and working code in their laptop, so that they
can run everything. (Teammates might develop different parts of the code, but in the end each teammate
should be able to run everything on their machine. Also, optionally, teammates may create some or all parts
of the code on their own). In particular, each teammate will be required to create the json files and the csv file
described below, all based on execution on their own machine.

6. Each team should create a single document that includes the jointly written paragraphs describing the results
obtained (Part 6).

7. Some parts of this project may take a long time to run. (E.g., one part of Step 1.1 can take about an hour or
more.) So be sure to start on the project early, and plan to complete it well before the deadline. Also, if you
have a very slow machine, then contact the professor about having a modified requirement for some parts.

8. As with Programming Assignments 1 and 2, ChatGPT and/or other LLMs can be used, and we ask that you
give a brief statement about how it was used and what each teammates experience was.

9. Late submissions will be graded according to the late policy. Specifically, 10% of grade is deducted if you are
up to 24 hours late, 20% is deducted if you are 24 to 48 hours late, and no credit if turned in after 48 hours.

10. Plagiarism is strictly prohibited. You’re free to discuss high-level concepts amongst your peers. However,
cheating will result in no points on the assignment and reporting to OSSJA.

Step 1: Creating MongoDB collection holding listings with embedded


reviews, using df and dict operations
For this step you are to:

1. Create a notebook that populates the MongoDB on your laptop with the full set of 37,434 listing objects, with
reviews data embedded, as illustrated in Loading Local MongoDB with Listings & Reviews-vXX.ipynb.

2. The collection you create should be named listings with reviews.

3. The notebook that you submit for this part should be named 1--Building-listings-with-reviews-using-
python.ipynb.

4. Once you have MongoDB populated, you are to add 4 cells to your notebook with pymongo scripts/queries that
do the following:

1
• Query 1: Output is the number of listings that have last review between February 1, 2021, and March
15, 2023, inclusive.

• Query 2: Output is the number of listings that have an array of reviews with length at least 50. (You may
want to take inspiration from https://stackoverflow.com/questions/41918605/mongodb-find-array-length-
greater-than-specified-size.)

• Query 3: Output is the number of listings that have a review containing the word ”awesome” (case
sensitive) OR a review containing the word ”amazing” (case sensitive).

• Query 4: Same as Query 3, but ignoring case.

5. Please prepare a csv file and name it as Step1 query counts.csv. It should have the following properties:

(a) The csv file has a header row.

(b) Column 1 is labeled ”Query Number”

(c) Column 2 is labeled ”Count”

(d) There are 4 rows, corresponding to the 4 queries, in that order.

(e) The entries of the first column are simply 1,2,3 and 4.

(f) The entries of the second column should be the counts that you obtained for each of the 4 queries by
running your notebook on your machine.

Here is an example of the format for the csv file that you are to create. The Count values here are completely
made up.

Query number Count

1 10

2 150

3 23

4 104

Table 1: Example structure of table in the csv file to be submitted

(You can create the csv file by hand, i.e., you do not need to have your notebook produce it.)

6. Final Note: In this Step, you brought slightly pre-processed data from PostgreSQL into your PyMongo environ-
ment, and used python and pandas to further format the data for insertion into a MongoDB collection. That
approach was not time efficient – it may have taken around 50 to 70 minutes for some of the processing. In
Steps 2, 3 and 4, you will do things primarily within the MongoDB system, and primarily using the aggregate
operation with pipelines. You will see that if you can do something within MongoDB, then it is generally much
more time efficient.

2
Step 2: Creating MongoDB collection holding listings with embedded
calendar availability, using an aggregation pipeline
For this step. please refer to the notebook ”2–Loading-Local-MongoDB-with-calendar-csv-data–vXX.ipynb” in the
PA3 Materials folder in Canvas. That notebook shows how you can load the calendar.csv file into a dataframe, and
from there into mongodb. In the notebook, the resulting collection is called ”calendar”. The datatypes for fields
holding dates should be datetime, and for any boolean fields should be Boolean.

Your goal is to build a notebook, called ”2–Building-listings-with-calendar-using-aggregate.ipynb”, that creates an-


other collection, called ”listings with calendar” which includes a document for each listing. The document should
have fields for

1. id: holds the listing id value of the listing

2. average price: holds the average price across all of the records in calendar.csv associated with this listing,
having type numeric (integer or float is OK)

3. first available date: holds the minimum date of any calendar record associated with the listing, having
type datetime

4. last available date: holds the maximum date of any calendar record associated with the listing, having type
datetime

5. dates list: Holds an array of documents, one for each calendar entry associated with the listing. Each of
these documents should include the following fields:

(a) date, of type datetime

(b) available, of type Boolean

(c) price, of type numeric (not string)

(d) minimum nights, of type integer

(e) maximum nights, of type integer

The notebook includes some code illustrating how you can check the data types of documents in a MongoDB
collection. It also includes a function convert lwc to json that illustrates how you can convert MongoDB documents
into python dictionaries that can be written into json files.

For this step of the Programming Assignment, you are to:

1. Create a pipeline specification that will build the listings with calendar collection, and use that pipeline as part
of the notebook to create the collection. Here are some notes:

(a) In one approach to building a pipeline that works, the first step is a $group operator. That can be used
to define how the scalar fields are to be populated. Also, to populate the dates list field you can use
the $push operator, which forms an array of all elements that are being grouped.

The second (and final) step is to use the $out operator to write the result of the aggregation into the
collection listings with calendar.

(b) When you are working to define the pipeline, you may want to work with a small collection that corresponds
to, e.g., of the first 5000 documents in calendar.

(c) Your collection listings with calendar should hold 37,431 documents. (Why is that three less that the
number of documents in the collection listings with reviews that you built for Step 1 above?)

3
2. Select a subset of the collection which holds documents for all listings whose id has prefix ’1001’, convert
these documents into something that can be written into a json file, and write them into a file named
listings with calendar subset 1001.json’. This file is to be included into your zip submission. Here
are some notes:

(a) The notebook ”2–Loading Local MongoDB with calendar csv data–vXX.ipynb” provides an illustration
of how to convert data from MongoDB into dictionaries that can be written out to json files.

(b) You can find a json file similar to the file you are to produce in Canvas in the PA3 Materials folder; it is
named listings with calendar subset avg price 18370.json.

Note: One good way to inspect a big json document is to open a new tab in Firefox (not Chrome) and
then drag the file into that browser window.

(c) Your file should hold 28 documents.

Step 3: Creating MongoDB collection holding listings with embedded


reviews, using an aggregation pipeline
In this step you will revisit the goal of Step 1, which was to build a MongoDB collection listings with reviews.
But for this step, you will use the aggregate function rather than doing things with pandas and python.

Specifically, you are to build a notebook called ”3–Building-listings-with-reviews-using-aggregate.ipynb” that:

1. Imports the listings.csv and reviews.csv files into dataframes. (As in the notebook 2-Building-listings-with-
calendar-using-aggregate.ipynb, as you import listings.csv into a dataframe make sure that the datatypes for
id and host id are strings, and similarly for reviews.csv and field names id, listing id, and reviewer id.)

2. Modifies the dataframe for listings in the following ways:

(a) It has only the 18 columns for the listings table that were used in Step 1. (Use a command like
df listings.drop(cols to drop, axis=1, inplace=True), where cols to drop is a list of all columns
to be dropped from the dataframe.

(b) The types of price and reviews per month columns are converted to numeric using commands like:
df listings[’reviews per month’] = pd.to numeric(df listings[’reviews per month’]). (For
price you may need to drop some ’$’ and ’,’ characters.)

3. Puts these dataframes into MongoDB collections listings and reviews. (Please ensure that date columns
are converted to the datetime data type, and address issues with NaT, as in the notebook 2-Building-listings-
with-calendar-using-aggregate.ipynb.)

4. Using an aggregation pipeline, builds a collection listings with reviews m. This should hold data very
similar to the collection listings with reviews that you built for Step 1. (There are some minor differences
because of some operations performed in Step 1 vis-a-vis some operations performed here. Can you find them?)

Some notes:

(a) One way to build the pipeline would be to start with a $lookup, and then use $out to write the output
into the target collection

(b) IMPORTANT NOTE: To make your pipeline run quickly (e.g., in about 7 to 15 seconds), you should
create an index on ’listing id’ in your db.reviews collection. You can use a command such as the following:

db.reviews.create_index(’listing_id’)

4
(If you don’t have this index, your pipeline would probably run for 2 to 4 hours!)

5. Finally, produce a file ”listings with reviews m subset 1001.json” that holds documents that correspond to the
documents in your listings with reviews m collection whose id value has prefix ”1001”. Some notes:

(a) In addition to dealing with the datetime values, you will have to modify the ObjectID values (make them
strings) and the NaN values (test for that using math.isnan operator, and map to None).

(b) Your file should hold 28 documents, as in Step 2.

Step 4: Creating MongoDB collection holding listings with embedded


data for both reviews and calendar availability
For this step, you are to create a notebook named ”4–Building-listings-with-reviews-and-cal.ipynb” that forms a
kind of join of your collections listings with reviews m and listings with calendar, and puts it into a collection
called listings with reviews and cal. In particular, each document in the collection listings with reviews and cal
should include information about a listing, including

1. all scalar fields about that listing from both listings with reviews m and listings with calendar, except
for the id field from listings with calendar.

2. a field reviews, holding the array of data about reviews associated with the listing

3. a field dates list holding the array of data about calendar entries associated with the listing.

Here are some notes about one way to build a pipeline to create the collection listings with reviews and cal.

1. Run the aggregate command on the listings with reviews m collection.

Remember that listings with reviews m has 3 listings that listings with calendar does not have.

2. Start the pipeline with a $lookup that will form, intuitively speaking, something close to the left join of
listings with reviews m and listings with calendar. In my $lookup, I used the name cal docs for
holding the array of docs from listings with calendar.

3. Now use the $unwind operator on the $cal docs field. This has the effect of breaking each $cal docs array
into separate documents. The intermediate result after the $unwind is quite close to being the left join of
listings with reviews m and listings with calendar. In order to retain data about the three listings not
in listings with calendar, you need to use the following formulation

{ ’$unwind’: { ’path’: ’$cal_docs’,


’preserveNullAndEmptyArrays’ : True
}
},

4. Now use an $addFields operator to add in the fields for average price, first available date, last available date,
and dates list. For each of these you will need a formulation something like:

’first_available_date’: ’$$ROOT.cal_docs.first_available_date’,

What is $$ROOT here? This step of the pipeline is basically operating on a stream of documents. For a given
document, $$ROOT refers to the root of that document.

5. Now use an $unset operator to remove the cal docs field.

6. Finally, use $out to write the output of the pipeline into the collection listings with reviews and cal.

5
As with steps 2 and 3, you are to produce a json file called ”listings with reviews and cal subset 1001.json” that
holds documents that correspond to the documents in your listings with reviews and cal collection whose id
value has prefix ”1001”.

Step 5: Comments about what you observed


As a team, write a short paragraph for each of the following questions based on the your work on this assignment.

1. For Part 3, if you include the index then the pipeline runs in 1 or 2 minutes, but if you leave the index out
then it would take about 4 hours. For this question, assume that on a particular machine it takes 2 minutes
with index, and 4 hours without index.

Compute the (approximate) time it takes for MongoDB to make one full scan of the db.reviews collection.

Compute the (approximate) time it takes, on average, for MongoDB to perform an index-based retrieval of all
documents in db.reviews having a particular listing id value.

Hint: The pipeline used for this Step is essentially doing a left-join of db.listings with db.reviews.

2. For Part 2 we did not use an index. Why does your pipeline for Part 2 run in a minute or two, even though
the calendar.csv file has many more entries than the review.csv file?

3. For Part 4 we again did not use an index. Why does your pipeline for Part 4 run in a minute or two, even
though both collections being joined have 37K+ documents in them?

4. Would your pipeline for Part 4 fun faster if you included an index on id for one or both of the collections?

Step 6: Submission Instructions


There are 2 components to your submission:

1. Each TEAM should create a single pdf report with the filename Assignment-3-TEAM-REPORT.pdf. (We have
realized that when a file is submitted into Canvas, is is automatically pre-pended with student last name and
student first name.) The report should include the following named sections, in this order

(a) ”Teammates”: List your teammate names here.

(b) ”Statement about ChatGPT (and/or other LLMs)”. Please include here a short statement about which
teammates, if any, used ChatGPT and/or other LLMs to help to generate any of your code. If there
was use of LLMs, then please indicate who used them and for what purposes. Also, please describe the
experience - was it helpful or not, and how/why?

(c) ”Statement about distributed work”. Please include here a short statement about which team members
did which work to create your codebase.

(d) ”Comments on Observed Performance”. Include here 4 subsections that include the answers to the 4
questions posed in Step 5 above.

(e) ”References”. If you used any outside sources, please include them in this section. (If you didn’t use any
outside sources, then include the statement ”We did not use any outside sources”.)

2. Each STUDENT should submit a zip including several files. The name of the file have the form
PA 3.zip. The zip should include the following things:

(a) The report Assignment-3-TEAM-REPORT.pdf produced by your team.

6
(b) The following json files:

i. listings with reviews m subset 1001.json

ii. listings with calendar subset 1001.json

iii. listings with reviews and cal subset 1001.json

(c) The csv file that you produced for Part 1, with file name ”Step1 query counts.csv”

(d) All notebooks and helper function files that you used to create the json files and visualizations on your
machine. (This code might have been developed individually by you and/or jointly by your team.) Please
use the suggested file names for the notebooks, and meaningful file names for other files.

You might also like