COMP551 Fall 2020 P1
COMP551 Fall 2020 P1
Preamble
• Quiz TA’s; Arna Ghosh and Howard Huang
• This mini-project is due on October 16th at 11:59pm EST. Late work will be automatically subject to a 20%
penalty, and can be submitted up to 5 days after the deadline. No submissions will accepted after this 5 day
period.
• This mini-project is to be completed in groups of three. There are three tasks outlined below which offer one
possible division of labour, but how you split the work is up to you. All members of a group will receive the
same grade. It is not expected that all team members will contribute equally to all components. However every
team member should make integral contributions to the project.
• You will submit your assignment on MyCourses as a group. You must register your group on MyCourses and
any group member can submit. See MyCourses for details.
• You are free to use libraries with general utilities, such as matplotlib, numpy, scipy, pandas and sklearn for
Python.
Background
In this miniproject, you will be exploring two COVID19-related datasets. The goal is to gain experience in deploying
unsupervised and supervised machine learning techniques to tackle a real-world data science problem. You are en-
couraged to explore techniques you have learned in class to visualize the data and thereafter form a hypothesis about
possible patterns in the data.
1
2. Load the datasets into Pandas dataframes or NumPy objects (i.e., arrays or matrices) in Python.
3. Clean the data. Are there any symptoms that have no search data available? Do all regions have valid hospital-
ization data (you can assume regions to have valid hospitalization data if they have sufficient non-zero entries)?
You should remove regions and features that have too many missing or invalid data entries.
4. Merge the two datasets. Note that the time resolution is different for the two datasets, the search symptoms is
weekly whereas the hospitalization cases are at the daily resolution. Your task is to bring both the datasets at
the weekly resolution and thereafter merge them into one array (Numpy or Pandas).
3. Explore using a clustering method – e.g., k-means – to evaluate possible groups in the search trends dataset. Do
the clusters remain consistent for raw as well as PCA-reduced data?
3. [Optional] Explore other prediction strategies. For example, one strategy could be to learn separate models for
predicting hospitalization in each region or cluster from Task 2.
Deliverables
You must submit two separate files to MyCourses (using the exact filenames and file types outlined below):
1. code.zip: Your data processing, classification and evaluation code (as some combination of .py and .ipynb files).
2. writeup.pdf : Your (max 5-page) project write-up as a pdf (details below).
Project write-up
Your team must submit a project write-up that is a maximum of five pages (single-spaced, 11pt font or larger;
minimum 0.5 inch margins, an extra page for references/bibliographical content can be used). We highly recommend
that students use LaTeX to complete their write-ups. You have some flexibility in how you report your results, but
you must adhere to the following structure and minimum requirements:
2
Abstract (100-250 words)
Summarize the project task and your most important findings. For example, include sentences like “In this project we
investigated the performance of two regression models, namely k-nearest neighbours and decision trees, on predicting
COVID-19 hospitalization cases from related symptoms search”, “We found that the k-nearest neighbour regression
approach achieved worse/better accuracy than decision trees and was significantly faster/slower to train.”
3. A comparison of regression performance (mean squared error or mean absolute error) between KNN and decision
trees on the aforementioned cross-validation schemes
Evaluation
The mini-project is out of 100 points, and the evaluation breakdown is as follows:
3
– Did you go beyond the bare minimum requirements for the write-up (e.g., by including a discussion of
related work in the introduction)?
– Do you effectively present numerical results (e.g., via tables or figures)?
• Originality / creativity (15 points)
– Did you go beyond the bare minimum requirements for the experiments? For example, did you investigate
which features are the most useful (e.g., by correlating them with your predictions or removing them from
your data)?
– Did you use other publicly available data to run more interesting experiments (e.g., using neighbourhood
information, or weather conditions for different states). This could potentially give you better performance
on the validation set.
– within the context of producing the required results did you propose a creative idea?
– Note: Simply adding in a random new experiment will not guarantee a high grade on this section! You
should be thoughtful and organized in your report in explaining why you performed an additional experiment
and how it helped in evaluating your hypothesis.
Final Remarks
You are expected to display initiative, creativity, scientific rigour, critical thinking, and good communication skills.
You don’t need to restrict yourself to the requirements listed above - feel free to go beyond, and explore further
You can discuss methods and technical issues with members of other teams, but you cannot share any code or
data with other teams.