0% found this document useful (0 votes)

70 views4 pages

COMP551 Fall 2020 P1

This document outlines a mini-project analyzing COVID-19 search trends and hospitalization data. Students are asked to acquire, preprocess, analyze, and merge two publicly available datasets. They must visualize the search trends data, cluster it using dimensionality reduction techniques, and compare the performance of KNN and decision tree models in predicting hospitalizations from search trends. The deliverables are a code submission and a 5-page write-up summarizing data processing, results, and conclusions. The write-up must include visualizations of clustered, reduced data and a comparison of regression model performances on different validation schemes.

Uploaded by

Alain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views4 pages

COMP551 Fall 2020 P1

Uploaded by

Alain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

MiniProject 1: Analyzing COVID-19 Search Trends and Hospitalization

COMP 551, Fall 2020, McGill University

October 1, 2020

Please read this entire document before beginning the assignment

Preamble
• Quiz TA’s; Arna Ghosh and Howard Huang
• This mini-project is due on October 16th at 11:59pm EST. Late work will be automatically subject to a 20%
penalty, and can be submitted up to 5 days after the deadline. No submissions will accepted after this 5 day
period.
• This mini-project is to be completed in groups of three. There are three tasks outlined below which offer one
possible division of labour, but how you split the work is up to you. All members of a group will receive the
same grade. It is not expected that all team members will contribute equally to all components. However every
team member should make integral contributions to the project.
• You will submit your assignment on MyCourses as a group. You must register your group on MyCourses and
any group member can submit. See MyCourses for details.
• You are free to use libraries with general utilities, such as matplotlib, numpy, scipy, pandas and sklearn for
Python.

Background
In this miniproject, you will be exploring two COVID19-related datasets. The goal is to gain experience in deploying
unsupervised and supervised machine learning techniques to tackle a real-world data science problem. You are en-
couraged to explore techniques you have learned in class to visualize the data and thereafter form a hypothesis about
possible patterns in the data.

Task 1: Acquire, preprocess, and analyze the data

Your first task is to acquire the data, analyze it, and clean it (if necessary). You will use two publicly available datasets
provided by Google Research in this project, outlined below.
• Dataset 1 (Search Trends dataset): This aggregated, anonymized dataset shows trends in search patterns
for symptoms and is intended to help researchers to better understand the impact of COVID-19. Read the
dataset details and how it was generated here. We suggest using the weekly resolution dataset that can be
downloaded from the associated github repo.
• Dataset 2 (COVID hospitalization cases dataset): This is an open source dataset that aggregates public
COVID-19 data sources into a single dataset. The dataset includes time series data for COVID-19 cases, deaths,
tests, hospitalizations, discharges among other attributes. For more information about the dataset, refer to the
associated github repo. We suggest using the hospitalizations from this dataset for the United States regions
that are also present in the Search Trends dataset. You can get the data (provided under the CC-BY license)
from here.
The essential subtasks for this part of the project include:
1. Download the datasets. Hint: If you are working locally, you could try cloning the repository and then using the
csv files from your cloned version. Do not forget to mention the date (or version) of the dataset that you ended
up using in your report.

1
2. Load the datasets into Pandas dataframes or NumPy objects (i.e., arrays or matrices) in Python.
3. Clean the data. Are there any symptoms that have no search data available? Do all regions have valid hospital-
ization data (you can assume regions to have valid hospitalization data if they have sufficient non-zero entries)?
You should remove regions and features that have too many missing or invalid data entries.

4. Merge the two datasets. Note that the time resolution is different for the two datasets, the search symptoms is
weekly whereas the hospitalization cases are at the daily resolution. Your task is to bring both the datasets at
the weekly resolution and thereafter merge them into one array (Numpy or Pandas).

Task 2: Visualize and cluster the data

Your next task is to leverage dimensionality reduction techniques and visualize the data. The subtasks for this part
include:
1. Visualize the evolution of popularity of various symptoms across different regions over time. Specifically, you
need to visualize how the distribution of search frequency of each symptom aggregated across different regions
changes over time. You can only do these plots for some of the most popular symptoms only. Hint: checkout
some visualizations here for inspiration.
2. Visualize the search trends dataset in a lower dimensional space. Use Principal Component Analysis (PCA) to
reduce the data dimensionality. Hint: You may treat each time point as an independent data point.

3. Explore using a clustering method – e.g., k-means – to evaluate possible groups in the search trends dataset. Do
the clusters remain consistent for raw as well as PCA-reduced data?

Task 3: Supervised Learning

In this part, you will compare two supervised learning frameworks, namely K-nearest neighbours (KNN) and decision
trees, to predict the hospitalization cases given the search trends data. The specific subtasks for this part include:
1. Split the data into train and validation sets using two strategies – based on regions and based on time. Specifically,
in the first case, you need to keep all data from some regions in the validation set and train on the rest (keep 80%
regions in training set and 20% in validation set, doing this multiple times to estimate cross-validation results).
In the second case, you need to keep data for the last couple of timepoints (keep data after ‘2020-08-10’) from
all regions in the validation set and train on the rest of the data
2. Compare the regression performance of KNNs and decision trees for each of the train-validation split strategies.
Note that you can report a 5-fold cross-validation performance for region-based train-validation split, wherein
you vary which regions are kept in the validation set for each fold. Please clearly report your validation error in
both cases.

3. [Optional] Explore other prediction strategies. For example, one strategy could be to learn separate models for
predicting hospitalization in each region or cluster from Task 2.

Deliverables
You must submit two separate files to MyCourses (using the exact filenames and file types outlined below):
1. code.zip: Your data processing, classification and evaluation code (as some combination of .py and .ipynb files).
2. writeup.pdf : Your (max 5-page) project write-up as a pdf (details below).

Project write-up
Your team must submit a project write-up that is a maximum of five pages (single-spaced, 11pt font or larger;
minimum 0.5 inch margins, an extra page for references/bibliographical content can be used). We highly recommend
that students use LaTeX to complete their write-ups. You have some flexibility in how you report your results, but
you must adhere to the following structure and minimum requirements:

2
Abstract (100-250 words)
Summarize the project task and your most important findings. For example, include sentences like “In this project we
investigated the performance of two regression models, namely k-nearest neighbours and decision trees, on predicting
COVID-19 hospitalization cases from related symptoms search”, “We found that the k-nearest neighbour regression
approach achieved worse/better accuracy than decision trees and was significantly faster/slower to train.”

Introduction (5+ sentences)

Summarize the project task, the two datasets, and your most important findings. This should be similar to the
abstract but more detailed. You should include background information and potential citations to relevant work (e.g.,
other papers analyzing these datasets).

Datasets (5+ sentences)

Very briefly describe the datasets and how you processed them. If you have come up with new new features to get
better results, you should explain it here. Present the exploratory analysis you have done to understand the data, e.g.
visualization plots and data filtering.

Results (7+ sentences, possibly with figures or tables)

Describe the results of all the experiments mentioned in Task 2 and 3 (at a minimum) as well as any other interesting
results you find. At a minimum you must report:
1. A visualization of the search trends data in lower dimensions
2. Same plot as above but with cluster labels for each data point to illustrate the clustering results

3. A comparison of regression performance (mean squared error or mean absolute error) between KNN and decision
trees on the aforementioned cross-validation schemes

Discussion and Conclusion (5+ sentences)

Summarize the key takeaways from the project and possibly directions for future investigation.

Statement of Contributions (1-3 sentences)

State the breakdown of the workload across the team members.

Evaluation
The mini-project is out of 100 points, and the evaluation breakdown is as follows:

• Completeness (20 points)

– Did you submit all the materials?
– Did you run all the required experiments?
– Did you follow the guidelines for the project write-up?

• Correctness (40 points)

– Are you cross-validation schemes implemented correctly?
– Are your models used/implemented correctly?
– Are you visualizations informative and visually appealing?
– Are your reported accuracy close to (our internal) reference solutions?
– If you proposed any features, did your proposed features actually improve performance, or do you adequately
demonstrate that it was not possible to improve performance?
• Writing quality (25 points)
– Is your report clear and free of grammatical errors and typos?

3
– Did you go beyond the bare minimum requirements for the write-up (e.g., by including a discussion of
related work in the introduction)?
– Do you effectively present numerical results (e.g., via tables or figures)?
• Originality / creativity (15 points)

– Did you go beyond the bare minimum requirements for the experiments? For example, did you investigate
which features are the most useful (e.g., by correlating them with your predictions or removing them from
your data)?
– Did you use other publicly available data to run more interesting experiments (e.g., using neighbourhood
information, or weather conditions for different states). This could potentially give you better performance
on the validation set.
– within the context of producing the required results did you propose a creative idea?
– Note: Simply adding in a random new experiment will not guarantee a high grade on this section! You
should be thoughtful and organized in your report in explaining why you performed an additional experiment
and how it helped in evaluating your hypothesis.

Final Remarks
You are expected to display initiative, creativity, scientific rigour, critical thinking, and good communication skills.
You don’t need to restrict yourself to the requirements listed above - feel free to go beyond, and explore further
You can discuss methods and technical issues with members of other teams, but you cannot share any code or
data with other teams.

Akaike Technologies Structured Data Assignment
33% (3)
Akaike Technologies Structured Data Assignment
4 pages
Physical Theatre PDF
No ratings yet
Physical Theatre PDF
10 pages
ICT583 Data Science Applications - Final Assignment - Individual - UPDATED!!! - Explanation
0% (1)
ICT583 Data Science Applications - Final Assignment - Individual - UPDATED!!! - Explanation
5 pages
Interpersonal Reactivity Index (IRI) PDF
No ratings yet
Interpersonal Reactivity Index (IRI) PDF
6 pages
CS3943-9223 Assignment1
No ratings yet
CS3943-9223 Assignment1
2 pages
COVID Project
50% (2)
COVID Project
1 page
ICATAS Invited Talk 2020 New
No ratings yet
ICATAS Invited Talk 2020 New
52 pages
Ikhwan Salihin E-Portfolio
No ratings yet
Ikhwan Salihin E-Portfolio
6 pages
Final Project Guidelines: Dataset Selection & Planning
No ratings yet
Final Project Guidelines: Dataset Selection & Planning
3 pages
and Data/uk - and - Regional - Series
0% (1)
and Data/uk - and - Regional - Series
5 pages
QBUS6840 Group Assignment (30 Marks) : 1 Background and Task
No ratings yet
QBUS6840 Group Assignment (30 Marks) : 1 Background and Task
3 pages
Conflict Management
No ratings yet
Conflict Management
27 pages
C Ovid Data Analysis
No ratings yet
C Ovid Data Analysis
3 pages
Name
No ratings yet
Name
23 pages
COVID 19 Pandemic Analysis
No ratings yet
COVID 19 Pandemic Analysis
26 pages
Name
No ratings yet
Name
23 pages
Lesson 4-Phrases, Clauses, Sentences
100% (7)
Lesson 4-Phrases, Clauses, Sentences
21 pages
Syadatajveez
No ratings yet
Syadatajveez
21 pages
Total Documentation
No ratings yet
Total Documentation
21 pages
Project Presentation
No ratings yet
Project Presentation
18 pages
Milestone
No ratings yet
Milestone
7 pages
Modules
No ratings yet
Modules
108 pages
Artificial Intelligence Project Report
No ratings yet
Artificial Intelligence Project Report
15 pages
Project Documentaiotn - InDIA Abellllll
No ratings yet
Project Documentaiotn - InDIA Abellllll
27 pages
COMP2501 - Assignment - 1 - Questions - RMD 2
No ratings yet
COMP2501 - Assignment - 1 - Questions - RMD 2
7 pages
Assignment 2 QBUS2820 2021S2
No ratings yet
Assignment 2 QBUS2820 2021S2
3 pages
Computer Science Ip
No ratings yet
Computer Science Ip
16 pages
Dialectics of The Kiswahili Language
100% (1)
Dialectics of The Kiswahili Language
128 pages
Assignment2 2024
No ratings yet
Assignment2 2024
4 pages
Data Science Manual
No ratings yet
Data Science Manual
155 pages
COVID-19 Clinical Trials EDA Pandas
No ratings yet
COVID-19 Clinical Trials EDA Pandas
30 pages
AIML Hard
No ratings yet
AIML Hard
22 pages
CITS1401 Project#02, Sem2, 2024
No ratings yet
CITS1401 Project#02, Sem2, 2024
10 pages
DIY Project - Data Mining and Analytics2
No ratings yet
DIY Project - Data Mining and Analytics2
1 page
Roland Barthes Essay
No ratings yet
Roland Barthes Essay
5 pages
CS502M Project Spec
No ratings yet
CS502M Project Spec
8 pages
R Jeevitha
No ratings yet
R Jeevitha
16 pages
OOMD Model Question Paper, SNGCE
0% (1)
OOMD Model Question Paper, SNGCE
5 pages
CV0003
No ratings yet
CV0003
43 pages
FIT1043 A2 Specification - S2 2024 - Gks6arg
No ratings yet
FIT1043 A2 Specification - S2 2024 - Gks6arg
5 pages
LeadershipSelfDeception PG
No ratings yet
LeadershipSelfDeception PG
11 pages
Capstone Project Guidelines
No ratings yet
Capstone Project Guidelines
2 pages
Chapter Five Communicative Language Teaching
No ratings yet
Chapter Five Communicative Language Teaching
10 pages
Case Study Guidelines
No ratings yet
Case Study Guidelines
7 pages
Corona Virus in India
No ratings yet
Corona Virus in India
29 pages
Project - COVID 19 Analysis
No ratings yet
Project - COVID 19 Analysis
2 pages
Assignment For DSF
No ratings yet
Assignment For DSF
2 pages
Characteristics of Effective Writing Characteristics of Effective Writing
No ratings yet
Characteristics of Effective Writing Characteristics of Effective Writing
16 pages
Covid Data Report
No ratings yet
Covid Data Report
21 pages
COVID 19 Pandemic Analysis Class 12 Practicals
No ratings yet
COVID 19 Pandemic Analysis Class 12 Practicals
29 pages
Unit 1 Part B
No ratings yet
Unit 1 Part B
23 pages
Report MSA Practice02
No ratings yet
Report MSA Practice02
29 pages
Essential Software Assignment 3
No ratings yet
Essential Software Assignment 3
2 pages
DS Assignment
No ratings yet
DS Assignment
7 pages
Obermeyer Sample
No ratings yet
Obermeyer Sample
8 pages
A Quasi-Experimental Study of A Web-Based English Literacy Tool For Grade 3 Students in China
No ratings yet
A Quasi-Experimental Study of A Web-Based English Literacy Tool For Grade 3 Students in China
24 pages
DS For Business Home Assignments
No ratings yet
DS For Business Home Assignments
24 pages
Reactivos de Ingles para Completar
No ratings yet
Reactivos de Ingles para Completar
12 pages
Behaviorism - Ivan Pavlov
100% (1)
Behaviorism - Ivan Pavlov
3 pages
F74675497-151103 27 Consumer Behavior
No ratings yet
F74675497-151103 27 Consumer Behavior
3 pages
5 Real-Deal French Immersion Courses That Aren't Messing Around (France, Canada and Belgium)
No ratings yet
5 Real-Deal French Immersion Courses That Aren't Messing Around (France, Canada and Belgium)
10 pages
Project2 - 158755. 4.21
No ratings yet
Project2 - 158755. 4.21
3 pages
Jan 11 Intro To Positive Psych
No ratings yet
Jan 11 Intro To Positive Psych
41 pages
Social Influences On Eating: Suzanne Higgs and Jason Thomas
No ratings yet
Social Influences On Eating: Suzanne Higgs and Jason Thomas
6 pages
Phase 2
No ratings yet
Phase 2
5 pages
Covid19 PPT
No ratings yet
Covid19 PPT
12 pages
Johari Window
No ratings yet
Johari Window
3 pages
HNC-Childcare - CH01Working in An Early Education
No ratings yet
HNC-Childcare - CH01Working in An Early Education
32 pages
Project Data Analysis 2025
No ratings yet
Project Data Analysis 2025
2 pages
HW 4
No ratings yet
HW 4
13 pages
Python Codes and Comments
No ratings yet
Python Codes and Comments
5 pages
Assignment 3
No ratings yet
Assignment 3
2 pages
Dribble Survivor Lesson Plan
No ratings yet
Dribble Survivor Lesson Plan
2 pages
Phase 2
No ratings yet
Phase 2
6 pages
NYT - How To Answer Common Difficult Interview Questions - The New York Times
No ratings yet
NYT - How To Answer Common Difficult Interview Questions - The New York Times
4 pages
Case Study Analysis
No ratings yet
Case Study Analysis
2 pages
Rubric Lesson 2 Options 3 and 4
No ratings yet
Rubric Lesson 2 Options 3 and 4
1 page
Revised Science8 LessonPlan MotionGraphs
No ratings yet
Revised Science8 LessonPlan MotionGraphs
2 pages
Reported Speech
No ratings yet
Reported Speech
5 pages
Reconstructive Memory Note Taking
No ratings yet
Reconstructive Memory Note Taking
6 pages
Romantic Naturalism Meryll
No ratings yet
Romantic Naturalism Meryll
14 pages
COVID-19 Clinical Trials EDA Pandas (ML - FA - DA Projects)
No ratings yet
COVID-19 Clinical Trials EDA Pandas (ML - FA - DA Projects)
53 pages
Evaluate Two or More Research Methodes Used To Study A Cognitive Process
No ratings yet
Evaluate Two or More Research Methodes Used To Study A Cognitive Process
2 pages
Mid-Term Project (Stroke Risk Classification)
No ratings yet
Mid-Term Project (Stroke Risk Classification)
3 pages
SPA Group 13 - Assignment 2 Problem Statement
No ratings yet
SPA Group 13 - Assignment 2 Problem Statement
2 pages
PracticalMachine Learning
No ratings yet
PracticalMachine Learning
32 pages
CLOUDCOMPUTINGLABproject
No ratings yet
CLOUDCOMPUTINGLABproject
11 pages
Ip Practical File Class Xii.docx (4)
No ratings yet
Ip Practical File Class Xii.docx (4)
27 pages
Abinash Nag Project Report CART
No ratings yet
Abinash Nag Project Report CART
40 pages

COMP551 Fall 2020 P1

Uploaded by

COMP551 Fall 2020 P1

Uploaded by

MiniProject 1: Analyzing COVID-19 Search Trends and Hospitalization

COMP 551, Fall 2020, McGill University

Please read this entire document before beginning the assignment

Task 1: Acquire, preprocess, and analyze the data

Task 2: Visualize and cluster the data

Task 3: Supervised Learning

Introduction (5+ sentences)

Datasets (5+ sentences)

Results (7+ sentences, possibly with figures or tables)

Discussion and Conclusion (5+ sentences)

Statement of Contributions (1-3 sentences)

• Completeness (20 points)

• Correctness (40 points)

You might also like