0% found this document useful (0 votes)

26 views5 pages

Project3 Handout

Uploaded by

wutianshuo31

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views5 pages

Project3 Handout

Uploaded by

wutianshuo31

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CS145: Data Management and Data Systems

Stanford University, Fall 2018

Project 3: Querying, Visualizing, Predicting -- The Full Data Cycle

20% of Course Grade

Proposal Due Date: Friday, November 9th, 11:59PM

Project Due Date: Friday, November 30th, 11:59PM

Overview

Welcome to the final CS145 project! You've gotten a good amount of experience with SQL by now.
Congrats! In the first project you learned how to navigate a reasonably complicated dataset and extract
facts from it. In the second project, you learned how to think about the design and tradeoffs of different
schema designs for real datasets -- you also got experience visualizing and reasoning about the
information in these datasets (and you probably realized that real-world data is not as nice and structured
as we'd like!).

In this assignment, you will use the tools you've learned throughout the quarter to follow your own
explorations on a topic of your choice. You will pick your own dataset and come up with one or more
interesting questions that you want to explore, just like how you began to explore what makes a Git repo
popular using the GitHub dataset for Project 2. You will finish by using machine learning to make
predictions related to your question.

This project may be done alone or in pairs.

For this project, we will also require submission of a very short proposal (see Task A for more details),
detailing your choice of dataset and question as well as your group. The intent of the proposal is for CAs
to give feedback on whether or not your dataset and question is appropriate for this project; it is worth 5%
of the project grade and must be submitted by Friday, November 9th at 11:59 PM. No late days can be
used for this proposal.

Note: Though this assignment can be done in pairs, late days are applied individually. If a pair submits
late, any member without late days will receive a zero.

1
Task A: Project Proposal (5%)

This proposal is not meant to be a long assignment; you should not need to write much more than two or
three small paragraphs. To submit your proposal, go to the Project 3 - Proposal assignment on
Gradescope and fill out the questions directly on Gradescope. The proposal will ask about what dataset
you plan to use and what question(s) you’d like to tackle regarding that dataset.

The proposal is worth 5% of your project grade and is due on Friday, November 9th. Don’t worry too
much about it; as long as you turn it in and fill out the form with reasonable detail, you will receive full
credit. After you turn it in, the CA’s will quickly go through the proposals to ensure that the dataset and
question you choose to explore are rich and complex enough for this project; by Tuesday, November
13th, you will have feedback on your proposal. We expect that in most cases, it will be a simple go-ahead;
for some, you may be asked to choose a dataset with greater size or complexity or to choose a question
with more depth. In other cases, we may advise you to decrease the scope of your project.

You are allowed to change your choice of dataset and question after you submit your proposal; however,
you will have to take initiative and come to OH if you want to verify that your new choices are
appropriate for this project.

Task B: Introduction to Machine Learning (10%)

For this project, you will be using BigQuery’s machine learning features to make predictions related to
your question of choice. Do not worry if you’ve never studied machine learning before! We have created
a Colab notebook, project3_MLwarmup.ipynb, that will guide you through the fundamentals of what
you need to know to complete this project. You can access the notebook from the course website, make a
copy of it in your own drive, and begin the assignment.

Read the notebook and fill out the questions where specified.

This notebook will be worth 10% of your project grade.

Task C: Your Own Data Cycle (85%)

For the main portion of Project 3, you will be exploring a question that interests you on a BigQuery
dataset of your choice. Please create a Colab notebook containing all of your work. You will use what
you’ve learned from Project 1 and Project 2 to begin your explorations on the dataset and your question.
Once you’ve done your explorations, you will use machine learning to make predictions relevant to your
interesting question.

As an example, take the GitHub dataset from Project 2. The question we were trying to answer there was
what factors impact the popularity of a GitHub repository. Given this question, once we finish our
explorations and decide what features were important, we can use machine learning to predict the
popularity of various repositories based on those features.

This part of the project is largely open-ended; the specific explorations you choose to do are up to you to
decide. However, we expect your project will contain at least the following sections:

2
● Analysis of your dataset (10%)
○ Comment on how the dataset you chose is organized. Some questions you may think
about are: Is there redundant data? What are the relationships between tables? If there is
only a single table, what are the keys of the table? Are there keys (in the FD sense, or
maybe in the OKV sense) at all? Are the tables normalized in some way (e.g. BCNF) or
are they denormalized? What are the tradeoffs between the design of the dataset you
chose as it is and other alternatives? Why do you think the original authors of the dataset
chose this particular structure?
○ Your analysis should demonstrate that you understand the structure of the data you are
working with.
○ You may supplement your analysis with concrete examples to corroborate your
statements. For example, your analysis could benefit from having an E/R diagram to
reference in some cases.
● Exploring your questions, with appropriate visualizations (60%)
○ Use SQL in your Colab notebook to gather information about your dataset -- you should
use your plotting library of choice to visualize and understand your data. Ask and answer
quantitative questions that revolve around your central questions. Your exploration
should both shed light on the questions that you’ve asked yourself and provide insights
for the feature engineering you will use when you generate predictions.
○ In this section, conclude with an analysis and summary of your observations. What
features of your data seem especially prominent or related to your question? Are there
anomalies? What trends do you see? Have you answered your questions?
○ Your analysis in this section should demonstrate that you have a reasonable quantitative
understanding of your dataset. By the end of this section, you should be ready to use
your newly-gained domain knowledge to generate predictions.
● Predictions based on your explorations (20%)
○ Use BigQuery to make predictions related to the questions you set out to ask. In the
GitHub example in Project 2, the prediction task would be, given that I have a good set of
features about a GitHub repo, can I predict how popular that repo is?
○ Evaluate your model in BigQuery. Comment on the performance of your model. Is it
good? Is it bad? In either case, why do you think your model does well or badly?
○ Finally, use your model to make predictions on data besides that which you trained
(i.e. created your model) on.
● Conclusion (10%)
○ What have you learned? What conclusions have you made or been unable to make about
your dataset and why? What is obvious, and what did you not expect to see? Support your
statements with the charts or predictions you generated. If you had more time, what other
data exploration would you pursue?

Note that we are generally more concerned with the depth with which you’ve leveraged visualizations and
BigQuery to gain insight into your questions than the quality of your machine learning predictions. This
is an open-ended project -- there is no right or wrong answer, only quality of exploration and inquiry.

This part will be worth 85% of the project grade.

3
Honor Code
As in all Stanford classes, you are expected to follow the Stanford Honor Code. For example, the
following activities are prohibited and will be treated as Honor Code violations (this is not intended to be
a complete list of Honor Code violations):
● Submitting code that you did not write personally, with the exception of project code written by
your partner.
● Consulting pre-existing solutions for problem sets and projects (such as solutions posted on the
Internet).
● Posting your solutions on the Internet or making them available to other students in any form.
You are allowed to discuss general approaches and issues with other students in the class besides your
team members. It's also fine to give other students help finding bugs if they are stuck, or to answer
general questions. But, any code you write must be written by you and your partner, from scratch, without
consulting existing solutions. We reserve the right to use computer software such as MOSS to analyze
material that you submit in order to detect duplication with other students or existing solutions.
A general way to think about this is that if a particular activity significantly short-circuits the learning
process (it saves you time but reduces the amount you learn and/or figure out on your own), or if it
misrepresents your capabilities or accomplishments, then it is probably an Honor Code violation.

Submission Instructions

Proposal Submission:
Please directly answer the proposal questions in the Project 3 - Proposal assignment on Gradescope.

Project Submission:
1. Download the Machine Learning Warmup Colab notebook as an iPython notebook - you can do
this by going to File > Download .ipynb. Submit it to the Project 3 - ML Warmup iPython
assignment on Gradescope.
2. Download your personal exploration Colab notebook as an iPython notebook. Submit the iPython
notebook to the Project 3 - Personal Question iPython assignment on Gradescope. Please name
your file according to the format firstname_lastname.ipynb for single-person submissions,
and for two-person submissions firstname1_lastname1_firstname2_lastname2.ipynb
(e.g. ben_parks.ipynb or ben_parks_jennie_chen.ipynb).
3. For each of your submissions on Gradescope, make sure to add your project partner as a
group. You can do this after submitting by clicking View or Edit Group underneath your name
and searching for your partner.

In total, for this project you should be submitting to two different assignments on Gradescope; one for
each iPython file.

Note: We reserve the right to deduct points from your project if you do not follow the submission
instructions. Please also leave yourself enough time to do the assignment/submission!

4
You may resubmit as many times as you like; however, only the latest submission and timestamp will be
saved, and we will use your latest submission for grading your work and determining any late penalties
that may apply. Submissions via email will not be accepted.

SASMO 2024 Info Pack International v9 Compressed
71% (7)
SASMO 2024 Info Pack International v9 Compressed
37 pages
80 Ways to Use ChatGPT in the Classroom
From Everand
80 Ways to Use ChatGPT in the Classroom
Stan Skrabut
4.5/5 (2)
6.891 Machine Learning: Project Proposal
No ratings yet
6.891 Machine Learning: Project Proposal
2 pages
KNIME Essentials
From Everand
KNIME Essentials
Gábor Bakos
No ratings yet
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
ENGG1003 ProjectSpecification 2024-25term2
No ratings yet
ENGG1003 ProjectSpecification 2024-25term2
4 pages
ENGG1003 ProjectSpecification 2024-25term1
No ratings yet
ENGG1003 ProjectSpecification 2024-25term1
4 pages
F21DL 2024-25 Coursework-1 - 240918 - 110502
No ratings yet
F21DL 2024-25 Coursework-1 - 240918 - 110502
7 pages
IDS MIdterm Project - Section (C) Fall 24-25
No ratings yet
IDS MIdterm Project - Section (C) Fall 24-25
2 pages
Research Project Management: 25 Free Tools
From Everand
Research Project Management: 25 Free Tools
Ruth Belling
No ratings yet
Stats170AB Project Proposal Template
No ratings yet
Stats170AB Project Proposal Template
3 pages
ENGG1004 ProjectSpecification 2024-25term2
No ratings yet
ENGG1004 ProjectSpecification 2024-25term2
4 pages
Research Project Management: 25 Free Tools: Evaluation Works’ Research Guides, #1
From Everand
Research Project Management: 25 Free Tools: Evaluation Works’ Research Guides, #1
Ruth Belling
No ratings yet
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
Software Engineering & Object Oriented Modeling
From Everand
Software Engineering & Object Oriented Modeling
Jitendra Patel
No ratings yet
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
From Everand
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Alok Kumar
No ratings yet
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
From Everand
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
Manish Soni
No ratings yet
Introduction to Algorithms & Data Structures: A solid foundation for the real world of machine learning and data analytics
From Everand
Introduction to Algorithms & Data Structures: A solid foundation for the real world of machine learning and data analytics
Bolakale Aremu
No ratings yet
Programming Problems: Advanced Algorithms
From Everand
Programming Problems: Advanced Algorithms
Bradley Green
3.5/5 (7)
The Predictive Project Manager
From Everand
The Predictive Project Manager
Puneet Mathur
No ratings yet
BluePrint for Software Engineering
From Everand
BluePrint for Software Engineering
Prakash Hegade
No ratings yet
CS229 Final Project Guidelines
No ratings yet
CS229 Final Project Guidelines
6 pages
Practical C++ Machine Learning: Hands-on strategies for developing simple machine learning models using C++ data structures and libraries
From Everand
Practical C++ Machine Learning: Hands-on strategies for developing simple machine learning models using C++ data structures and libraries
Anais Sutherland
No ratings yet
Confident Programmer Problem Solver: Six Steps Programming Students Can Take to Solve Coding Problems
From Everand
Confident Programmer Problem Solver: Six Steps Programming Students Can Take to Solve Coding Problems
Cloudy Heaven Games
No ratings yet
Learning Advanced Programming
From Everand
Learning Advanced Programming
IT Campus Academy
No ratings yet
.NET 7 Design Patterns In-Depth: Enhance code efficiency and maintainability with .NET Design Patterns (English Edition)
From Everand
.NET 7 Design Patterns In-Depth: Enhance code efficiency and maintainability with .NET Design Patterns (English Edition)
Vahid Farahmandian
No ratings yet
Demonstrating Design for Six Sigma
From Everand
Demonstrating Design for Six Sigma
Robert Perrine
3/5 (2)
Bìa Đề Làm Bài Vào Giấy Thi - Form 2
No ratings yet
Bìa Đề Làm Bài Vào Giấy Thi - Form 2
4 pages
Assignment 3
No ratings yet
Assignment 3
2 pages
R Object-oriented Programming
From Everand
R Object-oriented Programming
Kelly Black
3/5 (1)
COMPUTER SCIENCE FOR ROOKIES
From Everand
COMPUTER SCIENCE FOR ROOKIES
Angel Bahabwa
No ratings yet
Final Project
No ratings yet
Final Project
3 pages
Python Quick Interview Guide: Top Expert-Led Coding Interview Question Bank for Python Aspirants (English Edition)
From Everand
Python Quick Interview Guide: Top Expert-Led Coding Interview Question Bank for Python Aspirants (English Edition)
Shyamkant Limaye
No ratings yet
GROKKING ALGORITHM BLUEPRINT: Advanced Guide to Help You Excel Using Grokking Algorithms
From Everand
GROKKING ALGORITHM BLUEPRINT: Advanced Guide to Help You Excel Using Grokking Algorithms
William Turner
No ratings yet
AP® Computer Science Principles Crash Course
From Everand
AP® Computer Science Principles Crash Course
Jacqueline Corricelli
No ratings yet
Programming Problems: A Primer for The Technical Interview
From Everand
Programming Problems: A Primer for The Technical Interview
Bradley Green
4.5/5 (3)
CSC2626 Project Guidelines
No ratings yet
CSC2626 Project Guidelines
3 pages
Getting Data Science Done: Managing Projects From Ideas to Products
From Everand
Getting Data Science Done: Managing Projects From Ideas to Products
John Hawkins
No ratings yet
Model Based Environment: A Practical Guide for Data Model Implementation with Examples in Powerdesigner
From Everand
Model Based Environment: A Practical Guide for Data Model Implementation with Examples in Powerdesigner
Vladimir Pantic
No ratings yet
COM7039M MachineLearning Assignment Brief-Level 7-1
No ratings yet
COM7039M MachineLearning Assignment Brief-Level 7-1
12 pages
ChatGPT Mastery: Integrating AI into Your Workflow for Advanced Users
From Everand
ChatGPT Mastery: Integrating AI into Your Workflow for Advanced Users
GN
No ratings yet
CSCI241 2025S Syllabus
No ratings yet
CSCI241 2025S Syllabus
10 pages
Department of Mechatronics Engineering Machine Intelligence ME-555
No ratings yet
Department of Mechatronics Engineering Machine Intelligence ME-555
8 pages
Course Project Guideline - New
No ratings yet
Course Project Guideline - New
6 pages
Prompt Engineering with ChatGPT
From Everand
Prompt Engineering with ChatGPT
Nikiforos Kontopoulos
No ratings yet
Micro Analytics Course PDF
No ratings yet
Micro Analytics Course PDF
11 pages
ITMBU SoCSET Final Year Project Booklet
No ratings yet
ITMBU SoCSET Final Year Project Booklet
53 pages
Mastering Claude 3.5: A Practical Guide for Beginners and Intermediates
From Everand
Mastering Claude 3.5: A Practical Guide for Beginners and Intermediates
GN
No ratings yet
Internship Project Guide
No ratings yet
Internship Project Guide
6 pages
Computer Programming Languages for Beginners
From Everand
Computer Programming Languages for Beginners
Adesh Silva
No ratings yet
Mastering Prompt Engineering: 2025, #1
From Everand
Mastering Prompt Engineering: 2025, #1
Davor Mulalić
No ratings yet
From Idea to Execution: 5 steps to using ChatGPT for Project Proposals
From Everand
From Idea to Execution: 5 steps to using ChatGPT for Project Proposals
Azito
No ratings yet
Kaggle Kernels in Action: From Exploration to Competition
From Everand
Kaggle Kernels in Action: From Exploration to Competition
Robert Johnson
No ratings yet
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
From Everand
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
ARCHER PAUL
No ratings yet
CS229 Final Project Spring 2023 Public PDF
No ratings yet
CS229 Final Project Spring 2023 Public PDF
12 pages
FinalProject Description
No ratings yet
FinalProject Description
5 pages
Data Analysis for Corporate Finance: Building financial models using SQL, Python, and MS PowerBI
From Everand
Data Analysis for Corporate Finance: Building financial models using SQL, Python, and MS PowerBI
Mariano F. Scandizzo CFA CQF
No ratings yet
8 Ways to Boost Your Logic
From Everand
8 Ways to Boost Your Logic
Pawan Sharma
No ratings yet
Project Control and Evaluation
From Everand
Project Control and Evaluation
Michele Dove
No ratings yet
Investigating Performance: Design and Outcomes With Xapi
From Everand
Investigating Performance: Design and Outcomes With Xapi
Sean Putman
No ratings yet
Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis
From Everand
Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis
Fabio Nelli
No ratings yet
02 - Bharghav Fake News Detection
No ratings yet
02 - Bharghav Fake News Detection
49 pages
Data Science and Visualization Updated
No ratings yet
Data Science and Visualization Updated
3 pages
Mathematical Foundations of Data Science Using R
No ratings yet
Mathematical Foundations of Data Science Using R
424 pages
Data Visualization - Data Mining PRESENTATION
No ratings yet
Data Visualization - Data Mining PRESENTATION
9 pages
BBA (DA) Sem1
No ratings yet
BBA (DA) Sem1
19 pages
2.leveraging Big Data Analytics To Drive Data-Based Policy Making For Improved Government Services Delivery in Nigeria
No ratings yet
2.leveraging Big Data Analytics To Drive Data-Based Policy Making For Improved Government Services Delivery in Nigeria
22 pages
SSOFTWARES IN DECISION SUPPORT SYSTEMSs
No ratings yet
SSOFTWARES IN DECISION SUPPORT SYSTEMSs
13 pages
How To Write An Architectural Thesis Statement
100% (3)
How To Write An Architectural Thesis Statement
7 pages
Power Bi Introduction and Visualization
No ratings yet
Power Bi Introduction and Visualization
16 pages
Swathi - DE Data Engineer
No ratings yet
Swathi - DE Data Engineer
5 pages
Tableau ?
No ratings yet
Tableau ?
8 pages
Santhosh RHB
No ratings yet
Santhosh RHB
6 pages
Data Analysis With Power BI
No ratings yet
Data Analysis With Power BI
13 pages
Analytics AND REPORTING in Digital Marketing
No ratings yet
Analytics AND REPORTING in Digital Marketing
8 pages
AIFL Industry Session - Building AI Teams (Shashi Bhushan)
No ratings yet
AIFL Industry Session - Building AI Teams (Shashi Bhushan)
30 pages
2010-Program Allinone
No ratings yet
2010-Program Allinone
15 pages
Task 3 - Presentation Guide - BRAND
No ratings yet
Task 3 - Presentation Guide - BRAND
11 pages
3D Scientific Visualization With Blender (PDFDrive)
100% (1)
3D Scientific Visualization With Blender (PDFDrive)
121 pages
Lower Division Clerk
No ratings yet
Lower Division Clerk
2 pages
DSL Lab
No ratings yet
DSL Lab
81 pages
72 Introduction - Power BI Data Prep & Dataflows
No ratings yet
72 Introduction - Power BI Data Prep & Dataflows
31 pages
AD3491 UNIT 1 NOTES EduEngg
100% (1)
AD3491 UNIT 1 NOTES EduEngg
35 pages
Marciano Et Al Archival Records and Training in The Age of Big Data Final
No ratings yet
Marciano Et Al Archival Records and Training in The Age of Big Data Final
19 pages
CS 3352 Foundations of Data Science Syllabus
No ratings yet
CS 3352 Foundations of Data Science Syllabus
2 pages
2023 AA DAS Guidance Notes
No ratings yet
2023 AA DAS Guidance Notes
96 pages
Internship
No ratings yet
Internship
30 pages
Portfolio Data Analyst
No ratings yet
Portfolio Data Analyst
2 pages
Chapter 2 - Representation of Data
No ratings yet
Chapter 2 - Representation of Data
15 pages
OAC Expert Certification
No ratings yet
OAC Expert Certification
33 pages

Project3 Handout

Uploaded by

Project3 Handout

Uploaded by

CS145: Data Management and Data Systems

Stanford University, Fall 2018

Project 3: Querying, Visualizing, Predicting -- The Full Data Cycle

Proposal Due Date: Friday, November 9th, 11:59PM

This project may be done alone or in pairs.

Task B: Introduction to Machine Learning (10%)

This notebook will be worth 10% of your project grade.

Task C: Your Own Data Cycle (85%)

This part will be worth 85% of the project grade.

You might also like