0% found this document useful (0 votes)

160 views9 pages

Project1 Handout

This document describes Project 1 for the CS145: Data Management and Data Systems course at Stanford University. The project involves exploring NCAA basketball data using Google BigQuery. Students will complete tasks to get set up with BigQuery, familiarize themselves with the NCAA basketball dataset, and answer questions by writing SQL queries. The questions involve analyzing data on teams, venues, games, players, and more. Students are instructed to match their query results to provided answers as a sanity check.

Uploaded by

Tanay Kothari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

160 views9 pages

Project1 Handout

Uploaded by

Tanay Kothari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

CS145: Data Management and Data Systems

Stanford University, Fall 2019

Project 1: Exploring NCAA Basketball Data

10% of Course Grade

Due Date: Friday, October 11th, 11:59PM

Overview
Welcome to CS145! Throughout the course you will be using Google’s BigQuery platform to
gain hands-on practice with real-world data systems.

BigQuery is Google’s service for big data. Throughout the three class projects, we will be using
BigQuery’s basic SQL querying interface, its interface with Colaboratory1, and its built-in
machine learning features.

Google has published many datasets on BigQuery -- these range from StackOverflow statistics to
real-time air quality data. In this first part of the course project you will be using BigQuery’s
SQL interface to answer questions about the NCAA Basketball Dataset. To find the dataset in
BigQuery, follow the instructions in the Getting Started with BigQuery support document.

Please note: this is a solo project. You may discuss ideas at a high-level with other students, but
all work should be your own. Please note the names and SUNet IDs of any students you
collaborate with.

Task A: Getting Set Up

Before proceeding, make sure you have read and understood the Getting Started with BigQuery
support document (available on the course website) which describes how to get up and running
with your BigQuery account, how to manage your course credit, etcetera.

Note: This is a very important step as we will be unable to give extra Google Cloud credit to
students that use up all of their credit. If you have any questions about Google Cloud, your
account, or your credits, check Piazza for similar questions or make a post yourself.

1
Colaboratory is a free document collaboration tool built on top of Jupyter. You can think of it as a Jupyter
notebook, except that you are able to collaborate on it with multiple people.
1
Task B: Familiarize yourself with the NCAA Basketball Dataset
Now that you’ve oriented yourself in BigQuery, your second task is to examine the schemas and
the descriptions of the NCAA Basketball dataset tables and understand the data that you will be
working with.

You may try running some simple queries over the tables to get a feel for them, or use
BigQuery’s “Preview” tab to see what the data looks like.

Some notes:
● _sr stands for “Sportradar”, which is a company that collects sports data, down to the x/y
coordinates of events (shot attempted, rebound, turnover, foul).
● The historical data makes a distinction between tournament games and regular season
games. Please make sure you're using the right table!
● “pbp” means play-by-play, which is very granular data about each event that happens in
the game

Task C: Querying!
Now that you’ve gotten comfortable and familiar with BigQuery and its SQL querying interface,
let’s get to work and answer some questions about the NCAA Basketball dataset.

We intend for part of this assignment to be about how to translate a question in plain English to a
schema - in other words, we want you to read the tables and explore the data and think about
which tables and columns are necessary in answering the question we’re asking. This skill is
both necessary for the remainder of the projects, and is exactly how real world data querying and
analysis works!

2
Your queries should be fairly efficient -- they should each take at most ten seconds to execute on
BigQuery, and most of them will be finished in less than ~4 seconds. If any of your queries are
taking much longer than that, you’ve probably written them in particularly inefficient way;
please try rewriting them, and see the course staff if you need help.

Please also check to make sure you’re not querying more than a couple GBs of data - we’ve
specifically chosen this dataset so that no one need exhaust their credits completing the
assignment. All the queries you write should fit within the 1TB of free querying you’re alloted
for the month.

You can save your queries for each question from the BigQuery interface directly, or you can
keep track of your queries in separate files yourself. Remember that you can use BigQuery’s
“Query History” tab to inspect previous queries you’ve run.

Note: When querying in BigQuery, table names should be wrapped in backticks (`). For
example, instead of saying:
SELECT * FROM bigquery-public-data.ncaa_basketball.mascots
say:
SELECT * FROM `bigquery-public-data.ncaa_basketball.mascots`

Questions:

We will provide answers for these questions so that you can check your work. Please make sure
your output from BigQuery matches these answers, both in terms of values and ordering.
Read instructions carefully; if we ask for rounded answers, we may deduct points for not
rounding.

Note: While matching these answers is a good sanity check, it does not guarantee a perfect
score. The datasets we will use to grade your assignment may not perfectly match the datasets
on BigQuery; therefore, make sure that your queries are generalizable to other datasets (given
that schemas are identical).

We reserve the right to deduct points from your project if your queries are hard-coded in some
way or are not generalizable to other tables.

For the following questions, unless otherwise specified, a game can be either a tournament
game or a regular season game.

Write standard SQL queries to answer the following questions:

1. (1 point) What is the name and capacity of Stanford’s NCAA basketball team venue?
3
Answer:
Row venue_name venue_capacity
1 Maples Pavilion 7392

2. (1 point) How many games were played in Stanford’s venue in the 2013-2014 season?

Answer:
Row games_at_stanford
1 16

3. (1 point) Hexadecimal colors codes are a way of representing color on a computer. Hex
color codes are of form #AABBCC, where AA, BB, and CC are hexadecimal numbers
(00, 01, … , FE, FF) indicating the intensity of red, green, and blue in the color,
respectively.

Hint: be careful with the case of the colors in the dataset -- some use lower case
characters and some use upper case characters. Note that in the expected answer below,
the original case from the dataset is kept.

What teams have the maximum possible red intensity in their color? Give (team market,
color) as your answer. Order your results alphabetically by the team name.

Answer:
Row market color
1 Idaho State #ff7800
2 Morehead State #ffc300
3 North Carolina A&T #ffb82b
4 Northern Colorado #ffb500
5 Oklahoma State #FF6600
6 Pacific #ff6900
7 South Dakota #ff2310
8 Syracuse #ff5113
9 Tennessee-Martin #ff6900

4. (1 point) How many home games has Stanford won in seasons 2013 to 2017 (inclusive)?
Give (number of games won, average score for Stanford in those games, average score of
the opponents in those games) as your answer. Round any decimal values to two places.

4
Answer:
Depending on which table you use for your query, you may get slightly different values.
Either of the following results are acceptable.

Row number avg_stanford avg_opponent

1 71 78.04 64.21

Row number avg_stanford avg_opponent

1 71 78.07 64.13

5. (2 points) How many players have been on a team based in the same city where they
were born? For this question, please only use the player’s birth city and state (do not
include the player’s birth country).

Answer:
Row num_players
1 606

6. (2 points) What is the biggest margin of victory in the historical tournament data? Output
the winning team name, losing team name, winning team points, losing team points, and
the win margin of that game.

Answer:
Row win_name lose_name win_pts lose_pts margin
1 Jayhawks Panthers 110 52 58

7. (3 points) In a basketball tournament, teams are ranked from best to worst prior to
starting the matches. This ranking is called the “seed” of the team (1 is the best team, and
a higher number indicates a worse team). In general, a higher ranked team is expected to
beat a lower ranked team.

Definition: An upset occurs whenever a team with seed A beats a team with seed B, and
A > B.

5
What percentage of historical tournament games are upsets? Round to two decimal
places. For example, if 50.2489% of games are upsets, your query should return 50.25.

Answer:
Row upset_percentage
1 27.26

8. (3 points) Which pairs of NCAA basketball teams are 1) based in the same state and 2)
have the same team color? Output the team names and the state. Put the team name that
comes alphabetically first in each pair on the leftmost column, and order the rows
alphabetically by the first column.

Answer:
Row teamA teamB state
1 Bearcats Norse KY

2 Cougars Red Raiders TX

3 Razorbacks Red Wolves AR

9. Definition: A geographical location L is a unique tuple (city, state, country).

Definition: A geographical location L “makes” points for a team T whenever a player that
was born in L s cores points for T.

(3 points) What three geographical locations made the most points for Stanford’s team in
seasons 2013 through 2017, and how many points did they make?

Restrictions:
- For the purposes of this query, avoid using the “birth_place” column.

Answer:
Row city state country total_points
1 Phoenix AZ USA 2223

2 Minneapolis MN USA 1427

3 Rock Island IL USA 1399

6
10. (4 points) Since the start of the 2013 season, which teams have had more than 5 players
score 15 or more points in the first half in a single game? Note: These players did not all
have to score 15+ points in the first half of the same game.

Output the top 5 team markets and the number of players for each team meeting this
criteria from most to least, breaking ties by team markets in alphabetical order.

Answer:
Row team_market num_players
1 Kentucky 14

2 Oregon 14

3 UCLA 14

4 Duke 13

5 Marquette 13

11. Definition: Team X i s a top performer on season Y if no o ther team had more wins than
X in the same season. This includes teams with either null or non-null markets.

(4 points) What five teams (identify them here by their “markets”) were top performers in
the most seasons between 1900 and 2000 (inclusive), and how many times were they top
performers? Output the team markets and the number of times each team was a top
performer. If there are ties in the final output, break them by giving a higher ranking to
team markets that come first alphabetically. Ignore teams with NULL markets only in the
final output.

Answer:
Row team_market top_performer_count
1 University of California, Los 6
Angeles

2 University of Kentucky 6

3 Texas Southern University 5

4 University of Pennsylvania 5

5 Western Kentucky University 5

7
Submission Instructions
Once you have written queries that answer all questions and conform to the given result schemas,
you’re ready to submit.

To submit:
1. Copy all the queries you wrote for Task C into the project1_submission.py file
(available on the course website), pasting all of your queries into the corresponding
places.
2. If you collaborated with others to generate your queries, add their names and SUNet IDs
to the comment at the top of the project1_submission.py file.
3. Submit this Python file on Gradescope. In order to be correctly graded, the file must
be named project1_submission.py.

You may resubmit as many times as you like; however, only the latest submission and timestamp
will be saved, and we will use your latest submission for grading your work and determining any
late penalties that may apply. Submissions via email will not be accepted!

IMPORTANT SUBMISSION NOTES:

When you submit to Gradescope, we will run a syntax checker that will make sure that your
SQL runs OK. It should run immediately and return whether the query ran OK or if there were
errors - please make sure that you get a positive result from this test in your final
submission.

You will not see a final grade until after the project deadline. The answers are provided
above for the questions so that you may check your work yourself. It is your responsibility to
ensure that your final submission is free from Python or SQL syntax errors and that you follow
all instructions in this section.

We reserve the right to deduct points from your project if you do not follow the submission
instructions, or if you have syntax errors in your queries.

FAQ

Question:
I’m getting syntax errors when I submit to Gradescope, but I don’t see these syntax errors when I
run on BigQuery.

Answer:
Some things to check:
● Were queries copied correctly to the submission file?

8
● Are you using Standard SQL on BigQuery?
● Did you use backticks around table names?
Otherwise, this may be due to our autograder, which does not run with the same SQL that
BigQuery does. For example, our autograder does not support window functions (OVER,
PARTITION BY, etc.) and will throw an error. If you are sure that your query works
correctly (and is generalizable) with BigQuery SQL, then don’t worry; we’ll be doing a
manual pass through submissions, and as long as your query is correct using Standard SQL
syntax then you will get full credit.

Question:
Do I have to match the column names given in the solutions? For example, in question 7, do I
have to name the column “upset_percentage”?

Answer:
You can name your columns whatever you want, as long as the content matches both in ordering
and in values.

MP 6700 Manual
No ratings yet
MP 6700 Manual
970 pages
MongoBoulder - Schema Design
No ratings yet
MongoBoulder - Schema Design
59 pages
Class 12 CS Practical List 2023-24
No ratings yet
Class 12 CS Practical List 2023-24
31 pages
Lab Manual: 18CS3262S Data Modelling and Visualization Techniques
33% (3)
Lab Manual: 18CS3262S Data Modelling and Visualization Techniques
17 pages
Modules English Computer Science
No ratings yet
Modules English Computer Science
11 pages
Practical File - IP XII-23-24
No ratings yet
Practical File - IP XII-23-24
10 pages
11jan2025 StaticMedia IP-PRACTICAL PRACTICE WORKSHEET 5383498614184004534
No ratings yet
11jan2025 StaticMedia IP-PRACTICAL PRACTICE WORKSHEET 5383498614184004534
4 pages
Caterpillar CCM PC Manual
100% (8)
Caterpillar CCM PC Manual
113 pages
ISO 15118 Part 4 Conformance Test: ISO 15118: Road Vehicles - Vehicle To Grid Communication Interface
0% (1)
ISO 15118 Part 4 Conformance Test: ISO 15118: Road Vehicles - Vehicle To Grid Communication Interface
5 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
DIY RC CAR Using RF Module and Arduino/Gizduino Board
0% (1)
DIY RC CAR Using RF Module and Arduino/Gizduino Board
17 pages
IP Practical File - Edited
No ratings yet
IP Practical File - Edited
48 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
An Hao Ming - HTML
No ratings yet
An Hao Ming - HTML
144 pages
MUJ - Resume Format - FoE
No ratings yet
MUJ - Resume Format - FoE
2 pages
AISSCE XII IP Prac QP 2022-23
No ratings yet
AISSCE XII IP Prac QP 2022-23
3 pages
Final Project
No ratings yet
Final Project
2 pages
Big Data - REVISED
No ratings yet
Big Data - REVISED
12 pages
Col362 HW1
No ratings yet
Col362 HW1
4 pages
Dada Model Question Bank
No ratings yet
Dada Model Question Bank
50 pages
List of Signals Om-Ot
No ratings yet
List of Signals Om-Ot
42 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
58 pages
Ip Practical 2024 2025
No ratings yet
Ip Practical 2024 2025
14 pages
CIT-137-W1HB FA19 Syllabus
No ratings yet
CIT-137-W1HB FA19 Syllabus
8 pages
Big Query
No ratings yet
Big Query
10 pages
CL-I Lab Manual
No ratings yet
CL-I Lab Manual
131 pages
Bda Mu Qpapers 2022-2024
No ratings yet
Bda Mu Qpapers 2022-2024
8 pages
Lab Manual Ds&Bdal
No ratings yet
Lab Manual Ds&Bdal
100 pages
Practical Exam QP
No ratings yet
Practical Exam QP
8 pages
Practical File-1
No ratings yet
Practical File-1
74 pages
Pandas Worksheet
No ratings yet
Pandas Worksheet
3 pages
Practical 12 SA 1 SET B
No ratings yet
Practical 12 SA 1 SET B
1 page
CSCI2141 Assignment Part 1 Instructions
No ratings yet
CSCI2141 Assignment Part 1 Instructions
3 pages
CSE6006 NoSQL-Databases ETH 1 AC41
No ratings yet
CSE6006 NoSQL-Databases ETH 1 AC41
10 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
167 pages
Grade 12 SQL Practical Record
No ratings yet
Grade 12 SQL Practical Record
6 pages
Assignment 1 DBMS
No ratings yet
Assignment 1 DBMS
13 pages
01 DDL Problem Description
No ratings yet
01 DDL Problem Description
10 pages
Course Title:: Course Code: 10ISL57 Credits (L:T:P) : 0:1:1 Core/ Elective: Core Type of Course: Tutorials, Practicals Total Contact Hours: 42 Hrs
No ratings yet
Course Title:: Course Code: 10ISL57 Credits (L:T:P) : 0:1:1 Core/ Elective: Core Type of Course: Tutorials, Practicals Total Contact Hours: 42 Hrs
5 pages
Assignment6 1
No ratings yet
Assignment6 1
6 pages
Term-I Practical Question Paper 2022-2023
No ratings yet
Term-I Practical Question Paper 2022-2023
8 pages
Program File Codes
No ratings yet
Program File Codes
6 pages
Document From Igd - Rabichandra
No ratings yet
Document From Igd - Rabichandra
4 pages
Lab Test
No ratings yet
Lab Test
15 pages
Practical 12
No ratings yet
Practical 12
6 pages
Introduction To Databases Bit2440 Coursework 2024-1
No ratings yet
Introduction To Databases Bit2440 Coursework 2024-1
6 pages
4BUIS014W Business Computing-Portfolio
No ratings yet
4BUIS014W Business Computing-Portfolio
7 pages
Adobe Scan Jul 07, 2023
No ratings yet
Adobe Scan Jul 07, 2023
6 pages
Question Paper CS Practical Set1-6 (23-24) T
No ratings yet
Question Paper CS Practical Set1-6 (23-24) T
6 pages
Dbs - Assignment1 - Semester B
No ratings yet
Dbs - Assignment1 - Semester B
7 pages
Syllabus Sem 6
No ratings yet
Syllabus Sem 6
6 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
LEO Satellite Constellation For Internet of Things
No ratings yet
LEO Satellite Constellation For Internet of Things
11 pages
Ip - Xii - HHW Summer 2025
No ratings yet
Ip - Xii - HHW Summer 2025
2 pages
BDA2023 Outline
No ratings yet
BDA2023 Outline
7 pages
Xii Practical Solutions
No ratings yet
Xii Practical Solutions
26 pages
SQL - Practice Problem2
No ratings yet
SQL - Practice Problem2
3 pages
DB (PS)
No ratings yet
DB (PS)
43 pages
Practical File Questons2024-25
No ratings yet
Practical File Questons2024-25
5 pages
My Revision Notes Ocr Gcse Computer Science Third Edition Sample Pages 9781398321144
No ratings yet
My Revision Notes Ocr Gcse Computer Science Third Edition Sample Pages 9781398321144
16 pages
cst499 Final Capstone Proposal
No ratings yet
cst499 Final Capstone Proposal
25 pages
Think Project User Manual
No ratings yet
Think Project User Manual
50 pages
Functional Programming In: Javascript
No ratings yet
Functional Programming In: Javascript
7 pages
8K High Resolution Camera System
No ratings yet
8K High Resolution Camera System
10 pages
Oops Interview Questions
No ratings yet
Oops Interview Questions
10 pages
Resume Daniel Jaso V
No ratings yet
Resume Daniel Jaso V
2 pages
SDN 101 An Introduction To Software Defined Networking
No ratings yet
SDN 101 An Introduction To Software Defined Networking
9 pages
Reason Key Commands PDF
100% (1)
Reason Key Commands PDF
5 pages
Database Security - Concepts, Approaches: IEEE Transactions On Dependable and Secure Computing February 2005
No ratings yet
Database Security - Concepts, Approaches: IEEE Transactions On Dependable and Secure Computing February 2005
23 pages
Enhancing The Implementation of Telecommuting (Work From Home) in Malaysia
No ratings yet
Enhancing The Implementation of Telecommuting (Work From Home) in Malaysia
12 pages
Grove Temperature and Humidity Sensor Sen11301p
No ratings yet
Grove Temperature and Humidity Sensor Sen11301p
9 pages
Zimbra MigrationWizard For Exchange
No ratings yet
Zimbra MigrationWizard For Exchange
24 pages
Introduction To Programming Using C Language Syllabus
No ratings yet
Introduction To Programming Using C Language Syllabus
13 pages
ASCENTZ IEEE Titles 2021 - 2022
No ratings yet
ASCENTZ IEEE Titles 2021 - 2022
41 pages
Tyco Analytics Appliance: Data Sheet
No ratings yet
Tyco Analytics Appliance: Data Sheet
4 pages
The Avicadian Gazette Issue1
No ratings yet
The Avicadian Gazette Issue1
7 pages
CST3310 - Group Report - 4O4
No ratings yet
CST3310 - Group Report - 4O4
23 pages
Diploma Baru Ee111
No ratings yet
Diploma Baru Ee111
2 pages
Bayanat UAE JAVA Developer Dubai
No ratings yet
Bayanat UAE JAVA Developer Dubai
2 pages
A. Ajay Anand
No ratings yet
A. Ajay Anand
13 pages
Progressive Scan CMOS 210
No ratings yet
Progressive Scan CMOS 210
2 pages
Scope View Feature
No ratings yet
Scope View Feature
6 pages
SAP SIGN FILE Process New
No ratings yet
SAP SIGN FILE Process New
4 pages
0000389000-02 Tendersure Africa Sez Limited
No ratings yet
0000389000-02 Tendersure Africa Sez Limited
1 page
Excel 2007 for Fantasy Football
From Everand
Excel 2007 for Fantasy Football
John Broberg
No ratings yet
Excel 2010 for Fantasy Football
From Everand
Excel 2010 for Fantasy Football
John Broberg
No ratings yet
Apache Cassandra Developer Associate - Exam Practice Tests
From Everand
Apache Cassandra Developer Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
LSAT PrepTest 81 Unlocked: Exclusive Data, Analysis & Explanations for the June 2017 LSAT
From Everand
LSAT PrepTest 81 Unlocked: Exclusive Data, Analysis & Explanations for the June 2017 LSAT
Kaplan Test Prep
No ratings yet
LSAT PrepTest 84 Unlocked: Exclusive Data + Analysis + Explanations
From Everand
LSAT PrepTest 84 Unlocked: Exclusive Data + Analysis + Explanations
Kaplan Test Prep
No ratings yet
AP Statistics Flashcards, Fifth Edition: Up-to-Date Practice
From Everand
AP Statistics Flashcards, Fifth Edition: Up-to-Date Practice
Barron's Educational Series
No ratings yet

Project1 Handout

Uploaded by

Project1 Handout

Uploaded by

CS145: Data Management and Data Systems

Stanford University, Fall 2019

Project 1: ​Exploring NCAA Basketball Data

Due Date: Friday, October 11th, 11:59PM

Task A: Getting Set Up

Write standard SQL queries to answer the following questions:

Row number avg_stanford avg_opponent

Row number avg_stanford avg_opponent

2 Cougars Red Raiders TX

3 Razorbacks Red Wolves AR

9. Definition​: A ​geographical location​ ​L ​is a unique tuple (city, state, country).

2 Minneapolis MN USA 1427

3 Rock Island IL USA 1399

3 Texas Southern University 5

5 Western Kentucky University 5

IMPORTANT SUBMISSION NOTES:

You might also like

Project 1: Exploring NCAA Basketball Data

9. Definition: A geographical location L is a unique tuple (city, state, country).