Project 2: Exploring The Github Dataset With Colaboratory: Collaborators
Project 2: Exploring The Github Dataset With Colaboratory: Collaborators
In this project, you will explore one of BigQuery's public datasets on GitHub and learn to make visualizations in order to answer your questions.
This project is due on Monday, November 4th at 11:59 PM. It is worth 50 points, for 10% of your overall grade. After completing this project,
make sure to follow the submission instructions in the handout to submit on Gradescope.
Collaborators:
Please list the names and SUNet IDs of your collaborators below:
Overview
BigQuery has a massive dataset of GitHub les and statistics, including information about repositories, commits, and le contents. In this
project, we will be working with this dataset. Don't worry if you are not too familiar with Git and GitHub -- we will explain everything you need to
know to complete this part of the assignment.
Notes
The GitHub dataset available on BigQuery is actually quite massive. A single query on the "contents" table alone (it is 2.16TB!) can eat up your
1TB allowance for the month AND cut into about 10% of your GCloud credit for the class.
To make this part of the project more manageable, we have subset the original data. We have preserved almost all information in the original
tables, but we kept only the information on the top 500,000 most "watched" GitHub repos between January 2016 and October 2018.
You can see the tables we will be working with here. Read through the schemas to get familiar with the data. Note that some of the tables are
still quite large (the contents table is about 500GB), so you should exercise the usual caution when working with them. Before running queries
on this notebook, it's good practice to rst set up query limits on your BigQuery account or see how many bytes will be billed on the web UI.
Make sure to use our subsetted dataset, not the original BigQuery dataset!
GitHub: GitHub is a source-control service provider. GitHub allows you to collaborate on and keep track of source code in a fairly e cient
way.
commit: A commit can be thought of as a change that is applied to some set of les. i.e., if some set of les is in some state A, you can
make changes to A and commit your changes to the set of les so that it is now in state B. A commit is identi ed by a hash of the
information in your change (the author of the commit, who actually committed [i.e. applied] the changes to the set of les, the changes
themselves, etc.)
parent commit: The commit that came before your current commit.
repo: A repo (short for repository) is GitHub's abstraction for a collection of les along with a history of commits on those les. If you have
GitHub username "foo" and you make a repository called "data-rocks", your repo name will be "foo/data-rocks". You can think of a repo's
history in terms of its commits. E.g., "foo/data-rocks" can have the set of "states" A->B->C->D, where each state change (A->B, B->C, C->D)
was due to a commit.
branch: To keep track of different commit histories, GitHub repos can have branches. The 'main' branch (i.e. commit history) of the repo is
called the 'master' branch. Say on "foo/data-rocks" we have the commit history A->B->C->D on the master branch. If someone else comes
along and decides to add a cool new feature to "foo/data-rocks", they can create a branch called "cool-new-feature" that branches away
from the master branch. All the code from the main branch will be there, but new code added to "cool-new-feature" will not be on the main
branch.
ref: For the purpose of this assignment, you can think of the 'ref' eld on the " les" table as referring to the branch in which a le lives in a
repository at some point in time.
For the purposes of this question, you don't need to know about the following things in detail:
Commit trees
The encoding attribute on the commits table
If you want more clari cations about Git and GitHub in order to answer this question, be sure to post on Piazza or come to O ce Hours. In
many cases, a quick web search will also help answer your question.
Section 1 | Understanding the Dataset (4 points)
lrepo_name
b) What is the primary key in github_repo_licenses ? What is the foreign key? (1 point)
c) If we were given an author and we wanted to know what language repos they like to contribute to, which tables
should we use? (1 point)
d) If we wanted to know whether using different licenses had an effect on a repo's watch count, which tables would
we use? (1 point)
github_repos and github_repo_licenses
In this section, we'll look at some ine cient queries and think about how we can make them more e cient. For this section, we'll consider
e ciency in terms of how many bytes are processed.
NOTE: We do NOT recommend running this unoptimized query in BigQuery, as it will run for a very long time (over 15 minutes if not longer).
However, feel free to run an optimized version of this query after nishing part (c), which takes about 5 seconds to run.
This query returns a column of distinct author names. Each author commits more than 20 times.
b) Brie y explain why this query is ine cient (in terms of bytes that need to be processed) and how it can be
improved to be more e cient. (1 point)
It's ine cient because for each record in the outer loop, one iteration over the entire table is required. To improve it, we need to avoid the iteration
over the entire table for each record. In that case, we can use GROUP BY and HAVING clauses.
c) Following from part (b), write a more e cient version of the query. (2 points)
SELECT author.name
FROM `cs145-fa19.project2.github_repo_commits`
GROUP BY author.name
HAVING COUNT(*)>20
SELECT id
FROM (
SELECT files.id, files.mode, contents.size
FROM
`cs145-fa19.project2.github_repo_files` files,
`cs145-fa19.project2.github_repo_readme_contents` contents
WHERE files.id = contents.id
)
WHERE mode = 33188 AND size > 1000
LIMIT 10
d) Brie y explain why this query is ine cient (in terms of bytes that need to be processed) without the query
optimization and how it can be improved to be more e cient. (1 point)
This query is ine cient because it has too many reads and writes from two FROM clauses. To optimize the query, we can use an INNER JOIN to
combine two tables and then set the constrait using WHERE.
e) Following from part (d), write a more e cient version of the query. (2 points)
Hint: Think about the number of bytes processed by the unoptimized query. Can any operator be moved around to reduce this number?
SELECT contents.id
FROM `cs145-fa19.project2.github_repo_files` files JOIN `cs145-fa19.project2.github_repo_readme_contents` contents
ON files.id = contents.id
WHERE mode = 33188 AND size > 1000
LIMIT 10
f) Run both the original query and your optimized query on BigQuery and pay attention to the number of bytes
processed. How do they compare, and is it what you expect? Explain why this is happening in a few sentences. (1
point)
Hint: Look at the query plan under "Execution details" in the bottom panel of BigQuery. It may be especially helpful to look at stage "S00: Input".
The average time for reads and writes of the original query is 171 and 209 respectively and those of the optimized query is 100 and 14. As we can
see, the optimized query has much less reads and writes, especially less writes, than the original query has, so the latter is slower which is what I
expected in part d.
To learn more about writing e cient SQL queries and how BigQuery optimizes queries, check out Optimizing query computation and Query plan
and timeline.
In this section, you'll be answering questions about the dataset, similar to the rst project. The difference is that instead of answering with a
query, you will be answering with a visualization. Part of this assignment is for you to think about which data (speci cally, which indicators) you
should be using in order to answer a particular question, and about what type of chart/picture/visualization you should use to clearly convey
your answer.
General Instructions
For each question, you will have at least two cells - a SQL cell where you run your query (and save the output to a data frame), and a
visualization cell, where you construct your chart. For this project, make sure that all data manipulation is to be done in SQL. Please do
not modify your data output using pandas or some other data library.
Please make all charts clear and readable - this includes adding axes labels, clear tick marks, clear point markers/lines/color schemes
(i.e. don't repeat colors across categories), legends, and so on.
Visualization
For this project, we will be o cially supporting the use of matplotlib (https://matplotlib.org/3.0.0/tutorials/index.html), but feel free to use
another graphing library if you are more comfortable with it.
%matplotlib inline
That "%%" converts the cell into a SQL cell. The resulting table that is generated is saved into variable .
Then in a second cell, use the library of your choice to plot the variable. Here is an example using matplotlib:
plt.figure()
plt.scatter(variable["x"], variable["y"])
plt.title("Plot Title")
plt.xlabel("X-axis label")
plt.ylabel("Y-axis label")
Let's get our feet wet with this data by creating the following plots:
Note that you will not receive full credit if your charts are poorly made (i.e. very unclear or unreadable).
Hints
Some of these plots will need at least one of their axes to be log-scaled in order to be readable
For more readable plots, you can use pandas.DataFrame.sample. A sample size between 1,000 and 10,000 should give you more readable
plots.
Reminders
Be careful with your queries! Don't run SELECT * blindly on a table in this Colab notebook since you will not get a warning of how much
data the query will consume. Always how much data a query will consume on the BigQuery UI rst -- you are also better off setting a query
limit as we described earlier.
Don't forget to use the subsetted GitHub tables we provide here, not the original ones on BigQuery.
a) Language distribution (2 points)
(x-axis: programming language, y-axis: # repos containing at least one le in that language)
To keep the chart readable, only keep the top 20 languages.
Hint: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays
%%bigquery --project $project_id q3a
# YOUR QUERY HERE
SELECT name lang_name, COUNT(*) num_repos
FROM `cs145-fa19.project2.github_repo_files` files JOIN (SELECT lrepo_name, name
FROM `cs145-fa19.project2.github_repo_languages`,
UNNEST(language)) lang
ON files.lrepo_name = lang.lrepo_name
GROUP BY name
HAVING num_repos>=1
ORDER BY num_repos DESC
LIMIT 20
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
<a list of 20 Text xticklabel objects>)
To begin, let's see if any of the features we've already explored give us any good answers.
1. Languages used
2. Average le size in a repo
3. Average message length of commits in a repo
Then, make a plot of how average commit message length of a repository correlates to its watch count. Round the average commit message
length to the nearest integer.
d) Which, if any, of the features we inspected above have a high correlation to a repo having a high watch count?
Does the answer make sense, or does it seem counterintuitive? Explain your answer in a small paragraph, no more
than 200 words. Be sure to cite the charts you generated. (3 points)
The repos having high watch count are usually written in popular programming languages (q4a) such as JAVASCRIPT, HTML, PYTHON, etc. The
le size of high watch count repo is around 10^3 B (q4b) which is about right for coding. For most repos, the average length of commit message is
around 10-100 words (q4c_avg_commit_length_count), and it makes sense because we usually don't exceed 100 words when writing commit
message. However, for the most popular (high watch count) repo, the length of commit message is near 1000 words or a bit above
(q4c_msg_length_watch_count). I think it's because for those projects, there are many people collabrating together; therefore, the commit
messages are longer in order to keep track of details in each commit.
At this point we have learned a couple of things about how certain features may or may not impact the popularity of a GitHub repo. However, we
really only looked at features of GitHub repos that we had initially explored when we were getting a feel for the dataset! There has got to be
more things we can inspect than that.
If you do a web search for "how to make my git repo popular," you will nd that more than a couple of people suggest investing time in your
README le. The README usually gives an overview to a GitHub project and may include other information about the codebase such as whether
its most recent build passed or how to begin contributing to that repo. Here is an example README le for the popular web development
framework Vue.js.
Consider a README le to be any le with the path beginning with "README", not case-sensitive.
Note: If a project has multiple README les, you can just take the average size of those les.
c) Would you say that a "good" README is correlated with a popular repository, based on the features you studied?
Why or why not? If you were to analyze more in-depth features on the README le for correlation with repo
popularity what would they be? (3 points)
From the plot in q5a, overall, the watch count of a repo having a readme le is larger than that of having no readme le which suggests a repo
having a readme is more likely to have high watch count than the repo not having one. From the plot in q5b, we can see that the repos have high
watch count have length not too short also not too long, typically near the range of 10^3 to 10^4. It is reasonable because the readme le tends to
have enough information describing the repo and at the same time not too many words that make it impossible to read. In general, I think a good
repo would de nite require a readme le, and the content should be informative but not over ow with information. In addition, I think repos
containing graphs and plots are more likely to have high watch count due to better project visialization.
Question 6 (Extra Credit): What other features might correlate with a highly watched repo? (3
possible points)
We studied only a handful of features that could correlate with a highly watched repo. Can you nd a few more that seem especially promising?
Back your proposed features with data and charts.
I would like to explore how different licenses correlate with watch count. Based on the plots below, interestingly, mit license is used in the most
repos; however, the repos that use license cc0-1.0 have highest watch count in average.