100% found this document useful (1 vote)

176 views17 pages

Project 2: Exploring The Github Dataset With Colaboratory: Collaborators

This document provides instructions for Project 2 which involves exploring the GitHub dataset on BigQuery and creating visualizations to answer questions. The project is due on November 4th and is worth 50 points. Students should work collaboratively and list any collaborators. The overview explains that the dataset contains information about GitHub repositories, commits, files and more. Notes provide context about the dataset and a primer on key Git and GitHub concepts. The document outlines three sections: 1) Understanding the Dataset, 2) Query Performance, and 3) Visualizing the Dataset.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

176 views17 pages

Project 2: Exploring The Github Dataset With Colaboratory: Collaborators

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Project 2: Exploring the GitHub Dataset with Colaboratory

In this project, you will explore one of BigQuery's public datasets on GitHub and learn to make visualizations in order to answer your questions.
This project is due on Monday, November 4th at 11:59 PM. It is worth 50 points, for 10% of your overall grade. After completing this project,
make sure to follow the submission instructions in the handout to submit on Gradescope.

Notes (read carefully!):

Be sure you read the instructions on each cell and understand what it is doing before running it.
Don't forget that if you can always re-download the starter notebook from the course website if you need to.
You may create new cells to use for testing, debugging, exploring, etc., and this is in fact encouraged! Just make sure that the nal answer
for each question is in its own cell and clearly indicated.
Colab will not warn you about how many bytes your SQL query will consume. Be sure to check on the BigQuery UI rst before running
queries here!
See the assignment handout for submission instructions.
Have fun!

Collaborators:
Please list the names and SUNet IDs of your collaborators below:

Liang Xu, liangxu

Overview
BigQuery has a massive dataset of GitHub les and statistics, including information about repositories, commits, and le contents. In this
project, we will be working with this dataset. Don't worry if you are not too familiar with Git and GitHub -- we will explain everything you need to
know to complete this part of the assignment.

Notes
The GitHub dataset available on BigQuery is actually quite massive. A single query on the "contents" table alone (it is 2.16TB!) can eat up your
1TB allowance for the month AND cut into about 10% of your GCloud credit for the class.

To make this part of the project more manageable, we have subset the original data. We have preserved almost all information in the original
tables, but we kept only the information on the top 500,000 most "watched" GitHub repos between January 2016 and October 2018.

You can see the tables we will be working with here. Read through the schemas to get familiar with the data. Note that some of the tables are
still quite large (the contents table is about 500GB), so you should exercise the usual caution when working with them. Before running queries
on this notebook, it's good practice to rst set up query limits on your BigQuery account or see how many bytes will be billed on the web UI.

Make sure to use our subsetted dataset, not the original BigQuery dataset!

A Super Quick Primer on Git

If you are not very familiar with Git and GitHub, here are some high-level explanations that will give you enough context to get you through this
part of the problem:

GitHub: GitHub is a source-control service provider. GitHub allows you to collaborate on and keep track of source code in a fairly e cient
way.

commit: A commit can be thought of as a change that is applied to some set of les. i.e., if some set of les is in some state A, you can
make changes to A and commit your changes to the set of les so that it is now in state B. A commit is identi ed by a hash of the
information in your change (the author of the commit, who actually committed [i.e. applied] the changes to the set of les, the changes
themselves, etc.)

parent commit: The commit that came before your current commit.

repo: A repo (short for repository) is GitHub's abstraction for a collection of les along with a history of commits on those les. If you have
GitHub username "foo" and you make a repository called "data-rocks", your repo name will be "foo/data-rocks". You can think of a repo's
history in terms of its commits. E.g., "foo/data-rocks" can have the set of "states" A->B->C->D, where each state change (A->B, B->C, C->D)
was due to a commit.

branch: To keep track of different commit histories, GitHub repos can have branches. The 'main' branch (i.e. commit history) of the repo is
called the 'master' branch. Say on "foo/data-rocks" we have the commit history A->B->C->D on the master branch. If someone else comes
along and decides to add a cool new feature to "foo/data-rocks", they can create a branch called "cool-new-feature" that branches away
from the master branch. All the code from the main branch will be there, but new code added to "cool-new-feature" will not be on the main
branch.

ref: For the purpose of this assignment, you can think of the 'ref' eld on the " les" table as referring to the branch in which a le lives in a
repository at some point in time.

For the purposes of this question, you don't need to know about the following things in detail:

Commit trees
The encoding attribute on the commits table

If you want more clari cations about Git and GitHub in order to answer this question, be sure to post on Piazza or come to O ce Hours. In
many cases, a quick web search will also help answer your question.
Section 1 | Understanding the Dataset (4 points)

Question 1: Schema Comprehension (4 points)

Each of the following parts is worth 1 point.

a) What is the primary key of github_repo_files ? (1 point)

Things to note:

A le ID changes based on a le's contents; it is not assigned at a le's creation.

Different repos can have les with the same paths.
It is possible to have separate les with identical contents.
A repo may have one le across multiple branches.

lrepo_name

b) What is the primary key in github_repo_licenses ? What is the foreign key? (1 point)

The primary key is license. The foreign key is lrepo_name.

c) If we were given an author and we wanted to know what language repos they like to contribute to, which tables
should we use? (1 point)

github_repo_commits and github_repo_languages

d) If we wanted to know whether using different licenses had an effect on a repo's watch count, which tables would
we use? (1 point)
github_repos and github_repo_licenses

Section 2 | Query Performance (8 points)

In this section, we'll look at some ine cient queries and think about how we can make them more e cient. For this section, we'll consider
e ciency in terms of how many bytes are processed.

Question 2: Optimizing Queries (8 points)

For the next three subquestions, consider the following query:

SELECT DISTINCT author.name

FROM `cs145-fa19.project2.github_repo_commits` commits_1
WHERE (SELECT COUNT(*)
FROM `cs145-fa19.project2.github_repo_commits` commits_2
WHERE commits_1.author.name = commits_2.author.name) > 20

NOTE: We do NOT recommend running this unoptimized query in BigQuery, as it will run for a very long time (over 15 minutes if not longer).
However, feel free to run an optimized version of this query after nishing part (c), which takes about 5 seconds to run.

a) In one to two sentences, explain what this query does. (1 point)

This query returns a column of distinct author names. Each author commits more than 20 times.

b) Brie y explain why this query is ine cient (in terms of bytes that need to be processed) and how it can be
improved to be more e cient. (1 point)

It's ine cient because for each record in the outer loop, one iteration over the entire table is required. To improve it, we need to avoid the iteration
over the entire table for each record. In that case, we can use GROUP BY and HAVING clauses.

c) Following from part (b), write a more e cient version of the query. (2 points)

SELECT author.name
FROM `cs145-fa19.project2.github_repo_commits`
GROUP BY author.name
HAVING COUNT(*)>20

For the next three subquestions, consider the following query:

SELECT id
FROM (
SELECT files.id, files.mode, contents.size
FROM
`cs145-fa19.project2.github_repo_files` files,
`cs145-fa19.project2.github_repo_readme_contents` contents
WHERE files.id = contents.id
)
WHERE mode = 33188 AND size > 1000
LIMIT 10
d) Brie y explain why this query is ine cient (in terms of bytes that need to be processed) without the query
optimization and how it can be improved to be more e cient. (1 point)

This query is ine cient because it has too many reads and writes from two FROM clauses. To optimize the query, we can use an INNER JOIN to
combine two tables and then set the constrait using WHERE.

e) Following from part (d), write a more e cient version of the query. (2 points)
Hint: Think about the number of bytes processed by the unoptimized query. Can any operator be moved around to reduce this number?

SELECT contents.id
FROM `cs145-fa19.project2.github_repo_files` files JOIN `cs145-fa19.project2.github_repo_readme_contents` contents
ON files.id = contents.id
WHERE mode = 33188 AND size > 1000
LIMIT 10

f) Run both the original query and your optimized query on BigQuery and pay attention to the number of bytes
processed. How do they compare, and is it what you expect? Explain why this is happening in a few sentences. (1
point)
Hint: Look at the query plan under "Execution details" in the bottom panel of BigQuery. It may be especially helpful to look at stage "S00: Input".

The average time for reads and writes of the original query is 171 and 209 respectively and those of the optimized query is 100 and 14. As we can
see, the optimized query has much less reads and writes, especially less writes, than the original query has, so the latter is slower which is what I
expected in part d.

To learn more about writing e cient SQL queries and how BigQuery optimizes queries, check out Optimizing query computation and Query plan
and timeline.

Section 3: Visualizing the Dataset (38 points)

In this section, you'll be answering questions about the dataset, similar to the rst project. The difference is that instead of answering with a
query, you will be answering with a visualization. Part of this assignment is for you to think about which data (speci cally, which indicators) you
should be using in order to answer a particular question, and about what type of chart/picture/visualization you should use to clearly convey
your answer.

General Instructions
For each question, you will have at least two cells - a SQL cell where you run your query (and save the output to a data frame), and a
visualization cell, where you construct your chart. For this project, make sure that all data manipulation is to be done in SQL. Please do
not modify your data output using pandas or some other data library.
Please make all charts clear and readable - this includes adding axes labels, clear tick marks, clear point markers/lines/color schemes
(i.e. don't repeat colors across categories), legends, and so on.

Setting up BigQuery and Dependencies

Run the cell below (shift + enter) to authenticate your project.
Note that you need to ll in the project_id variable with the Google Cloud project id you are using for this course. You can see your project ID
by going to https://console.cloud.google.com/cloud-resource-manager

# Run this cell to authenticate yourself to BigQuery.

from google.colab import auth
auth.authenticate_user()
project_id = "cs145-255023"

Visualization
For this project, we will be o cially supporting the use of matplotlib (https://matplotlib.org/3.0.0/tutorials/index.html), but feel free to use
another graphing library if you are more comfortable with it.

# Add imports for any visualization libraries you may need

import matplotlib.pyplot as plt

%matplotlib inline

How to Use BigQuery and visualize in Colab

Jupyter notebooks (what Colab notebooks are based on) have a concept called "magics". If you write the following line at the top of a Code cell:

%%bigquery --project $project_id variable # this is the key line

SELECT ....
FROM ...

That "%%" converts the cell into a SQL cell. The resulting table that is generated is saved into variable .

Then in a second cell, use the library of your choice to plot the variable. Here is an example using matplotlib:

plt.figure()
plt.scatter(variable["x"], variable["y"])
plt.title("Plot Title")
plt.xlabel("X-axis label")
plt.ylabel("Y-axis label")

Question 3: A First Look at Repo Features (6 points)

Let's get our feet wet with this data by creating the following plots:

1. Language distribution across repos

2. File size distribution across repos
3. The distribution of the length of commit messages across repos

Note that you will not receive full credit if your charts are poorly made (i.e. very unclear or unreadable).

Hints
Some of these plots will need at least one of their axes to be log-scaled in order to be readable
For more readable plots, you can use pandas.DataFrame.sample. A sample size between 1,000 and 10,000 should give you more readable
plots.

Reminders
Be careful with your queries! Don't run SELECT * blindly on a table in this Colab notebook since you will not get a warning of how much
data the query will consume. Always how much data a query will consume on the BigQuery UI rst -- you are also better off setting a query
limit as we described earlier.
Don't forget to use the subsetted GitHub tables we provide here, not the original ones on BigQuery.
a) Language distribution (2 points)
(x-axis: programming language, y-axis: # repos containing at least one le in that language)
To keep the chart readable, only keep the top 20 languages.

Hint: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays
%%bigquery --project $project_id q3a
# YOUR QUERY HERE
SELECT name lang_name, COUNT(*) num_repos
FROM `cs145-fa19.project2.github_repo_files` files JOIN (SELECT lrepo_name, name
FROM `cs145-fa19.project2.github_repo_languages`,
UNNEST(language)) lang
ON files.lrepo_name = lang.lrepo_name
GROUP BY name
HAVING num_repos>=1
ORDER BY num_repos DESC
LIMIT 20

# YOUR PLOT CODE HERE

plt.figure(figsize=(13, 5))
plt.scatter(q3a["lang_name"], q3a["num_repos"])
plt.title("Language distribution")
plt.xlabel("Programming Language")
plt.ylabel("# of repos")
plt.xticks(rotation="vertical")

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
<a list of 20 Text xticklabel objects>)

b) File size distribution (2 points)

(x-axis: le size, y-axis: # les of that size)

%%bigquery --project $project_id q3b

# YOUR QUERY HERE

SELECT size, COUNT(*) num_files
FROM `cs145-fa19.project2.github_repo_contents`
GROUP BY size
ORDER BY size

# YOUR PLOT CODE HERE

q3b_sub = q3b.sample(2000)
plt.figure(figsize=(13, 7))
plt.scatter(q3b_sub["size"], q3b_sub["num_files"], s=5)
plt.yscale('symlog')
plt.xscale('symlog')
plt.title("File size distribution")
plt.xlabel("file size")
plt.ylabel("number of files")
p y ( )

Text(0, 0.5, 'number of files')

c) The distribution of the length of commit messages (2 points)

(x-axis: length of the commit message, y-axis: # commits with that length)
Note: The query for this plot may use ~30GB of data.

%%bigquery --project $project_id q3c

# YOUR QUERY HERE

SELECT LENGTH(message) m_len, COUNT(*) num_commits
FROM `cs145-fa19.project2.github_repo_commits`
GROUP BY m_len
ORDER BY m_len

# YOUR PLOT CODE HERE

q3c_sub = q3c.sample(1000)
plt.figure(figsize=(13, 7))
plt.scatter(q3c_sub["m_len"], q3c_sub["num_commits"], s=5)
plt.yscale('symlog')
plt.xscale('symlog')
plt.title("Distribution of the length of commit messages")
plt.xlabel("length of the commit message")
plt.ylabel("number of commits")
Text(0, 0.5, 'number of commits')

What Makes a Good Repo?

Given that we have some interesting data at our disposal, let's try to answer the question: what makes a good GitHub repo? For our purposes, a
"good" repo is simply a repo with a high watch count; this refers to how many people are following the repo for updates.

To begin, let's see if any of the features we've already explored give us any good answers.

Question 4: Using What We've Worked With (17 points)

Create plots for the following features in a repo and how they relate to that repo's watch count:

1. Languages used
2. Average le size in a repo
3. Average message length of commits in a repo

a) Languages used (4 points)

As in Q3a, please only keep the top 20 languages to keep the chart readable.

%%bigquery --project $project_id q4a

# YOUR QUERY HERE

SELECT name lang_name, SUM(repos.watch_count) num_watch_count
FROM `cs145-fa19.project2.github_repos` repos JOIN (SELECT lrepo_name, name
FROM `cs145-fa19.project2.github_repo_languages`,
UNNEST(language)) lang
ON repos.lrepo_name = lang.lrepo_name
GROUP BY name
ORDER BY num_watch_count DESC
LIMIT 20

# YOUR PLOT CODE HERE

plt.figure(figsize=(13, 5))
plt.scatter(q4a["lang_name"], q4a["num_watch_count"])
plt.title("Languages used")
plt.xlabel("Programming Language")
plt.ylabel("number of watch count")
plt.xticks(rotation="vertical")
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
<a list of 20 Text xticklabel objects>)

b) Average le size in a repo (4 points)

Note: For this question, you may use the github_repo_readme_contents table instead of the full contents table.

%%bigquery --project $project_id q4b

# YOUR QUERY HERE

SELECT s.avg_size, SUM(repos.watch_count) num_watch_count
FROM `cs145-fa19.project2.github_repos` repos
JOIN (SELECT files.lrepo_name, ROUND(AVG(contents.size)) avg_size
FROM `cs145-fa19.project2.github_repo_readme_contents` contents
JOIN `cs145-fa19.project2.github_repo_files` files
ON contents.id = files.id
GROUP BY files.lrepo_name) s
ON repos.lrepo_name = s.lrepo_name
GROUP BY avg_size
ORDER BY avg_size

# YOUR PLOT CODE HERE

q4b_sub = q4b.sample(1000)
plt.figure(figsize=(15, 10))
plt.scatter(q4b_sub["avg_size"], q4b_sub["num_watch_count"], s=5)
plt.yscale('symlog')
plt.xscale('symlog')
plt.title("Average file size in a repo vs watch count")
plt.xlabel("average file size in a repo")
plt.ylabel("number of watch count")
Text(0, 0.5, 'number of watch count')

c) Average message length of commits on a repo. (6 points)

First, make a plot of the average commit message length of repositories against the number of repositories with that average commit message
length.

Then, make a plot of how average commit message length of a repository correlates to its watch count. Round the average commit message
length to the nearest integer.

%%bigquery --project $project_id q4c_avg_commit_length_count

# YOUR QUERY HERE

SELECT avg_c_len, COUNT(c1.lrepo_name) num_repos
FROM `cs145-fa19.project2.github_repo_commits` c1
JOIN (SELECT c2.lrepo_name, ROUND(AVG(LENGTH(c2.message))) avg_c_len
FROM `cs145-fa19.project2.github_repo_commits` c2
GROUP BY c2.lrepo_name) c_avg
ON c1.lrepo_name = c_avg.lrepo_name
GROUP BY avg_c_len
ORDER BY avg_c_len

# YOUR PLOT CODE HERE

plt.figure(figsize=(15, 10))
plt.scatter(q4c_avg_commit_length_count["avg_c_len"], q4c_avg_commit_length_count["num_repos"], s=5)
plt.yscale('symlog')
plt.xscale('symlog')
plt.title("Average commit length count")
plt.xlabel("average commit message length")
plt.ylabel("number of repositories")
Text(0, 0.5, 'number of repositories')

%%bigquery --project $project_id q4c_msg_length_watch_count

# YOUR QUERY HERE

SELECT avg_c_len, AVG(repos.watch_count) num_watch_count
FROM `cs145-fa19.project2.github_repos` repos
JOIN (SELECT c.lrepo_name, ROUND(AVG(LENGTH(c.message))) avg_c_len
FROM `cs145-fa19.project2.github_repo_commits` c
GROUP BY c.lrepo_name) c_avg
ON repos.lrepo_name = c_avg.lrepo_name
GROUP BY avg_c_len
ORDER BY avg_c_len

# YOUR PLOT CODE HERE

plt.figure(figsize=(15, 10))
plt.scatter(q4c_msg_length_watch_count["avg_c_len"], q4c_msg_length_watch_count["num_watch_count"], s=5)
plt.yscale('symlog')
plt.xscale('symlog')
plt.title("Average commit length watch count")
plt.xlabel("average commit message length")
plt.ylabel("number of watch count")
Text(0, 0.5, 'number of watch count')

d) Which, if any, of the features we inspected above have a high correlation to a repo having a high watch count?
Does the answer make sense, or does it seem counterintuitive? Explain your answer in a small paragraph, no more
than 200 words. Be sure to cite the charts you generated. (3 points)

The repos having high watch count are usually written in popular programming languages (q4a) such as JAVASCRIPT, HTML, PYTHON, etc. The
le size of high watch count repo is around 10^3 B (q4b) which is about right for coding. For most repos, the average length of commit message is
around 10-100 words (q4c_avg_commit_length_count), and it makes sense because we usually don't exceed 100 words when writing commit
message. However, for the most popular (high watch count) repo, the length of commit message is near 1000 words or a bit above
(q4c_msg_length_watch_count). I think it's because for those projects, there are many people collabrating together; therefore, the commit
messages are longer in order to keep track of details in each commit.

What Do Others Have to Say?

At this point we have learned a couple of things about how certain features may or may not impact the popularity of a GitHub repo. However, we
really only looked at features of GitHub repos that we had initially explored when we were getting a feel for the dataset! There has got to be
more things we can inspect than that.

If you do a web search for "how to make my git repo popular," you will nd that more than a couple of people suggest investing time in your
README le. The README usually gives an overview to a GitHub project and may include other information about the codebase such as whether
its most recent build passed or how to begin contributing to that repo. Here is an example README le for the popular web development
framework Vue.js.

IMPORTANT: Note about Contents Table

Note that the original github_repo_contents table is about half a TB! In order to save you the pain of using up 500GB of your credits to subset
this table into a workable size for this problem, we have done it for you.
*For the rest of this question, be sure that you use the github_repo_readme_contents table *

Question 5: Analyzing README Features (15 points)

Analyze the following features of a repo's README le and how they relate to the popularity of a repository, generating an informative plot for
each feature:

1. Having or not having a README le

2. The length of the README le

Consider a README le to be any le with the path beginning with "README", not case-sensitive.

a) Having or not having a README le (6 points)

%%bigquery --project $project_id q5a1

# YOUR QUERY HERE

SELECT watch_count having_readme
FROM `cs145-fa19.project2.github_repos` r1
JOIN (SELECT DISTINCT f1.lrepo_name
FROM `cs145-fa19.project2.github_repo_files` f1
WHERE path LIKE "README%" OR path LIKE "readme%") r2
ON r1.lrepo_name = r2.lrepo_name

%%bigquery --project $project_id q5a2

# YOUR QUERY HERE

SELECT watch_count no_readme
FROM `cs145-fa19.project2.github_repos` r1
JOIN (SELECT DISTINCT f1.lrepo_name
FROM `cs145-fa19.project2.github_repo_files` f1
WHERE f1.lrepo_name NOT IN (SELECT DISTINCT t.lrepo_name
FROM `cs145-fa19.project2.github_repo_files` t
WHERE path LIKE "README%" OR path LIKE "readme%")) r2
ON r1.lrepo_name = r2.lrepo_name

# YOUR PLOT CODE HERE

plt.figure(figsize=(15, 10))
plt.scatter([0 for i in range(q5a2["no_readme"].size)], q5a2["no_readme"], s=5)
plt.scatter([1 for i in range(q5a1["having_readme"].size)], q5a1["having_readme"], s=5)
plt.yscale('symlog')
plt.title("Watch count for having or not having a README file")
plt.xlabel("having (1) or not having (0) readme file")
plt.ylabel("number of watch count")
Text(0, 0.5, 'number of watch count')

b) The length of the README le (6 points)

You may ignore README les with length 0.

Note: If a project has multiple README les, you can just take the average size of those les.

%%bigquery --project $project_id q5b

# YOUR QUERY HERE

SELECT LENGTH(contents.content) c_len, SUM(repos.watch_count) watch_count
FROM `cs145-fa19.project2.github_repo_readme_contents` contents
JOIN `cs145-fa19.project2.github_repo_files` files
ON files.id = contents.id
JOIN `cs145-fa19.project2.github_repos` repos
ON repos.lrepo_name = files.lrepo_name
GROUP BY c_len
ORDER BY c_len

# YOUR PLOT CODE HERE

q5b_sub = q5b.sample(1000)
plt.figure(figsize=(15, 10))
plt.scatter(q5b_sub["c_len"], q5b_sub["watch_count"], s=5)
plt.xscale('symlog')
plt.yscale('symlog')
plt.title("Watch count compared to the length of the README file")
plt.xlabel("length of the README file")
plt.ylabel("number of watch count")
Text(0, 0.5, 'number of watch count')

c) Would you say that a "good" README is correlated with a popular repository, based on the features you studied?
Why or why not? If you were to analyze more in-depth features on the README le for correlation with repo
popularity what would they be? (3 points)

From the plot in q5a, overall, the watch count of a repo having a readme le is larger than that of having no readme le which suggests a repo
having a readme is more likely to have high watch count than the repo not having one. From the plot in q5b, we can see that the repos have high
watch count have length not too short also not too long, typically near the range of 10^3 to 10^4. It is reasonable because the readme le tends to
have enough information describing the repo and at the same time not too many words that make it impossible to read. In general, I think a good
repo would de nite require a readme le, and the content should be informative but not over ow with information. In addition, I think repos
containing graphs and plots are more likely to have high watch count due to better project visialization.

Question 6 (Extra Credit): What other features might correlate with a highly watched repo? (3
possible points)
We studied only a handful of features that could correlate with a highly watched repo. Can you nd a few more that seem especially promising?
Back your proposed features with data and charts.

I would like to explore how different licenses correlate with watch count. Based on the plots below, interestingly, mit license is used in the most
repos; however, the repos that use license cc0-1.0 have highest watch count in average.

%%bigquery --project $project_id q6a

# YOUR QUERIES HERE

SELECT license, COUNT(*) num_repos
FROM `cs145-fa19.project2.github_repo_licenses`
GROUP BY license
ORDER BY num_repos DESC
LIMIT 20
LIMIT 20

%%bigquery --project $project_id q6b

# YOUR QUERIES HERE

SELECT license, ROUND(AVG(watch_count)) avg_watch_count
FROM `cs145-fa19.project2.github_repos` repos
JOIN `cs145-fa19.project2.github_repo_licenses` lic
ON repos.lrepo_name = lic.lrepo_name
GROUP BY license
ORDER BY avg_watch_count DESC
LIMIT 20

# YOUR PLOT CODE HERE

# plot for q6a
plt.figure(figsize=(15, 10))
plt.scatter(q6a["license"], q6a["num_repos"])
plt.yscale('symlog')
plt.title("Number of repos in each license")
plt.xlabel("license")
plt.ylabel("number of repos")
plt.xticks(rotation="vertical")

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],

<a list of 15 Text xticklabel objects>)

# plot for q6b

plt.figure(figsize=(15, 10))
plt.scatter(q6b["license"], q6b["avg_watch_count"])
plt.yscale('symlog')
plt.title("Watch count of different licenses")
plt.xlabel("license")
plt.ylabel("number of watch count")
plt.xticks(rotation="vertical")
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
<a list of 15 Text xticklabel objects>)

Mastering Git: Attain expert-level proficiency with Git by mastering distributed version control features
From Everand
Mastering Git: Attain expert-level proficiency with Git by mastering distributed version control features
Jakub Narębski
No ratings yet
Big Query
No ratings yet
Big Query
64 pages
Cot1 Ap 2019
67% (3)
Cot1 Ap 2019
2 pages
GitHub Actions Cookbook: A practical guide to automating repetitive tasks and streamlining your development process
From Everand
GitHub Actions Cookbook: A practical guide to automating repetitive tasks and streamlining your development process
Michael Kaufmann
No ratings yet
SOL Study Material Appointment
No ratings yet
SOL Study Material Appointment
1 page
Breast MRI Structured Report TEMPLATES
No ratings yet
Breast MRI Structured Report TEMPLATES
6 pages
Git Essentials
From Everand
Git Essentials
Ferdinando Santacroce
4.5/5 (4)
Learn MongoDB in 24 Hours
From Everand
Learn MongoDB in 24 Hours
Alex Nordeen
5/5 (2)
GitHub Foundations Exam Prep: 500 Practice Questions with Detailed Explanations
From Everand
GitHub Foundations Exam Prep: 500 Practice Questions with Detailed Explanations
Satou Takahiro
No ratings yet
Protocol Buffers Handbook: Getting deeper into Protobuf internals and its usage
From Everand
Protocol Buffers Handbook: Getting deeper into Protobuf internals and its usage
Clément Jean
No ratings yet
Data Analyst in R
No ratings yet
Data Analyst in R
169 pages
Training Notes
No ratings yet
Training Notes
101 pages
Asm 13606
No ratings yet
Asm 13606
3 pages
Aktu-Qp BCC302 2023-24 Odd-Sem
No ratings yet
Aktu-Qp BCC302 2023-24 Odd-Sem
4 pages
Giit
No ratings yet
Giit
89 pages
4 Linkers and Connectors
No ratings yet
4 Linkers and Connectors
44 pages
Core Network in GSM
No ratings yet
Core Network in GSM
81 pages
Lecture - 02
No ratings yet
Lecture - 02
38 pages
Building Serverless Apps with Azure Functions and Cosmos DB: Leverage Azure functions and Cosmos DB for building serverless applications (English Edition)
From Everand
Building Serverless Apps with Azure Functions and Cosmos DB: Leverage Azure functions and Cosmos DB for building serverless applications (English Edition)
Hansamali Gamage
No ratings yet
Workshop 2022-03-30 Getting Started With Git and GitHub
No ratings yet
Workshop 2022-03-30 Getting Started With Git and GitHub
47 pages
Jump Start Git
From Everand
Jump Start Git
Shaumik Daityari
No ratings yet
Tom04 Quick Overview of The Bible
No ratings yet
Tom04 Quick Overview of The Bible
38 pages
What Is Git?: Paparao.C
No ratings yet
What Is Git?: Paparao.C
33 pages
Advanced SQL & Data Literacy Training
No ratings yet
Advanced SQL & Data Literacy Training
47 pages
Mastering RethinkDB
From Everand
Mastering RethinkDB
Shahid Shaikh
No ratings yet
Python Essentials For Dummies
From Everand
Python Essentials For Dummies
John C. Shovic
4/5 (1)
Guide for Dummies: from MATLAB to C++ through the MATLAB Coder: English and Italian Book
From Everand
Guide for Dummies: from MATLAB to C++ through the MATLAB Coder: English and Italian Book
Filippo Piccinini
No ratings yet
ME Eng 10 Q1 0904 - PS - Modals of Permission and Suggestion
No ratings yet
ME Eng 10 Q1 0904 - PS - Modals of Permission and Suggestion
23 pages
Advanced Git
No ratings yet
Advanced Git
86 pages
Git and Github
No ratings yet
Git and Github
41 pages
MY472 Week01 Intro
No ratings yet
MY472 Week01 Intro
34 pages
Final Github Questions 17.12.2023-Answer Key
No ratings yet
Final Github Questions 17.12.2023-Answer Key
17 pages
C Programming Language - Repetition
No ratings yet
C Programming Language - Repetition
65 pages
Context Driven
No ratings yet
Context Driven
18 pages
Practical On PHP
No ratings yet
Practical On PHP
7 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Introduction To Google Cloud Big Data Platform: Lecturer: Phd. Tran Minh Quang Data Engineering - Group 12
No ratings yet
Introduction To Google Cloud Big Data Platform: Lecturer: Phd. Tran Minh Quang Data Engineering - Group 12
21 pages
Lab 7
No ratings yet
Lab 7
11 pages
Jesus Is Arrested and Crucified
No ratings yet
Jesus Is Arrested and Crucified
22 pages
Grade 9 Annual Exam Timetable With Portions
No ratings yet
Grade 9 Annual Exam Timetable With Portions
4 pages
Jyotish in The Vaidic Stage
No ratings yet
Jyotish in The Vaidic Stage
6 pages
Car, Gun Push and Pull Lissa
No ratings yet
Car, Gun Push and Pull Lissa
6 pages
Practical Go: Building Scalable Network and Non-Network Applications
From Everand
Practical Go: Building Scalable Network and Non-Network Applications
Amit Saha
No ratings yet
Training Assignments: Java Se 8 Programming Language
No ratings yet
Training Assignments: Java Se 8 Programming Language
6 pages
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
1803indonesian Grammer
No ratings yet
1803indonesian Grammer
7 pages
Python for Mechanical and Aerospace Engineering
From Everand
Python for Mechanical and Aerospace Engineering
Alexander Kenan
No ratings yet
6003 19545 1 PB
No ratings yet
6003 19545 1 PB
9 pages
Dwgeek Com Bigquery Recursive Query Alternative Example HTML
No ratings yet
Dwgeek Com Bigquery Recursive Query Alternative Example HTML
10 pages
CEE A1 ECCT ListeningTest
No ratings yet
CEE A1 ECCT ListeningTest
6 pages
DocumentType Search
No ratings yet
DocumentType Search
7 pages
Repository
No ratings yet
Repository
4 pages
Consolidated Grading Sheets For 2nd Quarter 2022 2023
No ratings yet
Consolidated Grading Sheets For 2nd Quarter 2022 2023
6 pages
Lab-1 Solution
No ratings yet
Lab-1 Solution
6 pages
Visual Studio Code: End-to-End Editing and Debugging Tools for Web Developers
From Everand
Visual Studio Code: End-to-End Editing and Debugging Tools for Web Developers
Bruce Johnson
No ratings yet
Android List View Using Custom Adapter and SQLite
No ratings yet
Android List View Using Custom Adapter and SQLite
14 pages
Commands MSI
No ratings yet
Commands MSI
3 pages
InsightUBC Query Engine
No ratings yet
InsightUBC Query Engine
10 pages
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Problem Solving Python Programming
No ratings yet
Problem Solving Python Programming
5 pages
Ansible for IT Experts
From Everand
Ansible for IT Experts
Denis Zuev
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Clearing and Posting Specific To Ledger Groups
No ratings yet
Clearing and Posting Specific To Ledger Groups
5 pages
Big Query Help
No ratings yet
Big Query Help
4 pages
Beginners Python Cheat Sheet PCC Git
No ratings yet
Beginners Python Cheat Sheet PCC Git
2 pages
Power System Lab Manual
No ratings yet
Power System Lab Manual
17 pages
Task 6
No ratings yet
Task 6
2 pages
7 BigData BigQuery Intelli
No ratings yet
7 BigData BigQuery Intelli
3 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Ten Promised Paradise Suhaba
No ratings yet
Ten Promised Paradise Suhaba
2 pages
Metasploit Meterpreter The Advanced and Powerful Payload
No ratings yet
Metasploit Meterpreter The Advanced and Powerful Payload
1 page
Update to Modern C++
From Everand
Update to Modern C++
James Raynard
No ratings yet
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
C++ Learn in 24 Hours
From Everand
C++ Learn in 24 Hours
Alex Nordeen
No ratings yet
CSS Grid Layout: 5 Practical Projects
From Everand
CSS Grid Layout: 5 Practical Projects
Craig Buckler
No ratings yet
Programming Concepts in C++
From Everand
Programming Concepts in C++
Robert Burns
No ratings yet
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Make Bootstrap Themes
From Everand
Make Bootstrap Themes
Bo Feng
No ratings yet
IGNOU PGDCA All in One Previous Years Unsolved Papers
From Everand
IGNOU PGDCA All in One Previous Years Unsolved Papers
Manish Soni
No ratings yet
Salesforce Developer Interview Questions: 1.0, #1
From Everand
Salesforce Developer Interview Questions: 1.0, #1
SFDC TELUGU
No ratings yet
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
Java: Tips and Tricks to Programming Code with Java
From Everand
Java: Tips and Tricks to Programming Code with Java
Charlie Masterson
No ratings yet
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
From Everand
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
Charlie Masterson
No ratings yet
1 Transformation and Collineations
No ratings yet
1 Transformation and Collineations
2 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
From Everand
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
Georgio Daccache
No ratings yet

Project 2: Exploring The Github Dataset With Colaboratory: Collaborators

Uploaded by

Project 2: Exploring The Github Dataset With Colaboratory: Collaborators

Uploaded by

Project 2: Exploring the GitHub Dataset with Colaboratory

Notes (read carefully!):

Liang Xu, liangxu

A Super Quick Primer on Git

Question 1: Schema Comprehension (4 points)

a) What is the primary key of github_repo_files ? (1 point)

A le ID changes based on a le's contents; it is not assigned at a le's creation.

The primary key is license. The foreign key is lrepo_name.

github_repo_commits and github_repo_languages

Section 2 | Query Performance (8 points)

Question 2: Optimizing Queries (8 points)

For the next three subquestions, consider the following query:

SELECT DISTINCT author.name

a) In one to two sentences, explain what this query does. (1 point)

For the next three subquestions, consider the following query:

Section 3: Visualizing the Dataset (38 points)

Setting up BigQuery and Dependencies

# Run this cell to authenticate yourself to BigQuery.

# Add imports for any visualization libraries you may need

How to Use BigQuery and visualize in Colab

%%bigquery --project $project_id variable # this is the key line

Question 3: A First Look at Repo Features (6 points)

1. Language distribution across repos

# YOUR PLOT CODE HERE

b) File size distribution (2 points)

%%bigquery --project $project_id q3b

# YOUR QUERY HERE

# YOUR PLOT CODE HERE

Text(0, 0.5, 'number of files')

c) The distribution of the length of commit messages (2 points)

%%bigquery --project $project_id q3c

# YOUR QUERY HERE

# YOUR PLOT CODE HERE

What Makes a Good Repo?

Question 4: Using What We've Worked With (17 points)

a) Languages used (4 points)

%%bigquery --project $project_id q4a

# YOUR QUERY HERE

# YOUR PLOT CODE HERE

b) Average le size in a repo (4 points)

%%bigquery --project $project_id q4b

# YOUR QUERY HERE

# YOUR PLOT CODE HERE

c) Average message length of commits on a repo. (6 points)

%%bigquery --project $project_id q4c_avg_commit_length_count

# YOUR QUERY HERE

# YOUR PLOT CODE HERE

%%bigquery --project $project_id q4c_msg_length_watch_count

# YOUR QUERY HERE

# YOUR PLOT CODE HERE

What Do Others Have to Say?

IMPORTANT: Note about Contents Table

Question 5: Analyzing README Features (15 points)

1. Having or not having a README le

a) Having or not having a README le (6 points)

%%bigquery --project $project_id q5a1

# YOUR QUERY HERE

%%bigquery --project $project_id q5a2

# YOUR QUERY HERE

# YOUR PLOT CODE HERE

b) The length of the README le (6 points)

%%bigquery --project $project_id q5b

# YOUR QUERY HERE

# YOUR PLOT CODE HERE

%%bigquery --project $project_id q6a

# YOUR QUERIES HERE

%%bigquery --project $project_id q6b

# YOUR QUERIES HERE

# YOUR PLOT CODE HERE

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],

# plot for q6b

You might also like