100% found this document useful (1 vote)
176 views17 pages

Project 2: Exploring The Github Dataset With Colaboratory: Collaborators

This document provides instructions for Project 2 which involves exploring the GitHub dataset on BigQuery and creating visualizations to answer questions. The project is due on November 4th and is worth 50 points. Students should work collaboratively and list any collaborators. The overview explains that the dataset contains information about GitHub repositories, commits, files and more. Notes provide context about the dataset and a primer on key Git and GitHub concepts. The document outlines three sections: 1) Understanding the Dataset, 2) Query Performance, and 3) Visualizing the Dataset.

Uploaded by

z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
176 views17 pages

Project 2: Exploring The Github Dataset With Colaboratory: Collaborators

This document provides instructions for Project 2 which involves exploring the GitHub dataset on BigQuery and creating visualizations to answer questions. The project is due on November 4th and is worth 50 points. Students should work collaboratively and list any collaborators. The overview explains that the dataset contains information about GitHub repositories, commits, files and more. Notes provide context about the dataset and a primer on key Git and GitHub concepts. The document outlines three sections: 1) Understanding the Dataset, 2) Query Performance, and 3) Visualizing the Dataset.

Uploaded by

z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Project 2: Exploring the GitHub Dataset with Colaboratory

In this project, you will explore one of BigQuery's public datasets on GitHub and learn to make visualizations in order to answer your questions.
This project is due on Monday, November 4th at 11:59 PM. It is worth 50 points, for 10% of your overall grade. After completing this project,
make sure to follow the submission instructions in the handout to submit on Gradescope.

Notes (read carefully!):


Be sure you read the instructions on each cell and understand what it is doing before running it.
Don't forget that if you can always re-download the starter notebook from the course website if you need to.
You may create new cells to use for testing, debugging, exploring, etc., and this is in fact encouraged! Just make sure that the nal answer
for each question is in its own cell and clearly indicated.
Colab will not warn you about how many bytes your SQL query will consume. Be sure to check on the BigQuery UI rst before running
queries here!
See the assignment handout for submission instructions.
Have fun!

Collaborators:
Please list the names and SUNet IDs of your collaborators below:

Liang Xu, liangxu

Overview
BigQuery has a massive dataset of GitHub les and statistics, including information about repositories, commits, and le contents. In this
project, we will be working with this dataset. Don't worry if you are not too familiar with Git and GitHub -- we will explain everything you need to
know to complete this part of the assignment.

Notes
The GitHub dataset available on BigQuery is actually quite massive. A single query on the "contents" table alone (it is 2.16TB!) can eat up your
1TB allowance for the month AND cut into about 10% of your GCloud credit for the class.

To make this part of the project more manageable, we have subset the original data. We have preserved almost all information in the original
tables, but we kept only the information on the top 500,000 most "watched" GitHub repos between January 2016 and October 2018.

You can see the tables we will be working with here. Read through the schemas to get familiar with the data. Note that some of the tables are
still quite large (the contents table is about 500GB), so you should exercise the usual caution when working with them. Before running queries
on this notebook, it's good practice to rst set up query limits on your BigQuery account or see how many bytes will be billed on the web UI.

Make sure to use our subsetted dataset, not the original BigQuery dataset!

A Super Quick Primer on Git


If you are not very familiar with Git and GitHub, here are some high-level explanations that will give you enough context to get you through this
part of the problem:

GitHub: GitHub is a source-control service provider. GitHub allows you to collaborate on and keep track of source code in a fairly e cient
way.

commit: A commit can be thought of as a change that is applied to some set of les. i.e., if some set of les is in some state A, you can
make changes to A and commit your changes to the set of les so that it is now in state B. A commit is identi ed by a hash of the
information in your change (the author of the commit, who actually committed [i.e. applied] the changes to the set of les, the changes
themselves, etc.)

parent commit: The commit that came before your current commit.

repo: A repo (short for repository) is GitHub's abstraction for a collection of les along with a history of commits on those les. If you have
GitHub username "foo" and you make a repository called "data-rocks", your repo name will be "foo/data-rocks". You can think of a repo's
history in terms of its commits. E.g., "foo/data-rocks" can have the set of "states" A->B->C->D, where each state change (A->B, B->C, C->D)
was due to a commit.

branch: To keep track of different commit histories, GitHub repos can have branches. The 'main' branch (i.e. commit history) of the repo is
called the 'master' branch. Say on "foo/data-rocks" we have the commit history A->B->C->D on the master branch. If someone else comes
along and decides to add a cool new feature to "foo/data-rocks", they can create a branch called "cool-new-feature" that branches away
from the master branch. All the code from the main branch will be there, but new code added to "cool-new-feature" will not be on the main
branch.

ref: For the purpose of this assignment, you can think of the 'ref' eld on the " les" table as referring to the branch in which a le lives in a
repository at some point in time.

For the purposes of this question, you don't need to know about the following things in detail:

Commit trees
The encoding attribute on the commits table

If you want more clari cations about Git and GitHub in order to answer this question, be sure to post on Piazza or come to O ce Hours. In
many cases, a quick web search will also help answer your question.
Section 1 | Understanding the Dataset (4 points)

Question 1: Schema Comprehension (4 points)


Each of the following parts is worth 1 point.

a) What is the primary key of github_repo_files ? (1 point)


Things to note:

A le ID changes based on a le's contents; it is not assigned at a le's creation.


Different repos can have les with the same paths.
It is possible to have separate les with identical contents.
A repo may have one le across multiple branches.

lrepo_name

b) What is the primary key in github_repo_licenses ? What is the foreign key? (1 point)

The primary key is license. The foreign key is lrepo_name.

c) If we were given an author and we wanted to know what language repos they like to contribute to, which tables
should we use? (1 point)

github_repo_commits and github_repo_languages

d) If we wanted to know whether using different licenses had an effect on a repo's watch count, which tables would
we use? (1 point)
github_repos and github_repo_licenses

Section 2 | Query Performance (8 points)

In this section, we'll look at some ine cient queries and think about how we can make them more e cient. For this section, we'll consider
e ciency in terms of how many bytes are processed.

Question 2: Optimizing Queries (8 points)

For the next three subquestions, consider the following query:

SELECT DISTINCT author.name


FROM `cs145-fa19.project2.github_repo_commits` commits_1
WHERE (SELECT COUNT(*)
FROM `cs145-fa19.project2.github_repo_commits` commits_2
WHERE commits_1.author.name = commits_2.author.name) > 20

NOTE: We do NOT recommend running this unoptimized query in BigQuery, as it will run for a very long time (over 15 minutes if not longer).
However, feel free to run an optimized version of this query after nishing part (c), which takes about 5 seconds to run.

a) In one to two sentences, explain what this query does. (1 point)

This query returns a column of distinct author names. Each author commits more than 20 times.

b) Brie y explain why this query is ine cient (in terms of bytes that need to be processed) and how it can be
improved to be more e cient. (1 point)

It's ine cient because for each record in the outer loop, one iteration over the entire table is required. To improve it, we need to avoid the iteration
over the entire table for each record. In that case, we can use GROUP BY and HAVING clauses.

c) Following from part (b), write a more e cient version of the query. (2 points)

SELECT author.name
FROM `cs145-fa19.project2.github_repo_commits`
GROUP BY author.name
HAVING COUNT(*)>20

For the next three subquestions, consider the following query:

SELECT id
FROM (
SELECT files.id, files.mode, contents.size
FROM
`cs145-fa19.project2.github_repo_files` files,
`cs145-fa19.project2.github_repo_readme_contents` contents
WHERE files.id = contents.id
)
WHERE mode = 33188 AND size > 1000
LIMIT 10
d) Brie y explain why this query is ine cient (in terms of bytes that need to be processed) without the query
optimization and how it can be improved to be more e cient. (1 point)

This query is ine cient because it has too many reads and writes from two FROM clauses. To optimize the query, we can use an INNER JOIN to
combine two tables and then set the constrait using WHERE.

e) Following from part (d), write a more e cient version of the query. (2 points)
Hint: Think about the number of bytes processed by the unoptimized query. Can any operator be moved around to reduce this number?

SELECT contents.id
FROM `cs145-fa19.project2.github_repo_files` files JOIN `cs145-fa19.project2.github_repo_readme_contents` contents
ON files.id = contents.id
WHERE mode = 33188 AND size > 1000
LIMIT 10

f) Run both the original query and your optimized query on BigQuery and pay attention to the number of bytes
processed. How do they compare, and is it what you expect? Explain why this is happening in a few sentences. (1
point)
Hint: Look at the query plan under "Execution details" in the bottom panel of BigQuery. It may be especially helpful to look at stage "S00: Input".

The average time for reads and writes of the original query is 171 and 209 respectively and those of the optimized query is 100 and 14. As we can
see, the optimized query has much less reads and writes, especially less writes, than the original query has, so the latter is slower which is what I
expected in part d.

To learn more about writing e cient SQL queries and how BigQuery optimizes queries, check out Optimizing query computation and Query plan
and timeline.

Section 3: Visualizing the Dataset (38 points)

In this section, you'll be answering questions about the dataset, similar to the rst project. The difference is that instead of answering with a
query, you will be answering with a visualization. Part of this assignment is for you to think about which data (speci cally, which indicators) you
should be using in order to answer a particular question, and about what type of chart/picture/visualization you should use to clearly convey
your answer.

General Instructions
For each question, you will have at least two cells - a SQL cell where you run your query (and save the output to a data frame), and a
visualization cell, where you construct your chart. For this project, make sure that all data manipulation is to be done in SQL. Please do
not modify your data output using pandas or some other data library.
Please make all charts clear and readable - this includes adding axes labels, clear tick marks, clear point markers/lines/color schemes
(i.e. don't repeat colors across categories), legends, and so on.

Setting up BigQuery and Dependencies


Run the cell below (shift + enter) to authenticate your project.
Note that you need to ll in the project_id variable with the Google Cloud project id you are using for this course. You can see your project ID
by going to https://console.cloud.google.com/cloud-resource-manager

# Run this cell to authenticate yourself to BigQuery.


from google.colab import auth
auth.authenticate_user()
project_id = "cs145-255023"

Visualization
For this project, we will be o cially supporting the use of matplotlib (https://matplotlib.org/3.0.0/tutorials/index.html), but feel free to use
another graphing library if you are more comfortable with it.

# Add imports for any visualization libraries you may need


import matplotlib.pyplot as plt

%matplotlib inline

How to Use BigQuery and visualize in Colab


Jupyter notebooks (what Colab notebooks are based on) have a concept called "magics". If you write the following line at the top of a Code cell:

%%bigquery --project $project_id variable # this is the key line


SELECT ....
FROM ...

That "%%" converts the cell into a SQL cell. The resulting table that is generated is saved into variable .

Then in a second cell, use the library of your choice to plot the variable. Here is an example using matplotlib:

plt.figure()
plt.scatter(variable["x"], variable["y"])
plt.title("Plot Title")
plt.xlabel("X-axis label")
plt.ylabel("Y-axis label")

Question 3: A First Look at Repo Features (6 points)

Let's get our feet wet with this data by creating the following plots:

1. Language distribution across repos


2. File size distribution across repos
3. The distribution of the length of commit messages across repos

Note that you will not receive full credit if your charts are poorly made (i.e. very unclear or unreadable).

Hints
Some of these plots will need at least one of their axes to be log-scaled in order to be readable
For more readable plots, you can use pandas.DataFrame.sample. A sample size between 1,000 and 10,000 should give you more readable
plots.

Reminders
Be careful with your queries! Don't run SELECT * blindly on a table in this Colab notebook since you will not get a warning of how much
data the query will consume. Always how much data a query will consume on the BigQuery UI rst -- you are also better off setting a query
limit as we described earlier.
Don't forget to use the subsetted GitHub tables we provide here, not the original ones on BigQuery.
a) Language distribution (2 points)
(x-axis: programming language, y-axis: # repos containing at least one le in that language)
To keep the chart readable, only keep the top 20 languages.

Hint: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays
%%bigquery --project $project_id q3a
# YOUR QUERY HERE
SELECT name lang_name, COUNT(*) num_repos
FROM `cs145-fa19.project2.github_repo_files` files JOIN (SELECT lrepo_name, name
FROM `cs145-fa19.project2.github_repo_languages`,
UNNEST(language)) lang
ON files.lrepo_name = lang.lrepo_name
GROUP BY name
HAVING num_repos>=1
ORDER BY num_repos DESC
LIMIT 20

# YOUR PLOT CODE HERE


plt.figure(figsize=(13, 5))
plt.scatter(q3a["lang_name"], q3a["num_repos"])
plt.title("Language distribution")
plt.xlabel("Programming Language")
plt.ylabel("# of repos")
plt.xticks(rotation="vertical")

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
<a list of 20 Text xticklabel objects>)

b) File size distribution (2 points)


(x-axis: le size, y-axis: # les of that size)

%%bigquery --project $project_id q3b

# YOUR QUERY HERE


SELECT size, COUNT(*) num_files
FROM `cs145-fa19.project2.github_repo_contents`
GROUP BY size
ORDER BY size

# YOUR PLOT CODE HERE


q3b_sub = q3b.sample(2000)
plt.figure(figsize=(13, 7))
plt.scatter(q3b_sub["size"], q3b_sub["num_files"], s=5)
plt.yscale('symlog')
plt.xscale('symlog')
plt.title("File size distribution")
plt.xlabel("file size")
plt.ylabel("number of files")
p y ( )

Text(0, 0.5, 'number of files')

c) The distribution of the length of commit messages (2 points)


(x-axis: length of the commit message, y-axis: # commits with that length)
Note: The query for this plot may use ~30GB of data.

%%bigquery --project $project_id q3c

# YOUR QUERY HERE


SELECT LENGTH(message) m_len, COUNT(*) num_commits
FROM `cs145-fa19.project2.github_repo_commits`
GROUP BY m_len
ORDER BY m_len

# YOUR PLOT CODE HERE


q3c_sub = q3c.sample(1000)
plt.figure(figsize=(13, 7))
plt.scatter(q3c_sub["m_len"], q3c_sub["num_commits"], s=5)
plt.yscale('symlog')
plt.xscale('symlog')
plt.title("Distribution of the length of commit messages")
plt.xlabel("length of the commit message")
plt.ylabel("number of commits")
Text(0, 0.5, 'number of commits')

What Makes a Good Repo?


Given that we have some interesting data at our disposal, let's try to answer the question: what makes a good GitHub repo? For our purposes, a
"good" repo is simply a repo with a high watch count; this refers to how many people are following the repo for updates.

To begin, let's see if any of the features we've already explored give us any good answers.

Question 4: Using What We've Worked With (17 points)


Create plots for the following features in a repo and how they relate to that repo's watch count:

1. Languages used
2. Average le size in a repo
3. Average message length of commits in a repo

a) Languages used (4 points)


As in Q3a, please only keep the top 20 languages to keep the chart readable.

%%bigquery --project $project_id q4a

# YOUR QUERY HERE


SELECT name lang_name, SUM(repos.watch_count) num_watch_count
FROM `cs145-fa19.project2.github_repos` repos JOIN (SELECT lrepo_name, name
FROM `cs145-fa19.project2.github_repo_languages`,
UNNEST(language)) lang
ON repos.lrepo_name = lang.lrepo_name
GROUP BY name
ORDER BY num_watch_count DESC
LIMIT 20

# YOUR PLOT CODE HERE


plt.figure(figsize=(13, 5))
plt.scatter(q4a["lang_name"], q4a["num_watch_count"])
plt.title("Languages used")
plt.xlabel("Programming Language")
plt.ylabel("number of watch count")
plt.xticks(rotation="vertical")
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
<a list of 20 Text xticklabel objects>)

b) Average le size in a repo (4 points)


Note: For this question, you may use the github_repo_readme_contents table instead of the full contents table.

%%bigquery --project $project_id q4b

# YOUR QUERY HERE


SELECT s.avg_size, SUM(repos.watch_count) num_watch_count
FROM `cs145-fa19.project2.github_repos` repos
JOIN (SELECT files.lrepo_name, ROUND(AVG(contents.size)) avg_size
FROM `cs145-fa19.project2.github_repo_readme_contents` contents
JOIN `cs145-fa19.project2.github_repo_files` files
ON contents.id = files.id
GROUP BY files.lrepo_name) s
ON repos.lrepo_name = s.lrepo_name
GROUP BY avg_size
ORDER BY avg_size

# YOUR PLOT CODE HERE


q4b_sub = q4b.sample(1000)
plt.figure(figsize=(15, 10))
plt.scatter(q4b_sub["avg_size"], q4b_sub["num_watch_count"], s=5)
plt.yscale('symlog')
plt.xscale('symlog')
plt.title("Average file size in a repo vs watch count")
plt.xlabel("average file size in a repo")
plt.ylabel("number of watch count")
Text(0, 0.5, 'number of watch count')

c) Average message length of commits on a repo. (6 points)


First, make a plot of the average commit message length of repositories against the number of repositories with that average commit message
length.

Then, make a plot of how average commit message length of a repository correlates to its watch count. Round the average commit message
length to the nearest integer.

%%bigquery --project $project_id q4c_avg_commit_length_count

# YOUR QUERY HERE


SELECT avg_c_len, COUNT(c1.lrepo_name) num_repos
FROM `cs145-fa19.project2.github_repo_commits` c1
JOIN (SELECT c2.lrepo_name, ROUND(AVG(LENGTH(c2.message))) avg_c_len
FROM `cs145-fa19.project2.github_repo_commits` c2
GROUP BY c2.lrepo_name) c_avg
ON c1.lrepo_name = c_avg.lrepo_name
GROUP BY avg_c_len
ORDER BY avg_c_len

# YOUR PLOT CODE HERE


plt.figure(figsize=(15, 10))
plt.scatter(q4c_avg_commit_length_count["avg_c_len"], q4c_avg_commit_length_count["num_repos"], s=5)
plt.yscale('symlog')
plt.xscale('symlog')
plt.title("Average commit length count")
plt.xlabel("average commit message length")
plt.ylabel("number of repositories")
Text(0, 0.5, 'number of repositories')

%%bigquery --project $project_id q4c_msg_length_watch_count

# YOUR QUERY HERE


SELECT avg_c_len, AVG(repos.watch_count) num_watch_count
FROM `cs145-fa19.project2.github_repos` repos
JOIN (SELECT c.lrepo_name, ROUND(AVG(LENGTH(c.message))) avg_c_len
FROM `cs145-fa19.project2.github_repo_commits` c
GROUP BY c.lrepo_name) c_avg
ON repos.lrepo_name = c_avg.lrepo_name
GROUP BY avg_c_len
ORDER BY avg_c_len

# YOUR PLOT CODE HERE


plt.figure(figsize=(15, 10))
plt.scatter(q4c_msg_length_watch_count["avg_c_len"], q4c_msg_length_watch_count["num_watch_count"], s=5)
plt.yscale('symlog')
plt.xscale('symlog')
plt.title("Average commit length watch count")
plt.xlabel("average commit message length")
plt.ylabel("number of watch count")
Text(0, 0.5, 'number of watch count')

d) Which, if any, of the features we inspected above have a high correlation to a repo having a high watch count?
Does the answer make sense, or does it seem counterintuitive? Explain your answer in a small paragraph, no more
than 200 words. Be sure to cite the charts you generated. (3 points)

The repos having high watch count are usually written in popular programming languages (q4a) such as JAVASCRIPT, HTML, PYTHON, etc. The
le size of high watch count repo is around 10^3 B (q4b) which is about right for coding. For most repos, the average length of commit message is
around 10-100 words (q4c_avg_commit_length_count), and it makes sense because we usually don't exceed 100 words when writing commit
message. However, for the most popular (high watch count) repo, the length of commit message is near 1000 words or a bit above
(q4c_msg_length_watch_count). I think it's because for those projects, there are many people collabrating together; therefore, the commit
messages are longer in order to keep track of details in each commit.

What Do Others Have to Say?

At this point we have learned a couple of things about how certain features may or may not impact the popularity of a GitHub repo. However, we
really only looked at features of GitHub repos that we had initially explored when we were getting a feel for the dataset! There has got to be
more things we can inspect than that.

If you do a web search for "how to make my git repo popular," you will nd that more than a couple of people suggest investing time in your
README le. The README usually gives an overview to a GitHub project and may include other information about the codebase such as whether
its most recent build passed or how to begin contributing to that repo. Here is an example README le for the popular web development
framework Vue.js.

IMPORTANT: Note about Contents Table


Note that the original github_repo_contents table is about half a TB! In order to save you the pain of using up 500GB of your credits to subset
this table into a workable size for this problem, we have done it for you.
*For the rest of this question, be sure that you use the github_repo_readme_contents table *

Question 5: Analyzing README Features (15 points)


Analyze the following features of a repo's README le and how they relate to the popularity of a repository, generating an informative plot for
each feature:

1. Having or not having a README le


2. The length of the README le

Consider a README le to be any le with the path beginning with "README", not case-sensitive.

a) Having or not having a README le (6 points)

%%bigquery --project $project_id q5a1

# YOUR QUERY HERE


SELECT watch_count having_readme
FROM `cs145-fa19.project2.github_repos` r1
JOIN (SELECT DISTINCT f1.lrepo_name
FROM `cs145-fa19.project2.github_repo_files` f1
WHERE path LIKE "README%" OR path LIKE "readme%") r2
ON r1.lrepo_name = r2.lrepo_name

%%bigquery --project $project_id q5a2

# YOUR QUERY HERE


SELECT watch_count no_readme
FROM `cs145-fa19.project2.github_repos` r1
JOIN (SELECT DISTINCT f1.lrepo_name
FROM `cs145-fa19.project2.github_repo_files` f1
WHERE f1.lrepo_name NOT IN (SELECT DISTINCT t.lrepo_name
FROM `cs145-fa19.project2.github_repo_files` t
WHERE path LIKE "README%" OR path LIKE "readme%")) r2
ON r1.lrepo_name = r2.lrepo_name

# YOUR PLOT CODE HERE


plt.figure(figsize=(15, 10))
plt.scatter([0 for i in range(q5a2["no_readme"].size)], q5a2["no_readme"], s=5)
plt.scatter([1 for i in range(q5a1["having_readme"].size)], q5a1["having_readme"], s=5)
plt.yscale('symlog')
plt.title("Watch count for having or not having a README file")
plt.xlabel("having (1) or not having (0) readme file")
plt.ylabel("number of watch count")
Text(0, 0.5, 'number of watch count')

b) The length of the README le (6 points)


You may ignore README les with length 0.

Note: If a project has multiple README les, you can just take the average size of those les.

%%bigquery --project $project_id q5b

# YOUR QUERY HERE


SELECT LENGTH(contents.content) c_len, SUM(repos.watch_count) watch_count
FROM `cs145-fa19.project2.github_repo_readme_contents` contents
JOIN `cs145-fa19.project2.github_repo_files` files
ON files.id = contents.id
JOIN `cs145-fa19.project2.github_repos` repos
ON repos.lrepo_name = files.lrepo_name
GROUP BY c_len
ORDER BY c_len

# YOUR PLOT CODE HERE


q5b_sub = q5b.sample(1000)
plt.figure(figsize=(15, 10))
plt.scatter(q5b_sub["c_len"], q5b_sub["watch_count"], s=5)
plt.xscale('symlog')
plt.yscale('symlog')
plt.title("Watch count compared to the length of the README file")
plt.xlabel("length of the README file")
plt.ylabel("number of watch count")
Text(0, 0.5, 'number of watch count')

c) Would you say that a "good" README is correlated with a popular repository, based on the features you studied?
Why or why not? If you were to analyze more in-depth features on the README le for correlation with repo
popularity what would they be? (3 points)

From the plot in q5a, overall, the watch count of a repo having a readme le is larger than that of having no readme le which suggests a repo
having a readme is more likely to have high watch count than the repo not having one. From the plot in q5b, we can see that the repos have high
watch count have length not too short also not too long, typically near the range of 10^3 to 10^4. It is reasonable because the readme le tends to
have enough information describing the repo and at the same time not too many words that make it impossible to read. In general, I think a good
repo would de nite require a readme le, and the content should be informative but not over ow with information. In addition, I think repos
containing graphs and plots are more likely to have high watch count due to better project visialization.

Question 6 (Extra Credit): What other features might correlate with a highly watched repo? (3
possible points)
We studied only a handful of features that could correlate with a highly watched repo. Can you nd a few more that seem especially promising?
Back your proposed features with data and charts.

I would like to explore how different licenses correlate with watch count. Based on the plots below, interestingly, mit license is used in the most
repos; however, the repos that use license cc0-1.0 have highest watch count in average.

%%bigquery --project $project_id q6a

# YOUR QUERIES HERE


SELECT license, COUNT(*) num_repos
FROM `cs145-fa19.project2.github_repo_licenses`
GROUP BY license
ORDER BY num_repos DESC
LIMIT 20
LIMIT 20

%%bigquery --project $project_id q6b

# YOUR QUERIES HERE


SELECT license, ROUND(AVG(watch_count)) avg_watch_count
FROM `cs145-fa19.project2.github_repos` repos
JOIN `cs145-fa19.project2.github_repo_licenses` lic
ON repos.lrepo_name = lic.lrepo_name
GROUP BY license
ORDER BY avg_watch_count DESC
LIMIT 20

# YOUR PLOT CODE HERE


# plot for q6a
plt.figure(figsize=(15, 10))
plt.scatter(q6a["license"], q6a["num_repos"])
plt.yscale('symlog')
plt.title("Number of repos in each license")
plt.xlabel("license")
plt.ylabel("number of repos")
plt.xticks(rotation="vertical")

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],


<a list of 15 Text xticklabel objects>)

# plot for q6b


plt.figure(figsize=(15, 10))
plt.scatter(q6b["license"], q6b["avg_watch_count"])
plt.yscale('symlog')
plt.title("Watch count of different licenses")
plt.xlabel("license")
plt.ylabel("number of watch count")
plt.xticks(rotation="vertical")
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
<a list of 15 Text xticklabel objects>)

You might also like