Jobst - 2022 - Efficient GitHub Crawling Using The GraphQL API
Jobst - 2022 - Efficient GitHub Crawling Using The GraphQL API
1 Introduction
2 Related Work
Various systems have been proposed which acquire data from online repositories.
Linstead et al. presented Sourcerer, an infrastructure for analyzing source code
repositories [22]. The authors processed source code files from GitHub projects
and analyzed them using Latent Dirichlet Allocation [2] and its variant, the
Author-Topic Model [29]. The results may serve as a summary for program
function, developer activities and more. A system developed to support scien-
tists and practitioners in MSR research is Boa, presented by Dyer et al. [7]. The
Efficient GitHub Crawling using the GraphQL API 3
vice through complex queries, which also introduces security risks [33]. A direct
comparison between the REST and GraphQL architectural models was made by
Seabra et al. [30]. Three target applications were implemented in both models,
from which performance metrics could be derived. Two-thirds of the applications
tested saw performance improvements in terms of average number of requests
per second and data transfer rate when using GraphQL. But in general, perfor-
mance was also below that of its REST counterpart when the workload exceeded
3000 requests. Similarly, Brito et al. migrated seven REST-based systems to use
GraphQL [4]. The migration reduced the size of JSON responses by up to 94% in
the number of fields and 99% in the number of bytes (median results). In another
study, Brit and Valente described a controlled experiment in which students had
to implement similar queries, once in REST, once in GraphQL [5]. Their results
showed that students were faster at implementing GraphQL queries, especially
when the REST endpoints contained more complex queries. Surprisingly, the
results held true for the more experienced groups of graduate students as well.
As mentioned earlier, GraphQL queries can be unexpectedly large, not least
because of their nested structure, making query cost estimation an important
feature. Estimating costs based on a static worst-case analysis of queries has had
limited success, leading Mavroudeas et al. to propose a machine learning-based
approach. Testing their approach on publicly available commercial APIs, they
found that their framework is able to predict query costs with high accuracy and
consistently outperforms static analysis.
The presented systems come with limitations when used for data generation.
Some systems have obstructive data transformations of raw data [22] or provide
a fixed set of processing functionality [7] which excludes them as usable tools
when a different transformation or format is needed. Systems which are able to
provide raw data [14,32] do usually only leverage a subset of available data, e.g.
issue tracking and version control history. Additionally most systems use the
GitHub REST API, whose rate limit is rather conservative which introduces the
need for multiple API tokens, otherwise crawling additional information takes
long. By using the GraphQL API, a single API token can crawl significantly
more information, which makes it more suitable for most users.
3 Concept
GitHub
REST
RDBMS
Fetcher Importer
Event Backbone
“polyglot persistence” [21]. In our system, the Importer service stores the raw
data collected from GitHub, and new services, even if they build on this data,
may use a new storage component. We believe this also has usability benefits, as
single conceptual models are easier to understand than a complicated mixture.
Event-driven architectures are also not formally defined, most likely because
there are many different notions of what an event is and what it is used for.
An event contains at least the form or action of the event and a timestamp of
when the event occurred. In summary, three purposes of events can be formu-
lated [11]. In the simplest case, an event just serves as a mere notification that
something happened. For instance, in our system, the fetching service publishes
data from the crawling process as events. As mentioned earlier, two systems or
services, even if they use conceptually different data models, can still have sim-
ilar attributes. In this case, state changes from one service must be transmitted
to others that have a similar view of the data which can be done in the form
of an event. In our system, the importing service publishes state changes via
events so that dependent or related components can react to them. For a small
monolithic application with a single data model, the state of the application is
usually mirrored in the database or can be derived from it in case of failure. In
contrast, the state of a distributed system is mirrored in all data of all services
together. However, because services can change state individually or event-driven
state transfers can occur, accessing older state is not as easy as with a single
database log. One way to maintain this capability in a distributed system is
event sourcing, where all changes to the system are recorded as events [9]. In
theory, the event log can be used to recover the system state at any point in
time. This feature will not be used in the current version of Prometheus but
may be relevant for future use-cases where it can be implemented without any
architectural change.
GET
/repos/vuejs/vue/issues/1
GET
GET GET
/repos/vuejs/vue/issues
/repos/vuejs/vue /repos/vuejs/vue/issues/2
?page=1
GET GET
/repos/vuejs/vue/issues/2 /users/tenderlove
GET /users/tenderlove/repos
Response
" pageInfo ": {
" hasNextPage ": true ,
POST " endCursor ": " Y3Vyc29yYyOp =="
query { }
repository ( name :" vue " , owner :" vuejs ") { ...
name " author ": {" login ": " tenderlove "}
issues ( first : 100 , after : null ) {
nodes {
title
author {
Set Cursor
login
}
} query {
} user ( login : " tenderlove ") {
} name
} }
}
(b) GraphQL API crawling. A cursor from the response is needed to paginate.
Fig. 2: A comparison of how entities of the GitHub REST and GraphQL API’s
are crawled. Red parts specify pagination settings.
Since it is good practice to combine services that change frequently and the
GitHub data schema may change often [10], our service could also address data
persistence. We argue that it is unlikely that there will be frequent changes to
the GitHub data model as GitHub has a long history of how people use the
API and has put a lot of thought into the GraphQL schema. So it is more likely
that the data model just gets extended for new features, which will not break
the existing ones. GraphQL also makes it easy to handle minor changes, for
example, attributes that will soon be dropped can be marked as deprecated,
and new changes can simply be added as nullable attributes.
Therefore, storage capabilities will be handled explicitly by a separated ser-
vice as described in section 3.3.
The interface of this service will only be used to control the crawling process
which means sending crawling jobs and monitoring progress. In our system job
descriptions will be defined as GraphQL queries. Using GraphQL for job descrip-
tions makes sense since it is the same paradigm as the GitHub API, and because
it seems easier for developers to form GraphQL queries instead of e.g. REST
queries [5]. These queries will be the same as regular GitHub GraphQL API
8 Jobst et al.
queries but allow for additional or different parameters. For instance, the query
seen in figure 2b uses 100 as the first argument value, which is the allowed max-
imum in the GitHub API. In our system values beyond the allowed value, e.g.
10.000 are possible. By allowing additional or changed arguments, the system
needs logic on how to handle them. For example, it needs to be able to pagi-
nate automatically when more than 100 elements are requested. An additional
problem is when we nest connections and require more than 100 elements in the
nested connection. Pagination requires a cursor of the previous page, which is
specified as the after argument as seen in figure 2b. The response to a query
with nested pagination returns multiple cursors for the nested entities since the
cursor refers to the parent entities. But since there can only be supplied one
cursor, there is no correct way to map these in a consecutive query. But in most
cases, the nested pagination problem can be solved by resolving it in separate
queries. This can be achieved by replacing a connection field with a field that
queries just one entity. In the left query of figure 2b, issues can be replaced with
issue which returns only a single issue. This would resolve nesting if any, but
requires splitting the query into one that fetches all issues and a second one that
paginates the nested connection.
Importer
Post / Get - Job
GitHub binlog
GraphQL
Endpoint REST RDBMS
in the database needs an update. In GraphQL, one could check the updatedAt
field, but unlike the REST API, this consumes rate limit, so one could also
simply ask for all attributes and update the entity if needed.
4 Prometheus System
The individual services are developed and deployed as Docker containers. These
coincide with the conceptual requirements for microservice architectures. In fact,
it has been shown that Docker can be a good fit when implementing microservices
[19]. Figure 3 shows the general Docker setup of Prometheus. The importer
service consists of two containers, and one for importing fetched entities, one for
publishing changes in the database. Separating service functionality in several
containers is not uncommon. This way the system can spawn multiple containers
of the desired functionality in case of a heavier workload. Both use the same
relational database. The fetching functionality resides in one single container.
The GitHub fetcher is using the GitHub GraphQL endpoint to query data.
Querying jobs can be submitted via a REST API, as well as getting job progress
summaries. The GitHub fetcher as well as the metastore publisher are publishing
events via an event service when service state changes occur. Redis is chosen as
the event system, whose basic publish/subscribe as well as a queuing mechanism
is sufficient for a prototype implementation of the proposed architecture.
GitHub Fetcher. One of the most important functions of the crawling service is
the processing of the job definitions. This includes splitting the query if there is
nested pagination. In addition, pagination must continue until the parameters
entered by the user are satisfied or there are no more objects. It must also pass
parameters from responses to consecutive queries resulting from the splitting
process if any. If a query has nested pagination, the way to resolve it is to first
query the top paginated node, e.g. issues of a repository, and then for every
node in the response query the nested nodes, e.g. assignees of every single issue.
10 Jobst et al.
This approach is not always applicable, more precisely it is only possible if the
returned entities of the paginated top node can also be accessed directly.
Pseudocode on how to do that can be seen in code listing 1.1. The algorithm
starts with the query originally supplied, then remove_nested_pagination re-
places connection nodes containing a nested connection with their direct-access
counterpart (e.g. issue instead of issues). It is important that the top node is
replaced and the rest of the branch remains untouched, even if it contains more
nested connections. If a substitution has taken place, the nested pagination will
be removed from the previous query (e.g. assignees of issues). Also, obsolete
nodes are removed from the new query, i.e. all connection nodes that do not
have a nested connection. These are already crawled by the previous query and
are not needed in the new query. Finally, the new query and follow-up parame-
ter information, e.g. an issue number, are added. The loop continues to replace
nested connections until there are none left.
5 Evaluation
This section evaluates the performance of Prometheus fetching and import ser-
vices. As a performance metric, we will measure throughput in fetched and stored
entities per second. This is done for Prometheus and Microsoft’s ghcrawler, a
REST-based GitHub crawler, to see if the promised speed increase through the
GraphQL API is true. Two experiments are performed, one simple and another
with deeper relationships.
(9 697 issues plus 1 repo). Currently, there is only one adjustable parameter for
retrieval performance in Prometheus, which is the number of work packages that
are combined in a query. Combining more work packages increases performance,
but can also lead to timeouts. Currently, this parameter cannot be dynamically
adjusted and is set to 100 for both experiments. This is quite aggressive and
sometimes leads to API timeouts, but so far never to unresolvable timeouts. For
ghcrawler, the required visitor maps are implemented in the source code. Four
tokens and ten concurrent processing loops are used for both tasks.
Figure 4 shows the result of the simple job. The first thing to notice is that
ghcrawler processes more entities than Prometheus. This is because the GitHub
REST API considers each pull request as an issue, but not vice versa [13]. Thus,
the REST API also returns all pull requests in the issues endpoint, and therefore
ghcrawler processes them (2 169 additional entities). The GitHub GraphQL API
explicitly separates pull requests and issues, so there is no such overhead in
Prometheus job execution. A unique feature of this job in Prometheus is that the
work packages are completely sequential. Getting the next page of Issues requires
the last cursor, which means there is no advantage to combining work packages.
In terms of overall job execution, Prometheus is 4.2 times faster, with an average
throughput of 92 entities per second, while ghcrawler has a throughput of 26
entities per second. If we assume that ghcrawler does not retrieve the unwanted
pull requests, Prometheus is still 3.5 times faster at retrieving all entities. On
average, the processing loop of the Prometheus fetching service took 1.14 seconds
with a standard deviation (SD) of 0.46 to process a work package. The majority
of that time is used to make the actual API request, which took 1.09 seconds
(SD = 0.46) on average.
12 Jobst et al.
response already returns the summary representation of the assignees. Also, there
are only 20 unique users, so previously queried users that exist in the database
are not queried again because they are retrieved from storage.
5.3 Discussion
In the simple use case – fetching all issues from a repository – Prometheus
clearly outperforms ghcrawler in both execution time and token consumption.
Even if we exclude the discussed pull request overhead when retrieving issues,
Prometheus is still 3.5 times faster when retrieving all issues. Looking at to-
ken consumption, the difference is drastic. While ghcrawler requires at least
three tokens to fetch all requests from the REST API, Prometheus consumes
only about two percent of the rate limit of one token. The result of the second
experiment is different, Prometheus is slower in this case. This is because the
current implementation incurs unnecessary overhead when fetching empty con-
nections. Also, at the moment it is fully synchronous in terms of the actual API
calls. Although Prometheus is 3.7 times slower, it still has higher throughput and
lower token consumption. The token consumption has remained almost the same
even though the queried nodes were much more, which is since queries could be
combined in this query. This is a particularly interesting result, as another study
suggests that GraphQL API responses are smaller than those of REST API’s [4].
While this is true for an end-user application, crawling application developers
must be careful to avoid this pitfall. Especially because GraphQL may perform
worse on heavier loads than a REST counterpart [30], one does not want to
flood GraphQL endpoints with unnecessary calls. Even if not present in this
experiment, the opposite can also be the case, badly chosen calls can lead to
unexpectedly complex queries which may either overload the server or even the
client [24].
6 Conclusion
the strict splitting, we can still query items from the nested connection. When
querying paginated fields, one can also retrieve the total count of items available.
If we do both, we can eliminate all the redundant queries that made the second
experiment slow. We can also further optimize for throughput. For example,
a consecutive query always fetches parents of the paginated node of interest,
resulting in severe overhead that should be omitted. Furthermore, the processing
loop should be asynchronous so that the high response times do not affect the
execution time. Lastly, more use cases have to be implemented and tested to
verify the system’s effectiveness.
References
1. Bjertnes, L., Tørring, J.O., Elster, A.C.: LS-CAT: A large-scale CUDA AutoTuning
dataset. In: 2021 International Conference on Applied Artificial Intelligence (ICA-
PAI). pp. 1–6. IEEE (2021). https://doi.org/10.1109/ICAPAI49758.2021.9462050
2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine
Learning Research 3, 993–1022 (2003)
3. Borges, H., Hora, A., Valente, M.T.: Understanding the factors that impact
the popularity of github repositories. In: 2016 IEEE International Confer-
ence on Software Maintenance and Evolution (ICSME). pp. 334–344 (2016).
https://doi.org/10.1109/ICSME.2016.31
4. Brito, G., Mombach, T., Valente, M.T.: Migrating to GraphQL: A practi-
cal assessment. In: Proc. 26th International Conference on Software Anal-
ysis, Evolution and Reengineering. pp. 140–150. SANER ’19, IEEE (2019).
https://doi.org/10.1109/SANER.2019.8667986
5. Brito, G., Valente, M.T.: REST vs GraphQL: A controlled experiment. In: Proc.
International Conference on Software Architecture. pp. 81–91. ICSA ’20, IEEE
(2020). https://doi.org/10.1109/ICSA47634.2020.00016
6. di Cosmo, R., Zacchiroli, S.: Software heritage: Why and how to preserve software
source code. In: iPRES 2017-14th International Conference on Digital Preservation.
pp. 1–10 (2017)
7. Dyer, R., Nguyen, H.A., Rajan, H., Nguyen, T.N.: Boa: Ultra-large-scale software
repository and source-code mining. ACM Transactions on Software Engineering
and Methodology 25(1), 1–34 (2015). https://doi.org/10.1145/2803171
8. de F. Farias, M.A., Novais, R., Júnior, M.C., da Silva Carvalho, L.P., Mendonça,
M., Spı́nola, R.O.: A systematic mapping study on mining software repositories.
In: Proceedings of the 31st Annual ACM Symposium on Applied Computing. pp.
1472–1479. SAC ’16, ACM (2016). https://doi.org/10.1145/2851613.2851786
9. Fowler, M.: Event sourcing (2005), https://martinfowler.com/eaaDev/
EventSourcing.html, accessed on 17.05.2022
10. Fowler, M., Lewis, J.: Microservices (2014), https://www.martinfowler.com/
articles/microservices.html, accessed on 17.05.2022
11. Fowler, M.: What do you mean by “Event-Driven”? (2017), https:
//martinfowler.com/articles/201701-event-driven.html, accessed on
17.05.2022
12. Gasparini, M., Clarisó, R., Brambilla, M., Cabot, J.: Participation inequal-
ity and the 90-9-1 principle in open source. In: Proceedings of the 16th
International Symposium on Open Collaboration. pp. 1–7. ACM (2020).
https://doi.org/10.1145/3412569.3412582
Efficient GitHub Crawling using the GraphQL API 15
29. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for
authors and documents. In: Proc. 20th Conference on Uncertainty in Artificial
Intelligence. pp. 487–494. UAI ’04, AUAI Press (2004)
30. Seabra, M., Nazário, M.F., Pinto, G.: REST or GraphQL? a performance
comparative study. In: Proc. XIII Brazilian Symposium on Software Com-
ponents, Architectures, and Reuse. pp. 123–132. SBCARS ’19, ACM (2019).
https://doi.org/10.1145/3357141.3357149
31. Tiwari, N.M., Upadhyaya, G., Rajan, H.: Candoia: A platform and ecosystem
for mining software repositories tools. In: 2016 IEEE/ACM 38th International
Conference on Software Engineering Companion (ICSE-C). pp. 759–761 (2016)
32. Trautsch, A., Trautsch, F., Herbold, S., Ledel, B., Grabowski, J.: The
SmartSHARK ecosystem for software repository mining. In: Proceedings of the
ACM/IEEE 42nd International Conference on Software Engineering: Companion
Proceedings. pp. 25–28. ACM (2020). https://doi.org/10.1145/3377812.3382139
33. Wittern, E., Cha, A., Davis, J.C., Baudart, G., Mandel, L.: An empirical study of
GraphQL schemas. In: Proc. International Conference on Service-Oriented Com-
puting. pp. 3–19. Springer (2019). https://doi.org/10.1007/978-3-030-33702-5 1
34. Zhang, D., Han, S., Dang, Y., Lou, J.G., Zhang, H., Xie, T.: Software analytics in
practice. Software, IEEE 30, 30–37 (09 2013). https://doi.org/10.1109/MS.2013.94