Characterizing The Roles of Contributors in Open-Source Scientific Software Projects
Characterizing The Roles of Contributors in Open-Source Scientific Software Projects
Abstract—The development of scientific software is, more than requirements [4]. As noted by Segal [5], this pressing need to
ever, critical to the practice of science, and this is accompanied produce or enable the production of knowledge lends itself to a
by a trend towards more open and collaborative efforts. Unfor- mindset where “software is valued only insofar as it progresses
tunately, there has been little investigation into who is driving
the science”, often in conflict with the need to have reliable,
the evolution of such scientific software or how the collaboration
happens. In this paper, we address this problem. We present maintainable code. However, Turk and colleagues remarked
an extensive analysis of seven open-source scientific software that, in an era of increasing scale and complexity, “the cyber-
projects in order to develop an empirically-informed model of infrastructure necessary to address problems in computational
the development process. This analysis was complemented by science is no longer tractably solved by individuals working in
a survey of 72 scientific software developers. In the majority of
isolation” [6]; broader, more open collaboration necessitates
the projects, we found senior research staff (e.g. professors) to be
responsible for half or more of commits (an average commit share a shift in how the software is developed. From a software
of 72%) and heavily involved in architectural concerns (seniors engineering research perspective, this motivates important
were more likely to interact with files related to the build system, questions about how the software evolves, who develops it,
project meta-data, and developer documentation). Juniors (e.g. and how quality can emerge from this process.
graduate students) also contribute substantially — in one studied
We focus on the people meeting the demand for scientific
project, juniors made almost 100% of its commits. Still, graduate
students had the longest contribution periods among juniors (with software. Such scientific software developers represent a pop-
1.72 years of commit activity compared to 0.98 years for postdocs ulation so far not properly understood, since their characteris-
and 4 months for undergraduates). Moreover, we also found that tics, motivation, and needs to contribute to scientific software
third-party contributors are scarce, contributing for just one day projects are intrinsically different than what drives traditional
for the project. The results from this study aim to help scientists
open source contributors. For instance, the actors that play the
to better understand their own projects, communities, and the
contributors’ behavior, while paving the road for future software scientific developer role include students, postdocs, faculty,
engineering research. and staff. Their knowledge, skills, and goals can vary greatly,
while also contributing to projects in different ways throughout
their tenure. As a consequence, the plethora of existing studies
I. I NTRODUCTION on open source contributors might not help much, since they
Computing technologies have had a profound impact on the hardly take into account their roles or the complexity of the
practice of science: simulation and data-intensive computation domains that scientific software is immersed in.
are now known as the third and fourth paradigms of science, Much is still unknown about the state-of-the-practice of
on equal footing with experimentation and theory [1]. This developing scientific software. For instance, who performs the
shift has accelerated the growth of a diverse ecosystem of majority of commit activities? Who fixes bugs? In order to bet-
scientific software projects. The term “scientific software” is ter understand the relationship between these contributors and
an umbrella that covers all aspects of the research pipeline, the software, we first leverage the availability and transparency
including codes for simulation and data analysis, dataset of social coding websites to inspect data related to source
management, communication infrastructure, and underlying code contributions and contributors. We selected a curated list
mathematical libraries [2]. It is software that exists “to support of seven open-source scientific software projects by searching
the exploration of a scientific question” [3]. three different platforms: the Journal of Open Source Software,
What makes scientific software projects different from GitHub, and DOECODE, a platform for publicly funded
traditional software projects? Scientific software operates at DOE research codes. For each selected project, we identified
the boundaries of human knowledge and tends to be in the roles played by different contributors by analyzing each
constant flux as new insights motivate unforeseen changes in projects’ documentation, websites, and other readily available
sources. We then surveyed representative scientific software
Sandia National Laboratories is a multimission laboratory managed and developers in order to cross-validate the findings found via
operated by National Technology & Engineering Solutions of Sandia, LLC, the repositories’ analysis.
a wholly owned subsidiary of Honeywell International Inc., for the U.S.
Department of Energy’s National Nuclear Security Administration under
Using quantitative and qualitative data, our study produced
contract DE-NA0003525. SAND2018-9345C. a set of findings, some of which confirmed anecdotal accounts
422
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
RQ3: How much of the development work is done by different From this list, we manually analyzed these repositories over
contributors by role? several days (step 3 ), filtering the results according to the
RQ4: What kinds of software maintenance and evolution ac- following criteria:
tivities do contributors perform? C1) Projects should have a contributor list. The repository
RQ5: How do scientific software developers perceive their own must link to a detailed contributor list or research
software development process? team page that identifies the roles played by different
Our first question RQ1 is demographic in nature based on contributors to the project. We use this data to later
aggregated project data, and tests the representativeness of distinguish contributors’ roles (Section III-E).
our dataset. RQ2 enables us to make inferences about the C2) Projects should be active. The project must be at least
division of labor based on personnel composition. Next, RQ3 a year old, and the repository must have more than 500
digs into the kinds of responsibilities, such as file ownership commits. For example, a large number of projects on
and test files creation, that different contributors take up. DOECODE were developed internally and then later
RQ4 investigates what kind of maintenance and evolution released to the public. Thus, the GitHub repository is a
changes, such as adding new features or fixing bugs, do shallow copy of the most recent version with no commit
these contributors contribute to the project. Finally, to provide history.
answers to RQ5 we surveyed 72 scientific software developers C3) Projects should be collaborative. There must be at least
regarding their own contribution behavior. three contributors which can be positively identified, and
at least one these must be considered a “junior” contrib-
B. Data Collection Procedure utor. Many research projects on GitHub are small codes
Figure 1 depicts the steps followed by our data collection developed by individual researchers in isolation without
procedure. any significant collaborations with others. Others are
collaborative projects between senior staff at different
institutions.
1 DOECODE JOSS GITHUB After applying these filters, we ended up with a curated list
of seven scientific software projects (step 4 ).
1,039 324 500
C. Characterizing the population
2 1,863 SCI PROJECTS We believe that the 1,863 projects in our population of
repositories we is a representative sample of scientific software
1,863
C1 projects that can be found in the wild. However, many of
these projects are unlikely to provide useful information for
3 MANUAL FILTER C2
our purposes, such as short-lived or single-user research codes,
7 C3 snapshots of codes released for publications, untouched clones
of decades-old legacy projects, or mirrors of private repos-
7 SELECTED
4 ¬SCI PROJECTS
itories lacking history information. As shown on Figure 2,
filtering for the number of commits and contributors eliminates
Fig. 1: Steps of the data collection procedure. roughly 8 out 10 of the repositories; the remainder are most
likely to be active, collaborative, and (most importantly) to
The first step 1 is aimed to find representative projects. have a rich history on GitHub that we can pull apart. The
We relied upon three data sources. First, we consulted DOE- median project within this group projects has 12 contributors
CODE1 , a platform for publicly funded DOE research codes. and 1770 commits spread out over 2.93 years.
Next, we searched the Journal of Open Source Software We can examine a sample of 5000 of the contributors
(JOSS)2 , a database of open source research software [14], to these repositories, using their number of commits and
which requires all entries be publicly available. Finally, we the number of days spanned by those commits as a proxy
did searches by topic on GitHub to find repositories with for involvement. 29% of these contributors make only one
relevant tags (e.g., computational-neuroscience, commit, and 40% are active for no more than a day. Among
bioinformatics). This yielded roughly 1,039 repositories contributors more active than these, the average individual
from DOECODE, 324 from JOSS, and another 500 from has an active tenure of 1.37 years, and during this time they
GitHub. These numbers corresponds all projects in these make 116 commits (6% of the commit activity of the median
platforms, except for GitHub, in which we stopped searching project). If we zoom in on any one of these contributors,
when we found 500 projects. This resulted in an initial set of we can observe their activities, but what interests us most
1,863 open source scientific software projects which we chose is how their identity relates to those activities. This is a
to take into consideration (step 2 ). more challenging problem to solve, and analysis is much less
scalable. Instead, in this work we take a deep dive into a
1 https://www.osti.gov/doecode/ handful of projects whose members we can identify, as a case
2 https://joss.theoj.org/ study into the composition of these teams.
423
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
established). In cases where we had to rely on incomplete
GitHub profile data (e.g. unlisted thirdparty contributors), we
attempted to extract names from handles (e.g. johnsmith79
→ John Smith) and cross-referenced those names with
web searches for similarly named researchers in the relevant
field; where we could not be reasonably convinced that the
identities matched, the contributor was left as unknown. To
ease understanding, we further group these contributors as
juniors (i.e., gradstudent, undergrad, and postdoc), seniors
(i.e., staff), and thirdparty.
424
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
TABLE I: List of studied projects. Age is present in years. kLOC is calculated using the cloc utility, encompassing blank,
comments, and code lines. PL means Programming Language.
Contributors
Project/GitHub kLOC Commits Stars PL Age Description
Identified (%)
Chaste (Chaste/Chaste) [15] 97% 371,4k 4,7k 22 C++ 8 Tissue and cell level electrophys-
iology, discrete tissue modeling,
and soft tissue modeling
Khmer (dib-lab/khmer) [16] 90% 145,1k 6.6k 528 Python 7 Nucleotide k-mer counting, filter-
ing, and graph traversal
PyGBe (barbagroup/pygbe) [17] 100% 12,4k 0.9k 28 Python 6 Biomolecular electrostatics and
nanoparticle plasmonics
LBANN (LLNL/lbann) [18] 99% 66,8k 3,5k 40 C++ 4 Artificial neural network toolkit
Hail (hail-is/hail) 98% 72,9k 3,1k 357 Scala 2 Genomic analysis
Genn (genn-team/genn) [19] 96% 37,4k 1,8k 77 C++ 6 Neuron and synapse modeling
openMOC (mit-crpg/openMOC) 90% 21,6k 2,6k 50 C++ 4 Nuclear reactor physics
425
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
distinction that we are making is that this is not the same with LBANN having a more even split between junior and
thing as the number of contributors listed as team members senior members. We also note that, for the projects we have
on the webpage of the project or research group. For example, studied, thirdparty contributors tend to play a very minor role
a graduate student may be a user of the software but not a in this regard; only LBANN has a notable share of commits
contributor, and in the case of projects like Chaste, we can attributable to thirdparty users (17.7% versus an average of
have multiple staff members responsible for design work and 3.59% and median of 0.06%). Overall, in all projects but
guidance of students but who have no commits to their name. one (Hail), junior contributors produce a significant share of
commits, meaning that even when seniors do most of the
heavy lifting, juniors play an essential role in the realization
of the project.
However, while both junior and senior alike generate
significant amounts of commit activity, this is not to say that
the scope of their activities is comparable. To better understand
this, we consider interactions with and ownership of files. For
the purposes of this work, we record a user as interacting with
a file each time they make a commit that touches that file. By
that measure, the typical junior interacts with a much smaller
percentage of files compared to a senior (average 5.85% vs.
20.35%; median 0.65% vs. 11.51%). This is to say that a
distinguishing characteristic of junior developers in our corpus
is that they often have a narrow focus on a particular subset of
a project. Meanwhile, the same is especially true for thirdparty
contributors who interact with an even smaller percentage of
files (average 1.64%; median 0.66%).
Likewise, we can also consider file creation. Earlier work
by Poncin et al. [29] addresses file creation in their opera-
Fig. 4: For RQ2, for each project we present a bar chart with tionalization of “core” developers, as frequent creation and
totals of identified contributors sorted by role. modification of files indicates that a user is helping to drive
the vision or direction of the software. Related to this, in
We showcase our results on Figure 4. Taken together, a recent study of large-scale open source projects, Lin et.
JUNIOR members make up the majority of team contributors al [25] found users who created files tended to be longer-
(average 69%; median 80%), with the remainder being SE - term contributors than those who modified files. In 5 out of 7
NIORS (average 31%; median 20%). Meanwhile, in all but one of the projects we studied (Khmer, Chaste, Hail, openMOC,
of the projects we studied, we were able to identify thirdparty and Genn), senior team members created the majority of files
contributors (average 20%; median 27%). Additionally, as a (average of 69.44%); LBANN is almost evenly split by this
rule, both team members and thirdparty contributors that we measure, and PyGBe, as a student-driven project, has only a
identified have a background relevant to the domain of the quarter of its files originating from senior members.
project, something which we learned by analyzing biographi-
cal information used to classify contributors by role. RQ4: What kinds of maintenance and evolution activities do
contributors perform?
RQ3: How much of the development work is done by different What value do different kinds of contributors add to a
contributors by role? project? For example, once a gradstudent exits a project, in
For RQ3, we want to characterize the amount of develop- what ways did they influence the evolution of the software
ment work that is done by these different actors. To begin, during their tenure? We can find some evidence for this
we consider the share of commits produced by different through project pages and documentation when teams provide
contributors, the number and frequency of commits being well- an itemized list of accomplishments of different contributors
worn metrics for engagement and investment in a project [26], (as is the case for several of the projects in our study); it
[28]. Taking averages across all projects that we studied, half is typical to see juniors receiving credit for implementing
or more of commits are made by senior members (average novel features pertinent to their research, seniors for building
50.76%; median 63.29%). However, the majority of the other out infrastructure and performing maintenance, and thirdparty
half are commits by junior contributors (average 42.9%; me- contributors for providing support or helping improve the
dian 36.7%). Moreover, for 4 out of 7 projects (Khmer, Chaste, codebase. However, relatively few projects provide this kind
Hail, and openMOC), senior researchers are responsible for of fine-grained information, and we would also like to be able
a plurality of commits (with an average commit share of to interrogate those claims in an empirical way.
77.12%; median 73.34%); the opposite is true for Genn, To answer our question, we present two views of the
LBANN, and PyGBe (average share 23.92%; median 22.58%), development activities that elucidate the kinds of work that
426
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
different contributors produce. The first is an analysis of com- Table II shows the confusion matrix between the automated
mit messages based on the approach of Hattori and Lanza [28], approach and the manual analysis. First, we note that because a
and the second is an analysis of file paths involved in commits commit may fail to match against the word bank, it is possible
based on the work of Vasilescu et al. [30]. In both cases, we are for this approach to fail to find a label. This happened for
interested in categorizing commit activity according to their 27% of commits in our sample. We identified three causes
purpose or intent. for this: (1) manual classification relied on words that were
In the framework set out by Hattori and Lanza [28], commits outside of the word bank (e.g., “vectorized book-keeping
are divided into four major categories of activity: kernels”), (2) commit messages were automatically generated
1) Forward engineering (Fwd), for instance, adding new and vague as to their purpose (e.g., merging an arbitrary
features; pull request), and (3) messages could be highly ambiguous
2) Reengineering (Reng), for instance, refactoring activi- (e.g., “complete breakdown of intuition” or “it is all becoming
ties; clear”). However, we consider this to be a more gentle form of
3) Corrective (Corr), for instance, fixing bugs; failure than attempting to shoehorn an unintelligible commit
4) Management (Mgmt), for instance, updating documen- into an arbitrary category.
tation. For commits which were classified automatically, the man-
ual and automatic approaches agreed 69% of the time; much
In order to automatically classify commits into these cate-
of the error was concentrated on management commits, only
gories, the authors compare the content of commit messages
a third of which were correctly labeled. If we limit our
against predefined word banks for each commit type based on
consideration to just forward engineering, reengineering, and
the earliest match found. For instance, consider the following
corrective commits, then the automatic approach agrees 89%
commit message: “This commit adds integrators support-
of the time, with some minor confusion between forward and
ing the combined, staggered, and pseudotransient forward
reengineering activities. As such, we limit our consideration
sensitivity analysis methods where the sensitivity equations
to just those three.
are solved alongside the forward equations.”4 Unpacking the
The second approach we use is derived from that of
semantics of this commit message requires extensive domain
Vasilescu et al. [30], which categorizes commit activity by ex-
knowledge and that is difficult to automate. However, the word
amining the filepaths involved in changes; file extensions (e.g.
“add” is a match in the word bank for forward engineering;
.cpp versus .csv) and hints in file paths (e.g. /src/ versus
it is reasonable to assume (in this case) that the commit is
/test/ can clue us in to the purpose of a file and, by ex-
adding a new feature to the software.
tension, the kind of labor that an individual provides a project
This approach is limited in that it only considers the lexical
through their interaction with those files. The classification
content of messages, and it also fails to handle situations
algorithm itself is analogous to what was previously described:
where a commit may belong to more than one category, but
filepaths are matched against a bank of regular expressions that
it remains useful as a diagnostic tool. To test the validity
map to different categories of files. Unlike with the Hattori-
of this classification scheme against our corpus, we chose to
Lanza scheme, these results are much less ambiguous because
manually classify a representative random sample of commits
it is reasonable to assume that file extensions indicate actual
drawn from across all projects. Assuming that all projects have
file types. For this work, we made several addenda to the
statistically similar commits (in the sense that the distribution
regexes used in the original paper in order to cover additional
of commits by type are roughly the same), a sample of 378
programming languages (e.g., Pascal and Ada), data storage
commits might reflect the overall population of roughly 23,000
types commonly used in scientific computing (e.g., HDF5 and
commits with a confidence level of 95% with an interval
FASTA), as well as a handful of previously unaddressed build
of ±5%. In order to arrive at the ground truth, our manual
and configuration artifacts (e.g., Dockerfiles and Gradle build
classification considers not only commit messages but also the
files). On Table III and Table IV we provide results for our
artifacts (such as source code, documentation, and test data)
two analyses as an aggregate of all contributors in the projects
that were modified and the context in which that occurred
we studied as a way of approximating a “typical” project.
(such as preceding commits and related files).
The former shows what percentage of commits made by an
TABLE II: Confusion matrix for validation of Hattori- average individual are categorized as forward engineering,
Lanza [28] classification scheme. Unknown (Unk) commits reengineering, and corrective activities; the latter asks what
are those which the algorithm failed to classify. percentage of individuals have made at least one commit that
interacts with a file of a given type.
Manual
Automated
Fwd Reng Corr Mgmt As a group, senior contributors have the highest average
Fwd 53 11 1 14 share of forward engineering commits (33.37%), which is
Reng 3 60 0 13 to say that commits made by seniors are more likely to
Corr 1 4 60 5
Mgmt 7 14 11 16
include novel development work, such as adding or extending
Unk 9 36 5 55 software capabilities. Senior developers also play a key role
in realizing the supporting infrastructure of their projects, with
4 https://GitHub.com/trilinos/Trilinos/commit/e8e6d67 a majority having interactions with build, devdoc, and
427
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
TABLE III: The relative share of automatically classified RQ5: How do scientific software developers perceive their own
commits of an average junior, senior, or thirdparty contributor. software development process?
Management commits are omitted.
For our final research question, we consider how scientific
Fwd Share Reng Share Corr Share
Juniors 26.76% 21.11% 15.93%
software developers view their own projects based on our sur-
Seniors 33.37% 16.07% 14.91% vey data; this provides us with points of comparison with our
thirdparty 19.98% 18.89% 39.45% quantitative findings. Among our survey respondents, those
we identified as being most responsible for their respective
projects, 36% of them were postdocs, 30% were non-academic
metadata files. Likewise, a plurality of seniors interacts professionals, another 30% were students (undergraduate or
with data/database files (such as data for validation tests graduate), and 14% were professors. The majority of them
and parameters for research models) and test files. (62%) work for an university or college, 11% work for the
Next, the activities of junior contributors resemble that government, 8.5% work in industry, and the other 18.5% play
of senior contributors in many key respects. Like seniors, other roles. Regarding their highest academic degree, 76.4%
juniors universally interact with code files, and a compara- had already received or were working towards their doctorate
ble share of their commits go towards forward engineering degree (18% have a master degree, and only 4% have bachelor
(26.77% for juniors vs. 33.37% for seniors), reengineering degree). They worked in a variety of fields, including computer
(22.11% vs. 18.90%) , and corrective (15.93% vs. 14.91%) science, fine art, chemistry, political science, urban planning,
activities. Also like seniors, the majority of juniors interact and neuroscience.
with build, devdocs, and test files (though in smaller The majority of projects in our survey (61%) were devel-
measures compared to seniors), and this was true in general oped by a team of people, though it is worth nothing that a
for all projects we studied. This reinforces our earlier obser- significant number of projects were the work of individuals
vations suggesting that their work is neither subordinate nor (39%). On average, these teams had 3.6 contributors (3rd
peripheral compared to the work of seniors, but is instead Quartile: 4.2, max: 15); this roughly aligns with the number
vitally important to the enterprise. of active contributors in a given year in the repositories which
Finally, we consider thirdparty contributors who, as we we mined (average: 3.8, 3rd quartile: 5, max: 11).
determined earlier, are relatively minor players who as a rule When we asked (Q5) what and how do they contribute, we
only sporadically contribute to projects. Code and devdocs observed that 50 respondents reported software development-
aside, they scarcely interact with any kind of file. One point oriented activities such as fixing bugs, developing scripts
that stands out, however, is that these contributors are more to support research, improving documentation, and adding
likely to make corrective activities (with an average commit tests. Strongly tied with software development activities, 18
share of 35.06%). This is to say that while they make very few respondents reported to contribute to non-software activities,
commits, the commits they do make are more likely to be bug such as paper writing, grant writing, running experiments, etc.
fixes; that suggests that thirdparty contributors are most likely Along these lines, three respondents perceive their contributing
to be users of the software who have the domain knowledge role as “Conducting research that feeds back into the project”.
and development expertise needed to correct such bugs or Regarding how do scientific software developers get trained
“scratch their own itch”. In essence, thirdparty contributor to do their jobs (Q6), 70% of the respondents were self-
behavior is similar to the kind of work produced by peripheral taught, although some of them received mentorship from
developers, which are typically involved in bug fixes, and they senior contributors (e.g., “Shadowing a more senior developer
have irregular or short-term involvement in a project [31], [32]. for a week or two”), while others benefited from online
training programs (e.g., “The Molecular Sciences Software
TABLE IV: The percentage of contributors in each category Institute (molssi.org) training programs.”), took advantage of
who have at least one commit that interacts with a given file their own documentation (e.g., “We make sure that the docs are
type. N/A indicates that no matches were found for regexes self-contained to ease onboarding for remote teams”), or even
in any projects studied for a given category the pull-request process (e.g., “By first contributing some pull
Juniors Seniors thirdparty
requests and getting code reviews”). Only four respondents
Documentation 19.1% 26.0% 0% were trained through their academic degree.
Images 14.5% 22.2% 2.7% When considering the responsibilities they need to take to
Localization 2% 0% 0%
UI N/A N/A N/A prepare for the departure of a team member (Q7), 20 respon-
Media 27.7% 33.3% 2.7% dents mentioned the importance to keep the documentation up-
Code 100% 100% 70.3% dated (e.g., “We simply try to ensure that all developments are
Project Metadata 36.2% 51.85% 8.1%
Configuration 34.0% 33.3% 5.4% adequately documented at the time, to help the understanding
Build 63.8% 77.8% 29.7% of future developers”). Interestingly, 8 respondents mentioned
Devdocs 63.8% 88.8% 66.7% that this never happened, which is partially because they are
Data/Databases 36.2% 63.0% 18.9%
Test 74.5% 85.2% 27.0% working on a small or solo team (e.g., “No one has departed
Libraries N/A N/A N/A yet (or joined...)”). Other respondents mentioned the need to
428
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
train other, to push code, or to add tests. TABLE V: A summary with descriptions of typical contribu-
Of those projects run by teams by teams of two or more tors, based on the findings in this work.
people, 35 out of 41 specifically called attention to the role Contributor Findings
played by juniors in developing their software. 20 of these Seniors • Are often active on a project for four or more
described them as being responsible for developing specific, years.
• Make the majority of contributions to a typical
non-core features of the software; projects that followed this project (average 50%). They create the most files
pattern offered up explanations such as “[juniors] — by and touch the most code.
necessity – start with smaller peripheral bugs and features. • Are most often responsible for forward engineer-
ing activities, development of the core of the
Core development requires a lot of experience and knowl- software, and infrastructure tasks.
edge.” Another twelve projects, however, cast juniors as being • Provide guidance and visionary leadership to ju-
developers of core infrastructure, typically for the reasoning nior contributors, especially when they do not
have the time or resources to work on the code
that senior members “cannot afford to put much time into themselves.
development”. Juniors • Are active for no more than 2 years. Roughly
Meanwhile, 24 of the 41 team projects emphasize the role 25%-35% of their time is spent on software
development.
of seniors in development. 8 of these said seniors developed • Make up the majority of team members.
the core of the software and 4 the periphery. Those that did • Perform many of the same development activities
so often emphasized the need for experience in development, as seniors, and, collectively, generate 42% of
commit activity on average.
insofar as “the more education a team member has (software • Are most likely to develop peripheral, non-core
development life-cycle, good coding practices, etc.), the better features of the software.
they are at seeing ‘big picture’ development tasks [...] these Third Parties • Are active for only one day.
people often lead development”. However, in contrast to this, • Have a background in the domain of the project
and an interest in using the software.
ten respondents characterized seniors as being visionaries first • Make only a handful of commits. These commits
and developers second. In this view, the role of seniors is to are most likely to be bug fixes.
“coordinate activities”, “drive the direction of the project”,
“guide the conceptual development”, and to provide the “the- though these users only be involved for a short time, they can
oretical details”. still make valuable contributions such as fixing bugs. However,
Lastly, only 9 out of the 72 projects gave recognition to despite the widespread presence of short-term committers in
thirdparty contributors. Among these projects, the typical the population and thirdparty contributors being present in all
view was that while thirdparties “contribute seldomly”, they but one of the projects in our sample, only 12% of respondents
were also a common sources of bug fixes, a finding echoed in our companion survey mentioned these contributors. Based
in our findings from RQ4. Likewise, these contributors were on this, we believe that better community policies could help
also responsible for “[submitting] small patches to make [a] attract these contributors, such as providing guides for new
tool better meet their own niche use cases”. contributors and explicitly giving credit to these users.
V. D ISCUSSION Supporting Sustainability. Scientific software projects are
known for being long-lived and under constant pressure to
We have summarized the major findings in Table V, and keep pace with scientific advances. It is common for senior
now consider the potential implications of our work. project members to provide visionary leadership to guide that
Training. As a group, juniors have long been the subject process, but how this translated to software construction was
of science public policy literature. Novice researchers are unclear. Our findings place seniors in a primary role as
“canaries in the mine” for the health of the scientific enterprise, core developers who are most likely responsible for forward
as it is during this period that they are meant to learn the values engineering and infrastructure tasks. Our findings suggest
and skills needed to participate fully as scientists [33]. How- directions for future research, such as how research priorities
ever, while software development is an increasingly important generate software development tasks and when and how those
skill, the amount of direct experience they acquire may be tasks are delegated. A more complete understanding of this
limited by competing demands in their academic careers (see phenomenon would help software engineers develop better
RQ1). This brings into focus a number of different topics tools and techniques to support that effort.
regarding software sustainability (e.g., the importance of good
practices such as code reviews) as well as educational policies VI. T HREATS TO VALIDITY
(e.g. the need for more formal software development training First, not all members of a project show up as contributors
in STEM curricula). to the repository; for example, senior staff may offer guidance
Building Communities. Much has been written about the and support while leaving the actual implementation work to
value of openness in science and the need for community others. Second, people who stop contributing code may still be
support of scientific software. As we noted in section III-C, part the project. This is frequently the case for graduate student
40% of contributors stay on for only a day. Our analysis contributors, who may refocus on completing coursework or
suggests that many of these may be thirdparty users who have a thesis towards the end of their tenure. Third, team websites
a formal background in a domain relevant to the project; even are not always up-to-date, and not all contributors are given
429
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
explicit recognition for their work; in one case, an undergradu- projects. Howison and Hersleb [8], a qualitative study of the
ate student was uncredited on a project site, but a subsequent incentives for creating and maintaining scientific software,
web search uncovered a university press release detailing a identifies the general breakdown of contributors to projects
research grant that was awarded for them to do specific work that they study; however, both the focus and methods used in
on the project in question. Finally, not all contributors are that work differ from ours, as they are not concerned with the
project members; as with most open-source software projects, specifics of how different people contribute to the software de-
open-source scientific software attracts third-party contribu- velopment. Finally, Storer [37] provides an excellent survey of
tors, typically senior researchers who benefit from using the the state of software engineering practice within the scientific
software. community. Our work shares the same motivations as others,
Coding individuals using our approach means addressing but, to our knowledge, is the first to tackle this subject from
several potential ambiguities. First, on a few occasions, a a software repository mining perspective.
subject may have belonged to different categories at different Studies on roles of contributors in open source projects.
times (e.g., a staff member starting off as a postdoc). When The study of core developers, i.e., developers that play an
this occurred, we labeled them according to the role in which essential role in developing the system architecture and form-
they made the majority of their commits. Second, for large ing the general leadership structure, in open source projects
research institutions (e.g., national labs), a software project is a fruitful research area [38], [39], [40]. Core developers
may receive contributions from people nominally part of the are well-known from being active contributors. A general rule
same organization, but unaffiliated with the research team; of thumb suggests that contributors with more than 80% of
we resolved this by treating those contributors as thirdparty the overall contributions are considered core developers [31],
contributors. [41]. Indeed, for some projects, this number is even higher.
The commit analysis performed to answer RQ4 was an au- Recent work indicates that several well-known, active open
tomated process which, when applied to a large-scale number source projects rely on 1–2 core developers to drive most of
of commits, can silently yield false-positives (i.e., commits that their maintenance and evolution tasks [42]. On the periphery
were unable to be categorized), since commit messages might side, research indicates that peripheral developers accounts for
lack semantical sense [34] or are even empty [35] (i.e., zero more than 90% of the contributors of a project [31]. Moreover,
words). To mitigate such bias, we conducted a manual analysis several authors have acknowledged the existence and the
over a representative sample of 378 commits (confidence growth of casual contributors (i.e., developers that contribute
level of 95% with an interval of ±5%). We observed a low just once) [43], [32], [44]. Here we enriched the understanding
number of uncategorized commits. Although uncategorized of core and peripheral developers. We also introduced the
commits still exist, we believe this approach is the fairest notion of third-party contributors, which share some of the
way to categorize the commit intention because, since we are behaviors commonly observed in peripheral developers (e.g.,
dealing with scientific software projects, the domain of our they have a short term relationship with the project, and most
studied projects is highly specific and complex. Therefore, any of their contributions are intended to fix bugs).
other attempt to categorize commits using our own domain VIII. C ONCLUSION
experience would introduce even more bias.
Scientific software projects are critical to the advancement
Lastly, one could argue that this study does not provide
of the scientific enterprise, and software engineering research
a novel contribution, e.g., “obviously graduate students are
can directly help those efforts through tailored tools, tech-
largely responsible for adding new features”. However, such
niques, and practices. However, there has historically been a
common-sense assumptions are often not backed up by em-
lack of hard data on who contributes to scientific software
pirical evidence. This paper piles such evidence and, more
and how they behave. In this work, we mined logs from seven
importantly, quantifies the phenomenon; even though some
non-trivial open source scientific software projects in order to
perceptions are confirmed (e.g., “gradstudents stay longer
provide answers to these questions. Among our findings, we
than postdocs and undergraduates”), other are uncovered
found that while senior researchers perform the lion’s share
(e.g., “thirdparty members survive only one day on average”).
of the work in many projects, junior researchers are often
VII. R ELATED W ORK on the frontlines driving the software development. We also
considered the habits of thirdparty contributors who, while
Studies on scientific software development. Turk [6] presents
often operating at the periphery of projects, have a valuable
techniques for encouraging community engagement with sci-
role to play in fixing and improving code.
entific software in the astrophysics community, arguing that
For future work, we plan to conduct ethnographic studies
attracting thirdparty contributors requires intentional actions
of scientific software projects to better understand topics such
designed to encourage their participation. Likewise, Bangherth
as feature creation and bug fixing. We also plan to study
and Heister [13] outlines the practices of successful open-
how scientific software projects compare with conventional
source scientific software libraries, which includes a discus-
projects.
sion of the value proposition behind open-sourcing software
primarily written by juniors. Sletholt et al. [36] performs a case Acknowledgments. We thank the reviewers, the 72 respondents,
study of agile development practices among scientific software and PROPESP/UFPA and CNPq (#406308/2016-0).
430
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [22] B. L. Berg, “Methods for the social sciences,” Qualitative Research
Methods for the Social Sciences. Boston: Pearson Education, 2004.
[1] T. Hey, S. Tansley, K. M. Tolle et al., The fourth paradigm: data- [23] A. Wood, P. Rodeghero, A. Armaly, and C. McMillan, “Detecting speech
intensive scientific discovery. Microsoft research Redmond, WA, 2009, act types in developer question/answer conversations during bug repair,”
vol. 1. in Proc. of the 26th ACM Symposium on the Foundations of Software
Engineering, 2018.
[2] J. C. Carver, N. P. Chue Hong, and G. K. Thiruvathukal, Software
[24] E. L. Kaplan and P. Meier, “Nonparametric estimation from incomplete
Engineering for Science. CRC Press, 2016.
observations,” Journal of the American statistical association, vol. 53,
[3] E. S. Mesh and J. S. Hawker, “Scientific software process improvement
no. 282, pp. 457–481, 1958.
decisions: A proposed research strategy,” in Software Engineering for
[25] B. Lin, G. Robles, and A. Serebrenik, “Developer turnover in global, in-
Computational Science and Engineering (SE-CSE), 2013 5th Interna-
dustrial open source projects: Insights from applying survival analysis,”
tional Workshop on. IEEE, 2013, pp. 32–39.
in Proceedings of the 12th International Conference on Global Software
[4] C. Letondal and U. Zdun, “Anticipating scientific software evolution as Engineering. IEEE Press, 2017, pp. 66–75.
a combined technological and design approach,” in Second International
[26] M. Nagappan, T. Zimmermann, and C. Bird, “Diversity in software
Workshop on Unanticipated Software Evolution, 2003.
engineering research,” in Proceedings of the 2013 9th Joint Meeting
[5] J. Segal, “Scientists and software engineers: A tale of two cultures,” on Foundations of Software Engineering. ACM, 2013, pp. 466–476.
2008. [27] K. Ferguson, B. Huang, L. Beckman, and M. Sinche, “National post-
[6] M. J. Turk, “Scaling a code in the human dimension,” in Proceedings doctoral association institutional policy report 2014: Supporting and de-
of the Conference on Extreme Science and Engineering Discovery veloping postdoctoral scholars,” Washington, DC: National Postdoctoral
Environment: Gateway to Discovery. ACM, 2013, p. 69. Association, 2014.
[7] J. Howison and J. D. Herbsleb, “Incentives and integration in scientific [28] L. P. Hattori and M. Lanza, “On the nature of commits,” in Proceedings
software production,” in Proceedings of the 2013 conference on Com- of the 23rd IEEE/ACM International Conference on Automated Software
puter supported cooperative work. ACM, 2013, pp. 459–470. Engineering. IEEE Press, 2008, pp. III–63.
[8] ——, “Scientific software production: incentives and collaboration,” [29] W. Poncin, A. Serebrenik, and M. Van Den Brand, “Process mining soft-
in Proceedings of the ACM 2011 conference on Computer supported ware repositories,” in Software maintenance and reengineering (CSMR),
cooperative work. ACM, 2011, pp. 513–522. 2011 15th european conference on. IEEE, 2011, pp. 5–14.
[9] J. E. Hannay, C. MacLeod, J. Singer, H. P. Langtangen, D. Pfahl, and [30] B. Vasilescu, A. Serebrenik, M. Goeminne, and T. Mens, “On the
G. Wilson, “How do scientists develop and use scientific software?” variation and specialisation of workload—a case study of the gnome
in Proceedings of the 2009 ICSE workshop on Software Engineering ecosystem community,” Empirical Software Engineering, vol. 19, no. 4,
for Computational Science and Engineering. IEEE Computer Society, pp. 955–1008, 2014.
2009, pp. 1–8. [31] K. Crowston, K. Wei, Q. Li, and J. Howison, “Core and periphery
[10] P. E. Stephan, How economics shapes science, 2012, vol. 1. in free/libre and open source software team communications,” in 39th
[11] M. Heroux, “Software engineering for computational science and en- Hawaii International International Conference on Systems Science
gineering: What can work and what will not,” Invited talk, presented (HICSS-39 2006), 4-7 January 2006, Kauai, HI, USA, 2006.
at the 2017 International Workshop on Software Engineering for High [32] G. Pinto, I. Steinmacher, and M. Gerosa, “More common than you think:
Performance Computing in Computational and Data-Enabled Science An in-depth study of casual contributors,” in IEEE 23rd International
and Engineering, 2017. Conference on Software Analysis, Evolution, and Reengineering, SANER
[12] J. Howison, E. Deelman, M. J. McLennan, R. Ferreira da Silva, and 2016, Suita, Osaka, Japan, March 14-18, 2016, pp. 112–123.
J. D. Herbsleb, “Understanding the scientific software ecosystem and [33] K. S. Louis, J. M. Holdsworth, M. S. Anderson, and E. G. Campbell,
its impact: Current and future measures,” Research Evaluation, vol. 24, “Becoming a scientist: The effects of work-group size and organizational
no. 4, pp. 454–470, 2015. climate,” The Journal of Higher Education, vol. 78, no. 3, pp. 311–336,
[13] W. Bangerth and T. Heister, “What makes computational open source 2007.
software libraries successful?” Computational Science & Discovery, [34] W. Maalej and H. Happel, “Can development work describe itself?”
vol. 6, no. 1, p. 015010, 2013. in Proceedings of the 7th International Working Conference on Mining
[14] A. M. Smith, K. E. Niemeyer, D. S. Katz, L. A. Barba, G. Githinji, Software Repositories, MSR 2010 (Co-located with ICSE), Cape Town,
M. Gymrek, K. D. Huff, C. R. Madan, A. C. Mayes, K. M. Moerman South Africa, May 2-3, 2010, Proceedings, 2010, pp. 191–200.
et al., “Journal of open source software (joss): design and first-year [35] R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen, “Boa: A language
review,” PeerJ Computer Science, vol. 4, p. e147, 2018. and infrastructure for analyzing ultra-large-scale software repositories,”
[15] G. R. Mirams, C. J. Arthurs, M. O. Bernabeu, R. Bordas, J. Cooper, in Proceedings of the 2013 International Conference on Software
A. Corrias, Y. Davit, S.-J. Dunn, A. G. Fletcher, D. G. Harvey et al., Engineering, ser. ICSE ’13, 2013.
“Chaste: an open source c++ library for computational physiology and [36] M. T. Sletholt, J. Hannay, D. Pfahl, H. C. Benestad, and H. P. Langtan-
biology,” PLoS computational biology, vol. 9, no. 3, p. e1002970, 2013. gen, “A literature review of agile practices and their effects in scientific
[16] M. R. Crusoe, H. F. Alameldin, S. Awad, E. Boucher, A. Caldwell, software development,” in Proceedings of the 4th international workshop
R. Cartwright, A. Charbonneau, B. Constantinides, G. Edvenson, S. Fay on software engineering for computational science and engineering.
et al., “The khmer software package: enabling efficient nucleotide ACM, 2011, pp. 1–9.
sequence analysis,” F1000Research, vol. 4, 2015. [37] T. Storer, “Bridging the chasm: A survey of software engineering
[17] C. D. Cooper, J. P. Bardhan, and L. A. Barba, “A biomolecular practice in scientific programming,” ACM Computing Surveys (CSUR),
electrostatics solver using python, gpus and boundary elements that vol. 50, no. 4, p. 47, 2017.
can handle solvent-filled cavities and stern layers,” Computer physics [38] R. T. Fielding, “Shared leadership in the apache project,” Commun.
communications, vol. 185, no. 3, pp. 720–729, 2014. ACM, vol. 42, no. 4, pp. 42–43, Apr. 1999.
[18] B. Van Essen, H. Kim, R. Pearce, K. Boakye, and B. Chen, “Lbann: [39] S. A. Licorish and S. G. MacDonell, “Understanding the attitudes,
Livermore big artificial neural network hpc toolkit,” in Proceedings of knowledge sharing behaviors and task performance of core developers: A
the Workshop on Machine Learning in High-Performance Computing longitudinal study,” Information & Software Technology, vol. 56, no. 12,
Environments. ACM, 2015, p. 5. pp. 1578–1596, 2014.
[19] E. Yavuz, J. Turner, and T. Nowotny, “Genn: a code generation frame- [40] J. Coelho, M. T. Valente, L. L. Silva, and A. Hora, “Why we engage in
work for accelerated brain simulations,” Scientific reports, vol. 6, p. FLOSS: Answers from core developers,” in 11th International Workshop
18854, 2016. on Cooperative and Human Aspects of Software Engineering (CHASE),
[20] G. Pinto, I. Wiese, and L. F. Dias, “How do scientists develop scientific 2018, pp. 1–8.
software? an external replication,” in 25th International Conference [41] A. Mockus, R. T. Fielding, and J. D. Herbsleb, “Two case studies of
on Software Analysis, Evolution and Reengineering, SANER 2018, open source software development: Apache and mozilla,” ACM Trans.
Campobasso, Italy, March 20-23, 2018, 2018, pp. 582–591. Softw. Eng. Methodol., vol. 11, no. 3, Jul. 2002.
[21] J. M. Sheltzer and J. C. Smith, “Elite male faculty in the life sciences [42] G. Avelino, L. Passos, A. Hora, and M. T. Valente, “A novel approach for
employ fewer women,” Proceedings of the National Academy of Sci- estimating truck factors,” in 24th International Conference on Program
ences, vol. 111, no. 28, pp. 10 107–10 112, 2014. Comprehension (ICPC), 2016, pp. 1–10.
431
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
[43] R. Pham, L. Singer, O. Liskin, F. Figueira Filho, and K. Schneider, 2) If you are currently employed, which of the
“Creating a shared understanding of testing culture on a social coding following best describes your current employer?
site,” in Proceedings of the 2013 International Conference on Software
Engineering, ser. ICSE ’13. • Government
[44] A. Lee, J. C. Carver, and A. Bosu, “Understanding the impressions, • University or college
motivations, and barriers of one time code contributors to FLOSS
projects: a survey,” in Proceedings of the 39th International Conference • Business or industry
on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May • Non-profit organization
20-28, 2017, 2017, pp. 187–197. • Other (please specify)
IX. A PPENDIX 3) Please list the highest academic degree you have
received or that you are currently working toward.
TABLE VI: Addenda to File Extensions used in Vasilescu et
• High school degree or equivalent
al.[30]
• Associate’s Degree
Category Addenda • Bachelor’s
Doc ".*\.md" • Master’s
Code ".*\.pas((\.swp)?)(˜?)",
".*\.pxd((\.swp)?)(˜?)", • Doctorate
".*\.ads((\.swp)?)(˜?)", 4) What is the subject of this degree?
".*\.adb((\.swp)?)(˜?)",
".*\.bin"
Devdoc ".*\.pdf",".*citation.*", 5) How many people are on your team?
".*license.*",".*doxyfile.*", 6) What and how do they contribute (e.g., adding
".*\.wiki",".*\.tex",
".*\.bib",".*\.dox",".*authors"
new features to the core of the project, conducting
Db ".*\.csv",".*\.xml",".*\.fa", research that feeds back into the project, writing tests,
".*\.xlsx",".*\.zip",".*\.h5", fixing bugs, et cetera.)?
".*\.bz2",".*\.tar(\.gz)?",
".*\.fq(\.gz)?",".*\.pts", 7) How does a new team member get trained to do
".*\.pdb",".*\.pqr",
".*\.vert",".*\.node",
their job (e.g. they are self-taught)?
".*\.edge",".*\.param(eters)?", 8) What (if anything) do you do to prepare for the
".*\.phi0",".*\.prototext(\.bve)?",
".*\.pkl",".*\.pbs" departure of a team member (e.g. we ask them to
Build ".*\.build",".*dockerfile", document the undocumented parts of their code)?
".*\.gradle"
Config ".*\.vcxproj((\.filters)?)(˜?)", 9) What kinds of roles and responsibilities do differ-
".*\.qpg",".*\.dsp",".*\.epf"
Img ".*\.graffle" ent people play in the development of your software?
In particular, consider the following groups: junior
A. Questions Used in the Online Survey researchers (undergraduates, graduate students, and
1) Which of the following best describes you (select postdocs), senior researchers (seasoned staff), and
all that apply)? third party contributors.
• Student (e.g. undergraduate or graduate student) 10) Are there differences in what different people
• Postdoc provide your project? If so, please explain what (e.g.
• Professional juniors are responsible for small issues while seniors
• Professor drive the architecture and solve the main problems)?
432
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.