0% found this document useful (0 votes)

45 views12 pages

Characterizing The Roles of Contributors in Open-Source Scientific Software Projects

This document summarizes a study characterizing the roles of contributors in open-source scientific software projects. The study analyzed 7 open-source scientific software projects and surveyed 72 scientific software developers. The key findings were: 1) Senior researchers (e.g. professors) tended to be the most active contributors, responsible for half or more of commits on average and heavily involved in architectural concerns. 2) Junior contributors (e.g. graduate students) also contributed substantially, with graduate students having the longest contribution periods. 3) Third-party contributors were scarce, typically contributing for just one day.

Uploaded by

Rodrigo Nahum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views12 pages

Characterizing The Roles of Contributors in Open-Source Scientific Software Projects

Uploaded by

Rodrigo Nahum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Characterizing the Roles of Contributors in

Open-source Scientiﬁc Software Projects
Reed Milewicz Gustavo Pinto Paige Rodeghero
Sandia National Laboratories Federal University of Pará Clemson University
Email: [email protected] Email: [email protected] Email: [email protected]

Abstract—The development of scientific software is, more than requirements [4]. As noted by Segal [5], this pressing need to
ever, critical to the practice of science, and this is accompanied produce or enable the production of knowledge lends itself to a
by a trend towards more open and collaborative efforts. Unfor- mindset where “software is valued only insofar as it progresses
tunately, there has been little investigation into who is driving
the science”, often in conflict with the need to have reliable,
the evolution of such scientific software or how the collaboration
happens. In this paper, we address this problem. We present maintainable code. However, Turk and colleagues remarked
an extensive analysis of seven open-source scientific software that, in an era of increasing scale and complexity, “the cyber-
projects in order to develop an empirically-informed model of infrastructure necessary to address problems in computational
the development process. This analysis was complemented by science is no longer tractably solved by individuals working in
a survey of 72 scientific software developers. In the majority of
isolation” [6]; broader, more open collaboration necessitates
the projects, we found senior research staff (e.g. professors) to be
responsible for half or more of commits (an average commit share a shift in how the software is developed. From a software
of 72%) and heavily involved in architectural concerns (seniors engineering research perspective, this motivates important
were more likely to interact with files related to the build system, questions about how the software evolves, who develops it,
project meta-data, and developer documentation). Juniors (e.g. and how quality can emerge from this process.
graduate students) also contribute substantially — in one studied
We focus on the people meeting the demand for scientific
project, juniors made almost 100% of its commits. Still, graduate
students had the longest contribution periods among juniors (with software. Such scientific software developers represent a pop-
1.72 years of commit activity compared to 0.98 years for postdocs ulation so far not properly understood, since their characteris-
and 4 months for undergraduates). Moreover, we also found that tics, motivation, and needs to contribute to scientific software
third-party contributors are scarce, contributing for just one day projects are intrinsically different than what drives traditional
for the project. The results from this study aim to help scientists
open source contributors. For instance, the actors that play the
to better understand their own projects, communities, and the
contributors’ behavior, while paving the road for future software scientific developer role include students, postdocs, faculty,
engineering research. and staff. Their knowledge, skills, and goals can vary greatly,
while also contributing to projects in different ways throughout
their tenure. As a consequence, the plethora of existing studies
I. I NTRODUCTION on open source contributors might not help much, since they
Computing technologies have had a profound impact on the hardly take into account their roles or the complexity of the
practice of science: simulation and data-intensive computation domains that scientific software is immersed in.
are now known as the third and fourth paradigms of science, Much is still unknown about the state-of-the-practice of
on equal footing with experimentation and theory [1]. This developing scientific software. For instance, who performs the
shift has accelerated the growth of a diverse ecosystem of majority of commit activities? Who fixes bugs? In order to bet-
scientific software projects. The term “scientific software” is ter understand the relationship between these contributors and
an umbrella that covers all aspects of the research pipeline, the software, we first leverage the availability and transparency
including codes for simulation and data analysis, dataset of social coding websites to inspect data related to source
management, communication infrastructure, and underlying code contributions and contributors. We selected a curated list
mathematical libraries [2]. It is software that exists “to support of seven open-source scientific software projects by searching
the exploration of a scientific question” [3]. three different platforms: the Journal of Open Source Software,
What makes scientific software projects different from GitHub, and DOECODE, a platform for publicly funded
traditional software projects? Scientific software operates at DOE research codes. For each selected project, we identified
the boundaries of human knowledge and tends to be in the roles played by different contributors by analyzing each
constant flux as new insights motivate unforeseen changes in projects’ documentation, websites, and other readily available
sources. We then surveyed representative scientific software
Sandia National Laboratories is a multimission laboratory managed and developers in order to cross-validate the findings found via
operated by National Technology & Engineering Solutions of Sandia, LLC, the repositories’ analysis.
a wholly owned subsidiary of Honeywell International Inc., for the U.S.
Department of Energy’s National Nuclear Security Administration under
Using quantitative and qualitative data, our study produced
contract DE-NA0003525. SAND2018-9345C. a set of findings, some of which confirmed anecdotal accounts

2574-3864/19/$31.00 ©2019 IEEE 421

DOI 10.1109/MSR.2019.00069
Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
while others were unexpected. We discuss them in detail in al. found while 84% of interviewees considered developing
Section IV. In the following, we highlight three of them. scientific software important for their own research, the aver-
• Senior researchers tend to be the most active and age scientist spent just 30% of their work time on development
prolific contributors in terms of commits and file activities [9].
creation. In four of the seven projects we studied, faculty Junior researchers. In order to meet the labor needs for
and staff contributors were responsible for half or more development, established researchers often rely upon student
of commits made to the project (with an average commit and post-graduate labor; juniors are seen as young, full of
share of 72%). In five projects, senior members were also ideas, and (most importantly) inexpensive personnel [10].
responsible for the majority of files created and, by that According to Heroux 2017, senior members provide a stable
measure, the resulting project structure. This influence presence, determining the scientific questions and the trajec-
over the overall direction of the software project was also tory for the software; they are familiar with the conceptual
evident in the fact that senior researchers were the most models and the software design, but they may spend less
likely to have interacted with files related to the build time writing actual code. Juniors, meanwhile, are transient
system, project metadata, and developer documentation. members with a dual focus on contributing code and producing
• Junior contributors, especially graduate students, are publications; they undergo a staged process of onboarding,
critical drivers of new features as well as supporting becoming experienced, and departing, and during this time
activities like test creation. On average, junior contrib- they may make substantial contributions to the software [11].
utors were responsible for 42% of commits across all Third party contributors. For every developer of a critical
projects we studied; in one case, juniors were responsible scientific software package, countless more depend upon it.
for nearly 100% of all commit activity. The majority of However, unlike in conventional software development, there
these commits came from graduate students, who had are no clear-cut distinctions between users and developers,
the longest contribution periods among juniors (with 1.72 and, as Turk 2013 argues, trying to force these terms is
years of commit activity compared to 0.98 years for “actively harmful” to our understanding [6]. Even when they
postdocs and 4 months for undergraduates). Similar to are not directly responsible for a package, it is not uncommon
senior contributors, junior contributors are significantly for scientific end-users to write code of their own, such as
involved in creating new features, improving existing “glue code” that draws together different software tools into a
capabilities, and fixing bugs. workflow or custom components that utilize others’ software.
• An open-source model facilitates external contribu- The decision to use another’s software creates risks because
tions, but the results are mixed. On one hand, an it is not guaranteed that the software will be supported
open-source model makes it easier to attract thirdparty in the future or kept current with the pace of changes in
contributors to help grow and maintain the software. the field [12]. Converting users into contributors is perhaps
However, the software is also made for and by members even more difficult. However, as Bangherth and Heister 2013
of a relatively niche and intensely preoccupied commu- explains, scientific software libraries require the support of
nity. In the majority of projects we studied, thirdparty a broader community of users and contributors in order to
contributors tended to be domain expert users who were survive in the long term [13].
only active for one day. We also note, however, that
these same contributors are more likely to offer defect- III. M ETHODOLOGY
correcting commits, which is highly valuable.
In this section, we describe our research questions and our
II. BACKGROUND approach (Section III-A). For the repository mining portion
Scientific software projects are very complicated undertak- of our work, we outline our data collection methodology
ings that have limited budgets, sometimes a lack of software (Section III-B), the corpus we have assembled in order to
development expertise, and the inherent complexity of the find the answers to our research questions (Section III-D),
domain [2]. and how we distinguish contributors’ roles (Section III-E). For
Senior researchers and staff. As is common in the sciences, our the survey section of this work, we describe our protocol
a typical software project coalesces around a principal in- (Section III-F).
vestigator (PI) and one or more co-investigators (Co-Is) who A. Research Questions
have secured the resources needed for development (e.g. time,
money). The reasons for developing software are varied, but In this paper, we characterize the habits of scientific soft-
include the use-value of the software as a vehicle for research, ware contributors and contributions. Some of our questions are
academic credit, and (in the case of commercial software) intended to test common wisdom, while others are aimed to
revenue [7]. However, it is well-known that the scientist-as- probe deeper into the relationships between project contribu-
software-developer rarely has the time to maintain the code tors. Our research questions are as follows:
that they write [8]. Time and energy must be divided between RQ1: What is the tenure of different contributors by role on a
writing papers and grant proposals, reviewing manuscripts, scientific software project?
mentoring, and conducting experiments et cetera. Hannay et RQ2: What is the breakdown of team contributors by role?

422

Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
RQ3: How much of the development work is done by different From this list, we manually analyzed these repositories over
contributors by role? several days (step 3 ), filtering the results according to the
RQ4: What kinds of software maintenance and evolution ac- following criteria:
tivities do contributors perform? C1) Projects should have a contributor list. The repository
RQ5: How do scientific software developers perceive their own must link to a detailed contributor list or research
software development process? team page that identifies the roles played by different
Our first question RQ1 is demographic in nature based on contributors to the project. We use this data to later
aggregated project data, and tests the representativeness of distinguish contributors’ roles (Section III-E).
our dataset. RQ2 enables us to make inferences about the C2) Projects should be active. The project must be at least
division of labor based on personnel composition. Next, RQ3 a year old, and the repository must have more than 500
digs into the kinds of responsibilities, such as file ownership commits. For example, a large number of projects on
and test files creation, that different contributors take up. DOECODE were developed internally and then later
RQ4 investigates what kind of maintenance and evolution released to the public. Thus, the GitHub repository is a
changes, such as adding new features or fixing bugs, do shallow copy of the most recent version with no commit
these contributors contribute to the project. Finally, to provide history.
answers to RQ5 we surveyed 72 scientific software developers C3) Projects should be collaborative. There must be at least
regarding their own contribution behavior. three contributors which can be positively identified, and
at least one these must be considered a “junior” contrib-
B. Data Collection Procedure utor. Many research projects on GitHub are small codes
Figure 1 depicts the steps followed by our data collection developed by individual researchers in isolation without
procedure. any significant collaborations with others. Others are
collaborative projects between senior staff at different
institutions.
1 DOECODE JOSS GITHUB After applying these filters, we ended up with a curated list
of seven scientific software projects (step 4 ).
1,039 324 500
C. Characterizing the population
2 1,863 SCI PROJECTS We believe that the 1,863 projects in our population of
repositories we is a representative sample of scientific software
1,863
C1 projects that can be found in the wild. However, many of
these projects are unlikely to provide useful information for
3 MANUAL FILTER C2
our purposes, such as short-lived or single-user research codes,
7 C3 snapshots of codes released for publications, untouched clones
of decades-old legacy projects, or mirrors of private repos-
7 SELECTED
4 ¬SCI PROJECTS
itories lacking history information. As shown on Figure 2,
filtering for the number of commits and contributors eliminates
Fig. 1: Steps of the data collection procedure. roughly 8 out 10 of the repositories; the remainder are most
likely to be active, collaborative, and (most importantly) to
The first step 1 is aimed to find representative projects. have a rich history on GitHub that we can pull apart. The
We relied upon three data sources. First, we consulted DOE- median project within this group projects has 12 contributors
CODE1 , a platform for publicly funded DOE research codes. and 1770 commits spread out over 2.93 years.
Next, we searched the Journal of Open Source Software We can examine a sample of 5000 of the contributors
(JOSS)2 , a database of open source research software [14], to these repositories, using their number of commits and
which requires all entries be publicly available. Finally, we the number of days spanned by those commits as a proxy
did searches by topic on GitHub to find repositories with for involvement. 29% of these contributors make only one
relevant tags (e.g., computational-neuroscience, commit, and 40% are active for no more than a day. Among
bioinformatics). This yielded roughly 1,039 repositories contributors more active than these, the average individual
from DOECODE, 324 from JOSS, and another 500 from has an active tenure of 1.37 years, and during this time they
GitHub. These numbers corresponds all projects in these make 116 commits (6% of the commit activity of the median
platforms, except for GitHub, in which we stopped searching project). If we zoom in on any one of these contributors,
when we found 500 projects. This resulted in an initial set of we can observe their activities, but what interests us most
1,863 open source scientific software projects which we chose is how their identity relates to those activities. This is a
to take into consideration (step 2 ). more challenging problem to solve, and analysis is much less
scalable. Instead, in this work we take a deep dive into a
1 https://www.osti.gov/doecode/ handful of projects whose members we can identify, as a case
2 https://joss.theoj.org/ study into the composition of these teams.

423

Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
established). In cases where we had to rely on incomplete
GitHub proﬁle data (e.g. unlisted thirdparty contributors), we
attempted to extract names from handles (e.g. johnsmith79
→ John Smith) and cross-referenced those names with
web searches for similarly named researchers in the relevant
ﬁeld; where we could not be reasonably convinced that the
identities matched, the contributor was left as unknown. To
ease understanding, we further group these contributors as
juniors (i.e., gradstudent, undergrad, and postdoc), seniors
(i.e., staff), and thirdparty.

F. Complementary Survey of Developers

The majority of the findings of this work come from a
quantitative analysis of repositories. To triangulate our findings
and better understand the perceptions of scientific software
developers, we additionally performed a complementary qual-
Fig. 2: A symmetric log plot of the number of contributors and itative analysis of scientific software development teams. We
commits of repositories considered in this work, with those sought to capture, in their own words, (1) who contributes
falling beneath the thresholds colored in gray. After these first to their projects, (2) how they prepare for contributors to
filtering steps, 359 repositories remain (19% of the original join or leave, and (3) what roles different people play in the
total), comprising the work of around 10000 contributors. development of the software.
To do this, we designed an online survey. For each partic-
ipant, we presented three multiple choice questions on their
D. Studied projects background and experience in developing scientific software,
Our data collection process yielded a total of seven projects followed by five open-ended questions addressing their team
which met our criteria. The descriptions of these projects can composition and their division of labor. Participation in the
be found on Table I. survey was voluntary and responses were anonymous. For the
Taken together, we argue that our sample of scientific open-ended questions, we coded the answers and organized
software projects is relevant due to their diversity: 1) they them into categories following the guidelines on open coding
are written in up to four different programming languages procedures [22] (cf. [23]).
(although mainly written in C++, Python, and Scala), with an To identify the target population, we reached out to devel-
average of 110.13 kLOC; 2) they span very different domains; opment teams whose projects had been accepted to the Journal
3) they have an average of 5.2 years of historical records; and of Open Source Software (JOSS); we used the JOSS Github
4) they are written by scientists that do not necessarily have repository, which is used to track submissions, to collect in-
a computer science or (in particular) a software engineering formation on points-of-contact. From the accepted submissions
background. When looking at the number of stars, one might to JOSS, we identified 273 scientific software developers that
argue that our selected scientific software projects have low either owned or made the majority of contributions to a GitHub
popularity. However, as a recent work reported, the median project, and recruited them by email. 17 emails were not sent
number of stars of R packages published on GitHub is 2 [20]. due to mailing errors, and 11 emails were returned due to out
of office automatic replies. Over the period of two weeks, we
E. Establishing identities of contributors received 72 answers (a response rate of approximately 30%).
Our step was to establish the identities of contributors.
We followed the strategy used by Sheltzer and Smith [21] IV. A NALYSIS
and first scraped data from laboratory websites and project In this section we provide answers to each research question.
documentation. This was followed up by searching depart-
mental directories and performing web searches in order to RQ1: What is the tenure of different contributors by role on a
disambiguate contributors where necessary. project?
The final step in this process is to code each individual For RQ1 we want to know what the expected contribution
in our dataset. This amounts to reviewing the assembled period is for different contributors based on their project role.
information about each individual and assigning a label to While students may spend years with a research team and staff
them. For the purposes of this study, we sorted subjects for decades, only a limited portion of that time will be spent
into the following categories: undergrad (i.e., undergraduates on software development activities. Knowing how much labor
students), gradstudent (i.e., master’s and doctoral students), is available to a team helps staff to understand issues related to
postdoc (i.e., postdoctoral researchers), staff (i.e., investiga- task allocation or job rotation. To gather this information, we
tors and support staff), thirdparty (i.e., external collabora- apply the Kaplan-Meier procedure [24] to perform a survival
tors), and unknown (i.e., which the identity could not be analysis of participants from all projects organized by role.

424

Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
TABLE I: List of studied projects. Age is present in years. kLOC is calculated using the cloc utility, encompassing blank,
comments, and code lines. PL means Programming Language.
Contributors
Project/GitHub kLOC Commits Stars PL Age Description
Identified (%)
Chaste (Chaste/Chaste) [15] 97% 371,4k 4,7k 22 C++ 8 Tissue and cell level electrophys-
iology, discrete tissue modeling,
and soft tissue modeling
Khmer (dib-lab/khmer) [16] 90% 145,1k 6.6k 528 Python 7 Nucleotide k-mer counting, filter-
ing, and graph traversal
PyGBe (barbagroup/pygbe) [17] 100% 12,4k 0.9k 28 Python 6 Biomolecular electrostatics and
nanoparticle plasmonics
LBANN (LLNL/lbann) [18] 99% 66,8k 3,5k 40 C++ 4 Artificial neural network toolkit
Hail (hail-is/hail) 98% 72,9k 3,1k 357 Scala 2 Genomic analysis
Genn (genn-team/genn) [19] 96% 37,4k 1,8k 77 C++ 6 Neuron and synapse modeling
openMOC (mit-crpg/openMOC) 90% 21,6k 2,6k 50 C++ 4 Nuclear reactor physics

In the medical domain, survival analysis measures the

fraction of patients who remain alive for a certain amount
of time after treatment. In our work, survival refers to how
long it takes for a contributor to become inactive. For the
purposes of this analysis, we consider a contributor inactive
if they have not made a commit within the last 180 days,
following the example of Lin et al. [25]; this is more strict
than is done in other works (cf. [26], one commit per year
counts as active). We do this because contributors may start or
cease their commits in the middle of their tenure (e.g., a junior
pivoting towards finishing a thesis). After tuning with different
knobs, we found that six months was a reasonable limit.
Moreover, recent work has also experimented with different
thresholds (e.g., 30, 90, 180 days), and results suggest the same
trends over the experiments [25]. We used the gitstats
utility3 to collect information on the length of each subject’s
participation in their respective projects. Fig. 3: A Kaplan-Meier survival plot of contributors to projects
The results of the analysis can be seen on Figure 3. This in our dataset, grouped according to role. S(t) indicates the
figure shows a series of declining horizontal steps which number of individuals in the population who are still actively
approaches the true survival function for that population. contributing t days after starting.
The x-axis represents the survival duration, while and the y-
axis indicates the probability that a contributor can survive
(i.e., keep actively contributing to the project). The median gether, the median junior in our projects spends only 24.84%
survival time per group is 4.06 years for staff, 1.72 years of their time in their position doing software development
for gradstudents, 0.98 years for postdocs, 4 months for work.
undergrads, and just a day for thirdparty contributors. Meanwhile, seniors provide the most stable presence, with
A careful, contextualized reading of this data proves in- a median survival time of 4 years. We note that this is affected
formative. With regards to juniors, aside from staff, we see by right-censoring because the average age of our projects is
that gradstudents are the most likely to stay around. That 5 years. However, our evidence suggests that senior members
being said, a typical PhD student (as almost all of the graduate who do contribute code may not do so indefinitely. Once the
students in our dataset are) takes 5 years to graduate, and this software reaches a point of maturity, they may hand off the
means that the median gradstudent only spends 34.52% of work to juniors. In other cases, they may leave virtually all
their tenure contributing to a project. Meanwhile, postdocs the work to juniors. Lastly, we found that in the projects we
spend even less time than gradstudents as contributors, even studied, thirdparty contributors tend to remain at the periphery
though the typical term limit of postdocs today is also 5 and do not engage with a project for any significant length of
years [27]; that being said, postdocs are brought on-board time; as we observed, thirdparty contributors stay, on average,
with prior knowledge that they can immediately apply to 1 day.
the software project. Finally, undergraduate students in our
dataset only spend one semester out of their 4 year education RQ2: What is the breakdown of team contributors by role?
participating in developing scientific software. Taken all to- For RQ2, we are interested in the breakdown of contrib-
utors to each project by role. How many people contribute
3 https://GitHub.com/hoxu/gitstats to projects overall? How can we characterize them? A key

425

Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
distinction that we are making is that this is not the same with LBANN having a more even split between junior and
thing as the number of contributors listed as team members senior members. We also note that, for the projects we have
on the webpage of the project or research group. For example, studied, thirdparty contributors tend to play a very minor role
a graduate student may be a user of the software but not a in this regard; only LBANN has a notable share of commits
contributor, and in the case of projects like Chaste, we can attributable to thirdparty users (17.7% versus an average of
have multiple staff members responsible for design work and 3.59% and median of 0.06%). Overall, in all projects but
guidance of students but who have no commits to their name. one (Hail), junior contributors produce a significant share of
commits, meaning that even when seniors do most of the
heavy lifting, juniors play an essential role in the realization
of the project.
However, while both junior and senior alike generate
significant amounts of commit activity, this is not to say that
the scope of their activities is comparable. To better understand
this, we consider interactions with and ownership of files. For
the purposes of this work, we record a user as interacting with
a file each time they make a commit that touches that file. By
that measure, the typical junior interacts with a much smaller
percentage of files compared to a senior (average 5.85% vs.
20.35%; median 0.65% vs. 11.51%). This is to say that a
distinguishing characteristic of junior developers in our corpus
is that they often have a narrow focus on a particular subset of
a project. Meanwhile, the same is especially true for thirdparty
contributors who interact with an even smaller percentage of
files (average 1.64%; median 0.66%).
Likewise, we can also consider file creation. Earlier work
by Poncin et al. [29] addresses file creation in their opera-
Fig. 4: For RQ2, for each project we present a bar chart with tionalization of “core” developers, as frequent creation and
totals of identified contributors sorted by role. modification of files indicates that a user is helping to drive
the vision or direction of the software. Related to this, in
We showcase our results on Figure 4. Taken together, a recent study of large-scale open source projects, Lin et.
JUNIOR members make up the majority of team contributors al [25] found users who created files tended to be longer-
(average 69%; median 80%), with the remainder being SE - term contributors than those who modified files. In 5 out of 7
NIORS (average 31%; median 20%). Meanwhile, in all but one of the projects we studied (Khmer, Chaste, Hail, openMOC,
of the projects we studied, we were able to identify thirdparty and Genn), senior team members created the majority of files
contributors (average 20%; median 27%). Additionally, as a (average of 69.44%); LBANN is almost evenly split by this
rule, both team members and thirdparty contributors that we measure, and PyGBe, as a student-driven project, has only a
identified have a background relevant to the domain of the quarter of its files originating from senior members.
project, something which we learned by analyzing biographi-
cal information used to classify contributors by role. RQ4: What kinds of maintenance and evolution activities do
contributors perform?
RQ3: How much of the development work is done by different What value do different kinds of contributors add to a
contributors by role? project? For example, once a gradstudent exits a project, in
For RQ3, we want to characterize the amount of develop- what ways did they influence the evolution of the software
ment work that is done by these different actors. To begin, during their tenure? We can find some evidence for this
we consider the share of commits produced by different through project pages and documentation when teams provide
contributors, the number and frequency of commits being well- an itemized list of accomplishments of different contributors
worn metrics for engagement and investment in a project [26], (as is the case for several of the projects in our study); it
[28]. Taking averages across all projects that we studied, half is typical to see juniors receiving credit for implementing
or more of commits are made by senior members (average novel features pertinent to their research, seniors for building
50.76%; median 63.29%). However, the majority of the other out infrastructure and performing maintenance, and thirdparty
half are commits by junior contributors (average 42.9%; me- contributors for providing support or helping improve the
dian 36.7%). Moreover, for 4 out of 7 projects (Khmer, Chaste, codebase. However, relatively few projects provide this kind
Hail, and openMOC), senior researchers are responsible for of fine-grained information, and we would also like to be able
a plurality of commits (with an average commit share of to interrogate those claims in an empirical way.
77.12%; median 73.34%); the opposite is true for Genn, To answer our question, we present two views of the
LBANN, and PyGBe (average share 23.92%; median 22.58%), development activities that elucidate the kinds of work that

426

Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
different contributors produce. The first is an analysis of com- Table II shows the confusion matrix between the automated
mit messages based on the approach of Hattori and Lanza [28], approach and the manual analysis. First, we note that because a
and the second is an analysis of file paths involved in commits commit may fail to match against the word bank, it is possible
based on the work of Vasilescu et al. [30]. In both cases, we are for this approach to fail to find a label. This happened for
interested in categorizing commit activity according to their 27% of commits in our sample. We identified three causes
purpose or intent. for this: (1) manual classification relied on words that were
In the framework set out by Hattori and Lanza [28], commits outside of the word bank (e.g., “vectorized book-keeping
are divided into four major categories of activity: kernels”), (2) commit messages were automatically generated
1) Forward engineering (Fwd), for instance, adding new and vague as to their purpose (e.g., merging an arbitrary
features; pull request), and (3) messages could be highly ambiguous
2) Reengineering (Reng), for instance, refactoring activi- (e.g., “complete breakdown of intuition” or “it is all becoming
ties; clear”). However, we consider this to be a more gentle form of
3) Corrective (Corr), for instance, fixing bugs; failure than attempting to shoehorn an unintelligible commit
4) Management (Mgmt), for instance, updating documen- into an arbitrary category.
tation. For commits which were classified automatically, the man-
ual and automatic approaches agreed 69% of the time; much
In order to automatically classify commits into these cate-
of the error was concentrated on management commits, only
gories, the authors compare the content of commit messages
a third of which were correctly labeled. If we limit our
against predefined word banks for each commit type based on
consideration to just forward engineering, reengineering, and
the earliest match found. For instance, consider the following
corrective commits, then the automatic approach agrees 89%
commit message: “This commit adds integrators support-
of the time, with some minor confusion between forward and
ing the combined, staggered, and pseudotransient forward
reengineering activities. As such, we limit our consideration
sensitivity analysis methods where the sensitivity equations
to just those three.
are solved alongside the forward equations.”4 Unpacking the
The second approach we use is derived from that of
semantics of this commit message requires extensive domain
Vasilescu et al. [30], which categorizes commit activity by ex-
knowledge and that is difficult to automate. However, the word
amining the filepaths involved in changes; file extensions (e.g.
“add” is a match in the word bank for forward engineering;
.cpp versus .csv) and hints in file paths (e.g. /src/ versus
it is reasonable to assume (in this case) that the commit is
/test/ can clue us in to the purpose of a file and, by ex-
adding a new feature to the software.
tension, the kind of labor that an individual provides a project
This approach is limited in that it only considers the lexical
through their interaction with those files. The classification
content of messages, and it also fails to handle situations
algorithm itself is analogous to what was previously described:
where a commit may belong to more than one category, but
filepaths are matched against a bank of regular expressions that
it remains useful as a diagnostic tool. To test the validity
map to different categories of files. Unlike with the Hattori-
of this classification scheme against our corpus, we chose to
Lanza scheme, these results are much less ambiguous because
manually classify a representative random sample of commits
it is reasonable to assume that file extensions indicate actual
drawn from across all projects. Assuming that all projects have
file types. For this work, we made several addenda to the
statistically similar commits (in the sense that the distribution
regexes used in the original paper in order to cover additional
of commits by type are roughly the same), a sample of 378
programming languages (e.g., Pascal and Ada), data storage
commits might reflect the overall population of roughly 23,000
types commonly used in scientific computing (e.g., HDF5 and
commits with a confidence level of 95% with an interval
FASTA), as well as a handful of previously unaddressed build
of ±5%. In order to arrive at the ground truth, our manual
and configuration artifacts (e.g., Dockerfiles and Gradle build
classification considers not only commit messages but also the
files). On Table III and Table IV we provide results for our
artifacts (such as source code, documentation, and test data)
two analyses as an aggregate of all contributors in the projects
that were modified and the context in which that occurred
we studied as a way of approximating a “typical” project.
(such as preceding commits and related files).
The former shows what percentage of commits made by an
TABLE II: Confusion matrix for validation of Hattori- average individual are categorized as forward engineering,
Lanza [28] classification scheme. Unknown (Unk) commits reengineering, and corrective activities; the latter asks what
are those which the algorithm failed to classify. percentage of individuals have made at least one commit that
interacts with a file of a given type.
Manual
Automated
Fwd Reng Corr Mgmt As a group, senior contributors have the highest average
Fwd 53 11 1 14 share of forward engineering commits (33.37%), which is
Reng 3 60 0 13 to say that commits made by seniors are more likely to
Corr 1 4 60 5
Mgmt 7 14 11 16
include novel development work, such as adding or extending
Unk 9 36 5 55 software capabilities. Senior developers also play a key role
in realizing the supporting infrastructure of their projects, with
4 https://GitHub.com/trilinos/Trilinos/commit/e8e6d67 a majority having interactions with build, devdoc, and

427

Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
TABLE III: The relative share of automatically classified RQ5: How do scientific software developers perceive their own
commits of an average junior, senior, or thirdparty contributor. software development process?
Management commits are omitted.
For our final research question, we consider how scientific
Fwd Share Reng Share Corr Share
Juniors 26.76% 21.11% 15.93%
software developers view their own projects based on our sur-
Seniors 33.37% 16.07% 14.91% vey data; this provides us with points of comparison with our
thirdparty 19.98% 18.89% 39.45% quantitative findings. Among our survey respondents, those
we identified as being most responsible for their respective
projects, 36% of them were postdocs, 30% were non-academic
metadata files. Likewise, a plurality of seniors interacts professionals, another 30% were students (undergraduate or
with data/database files (such as data for validation tests graduate), and 14% were professors. The majority of them
and parameters for research models) and test files. (62%) work for an university or college, 11% work for the
Next, the activities of junior contributors resemble that government, 8.5% work in industry, and the other 18.5% play
of senior contributors in many key respects. Like seniors, other roles. Regarding their highest academic degree, 76.4%
juniors universally interact with code files, and a compara- had already received or were working towards their doctorate
ble share of their commits go towards forward engineering degree (18% have a master degree, and only 4% have bachelor
(26.77% for juniors vs. 33.37% for seniors), reengineering degree). They worked in a variety of fields, including computer
(22.11% vs. 18.90%) , and corrective (15.93% vs. 14.91%) science, fine art, chemistry, political science, urban planning,
activities. Also like seniors, the majority of juniors interact and neuroscience.
with build, devdocs, and test files (though in smaller The majority of projects in our survey (61%) were devel-
measures compared to seniors), and this was true in general oped by a team of people, though it is worth nothing that a
for all projects we studied. This reinforces our earlier obser- significant number of projects were the work of individuals
vations suggesting that their work is neither subordinate nor (39%). On average, these teams had 3.6 contributors (3rd
peripheral compared to the work of seniors, but is instead Quartile: 4.2, max: 15); this roughly aligns with the number
vitally important to the enterprise. of active contributors in a given year in the repositories which
Finally, we consider thirdparty contributors who, as we we mined (average: 3.8, 3rd quartile: 5, max: 11).
determined earlier, are relatively minor players who as a rule When we asked (Q5) what and how do they contribute, we
only sporadically contribute to projects. Code and devdocs observed that 50 respondents reported software development-
aside, they scarcely interact with any kind of file. One point oriented activities such as fixing bugs, developing scripts
that stands out, however, is that these contributors are more to support research, improving documentation, and adding
likely to make corrective activities (with an average commit tests. Strongly tied with software development activities, 18
share of 35.06%). This is to say that while they make very few respondents reported to contribute to non-software activities,
commits, the commits they do make are more likely to be bug such as paper writing, grant writing, running experiments, etc.
fixes; that suggests that thirdparty contributors are most likely Along these lines, three respondents perceive their contributing
to be users of the software who have the domain knowledge role as “Conducting research that feeds back into the project”.
and development expertise needed to correct such bugs or Regarding how do scientific software developers get trained
“scratch their own itch”. In essence, thirdparty contributor to do their jobs (Q6), 70% of the respondents were self-
behavior is similar to the kind of work produced by peripheral taught, although some of them received mentorship from
developers, which are typically involved in bug fixes, and they senior contributors (e.g., “Shadowing a more senior developer
have irregular or short-term involvement in a project [31], [32]. for a week or two”), while others benefited from online
training programs (e.g., “The Molecular Sciences Software
TABLE IV: The percentage of contributors in each category Institute (molssi.org) training programs.”), took advantage of
who have at least one commit that interacts with a given file their own documentation (e.g., “We make sure that the docs are
type. N/A indicates that no matches were found for regexes self-contained to ease onboarding for remote teams”), or even
in any projects studied for a given category the pull-request process (e.g., “By first contributing some pull
Juniors Seniors thirdparty
requests and getting code reviews”). Only four respondents
Documentation 19.1% 26.0% 0% were trained through their academic degree.
Images 14.5% 22.2% 2.7% When considering the responsibilities they need to take to
Localization 2% 0% 0%
UI N/A N/A N/A prepare for the departure of a team member (Q7), 20 respon-
Media 27.7% 33.3% 2.7% dents mentioned the importance to keep the documentation up-
Code 100% 100% 70.3% dated (e.g., “We simply try to ensure that all developments are
Project Metadata 36.2% 51.85% 8.1%
Configuration 34.0% 33.3% 5.4% adequately documented at the time, to help the understanding
Build 63.8% 77.8% 29.7% of future developers”). Interestingly, 8 respondents mentioned
Devdocs 63.8% 88.8% 66.7% that this never happened, which is partially because they are
Data/Databases 36.2% 63.0% 18.9%
Test 74.5% 85.2% 27.0% working on a small or solo team (e.g., “No one has departed
Libraries N/A N/A N/A yet (or joined...)”). Other respondents mentioned the need to

428

Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
train other, to push code, or to add tests. TABLE V: A summary with descriptions of typical contribu-
Of those projects run by teams by teams of two or more tors, based on the findings in this work.
people, 35 out of 41 specifically called attention to the role Contributor Findings
played by juniors in developing their software. 20 of these Seniors • Are often active on a project for four or more
described them as being responsible for developing specific, years.
• Make the majority of contributions to a typical
non-core features of the software; projects that followed this project (average 50%). They create the most files
pattern offered up explanations such as “[juniors] — by and touch the most code.
necessity – start with smaller peripheral bugs and features. • Are most often responsible for forward engineer-
ing activities, development of the core of the
Core development requires a lot of experience and knowl- software, and infrastructure tasks.
edge.” Another twelve projects, however, cast juniors as being • Provide guidance and visionary leadership to ju-
developers of core infrastructure, typically for the reasoning nior contributors, especially when they do not
have the time or resources to work on the code
that senior members “cannot afford to put much time into themselves.
development”. Juniors • Are active for no more than 2 years. Roughly
Meanwhile, 24 of the 41 team projects emphasize the role 25%-35% of their time is spent on software
development.
of seniors in development. 8 of these said seniors developed • Make up the majority of team members.
the core of the software and 4 the periphery. Those that did • Perform many of the same development activities
so often emphasized the need for experience in development, as seniors, and, collectively, generate 42% of
commit activity on average.
insofar as “the more education a team member has (software • Are most likely to develop peripheral, non-core
development life-cycle, good coding practices, etc.), the better features of the software.
they are at seeing ‘big picture’ development tasks [...] these Third Parties • Are active for only one day.
people often lead development”. However, in contrast to this, • Have a background in the domain of the project
and an interest in using the software.
ten respondents characterized seniors as being visionaries first • Make only a handful of commits. These commits
and developers second. In this view, the role of seniors is to are most likely to be bug fixes.
“coordinate activities”, “drive the direction of the project”,
“guide the conceptual development”, and to provide the “the- though these users only be involved for a short time, they can
oretical details”. still make valuable contributions such as fixing bugs. However,
Lastly, only 9 out of the 72 projects gave recognition to despite the widespread presence of short-term committers in
thirdparty contributors. Among these projects, the typical the population and thirdparty contributors being present in all
view was that while thirdparties “contribute seldomly”, they but one of the projects in our sample, only 12% of respondents
were also a common sources of bug fixes, a finding echoed in our companion survey mentioned these contributors. Based
in our findings from RQ4. Likewise, these contributors were on this, we believe that better community policies could help
also responsible for “[submitting] small patches to make [a] attract these contributors, such as providing guides for new
tool better meet their own niche use cases”. contributors and explicitly giving credit to these users.
V. D ISCUSSION Supporting Sustainability. Scientific software projects are
known for being long-lived and under constant pressure to
We have summarized the major findings in Table V, and keep pace with scientific advances. It is common for senior
now consider the potential implications of our work. project members to provide visionary leadership to guide that
Training. As a group, juniors have long been the subject process, but how this translated to software construction was
of science public policy literature. Novice researchers are unclear. Our findings place seniors in a primary role as
“canaries in the mine” for the health of the scientific enterprise, core developers who are most likely responsible for forward
as it is during this period that they are meant to learn the values engineering and infrastructure tasks. Our findings suggest
and skills needed to participate fully as scientists [33]. How- directions for future research, such as how research priorities
ever, while software development is an increasingly important generate software development tasks and when and how those
skill, the amount of direct experience they acquire may be tasks are delegated. A more complete understanding of this
limited by competing demands in their academic careers (see phenomenon would help software engineers develop better
RQ1). This brings into focus a number of different topics tools and techniques to support that effort.
regarding software sustainability (e.g., the importance of good
practices such as code reviews) as well as educational policies VI. T HREATS TO VALIDITY
(e.g. the need for more formal software development training First, not all members of a project show up as contributors
in STEM curricula). to the repository; for example, senior staff may offer guidance
Building Communities. Much has been written about the and support while leaving the actual implementation work to
value of openness in science and the need for community others. Second, people who stop contributing code may still be
support of scientific software. As we noted in section III-C, part the project. This is frequently the case for graduate student
40% of contributors stay on for only a day. Our analysis contributors, who may refocus on completing coursework or
suggests that many of these may be thirdparty users who have a thesis towards the end of their tenure. Third, team websites
a formal background in a domain relevant to the project; even are not always up-to-date, and not all contributors are given

429

Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
explicit recognition for their work; in one case, an undergradu- projects. Howison and Hersleb [8], a qualitative study of the
ate student was uncredited on a project site, but a subsequent incentives for creating and maintaining scientific software,
web search uncovered a university press release detailing a identifies the general breakdown of contributors to projects
research grant that was awarded for them to do specific work that they study; however, both the focus and methods used in
on the project in question. Finally, not all contributors are that work differ from ours, as they are not concerned with the
project members; as with most open-source software projects, specifics of how different people contribute to the software de-
open-source scientific software attracts third-party contribu- velopment. Finally, Storer [37] provides an excellent survey of
tors, typically senior researchers who benefit from using the the state of software engineering practice within the scientific
software. community. Our work shares the same motivations as others,
Coding individuals using our approach means addressing but, to our knowledge, is the first to tackle this subject from
several potential ambiguities. First, on a few occasions, a a software repository mining perspective.
subject may have belonged to different categories at different Studies on roles of contributors in open source projects.
times (e.g., a staff member starting off as a postdoc). When The study of core developers, i.e., developers that play an
this occurred, we labeled them according to the role in which essential role in developing the system architecture and form-
they made the majority of their commits. Second, for large ing the general leadership structure, in open source projects
research institutions (e.g., national labs), a software project is a fruitful research area [38], [39], [40]. Core developers
may receive contributions from people nominally part of the are well-known from being active contributors. A general rule
same organization, but unaffiliated with the research team; of thumb suggests that contributors with more than 80% of
we resolved this by treating those contributors as thirdparty the overall contributions are considered core developers [31],
contributors. [41]. Indeed, for some projects, this number is even higher.
The commit analysis performed to answer RQ4 was an au- Recent work indicates that several well-known, active open
tomated process which, when applied to a large-scale number source projects rely on 1–2 core developers to drive most of
of commits, can silently yield false-positives (i.e., commits that their maintenance and evolution tasks [42]. On the periphery
were unable to be categorized), since commit messages might side, research indicates that peripheral developers accounts for
lack semantical sense [34] or are even empty [35] (i.e., zero more than 90% of the contributors of a project [31]. Moreover,
words). To mitigate such bias, we conducted a manual analysis several authors have acknowledged the existence and the
over a representative sample of 378 commits (confidence growth of casual contributors (i.e., developers that contribute
level of 95% with an interval of ±5%). We observed a low just once) [43], [32], [44]. Here we enriched the understanding
number of uncategorized commits. Although uncategorized of core and peripheral developers. We also introduced the
commits still exist, we believe this approach is the fairest notion of third-party contributors, which share some of the
way to categorize the commit intention because, since we are behaviors commonly observed in peripheral developers (e.g.,
dealing with scientific software projects, the domain of our they have a short term relationship with the project, and most
studied projects is highly specific and complex. Therefore, any of their contributions are intended to fix bugs).
other attempt to categorize commits using our own domain VIII. C ONCLUSION
experience would introduce even more bias.
Scientific software projects are critical to the advancement
Lastly, one could argue that this study does not provide
of the scientific enterprise, and software engineering research
a novel contribution, e.g., “obviously graduate students are
can directly help those efforts through tailored tools, tech-
largely responsible for adding new features”. However, such
niques, and practices. However, there has historically been a
common-sense assumptions are often not backed up by em-
lack of hard data on who contributes to scientific software
pirical evidence. This paper piles such evidence and, more
and how they behave. In this work, we mined logs from seven
importantly, quantifies the phenomenon; even though some
non-trivial open source scientific software projects in order to
perceptions are confirmed (e.g., “gradstudents stay longer
provide answers to these questions. Among our findings, we
than postdocs and undergraduates”), other are uncovered
found that while senior researchers perform the lion’s share
(e.g., “thirdparty members survive only one day on average”).
of the work in many projects, junior researchers are often
VII. R ELATED W ORK on the frontlines driving the software development. We also
considered the habits of thirdparty contributors who, while
Studies on scientific software development. Turk [6] presents
often operating at the periphery of projects, have a valuable
techniques for encouraging community engagement with sci-
role to play in fixing and improving code.
entific software in the astrophysics community, arguing that
For future work, we plan to conduct ethnographic studies
attracting thirdparty contributors requires intentional actions
of scientific software projects to better understand topics such
designed to encourage their participation. Likewise, Bangherth
as feature creation and bug fixing. We also plan to study
and Heister [13] outlines the practices of successful open-
how scientific software projects compare with conventional
source scientific software libraries, which includes a discus-
projects.
sion of the value proposition behind open-sourcing software
primarily written by juniors. Sletholt et al. [36] performs a case Acknowledgments. We thank the reviewers, the 72 respondents,
study of agile development practices among scientific software and PROPESP/UFPA and CNPq (#406308/2016-0).

430

Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [22] B. L. Berg, “Methods for the social sciences,” Qualitative Research
Methods for the Social Sciences. Boston: Pearson Education, 2004.
[1] T. Hey, S. Tansley, K. M. Tolle et al., The fourth paradigm: data- [23] A. Wood, P. Rodeghero, A. Armaly, and C. McMillan, “Detecting speech
intensive scientific discovery. Microsoft research Redmond, WA, 2009, act types in developer question/answer conversations during bug repair,”
vol. 1. in Proc. of the 26th ACM Symposium on the Foundations of Software
Engineering, 2018.
[2] J. C. Carver, N. P. Chue Hong, and G. K. Thiruvathukal, Software
[24] E. L. Kaplan and P. Meier, “Nonparametric estimation from incomplete
Engineering for Science. CRC Press, 2016.
observations,” Journal of the American statistical association, vol. 53,
[3] E. S. Mesh and J. S. Hawker, “Scientific software process improvement
no. 282, pp. 457–481, 1958.
decisions: A proposed research strategy,” in Software Engineering for
[25] B. Lin, G. Robles, and A. Serebrenik, “Developer turnover in global, in-
Computational Science and Engineering (SE-CSE), 2013 5th Interna-
dustrial open source projects: Insights from applying survival analysis,”
tional Workshop on. IEEE, 2013, pp. 32–39.
in Proceedings of the 12th International Conference on Global Software
[4] C. Letondal and U. Zdun, “Anticipating scientific software evolution as Engineering. IEEE Press, 2017, pp. 66–75.
a combined technological and design approach,” in Second International
[26] M. Nagappan, T. Zimmermann, and C. Bird, “Diversity in software
Workshop on Unanticipated Software Evolution, 2003.
engineering research,” in Proceedings of the 2013 9th Joint Meeting
[5] J. Segal, “Scientists and software engineers: A tale of two cultures,” on Foundations of Software Engineering. ACM, 2013, pp. 466–476.
2008. [27] K. Ferguson, B. Huang, L. Beckman, and M. Sinche, “National post-
[6] M. J. Turk, “Scaling a code in the human dimension,” in Proceedings doctoral association institutional policy report 2014: Supporting and de-
of the Conference on Extreme Science and Engineering Discovery veloping postdoctoral scholars,” Washington, DC: National Postdoctoral
Environment: Gateway to Discovery. ACM, 2013, p. 69. Association, 2014.
[7] J. Howison and J. D. Herbsleb, “Incentives and integration in scientific [28] L. P. Hattori and M. Lanza, “On the nature of commits,” in Proceedings
software production,” in Proceedings of the 2013 conference on Com- of the 23rd IEEE/ACM International Conference on Automated Software
puter supported cooperative work. ACM, 2013, pp. 459–470. Engineering. IEEE Press, 2008, pp. III–63.
[8] ——, “Scientific software production: incentives and collaboration,” [29] W. Poncin, A. Serebrenik, and M. Van Den Brand, “Process mining soft-
in Proceedings of the ACM 2011 conference on Computer supported ware repositories,” in Software maintenance and reengineering (CSMR),
cooperative work. ACM, 2011, pp. 513–522. 2011 15th european conference on. IEEE, 2011, pp. 5–14.
[9] J. E. Hannay, C. MacLeod, J. Singer, H. P. Langtangen, D. Pfahl, and [30] B. Vasilescu, A. Serebrenik, M. Goeminne, and T. Mens, “On the
G. Wilson, “How do scientists develop and use scientific software?” variation and specialisation of workload—a case study of the gnome
in Proceedings of the 2009 ICSE workshop on Software Engineering ecosystem community,” Empirical Software Engineering, vol. 19, no. 4,
for Computational Science and Engineering. IEEE Computer Society, pp. 955–1008, 2014.
2009, pp. 1–8. [31] K. Crowston, K. Wei, Q. Li, and J. Howison, “Core and periphery
[10] P. E. Stephan, How economics shapes science, 2012, vol. 1. in free/libre and open source software team communications,” in 39th
[11] M. Heroux, “Software engineering for computational science and en- Hawaii International International Conference on Systems Science
gineering: What can work and what will not,” Invited talk, presented (HICSS-39 2006), 4-7 January 2006, Kauai, HI, USA, 2006.
at the 2017 International Workshop on Software Engineering for High [32] G. Pinto, I. Steinmacher, and M. Gerosa, “More common than you think:
Performance Computing in Computational and Data-Enabled Science An in-depth study of casual contributors,” in IEEE 23rd International
and Engineering, 2017. Conference on Software Analysis, Evolution, and Reengineering, SANER
[12] J. Howison, E. Deelman, M. J. McLennan, R. Ferreira da Silva, and 2016, Suita, Osaka, Japan, March 14-18, 2016, pp. 112–123.
J. D. Herbsleb, “Understanding the scientific software ecosystem and [33] K. S. Louis, J. M. Holdsworth, M. S. Anderson, and E. G. Campbell,
its impact: Current and future measures,” Research Evaluation, vol. 24, “Becoming a scientist: The effects of work-group size and organizational
no. 4, pp. 454–470, 2015. climate,” The Journal of Higher Education, vol. 78, no. 3, pp. 311–336,
[13] W. Bangerth and T. Heister, “What makes computational open source 2007.
software libraries successful?” Computational Science & Discovery, [34] W. Maalej and H. Happel, “Can development work describe itself?”
vol. 6, no. 1, p. 015010, 2013. in Proceedings of the 7th International Working Conference on Mining
[14] A. M. Smith, K. E. Niemeyer, D. S. Katz, L. A. Barba, G. Githinji, Software Repositories, MSR 2010 (Co-located with ICSE), Cape Town,
M. Gymrek, K. D. Huff, C. R. Madan, A. C. Mayes, K. M. Moerman South Africa, May 2-3, 2010, Proceedings, 2010, pp. 191–200.
et al., “Journal of open source software (joss): design and first-year [35] R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen, “Boa: A language
review,” PeerJ Computer Science, vol. 4, p. e147, 2018. and infrastructure for analyzing ultra-large-scale software repositories,”
[15] G. R. Mirams, C. J. Arthurs, M. O. Bernabeu, R. Bordas, J. Cooper, in Proceedings of the 2013 International Conference on Software
A. Corrias, Y. Davit, S.-J. Dunn, A. G. Fletcher, D. G. Harvey et al., Engineering, ser. ICSE ’13, 2013.
“Chaste: an open source c++ library for computational physiology and [36] M. T. Sletholt, J. Hannay, D. Pfahl, H. C. Benestad, and H. P. Langtan-
biology,” PLoS computational biology, vol. 9, no. 3, p. e1002970, 2013. gen, “A literature review of agile practices and their effects in scientific
[16] M. R. Crusoe, H. F. Alameldin, S. Awad, E. Boucher, A. Caldwell, software development,” in Proceedings of the 4th international workshop
R. Cartwright, A. Charbonneau, B. Constantinides, G. Edvenson, S. Fay on software engineering for computational science and engineering.
et al., “The khmer software package: enabling efficient nucleotide ACM, 2011, pp. 1–9.
sequence analysis,” F1000Research, vol. 4, 2015. [37] T. Storer, “Bridging the chasm: A survey of software engineering
[17] C. D. Cooper, J. P. Bardhan, and L. A. Barba, “A biomolecular practice in scientific programming,” ACM Computing Surveys (CSUR),
electrostatics solver using python, gpus and boundary elements that vol. 50, no. 4, p. 47, 2017.
can handle solvent-filled cavities and stern layers,” Computer physics [38] R. T. Fielding, “Shared leadership in the apache project,” Commun.
communications, vol. 185, no. 3, pp. 720–729, 2014. ACM, vol. 42, no. 4, pp. 42–43, Apr. 1999.
[18] B. Van Essen, H. Kim, R. Pearce, K. Boakye, and B. Chen, “Lbann: [39] S. A. Licorish and S. G. MacDonell, “Understanding the attitudes,
Livermore big artificial neural network hpc toolkit,” in Proceedings of knowledge sharing behaviors and task performance of core developers: A
the Workshop on Machine Learning in High-Performance Computing longitudinal study,” Information & Software Technology, vol. 56, no. 12,
Environments. ACM, 2015, p. 5. pp. 1578–1596, 2014.
[19] E. Yavuz, J. Turner, and T. Nowotny, “Genn: a code generation frame- [40] J. Coelho, M. T. Valente, L. L. Silva, and A. Hora, “Why we engage in
work for accelerated brain simulations,” Scientific reports, vol. 6, p. FLOSS: Answers from core developers,” in 11th International Workshop
18854, 2016. on Cooperative and Human Aspects of Software Engineering (CHASE),
[20] G. Pinto, I. Wiese, and L. F. Dias, “How do scientists develop scientific 2018, pp. 1–8.
software? an external replication,” in 25th International Conference [41] A. Mockus, R. T. Fielding, and J. D. Herbsleb, “Two case studies of
on Software Analysis, Evolution and Reengineering, SANER 2018, open source software development: Apache and mozilla,” ACM Trans.
Campobasso, Italy, March 20-23, 2018, 2018, pp. 582–591. Softw. Eng. Methodol., vol. 11, no. 3, Jul. 2002.
[21] J. M. Sheltzer and J. C. Smith, “Elite male faculty in the life sciences [42] G. Avelino, L. Passos, A. Hora, and M. T. Valente, “A novel approach for
employ fewer women,” Proceedings of the National Academy of Sci- estimating truck factors,” in 24th International Conference on Program
ences, vol. 111, no. 28, pp. 10 107–10 112, 2014. Comprehension (ICPC), 2016, pp. 1–10.

431

Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.
[43] R. Pham, L. Singer, O. Liskin, F. Figueira Filho, and K. Schneider, 2) If you are currently employed, which of the
“Creating a shared understanding of testing culture on a social coding following best describes your current employer?
site,” in Proceedings of the 2013 International Conference on Software
Engineering, ser. ICSE ’13. • Government
[44] A. Lee, J. C. Carver, and A. Bosu, “Understanding the impressions, • University or college
motivations, and barriers of one time code contributors to FLOSS
projects: a survey,” in Proceedings of the 39th International Conference • Business or industry
on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May • Non-proﬁt organization
20-28, 2017, 2017, pp. 187–197. • Other (please specify)

IX. A PPENDIX 3) Please list the highest academic degree you have
received or that you are currently working toward.
TABLE VI: Addenda to File Extensions used in Vasilescu et
• High school degree or equivalent
al.[30]
• Associate’s Degree
Category Addenda • Bachelor’s
Doc ".*\.md" • Master’s
Code ".*\.pas((\.swp)?)(˜?)",
".*\.pxd((\.swp)?)(˜?)", • Doctorate
".*\.ads((\.swp)?)(˜?)", 4) What is the subject of this degree?
".*\.adb((\.swp)?)(˜?)",
".*\.bin"
Devdoc ".*\.pdf",".*citation.*", 5) How many people are on your team?
".*license.*",".*doxyfile.*", 6) What and how do they contribute (e.g., adding
".*\.wiki",".*\.tex",
".*\.bib",".*\.dox",".*authors"
new features to the core of the project, conducting
Db ".*\.csv",".*\.xml",".*\.fa", research that feeds back into the project, writing tests,
".*\.xlsx",".*\.zip",".*\.h5", ﬁxing bugs, et cetera.)?
".*\.bz2",".*\.tar(\.gz)?",
".*\.fq(\.gz)?",".*\.pts", 7) How does a new team member get trained to do
".*\.pdb",".*\.pqr",
".*\.vert",".*\.node",
their job (e.g. they are self-taught)?
".*\.edge",".*\.param(eters)?", 8) What (if anything) do you do to prepare for the
".*\.phi0",".*\.prototext(\.bve)?",
".*\.pkl",".*\.pbs" departure of a team member (e.g. we ask them to
Build ".*\.build",".*dockerfile", document the undocumented parts of their code)?
".*\.gradle"
Conﬁg ".*\.vcxproj((\.filters)?)(˜?)", 9) What kinds of roles and responsibilities do differ-
".*\.qpg",".*\.dsp",".*\.epf"
Img ".*\.graffle" ent people play in the development of your software?
In particular, consider the following groups: junior
A. Questions Used in the Online Survey researchers (undergraduates, graduate students, and
1) Which of the following best describes you (select postdocs), senior researchers (seasoned staff), and
all that apply)? third party contributors.
• Student (e.g. undergraduate or graduate student) 10) Are there differences in what different people
• Postdoc provide your project? If so, please explain what (e.g.
• Professional juniors are responsible for small issues while seniors
• Professor drive the architecture and solve the main problems)?

432

Authorized licensed use limited to: Pontificia Universidad Catolica de Chile. Downloaded on September 20,2023 at 13:09:18 UTC from IEEE Xplore. Restrictions apply.

Multi Disciplinary Advancement in Open Source Software and Processes 1st Edition Stefan Koch Download
100% (3)
Multi Disciplinary Advancement in Open Source Software and Processes 1st Edition Stefan Koch Download
71 pages
Software Evolution
No ratings yet
Software Evolution
16 pages
00 Managed Software Evolution
No ratings yet
00 Managed Software Evolution
439 pages
Performing Science With Open Source Software, Utilizing Python (Xy) - Koch - (2016)
No ratings yet
Performing Science With Open Source Software, Utilizing Python (Xy) - Koch - (2016)
206 pages
Se LN Final
No ratings yet
Se LN Final
119 pages
Department of Master of Computer Application: M.C.A. First Year (I Semester)
No ratings yet
Department of Master of Computer Application: M.C.A. First Year (I Semester)
57 pages
Improving The Visibility of Scholarly Software Work
No ratings yet
Improving The Visibility of Scholarly Software Work
60 pages
The Role of Software in Science A Knowledge Graph
No ratings yet
The Role of Software in Science A Knowledge Graph
47 pages
How To Make Open Source Work Better For Everyone
No ratings yet
How To Make Open Source Work Better For Everyone
26 pages
A Research Software Engineering Workflow For Computational Science and Engineering
No ratings yet
A Research Software Engineering Workflow For Computational Science and Engineering
26 pages
Ecp 95 005
No ratings yet
Ecp 95 005
25 pages
Distributed Software Development Tools F
No ratings yet
Distributed Software Development Tools F
19 pages
JCM 18 jcm778
No ratings yet
JCM 18 jcm778
13 pages
Energy Model 4 - Open Soure Software - Developing Your Own Energy System Scenarios
No ratings yet
Energy Model 4 - Open Soure Software - Developing Your Own Energy System Scenarios
28 pages
Programming For Software Engineers
No ratings yet
Programming For Software Engineers
11 pages
Object Oriented Software Engineering
No ratings yet
Object Oriented Software Engineering
74 pages
Architectural Analysis of Platform Integration Strategies For Scientific Software
No ratings yet
Architectural Analysis of Platform Integration Strategies For Scientific Software
11 pages
Ebook Open Source Cookbook
No ratings yet
Ebook Open Source Cookbook
29 pages
Miss Aqsa SE
No ratings yet
Miss Aqsa SE
14 pages
ResBaz 2019 Session Descriptions PDF
No ratings yet
ResBaz 2019 Session Descriptions PDF
9 pages
Software Engineering Lecture Notes
No ratings yet
Software Engineering Lecture Notes
41 pages
Art01 - Software Quality Assurance As A Service
No ratings yet
Art01 - Software Quality Assurance As A Service
15 pages
UNIT1SE
No ratings yet
UNIT1SE
23 pages
Sample Article Lecture 1 PDF
No ratings yet
Sample Article Lecture 1 PDF
10 pages
Evolution of Cyclomatic Complexity in Object Oriented Software
No ratings yet
Evolution of Cyclomatic Complexity in Object Oriented Software
5 pages
Chapter 1 Introduction To Software Engineering
No ratings yet
Chapter 1 Introduction To Software Engineering
18 pages
Report Writing 2021-2022
No ratings yet
Report Writing 2021-2022
18 pages
Best Practices For Scientific Computing 2014
No ratings yet
Best Practices For Scientific Computing 2014
8 pages
Cycling On The Freeway: The Perilous State of Open Source Neuroscience Software
No ratings yet
Cycling On The Freeway: The Perilous State of Open Source Neuroscience Software
5 pages
Why The Future of Science Must Be in Free Software
No ratings yet
Why The Future of Science Must Be in Free Software
14 pages
p126 Hafer
No ratings yet
p126 Hafer
4 pages
The Future Is Already Here 240628 173210
No ratings yet
The Future Is Already Here 240628 173210
7 pages
Open Source in Research
No ratings yet
Open Source in Research
16 pages
Fluiddyn: A Python Open-Source Framework For Research and Teaching in Fluid Dynamics by Simulations, Experiments and Data Processing
No ratings yet
Fluiddyn: A Python Open-Source Framework For Research and Teaching in Fluid Dynamics by Simulations, Experiments and Data Processing
6 pages
Science Code Manifesto Discussion
No ratings yet
Science Code Manifesto Discussion
3 pages
Free and Open Source Softwares
No ratings yet
Free and Open Source Softwares
47 pages
Object Oriented Software Engineering Chapter 1
No ratings yet
Object Oriented Software Engineering Chapter 1
18 pages
Then and Now Improving Software Portability Productivity and 100 Performance
No ratings yet
Then and Now Improving Software Portability Productivity and 100 Performance
10 pages
Why We Need New Software Testing Technologies: Carol Oliver, PH.D
No ratings yet
Why We Need New Software Testing Technologies: Carol Oliver, PH.D
22 pages
Aleksandra Pawlik
No ratings yet
Aleksandra Pawlik
7 pages
Amsci Survey 2009
No ratings yet
Amsci Survey 2009
3 pages
Pap Os13
No ratings yet
Pap Os13
5 pages
SEN MultidisciplinaryApproaches - UpdatedCfP
No ratings yet
SEN MultidisciplinaryApproaches - UpdatedCfP
1 page
STQA Unit I
No ratings yet
STQA Unit I
18 pages
Software Engineering Unit - I
No ratings yet
Software Engineering Unit - I
49 pages
Williams Draft Book
No ratings yet
Williams Draft Book
295 pages
Workshop Paper
No ratings yet
Workshop Paper
5 pages
Technical Debt in The Peer-Review Documentation of R Packages: A Ropensci Case Study
No ratings yet
Technical Debt in The Peer-Review Documentation of R Packages: A Ropensci Case Study
12 pages
Understanding Requirements
No ratings yet
Understanding Requirements
11 pages
BN209 MN507 Lecture1
No ratings yet
BN209 MN507 Lecture1
34 pages
Deadline: 4 4 - 15 15 March March, 201, 201 7 7
No ratings yet
Deadline: 4 4 - 15 15 March March, 201, 201 7 7
1 page
Software Must Be Recognised As An Important Output of Scholarly Research
No ratings yet
Software Must Be Recognised As An Important Output of Scholarly Research
6 pages
Bishop Et Al - 2016 - How To Use Open Source Software in Education
No ratings yet
Bishop Et Al - 2016 - How To Use Open Source Software in Education
2 pages
Assessing Rigor and Impact of Research Software For Hiring and Promotion in Psychology: A Comment On Gärtner Et Al. (2022)
No ratings yet
Assessing Rigor and Impact of Research Software For Hiring and Promotion in Psychology: A Comment On Gärtner Et Al. (2022)
5 pages
ACM 2019 20 Seminar Report
No ratings yet
ACM 2019 20 Seminar Report
6 pages
How Is Open Source Software Development Different in Popular IoT Projects
No ratings yet
How Is Open Source Software Development Different in Popular IoT Projects
12 pages
(Fa) Assig 3 ITM 5000
No ratings yet
(Fa) Assig 3 ITM 5000
10 pages
Preparation For An Official CTI Thermal Performance, Plume Abatement, or Drift Emission Test
No ratings yet
Preparation For An Official CTI Thermal Performance, Plume Abatement, or Drift Emission Test
16 pages
Midterm: (15 Points) : Indian Institute of Management Bangalore Decision Science II Old Exams
0% (1)
Midterm: (15 Points) : Indian Institute of Management Bangalore Decision Science II Old Exams
72 pages
Research Software Engineering and The Importance of Scientific Models
No ratings yet
Research Software Engineering and The Importance of Scientific Models
3 pages
q2 w1 Practical Research 2 Inquiries, Investigations, and Immersion Tejano
No ratings yet
q2 w1 Practical Research 2 Inquiries, Investigations, and Immersion Tejano
46 pages
A Case Study of ADEPR
100% (1)
A Case Study of ADEPR
20 pages
Ken Black QA ch03
0% (1)
Ken Black QA ch03
61 pages
Unit Ii
No ratings yet
Unit Ii
26 pages
Summer Training Project Report Format
No ratings yet
Summer Training Project Report Format
94 pages
Dissertation Chapter 4 Template
100% (2)
Dissertation Chapter 4 Template
6 pages
Course To Work On Railways
100% (2)
Course To Work On Railways
6 pages
Traffic Accidents Analysis Presentation
No ratings yet
Traffic Accidents Analysis Presentation
7 pages
Notification TNWeSafe
No ratings yet
Notification TNWeSafe
6 pages
Pitogo District Inset 2020 - Research
No ratings yet
Pitogo District Inset 2020 - Research
38 pages
ML Lab Lesson Plan 2024 25 Even Sem
No ratings yet
ML Lab Lesson Plan 2024 25 Even Sem
3 pages
Julia Slay Impact How Think Tanks Create Change-FINAL
No ratings yet
Julia Slay Impact How Think Tanks Create Change-FINAL
18 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
Data Analytics Reference
No ratings yet
Data Analytics Reference
10 pages
Fear of AI Replacing Humans
No ratings yet
Fear of AI Replacing Humans
16 pages
Ebook Ebook PDF Introduction To Probability and Statistics 15Th Edition All Chapter PDF Docx Kindle
100% (39)
Ebook Ebook PDF Introduction To Probability and Statistics 15Th Edition All Chapter PDF Docx Kindle
41 pages
Res511 - Group 4
No ratings yet
Res511 - Group 4
36 pages
Assignment For M.phil Students On Qualitative Research in Education
No ratings yet
Assignment For M.phil Students On Qualitative Research in Education
4 pages
Pratik Zanke Factor Hair Revised
No ratings yet
Pratik Zanke Factor Hair Revised
37 pages
House
No ratings yet
House
11 pages
3 Manuskrip Azmiranti 105421101320
No ratings yet
3 Manuskrip Azmiranti 105421101320
7 pages
Devi Project
No ratings yet
Devi Project
56 pages
Wayspire AI Course
No ratings yet
Wayspire AI Course
4 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
50 pages
Data Modeling With DAX-Concepts
No ratings yet
Data Modeling With DAX-Concepts
3 pages
Environmental Hydrology For Data Science
No ratings yet
Environmental Hydrology For Data Science
11 pages
Minitest: Regression, Correlation, & Probability Theory: PM608 - Advanced Statistics
No ratings yet
Minitest: Regression, Correlation, & Probability Theory: PM608 - Advanced Statistics
2 pages
Measures of Forecast Error
No ratings yet
Measures of Forecast Error
12 pages

Characterizing The Roles of Contributors in Open-Source Scientific Software Projects

Uploaded by

Characterizing The Roles of Contributors in Open-Source Scientific Software Projects

Uploaded by

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Characterizing the Roles of Contributors in

2574-3864/19/$31.00 ©2019 IEEE 421

F. Complementary Survey of Developers

In the medical domain, survival analysis measures the

You might also like