Agile (Data) Science: A (Draft) Manifesto
Agile (Data) Science: A (Draft) Manifesto
J. J. Merelo
4/7/2022
arXiv:2104.12545v3 [cs.CY] 4 Jul 2022
Abstract
Science has a data management problem, as well as a project management problem. While industrial-
grade data science teams have embraced the agile mindset, and adopted or created all kind of tools to
create reproducible workflows, academia-based science is still (mostly) mired in a mindset that is focused
on a single final product (a paper), without focusing on incremental improvement, on any specific problem
or customer, or, paying any attention reproducibility. In this report we argue towards the adoption of
the agile mindset and agile data science tools in academia, to make a more responsible, and over all,
reproducible science.
Introduction
By agile, we usually imply a mindset that is applied to the whole software development lifecycle which is
customer-centered and focused on continuous improvement of increasingly complex minimally viable prod-
ucts. The name comes from the Agile Manifesto (Beck et al. 2001), literally “Manifesto for agile software
development”. This manifesto has certainly changed the way software at large is developed, and become
mainstream, spawning many different methodologies and best practice guidelines. It has proved to be an
efficient way of carrying out all kind of projects, from small to large-scale ones, mitigating the presence
of bugs and proving to be more efficient (Abrahamsson et al. 2017) than the methodology that prevailed
previously (and still today in many sectors), generally called waterfall (Andrei et al. 2019), which separated
(or siloed) different teams doing from the specification to the testing, with every team acting at different
parts of the lifecycle.
Despite being prevalent in software development (and, in general, project development) environments, it
certainly has not reached science at large, which arguably follows a method that closely follows the waterfall
methodology.
Since data science and engineering has become an integral part of the workflow in many companies, agile
data science is, mostly, the way it’s done. Again, this is mostly because data science is mostly done in the
industry, and not in academia, which does not have the same kind of workflows to deal with its own data.
Our intention is to try and put science back in data science. We will try and examine critically how science
is done, what are the main reasons why this agile mindset is not being used in science, how would agile
concepts translate to science, and eventually what agile data science and science at large woudl look like.
We will first present what attempts have been made to translate agile concepts to the (academic) world of
science.
1
This situation has been challenged repeatedly, lately, mainly after the introduction of the aforementioned
agile manifesto. In (Amatriain and Hornos 2009), which is essentially a presentation and not a formal paper,
several proposal are made to apply agile “methods” in research; something that has been proposed repeatedly
in later years, for instance in this blog post (Carattino, n.d.) and even in this paper (Baijens, Helms, and
Iren 2020) which specifies an agile methodology, Scrum, and how it can be applied specifically to data science
projects. As a matter of fact, there were several attempts to raise the issue again and bring it to the attention
of the research community: a blog post introduced agile research (Amatriain 2008) and even drafted an Agile
Research Manifesto (Amatriain 2009). This was almost totally forgotten until it was brought up two years
ago in a blog called “Agile Science” (Bergman 2018). Independently, some researchers proposed an (almost)
ultimatum for Agile Research in (Way, Chandrasekhar, and Murthy 2009), and eventually it became fruitful
in a restricted environment, mHealth, in (Wilson et al. 2018). This goes to show that it’s still part of the
fringe, and has not been incorporated either to funding agencies guidelines, or to the common science and
research practice.
This is certainly related with Open Science: Open Science adapts the main ideas of open source software
development to the publication of scientific results and artifacts; the push for Open Science (Robson et
al. 2021) has provided with new venues and new ways of understanding and producing science. However,
the uptake of new methodologies is still very slow. While most companies have created pipelines for data
management (Rodríguez 2019), there are neither clear guidelines or best practices nor resources where
scientific data management can be done at scale and, what’s more important, in a way that can have a
(positive) outcome for your career.
2
we should formulate tests on data to check that it keeps being in the same format, range and general
characteristics that allow our hypothesis to be valid. So agile science would need to test data to start with,
before using it as input for workflows, but it should also unit-test all software used, perform integration
tests on data + software, and eventually transform the hypothesis into actual software tests that would
continuously check if the hypothesis still holds.
Increasingly, and when the Internet itself, as well as myriad sensors and devices, is a continuous source of
data, publishing a paper drawing conclusions over a small piece of data is valuable and helpful. Creating a
tested, continuously deployed workflow that over and over again tests that hypothesis, and that has been
released as free software to be integrated as input or middleware in other workflows is immensely more
valuable. But needs another hypothesis
3
The way forward
Agile fixed software development by proposing a series of principles attached to the Agile manifesto (Beck
et al. 2001) that eventually spawned a series of tools on one hand, and best practices in other hand. Tools
that can be grouped into generic CI and CD toolchains, including MLOps tools Kreuzberger, Kühl, and
Hirschl (2022) , team work tools (usually attached to source repositories) such as Jira or GitHub itself and
the use of different methodologies (Kanban (Ahmad, Markkula, and Oivo 2013), Scrum), practices (reviews,
retrospective meetings) and roles (product owner, stakeholder) to streamline software production, bring
value to stakeholders, and provide a sane, stable nurturing and eventually productive working environment.
I have been advocating for using these techniques for quite a long time, at least since 2011 (last version of
the talk on “the art of evolutionary algorithm programming is here (Merelo 2013)). In this report I try to
put everything together under the same framework which we will be calling agile science.
Science should not be different, and a (roughly) direct translation of all these practices, however they are
interpreted, would be beneficial. We’ll try, anyway, to delve a bit further into those concepts to see how they
translate and how they could be applied, in practice, to science.
4
In many cases, and specially in data science/machine learning, there will specialized tools such as MLflow
(Zaharia et al. 2018) with frontends such as Snapper ML (Domenech and Guillén 2020) to simplify the
creation of workflows. No doubt these workflow high-level tools will be extended to other fields, and integrated
with mainstream deployment tools such as Docker or Kubernetes. Integrating these practices seamlessly
merges product (workflow) development with software development, and also leverages existing tools such
as GitHub or GitLab with their accompanying workflow design tools (Github Actions, pipelines), as well as
other cloud environments with their accompanying tools.
This also decouples the production of a workflow from its actual deployment. As long as deployment is
clearly expressed, it can be deployed by the scientific team producing it on premises or on the cloud, or
done by anyone else elsewhere. One could even think about global free infrastructure for doing this kind
of thing, or even a model similar to pay-to-publish: pay-to-deploy, and let the hosting place take care of
long-term maintenance. This also makes science and the scientific effort, much more sustainable, and satisfies
stakeholders (fourth hypothesis above) by keeping the product of science funding available and working way
beyond the mere existence of the grant, or even the group itself.
Conclusions
In this report we have tried to propose a set of best practices that we think would benefit science at large, but
especially those disciplines that rely heavily in data and software to produce results. Essentially, it interprets,
translates and codifies the agile (software development) manifesto to the scientific arena, converting what
this manifesto values in a series of 4 hypotheses that will guide agile science.
Hypotheses need to be proved, however, and science prides itself in being able to establish fact over anecdotal,
or even counter-intuitive, evidence. This is why we also provide a path forward in the shape of several best
practices suggestions that will help prove those hypotheses beyond any doubt. There is strong evidence that
supports them, and our own experience using it for some time via open-repository product development,
specially in papers such as (García-Ortega, Sánchez, and Merelo-Guervós 2021) and, in general, most papers
that we have published lately, helps stakeholder participation in the production of workflows, makes easier
to evolve software related to science and streamline product development, and makes also easier to respond
to new requirements. Proving the positive effects of these preferences is, however, left as future work.
Acknowledgements
This research was funded by projects TecNM-5654.19-P and DemocratAI PID2020-115570GB-C22.
References
Abrahamsson, Pekka, Outi Salo, Jussi Ronkainen, and Juhani Warsta. 2017. “Agile Software Development
Methods: Review and Analysis.” https://arxiv.org/abs/1709.08439.
Ahmad, Muhammad Ovais, Jouni Markkula, and Markku Oivo. 2013. “Kanban in Software Development:
A Systematic Literature Review.” In 2013 39th Euromicro Conference on Software Engineering and
Advanced Applications, 9–16. IEEE.
Amatriain, Xavier. 2008. “Agile Research.” http://technocalifornia.blogspot.com/2008/06/agile-research.html.
———. 2009. “A Manifesto for Agile Research.” https://xamat.github.io/AgileResearch/.
Amatriain, Xavier, and Gemma Hornos. 2009. “Agile Methods in Research.” https://www.slideshare.net/xamat/agile-science
Andrei, Bogdan-Alexandru, Andrei-Cosmin Casu-Pop, Sorin-Catalin Gheorghe, and Costin-Anton Boiangiu.
2019. “A Study on Using Waterfall and Agile Methods in Software Project Management.” Journal Of
Information Systems & Operations Management, 125–35.
Arney, Kat. n.d. “Science Is Broken. Here’s How to Fix It.” http://littleatoms.com/science/science-broken-heres-how-fix-it.
Baijens, J., R. Helms, and D. Iren. 2020. “Applying Scrum in Data Science Projects.” In 2020 IEEE 22nd
Conference on Business Informatics (CBI), 1:30–38. https://doi.org/10.1109/CBI49978.2020.00011.
Beck, Kent, Mike Beedle, Arie Van Bennekum, Alistair Cockburn, Ward Cunningham, Martin Fowler, James
Grenning, et al. 2001. “Manifesto for Agile Software Development.”
Bergman, Olle. 2018. https://crastina.se/xavier-invented-agile-science-a-decade-ago/.
5
Carattino, Aquiles. n.d. “Agile Development for Science: Scientific Work Can Also Benefit from Principles
Derived from Software Development.” https://www.uetke.com/blog/general/agile-development-for-science/.
Domenech, Antonio Molner, and Alberto Guillén. 2020. “Ml-Experiment: A Python Framework for
Reproducible Data Science.” Journal of Physics: Conference Series 1603 (September): 012025.
https://doi.org/10.1088/1742-6596/1603/1/012025.
García-Ortega, Rubén Héctor, Pablo García Sánchez, and Juan J. Merelo-Guervós. 2021. “Tropes in Films:
An Initial Analysis.” https://arxiv.org/abs/2006.05380.
Gibney, Elizabeth. 2020. “This AI Researcher Is Trying to Ward Off a Reproducibility Crisis.” Nature 577
(7788): 14.
Kardas, Marcin, Piotr Czapla, Pontus Stenetorp, Sebastian Ruder, Sebastian Riedel, Ross Taylor, and Robert
Stojnic. 2020. “Axcell: Automatic Extraction of Results from Machine Learning Papers.” arXiv Preprint
arXiv:2004.14356.
Kreuzberger, Dominik, Niklas Kühl, and Sebastian Hirschl. 2022. “Machine Learning Operations (MLOps):
Overview, Definition, and Architecture.” arXiv Preprint arXiv:2205.02302.
Levecque, K., F. Anseel, A. D. Beuckelaer, J. Heyden, and L. Gisle. 2017. “Work Organization and Mental
Health Problems in PhD Students.” Research Policy 46: 868–79.
Mäkinen, Sasu, Henrik Skogström, Eero Laaksonen, and Tommi Mikkonen. 2021. “Who Needs MLOps:
What Data Scientists Seek to Accomplish and How Can MLOps Help?” arXiv Preprint arXiv:2103.08942.
Merelo, JJ. 2013. “The Art of Evolutionary Algorithm Programming.” https://issuu.com/jjmerelo/docs/art-ecp-cec13.
Moonesinghe, Ramal, Muin J Khoury, and A Cecile JW Janssens. 2007. “Most Published Research Findings
Are False—but a Little Replication Goes a Long Way.” PLoS Med 4 (2): e28.
Oomen, Sandra, Benny De Waal, Ademar Albertin, and Pascal Ravesteyn. 2017. “How Can Scrum Be
Succesful? Competences of the Scrum Product Owner.”
“Papers with Code.” n.d. https://paperswithcode.com.
Robson, Samuel G, Myriam A Baum, Jennifer L Beaudry, Julia Beitner, Hilmar Brohmer, Jason Chin,
Katarzyna Jasko, et al. 2021. “Nudging Open Science.” PsyArXiv. https://doi.org/10.31234/osf.io/zn7vt.
Rodríguez, Jesús. 2019. “How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and
Discovery for Machine Learning Solutions.” https://www.kdnuggets.com/2019/08/linkedin-uber-lyft-airbnb-netflix-solvin
Vandewalle, Patrick. 2012. “Code Sharing Is Associated with Research Impact in Image Processing.” Com-
puting in Science & Engineering 14 (4): 42–47.
———. 2019. “Code Availability for Image Processing Papers: A Status Update.” In WIC IEEE SP
Symposium on Information Theory and Signal Processing in the Benelux, Date: 2019/05/28-2019/05/29,
Location: Gent, Belgium.
Wattanakriengkrai, Supatsara, Bodin Chinthanet, Hideaki Hata, Raula Gaikovina Kula, Christoph Treude,
Jin Guo, and Kenichi Matsumoto. 2020. “GitHub Repositories with Links to Academic Papers: Open
Access, Traceability, and Evolution.” https://arxiv.org/abs/2004.00199.
Way, Thomas, Sandhya Chandrasekhar, and Arun Murthy. 2009. “The Agile Research Penultimatum.” In
Software Engineering Research and Practice, 530–36. Citeseer.
Wilson, Kumanan, Cameron Bell, Lindsay Wilson, and Holly Witteman. 2018. “Agile Research to Comple-
ment Agile Development: A Proposal for an mHealth Research Lifecycle.” NPJ Digital Medicine 1 (1):
1–6.
Woolston, Chris. 2019. “PhDs: The Tortuous Truth.” Nature 575 (7782): 403–7.
Zaharia, Matei, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth
Murching, et al. 2018. “Accelerating the Machine Learning Lifecycle with MLflow.” IEEE Data Eng.
Bull. 41 (4): 39–45.