Pipeline Provenance for Analysis, Evaluation, Trust or Reproducibility
Abstract
Data volumes and rates of research infrastructures will continue to increase in the upcoming years and impact how we interact with their final data products. Little of the processed data can be directly investigated and most of it will be automatically processed with as little user interaction as possible. Capturing all necessary information of such processing ensures reproducibility of the final results and generates trust in the entire process.
We present PRAETOR111Pipeline pRovenance for Analysis, Evaluation, Trust Or Reproducibility, a software suite that enables automated generation, modelling, and analysis of provenance information of Python pipelines. Furthermore, the evaluation of the pipeline performance, based upon a user defined quality matrix in the provenance, enables the first step of machine learning processes, where such information can be fed into dedicated optimisation procedures.
1 Introduction
In a general sense, provenance information documents the history of a thing whereas in case of data products, this concept has been adopted to document the entire production history of the item itself.
However, collecting data, generating science-ready datasets and scientific results is a complex procedure, including a broad range of performance measurements of the infrastructure, expert knowledge, and domain-specific aspects in the data processing. Therefore, obtaining provenance that describes which additional information went in and is available on data products is a key factor to generate confidence and trust of their scientific usage.
Here we present PRAETOR, a software suite, that records provenance and provides a framework to explore such information.
PRAETOR has been developed having astronomical use cases in mind (Johnson et al., 2021), but is applicable to a broad range of applications. Astronomical observations are comprehensive examples of taking data: raw and uncalibrated data are transferred into physical meaningful units based on metadata of technical equipment and observatory conditions; cleaning up the (meta)data from spurious values; validating data properties and quality assessment of data products; transformations into science-ready data products; and finally, analysing and validating the scientific results. Such workflows are mostly implemented in individual steps with semi-automatic processing, but with upcoming data rates there is a need for fully automated systems with self-regulated processing and optimisation of production lines for science-ready data products.
PRAETOR has been developed to collect provenance of workflows (in the following we refer to pipelines as automated workflows), a user interface (UI) that allows browsing through the provenance data, a database solution to query and analyse the captured provenance, and the ability to concatenate provenance information of individual pipeline executions.
2 Software
In the development of PRAETOR, we have investigated a way to capture provenance from Python based pipelines and developed diagnostic tools to evaluate the pipeline by utilising its provenance information.
To fully understand the inner workings of the astronomical pipeline described above, one needs information on its input parameters, the original dataset, the observatory metadata, the individual function calls and their parameters, and the relation between all of these pieces of information. Such a set of information is a complex structure and extracting these details is a challenging task.
In order to document this information, we extended the standard language for recording provenance (PROV) and its data model for things, processes, and those responsible for processes as entities, activities, and agents, respectively (Belhajjame et al., 2013).
However, the main extensions to PROV were in the form of attributes and were defined to represent specific components that are common within Python pipelines such as: function names, Python modules and versions thereof, and memory consumption of individual processes.
For analysis purposes, we implemented a quality metric attribute which can be attached to any component within the pipeline as well as to the pipeline itself.
Based on the astronomical use case analysis (Johnson et al., 2021), we have identified a number of items to extend the PROV model for PRAETOR, but their implementation will be left to future work.
Many of these extensions are relations between different objects within the provenance, e.g. whether an object was used as data or as parameter, or whether a process was responsible for loading a specific object.
However, the motivation for using a minimally adapted PROV in the current release of PRAETOR was compatibility with existing tools designed for PROV, such as the ProvToolBox and prov2neo.
In addition, the generated PRAETOR-based provenance is designed to be interoperable with tools developed for other PROV-based provenance models, such as those from IVOA (Servillat et al., 2020).
Extracting provenance information and the relation of the individual operations can be an overwhelming task, in particular, if the pipelines are unknown and treated as "black boxes". In order to obtain a first overview of the available information a UI has been developed to browse through the provenance data. Once a deeper understanding of the available information has been acquired, queries and analysis on the captured provenance can be done via database operations.
The UI provides a general overview of the used software packages, tools, files, the processing time, basic memory consumption, and individual functions. Information of the individual functions can be accessed on a dedicated page that provides details on invocations of functions in the pipeline and their in- and out-put parameters. To investigate the sequence of function calls within the pipeline itself adjacent activities can be followed.
Apart from being a complex structure, provenance data can also grow to substantial sizes and efficient processing can be a limiting issue.
Therefore the PRAETOR package provides a framework for uploading provenance to two different kinds of database structures - graph databases Neo4j and triple stores RDF/fuseki.
A set of queries are available which extract key information into pandas dataframes, such as: function invocations, inputs, and outputs.
3 A first glimpse
The software suite has three main components: provenance generation/capturing, provenance analysis and the UI.
The generation and analysis components are packaged within the same Python package, whereas the UI is deployed as a series of Docker containers.
The Python package is available on the pypi-hub and can be installed using e.g. pip install praetor in either a virtual environment or container alongside an existing pipeline.
The information captured by the package includes any imported modules, functions, file access, and other core functionality to Python.
Which of these pieces of information is included can be defined in the praetor_settings_user.py settings file.
The UI can be installed by cloning the gitlab repository and following the relevant installation instructions to start up its Docker environment. Once running, the UI will be available via the local host in an web browser and the queries are based on to the triple store queries within the PRAETOR analysis package.
Various tutorials are available including a full n2n-example that explains how to build a Singularity container and generate provenance of an example pipeline. The example pipeline calls various functions, each having different input and output parameters and are called either within the pipeline itself or via a module. A subset of the provenance of the example pipeline is shown in Figure LABEL:fig:ui. This information can also be obtained using the database framework of PRAETOR. For this, we suggest to do the installation within a virtual environment like Conda as explained in the n2n-example.
4 Conclusion
Provenance is of timely importance, by documenting the pathway through scientific processing and by publishing FAIR (Findable, Accessible, Interoperable, Reusable) data products (Wilkinson et al., 2016). We have presented PRAETOR - a software suite for automated generation and analysis of provenance information from Python based pipelines. We explained the rationale for capturing provenance in a PROV extension and the decision to divide the software into three stand-alone packages to operate on any Python pipeline. A showcase of the analysis of generated provenance via the user interface and the database framework has been presented.
The provenance information of multiple executions of a pipeline can be used to characterise the input and output parameters. Pipelines that automatically produce science-ready data sets, often have the problem of data irreversibility, where the raw input data and intermediate results are not stored for repeat analysis. PRAETOR provides the tools to evaluate such pipelines and is the basis to any AI and neural network optimisation.
References
- Belhajjame et al. (2013) Belhajjame, K., B’Far, R., Cheney, J., et al. 2013, W3C Recommendation, 14, 15
- Johnson et al. (2021) Johnson, M., Paradies, M., Dembska, M., et al. 2021, in TaPP 2021
- Servillat et al. (2020) Servillat, M., Riebe, K., Boisson, C., et al. 2020
- Wilkinson et al. (2016) Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., et al. 2016, Scientific Data, 3, doi: 10.1038/sdata.2016.18