Py Chem Flow An Automated Pre Processing Pipeline in Python For Reproducible Machine Learning On Chemical Data
Py Chem Flow An Automated Pre Processing Pipeline in Python For Reproducible Machine Learning On Chemical Data
Py Chem Flow An Automated Pre Processing Pipeline in Python For Reproducible Machine Learning On Chemical Data
Abstract: PyChemFlow is a Python library for automated and The user can run the library with a one-line command after
reproducible data pre-processing. Based on open-source splitting data into train and validation sets or while working
code, PyChemFlow has simple requirements that rely on with additional data. This is especially useful when
pandas, scikit-learn and joblib. The library's backbone is built reproducibility is critical. PyChemFlow also offers the ability to
up of transformer objects, which are fully constructed during persistently store metadata, in addition to providing
the PyChemFlow fitting process using training data and can customizable and configurable data manipulation steps.
be conveniently stored using joblib.
https://doi.org/10.26434/chemrxiv-2023-3zpw0 ORCID: https://orcid.org/0000-0002-3541-9624 Content not peer-reviewed by ChemRxiv. License: CC BY-NC-ND 4.0
Missing Data Imputation
In practice, missing data is a frequent occurrence because of
manual data entry systems, incorrect measurements,
equipment malfunctions, intentional omissions etc. A few
missing values in some features (if the number of instances
were reduced) can reduce the sample size, hence here an
imputation step is done.
A total of 7 pre-processing steps are applied to the input data The pipeline can then be reloaded and applied to the test or
and we shortly describe each of these steps. A. Handling null other data sets by using the transform function.
values involves identifying and replacing missing or
undefined data points in the dataset. This can be done by preproc_pipe_load = joblib.load(“file.joblib”)
methods such as data imputation or deletion. B. Removing
correlated features using the Spearman method involves processed_test = preproc_pipe_load.transform(test)
calculating the correlation coefficient between features in a
pairwise fashion and identifying features with a high In this example, the train and test are pandas data frames
correlation that can be removed to improve model which were pre-split.
performance. C. The declaration of feature types includes
the identification of continuous and discrete features, as this
information is important for proper data preprocessing and 3. Limitations and future work
modeling. D. One-hot coding of discrete features involves
creating a new binary representation for each discrete As with many pre-processing and data manipulation tools
feature category. This allows models to better handle and libraries there is no one-fits-all and a user might run into
categorical data. E. Variance-based feature selection allows need for further customization. The PyChemFlow codes in
the identification and removal of features with low variance open source are well documented and can easily be further
because they are unlikely to contain useful information for modified by those with a solid foundation in Python
the model. F. Min-max scaling is a technique in which the programming. However, the authors tried to cover the basic
values of a feature are scaled to a specific range, usually steps of data processing so that the library can be used as is.
between 0 and 1. This ensures that all features are on the Future work of this library is an increased customization,
same scale which could prevent bias due to varying feature addition of optional steps and flexibility by providing
ranges. additional arguments into the functions and classes. The
https://doi.org/10.26434/chemrxiv-2023-3zpw0 ORCID: https://orcid.org/0000-0002-3541-9624 Content not peer-reviewed by ChemRxiv. License: CC BY-NC-ND 4.0
authors see this library as a dynamic one which will be used [9] J. Torniainen, I. O. Afara, M. Prakash, J. K. Sarin, L.
and developed in the future. Even though the authors Stenroth, and J. Töyräs, “Open-source python module
suggest using such libraries for increasing transparency and for automated preprocessing of near infrared
reproducibility, the usability of such will also on data spectroscopic data,” Analytica Chimica Acta, vol. 1108,
distributions. pp. 1–9, Apr. 2020, doi: 10.1016/J.ACA.2020.02.030.
[10] M. Bilal, G. Ali, M. W. Iqbal, M. Anwar, M. S. A. Malik,
and R. A. Kadir, “Auto-Prep: Efficient and Automated
Supporting information Data Preprocessing Pipeline,” IEEE Access, vol. 10,
no. October, pp. 107764–107784, 2022, doi:
The open-source python code is available at the following 10.1109/ACCESS.2022.3198662.
repository https://github.com/mariolovric/pychemflow . [11] A. Elangovan, J. He, and K. Verspoor, “Memorization
The structure of the repository is described in the readme file vs. Generalization: Quantifying data leakage in NLP
performance evaluation,” EACL 2021 - 16th
Conference of the European Chapter of the
Funding and acknowledgements Association for Computational Linguistics,
Proceedings of the Conference, vol. 2, pp. 1325–1335,
M.L. is funded by the EU-Commission Grant Nr-101057497- 2021, doi: 10.18653/v1/2021.eacl-main.113.
EDIAQI. The Know-Center is funded within the Austrian [12] W. Mckinney, “Data Structures for Statistical
COMET Program – Competence Centers for Excellent Computing in Python,” in Proceedings of the 9th
Technologies – under the auspices of the Austrian Federal Python in Science Conference, S. van der Walt and J.
Ministry of Transport, Innovation and Technology, the Millman, Eds., 2010, pp. 51–56. [Online]. Available:
Austrian Federal Ministry of Economy, Family and Youth and http://conference.scipy.org/proceedings/scipy2010/mc
by the State of Styria. COMET is managed by the Austrian kinney.html
Research Promotion Agency FFG. [13] F. Pedregosa et al., “Scikit-learn: Machine Learning in
Python,” Journal of Machine Learning Research, vol.
12, pp. 2825–2830, 2011, doi: 10.1007/s13398-014-
0173-7.2.
[14] M. Lovrić et al., “Machine learning in prediction of
intrinsic aqueous solubility of drug‐like compounds:
References Generalization, complexity, or predictive ability?,”
Journal of Chemometrics, vol. 35, no. 7–8, p. e3349,
[1] P. Gramatica, “Principles of QSAR models validation: Jul. 2021, doi: 10.1002/cem.3349.
Internal and external,” QSAR and Combinatorial
Science, vol. 26, no. 5, pp. 694–701, May 2007, doi:
10.1002/qsar.200610151.
[2] W. P. Walters, “Modeling, informatics, and the quest
for reproducibility,” Journal of Chemical Information
and Modeling, vol. 53, no. 7, pp. 1529–1530, Jul. 2013,
doi: 10.1021/CI400197W/ASSET/IMAGES/LARGE/CI-
2013-00197W_0002.JPEG.
[3] R. D. Clark, “A path to next-generation reproducibility
in cheminformatics,” Journal of Cheminformatics, vol.
11, no. 1, pp. 1–3, Oct. 2019, doi: 10.1186/S13321-
019-0385-0/METRICS.
[4] A. Tropsha, P. Gramatica, and V. K. Gombar, The
Importance of Being Earnest: Validation is the
Absolute Essential for Successful Application and
Interpretation of QSPR Models, vol. 22, no. 1. Wiley-
VCH Verlag, 2003, pp. 69–77. doi:
10.1002/qsar.200390007.
[5] S. Kapoor and A. Narayanan, “Leakage and the
Reproducibility Crisis in ML-based Science,” 2020.
[6] M. F. Dacrema, P. Cremonesi, and D. Jannach, “Are
we really making much progress? A worrying analysis
of recent neural recommendation approaches,”
RecSys 2019 - 13th ACM Conference on
Recommender Systems, pp. 101–109, Sep. 2019, doi:
10.1145/3298689.3347058.
[7] D. Krstajic, L. J. Buturovic, D. E. Leahy, and S.
Thomas, “Cross-validation pitfalls when selecting and
assessing regression and classification models,”
Journal of Cheminformatics, vol. 6, no. 1, pp. 1–15,
Mar. 2014, doi: 10.1186/1758-2946-6-10/FIGURES/16.
[8] M. Lovrić et al., “Should We Embed in Chemistry? A
Comparison of Unsupervised Transfer Learning with
PCA, UMAP, and VAE on Molecular Fingerprints,”
Pharmaceuticals, vol. 14, no. 8, 2021, doi:
10.3390/ph14080758.
https://doi.org/10.26434/chemrxiv-2023-3zpw0 ORCID: https://orcid.org/0000-0002-3541-9624 Content not peer-reviewed by ChemRxiv. License: CC BY-NC-ND 4.0