Scikit-fingerprints: easy and efficient computation of
molecular fingerprints in Python
Jakub Adamczyk∗, Piotr Ludynia
AGH University of Krakow, Department of Computer Science, Cracow, Poland
arXiv:2407.13291v1 [cs.SE] 18 Jul 2024
Abstract
In this work, we present scikit-fingerprints, a Python package for compu-
tation of molecular fingerprints for applications in chemoinformatics. Our
library offers an industry-standard scikit-learn interface, allowing intuitive
usage and easy integration with machine learning pipelines. It is also highly
optimized, featuring parallel computation that enables efficient processing of
large molecular datasets. Currently, scikit-fingerprints stands as the most
feature-rich library in the Python ecosystem, offering over 30 molecular fin-
gerprints. Our library simplifies chemoinformatics tasks based on molecular
fingerprints, including molecular property prediction and virtual screening.
It is also flexible, highly efficient, and fully open source.
Keywords: molecular fingerprints, chemoinformatics, molecular property
prediction, Python, machine learning, scikit-learn
2000 MSC: 92-04, 92-08, 92E10, 68N01
Metadata
1. Motivation and significance
Molecules are the basic structures processed in computational chemistry.
They are most commonly represented as molecular graphs, which need to
be converted into multidimensional vectors for the majority of processing al-
gorithms, most prominently for machine learning (ML) applications. This is
typically done with molecular fingerprints, which are feature extraction algo-
rithms encoding structural information about molecules as vectors [1]. They
are ubiquitously used in chemoinformatics, e.g. for chemical space diversity
∗
Corresponding author
Email address: [email protected] (Jakub Adamczyk)
Preprint submitted to SoftwareX July 19, 2024
Nr. Code metadata description Please fill in this column
C1 Current code version 1.6.1
C2 Permanent link to code/repository https://github.com/
used for this code version scikit-fingerprints/
scikit-fingerprints/tree/
SoftwareX_submission_v1.6.1
C3 Permanent link to Reproducible
Capsule
C4 Legal Code License MIT
C5 Code versioning system used git
C6 Software code languages, tools, and Python 3.9 or newer, RDKit
services used
C7 Compilation requirements, operat- Linux, Windows, MacOS
ing environments & dependencies
C8 If available Link to developer docu- https://scikit-fingerprints.
mentation/manual github.io/
scikit-fingerprints/
C9 Support email for questions
[email protected] Table 1: Code metadata
measurement [2, 3, 4] and visualization [5, 6], clustering [7, 8, 9, 10], vir-
tual screening [11, 12], molecular property prediction [13, 14, 15], and many
more [16, 17, 18, 19, 20, 21]. These chemoinformatics tasks, often relying
on machine learning methods, are important for many real-life applications,
particularly in de novo drug design. For properly assessing the performance
of predictive models, train-test splitting is crucial, and molecular fingerprints
can also be used there [22, 23, 24, 25, 26]. The performance of fingerprint-
based models remains very competitive, even compared to state-of-the-art
graph neural networks (GNNs) [14].
Selection of the optimal fingerprint representation for a given application is
nontrivial, and typically requires computing many different fingerprints [14],
and may also require tuning their hyperparameters [27, 28]. Using multiple
fingerprints at once often improves results, e.g., via concatenation [15] or
data fusion [29, 30]. Processing large molecular datasets necessitates efficient
implementations that leverage modern multicore CPUs. Python, the most
popular language in chemoinformatics today, includes the scikit-learn library
[31], which has become the de facto standard tool for machine learning tasks.
The library is renowned for its intuitive and widely adopted API [32].
Popular open source tools for computing molecular fingerprints, such as
Chemistry Development Kit (CDK) [33], OpenBabel [34] or RDKit [35], are
written in Java or C++, and unfortunately only RDKit has an official Python
2
Figure 1: Package diagram of scikit-fingerprints.
wrapper. None of them are compatible with scikit-learn API, and they only
support sequential computation. Each of these tools also supports only a
limited number of fingerprints.
Here, we present scikit-fingerprints, a new Python library for easy and ef-
ficient computation of molecular fingerprints. It is fully scikit-learn com-
patible, enabling easy integration into ML pipelines as a feature extractor
for molecular data. It offers optimized parallel computation of fingerprints,
enabling processing of large datasets and experiments with multiple algo-
rithms, like data fusion. We implemented over 30 different fingerprints,
making it the most feature-rich library in Python ecosystem for molecular
fingerprinting. Those include ones based only on molecular graph topology
(2D), as well as those utilizing graph conformational structure (3D, spatial).
It is fully open source, publicly available on PyPI [36] and on GitHub at
https://github.com/scikit-fingerprints/scikit-fingerprints.
2. Software description
2.1. Software architecture
scikit-fingerprints is a Python package for computing molecular fingerprints,
and it is aimed at chemoinformatics and ML workflows. Its interface is fully
compatible with scikit-learn API [32], ensured by proper inheritance from
scikit-learn base classes and comprehensive tests.
The package structure is shown in Figure 1. All functionality is contained in
the skfp package, allowing easy imports. Base classes are in skfp.bases
package, and they can be used to extend the functionality with new or
customized fingerprints. skfp.datasets has functions for loading popular
datasets, for easy benchmarking. skfp.preprocessing contains classes for
preprocessing molecules before computing fingerprints, as described in Sec-
tion 2.2.1. Fingerprints are represented as classes in package skfp.fingerprints.
Lastly, skfp.utils contains additional utility classes, such as input type val-
idators.
3
BaseFingerprintTransformer
{abstract}
BaseSubstructureFingerprint AtomPairFingerprint AutocorrFingerprint AvalonFingerprint E3FPFingerprint ...
GhoseCrippenFingerprint KlekotaRothFingerprint LaggnerFingerprint
Figure 2: Class diagram for fingerprint classes. Some classes omitted for readability.
2.2. Software functionalities
scikit-fingerprints user-facing functionalities can be broken into preprocessing
and fingerprint calculation. It also supports loading popular datasets. In
addition, in contrast to existing software, we support efficient parallelism,
and implement multiple measures for ensuring high code quality and security.
2.2.1. Preprocessing
Fingerprints take RDKit Mol objects as input to the .transform() method.
However, for convenience, all 2D-based fingerprints also take SMILES input,
converting them internally. If done multiple times, this entails a small perfor-
mance penalty, so scikit-fingerprints offers MolFromSmiles and MolToSmiles
classes for easier conversions.
SMILES representation for a molecule is not unique, and does not convey
all information. In particular, incorrect or very unlikely molecules can be
written in SMILES form. MolFromSmiles by design performs only basic
checks, to enable reading arbitrary data. For expanded checks, we imple-
mented MolStandardizer class. Since there is no one-size-fits-all solution
for molecular standardization, we use the most broadly used sanity checks,
recommended by RDKit [37]. This helps to ensure high data quality at the
beginning of the pipeline.
All fingerprints utilizing conformational (3D, spatial) information require
Mol input, with conformers calculated using RDKit, with conf_id prop-
erty set. Conformer generation can be troublesome, with multiple differ-
ent algorithms and settings available. ConformerGenerator class in scikit-
fingerprints greatly simplifies this process, offering reasonable defaults. It at-
tempts to maximize efficiency for easy molecules and minimize failure chance
for complex compounds, based on ETKDGv3 algorithm [38], known to give
excellent results [39].
4
2.2.2. Fingerprints calculation
Different molecular fingerprints are represented as classes, all inheriting from
BaseFingerprintTransformer, and further from BaseSubstructureFingerprint
for substructure fingerprints like Klekota-Roth [40] (see Figure 2). They are
used as stateless transformer objects in scikit-learn, and used mainly via
.transform() method. It takes a list of SMILES strings or RDKit Mol
objects, and outputs a dense NumPy array [41] or sparse SciPy array in
CSR format [42]. Various options, such as vector length for hashed finger-
prints (e.g. ECFP [43]), binary/count variant, dense/sparse output etc. are
specified by constructor parameters. This ensures full composability with
scikit-learn constructs like pipelines and feature unions.
We implement over 30 different fingerprints of various types, e.g. circu-
lar ECFP [43] and SECFP [44], path-based Atom Pair [45] and Topolog-
ical Torsion [46], substructure-based MACCS [47] and Klekota-Roth [40],
physicochemical descriptors like USRCAT [48] and Mordred [49], and more.
We used efficient RDKit subroutines, written in C++, e.g. for matching
SMARTS patterns. A full list of implemented fingerprints is available in the
scikit-fingerprints online documentation.
2.2.3. Parallelism
Since molecules can be processed independently when computing fingerprints,
the task is “embarrassingly parallel” [50]. This means that we can efficiently
utilize all available cores. To minimize inter-process communication, by de-
fault, input molecules are split into as many chunks as there are physical
cores available, and processed in parallel by Python workers. We utilize
Joblib [51], with Loky executor, which uses memory mapping to efficiently
pass the resulting arrays between processes. Furthermore, by using sparse
arrays and smaller chunk sizes, users can minimize the memory utilization
for large datasets and fingerprints that yield long output vectors [28].
Furthermore, we support distributed computing with Dask [52], used as
Joblib executor. This way, scikit-fingerprints can take advantage of large
high-performance computing (HPC) clusters. All that is required is a single
parameter passed to Joblib configuration, to connect to the Dask cluster [53].
2.2.4. Datasets loading
Fingerprints are often used in the context of molecular property prediction
on standardized benchmarks. In particular, they constitute strong baselines,
often outperforming complex graph neural networks (GNNs) [13, 17, 14].
Therefore, their easy usage is important for fair evaluation of advancements
in graph classification.
5
We utilized HuggingFace Hub [54, 55] to host datasets. It offers easy down-
loading, caching, and loading datasets, with automated compression to Par-
quet format. Currently, the most widely used MoleculeNet [56] benchmark
has been integrated, and further datasets can be easily added with unified
interface. Users can load datasets similarly to scikit-learn example datasets.
For example, loading the BBBP dataset from MoleculeNet uses the function
load_bbbp().
2.2.5. Code quality and CI/CD
We ensure high code quality and security with multiple measures. The code is
versioned using Git and GitHub. New features have to be submitted through
Pull Requests and undergo code review. We use pre-commit hooks [57] to
verify code quality before each commit:
• bandit [58], safety [59] - security analysis and dependency vulnera-
bility scanning, following security recommendations [60, 61]
• black [62], flake8 [63], isort [64], pyupgrade [65] - code style, fol-
lowing reproducibility and readability guidelines [66]
• mypy [67] - type checking; our entire code is statically typed, following
security recommendations [68, 69]
• xenon [70] - cyclomatic complexity
We implemented a comprehensive suite of 196 unit and integration tests.
They use PyTest framework [71], and are run automatically on GitHub Run-
ners as a part of CI/CD process. Passing all tests is required to merge the
code into the master branch. We run tests on a full matrix of operating
systems (Linux, Windows, MacOS) and Python versions (from 3.9 to 3.12),
ensuring proper execution in different environments.
Any changes to the documentation are automatically deployed to the GitHub
Pages. New package versions are deployed to PyPI by using GitHub Re-
leases, with new changes description. Internally, this uses a GitHub Ac-
tions workflow and creates a Git tag on the commit used in the given re-
lease. scikit-fingerprintscan be installed via pip by running pip install
scikit-fingerprints.
3. Illustrative examples
3.1. Parallel computation
Computing fingerprints in parallel is useful for all molecular tasks, in partic-
ular for large databases in virtual screening. To illustrate the capability of
6
# cores: 1
80 # cores: 2
# cores: 4
# cores: 8
# cores: 16
70
60
Time of computation [s]
50
40
30
20
10
0 2000 4000 6000 8000 10000
Number of molecules
Figure 3: Computation time for PubChem fingerprint.
scikit-fingerprints in this regard, we compute fingerprints for popular HIV
dataset from MoleculeNet benchmark [56]. It contains a wide variety of
molecules for a medicinal chemistry data, including organometallics, small
and large molecules, some atoms with very high number of bonds etc. We
limit the data to 10 thousand molecules, due to high computational time of
running the benchmark multiple times for many data sizes and fingerprints.
Code is available in the GitHub repository, in benchmarking directory.
As an example, we present the timings for the PubChem fingerprint [72] in
Figure 3, commonly used for virtual screening. Speedup for all fingerprints
1
is shown in Figure 4, when using 16 cores and 10 thousand molecules.
Speedup is defined as a ratio of sequential to parallel computation time. We
calculate those times as an average of 5 runs, using a machine with Intel
Core i7-13700K 3.4 GHz CPU. For 3D fingerprints, we do not include the
conformer generation time.
PubChem fingerprint clearly benefits from parallelism, with time decreasing
with almost perfect speedup for more cores. This behavior is typical in par-
1
We omit Pharmacophore fingerprint due to excessive computation time. Due to check-
ing multiple SMARTS patterns for all atoms, it is by far the slowest fingerprint.
7
AtomPair
Autocorr
Avalon
E3FP
ECFP
ERG
EState
FunctionalGroups
GhoseCrippen
KlekotaRoth
Laggner
Layered
Lingo
MACCS
MAP
MHFP
Mordred
MORSE
MQNs
Pattern
Pharmacophore
PhysiochemicalProperties
PubChem
RDF
RDKit
SECFP
TopologicalTorsion
USR
USRCAT
WHIM
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedup
Figure 4: Speedup for fingerprints when using 16 cores.
ticular for all substructure-based fingerprints, which have to check numerous
SMARTS patterns for each molecule. This gain is especially significant for
larger datasets.
High speedup values indicate that a significant majority of fingerprints bene-
fits from parallelism, with Klekota-Roth achieving the greatest improvement.
In general, computationally costly ones like SECFP or Mordred gain the
most. Only the fastest ones like ECFP or Atom Pair have speedup less than
1, meaning slower computation than the sequential one. However, we did
not tune the number of cores here, and using 2 or 4 could be more beneficial
for those fingerprints given this amount of data.
3.2. Sparse matrix support
Molecular fingerprints are often extremely sparse, therefore using proper rep-
resentation can result in large savings in memory usage, compared to dense
arrays. Differences are particularly significant for large datasets, which are
typical for virtual screening or similarity searching.
scikit-fingerprints has full support for sparse matrix computations, using
SciPy. As an example, we calculated the memory usage of the resulting
fingerprint arrays for PCBA dataset from MoleculeNet [56], consisting of
almost 440 thousand molecules. In Table 2, we report memory usage of
8
Dense array Sparse array
Fingerprint name Memory savings
size (MB) size (MB)
Klekota-Roth 2029 23 88.2x
FCFP 855 15 57x
Physiochemical Properties 855 17 50.3x
ECFP 855 19 45x
Topological Torsion 855 19 45x
Table 2: Memory usage of fingerprints in dense and sparse versions.
dense and sparse representations. We also report memory savings, defined as
how many times the sparse representation reduced the memory usage. For
brevity, we show the results of 5 fingerprints with the largest reduction. Code
to produce results for all fingerprints is available in the GitHub repository,
in benchmarking directory.
Clearly, fingerprints greatly benefit from sparse representations, with den-
sity of arrays around just 1-2%. In particular, the very popular ECFP and
FCFP fingerprints [43] are among those benefitting the most. The Klekota-
Roth fingerprint [40], which is quite long for a substructure-based fingerprint,
obtains a reduction from almost 2 GB RAM to just 23 MB, i.e. 88.2 times.
Those savings would be even more important during hyperparameter tuning
of downstream classifiers, when many copies of the data matrix are created.
Using sparse representation did not negatively impact computation time,
compared to the dense one.
3.3. Molecular property prediction
scikit-fingerprints can greatly simplify the process of classifying molecules.
We show a part of a pipeline in Listing 3.3, responsible for computing ECFP
fingerprints from SMILES strings and their classification. For brevity, we
omit loading the data, which is just standard Pandas code for CSV files.
Inputs can be any sequences that consist of SMILES strings or RDKit Mol ob-
jects, e.g. Python lists, NumPy arrays, or Pandas series. Since ECFPFingerprint
is a stateless transformer class, it uses an empty .fit() method in the
pipeline. The code is also parallelized, requiring only the n_jobs param-
eter.
from sklearn . ensemble import R a n d o m F o r e s t C l a s s i f i e r
from sklearn . pipeline import make_pipeline
from skfp . fingerprints import ECFPFingerprint
pipeline = make_pipeline (
9
ECFPFingerprint ( n_jobs = -1) ,
R a n d o m F o r e s t C l a s s i f i e r ( n_jobs = -1 , random_state =0)
)
pipeline . fit ( smiles_train , y_train )
y_pred = pipeline . predict ( smiles_test )
3.4. Fingerprint hyperparameter tuning
Most papers in the literature neglect hyperparameter tuning for molecular
fingerprints, only tuning downstream classifiers. We conjecture that this
is also due to the lack of easy to use and efficient software for computing
fingerprints. The works that do perform such tuning [27, 28] indicate that it
is indeed beneficial.
We performed hyperparameter tuning for all 2D fingerprints on MoleculeNet
single-task classification datasets [56], using scaffold split provided by OGB
[73]. Only the pharmacophore fingerprint was omitted due to excessive com-
putation time for some molecules. A Random Forest classifier with default
hyperparameters was used, in order to isolate the tuning improvements to
just fingerprints. In Table 3, we report area under receiver operating charac-
teristic curve (AUROC) values obtained when using tuned hyperparameters,
improvement from tuning compared to the default parameters, and average
gain over all datasets. Due to space limitations, we present the results for five
fingerprints that had the highest average gain. They can be therefore con-
sidered the methods with the highest tunability [74]. Hyperparameter grids
and code are available in the GitHub repository, in benchmarking directory.
Tuning fingerprints results in considerable gains, as high as 5.8% AUROC in
case of RDKit fingerprint [75] on BBBP dataset. Notably, substructure-based
Ghose-Crippen fingerprint [76] gains 4% AUROC on average, using feature
counts instead of binary indicators. This signifies that further research in
this area, utilizing scikit-fingerprints, would be highly beneficial.
Dataset AUROC and tuning gain Average tuning
Fingerprint
BACE BBBP HIV AUROC gain
GhoseCrippen 84.0 (+2.9) 73.3 (+4.9) 76.0 (+4.3) +4.0
RDKit 83.0 (+1.2) 73.0 (+5.8) 76.7 (+0.6) +2.5
Laggner 80.1 (+3.1) 73.8 (+0.7) 76.1 (+1.0) +1.6
Avalon 83.8 (+2.3) 71.3 (+0.6) 78.0 (+1.7) +1.5
EState 82.3 (+1.7) 71.7 (+1.0) 76.4 (+0.0) +0.9
Table 3: Molecular property prediction performance using different fingerprints and gain
from tuning their hyperparameters.
10
3.5. Complex pipelines for 3D fingerprints
For tasks requiring 3D information, i.e. fingerprints based on conformers,
the whole processing pipeline becomes more complex. Conformers need to
be generated and often post-processed with force field optimization, and re-
sulting fingerprints may have missing values. Additionally, using more than
one fingerprint is often beneficial, especially for virtual screening, as they take
different geometry features into consideration. In Listing 3.5, we present an
example how to create such pipeline for vectorizing molecules for screening,
for GETAWAY [77] and WHIM [78] descriptors. This short example would
require well over 100 lines of code in RDKit, even without parallelization.
from sklearn . impute import SimpleImputer
from skfp . fingerprints import (
GETAWAYFingerprint , WHIMFingerprint
)
from skfp . preprocessing import Co nfo rm er Gen er at or
from sklearn . pipeline import make_pipeline , make_union
pipeline = make_pipeline (
Co nf or mer Ge ne rat or (
o p t i m i z e _ f o r ce _ f i e l d = " MMFF94 " , n_jobs = -1
),
make_union (
GE TA WA YFi ng er pri nt ( n_jobs = -1) ,
WHIMFingerprint ( n_jobs = -1)
),
SimpleImputer ( strategy = " mean " ) ,
)
3.6. Comparison with existing software
We compare our library to existing libraries for chemoinformatics, which also
include molecular fingerprints computation. Differences are summarized in
table 4.
We implement the largest number of fingerprints, including both all those
available in other libraries, and new ones like MAP4 [79] or E3FP [80]. In
terms of Python support, scikit-fingerprints is the first one to have a native
Python package, with other libraries not supporting Python at all (CDK
and OpenBabel), or just using an autogenerated wrapper (RDKit). We also
fully support parallelism and even distributed computing, which is either
nonexistent or very limited elsewhere. scikit-fingerprints is also the only
library utilizing pre-commit hooks and dedicated security tools, and offering
convenient, integrated datasets.
11
CDK OpenBabel RDKit scikit-fingerprints
Language Java C++ C++ Python
Actively maintained? Yes No Yes Yes
Number of fingerprints 13 7 22 31
No Yes
Official Python package No Yes
(abandoned) (autogenerated)
Parallelism No No Very limited Yes
Pre-commit hooks No No No Yes
Code quality tools Yes No Yes Yes
Security tools No No No Yes
Integrated datasets No No No Yes
Yes No Yes Yes
Easy commercial usage
(LGPL-2.1) (GPL-2.0) (BSD-3) (MIT)
Table 4: Comparison of scikit-fingerprints with CDK and OpenBabel.
4. Impact
scikit-fingerprints is a comprehensive library for computing molecular fin-
gerprints. Leveraging fully scikit-learn compatible interfaces, researchers
can easily integrate it with complex pipelines for processing molecular data.
Comprehensive capabilities, with over 30 fingerprints, both 2D and 3D, with
efficient conformer generation, enable using varied solutions for molecular
property prediction, virtual screening, and other tasks. Intuitive and uni-
fied APIs make it easy to use for domain specialists with less programming
expertise, like computational chemists, chemoinformaticians, or molecular
biologists. We also put strong emphasis on code quality, security, and auto-
mated checks and analyzers.
Lack of efficient parallelism is a major downside of existing solutions. Modern
molecular databases can easily encompass millions of molecules, especially for
virtual screening [11, 12]. Our solution, utilizing all available cores, results in
significant speedups, enabling efficient processing of large datasets. This is
also beneficial for hyperparameter tuning [27, 28], fingerprint concatenation
[15], data fusion [29, 30], and other computationally complex tasks.
Simple class hierarchy and high code quality make our solution easily exten-
sible. New fingerprints can easily be added, automatically benefiting from
parallelization and scikit-learn compatibility. GitHub repository had 7 con-
tributors to date, showing good reception by the community and easy learn-
ing curve. The first issue by an external researcher has been made in a week
of making the library public, highlighting the need for modern software in
this area.
The research shows that fingerprint-based molecular property prediction is
still competitive compared to graph neural networks [13, 16, 14], justifying
further research in this area. In particular, they should be applied as baselines
12
for fair evaluation of the impact of novel approaches, which is particularly
easy with our library. scikit-fingerprintshas already been applied to research
in molecular chemistry. In [81], it was used to implement ECFP fingerprint as
a baseline algorithm, ensuring fair comparison of various approaches on the
MoleculeNet benchmark [56]. It is also being actively applied for predicting
pesticide toxicity for honey bees, using recently proposed ApisTox dataset
[82]. Additionally, numerous research projects and Master’s theses at Faculty
of Computer Science at AGH University of Krakow are currently utilizing it.
Finally, scikit-fingerprintsis constantly evolving, with new fingerprints being
added. We are also working on expanding the functionality, e.g. imple-
menting data splitting functions based on fingerprints, or dataset loaders for
popular benchmark datasets. Therefore, its impact in chemoinformatics will
be even greater in the future.
5. Conclusions
We have developed scikit-fingerprints, an open-source Python library for
computation of molecular fingerprints. It is simple to use, fully compati-
ble with the scikit-learn API, and easily installable from PyPI. It is also the
most feature-rich and highly efficient library available in the Python ecosys-
tem, allowing parallel computation of over 30 different fingerprints. Multiple
mechanisms have been implemented to ensure high code quality, maintain-
ability, and security. It fills the gap for a single, definitive software in Python
ecosystem for molecular fingerprints. It facilitates quicker, more efficient, and
also more comprehensive experiments in fields of chemoinformatics, de novo
drug design and computational molecular chemistry.
Acknowledgements
Research was supported by the funds assigned by Polish Ministry of Science
and Higher Education to AGH University of Krakow, and by the grant from
“Excellence Initiative - Research University” (IDUB) for the AGH University
of Krakow. We thank Michał Szafarczyk and Michał Stefanik for help with
code implementation, and Wojciech Czech for help with manuscript review.
We also thank Alexandra Elbakyan for her work and support for accessibility
of science.
References
[1] R. Todeschini, V. Consonni, Molecular Descriptors for Chemoinformat-
ics, John Wiley & Sons, 2009.
13
[2] A. Koutsoukas, S. Paricharak, W. R. Galloway, D. R. Spring, A. P. IJz-
erman, R. C. Glen, D. Marcus, A. Bender, How Diverse Are Diversity
Assessment Methods? A Comparative Analysis and Benchmarking of
Molecular Descriptor Space, Journal of Chemical Information and Mod-
eling 54 (1) (2014) 230–242.
[3] R. Sayle, 2D similarity, diversity and clustering in RDKit, RDKit UGM
(2019).
[4] A. Bender, How Similar Are Those Molecules after All? Use Two De-
scriptors and You Will Have Three Different Answers, Expert Opinion
on Drug Discovery 5 (12) (2010) 1141–1151.
[5] S. Riniker, G. A. Landrum, Similarity maps - a visualization strategy for
molecular fingerprints and machine-learning methods, Journal of Chem-
informatics 5 (2013) 1–7.
[6] Lovrić, Mario and Ðuričić, Tomislav and Tran, Han T. N. and Hussain,
Hussain and Lacić, Emanuel and Rasmussen, Morten A. and Kern, Ro-
man, Should We Embed in Chemistry? A Comparison of Unsupervised
Transfer Learning with PCA, UMAP, and VAE on Molecular Finger-
prints, Pharmaceuticals 14 (8) (2021) 758.
[7] S. Hernández-Hernández, P. J. Ballester, On the Best Way to Cluster
NCI-60 Molecules, Biomolecules 13 (3) (2023) 498.
[8] D. Butina, Unsupervised Data Base Clustering Based on Daylight’s Fin-
gerprint and Tanimoto Similarity: A Fast and Automated Way To Clus-
ter Small and Large Data Sets, Journal of Chemical Information & Com-
puter Sciences 39 (4) (1999) 747–750.
[9] M. G. Malhat, H. M. Mousa, A. B. El-Sisi, Improving Jarvis-Patrick
algorithm for drug discovery, in: 2014 9th International Conference on
Informatics and Systems, IEEE, 2014, pp. DEKM–61.
[10] R. Taylor, Simulation Analysis of Experimental Design Strategies for
Screening Random Compounds as Potential New Drugs and Agrochem-
icals, Journal of Chemical Information & Computer Sciences 35 (1)
(1995) 59–67.
[11] S. Riniker, G. A. Landrum, Open-source platform to benchmark finger-
prints for ligand-based virtual screening, Journal of Cheminformatics
5 (1) (2013) 26.
14
[12] I. Muegge, P. Mukherjee, An overview of molecular fingerprint similarity
search in virtual screening, Expert Opinion on Drug Discovery 11 (2)
(2016) 137–148.
[13] B. Zagidullin, Z. Wang, Y. Guan, E. Pitkänen, J. Tang, Comparative
analysis of molecular fingerprints in prediction of drug combination ef-
fects, Briefings in Bioinformatics 22 (6) (2021) bbab291.
[14] D. Jiang, Z. Wu, C.-Y. Hsieh, G. Chen, B. Liao, Z. Wang, C. Shen,
D. Cao, J. Wu, T. Hou, Could graph neural networks learn better
molecular representation for drug discovery? A comparison study of
descriptor-based and graph-based models, Journal of Cheminformatics
13 (2021) 1–23.
[15] L. Xie, L. Xu, R. Kong, S. Chang, X. Xu, Improvement of Prediction
Performance With Conjoint Molecular Fingerprint in Deep Learning,
Frontiers in Pharmacology 11 (2020) 606668.
[16] N. M. O’Boyle, R. A. Sayle, Comparing structural fingerprints using
a literature-based similarity benchmark, Journal of Cheminformatics 8
(2016) 1–14.
[17] D. Baptista, J. Correia, B. Pereira, M. Rocha, Evaluating molecular
representations in machine learning models for drug response prediction
and interpretability, Journal of Integrative Bioinformatics 19 (3) (2022)
20220006.
[18] Y. Song, S. Chang, J. Tian, W. Pan, L. Feng, H. Ji, A Comprehensive
Comparative Analysis of Deep Learning Based Feature Representations
for Molecular Taste Prediction, Foods 12 (18) (2023) 3386.
[19] Y. Long, H. Pan, C. Zhang, H. T. Song, R. Kondor, A. Rzhetsky, Molec-
ular Fingerprints Are a Simple Yet Effective Solution to the Drug–Drug
Interaction Problem, Drugs 500 (2022) 1–7.
[20] D. Boldini, D. Ballabio, V. Consonni, R. Todeschini, F. Grisoni, S. A.
Sieber, Effectiveness of molecular fingerprints for exploring the chemical
space of natural products, Journal of Cheminformatics 16 (1) (2024) 35.
[21] B. Ran, L. Chen, M. Li, Y. Han, Q. Dai, et al., Drug-Drug Interactions
Prediction Using Fingerprint Only, Computational and Mathematical
Methods in Medicine 2022 (2022).
15
[22] J. Deng, Z. Yang, H. Wang, I. Ojima, D. Samaras, F. Wang, A system-
atic study of key elements underlying molecular property prediction,
Nature Communications 14 (1) (2023) 6395. doi:https://doi.org/
10.1038/s41467-023-41948-6.
[23] M. Ashton, J. Barnard, F. Casset, M. Charlton, G. Downs, D. Gorse,
J. Holliday, R. Lahana, P. Willett, Identification of Diverse Database
Subsets using Property-Based and Fragment-Based Molecular Descrip-
tions, Quantitative Structure-Activity Relationships 21 (6) (2002) 598–
604. doi:https://doi.org/10.1002/qsar.200290002.
[24] R. Kpanou, P. Dallaire, E. Rousseau, J. Corbeil, Learning self-
supervised molecular representations for drug–drug interaction predic-
tion, BMC Bioinformatics 25 (1) (2024) 47.
[25] J. Adamczyk, J. Poziemski, P. Siedlecki, ApisTox: a new benchmark
dataset for the classification of small molecules toxicity on honey bees,
arXiv preprint arXiv:2404.16196 (2024).
[26] G. A. Landrum, M. Beckers, J. Lanini, N. Schneider, N. Stiefl, S. Riniker,
SIMPD: an algorithm for generating simulated time splits for validating
machine learning approaches, Journal of Cheminformatics 15 (1) (2023)
119.
[27] S. Cui, Q. Li, D. Li, Z. Lian, J. Hou, et al., Hyper-Mol: Molecular Rep-
resentation Learning via Fingerprint-Based Hypergraph, Computational
Intelligence and Neuroscience 2023 (2023).
[28] L. Pattanaik, C. W. Coley, Molecular Representation: Going Long on
Fingerprints, Chem 6 (6) (2020) 1204–1207.
[29] C. M. Ginn, P. Willett, J. Bradshaw, Combination Of Molecular Sim-
ilarity Measures Using Data Fusion, in: Virtual Screening: An Alter-
native or Complement to High Throughput Screening? Proceedings of
the Workshop ’New Approaches in Drug Design and Discovery’, special
topic ’Virtual Screening’, Schloβ Rauischholzhausen, Germany, March
15–18, 1999, Springer, 2002, pp. 1–16.
[30] G. M. Sastry, V. S. Inakollu, W. Sherman, Boosting Virtual Screening
Enrichments with Data Fusion: Coalescing Hits from Two-Dimensional
Fingerprints, Shape, and Docking, Journal of Chemical Information and
Modeling 53 (7) (2013) 1531–1542.
16
[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay,
Scikit-learn: Machine Learning in Python, Journal of Machine Learning
Research 12 (2011) 2825–2830.
[32] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel,
V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. Van-
derPlas, A. Joly, B. Holt, G. Varoquaux, API design for machine learn-
ing software: experiences from the scikit-learn project, in: ECML PKDD
Workshop: Languages for Data Mining and Machine Learning, 2013, pp.
108–122.
[33] C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann, E. Wil-
lighagen, The Chemistry Development Kit (CDK): An open-source Java
library for chemo-and bioinformatics, Journal of Chemical Information
& Computer Sciences 43 (2) (2003) 493–500.
[34] N. M. O’Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch,
G. R. Hutchison, Open Babel: An open chemical toolbox, Journal of
Cheminformatics 3 (2011) 1–14.
[35] RDKit: Open-source Cheminformatics, https://www.rdkit.org, ac-
cessed: 2024-05-08. doi:10.5281/zenodo.10633624.
[36] PyPI: Python Package Index (PyPI) is a repository of software for the
Python programming language, https://pypi.org/, accessed: 2024-
05-08.
[37] The RDKit Book: Molecular Sanitization, https://www.rdkit.org/
docs/RDKit_Book.html#molecular-sanitization, accessed: 2024-05-
08.
[38] S. Wang, J. Witek, G. A. Landrum, S. Riniker, Improving Conformer
Generation for Small Rings and Macrocycles Based on Distance Geome-
try and Experimental Torsional-Angle Preferences , Journal of Chemical
Information and Modeling 60 (4) (2020) 2044–2058.
[39] A. T. McNutt, F. Bisiriyu, S. Song, A. Vyas, G. R. Hutchison, D. R.
Koes, Conformer Generation for Structure-Based Drug Design: How
Many and How Good?, Journal of Chemical Information and Modeling
63 (21) (2023) 6598–6607.
17
[40] J. Klekota, F. P. Roth, Chemical substructures that enrich for biological
activity, Bioinformatics 24 (21) (2008) 2518–2525.
[41] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Vir-
tanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith,
R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane,
J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Shep-
pard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, T. E. Oliphant,
Array programming with NumPy, Nature 585 (7825) (2020) 357–362.
doi:10.1038/s41586-020-2649-2.
URL https://doi.org/10.1038/s41586-020-2649-2
[42] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy,
D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J.
van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J.
Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng,
E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman,
I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H.
Ribeiro, F. Pedregosa, P. van Mulbregt, SciPy 1.0 Contributors, SciPy
1.0: Fundamental Algorithms for Scientific Computing in Python, Na-
ture Methods 17 (2020) 261–272. doi:10.1038/s41592-019-0686-2.
[43] D. Rogers, M. Hahn, Extended-Connectivity Fingerprints, Journal of
Chemical Information and Modeling 50 (5) (2010) 742–754.
[44] D. Probst, J.-L. Reymond, A probabilistic molecular fingerprint for big
data settings, Journal of Cheminformatics 10 (2018) 1–12.
[45] R. E. Carhart, D. H. Smith, R. Venkataraghavan, Atom pairs as molec-
ular features in structure-activity studies: definition and applications,
Journal of Chemical Information and Computer Sciences 25 (2) (1985)
64–73.
[46] R. Nilakantan, N. Bauman, J. S. Dixon, R. Venkataraghavan, Topo-
logical torsion: a new molecular descriptor for SAR applications. Com-
parison with other descriptors, Journal of Chemical Information and
Computer Sciences 27 (2) (1987) 82–85.
[47] J. L. Durant, B. A. Leland, D. R. Henry, J. G. Nourse, Reoptimization
of MDL keys for use in drug discovery, Journal of Chemical Information
& Computer Sciences 42 (6) (2002) 1273–1280.
18
[48] A. M. Schreyer, T. Blundell, USRCAT: real-time ultrafast shape recog-
nition with pharmacophoric constraints, Journal of Cheminformatics 4
(2012) 1–12.
[49] H. Moriwaki, Y.-S. Tian, N. Kawashita, T. Takagi, Mordred: a molecu-
lar descriptor calculator, Journal of Cheminformatics 10 (2018) 1–14.
[50] M. Herlihy, N. Shavit, The Art of Multiprocessor Programming, Revised
Reprint, Elsevier, 2012, p. 14.
[51] Joblib: running Python functions as pipeline jobs, https://joblib.
readthedocs.io/en/stable/, accessed: 2024-05-08.
[52] M. Rocklin, et al., Dask: Parallel computation with blocked algorithms
and task scheduling., in: SciPy, 2015, pp. 126–132.
[53] Dask documentation: Scikit-Learn & Joblib, https://ml.dask.org/
joblib.html, accessed: 2024-05-08.
[54] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cis-
tac, T. Rault, R. Louf, M. Funtowicz, et al., HuggingFace’s Trans-
formers: State-of-the-art Natural Language Processing, arXiv preprint
arXiv:1910.03771 (2019).
[55] HuggingFace Hub: scikit-learn organization, https://huggingface.
co/scikit-fingerprints, accessed: 2024-05-08.
[56] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S.
Pappu, K. Leswing, V. Pande, MoleculeNet: A Benchmark for Molecular
Machine Learning, Chemical science 9 (2) (2018) 513–530.
[57] pre-commit: A framework for managing and maintaining multi-language
pre-commit hooks, https://pre-commit.com, accessed: 2024-05-08.
[58] bandit: a tool designed to find common security issues in Python code,
https://bandit.readthedocs.io/en/latest/, accessed: 2024-05-08.
[59] Safety: Python dependency vulnerability scanner, https://github.
com/pyupio/safety, accessed: 2024-05-08.
[60] S. Peng, P. Liu, J. Han, A Python Security Analysis Framework in
Integrity Verification and Vulnerability Detection, Wuhan University
Journal of Natural Sciences 24 (2) (2019) 141–148.
19
[61] M. Alfadel, D. E. Costa, E. Shihab, Empirical analysis of security vul-
nerabilities in Python packages, Empirical Software Engineering 28 (3)
(2023) 59.
[62] black: The uncompromising Python code formatter, https://black.
readthedocs.io/en/stable/, accessed: 2024-05-08.
[63] flake8: Your Tool For Style Guide Enforcement, https://flake8.
pycqa.org/en/latest/, accessed: 2024-05-08.
[64] isort: A Python utility / library to sort imports, https://pycqa.
github.io/isort/, accessed: 2024-05-08.
[65] pyupgrade: A tool to automatically upgrade syntax for newer versions
of the language, https://github.com/asottile/pyupgrade, accessed:
2024-05-08.
[66] C. T. Hoyt, B. Zdrazil, R. Guha, N. Jeliazkova, K. Martinez-Mayorga,
E. Nittinger, Improving reproducibility and reusability in the Journal of
Cheminformatics, Journal of Cheminformatics 15 (1) (2023) 62.
[67] mypy: Optional static typing for Python, https://mypy-lang.org/,
accessed: 2024-05-08.
[68] F. Khan, B. Chen, D. Varro, S. Mcintosh, An Empirical Study of Type-
Related Defects in Python Projects, IEEE Transactions on Software
Engineering 48 (8) (2021) 3145–3158.
[69] H. Gulabovska, Z. Porkoláb, Survey on Static Analysis Tools of Python
Programs, in: SQAMIA, 2019.
[70] xenon: Monitoring tool based on radon, https://github.com/rubik/
xenon, accessed: 2024-05-08.
[71] pytest: Helps you write better programs, https://pytest.org/, ac-
cessed: 2024-05-08.
[72] PubChem Subgraph Fingerprint, https://ftp.ncbi.nlm.nih.gov/
pubchem/specifications/pubchem_fingerprints.pdf, accessed:
2024-05-08.
[73] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta,
J. Leskovec, Open Graph Benchmark: Datasets for Machine Learning on
Graphs, Advances in Neural Information Processing Systems 33 (2020)
22118–22133.
20
[74] P. Probst, A.-L. Boulesteix, B. Bischl, Tunability: Importance of Hyper-
parameters of Machine Learning Algorithms, Journal of Machine Learn-
ing Research 20 (53) (2019) 1–32.
[75] The RDKit Book: RDKit Fingerprints, https://www.rdkit.org/
docs/RDKit_Book.html#rdkit-fingerprints, accessed: 2024-05-08.
[76] A. K. Ghose, G. M. Crippen, Atomic Physicochemical Parameters for
Three-Dimensional Structure-Directed Quantitative Structure-Activity
Relationships I. Partition Coefficients as a Measure of Hydrophobicity,
Journal of Computational Chemistry 7 (4) (1986) 565–577.
[77] V. Consonni, R. Todeschini, M. Pavan, Structure/Response Correlations
and Similarity/Diversity Analysis by GETAWAY Descriptors. 1. Theory
of the Novel 3D Molecular Descriptors, Journal of Chemical Information
and Computer Sciences 42 (3) (2002) 682–692.
[78] R. Todeschini, P. Gramatica, et al., New 3D molecular descriptors: the
WHIM theory and QSAR applications, Perspectives in Drug Discovery
and Design 9 (0) (1998) 355–380.
[79] A. Capecchi, D. Probst, J.-L. Reymond, One molecular fingerprint to
rule them all: drugs, biomolecules, and the metabolome, Journal of
Cheminformatics 12 (2020) 1–15.
[80] S. D. Axen, X.-P. Huang, E. L. Cáceres, L. Gendelev, B. L. Roth, M. J.
Keiser, A Simple Representation of Three-Dimensional Molecular Struc-
ture, Journal of Medicinal Chemistry 60 (17) (2017) 7393–7409.
[81] J. Adamczyk, W. Czech, Molecular topological profile (moltop) – simple
and strong baseline for molecular graph classification (2024). arXiv:
2407.12136.
URL https://arxiv.org/abs/2407.12136
[82] J. Adamczyk, J. Poziemski, P. Siedlecki, ApisTox: a new benchmark
dataset for the classification of small molecules toxicity on honey bees,
arXiv preprint arXiv:2404.16196 (2024).
21