0% found this document useful (0 votes)
10 views9 pages

Autocorrelation Descriptor Improvements For QSAR - 2DA - Sign and 3DA - Sign

Uploaded by

Radite Yogaswara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views9 pages

Autocorrelation Descriptor Improvements For QSAR - 2DA - Sign and 3DA - Sign

Uploaded by

Radite Yogaswara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

J Comput Aided Mol Des

DOI 10.1007/s10822-015-9893-9

Autocorrelation descriptor improvements for QSAR: 2DA_Sign


and 3DA_Sign
Gregory Sliwoski1,2 • Jeffrey Mendenhall1 • Jens Meiler1

Received: 14 September 2015 / Accepted: 23 December 2015


Ó Springer International Publishing Switzerland 2015

Abstract Quantitative structure–activity relationship variations of 2DA and 3DA called 2DA_Sign and
(QSAR) is a branch of computer aided drug discovery that 3DA_Sign that avoid information loss by splitting unique
relates chemical structures to biological activity. Two well sign pairs into individual histograms. We evaluate these
established and related QSAR descriptors are two- and variations with models trained on nine datasets spanning a
three-dimensional autocorrelation (2DA and 3DA). These range of drug target classes. Both 2DA_Sign and
descriptors encode the relative position of atoms or atom 3DA_Sign significantly increase model performance across
properties by calculating the separation between atom pairs all datasets when compared with traditional 2DA and 3DA.
in terms of number of bonds (2DA) or Euclidean distance Lastly, we find that limiting 3DA_Sign to maximum atom
(3DA). The sums of all values computed for a given small pair distances of 6 Å instead of 12 Å further increases
molecule are collected in a histogram. Atom properties can model performance, suggesting that conformational flexi-
be added with a coefficient that is the product of atom bility may hinder performance with longer 3DA descrip-
properties for each pair. This procedure can lead to infor- tors. Consistent with this finding, limiting the number of
mation loss when signed atom properties are considered bonds in 2DA_Sign from 11 to 5 fails to improve
such as partial charge. For example, the product of two performance.
positive charges is indistinguishable from the product of
two equivalent negative charges. In this paper, we present Keywords Quantitative structure activity relationship 
Descriptor  2D autocorrelation  3D autocorrelation 
Artificial neural network  Virtual high-throughput
screening

Abbreviations
Electronic supplementary material The online version of this 2DA 2D autocorrelation
article (doi:10.1007/s10822-015-9893-9) contains supplementary 3DA 3D autocorrelation
material, which is available to authorized users. ANN Artificial neural network
& Jens Meiler
BCL BioChemical library
[email protected]; CADD Computer aided drug discovery
http://www.meilerlab.org GPCR G-protein coupled receptor
Gregory Sliwoski HTS High-throughput screen
[email protected] LB-CADD Ligand-based CADD
1
logAUC Area under the logarithmic ROC curve
Departments of Chemistry, Pharmacology, and Biomedical
LOO Leave-one-out
Informatics, Center for Structural Biology, Institute for
Chemical Biology, Vanderbilt University, 7330 Stevenson QSAR Quantitative structure–activity relationship
Center, Station B 351822, Nashville, TN 37235, USA RDF Radial distribution function
2
Institute of Biochemistry, Leipzig University, Brüderstraße ROC Receiver operating characteristic
34, 04103 Leipzig, Germany VDW Van der Waals

123
J Comput Aided Mol Des

Introduction A descriptor is considered useful when it provides per-


tinent information about a compound while adding mini-
Computer aided drug discovery (CADD) is a multi-faceted mal noise to the overall model. In this respect, the most
approach that implements computational tools into the drug useful descriptors are the ones with the greatest degree of
discovery pipeline [1]. CADD can reduce the time and information density (information used by the model divi-
resources required for the development of novel therapeu- ded by total information). A descriptor that provides no
tics. Scientifically, CADD can also provide insights into the useful information is often ignored by statistical models but
complex interaction between small molecule and a biologi- can sometimes reduce model performance by overwhelm-
cal target protein. Ligand-based CADD (LB-CADD) is one ing it with noise [9]. The goal of this paper is to evaluate
approach that focuses on analyzing the collective chemical potential improvements to 2DA and 3DA descriptors [7].
properties of a set of active and inactive compounds without 2DA [10] and 3DA [7] descriptors both generate his-
leveraging explicit knowledge of the target protein structure. tograms of atom pair distances within a molecule up to a cutoff
One fundamental principle of LB-CADD is quantitative distance. The major difference between these descriptors that
structure–activity relationship (QSAR) modeling. The goal designates their dimensionality is in their representation of
of QSAR modeling is to define the relation between chemical interatomic distance. For 2DA, distances are measured in
structure and biological activity in a quantitative way so that terms of the number of bonds between two connected atoms.
the activity of new molecules can be predicted to prioritize 3DA, on the other hand, represents interatomic distance in
acquisition or synthesis. In general, QSAR can be separated terms of Euclidian distance typically measured in angstroms.
into two major components: a quantitative description of To extend these descriptors beyond the geometric character-
molecular structure (descriptor) and a mathematical model istics of a molecule, atom pair distances are weighted by atom
that uses these multidimensional descriptors as input to properties such as partial charge, electronegativity, etc. The
predict activity. Both components come in a variety of fla- formal definition of 2DA and 3DA is shown in Eq. 1.
vors and strategies that vary in performance depending on Xn X n
Autocorrelation ðra ; rb Þ ¼ dðra  ri;j \rb ÞPi Pj ð1Þ
the specific project. Machine learning techniques are the
i j
most commonly applied non-linear mathematical QSAR
models [2]. For this study, we use Artificial Neural Networks where rij is the distance between atoms i and j and n is the
(ANN) as implemented in BCL::ChemInfo [3] to generate total number of atoms in the molecule. Pi and Pj are the
our mathematical models across all conditions. atom properties for atoms i and j used to weight the
Descriptors of chemical structure are typically computed as autocorrelation. ra and rb define the lower and upper
a combination of atomic properties (mass, volume, surface boundaries of each consecutive distance bin.
area, partial charge, electro-negativity, polarizability, etc.) Weighting 2DA and 3DA with atom properties Pi and Pj
that are processed with a translation and rotation invariant allow these descriptors to encode the distribution of specific
geometric function to describe the distribution of these atom properties within a molecule. These properties may be
properties in the molecular structure. Descriptors can be unsigned in the case of atomic mass or signed in the case of
grouped into five categories, depending on the ‘dimension- partial charge. However, significant information loss arises
ality’ of the small molecule description required: (1D) when signed atom properties are used to weight 2DA and
Descriptors that can be derived from the molecular formula 3DA due to sign-cancellation. For example, a pair of atoms
such as molecular weight by summing up all atom masses or both with positive partial charges will be encoded the same
total charge by summing up nominal charges. (2D) Descrip- as a pair with negative partial charges. Therefore, we
tors that depend on constitution such as the number of introduce variations of 2DA and 3DA specifically for
hydrogen bond donors/acceptors, number of ring systems, heterogeneously signed atom properties called 2DA_Sign
topological surface area, and some approximations of volume and 3DA_Sign, respectively. With 2DA_Sign/3DA_Sign,
and surface area. A topological index, for example, encodes we separate a single 2DA/3DA histogram into three: nega-
which atoms a chemically bonded [4]. (2.5D) Configuration- tive–negative, positive–positive, and opposite sign property
dependent descriptors that encode, for example, the relation of pairs. Comparing 2DA_Sign and 3DA_Sign histograms with
stereo-centers within a topological index [5]. (3D) Confor- their traditional counterparts reveals the different forms of
mation-dependent descriptors including Radial Distribution information loss that arise when weighting with signed atom
Functions (RDF) [6] and 3-Dimensional Autocorrelation properties. Figure 1a compares a single 2DA weighted with
(3DA) [7] that encode aforementioned atomic properties in a TotalCharge (TotalCharge = r ? p partial charges) with
three-dimensional fingerprint. (4D) Descriptors that take the three histograms generated for the same molecule’s
conformational flexibility into account such as those derived TotalCharge-weighted 2DA_Sign. Figure 1b provides the
from low energy conformational ensembles [8]. same illustration for 3DA and 3DA_Sign weighted with

123
J Comput Aided Mol Des

Lastly, by default we use a 2DA that encodes distances


up to 11 bonds and 3DA that encodes all atom pair dis-
tances up to 12 Å [11]. This distance is sufficient to capture
the maximum distance within most small molecules.
However, 3D descriptors such as 3DA are computed from a
single predicted conformation of each molecule. As inter-
atomic distance increases, the degree of flexibility and
rotatable bonds may increase, leading to greater degrees of
conformational uncertainty at larger distances. This
uncertainty and potential error may interfere with QSAR
model training. This issue is 3-dimensional and therefore
we test a higher resolution 3DA_Sign variation that is
limited to 6.0 Å instead of 12.0 Å. As a comparison, we
test a similar variation of 2DA_Sign that is limited to a
maximum distance of five bonds instead of 11. Here, no
noise is added by conformational flexibility.
To test whether these variations are useful in training
QSAR models, we used a generalizable framework for
benchmarking the utility of 2DA_Sign and 3DA_Sign [11].
With any novel QSAR descriptor, performance evaluation is
both important and challenging. In most cases, a predictive
model can disregard information that does not increase per-
formance. However, this is not guaranteed and extra
descriptors adding too much noise can decrease performance.
Additionally, properties that add noise for one dataset may be
Fig. 1 2DA and 3DA lose information with weighted with signed useful information for another. One approach is to provide
atom properties. a Information loss is revealed when standard 3DA the model with as many descriptors as available and perform
weighted with total atom charge is split into three curves that isolate
iterative steps of descriptor selection where those that fail to
different sign pairs. 2DA descriptors out to a cutoff distance of 11
bonds are compared for an active compound from screen AID significantly improve model performance are discarded.
435034. b Information loss is revealed when standard 3DA weighted However, with an initial set of n descriptors, there are 2n
with total atom charge is split into three curves that isolate different possible combinations. Coupled with the importance of
sign pairs. 3DA descriptors out to 12 Å at a resolution of 1.0 Å per
cross-validation to avoid over-fitting, this process can
bin are compared for the same compound. Sections are highlighted
including (a) standard 3DA encodes almost no signal for distance bin quickly become time consuming or even intractable. Addi-
[7:8), whereas sign pair splitting reveals significant presence of tionally, any descriptor selection must be repeated for every
negative sign pairs and opposite sign pairs. b1 and b2 standard 3DA target of interest or high-throughput screening (HTS) dataset.
encodes equal intensities for bins [8:9) and [10:11), whereas sign pair
Several algorithms have been presented to perform efficient
splitting reveals contribution of negative sign pairs and positive sign
pairs are significantly different for these two distance bins descriptor selection [9]. However, as more descriptors and
descriptor variations are developed, it is beneficial to use
heuristics to eliminate descriptors unlikely to be beneficial.
TotalCharge. Two specific instances of information loss are Therefore, we evaluated our descriptors with a rigorous
highlighted in Fig. 1a. In the distance bin [7:8), standard benchmarking protocol that evaluates model performance
3DA weighted with TotalCharge contains almost no signal. across a variety of targets and datasets to identify those that
However, when sign pairs are separated with 3DA_Sign, consistently improve model performance.
very strong signals emerge for negative–negative and
opposite sign pairs. Because each bin of the histogram
represents a sum of atom pairs with similar distances, the Results
positive product of negative–negative and negative product
of negative-positive cause their signals to cancel each other. Developing a standard approach to descriptor
Additionally, standard 3DA contains similar signals at dis- benchmarking
tance bins [8:9) and [10:11). However, when unique sign
pairs are split with 3DA_Sign, it becomes clear that these The simplest evaluation of a descriptor’s utility is through a
signals represent different distribution of negative–negative one-to-one comparison of models trained with and without
and positive–positive sign pairs within these distance bins. the descriptor of interest. To keep the total information

123
J Comput Aided Mol Des

Table 1 Nine datasets were used to train models and evaluate model performance across different QSAR descriptor conditions
Pubchem project bioassay ID Target Active compounds Inactive compounds

1798 M1 muscarinic receptor (agonist) 187 61,646


1843 Kir2.1 potassium channel 172 301,321
2258 KCNQ2 potassium channel 213 302,192
2689 Serine threonine kinase 33 172 319,620
435008 Orexin 1 receptor 233 217,925
435034 M1 muscarinic receptor (antagonist) 362 61,394
463087 Cav3 calcium channel 703 100,172
485290 Tyrosyl-DNA phosphodiesterase 1 281 341,084
488997 Choline transporter 252 302,084
PubChem bioassay ID for the overall project is indicated, as well as specific target, total number of confirmed actives, and total inactive
compounds

provided to QSAR models in either condition constant, it is focused datasets may be used to evaluate novel descriptors
best to compare models trained with comparable descrip- using leave-on-out (LOO) cross-validation [13]. However,
tors or variations. Therefore, performance evaluations were this method of benchmarking can be misleading and tends to
isolated for each descriptor type. 2DA_Sign was compared rely heavily on the presence of specific geometries rather
against 2DA, 3DA_Sign was compared against 3DA, and than more subtle properties [14]. To apply the most gener-
3DA/3DA_Sign at 6.0 Å was compared to 3DA/3DA_Sign alizable benchmark possible, we used nine HTS datasets
at 12 Å. To enforce statistical comparability, all ANN curated from PubChem [11]. These datasets target various
parameters and objective functions are kept constant as proteins including G-protein coupled receptors (GPCRs),
well as any atom properties used for weighting. This does kinases, and ion channels. The number of compounds in
not always ensure that the total number of descriptors these datasets range from approximately 61,000 to 344,000.
provided to models in both conditions is equal. For These datasets are detailed in Table 1.
example, 3DA_Sign splits different sign pair variants by Because each 3D descriptor tested can be weighted with
multiplying a single 3DA histogram into three. To avoid a variety of atom properties, we used nine different atom
the possibility that 3DA_Sign outperforms 3DA simply properties with and without accessible van der Waals
because it supplies more descriptors, we decreased the (VDW) surface area scaling. Accessible VDW surface area
resolution of 3DA_Sign three-fold to keep the total number accounts for varying accessibility of different atoms in a
of properties consistent across conditions. Any increase in molecule arising from overlapping and covered VDW
model performance, therefore, will not be due to increased surfaces. Additionally, we provide all models with a stan-
input vector length. This strategy is inappropriate for dard set of 1D descriptors to achieve a performance base-
2DA_Sign evaluation, however, because there is no reso- line that strengthens comparisons. All 1D molecule
lution factor to compensate for the differences in vector descriptors and atom properties used for weighting are
size. Therefore, we are forced to evaluate 2DA_Sign with outlined in Table 2.
three times as many data-points as 2DA.
Model performance is judged by its ability to predict the 2DA_Sign and 3DA_Sign: separating atom
activity of compounds it has never seen. Compounds not properties by sign
used for training are evaluated and ranked by their pre-
dicted activity. Plotting these predictions as true or false The most common method for weighting 2DA and 3DA is
positives generates a receiver operating characteristic with the product of atom properties for each atom pair. For
(ROC) curve. By computing the area under the curve of a signed properties such as partial charge, information can be
logarithmic x-axis ROC curve, it is possible to score the lost as the product of two negative values cannot be dis-
ratio of true positive predictions to false positive predic- tinguished from the product of two positive values. To avoid
tions for the high confidence predictions. this information loss, we modified the 2DA and 3DA
When training and evaluating QSAR model performance, descriptors to allocate atom pairs into one of three his-
large datasets that cover large chemical spaces are preferred tograms depending on the whether the atom properties are
[12]. These datasets often come from high-throughput both negative, both positive, or opposite. These descriptors
screening (HTS) projects where active compounds have are called 2DA_Sign and 3DA_Sign and are designed
been verified against a single target. Alternatively, smaller, specifically for signed properties such as partial charge since

123
J Comput Aided Mol Des

Table 2 Properties used to train ANN models are categorized as scalar (one property per molecule) and atom (one property per atom)
Property Type Description Signed

Molecular weight Molecule Total weight of molecule


HBondDonor Molecule Total hydrogen bond donors in molecule
HBondAcceptor Molecule Total hydrogen bond acceptors in molecule
LogP Molecule Octanol/water coefficient; solubility
TotalCharge Molecule Total charge of molecule
NRotBond Molecule Number of rotatable bonds
NAromaticRings Molecule Number of aromatic rings
NRings Molecule Number of closed rings
TopologicalPolarSurfaceArea Molecule Total surface area of molecule that is polar
BondGirth Molecule Maximum number of bonds between two toms
MaxRingSize Molecule Number of atoms in largest ring
MinRingSize Molecule Number of atoms in smallest ring
AromaticAtoms Molecule Number of atoms in aromatic rings
IntersectionAtoms Molecule Number of atoms in ring intersections
AromaticIntersectionAtoms Molecule Number of atoms in aromatic ring intersections
MaxSigmaCharge Molecule Maximum sigma charge
MinSigmaCharge Molecule Minimum sigma charge
TotalSigmaCharge Molecule Sum of all sigma charges
StDevSigmaCharge Molecule Standard deviation of all sigma charges
MaxVcharge Molecule Maximum V-charge
MinVcharge Molecule Minimum V-charge
TotalVcharge Molecule Sum of absolute values of all V-charges
StDevVcharge Molecule Standard deviation of all V-charges
Girth Molecule Widest diameter of molecule
Identity Atom Unweighted; 1 for all atoms
SigmaCharge [15–17] Atom Partial charge localized to a-electron system X
PiCharge [18–20] Atom Partial charge localized to p-electron system X
TotalCharge Atom Total partial charge of atom X
Vcharge [21] Atom Partial charge accounting for resonance X
EffectivePolarizability [22–24] Atom Responsiveness of electron density to external field
IsRingIntersection Atom 1 if atom is at a non-aromatic ring intersection, 0 otherwise
IsInAromaticRing Atom 1 if atom is within aromatic ring, 0 otherwise
InAromaticRingIntersection Atom 1 if atom is at an aromatic ring intersection, 0 otherwise
Molecule properties are used in every condition as a standard baseline of QSAR information and contain general information regarding overall
molecular properties. Atom properties are used in every condition to weight the corresponding descriptor (2DA, 3DA, 2DA_Sign, or 3DA_Sign)
with and without VDW surface area scaling. Atom properties that are split into unique sign pairs with the 3da_Sign descriptor are indicated as
‘signed.’ Algorithms used for the implementations of these atom properties are referenced

unsigned properties will solely fill the positive–positive 2DA_Sign in place of 2DA for signed atom properties
vector. Therefore, when testing the utility of 2DA_Sign and resulted in an increase in performance of approximately
3DA_Sign, we only apply these new descriptors with signed 13.8 %. Model performance across nine datasets is com-
properties. All unsigned properties are included with stan- pared for 2DA and 2DA_Sign in Fig. 2.
dard 2DA or 3DA depending on the condition. As mentioned, 3DA_Sign was encoded with a larger
Models trained with signed properties encoded with distance step size as 3DA to ensure that the input vector
2DA_Sign outperformed models trained with standard lengths between the two conditions remained constant.
2DA for all properties across all datasets tested. The Despite the lower resolution, 3DA_Sign improved model
average performance as measured by the area under the performance over standard 3DA in all datasets. Average
logarithmic ROC curve (logAUC) was 0.335. Compared model performance across nine datasets as measured by
with the average standard 2DA logAUC of, 0.295, using logAUC was 0.358 when applying signed properties with

123
J Comput Aided Mol Des

descriptor is not only computationally inefficient, but may


introduce noise that hinders model performance. Therefore,
an evaluation of a novel descriptor is critical before
including it with QSAR model application. This evaluation
must also be applied across multiple datasets with different
targets. By nature, these biological targets may focus on
different property demands, thereby making a broad state-
ment of a descriptor’s utility difficult.
The first descriptor tested is a variation of 2DA that is
designed for weighting with signed atom properties. Multi-
plying two negative properties produces the same result as
Fig. 2 Model performance is compared across nine datasets for multiplying two equivalent positive properties, leading to
descriptor modifications. Model performance is evaluated as logAUC
(area under the logarithmic ROC curve between 0.001 and 0.1) for misinformation for molecules with two or more atoms with
different QSAR descriptor methods. Different colored datasets are negative properties. Additionally, histogram bins represent a
indicated by their Pubchem HTS project assay ID. 2DA_Sign sum of all atom pairs connected by a specific number of bonds.
significantly increases model performance (*2DA_sign vs 2DA, This can lead to additional information loss when opposite
paired t test p \ 0.0001, n = 9). Limiting 2DA to 5 bond lengths with
2DA_Sign (short) instead of 11 (2DA_Sign) does not increase signed signals are added. To avoid these problems, we intro-
performance. 3DA_Sign significantly increases model performance duce 2DA_Sign to replace standard 2DAs when weighting
when compared to using standard 3DA with signed properties with signed atom properties. 2DA_Sign generates three his-
(*3DA_Sign vs 3DA paired t test p \ 0.05, n = 9). Limiting tograms of equal length, splitting atom pairs into negative–
maximum atom pair distance to 6.0 Å in 3DA_Sign (short) signif-
icantly increases model performance when compared to limiting negative, positive–positive, and opposite signs. Using
maximum atom pair distance to 12 Å (*3DA_Sign vs 3DA_Sign 2DA_Sign in place of 2DA for signed properties resulted in an
(short) paired t test p \ 0.001, n = 9) average model performance increase of 13.8 % across nine
HTS datasets.
3DA_Sign (vs 0.343 with 3DA), an average improvement of Secondly, we tested a variation of 3DA called 3DA_Sign
4.4 % (paired t test p \ 0.05). Model performance across that treats unique sign pairs the same as 2DA_Sign. Because
nine datasets is compared for 3DA and 3DA_Sign in Fig. 2. 3DA vector length is controlled by maximum cutoff distance
Finally, we tested limiting the maximum atom pair and resolution, it was possible to adjust the resolution of
distance encoded for 3DA/3DA_Sign to 6.0 Å instead of 3DA_Sign to ensure constant input vector length. Encoding
12.0. By focusing on the first 6.0 Å at higher resolution, signed atom properties with 3DA_Sign in place of 3DA
model performance increased significantly from an average increased model performance across all nine datasets by
performance as measured by logAUC of 0.358–0.381 approximately 4.4 %.
(6.4 % improvement, paired t test p \ 0.001). Figure 2 Because of the signed nature of the information provided
compares model performance across nine datasets when by 2DA_Sign and 3DA_Sign over traditional autocorrelation,
encoding atom pair distances up to 12 versus 6.0 Å. When it is possible that targets placing higher demands for charged
2DA_Sign is limited to maximum distance of 5 bonds active compounds may benefit significantly more from these
instead of 11, on the other hand, performance is not descriptor improvements than targets that require more neu-
increased. Instead, there is a non-significant decrease in tral compounds. Therefore, it was important to examine the
average performance to logAUC 0.324. charge demands of nine datasets used for evaluation to ensure
that they contained active compounds with diverse charge
profiles. In Table 3, formal charge populations are listed for
Discussion all active compounds across all datasets. This reveals a range
of charge profiles across active datasets with 435034 con-
This study outlines a general QSAR descriptor benchmark- taining the lowest percent of neutral compounds (39 %) and
ing technique that can be used to evaluate novel descriptors. 435008 containing the highest (88 %). Additionally, the per-
Three potential QSAR descriptor modifications are evalu- cent of active compounds with positive formal charges varies
ated using this generalizable benchmark strategy. Descrip- significantly from 2 % (485290) to 59 % (435034) and the
tors represent small molecules as vectors of numerical percent of active compounds with negative formal charges
properties that can train ANNs to predict small molecule varies from less than 1 % (1843, 463087) to 46 % (485290).
activity towards a specific target. These descriptors come in To evaluate whether a higher presence of formal charges
a continuously growing range of dimensions and informa- in the active compounds allows for a greater performance
tion content. Coupled with the high degree of customization increase when using 2DA_Sign or 3DA_Sign, the Pear-
for many descriptors, training models using every available son’s correlation coefficient was calculated between the

123
J Comput Aided Mol Des

Table 3 Percent of active compounds with formal charges varies across datasets
Pubchem project Percent neutral Percent positive Percent negative Percent zwitterion
bioassay ID actives actives actives actives

1798 74 19 5 1
1843 55 42 1 2
2258 80 6 11 2
2689 65 13 20 2
435008 88 4 5 3
435034 39 59 1 0
463087 86 13 0 1
485290 47 2 46 5
488997 48 33 6 12
Active compounds across all datasets were analyzed for the presence of formal charges to ensure that target datasets with diverse formal charge
preferences were tested due to the nature of the 2DA_Sign and 3DA_Sign descriptors. Active compounds are broken down into neutral, overall
positive, overall negative, and zwitterion properties below

presence of formal charges within the active compounds and maximum width of many small molecules, encoding longer
the performance increase for each dataset. When comparing atom pair distances can provide false information in cases
the percentage of neutral actives within a dataset with the of high molecular flexibility or rotation. A 6.0 Å limitation,
performance increase seen when using the improved auto- on the other hand, focuses more on fragments within the
correlation descriptor over the traditional one, no significant molecule that are relatively invariant with respect to the
correlation was found with 2DA_Sign (r = 0.10) or arbitrary choice of conformer used to represent each
3DA_Sign (r = 0.18). Additionally, no correlation was molecule. Additionally, shorter distances can be sampled at
found between the performance increases with 2DA_Sign a higher resolution without increasing input vector size. We
and the percentage of active compounds with overall positive found that limiting the maximum atom pair distance to
charge (r = -0.24) or overall negative charge (r = 0.12). 6.0 Å significantly increases performance across nine
However, a moderate but significant negative correlation was datasets by an average of 6.4 %. The fact that we see no
found between the performance increase with 3DA_Sign and performance increase when limiting 2DA_Sign to a dis-
the percentage of actives with overall positive charge tance cutoff of 5 bonds supports the conclusion that the
(r = -0.67, one-tailed p \ 0.05). Additionally, a moderate increase in performance of a limited 3DA_Sign is linked to
but significant positive correlation was found between per- conformational flexibility. However, both results support
formance increase and negative charge (r = 0.67, one-tailed the notion that 2DA and 3DA descriptors with maximum
p \ 0.05). This suggests a potential link between the formal radius of 5 bonds and 6 Å, respectively, are sufficient to
charge demand of active compounds and the improved per- describe molecular structure for QSAR studies.
formance seen with using 3DA_Sign to separate the auto- In conclusion, we present three recommendations for
correlation of signed atom properties. However, the opposite ANN-based QSAR descriptor selection: (1) Encoding
correlations with regards to positive and negative charge signed properties in standard 2DA results in information
presence and the fact that this correlation is not reflected with loss that can significantly decrease model performance.
2DA_Sign makes it difficult to predict the specific relation- Therefore, it is preferable to split unique sign pairs as with
ship. Additionally, the comparative performance increases 2DA_Sign. (2) Similar information loss can be seen with
seen with both primarily neutral and largely charged active standard 3DA and unique sign pairs should be split as with
compounds suggests that the improvements seen with these 3DA_Sign. (3) Limiting 3DAs to encode atom pairs up to
descriptors are independent of specific charge demands of the 6.0 Å instead of 12.0 can significantly improve model
targets. All plots examining potential relationships between performance.
active compound formal charges and performance increases
can be found in the supplemental information (supplemental
figure S1: percentage neutral actives; supplemental figure S2: Methods
percentage positively charged actives; supplemental fig-
ure S3: percentage negatively charged actives; supplemental HTS dataset curation
figure S4: percentage zwitterion actives).
Lastly, we tested a maximum atom pair distance limi- Nine datasets were used to evaluate descriptor perfor-
tation of 6.0 Å instead of 12.0. Although 12.0 Å covers the mance. Specific details regarding all curation steps have

123
J Comput Aided Mol Des

been previously described [11]. However, relevant details inhibitors of the choline transporter were identified in the
for all datasets have been summarized: primary screen AID 488975. Only active compounds verified
Datasets were selected for high-throughput screening in all three confirmatory assays AID 493221, AID 504840,
assays that focused on a single well-defined biological target and AID 588401 were kept and compounds showing non-
protein. Active compound sets contain only those hits that specific activity in additional assays were removed. The final
were verified in confirmatory assays and did not show cross- active to inactive ratio is 1:1198. Inactive compounds for all
activity with other targets tested against. Additionally, only datasets represent those identified in the primary screens.
those sets with at least 150 confirmed active compounds were
used and the final collection was designed to encompass a Generation of numerical descriptors for QSAR
variety of pharmaceutically relevant target protein classes. model creation
Specifically, for dataset 1798, positive allosteric modulators
of M1 were identified in the primary calcium flux assay AID Numerical descriptors and QSAR models were generated
626. Only actives compounds that were verified in confir- and evaluation over nine HTS datasets detailed in Tables 1,
matory assay AID 1741 were kept for the dataset and those 2 and 3. The curation of these datasets has been previous
with cross activity with M4 (AID 1488) were removed. The outlined [11]. 3D conformations of all small molecules
final ratio of active to inactive compounds is 1:329. For 1843, were generated using the CORINA [25] software package.
inhibitors of the inward-rectifying potassium ion channel The BioChemical Library (BCL) software was used to
Kir2.1 were identified in the thallium flux assay AID 1672. generate all molecular descriptors tested in this study. All
Only active compounds verified in confirmatory assays AID descriptors and atom properties used to weight 2DA and
2032 and AID 463252 were kept and compounds showing 3DA descriptors are described in Table 2. When weighting
non-specific activity in additional verification assays were autocorrelation descriptors, all atom properties are repre-
removed. The final active to inactive ratio is 1:1751. For sented with and without accessible surface area scaling.
2258, potentiators of the KCNQ2 potassium channel were Standard length 2DA and 2DA_Sign descriptors con-
identified in the primary thallium flux assay AID 2239. Only tained a cutoff distance of 11 bonds (12 values). Shortened
active compounds that were verified in confirmatory assay 2DA_Sign descriptors contained a cutoff distance of five
AID 2287 were kept and false positives or compounds bonds (6 values). 3DA descriptors tested at a 12.0 Å cutoff
showing non-specific effects in additional assays were were calculated for a step size of 0.167 Å (72 total values).
removed. The final active to inactive ratio is 1:1418. For 3DA descriptors tested at a 12.0 Å cutoff were calculated
2689, inhibitors of the serine/threonine kinase 33 were for a step size of 0.084 Å (72 total values). 3DA_Sign
identified in the primary screen AID 2661. Only active descriptors tested at 12.0 Å cutoff were calculated for a
compounds that were verified in confirmatory screen 2821 step size of 0.5 Å (3 9 24 = 72 total values). 3DA_Sign
were kept and non-selective compounds identified in addi- descriptors tested at 6.0 Å cutoff were calculated for a step
tional assays were removed. The final active to inactive ratio size of 0.25 Å (3 9 24 = 72 total values).
is 1:1858. For 435008, antagonists of the orexin 1 receptor
were identified in three primary screens AID 485270, AID Artificial neural network model architecture
463079, and AID 434989. Only active compounds verified in and training
confirmatory screens AID 504701 and AID 504699 were
kept for the active dataset. The final active to inactive ratio is All ANN models were trained using back propagation and
1:935. For 435034, negative allosteric modulators of the M1 a sigmoid transfer function with a simple weight update of
receptor were identified in the primary calcium flux assay g = 0.05 and a = 0.5, a hidden layer of 32 neurons, 0.1
AID 628. Only active compounds verified in confirmatory visible neuron dropout, and 0.5 hidden neuron dropout.
assay AID 677 were kept and compounds that showed cross Each dataset was divided into two sets of compounds:
activity with the M4 receptor in assay AID 860 were compounds used to train the model (training) and com-
removed. The final active to inactive ratio is 1:169. For pounds kept hidden from the model during training to
463087, inhibitors of the T-type calcium channel Cav3 were evaluate predictability after training has completed (inde-
identified in the primary calcium flux assay AID 449839. pendent). Five-fold cross-validation was used where 20
Only actives that were verified in confirmatory screens were individual ANN models were trained for each HTS dataset
kept as the active compound dataset. The final active inactive by rotating which compounds appeared in the training and
ratio is 1:142. For 485290, inhibitors of the tyrosyl-DNA independent sets. Final active or inactive prediction for
phosphodiesterase 1 were identified in the primary screen each independent compound was taken as a consensus
485290. Only active compounds verified in the confirmatory across models for which that compound appeared in the
assay AID 489007 were kept as the active compound dataset. independent set. The objective function used during train-
The final active to inactive ratio is 1:1213. For 488997, ing was the area under the logarithmic receiver operating

123
J Comput Aided Mol Des

characteristic (ROC) curve [26, 27] (logAUC [28]) 10. Moreau G, Broto P (1980) The auto-correlation of a topological-
between false positive rates of 0.001 and 0.1. structure—a new molecular descriptor. Nouv J Chim 4(6):359–360
11. Butkiewicz M, Lowe EW Jr, Mueller R, Mendenhall JL, Teixeira
PL, Weaver CD, Meiler J (2013) Benchmarking ligand-based
ANN model performance evaluation virtual high-throughput screening with the PubChem database.
Molecules 18(1):735–756. doi:10.3390/molecules18010735
All models were evaluated with the same objective func- 12. Kubinyi H, Folkers G, Martin YC (1998) 3D QSAR in drug
design. Qdsar, vol 2. Kluwer, Dordrecht
tion used for training. ROC curves with a logarithmic 13. Kiralj R, Ferreira MMC (2009) Basic validation procedures for
x-axis were generated for consensus predictions sorted by regression models in QSAR and QSPR studies: theory and
predicted activity and the area under the curve as calcu- application. J Braz Chem Soc 20:770–787
lated for the range of 0.001–0.1 (the top 10 % of predicted 14. Manchester J, Czermiński R (2009) CAUTION: popular
‘‘Benchmark’’ data sets do not distinguish the merits of 3D QSAR
compound activities). For all statistical comparisons, two- methods. J Chem Inf Model 49(6):1449–1454. doi:10.1021/
tailed paired t tests were performed between descriptor ci9000508
conditions across the nine HTS datasets. 15. Gasteiger J, Marsili M (1978) A new model for calculating atomic
charges in molecules. Tetrahedron Lett 19(34):3181–3184. doi:10.
1016/S0040-4039(01)94977-9
Figures and artwork 16. Gasteiger J, Marsili M (1980) Iterative partial equalization of
orbital electronegativity—a rapid access to atomic charges. Tetra-
All graphs were generated with Microsoft Excel 2007. hedron 36(22):3219–3228. doi:10.1016/0040-4020(80)80168-2
Molecule structures were generated with Molecular Oper- 17. Guillen MD, Gasteiger J (1983) Extension of the method of
iterative partial equalization of orbital electronegativity to small
ating Environment (MOE, Chemical Computing Group). ring systems. Tetrahedron 39(8):1331–1335. doi:10.1016/S0040-
4020(01)91901-5
Acknowledgments Work in the Meiler laboratory is supported 18. Bauerschmidt S, Gasteiger J (1997) Overcoming the limitations
through NIH (R01 GM080403, R01 GM099842, R01 DK097376, R01 of a connection table description: a universal representation of
HL122010, R01 GM073151, U19 AI117905) and NSF (CHE 1305874). chemical species. J Chem Inf Comput Sci 37(4):705–714
19. Streitwieser A (1961) Molecular orbital theory for organic che-
mists. Wiley, New York
20. Gasteiger J, Saller H (1985) Calculation of the charge distribution
in conjugated systems by a quantification of the resonance con-
References cept. Angew Chem Int Ed Engl 24(8):687–689. doi:10.1002/anie.
198506871
1. Sliwoski G, Kothiwale S, Meiler J, Lowe EW Jr (2014) Com- 21. Gilson MK, Gilson HS, Potter MJ (2003) Fast assignment of
putational methods in drug discovery. Pharmacol Rev accurate partial atomic charges: an electronegativity equalization
66(1):334–395. doi:10.1124/pr.112.007336 method that accounts for alternate resonance forms. J Chem Inf
2. Salt DW, Yildiz N, Livingstone DJ, Tinsley CJ (1992) The use of Comput Sci 43(6):1982–1997
artificial neural networks in QSAR. Pestic Sci 36(2):161–170. 22. Gasteiger J, Hutchings MG (1983) New empirical models of
doi:10.1002/ps.2780360212 substituent polarisability and their application to stabilisation
3. Butkiewicz M, Lowe EW, Meiler J (2012) Bcl::ChemInfo— effects in positively charged species. Tetrahedron Lett 24(25):
qualitative analysis of machine learning models for activation of 2537–2540
HSD involved in Alzheimer’s Disease. In: Computational intel- 23. Gasteiger J, Hutchings MG (1984) Quantitative models of gas-
ligence in bioinformatics and computational biology (CIBCB), phase proton-transfer reactions involving alcohols, ethers, and
2012 IEEE symposium on, 9–12 May 2012, pp 329–334. doi:10. their thio analogs. Correlation analyses based on residual elec-
1109/cibcb.2012.6217248 tronegativity and effective polarizability. J Am Chem Soc
4. Trinajstić N (1992) Chemical graph theory. In: Mathematical 106(22):6489–6495. doi:10.1021/ja00334a006
chemistry series, 2nd edn. CRC Press, Boca Raton 24. Miller KJ (1990) Additivity methods in molecular polarizability.
5. Balaban AT (1998) Topological and stereochemical molecular J Am Chem Soc 112(23):8533–8542. doi:10.1021/ja00179a044
descriptors for databases useful in QSAR, similarity/dissimilarity 25. Sadowski J, Gasteiger J (1993) From atoms and bonds to three-
and drug design. SAR QSAR Environ Res 8(1–2):1–21. doi:10. dimensional atomic coordinates: automatic model builders. Chem
1080/10629369808033259 Rev 93(7):2567–2581. doi:10.1021/cr00023a012
6. Hemmer MC, Steinhauer V, Gasteiger J (1999) Deriving the 3D 26. Cleves AE, Jain AN (2006) Robust ligand-based modeling of the
structure of organic molecules from their infrared spectra. Vib biological targets of known drugs. J Med Chem 49(10):2921–2938.
Spectrosc 19(1):151–164. doi:10.1016/S0924-2031(99)00014-4 doi:10.1021/Jm051139t
7. Broto P, Moreau G, Vandycke C (1984) Molecular structures: 27. Hristozov DP, Oprea TI, Gasteiger J (2007) Virtual screening
perception, autocorrelation descriptor and SAR studies. Percep- applications: a study of ligand-based methods and different
tion of molecules: topological structure and 3-dimensional structure representations in four different scenarios. J Comput
structure. Eur J Med Chem 19(1):61–65 Aided Mol Des 21(10–11):617–640. doi:10.1007/s10822-007-
8. Hopfinger AJ, Wang S, Tokarski JS, Jin B, Albuquerque M, 9145-8
Madhav PJ, Duraiswami C (1997) Construction of 3D-QSAR 28. Clark RD, Webster-Clark DJ (2008) Managing bias in ROC
models using the 4D-QSAR analysis formalism. J Am Chem Soc curves. J Comput Aided Mol Des 22(3–4):141–146. doi:10.1007/
119(43):10509–10524. doi:10.1021/ja9718937 s10822-008-9181-z
9. Shahlaei M (2013) Descriptor selection methods in quantitative
structure–activity relationship studies: a review study. Chem Rev
113(10):8093–8103. doi:10.1021/cr3004339

123

You might also like