Autocorrelation Descriptor Improvements For QSAR - 2DA - Sign and 3DA - Sign
Autocorrelation Descriptor Improvements For QSAR - 2DA - Sign and 3DA - Sign
DOI 10.1007/s10822-015-9893-9
Abstract Quantitative structure–activity relationship variations of 2DA and 3DA called 2DA_Sign and
(QSAR) is a branch of computer aided drug discovery that 3DA_Sign that avoid information loss by splitting unique
relates chemical structures to biological activity. Two well sign pairs into individual histograms. We evaluate these
established and related QSAR descriptors are two- and variations with models trained on nine datasets spanning a
three-dimensional autocorrelation (2DA and 3DA). These range of drug target classes. Both 2DA_Sign and
descriptors encode the relative position of atoms or atom 3DA_Sign significantly increase model performance across
properties by calculating the separation between atom pairs all datasets when compared with traditional 2DA and 3DA.
in terms of number of bonds (2DA) or Euclidean distance Lastly, we find that limiting 3DA_Sign to maximum atom
(3DA). The sums of all values computed for a given small pair distances of 6 Å instead of 12 Å further increases
molecule are collected in a histogram. Atom properties can model performance, suggesting that conformational flexi-
be added with a coefficient that is the product of atom bility may hinder performance with longer 3DA descrip-
properties for each pair. This procedure can lead to infor- tors. Consistent with this finding, limiting the number of
mation loss when signed atom properties are considered bonds in 2DA_Sign from 11 to 5 fails to improve
such as partial charge. For example, the product of two performance.
positive charges is indistinguishable from the product of
two equivalent negative charges. In this paper, we present Keywords Quantitative structure activity relationship
Descriptor 2D autocorrelation 3D autocorrelation
Artificial neural network Virtual high-throughput
screening
Abbreviations
Electronic supplementary material The online version of this 2DA 2D autocorrelation
article (doi:10.1007/s10822-015-9893-9) contains supplementary 3DA 3D autocorrelation
material, which is available to authorized users. ANN Artificial neural network
& Jens Meiler
BCL BioChemical library
[email protected]; CADD Computer aided drug discovery
http://www.meilerlab.org GPCR G-protein coupled receptor
Gregory Sliwoski HTS High-throughput screen
[email protected] LB-CADD Ligand-based CADD
1
logAUC Area under the logarithmic ROC curve
Departments of Chemistry, Pharmacology, and Biomedical
LOO Leave-one-out
Informatics, Center for Structural Biology, Institute for
Chemical Biology, Vanderbilt University, 7330 Stevenson QSAR Quantitative structure–activity relationship
Center, Station B 351822, Nashville, TN 37235, USA RDF Radial distribution function
2
Institute of Biochemistry, Leipzig University, Brüderstraße ROC Receiver operating characteristic
34, 04103 Leipzig, Germany VDW Van der Waals
123
J Comput Aided Mol Des
123
J Comput Aided Mol Des
123
J Comput Aided Mol Des
Table 1 Nine datasets were used to train models and evaluate model performance across different QSAR descriptor conditions
Pubchem project bioassay ID Target Active compounds Inactive compounds
provided to QSAR models in either condition constant, it is focused datasets may be used to evaluate novel descriptors
best to compare models trained with comparable descrip- using leave-on-out (LOO) cross-validation [13]. However,
tors or variations. Therefore, performance evaluations were this method of benchmarking can be misleading and tends to
isolated for each descriptor type. 2DA_Sign was compared rely heavily on the presence of specific geometries rather
against 2DA, 3DA_Sign was compared against 3DA, and than more subtle properties [14]. To apply the most gener-
3DA/3DA_Sign at 6.0 Å was compared to 3DA/3DA_Sign alizable benchmark possible, we used nine HTS datasets
at 12 Å. To enforce statistical comparability, all ANN curated from PubChem [11]. These datasets target various
parameters and objective functions are kept constant as proteins including G-protein coupled receptors (GPCRs),
well as any atom properties used for weighting. This does kinases, and ion channels. The number of compounds in
not always ensure that the total number of descriptors these datasets range from approximately 61,000 to 344,000.
provided to models in both conditions is equal. For These datasets are detailed in Table 1.
example, 3DA_Sign splits different sign pair variants by Because each 3D descriptor tested can be weighted with
multiplying a single 3DA histogram into three. To avoid a variety of atom properties, we used nine different atom
the possibility that 3DA_Sign outperforms 3DA simply properties with and without accessible van der Waals
because it supplies more descriptors, we decreased the (VDW) surface area scaling. Accessible VDW surface area
resolution of 3DA_Sign three-fold to keep the total number accounts for varying accessibility of different atoms in a
of properties consistent across conditions. Any increase in molecule arising from overlapping and covered VDW
model performance, therefore, will not be due to increased surfaces. Additionally, we provide all models with a stan-
input vector length. This strategy is inappropriate for dard set of 1D descriptors to achieve a performance base-
2DA_Sign evaluation, however, because there is no reso- line that strengthens comparisons. All 1D molecule
lution factor to compensate for the differences in vector descriptors and atom properties used for weighting are
size. Therefore, we are forced to evaluate 2DA_Sign with outlined in Table 2.
three times as many data-points as 2DA.
Model performance is judged by its ability to predict the 2DA_Sign and 3DA_Sign: separating atom
activity of compounds it has never seen. Compounds not properties by sign
used for training are evaluated and ranked by their pre-
dicted activity. Plotting these predictions as true or false The most common method for weighting 2DA and 3DA is
positives generates a receiver operating characteristic with the product of atom properties for each atom pair. For
(ROC) curve. By computing the area under the curve of a signed properties such as partial charge, information can be
logarithmic x-axis ROC curve, it is possible to score the lost as the product of two negative values cannot be dis-
ratio of true positive predictions to false positive predic- tinguished from the product of two positive values. To avoid
tions for the high confidence predictions. this information loss, we modified the 2DA and 3DA
When training and evaluating QSAR model performance, descriptors to allocate atom pairs into one of three his-
large datasets that cover large chemical spaces are preferred tograms depending on the whether the atom properties are
[12]. These datasets often come from high-throughput both negative, both positive, or opposite. These descriptors
screening (HTS) projects where active compounds have are called 2DA_Sign and 3DA_Sign and are designed
been verified against a single target. Alternatively, smaller, specifically for signed properties such as partial charge since
123
J Comput Aided Mol Des
Table 2 Properties used to train ANN models are categorized as scalar (one property per molecule) and atom (one property per atom)
Property Type Description Signed
unsigned properties will solely fill the positive–positive 2DA_Sign in place of 2DA for signed atom properties
vector. Therefore, when testing the utility of 2DA_Sign and resulted in an increase in performance of approximately
3DA_Sign, we only apply these new descriptors with signed 13.8 %. Model performance across nine datasets is com-
properties. All unsigned properties are included with stan- pared for 2DA and 2DA_Sign in Fig. 2.
dard 2DA or 3DA depending on the condition. As mentioned, 3DA_Sign was encoded with a larger
Models trained with signed properties encoded with distance step size as 3DA to ensure that the input vector
2DA_Sign outperformed models trained with standard lengths between the two conditions remained constant.
2DA for all properties across all datasets tested. The Despite the lower resolution, 3DA_Sign improved model
average performance as measured by the area under the performance over standard 3DA in all datasets. Average
logarithmic ROC curve (logAUC) was 0.335. Compared model performance across nine datasets as measured by
with the average standard 2DA logAUC of, 0.295, using logAUC was 0.358 when applying signed properties with
123
J Comput Aided Mol Des
123
J Comput Aided Mol Des
Table 3 Percent of active compounds with formal charges varies across datasets
Pubchem project Percent neutral Percent positive Percent negative Percent zwitterion
bioassay ID actives actives actives actives
1798 74 19 5 1
1843 55 42 1 2
2258 80 6 11 2
2689 65 13 20 2
435008 88 4 5 3
435034 39 59 1 0
463087 86 13 0 1
485290 47 2 46 5
488997 48 33 6 12
Active compounds across all datasets were analyzed for the presence of formal charges to ensure that target datasets with diverse formal charge
preferences were tested due to the nature of the 2DA_Sign and 3DA_Sign descriptors. Active compounds are broken down into neutral, overall
positive, overall negative, and zwitterion properties below
presence of formal charges within the active compounds and maximum width of many small molecules, encoding longer
the performance increase for each dataset. When comparing atom pair distances can provide false information in cases
the percentage of neutral actives within a dataset with the of high molecular flexibility or rotation. A 6.0 Å limitation,
performance increase seen when using the improved auto- on the other hand, focuses more on fragments within the
correlation descriptor over the traditional one, no significant molecule that are relatively invariant with respect to the
correlation was found with 2DA_Sign (r = 0.10) or arbitrary choice of conformer used to represent each
3DA_Sign (r = 0.18). Additionally, no correlation was molecule. Additionally, shorter distances can be sampled at
found between the performance increases with 2DA_Sign a higher resolution without increasing input vector size. We
and the percentage of active compounds with overall positive found that limiting the maximum atom pair distance to
charge (r = -0.24) or overall negative charge (r = 0.12). 6.0 Å significantly increases performance across nine
However, a moderate but significant negative correlation was datasets by an average of 6.4 %. The fact that we see no
found between the performance increase with 3DA_Sign and performance increase when limiting 2DA_Sign to a dis-
the percentage of actives with overall positive charge tance cutoff of 5 bonds supports the conclusion that the
(r = -0.67, one-tailed p \ 0.05). Additionally, a moderate increase in performance of a limited 3DA_Sign is linked to
but significant positive correlation was found between per- conformational flexibility. However, both results support
formance increase and negative charge (r = 0.67, one-tailed the notion that 2DA and 3DA descriptors with maximum
p \ 0.05). This suggests a potential link between the formal radius of 5 bonds and 6 Å, respectively, are sufficient to
charge demand of active compounds and the improved per- describe molecular structure for QSAR studies.
formance seen with using 3DA_Sign to separate the auto- In conclusion, we present three recommendations for
correlation of signed atom properties. However, the opposite ANN-based QSAR descriptor selection: (1) Encoding
correlations with regards to positive and negative charge signed properties in standard 2DA results in information
presence and the fact that this correlation is not reflected with loss that can significantly decrease model performance.
2DA_Sign makes it difficult to predict the specific relation- Therefore, it is preferable to split unique sign pairs as with
ship. Additionally, the comparative performance increases 2DA_Sign. (2) Similar information loss can be seen with
seen with both primarily neutral and largely charged active standard 3DA and unique sign pairs should be split as with
compounds suggests that the improvements seen with these 3DA_Sign. (3) Limiting 3DAs to encode atom pairs up to
descriptors are independent of specific charge demands of the 6.0 Å instead of 12.0 can significantly improve model
targets. All plots examining potential relationships between performance.
active compound formal charges and performance increases
can be found in the supplemental information (supplemental
figure S1: percentage neutral actives; supplemental figure S2: Methods
percentage positively charged actives; supplemental fig-
ure S3: percentage negatively charged actives; supplemental HTS dataset curation
figure S4: percentage zwitterion actives).
Lastly, we tested a maximum atom pair distance limi- Nine datasets were used to evaluate descriptor perfor-
tation of 6.0 Å instead of 12.0. Although 12.0 Å covers the mance. Specific details regarding all curation steps have
123
J Comput Aided Mol Des
been previously described [11]. However, relevant details inhibitors of the choline transporter were identified in the
for all datasets have been summarized: primary screen AID 488975. Only active compounds verified
Datasets were selected for high-throughput screening in all three confirmatory assays AID 493221, AID 504840,
assays that focused on a single well-defined biological target and AID 588401 were kept and compounds showing non-
protein. Active compound sets contain only those hits that specific activity in additional assays were removed. The final
were verified in confirmatory assays and did not show cross- active to inactive ratio is 1:1198. Inactive compounds for all
activity with other targets tested against. Additionally, only datasets represent those identified in the primary screens.
those sets with at least 150 confirmed active compounds were
used and the final collection was designed to encompass a Generation of numerical descriptors for QSAR
variety of pharmaceutically relevant target protein classes. model creation
Specifically, for dataset 1798, positive allosteric modulators
of M1 were identified in the primary calcium flux assay AID Numerical descriptors and QSAR models were generated
626. Only actives compounds that were verified in confir- and evaluation over nine HTS datasets detailed in Tables 1,
matory assay AID 1741 were kept for the dataset and those 2 and 3. The curation of these datasets has been previous
with cross activity with M4 (AID 1488) were removed. The outlined [11]. 3D conformations of all small molecules
final ratio of active to inactive compounds is 1:329. For 1843, were generated using the CORINA [25] software package.
inhibitors of the inward-rectifying potassium ion channel The BioChemical Library (BCL) software was used to
Kir2.1 were identified in the thallium flux assay AID 1672. generate all molecular descriptors tested in this study. All
Only active compounds verified in confirmatory assays AID descriptors and atom properties used to weight 2DA and
2032 and AID 463252 were kept and compounds showing 3DA descriptors are described in Table 2. When weighting
non-specific activity in additional verification assays were autocorrelation descriptors, all atom properties are repre-
removed. The final active to inactive ratio is 1:1751. For sented with and without accessible surface area scaling.
2258, potentiators of the KCNQ2 potassium channel were Standard length 2DA and 2DA_Sign descriptors con-
identified in the primary thallium flux assay AID 2239. Only tained a cutoff distance of 11 bonds (12 values). Shortened
active compounds that were verified in confirmatory assay 2DA_Sign descriptors contained a cutoff distance of five
AID 2287 were kept and false positives or compounds bonds (6 values). 3DA descriptors tested at a 12.0 Å cutoff
showing non-specific effects in additional assays were were calculated for a step size of 0.167 Å (72 total values).
removed. The final active to inactive ratio is 1:1418. For 3DA descriptors tested at a 12.0 Å cutoff were calculated
2689, inhibitors of the serine/threonine kinase 33 were for a step size of 0.084 Å (72 total values). 3DA_Sign
identified in the primary screen AID 2661. Only active descriptors tested at 12.0 Å cutoff were calculated for a
compounds that were verified in confirmatory screen 2821 step size of 0.5 Å (3 9 24 = 72 total values). 3DA_Sign
were kept and non-selective compounds identified in addi- descriptors tested at 6.0 Å cutoff were calculated for a step
tional assays were removed. The final active to inactive ratio size of 0.25 Å (3 9 24 = 72 total values).
is 1:1858. For 435008, antagonists of the orexin 1 receptor
were identified in three primary screens AID 485270, AID Artificial neural network model architecture
463079, and AID 434989. Only active compounds verified in and training
confirmatory screens AID 504701 and AID 504699 were
kept for the active dataset. The final active to inactive ratio is All ANN models were trained using back propagation and
1:935. For 435034, negative allosteric modulators of the M1 a sigmoid transfer function with a simple weight update of
receptor were identified in the primary calcium flux assay g = 0.05 and a = 0.5, a hidden layer of 32 neurons, 0.1
AID 628. Only active compounds verified in confirmatory visible neuron dropout, and 0.5 hidden neuron dropout.
assay AID 677 were kept and compounds that showed cross Each dataset was divided into two sets of compounds:
activity with the M4 receptor in assay AID 860 were compounds used to train the model (training) and com-
removed. The final active to inactive ratio is 1:169. For pounds kept hidden from the model during training to
463087, inhibitors of the T-type calcium channel Cav3 were evaluate predictability after training has completed (inde-
identified in the primary calcium flux assay AID 449839. pendent). Five-fold cross-validation was used where 20
Only actives that were verified in confirmatory screens were individual ANN models were trained for each HTS dataset
kept as the active compound dataset. The final active inactive by rotating which compounds appeared in the training and
ratio is 1:142. For 485290, inhibitors of the tyrosyl-DNA independent sets. Final active or inactive prediction for
phosphodiesterase 1 were identified in the primary screen each independent compound was taken as a consensus
485290. Only active compounds verified in the confirmatory across models for which that compound appeared in the
assay AID 489007 were kept as the active compound dataset. independent set. The objective function used during train-
The final active to inactive ratio is 1:1213. For 488997, ing was the area under the logarithmic receiver operating
123
J Comput Aided Mol Des
characteristic (ROC) curve [26, 27] (logAUC [28]) 10. Moreau G, Broto P (1980) The auto-correlation of a topological-
between false positive rates of 0.001 and 0.1. structure—a new molecular descriptor. Nouv J Chim 4(6):359–360
11. Butkiewicz M, Lowe EW Jr, Mueller R, Mendenhall JL, Teixeira
PL, Weaver CD, Meiler J (2013) Benchmarking ligand-based
ANN model performance evaluation virtual high-throughput screening with the PubChem database.
Molecules 18(1):735–756. doi:10.3390/molecules18010735
All models were evaluated with the same objective func- 12. Kubinyi H, Folkers G, Martin YC (1998) 3D QSAR in drug
design. Qdsar, vol 2. Kluwer, Dordrecht
tion used for training. ROC curves with a logarithmic 13. Kiralj R, Ferreira MMC (2009) Basic validation procedures for
x-axis were generated for consensus predictions sorted by regression models in QSAR and QSPR studies: theory and
predicted activity and the area under the curve as calcu- application. J Braz Chem Soc 20:770–787
lated for the range of 0.001–0.1 (the top 10 % of predicted 14. Manchester J, Czermiński R (2009) CAUTION: popular
‘‘Benchmark’’ data sets do not distinguish the merits of 3D QSAR
compound activities). For all statistical comparisons, two- methods. J Chem Inf Model 49(6):1449–1454. doi:10.1021/
tailed paired t tests were performed between descriptor ci9000508
conditions across the nine HTS datasets. 15. Gasteiger J, Marsili M (1978) A new model for calculating atomic
charges in molecules. Tetrahedron Lett 19(34):3181–3184. doi:10.
1016/S0040-4039(01)94977-9
Figures and artwork 16. Gasteiger J, Marsili M (1980) Iterative partial equalization of
orbital electronegativity—a rapid access to atomic charges. Tetra-
All graphs were generated with Microsoft Excel 2007. hedron 36(22):3219–3228. doi:10.1016/0040-4020(80)80168-2
Molecule structures were generated with Molecular Oper- 17. Guillen MD, Gasteiger J (1983) Extension of the method of
iterative partial equalization of orbital electronegativity to small
ating Environment (MOE, Chemical Computing Group). ring systems. Tetrahedron 39(8):1331–1335. doi:10.1016/S0040-
4020(01)91901-5
Acknowledgments Work in the Meiler laboratory is supported 18. Bauerschmidt S, Gasteiger J (1997) Overcoming the limitations
through NIH (R01 GM080403, R01 GM099842, R01 DK097376, R01 of a connection table description: a universal representation of
HL122010, R01 GM073151, U19 AI117905) and NSF (CHE 1305874). chemical species. J Chem Inf Comput Sci 37(4):705–714
19. Streitwieser A (1961) Molecular orbital theory for organic che-
mists. Wiley, New York
20. Gasteiger J, Saller H (1985) Calculation of the charge distribution
in conjugated systems by a quantification of the resonance con-
References cept. Angew Chem Int Ed Engl 24(8):687–689. doi:10.1002/anie.
198506871
1. Sliwoski G, Kothiwale S, Meiler J, Lowe EW Jr (2014) Com- 21. Gilson MK, Gilson HS, Potter MJ (2003) Fast assignment of
putational methods in drug discovery. Pharmacol Rev accurate partial atomic charges: an electronegativity equalization
66(1):334–395. doi:10.1124/pr.112.007336 method that accounts for alternate resonance forms. J Chem Inf
2. Salt DW, Yildiz N, Livingstone DJ, Tinsley CJ (1992) The use of Comput Sci 43(6):1982–1997
artificial neural networks in QSAR. Pestic Sci 36(2):161–170. 22. Gasteiger J, Hutchings MG (1983) New empirical models of
doi:10.1002/ps.2780360212 substituent polarisability and their application to stabilisation
3. Butkiewicz M, Lowe EW, Meiler J (2012) Bcl::ChemInfo— effects in positively charged species. Tetrahedron Lett 24(25):
qualitative analysis of machine learning models for activation of 2537–2540
HSD involved in Alzheimer’s Disease. In: Computational intel- 23. Gasteiger J, Hutchings MG (1984) Quantitative models of gas-
ligence in bioinformatics and computational biology (CIBCB), phase proton-transfer reactions involving alcohols, ethers, and
2012 IEEE symposium on, 9–12 May 2012, pp 329–334. doi:10. their thio analogs. Correlation analyses based on residual elec-
1109/cibcb.2012.6217248 tronegativity and effective polarizability. J Am Chem Soc
4. Trinajstić N (1992) Chemical graph theory. In: Mathematical 106(22):6489–6495. doi:10.1021/ja00334a006
chemistry series, 2nd edn. CRC Press, Boca Raton 24. Miller KJ (1990) Additivity methods in molecular polarizability.
5. Balaban AT (1998) Topological and stereochemical molecular J Am Chem Soc 112(23):8533–8542. doi:10.1021/ja00179a044
descriptors for databases useful in QSAR, similarity/dissimilarity 25. Sadowski J, Gasteiger J (1993) From atoms and bonds to three-
and drug design. SAR QSAR Environ Res 8(1–2):1–21. doi:10. dimensional atomic coordinates: automatic model builders. Chem
1080/10629369808033259 Rev 93(7):2567–2581. doi:10.1021/cr00023a012
6. Hemmer MC, Steinhauer V, Gasteiger J (1999) Deriving the 3D 26. Cleves AE, Jain AN (2006) Robust ligand-based modeling of the
structure of organic molecules from their infrared spectra. Vib biological targets of known drugs. J Med Chem 49(10):2921–2938.
Spectrosc 19(1):151–164. doi:10.1016/S0924-2031(99)00014-4 doi:10.1021/Jm051139t
7. Broto P, Moreau G, Vandycke C (1984) Molecular structures: 27. Hristozov DP, Oprea TI, Gasteiger J (2007) Virtual screening
perception, autocorrelation descriptor and SAR studies. Percep- applications: a study of ligand-based methods and different
tion of molecules: topological structure and 3-dimensional structure representations in four different scenarios. J Comput
structure. Eur J Med Chem 19(1):61–65 Aided Mol Des 21(10–11):617–640. doi:10.1007/s10822-007-
8. Hopfinger AJ, Wang S, Tokarski JS, Jin B, Albuquerque M, 9145-8
Madhav PJ, Duraiswami C (1997) Construction of 3D-QSAR 28. Clark RD, Webster-Clark DJ (2008) Managing bias in ROC
models using the 4D-QSAR analysis formalism. J Am Chem Soc curves. J Comput Aided Mol Des 22(3–4):141–146. doi:10.1007/
119(43):10509–10524. doi:10.1021/ja9718937 s10822-008-9181-z
9. Shahlaei M (2013) Descriptor selection methods in quantitative
structure–activity relationship studies: a review study. Chem Rev
113(10):8093–8103. doi:10.1021/cr3004339
123