Chemical Library Design (Methods in Molec Bio 685) - J. Zhou (Humana, 2011) BBS
Chemical Library Design (Methods in Molec Bio 685) - J. Zhou (Humana, 2011) BBS
Edited by
Editor
Joe Zhongxiang Zhou
Department of Pharmacology
University of California
La Jolla, CA 92093, USA
[email protected]
ISSN 1064-3745
e-ISSN 1940-6029
ISBN 978-1-60761-930-7
e-ISBN 978-1-60761-931-4
DOI 10.1007/978-1-60761-931-4
Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2010937983
Springer Science+Business Media, LLC 2011
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of
the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013,
USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified
as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Printed on acid-free paper
Humana Press is part of Springer Science+Business Media (www.springer.com)
Preface
Over the last two decades we have seen a dramatic change in the drug discovery process
brought about by chemical library technologies and high-throughput screening, along
with other equally remarkable advances in biomedical research. Though still evolving,
chemical library technologies have become an integral part of the core drug discovery
technologies. This volume primarily focuses on the design aspects of the chemical library
technologies. Library design is a process of selecting useful compounds from a potentially
very large pool of synthesizable candidates. For drug discovery, the selected compounds
have to be biologically relevant. Given the enormous number of compounds accessible
to the contemporary synthesis and purification technologies, powerful tools are indispensible for uncovering those few useful ones. This book includes chapters on historical
overviews, state-of-the-art methodologies, practical software tools, and successful applications of chemical library design written by the best expert practitioners.
The book is divided into five section. Section I covers general topics. Chapter 1 highlights the key events in the history of high-throughput chemistry and offers a historical
perspective on the design of screening, targeted, and optimization libraries. Chapter 2 is
a short introduction to the basics of chemoinformatics necessary for library design. Chapter 3 describes a practical algorithm for multiobjective library design. Chapter 4 discusses
a scalable approach to designing lead generation libraries that emphasize both diversity
and representativeness along with other objectives. Chapter 5 explains how FreeWilson
selectivity analysis can be used to aid combinatorial library design. Chapter 6 shows how
predictive QSAR and shape pharmacophore models can be successfully applied to targeted library design. Chapter 7 describes a combinatorial library design method based
on reagent pharmacophore fingerprints to achieve optimal coverage of pharmacophoric
features for a given scaffold.
Three chapters in Section II focus on the methods and applications of structure-based
library design. Chapter 8 reviews the docking methods for structure-based library design.
Chapters 9 and 10 contain two detailed protocols illustrating how to apply structurebased library design to the successful optimization of lead matters in the real drug discovery projects.
Section III consists of three chapters on fragment-based library design. Chapter 11
describes the key factors that define a good fragment library for successful fragment-based
drug discovery. It also provides a summary view of the fragment libraries published so far
by various pharmaceutical companies. Chapter 12 shows how a fragment library is used
in fragment-based drug design. Chapter 13 introduces a new chemical structure mining
method that searches into a huge virtual library of combinatorial origin. The method uses
fragmental (or partial) mappings between the query structure and the target molecules in
its initial search algorithms.
Chapter 14 in Section IV describes a workflow for designing a kinase targeted library.
It illustrates how to assemble a lead generation library for a target family using known
ligandtarget family interaction data from various sources.
Section V contains four chapters on library design tools. PGVL Hub described
in Chapter 15 is an integrated desktop tool for molecular design including library
design. It streamlines the design workflow from product structure formation to property
vi
Preface
calculations, to filtering, to interfaces with other software tools, and to library production
management. An application of PGVL Hub to the optimization of human CHK1 kinase
inhibitors is presented in Chapter 16. Chapter 17 is a detailed protocol on how to use
library design tool GLARE to perform product-oriented design of combinatorial libraries.
Finally, Chapter 18 is a detailed protocol on how to use the library design tool CLEVER
to perform library design and visualization.
Joe Zhongxiang Zhou
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
SECTION I
GENERAL TOPICS
1.
2.
27
3.
53
4.
71
5.
91
6.
7.
SECTION II
8.
9.
10.
SECTION III
11.
vii
viii
Contents
12.
13.
LEAP into the Pfizer Global Virtual Library (PGVL) Space: Creation
of Readily Synthesizable Design Ideas Automatically . . . . . . . . . . . . . . . 253
Qiyue Hu, Zhengwei Peng, Jaroslav Kostrowicki, and Atsuo Kuki
SECTION IV
14.
SECTION V
15.
16.
17.
18.
Contributors
CAROLYN BECK Department of Industrial and Enterprise Systems Engineering,
University of Illinois at Urbana Champaign, Urbana, IL, USA
PAUL H. BERNARDO Institute of Chemical and Engineering Sciences, Singapore,
Singapore
NIKLAS BLOMBERG DECS GCS Computational Chemistry, AstraZeneca R&D
Mlndal, Mlndal, Sweden
CLAUDIO N. CAVASOTTO School of Biomedical Informatics, The University of Texas
Health Science Center at Houston, Houston, TX, USA
CHRISTINA L.L. CHAI Institute of Chemical and Engineering Sciences, Singapore,
Singapore
BO CHAO PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
HONGMING CHEN DECS GCS Computational Chemistry, AstraZeneca R&D
Mlndal, Mlndal, Sweden
JOHN CLARK PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
BOB CONER PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
JASON B. CROSS Cubist pharmaceuticals, Inc., Lexington, MA, USA
ROLAND E. DOLLE Department of Chemistry, Adolor Corporation, Exton, PA, USA
KLAUS DRESS Oncology Medicinal Chemistry, La Jolla Laboratories, Pfizer Inc., San
Diego, CA, USA
JERRY O. EBALUNODE Department of Pharmaceutical Sciences, BRITE Institute,
North Carolina Center University, Durham, NC, USA
JEFF ELLERAAS Oncology Medicinal Chemistry, La Jolla Laboratories, Pfizer Inc., San
Diego, CA, USA
OLA ENGKVIST DECS GCS Computational Chemistry, AstraZeneca R&D Mlndal,
Mlndal, Sweden
ERIC FEYFANT Pfizer Global R&D, Cambridge, MA, USA
QIYUE HU Pfizer Global Research and Development, La Jolla Laboratories, San Diego,
CA, USA
BUWEN HUANG Oncology Medicinal Chemistry, La Jolla Laboratories, Pfizer Inc., San
Diego, CA, USA
SHOGO ITO PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
THERESA L. JOHNSON Pfizer Research Technology Center, Cambridge, MA, USA
CHRISTOS C. KANNAS Department of Computer Science, University Of Cyprus, Nicosia,
Cyprus; Noesis Chemoinformatics, Nicosia, Cyprus
DAVID KLATTE PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
JAMES KONG PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
JAROSLAV KOSTROWICKI Pfizer Global Research and Development, La Jolla
Laboratories, San Diego, CA, USA
ATSUO KUKI Pfizer Global Research and Development, La Jolla Laboratories, San Diego,
CA, USA
TZE HAU LAM Data Mining Department, Institute for Infocomm Research, Singapore,
Singapore
ix
Contributors
ELIZABETH A. LUNNEY PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
SARATHY MATTAPARTI PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
JAMES NA Pfizer Global Research and Development, La Jolla Laboratories, San Diego,
CA, USA
CHRISTOS A. NICOLAOU Noesis Chemoinformatics, Nicosia, Cyprus
GENEVIEVE D. PADERES Cancer Crystallography & Computational Chemistry, La
Jolla Laboratories, Pfizer Inc., San Diego, CA, USA
KEVIN PARIS Pfizer Global R&D, Cambridge, MA, USA
TOM PAULY Oncology Medicinal Chemistry, La Jolla Laboratories, Pfizer Inc., San
Diego, CA, USA
ZHENGWEI PENG Pfizer Global Research and Development, La Jolla Laboratories, San
Diego, CA, USA
SHARANGDHAR S. PHATAK School of Biomedical Informatics, The University of Texas
Health Science Center at Houston, Houston, TX, USA
PAUL A. REJTO Oncology, La Jolla Laboratories, Pfizer Inc., San Diego, CA, USA
SRINIVASA SALAPAKA Department of Mechanical Science and Engineering, University
of Illinois at Urbana Champaign, Urbana, IL, USA
SIMONE SCIABOLA Pfizer Research Technology Center, Cambridge, MA, USA
NUNZIO SCIAMMETTA PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
ROBERT SELLIAH Drug Design Consulting, Irvine, CA, USA
PUNEET SHARMA Integrated Data Systems Department, Siemens Corporate Research,
Princeton, NJ, USA
THOM SHULOK PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
ROBERT V. STANTON Pfizer Research Technology Center, Cambridge, MA, USA
THOMAS THACHER PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
JOO CHUAN TONG Data Mining Department, Institute for Infocomm Research,
Singapore, Singapore; Department of Biochemistry, Yong Loo School of Medicine,
National University of Singapore, Singapore, Singapore
ALEXANDER TROPSHA Laboratory for Molecular Modeling and Carolina Center
for Exploratory Cheminformatics Research, School of Pharmacy, University of North
Carolina at Chapel Hill, Chapel Hill, NC, USA
JEAN-FRANOIS TRUCHON Chemical Modeling and Informatics, Merck Frosst
Canada, Kirkland, QC, Canada
DSIRE H.H. TSAO Pfizer Global R&D, Cambridge, MA, USA
CHRIS WALLER PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
HUALIN XI Pfizer Research Technology Center, Cambridge, MA, USA
SHUNQI YAN Drug Design Consulting, Irvine, CA, USA
BO YANG PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA
WEIFAN ZHENG Department of Pharmaceutical Sciences, BRITE Institute, North
Carolina Center University, Durham, NC, USA
JOE ZHONGXIANG ZHOU PGRD-La Jolla, Pfizer Inc., San Diego, CA, USA;
Department of Pharmacology, University of California, San Diego, CA, USA
Section I
General Topics
Chapter 1
Historical Overview of Chemical Library Design
Roland E. Dolle
Abstract
High-throughput chemistry (HTC) is approaching its 20-year anniversary. Since 1992, some 5,000
chemical libraries, prepared for the purpose of biological intestigation and drug discovery, have been
published in the scientific literature. This review highlights the key events in the history of HTC with
emphasis on library design. A historical perspective on the design of screening, targeted, and optimization libraries and their application is presented. Design strategies pioneered in the 1990s remain viable in
the twenty-first century.
Key words: High-throughput chemistry, chemical library, random library, targeted library,
optimization library, library design, biological activity, drug discovery.
1. Milestones in
High-Throughput
Chemistry
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_1, Springer Science+Business Media, LLC 2011
Dolle
IRORI
(a)
(b)
fragmentbased
discovery
flow
Chem
through
genetics dynamic
method
PCOP
DOS
Broad
CC
inflection
6M
SAR
Institute 10th
1st
NMR
Lipinski ARQL
GRC
GRC
Chem
Ro5
peak
Bank
FTI
Glaxo
buys
AFMX
$539 M
PCOP
IPO
(d)
(c)
(f)
(g)
(e)
1992
(h)
(j)
(i)
ARQL
IPO
(k)
(m)
(l)
(n)
(o)
(p)
(q)
MD
(w)
(v) (x)
(r)
JCC
Human
industry/solid
AGPH genone
phase
Phase I
synthesis
1995
Diversomer
(y) (z)
(ab)
2000
MW
DNA
template
(ac) (ae)
(aa)
(ad)
2005
NIH
CGC
CMLD
pubs
Fig. 1.1. Time chart showing selected events in the history of HTC. Key: (a) Affymax is the first combinatorial chemistry company to go public. (b) Ellmans solid-phase parallel synthesis of benzodiazepines fuels HTC. (c) Parke-Davis
R
, apparatus for solid-phase synthesis of small molecules. (d) Pharmacopeia licenses Columbia
introduces Diversomer
Universitys encoded split synthesis technology and company goes public a year later (NASDAQ symbol: PCOP). (e) ArQule
goes public (NASDAQ symbol: ARQL) with its industrialized solution-phase synthesis of discrete purified compounds.
(f) IRORI introduces radio frequency (Rf) encoding technology for solid-phase synthesis in cans containing reusable Rf
chips. (g) Glaxo Wellcome buys Affymax for $539 M in cash. (h) Lipinski publishes landmark correlation of physiochemical properties of drugs Rule of 5 (Ro5) has profound impact on library design. (i) 19921996: 80% of published
libraries are from industry; 75% using solid-phase synthesis. (j) Pharmacopeia generates 6 M encoded compounds.
(k) ArQule has the largest number of collaborations (27) reported for a combichem company. (l) Inaugural issue of
Molecular Diversity, the first journal dedicated to HTC. (m) SAR by NMR compounds binding to proximal subsites of
a protein are linked and optimized using HTC. (n) Agouron Pharm. moves human rhinovirus 3C protease inhibitor into
the clinical trials; HTC played a key role in its discovery. (o) S. Schreiber introduces the concept of chemical genetics
and diversity-oriented synthesis (DOS). (p) A. Czarnik editor of a new ACS journal: Journal of Combinatorial Chemistry.
(q) Academia overtakes industry library synthesis publications for the first time. (r) Human genome sequence is published
in Science. (s) Dynamic combinatorial chemistry. (t) First Gordon Research Conference entitled combinatorial chemistry:
High Throughput Chemistry & Chemical Biology. (u) D. Curran develops fluorous reagents and tags and launches Fluorous Technology Inc. (FTI). (v) DNA-templated synthesis. (w) Solution-phase overtakes solid-phase in library synthesis.
(x) Microwave-assisted synthesis gains momentum in HTC. (y) ChemBank public database established. (z) First reports
of fragment-based drug discovery. (aa) NIH Roadmap defined. NIH funds the Chemical Genomics Center and Molecular
Library Initiative, establishing 10 chemical methodology and library design centers throughout the US. (ab) Broad Institute established, furthers application of DOS in chemical biology. (ac) Flow through synthesis for HTC gains in popularity.
(ad) Of the 497 library publications reported in 2008, 90% originated from academic labs; >80% were made by solutionphase chemistry. (ae) HTC Gordon Research Conference celebrates tenth anniversary and revises conference title: High
Throughput Chemistry & Chemical Biology.
Dolle
NHFmoc
O
RB
NH2
RB
RB
NHFmoc
RC
NH
b, c
Suppor t
Support
RA
RA
RA
b, d
RD
RB
RD
RB
RC
H
N
RB
RC
RA
RC
N
Support
RA
5
N
Suppor t
RA
4
Fig. 1.2. One of the first nonpeptide library synthesis (reprinted (adapted or in part) with permission from Journal of
the American Chemical Society. Copyright 1992 American Chemical Society).
Fig. 1.3. One of the first devices for HTC (copyright (1993) National Academy of
Sciences, USA).
Table 1.1
ArQule collaborations 19961997
Pharmaceutical companies
Abbott Laboratories
ACADIA Pharmaceuticals
Fibrogen
Monsanto Company
Aurora Biosciences
Genome Therapeutics
Pharmacia Biotech AB
GenQuest
Roche Biosciences
Genzyme
DGI Biotechnologies
Immunex Corp.
ICAgen, Inc.
Ontogeny
Ribogene
Sankyo Company
Sepracor, Inc.
SUGEN, Inc.
ViroPharma
Dolle
2. Historical
Library Designs
The objective of creating a chemical library for drug discovery,
regardless of its size or method of synthesis, is to supply biologically active compounds. For the purpose of this text, chemical
libraries can be classified into one of two categories: screening
libraries and optimization libraries. The screening library category is further subdivided into (a) random libraries, collections
with a unique design theme that has a distant, if any, relation to
known biologically active agents, and (b) targeted libraries where
the link with other biologically active structures is clearly evident.
Targeted libraries generally contain a known pharmacophore, i.e.,
a set of structural features in a molecule that is recognized at
the molecular target (enzyme, receptor, etc.) and is responsible
for that molecules biological activity (9). They may also contain
structural scaffolds that interact with a variety of molecular targets, commonly referred to as privileged scaffolds (10). Optimization libraries, on the other hand, function primarily to enhance
the biological activity of an existing lead. Potency, selectivity, and
metabolic stability are examples of deficits in leads which can be
addressed using optimization libraries. The term lead is defined
here as a biologically active molecule that has emerged from a
high-throughput screen or reported in the scientific or patent
literature.
2.1. Historical
Designs: Random
Screening Libraries
H-O1-X-X-X-NH2
H-X-O2-X-X-NH2
N
R1
H-X-X-O3-X-NH2
R3
H
N
2
H-X-X-X-O4-NH2
O
N
H
R4
N
H
R4
reduction
N
R1
H
N
R3
R2
cleavage
HN
R1
H
N
R2
R3
(COIm)2
N
H
R1
N
R2
H-phe-phe-nle-arg-NH2
K i = 1.2 (selective kappa opioid agonist)
O
N
R3
H-Tyr-tyr-Gly-Trp-NH2
K i = 3.0 nM (selective delta opioid agonist)
H-Tyr-nve-Gly-Nal-NH2
K i = 0.4 nM (selective mu opioid agonist)
O
N
R1
HN
R2
N
N
3
Ph
H2N
H
N
O
Ph
O
N
H
R4
reduction
cleavage
R4
H
N
N
H
NH
H2N
NH
FE 2000665
H-phe-phe-nle-arg-NHCH2-(4-pyridyl)
K i = 0.24 nM (kappa opioid agonist)
K i (mu) = 4,050 nM
K i (delta) = 20,300 nM
10
Dolle
11
O
N
H
REM resin
(4 commercial and 9
custom amino phenol
inputs)
R = 3-OH
R = 4-OH
R = 5-OH
OH
N
OH
HN
R1
R2
acylation,
sulfonylation,
carbamoylation then
cleavage
Mitsunobu then
cleavage
alkylation
then cleavage
OH
N
O-XR3
R2
R1
R2
R2
triflation
Suzuki coupling
then cleavage
Ar
N
R1
OR
N
R1
Physicochemical
properties
MW
ClogP
No. H-bond donors
No. H-bond acceptors
No. rotatable bonds
Lipinski
Ro5
< 500
<5
< 10
<5
< 10
>75%
members
<
<
<
<
<
450
6
1
3
1
Fig. 1.6. Nonoligomeric library (reprinted (adapted or in part) with permission from
Journal of the American Chemical Society. Copyright 2002 American Chemical Society).
versus oligomeric libraries is the control over design and physicochemical properties. In the example of Fig. 1.6, >75% of the
library members fell well within the Ro5 and successfully targeted
central nervous system (CNS) property space. Screening the
library against a variety of biological targets revealed a ca. 1 M
lead against the glycine-2 transporter.
Diversity-oriented synthesis (DOS) libraries. DOS libraries are
a special class of nonpolymeric libraries distinguished by their
synthetic design. Emphasis is placed on complexity-generating
reactions to drive structural complexity in combination with
branching pathways to drive structural diversity (Fig. 1.7) (17).
A single library will contain multiple stereochemically rich molecular frameworks incorporating multiple building blocks and
functional groups. Less emphasis is placed on physicochemical
properties. They are intended for application in chemical biology. Originally prepared using encoded split-pool synthesis, DOS
libraries are now prepared as discrete compounds on multimilligram scale. Build-couple-pair is the current paradigm for
constructing DOS libraries (18).
12
Dolle
Achmatowicz r eaction (1260 members)
Br
HO
O
R1
R4
R2
R4
HO
R4
HO
OAc
R4
O
O
Ar
Br
R1
R1
OR3
HO
R4
O
OH
HO
R1
Ar
HO
R4
R1
O
OAc
R4
OAc
R1
CHO
Ph
HO R1
O
Ph
Ph
N
H
HN
N
O
O
NR2R3
R4
R1
CHO
amino alcohols
Br
CHO
N
H
HO
(2-bromo)bromomethyl aryls
Ar
HO
aryl
M
aryl
HO
P
N
H
Fig. 1.7. Diversity-oriented synthesis (DOS) libraries (reprinted (adapted or in part) with permission from Journal of
the American Chemical Society. Copyright 2005 American Chemical Society).
2.2. Historical
Designs: Targeted
Screening Libraries
Biaryl
Y
N
R
Indole
Spirocycles
Z
N
O
Benzopyran
Y
1,4-Benzodiazepinone
X
NRR
X
N
N R
N
R
N
13
N
R
RRN
N
R
Arylpiperazine
Purine
Benzyhydryl
Diarylethyl
14
Dolle
Example 1: Library design
Ar
X
1
4 x dienophiles
R2
Ar
HN
metal salt
(1,3-dipolar
cycloaddition
reaction)
R1
O
Ar
3 x mecaptoacyl
chlorides
N
Y
deprotection
HS
cleavage
O
CO2Me
N
HS
CO2H
R1
OH
CO2Me
HS
R2
O
angiotensin converting enzyme (ACE)
Ki = 0.16 nM
(purified diastereomer)
CO2H
O
ACE
Ki >100 nM
(purified diastereomer)
O
H2N
R1
N
H
H
N
HS
O
mecaptoacyl
chlorides
deprotection
cleavage
H
N
O
HS
O
R2
H
N
R1
N
H
NH2
O
O
N
H
NH2
O
matrix metalloprotease-1
(MMP-1): IC50 = 50 nM
15
Library design
R2
R3
R4-CO2H
CO2H
FmocHN
[10 acids
OH
2 statines
R3
O
4
N
H
63 amino acids
(H)
N
H2N-R1
CO2H
(R)HN
40 amines]
OH O
R2
25,200 members
N
H
R1
N
H
H
N
OH
Ph
O
Z-Val
N
H
N
H
Ph
Ph
N
H
Z-Val
N
OH O
OH
N
H
Ki = 15 nM, plm II
Ki = 140 nM, cat D
Ki = 29 nM, plm II
Ki = 44 nM, cat D
Z-Val
H
N
N
H
H
N
OH
O
N
H
N
H
Fig. 1.10. Statine pharmacophore library targeting aspartic acid proteases (reprinted (adapted or in part) with permission from Journal of the American Chemical Society. Copyright 2001 American Chemical Society).
activity at the two enzymes were identified as well as agents showing up to 75-fold selectivity for plm II versus cat D.
Affymaxs thiolacyl library (Fig. 1.9) and Pharmacopeias
statine library (Fig. 1.10) are pharmacophore-based libraries;
however, their design is different. In the former library, a
pool of advanced library intermediates are derivatized with
the pharmacophore (thiolacylation) as the final step in library
construction, while in the latter library the pharmacophore
(statine) is derivatized with synthons as part of library
construction.
2-Aryl indole as a G-protein-coupled receptor (GPCR) privileged scaffold. The indole ring is a premier example of a privileged scaffold. The heterocycle is present in a profusion of medicinally important natural products and pharmaceutical substances,
and it is associated with an extraordinary manifold of biological
16
Dolle
O O
S
NH2 + HO2CH2
Ar
i) C6F5CH2OH,
DIAD, Ph3P
ii) R1R2NH2
(Z-subunits)
iii) amine scavenge
n
O
R1
NH
Ar
i) ArNHNH2
ii) ZnCl2, HOAc
iii) archive, mix/split
Ar
n
Kenner safety
catch resin
O O O
S
N
H
O O O
S
N
H
DIC,
THF/DCM
17
N
R2
NH
Ar
i) BH3-DMS,
dioxane, 50 oC
ii) HCl/MeOH, 50 oC
then azeotrope 3x
R1
R1
N
R2
N
R2
NH
Ar
NH
Ar
NH2
Assay
(concn, uM)
5-HT6 (5)
MCR-4 (2)
5-HT2a (0.1)
GnRH (1)
NPY5 (2)
CCR5 (8)
NK1 (1)
NH2
Ph
H
N
NH
76
62
14
7
89
21
23
97
10
81
4
82
1
7
NH
95
17
82
4
85
4
17
Ph
N
NH2
HO
44
-54
66
-10
--
87
5
45
6
98
0
2
68
23
63
6
96
62
42
NH
42
-0
-23
-92
Ph
N
H
N
H
N
NH
HO
NH
Br
5-HT2a
Ki = 10 nM
CO2Et
N
H
MCR-4
Ki = 612 nM
NH
Br
HO
Ph
NH
NH
Br
Br
5-HT6
Ki = 0.7 nM
Br
NPY5
Ki = 0.8 nM
Ph
N
H
NH
NH
GnRH
Ki = 52 nM
NK1
Ki = 0.8 nM
Br
CCR5
Ki = 1190 nM
Fig. 1.11. Privileged GPCR pharmacophore library (reprinted (adapted or in part) with permission from Journal of
the American Chemical Society. Copyright 2003 American Chemical Society).
18
Dolle
Library design
Cl
N
H
N
R1
custom inputs
N
Cl
R2
R2
i Pr2EtN
nBuOH
80 C
Cl
resin
cleavage
R1
HN
R2
N
R3
N
R1
Cl
N
Cl
N
H
Cl
Cl
N
N
Cl
Cl
N
Cl
Cl
N
N
Cl
Cl
Cl
Biological activity
NH2
HO2C
Cl
NH
Cl
N
N
HO
N
H
N
HO
CDK1: IC50 = 28 nM
CDK2: IC50 = 33 nM
NH
NH
N
H
CDK2-cyclin A
IC50 = 6 nM
O
N
estrogen sulfotransferase
IC50 = 500 nM
O
N
N
H
N
N
N
H
applied to structurally related halogenated heterocycles expanding chemotype diversity beyond the purine scaffold. Approximately 45,000 substituted purines and related derivatives were
synthesized in total. Screening the library afforded potent cyclindependent kinase-1 (CDK1; IC50 = 28 nM) and CDK2 (IC50 =
6 nM) inhibitors. These kinases utilize ATP to phosphorylate proteins on serine and threonine amino acid residues regulating cell
division. Inhibitors of estrogen sulfotransferase (IC50 = 500 nM)
(24) and enzymes involved in cell regeneration were found (25).
Estrogen sulfotransferase catalyzes the transfer of a sulfuryl group
from 3 -phosphoadenosine 5 -phosphosulfate (PAPS) to estrogen
regulating hormone homeostasis.
2.3. Historical
Designs:
Optimization
Libraries
19
20
Dolle
Lead enhancement
CONH2
O
Ph
N
H
H
N
Optimization libr ar y
O
O
N
H
OH
CO2Et
CO2Et
FmocHN
Ph
lead
kobs/I= 280,000 M1s1
CONH2
O
O N
N
H
H
N
N
H
CO2Et
O
O
R
O
O N
NH
O
N
H
Ph
RCOCl
Ph
advanced lead
kobs/I = 260,000 M1s1
CO2Et
N
H
H
N
H2N
NH
N
H
CO2Et
N
H
H
N
O
NH2
O
N
H
CO2Et
Ph
ca. 500 member optimization library
to discover a replacement for the
metabolically labile N-benzylthiocarbamate in the lead inhibitor
F
AG7088: clinical candidate
kobs/I = 1,470,000 M1s1
Fig. 1.13. Optimization library for human rhinovirus 3C protease (reprinted (adapted or in part) with permission from
Journal of the American Chemical Society. Copyright 2001 American Chemical Society).
21
Library design
OH
OH
i) Boc-Aa
ii) TFA
iii) BH3-Me2S
selective kappa
antagonists?
N
H
R3COOH
R1
R1
(+)-enantiomer
as starting
material
Ph
lead structure
Ki = 0.74 nM ( )
Ki = 322 nM ( )
OH
OH
HN
R2
R3
R2
O
Library
288 members
Screening results
OH
OH
OH
HO
HO
HO
NH
O
K i = 7 nM ( )
= 57;
>824
(functional antagonist)
Ph
NH
O
no binding
NH
O
Ki = 54 nM ( )
Ki = 10 nM ( )
Fig. 1.14. Kappa () opioid receptor antagonist optimization library (reprinted (adapted or in part) with permission
from Journal of the American Chemical Society. Copyright 1999 American Chemical Society).
22
Dolle
Dual approach to library design
O
O
H
N
H
N
S
O
combinatorial
optimization
strategy
Part 1
Y
NH2
+ R-Ph-NCO
H
N
screening
Z
H
N
H
N
O
H
N
Z
X
R
N
Z Y
R
N
R
S
O
ca. 1000 member library
IC50 = 1700 nM
screening
Part 2
N
NH2
Y
Z
H
N
O
+ 4-Me-Ph-NCO
H
N
H
N
O
O
IC50 > 25,000 nM
advanced lead
IC50 = 54 nM, raf kinase
IC50 = 360 nM, p38 MAP kinase
screening
H
N
F3C
Cl
H
N
H
N
O
H
N
Fig. 1.15. Contrasting raf kinase inhibitor optimization strategies (reprinted (adapted
or in part) with permission from Journal of the American Chemical Society. Copyright
2002 American Chemical Society).
both the inhibitory potency and selectivity of the urea against raf
kinase. A two-part sequential optimization strategy was devised.
In part one, coupling conservatively altered 3-aminothienyls with
phenyl-substituted isocyanates was carried out. A ca. 10-fold
improvement in activity over the original lead was obtained with
a 4-methyl group in the phenyl ring. In part 2, the optimized
23
3. Summary
HTC originated in the early 1990s in response to unprecedented
access to molecular targets, advances in high-throughput screening technology, and the demand for new chemical compound
collections. Approaching two decades of application, there are
over 5,000 chemical libraries reported in the literature (8). Initial
design strategies based on oligomeric and nonoligomeric libraries
with multiple (>3) points of diversity have progressed toward
more carefully crafted molecules with attention paid to physicochemical and toxiphoric properties. Today, library compounds
are typically synthesized on a milligram scale (10100 mg), purified, and evaluated not only against the primary target but also in
selectivity assays including (a) in vitro drug metabolism pharmacokinetic (DMPK) assays which measure a compounds metabolic
stability and interaction with cytochrome P450 metabolizing
enzymes, and (b) ion channels associated with cardiac function. Libraries are being used to generate multiple SARs to efficiently identify and simultaneously address compound liabilities.
Library designs incorporating pharmacophores (19, 21) and privileged structures (22, 23) have historically been successful in lead
finding. New chemotypes are needed to investigate previously
24
Dolle
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
25
Chapter 2
Chemoinformatics and Library Design
Joe Zhongxiang Zhou
Abstract
This chapter provides a brief overview of chemoinformatics and its applications to chemical library design.
It is meant to be a quick starter and to serve as an invitation to readers for more in-depth exploration of
the field. The topics covered in this chapter are chemical representation, chemical data and data mining,
molecular descriptors, chemical space and dimension reduction, quantitative structureactivity relationship, similarity, diversity, and multiobjective optimization.
Key words: Chemoinformatics, QSAR, QSPR, similarity, diversity, library design, chemical
representation, chemical space, virtual screening, multiobjective optimization.
1. Introduction
Library design is essentially a selection process, selecting a useful subset of compounds from a candidate pool. How to select
this subset depends on the purpose of the library. For a simple
probe of a local structureactivity relationship (SAR), medicinal
chemists may be able to choose an excellent subset of representatives from a small pool of synthesizable compounds to achieve the
goal without resorting to any sophisticated design tools. For complex applications of library though, design tools are indispensable
for obtaining optimal results. Majority of the design tools used
for library design fall into a field called chemoinformatics, a discipline that studies the transformation of data into information
and information into knowledge for better decision making (1).
Actually, the recent explosive development in chemoinformatics
has mainly been stimulated by the ever-increasing applications of
chemical library technologies in pharmaceutical industry.
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_2, Springer-Science+Business Media, LLC 2011
27
28
Zhou
2. Chemoinformatics
Although still rapidly evolving, chemoinformatics as a scientific
discipline is relatively mature. This section is meant to be introductory only. Interested readers are referred to various monographs on chemoinformatics for a deep understanding of the field
(48).
29
2.1. Chemical
Representation
(a)
(b)
Header block
SMMXDraw12120917342D
11 11 0
12.3082
13.0242
13.7402
14.4562
15.1722
15.1722
15.8882
13.7402
13.0242
12.3082
11.5922
2 1 2
3 2 1
3 4 1
4 5 1
5 6 2
5 7 1
8 3 2
9 8 1
10 9 2
10 11 1
1 10 1
M END
0
0
0
0
0
0
0
0
0
0
0
0 0 0
-7.1882
-7.6016
-7.1882
-7.6016
-7.1882
-6.3615
-7.6016
-6.3615
-5.9481
-6.3615
-5.9481
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0999 V2000
0.0000 C
0 0
0 0
0.0000 C
0 0
0.0000 C
0.0000 N
0 0
0.0000 C
0 0
0 0
0.0000 O
0 0
0.0000 C
0 0
0.0000 C
0.0000 C
0 0
0 0
0.0000 C
0.0000 O
0 0
Counts line
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Atom block
Connection
Table (Ctab)
Bond block
Fig. 2.1. Illustrative example of a MOLFile for acetaminophen (also known as paracetamol). (a) Molecular structure of
acetaminophen, commonly known as Tylenol. Tylenol is a widely used medicine for reducing fever and pain. (b) MOLFile
for acetaminophen.
30
Zhou
31
Table 2.1
Illustrative SMILES: molecular structures and the corresponding SMILES strings are paired vertically. The numbered
arrows on the three cyclic molecular structures are not part
of the molecules. They are used to indicate the break points
for deriving the corresponding SMILES strings (see text)
1
N
CCC
CC = C
CC#C
c1ccncc1
N
O
CCC(C)N
CC(C)C(C(C)N)C(C)
O
c1cc2c(cc[nH]2)nc
1
CC(=O)Nc1ccc(cc1)O
Note that a single molecule may correspond to many different, but equivalent, SMILES strings. For example, for a given
asymmetric molecule, starting from a different asymmetric atom
will lead to a different, but equally valid, SMILES string. These
various SMILES are called isomeric SMILES. They can be converted to a unique form called canonical SMILES (11).
Daylight has extended SMILES rules to accommodate
general descriptions of molecular patterns and chemical reactions (13). These SMILES extensions are called SMARTS and
SMIRKS. SMARTS is a language for describing molecular patterns while SMIRKS defines rules for chemical reaction transformations.
32
Zhou
SMILES strings are very concise and hence are suitable for
storing and transporting a large number of molecular structures,
while MOLfiles and its extension SDFiles have the option to store
more complicated molecular data such as 3D molecular conformational information and biological data associated with the
molecules. There are many other file formats not discussed here.
Interested readers can find a list of file types at the following web
site: http://www.ch.ic.ac.uk/chemime/.
2.2. Data, Databases,
and Data Mining
33
34
Zhou
Theoretical molecular descriptors cover much broader varieties and usually are more readily available even though the complexity of their computational procedures may vary widely. Major
classes of computed molecular descriptors include the following:
(i) Constitutional counts such as molecular weight, number
of heavy atoms, number of rotatable bonds, number of
rings, and number of aromatic rings.
(ii) 2D molecular properties such as number of hydrogen
bond donor/acceptor and their strengths, number of
polar atoms.
(iii) Topological descriptors from graph theory such as various
graph-theoretic invariants of molecular graphs, 2D and
3D autocorrelations, various property-weighted graphtheoretic quantities.
(iv) Geometrical descriptors such as shape, radius of gyration, moments of inertia, volume, polar/nonpolar surface
areas.
(v) Electrostatic properties such as dipole moment, partial
atomic charges.
(vi) Fingerprints such as 2D fingerprints like Daylight fingerprints and UNITY fingerprints and 3D fingerprints like
pharmacophore fingerprints.
(vii) Quantum chemical descriptors such as HOMO/LUMO
energies, E-state values.
(viii) Predicted physicochemical properties such as calculated
solubility, calculated logP, and various molecular properties from QSPR predictive models.
There are also various hybrid descriptors. For example, electrotopological descriptors are a hybrid of topological and electronic descriptors.
Applications of molecular descriptors are as diverse as their
definitions. The important classes of applications include QSAR
and/or QSPR, similarity, diversity, predictive models for virtual
screening and/or data mining, data visualization. We will discuss
briefly some of these applications in the next sections.
There are literally thousands of molecular descriptors available
for various applications. We have only mentioned a few of them
in previous paragraphs. Interested readers can find a more complete coverage of molecular descriptors in reference (15), which
gives definitions for 3,300 molecular descriptors. Many software,
or subroutines as an integral part of other programs, are available
to generate various types of molecular descriptors. Table 2.2 lists
a few of these software.
2.4. Chemical Space
and Dimension
Reduction
Type of descriptors
Constitutional, functional group counts, topological, Estate, Moriguchi descriptors, Meylan flags, molecular
patterns, electronic properties, 3D descriptors, hydrogen bonding, acidbase ionization, empirical estimates
of quantum descriptors
Software name
ADAPT
ADMET
Predictor
ADRIANA.code
CODESSA
DRAGON
Table 2.2
A selected list of software for computing molecular descriptorsa
3,224
1,500
1,244
297
>264
Number of
descriptors
Molecular Networks
Simulations Plus
Distributor (and/or
author)
http://www.codessapro.com/index.htm
http://www.molecularnetworks.com/products/
adrianacode
http://www.simulationsplus.com/
http://research.chem.psu.
edu/pcjgroup/
desccode.html
Reference
Constitutional,
BCUT, etc.
Constitutional, topological,
physicochemical, etc.
Topological
Molgen-QSPR
PowerMV
PreADMET
Sarchitect
TAM
TOPIX
130
>20
1,084
1,081
>1,000
708
>40
>600
68
Number of
descriptors
geometrical,
fingerprints,
Molconn-Z
pairs,
MOE
atom
Type of
descriptors
JOELib
Software name
Table 2.2
(continued)
Bioinformatics
&
Molecular
Design Research Center, South
Korea
eduSoft
Distributor (and/or
author)
http://www.lohninger.com/
topix.html
http://www.strandls.com/
sarchitect/index.html
http://preadmet.bmdrc.org/
index.php?option=com_
content&view=
frontpage&Itemid=1
http://nisla05.niss.org/
PowerMV/
http://www.edusoftlc.com/molconn/
http://www.molgen.de/
?src=documents/
molgenqspr.html
http://www.ra.cs.unituebingen.de/software/
joelib/index.html
http://www.chemcomp.com/
Reference
36
Zhou
37
38
Zhou
39
Table 2.3
Selected similarity coefficients to be used with 2D fingerprints for
molecule pair (A, B)
Coefficient
Expressiona
Value range
Tanimoto
c
a+b+c
0.01.0
Cosine
0.01.0
Hamming
a+b
0.0
RussellRao
c
a+b+c+d
0.01.0
Forbes
(a+b+c+d)c
(a+c)(b+c)
0.0
Pearson
1.01.0
Simpson
c
min{(a+c), (b+c)}
Euclid
c
(a+b)(b+c)
cdab
(a+c)(b+c)(a+d)(d+d)
c+d
a+b+c+d
Notes
This is a dissimilarity
coefficient
0.01.0
0.01.0
a a is the count of bits that is on in A string but off in B string; b is the count of bits that is off
in A string buton in B string; c is the count of bits that is on in both A string and B string; d is
the count of bits that is off in both A string and B string.
differences between diversity and dissimilarity. Diversity is a property of a molecular collection while dissimilarity can be defined
for pairs of molecules as well.
Since diversity is a collective property, its precise quantification requires a mathematical description of the distribution of the
molecular collection in a chemical space. When a set of molecules
are considered to be more diverse than another, the molecules
in this set cover more chemical space and/or the molecules distribute more evenly in chemical space. Historically, diversity analysis is closely linked to compound selection and combinatorial
library design. In reality, library design is also a selection process,
selecting compounds from a virtual library before synthesis. There
are three main categories of selection procedures for building a
diverse set of compounds: cluster-based selection, partition-based
selection, and dissimilarity-based selection.
The cluster-based selection procedure starts with classifying
compounds into clusters of similar molecules with a clustering
algorithm followed by selection of representative(s) from each
cluster (24). On the other hand, the partition-based selection
procedure partitions chemical space into cells by dividing values
of each dimension into various intervals and selects representative
40
Zhou
[1]
[2]
and
There is a long history of efforts to find simple and interpretable f1 and f2 functions for various activities and properties
(29, 30). The quest for predictive QSAR models started with
Hammetts pioneer work to correlate molecular structures with
chemical reactivities (3032). However, the widespread applications of modern predictive QSAR and QSPR actually started
with the seminal work of Hansch and coworkers on pesticides
(29, 33, 34) and the developments of various powerful analysis tools, such as PLS (partial least squares) and neural networks,
for multivariate analysis have fueled these widespread applications.
Nowadays, numerous publications on guidelines, workflows, and
41
common errors for building predictive QSAR and QSPR models, not to mention the countless papers of applications, are well
documented in literature (3541).
In principle, a valid QSAR/QSPR model should contain the
following information (39): (i) a defined endpoint; (ii) an unambiguous algorithm; (iii) a defined domain of applicability; (iv)
appropriate measures of goodness of fit, robustness, and predictivity; and (v) a mechanistic interpretation, if possible. Building
predictive QSAR/QSPR models is a process from experimental data to model and to predictions. Collecting reliable experimental data (and subdividing the data into training set and
testing set) is the first step of the model-building process. The
second step of the process is usually to select relevant parameters
(or molecular descriptors) that are most responsive to the variation of activities (or properties) in the data set. The third step is
QSAR/QSPR modeling and model validation. Finally, the validated models are applied to make predictions. Usually, the second and the third, and sometimes the first, steps are repeated
to select the best combination of parameter set and models (see,
for example, reference (40)). Although majority of QSAR/QSPR
models are built with molecular descriptors, there are parameterfree models. For example, the FreeWilson method builds predictive QSAR/QSPR models for a series of substituted compounds
without any molecular descriptors (42). Its drawback is that the
FreeWilson method requires a data set for almost all combinations of substituents at all substituted sites and the method is not
applicable to molecular set of noncongeners.
It is interesting to note that various QSAR/QSPR models
from an array of methods can be very different in both complexity and predictivity. For example, a simple QSPR equation with
three parameters can predict logP within one unit of measured
values (43) while a complex hybrid mixture discriminant analysis
random forest model with 31 computed descriptors can only predict the volume of distribution of drugs in humans within about
twofolds of experimental values (44). The volume of distribution
is a more complex property than partition coefficient. The former is a physiological property and has a much higher uncertainty
in its experimental measurements while logP is a much simpler
physicochemical property and can be measured more accurately.
These and other factors can dictate whether a good predictive
model can be built.
2.7. Multiobjective
Optimization
42
Zhou
n
wi fi (Obj1 )
[3]
i=1
43
Library
For VS
In-silico
Assay
Virtual Hit
Follow-up
Structure-based:
Docking
Synthesis
if needed
Ligand-based:
Similarity
clustering
Pharmacophore
QSAR models
etc
Experimental
validation of activity
etc
Fig. 2.2. Three components of a typical VS process: compound library, virtual assay,
and hit follow-up for virtual hits.
44
Zhou
3. Library and
Library Design
3.1. Compound
Library for Drug
Discovery
There are two major classes of libraries for drug discovery: diverse
libraries for lead discovery and focused libraries for lead optimization. Lead discovery libraries emphasize diversity while lead
optimization libraries prefer similar compounds. The purpose of
lead discovery libraries is to find lead matter and to provide
potential active compounds for further optimization. Without any
prior knowledge about the active compounds for a given target, it is reasonable to start with a library of enough chemical
space coverage to demarcate the biologically relevant chemical
45
space for the target. Therefore, libraries for lead discovery often
comprise diverse compounds with drug-like/lead-like properties.
Lead matters without proper drug-likeness/lead-likeness properties might be trapped in a local and unoptimizable zone of
the chemical space during lead optimization stage. On the other
hand, the purpose of lead optimization libraries is to improve the
activity and the property profile of the lead matter. With a lead
compound, searching for better and optimized compounds is usually performed among similar compounds with limited diversity
around the lead molecule in the chemical space.
There are three major sources for a typical corporate compound collection: project-specific compounds accumulated over
a long period of time through medicinal chemistry efforts for various therapeutic projects, individual compounds from commercial sources, and compounds from combinatorial chemistry. In
practice, compound collections are often divided into subsets, for
example, the diverse subsets for general HTS and target-focused
subsets (such as kinase libraries or GPCR libraries). For library
design, diversity and similarity are generally built into the libraries
of compounds to be synthesized and/or purchased (73).
Stimulated by the widespread applications of HTS technologies, combinatorial chemistry has provided a powerful tool for
rapidly adding large number of compounds to corporate collections for many pharmaceutical companies. Virtual combinatorial
library consists of libraries from individual reactions and compounds from a single reaction share a common product core
(see Fig. 2.3). The number of compounds in a combinatorial
library can grow rapidly with number of reaction components
and numbers of reactants for individual components. For example, a full combinatorial library from a three-component reaction
Virtual
Combinatorial
Library
Compounds of
core 1 from
Reaction 1
R2
R1
Compounds of
core 2 from
Reaction 2
R2
R1
Compounds of
core N from
Reaction N
R2
R1
R3
Fig. 2.3. Virtual combinatorial library is the start point for any combinatorial library
design. It consists of libraries from individual reactions. Compounds from a given
reaction share a unique product core.
46
Zhou
with 200 reactants for each component would contain 8 million products. Virtual library can also be represented as a template with R-groups attached at various variation sites. This representation is also called Markush structure. Markush structure is
the standard chemical structure often used in chemical patents.
Template-based libraries can be considered as a generalization
of the scenario shown in Fig. 2.3 where the product cores of
individual reactions are the templates. Notice that reaction-based
virtual libraries have explicit chemistries for compound syntheses
and therefore may include only those synthesizable compounds
through careful selections of reactants while general templatebased virtual libraries usually do not indicate chemistry accessibilities of the compounds.
3.2. Library
Enumeration
R2
R1
R2
Product
R3
R3
R2
R2
Fig. 2.4. Product enumerations of a combinatorial library. For reaction-based enumeration, individual groups of N(R1)(R2) and (CO)-R3 are replaced by corresponding molecular fragments from reactants A and B. For template-based enumeration,
the R-groups R1, R2, and R3 are replaced by independent lists of molecular fragments. Note that some combinations of R1 and R2 may not exist in component A for
reaction-based enumerations. The template-based product structure with R-groups is
also called Markush structure and its enumeration is called Markush enumeration or
Markush exemplification.
47
48
Zhou
4. Concluding
Remarks
Library design has become an integral part of drug discovery process. Chemical library design underwent a transformation from a
pure tool for supplying vast number of compounds to a power
tool for generating quality leads and drug candidates. Although
the controversy of how to define a best set of compounds for lead
generation is not completely resolved, tremendous progress has
been made to find biologically relevant subregions of the chemical space, particularly when confined to a target or a target family
(see, for example, references (85, 86)). Providing biologically relevant compounds will continue to be one of the main goals of
library design.
Since modern drug discovery is mainly a data-driven process and chemoinformatics is at the center of data integration
and utilization, it is natural that majority of library design tools
are chemoinformatics tools. Therefore, a deep understanding of
chemoinformatics is necessary for taking full advantage of library
technologies.
Though relatively mature, chemoinformatics is still an active
field of intensive research. Numerous new methods and tools continue to be developed. Here we have selectively covered, without
giving too many details, a few topics important to library design.
Actually the interplays and costimulations of chemoinformatics
with library design have been well documented in literature. We
hope that the brief introduction in this chapter can serve as a
guide for you to enter into the exciting field of chemoinformatics
and its applications to chemical library design.
49
Acknowledgment
The chapter was prepared when the author was visiting with professor Andy McCammons group. The author is very grateful to
Professor Andy McCammon and his group for the exciting and
stimulating scientific environment during the preparation of the
chapter.
References
1. Brown, F. B. (1998) Chemoinformatics: what is it and how does it impact
drug discovery. Annu Rep Med Chem 33,
375384.
2. Bohacek, R. S., McMartin, C., Guida, W.
C. (1996) The art and practice of structurebased drug design: a molecular modeling perspective. Med Res Rev 16, 350.
3. Walters, W. P., Stahl, M. T., Murcho, M. A.
(1998) Virtual screeningan overview. Drug
Discov Today 3, 160178.
4. Gasteiger, J. (ed.) (2003) Handbook of
Chemoinformatics: From Data to Knowledge,
Wiley-VCH, Weinhiem.
5. Bajorath, J. (ed.) (2004) Chemoinformatics:
Concepts, Methods, and Tools for Drug Discovery, Humana Press, Totowa, NJ.
6. Oprea, T. I. (ed.) (2005) Chemoinformatics
in Drug Discovery, Wiley-VCH, Weinheim.
7. Leach, A. R. and Gillet, V. J. (2007) An
Introduction to Chemoinformatics, Springer,
London.
8. Bunin, B. A., Siesel, B., Morales, G. A., Bajorath, J. (2007) Chemoinformatics: Theory,
Practice, & Products, Springer, The Netherlands.
9. http://www.symyx.com/solutions/white_
papers/ctfile_formats.jsp,
last
accessed
February, 2010.
10. Weininger, D. (1988) SMILES, a chemical
language and information system. 1. Introduction to methodology and encoding rules.
J Chem Inf Comput Sci 28, 3136.
11. Weininger, D. (1989) SMILES, 2. Algorithm
for generation of unique SMILES notation. J
Chem Inf Comput Sci 29, 97101.
12. Weininger, D. (1990) SMILES, 3. Depict.
Graphical depiction of chemical structures. J
Chem Inf Comput Sci 30, 237243.
13. http://www.daylight.com/dayhtml/doc/
theory/theory.smiles.html, last accessed
February, 2010.
14. Simsion, G. C., Witt, G. C. (2001) Data
Modeling Essentials, 2nd ed. Coriolis, Scottsdale, USA.
15. Todeschini, R., Consonni, V. (2009) Molecular Descriptors for Chemoinformatics Vol. 1,
2nd ed. Wiley-VCH, Weinheim, Germany.
16. Jolliffe, I. T. (2002) Principal Component
Analysis, 2nd ed. Springer, New York.
17. Borg, I. and Groenen, P. J. F. (2005) Modern
Multidimensional Scaling: Theory and Applications, 2nd ed. Springer, New York.
18. Domine, D., Devillers, J., Chastrette, M.,
Karcher, W. (1993) Non-linear mapping
for structure-activity and structure-property
modeling. J Chemometrics 7, 227242.
19. Wermuth, C. G. (2006) Similarity in drugs:
reflections on analogue design. Drug Discov
Today 11, 348354.
20. Willett, P. (2000) Chemoinformatics
similarity and diversity in chemical libraries.
Curr Opin Biotech 11, 8588.
21. Maldonado, A. G., Doucet, J. P., Petitjean,
M., Fan, B. -T. (2006) Molecular similarity
and diversity in chemoinformatics: from theory to applications. Mol Divers 10, 3979.
22. Willett, P. (2006) Similarity-based virtual
screening using 2D fingerprints. Drug Discov
Today 11, 10461053.
23. Holliday, J. D., Hu, C. -Y., Willett, P. (2002)
Grouping of coefficients for the calculation
of inter-molecular similarity and dissimilarity using 2D fragment bitstrings. Comb Chem
High Throughput Screening 5, 155166.
24. Dunbar J. B. (1997) Cluster-based selection.
Perspect Drug Discov Des 7/8, 5163.
25. Mason J. S., Pickett S. D. (1997) Partitionbased selection Perspect Drug Discov Des
7/8, 85114.
26. Rusinko, A. III, Farmen, M. W., Lambert,
C. G. et al. (1999) Analysis of a large structure/biological activity dataset using recursive partitioning. J Chem Inf Comput Sci 39,
10171026.
27. Lajiness, M. S. (1997) Dissimilarity-based
compound selection techniques. Perspect
Drug Discov Des 7/8, 6584.
28. Pickett, S. D., Luttman, C., Guerin, V.,
Laoui, A., James, E. (1998) DIVSEL and
50
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
Zhou
COMPLIBstrategies for the design and
comparison of combinatorial libraries using
pharmacophore descriptors. J Chem Inf Comput Sci 38, 144150.
Hansch, C., Hoekman, D., Gao, H. (1996)
Comparative QSAR: toward a deeper understanding of chemicobiological interactions.
Chem Rev 96, 10451074.
Jaff, H. H. (1953) A reexamination of
the Hammett equation. Chem Rev 53,
191261.
Hammett, L. P. (1935) Some relations
between reaction rates and equilibrium.
Chem Rev 17, 125136.
Hammett, L. P. (1937) The effect of structure upon the reactions of organic compounds. Benzene derivatives. J Am Chem Soc
59, 96103.
Hansch, C., Maloney, P. P., Fujita, T., Muir,
R. M. (1962) Correlation of biological activity of phenoxyacetic acids with Hammett
substituent constants and partition coefficients. Nature 194, 178180.
Hansch, C. (1993) Quantitative structureactivity relationships and the unnamed science. Acc Chem Res 26, 147153.
Livingstone, D. J. (2004) Building QSAR
models: a practical guide, in (Cronin, M.
T. D., Livingstone, D. J. eds.) Predicting
Chemical Toxicity and Fate. CRC Press, Boca
Raton, FL, 2004, pp. 151170.
Walker, J. D., Dearden, J. C., Schultz, T.
W., Jaworska, J., Comber M. H. I. (2003)
in (Walker, J. D. ed.) QSARs for New Practitioners, in QSARs for Pollution Prevention,
Toxicity Screening, Risk Assessment, and Web
Applications. SETAC Press, Pensacola, FL,
pp. 318.
Walker, J. D., Jaworska, J., Comber, M. H. I.,
Schultz, T. W., Dearden, J. C. (2003) Guidelines for developing and using quantitative
structureactivity relationships. Environ Toxicol Chem 22, 16531665.
Cronin, M. T. D., Schultz, T. W. (2003)
Pitfalls in QSAR J Theoret Chem (Theochem)
622, 3951.
OECD Principles for the Validation of
(Q)SARs, http://www.oecd.org/dataoecd/
33/37/37849783.pdf, last accessed February, 2010.
Tropsha, A., Golbraikh, A. (2007) Predictive QSAR modeling workflow, model applicability domains, and virtual screening. Curr
Pharmaceut Design 13, 34943504.
Dearden, J. C., Cronin, M. T. D., Kaiser,
K. L. E. (2009) How not to develop a
quantitative structure-activity or structureproperty relationship (QSAR/QSPR). SAR
and QSAR in Environ Res 20, 241266.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
51
52
Zhou
81. Grabowski, K., Baringhaus, K. -H., Schneider, G. (2008) Scaffold diversity of natural products: inspiration for combinatorial
library design. Nat Prod Rep 25, 892904.
82. Stocks, M. J., Wilden, G. R. H, Pairaudeau,
G., Perry, M. W. D, Steele, J., Stonehous, J. P. (2009) A practical method for
targeted library design balancing lead-like
properties with diversity. ChemMedChem 4,
800808.
83. Hann, M. M., Leach, A. R., Harper, G.
(2001) Molecular complexity and its impact
on the probability of finding leads for
drug discovery. J Chem Inf Comput Sci 41,
856864.
84. Gillet, V. J. (2002) Reactant- and productbased approaches to the design of combinatorial libraries. J Comput Aided Mol Des
16:371380.
85. Balakin, K. V., Ivanenkov, Y. A., Savchuk,
N. P. (2009) Compound library design for
targeted families, in (Jacoby, E. ed.)
Chemogenomics. Humana Press, New York,
pp 2146.
86. Xi, H., Lunney, E. A. (2010) The design,
annotation and application of a kinasetargeted-library, in (Zhou, J. Z. ed.) Chemical Library Design. Humana Press, New
York, Chapter 14.
Chapter 3
Molecular Library Design Using Multi-Objective
Optimization Methods
Christos A. Nicolaou and Christos C. Kannas
Abstract
Advancements in combinatorial chemistry and high-throughput screening technology have enabled the
synthesis and screening of large molecular libraries for the purposes of drug discovery. Contrary to initial
expectations, the increase in screening library size, typically combined with an emphasis on compound
structural diversity, did not result in a comparable increase in the number of promising hits found. In
an effort to improve the likelihood of discovering hits with greater optimization potential, more recent
approaches attempt to incorporate additional knowledge to the library design process to effectively guide
the search. Multi-objective optimization methods capable of taking into account several chemical and
biological criteria have been used to design collections of compounds satisfying simultaneously multiple pharmaceutically relevant objectives. In this chapter, we present our efforts to implement a multiobjective optimization method, MEGALib, custom-designed to the library design problem. The method
exploits existing knowledge, e.g. from previous biological screening experiments, to identify and profile
molecular fragments used subsequently to design compounds compromising the various objectives.
Key words: Multi-objective molecular library design, multi-objective evolutionary algorithm,
selective library design, MEGALib.
1. Introduction
Drug discovery can be seen as the quest to design small molecules
exhibiting favourable biological effects in vivo. Such molecules
need to balance a combination of multiple properties including
binding affinity to the pharmaceutical target, appropriate pharmacokinetics, limited (or no) toxicity (1, 2). The lack of consideration of the multitude of properties in the early stages of lead identification and optimization frequently hinders subsequent efforts
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_3, Springer Science+Business Media, LLC 2011
53
54
for drug discovery (3). Indeed, one of the common causes for
lead compounds to fail in the later stages of drug discovery is the
lack of consideration of multiple objectives at the early stage of
optimization of candidate compounds (4).
Traditional molecular library design (MLD) methods, modelled after the standard experimental drug discovery procedures,
ignored the multi-objective nature of drug discovery and focussed
on the design of libraries taking into account a single criterion.
Often, the focus has been on maximizing library diversity in an
effort to select compounds representative of the entire possible
population (5) or in designing compound collections exploring
a well-defined region of the chemical space defined by similarity to known ligands (6). The resulting molecular libraries, typically synthesized using combinatorial chemistry which enables the
synthesis of large numbers of compounds and screened via highthroughput screening systems, revealed that simply synthesizing
and screening large numbers of diverse (or similar) compounds
may not increase the probability of discovering promising hits
(7). Instead, due to the multi-objective nature of drug discovery,
other factors, such as absorption, distribution, metabolism, excretion, toxicity (ADMET), selectivity and cost, molecular screening libraries need to be carefully planned and a number of design
objectives must be taken into account (8). In recent times, MLD
efforts have been exploring the use of multi-objective optimization (MOOP) techniques capable of designing libraries based on
a number of properties simultaneously (9).
1.1. Multi-objective
Optimization Basics
55
Fig. 3.1. A MOP with two minimization objectives and a set of solutions represented
as circles. The rank of each solution (number next to circle) is based on the number of
solutions that dominate it (i.e. are better) in both objectives. The area defined by the
dashed lines of each solution contains the solutions that dominate it. Non-dominated
solutions are labelled 0. Point (0, 0) represents the ideal solution to this problem.
56
1.3. Multi-objective
Molecular Library
Design Applications
57
2. Materials
2.1. Multi-objective
Optimization
Software
2.2. Molecular
Building Block
Preparation Software
2.3. Datasets
1. Dataset 1, a set of well-known estrogen receptor (ER) ligands, contains five compounds, three with increased selectivity to ER- and two with increased selectivity to ER-.
58
0.17
13
76
32.2
6.4
0.2
Fig. 3.3. Ligands and their relative binding affinity (RBA) to estrogen receptors
and (25).
3. Methods
Recently, we proposed the Multi-objective Evolutionary Graph
Algorithm (MEGA), an optimization algorithm designed for the
evolution of chemical structures satisfying multiple constraints
(9). The technique combines evolutionary techniques with graph
data structures to directly manipulate graphs and perform a global
search for promising molecule designs. MEGA supports the use
of problem-specific knowledge and local search techniques with
an aim to improve both performance and scalability. Initial applications of the algorithm to the problem of de novo design showed
that the technique is able to produce a diverse collection of equivalent solutions and, thus, support the drug discovery process (9).
Based on our experiences we have designed a custom version of
59
the original algorithm, termed MEGALib, to meet the requirements of multi-objective library design. The method focusses on
designing the best possible products, i.e. chemical structures, for
the problem under investigation and makes no attempt to minimize the number of reagents used; its main applications to date
have been in designing small, focussed molecular libraries for secondary screening. This section initially describes MEGALib followed by a detailed overview of the methodology used to prepare the fragment collection and the computational objectives
required by the algorithm. The later part of the section thoroughly describes an application of MEGALib to the problem of
designing a selective library of compounds.
3.1. Multi-objective
Library Design
Algorithm
Description
60
61
62
Upon termination of the process the algorithm selects a compound set from the working population equal to the user-supplied
library size as the library proposed. The selection of the library
members is performed in a manner identical to the parent selection method described previously.
The algorithm exploits existing knowledge through the inclusion of multiple, problem-specific objectives, the use of bondtype information when evolving molecules and the exploitation of
the weights associated with the building blocks provided which
result in favouring those with an increased weight, i.e. having
privileged status.
3.2. Fragment
Collection
Preparation
3.3. Computational
Objective Encoding
63
64
65
66
Fig. 3.6. Scaffolds representative of the compounds in the library designed using MEGALib.
to ER- ligands; thus, the problem has been transformed to a biobjective minimization problem with the ideal solution at point
(0, 0).
Figure 3.6 presents a small subset from the collection of the
scaffolds found in the compounds of the designed library. Each
scaffold gave rise to one or more compounds of the designed
library with varying performance to the objectives of the experiment through different substitutions on the various attachment
points indicated as R groups. Consequently, the resulting library
was sufficiently diverse indicating that MEGALib has been successful in identifying and preserving the structural diversity of the
designed compounds.
4. Notes
1. Designing focussed vs diverse libraries. The scope and diversity of the library designed by MEGALib can be controlled
using the user-supplied parameters required by the algorithm primarily by the choice of objectives and building
block pool. Diverse libraries may be designed by formulating
population diversity as one of the objectives of MEGALib.
To this end the Wards clustering method combined with the
Kelley cluster level selection described in Section 3 may be
used. Additional objectives ensure that the set of diverse
molecules produced will meet, for example, drug-likeness
criteria. Focussed libraries are meant for a specific target
(or related targets) and therefore objectives encoding targetspecific information must be used (17). The use of a carefully
67
68
References
1. Ekins, S., Boulanger, B., Swaan, P. W.,
Hupcey, M. A. (2002) Towards a new age of
virtual ADME/TOX and multidimensional
drug discovery. J Comput Aided Mol Des 16,
381401.
2. Agrafiotis, D. K., Lobanov, V. S., Salemme, F.
R. (2002) Combinatorial informatics in the
post-genomics era. Nat Rev Drug Discov 1,
337346.
3. Baringhaus, K. H., Matter, H. (2004)
Efficient strategies for lead optimization by
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
69
Chapter 4
A Scalable Approach to Combinatorial Library Design
Puneet Sharma, Srinivasa Salapaka, and Carolyn Beck
Abstract
In this chapter, we describe an algorithm for the design of lead-generation libraries required in
combinatorial drug discovery. This algorithm addresses simultaneously the two key criteria of diversity and representativeness of compounds in the resulting library and is computationally efficient when
applied to a large class of lead-generation design problems. At the same time, additional constraints on
experimental resources are also incorporated in the framework presented in this chapter. A computationally efficient scalable algorithm is developed, where the ability of the deterministic annealing algorithm to
identify clusters is exploited to truncate computations over the entire dataset to computations over individual clusters. An analysis of this algorithm quantifies the trade-off between the error due to truncation
and computational effort. Results applied on test datasets corroborate the analysis and show improvement
by factors as large as ten or more depending on the datasets.
Key words: Library design, combinatorial optimization, deterministic annealing.
1. Introduction
In recent years, combinatorial chemistry techniques have provided important tools for the discovery of new pharmaceutical
agents. Lead-generation library design, the process of screening
and then selecting a subset of potential drug candidates from a
vast library of similar or distinct compounds, forms a fundamental step in combinatorial drug discovery (1). Recent advances in
high-throughput screening such as using micro/nanoarrays have
given further impetus to large-scale investigation of compounds.
However, combinatorial libraries often consist of extremely large
collections of chemical compounds, typically several million. The
time and cost of associated experiments makes it practically
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_4, Springer Science+Business Media, LLC 2011
71
72
2. Issues
in LeadGeneration
Library Design
73
Design constraints: In addition to diversity and representativeness, other design criteria include confinement, which quantifies the degree to which the properties of a set of compounds
lie in a prescribed range (8), and maximizing the activity of
the set of compounds against some predefined targets. Activity is usually measured in terms of the quantitative structure
of the given set. Additionally, the cost of chemical compounds
and experimental resources is significant and presents one of the
main impediments in combinatorial diagnostics and drug synthesis. Different compounds require different experimental supplies
which are typically available in limited quantities. The presence
of these multiple (and often conflicting) design objectives makes
the library design a multiobjective optimization problem with
constraints.
3. Basic Problem
Formulation and
Modifications
min
N
rj ,1jM
p(xi )
i=1
min d(xi , rj )
1jM
[1]
Here, represents the chemical property space corresponding to the VCL, d(xi , rj ) represents an appropriate distance metric
between the lead compound rj and the compound xi , p(xi ) is the
relative weight that can be attached to compound xi (if all compounds are of equal importance, then the weights p(xi ) = N1 for
each i), and M is typically much smaller than N. That is, this problem seeks a subset of M lead compounds rj in a descriptor space
such that the average distance of a compound xi from its nearest
lead compound is minimized. Alternatively, this problem can also
be formulated as finding
an optimal partition of the descriptor
space into M clusters Rj and assigning to each cluster a lead
compound rj such that the following cost function is minimized:
M
d(xi , rj )p(xi )
j=1 xi Rj
74
weighted equally. However, design constraints often require distinguishing them from one another to reflect different aspects of
the clusters. For example, when addressing the issue of representativeness in the lead-generation library, the lead compounds that
represent larger clusters need to be distinguished from those that
represent outliers.
We incorporate representativeness into the problem formulation by specifying an additional relative weight parameter j , 1
j M for each lead compound. This parameter j quantifies the
size of the cluster represented by the compound rj , and it is
proportional to the number of the compounds in that cluster.
Thus, the resulting library design will associate lead compounds
that represent outliers with low values of and the lead compounds that represent the majority members with corresponding
high values. In this way, the algorithm can be used to identify
distinct compounds through property vectors rj in the descriptor space that denote the jth lead compound and at the same
time determine how representative each lead compound is. For
instance, j = 0.2 implies that lead compound rj represents 20%
of all compounds in the VCL. The following modified optimization problem adequately describes the diversity goals in the basic
formulation as well as the representativeness through the relative
weights j :
min
rj ,j ,1jM
such that
M
p(xi )
min d(xi , rj )
1jM
[2]
j = 1
j=1
n
pn (xin ) min d(xin , rj )
j
[3]
75
4. Computational
Issues
Problem formulations [13] for designing lead-generation library
under different constraints belong to a class of combinatorial
resource allocation problems, which have been widely studied.
They arise in many different applications such as minimum distortion problems in data compression (11), facility location problems (12), optimal quadrature rules and discretization of partial
differential equations (13), locational optimization problems in
control theory (9), pattern recognition (14), and neural networks
(15). Combinatorial resource allocation problems are nonconvex
and computationally complex and it is well documented (16) that
most of them have many local minima that riddle the cost surface.
Therefore, the main computational issue is developing an efficient algorithm that avoids local minima. Due to the large size of
VCLs, and the combinatorial nature of the problem, the issue of
algorithm scalability takes central importance. Since the number
of computations to be performed by the lead-generation library
design algorithm scales up exponentially with an increase in the
amount of data, most algorithms become prohibitively slow and
expensive (computationally) for large datasets.
4.1. Deterministic
Annealing Algorithm
76
rj p(rj |xi )
:=F
over iterations indexed by k, where Tk is a parameter called temperature which tends to zero as k tends to infinity. The cost function F is called free energy, where this terminology is motivated
by statistical chemistry (18). Here the distortion
D=
N
p(xi )
i=1
M
j=1
e d(xi ,rj )/Tk
, where Zi :=
e d(xi ,rj )/Tk
Zi
[4]
Note that the weighting parameters p(rj |xi ) are simply radial
basis functions, which clearly decrease in value exponentially as rj
and xi move farther apart. The corresponding minimum of F is
obtained by substituting for p(rj |xi ) from equation [4]:
F
= Tk
i
p(xi ) log Zi
[5]
To minimize
F
77
with respect to the lead compounds rj , we
i
[6]
As noted earlier, one of the major problems with combinatorial optimization algorithms is that of scalability, i.e., the number
of computations scales up exponentially with an increase in the
amount of data. In the DA algorithm, the computational complexity can be addressed in two steps first by reducing the number of iterations and second by reducing the number of computations at every iteration. The DA algorithm, as described earlier,
exploits the phase transition feature (18) in its process to decrease
the number of iterations (in fact in the DA algorithm, typically the
temperature variable is decreased exponentially which results in
few iterations). The number of computations per iteration in the
DA algorithm is O(M 2 N ), where M is the number of lead compounds and N is the total number of compounds in the underlying VCL. In this section, we present an algorithm that requires
fewer computations per iteration. This amendment becomes necessary in the context of the selection problem in combinatorial
chemistry as the sizes of the dataset are so large that DA is typically too slow and often fails to handle the computational complexity. We exploit the features inherent in the DA algorithm that,
for a given temperature, the farther an individual compound is
from a cluster, the lower is its influence on the cluster (as is evident from equation [4]). That is, if two clusters are far apart,
then they have very small interaction between them. Thus, if we
ignore the effect of a separated cluster on the remaining compound locations, the resulting error will not be significant (see
Fig. 4.1). Ignoring the effects of separated regions (i.e., groups
78
Fig. 4.1. (a) Illustration depicting the different clusters in the dataset, together with the
interaction between each pair of points (and clusters). (b) Separated regions determined
after characterizing intercluster interaction and separation.
79
p(r1 |x)p(x)
p(r1 |x)p(x)
xR1
xRm
p(r2 |x)p(x)
p(r1 |x)p(x)
xRm
A = xR1
..
..
..
.
.
.
p(rm |x)p(x)
p(r1 |x)p(x)
xR1
xRm
In a probabilistic framework, this matrix is a finitedimensional Markov operator, with the term Aj,i denoting the
transition probability from region Ri to Rj . The higher the transition probability, the greater is the amount of interaction between
the two regions. Once the transition matrix is formed, the next
step is to identify regions, that is, groups of clusters, which are
separate from the rest of the data. The separation is characterized by a quantity which we denote by . We say a cluster (Rj ) is
-separate if the level of its interaction with each of the other clusters (Aj,i , i = 1, 2, . . . , n, i = j) is less than . The value is used
to partition the descriptor space into separate regions for reduced
and scalable computational effort, and it quantifies the increase
in the distortion cost function of the proposed scalable algorithm
with respect to the DA algorithm.
4.2.2. Trade-Off
Between Error in Lead
Compound Location and
Computation Time
As was discussed in Section 4.2, the greater the number of separate regions we use, the smaller the computation time for the
scalable algorithm. At the same time, a greater number of separate regions results in a higher deviation in the distortion term
of the proposed algorithm from the original DA algorithm. This
trade-off between reduction in computation time and increase in
distortion error is systematically addressed in the following. For
any pair (rj , V ), where rj is a lead compound and V is a subset of
the descriptor space , we define
Gj (V ) : =
xi V
Hj (V ) : =
xi V
[7]
80
Then, from the DA algorithm, the location of the lead comG ()
pound (rj ) is determined by rj = Hj () . Since the cluster j is
j
separated from all the other clusters, the lead compound location
r j will be determined in the scalable algorithm by
xi j
r j =
xi j
Gj (j )
Hj (j )
[8]
cj
max Gj (cj )Hj (j ),Gj (j )Hj (cj )
Hj (j )Hj ()
[9]
= \j
1
N
xi cj
xi ,
we note that
Gj (cj )
xi Hj (cj ) = NMjc Hj (cj )
[10]
xi cj
1
N
xi xi
c
kj
r j rj
Mj M j
k =j
= max
,
j , where j =
MN
M M
kj
[11]
gives
[12]
j
j
N max
4.2.3. Scalable
Algorithm
Mjc Mj
M , M
81
[13]
5. Simulation
Results
5.1. Design for
Diversity and
Representativeness
82
Fig. 4.2. Simulation results for dataset 1. (a) The locations xi , 1 i 200, of compounds (circles) and rj , 1 j
10, of lead compounds (crosses) in the 2-d descriptor space. (b) The weights j associated with different locations of
lead compounds. (c) The given weight distribution p(xi ) of the different compounds in the dataset. Reprinted (adapted
or in part) with permission from Journal of Chemical Information and Modeling. Copyright 2008 American Chemical
Society.
compromised. This is due to the fact that the algorithm inherently recognizes the natural clusters in the VCL. As is seen from
the figure, the algorithm identifies all clusters. The two clusters
which were quite distinct from the rest of the compounds are also
identified albeit with a smaller weight. As can be seen from the
pie chart, the outlier cluster was assigned a weight of 2%, while
the central cluster was assigned a significant weight of 22%.
5.2. Scalability and
Computation Time
83
84
Fig. 4.4. (a) Separated regions R1 , R2 , R3 , and R4 as determined by the proposed algorithm. (b) Comparison of lead
compound locations rj and r j . Reprinted (adapted or in part) with permission from Journal of Chemical Information
and Modeling. Copyright 2008 American Chemical Society.
Table 4.1
Comparison between the original and proposed algorithm
Algorithm
Distortion
The original DA
300.80
129.41
Proposed algorithm
316.51
21.53
Reprinted (adapted or in part) with permission from Journal of Chemical Information and Modeling. Copyright 2008 American Chemical Society
handle larger datasets. The results from the two algorithms are
presented in Table 4.1. As can be seen, the proposed scalable
algorithm takes just about 17% of the time used by the original (nonscalable) algorithm and results in only a 5.2% increase in
distortion; this was obtained for = 0.005. Both the algorithms
were terminated when the number of lead compounds reached
12. The computation time for the scalable algorithm can be further reduced (by changing ), but at the expense of increased
distortion.
5.2.1. Further Examples
85
(b)
(a)
(c)
Fig. 4.5. (a, b, c) Simulated dataset with locations xi of compounds (circles) and lead compound locations rj (crosses)
determined by the algorithm. Reprinted (adapted or in part) with permission from Journal of Chemical Information
and Modeling. Copyright 2008 American Chemical Society.
Table 4.2
Distortion and computation times for different datasets
Computation
time (s)
Case
Algorithm
Distortion
Case 2
The original DA
Proposed algorithm
290.06
302.98
44.19
11.98
Case 3
The original DA
Proposed algorithm
672.31
717.52
60.43
39.77
Case 4
The original DA
Proposed algorithm
808.83
848.79
127.05
41.85
Reprinted (adapted or in part) with permission from Journal of Chemical Information and Modeling. Copyright 2008 American Chemical Society
points each. Both the algorithms were executed till they identified 16 lead compound locations. Results for the three cases have
been presented in Table 4.2.
86
This dataset is a modified version of the test library set (19). Each
of the 50,000 members in this set is represented by 47 descriptors which include topological, geometric, hybrid, constitutional,
and electronic descriptors. These molecular descriptors are computed using the Chemistry Development Kit (CDK) Descriptor
Calculator (20, 21). These 47-dimensional data were then normalized and projected onto a two-dimensional space. The projection was carried out using Principal Component Analysis. Simulations were completed on this two-dimensional dataset. The
proposed scalable algorithm was used to identify 25 lead compound locations from this dataset (see Fig. 4.6). The algorithm
gave higher weights at locations which had larger numbers of similar compounds. Maximally diverse compounds are identified with
a very small weight. The original version of the algorithm could
not complete the computations for this dataset (on a 512 MB
RAM 1.5 GHz Intel Centrino processor).
5.4. Additional
Constraints on Lead
Compounds
Fig. 4.6. Choosing 25 lead compound locations from the drug discovery dataset.
Reprinted (adapted or in part) with permission from Journal of Chemical Information
and Modeling. Copyright 2008 American Chemical Society.
5.4.1. Constraints on
Experimental Resources
87
In this dataset, the VCL is divided into three classes based on the
experimental supplies required by the compounds for testing, as
shown in Fig. 4.7a by different symbols. It contains a total of
280 compounds with 120 of the first class (denoted by circles),
40 of the second class (denoted by squares), and 120 of the third
class (denoted by triangles). We incorporate experimental supply
constraints into the algorithm by translating them into direct constraints on each of the lead compounds. With these experimental
supply constraints, the algorithm was used to select 15 lead compound locations (rj ) in this dataset with capacities (Wjn ) fixed for
(a)
(b)
Fig. 4.7. (a) Simulation results with constraints on experimental resources. (b) Simulation results with exclusion constraint. The locations xi , 1 i 90, of compounds (circles) and rj , 1 j 6, of lead compounds (crosses). Dotted circles represent undesirable properties. Reprinted (adapted or in part) with permission from Journal of
Chemical Information and Modeling. Copyright 2008 American Chemical Society.
88
each class of resource. The crosses in Fig. 4.7a represent the selection from the algorithm in the wake of the capacity constraints for
different types of compounds. As can be seen from the selection,
the algorithm successfully addressed the key issues of diversity and
representativeness together with the constraints that were placed
due to experimental resources.
5.4.2. Constraints on
Exclusion and Inclusion
of Certain Properties
There may arise scenarios where we would like to inhibit selection of compounds exhibiting properties within certain prespecified ranges. This constraint can be easily incorporated in the
cost function by modifying the distance metric used in the problem formulation. Consider a case in a 2-d dataset where each
point xi has an associated radius (denoted by ij ). The selection problem is the same, but with the added constraint that all
the selected lead compounds (rj ) must be at least ij distance
removed from xi . The proposed algorithm can be modified to
solve this problem by defining the distance function, given by
2
d(xi , rj ) =
xi rj
ij , which penalizes any selection (rj )
which is in close proximity to the compounds in the VCL. For
the purpose of simulation, a dataset was created with 90 compounds (xi , i = 1, . . . , 90). The dotted circle around the locations
xi denotes the region in the property space that is to be avoided
by the selection algorithm. The objective was to select six lead
compounds from this dataset such that the criterion of diversity and representativeness is optimally addressed in the selected
subset. The selected locations are represented by crosses. From
Fig. 4.7b, note that the algorithm identifies the six clusters under
the constraint that none of the cluster centers are located in the
undesirable property space (denoted by dotted circles).
6. Conclusions
In this chapter, we proposed an algorithm for the design of leadgeneration libraries. The problem was formulated in a constrained
multiobjective optimization setting and posed as a resource allocation problem with multiple constraints. As a result, we successfully tackled the key issues of diversity and representativeness of
compounds in the resulting library. Another distinguishing feature of the algorithm is its scalability, thus making it computationally efficient as compared to other such optimization techniques.
We characterized the level of interaction between various clusters and used it to divide the clustering problem with huge data
size into manageable subproblems with small size. This resulted
in significant improvements in the computation time and enabled
the algorithm to be used on larger sized datasets. The trade-off
between computation effort and error due to truncation is also
characterized, thereby giving an option to the end user.
89
References
1. Gordon, E. M., Barrett, R. W., Dower,
W. J., Fodor, S. P. A., Gallop, M. A.
(1994) Applications of combinatorial technologies to drug discovery. 2. Combinatorial
organic synthesis, library screening strategies,
and future directions. J Med Chem 37(10),
13851401.
2. Blaney, J., Martin, E. (1997) Computational
approaches for combinatorial library design
and molecular diversity analysis. Curr Opin
Chem Biol 1, 5459.
3. Willett, P. (1997) Computational tools for
the analysis of molecular diversity. Perspect
Drug Discov Design, 7/8, 111.
4. Rassokhin, D. N., Agrafiotis, D. K. (2000)
Kolmogorov-Smirnov statistic and its applications in library design. J Mol Graph Model
18(45), 370384.
5. Lipinski, C. A., Lomabardo, F., Dominy, B.
W., Feeny, P. J. (1997) Experimental and
computational approaches to estimate solubility and permeability in drug discovery and
development setting. Adv Drug Del Review
23, 225.
6. Higgs, R. E., Bemis, K. G., Watson, I. A.,
Wikel, J. H. (1997) Experimental designs
for selecting molecules from large chemical databases. J Chem Inf Comput Sci 37,
861870.
7. Clark, R. D. (1997) Optisim: an extended
dissimilarity selection method for finding
diverse representative subsets. J Chem Inf
Comput Sci 37(6), 11811188.
8. Agrafiotis, D. K., Lobanov, V. S. (2000)
Ultrafast algorithm for designing focussed
combinatorial arrays. J Chem Inf Comput Sci
40, 10301038.
9. Salapaka, S., Khalak, A. (2003) Constraints
on locational optimization problems. Proceedings of the IEEE Control and Decisions
Conference. Maui, HI, 912 December 2003,
pp. 17411746.
10. Sharma, P., Salapaka, S., Beck, C. (2008)
A scalable approach to combinatorial library
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
Chapter 5
Application of FreeWilson Selectivity Analysis
for Combinatorial Library Design
Simone Sciabola, Robert V. Stanton, Theresa L. Johnson,
and Hualin Xi
Abstract
In this chapter we present an application of in silico quantitative structureactivity relationship (QSAR)
models to establish a new ligand-based computational approach for generating virtual libraries. The Free
Wilson methodology was applied to extract rules from two data sets containing compounds which were
screened against either kinase or PDE gene family panels. The rules were used to make predictions for all
compounds enumerated from their respective virtual libraries. We also demonstrate the construction of
R-group selectivity profiles by deriving activity contributions against each protein target using the QSAR
models. Such selectivity profiles were used together with protein structural information from X-ray data
to provide a better understanding of the subtle selectivity relationships between kinase and PDE family
members.
Key words: QSAR, FreeWilson, MLR, virtual libraries, combinatorial chemistry, protein kinase,
PDE, enzyme inhibition, enzyme selectivity, docking.
1. Introduction
Combinatorial chemistry has become an essential tool in the
pharmaceutical industry for identifying new leads and optimizing the potency of potential lead candidates while reducing the
time and costs associated with producing effective and competitive new drugs. By speeding up the process of chemical synthesis,
it is now possible to generate large diverse compound libraries
to screen for novel bioactivities. At the same time improvements
in high-throughput screening (HTS) allow selectivity panels for
J.Z. Zhou (ed.) , Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_5, Springer Science+Business Media, LLC 2011
91
92
Sciabola et al.
93
94
Sciabola et al.
analysis (MLR) was used to model the selectivity profiles of different chemical series in our in-house kinase and PDE screening
panel. Overall, reliable estimations for R-group activity contributions against each protein in the data set were observed and
used for enumerating focused virtual libraries to predict more
selective inhibitors. When an external test set of cherry-picked
compounds was used to test the validity of the in silico models, a strong correlation of experimental versus predicted inhibition values was found. Lastly, the availability of X-ray structures
in the public domain for both PKs and PDEs allowed us to further validate our QSAR models by combining the information
from the FreeWilson approach with the three-dimensional (3D)
structural knowledge of the target, providing more insight into
specific enzyme selectivity.
2. Methods
2.1. Assay Conditions
95
concentration of 20 nM. The catalytic domain of PDEs was incubated with a reaction mixture of 50 mM TrisHCl, pH 7.5,
1.3 mM MgCl2 , 1 mM DTT, and 3 H-cAMP or 3 H-cGMP at
room temperature on an orbital shaker for 30 min. Compounds
to be tested are submitted to the assay at a concentration of
4 mM in 100% DMSO. Compounds are initially diluted in 50%
DMSO/water. Subsequent dilutions are in 15% DMSO/water to
achieve 5 the desired assay concentration. Each well receives
10 L drug or DMSO vehicle, 20 L 3 H-cAMP or 3 H-cGMP,
and 20 l enzyme (diluted 1:1,000 in assay buffer). The incubation is terminated by the addition of 25 L of PDE SPA beads
(0.2 mg/well). The reaction product 3 H-cAMP or 3 H-cGMP
was precipitated out by BaSO4 while unreacted 3 H-cAMP or
3 H-cGMP remained in supernatant. After centrifugation, the
radioactivity in the supernatant was measured in a liquid scintillation counter after a 500 min delay. The enzymatic properties were analyzed by the steady-state kinetics. The nonlinear
regression of the MichaelisMenten equation as well as Eadie
Hofstee plots was analyzed to obtain the values of KM , Vmax , and
kcat . For measurement of IC50 , ten concentrations of inhibitors
were used at the substrate concentration of <1/10 KM and the
suitable enzyme concentration. All measurements were repeated
three times.
2.2. Data Sets
96
Sciabola et al.
Fig. 5.1. Number of compounds tested against each of the 45 protein kinases in the
in-house selectivity panel. Histogram bars are subdivided according to the compounds
frequency in the four kinase chemical series.
pICCalc
50
pIC50 @1M,
= pIC50 @10M,
97
Table 5.1
2D depiction for the five chemical series. R-positions represent sites which were allowed to change within a given
library while X-positions indicate not changing chemical
matter whose structure cannot be disclosed
Protein
family
Chemical
series
2D
depiction
Number of
Compounds
R-groups
388
R1 = 77
R2 = 183
312
R1 = 124
R2 = 87
181
R1 = 8
R2 = 169
94
R1 = 19
R2 = 5
R3 = 37
R4 = 33
1505
R1 = 62
R2 = 157
R3 = 872
R4 = 339
X1
X2
N
Kinases
Diaminopyrimidine
R2
R1
N
N
H
N
H
X1
X2
H
N
O
N
Pyrrolopyrazole
R2
R1
X4
X3
NH
R2
X1
N
X2
Pyrrolopyrimidine
R1
N
X4
R4
X3
R1
Quinazoline
R3
X1
R2
R3
PDEs
R2
Pyrazolopyrimidine
N
R4
N
R1
PDE compounds were tested for IC50 , therefore, no data transformation was required in the case. The negative logarithm of
IC50 was used as a dependent variable in the model-building
process.
2.3. FreeWilson
(FW)
98
Sciabola et al.
activity is additive and constant, regardless of the structural variation on the other sites of substitution in the rest of the molecule.
The classical FreeWilson linear model is expressed by the following equation:
BioActivity =
ij Rij +
ij
99
$
2
pred
pred
yi
yiact yiact
yi
itest
2
pred
pred 2
yi
yiact yiact
yi
itest
%
&
&
&
&
& 1
STE = &
&
&n 2
'
itest
$
2
pred
pred
act
act
yi
yi yi
yi
itest
pred
pred 2
yi
yi
2
act
act
itest
yi yi
pred
Here, yi
itest
pound,
is its measured activity, yi
and yiact are the average
of the predicted and measured activity values, respectively, and n
is the sample size.
The squared Pearson correlation coefficients for the linear models built upon the Diaminopyrimidine, Pyrrolopyrazole,
Pyrrolopyrimidine, and Quinazoline series across the 45 pro2
= 0.82 0.95
tein kinases are, respectively, in the range of rfitting
yiact
2
2
2
= 0.87), rfitting
= 0.73 0.93 (average rfitting
=
(average rfitting
2
2
2
= 0.36 0.99 (average rfitting
= 0.80), and rfitting
=
0.85), rfitting
2
= 0.76). For the PDE case study, the
0.46 0.97 (average rfitting
correlation coefficients for the Pyrazolopyrimidine series when
tested in the PDE2 and PDE10 biochemical assays are, respec2
2
= 0.94 0.17 and rfitting
= 0.92 0.18. The highly
tively, rfitting
significant correlation between experimental and calculated pIC50
100
Sciabola et al.
Fig. 5.2. Leave-One-Out cross-validation results reported as predicted vs. experimental pIC50 values for the four kinase
chemical series. In general, model prediction of pIC50 is in good agreement with experimental pIC50 derived from percent
2
2
= 0.90 for Diaminopyrimidine (a), rcorr,CV
= 0.84 for the
of inhibition, with a global correlation coefficient rcorr,CV
2
2
Pyrrolopyrazole (b), rcorr,CV
= 0.77 for the Pyrrolopyrimidine (c), and rcorr,CV
= 0.73 for the Quinazoline (d) series.
101
Fig. 5.3. Leave-One-Out cross-validation results for the Pyrazolopyrimidine series tested in the biochemical assays PDE2
(a) and PDE10 (b).
were obtained for the Pyrrolopyrazole series, where LOO estimations of 5,413 objects gave an overall correlation coefficient
2
= 0.85 (STE = 0.47), the Pyrrolopyrimidine series with
rcorr,CV
2
rcorr,CV = 0.77 (STE = 0.53 based on 650 LOO estimations),
and the Quinoline series where 707 LOO estimations resulted in
2
= 0.73 (STE = 0.64). The same cross-validation protocol
rcorr,CV
was carried out in the case of Pyrazolopyrimidine series obtain2
= 0.78 (STE =
ing the following correlation coefficients: rcorr,CV
2
0.38, 485 LOO estimations) and rcorr,CV = 0.76 (STE = 0.46,
473 LOO estimations) when tested in the PDE2 and PDE10
assays, respectively (Fig. 5.3).
Since FreeWilson models use the presence or absence of
distinct R-group fragments as the basic variables in regression,
the derived model coefficients can be treated as a quantitative
estimate of the activity contribution of each R-group. Assuming
the additive assumption holds, then these R-group contributions
can be used to make reliable predictions for all the enumerated
compounds in a virtual library, where all R-group fragments are
crossed with each other.
2.5. Virtual Library
Space Analysis
102
Sciabola et al.
As a result, many compounds with desired potency and selectivity profiles could potentially be missed. By using high-quality
QSAR models, the activity and selectivity of compounds in the
virtual library can be reliably estimated, thus, greatly expanding
the chemical space coverage and increasing the chance of finding
compounds with attractive biological properties.
To demonstrate this, we enumerated the full virtual library
for the five chemical series shown in Table 5.1. We obtained
861 compounds for the Diaminopyrimidine series, 1,764 compounds for the Pyrrolopyrazole series, 598 for the Pyrrolopyrimidine series, 2,370 for the Quinazoline series, and 214,486 for
the Pyrazolopyrimidine series, using only those R-groups from
the existing compounds for which the activity contribution could
be estimated across the 45 protein kinase (first four chemical
series) and the two PDE assays (Pyrazolopyrimidine). We then
calculated their selectivity profile using the QSAR models derived
from FreeWilson analysis. Among the existing compounds in the
kinases series, 27 of them (17 Diaminopyrimidines, 1 Pyrrolopyrazole, 6 Pyrrolopyrimidine, 3 Quinazoline) met our selectivity
criteria (pIC50 > 5.3 against no more than 5 kinases on the panel).
In the full virtual library, however, 111 additional compounds
(57 Diaminopyrimidines, 8 Pyrrolopyrazoles, 31 Pyrrolopyrimidine, 15 Quinazoline) were predicted to be selective. In the PDE
series, the library expansion provided with a greater enrichment
in the number of compounds potentially selective, moving from
three selective compounds in the original library (pIC50 7 in
one assay and pIC50 5.3 in the second assay) to 4,103 selective
compounds in the virtual space.
We have also noticed an increase in the number of kinases
selectively targeted upon the expansion of the inhibitors chemical space, suggesting that such a procedure would also be suitable as a tool for exploring potential Target Hopping. Indeed,
when applied to our data set, existing selective compounds from
the Diaminopyrimidine, Pyrrolopyrazole, Pyrrolopyrimidine, and
Quinazoline series targeted 14, 5, 7, and 3 protein kinases,
respectively. However, after complete enumeration of the virtual
libraries, 28, 19, 31, and 12 protein kinases were predicted to
be selectively inhibited by compounds in the four series, respectively. This shows how series originally developed for a specific
kinase could be turned into selective inhibitors for other kinases
by exploiting different R-group combinations.
2.6. R-Group
Selectivity Profiles
103
R2-group structures, giving rise to two different matrices containing 3645 R1- and 2645 R2-group contributions. In the
Pyrrolopyrazole series, a total of 60 R1- and 35 R2-group structures were available for analysis, leading to two coefficient matrices of 6045 R1- and 3545 R2-group contributions. Analysis
of the R-group structures for the Pyrrolopyrimidine and Quinazoline series resulted in two coefficient matrices of 345 R1- and
5745 R2-group contributions for the former series and four
coefficient matrices of 445 R1-, 245 R2-, 1545 R3-, and
1145 R4-group contributions for the latter. In the Pyrazolopyrimidine PDE series, a total of 5 R1-, 79 R2-, 543 R3-, and 3
R4-group structures were available for analysis, leading to four
coefficient matrices of 52 R1-, 792 R2-, 5432 R3-, and 32
R4-group contributions.
The main objective in this R-group selectivity analysis was to
detect whether small changes in structure could give rise to large
variations in activity. This was achieved by computing all pairwise structural similarities between R-groups at each substitution
site (using a combination of structural descriptors (37, 38) and
Tanimoto as similarity measure), then keeping only R-group pairs
with Tanimoto similarity greater than 0.8. Afterward, each surviving R-group pair was assigned a profile resulting from the difference in the original coefficients profiles for the R-groups being
compared. This produced one selectivity map for each R-group
position within each different chemical series. Figures 5.4 and
5.5 show a few snapshots of this data transformation for the
Diaminopyrimidine and Pyrazolopyrimidine series, reported as
heat maps where each R-group pair/assay combination is assigned
a color ranging from white (pIC50 = 0) to red (pIC50 2).
Fig. 5.4. Structural models for binding site interactions of Diaminopyrimidine series. Selectivity maps are shown next
to each binding site model. pIC50 for the specific R-group pair/assay combination is highlighted in yellow (R-group
combinations are reported as rows and protein kinase assays as columns within the heat map). (a) R-groupA (orange)
and R-groupB (violet) at site R1 of Diaminopyrimidine docked into the crystal structure of GSK3 (1O9U). The extra
methyl in R-groupB is responsible for its increased activity contribution. (b) Position R2 of Diaminopyrimidine in protein
kinase PAK4 (2CDZ). R-groupB (violet) undergoes a 45 rotation in order to orient the tert-butyloxy tail toward the buried
lipophilic pocket made up residues R586, M585, and L448.
104
Sciabola et al.
Fig. 5.5. Structural models for binding site interactions of Pyrazolopyrimidine series. Selectivity maps are shown next
to each binding site model. pIC50 for the specific R-group pair/assay combination is highlighted in yellow (R-group
combinations are reported as rows and protein PDE assays as columns within the heat map). (a) R-groupA (orange) and
R-groupB (violet) at site R2 of Pyrazolopyrimidine docked into the in-house PDE2 crystal structure. The extra phenethyl
moiety in R-groupB makes an extended hydrophobic interaction with residue L809 and it is responsible for the observed
increased in activity. (b) Position R3 of Pyrazolopyrimidine in the PDE2 crystal structure. The presence of two extra atoms
linker in R-groupB (violet) determines its different binding mode compared to R-groupA . The 1,3-dimethoxy benzene
portion of R-groupB undergoes a 90 rotation in order to orient itself toward a buried lipophilic pocket and interacting
directly with the side chain of residue L770.
105
Core
Site
Diaminopyrimidine
R1
Pyrazolopyrimidine
Table 5.2
R-group/kinase contributions from FreeWilson selectivity maps
R2
R-groupA
R-groupB
Protein
gsk3
N
N
N
S
pIC50
(RBRA)
+1.8
(1O9U)
F-W
O
O
OH
pak4
R2
+2.5
(2CDZ)
pde2
N
+1.8
(in-house)
R3
pde2
O
N
+1.1
(in-house)
O
N
106
Sciabola et al.
Similar conclusions can be derived when analyzing the coredocking results for the Pyrazolopyrimidine series. The in-house
X-ray structure of PDE2 was used to elucidate the differences
in activity (pIC50 = 1.8) when moving from R-groupA to
R-groupB (Table 5.2) at position R2 of the Pyrazolopyrimidine
core. Figure 5.5a highlights the structural explanation for that,
where the presence of the additional phenyl ring at this site is
not influencing the R-group binding mode, but is extending the
staked hydrophobic interaction toward residue L809. When position R3 of the Pyrazolopyrimidine series was examined, a pIC50
of 1.1 units was obtained by substituting two highly similar
R-groups in the PDE2 biochemical assay (R-groupA and
R-groupB in Table 5.2). Figure 5.5b shows how the variation
in R-group composition determines a different binding mode
for the two R-groups, with the 1,3-dimethoxy benzene portion
of R-groupB now filling a hydrophobic pocket in the active site
made up of a combination of lipophilic (L770, L809, I866, I870)
residues, and optimizing stacked hydrophobic interactions with
the isopropyl moiety of residue L770 (Fig. 5.5b).
3. Notes
1. The FreeWilson approach has proven to be a successful
strategy for the analysis of data sets where large library
collections of compounds obtained through combinatorial
chemistry have been screened against a panel of related proteins or target families, thus boosting the overall quest for
selective inhibitors.
2. A key advantage of the FreeWilson method over standard descriptors-based QSAR techniques is the estimation of activity contribution for individual R-group
structures that are readily interpretable to medicinal
chemists.
3. The possibility to expand the original chemical space of a
given chemical series into a complete virtual library provided us with the identification of compounds with desirable selectivity profiles.
4. The major disadvantage relies on the use of R-groups as
descriptors in model building which gives the models a
well-defined boundary of the chemical space that can be
predicted. It can only explore the chemical space defined by
the R-group combinations present in the training set compounds and cannot be applied, as it is, for predicting the
activity of new compounds with R-groups beyond those
used in the analysis.
107
5. Data preparation and quality control is a key step in applying FreeWilson methodology to model biological data.
Care must be taken to make sure the underlying data complies with FW additive assumption.
6. Compounds with correlated R-groups and outlier compounds whose R-groups did not occur in other compounds
were removed from the data set as the activity contribution
for these R-groups could not be estimated.
7. In case of sparse structural matrices, these were normally
rearranged into independent blocks where R-groups from
one block would not cross over with other blocks, and statistical analysis was applied to each block separately to estimate the activity contribution for each R-group. Blocks
whose R-group activity contributions could not be estimated due to a lack in R-group crossovers were further
eliminated. The block separation and compound removal
procedure maximized the total number of R-group activity
contributions that could be estimated.
8. LOO cross-validation analysis of FW QSAR models
showed an overall agreement between predicted and experimental pIC50 for each individual combination of chemical
series and protein target.
9. The construction of R-group selectivity profiles based on
in silico R-group contributions allowed us to identify structural determinants for selectivity where a small modification
in the R-groups results in significant difference in selective
profiles.
10. The R-group selectivity knowledge coupled with the availability of X-ray data for many of the kinase/PDE structures
provides substrates for scientists to formulate novel lead
transformation ideas for inhibitor compounds with better
physicochemical properties.
Acknowledgment
This chapter is adapted in part with permission from Simone
Sciabola et al. (2008) J Chem Info Model 48, 18511867.
Copyright 2008 American Chemical Society.
References
1. Manning, G., Whyte, D. B., Martinez, R.,
Hunter, T., Sudarsanam, S. (2002) The
protein kinase complement of the human
genome. Science 298, 19121934.
108
Sciabola et al.
109
Chapter 6
Application of QSAR and Shape Pharmacophore Modeling
Approaches for Targeted Chemical Library Design
Jerry O. Ebalunode, Weifan Zheng, and Alexander Tropsha
Abstract
Optimization of chemical library composition affords more efficient identification of hits from biological
screening experiments. The optimization could be achieved through rational selection of reagents used in
combinatorial library synthesis. However, with a rapid advent of parallel synthesis methods and availability
of millions of compounds synthesized by many vendors, it may be more efficient to design targeted
libraries by means of virtual screening of commercial compound collections. This chapter reviews the
application of advanced cheminformatics approaches such as quantitative structureactivity relationships
(QSAR) and pharmacophore modeling (both ligand and structure based) for virtual screening. Both
approaches rely on empirical SAR data to build models; thus, the emphasis is placed on achieving models
of the highest rigor and external predictive power. We present several examples of successful applications
of both approaches for virtual screening to illustrate their utility. We suggest that the expert use of both
QSAR and pharmacophore models, either independently or in combination, enables users to achieve
targeted libraries enriched with experimentally confirmed hit compounds.
Key words: QSAR modeling, pharmacophore modeling, model validation, virtual screening.
1. Introduction
There is an increased realization that rationally designed chemical libraries facilitate significantly the process of discovering new
drug candidates. The library is described as focused (or targeted)
when compounds selected into the library are optimized with
respect to at least one target property [the property(-ies) can be
specific biological activities and/or various desired parameters of
drug likeness, including drug safety, that are generally covered by
the optimal ADME/Tox paradigm].Naturally, rational design of
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_6, Springer Science+Business Media, LLC 2011
111
112
such libraries is only enabled when sufficient amount of experimental data (e.g., results of biological testing for ligands and/or
target structural information) relevant to the target property(-ies)
is available.
In the early days of combinatorial chemistry, rational design
of chemical libraries frequently implied the selection of building
blocks (from a large available pool) that would produce a reduced
library enriched with potential hit compounds. For instance,
in one of our earlier studies we have developed an approach
termed FOCUS-2D (1, 2) for designing targeted libraries via
rational selection of building blocks. The approach was based
on a virtual combinatorial synthesis procedure where the products were assembled by combining reagents (or building blocks)
into virtual compounds. The building blocks were sampled using
stochastic optimization procedure, and the scoring function optimized in this process was either the similarity of products to
a known active compound(-s) or target activity predicted from
independently developed quantitative structureactivity relationship (QSAR) models. The virtual library of high scoring (i.e.,
predicted to be active) compounds was assembled and analyzed
in terms of building blocks found with the highest frequency
within selected compounds; thus, the ultimate goal of the study
was the rational selection of building blocks that would be used to
build a complete chemical library (as opposed to cherry-picking
selected compounds).
Although studies into rational building block selection such
as those described above were popular in the early days of computational combinatorial chemistry, the alternative approaches
looking into rational selection of compounds from commercial libraries of already synthesized or synthetically feasible compounds have gradually prevailed. In fact, in a popular review
Jamois (3) has compared reagent-based vs. product-based strategies for library design and concluded that several studies have
demonstrated the superiority of product-based designs in yielding diverse and representative subsets. Nowadays, large commercial libraries and services that provide integrated links to commercially available compounds are widely available (for instance,
ca. 10 M compounds have been compiled in publicly available ZINC database (4)); see a recent review (5) for a partial
list of additional chemical databases. Thus, most of the current approaches employ various virtual screening strategies to
select specific compound subsets for subsequent experimental
exploration.
This chapter discusses the application of popular cheminformatics approaches, such as rigorously built QSAR models and
shape pharmacophore models, to the problem of targeted library
design. QSAR models offer unique ability to rationalize existing experimental SAR data in the form of robust quantitative
113
models that predict target property directly from structural chemical descriptors; thus, they can be used to screen an external chemical library to select compounds predicted to be active against
the target. Conversely, shape pharmacophore models utilize the
representative shape of active ligands or the negative image (or
pseudo molecule) extracted from the binding site of the target
protein to query 3D conformational databases of virtual or real
molecular libraries. With enough attention paid to critical issues of
model validation and applicability domain definition, both QSAR
and shape pharmacophore models could be used successfully (and
concurrently) to mine external virtual libraries to identify putative compounds with the desired target properties. The selected
compounds could be chosen as candidates for thereby rationally
designed compound library.
This chapter will initially discuss current algorithms for developing externally predictive QSAR models and present experimentally confirmed examples of identifying novel bioactive compounds by the means of QSAR model-based virtual screening. It
will also present a novel shape pharmacophore modeling method
and its validation through retrospective analysis of known biologically active compounds. Of course, many approaches, both
structure based and ligand based, have been used for virtual
screening.We have decided to focus on these specific methodologies, i.e., QSAR and pharmacophore modeling because both
approaches are well known to both computational and medicinal chemists as structure optimization tools used at later stages
of drug discovery after the lead compounds have been identified experimentally. However, in recent years these approaches,
among other cheminformatics methods (6), have found new
applications as virtual screening tools. The methods and applications discussed in this chapter should be of interest to both
computational and synthetic chemists and experimental biologists
working in the areas of biological screening of chemical libraries.
2. Predictive
QSAR Models as
Virtual Screening
Tools
114
115
116
One of the most important problems in QSAR analysis is establishing the domain of applicability for each model. In the absence
of the applicability domain restriction, each model can formally
predict the activity of any compound, even with a completely different structure from those included in the training set. Thus, the
absence of the model applicability domain as a mandatory component of any QSAR model would lead to the unjustified extrapolation of the model in the chemistry space and, as a result, a
high likelihood of inaccurate predictions. In our research we have
always paid particular attention to this issue (12, 2027). A good
overview of commonly used applicability domain definitions can
be found in reference (28).
In our earlier publications (8, 12) we have recommended
a set of statistical criteria which must be satisfied by a predictive model. For continuous QSAR, criteria that we will follow
in developing activity/property predictors are as follows: (i) correlation coefficient R between the predicted and the observed
activities; (ii) coefficients of determination (29) (predicted versus
observed activities R02 and observed versus predicted activities R02
for regressions through the origin); (iii) slopes k and k of regression lines through the origin. We consider a QSAR model predictive if the following conditions are satisfied (i) q2 >0.5; (ii) R2 >0.6;
(R2 R2 )
(R2 R2 )
0
0
< 0.1 and 0.85 k 1.15 or
< 0.1 and
(iii)
R2
R2
2
2
2
0.85 k 1.15; (iv) R0 R0 < 0.3 where q is the crossvalidated correlation coefficient calculated for the training set, but
all other criteria are calculated for the test set (for additional discussion, see (30)).
117
In our recent studies we were fortunate to recruit experimental collaborators who have validated computational hits identified
by virtual screening of commercially available compound libraries
using rigorously validated QSAR models. Examples include anticonvulsants (25), HIV-1 reverse transcriptase inhibitors (32),
D1 antagonists (33), antitumor compounds (34), beta-lactamase
inhibitors (35), human histone deacetylase (HDAC) inhibitors
118
(36), and geranylgeranyltransferase-I inhibitors (37). Thus, models resulting from predictive QSAR workflow could be used to
prioritize the selection of chemicals for the experimental validation. To illustrate the power of validated QSAR models as virtual screening tools, we shall discuss the examples of studies that
resulted in experimentally confirmed hits. We note that such studies could only be done if there are sufficient data available for
a series of tested compounds such that robust validated models
could be developed using the workflow described in Fig. 6.2.
The following examples illustrate the use of QSAR models developed with predictive QSAR modeling and validation workflow
(Fig. 6.2) for virtual screening of commercial libraries to identify
experimentally confirmed hits.
2.2.1. Discovery of
Novel Anticancer Agents
A combined approach of validated QSAR modeling and virtual screening was successfully applied to the discovery of novel
tylophorine derivatives as anticancer agents (34). QSAR models have been initially developed for 52 chemically diverse
phenanthrine-based tylophorine derivatives (PBTs) with known
experimental EC50 using chemical topological descriptors (calculated with the MolConnZ program) and variable selection
knearest neighbor (kNN) method. Several validation protocols
have been applied to achieve robust QSAR models. The original data set was divided into multiple training and test sets, and
the models were considered acceptable only if the leave-one-out
cross-validated R2 (q2 ) values were greater than 0.5 for the training sets and the correlation coefficient R2 values were greater
than 0.6 for the test sets. Furthermore, the q2 values for the
actual data set were shown to be significantly higher than those
obtained for the same data set with randomized target properties
(Y-randomization test), indicating that models were statistically
significant. Ten best models were then employed to mine a commercially available ChemDiv Database (ca. 500K compounds)
resulting in 34 consensus hits with moderate to high predicted
activities. Ten structurally diverse hits were experimentally tested
and eight were confirmed active with the highest experimental
EC50 of 1.8 M implying an exceptionally high hit rate (80%).
The same 10 models were further applied to predict EC50 for
four new PBTs, and the correlation coefficient (R2 )between the
experimental and the predicted EC50 for these compounds plus
eight active consensus hits was shown to be as high as 0.57.
2.2.2. Discovery of
Novel Histone
Deacetylase (HDAC)
Inhibitors
Histone deacetylases (HDACs) play a critical role in transcription regulation. Small molecule HDAC inhibitors have become
an emerging target for the treatment of cancer and other cell proliferation diseases. We have employed variable selection k nearest
neighbor approach (kNN)and support vector machines (SVM)
approach to generate QSAR models for 59 chemically diverse
119
compounds with inhibition activity on class I HDAC. MOE (38)and MolConnZ (39)-based 2D descriptors were combined with
knearest neighbor (kNN) and support vector machines (SVM)
approaches independently to improve the predictive power of
models. Rigorous model validation approaches were employed
including randomization of target activity (Y-randomization test)
and assessment of model predictability by consensus prediction
on two external data sets. Highly predictive QSAR models were
generated with leave-one-out cross-validation R2 (q2 ) values for
the training set and R2 values for the test set as high as 0.81 and
0.80, respectively, with MolconnZ/kNN approach and 0.94 and
0.81, respectiveley, with MolconnZ/SVM approach. Validated
QSAR models were then used to mine four chemical databases:
National Cancer Institute (NCI) database, Maybridge database,
ChemDiv database, and one ZINC database, including a total of
over 3 million compounds. The searches resulted in 48 consensus hits, including two reported HDAC inhibitors that were not
included in the original data set. Four hits with novel structural
features were purchased and tested using the same biological assay
that was employed to assess the inhibition activity of the training
set compounds. Three of these four compounds were confirmed
active with the best inhibitory activity (IC50 ) of 1 M. The overall
workflow for model development, validation, and virtual screening is illustrated in Fig. 6.3.
2.2.3. Discovery of
Novel Histone
Deacetylase (HDAC)
Inhibitors
Fig. 6.3. Application of predictive QSAR workflow including virtual screening to discover
novel HDAC inhibitors.
120
(GGTIs) have therapeutic potential to treat inflammation, multiple sclerosis, atherosclerosis, and many other diseases. Following
our standard QSAR modeling workflow, we have developed and
rigorously validated models for 48 GGTIs using variable selectionk nearest neighbor (40) and automated lazy learning (26)
and genetic algorithm-partial least square (41) QSAR methods.
The QSAR models were employed for virtual screening of 9.5
million commercially available chemicals yielding 47 diverse computational hits. Seven of these compounds with novel scaffolds
and high predicted GGTase-I inhibitory activities were tested in
vitro, and all were found to be bona fide and selective micromolar
inhibitors.
Figure 6.4 shows the structures of both representative training set compounds and confirmed computational hits. We should
emphasize that QSAR models have been traditionally viewed
as lead optimization tools capable of predicting compounds
with chemical structure similar to the structure of molecules
used for the training set. However, this study clearly indicates
(Fig. 6.4) that with enough attention given to the model development process and using chemical descriptors characterizing
whole molecules (as opposed to, e.g., chemical fragments), it is
indeed possible to discover compounds with novel chemical scaffolds. Furthermore, in our study we have additionally demonstrated that these novel hits could not be identified using tradi-
Peptidomimetics
Sigma: IC50 = 8 M
Asinex: IC50 = 35 M
Pyrazoles
Mean IC50
5 M
Enamine: IC50 = 43 M
Two similar hits
Fig. 6.4. Discovery of GGTase-I inhibitors with novel chemical scaffolds using a combination of QSAR modeling and
virtual screening.
121
3. Shape
Pharmacophore
Modeling as
Virtual Screening
Tool
122
in structure-based drug design. When the constraints of critical functional groups and their spatial orientation are taken into
account, together with shape complementarity, one can create a
shape pharmacophore model. This latter model has proved to be
more effective in virtual screening experiments. In the following
sections, we first describe the basic concept of shape and shape
pharmacophore modeling and then present some recent literature
examples.
3.1. Basic Concept of
Molecular Shape
Analysis
3.1.1. Alignment-Based
and Alignment-Free
Methods
3.1.1.1.
Alignment-Based
Algorithms
123
the first time (47). The shape of each atom was described as a
suitable electron density function, and then three Gaussian functions were fitted to each of the atomic electron density functions.
An analytical shape similarity index was formulated according to
the Carbo index (50, 51). Molecular superposition was achieved
via the optimization of the similarity index.
Shape-Matching Method by Grant and Pickup (40, 51). This
method defines a Gaussian density for each atom to replace the
hard sphere representation of atoms (52). The molecular volume
is expressed as a series of integration terms, representing the intersection volumes between the atoms in a molecule. The Gaussian
description was used to compare the shapes of two molecules by
optimizing their volume overlap using analytical derivatives with
respect to rotations and translations. This idea was later implemented in the ROCS program. However, it has been pointed
out (53) that ROCS, by default, gives the same radius value to
all heavy atoms in the molecule. This approximation led to the
conclusion that the volume calculation in ROCS might not be
as accurate as expected from the original theory of Grant and
Pickup. Nonetheless, ROCS has been shown to be very successful in many validation studies and actual applications.
3.1.1.2. Alignment-Free
Algorithms
124
The PEST and PESD Method These methods were developed by Brenemans group. The PEST (property-encoded surface translator) method is based on the combination of the
TAE descriptors (54) and the shape signatures idea by Zauhar
(45). It uses the TAE molecular surface representations to define
property-encoded boundaries. It first computes the molecular surface property distributions and then collects ray-tracing
path information and lastly generates the shape descriptors. The
2D histograms are generated to represent surface shape profile, encoding both shape and surface properties. Similarly, the
property-encoded shape distributions (PESD) descriptors have
recently been reported and employed to study ligandprotein
binding affinities (56). The PESD algorithm is different from
PEST in that it is based on a fixed number of randomly sampled point pairs on the molecular surface that does not require
ray tracing. Both PEST and PESD descriptors should account for
the distribution of both the polar and non-polar regions and electrostatic potential on the molecular surface.
The USR (Ultrafast Shape Recognition) Method. This method
was reported by Ballester and Richards (53) for compound
database search on the basis of molecular shape similarity. It was
reportedly capable of screening billions of compounds for similar shapes on a single computer. The method is based on the
notion that the relative position of the atoms in a molecule is
completely determined by inter-atomic distances. Instead of using
all inter-atomic distances, USR uses a subset of distances, reducing the computational costs. Specifically, the distances between
all atoms of a molecule to each of four strategic points are calculated. Each set of distances forms a distribution, and the three
moments (mean, variance, and skewness) of the four distributions
are calculated. Thus, for each molecule, 12 USR descriptors are
calculated. The inverse of the translated and scaled Manhattan
distance between two shape descriptors is used to measure the
similarity between the two molecules. A value of 1 corresponds
to maximum similarity and a value of 0 corresponds to minimum similarity.
3.2. Examples of
Application of Shape
and Pharmacophore
Models for Virtual
Screening
3.2.1. Ligand-Based
Studies
When a few ligands are known for a particular target, one can
use ligand-based shape-matching technology to search for potential ligands via virtual screening. A ligand-based application of
shape-matching methods starts with a ligand with known biological activity. Its 3D conformations are often pregenerated by a
125
conformer generator. Also, a multiconformer database of potential drug molecules is pregenerated to be used by the shapematching program. For alignment-based methods, the conformers of the known ligand (i.e., the query) will be directly aligned
with those of the database molecules. Molecules that align well
with the query molecule will be selected for further consideration.
In the case of alignment-free methods, both the shape descriptors
of the query and those of the database molecules are first calculated, and a similarity value is calculated between the query and
each of the database molecules. Molecules with better similarity
values to the query are selected for further consideration.
In the validation study by Hawkins et al. (60), the shapematching method ROCS was compared to 7 well-known docking
tools, in terms of their abilities to recover known ligands for 21
different protein targets. The comparative study showed that the
3D shape method (ROCS) performed at least the same as, and
often better than, the docking tools studied. Their work indicated
that shape-based virtual screening method could be both efficient
(in terms of the computational speed) and effective (in terms of
hit enrichment) in virtual screening projects.
In a comparative study, McGaughey et al. (61) investigated
several 2D similarity methods (including Daylight fingerprint similarity (62) and TOPOSIM (63)), 3D shape similarity methods
(ROCS and SQ (64)), and several known docking tools (FLOG
(65), FRED (66), and Glide (67, 68)). Based on the performance on a benchmark set of 11 protein targets, they observed
that, on average, the ligand-based shape method with chemistry
constraints outperformed more sophisticated docking tools. Their
results also demonstrated that shape matching (including chemistry constraints) could select more diverse active compounds
than 2D similarity methods. This indicates that shape-matching
tools may offer a better scaffold hopping capability than 2D
methods.
Moffat et al. (69) also compared three ligand-based shape
similarity methods, including CatShape (70), FBSS (71), and
ROCS. These methods have been compared on the basis of
retrospective virtual screening experiments. All three methods
have demonstrated significant enrichment, but ROCS with CFF
option (CFF: chemical force field) gave the best performance.
They reported that shape matching, coupled with chemistry constraints, afforded better enrichment factors than shape-matching
alone, indicating the importance of including chemistry information in the search. This observation is consistent with the
recent validation study by Hawkins et al. (60) and by Ebalunode et al. (72). In general, flexible methods gave slightly better performance than the respective rigid search methods; however, the increased performance could not justify the increased
126
127
128
129
low Tanimoto coefficients between the found hits and the query
molecule. Visual inspection also confirmed that none of the nine
actives found shared a common scaffold with the template. Thus,
this example demonstrated the power of a pure shape similarity
method for scaffold hopping projects.
Finally, Ebalunode et al. reported a structure-based shape
pharmacophore modeling for the discovery of novel anesthetic
compounds (88). The 3D structure of apoferritin, a surrogate
target for GABAA , was used as the basis for the development of
several shape pharmacophore models. They demonstrated that
(1) the method effectively recovered known anesthetic agents
from a diverse database of compounds; (2) the shape pharmacophore scores had a significant linear correlation with the measured binding data of several anesthetic compounds, without
prior calibration and fitting; and (3) the computed scores also
correctly predicted the trend of the EC50 values of a set of
anesthetics.
4. Summary
and Conclusions
We have discussed the application of cheminformatics approaches
such as QSAR and shape pharmacophore modeling to the problem of targeted library design by means of virtual screening. Both
approaches offer unique abilities to rationalize existing experimental SAR data in the form of models that could identify novel
compounds predicted to interact with the specific target. Pharmacophore models achieve this task by establishing that a compound contains specific chemical features characteristic of known
bioactive compounds, whereas QSAR models have the ability to
predict the target activity quantitatively from structural chemical
descriptors of compounds. As with any computational molecular
modeling approach, it is imperative that both QSAR and pharmacophore modeling approaches are used expertly. Therefore, this
chapter has focused on the discussion of critical components of
both approaches that should be studied and executed rigorously
to enable their successful application. We have shown that with
enough attention paid to critical issues of model validation and
(in the case of QSAR modeling) applicability domain definition,
the models could be indeed used successfully to mine external
virtual libraries, especially of commercially available chemicals, to
create targeted compound libraries with desired properties. The
methods and applications discussed in this chapter should be of
help to both computational and synthetic chemists and experimental biologists working in the areas of biological screening of
chemical libraries.
130
Acknowledgments
AT acknowledges the support from NIH (grant R01GM066940).
J.E. and W.Z. acknowledge the financial support by the Golden
Leaf Foundation via the BRITE Institute, North Carolina Central University. W.Z. also acknowledges funding from NIH (grant
SC3GM086265).
References
1. Zheng, W., Cho, S. J., Tropsha, A. (1998)
Rational combinatorial library design. 1.
Focus-2D: a new approach to the design of
targeted combinatorial chemical libraries. J
Chem Inf Comput Sci 38, 251258.
2. Cho, S. J., Zheng, W., Tropsha, A. (1998)
Focus-2D: a new approach to the design of
targeted combinatorial chemical libraries, in
(Altman, R. B., Dunker, A. K., Hunter, L.,
Klein, T. E. eds.) Pacific Symposium on Biocomputing 98, Hawaii, Jan 49, 1998. World
Scientific, Singapore, pp. 305316.
3. Jamois, E. A. (2003) Reagent-based and
product-based computational approaches in
library design. Curr Opin Chem Biol 7,
326330.
4. Irwin, J. J., Shoichet, B. K. (2005) ZINCa
free database of commercially available compounds for virtual screening. J Chem Inf
Model 45, 177182.
5. Oprea, T., Tropsha, A. (2006) Target, chemical and bioactivity databases integration is
key. Drug Discov Today 3, 357365.
6. Varnek, A., Tropsha, A. (2008) Cheminformatics Approaches to Virtual Screening, RSC,
London.
7. Tropsha, A. (2006) I in (Martin, Y. C.,
ed.) Comprehensive Medicinal Chemistry I.
pp. 113126, Elsevier, Oxford.
8. Golbraikh, A. and Tropsha, A. (2002)Beware
of q2 ! J Mol Graph Model 20, 269276.
9. Novellino, E., Fattorusso, C., Greco, G.
(1995) Use of comparative molecular field
analysis and cluster analysis in series design.
Pharm Acta Helv 70, 149154.
10. Norinder, U. (1996) Single and domain
made variable selection in 3D QSAR applications. J Chemomet 10, 95105.
11. Tropsha, A., Cho, S. J. (1998) in (Kubinyi,
H., Folkers, G., and Martin, Y. C., eds.) 3D
QSAR in Drug Design. Kluwer, Dordrecht,
The Netherlands, pp. 5769.
12. Tropsha, A., Gramatica, P., Gombar, V. K.
(2003) The importance of being earnest: validation is the absolute essential for successful
13.
14.
15.
16.
17.
18.
19.
20.
application and interpretation of QSPR models. Quant Struct Act Relat Comb Sci 22,
6977.
Golbraikh, A., Tropsha, A. (2002) Predictive
QSAR modeling based on diversity sampling
of experimental datasets for the training and
test set selection. J Comput Aided Mol Des
16, 357369.
Golbraikh, A., Shen, M., Xiao, Z., Xiao, Y.
D., Lee, K. H., Tropsha, A. (2003) Rational selection of training and test sets for
the development of validated QSAR models.
J Comput Aided Mol Des 17, 241253.
Pavan, M., Netzeva, T. I., Worth, A. P.
(2006) Validation of a QSAR model for
acute toxicity. SAR QSAR Environ Res 17,
147171.
Vracko, M., Bandelj, V., Barbieri, P., Benfenati, E., Chaudhry, Q., Cronin, M., Devillers, J., Gallegos, A., Gini, G., Gramatica, P.,
Helma, C., Mazzatorta, P., Neagu, D., Netzeva, T., Pavan, M., Patlewicz, G., Randic,
M., Tsakovska, I., Worth, A. (2006) Validation of counter propagation neural network
models for predictive toxicology according
to the OECD principles: a case study. SAR
QSAR Environ Res 17, 265284.
Saliner, A. G., Netzeva, T. I., Worth, A. P.
(2006) Prediction of estrogenicity: validation
of a classification model. SAR QSAR Environ
Res 17, 195223.
Roberts, D. W., Aptula, A. O., Patlewicz, G.
(2006) Mechanistic applicability domains for
non-animal based prediction of toxicological endpoints. QSAR analysis of the Schiff
base applicability domain for skin sensitization. Chem Res Toxicol 19, 12281233.
Zhang, S., Golbraikh, A., Tropsha, A. (2006)
Development of quantitative structurebinding affinity relationship models based
on novel geometrical chemical descriptors of
the protein-ligand interfaces. J Med Chem 49,
27132724.
Golbraikh, A., Bonchev, D., Tropsha, A.
(2001) Novel chirality descriptors derived
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
131
132
57. Nilakantan, R., Bauman, N., Venkataraghavan, R. (1993) New method for rapid characterization of molecular shapes: applications
in drug design. J Chem Inf Comput Sci 33,
7985.
58. Schlosser, J., Rarey, M. (2009) Beyond the
virtual screening paradigm: structure-based
searching for new lead compounds. J Chem
Inf Model 49, 800809.
59. Zauhar, R. J. (1995) SMART: a solventaccessible triangulated surface generator for
molecular graphics and boundary element
applications. J Comput Aided Mol Des 9,
149159.
60. Hawkins, P. C., Skillman, A. G., Nicholls,
A. (2007) Comparison of shape-matching
and docking as virtual screening tools. J Med
Chem 50, 7482.
61. McGaughey, G. B., Sheridan, R. P., Bayly, C.
I., Culberson, J. C., Kreatsoulas, C., Lindsley, S., Maiorov, V., Truchon, J. F., Cornell, W. D. (2007) Comparison of topological, shape, and docking methods in
virtual screening. J Chem Inf Model 47,
15041519.
62. Daylight. version 4.82. 2003. Aliso Viejo,
CA, USA, Daylight Chemical Information
Systems Inc.
63. Kearsley, S. K., Sallamack, S., Fluder, E. M.,
Andose, J. D., Mosley, R. T., Sheridan, R.
P. (1996) Chemical similarity using physiochemical property descriptors. J Chem Inf
Comput Sci 36, 118127.
64. Miller, M. D., Sheridan, R. P., Kearsley,
S. K. (1999) SQ: a program for rapidly
producing
pharmacophorically
relevant
molecular superpositions. J Med Chem 42,
15051514.
65. Miller, M. D., Kearsley, S. K., Underwood,
D. J., Sheridan, R. P. (1994) FLOG: a system to select quasi-flexible ligands complementary to a receptor of known threedimensional structure. J Comput Aided Mol
Des 8, 153174.
66. McGann, M. R., Almond, H. R., Nicholls,
A., Grant, J. A., aBrown, F. K. (2003)
Gaussian docking functions Biopolymers 68,
7690.
67. Friesner, R. A., Banks, J. L., Murphy, R. B.,
Halgren, T. A., Klicic, J. J., Mainz, D. T.,
Repasky, M. P., Knoll, E. H., Shelley, M.,
Perry, J. K., Shaw, D. E., Francis, P., Shenkin,
P. S. (2004) Glide: a new approach for rapid,
accurate docking and scoring. 1. Method and
assessment of docking accuracy. J Med Chem
47, 17391749.
68. Halgren, T. A., Murphy, R. B., Friesner, R.
A., Beard, H. S., Frye, L. L., Pollard, W. T.,
Banks, J. L. (2004) Glide: a new approach
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
133
Chapter 7
Combinatorial Library Design from Reagent Pharmacophore
Fingerprints
Hongming Chen, Ola Engkvist, and Niklas Blomberg
Abstract
Combinatorial and parallel chemical synthesis technologies are powerful tools in early drug discovery
projects. Over the past couple of years an increased emphasis on targeted lead generation libraries and
focussed screening libraries in the pharmaceutical industry has driven a surge in computational methods
to explore molecular frameworks to establish new chemical equity. In this chapter we describe a complementary technique in the library design process, termed ProSAR, to effectively cover the accessible
pharmacophore space around a given scaffold. With this method reagents are selected such that each
R-group on the scaffold has an optimal coverage of pharmacophoric features. This is achieved by optimising the Shannon entropy, i.e. the information content, of the topological pharmacophore distribution
for the reagents. As this method enumerates compounds with a systematic variation of user-defined pharmacophores to the attachment point on the scaffold, the enumerated compounds may serve as a good
starting point for deriving a structureactivity relationship (SAR).
Key words: ProSAR, combinatorial library design, topological pharmacophore, pharmacophore
fingerprint, genetic algorithm, Shannon entropy, multi-objective optimisation.
1. Introduction
Effective structureactivity relationship (SAR) generation is at the
centre of any medicinal chemistry campaign. Much work has been
done to devise effective methods to explain and explore SAR
data for medicinal chemistry teams to drive the design cycles
within drug discovery projects (1). Recent work on SAR generation highlights the commonly observed discontinuity of SAR
and bioactivity data, the so-called activity cliffs (2). This also
emphasises the need to empirically determine SAR for each lead
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_7, Springer Science+Business Media, LLC 2011
135
136
137
2. Materials
The 2-point pharmacophores were created by an in-house tool
TRUST (34) and a shell script was written to create reagent pharmacophore fingerprint based on TRUST output. The greedy
search algorithm (35) was implemented in Python (36) to read in
reagent pharmacophore fingerprint and optimise pharmacophore
entropy. An in-house genetic algorithm-driven library design tool
GALOP was used to optimise library under user-supplied multiple
constraints. Library product properties were calculated by various
in-house prediction tools. Tanimoto similarity for reagents was
calculated using FOYFI fingerprint which is an in-house developed fingerprint (37) and is similar in spirit to standard Daylight
fingerprint (38). Database similarity searches were done by using
an in-house 2D similarity search tool (34) with FOYFI fingerprint. An in-house program FLUSH (37) was used for the structure clustering.
Three sets of commercially available reagents are used in this
study: a set of 493 aliphatic primary amines for selecting 20
138
3. Method
3.1. Methodology
3.1.1. Definition of the
Pharmacophore
Fingerprint
139
pi log2 pi
[1]
ci
[2]
140
SEj
[3]
141
is a deterministic method in nature, ProSAR and occupancy optimisation give the optimal reagent selection (within the given constraints), while random selections was repeated ten times to get
ten different reagent sets.
The distributions of reagent pharmacophore bins for the
ProSAR reagent set, one random reagent set and the occupancyoptimised reagent set are compared in Fig. 7.2. The total SE of
the ProSAR selection is 4.15, average SE of ten random selections is 3.3 and the SE for occupancy-based selection is 3.4.
Entropy-driven ProSAR selections have the same pharmacophore
coverage as the occupancy-optimised set and both optimisation
techniques achieve better coverage than random selections. Additionally, entropy-optimised ProSAR has the most even bin distribution of the selections. As an example, for the lipophilic bins
from no. 14 to 17, the reagent count for random selection is 39,
the count for occupancy selection is 33, while the count of these
bins has been reduced to 15 in the ProSAR library. Entropy-based
optimisation achieves the same level of pharmacophore coverage
as occupancy optimisation but has a more even distribution of
pharmacophores in the reagent set and does not bias the selection towards reagents with lipophilic pharmacophores.
Fig. 7.2 Pharmacophore fingerprint distribution for 20 primary amines selected by using
the ProSAR strategy, random selection and occupancy optimisation of fingerprint bins,
respectively.
142
O
R1
2. R1CHO
R
OH
HS
S(Trt)
N
HN
R2
3.
O
BocHN
143
(a)
(b)
Fig. 7.4 (a) Pharmacophore fingerprint distribution for the R1 reagents. (b) Pharmacophore fingerprint distribution for
the R2 reagents (adapted from (33)).
144
Table 7.1
Results of the designed libraries for the Affymax example
(adapted from (33))
ProSAR
libraries
Random
librariesa
20
11.7
1.1
Shannon entropy
R1
4.61
3.1
3.2
R2
4.65
2.9
3.6
R1
27
13.8
12.1
R2
27
13.3
17.2
Libraries
Number of
covered bins
Diversity
librariesa
145
O
1. Br-R1
2. HCl
3. R2R3-NH
NH
R1
R3
N
O
O
R2
Fig. 7.5 Library example for concurrent reagent pharmacophore entropy and library
property profile optimisation.
Table 7.2
Results for the GA-optimised librariesa (adapted from (33))
Libraries
ProSAR+
propertyb
Diversity+
propertyc
Propertyd
Random
librarye
Full
library
% of good compounds
99.7
99.7
100
62.2
62
Number of clusters
Shannon entropy R1
21
46.1
23
NC
Dissimilarity
index
Number of
covered bins
14.1
3.03
2.86
2.38
2.71
2.83
R2
3.52
2.62
2.32
2.81
2.94
R1
0.74
0.80
0.64
0.72
0.74
R2
0.69
0.80
0.65
0.71
0.73
R1
10.5
10.3
R2
15.4
10.2
10.5
10.7
21
12
20
a The values listed in the table are averaged over ten library designs, except for the full library.
b Libraries obtained by optimising both the pharmacophore entropy and the property profile simultaneously.
c Libraries obtained by optimising both the diversity and the property profile simultaneously.
d Libraries obtained by only optimising the property profile.
e Libraries obtained by randomly selecting reagents.
146
(a)
(b)
Fig. 7.6 Comparison of pharmacophore fingerprint distribution for libraries with different design strategies. (a) Pharmacophore fingerprint distribution for R1 reagents. (b) Pharmacophore fingerprint distribution for R2 reagents (adapted from
(33)).
147
F
Br
Br
Br
Br
Br
Br
F
F
Br
Br
O
7
8
F
Br
10
S
OH
Br
Br
Br
N
Br
Br
F
O
N
11
12
13
Br
Br
Br
14
Cl
O
Br
O
15
16
17
18
Br
F
O
19
Br
F
20
Fig. 7.7 Selected R1 reagents for the ProSAR library (adapted from (33)).
4. Conclusion
The ProSAR strategy for library design selects reagents by optimising the reagent pharmacophore space to achieve a systematic
variation of the pharmacophores relative to a scaffold attachment.
We show that optimising the Shannon entropy of the reagent
148
Br
Br
Br
Br
F Br
Br
1
F
2
Br
Br
Br
Br
Br
Br
F
O
8
11
10
Br
Cl Br
Br
Br
13
12
Br
Cl
14
O
16
15
17
F
Br
Br
N
Br
O
18
F
F
19
20
Fig. 7.8 Selected R1 reagents for the diversity library (adapted from (33)).
OH
OH
OH
O
N
N
S
O
10
N
N
N
S
11
N
N
15
13
12
14
N
N
N
17
O
16
O
N
18
19
Fig. 7.9 Selected R2 reagents for the ProSAR library (adapted from (33)).
20
O
2
Cl
6
5
N
N
N
N
8
11
10
N
N
14
13
149
15
S
O
16
12
N
N
17
18
19
20
Fig. 7.10 Selected R2 reagents for the diversity library (adapted from (33)).
150
Acknowledgements
The authors are grateful to the following colleagues at
AstraZeneca: Dr. David Cosgrove for providing the FOYFI fingerprint programs, Dr. Jens Sadowski for providing the tool to
extract the R-groups for the library compounds and Dr. Ulf
Brjesson for developing the GALOP program.
References
1. Bajorath, J., Peltason, L., Wawer, M.,
Guha, R., Lajiness, M. S., van Drie, J. H.
(2009) Navigating structure activity landscapes. Drug Discovery Today 14, 698705.
2. Maggiora, G. M. (2006) On outliers and
activity cliffs why QSAR often disappoints.
J Chem Inf Model 46, 1535.
3. Sisay, M. H., Peltason, L., Bajorath, J.
(2009) Structural interpretation of activity cliffs revealed by systematic analysis
of structureactivity relationships in analog
series. J Chem Inf Model 49, 21792189.
4. Bostrm, J., Hogner, A., Schmitt, S. (2006)
Do structurally similar ligands bind in a similar fashion? J Med Chem 49, 67166725.
5. Spellmeyer, D. C., Grootenhuis, P. D. J.
(1999) Recent developments in molecular
diversity: computational approaches to combinatorial chemistry. Annu Rep Med Chem
Rev 34, 287296.
6. Beno, B. R., Mason, J. S. (2001) The design
of combinatorial libraries using properties
and 3D pharmacophore fingerprints. Drug
Discovery Today 6, 251258.
7. Willett, P. (2000) Chemoinformatics similarity and diversity in chemical libraries. Curr
Opin Biotechnol 11, 8588.
8. Bemis, G. W., Murcko, M. A. (1996) The
properties of known drugs. 1. Molecular
frameworks. J Med Chem 39, 28872893.
9. Xu, Y. J., Johnson, M. (2002) Using molecular equivalence numbers to visually explore
structural features that distinguish chemical
libraries. J Chem Inf Comp Sci 42, 912926.
10. Pitt, W., Parry, D. M., Perry, B. G., Groom,
C. R. (2009) Heteroaromatic rings of the
future. J Med Chem 52, 29522963.
11. Good, A. C., Kuntz, I. D. (1995) Investigating the extension of pairwise distance
pharmacophore measures to triplet-based
descriptors. J Mol Comput Aided Mol Des 9,
373379.
12. Mason, J. S., Morize, I., Menard, P. R.,
Cheney, D. L., Hulme, C., Labaudiniere,
R. F. (1999) New 4-point pharmaophore
13.
14.
15.
16.
17.
18.
19.
20.
21.
method for molecular similarity and diversity applications: overview of the method and
applications, including a novel approach to
the design of combinatorial libraries containing privileged substructures. J Med Chem 42,
32513264.
Symyx, 2.5; Symyx Technologies Inc., Santa
Clara, CA 95051, USA.
McGregor, M. J., Muskal, S. M. (1999)
Pharmacophore fingerprinting. 1. Application to QSAR and focused library design. J
Chem Inf Comput Sci 39, 569574.
Pickett, S. D., Mason, J. S., Mclay, I. M.
(1996) Diversity profiling and design using
3D pharmacophore: pharmacophore-derived
queries (PDQ). J Chem Inf Comput Sci 36,
12141223.
Mason, J. S., Beno, B. R. (2000) Library
design using BCUT chemistry-space descriptors and multiple four-point pharmacophore
fingerprints: simultaneous optimization and
structure-based diversity. J Mol Graph Mod
18, 438451.
Cato, S. J. (2000) Exploring pharmacophores with Chem-X, in (Gner, O., ed.)
Pharmacophore Perception, Development, and
Use in Drug Designer. International University Line, La Jolla, CA, pp. 107125.
Good, A. C., Lewis, R. A. (1997) New
methodology for profiling combinatorial
libraries and screening sets: cleaning up the
design process with HARPick. J. Med Chem
40, 39263936.
Chen. X., Rusinko, A., III, Young, S. S.
(1998) Recursive partitioning analysis of a
large structure-activity data set using threedimensional descriptors. J Chem Inf Comput
Sci 38, 10541062.
Matter, H., Ptter, T. (1999) Comparing
3D pharmacophore triplets and 2D fingerprints for selecting diverse compound subsets. J Chem Inf Comput Sci 39, 12111225.
Eksterowicz, J. E., Evensen, E., Lemmen, C., Brady, G. P., Lanctot, J. K.,
Bradley, E. K., Saiah, E., Robinson, L. A.,
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
151
152
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
SAGE and SCA algorithms. J Chem Inf Comput Sci 41, 14701477.
Jamois, E. J., Hassan, M., Waldman, M.
(2000) Evaluation of reagent-based and
product-based strategies in the design of
combinatorial library subsets. J Chem Inf
Comput Sci 40, 6370.
Szardenings, A. K., Antonenko, V.,
Campbell, D. A., DeFrancisco, N., Ida, S., Si,
L., Sharkov, N., Tien, D., Wang, Y., Navre,
M. (1999) Identification of highly selective
inhibitors of collagenase-1 from combinatorial libraries of diketopiperazines. J Med
Chem 42, 13481357.
Campbell, D. A., Look, G. C., Szardenings,
A. K., Patel, P. V. (2001) US6271232B1;
Campbell, D. A., Look, G. C., Szardenings,
A. K., Patel, P. V. (1999) US5932579A;
Campbell, D. A., Look, G. C., Szardenings,
A. K., Patel, P. V. (1997) WO97/48685A1.
Szardenings, A. K., Harris, D., Lam, S.,
Shi, L., Tien, D., Wang, Y., Patel, D. V.,
Navre, M., Campbell, D. A. (1998) Rational design and combinatorial evaluation of
enzyme inhibitor scaffolds: identification of
novel inhibitors of matrix metelloproteinases.
J Med Chem 41, 21942200.
Martin, Y. C., Kofron, J. L., Traphagen, L.
M. (2002) Do structurally similar molecules
have similar biological activity? J Med Chem
45, 43504358.
Pickett, S. D, McLay I. M., Clark, D. E.
(2000) Enhancing the hit-to-lead properties
of lead optimization libraries. J Chem Inf
Comput Sci 40, 263272.
Gillet, V. J., Khatlib, W., Willett, P., Fleming
P. J., Green, D. V. S. (2002) Combinatorial
library design using a multiobjective genetic
algorithm. J Chem Inf Comput Sci 42,
375385.
Brown, R. D., Hassan, M., Waldman, M.
(2000) Combinatorial library design for
diversity, cost efficiency, and drug-like character. J Mol Graph Model 18, 427437.
Section II
Structure-Based Library Design
Chapter 8
Docking Methods for Structure-Based Library Design
Claudio N. Cavasotto and Sharangdhar S. Phatak
Abstract
The drug discovery process mainly relies on the experimental high-throughput screening of huge
compound libraries in their pursuit of new active compounds. However, spiraling research and development costs and unimpressive success rates have driven the development of more rational, efficient, and
cost-effective methods. With the increasing availability of protein structural information, advancement in
computational algorithms, and faster computing resources, in silico docking-based methods are increasingly used to design smaller and focused compound libraries in order to reduce screening efforts and
costs and at the same time identify active compounds with a better chance of progressing through the
optimization stages. This chapter is a primer on the various docking-based methods developed for the
purpose of structure-based library design. Our aim is to elucidate some basic terms related to the docking technique and explain the methodology behind several docking-based library design methods. This
chapter also aims to guide the novice computational practitioner by laying out the general steps involved
for such an exercise. Selected successful case studies conclude this chapter.
Key words: Structure-based library design, drug discovery, docking, high-throughput screening,
combinatorial chemistry.
1. Introduction
The finding, optimization, and bioevaluation of small molecules
that can interact with therapeutically relevant targets to modulate biological processes is the core of the drug discovery process. So far, this has been mainly dominated by high-throughput
screening (HTS), a hardware technology that allows the rapid
screening of compound libraries to identify potentially active ones
(1, 2). HTS, however, requires a ready source of large and preferably diverse set of compounds to serve as starting points for the
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_8, Springer Science+Business Media, LLC 2011
155
156
157
consortium projects (20), and developments in homology modeling methods (21), an increasing number of 3D structures of targets are now available for several structure-based drug discovery
applications (e.g., virtual screening (2226), binding mode predictions (2731), and lead optimization (32)). Structural information coded in the characteristics of binding sites, such as
receptor:ligand interaction patterns, can be used to prioritize
compounds for experimental screening using docking methods
(3335), and such exercises have been successful in the past
(36, 37). With the continual development in docking algorithms
and computational resources, structure-based docking methods
will play an increasing important role in compound library
design.
It is timely, then, to review the use of docking methods
for structure-based library design and to understand how best
to implement them in drug discovery. First, we will define and
explain basic concepts and terminology related to structurebased drug design and docking. Next, docking methods for
library design will be presented with brief notes explaining the
practical considerations involved in such an exercise. The chapter will conclude with selected case studies highlighting the
recent successes of docking methods for structure-based library
design.
2. Requirements
The three major requirements for docking-based library design
are basically the same as for high-throughput docking (HTD):
a 3D representation of the target structure (experimental or modeled), a database of compounds in electronic format, and a suitable docking algorithm.
2.1. 3D Structure of
Target
158
2.2. Compound
Collections
159
Many commercial vendors and academic labs provide collection of compounds or fragments (Sigma-Aldrige, Available Chemicals Directory (ACD) (http://www.symyx.com),
Maybridge (www.maybridge.com), TCI America (http://www.
tciamerica.com), ChemDB (51) (http://cdb.ics.uci.edu), ZINC
(52), ChemBridge (www.chembridge.com); cf (53) for a review
on public accessible chemical databases), in computationally
readably formats like sdf or mol2 (42). Along with the
experimental combinatorial library design process mentioned
earlier, several software tools, such as CombiLibMaker in
Sybyl (www.tripos.com) and QuaSAR-Combigen from CCG
(www.chemcomp.com), were developed to enumerate and predict the 2D/3D structures of compounds using chemical fragments without the expensive and tedious experimental part. On
the other hand, pharmaceutical companies have historically maintained huge compound libraries and continue to add compounds
with novel chemistry to these collections. The size of such compound collections is estimated to be 106 compounds (54).
These compound libraries or their constitutive parts (reagents,
fragments) form the source of inputs for docking methods for
structure-based library design.
However, these compounds and the fragments are not without their intrinsic problems and should not be used as is. Some
examples of potentially problematic compounds include those
with chemically reactive groups, dyes, and fluorescent compounds which interfere with assays, frequent hitters/promiscuous
binders, and inorganic complexes (55). It is important, then, to a
priori filter out such compounds or reagents which are practically
useless from a drug discovery point of view.
Some of the common steps toward preparing and filtering
chemical compound libraries include
(a) Removing compounds with salts, counter ions, chemically
reactive groups (e.g., metal chelators), undesirable atoms or
functional groups, inorganic compounds, and duplicates.
(b) Generation of correct tautomers, protonation, and
stereoisomeric states for each compound. Eventually, several states for a given compound could be generated and
kept in the final library.
(c) Filtering compounds based on drug-like or lead-like physiochemical properties or other in-house scoring schemes
(e.g., Lipinskis rule of 5 (56), lead-like filters suggested
by Oprea et al. (57)).
In cases where fragments, rather than compounds, are used
to design libraries, the fragments may be filtered based on
the nature of the binding site and their ability to adhere to
existing chemistry protocols with respect to their attachments
to templates or other fragments (58). Other filtering criteria
160
include excluding fragments with hydrolysable groups (e.g., sulfonyl halides, anhydride aliphatic ester) and potential cytotoxic
groups like thiourea and cyclohexanone (55).
2.3. Docking
Algorithms and
Methods
161
Fig. 8.1. Schematic depiction of the seed and grow docking approach for structurebased library design. The programs that use this approach include CombiDOCK and
PRO_SELECT.
162
(b) Dock and link: The substituent groups are docked to the
interacting sites in the binding pocket, scored, and then
linked to each other based on chemistry constraints (70).
This method, as illustrated in Fig. 8.2, attempts to take
advantage of the known significant interactions within the
binding site to bias the final compound library.
Fig. 8.2. Schematic depiction of the dock and link docking approach. The program
BUILDER v.2 is based on this approach.
Notes:
For fragment-based methods, one needs to consider some
important issues.
(a) For the seed and grow method, the orientation of the scaffold is highly critical. Any errors at this stage may render the
results at later steps irrelevant (71).
(b) There must be a ready-to-use synthetic protocol to build
these libraries based on the scaffold and fragments used.
(c) Although the seed and grow method reduces the number
of compounds as compared to a full library enumeration,
the availability of large number of fragments may still result
in a huge number of compounds. Further filtering steps,
diversity analysis exercises may be required to choose a final
subset of compounds.
(d) It is important to have diverse fragment libraries to maximize the chances of library diversity.
(e) In the case of the dock and link method, though it is likely to
have fragments satisfying key interactions within the binding
site, the final compound may not be amenable to synthesis
(71).
163
3. Docking
Methods for
Structure-Based
Library Design
164
165
166
167
(http://mdi.ucsf.edu/CombiBUILD.html) have been developed for the purpose of library design. Please refer to Table 8.1
for a list of programs listed in this chapter. However, it should
be noted that though useful, most of the programs are neither
easy to implement nor use as is (87). As a result, these methods
have found limited applicability in the scientific community.
On the other hand, several commercial library vendors like
Cerep (www.cerep.fr), Asinex (www.asinex.com), and Enamine
(www.enamine.net) offer target-focused libraries using dockingbased protocols. A list of the docking methods/software tools
mentioned in this chapter can be found in Table 8.1.
Table 8.1
Docking-based programs for library design
Method/program
Description
Refs
CombiDock
Tweaks the DOCK algorithm to identify suitable scaffold orientations in the binding
pocket. Proceeds using the seed and grow
approach to design combinatorial libraries
Combines combinatorial chemistry and
fragment-based docking methods to design
structure-based libraries
(73)
PRO_SELECT
DREAM++
Builder v.2
(75)
(80)
(81)
OptiDOCK
(83)
Basis products
(BPs)
(85)
CombiGlide
www.schrodinger.com
CombiSMoG
(86)
168
4. Applications
This section will highlight a few applications of docking-based
methods for library design.
The CombiDOCK algorithm was applied to design a
structure-based library for cathepsin D protease using a hydroxyethylamine scaffold. This scaffold has three attachment points.
Ten fragments for each site were chosen and incorporated in the
final library design. The 1000 compounds were filtered to check
inaccuracies in bond geometries to give 750 compounds which
were synthesized and assayed for experimental testing. Results
indicated that this library had an enrichment factor (EF) of 2.5,
whereas a completely random ranking would result in an EF of
1.0. (73) The EF is the ratio between the probability of finding
a true ligand in a filtered sub-library compared to the probability
of finding a ligand at random.
The PRO_SELECT method was applied to design an
inhibitor library for thrombin, a key serine protease. The crystal structure of thrombin includes a covalently bound inhibitor,
tri-peptide PPACK. L-proline, the centrally located portion of
PPACK was chosen as the template and its alternate locations
were generated by docking/modeling a noncovalently bound
analogue of PPACK. Analysis of the binding site revealed the
requirements for a hydrogen bond donor and a hydrophobic
group at either ends of the template. A 3D database search of
potential fragment binders based on the analysis of the binding
site resulted in over 400,000 hits. PRO_SELECT method was
able to drastically reduce the number of fragments to 17, which
were then used to build a chemical library. Over 30 molecules
were then synthesized, of which at least 50% showed micromolar
activities (75).
In another study Head et al. used a docking-based method
to design a library of potentially novel inhibitors for caspases
3 and 8, a key regulator of apoptosis (88). The authors chose
thiomethylketone as a scaffold for this study, as it is a common
denominator of a class of compounds inhibiting caspases 3 and 8.
Two attachment points on the thiomethylketone scaffold were
identified. The ketone group is postulated to covalently bind with
the catalytic cysteine. Thus from Fig. 8.3 it is seen that the R
group points away from the S2 binding pocket. Hence a small
number of reagents (8) were fixed for R based on availability
and ease of synthesis. To identify potential functional groups for
the other attachment point (R), roughly 7000 monoacid reagents
from the ACD database were selected for combinatorial docking.
First, a simplified thiomethylketone with R and R set to methyl
was docked in the binding pocket to identify initial template
169
locations. Next, the eight reagents for the R point and 7000
monoacids for the R points were combinatorially attached to the
templates, docked, and scored. Two criteria were used to obtain
the final reagents for the R group: (1) docking scores and (2) distance filters based on the experimental data of isatin-based compounds and crystal structures of other caspases. Based on these
results approximately 150 reagents were selected per caspase and
roughly 10% of these reagents underwent full conformational
sampling. As the array size for synthesis was 96, only 12 reagents
for the R group (seven for caspase 3, three for caspase 8, and
three common for both) were selected based on visual inspection
of the predicted binding modes. Sixty-one compounds were synthesized and tested. Five of the 61 compounds tested against caspase 3 and two compounds against caspase 8 showed micromolar
activity. Interestingly, a homology model of caspase 8 was used
for this study, which clearly indicates the usefulness of homology
modeling in structure-based library design.
Decornez et al. used a generalized kinase model and a combination of 2D (fingerprint based similarity) and 3D methods
(docking) to develop a kinase family focused library (15). The
authors used 2800 kinase inhibitors compounds as a reference
for a 2D search of their in-house database of 260 K compounds
170
5. Conclusions
Despite the initial promise, advancements in HTS methods and
combinatorial chemistry have so far failed to improve the success
rates of drug discovery programs. Since the experimental screening of these gigantic libraries is costly and time consuming, it
is of utmost importance to rationally, efficiently, and economically explore the available chemical space of compounds in order
to design smaller and focused compound libraries for experimental evaluation. Several docking-based methods make use of the
increasing availability of structural information of drug targets to
a priori filter out those compounds that are unlikely to bind to the
target. This chapter highlights several of such docking methods
used in library design, together with their application to actual
cases.
References
1. Mayr, L. M., Fuerst, P. (2008) The future of
high-throughput screening. J Biomol Screen
13, 443448.
2. Entzeroth, M. (2003) Emerging trends in
high-throughput screening. Curr Opin Pharmacol 3, 522529.
3. Schnecke, V., Bostrom, J. (2006) Computational chemistry-driven decision making
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
based drug design: a molecular modeling perspective. Med Res Rev 16, 350.
Walters, W. P., Stahl M. T., Murcko, M. A.
(1998) Virtual screening an overview. Drug
Discov Today 3, 160178.
Phatak, S. S., Stephan, C. C., Cavasotto,
C. N. (2009) High-throughput and in silico screenings in drug discovery. Expert Opin.
Drug Discov 4, 947959.
Keseru, G. M., Makara, G. M. (2006) Hit
discovery and hit-to-lead approaches. Drug
Discov Today 11, 741748.
Macarron, R. (2006) Critical review of the
role of HTS in drug discovery. Drug Discov
Today 11, 277279.
Fox, S., Farr-Jones, S., Sopchak, L., Boggs,
A., Nicely, H. W., Khoury, R., Biros, M.
(2006) High-throughput screening: update
on practices and success. J Biomol Screen 11,
864869.
Keseru, G. M., Makara, G. M. (2009) The
influence of lead discovery strategies on the
properties of drug candidates. Nat Rev Drug
Discov 8, 203212.
Lipkin, M. J., Stevens, A. P., Livingstone,
D. J., Harris, C. J. (2008) How large does
a compound screening collection need to be?
Comb Chem High Throughput Screening 11,
482493.
Nestler, H. P. (2005) Combinatorial chemistry and fragment screening Two unlike
siblings? Curr Drug Discov Tech 2, 112.
Diller, D. J., Merz, K. M., Jr. (2001) High
throughput docking for library design and
library prioritization. Proteins 43, 113124.
Decornez, H., Gulyas-Forro, A., Papp, A.,
Szabo, M., Sarmay, G., Hajdu, I., Cseh,
S., Dorman, G., Kitchen, D. B. (2009)
Design, selection, evaluation of a general
kinase-focused library. ChemMedChem 4,
12731278.
Lipinski, C. A. (2000) Drug-like properties
and the causes of poor solubility and poor
permeability. J Pharmacol Toxicol Methods 44,
235249.
Schnur, D. M. (2008) Recent trends in
library design: rational design revisited.
Curr Opin Drug Discov Devel 11, 375380.
Villar, H. O., Koehler, R. T. (2000) Comments on the design of chemical libraries for
screening. Mol Divers 5, 1324.
Manjasetty, B. A., Turnbull, A. P., Panjikar,
S., Bussow, K., Chance, M. R. (2008) Automated technologies and novel techniques to
accelerate protein crystallography for structural genomics. Proteomics 8, 612625.
Gileadi, O., Knapp, S., Lee, W. H., Marsden, B. D., Muller, S., Niesen, F. H.,
Kavanagh, K. L., Ball, L. J., von Delft, F.,
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
171
172
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
173
70. Schneider, G. (2002) Trends in virtual combinatorial library design. Curr Med Chem 9,
20952101.
71. Beavers, M. P., Chen, X. (2002) Structurebased combinatorial library design: methodologies and applications. J Mol Graph Model
20, 463468.
72. Coupez, B., Lewis, R. A. (2006) Docking and scoringtheoretically easy, practically
impossible? Curr Med Chem 13, 29953003.
73. Sun, Y., Ewing, T. J., Skillman, A. G., Kuntz,
I. D. (1998) CombiDOCK: structure-based
combinatorial docking and library design.
J Comput Aided Mol Des 12, 597604.
74. Kuntz, I. D., Blaney, J. M., Oatley, S. J., Langridge, R., Ferrin, T. E. (1982) A geometric approach to macromolecule-ligand interactions. J Mol Biol 161, 269288.
75. Murray, C. W., Clark, D. E., Auton, T.
R., Firth, M. A., Li, J., Sykes, R. A.,
Waszkowycz, B., Westhead, D. R., Young,
S. C. (1997) PRO_SELECT: combining
structure-based drug design and combinatorial chemistry for rapid lead discovery. 1.
Technology. J Comput Aided Mol Des 11,
193207.
76. Blaney, J. M., Dixon, J. S. (1993) A good ligand is hard to find: automated docking methods. Perspect Drug Discov Des 1, 301319.
77. Kuntz, I. D., Meng, E. C., Shoichet, B.
(1994) Structure-based molecular design.
Acc Chem Res 27, 117123.
78. Bohm, H. J. (1994) The development of a
simple empirical scoring function to estimate
the binding constant for a protein-ligand
complex of known three-dimensional structure. J Comput Aided Mol Des 8, 243256.
79. Clark, D. E., Frenkel, D., Levy, S. A., Li,
J., Murray, C. W., Robson, B., Waszkowycz,
B., Westhead, D. R. (1995) PRO-LIGAND:
an approach to de novo molecular design.
1. Application to the design of organic
molecules. J Comput Aided Mol Des 9,
1332.
80. Makino, S., Ewing, T. J., Kuntz, I. D. (1999)
DREAM++: flexible docking program for virtual combinatorial libraries. J Comput Aided
Mol Des 13, 513532.
81. Roe, D. C., Kuntz, I. D. (1995) BUILDER
v.2: improving the chemistry of a de novo
design strategy. J Comput Aided Mol Des 9,
269282.
82. Van Gunsteren, W. F., Berendsen, H. J. C.
(1977) Algorithms for macromolecular
dynamics and constraint dynamics. Mol Phys
34, 13111327.
83. Sprous, D. G., Lowis, D. R., Leonard, J.
M., Heritage, T., Burkett, S. N., Baker, D.
S., Clark, R. D. (2004) OptiDock: virtual
174
87. Zhou, J. Z. (2008) Structure-directed combinatorial library design. Curr Opin Chem
Biol 12, 379385.
88. Head, M. S., Ryan, M. D., Lee, D., Feng,
Y., Janson, C. A., Concha, N. O., Keller,
P. M., deWolf, W. E., Jr. (2001) Structurebased combinatorial library design: discovery of non-peptidic inhibitors of caspases
3 and 8. J Comput Aided Mol Des 15,
11051117.
89. Zhao, L., Huang, W., Liu, H., Wang,
L., Zhong, W., Xiao, J., Hu, Y., Li,
S. (2006) FK506-binding protein ligands: structure-based design, synthesis,
and neurotrophic/neuroprotective properties of substituted 5,5-dimethyl-2-(4thiazolidine)carboxylates. J Med Chem 49,
40594071.
Chapter 9
Structure-Based Library Design in Efficient Discovery
of Novel Inhibitors
Shunqi Yan and Robert Selliah
Abstract
Structure-based library design employs both structure-based drug design (SBDD) and combinatorial
library design. Combinatorial library design concepts have evolved over the past decade, and this chapter
covers several novel aspects of structure-based library design together with successful case studies in the
anti-viral drug design HCV target area. Discussions include reagent selections, diversity library designs,
virtual screening, scoring/ranking, and post-docking pose filtering, in addition to the considerations
of chemistry synthesis. Validation criteria for a successful design include an X-ray co-crystal complex
structure, in vitro biological data, and the number of compounds to be made, and these are addressed in
this chapter as well.
Key words: Structure-based drug design, structure-based library design, library design, focused
library design, diversity library, combinatorial library, docking, reagent selections, HCV NS5B,
thiazolone.
1. Introduction
Structure-based library design engages in dual approaches of
structure-based drug design (SBDD) and combinatorial library
design (Fig. 9.1). Design concepts of combinatorial library have
been evolving since its conception more than a decade ago. Early
efforts mainly focused on the capability to synthesize large number of compounds through combinatorial chemistry with the confidence that high-throughput screening (HTS) (1) of every possible compound in a large library would lead to potential druggable
hits and leads, and eventually development of candidates after
subsequent lead optimizations. Needless to say, this approach
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_9, Springer Science+Business Media, LLC 2011
175
176
2. Materials
A number of computational methodologies have been used in this
experiment. Reagent selections for library designs were exported
177
from ACD database (2). GOLD (3) and LigandFit (4) were used
for docking, where MOE software (5) was used for reagent filtering and library enumeration. Post-docking pose filters and reactive group filtering were carried out using a MOE SVL script
(5). Diversity analysis and physical property calculation were performed by Ceris2 (4). HKL package was applied for X-ray data
process (6) and the X-ray structure determination and refinement
were carried out by CNX (4). X-ray co-crystal complex structures
discussed in this chapter were deposited in PDB database with
codes 2O5D, 2HWH, 2HWI, and 2I1R.
The chemical synthesis of the compounds is applicable for
library production. The reactions start with a condensation reaction of a readily available reagent, rhodinine, with an aldehyde
to afford a thiazolonone intermediate, which can undergo a coupling reaction with an amine, amino acid, or a variety of other
amino-containing derivatives to give final products with good
yields (79).
3. Methods
3.1. Introduction
SBDD begins with designs of novel scaffolds based on the structural binding information of hit or lead compounds in complex
with a target which is usually an enzyme or a protein. Previously
hard-to-crystallized protein and their corresponding protein
ligand complex co-crystal structures have been routinely determined nowadays due to the significant technology improvement
in crystallography, molecular biology, and protein science in the
last decade. Given an X-ray complex structure of a proteinligand
co-crystal, various computational tools such as virtual screening
(10, 11), de novo design (11), and scaffold hopping are utilized to design brand new and better molecules with potentially
novel IP space coverage. Such designs are often accomplished
through exploring better or comparable ligandprotein interaction as predicted computationally. Alternatively, a 3D structure
of a target can be also approximated with reasonable confidence
from a homology model if identity level of amino acid sequences
between the target and a known structure from either in-house Xray determinations or PDB database is high. Recently more and
more protein structures have been solved with high resolutions
(<2.5 ) by NMR experiments. Structures from NMR are sometimes advantageous for SBDD in terms of the flexibility of protein
since the experiments are generally performed in solution at room
temperature and therefore, the dynamic states of protein resemble
more of the genuine states of protein in physiological conditions
178
179
180
enumerated using the refined building block list using commercial software such as MOE (5) or CombiLibMaker in Sybyl (16).
The virtual library thus obtained may be further trimmed using
ADME filters. The final focus library is thus prepared for virtual
screening against a protein target (Fig. 9.4) (4, 5, 16).
3.3. Virtual
Screening, Scoring,
and Ranking of
Focused Library
Table 9.1
Parameter definition for post-docking filter
Label,
smarts_pattern,
x_y_z_coordinates,
distance_
threshold
[d1_NH,
[NH]([CH2])cn,
2.5]
[d2_OC,
O(=C([NH])[CH2][NH]),
2.0]
[d3_nap,
[cX3][nX3]nc([NH])c[#6],
2.0]
[d4_Me,
[nX3]nc([NH])cc,
2.0]
[d5_nR,
[cX3]([cX3])[cX3]([NH])n[nX3],
2.0]
181
initial virtual library are selected for further analysis only if all of
the distance criteria in column 4 of Table 9.1 are satisfied.
Post-docking pharmacophore-based filtering, followed by
various scoring functions (Fig. 9.1), is greatly advantageous in
comparison with results when only docking scoring functions are
used. Docking scoring functions (35, 12, 16, 17, 23) are in
reality met with considerable limitations in reasonably prioritizing compounds in accordance with their corresponding binding
affinities or enzymatic potency (2427). One of the reasons for
this lack of correlation is the artificially high ranking for an incorrect docking pose (28, 29). However, it is well documented that
docking methods are able to reproduce the bound conformation
of a ligand in a proteinligand complex determined by X-ray crystallography (2932). Therefore, once the molecules with the correct poses are identified by post-docking filters, the problem of
scoring wrong poses is avoided and multiple scoring functions
can thus be better suited to rank molecules in the focused library
with better chance of success (Fig. 9.1).
A set of molecules that rank high after this process would
be synthesized and subject to biological tests, i.e., in vitro enzymatic assay or binding affinity experiments, in order to confirm
design rationales. Simultaneously, X-ray co-crystal structures of
these ligands in complex with the target are to be determined to
further corroborate modeling results. Positive results from such
approaches are decisive for selection of next set of compounds for
synthesis and the future directions of lead optimization.
3.4. Structure-Based
Library Design in
Discovery of HCV
NS5B Polymerase
Inhibitors
3.4.1. Background
182
In our HCV programs, we aimed to discover novel scaffolds efficiently to explore the allosteric site of HCV NS5B by means of
structure-based approach including the focused library design (7).
Our main strategies for new scaffolds are to maintain key pharmacophores of the initial hit, establish sizable chemistry space,
and most importantly identify directions for future diversification
and optimization. We started with a hit 1 from high-throughput
screening which has an IC50 value of 2.0 M (Fig. 9.5). An
X-ray complex structure indicated that 1 binds to a location
in the allosteric site (Fig. 9.6). Key inhibitorprotein interactions include the following: (1) both C=O and N on the thiazolone ring hydrogen bond with backbone NHs of Tyr477
and Ser476, (2) sulfonamide oxygen atom engages in hydrogenbonding interaction with basic side chain NH3 + of Arg 501,
(3) the aromatic furan and phenyl rings interact with the protein
by hydrophobic contacts (Fig. 9.6). Furthermore, such binding
information enable us to envision that a novel structure 2 once
incorporated with a suitable (S) amino acid possesses not only
pharmacophore equivalent as 1 but additional chemistry opportunity for exploring more space in the pocket (Figs. 9.5 and 9.6).
183
Fig. 9.7. Alignment of 1 (in sticks) from X-ray with 2 (sticks and balls) from docking.
184
Structural analysis of the binding mode of 3 in the pocket further led to the identification of more new scaffolds 4 and 5
as HCVNS5B inhibitors (Fig. 9.9) (9). A small focused library
was enumerated and selected for synthesis after virtual screening.
In general, carboxyl compounds with more flexibility resulting
from addition of one methylene (CH2 ) group have comparable potency with original compound 3. The most potent compound 6 has an IC50 of 8.5 M (Table 9.2). Compound 11
shows similar enzymatic potency to 6, while mono-substituted
molecules 710, regardless of the chiral centers, showed much
weaker potency (Table 9.2). Molecules with tetrazole moiety, a
commonly used carboxyl group COOH bioisostere, were pre-
185
Table 9.2
Enzymatic potency (IC50 in M) for new molecules. IC50 (M)
values of novel scaffolds
O
O
N
S
N
N
N
N
N
N
O
Entry
R:
IC50(M)
Entry
IC50(M)
R:
Entry
R:
IC50(M)
Cl
Cl
8.5
44.0
10
Me
27.0
12
13.0
13
9.0
14
9.7
19.0
Cl
Br
16.5
11
Cl
Me
14.0
dicted to fit well into target as well and a few of them were synthesized. As seen from Table 9.2, tetrazole compounds 1214 are
moderately potent with IC50 values of 9.7, 19.0, and 14.0 M,
respectively. Extending the tetrazole group by one more CH2
group is tolerated by protein. To prove the design rationale for
future structure-based designs, co-crystals structure of 12 in complex with HCV NS5B was successfully established at a resolution
of 2.2 . The electron density was clear for inhibitor 12, which
binds to the thumb sub-domain as expected (Fig. 9.10). Overall interactions of 12 with protein are comparable with those of 3
(Fig. 9.10).
Fig. 9.10. X-ray complex structure of 12 with HCV NS5B (3 in yellow sticks).
186
187
Fig. 9.12. Electron-density map and interactions of acylsulfonamide with NS5B allosteric site.
4. Notes
1. The key to a success is to diligently do various cycles of
filtering such as availability, reaction groups, and diversity
selection of reagents before library enumerations. It is also
necessary to carry out automatic pharmacophore-based
post-docking pose filtering prior to using any docking scoring functions.
2. Induced fit docking (IFD) should be carried out periodically
to check whether inclusion of flexibility of receptor improves
docking results or not. Regular docking, while very fast,
treats all amino acids rigid which does not reflect the true
nature of protein flexibility and consequently true positives
may be missed.
3. Molecular dynamics (MD) should be performed for binding pockets defined mostly by side chains of flexible protein residues to generate an ensemble of binding sites. Such
an ensemble can be used for subsequent docking or virtual
screening in a parallel fashion.
4. A SBDD design should be confirmed by a later X-ray complex structure which in turn serves to initiate a cycle of
iterative structural-based drug design (SBDD). SBDD starts
from X-ray or NMR complex structure of ligand with protein and a design, if synthesized, validated, and confirmed
by X-ray, creates a starting point for a new level of SBDD
efforts.
5. Do not use any scoring functions blindly without validation
in any specific drug targets. Most SBDD efforts involve both
188
References
1. Hertzberg, R. P., Pope, A. J. (2000) Highthroughput screening: new technology for
the 21st century. Curr Opin Chem Biol 4,
445451.
2. MDL Information Systems. http://www.
mdli.com.
3. Cambridge Crystallographic Data Centre,
UK. http://gold.ccdc.cam.ac.uk.
4. Accelyrs, San Diego, CA, USA. http://
www.accelrys.com.
5. Chemical Computing Group, Montreal,
Quebec, CA. http://www.chemcomp.com.
6. HKL Research, Inc. http://www.hklxray.com.
7. Yan, S., Appleby, T., Larson, G., Wu, J. Z.,
Hamatake, R., Hong, Z., Yao, N. (2006)
Structure-based design of a novel thiazolone
scaffold as HCV NS5B polymerase allosteric
inhibitors. Bioorg Med Chem Lett 16,
58885891.
8. Yan, S., Appleby, T., Larson, G., Wu, J.
Z., Hamatake, R. K., Hong, Z., Yao, N.
(2007) Thiazolone-acylsulfonamides as novel
HCV NS5B polymerase allosteric inhibitors:
convergence of structure-based drug design
and X-ray crystallographic study. Bioorg Med
Chem Lett 17, 19911995.
9. Yan, S., Larson, G., Wu, J. Z., Appleby,
T., Ding, Y., Hamatake, R., Hong, Z.,
Yao, N. (2007) Novel thiazolones as HCV
10.
11.
12.
13.
14.
15.
16.
17.
18.
NS5B polymerase allosteric inhibitors: Further designs, SAR, and X-ray complex structure. Bioorg Med Chem Lett 17, 6367.
Lyne, P. D. (2002) Structure-based virtual
screening: an overview. Drug Discov Today 7,
10471055.
Jain, S. K., Agrawal, A. (2004) De novo drug
design: an overview. India J Phar Sci 66,
721728.
Schrodinger, LLC., Portland, OR, USA.
http://www.schrodinger.com.
Rarey, M., Kramer, B., Lengauer, T.
(1999) Docking of hydrophobic ligands
with interaction-based matching algorithms.
Bioinformatics 15, 243250.
Kramer, B., Rarey, M., Lengauer, T. (1997)
CASP2 experiences with docking flexible
ligands using FlexX. Proteins Suppl 1,
221225.
Kramer, B., Rarey, M., Lengauer, T. (1999)
Evaluation of the FLEXX incremental construction algorithm for protein-ligand docking. Proteins 37, 228241.
Tripos,
St.
Louis,
MO,
USA.
http://www.tripos.com.
Molecular Graphics Laboratory, The Scripps
Research Institute, San Diego, CA, US.
http://autodock.scripps.edu.
DesJarlais, R. L., Sheridan, R. P., Dixon, J.
S., Kuntz, I. D., Venkataraghavan, R. (1986)
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
189
190
Chapter 10
Structure-Based and Property-Compliant Library Design
of 11-HSD1 Adamantyl Amide Inhibitors
Genevieve D. Paderes, Klaus Dress, Buwen Huang, Jeff Elleraas,
Paul A. Rejto, and Tom Pauly
Abstract
Multiproperty lead optimization that satisfies multiple biological endpoints remains a challenge in the
pursuit of viable drug candidates. Optimization of a given lead compound to one having a desired set
of molecular attributes often involves a lengthy iterative process that utilizes existing information, tests
hypotheses, and incorporates new data. Within the context of a data-rich corporate setting, computational
tools and predictive models have provided the chemists a means for facilitating and streamlining this
iterative design process. This chapter discloses an actual library design scenario for following up a lead
compound that inhibits 11-hydroxysteroid dehydrogenase type 1 (11-HSD1) enzyme. The application
of computational tools and predictive models in the targeted library design of adamantyl amide 11HSD1 inhibitors is described. Specifically, the multiproperty profiling using our proprietary PGVL (Pfizer
Global Virtual Library) Hub is discussed in conjunction with the structure-based component of the
library design using our in-house docking tool AGDOCK. The docking simulations were based on a
piecewise linear potential energy function in combination with an efficient evolutionary programming
search engine. The library production protocols and results are also presented.
Key words: Multiproperty lead optimization, library design, adamantyl amide, targeted library,
11-hydroxysteroid dehydrogenase type 1, 11-HSD1, PGVL, Pfizer Global Virtual Library,
structure-based, AGDOCK, piecewise linear, evolutionary programming.
1. Introduction
Glucocorticoids (GC) are steroid hormones that regulate various physiological processes via stimulation of the nuclear glucocorticoid receptors (1). Chronically elevated levels of active
GC hormones (e.g., cortisol) have been associated with many
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_10, Springer Science+Business Media, LLC 2011
191
192
Paderes et al.
diseases, including diabetes, obesity, dyslipidemia, and hypertension. In mammalian tissues, GC hormonal regulation is controlled by two isozymes of 11-hydroxysteroid dehydrogenase
that catalyze the interconversion of inert cortisone and active
cortisol, namely, 11-HSD1, which is present predominantly in
the liver, adipose tissue, and brain, and 11-HSD2, which is
mainly expressed in the kidney and placenta (2, 3). 11-HSD1
is a bidirectional, NADPH-dependent enzyme that catalyzes the
conversion of inactive 11-keto GCs (cortisone in humans and
11-dehydrocorticosterone in rodents) into hormonally active
11-hydroxy GCs (cortisol in human and corticosterone in
rodents), whereas 11-HSD2 is a unidirectional dehydrogenase
that catalyzes the reverse reaction (cortisol to cortisone) using
NAD+ solely as a cofactor (3, 4). In recent years, clinical studies
in animal models (57) and in humans (812) provided evidence
for the role of 11-HSD1 enzyme activity in obesity, diabetes, and
insulin insensitivity. In line with these findings, inhibition of 11HSD1 by the steroid carbenoxolone (CBX) showed improved
insulin sensitivity in human (1314). Thus, 11-HSD1 is considered a promising target for the treatment of glucocorticoidrelated diseases and has given rise to several classes of nonsteroidal
11-HSD1 inhibitors (1518), including the adamantyl triazoles
and amides (1921).
The identification of an adamantyl amide inhibitor of human
11-HSD1 (Fig. 10.1) in our laboratories has prompted us
to design a targeted library of close analogs using the Pfizer
Global Virtual Library (PGVL) Hub, a desktop tool for designing
libraries and accessing Pfizer internal tools, models, and resources.
With PGVL Hub, we were able to input our customized alcoholcontaining adamantyl amide template and select the appropriate
reaction protocol with its corresponding set of amine monomers
(Fig. 10.2). The reaction protocol involves the transformation
of alcohols to amines via mesylation followed by amine substitution (22, 23). The initial set of amine monomers from in-house
and commercial sources gave us 13,000 amines. Reduction in
the virtual chemistry space to 1,000 was achieved by selecting only available secondary amines having molecular weights
H
N
N
O
N
193
Fig. 10.2. Reaction protocol for the transformation of alcohols to amines via mesylation and amine substitution, as
shown in PGVL Hub.
less than 200. Since the objective of the library design was to
improve the cellular potency with retention of good solubility,
stability in human liver microsomes (HLM) and human hepatocytes (HHep), in silico property calculation and profiling were
performed on the virtual enumerated products, resulting in 300
predicted property-compliant virtual products.
In order to ensure the retention of enzyme activity, the virtual products were subjected to fixed anchor docking using
AGDOCK, wherein the adamantyl amide moiety was fixed to
a specified crystal-bound coordinate during the docking simulations. At the time of our library design, the human 11HSD1 (hu11-HSD1) crystal structure was not available. Thus,
we utilized our available in-house guinea pig 11-HSD1 (gp11HSD1) protein crystal structure for docking our virtual library
and selected its bound adamantyl ligand, which showed activity in hu11-HSD1, for defining the coordinates of our fixed
anchor structure. The docking simulations were carried out
using a piecewise linear intermolecular function (2427) and
a stochastic search algorithm based on evolutionary programming (28, 29). Evaluation of the dock hits led to the selection of the top-ranking virtual compounds based on their estimated high-throughput docking scores. The resulting structurebased and property-compliant, 88-compound library design was
then submitted to production for combinatorial synthesis. Initial
194
Paderes et al.
Ser-170
Tyr-231
Tyr-183
Fig. 10.3. Adamantyl amide analogs exhibit similar binding modes in guinea pig (green)
and in human (pink) 11-HSD1 cocrystal complexes, with Ser-170 and Tyr-183 forming
hydrogen bond interactions with the bound ligands. A nonconserved residue (Tyr-231 in
guinea pig and Asn-123 in human) differentiates the active sites for these analogs.
PGVL is defined as a set of virtual molecules that can be synthesized from the available monomers and existing templates using
validated reaction protocols at Pfizer. It covers a vast virtual chemistry space in the order of 1013 compounds. PGVL Hub is the
corresponding desktop interface used for a quick navigation of the
virtual chemistry space and contains the basic features of an earlier
library design tool called LiBrain (33). Searching PGVL for compounds similar to a given lead or HTS hit can be carried out using
a Lead Centric Mining tool within PGVL Hub or a desktop
application called the Bayesian Idea Generator (34). For library
designs, virtual searching and screening are simply conducted on
specific subsets of PGVL, as defined by reaction types that utilize a
set of registered chemistry protocols along with their specific sets
of mined reactant monomers. One of the most useful features in
PGVL Hub is its ability to access Pfizers internal computational
tools and models. Thus, calculation of physicochemical properties
(e.g., thermodynamic solubility) and use of predicted biological
195
AGDOCK is a Pfizer application for rapid and automated computational prediction of the binding geometries (conformation
and orientation) of compounds in a given protein active site, as
defined by the input defining ligand. It operates in three modes,
namely noncovalent docking, covalent docking (25, 36), and partially fixed or fixed anchor docking (36, 37). The default mode
is noncovalent docking with full ligand conformational flexibility
that explores a large number of degrees of freedom. Significant
reduction in the number of degrees of freedom is achieved with
the latter two modes in which part of the ligand is fixed within the
active site of the protein, either through covalent bond formation
with the receptor (covalent docking) or by imposition of positional constraints on an anchor fragment (fixed anchor docking)
that is primarily responsible for molecular recognition. AGDOCK
employs two search engines, evolutionary programming (2427)
and simulated annealing (38), both of which allow for a full search
of the ligand conformation and orientation within the active site.
It also supports two intermolecular potentials, AMBER (39) and
piecewise linear potential (2427), and an intramolecular potential consisting of van der Waals and torsional terms derived from
the DREIDING force field (40). The intermolecular potential
developed for AGDOCK incorporates both steric and hydrogen
bond contributions which are calculated from the sum of pairwise interactions between the ligand and the protein heavy atoms
using piecewise linear potentials. This energy function along with
an evolutionary search technique enables the structure prediction
of the protein-ligand complex.
2. Materials
2.1. PGVL Monomers
1. Reactant A: R-1(2-hydroxy-ethyl)-pyrrolidine-2-carboxylic
acid adamantan-2-ylamide (Fig. 10.2)
2. Reactant B: 88 cyclic and acyclic secondary amines with
molecular weights ranging from 71.2 to 162.1 Da
196
Paderes et al.
1. Structure file (in SDF or PDB format) containing the crystalbound conformation of reference ligand (Reactant A) in the
gp11-HSD1 cocrystal complex
2. Structure file (in PDB format) containing the coordinates of
the protein crystal structure derived from the gp11-HSD1
cocrystal complex (Fig. 10.4a)
3. Structure file (in SDF or PDB format) of the anchor or core
structure (Fig. 10.4b) which will be used in specifying the
fixed coordinates of the common fragment in the virtual
library of adamantyl amide analogs
4. Structure file (in SDF format) of the virtual library compounds to be subjected to fixed anchor or partially fixed
docking
Ser-170
Tyr-183
a)
b)
Fig. 10.4. (a) Crystal structure of gp11-HSD1 complex with Reactant A which was used
as defining ligand in docking simulations. (b) Adamantyl amide core structure used in
fixed anchor docking.
2.4. Computational
Tools and Resources
197
3. Methods
3.1. PGVL Library
Design
The library design was conducted with PGVL Hub which allows
the retrieval of the appropriate reaction protocol along with their
corresponding sets of reactant monomers (Fig. 10.2). There are
basically four monomer sources, a commercial domain (ACD),
and three in-house domains (AXL, MN, and PF). With the selection of the in-house and commercial domains, the virtual library
size is 26,404 (2 Reactant A 13,202 Reactant B). In
this design, we selected only the in-house monomers which gave
us a virtual library size of 11,664 (2A 5,832B). By specifying a single alcohol-containing template for Reactant A which
is needed for generating close analogs of the adamantyl amide
lead compound, the virtual library size was reduced to 5,832
(1A 5,832 B). Further reduction in chemistry space was
achieved through filtering done both at the monomer and virtual product levels, as outlined in the subsequent library design
steps.
1. Calculate the molecular weight and structural alerts (substructures containing undesirable or reactive functionalities)
for 5,832 amines (see Note 1).
2. Perform a substructure search for secondary amines.
3. Select only secondary amines with molecular weight (MW)
less than 200 and with no structural alerts. This step drastically reduced the number of amines from 5,832 to 1,019.
4. Enumerate the virtual products for the alcohol template and
the selected amines using the PGVL Hub virtual product
enumerator.
5. Calculate the following molecular properties within PGVL
Hub using global computational tools and models: (a) Ruleof-Five (RO5), MW, cLogP, number of hydrogen-bond
donors (HBD), number of N and O atoms (NO), and number of RO5 violations; (b) topological polar surface area
(TPSA); (c) number of rotatable bonds (NRB); (d) LogD
(see Note 2); and (e) aqueous solubility (see Note 3).
6. Impose the desired molecular property profile for the virtual
products by setting computed property thresholds using the
PGVL Hub Decision Maker feature, as shown in Fig. 10.5.
198
Paderes et al.
Fig. 10.5. Virtual product property profiling within PGVL Hub. The upper threshold for cLogD and the lower threshold
for c_LogS were determined from Spotfire analysis of 2-aminoacetamide lead series. The upper threshold values for
MW, number of rotatable bonds (NRB), and polar surface area (TPSA) were user-specified parameters. The rest of the
thresholds (e.g., lower threshold for calculated LogD at pH 7.4 or the upper threshold for calculated Log of solubility) are
either the lowest or the highest property values of the virtual products.
The energy function (24) used to predict the structure and energy
of the proteinligand complex contains an intermolecular term
for the interaction between the ligand and the protein and an
intramolecular term for the internal energy of the ligand. The
BC D
b)
Energy
Energy
a)
199
B
C
Interatomic Distance
Interatomic Distance
Fig. 10.6. (a) Functional form of the hydrogen bond interaction energy (A = 15.0,
B = 2.3, C = 2.6, D = 3.1, E = 3.4, F = 4.0) and nonpolar dispersion (A = 15.0,
B = 0.93 , C= , D = 1.25 , E = 1.5 , F = 0.4), where is the sum of the atomic
radii of the protein and the ligand atoms. (b) Functional form of the repulsive interaction
(A = 15.0, B = 3.2, C = 6.0, F = 1.5). A and F are in kcal/mol and BE are in Angstroms.
intramolecular term includes the torsional and the van der Waals
functions of the DREIDING force field (40) and is useful in
differentiating between low- and high-energy ligand geometries
and in preventing internal ligand collapse (overlap between ligand atoms). The intermolecular term is a simplified intermolecular potential, which is a pairwise sum of piecewise linear potentials over all ligand and protein heavy (nonhydrogen) atoms (Fig.
10.6). Both the hydrogen-bonding and repulsive terms are modulated by a scaling factor based on the relative orientation of the
protein and the ligand atoms (Fig. 10.7). The piecewise linear
potentials for the hydrogen bond and steric interactions have the
same functional form but different parameters. This functional
form has the advantage of having a finite value compared to the
very high energy value in LennardJones potential, when the
interatomic distance approaches zero, thereby allowing the ligand to come in close contact with the protein during the early
H-bond strength
a)
1.0
90 120 180
c)
b)
d)
A
H L
Fig. 10.7. (a) Hydrogen bond strength is a function of the angle, , determined by the
relative orientation of the protein and ligand atoms. (b) A protein donor atom D bound
to one hydrogen atom H makes an angle with the ligand atom L. (c) A protein donor
atom D bound to the two hydrogen atoms H makes an angle with the ligand atom L.
(d) A protein acceptor atom A makes an angle with the ligand atom L.
200
Paderes et al.
stages of docking simulations. The parameters used in the piecewise linear potentials depend on the type of interaction and the
size of the atom. There are three types of interaction that arise
from four different protein and ligand atom types, as follows: (a)
hydrogen bond interactions between donors and acceptors, (b)
repulsive interactions between pairs of donors or acceptors, and
(c) steric interactions between nonpolar atoms or one nonpolar
and another atom type (Table 10.1). Every pair of interacting
atoms is assigned one of these three types of interaction. Atoms
are also assigned the atomic radii of 1.4, 1.8, and 2.2 corresponding to small (F, metal ions), medium (C, O, N), and large
(S, P, Cl, Br) atoms, respectively (36). These parameters were
derived from optimized interatomic distances observed in highquality crystal structures.
Table 10.1
Three types of interaction between ligand and protein heavy
atoms arising from different atom types. Primary and secondary amines are defined to be donors while oxygen and
nitrogen atoms with no bound hydrogens are defined to be
acceptors. Crystallographic water molecules and hydroxyl
groups are defined to be both donor and acceptor. Carbon
and sulfur atoms are defined to be nonpolar
Ligand
Donor
Acceptor
Both
Nonpolar
Repulsive
H-bond
H-bond
Steric
Acceptor
H-bond
Repulsive
H-bond
Steric
Both
H-bond
H-bond
H-bond
Steric
Nonpolar
Steric
Steric
Steric
Steric
Donor
3.3. Evolutionary
Search for Ligand
Exploration
Protein
201
Table 10.2
Rotatable bond types with common threshold energy values
Threshold
energy value
(kcal/mol)
1.0
2.0
5.0
10.0
25.0
50.0
202
Paderes et al.
In a glove box, the alcohol (320 L, 80.0 mol, 1.0 eq, 0.25 M
in anhydrous DCE), TEA (33 L, 240 mol, 3.0 eq, neat TEA),
203
DMAP (40 L, 8.0 mol, 0.1 eq, 0.2 M in anhydrous DCE), and
methanesulfonyl chloride (320 L, 160 mol, 2.0 eq, 0.5 M in
anhydrous DCM) were added to a 1095 mm test tube. The
test tube was sealed with a test tube cap and stirred in glove
box for 3 h at ambient temperature. The solvent was evaporated (SpeedVac or GeneVac, vacuum, medium heat, 16 h) and
the residue was dissolved in anhydrous DMF (400 L). TEA (80
L, 80.0 mol, 1.0 eq, 1 M in anhydrous DMF) and the amine
(480 L, 240.0 mol, 3.0 eq, 0.5 M in anhydrous DMF) were
added. The test tube was sealed with a test tube cap. The reaction
was heated and stirred at 80 C for 5 h. The solvent was evaporated (SpeedVac or GeneVac, vacuum, medium heat, 16 h), and
the residue was dissolved in DMSO (1.340 mL). The reaction
mixtures were analyzed by LCMS and the products isolated by
automated mass-directed HPLC.
All chromatographic separations were at ambient temperature. Analytical-scale separations were achieved using Agilent HP
TM
R
MSD systems with a Phenomenex Gemini C18 column
1100
TM
R
(4.6 50 mm ID, 5.0 m) or Agilent Zorbax
Extend C18
column (4.6 50 mm ID, 3.5 m). The mobile phase consisted
of water and acetonitrile, each with 0.05% trifluoroacetic acid and
was applied as linear gradient 0100% organic solvent in 3.0 or
1.75 min, depending on the column used. The MSD utilized positive mode APCI with a scan range from 100 to 1,000 amu. The
TM
mass-directed preparative HPLC was a Waters Fractionlynx system operating at 50 mL/min using the Gemini C18 stationary
phase in a 20 50 mm ID column. The mobile-phase solvents
were the same as the analytical scale with a 1 min hold to allow
for 1,200 L injection of the crude sample. The gradient was
TM
R
ZQ sin0100% organic in 5.4 min. The Waters Micromass
gle quad MS utilized positive mode electrospray ionization with
a 1:10,000 split from the preparative flow to the MS using a
methanol carrier fluid.
The library synthesis steps are as follows:
(1) Prepare an 811 array of 1095 mm test tubes in a test
tube rack.
(2) Add one 63 mm stir bar into each of the test tubes.
(3) Dry the rack of test tubes, the vials, and caps needed to
make the stock solutions at 100 C for 16 h (overnight).
Predried vials and caps must be used in subsequent steps.
(4) Transfer the rack of test tubes, the vials, and the caps into
a glove box until future use.
(5) In the glove box, prepare a 0.25 M stock solution of each
alcohol (Reactant A) in anhydrous DCE. Note: In case of
salt, equal amount of equivalents of TEA should be added.
(6) In the glove box, prepare a 0.5 M stock solution
of methanesulfonyl chloride (MW = 114.55) in anhydrous DCE.
204
Paderes et al.
205
206
Paderes et al.
a)
b)
N
N
N
PF-03440142
PF-03440171
hKiapp = 1.6 nM
hHEK_EC50 = 0.03 nM
cLogD = 1.65
Kin_Solubility = 355 uM
GL_hHepCl = 25.6 ul/min/million
GL_HLM = 158 ul/min/mg
hKiapp = 5.1 nM
hHEK_EC50 = 67 nM
eLogD = 0.91 nM
Kin_Solubility = 459 uM
GL_hHepCl = 5.0 ul/min/Million
GL_hLM = 9.7 ul/min/mg
Fig. 10.8. Cellular activity and hepatocyte stability of the 37 purified library compounds. (a) cLipE vs. human 11HSD1 EC50 (hHEKEC50). Chart enables rapid identification of the active and lipophilic efficient compounds (in lower right
corner), e.g., PF-03440171 with cLipE = 8.7 and EC50 = 0.03 nM. (b) Human hepatocyte intrinsic apparent clearance
(GL_hHepCl) vs. hHEKEC50. Chart enables rapid identification of active and metabolically stable compounds, such as
PF-03440142 with GL_hHepCl = 5 L/min/million and EC50 = 67 nM.
207
4. Notes
1. Structural alerts (STA) exemplify substructures which
have been associated with multiple examples of adverse in
vivo events (47). Information on the chemistry, pharmacology, and toxicology for these substructures has been collated to provide a strong rationale for their classification as
STA due to their intrinsic reactivity, ability to intercalate to
DNA, coordination with a metal, metabolic activation, or
transformation to a species capable of covalently binding to
biological macromolecules. A common example of STA is
the aniline functional group which can get oxidized either at
the aniline nitrogen or at the ortho and para aromatic carbon atoms to form the reactive nitroso and iminoquinone
intermediates, respectively, that can lead to increased risk
of mutagenicity, in vivo carcinogenicity, and hepatoxicity.
Hence, chemists who design libraries must be aware of all
these potential liabilities and understand the chemistry and
mechanism involved in the metabolism of the design compounds. If the purpose of the library is to probe the receptor pocket to find novel chemotypes, then compounds with
STA may be included in the design, provided they can be
modified later to eliminate the risks. In general, chemists are
advised to avoid compounds with STA in order to minimize
the risk of adverse outcome, thereby ensuring the safety and
efficacy of drugs.
2. PGVL Hub has access to a plethora of computational models for predicting molecular properties. In this library design,
the ACD LogD was used in profiling the virtual products.
LogD is an experimental measurement of lipophilicity, where
D is a pH-dependent parameter which changes with the
degree of ionization. The ACD LogD was selected after
careful comparison with alternative LogD predictors, such
as the Pallas LogD and the in-house Cubist models based
on Pfizer experimental LogD. Chemists are encouraged to
test different LogD models to see how well they correlate
with the experimental data of their lead series.
3. The legacy aqueous thermodynamic solubility model is an
in-house linear regression model consisting of 11 calculated
descriptors which include cLogP and polar surface area. This
is the model that was used to shape the property profile of
the virtual products. This model has now been superseded
by a Cubist (48) thermodynamic solubility model (R2 =
0.93 and RMSE = 0.44, N = 2794) based on a data set
of 3,075 compounds with intrinsic solubility. This model
208
Paderes et al.
R3
H
R1
R2
H
R1
N
O
a)
209
R2
b)
Fig. 10.9. Aminoacetamide lead series used in establishing thresholds for solubility and
metabolic stability to guide the library design. R1 can be adamantyl, cycloalkyl, benzyl
or substituted benzyl, aryl, or heteroaryl. R2 can be alkyl, substituted alkyl, cycloalkyl,
benzyl, substituted benzyl, or acetyl. R3 can be H or OH.
b)
a)
6
100
80
(2.62, 68.8)
60
40
20
2
2
0
1
eLogD
eLogD
Fig. 10.10. (a) Experimental LogD (eLogD) vs. human liver microsome stability (HLM_%Rem@1M). A threshold value
of eLogD < 2.7 is required for >70% stability in HLM. (b) Experimental vs. calculated LogD. eLogD < 2.7 translates to
cLogD < 2.0 at pH 7.4.
210
Paderes et al.
H HEP (%R) vs. c_LogS
0
a)
b)
1
1
2
2
3
5
4
6
5
7
9
20
40
60
80
HHEP_%Rem@1uM
100
10
20
30
40
50
60
HHEP_CL(int)_uL/min/M
Fig. 10.11. (a) Experimental human hepatocyte stability (HHEP_%Rem@1M) vs. calculated Log of solubility (c_LogS).
(b) Experimental human hepatocyte stability expressed as apparent intrinsic clearance (HHEP_CL(int)_L/min/million)
vs. calculated Log of solubility (c_LogS). A threshold value of cLogS > 3.0 is required for stability in human
hepatocytes.
stituents in the analogs. One must be careful when selecting a ligand fragment to fix during docking since not all
ligand fragments can act as molecular anchors. A molecular anchor is characterized by a specific binding mode with
a dominant free energy minimum and a large stability gap,
defined as the free energy of the crystal binding mode relative to the free energy of alternative binding modes (26). An
advantage of fixed anchor docking is that the large number
of degrees of freedom due to ligand flexibility is drastically
reduced and that calculation of the free energy of binding for
close analogs containing the anchor fragment is significantly
facilitated.
6. During docking, the ligand is required to remain in a rectangular box that encompasses the active site. Ligand conformations and orientations are searched via an evolutionary
programming algorithm within this rectangular box. A constant energy penalty is added to every ligand atom outside
this box. If the virtual library of compounds contain a lot
of large substituents (Reactant B), it is advisable to increase
this cushion to a larger value in order to accommodate the
a)
211
b)
-1
1
2
2
3
5
4
6
5
7
9
0
20
40
60
80
10
100
20
30
40
50
60
70
80
90
HLM_CL(int)_uL/min/mg
HLM_%Rem@1uM
Fig. 10.12. (a) Experimental human liver microsome stability (HLM_%Rem@1M) vs. calculated Log of solubility (c_LogS). (b) Experimental human liver microsome stability expressed as apparent intrinsic clearance
(HLM_CL(int)_L/min/mg) vs. calculated Log of solubility (c_LogS). A threshold value of cLogS > 3 and 4.0 is
required for stability in HLM based on %R and intrinsic clearance, respectively.
Ser-A170
Tyr-A177
Tyr
O
O
N
N
Tyr-A183
Pro-A178
Fig. 10.13. Examples of dock poses from fixed anchor docking of the virtual library in
gp11-HSD1 crystal structure.
212
Paderes et al.
Acknowledgments
The authors would like to thank Simon Bailey, Martin Edwards,
and Michael McAllister for their valuable advice, encouragement,
and guidance. Specifically, the authors are grateful to Stanley
Kupchinsky for the synthesis of the starting adamantyl amide
lead and to the Discovery Computation group at PGRD La
Jolla for the development of PGVL and AGDOCK, under the
leadership of Atsuo Kuki and Peter Rose, respectively. Thanks
are especially due to the following colleagues who developed
and performed our project assays, specifically, Jacques Ermolieff (11-HSD1 enzyme assays); Andrea Fanjul (11-HSD1 cellular assays); Nora Wallace, Christine Taylor, and Rob Foti
(HLM assays); and Veronica Zelesky, Kevin Whalen, and Walter
Mitchell (HHEP assays). This work was supported by the 11HSD1 project team and the Pfizer Diabetes Therapeutic Area
management.
References
1. Charmandari, E., Kino, T., Chrousos, G.
P. (2004) Glucocorticoids and their actions:
an introduction. Ann N Y Acad Sci 1024,
18.
3.
4.
5.
6.
7.
8.
9.
10.
11.
213
214
Paderes et al.
45.
46.
47.
48.
215
Section III
Fragment-Based Library Design
Chapter 11
Design of Screening Collections for Successful
Fragment-Based Lead Discovery
James Na and Qiyue Hu
Abstract
A successful fragment-based lead discovery (FBLD) campaign largely depends on the content of the
fragment collection being screened. To design a successful fragment collection, several factors must be
considered, including collection size, property filters, hit follow-up considerations, and screening methods. In this chapter, we will discuss each factor and how it was applied to the design and assembly of one
or more fragment collections in a major pharmaceutical company setting. We will also present examples
and statistics of screening results from such collections and how subsequent collections can be improved.
Lastly, we will provide a summary comparison of selected fragment collections from literature.
Key words: Fragment-based lead discovery, screening collection, library design, computational
filtering, NMR screening
1. Introduction
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_11, Springer Science+Business Media, LLC 2011
219
220
Na and Hu
2. Factors to
Consider in
Creating a
Fragment
Collection
2.1. Collection Size
221
Solubility is an important factor to consider in selecting fragments, especially when the assay of choice is high-concentration
screening. This is often the case since fragments are typically weak
binders with low millimolar to high micromolar activity. In most
instances the solvent used tends to be more polar, such as DMSO
or water. Therefore, the fragments have to be water soluble, and
compounds having ionizing groups or polar functions favor solubility. There are various methods to calculate the solubility of
a compound, see a recent review for details (13). These calculated values can be used to guide the selection of compounds for
a given collection.
Besides solubility, other physicochemical factors to be considered in building a fragment collection are molecular weight, number of heavy atoms, rotatable bonds, lipophilicity, and polar surface area. All or some of these factors can be accounted for when
building a collection, although molecular weight is one factor
that is almost always considered. In general the MW range for a
fragment collection should fall within 120300, with the median
MW at around 200250. Compounds with MW much less than
150 are undesirable due to higher risk of unspecific or undetectable binding, while compounds larger than 300 are becoming more full-size molecules rather than fragments. Attenuating the MW range would also affect the number of heavy atoms,
a closely related molecular property. For the number of rotatable bonds, typically a fragment with MW around 250 would
have 03 rotatable bonds, which in general is a desirable range
for a fragment collection. Another important factor to consider is
lipophilicity, usually measured as logP, which can have a big influence on binding affinity. The generally accepted range for logP is
03 (14).
222
Na and Hu
One of the less common factors considered when building a fragment collection is synthetic attractiveness of the fragments, or put
differently, ease of hit follow-up for the hit-to-lead process. In
general, chemists like to have hits which present opportunities
for making analogs, but often fragments lack the synthetic handle(s) that a chemist desires. It would be relatively easy to assemble a fragment collection from a pool of reagent-type compounds
which contain one or multiple reactive centers. However, some
chemistry savvy must be exercised to avoid highly reactive functional groups such as sulfonyl halides or isocyanates, which can
react with the protein side chains. Because fragments are often
screened as a mixture, care also must be taken to avoid compounds which may react with each other when mixed together.
Another factor to consider is that most reactive functional groups
can also elicit binding interaction, and this interaction can be
altered or destroyed altogether when reaction occurs at the site.
For example, the bromine of an alkyl bromide can elicit a binding
interaction with the protein target which can result in the compound being a screening hit. However, if the bromine is used as
a chemistry vector for elaboration, the bromine is then displaced
upon reaction which eliminates the bromineprotein interaction,
which may be an important part of the overall binding interaction. Hence, the selection of compounds which contain reactive
functional groups to be included in a screening collection must
be done carefully, preferably with input from medicinal/synthetic
chemists.
Less reactive synthons which also facilitate the hit follow-up
process contain chemistry handles or functional groups on a
molecule which can be easily converted to grow or shape the
molecule. One of the more useful functional groups for this purpose are halo-aromatic compounds, in which the halogen (or even
the aromatic hydrogens) can often be the chemistry vector that
allows the chemist to explore other parts of the binding pocket.
Primary amines and carboxylic acids are other functional groups
that can be considered useful chemistry handles to be included in
a fragment collection. When screened against a protein target, the
binding interaction from these functional groups can mostly be
preserved even after they are used as chemistry vectors for elaboration. Figure 11.1 lists commonly used functional groups which
serve as chemistry handles in a fragment.
Sometimes, reactive functional groups are protected. For
instance, a novel primary amine can be protected by reacting
with a small acid (e.g., acetic acid), and the resulting amide would
still retain some of the characteristics of the original amine where
both the amino R-group and the resulting amide reaction product can elicit binding interaction with the protein. And if this pro-
O
Acid/Ester
R1
OH
R2
Ar
R1
NH2
NH
NH2
Amine
223
R2
O
O
Amide
O
Sulfonamide
Alcohol/Phenol
Thiol
R1
NH2
O
S
Aliph
N
H
NH2
O
S
R1
Ar
A
A
N
H
O
R2
H
H
Activated CH
R2
A
A
N
H
A = C, N
Aromatic halide
Nitrile
X
Arom
x = F, Cl, Br
N
R
Fig. 11.1. Functional groups that are useful as chemistry handles in a fragment.
224
Na and Hu
3. Building
a Fragment
Collection
There are numerous ways to build a fragment collection. Consideration of factors described in the earlier sections can be used
to build a generic collection or to build a collection tailored to a
specific screening method or a particular target. One of the more
popular and rational approaches to assemble a fragment collection is to build the collection based on the screening technique.
Among the more popular fragment screening methods are NMR
techniques, beginning with the SAR by NMR method pioneered by Fesik et al. at Abbott (5). Other methods using NMR
include saturation transfer difference (STD) (15, 16) and waterLOGSY (17). There have been several efforts to assemble fragment collection based on NMR screening techniques (11, 18).
In the following sections, we will describe in detail efforts to
build fragment collections and the processes involved in their creation. Two such efforts were performed at a major pharmaceutical
company (Pfizer) while a third took place at a biotech company
(Vernalis).
At the Pfizer research site in La Jolla, the preferred primary
screening method for fragments is the STD NMR technique,
although other research sites have employed other screening
methods. Prior to 2006, Pfizer had a legacy fragment collection of
5 K compounds, but this collection had two major drawbacks:
many of the fragments lacked chemistry handles to facilitate hit
follow-up efforts and almost all of the fragments were purchased
as screening compounds and therefore were of insufficient quantity for chemistry efforts. Therefore, it was decided to build a
fragment collection for the La Jolla NMR screening campaigns.
The goal was to create a collection optimized for NMR screening
while being chemically attractive for hit follow-up efforts.
The approach taken to achieve these goals was to first select
a set of novel reagents, then react these compounds with simple
reagents to cap the reactive functionalities. Virtual products for
the selected compounds were created and are then passed through
an in silico filtering process. Finally, the filtered libraries were synthesized via combinatorial libraries.
Selection of the reagents was based on the Pfizer internal
compound collection which allowed speedy acquisition of any
selected compound. A set of primary amines, secondary amines,
and carboxylic acids which were not commercially available were
chosen for consideration. These acids and amines were designed
by medicinal chemists via a Pfizer internal screening file enrichment effort to be novel and diverse, and more importantly, were
not part of any existing Pfizer fragment collection. The MW
225
cutoff was set at an upper limit of 200, with most of the amines
having a MW range of 100150. All compounds in consideration
had at least 5 g quantity available to ensure that future follow-up
activities were enabled.
The combinatorial reactions chosen for the novel amines were
amide bond formation and sulfonamide formation. The novel carboxylic acids were derivatized to simple amides. For the amine
reactions, we chose two simple carboxylic acids (propionic acid
and benzoic acid) and two simple sulfonyl chlorides (methylsulfonyl chloride and benzenesulfonyl chloride) as the capping
groups. Propyl amine and benzylamine were chosen as the capping groups to react with the novel carboxylic acids. Because only
one reactant will be variable, these combinatorial libraries were
essentially 1 N libraries, where the one reactant was a simple
reactant and the N component is the novel amines or acids.
Next, we used an in-house library design software (see details
in Chapter 15) to enumerate the virtual libraries and then calculated various physical properties. Products were removed from
consideration if MW is > 300, number of rotatable bonds > 3, and
ClogP > 3. For solubility, two in-house model calculations were
applied as filters: turbidimetric 10 mg/mL and thermodynamic
solubility >100 M. The resulting cherry-picked library was then
reviewed by NMR spectroscopists to remove compounds with
possible artifacts, likely to be insoluble, or likely to be false positive. These included some conjugated systems and compounds
with likelihood of indistinct NMR spectra.
Approximately 1,200 amines and 300 carboxylic acids were
selected for inclusion in the fragment libraries, from which
approximately 20 fragment libraries were synthesized. These
libraries yielded 2,000 products with sufficient purity (>95%)
and quantity (1.2 mL of 30 M solution), and the product
structures were confirmed via 1D NMR. This fragment collection became known as the NMR Combicores to denote their
purpose and their combichem origin. It was distributed across
several major Pfizer research sites and used in multiple fragment
screens.
One of the lessons learned from previous fragment screening collections is that fragments which are enabled for chemical
expansion was a key factor in engaging chemists in performing hit
optimization. This was addressed in the Combicores collection
described above via capped functional groups. However, Combicores is a specialized collection since it was designed specifically
for NMR screening. To build a more generic fragment collection to accommodate protein target screening requirements such
as reagent stability, sensitivity of screening methods, and druggability (19) of binding sites, the Pfizer Global Fragment Initiative (GFI) (20) was initiated with the goal of assembling a fragment collection suitable for several screening methods, including
226
Na and Hu
NMR techniques, SPR, high-concentration bioassays, and fragment crystallography. The assembly process for the collection
involved several computational filters, chemical complexity analysis, diversity analysis, and manual review by chemists to ensure
chemical attractiveness for follow-up. Details of the complete process will be published elsewhere (20). The analysis of the screening results presented early in the 2009 fall ACS meeting showed
that fragment screening of GFI offered consistent high hit rate
across protein classes (21).
Figure 11.2 shows a typical screening and hit-to-lead cascade
utilized in an FBLD campaign at Pfizer. In the first stage, we perform a primary screen (STD) along with a confirmation screen
(HCS, MS, or SPR). For a fragment to be considered a primary
NMR hit, the STD values must be >10 but less than 40, and MS
confirms that at least one copy of the fragment is bound with the
protein (binding = YES). In the biophysical confirmation step,
we conduct competitive binding studies with MS or a biochemical
assay to see if the compound displaces known active site binders in
order to confirm that the fragment is bound at the active site. For
this purpose we also attempt crystallography on the more active
hits when the protein target is amenable. The biochemical assay
results also allow us to calculate ligand efficiencies (LE) (22) of
the fragment hits. In the second stage, the hits with LE 0.3
and are chemically attractive are selected to be progressed for hit
follow-up activities. These activities include database mining for
similar analogs which are submitted for biochemical screening,
synthesizing of analogs by chemists, designing core-based fragments to enable further elaboration of the hits, and designing
structural-based targeted library based on top selected hits. This
is an interactive and iterative process involving a project team
IC50
1 mM
100 M
Biophysical
Confirmation
10 M
1 M
< 100 nM
DB Mining, SBDD,
Library Dgns,
Analoging
Lead
Series
to Lead
Dev.
Dev. lead
Opt.
leadseries
serieswith
withSAR
SAR
Series hand-off to medchem
# Compounds decreases at
each stage.
227
consisting chemists, spectroscopists, biologists, and computational chemists. Once lead series are identified with good activity
and SAR, they are passed to the lead optimization stage.
3.2. Vernalis NMR
Screening Collection
228
Na and Hu
Fig. 11.3. Pharmacophoric triangle detection. The dotted lines define a triangle comprising three features: piHydrophobic (H=, centroid of benzene), piAcceptor (A=, oxygen of carboxylic acid), and piPolar (P=, oxygen of hydroxyl), and the shortest bond
path between each pair of features is 2 (A= to H=), 1 (P= to H=), and 4 (A= to P=).
Reprinted (adapted or in part) with permission from Journal of Chemical Information
and Modeling. Copyright 2008 American Chemical Society.
229
230
Na and Hu
Combining all four SeeDs libraries resulted in 1,315 fragments for the collection. Various properties were calculated, analyzed, and compared with a drug-like reference set created from
the WDI and a binding reference set created from PDB. These
results can be found in the key reference (18).
4. Screening
Results
There are various methods to conduct a fragment screening
campaign. The most commonly utilized methods include various NMR techniques, mass spectrometry, SPR, and biochemical
screens. X-ray crystallization is a preferred method since it provides a binding conformation, but can only be used when the target protein is well behaved. Various calorimetry techniques have
also been used for fragment screening, but these have been less
commonly utilized. The merits of each method have been discussed in the literature (26) and will not be outlined here.
An analysis of screening hits based on 12 NMR screens (Table
11.1) for a range of protein targets conducted over an 8-year
period at Vernalis was performed (27). Three main aspects of the
analysis were (1) the relationship of the fragment hit rates to the
druggability of the target; (2) comparison of hits, nonhits to the
entire fragment library; and (3) the specificity and ligand complexity of the fragment hits.
Composition of the Vernalis fragment library evolved over
the course of 4 years through changes in what was synthesized
in-house, available commercially, and removed from the collection through quality control process. Although the number of
compounds remains roughly the same, the content has changed
dramatically, which makes the analysis quite challenging and interesting as well.
4.1. Fragment
Screening
Campaigns
231
three NMR experiments, Class 2 hit shows changes in two experiments, and a Class 3 hit in only one experiment.
4.2. Fragment Hit
Rates and
Druggability Index
4.3. Comparison of
Hits, Nonhits, and
the Entire Fragment
Library
39
10
24
58
20
23
54
53
42
51
39
35
10
Class 1
seriesc
1,351
1,068
1,064
1,351
1,260
1,351
1,351
1,351
868
855
1,250
308
Library size
0.7
2.2
3.2
0.4
4.5
4.0
4.4
0.4
7.3
4.9
3.1
3.6
Low
High
High
Low
High
High
High
Low
High
High
High
High
Category
No
Yes
Yes
No
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
High-affinity
ligandsd
a Total number of fragments identified by at least one NMR experiment to interfere with the binding of known competitor compound.
b Number of fragments identified by all three NMR experiments (STD, water-LOGSY, and CPMG) to interfere with the binding of known competitor compound.
c Total number of unique chemical series suggested by the clustering results of Class 1 fragment hits with a Tanimoto coefficient of 0.70 and MACCS keys.
d Reported affinities <300 nM. Please refer to the paper for references.
PPI-3
34
40
52
PPI-1
13
PIN-1
PPI-2
119
PDPK1
55
60
82
101
38
HSP70
63
44
JNK3
81
40
11
Class 1b
HSP90
54
FAAH
109
15
Totala
DNA gyrase
CDK2
AK
Protein
Number of hits
Table 11.1
SeeDs screening hit rates for 12 protein targets. With kind permission from Springer Science+Business Media: Journal
of Computer-Aided Molecular Design, 23, 2009, 603, I-Jen Chen and Roderick E. Hubbard, Table 4
232
Na and Hu
233
Fig. 11.4. Targets with observed high (>2%, light bars) and low (<2%, darker bars)
Class 1 hit rates compared to the druggability score (Dscore) calculated by SiteMap.
The red arrow indicates the minimum Dscore for targets yielding high hit rates for
the current data set. With kind permission from Springer Science+Business Media:
Journal of Computer-Aided Molecular Design, 23, 2009, 603, I-Jen Chen and Roderick
E. Hubbard, Fig. 4.
The screening results also clearly indicate that small fragments can
be specific binders, even for proteins within the same family.
Given their relatively small size, there are natural concerns
regarding nonspecific binding of fragments. Based on the Vernalis study, 62% of the fragments were competitive binders with
just one target and another 24% were hits for just two targets.
This study shows that most fragments are in actuality quite target
specific. The pie chart in Fig. 11.6 focuses on the hits from three
kinase screenings, PDPK1, CDK2, and JNK3. It shows that at
least 52% of the fragment hits are unique to one kinase, and only
11% of the hits are shared among all three of proteins.
234
Na and Hu
Fig. 11.5. Distribution plots of (a) molecular weight (MW), (b) number of rotatable bonds (NRot), (c) SlogP, (d) number
of pharmacophore (ph4) triangles, and (e) number of rings for the whole library (VER_ref), all hits (Class 13), Class 1
hits, and nonhits. With kind permission from Springer Science+Business Media: Journal of Computer-Aided Molecular
Design, 23, 2009, 603, I-Jen Chen and Roderick E. Hubbard, Fig. 6.
Ligand complexity can be represented by the number of pharmacophore triangles in fragment structures. Figure 11.7 plots
the averaged pharmacophore complexity of both the hits and the
Class 1 hits (all three NMR spectra confirm binding) for each target. It would appear that the level of complexity required for a
fragment to be detected in binding varies from target to target.
HSP70 appears to be the most demanding target as it requires the
235
Fig. 11.6. Overlap of kinase fragment hits. The horizontal lines indicate the portion of
unique fragment hits to each kinase. The crossed area (11%) is the portion of common
hits to all three kinases. With kind permission from Springer Science+Business Media:
Journal of Computer-Aided Molecular Design, 23, 2009, 603, I-Jen Chen and Roderick
E. Hubbard, Fig. 9.
Fig. 11.7. Pharmacophore complexity observed for all fragment hits and Class 1 hits for 12 protein targets. With kind
permission from Springer Science+Business Media: Journal of Computer-Aided Molecular Design, 23, 2009, 603, I-Jen
Chen and Roderick E. Hubbard, Fig. 8.
most complex fragments (20 and 27 triangles for all hits and Class
1 hits) among all targets studied. Perhaps HSP70s hit rate was
among the lowest because fewer fragments have the complexity
required for HSP70 binding. On the contrary, both DNA gyrase
and FAAH showed low average ligand complexity, and they are
indeed top two targets among the 12 screens with highest Class
1 hit rates, 4.9 and 7.3%, respectively.
236
Na and Hu
5. Overview
of Published
Fragment
Libraries
5.2. Physical
Properties
Most of the fragment libraries are designed with physical properties within the rule-of-3 constraints (14), which were originally
proposed for molecules used for high-throughput fragment crystallography. For solubility, since there are no reliable methods to
predict aqueous solubility, clogP is often used as a guide. In the
case of AstraZeneca, a clogP value above 2.2 warrants enough risk
for a neutral compound that its solubility will be experimentally
determined (8).
5.3. Screening
Technologies
In recent years there have been encouraging advances in screening technologies suitable for fragment screening. Even for an
established method such as NMR, new techniques have been
developed and put into practice at various companies that
have expanded the scope of targets for FBDD. For example,
0.3
3
194
190
20,342
132
1,400
AstraZeneca
Vertex
Vernalis
1,000
7,000
10,000
20,063
21,869
2,000
SGXc
Sunesis
Graffinity
Evotec
Evotec
ZoBio/Pyxis
Discovery
1.2
1.6
1.3
2.7
1.9
1 to 3
NA
NA
NA
NA
NA
NA
2.3
NA
NA
NA
NA
Number of
rotatable
bonds
NA
NA
3.2
NA
NA
NA
NA
NA
H-bond
acceptors
a All single values in this table for properties are mean, except the values for the Pfizer collection which are median.
b Multiple means more than one screening technology, including NMR, SPR, biochemical assay, and X-ray.
c The property values reported for SGX applies to 90% of the molecules for the SGX collection.
218
247
2.2
300
276
NA
NA
NA
2.2 to 5.5
1.9
NA
NA
NA
1.6
1.5
<200
174
300
127 to 350
850
20,000
Astex
Plexxikon
NA
NA
1,200
AstraZeneca
1.0
3
205
300
2,792
2,000
1.5
Roche
220
10,000
Abbott
ClogP
Pfizera
MW
Size
Company
H-bond
donors
52.6
70
NA
NA
NA
NA
NA
60
NA
NA
NA
NA
60
56.9
NA
Polar
surface
area ()
NMR (target
immobilized)
Biochemical
assay
NMR
SPR (ligand
immobilized)
Tether
X-ray
X-ray
X-ray
NMR
(26)
(33)
(33)
(26, 33)
(37)
(36)
(35)
(34)
(18, 27)
(11, 12)
(8)
NMR
(8)
Multipleb
(26, 33)
(20, 21)
(26, 33)
References
NMR
SPR (target
immobilized)
Multipleb
SAR by NMR
Screening
technologies
Table 11.2
Overview of some key physical properties for selected fragment libraries and their associated screening methods
238
Na and Hu
6. Conclusion
The composition of a fragment collection can have a profound
effect on the success of an FBLD campaign. Consideration of the
screening method and ease of chemistry follow-up are two of the
more important factors in creating a fragment collection. It has
been shown that by using a combination of computational analysis and human expertise, a fragment collection can be created
to accommodate a single method or several screening methods
without being target or protein family specific. Further, a carefully
designed fragment collection can result in high hit rates across a
variety of targets to produce hits with novelty and good ligand
efficiency, thereby accelerating the lead discovery process.
Acknowledgments
We would like to thank Drs. Ben Burke and Zhongxiang (Joe)
Zhou for their valuable comments and insights throughout the
preparation of this manuscript.
References
1. Congreve, M., Chessari, C., Tisi, D.,
Woodhead, A. (2008) Recent advances in
fragment-based drug discovery. J Med Chem
51, 36613680.
2. Hesterkamp, T., Whittaker, M. (2008)
Fragment-based activity space: smaller is better. Curr Opin Chem Biol 12, 260268.
3. Hajduk, P. J., Greer, J. (2007) A decade
of fragment-based drug design: strategic
advances and lessons learned. Nat Rev Drug
Discov 6, 211219.
4. Albert, J. S., Blomberg, N., Breeze, A. L.,
Brown, A. J. H., Burrows, J. N., Edwards,
P. D., Folmer, R. H. A., Geschwindner, S.,
Griffen, E. J., Kenny, P. W., Nowak, T., Olsson, L. -L., Sanganee, H., Shapiro, A. B.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
239
240
Na and Hu
Chapter 12
Fragment-Based Drug Design
Eric Feyfant, Jason B. Cross, Kevin Paris, and Dsire H.H. Tsao
Abstract
Fragment-based drug design (FBDD), which is comprised of both fragment screening and the use of
fragment hits to design leads, began more than 15 years ago and has been steadily gaining in popularity
and utility. Its origin lies on the fact that the coverage of chemical space and the binding efficiency of
hits are directly related to the size of the compounds screened. Nevertheless, FBDD still faces challenges,
among them developing fragment screening libraries that ensure optimal coverage of chemical space,
physical properties and chemical tractability. Fragment screening also requires sensitive assays, often biophysical in nature, to detect weak binders. In this chapter we will introduce the technologies used to
address these challenges and outline the experimental advantages that make FBDD one of the most
popular new hit-to-lead process.
Key words: Fragment-based drug design, fragment screening, ligand efficiency, NMR, X-ray
crystallography.
1. Introduction
1.1. General Views
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_12, Springer Science+Business Media, LLC 2011
241
242
Feyfant et al.
Fragment libraries differ from drug-like and lead-like libraries primarily by having members with a significantly lower molecular
weight (MW), typically in the 140300 Da range. However, as
fragment screening programs have matured over recent years,
other key factors that help improve the success rates of these
projects have been identified.
There has been much effort in identifying physical properties of ideal fragment libraries. Since fragments tend to be smaller
than most drugs, clinical candidates, leads, or high-throughput
screening (HTS) hits, they are able to make fewer interactions
on average and tend to have lower affinities for their protein
targets. Affinities are often in the high micromolar or low millimolar range, necessitating solubility at least to that degree and
potentially higher depending on the assay protocol. Congreve and
243
244
Feyfant et al.
245
246
Feyfant et al.
2. Materials
2.1. STD NMR
Screening
247
3. Methods
3.1. STD NMR
Screening
248
Feyfant et al.
4. If co-crystallization is indicated by properties of the protein or the fragment library, then prepare a solution of the
protein with suitable concentration of fragment(s), (suggested 100 mM). Screen this solution around known crystallization conditions for the protein. If no crystals are
observed a full screen using numerous conditions may be
indicated.
5. Prepare a cryopreservation solution compatible with your
crystals, starting with the protein stabilization buffer. Upon
testing, the amount of cryo agent (DMSO, low molecular weight PEG, glycerol, ethylene glycol, etc.) used should
produce a clear glass effect with no water rings when analyzed with X-rays.
6. Treat the crystal exposed to fragment(s) with the cryopreservation buffer and vitrify the sample with liquid nitrogen (see
Note 11).
7. During data collection from crystals exposed to fragment(s),
collect a data set that is complete in the low-resolution shells
and has high redundancy. Also beneficial will be the highest resolution data possible, so examination of multiple crystals to select the one with suitable qualities is crucial (see
Note 12).
4. Notes
4.1. NMR Screening
1. The NMR screening samples can be prepared in an automated fashion with a programmable platform such as Tecan
(by Tecan) and samples can be automatically loaded into the
spectrometer by using a Sample Rail (by Bruker Biospin).
This allows for maximum spectrometer time and the sample
is always freshly prepared prior to data acquisition.
2. The protein stock concentration should be in the NMR running buffer at concentrations slightly higher than what is
used in the NMR samples. Alternatively, if the protein stability is better in a different buffer, the protein could be stored
at high concentration (80 M or higher) and a small aliquot
diluted into the NMR running buffer for sample preparation.
3. Fragments in mixtures can sometimes precipitate due to
the high total fragment concentration in solution, which
could be up to 5 mM. In most cases where precipitation
is observed, we have noted that the other fragments in the
mixture are still soluble and give good NMR signals. Thus
the mixture is still usable.
249
4. The higher the DMSO percentage used, the higher the fragment mixture solubility will be. The protein needs to be
stable and active for at least a couple of hours under these
conditions for data collection.
5. Competition experiments can be performed within the same
NMR sample mixture used for screening if protein amounts
are limited. The competitor is just added to the NMR solution in the tube and mixed well.
6. Fragments in the mixtures that bind to the protein target can easily be identified by comparing the NMR
spectrum of the hit with the spectra of the individual
fragments.
4.2. X-Ray Screening
7. The more concentrated the fragment sample the less a dilution effect is observed when added to the protein. For
those proteins or crystals highly sensitive to DMSO concentration we have found that soaking is problematic and
co-crystallization is indicated. Fragments are generally lowaffinity compounds and in order for weakly binding compounds to be observed with X-ray crystallography they
need to possess excellent solubility. A rule of thumb used is
ten times the binding constant. Applying this to fragment
screening, it is desirable to have the compound at 100 mM
during the experiments. While this level of solubility is easily obtained in DMSO, the addition of an aqueous component will be an issue as precipitation of the small molecule
often occurs. During soaking experiments it is not uncommon to have a successful experiment despite heavy precipitate or even crystallization of the small molecule upon addition of the fragment to the protein stabilization solution.
When co-crystallizing protein with fragments, centrifugation prior to screening will be required in cases where precipitation is observed.
8. For automation in our lab, we use a Hamilton STAR
for creating/dispensing crystallization solutions, a TTP
LabTech mosquito liquid handling robot for setting
up crystallization drops, a Formulatrix robotic storage/retrieval/imaging system for crystallization trays, and
a Rigaku ACTOR robot for automatic crystal handling for
testing of diffraction properties.
9. Prior to initiating the FBS soaking experiments a substantial amount of investigation needs to be completed on the
methodology that will be used. The parameters that should
be considered and optimized include
The length of time to soak the compound into the
crystal
250
Feyfant et al.
5.
6.
7.
8.
251
252
Feyfant et al.
39.
40.
41.
42.
Chapter 13
LEAP into the Pfizer Global Virtual Library (PGVL) Space:
Creation of Readily Synthesizable Design Ideas
Automatically
Qiyue Hu, Zhengwei Peng, Jaroslav Kostrowicki, and Atsuo Kuki
Abstract
Pfizer Global Virtual Library (PGVL) of 1013 readily synthesizable molecules offers a tremendous opportunity for lead optimization and scaffold hopping in drug discovery projects. However, mining into a
chemical space of this size presents a challenge for the concomitant design informatics due to the fact
that standard molecular similarity searches against a collection of explicit molecules cannot be utilized,
since no chemical information system could create and manage more than 108 explicit molecules. Nevertheless, by accepting a tolerable level of false negatives in search results, we were able to bypass the
need for full 1013 enumeration and enabled the efficient similarity search and retrieval into this huge
chemical space for practical usage by medicinal chemists. In this report, two search methods (LEAP1 and
LEAP2) are presented. The first method uses PGVL reaction knowledge to disassemble the incoming
search query molecule into a set of reactants and then uses reactant-level similarities into actual available
starting materials to focus on a much smaller sub-region of the full virtual library compound space. This
sub-region is then explicitly enumerated and searched via a standard similarity method using the original query molecule. The second method uses a fuzzy mapping onto candidate reactions and does not
require exact disassembly of the incoming query molecule. Instead Basis Products (or capped reactants)
are mapped into the query molecule and the resultant asymmetric similarity scores are used to prioritize
the corresponding reactions and reactant sets. All sets of Basis Products are inherently indexed to specific
reactions and specific starting materials. This again allows focusing on a much smaller sub-region for
explicit enumeration and subsequent standard product-level similarity search. A set of validation studies were conducted. The results have shown that the level of false negatives for the disassembly-based
method is acceptable when the query molecule can be recognized for exact disassembly, and the fuzzy
reaction mapping method based on Basis Products has an even better performance in terms of lower
false-negative rate because it is not limited by the requirement that the query molecule needs to be
recognized by any disassembly algorithm. Both search methods have been implemented and accessed
through a powerful desktop molecular design tool (see ref. (33) for details). The chapter will end with a
comparison of published search methods against large virtual chemical space.
Key words: LEAP, PGVL, combinatorial chemistry, library design, similarity search, disassembly,
Basis Product, symmetric similarity score, asymmetric similarity score, lead hopping.
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_13, Springer Science+Business Media, LLC 2011
253
254
Hu et al.
1. Introduction
The high attrition rate across multiple stages of the modern drug
discovery process has significantly hampered the productivity of
the pharmaceutical industry as a whole (1). One of the countermeasures implemented by pharmaceutical companies against this
challenge is to build a large and diverse library of combinatorially
enabled molecules to boost productivity in hit identification and
lead optimization (2). Through a multi-year multi-million dollar
investment in collaborations with ChemBridge, Tripos, ChemRx,
and Arqule, Pfizer has developed validated reactions for parallel
synthesis, and implemented those protocols, to expand its corporate compound collection for biological screening to 3 million
(35).
As an integral part of the parallel synthesis of these arrays
of compounds, these collaborations and internal effort produced
and validated 2500 parallel synthetic protocols spanning across
757 diverse chemical reactions. These combinatorial reactions,
their synthetic procedures, and their reactant scope and limitations are well defined and have been captured electronically for
future library production (6). Those experimentally validated synthetic protocols and their corresponding reactant sets compatible
with their reaction conditions implicitly lead to a huge chemical compound space (PGVL, or Pfizer Global Virtual Library)
with more than 1013 virtual yet synthetically feasible compounds.
All starting materials are known, specified, and available. Only
a very small fraction of PGVL has ever been synthesized (106
out of 1013 ). Conceptually a medicinal chemist can use a query
(or seed) molecule as input to search for similar molecules inside
PGVL and thereby retrieve new analogs for lead optimization and
scaffold hopping. Previous work has demonstrated that there are
many lead- and drug-like molecules in this type of large virtual
compound space spanned by combinatorial reactions selected by
medicinal chemists and existing reactant sets (7). Yet there are significant challenges inherent in making the desired similarity search
practical against such a huge chemical space. The standard similarity search methods require the construction of a file or database
containing explicit molecules. However, as of today, no chemical
information technology is known to enumerate and store more
than 108 molecules, for example, both CAS (Chemical Abstract
Service) (8) and Pubchem (9) have collections of substances in
the 107 scale.
In the publications from the Tripos group, the authors had
demonstrated that they could bypass the need for full enumeration of a huge virtual space and enable similarity search by extensively leveraging the reactant-level information (10). Even though
255
2. Methods
The standard similarity search problem is commonly defined as
the following: Given an input query molecule, find the molecules
within a collection of compounds that are most similar (either top N
or within a predefined similarity threshold) to the query molecule. Of
course a molecular similarity measure has to be given between a
pair of molecules. The Tanimoto distance calculated on the basis
of molecular fingerprints is the most commonly used similarity
measure (17). Medicinal chemists routinely perform this type of
similarity searches against molecular databases containing 106
108 explicit molecules. The set of molecules returned by a similarity search is expected to be well defined by the search parameters
such as the query molecule, the search domain, and the similarity measure in combination with the underlying molecular fingerprints. In this report, we refer to this set of molecules returned
from such a search into a standard explicit database as the reference set to be used in comparison with search results obtained by
new similarity search methodologies.
256
Hu et al.
257
O
O
N
O
N
H
Input:
Query
Output:
Search
result
Step4: Perform
standard similarity
searches against
those explicit virtual
molecules using the
query molecule.
Step3: Enumerate
On-the-fly of those
identified subregions (~102 to 106)
LEAP
Fig. 13.1. Internal flowchart for the LEAP1 fully automated process. The diagram illustrates that there are three reactions
identified whose chemical spaces are colored as pink, green, and yellow, which LEAP1 automatically identified as disconnection routes. LEAP1 then retrieves the most relevant sub-region within the chemistry space, followed by on-the-fly
enumeration of those identified sub-regions. The final step can be any 2D/3D virtual screening algorithms. LEAP1 was
implemented using Scitegic fingerprint technologies.
258
Hu et al.
259
For a given combinatorial reaction and its associated fully enumerated product space spanned by all suitable reactants, Basis
Products (BP) form a much smaller subset within the full product space and at the same time provide a systematic and efficient
sampling of all reactants suitable for that reaction. Figure 13.2
depicts an example for a two-component reaction. A Basis Product contains information about the R-groups as well as the reaction core, which can be expressed in the following statement (see
Fig. 13.2):
BP = R-groups of one reaction component
+ Reaction Core + CAP(s)from other components(s)
where CAP is the R-group of the smallest reactant from each reactant list.
In Fig. 13.2, the first row and the first column of products
are defined as the Basis Products for that reaction. Basis Products
have an one-to-one relationship with their corresponding reactants. It can be seen also that in a two-component reaction, there
are two sets of Basis Products; in a three-component reaction,
there are three sets of Basis Products; always one set per reaction
component. M plus N reactants lead to M plus N Basis Products,
while the fully combinatorial product space is M N in size. Currently there are 106 Basis Products in PGVL, far smaller than
the full PGVL space about 1013 1014 in size.
Importantly, all Basis Products are real products, members within the product space, and, like the simple truncated
R-groups, retain no transient reactant-only functional groups
(reactive halides, aldehydes, etc.); in R-group methods these disappear by clipping, whereas in Basis Products these are transmuted in the reaction transformation preparing the Basis Products. Yet unlike truncated R-groups, Basis Products also incorporate the full reaction core (all of the newly formed bonds) as
part of the BP structure. Furthermore the collection of available
starting materials, e.g., aliphatic amines, aldehydes, acyl chlorides,
benzyl halides, collapses to a fewer number of unique R-groups
when clipped, whereas the same set of starting materials expands
260
Hu et al.
a)
VRXN-2-00051
O
R1
H +
R2
N
H
N
A
R2
H
O
A_CAP
R2
A_CAP + Core + R2
N
N
N
N
Basis products
for all 2-amino
heterocylces
(plus atom level
annotations)
A: aminoheterocycles
B: Alpha-halo ketones
R1
B_CAP
R1
b)
Br
Basis Product of B:
VRXN-2-00051_B_1
R1 + Core +
R1+ Core + B_CAP
N
N
N
N
Basis Product of A:
VRXN-2-00051_A_1
Full Products
Fig. 13.2. Illustration of the basic concept of Basis Products. (a) The PGVL reaction scheme of VRXN-2-00051 (formation
of the H-imidazo[1,2-a]pyridine ring system using aminoheterocycles and alpha-halo ketones) is used for the illustration;
(b) The Basis Products of A are formed by all A reactants with one constant B reactant (B_CAP, 1-bromopropan-2-one).
The Basis products of B are formed by all B reactants with a constant A reactant (A_CAP, 2-amino pyridine). The blue
triangle and yellow hexagon represent two such basis products. The red star represents a product molecule which is
related to those two corresponding basis products.
261
Asymmetric similarity measure has been first described by Tversky (22) to provide a general mathematical framework for the
perception of similarity and later adapted to molecular similarity
by Bradshaw (23). The mathematical formula for both similarity
measurements against BPs are shown below:
Symmetric similarity (SS) favors maximum common features
and penalizes non-common features:
SS =
The well-known symmetric similarity measure rewards common features shared by two molecules and penalizes unique features present in either molecule which are not found in the other.
Its value reaches 1 only when both molecules are identical. The
asymmetric similarity measure focuses on the degree to which a
test molecule (BPs in our case) can map into the original query
molecule. When a BP molecule, which is typically smaller, is
mapped into the query molecule, the asymmetric similarity measure can still reach 1.0 when the BP can be fully mapped into the
query molecule, in another words the BP is a substructure of the
query molecule. Figure 13.3 uses a query molecule within PGVL
and its corresponding Basis Products to highlight the difference
between symmetric and asymmetric similarity measures. From the
differences of the AS and SS scores of the same BP, it is seen that
indeed the standard symmetric similarity measure penalizes any
differences between two molecules, while the asymmetric similarity measure used in LEAP2 focuses on mapping the Basis Product into the query molecule, while ignoring the unique features
262
Hu et al.
Query molecule
O
N
N
N
Basis Products
Basis Products
SS=82%
AS=98%
VRXN-2-00051_A_1
VRXN-2-00051_A_1
SS=84%
AS=100%
N
N
VRXN-2-00051_B_1
N
N
VRXN-2-00051_B_1
Fig. 13.3. Comparison of symmetric and asymmetric similarity scores. A virtual product
from VRXN-2-00051 is used as a query molecule. The two corresponding Basis Products
are VRXN-2-00051_A_1 and VRXN-2-00051_B_1. In reference to the query molecule,
their corresponding similarity scores are listed under SS and AS (see equations [1] and
[2] for details), respectively, depending on the similarity methods used.
(1) Search a database of Basis Products using Asymmetric Similarity measure. Here this search is done using the query
molecule against a database of 106 explicit enumerated
Basis Products. The asymmetry similarity search in the BP
database is implemented using MDL Keys finger print (24)
with ISIS host technology (25).
263
The output is a set of Basis Products with high asymmetric similarity (AS) values (the default cutoff value is set
to 90%) when they are mapped into the query molecule.
The reaction schemes and reactants encoded by those Basis
Products are then extracted, ranked, and used to form subregions of PGVL for subsequent just-in-time enumeration
and symmetric similarity search against the query molecule.
Similarly to LEAP1, the top 20 similar molecules per
reaction component list are used, as a default setting, to
ensure balanced sampling of reactants for each reaction
component and the reasonable performance. This is user
adjustable.
(2) Enumerate sub-region(s) using the smaller sets of reactants
identified in step 1. This on-the-fly enumeration step is identical to step 3 of LEAP1.
(3) Perform standard similarity search using the original query
molecule against the enumerated products from the subregion(s) obtained in step 2. This is identical to step 4 of
LEAP1.
Since LEAP2 was also built based on Pipeline Pilot technology, multiple molecular fingerprints and similarity methods can
be applied at disposal, which currently include MDL Public Keys
and different levels of FCFPs and ECFPs (18).
3. Results and
Discussion
3.1. Method
Validation and
Performance
Profiling
As mentioned before, LEAP1 and LEAP2 are the results of conscious choices between accuracy and practical execution performance. Therefore it is important to conduct a set of controlled
validation studies to assess the accuracy in terms of rates of false
positives and false negatives in their search results and performance in terms of end-to-end search turnaround time.
To reach those objectives, we conducted validations to answer
the following questions:
(1) Given a set of molecules known to be inside PGVL as
query molecules, what is the success rate for returning the
expected molecules identical to the query molecules (100%
similarity threshold)? This is by definition a baseline test
that a validated search strategy must pass.
(2) Given a sub-region of PGVL that can be enumerated
explicitly and a query molecule, compare search results
obtained by a LEAP search with the reference sets obtained
by the exhaustive search against the fully enumerated
264
Hu et al.
265
Table 13.1
Construction of a fully enumerated virtual library
space (VL) for the second validation study
Mapped
Seed Structure
VRXN-2-00004
VRXN-2-00006
VRXN-2-00010
VRXN-2-00011
VRXN-3-00063
VRXN-3-00064
VRXN-3-00065
O
N
S
Real VL size
VRXNs
Validation VL size
438 x 1171
544 x 264
3371 x 6635
449 x 7044
19 x 721 x 5697
77 x 389 x 5175
44 x 632 x 444
60 X 60
50 X 50
60 X 60
50 X 50
18 X 50 X 50
50 X 50 X 50
17 X 50 X 50
Total: 388,798,585
Total:
224,700
Table 13.2
True-positive and false-negative rates of the LEAP1 method as a function of search
threshold for molecular similarity
Similarity
threshold
No of cpds
retrieved by
LEAP1
No of cpds
retrieved by
exhaustive
search
No of true
positives in % true positive
LEAP1
in LEAP1
No of false
negatives
in LEAP1
% false
negative
in LEAP1
100
0.9
100
0.8
11
11
11
100
0.7
51
52
51
98
0.6
249
257
249
97
0.55
530
557
530
95
27
0.52
915
968
915
95
53
0.5
1331
1410
1331
94
79
0.48
1699
1807
1699
94
108
Hu et al.
100
90
80
70
% of cpds
266
60
50
40
30
20
10
0
0.4
0.5
0.6
0.7
0.8
Similarity threshold
0.9
Fig. 13.4. Performance of LEAP1 when compared against the exhaustive search in the
second validation study. See main text for details.
Table 13.3
Comparison of performance of LEAP1 vs. exhaustive search
Method
LEAP1
Exhaustiv_search
Speedup factor
Validation VL (s)
446
9700
22
Real VL (s)
602
16,783,917a
27880
a Estimated based on the reasonable assumption that standard search time is propor-
tional to the size of the VL to be searched. The exhaustive search time against a
smaller VL of 224,700 is 9700 s. Therefore we have estimated to the first approximation that an exhaustive search against the real VL of 388,798,585 would take
16,793,917 s (or about 194 days) to complete. See Table 13.1 for VL sizes. Time is
in the unit of second and based on single 3 GHz Pentium CPU.
267
AZITHROMYCIN
CAFFEINE
CELECOXIB
VALSARTAN
EFAVIRENZ
VENLAFAXINE
FLUCONAZOLE
FLUOXETINE
ALENDRONATE
GABAPENTIN
IBUPROFEN
ATORVASTATIN
OLANZAPINE
NELFINAVIR
ESOMEPRAZOLE
AMOLDIPINE
PAROXETINE
CLOPIDOGREL
LANSOPRAZOLE
RANITIDINE
RISPERIDONE
SILDENAFIL
SIMVASTATIN
SERTRALINE
Fig. 13.5. A diverse set of 24 drug molecules on the market is compiled for the third validation study.
268
Hu et al.
LEAP1
LEAP2
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
drug
Fig. 13.6. Results from the third validation study. The y-axis represents the Tanimoto similarity score of returned hits
with respect to their corresponding query molecule, calculated based on the FCFP4 molecular fingerprints (31). The xaxis are drug molecules in Fig. 13.5. Search hits are color coded by the PGVL reactions (VRXN) where they are originated
from.
relaxed to 0.7, then 5 of 24 searches lead to PGVL hits for meaningful follow-up.
Based on the validation study it seems that the typical search
time required for both LEAP1 (15 min) and LEAP2 (45 min)
are short enough for routine and practical usage by medicinal
chemists. On average LEAP2 is about three times slower than
LEAP1, due to its larger VRXN coverage. LEAP2 in essence uses
a fuzzy reaction retrieval strategy which returns more candidate PGVL regions of interest in the intermediate steps of the
algorithm.
3.2. Application of
LEAP1 and LEAP2
3.2.1. LEAP1
269
Table 13.4
Two example virtual hits from among hundreds in the 14 LEAP1 searches
Structure
Name
monomerID
VRXN
simScore
Est_IC50 (nM)*
Br
N
H
MCH-1
MFCD01443686:
VRXN-2-00001
MFCD00238752
0.37
0.64
MCH-2
MN-011201:MN017087
0.33
0.69
F
FF N
Cl
N
Cl
N
VRXN-2-00001
a IC50 estimated by 3D Pharmacophore model of MCH, their corresponding mappings are shown
in Fig. 13.7.
MCH-1
MCH-2
Fig. 13.7. Novel synthesizable compounds (see 2D structures in Table 13.4) produced
by LEAP1 searches with high score judged by a project-specific 3D pharmacophore
model (red blob: basic feature, light blue blob: hydrophobe, green vector blob: hydrogen
bond acceptor).
270
Hu et al.
per VRXN per query molecule, with default settings for remainder of the parameters. The LEAP2 search resulted in 900 hits
originating from 18 PGVL chemical reactions. Three targeted
libraries were subsequently designed and synthesized based on
those LEAP2 hits. The efforts resulted in 281 compounds synthesized, of which 13 yielded IC50 ranging from 1 to 20 M
(see Table 13.5). The result demonstrated that LEAP2 method
is capable of generating multiple different design ideas which can
be implemented quickly and fruitfully by the project team.
Table 13.5
Hits from the caspase-3 targeted libraries
Compound_Number
IC50 (M)
VRXN_IDa
Cpd-1
1.01
VRXN-2-00086
Cpd-2
1.85
VRXN-2-00086
Cpd-3
3.36
VRXN-2-00086
Cpd-4
3.56
VRXN-2-00086
Cpd-5
5.82
VRXN-2-00086
Cpd-6
6.03
VRXN-2-00086
Cpd-7
7.46
VRXN-2-00086
Cpd-8
7.69
VRXN-2-00086
Cpd-9
12.5
VRXN-2-00010
Cpd-10
14.2
VRXN-2-00086
Cpd-11
16.7
VRXN-2-00086
Cpd-12
17.5
VRXN-2-00086
Cpd-13
19.5
VRXN-2-00010
4. Conclusion
It is very useful to emphasize systematic data capture within an
organization as large as Pfizer. It has been beneficial to derive
knowledge from those data in projects and sites different from
the original settings which led to the original development of a
given reaction protocol and most valuable if this knowledge can
be reused in the essential operations on a regular basis.
The PGVL system is a large-scale knowledge system derived
from rigorous multi-year systematic reaction knowledge capture,
including the registration of large numbers of bona fide starting materials and validated parallel synthesis reaction protocols.
LEAP chemistry space mining methodologies are ways to enable
271
5. Notes on
Comparison with
Other Published
Search Methods
Against Large
Virtual Chemical
Space
NA
AllChem
FTree-FS
LEAP1
LEAP2
MoBSS
CoLibri/
FTrees-FS
CoLibri/
FTrees-FS
15
16
14
11, this
report
11, this
report
13
12
Ref#
Boehringer
Ingelheim (BI)/
BioSolvIT
Pfizer/BioSolvIT
Pfizer
Pfizer
Pfizer
Roche
Tripos
Algodign
Origin
Min
Min
Min
Min
Min
Min
Hour
Month
PGVL
PGVL
PGVL
534a
441a
358a
BI CLAIM
(Comprehensive
Library of
Accessible
Innovative
Molecules)
PGVL (Pfizer
Global Virtual
Library)
157a
NA
RECAP/TOPAS
Tripos Discovery
Research (TDR)
Literature
11
100
400
# of
chemical
reactions
1.00E+11
1.00E+13
1.00E+13
1.00E+13
1.00E+13
1.00E+18
1.00E+20
1.00E+13
Size of
virtual
library
space
2D ligand
2D ligand
2D ligand
2D ligand
2D ligand
2D ligand
3D/2D
ligand
3D target
Query
input
2D FeatureTree
2D FeatureTree
2D Atom Pair
(AP)
2D
2D
2D FeatureTree
3D Topomer
3D docking
Similarity
measure
Although there are 700+ VRXNs in the PGVL system, not all of them are registered to the full extent to enable the LEAP1 and LEAP2 search. For MoBSS and FTree-based
methods, due to the assumptions made in the finger print additivity, some VRXNs, such as variable ring formation which depends on the reactant combinations used, were
excluded from the implementation. For CoLibri/FTrees-FS method, the final enumeration step was implemented using CoLibri technology which is different from the PGVL
foundation system, so certain VRXNs are excluded as well.
a Those methods based on PGVL are implemented at different times, LEAP1 is the first among all four methods. The rest of the three are second-generation methods.
Method
name
No.
Search
turnaround
time
Table 13.6
Summary and comparison of representative methods to search into large virtual chemical space indexed by combinatorial
libraries
272
Hu et al.
273
274
Hu et al.
Acknowledgments
The authors would like to thank the following Pfizer colleagues
for their generous help and support: Bo Yang, Thom Shulok,
275
276
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
Hu et al.
structural descriptors and similarity coefficients. J Chem Inf Comput Sci 42,
14071414.
Pipeline Pilot from SciTegic: http://www.
scitegic.com/
Shi, S., Peng, Z., Kostrowicki, J., Paderes,
G., Kuki A. (2000) Efficient combinatorial
filtering for desired molecular properties of
reaction products. J Mol Graph Model 18,
478496.
Zhou, Z., Shi, S., Na, J., Peng, Z.,
Thacher, T. (2009) Combinatorial librarybased design with basis products. J Comput
Aided Mol Des 23, 725736.
Lau, W., Hepworth, D., Magee, T., Du,
J., Bakken, G., Miller, M., Hendsch, Z.,
Thanabal, V., Kolodziej, S., Xing, L., Hu,
Q., Narasimhan, L., Love, R., Charlton,
M., Hughes, S., Van Hoorn, W., Mills, J.,
Withka, J. (2010) Design of a multi-purpose
fragment screening library using molecular
complexity and orthogonal diversity metrics.
J Comput-Aided Mol Des.
Tversky, A. (1977) Features of similarity. Psycholog Rev 84, 327352.
Bradshaw, J. (1997) Introduction to the
Tversky Similarity Measure. Presented at
Daylight MUG Meeting, Laguna Beach, CA,
URL http://www.daylight.com/meetings/
mug97/agenda97/Bradshaw/MUG97/
tvtversky.html.
Durant, J. L., Leland, B. A., Henry, D.
R., Nourse, J. G. (2002) Reoptimization of
MDL keys for use in drug discovery. J Chem
Inf Comput Sci 42, 12731280.
ISIS host from Symyx: http://www.symyx.
com/products/software/cheminformatics/
isis-host/index.jsp
Qu, D., Ludwig, D.S., Gammeltoft, S. et al.
(1996) A role for melanin-concentrating hormone in the central regulation of feeding
behavior. Nature 380, 243247.
Saito, Y., Nothacker, H., Wang, Z., et al.
(1999) Molecular characterization of the
28.
29.
30.
31.
32.
33.
Section IV
Library Design for Kinase Family
Chapter 14
The Design, Annotation, and Application
of a Kinase-Targeted Library
Hualin Xi and Elizabeth A. Lunney
Abstract
We present here a workflow for designing a kinase-targeted library (KTL) with the goal of capturing
known kinase inhibitor chemical space. We validated our design retrospectively using recent, highthroughput screening data and found significant enrichment of kinase inhibitor hits while retaining
majority of the active kinase inhibitor series. To further assist kinase projects in triaging KTL screen
hits, we also developed a methodology to systematically annotate known kinase inhibitors in the KTL
with regard to their binding modes.
Key words: Protein kinase, kinase-targeted library, library design, kinase chemical cores,
substructure search, SMARTS Query, subsetting, binding mode annotation.
1. Introduction
The protein kinase family is one of the largest gene families
encompassing almost 2% of the human genome. The enzymes
phosphorylate proteins through the catalytic transfer of phospho
groups from ATP to the protein substrates. Protein kinases play
key roles in numerous cellular pathways that impact multiple cellular events such as growth, division, differentiation, and apoptosis. From a pharmaceutical perspective, kinases have been targeted
in drug design across multiple therapeutic areas. The most prominent of these is oncology, for which eight small molecule kinase
inhibitors have currently been approved in the USA.
This family of proteins exhibits a common fold that results in
a two-lobe structure: a smaller N-terminal region connected by
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_14, Springer Science+Business Media, LLC 2011
279
280
Xi and Lunney
2. Materials
Kinase protein/ligand crystal complexes were retrieved from the
Pfizer Crystal Structure Database, an in-house X-ray structure
repository that contains internally solved structures and selected
ones imported from the Protein Data Bank (1). Kinase assay data
were obtained by querying against Pfizer screening database for
screens associated with any kinase target and tagged with IC50 or
Ki as the endpoint type.
3. Methods
3.1. Kinase Domain
and ATP Binding Site
281
Fig. 14.1. Ribbon structure (magenta) of the phosphorylase kinase crystal structure
2PHK (20) bound with ATP (green carbons, colored by atom type) and substrate peptide
(light blue ribbon). The N- and C-terminal lobes are highlighted; the hinge region is
shown in cyan, the -C helix in gray, and the A-loop in orange.
282
Xi and Lunney
Fig. 14.2. Protein kinase active site of JNK3 bound with an ATP analogue extracted
from an X-ray structure (1JNK) (21). Hinge region and DFG segment of the A-loop are
shown; the sugar and phosphate binding areas are circled. Highlighted schematically:
substrate binding region and the adjacent sites: NE, SE, W, and N. ATP hydrogen-binding
interactions with the hinge region are labeled D1 (Donor1) and A (Acceptor). Another
potential donor interaction (D2) with a hinge carbonyl is shown.
Unactivated states of the protein kinases have also been targeted in inhibitor design. In fact, five of the eight approved
small molecule drugs have been reported to target the unactivated form of the enzyme (3). Two characterized, unactivated
forms are the DFG-out (4, 5) and the C-Glu-out conformations
(6) (Fig. 14.3). In the DFG-out conformation, the DFG Phe, at
the beginning of the A-loop, reorients from a buried pocket near
the -helix C and can extend to the ATP binding region. This
in turn opens a pocket that can be accessed by inhibitor ligands.
The multikinase inhibitors imatinib, which targets Abl in chronic
myelogenous leukemia (CML) patients, and sorafenib, which targets angiogenesis in renal cell carcinoma, are approved drugs that
probe this pocket (5, 7).
In the C-Glu-out conformation the -helix C moves away
from the ATP site, such that the conserved Glu in the helix
does not form an ionic interaction with the conserved Lys in the
-strand 3. Here again a pocket is formed for inhibitor binding
(Fig. 14.3). The EGFR-targeted drug, lapatinib, takes advantage
of this pocket in binding the tyrosine kinase (8), as do the MEK
inhibitors identified by the Parke-Davis (PD) group (9). However, the latter compounds differ from lapatinib in that they do
283
Fig. 14.3. An extension of Fig. 14.2 illustrating the DFG-out and C-Glu-out binding
regions. (Acknowledgment to J. F. Ohren for original mapping of the scheme).
not extend to the hinge region of the ATP binding site and are
non-competitive with ATP.
Another category of kinase inhibitors include compounds targeting the substrate binding region (Figs. 14.1 and 14.2), which
are also non-competitive with ATP (10). In addition, inhibitors
that have been designed to target the related phosphoinositol
kinases (PIK) can be characterized accordingly.
3.2. Compilation of
the Kinase-Targeted
Library
284
Xi and Lunney
285
Table 14.1
Examples of substructure queries
H
[C,N] N
N N
N
[C,N]
O
[C,N]
[C,N] [C,N]
[C,N]
A
[C,N]
[C,N,O,S]
N
2
[C,N,O,S]
N
N
that did not yet have a solved structure in the CSDB, we mined
our corporate screening database for any compounds with an
IC50 less than 1 M in either functional or enzymatic kinase
assays. A total of 34K compounds from 200 kinase assays were
found. For these kinase active compounds, we filtered out compounds already represented by the CSDB substructure queries
and then clustered the rest into structural series. From these, an
additional 100 substructure queries were derived from the maximal common substructure of each series.
In addition to these substructure queries, a small number
of SMARTS (16) queries were derived to capture the more
general hydrogen bond Donor-Acceptor-Donor (D-A-D) motif
that is frequently observed in core structures interacting with
kinases at the hinge regions. To make the search more specific, these SMARTS queries capture the presence of at least
two out of three hydrogen bond features at the D-A-D motif
(Table 14.2). Although single acceptor cores (for example, the
pyridine moiety of Gleevec binding to Abl (17)) are missed by
these SMARTS queries, the ones existing in the CSDB would
be captured by the CSDB substructure queries. We tested the
sensitivity and specificity of these SMARTS queries on CSDB
ligands and the randomly selected compound set. While the
SMARTS queries matched 75% of kinase ligands in CSDB, only
24% of the compounds in the random set were matched. Only
40% of the hits from the random set are also found in the hits
from CSDB substructure queries indicating that the SMARTS
query could potentially identify additional kinase inhibitor-like
compounds.
With these sets of substructures and SMARTS queries, we
searched our corporate compound collections and identified
286
Xi and Lunney
Table 14.2
Example of SMARTS queries
Motifs
Examples
SMARTS
AD
Pyrazole
AD
Amide in a ring
AD
Azaindole, adenosine,
amino-pyrimidine, etc
AD
Pyrrolepyrmidine
AD
AD
Other cases for amides
5-member-aryl-amide
6-member-aryl-amide
Biaryl urea
a 1aaaaa1-[N;!H0]C(=O)[N;!H0]a
840K hits from a total 2.8 M compounds. Then a set of druglikeness filters were applied to these compounds to reduce down
the total number of compounds to 720K. We then split these
720K compounds into two collections the library set (270K
compounds) amenable to combinatorial synthesis with library
synthesis protocols available and the medchem set (450K compounds) mostly made through traditional medicinal chemistry
synthesis. To further prioritize these hits, compounds in the
library sets were grouped by library protocol id and compounds
in the medchem set were clustered into structural series using
Wards clustering with Daylight fingerprints. Then four representative structures from each library or series were selected. A
panel of experienced kinase chemists were then asked to review
and prioritize the represented compound library or series based
on the physical properties, synthetic doability, as well as structural novelty. The review process focused on the chemical series as
opposed to individual compounds. Although each kinase expert
might unintentionally be biased toward a subset of chemical series
that he or she worked on in the past, by having multiple experts
in the review process, we were aiming to have a more unbiased
representation of the kinase chemical space collectively. In the
end, 310K compounds were retained after pooling the chosen
series together. We validated this selection retrospectively using
data from two recent kinase HTS projects (HGK and JNK1) that
screened the full compound collection in Pfizer and data from
Pfizer kinase selectivity panel screens (Table 14.3). For HGK,
82% of the Rule of 5 (Ro5) (18) compliant confirmed hits were
recovered in the 310K collection representing an 8-fold hit rate
287
Table 14.3
Enrichment of kinase inhibitors in the initial compilation of
the KTL (310K compound collection) using substructures and
SMARTS queries prior to applying subsetting
HGK
JNK1
1,600,000
1,600,000
# confirmed actives
945
1455
833
1376
685
920
% recovered
82
67
288
Xi and Lunney
Table 14.4
Validation of the subsetting algorithm retrospectively using
data from two HTS screening projects. Majority of the active
series were retained after the subsetting
3.3. Kinase-Targeted
Library Annotations
HGK
JNK1
688 (40)
730 (44)
263 (37)
208 (37)
289
Fig. 14.5. Annotation for the ATP site inhibitor, SB203580. The core is Pyri(mi)dine-5-MemberHeterocycle_4, which is a
hydrogen bond acceptor with the hinge region. The compound probes the Phosphate (Phos) and NE sites.
COMPOUND
CORE
F
F
O
O
N
H
Cl
A
A
N
H
N
H
Sorafinib
OH
O
H
N
A
A
A
A
O
H
N
Br
Bayer_like_dfg
OH
H
N
I
PD318088
MEK_like_3
Fig. 14.6. Annotations for inhibitors that bind to unactivated kinase conformations.
Sorafenib binds to the DGF-out conformation and its core is defined as Bayer_like_dfg.
PD318088 binds to the C-Glu-out conformation and its template is MEK_like_3.
3.4. Performance of
KTL and Future Plans
290
Xi and Lunney
4. Notes
4.1. Advantage of
Using Substructure
Query-Based Method
for KTL Design
The KTL core annotations have been integrated into several inhouse desktop applications for compound design and HTS hit
triage. The annotations provide key binding information for the
inhibitors and can be used to cluster compounds or to search for
inhibitors with a particular binding feature. Overall these insights
can help accelerate the HTS triage process and allow project teams
to advance chemical matter in a timely manner.
References
1. Berman, H. M., Westbrook, J., Feng, Z.,
Gilliland, G., Bhat, T. N., Weissig, H.,
Shindyalov, I. N., Bourne, P. E. (2000) The
Protein Data Bank. Nucleic Acids Res 28,
235242.
2. Johnson, L. N., Lowe, E. D., Noble, M. E.
M., Owen, D. J. (1998) The structural basis
for substrate recognition and control by protein kinases. FEBS Lett 430, 111.
3. Alton, G. R., Lunney, E. A. (2009)
Targeting the unactivated conformations
of protein kinases for small molecule
drug discovery. Expert Opin Drug
Discov 3, 595605.
291
Section V
Library Design Tools
Chapter 15
PGVL Hub: An Integrated Desktop Tool for Medicinal
Chemists to Streamline Design and Synthesis of
Chemical Libraries and Singleton Compounds
Zhengwei Peng, Bo Yang, Sarathy Mattaparti, Thom Shulok,
Thomas Thacher, James Kong, Jaroslav Kostrowicki, Qiyue Hu,
James Na, Joe Zhongxiang Zhou, David Klatte, Bo Chao, Shogo
Ito, John Clark, Nunzio Sciammetta, Bob Coner, Chris Waller,
and Atsuo Kuki
Abstract
PGVL Hub is an integrated molecular design desktop tool that has been developed and globally deployed
throughout Pfizer discovery research units to streamline the design and synthesis of combinatorial
libraries and singleton compounds. This tool supports various workflows for design of singletons, combinatorial libraries, and Markush exemplification. It also leverages the proprietary PGVL virtual space
(which contains 1014 molecules spanned by experimentally derived synthesis protocols and suitable reactants) for lead idea generation, lead hopping, and library design. There had been an intense focus on ease
of use, good performance and robustness, and synergy with existing desktop tools such as ISIS/Draw and
SpotFire. In this chapter we describe the three-tier enterprise software architecture, key data structures
that enable a wide variety of design scenarios and workflows, major technical challenges encountered and
solved, and lessons learned during its development and deployment throughout its production cycles.
In addition, PGVL Hub represents an extendable and enabling platform to support future innovations
in library and singleton compound design while being a proven channel to deliver those innovations to
medicinal chemists on a global scale.
Key words: Drug discovery, chem-informatics, molecular design, combinatorial chemistry, combinatorial library, synthesis protocol, PGVL, reactant, product, enumeration, filtering integration,
workflow, streamline, desktop tool, software deployment.
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_15, Springer Science+Business Media, LLC 2011
295
296
Peng et al.
1. Introduction
Like many new technologies introduced to drug discovery, combinatorial library design and synthesis have matured during the
last 15 years (15). This is reflected by the fact that their practice
has been shifted significantly from experts, such as computational
and combinatorial chemists, to general medicinal chemists working in the pharmaceutical industry (627). In terms of library
design technology, we have seen a similar shift of focus from
methodology development to integration and deployment. With
the maturing of this technology, three new needs have emerged.
First, a significant amount of chemistry knowledge has accumulated in the form of detailed synthetic protocols. These protocols not only contain step-by-step synthesis instructions but
also specify what is considered to be suitable chemical reactants
compatible with the reaction conditions explored and validated
experimentally (termed the scope and limitation of the given
synthetic protocol). Systematic capturing, mining, and reuse of
such knowledge would bring tremendous value and competitive advantage to a pharmaceutical company in hit generation,
hit follow-up, and lead optimization. In a separate publication,
we have described Pfizers effort for the past 10 years in this
type of systematic knowledge capture and reuse which led to the
PGVL (Pfizer Global Virtual Library) chemistry knowledge base
(28). Second, unlike the expert community which is more interested in the latest design algorithms or flexibility in customizable
solutions, medicinal chemists working on drug discovery projects
place more emphasis on ease-of-use, start-to-finish workflow
management, and robust deployment as their top needs (2227).
For example, the ADEPT system was developed by scientists from
Galaxo and Daylight and deployed to the Glaxo discovery chemistry community as an integrated suite of Web tools on its corporate intranet for reactant selection, library enumeration, and
molecular property profiling and library design (25). The REALISIS system from Johnson & Johnson essentially accomplished
the same major goals of combinatorial library design, with more
focus on medicinal chemists and utilization of more advanced
software architecture and components (26). Finally, a modern
medicinal chemist uses many software packages to increase productivity. To gain acceptance by the chemist, a new software package needs to work with existing software packages synergistically
to reduce training effort and to enable richer and more powerful
workflows. Over the years, both commercial vendor offers and inhouse chem-informatics solutions have evolved to address these
three aforementioned needs and have achieved varying degrees of
success (627). Within Pfizer, we have witnessed a similar path
PGVL Hub
297
2. Major
Requirements
There are a few key design scenarios commonly requested by
medicinal chemists. They are described in the following sections.
2.1. Singleton Design
2.2. Markush
Exemplification
298
Peng et al.
In this design scenario, users can take advantage of the captured chemistry knowledge inside the PGVL chemistry knowledge base by using various readily available components, ranging from high-quality product enumeration instructions to premined reactants lists suitable to pre-validated synthetic protocols,
in order to streamline library design and production (28). Since
a significant portion of modern corporate screening compound
collections originates from combinatorial chemistry, HTS hits
resulting from this subset synthesized via combinatorial chemistry
can be followed-up quickly and effectively with targeted libraries
using the same pre-validated synthetic protocols and pre-mined
compatible reactant sets. This is one of the unique objectives for
PGVL Hub.
PGVL Hub
299
has been developed via the lead centric mining tool, and its design
concepts and application will be described in detail by a separate
publication (29). The output of a LEAP search is a collection of
PGVL virtual compounds, each linked to an available combinatorial synthesis protocol and a combination of explicit reactants that
fully describes how this compound can be synthesized. Chemists
can then use these results to further evaluate each hit and launch
one or several targeted library designs to follow up on the LEAPderived hits.
2.6. Additional
Considerations
3. PGVL Hub
3.1. Three-Tier
Enterprise
Architecture
We have chosen the J2EE (http://www.sun.com/java/) threetier enterprise software architecture for PGVL Hub (Fig. 15.1).
The client side is a J2SE GUI built as a Java Web Start (33)
deployable application which enables easy and automated deployment and update of both prototype and production versions globally from a single central server machine. Chemists can easily
install the client-side software component via a Web link available on an internal Web page. The Java Web Start technology
also provides automatic version check and upgrade of the clientside component each time PGVL Hub is launched by the user.
This mechanism was used to deploy PGVL Hub during all stages
300
Peng et al.
PGVL Hub
client-side GUI
component
Corporate Compound
Structure Service
PGVL Services
PGVL
Data
Product
Enumeration
Service
Other PGVL
Computing
Service
Corporate
Molecular
Property
Computing
Service
Library
Planning &
Production
systems
Corporate
Compound DBs
Fig. 15.1. Three-tier J2EE architecture used by PGVL Hub. The client is a Java Swing GUI deployed via the Java Web
Start technology (33) to chemists desktop. The middle-tier is WebLogic J2EE server (51). PGVL data are hosted by
Oracle (52). And the product enumeration service and the PGVL computing service are hosted on SciTegic Pipeline Pilot
server (34). The corporate compound structure service provides support to PGVL Hub client for compound ID to structure
look-up, query searches into corporate databases, inventory checking, and compound duplicate checking. The corporate
molecular property computing service (41) returns computed properties chosen by user for a set of submitted molecules.
of the software life cycle with maximum ease and flexibility while
minimizing administrative cost.
The J2EE middle tier manages the interaction between clientside and server-side backend resources (e.g., various databases and
molecular property computing services) and enables the clientside and server-side software components to be updated independently. The server-side backend has access to various corporate
compound databases and the PGVL chemistry knowledge base
which delivers captured combinatorial reactions, synthetic protocols, and pre-mined and indexed suitable reactants for these protocols (28). It also contains computational services that perform
product enumeration and compute various molecular properties
at the request of the client-side GUI component. Since several
of the server-side services are not unique to PGVL Hub, we were
able to leverage emerging general purpose services via the serviceoriented architecture (SOA). Reciprocally, the PGVL server side
is also designed and deployed as a service, so software packages
other than the client-side component of PGVL Hub can tap into
the PGVL services as a valuable resource. The underlying database
engine for PGVL data is Oracle, and the underlying engine for
product enumeration is SciTegic Pipeline Pilot (34). More discussion on the PGVL services and captured chemistry knowledge
can be found in a separate publication (28).
3.2. Data Structures
A hierarchical data structure was designed to capture all data elements created during a design session, as shown in Fig. 15.2.
At the very basic level of Molecular Structure, a CTAB string
PGVL Hub
301
Collection
Design
Molecule
Reaction
CTAB
1
n Property
Name
Value
Reaction
component
* Reactant
Collection
Lib i
Libraries
1
Session
*
Library
Design
Generic Mol.
Collection
LCM result
Reaction
Reaction
component
P d
Product
Collection
Reactant
Collection
Fig. 15.2. Key data structures within PGVL hub for library design. The three data structures (Collection, Design, and
Session) within PGVL Hub are hierarchical; each is built on top of previous ones sequentially. A Collection contains
a set of molecules and their molecular properties. By default, one-to-one relationship between two data entities is
assumed, unless marked as either 1n which means for each parent entity, there are n (1, 2, 3, . . .) instances of child
entities associated. 1 means that for one parent entity, there could be any number (0, 1, 2, 3, . . .) of child entities
associated. MDL CTAB string (35) is used to represent the 2D structure of a molecule. Design encapsulates everything
about a library design work using one chemical reaction scheme. The Reaction is either an Rxn ID pointing to a preregistered PGVL reaction (28) or a MDL RDF string (35) for user-drawn reaction scheme for product enumeration, or even
a Markush drawing with R-groups hanging off the core (also in MDL CTAB string format, see Ref. (35)) for the Markush
exemplification workflow. The Rxn Component is created for each reaction component in reaction to be a place holder
for various reactant sets (or R-group sets for the Markush exemplification design workflow) for that reaction component
to enable reactant-based library design. The Libraries folder is a place to hold many explicit designed libraries. This
folder also enables chemists to compare and combine individual libraries (designed based on different design objectives
and protocols) within this folder into new ones. The Library concept contains all elements of a single explicit library
designed, and is self-contained with the Reaction information used to form the library. The single set of reactant sets
(one for each reaction component) now is fully synchronized with the product collection automatically during a library
design to ensure self-consistency. Finally Session contains everything a user has worked on during a single working
session after launching PGVL Hub and can be easily saved as an XML file for later use. Inside a Session folder, one
can have multiple Design folders, each with a unique reaction. The Generic Molecule Collection is intended for
simple singleton design workflow. It is also a convenience place to share sets of molecules between different Design
folders via Drag-and-Drop or Copy-and-Paste operation. The LEAP Result folder is a specialized Collection which
holds the Lead-Centric-Mining (LEAP) results as a collection of PGVL molecules with their combinatorial origin (reaction,
protocol, and reactant combination to make each individual molecule).
(a molecular format published by MDL) with 2D atomic coordinates is used for molecular representation (35). PGVL Hub
not only renders molecular structures graphically based on the
2D atomic coordinates inside the CTAB string but also allows
chemists to update those 2D atomic coordinates via ISIS/Draw
to improve the 2D layout of a molecule if desired. At the
Molecule level, users can add any number of properties (textual or numerical) to a molecule, with the format being a combination of name and value. We have found this data structure
to be adequate, although it could be extended to handle more
302
Peng et al.
Library design, unlike singleton design, usually involves collections of many molecules. It is fairly common to encounter design
cases starting with hundreds to thousands of reactants for each
reaction component which could potentially lead to huge enumerated libraries. Such a large collection of molecules poses three
performance challenges to PGVL Hub. First, how do we move
them quickly between the client and server components of PGVL
Hub? Second, how do we store and manage them once they arrive
at a client machine? Lastly, how do we enhance the user experience when performing molecular property calculation on such
PGVL Hub
303
304
Peng et al.
PGVL Hub
305
a)
d)
b)
O
z3
z1
N
z1
O
N
z1
S
O
z3
O
O
N
R1
R3
c)
N
R2
z2
z2
z2
Fig. 15.3. Library design workflow: initiate design by defining reaction and gathering reactant sets: (a) Common workflows supported by PGVL Hub. Here the Monomers inside the figure means reactant sets. Clicking on a bottom that
represents a step of the workflow leads to more detailed GUI panels for users to further specify what need to be done.
(b) One can define a reaction either by search, browse, and select a pre-defined PGVL reaction or draw a new reaction
on the fly using ISIS/Draw. (c) A Markush core capped by R-groups plus actual R-group fragment sets are used for the
Markush exemplification workflow in place of reaction and reactant sets. (d) There are many ways to get reactant lists
into PGVL Hub (see main text). Here one GUI component of ChemSelect (AQB) (50) was shown. Names of chemistry
functional groups are listed in the menu for user to specify what functional group should and should not appear in the
desired reactants. The query will be searched against in-house inventory systems and search results will be loaded into
PGVL Hub as reactant lists.
Once the reactant lists are loaded, chemists can filter them
through various molecular property calculations, substructure
search/mapping, and similarity score against one or a set of lead
structures. The substructure mapping and similarity scores against
a collection of molecules are performed by a server-side SciTegic
Pipeline Pilot component on the fly, eliminating the need for preregistration. For library synthesis considerations, PGVL Hub also
enables calculation for the amount of each reactant required for
library synthesis and determines its availability from the corporate
reagent inventory system. These calculations and filtering are in
place to enable the designer to derive lists of desirable reactant for
library synthesis.
Having the desired reactant lists, the chemist can now create a virtual library by enumerating the product structures in a
fully combinatorial manner. Enumeration instructions are prevalidated for all PGVL registered reactions; for a user-specified
reaction, PGVL Hub enables enumeration via the Markush
representation of the reaction scheme. Once the products are
306
Peng et al.
PGVL Hub
307
Fig. 15.4. Library design workflow: analysis and filtering of molecular collections: (a) A grid view and a table view panels
are showed here displaying a collections of product and reactant molecules. Both views can be sorted by properties,
and visibility and column location for each molecular property can be customized by user. Both views allow user to
browse through large collections of molecules very efficiently and mark molecules of interest with various color pens.
Also notice that the top workflow bar keeps track of design progress and highlights workflow steps already initiated.
The first example in Grid View also offers a good view of all content within a single Design. (b) User finds, selects,
and submits computation jobs to the Pfizer molecular property calculation service (41). This service contains many in
silico models for ADME&T and even project-specific SAR activity models. (c) User makes all filtering decisions within this
Decision Maker panel. All molecular properties available can be used by Decision Maker. The user hand-marking
using color pens or selection returned from a SpotFire session are used as binary input. Numerical data are filtering
using range slider bars. The color histograms display the property distributions of the starting set (in green color) and the
current set (in blue color) before a filtering action is fully committed. This gives user an immediate and dynamic feedback
on possible consequences of the current filtering setting. Textual properties can also be used for decision making (not
shown here) via string searches such as Exact Match, contains, Starts with, and Ends with in comparison with
user-specific string. Also one can use SpotFire for data visualization and selection.
308
Peng et al.
Fig. 15.5. Integration between PGVL Hub and ISIS/Draw. By clicking on any structural box in PGVL Hub window, ISIS/Draw
will appear with any molecular structure inside the PGVL Hub structural box if exists. Once the user is done creating or
polishing a structure drawing inside ISIS/Draw, a single click on ISIS/Draw will transfer the new structure drawing back
to the PGVL Hub structural box, and the ISIS/Draw window will disappear automatically.
a)
b)
Fig. 15.6. Seamless integration between PGVL Hub and SpotFire. User can launch SpotFire within PGVL Hub to visualize
the molecular properties associated with a molecule collection. Any selection done within SpotFire is dynamically passed
back to PGVL Hub as marking on individual molecules. And user then can use the Decision Maker within PGVL Hub to
make selections based on the SpotFire marking.
PGVL Hub
309
310
Peng et al.
a)
Reactants are sorted based
on # of failed products
they are associated with.
Drag the slider(s) to
remove reactants with
highest # of failed
products
Status on
library
Effect of removing
reactants on the remaining
library are shown by the
display
b)
Fig. 15.7. Interactive combinatorial library shaping (a) A user-defined Pass/Fail score (such as Rule-of-Five) can be
constructed and computed for product molecules based on existing molecular properties. This type of Pass/Fail scores
can be used for combinatorial library shaping. (b) After user selects a library for combinatorial shaping, PGVL Hub allows
user to pick the appropriate Pass/Fail scores as input, then sort reactants for each reaction component from low to high
based on number of Failed product molecules a reactant molecule is associated with and plot the library status visually.
User then uses the slider bars (one for each reaction component) to remove the worst reactant(s) and get an immediate
feedback from the status report. The green curves are static based on the input reactant sets; the blue curves are
updated dynamically to indicate the possible outcome. User can explore various strategies in reducing worst offenders
in reactant sets to reduce the number of Failed products while still maintaining a fully combinatorial library of good
size and production efficiency (number of products to be synthesized vs. number of reactants to be handled). User then
commits the shaping by creating a new library.
PGVL Hub
311
Fig. 15.8. Substructure mapping, highlighting, and drill-down. Based on on-the-fly substructure query and mapping
capability within SciTegic Pipeline Pilot, PGVL Hub allows user to perform substructure queries into a set of target
molecules. In the example shown, a set of substructure queries globally collected and validated as undesirable substructure features to be avoided are mapped into target molecules (41b).
312
Peng et al.
4. Remarks
4.1. Deployment,
Usage, and Impact
Since PGVL Hub client side is Java Web Start deployable and
Pfizer-supported desktop computers all have the Java Web Start
utility installed, the installation of PGVL Hub is easily done by
users themselves directly via a Pfizer internal web site. This web
site also contains other resources to facilitate the distribution,
training, and support of PGVL Hub. Training materials include
the users guide, presentation slides, animated tutorials, and scientific literatures pertaining to singleton and library designs. Support is provided by a global support team, a network of local
power users at various Pfizer research sites, and a global steering
committee comprising PGVL champions.
The initial deployment began around 2003 targeting a small
yet highly motivated community of approximately 100 beta
testers comprising mainly medicinal chemists and computational
chemists. Based on feedback from these users and results of two
formal software usability studies conducted at various Pfizer sites
involving medicinal chemists with some or no prior exposure to
PGVL Hub, we were able to add further enhancements to make
the software even better and more user-friendly. A full deployment was initiated around 2005 to the global discovery chemistry
community of over 2000 potential users at that time. The adoption and usage of PGVL Hub are tracked, and reports of usage
statistics are provided through an internal Web page; one graph
within a typical report is shown in Fig. 15.9. For the past few
years, PGVL Hub usage has been steady at 60100 launches per
day. Of the approximately 1000 registered users, 30% are considered to be experienced users based on their numbers of logins
during the last 12 months. The usage tracking tool also identified expert users as well as novices which helped the support
team recognize opportunities for training and allocation of support efforts. The feedback from PGVL local champions and the
usage data collected so far illuminate some aspects of PGVL Hub
success, including penetration and adoption by the intended user
community (>50%), frequency of usage, and level of expertise
reached by expert users (about 1 in 6 is a frequent user). However,
the true impact of PGVL Hub should be measured by increased
quality of singletons and library compounds chemists designed to
move drug discovery projects forward and the productivity gained
due to PGVL Hub usage (5356). Unfortunately we do not have
a systematic way of tracking these factors directly other than feedback from medicinal chemists and their research leadership. Ultimately, the success of PGVL Hub to enable smarter designs and
higher productivity should be better assessed by successful drug
candidates coming out of drug discovery projects.
PGVL Hub
313
07/01/04
02/16/09
Fig. 15.9. Tracking usage of PGVL Hub. Each time a user logins into PGVL Hub, information about user ID and time stamp is recorded into a tracking database. A usage report
can be generated via a Web reporting tool.
It is likely that every pharmaceutical company would have an integrated library design tool in place as part of the strategy to incorporate the combinatory library approach into its drug discovery
process. Since most of them were not published, we could only
make comparisons among ADEPT, REALISIS, and PGVL Hub
(see Table 15.1). Here we shall leave the details for readers to
explore further while making a general statement that all three are
designed to address essentially the same set of major questions in
library design. They only differ in level of software engineering,
GUI capability and intuitiveness, and scope of feature coverage.
4.3. A Proven
Platform for Future
Enhancement and
Innovation
PGVL Hub has been well entrenched as one of the key desktop
molecular design tools used by Pfizer medicinal chemists. Its solid
three-tier enterprise architecture and powerful client-side component easily deployed by Java Web Start provide a very attractive
platform with a proven track record for future enhancement and
innovations in singleton and library design. There are many possibilities for further enhancement based on user requests as well
as attractive methodologies and algorithms already published in
the literature (627). Here we would like to list a few, with some
already being prototyped.
Yes
Yes
No
Reaction encoding/
enumeration method
Feature or capability
No
Yes
No
Limited set
Table 15.1
Comparison of three integrated library design tools from the pharmaceutical industry
Yes
Yes
Very large collection, including many vendorsupplied and internally developed in silico
models for ADME&T end points and target SAR
Yes
ISIS reaction scheme entered by user or predefined reaction object for 500 PGVL
registered combinatorial reactions/reactant
clipping and assembly of a Markush core
and R-groups
314
Peng et al.
Software architecture
No
Targeted users
Integrated decision
making
Limited
Limited
Software performance
enhancements
Table 15.1
(continued)
Java GUI deployed via Web Start for easy deployment and update, J2EE three-tier. Access to
various DBs, SciTegic Pipeline Pilot, and other
molecular property predicting services
Very general and powerful. Both table- and gridviewers are fully configurable by user. User can
also select and re-order property columns for both
display and export
PGVL Hub
315
316
Peng et al.
For discovery projects where target protein structures are available, a structure-based library design strategy would be highly
desirable. In the current version of PGVL Hub, 2D library
molecules are exported out of PGVL Hub and imported to
SBDD-enabled molecular design tools; it would be desirable to
have a more integrated workflow. The challenge is how to deal
with the one-to-many relationship between a 2D molecule inside
PGVL Hub and its many potential 3D conformers in complex
with the target protein binding sites. By utilizing an aggregation
step to reduce the one-to-many relationship into a one-to-one
relationship (e.g., keeping just the best docking score of the best
3D conformation), the SBDD aspect is reduced to the best docking score, and a very simple molecular property is returned to
PGVL Hub for decision making. This approach is very simple and
intuitive, but is best for smaller libraries due to the computationintensive nature of docking and scoring. One way to extend the
range of SBLD coverage to a much larger virtual space is through
the usage of Basis Products (21). The detail of this SBLD effort
using Basis Products has been described in a separate publication
(48).
All of the more advanced library design capabilities described
above can be integrated into the PGVL Hub platform in an intuitive way and utilized by medicinal chemists routinely to impact
progression of drug discovery projects.
5. Conclusions
PGVL Hub is an integrated desktop tool which has been developed and globally deployed throughout Pfizer discovery research
PGVL Hub
317
units for singleton and library design and synthesis. It has a highly
intuitive and interactive GUI, an excellent performance profile,
and is easy to install and update. For the past several years it
has been routinely accessed by hundreds of medicinal chemists
and other scientists for compound and library design work. This
tool provides direct access to Pfizers proprietary PGVL chemistry knowledge base to enable fast HTS hit follow-up and lead
optimization. It offers a very rich and intuitive set of design capabilities and covers a wide range of workflows commonly used
by medicinal chemists. PGVL Hub also has the advantage of
being integrated with other desktop tools, such as ISIS/Draw,
Microsoft Excel, SpotFire, and other 2D and/or 3D molecular
design tools, and it leverages the best features of those tools to
provide synergies and an integrated workflow. Its three-tier J2EE
enterprise architecture and a powerful GUI provide a proven platform and delivery mechanism for future enhancements. Beyond
its usage statistics, the true measure of PGVL Hubs positive
impact in design quality and productivity should be an increase of
attractive chemical leads emerging from drug discovery projects.
Acknowledgments
Over the years, the PGVL development team has received strong
support and help from Pfizer research management, PGVL site
champions and steering committee, user communities in medicinal chemistry and computational chemistry, research informatics, and sister software development projects. We would like to
express our deepest gratitude and apologize for not being able to
list all their names explicitly here.
References
1. Hogan, J. C. Jr. (1997) Combinatorial chemistry in drug discovery. Nat Biotechnol 15,
328330.
2. Hall, S. E. (1997) The future of combinatorial chemistry as drug discovery paradigm.
Pharm Res 14(9), 11041105.
3. Salemme, F. R., Spurlino, J., Bone, R. (1997)
serendipity meets precision: the integration
of structure-based drug design and combinatorial chemistry for efficient drug discovery.
Structure 5, 319324.
4. Floyd, C. D., Leblanc, C., Whittaker, M.
(1999) Combinatorial chemistry as a tool
for drug discovery. Prog Med Chem 36,
91163.
318
Peng et al.
PGVL Hub
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
319
320
51.
52.
53.
54.
Peng et al.
This is one of the successful examples that
re-usable components are developed and
shared among multiple software development projects within Pfizer.
WebLogic is a J2EE middle-tier software
suit from the former BEA Systems, now
part of Oracle (http://www.oracle.com/
appserver/weblogic/weblogic-suite.html)
Oracle is a database application software from
Oracle (http://www.oracle.com/index.html)
Smith, G. F. (2006) Enabling HTS
Hit follow-up via Chemo informatics,
File Enrichment, and Outsourcing. High
Throughput Medicinal Chemistry II; MMS
Conferencing & Events Ltd., Institute of
Physics; London, 2006. This article is also
available on-line via this web link (http://
www.mmsconferencing.com/pdf/htmc/g.
smith.pdf).
Clark, J. D., Hu, Q., Kuki, A., Peng, Z.,
Sciammetta, N., Smith, G. F., Ramirez-
Chapter 16
Design of Targeted Libraries Against the Human Chk1
Kinase Using PGVL Hub
Zhengwei Peng and Qiyue Hu
Abstract
PGVL Hub is a Pfizer internal desktop tool for chemical library and singleton design. In this chapter, we
give a short introduction to PGVL Hub, the core workflow it supports, and the rich design capabilities
it provides. By re-creating two legacy targeted libraries against the human checkpoint kinase 1 (Chk1)
as a showcase, we illustrate how PGVL Hub could be used to help library designers carry out the steps
in library design and realize design objectives such as SAR expansion and improvement in both kinase
selectivity and compound aqueous solubility. Finally we share several tips about library design and usage
of PGVL Hub.
Key words: PGVL Hub, combinatorial chemistry, library design, reaction, synthesis protocol,
reactant, product, enumeration, filtering, Chk1, kinase, inhibitor, SAR, ADME&T (Adsorption,
Distribution, Metabolism, Excretion, and Toxicity), selectivity, solubility, proteinligand complex.
1. Introduction
1.1. PGVL Hub
PGVL Hub was developed for and deployed within Pfizer global
chemistry communities (1). The main goal of PGVL Hub is to
offer bench chemists a very capable desktop tool to (a) access
Pfizers proprietary chemistry knowledge database containing
information about many experimentally validated combinatorial
chemistry synthesis protocols; (b) support and streamline the full
cycle of library design, synthesis, and registration; and (c) harness
the power of synergy with many other desktop software packages
(ISIS/Draw, MS-Excel, SpotFire (http://spotfire.tibco.com/),
and additional 2D/3D molecular design tools). PGVL Hub has
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_16, Springer Science+Business Media, LLC 2011
321
322
Peng and Hu
been used by more than 1000 users within Pfizer with more than
110,000 user logins accumulated since 2004. Its application to
targeted library design is the focus of this chapter.
In 2007, Teng and coworkers published a potent and selective
lead series of human Chk1 inhibitors (2). Their work was initiated by two advanced lead matters 1808-1 and 1819-1 obtained
through two rounds of targeted library design and synthesis based
on a high-throughput screening (HTS) hit Cpd-1 (see Fig. 16.1
for the progress of this hit through the library-based lead optimization). This report distills the essence of those two legacy targeted libraries to showcase the design process using PGVL Hub.
The PGVL Hub screen shots used in this report are recreations
of our legacy design efforts in the past.
OH
HO
OH
H
N
N
HN
HN
O
H
N
Series1808
N
2x44, VRXN-2-00010
HN
OH
HN
O
H
N
Series1819
N
2x88, VRXN-2-00010
2x44
HN
2 x8 8
Cpd-1
CHK1_Ki = 0.005uM
1808-1
CHK1_Ki = 0.3uM
1819-1
CHK1_Ki = 0.0005uM
ClogP = 4.6
ACD_logD = 3.2 (pH=7.4)
ClogP = 3.9
ACD_logD = 2.3 (pH=7.4)
ClogP = 3.5
ACD_logD = 2.3 (pH=7.4)
CDK1_Ki = 0.009uM
CDK2_Ki = 0.015uM
VEGF_Ki = 0.0066uM
LCK = 37%@1uM
CDK1_Ki = 24uM
CDK2_Ki = 5uM
VEGF_Ki = 53.9%@1uM
LCK = 46%@1uM
CDK1_Ki =4%@1uM
CDK2_Ki = 0.242uM
VEGF_Ki = 29%@1uM
LCK = 14%@1uM
Fig. 16.1. Progress of two rounds of Chk1 targeted libraries. Cpd-1 is the original HTS hit with a broad kinase inhibition
profile and based on which the first round library was designed and synthesized. 1808-1 is the best hit from the first
round targeted library with improved kinase selectivity profile, based on which the second round library was designed
and synthesized. 1819-1 is the best lead with improved potency, kinase selectivity, and solubility. Co-crystal structures
of Chk1 kinase domain and corresponding lead compounds were solved and extensively utilized in structure-based
singleton and library designs. For details of the X-ray co-Crystal structures, please refer to the publications from Ming
and et al (4a) and Foloppe and et al for details (4b).
323
2. Materials
More information about the biological roles of the Chk1 gene
and its protein product can be found in Ref. (3). In short, DNA
damages are sensed and passed to the Chk1 kinase to activate the
checkpoint between the G2 (secondary gap) and the M (mitosis) phases of the cell cycle. The cell cycle rest at this checkpoint
allows the cell to repair its DNA damages before proceeding to
the M phase (see Fig. 16.2). Cells entering into the M phase with
un-repaired DNA damages tend to suffer from mitotic catastrophe, which ultimately leads to cell death via the apoptosis pathway. Since the anti-cancer drugs used in standard chemotherapies
are mostly DNA-damaging agents intended to induce cancer cell
death in the M phase, it has been hypothesized that a Chk1 kinase
inhibitor could synergistically enhance the anti-cancer effect of
those DNA-damaging agents. What makes this approach even
more attractive is the observation that normal cells tend to arrest
at various checkpoints in both G1 and G2 phases after DNA damage, while many cancer cells tend to heavily rely on the G2/M
checkpoint to repair DNA damages where activity of Chk1 is critical. This implies selective kill of cancer cells vs. normal cells (3).
The first X-ray crystal structure for the Chk1 kinase domain was
solved by Pfizer scientists at the Pfizer La Jolla site (4a), and the
key structural features around its hinge region in the ATP binding
site are also depicted in Fig. 16.2. Due to its above-mentioned
G2
Chk1
M
G1
G0
resting
Fig. 16.2. Cell cycle, Chk1 at the G2/M checkpoint, and key structural features of the
Chk1 kinase domain (4a). The highlighted location is the hinge region of the Chk1 kinase
domain, which also corresponds to the same regions of proteinligand structures shown
in Fig. 16.1.
324
Peng and Hu
connections with cancers, Chk1 has been identified as an attractive oncology target (5) for inhibition by small organic molecules
(6). For details on the biological assays and solutions of protein
ligand X-ray crystal structures cited in this report, please refer to
Refs.(2) and (4a) directly.
Decision Maker
Integration
with SpotFire
Fig. 16.3. Screen shots of PGVL Hub. It has two ways to display molecules and their properties (Structural Viewer panel
and Table viewer panel). It has been integrated with SpotFire for data visualization. It also has a decision maker capable
of handling numerical and textual data as well as user selections by hand.
The library design tool used for this report is PGVL Hub (1).
Figure 16.3 contains several screen shots of PGVL Hub, highlighting its capabilities in viewing molecules and their properties,
decision making, and integration with SpotFire for data visualization. Figure 16.4 describes the main workflow of library design
seen through the workflow manager of PGVL Hub and the key
questions a library designer would ask and address with the help
of a library design tool such as PGVL Hub. Those key questions
in library design can be summarized into the following list: What
chemical reaction should be used? What reactants are available to
start the library design? Which reactants (also called monomers)
or products should be chosen for testing design hypotheses as
well as satisfying various constraints such as ADME&T compliance? How can the library design be communicated to collaborators as well as a downstream synthesis process? Figure 16.5
describes more about the design strategies and PGVL Hubs
capabilities that allow a library designer to analyze the possible
325
(4) Sooner or
later, one wants
to enumerate
product
structures
explicitly
Fig. 16.4. The basic library design workflow enabled in PGVL Hub.
Fig. 16.5. Library design strategies and features enabled in PGVL Hub.
326
Peng and Hu
3. Methods
3.1. Information
Known Before the
Design of the First
Targeted Library
As shown in Fig. 16.6, we planned to use a pre-registered combinatorial chemistry protocol (LJ0194) to synthesize the targeted library. PGVL Hub allowed us to easily search and load
this pre-registered reaction scheme into a design session without
the need to draw a reaction scheme required for product enumeration (see Fig. 16.7). Even for this simple reaction, a simple
reaction scheme drawn by users may not be sufficient to ensure
proper formation of product structures in the case where bonds
associated with chiral centers are near the reactive sites on the
reactants.
327
Design hypothesis: Explore the right hand side of Cpd-1 for specificity
(selectivity) and new ways to achieve affinity.
VRXN ID: VRXN-2-00010; Synthetic protocol: LJ0194
selectivity Pocket
Hinge Region
OH
H
N
N
HN
H
N
H
N
O
OH
H
N
H
N
H
N
R1
N
R2
2 acids
H
N
O
R1
N
R2
H
N
OH
N
H
N
R1
N
R2
O
44 amines
Fig. 16.6. Objectives of the first round library. The main goal is to explore the protein pocket probed by the right-hand
side of Cpd-1 to improve kinase selectivity and further build SAR knowledge. The single arylaryl bond was replaced by
three bonds containing an amide group with more flexibility. A registered combinatorial synthesis protocol (LJ0194) was
used for this library and a 244 plate format was planned before the design of the library.
3.2.2. Selection of
Reactants
Due to the small size of the binding pocket accessed by the righthand side of Cpd-1, we first hypothesized that smaller amines
would have a higher chance of fitting to that binding pocket. By
applying the MW and ClogP calculations and filtering, we significantly reduced the possible choices of amines from the original
set of 8449. Using molecular similarity score (11, 14) with respect
to 4-amino-2-methoxyphenol, we ensured that the neighborhood
of the right-hand side of Cpd-1 was well sampled (see Fig. 16.1).
Finally we looked up the reactant amount available in our chemical inventory for each reactant and only used those with sufficient amount for library production so that the designed library
could be synthesized without further delays. With those two considerations, we were able to focus the remaining choices further
328
Peng and Hu
Fig. 16.7. Accessing the pre-registered reactions and pre-mined suitable reactants for synthesis protocol LJ0194 to
initiate library design. PGVL Hub makes it simple to load the pre-registered reaction scheme and suitable reactants for
the reaction conditions specified in synthesis protocol LJ0194. The product structure enumeration and the synthetic
feasibility of those product molecules are taken care of by the PGVL Hub through extensive knowledge capturing and
reusing, so that the library designers can focus their efforts on design issues such as target binding, selectivity, and
ADME&T.
to a subset of a few hundred amines. Then we used the Structural Viewer panel of PGVL Hub (see Fig. 16.3) to display many
amines in a single page and browsed through them visually. Each
molecule was examined in terms of possible hypothesis it could
help to form and validate. Molecular diversity of the final library
is also a consideration. Desirable ones were marked with color
markers provided by PGVL Hub and included in the final set of
44 (plus a few backups). Even though the actual legacy design also
contained input from the project medicinal chemists and library
production chemists, the first target library was designed mainly
based on the reactant-level considerations described so far. The
first targeted library yielded several hits with weaker potency than
Cpd-1 (see Fig. 16.9); however, assay data on kinase selectivity
suggested that the top hit 1808-1 had a much improved kinase
selectivity profile and some improvement in solubility (see Fig.
16.1). This prompted the project team to solve the co-crystal
329
Fig. 16.8. Reactant-level (pre-enumeration) design steps. This is a screen shot of PGVL Hub during the design of the
two libraries. The reactant sets for this two-component reaction and the generated explicit libraries and products are
all captured during the design session (see the left-hand side). The A-component is for acids and the B-component
for amines. The molecular structures of the two special acids are shown in Fig. 16.6. Many annotations can be added
to reactants to aid their analysis and selection. Here ClogP, molecular weight (MW), similarity (SIMI) with respect to a
user-defined reactant, and reactant amount available in the inventory house are just a few examples.
330
Peng and Hu
Fig. 16.9. Top hits from the first targeted library. One can see that a fairly diverse set of small amines are all tolerated
by the binding pocket but with a significant reduction in potency when compared with the initial HTS hit Cpd-1 (5 nM).
On the other hand, the top hit 1808-1 shows significant improvement in kinase selectivity and improved solubility (see
data given in Fig. 16.1).
Design hypothesis: Explore the right hand side of 1808-1 for specificity
(selectivity) and further improve potency.
VRXN ID: VRXN-2-00010; Synthetic protocol: LJ0194
selectivity Pocket
Hinge Region
H
N
OH
HN
N
O
HN
N
H
N
H
N
O
H
N
OH
H
N
H
N
R1
N
R2
H
2 acids
N
N
H
N
O
R1
N
R2
H
N
OH
N
H
N
R1
N
R2
88 amines
Fig. 16.10. Objectives of the second round targeted library. The main goal is to further expand the SAR knowledge around
the right-hand side of 1808-1 with the aim to improve potency while retaining kinase selectivity. A same combinatorial
synthesis protocol (LJ0194) was used and a 288 plate format was planned before the design of the library.
331
Fig. 16.11. Product-level design steps. In this screen shot, product structures and their calculated properties are listed in
the table. Those annotations are key to implementing various design considerations such as ADME&T profile (e.g., Rule of
Five (8, 9), LogD (10)), molecular similarity with respect to lead compound 1808-1 (SIMI), proteinligand complementation
(docking score again the binding pocket in the Chk1 kinase domain initially occupied by 1808-1, such as HT score (13)),
kinase selective (predicted CDK2 activity based on an in silico model), and finally checking for duplicates against the
corporate compound database (PRGL_Lookup).
332
Peng and Hu
Fig. 16.12. Top hits from the second round library. One compound 1819-1 (0.5 nM) shows significant improvement over
the lead 1808-1 (300 nM) in terms of Chk1 inhibition results. More data show (see Fig. 16.1) that it is even more potent
than the initial hit Cpd-1 (5 nM) while having a much improved kinase selectivity profile and better solubility.
333
library compounds, supplemented by the two co-crystal structures associated with 1801-1 and 1819-1 showing significant
protein flexibility around the ligand binding site, provided a
solid foundation for additional lead optimization effort on this
project (2).
There are many design requirements, constraints, and
hypotheses for a given library design case. In our example, we
have touched upon several reactant as well as product-level design
considerations. These considerations include but not limited to
ADME&T properties, similarity with respect to one or more lead
molecules, docking and scoring against a given protein binding
site, activity models for selectivity profiling, and even practical
issues such as reactant availability in chemical inventory systems
and duplication check against corporate compound collections.
Historical usage tracking strongly suggests that PGVL Hub is a
proven, streamlined, and highly effective design environment to
fulfill many of those diverse library design objectives in the hands
of Pfizer medicinal and computational chemists.
4. Notes
1. Shorten the cycle time: The designsynthesistest cycle of lead
optimization is the dominant workflow in drug discovery
projects. The library designsynthesistest cycle should be
short enough to be compatible with the progression of the
project so that relevant project questions can be proposed
and answered by targeted libraries in a timely manner. Effective communication and coordination among the library
designer, his/her project collaborators, and the library production team is essential to reduce the cycle time. PGVL
Hub allows one library designer to save a full design session
into a file and share it with another collaborator to enable
effective communication and coordination. Selecting only
reactants that are readily available from chemical inventory
systems is another way to bypass the wait required to restock missing reactants, and the inventory check feature of
PGVL Hub makes this check straightforward.
2. Complementary to singleton synthesis: In terms of lead optimization, library synthesis works as a shot-gun approach,
with multiple shots on the goal while its resolution is limited by availability of types of combinatorial chemistries and
suitable reactants. On the other hand, the one-on-one singleton synthesis practiced by standard medicinal chemistry
offers the highest resolution, yet at a lower throughput. So
for a well-explored SAR region, the singleton approach is
the best way to further refine project leads effectively, while
334
Peng and Hu
Acknowledgments
We would like to express our gratitude toward other members of
the Chk1 project team for their design input and their efforts in
library production (Haresh Vazir and Dr. Ming Teng), bio-assays
335
6.
6.
6.
6.
6.
7.
8.
9.
10.
336
Peng and Hu
11. (a) Tanimoto, T. T. (1957) IBM Internal Report 17th Nov. 1957; and (b)
Wikipedia entry on the Tanimoto coefficient:
http://en.wikipedia.org/wiki/Tanimoto_
coefficient#Tanimoto_Coefficient_.28
Extended_Jaccard_Coefficient.29
12. (a) Gehlhaar, D. K., Verkhivker, G. M.,
Rejto, P. A., Sherman, C. J., Fogel, D. B.,
Fogel, L. J., Freer, S.T. (1995) Molecular recognition of the inhibitor AG-1343
by HIV-1 protease: conformationally flexible docking by evolutionary programming.
Chem Biol 2(5), 317324.
12. (b) Gehlhaar, D., Bouzida, D., Rejto,
P. A. (1999) Reduced dimensionality in
Chapter 17
GLARE: A Tool for Product-Oriented Design of Combinatorial
Libraries
Jean-Franois Truchon
Abstract
Combinatorial chemistry with two or more diversity points often leads to an immense number of theoretical products. It is sensible to select the reagents based on the desired properties of the products in the
hope of maximizing the usefulness of the synthesized molecules. The presented tool enables the filtering
of reagents such that any further reagent selection will form products matching the desired properties.
Virtual combinatorial library leading to thousands of billions of products can be rapidly assessed. The
publicly available software (http://glare.sourceforge.net) and key algorithmic elements are discussed.
Key words: Combinatorial
multi-objective optimization.
library
design,
computer
algorithms,
product
properties,
1. Introduction
The design of chemical libraries often requires the selection of
only a small number of reagents compared to what is available
commercially or from proprietary repositories. The combinatorial
nature of multi-reagent synthesis can lead to far more products
than can be synthesized or screened. It is thus of great interest to
apply filtering schemes that are adapted to the goal of the chemical library (1, 2). In order to help the chemist focus on only the
reagents that would form a library with desired computed properties, it is attractive to use a computational algorithm to eliminate the unproductive reagents. In practice, it is often difficult
to map the desired product properties into reagent-based filtering rules. For example, if the products need to fit within a log
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_17, Springer Science+Business Media, LLC 2011
337
338
Truchon
P range, filtering rules applied to the reagents are rather difficult to find. Although one can guess a reagent-based threshold, it
would be difficult to account for the fact that some reagent classes
are fundamentally greasier than others. This sort of strategy has
been found to be misleading (3). The daunting task of generating chemical structures for millions or billions of virtual products
and assessing their properties in order to find the best reagents to
form the combinatorial matrix is clearly challenging. This is worsened by the multi-objective thresholds needed when more than
one property is monitored simultaneously.
A practical solution to this problem, called GLARE (Global
Library Assessment of REagents) (4), has been developed and
validated in our laboratories and is explained in this chapter. We
will focus on a specific chemical combinatorial library to illustrate
the workflow and the use of the software.
2. Materials
2.1. Computer
Program GLARE
The computer program used in this chapter has been made publicly available under an Open Source Initiative BSD License at
http://glare.sourceforge.net. This program is written in C++ and
has been successfully compiled on diverse platforms such as Mac
OS 10.X, Linux, IBM AIX, and Windows XP. Under its current form, GLARE is mainly a command line application that is
invoked identically on any of the platforms. Details of the parameters and options can be found at the aforementioned web site.
2.2. Chemical
Databases
There exist a multitude of chemical reagent sources. The Available Chemical DirectoryTM (ACD) collection, from Symyx Technologies Inc., lists as many as 1,160,000 unique chemicals with
chemical structure, pricing, supplier, purity, forms, etc.
GLARE
339
R1
R1
H2N
R2
O
R3
O
OH
(651)
O
R4
(637)
O
S
R2
O2 S N O
R4 R
3
Cl
(143)
(59,300,241)
Fig. 17.1. The oxazolidine library used to illustrate how the GLARE tool works. The
number of reagents considered and the total number of products they can potentially
form are given in parenthesis under each chemical class.
30
20
Frequency (%)
Frequency (%)
Initial
Final
40
Initial
Final
25
15
10
30
20
10
>7
0
0
2
5
6
7
8 9 10 11 12 13 14
Number Hydroge-Bond Acceptors
50
25
40
Initial
Final
20
Initial
Final
30
Frequency (%)
Frequency (%)
1
2
3
4
5
6
7
Number of Hydrogen-Bond Donors
20
10
15
10
5
> 65
0
10
15
20 25 30 35 40 45 50 55
Number of Non-Hydrogen Atoms
60
65
4 3 2 1 0
1 2 3 4 5
Calculated logP
9 10
Fig. 17.2. Properties profile of the products in the oxazolidine library formed with all available reagents (black) and after
filtering the reagents (grey) based on the product properties with GLARE. The multi-objective thresholds are illustrated
by the dashed vertical lines. The initial library is formed by 651 637 143 products and the filtered library by 144
143 92 products (aminoalcohols aldehydes sulfonyl chlorides).
3. Methods
The different steps, files, and parameters necessary to optimize
the oxazolidine library are discussed in this section.
3.1. Selection of the
Reagents
It is standard practice to remove chemical functionalities susceptible to interfere with subsequent synthetic steps in the library. We
340
Truchon
used a Merck & Co proprietary web tool to this end called Virtual
Library Toolkit (VLTK) described elsewhere (6).
3.2. Product
Properties and Offset
Calculations
[1]
Poffset = P product
N
P reagenti
[2]
This has been shown to work well for a diverse set of libraries
(see Note 2) (4). In Fig. 17.3, the offset is calculated for the
oxazolidine library for properties related to Lipinskis rule of five
(8): the number of hydrogen bond acceptors (HBA), the number
of hydrogen bond donors (HBD), the number of non-hydrogen
atoms (NHA), and the calculated log P (9).
NH3+
+
OH
Cl
R1
P
R1
R2
R3
Offset
O2
S
R2
HBA
3
1
1
2
1
R3
HBD
0
4
0
0
4
NHA
18
6
4
10
2
N SO
2
logP
1.3
0.3
0.3
1.5
0.2
Fig. 17.3. Calculation of the oxazolidine offsets from a specific example. From the product (P ) property, each of the reagent (Ri ) property is subtracted. Four properties are
considered: the number of hydrogen bond acceptors (HBA), the number of hydrogen
bond donors (HBD), the number of non-hydrogen atoms (NHA), and the logarithm of the
octanol/water partition constant (log P ).
GLARE
341
With GLARE, there are only two file types that need to be prepared. First, the reagent property files contain one reagent per
line in a text file starting with a reagent ID followed by a list
of numbers corresponding to the reagent properties. The offset
information is given in a separate file with the same format and
contains only one line.
Second, the virtual library is combined according to the
instructions outlined in the library definition file. An example
for the oxazolidine library is given in Fig. 17.4. The keyword
DIMDEF associates a list of reagent property files to one combinatorial dimension identified by a user-defined alias (e.g., ALDEHYDES). The listed reagent files are simply appended in the program. The LIBDEF keyword is followed by a user-defined library
# Defines the combinatorial dimensions and list the reagent property files that are
combined.
DIMDEF AMINOALCOHOLS amino_alcohols_acd.gli amino_alcohols_inhouse.gli
DIMDEF ALDEHYDES aldehydes_acd.gli
DIMDEF SULFONYLS sulfonyl_chlorides_acd.gli
DIMDEF OFFSET oxazolidine_offset.gli
# Defines a combinatorial library called oxazolidines formed by the matrix of products
from the combination of the listed dimensions
LIBDEF OXAZOLIDINES AMINOALCOHOLS ALDEHYDES SULFONYLS
OFFSET
# Defines a property name with the expected minimum and maximum value of the
products.
PROPDEF HBD 0 5
PROPDEF HBA 0 10
PROPDEF NHA 0 35
PROPDEF LOGP -2.4 5.0
# Gives the order of the properties found in the property input files.
INPUTDEF HBA HBD NHA LOGP
Fig. 17.4. Example of the library definition file for GLARE. Bold text indicates a dedicated keyword, text in italic a user-defined alias, lines starting with a hash mark are
comments, and normal text gives the keyword-associated parameters.
342
Truchon
[3]
i=1
GLARE
343
100
90
80
70
60
50
40
30
20
0
10 20 30 40 50 60 70 80 90 100
Goodness filtering threshold (%)
Fig. 17.5. This figure shows how requiring a higher fraction of the products to comply
(goodness) with the desired product properties reduces the fraction of retained reagent
(effectiveness). The initial goodness of the oxazolidine library used here is 18%.
Truchon
200
180
160
140
aminoalcohols
aldehydes
sulfonylchlorides
120
100
80
60
40
20
0,001
0,01
0,1
1
10
Scaling parameter ()
100
1000
Fig. 17.6. This figure shows the final number of reagents left in each dimension once
a 95% goodness threshold is obtained as a function of the scaling parameter () displayed on a log scale. The larger is , the more reagents from the initially less populated
dimension are left. We found that = 6 generally leads to useful results.
40%
Optimized oxazolidine library effectiveness
344
111 s
50.6 s
18.7
18.7s
38%
3.94 s
1.03 s
36%
no partitioning
34%
0.24 s
32%
0.07 s
30%
28%
0.04 s
26%
24%
1
Fig. 17.7. This figure illustrates the advantages and disadvantages of using partitioning. On the one hand, the timings (shown next to the individual points in seconds) are
tremendously reduced, on the other hand the effectiveness of the optimized oxazolidine library is sub-optimal with smaller partitions. A partition of 16 reagents seems best
overall.
In summary, the two main parameters related to the compliance to the product property rules (goodness) and the number
of reagents left for further selection (effectiveness) can be controlled by adjusting the algorithmic goodness threshold, the scaling parameter, and the size of the reagent partitions. Each library
being different, it may sometimes be useful to deviate from the
proposed defaults. The oxazolidine library is a good surrogate
for the relationships normally involved and Figs. 17.5, 17.6, and
17.7 can be used to assess sensitivity and expected effects of modifying these parameters.
GLARE
345
4. Notes
1. Here the word good and goodness are strictly related
to the binary classification that a product is good only if it
fits all the multi-objective criteria. GLARE could easily be
adapted to work with a scalar fitness score.
2. Most spectacular exceptions to the property additivity
scheme come from nitrogen atoms that can change their
basicity, their polar surface area, their number of donors,
etc. If this becomes an issue for a library, the reagents can
be initially split according to each case and a different offset
used.
3. When the partitioning scheme is used, only a small subset of
the products is examined and the goodness is then defined
as the fraction of the examined products with the desired
product properties.
Acknowledgments
The author thanks Dr. Christopher Bayly from Merck Frosst
Canada for his initial important contribution to GLARE and for
a careful proofreading of this chapter.
References
1. Gillet, V. J. (2008) New directions in library
design and analysis. Curr Opin Chem Biol 12,
372378.
2. Song, C. M., Bernardo, P. H., Chai, C. U.,
Tong, J. C. (2009) CLEVER: Pipeline for
designing in silico chemical libraries. J Mol
Graph Model 27, 578583.
3. Truchon, J. -F., Bayly, C. I. (2006) Is there
a single Best Pool of commercial reagents
to use in combinatorial library design to conform to a desired product-property profile?
Aust J Chem 59, 879882.
4. Truchon, J. -F., Bayly, C. I. (2006)
GLARE: a new approach for filtering
large reagent lists in combinatorial library
design using product properties. J Chem
Inf Model 46, 15361548. http://glare.
sourceforge.net
5. Conde-Frieboes, K., Schjeltved, R. K., Breinholt, J. (2002) Diastereoselective synthesis of
2-aminoalkyl-3-sulfonyl-1,3-oxazolidines on
solid support. J Org Chem 67, 89528957.
6. Feuston, B. P., Chakravorty, S. J., Conway, J. F., Culberson, J. C., Forbes, J.,
Kraker, B., Lennon, P. A., Lindsley, C.,
McGaughey, G. B., Mosley, R., Sheridan, R.
P., Valenciano, M., Kearsley, S. K. (2005)
Web enabling technology for the design,
enumeration, optimization and tracking of
compound libraries. Curr Top Med Chem 5,
773783.
7. Shi, S. G., Peng, Z. W., Kostrowicki, J.,
Paderes, G., Kuki, A. (2000) Efficient combinatorial filtering for desired molecular properties of reaction products. J Mol Graph
Model 18, 478496.
8. Lipinski, C. A., Lombardo, F., Dominy, B.
W., Feeney, P. J. (1997) Experimental and
computational approaches to estimate solubility and permeability in drug discovery and
346
Truchon
Chapter 18
CLEVER: A General Design Tool for Combinatorial Libraries
Tze Hau Lam, Paul H. Bernardo, Christina L.L. Chai,
and Joo Chuan Tong
Abstract
CLEVER is a computational tool designed to support the creation, manipulation, enumeration, and
visualization of combinatorial libraries. The system also provides a summary of the diversity, coverage,
and distribution of selected compound collections. When deployed in conjunction with large-scale virtual
screening campaigns, CLEVER can offer insights into what chemical compounds to synthesize, and,
more importantly, what not to synthesize. In this chapter, we describe how CLEVER is used and offer
advice in interpreting the results.
Key words: Virtual combinatorial library, Markush technique, compound analysis, chemoinformatics, chemistry.
1. Introduction
Combinatorial chemistry has become increasingly essential in the
modern drug discovery pipeline (1, 2). Through the discovery of
new chemical reactions and commercially available reagents, the
size of these libraries has amplified exponentially over the past few
years (3). Often, such libraries are far too large to be synthesized
and screen in their entirety. Moreover, the output frequently produces high level of redundancy in terms of the similarity in the
physiochemical properties of the derived compounds. Therefore,
a rational approach for combinatorial library design is desirable
in order to maximize the outcome of an expensive synthesis and
screening campaign (4).
Here we introduce CLEVER (Chemical Library Editing,
Visualizing, and Enumerating Resource), a platform-independent
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
DOI 10.1007/978-1-60761-931-4_18, Springer Science+Business Media, LLC 2011
347
348
Lam et al.
2. Materials
1. Java version 1.6 and above.
2. SmiLib v2.0 (5) for rapid combinatorial library generation in Simplified Molecular Input Line Entry Specification
(SMILES) (6).
3. The Chemistry Development Kit (CDK) Application Programming Interface (API) (7), OpenBabel (8), or CORINA
(9) for generating 3D coordinates (SDF format) from
SMILES strings.
4. Jmol (10) for interactive display of molecular structures in
3D space.
5. JFreeChart (http://www.jfree.org/jfreechart/) for generating histograms and 2D scatter plots for chemical compound
analysis.
3. Methods
CLEVER is implemented using the Java 3D API (see Section
4.1). The main framework is made up of five key modules for
chemical library editing, enumeration, conversion, visualization,
and analysis. The operations of these functionalities are accomplished by the various applications at the resource layer. For the
purpose of illustration, the compound calothrixin B, a secondary
metabolite isolated from the Calothrix cyanobacteria (1113), is
used as the scaffold molecule with the variable functional groups
[Rn] attached (Fig. 18.1). The calothrixins are redox-active natural products which display potent antimalarial and anticancer
properties and thus there is interest in probing the physical as
well as biological profiles of their derivatives (14). In this exercise,
six functional groups have been selected as the building blocks
(Table 18.1).
CLEVER
349
Fig. 18.1. Compound CID: 9817721 and its corresponding scaffold structure for
enumerating novel library.
Table 18.1
SMILES string configuration for scaffold and building blocks
Scaffold
SMILES
S1
O=C(C(C(C=C([R3])C=C1)=C1N=C2)=
C2C3=O)C4=C3C5=CC=C([R2])C([R1])
=C5N4
Attachment blocks
SMILES
B1
C[A]
B2
C(C)(C)([A])
B3
F[A]
B4
CC[A]
B5
C=C[A]
B6
C1=CC=CC=C1[A]
1. Use the library editor to create a library file for the compounds under study (Fig. 18.2). Library files are essentially plain text files that contain a record on each line,
with an entry identifier and a SMILES string for the
350
Lam et al.
CLEVER
351
Fig. 18.4. Chemical library enumeration. (a) Initiation for the scaffold and block lists. (b)
Illustration on the usage of the enumerator.
352
Lam et al.
6. Click on the Enumerate Library button to start enumeration. A full enumeration will generate a new library consisting of 216 compounds derived from the systematic permutation of the variable sites with the six attachment blocks on
the core scaffold.
3.2.2. Flexible Library
Enumeration
3.2.3. Library
Enumeration Using
Linkers
CLEVER
353
3. Unclick empty linker option to allow addition and modification of the linkers (Fig. 18.6). In this exercise, we only
demonstrate enumeration using two linkers. More linkers
could be included for chemical library construction.
3.3. Chemical Library
Analysis
3.3.1. Computation of
Physiochemical
Properties
3.3.2. Filtering of
Chemical Library-Based
Predefined Schemes
3.3.3. Evaluation of
Chemical Libraries
354
Lam et al.
Fig. 18.7. Physiochemical properties computation and the filtration of chemical libraries
based on predefined scheme.
CLEVER
355
4. Notes
1. Install a Java Virtual Machine (a runtime version of Java,
or JRE 1.6 and above). JVM is compatible to all the major
operating systems including Windows, MacOS, and Linux.
2. Ensure the input scaffold and the building block plain text
lists are saved in the .smi extension format. Any other extension formats are unrecognizable by the CLEVER enumerator and will generate an error.
3. CLEVER only allows up to a maximum of 90 [Rn] functional groups to be defined. However, there is no restriction
on the number of scaffolds, linkers, and building blocks.
356
Lam et al.
9.
10.
11.
12.
13.
14.
SUBJECT INDEX
Caco-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Calculations . . . . . . . . . . 29, 60, 64, 105, 122123, 164, 177,
193194, 196, 210, 225, 227, 231, 245,
297298, 302, 304307, 311, 327, 329, 334,
340341
Carbo index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Catalyst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
CDK2 . . . . . . . . . . . . . . . . . . 18, 232233, 322, 326, 329, 331
Cell-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Cell-based partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . 7778, 82, 228229
Chem-Diverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Chemical library . . . . . . . . . . . . . . 324, 2728, 48, 111130,
156, 165, 167168, 180, 231, 295317, 337,
340, 347348, 350355
Chemical reactions . . . . . . . . . . . . 29, 31, 165, 167, 179, 188,
254, 270, 272, 301302, 304, 324, 347
Chemical representation . . . . . . . . . . . . . . . . . . . . . . . . . . 2832
Chemical space . . . . . . . . . . . . . 28, 3340, 4345, 48, 54, 62,
102, 106, 115, 136, 156, 167, 170, 220, 236,
242, 254, 257, 271274, 286, 316
Cheminformatics . . . . . . . . . . . . . . . . 112113, 129, 296, 341
Chemistry
combinatorial . . . . . . . . . . 5, 45, 54, 71, 77, 91, 106, 112,
156, 163, 167, 170, 175176, 245, 255, 298,
321, 326, 347
high throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
medicinal . . . . . . . . . . 4, 16, 45, 115, 135136, 286, 333
Chemoinformatics . . . . . . . . . . . . . . . . . . . . . . 2749, 57, 176
Cherry-picking . . . . . . . . . . . . . 94, 112, 225, 298, 302, 306,
309, 314, 325
Chk1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321335
Chromosome . . . . . . . . . . . . . . . . . . . . . . . . 5960, 65, 68, 140
Chronic myelogenous leukemia (CML) . . . . . . . . . . 92, 282
cLipE (calculated lipophilic efficiency) . . . . . . . . . . 205206
cLogD (calculated LogD) . . . . . . . . . . . . . . . . . . . . . 198, 206,
208209
cLogP . . . . . . . . . . . 7, 11, 32, 197, 205207, 225, 236237,
243, 287, 322, 326327, 329
Cluster . . . . . . . . . 3839, 6061, 66, 68, 7374, 7782, 84,
88, 144145, 149, 229, 287, 290, 303, 325
Clustering . . . . . . . . . . . 39, 4344, 60, 66, 68, 88, 137, 160,
229, 232, 284, 286
Collaborations . . . . . . . . . . . . . . . . . . . . . . . . . . . 57, 254, 332
COMBIBUILD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166167
CombiDock . . . . . . . . . . . . . . . . . . . . . . . . . 161, 163, 167168
CombiGlide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166167, 180
CombiLibMaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159, 180
Combinatorial . . . . . . . . . . . 45, 79, 16, 2223, 3940, 43,
4547, 54, 7188, 91107, 112, 114, 117, 128,
B
Basis Product . . . . . . . . . . . . . . . 166167, 259263, 273, 316
BCUT descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Binary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4, 35, 114, 136, 140,
307, 345
Binding mode annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Bioactivity data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Biological activity . . . . . . . 4, 8, 1718, 40, 9798, 101, 111,
114115, 121, 124
Biologically active compounds . . . . . . . . . 3, 8, 113114, 128
Bond order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162, 165, 167, 245
Building . . . . . . . . . . . . . 4, 1011, 28, 3841, 44, 47, 57, 59,
6162, 64, 6667, 97101, 106, 112, 137, 156,
179180, 220222, 224230, 245246, 273,
348350, 352, 355356
J.Z. Zhou (ed.), Chemical Library Design, Methods in Molecular Biology 685,
c Springer Science+Business Media, LLC 2011
DOI 10.1007/978-1-60761-931-4,
357
D
Database . . . . . . . . . . . . . . . . . 5, 32, 112, 118119, 124126,
128129, 137138, 141144, 149, 157,
168170, 177178, 180, 183, 226227, 229,
254255, 260, 262, 280, 284285, 290, 300,
313, 321, 326, 331
Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28, 3234, 117
tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Daylight . . . . . . . . . . . . 31, 34, 125, 137, 274, 284, 286, 314
fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34, 286
Degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . 195, 210
De novo design . . . . . . . . . . . . . . . . . . . . . . . 58, 177, 245, 290
Dependent variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9798
Descriptors . . . . . . . . . . . . 28, 3338, 4041, 60, 65, 86, 93,
9798, 101, 103, 106, 113114, 118120,
122125, 129, 139, 207, 227, 273
Design
approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
based library . . . . . . . 137, 155170, 175187, 261, 298,
301302, 316
chemical library . . . . . . . . . . . . . . . 323, 28, 48, 111129
Desktop tool . . . . . . . . . . . . . . . . . . . . . . . . 192, 295317, 321
Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Deterministic annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7577
2D fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . 34, 3839, 128
3D fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Diaminopyrimidine . . . . . . . . . . . . 95, 97, 99100, 102105
Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3738
Dimension reduction . . . . . . . . . . . . . . . . . . . . . 28, 3438, 40
Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . 159, 227, 315, 338
Disassembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Dissimilarity . . . . . . . . . . . . . . 3740, 65, 142, 145, 147, 149
Distance range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Diverse libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44, 66, 145
Diversity
analysis . . . . . . . . . . . . . . 39, 60, 136, 162, 177178, 226
library . . . . . . . . . . . . . . . . . . 142, 145146, 148149, 176
Diversity oriented synthesis (DOS) . . . . . . . . . . . 5, 7, 1112
DOCK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163, 165, 167, 178
Docking . . . . . . . . . 4344, 63, 67, 104105, 125, 155170,
176178, 180183, 187, 193196, 198,
200210, 245, 271272, 280, 283, 303, 306,
316, 326, 329, 331, 333334
3D pharmacophore . . . . . . . . . . . . . . . . . . 231, 269, 271, 306
DRAGON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
DREAM++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165, 167
Drug discovery . . . . . . . . . . . . 45, 8, 28, 3233, 4144, 48,
5354, 58, 7172, 86, 113, 121, 126128,
135136, 155159, 170, 181, 219, 227, 236,
242, 255, 271, 273, 275, 296, 312313,
316317, 333, 347
Drug-likeness . . . . . . . . . . . . . . 42, 45, 56, 66, 111, 348, 353
E
EC50 . . . . . . . . . . . . . . . . . . . . . . . 18, 118, 129, 192, 205206
EGFR inhibitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Eigenvalue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35, 78
Electron density map . . . . . . . . . . . . . . . . . . . . . . . . . . 158, 187
Empirical . . . . . . . . . . . . . . . . . . . . 33, 35, 114, 135, 163, 208
Encoding . . . . . . . . . . . . . . . . . . . . 5, 59, 6263, 66, 124, 138,
200, 302, 314
Enrichment factor . . . . . . . . . . . . . . . . . . . . . . . . 125, 127, 168
Enumeration . . . . . . . . . . . . . 4647, 72, 102, 162, 164, 167,
177179, 187188, 196, 254, 256258, 261,
263, 272, 296, 298, 300, 314, 325326,
328329, 334, 348, 350353
Enzyme
inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
selectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23, 94
Erlotinib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Euclidean distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 37, 65, 78
Evaluation
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5556
programming . . . . . . . . . . . . . . . . . . . . 193, 195, 200, 210
Excretion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54, 156, 297
F
Features . . . . . . . . . . . . . . . . . . . . 8, 30, 63, 77, 119, 129, 136,
146, 163, 179, 194, 228229, 243, 261, 285,
297, 299, 309311
Filtering . . . . . . . . . . . . . . . . . . . . . 43, 47, 59, 63, 65, 67, 139,
158160, 162, 176177, 181, 187, 197, 208,
224, 228229, 305307, 309, 325, 327, 329,
331, 337339, 342343, 348, 353
integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
G
Gastrointestinal stromal tumor (GIST) . . . . . . . . . . . . . . . 92
Gaussian functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122123
Gefitinib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Genetic Algorithm . . . . . . . . . . . 56, 120, 137, 140, 144, 316
Gleevec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92, 285
Glide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125, 170, 178, 180
GOLD . . . . . . . . . . . . . . . . . . . . 177178, 180, 182184, 186
GPCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1517, 45
Graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30, 34
GROWMOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
H
Hamming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
HCV NS5B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181187
hERG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32, 140, 144
High throughput chemistry . . . . . . . . . . . . . . . . . . . . . . . . 37
High-throughput screening (HTS) . . . . . . . . . . . . . . . 33, 38,
45, 58, 91, 127, 155156, 170, 175, 194,
219221, 231, 241242, 286290, 298, 317,
322, 326, 330, 332
HIPPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Histone deacetylases (HDACs) . . . . . . . . . . . . . . . . .117119
Hit rate . . . . . . . . . . . . . . . . . . . 118, 128, 205, 226, 231232,
235, 286287, 289
Homology model . . . . . . . . . . . . . . . . . . . . 157158, 169, 177
HOOK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .245
HSITE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
HSP70 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231232, 234235
Human rhinovirus 3C protease . . . . . . . . . . . . . . . . . 5, 1920
Hydrogen bond acceptor (HA) . . . . . . . . . . . . . . . . . . . . . 138
Hydrogen-bond donor (HD) . . . . . . . . . 138, 146, 197, 339
11 Hydroxysteroid dehydrogenase type 1
(11-HSD1) . . . . . . . . . . . . . . . . . . . . . . . 191212
I
IC50 . . . . . . . . . . . . 14, 18, 21, 23, 32, 57, 95, 97, 119120,
182, 184, 186, 269270, 280, 285
Imatinab-resistant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Imatinib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92, 282
Independent variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4, 123, 145, 231
Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Informatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296, 306
Inhibitor . . . . . . . . 5, 12, 1922, 58, 92, 107, 127, 168170,
182186, 192, 246247, 250, 280283, 285,
288290, 323, 326
Iressa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
ISIS . . . . . . . . . 262, 297, 299, 301, 304308, 314, 317, 321
J
JNK3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232233, 282
K
Kappa opioid receptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Kinase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279290, 321334
Kinase chemical cores . . . . . . . . . . . . . . . . . . . . . . . . . 280, 290
Kinase targeted library (KTL) . . . . . . . . . . . . . 280, 283284,
287290
K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
k nearest neighbor (kNN) . . . . . . . . . . . . . . . . . . 118120, 160
Knowledge-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21, 167
L
LCK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322, 326
Lead hopping . . . . . . . . . . 128, 264, 268, 271, 274, 298, 315
Lead-likeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
LEAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253274, 298299
LEAP1 . . . . . . . . . . . . . . . . . . . . 255258, 263269, 271275
LEAP2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255264, 266273
Leave-one-out (LOO) . . . . . . . . . . . 100101, 107, 118119
LEGEND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Lennard-Jones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Library design . . . . . . 324, 2749, 5368, 7188, 91107,
111130, 135150, 155170, 175188,
191212, 219238, 241250, 253275,
279290, 295317, 321335, 337345,
347356
Library design strategies . . . . . . . . . . . . . . . . . . 137, 140, 325
Ligand efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .238, 242
LigBuilder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
LIGSITE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 9394, 207
LipE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Lipophilic groups (LIP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
LogD . . . . . . . . . . . . 197198, 205209, 303, 326, 329, 331
LogP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
LUDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
M
MACCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Markush . . . . . . . . . . . . 4647, 297298, 301, 305306, 314
Markush exemplification . . . . . 46, 297, 301, 305306, 314
Markush technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
MCSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
MDL . . . . . . . . . . . . . 30, 258, 262263, 274, 297, 301302,
304, 306, 309
MDL ISIS/Draw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304, 306
MedChem . . . . . . . . . . . . . . . . . . . . . . . . . . 138, 141, 226, 286
Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Medicinal chemistry . . . . . . . . . . . . . . . . . . . . . 4, 16, 45, 115,
135136, 286, 333
MEGALib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59, 6168
Melanin-concentrating hormone receptor 1
(MCHR1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .128
N
National Cancer Institute (NCI) . . . . . . . . . . . . . . . 119, 126
Negative charge centre (NEG) . . . . . . . . . . . . . . . . . . . . . . 138
Neural networks . . . . . . . . . . . . . . . . . . . . . . . . 40, 44, 75, 160
NMR . . . . . . . . . 5, 157, 176177, 187, 219220, 224238,
242243, 245249
NMR screening . . . 224225, 227230, 238, 244, 246249
Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Non-dominated solution . . . . . . . . . . . . . . . . . . . . . 5455, 61
Nonoligomeric library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Non-small cell lung cancer (NSCLC) . . . . . . . . . . . . . . . . 92
Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
NSisFragment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57, 64
O
OEChem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57, 341
OptiDock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166167
Optimization library . . . . . . . . . . . . . . . . . . . . . . . . 1921, 333
ORIENT++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
P
PAMPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Pareto . . . . . . . . . . . . . . . . . . . . . 42, 5456, 5961, 6465, 67
Pareto-optimal solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Q
Quantitative structure activity relationship (QSAR)
methods . . . . . . . . . . . . . . . . . . . . . . . . . . 28, 113114, 120
R
Raf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2124
Random library . . . . . . . . . . . . . . . . . . . . . . 8, 10, 12, 144145
Rapid Overlay of Compound Structures (ROCS),
122123, 125127
REACT++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Reactant . . . . . . 46, 196, 257260, 272, 297302, 304311,
314315, 327329, 332334
Reaction transform . . . . . . . . . . . . . . . . . . . . . . . . . . 5, 31, 259
Reagent selections . . 56, 137, 139142, 144, 176178, 338
REALISIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296, 313314
RECAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57, 61, 272273
Regression . . . . . . . . . . . . . . . . 9395, 98101, 116, 207, 227
Renal cell carcinoma (RCC) . . . . . . . . . . . . . . . . . . . . . 92, 282
Research and development . . . . . . . . . . . . . . . . . . . . . . . . . 155
Review . . . . . . . . . . . . . . . . 28, 112, 115, 117, 126, 136, 157,
159160, 219, 221, 225226, 229, 236, 238,
245, 255, 286, 311
Rings . . . . . . . . . . . . . . . . . . . . . 1213, 1516, 22, 3132, 34,
105106, 147, 165, 182, 227228, 233234,
243, 248, 260, 272, 284, 286, 288
Root node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Rule-of-five . . . . . . . 42, 48, 6364, 303, 310, 326, 331, 340
Rxn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297, 301, 304
S
Scaffold . . . . . . . . . . 13, 1516, 66, 120, 167, 178, 182186,
349352, 355356
Scaffold hopping . . . . . . . . . . . 125, 127129, 136137, 177,
254, 267
Scalable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7188
Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37, 80, 199, 344
SciTegic . . . . . . . . . . 257, 274275, 300, 305, 311, 315, 326
Scoring . . . . . 44, 59, 63, 112, 128, 159161, 163, 170, 176,
180181, 184, 187188, 201, 212, 273, 306,
316, 326, 333, 334
Screen . . . . . . . . 89, 91, 113, 219220, 225226, 228231,
233, 235236, 248, 280, 286287, 289, 303,
306, 322, 324, 329, 331, 347
Screening collection . . . . . . . . . . . . . . . . . . . . . . 219238, 243
SEARCH++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Search . . . . . . . 144, 200201, 257, 262263, 265266, 268,
271274, 305, 315
Searching . . . . . . . . . . . . . . . . 42, 45, 126, 194, 271, 311, 314
SeeDs . . . . . . . . . . . . . . . . . . . . . . . . . . 227, 229230, 232, 258
Selection . . . . . 3940, 4748, 5962, 64, 68, 72, 112, 118,
139141, 307308, 325329, 339340, 354
Selective library design . . . . . . . . . . . . . . . . . . . . . . . . . . . 6366
Selectivity . . . . .89, 15, 1924, 54, 5758, 6364, 91107,
280281, 286287, 322, 326330, 332334
Shannon entropy . . . . . . . . . . . . . . . . . . . . 139, 144145, 147
SHAPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126127
Shape complementarity . . . . . . . . . . . . . . . . . . . . . . . . 121122
Similarity . . . . . . . . . . . . . . 28, 34, 3740, 4245, 4748, 54,
5657, 6365, 67, 103, 112, 115, 121129, 137,
142145, 147, 149, 158, 169170, 183, 194,
254259, 261266, 268269, 271274, 284,
290, 298, 305306, 315316, 325327, 329,
331, 333, 347
T
Tanimoto . . . . . . . . . . . . 3839, 63, 103, 128129, 137, 142,
144145, 147, 149, 232, 255, 268, 274, 284, 326
Tanimoto coefficient . . . . . . . . . . . . . . 38, 129, 232, 274, 326
Tarceva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Targeted . . . . . . . . . . . . . . . 8, 1119, 92, 102, 111130, 192,
226, 243, 269270, 279290, 298299, 315,
321335, 343
Targeted library . . . . . . . . 8, 1214, 19, 112, 129, 192, 226,
269270, 279290, 298299, 321335
Tautomers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158159
Techniques . . . . . 4, 8, 16, 21, 54, 5758, 6061, 71, 92, 97,
99101, 106, 114115, 117, 126, 141, 156157,
160, 163, 195, 200, 224,
226227, 230, 236, 242246
Thiazolone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182184, 186
Tools . . . . . . . . . . . . 2728, 33, 40, 4648, 62, 71, 113122,
125126, 137, 156, 159, 167, 176177, 192,
194, 196197, 202, 245, 295317, 321335,
337345, 347356
Topological pharmacophore . . . . . . . . . . . . . . . . . . . . 136139
Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5354, 205, 297
Training and test sets . . . . . . . . . . . . . . . . . . . . . 115, 117118
Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40, 44, 60
Tripos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159, 254, 272273
Tversky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Tyrosine kinase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92, 170, 282
U
Undesirable functional group . . . . . . . . . . . . . . . . . . . . . . . 227
V
Validation . . . . . . . . 41, 4344, 98101, 107, 113119, 123,
125126, 129, 138, 141, 176, 178, 187, 255,
263268, 271, 274, 287288
Virtual combinatorial library . . . . . . . . . . . . . . . . . . . . 45, 128
Virtual libraries . . . . . . . . . . . . . . . 3940, 4344, 46, 56, 94,
101102, 106, 112113, 121, 128129,
166167, 178179, 181, 183, 192193,
196197, 201202, 208, 210212, 225,
253275, 296, 303, 305, 315, 340341
Virtual screening . . . . . . . . . . . . . . . 28, 34, 4344, 112122,
124129, 157, 160, 176177, 180181,
183184, 187, 195, 245, 257, 271, 316
VSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
W
Workflow . . . . . . . . . . . . . . . 4041, 104, 114, 116119, 127,
137, 255, 283284, 290, 296297, 301,
304307, 309, 316
X
X-ray . . . . . . . . . . . . . . . . . . 94, 104, 106107, 126127, 136,
176178, 181185, 187, 194, 219, 230, 237,
242, 244249, 280, 282, 288, 322324, 329,
331, 335
Z
Zinc metalloprotease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12