A Practical Guide To Scientific Data Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

P1: OTE/OTE/SPH P2: OTE

JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

A Practical Guide to
Scientific Data Analysis

A Practical Guide to Scientific Data Analysis David Livingstone


© 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-85153-1
P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

A Practical Guide to
Scientific Data Analysis
David Livingstone
ChemQuest, Sandown, Isle of Wight, UK

A John Wiley and Sons, Ltd., Publication


P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

This edition first published 2009



C 2009 John Wiley & Sons, Ltd

Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ,
United Kingdom
For details of our global editorial offices, for customer services and for information about
how to apply for permission to reuse the copyright material in this book please see our
website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in
accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted, in any form or by any means, electronic, mechanical, photocopying,
recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act
1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as
trademarks. All brand names and product names used in this book are trade names, service
marks, trademarks or registered trademarks of their respective owners. The publisher is not
associated with any product or vendor mentioned in this book. This publication is designed
to provide accurate and authoritative information in regard to the subject matter covered. It
is sold on the understanding that the publisher is not engaged in rendering professional
services. If professional advice or other expert assistance is required, the services of a
competent professional should be sought.
The publisher and the author make no representations or warranties with respect to the
accuracy or completeness of the contents of this work and specifically disclaim all warranties,
including without limitation any implied warranties of fitness for a particular purpose. This
work is sold with the understanding that the publisher is not engaged in rendering
professional services. The advice and strategies contained herein may not be suitable for
every situation. In view of ongoing research, equipment modifications, changes in
governmental regulations, and the constant flow of information relating to the use of
experimental reagents, equipment, and devices, the reader is urged to review and evaluate the
information provided in the package insert or instructions for each chemical, piece of
equipment, reagent, or device for, among other things, any changes in the instructions or
indication of usage and for added warnings and precautions. The fact that an organization or
Website is referred to in this work as a citation and/or a potential source of further
information does not mean that the author or the publisher endorses the information the
organization or Website may provide or recommendations it may make. Further, readers
should be aware that Internet Websites listed in this work may have changed or disappeared
between when this work was written and when it is read. No warranty may be created or
extended by any promotional statements for this work. Neither the publisher nor the author
shall be liable for any damages arising herefrom.
Library of Congress Cataloging-in-Publication Data
Livingstone, D. (David)
A practical guide to scientific data analysis / David Livingstone.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-470-85153-1 (cloth : alk. paper)
1. QSAR (Biochemistry) – Statistical methods. 2. Biochemistry – Statistical methods.
I. Title.
QP517.S85L554 2009
615 .1900727–dc22
2009025910
A catalogue record for this book is available from the British Library.
ISBN 978-0470-851531
Typeset in 10.5/13pt Sabon by Aptara Inc., New Delhi, India.
Printed and bound in Great Britain by TJ International, Padstow, Corwall
P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

This book is dedicated to the memory of my


first wife, Cherry (18/5/52–1/8/05), who
inspired me, encouraged me and helped me
in everything I’ve done, and to the memory
of Rifleman Jamie Gunn (4/8/87–25/2/09),
whom we both loved very much and who
was killed in action in Helmand
province, Afghanistan.
P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

Contents

Preface xi
Abbreviations xiii

1 Introduction: Data and Its Properties, Analytical Methods


and Jargon 1
1.1 Introduction 2
1.2 Types of Data 3
1.3 Sources of Data 5
1.3.1 Dependent Data 5
1.3.2 Independent Data 6
1.4 The Nature of Data 7
1.4.1 Types of Data and Scales of Measurement 8
1.4.2 Data Distribution 10
1.4.3 Deviations in Distribution 15
1.5 Analytical Methods 19
1.6 Summary 23
References 23

2 Experimental Design – Experiment and Set Selection 25


2.1 What is Experimental Design? 25
2.2 Experimental Design Techniques 27
2.2.1 Single-factor Design Methods 31
2.2.2 Factorial Design (Multiple-factor Design) 33
2.2.3 D-optimal Design 38
2.3 Strategies for Compound Selection 40
2.4 High Throughput Experiments 51
2.5 Summary 53
References 54
P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

viii CONTENTS

3 Data Pre-treatment and Variable Selection 57


3.1 Introduction 57
3.2 Data Distribution 58
3.3 Scaling 60
3.4 Correlations 62
3.5 Data Reduction 63
3.6 Variable Selection 67
3.7 Summary 72
References 73

4 Data Display 75
4.1 Introduction 75
4.2 Linear Methods 77
4.3 Nonlinear Methods 94
4.3.1 Nonlinear Mapping 94
4.3.2 Self-organizing Map 105
4.4 Faces, Flowerplots and Friends 110
4.5 Summary 113
References 116

5 Unsupervised Learning 119


5.1 Introduction 119
5.2 Nearest-neighbour Methods 120
5.3 Factor Analysis 125
5.4 Cluster Analysis 135
5.5 Cluster Significance Analysis 140
5.6 Summary 143
References 144

6 Regression Analysis 145


6.1 Introduction 145
6.2 Simple Linear Regression 146
6.3 Multiple Linear Regression 154
6.3.1 Creating Multiple Regression Models 159
6.3.1.1 Forward Inclusion 159
6.3.1.2 Backward Elimination 161
6.3.1.3 Stepwise Regression 163
6.3.1.4 All Subsets 164
6.3.1.5 Model Selection by Genetic Algorithm 165
6.3.2 Nonlinear Regression Models 167
6.3.3 Regression with Indicator Variables 169
P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

CONTENTS ix

6.4 Multiple Regression: Robustness, Chance Effects,


the Comparison of Models and Selection Bias 174
6.4.1 Robustness (Cross-validation) 174
6.4.2 Chance Effects 177
6.4.3 Comparison of Regression Models 178
6.4.4 Selection Bias 180
6.5 Summary 183
References 184

7 Supervised Learning 187


7.1 Introduction 187
7.2 Discriminant Techniques 188
7.2.1 Discriminant Analysis 188
7.2.2 SIMCA 195
7.2.3 Confusion Matrices 198
7.2.4 Conditions and Cautions for
Discriminant Analysis 201
7.3 Regression on Principal Components and PLS 202
7.3.1 Regression on Principal Components 203
7.3.2 Partial Least Squares 206
7.3.3 Continuum Regression 211
7.4 Feature Selection 214
7.5 Summary 216
References 217

8 Multivariate Dependent Data 219


8.1 Introduction 219
8.2 Principal Components and Factor Analysis 221
8.3 Cluster Analysis 230
8.4 Spectral Map Analysis 233
8.5 Models with Multivariate Dependent and
Independent Data 238
8.6 Summary 246
References 247

9 Artificial Intelligence and Friends 249


9.1 Introduction 250
9.2 Expert Systems 251
9.2.1 LogP Prediction 252
9.2.2 Toxicity Prediction 261
9.2.3 Reaction and Structure Prediction 268
P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

x CONTENTS

9.3Neural Networks 273


9.3.1 Data Display Using ANN 277
9.3.2 Data Analysis Using ANN 280
9.3.3 Building ANN Models 287
9.3.4 Interrogating ANN Models 292
9.4 Miscellaneous AI Techniques 295
9.5 Genetic Methods 301
9.6 Consensus Models 303
9.7 Summary 304
References 305

10 Molecular Design 309


10.1 The Need for Molecular Design 309
10.2 What is QSAR/QSPR? 310
10.3 Why Look for Quantitative Relationships? 321
10.4 Modelling Chemistry 323
10.5 Molecular Fields and Surfaces 325
10.6 Mixtures 327
10.7 Summary 329
References 330

Index 333
P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

Preface

The idea for this book came in part from teaching quantitative drug
design to B.Sc. and M.Sc. students at the Universities of Sussex and
Portsmouth. I have also needed to describe a number of mathemati-
cal and statistical methods to my friends and colleagues in medicinal
(and physical) chemistry, biochemistry, and pharmacology departments
at Wellcome Research and SmithKline Beecham Pharmaceuticals. I have
looked for a textbook which I could recommend which gives practical
guidance in the use and interpretation of the apparently esoteric meth-
ods of multivariate statistics, otherwise known as pattern recognition. I
would have found such a book useful when I was learning the trade, and
so this is intended to be that sort of guide.
There are, of course, many fine textbooks of statistics and these are
referred to as appropriate for further reading. However, I feel that there
isn’t a book which gives a practical guide for scientists to the processes of
data analysis. The emphasis here is on the application of the techniques
and the interpretation of their results, although a certain amount of
theory is required in order to explain the methods. This is not intended
to be a statistical textbook, indeed an elementary knowledge of statistics
is assumed of the reader, but is meant to be a statistical companion to
the novice or casual user.
It is necessary here to consider the type of research which these meth-
ods may be used for. Historically, techniques for building models to
relate biological properties to chemical structure have been developed in
pharmaceutical and agrochemical research. Many of the examples used
in this text are derived from these fields of work. There is no reason,
however, why any sort of property which depends on chemical structure
should not be modelled in this way. This might be termed quantita-
tive structure–property relationships (QSPR) rather than QSAR where
P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

xii PREFACE

A stands for activity. Such models are beginning to be reported; re-


cent examples include applications in the design of dyestuffs, cosmetics,
egg-white substitutes, artificial sweeteners, cheese-making, and prepared
food products. I have tried to incorporate some of these applications
to illustrate the methods, as well as the more traditional examples of
QSAR.
There are also many other areas of science which can benefit from the
application of statistical and mathematical methods to an examination
of their data, particularly multivariate techniques. I hope that scientists
from these other disciplines will be able to see how such approaches can
be of use in their own work.
The chapters are ordered in a logical sequence, the sequence in which
data analysis might be carried out – from planning an experiment
through examining and displaying the data to constructing quantita-
tive models. However, each chapter is intended to stand alone so that
casual users can refer to the section that is most appropriate to their
problem. The one exception to this is the Introduction which explains
many of the terms which are used later in the book. Finally, I have in-
cluded definitions and descriptions of some of the chemical properties
and biological terms used in panels separated from the rest of the text.
Thus, a reader who is already familiar with such concepts should be able
to read the book without undue interruption.

David Livingstone
Sandown, Isle of Wight
May 2009
P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

Abbreviations

π hydrophobicity substituent constant


σ electronic substituent constant
alk hydrogen-bonding capability parameter
H enthalpy
AI artificial intelligence
ANN artificial neural networks
ANOVA analysis of variance
BPN back-propagation neural network
CA cluster analysis
CAMEO Computer Assisted Mechanistic Evaluation of Organic
reactions
CASE Computer Assisted Structure Evaluation
CCA canonical correlation analysis
CoMFA Comparative Molecular Field Analysis
CONCORD CONnection table to CoORDinates
CR continuum regression
CSA cluster significance analysis
DEREK Deductive Estimation of Risk from Existing Knowledge
ED50 dose to give 50 % effect
ESDL10 electrophilic superdelocalizability
ESS explained sum of squares
FA factor analysis
FOSSIL Frame Orientated System for Spectroscopic Inductive
Learning
GABA γ -aminobutyric acid
GC-MS gas chromatography-mass spectrometry
HOMO highest occupied molecular orbital
HPLC high-performance liquid chromatography
P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

xiv ABBREVIATIONS

HTS high throughput screening


I50 concentration for 50 % inhibition
IC50 concentration for 50 % inhibition
ID3 iterative dichotomizer three
IR infrared
Km Michaelis–Menten constant
KNN k-nearest neighbour technique
LC50 concentration for 50 % lethal effect
LD50 dose for 50 % death
LDA linear discriminant analysis
LLM linear learning machine
logP logarithm of a partition coefficient
LOO leave one out at a time
LV latent variable
m.p. melting point
MAO monoamine oxidase
MIC minimum inhibitory concentration
MLR multiple linear regression
mol.wt. molecular weight
MR molar refractivity
MSD mean squared distance
MSE explained mean square
MSR residual mean square
MTC minimum threshold concentration
NLM nonlinear mapping
NMR nuclear magnetic resonance
NOA natural orange aroma
NTP National Toxicology Program
OLS ordinary least square
PC principal component
PCA principal component analysis
PCR principal component regression
p.d.f. probability density function
pI50 negative log of the concentration for 50 % inhibition
PLS partial least squares
PRESS predicted residual sum of squares
QDA quantitative descriptive analysis
QSAR quantitative structure-activity relationship
QSPR quantitative structure-property relationship
R2 multiple correlation coefficient
ReNDeR Reversible Non-linear Dimension Reduction
P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

ABBREVIATIONS xv

RMSEP root mean square error of prediction


RSS residual or unexplained sum of squares
SE standard error
SAR structure-activity relationships
SIMCA see footnote p. 195
SMA spectral map analysis
SMILES Simplified Molecular Input Line Entry System
SOM self organising map
TD50 dose for 50 % toxic effect
TOPKAT Toxicity Prediction by Komputer Assisted Technology
TS taboo search
TSD total squared distance
TSS total sum of squares
UFS unsupervised forward selection
UHTS ultra high throughput screening
UV ultraviolet spectrophotometry
Vm Van der Waals’ volume

You might also like