0% found this document useful (0 votes)
29 views582 pages

Dokumen - Pub Elements of Data Science Machine Learning and Artificial Intelligence Using R 1nbsped 3031133382 9783031133381

The document is a book titled 'Elements of Data Science, Machine Learning, and Artificial Intelligence Using R' authored by Frank Emmert-Streib, Salissou Moutari, and Matthias Dehmer. It aims to provide a comprehensive understanding of data science by integrating concepts from machine learning, artificial intelligence, and statistics, with practical examples using the R programming language. The book is structured into three main parts covering general topics, core methods, and advanced topics, targeting graduate and advanced undergraduate students in related fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views582 pages

Dokumen - Pub Elements of Data Science Machine Learning and Artificial Intelligence Using R 1nbsped 3031133382 9783031133381

The document is a book titled 'Elements of Data Science, Machine Learning, and Artificial Intelligence Using R' authored by Frank Emmert-Streib, Salissou Moutari, and Matthias Dehmer. It aims to provide a comprehensive understanding of data science by integrating concepts from machine learning, artificial intelligence, and statistics, with practical examples using the R programming language. The book is structured into three main parts covering general topics, core methods, and advanced topics, targeting graduate and advanced undergraduate students in related fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 582

Frank Emmert-Streib

Salissou Moutari
Matthias Dehmer

Elements of
Data Science,
Machine Learning,
and Artificial
Intelligence Using R
Elements of Data Science, Machine Learning,
and Artificial Intelligence Using R
Frank Emmert-Streib • Salissou Moutari •
Matthias Dehmer

Elements of Data Science,


Machine Learning,
and Artificial Intelligence
Using R
Frank Emmert-Streib Salissou Moutari
Tampere University Queen’s University Belfast
Tampere, Finland Belfast, UK

Matthias Dehmer
Swiss Distance University of Applied
Science
Brig, Switzerland
Tyrolean Private University UMIT TIROL
Hall in Tyrol, Austria

ISBN 978-3-031-13338-1 ISBN 978-3-031-13339-8 (eBook)


https://doi.org/10.1007/978-3-031-13339-8

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.


We dedicate the book to our families:
Moureen
Shaymae, Sarah, and Siham
Miriana
Preface

The digitalization of all areas of science, industry, and society has led to an
unprecedented flood of data. However, after the initial enthusiasm for the anticipated
wealth of information, in most cases this information remains deeply buried inside
the data and needs to be uncovered. This requires analysis of the data, which
is usually nontrivial and often challenging. All these developments led to the
establishment of the field of data science.
The data science field combines methods and approaches from machine learning,
artificial intelligence, and statistics. This makes it inherently interdisciplinary, as
it leverags scientific approaches to derive valuable insights from data. A key to
the success of data science is that it centers an analysis around data. This allows
one move away from making theoretical assumptions, thus directing the analysis
of a problem toward data-driven approaches. Consequently, this requires often
nonparametric approaches that rely on computational implementations. In general,
data science encompasses a strong computational component, enabling one to put
theoretical concepts into practice. For this reason, in this book many examples are
provided for such implementations that use the statistical programming language R.
Our motivation for writing this book arose out of our experience over many
years. From teaching, supervising, and conducting scientific and industrial research,
we realized that many students and scientists are struggling to understand the
underlying concepts of methods from data science, which were derived from a
variety of fields, including machine learning, artificial intelligence, and statistics.
For this reason, we present in this book the basics, core methods, and advanced
methods with an emphasis on understanding the corresponding concepts. That
means we are not aiming for comprehensive coverage of all existing methods; rather,
we provide selected topics from data science to foster a thorough understanding
of the subject. Based on these, deeper insights about all aspects of data science
can be reached. This will provide a springboard for mastering advanced methods.
Furthermore, we combine this with computational realizations of analysis methods
using the widely used programming language R.
This book is intended for graduate students and advanced undergraduate students
in the interdisciplinary field of data science with a major in computer science,

vii
viii Preface

information technology, or engineering. The book is organized into three main parts.
Part I: General Topics; Part II: Core Methods; and Part III: Advanced Topics. Each
chapter contains the theoretical basics and many practical examples that can be
practiced side by side. This way, one can put the learned theory into a practical
application and gain a profound conceptual understanding over time.
During the preparation of this book, many colleagues provided us with input,
help, and support. In particular, we would like to thank Zengqiang Chen, Amer
Farea, Markus Geuss, Galina Glazko, Tobias Häberlein, Andreas Holzinger, Arno
Homburg, Bo Hu, Oliver Ittig, Joni-Kristian Kämäräinen, Juho Kanniainen, Urs-
Martin Künzi, Abbe Mowshowitz, Aliyu Musa, Rainer Schubert, Yongtang Shi, Jin
Tao, Martin Welk, Chengyi Xia, Olli Yli-Harja, Jusen Zhang and apologize to all
those who have not been named mistakenly. For proof reading and help with various
chapters, we would like to express our special thanks to Shailesh Tripathi, Kalifa
Manjang, Tanvi Sharma, Nadeesha Perera, Zhen Yang, and many students from
the course Computational Diagnostics of Data (DATA.ML.390). We would also
like to thank our editors Mary James, Zoe Kennedy, Vinodhini Srinivasan, Sanjana
Sundaram, and Brian Halm from Springer, who have been always available and
helpful.
Finally, we hope this book helps to spread the enthusiasm and joy we have for
this field, and inspires students and scientists for their studies or research questions.

Tampere, Finland Frank Emmert-Streib


Belfast, UK Salissou Moutari
Brig, Switzerland Matthias Dehmer
August 2023
Contents

1 Introduction to Learning from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


1.1 What Is Data Science? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Converting Data into Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Big Aims: Big Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Generating Insights by Visualization . . . . . . . . . . . . . . . . . . . 7
1.3 Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Part III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Our Motivation for Writing This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 How to Use This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Part I General Topics


2 General Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Categorization of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Properties of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Properties of the Optimization Algorithm . . . . . . . . . . . . . . 19
2.2.3 Properties of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Overview of Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Causal Model versus Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Explainable AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Fundamental Statistical Characteristics of Prediction Models . . . . 23
2.6.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

ix
x Contents

3 General Error Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Fundamental Error Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Error Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 True-Positive Rate and True-Negative Rate . . . . . . . . . . . . 33
3.4.2 Positive Predictive Value and Negative Predictive
Value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.4 F-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.5 False Discovery Rate and False Omission Rate . . . . . . . . 36
3.4.6 False-Negative Rate and False-Positive Rate . . . . . . . . . . . 36
3.4.7 Matthews Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.8 Cohen’s Kappa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.9 Normalized Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.10 Area Under the Receiver Operator Characteristic
Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Evaluation of Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.1 Evaluation of an Individual Method . . . . . . . . . . . . . . . . . . . . 45
3.5.2 Comparing Multiple Binary Decision-Making
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Resampling Methods for Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 Holdout Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Leave-One-Out CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 K-Fold Cross-Validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Extended Resampling Methods for Error Estimation . . . . . . . . . . . . . 58
4.3.1 Repeated Holdout Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.2 Repeated K-Fold CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.3 Stratified K-Fold CV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 Resampling With versus Resampling Without
Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5 Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Different Types of Prediction Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.7 Sampling from a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.8 Standard Error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Contents xi

5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.1 Genomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.2 Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.3 Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.4 Time-to-Event Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.5 Business Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Part II Core Methods


6 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1 Exploratory Data Analysis and Descriptive Statistics . . . . . . . . . . . . . 92
6.1.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.1.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.3 Summary Statistics and Presentation of Information . . 93
6.1.4 Measures of Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.1.5 Measures of Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.1.6 Measures of Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.7 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1.8 Example: Summary of Data and EDA . . . . . . . . . . . . . . . . . . 102
6.2 Sample Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.2 Unbiased Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.3 Biased Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2.4 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.1 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3.2 Continuous Parameter Estimation. . . . . . . . . . . . . . . . . . . . . . . 112
6.3.3 Discrete Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3.4 Bayesian Credible Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.3.5 Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.6 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4.1 Asymptotic Confidence Intervals for MLE . . . . . . . . . . . . . 124
6.4.2 Bootstrap Confidence Intervals for MLE . . . . . . . . . . . . . . . 127
6.4.3 Meaning of Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 128
6.5 Expectation-Maximization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.5.1 Example: EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
xii Contents

7 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 What Is Clustering? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.3 Comparison of Data Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.3.1 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.3.2 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.4 Basic Principle of Clustering Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.5 Non-hierarchical Clustering Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5.1 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5.2 K-Medoids Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.5.3 Partitioning Around Medoids (PAM) . . . . . . . . . . . . . . . . . . . 148
7.6 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.6.1 Dendrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.6.2 Two Types of Dissimilarity Measures . . . . . . . . . . . . . . . . . . 150
7.6.3 Linkage Functions for Agglomerative Clustering . . . . . . 151
7.6.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.7 Defining Feature Vectors for General Objects . . . . . . . . . . . . . . . . . . . . . 153
7.8 Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.8.1 External Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.8.2 Assessing the Numerical Values of Indices. . . . . . . . . . . . . 158
7.8.3 Internal Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.2.1 An Overview of PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.2.2 Geometrical Interpretation of PCA . . . . . . . . . . . . . . . . . . . . . 165
8.2.3 PCA Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.2.4 Underlying Mathematical Problems in PCA . . . . . . . . . . . 167
8.2.5 PCA Using Singular Value Decomposition . . . . . . . . . . . . 168
8.2.6 Assessing PCA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.2.7 Illustration of PCA Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.2.8 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.2.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.2.10 Non-negative Matrix Factorization . . . . . . . . . . . . . . . . . . . . . 179
8.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.3.1 Filter Methods Using Mutual Information . . . . . . . . . . . . . 186
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.2 What Is Classification? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.3 Common Aspects of Classification Methods . . . . . . . . . . . . . . . . . . . . . . 192
Contents xiii

9.3.1 Basic Idea of a Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192


9.3.2 Training and Test Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.3.3 Error Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.4 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.4.1 Educational Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.4.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
9.5 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
9.5.1 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
9.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.7 k-Nearest Neighbor Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.8 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
9.8.1 Linearly Separable Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
9.8.2 Nonlinearly Separable Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.8.3 Nonlinear Support Vector Machines . . . . . . . . . . . . . . . . . . . . 219
9.8.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
9.9 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.9.1 What Is a Decision Tree? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.9.2 Step 1: Growing a Decision Tree. . . . . . . . . . . . . . . . . . . . . . . . 227
9.9.3 Step 2: Assessing the Size of a Decision Tree . . . . . . . . . . 230
9.9.4 Step 3: Pruning a Decision Tree. . . . . . . . . . . . . . . . . . . . . . . . . 234
9.9.5 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.2 What Is Hypothesis Testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
10.3 Key Components of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
10.3.1 Step 1: Select Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.3.2 Step 2: Null Hypothesis H0 and Alternative
Hypothesis H1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.3.3 Step 3: Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.3.4 Step 4: Significance Level α . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.3.5 Step 5: Evaluate the Test Statistic from Data . . . . . . . . . . . 248
10.3.6 Step 6: Determine the p-Value . . . . . . . . . . . . . . . . . . . . . . . . . . 248
10.3.7 Step 7: Make a Decision about the Null Hypothesis . . . 249
10.4 Type 2 Error and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
10.4.1 Connections between Power and Errors . . . . . . . . . . . . . . . . 252
10.5 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
10.5.1 Confidence Intervals for a Population Mean with
Known Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
10.5.2 Confidence Intervals for a Population Mean with
Unknown Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.5.3 Bootstrap Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . 255
xiv Contents

10.6 Important Hypothesis Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256


10.6.1 Student’s t-Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.6.2 Correlation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
10.6.3 Hypergeometric Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
10.6.4 Finding the Correct Hypothesis Test . . . . . . . . . . . . . . . . . . . . 264
10.7 Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10.8 Understanding versus Applying Hypothesis Tests . . . . . . . . . . . . . . . . 268
10.9 Historical Notes and Misinterpretations . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
10.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11 Linear Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
11.1.1 What Is Linear Regression? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
11.1.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
11.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
11.2.1 Ordinary Least Squares Estimation of Coefficients . . . . 277
11.2.2 Variability of the Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
11.2.3 Testing the Necessity of Coefficients . . . . . . . . . . . . . . . . . . . 280
11.2.4 Assessing the Quality of a Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
11.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
11.4 Multiple Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
11.4.1 Testing the Necessity of Coefficients . . . . . . . . . . . . . . . . . . . 284
11.4.2 Assessing the Quality of a Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
11.5 Diagnosing Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
11.5.1 Error Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
11.5.2 Linearity Assumption of the Model . . . . . . . . . . . . . . . . . . . . . 288
11.5.3 Leverage Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
11.5.4 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
11.5.5 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
11.5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
11.6 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
11.6.1 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
11.6.2 Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
11.6.3 Categorical Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
11.6.4 Generalized Linear Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
11.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
11.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
12 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
12.2 Difference Between Model Selection and Model Assessment. . . . 309
12.3 General Approach to Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
12.4 Model Selection for Multiple Linear Regression Models . . . . . . . . . 312
12.4.1 R 2 and Adjusted R 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
12.4.2 Mallow’s Cp Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Contents xv

12.4.3 Akaike’s Information Criterion (AIC) and


Schwarz’s BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
12.4.4 Best Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12.4.5 Stepwise Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12.5 Model Selection for Generalized Linear Models . . . . . . . . . . . . . . . . . . 318
12.5.1 Negative Binomial Regression Model . . . . . . . . . . . . . . . . . . 318
12.5.2 Zero-Inflated Poisson Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
12.5.3 Quasi-Poisson Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
12.5.4 Comparison of GLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
12.6 Model Selection for Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
12.7 Nonparametric Model Selection for General Models with
Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
12.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
12.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330

Part III Advanced Topics


13 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
13.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
13.2.1 Preprocessing and Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
13.2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
13.2.3 R Packages for Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 336
13.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
13.3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
13.4 Non-negative Garrote Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
13.5 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
13.5.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
13.5.2 Explanation of Variable Selection. . . . . . . . . . . . . . . . . . . . . . . 341
13.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
13.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
13.6 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
13.7 Dantzig Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
13.8 Adaptive LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
13.8.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
13.9 Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
13.9.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
13.9.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
13.10 Group LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
13.10.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
13.10.2 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
13.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
13.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
13.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
xvi Contents

14 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359


14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
14.2 Architectures of Classical Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 360
14.2.1 Mathematical Model of an Artificial Neuron . . . . . . . . . . . 360
14.2.2 Feedforward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 362
14.2.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
14.2.4 Overview of General Network Architectures . . . . . . . . . . . 364
14.3 Deep Feedforward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
14.3.1 Example: Deep Feedforward Neural Networks . . . . . . . . 366
14.4 Convolutional Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
14.4.1 Basic Components of a CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
14.4.2 Important Variants of CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
14.4.3 Example: CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
14.5 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
14.5.1 Pre-training Phase: Unsupervised . . . . . . . . . . . . . . . . . . . . . . . 385
14.5.2 Fine-Tuning Phase: Supervised . . . . . . . . . . . . . . . . . . . . . . . . . 389
14.6 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
14.6.1 Example: Denoising and Variational Autoencoder . . . . . 392
14.7 Long Short-Term Memory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
14.7.1 LSTM Network Structure with Forget Gate . . . . . . . . . . . . 401
14.7.2 Peephole LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
14.7.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
14.7.4 Example: LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
14.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
14.8.1 General Characteristics of Deep Learning . . . . . . . . . . . . . . 416
14.8.2 Explainable AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
14.8.3 Big Data versus Small Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
14.8.4 Advanced Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
14.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
14.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
15 Multiple Testing Corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
15.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
15.2.1 Formal Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
15.2.2 Simulations Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
15.2.3 Focus on Pairwise Correlations . . . . . . . . . . . . . . . . . . . . . . . . . 425
15.2.4 Focus on a Network Correlation Structure . . . . . . . . . . . . . 426
15.2.5 Application of Multiple Testing Procedures . . . . . . . . . . . . 426
15.3 Motivation of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
15.3.1 Theoretical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
15.3.2 Experimental Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
15.4 Types of Multiple Testing Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
15.4.1 Single-Step versus Stepwise Approaches . . . . . . . . . . . . . . . 430
15.4.2 Adaptive versus Nonadaptive Approaches . . . . . . . . . . . . . 433
Contents xvii

15.4.3 Marginal versus Joint Multiple Testing Procedures . . . . 433


15.5 Controlling the FWER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
15.5.1 Šidák Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
15.5.2 Bonferroni Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
15.5.3 Holm Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
15.5.4 Hochberg Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
15.5.5 Hommel Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
15.5.6 Westfall-Young Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
15.6 Controlling the FDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
15.6.1 Benjamini-Hochberg Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 444
15.6.2 Adaptive Benjamini-Hochberg Procedure . . . . . . . . . . . . . . 445
15.6.3 Benjamini-Yekutieli Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 447
15.6.4 Benjamini-Krieger-Yekutieli Procedure . . . . . . . . . . . . . . . . 448
15.6.5 Blanchard-Roquain Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 449
15.7 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
15.8 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
15.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
15.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
16 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
16.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
16.2.1 Effect of Chemotherapy: Breast Cancer Patients . . . . . . . 456
16.2.2 Effect of Medication: Agitation . . . . . . . . . . . . . . . . . . . . . . . . . 457
16.3 Censoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
16.4 General Characteristics of a Survival Function . . . . . . . . . . . . . . . . . . . . 459
16.5 Nonparametric Estimator for the Survival Function . . . . . . . . . . . . . . 460
16.5.1 Kaplan-Meier Estimator for the Survival Function . . . . 460
16.5.2 Nelson-Aalen Estimator for the Survival Function. . . . . 461
16.6 Comparison of Two Survival Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
16.6.1 Log-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
16.7 Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
16.7.1 Weibull Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
16.7.2 Exponential Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
16.7.3 Log-Logistic Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
16.7.4 Log-Normal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
16.7.5 Interpretation of Hazard Functions. . . . . . . . . . . . . . . . . . . . . . 468
16.8 Cox Proportional Hazard Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
16.8.1 Why Is the Model Called a Proportional Hazard
Model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
16.8.2 Interpretation of General Hazard Ratios . . . . . . . . . . . . . . . . 472
16.8.3 Adjusted Survival Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
16.8.4 Testing the Proportional Hazard Assumption . . . . . . . . . . 473
16.8.5 Parameter Estimation of the CPHM via Maximum
Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
xviii Contents

16.9 Stratified Cox Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479


16.9.1 Testing No-Interaction Assumption . . . . . . . . . . . . . . . . . . . . . 479
16.9.2 Case of Many Covariates Violating the
PH Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
16.10 Survival Analysis Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
16.10.1 Comparison of Survival Curves . . . . . . . . . . . . . . . . . . . . . . . . . 481
16.10.2 Analyzing a Cox Proportional Hazard Model . . . . . . . . . . 483
16.10.3 Testing the PH Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
16.10.4 Hazard Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
16.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
16.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
16.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
17 Foundations of Learning from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
17.2 Computational and Statistical Learning Theory . . . . . . . . . . . . . . . . . . . 490
17.2.1 Probabilistic Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
17.2.2 Probably Approximately Correct (PAC) Learning . . . . . 492
17.2.3 Vapnik-Chervonenkis (VC) Theory . . . . . . . . . . . . . . . . . . . . . 500
17.3 Importance of Bias for Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
17.4 Learning as Optimization Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
17.4.1 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
17.4.2 Structural Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
17.5 Fundamental Theorem of Statistical Learning. . . . . . . . . . . . . . . . . . . . . 506
17.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
17.7 Modern Machine Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
17.7.1 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
17.7.2 One-Class Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
17.7.3 Positive-Unlabeled Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
17.7.4 Few/One-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
17.7.5 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
17.7.6 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
17.7.7 Multi-Label Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
17.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
17.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
18 Generalization Error and Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
18.2 Overall View of Model Diagnosis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
18.3 Expected Generalization Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
18.4 Bias-Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
18.5 Error-Complexity Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
18.5.1 Example: Linear Polynomial Regression Model . . . . . . . 531
18.5.2 Example: Error-Complexity Curves . . . . . . . . . . . . . . . . . . . . 533
18.5.3 Interpretation of Error-Complexity Curves . . . . . . . . . . . . . 535
Contents xix

18.6 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537


18.6.1 Example: Learning Curves for Linear Polynomial
Regression Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
18.6.2 Interpretation of Learning Curves . . . . . . . . . . . . . . . . . . . . . . . 539
18.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
18.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
18.9 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
18.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Chapter 1
Introduction to Learning from Data

We are living in a data-rich era, where every field of science or industry sector
generates data in a seemingly effortless manner [160, 394]. To emphasize the
importance of this, data have been called the “oil of the twenty-first century” [232].
To deal with this flood of data, a new field has been established called data science
[83, 150, 228]. Data science combines the skill sets and expert knowledge of many
different fields, including machine learning, artificial intelligence, statistics, and
pattern recognition [150, 220, 483].
The availability of data provides new opportunities in all fields and industries
to gain new information and tackle difficult problems. However, data alone do not
provide information; first, they need to be analyzed to unlock the answers to the
questions buried within them. This is what we call learning from data. To grasp
the importance of learning from data, let’s look at three examples from genomics,
finance, and internet applications.
Traditionally, biology has not been a field one would associate with technology.
However, in the last 30 years, significant advances in experimental techniques have
been made, allowing the easy and affordable generation of different types of data
concerning various aspects of biological cells. The most prominent example is
probably the sequencing of DNA. The collection of these different data types is
summarized under the term “omics” or sometimes “genomics data.” Importantly,
genomics technology is not only used in research, but also in hospitals to generate
patient data. Such data can be used for diagnostic, prognostic, and therapeutic
applications with a direct benefit for patients. A second source of big data is
the finance world. Nowadays, there are myriad financial markets, including stock
exchanges, that provide temporal information about the market value of companies
on even a subminute scale. This information can be used by investors to select
an optimal portfolio that is resilient to economic crises. A third example of mass
data can be found in internet applications; such as for online shopping or social
networking. Such applications utilize the internet, which became available to the

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1


F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_1
2 1 Introduction to Learning from Data

public in the 1990s, to place orders or exchange information about all aspects of our
private and professional lives.
In this book, we discuss the basic tools that enable data scientists to learn from
data. As we will see, it takes a journey to gain an understanding of the different
methods and approaches and become proficient.

1.1 What Is Data Science?

From time to time, new scientific fields emerge to adapt to the changing world.
Examples of newly established academic disciplines include economics (the first
professorship in economics was established at the University of Cambridge in 1890
and was held by Alfred Marshall [330]); computer science (the first department of
computer science in the United States was established at Purdue University in 1962,
whereas the term “computer science” appeared first in [164]); bioinformatics (the
term was first used in [249] in 1978); and, most recently, data science [99, 321, 394].
The first appearance of the term “data science” is ascribed to Peter Naur in
1974 [358], but it took almost 30 years before there was a significant push for
the establishment of an independent discipline with this name [83]. Since then,
the Research Center for Dataology and Data Science was established at Fudan
University in Shanghai, China, in 2007, and Harvard Business Review even called
data scientist “the sexiest job of the twenty-first century” [381].
A natural question to ask is, “What is data science?” In [146], a data-driven sci-
entometrics analysis of this question is presented by studying publication statistics
data provided by Google Scholar. As a result, the top 20 most influential fields were
all found to contribute significantly to data science. Those fields include machine
learning, artificial intelligence, data mining, and statistics.
An important conclusion of this analysis is that data science is not a monolithic
field. Instead, it consists of many different approaches and concepts that have their
origins in entirely different fields and communities; e.g., machine learning, artificial
intelligence, or statistics. From the perspective of a learner, this is unfortunate
because the learning process will not be monolithic but undulating. However, due to
its inclusive nature — that is, covering methods irrespective of the field of origin —
in our opinion, data science provides the most comprehensive toolbox for analyzing
data.
Regardless of its inter- and multidisciplinary nature, data science consists of the
following five major components (see Fig. 1.1):
: machine learning
: artificial intelligence
: statistics
: mathematics
: programming
1.1 What Is Data Science? 3

Fig. 1.1 Data science is


composed of five major
Artificial
components, each of which Intelligence
makes a unique contribution.

Machine
Statistics
Learning

Data Science

Programming Mathematics

The first three components, machine learning, artificial intelligence,


and statistics, provide all the methods used in data science to analyze data.
Representative methods therefore are support-vector machines (SVMs), neural
networks (NNs), and generalized linear models (GLMs). Each of these methods
is based on mathematics, which provides the fundamental methodology for
formulating such methods. Finally, programming connects everything together.
It is important to realize that programming is a glue skill that (1) enables the
practical application of methods from machine learning, artificial intelligence, and
statistics; (2) allows the combination of different methods from different fields; and
(3) provides practical means for developing novel computer-based methods (using,
for example, resampling methods or Monte Carlo simulations). For clarity reasons,
we want to emphasize that when we speak about “programming skills” we mean
scientific and statistical programming rather than general-purpose programming
skills. All of these points are of great importance for data science. In some sense,
mathematics and programming form the root of data science (see Fig. 1.1), whereas
machine learning, artificial intelligence, and statistics provide the methodological
realizations, thus establishing the “roofing.”
It is this multicomponent nature of data science that explains why the field is
usually taught at a graduate level, whereas mathematics and programming are taught
at the undergraduate level. Learning the basics of mathematics and programming
requires a considerable amount of time if one wants to attain a higher level of
proficiency. Furthermore, it is clear that one book alone can neither cover all relevant
topics of these five components nor introduce them in sufficient detail as needed for
the beginner. For this reason, we refer to our introductory textbook for learning
basics in mathematics and programming [153].
4 1 Introduction to Learning from Data

Regarding useful programming languages, R and Python are very popular


today. However, while both provide similar capabilities, there are differences in
certain situations. In this book, we prefer R over Python due to its statistical origin.
In fact, R was developed to provide a “statistical programming language.” We will
see the benefits of this when discussing hypothesis testing (Chap. 10), resampling
methods (Chap. 4), and linear regression (Chap. 11), where R provides excellent
functionalities.
Although this book does not provide an introduction to programming and math-
ematics (this is provided in [153]), it presents examples in R for the methods from
machine learning, artificial intelligence, and statistics. A pedagogical side
effect of this presentation is enhanced computational thinking. This has a profound
influence on one’s analytical problem-solving capabilities because it enables one to
think in guided ways, which can be computationally realized in a practical manner,
rather than coming up with ungrounded proposals that are intractable.

1.2 Converting Data into Knowledge

In data science, our main aim is to learn about a phenomenon that underlies
available data. In general terms, this means that we want to extract reliable
information from the data through the application of appropriate analysis methods.
In the following, we discuss this in more detail.

1.2.1 Big Aims: Big Questions

To learn about a given problem, we need to answer important relevant questions by


interrogating related data in a way that allows us to enhance our current knowledge
about the problem. Ideally, we would like to ask “big questions,” in the sense
that their answers would put us into a position to solve the problem entirely. The
following are some examples of such big questions:
• What is a cure for breast cancer?
• What will be the stock price of Alphabet (parent company of Google) on the 27th
of May 2057?
• What products will a customer order from Amazon next time?
• What is the meaning of life?
It is easy to see that an answer to any of the preceding questions would have a huge
impact — on different levels. In the first case, you would certainly be awarded the
Nobel Prize in Medicine and Physiology, whereas in the second and third cases, you
could become rich. Finally, in the fourth case, you might not earn scientific merits
but would probably make a lot of people happy because the search would be over.
1.2 Converting Data into Knowledge 5

From the preceding examples, there are two important lessons to be learned.
First, there are usually no analysis methods available that could provide a direct
answer to any of the preceding questions. Even worse, the results that can be
obtained from current analysis methods are usually not even close to being answers
for “big questions.” The reason for this is that data science methods work differently,
as we will see. Second, questions of the fourth type are out of the scope of this book
as we are only dealing with questions that can be addressed by the analysis of data.
The second lesson seems trivial; however, there are related cases that are not
so easy to spot in practice because they come disguised. For instance, a couple of
years ago we were analyzing proteomics data from SELDI-TOF (surface-enhanced
laser desorption/ionization time-of-flight) experiments (providing information about
proteins), and to our surprise we were not able to detect anything by any analysis
method. Curiously, also, independent analysis attempts by several other teams
confirmed our negative findings. Later, we found out that the experiments conducted
were corrupted, hence, they confirmed that the data did not contain any meaningful
information.
The first lesson is not trivial either, but its underlying argument is different. To
understand this, we show in Fig. 1.2 an overview of six principal categories of
analysis methods. For each of these categories, we provide information about the
data type such an analysis is based on, the question the method addresses, and some
examples of methods, which are discussed in later chapters of this book. To simplify
the discussion, we present only simplified versions of, for example, the data types
that can be handled, without affecting the following discussion. It is important to
note that for each method category, the question addressed is of a simple nature
compared to any of the three “big questions” just mentioned. The reason for this
is not that we intentionally selected only method categories that address such
simple questions, but that these are characteristics of all data analysis methods.
Furthermore, it is important to emphasize that the questions that can be addressed
by the six principal method categories are representative for the categories as a
whole and not just for particular methods of a category. However, this means that
any “big question” needs to somehow be related back to such “simple questions”
for which analysis methods are available. A further consequence of this is that to
study “big questions” one needs not only one but several different methods applied
sequentially. Hence, studying “big questions” requires a data analysis process
where a multitude of methods are applied in a sequential way [150].
In Fig. 1.3, we visualize the steps needed to reformulate a “big question” in
order to arrive at a question that can be addressed by means of data science. The
reformulation of the question requires expert knowledge and statistical thinking
because essential elements of the problem need to be conserved, whereas unimpor-
tant elements can be neglected. For the reformulated question, there may be methods
applicable. This requires computational thinking to select and implement the
appropriate one. Potentially, this method needs to be adapted to fit the reformulated
question optimally. This requires mathematical thinking to redesign the algorithm.
From the results of this analysis, we can then draw conclusions back to the original
6 1 Introduction to Learning from Data

Preprocessing Preprocessing Preprocessing

Clustering Hypothesis Testing Linear Regression

Data type: Data type: Data type:


X = {(xi )}p1 with feature X = {xi }n
i=1 with xi ∈ X = {(xi , yi )}n
i=1 with
vector xi ∈ n xi ∈ p and yi ∈
Question addressed: Question addressed: Question addressed:
Is there a ’structure’ be- Is there a difference for a Is there a dependency be-
tween variables? test statistic? tween variables?
Example methods: Example methods: Example methods:
k-means t-test Simple Regression
Hierarchical clustering Fisher’s exact test Multiple Regression

Preprocessing Preprocessing Preprocessing

Deep learning Classification Survival Analysis

Data type: Data type: Data type:


X multiple data types are X = {(xi , yi )}n
i=1 with X = {(ti , ci )}n
i=1 with sur-
possible feature vector xi ∈ p
vival time ti ∈ + and
and class label yi ∈ censoring ci ∈ {0, 1}
Question addressed:
{−1, 1}
Multiple Question addressed:
Question addressed: How is the survival be ef-
Example applications:
In what class should a fected?
Parameter estimation
data point be placed?
Clustering Example methods:
Classification Example methods: Kaplan-Meier curves
Regression Logistic Regression Cox Proportional-Hazards
Time Series Analysis SVM Model

Fig. 1.2 The six principal method categories discussed in this book allow us to address a large
variety of application-specific questions.

“big question.” In general, in science, the steps along the orange triangle are
iteratively repeated, establishing what is called scientific discovery.
In summary, to analyze data one needs to learn how to (re)formulate questions
so that the data can be analyzed in a problem-oriented manner. The key for this
step is having a thorough understanding of the principal analysis categories and
their methods. This requires statistical thinking for the reformulation of the question
itself, the analysis of the data, and the conclusions that can be drawn thereof.
Another point of caution to mention is that simple does not mean trivial. This
is visualized in Fig. 1.4, where we show the three main components for analyzing
data (data, methods, and results) and the corresponding topics one needs to address
in order to specify a data analysis project. This involves preprocessing, modeling,
analysis, and design. The numbers in brackets indicate the chapters, in this book,
that discuss the respective subjects. For the beginner, it may be interesting to see
1.2 Converting Data into Knowledge 7

expert knowledge/
computational thinking
statistical thinking

Big question Reformulated question(s) Analysis method

statistical thinking

mathematical thinking

No analysis possible Conclusions Analysis Adapted method

??? Analysis results

Fig. 1.3 To study a “big question” amenable to an analysis method from data science, one needs
to reformulate it. This requires expert knowledge, statistical thinking, computational thinking, and
mathematical thinking.

the iteration arc from results to data. This means even a simple data analysis does
not consist of running the analysis just once, but many times — for example,
for estimating learning curves (discussed in Chap. 18). Overall, to specify a data
analysis, one needs to address each aspect of the categories shown in Fig. 1.4.
From an educational perspective, the preceding described complexity of data
science projects poses challenges because one cannot address all issues at the same
time. Instead, it is easier to learn the elements of data science step-by-step to gain
a thorough understanding of its components. In this book, we will also follow this
approach.

1.2.2 Generating Insights by Visualization

It is important to note that in addition to the six quantitative method categories listed
in Fig. 1.2, there is one further principal method category that is of a qualitative
nature. This category comprises visual exploration methods. The conceptual idea
behind this was introduced in the 1950s by John Tukey, who advocated widely the
idea of data visualization as a means to generate novel hypotheses about data. In the
statistics community, such an approach is called exploratory data analysis (EDA).
In general, EDA uses data visualization techniques, such as box plots, scatter plots,
and so forth, as well as summary statistics, like the mean or the variance, to get either
an overview of the characteristics of the data or to generate a new understanding. For
8 1 Introduction to Learning from Data

Iteration

Design
Resampling methods (4)
Model selection (12)
Regularization (13)
Generalization error (18)

Data Methods Results

Preprocessing Modeling Analysis


Dimension reduction (8) Clustering (7) Error measures (3)
Feature selection (8) Classification (9) Model assessment (18)
Parameter estimation (6) Multiple testing correc-
tions (15)
Hypothesis testing (10)
Linear regression (11)
Deep learning (14)
Survival analysis (16)

Fig. 1.4 The three main components for analyzing data (data, methods, and results) and cor-
responding topics one needs to address (non-exhaustive list) in order to specify a data analysis
project.

instance, a first step in formulating a question that can be addressed by a quantitative


analysis method (see Fig. 1.3) would consist of visualizing the data.
Overall, this means that there are seven principal method categories of data
science. In Chap. 6, we discuss a variety of different visualization methods that can
be utilized for the purpose of doing an exploratory data analysis.

1.3 Structure of the Book

In the following, we discuss the structure and the content of this book. Overall, the
book is structured into three main parts.

1.3.1 Part I

In Part I, we start with a discussion of general prediction models and their catego-
rizations as well as the difference between inferential models and predictive models.
1.3 Structure of the Book 9

In Chap. 3, we discuss general error measures one can use for supervised learning
problems, which are later discussed in Part II of the book. In Chap. 4, we introduce
resampling methods, such as cross-validation, and show how they are used for error
estimation. Furthermore, we discuss related topics; for example, subsampling and
sampling from a distribution. This chapter shows that error estimates of learning
models are random variables that require estimates of their variability, such as in
the form of the standard error. The last chapter in Part I is about different data types
frequently encountered in data science. This chapter is important because methods
do not operate in isolation but are always applied to data. Hence, data are of course
an important part of data science, and a sufficient understanding of them is required.
In this chapter, we provide five examples of different data types (genomic data,
network data, text data, time-to-event data, and business data) to show that data
structures can be complex.

1.3.2 Part II

In Part II, we present core analysis methods. Specifically, we discuss statistical


inference, clustering, dimension reduction, classification, hypothesis testing, linear
regression, and model selection. Each of these chapters presents the methods side-
by-side with examples that use the statistical programming language R.
Part II starts with a chapter on statistical inference. First, we present an
overview of descriptive statistics and exploratory data analysis (EDA). As briefly
mentioned earlier, EDA advocates visualizing data as a means to generate insights
about the underlying problem [242, 475]. That means the visualization of data
is very important because it represents a form of data analysis. Further topics
discussed in this chapter are Bayesian inference, maximum likelihood estimation,
and the expectation-maximization (EM) algorithm. In Chap. 7, we discuss clustering
methods, which can be used when only unlabeled data are available. Also, Chap. 8
presents unsupervised learning methods but for use in dimension reduction. In
contrast, in Chap. 9, we discuss supervised learning methods for classification. We
present a variety of approaches for classification, including naive Bayes classifier,
logistic regression, and decision tree. In Chap. 10, we introduce hypothesis testing
and a number of important hypothesis tests that find frequent application in practice;
for example, Student’s t-test or Fisher’s exact test. Chap. 11 presents supervised
learning methods again; however, unlike in Chap. 9, the focus is on a gradual output
instead of a categorical one. Data of such a type can be studied using regression
methods. Finally, Part II ends with a chapter about model selection.
The order of the chapters in Part II follows a largely logical order such as
would be used when conducting a data science project and also considers the
dependency between the methods. For instance, for linear regression models one
can use a hypothesis test to check whether a regression coefficient vanishes. Hence,
hypothesis testing needs to be presented first in order to understand this. Another
example is the topic of model selection, which aims to choose the best model
10 1 Introduction to Learning from Data

among a family of prediction models, such as regression models. Hence, regression


is presented before model selection.

1.3.3 Part III

In Part III, we discuss advanced topics of data science. Specifically, we introduce


regularization, deep learning, multiple testing corrections, and survival analysis
models. Furthermore, we discuss theoretical foundations of learning from data
and the generalization error. All of these topics and methods provide powerful
approaches for conducting complex data science projects in a sound way. It is
important to emphasize that many of these methods are not independent but provide
extensions of methods discussed in Part II. For instance, regularization (Chap. 13)
builds upon linear regression models (Chap. 11), and multiple testing corrections
(Chap. 15) extend statistical hypothesis testing (Chap. 10). Also, survival analysis
(Chap. 16) is not a stand-alone subject but contains elements of linear regression
models (Chap. 11) and statistical hypothesis testing (Chap. 10).
The interdependency of the chapters also reflects a general characteristic of
data science. This means that selecting a particular method for an analysis usually
requires a multitude of “other” considerations beyond the analysis method itself.
Hence, data science projects are typically complex, resisting a cookbook-style
presentation.
Finally, we would like to remark that each chapter ends with a brief summary.
The summary also contains a learning outcome box that highlights the most
important lesson from a chapter. In general, a chapter will contain many learning
outcomes; however, by highlighting one, we want to encourage the reader to reflect
about its content from a bird’s-eye view. This is important because to understand
any topic, one needs to be able to switch between the technical aspects offered by a
method to the general perspective it provides. This mental switching allows one to
avoid getting lost in either the nitty-gritty or a generic outlook.

1.4 Our Motivation for Writing This Book

Conducting research projects in data science for science or industry typically


requires the application of appropriate methods. This involves model selection,
model assessment, and problem visualization. For each subtask, decisions need
to be made, such as what to study and how to study. Taken together, this makes
data science projects complex, requiring a multitude of skills and knowledge from a
variety of areas. To succeed, one needs a detailed understanding of all those methods
and approaches that is far beyond “cookbook thinking.” The goal of this book is
to provide a data science toolbox with the most important methods from machine
learning, artificial intelligence, and statistics that can be used to analyze data.
1.4 Our Motivation for Writing This Book 11

We provide detailed information about such methods on an abstraction level that


balances a mathematical understanding with an application-oriented one. Overall,
we aim not only to provide information about analysis methods but also to advocate
statistical thinking. The latter allows the learner to advance from individual methods
to complex data science projects.
From our experience in supervising data science projects, we learned that
there are three key components for learning how to analyze data: (1) a thorough
understanding of methods and data; (2) an easy way to apply a method to data;
and (3) iterative modification of (1) and (2) to study a problem. For (1), sufficient
explanations of the methods and the data are needed to have a good idea of how a
method works and what the data mean. For (2), computational implementations of
methods are needed because essentially all data science projects require a computer.
In this book, we use the statistical programming language R because it has a long
history in the statistics community and is increasingly dominating machine learning
and artificial intelligence. For (3), only your creativity is the limit, which makes data
science an art. More formally, statistical thinking is needed to connect all parts with
each other.
From the preceding discussion, it follows that for beginners, books offering
comprehensive coverage may not be very beneficial, because such presentations
usually lack a thorough discussion of methods, neglect computational discussions
of their applications, and do not guide the reader toward statistical thinking. Instead,
such books are great reference books for experts. Also, books that focus only on
methods from particular fields, such as from statistics or machine learning, have
a limited utility in learning data science, because for an optimal analysis the most
appropriate methods need to be used, regardless of which field introduced them.
Our book is intended for the beginner trying to learn the basics of data science.
Hence, we emphasize the interplay between mathematical thinking, computational
thinking, and statistical thinking, as shown in Fig. 1.5, throughout the book when
discussing the different topics. We favor basics over a comprehensive presentation
because once the basics are learned and understood, all advanced methods can

Fig. 1.5 Data science


requires proficiency in mathem
ati
mathematical thinking, ca
l
computational thinking, and
th
ing

ink

statistical thinking. Only by


istical think

ing

integrating those skills and


knowledge can one obtain the
optimal results when working data science
on a data science project.
stat

com
p

ut
at
ion
thinking al
12 1 Introduction to Learning from Data

be self-learned. Finally, we present important methods from essentially any field,


including machine learning, artificial intelligence, and statistics, because the selec-
tion of methods needs to be made eclectically to obtain the best results.

1.5 How to Use This Book

Our textbook can be used in many different areas related to data science, including
computer science, information technology, and statistics. The following list gives
some suggestions of different courses at the (advanced) undergraduate and graduate
levels for which selected chapters of our book could be used:
• Artificial intelligence
• Big data
• Bioinformatics
• Business analytics
• Computational biology
• Computational finance
• Computational social science
• Data analytics
• Data mining
• Data science
• Deep learning
• Machine learning
• Natural language processing
• Statistical data analytics
The target audience of the book is graduate students in computer science,
information technology, and statistics, but it can also be used for advanced under-
graduate courses in related fields. For students lacking a thorough mathematical
understanding of basic topics, including probability, analysis, or linear algebra, we
recommend Mathematical Foundations of Data Science Using R [153] as an intro-
ductory textbook. This textbook also provides an introduction to the programming
language R on the level required for the following chapters.
The order of the chapters in this book follows a natural progression of the
difficulty level of the topics (see Fig. 1.6). Hence, the beginner can follow the
chapters in the presented order to gain a progressive understanding of data science.
However, we would like to note that there is no one size that fits all, and the same
is true for learning data science. For instance, the topics of Chap. 18 about the
generalization error can be seen either as a conceptual roof for all the previous
chapters or as a conceptual root. The difference is the order of the presentation. We
decided to present this topic at the end of the book because in our experience many
students ask, during the learning process, “What is it good for?” This question is
naturally answered when presenting first practical applications of the generalization
error. However, other students may prefer to first obtain theoretical insights before
1.5 How to Use This Book 13

18. Generalization error and model assessment

17. Foundations of statistical learning

Part III
16. Survival analysis

15. Multiple testing correction

14. Deep learning

13. Regularization

12. Model selection

11. Linear regression

10. Hypothesis testing

9. Classification Part II

8. Dimension reduction

7. Clustering

6. Statistical inference

5. Data

4. Resampling methods
Part I
3. General error measures

2. General prediction models

Fig. 1.6 Interconnectedness of the chapters in this book. The chapters in Parts I, II, and III are
shown in blue, purple, and red, respectively, while the links correspond to major dependencies
among the topics. This indicates that data science forms a complex network of methods, concepts,
and algorithms.
14 1 Introduction to Learning from Data

knowing how to realize them practically. In this case, the last chapter should be
read at an earlier stage. Due to the fact that all chapters have their own major focus,
the intermediate and advanced reader can choose an individual reading order for a
personalized learning experience.

1.6 Summary

During the course of this book, we will see that the topics defining data science are
very interconnected. In Fig. 1.6, we show a visualization of this interconnectedness.
The shown links correspond only to the major dependencies among the topics; how-
ever, many more connections exist. As one can see, there are forward connections
(shown in orange) and backward connections (shown in green) among the topics.
Overall, this highlights that data science forms a complex network of interconnected
topics.
Learning Outcome 1: Data Science

Data science forms a complex network of methods, concepts, and algorithms,


which defies a linear learning experience.

To achieve the best learning outcome, we advise the reader to work through the
book multiple times and in different orders; for an example of these orders, see
Fig. 1.6. Metaphorically, one can see data science as a language rather than a single
method that needs to be wrapped creatively around a data-based problem in order to
communicate efficiently with the information buried within the data.
Part I
General Topics
Chapter 2
General Prediction Models

2.1 Introduction

This chapter provides a general overview of prediction models. We present three


different categorizations commonly used to organize the various methods in data
science. This will show that there is more than one way to look at prediction models,
and that no one is superior to the others. In addition, we present our own pragmatic
organization of methods that we will use for the following chapters, which is formed
by a mixture of application-based and model-based categories. Finally, we discuss a
fundamental statistical characteristic that holds for every prediction model. We will
see that every output of a prediction model is a random variable. In later chapters,
we will utilize this in several ways, such as when discussing resampling methods
(see Chap. 4) or the expected generalization error (see Chap. 18).

2.2 Categorization of Methods

Data science is an interdisciplinary field. That means its methods have not been
developed by one research community, but many. In the previous chapter, we
mentioned that data science’s major contributions come from machine learning,
artificial intelligence, statistics, and pattern recognition. A consequence of this
is that there is no common criterion that can be used to categorize methods,
such as which method has been used to derive learning algorithms. Instead, there
are considerable differences among the methods in data science, and there is no
universal concept from which all methods are derived. For this reason, depending
on the perspective, various categorizations of methods are possible.
In Fig. 2.1, we show an overview of three such categorizations. The three
columns in this figure emphasize different aspects of methods and their properties.
Specifically, the first column emphasizes properties of the data; the second column,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 17


F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_2
18 2 General Prediction Models

Methods from
data science

Property of Property of the Property of


the data optimization algorithm the model

Unsupervised learning Probability-based Regression

Supervised learning Error-based Kernel

Semi-supervised learning Similarity-based Instance-based

Information-based Ensemble

Structural

Fig. 2.1 Categorization of methods from data science. The three columns emphasize different
aspects of the methods and their properties.

properties of the optimization algorithm; and the third column, properties of the
model itself. No one perspective is superior to the others, but rather, each is valid in
its own right. In the following sections, we discuss each of the three main categories
briefly.

2.2.1 Properties of the Data

This category emphasizes a data perspective. A data perspective is particularly


practical because it allows the selection of a method based on the properties of the
available data. For instance, if data for the variables X (input) are available without
response variable(s) Y (output), methods for unsupervised learning need to be used.
In contrast, if data for the variables X are available in combination with real-valued
response variable(s) Y , then supervised methods, such as for a regression model,
can be used.
In general, one can distinguish between three major types of data, as defined in
Fig. 2.2. Because each data type contains a different type of information, different
learning paradigms have been developed to deal with such data. The corresponding
learning paradigms are called unsupervised learning, supervised learning, and semi-
supervised learning.
For supervised learning, we distinguish between two different data characteris-
tics (A and B); see Fig. 2.2. This indicates that there are further subcategories of the
learning paradigm. For Data A, one uses a regression model to estimate the effect
of explanatory variables (X, also called predictors) on the dependent variable(s) (Y).
Another example of supervised learning is reinforcement learning [457].
We want to note that methods from classical statistics, such as for parameter
estimation or statistical hypothesis testing, are based on the assumption of the form
2.2 Categorization of Methods 19

Unsupervised learning Supervised learning Semi-supervised learning

Data: A: Data: Data: 


X = {(xi )}n
1 with feature vec- D = {(xi , yi )}n
i=1 with xi ∈
m
D = {(xi , yi )}li=1 , {(xj )}uj=1
tors xi ∈ m and yi ∈
with feature vector xi ∈ m
and
Question addressed: Question addressed: class label yi ∈ {−1, 1}
Is there a ’structure’ between the What is the effect of explanatory
variables? variables on the dependent vari- Questions addressed:
able? Transductive learning (goal is to
Example methods: learn missing labels) or inductive
Non-hierarchical clustering: Example methods: learning (goal is to learn map-
K-means Simple linear regression ping X → Y ).
K-medoids Multiple linear regression
Example methods:
B: Data: Transductive learning:
Hierarchical clustering
X = {(xi , yi )}n
i=1 with feature Graph-based methods
Demdrograms
vectors xi ∈ m and class labels
Partitioning around medoids
yi ∈ {−1, 1} Inductive learning:
(PAM)
Semi-supervised SVM
Question addressed:
In what class should a data point
be placed?
Example methods:
Logistic Regression
SVM

Fig. 2.2 Properties of the data. Shown are the three major learning paradigms — namely,
unsupervised learning, supervised learning, and semi-supervised learning — as well as the
characteristics of the data types on which they are based. The questions addressed are for
illustration purposes only, and other questions are possible.

of data used for unsupervised learning. Despite this fact, such methods are usually
not designated as unsupervised learning methods. However, this is merely due to the
jargon used by different research communities.
The third data characteristic we distinguish in Fig. 2.2 is called semi-supervised
learning. Semi-supervised learning is a mixture of data for supervised learning and
unsupervised learning that provides labeled and unlabeled data at the same time. It is
easy to imagine that the analysis of such data requires other methods than those used
for supervised learning and unsupervised learning. For completeness, we would like
to add that the terms “unsupervised,” “supervised,” and “semi-supervised” learning
have their origins in machine learning.
Finally, we would like to add that there are more advanced learning paradigms,
such as transfer learning or one-shot learning. These are derived from the basic data
characteristics in Fig. 2.2, however, providing additional structure. In Chap. 17, we
will discuss these learning paradigms in detail.

2.2.2 Properties of the Optimization Algorithm

Unlike using the properties of the data to categorize methods, utilizing properties of
the optimization algorithms provides a categorization from a theoretical perspective.
20 2 General Prediction Models

For instance, a Naive Bayes classifier estimates probability distributions and iden-
tifies the maximum posterior for making predictions, whereas k-Nearest Neighbor
assesses the similarity of a feature vector to a set of reference vectors for assigning
a prediction based on voting. Examples of error-based optimization methods are
neural networks or regression models. In contrast, a decision tree or random forests
are examples of information-based methods.
These are just four examples that demonstrate that the internal working mech-
anisms of the different methods from data science can be quite different from
each other from a theoretical perspective, making it impossible to find a common
denominator. As discussed, this is another indicator of the interdisciplinarity of
the field because different areas follow different approaches. Categorizing methods
according to the properties of optimization algorithms is more of a theoretical than
a practical interest because it does not necessarily imply specific application areas
where a method could be used.

2.2.3 Properties of the Model

Finally, categorizing prediction methods according to the properties of the model is


similar to doing so based on the properties of optimization algorithms, but it assumes
the perspective of the model structure itself rather than the means of optimization.
Hence, it provides a functional point of view on the working mechanism of a model
as a whole.
For instance, support vector machines (SVMs) are kernel-based methods that
project feature vectors into a high-dimensional space where, for example, the
classification is performed. In contrast, deep neural networks learn the structure of
a network to, first, re-represent the feature vectors in a new space (representation
learning) and, second, to classify them. Both properties (kernel and structural)
provide a metaphorical visualization of the working principles of the methods.
Similarly, regression — for instance, multiple linear regression — emphasizes that
an input is mapped onto an output, and instance-based learning — for example,
k-Nearest Neighbor — does not store an internal model but rather investigates
each new feature vector (instance) by performing a comparison with a set of
reference vectors. Finally, ensemble methods, such as random forests, utilize the
same base classifier multiple times and achieve the final decision by combining the
classification results from the individual trees (many trees make a forest).

2.2.4 Summary

From the preceding discussion, it follows that there is no grand underlying principle
unifying all prediction methods that would enable their objective categorizations.
Instead, many perspectives are possible, and each has its own merits. Hence,
every categorization used is neither absolute nor unique but instead presents a
2.3 Overview of Prediction Models 21

certain perspective. In the following, we present yet another perspective on the


categorization of prediction models, which we follow in this book.

2.3 Overview of Prediction Models

The number of different methods provided by data science is vast, so listing


them next to one another does not allow for an easy overview, nor do the three
categorizations presented in the previous section. Instead, in Fig. 2.3, we show such

Parameter estimation Clustering Classification

Maximum likelihood K-means Naive Bayesian

Linear Discriminant
Bayesian estimator K-medoids Analysis (LDA)
Partitioning around
EM algorithm Logistic regression
medoids (PAM)

Hierarchical clustering K-nearest neighbor

Autoencoder Support Vector Machine

Decision tree

Hypothesis testing Regression Deep learning

t-test Simple linear regression Deep Feedforward NN

Fisher’s exact test Multiple linear regression Convolutional NN

Correlation test Generalized linear model Deep Belief Network

Permutation tests LSTM

Multiple testing correc-


tions (MTC) Autoencoder

Survival analysis
EM algorithm: Expectation-Maximization algorithm
NN: neural network
Kaplan-Meier curve
LSTM: Long-Short Term Memory
CPHM: Cox Proportional Hazard Model
Log-rank test

CPHM

Stratified Cox model

Fig. 2.3 Overview of prediction models where the main categories form a mixture of application-
based and model-based groups. This allows a more intuitive overview of prediction models.
22 2 General Prediction Models

an overview based on a mixture of application-based and model-based groups. This


implies that there are cross-connections between the categories. For instance, an
autoencoder is a deep learning model, which can be used for clustering problems,
and logistic regression is a generalized linear model (GLM), which can be used for
classification problems. Similarly, survival analysis is a special case of a GLM.
We used the preceding categories to structure the presentation of the models
discussed in this book because, in our opinion, this provides the most intuitive
overview. Also, it allows us to emphasize the underlying systematics — for example,
of generalized linear models or hypothesis tests — from which follow many models
as special cases of a larger category.
For completeness, we would like to mention that there are advanced prediction
models we did not include in the preceding overview. For instance, ensemble
methods like boosting utilize many “weak” classifiers and combine them in a way
that improves the individual classifiers. Other advanced methods include adversarial
networks [205]. These are deep neural networks, and the idea behind this method is
to let two neural networks compete with each other in a game-like setting. Overall,
this establishes a generative model that can be used for producing new data in a
simulation-like framework. Further methods are for causal inference and network
inference, which aim to reveal structural information among features or variables.
Finally, there are methods for the modeling of data. For instance, graphical mod-
els are methods for the representation of high-dimensional probability distributions
[272, 290]. Importantly, these methods are probabilistic (and not statistic) models,
which aim, first of all, to describe a problem rather than to make predictions. This
distinction is especially important because from this it follows that graphical models
are not data-driven. For this reason, they do not provide core methods for data
science.

2.4 Causal Model versus Predictive Model

We would like to note that there are two further model categories originating from
the statistics community [61] that are commonly distinguished. The first model type
is called a causal model (also known as an inferential or explanatory model). Such
a model provides a causal explanation of the data generation process. The second
model type is called a predictive model. The purpose of such models is to make
forecasts for unseen instances (data points), such as by performing a classification
or regression [437].
Certainly, an inferential model is more informative than a predictive model
because an explanatory model can also be used to make predictions, but a predictive
model does not provide (causal) explanations for such predictions. An example of an
inferential model is a causal Bayesian network, whereas a decision tree (discussed in
Chap. 9) is a prediction model. Due to the complementary capabilities of predictive
and causal models, they coexist next to each other, and each is useful in its own
right.
2.6 Fundamental Statistical Characteristics of Prediction Models 23

In this book, our focus is on predictive models because the beginner needs to start
with intuitive and simpler models before learning advanced models. Causal models
are advanced models that require a solid foundation with prediction models in order
to fully appreciate their functionality. Besides this, in many practical situations, it is
either impossible to estimate a causal model (for instance, due to data limitations)
or unclear how to estimate such a model in a sound way. Hence, despite the obvious
theoretical advantages of a causal model over a predictive model, practically, the
former is not attainable for every problem.

2.5 Explainable AI

Very recently, another model type emerged that is commonly summarized under
the term explainable AI (XAI) [17, 154]. The goal of XAI is to generate human-
understandable models because applications in medicine, politics, and finance
require the conveyance of the working mechanism of a model to stakeholders.
Interestingly, every causal model is also an explainable model; however, the reverse
is not true. So, this type of model is situated between a causal model and a predictive
model. A practical example of an explainable model is a decision tree.
Considering the goal of XAI, one realizes that this type of model comes with a
certain subjectivity because different humans can exhibit a different understanding.
Another complication comes from the fact that there are several closely related
concepts to “explainability,” which, however, add further subtleties [17]. Such
concepts are as follows:
1. Understandability
2. Comprehensibility
3. Interpretability
Research about XAI is currently at the very beginning, and at present many
approaches are being tested, such as SHAP (SHapley Additive exPlanations) [323],
but so far there is no generally accepted gold standard. At the moment, it is even
unclear if it is always possible to substitute a black-box model with an explainable
model or if there are possible approximations that provide a compromise between a
“good prediction” and a “good explanation.”

2.6 Fundamental Statistical Characteristics of Prediction


Models

In Chap. 1, we emphasized on several occasions the importance of statistical


thinking. This refers not only to methods from statistics but also to methods from
machine learning and artificial intelligence. In this section, we want to elaborate on
this by showing that the output of any prediction method is a random variable.
24 2 General Prediction Models

Data: D1 Results: R1

Data: D2 Results: R2
Experiment (Ex) Method (M)

Data: Dm Results: Rm

Histogram Probability distribution, fR M, D


8 5

4.5
7
E fR M, D
4
6
3.5
5
3

Probability
Frequency

4 2.5
m
2
3
1.5
2
1
1
0.5

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Error measure Error measure

Fig. 2.4 Fundamental characteristics of prediction models. Top: An experiment (Ex) is repeated
m times, leading to m different data sets. Each data set is analyzed with the same method (M),
leading to m different results. Bottom: The histogram shows a summary of an error measure (E),
such as for accuracy, F-score, or any other error score, while the distribution on the right-hand side
is obtained in the limit m → ∞; such as for an infinite number of experiments.

In Fig. 2.4 (top), we show our setup by outlining the general framework used
when analyzing data. This framework consists of the following three components.
First, an experiment (Ex) is conducted, leading to the generation of data (D). Then
the data are analyzed using a method (M), leading to results (R). If one has just
one data set, say D1 , one obtains just one result, R1 . Here, R1 can correspond to
an error measure (E); for example, accuracy, F-score, or any other error score (for
a discussion of general error measures, see Chap. 3). For the following discussion,
the specific measure is not crucial; we only need to decide which one of the above
(accuracy or F-score) we want to use. Obviously, it is possible to repeat the same
experiment a second time, giving us data set D2 and result R2 . What would we
expect from such repeated experiments?
Repeating an experiment m times allows us to obtain a histogram of the values of
the error measure. From such a histogram (see Fig. 2.4 (bottom) for an example), one
can already see a distribution emerging from the repeated experiments. If one goes
to the limit m → ∞, conducting an infinite number of experiments, this histogram
actually becomes a probability distribution for our error measure. If we call the
resulting probability distribution fR (M, D) one can write the following:

E ∼ fR (M, D) (2.1)
2.6 Fundamental Statistical Characteristics of Prediction Models 25

This means that the observed results (R) of a prediction method (M) for data
(D) lead to a particular value of the error measure (E). In Eq. 2.1 ‘∼’ means the
value of E is drawn (or sampled) from the distribution fR (M, D) (the sampling
from a probability distribution is discussed in detail in Chap. 4), providing an
abstract formulation for the visualization in Fig. 2.4. In statistics, a variable with
this property is called a random variable. Importantly, this implies that the output
of any prediction model E is a random variable.
One may wonder where this randomness comes from. From statistics, we know
that when we have a random variable as the input of a deterministic function, we get
another random variable as output. Specifically, given x drawn from distribution fx ,
i.e., x ∼ fx , and a deterministic function g given by

g : x → y, (2.2)

which maps x onto y, then y is also a random variable.


Given the distribution fR (M, D), one can ask which of the two input variables,
that is, M and D, of the function is the random variable? Since the method (M)
is fixed, it must be the data (D). In fact, for a given data set D = {xi }ni=1 with
n samples, each data point xi is drawn from a distribution determined by the
experiment (Ex), and it is given by

xi ∼ fD (Ex). (2.3)

That means the randomness in the measurement of each data point xi translates into
the randomness of the error measure E. This translation of randomness between the
data and an error measure is fundamental, and it applies to any prediction model.
An important consequence of the preceding discussion is that now that we are
aware that the output of any prediction model is a random variable, with associated
underlying (but usually unknown) probability distribution fR (M, D), we need to
interpret the results accordingly. This affects profoundly the way we conduct a data
analysis. In Chap. 4, we return to this issue when discussing resampling methods.

2.6.1 Example

To gain a practical understanding of this theoretical result, let’s study a numerical


example of a prediction model to show that its output is a random variable.
Listing 2.1 shows such an example for a t-test (discussed in detail in Chap. 10) as a
prediction model.
The first part of Listing 2.1 defines our experiment. Here, we generate normal
distributed data with a mean of μ = 0.4 and a standard deviation of σ = 0.1. That
means xi ∼ N (μ, σ ) for i ∈ {1, . . . , n}, where n is the sample size. As a prediction
model, we want to use a t-test for testing the following hypothesis:
26 2 General Prediction Models

Null hypothesis: The mean value of the population is 0.5: μ = 0.5.


Alternative hypothesis: The mean value of the population is not 0.5: μ = 0.5.

As one can see from Listing 2.1, the p-values resulting from the application of a
t-test to D1 and D2 are p1 = 0.0084 and p2 = 0.1607, respectively. First, we notice
that p1 and p2 are not identical. Second, we notice that both p-values are not even
close, because p2 is almost 20 times larger than p1 . Third, using a significance level
to make a decision about the statistical significance of the results (discussed in detail
in Chap. 10), we find that for α = 0.05, p1 is statistically significant, whereas p2
is not. Hence, both p-values result in different decisions because we need to reject
the null hypothesis based on α = 0.05 and p1 = 0.0084, whereas for α = 0.05 and
p2 = 0.1607, we cannot reject the null hypothesis.
Overall, from the preceding numerical example, one can see not only different
numerical values but also different decisions that follow from declaring significance.
To understand the severity of this, we would like to mention that hypothesis tests
are frequently used to study the effect of medications (with the help of survival
analysis discussed in Chap. 16). In such a context, the preceding results could be
interpreted as “the medication has an effect,” corresponding to rejecting the null
hypothesis, or “the medication has no effect,” corresponding to not rejecting the
null hypothesis. Apparently, both statements are opposing each other, and hence
both cannot be correct. Instead, one must be wrong. Again considering the fact that
for both decisions the same method has been used, but different data (from the same
experiment) was used, the need for treating the output of a prediction model as a
random variable is reinforced.
2.8 Exercises 27

2.7 Summary

In this chapter, we presented different views on prediction models. We have seen that
there are different perspectives, and each is beneficial in its own right; however, none
are perfect or complete. This reflects the interdisciplinary nature of data science and
the lack of one underlying and guiding theoretical principle. Interestingly, this is in
contrast with physics, which is based on the Hamiltonian theory that allows one to
derive fundamental laws in mechanics, electrodynamics, and quantum mechanics.
One important difference between statistical models is whether they are a causal
model or a predictive model. While a causal model is superior from a theoretical
point of view (it provides an explanation and predictions), it is problematic from a
practical point of view. One reason for the latter is that in practice, we start from
data, and this requires us to learn a causal model from data. However, this turned
out to be very difficult and is not even feasible in all circumstances, such as when
there are limitations in the available data [366, 383, 450]. Hence, from a practical
perspective as well as from the perspective of a beginner, predictive models are the
first step when learning data science.
Furthermore, we discussed a numerical example that showed that the outcome
of a prediction model is a random variable. We are aware that we may have used
various terms and methods a beginner may not be familiar with. For this reason,
we suggest the reader return to the setting outlined in Fig. 2.4 after reading about
the corresponding topics in the following chapters and to repeat a similar analysis
for other methods. This will allow the reader to generate an important learning
experience that holds for all prediction models.
Learning Outcome 2: Prediction Models

The output of a prediction model is a random variable, which is associated


with an underlying probability distribution.

In our experience, truly understanding the meaning of this is an important step


in comprehending data science in general and in working on the practical analysis
of a project, because one needs to think in terms of a probability distribution when
talking about the output of a model. Hence, the preceding observation can serve as
a guiding principle when looking at prediction models.

2.8 Exercises

1. Repeat the analysis provided by Listing 2.1.


• Conduct this analysis for new randomly sampled data and for D1 and D2
shown in Listing 2.1.
28 2 General Prediction Models

• Repeat the analysis and record your results by plotting the percentage of
rejection/acceptance as a function of the standard deviation σ .
• What is the influence of m (number of experiments — see Fig. 2.4) on the
results?
2. Modify the previous analysis using any prediction model you are familiar with.
For this you need to generate new data that are appropriate for your method (for
example, for a classification, you need labeled data).
3. Write an R program that maps the values of x = {4, 2, −1, 5} into y via the
deterministic function

g : x → y. (2.4)

Modify this program by adding a noise term ε with ε ∼ N(μ = 0, σ = 0.01).


In R, values from a normal distribution can be drawn using the command
‘rnorm(n=1, mean=0, sd=0.01)’.
4. In Sect. 2.6, we distinguished between results (R) and the values of an error
measure (E). Discuss this difference between R and E by explicating the
relationship between the contingency table and the F1-score. Hint: See Chap. 3.
5. For a medical example, in order to see the importance of explainable models (see
Sect. 2.5) in the context of biomarkers, read the article [327]. Summarize and
interpret these results. Hint: See a discussion in [156].
Chapter 3
General Error Measures

3.1 Introduction

When using a prediction method to analyze data, one needs to evaluate the outcome
corresponding to the prediction. There are different types of error measures one
can use depending on the nature of the method. In this chapter, we focus on
error measures that can be used to evaluate classification methods. Classification
methods are supervised learning methods that require labeled data. This allows a
straightforward evaluation, because for every prediction there is a label available,
which can be used to assess if the prediction is true or false. This allows the
quantification of the results of a prediction model. In this chapter, we will see that
there are many error measures that can be used, and each one focuses on a particular
aspect of the prediction.
There are different forms of classification, depending on the number of classes
considered. The simplest and most widely used one is a binary classification
(also known as two-class classification), where a data point is assigned to either
class +1 or class −1. In general, one can view binary classification as binary
decision-making, because each sample requires us to make a decision regarding
its classification. Binary decision-making is a topic of great interest in many fields,
including biomedical sciences, economics, management, politics, medicine, natural
sciences, and social sciences. Despite the considerable differences in the problems
studied within these fields, we can summarize the discussion of classification errors
in terms that are useful for all application domains.
This chapter provides a comprehensive survey of well-established error measures
for evaluating the outcome of binary decision-making problems. We will start by
introducing four so-called fundamental errors from which most error measures
are derived. Then, we discuss 14 different error measures. Finally, we discuss the
evaluation of the outcome of a single method and that of multiple methods, showing
that such an evaluation is a complex task requiring the interpretation of an analysis.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 29


F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_3
30 3 General Error Measures

Fig. 3.1 Overview of three


main steps for obtaining error Class +1
measures. First, binary
Data Decision making
decision-making categorizes
instances in either class +1 or Class -1
class −1. The result of this
classification is either correct
or false. Second, a repeated
application allows one to
obtain the values in the Evaluation prediction outcome
contingency table, which
evaluates the classification class +1 class -1 total
results. Third, from this, one
can derive a variety of error class +1 TP FN P
measures.
class -1 FP TN N

total R A

Error measures

3.2 Motivation

For the discussion of error measures, we start with a visualization of the overall
problem, which is shown in Fig. 3.1. This figure shows three steps. First, data are
used for decision-making, leading to a binary classification. Second, a repeated
application of this allows one to obtain the values in a so-called contingency table.
A contingency table provides an evaluation of the classification results, offering
numerical summary information about the true positives (TP), false positives (FP),
true negatives (TN), and false negatives (FN) (discussed in detail later). Third, based
on the values in the contingency table, one can derive a variety of error measures,
which are functions of TP, FP, TN, and FN.
In the next section, we discuss the contingency table and its constituents, and in
the following sections, we discuss error measures and their definitions.

3.3 Fundamental Error Measures

A contingency table (also called a confusion matrix) provides a summary of binary


decision-making. In the following, we assume that we have two classes, called +1
and −1. Here, the indicators of the two classes are labels, or nominal numbers (also
called categorical numbers). That means the class labels do not provide numerical
values, but rather names, to distinguish the classes.
3.3 Fundamental Error Measures 31

In general, the outcome of a decision-making process can be one of the following


four cases:
1. The actual outcome is class +1, and we predict +1.
2. The actual outcome is class +1, and we predict −1.
3. The actual outcome is class −1, and we predict +1.
4. The actual outcome is class −1, and we predict −1.
It is convenient to give these four cases four different names. We call them the
following:
1. True positive: TP
2. False negative: FN
3. False positive: FP
4. True negative: TN
A summary of the outcomes is provided by the contingency table, shown in Fig. 3.2.
If one repeats such an analysis multiple times, making predictions for a number of
instances, one obtains integer values for each of the preceding four measures; that
is, TP, FN, FP, TN ∈ N. Hence, summing over the rows or columns in a contingency
table provides information about the following:

Total number of instances in class + 1 : P = T P + F N. (3.1)


Total number of instances in class − 1 : N = F P + T N. (3.2)
Total number of instances predicted + 1 : R = T P + F P . (3.3)
Total number of instances predicted − 1 : A = F N + T N. (3.4)

It is easy to see that these four quantities characterize the outcome of a binary
decision-making process completely. For this reason, we are calling them the four
fundamental error measures. Most of the error measures we will discuss in the
following sections will be based on these four fundamental errors. To facilitate
understanding, we will utilize the contingency table whenever beneficial to explain
the different error measures.
We would like to note that we use the term “fundamental error measures” to
emphasize the importance of these four measures relative to all other error measures
discussed in subsequent sections. Mathematically, we will see that all the other

Fig. 3.2 Summary of binary prediction outcome


decision-making in the form
of a contingency table. class +1 class -1 total

class +1 TP FN P
actual
outcome
class -1 FP TN N

total R A
32 3 General Error Measures

measures are functions of the four fundamental error measures. Hence, for those
measures, TP, FP, TN, and FN are independent variables.

3.4 Error Measures

With the contingency table available, we have all we need to discuss the error
measures. In Fig. 3.3, we show an overview of the error measures we will discuss.
In general, the error measures can be grouped into three main categories. The first
group focuses on correct outcome, the second on incorrect outcome, and the third
group on both outcomes. In the following sections, we will discuss measures from
each of these categories.

Fundamental errors:
TP = True positive: correct positive prediction
FP = False positive (type I error): incorrect positive prediction
TN = True negative: correct negative prediction
FN = False negative (type II error): incorrect negative prediction
Summarizations:
P = TP + FN
N = FP + TN
R = TP + FP
A = FN + TN
T = TP + FP + FN + TN
PR = prevalence = P = TP + FN ∈ [0, 1]
T TP+FN + TN + FP

Error measures:
TPR = True positive rate = sensitivity = recall = TP = TP ∈ [0, 1]
P TP+FN
TNR = True negative rate = specificity = TN = TN ∈ [0, 1]
N TN+FP
PPV = positive predictive value = precision = TP = TP ∈ [0, 1]
R TP+FP
TN TN Focus on
NPV = negative predictive value = = ∈ [0, 1] correct outcome
A TN+FN
ACC = accuracy = TP+TN = TP+TN = TP+TN ∈ [0, 1]
TP+TN+FP+FN P+N R+A
PPV×sensitivity
Fβ = (1 + β ) 2
2
β (PPV)+sensitivity
PPV×sensitivity
F1 = 2 ∈ [0, 1]
PPV sensitivity
+

FDR = false discovery rate = FP = FP ∈ [0, 1]


R FP + TP
FOR = false omission rate = FN = FN ∈ [0, 1]
A FN + TN
FNR = false negative rate = FN = FN ∈ [0, 1] Focus on
P FN + TP incorrect outcome
FPR = false positive rate = FP = FP ∈ [0, 1]
N FP + TN
E = error rate = FP + FN ∈ [0, 1]
TP + TN + FP + FN

MCC = Matthews correlation coefficient = √ TP·TN−FP·FN ∈ [−1, 1]


(TP+FP)(TP+FN)(TN+FN)(TN+FN)

κ = Cohen’s kappa = ACC -rACC ∈ (−∞, 1] with rACC = PR +2 NA


1 - rACC T
I(actual, predicted)
NMIA = (asymmetric) normalized mutual information =
H(actual)
H(actual) -H(actual | predicted)
= ∈ [0, 1]
H(actual)
I(actual, predicted)
NMIS = (symmetric) normalized mutual information =  ∈ [0, 1]
H(actual) H(predicted)

Fig. 3.3 Overview of error measures for binary decision-making and two-class classification.
3.4 Error Measures 33

3.4.1 True-Positive Rate and True-Negative Rate

The true-positive rate and true-negative rate are defined as follows:

TP TP
TPR = True positive rate = sensitivity = = ∈ [0, 1] (3.5)
P TP + FN
TN TN
TNR = True negative rate = specificity = = ∈ [0, 1] (3.6)
N TN + FP
The definitions ensure that both measures are bound between zero and one. For
an error-free classification, we obtain FN = FP = 0, which implies TPR = TNP = 1.
However, for TP = TN = 0, we obtain TPR = TNP = 0.
In the literature, the true-positive rate is also called sensitivity, and the true-
negative rate is called specificity [171].
It is important to note that both quantities utilize only half of the information
contained in the confusion matrix. The TPR uses only values from the first row and
the TNR only values from the second row. In Fig. 3.4, we highlight this by encircling
the used fundamental errors. For simplicity, we refer to the first row as P-level and
the second row as N-level. Hence, the TPR uses only values from the P-level and
the TNR values from the N-level.
From Fig. 3.4, it is clear that both measures are symmetric with respect to the
utilized information; that is, TPR and TNR merely exchange the roles of both
classes. This formulation allows us to remember the measures more easily.

3.4.2 Positive Predictive Value and Negative Predictive Value

The positive predictive value and negative predictive value are defined by the
following:

TP TP
PPV = positive predictive value = precision = = ∈ [0, 1]. (3.7)
R TP + FP
TN TN
NPV = negative predictive value = = ∈ [0, 1]. (3.8)
A TN + FN

prediction outcome prediction outcome

class +1 class +1
actual outcome

class -1 total class -1 total

class +1 TP FN P class +1 TP FN P

class -1 FP TN N class -1 FP TN N

total R A total R A

Fig. 3.4 Left: The true-positive rate (TPR) uses only information from the P-level. Right: The
true-negative rate (TNR) uses only information from the N-level.
34 3 General Error Measures

prediction outcome prediction outcome

class +1 class +1
actual outcome

class -1 total class -1 total

class +1 TP FN P class +1 TP FN P

class -1 FP TN N class -1 FP TN N

total R A total R A

Fig. 3.5 Left: The positive predictive value (PPV) uses only information from the R-level. Right:
The negative predictive value (NPV) uses only information from the A-level.

The definitions ensure that both measures are bound between zero and one. For
an error-free classification, we obtain FN = FP = 0, which implies PPV = NPV = 1.
Meanwhile, for TP = TN = 0, we obtain PPV = NPV = 0.
In the literature, the positive predictive value is also called precision [162].
Similar to TPR and TRN, PPV and NPV are estimated using only half of the
information contained in the confusion matrix. The PPV uses only values from the
first column, and the NPV uses only values from the second column. In Fig. 3.5,
we highlight this by encircling the used fundamental errors. For simplicity, we
refer to the first column as R-level and the second column as A-level. Hence,
the PPV uses only values from the R-level, and the NPV uses values from the
A-level.
From Fig. 3.4, it can be observed that both measures are again symmetric
with respect to the utilized information, and just the roles of the classes are
exchanged.

3.4.3 Accuracy

Accuracy is defined as follows:

TP + TN TP + TN TP + TN
ACC = accuracy = = = ∈ [0, 1]. (3.9)
TP + TN + FP + FN P+N R+A
This definition ensures that the accuracy value is bound between zero and one.
For an error-free classification, we obtain FN = FP = 0, which implies ACC = 1.
Meanwhile, for TP = TN = 0, we obtain ACC = 0. Another term used to refer to
accuracy, in the context of clustering evaluation, is Rand index.
In contrast with the quantities TPR, TNR, PPV, and NPV, the accuracy value uses
all values in the confusion matrix.
3.4 Error Measures 35

3.4.4 F-Score

The general definition of the F-score is as follows:

PPV × sensitivity
Fβ = (1 + β 2 ) . (3.10)
β 2 (PPV) + sensitivity

In this equation, the parameter β is a weighting parameter that can assume values
in the interval [0, ∞]. The parameter β allows us to weight as more weight the
importance of the PPV and the sensitivity. Hence, the F-score is a family of measures
and not just one error measure.
In Fig. 3.6, we show an example of two different value pairs of PPV and
sensitivity. One can observe that for β = 0, the F-score corresponds to the PPV,
whereas for β → ∞, it corresponds to the sensitivity. Intermediate values of β
enable to obtain “averaged” F-score values.
For β = 1, one obtains the F1 -score,

PPV × sensitivity
F1 = 2 ∈ [0, 1] (3.11)
PPV + sensitivity

The F1 -score is the harmonic mean of PPV and sensitivity, where the harmonic
mean is defined as follows:
1 1 1 1 
= + . (3.12)
F1 n PPV sensitivity

The F1 -score uses three of the four fundamental errors; namely, TP, FP, and FN.

A. 0.9 B. 0.9

0.8 0.8
F−score

F−score

0.7 0.7

0.6 0.6
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
beta beta

Fig. 3.6 Behavior of the Fβ -score depending on the parameter β. The results are shown for
PPV = 0.60 and sensitivity = 0.90 (left figure) and PPV = 0.90 and sensitivity = 0.60 (right figure).
36 3 General Error Measures

3.4.5 False Discovery Rate and False Omission Rate

The false discovery rate and false omission rate are defined by the following:

FP FP
FDR = false discovery rate = = ∈ [0, 1]. (3.13)
R FP + TP
FN FN
FOR = false omission rate = = ∈ [0, 1]. (3.14)
A FN + TN
The preceding definitions ensure that both measures are bound between zero
and one. For an error-free classification, we obtain FN = FP = 0, which implies
FDR = FOR = 0. Meanwhile, for TP = TN = 0, we obtain FDR = FOR = 1.
FDR and FOR also utilize only half of the information contained in the confusion
matrix. The FDR uses only values from the first column, and the FOR uses only
values from the second column. In Fig. 3.7, we highlight this by encircling the
used fundamental errors. That means the FDR uses information from the R-level,
but in contrast to the PPV, which also uses information from this level, the FDR
focuses on failure by forming the quotient of FP and R. Similarly, the FOR uses
only information from the A-level, forming the quotient of FN and A. These can be
compared with the PPV and the NPV in Fig. 3.5.
From Fig. 3.7, one sees again that both measures are symmetric with respect to
the utilized information, and just the roles of the classes are exchanged. Frequent
application domains for the false discovery rate and the false omission rate are
biology, medicine, and genetics [161, 185]. When discussing multiple testing
corrections in Chap. 15, we will see further examples of the application of the FDR.

3.4.6 False-Negative Rate and False-Positive Rate

The false-negative rate and false-positive rate are defined by the following:

prediction outcome prediction outcome

class +1 class +1
actual outcome

class -1 total class -1 total

class +1 TP FN P class +1 TP FN P

class -1 FP TN N class -1 FP TN N

total R A total R A

Fig. 3.7 Left: The false discovery rate (FDR) uses only information from the R-level. Right: The
false omission rate (FOR) uses only information from the A-level. Both measures focus on failure
only.
3.4 Error Measures 37

prediction outcome prediction outcome

class +1 class +1
actual outcome

class -1 total class -1 total

class +1 TP FN P class +1 TP FN P

class -1 FP TN N class -1 FP TN N

total R A total R A

Fig. 3.8 Left: The false-negative rate (FNR) uses only information from the P-level. Right: The
false-positive rate (FPR) uses only information from the N-level. In contrast with the TPR and the
TNR, both measures focus on failure.

FN FN
FNR = false negative rate = = ∈ [0, 1]. (3.15)
P FN + TP
FP FP
FPR = false positive rate = = ∈ [0, 1]. (3.16)
N FP + TN
By definition, both measures are bound between zero and one. For an error-free
classification, we have FN = FP = 0, which implies FNR = FPR = 0. Meanwhile,
for TP = TN = 0, we obtain FNR = FPR = 1.
Both quantities are similar to the TPR and TNR, since they use only information
from the first and second rows of the contingency table, respectively. Specifically,
the FNR uses only values from the first row, whereas the FPR uses only values from
the second row. However, both measures focus on failure. In Fig. 3.8, we highlight
this by encircling the used fundamental errors. These can be compared with the TPR
and the TNR in Fig. 3.4.
From Fig. 3.8, it can be observed that both measures are symmetric with respect
to the utilized information, and just the roles of the classes are exchanged.

3.4.7 Matthews Correlation Coefficient

A common issue when applying machine learning (ML) techniques to a real-


world problem is having an imbalanced target variable. In this case, the Matthews
correlation coefficient (MCC) is a good measure [331]. The MCC score was first
introduced by Matthews [331] to assess the performance of protein secondary
structure prediction. It is defined as follows:

TP · TN − FP · FN
MCC = √ ∈ [−1, 1]. (3.17)
(TP + FP)(TP + FN)(TN + FN)(TN + FN)

MCC ranges from −1 to +1. A value of −1 indicates that the prediction is entirely
wrong, whereas a value of +1 indicates a perfect prediction. MCC=0 means that
we have a random classification, where the model predictions have no detectable
correlation with the true results.
38 3 General Error Measures

In Fig. 3.9, we show some numerical results for the behavior of the MCC. For the
shown simulations, we assumed a fixed prevalence (=P/T) of 0.1, a fixed sensitivity
of 0.5, and T = 10,000. The values of the specificity were varied from 0.0 to
1.0. From these four measures, the four fundamental errors can be derived, as can
the Matthews correlation coefficient. The chosen value for the prevalence ensures a
strong imbalance between both classes.
Listing 3.1 shows how to estimate the Matthews correlation coefficient for any
given values of prevalence, specificity, sensitivity, and T (total number of instances).
Since MCC is based on TP, TN, FP, and FN, these values needed to be estimated
first by utilizing their corresponding definitions.

In general, low specificity values correspond to poor classification results, as


indicated by very low ACC values and negative values of the MCC (see Fig. 3.9).
With increasing values for the specificity, the ACC increases. However, due to the
imbalance of the classes and the way we defined our model, keeping the TPR
constant, the increasing values for ACC are misleading. Indeed, the MCC increases
but not linearly and much more slowly than those for the ACC. Moreover, the
highest achievable value for MCC is only 0.68, in contrast with 0.95 for ACC. Using
ACC alone would not reveal this problem.
3.4 Error Measures 39

1.0

0.5

ACC

0.0

−0.5

MCC
0.00 0.25 0.50 0.75 1.00
specificity

Fig. 3.9 Behavior of the MCC and ACC (accuracy) depending on the specificity and fixed values
for sensitivity, prevalence, and T . For the shown results, we use a prevalence of 0.1, a sensitivity
of 0.5, and T = 10,000.

Because an imbalance in the categories of the classes is a frequent problem, the


Matthews correlation coefficient is used throughout all application domains of data
science.

3.4.8 Cohen’s Kappa

The next measure we discuss is called Cohen’s kappa [85]. It is used essentially as
a measure to assess how well a classifier performs compared to how well it would
have performed by chance. That means a model has a high kappa value if there is
a big difference in the accuracy between a model and a random model. Formally
expressed, Cohen’s kappa was defined in [85] as follows:

ACC -rACC
κ = Cohen’s kappa = ∈ (−∞, 1]. (3.18)
1 - rACC
Here, ACC is the accuracy, and rACC denotes the randomized accuracy, defined as
follows:
P · R+N · A
rACC = . (3.19)
T2
40 3 General Error Measures

Usually, Cohen’s kappa is used as a measure of agreement between two categorical


variables in various machine learning applications across many disciplines, includ-
ing epidemiology, medicine, psychology, and sociology [414, 477].

3.4.9 Normalized Mutual Information

The normalized mutual information is an information-theoretic measure [410]. In


most studies, the normalized mutual information is applied to classification tasks
and other learning problems.
Specifically, Baldi et al. [23] defined the (asymmetric) normalized mutual
information as follows:
I(actual, predicted)
NMIA =
H(actual)
TP TP TN TN FP FP FN FN
= log log log log
T T T T T T T T
   
TP TP+FP TP + FN FN TP + FN TN + FN
− log − log
T T T T T T
   
FP TP + FP TN + FP TN TN + FN TN + FP
− log − log
T T T T T T

In fact, this expression can be simplified as follows [23]:


TP TP TN TN FP FP FN FN
NMIA = log log log log
T T T T T T T T
   
TP RP FN PA
− log − log
T TT T TT
   
FP RN TN AN
− log − log
T TT T TT

Many variants of the normalized mutual information measure have been introduced
and applied (see, for example, [258, 393, 410, 438, 491]). For instance, Hu and
Wang [258] defined the normalized mutual information measures on the so-called
augmented confusion matrix. This matrix was defined by adding one column for
a rejected class to a conventional confusion matrix (see [258]). Wallach [491]
considered normalized confusion matrices representing models and formulated the
normalized mutual information measures on that matrix as well as other error
measures, such as precision and recall.
In [24], it was noted that the normalized mutual information just defined is
asymmetric in the argument of the entropy because
I(actual, predicted) I(actual, predicted)
NMIA = = . (3.20)
H(actual) H(predicted)
3.4 Error Measures 41

For this reason, in [454], a symmetric normalized mutual information was


suggested, and it is defined as follows:

I(actual, predicted)
NMIS = √ . (3.21)
H(actual)H(predicted)

3.4.10 Area Under the Receiver Operator Characteristic Curve

The final error measure we are presenting is the area under the receiver operator
characteristic (AUROC) curve [56, 162]. In contrast with all the previous measures
discussed so far, the AUROC curve can only be obtained via a construction process,
rather than being derived directly from the contingency table. The reason for this is
that the AUROC does not make use of an optimal threshold of a classifier to decide
how to categorize data points (instances). Instead, this threshold can be derived, as
we will show at the end of this section.
The first step of this construction process is to derive the ROC curve, and the
second is to integrate this curve to obtain the area under it. To construct an ROC
curve, we need to obtain pairs of the true-positive rate and the false-positive rate;
that is, (TPRi , FPRi ). That means the ROC curve presents the TPR as a function of
the FPR. Since the TPR is equivalent to the sensitivity, and the FPR is equivalent to
(1 — specificity), an alternative representation would be the sensitivity as a function
of (1 — specificity).
Let’s assume that we have a data set with n samples. Regardless of the
classification method, we can obtain either a score of si or a probability of pi for
every instance as an indicator of the membership for class +1 (and analogously
values for class −1). In the following, we use probabilities, but the discussion for
scores is similar. Based on these probabilities, a decision is obtained by thresholding
the values. That means for

pi > p t , (3.22)

we decide to place instance i into class +1; otherwise, in class −1. Rearranging the
values pi in increasing order, we obtain the following:

p[1] ≤ p[2] ≤ · · · ≤ p[n]. (3.23)

Now, we apply successively all possible thresholds to obtain two groups: one group
corresponding to class +1 and the other group to class −1. This results overall in
n + 1 different thresholds and, hence, groupings. This is visualized in Fig. 3.10. In
this figure, the vertical lines in red correspond to the thresholds used to categorize
instances.
It is interesting to note that these thresholds pit are not unique, but rather are given
by constraints, as shown on the right-hand side of Fig. 3.10. For instance, threshold
42 3 General Error Measures

Ordered probabilities: Thresholds:

p[1] ≤ p[2] ≤ . . . ≤ p[n] 1: pt1 ≤ p[1]


2: p[1] < pt2 ≤ p[2]
p[1] ≤p[2] ≤ . . . ≤ p[n] ..
.
.. n+1: p[n] < ptn+1
.
p[1] ≤ p[2] ≤ . . . ≤p[n]

p[1] ≤ p[2] ≤ . . . ≤ p[n]

Fig. 3.10 Left: Ordered probabilities and the corresponding thresholds indicated by the vertical
red line. Right: These thresholds are constrained by the preceding equations shown.

p2t can assume any value between p[1] and p[2]; that is, p[1] < p2t ≤ p[2], and all
such values will give the same classification results.
For each of these groupings, we can calculate the four fundamental errors and,
based on these, every error measure shown in Fig. 3.3, including the TPR and FPR,
which are needed for an ROC curve. Overall, this results in n + 1 pairs of (TPRi ,
FPRi ) for i ∈ {1, . . . , n + 1}, from which an ROC curve is constructed.
In Fig. 3.11, we show an example of an ROC curve (in purple) resulting from
a logistic regression analysis. These results were generated with Listing 3.2, using
the Michelin data discussed in Sect. 9.6, where we also introduce logistic regression
informally.
Here, we are only interested in the resulting ROC curve in Fig. 3.11, which shows
the TPR (=sensitivity) as a function of the FPR (=1 — specificity). As one can
see in Listing 3.2, these values are provided by the function roc() included in the
package plotROC.
The area under the receiver operator characteristic, called AUROC, in Fig. 3.11
is 0.804. For comparison purposes, we added a diagonal line (in red). If the ROC
curves looked like this, the AUROC would be 0.5, indicating a random classification
performance. Furthermore, we added a blue curve that reaches to the top left corner.
In this case, the AUROC would be 1.0, indicating a perfect classifier without error.
An ROC curve can be used not only to obtain the AUROC value to evaluate a
classifier, but also to determine the optimal threshold for the classifier. Remember,
to obtain ROC curves, such a threshold is not used. To obtain such an optimal
threshold, we need to define an optimization function. In the literature, there are
two frequent choices [478]. The first is the distance from the ROC curve to the
upper-left corner, that is, (TPR = 1, FPR = 0), given by

D-ROCi = (1 − TPRi )2 + FPR2i , (3.24)
 
= (1 − Sei )2 + (1 − Spi )2 . (3.25)
3.4 Error Measures 43

The second is called the Youden’s Index [517], given by

Youden-Indexi = TPRi − FPRi (3.26)


= Sei − (1 − Spi ). (3.27)

Here, Se and Sp correspond to the sensitivity and specificity, respectively. The


optimal thresholds are then obtained by finding the following:
d
iopt = argmin{D-ROCi }, (3.28)
i
Y
iopt = argmax{Youden-Indexi }. (3.29)
i

In general, these indices do not result in identical values; however, this is the case
for the example shown in Fig. 3.11, as illustrated in Fig. 3.12. For the ROC curve in
Fig. 3.11, this leads to TPR = 0.66 and FPR = 0.17.
44 3 General Error Measures

1.00
perfect classifier

0.75
TPR=sensitivity

0.50

ROC − curve

0.25

AUC: 0. 804
random classifier
0.00

0.00 0.25 0.50 0.75 1.00


FPR=1−specificity

Fig. 3.11 ROC curve (in purple) for a logistic regression model. The red curve corresponds to a
random classifier, and the blue curve to a perfect classifier.

1.00

D−ROC

0.75
Youden index/D−ROC

0.50

0.25

Youden index optimal index


0.00

0 50 100 150
index

Fig. 3.12 Youden’s Index (purple) and D-ROC curve (blue) depending on the index. For the
logistic regression in Fig. 3.11, both measures result in the same optimal cutoff index.
3.5 Evaluation of Outcome 45

3.5 Evaluation of Outcome

So far, we have discussed many different error measures, and one may ask, “Do
we really need all of them, or is there one measure that summarizes all others?”
The answer to this question is that there is no single error measure that would be
appropriate in all possible situations under all feasible conditions. Instead, one needs
to realize that evaluating binary decision-making is a multivariate problem. This
means, usually, that we need more than one error measure to evaluate the outcome
in order to avoid false interpretations of the prediction results.

3.5.1 Evaluation of an Individual Method

To illustrate this problem, we present an example. For this example, we simulate


the values in the contingency table according to a model. This means we are
defining an error model. To simplify the analysis, we define the error model for
the proportions of the four fundamental errors — TP, FP, TN, and FP. Let us denote
these proportions pTP, pFP, pTN, and pFP, respectively. Each of these proportions
(probabilities) can assume values between zero and one, and the four quantities sum
up to one, as follows:

pTP + pFP + pTN + pFP = 1 (3.30)

If we multiply each of these proportions by T, the total number of instances, we


recover the four fundamental errors, as follows:

TP = T · pTP (3.31)
TN = T · pTN (3.32)
FP = T · pFP (3.33)
FN = T · pFN (3.34)

Hence, there is no loss in generality by utilizing the proportions of the four


fundamental errors.
In Fig. 3.13, we visualize our error model that defines the proportions of the four
fundamental errors. Formally, we define the error model as shown in Listing 3.3.
46 3 General Error Measures

0.30
A. B. C.

0.3 0.3
0.25
pFN

pTN

pFP
0.2 0.20 0.2

0.15
0.1
0.1
0.10
0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6
pTP pTP pTP
D. E. 0.52 F.

0.425 0.3
0.48

0.400 0.44

pFN
0.2
N

0.40
0.375
0.36 0.1

0.350
0.575 0.600 0.625 0.650 0.48 0.52 0.56 0.60 0.64 0.1 0.2 0.3
P R pFP

Fig. 3.13 Visualization of the error model that defines the proportions of the four fundamental
errors.

The simulations start by assuming pTP takes its values in the interval from 0.2 to
0.6 in step sizes of 0.01. Based on this, the values of pFN and pTN are determined
via functional relationships. Finally, the values of pFP ensure the conservation
of the total probability. Overall, the model starts with a poor decision outcome,
corresponding to low pTP and pTN values, and improves toward higher values that
correspond to better decision-making. Furthermore, the model is imbalanced in the
sizes of the classes, as can be seen from the values of P and N. It is also imbalanced
in the predictions (see R and A).
In Fig. 3.14, we present results for seven error measures that are obtained using
the values from our error model. The red points correspond to the starting value of
the error model; that is, pTP = 0.2, pFN = 0.36, pTN = 0.10, and pFP = 0.34 (see
Listing 3.3).
All of these pairwise comparisons show nonlinearities, at least to some extent.
The most linear (but not exactly linear) correspondence is exhibited between the
ACC and the F-score, followed by the relationship between the FDR and the F-
score. For the starting point — that is, pTP = 0.2, etc. — we obtain ACC = 0.18 and
F = 0.33, indicating poor classification performance. Taking, in addition, the value
of the FDR = 0.68 into consideration, we see that 68% of all samples classified as
class +1 are false. From this perspective, the classification results are even very
poor. From the PPV-versus-NPV plot, one can see that class +1 is easier to recover
compared to class −1 because there is a strong nonlinearity between these two error
measures, with a faster increase in the values of the PPV (precision).
Overall, the interpretation of the classification results, given the error measures,
is not straightforward, because, usually, neither the best nor the worst classification
results are observed. Instead, values, of, for example, the F-score or PPV, are situated
3.5 Evaluation of Outcome 47

A. 0.9 B. 0.9

0.8 0.8

0.7 0.7
PPV

F
0.6 0.6

0.5 0.5

0.4 0.4

0.2 0.4 0.6 0.8 0.3 0.4 0.5 0.6 0.7 0.8 0.9
NPV ACC
C. 0.8
0.6
specificity

0.6
0.4

FDR
0.4
0.2

0.2
0.4 0.6 0.8 0.4 0.5 0.6 0.7 0.8 0.9
sensitivity F

Fig. 3.14 Summary of seven error measures according to the error model shown in Fig. 3.13.
The red points correspond to the starting value of the error model, i.e., pTP = 0.2, pFN = 0.36,
pTN = 0.10, and pFP = 0.34.

in between. Furthermore, it is not possible to use just one error measure to obtain
all the information on the classification performance. Instead, one needs to compare
multiple error measures with each other to draw a conclusion about the performance.
This will always require a discussion of the results.

3.5.2 Comparing Multiple Binary Decision-Making Methods

In the preceding example, we showed that changes in the four fundamental errors
(as defined by an error model) can lead to nonlinear effects in the dependent errors.
However, this type of issue is not the only problem when evaluating binary decision-
making. Another problem arises when evaluating two (or more) binary decision-
making methods. To demonstrate this type of problem, we show in Fig. 3.15 the
outcome of three binary decision-making methods.
Specifically, in the first part of Fig. 3.15, we show the proportion of the four
fundamental errors for three methods. Here, we assumed that the application of
a method to a data set results in the shown errors, and that all three methods
are applied to the same data set. To demonstrate this problem, we consider two
scenarios. The first scenario corresponds to a comparison of the outcome of method
1 with that of method 2, and the second scenario to a comparison of the outcome of
method 1 with that of method 3.
For scenario one, from Fig. 3.15, we see the following:
• pTP1 > pTP2 ,
• pTN1 > pTN2 ,
48 3 General Error Measures

prediction outcome method 1 prediction outcome method 2 prediction outcome method 3

class +1 class -1 class +1 class -1 class +1 class -1


class +1 pTP=0.4 pFN=0.1 pTP=0.35 pFN=0.2 pTP=0.3 pFN=0.1
actual
outcome
class -1 pFP=0.1 pTN=0.4 pFP=0.15 pTN=0.3 pFP=0.1 pTN=0.5

0.8
method 1.
method 2.
value of error measures

method 3.
0.6
0.4

=⇒
0.2
0.0

TPR TNR PPV NPV ACC F FDR FNR FPR

Focus on correct outcome Focus on incorrect outcome

Fig. 3.15 Outcome of three binary decision-making methods. Top: Contingency tables for the
three methods. Bottom: Nine further error measures for each method.

• pFN1 < pFN2 ,


• pFP1 < pFP2 .
That means the true predictions for method 1 are always better than those for method
2, and the false predictions for method 1 are always worse than those for method
2. For this reason, it seems obvious that method 1 performs better than method 2
regardless of what fundamental error measure is used and no matter if one considers
just one of these or their combinations. The result of this comparison shows that
method 1 always performs better than method 2.
However, for scenario two, shown in Fig. 3.15, comparing method 1 with method
3 yields the following:
• pTP1 > pTP3 ,
• pTN1 < pTN3 ,
• pFN1 = pFN3 ,
• pFP1 = pFP3 .
First, we observe that the false predictions are identical. Second, the proportion of
true positives is higher for method 1, but the proportion of true negatives is higher
for method 3. Third, the absolute value of the distance of the true predictions is
equal for the positive and negative classes; that is,

pTP = pTP1 − pTP3 = 0.1, (3.35)


pTN = pTN1 − pTN3 = −0.1, (3.36)

but the sign is different. Overall, one can see that without further information, one
cannot decide if method 1 is better than method 3 or vice versa.
3.6 Summary 49

This can be further clarified by looking at different error measures that are
functions of the four fundamental errors. In Fig. 3.15, we show results for N = 100.
For scenario one, we see that all error measures that focus on positive outcome
are higher for method 1 (in orange) than for method 2 (in blue), and all error
measures that focus on negative outcome are lower for method 1 than for method 2.
In contrast, for scenario two, we see that there are some error measures that focus
on positive outcome, which are higher for method 1 (in orange) than for method 3
(in brown), and some are lower for method 1 (in orange) than for method 3. For
instance,

TPR1 > TPR3 , (3.37)


TNR1 < TNR3 . (3.38)

Similar results hold for error measures that focus on negative outcome. For instance,

FDR1 < FDR3 , (3.39)


FPR1 > FPR3 . (3.40)

Hence, without additional information, one cannot decide which method is the best.
In summary, the preceding examples demonstrate the following: First, there is
not one measure that summarizes the information provided by all error measures.
Second, even using all of the preceding error measures does not guarantee the
identification of the best method. As a consequence, domain-specific information
about the problem, such as from biology, medicine, finance, or the social sciences,
is needed to have a further criterion for assessing a method. This can be seen as
introducing some further error measure(s).

3.6 Summary

In this chapter, we provided a comprehensive discussion of many error measures


used in data science [150]. Such measures can be applied to common classification
problems and hypothesis testing because both types of method conduct binary
decision-making.
We have seen that many error measures are based on a contingency table, which
summarizes the outcome of decision-making. Specifically, the four fundamental
errors (true positive TP, false negative FN, false positive FP, and true negative
TN) provide the base information from which the functional form of general error
measures is derived. As we have seen, most error measures are rather simple in their
definitions. However, the normalized mutual information and the receiver operator
characteristic curve are more complex in their definition and estimation.
50 3 General Error Measures

In Chap. 3.5, we saw that the evaluation of one or more methods is not always
straightforward. Despite the fact that there are many error measures, none is superior
to the others in all situations, nor do all error measures taken together lead to a
unique decision in all situations.
Learning Outcome 3: Error Measures

There are many error measures because they all provide a quantification for
a different aspect, and the choice of the measure(s) to use for a particular
analysis problem needs to be decided on a case-by-case basis.

While for many data analysis problems we will not encounter issues with
interpreting the outcome of a classifier, there are cases that are hard to interpret. For
those cases, domain-specific information about the problem, such as from biology,
medicine, economy, or social sciences, needs to be taken into consideration when
assessing a method. It is interesting to note that the latter can be seen as introducing
further error measures. This is of course possible since the list of error measures we
discussed in this chapter is not exhaustive. In fact, the number of error measures one
can construct, based on the four fundamental errors, is not limited.
Finally, we would like to note that in practice, resampling methods [128, 346],
such as cross-validation, are used to estimate the variability of error measures. This
is necessary because in Chap. 2 we learned that the outcome of a prediction model
is a random variable that is associated with an underlying probability distribution.
From this, one can estimate the mean value of error measures as well as the
corresponding standard deviation (called the standard error). So, error measures are
only one side of the medal when evaluating the outcome of binary decision-making.
Resampling methods complement these error measures by enabling estimates of the
distribution of an error measure. We will return to this topic in Chap. 4 where we
will discuss resampling methods.

3.7 Exercises

1. Show that the following analytical result holds for the F-score.

lim Fβ = sensitivity. (3.41)


β→∞

Hint: Expand the definition of the F-score.


2. Show numerically, by writing a program in R, that the preceding result is correct.
3. Extend Listing 3.1, used to estimate the Matthews correlation coefficient, in the
following ways:
• Plot the MCC for different values of the specificity.
• Plot the MCC for different values of sensitivity.
3.7 Exercises 51

• Investigate the effect of the prevalence on the results. What are naturally
occurring values of classification problems?
4. Write an R program to estimate the symmetric mutual information. What
influence has the imbalance of classes on NMIS ?
5. Compare the results for the symmetric mutual information with the asymmetric
mutual information.
6. Construct an ROC curve manually by following the definition of its construction
in Sect. 3.4.10. Start by randomly drawing 6 probabilities p[i] corresponding to
the membership of an instance for class +1.
• Why are the thresholds pit not unique?
• Give some numerical examples for pit and study their influence.
Chapter 4
Resampling Methods

4.1 Introduction

This chapter introduces resampling and subsampling methods. Both method types
deal with the sampling of data. Due to the nontrivial nature of the latter topic, we
also discuss this in detail in Sect. 4.7.
The methods discussed in this chapter are different from the other methods
presented in this book. As we will see, resampling and subsampling methods allow
the generation of “new” data sets from any given data set, which can then be used
either for the assessment of a prediction model or for the estimation of parameters.
Some important resampling methods that will be discussed in this chapter include
holdout set, leave-one-out cross-validation (LOO-CV), k-fold CV, repeated k-fold
CV, and bootstrap [129, 203, 427]. Subsampling is similar to resampling but is used
for systematically reducing the number of samples. Such methods can be used for
advanced applications, such as for the estimation of learning curves, which will be
discussed in detail in Chap. 18.
All resampling and subsampling methods are nonparametric methods, which
makes them flexible and easy to use. Here, “nonparametric” means that they are
lacking sophisticated analytical formulations expressed via mathematical equations.
Instead, such methods are realized numerically in a computational manner. This
lack of mathematical elegance has the practical advantage of making such methods
easy to understand and easy to implement computationally. Furthermore, despite
this simplicity, resampling methods are powerful enough to enable us to estimate
the underlying probability distribution associated with an outcome (error) of a
prediction model, as discussed in Chap. 2.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 53


F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_4
54 4 Resampling Methods

The resampling methods discussed in the following sections can be categorized


as follows:
• Resampling methods for error estimation
– Holdout set
– Leave-one-out CV
– K-fold CV
• Extended resampling methods for error estimation include the following:
– Repeated holdout set
– Repeated k-fold CV
– Stratified k-fold CV
• Resampling methods for parameter estimation include the following:
– Bootstrap

We would like to highlight that in this chapter, we focus on the situation


where one prediction model needs to be evaluated, and this is equivalent to model
assessment. The situation where multiple models need to be evaluated will be
discussed in Chap. 12 because that scenario requires model selection and model
assessment.
In addition to resampling and subsampling methods, this chapter discusses the
standard error and the meaning of sampling from a distribution. Both concepts have
wide implications in data science, and for these reasons they deserve an in-depth
discussion.

4.2 Resampling Methods for Error Estimation

In the following sections, we discuss resampling methods used for estimating errors
of prediction models. That means such resampling methods can be applied to either
classification or regression methods.

4.2.1 Holdout Set

The holdout set method is the simplest of all resampling methods. It is based on a
two-step process, which works as follows. First, the original data are randomized,
and then the data points are separated into two parts of equal size. One part is used
as a training data set and the other as a testing data set. We would like to note that
the proportions used as training and testing data can be parameters, allowing choices
other than 50%.
4.2 Resampling Methods for Error Estimation 55

Original data 1 2 ... n−1 n



 copy for each split

Split 1 1 2 ... n−1 n Etest (1)

Split 2 1 2 ... n−1 n Etest (2)

.. .. .. .. .. .. ..
. . . . . . .
Split n-1 1 2 ... n−1 n Etest (n − 1)

Split n 1 2 ... n−1 n Etest (n)

training data
testing data

Fig. 4.1 Visualization of leave-one-out CV. The original data are copied n times, and then each
copy is separated such that n − 1 points are used as training data and 1 point is used as test data.
Importantly, the indicated test data are selected successively in a non-overlapping manner.

In practice, the holdout set method is not often used because its estimates are
merely based on one testing data set. However, from a didactic point of view, it is
instructive for the more complex methods described next. Besides this, there are
extensions similar to the holdout set method that are of practical relevance, as we
will see.
Finally, we would like to note that there is one case in which the holdout set
is the method of choice; namely, when the sample size is very large. In this case,
splitting the data set into two parts leaves us with two still very large data sets that
are sufficient for training and testing purposes.

4.2.2 Leave-One-Out CV

The leave-one-out cross-validation (LOO-CV) method is also called Jackknife.


Assuming that the original data contain n samples, LOO-CV is a two-step process
that first makes n identical copies of the original data and then splits each copy such
that n − 1 points are used as training data and 1 point is used as test data, where
the indicated test data are selected successively in a non-overlapping manner. The
construction process of the LOO-CV method and its data splitting is visualized in
Fig. 4.1.
In this figure, the original data are represented via the indices of the data points.
That means for a data set {xi }ni=1 with n samples, the shown folds (cells) contain
the index of data point xi ; that is, i. Furthermore, one can see that for each split i,
a testing data set is available and is used to estimate the error Etest (i) for a given
method; for example, a classifier. After all these individual test errors for the n splits
56 4 Resampling Methods

have been estimated, they are summarized using the mean error, given by

1
n
Etest (LOO-CV) = Etest (i). (4.1)
n
i=1

The reason why it makes sense to average over the results of different data sets
(given by the splits) is that Etest (i) is a random variable that changes its value
(slightly) for different data sets. This means that there is an underlying distribution
from which this random variable is drawn. Furthermore, it means that in addition to
getting a mean value, one needs to quantify the variability of the estimate of Etest .
This is done using the standard deviation and the standard error, given by

n
i=1 (Etest (i) − Etest (LOO-CV))
2
s(Etest (LOO-CV)) = , (4.2)
n−1
s(Etest (LOO-CV))
SE(Etest (LOO-CV)) = √ , (4.3)
n

where s(Etest (LOO-CV)) is the standard deviation of Etest (LOO-CV). In Sect. 4.8,
we will provide an in-depth discussion of the standard error and its meaning in
general.

4.2.3 K-Fold Cross-Validation

K-fold cross-validation is an extension of the LOO-CV method that increases the


size of the test data at the expense of having fewer splits, resulting in fewer training
data points. Specifically, k-fold cross-validation first randomizes the original data
and makes k copies. Here, it is important to note that k < n. Then, each split is
separated into k-folds, where k − 1 folds are used as training data and one fold is
used as test data. The folds corresponding to the test data are selected successively
in a non-overlapping manner.
This is visualized in Fig. 4.2. As one can see, in contrast to LOO-CV, for k-fold
CV, the test data sets consist of more than one data point. Consequently, the training
data contain fewer data points compared to LOO-CV. Furthermore, it is important
to note that each fold contains multiple data points as indicated by the indices of the
data points. From these indices, one can also see that, for different splits, the indices
in the folds do not change.
After all the individual test errors for the k splits have been estimated, they are
summarized using the mean error, given by

1
k
Etest (CV ) = Etest (i), (4.4)
k
i=1
4.2 Resampling Methods for Error Estimation 57

Original data 1 2 ... n−1 n



 randomize once and copy

Split 1 8, 12, . . . 2, n, . . . ... 3, n − 4, . . . 1, 11, . . . Etest (1)

Split 2 8, 12, . . . 2, n, . . . ... 3, n − 4, . . . 1, 11, . . . Etest (2)

.. .. .. .. .. .. ..
. . . . . . .
Split k-1 8, 12, . . . 2, n, . . . ... 3, n − 4, . . . 1, 11, . . . Etest (n − 1)

Split k 8, 12, . . . 2, n, . . . ... 3, n − 4, . . . 1, 11, . . . Etest (n)

training data
testing data

Fig. 4.2 Visualization of k-fold cross-validation. Before splitting the data, the original data points
are randomized but then kept fixed for all splits.

and the standard deviation and standard error, given by

k
i=1 (Etest (i) − Etest (CV ))
2
s(Etest (CV )) = , (4.5)
k−1
s(Etest (CV ))
SE(Etest (CV )) = √ , (4.6)
k

where s(Etest (CV )) is the standard deviation of Etest (CV ). Importantly, in


contrast with LOO-CV, for k-fold CV n has to be substituted by k; that is, the
number of folds.
On a technical note, we would like to remark that a LOO-CV corresponds to an
n-fold CV where n is the number of samples in the data.
In Listing 4.1, we show an example, using R, of the randomization of data
corresponding to the first step in Fig. 4.2. Specifically, Listing 4.1 shows the
randomization of the indices of n data points.

We would like to emphasize that it is important to set the option “replace


= FALSE” to ensure each index is selected just once. This is different than
the bootstrap discussed in Sect. 4.4. The difference between the two is due to
resampling with replacement and resampling without replacement, which will be
discussed in Sect. 4.4.1.
58 4 Resampling Methods

4.3 Extended Resampling Methods for Error Estimation

The following two resampling methods provide extensions of the holdout set and
the k-fold CV methods to improve the accuracy of the estimated errors. In contrast,
the stratified k-fold CV method deals with the problem of imbalanced samples.

4.3.1 Repeated Holdout Set

The repeated holdout set method is just a repeated application of the holdout
set method recently described. Importantly, each application performs a new
randomization of the original data. Also, the data separation into training and testing
data is usually not done with equal proportions; instead, 2/3 of the data points are
frequently used for training and 1/3 for testing.
The advantage of the repeated holdout set over the ordinary holdout set, discussed
in Sect. 4.2.1, is that it can average over R repeats. This also allows one to estimate
a standard error, while this was not possible for the holdout set method.

4.3.2 Repeated K-Fold CV

The repeated k-fold CV set method is a repeated application of the k-fold CV


method, described earlier. Each application of the k-fold CV method performs a
new randomization of the original data.
The purpose of the repeated holdout set and the repeated k-fold CV is the same
— namely, to reduce the variability of the error estimator. The choice of the most
appropriate method depends on the underlying data and the sample size [283].

4.3.3 Stratified K-Fold CV

The stratified k-fold CV method can be used for data with an additional structure.
Such a structure could be provided by labeled data; that is, instead of data of the
form {xi }ni=1 , we have {(xi , yi )}ni=1 . The simplest example would be a (multiple)
classification, where the yi corresponds to the labels of the data points xi .
The problem is that in this case we have at least two types of data points, one
for class +1 and one for class −1, but one needs data points from both classes for
training and testing. However, due to the randomization of the data, it may happen
that data points from one class either are completely missing in the training or testing
data or are disproportionally represented. This is especially problematic when the
classes are already imbalanced, as the randomization can amplify this imbalance
even further.
4.4 Bootstrap 59

For this reason, a stratification of the data points can be necessary. Here,
“stratification” just means to perform a k-fold CV for each class (stratum) separately
to ensure the same proportion of data points from the strata is used for the training
and testing data sets.
A comparative analysis showed that stratified k-fold CV has a lower bias and
lower variance compared to regular k-fold CV [289]. Similar to the repeated holdout
set and the repeated k-fold CV method, these results also depend on the data and the
sample size.

4.4 Bootstrap

Now we come to a resampling method that is different from all the previous
methods. The bootstrap method was introduced by Efron in the 1970s, and it is
one of the first computer-intensive approaches in statistics [132, 501]. In general,
the bootstrap method does not lead to training and testing data. For this reason, it is
not used for error estimation, which requires first to estimate the parameters of the
model and then to estimate the error, but rather is used for parameter estimation.
The working mechanism of bootstrap is as follows. The method generates B new
data sets with B ∈ N that can be even larger than n. Each of these B data sets is
generated by drawing n samples with replacement. This means that it is possible
that data points can appear multiple times in a new data set. This implies that the
number of unique data points in each new data set can be smaller than n. This is
illustrated in Fig. 4.3, where in Set 1, data point 6 appears twice.

Original data 1 2 ... n−1 n



 randomize for each new set

Set 1 6 n−1 ... 6 11

Set 2 7 3 ... 1 3

.. .. .. .. .. ..
. . . . . .
Set B-1 n 9 ... n−4 2

Set B 7 n−3 ... n−5 3

Fig. 4.3 Visualization of bootstrap. The method generates B new data sets with B ∈ N that can
be larger than n. Each of these B data sets is generated by drawing n samples with replacement.
This means that it is possible for data points to appear many times in a new data set (see the value
“3” in Set 2).
60 4 Resampling Methods

In Listing 4.2, we show an example of generating one bootstrap set. Because


we are resampling with replacement, the unique number of indices in the variable
ind will usually be smaller than the sample size n. This can be seen using the
command unique(ind). It is important to note that for the bootstrap method, one
uses not only the unique data points but also the duplicated ones. This duplication
induces a weighting when the data are used for estimations.

4.4.1 Resampling With versus Resampling Without


Replacement

To clarify the difference between resampling with replacement and resampling


without replacement, we show, in Fig. 4.4, a visualization of both resampling
approaches. The top row (A) shows resampling with replacement, and the bottom
row (B) shows resampling without replacement. In the left column, the data points
available before the sampling are shown. In our case, the sample size of the data is
n = 7.
Let’s suppose we sample m = 6 instances from these with replacement and
without replacement. The data points available after the sampling and the sampled
data points are shown in the middle and the right columns, respectively, in Fig. 4.4.
Because resampling with replacement will replace every drawn instance, the data
points after resampling (middle column) and the data points before sampling
(left column) are the same. In contrast, resampling without replacement does
not replace drawn instances, and for this reason the data points after resampling
(middle column) and the data points before sampling are not the same. Hence, the
middle column shows the instances that are available for further sampling after six
instances have already been drawn.
Also, the sampled instances, shown in the right column, are different. Resam-
pling with replacement enables duplicate instances (see, for example, the orange
triangles), while for resampling without replacement, this is not possible. As a
consequence, for resampling with replacement, the number of unique instances can
be smaller than the number of drawn instances due to duplications (see Fig. 4.4).
In contrast, for resampling without replacement, the number of unique instances
corresponds to the number of drawn samples; that is, m. The latter means that
for resampling without replacement, m always needs to be smaller (or equal)
to n.
4.5 Subsampling 61

A.
x (before) x (after) x (sampled)

B.

n = 7 m = 6

Fig. 4.4 Comparison of resampling with replacement (a) and resampling without replacement (b).
The left column shows the data points available before sampling, the middle column shows the data
points available after sampling six instances, and the right column shows the sampled data points.

This example shows that there are crucial differences between the two resampling
approaches, and for this reason it is important to ensure that the appropriate one is
used when conducting an analysis.
In R, resampling with replacement is realized via the function sample() using the
option “replace = TRUE,” whereas for resampling without replacement, the option
“replace = FALSE” is required.

4.5 Subsampling

To estimate errors of learning algorithms, as well as for parameter estimation, it is


important to know how these estimates depend on the sample size. For this purpose,
subsampling can be used.
The basic idea of subsampling is to systematically reduce the sample size of
the original data set. Specifically, given a data set with sample size n, one reduces
successively the number of samples by randomly selecting x% of the data points
without replacement (see Fig. 4.5). For instance, one can obtain data sets with
{90%, 75%, 50%, 25%} of the original sample size n. These data sets can then be
used with any of the resampling methods discussed earlier; for example, to estimate
the classification error of a method.
62 4 Resampling Methods

Original data 1 2 ... n−1 n



 randomize

random subsample of x% of the data without replacment

Subsample 8 n ... 3 n−5



use for any application, e.g., resampling methods

D(x)

use
do not use

Fig. 4.5 Visualization of subsampling. The method uses a random subsample of the original data.
Hence, the resulting data set contains x% of the data points without replacement.

Due to the random selection of x% of the data points without replacement, this
procedure needs to be repeated m times in order to obtain an appropriate average.
Frequently, one uses a value of m between 10 and 100, depending on the underlying
data and the computational complexity of the involved methods.
In Listing 4.3, we illustrate how to subsample a data set using R. The resulting
variable ind contains the indices of the original data points and not their values
(similar to all previous examples).

4.6 Different Types of Prediction Data Sets

In the previous sections, we distinguished between two different data sets: training
data and test data (also called testing data). In addition, we would like to note that
there are two further data sets to examine (see also Chap. 18):
• In-sample data
• Out-of-sample data
4.7 Sampling from a Distribution 63

Both of these data sets are associated with the predictions of a model. Specifically,
if one uses the training data set as the testing data set, i.e., if one makes double
use of the training data, then such a set is called in-sample data. In contrast, if the
training data set is different from the testing data set, then the testing data are also
called out-of-sample data. Hence, the distinction does not introduce a new data set,
but just clarifies the role of a data set with respect to predictions of a model. Thus, a
prediction is based on either in-sample data or out-of-sample data.

4.7 Sampling from a Distribution

As seen in previous sections, when applying any resampling method, some form of
random assignment of data points is required. However, this means that one needs to
draw values from a probability distribution. In this section, we want to take a more
detailed look at what it means to draw a sample from a probability distribution. In
statistics, this is called sampling from a distribution. In the following, we start with
the simplest way of sampling from a distribution, which can even be done without
a computer. Then we show how to translate this strategy into a computational form,
which can be easily executed using R or any other programming language.
In Fig. 4.6, we show an example of a continuous probability distribution f . In
order for f to be a probability distribution, it needs to hold that f (x) ≥ 0 for all
x within its domain Dx and Dx f (x )dx = 1. It is always possible to divide the
domain Dx into a grid of m + 1 points of equal distance x = xi+1 − xi , as shown
in Fig. 4.6. This can be used to convert the continuous probability distribution into

f (x)

random variable:
Δx = xi+1 − xi
x ∼ f (x)
limm→∞ Δx → 0

Probability density

 xi+1
pi = f (x )dx
xi

x
xi xi+1 i ∈ {1, . . . , m}

Fig. 4.6 Given a continuous probability distribution f , and using a grid, one can convert f into a
discrete probability distribution.
64 4 Resampling Methods

a discrete probability distribution, where each of the m intervals, x, is assigned a


probability according to
xi+1
pi = f (x )dx (4.7)
xi

with i ∈ {1, . . . , m}. By defining xi = (xi + xi+1 )/2, one can assign

pi = Prob(X = xi ) (4.8)

to all i ∈ {1, . . . , m}, which defines a discrete probability distribution formally. It is


clear that in the limit m → ∞, the width of the intervals goes to zero, i.e., x → 0,
and one recovers the original continuous probability distribution.
We performed the preceding derivation to show that it is sufficient to have a
sampling procedure for a discrete probability distribution because every continuous
probability distribution can be approximated with a discrete one.
Now, let’s assume that we have such a discrete probability distribution as defined
in Eq. 4.8. The simplest way to utilize this discrete probability distribution is in a
physical sampling process; for example, via an urn experiment. Specifically, let’s
assume that we have an urn with N balls. Since our discrete probability distribution
assumes m different values because i ∈ {1, . . . , m}, we need to label Ni = N · pi
balls with label “i.” Here, a label could be either a name or a color. Either way, the
Ni balls need to be recognizable as similar. Because, usually, Ni = N · pi will not
result in an integer number, we need to round Ni toward the nearest integer. This
labeling process is repeated for all i ∈ {1, . . . , m} until all N balls have received
one label. Finally, we put all balls in an urn. By drawing one ball from this urn while
blindfolded, we simulate xi ∼ P , which corresponds to sampling from distribution
P with probabilities given by Eq. 4.8.
The preceding procedure can be summarized as follows:
1. N: number of balls
2. Number of balls with label “i”: Ni = N · pi (rounding needed)
3. Place all balls in an urn.
4. Draw one ball randomly (blindfolded) from the urn.
As one can see, this sampling procedure does not require a computer but rather
can be done with balls and an urn. Hence, it is a purely physical (mechanical)
process. However, this procedure can be easily converted into a computational form.
This is shown in Listing 4.4.
Following Listing 4.4 line by line, one can see how the instructions of the
preceding (mechanical) procedure are realized.
Despite the simplicity of Listing 4.4, it can be further simplified by using the
function sample() available in R. Specifically, by using the option “prob,” one can
4.7 Sampling from a Distribution 65

draw a sample from an arbitrary discrete probability distribution, as illustrated in


Listing 4.5. When using the function sample(), it is important to set the option
“replace = TRUE” to enable the drawing of the same value more than once.

Finally, we would like to note that R of course offers predefined functions to


sample from, and some examples are given in Listing 4.6. However, these functions
hide the complexity and make it unclear for the beginner how sampling actually
works. For this reason, we presented the preceding details.

Further examples of probability distributions available in R can be found in [153].


66 4 Resampling Methods

4.8 Standard Error

In the preceding sections, we have seen that resampling methods provide a means to
estimate the probability distribution associated with the output of a prediction model
since such an output is a random variable. Usually, this probability distribution is
not estimated explicitly, but rather is characterized by its mean and its variance.
For instance, a k-fold CV estimates the mean error based on k folds; i.e.,

1
k
Ek-CV = Ei . (4.9)
k
i=1

This means that the Ei for i ∈ {1, . . . , k} is drawn from an underlying probability
distribution P , i.e., Ei ∼ P , and by changing the training and testing data, the
statistical model will yield (slightly) different error values.
In the following, we will derive a general result for the standard deviation of the
mean in Eq. 4.9, which is called the standard error. Since the standard deviation
(σ ) corresponds to the square root of the variance (σ 2 ), we are using both terms
interchangeably.
To derive the standard error, let’s define a formal setting, because the results will
hold not only for the mean of errors but also for general mean values. For this reason,
let’s define the mean and variance by

1
n
Y = Xi , (4.10)
n
i=1
μY = E[Y ], (4.11)
σY2 = E[(Y − μY ) ].2
(4.12)

Here, Y is the sample mean of the data {X1 , . . . , Xn } and μY is its expectation
value, σY2 is the variance of Y , and σY is the standard deviation of Y . For clarity, we
would like to highlight that the expectation value of the mean μY is also called the
population mean.
The difference between the sample mean Y and the population mean μY is that
the former is based on a finite sample of size n given by {X1 , . . . , Xn }, whereas the
latter is evaluated by using the entire population. This requires knowledge about the
probability distribution P from which the Y is drawn, i.e., Y ∼ P , which is only
known in theory. That means a data sample is not sufficient.
There is one further distribution, Q, from which the data points are drawn Xi ∼
Q. This distribution Q is characterized by

μ = E[X] (4.13)
σ 2 = E[(X − μ)2 ]. (4.14)
4.8 Standard Error 67

Here, μ is the mean value of X and σ 2 is the variance of X.


First, let’s derive a result for the population mean of Y .
n 
i=1 Xi
μY = E[Y ] = E (4.15)
n
 n 
1  1
n
= E Xi = E[Xi ] (4.16)
n n
i=1 i=1

1
n
n
= E[X] = E[X] (4.17)
n n
i=1
= μ. (4.18)

This shows that the population mean of Y is the same as the population mean of X.
At first, it may appear confusing why we estimate a “mean of a mean,” but E[Y ]
is exactly this. The reason for this is that since the Xi are random variables, Y is also
a random variable, and for both variables, a mean can be estimated. The difference
is that the Xi are drawn from Q, whereas Y is drawn from P . Hence, the underlying
probability distributions of these two random variables are different.
Now, let’s repeat such a derivation for the (population) variance of Y .
   
σY2 = E (Y − μY )2 = E (Y − E[Y ])2 (4.19)
⎡  n 2 ⎤
1 n
1 
= E⎣ Xi − E Xi ⎦ (4.20)
n n
i=1 i=1
⎡  n 2 ⎤
1 n 
= 2E⎣ Xi − E Xi ⎦ (4.21)
n
i=1 i=1
⎡ 2 ⎤
1 ⎣  
n n
= 2E Xi − E[Xi ] ⎦ (4.22)
n
i=1 i=1
⎡ 2 ⎤
1 ⎣  
n n
= 2E Xi − E[X] ⎦ (4.23)
n
i=1 i=1
⎡ 2 ⎤
1 n 
n
= 2E⎣ Xi − μ ⎦ (4.24)
n
i=1 i=1
68 4 Resampling Methods

⎡ 2 ⎤
1 ⎣ 
n
= 2E (Xi − μ) ⎦ (4.25)
n
i=1

 n 
1 
= 2E (Xi − μ) 2
(4.26)
n
i=1

1   
n
= E (Xi − μ) 2
(4.27)
n2
i=1
1   1
= E (X − μ)2 = σ 2 (4.28)
n n
In summary, the preceding result means that there is a general relationship between
the (population) standard deviation of an observation X and the (population)
standard deviation of Y , which is given by
σ
σY = √ . (4.29)
n

It is important to emphasize that this result is based on expectation values, which


provide population estimates. However, when dealing with data, we need to have
sample estimates for the corresponding entities. For μ and σ , the sample estimates
are given by

1
n
X̄ = Xi , (4.30)
n
i=1

 n
1  
s=  (Xi − X̄)2 . (4.31)
n−1
i=1

Overall, this gives the sample estimate of Eq. 4.29, given by


s
SE = √ . (4.32)
n

Due to the importance of this result, the sample estimate of σY has its own name,
the standard error.
The important result of the preceding derivation is that whenever we estimate the
mean value, for example, based on prediction errors {Ei }ni=1 , one can estimate the
(sample) standard deviation of this mean, which corresponds to the standard error.
For practical applications, the results from resampling methods should be always
summarized using the mean prediction error and its standard error.
4.9 Summary 69

4.9 Summary

In this chapter, we distinguished two applications for resampling methods: (1) error
estimation and (2) parameter estimation. Regarding error estimation, the results
from the resampling methods can be summarized as follows:

EHOS =E (4.33)

1
n
ELOO-CV = Ei (4.34)
n
i=1

1
k
Ek-CV = Ei (4.35)
k
i=1

1 
m
Er-HOS = Ei (4.36)
m
i=1

1 
m
Er-k-CV = Ek-CV,i (4.37)
m
i=1

Here, Ei are the errors obtained from the testing data set i, n corresponds to the
sample size of the original data, and m is the number of repetitions of a resampling
method. As one can see, the holdout set (HOS) is the only method that does not
average over errors, since this method allows one to estimate only one such error.
From the discussion on the various resampling methods, we have seen that there
are many similarities. For instance, when setting k = n (the number of samples) for
the k-fold CV, this results in LOO-CV. In general, when choosing k, there are two
competing effects. For large values of k, the bias of the true error estimate is small,
but the variance of the true error estimator is large, whereas for small values of k,
the situation is reversed. In practice, common choices are k = 5 or k = 10. The
general drawbacks of cross-validation can be summarized as follows:
• The computation time can be long because the entire analysis needs to be
repeated k times for each model.
• The number of folds (k) needs to be determined.
• For a small number of folds, the bias of the estimator will be large.
For situations where the data are very limited, leave-one-out CV (LOO-CV) has
been found to be advantageous [346].
For repeated resampling methods, such as repeated k-fold CV or repeated
holdout set, the problems are similar because a large number of repetitions m
reduces the variance of the estimates but also introduces a computational burden.
70 4 Resampling Methods

Learning Outcome 4: Resampling Methods

Resampling and subsampling methods do not form standalone analysis


methods, but they complement others (for instance, prediction models) by
allowing one to “simulate” repeated experiments.

This implies that resampling and subsampling methods allow one to estimate
the underlying probability distribution associated with the outcome of a prediction
model (see Learning Outcome 2 in Chap. 2) for estimating the (mean) error and the
standard error.
Finally, we would like to remark that despite the fact that there are many
technical variations of cross-validation and other resampling methods (for example,
bootstrap) to improve the estimates [16, 295, 346], the approaches discussed in this
chapter are frequently utilized in practical data science projects.

4.10 Exercises

1. Study the number of unique instances obtained from resampling with replace-
ment. Start with n = 100 data points and estimate the percentage of unique
instances in dependence on m drawn samples for m = {10, 20, 50, 75, n}. In
order to obtain stable estimates, an averaging is needed. How do the results
change when varying n? Hint: Extend the code shown in Listing 4.7.

2. Generate a data set with n = 100 samples. For this data set, perform a
subsampling by drawing x = {90%, 60%, 40%} samples.
3. Convince yourself that the standard error is a standard deviation — however, not
for one observation, but for the sample mean.
4. Given the standard deviation of X and of Y as in Eqs. 4.14 and 4.12, what is the
limit of the corresponding sample estimates for the standard error and the sample
standard deviation?

limn→∞ SE = ? (4.38)
limn→∞ s = ? (4.39)
Chapter 5
Data

5.1 Introduction

When learning how to analyze data, it may appear natural to focus entirely on
methods. However, methods provide only one part of the story since they cannot
be used without data. Unfortunately, until recently, when the field of data science
emerged, “data” were severely underappreciated by the research communities,
giving the false impression that knowledge about data is less important than
understanding a method. To counteract this impression, we dedicate a chapter
exclusively to data at the beginning of this book.
In this chapter, we provide an overview of different data types. We will see that
data generation processes (for various data types) can be fairly different. Here,
we are not interested in acquiring enough knowledge to conduct experiments by
ourselves. Instead, our focus is on gaining a theoretical understanding that explains
the generation of the data and the underlying processes behind them. We believe that
a sensible data analysis is only feasible if one has a sufficient understanding of the
underlying phenomena, the data generation process, and the related experimental
measurements. For this reason, we describe in this chapter five different data types
and the fields from which they come. The descriptions will be brief, and for
professional data analysis projects, many more details may be needed. However,
the discussed examples should be sufficient to see a general pattern when dealing
with data.
In general, when working on a data analysis project, the data always require some
work regarding the following:
• Gathering
• Loading/storing
• Cleaning
• Preprocessing
• Understanding

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 71


F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_5
72 5 Data

This can become fairly complex and time-consuming, requiring weeks or even
months of work. Hence, the discussion in this chapter is only exemplary for general
data science projects to demonstrate the importance of the preceding steps.
By the end of this chapter, we will understand that “data” come in many
shapes and forms and so are very heterogeneous and situational with respect to
the phenomena with which they are associated. All of this will make it clear that the
understanding of data needs to be taken seriously when conducting a data science
project.

5.2 Data Types

In the following, we discuss five different data types: genomic data, network data,
text data, time-to-event data, and business data. Some of these data types are field
specific, whereas others can occur in many different fields. For instance, genomic
data occur in biology, medicine, and pharmacology, which are all life sciences. The
life sciences are fields that study living organisms, like plants, animals, or human
beings. In contrast, network data are not field specific, but rather can be found in any
field, including life sciences, physics, chemistry, social sciences, finance, business,
and economics.

5.2.1 Genomic Data

The Human Genome Project did not only result in the sequencing of the human
genome, but also sparked the development of molecular measurement devices
[396]. These measurement devices transformed biology in the early 2000s into a
technological field with the capability to generate large amounts of data. In contrast
with traditional experiments in biology, targeting only one or a few molecular
entities within biological cells at a time, the new measurement devices perform
high-throughput recordings of thousands or even tens of thousands of genes or
proteins. Importantly, such experiments can be performed not only in biology but
also in medicine and pharmacology to study human diseases or effects of drugs on
treatment. Nowadays, research in life sciences is usually data-driven due to the flood
of available data resources enabled by powerful high-throughput technologies.
According to the central dogma of molecular biology [93], every biological
cell consists of three fundamental information-carrying levels: genes, mRNAs
(messenger ribonucleic acids), and proteins. The genes provide information about
functional units stored in the DNA (deoxyribonucleic acid), also called the genome.
If activated, genes are transcribed into mRNAs, which are then translated into
proteins. This flow of information is visualized in Fig. 5.1, which shows a simplified
biological cell. Because this is common to all biological cells, the information
exchange between the three levels — DNA, mRNA, and protein — is fundamental.
5.2 Data Types 73

nucleus
measurements:

DNA mutations

concentration
mRNA of mRNAs
regulations
cytoplasm

protein protein bindings

cell membrane
biological cell

Fig. 5.1 Simplified visualization of a biological (eukaryotic) cell. Every cell contains three
fundamental information-carrying levels: DNA, mRNA, and protein. On each level, measurements
about the state of a cell can be conducted.

We would like to highlight that there is also a connection between the protein
level and the DNA level (see the orange arrow in Fig. 5.1). Biologically, this is
provided by specific types of proteins, which are called transcription factors (TFs).
A TF binds to a particular region on the DNA (called the promoter region) to
regulate the activation of a gene (for more information about this, see the discussion
about gene regulatory networks in Sect. 5.2.2). Hence, the connections between the
three levels are circular, which can give rise to very complex dynamic processes.
Regarding the generation of data, the central dogma of molecular biology also
informs us about possible measurements that can occur on these three levels. For
instance, on the DNA level, we can obtain information about mutations of genes;
that is, changes in the sequence of nucleotides that form genes. On the mRNA level,
one can measure the concentration of mRNA within a cell, and on the protein level,
one can measure either the binding among two or more proteins or their three-
dimensional structure via crystallography. These are just a few examples, and for
each of them, there are dedicated measurement devices that allow one to measure
this molecular information for thousands of such entities. As an order of magnitude,
we want to mention that humans have about 22,000 genes (the exact number is, to
this day, not known).
In the following, we will focus on data that provide information about the
measurement of the concentration of mRNA. Such genomic (or omic) data are
called gene expression data. There are two major technologies that can be used for
measuring mRNAs: DNA microarrays and RNA-seq. The latter is a next-generation
sequencing (NGS) technology [494], whereas the former is based on hybridization
[430]. In general, measuring mRNAs is important because they allow the study of
the functioning of cells. Every cell of an organism contains the same amount of
DNA (collection of all genes); however, not all genes are active at all times, nor are
the same genes active in different cell types; for example, breast cells or neurons.
The mRNAs allow one, on the one hand, to identify active genes, and on the other
hand to verify the presence of proteins. The latter is important because proteins
74 5 Data

are the active units in cells that are required for performing all the work so that an
organism can function properly.
Despite the fact that DNA microarray and RNA-seq technologies measure
mRNAs, the preprocessing steps from raw data to normalized data, which can be
used for an analysis, are considerably different. However, a commonality is that
both preprocessing pipelines are very complex. This means that the corresponding
pipelines can be structured into subproblems for which R packages exist to process
them. Note that each of these packages is typically the result of a PhD thesis. This
should give an impression of the complexity of the preprocessing and explain why
a detailed discussion of these steps is beyond the scope of this book. However, in
Fig. 5.2, we show an example result of such an analysis.
Specifically, the figure shows a heatmap of expression data for 106 genes (rows)
for 295 breast cancer patients (columns) [480]. A heatmap assumes a matrix form,
M, where the rows correspond to genes and the columns to samples, which in our
case are patients.
Definition 5.1 (Gene Expression Data) The expression matrix, M, is numerical;
that is, Mi,j ∈ R, with i ∈ {1, . . . , g} and j ∈ {1, . . . , s} where g is the total number
of genes and s is the total number of samples.
If expression data for all mRNAs (both protein coding and non-coding) are avail-
able, such data are called transcriptomics data, or transcriptome. For completeness,
we would like to mention that the entirety of the information about all proteins is
called proteomics data and the entirety of the information about all genes is called
genomics data (classically called genetics).
The heatmap in Fig. 5.2 clusters (clustering is discussed in detail in Chap. 7)
the patients according to the similarity of the expression values of their genes.
While the evaluation of a similar clustering for genes would be very complex, the
evaluation of the clustering for these patients is simple because each patient can be
medically categorized according to a “good prognosis” (red) or a “poor prognosis”
(green) with respect to overall survival. In Fig. 5.2, this is visualized at the top of
the heatmap, where one can see that the two main clusters are fairly good but not
perfect. Overall, this is a good example for an exploratory data analysis (EDA),
already mentioned in Chap. 1 and further elaborated in Chap. 6. This also shows
that creating a visualization is the first step in gaining deeper insights about the
underlying meaning of data.

5.2.2 Network Data

Network data appear when a system contains variables that are heterogeneously
connected to each other. In contrast to genomic data, this is not limited to a single
field but instead can occur in a variety of fields. Examples of areas where network
data are observed include chemistry, biology, economics, finance, and medicine
[25, 51, 97, 138, 141, 151]. Of course, the meaning of such networks is domain-
dependent and requires additional information.
5.2 Data Types 75

Fig. 5.2 Heatmap of gene expression data. Shown are the expression values of 106 genes (rows)
for 295 breast cancer patients (columns) [480]. The patients are categorized on the top according
to “good prognosis” (red) and “poor prognosis” (green) with respect to survival.

To define a graph (or network) formally, we need to specify its set of vertices or
nodes, V , and its set of edges, E. That means any vertex i ∈ V is a node of the
network. Similarly, any element Eij ∈ E is an edge of the network, which means
that the vertices i and j are connected to each other. In the case of an undirected
graph, there is no direction for this connection, which means node i is connected
with j but also node j is connected with i.
Figure 5.3 shows an example of a simple undirected graph with V =
{1, 2, 3, 4, 5, 6} and E = {E12 , E23 , E34 , E14 , E35 , E36 }. It is clear from Fig. 5.3
that in an undirected network, the symbol Eij is symmetric in its arguments, i.e.,
Eij = Ej i , since the order of the nodes is not important.
76 5 Data

Graph: G Adjacency matrix: A

2 6
0 1 0 1 0 0
1 0 1 0 0 0
0 1 0 1 1 1
1 3 5
1 0 1 0 0 0
0 0 1 0 0 0
4
0 0 1 0 0 0

Fig. 5.3 An example for a simple graph. Left: Visualization of an undirected graph G. Right: The
adjacency matrix, A, of the graph on the left-hand side.

Definition 5.2 An undirected


 network G = (V , E) is defined by a vertex set V and
an edge set E ⊆ V2 .
 
E ⊆ V2 means that all edges of G belong to the set of subsets of vertices with
two elements.
To encode a network mathematically, a matrix representation can be used.
The adjacency matrix, A, is a squared matrix with |V | number of rows and |V |
number of columns. The matrix elements, Aij , of the adjacency matrix provide the
connectivity of a network.
Definition 5.3 The adjacency matrix, A, of an undirected network G is defined by

1 if i is connected with j in G,
Aij = (5.1)
0 otherwise,

for i, j ∈ V .
The adjacency matrix, A, of the graph in Fig. 5.3, is shown on the right-hand side.
Since this network is undirected, its adjacency matrix is symmetric, i.e., Aij = Aj i
holds for all i and j .
Definition 5.4 (Network Data) The adjacency matrix, A, represents network data.
In Fig. 5.4, we show four real-world examples of networks. The first example
shows a chemical graph. In this case, chemical elements correspond to nodes, and
the bindings between the elements correspond to edges. The chemical structure
shown is serotonin.
The second example shows a small part of a friendship network or acquaintance
network. Such networks can be constructed from social media data — for example,
Facebook or LinkedIn or, as is the case for Fig. 5.4, Twitter. A general problem
with such networks and their visualization is that they can be very large. The
shown subnetwork shows that Barack Obama follows Joe Biden and Bill Clinton.
However, in addition, Obama follows 590,000 more Twitter users, and 130 million
5.2 Data Types 77

Chemical structure Friendship network

130m
NH2 CH2

HO CH CH2
C
C C
CH Barack Obama

CH C
NH
CH

Joe Biden Bill Clinton 590k

Genes
Disorders P3
g5 P1 P2
DNA
d4 Prom Gene X
g4 Promoter
d3
inference
g3
P3
d2
g2
P2 GX
d1
g1
P1

Bipartite graph Regulatory network

Fig. 5.4 Four examples for real-world networks. 1. Chemical structure of serotonin. 2. Friendship
network from Twitter. 3. Bipartite graph corresponding to the diseasome. 4. Gene regulatory
network providing information about the activation of genes in biological cells.

users follow Obama. From this information, it becomes clear that it is impossible
to visualize all edges in the network, even when we focus only on Barack Obama.
Despite this complexity, friendship networks can be easily constructed by collecting
information about who-follows-who. That means such networks are constructed in
an edge-by-edge manner.
The third network is a so-called bipartite network. In contrast to the networks
discussed so far, a bipartite network consists of two types of nodes. In Fig. 5.4, the
first type of node corresponds to genes and the second to disorders. For simplicity,
let’s call the first node type G and the second node type D. A bipartite network has
the property that edges can only occur between nodes of different type; that is,

Eij = 1 if gi ∈ G and dj ∈ D. (5.2)


78 5 Data

To distinguish such a network from a regular graph, we often write G = (G, D, E),
and the gene-disorder bipartite network was called diseasome [200]. Examples of
such gene-disorder pairs are BRAC1 for breast cancer, KRAS for pancreatic cancer,
BRAC1 for ovarian cancer, and C9orf72 for ALS (amyotrophic lateral sclerosis).
The diseasome is another example of a network that is constructed edge by edge
[148], where the information about individual edges is obtained from databases;
for example, the Online Mendelian Inheritance in Man (OMIM) [375] provides
information about thousands of individual experiments.
The fourth network in Fig. 5.4 is a regulatory network or gene regulatory
network (GRN). A GRN describes molecular activities within biological cells
relating to the regulation of genes. Specifically, in order for genes to become
activated, a certain number of proteins (called transcription factors) need to bind
to the DNA to a promoter region. This is illustrated in Fig. 5.4, where the proteins
P1 to P3 activate the gene X. This sounds similar to a friendship network; however,
there is one crucial difference. While for a friendship network the “friends” are
directly observable — for example, from Twitter via “Followers” and “Following”
— this information is only indirectly available for a GRN. Here, “indirectly” means
this information needs to be inferred via statistical methods. One of these methods
is BC3Net (bagging conservative causal core network) [109], which is based on an
ensemble method (bagging) that is applied to statistical hypothesis tests.
The preceding examples show that the construction of a network can be simple,
following deterministic rules, as for acquaintance networks of Twitter friends, or
difficult, requiring the statistical inference of connections via hypothesis testing, as
for gene regulatory networks or financial networks [7, 328]. Hence, the generation
of network data can be quite involved.
From a practical point of view, for large networks with many nodes — for
example, |V | 10000 — it can be preferable, in order to reduce the storage
requirement on a computer, to use an edge list E instead of the adjacency matrix
A. Listing 5.2.2 shows an example of how to obtain the edge list from A for the
network in Fig. 5.3.
5.2 Data Types 79

The reduction in storage requirements comes from the fact that an edge list
stores only information about existing edges and ignores all non-edges (non-existing
edges).

5.2.3 Text Data

The third data type is very interesting because text data are not numbers. Instead,
text data are symbol sequences that form a natural language, where each language
has its own characteristic symbols (given by the alphabet). Hence, when analyzing
text data, the first task is to convert symbol sequences into a representation that
is amenable for machine learning methods. This requires a form of mathematical
representation corresponding to, for example, vectors of numbers.
One aspect that makes the analysis of text data a difficult task it that the
conversion between symbol sequences and a (mathematical) representation is not
unique. Instead, many different approaches have been introduced and used over
the past decades. Here, we discuss five widely used text data representations:
part-of-speech (POS), one-hot document (OHD), one-hot encoding (OHE), term
frequency-inverse document frequency (TF-IDF), and word embedding (WE).
The idea of part-of-speech (POS) tagging is to assign each word (or lexical item)
in a sentence to a grammatical category. Examples for such categories are noun
(NOUN), determiner (DET), or auxiliary (AUX). For an example, see Fig. 5.5. Put
simply, words in the same POS category show a similar syntactic behavior in a
sentence by playing a similar role for the grammatical structure. For linguistics,
POS tagging allows a systematic analysis of the grammatical structure of texts. For
instance, in Fig. 5.5, we show the distribution of 49 categories as found in the novel
Moby Dick by Herman Melville, published in 1851.
The next two representations, one-hot document (OHD) and one-hot encoding
(OHE), are similar. Both are based on a bag-of-words model for a document (or
sentence). Specifically, given a vocabulary with V unique words, a binary vector of
length V is defined. Each component of this vector corresponds to one word, and its
element is 1 if the word is present in the document (or the sentence); see Fig. 5.5 for
an example. Here, V = 5 and two simple documents are considered, giving v1 for
document 1 and v2 for document 2.
Importantly, the number of appearances of a word is irrelevant. It is only
important if a word exists in a document. In contrast to OHD, one-hot encoding
is on the word level, which means a binary vector is formed that represents only
one word. For instance, according to the definition in Fig. 5.5 given by v, the word
“plus” corresponds to the vector vplus = (0, 0, 1, 0, 0).
The term frequency-inverse document frequency (TF-IDF) representation is
an extension of the previous representation and considers relative frequencies.
Specifically, to define TF-IDF, one needs two components, term frequencies and
inverse document frequencies. The term frequencies are defined for each word
(term), wi , in a document, dj , by
80 5 Data

Text data Text representations

POS: part-of-speech

OHD: one-hot document

OHE: one-hot encoding

TF-IDF: term frequency–inverse


document frequency
WE: word embedding

POS: "This" "sentence" "is" "an" "example" "."


"DET" "NOUN" "AUX" "DET" "NOUN" "PUNCT"

POS distribution of Moby Dick


25000
15000
0 5000

$ '' , −RRB− : AFX CD EX GW IN JJ JJS MD NN NNPS PDT PRP RB RBS SYM UH VBD VBN VBZ WP WRB

OHD: document 1: "One" "plus" "one" "is" "two"


document 2: "One" "times" "one" "is" "one"
v = (”one” , ”two”, ”plus”, ”times ”, ”is ”)

v1 =(1 , 1, 1, 0, 1)
v2 =(1 , 0, 0, 1, 1)

Fig. 5.5 Examples of text representations.

#{wi |wi ∈ dj }
TF(wi , dj ) = (5.3)
len(dj )

where len(dj ) is the length of document dj (total number of words — not the unique
number of words) and #{wi |wi ∈ dj } is the word frequency (how often does wi
appear in dj ).
The document frequencies give information about the fraction of documents
containing a certain word, wi , out of all documents, and it is defined by

#{d|wi ∈ d}
DF ∼ , (5.4)
|D|

where |D| is the total number of documents, i.e., D = {d1 , . . . , d|D| }. The inverse
document frequency is just the inverse of the preceding expression; however, this
5.2 Data Types 81

inverse is usually scaled using the logarithm as follows:


 |D| 
IDF(wi ) = log . (5.5)
#{d|wi ∈ d}

The resulting TF-IDF for each word and document is then given by

TF-IDF(wi , dj ) = TF(w(i, dj )) · IDF(wi ). (5.6)

As an example, we consider the following (simple) documents, each consisting


of one sentence:

d1 : one plus one is two; (5.7)


d2 : one times one is one. (5.8)

For this we obtain the following:

TF(“one”, d1 ) = 2/5 (5.9)


TF(“plus”, d1 ) = 1/5 (5.10)
TF(“is”, d1 ) = 1/5 (5.11)
TF(“two”, d1 ) = 1/5 (5.12)
TF(“one”, d2 ) = 3/5 (5.13)
TF(“times”, d2 ) = 1/5 (5.14)
TF(“is”, d2 ) = 1/5 (5.15)

and

IDF(“one”) = log(2/2) = 0 (5.16)


IDF(“plus”) = log(2/1) = 0.69 (5.17)
IDF(“is”) = log(2/2) = 0 (5.18)
IDF(“two”) = log(2/1) = 0.69 (5.19)
IDF(“times”, d2 ) = log(2/1) = 0.69. (5.20)

The resulting term frequency-inverse document frequencies for this example are
shown in Listing 5.2, which also provides steps to calculate these values using R.
Overall, a word that is representative of a document, for example, because it
appears often in this document but not in others, receives a high TF - IDF value,
whereas a word that appears often in all documents receives a low TF - IDF value
due to its lack of being representative for any single document.
82 5 Data

By using the TF-IDF values, one can now extend the definition of one-hot-
document by substituting the “1”s with the values of the term frequency-inverse
document frequencies while the zeroes remain unchanged. This leads to real valued
vectors that can be used to compare documents. For our example, this gives the
following:

v = (“one”, “two”, “plus”, “times”, “is”) (5.21)


v1 = (0, 0.13, 0.13, 0, 0) (5.22)
v2 = (0, 0, 0, 0.13, 0) (5.23)

All of these measures are defined for a bag-of-words. A natural extension of a


bag-of-words approach is a bag-of-n-grams. In general, an n-gram is a contiguous
sequence of n items from a given text. These items can be, for instance, letters,
syllables, or words. Put simply, an n-gram is any sequence of n tokens. Using n-
grams, for example, of words, one can again determine the TF-IDF. This allows one
to study more complex entities, rather than single words, and their meaning for a
given text.
The last text representation we will discuss is a word embedding (WE) model. In
contrast with the previous representations, a WE model learns a real-valued vector
representation of a word from text data. The learning considers the context of a
word, at least to some extent, and hence it is different from approaches based on
bag-of-words and even bag-of-n-grams.
Technically, there are a number of different methods — frequently utilizing
neural networks — for learning such a vector representation of a word, but word2vec
5.2 Data Types 83

was the first one introduced in 2013. The most popular architecture of word2vec is
based on continuous bag-of-words (CBOW) [337]. The CBOW model predicts the
current word from a window of surrounding context words, where the order of the
context words has no influence on the prediction. The dimension of a vector is a
parameter, and it is not determined by the size of the vocabulary. The learning-
vector representations of words allow one to construct vectors for entities beyond
words, such as phrases or entire documents, to study their similarity.
Definition 5.5 (Text Data) Text data are symbol sequences that do not correspond
to numbers. To analyze text data mathematically, there are different text repre-
sentations (for instance, POS, OHD, OHE, TF-IDF, WE), which have different
interpretations and applications.
We conclude by remarking that the field dedicated to analyzing text data is
natural language processing (NLP), and recent years have seen many advances that
utilize methods from deep learning (see Chap. 14).

5.2.4 Time-to-Event Data

The next data type is called time-to-event data. Such data have yet another form
compared to all other data types, which is very characteristic. In its simplest form,
time-to-event data can be represented as a triplet given by:

time-to-event data: (ID, t: time duration, c: censoring) (5.24)

Here, ID corresponds to the identification number of, for example, a patient, time
duration t is a time interval, and censoring c is a binary label, such as, c ∈ {1, 2} with
1 corresponding to censoring occurred and 2 indicating the event occurred. A crucial
difference with many other data types is that the preceding information, that is, t and
c, cannot be directly measured by an experiment. Instead, this information needs
to be extracted from a certain process, and this process is domain or application
dependent.
To provide a motivating example, let’s discuss time-to-event data in a medical
context. In Fig. 5.6 (top), we show patients participating in a study and receiving
a treatment at a hospital. Each treatment, such as a surgery, will lead to a patient-
specific health record that contains all the medical information about the patient.
It is clear that not all patients receive the surgery on the same day at the hospital,
but rather this occurs over a longer period, perhaps months or years, depending on
the duration of the study. From the patient-specific health records, one can then
extract the described time-to-event data. However, this requires the definition of an
“event.” In a medical context, an example of an event is “death.” Other examples
are “relapse” or “exhibiting symptoms.” This allows one to obtain t; for example,
corresponding to the time from surgery to the time of death.
84 5 Data

health record for patient IDD

health record for patient IDC

health record for patient IDB


treatment
(over time) health record for patient IDA

subjects

t4
IDD X : event (observed)
t3
IDC : event (unobserved)
t2
IDB X
t1 X: censoring
IDA
time
start of study end of study

Fig. 5.6 Generation of time-to-event data in a medical context. Top: Patients receive treatment for
a particular disease. This treatment is extended over a certain time duration. Bottom: A summary
of time-to-event data extracted from health records of patients. Unobserved events lead to a
complication requiring information, which is called censoring.

The meaning of the third entity of time-to-event data, censoring, is described


in Fig. 5.6 (bottom). Because the collection of patient health records occurs over
a period of time, one needs to distinguish between three different cases. For the
first case, represented by patient IDA and IDC , the event “death” occurs either in
hospital or outside the hospital, but the hospital is informed about the death of
the patient, and this event happens within the time period of the study. In both
cases, the health record of the patient can be updated by including this information.
For the second case, represented by patient IDB , the event “death” occurs at some
time; however, the hospital is not informed, or the study has ended already. Hence,
from the health record, only information about the last doctor visit of the patient is
available, and nothing after the last visit. The time of the last visit is called censored,
and it indicates a patient lived at least till this day. For the third case, represented by
patient IDD , the event “death” occurs within the study; however, the patient decided
before this to drop out of the study. Hence, this event, despite occurring within the
time frame of the study, is not observed. It is therefore labeled as censored.
Overall, for the example shown in Fig. 5.6, one gets the following time-to-event
data for the four patients:

(IDA , t1 , c1 = 2); (5.25)


(IDB , t2 , c2 = 1); (5.26)
(IDC , t3 , c3 = 2); (5.27)
(IDD , t4 , c4 = 1). (5.28)
5.2 Data Types 85

Despite their simple appearance, the analysis of these data is far from simple. In
Chap. 16, we will see the analysis methods developed for such data, which are
summarized under the term “survival analysis.”
For the description of time-to-event data, an “event” plays a pivotal role.
Depending on the application, there are various definitions of “event,” which makes
survival analysis widely applicable. In an industrial context, an event could be
the “malfunctioning” or “failure” of a component/device of a machine, while in
marketing an event could be the “purchase” of a product or “churn” of a customer.

5.2.5 Business Data

The last data type in this chapter is business data. Business data are similar to
expression data from genomic experiments, which assume a matrix form, but
business data come in the form of a table. Although a matrix and a table may look
similar at first glance, there is a crucial difference, as discussed later. Examples
of business data are customer, transaction, financial, product, or market research
data. Hence, one encounters such data frequently in management, economics, or
marketing.
Let’s look at an example. The data set UniversalBank.csv contains data about
5,000 customers of the Universal Bank. The data include customer demographic
information (age, income, etc.), the customer’s relationship with the bank (mort-
gage, securities account, etc.), and the customer’s response to the last personal loan
campaign (Personal Loan). In R, tables in csv format, such as those used by Excel,
can be easily loaded, as in the example in Listing 5.3 shows.
86 5 Data

The table contains 14 columns, corresponding to 14 features. These features have


the following meaning: ID, customer ID; Age, customer’s age in completed year;
Experience, number of years of professional experience; Income, annual income
of the customer (1,000); ZIP code, home address zip code; Family, family size of
the customer; CCAvg, average monthly credit card spending (1,000); Education
(education level), 1 for undergrad, 2 for graduate, 3 for advance/professional;
Mortgage, value of house mortgage, if any (1,000); Personal loan (Did this customer
accept the personal loan offered in the last campaign?), 1 for yes, 0 for no; Securities
Acct (Does the customer have a securities account with the bank?), 1 for yes, 0 for
no; CD Account (Does the customer have a certificate of deposit — CD — account
with the bank?), 1 for yes, 0 for no; Online (Does the customer use internet bank
facilities?), 1 for yes, 0 for no; Credit Card (Does the customer use a credit card
issued by the Bank?): 1 for yes, 0 for no.
A potential problem with business data is that their features can be of a different
level of measurement (see Sect. 10.6.4 for a detailed discussion of level [or scale]
of measurement). Put simply, this means that there are different types of features,
as can be seen by comparing, for example, the feature “Income” and “ZIP.Code.”
While the former can assume numbers, the latter is merely a label. This means
the number of a zip code could be replaced by any character string without losing
information. Due to the label type of the zip code, the summation of two zip codes
does not make any sense. So, despite the fact that a zip code is a number, its meaning
indicates it is a label.
This is the main difference between a table and a matrix. While a table can
contain different types of features, such as numbers, labels, character strings, and
so forth, a matrix contains only one type of feature. This is a crucial difference
between the gene expression data discussed earlier and the business data.
When analyzing data with mixed feature types, it needs to be taken into consider-
ation because one cannot just treat a label as a number, since both variables convey
different information. For this reason, in statistics such variables are called nominal
or categorical. In general, when selecting a method, the level of measurement of the
features needs to be considered, and not every method can be used indifferently. In
Sect. 10.6.4, we will see that there are in total four different levels of measurement
one needs to distinguish, and three of those can be found in the data from the
Universal Bank (nominal, ZIP.Code; ordinal, Education; and ratio, Income).
Definition 5.6 (Business Data) Business data are represented by tables. The
entries in a table can correspond to different levels of measurement (mixed type of
information).
Other types of data that usually come in the form of a table are from economics,
finance, politics, psychology, or the social science. Since data from those domains
usually contain a mixture of levels of measurement, the same arguments for business
data are applicable to them.
5.3 Summary 87

5.3 Summary

The lessons learned from this chapter are the following: First, one should realize
that data-generating processes can be very complicated and are usually highly
domain specific. That means data generated in biology are very different from
data coming from economics. However, to conduct a sensible analysis, one needs
sufficient background information and insights about the application domain that
generated the data. For the many fields, including those discussed here, acquiring
this knowledge is nontrivial and requires a substantial effort. Second, the resulting
data types can also be quite different for different application domains. This also
affects the selection of prediction models, which can be used for the analysis since
different methods have different data requirements. Third, data may not even come
in a numerical form, as we have seen for text data studied in the field of natural
language processing, where a mapping from “text to numbers” is required before
any method can be used for an analysis.
Learning Outcome 5: Data

Data come in many different forms (data heterogeneity), which cannot be


mapped into each other. This zoo of data types requires diverse families of
prediction methods.

In the following chapters, where we focus on methods, the discussion about the
data will usually be short. However, this is merely for practical reasons that require
us to limit the length of our discussions. For real-world data analysis projects, the
analyst always needs to make sure that the data are addressed properly and with
sufficient detail.
Part II
Core Methods
Chapter 6
Statistical Inference

The basic idea of statistics is to use a data sample to draw conclusions about the
underlying population from which the data have been drawn. Since, in reality, a
sample of data always has a finite size, any conclusions reached about the population
are always uncertain to a degree. The goal of statistics is to quantify the amount
of uncertainty around the conclusions that are made based on a sample of data. In
general, statistical inference is the (systematic) process of making predictions about
a population, using data drawn from that population.
Figure 6.1 provides a visualization of the concept of statistical inference that
connects properties in a theoretical world (upper part) described by probability
theory with properties in reality (lower part) described by statistics. Assuming a
population is given, such properties can be deduced probabilistically. To achieve
this, probability theory, that is, mathematics, can be used to study a population.
Meanwhile, the lower part of Fig. 6.1 corresponds to a data sample and the
properties that can be inferred from it. Since a sample is always finite, such
conclusions are only estimates of the population properties. Overall, the upper part
belongs to a theoretical world in which the laws of mathematics hold by means
of probability theory, whereas the lower part represents the reality, which requires
methods for inferring conclusions by means of statistics or data science. These
worlds are separated from each other, and information can only be exchanged via a
data sample, which is not only small compared to the population itself but is usually
also corrupted by measurement errors and noise. That means even the small part of
the population we can observe is potentially corrupted. From this description, the
difficulty of the problem we are facing should become clear.
The preceding description is true for all fields dealing with the analysis of
data, including machine learning, artificial intelligence, and pattern recognition, but
statistics was the first field that articulated this explicitly. Hence, the term “statistical
inference” is usually connected to the field of statistics, although all data science-
related fields are facing the same issue.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 91


F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_6
92 6 Statistical Inference

Theory
population values
properties and
population characteristics Probability theory
deduction of the population

- measurement error make statements


- noise about the population

properties and
data sample Statistics
characteristics
of size n inference of the population
Reality sample estimations

Fig. 6.1 Conceptual framework of statistical inference that connects properties in a theoretical
world (upper part) with properties in reality (lower part).

We start this chapter by discussing exploratory data analysis (EDA) [475],


including descriptive statistics and properties of sample estimators. Both topics
provide valuable means for any data science project because such methods are
universally applicable. Then, we discuss the two main approaches for parameter
estimation, namely, Bayesian inference and maximum likelihood estimation (MLE),
and their conceptual differences. Finally, we discuss the expectation-maximization
(EM) algorithm as a practical way to find the (local) maximum of a likelihood
function or the maximum posteriori (maximum of the posterior distribution), which
corresponds to the parameter estimate of a statistical model, using an iterative
approach.

6.1 Exploratory Data Analysis and Descriptive Statistics

The first step of any statistical investigation consists of preprocessing the data,
assessing its quality, analyzing its structure, and calculating the associated summary
statistics.

6.1.1 Data Structure

Data can be defined as a collection of facts derived from experiments or observa-


tions. Aspects of the data to be taken into consideration for a statistical investigation
include the following:
• Sample size: This is the total number of data points collected; a very small sample
size is likely to lead to unreliable statistical results.
• Number of variables: For a large number of variables, one needs to investigate
whether all the variables are necessary.
6.1 Exploratory Data Analysis and Descriptive Statistics 93

• Type of variables: Variables can be categorized into two main classes:


– Qualitative or categorical variable
· Nominal, for example, patient gender (male, female)
· Ordinal, for example, response to a treatment (poor, mild, good)
– Quantitative variable
· Continuous, for example, patient age (0-120 years)
· Discrete, for example, number of admissions to care (1, 2, 3, etc.)

• Independence of the observations: We need to investigate whether the observa-


tions from the variable of interest are independent from each other.
• Distributions of the variables: We need to investigate whether the variables are
following classical distributions; for example, the normal distribution.
We would like to remark that variables are sometimes also called features,
especially in the machine learning community.

6.1.2 Data Preprocessing

Quality data is essential for an effective use of statistical inference techniques, and
the data preprocessing step aims to assess the quality of the data.
There are four main stages in the preprocessing of the data:
• Data consolidation: data collection, data selection, and data integration
• Data cleaning: data auditing, data de-cleansing, imputation of missing values or
outliers, elimination of data inconsistency, and data quality validation
• Data transformation: data discretization/aggregation, data normalization, and
derivation of new attributes/variables
• Data reduction: variable reduction, sample reduction, and data balancing

6.1.3 Summary Statistics and Presentation of Information

Descriptive statistics can be defined as summary statistics of the whole data set as
well as the most important subgroups. The statistics to be calculated depend on the
type and number of variables under consideration, and they generally consist of the
following:
• Measures of location: They describe the central tendency of the data.
• Measures of scale: They describe the spread of the data; that is, the departure
from the central tendency.
• Measures of shapes: They describe the form/shape of the distribution of the data.
94 6 Statistical Inference

To be more informative, the summary statistics need to be presented using the


most suitable format, which can include the following:
• Tables: They are used for presenting data itself, summary statistics (such as
contingency tables), and final results.
• Graphs: They are used for displaying broad qualitative information, such as the
shape of a distribution, the form of a relationship between a pair of variables, the
presence of outliers, etc.; the most commonly used graphs for an exploratory data
analysis include histograms, bar charts, boxplots, scatter plot, and pie charts.
Due to their importance, in the following sections we discuss measures of
location, measures of scale, and measures of shape in more detail.

6.1.4 Measures of Location

A measure of location attempts to describe a typical individual sample through


a number. That means the possible complex characteristics of a population are
summarized by a single number that should reflect the centrality of the population.

6.1.4.1 Sample Mean

The sample mean, also called the arithmetic sample mean or the average, denoted x̄,
is defined as the sum of the observations of the sample divided by the sample size.
Therefore, the sample mean is estimated as follows:

1
n
x̄ = xi (6.1)
n
i=1

One of the problems associated with the sample mean is that a small number
of outliers can have a huge impact on its value. Such outliers could stem, for
example, from erroneous measurements. Importantly, it is possible that even just
one measurement error could have a determining effect on the sample mean.
Let’s illustrate this problem using a data sample with an outlier, which is 45, and
a similar data sample without this outlier (see Table 6.1).
The presence of the outlier has doubled the value of the sample mean without
outlier, which is a significant error in the description of the sample.

Table 6.1 Illustration of the sample mean on two data samples, where one contains an outlier, and
the other doesn’t.
Sample mean
Data sample with outlier 3 5 2 3 45 4 2 3 5 4 7.6
Data sample without outlier 3 5 2 3 4 4 2 3 5 4 3.5
6.1 Exploratory Data Analysis and Descriptive Statistics 95

Table 6.2 Illustration of the trimmed sample mean on a data sample containing an outlier.
Sample mean
Original data sample 3 5 2 3 45 4 2 3 5 4 7.6
10% trimmed data sample 2 3 3 3 4 4 5 5 3.6

6.1.4.2 Trimmed Sample Mean

The trimmed mean is the sample mean, which results from trimming a certain
percentage from both ends of the ordered original data sample. Let’s calculate the
10% trimmed mean of the following data sample containing an outlier, which is 45:
3, 5, 2, 3, 45, 4, 2, 3, 5, 4 (see Table 6.2).
Therefore, the trimmed mean can be used to address one of the shortcomings of
the sample mean; namely, its sensitivity to outliers. However, the threshold of the
data to be trimmed needs to be chosen carefully to take advantage of this potential
benefit of the trimmed mean without throwing away valuable observations within
the data.

6.1.4.3 Sample Median

The median of a data sample, denoted mx , is defined as the middle point of the
ordered observations from the sample. It is estimated by, first, ordering the data from
the smallest to the largest value and then counting upward for half the observations.
Let n denote the sample size of the data. Then, the value of the sample median
depends on whether the number n is even or odd. The sample median is given by
⎧  
⎨the n + 1 th ordered value

if n is odd,
2    

⎩the average of n th and n + 1 th ordered values if n is even.
2 2
Therefore,

mx = x(n+1)/2 if n odd, (6.2)


1 
mx = xn/2 + xn/2+1 if n even, (6.3)
2
where xi , i = 1, . . . , n are the ordered values of the observations from the sample
data.
For the following data sample

(180, 175, 191, 184, 178, 188), (6.4)

the ordered observations are


96 6 Statistical Inference

Table 6.3 Illustration of the sample median on two data samples, where one contains an outlier,
and the other doesn’t.
Sample median
Data sample with outlier 3 5 2 3 45 4 2 3 5 4
Ordered observations 2 2 3 3 3 4 4 5 5 45 3.5
Data sample without outlier 3 5 2 3 4 4 2 3 5 4
Ordered observations 2 2 3 3 3 4 4 4 5 5 3.5

(175, 178, 180, 184, 188, 191). (6.5)

Since the sample size is 6 (even), the median is the average of 180 and 184, which
is 182.
If we have one more observation value, say 189, then the ordered observations
are

(175, 178, 180, 184, 188, 189, 191) (6.6)

and the median is given by 184, since the sample size is odd.
In contrast with the sample mean, the sample median is quasi-insensitive to
outliers. In fact, the median is affected by at most two values in the sample data,
and these are the halfway points of the ordered observations data. Let’s illustrate
this desirable property of the sample median in Table 6.3 using a data sample with
an outlier, which is 45, and a similar data sample where this outlier is replaced by
the value 4.
The presence of the outlier didn’t affect the value of the sample median, which
is the same for both data samples.

6.1.4.4 Quartile

A quartile is any of the three values that divide an ordered data sample into four
equal parts, so that each part forms a quarter of the sample. Therefore, the second
quartile is nothing but the sample median.
For the following data sample

(180, 175, 191, 184, 178, 188, 189, 183, 197, 186, 172, 169, 181, 177, 170, 172),

the ordered observations are

(169, 170, 172, 172 | 175, 177, 178, 180 | 181, 183, 184, 186 | 188, 189, 191, 197).

The first, second, and third quartiles for this data sample are given here:
172 + 175
• 1st quartile = = 173.5
2
6.1 Exploratory Data Analysis and Descriptive Statistics 97

180 + 181
• 2nd quartile (sample median) = = 180.5
2
186 + 188
• 3rd quartile = = 187.
2

6.1.4.5 Percentile

A percentile is the data value that is greater than or equal to a certain percentage of
the observations in a data sample.
For the following sample

(180, 175, 191, 184, 178, 188, 189, 183, 197, 186, 172, 169, 181, 177, 170, 172),

the ordered data are

(169, 170, [172, 172], 175, 177, 178, 180, 181, 183, 184, 186, 188, 189, 191, 197).

The 20th percentile is the value that is greater or equal to 20% of the observations.
Since the sample size is 16, then the rank of the 20th percentile is given by 20% ×
16 = 3.2 ≈ 3. Therefore, the 20th percentile is 172.

6.1.4.6 Mode

The mode of a data sample is the observation that occurs most often in the sample,
in other words, the observation that is more likely to be sampled.
As an example, let’s consider the following data sample:

(1, 5, 2, 3, 4, 4, 2, 3, 5, 4) (6.7)

for which the ordered sample is

(1, 2, 2, 3, 3, 4, 4, 4, 5, 5). (6.8)

For this data sample, the mode is 4, since it is the most observed value.

6.1.4.7 Proportion

While the sample mean, sample median, quartile, and percentile are meaningful and
can be readily calculated for quantitative data, the most representative measure of
location for categorical data is the proportion or percentage of each of the categories
associated with the observations in the data sample.
98 6 Statistical Inference

For instance, “prevalence” is the proportion of patients with the disease of interest
(see Chap. 3), whereas “fatality” is the proportion of people who died due to the
event of interest, and so forth.

6.1.5 Measures of Scale

A measure of scale (or dispersion) attempts to describe the extent to which values
in a sample differ from some measures of location, for example, the sample mean
of the same sample.

6.1.5.1 Sample Variance

The sample variance describes the spread of the observations of the data sample
around the sample mean. It is given by the average of the squares of the difference
between the sample mean and each individual observation in the data sample. If x̄
denotes the mean of the sample x1 , . . . , xn , the sample variance, denoted s 2 , is given
by

1
n
s2 = (xi − x̄)2 . (6.9)
n
i=1

For statistical reasons, the following formulation of the variance is preferred due to
its theoretical properties; namely, it is an unbiased estimation of the sample variance.

1 
n
s =
2
(xi − x̄)2 . (6.10)
n−1
i=1

There are two other important measures of scale, which are directly derived from
the sample variance, namely, the sample standard deviation and the sample standard
error of the mean, which are defined as follows:
• The sample standard√deviation, denoted s, is given by the square root of the
variance; that is, s = s 2 .
• The standard error of the mean of a data sample, denoted SE, is obtained by
dividing the sample standard deviation by the square root of the sample size; that
s
is, SE = √ . A discussion of the standard error is provided in Chap. 4.
n
As an example, let’s consider the data sample (1, 5, 2, 3, 4, 4, 2, 3, 5, 4). For this
data sample, the sample variance, the sample standard deviation, and the standard
error are 1.8, 1.3, and 0.4, respectively.
6.1 Exploratory Data Analysis and Descriptive Statistics 99

Before we continue, we would like to highlight the difference between a sample


variance, estimated using Eq. (6.10), and the population variance given by Eq. (6.9).

6.1.5.2 Range

The range of a data sample is given by the difference between the largest and
the smallest observation in the sample. For instance, the range of the data sample
(1, 5, 2, 3, 4, 4, 2, 3, 5, 4) is given by 5 − 1 = 4.

6.1.5.3 Interquartile Range

The interquartile range, denoted IQR, for a data sample is given by the difference
between the 3rd and the 1st quartiles.
For the following data sample

(180, 175, 191, 184, 178, 188, 189, 183, 197, 186, 172, 169, 181, 177, 170, 172),

the 1st and 3rd quartiles are 173.5 and 187, respectively. Hence, the corresponding
interquartile range is 13.5.

6.1.6 Measures of Shape

A measure of shape attempts to describe some graphical properties of the distri-


bution of the data sample through some numerical values. The main properties of
interest include the symmetry of the distribution, its tendency to skew, its uniformity,
and its number of modes (that is, whether it is unimodal, bimodal, or multimodal).

6.1.6.1 Skewness

The skewness defines the asymmetry of a distribution; that is, the degree to which a
distribution is distorted either to the left or to the right.
If the values of the sample mean and sample median fall to the right of the mode,
then the distribution is said to be skewed positively. Therefore, with positive skew,
we have sample mean > sample median > sample mode.
If the values of the sample mean and sample median fall to the left of the mode,
then the distribution is said to be skewed negatively. Therefore, with negative skew,
we have sample mean < sample median < sample mode.
A measure of skewness is the Fisher-Pearson coefficient of skewness [384]. Let
x1 , x2 , . . . , xn denote a data sample of size n. Then, the Fisher-Pearson coefficient
of skewness, denoted γ , is given by
100 6 Statistical Inference

n
1
i=1 (xi − x̄)3
γ = n
, (6.11)
s3
where x̄ and s denote the sample mean and the sample standard deviation,
respectively.
The following adjusted variant of the Fisher-Pearson coefficient of skewness
(6.11) is also commonly used:

n(n − 1)
γadj = γ. (6.12)
n−2

Other measures of skewness include the following:


• The Pearson coefficients of skewness, denoted γP 1 and γP 2 , given by

x̄ − m̄
γP 1 = , (6.13)
s
3(x̄ − mx )
γP 2 = , (6.14)
s
where x̄, m̄, s, and mx denote the sample mean, the sample mode, the sample
standard deviation, and the sample median, respectively.
• The Galton measure of skewness, also known as Bowley’s measure of skewness
or Yule’s coefficient, denoted γG , given by

Q1 + Q3 − 2Q2
γG = , (6.15)
Q3 − Q1

where Q1 , Q2 , and Q3 denote the first, second, and third quartiles, respectively,
of the data sample.

Remark 6.1 The skewness coefficient of a symmetrical distribution (for example,


the normal distribution) is near zero. Negative values for the coefficient of skewness
indicate that the data are left skewed, whereas positive values indicate that the data
are right skewed.

6.1.6.2 Kurtosis

The kurtosis is a measure of the peakedness of a symmetrical distribution. Let


x1 , x2 , . . . , xn denote a data sample of size n. Then, the kurtosis of the sample,
denoted η, is given by
n
1
i=1 (xi − x̄)4
η= n
, (6.16)
s4
6.1 Exploratory Data Analysis and Descriptive Statistics 101

where x̄ and s denote the sample mean and the sample standard deviation,
respectively.
The following adjusted variant of kurtosis (6.16) is also commonly used:

ηadj = η − 3. (6.17)

Using the definition of kurtosis provided by Eq. (6.16), the kurtosis of the normal
distribution has an expected value of 3. Therefore, using the definition of kurtosis
provided by Eq. (6.17), the kurtosis of the normal distribution is approximately 0.
Remark 6.2 The skewness coefficient and the kurtosis can be used to quantify the
extent to which a distribution differs from the normal distribution.

6.1.7 Data Transformation

Some data could benefit from transformations of the observations to get the sample
“fit” for statistical analysis. For quantitative data, the transformation can be used for
the following:
• Reducing the skewness:
– To reduce right skewness, we can take square roots, logarithms, or reciprocals
of the observations in the sample.
– To reduce left skewness, we can take squares, cubes, or higher powers of the
observations in the sample.
• Achieving approximate normality: the observations of the sample can be stan-
dardized by calculating the z-score for each observation, xi , which is given by

(xi − x̄)
zi = , (6.18)
s
where x̄ and s denote the mean and standard deviation of the sample.
• Stabilizing the variance: the form of the transformation is dictated by the
dependence of the variance on the other sample characteristics; for example, the
sample mean.
Categorical variables can also benefit from special transformations. The most
common transformation is made on the proportion or percentage values of the
categories. This is referred to as the logit (or logistic) transformation, and it is
defined as follows:
• logit (p) = log(p/(1 − p)) for proportions,
• logit (p) = log(p/(100 − p)) for percentages,
where p is a proportion or a percentage.
102 6 Statistical Inference

6.1.8 Example: Summary of Data and EDA

In the following, we provide a numerical example using data of the recurrence time
to infection for kidney patients who are using portable dialysis equipment. The
time is measured from the point of the insertion of the catheter. The data set is
available from the package “survival,” available in R, and for the following analysis
we use only data from 18 patients experiencing a recurrent infection (corresponding
to uncensored data; see Chap. 5).
In Table 6.4, we provide summary statistics for measures of location and
measures of scale for the kidney patients’ data. The first part of the table presents the
measures of location, and the second half the measures of scale. Here, we assumed
that the data for the 18 patients are in a vector called “dat.”
Listing 6.2 gives details of the different measures. It is important to note that
there is no function for obtaining the mode in the base package of R. However, by
estimating the density of the recurrence times, this information can be obtained in
only two lines of code.

In addition to the summary statistics, Listing 6.2 also gives four examples of a
graphical summary of data using a histogram, density plot, boxplot, and empirical
cumulative distribution function (ECDF). The results of these visualizations are
depicted in Fig. 6.2. It is worth highlighting that the data underlying each of these
four different graphs is identical. From the different shapes of the graphs, one can
see the power of visualizing the data, as each graph emphasizes a different aspect of
the “distribution” of the data.
6.1 Exploratory Data Analysis and Descriptive Statistics 103

Table 6.4 Summary statistics for measures of location and measures of scale for the kidney
patient data.
Measure type Name R code Value for the kidney data
Location Sample mean mean(dat) 46.38
Sample median median(dat) 23
Trimmed mean (10%) mean(dat, trim=0.1) 42
Mode d <- density(dat)
d$x[which.max(d$y)] 14.58
Scale Variance var(dat) 2673.6
Range range(dat) 4159
IQR IQR(dat) 58
6
5

0.008
Frequency
4

Density
3

0.004
2
1

0.000
0

0 50 100 150 200 0 50 100 150 200


Recurrence times to infection Recurrence times to infection
1.0
Empirical distribution function
Recurrence times to infection
150

0.8
100

0.6
0.4
50

0.2
0.0
0

0 50 100 150 200


Recurrence times to infection

Fig. 6.2 Graphical summary of the kidney data using a histogram, a density plot, a boxplot, and
an empirical cumulative distribution function (ECDF).

A histogram gives a comprehensive overview of the frequency of binned


recurrence times corresponding to the number of patients in the respective bins.
In contrast, a density plot shows a smoothed probability density, allowing one to
obtain a global overview of the entire population, whereas a boxplot focuses on the
IQR corresponding to 50% of the population and its median value. Yet a different
perspective is provided by the ECDF, which integrates over recurrence times to give
104 6 Statistical Inference

a global overview of the percentage of the population experiencing an event (in our
case, a recurring infection).
Overall, this example shows how summary statistics, such as for descriptive
statistics, and graphical visualizations are used and why such measures provide
valuable information for an exploratory data analysis.

6.2 Sample Estimators

Statistics theory in general, and statistical inference in particular, is concerned with


the effective use of a data sample, collected from a given population, to devise a
model that involves the estimation of some unknown parameters. The parameters of
interest, which may have some physical interpretation, generally provide a concise
description of relatively complex data or represent some essential properties of the
system under consideration.
The form of the parameterization, that is, the choice of the model, can be
motivated by the following considerations:
1. The potential physical interpretation of the values of the parameters
2. Some theoretical properties of the parameters, such as the stability of the values
of the parameters
3. The properties of the procedure used to estimate the parameters, such as the
numerical stability of the procedure
In the following, we will assume that we have a data sample x1 , x2 , . . . , xn with n
independent observations of a random variable X defined on a probability space, and
the probability distribution of X depends on unknown parameters, say m, denoted
by the vector θ = (θ1 , θ2 , . . . , θm ).

6.2.1 Point Estimation

Suppose that m = 1. Therefore, θ = θ1 , and we have only a single parameter to


estimate. The point estimation of the parameter θ consists of constructing a function
h such that h(x1 , x2 , . . . , xn ) provides a “good” estimate of the parameter θ ; that is,
it should be closely concentrated around the true value of θ .
To explain this, we provide the following definitions.
Definition 6.1 The random variable h(X1 , X2 , . . . , Xn ) is called a point estimator
of θ , and its value, h(x1 , x2 , . . . , xn ), for the sample under investigation is called the
point estimate of θ .

Definition 6.2 For a given value of θ , the probability distribution of the random
variable h(X1 , X2 , . . . , Xn ), denoted fh (·|θ ), is called the sampling distribution.
6.2 Sample Estimators 105

Theoretically, the probability distribution of the random variable h(X1 , X2 , . . . , Xn )


can be obtained as follows:
⎛ ⎞ ⎛ ⎞
X1 Y1 = h(X1 , X2 , . . . , Xn )
⎜ X2 ⎟ ⎜ Y2 = X2 ⎟
⎜ ⎟ ⎜ ⎟
• Transform ⎜ . ⎟ to ⎜ . ⎟.
⎝ .. ⎠ ⎝ .. ⎠
Xn Yn = Xn
• Find the joint probability distribution of Y1 , Y2 , . . . , Yn .
• Find the marginal probability distribution of Y1 .
Note that the estimate of θ is likely to vary when we consider a different data
sample. Therefore, it is more suitable to generalize the point estimation to interval
estimation by constructing two functions, say h1 and h2 , with h1 (x1 , x2 , . . . , xn ) <
h2 (x1 , x2 , . . . , xn ), such that θ is likely to lie within the interval

[h1 (x1 , x2 , . . . , xn ); h2 (x1 , x2 , . . . , xn )]. (6.19)

The difference between point estimation methods revolves around the choice of
the function h. For a fixed value of the parameter θ , the sampling distribution fh (·|θ )
generally depends upon θ , the sample size n, as well as other unknown parameters,
which may not be of interest.
The aim of statistical inference is to assess or estimate this variation from
sample data. Some detailed discussions and illustrations of the most commonly used
approaches are the purpose of the subsequent sections in this chapter, as well as
Chaps. 10, 11, 12, and 18.

6.2.2 Unbiased Estimators

The aim of point estimation is to find a function h that is “closely concentrated


around the true value θ .” One of the most commonly used approaches to mathe-
matically express “closely concentrated around the true value θ ” is to require that

E[h(X1 , X2 , . . . , Xn )] = θ, (6.20)

and the variance V(h(X1 , X2 , . . . , Xn )) is small for any value of θ .


To define this formally, we give a couple of definitions.
Definition 6.3 The estimator h(X1 , X2 , . . . , Xn ) is called an unbiased estimator
of θ if

E[h(X1 , X2 , . . . , Xn )] = θ, ∀ θ ∈ , ∀ n ∈ N,

where denotes the parameter space, that is, the set of all possible values of θ , and
n is the size of the data sample.
106 6 Statistical Inference

Definition 6.4 The estimator h(X1 , X2 , . . . , Xn ) is called an asymptotically unbi-


ased estimator of θ if

E[h(X1 , X2 , . . . , Xn )] −→ θ, as n −→ ∞,

where n denotes the size of the data sample.

Definition 6.5 If h1 (X1 , X2 , . . . , Xn ) and h2 (X1 , X2 , . . . , Xn ) are two unbiased


estimators of θ , then
• h1 is said to be more efficient than h2 if

V[h1 (X1 , X2 , . . . , Xn )] < V[h2 (X1 , X2 , . . . , Xn )], ∀ θ ∈ , ∀ n ∈ N;

• h1 is said to be asymptotically more efficient than h2 if

V[h1 (X1 , X2 , . . . , Xn )] < V[h2 (X1 , X2 , . . . , Xn )], as n −→ ∞.

In either case, h1 would be preferred over h2 .


Definition 6.6 An unbiased estimator of θ , h∗ (X1 , X2 , . . . , Xn ), such that

V[h∗ (X1 , X2 , . . . , Xn )] ≤ V[h(X1 , X2 , . . . , Xn )], ∀ θ ∈ ,

for any other unbiased estimator, h, of θ is called the minimum variance unbiased
estimator. In other words, h∗ is the “best” unbiased estimator of θ .
Although it is desirable, in statistical analysis, to use unbiased estimators,
sometimes the following happens:
• The sought after unbiased estimator may be difficult to obtain or may not exist.
• The sought after unbiased estimator may not be a satisfactory estimate for the
problem of interest.
• The variance of the available unbiased estimator is too large.

6.2.3 Biased Estimators

To introduce this concept, we start with some definitions.


Definition 6.7 Suppose that h(X1 , X2 , . . . , Xn ) is an estimator of θ such that

E[h(X1 , X2 , . . . , Xn )] = θ + g(θ, n),


6.2 Sample Estimators 107

where n is the size of the data sample and g(θ, n) is called the bias. If the bias
g(θ, n) = 0, then h is called a biased estimator of θ .
Note that if ∀ θ ∈ , limn−→∞ g(θ, n) −→ 0, then h is an asymptotically
unbiased estimator of θ .
When h(X1 , X2 , . . . , Xn ) is a biased estimator of θ , the variance of h, V(h), is
not a suitable measure of the concentration of the values of h around θ , and the
mean square error (MSE) of h is preferred instead.
Definition 6.8 The MSE of an estimator of θ , h(X1 , X2 , . . . , Xn ), is defined as
follows:
 
MSE(h(X1 , X2 , . . . , Xn )) = E (h(X1 , X2 , . . . , Xn ) − θ )2 (6.21)

= E(h(X1 , X2 , . . . , Xn )) + [g(θ, n)]2 . (6.22)

Note that for an unbiased estimator h, we have MSE(h) = V(h).

Definition 6.9 Let h1 (X1 , X2 , . . . , Xn ) and h2 (X1 , X2 , . . . , Xn ) be two estimators


of θ . Then,
• h1 is said to be more efficient than h2 if

MSE[h1 (X1 , X2 , . . . , Xn )] < MSE[h2 (X1 , X2 , . . . , Xn )], ∀ θ ∈ , ∀ n ∈ N;

• h1 is said to be asymptotically more efficient than h2 if

MSE[h1 (X1 , X2 , . . . , Xn )] < MSE[h2 (X1 , X2 , . . . , Xn )], as n −→ ∞.

6.2.4 Sufficiency

There is no general approach for finding a suitable estimator for a parameter θ , and
we often resort to using a specified technique that provides a good estimator of θ .
However, there is a general framework, known as the principle of sufficiency, that
can be applied when estimating parameters from certain distributions.
Definition 6.10 Let ξ(X1 , X2 , . . . , Xn ) denote a statistic, that is, a function of
the random variables X1 , X2 , . . . , Xn , that doesn’t depend upon any unknown
parameter. Then, the statistic ξ(X1 , X2 , . . . , Xn ) is said to be sufficient for the
unknown parameter θ if the conditional distribution of X1 , X2 , . . . , Xn , given the
value of ξ , doesn’t depend upon θ .
108 6 Statistical Inference

The principle of sufficiency [167] states that all the useful information in a
random sample, for inference purposes, is contained in a sufficient statistic, if one
exists. Therefore, any statistical procedure for making inferences about a parameter
θ should be based on a sufficient statistic if one exists.
Finding a sufficient statistic is not a trivial problem. However, there are some
theoretical tools that could be valuable in the formulation of the problem. One of
them is the Fisher-Neyman factorization theorem [167, 363].
Theorem 6.1 (Fisher-Neyman Factorization Theorem) A statistic ξ(X1 , X2 , . . . ,
Xn ) is said to be sufficient for θ if and only if the joint probability distribution
function of X1 , X2 , . . . , Xn , denoted fX , can be written in the following form:

fX (x1 , x2 , . . . , xn |θ ) = s(x1 , x2 , . . . , xn )g(ξ(x1 , x2 , . . . , xn ), θ ),

where s and g are non-negative functions.

Problem 6.1 (Sufficiency) Let X1, . . . , Xn be n independent random variables


following a normal distribution N(μ, σ 2 ), where σ 2 is known. Find a sufficient
statistic for the unknown parameter μ.
The probability density function is given by
2
1 − (x−μ)
f (x|μ) = √ e 2
2σ .
2π σ

The corresponding joint probability density function is given by

)
n
fX (x1 , x2 , . . . , xn |μ) = f (xi |μ) (6.23)
i=1

)
n
1 (x −μ)
− i
2

= √ e 2σ 2 (6.24)
i=1
2π σ
 n
1 − 1 n
(x −μ)2
= √ e 2σ 2 i=1 i (6.25)
2π σ
 n  n 
1 − 1 (x −x̄)2 +n(x̄−μ)2
= √ e 2σ 2 i=1 i (6.26)
2π σ
  n 
− 1 n
(x −x̄)2 1 − 1 n(x̄−μ)2
= e 2σ 2 i=1 i √ e 2σ 2
2π σ
(6.27)
= [s(x1 , x2 , . . . , xn )] [g(x̄, μ)] , (6.28)

1
n
where x̄ = xi .
n
i=1
6.2 Sample Estimators 109

Therefore, based on the Fisher-Neyman factorization theorem, x̄ is a sufficient


statistic for μ.

Problem 6.2 (Counter-Example of Sufficiency) Let X1 , . . . , Xn be n indepen-


dent random variables following a Cauchy distribution with parameter θ ; that is,
1
f (x|θ ) = , −∞ < x < +∞.
π [1 + (x − θ )2 ]

The corresponding joint probability density function is given by

)
n
fX (x1 , x2 , . . . , xn |θ ) = f (xi |θ ) (6.29)
i=1

)
n
1
= . (6.30)
π [1 + (xi − θ )2 ]
i=1

Clearly, the joint probability density function (6.29) cannot be factorized in terms
of s(x1 , x2 , . . . , xn )g(ξ(x1 , x2 , . . . , xn ), θ ). Therefore, a sufficient statistic for the
parameter θ of a Cauchy distribution doesn’t exist.
A typical class of probability distributions that satisfy the Fisher-Neyman
factorization theorem is the regular exponential class (REC).
Definition 6.11 A probability distribution function, f (x|θ ), belongs to the REC if

f (x|θ ) = ea(θ)b(x)+c(θ)+d(x) . (6.31)

Some well-known distributions that belong to the REC include the following:
• The binomial distribution B(n, θ ), where n is known.
Since
 
n
f (x|θ ) = θ x (1 − θ )n−x , x = 0, . . . , n,
x
 
θ
then we have a(θ ) = log , b(x) = x, c(θ ) = nlog(1 − θ ), d(x) =
  1−θ
n
log
x
• The Poisson distribution: Poisson (θ ) given by

θ x e−θ
f (x|θ ) = , x = 0, 1, . . . ,
x!

where a(θ ) = logθ, b(x) = x, c(θ ) = −θ, d(x) = −log(x!)


110 6 Statistical Inference

• The normal distribution N(μ, σ 2 ), where either μ or σ 2 is known


• The negative binomial distribution with parameters (r, θ ), where r is known
• The gamma distribution with parameters (α, β), where the shape parameter α is
known
• The geometric distribution
• The negative exponential distribution
The aforementioned important properties of the point estimator (bias, mean
square error, sufficiency) provide “out-of-the-box” tools to assess the quality or the
qualitative behavior of the estimator.

6.3 Bayesian Inference

The principal idea of Bayesian inference is based on utilizing Bayes’ theorem. We


formulate Bayes’ theorem here for continuous random variables:

f (x|θ )p(θ )
p(θ |x) = (6.32)
p(x)

with

p(x) = f (x|θ )p(θ ) (6.33)

Here, θ is a parameter of a model, f (x|θ ), for which we would like to make an


estimate based on the observed data x.
In Bayesian inference, it is common to denote the preceding terms in the
following way:

p(θ |x) = f (x|θ ) p(θ ) / p(x) (6.34)


posterior likelihood prior evidence

whereas the evidence, (p(x)), is given by Eq. 6.33. The prior distribution p(θ )
summarizes our knowledge about the parameter θ before we observe the data
x. That means the prior is data independent. The likelihood f (x|θ ) defines a
statistical model from which the data could have been generated. Finally, the
posterior distribution p(θ |x) gives us an updated distribution of the parameter θ .
Here, “updated” means with respect to our knowledge before the data, given by the
prior distribution.
To make a Bayesian inference about a model parameter θ , we need, in addition
to the likelihood f (x|θ ), which defines the statistical model, a prior distribution
over θ , given by p(θ ). This distribution needs to be determined independent of the
observed data x. That means the knowledge about x must not be used to specify
6.3 Bayesian Inference 111

the distribution p(θ ). This explains the name “prior distribution,” because this
distribution should be specified before the data are observed.
Finally, to make an inference about a model parameter θ , one needs to specify a
rule that is used to extract this information from the posterior distribution p(θ |x).
There are several measures of location that can be used, and we want to mention
three of them.

posterior mean θ̄ = θp(θ |x)dθ (6.35)

θm
posterior median θ̄ = θm : p(θ |x)dθ = 0.5 (6.36)

posterior mode θ̄ = argmaxθ p(θ |x) (6.37)

The first is the posterior mean, which corresponds to the expected value of θ . The
second is the posterior median, which has the advantage of being more robust
against outliers, compared to the posterior mean. Finally, the posterior mode is the
maximum of the posterior distribution. For this reason, it is also called the maximum
a posteriori (MAP). An advantage of the MAP estimator over the posterior mean
and the posterior median is that if p(θ |x) is graphically visualized, the MAP can be
easily spotted by eye. For this reason, in the following we use the MAP estimator.
Bayesian inference can be used for three different purposes that will be discussed
in more detail in the following sections:
1. Parameter estimation
2. Prediction
3. Model selection
Before we continue, we would like to note that the first two points just discussed
— that is, parameter estimation and prediction — are similar since a parameter
estimation is a prediction; namely, a prediction of the (true) parameter value.
Alternatively, one can see prediction as parameter estimation, where the value of
the error is the parameter to be estimated.
Importantly, the difference is that usually parameter estimation is a one-step pro-
cess, whereas prediction is a two-step process. Specifically, one can see parameter
estimation as the following mapping:

D → parameter estimation. (6.38)

Here, the data, D, are used to estimate the value of a parameter. In contrast, for
prediction we have the following two mappings:

D → parameter estimation (of α), (6.39)


M(α, D ) → prediction (of the error). (6.40)
112 6 Statistical Inference

The first mapping estimates a parameter, let’s call it α, and the second mapping uses
this parameter (in addition to another data set D ) to make an estimation of the error
of a model M, and this mapping is called the prediction. Hence, prediction requires
two mappings, two data sets (D and D ), and the coupling of both mappings by
using the estimated parameter α as the parameter of the model M.
Leaving philosophical considerations aside, the understanding of the principal
idea behind Bayesian inference is quite simple, as one can see from the preceding
formulation. However, practically, the realization of a Bayesian inference can be
intricate. In the following, we will not present a comprehensive discussion but rather
limit our focus to a special case of Bayesian inference, which is based on the so-
called conjugate priors.

6.3.1 Conjugate Priors

In Table 6.5, we present an overview of known conjugate priors for likelihood


functions. For all of these configurations, the calculation of the posterior distribution
is straightforward because a closed-form expression for the posterior is known due
to the particular form of the (conjugate) prior, and it is given in the first column of
the table.
This information enables one to simplify the estimation considerably because for
all the cases shown in Table 6.5, a Bayesian inference can be conducted analytically
without any numerical approximations.

6.3.2 Continuous Parameter Estimation

In this section, we present an application of conjugate priors that illustrates the


working mechanism of Bayesian inference.
Problem 6.3 Suppose that there is an infinitely large urn containing a fraction θ
of red balls and a fraction 1 − θ of blue balls. We draw n balls from this urn and
observe x red balls. What is θ ?
This problem is solved via a five-step procedure:

Table 6.5 Conjugate priors and their corresponding likelihood functions.


Posterior Likelihood Prior
μP ·σL2 +μL ·σP2 σP2 ·σL2
Normal: N( , ) Normal: N(μL , σL2 ) Normal: N(μP , σP2 )
σP2 +σL2 σP2 +σL2
Beta: Be(x + α, n − x + β) Binomial: B(n, θ) Beta: Be(α, β)
Gamma: G(α + γ , β + x) Gamma: G(γ , δ) Gamma: G(α, β)
Gamma: G(α + 0.5, β + (μ − x)2 /2) Normal: N(μL , σL2 ) Gamma: G(α, β)
6.3 Bayesian Inference 113

(1) Define the likelihood.


(2) Define the prior.
(3) Calculate the denominator p(x).
(4) Calculate the posterior distribution p(θ |x).
(5) Obtain an estimate of the parameter from the posterior distribution.

Step 1 The likelihood of drawing n balls of which x are red from an infinitely large
urn is a binomial distribution given by
 
n x
f (x|θ ) = θ (1 − θ )n−x . (6.41)
x

We would like to note that the infinite size of the urn is important here as it
guarantees that the fraction of red balls, given by θ , does not change when removing
some balls from the urn. If θ were not constant, we would need to account for this.
Alternatively, instead of assuming an infinitely large urn, we could also draw the
n balls one after another, and each time, after we had observed the color of a ball, we
placed it back in the urn. This way, the number of balls in the urn would be constant,
and this would also guarantee a constant probability θ of drawing a red ball.

Step 2 From Table 6.5, we see that the conjugate prior of a binomial distribution is
a beta distribution. Thus, the prior is given by the following beta distribution:

(α + β) α−1  β−1


p(θ ) = Be(θ |α, β) = θ 1−θ . (6.42)
(α)(β)

The value of Be(θ |α, β) is defined for any θ between 0 and 1. That means for any
values of the parameters α and β, there will be always a value of θ ∈ [0, 1], which
could be used as an estimate of the proportion of red balls.

Step 3 The next step is to calculate the denominator in Eq. 6.32 using Eq. 6.33, as
follows:
1
p(x) = f (x|θ )p(θ )dθ
0
1 n (α + β) α−1  β−1
= θ x (1 − θ )n−x θ 1−θ dθ
0 x (α)(β)
 
n (α + β) 1
= θ x+α−1 (1 − θ )n−x+β−1 dθ
x (α)(β) 0
 
n (α + β) 1
= θ α −1 (1 − θ )β −1 dθ
x (α)(β) 0
114 6 Statistical Inference

 
n (α + β) (α )(β )
=
x (α)(β) (α + β )
 
n (α + β) (x + α)(n − x + β)
= (6.43)
x (α)(β) (α + n + β)

Here, we defined α = x + α and β = n − x + β to simplify the notation.

Step 4 Finally, we calculate the posterior probability from the preceding com-
ponents. After this step, we will also have a full appreciation of using a beta
distribution as a prior distribution and also understand what we mean by a conjugate
prior.

f (x|θ )p(θ )
p(θ |x) =
p(x)
(α + n + β)  β−1
= θ x (1 − θ )n−x θ α−1 1 − θ
(x + α)(n − x + β)
(α + n + β)  n−x+β−1
= θ x+α−1 1 − θ (6.44)
(x + α)(n − x + β)

As one can see, the resulting posterior distribution p(θ |x) is again a beta dis-
tribution, as is the prior, but with different parameters. In general, whenever the
prior and the posterior are probability distributions from the same family, we call
the prior conjugated. This is the case for the preceding example because the beta
prior distribution and the binomial likelihood “fit together,” or conjugate each other.
Hence, the combination of a prior and a likelihood is key for this analytical result.
In Fig. 6.3, we show nine numerical examples for different values of the sample
size n and the proportion of red balls in the urn θ . In the first row, we assume that
θ = 1.0, meaning that there are only red balls in the urn. For this reason, we always
observe as many red balls as the number of samples we draw; that is, x = n. In each
figure, we show the prior distribution in red and the estimated posterior distribution
in blue. We include the prior distribution to remind us that our initial guess for θ is
uninformative because it is based on a uniform prior distribution. Drawing samples,
we can see that by increasing the sample size (from left to right), our estimates of
θ are getting closer to the true value of θ = 1.0, which is shown as dotted vertical
lines.
In the second row of Fig. 6.3, we assume a fixed proportion of red balls given
by θ = 0.2 (dotted vertical lines). Furthermore, we also assume that we always
observe x = θ · n red balls. Here, the symbol y indicates the floor function,
which rounds down the real value y to the nearest integer value. This rounding is
necessary because x needs to be an integer value. The assumption x = θ · n is
quite strict and artificial, but it will help us in making a point, as we will see later.
From the second row in Fig. 6.3, we can see that our estimates of θ are always
correct for any value of the sample size n because the maximum of the posterior
6.3 Bayesian Inference 115

6
prior
prior prior
posterior
posterior posterior
sample size n ,x sample size n ,x
sample size n ,x

prior prior
prior posterior
posterior posterior
sample size n ,x sample size n ,x
sample size n ,x

prior prior
posterior posterior
prior sample size n ,x
posterior sample size n ,x
sample size n ,x

Fig. 6.3 Examples of a Bayesian inference for a continuous parameter p of a binomial likelihood
function.

distribution is always exactly at θ = 0.2. However, the difference between the three
examples is that our uncertainty about this value decreases with the increase of
the sample size from left to right. Here, by uncertainty we mean the variance of
the posterior distribution. This is plausible because the more information we have
(larger sample sizes), the more false estimates can be potentially eliminated.
In row 3 of Fig. 6.3, we assume again a fixed proportion of red balls given by
θ = 0.2 (dotted vertical lines). However, this time we do not fix the number of
observed red balls by assuming x = θ · n. Instead, we draw n samples from a
binomial distribution with probability θ = 0.2, as specified in Eq. 6.41. Hence, we
are actually conducting experiments to generate the data x, which corresponds to
the number of observed red balls. In contrast with the second row in Fig. 6.3, this
time all modes of the posterior distribution do not exactly correspond to θ = 0.2,
but rather are located around this value. The reason for this is that in none of the
experiments did the number of observed red balls correspond exactly to x = θ · n
116 6 Statistical Inference

(see the figure legend for observed x values in the three cases). This leads to an
error, which has some consequences. Interestingly, despite the different locations
of the mode of the posterior distribution, the variance is reduced with an increased
sample size, as in the second row in Fig. 6.3.

6.3.2.1 Example: Continuous Bayesian Inference Using R

Although the results in the previous section can be obtained analytically, writing a
computer program can be helpful. For this reason, in Listing 6.4, we provide the R
script to reproduce the results in Fig. 6.3. When executing this program repeatedly,
one will realize that each time a slightly different result is obtained. This aspect is
studied in Exercise 6.

6.3.3 Discrete Parameter Estimation

In this section, we show how to conduct a Bayesian inference for a discrete


parameter. That means, in contrast with Sect. 6.3.2, where we assumed θ was a
continuous parameter, in this section we assume that θ is discrete; that is, it can take
values {θ1 , . . . , θk }. In this case, we make a Bayesian inference from

f (x|θi )p(θi )
p(θi |x) = , (6.45)
j f (x|θj )p(θj )

which is the discrete version of Eq. 6.32.


6.3 Bayesian Inference 117

In Sect. 6.3.2, we provided an analytical solution to Problem 6.3, where a


binomial likelihood is combined with a prior corresponding to a beta distribution.
The rationale behind this is that the posterior is a beta distribution and its parameters
are given by the parameters of the likelihood and the prior. However, you may
wonder, what if there is no known analytical solution to a problem? Or what if
there is such a solution but you don’t know it?
In the following, we study again Problem 6.3; however, this time we do not
make use of the elegant analytical approach based on a conjugate prior for the
continuous parameter θ , as in Sect. 6.3.2. Instead, we will study a discrete version
using numerical means.
Let’s assume that we limit the values of θi to d equally spaced discrete values
between 0 and 1 and that we do not have any information that would favor any of
these values. For this reason, we assume a non-informative prior distribution, which
assigns to each value of θi the same probability. In Listing 6.5, we present the R
script that implements Eq. 6.45 to produce the results shown in Fig. 6.4.
118 6 Statistical Inference

For Fig. 6.4, we used three different values of d to show the influence on the
study of the number of points used to evaluate the parameter θ . Specifically, for row
1 we use d = 5; for row 2, d = 20; and for row 3, d = 50. Moreover, for each
column we use a different sample size: column 1 n = 5 and column 2 n = 20.
Clearly, for higher values of d the estimated posterior distributions become
increasingly smooth and very similar to the results from the analytical solution
(see Fig. 6.3). This is an important observation because it shows that the numerical
solution is as good as the analytical solution for large values of d. Interestingly, d
does not need to be extremely large in our example to provide sensible results, and
even values as low as d = 5 give a meaningful approximation.
The observation from this specific example is actually true in general, which
means we can always approximate an analytical solution numerically, and the
difference between both solutions decreases with increased values of d. However,
practically, the disadvantage of a numerical solution is that it is usually more time-
consuming because the likelihood and the prior need to be evaluated at many
different parameter points. For our one-dimensional problem considered here, this
is not a problem; however, for high-dimensional problems — say, with ten variables
— this can be computationally expensive because there are d 10 evaluations for the
ten variables, each with d discrete values. Fortunately, there are advanced Monte
Carlo approaches that have been developed for such cases.
In addition to the asymptotical equivalence between analytical and numerical
solutions, another advantage of using a discrete Bayesian inference is that there is
more flexibility in the choice of the priors compared to the continuous approach.
Specifically, using a beta distribution as a prior does not actually allow one to
assume any arbitrary functional shape, but is limited to the forms that can be
modeled by different values of the parameters α and β. For instance, the distribution
shown in Fig. 6.5 cannot be realized with a beta distribution or any other analytical
distribution. In this respect, a discrete numerical approach is more flexible than an
analytical one.

6.3.4 Bayesian Credible Intervals

In Sects. 6.3.2 and 6.3.3, we discussed Bayesian inference for point estimation.
However, so far we have not explicitly addressed the variability of such estimates.
Instead, we only discussed using measures of location, like the posterior mean
or maximum a posteriori (see Eqs. 6.35 and 6.37), to summarize the information
provided by the posterior distribution. Now, we turn our focus to the quantification
of the variability of Bayesian point estimates by means of so-called credible
intervals.
Suppose that for a given posterior distribution, p(θ |x), we would like to know
what the probability is of observing a value of the model parameter θ within the
interval (θl , θu ). Formally, this probability can be estimated by
6.3 Bayesian Inference 119

1.0
prior prior
posterior posterior
sample size = 5 , x = 3 sample size = 20 , x = 3
0.4

0.8
0.3

0.6
0.2

0.4
0.1

0.2
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
θ θ

prior prior
0.25
0.10

posterior posterior
sample size = 5 , x = 3 0.20 sample size = 20 , x = 3
0.08

0.15
0.06

0.10
0.04

0.05
0.02
0.00

0.00

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
θ θ
0.10

prior prior
0.04

posterior posterior
sample size = 5 , x = 3 sample size = 20 , x = 3
0.08
0.03

0.06
0.02

0.04
0.01

0.02
0.00

0.00

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
θ θ

Fig. 6.4 Examples of a Bayesian inference for a discrete parameter θ of a discretized binomial
distribution. The connecting lines between the points are only included to enhance the visual
impression of the discrete distributions.
120 6 Statistical Inference

0.03
0.02
0.01
0.00

0.0 0.2 0.4 0.6 0.8 1.0


θ
Fig. 6.5 Discrete prior distribution that cannot be realized with a beta distribution.

θu
P r(θl < θ < θu ) = p(θ |x)dθ. (6.46)
θl

If we write

1 − α = P r(θl < θ < θu ), (6.47)

for α ∈ [0, 1], then (θl , θu ) is called the (1 − α) × 100% credible interval for θ .
The interpretation of the credible interval (θl , θu ) is straightforward from
Eq. 6.46, meaning that the probability of observing any value of θ within this
interval is 1 − α.
Interestingly, there are actually many intervals within the posterior distribution
that result in a probability of 1 − α. The two most common ones are shown in
Fig. 6.6. The left figure shows

θu
1−α = p(θ |x)dθ with p(θl |x) = p(θu |x), (6.48)
θl
6.3 Bayesian Inference 121

posterior distribution of

posterior distribution of
1−α 1−α

Fig. 6.6 Two examples for defining Bayesian credible intervals for the same underlying posterior
distribution p(θ|x).

whereas the right figure depicts

θu α θl ∞
1−α = p(θ |x)dθ with = p(θ |x)dθ = p(θ |x)dθ. (6.49)
θl 2 0 θu

We would like to emphasize that both definitions use the same underlying posterior
distribution p(θ |x). However, the values of θl and θu are set up differently.
The left credible interval has equal probability for the values of the posterior
distribution at the boundaries of the interval, that is, at θl and θu , whereas the right
credible interval has an equal tail probability of α/2 in the lower and upper tails of
p(θ |x).

6.3.5 Prediction

There are two ways to make a systematic prediction about a distribution of values.
The first is based on the prior and the second on the posterior distribution from a
Bayesian inference.

prior predictive density f (y ) = f (y |θ )p(θ )dθ ; (6.50)

posterior predictive density f (y ) = f (y |θ )p(θ |y)dθ. (6.51)

Here, p(θ ) is the prior and p(θ |y) is the posterior distribution of the data denoted y.
The distribution f (y ) corresponds to the predicted distribution of a new observation
y.
122 6 Statistical Inference

6.3.6 Model Selection

The last problem we discuss, and for which Bayesian inference can be used, is model
selection. Suppose that we have two parametric models M1 and M2 , and we would
like to know which of these models is a better description of the observed data
x. This problem is called model selection. In the following, we want to limit our
discussion to two models, M1 and M2 , but extending to a larger number of models
is possible.
If we have two models, we select model M1 with probability p(M1 ) and model
M2 with probability p(M2 ). Because there are only two models, it holds that
p(M1 ) + p(M2 ) = 1. A Bayesian approach to model selection is provided by using
a so-called Bayes factor (BF). A Bayes factor is defined as the posterior odds of
model M1 divided by the prior odds of model M1 :

oddspost (M1 )
BF = (6.52)
oddsprior (M1 )
p(M1 |x)/p(M2 |x)
= (6.53)
p(M1 )/p(M2 )

The posterior distributions p(Mi |x) are obtained from

p(x|Mi )p(Mi )
p(Mi |x) = for i ∈ {1, 2}, (6.54)
p(x)

where p(x|Mi ) is called the marginal likelihood and is given by

p(x|Mi ) = f (x|θi , Mi )p(θi )dθi for i ∈ {1, 2}. (6.55)

This probability is called “marginal” because the likelihood of model Mi is


integrated over its parameters θi .
If we substitute Eq. 6.54 into 6.53, then the evidence p(x) cancels out, and a
Bayes factor is given by the fraction of marginal likelihoods of the models; that is,

p(x|M1 )
BF = . (6.56)
p(x|M2 )

Once a BF has been calculated, its interpretation is simple. If BF > 1, then model
M1 fits the data better, while if BF < 1, then model M2 better fits the data. However,
what is not so simple is to define a vector + (θ1 , θ2 ) so that a BF is either larger
or smaller, and we are sufficiently confident to consider the findings robust.
6.4 Maximum Likelihood Estimation 123

6.4 Maximum Likelihood Estimation

Now, we turn our attention to another principal way of making a statistical inference
about a model parameter. This is called maximum likelihood estimation (MLE).
Definition 6.12 Let f (xi |θ ) be a probability function from which the observed data
x = {x1 , . . . , xn } are sampled. Then L(θ |x) = f (x|θ ) is called the likelihood
function because the distribution f (x|θ ) is considered as a function of its parameter
θ.

Definition 6.13 Based on the previous definition, a maximum likelihood estima-


tion of a parameter can be defined as follows:

θ̄ = argmaxθ L(θ |x). (6.57)

Hence, the maximum likelihood estimation is another way to obtain a point


estimator for the parameter of a model.
Problem 6.4 Suppose that we have a coin and we perform n independent random
experiments resulting in x = {x1 , . . . , xn } observations, where each xi is either head
(1) or tail (0). What is the probability of the coin’s coming up heads?
For each conducted experiment, we can assume that the random variable xi
is drawn from a Bernoulli distribution. This results in the following likelihood
function:

L(θ |x) = ni=1 θ xi (1 − θ )1−xi . (6.58)

For practical reasons, it is usually beneficial to take the logarithm of Eq. 6.58 and
maximize logL(θ |x) instead:
n 
 
logL(θ |x) = xi logθ + (1 − xi )log(1 − θ ) , (6.59)
i=1

n   
n 
= xi logθ + n − xi log(1 − θ ). (6.60)
i=1 i=1

To find the maximum of logL(θ |x), we calculate the first derivative and solve the
following:

dlogL(θ |x) !
=0 ⇒ (6.61)

n
i=1 xi n − ni=1 xi
− =0 ⇒ (6.62)
θ 1−θ
124 6 Statistical Inference

Fig. 6.7 Logarithmic


n = 30
likelihood function for 
n = 30 independent Bernoulli i
xi = 7
experiments where seven −
heads have been observed.

log L( |x)

− θestimate θtrue = 1/3

1
n
θ̄ = xi . (6.63)
n
i=1

To verify that this is indeed a maximum and not a minimum, we could calculate
the second derivative of logL(θ |x) to confirm it is positive for θ̄ :
n
d 2 logL(θ |x) i=1 xi n − ni=1 xi
=− + . (6.64)
dθ 2 θ2 (1 − θ )2

Alternatively, we show in Fig. 6.7 a graphical visualization of logL(θ |x), given by


Eq. 6.60, for an example where n = 30 and seven heads have been observed. We
can see that this function has its maximum at θ̄ = n1 ni=1 xi = 0.23, whereas the
true value the parameter θ is 1/3.

6.4.1 Asymptotic Confidence Intervals for MLE

The purpose of an MLE is to obtain a point estimator for a random sample that
has been drawn from a parametric distribution. Again, this does not tell us anything
about the variability of this estimate. To obtain some insights about this variability,
we can utilize the result from the following theorem, which provides an estimate for
the asymptotic sampling distribution of the MLE.
Theorem 6.2 Let θ̂ be an MLE that maximizes L(θ |x), with x = {xi , . . . , xn }, and
assume also that the second and third derivatives of the likelihood function exist.
Then, the asymptotic distribution of θ̂ for n → ∞ is given by
 
θ̂ ∼ N θ0 , (I (θ0 ))−1 . (6.65)
6.4 Maximum Likelihood Estimation 125

Here, θ0 is the true value of θ , and I (θ0 ) is the Fisher information.


The results of this theorem mean that θ̂ is normally distributed with mean θ0 and
variance I (θ0 ).
The Fisher information is defined as follows:
Definition 6.14 (Fisher Information) Let x = {xi , . . . , xn } be a random variable
with likelihood function L(θ |x) for a statistical model with parameter θ , and
λ(θ |x) = logL(θ |x) is the logarithm of this likelihood function. Assume that
λ(θ |x) is twice differentiable with respect to θ . Then, the Fisher information I (θ )
is defined by
 2 
I (θ ) = E λ (θ |x) (6.66)

 2
= λ (θ |x) f (x|θ )dx (6.67)

For applications, an alternative formulation of the Fisher information proves useful,


which we state here without proof.
Theorem 6.3 (Alternative Form of Fisher Information) The Fisher information
can be also written by
 
I (θ ) = −E λ (θ |x) (6.68)

=− λ (θ |x)f (x|θ )dx. (6.69)

To show how to calculate the Fisher information practically, let’s consider the
following example:
Problem 6.5 Suppose that we have a coin with a probability for head (1) of θ and
a probability for tail (0) of 1 − θ . Hence, this model generates random variables x,
from {0, 1}, with a Bernoulli distribution f (x|θ ) = θ x · (1 − θ )1−x . What is the
Fisher information?
If we observe just one sample, then

L(θ |x) = θ x (1 − θ )1−x (6.70)

is the likelihood function and

λ(θ |x) = xlogθ + (1 − x)log(1 − θ ) (6.71)

is the logarithmic likelihood. Calculating the first and second derivatives gives
126 6 Statistical Inference

x 1−x
λ (θ |x) = − ; (6.72)
θ 1−θ
x 1−x
λ (θ |x) = − 2 − . (6.73)
θ (1 − θ )2

From this, the Fisher information writes as follows:


 
I (θ ) = −E λ (θ |x) (6.74)

1   1  
= E x + E (1 − x) (6.75)
θ2 (1 − θ )2
  
1   1
= 2E x + 1−E x (6.76)
θ (1 − θ )2
1 1
= + , (6.77)
θ 1−θ

since E[x] = θ for a Bernoulli distribution.


In principle, we can now calculate the Fisher information for any sample of
arbitrary size, that is, x = {x1 , . . . , xn }, with a known likelihood function L(θ |x).
In this case, we denote the Fisher information In (θ ) to indicate that it is based on a
sample of size n. However, for independent and identically distributed (iid) samples,
this can even be simplified, because the following relation holds between In (θ ) and
I (θ ) (with I (θ ) = In=1 (θ )):
Theorem 6.4 (Additivity Property [176]) It holds

In (θ ) = nI (θ ). (6.78)

Problem 6.6 Suppose that we have a coin with a probability for head (1) of θ and
a probability for tail (0) that is 1 − θ . The coin is thrown n times, generating x =
{x1 , . . . , xn } with xi ∈ {0, 1}. Hence, each sample xi is drawn from a Bernoulli
distribution f (xi |θ ) = θ xi (1 − θ )1−xi . What is the Fisher information?
Using the result from Theorem 6.4, the Fisher information for Problem 6.6 can
be easily obtained for n draws of a coin, as follows:
1 1 
In (θ ) = n + . (6.79)
θ 1−θ

In Fig. 6.8, we show an example of an asymptotic sampling distribution of θn


for a Bernoulli sample with n = 20. The interval marked by the two vertical green
lines corresponds to the (1 − α) confidence interval. Here, we use α = 0.05, which
means that the shown interval is the 95% confidence interval. The red areas under
the distribution on the left and right sides each correspond to α/2.
6.4 Maximum Likelihood Estimation 127

Fig. 6.8 Asymptotic


θtrue θM LE n = 20
sampling distribution of θn
given by N(θ, (I (θ0 ))−1 ).

n
sampling distribution of
α/2 α/2

6.4.2 Bootstrap Confidence Intervals for MLE

In the previous section, we showed how to obtain confidence intervals for the MLE
of a model parameter θ . These intervals are based on asymptotic results; that is,
they assume a large sample size n. However, this does not guarantee that for small
n, the resulting confidence intervals are good approximations of the asymptotic
results. For this reason, in practice and for small n, it is more appropriate to estimate
confidence intervals numerically, using a bootstrap approach instead of results based
on asymptotic considerations.
To estimate confidence intervals for MLE, we can apply the following
procedure:
1. Estimate the MLE, θ̂ , for x = {x1 , . . . , xn } with xi ∼ f (x|θ0 ).
2. Generate new parametric bootstrap samples x b = {x1b , . . . , xnb } with xib ∼
f (x|θ̂ ).
3. Estimate new MLE, θ̂ b , for the new parametric bootstrap samples x b =
{x1b , . . . , xnb }.
The idea behind this procedure is to use the MLE estimate θ̂ of the sample x =
{x1 , . . . , xn } as a parameter for f (x|θ̂ ) to generate n new bootstrap samples. The
way the functional form of f (x|θ ) is used in combination with the MLE estimate θ̂
to generate new samples is called parametric bootstrap. Then, for b ∈ {1, . . . , B}
bootstrap samples, new MLE θ̂ b are calculated. From this, the sampling distribution
of θn can be estimated and then used to derive the desired confidence interval.
Problem 6.7 Suppose that we have a coin with a probability for head (1) of θ and a
probability for tail (0) that is 1 − θ . This model generates random variables x, from
{0, 1}, with a Bernoulli distribution f (x|θ ) = θ x (1 − θ )1−x . What is the bootstrap
95% confidence interval of θ̂ ?
128 6 Statistical Inference

θtrue = 1/3 θtrue = 1/3 θM LE n = 30


θM LE n = 10
sampling distribution of

sampling distribution of
Fig. 6.9 Sampling distribution of θn . The blue crosses correspond to the estimates from the
bootstrap approach, and the red dashed curves give the asymptotic distribution of the MLE. The
vertical lines in red and blue correspond to the 95% confidence intervals for the bootstrap and the
asymptotic distribution, respectively.

In Fig. 6.9, we show estimates of the sampling distribution of θn . The blue crosses
correspond to the results from bootstrap estimates that use the preceding procedure,
and the red dashed curve is the asymptotic distribution given by
 
N θ̂, (nθ (1 − θ ))−1 (6.80)

using the Fisher information in Eq. 6.79. We can see that for n = 10 and n = 30, the
approximate bootstrap estimates and the corresponding 95% confidence intervals
are quite similar to the results from the asymptotic distribution, although these
sample sizes are only moderately large.

6.4.3 Meaning of Confidence Intervals

It is worth discussing the meaning of confidence intervals for MLE in detail, as these
are frequently misinterpreted. First of all, a (1 − α) confidence interval does not
mean that an MLE for θ lies within a percentage of 100 × (1 − α) in this interval. If
one would like to have such an interpretation, one needs to use a Bayesian approach
and credible intervals, as discussed in Sect. 6.3.4. Instead, a (1 − α) confidence
interval for an MLE for θ means that, if we repeat the same experiment many times,
then, on average, in 100 × (1 − α) percent of the cases, a (1 − α) confidence interval
will include the MLE for θ .
The rationale behind this interpretation is as follows. To know the probability
for an MLE of θ to lie within a certain confidence interval, we need to consider
this MLE as a random variable, and we can only make probabilistic statements
about random variables. Since the MLE is a random variable, then we need to
define its prior probability, which brings us directly into a Bayesian framework. The
6.5 Expectation-Maximization Algorithm 129

problem is that by using a maximum likelihood framework we assume that there is


an unknown but fixed value of θ (the parameter of the distribution from which the
data are sampled) instead of a probability distribution.

6.5 Expectation-Maximization Algorithm

The expectation-maximization (EM) algorithm is an iterative method to estimate a


parameter, or a vector of parameters θ , of a parametric probability distribution while
attempting to maximize the associated likelihood function.
Although the EM has been widely popularized by Dempster, Laird, and Rubin
[103], similar approaches can be traced back to the work of Newcomb [361] in 1886
and McKendrick [332] in 1926.
The problem of interest can be formulated as follows. Let Y denote an n-
dimensional random real variable (i.e., Y ∈ Rn ) with a probability density function
fY (y|θ ), where θ is a parameter or a vector of parameters to be estimated. For a
given realization y of Y , we would like to estimate the value of θ , denoted θ ∗ ,
which maximizes the likelihood function of θ , given by

L(θ ) = fY (y|θ ). (6.81)

This yields the following optimization problem:

θ ∗ = arg max L(θ ), (6.82)


θ∈

where is the set of all potential values of θ and L is defined in Eq. 6.81.
In theory, various techniques can be used to solve the preceding optimization
problem. However, it is often difficult to define the topology of the set of potential
parameter values, . Therefore, it is customary to resort to some heuristic methods,
and the EM algorithm is one of the most commonly used to estimate solutions to
the problem in Eq. (6.82).
The EM assumes that there exists another random real variable X ∈ Rm with
a probability density function fX (x|θ ), such that for a given realization x of X, it
is computationally easier to find the value of θ , which maximizes the likelihood
function of θ given by

Lx (θ ) = fX (x|θ ). (6.83)

In this case, the optimization problem of interest is therefore

θ ∗ = arg max fX (x|θ ). (6.84)


θ∈
130 6 Statistical Inference

Since log() is a monotonically increasing function, then a value of θ , which


maximizes Lx (θ ), also maximizes logLx (θ ) = logfX (x|θ ). Often, it is easier to
find the parameter θ that maximizes the log-likelihood logfX (x|θ ).
In the EM approach, the vector y, which represents the observed data or the data
at hand, is referred to as the incomplete data, whereas the vector x, which represents
the “ideal” data (which would facilitate the estimation of the parameter θ ) we wish
we had, is called the complete data.
For a given estimate θ k−1 of θ and the data y, the EM algorithm estimates
logLx (θ ) as the conditional expected value of the random function logfX (X|θ ),
conditioned on y and θ k−1 as follows:

EX|y,θ k−1 (logfX (X|θ )) = fX|Y (x|y, θ k−1 )logfX (x|θ )dx, (6.85)
y

where fX|Y (x|y, θ k−1 ) is the conditional probability density function for the
complete data x, given θ k−1 and y, whereas y is the closure of the set

{x : fX|Y (x|y, θ k−1 ) > 0}.

This step of the EM algorithm is referred to as the expectation step.


The only unknown in the right-hand side of Eq. 6.85 is θ . Thus, it can be defined
as a function of θ for a fixed given value θ k−1 , denoted Q(θ |θ k−1 ); that is,

Q(θ |θ k−1 ) = fX|Y (x|y, θ k−1 )logfX (x|θ )dx. (6.86)


y

The maximization step of the EM algorithm consists of finding the value of θ that
maximizes Q(θ |θ k−1 ). The corresponding value of θ , denoted θ k , is the “optimal”
value of θ for the iteration k.
Therefore, the EM algorithm generates a sequence {θ k } of estimates of θ , and
during such a process, the following is expected:
1. The sequence {L(θ k )} is increasing; that is, ∀ k, L(θ k+1 ) ≥ L(θ k ).
2. The sequence {L(θ k )} converges to L(θ ∗ ).
3. The sequence {θ k } converges to θ ∗ .
In the discrete case, we need to replace the probability density function with the
probability mass function and replace the integral with the summation.
The EM algorithm can be summarized as follows:
Step 0: Initialization Let θ 0 denote an initial estimate of the parameter θ ; θ 0 is
generally generated randomly.

Step 1: Expectation Step Given the observed (or “incomplete”) data y and θ k−1 ,
the current estimate of the parameter θ , formulate the conditional probability
density function fX|Y (x|y, θ k−1 ) for the complete data X. Then, use the formulated
6.5 Expectation-Maximization Algorithm 131

conditional probability to compute the conditional expected log-likelihood as a


function of θ , as follows:

Q(θ |θ k−1 ) = fX|Y (x|y, θ k−1 )logfX (x|θ )dx. (6.87)


y

Step 2: Maximization Step Maximize the function Q(θ |θ k−1 ) over θ ∈ to find

θ k = arg max(Q(θ |θ k−1 )). (6.88)


θ∈

* * * *
Step 3: Exit Condition If *θ k − θ k−1 * < ε or *L(θ k ) − L(θ k−1 )* < ε, where
L(θ ) is defined in (6.81), for some ε > 0, then stop; otherwise, go to Step 1.
For a compelling discussion of the EM algorithm and its various extensions, we
refer the reader to the textbook of McLachlan and Krishnan [333].

6.5.1 Example: EM Algorithm

In this section, we provide an illustrative example of the EM algorithm, introduced


in [333] and further elaborated by Byrne [66]. Let W denote the non-negative
random variable representing the time to failure of an item, which is assumed to
be exponentially distributed; that is,

1 −w
f (w|θ ) = e θ, (6.89)
θ
where the parameter θ > 0 denotes the expected time to failure.
The associated cumulative probability distribution is given by
w
F (w|θ ) = 1 − e− θ . (6.90)

Suppose that we observe a random sample of n items and let wi , i = 1, . . . , n,


denote their corresponding failure times. We would like to use these data to estimate
the mean time to failure θ , which in this case is a scalar parameter. In practice, the
study generally terminates before we observe the random sample w1 , w2 , . . . , wn .
This means that we can only record the time to failure for the first r items (r < n)
that failed. For the n − r items whose times to failure were not observed at the end
of the study, we assume that their time to failure is T , the duration of the study. Such
a data set is referred to as censored, and let us denote it y = (y1 , . . . , yn ), with
132 6 Statistical Inference

+
yi = wi , for i = 1, . . . , r;
yi = T , for i = r + 1, . . . , n;

The censored data y can be viewed as the incomplete data, whereas the completed
data will be those obtained if the study terminated only when all the n items failed.
The probability of an item’s surviving for the duration of the study, T , is
T
1 − F (T |θ ) = 1 − 1 + e− θ (6.91)
T
= e− θ . (6.92)

Therefore, the likelihood or the probability density function of the vector of


incomplete data, y, is
 r  n 
) 1 yi )
−θ − Tθ
fY (y|θ ) = e e (6.93)
θ
i=1 i=r+1
 r 
) 1 yi T
= e− θ e−(n−r) θ . (6.94)
θ
i=1

The log-likelihood function of y is


 r  
) 1 yi
−θ −(n−r) Tθ
logfY (y|θ ) = log e e (6.95)
θ
i=1
 r 
1 
= −rlogθ − yi + (n − r)T . (6.96)
θ
i=1

In this particular example, the value of θ , which maximizes logfY (y|θ ), can be
obtained analytically by solving

∂logfY (y|θ )
= 0.
∂θ

The corresponding optimal value of θ , denoted θ̂ , is given by


 r 
1 
θ̂ = yi + (n − r)T .
r
i=1

However, this is not generally the case when dealing with incomplete data.
Now, assume that the actual times to failure of the items that did not fail during
the study are in fact known, and let’s denote the corresponding (completed) data as
6.5 Expectation-Maximization Algorithm 133

x = (w1 , . . . , wn ). Then, the likelihood or the probability density function of the


vector of complete data, x, is

)
n
1 − wi
fX (x|θ ) = e θ . (6.97)
θ
i=1

The log-likelihood function of x is

1
n
logfX (x|θ ) = −nlogθ − wi . (6.98)
θ
i=1

Again, in this example the value of θ that maximizes logfX (x|θ ) can be easily
obtained analytically by solving

∂logfX (x|θ )
= 0,
∂θ

and the corresponding optimal value of θ , denoted θ̂ , is given by

1
n
θ̂ = wi .
n
i=1

In the subsequent section, we are going to apply the EM algorithm to this


example. Assume that we are given the incomplete data, y, and we have the
current estimate of the parameter θ , θ k−1 . We will illustrate the expectation and
the maximization steps of the EM algorithm.
Expectation Step
Note that logfX (x|θ ) is linear in the unobserved data wi , i = r + 1, . . . , n. Then,
to calculate EX|y,θ k−1 (logfX (X|θ )), we just need to replace the unobserved values
with their conditional expected values, given y and θ k .
 n  
1 
EX|y,θ k−1 (logfX (X|θ )) = −nlogθ − yi + (n − r)θ k−1 . (6.99)
θ
i=1

Thus,
 n  
1 
Q(θ |θ k−1
) = −nlogθ − yi + (n − r)θ k−1
. (6.100)
θ
i=1
134 6 Statistical Inference

Maximization Step
The value of θ , which maximizes Q(θ |θ k−1 ), is obtained by solving

∂Q(θ |θ k−1 )
= 0,
∂θ
which yields
 n  
1 
θ =
k
yi + (n − k)θ k−1
. (6.101)
n
i=1

Exit
* Condition *
If *θ k − θ k−1 * < ε, where ε is a specified threshold, then the algorithm stops;
otherwise, we iterate the expectation step using the calculated value θ k .

6.6 Summary

In this chapter, we discussed the basics of statistical inference. We started by


discussing exploratory data analysis (EDA), which comprises descriptive statistics
and visualization methods for quickly summarizing information contained in data.
We have seen that descriptive statistics has a close relationship with sample
estimators, and for this reason we discussed the associated important properties.
The classical task of statistics is parameter estimation, for which two different
conceptual frameworks exist. The first is Bayesian inference, and the second is
maximum likelihood estimation (MLE). We discussed both methodologies and
emphasized the differences. Finally, we discussed the expectation-maximization
(EM) algorithm as a practical way to find the (local) maximum of a likelihood
function or the maximum posteriori (maximum of the posterior distribution) of a
statistical model using an iterative approach.
Learning Outcome 6: Statistical Inference

Statistical inference refers to the process of learning from a (small) data


sample and making predictions about an underlying population from which
data have been drawn. Due to the variety of ways to formulate this problem,
there are many approaches that can be used to investigate the different aspects
of the problem.
6.7 Exercises 135

We would like to emphasize that the preceding description is essentially true for
all fields dealing with the analysis of data, including machine learning, artificial
intelligence, and pattern recognition. However, statistics was the first field that
articulated and formulated this aim explicitly. Hence, the term “statistical inference”
is usually connected to the field of statistics, although all data science-related fields
utilize the same conceptual framework.

6.7 Exercises

1. Let X1 , X2 , . . . , Xn be some independent random variables following the normal


distribution N(μ, σ 2 ). Show that the estimator of the variance, σ , given by

1  1
n n
S2 = (Xi − X̄)2 , where the random variable X̄ = Xi ,
n−1 n
i=1 i=1

is an unbiased estimator.
2. Suppose H1 is a biased estimator of θ , with a small bias, but its MSE is
considerably lower than the MSE of H2 , which is an unbiased estimator of θ .
Which estimator is preferred?
3. Let H be an unbiased estimator of θ . Show that the estimator η(H ) is generally
not an unbiased estimator of η(θ ), unless the function η is linear.
4. Show that the following probability distributions belong to the regular expression
class (REC), and find the functions a, b, c, and d as well as a sufficient statistic
for the parameter, based on a random sample of size n:
a. The binomial distribution B(n, θ ), where n is known
b. The Poisson distribution with parameter θ
c. The geometric distribution with parameter θ , given by

f (x|θ ) = θ (1 − θ )x , x = 0, 1 . . .

d. The negative exponential distribution with parameter θ , given by

f (x|θ ) = θ e−θx , x ≥ 0, θ > 0

e. The normal distribution N(μ, σ 2 ), where σ 2 is known.


5. Let X1 , X2 , . . . , Xn be some independent random variables with the probability
distribution function, f (x|θ ), from the regular exponential class (REC).
a. Using
 the Fisher-Neyman factorization theorem, show that the statistic
b(Xi ) is a sufficient statistic for any distribution in the REC.
i
136 6 Statistical Inference

b. Suppose that the random variables X1 , X2 , . . . , Xn are from the binomial


distribution B(m, θ ), where m is known. Using the result from the previous
question, determine a sufficient statistic for the parameter θ .
6. Repeat the analysis shown in Listing 6.4 and record the estimates for the
maximum a posteriori (MAP). Why is the MAP a random variable? What is
the cause of this randomness?
Chapter 7
Clustering

7.1 Introduction

The task of grouping data points or instances into clusters is quite fundamental
in data science [218, 265]. In general, clustering methods belong to the area
of unsupervised learning [224] because the data sets using such methods are
unlabeled; that is, no information is available about the true cluster to which a
data point belongs. The aim of clustering methods is to group a set of data points,
which can correspond to a wide variety of objects — for example, texts, vectors, or
networks — into groups that we call clusters.
Many different approaches can be used for defining clustering methods. How-
ever, in this chapter, we focus on clustering methods based on similarity and distance
measures [108, 416]. Such methods provide criteria for thresholding the similarity
(or distance) of data points for their assignment to groups. So, in order to understand
the clustering techniques we want to study here, a thorough understanding of
similarity and distance is needed. Note that distance and similarity cannot be
defined fully formally, and we need to restrict the study to quantitative similarity
or distance measures for data points. Those can be real numbers or vectors or even
more complex structures such as matrices. Later, we discuss various similarity and
distance measures to induce clustering solutions.
In this chapter, we discuss the following two main classes of clustering methods:

1. Non-hierarchical clustering methods


2. Hierarchical clustering methods
Furthermore, we discuss the difficult problem of cluster validation. Because the data
used for clustering are unlabeled, judging the quality of clusters is a challenging
task and can be performed by using quantitative measures and/or by using domain
knowledge that requires further assumptions.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 137
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_7
138 7 Clustering

We start this chapter by examining the task of clustering data. Then, we outline
existing similarity and distance measures that are frequently used in data science.
Afterward, we discuss important clustering methods [218, 265] and techniques for
evaluating clustering solutions.

7.2 What Is Clustering?

Cluster analysis, also called clustering, relies only on data of the type X = {xi }n1 ,
without additional information, such as class labels that would allow one to associate
individual data points xi with specific classes or, in general, specific categories.
Depending on the context, the data points xi are sometimes also called feature
vectors, profile vectors, or instances. In the following, we denote the number of
available instances by n; that is, i ∈ {1, . . . , n}. This corresponds to the sample size.
In contrast, we denote the length of the data points xi by p.
Suppose there are n = 300 patients suffering from a tumor. From biopsies
of each tumor, a molecular profile for each tumor is generated by measuring the
gene expression levels of p = 10,000 genes using, for example, DNA microarrays.
Similar data sets X = {xi }n1 have been used to identify clusters of tumors according
to their similarity. In this case, xi corresponds to patient i, or their tumor, and each
component of xi corresponds to the expression level of a gene. There are different
chip types available for DNA microarrays that allow to measure a different number
of genes (see Chap 5). For instance, there are chips that allow one to measure the
gene expression levels of p = 20,000 genes for the same tumor biopsies. As one can
see, in this case the number of patients (n = 300) would not change, of course, but
the number of measurements for the genes would. The preceding example allows
one to clearly see the asymmetry between p and n. For any analysis, it is important
to be very clear about the meaning of p and n in a given context, which is problem-
specific.
To complete our discussion, we need one further layer of argument. This relates
to identifying the features and the instances for a given data set. In fact, this is a
definition that needs to be articulated by the analyst. For instance, using our tumor
example, if we want to cluster tumors, the tumors become the instances (n) and the
genes the features (p). However, if we want to cluster genes, then the genes become
the instances (n) and the tumors the features (p). Both are valid perspectives using
the same data X [376]. Hence, the role of, for example, a patient is not always the
same, but rather depends on the question the analyst wants to ask.
Figure 7.1 summarizes the overall situation. It shows characteristics of the
data type, principles of major clustering methods, and some known clustering
approaches.
7.3 Comparison of Data Points 139

Data type:
X = {(xi )}n 1 with xi ∈
p

xi is called a data point, instance or feature vector


p: number of variables
n: number of samples

Question addressed:
Is there a ’structure’ between variables?

Principles of major clustering approaches:


Partitioning-based clustering =⇒ partitions X into K non-overlapping clusters
based on a distance measure, d, for data points resulting in clusters Ck =
{x1,k , . . . xNk ,k } such that X = ∪K
k=1 Ck , with k ∈ {1, . . . , K} and Nk is the
number of data points in cluster k.

Hierarchical clustering =⇒ hierarchical partitioning of X resulting in nested


clusters as function of a distance measure, d, for clusters called linkage func-
tion.

Fig. 7.1 Overview of cluster analysis with respect to the data type used, the question addressed,
and the principles of major approaches.

7.3 Comparison of Data Points

For clustering methods, similarity and distance measures are crucial to group data
points. Before we start outlining the various similarity and distance measures used
in data science, we will define such measures in general [49, 149].
Definition 7.1 Let X be a set and s a mapping s : X × X −→ [0, 1]. s(x, y) with
x, y ∈ X is called similarity measure if

s(x, y) > 0 (positivity). (7.1)


s(x, y) = s(y, x) (symmetry). (7.2)
s(x, y) = 1 iff x = y (identity). (7.3)

A distance measure can be defined similarly.


Definition 7.2 Let X be a set and d a mapping d : X × X −→ R+ . d(x, y) with
x, y ∈ X is called distance measure if

d(x, y) ≥ 0 (positivity). (7.4)


d(x, y) = d(y, x) (symmetry). (7.5)
d(x, y) = 0 iff x=y (identity). (7.6)
140 7 Clustering

If in addition the following holds:

d(x, z) ≤ d(x, y) + d(y, z), x, y, z ∈ X, (7.7)

then we call d(x, y) a distance metric [108].


The inequality in Eq. 7.7 is called the triangular inequality [108]. We would like
to emphasize that the properties of distance measures are sufficient for constructing
clustering methods. However, the triangular inequality and, hence, distance metrics
are required if one wants to define an embedded distance measure in a metrical
space.
Let’s study two examples that show how to correctly classify quantitative
measures for real numbers.
Example 7.1 Let’s define a quantitative similarity measure s between x, y ∈ R by

1
s(x, y) := . (7.8)
1 + |x − y|

Now, we show that all properties of Definition 7.1 hold.


Positivity: As |x −y| ≥ 0, we see that s(x, y) > 0 and that the positivity property
is fulfilled.
Symmetry: As |x − y| = |y − x|, the symmetry property is also satisfied.
Identity: To see the identity property, we first show that

1
s(x, y) := = 1 ⇒ x = y.
1 + |x − y|

Thus, we have to distinguish two cases when considering |x − y|.


The first case x − y ≥ 0 leads to 1+x−y
1
= 1 and, finally, to x = y.
The second case x − y < 0 leads to 1
1−(x−y) = 1. Again, this yields to x = y.
Now, it remains to show that if x = y ⇒ s(x, y) := 1+|x−y| 1
= 1. But if x = y,
then s(x, x) = 1. Altogether, we see that the measure given by Eq. 7.8 is a similarity
measure according to Definition 7.1.

Example 7.2 We define a quantitative distance measure between x, y ∈ R by

d(x, y) := |x − y|. (7.9)

Let’s show that all the properties in Definition 7.2 are fulfilled.
Positivity: The positivity property is fulfilled by the definition of the modulus
function.
Symmetry: The symmetry property follows directly because |x − y| = |y − x|.
Identity: To show that d(x, y) := |x − y| = 0 ⇐⇒ x = y, we start with
d(x, y) := |x − y| = 0 ⇒ x = y. Considering the two cases of |x − y|, we
always obtain x = y, as this follows from x − y = 0 and −(x − y) = 0. The second
direction follows immediately; x = y leads to d(x, y) = |x − x| = 0.
7.3 Comparison of Data Points 141

Let’s study a numerical example for the application of the measure from
Example 7.1. Suppose that x = 1 and y = 1.1. We would expect that the similarity
measure given by Eq. 7.8 yields a high similarity score, close to one, since the values
of x and y are quite close to each other. Numerically, we obtain s(1, 1.1) = 0.90909.
If we utilize the known distance-similarity relation [49]

d(x, y) = 1 − s(x, y), s(x, y) ≤ 1, (7.10)

we get d(1, 1.1) = 1 − s(1, 1.1) = 0.090910. That means a high similarity value
between x and y corresponds to a small distance between x and y.

7.3.1 Distance Measures

In this section, we discuss some specific distance measures [49, 218], which are
widely used in data science. For all the discussed distance measures, we assume
that the properties given in Definition 7.2 are satisfied.
Examples of commonly used distance measures for pairs of data points are as
follows:

 p  2

dE (x, y) =  xi − yi (Euclidian distance). (7.11)
i=1


p 1/α
dmin (x, y) = |xi − yi |p (Minkowski distance). (7.12)
i=1


p
dman (x, y) = |xi − yi | (Manhattan distance). (7.13)
i=1
dmax (x, y) = max |xi − yi | (maximum distance). (7.14)
i
dmin (x, y) = min |xi − yi | (minimum distance). (7.15)
i
1 − ρxi ,yj
dρ (x, y) = (correlation distance). (7.16)
2
For the Minkowski distance, α is a positive integer value, which gives for α = 2 the
Euclidean distance.
We would like to point out that each of the preceding distance measures provides
the minimum possible distance for d(x, x), which is called the self-distance. In this
case, d(x, x) = 0 for all possible data points x.
Let’s illustrate the calculation of the Euclidian distance via an example. Fig-
ure 7.2 depicts two vectors x = (x1 , x2 ) ∈ R2 and y = (y1 , y2 ) ∈ R2 . If we set
142 7 Clustering

Fig. 7.2 Calculating the y


Euclidian distance between
two-dimensional vectors by (x2,y2)
using Pythagoras’ theorem.

d y2-y1

(x1,y1) x2-x1 (x2,y1)

a = x2 − x1 and b = y2 − y1 , then, using Pythagoras’ theorem, we obtain

d 2 = a 2 + b2 = (x2 − x1 )2 + (y2 − y1 )2 , (7.17)

and

d= (x2 − x1 )2 + (y2 − y1 )2 . (7.18)

Listing 7.1 shows an application of the command dist() for calculating the Euclid-
ian distance of two vectors using R5 . By setting different options for “method,”
one can obtain alternative distances for “maximum,” “Manhattan,” “Canberra,” or
“Minkowski.”
7.4 Basic Principle of Clustering Algorithms 143

7.3.2 Similarity Measures

In this section, we discuss some known similarity measures [49, 218] that are used
for clustering methods. For all the discussed similarity measures, we assume that
the properties given in Definition 7.1 are satisfied.
Examples of commonly used similarity measures for pairs of data points and sets
are as follows [218]:

2(|X ∩ Y |)
sD (X, Y ) = (Dice’s coefficient). (7.19)
|X| + |Y |
|X ∩ Y |
sJ (X, Y ) = (Jaccard’s coefficient). (7.20)
|X ∪ Y |
sρ (x, y) = ρxi ,yj (correlation coefficient). (7.21)
p
i=1 xi yi
scos (x, y) =   (cosine similarity). (7.22)
p 2 p 2
x
i=1 i i=1 yi

In Eqs. 7.19 and 7.20, X and Y are finite sets. The cosine similarity has been applied
extensively in text mining and information retrieval; see [21]. We illustrate the
calculation of Jaccard’s coefficient using an example. Let X and Y be two text
fragments given by the two sets

X = {Data, Science, is, challenging}, (7.23)


Y = {Information, Science, is, modern}. (7.24)

We obtain X ∩ Y = {Science, is} and, hence, |X ∩ Y | = 2. Furthermore,


X ∪ Y = {Data, Information, Science, is, challenging, modern} and |X ∪ Y |=6.
Hence, the similarity between the two text fragments X and Y , measured by
Jaccard’s coefficient, equals sJ (X, Y ) = 26 = 13 .

7.4 Basic Principle of Clustering Algorithms

There are two basic principles/properties of any clustering method that have been
proposed, in the literature, from a theoretical point of view; see [20, 49, 265].
Suppose that we start with a set of objects to be grouped using a clustering
algorithm. First, the objects in a given generated cluster should be (very) similar
to each other (homogeneity) [49, 265] with respect to a chosen similarity measure;
see Sect. 7.3. Second, the objects that belong to different generated clusters should
be (very) different from each other (heterogeneity) [49, 265]. We emphasize that
Everitt et al. [158] call clusters, which fulfill these two properties, natural clusters.
144 7 Clustering

A B
C1

C2

C D

C4 C3 C4

C1 C2 C1 C2

Fig. 7.3 Homogeneity versus heterogeneity of a concrete clustering solution.

Various quantitative measures have been proposed to quantify homogeneity and


heterogeneity; see [49].
One possible definition of “homogeneity” can be stated by this well-known
homogeneity measure: [49]

2  p p
Hom(X) := d(xi , xj ). (7.25)
p(p − 1)
j =1 i=1

For sets X = {x1 , x2 , . . . , xp } and Y = {y1 , y2 , . . . , yp } we call X more


homogenous than Y if Hom(X) < Hom(Y ). In general, the smaller the value of
Hom(X), the more homogenous the cluster, according to Eq. 7.25.
We briefly illustrate this measure using an example. Let’s assume that X =
{1, 2, 3}, Y = {5, 7, 10} and d(xi , xj ) = |xi − xj |. First, we note that X and
Y are non-overlapping and therefore appear heterogeneous. Equation 7.25 yields
Hom(X) = 43 and Hom(Y ) = 10 3 , showing that Hom(X) < Hom(Y ). Based on our
intuition, we also find that X is more homogenous than Y as the distances between
the data points in X are smaller compared to Y .
7.5 Non-hierarchical Clustering Methods 145

Figure 7.3 explains the concept of homogeneity and heterogeneity visually; see
[20]. In Fig. 7.3a, there is no structure in the given data set, and hence no proper
clusters can be generated. Two homogenous clusters C1 and C2 can be seen in
Fig. 7.3b. Also, there is heterogeneity between C1 and C2 . In Fig. 7.3c, we see three
clusters. Again, C1 and C2 are homogenous. However, C4 forms a large cluster, and
the homogeneity of this cluster is rather low. But there is heterogeneity between
C1 , C2 , and C4 . The last situation, in Fig. 7.3d, shows four clusters. C1 and C2 are
homogenous, and the large cluster from Fig. 7.3c is now split into C3 and C4 . We
now see in Fig. 7.3d that by generating C3 and C4 , the homogeneity property is
fulfilled more properly compared to the old cluster C4 in Fig. 7.3c. However, we
observe that C3 and C4 are not disjoint, and therefore we end up with overlapping
clusters. Thus, the heterogeneity between C3 and C4 in Fig. 7.3d is not fulfilled.
Note that the two preceding properties, namely, homogeneity and heterogeneity,
belong to the so-called hard clustering paradigm; see [265]. In this paradigm, data
points are only allowed to be a member of one cluster.
The counterpart of hard clustering is soft or fuzzy clustering [265]. When using
soft clustering, an object belongs to a cluster to a certain degree; this property is also
referred to as fuzzy membership. Therefore, an object may belong to several clusters
with a degree greater than zero.

7.5 Non-hierarchical Clustering Methods

We are now in a position to address clustering methods. The first important


class of algorithms we discuss are non-hierarchical clustering methods, which are
sometimes also called partition-based methods [49, 265].

7.5.1 K-Means Clustering

The K-means clustering method [325] is an iterative algorithm that requires as


input the number of clusters K. The algorithm is initialized by randomly setting
the cluster centers {mk }Kk=1 . Then, one assigns each data point, xi , to exactly one
cluster, resulting in K sets of data points,

C(k) = {xi |xi is in cluster k}, (7.26)

that contain the data points for each cluster. Mathematically, this is accomplished
by calculating the Euclidean distance between xi and all centroids mk and selecting
the cluster with minimal Euclidean distance using

j = argmin{dE (xi , mk )}. (7.27)


k

That means xi will be assigned to cluster C(j ).


146 7 Clustering

Then, the data points of the clusters, i.e., C(k), are used to calculate updated
centroids of the clusters, given by

1 
mk = xj . (7.28)
Nk
j :xj ∈C(k)

Here, Nk is the number of data points in C(k); that is, Nk = |C(k)|. The centroids
are just the mean value of all samples and can be seen as a representative of a cluster.
This completes the first iteration step. Then, all the preceding steps are repeated,
leading to updated C(k) and centroids.
To terminate the algorithm, one can either set a fixed number of iterations, I , or
assess the progress made during the iterations. The latter implies that one needs a
quality measure to assess the progress quantitatively. For this reason, the squared
Euclidean distance
K  
 t  
ESS = xi − x̄ k xi − x̄ k (7.29)
k=1 xi ∈Ck

is used. The closer the samples are around the centroids of their respective cluster,
the smaller the ESS. For example, by using a small ε > 0, one can terminate the
iteration process if

ESS(i) − ESS(i + 1) > ε (7.30)

no longer holds. The major steps in the implementation of the K-means are given
by Algorithm 1.

An obvious disadvantage of K-means is that K must be given to start the


algorithm. However, K is generally not known and can also be in the eye of the
beholder, thus requiring special domain knowledge. Also, K-means is sensitive to
7.5 Non-hierarchical Clustering Methods 147

outliers that could disrupt a natural clustering structure. Another drawback relates
to the initial choice of the centroids, which has a strong impact on the expected
clustering solution. Therefore, a global minimum of the objective function cannot be
guaranteed with only one starting configuration of initial centroids. To overcome this
problem, one can run K-means multiple times by using different initial centroids.

7.5.2 K-Medoids Clustering

We can generalize the K-means algorithm by making two modifications [265]. First,
instead of using centroids to represent clusters, one can use medoids [280]. Second,
instead of the Euclidean distance, one can use any other distance measure defined
in Sect. 7.3.2.
In contrast to a centroid, which is the mean of all data points that belong to a
cluster, a medoid corresponds to one of the data points within a cluster itself. That
means a medoid does not need to be estimated by any measure, for example, the
mean, but just needs to be selected among all data points in a cluster.
To select such a medoid, a criterion is used that is based on the distances between
data points within the cluster. Specifically, a medoid for cluster k is defined as the
data point xi , with xi ∈ C(k), which has the minimal distance to all other data points
in C(k),

D(i) = d(xi , xj ). (7.31)
j :xj ∈C(k)

In Algorithm 2, we highlight this part of the algorithm in orange.


148 7 Clustering

The generalization from centroids to medoids has its price, because the identifi-
cation of the K medoids is far more computationally demanding than the estimation
of the K centroids.

7.5.3 Partitioning Around Medoids (PAM)

There is a further variation of the K-means clustering algorithm called partitioning


around medoids (PAM); see [265]. The basic steps of PAM are shown in Algo-
rithm 3. PAM assigns in its first step data points xi to their closest medoids (red
part in Algorithm 3). Then, these clusters are quantified by measuring the distance
between all data points and their corresponding medoids. In Algorithm 3, this
measure is denoted by Q. Then, for all medoid-data point pairs, a swapping of
xi with mk is assessed by calculating the resulting quality measure Qki . Finally,
the medoid-data point pair that leads to the maximal reduction in Q − Qki ,
corresponding to the largest reduction in the distances between all data points and
medoids, is selected.
7.6 Hierarchical Clustering 149

7.6 Hierarchical Clustering

The second class of clustering methods we discuss are hierarchical clustering. Hier-
archical clustering algorithms are among the most popular clustering approaches
[49, 265]. There is a large variety of procedures that can be distinguished by the
distance measure they are using. Furthermore, all of these methods perform either
an agglomerative (bottom-up) or a divisive (top-down) clustering.
Suppose that we have n data points we want to cluster. Agglomerative algorithms
start with n clusters, where each cluster consists of exactly one data point. Then,
the distances between all n clusters are evaluated, and the two “closest” clusters
are merged, resulting in n − 1 clusters. This successive merging of two clusters is
iteratively repeated until all clusters are merged into a single cluster. In Algorithm 4,
we summarize the principal algorithmic steps. The distance function D can be any
of the particular linkage functions defined in Sect. 7.6.3.
In contrast, divisive algorithms start with just one cluster that contains all n data
points. Then, this large cluster is successively split down into more clusters until
one ends up with n clusters, each containing just one data point.
To perform the corresponding merging or splitting steps for the agglomerative
and divisive algorithms, appropriate distance measures need to be used to decide
what “close clusters” means. We discuss these distance measures in Sect. 7.6.3.
In the following, we focus on agglomerative algorithms because they are usually
computationally more efficient and less restrictive.

7.6.1 Dendrograms

Graphically, the result of an agglomerative algorithm (and also of a divisive


algorithm) can be represented as a dendrogram. In Fig. 7.4, we show two examples
of dendrograms. In general, a dendrogram is similar to a tree containing branches
that correspond to the clusters. However, it is important to note that in contrast to
150 7 Clustering

1.5 1.5

1.0 1.0
height

height
0.5 0.5

0.0 0.0
7 3 4 2 9 6 8 1 5 7 3 4 2 9 6 8 1 5

Fig. 7.4 Two dendrograms corresponding to the same result of an agglomerative clustering
algorithm.

an ordinary tree, a dendrogram contains a scale; namely, its height. For this reason,
on the left-hand side of both dendrograms, an axis that corresponds to the height
of the branches is shown. We will see that these heights are related to the distances
between the clusters.
There are various ways to visualize a dendrogram. One is a “rectangular”
display of the branches (left), and the other is “triangular” (right). Despite the
different visual representations, both dendrograms contain exactly the same amount
of information about the clustering of the nine data points in Fig. 7.4. For this
reason, it is merely a matter of personal taste which representation form to choose.

7.6.2 Two Types of Dissimilarity Measures

One needs to distinguish between two different types of distance measures. The first
distance measure directly assesses the distance between two data points, whereas the
second measure evaluates the distance between clusters containing data points. That
means one needs to distinguish between distance measures for the following:
• Pairs of data points (see Sect. 7.3)
• Pairs of clusters
This implies that the former measures can be written as d(x, y), where x and
y are two data points of length p; see Sect. 7.3. The latter can be expressed as
d(C(m), C(n)), where C(m) and C(n) are two clusters that contain the two data
points; that is, x ∈ C(m) and y ∈ C(n). Distance measures between clusters will
be defined in the next section because they are required by hierarchical clustering
methods.
7.6 Hierarchical Clustering 151

7.6.3 Linkage Functions for Agglomerative Clustering

It is often unclear how to determine the distance between clusters. Cluster distances
are needed to generate a dendrogram and, finally, to generate a clustering solution
by cutting the dendrogram horizontally. Agglomerative clustering algorithms are
distinguished from each other depending on the cluster distance measure they are
using. However, most of these cluster distance measures, with the exception of the
Ward distance [497], are based on the data point distances given by d(xi , xj ).
In general, a cluster distance measure, D, is called a linkage function. In the
following, we give examples for five widely used linkage functions:

Dsingle (Ci , Cj ) = min d(x, y) with x ∈ Ci and y ∈ Cj (single linkage).

(7.32)
Dcomp (Ci , Cj ) = max d(x, y) with x ∈ Ci and y ∈ Cj (complete linkage).
(7.33)
1 
Dave (Ci , Cj ) = d(x, y) (average linkage). (7.34)
|Ci ||Cj |
x∈Ci and y∈Cj

|Ci ||Cj | * *
*μi − μj *2 (Ward method).
Dward (Ci , Cj ) = (7.35)
|Ci | + |Cj |

In Dward (Ci , Cj ), μi and μj are the centers of the cluster Ci and Cj , respectively.
Using the preceding measures, one can calculate the distance between clusters.
These are crucial to generate the dendrogram and to generate plausible clustering
solutions. The final clustering and the quality of the clusters (for example, measured
by homogeneity) strongly depend on the linkage function, which needs to be chosen
in advance.

7.6.4 Example

In Fig. 7.5, we show four examples of hierarchical clustering using the four distance
measures discussed in the previous section. The clustering analysis in Listing 7.2
is performed for the Iris data set, which provides measurements in centimeters for
the variables sepal length and width and petal length and width, respectively, for 50
flowers from each of three species of Iris. The species are Iris setosa, Iris versicolor,
and Iris virginica. For our analysis, we used only a subset of data consisting of f ive
flowers from each species.
From Fig. 7.5, it is interesting to see that each of the four linkage functions gives
different results. Only the single linkage function gives the “correct” clusters as
known from the three Iris species.
152 7 Clustering

3
Single linkage Complete linkage

4
2
height

height
1 2

0 SE 1 SE 5 SE 2SE 3SE 4 VE 4 VE 1 VE 3 VE 2 VE 5 VI 2 VI 3 VI 1 VI 4 VI 5 0 SE 1 SE 5 SE 2SE 3SE 4 VE 4 VI 2 VE 1 VE 3 VE 2 VE 5 VI 3 VI 1 VI 4 VI 5

25
4 Average linkage Ward
20

3
15
height

height

2
10

1 5

0 SE 1 SE 5 SE 2SE 3SE 4 VE 4 VI 3 VI 1 VI 4 VI 5 VI 2 VE 1 VE 3 VE 2 VE 5 0 SE 1 SE 5 SE 2SE 3SE 4 VI 3 VI 1 VI 4 VI 5 VE 1 VE 3 VE 2 VE 5 VE 4 VI 2

Fig. 7.5 Analysis of the Iris data set using the Euclidian distance measure. The following cluster
distance measures were used. First row: single linkage (left) and complete linkage (right). Second
row: average linkage (left) and Ward (right).
7.7 Defining Feature Vectors for General Objects 153

7.7 Defining Feature Vectors for General Objects

So far, we have described partition-based clustering and hierarchical clustering


algorithms for data sets of the type X = {xi }n1 with xi ∈ Rp . That means feature
vectors are given as an input for the clustering method. Furthermore, we mentioned
briefly set-based measures, such as Jaccard’s and Dice’s coefficients, to show that
clustering is more flexible with respect to the required representation of data points.
Now, we go one step further by generalizing these requirements. That means
it is not only possible to cluster objects represented by vectors or sets, but also
general objects. In Fig. 7.6, we show a visualization of this. On the left-hand
side, two examples are shown for different types of objects: graph and document.
Regardless of the nature of such objects, it is always possible to map these to feature
vectors using domain-specific quantifications. For documents, this could correspond
to features like TF-IDS (term frequency-inverse document frequency), POS (part
of speech), or WE (word embeddings); see Chap. 5. Similarly, such a mapping is
possible for graphs, and we discuss next some quantitative features corresponding
to topological indices and graph entropy measures. Hence, by mapping an (abstract)

Graph

Features:
topological indices and
graph entropy measures

⎛ ⎞
x1
⎜ x ⎟
⎜ 2 ⎟
⎜ ⎟
⎜ . ⎟
⎜ .. ⎟
⎝ ⎠
xp

Feature vector

Features:
-TF-IDS
-POS: Part-of-speech
-sentiments
-OHD: One-hot document
-WE: word embeddings
Document

Fig. 7.6 Mapping of general objects to feature vectors. The left-hand side shows two examples for
different types of objects: graph and document. These objects are then mapped by domain-specific
quantifications, leading to features.
154 7 Clustering

object to a feature vector, one can obtain an approximation of the properties of the
object itself.
Now, we describe how a clustering can be performed for graphs or complex
networks [139, 219]. One reason we choose networks is that graphs/networks are
currently ubiquitous in data science and related disciplines. For instance, they have
been applied for classification and modeling tasks extensively; see, for example,
[87, 98, 260, 362, 470].
To cluster networks, we need to transform a network G = (V , E) into a vector
V = (I1 , I2 , . . . , In ). In the simplest case, Ij : G −→ R+ , 1 ≤ j ≤ n is a
topological index capturing structural information of a network. F is a class of
graphs. For applications, many of these indices have been used [100, 107, 139]
where a so-called graph invariant is required. A graph invariant is a graph measure
(or index) that is invariant under isomorphism [98]; two graphs are ismorphic if
they are structurally equivalent. In the following, we give some topological indices,
which have been applied to characterize graphs in a wide range of applications
[98, 107, 113]:

|V | |V |
1 
W := d(vi , vj ) (Wiener index), (7.36)
2
i=1 j =1

where d(vi , vj ) denotes the shortest distance between vi and vj .

|V |

Z1 := δ(vi ) (first Zagreb index), (7.37)
i=1

where δ(vi ) is the degree of the vertex vi .


 1
R := [δ(vi )δ(vj )]− 2 (Randić index). (7.38)
(vi ,vj )∈E

|E|  1
B := [DS i DS j ]− 2 (Balaban index), (7.39)
μ+1
(vi ,vj )∈E

where DSi denotes the distance sum (row sum) of vi and μ is the cyclomatic number
(that is, the number of rings in the graph).
We end this section by introducing measures for graph entropy, which turned out
to be very meaningful for the quantitative characterization of graphs [50, 96, 100,
350]:


k  
|Ni | |Ni |
Ia := − log (topological information content), (7.40)
|V | |V |
i=1
7.8 Cluster Validation 155

where |Ni | stands for the number of topologically equivalent vertices in the i-th
vertex orbit of G, and k is the number of different orbits.

 
1 1
ID := − log
|V | |V |
 2ki
ρ(G)  
2ki
− log (magnitude-based information index), (7.41)
|V |2 |V |2
i=1

where the distance of a value i in the distance matrix D appears 2ki times. ρ(G)
stands for the diameter of a graph G.
Finally, the graph entropy, based on vertex functionals, is defined as follows [96,
100]:

|V |
 
 f (vi ) f (vi )
If := − |V |
log |V |
, (7.42)
i=1 j =1 f (vj ) j =1 f (vj )

where f : V −→ R+ and the vertex probabilities are

f (vi )
p(vi ) := |V |
. (7.43)
j =1 f (vj )

Applications of graph entropy can be found in bioinformatics, systems biology,


and computer science, in general, for tackling classification, clustering, and model-
ing tasks; see, for example, [137, 296, 352].
Earlier, we mentioned that a graph (or network) G = (V , E) can be characterized
by a vector V. For instance, we could define feature vectors by V1 = (W, R, B)
or V2 = (ID , Ia , If , W, Z1 ) or any other combination of graph measures for
performing clustering. Altogether, this allows us to define numerical vectors, which
can be used to apply the clustering techniques discussed in the previous sections.
Lastly, we would like to remark that the preceding measures and many more have
been implemented in the R package QuACN [353, 354].

7.8 Cluster Validation

The validation of clusters is a challenging task because often one does not have
any prior or domain knowledge for judging the results of a clustering [265]. This
is also directly visible from the unlabeled data upon which the clustering is based.
Therefore, various measures have been developed to quantify and assess the validity
of clusters. In this section, we distinguish two major categories of such measures
[265]: external criteria and internal criteria.
156 7 Clustering

7.8.1 External Criteria

In this case, the result of a clustering is assessed using additional external infor-
mation that was not used for the cluster analysis itself. This information consists
of labels for the data points defining the gold standard of comparison. Depending
on the origin of these labels, this can be the correct solution for the problem or
just the best assignment available; for example, provided by human experts. An
example for the latter could be a biomedical data set containing measurements
of tumor samples of patients for which the labels were provided by a pathologist
performing a histological analysis of the morphology of the tumor tissues. Overall,
this means labeled data are available; however, the labels have not been used for
learning the clustering algorithm. This allows one to assess clustering in the same
way as a (multiclass) classification problem.
Let’s assume that a cluster analysis results in a partitioning of n data points given
by C = {C1 , . . . , CK }, where K is the total number of clusters. That means, for
each cluster, Cm is a set consisting of the data points that belong to this cluster;
that is, Cm = {xi }. Furthermore, let’s denote the reference information by R =
{R1 , . . . , RL } with a possibly different number of clusters; that is, K = L.
Utilizing R, it is now possible to decide if, for example, the two data points,
xi and xj , are correctly or incorrectly placed in the same cluster. Specifically, if
xi , xj ∈ Rm and xi , xj ∈ Cn , we call this pair a true positive (TP). Similarly, we
define the following:
• If xi ∈ Rm , xj ∈ Rm and xi ∈ Cn , xj ∈ Cn , we call this pair true negative (TN).
• If xi ∈ Rm , xj ∈ Rm and xi , xj ∈ Cn , we call this pair false positive (FP).
• If xi , xj ∈ Rm and xi ∈ Cn , xj ∈ Cn , we call this pair false negative (FN).
Performing such a pairwise data-point comparison enables us to identify the total
number of true positive, false positive, true negative, and false negative pairs. As
a consequence, any statistical measure based on these four errors (discussed in
Chap. 3) can be used. However, in the context of cluster analysis, the following
indices are frequently used [217, 265]:
Rand Index It is defined by

TP +TN
R= . (7.44)
T P + T N + FP + FN

Jaccard Index It is defined by

TP
R= . (7.45)
T P + FP + FN
7.8 Cluster Validation 157

F-Score It is defined by

TP PR
F = = , (7.46)
T P + FP + FN P +R

where P is the precision and R the recall (sensitivity) given by

TP
P = , (7.47)
T P + FP
TP
R= . (7.48)
T P + FN

Fowlkes-Mallows (FM) Index It is defined by


 
TP TP √ √
FM = = P R. (7.49)
T P + FP T P + FN

The FM index is the geometric mean of the precision and recall P and R, while the
F-score is their harmonic mean.

Normalized Mutual Information The NMI is defined by

I (C, R)
NMI = , (7.50)
max{H (C), H (R)}

where I (C, R) is the mutual information between the predicted set (C) and the
reference set (R) and H (R) and H (C) are the entropies of these sets.
To evaluate the information-theoretic entities, we need to estimate marginal and
joint probability distributions based on the comparison of R and C. Table 7.1 is a
contingency table obtained from such a comparison. Here, the ordering of the sets
Ci and Rj is not crucial, because every pair is assessed. For instance, n11 gives the
number of data points common in R1 and C1 , and a1 is the number of all data points
in R1 ; that is, a1 = j n1j .

Table 7.1 Contingency table Predicted


to evaluate the normalized
Reference C1 C2 ... CK Sums
mutual information.
R1 n11 n12 ... n1,K a1
R2 n21 n22 ... n2,K a2
.. .. .. .. .. ..
. . . . . .
RL nL,1 nL,2 ... nL,K aL
Sums b1 b2 ... bK ij nij = N
158 7 Clustering

From the contingency table, we can estimate the following necessary probability
distributions:
bi
p(Ci ) = , (7.51)
N
aj
p(Rj ) = , (7.52)
N
nij
p(Ci , Rj ) = , (7.53)
N
These are then used to estimate the mutual information and the entropies as follows:


K
H (C) = − p(Ci ) log p(Ci ), (7.54)
i=1


L
H (R) = − p(Rj ) log p(Rj ), (7.55)
j =1


K 
L  
p(Ci , Rj )
I (C, R) = p(Ci , Rj ) log . (7.56)
p(Ci )p(Rj )
i=1 j =1

7.8.2 Assessing the Numerical Values of Indices

After obtaining a numerical value of any of the preceding indices, the next question
is, what does it mean? Specifically, is the result from our cluster analysis “good”
enough compared to the reference set R, or not? This question is actually not easy
to answer as it requires further considerations.
One way to show that the obtained clustering C is meaningful compared to R is
via a hypothesis test. We would like to note that we can use any of the preceding
indices as a test statistic for a hypothesis test. However, since we usually do not
know the analytical form of the sampling distribution that belongs to a selected
test statistic, we need to obtain the sampling distribution for the null hypothesis
numerically.

7.8.3 Internal Criteria

When no external information about the partitioning of the data points within a
data set is available in the form of a reference set R, then we need to assess the
obtained clustering, C, using other criteria. In this case, the assessment of the
quality of a cluster analysis is a more challenging task. Well-known criteria to
7.8 Cluster Validation 159

evaluate the quality of a clustering include the Dunn index, the Davies-Bouldin
index, and the silhouette index. Most of these indices are based on concepts such as
the homogeneity and the variance of the clustering. For instance, we say a cluster
is homogenous if there are small distances and variances between the points in the
cluster. We provide here the definition of the aforementioned indices.
Dunn Index Assume that the distance between two clusters Cm and Cn is given by

d(Cm , Cn ) = min d(x, y), (7.57)


x∈Cm ,y∈Cn

for any distance measure defined in Sect. 7.8.1.


The diameter of a cluster Ck is defined by

diam(Ck ) = max d(x, y). (7.58)


x,y∈Ck

The diameter is just the maximum distance between any two data points in a cluster
Ck .
For a fixed number K of clusters, the Dunn index is
 
d(Cm , Cn )
DK = min min . (7.59)
m∈{1,...,K} n∈{m+1,...,K} maxk∈{1,...,K} diam(Ck )

If the number of clusters K is not known, one can estimate DK for different
values, such as by performing K-means clustering and choosing different numbers
of clusters. Then, the number of clusters that maximizes the Dunn index can be
selected as the best number of clusters. In general, larger values of DK indicate
better clustering since the aim is to maximize the distance between clusters.
Davies-Bouldin Index The Davies-Bouldin index, DB, is defined by

1 
K
DB = Rk , (7.60)
K
k=1

where

Rk = max Rkn , (7.61)


n∈{1,...,K},n=k

s m + sn
Rmn = , (7.62)
d(C(m), C(n))

with
 1/r
1 
sk = |xi − mk |r , (7.63)
bk
xi ∈C(k)
160 7 Clustering


n 1/q
d(C(m), C(n)) = |mm,i − mn,i |q . (7.64)
i=1

Note that here bk is the number of data points in cluster C(k) and mk is its centroid.

Silhouette Coefficient The silhouette coefficient [411], denoted si , is defined for


each data point xi with xi ∈ C(k) by

bi − ai (k)
si (k) =  , (7.65)
maxi ai (k), bi

1 
ai (k) = d(xj , xi ), (7.66)
Nk
j :xj ∈C(k)

bi = min{ai (m)}, (7.67)


m=k

where ai denotes the average distance between the data point xi and all the other
data points in its cluster and bi denotes the minimum average distance between xi
and the data points in other clusters. By definition, the silhouette coefficient, si , is
normalized; that is,

− 1 ≤ si ≤ +1, for all i. (7.68)

Average Silhouette Coefficient For the n data points {xi }ni=1 , the average silhou-
ette coefficient is just the average of all silhouette values:

1 
K
sK = si . (7.69)
K
i=1

To evaluate sK quantitatively, the following characteristic values were suggested


[411]:
• 0.71–1.00: strong clusters
• 0.51–0.70: reasonable good clusters
• 0.26–0.50: weak clusters
• < 0.25: no substantial clusters
7.10 Exercises 161

7.9 Summary

Clustering methods are a type of unsupervised learning. In this chapter, we have seen
that clustering algorithms can discover structures in data without using information
about labels. For this reason, clustering is a powerful approach to perform an
exploratory data analysis (EDA) and discover new groups or classes in a data set.
This is usually the first step in a data science project because it allows a largely
unbiased interrogation of the data.
The application of clustering algorithms is not always straightforward since the
true number of clusters hidden in the data is usually unknown. For this reason,
although there exist various measures to evaluate the goodness/quality of clusters,
the quality of a clustering solution is in the eye of the beholder. Therefore, it’s not
surprising that the evaluation of clusters is the most intricate part of a clustering
analysis in practice.
Learning Outcome 7: Clustering

Clustering methods are based on unlabeled data. For this reason, the evalua-
tion of the found clusters/groups is challenging since no reference information
is available that could be directly used for their evaluation.

In this chapter, we have seen that instances to be clustered can not only
correspond to profile vectors but also to networks. Even more generally, one can
cluster images, documents, or human behavior. Since this can correspond to very
different data types (as discussed in Chap. 5), the underlying similarity measures in
such cases can assume domain-specific forms. Nevertheless, the basic idea of those
clustering methods is very similar to those discussed in this chapter.

7.10 Exercises

1. Show that s1 : R × R −→ [0, 1], defined by s1 (x, y) := e−(x−y) , is a similarity


2

measure according to Definition 7.1.


2. Show that d1 : R × R −→ [0, 1], defined by d1 (x, y) := 1 − e−(x−y) , is a
2

distance measure according to Definition 7.2.


3. Given a similarity measure s(x, y), where s(x, y) ≤ 1, i.e., s(x, y) fulfills the
properties given by Definition 7.1, show that d(x, y) = 1 − s(x, y) is a distance
measure.
4. Find some examples in terms of applications for non-symmetric similarity or
distance measures.
5. Calculate the Euclidian distance between v1 := (1, 2, 0) and v2 := (−2, 1, 7).
6. Use R to compute the Euclidian distance between v1 and v2 .
162 7 Clustering

7. Given the vectors

a := (0, 0, 1, 2),
b := (2, 1, 1, 2),
c := (8, 0, 2, 2),
d := (0, −1, 1, 2),
e := (20, 10, 1, 4),
f := (0, 0, 0, 23),
g := (1, 1, 1, 1),

use R to pairwisely compute the Euclidian distances between these vectors and
generate the corresponding distance matrix.
8. Use R to run K-means with K = 2 and Euclidian distance. Use the function
scale() to standardize the data. To perform the clustering, choose two initial cen-
troids arbitrarily. Compare the clustering results using different initial centroids.
9. Use R to run agglomerative clustering with Euclidian distance and average
linkage. Use the function scale() to standardize the data. Plot a dendrogram. Find
a method to find meaningful clusters from this dendrogram.
Chapter 8
Dimension Reduction

8.1 Introduction

When speaking about big data, one generally refers to the (sample) size of the
data, which is also called volume. However, there is another entity that can make
data “big” in a certain sense — the dimensionality of a data point. Specifically,
for data represented by a n × p matrix, where n corresponds to the number of
samples (observations) and p corresponds to the number of features, a data point
is represented as a p-dimensional vector where its components correspond to the
so-called features.
In data science, we are frequently confronted with data sets that have a large
number of features. However, many of these features are highly redundant or
non-informative, which generally hinders the ability of most machine learning
algorithms to perform efficiently. A common approach to address these issues is
to check whether a low-dimensional structure can be detected within these high-
dimensional data. If the answer is yes, then we can identify the most meaningful
basis in a lower dimension, which can be used to re-represent the data. This results
in a new data matrix of the form n × k with k < p and possibly k  p.
The procedures used to devise such a compact representation of the data
without a significant loss of information are referred to as dimension reduction
(or dimensionality reduction) techniques. According to their working mechanisms,
most dimension reduction techniques can be divided into two categories:
1. Feature extraction techniques: These methods generate a small set of new
features containing most of the information from the original data set via some
linear/nonlinear weighted combination of the original features.
2. Feature selection techniques: These methods identify and select the most relevant
features from the original data set.
In this chapter, we introduce both techniques, and we also present some examples
of such methods. Specifically, for feature extraction we discuss PCA (principal

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 163
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_8
164 8 Dimension Reduction

component analysis) and NNMF (non-negative matrix factorization) techniques,


whereas for feature selection we present maximum relevance and MRMR (min-
imum redundancy and maximum relevance) techniques. Most of the methods
discussed in this chapter are based on unlabeled data; hence, they belong to the
paradigm of unsupervised learning methods.

8.2 Feature Extraction

A variety of approaches for feature extraction have been proposed, including


principal component analysis [255, 385], Isomap [463], diffusion maps [299], local
tangent space analysis [523], and multilayer autoencoders [102, 239]. However,
in this chapter we will restrict ourselves to the most popular of these methods —
namely, principal component analysis (PCA).

8.2.1 An Overview of PCA

PCA is a feature extraction process by definition, and therefore it aims to find a


subset of linear combinations of the original features that encompasses the majority
of the variation within the data. The elements of the sought-after subset are referred
to as principal components. They are mutually uncorrelated and are extracted such
that the first few encompass most of the variation within the original features
or variables. The principal components are extracted in a decreasing order of
importance, with the first accounting for the maximum variation within the original
data set. The second principal component represents the variance-maximizing
direction orthogonal to the first principal component. Subsequently, it follows that
the kth principal component is the direction that maximizes variance among all
directions orthogonal to the previous k − 1 components.
In other words, PCA seeks the most accurate data representation in a lower-
dimensional space through a linear transformation of an original set of features
or variables into a substantially smaller set of uncorrelated variables that represent
most of the information in the original set of features or variables [271].
As such, PCA enables us to highlight any trends, patterns, and outliers in the data
that may have been unclear from the original data set. Due to the simplicity of its
extracting important information from complex data sets, PCA is used abundantly
in many forms of analysis, in particular for any initial analysis of large data sets,
to obtain insights about the dimensionality of the data and the distribution of the
variance within these dimensions.
8.2 Feature Extraction 165

y
Data 6
1st Principal Component
2nd Principal Component 4

2
x
−2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2
−2

−4

−6

Fig. 8.1 Example of PCA for a two-dimensional data set.

8.2.2 Geometrical Interpretation of PCA

Geometrically, PCA is analogous to fitting a hyperplane to noisy data, and it can be


achieved via a series of rotations and/or projections with orthogonality constraints
on the data points from a high-dimensional space onto a lower-dimensional space,
as illustrated in Fig. 8.1.
In Fig. 8.1, it is clear that the first principal component lies along the direction
of the highest variation within the data, whereas the second principal component is
orthogonal to the first one. The two principal components form the new axes for the
data, and they can be viewed as a result of the rotation of the original x-y axes.
However, in Fig. 8.2, the difference in the degree of variation within the data
along the first and second principal components is not obvious. Yet, the two principal
components are orthogonal to each other.

8.2.3 PCA Procedure

Let X ∈ Rn×p denote the data matrix, where the columns and rows represent the
features or variables and the observations, respectively. The PCA procedure on the
matrix X can be summarized by the following key steps:
1. Center the variables in X so that the mean of each column is equal to 0. When
the variables are measured in different units, then the centered matrix needs to
be standardized; for instance, by dividing each column by its norm. After a such
transformation, each variable has a unit norm. Let X̂ denote the corresponding
transformed matrix.
2. Calculate the matrix.
166 8 Dimension Reduction

2
nd
Pr
in
ci
pa
lC
om
po
5

ne
t
en

nt
p on
C om
al
0 rin
cip
1
st P 10

−5 0
−10 −5 0 5 10−10

Fig. 8.2 Example of PCA for a three-dimensional data set. Only the plane defined by the first two
principal components is represented.

1
S= X̂T X̂. (8.1)
n−1

If the matrix X̂ consists of the centered variables in X, then S is called the


covariance matrix, and in this case the subsequent analysis is referred to as a
covariance PCA. However, if X̂ is the centered and standardized form of X, then
S is called the correlation matrix, and the subsequent analysis is referred to as a
correlation PCA.
3. Find the p eigenvalues of S, denoted λi , i = 1, . . . , p, and their corresponding
right eigenvectors ui such that

Sui = λi ui , i = 1, . . . , p, (8.2)
ui  = 1, i = 1, . . . , p, (8.3)
uTi uj = 0, i = j. (8.4)

Let  denote the following diagonal matrix:


⎡ ⎤
λ1
⎢ λ2 ⎥
⎢ ⎥
=⎢ .. ⎥ , with λ1 ≥ λ2 ≥ . . . ≥ λp ,
⎣ . ⎦
λp
8.2 Feature Extraction 167

and U = (u1 , . . . , up ) ∈ Rp×p is the matrix of the eigenvectors. Then, the matrix
S can be obtained as follows:

S = U U T . (8.5)

The eigenvector ui represents one of the directions of the principal components,


whereas the corresponding eigenvalue λi represents the variance of the data along
the direction ui . The projection of a vector x along the direction ui is given by uTi x.

8.2.4 Underlying Mathematical Problems in PCA

Suppose that we have a data set represented by a matrix X ∈ Rn×p ; that is, the
data consist of n observations and p variables. Then, the PCA of the data matrix
X consists of a series of best-fit subspaces of dimensions r = 1, 2, p − 1 along
the directions that maximize the variation within the data. In other words, at each
step, we are seeking a vector such that the variance of the projections of the data
points in X onto the corresponding one-dimensional subspace is maximized. Since
the sample covariance/correlation matrix of the data points in the matrix X is given
by

1
S= X̂T X̂, (8.6)
n−1

where X̂ denotes the transformed version of X, then we can find the first best-fit
subspace by solving the following optimization problem:

max uT X̂T X̂u (8.7)


u

uT u = 1 (8.8)

The optimal solution of the optimization problem given by Eqs. 8.7–8.8, denoted
u1 , is the eigenvector of the matrix S associated with the largest eigenvalue.
The second best-fit subspace is obtained by solving the following optimization
problem:

max uT X̂T X̂u (8.9)


u

uT u = 1 (8.10)
uT1 u =0 (8.11)

The additional constraint given by Eq. 8.11, compared to the optimization


problem given in Eqs. 8.7–8.8, enforces the orthogonality between the first best-
168 8 Dimension Reduction

fit subspace obtained previously (which is characterized by u1 ) and the sought-after


second best-fit subspace. The optimal solution of the optimization problem given
in Eqs. 8.9–8.11, denoted u2 , is the eigenvector of the matrix S associated with the
second largest eigenvalue.
Similarly, the kth best-fit subspace is given by the solution of the following
optimization problem:

max uT X̂T X̂u (8.12)


u

uT u = 1 (8.13)
uTi u = 0, i = 1, . . . , k − 1. (8.14)

The vectors ui , i = 1, . . . , k < p, also referred to as the loadings of the principal


components, form the first k columns of the PCA rotation matrix.
The preceding steps can be generalized to find the best-fit k-dimensional
subspace, with k < p, by solving the following optimization problem:

max U T X̂T X̂U (8.15)


U

U T U = I, (8.16)

where U denotes the basis of the optimal k-dimensional subspace onto which to
project the data set X̂ and I denotes the k × k identity matrix.
The solution of the quadratic problem 8.15–8.16 provides us with the orthonor-
mal eigenvectors ui , thus satisfying the constraints 8.3 and 8.4 from Step 3 of the
PCA procedure, along which the variance of the data is maximum. Substituting the
eigenvectors ui in the constraint in Eq. 8.2, we can solve the corresponding linear
problem to obtain the eigenvalues λi . However, this multi-step solution to obtain the
eigenvalues λi and their associated eigenvectors ui can be replaced by a single-step
process that uses a computationally efficient technique on the matrix X̂ called the
singular value decomposition.

8.2.5 PCA Using Singular Value Decomposition

Singular value decomposition (SVD) is a generalized approach for matrix factoriza-


tion. Let X ∈ Rn×p denote the original data matrix, where the columns represent the
features or variables in the data, and let X̂ denote the corresponding centered and
standardized matrix. Then, the SVD of X̂ is given by the following factorization:

X̂ = U V T , (8.17)
8.2 Feature Extraction 169

where U ∈ Rn×n and V ∈ Rp×p are unitary or orthogonal matrices — that is,
U T U = In and V T V = Ip — and  ∈ Rn×p is a matrix with real, non-negative
entries on the diagonal and zero off-diagonal elements.
The matrix  has at most r = min(n, p) non-zero diagonal entries, λi for i =
1, 2, . . . , r, which are ordered in descending order:

λ1 ≥ λ2 ≥ . . . ≥ 0. (8.18)

The matrices U and V are referred to as the left and right singular vectors,
respectively, whereas  is the diagonal matrix of singular values.
The projection of the observations onto the principal component space, also
referred to as the scores’ matrix, denoted Y , is given by

Y = U . (8.19)

The matrix V , referred to as the loadings’ matrix, can also be used to obtain the
projection of the observations onto the principal component space, Y . This can be
shown using Eqs. 8.17 and 8.19 as follows:

Y = U (8.20)
= U V T V , since V T V = Ip (8.21)

= X̂V . (8.22)

Therefore, given the scores’ matrix Y and the loadings’ matrix V , the original data
matrix X̂ can be recovered as follows:

X̂ = Y V T (8.23)

The projection of a new given observation vector xnew ∈ R1×p — not included in
the PCA process — onto the principal component space or its scores, denoted ynew ,
can be obtained as follows:

ynew = x̂new V , (8.24)

where x̂new is the centered and normalized version of xnew obtained using the same
values of the mean and standard deviation utilized to derive matrix X̂.

8.2.6 Assessing PCA Results

The results of a PCA can be assessed using various metrics, including the following:
170 8 Dimension Reduction

• The importance of each principal component, reflected by the magnitude of its


corresponding eigenvalue; that is, the fraction of variation within the original data
captured by the principal component.
• The correlation between a principal component and a variable.
• The contribution of a given observation i in the construction of a given principal
component j , denoted contribi,j , given by such an analysis. As a note

2
yi,j
contribi,j = , (8.25)
λj

where yi,j is the score — that is, the projection of the observation i onto the
principal component j — and λj is the eigenvalue of the principal component j .
• The contribution of a principal component j for the representation of a given
observation i in the principal component space, denoted cos2i,j , given by

2
yi,j
cos2i,j = 2
, (8.26)
j yi,j

where yi,j is the score — that is, the projection — of the observation i onto the
principal component j ; the quantity cos2i,j is called the squared cosine between
the observation i and the principal component j .
A commonly used approach to identify how many principal components should
be retained is to plot the eigenvalues as a function of their indices, which denotes
the order of importance of the principal components. Then, the index, corresponding
to a sharp change of direction in the eigenvalues graph, also known as the elbow,
provides a cut-off point for the number of principal components to be retained.
Such a plot is known as a scree plot. Another approach is to consider a principal
component as relevant when its associated eigenvalue is larger than the mean of all
the eigenvalues.

8.2.7 Illustration of PCA Using R

In R, PCA can be performed readily using the linear algebra tools and following the
steps defined in Sects. 8.2.3 or 8.2.5. However, many packages that enable one to
perform PCA directly are available, including h2o, FactoMineR, ade4, amap, and
stats. We will use the latter package to illustrate the application of PCA on the data
set PimaIndiansDiabetes available in the package mlbench. The analysis and the
visualization of the results are performed using Listing 8.1.
8.2 Feature Extraction 171

The scree plot of the corresponding analysis is depicted in Fig. 8.3, and it
can be used to identify the appropriate number of principal components to be
retained. For this example, there appears to be an elbow at the third principal
component. Therefore, the elbow approach suggests to retain the first three principal
components, which account for 60.7% of the overall variance. The alternative
approach also suggests the retention of the first three principal components, since
only the values of the first three eigenvalues are greater than the mean of all the
eigenvalues.
172 8 Dimension Reduction

30

26.2%
Percentage of explained variances

21.6%

20

12.9%

10.9%
10 9.5%
8.5%

5.2% 5.1%

1 2 3 4 5 6 7 8
Dimensions

Fig. 8.3 Scree plot for the PCA in Listing 8.1.

Fig. 8.4 Correlation between the first two principal components and the 8 variables within the
data (namely, age, pregnant, pressure, glucose, mass, pedigree, insulin, triceps).

The quality of the PCA can be assessed using the correlation between the retained
principal components and the variables. Figure 8.4 shows the representation of the
8.2 Feature Extraction 173

Dim.1

Dim.2

Dim.3

Dim.4

Dim.5

Dim.6

Dim.7

Dim.8
69.5
pregnant
62.55

glucose 55.6

pressure 48.65

41.7
triceps
34.75
insulin
27.8

mass 20.86

pedigree 13.91

6.96
age
0.01

Fig. 8.5 Contribution of the variables to the construction of the principal components: the larger
the box in the cell (i, j), the larger the contribution of the variable in row i to the construction of the
principal component in column j.

correlations between the variables and the first two principal components. The sign
of the dimension axis indicates the sign of the correlation, whereas the magnitude
of the corresponding vector represents the degree of correlation. The variables age
and pregnant have a high positive correlation and a low negative correlation with
the first and second principal components, respectively. The variables glucose and
pressure have a modest positive correlation and a moderate negative correlation with
the first and second principal components, respectively. The variables insulin and
triceps have a moderate negative correlation with both the first and second principal
components. The variable mass has a moderate negative correlation and low
negative correlation with the first and second principal components, respectively.
The variable pedigree has a modest negative correlation with both the first and
second principal components.
Figure 8.5 depicts the contribution of the variables in the construction of
the principal components. The variables are represented by the rows, whereas
the principal components are represented by the columns. Most of the variables
have some moderate to low contribution to the construction of most principal
components, whereas:
• The variable pregnant has only a moderate contribution to the construction of the
fifth principal component (denoted Dim.5) and a very low or no contribution at
all to the construction of the other principal components.
174 8 Dimension Reduction

Fig. 8.6 Projection of the observations into the space of the first two principal components.

• The variable pedigree has a high contribution to the construction of the fourth
principal component (denoted Dim.4), a moderate contribution to the construc-
tion of the third principal component (denoted Dim.3), and a very low or no
contribution at all to the construction of the other principal components.
• The variable age has a significant contribution to the construction of the
second and seventh principal components (denoted Dim.2 and Dim.7), while its
contribution to the construction of the other principal components is very low.
• The variables mass and pressure are the main contributors to the construction of
the sixth principal component (denoted Dim.6).
We can further assess the quality of the results from the PCA by analyzing the
representation of the observations in the space of the retained principal components.
For our particular example, we can assess whether the representation of the
observations in the space of the first two principal components could facilitate
the classification of observations in terms of diabetes test outcome. Figure 8.6
provides some insights on the complexity of the task to be performed by a
classification model to predict diabetes outcome, if only the first two principal
components are retained. In fact, there is no separation between the observations
with a negative diabetes outcome, represented by circles, and those with a positive
diabetes outcome, represented by triangles. This suggests that we would need to
8.2 Feature Extraction 175

consider additional principal components or use a kernel PCA approach, which will
be introduced in the next section.
Listing 8.2 provides an illustration of PCA using SVD via the package MASS.
This package enables one to carry out the SVD of a given matrix X ∈ Rm×n and
outputs the matrices U , D, and V . In this listing, X is an 8 × 8 matrix, and the
output matrices U , D, and V were used to estimate the approximation of the matrix
X, denoted X̂, using the following scenarios: (a) all the principal components, (b)
only the principal components corresponding to the four largest eigenvalues, and
(c) only the principal components corresponding to the two largest eigenvalues. The
visualization of the matrices X, U , D, V , and X̂ = U V for each of the three cases
was depicted in Fig. 8.7a, b, and c, respectively.
From the visualization results, it is clear that using all the principal components
enables one to reconstruct perfectly the matrix X. However, using only the
principal components corresponding to the four largest eigenvalues provides a
good reconstruction of the matrix X, whereas using only the principal components
corresponding to the two largest eigenvalues results in an average reconstruction of
the matrix X.

8.2.8 Kernel PCA

Although PCA enables one to reduce the dimensions of the data matrix, this doesn’t
always improve the performance of a model, as shown in the example illustrated in
176 8 Dimension Reduction

(A)
X

X̂ U D VT

≈ × ×

(B)
X

X̂ U D VT

≈ × ×

(C)
X

X̂ U D VT

≈ × ×

Fig. 8.7 Visualization of the reconstruction of matrix X, in Listing 8.2: (a) Using all principal
components. (b) Using only the principal components corresponding to the four largest eigenval-
ues. (c) Using only the principal components corresponding to the two largest eigenvalues.
8.2 Feature Extraction 177

Fig. 8.6. For this reason, an extension to PCA called kernel PCA (KPCA) has been
introduced.
KPCA is the nonlinear form of PCA, and it leverages some complex spatial
transformations of the features. These transformations may sometimes require
moving into a high-dimensional feature space because such a transformation could
enable a simpler model to perform better on data with a complex structure.
To illustrate the concept of KPCA, let us consider a binary classification problem,
where the two classes are highlighted in red and blue in the plots in Fig. 8.8. The
graphs on the left represent the original data in a two-dimensional space, which
obviously require nonlinear classifiers to model the problem. The graphs on the
right-hand side represent the transformed data in a three-dimensional space. Clearly,
linear classifiers would perform well in differentiating the two classes in the space
of the transformed data.
The mapping process used in KPCA to transform the data is called a kernel.
Many kernels have been suggested in the literature, including the following:
• Linear kernel:

k(x, x ) = x T x;

• Polynomial kernel of degree d:

k(x, x ) = (x T x + 1)d ;

• Gaussian kernel or radial basis function with bandwidth σ :

k(x, x ) = exp(−(x − x )2 /2σ 2 ),

where x denotes a vector of observations in the data set of interest.

8.2.9 Discussion

Although the origin of PCA can be traced back to Pearson [385] (1901), it remains
one of the most popular techniques in multivariate analysis. PCA is a valuable tool
for removing correlated features, which can reduce the performance of machine
learning algorithms. Furthermore, PCA helps to reduce the problem of overfitting,
which can result from the high dimensionality of the data.
By reducing the dimensionality of the data, PCA also offers the opportunity to
visualize the data. Visualization is an important step in data science and exploratory
data analysis (EDA), as discussed in Chap. 6. Note that the PCA presented in the
previous sections can only be used on quantitative variables; that is, numerical
but not categorical features. However, some generalizations of PCA, such as
correspondence analysis (CA) and multiple factor analysis (MFA), allow one to
178 8 Dimension Reduction

(a) (b)

(c) (d)

(e) (f)

Fig. 8.8 Illustrative examples of the relevance of KPCA. (a) Example 1: original data in 2D. (b)
Example 1: transformed data in 3D. (c) Example 2: original data in 2D. (d) Example 2: transformed
data in 3D. (e) Example 3: original data in 2D. (f) Example 3: transformed data in 3D.
8.2 Feature Extraction 179

address the cases of qualitative variables and mixed variables (quantitative and
qualitative), respectively.

8.2.10 Non-negative Matrix Factorization

Non-negative matrix factorization (NNMF) is one of the most widely used tool in the
analysis of high-dimensional data. It enables one to automatically extract meaning-
ful features from a non-negative data set. In fact, unsupervised learning techniques,
such as PCA, can be viewed as constrained matrix factorization problems. Let
X ∈ Rn×p denote an n × p matrix with non-negative elements, representing n
samples of p-dimensional data. Then, the non-negative matrix factorization of X
consists of finding two matrices U ∈ Rn×m and V ∈ Rm×p , both with non-negative
elements, such that m < min(n, p) and

X = UV. (8.27)

Since the matrices U and V are smaller compared to X, then such a mapping can
be viewed as a compression of the data within X.
Mathematically, the NNMF problem can be formulated as follows:

Find (U, V ) ∈ (Rn×m , Rm×p )


subject to X = U V ,
(8.28)
[U ]il ≥ 0, i = 1, . . . , n, l = 1, . . . , m
[V ]lj ≥ 0, l = 1, . . . , m, j = 1, . . . , p,

where [U ]il and [V ]lj denote the elements of the matrices U and V , respectively.
To solve problem 8.28, the following approximated form is generally used:

min F (X, U V )
U ∈Rn×m ,V ∈Rm×p

subject to
(8.29)
[U ]il ≥ 0, i = 1, . . . , n, l = 1, . . . , m
[V ]lj ≥ 0, l = 1, . . . , m, j = 1, . . . , p,

where the objective function F is a suitable scalar measure of the matrix X and the
product of the sought-after matrices U and V . Examples of such a scalar measure
include the following:
1. The Frobenius norm, where

F (X, U V ) = X − U V 2F (8.30)


180 8 Dimension Reduction


n 
p
. .
= .[X]ij − [U V ]ij .2 , (8.31)
i=1 j =1

with [X]ij and [U V ]ij denoting the elements of the matrices X and U V ,
respectively.
2. The generalized Kullback-Leibler divergence or the relative entropy, where

 p 
n  
[X]ij
F (X, U V ) = [X]ij log − [X]ij + [U V ]ij , (8.32)
[U V ]ij
i=1 j =1

with [X]ij and [U V ]ij denoting the elements of the matrices X and U V ,
respectively.
Note that the functions in Eq. 8.30 and 8.32 are not convex in U and V . There-
fore, the NNMF is a non-convex optimization problem, and available numerical
optimization techniques can only guarantee locally optimal solutions to the problem.
However, since matrix multiplication is bilinear, the function F (X, U V ), such as
8.30 and 8.32, is convex in its argument U and also convex in its argument V ,
respectively.
The most commonly used approach to find a local optimum of problem 8.29 is
a variant of the gradient descent method, known as the block-coordinate descent.
Let U0 and V0 denote some given initial values of the matrices U and V . Then, the
method alternates between optimizing U and optimizing V , respectively, as follows:
At a given iteration k, the updated matrices Uk and Vk are given by

Uk = arg min F (X, U Vk−1 ) (8.33)


U

= Uk−1 − ηU  ∇U F (X, Uk−1 Vk−1 ), (8.34)

Vk = arg min F (X, Uk Vk−1 ) (8.35)


V

= Vk−1 − ηV  ∇V F (X, Uk Vk−1 ), (8.36)

where ηU and ηV are the learning rate matrices used to update the matrices U and
V , respectively; the symbol  denotes the Hadamard product or the element-wise
product; and partial gradients ∇U F and ∇V F are given by
⎡ ∂F ∂F ∂F ⎤ ⎡ ∂F ∂F ∂F ⎤
∂u11 ∂u12 ... ∂u1p ∂v11 ∂v12 ... ∂v1n
⎢ ∂F ∂F ∂F ⎥ ⎢ ∂F ∂F ∂F ⎥
⎢ ∂u21 ∂u22 ... ⎥ ⎢ ∂v21 ∂v22 ... ∂v2n ⎥
∇U F = ⎢ ⎥ ∇V F = ⎢ .. ⎥
∂u2p
⎢ .. .. .. .. ⎥ ⎢ .. .. .. ⎥ (8.37)
⎣ . . . . ⎦ ⎣ . . . . ⎦
∂F ∂F ∂F ∂F ∂F ∂F
∂um1 ∂um2 ... ∂ump ∂vp1 ∂vp2 ... ∂vpn
8.2 Feature Extraction 181

8.2.10.1 NNMF Using the Frobenius Norm as Objective Function

Using the Frobenius norm, the objective function of the NNMF problem 8.29 writes

F (X, U V ) = X − U V 2F
= tr[(X − U V )T (X − U V )]
= tr[XT X − XT (U V ) − (U V )T X + (U V )T (U V )]
= tr[XT X] − tr[XT (U V )] − tr[(U V )T X] + tr[(U V )T (U V )]
= tr[XT X] − tr[XT U V ] − tr[V T U T X] + tr[V T U T U V ]

where tr[Z] denotes the trace of the matrix Z.


Then, the partial gradient of F (X, U V ) with respect to U is given by

∇U F (X, U V ) = ∇U tr[X T X] − ∇U tr[X T U V ] − ∇U tr[V T U T X] + ∇U tr[V T U T U V ]


= ∇U tr[X T X] − ∇U tr[V X T U ] − ∇U tr[XV T U T ] + ∇U tr[U V V T U T ]
 
= 0 − (V X T )T − XV T + U (V V T )T + V V T

= 0 − XV T − XV T + 2U V V T
= −2(XV T − U V V T ).

Similarly, the partial gradient of F (X, U V ) with respect to V is given by

∇V F (X, U V ) = ∇V tr[X T X] − ∇V tr[X T U V ] − ∇V tr[V T U T X] + ∇V tr[V T U T U V ]


= 0 − (X T U )T − U T X + [U T U + (U T U )T ]V
= 0 − U T X − U T X + 2U T U V
= −2(U T X − U T U V ).

Thus, the scheme 8.34–8.36 can be written as

Uk = Uk−1 + η̃U  (XVk−1


T
− Uk−1 Vk−1 Vk−1
T
), (8.38)
Vk = Uk−1 + η̃V  (UkT X − UkT Uk Vk−1 ), (8.39)

where η̃U = 2ηU and η̃V = 2ηV .


The updating scheme given by Eqs. 8.38–8.39 is referred to as the additive
update rules. Since these rules are just the conventional gradient descent method,
182 8 Dimension Reduction

all the values in the learning rates need to be set to some sufficiently small positive
numbers to ensure the convergence of the scheme.
Now, let’s set the learning rate in Eq. 8.38 to

Uk−1
η̃U = η̃Uk−1 = T T
, (8.40)
Uk−1 Vk−1 Vk−1

where the fraction symbol denotes the element-wise division. Then, the additive
update rule in Eq. 8.38 writes

Uk−1
Uk = Uk−1 + T
 (XVk−1
T
− Uk−1 Vk−1 Vk−1
T
) (8.41)
Uk−1 Vk−1 Vk−1
T
XVk−1 T
Uk−1 Vk−1 Vk−1
= Uk−1 + Uk−1  T
− Uk−1  T
(8.42)
Uk−1 Vk−1 Vk−1 Uk−1 Vk−1 Vk−1
T
XVk−1
= Uk−1  T
. (8.43)
Uk−1 Vk−1 Vk−1

Let’s set the learning rate in Eq. 8.39 to

Vk−1
η̃V = η̃Vk−1 = T
, (8.44)
Uk Uk Vk−1

where the fraction symbol denotes the element-wise division and Uk is given by
8.43. Then, the additive update rule 8.39 writes

Vk−1
Vk = Vk−1 + T
 (UkT X − UkT Uk Vk−1 ) (8.45)
Uk Uk Vk−1
UkT X UkT Uk Vk−1
= Vk−1 + Vk−1  T
− Vk−1  (8.46)
Uk Uk Vk−1 UkT Uk Vk−1
UkT X
= Vk−1  . (8.47)
UkT Uk Vk−1

The resulting update rules 8.43–8.47, given in an element-wise form in


Eqs. 8.48–8.49, are referred to as the multiplicative update rules [304].

[XVk−1
T ]
i,l
[Uk ]i,l = [Uk−1 ]i,l , (8.48)
[Uk−1 Vk−1 Vk−1
T ]
i,l

[UkT X]l,j
[Vk ]l,j = [Vk−1 ]l,j , (8.49)
[Uk Uk Vk−1 ]l,j
T
8.2 Feature Extraction 183

with i = 1, . . . , n, l = 1, . . . , m, j = 1, . . . , p, and where [A]rs denotes the


element in row r and column s of the matrix A.
The data-adaptive feature of the learning rates given by Eq. 8.40 and Eq. 8.44
enables the multiplicative update rules given by Eqs. 8.48–8.49 to intrinsically
satisfy the non-negativity constraint in the NNMF problem given by Eq. 8.29.
Therefore, such learning rates improve the running time of the scheme compared
to the additive update rules 8.38–8.39.

8.2.10.2 NNMF Using the Generalized Kullback-Leibler Divergence as


Objective Function

Using the generalized Kullback-Leibler divergence in Eq. 8.32 as the objective


function of the NNMF problem 8.29 yields the following additive update rules at
iteration k [304]:
 
 [X]i,s 
[Uk ]i,l = [Uk−1 ]i,l + [ηU ]i,l [Vk−1 ]l,s − [Vk−1 ]l,s
s
[Uk−1 Vk−1 ]i,s s
(8.50)
 
 [X]r,j 
[Vk ]l,j = [Vk−1 ]l,j + [ηV ]l,j [Uk ]r,l − [Uk ]r,l (8.51)
r
[Uk Vk−1 ]r,j r

Setting the learning rates in Eq. 8.50 to

[Uk−1 ]i,l
[ηU ]i,l = [ηUk−1 ]i,l = (8.52)
s [Vk−1 ]l,s

yields

[X]i,s
s [Vk−1 ]l,s [Uk−1 Vk−1 ]i,s
[Uk ]i,l = [Uk−1 ]i,l . (8.53)
s [Vk−1 ]ls

Now, setting the learning rates in 8.51 to

[Vk−1 ]l,j
[ηV ]l,j = [ηVk−1 ]l,j = (8.54)
r [Uk−1 ]r,l

yields

[X]r,j
r [Uk−1 ]r,i [Uk−1 Vk−1 ]r,j
[Vk ]i,j = [Vk−1 ]i,j (8.55)
s [Uk−1 ]si
184 8 Dimension Reduction

The update rules 8.53–8.55 provide the multiplicative update scheme for the
NNMF using the generalized Kullback-Leibler divergence [304].

8.2.10.3 Example of NNMF Using R

Listing 8.3 provides an illustration of NNMF using the package NMF in R. This
package allows one to carry out the NNMF of a given matrix X ∈ Rn×p for a given
rank m ≤ min(n, p). In this listing, X is an 8 × 8 matrix, and three possible values
of m were considered: (a) m = 8, (b) m = 4, and (c) m = 2. For these three cases,
the approximation of the matrix X, denoted X̂, is estimated. The visualization of
the resulting matrices, X, U , V , and X̂ = U V , for each value of m, are shown in
Fig. 8.9a, b, and c, respectively. Similar to PCA, using all the features enables one to
reconstruct perfectly the matrix X. Using only the four best features provides a good
reconstruction of the matrix X, whereas using only the two best features results in
an average reconstruction of the matrix X.

8.3 Feature Selection

In contrast with feature extraction, where a small subset of relevant features is


derived through some (nonlinear) combination of the origination features, the
relevant features obtained with feature selection consist of a subset of the original
8.3 Feature Selection 185

X
(A)

X̂ U V

≈ ×

X
(B)

X̂ U V

≈ ×

X
(C)

X̂ U V

≈ ×

Fig. 8.9 Visualization of the reconstruction of matrix X in Listing 8.3: (a) Using all features
available. (b) Using only the four best features selected through NNMF. (c) Using only the two
best features selected through NNMF.
186 8 Dimension Reduction

features. Feature selection is typically used to identify a subset S of the most relevant
features from a high-dimensional data matrix X ∈ Rn×p of input variables to target
a response variable Y , where |S| = k  p. Feature selection algorithms can be
categorized into three main classes, as follows:
1. Filter methods: Statistical metrics — for example, the Pearson correlation
coefficient or the mutual information — are used to identify the most relevant
features.
2. Wrapper methods: A learning algorithm is used to train models on various
combinations of the features of matrix X and then selects the features that yield
the best out-of-sample performance.
3. Embedded methods: They perform the feature selection during the construction
of the model.

8.3.1 Filter Methods Using Mutual Information

Mutual information is a measure that quantifies a relationship between two random


variables that have been sampled simultaneously, and it forms the building block for
defining criteria for feature selection in machine learning. The mutual information
is based on the concept of entropy, which quantifies the uncertainty present in the
distribution of a variable X, and it is defined by

H (X) = − p(x) log p(x), if X is discrete,
x∈X

H (X) = − f (x) log f (x)dx, if X is continuous,


X

where p(x) (resp. f (x)) denotes the probability density function of X.


The conditional entropy of X given Y is defined by
 
H (X|Y ) = − p(y) p(x|y) log p(x|y), if X and Y are discrete,
y∈Y x∈X

H (X|Y ) = − f (y) f (x|y) log f (x|y)dxdy, if X and Y are continuous,


Y X

where p(x|y) (resp. f (x|y)) denotes the conditional probability density function of
X given Y and p(y) (resp. f (y)) denotes the probability density function of Y .
Mutual information estimates the amount of information about a given random
variable that can be obtained through another random variable. The mutual infor-
mation between two random variables X and Y is defined by

I (X; Y ) = H (X) − H (Y /X) (8.56)


8.3 Feature Selection 187

 p(x, y)
= p(x, y) log , if X and Y are discrete, (8.57)
p(x)p(y)
y∈Y x∈X

I (X; Y ) = H (X) − H (Y /X) (8.58)


f (x, y)
= f (x, y) log dxdy, if X and Y are continuous,
Y X f (x)f (y)
(8.59)

where p(x, y) (resp. f (x, y)) denotes the joint probability density function of X and
Y , and p(x) and p(y) (resp. f (x) and f (y)) are the marginal probability density
functions of X and Y .
When the random variables X and Y are independent, then p(x, y) = p(x)p(y).
From this, it follows that I (X, Y ) = 0. In other words, the mutual information
allows one to establish a similarity between p(x, y) and p(x)p(y).
The feature selection problem consists of finding the subset S that has the
maximum mutual information between its features XS and the target variable Y .
This can be formulated via the following constrained optimization problem:

Ŝ = arg max I (XS ; Y ) (8.60)


S

subject to (8.61)
S⊂X (8.62)
|S| = k. (8.63)

In general, the optimization problem in Eq. 8.60–8.63 is NP-hard since it requires


searching all the possible subsets S ∈ X. Therefore, heuristic algorithms need to be
used to find (possibly suboptimal) solutions to the problem. The most commonly
used heuristic algorithms are based on greedy approximation and include the
following:
• The maximum relevance method, which selects the most relevant features
iteratively as follows:
188 8 Dimension Reduction

• The minimum redundancy and maximum relevance method [387], which


selects the most relevant and least redundant features iteratively as follows:

The main difference between these two algorithms lies in the formulation of the
objective function. While the maximum relevance algorithm extracts all the features
that contribute to maximizing relevance, the minimum redundancy and maximum
relevance algorithm extracts only independent features (that is, with minimum
redundancy) that contribute to maximizing relevance.

8.4 Summary

In this chapter, we discussed feature extraction and feature selection. While both
approaches aim to reduce the dimensionality of the features, i.e., p, in the original
data matrix, they differ fundamentally in the way this is realized. For feature
extraction, this is accomplished by transforming the original features into a lower-
dimensional space. This results in synthetic (or new) features generated from the
original features. Hence, the generated features contain information about all the
original features to a certain degree. In contrast, feature selection performs a literal
selection of a subset of features from the original features.
Feature extraction techniques such as PCA derive new synthetic features using a
linear combination of the original ones. This process is not exempt from some loss
of information, although the technique strives to minimize this loss. The derived
synthetic features can have a high discrimination power and enable control of
the overfitting problem. However, the synthetic features can be computationally
expensive to obtain, and their interpretability is not obvious due to the synthetic
nature of these new features.
The underlying problem for feature selection is NP-hard. Hence, the techniques
available in the literature can only guarantee suboptimal solutions. Since feature
selection yields a subset of features from the original ones, there is no interpretabil-
ity issue with the features, as opposed to feature extraction. This is important
8.5 Exercises 189

in many contexts and applications, such as genomics, where the meaning of the
features is important; for instance, when they are used as biomarkers.
Learning Outcome 8: Dimension Reduction

There are two conceptually different approaches to dimension reduction:


feature extraction and feature selection. Both result in a low-dimensional
representation of a data matrix, but the meaning of “features” is entirely
different.

8.5 Exercises

1. Load the R built-in data set “decathlon” and carry out a PCA on the data set after
removing the variables “Rank,” “Points,” and “Competition.”
a. Provide the summary of the PCA and discuss the results.
b. Plot the scree plot and identify the appropriate number of principal compo-
nents to be retained.
c. Plot the correlation graph between the first two principal components and the
variables and discuss the results.
d. Plot the graph representing the contribution of the variables in the construction
of the principal components, and then discuss the results.
e. Reconstruct the original data using all the principal components and discuss
the results.
f. Reconstruct the original data using the first four principal components and
discuss the results.
g. Reconstruct the original data using the first two principal components and
discuss the results.
2. Load the R built-in data set “decathlon” and carry out a NNMF on the data set
after removing the variables “Rank,” “Points,” and “Competition.”
a. Provide the summary of the NNMF and discuss the results.
b. Reconstruct the original data using all the features, discuss the results, and
compare them with the results from 1.e.
c. Reconstruct the original data using the best four features, discuss the results,
and compare them with the results from 1.f.
d. Reconstruct the original data using the best two features, discuss the results,
and compare them with the results from 1.g.
Chapter 9
Classification

9.1 Introduction

In this chapter, we discuss classification methods. We start by clarifying what a


classification means and what type of data are needed. Then we discuss aspects
common to general classification methods. This includes an extension of measures
for binary decision-making to multi-class classification problems. As we will
see, this extension is not trivial, because the contingency table becomes multi-
dimensional when conditioned on different classes.
There are many classification methods that have been developed in statistics
and machine learning, making it impossible to provide comprehensive coverage.
For this reason, in this chapter we selected six important and popular methods
(namely, naive Bayes classifier, linear discriminant analysis, k-nearest neighbor
classification, logistic regression, support vector machine, and decision tree) that
provide a representative overview of the diverse ideas underlying classification
methods widely used in many applications.

9.2 What Is Classification?

Classification is a supervised learning task. That means in addition to the data


points, X = {x1 , . . . , xn } with xi ∈ Rp and p ∈ N, where n corresponds to the
sample size and p to the number of features, information about the classes to which
these data points belong, yi , is needed. Thus, to study a classification problem,
paired data of the form X = {(x1 , y1 ), . . . , (xn , yn )} are required; see Fig. 9.1. It
is important to highlight that the class labels, yi , are categorical variables, because
they just provide an indicator — a label — to name a certain class uniquely. For
example, for a binary classifier, one could use the labels y ∈ {0, 1}, y ∈ {+1, −1},
or y ∈ {N, P } to indicate the two classes. For the former choices, it is important to

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 191
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_9
192 9 Classification

Data type:
X = {((xi , yi ))}n
i=1 with xi ∈
p
, yi ∈ {l1 , l2 } and l1 , l2 are categorical vari-
ables
xi is called a data point or feature vector
yi is called a class label
p: number of features or number of variables
n: number of samples

Question addressed:
What class label should be assigned to a (new) data point?

Principles of major classification approaches:


Probability distribution-based classification =⇒ estimation of conditional
probability distributions to make probabilistic predictions for the class labels
of new data points

Support Vector Machine =⇒ mapping of the data points xi into a high-


dimensional feature space by utilizing the kernel trick allows the linear sepa-
ration of these data points by hyperplanes

Decision Tree =⇒ a sequence of linear, binary decisions, organized as a tree,


leads to the classification of data points

Fig. 9.1 Overview of the classification problem with respect to the data type used, the question
addressed, and the principal methods discussed in this chapter.

remember that the numbers do not have their usual meaning; for example, that 0 is
smaller than 1 — that is, 0 < 1. Instead, they are merely used as labels (categorial
variables) and not as numerical values.
The availability of class labels provides valuable additional information that
can be used in the learning process of a classification method. Specifically, when
learning the parameters of a classification method, the error of this classifier can be
quantified by means of the class labels. This can be seen as a “supervision” of the
learning process because guided feedback can be evaluated and used to improve the
classifier. This inspires the name “supervised learning.”
Before we discuss the different classification methods in detail, we must review
some aspects common to general classification problems.

9.3 Common Aspects of Classification Methods

9.3.1 Basic Idea of a Classifier

A classification method, also called a classifier, is a prediction model, M, that makes


a prediction about the class label of a data point, x. That means
9.3 Common Aspects of Classification Methods 193

y = M(x; α), (9.1)

where α corresponds to the (true) parameter(s) of the model and y corresponds to


the predicted class label. We distinguish predicted class labels y from the true class
labels y because the model can make errors.
The prediction results in the pair (x, y ), which associates x with the class y .
Formally, the aim of a classifier M is to learn a mapping from an input space X to
an output space Y with
M : X → Y. (9.2)

9.3.2 Training and Test Data

For classification methods, it is important to distinguish the following two types of


data sets: the training data, Dtrain , and the test data, Dtest .
The training data are used for learning the parameter(s) of the model given by α.
That means the estimated model parameter(s) α̂ are a function of the training data:

α̂ = f (Dtrain ). (9.3)

Here, the “hat” indicates that the values of the parameters α̂ are estimates, which
can be different from the true values given by α.
In contrast, test data are used to evaluate the prediction capabilities of a model
by estimating an error measure:

Ê = g(M(x; α̂), Dtest ). (9.4)

Here, the function g corresponds to an error measure, such as accuracy or F-score


(see Chap. 3). Since the estimated parameter(s) of the model α̂ are needed for this,
the estimated error measure is a function of Dtrain and Dtest .
Formally, supervised learning can be defined based on the definition of a domain
and a task.
Definition 9.1 A domain D consists of a feature space, X, and a marginal
probability distribution, P (X), where X ∈ X, and it is given by D = {X, P (X)}.

Definition 9.2 A task T consists of an outcome space Y and a prediction function,


f (X), with f : X → Y. Therefore, T = {Y, f (X)}.
The task provides a mapping from the feature space X to the outcome space Y.
Before we proceed, we would like to make a general remark. When we discuss
model selection in Chap. 12 and the expected generalization error in Chap. 18, we
will see that, in general, one needs to distinguish between training data, testing data,
and validation data. However, in this chapter we will focus on individual classifi-
cation methods, which require only model assessment but no model selection. That
194 9 Classification

means the following setting serves an educational purpose by allowing us to focus


on individual methods, but it does not suit real-world problems where one needs to
choose among alternative classification models.

9.3.3 Error Measures

To evaluate a classifier, one needs to specify an error measure. In Chap. 3, we


discussed many error measures for binary decision-making, which corresponds to
a two-class classification. We have seen that essentially all such error measures are
based on the four fundamental errors: TP (true positives), TN (true negative), FP
(false positives), and FN (false negatives). That means such measures are (nonlinear)
functions of TP, TN, FP, and FN.
Because a classifier makes a unique prediction for the class labels of input
samples, the four fundamental errors are easy to estimate for a given test data set,
DTest = {(x1 , y1 ), . . . , (xT , yT )}, in the following way:


T
TP = I (M(xi ) = yi |yi = +1) (9.5)
i


T
TN = I (M(xi ) = yi |yi = −1) (9.6)
i


T
FP = I (M(xi ) = yi |yi = −1) (9.7)
i


T
FN = I (M(xi ) = yi |yi = +1) (9.8)
i

Here, the function I () is the so-called indicator function. It gives only two different
values, either a 1 or a 0.
+
1 if x is true
I (x) = (9.9)
0 if x is true

If the argument of the indicator function is “true,” it gives 1; otherwise, 0. The


separator “|” should be understood as “conditioned on” the argument on the right-
hand side. For example, the explicit meaning of I (M(xi ) = yi |yi = 1) is that if
y is given and yi = 1 holds, then for M(xi ) = yi , the indictor function is 1. A
moment of reflection will reveal the meaning of this evaluation, corresponding to a
true positive prediction of the classifier. Hence, a summation over all elements i for
I (M(xi ) = yi |yi = 1) in the test data set gives the total number of true positive
predictions (TP).
9.3 Common Aspects of Classification Methods 195

Table 9.1 Contingency table Predicted


summarizing the results of a
Truth Class +1 Class −1
two-class classifier.
Class +1 TP FN
Class −1 FP TN

A convenient way to summarize these results for a two-class classification is by


means of a contingency table or confusion matrix, shown in Table 9.1.

9.3.3.1 Error Measures for Multi-class Classification

In this chapter, we study multi-class classification in addition to two-class classifiers.


However, this requires an extension of error measures for binary decision-making.
Specifically, there are two types of such extensions. The first type provides an error
measure for all classes, while the second gives an error measure for each class by
providing information about “one class versus the remaining classes.”
An example for the first type of multi-class error measure is the accuracy (ACC).
Specifically, in this case of multi-class classification, the accuracy (ACC) is defined
by

1 
T
ACC = I (M(xi ) = yi ) (9.10)
T
i=1

where T is the total number of predictions. This is a simple generalization of the


definition of accuracy for binary classification that maintains its original meaning;
that is, the number of correct predictions is divided by the number of total
predictions (see Chap. 3).
Examples of the second type of multi-class error measure are sensitivity,
specificity, positive predictive value, and negative predictive value. Specifically,
instead of defining a global error measure over all classes, as for the accuracy,
the aforementioned measures are defined for each class separately. For a c-class
classification problem, with c > 2, this means that one has, for example, c sensitivity
values (one for each class). The logic behind this is to define the four fundamental
errors for “one against the rest.” For simplicity, let’s assume we have a three-
class classification problem with c = 3; however, an extension to more classes
is straightforward.
Figure 9.2 (top) shows a numerical example of a contingency table (the numbers
correspond to the results from an LDA classification, discussed in Sect. 9.5, but this
is not important for our current discussion), and the following three tables show how
the meaning of the cells changes with respect to the four fundamental errors when
conditioned on different classes. That means a cell in a contingency table does not
always correspond to, for example, a TP, but this depends on the class designated as
“positive.” So, this conditioning results in three contingency tables.
196 9 Classification

Example: predicted outcome


Class 1 Class 2 Class 3 Total
Class 1 2767 217 16 b1
actual outcome Class 2 110 2846 44 b2
Class 3 9 51 2940 b3
Total a1 a2 a3 T

For class 1: predicted outcome


Class 1 Class 2 Class 3 Total
Class 1 TP FN FN b1
actual outcome Class 2 FP TN TN b2
Class 3 FP TN TN b3
Total a1 a2 a3 T

For class 2: predicted outcome


Class 1 Class 2 Class 3 Total
Class 1 TN FP TN b1
actual outcome Class 2 FN TP FN b2
Class 3 TN FP TN b3
Total a1 a2 a3 T

For class 3: predicted outcome


Class 1 Class 2 Class 3 Total
Class 1 TN TN FP b1
actual outcome Class 2 TN TN FP b2
Class 3 FN FN TP b3
Total a1 a2 a3 T

Fig. 9.2 Contingency tables for a three-class classification. Top: Numerical results for an LDA
classification (see Sect. 9.5). The remaining contingency tables show how the meaning of the cells
change with respect to the four fundamental errors when conditioned on different classes.

To emphasize the focus on the different classes, we highlighted one row and
one column for each of these tables that have a succinct meaning for defining the
four fundamental errors. Overall, each contingency table allows us to estimate the
sensitivity, specificity, positive predictive value, and negative predictive value in the
usual way, as discussed in Chap. 3; however, with the meaning “one against the rest.”
For instance, for the numerical values in Fig. 9.2 (top), one obtains the following
values for the sensitivity:
• Class 1: 0.92
• Class 2: 0.94
• Class 3: 0.98
Hence, for a c-class classification problem, the contingency table becomes multi-
dimensional. Based on these class-specific errors, one can obtain a global error
measure by summarizing the “local” errors. For instance, the global sensitivity (true
positive rate [TPR]) is given by
9.4 Naive Bayes Classifier 197

1
c
TPRglobal = TPRi . (9.11)
c
i=1

9.4 Naive Bayes Classifier

The first classification method we discuss is the naive Bayes classifier. The basic
idea of a naive Bayes classifier is quite simple. This model is given by a conditional
probability distribution p(c|x) that is used to classify an instance x as follows:
/ 0
cp (x) = argmax p(c|x) . (9.12)
c∈C

That means for a given instance x, this classifier uses a conditional probability
distribution to make a prediction for the class label of the data point x by selecting
the class label with the maximum probability.
If one considers p(c|x) as the posterior probability of the distribution, p(x|c)
(called likelihood; see Chap. 6), from which the samples are drawn, the name of this
classifier becomes clear, and Eq. 9.12 can be written using Bayes’s theorem (see
Chap. 6) as follows:
1 / 0
cp (x) = argmax p(x|c)p(c) . (9.13)
p(x) c∈C

9.4.1 Educational Example

In the following, we give a simple educational example to clarify the working


mechanism of a naive Bayes classifier. First, we generate a training data set for
two classes with labels 1 and 2. For each class, we draw ni samples from a normal
distribution with a mean of μi and a standard deviation of σi using Listing 9.1:
198 9 Classification

Fig. 9.3 True and estimated probability distributions for two classes from which data points are
sampled; that is, x ∼ p(ci |x) with i ∈ {1, 2}. The estimated distributions are obtained from n1
samples from class 1 and n2 samples from class 2, as shown in the rug plot. The true decision
boundary is shown as a black vertical dashed line.

Figure 9.3 shows the true distributions for classes 1 (blue) and 2 (green) as
simulated in Listing 9.1. In addition, the drawn samples n1 and n2 are shown as a rug
plot (colored lines above the x-axis). It is important to note that the true distributions
of both classes overlap, as indicated by the vertical dashed line (red); however, the
samples drawn from these distributions are nicely separated. That means, on the
left-hand side, first come all samples from class 1 (blue), and then come all samples
from class 2 (green).
A naive Bayes classifier learns a conditional probability distribution; that is,

p(x|ci ) = p(x|mi , si ), (9.14)

for each class by assuming a normal distribution. That means the training samples
from both classes are used separately to obtain maximum likelihood estimates for
the mean and the standard deviation of a normal distribution that describes the
data best. For this reason, in Eq. 9.14, we did not use the Greek symbols for the
population mean and standard deviation — that is, μi and σi , respectively — but
rather the sample estimates of the mean (given by mi ) and standard deviation (given
by si ) to indicate that these values are estimates from the training data.
In R, this can be accomplished by using the library lklaR, which provides the
function NaiveBayes(), setting the option “usekernel=F.” The prior for the naive
Bayes classifier can either be explicitly given, using the “prior” option, or be left
unspecified, in which case the class proportions for the training set are used.
9.4 Naive Bayes Classifier 199

The numerical results for our sample data can be obtained by applying the
function predict() to the estimated model with a certain data set as argument.
Listing 9.2 shows the corresponding implementation using R. In this case, we are
using the training data as the test data, called out-of-sample data; see Chap. 4. As a
result, we obtain a perfect classification because the predicted classes correspond to
the true classes for all the instances.

To understand the numerical outcome of this classification, we included in


Fig. 9.3 the estimated distributions for classes 1 (light blue) and 2 (dark green),
and we also added the estimated decision boundary as a vertical dashed line. As one
can see, the estimated distributions are not identical to the true class distributions,
but they provide reasonable estimates, especially given the small number of samples
used. Most important, the estimated decision boundary, which is directly obtained
from the estimated distributions, separates perfectly samples from class 1 (left-hand
side) from those in class 2 (right-hand side). For this reason, the classification is
perfect for these data.
Descriptively, the decision boundary in Eq. 9.13 is given by the intersection point
of the (estimated) distributions, enabling the classifier to predict class 1 to its left-
hand side and class 2 for x values on its right-hand side. Formally, this can be written
as
Predict class 1 if
p(x|c1 ) > p(x|c2 );
Predict class 2 if
p(x|c2 ) > p(x|c1 ).

In Listing 9.3, we show how to estimate and visualize the probability distribu-
tions underlying a naive Bayes classifier.
Before we continue discussing further examples for the naive Bayes classifier,
we would like to stop for a moment and reflect about the position of the decision
boundary. Intuitively, if we took into account only the position of the data points, any
decision boundary between the last instance from class 1 and the first instance from
class 2 would lead to the same numerical result as the classification for the preceding
data. However, application of this strategy would in general not lead to an optimal
decision boundary for unseen data. The latter point is very important because the
200 9 Classification

goal of any classifier is to utilize a training sample to learn its parameters as well as
possible so that unseen data (test data) can be classified optimally. This extrapolation
of the training data toward test data, to obtain optimal classification results, is
practically realized by a naive Bayes classifier by estimating the parameters of
(normal) distributions instead of devising a learning rule directly based on the data
points.

9.4.2 Example

In the following, we investigate the behavior of the accuracy (A), the precision (P),
and the recall (R) values of a naive Bayes classifier as functions of the distance
between the mean values of the distributions of the two classes. Specifically, for
the true normal distributions of classes 1 and 2, we assume the parameters μ1 =
1, σ1 = 0.35, and μ2 = f × μ1 , σ2 = 0.25. Here, f is a positive parameter that
allows one to increase the distance between the means of class 1 and class 2. In the
following, this parameter can assume the values f ∈ {1, 1.1, 1.2, 1.5, 2}. For f = 1
the mean of both classes is identical, i.e., μ1 = μ2 = 1, and for all other values of
f , the distance increases up to μ2 = 2 for f = 2.
9.4 Naive Bayes Classifier 201

In Listing 9.4, we evaluate the accuracy (A), the precision (P), and the recall
(R) values of a naive Bayes classifier for every value of f . To do this, we generate
independent test data of size N1t = 1000 and N2t = 3000. For the corresponding
training data, we generate much smaller data sets of size N1 = 30 and N2 = 20.
The results are shown in Fig. 9.4.
202 9 Classification

1.0
0.8
A, P and R
0.6
0.4
0.2

accuracy (A)
precition (P)
0.0

recall (R)

1.0 1.2 1.4 1.6 1.8 2.0


f

Fig. 9.4 Accuracy, precision, and recall values of a naive Bayes classifier against the distance
between the means of the two classes.

One can see that with an increasing distance between the mean values of the class
distributions, the accuracy (A), the precision (P), and the recall (R) values increase.
This is expected because the farther the distance, the easier the classification
problem (see also Fig. 9.3). For f = 2 the obtained error measures assume almost
perfect values (close to one), meaning that a further increase of the distance would
not lead to a further improvement in the classification performance. This allows one
to make statements about a saturating convergence of the classification method.
Interestingly, for f = 1, the values of A, P, and R are not “equally bad,” but
the recall values are quite high compared to A and P. This is a reflection of the
nonlinear dependency of the error measures on the four fundamental errors, TP, TN,
FP, and FN (see Chap. 3) and provides another argument why there is not just one
error measure that is important but different error measures that complement each
other (see the discussion in Chap. 3). This is actually true not only for a naive Bayes
classifier but also for any other classifier or statistical method for which TP, TN, FP,
and FN can be obtained.

9.5 Linear Discriminant Analysis

A classification method that is related to a naive Bayes classifier is linear dis-


criminant analysis (LDA). This method also estimates conditional probability
distributions that correspond to normal distributions,
9.5 Linear Discriminant Analysis 203

p(x|y = i) = N(x|μi , Σi ) (9.15)


1 1 
=  p/2 exp − (x − μi )T Σi−1 (x − μi ) , (9.16)
2
2π |Σi |1/2

similar to the naive Bayes classifier. However, it additionally assumes that the
covariance matrices for all classes are identical; that is, Σi = Σ for all i.
This constraint of the covariance matrices can be utilized to derive a simplified
criteria for identifying a MAP solution given by
/ 0
cp (x) = argmax p(c|x) . (9.17)
c∈C

Applying Bayes’ theorem to Eq. 9.16 and taking the logarithm lead to

δi (x) = logp(y = i|x) = logp(x|y = i) + logpi (9.18)


1  p/2
= − (x − μi )T Σi−1 (x − μi ) − log 2π |Σi |1/2 + logpi (9.19)
2
1
= x T Σi−1 μi − μTi Σi−1 μi + logpi (9.20)
2
1  p/2
− x T Σi−1 x − log 2π |Σi |1/2 (9.21)
1 2 23 4
this term is identical for allδi (x)

Here, pi is the prior for class i. Since LDA assumes Σi = Σ for all classes, only
the terms containing an μi and pi are different for different classes. This leads to
the simplified linear discriminant function given by

1
δi (x) = x T Σi−1 μi − μTi Σi−1 μi + logpi , (9.22)
2
from which the MAP class predictions are obtained as follows:

c = argmax{δi (x)}. (9.23)


i∈C

As an example of an LDA, let’s study a three-class classification problem.


For this, we simulate training and test data drawn from two-dimensional normal
distributions with mean μi and a common covariance matrix Σ. Listings 9.5 and 9.6
show how these data are simulated, and in Fig. 9.5 (top left and top right), we
show a visualization of the training data for the three classes. For the training
data, we generated N1 = 30, N2 = 50, N3 = 40 samples, and for the test data
N1 = N2 = N3 = 3000.
204 9 Classification

In Fig. 9.5 (top right), we highlight the mean values of the normal distributions
to see how far apart the centers of mass are from each other. Furthermore, one can
see that the overlap of the three distributions is moderate; that is, the mixing of data
points from different classes is limited.

In R, the package MASS provides the function lda(), which implements the LDA
classifier. Here, “df.train” is a data frame that contains the training data generated
using Listing 9.5, and the column of this data frame named “y” contains the class
labels corresponding to the two-dimensional data points, x, provided in the first two
columns.
Listing 9.7 also contains the predictions of the model for the test data. It is
important to note that the data frame of the test data needs to have the same column
names as “df.train,” corresponding to the names of the p = 2 dimensional predictor
variables.
9.5 Linear Discriminant Analysis 205

We would like to note that since we are analyzing a three-class classification


problem, for the assessment we need to use the error measures discussed in
Sect. 9.3.3.1, which assess “one class against the rest.” As one can see, the LDA
results in a very good classification, and the error measures for the individual classes
are quite homogeneous.
In Fig. 9.5 (bottom row), we show level plots for the decision boundaries for the
training data (bottom left) and the posterior probabilities for the class predictions
(bottom right). From this visualization, one can see why the LDA is a linear
classifier, as the decision boundaries between two classes are linear lines. In a later
section of this chapter, we will see that there are other classification methods that
206 9 Classification

class 3
8

8
6

6
4

class 2

4
x[,2]

x[,2]
mean value of
the class distributions
2

2
0

0
class 1
−2

−2
−4 −2 0 2 4 6 −4 −2 0 2 4 6
x[,1] x[,1]
1.0
9

0.9
6.8

6
0.8
4.6

0.7
4
x[,2]

x[,2]

0.6
2.4

0.5
0.2

0.4
−2

−2 0.3
−4 −1.8 0.4 2.6 4.8 7 −4 −2 0 2 4 6

x[,1] x[,1]

Fig. 9.5 Linear discriminant analysis (LDA). Top left: Distribution of the training data. Top
right: Explanation of the estimation of the conditional probability values for each point. Bottom
left: Decision boundaries obtained for the training data. Bottom right: Visualization of the
corresponding probabilities for the decisions on the left side.

allow lacerated boundaries (see Fig. 9.8) to deal with additional overlapping data
points from different classes.

9.5.1 Extensions

As just discussed, the LDA assumes that the covariance matrices of all classes are
identical, i = . If we do not make this assumption, the linear discriminant
function in Eq. 9.22 becomes nonlinear, because the quadratic terms in x are
no longer identical for the different classes. Thus, allowing arbitrary covariance
matrices leads to the quadratic discriminant function
9.6 Logistic Regression 207

1 1 T  
δi (x) = − log|i | − x − μi i−1 x − μi + logpi (9.24)
2 2
1  T  
= δi (x) − x − μi i−1 x − μi . (9.25)
2
Further extensions are possible and allow, for example, mixtures of Gaussians
or use of non-parametric density estimates. In general, such methods are called
Gaussian discriminant analysis.

9.6 Logistic Regression

The logistic regression method is a member of a larger family of methods called


generalized linear models (GLMs), which are discussed in Chap. 11. For this
reason, we present here only the underlying idea of logistic regression and discuss
further details in Sect. 11.6.4.4.
In contrast with common regression methods, logistic regression is used for
binary data and hence can be used for classification. Binary data means that
the response variable Y can assume two values; for example, 1 and 0. The idea
of logistic regression is similar to that of the previous methods because logistic
regression also aims to estimate a conditional probability distribution. Specifically,
a logistic regression model estimates

p(Y = 1|x), (9.26)

providing a probability for Y = 1 given an input x. Due to the binary nature of the
response, from this estimate it follows that

p(Y = 0|x) = 1 − p(Y = 1|x). (9.27)

Definition 9.3 The probability p(Y = 1|x) corresponds to the proportion of


responses giving Y = 1 for an input x. For brevity, we write

p(x) = p(Y = 1|x). (9.28)

Let’s visualize this definition with an example. In Listing 9.8, we show the
first few lines of the data from the Michelin Guides for restaurants in New York.
To fit each data line into one line, we skipped a few columns, which in the
following we do not need anyway. Briefly, the data provide some information about
whether a restaurant is recommended by the Michelin Guides (“InMichelin”). For
the following analysis, we use “InMichelin” as the response variable, which can be
either “Yes” or “No.” The decision for including a restaurant in the Michelin Guides
208 9 Classification

is based on a number of covariates, of which four are numerical (“Food,” “Decor,”


“Service,” and “Cost”) and one is categorical (“Cuisine”). The numerical covariates
provide scores for the corresponding categories, such as for service or food, whereas
the categorical variable labels a restaurant with respect to the type of food provided.
To discuss the idea of logistic regression, it is sufficient to restrict our analysis to
one covariate. For our analysis, we select the “Food” score.

In Table 9.2, we show summary data provided by the Michelin Guides, specif-
ically for obtaining the count data shown; for example, how many restaurants
get a food score of 15. Then we counted how many of these restaurants are
recommended by the Michelin Guides (“InMichelin”). From these values, we
estimate the proportion of recommended restaurants, given by

InMichelin
prop(restaurants in Michelin guide| food score) = . (9.29)
m
For this estimate, it is important to highlight that the proportion has a conditional
dependency on the food score.
In Fig. 9.6, we visualize the data in Listing 9.8. Specifically, the histograms in
the top figure correspond to the response variables “Yes” (green histogram) and
“No” (blue histogram) for the corresponding food scores. These food scores are
also shown in the boxplots at the bottom of the figure. These are the raw data from
Listing 9.8. In contrast, the (black) points in Fig. 9.6 correspond to the estimated
values of the proportion in Listing 9.8. To emphasize that the proportion has been
estimated, we call it the “sample proportion” (in contrast with the population
proportion). Overall, for the estimated proportion, we can observe a tendency toward
increasing values for increasing food scores (Fig. 9.6).
Formally, a logistic regression model is given by

p(x)
log = β0 + β1 x. (9.30)
1 − p(x)
9.6 Logistic Regression 209

Table 9.2 Summary data InMichelin Food score m Prop


from the Michelin Guides for
New York restaurants. 1 0 15.00 1 0.00
2 0 16.00 1 0.00
3 0 17.00 8 0.00
4 2 18.00 15 0.13
5 5 19.00 18 0.28
6 8 20.00 33 0.24
7 15 21.00 26 0.58
8 4 22.00 12 0.33
9 12 23.00 18 0.67
10 6 24.00 7 0.86
11 11 25.00 12 0.92
12 1 26.00 2 0.50
13 6 27.00 7 0.86
14 4 28.00 4 1.00

p(x)
The term on the left-hand-side, that is, log 1−p(x) , is called logit or log odds. The
latter name comes from the fact that
p
odds(p) = . (9.31)
1−p

Hence, Eq. 9.30 can also be written as


   
log odds(p(x)) = logit p(x) = β0 + β1 x. (9.32)

The odds in Eq. 9.31 can assume values between 0 and ∞; that is,

odds : [0, 1] → [0, ∞). (9.33)

However, taking the logarithm, the logit can assume values between −∞ and ∞.
This makes sense because the regression term on the right-hand side is unbound.
Solving for p(x) in Eq. 9.30 gives

exp(β0 + β1 x) 1
p(x) = = (9.34)
1 + exp(β0 + β1 x) 1 + exp(−(β0 + β1 x))
= S(β0 + β1 x) (9.35)

Here, S is the logistic function, which is an example of a sigmoid function describing


an “S”-shaped curve. The logistic function assumes values between 0 and 1; that is,

logistic function S : (−∞, ∞) → [0, 1]. (9.36)


210 9 Classification

1.00

Sample proportion 0.75

0.50

0.25

0.00

15 20 25
Food score

25
Food score

20

15

No Yes
InMichelin

Fig. 9.6 Visualization of the Michelin data. Top figure: Estimated proportions of the response
variable. Bottom figure: Boxplots of the food scores in dependence on the recommendations given
in the Michelin Guides.

This is great since the probability on the left-hand side assumes values between 0
and 1.
In Listing 9.9, we fit the logistic regression model to our data, and the results are
shown in Fig. 9.7. To use the estimates of the model, given by βˆ0 and βˆ1 , to predict
the class of a new instance x, we proceed as follows. First, we calculate p(x), given
by
9.7 k-Nearest Neighbor Classifier 211

exp(βˆ0 + βˆ1 x)
p(x) = . (9.37)
1 + exp(βˆ0 + βˆ1 x)

Then, we need to define a decision boundary. For this, we define


+
0, if p(x) ≤ 0.5,
classLR (x) = (9.38)
1, if p(x) > 0.5.

Hence, based on the definition of classLR (x), we can make a class prediction for
every new instance x.

9.7 k-Nearest Neighbor Classifier

The next classification method we discuss is the k-nearest neighbor (KNN).


In this section, we give the basic idea of k-nearest neighbor classification. From
a given training data set, each data point can be represented by a set of variables. So,
the data points are plotted in a high-dimensional vector space; the axes in the space
212 9 Classification

1.00

0.75
sample proportion

0.50

0.25

0.00

15 20 25
Food score

Fig. 9.7 Visualization of the Michelin data and the estimated proportions.

correspond to the variables under consideration. Also, we assume that we have a


fixed number of classes, which are usually represented by colors; for example, in a
two-dimensional space. Now, a new data point from a test data set can be classified
by determining the k-nearest neighbors that are most similar (that is, minimum
distance) to this test point. To classify this point, we need to use existing similarity
or distance measures, which were discussed in Chap. 7. It is clear that the result
of the classification depends on the selected distance/similarity measure. √ Another
question relates to finding an appropriate value for k. It is well-known that N is a
good choice for k, where N is the number of data points.
We start with an educational example to understand the k-nearest neighbor clas-
sification. In Fig. 9.8 (top-left), we show the training data for three classes generated
with the following code. For each class i ∈ {1, 2}, we assume an underlying two-
dimensional normal distribution with mean μi and covariance matrix Σi . Class 3 is
generated using a random process of selecting either N(μ3.1 , Σ3 ) or N(μ3.2 , Σ3 ).
The results of this random process for class 3 are that this class is separated by
classes 1 and 2, as one can see in Fig. 9.8 (top left).
We use these training data to visualize the numerical estimation of the conditional
probabilities:

N(c, x) 1 
p(y = c|x) = = I (yi = c). (9.39)
k k
i∈Ne (x)
9.7 k-Nearest Neighbor Classifier 213

class 3
8

8
6

6
class 2
4

4
x[,2]

test data point

x[,2]
2

2
0

0
class 1

class 3
−2

−2
−4 −2 0 2 4 6 −4 −2 0 2 4 6
x[,1] x[,1]
1.0
9

0.9
6.8

6
0.8
4.6

0.7
4
x[,2]

x[,2]

0.6
2.4

0.5
0.2

0.4
−2

−2 0.3
−4 −1.8 0.4 2.6 4.8 7 −4 −2 0 2 4 6

x[,1] x[,1]

Fig. 9.8 K-nearest neighbor classification with k = 5. Top left: Distribution of the training data.
Top right: Explanation of the estimation of the conditional probability values for each point.
Bottom left: Decision boundaries obtained for these training data. Bottom right: Visualization of
the corresponding probabilities for the decisions on the left side.

In Fig. 9.8 (top right), we show the training data plus one additional test data point
(in black). Around this test data point is a circle that includes exactly k = 5 training
data points corresponding to a KNN classifier with five neighbors. Within this circle
we observe

#red instances - corresponding to class 3 = 5 (9.40)


#blue instances - corresponding to class 1 = 0 (9.41)
#green instances - corresponding to class 2 = 0, (9.42)
214 9 Classification

leading to the following estimates for the conditional probabilities:

p(y = 1|x) = 0, (9.43)


p(y = 2|x) = 0, (9.44)
p(y = 3|x) = 1. (9.45)

It is important to note that the diameter, d, of the circle is not fixed for every test
data point but is determined by the k neighbor training data points because every
circle needs to contain exactly k training data points. Furthermore, the shape of
the neighborhood around a test data point is given by a circle because the distance
between data points is measured using the Euclidean metric.
To analyze the KNN classifier, we generate a test data set using a normal
distribution with the same parameters as in Listing 9.10. For this configuration, we
generate N1 = N2 = N3 = 3000 samples.
In R, the package class provides the function knn() with the following options.
The first argument is the training data set, “df.train.” It assumes the form of a matrix
or a data frame, where each row corresponds to one sample and the number of
columns corresponds to the number of predictor variables (features). The second
argument is another matrix or data frame containing the test data set. The third
argument is a vector giving the class labels of the training data. The option “k” sets
the number of neighbors to be considered. The function knn() returns the predictions
for each sample in “df.test” corresponding to the class labels. Furthermore, by
setting “prob=’TRUE’,” the probability values corresponding to the predictions, as
defined in Eq. 9.39, are returned.
9.7 k-Nearest Neighbor Classifier 215

From Listing 9.11, one can see that the obtained results for accuracy, sensitivity,
specificity, and so forth are quite good.
In Fig. 9.8 (bottom left), we visualize the decision boundaries of the KNN clas-
sifier learned from the training data, and in Fig. 9.8 (bottom right), the probability
values corresponding to these decisions are shown. First, it is important to note
that in contrast with the classifiers we have discussed so far, the observed decision
boundaries are not straight lines or concentric circles. Instead, they can assume any
shape. This is due to the non-parametric character of the KNN classifier, because no
assumptions are made about the conditional probability distributions in Eq. 9.39 to
limit these to a certain family of probability distributions. Instead, the conditional
probability distributions are estimated numerically from the training data and hence
are very flexible.
216 9 Classification

From Fig. 9.8 (bottom left), one can see that there are several regions that change
the class labels many times. These are regions that have a tie for different classes;
that is, the same number of votes for more than one class. In general, there are
multiple ways ties can be broken. For this analysis, we extended the neighborhood
successively by one until a unique maximal vote was reached. For the function knn(),
this can be done by setting the option “use.all=TRUE,” which is the default option
of this function. These “tie regions” are also easy to spot in Fig. 9.8 (bottom right),
as they have the lowest probabilities (blue values) for the decisions.

9.8 Support Vector Machine

The next classification method we discuss is an support vector machine (SVM).


Originally developed by Vladimir Vapnik and Alexey Chervonenkis in the 1960s
for linearly separable data, the “kernel trick” introduced in the 1990s allowed one
to extend this framework to nonlinear problems [54]. We will see that an SVM is
quite different from the other classification methods discussed so far, because an
SVM is not a probabilistic classifier that aims to estimate a conditional probability
distribution. Instead, an SVM aims to estimate optimal separating hyperplanes in a
transformed feature space.
We start our discussion of SVMs by first introducing its underlying idea and
motivation. Then, we will turn to a more mathematical formulation and discuss the
extensions.

9.8.1 Linearly Separable Data

Suppose that we have a binary classification problem and the training data are
linearly separable. We can formulate this problem with linear equations that assume
the following form:

w · xi + b > 0 for yi = +1 (9.46)


w · xi + b < 0 for yi = −1. (9.47)

Here, xi are data points in Rp , w is the normal vector of the hyperplane, and b is
the shift vector. Since we assume that the data points are linearly separable, there
is no misclassification, and all data points that belong to the same class lie on the
same side of the hyperplane, identified by the sign of the linear equation. This is
visualized in Fig. 9.9.
In Fig. 9.9, we included two (of many) possible hyperplanes (in blue and red),
separating the data points and providing a perfect classification. However, the
question is, which of the possible separating hyperplanes should be chosen? Recall
9.8 Support Vector Machine 217

margin
Class −1 Class +1
H+1 : w · xi + b = +1
H−1 : w · xi + b = −1

Decision boundary:
H0 : w · xi + b = 0
support vectors

possible hyperplane

H−1 H0 H+1

Fig. 9.9 Maximum margin (hard-margin) SVM for linearly separable classes. The optimal
hyperplane, H0 , maximizes the distance between the auxiliary hyperplanes H+1 and H−1 .

that the angle and position of a hyperplane can be modified by changing w and b
(see [153]).
The hyperplane we consider as “optimal” is depicted in blue in Fig. 9.9. This
hyperplane, H0 , is located “in the middle,” between the two classes. To understand
this better, we included two auxiliary hyperplanes, H+1 and H−1 (in dashed lines),
which are parallel to the optimal hyperplane and are characterized by

H+1 : w · xi + b = +1, (9.48)


H−1 : w · xi + b = −1. (9.49)

Here, the minimal distance from the optimal hyperplane to either of these
auxiliary hyperplanes is the same.
To find the parameters of the linear equations, one rescales Eqs. 9.46 and 9.47 in
the following way. First, we determine
5 6
min w · xi + b = +m+1 for yi = +1, (9.50)
5 6
max w · xi + b = −m1 for yi = −1, (9.51)

for the training data. Here m+1 and m−1 are positive numbers; that is, m+1 , m−1 ∈
R+ , which correspond to the minimal distances from the data points in class +1 and
−1 to the hyperplane. Since the data points in class −1 assume negative values,
we wrote −m−1 to make m−1 positive. The optimal hyperplane shall have an equal
distance to both classes, leading to m = m+1 = m−1 . If we substitute w and b with
218 9 Classification

their rescaled values w = w/m and b = b/m, we obtain

· xi + ≥ +1, for all i with yi = +1, (9.52)


· xi + ≤ −1, for all i with yi = −1. (9.53)

Let us reflect for a moment on what the distances of these hyperplanes are from
the origin. For the hyperplane in Eq. 9.52, we have a perpendicular distance to the
origin, denoted d+1 = |1−b|/||w||, and for Eq. 9.53, we have d−1 = |−1−b|/||w||.
Hence, the margin between the two auxiliary hyperplanes is d+1 − d−1 = 2/||w||.
From this, we can see that if we want to maximize the distance between the two
auxiliary hyperplanes, we need to minimize ||w|| by fulfilling the constraints in
Eqs. 9.52 and 9.53. By re-writing these equations in the form
 
yi w · xi + b − 1 ≥ 0 for all i (9.54)

we are now in a position to express this problem as an optimization problem.


Specifically, we can define a Lagrangian by
   
1 2 
LP = w − αi yi w · xi + b − 1 . (9.55)
2
i

A Lagrangian is a function used to solve a constrained optimization problem.


Minimization of LP with respect to w and b for positive Lagrange multipliers αi
gives the desired solution. We just want to note that when solving this optimization
problem, the data points xi with positive Lagrange multipliers, i.e., αi > 0, are
called “support vectors.” These lie exactly on the two auxiliary hyperplanes (see
Fig. 9.9). All other Lagrange multipliers are zero, and the corresponding data points
are not relevant for learning the SVM.

9.8.2 Nonlinearly Separable Data

The preceding formulation does not allow data that are not linearly separable. That
means it can only be applied to linearly separable data, which limits the application
severely. To extend the preceding classifier to general data that are not linearly
separable, some so-called slack variables need to be introduced to the problem.
Using positive slack variables ξi ∈ R+ , errors in the classification can be
counterbalanced so that the following equations hold:

w · xi + b ≥ +1 − ξi , for all i with yi = +1 (9.56)


w · xi + b ≤ −1 + ξi , for all i with yi = −1 (9.57)
9.8 Support Vector Machine 219

Specifically, if there is a classification error for data point xi , the corresponding


slack variable is ξi > 0, and its value corresponds to the distance from an auxiliary
hyperplane given by its class label; that is, yi . For instance, for class +1, a slack
variable is between zero and one if xi is located between H+ 1 and H0 , and larger
than one if located beyond H0 (see Fig. 9.10 for a visualization). However, if the
classification is correct, then ξi = 0. That means i ξi is an indicator for the
classification error.
The corresponding Lagrangian (for the hinge loss [344]) can be written as
   
1 2   
LP = w − αi yi w · xi + b − 1 + ξi + C ξi − μi ξi . (9.58)
2
i i i

Here, the μi are additional Lagrange multipliers included to enforce the positivity of
the slack variables, and C is a constant representing the cost of constraint violation.

9.8.3 Nonlinear Support Vector Machines

Finally, we are in a position to extend the classifier to nonlinear decision boundaries.


To do this, we first reformulate the Lagrangian in Eq. 9.58. Specifically, the dual
Lagrangian of Eq. 9.58 is given by

margin
Class −1 Class +1
H+1 : w · xi + b = +1
H−1 : w · xi + b = −1

ξl < 1
Decision boundary:
ξk > 1
H0 : w · xi + b = 0
support vectors

ξj = 0

ξi = 0

H−1 H0 H+1

Fig. 9.10 Softmargin SVM for nonseparable classes. Falsely classified data points have a slack
variable value larger than 1 (see ξk ), and correctly classified data points within the margin have a
negative value (see ξl ).
220 9 Classification

 1
LD = αi − αi αj yi yj xi · xj . (9.59)
2
i i,j

The crucial point here is that LD contains the data only in the form of a scalar
product xi · xj . This allows one to “plug in” a transformation that behaves like a
scalar product; namely, a kernel. This is in general called the kernel trick.
To understand this, we map from the space of our data to another (possibly
infinite dimensional) Euclidian space H as follows:

 : Rp → H, (9.60)

And we define a kernel by

K(xi , xj ) = (xi ) · (xj ). (9.61)

That means a kernel is just a function that depends on two arguments and behaves
like a scalar product in the Euclidian space H. Interestingly, if one finds such a
kernel, one no longer explicitly needs the auxiliary mapping , but can use it
implicitly. This becomes clear when looking at the kernels given in Eqs. 9.62-9.64
because the mapping  is not immediately identifiable.
It can be shown that the following functions are kernels:

 x − x 2
i j
K(xi , xj ) = exp − 2
(Radial base function) (9.62)

 q
K(xi , xj ) = xi · xj + 1 (Polynomial of degree q) (9.63)
 
K(xi , xj ) = tanh κxi · xj − δ (Sigmoid function) (9.64)

Further examples of kernel functions are the Mahalanobis kernel and graph kernels.
We would like to mention that for the linear kernel

K(xi , xj ) = xi · xj (9.65)

one obtains a (linear) SVM.


We would like to also mention that Mercer’s condition [344] can be used to check
whether a function is a valid kernel.

9.8.4 Examples

The first example we discuss is for a binary classification. In Listing 9.12, we


generate training data with N1 = 30, N2 = 50 samples. Each of the data points is
9.8 Support Vector Machine 221

sampled from a two-dimensional normal distribution with mean μi and covariance


matrix σi .

To perform an analysis with an SVM, we use the package e1071, available in R.


Listing 9.13 shows this analysis for a radial base kernel, and the resulting decision
boundaries are visualized in Fig. 9.11 (top). As one can see, the decision boundaries
are nonlinear between the two class regions. The support vectors from the other data
points are indicated using the symbol “s.”
To highlight the difference compared to the other kernels, we repeated a similar
analysis for the sigmoidal kernel. The results of this analysis are shown in Fig. 9.11
(bottom). As one can see, this kernel also performs a nonlinear separation; however,
the decision boundary has a different shape. This indicates a flexibility induced by
the kernels, which is a significant improvement over linear decision boundaries.
Other possible kernels provided in the package e1071 are as follows:
• Linear
• Polynomial
• Radial basis
• Sigmoid
The option “type” is not of importance here, because we want to focus on the
base version of an SVM for classification problems only. However, we would like
to note that an SVM can also be used for regression problems. Furthermore, there
exist different algorithmic variations of SVM from which one can select.
In Listing 9.13, we show an example for making predictions. For this, we
simulated test data similar to Listing 9.12; however, this time for N1 = 300, N2 =
300 samples.
The next example we study is a three-class classification. For this, we generate
data similar to Listing 9.10. The resulting decision boundaries are visualized in
Fig. 9.12, and Listing 9.13 shows the results of the prediction.
setcounterfigure13
We would like to note that kernels are usually parameter dependent. For instance,
the radial base function depends on σ (see Eq. 9.62) and the polynomial function on
q (see Eq. 9.64). Furthermore, the Lagrangian of an SVM allows one to specify the
cost of a constraints violation, C. For our analysis, we used default values given
in the R package; however, these are not the optimal values for these parameters.
222 9 Classification

Instead, these parameters need to be estimated via model selection, which will be
discussed in Chap. 12.

9.9 Decision Tree

The systematic investigation of decision trees goes back to the 1960s and 1970s,
and automatic interaction detection (AID) and theta automatic interaction detection
(THAID) are widely considered the first algorithms for regression trees and
classification trees, respectively [319]. However, a breakthrough came with the
development of the classification and regression tree (CART) [60] in the 1980s,
9.9 Decision Tree 223

Fig. 9.11 Support vector SVM classification


o plot
machine classification for a 6
o
radial-based kernel (top) and s oo
a sigmoidal kernel (bottom). s
5 so o
o s
Shown are the decision o

2
o
o ooo oo oo
boundaries between the two 4 o o
o o
o oo o so os
classes. Support vectors are so o
o o s o
highlighted using the symbol o o o
3 o os o
o

x.1
“s.” o s
s s
2 oo s o
os s
oo
o o
s s o
1 os
s oo s

1
s
o o
s
0 o
o
−1 o
0 1 2 3 4

x.2

SVM classification
o plot
6
o
s oo
o
5 oo o
s s o

2
o
o ooo oo oo
4 o o o
o
o oo o so os
so o
o o o o
o o o
3 o os o
o
x.1

o s
o o
2 oo o s
os s
oo
o o
os o
1 os
o oo s
s 1
o o
s
0 o
o
−1 o
0 1 2 3 4

x.2

because this algorithm addressed successfully the underfitting and overfitting


problems of AID and THAID through a pruning process (discussed next). This
section will be dedicated to the CART method.

9.9.1 What Is a Decision Tree?

Let’s start with an example of a decision tree and its components. For our
example, we will use the package rpart available in R. This package provides an
implementation of CART (classification and regression tree) as described in [60].
224 9 Classification

Fig. 9.12 Three-class s SVM classification plot


6
classification with a support s
o
vector machine for a s o o o
o s
5 os
radial-based kernel. Shown

3
s o o
are the decision boundaries o oo o o s
o oo
o oo s
so o o o
o
between the three classes. 4 oo s s o o
s s s oo oo o s s
Support vectors are o
o oo
s o o
s s
highlighted using the symbol o ss s o o
3 s so oo
s ss o

x.1
“s.” s

2
s s s
s
s s o
2
o s s oo
s o
o s o
1 o os o
so o o o
o s o
o o o so
ooo

1
0 o o o
s
o o
o s s
−2 0 2 4 6

x.2

The reason behind the name “rpart” (recursive partitioning) for this package instead
of CART is that the latter is a protected trademark name.
In Fig. 9.13 (top left), we show a set of training data for three classes. These
data are sampled from three normal distributions, as in Listing 9.10. A decision tree
that learned from these training data is shown in Fig. 9.13 (top right). Let’s for the
moment just accept that we have a decision tree without asking how we obtained it.
That question will be answered shortly.
In general, a decision tree consists of two types of nodes:
1. Decision nodes
2. Leaf nodes
In Fig. 9.13 (top right), the decision nodes are shown in gray, whereas the leaf nodes
are shown in color. Furthermore, for every decision node, there is an inequality. For
example, decision node 1 contains the inequality

x.1 ≤ 1.9. (9.66)

In general, these inequalities are used to make decisions regarding the partitioning
of the data. Because such a partitioning of the data is applied until the data samples
reach a leaf node, this is a recursive partitioning of the data, explaining the name of
the package: rpart. For example, for decision node 1, the data are split into two parts
depending on whether the inequality is true or false for the x.1 component of a data
point. The data points for which the inequality holds reach decision node 2, while
the remaining data points reach decision node 3. Hence, a decision tree performs, at
each decision node, a bipartitioning of the data, because an inequality is either true
or false. Overall, this makes the tree a binary decision tree.
9.9 Decision Tree 225

As a consequence of the form of decisions (see, for example, Eq. 9.66), one
obtains linear decision boundaries for the tree. For our example, this is visualized
in Fig. 9.13 (bottom left). That means additional decision nodes will lead not to
nonlinear decisions but rather to more fine-grained boundaries.
The information shown in the leaf nodes corresponds to the predicted class label
(number in the middle of the first row in each leaf node) and the fraction of training
samples classified as class 1, 2, or 3. For instance, leaf node 5 makes the prediction
for each data sample being in class 3. However, for the training data, about 5% of
the these are actually from class 1 and from class 2, and about 90% are from class
3. The total numbers that correspond to these percentages are shown in Fig. 9.13
(bottom right). This figure includes not only the training samples that reach the leaf
nodes but in addition shows these numbers for the decision nodes. In this way, it may
become more clear how the data are actually processed during the decision-making.
226 9 Classification

x.1 < 1.9 no


8

yes
6

2 3

x.2 < 2.8 x.2 >= 1.1


4
x[,2]
2

4 5 7
6
1 3 3
x.2 < 4.5
1.00 .00 .00 .05 .05 .89 .14 .00 .86
0

12 13
−2

2 3
.06 .94 .00 .00 .43 .57
−4 −2 0 2 4 6
x[,1]

x.1 < 1.9 no


8

yes

2
30 50 40
6

2 3

x.2 < 2.8 x.2 >= 1.1


4

1 2
x[,2]

24 1 17 6 49 23
2

4 5 6 7

1 3 x.2 < 4.5 3


23 0 0 1 1 17 3 0 19
2
3 49 4
0

12 13
−2

2 3
3 46 0 0 3 4
−4 −2 0 2 4 6
x[,1]

Fig. 9.13 A decision tree for three classes. Top left: Distribution of the training data. Top right:
Decision tree learned from the training data. Bottom right: Decision boundaries of the decision
tree. Bottom left: Same decision tree as top right but showing the number of training samples in
each class at each node.

9.9.1.1 Three Principal Steps to Get a Decision Tree

As we have seen, a decision tree has a structure that can be intuitively understood
via a simple graphical representation. Now that we understand a decision tree, we
turn to the question of how we actually construct a decision tree from training data.
We answer this question by subdividing it into three sub-problems:
1. Growing a decision tree
2. Assessing the size a decision tree
3. Pruning a decision tree
The general idea behind this is to, first, define a splitting criterion, called an impurity
function, that will be used at the decision nodes to separate the data. This process
9.9 Decision Tree 227

grows a tree, but does not provide a stopping criterion that would prevent further
subdivision of the data. For this reason, this process will result in a fully grown
tree that contains in each leaf node either just one training sample or only samples
from the same class. Second, to find the optimal size of the tree, we need to define
a criterion that provides us with information about the predictive performance of
the tree. Third, based on this, the tree will be pruned by getting rid of branches that
would most likely lead to overfitting of the data. This describes the logical way to
create a decision tree. In the following sections, we describe each of these three
steps in detail.

9.9.2 Step 1: Growing a Decision Tree

To grow a tree, we need to define (1) a decision criterion and (2) a decision function
that selects a decision criterion. For simplicity, we are focusing here on continuous
variables, and we choose as decision criterion an inequality for one variable of the
form

xi ≤ γ . (9.67)

A common choice is to select the variable xi randomly from all available variables,
which leaves us to specify the numerical value of the parameter γ . To do this, we
need to define a decision function, called an impurity function. The purpose of the
impurity function is to assess a split.
As the name “impurity” suggests, it is bad to split a set of data points in a way
that results in a high level of impurity. Here, the impurity is assessed with respect
to the discrete distribution of the samples; that is, for C different classes, there will
be a certain number of training samples fj in class j ∈ {1, . . . , C}, with fj =
number of training samples in class j . From this, one obtains the distribution

fj
pj = , (9.68)
k fk

which is used to assess the impurity of a node. There are two impurity functions
widely used; namely, the entropy and the Gini index, defined as follows:
  
i.e(n) = − pj log pj (Entropy); (9.69)
j

   
1
i.g(n) = pi pj = 1− pj2 (Gini index). (9.70)
2
i=j j

These impurity functions depend on n, which is the index of the node being
evaluated within a decision tree.
228 9 Classification

1.00

Entropy
Impurity functions

0.75

0.50

Gini index
0.25

0.00

0.00 0.25 0.50 0.75 1.00


p
Fig. 9.14 Two examples of impurity functions: Gini index (lower curve) and entropy (upper curve)
for a binary classification.

In Fig. 9.14, we show the entropy and the Gini index for a binary classification
problem; that is, the number of classes C = 2. As one can see, in both cases the
impurity is high for intermediate values of p and lowest for p = 0 and p = 1. That
means both functions are useful for identifying desirable splits.
To identify the optimal value of γ at a given node m, one needs a measure to
evaluate outcomes of splits that correspond to the goodness of the split, i(m, γ ).
Formally, this measure evaluates, for a selected impurity function i, the impurity
reduction at node m and at its two offspring (children), mL and mR , in the following
way:

i(m; γ ) = i(m; γ ) − ProbL · i(mL ; γ ) − ProbR · i(mR ; γ ). (9.71)

For a given γ , ProbL is the probability of training samples’ reaching the left (child)
node and ProbR = 1 − ProbL is the probability of their reaching the right (child)
node. Evaluating i(m; γ ) for different values of γ , the optimal γ will maximize
i(m; γ ); that is,

γ ∗ = argmax i(m; γ ). (9.72)


γ

resulting in a maximal impurity reduction. Hence, i(m; γ ) describes the goodness


of split based on an impurity function.
9.9 Decision Tree 229

1
2
3
yes x.1 < 3.3 no
2
30 50 40
100%
x.2 < 3.5 x.2 >= 0.39
1 2
28 4 28 2 46 12
50% 50%
x.2 < 3 x.2 >= 1.6
1 2
28 4 2 2 46 3
28% 42%
x.1 < 2.9 x.1 < 5
1 2
28 1 2 2 7 3
26% 10%
x.1 < 3.1 x.2 < 1.5
3 2
0 1 2 2 7 1
2% 8%
x.1 >= 3.8
2
2 7 0
8%
x.1 < 4.3
2
2 4 0
5%
x.2 < 1.3
1
2 1 0
2%

1 2 3 2 3 2 1 2 2 2 3 3 3
28 0 0 0 1 0 0 0 2 0 3 0 0 0 26 0 39 0 2 0 0 0 1 0 0 3 0 0 3 0 0 0 1 0 0 2 0 0 9
23% 1% 2% 2% 22% 32% 2% 1% 2% 2% 1% 2% 8%

Fig. 9.15 Fully grown decision tree for a Gini impurity function.

Successive application of the preceding procedure to each decision node in a tree


will result in a full tree that stops growing when one of the following two conditions
hold:
1. All samples belong to the same class.
2. There is only one sample left. These nodes correspond then to the leaf nodes.
In both cases, there are only sample(s) of one class left, and hence the impurity of
the node already reached zero.
In Fig. 9.15, we show an example of such a tree. For this example, we used a Gini
impurity, and Listing 9.15 gives the corresponding implementation using R.
The function used to generate a decision tree using R is rpart(). The option
“method” for this function allows one to select between a regression tree and a
decision tree, among others. Since we are only focusing on classification problems,
this is specified by “class.” The next option, “parms,” allows one to select an
impurity function. Available options are “information,” corresponding to the entropy
impurity, and “gini,” giving the Gini impurity.
The next option, “control,” is very important because it allows one to set stopping
criteria via the auxiliary function “rpart.control.” Specifically, “minsplit” sets the
minimum number of observations that must exist in a node in order for a split to
be attempted. That means if the number of samples at a node is smaller than k
(for “minsplit=k”), then this will be a leaf node. The option “cp” is a (positive)
230 9 Classification

complexity parameter. Any split that does not decrease the overall lack of fit by
a factor of “cp” is not attempted. By setting “cp” to 0.0, we are effectively not
making use of this option; hence, we accept any attempted split, no matter how much
improvement it will bring. This is because we want to grow a tree to its maximal
size, which we call a full tree.

9.9.3 Step 2: Assessing the Size of a Decision Tree

Having grown a full tree, we need to assess it and then cut it to the right size.
A common way to determine the optimal size of a decision tree is to assess the
classification error of the tree according to its complexity. This will allow us to
identify the optimal model with minimal complexity so as to avoid overfitting. As
usual, there is more than one way to accomplish this goal. In the following, we
first outline an intuitive way to do this, and then we show a more elegant, formal
approach that is thus more involved.
9.9 Decision Tree 231

9.9.3.1 Intuitive Approach

Having a fully grown tree Tf , we can successively reduce its size by cutting off
branches one node after another. If we do this for all possible cuts, this results in a
set of trees {Tf , . . . , T0 }, where Tf is the full tree and T0 is the root tree consisting
just of one node. Using the training data, we can evaluate the prediction error of all
trees {Tf , . . . , T0 }, such as via cross-validation. Finally, we select the smallest tree
that achieves an acceptable error. Here, we define the size of a tree as its number of
nodes.
In principle, this is a valid approach; however, the meaning of “acceptable error”
is quite vague because we did not define with sufficient detail what numerical
decision could be used to call the error of a tree sufficiently low. For this reason,
we are presenting now a precise formulation of this problem.

9.9.3.2 Formal Approach

To approach this problem formally, we define first the tree cost-complexity as


follows:

Rα (T ) = R(T ) + α|T̃ |. (9.73)

Here, α is a positive cost-complexity parameter and T̃ is the set of terminal nodes


in T , which means |T̃ | measures the tree complexity with respect to the number of
terminal nodes in tree T . For this reason, |T̃ | is given by

|T̃ | = number of terminal nodes in tree T . (9.74)

There is a relationship between the complexity of a tree and its size |T | given by

|T | = 2|T̃ | − 1. (9.75)

The term R(T ), called resubstitution error, is defined by


 
R(T ) = R(τ ) = p(τ )r(τ ) (9.76)
τ ∈T̃ τ ∈T̃

where τ is a terminal node in tree T . The underlying idea of R(T ) is to assume


that the training data provide a good representation of the population, allowing a
representative estimation of the error. Hence, R(T ) estimates the misclassification
error of a tree T , using the training data.
The resubstitution error of a node, R(τ ), is given by a product of p(τ ) and r(τ ).
The probability p(τ ) is just the fraction of training samples that are at node τ divided
by the total number of training samples,
232 9 Classification

training samples at node τ


p(τ ) = , (9.77)
total number of samples

and r(τ ) is the within-node (conditional) misclassification error. Formally, r(τ ) is


defined by

r(τ ) = c(j |i)p(Y = i|τ ). (9.78)
i

The term c(j |i) is the (conditional) misclassification cost of classifying a sample
from class i as j . Practically, r(τ ) can be estimated by

r(τ ) = 1 − max p(Y = i|τ ), (9.79)


i

and the conditional probability p(Y = i|τ ) can be estimated from the training data
as follows:
training samples at node τ in class i
p(Y = i|τ ) = . (9.80)
training samples at node τ

The key idea is to use Rα (T ) — which is a scalar, positive, real-valued number;


that is, Rα (T ) ∈ R+ — as a representation of the complexity of tree T . Breiman
[60] showed that the following holds:
• If α > β then either Tα = Tβ or Tα is a strict sub-tree of Tβ .
In other words, if one increases α successively, then the resulting sub-trees are
getting smaller and smaller and are strict sub-trees of each other.
Now we are in a position to present a precise realization of the intuitive approach,
discussed in the previous section, using Rα (T ). Specifically, by varying α from 0 to
+∞, we get a set of nested trees where Tα=0 corresponds to the full tree and Tα→+∞
to the root node. Then, for each of these trees Tα , we estimate its classification error
using a cross-validation approach. In Fig. 9.16, we show this error as a function of
the complexity of a tree. In R, α is denoted by cp. The only point that is left is to
select where to cut the decision tree.
To find the optimal complexity, cp∗ , there are a number of different approaches.
However, one of the simplest and most frequently applied criterion is the “one
standard error rule” (1-SE rule) [226]. The 1-SE rule selects the smallest tree
(least complex tree) that is within one standard error (SE) of the tree, T , with the
best cross-validation error; that is, Emin = E(T ). That means from the candidate
set

S = {cp|E(T (cp)) ≤ Emin + SE(Emin )}, (9.81)

the optimal complexity is given by


9.9 Decision Tree 233

Fig. 9.16 Cross-validation number of splits


classification error as a
function of the complexity of 0 2 3 4 5 9 12
decision trees. The size of the

1.2
trees indicates the number of
terminal nodes, and the

1.0
dashed horizontal line
corresponds to

X−val Relative Error

0.8
Emin + SE(Emin ).

0.6
0.4
0.2
0.0

Inf 0.21 0.074 0.035 0.02 0.012 0

cp

5 6
cp∗ = argmin cp ∈ S . (9.82)
cp

Figure 9.16 shows the classification error as a function of cp. From this, one can
see that the optimal complexity is obtained for three splits. Unfortunately, neither the
shown classification error nor the complexity values are absolute values, but rather
are scaled, and the values are not consistent among the different functions available
in R. Specifically, the “cp” values obtained with “model$cptable” (see Listing 9.16)
are different from the values shown in Fig. 9.16.

The transformation in Listing 9.17 gives cpfig from the “plotcp” function for
cptab from “cptable.”
234 9 Classification

This difference is important because using cpfig = 0.074 from Fig. 9.16 gives a
different tree than using cptab = 0.042 from “model$cptable.” The correct values
are given in the table; thus, for our example, the optimal complexity value is cp∗ =
0.042.

9.9.4 Step 3: Pruning a Decision Tree

The final step is to actually prune the tree at the selected complexity level. In R, this
can be done by using the function prune().

Figure 9.17 shows the resulting decision tree. One can see that this tree uses three
splits as indicated in the “cptable” shown in Listing 9.16. We would like to note that
one should always cross-check if the resulting cut corresponds to the desired tree
complexity.

9.9.4.1 Alternative Way to Construct Optimal Decision Trees: Stopping


Rules

There is a second way to obtain an optimal decision tree that is fundamentally


different from the approaches just described. Instead of growing a full tree and
pruning it, as suggested by Breiman, one can use a stopping rule to prevent a tree
from further growing. This is called early stopping or pre-pruning. This idea is
very appealing at first because it appears simpler and, in fact, is computationally
less demanding. However, the problem with such an approach is that one needs to
9.10 Summary 235

Fig. 9.17 Pruned decision 1


tree of the full tree in 2 yes x.1 < 3.3 no
Fig. 9.15. 3
2
30 50 40
100%

x.2 < 3.5 x.2 >= 0.39


1 2
28 4 28 2 46 12
50% 50%

1 3 2 3
28 4 2 0 0 26 2 46 3 0 0 9
28% 22% 42% 8%

use a stopping rule that is capable of looking ahead. What we mean by that is the
following: Suppose that you grow a decision tree, and the stopping criterion you
selected suggests not to further grow a specific branch. Unfortunately, it is possible
that when actually making this split, further down this branch, the resulting leaf
nodes may be in fact better terminal nodes than the ones further up in the tree. This
problem is common to greedy optimization methods, and a stopping rule is just one
of them.
Practical criteria used as stopping rules are, for example, requiring a minimum
number of samples in a node in order to attempt a split. This rule can be used in
combination with or instead of requiring a minimum value of impurity reduction
(see Eq. 9.71).

9.9.5 Predictions

Finally, after obtaining the optimal decision tree, we can now use it for making
predictions. In Listing 9.19, we show an example using the test data.

9.10 Summary

In this chapter, we discussed classification problems. In general, a classification


requires a form of supervised learning, where each data point is associated with
a label. Importantly, the label provides only categorical information and does not
correspond to a numerical value. The case of numerical response variables will be
the topic of Chap. 11, where we discuss regression models.
236 9 Classification
9.11 Exercises 237

We discussed a naive Bayes classifier, linear discriminant analysis, k-nearest


neighbor classification, logistic regression, support vector machines, and decision
trees. Interestingly, despite the fact that all these methods provide approaches for
classifying data, their underlying working mechanisms are quite different from each
other. Specifically, while naive Bayes classifier, linear discriminant analysis, logistic
regression, and k-nearest neighbor classifications learn conditional probability dis-
tributions, a support vector machine solves an optimization problem that optimizes
the regularized separation of data points by hyperplanes. Yet a different approach is
used by a decision tree, which is a non-parametric procedure that decomposes the
classification problem into separate (linear) decision rules.
When we discussed general prediction models in Chap. 2, we saw that in data
science there is no unifying principle that would allow one to categorize all models
uniquely. In this chapter, we have seen an example of this plurality for classification
methods.
Learning Outcome 9: Classification

Classification methods are supervised learning approaches that require labeled


data for training. The labels provide only categorical information and do not
correspond to numerical values.

9.11 Exercises

1. The classification error of a binary classification is defined by

FP+FN
classification error = . (9.83)
TP+TN+FP+FN
How is this error related to the accuracy given by

TP+TN
accuracy = ? (9.84)
TP+TN+FP+FN
2. Study a naive Bayes classifier for a two-class classification problem using
simulated data.
a. Estimate the four fundamental errors, TP, FP, TN, and FP, for the values given
in Listing 9.4.
b. Reproduce the results with Listing 9.2.
c. Investigate the influence of σ1 and σ2 for the true class probability distribu-
tions on the accuracy, the precision, and the recall, respectively. Visualize the
results by plotting the error measures against σ1 and σ2 , respectively.
238 9 Classification

d. Repeat the preceding analysis for in-sample and out-of-sample data (see
Chap. 4). Discuss the differences.
3. Determine a linear classifier that perfectly classifies the Boolean training data
set True = {(1, 1), (0, 1), (1, 0)} and False = {(0, 0)}. The classifier, to be
determined, learns the logical OR-function.
4. Determine a linear classifier that perfectly classifies the Boolean training data
set True = {(1, 1)} and False = {(0, 0), (0, 1), (1, 0)}. The classifier, to be
determined, learns the logical AND function.
5. Consider the following two-class problem in a two-dimensional space, where the
training data are given by the following:
Class1 = {(11, 11), (13, 11), (8, 10), (9, 9), (7, 7), (7, 5), (16, 3)},
Class2 = {(8, 11), (15, 9), (15, 7), (13, 5), (14, 4), (9, 3), (11, 3)}.
a. Draw a scatter plot of the data using R.
b. Can the data be classified perfectly by a hyperplane?
c. Determine the decision boundary in the scatter plot for a 1nn classifier.
d. Classify the data point (5.5, 11) by using a Gaussian classifier.
6. Consider the linear SVM model:
a. Construct the typical linear function used for SVM classification. How can
one assign an input vector x to the positive or negative class?
b. Suppose that the training examples are linearly separable. How many decision
boundaries can separate positive data points from negative ones? Which
decision boundary does the SVM algorithm calculate and why?
c. Does the linear SVM model always work, even if we use noisy training data?
If the answer is no, how can the basic linear SVM model be modified to cope
with noisy training data?
Chapter 10
Hypothesis Testing

10.1 Introduction

Statistical hypothesis testing is among the most misunderstood quantitative analysis


methods from data science, despite its seeming simplicity. Having originated from
statistics, hypothesis testing has complex interdependencies between its procedural
components, which makes it hard to thoroughly comprehend. In this chapter, we
discuss the underlying logic behind statistical hypothesis testing and the formal
meaning of its components and their connections. Furthermore, we discuss some
examples of hypothesis tests.
Despite the fact that the core methodology of statistical hypothesis testing dates
back many decades, questions regarding its interpretation and practical usage are
still under discussion today [14, 34, 264, 498, 499]. Furthermore, there are new
statistical hypothesis tests constantly being developed [109, 400, 401]. Given the
need to make sense of the increasing flood of data that we are currently facing in all
areas of science and industry, statistical hypothesis testing provides a valuable tool
for binary decision-making. Hence, a future without statistical hypothesis testing is
hard to imagine.
The first method that can be considered a hypothesis test goes back to J.
Arbuthnot in 1710 [195, 216]. However, the modern form of statistical hypothesis
testing originated in the combination of work from R. A. Fisher, J. Neyman, and
E. Pearson [168–170, 364, 365]. Examples of applications of hypothesis testing can
be found in all areas of science, including medicine, biology, business, marketing,
finance, psychology, and social sciences. Specific examples in biology include the
identification of differentially expressed genes or pathways; in marketing it is used
to identify the efficiency of marketing campaigns or the alteration of consumer
behavior; and in medicine it has been used to assess surgical procedures, treatments,
or the effectiveness of medications [106, 125, 345, 443].
In this chapter, we provide a basic discussion of statistical hypothesis testing
and its components. First, we discuss the basic idea of hypothesis testing. Then,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 239
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_10
240 10 Hypothesis Testing

we discuss its seven main components and their interconnections. Thereafter, we


address potential errors resulting from hypothesis testing and the meaning of the
power. Furthermore, we show that a confidence interval complements the value
provided by a test statistic. Finally, we present different examples of statistical tests
that can be applied to a wide range of problems.

10.2 What Is Hypothesis Testing?

The principal idea of a statistical hypothesis test is to decide whether a data sample
is “typical” or “atypical” compared to a population, assuming that a hypothesis we
formulated about the population is true. Here, “data sample” refers to a small portion
of entities taken from a population, for example, measured via an experiment,
whereas the population comprises all possible entities.
In Fig. 10.1, we give an intuitive example of the basic idea of hypothesis testing.
In this particular example, the population consists of all cats, and the data sample
is one individual cat randomly drawn (or sampled) from the entire population.
In statistics, “randomly drawn” is referred to as “sampling,” as we discussed in
detail in Chap. 4. To perform the comparison between the data sample and the
population, one needs to introduce a quantification of the situation. In our case,
this quantification consists of a mapping from a cat to a number. This number could
correspond to, for example, the body weight, body size, fluffiness, or hair length of
a cat. In statistics, this mapping is called a test statistic.

Population of cats

Map a data sample to number

- body weight
- body size
- tail length

Formulate hypothesis about


the population:
- about same quantity

Data sample

Fig. 10.1 Intuitive example explaining the basic idea underlying a one-sample hypothesis test.
10.3 Key Components of Hypothesis Testing 241

A key component in hypothesis testing is of course the hypothesis. The hypothe-


sis is a quantitative statement formulated about the value of the test statistic for the
population. In our case it could be about the body parts of a cat; for example, body
size. A particular hypothesis we can formulate is as follows: “The mean body size
equals 30 cm.” Such a hypothesis is called the null hypothesis, and it is denoted as
H0 .
Now, assume that we have a population of cats having a body size of 30 cm,
including some natural variations. Because the population consists of (infinitely)
many cats and for each cat we obtain such a quantification, this results in a
probability distribution, called the sampling distribution, for the mean body size.
Here, it is important to note that our population is a hypothetical population
that obeys our null hypothesis. In other words, the null hypothesis specifies the
population completely.
Now, having a numerical value for the test statistic, representing the data
sample, having the sampling distribution, and representing the population, we can
compare them to evaluate the null hypothesis that we have formulated. From this
comparison, we obtain another numerical value, called the p-value, which quantifies
the typicality or atypicality of the configuration, assuming the null hypothesis is
true. Finally, based on the p-value, a decision is made to accept or reject the null
hypothesis.
On a technical note, we want to remark that since in this example there is only
one population involved, this is called a one-sample hypothesis test. However, the
principal idea extends also to hypothesis tests involving more than one population.

10.3 Key Components of Hypothesis Testing

In the following sections, we will formalize the example just discussed. In general,
regardless of the specific hypothesis test one is conducting, there are seven
components common to all hypothesis tests. These components are summarized
in Fig. 10.2. We listed these components in the order they enter the process when

Main components of a statistical hypothesis test:

1. Select appropriate test statistic T


2. Define null hypothesis H0 and alternative hypothesis H1 for T decision
3. Find the sampling distribution for T, given H0 true
4. Choose significance level alpha accept H0 reject H0
5. Evaluate test statistic t for sample data FP
H0 TN
6. Determine the p-value Type 1 error
7. Make a decision (accept H0 or reject H0 ) truth
FN
H1 TP
Type 2 error

Fig. 10.2 The seven main components that are common to all hypothesis tests.
242 10 Hypothesis Testing

performing a hypothesis test. For this reason, they can also be considered steps of
a hypothesis test. Because they are interconnected, their logical order is important.
Overall, this means that a hypothesis test is a procedure that needs to be executed.
In the following subsections, we will discuss each of these seven procedural
components in detail.

10.3.1 Step 1: Select Test Statistic

Put simply, a test statistic quantifies a data sample. In statistics, the term “statistic”
refers to any mapping (or function) between a data sample and a numerical value.
Popular examples are the mean value or the variance. Formally, the test statistic can
be written as

tn = T (D(n)), (10.1)

where D(n) = {x1 , . . . , xn } is a data sample with sample size n. Here, we denoted
the mapping by T and the value we obtain by tn . Typically, the test statistic can
assume real values, that is, tn ∈ R, but restrictions are possible.
A test statistic assumes a central role in a hypothesis test because by deciding
which test statistic to use, one determines/defines a hypothesis test to a large extent.
This is because it will enter the hypotheses we will formulate in step 2. Hence, one
needs to carefully select a test statistic that is of interest and importance for the
conducted study.
We would like to emphasize that in this step, we select the test statistic, but we
neither evaluate it nor use it yet. This is postponed until step 5.

10.3.2 Step 2: Null Hypothesis H0 and Alternative


Hypothesis H1

At this step, we define two hypotheses, which are called the null hypothesis H0
and the alternative hypothesis H1 . Both hypotheses make statements about the
population value of the test statistic and are mutually exclusive. For the test statistic
t = T (D), selected in step 1, we call the population value of t as θ . Based on this,
we can formulate the following hypotheses:
Null hypothesis: H0 : θ = θ0 .
Alternative hypothesis: H1 : θ > θ0 .
As one can see, due to the way the two hypotheses are formulated, the value of
the population parameter θ can only be true for one statement but not for both. For
instance, either θ = θ0 is true, and the alternative hypothesis H1 is false, or θ > θ0
is true, but then the null hypothesis H0 is false.
10.3 Key Components of Hypothesis Testing 243

In Fig. 10.2, we show the four possible outcomes of a hypothesis test. Each
of these outcomes has a specific name that is commonly used. For instance, if
the null hypothesis is false and we reject H0 , this is called a “true positive” (TP)
decision. The reason for calling it “positive” is related to the asymmetric meaning
of a hypothesis test, because rejecting H0 when H0 is false is more informative
than accepting H0 when H0 is true. In this case, one can consider the outcome of a
hypothesis test a positive result.
The preceding alternative hypothesis is an example of a one-sided hypothesis.
Specifically, we formulated a right-sided hypothesis because the alternative assumes
values larger than θ0 . In addition, we can formulate a left-sided alternative hypoth-
esis by stating:
Alternative hypothesis: H1 : θ < θ0 .
Furthermore, we can formulate a two-sided alternative hypothesis that is indifferent
to the side, as follows:
Alternative hypothesis: H1 : θ = θ0 .
Despite the variety of hypothesis tests [435], the preceding description holds for
all of them. However, this does not mean that if you understand one hypothesis test,
you understand all, but rather that if you understand the principle of one hypothesis
test, you understand the principle of all.
To connect the test statistic t, which is a sample value, with its population value
θ , one needs to know the probability distribution of the test statistic. Because of
this connection, this probability distribution received a special name — sampling
distribution. It is important to emphasize that the sampling distribution represents
the values of the test statistic, assuming that the null hypothesis is true. This means
that, in this case, the population value for θ is θ0 .
Let’s assume for now that we know the sampling distribution of our test
statistic. By comparing the particular value tn of our test statistic with the sampling
distribution, we obtain a quantification for the “typicality” of this value with respect
to the sampling distribution, assuming that the null hypothesis is true.

10.3.3 Step 3: Sampling Distribution

In our general discussion about the principal idea of a hypothesis test, we mentioned
that the connection between a test statistic and its sampling distribution is crucial for
any hypothesis test. For this reason, we elaborate on this point in more detail in this
section.
In this section, we want to answer the following questions:
1. What is the sampling distribution?
2. How does one obtain the sampling distribution?
3. How does one use the sampling distribution?
244 10 Hypothesis Testing

For question (1): First of all, the sampling distribution is a probability distri-
bution. It is the distribution of the test statistic T , which is a random variable,
given some assumptions. We can make this statement more precise by defining the
sampling distribution of the null hypothesis as follows.
Definition 10.1 (Sampling Distribution) Let X(n) = {X1 , . . . , Xn } be a random
sample from a population with Xi ∼ Ppop ∀i, and let T (X(n)) be a test statistic.
Then, the probability distribution fn (x|H0 true) of T (X(n)), assuming H0 is true,
is called the sampling distribution of the null hypothesis.
Similarly, one defines the sampling distribution of the alternative hypothesis by
fn (x|H1 true). Since there are only two different hypotheses, H0 and H1 , there are
only two different sampling distributions in this context. However, we would like to
note that sampling distributions also play a role outside statistical hypothesis testing;
for example, for estimation theory or bootstrapping [77].
There are several points that are important in the preceding definition. For this
reason, we would like to highlight these explicitly. First, the distribution Ppop
from which the random variables Xi are sampled can assume any form and is not
limited to, for example, a normal distribution. Second, the test statistic is a random
variable itself because it is a function of random variables. For this reason, there
exists a distribution from which the values of this random variable are sampled.
Third, the test statistic is a function of the sample size n, and for this reason the
sampling distribution is also a function of n. That means, if we change the sample
size n, we change the sampling distribution. Fourth, the fact that fn (x|H0 true) is
the probability distribution of T (X(n)) means that by taking an infinite number
of samples from fn (x|H0 true), in the form T (X(n)) ∼ fn (x|H0 true), we can
perfectly reconstruct the distribution fn (x|H0 true) itself. The last point allows,
under certain conditions, a numerical approximation of the sampling distribution.
We will take a closer look at the last point in the following example.

10.3.3.1 Examples

Suppose that we have a random sample X(n) = {X1 , . . . , Xn } of size n where


each data point Xi is sampled from a gamma distribution with parameters α = 4
and β = 2; that is, Xi ∼ gamma(α = 4, β = 2). Hence, here we have Ppop =
gamma(α = 4, β = 2).
Furthermore, let’s use the mean value as a test statistic; that is,

1
n
tn = T (X(n)) = Xi . (10.2)
n
i=1

In Fig. 10.3a-c, we show three examples for three different values of n (in A
n = 1, in B n = 3, and in C n = 10) when drawing E = 100,000 samples of
X(n), from which we estimate E = 100,000 different mean values T . Specifically,
10.3 Key Components of Hypothesis Testing 245

a Population distribution Ppop b Approximate sampling distribution Ps(n, E)

0.4
0.6
E=100,000
E=100,000
0.3
density

density
0.4

0.2

0.2
0.1

0.0 0.0

0 2 4 6 8 0 2 4 6 8
t(n=1) t(n=3)
c Approximate sampling distribution Ps(n, E) d For Ps(10, E)
4

1.0
E=100,000 3
sample quantiles

E=100,000
density

2
0.5

0.0

0 2 4 6 8 −2.5 0.0 2.5


t(n=10) theoretical quantiles

Fig. 10.3 Panels a-c show approximate sampling distributions for different values of the sample
size n. Panel a shows Ps (n = 1, E = 100,000), which is equal to the population distribution of
Xi . Panel d shows a qq-plot comparing Ps (n = 10, E = 100,000) with a normal distribution.

in Fig. 10.3a-c, we show density estimates of these 100,000 values. As indicated


earlier, in the limit of infinite number of samples E, the approximate sampling
distribution Ps (n, E) will become the (theoretical) sampling distribution,

fn (x|H0 true) = lim Ps (n, E), (10.3)


E→∞

as a function of the sample size n.


For n = 1, we obtain the special case that the sampling distribution is the same
as the underlying distribution of the population Ppop , which is in our case a gamma
distribution with the parameters α = 4 and β = 2, as shown in Fig. 10.3a. For all
other n > 1, we observe a transformation in the distributional shape of the sampling
246 10 Hypothesis Testing

distribution, as shown in Fig. 10.3b and c. However, this transformation should be


familiar to us because from the central limit theorem, we know that the mean of
{X1 , . . . , Xn } independent samples with mean μ and 2
√ variance σ follows a normal
distribution with mean μ and standard deviation σ/ n; that is,
 
σ
X̄ ∼ N μ, √ . (10.4)
n

Note that this result is only strictly true when n is large.


A question to ask is, what is a large n? In Fig. 10.3d, we show a qq-plot that
demonstrates that for n = 10 the resulting distribution, Ps (n = 10, E = 100,000),
is quite close to such a normal distribution (with the appropriate parameters).
We would like to point out that the central limit theorem holds for arbitrarily i.i.d.
(independent and identically distributed) random variables {X1 , . . . , Xn }. Hence,
the sampling distribution for the mean is always the normal distribution given in
Eq. 10.4.
At this point, we could stop, because we found the sampling distribution for our
problem. However, by going a step further, we can obtain a numerical simplification.
Specifically, we do so by applying a so-called z-transformation given by

X̄ − μ
Z= √ , (10.5)
σ/ n

which transforms the mean value of X̄ to Z, and we obtain a simplification because


the distribution of Z is a standard normal distribution; that is,

Z ∼ N(0, 1). (10.6)

This is a simplification, because the standard normal distribution does not depend
on any parameter, in contrast with the previous normal distribution (see Eq. 10.4),
which depends on μ, σ , and n.
While statistics is certainly amazing at times, it is not magic. In our case, this
means that the parameters μ, σ , and n did not disappear entirely, but rather were
merely shifted into the z-transformation (see Eq. 10.5).
Now we need to distinguish between two cases.
• Case 1: σ is known.
• Case 2: σ is unknown.
If we know the variance σ 2 , the sampling distribution of our transformed mean
X̄, which we called Z, is a standard normal distribution. However, if we do not know
the variance σ 2 , we cannot perform the z-transformation in Eq. 10.5, because this
transformation depends on σ . In this case, we need to estimate the variance from
the sample {X1 , . . . , Xn } as follows:
10.3 Key Components of Hypothesis Testing 247

Table 10.1 Sampling distributions of the z-score and the t-score. Here “dof” means degree of
freedom.
Test statistic Sampling distribution Prior knowledge about parameters
z-score N(0,1) σ 2 needs to be known
t-score Student’s t-distribution, n-1 dof none

1  2
n
s2 = Xi − X̂ . (10.7)
n−1
i=1

Then, we can use the estimate of the variance, s, with the so-called t-transformation

X̄ − μ
T = √ . (10.8)
s/ n

Although this t-transformation is formally very similar to the z-transformation


in Eq. 10.5, the resulting random variable T does not follow a standard normal
distribution but rather a student’s t-distribution with n − 1 degrees of freedom (dof).
We want to mention that this holds only for Xi ∼ N(μ, σ ); that is, for normally
distributed samples.
Table 10.1 summarizes the results from this section regarding the sampling
distribution of the z-score (Eq. 10.5) and the t-score (Eq. 10.8).

10.3.4 Step 4: Significance Level α

The significance level α is a number between zero and one; that is, α ∈ [0, 1]. It has
the meaning

α = P(Type 1 error) = P(reject H0 |H0 true ), (10.9)

which is the probability of rejecting the null hypothesis H0 given that H0 is true.
Alternatively, this gives us the probability of making a Type 1 error, resulting in a
false-positive decision.
When conducting a hypothesis test, one has the freedom to choose the value of
α. However, when deciding about its numerical value, one needs to be aware of
potential consequences. Possibly the most frequent choice for α is 0.05; however,
for genome-wide association studies (GWAS), values as low as 10−8 are used [378].
The reason for such a wide variety of used values is the possible consequences
incurred in different application domains. Specifically, we discuss student’s t-test,
correlation tests, and a hypergeometric test. For GWAS, Type 1 errors can result
248 10 Hypothesis Testing

in wasting millions of dollars, because follow-up experiments in this field are very
costly. Hence, α is chosen to be very small to avoid Type 1 errors.
Finally, we want to remark that formally, we obtain the value of the right-hand
side of Eq. 10.9 by integrating the sampling distribution, as given by Eq. 10.13. This
is discussed in detail in Step 6.

10.3.5 Step 5: Evaluate the Test Statistic from Data

In this step, we connect everything just discussed with the real world, as represented
by the data, since everything until this step has been theoretical. Specifically, for
D(n) = X(n) = {x1 , . . . , xn }, we estimate the numerical value of the test statistic,
selected in step 1, giving us

tn = T (D(n)). (10.10)

Here, tn represents a particular numerical value obtained from the observed data
D(n). Because our data set depends on the number of samples n, the numerical
value of tn also will depend on n. This is explicitly indicated by the subscript.

10.3.6 Step 6: Determine the p-Value

To determine the p-value of a hypothesis test, we need to use the sampling


distribution (see step 3) and the estimated test statistic tn (see step 5). That means
the p-value results from a comparison of theoretical assumptions, as represented by
the sampling distribution, with real observations, as represented by the data sample,
assuming H0 is true.
This situation is visualized in Fig. 10.4 for a right-sided alternative hypothesis.
The p-value is the probability of observing more-extreme values than the test
statistic tn , assuming H0 is true:

p = P (observe x at least as extreme as t |H0 is true)


= P (x ≥ t |H0 is true). (10.11)

Formally, the p-value is obtained by an integral over the sampling distribution:



p= fn (x |H0 true)dx . (10.12)
tn

We would like to emphasize that since the test statistic is a random variable, the
p-value is also a random variable as it depends on the test statistic [355].
10.3 Key Components of Hypothesis Testing 249

fn (x|H0 true )
accept H0 reject H0

 ∞
p= fn (x |H0 true)dx
tn
 ∞
α= fn (x |H0 true)dx
θc

x
θ0 θc tn

Fig. 10.4 Determining the p-value from the sampling distribution of the test statistic.

Furthermore, we can use the following integral:



α= fn (x |H0 true)dx (10.13)
θc

to solve for θc . That means the significance level α implies a threshold θc . In step 7,
we will see that the final decision to reject or accept the null hypothesis is based on
either the p-value or the test statistic tn .
Remark 10.1 The sample size, n, has an influence on the numerical analysis of a
problem. For this reason, the test statistic and the sampling distribution are indexed
by n. However, the sample size has no effect on the formulation and expression of
the hypothesis (see step 2), because we make statements about a population value
that holds for any value of n.

10.3.7 Step 7: Make a Decision about the Null Hypothesis

In the last step, we are finally making a decision about the null hypothesis
formulated in step 2. As mentioned, there are two alternative ways to do this. We
can make a decision based on either the p-value or the value of the test statistic tn :
1. Decision based on the p-value:

If p < α ⇒ reject H0 (10.14)

2. Decision based on the value of the test statistic tn :

If tn > θc ⇒ reject H0 (10.15)


250 10 Hypothesis Testing

Here, θc is obtained by solving the integral in Eq. 10.13.


If we cannot reject the null hypothesis, we accept it. For clarity, we want to
mention that when we reject the null hypothesis, it means we accept the alternative
hypothesis. Conversely, when we accept the null hypothesis, it means we reject the
alternative hypothesis.

10.4 Type 2 Error and Power

When making binary decisions, there are a number of errors one can make. In
this section, we go one step back and take a more theoretical look at a hypothesis
test with respect to the possible errors that can be made. In the section “Step 2:
Null Hypothesis H0 and Alternative Hypothesis H1 ,” we mentioned that there are
two possible errors one can make, a false positive and a false negative, and when
discussing step 4, we introduced the meaning of a Type 1 error. Now we extend this
discussion to the Type 2 error.
As discussed, there are only two possible configurations that need to be
distinguished. Either H0 is true or it is false. If H0 is true (respectively, false), it
is equally correct to say H1 is false (respectively, true). Now, let’s assume that H1
is true. To evaluate the Type 2 error, we require the sampling distribution, assuming
that H1 is true. However, to perform a hypothesis test, as discussed in the previous
sections (see Fig. 10.2), we do not need to know the sampling distribution, assuming
that H1 is true. Instead, we need the sampling distribution, assuming that H0 is
true, because this distribution corresponds to the null hypothesis. The good news is
that the sampling distribution, assuming that H1 is true, can be easily obtained if
we make the alternative hypothesis more precise. Let’s assume we are testing the
following hypothesis:
Null hypothesis: H0 : θ = θ0
Alternative hypothesis: H1 : θ > θ0
In this case, H0 is precisely specified because it sets the population parameter θ to
θ0 . In contrast, H1 only limits the range of possible values for θ , but does not set it
to a particular value.
To determine the Type 2 error, we need to set θ , in the alternative hypothesis, to
a particular value. So let’s set the population parameter θ = θ1 in H1 for θ1 > θ0 .
That means we define the following:
Alternative hypothesis: H1 : θ = θ1 with θ1 > θ0
In Fig. 10.5, we visualize the corresponding sampling distribution for H1 and H0 .
If we reject H0 when H1 is true, this is a correct decision, and the green area in
Fig. 10.5 represents the corresponding probability for this, formally given by

1 − β = P (reject H0 |H1 is true) = fn (x |H1 true )dx . (10.16)
θc
10.4 Type 2 Error and Power 251

fn (x|H0 true )
accept H0 reject H0

Type 1 error: α

θ
fn (x|H1 true ) θ0
reject H1 accept H1

power: 1 − β

Type 2 error: β

θ
θ1

Fig. 10.5 Visualization of the sampling distribution for H0 and H1 assuming a fixed sample size
n and setting the value of θ to θ1 in the alternative hypothesis.

In short, this probability is usually denoted by 1 − β and called the power of a test.
However, if we do not reject H0 when H1 is true, we make an error, given by

β = P (Type 2 error) = P (do not reject H0 |H1 is true). (10.17)

However, this is the same as

β = P (Type 2 error) = P (accept H0 |H1 is true). (10.18)

This is called the Type 2 error. In Fig. 10.5, we highlight the Type 2 error probability
in orange.
We would like to emphasize that the Type 1 and the Type 2 errors are both long-
run frequencies for repeated experiments. That means both probabilities give the
error when repeating exactly the same test many times. This is in contrast with the
p-value, which is the probability for a given data sample. Hence, the p-value does
not allow one to draw conclusions about repeated experiments.
252 10 Hypothesis Testing

10.4.1 Connections between Power and Errors

From Fig. 10.5, we can see the relationship between the power (1 − β), the Type
1 error (α), and the Type 2 error (β), summarized in the table given in Fig. 10.6.
Ideally, one would like to have a test with a high power and low Type 1 error
and low Type 2 error. However, from Fig. 10.5, we see that these three entities are
not independent from each other. Specifically, if we increase the power (1 − β) by
changing α, we increase the Type 1 error (α), because this will reduce the critical
value θc . In contrast, reducing α leads to an increase in the Type 2 error (β) and a
reduction in power. Hence, in practice, one needs to make a compromise between
the ideal goals.
For the preceding discussion, we assumed a fixed sample size n. However, as we
discussed in the example of section “Step 3: Sampling Distribution,” the variance of
the sampling distribution depends on the sample size via the standard error (SE), as
follows:
σpop
SE = √ . (10.19)
n

Importantly, this provides a way to increase the power and to minimize the Type 2
error by increasing the sample size n. That means by keeping the population means
θ0 and θ1 unchanged, but increasing the sample size n to a value larger than n, i.e.,
n > n, the sampling distributions for H0 and H1 become narrower because their
variances decrease according to Eq. 10.19. Thus, with an increased sample size, the
overlap between the distributions, represented by β, is reduced. This leads to an
increase in the power and a decrease in the Type 2 error for an unchanged value of
the significance level α. In the extreme case, n → ∞, the power approaches 1 and
the Type 2 error 0, for a fixed Type 1 error α.
From this discussion, the importance of the sample size in a study becomes
apparent, as it is a control mechanism to influence the resulting power and the Type
2 error.

Decision
Truth accept H0 reject H0
H0 1 − α = P (accept H0 |H0 true) α = P (reject H0 |H0 true)
H1 β = P (accept H0 |H1 true) 1 − β = P (reject H0 |H1 true)

Fig. 10.6 Overview of the different errors resulting from hypothesis testing and their probabilistic
meaning.
10.5 Confidence Intervals 253

10.5 Confidence Intervals

The test statistic is a function of the data (see step 1 in Sect. 10.3.1), and hence it is a
random variable. That means there is a variability to a test statistic because its value
changes for different samples. To quantify the interval within which such values fall,
one can use a confidence interval (CI) [26, 57].
Definition 10.2 The interval I = [a, b] is called a confidence interval for parameter
θ if it contains this parameter with probability 1 − α for α ∈ [0, 1]; that is,
 
P a ≤ θ ≤ b = 1 − α. (10.20)

The interpretation of a CI, I = [a, b], is that for repeated samples, the corresponding
confidence intervals are expected to contain the true value of θ with probability
1 − α. Here, it is important to note that θ is fixed because it is a population value.
What is random is the estimate of the boundaries of the CI; that is, a and b. Hence,
for repeated samples, θ is fixed but I is a random interval.
The connection between a 1 − α confidence interval and a hypothesis test for
a significance level of α is that if the value of the test statistic falls within the CI,
then we don’t reject the null hypothesis. However, if the confidence interval does
not contain the value of the test statistic, we reject the null hypothesis. Hence, the
decisions reached by both approaches always agree with each other.
If one does not make any assumption about the shape of the probability
distribution, for example, symmetry around zero, there is an infinite number of CIs
because neither the starting nor the ending values of a and b are uniquely defined,
but rather follow from assumptions. Frequently, one is interested in obtaining a CI
for a quantile separation of the data in the form
 
P qα/2 ≤ θ ≤ q1−α/2 = 1 − α, (10.21)

where qα/2 and q1−α/2 are quantiles of the sampling distribution with respect to
(100α/2)% and 100(1 − α/2)% of the data, respectively.

10.5.1 Confidence Intervals for a Population Mean with


Known Variance

From the central limit theorem, we know that the sum of random variables θ̂ =
1/n xi is normally distributed. If we normalize this with a z-transformation as
follows:

θ̂ − E[θ̂]
Z= , (10.22)
SE
254 10 Hypothesis Testing

then Z follows a standard normal distribution — that is, N(0, 1) — where SE is the
standard error of θ̂ given by
σ
SE = √ . (10.23)
n

Adjusting the definition of a confidence interval in Eq. 10.21 to our problem gives

 
P qα/2 ≤ Z ≤ q1−α/2 = 1 − α (10.24)

with

qα/2 = −zα/2 ; (10.25)


q1−α/2 = zα/2 . (10.26)

Here, the values of ±zα/2 are obtained by solving the equations for a standard
normal distribution probability
 
P Z < −zα/2 = α/2; (10.27)
 
P Z > zα/2 = α/2. (10.28)

Using these and solving the inequality in Eq. 10.24 for the expectation value gives
the confidence interval I = [a, b] with
σ
a = θ̂ − zα/2 σ (θ̂ ) = θ̂ − zα/2 √ ; (10.29)
n
σ
b = θ̂ + zα/2 σ (θ̂ ) = θ̂ + zα/2 √ . (10.30)
n

Here, we assumed that σ is known. Hence, the preceding CI is valid for a z-test.

10.5.2 Confidence Intervals for a Population Mean with


Unknown Variance

If we assume that σ is not known, then the sampling distribution of a population


mean becomes the student’s t-distribution. For this, σ needs to be estimated from
samples using the sample standard deviation s. In this case, a similar derivation as
earlier results in
s
a = θ̂ − tα/2 √ ; (10.31)
n
10.5 Confidence Intervals 255

s
b = θ̂ + tα/2 √ . (10.32)
n

Here, ±tα/2 are critical values for a student’s t-distribution, obtained as in Eqs. 10.27
and 10.28. Such a CI is valid for a t-test; see Sect. 10.6.1.

10.5.3 Bootstrap Confidence Intervals

When a sampling distribution is not given in an analytical form, numerical


approaches need to be used. In such a situation, a CI can be numerically obtained via
nonparametric bootstrap [132]. This is the most generic way to obtain a CI. Using
the augmented definition in Eq. 10.21, for any test statistic θ̂ , the CI can be obtained
from
 
P q̂α/2 ≤ θ̂ ≤ q̂1−α/2 = 1 − α, (10.33)

where the quantiles q̂α/2 and q̂1−α/2 are directly obtained from the data, resulting
in I = [q̂α/2 , q̂1−α/2 ]. Such a confidence interval can be used for any statistical
hypothesis test.
We would like to emphasize that in contrast with Eq. 10.21, here, the quantiles
q̂α/2 and q̂1−α/2 are estimates of the quantiles qα/2 and q1−α/2 from the sampling
distribution. Hence, the obtained CI is merely an approximation.
An example of this is shown in Listing 10.1.
256 10 Hypothesis Testing

10.6 Important Hypothesis Tests

In the following, we discuss some important hypothesis tests that are frequently
used in many application domains.

10.6.1 Student’s t-Test

The student’s t-test, also known as the t-test, can be used to test hypotheses about
the mean. In Sect. 10.3.3.1, we saw that a t-transformation shows that the sampling
distribution of a t-score is given by student’s t-distribution, which explains the name
of the test. For this reason, the test statistics of a t-test is a t-score given by

X̄ − μ
T = √ , (10.34)
s/ n
1
n
with X̄ = Xi . (10.35)
n
i=1

Here, {Xi }ni=1 are n observations drawn from an underlying distribution associated
with a population. If the variance, σ , of this population is known, the t-score
becomes a z-score (see Sect. 10.3.3.1) and the t-test a z-test. However, for real-world
problems, this is usually not the case. This means that s needs to be estimated from
the observations as follows:
n 2
n ( i=1 Xi )
2
i=1 Xi −
s= n
. (10.36)
n−1

10.6.1.1 One-Sample t-Test

There are different versions of a t-test, but the simplest one is the one-sample test. In
this case, there is just one population, f1 , from which observations are drawn, i.e.,
xi ∼ f1 , and the hypotheses are formulated as follows:

H0 : μ = μ0 . (10.37)
H1 : μ > μ0 . (10.38)

Practically, a t-test is used as follows. Given the n observations {Xi }ni=1 , we first
estimate X̄ and s from Eqs. 10.35 and 10.36. Then, we estimate the test statistics
from Eq. 10.34 using our null hypothesis; that is,
10.6 Important Hypothesis Tests 257

X̄ − μ0
tn = √ . (10.39)
s/ n

This gives a numerical value that can be used to integrate along the sampling
distribution, as specified by the alternative hypothesis. Here we used a right-sided
alternative hypothesis, which corresponds to an integration, as shown in Fig. 10.4;
that is,

p= fSt−t (x)dx. (10.40)
tn

Here, fSt−t is a student’s t-distribution corresponding to the sampling distribution.


Using R, a t-test can be easily performed, as shown in the example in Listing 10.2.

10.6.1.2 Two-Sample t-Test

An extension of the student’s t-test to two samples is needed when there are
two independent underlying populations, f1 , f2 , from which samples are drawn.
Specifically, let’s assume that we have n observations X = {Xi }ni=1 from population
one (f1 ) and m observations Y = {Yi }mi=1 from population two (f2 ). In this case, we
want to formulate the hypothesis about both populations in the following form:

H0 : μ1 = μ2 (10.41)
H1 : μ1 > μ2 . (10.42)
258 10 Hypothesis Testing

Here, μ1 corresponds to the population mean of the first population and μ2 to the
population mean of population two. In this case, the test statistics are given by

X̄ − Ȳ
tn =  2 , (10.43)
sX sY2
n + m

where X̄, Ȳ , and sX , sY are estimated according to Eqs. 10.35 and 10.36.
One technical detail we would like to note is that this test statistic can be used
for unequal sample sizes n and m and unequal variances sX and sY . In this case,
the t-test is formally called the Welch’s t-test, and this is the default when using the
function t.test() available in R.

10.6.1.3 Extensions

A generalization of the student’s t-test called Hotelling’s t-squared test is used for
the multivariate case. This is the case when the mean value of a population, μ, is not
given by a scalar value but rather by a vector; that is, μ ∈ Rp , with p > 1. In such a
case, observations are drawn from, for example, a multivariate normal distribution,
X ∼ N (μ, ), where  is the covariance matrix.
Another extension is needed when there are more than two populations from
which observations are drawn. For instance, for testing the null hypothesis

H0 : μ1 = μ2 = . . . μk (10.44)

with k > 2 and k ∈ {3, 4, . . . }, an ANOVA (Analysis of Variance) test is used.


10.6 Important Hypothesis Tests 259

10.6.2 Correlation Tests

Another test statistic frequently of interest is the correlation value. This statistic
p
measures the association between two variables. Given two variables X = {Xi }i=1
p
and Y = {Yi }i=1 , one can estimate the sample Pearson product-moment correlation
coefficient by

Sx,y
r=7 7 , (10.45)
Sx,x Sy,y

with
  

p p p
i=1 xi i=1 yi
Sx,y = xi yi − (10.46)
p
i=1

 2

p p
i=1 xi
Sx,x = xi2 − (10.47)
p
i=1

 2

p p
i=1 yi
Sy,y = yi2 − . (10.48)
p
i=1

For a two-sided alternative, the hypothesis about the population correlation is


formulated as follows:

H0 : ρ = 0. (10.49)
H1 : ρ = 0. (10.50)

In Listing 10.4, we provide an example of this case since for a simple linear
regression, y ∼ β0 + β1 x, one can show that the regression coefficient β1
corresponds to the correlation coefficient, r. The results are plausible because
according to the way we simulated the data, the null hypothesis is false.
Aside from the Pearson’s correlation, there is also the Spearman’s correlation.
The Spearman’s rank-order correlation is the nonparametric version of the Pear-
son’s product-moment correlation. Spearman’s correlation coefficient measures the
strength between two ranked variables. That means such values are at least on an
ordinal scale (see Sect. 10.6.4). This implies that even when the observations are real
valued, that is, xi , yi ∈ R, Spearman’s rank-order correlation uses only information
about the rank order of these values. In R, this information is obtained using the
function rank().
260 10 Hypothesis Testing

An example for a Spearman’s rank-order correlation test is shown in Listing 10.5.


Here, we include two versions of the test that both give the same results.

The hypothesis tested in Listing 10.5 can be formulated as follows:

H0 : There is no (monotonic) association between the two variables.


H1 : There is a (monotonic) association between the two variables.

In Fig. 10.7, we show an example that makes the difference between Pearson’s
and Spearman’s correlations clear. The alternative form of the Spearman correlation
leads to the same result, because Spearman’s correlation utilizes only the ranks of
the observations X and Y .
10.6 Important Hypothesis Tests 261

X Y rank(X) rank(Y)
-0.26 -0.18 2 2
0.94 0.38 3 3 Pearson correlation: r = 0.806
1.27 2.17 5 4 Spearman correlation: rS = 0.90
1.11 2.22 4 5
-1.36 -0.24 1 1

R code:
Pearson correlation: cor(X, Y, method=”pearson”)
Spearman correlation: cor(X, Y, method=”spearman”)
Spearman correlation: cor(rank(X), rank(Y), method=”spearman”)

Fig. 10.7 An example of Pearson and Spearman correlations. The alternative form of the
Spearman correlation leads to the same result.

10.6.3 Hypergeometric Test

The hypergeometric test, also known as Fisher’s exact test, is used to determine the
enrichment of one subpopulation in another.
To explain the idea behind a hypergeometric test, we consider a problem
frequently studied in genomics. Suppose that an experiment is conducted involving
a large number of genes, and the experiment shows that some of these genes are
active. Such genes are said to be differentially expressed. The question we want to
answer is whether the genes that are differentially expressed are overrepresented in a
specific biological process. An alternative formulation used to describe this is to ask
whether the differentially expressed genes are enriched in this biological process.
The basic idea of a hypergeometric test is shown in Fig. 10.8. As one can see,
there are different colors for the genes, shown as dots and (big) circles that enclose
the dots. The reason for this is that each gene/dot is characterized by two properties.
The first property distinguishes whether a gene is differentially expressed or not.
If a gene is differentially expressed, it is enclosed by the red circle; otherwise, it
is outside the red circle and enclosed by the purple circle. The second property
distinguishes if a gene is a member of a biological process or not. If it is a member
of a biological process, its dot is shown in green, otherwise in orange.
More formally, one can summarize the preceding groupings of genes in a tabular
form. In Fig. 10.9, we show a contingency table that corresponds to the visualization
in Fig. 10.8. The two properties of the genes, that is, “differentially expressed” and
“member in biological process,” correspond to the rows and columns, respectively.
Specifically, the total number of differentially expressed genes is given by n1+ , and
the number of all genes is n. Hence, n2+ corresponds to the number of genes that are
not differentially expressed. Similarly, the total number of genes that are associated
with the biological process of interest is given by n+1 , and the number of genes that
are not associated with this biological process is given by n+2 .
We would like to highlight that there are a number of constraints that hold for the
rows and columns. These are given by
262 10 Hypothesis Testing

n1+ = n11 + n12 ; (10.51)


n2+ = n21 + n22 ; (10.52)
n+1 = n11 + n21 ; (10.53)
n+2 = n12 + n22 ; (10.54)
n = n1+ + n2+ ; (10.55)
n = n+1 + n+2 . (10.56)

Hence, the sums of the rows and columns are conserved. This implies that from the
four numbers, that is, n11 , n12 , n21 , and n22 , follow all other entities.

10.6.3.1 Null Hypothesis and Sampling Distribution

To emphasize the entities of interest for formulating a proper statistical hypothesis,


we show a revised contingency table in Fig. 10.10. The crucial point is to realize that
to address our original question about the enrichment of differentially expressed
genes that are also involved in a biological process, the entities given by x and
y are important. Specifically, since both numbers are random variables, there
are underlying probability distributions from which those numbers are drawn.
Both distributions correspond to binomial distributions, however, characterized by
different parameters; that is,

x ∼ Binom(X = x|n+1 , p1 ), (10.57)

Genes that are member of a biological process

Genes that are differentially expressed

Gene that are not differentially expressed

Genes that are not member of a biological process

Property 1 (of a gene): Differentially expressed

Property 2 (of a gene): Member in biological process

Fig. 10.8 For a hypergeometric test, one needs to distinguish between two properties. This is
visualized by the colors given to the dots and circles.
10.6 Important Hypothesis Tests 263

Fig. 10.9 Contingency table Member in biological process


that summarizes a
hypergeometric test as Differentially expressed Yes No Total
visualized in Fig. 10.8. The Yes n11 n12 n1+
shown colors are the same as No n21 n22 n2+
in Fig. 10.8. Total n+1 n+2 n

Fig. 10.10 Contingency Member in biological process


table that summarizes a
hypergeometric test as Differentially expressed Yes No Total
visualized in Fig. 10.8. The Yes x y n1+
shown colors are the same as No n21 n22 n2+
in Fig. 10.8. Total n+1 n+2 n

y ∼ Binom(Y = y|n+2 , p2 ). (10.58)

The binomial distribution for a random variable, Z, is given by


 
n k
P (Z = k) = Binom(k|n, p) = p (1 − p)n−k . (10.59)
k

Importantly, the parameter p defines the probability of drawing a gene with a certain
property. In our case, the property is either to be differentially expressed, given by
p1 , or not, given by p2 . At this point, we need to realize that this is the test statistic
we are looking for to formally describe our initial hypothesis. That means we can
formulate the null hypothesis as

H0 : p1 = p2 . (10.60)

Assuming that the null hypothesis is true, that is, p = p1 = p2 , the independence
of X and Y , and z = x + y = n1+ , we derive the null distribution as follows:

P (X = x, X + Y = z)
P (X = x|X + Y = z) = (10.61)
P (X + Y = z)
P (X = x, Y = z − x)
= (10.62)
P (X + Y = z)
P (X = x)P (Y = z − x)
= (10.63)
P (X + Y = z)
n+1  x  
n+1 −x · n+2 p z−x (1 − p)n+2 −z+x
x p (1 − p) z−x
= n+1 +n+2  (10.64)
z p z (1 − p)n1 +n+2 −z
n+1   n+2 
·
= xn+1 +nz−x
+2
 . (10.65)
z
264 10 Hypothesis Testing

Fig. 10.11 Numerical values Member in biological process


of the contingency table
corresponding to the example Differentially expressed Yes No Total
shown in Fig. 10.8. Yes x=3 y=9 n1+ = 12
No n21 = 13 n22 = 4 n2+ = 17
Total n+1 = 16 n+2 = 13 n = 29

Equation 10.65 is nothing but a hypergeometric distribution. Hence, the null


distribution corresponding to the null hypothesis H0 : p1 = p2 is a hypergeometric
distribution given by
n+1   n+2 
·
P (X = x|X + Y = n1+ ) = xn+1 +nz−x
+2
 . (10.66)
z

Depending on the formulation of the alternative hypothesis HA , one obtains a


p-value. Specifically, for the alternative hypothesis

HA : p1 > p2 . (10.67)

Which corresponds to an enrichment of X compared to Y , the p-value is given by


n1+
p = P (X ≥ n11 |X + Y = n1+ ) = P (X = x|X + Y = n1+ ). (10.68)
x=n11

Now, we have everything we need to conduct a hypergeometric test.

10.6.3.2 Examples

Let’s consider the example visualized in Fig. 10.8. The numerical values of the
contingency table are shown in Fig. 10.11. From these values, we can estimate the
p-value either by directly utilizing a hypergeometric distribution or by using the
function fisher.test() provided in R.
The application of both methods, shown in Listing 10.6, results in p = 0.9993.

10.6.4 Finding the Correct Hypothesis Test

The preceding tests are just a few examples. In fact, there are many different hypoth-
esis tests, and it is impossible to discuss them all. Practically, the question is how to
find the correct test for a given problem? For this reason, every comprehensive book
about hypothesis testing distinguishes them according to three properties. First, how
10.6 Important Hypothesis Tests 265

many populations are involved? For instance, for one population, one needs a one-
sample test, for two populations a two-sample test, and so on. Second, what is
the hypothesis that should be tested? This defines the test statistic and determines
the sampling distribution. Third, what is the level of measurement of the data? In
statistics, one distinguishes between nominal data (categorical data), ordinal data
(rank-order data), interval data, and ratio data. Moving from nominal data to the
other levels provides more and more information.
• Nominal data: Data points are merely identifiers; for example, license plate of a
car.
• Ordinal data: Data points have properties of nominal data; plus they can
be ordered. However, differences between data points have no meaning; for
example, final position in a car race.
• Interval data: Data points have properties of ordinal data, and equal intervals
have the same meaning; for example, physical position of the cars finishing a
race.
• Ratio data: Data points have properties of interval data, and they have a true zero
point; for example, weight of a car.
The understanding of all data types is straightforward, except the ratio data. Let’s
consider two examples to show what “have a true zero point” means.
Example: (weight) Suppose that we have two bags given by X = 70 kg
(kilogram) and Y = 140 kg. From this, it follows that Y /X = 2; that is, bag Y
is twice as heavy as bag X. By going to the unit “stone” instead of “kg,” this result
does not change, because 70 kg corresponds to 11 stone and 140 kg corresponds to
22 stone.
Example: (temperature) Suppose that the temperatures we measure at two
different locations are X = 20 F (Fahrenheit) and Y = 40 F, respectively. From
this, it follows that Y /X = 2. However, going to the unit “Celsius” (C)

X = 20 F → Y = −6.6 C (10.69)
Y = 40 F → Y = 4.4 C (10.70)
266 10 Hypothesis Testing

one obtains Y /X = −1.5. Also, for X = 0 kg, one obtains X = 0 stone; however,
X = 0 C does not mean X = 0 F (X = 0 C corresponds to X = 32 F). This means
that for temperature, the data do not have a true zero because depending on the unit,
its meaning changes. Also, in this case, it does not make sense to say that at location
X, it is twice as warm as at location Y , since by changing the unit this assertion is
no longer valid.
A comprehensive book about hypothesis tests is [436]. There, one can find a very
large collection of different tests that can be distinguished with the three properties
we discussed. In total, 32 main test categories are presented and discussed over
nearly 1,200 pages.

10.7 Permutation Tests

The preceding hypothesis tests, for example, the t-test or Fisher’s exact test, are
examples of parametric tests. In general, a parametric test is a hypothesis test that
makes certain assumptions about the underlying distribution(s), which can be used
to derive analytical solutions for the sampling distribution. “Analytical solution”
means that the functional form of the sampling distribution is precisely given. For
instance, the t-test resulted in student’s t-distribution and for Fisher’s exact test in
the hypergeometric distribution. Whenever it is possible and justified, this results
in elegant solutions. However, practically, there are two problems. First, this is not
always possible, and, second, the derivation is mathematically demanding, as we
have seen for the derivation of the hypergeometric distributions based on binomial
distributions.
Luckily, there is another category of hypothesis test that avoids these problems,
called permutation tests. In contrast with the t-test or the Fisher’s exact test, a
permutation test is an approach rather than a particular method that can be applied
to general two-sample problems. Also, a permutation test does not require any
assumptions about the underlying distribution(s). For this reason, permutation tests
are nonparametric tests that provide a numerical approximation of a sampling
distribution rather than its analytical solution.
The underlying idea of a permutation test is simple, and we provide its visualiza-
tion in Fig. 10.12. Suppose that we have data from two populations corresponding
to sample 1 given by X = {Xi }ni=1 and sample 2 given by Y = {Yi }m i=1 . Based on
the data, a test statistic is estimated. An example could be the mean value, given by

1
n
X̄ = Xi , (10.71)
n
i=1

1 
n
Ȳ = Yi , (10.72)
m
i=1

tn,m = X̄ − Ȳ . (10.73)
10.7 Permutation Tests 267

Sample 1. Sample 2.

Data

test statistics: t(n,m)

Randomization

1. Realization
t1 (n,m )

2. Realization
t2 (n,m )

Fig. 10.12 A visualization of the basic idea of a randomization test. Shown are two realizations
of randomized data.

Based on this, a null and alternative hypothesis can be formulated as follows:

H0 : μ1 = μ2 . (10.74)
H1 : μ1 > μ2 . (10.75)

Where μ1 and μ2 are the mean values of population 1 and population 2, respectively.
The assumption made by a permutation test is that, given that the null hypothesis
is true, the observations have an equal probability to be drawn from population
one or population two. This assumption can be used to randomize the data. In
Fig. 10.12, we show two realizations of such a randomization. Importantly, for
each realization the test statistic can be estimated; that is, t1 (n, m) and t2 (n, m).
Hence, by combining such estimates from many randomizations {tr (n, m)}R r=1 , the
corresponding sampling distribution is estimated.
In total there are (n + m)! different realizations. When all possible realizations
are used, the test is called a permutation test. However, if only a random subset of
all possible realizations is used, then the test is called a randomization test, which is
an approximation of a permutation test.
268 10 Hypothesis Testing

In Listing 10.7, we present an example using the mean as the test statistic. As one
can see, the estimation of the p-value involves merely the counting of realizations
that are larger than t (n, m), as this approximates the integration over the sampling
distribution.

Listing 10.7: Permutation test with R


n <- 10; m <- 5;
X <- rnorm(n, mean=0, sd=1)
Y <- rnorm(m, mean=0.5, sd=1)
t <- mean(X) - mean(Y)

R <- 1000 # Realizations


tr <- vector(mode = "numeric", length = R)
for(i in 1:R){
data_r <- sample(c(X, Y))
tr[i] <- mean(data_r[1:n]) - mean(data_r[(n+1):(n+m)])
}
p <- length(which(tr > t))/R # p-value for right-sided
alternativehypothesis

In the example in Listing 10.7, we used R = 1000 as the realization. However,


the general question is, what value of R should be used when performing a
randomization test? Since the second smallest possible p-value is 1/R (the smallest
p-value is zero), the value of R determines the resolution of possible p-values.
For R = 1000, this gives pres = 0.001 = 1/R, which is sufficient for (single)
hypothesis tests and a significance level α larger or equal to pres . Hence, in general,
the value of R needs to be chosen as follows:
1
Given the significance level α → R ≥ . (10.76)
α

10.8 Understanding versus Applying Hypothesis Tests

Finally, we present an example to demonstrate the difference between “understand-


ing” and “applying” a hypothesis test. In Fig. 10.13, we show a worked example for
a t-test.
On the left-hand side of Fig. 10.13, a summary of the test is presented, and on
the right-hand side of Fig. 10.13, we show the numerical solution to the problem
using R. The script of the solution is only two lines. In the first line, the data
sample is defined, and in the second, the hypothesis test is conducted. Then, the
function t.test() has arguments that specify the used data and the type of the
alternative hypothesis. In our case, we are using a right-sided alternative indicated
10.9 Historical Notes and Misinterpretations 269

Data from experiment: D= {0.3, 0.3, 0.1, 0.5, 0.1}


Main components of an one-sample t-test:

1. Select appropriate test statistic T: t-score H0 : θ = 0


2. Define null hypothesis H0 and alternative hypothesis H1 ←−
3. Find the sampling distribution for T, given H0 true: Student t-distribution H1 : θ > 0
−→ use t.test
4. Choose significance level alpha: α = 0.05
5. Evaluate test statistic t for sample data: 3.2071
6. Determine the p-value: 0.01634
7. Make a decision (accept H0 or reject H0 ):
−→ Reject H0

Fig. 10.13 Example of a one-sample t-test conducted using the statistical programming
language R.

by “alternative=’greater’.” In addition, the null hypothesis needs to be specified. In


our case, we used the default, which is θ = 0; however, by using the argument “mu,”
one can set different values.
From this example, one can learn the following. First, the practical execution of a
hypothesis test using R is very simple. In fact, every hypothesis test assumes a form
similar to the provided example. Second, with this simplicity, all the complexity
of a hypothesis test discussed in the previous sections of this chapter is hidden
behind the abstract computer command “t.test().” Therefore, a deeper understanding
of a hypothesis test cannot be gained by the practical execution of a black box
(in the preceding example “t.test()” is the black-box). The last point may seem
counterintuitive, especially if one skips the preceding discussion. This is one of the
causes of the widespread misunderstanding of statistical hypothesis tests, in general.
Hence, to understand the logic and mechanics of the inner workings of hypothesis
testing, one needs to open up the black box and study its seven main components
one by one.

10.9 Historical Notes and Misinterpretations

We want to end this chapter by providing some historical information about the
development of statistical hypothesis testing, which can be useful for understanding
some problems that accompany this method.
The modern formulation of statistical hypothesis testing, as presented in this
chapter, was not established as one theory, but rather evolved from two separately
introduced theories. The first is due to Fisher [168] and the second due to Neyman
270 10 Hypothesis Testing

and Pearson [365]. Since the 1960s, a unified form, which is sometimes called Null
Hypothesis Significance Testing (NHST) [372, 460], has existed. Fisher introduced
the concept of a p-value, while Neyman and Pearson introduced the alternative
hypothesis as complement to the null hypothesis, the Type I and Type II errors,
as well as the power.
There is an ongoing discussion about the differences of both concepts (see,
for example, [42, 307, 388]), which is in general very difficult to follow because
these involve philosophical interpretations of these theories. Unfortunately, these
differences are of interest not only for historical reasons, but also because they lead
to contaminations and misunderstandings of the modern formulation of statistical
hypothesis testing. In particular, arguments are often taken out of context, whereas
properties differ between the various theories [207, 213]. For this reason, we discuss
some of these properties in the following:
1. Is the p-value the probability that the null hypothesis is true given the data?
No, it is the probability of observing more extreme values than the test statistic,
if the null hypothesis is true, i.e., P (x ≥ |t| |H0 is true) (see Eq. 10.11). Hence,
one assumes that H0 is true for obtaining the p-value. Instead, the question aims
to find P (H0 |D).
2. Is the p-value the probability that the alternative hypothesis is true given the
data?
No, see question (1). This would be P (H1 |D).
3. If the null hypothesis is rejected, is the p-value the probability of your rejection
error?
No, the rejection error is the Type I error given by α.
4. Is the p-value the probability of observing our data sample given the null
hypothesis is true?
No, this would be the likelihood.
5. If one repeats an experiment, does one obtain the same p-value?
No, because p-values do not provide information about the long-run frequencies
of repeated experiments like the Type I or Type II errors. Instead, they give the
probability resulting from comparing the test statistic (as a function of the data),
while the null hypothesis is assumed to be true.
6. Does the p-value give the probability that the data were produced by random
chance alone?
No, despite the fact that the data were produced by assuming that H0 is true.
7. Does the same p-value from two studies provide the same evidence against the
null hypothesis?
Yes, but only in the very rare case where everything in the two studies and the
formulated hypotheses is identical, including the sample sizes. In any other case,
p-values are difficult to compare with each other, and no conclusion can be
drawn.
10.10 Summary 271

We think that many of the preceding confusions are a result of verbal interpreta-
tions of the theory that neglected the mathematical definitions of used entities. This
is understandable since many people are interested in the application of statistical
hypothesis testing but not in the underlying probability theory. A related problem
may be that a hypothesis test is set up to answer exactly one question (based
on a data sample to reject a null hypothesis). However, there are certainly many
more questions experimentalists would like to have answers for, but a statistical
hypothesis test is not designed to answer more than one question at a time.
In general, to address questions regarding interpretations of a hypothesis test,
it is a good strategy to start answering such questions by looking at the basic
definitions of the involved entities, because only these are exact and provide
insightful interpretations.

10.10 Summary

In this chapter, we provided a basic introduction to the concepts underlying


statistical hypothesis testing. Our goal was to make the seven individual components
upon which a general hypothesis test is based as clear as possible. Furthermore, we
discussed some hypothesis tests that are frequently used in applications, such as
student’s t-test and a hypergeometric test (also known as Fisher’s exact test).
In our experience, gaining an understanding of hypothesis testing is usually
considered difficult. For this reason, we were aiming at an accessible level of
description, and we presented the bare backbone of the method. We tried to avoid
the application of domain-specific jargon in order to make the knowledge transfer
to different application areas in data science, including biomedical science, social
sciences, marketing, medicine, or psychology, easier [88, 142, 259, 369, 401, 420].
Learning Outcome 10: Statistical Hypothesis Testing

A statistical hypothesis test compares a test statistic with a reference popula-


tion distribution (sampling distribution) by assuming that the null hypothesis
is true to decide if the null hypothesis about the test statistic is true.

Finally, we would like to note that in many practical applications, one is required
to perform not only one but multiple hypothesis tests simultaneously; for instance,
to identify the differential expression of genes when comparing the effect of a
medication or to identify which marketing campaign is more successful. In such
a situation, one needs to apply a multiple testing correction (MTC) to control the
resulting errors [35, 147, 161]. This is a highly nontrivial problem and a complex
topic in itself, which can lead to erroneous outcomes if not addressed properly [40].
For this reason, we discuss multiple testing corrections in detail in Chap. 15.
272 10 Hypothesis Testing

10.11 Exercises

1. Discuss the seven main components of a statistical hypothesis test.


2. How many different alternative hypotheses can be formulated? Make a sketch
similar to Fig. 10.4 and define the corresponding integrals for obtaining the p-
values.
3. Plot the gamma distribution, gamma(α = 4, β = 2), corresponding to Fig. 10.3a.
Hint: When using the function rgamma() available in R, the arguments “shape”
and “rate” correspond to α and β, respectively.
4. Estimate the approximate sampling distribution for a sample of size n = 5, where
the individual data points are samples from the gamma distribution, defined in the
previous exercise.
Chapter 11
Linear Regression Models

11.1 Introduction

Another widely used analysis method originating from statistics is linear regression
[221, 226]. Since many application problems require the prediction of a numerical
output variable, such as for forecasting stock prices, temperatures, or sales, such
models are often used in economics, climate science, marketing, and so forth [32,
208].
Similar to classification, linear regression is a type of a supervised learning
model. However, in contrast to classification, a regression model generates a
numerical variable as output, whereas a classification results in a categorical
variable (which is merely a label). Although classical ordinary least squares (OLS)
regression models have been known for a long time, they are still frequently used
till today. Importantly, in recent years, there have been many new developments that
have extended classical regression models significantly and that require a thorough
understanding of the more basic models.
In this chapter, we introduce ordinary least squares (OLS) linear regression
models, including methods for diagnosing such models. Furthermore, we discuss
extended models that allow interaction terms, nonlinearities, or categorical pre-
dictors. Finally, we introduce generalized linear models (GLMs), which allow
the response variable to have a distribution other than a normal distribution, thus
enabling a flexible modeling of the response.

11.1.1 What Is Linear Regression?


In a linear regression problem, one tries to find a linear function that fits the observed
values of the form {(xi , yi )}ni=1 best. Here “best” means that we need to select
a criterion that allows the quantification of the quality of a fit. It is important to
emphasize that one tries to find a linear function and not a function of general

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 273
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_11
274 11 Linear Regression Models

Data type:
X = {((xi , yi ))}n
1 with xi ∈
p
, yi ∈ m
xi is called input or predictor vector
yi is called output or response vector
p: number of covariates or number of predictors
n: number of samples
m: number of response variables

Question addressed:
Is there a (linear) functional dependency between input and output?

Principles of major linear regression approaches:


Simple Linear Regression =⇒ m = 1, p = 1
Find coefficients βi that
Multiple Linear Regression =⇒ m = 1, p > 1
optimize an error measure
General Linear Regression =⇒ m > 1, p > 1

Fig. 11.1 Overview of linear regression methods with respect to the data type used, the question
addressed, and the principles of major approaches.

shape. This can be seen as a constraint on the class of functions we consider for
this problem. At the beginning of this chapter, we will limit our discussion to linear
regression problems; however, at the end we will also discuss extensions that allow
nonlinear functions.
If xi and yi are scalar values, that is, xi , yi ∈ R, the regression problem is called
a simple linear regression. However, when xi is a vector of dimension p and yi is
a vector of dimension m, the problem is called a general linear regression problem;
see Fig. 11.1.

11.1.2 Motivating Example

To get an intuitive understanding of linear regression, we present an example in


Fig. 11.2. The scatterplot in this figure shows 20 data points corresponding to
the number of murders per annum (per 1 million inhabitants), depending on the
unemployment rate (in percentage) [286]. That means the number of murders per
annum per 1 million inhabitants corresponds to the y values and the unemployment
rate to the x values. Formally, the values of all pairs can be written as (xi , yi ) where
i ∈ {1, . . . , n = 20}; see Table 11.1.
From the visualization in Fig. 11.2, one gets the impression that the relationship
between y and x can be approximated by a linear function of the form

y ≈ β0 + β1 x, (11.1)
11.1 Introduction 275

40

murders per annum per 1 million inhabitants

30

20

10

5 6 7 8 9
unemployment rate (%)

Fig. 11.2 Scatterplot of the number of murders per annum per 1 million inhabitants (y-axis)
against the unemployment rate (x-axis).

Table 11.1 Data points Index x y


shown in Fig. 11.2. Here, the
x-values correspond to the 1 6.2 11.2
unemployment rate (in 2 6.4 13.4
percentage) and the y-values 3 9.3 40.7
to the number of murders per 4 5.3 5.3
annum per 1 million 5 7.3 24.8
inhabitants [286].
6 5.9 12.7
7 6.4 20.9
8 7.6 35.7
9 4.9 8.7
10 6.4 9.6
11 6.0 14.5
12 7.4 26.9
13 5.8 15.7
14 8.6 36.2
15 6.5 18.1
16 8.3 28.9
17 6.7 14.9
18 8.6 25.8
19 8.4 21.7
20 6.7 25.7
276 11 Linear Regression Models

40

murders per annum per 1 million inhabitants

30

20

10

5 6 7 8 9
unemployment rate (%)

Fig. 11.3 Scatterplot of the number of murders per annum per 1 million inhabitants (y-axis)
against the unemployment rate (x-axis), including three guessed linear approximation lines.

where β0 and β1 are two unknown parameters. In Fig. 11.3, we added three specific
linear functions that have different values for β0 and β1 . For each case, we just
guessed the values of the parameters. Of course, this approach is unsatisfactory
because it is arbitrary and unsystematic. For this reason, in the next section we will
discuss a systematic method that allows the estimation of the values β0 and β1 from
our data set {(x1 , y1 ), . . . , (xn , yn )}.

11.2 Simple Linear Regression

In this section, we will formalize the problem at hand by introducing a model for
the linear relationship between two variables. Such a model is called simple linear
regression. In later sections, we will extend this to more variables (see multiple
linear regression in Sect. 11.4).
Suppose that we have n data points of the form {(x1 , y1 ), . . . , (xn , yn )}. The X
variable is called the covariate, explanatory variable, regressor, input, or predictor,
and the Y variable is called the response, explained variable, or output. A simple
linear regression model is defined by

y = E[Y |X = x] = β0 + β1 x, (11.2)
11.2 Simple Linear Regression 277

where β0 and β1 are the model parameters. Equation 11.2 defines the expectation
value of Y given that X assumes the specific value x. The reason why this model
is formulated for an expectation value is that both variables, Y and X, are random
variables that are related via

Y ≈ β0 + β1 X. (11.3)

In Eq. 11.3 we used the approximation symbol (≈) to indicate that this mapping
is not exact but that there is a random error. If we denote these errors by εi , we can
add them to Eq. 11.3 to get the following exact equation:

yi = β0 + β1 xi + εi (11.4)

For the error term, we assume that E[εi |X = x] = 0 and V ar(εi ) = σ 2 for all i.

11.2.1 Ordinary Least Squares Estimation of Coefficients

After formulating the problem, we can present a solution for finding “the best”
values for the parameters β0 and β1 in our linear regression model in Eq. 11.2
(or 11.3) by using the least squares error measure. To define this error measure,
we denote by

yˆi = βˆ0 + βˆ1 xi (11.5)

a predicted value of yi based on the estimated values of the parameters βˆ0 and βˆ1 .
Here, the symbolˆ(called hat) indicates predicted values. We can measure the quality
of this prediction by taking the difference of the true value from the predicted value
given by

ei = yi − yˆi . (11.6)

The entity ei is the prediction error for the ith data point, and it is called the residual.
It is clear that smaller values of ei indicate a better prediction, and in the case of
ei = 0, the prediction is perfect. In a similar way, we can estimate these errors for
all n data points, and as an aggregate measure, we define


n
RSS = ei2 , (11.7)
i=1
278 11 Linear Regression Models

which is called the residual sum of squares (RSS). Due to the squares of ei , we
ensure that each term is not negative. Hence, the RSS is a quantitative error measure
that assesses the quality of the estimated values of βˆ0 and βˆ1 , and the best values
are those that minimize RSS.
To find the extremum of a function, we need to calculate its first derivative and
then check with the second derivation if the extremum is a minimum or a maximum
[153]. As a result of this calculation (exercises), we find
n
i=1 (xi − x̄)(yi − ȳ) Cov(X, Y )
βˆ1 = n = (11.8)
i=1 (xi − x̄)
2 V ar(X)
βˆ0 = ȳ − βˆ1 x̄ (11.9)

as the optimal parameter values. Here, x̄ = n1 ni=1 xi and ȳ = n1 ni=1 yi are the
sample means. This is the ordinary least squares (OLS) solution for a simple linear
regression model.
In Fig. 11.4, we show the resulting OLS regression line for our murderers data
set. The result was obtained using the function lm() (linear model), available in R,
and the corresponding script is provided in Listing 11.1.

40
murders per annum per 1 million inhabitants

30

20

10

5 6 7 8 9
unemployment rate (%)

Fig. 11.4 Scatterplot of the number of murders per annum per 1 million inhabitants (y-axis)
against the unemployment rate (x-axis), including the least squares solution.
11.2 Simple Linear Regression 279

Overall, the result from the OLS regression, obtained using the function sum-
mary(), contains not only information about the value of the (regression) coefficients
but also their standard errors, t-scores, and p-values (see Listing 11.1). The latter
are discussed in Sects. 11.2.2 and 11.2.3. Information about the distribution of the
residuals is shown at the beginning of the outputs of the function, providing some
summary statistics. Furthermore, the residual standard error (Sect. 11.2.2), multiple
R-squared (Sect. 11.2.4), adjusted R-squared, and F-statistic are provided.
More information about the result from the OLS regression can be obtained using
the function confint(), providing information about the 95% confidence intervals of
the regression coefficients.

Finally, using the function coef(), we obtain a vector of the estimated regression
coefficients.

At this point, it should be clear why linear regression models are discussed after
statistical inference (Chap. 6) and hypothesis testing (Chap. 10) — there are various
concepts discussed in those chapters, such as confidence intervals and hypothesis
tests, that are needed for regression models.
280 11 Linear Regression Models

11.2.2 Variability of the Coefficients

Recall that our estimation of the regression coefficients of the linear model in
Eqs. 11.8 and 11.9 is based on a data sample {(x1 , y1 ), . . . , (xn , yn )}. That means,
if we were given another data sample of the same size and we repeated the same
procedure, we would get estimates for the model parameters that may be different
from the first ones. For this reason, a natural question is, how variable are our
estimates in Eqs. 11.8 and 11.9?
To derive the following results, we need to make some further assumptions.
Namely, the errors εi are independent from each other and:
• V ar(εi ) = σ 2 for all i ∈ {1, . . . , n}.
• εi ∼ N (0, σ 2 ) for all i ∈ {1, . . . , n}.
Based on these assumptions, one can show that the standard errors of the estimated
model parameters βˆ0 and βˆ1 are given by

σ2
SE(βˆ1 )2 = n , (11.10)
i=1 (xi − x̄)2
 
1 x̄ 2
SE(βˆ0 )2 = σ 2 + n . (11.11)
i=1 (xi − x̄)
n 2

Furthermore, one can estimate the variance of the noise, that is, V ar(ε), by

1  2
n
σ̂ = ei . (11.12)
n−2
i=1

11.2.3 Testing the Necessity of Coefficients

An important question one should always ask after estimating the model parameters
is whether these parameters are necessary or can be removed. The interesting point
is that Eqs. 11.8 and 11.9 will always give some numerical values for βˆ0 and βˆ1 , but
what is not clear from these numbers is if one can remove the parameters from the
model (by setting them to zero) without affecting the performance of the model. In
statistical terms, we can formulate this question in the form of a hypothesis test for
the model parameter β1 as follows:
Null hypothesis H0 : β1 = 0
Alternative hypothesis H1 : β1 = 0
Practically, one can use a one-sample t-test to perform the hypothesis test for the
t-score
11.2 Simple Linear Regression 281

βˆ1 − 0
t= . (11.13)
SE(βˆ1 )

Here, the standard error from Eq. 11.10 is used.


The resulting values are shown in the fourth and fifth columns of the coefficients
table, see Listing 11.1.

11.2.4 Assessing the Quality of a Fit

To assess the quality of a fit, two entities are used: the residual standard error (RSE)
and the R 2 statistic.
The residual standard error (RSE) assesses the standard deviation of all errors
ei = yi − ŷi made in predicting the (true) y-values. This is accomplished by
normalizing RSS in the following way:


 1 
n
RSE =  (yi − ŷi )2 ; (11.14)
n−2
i=1

1
= RSS. (11.15)
n−2

The information RSE provides about the quality of a fit is relative, where high values
indicate a worse and low values a better fit. For this reason, RSE measures the lack-
of-fit. Here, “relative” means that one cannot provide absolute values of RSE that can
be used as a strict guideline, but one needs to compare the RSE values of different
fits with each other on a case-by-case basis to assess the meaning of a particular
value.
The R 2 statistic, also called the coefficient of determination, measures the
proportion of explained variance by the model. It is defined by

T SS − RSS
R2 = , (11.16)
T SS
where


n
T SS = (yi − ȳ)2 , (11.17)
i=1


n
RSS = (yi − ŷi )2 . (11.18)
i=1
282 11 Linear Regression Models

The interpretation of TSS is that it measures the variance in the response variables
yi relative to ȳ, and RSS measures the variance of the predicted and true response
variables. Hence, TSS can be seen as a measure of the variance in the data (without
considering a model fit), and RSS is the variance left after we fit a model. Overall,
this means that R 2 is a measure for the explained variance in the data. The
normalization by TSS ensures that R 2 can only assume values in [0, 1].
For a simple linear regression, it can be shown that
7
R 2 = r. (11.19)

That means the square root of R 2 corresponds to the correlation coefficient between
the predictor variable X and the response variable Y ; that is, r = Cor(X, Y ).
It is important to note that the function summary() in R (see Listing 11.1) denotes
R 2 as multiple R-squared.

11.3 Preprocessing

Let’s assume we have data of the form (xi , yi ) with i ∈ {1, . . . , n}, where n is the
number of samples, xi = (Xi1 , . . . , Xip )T corresponds to vector of p predictors,
and yi is the response variable. We denote by y ∈ Rn the vector of response variables
and by X ∈ Rn×p the predictor matrix. The vector β = (β1 , . . . , βp )T gives the
regression coefficients.
The predictors and the response variable shall be standardized; that means

1
n
x̄j = Xij = 0 for all j, (11.20)
n
i=1

1 2
n
s̄j2 = Xij = 1 for all j, (11.21)
n
i=1

1
n
ȳ = yi = 0. (11.22)
n
i=1

Here, x̄j and s̄j2 are the mean and variance of the predictor variables, and ȳ is the
mean of the response variables.
To study the regularization of regression models, we need to solve optimization
problems, which are formulated in terms of norms. For this reason, we review, in the
following, the norms needed for the subsequent sections. For a real vector x ∈ Rn
and q ≥ 1, the Lq-norm is defined by
11.4 Multiple Linear Regression 283


n 1
q
xq = |xi |q . (11.23)
i=1

For the special case q = 2, one obtains the L2-norm (also known as Euclidean
norm), and for q = 1 the L1-norm. Interestingly, for q < 1, Eq. 11.23 is defined,
but it is no longer a norm in the mathematical sense.

11.4 Multiple Linear Regression

In this section, we extend the results for a simple linear regression model, discussed
in the previous section, to a multiple linear regression model.
We begin our discussion by formulating a multiple regression problem:


p
Yi = Xij βj + εi (11.24)
j =1

Here, Xij = (Xi1 , . . . , Xip )T ∈ Rp are p predictor variables that are linearly
mapped onto the response variable Yi ∈ R for sample i. The mapping is defined
by the p regression coefficients βj . For the noise term, εi , we assume again that
εi ∼ N(0, σ 2 ), which summarizes all kinds of uncertainties, including measurement
errors.
To see the similarity between a multiple linear regression, which has p predictor
variables, and a simple linear regression, having one predictor variable, one can
write Eq. 11.24 in the form

yi = x Ti β + εi . (11.25)

Here, x Ti β is the inner product (scalar product) between the two p-dimensional
vectors xi = (Xi1 , . . . , Xip )T and β = (β1 , . . . , βp )T . One can further summarize
Eq. 11.25 for all samples i ∈ {1, . . . , n} by

y = Xβ + ε. (11.26)

Here, the noise terms assume the form ε ∼ N(0, σ 2 I n ), where I n is the Rn×n
identity matrix. The matrix X is called the model matrix.
The solution of Eq. 11.26 can be formulated as an optimization problem given by

OLS
β̂ = arg min y − Xβ22 . (11.27)
284 11 Linear Regression Models

The ordinary least squares (OLS) solution of Eq. 11.27 can be analytically calcu-
lated, assuming that X has a full column rank, which implies that XT X is positive
definite. The estimation of the model parameters is given by

OLS  −1
β̂ = XT X XT y. (11.28)

If X does not have a full column rank, the solution cannot be uniquely determined.

11.4.1 Testing the Necessity of Coefficients

Similar to simple linear regression, where we tested the necessity of coefficients in


Sect. 11.2.3, we need to assess the coefficients for multiple linear regression models.
However, instead of conducting a test for only one coefficient, for multiple linear
regression, one tests the following:
Null hypothesis H0 : β1 = β2 = · · · = βp = 0
Alternative hypothesis H1 : at least one βj = 0
As a test score, the F-statistic is used:
11.5 Diagnosing Linear Models 285

(T SS − RSS)/p
F = . (11.29)
RSS/(n − p − 1)

The resulting value is shown in Listing 11.4.

11.4.2 Assessing the Quality of a Fit

The coefficient of determination defined in Sect. 11.2.4 is only valid for simple
linear regression models. For a multiple linear regression model, this definition
needs to be adjusted, considering the number of parameters in the model. The
resulting measure is called adjusted R 2 , and it is given by

RSS/(n − p − 1)
2
Radj =1− . (11.30)
T SS/(n − 1)

The resulting value of the adjusted R 2 for our example is shown in Listing 11.4.

11.5 Diagnosing Linear Models

To fit linear models, we made several theoretical assumptions, for example, about
the error and the model itself. In this section, we will discuss the following topics,
which are important for model diagnoses:
1. Error assumptions
• Independence
• Homoscedasticity (constant variance)
• Normality
2. Linearity assumption
3. Leverage points
4. Outliers
5. Collinearity
From these topics listed above one can see that diagnosing linear models is a
nontrivial and complex task that should not be taken lightly because omissions can
lead to severe problems and even invalidate an analysis. In the following, we discuss
these topics one by one.
286 11 Linear Regression Models

10 10

5 5
residuals(res)

ei+1
0 0

−5 −5

−10 −10
5 10 15 20 −10 −5 0 5 10
index ei

Fig. 11.5 Diagnostic plots to check independence of residuals. Left: Index plot of the residuals.
Right: ei vs ei+1 .

11.5.1 Error Assumptions

For the errors of a regression model, we made the following assumptions:


• Independence
• Constant variance
• Normality

Independence The first assumption we check is the independence of errors.


Graphically, one can diagnose this by plotting ei versus the index or by plotting
ei versus ei+1 .

The left plot in Fig. 11.5 looks a bit suspicious because there are several residuals
in a row, which assume increasingly negative values; however, the plot for ei versus
ei+1 on the right-hand side looks fine. To confirm this quantitatively, one can use
the Durbin-Watson test for ei versus ei+1 . It uses the following test statistic:
n
(ei − ei−1 )2
DW = i=1 n . (11.31)
From the resulting p-value in Listing 11.6, ei can see that the null hypothesis
i=1 one
stating a zero correlation cannot be rejected; hence, the residuals are independent of
each other.
Homoscedasticity The second assumption we check is the homoscedasticity
(constant variance). For this, we plot the residuals (ri ) as a function of yˆi .
11.5 Diagnosing Linear Models 287

10

5
residuals(res)

−5

−10
10 20 30
fitted.values(res)

Fig. 11.6 Scatterplot showing ei as a function of yˆi (blue points). The red line is a linear regression
of these values.

Figure 11.6 shows the result from Listing 11.7. In addition, we added a red line
that correspond to the linear regression of these values obtained using the command
lm(residuals(res), fitted.values(res)). Homoscedasticity means that the variance does
not change along the fitted values; hence, it is approximated by constant residuals.
In such a case, a linear regression has a slope of zero, as shown by the results in
Fig. 11.6. Any non-zero slope is an indicator of heteroscedasticity (the absence of
homoscedasticity).
Normality The third assumption we check is the normality of the errors. Graphi-
cally, this can be done with a Q-Q plot.
288 11 Linear Regression Models

Listing 11.8 shows how to do this using R, and the result is displayed in Fig. 11.7.
This assumption can also be checked quantitatively using the Shapiro-Wilk test.
Listing 11.9 shows the result of this test.
The null hypothesis of the Shapiro-Wilk test states the normality of a distribution.
As one can see, the p-value indicates that the null hypothesis cannot be rejected.

11.5.2 Linearity Assumption of the Model

Since we want to fit a linear model, that is, either a simple linear regression or a
multiple linear regression model, the underlying relationship between the data of the
predictors and the response variable needs to be linear. For a simple linear regression
model, this can be graphically checked by plotting xi versus yi . However, for (high-
dimensional) multiple linear regression models, this is not straightforward, because
only two-dimensional projections can be plotted easily. For this reason, alternatively,
one can use a score quantifying the quality of a fit (see Sect. 11.4.2).

11.5.3 Leverage Points

Another entity with which to characterize a regression model is a leverage point.


The idea behind a leverage point is to realize that not all data points exhibit the
same influence on the model. To identify influential points, one can use the leverage
hi . For a simple linear regression model, one can show that

V ar(ei ) = σ 2 (1 − hi ), (11.32)
V ar(yˆi ) = σ 2 hi , (11.33)

where σ 2 is the variance of the noise and the hi are estimated by

1 (xi − x̄)2
hi = + n . (11.34)
i=1 (xi − x̄)
n 2

From Eq. 11.34, it follows that 0 ≤ hi ≤ 1 and ni=1 hi = p. Hence, the effect of
a large hi is to reduce the variance of a residual. Here, “large” means a deviation of
the average value by 2/n, where 2 is the number of parameters of the model and n
is the sample size.
11.5 Diagnosing Linear Models 289

10

5
residuals

−5

−10

−2 −1 0 1 2
theoretical

Fig. 11.7 Q-Q plot for checking the normality of the residual distribution.

We would like to note that the leverage points given by Eq. 11.34 depend only on
the xi values. Hence, they assess the influence of extremal x-points.

From Fig. 11.8, one can see that the data points 3 and 9 have larger leverages than
all the other points.
290 11 Linear Regression Models

0.25 3

0.20 9
sorted Leverage

0.15

0.10

0.05

0.0 0.5 1.0 1.5 2.0


Half−Normal Quantiles

Fig. 11.8 Half-norm plot for the leverages hi . The two labeled data points are twice as large as
the average leverage value of p/n, which here is 0.1.

11.5.4 Outliers

In contrast with the leverage points, which focus only on the x-points, outliers
are indicated by “unusual y-points.” For identifying outliers, studentized residuals,
given by
ei
e˜i = , (11.35)
SE(ei )

can be utilized. Observations with studentized residuals, e˜i , larger than an absolute
value of 3 are generally considered outliers.
As a warning, we would like to remark that an unusual y-point could also indicate
a deficiency of the model. Hence, one needs to be careful in judging whether a y-
point is an outlier.

11.5.5 Collinearity

Collinearity between two predictor variables means that these two variables are
closely related. Often, this is indicated by a high correlation coefficient. The problem
is that this collinearity can lead to a reduction in the quality of the fit for the model
due to a counterproductive competitive effect between these variables. A related
11.5 Diagnosing Linear Models 291

problem is that in such a case, it is also difficult to separate the individual influences
of the predictor variables on the response. More generally, this can also involve more
than two predictor variables. In that case, it is called multicollinearity.
A simple way to check for pairwise collinearity is by looking at the correlation
matrix for the predictor variables. However, this does not always reveal an existing
collinearity and has even more severe limitations for multicollinearity. For this
reason, the variance inflation factor (VIF), defined by

1
VIF(β̂i ) = , (11.36)
1 − RX2
i |X−i

can be used [433]. Here, RX 2 is for a regression model of Xi on all remaining


i |X−i
variables, indicated by X−i . For a value of RX 2 close to one, multicollinearity is
i |X−i
detected, and the resulting value of VIF(β̂i ) will be large.

11.5.6 Discussion

Least squares regression models can perform very badly when there are outliers
in the data. For this reason, it can be very helpful to perform outlier detection on
the data and remove these outliers from the data before building the regression
model. Least squares regression is so sensitive to outliers because the model does
not perform any form of coefficient shrinkage of the regression coefficients, such
as, for example, the LASSO (see Chap. 13). However, without such a restrictive
mechanism built directly into the model, outliers can cause the coefficients in the
model to become very large.
Another factor that can lead to the bad performance of a model is the correlation
between predictor variables. The disadvantage of the regression model is that it
does not perform any form of variable selection to reduce the number of predictor
variables, such as ridge regression or LASSO. Instead, it uses the variables specified
as input to the model.
The third factor that can reduce the performance of a model is called het-
eroskedasticity or heteroscedasticity. It refers to varying (that is, non-constant)
variances of the error term, depending on the sampling region. One particular
problem caused by heteroskedasticity is the obtention of inefficient and biased
estimates of the OLS standard errors, which can result in biased statistical tests
of the regression coefficients [279].
In summary, ordinary least squares regression performs neither a shrinkage
nor a variable selection, potentially leading to the aforementioned problems. For
this reason, advanced regression models have been introduced to guard against
such problems. We will discuss such extensions in Chap. 13, where we introduce
regularization methods.
292 11 Linear Regression Models

11.6 Advanced Topics

In the next section, we address some advanced topics of multiple linear regression
that further extend the framework toward more complex models. Specifically, we
will discuss interactions, nonlinearities, and categorical predictors.

11.6.1 Interactions

The first extension of multiple linear regression we discuss includes interaction


terms in the regression model. In general, an interaction is a multiplicative effect
of two (or more) predictors on the response variable. An example for such an
interaction term is given in the following model:

yi = β0 + β1 x1 + β2 x2 + β3 x1 x2 + εi . (11.37)

The term x1 x2 introduces a nonlinearity in the model, which means that the model
is no longer linear in x1 and x2 .
However, by introducing the new auxiliary variable x3 = x1 x2 , one can create
a new model that is linear in x1 , x2 , and x3 . In R, such a model can be defined via
x1 ∗ x2 . This creates the exact same model as in Eq. 11.37. In Listing 11.11, we
present an example using simulated data.
If one wants to include just the interaction term but not the individual predictors,
one needs to use the syntax x1 : x2 . This creates a model of the form

yi = β0 + β1 x1 x2 + εi , (11.38)

which is obviously completely different from the model in Eq. 11.37

11.6.2 Nonlinearities

The next extension allows nonlinear terms, such as polynomials, in a model. For
instance, a model of the form

y = β0 + β1 x1 + β2 x13 + ε (11.39)

is nonlinear in x1 . However, by defining a new variable x2 = x13 , the preceding


model becomes

y = β0 + β1 x1 + β2 x2 + ε, (11.40)
11.6 Advanced Topics 293

which is linear in x1 and x2 .


Listing 11.12 presents an example of how a polynomial nonlinearity is analyzed
using R. To indicate that a covariate should be used as a polynomial, the argument
I() needs to be used for the function lm().
It is important to note that this works not only for polynomials but also for general
nonlinear transformations; for example, using I(log(x)) makes this trick applicable
to general nonlinearities.
In Fig. 11.9, we show a visualization of the results in Listing 11.12. Here, the
purple line corresponds to the fitted coefficients of the cubic model in Eq. 11.39. For
comparison, we repeated a similar analysis for a quadratic model given by

y = β0 + β1 x1 + β2 x12 . (11.41)

The result for this model is shown by the red line in Fig. 11.9.
294 11 Linear Regression Models

11.6.3 Categorical Predictors

Linear regression models can also handle categorical predictors. A categorical


predictor does not assume numerical values, such as real numbers, but rather
categories, which are also called levels. The categorical predictor itself is called
a factor.
Generally, one categorical variable with n levels can be substituted by n − 1
dummy variables (or indicators), each with two levels. These n − 1 new dummy
variables contain the same information as the single variable.
As an example, let’s use the ’mtcars’ data set to perform a regression of the
variable “mpg” (miles per gallon) on the variables weight (“wt”) of a car and its
number of cylinders (“cyl”). Here, we consider “wt” as a numerical predictor and
“cyl” as factor. For this example, “cyl” has three levels: 4 cylinder, 6 cylinder, and
8 cylinder. Hence, we need two indicator variables, xI 1 and xI 2 , to code the factor
“cyl.” That means the regression model will assume the following form:

yi = β0 + β1 x1 + β2 xI1 + β3 xI2 + εi . (11.42)


11.6 Advanced Topics 295

60

y = β0 + β1x1 + β3x31

y = β0 + β1x1 + β3x21
40
y

20

−1 0 1 2 3
x1

Fig. 11.9 The figure shows results of two nonlinear regression models. The purple line corre-
sponds to a cubic model and the red line to a quadratic model.

Here, x1 ∈ R (as usual) but xI1 , xI2 ∈ [0, 1]. In total, this allows one to code three
different configurations given by

yi = β0 + β1 x1 + εi , for xI1 = 0, xI2 = 0 (11.43)


yi = (β0 + β2 ) + β1 x1 + εi , for xI1 = 1, xI2 = 0 (11.44)
yi = (β0 + β3 ) + β1 x1 + εi , for xI1 = 0, xI2 = 1. (11.45)

As one can see, all three models have the same slope, but have a different intercept
due to the influence of the different levels of the factorial variable. Importantly, an
indicator variable can only assume one level because both indicators code together
for the number of cylinders (“cyl”), which is in our case a factor. For this reason, for
example, xI1 = 1, xI2 = 1 is not possible (see the discussion on using the function
contrasts()). The numerical result of this model is shown in Listing 11.13.
To clarify the meaning of a factor, one can use the function contrasts() available
in R. The output of this function is a table showing the coding of two indicator
variables (columns) for the number of cylinders (rows). This means that “4
cylinders” (cyl4) is the reference point because it corresponds to the absence of
the indicator variables; that is, xI1 = 0, xI2 = 0. It also shows that cyl6 and
cyl8, corresponding to 6 and 8 cylinders, are coded using xI1 = 1, xI2 = 0 and
xI1 = 0, xI2 = 1, respectively. In summary, this gives the following meaning for the
296 11 Linear Regression Models

different models:

yi = β0 + β1 x1 + εi , for 4 cylinder, (11.46)


yi = (β0 + β2 ) + β1 x1 + εi , for 6 cylinder, (11.47)
yi = (β0 + β3 ) + β1 x1 + εi , for 8 cylinder. (11.48)

Hence, the results for 6 and 8 cylinders are shown with reference to 4 cylinders
(see the change in the intercept in Eqs. 11.47 and 11.48).
If one wants to use a different reference, one needs to indicate this explicitly.
For instance, including the command dat$cyl <- relevel(dat$cyl, ref="6") will use
6 cylinders as the reference. It is worth noting that in this case, the remaining
indicators are arranged automatically. To illustrate this, an example is shown in
Listing 11.14.
11.6 Advanced Topics 297

If one does not want an automatic arrangement, one can also specify the order of
all levels. This can be done using the command dat$cyl <- factor(dat$cyl, levels =
c("8", "6", "4")) (that is, we change the first line in Listing 11.14).

11.6.4 Generalized Linear Models

The last extension we present in this chapter is a model family called generalized
linear models (GLM). GLMs were popularized by McCullagh and Nelder in the
1980s.
In contrast with the models discussed so far, a GLM is different (more general) in
two aspects. First, a GLM allows the response variable to have a distribution other
than the normal distribution, and second, we can model the connection between
predictors and (the mean of) response by a link function. For the distribution, the
only restriction is that the distribution of the response needs to be a representative
of the exponential family. This family class of probability distributions includes
normal, binomial, Poisson, Dirichlet, or gamma distributions; hence, such a model
family is very flexible.
298 11 Linear Regression Models

Table 11.2 An overview of common link functions, η = g(μ), and their corresponding mean
functions, μ = g −1 (η).
Link function Mean function
Link function name equation: η = g(μ) equation: μ = g −1 (η) Mean function name
Identity μ η Identity
Inverse μ−1 η−1 Inverse
Inverse-square μ−2 η−1/2 Inverse-square-root
Log lnμ expη Exponential
μ 1
Logit ln 1−μ 1+exp(−μ) Logistic
Probit −1 (μ) (η) Normal quantile

A GLM is defined by the following three components:


1. A conditional probability distribution, f (), for the response variable Y from the
exponential family, y ∼ f (y|μ).
2. A linear predictor, η, given by η = β0 + β1 x.
3. A link function, g, given by μ = g −1 (η) = E[Y |X].
The linear predictor includes the independent variables X in the model. So, we have
formulated a GLM for one covariate for simplicity; however, an extension to the
multivariate case is straightforward by defining


p
η = β0 + βi xi . (11.49)
i=1

The link function models the relationship between the linear predictor and the
mean of the distribution function given by η = g(μ). The definition of the link
function assumes that g is a differentiable function for which an inverse exists. The
inverse g −1 is called the mean function because it gives the mean of the response
variable; that is, μ = g −1 (η).
In Table 11.2, we show some examples of common link functions and the
corresponding mean functions. Specifically, the first two columns show the names
of link functions and their equations, that is, η = g(μ), and the remaining column
shows the corresponding mean functions, μ = g −1 (η), and their common names.
For the probit function, () is the cumulative distribution function of the
standard-normal distribution.
At this point, the importance of the link function can be discussed in more detail.
Let’s consider two link functions: Identity and Inverse. The corresponding linear
predictors are given by

η = μ = β0 + β1 x = E[Y |X]; (11.50)


η = μ−1 = β0 + β1 x = E[Y |X]−1 . (11.51)
11.6 Advanced Topics 299

The linear predictor in Eq. 11.50 corresponds to a linear regression model, whereas
Eq. 11.51 requires a transformation (the inverse) to obtain
 −1
E[Y |X] = β0 + β1 x . (11.52)

Hence, from Eq. 11.52, it is clear that the expectation value of the response variable
is nonlinear in the predictor variable. As one can see, by using different link
functions one can obtain different forms of nonlinearity to model the relationship
between predictor and response variables.
When specifying a GLM, the link function needs to suit the probability distri-
bution of the response variable. This means that it cannot be arbitrarily chosen.
Information about possible link functions can be obtained via the help function
in R for the function family(). From this, one finds, for example, the “Gaussian”
(aka normal) family accepts the link functions “Identity,” “Log,” and “Inverse,” and
the “Poisson” family accept the links “Log,” “Identity,” and “Square-root.” This
information can be used to fit a GLM via the function glm.
From these examples, one can see that the link function is not unique to a family;
multiple families can have the same link function. For instance, “Log” can be
used as a link function for a Gaussian or a Poisson family. Overall, R provides
information for six families (Gaussian, Binomial, Poisson, Quasi, Gamma, and
Inverse-Gaussian), each with a number of link functions.
After specifying the probability family, the linear predictor, and the link function,
p
the parameters of the model {βi }i=0 need to be estimated. For this optimization, the
function glm uses the maximum likelihood estimation.

11.6.4.1 How to Determine Which Family to Use When Fitting a GLM

An important question is how to decide which family to use when fitting a GLM.
This depends on the type of the response variable. Specifically, one can select a
family based on the following distinctions:
• Continuous and unbounded response: Gaussian distribution
• Continuous and non-negative response: Gamma distribution or Inverse-Gaussian
distribution
• Count response: Poisson distribution, quasi-Poisson distribution, or negative
binomial distribution
• Binary response: Binomial distribution
• Multi-category response: Multinomial distribution
300 11 Linear Regression Models

11.6.4.2 Advantages of GLMs over Traditional OLS Regression

Some advantages of GLMs over traditional (OLS) regression include the following:

• The choice of link function is separate from the choice of the probability of the
response variable. This gives more flexibility for the modeling.
• If the link produces additive effects, then we do not need constant variance.
• The coefficients of the model are fitted via maximum likelihood estimation.
• R provides the function glm, which is very flexible in specifying the three
components needed to define a GLM. This allows for an easy comparison of
different GLMs.

11.6.4.3 Example: Poisson Regression

In the following, we discuss a problem involving count data. For such data the
response variable assumes only discrete values; for example, y ∈ {0, 1, . . . , m}.
Importantly, the number of values is larger than two, that is, m > 1; otherwise,
we would have binary data. A probability distribution for count data is the Poisson
distribution. Hence, in the following, we establish a GLM for a Poisson regression.
We use data from [320] on the number of publications produced by Ph.D.
biochemists to illustrate the application of a Poisson regression. The data can be
loaded from the Stata website, as shown in Listing 11.15.

In Fig. 11.10, we show the number of publications by biochemists produced in


the last three years of their Ph.D. program. This corresponds to the response variable
of our model.
In Listing 11.16, we show a Poisson regression with Log as the link function for
the covariate “ment” corresponding to the articles published by the mentor in the
last three years.
The coefficients in Listing 11.16 can be accessed using the command
exp(coef(res.poisson)).
In the output of Listing 11.16, one can see information about the deviance of the
model. In general, deviance is a measure of goodness of fit of a generalized linear
model, and higher values indicate a worse fit.
There are two versions of the deviance provided in Listing 11.16: Null deviance
and Residual deviance. The null deviance shows how well the response variable
is predicted by a model that includes only the intercept but no other covariates.
In contrast, the residual deviance is for the full model as specified by the function
glm(). For this reason, the null deviance is always larger than the residual deviance
11.6 Advanced Topics 301

200
Frequency

100

0 1 2 3 4 5 6 7 8 9 10 11 12 16 19
Number of published articles in last 3 years

Fig. 11.10 The figure shows the number of published articles in the last three years of Ph.D.
program.

when the data are not explained by a model containing only an intercept. In Listing
11.16, it is indicated that “Dispersion parameter for Poisson family taken to be 1”
because this is the assumption of the Poisson model. However, as a rule of thumb,
302 11 Linear Regression Models

the ratio of the residual deviance and the degrees of freedom (df) gives an estimate
for the actual dispersion:

Residual deviance
φ̂ = . (11.53)
df

If the fit is good φ̂ ≈ 1. From the value in Listing 11.16, we find that φ̂ =
1.82, which means that there is about 80% overdispersion. We will continue this
discussion at the end of this section.
Information about “Fisher Scoring” refers to the optimization method used
to estimate the coefficients. Since the maximum likelihood estimation cannot be
performed analytically, numerical procedures need to be used, such as the Newton-
Raphson method. The Fisher Scoring improves upon the Newton-Raphson method
by replacing the Hessian matrix with its expected value, which is the (negative)
Fisher information.
The meaning of AIC (Akaike information criterion) will be discussed in
Sect. 12.5.
Using the coefficients in Listing 11.16, one can connect to the Poisson distribu-
tion, which is given by

exp(−λ)λy
P (Y = y|λ) = , (11.54)
y!

and characterized by the single parameter λ.


For a Poisson distribution, the mean and variance are given by

E[Y |x] = λ, (11.55)


Var[Y |x] = λ. (11.56)

Using the connection between the link function and the linear predictor, one
obtains

ln(λ) = β0 + β1 x. (11.57)

Here, we just used μ = λ since the parameter for a Poisson distribution is commonly
denoted λ. From Eq. 11.57, it follows that

λ = exp(β0 + β1 x). (11.58)

Overall, this results in the Poisson distribution for our GLM, given by

exp(−exp(β0 + β1 x))exp(β0 + β1 x)y


P (Y = y|x, β0 , β1 ) = . (11.59)
y!

Here, (x, y) is an observation point. It is important to note that Eq. 11.59 provides
the probability for different responses given by y. However, the predictions of the
11.6 Advanced Topics 303

GLM are with respect to the expectation value; that is,

E[Y |x] = exp(β0 + β1 x). (11.60)

Equation 11.60 is also important in interpreting the regression coefficients. In order


to explore this, let’s rewrite Eq. 11.60 as

E[Y |x] = exp(β0 )exp(β1 x). (11.61)

From this, it follows that:


• β1 > 0: Positive association with x
• β1 < 0: Negative association with x
• β1 = 0: No association with x
Furthermore, for x = 0 we obtain, from Eq. 11.61, that

E[Y |x = 0] = exp(β0 ). (11.62)

To interpret the change in the response, going from x to x +1 — that is, changing
the value of x by one unit — one needs to assess the following expressions:

E[Y |x] = exp(β0 )exp(β1 x); (11.63)


E[Y |x + 1] = exp(β0 )exp(β1 (x + 1)) = exp(β0 )exp(β1 x)exp(β1 ). (11.64)

Combining the preceding two equations leads to

E[Y |x + 1] = E[Y |x]exp(β1 ). (11.65)

Hence, the change of the response for one-unit change in x is given by the factor
exp(β1 ). This implies that the exponent of the regression coefficient has an intuitive
interpretation for the Poisson model but not the regression coefficient itself. Let’s
use this model to make some predictions.

The results of these predictions given by Listing 11.17, together with the data,
are shown in Fig. 11.11.
To assess the goodness of fit, we use the deviance (see Listing 11.16). The
deviance is a measure of how well the model fits the data. Since the deviance
can be derived as the profile likelihood ratio test comparing the current model to
the saturated model, likelihood theory would predict that (assuming the model is
304 11 Linear Regression Models

15
Response

10

0 20 40 60 80
Number of published articles in last 3 years by mentor

Fig. 11.11 Results for predictions of the fitted Poison model (purple line). The blue points
correspond to the data.

correctly specified) the deviance follows a chi-squared distribution, with degrees of


freedom equal to the difference in the number of parameters.
Using R, the p-value for the deviance goodness-of-fit test is estimated as shown
in Listing 11.18.

The null hypothesis for this test is: “Our model is correctly specified.” For a
significance level of α = 0.05, we have strong evidence to reject the null hypothesis
because the p-value is much smaller than α. In general, a high p-value indicates no
evidence of lack-of-fit, and a low p-value indicates evidence of lack-of-fit.
Having identified a problem with the quality of the fit (see Listing 11.18), the next
step is to find the reason behind this. One problem could be dispersion. In general,
dispersion quantifies how much higher (or lower) the observed variance is, relative
to the expected variance of the model. For a Poisson distribution, the expected value
is equal to the variance; hence, the dispersion should be 1. In other words, when
we fit a Poisson model, we expect the variance to increase with the mean value by a
factor of 1. If the variance decreases at a lower rate, then the data are underdispersed,
and if the variance decreases at a higher rate, then the data are overdispersed. This
is tested in Listing 11.19.
11.6 Advanced Topics 305

The hypothesis of the overdispersion test can be formulated as follows:


Null hypothesis: “The true dispersion is not greater than 1.”
Alternative hypothesis: “The true dispersion is greater than 1.”
Given the very low p-value of 9.289e − 09, we need to reject the null hypothesis.
This means that we found a reason for the bad quality of the fit.
The preceding problem can be placed in a wider context. Instead of thinking of
testing the quality of a model, this can be seen as testing several models, among
which we try to find the best one. This setting is formalized by model selection.
For this reason, we stop our discussion here and continue in Sect. 12.5, where we
formally introduce model selection for GLMs.

11.6.4.4 Example: Logistic Regression

The next GLM we discuss is the logistic regression. This model is for binary data,
that is, the response can only assume two values; for example, “Yes” or “No.” The
model assumes that the response is sampled from a binomial distribution:

Y |x ∼ Binom(p(x)). (11.66)

The linear predictor is given by

η = β0 + β1 x. (11.67)

The link function is given by the logit function

g(p(x)) = logit(p(x)) = η (11.68)

with
 p 
logit(p) = log . (11.69)
1−p
306 11 Linear Regression Models

In Sect. 9.6, we showed that Eq. 11.68 can be solved for p(x), giving

exp(β0 + β1 x) 1
p(x) = = , (11.70)
1 + exp(β0 + β1 x) 1 + exp(−(β0 + β1 x))

which is the parameter of the binomial distribution (see Eq. 11.66).


In Listing 11.20, we present an example to illustrate how to specify a logistic
regression model using R.

A numerical example of a logistic regression model was already presented and


discussed in Sect. 9.6 because the model performs a classification.
The logistic regression model is a good example to show that regression and
classification are not that different. In fact, it depends on the type of output variable,
and for a binomial distribution, a GLM can serve as a classifier.

11.7 Summary

In this chapter, we introduced OLS regression models and GLMs. Both frameworks
provide linear regression models. Despite the fact that the underlying idea of an
OLS regression model is easy to understand, we have seen that the diagnostic
of such models is not that simple, and it requires attention because the analyst
needs to make decisions. Furthermore, advanced topics of OLS regression, such as
interaction terms, nonlinearities, and categorical predictors, make the models more
involved and nontrivial. This is even more true for GLMs, which provide an elegant
extension of OLS regression models by allowing the response variable to have an
error distribution other than a normal distribution. Also, the link function permits
one to model different forms of nonlinearity.
Learning Outcome 11: Linear Regression Models

Linear regression models are supervised learning approaches that require


labeled data for the training. Here the labels provide numerical information
from the output variable(s). Linear regression models are a family of methods
that comprise many different regression models, including OLS and GLM.

Finally, we would like to highlight that GLMs have an elegant underlying


framework, allowing one to derive a multitude of different models with different
statistical properties. This observation is worth noting because it is different from
all the methods we’ve discussed so far in this book.
11.8 Exercises 307

11.8 Exercises

1. For the data in Listing 11.12, estimate a nonlinear regression model. Specifically,
fit a quadratic regression model given by

y = β0 + β1 x1 + β2 x12 . (11.71)

Compare the results with the cubic regression model as discussed in Sect. 11.6.2.
2. Repeat the analysis but with more terms in the regression model. Specifically, fit
the model

y = β0 + β1 x1 + β2 x12 + β3 x13 . (11.72)

Discuss the differences between the quadratic model in Eq. 11.71 and the cubic
model in Eq. 11.39.
3. Repeat the analysis but use a polynomial regression model of 6-th order given by

y = β0 + β1 x1 + β2 x12 + β3 x13 + β4 x14 + β5 x15 + β6 x16 . (11.73)

Is there an advantage or disadvantage to going to higher orders? A formal


approach to the problem is provided in Chap. 12 because this problem is related
to model selection.
4. Repeat the analysis shown in Listing 11.14 for the categorical variable “cyl.” But
this time, use 8 cylinder as the reference, and order the remaining levels in the
order 6 cylinder and 4 cylinder. How does one check that the order of the levels
is correct?
Chapter 12
Model Selection

12.1 Introduction

In this chapter, we discuss approaches for a problem called model selection. Model
selection is always needed when there are a number of candidate models that could
be used for a prediction task, but we want to choose the best one among them. For
instance, for a classification problem, we may consider an SVM or a decision tree.
Similarly, for a regression analysis, there may be different options for the number of
predictors of the model. In either case, one needs to decide which statistical model
to select from the available candidates. This is the topic of model selection.
There is a topic related to model selection called model assessment. Model
selection and model assessment are frequently confused despite the fact that each
topic focuses on a different goal. For this reason, we start our discussion about
model selection by clarifying the difference between it and model assessment.

12.2 Difference Between Model Selection and Model


Assessment

If we lived in an ideal world with infinitely large data sets, the problem of model
selection would be an easy one. For each candidate model, one could estimate the
test error for infinitely large test data. Based on this, one could select the model with
the smallest error. Importantly, in this case the ideal test error would correspond
to the so-called expected generalization error [45, 368, 493], providing the most
complete information about the generalization abilities of a prediction model.
We discuss the important (but difficult) expected generalization error formally in
Chap. 18.
Overall, this means that in an ideal world with infinitely large data sets, the
expected generalization error could be used to assess and select models. However,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 309
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_12
310 12 Model Selection

we do not live in such an ideal world. For this reason, the finite size of the data
enforces a trade-off between model selection and model assessment, which makes
both tasks nontrivial.
In reality, model selection and model assessment aim to deal with finite data,
requiring us to be more economical about data usage. While model selection
has its focus on selecting one model, model assessment aims to approximate the
expected generalization error. The underlying idea of model selection is to estimate
an auxiliary function that is different than the expected generalization error but
suffices to rank different models in a similar way as we would have done using
the expected generalization error, if it were known. This means that the measure
used for model selection just needs to result in the same ordering of the models
as if the generalization errors of the models had been used for the ranking. Hence,
model selection is actually a model ordering problem, and the best model is selected
without necessarily estimating the expected generalization error. This explains why
model assessment and model selection are in general two different approaches.
Briefly, we can state the goals of model selection and model assessment as
follows:
Model selection: Estimate a performance criterion of different models in
order to choose the best model.
Model assessment: For the best model resulting from model selection,
estimate its expected generalization error.
In the following, we outline the underlying idea of model selection in more detail.

12.3 General Approach to Model Selection

There are two schools of thought for model selection, and they differ in how the
“best model” is defined. The first defines the best model as the “best prediction
model,” and the second defines it as the “true model” that generated the data
[112, 172, 173]. For this reason, the latter is also referred to as model identification.
The first definition fits seamlessly with our preceding discussion about the expected
generalization error. In contrast, the second one assumes that the true model also
has the best expected generalization error. For very large sample sizes — that is,
ntrain → ∞ — this is uncontroversial. However, for finite sample sizes, as is always
the situation in practice, this is not necessarily the case.
In Fig. 12.1, we visualize the theoretical problem of model selection. In
Fig. 12.1a, we show three model families, g1 , g2 , and g3 , indicated by the three
curves colored in blue, red, and green. Each of these model families corresponds
to a statistical model; for example, a linear regression model with p1 , p2 , and
p3 covariates. Each point along these curves corresponds to a particular model
obtained by estimating the parameters of the models using a training data set D.
Here, g1 (x, β̂1 (D)), g2 (x, β̂2 (D)), and g3 (x, β̂3 (D)) are three such models that
have been specified by the training data. For a given instance, x, each of these
models makes a prediction ŷi = gi (x, β̂i (D)) (with i ∈ {1, . . . , 3}).
12.3 General Approach to Model Selection 311

a. model families
ŷ2 = g2 (x2 , β̂2 (D))
true model
(population model)
realization ŷ1 = g1 (x1 , β̂1 (D))
(sample model)
variance

bias

noise

ŷ3 = g3 (x3 , β̂3 (D))


b.
training data:
true model data estimated models

testing data: validation data:


model assessment model selection
Etest Eval

Fig. 12.1 Visualization of the theoretical problem of model selection. (a): Three model families,
g1 , g2 , and g3 , are shown, as are the estimates of three specific models obtained using the training
data. (b): A summary combining model selection and model assessment, emphasizing that different
data sets are used for the different analysis steps, is shown.

After the parameters of the three models have been estimated, one performs a
model selection to identify the best model according to a chosen criterion. For
this, a validation data set is used. The best model should be “closest” to the true
model (shown in brown). An additional problem is that the true model as well
as the estimated best model has a variability due to the finite size of the data and
noise; for example, due to measurement errors. This is indicated by the light brown
and light blue circles in Fig. 12.1a. As a consequence, the true model may appear
differently, as indicated by the “realization” of the true model. Similarly, there is not
just one best model but different realizations, all of which fall within the light blue
circle in Fig. 12.1a. Formally, the models at the center of these circles correspond
to population models, whereas each model within the large circles corresponds to a
sample model. Finally, one performs a model assessment of the best model using a
test data set.
In Fig. 12.1b, a summary of the entire model selection process is shown. Here,
we emphasize that different data (training data, validation data, and testing data) are
used for the different analysis steps. Assuming an ideal (very large) data set D, there
are no problems with the practical realization of these steps. However, practically,
we have no ideal data set but rather one with a finite sample size. For this reason,
there are various approximations to realize model selection practically.
312 12 Model Selection

In the following, we discuss three different model selection methods. The first
one is for multiple linear regression models, the second one is for Bayesian models,
and the third one can be used for general models.

12.4 Model Selection for Multiple Linear Regression Models

The methods we discuss in this section, except the AIC and BIC, are specifically for
multiple linear regression models. This means that for other types of models, they
cannot be used.

12.4.1 R 2 and Adjusted R 2

The first measure we discuss is called the coefficient of determination (COD) [121,
509]. The COD is defined as

SSR SSE
R2 = =1− . (12.1)
SST SST
This definition is based on sum of squares of regression (SSR), sum of squares total
(SST), and sum of squares error (SSE), given as follows:


n
SST = TSS = (yi − ȳ)2 = Y − Ȳ 2 ; (12.2)
i=1


n
SSR = ESS = (yˆi − ȳ)2 = Ŷ − Ȳ 2 ; (12.3)
i=1


n 
n
SSE = RSS = (yˆi − yi )2 = ei2 = Ŷ − Y 2 . (12.4)
i=1 i=1

Here, ȳ = n1 ni=1 yi is the mean value of the predictor variable, and ei = yˆi − yi
are the residuals. For reasons of completeness, we would like to mention that in
the literature SST is also called TSS (total sum of squares), SSR is also called ESS
(explained sum of squares) and SSE is also called RSS (residual sum of squares).
The COD is a measure of how well the model explains the variance of the
response variables. A disadvantage of R 2 is that a submodel of a full model has
always a smaller value regardless of its quality. For this reason, a modified version
of R 2 has been introduced called the adjusted coefficient of determination (ACOD).
The ACOD is defined as
12.4 Model Selection for Multiple Linear Regression Models 313

SSE(n − 1)
2
Radj =1− . (12.5)
SST(n − p)

It can also be written as a function of R 2 , as follows:

(n − 1)
2
Radj =1− (1 − R 2 ). (12.6)
(n − p − 1)

The ACOD adjusts for sample size n of the training data and the model complexity,
defined by the number of covariates p.

12.4.2 Mallow’s Cp Statistic

For a general model, by using in-sample data {(xi , yi )} for training and out-sample
data {(xi , yi )} for testing, one can show that
   n 
1  1 
n
2 2
E yi − ŷi <E yi − ŷi . (12.7)
n n
i=1 i=1

Furthermore, if the model is linear, having p predictors and an intercept, one can
show that
 n   n 
1  2 1  2 2
E yi − ŷi =E yi − ŷi + σ 2 (p + 1) . (12.8)
n n 1n 23 4
i=1 i=1
optimism

The last term in Eq. 12.8 is called the optimism because it is the amount by which
the in-sample error underestimates the out-sample error. Hence, a large value for the
optimism indicates a large discrepancy between both errors. It is interesting to note
the following:
1. The optimism increases with σ 2 .
2. The optimism increases with p.
3. The optimism decreases with n.
The preceding relationship can be explained as follows:
1. Adding more noise (indicated by increasing σ 2 ) and leaving n and p fixed makes
it harder for a model to be learned.
2. Increasing the complexity of the model (indicated by increasing p) and leaving
σ 2 and n fixed makes it easier for a model to fit the test data but makes it prone
to overfitting.
3. Increasing the test data set (indicated by increasing n) and leaving σ 2 and p fixed
reduces the chances of overfitting.
314 12 Model Selection

The problem with Eq. 12.8 is that σ 2 corresponds to the true value of the noise,
which is unknown. For this reason, one needs to use an estimator to obtain a
reasonable approximation. One can show that by estimating σ̂ 2 from the largest
(i.e., the least parsimonious) model, this will be an unbiased estimator of σ 2 if the
true model is smaller (i.e., the most parsimonious).
Using this estimate of σ 2 leads to Mallow’s Cp statistic [196, 536], given by
 
1 
n
2 2
Cp = E yi − ŷi + σ̂ 2 (p + 1). (12.9)
n n
i=1

Alternatively, we can write Eq. 12.9 as follows:

2 2
Cp = MSE + σ̂ (p + 1). (12.10)
n
For model selection, one chooses the model that minimizes Cp. Mallow’s Cp is
only used for linear regression models that are evaluated with the squared error.

12.4.3 Akaike’s Information Criterion (AIC) and Schwarz’s


BIC

The next two model selection criteria have similarity to Eq. 12.10.
Akaike’s Information Criterion (AIC) The Akaike’s information criterion (AIC)
[5, 65, 458] for a model M is defined by
 
AIC(M) = −2 log LM + 2dim(M). (12.11)

Here, LM is the likelihood of the model M evaluated at the maximum likelihood


estimate, and dim(M) is the dimension of the model corresponding to the number
of free parameters. In contrast with Mallow’s Cp, the Akaike’s information criterion
selects the model that maximizes AIC(M).
For a linear model, one can show that the log likelihood is given by
  n  
log LM = − log MSE + C , (12.12)
2

where C is a model independent constant and the dimension of the model is

dim(M) = p + 2. (12.13)

Taken together, this gives


 
AIC(M) = n log MSE + 2p + C, (12.14)
12.4 Model Selection for Multiple Linear Regression Models 315

with C = −2C + 4. For model comparisons, the parameter C is irrelevant.

The BIC (Bayesian Information Criterion) BIC [359, 428], also called the
Schwarz criterion, has a similar form as the AIC. The BIC is defined by
   
BIC(M) = −2 log LM + p log n . (12.15)

For a linear model with normal distributed errors, this simplifies to


   
BIC(M) = n log MSE + p log n . (12.16)

The BIC selects the model that maximizes BIC(M).


The idea common to both AIC and BIC is to penalize larger models. Since
log(n = 8) > 2, the BIC penalizes more strongly than does the AIC (usually
data sets have more than 8 samples). Hence, the BIC selects smaller models (i.e.,
is more parsimonious) than the AIC. The BIC has a consistency property, meaning
that when the true unknown model is one of the models under consideration and the
sample size, n, tends to infinity, the BIC selects the correct model. In contrast, the
AIC does not have this consistency property.
In general, the AIC and the BIC are considered to have different views on model
selection [4]. The BIC assumes that the true model is among the studied ones, and
its goal is to identify it. In contrast, the AIC does not assume this; instead, the goal of
the AIC is to find the model that maximizes the predictive accuracy. In practice, the
true model is rarely among the model families studied, and for this reason the BIC
cannot select the true model. For such a case, the AIC is the appropriate approach
for finding the best approximating model. Several studies preferred the AIC over the
BIC for practical applications [4, 65, 359]. For instance, in [489], it was found that
the AIC can select a better model compared to the BIC, even when the true model is
among the studied models. Specifically, for regression models, in [513] it has been
demonstrated that the AIC is asymptotically efficient for selecting the model with
the least MSE (mean squared error), while the BIC is not when the true model is not
among the studied models.
In summary, the AIC and the BIC have the following characteristics:
• The BIC selects smaller models (more parsimonious) compared to the AIC and
therefore is more prone to underfitting.
• The AIC selects larger models (less parsimonious) compared to the BIC and
therefore is more prone to overfitting.
• The AIC represents a frequentist point of view.
• The BIC represents a Bayesian point of view.
• The AIC is asymptotically efficient but not consistent.
• The BIC is consistent but not asymptotically efficient.
• The AIC should be used when the objective is the prediction accuracy of a model.
• The BIC should be used when the objective is the model interpretability.
316 12 Model Selection

The AIC and the BIC are generic in their applications and not limited to linear
models, and they can be applied whenever we have a likelihood of a model [297].

12.4.4 Best Subset Selection

So far, we have discussed evaluation criteria that can be used for model selection.
However, we did not discuss how these criteria are actually used. In the following,
we provide this information, discussing best subset selection (Algorithm 7), forward
stepwise selection (Algorithm 8), and backward stepwise selection (Algorithm 9)
[30, 105, 121]. All of these approaches are computational.
The most brute-force model selection strategy is to evaluate each possible model.
This is the idea behind best subset selection (Best).

Best subset selection evaluates each model with k parameters by the MSE or R 2 .
Because each of these models has the same complexity (a model with k parameters),
measures based on the model complexity are not needed. However, when comparing
p + 1 different models, which have different parameters (see line 5 in Algorithm 7),
a complexity-penalizing measure such as the Cp , AIC, or BIC needs to be used.
For a linear regression model, one needs to fit all combinations with p predictors.
A problem with the best subset selection is that in total, one needs to evaluate
p p
k=0 k = 2p different models. For p = 20, this is already over 106 models,
leading to computational problems in practice. For this reason, approximations of
the best subset selection are needed.

12.4.5 Stepwise Selection

Two such approximations are discussed in the following. Both of these follow a
greedy approach, where the forward stepwise selection proceeds in a bottom-up
manner, while the backward stepwise selection proceeds in a top-down manner.
12.4 Model Selection for Multiple Linear Regression Models 317

12.4.5.1 Forward Stepwise Selection

The idea of the forward stepwise selection (FSS) is to start with a null model,
without parameters, and add successively one parameter at a time that is best
according to a selection criterion.
For a linear regression model with p predictors, this gives


p−1
p(p + 1)
1+ (p − k) = 1 + (12.17)
2
k=0

models. For p = 20, this gives only 211 different models that need to be evaluated.
This is a great improvement over best subset selection.

12.4.5.2 Backward Stepwise Selection

The idea of backward stepwise selection (BSS) is to start with a full model with p
parameters and remove successively one parameter at a time that is worst according
to a selection criterion.

The number of models that need to be evaluated using the backward stepwise
selection is exactly the same as for the forward stepwise selection.
Neither stepwise selection strategy is guaranteed to find the best model contain-
ing a subset of the p predictors. However, when p is large, these approaches may be
318 12 Model Selection

the only ones practically feasible. Despite the apparent symmetry of the forward
stepwise selection and the backward stepwise selection, there is a difference in
situations where p > n — that is, where we have more parameters than samples
in our data. In this case, the forward stepwise selection approach can still be applied
because the procedure may be systematically limited to n parameters.

12.5 Model Selection for Generalized Linear Models

In Sect. 11.6.4, we introduced generalized linear models (GLMs) as an extension


of OLS regression models. Now, we continue our discussion of the Poisson model
in the context of model selection. Specifically, the results in Sect. 11.6.4.3 showed
that the Poisson regression model did not provide a good fit for the studied data.
This was indicated by the deviance goodness-of-fit test. In the following section, we
continue this discussion by investigating a number of possible GLMs that could be
used instead of a Poisson regression model.

12.5.1 Negative Binomial Regression Model

A possible issue with a Poisson distribution is that it assumes that the mean and
the variance are the same. However, sometimes more flexibility is needed when the
variance is greater than the mean, which is called overdispersion. In such a situation,
a negative binomial regression model can be used because the negative binomial
distribution has one parameter more than the Poisson distribution, which allows one
to adjust the variance independent of the mean.
Specifically, a negative binomial regression model assumes that the variance is a
quadratic function of the mean:

E[Y |x] = μ, (12.18)


Var[Y |x] = μ + μ φ.2
(12.19)

Here, φ is the dispersion parameter.


In Listing 12.1, we show a negative binomial regression with log as a link
function for the same data used in Sect. 11.6.4.3. The covariate “ment” corresponds
to the articles published by the mentor in the last three years, and the covariate “art”
is the number of articles published during the Ph.D.
12.5 Model Selection for Generalized Linear Models 319

Let’s perform some tests to check the quality of the preceding fit. In Listing 12.2,
we show the deviance goodness-of-fit test, and in Listing 12.3 the dispersion test.

The null hypothesis for deviance goodness-of-fit is: “Our model is correctly
specified.” Given that the resulting p-value is around 0.05 (see Listing 12.2), we
do not have strong evidence to reject the null hypothesis. In general, a high p-value
indicates no evidence of lack-of-fit, and a low p-value indicates evidence of lack-
of-fit.
320 12 Model Selection

The null hypothesis for the dispersion test is: “There is no difference in
dispersion.” By specifying the alternative hypothesis as “two.sided,” the dispersion
can correspond to an over- or underdispersion. The parameter “less” is for underdis-
persion, whereas the parameter “greater” is for overdispersion. As one can see from
Listing 12.3, the p-value does not allow one to reject the null hypothesis.
These results look better than for the Poisson model in Sect. 11.6.4.3.

12.5.2 Zero-Inflated Poisson Model

Another possible problem with a Poisson regression model that can also lead to
overdispersion is the presence of too many zero-count observations. This can be
modeled with a zero-inflated Poisson model, given by
+
π + (1 − π ) exp−λ if y = 0,
f (y|x) = y −λ (12.20)
(1 − π ) λ exp
y! if y > 0.

From this model, one can obtain the mean and the variance as follows:

E[Y |x] = (1 − π )λ, (12.21)


Var[Y |x] = (1 − π )(λ2 + λ). (12.22)

where π is the probability that an observation is zero.


From Fig. 11.10, one can see that the number of zeros in our data is indeed quite
large (there are 275 zeros).
12.5 Model Selection for Generalized Linear Models 321

12.5.3 Quasi-Poisson Model

For a Quasi-Poisson regression, the variance is assumed to be a linear function of


the mean. The model fits an extra dispersion parameter to account for that extra
variance:

E[Y |x] = μ, (12.23)


Var[Y |x] = μφ. (12.24)

Here, φ is the dispersion parameter.


From Listing 12.5, we can see that the residual deviance does not show much
improvement upon the Poisson regression model. Furthermore, we would like to
note that the AIC does not return a value, because there exists no maximum
likelihood for a Quasi-Poisson model.
On a technical note, we would like to add that the dispersion parameter, as shown
in the output in Listing 12.5, can be obtained via the following command in Listing
12.6.
322 12 Model Selection

Overall, one can see that the result is similar to that of the Poisson regression
model (see Sect. 11.6.4.3), hence, this does not provide an alternative for the studied
data.

12.5.4 Comparison of GLMs

While in the previous section we discussed several GLMs individually, in this


section we discuss a test that allows one to compare pairs of GLMs.
To compare GLM pairs, one can use the Vuong test. This test is a likelihood ratio
test, which can be used for both nested and non-nested models [490]. Nested models
just differ in the covariates used for a fit, and one model is fully embedded in the
larger model. In contrast, non-nested models can assume general forms, and they
cannot be reduced into each other [126].
A Vuong test is similar to a traditional likelihood ratio test (LRT), but differs
in the sampling distribution. While an LRT uses a chi-square null distribution, the
12.5 Model Selection for Generalized Linear Models 323

Vuong LRT uses a weighted sum of chi-square distributions. The latter distribution
converges to the traditional chi-square distribution when the full model is the true
model (for nested models).
From Listing 12.7, one can see that there are actually two different hypothesis
tests shown: variance test and non-nested likelihood ratio test. The reason for this is
that it is often difficult to see whether non-nested models are overlapping. The null
hypothesis of these tests are as follows:
• H0 for the variance test: “Model 1 and model 2 are indistinguishable.”
• H0 for the non-nested likelihood ratio test: “Model fits of both models are equal.”

For a significance level α = 0.05, both null hypotheses need to be rejected,


although the p-value for the variance test is much larger than that for the non-nested
likelihood ratio test. This is understandable because the Poisson regression model
can be considered a special case of the negative binomial model. The comparison of
the other pairs of GLMs is left as an exercise.
One can also use Vuong’s theory to obtain confidence intervals of the AIC and the
BIC for non-nested models. The results of this analysis are shown in Listing 12.8.
Here, it is important to note that the 95% confidence intervals of both the AIC
differences and the BIC differences do not overlap with zero. This means that AIC1
(for model 1) is always larger than AIC2 (for model 2), and similarly for the BICs.
Hence, model 1 can be considered a better fit than model 2.
Finally, we can compare more than two GLMs with each other based on AIC
or BIC, as shown in Listing 12.9. This example compares the Poisson regression
model, the negative binomial model, and the zero-inflated Poisson model.
324 12 Model Selection

12.6 Model Selection for Bayesian Models

In this section, we discuss model selection for yet another specific class of models
— namely, Bayesian models. This means that the method discussed in this section
can be used for probabilistic models, which allow the estimation of a posterior
distribution.
The model selection criterion we discuss in the following is called the Bayes’
factor [268, 277, 300, 348]. Suppose that we have a finite set of candidate models
{Mi }Mi=1 with i ∈ {1 . . . M} that we can use for fitting the data D. To select the best
model from a Bayesian perspective, we need to evaluate the posterior probability of
each model, given by

P (Mi |D), (12.25)


12.6 Model Selection for Bayesian Models 325

for the available data. Using Bayes’ theorem, one can write this probability as

p(D|Mi )P (Mi )
P (Mi |D) = M
. (12.26)
j =1 p(D|Mi )P (Mj )

Here, the term p(D|Mi ) is called the evidence for the model Mi , or simply
evidence.
The ratio of the posterior probabilities for models MA and MB , corresponding
to the posterior odds of the models, is given by

P (MA |D) p(D|MA )P (MA ) p(D|MA )


P (MB |D) = = × prior odds
p(D|MB )P r(MB ) p(D|MB )
= BFAB × prior odds.

That means the Bayes’ factor of the models is the ratio of the posterior probabilities
and the prior probabilities:

P (MA |D)
posterior odds P (MB |D)
BFAB = = P (MA )
. (12.27)
prior odds
P (MB )

For non-informative priors, that is, P (MA ) = P (MA ) = 0.5, the Bayes’ factor
simplifies to

P (MA |D)
BFAB = posterior odds = . (12.28)
P (MB |D)

To see practical problems when using the Bayes’ factor, let’s assume that a model
Mi depends on the parameter θ . Then, the evidence can be written as

p(D|Mi ) = p(D|θ, Mi )p(θ |Mi )dθ. (12.29)

A serious problem with this expression is that it can be very hard to evaluate
numerically, especially in high dimensions, where no closed-form solution is
available. This is a general problem of the Bayes’ factor, which makes its application
problematic.
It is interesting to note that there is a close connection between the BIC and the
Bayes’ factor. Specifically, in [277] it has been proven that for n → ∞ the following
holds:

2 ln BFBA ≈ BICA − BICB . (12.30)

Note that the relation − ln BFAB = ln BFBA is negatively symmetric. Equation


12.30 implies that model comparison results for the BIC and the Bayes’ factor can
approximate each other.
326 12 Model Selection

Table 12.1 Interpretation of Evidence  BIC = BICk − BICmin BFmin,k


the comparison of models
with the BIC and Bayes’ Weak 0−2 1−3
factor. Positive 2−6 3 − 20
Strong 6 − 10 20 − 150
Very strong > 10 > 150

Original data



 randomize once and copy

MS MA

Split 1 Eval (1|m) Etest (1|mopt )

Split 2 Eval (2|m) Etest (2|mopt )

Split 3 Eval (3|m) Etest (3|mopt )

Split 4 Eval (4|m) Etest (4|mopt )

Split 5 Eval (5|m) Etest (5|mopt )

training data MS: data for model selection


validation data MA: data for model assessment
testing data

Fig. 12.2 Visualization of resampling for model selection (MS) and model assessment (MA).
Each model, m ∈ {1, . . . , M}, is evaluated with the validation data. Finally, only the optimal
model, mopt , is evaluated using the testing data.

For practical applications and for interpreting the BIC and Bayes’ factors, in
[399] the following evaluation for comparing two models (see Table 12.1) was
suggested. Here, “min” indicates the model with the smaller BIC or posterior
probability.

12.7 Nonparametric Model Selection for General Models


with Resampling

In contrast with the previous sections, the nonparametric approach discussed here
can be used for general statistical models, including multiple linear regression
models and Bayesian models. This approach is based on the resampling of data,
forming an extension of the k-fold CV discussed in Sect. 4.2.3. Due to the data-based
nature of the resampling method, this allows one to perform model assessment in
addition to model selection.
12.7 Nonparametric Model Selection for General Models with Resampling 327

The idea of the resampling method is shown in Fig. 12.2. To perform model
selection (MS) and model assessment (MA), the data are split into three parts. One
part of the data is used for training, one for validation, and one for testing. The splits
of the data are obtained by randomization of the original data. These data are then
copied and split, as shown in Fig. 12.2. Specifically, the separation is similar to a
k-fold CV with one additional fold. In Fig. 12.2, the example shows a fivefold CV.
When discussing general resampling methods in Chap. 4, there was no model
selection involved because we assumed that only one model was being studied.
Hence, only training data and testing data were needed. This corresponded to model
assessment. However, to perform both model selection and model assessment, one
needs three different data sets. Importantly, in Fig. 12.2 one can see that the data
for both stages, MS and MA, are non-overlapping. That means no data point in the
testing data is ever used for training and validation and vice versa.
The example shown in Fig. 12.2 visualizes the usage of a fivefold CV; however,
any other resampling method can be used for model selection and model assessment
in a similar way. That means the MS part needs to be adopted as described for the
resampling methods in Chap. 4, while the MA part uses a separate proportion of the
data (similar to the holdout set method; see Sect. 4.2.1).
Formally, model selection operates as follows. For each split i (i ∈ {1, . . . , k}),
the parameters of model m (m ∈ {1, . . . , M}) are estimated using the training data,
and the prediction error is evaluated using the validation data; in other words,

Eval (i|m). (12.31)

After the last split, the errors are summarized by

1
k
Eval (m) = Eval (i|m). (12.32)
k
i=1

This gives estimates of the prediction error for each model m. The best model can
then be selected as
5 6
mopt = argmin Eval (m) . (12.33)
m

Using the optimal model, given by mopt , one now uses the test data to estimate the
generalization error for each split, as follows:

Etest (i|mopt ). (12.34)

It is important to note that this is only performed for the optimal model. These
results can be summarized as
328 12 Model Selection

1
k
Etest (mopt ) = Etest (i|mopt ). (12.35)
k
i=1

In Chap. 18, we will formally define what we mean by the expected general-
ization error and discuss why this entity can only be approximated in reality; for
example, by means of resampling methods or model assessment.
Overall, a nonparametric model selection method is very flexible, but requires a
sufficiently large data set in order to split the data into k+1 folds, so as to perform the
MS and MA parts. This is the only assumption this method is based on. Given that
we are living in the big data era, for most contemporary data sets this should not be a
problem. However, a potential problem is the computational burden. Therefore, this
assumes that computational resources are available to execute an analysis multiple
times.

12.8 Summary

In this chapter, we discussed two types of model selection methods. The first type
were parametric, and the second type were nonparametric. The parametric methods,
for example, R 2 , Mallow’s Cp statistic, or AIC, are elegant and simple in their
application because models are selected based on an analytical expression. The
price to be paid for this elegance comes in the form of assumptions that need
to be made, which limit the application of these methods to specific statistical
models. For instance, R 2 and Mallow’s Cp statistic can only be used for multiple
linear regression models, and Bayes’ factors for Bayesian models. In contrast,
nonparametric resampling methods for model selection are non-elegant, without
any analytical expressions. However, the numerical nature of this approach is very
flexible and applicable to any kind of statistical model regardless of its nature.
In Fig. 12.3, we summarize the different model selection approaches. We high-
light two important characteristics of such methods. The first characteristic dis-
tinguishes methods in terms of data splitting, and the second does so in terms of
model complexity. Neither the best subset selection (Best) nor the forward stepwise
selection (FSS) nor the backward stepwise selection (BSS) applies data splitting,
but they use the entire data for evaluation. Furthermore, each of these approaches is
a two-step procedure that employs in its first step a measure that does not consider
the model complexity. For instance, in this step either the MSE or R 2 is used. Then,
in the second step a measure considering model complexity is used; for example,
the AIC, the BIC, or Cp .
Data splitting is typically based on resampling of the data, and in this chapter
we used cross-validation. Interestingly, CV can be used with or without model
complexity measures. For instance, regularized regression models, such as ridge
regression, LASSO, or elastic net, consider the complexity by varying the value of
λ (regularization parameter). Regularization is the topic of Chap. 13.
12.8 Summary 329

model complexity

no yes

MSE AIC regularization

Best 1 2
no
data splitting

FSS 1 2

BSS 1 2

CV x
yes

CV x

Fig. 12.3 Summary of different model selection approaches. Here, the AIC stands for any
criterion considering model complexity, e.g., the BIC or Cp , and regularization is any regularized
regression model, e.g., LASSO or elastic net.

For model selection, cross-validation (CV) is the most practical and flexible
approach one can use [16, 182, 453]. It is conceptually simple, it is intuitive,
and it can be applied to any statistical model family regardless of its technical
details (for instance, to parametric and nonparametric models). Compared with other
approaches for model selection, cross-validation has the following advantages:
• Cross-validation is a computational method that is simple in its realization.
• Cross-validation makes few assumptions about the true underlying model.
• Compared with the AIC, the BIC, and the adjusted R 2 , cross-validation provides
a direct estimate of the prediction error.
• Every data point is used for both training and testing.

Learning Outcome 12: Model Selection

For finite data, model selection aims to rank candidate models and obtain the
same order as when using infinite data to estimate the expected generalization
error.

In summary, cross-validation, the AIC, and Cp all have the same goal — to find
a model that predicts best — and they all tend to choose similar models. However,
the BIC is quite different and tends to choose smaller models. The goal of the
BIC is also different because it tries to identify the true model. In general, smaller
models are easier to interpret, providing a better understanding of the underlying
process. Overall, cross-validation is the most general approach, and it can be used
for parametric as well as nonparametric models.
330 12 Model Selection

12.9 Exercises

1. What are the differences between best subset selection and stepwise selection?
2. What are the main differences between the AIC and the BIC, including their
respective strengths and weaknesses?
3. Discuss the definition of the Bayes’ factor for informative priors.
4. Why is the expected generalization error important in selecting the best model?
Hint: See the discussion of the expected generalization error in Chap. 18.
5. Select a prediction model for an application that does not allow one to use a
parametric criterion — for example, Mallow’s Cp statistic or AIC — and outline
the nonparametric approach that can be used in this case.
6. Compare the results from a Poisson regression model with those of a Zero-
inflated Poisson regression model using the Vuong test.
Part III
Advanced Topics
Chapter 13
Regularization

13.1 Introduction

In this chapter, we discuss extensions of the regression models introduced in


Chap. 11. Despite the success and importance of ordinary least squares (OLS)
regression models, high-dimensional data require a modification of this framework
in order to be to dealt with efficiently. For this reason, in recent years new regression
models have been introduced that extend classical regression models significantly.
These models are based on regularization, which is a concept introduced by
Tikhonov to deal with ill-posed inverse problems [43, 95, 468]. We will see
that depending on the mathematical formulation of the regularization, different
regression models can be derived. Perhaps the most prominent of these is the least
absolute shrinkage and selection operator (LASSO) model.
The concept underlying regularization is to modify the optimization function of
a (regression) model by using an additional parameterized term. This framework
is very general and not limited to regression models, and it can be applied to any
model optimization function; for example, for deep neural networks. In the context
of regression models, regularization can lead to two effects. First, the absolute value
of regression coefficients can be reduced. Second, some regression coefficients can
even be shrunk to zero. While both effects will generally improve the prediction
abilities and reduce overfitting of the model, only the latter performs a type of
model selection on the number of parameters of the regression model. Interestingly,
the shrinkage of regression coefficients is performed automatically by the model
and does not require manual intervention. This makes it a very desirable feature,
explaining part of the success of models like LASSO.
In this chapter, we discuss a number of different regularized regression models,
including ridge regression, non-negative garrote, LASSO, Dantzig selector, adaptive
LASSO, elastic net, and group LASSO [447, 467]. We will see that each one has
its own merits and is not necessarily making other models obsolete. Instead, the
selection needs to be guided by the application.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 333
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_13
334 13 Regularization

13.2 Preliminaries

We begin this chapter by briefly providing statistical preliminaries needed for the
regression models. First, we discuss some preprocessing steps used to standardize
the data for all regression models. Second, we discuss the data we are using to
demonstrate the differences between the various regression models.

13.2.1 Preprocessing and Norms

Let’s assume that we have some data of the form (xi , yi ) with i ∈ {1, . . . , n} where
n is the number of samples, xi = (Xi1 , . . . , Xip )T corresponds to the vector of
predictor variables for sample i, p is the number of predictors, and yi is the response
variable. We denote with y ∈ Rn the vector of response variables and with X ∈
Rn×p the predictor matrix. The vector β = (β1 , . . . , βp )T indicates the regression
coefficients.
The predictors and the response variable shall be standardized as follows:

1
n
x̄j = Xij = 0 for all j ; (13.1)
n
i=1

1 2
n
s̄j2 = Xij = 1 for all j ; (13.2)
n
i=1

1
n
ȳ = yi = 0. (13.3)
n
i=1

Here, x̄j and s̄j2 are the mean and variance of the predictor variables, and ȳ is the
mean of the response variable.
To study the regularization of regression models, we need to solve optimization
problems, which are formulated in terms of norms. For this reason, we review in the
following the norms needed for the subsequent sections. For a real vector x ∈ Rn
and q ≥ 1, the Lq-norm is defined by


n 1
q
xq = |xi |q . (13.4)
i=1

The special case q = 2 gives the L2-norm (also known as the Euclidean norm),
and the case q = 1 gives the L1-norm. Interestingly, for q < 1, Eq. 13.4 is still
defined, but it is no longer a norm in the mathematical sense.
We will revisit the L2-norm when discussing ridge regression and the L1-norm
for the LASSO. The infinity norm, also called the maximum norm, is defined by
13.2 Preliminaries 335

x∞ = max |xi |. (13.5)


i

This norm is used by the Danzig selector.


For q = 0, one obtains the L0-norm (which is strictly not a norm) corresponding
to the number of non-zero elements; that is,

x0 = #non-zero elements. (13.6)

13.2.2 Data

To illustrate the methods in this chapter, we use a simulated data set. Listing 13.1
presents the code used to generate the data. This allows us to find the true regression
model.
The simulated data consist of 20 uncorrelated predictor variables out of which
only the first 5 influence the response variable. The true values of the regression
coefficients are given by β = (1, 2, 3, −1, −2)T , whereas all the other coefficients
are zero.

In Listing 13.1, we set a seed, which allows us to reproduce the same “random”
data each time we run this code, since the seed initializes the pseudorandom number
generator in R. Hence, if one wants to generate a new random data set, one needs to
set a different seed.
336 13 Regularization

13.2.3 R Packages for Regularization

Many of the following regression models, for example, the LASSO, adaptive
LASSO, and elastic net, can be implemented using the package glmnet, available
in R. This package provides an efficient implementation via the cyclical coordinate
descent method [178], and it has been demonstrated that even with thousands of
predictors and samples, the parameters can be estimated quickly [178]. So, the
package glmnet can be used not only for small data sets but also for large real-word
data.
To perform a group LASSO, we use the package gglasso [514], available in R.
However, there are alternative packages in R that could be used, such as oem [261,
512].

13.3 Ridge Regression

The motivation for improving OLS regression is that the estimates from such models
often have a low bias but a large variance. This relates to the prediction accuracy
of a model because it is known that either by shrinking the values of regression
coefficients or by setting some coefficients to zero, the accuracy of a prediction can
be improved [227]. Thus, by introducing some estimation bias, the variance can be
actually reduced.
Ridge regression was introduced by [248], and the corresponding model is
formulated as follows:
 8
1  n
RR 2
β̂ = arg min yi − xij βj + λβ22 (13.7)
2n
i=1 j
 8
1
= arg min RSS(β) + λβ22
(13.8)
2n
 8
1 
= arg min  y − Xβ22 + λβ22 . (13.9)
2n

Here, RSS(β) is the residual sum of squares (RSS), called the loss of the model,
λβ22 is the regularization term or penalty, and λ is the tuning or regularization
parameter. Specifically, the parameter λ controls the shrinkage of coefficients. The
L2-penalty in Eq. 13.25 is sometimes also called a Tikhonov regularization.
Ridge regression has an analytical solution, which is given by

RR
 −1
β̂ (λ) = XT X + λI p XT y. (13.10)
13.3 Ridge Regression 337

Here, I p is the p × p identity matrix. A problem with OLS is that if rank(X) < p,
then XT X does not have an inverse. However, a non-zero regularization parameter
λ usually leads to a matrix XT X + λI p , for which an inverse exists.
In Fig. 13.1, we show an example of the simulated data from Listing 13.1.
Specifically, in Fig. 13.1a-c, we show the regression coefficients depending on λ,
because the solution in Eq. 13.10 depends on the tuning parameter. At the top of

(a) 20 20 20 20 20
3
2
Coefficients

1
0
−1
−2

0 2 4 6 8
Log Lambda

(b) 20 20 20 20 20 20
3
2
Coefficients

1
0
−1
−2

0 2 4 6 8 10

L1 Norm
(c) 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20
Mean−Squared Error

15
10
5

0 2 4 6 8

Log(λ)

Fig. 13.1 Results of the ridge regression model from Listing 13.2. Coefficient paths for the ridge
regression model against log(λ) (a), the L1-norm (b). (c) Mean-squared error against log(λ).
338 13 Regularization

each figure, the numbers of non-zero regression coefficients are shown. One can see
that for decreasing values of λ, the values of the regression coefficients decrease.
This is the shrinkage effect of the tuning parameter. Furthermore, one can see that
none of the coefficients become zero. Instead, all regression coefficients assume a
small but non-zero value. These observations are typical for general results from
ridge regression [227].
In Fig. 13.1c, we show results for the mean-squared error. Again, the numbers at
the top give the number of non-zero regression coefficients. The MSA can be used
to identify an optimal value of λ (see the vertical dashed lines). As one can see,
the MSE suggests a smaller value of the tuning parameter that indeed shrinks the
coefficients.
Overall, the advantage of a ridge regression is that it can reduce the variance at
the price of an increased bias. This can improve the prediction accuracy of a model,
which works best in situations where the OLS estimates have a high variance and
p < n. A disadvantage of ridge regression is that it does not shrink the coefficients to
zero. From this, it follows that ridge regression does not perform variable selection.

13.3.1 Example

Ridge regression can be performed using the package glmnet [177], available in R.
Listing 13.2 was used to obtain the preceding results.

The last two lines in Listing 13.2 give estimates for λ̂min and λ̂1se (discussed in
Sect. 13.5). Briefly, these values provide optimal estimates for λ. In order to use
these, such as for predictions given a new data set, one needs to define a new ridge
regression model by setting the “lambda” option. This is shown in Listing 13.3.
13.5 LASSO 339

13.4 Non-negative Garrote Regression

The next model we discuss is the non-negative garrote. Interestingly, the non-
negative garrote has been mentioned as a motivating factor for the introduction of
the LASSO [467]. For this reason, we discuss this model before the LASSO.
The non-negative garrote was introduced by [58], and it is defined by
 
p 8
1 
β̂ = arg min  y − Zd22 + λ dj , (13.11)
2n
j =1

for d = (d1 , . . . , dp )T with dj > 0 for all j . Importantly, the regression model
is formulated for the scaled variables Z given by Zj = Xj β̂jOLS . That means the
model first estimates ordinary least squares parameter β̂jOLS for the unregularized
regression (Eq. 11.27) and then performs in a second step a regularized regression
for the scaled predictors Z.
The estimates of the non-negative garrote can be expressed with the OLS
regression coefficients and the regularization coefficients in the following way
[521]:

β̂jN N G (λ) = dj (λ)β̂jOLS (13.12)

Breiman showed that the non-negative garrote consistently has a lower prediction
error than subset selection, and it is competitive with ridge regression except when
the true model has many small non-zero coefficients. A disadvantage of the non-
negative garrote is its explicit dependency on the OLS estimates [467].

13.5 LASSO

The LASSO (least absolute shrinkage and selection operator) was made popular
by R. Tibshirani in 1996 [467], but it had been studied in the literature before;
see, for example, [159, 417]. The LASSO is a regression model that performs
340 13 Regularization

both regularization and variable selection to enhance the prediction accuracy and
interpretability of the regression model.
The LASSO estimate of β̂ is given by
 8
1   n
2
β̂ = arg min yi − βj xij (13.13)
2n
i=1 j

subject to:β1 ≤ t. (13.14)

Equation 13.13 is called the constrained form of the regression model. In Eq. 13.14,
t is a tuning parameter (also called regularization parameter or penalty parameter)
and β1 is the L1-norm (see Eq. 13.4).
One can show that Eq. 13.13 can be written in the Lagrange form, given by
 8
1  
n
2
β̂ = arg min yi − βj xij + λβ1 (13.15)
2n
i=1 j
 8
1 
= arg min  y − Xβ22 + λβ1 (13.16)
2n

The relationship between both forms holds due to the duality and the KKT
(Karush-Kuhn-Tucker) conditions. Furthermore, for every t > 0 there exists a λ > 0
such that both equations lead to the same solution [227].
In general, the LASSO lacks a closed-form solution because the objective
function is not differentiable. However, it is possible to obtain closed-form solutions
for the special case of an orthonormal design matrix.
In the LASSO regression model, Eq. 13.16, λ is a parameter that needs to be
estimated. This is accomplished using cross-validation. Specifically, for each fold
Fk , the mean-squared error is estimated by

1 
e(λ)k = (yj − ŷj )2 . (13.17)
#Fk
j ∈Fk

Here, #Fk is the number of samples in set Fk . Then the average over all K folds is
taken, leading to

1 
K
CV (λ) = e(λ)k . (13.18)
K
k=1

This is called the cross-validation mean-squared error. To obtain an optimal λ from


CV (λ), two approaches are commonly used. The first approach estimates the λ that
minimizes the function CV (λ):
13.5 LASSO 341

λ̂min = arg min CV (λ). (13.19)

The second approach first estimates λ̂min and then identifies the maximal λ that has
a cross-validation MSE (mean-squared error) smaller than CV (λ̂min ) + SE(λ̂min ):

λ̂1se = max λ (13.20)


CV (λ)≤CV (λ̂min )+SE(λ̂min )

13.5.1 Example

In Listing 13.4, we present the R code used to analyze the simulated data. The results
of this analysis are visualized in Fig. 13.2. The first two rows in Fig. 13.2 show
coefficient paths for the LASSO regression model depending on log(λ) (top) and
the L1-norm (middle). One can see that the five regression coefficients for the true
predictors are nicely recovered, while the remaining (false) coefficients assume very
small values (left-hand side) before they vanish. At the bottom of the figure, we show
the mean-squared error, depending on log(λ). The vertical dashed lines correspond
to λ̂min and λ̂1se (see Listing 13.4 for numerical values corresponding to Fig. 13.2).

At the bottom of Listing 13.4, we show how to access important estimated


entities, including λ̂min , λ̂1se , as well as the regression coefficients. As one can
see, the values of these estimates are needed, for example, to obtain the regression
coefficients of a particular model, as specified by λ̂1se .

13.5.2 Explanation of Variable Selection

From Fig. 13.2, one can see that decreasing values of λ lead to the shrinkage of
the regression coefficients (see top and middle rows in Fig. 13.2), and some of
342 13 Regularization

Fig. 13.2 Results for the 20 19 16 5 5 4 0


LASSO from Listing 13.4.
Shown are coefficient paths

3
against log(λ) (top) and the

2
L1-norm (middle). Bottom:

Coefficients
Mean-squared error against

1
on log(λ).

0
−2
−5 −4 −3 −2 −1 0 1

Log Lambda

0 3 4 5 5 19
3
2
Coefficients

1
0
−2

0 2 4 6 8 10

L1 Norm

20 20 19 17 8 5 5 5 5 5 3 2
20
Mean−Squared Error

15
10
5

−5 −4 −3 −2 −1 0 1

Log(λ)

these even become zero. To understand this behavior, we depict in Fig. 13.3 a
two-dimensional LASSO (Panel A) and ridge regression (Panel B) model. The
regularization term of each regression model is depicted in blue, corresponding to
the diamond shape for the L1-norm and the circle for the L2-norm. The solution of
the optimization problem is given by the intersection of the ellipsis and the boundary
13.5 LASSO 343

A. B. C.
β2 β2 β̂(orth)

β β

β̂OLS

β1 β1 λ fixed

Fig. 13.3 Visualize the difference between the L1-norm (a) used by the LASSO and the L2-norm
(b) used by ridge regression; (c) Solution for the orthonormal case.

of the penalty shapes. These intersections are highlighted by a green point for the
LASSO and a blue point for the ridge regression.
To shrink a coefficient to zero, an intersection needs to occur alongside the two
coordinate axes. For the shown situation, this is only achieved using the LASSO
but not the ridge regression. In general, the probability of a LASSO’s shrinking a
coefficient to zero is much larger than that of a ridge regression’s doing so.
To understand this, it is helpful to look at the solution of the coefficients for the
orthonormal case, because in this case the solution for the LASSO can be found
analytically. The analytical solution is given by

ˆ )S(β OLS
β̂iLASSO (λ; orth) = sign(βiOLS ˆ , λ) (13.21)
i

Here, S() is the soft-threshold operator, defined as




⎪ ˆ −λ
OLS ˆ > λ;
, if βiOLS
⎨βi
ˆ , λ) =
S(βiOLS ˆ | ≤ λ;
0 , if |β OLS (13.22)


i
⎩β OLS
ˆ +λ , if ˆ
βiOLS < −λ.
i

For the ridge regression, the orthonormal solution is given by

ˆ
βiOLS
β̂iRR (λ; orth) = . (13.23)
1+λ

In Fig. 13.3c, we show Eq. 13.21 (green) and Eq. 13.23 (blue). As a reference, we
added the ordinary least square solution as a dashed diagonal line (black) because it
is just the identity mapping:

ˆ
β̂iOLS (orth) = βiOLS (13.24)
344 13 Regularization

As one can see, ridge regression leads to a change in the slope of the line and, hence,
a shrinkage of the coefficient. However, it does not lead to a zero coefficient except
for the point at the origin of the coordinate system. In contrast, LASSO shrinks the
ˆ | ≤ λ.
coefficient to zero for |βiOLS

13.5.3 Discussion

The key idea of the LASSO is to realize that the theoretically ideal penalty for
achieving sparsity is the L0-norm (i.e., β0 = #non-zero elements; see Eq. 13.6),
which is computationally intractable; however, this can be mimicked using the L1-
norm, which makes the optimization problem convex [481].
There are three major differences between ridge regression and the LASSO:
1. The non-differentiable corners of the L1-ball produce sparse models for suffi-
ciently large values of λ.
2. The lack of rotational invariance limits the use of the singular-value theory.
3. The LASSO has no analytic solution, making both computational and theoretical
results more difficult to obtain.
The first point implies that the LASSO is better than OLS for interpretation
purposes. With a large number of independent variables, we often would like
to identify a smaller subset of these variables that exhibit the strongest effects.
The sparsity of the LASSO is mainly counted as an advantage due to a simpler
interpretation, but it is important to highlight that the LASSO is not able to select
more than n variables.

13.5.4 Limitations

There are a number of limitations for the LASSO estimator, which cause problems
for variable selection in certain situations.
1. In the p > n case, the LASSO selects at most n variables before it saturates. This
could be a limiting factor if the true model consists of more than n variables.
2. The LASSO has no grouping property, which means it tends to select only one
variable from a group of highly correlated variables.
3. In the n > p case and with high correlations between predictors, it has been
observed that the prediction performance of the LASSO is inferior to that of the
ridge regression.
13.7 Dantzig Selector 345

13.6 Ridge Regression

Ridge regression was suggested by Frank and Friedman [175]. It minimizes the RSS
subject to a constraint depending on a parameter q:

n   p 8
 2
β̂ = arg min yi − βj xij + λ |βj |q (13.25)
i=1 j j =1
 
p 8

= arg min  y − Xβ22 + λ |βj |q . (13.26)
j =1

The regularization term has the form of Lq-norm, although q can assume all positive
values; that is, q > 0. For the special case q = 2, one obtains ridge regression, and
for q = 1 the LASSO. Although ridge regression was introduced in 1993, before
the LASSO, the model hadn’t been studied at that time. This justifies the LASSO as
a new method because in [467] the authors presented a full analysis.

13.7 Dantzig Selector

Next, we discuss briefly the Dantzig selector [68]. This regression model was
particularly introduced for the case where p is large (p n); in other words, when
we have many more parameters than observations.
The regression model solves the following problem:
 8
 
β̂ = arg min XT y − Xβ ∞ + λβ1 . (13.27)

Here, the L∞ norm is the maximum absolute value of the components of the
argument. It is worth remarking that, in contrast with the LASSO, here the term
XT is added to the loss (residual sum) in Eq. 13.27. This term makes the solution
rotation invariant.
An advantage of the Dantzig selector is that it is computationally simple, because
technically it can be reduced to a linear programming problem. This inspired
the name of the method, which pays tribute to George Dantzig for his seminal
work on the simplex method for linear programming [226]. As a consequence
of its computational efficiency, this regression model can be used for very high-
dimensional data for which the LASSO becomes burdensome.
The disadvantages of the Dantzig selector are similar to those of the LASSO
except that it can result in more than n non-zero coefficients when p > n [130].
Also, the Dantzig selector is sensitive to outliers because the L∞ norm is very
sensitive to outliers. This hampers the practical application of the model.
346 13 Regularization

For a computational analysis of the Dantzig selector, the R package flare can be
used [312].

13.8 Adaptive LASSO

The adaptive LASSO, which was introduced in [528], is similar to the LASSO but
with an oracle procedure. An oracle procedure is one that has the following oracle
properties:
1. Consistency in variable selection
2. Asymptotic normality
Put simply, the oracle property means that a model performs as well as the true
underlying model, if this were known [531]. Specifically, the first property means
that a model selects all non-zero coefficients with probability one; that is, an oracle
identifies the correct subset of true variables. The second property means that non-
zero coefficients are estimated as in the true model, if this were known. Importantly,
it has been shown that the adaptive LASSO is an oracle procedure but the LASSO
is not [534].
The basic idea of the adaptive LASSO is to introduce weights for the penalty
of each regression coefficient. Specifically, the adaptive LASSO is a two-step
procedure. In the first step, a weight vector ŵ is estimated from OLS estimates
of β̂ init , and a connection between both is given by

1
ŵ = (13.28)
|β|ˆ
γ
init

Here, γ is a positive tuning parameter; that is, γ > 0.


Second, for the weight vector, w = (w1 , . . . , wp )T , the following weighted
LASSO is formulated:
 8
1   
n p
2
β̂ = arg min yi − βj xij + λ wj |βj | (13.29)
2n
i=1 j j =1
 
p 8
1 
= arg min  y − Xβ22 + λ wj |βj | . (13.30)
2n
j =1

It can be shown that for certain data-dependent weight vectors, the adaptive
LASSO has oracle properties. Typically, the values of β̂ init are chosen according
to the following cases:
• For the case where p is small (p  n): β̂ init = β̂ OLS .
• For the case where p is large (p n: β̂ init = β̂ RR .
13.8 Adaptive LASSO 347

The adaptive LASSO penalty can be seen as an approximation of the Lq


penalties, with q = 1−γ . One advantage of adaptive LASSO is that, for appropriate
initial estimates, the criterion Eq. 13.29 is convex in β. Furthermore, if the initial
estimates are consistent, it has been shown in [534] that the method recovers the
true model under more general conditions than the LASSO.

13.8.1 Example

In Listing 13.4, we present an example in R that uses the adaptive LASSO to analyze
the simulated data. In Fig. 13.4, we show the results for γ = 1. The figures show
the coefficient paths depending on log(λ) (top) and the results for the mean-squared
error (bottom). One can observe the shrinking and selecting property of the adaptive
LASSO.

5 5 5 5 3 3
3
2
Coefficients

1
0
−2 −1

−1 0 1 2 3 4

Log Lambda

5 5 5 5 5 5 5 5 5 4 3 3 1 1
20
Mean−Squared Error

15
10
5

−1 0 1 2 3 4

Log(λ)

Fig. 13.4 Results for adaptive LASSO using Listing 13.4. Top: Coefficient paths against log(λ).
Bottom: Mean-squared error against log(λ).
348 13 Regularization

For the preceding analysis, we used γ = 1. However, γ is a tuning parameter


that needs to be estimated from the data. We leave this as an exercise (see Exercise
3 ).

13.9 Elastic Net

The elastic net regression model was introduced in [535] to extend the LASSO
by improving some of its limitations, especially with respect to variable selec-
tion. Importantly, the elastic net encourages a grouping effect, keeping strongly
correlated predictors together in the model. In contrast, the LASSO tends to split
such groups, keeping only the strongest variable. Furthermore, the elastic net is
particularly useful in cases where the number of predictors (p) in a data set is much
larger than the number of observations (n). In such a case, the LASSO is not able to
select more than n predictors, but the elastic net has this capability.
Assuming standardized predictors and response, the elastic net solves the
following problem:
 8
1  
N
2
β̂ = arg min yi − βj xij + λPα (β) (13.31)
2N
i=1 j
 8
1 
= arg min  y − Xβ22 + λPα (β) ; (13.32)
2n
Pα (β) = αβ22 + (1 − α)β1 (13.33)

p
= αβj2 + (1 − α)|βj |. (13.34)
j =1
13.9 Elastic Net 349

Here, Pα (β) is the elastic net penalty [535]. Pα (β) is a combination of the ridge
regression penalty, with α = 1, and the LASSO penalty, with α = 0. This form of
penalty turns out to be particularly useful when p > n or in situations where we
have many (highly) correlated predictor variables.
In the correlated case, it is known that ridge regression shrinks the regression
coefficients of the correlated predictors toward each other. In the extreme case of
k identical predictors, each predictor obtains the same estimates of the coefficients
[178]. From theoretical considerations, it is further known that the ridge regression
is optimal if there are many predictors and all have non-zero coefficients. LASSO,
on the other hand, is somewhat indifferent to very correlated predictors and will tend
to pick one and ignore the rest.
Interestingly, it is known that the elastic net with α = ε, for some very small ε >
0, performs similarly to the LASSO, but removes any degeneracies caused by the
presence of correlations between the predictors [178]. More generally, the penalty
family given by Pα (β) creates a non-trivial mixture between ridge regression and
the LASSO. For a given λ, if we decrease α from 1 to 0, the number of regression
coefficients, equal to zero, increases monotonically from 0 (full ridge regression
model) to the sparsity of the LASSO solution. Here, “sparsity” refers to the fraction
of regression coefficients equal to zero. For more detail, see Friedman et al. [178],
where an efficient implementation of the elastic net penalty for a variety of loss
functions was provided.

13.9.1 Example

In Listing 13.6, we present an example in R that uses the elastic net to analyze
the simulated data. To obtain an elastic net model, one needs to set a value for the
option “alpha.” For our analysis, we used α = 0.5. The results of this analysis are
visualized in Fig. 13.5. Here, the coefficient paths are shown, depending on log(λ)
(top), as is the mean-squared error, depending on log(λ) (bottom).

Since α is a parameter, one needs to optimize this value via model selection (see
Exercise 4 ).
350 13 Regularization

20 19 10 5 5 3

3
2
Coefficients

1
0
−2

−4 −3 −2 −1 0 1

Log Lambda

20 20 19 17 10 6 5 5 5 5 4 2
20
Mean−Squared Error

15
10
5

−4 −3 −2 −1 0 1

Log(λ)

Fig. 13.5 Results for the elastic net from Listing 13.6. Top: Coefficient paths against log(λ) for
α = 0.5. Bottom: Mean-squared error against log(λ).

13.9.2 Discussion

The elastic net has been introduced to counteract the drawbacks of the LASSO and
ridge regression. The idea was to use a penalty for the elastic net that is based on
a combined penalty of the LASSO and ridge regression. The penalty parameter
α determines how much weight should be given to either the LASSO or ridge
regression. An elastic net with α = 1.0 performs a ridge regression, and an elastic
net with α = 0 performs the LASSO. Specifically, several studies [22, 476] showed
the following:
1. In the case of correlated predictors, the elastic net can result in lower mean-
squared errors compared to ridge regression and the LASSO.
13.9 Elastic Net 351

2. In the case of correlated predictors, the elastic net selects all predictors, whereas
the LASSO selects one variable from a correlated group of variables but tends to
ignore the remaining correlated variables.
3. In the case of uncorrelated predictors, the additional ridge penalty brings little
improvement.
4. The elastic net identifies correctly a larger number of variables compared to the
LASSO (model selection).
5. The elastic net often has a lower false-positive rate compared to ridge regression.
6. In the case p > n, the elastic net can select more than n predictor variables,
whereas the LASSO selects at most n.
The last point means that the elastic net is capable of performing group selection
of variables, at least to a certain degree. To further improve this property, the group
LASSO has been introduced (see Sect. 13.10).
It can be shown that the elastic net penalty is a convex combination of the LASSO
penalty and the ridge penalty. Specifically, for all α ∈ (0, 1) the penalty function is
strictly convex. In Fig. 13.6, we visualize the effect of the tuning parameter, α, on the
regularization. As one can see, the elastic net penalty (in green) is located between
the LASSO penalty (in blue) and the ridge penalty (in purple).
The orthonormal solution of the elastic net is similar to that of the LASSO in
Eq. 13.21. It is given by [535]

ˆ ,λ )
S(βiOLS
ˆ )
β̂iEN (λ; orth) = sign(βiOLS
1
, (13.35)
1 + λ2

ˆ , λ ) defined as
with S(βiOLS 1

Fig. 13.6 Visualization of


the elastic net regularization β2
(green) combining the
L2-norm (purple) of ridge
regression and the L1-norm
(blue) of LASSO.

β1
352 13 Regularization



⎪ ˆ −λ
OLS ˆ >λ ;
if βiOLS
⎨βi 1 , 1
ˆ ,λ ) =
S(βiOLS ˆ
1 0 , if |β OLS |≤λ ; (13.36)


i 1
⎩β OLS
ˆ +λ , if ˆ
βiOLS < −λ1 .
i 1

Here, the parameters λ1 and λ2 are connected to λ and α in Eq. 13.31 by

λ2
α= , (13.37)
λ1 + λ2
λ = λ1 + λ2 , (13.38)

resulting in the following alternative form of the elastic net:


 8
1 
β̂ = arg min  y − Xβ22 + λ2 β22 + λ1 β1 . (13.39)
2n

ˆ |>λ
In contract with the LASSO in Eq. 13.21, only the slope of the line for |βiOLS 1
is different due to the denominator 1 + λ2 . That means the ridge penalty, controlled
by λ2 , performs a second shrinkage effect on the coefficients. Hence, an elastic
net performs a double shrinkage on the coefficients, one from the LASSO penalty
and one from the ridge penalty. So, from Eq. 13.35, one can also see the variable
selection property of the elastic net, which is similar to the LASSO.

13.10 Group LASSO

The last modern regression model we are discussing is the group LASSO, intro-
duced in [520]. The group LASSO is different from the other regression models
because it focuses on groups of variables instead of on individual variables.
There are many real-world application problems, for example, pathways of genes,
portfolios of stocks, or substage disorders of patients, that have substructures where
a set of predictors forms a group where the predictors simultaneously have either
non-zero or zero coefficients.
The various forms of group LASSO penalties are designed for such situations.
Let’s suppose that the p predictors are divided into G groups and pg is the num-
ber of predictors in group g ∈ {1, . . . , G}. The matrix Xg ∈ Rn×pg represents the
predictors corresponding to group g, and the corresponding regression coefficient
vector is given by β g ∈ Rpg .
The group LASSO solves the following convex optimization problem:
 
G 
G 8
1  √
β̂ = arg min  y− Xg β g 2 + λ
2
pg β g 2 . (13.40)
2n
g=1 g=1
13.10 Group LASSO 353

Here, the term pg accounts for the varying group sizes. If pg = 1 for all groups g,
then the group LASSO becomes the ordinary LASSO. If pg > 1, the group LASSO
works like the LASSO but on the group level, instead of the individual predictors.

13.10.1 Example

In Listing 13.7, we present an example in R that uses the group LASSO. The data
analyzed are from the simulated data in Listing 13.1, and for the group labels we
assumed {1, 1, 1, 2, 2, 3, 3, . . . , 3} for the 20 predictors.
The results of this analysis are shown in Fig. 13.7. The top figure shows the
coefficient paths depending on log(λ), and the bottom figure the mean-squared error
depending on log(λ). The coefficient paths are colored according to the three groups
(group 1, blue; group 2, purple; and group 2 magenta). As one can see, either all
variables of a group are zero or none are. From Fig. 13.7 (top), it is clear that first
the coefficients of the variables 6-20 (in magenta) vanish, and then the variables 4
and 5 (in purple) do (see Listing 13.1 for this information).

According to the results from the mean-squared error in Fig. 13.7 (bottom), the
optimal solution involves only two groups, consisting of the first five variables.
We would like to remark that the group information for the variables may not
always be easy to obtain, or there may be even alternative groupings. Further-
more, one may consider the finding of the groups as a model selection problem.
Unfortunately, even for a moderate number of variables, this can quickly become
computationally challenging if one wants to do an exhaustive search.
354 13 Regularization

3
2
Coefficients

1
0
−1
−2

−6 −4 −2 0

Log Lambda

−6.9077552790 −4.5098600062 −2.6450754019 −0.8651224453 0.8419981273


20
Least−Squared loss

15
10
5

−6 −4 −2 0

log(Lambda)

Fig. 13.7 Group LASSO. Top: Coefficient paths against log(λ). Bottom: Mean-squared error
against log(λ).

13.10.2 Remarks

1. The group LASSO has either zero coefficients of all members of a group or non-
zero coefficients.
2. The group LASSO cannot achieve sparsity within a group.
3. The groups need to be predefined; that is, the regression model does not provide
a direct mechanism to obtain the grouping.
4. The groups are mutually exclusive (non-overlapping).
Finally, we just want to briefly mention that to overcome the limitation of the
group LASSO to obtain sparsity within a group (point (2)), the sparse group LASSO
has been introduced in [441], and the corresponding optimization problem writes
 
G 
G 8
1  √
β̂ = arg min  y− Xg β g 22 + (1 − α)λ pg β g 2 + αλβ1 .
2n
g=1 g=1

(13.41)
13.11 Discussion 355

Table 13.1 Summary of key features of the regularized regression models.


Method Analytical solution Variable selection Can select > n Grouping Oracle
Ridge regression Yes No Yes Yes No
Non-negative garrote No Yes No No No
LASSO No Yes No No No
Dantzig selector No Yes Yes No No
Adaptive LASSO No Yes No No Yes
Elastic net No Yes Yes Yes No
Group LASSO No Yes Yes Yes No

For α ∈ [0, 1], this is a convex optimization problem combining the group LASSO
penalty (with α = 0) with the LASSO penalty (with α = 1). Here, β ∈ Rp is the
complete coefficient vector.

13.11 Discussion

The modern regression models discussed in this chapter extend OLS regression.
In contrast to the OLS regression and ridge regression, all of these models are
computational in nature because the solution to the various regularizations can only
be found by means of numerical approaches.
In Table 13.1, we summarize key features of these regression models. A common
feature of all the extensions of OLS regression and ridge regression is that these
models perform variable selection (coefficient shrinkage to zero). This allows one to
obtain interpretable models because the smaller the number of variables in a model,
the easier it is to find plausible explanations. Considering this, the most satisfying
method is the adaptive LASSO because it possesses the oracle property, enabling
(under certain conditions) one to identify only the coefficients that are non-zero in
the true model.
In general, one considers data as high-dimensional if either (a) p is large or (b)
p > n [22, 270, 335]. Case (a) can be handled by all regression models, including
the OLS regression. However, case (b) is more difficult because it may require one
to select more variables than there are available samples. Only ridge regression,
Dantzig selector, elastic net, and the group LASSO are capable of this, while the
elastic net is particularly suitable for this situation.
Finally, the grouping of variables is useful; for example, in cases where variables
are highly correlated with each other. Again ridge regression, the elastic net, and the
group LASSO have this property, and the latter has been specifically introduced to
deal with this problem.
In Fig. 13.8, we show a numerical comparison of three models. The underlying
data are again for 20 uncorrelated covariates where only 5 contribute to the response
variable. Specifically, Fig. 13.8 shows the MSE depending on the sample size of
356 13 Regularization

Mean−Squared Error 6

model
4 Lasso
MLR
Ridge
2

0
30 40 50 60 70
number of samples

Fig. 13.8 Comparison of the MSE for three different models: LASSO, ridge regression and
multiple linear regression (MLR).

the training data. The results are averaged over 100 independent data sets. As one
can see, with an increasing sample size, the distance between the three models
becomes smaller, and even an unregularized multiple linear regression (MLR)
model performs satisfactorily. On the other hand, for smaller sample sizes, the
advantage of regularization becomes apparent.
This example demonstrates that the advantage of a model is data-dependent.
Therefore, this needs to be investigated on a case-by-case basis. However, this
makes a high-dimensional regression analysis nontrivial, requiring insights from
the analyst.

13.12 Summary

In this chapter, we discussed regularized regression models. Over the years, many
different regularization models have been introduced, where each addresses a
particular problem; hence, none of the methods dominates the others, and each has
specific strengths and weaknesses. In this chapter, we discussed ridge regression,
non-negative garrote regression, LASSO, Dantzig selector, adaptive LASSO, elastic
net, and the group LASSO. The LASSO is a very popular model that can be found
frequently in many applications, ranging from biology to psychology.
13.13 Exercises 357

Learning Outcome 13: Regularization

Regularization is a mathematical concept that modifies the optimization


function of a regression model. This can lead to the shrinkage of regression
coefficients, which can even vanish.

Regularization is a powerful framework that can influence the optimization


of regression coefficients. We have seen that depending on the mathematical
formulation, different models can be obtained. It is interesting to note that when
regression coefficients are shrunk to zero, regularization performs model selection
on the number of regression coefficients.

13.13 Exercises

1. Perform a regression analysis with the ridge regression model for the simulated
data in Listing 13.1.
• Reproduce the results in Fig. 13.1.
• Use λ̂min to define an optimal model, and make a prediction for the testing
and training data. Compare and discuss the results.
• Repeat this analysis for λ̂1se .
2. Perform a regression analysis with the LASSO model for the simulated data in
Listing 13.1.
• Reproduce the results in Fig. 13.2.
• Why do the values of λ̂min and λ̂1se change when the analysis is repeated?
Hint: See Chap. 4 and our discussion about k-fold CV.
3. Perform a regression analysis with the adaptive LASSO model for the simulated
data in Listing 13.1.
• Reproduce the results in Fig. 13.4.
• Estimate the optimal value of γ in Eq. 13.28. Hint: Formulate the analysis as
a model selection problem; see Chap. 12.
4. Perform a regression analysis with the elastic net model for the simulated data in
Listing 13.1.
• Reproduce the results in Fig. 13.5.
• Estimate the optimal value of α in Eq. 13.34. Hint: Formulate the analysis as
a model selection problem; see Chap. 12.
Chapter 14
Deep Learning

14.1 Introduction

Deep learning models are new estimation models from artificial intelligence (AI).
Recent breakthroughs in image analysis and speech recognition have generated a
massive interest in this field because applications seem possible in many other
domains that generate big data. But a downside is that the mathematical and
computational methodology underlying deep learning models is very challenging,
especially for interdisciplinary scientists.
In general, deep learning (DL) describes a family of learning algorithms rather
than a single method that can be used to learn complex prediction models; for exam-
ple, multilayer neural networks with many hidden units [303]. Importantly, DL has
received much attention [241] in recent years, and it has been successfully applied
to several application problems. For instance, a deep learning method sets the record
for the classification of handwritten digits of the Modified National Institute of
Standards and Technology database (MNIST) data set with an error rate of 0.21%
[492]. Further application areas that achieved remarkable results include image
recognition [294, 303], speech recognition [212], natural language understanding
[418], acoustic modeling [343], and computational biology [6, 310, 445, 446, 528].
Interestingly, models of artificial neural networks have been used since about
the 1950s [409]; however, the current wave of deep learning neural networks started
around 2006 [241]. A common characteristic of the many variants of supervised and
unsupervised deep learning models is that these models learn many layers of hidden
neurons; for example, using a restricted Boltzmann machine (RBM) in combination
with backpropagation and error gradients of the stochastic gradient descent [405].
Due to the heterogeneity of deep learning approaches, a comprehensive discussion
is very challenging, and for this reason many introductions usually aim at dedicated
subtopics. For instance, a bird’s eye view without detailed explanations can be
found in [303], whereas a historical summary with many detailed references was
provided in [425]. In addition, reviews are available for various application domains,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 359
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_14
360 14 Deep Learning

including image analysis [402, 434], speech recognition [519], natural language
processing [518], and biomedicine [69].
This chapter is organized as follows. In Sect. 14.2, we discuss major architec-
tures, distinguishing classical neural networks from deep neural networks. Then,
we discuss a number of deep neural networks in detail: deep feedforward neural
networks (in Sect. 14.3), convolutional neural networks (in Sect. 14.4), deep belief
networks (in Sect. 14.5), autoencoders (in Sect. 14.6), and long short-term memory
networks (in Sect. 14.7). In Sect. 14.8, we provide a discussion of important issues
that come up when learning neural network models. Finally, this chapter finishes
with a summary (Sect. 14.9). Throughout these sections, we provide a number of
numerical examples using R.

14.2 Architectures of Classical Neural Networks

Historically, artificial neural networks (ANNs) are mathematical models that have
been inspired by the functioning of the brain. However, the models we discuss in the
following sections do not aim at providing biologically realistic models. Instead, the
purpose of these models is to analyze data.

14.2.1 Mathematical Model of an Artificial Neuron

The basic entity of any neural network is the model of a neuron. In Fig. 14.1a, we
show such a model of an artificial neuron.
The basic idea of a neuron model is that an input, x, weighted by w, together
with a bias, b, are summarized. The bias, b, is a scalar value, whereas the input x

A. B.

Input Input
bias
x1 x1
w1 b w1
Output Output
w2 w2
x2 Σ φ y x2 y

w3 w3
x3 x3

Fig. 14.1 (a) Representation of a mathematical artificial neuron model. The input to the neuron
is summed up and filtered by activation function φ (for examples, see Table 14.1). (b) Simplified
Representation of an artificial neuron model. Only the key elements are depicted; that is, the input,
the output, and the weights.
14.2 Architectures of Classical Neural Networks 361

Table 14.1 An overview of frequently used activation functions for neuron models.
Activation function φ(x) φ (x) Values
ex −e−x
Hyperbolic tangent tanh(x) = ex +e−x
1 − φ(x)2 (−1, 1)
Sigmoid S(x) = 1
1+e−x
φ(x)(1 − φ(x)) (0, 1)
+ +
0 for x < 0 0 for x < 0
ReLu R(x) = [0, ∞)
x for x ≥ 0 1 for x ≥ 0
+
0 for x < 0
Heaviside function H (x) = δ(x) [0, 1]
1 for x ≥ 0

⎨ −1 for x < 0

Signum function sgn(x) = 0 for x = 0 2δ(x) [−1, 1]


1 for x > 0
exi ∂yi  
Softmax yi = n xj ∂j = yi δij − yj (0, 1)
j e

and the weights w are vector valued; that is, x ∈ Rn and w ∈ Rn , with n ∈ N
corresponding to the dimension of the input. Note that the bias term is not always
present, as it is sometimes omitted. The sum of these terms — that is, z = wT x + b
— then forms the argument of an activation function, φ, resulting in the output of
the neuron model:
   
y = φ z = φ wT x + b . (14.1)

Considering only the argument of φ, one obtains a linear discriminant function


[500].
The activation function, φ (also known as a unit function or transfer function),
performs a nonlinear transformation of z. In Table 14.1, we give an overview of
frequently used activation functions.
The ReLU activation function, also called a rectified linear unit or rectifier [357],
is the most popular activation function for deep neural networks. Another useful
activation function is the softmax function [301], given by

e xi
yi = n xj . (14.2)
je

Softmax maps an n-dimensional vector x into an n-dimensional vector y with the


property i yi = 1. Hence, the components of y represent probabilities for each
of the n elements. The softmax is often used in the final layer of a network. If the
Heaviside step function is used as activation function (see Table 14.1), the neuron
model is known as a perceptron [409].
362 14 Deep Learning

In general, the neuron model depicted in Fig. 14.1a can be described more
simplistically as shown in Fig. 14.1b, where merely the input and output parts are
depicted.

14.2.2 Feedforward Neural Networks

To build neural networks (NNs), the neurons need to be connected with each other.
The simplest architecture of an NN is the feedforward structure. In Fig. 14.2a and
b, we show examples for a shallow and a deep architecture (discussed in detail in
Sect. 14.3).
In general, the depth of a network denotes the number of nonlinear transforma-
tions between the separating layers, whereas the dimensionality of a hidden layer,
that is, the number of hidden neurons, is called its width. For instance, the shallow
architecture in Fig. 14.2a has a depth of 2, whereas the architecture in Fig. 14.2b
has a depth of 4 (that is, total number of layers minus one [the input layer]). The
required value for the depth to justify calling a feedforward neural network (FFNN)
architecture "deep" is debatable, but architectures with more than two hidden layers
are commonly considered to be deep [33].
A feedforward neural network, also called a multilayer perceptron (MLP), can
use linear or nonlinear activation functions [206]. Importantly, there are no cycles
in the NN that would allow direct feedback. Equation 14.3 defines how the output

Hidden layers: h(i)

A. B.
Hidden (1)
h1
layer
(1) (2)
Input h1 Input h2 h1
Output Output
x1 x1 (1) (2) (3)
h2 h3 h2 h1
y1 y1
x2 x2 (1) (2) (3)
h3 h4 h3 h2
y2 y2
x3 x3 (1) (2) (3)
h4 h5 h4 h3

(1) (2)
h5 h6 h5

(1)
h7

Fig. 14.2 Two examples of feedforward neural networks (FFNN). (a) A shallow FFNN. (b) A
deep feedforward neural network (D-FFNN) with three hidden layers (see Sect. 14.3 for details
about D-FFNN).
14.2 Architectures of Classical Neural Networks 363

of an MLP is obtained from the input [500]:

f (x) = ϕ (2) (W (2) ϕ (1) (W (1) x + b(1) ) + b(2) ). (14.3)

Equation 14.3 is the discriminant function of the neural network [500]. To


find the optimal parameters of the model, one needs to define a learning rule. A
common approach is to define an error function (or cost function) together with an
optimization algorithm to find the optimal parameters by minimizing the error for
the training data.

14.2.3 Recurrent Neural Networks

The family of recurrent neural network (RNN) models has two subclasses that
can be distinguished based on their signal-processing behavior. The first consists
of finite impulse recurrent networks (FRNs), and the second of infinite impulse
recurrent networks (IIRNs). The difference is that a FRN is given by a directed
acyclic graph (DAG) that can be unrolled in time and replaced with a feedforward
neural network, whereas an IIRN is a directed cyclic graph (DCG) for which such
an unrolling is not possible.

14.2.3.1 Hopfield Networks

A Hopfield network (HN) [253] is an example of an FRN. An HN is defined as a


fully connected network consisting of McCulloch-Pitts neurons. A McCulloch-Pitts
neuron is a binary model with an activation function given by
+
+1 for x ≥ 0,
s = sgn(x) = (14.4)
−1 for x < 0.

The activity of the neurons xi ,that is,


N
xi = sgn( wij xj − θi ), (14.5)
j =1

is updated either synchronously or asynchronously. To be precise, xj refers to xjt


and xi to xit+1 (time progression).
Hopfield networks have been introduced to serve as a model of a content-
addressable ("associative") memory; that is, for storing patterns. In this case, it has
been shown that the weights are given by
364 14 Deep Learning


P
wij = ti (k)tj (k), (14.6)
k=1

where P is the number of patterns, t (k) is the kth pattern, and ti (k) its ith
component. From Eq. 14.6, one can see that the weights are symmetrical. An
interesting question, in this context, is, "What is the maximal value of P or P /N?"
The ratio P /N is called the network capacity (here, N is the total number of
patterns). In [236], it was shown that the network capacity is ≈ 0.138.
It is interesting to note that the neurons in a Hopfield network cannot be
distinguished as input neurons, hidden neurons, or output neurons, because at the
beginning every neuron is an input neuron, during the processing, every neuron is a
hidden neuron, and at the end every neuron is an output neuron.

14.2.3.2 Boltzmann Machine

A Boltzmann machine [240] can be described as a noisy Hopfield network because


it uses a probabilistic activation function

1
p(si = 1) = , (14.7)
1 + exp(−xi )

where xi is obtained as in Eq. 14.5. This model is important because it is one


of the first neural networks that uses hidden units (latent variables). To learn the
weights, the contrastive divergence algorithm (see Algorithm 14.11) can be used
to train Boltzmann machines. Put simply, Boltzmann machines are neural networks
consisting of two layers — a visible layer and a hidden layer. Each edge between
the two layers is undirected, implying that information can flow in a bidirectional
way. The whole network is fully connected, which means that each neuron in the
network is connected to all other neurons via undirected edge (see Fig. 14.10a and
b).

14.2.4 Overview of General Network Architectures

There are many different network architectures that can be used as deep learning
models. In Table 14.2, we show an overview of some of the most popular deep
learning models, which can be found in the literature [33, 303].
It is interesting to note that some of the models in Table 14.2 are composed
by other networks. For instance, CDBNs are based on RBMs and CNNs [306];
DBMs are based on RBMs [415]; DBNs are based on RBMs and MLPs; dAEs are
stochastic autoencoders that can be stacked on top of each other to build stacked
denoising autoencoders (SdAEs).
14.3 Deep Feedforward Neural Networks 365

Table 14.2 An overview of some popular deep learning models, available learning algorithms
(unsupervised, supervised), and software implementations in R or Python.
Deep learning model Unsupervised Supervised Software
Autoencoder (AE)  Keras [80], R: dimRed [293], h2o [67],
RcppDL [292]
Convolutional deep   R & Python: TensorFlow [2], Keras [80],
belief network (CDBN) h2o [67]
Convolutional neural   R & Python: Keras [80] MXNet [76],
network (CNN) Tensorflow [2], h2O [67], fastai
(Python)[257]
Deep belief network   RcppDL (R) [292], Python: Caffee [269],
(DBN) Theano [464], Pytorch [379], R &
Python: TensorFlow [2], h2O [67]
Deep Boltzmann  Python: boltzmann-machines [52],
machine (DBM) pydbm [78]
Denoising autoencoder  Tensorflow (R, Python) [2], Keras (R,
(dA) Python) [80], RcppDL (R) [292]
Long short-term  RNN (R) [397], OSTSC (R) [114], Keras
memory (LSTM) (R and Python) [80], Lasagne (Python)
[111], BigDL (Python) [94], Caffe
(Python) [269]
Multilayer perceptron  SparkR (R) [484], RSNNS (R) [41],
(MLP) Keras (R and Python) [80], sklearn
(Python) [386], tensorflow (R and
Python) [2]
Recurrent neural  RSNNS (R) [41], RNN (R) [397], Keras
network (RNN) (R and Python) [80]
Restricted Boltzmann   RcppDL (R) [292], deepnet (R) [408],
machine (RBM) pydbm (Python) [78], sklearn (Python)
[78], Pylearn2 [204], TheanoLM [157]

In the following sections, we discuss the major core architectures of deep


learning models in detail. Specifically, we discuss deep feedforward neural networks
(D-FFNN), convolutional neural networks (CNNs), deep belief networks (DBNs),
autoencoders (AEs), and long short-term memory networks (LSTMs).

14.3 Deep Feedforward Neural Networks

It can be proven that a feedforward neural network (FFNN), with one hidden
layer and a finite number of neurons in the hidden layer, can approximate any
continuous function on a compact subset of Rn [254]. This is called the universal
approximation theorem. The reason for using a FFNN with more than one hidden
layer is that the universal approximation theorem does not provide information on
how to learn such a network (how to estimate its parameters), which is a difficult
366 14 Deep Learning

problem. A related issue that contributes to the difficulty of learning such networks
is that their width can become exponentially large. Interestingly, the universal
approximation theorem can also be proven for FFNN with many hidden layers and a
bounded number of hidden neurons [322] for which learning algorithms have been
found. Thus, D-FFNNs are used instead of shallow FFNNs for practical reasons of
learnability.
Formally, the idea of approximating an unknown function f ∗ can be written as
follows:

y = f ∗ (x) ≈ f (x, w) ≈ φ(x T w). (14.8)

Here, f is a function from a specific family that depends on the parameters θ , and
φ is a nonlinear activation function for one layer. For many hidden layers, φ has the
form
 
φ = φ (n) . . . φ (2) (φ (1) (x)) . . . . (14.9)

Instead of guessing the correct family of functions from which f should be chosen,
D-FFNNs learn this function by approximating it via φ, which itself is approximated
by the n hidden layers.
Practically, the learning of the parameters of a D-FFNN (see Fig. 14.2b) can be
accomplished with the backpropagation algorithm, although, for computational
efficiency, nowadays the stochastic gradient descent is used [55]. The stochastic
gradient descent calculates a gradient for a set of randomly chosen training samples
(batch) and updates the parameters for this batch sequentially. This results in faster
learning. A drawback is an increase in imprecision. However, for data sets with a
large number of samples, the speed advantage outweighs this drawback.

14.3.1 Example: Deep Feedforward Neural Networks

In the following, we provide two examples of deep feedforward neural network


model implementations. The first example implements a D-FFNN from scratch,
using R, while the second example uses the Keras package. The first example is for
understanding the functioning of D-FFNN models and mainly performs backward
and forward propagation and readjusts the weights of the neurons. In contrast,
the second example shows that the Keras package allows one to simplify such
an analysis. As a note of caution, we would like to add that the Keras package
hides the complexity of the entire analysis in a black-box. Hence, in order to fully
understand a D-FFNN, it is advised to conduct first an analysis from scratch. See
also our discussion on the same issue in Sect. 10.8 on conducting hypothesis tests.
For the first example, we generate simulated data from a mixture of normal
distributions, giving data for a two-class classification problem. Listing 14.1
generates these data, and a visualization of the resulting data is shown in Fig. 14.3
(top).
14.3 Deep Feedforward Neural Networks 367
368 14 Deep Learning

5..0 class 1
class 2

2.5
feature 2

0.0

-2.5

-5.0
-5.0 -2.5 0.0 2.5 5.0
feature 1

1.5
error

1.0

0.0-
0 2000 5000 7500 10000
iterations
Fig. 14.3 A visualization of the simulated data for a two-class classification problem generated
in Listing 14.1 (top). Bottom: The training error of the D-FFNN in Listing 14.2 is shown in
dependence on the number of iterations for training the model to optimize the weights of the
neural network.

Implementing a D-FFNN from scratch requires three main functions, imple-


mented in Listing 14.2. The function dnn.arch() allows an initialization of the
weights, the user-defined layers, and the number of neurons in each layer based
on input data and then generates a list object containing input data (X), weights
of the hidden layer (whi ), output ohi , and the data output y. For input and
each hidden layer, we add also a “bias” neuron. The functions dnn.ffwd() and
dnn.backprop() are for feedforward and backpropagation operations. The function
dnn.ffwd() uses the list object, generated using the function dnn.arch(), to perform
matrix multiplications between the previous layer’s output and the neurons’ weights
sequentially, until it generates the output for the last layer of the neural network.
The function dnn.backprop() takes the updated list object as input from the forward
operation. It calculates the mean square error (loss function) between output and
class labels, and then it calculates derivatives in the feedforward network to optimize
the weights of the neurons of each layer using a gradient descent approach.
14.3 Deep Feedforward Neural Networks 369
370 14 Deep Learning
14.3 Deep Feedforward Neural Networks 371

The functions implemented in Listing 14.2 are used in Listing 14.1 to generate
the results shown in Fig. 14.3. Specifically, Fig. 14.3 (bottom) shows the training
error, depending on the iterations required for the D-FFNN to learn. As one can
see the D-FFNN converges quickly to a low error, and about 2000 iterations are
sufficient to achieve convergence of the network. Increasing the number of iterations
beyond 2000 does not harm but requires resources that will not lead to a better
model.
For the second example, we use the Keras package to implement the D-FFNN.
To use the functionality of this package, we need to install two libraries, Keras and
Reticulate. In Listing 14.3, we show how to install these packages.
As an example, we develop a simple D-FFNN to classify breast cancer data (class
1, “good prognosis” versus class 2, “bad prognosis”); see Chap. 5 for a heatmap of
the data. In Listing 14.4, we show how such a D-FFNN is defined using Keras.
372 14 Deep Learning

The input breast cancer data contains 106 input features of different genes and two
output labels. We split the data randomly into two parts for training (80%) and
testing (20%). The initialized model contains an input layer of size 106 (neurons);
three hidden layers of size 128, 64, and 28; and two output neurons in the last
layer, respectively. The loss function for the D-FFNN is mean_squared_error, the
output values of the hidden layers are computed using the function relu(), and the
final output is computed using the function sof tmax(). The other hyperparameters
used for training the D-FFNN are a dropout rate (0.2, to avoid overfitting and for
error generalization), epoch (50), and batch size (30) for an iterative optimization.
In Fig. 14.4, we show the loss and accuracy, depending on the epochs, for training
and testing (validation). It can be observed that the loss (respectively, the accuracy)
decreases (respectively, increases) for longer epochs. However, this is not the case
for the validation data. For the validation data, there is a discontinuation at moderate
epoch values. This indicates that going beyond a certain number of epochs leads to
deteriorating results. However, this effect is not too severe, as can been seen from
the accuracy. Here it is important to note that the result are for one training and
one validation data set. Hence, it is problematic to make far-reaching interpretations
from this.
For this reason, we repeat the above analysis using cross-validation. The results
of 10-fold cross-validation for the loss and accuracy (training and validation) are
shown in Fig. 14.5. In this figure, we can see that the behavior of the training loss,
training accuracy and validation loss are similar to Fig. 14.4; however, the validation
accuracy is different showing a converging behavior. This indicates that the number
of epochs is appropriate to enable learning the D-FFNN. On a general note, we
would like to add that the reason for using cross-validation is to perform model
selection (see Chap. 12). In our case, different D-FFNN models are given by the
choices of hyperparameters, including network architecture, number of neurons per
layer, learning rate, batch size, epoch size, regularization constant, and activation
functions. Since our focus here is on the functioning of a D-FFNN, we do not
perform model selection but just highlight its importance in data science projects.

14.4 Convolutional Neural Networks

A convolutional neural network (CNN) is a special feedforward neural network


that utilizes convolution, ReLU, and pooling layers. Standard CNNs are usually
composed of several feedforward neural network layers, including convolution,
pooling, and fully connected layers.
Typically, in traditional artificial neural networks (ANNs), each neuron in a layer
is connected to all neurons in the next layer, and each connection is a parameter in
the network. This can result in a very large number of parameters. Instead of using
fully connected layers, a CNN uses a local connectivity between neurons; that is, a
neuron is only connected to nearby neurons in the next layer. This can significantly
reduce the total number of parameters in the network.
14.4 Convolutional Neural Networks 373
374 14 Deep Learning

data
training
validation

loss 0.2

0.1

0.0
0 10 20 30 40 50
epoch
1.0
accuracy

0.9

0.8

0.7

0 10 20 30 40 50
epoch
Fig. 14.4 The loss (top) and validation accuracy (bottom) of a D-FFNN trained with the breast
cancer data. The results are obtained using Listing 14.4.

Furthermore, all the connections between local receptive fields and neurons
use a common set of weights, called a kernel. A kernel will be shared with all
the other neurons that connect to their local receptive fields, and the results of
these calculations between the local receptive fields and neurons using the same
kernel will be stored in a matrix called an activation map. The sharing property is
referred to as the weight sharing of CNNs [302]. Consequently, different kernels will
result in different activation maps, and the number of kernels can be adjusted with
hyperparameters. Thus, regardless of the total number of connections between the
neurons in a network, the total number of weights corresponds only to the size of the
local receptive field; that is, the size of the kernel. This is visualized in Fig. 14.6b,
where the total number of connections between the two layers is 9, but the size of
the kernel is only 3.
14.4 Convolutional Neural Networks 375

Fig. 14.5 Boxplots of loss 0.20


and accuracy of the D-FFNN

training loss
for training data (first and 0.15
second rows) and validation
data (third and fourth rows) 0.10
for a tenfold cross-validation
using the breast cancer data 0.05
example. The results are
obtained using Listing 14.5. 0.00
1 10 20 30 40 50
epoch
1.0
training accuracy

0.9

0.8

0.7

1 10 20 30 40 50
epoch
0.30
validation loss

0.25

0.20

0.15

1 10 20 30 40 50
epoch
validation accuracy

0.80

0.75

0.70

0.65

0.60

1 10 20 30 40 50
epoch

By combining weight sharing and the local connectivity property, a CNN is able
to handle data with high dimensions. See Fig. 14.6a for a visualization of a CNN
with three hidden layers. In Fig. 14.6a, the red edges highlight the locality property
of hidden neurons; that is, only very few neurons connect to the succeeding layers.
This locality property of CNNs makes the network sparse compared to a FFNN,
which is fully connected.
376 14 Deep Learning

14.4.1 Basic Components of a CNN


14.4.1.1 Convolutional Layer

A convolutional layer is an essential part of building a convolutional neural network.


Similar to a hidden layer of an ordinary neural network, a convolutional layer has
the same goal, which is to convert the input into a representation of a more abstract
level. However, instead of using full connectivity, the convolutional layer uses local
connectivity to perform the calculations between inputs and the hidden neurons. A
convolutional layer uses at least one kernel to slide across the input, performing
14.4 Convolutional Neural Networks 377

Hidden layers: h(i)

Input
A. B.
(1)
Hidden
h1
x1 w1
Input (1)
(1) (2) w2 h1
h2 h1 Output
x1 (3) x2
h1 w3
(1)
h3
(2)
h2 y1
(3)
w1
x2 h2
x3 w2
(1)
h4
(2)
h3 y2 (1)
h2
x3 (3) w3
h3
(1) (2) x4
h5 h4
w1
(1) w2
h6 (1)
x5 h3
w3

Fig. 14.6 (a) An example of a convolutional neural network. The red edges highlight the fact that
hidden layers are connected in a "local" way; that is, only very few neurons are connected with the
succeeding layers. (b) An example of shared weights and local connectivity in CNN. The red edges
highlight the fact that hidden layers are connected in a "local" way; that is, only very few neurons
are connected with the succeeding layers. The labels w1 , w2 , w3 indicate the assigned weight for
each connection; three hidden nodes share the same set of weights w1 , w2 , w3 when connecting to
three local patches.

a convolution operation between each input region and the kernel. The results are
stored in activation maps, which can be seen as the output of the convolutional layer.
Importantly, the activation maps can contain features extracted by different kernels.
Each kernel can act as a feature extractor and will share its weights with all neurons.
For the convolution process, some spatial arguments need to be defined in
order to produce activation maps of a certain size. Essential attributes include the
following:
1. Size of kernels (N). Each kernel has a window size, which is also referred to as
the receptive field. The kernel will perform a convolution operation with a region
matching its window size from the input and produces results in its activation
map.
2. Stride (S). This parameter defines the number of pixels the kernel will move for
the next position. If it is set to 1, each kernel will make convolution operations
around the input volume and then shift 1 pixel at a time until it reaches the
specified border of the input. Hence, the stride can be used to downsize the
dimension of the activation maps since the larger the stride the smaller the
activation maps.
3. Zero-padding (P). This parameter is used to specify how many zeros one wants
to pad around the border of the input. This is very useful for preserving the
dimensions of the input.
378 14 Deep Learning

⎛ ⎞
1 0 0 0×1 0×0 0×1 ⎛ ⎞
⎜ ⎟ ⎛ ⎞
⎜ 0 1 0 0×0 1×1 1×0 ⎟
⎜ ⎟ 1 0 1 ⎜ 3 0 2 2 ⎟
⎜ 0 0 1 0×1 1×0 1×1 ⎟ ⎜ ⎟ ⎜ ⎟
⎜ ⎟ ∗ ⎝ 0 1 0 ⎠ = ⎜ 0 3 2 3 ⎟
⎜ ⎟ ⎜ 3 1 ⎟
⎜ 0 0 0 1 1 0 ⎟ ⎝ 5 4 ⎠
⎜ ⎟ 1 0 1
⎝ 1 0 1 1 1 0 ⎠ 0 3 2 3
0 1 0 0 0 1
Input matrix: 6 × 6 Kernel: 3 × 3 Activation map

Fig. 14.7 Example of calculation of the values in the activation map. Here, the stride is 1 and the
zero-padding is 0. The kernel slides by 1 pixel at a time from left to right, starting from the top
left position. After reaching the border, the kernel will move to the next row and repeat the process
until the entire input matrix is covered. The red area indicates the local patch (receptive field) to be
convoluted with the kernel, and the result is stored in the green field in the activation map.

These three parameters are the most common hyperparameters used to control the
output volume of a convolutional layer. Specifically, for an input of dimension
Winput × Hinput × Z, for the hyperparameters size of the kernel (N), stride (S), and
zero-padding (P), the dimension of the activation map, that is, Wout × Hout × D,
can be calculated as follows:
(Winput − N + 2P )
Wout = ,
S+1
(Hinput − N + 2P ) (14.10)
Hout = .
S+1
D=Z

An example of how to calculate the result between an input matrix and a kernel
is depicted in Fig. 14.7.
The shared weights and the local connectivity help to reduce significantly the
total number of parameters of the network. For example, assuming that an input
has dimension 100 × 100 × 3, and that the convolutional layers and the number of
kernels is 2, and each kernel has a local receptive field of size 4, then the dimension
of each kernel is 4 × 4 × 3 (3 is the depth of the kernel, which will be the same as
the depth of the input volume). For 100 neurons in the layer, there will be in total
only 4 × 4 × 3 × 2 = 96 parameters in this layer because all 100 neurons will share
the same weights for each kernel. This considers only the number of kernels and the
size of the local connectivity, and does not depend on the number of neurons in the
layer.
In addition to the reduction of the number of parameters, shared weights and
local connectivity are important for processing images efficiently. The reason behind
this is that local convolutional operations on an image result in values that contain
certain characteristics of the image, because in images local values are generally
14.4 Convolutional Neural Networks 379

highly correlated and the statistics formed by the local values are often invariant
in the location [303]. Hence, using a kernel that shares the same weights can
detect patterns from all local regions in the image, and different kernels can extract
different types of patterns from the image.
A nonlinear activation function (for instance, ReLu, tanh, sigmoid, and so on) is
often applied to the values resulting from the convolutional operations between the
kernel and the input. These values are stored in the activation maps, which will later
be passed to the next layer of the network.

14.4.1.2 Pooling Layer

A pooling layer is usually inserted between a convolutional layer and the following
layer. Pooling layers aim to reduce the dimension of the input by using some
pre-specified pooling method, resulting in a smaller input by conserving as much
information as possible. A pooling layer is also able to introduce spatial invariance
into the network [424], which can help to improve the generalization of the model.
To perform pooling, a pooling layer uses a stride, a zero-padding, and a pooling
window size as hyperparameters. The pooling layer will scan the entire input using
the specified pooling window size in the same manner as the kernel does in a
convolutional layer. For instance, using a stride of 2, a window size of 2, and a
0 zero-padding for pooling will halve the size of the input dimension.
There are many types of pooling methods, such as averaging-pooling, min-
pooling, and some advanced pooling methods, such as fractional max pooling and
stochastic pooling. The most commonly used pooling method is max pooling, as
it has been shown to be superior in dealing with images by capturing invariances
efficiently [424]. Max pooling extracts the maximum value within each specified
sub-window across the activation map. The max pooling can be formulated as
Ai,j,k = max(Ri−n:i+n,j −n:j +n,k ), where Ai,j,k is the maximum activation value
from the matrix R of size n × n centered at index i, j in the kth activation map with
n as the window size.

14.4.1.3 Fully Connected Layer

A fully connected layer is the basic hidden layer unit in FFNN (see Sec. 14.2.2).
Interestingly, for traditional CNN architectures also, a fully connected layer is often
added between the penultimate layer and the output layer to further model nonlinear
relationships of the input features [294, 442, 459]. However, recently the benefit of
this has been questioned because it introduces many parameters, potentially leading
to overfitting [442]. As a result, more and more researchers started to construct CNN
architecture without such a fully connected layer, using other techniques like max-
over-time pooling [284, 316] to replace the role of linear layers.
380 14 Deep Learning

14.4.2 Important Variants of CNN

VGGNet VGGNet [442] was a pioneer in exploring how the depth of the network
influences the performance of a CNN. VGGNet was proposed by the Visual
Geometry Group and Google DeepMind, and they studied architectures with a depth
of 19 (compared to 11 for AlexNet [294]).
VGG19 extended the network from 8 weight layers (a structure proposed by
AlexNet) to 19 weight layers by adding 11 more convolutional layers. In total, the
parameters increased from 61 million to 144 million; however, the fully connected
layer takes up most of the parameters. According to their reported results, the error
rate dropped from 29.6 to 25.5 for top-1 val.error (percentage of times the classifier
did not give the correct class with the highest score) on the ILSVRC data set, and
from 10.4 to 8.0 for top-5 val.error (percentage of times the classifier did not include
the correct class among its top 5) using the ILSVRC data set from ILSVRC2014.
This indicates that a deeper CNN structure is able to achieve better results than
shallower networks. In addition, they stacked multiple 3x3 convolutional layers
without incorporating a pooling layer in between to replace a convolutional layer
with a larger filter size; for example, 7x7 or 11x11. They suggested such an
architecture is capable of receiving the same receptive fields as those composed
of larger filter sizes. Consequently, two stacked 3x3 layers can learn features from a
5x5 receptive field, but with fewer parameters and more nonlinearity.

GoogLeNet with Inception The most intuitive way to improve the performance
of a convolutional neural network is to stack more layers and add more parameters
to the layers [442]. However, this will impose two major problems. One is that too
many parameters will lead to overfitting, and the other is that the model becomes
hard to train.
GoogLeNet [459] was introduced by Google. Until the introduction of the incep-
tion network architecture, traditional state-of-the-art CNN architectures mainly
focused on increasing the size and depth of the neural network, which also
increased the computation cost of the network. In contrast, GoogLeNet introduced
an architecture to achieve state-of-the-art performance with a lightweight network
structure.
The idea underlying an inception network architecture is to keep the network as
sparse as possible while utilizing the fast matrix computation feature provided by a
computer. This idea facilitates the first inception structure; see Fig. 14.8.
As one can see in Fig. 14.8, several parallel layers, including 1x1 convolution
and 3x3 max pooling, operate at the same level on the input. Each tunnel (namely,
one separated sequential operation) has a different child layer that includes 3x3
convolutions, 5x5 convolutions, and a 1x1 convolution layer. All the results from
each tunnel are concatenated together at the output layer. In this architecture, a 1x1
convolution is used to downscale the input image while preserving input information
[316]. The authors argued that concatenating all the features extracted by different
filters corresponds to the idea that image information should be processed at
different scales and only the aggregated features should be sent to the next level.
14.4 Convolutional Neural Networks 381

Input layer

1x1 convolutions 1x1 convolutions 3x3 max pooling 1x1 convolutions

3x3 convolutions 5x5 convolutions 1x1 convolutions

Filter con-
catenation

Fig. 14.8 Inception block structure. Here, multiple blocks are stacked on top of each other,
forming the input layer for the next block.

Hence, the next level can extract features from different scales. Moreover, this sparse
structure, introduced by an inception block, requires fewer parameters, and hence it
is much more efficient.
By stacking the inception structure throughout the network, GoogLeNet won
first place in the classification task on ILSVRC2014, demonstrating the quality of
the inception structure. Subsequently, Inception v1, v2, v3, and the latest version
v4 were introduced. Each generation introduced some new features, making the
network faster, more lightweight, and more powerful.

ResNet In principle, CNNs with a deeper structure perform better than shallow
ones [442]. In theory, deeper networks have a better ability to represent high-
level features from the input and therefore improve the accuracy of predictions
[119]. However, one cannot simply stack more and more layers. In [230], the
authors observed that more layers can actually hinder the performance of the model.
Specifically, in their experiment, they consider a network A with N layers and a
network B with N + M layers, where the initial N layers had the same structure.
Interestingly, when training on the CIFAR-10 and ImageNet data sets, network B
showed a higher training error compared to network A. In theory, the extra M layers
should result in better performance, but instead they obtained higher errors, which
cannot be explained by overfitting. The reason for this is that the loss is getting
optimized to a local minima, which differs from the vanishing gradient phenomena.
This is referred to as the degradation problem [230].
ResNet [230] was introduced to overcome the degradation problem of CNNs,
pushing the depth of a CNN to its limit. In [230], the authors proposed a novel
structure of a CNN that, in theory, can be extended to an infinite depth without
losing accuracy. In their paper, they proposed a deep residual learning framework
that consists of multiple residual blocks to address the degradation problem. The
structure of a residual block is shown in Fig. 14.9.
Instead of trying to learn the desired underlying mapping, H (x), from each
few stacked layers, the authors used an identity mapping for input x from input
to the output of the layer, and then let the network learn the residual mapping
F (x) = H (x) − x. After adding the identity mapping, the original mapping can
be reformulated as H (x) = F (x) + x. The identity mapping is realized by making
382 14 Deep Learning

Fig. 14.9 The structure of a


residual block. Inside a block, X
there can be many weight
layers. weight
layer

F (X) ReLu Identity


mapping
weight of X
layer

F (X) + X +
ReLu

shortcut connections from the input node directly to the output node. This can help
address the degradation problem as well as the vanishing (exploding) gradient issue
of deep networks. In extreme cases, deeper layers can learn just the identity map of
the input to the output layer by simply calculating the residuals as 0. This enables a
deep network to perform at least not worse than shallow ones. Also, in practice, the
residuals are never 0, which makes it possible for much deeper layers to always
learn something new from the residuals, therefore producing better results. The
implementation of ResNet helped to push the layers of CNNs to 152 by stacking
so-called residual blocks throughout the network. ResNet achieved the best result in
the ILSVRC2016 competition with an error rate of 3.57.

14.4.3 Example: CNN

In Listing 14.6, we present an example using a convolutional neural network in R.


For this example, we use again the Keras package in combination with the MNIST
(Modified National Institute of Standards and Technology database) data. MNIST is
a large data set providing thousands of images of handwritten digits (0–9) frequently
used as benchmark data.
For the MNIST data classification with a CNN, we use three convolutional
layers, each with a max pooling kernel of size 3 × 3 and three fully connected
layers. Specifically, it starts with a convolutional layer followed by a max pooling
layer, repeated twice, and then a convolutional layer is connected to three stacked
dense layers. Whereas each dense layer is a fully connected layer and the last of
these layers is the output layer. The size of these three dense layers is 128, 64 and
10 neurons, respectively. We would like to note that for the training, we added a
dropout layer that randomly deletes weights between two consecutive layers to avoid
overfitting. These dropout layers are included between the convolutional layers and
the dense layers. For the optimization of our loss function, which is the categorical
cross entropy, we use the Adam optimization algorithm. The implementation of this
CNN in R is shown in Listing 14.6.
14.4 Convolutional Neural Networks 383
384 14 Deep Learning

Table 14.3 Sensitivity, specificity, precision, recall, and other scores of test data for a CNN
classifying the MNIST data of Listing 14.6.
Detection Balanced
Sensitivity Specificity Precision Recall F1 Prevalence Rate accuracy
Class: 0 0.99 1.00 1.00 0.99 0.99 0.10 0.10 1.00
Class: 1 0.99 1.00 0.99 0.99 0.99 0.11 0.11 1.00
Class: 2 0.99 1.00 0.99 0.99 0.99 0.10 0.10 0.99
Class: 3 0.99 1.00 1.00 0.99 0.99 0.11 0.10 0.99
Class: 4 0.99 1.00 0.99 0.99 0.99 0.10 0.10 0.99
Class: 5 1.00 1.00 0.98 1.00 0.99 0.09 0.09 1.00
Class: 6 0.99 1.00 0.99 0.99 0.99 0.10 0.10 0.99
Class: 7 0.99 1.00 0.99 0.99 0.99 0.10 0.10 1.00
Class: 8 0.99 1.00 0.99 0.99 0.99 0.10 0.10 0.99
Class: 9 0.99 1.00 0.99 0.99 0.99 0.10 0.10 0.99

The results of the analysis are shown in Table 14.3. The table shows various
error measures, including sensitivity, specificity, and F1-score, evaluating the perfor-
mance of the CNN model. Due to the fact that this problem is a 10 class classification
problem the corresponding error measures for multi-class classification have to be
used; see the discussion in Sect. 9.3.3.1. For reasons of completeness, we would like
to add that the balanced accuracy is

1
Balanced Accuracy = (Sensitivity + Specificity) (14.11)
2
and the detection rate is the same as the true positive rate (TPR).
One can see from the table that the obtained results are very good indicating
that learning the CNN based on the used training data is without problems. This is
not surprising because about 60000 data samples have been used for the training
allowing a nearly flawless classification.

14.5 Deep Belief Networks

A deep belief network (DBN) is a model that combines different types of neural
networks with each other to form a new neural network model. Specifically, DBNs
integrate restricted Boltzmann machines (RBMs) with deep feedforward neural
networks (D-FFNN). The RBMs form the input unit, whereas the D-FFNNs form
14.5 Deep Belief Networks 385

the output unit. Frequently, RBMs are stacked on top of each other, meaning that
more than one RBM is used sequentially. This adds to the depth of the DBN.
Due to the different natures of the networks RBM and D-FFNN, two different
types of learning algorithms are used. Practically, the RBMs are used to initialize
the model in an unsupervised way. Thereafter, a supervised method is applied for the
fine-tuning of the parameters [33]. In the following, we describe these two phases
of the training of a DBN in more detail.

14.5.1 Pre-training Phase: Unsupervised

Theoretically, neural networks can learn using only supervised methods. However,
in practice it was found that such a learning process can be very slow. For this
reason, unsupervised learning is used to initialize the model parameters. The
standard neural network learning algorithm (backpropagation) was initially able
to learn only shallow architectures. However, using an RBM for the unsupervised
initialization of the parameters, one obtains a more efficient training of the neural
network [241].
An RBM is a special type of Boltzmann machine (BM); see Sect. 14.2.3.2.
The difference between an RBM and a Boltzmann machine is that RBMs have
constraints in the connectivity of their structure [166]. Specifically, there can be
no connections between nodes in the same layer. For an example, see Fig. 14.10c.
The values of the neurons, v, in the visible layer are known, but the neuron values,
h, in the hidden layer are unknown. The parameters of the network are learned by
defining an energy function, E, of the model, which is then minimized.
Frequently, an RBM is used with binary values; that is, vi ∈ {0, 1} and hi ∈
{0, 1}. The energy function for such a network is given by [237]


m 
n 
m 
n
E(v, h) = − ai vi − bj hj − vi hj wi,j , (14.12)
i j i j

where Θ = {a, b, W } is the set of model parameters.


Each configuration of the system corresponds to a probability defined via the
Boltzmann distribution, as in Eq. 14.12:

1 −E(v,h)
p(v, h) = e . (14.13)
Z
In Eq. 14.13, Z is the partition function given by

Z= e−E(v,h) . (14.14)
v,h
386 14 Deep Learning

A. Visible B. Visible Hidden

v3 v1
v2
h1
v4 v2
v1 h2
v3
Hidden h1
h3
h3
h2 v4

C. Visible Hidden Visible Hidden

v1 v1
h1 h1
v2 v2
h2 =⇒ h2
v3 v3
h3 h3
v4 v4

Fig. 14.10 Examples of Boltzmann machines. (a) The neurons are arranged on a circle. (b) The
neurons are separated according to their type. Both Boltzmann machines are identical and differ
only in their visualization. (c) Transition from a Boltzmann machine (left) to a restricted Boltzmann
machine (right).

The probability for the network assigned to a visible vector v is obtained by


summing over all possible hidden vectors:

1  −E(v,h)
p(v) = e . (14.15)
Z
h

Maximum likelihood estimation (MLE) is used to estimate the optimal param-


eters of this probabilistic model [229]. For a training data set D = Dtrain =
{v1 , . . . , vl }, consisting of l patterns, and assuming that the patterns are i.i.d.
(independent and identically distributed), the log-likelihood function is given by

)
l 
l
L(θ ) = ln L(θ |D) = ln p(vi |θ ) = ln p(vi |θ ). (14.16)
i=1 i=1

For simple cases, one may be able to find an analytical solution for Eq. 14.16

by solving ∂θ ln L(θ |D) = 0. However, usually the parameters need to be found
numerically. For this, the gradient of the log-likelihood is a typical approach by
14.5 Deep Belief Networks 387

which to estimate the optimal parameters:

∂L(θ t )
θ (t+1) = θ (t) + Δθ (t) = θ (t) + η − λθ (t) + νΔθ (t−1) . (14.17)
∂θ (t)
In Eq. 14.17, the constant, η, in front of the gradient is the learning rate, and
the first regularization term, −λθ (t) , is the weight decay. The weight decay is
used to constrain the optimization problem by penalizing large values of θ [237].
The parameter λ is also called the weight-cost. The second regularization term in
Eq. 14.17 is called the momentum. The purpose of the momentum is to speed up the
learning and reduce possible oscillations. Overall, this should stabilize the learning
process.
For the optimization, the Stochastic Gradient Ascent (SGA) is utilized, using
mini-batches. That means one selects randomly a number of samples from the
training set, k, which are much smaller than the total sample size, and then estimates
the gradient. The parameters, θ , are then updated for the mini-batch. This process is
repeated iteratively until an epoch is completed. An epoch is characterized by using
the whole training set once. A common problem can arise when using mini-batches
that are too large, because this can slow down the learning process considerably.
Frequently, k is chosen between 10 and 100 [237].
Before the gradient can be used, one needs to approximate the gradient in
Eq. 14.17. Specifically, the derivatives with respect to the parameters can be written
in the following form:



∂L(θ|v)
= p(Hj = 1|v)vi − v p(v)p(Hj = 1|v)vi ,
⎨ ∂wij
∂L(θ|v)
= vi − v p(v)vi , (14.18)


∂ai
⎩ ∂L(θ|v) = p(Hj = 1|v) − = 1|v).
∂bj v p(v)p(Hj

In Eq. 14.18, Hi denotes the value of the hidden unit i, and p(v) is the probability
defined in Eq. 14.15. For the conditional probability, one finds

n
p(Hj = 1|v) = σ ( wij vi + bj ), (14.19)
j =1

and correspondingly

m
p(Vi = 1|h) = σ ( wij hj + ai ). (14.20)
i=1

Using the preceding equations in the presented form would be inefficient because
these equations require a summation over all visible vectors. For this reason, the
contrastive divergence (CD) method is used to increase the speed for the estimation
of the gradient. In Fig. 14.11 A, we present the pseudocode of the CD algorithm.
388 14 Deep Learning

A.
Input: RBM (with m visible and n hidden layers) and mini-batch D (sample size k)
Output: Update Δwij , Δai , Δbj
for v ∈ D do
v(0) ← v
for t = 0, ..., k − 1 do
for j = 1, ..., n do sample hj (t) ∼ p(hj |v(t) )
for i = 1, ..., m do sample vi (t+1) ∼ p(vi |h(t) )
for i = 1, . . . m, j = 1, . . . n do
(0) (k)
Δwij ← Δwij + p(Hj = 1|v(0) )vi − p(Hj = 0|v(k) )vi
(0) (k)
Δai ← Δai + vi − vi
Δbj ← Δbj + p(Hj = 1|v(0) ) − p(Hj = 1|v(k) )

B.
Input: Mini-batch D (sample size k)
Output: Update Δb, Δw
for x ∈ D do
a(a,1) ← x
for l ∈ {2, 3, . . . , L} do
z(x,l) ← w(l) a(x,l−1) + b(l)
a(x,l) ← ϕ(z(x,l) )
δ (x,l) ← ((w(l+1) )T δ (x,l+1) ) ∗ ϕ (z(x,l) )
for l ∈ {L, L − 1, . . . , 2} do
δ (x,l) ← ((w(l+1) )T δ (x,l+1) ) ∗ ϕ (z(x,l) )
for l ∈ {L, L − 1, . . . , 2} do
 (x,l)
Δb(l) ← Δb(l) + k1 x
δ
1
 T
Δw(l) ← Δw(l) + k x
δ (x,l) (a(x,l−1) )

C.
Input: Parameters θ, η + , η − , Δmax , Δmin , Δ(0) and epoch t
Output: Update Δθ
for θ do
∂E (t−1) ∂E (t)
if ∂θ
· ∂θ > 0 then
Δ(t) ← min(Δ(t−1) · η + , Δ max )
(t)
Δθ(t) ← −sgn( ∂E
∂θ
) · Δ(t)
∂E (t−1) ∂E (t)
elseif ∂θ
· ∂θ < 0 then
Δ ← max(Δ(t−1) · η − , Δmin )
(t)

if E (t) > E (t−1) then


θ(t+1) ← θ(t) − Δθ(t−1)
∂E (t)
∂θ
←0
∂E (t−1) ∂E (t)
elseif ∂θ · ∂θ = 0 then
(t)
Δθ(t) ← −sgn( ∂E
∂θ
) · Δ(t−1)

Fig. 14.11 (a) Contrastive divergence k-step algorithm using Gibbs sampling. (b) Backpropaga-
tion algorithm. (c) iRprop+ algorithm.
14.5 Deep Belief Networks 389

1st RBM
2nd RBM
v1
h1 v1 kth RBM
v2
h1 v1 h1
h2 v2
.. ..
v3 ⇔ ⇔ .. ··· . ⇔ .
.
.. ..
. . hq vp hs
..
.
hn vn
vm

Fig. 14.12 Visualizing the stacking of RBMs in order to learn the parameters Θ of a model in an
unsupervised way.

The CD uses Gibbs sampling to draw samples from conditional distributions


so that the next value depends only on the previous one. This generates a Markov
chain [225]. Asymptotically, for k → ∞ the distribution becomes the true stationary
distribution. Interestingly, k = 1 can already lead to satisfactory approximations for
the pre-training [71].
In general, pre-training of DBNs consists of stacking RBMs. That means the next
RBM is trained using the hidden layer of the previous RBM as a visible layer. This
initializes the parameters for each layer [238]. Interestingly, the order of this training
is not fixed. For instance, the last layer can be trained first, and then the remaining
ones [241]. In Fig. 14.12, we show an example of the stacking of RBMs.

14.5.2 Fine-Tuning Phase: Supervised

After the initialization of the parameters of the neural network, as described in the
previous step, they can be fine-tuned. For this step, a supervised learning approach
is used; that is, the labels of the samples, omitted in the pre-training phase, are now
utilized.
To learn the model, one minimizes an error function (also called a loss function
or sometimes an objective function). An example of such an error function is the
mean squared error (MSE).

1 
n
E= oi − ti 2 (14.21)
2n
i=1
390 14 Deep Learning

In Eq. 14.21, oi = φ(xi ) is the ith output from the network function φ : Rm →
Rn , given the ith input xi from the training set D = Dtrain = {(x1 , t1 ), . . . (xl , tl )},
where ti is the target output.
Similarly, to maximize the log-likelihood function of an RBM (see Eq. 14.17),
one uses gradient descent to find the parameters that minimize the error function as
follows:
∂E
θ (t+1) = θ (t) − Δθ (t) = θ (t) − η − λθ (t) + νΔθ (t−1) (14.22)
∂θ (t)
Here, the parameters (η, λ, and ν) have the same meaning as explained earlier.
Again, the gradient is typically not used for the entire training data D, but instead
smaller batches are used via the stochastic gradient descent (SGD).
The gradient of the RBM log-likelihood can be approximated using the CD
algorithm (see Fig. 14.11a). For this, the backpropagation algorithm is used [303].
Let us denote by ai l the activation of the ith unit in the lth layer (l ∈ {2, . . . , L}),
t
bi the corresponding bias, and wij l the weight for the edge between the j th unit of

the (l − 1)th layer and the ith unit of the lth layer. For the activation function, ϕ,
the activation of the lth layer with the (l − 1)th layer as input is al = ϕ(z(l) ) =
ϕ(w(l) a(l−1) + b(l) ).
Application of the chain rule leads to [370]:


⎪δ (L) = ∇a E · ϕ (z(L) ),




⎨δ (l) = ((w(l+1) )T δ (l+1) ) · ϕ (z(l) ),
(l) (14.23)
(l) = δi ,
∂E




∂bi

⎪ (l−1) (l)
⎩ ∂E(l) = x j δ .i
∂wij

In Eq. 14.23, the vector δ L contains the errors of the output layer (L), whereas
the vector δ l contains the errors of the lth layer. Here, · indicates the element-wise
product of vectors. From this, the gradient of the error of the output layer is given
by
5 ∂E ∂E 6
∇a E = (L)
,..., (L)
. (14.24)
∂a1 ∂ak

In general, the result depends on E. For instance, for the MSE we obtain
∂E
(L) = (aj − tj ). As a result, the pseudocode for the backpropagation algorithm
∂aj
can be formulated as shown in Algorithm 14.11b [370]. The estimated gradients
from Algorithm 14.11b are then used to update the parameters (weights and biases)
via SGD (see Eq. 14.22). More updates are performed using mini-batches until all
training data have been used [444].
14.6 Autoencoder 391

1
5th RBM
50

50
4th RBM
125

125 1
3rd RBM
250 50

250 125
2nd RBM
500 250

500 500
1st RBM
#Features #Features

Pre-
training Fine-
tuning

Fig. 14.13 The two stages of DBN learning. Left: The hidden layer (purple) of one RBM is the
input of the next RBM. For this reason, their dimensions are equal. Right: The two edges in fine-
tuning denote the two stages of the backpropagation algorithm — the input feedforwarding and the
error backpropagation. The orange layer indicates the output.

The resilient backpropagation algorithm (Rprop) is a modification of the


backpropagation algorithm and was originally introduced to speed up the basic
backpropagation (Bprop) algorithm [405]. There exist at least four different versions
of Rprop [263], and in Algorithm 14.11d the pseudocode for the iRprop+ algorithm
(which improves Rprop with weight-backtracking) is shown [444].
As one can see in Algorithm 14.11c, iRprop+ uses information about the sign
of the partial derivative from the time step (t − 1) to make a decision regarding the
update of the parameter. Importantly, the results of comparisons have shown that the
iRprop+ algorithm is faster than Bprop [263].
It has been shown that the backpropagation algorithm with SGD can learn good
neural network models even without a pre-training stage, when the training data are
sufficiently large [303].
In Fig. 14.13, we show an example of the overall DBN learning procedure. The
left-hand side shows the pre-training phase and the right-hand side the fine-tuning.

14.6 Autoencoder

The next model we discuss is an autoencoder. An autoencoder is an unsupervised


neural network model used for representation learning; for example, feature selec-
tion or dimension reduction. A common property of autoencoders is that the size of
the input and output layers is the same, with a symmetric architecture [238]. The
392 14 Deep Learning

#Features #Features
W1 T Decoder W1 T
500 500
T
W2 W2 T
50 250 250
Top T
W4 W3 W3 T
RBM
125 125 125

W4 T W4 T
125
Code
3th RBM W3 50 50
layer
250
W4 W4

250 125 125


2nd RBM W2 W3 W3
500 250 250
W2 W2
500 500 500
1st RBM W1 W1 Encoder W1
#Features #Features #Features

Fine-
Pretraining Unrolling
tuning

Fig. 14.14 Visualizing the concept of autoencoder learning. The new learned encoding of the
input is represented in the code layer (shown in blue).

underlying idea is to learn a mapping from an input pattern x to a new encoding


c = h(x), which ideally gives an output pattern identical to the input pattern; that
is, x ≈ y = g(c). Hence, the encoding c, which usually has a lower dimension than
x, allows one to reproduce (or code for) x.
The construction of autoencoders is similar to that of DBNs. Interestingly, the
original implementation of an autoencoder [238] pre-trained only the first half of
the network with RBMs and then unrolled the network, creating, in this way, the
second part of the network. Similar to DBNs, a pre-training phase is followed by
a fine-tuning phase. In Fig. 14.14, an illustration of the learning process is shown.
Here, the coding layer corresponds to the new encoding, c, providing, for example,
a reduced dimension of x.
An autoencoder does not utilize labels; it is an unsupervised learning model. In
applications, the model has been successfully used for dimensionality reduction.
Autoencoders can achieve a much better two-dimensional representation of array
data when an adequate amount of data is available [238]. Importantly, Principal
Component Analysis implement a linear transformation, whereas autoencoders are
nonlinear. Usually, this results in a better performance. We would like to highlight
that there are many extensions of these models; for example, sparse autoencoder,
denoising autoencoder, or variational autoencoder [104, 395, 485].

14.6.1 Example: Denoising and Variational Autoencoder

In the following, we present two examples of autoencoders for the MNIST data.
The first example is for a denoising autoencoder, and the second is for a variational
autoencoder.
The implementation of the denoising autoencoder is shown in Listing 14.7.
In general, a denoising autoencoder can be based on a FFNN, CNN or LSTM.
14.6 Autoencoder 393
394 14 Deep Learning

Fig. 14.15 Visualization of 10 randomly selected digits (test data) used for predicting the output
of an autoencoder. The original digits are shown in the top row and the noisy digits used for training
in the middle row, and the output of the denoising autoencoder (as produced by Listing 14.7) is
shown in the bottom row.

However, for our example we use convolutional layers for an encoder block
and de-convolutional layers for a decoder block. The encoder block has two
convolutional layers and one fully connected layer with four output nodes. The
decoder block is connected with the output of the encoder. The first three layers
are the transposed convolutional layers for transforming the encoded information in
the reverse direction to regenerate the input data. In Fig. 14.15, we show in the first
row 10 randomly selected digits/samples from MNIST. In the second row, we show
the same samples, but we added normal distributed noise with a mean of zero and
a variance of 0.1. These data are used as training data for the dA. Finally, the third
row shows the output of the denoising autoencoder (dA) itself. From this figure, we
see that the dA is capable of removing the noise from the samples to reconstruct the
true input images, although they have not been used for the training. This example
motivates the name of the model.
The implementation of the variational autoencoder (VAE) is shown in List-
ing 14.8. The VAE implements an encoder block, a sampling layer, and decoding
block. The encoder block has four output nodes: two nodes for latent space
parameters denoted z_mean → R 2 and two nodes for z_var → R 2 . The sampling
block receives the output of the encoder block (which are four latent variables)
and generates samples from the latent variables as follows: z = z_mean +
14.6 Autoencoder 395
396 14 Deep Learning
14.6 Autoencoder 397

exp(z_var/2) ∗ epsilon. Here, epsilon is a random variable drawn from a standard


normal distribution. This means that the output of the sampling layer is a random
variable. This is different than an AE in the sense that the same input results always
in the same output in a deterministic way. In contrast, a VAE gives for the same
input an output that depends on z generated by the sampling layer.
The sampling layer is created by combining the latent space layers. The decoder
block is merged with the sampling layer, one dense layer, and three transposed
convolutional layers. The implemented loss function considers two losses: the
reconstruction loss and the Kullback-Leibler (KL) divergence between the latent
space distribution and the prior distribution. The two losses are the gradient loss
and the latent loss; the former is used to compare input and output data of the
variational autoencoder model. For the latent loss the KL divergence is used, which
compares the latent vector output distribution with the standard normal distribution
of zero mean and unit variance (N(0, Il/2 ), where l is the latent space dimension).
Importantly, a latent output distribution other than a normal distribution leads in
general to a high loss. Furthermore, it has been found that a KL divergence aims to
avoid overfitting and to obtain a sufficient variation for generative properties of the
model.
The build-in functions sampling() and vae_loss(), used in Listing 14.8, are avail-
able in the Keras library. In Listing 14.8, we provide three objects corresponding to
an encoder (see “encoder” in the Listing), decoder (see “decoder” in the Listing),
and variational autoencoder (see “vae” in the Listing). The training process of data
samples of a VAE automatically updates the weights in these three blocks (encoder,
decoder, and variational autoencoder).
To understand the results from the VAE, we provide three visualizations. The
first visualization is for the latent space of the test data, which is the output from the
encoder. Here, we consider the two means from the output of the encoder model,
z_mean = {z1 , z2 }, which are z1 and z2 , and show them in the scatter plot in Fig.
14.16. The results in this figure have been generated by using Listing 14.10. In this
figure, the digits from 0 to 9 are highlighted by color (see the legend in Fig. 14.16
for the color-codes). This shows the projection of the distribution of the samples
into the latent space obtained from the output of the encoder for the test data.
From the colors of different distributions of digits one can see that the digits are
grouped together. Specifically, the digits 0, 1 and 7 separate nicely while all other
digits show some mixing. Overall, the digits are reasonably separated considering
the fact that we did not fine-tune all hyperparameters of the model. That means
to improve the performance of the model fine-tuning is required, for example, by
optimizing the dropout rate, the size of the latent space, or the number of layers in
the encoder and decoder.
The second visualization we provide demonstrates that one can use the latent
space for understanding a VAE. In order to show this, we generate two-dimensional
uniform numbers equally spaced between −6 and 6 (see Fig. 14.17 (top)) and use
these as surrogates for z1 and z2 within the boundaries of the latent space. These
results can be obtained using Listing 14.10. As one can see in Fig. 14.17 (top), there
is a transition from 0s (on the right-hand side of the figure) to 7s (on the bottom left)
398 14 Deep Learning

5.0
digit labels
0 5

1 6

2.5 2 7

3 8

4 9
z2

0.0

−2.5

−5.0 −2.5 0.0 2.5 5.0


z1

Fig. 14.16 Visualization of the latent space of the test data encoded by the constructed encoder of
VAE shown in Example 14.10.

Listing 14.9: Prediction results by sampling from the latent space for the variational
autoencoder in Listing 14.8

and 1s (on the top left). For a comparison it is useful to look at Fig. 14.16 to see
that these digits are at similar locations. Hence, this shows how the values of z1 and
z2 can be used as samples from the latent space to generate images. This provides
just a different view on the results in Fig. 14.16, where one can see that certain parts
of the latent space are well organized with respect to the separation of the digits
whereas others are mixed-up. Overall, this also shows that the trained VAE can be
used as a generative model when using samples from the latent space as input.
14.6 Autoencoder 399

Listing 14.10: Latent space of test data for the variational autoencoder in Listing 14.8

Listing 14.11: Prediction results by sampling from test data for the variational
autoencoder in Listing 14.8

The third visualization uses a trained VAE together with test data to predict novel
outputs/digits. This is generated using Listing 14.11. Here we randomly select 20
digits (from 10000 total test samples) for each digit and use a trained VAE to make a
prediction. The results of these predictions are shown in Fig. 14.17 (bottom). From
this figure, one can see that the reconstruction of the input test image is for most
instances correct; however, there are a few cases that lead to wrong predictions (for
instance, see first row sixth column and rows two and three).
To gain a deeper understanding of a VAE, we suggest the following analysis
for self-study. Use the same input digit multiple times for a trained VAE to
predict its output. What result would you expect? This analysis can be performed
with Listing 14.11 by selecting one input image. Hint: See Listing 14.8 and
study the working mechanism of the function “sampling,” which utilizes epsilon.
Furthermore, compare Listing 14.8 with Listing 14.7 for a denoising autoencoder to
see the difference between these two models.
400 14 Deep Learning

6.00
5.37
4.74
4.11
3.47
2.84
2.21
1.58
0.95
z2

0.32
-0.32
-0.95
-1.58
-2.21
-2.84
-3.47
-4.11
-4.74
-5.37
-6.00

-0.210
0.21
-6.00
-5.59
-5.17
-4.76
-4.34
-3.93
-3.52
-3.10
-2.69
-2.28
-1.86
-1.45
-1.03
-0.62

0.62
1.03
1.45

3.93
2.69
3.10
3.52

6.00
5.17
5.37
1.86
2.28

4.76
4.34
z1

Fig. 14.17 Two visualization of the output of a trained VAE. Top: Shown is the output of a decoder
of a trained VAE by inputting data within the boundaries from the latent space. These results are
generated with Listing 14.9. Bottom: Visualization of predicted results of a trained VAE for 20
randomly selected input digits from 0 to 9.

14.7 Long Short-Term Memory Networks

The last model we discuss is the long short-term memory (LSTM) network. LSTM
networks were introduced by Hochreiter and Schmidhuber in 1997 [246]. LSTM
is a variant of an RNN that has the ability to address the shortcomings of RNNs,
which do not perform well; for example, when handling long-term dependencies
[210]. Furthermore, LSTMs avoid the gradient vanishing or exploding problem
[190, 245]. In 1999, an LSTM with a forget gate that could reset the cell memory was
introduced. This improved the initial LSTM and became the standard structure of
LSTM networks [190]. In contrast with deep feedforward neural networks, LSTMs
contain feedback connections. Furthermore, they can process not only single data
points such as vectors or arrays, but also sequences of data. For this reason, LSTMs
are particularly useful for analyzing speech or video data.
14.7 Long Short-Term Memory Networks 401

Output yt Output yt−1 yt yt+1 yt+2


softmax y
t+2

ht ht−1 ht ht+1 ht+2


Hidden Hidden

xt−1 xt+1 xt+2


Input xt Input xt

Time t Time t−1 t t+1 t+2

Fig. 14.18 Left: A folded structure of an LSTM network model. Right: An unfolded structure of
an LSTM network model. xi is the input data at time i, and yi is the corresponding output (i is the
time step starting from (t − 1)). In this network, only yt+2 , activated by a softmax function, is the
final network output.

14.7.1 LSTM Network Structure with Forget Gate

Figure 14.18 shows an unrolled structure of an LSTM network model [495]. In


this model, the input and output are organized vertically, while the information is
delivered horizontally over time.
In a standard LSTM network, the basic entity is called an LSTM unit or a memory
block [190]. Each unit is composed of a cell, the memory part of the unit, and three
gates: an input gate, an output gate, and a forget gate (also called keep gate) [191].
An LSTM unit can remember values over arbitrary time intervals, and the three
gates control the flow of information through the cell. The central feature of an
LSTM cell is a part called the constant error carousel (CEC) [317]. In general, an
LSTM network is formed exactly like an RNN, except that the neurons in the hidden
layers are replaced by memory blocks.
In Fig. 14.19, we show a schematic description of an LSTM block with one cell.
In the following, we discuss some core concepts and technicalities of LSTMs. Let
W and U denote the weights and b the bias. Then, we have the following definitions:
• Input gate: A unit with sigmoidal function that controls the flow of information
into the cell. It receives its activation from both the output of the previous time
h(t−1) and the current input x (t) . Under the effect of the sigmoid function, an
input gate i t generates values between zero and one. Zero indicates it blocks the
information entirely, whereas values of one allow all the information to pass.

i t = σ (W (ix) x (t) + U (ih) h(t−1) + bi ) (14.25)

• Cell input layer: The cell input has a similar flow as the input gate, receiving
h(t−1) and x (t) as input. However, a tanh activation is used to squish input values
into a range between −1 and 1 (denoted by l t in Eq. 14.26).

l t = tanh(W (lx) x (t) + U (lh) h(t−1) + bl ) (14.26)


402 14 Deep Learning

h(t)

CEC
ct
tanh

ct−1

ft it lt ot
σ σ tanh σ

h(t−1) x(t)

Fig. 14.19 Internal connectivity pattern of a standard LSTM unit (blue rectangle). The output
from the previous time step, h(t−1) and x (t) , are the input to the block at time t, and then the output
h(t) at time t will be an input to the same block in the next time step (t + 1).

• Forget gate: A unit with a sigmoidal function that determines which information
from previous steps of the cell should be memorized or forgotten. The forget
gate f t (see Eq. 14.27) assumes values between zero and one based on the inputs
h(t−1) and x (t) . In the next step, the Hadamard product of f t with old cell state
ct−1 is used to get the updated new cell state ct (see Eq. 14.28). In this case, a
value of zero means the gate is closed, so it will completely forget the information
of the old cell state ct−1 , whereas a value of one will make all information
memorable. Therefore, a forget gate has the right to reset the cell state if the
old information is considered meaningless.

f t = σ (W (f x) x (t) + U (f h) h(t−1) + bf ) (14.27)

• Cell state: A cell state stores the memory of a cell over a longer time period [339].
Each cell has a recurrently self-connected linear unit, which is called constant
error carousel (CEC) [246]. The CEC mechanism ensures that an LSTM network
does not suffer from the vanishing or exploding gradient problem [135]. The CEC
is regulated by a forget gate, which can also be used to reset the CEC. At time
t, the current cell state ct is updated by the previous cell state ct−1 , controlled
by the forget gate, and the product of the current input and the cell input; that is,
(i t ◦ l t ). Overall, Eq. 14.28 describes the combined update of a cell state:
14.7 Long Short-Term Memory Networks 403

ct = f t ◦ ct−1 + i t ◦ l t . (14.28)

• Output gate: A unit with a sigmoidal function that can control the flow of
information out of the cell. An LSTM uses the values of the output gate at time t
(denoted by ot ) to control the current cell state ct , activated by a tanh function,
and to obtain the final output vector h(t) as follows:

ot = σ (W (ox) x (t) + U (oh) h(t−1) + bo ), (14.29)


ht = ot ◦ tanh(ct ). (14.30)

14.7.2 Peephole LSTM

A peephole LSTM is a variant of an LSTM proposed in [189]. In contrast with a


standard LSTM, a peephole LSTM uses the cell state c, instead of h, to regulate
the forget gate, input gate, and output gate. In Fig. 14.20, we show the internal
connectivity of a peephole LSTM unit, where the red arrows represent the new
peephole connections.
The key difference between a peephole LSTM and a standard LSTM is that
the peephole LSTM’s forget gate f t , input gate i t , and output gate ot do not

h(t)

CEC

ct−1

ft ct−1 it lt ct−1 ot
σ σ tanh σ

x(t)

Fig. 14.20 Internal connectivity of a peephole LSTM unit (blue rectangle). Here, x (t) is the input
to the cell at time t, and h(t) is its output. The red arrows are the new peephole connections added,
compared to the standard LSTM in Fig. 14.19.
404 14 Deep Learning

use h(t−1) as input. Instead, these gates use the cell state ct−1 . To understand
the base idea behind a peephole LSTM, let’s assume that the output gate ot−1 in
a traditional LSTM network is closed. Then the output of the network h(t−1) at
time (t − 1) will be 0, according to Eq. 14.30, and in the next time step t, the
regulating mechanism for all three gates will only depend on the network input
x (t−1) . Therefore, the historical information will be lost completely. A peephole
LSTM avoids this problem by using a cell state instead of the output, h, to control
the gates. The following equations describe a peephole LSTM formally:

i t = σ (W (ix) x (t) + U (ic) ct−1 + bi ), (14.31)


l t = tanh(W (lx) x (t) + bl ), (14.32)
f = σ (W
t (f x) (t)
x +U (f c) t−1
c + b ),
f
(14.33)
ot = σ (W (ox) x (t) + U (oc) ct−1 + bo ), (14.34)
c = f ◦c
t t t−1
+i ◦l ,
t t
(14.35)
ht = o t ◦ c t . (14.36)

Aside from these main forms of LSTMs just described, there are further variants.
For instance, a bidirectional LSTM network (BLSTM) has been introduced in [211],
and can be used to access long-range context in both input directions. Furthermore,
in 2014, the concept of "gated recurrent unit," which is viewed as a simplified
version of LSTM [79], was proposed. In 2015, Wai-kin Wong and Wang-chun
Woo introduced a convolutional LSTM network (ConvLSTM) for precipitation
nowcasting [511]. There are further variants of LSTM networks; however, most
of them are designed for specific application domains without a clear performance
advantage.

14.7.3 Applications

LSTMs have a wide range of applications in text generation, text classification,


language translation, and image captioning [262, 486]. In Fig. 14.21, an LSTM
classifier model for text classification is shown. In this figure, the input of the LSTM
structure at each time step is a word-embedding vector Vi , which is a common
choice for text classification problems. A word-embedding technique maps the
words or phrases in the vocabulary to vectors consisting of real numbers. Some
common word-embedding techniques include word2vec, GloVe, FastText, and so
forth [530]. The output yN is the corresponding output at the N -th time step, and
yN is the final output after the softmax activation of yN , which will determine the
classification of the input text.
14.7 Long Short-Term Memory Networks 405

yN 
yN

softmax

LST M1 LST M2 ··· LST MN −1 LST MN

V1 V2 VN −1 VN

Fig. 14.21 An LSTM classifier model for text classification, where N is the sequence length of
the input text (the number of words), V1 to VN is a sequence of word embedding vectors used as
input to the model at different time steps, and yN is the final prediction result.

14.7.4 Example: LSTM

In this section, we discuss three numerical examples for different variations of


LSTMs; namely, for the following:
1. Time series forecasting
2. Prediction of multiple outputs from multivariate time series data
3. Automatic text generation, by training the BI-LSTM model with a large English
text corpus
The first example is for a time series forecasting model. For this, we use data
for the average global temperature. Specifically, we use monthly data from January
1880 until December 2020. The data are provided by NASA’s official website,
along with various types of geological and image data for the atmosphere of the
Earth. In the first step, we split the time series data into training and testing data.
Next, we prepare data that are used as input and output values. Specifically, we
use the temperatures of (t − n), (t − n + 1), . . . t months as input vector having
n components/features for predicting the temperature of the (t + 1)st month. For
our example, we use n = 6, that is, a lag of six consecutive months. In total, we
use 1542 months for training (shown in blue in Fig. 14.22) and the remaining 152
months for testing (shown in red in Fig. 14.22).
The implementation of the LSTM model is shown in Listing 14.12. Overall, the
LSTM model consists of two hidden LSTM layers with 50 units that connect to a
single output unit. For the loss function of the model we use the mean squared error
(MSE) function. Hyperparameters of the LSTM optimized during training are the
406 14 Deep Learning
14.7 Long Short-Term Memory Networks 407

dropout rate, number of epochs, and batch size. Listing 14.12 shows all steps of the
training of the model and includes also a visualization of the prediction results.
The visualization of the predicted results is shown in Fig. 14.22. The temperature
values of the training data are shown in blue whereas the temperatures for test data
(actual temperatures) and predicted results are shown in red and green, respectively.
One can see that the predicted results approximately capture the trends of the test
temperature; however, there are also notable differences. There are two potential
reasons to explain these differences. First, the complexity of the Earth’s weather
patterns requires also to consider other factors not included in our model, for
example changing sun intensities over time. This means our LSTM is too simple
for modeling all relevant factors. Second, our LSTM requires more fine-tuning of
its hyperparameters, for example we could chose a different architecture, use a
regularization for the optimization, or use a different loss function. Overall, given
the complexity of the problem, which is from climate science, the results from the
LSTM are reasonably accurate. Overall, the results of the LSTM can be seen as a
nonparametric regression model because the output consists of a numerical value
and the model itself does not make functional assumptions regarding the output.
408 14 Deep Learning

average monthly temperature


1.0

0.5

0.0

-0.5

January 1880 months December 2020

Fig. 14.22 Time series of the average global (monthly) temperature. The values shown in blue
were used for the training, whereas the red values were used for testing. The green values are the
predicted values by the LSTM model using Listing 14.12.

For the second example, we use again climate data but this time provided by the
Max Planck Institute for Biogeochemistry (Germany). The data set consists of 14
features, from which we remove the Kelvin temperature feature (Tpot (k)) because
the information in Kelvin temperature is the same as in Celsius temperature, which
we utilize. From the remaining features, we select the temperature (in deg Celsius
(C)) and the pressure as output variables, while the remaining 11 variables are used
as input. Hence, we have multiple outputs.
For this example, the LSTM model is developed to predict the (t + 1)th
temperature and pressure using 11 input features from the preceding time period
t − n, t − n + 1, . . . t, where n = 12. We first normalize the data and then create
an array structure to prepare input data for the LSTM model. In our example, we
use 12, 000 samples for the training data and 2000 samples for the testing data.
Our model uses four layers in total — two LSTM layers, one dense layer which is
fully connected, and an output layer with two nodes. Specifically, the first LSTM
layer consists of 128 units, the second LSTM layer of 64 units, and the dense layer
consists of 32 units. We use the mean squared error (MSE) loss function to optimize
the model, that is, we optimize both outputs using the MSE. Listing 14.13 shows
the implementation of this LSTM model.
The predicted results for the pressure and temperature are shown in Fig. 14.23.
Importantly, here the colors have a different meaning compared to Fig. 14.22.
Specifically, for Fig. 14.23 (top), the actual temperatures, regardless of their use
for training or testing, are shown in red and the temperatures in blue and green
correspond to the predicted training and testing data, respectively. That means the
results shown in blue are for in-sample data and the results in green are for out-of-
sample data; see Sec. 4.6 for a discussion. For Fig. 14.23 (bottom) the colors are
different but have the same meaning as discussed above.
14.7 Long Short-Term Memory Networks 409
410 14 Deep Learning
14.7 Long Short-Term Memory Networks 411

As one can see from the figure, the predictions of both outputs capture the trends
well for the in-sample data - shown in blue in Fig. 14.23 (top) and green in Fig. 14.23
(bottom). However, for the test data the differences become larger. This shows again
the complexity of time series predictions for data from climate science.
The third example of an LSTM model is a bidirectional LSTM for text genera-
tion. This model is implemented in Listing 14.14. For our analysis, we use a freely
available online ebook, The Republic, authored by Plato. For the preprocessing, we
convert all uppercase characters into lowercase, remove punctuation and digits, to
obtain a sequence of text. The next step is the tokenization of each word in a text
sequence; for that, we use the functions in the library tokenizers available in R.
Next, we need to prepare input and output data for training and testing. We use
{winput = wi−n , wi−n+1 , . . . wn } words in sequence as input features to predict the
(wn+1 )th word. The model we use consists of one embedding layer, one Bi-LSTM
layer, one LSTM layer, one dense layer fully connected, and one output layer where
the number of nodes equals the total number of tokens. The model is trained with a
small data set containing 50,000 training samples, using a cross-entropy loss.
From Fig. 14.24, one can see that the loss is reduced during training but the
overall improvement is moderate. This impression is confirmed when looking at
the accuracy values, which do not change much. As a reason for this behavior, we
would like to highlight that the multi-class classification problem learned by the
BI-LSTM is 11443-dimensional! This is of course an astonishing number of classes
for a (relatively) small sample size of 50,000 of our training data. Considering this,
the chance of a correct classification of a random classifier, that is, a classifier that
assigns instances randomly to one of the available classes, is 1/11443 which is about
a factor of 1000 smaller than our Bi-LSTM.
To gain a deeper understanding of this BI-LSTM, we suggest to modify the model
in Listing 14.14 by reducing the number of classes. By starting with a small number
and increasing it in a stepwise manner we can study the effect of the number of
classes on the performance of the BI-LSTM. Furthermore, it could be interested
412 14 Deep Learning

actual temperature

predicted testing data

normalized temperature predicted training data


0

−2

−4
Jan Feb Mar Apr
time
2
normalized pressure

−2

actual pressure

−4 predicted testing
data
predicted training
data
Jan Feb Mar Apr
time
Fig. 14.23 The plots show the temperature (top) and pressure (bottom) values for the second
example. The red curves correspond to the true values of the temperature and the pressure while
the predictions for in-sample data (training data) are shown in blue and green, respectively, and
the predictions for out-of-sample data (testing data) are shown in red and purple, respectively. The
results can be produced with the multi-output LSTM model implemented in Listing 14.13.

to add further layers to increase the complexity of the neural network architecture.
Overall, this example shows that a deep learning model cannot perform magic but
the combination of a model with data is responsible for the performance of a model.
14.7 Long Short-Term Memory Networks 413

Listing 14.14: BI-LSTM for automatic text generation


414 14 Deep Learning
14.7 Long Short-Term Memory Networks 415

training
validation
6.5
loss

6.4

0.090

0.088
accuracy

0.086

0.084

0.082
5 10 15 20
epoch

Fig. 14.24 The plots showing loss (top) and accuracy (bottom) for the training and validation data
of the text generation model, implemented in Listing 14.14.
416 14 Deep Learning

14.8 Discussion

14.8.1 General Characteristics of Deep Learning

A property common to all deep learning models is that they perform the so-called
representation learning, which is also called feature learning. This describes a
model that learns new and better representations compared to the raw data. Impor-
tantly, deep learning models do not learn the final representation within one step
but rather multiple ones, corresponding to multilevel representation transformations
between the hidden layers [303].
Another common property of deep learning models is that the subsequent
transformations between layers are nonlinear (see Fig. 14.2). This increases the
expressive power of the model [122]. Furthermore, individual representations are
not designed manually, but are learned via training data [303]. This makes deep
learning models very flexible.

14.8.2 Explainable AI

Any model in data science can be categorized as either an inferential model or a


prediction model [61, 437] (see also Chap. 2). An inferential model does not only
make predictions but also provides an interpretable structure. Hence, it is a model of
the prediction process itself; for example, a causal model. In contrast, a prediction
model is merely a "black-box" model for making predictions.
The models discussed in this chapter aim neither to provide physiological
models of biological neurons nor to offer an interpretable structure. Instead, they
are prediction models. An example of a biologically motivated learning rule for
neural networks is the Hebbian learning rule [231]. Hebbian learning is a form of
unsupervised learning in neural networks that does not use global information about
the error as backpropagation. Instead, only local information is used from adjacent
neurons. There are many extensions of Hebb’s basic learning rule that have been
introduced based on new biological insights; see, for example, [136].
Recently, there has been great interest in interpretable or explainable AI (XAI)
[44, 120]. Especially in the clinical and medical area, one would like to have
understandable decisions from statistical prediction models because these affect
patients [251]. The field is still in its infancy, but if meaningful interpretations of
general deep learning models could be found, this would certainly revolutionize the
field.
As a note, we would like to add that the distinction between an explainable AI
model and a non-explainable model is not well-defined. For instance, the sparse
coding model by [374] was shown to be similar to the coding of images in the
human visual cortex [469], and an application of this model can be found in [75],
where an unsupervised learning approach was used to learn an optimal sparse coding
14.8 Discussion 417

dictionary for the classification of high spectral imagery (HIS) data. Some may
consider this model as an XAI model because of the similarity to the working
mechanism of the human cortex, whereas others may question this explanation.

14.8.3 Big Data versus Small Data

In statistics, the field of experimental design is concerned with assessing whether


the available sample sizes are sufficient to conduct a particular analysis (for a
practical example, see [455]). In contrast, for all methods discussed in this chapter,
we assumed that we were in the big data domain, implying sufficient samples. This
corresponds to the ideal case. However, we would like to point out that for practical
applications, one needs to assess this situation on a case-by-case basis to ensure
the available data (the sample sizes) are sufficient for using deep learning models.
Unfortunately, this issue is not well-represented in the current literature. As a rule
of thumb, deep learning models for image processing usually perform well for tens
of thousands of samples, but it is largely unclear how they perform in a small data
setting. It is left to the user to estimate learning curves of the generalization error
for a given model to avoid spurious results [144].
As an example to demonstrate this problem, we want to discuss an analysis
conducted in [155]. There, the influence of the sample size on the accuracy of the
classification of the EMNIST data was explored. Specifically, EMNIST (Extended
MNIST) [86] consists of 280,000 handwritten digits (240,000 training samples
and 40,000 test samples) for 10 balanced classes corresponding to the digits 0 to
9. A long short-term memory (LSTM) model for a 10-class classifier was used.
The model consisted of a four-layer network (three hidden layers and one fully
connected layer), and each hidden layer contained 200 neurons. It was found that in
order to achieve a classification error below 5%, more than 25,000 training samples
(images) are needed. In contrast, in [515] electronic health records (eHR) of patients
corresponding to text data have been analyzed for classifying disorder categories.
As a result, only hundreds of samples were needed to achieve F-scores over 0.75. A
similar order of magnitude for the sample size has been found for gene expression
data [446].
Overall, these results demonstrate that the number of samples needed for deep
learning models depends crucially on the data type. While image data seem very
demanding, other data types require much less data for a deep learning model to
perform well.

14.8.4 Advanced Models

Finally, we would like to emphasize that there are additional but more advanced
models of deep learning networks that are outside the core architectures. For
418 14 Deep Learning

instance, deep learning and reinforcement learning have been combined to form
deep reinforcement learning [18, 234, 342]. Such models have found application in
problems from robotics and games to health care.
Another example of an advanced model is a graph CNN, which is particularly
suitable when data have the form of graphs [233, 510] (see also Chap. 2). Such
models have been used in natural language processing, recommender systems,
genomics, and chemistry [313, 516].

14.9 Summary

In this chapter, we provided an overview of deep learning models, including deep


feedforward neural networks, (D-FFNN), convolutional neural networks (CNNs),
deep belief networks (DBNs), autoencoders (AE), and long short-term memory
networks (LSTMs). These models can be considered the core architectures that cur-
rently dominate deep learning. In addition, we discussed related concepts needed for
a technical understanding of these models, such as restricted Boltzmann machines
and resilient backpropagation. Given the flexibility of network architectures that
allows a LEGO-like construction of new models, an unlimited number of neural
network models can be constructed using elements of the core architectural building
blocks discussed in this chapter.
Learning Outcome 14: Deep Learning

Deep learning models allow the realization of analysis models similar to


machine learning models. However, the estimation models assume the form
of neural networks, which can be learned more efficiently in many practical
situations.

We would like to highlight that deep learning does not establish a new learning
paradigm that could not be realized with other machine learning models (see
Chap. 17 for a detailed discussion of learning paradigms). Instead, the difference
is in the numerical estimation of such models where neural network architectures
turn out to provide an economic representation that allows an efficient estimation of
its parameters.

14.10 Exercises

1. Discuss and review the components of a mathematical model of an artificial


neuron.
2. Study the functional forms an activation function can provide. Discuss the
different activation functions in Table 14.1.
14.10 Exercises 419

3. Implement a simple feedforward neural network using R.


4. Compare the characteristics of deep feedforward neural networks (D-FFNNs),
convolutional neural networks (CNNs), deep belief networks (DBNs), autoen-
coders (AE), and long short-term memory networks (LSTMs) with each other.
5. Repeat the analysis for the D-FFNN by studying the influence of the sample size
on the classification error.
6. Repeat the analysis for the LSTM model for the time series forecasting. Vary the
model parameters and study the effect on the results.
7. What is the difference between an inferential model and a prediction model?
8. How many samples are needed to learn a deep learning model? Can one provide
a generic answer, or does this depend on the situation? What influence does the
data type have on this?
Chapter 15
Multiple Testing Corrections

15.1 Introduction

When discussing statistical hypothesis testing in Chap. 10, we focused on the


underlying concept behind a hypothesis test and on its single application. Here,
“single” application means that the hypothesis test is applied only once. However,
high-dimensional data frequently make it necessary to apply a statistical hypothesis
test multiple times instead of just once. For instance, when analyzing genomic gene
expression data, one is interested in identifying the activity change for each gene.
Given that such data sets contain information for 10, 000 to 20, 000 genes, one
needs to apply a hypothesis test 10, 000 to 20, 000 times. Similar problems occur
in psychology when studying patients, or in web science when comparing different
marketing strategies.
In this chapter, we will see that the transition from one test to multiple tests is not
straightforward, but rather requires methodological extensions; otherwise, Type 1
errors (later we will see that there is more than one) will increase. Such approaches
are summarized under the term multiple testing procedures (MTPs) (or multiple
testing corrections (MTCs) or multiple comparisons (MCs)) [123, 131, 373]. In this
chapter, we discuss a number of different MTPs for controlling either the FWER
(family-wise error) or the FDR (false discovery rate).
When discussing statistical hypothesis testing in Chap. 10, we saw that there
are two errors one can encounter: Type 1 error and Type 2 error. Multiple testing
procedures can be evaluated based on these errors. For instance, the FDX (false
discovery exceedance [185]) or PFER (per family error rate [209]) are examples
of Type 1 errors, whereas the FNR (false-negative rate [184]) is a Type 2 error.
However, in practice, the FWER [53] and the FDR [35, 429] (which are both Type
1 errors) are the most popular ones, and they will be our focus in this chapter.
This chapter is organized as follows. In the next section, we present general
preliminaries and definitions required for the subsequent discussion of the MTPs.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 421
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_15
422 15 Multiple Testing Corrections

Furthermore, we provide information about the practical usage of MTPs using


the statistical programming language R. Then, we examine the problem from both
theoretical and experimental perspectives. In Sect. 15.4, we discuss a categorization
of different MTPs, and in the Sects. 15.5 and 15.6 we discuss methods to control the
FWER and the FDR. We finish by discussing the computational complexity of the
most important procedures.

15.2 Preliminaries

In this section, we briefly review some statistical preliminaries needed for the
models discussed in the following sections. First, we provide some definitions of
our formal setting, error measures, and different types of error control. Second, we
describe how to simulate correlated data that can be used for a comparative analysis
of different MTPs. For a practical realization of this, we also provide details about
implementation using the statistical programming language R.

15.2.1 Formal Setting

Let’s assume that we test m hypotheses where H1 , H2 , . . . Hm are the corresponding


null hypotheses and p1 , p2 . . . , pm the corresponding p-values. The p-values are
obtained from a comparison of a test statistic, ti , with a sampling distribution that
assumes Hi is true. Briefly, assuming two-sided tests, the p-values are given by

pi = P r(|ti | > |T (α)||Hi is true), (15.1)

where T (α) is a cutoff value determined by the value of the significance level α
of the individual tests. We indicate the reordered p-values in increasing order as
p(1) , p(2) . . . , p(m) with

p(1) ≤ p(2) · · · ≤ p(m) , (15.2)

and the corresponding reordered null hypotheses are H(1) , H(1) , . . . , H(m) . When
the indices of the reordered p-values are explicitly needed, such as for the minP
procedure discussed in Sect. 15.5.6, these p-values are denoted by

pr1 ≤ pr2 ≤ · · · ≤ prm . (15.3)

In general, an MTP can be applied to p-values or cutoff values; however, corrections


of p-values are more common because one does not need to specify the type of
15.2 Preliminaries 423

Fig. 15.1 A contingency Decision


table summarizing the
outcome of m hypothesis Truth reject H0 accept H0
tests. H0 N1|0 N0|0 m0
H1 N1|1 N0|1 m − m0
R m−R m

the alternative hypothesis (that is, right-sided, left-sided, or two-sided), and in the
following we focus on these.
Depending on the application, the definition of a set of hypotheses to be
considered for a correction may not always be obvious. In contrast, in genomic
or finance applications, tests for genes or stocks provide such definitions naturally;
for example, as pathways or portfolios. In our context, such a set of hypotheses is
called a family. Hence, an MTP is applied to a family of hypothesis tests.
In Fig. 15.1 we summarize the possible outcome of the m hypothesis tests in a
contingency table (see Chap. 3). Here, we assumed that of the m tests, m0 are true
null hypotheses and m − m0 are false null hypotheses. Furthermore, R is the total
number of rejected null hypotheses of which N1|0 have been falsely rejected.
The MTCs we discuss in this chapter use the following error measures:

FWER = P r(N1|0 ≥ 1); (15.4)


FDR = E[FDP]; (15.5)
PFER = E[N1|0 ]; (15.6)
E[N1|0 ]
PCER = . (15.7)
m

Here, FWER is the family-wise error, which is the probability of making at least
one Type 1 error. Alternatively, it can be written as

FWER = 1 − P r(N1|0 = 0). (15.8)

The FDR is the false discovery rate. The FDR is the expectation value of the
false discovery proportion (FDP), defined as
+N
1|0
R ≥ 1;
FDP = R (15.9)
0 R = 0.

Finally, PFER is the per family error rate, which is the expected number of Type
1 errors, whereas PCER is the per comparison error rate, which is the average
number of expected Type 1 errors across all tests.
424 15 Multiple Testing Corrections

Definition 15.1 (Weak Control of FWER) A procedure is said to control the


FWER, in the weak sense, if the FWER is controlled at level α only when all null
hypotheses are true, i.e., when m0 = m [244].

Definition 15.2 (Strong Control of FWER) A procedure is said to control the


FWER, in the strong sense, if the FWER is controlled at level α for any config-
uration of null hypotheses.
Similar definitions, like those for weak and strong control of the FWER just stated,
can be formulated for the control of the FDR. In general, a strong control is superior
because it allows more flexibility regarding the valid configurations.
Formally, an MTP will be applied to the raw p-values p1 , p2 . . . , pm , and,
according to some method-specific rule,

pi ≤ ci , (15.10)

based on cutoff (or critical) values ci , a decision is made to either reject or accept a
null hypothesis. After the application of such an MTP, the problem can be restated
adj adj adj
in terms of adjusted p-values; that is, p1 , p2 . . . , pm . Typically, the adjusted
p-values are given as functions of the critical values. For instance, for a single-
step Bonferroni correction, the estimation of adjusted p-values corresponds to a
adj
multiplication equation with a constant factor, pi = mpi , where m is the total
number of hypotheses. A more complex example is given by the single-step minP
procedure that uses data-dependent factors [506].
In general, for stepwise procedures, the cutoff values ci are not constant as they
vary with the steps; that is, with the index i. This makes the estimation of the
adjusted p-values more complex. Alternatively, the adjusted p-values can be used
for making a decision based on the significance level of α, as follows:

adj
pi ≤ α. (15.11)

For historical reasons, we want to mention a very influential conceptual idea that
inspired many MTPs and was introduced by Simes [440]. There, it was proved that
the FWER error is weakly controlled if we reject all null hypotheses when there
exists an index i with i ∈ {1, . . . , m}, such that


pi ≤ . (15.12)
m
That means the (original) Simes correction rejects either all m null hypotheses or
none. This makes the procedure practically not very useful because it does not
allow one to make statements about individual null hypotheses, but conceptually we
will find similar ideas in subsequent sections; see Sect. 15.6.1 about the Benjamini-
Hochberg procedure.
15.2 Preliminaries 425

15.2.2 Simulations Using R

To compare MTPs with each other and identify the optimal correction for a given
problem, we describe a general framework that can be applied. Specifically, we show
how to generate multivariate normal data with certain correlation characteristics.
Since there are many perspectives possible regarding this, we provide two comple-
mentary perspectives and the corresponding practical realization using the statistical
programming language R [398]. Furthermore, we show how to apply MTPs in R.

15.2.3 Focus on Pairwise Correlations

In general, the population covariance between two random variables Xi and Xj is


defined by
 
σij = cov(Xi , Xj ) = E (Xi − μi )(Xj − μj ) . (15.13)

From these, the population correlation between Xi and Xj is defined as follows:

cov(Xi , Xj ) σij
ρij = = . (15.14)
σii σjj σii σjj

In matrix notation, the covariance matrix for a random vector X of dimension n is


given by
 
Σ = E (X − μX )(X − μX )T . (15.15)

By utilizing the correlation matrix ρ, with elements given by Eq. 15.14, the
covariance matrix Σ can be written as

Σ = Dσ ρDσ , (15.16)

with

Dσ = diag(σ11 , . . . , σmm ). (15.17)

That means Dσ is a diagonal matrix.


Hence, by specifying the pairwise correlation between the covariates, the
corresponding covariance matrix can be obtained. This covariance matrix Σ can
then be used to generated multivariate normal data; that is, N(μ, Σ). To simulate
multivariate normal data with a mean vector of μ and a covariance matrix of Σ, one
426 15 Multiple Testing Corrections

can use the package mvtnorm [187, 188], available in R. An example is presented in
Listing 15.1.

15.2.4 Focus on a Network Correlation Structure

The preceding way to generate multivariate normal data does not allow one to
control the causal structure among the covariates. It controls only the pairwise
correlations. However, for many applications, it is necessary to use a specific
correlation structure that is consistent with the underlying causal relations of the
covariates. For instance, in biology, the causal relations among genes are given by
underlying regulatory networks. In general, such a constrained covariance matrix
is given by a Gaussian graphical model (GGM). The generation of a consistent
covariance matrix is intricate, and the interested reader is referred to [152] for a
detailed discussion.
To simulate multivariate normal data for constrained covariance matrices, one
can use the R package mvgraphnorm [471]. An example is shown in Listing 15.2.

15.2.5 Application of Multiple Testing Procedures

For the correction of the p-values, one can use the function p.adjust(), which is
part of the core R package. This function includes the Šidák, Bonferroni, Holm,
Hochberg, Hommel, Benjamini-Hochberg, and Benjamini-Yekutieli procedures.
For the Benjamini-Krieger-Yekutieli and Blanchard-Roquain procedures, one can
use the functions multiple.down() and BlaRoq() from the R package mutoss [47].
For the SD maxT and SD minP, the package multtest [392] can be used (see the
Reference Manual for the complex setting of the functions’ arguments). Recently,
15.3 Motivation of the Problem 427

a much faster computational realization has been found for the Hommel procedure,
and it is included in the package hommel [334].

Listing 15.3: Application of MTPs to raw p-values given by p.values.

15.3 Motivation of the Problem

Before we discuss procedures for dealing with multiple testing corrections, we


present motivations that demonstrate the need for such a correction. First, we present
theoretical considerations that quantify formally the problem of testing multiple
hypotheses and the accompanied misinterpretations of the significance level of a
single hypothesis. Second, we provide an experimental example that demonstrates
these problems impressively.
428 15 Multiple Testing Corrections

15.3.1 Theoretical Considerations

Suppose that we are testing three null hypotheses H0 = {H1 , H2 , H3 } indepen-


dently, each for a significance level of α = 0.05. That means for each hypothesis
test Hi with i ∈ {1, 2, 3}, we are willing to make a false-positive decision of α
where α is defined by

α = P r(reject Hi |Hi is true). (15.18)

For these three hypotheses, we would like to know our combined error, or our
simultaneous error, in rejecting at least one hypothesis falsely; that is, we would
like to know

P r(reject at least one H0 |all H0 are true). (15.19)

To obtain this error, we need some auxiliary steps. Assuming the independence
of the null hypotheses, from the α’s of each hypothesis test, it follows that the
probability of accepting all three null hypotheses, H0 , is

P r(accept all three H0 |all H0 are true) = (1 − α)3 . (15.20)

The reason for this is that 1 − α is the probability of accepting Hi when Hi is true;
that is,

1 − α = P r(accept Hi |Hi is true). (15.21)

Furthermore, because all three null hypotheses are independent of each other,
P r(accept all three H0 |all H0 are true) is just the product of these hypotheses:

)
3
P r(accept all three H0 |all H0 are true) = P r(accept Hi |Hi is true) (15.22)
i=1

= (1 − α)3 . (15.23)

From this, we can obtain the probability of rejecting at least one H0 as follows:

P r(reject at least one H0 |all H0 are true) = 1 − P r(accept all three H0 |all H0 are true)

= 1 − (1 − α)3 . (15.24)

This is just the complement of the probability in Eq. 15.20. For a significance level
of α = 0.05, we can now calculate

P r(reject at least one H0 |all H0 are true) = 0.14. (15.25)


15.3 Motivation of the Problem 429

1.00

Pr(reject at least one H0 | all H0 true)


0.4

0.75

0.3

FWER
0.50
0.2

0.1

0.25

0.0

n=59 1 2 3 4 5 6 7 8 9 10
number of tests

0.00

0 100 200 300 400 500


number of tests

Fig. 15.2 Shown is the FWER = P r(reject at least one H0 |all H0 are true) against the number of
tests, with α = 0.05 for all tests. The inlay highlights the first ten tests.

That means that although we are only making an error of 5% by falsely rejecting Hi
for a single hypothesis, the combined error for all three tests is 14%.
In Fig. 15.2, we show the generalization of this result for m independent
hypothesis tests given by

FWER = P r(reject at least one H0 |all H0 are true) (15.26)


= 1 − P r(accept all m H0 |all H0 are true) (15.27)
= 1 − (1 − α)m . (15.28)

As one can see, the probability of rejecting at least one H0 falsely quickly
approaches 1. Here, the dashed red line indicates the number of tests for which
this probability is 95%. That means when testing 59 tests or more, we are almost
certain to make such a false rejection.
The inlay in Fig. 15.2 highlights the first ten tests to show that even with a
moderate number of tests, the FWER is much larger than the significance level of
an individual test. Ideally, one would like strong control of the FWER because this
guarantees control of all possible combinations of true null hypotheses.
These results demonstrate that the significance level of a single hypothesis can
be quite misleading with respect to the error from testing many hypotheses. For this
reason, different methods have been introduced to avoid this explosion in errors by
controlling them.
430 15 Multiple Testing Corrections

15.3.2 Experimental Example

To demonstrate the practical importance of the problem, an experimental study


was presented in [40]. In their study, they used a post-mortem Atlantic salmon as
subject and showed “a series of photographs depicting human individuals in social
situations with a specified emotional valence, either socially inclusive or socially
exclusive. The salmon was asked to determine which emotion the individual in the
photo must have been experiencing” [40]. Using fMRI neuroimaging to monitor the
brain activity of the deceased salmon, they found out of 8064 voxels (brain areas)
16 were statistically significant when testing 8064 hypotheses without any multiple
testing correction.
Because the physiological state of the fish is clear (it is dead), the measured
activities correspond to Type 1 errors. They showed also that by applying multiple
correcting procedures these errors can be avoided. The purpose of their experimental
study was to highlight the severity of the multiple testing problem in general fMRI
neuroimaging studies [39] and the need for applying MTC procedures [367].
Importantly, the preceding problems are not limited to neuroimaging, as similar
problems have been reported in proteomics [115], transcriptomics [125], genomics
[199], Genome-wide association studies [349], finance [223], astrophysics [338],
and high-energy physics [91].

15.4 Types of Multiple Testing Procedures

In general, multiple testing procedures (MTPs) can be categorized in three different


ways:
1. Single-step versus stepwise approaches
2. Adaptive versus nonadaptive approaches
3. Marginal versus joint multiple testing procedures
In the following sections, we discuss each of these categories.

15.4.1 Single-Step versus Stepwise Approaches

Overall, there are three different types of MTPs commonly distinguished by the way
they conceptually compare p-values with critical values [117].
1. Single-step (SS) procedure
2. Step-up (SU) procedure
3. Step-down (SD) procedure
The SU and SD procedures are commonly referred to as stepwise procedures.
15.4 Types of Multiple Testing Procedures 431

Assuming that we have ordered p-values, as given by Eq. 15.2, the procedures
are defined as follows:
Definition 15.3 (Single-Step (SS) Procedure) An SS procedure tests the condition

p(i) ≤ ci (15.29)

and rejects the null hypothesis i if this condition holds.


For an SS procedure, there is no order required for testing the conditions. Hence,
previous decisions are not taken into consideration. Furthermore, usually the critical
values ci are constant for all tests; that is, ci = c for all i.
Definition 15.4 (Step-Up (SU) Procedure) Conceptually, an SU procedure starts
from the least significant p-value, p(m) , and goes toward the most significant p-
value, p(1) , by testing successively if the following condition holds:

p(i) ≤ ci . (15.30)

For the first index, i ∗ , such that this condition holds, the procedure stops and rejects
all null hypotheses j with j ≤ i ∗ ; that is, the procedure rejects the null hypotheses.

H(1) , H(1) , . . . , H(i ∗ ) . (15.31)

If such an index does not exist, the procedure does not reject any null hypotheses.
Formally, an SU procedure identifies the index
/ 0
i ∗ = max i ∈ {1, . . . , m}|p(i) ≤ ci (15.32)

for the critical values ci . Usually, the ci s are not constant but change with the index,
i.
Definition 15.5 (Step-Down (SD) Procedure) Conceptually, an SD procedure
starts from the most significant p-value, p(1) , and goes toward the least significant
p-value, p(m) , by testing successively if the following condition holds:

p(i) ≤ ci . (15.33)

For the first index i ∗ + 1 such that this condition does not hold, the procedure stops.
Then, it rejects all null hypotheses j with j ≤ i ∗ ; that is, it reject the null hypotheses

H(1) , H(1) , . . . , H(i ∗ ) . (15.34)

If such an index does not exist, the procedure does not reject any null hypotheses.
Formally, an SD procedure identifies the index
432 15 Multiple Testing Corrections

0.06

0.04
SU procedure
p−values

SD procedure
0.02

0.00

2 4 6 8 10
Rank

Fig. 15.3 An example visualizing the differences between an SU and an SD procedure. The
dashed red line corresponds to the critical values ci , and the blue points correspond to the rank-
ordered p-values. The green range indicates p-values identified using an SD procedure, whereas
the orange range indicates p-values identified using an SU procedure.

/ 0
i ∗ = max i ∈ {1, . . . , m}|p(j ) ≤ cj , for all j ∈ {1, . . . , i} (15.35)

for the critical values cj .


Regarding the meaning of both procedures, we want to make two remarks. First,
the direction, either “up” or “down,” is with respect to the significance of p-values.
That means an SU procedure steps toward more significant p-values (it steps up),
whereas an SD procedure steps toward less significant p-values (it steps down).
Second, the crucial difference between an SU procedure and an SD procedure is
that the latter is more strict, requiring all p-values below i ∗ to be significant as well,
whereas the former does not require this.
In Fig. 15.3, we visualize the working mechanisms of an SD and an SU
procedure. The dashed red line corresponds to the critical values ci , and the blue
points correspond to the rank-ordered p-values. Whenever a p-value is below
the dashed red line, its corresponding null hypothesis is rejected; otherwise it is
accepted. The green range indicates p-values identified using an SD procedure,
whereas the orange range indicates p-values identified using an SU procedure. As
one can see, an SU procedure is less conservative than an SD procedure because it
does not have the monotonicity requirement.
15.5 Controlling the FWER 433

15.4.2 Adaptive versus Nonadaptive Approaches

Another way to categorize MTPs is by whether they estimate the number of null
hypotheses m0 from the data or not. The former type of procedure is called an
adaptive procedure (AD), and the latter are nonadaptive (NA) procedures [36, 419].
Specifically, adaptive MTPs estimate the number of null hypotheses m0 from a
given data set and then use this estimate for a multiple-test procedure. In contrast,
nonadaptive MTPs assume m0 = m.

15.4.3 Marginal versus Joint Multiple Testing Procedures

A third way to categorize MTPs is by whether they are using marginal or joint
distributions of the test statistics. Multivariate procedures enable one to take into
account the dependency structure in the data (among the test statistics), and hence
such MTPs can be more powerful than marginal procedures because the latter
just ignore this information. For instance, the dependency structure manifests as
a correlation structure, which can have a noticeable effect on the results.
Usually, procedures using joint distributions are based on resampling
approaches; for example, bootstrapping or permutations [124, 506]. Thus, they
are nonparametric methods, which require computational approaches.

15.5 Controlling the FWER

We start our presentation of MTPs with methods for controlling the FWER [431]. In
the following, we will discuss procedures from Šidák, Bonferroni, Holm, Hochberg,
Hommel, and Westfall-Young. This discussion emphasizes the working mechanisms
of these procedures. In Sect. 15.8, we present a summary of the underlying
assumptions on which the procedures rely.

15.5.1 Šidák Correction

The first MTP we discuss for controlling the family-wise error (FWER) was
introduced by Šidák [439]. Let’s say that we want to control the FWER at a level α.
If we reverse Eq. 15.28, we obtain an adjusted significance level given by

αS = 1 − (1 − α)1/m . (15.36)
434 15 Multiple Testing Corrections

This equation allows one to calculate, for every FWER of α and every m (number
of hypotheses), the corresponding adjusted significance level αS of the individual
hypotheses. A null hypothesis Hi is rejected if

pi ≤ αS (15.37)

holds. Hence, using αS (m), the FWER is controlled at level α.


The procedure given by Eq. 15.37 is called single-step Šidák correction. For
completeness, we also want to mention that there is a step-down Šidák correction
defined by

pi ≤ 1 − (1 − α)1/m−i+1 . (15.38)

From Eqs. 15.36 and 15.37, we can derive adjusted p-values for the single-step
Šidák correction, which are given by

adj
pi = min{1 − (1 − pi )m , 1}. (15.39)

These adjusted p-values can alternatively be used to test for significance by


comparing them with the original significance level; that is,

adj
pi ≤ α. (15.40)

In Fig. 15.4a, we show the adjusted significance level αS of the individual


hypotheses for a single-step Šidák correction, depending on the number of hypothe-
ses m for α = 0.05. As one can see, the adjusted significance level αS quickly
becomes more stringent for an increasing number of hypothesis tests m.

15.5.2 Bonferroni Correction

The Bonferroni correction controls the family-wise error (FWER) under general
dependence [53]. From a Taylor expansion of Eq. 15.36 up to the linear term, we
obtain the following approximation:
α
αB = . (15.41)
m
Using Boole’s inequality, one can elegantly show that this controls the FWER [199].
This is the adjusted Bonferroni significance level. We can use this adjusted
significance level to test every p-value and reject the null hypothesis Hi if

pi ≤ αB . (15.42)
15.5 Controlling the FWER 435

A. 0.05

0.04

0.03
αS

0.02

0.01

0.00

0 20 40 60 80 100
number of tests
B.
0.0500

α = 0.05

0.0495
FWER

0.0490

0.0485

0.0480

0 20 40 60 80 100
number of tests

Fig. 15.4 A: Single-step Šidák correction. Shown is αS in dependence on m for α = 0.05. B:


Bonferroni correction. Shown is the FWER against m for α = 0.05.

From Eqs. 15.41 and 15.42, we can derive adjusted p-values, which are given by

adj
pi = min{mpi , 1}. (15.43)

These adjusted p-values can alternatively be used to test for significance by


comparing them with the original significance level; that is,

adj
pi ≤ α. (15.44)

The corresponding result is shown in Fig. 15.4. Specifically, the FWER in


Eq. 15.28 is shown for the corrected significance level αB given by

FWER = 1 − (1 − αB )m . (15.45)

As one can see, the FWER is controlled for all m because it is always below α =
0.05. Here, it is important to emphasize that the y-axis range to see the effect is only
from 0.048 to 0.05.
436 15 Multiple Testing Corrections

15.5.3 Holm Correction

A modified Bonferroni correction, called the Holm correction, was suggested in


[250]. In contrast with a Bonferroni or a Šidák correction, it is a sequential procedure
that tests ordered p-values. For this reason, it was also called “the sequentially
rejective Bonferroni test” [250].
Let’s denote

p(1) ≤ p(2) · · · ≤ p(m) (15.46)

the ordered sequence of p-values in increasing order. Then, the Holm correction
tests the following conditions in a step-down manner:
α
Step 1.: Reject H(1) if p(1) ≤ ; (15.47)
m
α
Step 2.: Reject H(2) if p(2) ≤ ; (15.48)
m−1
α
Step 3.: Reject H(3) if p(3) ≤ ; (15.49)
m−2
..
.
α
Step m.: Reject H(m) if p(m) ≤ . (15.50)
1
If at any step the hypothesis, H(i) , is not rejected, the procedure stops, and all other
p-values, that is, p(i) , p(i+1) , . . . p(m) , are accepted. The preceding testing criteria
of the steps can be written in the following compact form:
α
p(i) ≤ , (15.51)
m−i+1

for i ∈ {1, . . . , m}. As one can see, the first step, i = 1, is exactly a Bonferroni
correction, and each following step is in the same spirit, but considers the changed
number of remaining tests. The optimal cutoff index of this SD procedure can be
identified as follows:
/ α 0
i ∗ = max i ∈ {1, . . . , m}|p(j ) ≤ , for all j ∈ {1, . . . , i} . (15.52)
m−j +1

From this, the adjusted p-values of a Holm correction can be derived [131], and
they are given by
 5 68
adj
p(i) = max min (m − k + 1)p(k) , 1 . (15.53)
j ≤i k∈{1,...,j }
15.5 Controlling the FWER 437

The nested character of this formulation comes from the strict requirement of an
SD procedure that all p-values, p(i) , be significant with j ≤ i (see the j index
in Eq. 15.52). An alternative, more explicit form to obtain the adjusted p-values is
given by the following sequential formulation [118]:
+
adj min{mp(i) , 1} if i = 1;
p(i) = adj (15.54)
max{p(i−1) , (m − i + 1)p(i) } if i = {2, . . . , m}.

A computational realization of a Holm correction is given by Algorithm 10.

Similar to a Bonferroni correction, a Holm correction also does not require the
independence of the test statistics and provides strong control of the FWER. In
general, this procedure is more powerful than a Bonferroni correction.

15.5.4 Hochberg Correction

Another MTC that is formally very similar to the Holm correction is the Hochberg
correction [243], shown in Algorithm 11. The only difference is that it is a step-up
procedure.

The adjusted p-values of the Hochberg correction are given by [118]:


+
adj p(i) if i = m,
p(i) = adj (15.55)
min{p(i+1) , (m − i + 1)p(i+1) } if i = {m − 1, . . . , 2}.
438 15 Multiple Testing Corrections

The Hochberg correction is an optimistic approach because it tests backward and


stops as soon as a p-value is significant at level α/(m − k + 1). The SU character
makes it more powerful, and hence the SU Hochberg procedure is more powerful
than the SD Holm procedure.

15.5.5 Hommel Correction

The next MTP we discuss, the Hommel correction [252], is far more complex than
the previous procedures. The method evaluates the set

/ k 0
i ∗ = max i ∈ {1, . . . , m}|p(m−i+k) > α, for all k ∈ {1, . . . , i} , (15.56)
i
and determines the maximum index such that this condition holds [19]. If such an
index does not exist, then we reject H(1) , H(1) , . . . H(m) ; otherwise, we reject only
the p-values such that
α
p< . (15.57)
i∗
In Fig. 15.5, we visualize the testing condition for m = 7. In this figure, the
scaling factor, f (i, k) = k/I , is shown as a function of the indices i and k. Each red
point corresponds to an index pair (i, k). Figure 15.5a shows f (i, k) for fixed values
of i, whereas Fig. 15.5b shows f (i, k) for fixed values of k. This leads to the k and
i dependent curves shown in blue, for f (i fixed, k) and f (i, k fixed), respectively.
In both figures, the isolines shown as orange dashed lines connect points with a
constant index h = m − i + k. The points on each blue line (in both figures) have
indices h = m, h = m − 1, and so on, going from high to low values of f (i, k).
These indices are used for the rank-ordered p-values; that is,

p(h) = p(m−i+k) . (15.58)

According to Eq. 15.56, for each index i, there are i different values of k. In
Fig. 15.5a, these correspond to the points on the blue lines, whereas in Fig. 15.5b,
these correspond to the points in vertical columns. In Fig. 15.5b, we highlight just
one of these, shown as a green dashed line.
The following array in Eq. 15.59 shows, for each step of the procedure, the
corresponding conditions that need to be tested. At step 1, the index i = m. For
simplicity, we use the notation F (i, k) = f (i, k)α = kα/ i. The procedure stops
only if all conditions for a step (for a column) are true, and if i ∗ corresponds to the
value of i for this step. Otherwise, the procedure continues to the next step.
15.5 Controlling the FWER 439

A. i=1 i=2 i=m−2 i=m−1 i=m


1.00 h=m
h=m

h=m−1
0.75

f(i,k) h=m−2

0.50

0.25

h=1
0.00
1 2 3 4 5 6 7
k
B.1.00 h=m k=i

k=i−1
0.75
k=i−2
f(i,k)

0.50
h=m−1

h=m−2 k=2
0.25
k=1

0.00
1 2 3 4 5 6 7
i

Fig. 15.5 The factor f (i, k) of the Hommel correction is shown as a function of i and k. (a) The
blue lines correspond to fixed values of i. (b) The blue lines correspond to fixed values of k. The
index h is the argument of the ordered p-values; that is, p(h) .

Step : 1 2 ... m−1 m


i: m m−1 ... 2 1
p(m) > F (m, m) p(m) > F (m − 1, m − 1) . . . p(m) > F (2, 2) p(m) > F (1, 1)
p(m−1) > F (m, m − 1) p(m−1) > F (m − 1, m − 2) . . . p(m−1) > F (2, 1)
.. .. (15.59)
. .
p(3) > F (m, 3) p(3) > F (m − 1, 2)
p(2) > F (m, 2) p(2) > F (m − 1, 1)
p(1) > F (m, 1)

As one can see from the triangular-shaped array, the number of conditions per
step decreases by one. Specifically, the smallest p-value from the previous step is
always dropped. This increases the probability that all conditions will hold from one
step to the next since the corresponding values of F (i, k) increase too. Specifically,
F (c, d) < F (c − 1, d) holds for all c since

dα dα
< (15.60)
c c−1
440 15 Multiple Testing Corrections

holds for all c. Hence, the smallest p-values tested per step are equivalent to
stringent decreasing conditions.
In algorithmic form, one can write the Hommel correction as shown in Algo-
rithm 12. In this form, the Hommel correction is less compact but easier to
understand.

It has been found that the Hommel procedure is more powerful than Bonferroni,
Holm, and Hochberg [19]. Finally, we want to mention that recently, a much
faster computational realization of the Hommel procedure was found [334]. This
algorithm has a linear time complexity and leads to an astonishing improvement,
thus allowing its application on millions of tests.

15.5.5.1 Examples

Let’s consider some numerical examples for m = 5. In this case, the general array
in Eq. 15.61 assumes the following numerical values:

Step : 1 2 3 4 5
i: 5 4 3 2 1
p(5) > 0.05 p(5) > 0.05 p(5) > 0.05 p(5) > 0.05 p(5) > 0.05
p(4) > 0.04 p(4) > 0.037 p(4) > 0.033 p(4) > 0.025 (15.61)
p(3) > 0.03 p(3) > 0.025 p(3) > 0.016
p(2) > 0.02 p(2) > 0.012
p(1) > 0.01

• Example 1: p(1) = 0.011, p(2) = 0.021, p(3) = 0.031, p(4) = 0.41, p(5) =
0.051. In this case, i ∗ = 5 and α/i ∗ = 0.01. From this, it follows that no
hypothesis can be rejected.
15.5 Controlling the FWER 441

• Example 2: p(1) = 0.009, p(2) = 0.021, p(3) = 0.031, p(4) = 0.41, p(5) =
0.051. In this case, i ∗ = 4 and α/i ∗ = 0.0125. From this, it follows that H(1)
can be rejected.
• Example 3: p(1) = 0.009, p(2) = 0.021, p(3) = 0.024, p(4) = 0.41, p(5) =
0.051. In this case, i ∗ = 3 and α/i ∗ = 0.016. From this, it follows that H(1) can
be rejected.
These examples should demonstrate that the application and outcome of a
Hommel correction are nontrivial.

15.5.6 Westfall-Young Procedure

For most real-world situations, the joint distribution of the test statistics is unknown.
Westfall and Young made seminal contributions by showing that, in this case,
resampling-based methods can be used under certain conditions to estimate p-
values without many theoretical assumptions [506]. However, in order to do this,
one needs to (1) access the data and (2) be able to resample the data such that
the resulting permutations allow one to estimate the null hypotheses for the test
statistics. The latter is usually possible for two-sample tests, but may be more
involved for other types of tests.
In particular, four such permutation-based methods have been introduced by
Westfall and Young [506]. Two of these are single-step procedures, and two are
step-down procedures. The single-step procedures are called single-step minP:
 
p̃j = Pr min Pl ≤ pj |H0C , (15.62)
l∈{1,...,m}

and single-step maxT:


 
p̃j = Pr max |Tl | ≥ tj |H0C . (15.63)
l∈{1,...,m}

Their adjusted p-values are given by Eqs. 15.62 and 15.63. Here, H0C is an
intersection of all true null hypotheses, Pl denotes unadjusted p-values from
permutations, and Tl denotes test statistics from permutations. The pj and tj are
the p-values and test statistics from the non-permuted data, respectively.
Without additional assumptions, single-step maxT and single-step minP provide
a weak control of the FWER. However, for subset pivotality, both procedures control
the FWER strongly [506]. Here, subset pivotality is a property of the distribution
of raw p-values, which holds if all subsets of p-values have an identical joint
distribution under the complete null distribution [123, 505, 506] (for a discussion
of an alternative and practically simpler and sufficient condition, see [198]).
Furthermore, the results from single-step maxT and single-step minP are similar
when the test statistics are identically distributed [125].
442 15 Multiple Testing Corrections

From a computational perspective, the single-step minP is more demanding than


the single-step maxT because it is based on p-values and not on test statistics. The
difference is that one can get a resampled value of a test statistic from one resampled
data set, whereas for a p-value, one needs a distribution of resampled test statistics,
which can only be obtained from many resampled data sets. This has been termed
double permutation [181].
The step-down procedures are called step-down minP:
  8
p̃rj = max Pr min Prl ≤ prk |H0C , (15.64)
k∈1,...,j } l∈{k,...,m}

and step-down maxT:


  8
p̃sj = max Pr max |Tsl | ≥ tsk |H0C . (15.65)
k∈1,...,j } l∈{k,...,m}

Their adjusted p-values are given by Eqs. 15.64 and 15.65. The indices rk and sk are
the ordered indices; that is, |ts1 | ≥ |ts2 | ≥ · · · ≥ |tsm | and pr1 ≤ pr2 ≤ · · · ≤ prm .
Interestingly, it can be shown that when the Pl are uniformly distributed in [0, 1],
the p-values in Eq. 15.64 correspond to those obtained from the Holm procedure
[181]. That means, in general, the step-down minP procedure is less conservative
than the Holm’s procedure. Again, the step-down minP is computationally more
demanding compared to the step-down maxT due to the required double permuta-
tions. Also, assuming the subset pivotality, both procedures have strong control of
the FWER [404].
The general advantage of using maxT and minP procedures over all the other
procedures discussed in this chapter is their potential use of the dependency struc-
ture among the test statistics. That means, when such a dependency (correlation) is
absent, there is no apparent need to use these procedures. However, most data sets
have some kind of dependency since the associated covariates are usually connected
with each other. In such situations, the maxT and minP procedures can lead to an
improved power.
In algorithmic form, the step-down maxT and step-down minP procedures can
be formulated as shown in Algorithms 13 and 14. For Step 2 in Algorithm 14, the
raw p-value pi,b is obtained using the same permutations from Step 1.
15.5 Controlling the FWER 443
444 15 Multiple Testing Corrections

15.6 Controlling the FDR

Now we come to a second type of correction method. In contrast with the methods
discussed so far for controlling the FWER, the methods we are discussing in
the next sections are for controlling the FDR. That means these methods have a
different optimization goal. In Sect. 15.8, we present a summary of the underlying
assumptions on which the procedures rely.

15.6.1 Benjamini-Hochberg Procedure

The first method from this category of procedures for controlling the FDR is called
the Benjamini-Hochberg (BH) procedure [35]. The BH procedure can be considered
a breakthrough because it introduced a novel way of thinking to the community.
The procedure assumes ordered p-values, as in Eq. 15.46. Then, it identifies, using
a step-up procedure, the largest index k such that
α
p(i) ≤ i (15.68)
m
holds, and it rejects the null hypotheses H(1) , H(2) , . . . , H(k) . This can be formulated
in the following compact form:
/ α0
k = max i ∈ {1, . . . , m}|p(i) ≤ i . (15.69)
m
If no such index exists, then no hypothesis is rejected.
Conceptually, the BH procedure utilizes the Simes inequality [440]; see
Sect. 15.2.1. In algorithmic form, the BH procedure can be formulated as shown in
Algorithm 15.

The adjusted p-values of the BH procedure [125] are given by


5 mp 6
adj (j )
p(i) = min ,1 . (15.70)
j ∈{1,...,m} j
15.6 Controlling the FDR 445

In general, the BH procedure makes a good trade-off between false positives


and false negatives and works well for independent test statistics or positive
regression dependencies (denoted PRDS), which is a weaker assumption compared
to independence [37, 165, 419]. Generally, it is also more powerful than procedures
for controlling the FWER. The correlation assumptions imply that in the presence
of negative correlations, the control of the FDR is not always achieved. The BH
procedure can also suffer from a weak power, especially when testing a relatively
small number of hypotheses, because in such a situation it is similar to a Bonferroni
correction; see Fig. 15.6b.

15.6.1.1 Example

In Fig. 15.6, we show a numerical example for the Benjamini-Hochberg procedure.


In Fig. 15.6a, we show the rank-ordered p-values for m = 100. The dashed red
line corresponds to a significance level of α = 0.05, and the dashed green line
corresponds to the testing condition in Eq. 15.68.
In Fig. 15.6b, we zoom into the first 30 p-values. Here, we add a Bonferroni
correction, depicted by the dashed orange line at a value of α/m = 5e − 04. One
can see that the BH correction corresponds to a straight line that is always above the
Bonferroni correction. Hence, a BH is always less conservative than a Bonferroni
correction. As a result, for the shown p-values, we obtain 18 significant values for
the BH correction but only 3 significant values for the Bonferroni correction. One
can also see that using the uncorrected p-values with α = 0.05 gives additional
significant values in an uncontrolled manner beyond rank 18.

15.6.2 Adaptive Benjamini-Hochberg Procedure

A modified version of the BH procedure that estimates the proportion of null


hypotheses from data, whereas the proportion of true null hypotheses is given by
π0 = m0 /m, was introduced in [36]. For this reason, this procedure is called the
adaptive Benjamini-Hochberg procedure (adaptive BH).
The adaptive BH procedure modifies Eq. 15.68 by substituting α with α/π0 ,
which gives
α α
p(i) ≤= i =i . (15.71)
π0 m m0

The procedure itself searches, in a step-up manner, the largest index k such that
 8
α
k = max i ∈ {1, . . . , m}|p(i) ≤ i . (15.72)
π0 m
446 15 Multiple Testing Corrections

A. 1.00

0.75
p−values

0.50

0.25

α = 0.05
0.00

0 20 40 60 80 100
Rank
B.
0.05

α = 0.05

0.04

0.03
p−values

0.02
BH

0.01

Bonferroni
0.00
0 10 20 30
Rank

Fig. 15.6 Example for the Benjamini-Hochberg procedure. The dashed green line corresponds
to the critical values given by Eq. 15.68. (a) Results for m = 100. (b) Zooming into the first 30
p-values.

If no such index exists, then no hypothesis is rejected; otherwise, the null hypothe-
ses, H(1) , H(2) , . . . , H(k) , are rejected.
The estimator for π0 is found as a result of an iterative search [314] based on

m−k+1
π̂0BH (k) = . (15.73)
(1 − p(k) )m

Specifically, the optimal index k is found from


/ 0
k = min i ∈ {2, . . . , m}|π̂0BH (i) > π̂0BH (i − 1) . (15.74)

The importance of this study does not lie in its practical usage but in the inspira-
tion it provided for many follow-up approaches that introduced new estimators for
π0 . In this chapter, we will provide some examples for this, such as when discussing
the BR-2S procedure and in the summary Sect. 15.8.
15.6 Controlling the FDR 447

15.6.3 Benjamini-Yekutieli Procedure

To improve the BH procedure so as to deal with a dependency structure, a


modification called the Benjamini-Yekutieli (BY) procedure was introduced in [37].
The BY procedure also assumes ordered p-values, as in Eq. 15.46, and then it
identifies, in a stepwise procedure, the largest index k such that
α
p(k) ≤ k (15.75)
mf (m)

(1) (2) (k)


holds, and it rejects the null hypotheses H0 , H0 , . . . , H0 . It is important to
note that here the factor f (m) = m i=1 1/i, which depends on the total number of
hypotheses, is introduced. This can be formulated in the following compact form:
 8
α
k = max i ∈ {1, . . . , m}|p(i) ≤ i . (15.76)
mf (m)

If no such index exists, then no hypothesis is rejected.


Since f (m) > 1 for all m, the product mf (m) can be seen as an effective
increase in the number of hypotheses to m = mf (m). Hence, the BY procedure
is very conservative, and can be even more conservative than a Bonferroni cor-
rection. For instance, for m ∈ {100, 1000, 10,000, 100,000}, we obtain f (m) =
{5.8, 7.5, 9.8, 12.1}. The adjusted p-values of the BY procedure [125] are given by
5 mf (m)p 6
adj (k)
p(i) = min ,1 . (15.77)
k∈{i,...,m} k

It has been proved that the BY procedure controls the FDR in the strong sense
by
m0
F DR ≤ α = π0 α, (15.78)
m
for any type of dependent data [37]. Since m0 ≤ m always, the FDR is controlled
either at level α (for m0 = m) or even below that level. A disadvantage of the BY
procedure is that it is less powerful than BH.

15.6.3.1 Example

In Fig. 15.7, we show a numerical example for the Benjamini-Yekutieli procedure.


Here, the BY correction corresponds to the dashed red line, which is always below
the BH correction (dashed green line), indicating that it is more conservative.
Interestingly, the line for the BY correction intersects with the Bonferroni
correction (dashed orange line) at rank 5 (see inlay). That means below this
448 15 Multiple Testing Corrections

0.025
0.004

0.003

0.020

p−values
0.002

BH
0.015 0.001
p−values

0.000

2 4 6 8 10
Rank
0.010

0.005
BY

Bonferroni
0.000
0 5 10 15 20 25 30
Rank

Fig. 15.7 Example of the Benjamini-Yekutieli procedure for m = 100. Both figures show only a
subset of the results up to rank 30, respectively, 10 in order to see the effect of a BY correction.

value the BY correction is more conservative, and it is less conservative after the
intersection. For the p-values in this example, the BY gives no significant results.
This indicates the potential problem with the BY procedure in practice because its
conservativeness can lead to no significant results at all.

15.6.4 Benjamini-Krieger-Yekutieli Procedure

Yet another modification of the BH procedure was introduced in [38]. This MTP
is an adaptive two-stage linear step-up method, called BKY (Benjamini-Krieger-
Yekutieli). Here, “adaptive” means that the procedure estimates the number of null
hypotheses from the data and uses this information to improve the power. This
approach is motivated by Eq. 15.78 and the dependency of the control on m0 .
Step 1 Use a BH procedure with α = α/(1−α). Let r be the number of hypotheses
rejected. If r = 0, no hypothesis is rejected. If r = m, all m hypotheses are rejected.
In both cases, the procedure stops. Otherwise proceed.
Step 2 Estimate the number of null hypotheses by m̂0 = m − r.
Step 3 Use a BH procedure with α" = mα /(m̂0 ) = α /π̂0 .
The BKY procedure utilizes the BH procedure twice: to estimate the number of
null hypotheses m̂0 in the first stage and to declare significance in the second stage.
The BKY procedure controls the FDR exactly at level α when tests are
independent. In [38], it has been shown that this procedure has higher power than
BH.
15.6 Controlling the FDR 449

15.6.5 Blanchard-Roquain Procedure

A generalization of the Benjamini-Yekutieli procedure was introduced by Blanchard


and Roquain [46].

15.6.5.1 BR-1S Procedure

The first procedure, introduced in [46], is a one-state adaptive step-up procedure


called BR-1S, independently proposed in [419]. Formally, the BR-1S procedure [46]
first defines an adaptive threshold by
5 iα(1 − λ) 6
t(i) = min λ, (15.79)
m−i+1

for λ ∈ (0, 1) and for all i ∈ {1, . . . , m}. Then, the largest index k is determined as
follows:
/ 0
k = max i ∈ {1, . . . , m}|p(i) ≤ t(i) . (15.80)

If no such index exists, then no hypothesis is rejected; otherwise, all the null
hypotheses with p-values such that p(i) ≤ t(k) are rejected.
For the BR-1S procedure, it has been proved that the FDR is controlled by
5 6
FDR ≤ min λ, α(1 − λ)m . (15.81)

A brief calculation shows that both arguments of the preceding equations are equal,
for
αm
λ(m) = . (15.82)
1 + αm

A further calculation shows that Eq. 15.82 is monotonously increasing for increased
values of m, and for m ≥ 2, we find λ(m) > α. That means one needs to choose λ
values smaller than the value on the right-hand side in Eq. 15.82 to be able to control
the FDR [46]. Hence, a common choice for λ, in Eq. 15.81, is λ = α because this
controls the FDR on the α level; that is, F DR ≤ α.
For λ = α, the adaptive threshold simplifies and becomes
5 i(1 − α) 6
t(i) = α min 1, . (15.83)
m−i+1

For i ≤ (m + 1)/2, the adaptive threshold simplifies even further to


450 15 Multiple Testing Corrections

i(1 − α)
t(i) = α . (15.84)
m−i+1

15.6.5.2 BR-2S Procedure

The second procedure introduced in [46] is a two-state adaptive plug-in procedure


called BR-2S, given by
Stage 1 Estimate R(λ1 ) = m0 by BR-1S(λ1 ).
Stage 2 Use α = α/π̂0 with

m − R(λ1 ) + 1
π̂0BR = for λ2 ∈ (0, 1) (15.85)
(1 − λ2 )m

in the step-up procedure given by Eq. 15.72. That means the estimate for the
proportion of null hypotheses is used to find the largest index k such that
+ 9
α
k = max i ∈ {1, . . . , m}|p(i) ≤ i . (15.86)
π̂0BR m

If no such index exists, then no hypothesis is rejected; otherwise, the null


hypotheses, H(1) , H(2) , . . . , H(k) , are rejected.
The BR-2S procedure depends on two parameters, denoted λ1 and λ2 . The first
parameter is for BR-1S in stage one, whereas the second is used to estimate the
proportion of null hypotheses in stage two. It has been proved in [46] that by setting
λ1 = α/(1 + α + 1/m) in step 1 of the BR-2S procedure one obtains FDR= λ. This
suggests setting λ = α in stage 2. The BR-1S and BR-2S procedures are proven to
control the FDR for arbitrary dependence.

15.7 Computational Complexity

When performing MTCs for high-dimensional data, the computation time required
by a procedure can have an influence on its selection. For this reason, we present in
this section a comparison of the computation time for different methods, depending
on the dimensionality of the data.
In the following, we apply the seven MTPs, namely, Bonferroni, Holm,
Hochberg, Hommel (two times), Benjamini-Hochberg, and Benjamini-Yekutieli, to
p-values of varying size for m ∈ {100, 20000, 50000}. In Table 15.1 we show the
mean computation times in seconds averaged over ten independent runs.
One can see that there are large differences in the computation times. By far,
the slowest method is Hommel. For instance, correcting m = 50,000 p-values
takes over 400 times longer compared to a Bonferroni correction. This method
15.8 Comparison 451

Table 15.1 Computational analysis for seven MTPs. Shown are the average computation times in
seconds for m tests.
Method Error control m = 100 m = 20,000 m = 50, 000
Bonferroni FWER 4.997253e-05 6.375313e-04 0.003246069
Holm FWER 8.559227e-05 1.908016e-03 0.004300451
Hochberg FWER 7.801056e-05 1.903248e-03 0.005757761
Hommel FWER 1.836157e-03 1.432532e+01 1.389162673
Benjamini-Hochberg FDR 9.551458e-05 1.955366e-03 0.004550409
Benjamini-Yekutieli FDR 8.130074e-05 2.041864e-03 0.004711628
Hommel∗ FWER 2.599716e-04 1.967573e-03 0.004542589

has also the worst scaling, which means practical applications need to take this
into consideration. This computational complexity could already be anticipated
based on our discussion in Sect. 15.5.5, because the Hommel correction is much
more involved than all the other procedures. However, the new algorithm in
[334] (indicated by ∗ ) leads to an astonishing improvement in the computational
complexity for this method.
Furthermore, from Table 15.1, there are essentially two groups of computational
times. In the group of the fastest methods, we have Bonferroni, Holm, Hochberg,
Benjamini-Hochberg, and Benjamini-Yekutieli, and the group of slowest methods
includes only Hommel. As one can see, applying MTPs to tens of thousands of
hypotheses (p-values) is feasible without problems.

15.8 Comparison

In this chapter, we discussed many multiple testing procedures for controlling the
FWER and the FDR, and we categorized them as follows:
1. Single-step versus stepwise approaches
2. Adaptive versus nonadaptive approaches
3. Marginal versus joint multiple testing procedures
When it comes to the practical application of an MTP, one needs to realize that to
select a method, there is more to it than the control of an error measure. Specifically,
while a given MTP may guarantee the control of an error measure, for example, the
FWER or the FDR, this does not inform us about the Type 2 error/power of the
procedure. This is important for practical applications because if one cannot reject
any null hypotheses, there is usually nothing to explore.
To find the optimal procedure for a given problem, the best approach is to conduct
simulation studies and compare different MTPs. Specifically, for a given data set,
one can diagnose its characteristics, such as by estimating the presence and the
structure of correlations, and then simulate data following these characteristics. This
452 15 Multiple Testing Corrections

Table 15.2 Summary of MTC procedures. PRDS stands for positive regression dependencies.
Method Error control Procedure type Error control type Correlation assumed
Šidák FWER Single-step Strong Non-negative
Šidák FWER Step-down Strong Non-negative
Bonferroni FWER Single-step Strong Any
Holm FWER Step-down Strong Any
Hochberg FWER Step-up Strong PRDS
Hommel FWER Step-down Strong PRDS
maxT FWER Single-step Strong Subset pivotality
minP FWER Single-step Strong Subset pivotality
maxT FWER Step-down Strong Subset pivotality
minP FWER Step-down Strong Subset pivotality
Benjamini- FDR Step-up Strong PRDS
Hochberg
Benjamini- FDR Step-up Strong Any
Yekutieli
Benjamini- FDR Step-up Strong Independence
Krieger-Yekutieli
BR-1S FDR Step-up Strong Any
BR-2S FDR Two-stage Strong Any

ensures that the simulations are problem-specific and adopt the characteristics of the
data as closely as possible.
The advantage of this approach is that the selection of an MTP is not based on
generic results from the literature but is tailored to your problem. The disadvantage
is the effort it takes to estimate, simulate, and compare the different procedures.
If such a simulation approach is not feasible, one needs to revert to results from
the literature. In Table 15.2, we show a summary of MTPs and some important
characteristics. Furthermore, from a multitude of simulation studies, the following
results have been found independently:
• Positive correlations (simulated data): BR is more powerful than BKY [46].
• General correlations (real data): BY has a higher PPV compared to BH [291].
• Positive correlations (simulated data): BKY is more powerful than BH [38].
• Positive correlations (simulated data): Hochberg, Holm, and Hommel do not
control the PFER for high correlations [174].
• General correlations (real data): SS MaxT has higher power compared to SS
minP [125, 311, 504].
• General correlations (real data): SS MaxT and SD MaxT can be more powerful
than Bonferroni, Holm, and Hochberg [125].
• Random correlations (simulated data): SD minP is more powerful than SD maxT
[311].
15.9 Summary 453

The earlier mentioned simulation studies considered all the correlation structures
in the data because this is of practical relevance. Since there is more than one
type of correlation structure, the possible number of different studies to consider
for all these different structures is huge. Specifically, one can assume homogenous
or heterogeneous correlations. The former assumes that pair-wise correlations are
equal throughout the different pairs, whereas the latter assumes inequality. For
heterogeneous correlation structures, one can further assume a random structure
or a particular structure. For instance, a particular structure can be imposed from
an underlying network; for example, a gene regulatory network among genes [110]
(see also Chap. 5). Hence, for the simulation of such data, the covariance structure
needs to be consistent with a structure of the underlying network [152].

15.9 Summary

In this chapter, we discussed multiple testing corrections (MTCs) because they


provide important extensions to the framework of hypothesis testing (discussed in
Chap. 10) [186, 336, 389, 407]. As we have seen, there is a large number of different
methodological correction procedures allowing the control of the FWER or the FDR
[63, 116, 391, 452]. However, especially for correlated data, many more methods
can be expected in the coming years because high-dimensional data are usually
correlated, at least to some extent, and there is great potential to further improve
those methods by using tailored MTCs.
Learning Outcome 15: Multiple Testing Corrections

A multiple testing correction procedure aims to control the Type 1 error


(FWER or FDR) and at the same time tries to maximize the power of the
test.

In this chapter, we categorized MTPs as (1) single-step versus stepwise


approaches, (2) adaptive versus nonadaptive approaches, and (3) marginal versus
joint multiple testing procedures. While single-step procedures apply the same
constant correction to each test, stepwise procedures are variable in their correction,
and decisions also depend on previous steps. Furthermore, the latter procedures
are based on rank-ordered p-values of the tests, and they inspect these values in
either decreasing (step-down) or increasing (step-up) order of their significance. To
fully explain each of these concepts, we discussed procedures from all categories.
Specifically, we discussed single-step corrections of Šidák [439], Bonferroni [53],
and Westfall and Young [506]; stepwise procedures of Holm [250], Hochberg
[243], Hommel [252], Benjamini and Hochberg [35], Benjamini-Yekutieli [37],
and Westfall and Young (maxT and minP) [506]; and the multistage procedure of
Benjamini-Krieger-Yekutieli [38].
454 15 Multiple Testing Corrections

15.10 Exercises

1. Identify a problem in science or industry that requires the testing of multiple


hypotheses. Discuss for this problem the difference between testing one and
testing multiple hypotheses.
2. Reproduce the results shown in Fig. 15.2.
a. Calculate FWER = P r(reject at least one H0 |all H0 are true) for α = 0.05.
b. Calculate FWER = P r(reject at least one H0 |all H0 are true) for α = 0.001.
c. Calculate FWER = P r(reject at least one H0 |all H0 are true) for α =
0.0001.
3. Discuss the Holm procedure summarized by Algorithm 10 by giving a numerical
example.
4. Suppose we perform ten hypothesis tests (e.g., testing the effect of ten marketing
campaigns compared to the current strategy) and obtain the following p-values:

0.0140, 0.2960, 0.9530, 0.0031, 0.1050, 0.6410, 0.7810, 0.9010, 0.0053, 0.4500

a. Which hypotheses are rejected when applying a Bonferroni correction?


b. Order the p-values.
c. Which hypotheses are rejected when applying a Holm correction?
5. Discuss the Benjamini-Hochberg procedure summarized by Algorithm 15 by
giving a numerical example.
Chapter 16
Survival Analysis

16.1 Introduction

The term “survival analysis” comprises a collection of longitudinal analysis meth-


ods for studying time-to-event data. Here, the term “time” corresponds to the
duration until the occurrence of a particular event, while an “event” is a special
incident that assumes an application-specific meaning; for example, death, heart
attack, wear-out or failure of a product or equipment, divorce, violation of parole, or
bankruptcy of a company — to name just a few. It is this diversity in the meaning of
“event” that makes survival analysis widely applicable to many problems in various
fields. For instance, survival analysis is frequently utilized in biology, medicine,
engineering, marketing, social sciences, and behavioral sciences [11, 15, 64, 140,
214, 273, 356, 448, 527]. This interdisciplinary usage led to the synonymous use
of many alternative names for the field. For this reason, survival analysis is also
known as event history analysis (social sciences), reliability theory (engineering),
and duration analysis (economics).
There are two approaches that offered crucial contributions to the development
of modern survival analysis. The pioneers of the first approach are Kaplan and
Meier, who introduced an estimator for survival probabilities [276]. The second
approach was put forward by Cox, who introduced what is nowadays called the Cox
proportional hazard model (CPHM) [89], which is a regression model.
In this chapter, we discuss the theoretical basics of survival analysis, including
estimators for survival and hazard functions and the comparison of survival curves.
We discuss the Cox proportional hazard model (CPHM) in detail, as well as
approaches for testing the proportional hazard (PH) assumption. Furthermore, we
discuss stratified Cox models for cases where the PH assumption does not hold.
We will see that there are links to previously discussed topics in this book, such as
linear regression discussed in Chap. 11 and statistical hypothesis testing discussed in
Chap. 10. In fact, the CPHM is a regression model, and the comparison of survival
curves is conducted with the help of statistical hypothesis tests.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 455
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_16
456 16 Survival Analysis

We start this chapter by examining the need for survival analysis via two
examples. The first is about the effect of chemotherapy on patients, and the second
is about the effect of medication on schizophrenia patients.

16.2 Motivation

To develop an intuitive understanding of survival analysis, let us discuss its basic


principles. When we speak about survival, we mean, in fact, probabilities, which
means that survival can be “measured” as a probability. Importantly, survival relates
to the membership in a group, where a group consists of a number of subjects,
and the survival probability is associated with each subject in this group. The
membership in a group is not constant; it can change. Such a change of membership
is initiated by an event. Particular examples of events are as follows:
• Death
• Relapse/recurrence
• Infection
• Suicide
• Agitation attack
• Crime
• Violation of parole
• Divorce
• Graduation
• Bankruptcy
• Malfunctioning or failure of a device
• Purchase
The event “death” is certainly the most severe example that can be given, which
also intuitively explains the name “survival analysis.” Importantly, survival analysis
is not limited to medical problems, but can also be applied to problems in social
sciences, engineering, or marketing, as illustrated by the examples in the preceding
list.

16.2.1 Effect of Chemotherapy: Breast Cancer Patients

In [315], the authors investigated the effect of neoadjuvant chemotherapy on triple-


negative breast cancer (TNBC) patients. TNBC is characterized by the lack of
expression of three genes; namely, estrogen receptor (ER), progesterone receptor
(PR), and human epidermal growth factor receptor 2 (HER2). To compare the
survival time of TNBC patients with non-TNBC people, the time was measured
from surgery (mastectomy) to death. As a result, the authors found that patients
with TNBC have a decreased survival compared to non-TNBC patients.
16.3 Censoring 457

16.2.2 Effect of Medication: Agitation

In [308], the authors studied the effect of medications on individuals with


schizophrenia. Due to the complexity of this neuronal disorder, it is rather difficult
or even impossible to discern, from observing such individuals, how long the effect
of a medication lasts or the onset of an attack. Hence, measuring “time to an attack”
is in this case nontrivial because it is not directly observable. To accomplish this
task, the authors used the following experimental design for the study. At a certain
time, the patients are using either medication or a placebo administered by an
inhaler. Then, the patients are not allowed to reuse the inhaler for 2 h. After the 2 h,
everyone could use the inhaler as required. This allowed the easy measurement of
the time between the first and second usage of the inhaler, which was then used as
“time to event.” This was used to perform a survival analysis to assess the difference
between the medication and the placebo.
In contrast to the breast cancer example, the agitation example shows that the
time to event is not for all problems easy to obtain, but sometimes requires a clever
experimental design that enables its measurement.
Taken together, survival analysis examines and models the time for events to
occur and changes of the survival probability over time. Practically, one needs to
estimate these from subject data, which contains information about the time of
events. A factor that further complicates the analysis is the incomplete information
caused by censoring. Due to the central role of censoring for essentially all statistical
estimators that will be discussed in the following, we discuss the problem associated
with censoring in the next section.

16.3 Censoring

To perform a survival analysis, one needs to record the time to event ti for the
subjects i ∈ {1, . . . , N } of a group. This establishes so-called time-to-event data
(see Chap. 5) upon which a general survival analysis is based. However, this is
not always possible since we may have only partial information about the time to
an event. In these cases, the data are called censored. Specifically, a patient has a
censored survival time if the event has not yet occurred for this patient. This could
happen when
• A patient is a drop-out of a study; for example, stops attending the clinic for
follow-up examination
• The study has a fixed timeline and the event occurs after the cutoff time
• A patient withdraws from a study
The preceding censoring instances are called right censoring [309]. In Fig. 16.1, we
visualize the meaning of censoring. For instance, the subjects with the labels IDA
and IDC experience the event within the duration of the study, and this is indicated
458 16 Survival Analysis

subjects

IDD X : event (observed)


IDC : event (unobserved)
IDB X
X: censoring
IDA
time
start of study end of study

Fig. 16.1 A visualization of the meaning of right censoring.

by a full blue circle. In contrast, the subject IDB experiences the event after the
end of the study, and this is indicated by an open blue circle. Therefore, this event
is not observed during the study. The only useable (observable) information we
have is that at the end of the study, subject IDB did not yet experience the event.
Hence, the survival time of subject IDB is censored, as indicated by the red X. This
means that until the censoring event occurred (indicated by the red X), subject IDB
did not experience the event. Also, for subject IDD , we have a censored survival
time; however, for a different reason, since the study did not end yet. A possible
explanation for this censoring is that the subject did not attend follow-up visits after
the censoring event occurred (indicated by the red X). Formally, the censoring for
subject IDB is termed a fixed right censoring, whereas the censoring for subject IDD
is called a random right censoring.
There are further censoring types that can occur. For instance, a left censoring
occurs if the event is observed but not the beginning of the process. A typical
example of left censoring is an infection, since it is usually diagnosed at some time,
but it started before the diagnosis at an unknown point in time. In the following, we
will limit our focus to right-censored subjects. A summary of the different types of
censoring is provided in [305]:
• Type I censoring: All subjects begin and end the study at the same time
(fixed length of study). Examples of Type I censoring occur during laboratory
experiments.
• Type II censoring: All subjects begin the study at the same time, but the study
ends when a predetermined fixed number of subjects have experienced the event
(flexible length of study). Examples of Type II censoring also occur during
laboratory experiments.
• Type III censoring: The subjects enter the study at different times, but the length
of the study is fixed. Examples of Type III censoring occur during clinical trials.
16.4 General Characteristics of a Survival Function 459

16.4 General Characteristics of a Survival Function

A survival curve, denoted S(t), formulates the survival probability as a function of


time (t). The function S(t) is also referred to as the survival function or the survivor
function. Formally, S(t) is the probability that the random variable T is larger than
a specified time t; that is,

S(t) = P r(T > t). (16.1)

Since S(t) is defined for a group of subjects, S(t) can be interpreted as the
proportion of subjects having survived till t. Therefore, a naive estimator for S(t) is
given by

#subjects surviving past t


Snaive (t) = , (16.2)
N
whereas N is the total number of subjects. Eq. 16.1 is the population estimate of a
survival function.
Next, we will discuss various sample estimates of S(t), which can be numerically
evaluated from data. Put simply, the survival function gives the probability that a
subject (represented by T ) will survive past time t.
The survival function has the following properties:
• The range of time, t, is [0, ∞).
• S(t) is a non-increasing function; that is, S(t1 ) ≥ S(t2 ) for t1 ≤ t2 .
• At time t = 0, S(t) = 1; that is, the probability of surviving past time t = 0 is 1.
Since S(t) is a probability distribution, then there exists a probability density f
such that

S(t) = f (τ )dτ. (16.3)
t

Hence, differentiating S(t) with respect to t, we obtain

d S(t)
f (t) = − . (16.4)
dt
Furthermore, the expectation value of T is given by

μ = E[T ] = tf (t)dt. (16.5)
0

Using Eq. 16.4 and integrating by parts, it can be shown that the survival function,
S(t), can be used to obtain the mean life expectation:
460 16 Survival Analysis


μ = E[T ] = S(t)dt. (16.6)
0

16.5 Nonparametric Estimator for the Survival Function

In this section, we present two of the most popular nonparametric methods used to
estimate the survival function S(t).

16.5.1 Kaplan-Meier Estimator for the Survival Function

The Kaplan-Meier estimator [276] of a survival function, denoted SKM (t), is given
by
) ni − di )  di 
SKM (t) = = 1− . (16.7)
ni ni
i:ti <t i:ti <t

This estimator holds for all t > 0, and it only depends on the following two
parameters:
• ni : number of subjects at risk at time ti
• di : number of events at time ti
Here, ni corresponds to the number of subjects present at time ti . In contrast,
subjects who experienced the event or are censored are no longer present. The
difficult part of this estimator is the argument of the product, which considers only
events i that occur before time t; that is, ti < t. Hence, the survival curve SKM (t) at
time t considers all events that happened before t.
It is important to realize that when evaluating the Kaplan-Meier estimator
only the events occurring at {ti } are important. That means between two events,
for example, ti and ti+1 , the survival curve is constant. This allows a simple
reformulation of Eq. 16.7 to rewrite the Kaplan-Meier estimator using the following
recursive relationship:

nk−1 − dk−1
SKM (tk ) = SKM (tk−2 ). (16.8)
nk−1

In Fig. 16.2, we show an example of the evaluation of the Kaplan-Meier estimator


that uses the recursive formulation given in Eq. 16.8. This example includes in total
five subjects and four events.
16.6 Comparison of Two Survival Curves 461

1.2
event t n d
S(t) = 1 : 0 ≤ t < t1
survival probability
1.0
1 9 5 1
n1 −d1
S(t) = 1 × n1
: t1 ≤ t < t2 2 13 4 1
0.8

n2 −d2 3 18 2 1
S(t) = 0.8 × n2
: t2 ≤ t < t 3
4 23 1 1
0.6

n3 −d3
S(t) = 0.6 × : t3 ≤ t < t 4
0.4

n3
ti : time of event i
ni : number in risk at time ti
0.2

n4 −d4
t1 t2 t3 S(t) = 0.3 × n4
: t4 ≤ t < ∞
di : number of events at time ti
0.0

0 10 20 t4 30 40 50

time

Fig. 16.2 Numerical example for the Kaplan-Meier estimator. For each event, SKM is recursively
evaluated.

16.5.2 Nelson-Aalen Estimator for the Survival Function

In contrast to the Kaplan-Meier estimator, which is a direct estimator of S(t), the


Nelson-Aalen estimator [1, 360] provides an indirect estimator of S(t) through the
following direct estimate of the cumulative hazard function:
 di 
HN A (t) = = hN A,i (t). (16.9)
ni
i:ti ≤t i:ti ≤t

From this, the Nelson-Aalen estimator of S(t), denoted SN A (t), is obtained as


follows:
    
di
SN A (t) = exp − HN A (t) = exp − , (16.10)
ni
i:ti ≤t

using the relation in Eq. 16.28.


In general, it can be shown that

SKM (t) ≤ SN A (t). (16.11)

16.6 Comparison of Two Survival Curves

When we have more than one survival curve, we might be interested in comparing
them. Suppose that we have two survival curves that correspond to two different
groups of subjects; for example, one group received a medication, whereas the other
received a placebo.
Statistically, a comparison can be accomplished using a hypothesis test, and the
null and alternative hypotheses can be formulated as follows:
462 16 Survival Analysis

H0 : There is no difference in survival between (group 1) and (group 2).


H1 : There is a difference in survival between (group 1) and (group 2).
The most popular tests for comparing survival curves are as follows:
• Log-rank test
• Wilcoxon (Gehan) test (a special case of a weighted log-rank test)
The difference between these tests is that a log-rank test has more power than a
Wilcoxon test to detect late differences in the survival curves, whereas a Wilcoxon
test has more power than a log-rank test to detect early differences.

16.6.1 Log-Rank Test

The log-rank test, sometimes called the Mantel-Haenszel log-rank test, is a nonpara-
metric hypothesis test [329]. It makes the following assumptions:
• Censored and uncensored subjects have the same probability of the event
(censoring is non-informative).
• Kaplan-Meier curves of the two groups must not intersect (proportional hazards
assumption must hold).
• No particular distribution for the survival curve is assumed (distribution free).
Formally, the test is defined thusly: for each time t, estimate the expected number
of events for (group 1) and (group 2) as follows:
   
n1t
e1t = × m1t + m2t , (16.12)
n1t + n2t
   
n2t
e2t = × m1t + m2t , (16.13)
n1t + n2t

where the indices “1” and “2” indicate groups one and two, respectively; nit and mit
are the number of subjects and number of events in group i at time t, respectively.
The first term in the preceding equations can be interpreted as the probability of
selecting a subject from a group of interest; that is,
 
e1t = P r(group 1) × m1t + m2t , (16.14)
 
e2t = P r(group 2) × m1t + m2t . (16.15)

Using the auxiliary terms



E1 = e1t , (16.16)
t
16.7 Hazard Function 463


E2 = e2t , (16.17)
t

O1 = m1t , (16.18)
t

O2 = m2t , (16.19)
t

we can define the following test statistic:

over
groups
(Oi − Ei )2
s= . (16.20)
Ei
i

The statistic s follows a chi-square distribution with one degree of freedom.

16.7 Hazard Function

Next, we define the hazard function and its relationship with the survival function.
The hazard function is given by

P (t ≤ T < t + Δt|T ≥ t)
h(t) = lim . (16.21)
Δt→0 Δt

The hazard function h(t), also called the hazard rate, can be interpreted as an
“instant probability” because the only individuals considered are those such that
T ≥ t and T < t + Δt for Δt → 0. Put simply, h(t) represents the chance you will
succumb to the event in the next instant (because Δt → 0), given that you survive
up to time t.
The hazard function, h(t), has the following properties:
• h(t) ≥ 0 for all t.
• h(t) has no upper bound.
• h(t) can assume any shape.
When h(t) = 0, it means that no event occurred within the time interval Δt.
The cumulative hazard function, denoted H (t), describes the accumulated risk
up to time t, and it is given by
t
H (t) = h(τ )dτ. (16.22)
0
464 16 Survival Analysis

The function H (t) can also be seen as the total amount of risk that has been accu-
mulated up to time t. The integration/summation over h(t) makes the interpretation
simpler, but some details are lost during this process.
There is an important relationship between h(t), f (t), and S(t), given by

f (t)
h(t) = . (16.23)
S(t)

Therefore, the hazard, density, and survival functions are not independent from each
other. In the following, we derive this relationship.
From the definition of the hazard function, we have

P (t ≤ T < t + Δt|T ≥ t)
h(t) = lim (16.24)
Δt→0 Δt
P (t ≤ T < t + Δt)
= lim (16.25)
Δt→0 P (T ≥ t)Δt
P (t ≤ T < t + Δt)
= lim (16.26)
Δt→0 S(t)Δt
f (t)
= (16.27)
S(t)

The first step followed from the property of a conditional probability, and the second
from the definition of a survival function (in red). The last step shows the desired
relationship.
In addition to the preceding relationship, there is another important connection
between h(t) (or H (t)) and S(t), given by
 t   
S(t) = exp − h(τ )dτ = exp − H (t) . (16.28)
0

From this, the inverse relationship from S(t) to h(t), given by



dS(t)/dt
h(t) = − , (16.29)
S(t)

is derived as follows: Starting from



S(t) = f (τ )dτ, (16.30)
t

we have
t
1 − S(t) = f (τ )dτ. (16.31)
0
16.7 Hazard Function 465

Differentiating both sides of Eq. 16.31 with respect to t gives

d (1 − S(t)) d S(t)
=− = f (t). (16.32)
dt dt

Using Eq. 16.23 for f (t), we obtain the desired relation in Eq. 16.29.
The difference between a hazard function and a survival function can be
summarized as follows:
• The hazard function focuses on failing.
• The survival function focuses on surviving.
By making some assumptions about h(t), one can obtain a parametric model.
The relationship between h(t) and S(t), given by Eq. 16.28, makes the survival
distribution a parametric model. Specific models that are frequently used include
the following:
• Weibull model
• Exponential model
• Log-logistic model
• Log-normal model
• Gamma model
Compared with a nonparametric model, parametric assumptions enable one to
model a survival function elegantly and in more detail. However, the major risk
associated with such an approach is to make assumptions that are not justified by
the data. In the following, we discuss four parametric models in detail.

16.7.1 Weibull Model

For the Weibull model, the hazard, survival, and density functions are given by
 p−1
h(t) = λp λp , (16.33)
 p 
S(t) = exp − (λt , (16.34)
 p−1  p 
f (t) = λp λp exp − (λt . (16.35)

Here, λ > 0 is a rate parameter, and p > 0 is a shape parameter, allowing one
to control the behavior of the hazard function. Specifically, one can observe the
following:
• h(t) is monotonously decreasing when p < 1.
• h(t) is constant when p = 1.
• h(t) is monotonously increasing when p > 1.
466 16 Survival Analysis

The expected lifetime and its variance are given by

1  1
E[T ] = Γ 1+ , (16.36)
λ p
1  2 1  1 2
V ar(T ) = 2 Γ 1 + − 2Γ 1 + . (16.37)
λ p λ p

Here, Γ is the gamma function defined by



Γ (x) = t x−1 exp(−t)dt. (16.38)
0

16.7.2 Exponential Model

For the exponential model, the hazard, survival, and density functions are given by

h(t) = λ, (16.39)

S(t) = exp − λt , (16.40)
 
f (t) = λexp − λt . (16.41)

The exponential model depends only on the rate parameter λ > 0.


The expected lifetime and its variance are given by

1
E[T ] = , (16.42)
λ
1
V ar(T ) = 2 . (16.43)
λ

16.7.3 Log-Logistic Model

For the log-logistic model, the hazard, survival, and density functions are given by
 −1
h(t) = λα(λt)α−1 1 + (λt)α , (16.44)
 −1
S(t) = 1 + (λt)α , (16.45)
 −2
f (t) = λα(λt)α−1 1 + (λt)α . (16.46)
16.7 Hazard Function 467

The log-logistic model depends on two parameters: the rate parameter λ > 0 and the
shape parameter α > 0. Depending on α, one can distinguish the following different
behaviors of the hazard function:
• h(t) is monotonously decreasing from ∞ when α < 1.
• h(t) is monotonously decreasing from λ when α = 1.
• h(t) is first increasing and then decreasing when α > 1.
Specifically,
– h(t = 0) = 0.
1
– The maximum of h(t) is at t = (α − 1) α .

The expected lifetime and its variance are given by


 σ2 
E[T ] = exp μ + , (16.47)
2
   
V ar(T ) = exp(σ 2 ) − 1 exp 2μ + σ 2 . (16.48)

16.7.4 Log-Normal Model

For the log-normal model, the hazard, survival, and density functions are given by

α  −α 2  ln(λt)2   −1
h(t) = √ exp 1 − Φ α ln(λt) , (16.49)
2π t 2
 
S(t) = 1 − Φ α ln(λt) , (16.50)

α  −α 2  ln(λt)2 
f (t) = √ exp . (16.51)
2π t 2

Since a normal distribution has two parameters, the log-normal model also has two
parameters: the mean μ ∈ R and the standard deviation σ > 0. These parameters
are obtained from the transformations μ = − ln(λ) and σ = α −1 . The behavior of
the hazard function is similar to that of the log-logistic model for α > 1.
The expected lifetime and its variance are given by

 σ2 
E[T ] = exp μ + , (16.52)
2
   
V ar(T ) = exp(σ 2 ) − 1 exp 2μ + σ 2 . (16.53)
468 16 Survival Analysis

16.7.5 Interpretation of Hazard Functions

In Fig. 16.3, we provide some examples of the four parametric models discussed.
From this figure, one can see that the hazard function can assume a variety of
different behaviors. The specific behavior of h(t) suggests the parametric model
to use for a particular problem. In Table 16.1, we summarize some examples of
characteristic hazard curves and their associated events.

A. Weibull: h(t) B. Weibull: S(t) C. Weibull: f (t)


10.0 1.00 1.5
Legend Legend
p=0.5 p=0.5
7.5 0.75 p=1.0 p=1.0
p=2.0 p=2.0
1.0
p=3.0 p=3.0
Legend
S(t)
h(t)

f(t)
5.0 0.50
p=0.5
p=1.0
p=2.0 0.5
2.5 p=3.0 0.25

0.0 0.00 0.0


0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
time time time

D. Exponential: h(t) E. Exponential: S(t) F. Exponential: f (t)


3 1.00 1.5
Legend Legend
Legend
λ=0.5 λ=0.5
λ=0.5 λ=1.0 λ=1.0
λ=1.0
0.75
λ=2.0 λ=2.0
2 λ=2.0
1.0
λ=3.0 λ=3.0
λ=3.0
S(t)
h(t)

f(t)

0.50

1 0.5
0.25

0 0.00 0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
time time time

G. log-logistic: h(t) H. log-logistic: S(t) I. log-logistic: f (t)


3 1.00 1.5
Legend Legend
Legend
α=0.5 α=0.5
α=0.5 α=1.0 α=1.0
α=1.0
0.75
α=2.0 α=2.0
2 α=2.0 1.0
α=3.0 α=3.0
α=3.0
S(t)
h(t)

f(t)

0.50

1 0.5
0.25

0 0.00 0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
time time time

J. log-normal: h(t) K. log-normal: S(t) L. log-normal: f (t)


2.0 1.00 1.00
Legend Legend
Legend
μ=0.0 μ=0.0
μ=0.0 μ=0.5 μ=0.5
1.5 μ=0.5 0.75 0.75
μ=1.0 μ=1.0
μ=1.0 μ=2.0 μ=2.0
μ=2.0
S(t)
h(t)

f(t)

1.0 0.50 0.50

0.5 0.25 0.25

0.0 0.00 0.00


0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
time time time

Fig. 16.3 Comparison of different parametric survival models.


16.8 Cox Proportional Hazard Model 469

Table 16.1 Summary of characteristic hazard functions and their usage.


Hazard function behavior Event Parametric model
Constant Normal product Weibull (p = 1)
Monotonous decreasing Patient after surgery Log-logistic (α < 1)
or stock market after
crash
or infant mortality
Monotonous (linear) increasing Unsuccessful surgery Weibull (p = 2)
or unsuccessful
treatment
or failure of a product
Humped Infection with Log-normal
tuberculosis (TB)
U-shaped Heart transplant

16.8 Cox Proportional Hazard Model

So far, we have considered only models that did not include any covariates of the
subjects. Now, we include such covariates, and the resulting model is called the Cox
proportional hazard model (CPHM). The CPHM is a semiparametric regression
model that defines the hazard function as follows:

h(t, X) = h0 (t)exp(β1 X). (16.54)

Here, h0 (t) is called the baseline hazard. The baseline hazard can assume any
functional form. Examples of covariates are gender, smoking habit, or medication
intake.
Equation 16.54 may look like a special case because no constant β0 is included.
However, the following calculation shows that it is actually included in h0 (t) since

h(t, X) = h0 (t) exp(β0 + β1 X), (16.55)


= h0 (t) exp(β0 )exp(β1 X). (16.56)

We can generalize the preceding formulation for p covariates as follows:


p
h(t, X) = h0 (t)exp( βi Xi ). (16.57)
i

For X = 0, we obtain

h(t, X) = h0 (t), (16.58)

which is the hazard function defined in Eq. 16.21 without the influence of covariates.
470 16 Survival Analysis

The CPHM for n covariates does not make assumptions about the baseline hazard
h0 (t). However, the model assumes the following:
• Time independence of the covariates Xi
• Linearity in the covariates Xi
• Additivity
• Proportional hazard
The Cox proportional hazard regression model is a semiparametric model because
it does not make assumptions about h0 (t). However, it assumes a parametric form
for the effect of the predictors on the hazard.
In many situations, one is interested in the numerical estimates of the regression
coefficients βi rather than the shape of h(t, X) because this provides a summary of
the overall results.
To demonstrate this, let’s take the logarithm of the hazard ratio
 h(t, X)  
log = βi Xi , (16.59)
h0 (t)
i

which is linear in Xi and βi . From this formulation, the connection to a linear


regression model is apparent. In other terms, this can be summarized as follows:
   group hazard  
log HR0 = log = βi Xi . (16.60)
baseline hazard
i

Here, the group hazard corresponds to all effects of the covariates Xi , whereas the
baseline hazard excludes all such effects. Thus, the sum over all covariates is log
HR0 .
Let’s consider just one covariate, that is, p = 1, and suppose that this covariate
is the gender, which can assume the values X1 = 1 (female) and X1 = 0 (male).
Then, we obtain
 hazard female 
log = β1 (16.61)
baseline hazard
 hazard male 
log = 0. (16.62)
baseline hazard
By taking the difference, we obtain
 hazard female   hazard male   hazard female 
log − log = log
baseline hazard baseline hazard hazard male
(16.63)
=β1 . (16.64)
16.8 Cox Proportional Hazard Model 471

So, β1 is the log-hazard ratio of the hazard for females and males. This gives a
direct interpretation of the regression coefficient β1 . Transforming both sides of the
preceding equation, we obtain the hazard ratio:

hazard female  
= exp β1 . (16.65)
hazard male
For the preceding evaluation, we used the binary covariate gender as an example.
However, not all covariates are binary. In case of non-binary covariates, one can
use a difference of one unit, that is, X1 = x + 1 and X1 = x, to obtain a similar
interpretation for the regression coefficients.
A major advantage of the CPHM framework is that we can estimate the
parameters, βi , without having to estimate the baseline hazard function, h0 (t). This
implies that we also do not need to make parametric assumptions about h0 (t), thus
making the CPHM semiparametric.

16.8.1 Why Is the Model Called a Proportional Hazard Model?

To appreciate why the model is called a proportional hazard model, let’s consider
two individuals, m and n, for the same model. Specifically, for individuals m and n,
the hazards are given by
 

hm (t) = h0 (t)exp βi Xmi , (16.66)
i
 

hn (t) = h0 (t)exp βi Xni . (16.67)
i

Here, the covariates in red are from individual m and the covariates in green are
from individual n. Taking the ratio of both hazards, we obtain the following hazard
ratio:
hm (t)  
= exp βi (Xmi − Xni ) , (16.68)
hn (t)
i

which is independent of the baseline hazard h0 (t) because it cancels out. Here, it
is important to note that the right-hand side is constant over time due to the time
independence of the coefficients and covariates. Let’s denote HR, the hazard ratio;
that is,
 
HR = exp βi (Xmi − Xni ) . (16.69)
i
472 16 Survival Analysis

A simple reformulating of Eq. 16.68 leads to

hm (t) = HR × hn (t). (16.70)

From this equation, it is clear that the hazard for the individual m is proportional to
the hazard for the individual n, and the proportion, HR, is time independent.

16.8.2 Interpretation of General Hazard Ratios

The validity of the proportional hazard (PH) assumption allows a simple summariza-
tion for comparing time-dependent hazards. Specifically, instead of taking a hazard
ratio between the hazards for two individuals, as in Eqs. 16.66 and 16.67, we can
take a hazard ratio of arbitrary hazards for conditions we want to compare. Let’s call
these conditions “treatment” and “control,” since these have an intuitive meaning in
a medical context, and let’s denote their corresponding hazards as follows:


p
h(t, Xtreatment ) = h0 (t)exp( βi Xitreatment ), (16.71)
i


p
h(t, X control
) = h0 (t)exp( βi Xicontrol ). (16.72)
i

Regardless of the potential complexity of the individual hazards, assuming the PH


holds, their hazard ratio is constant over time:

h(t, Xtreatment )
HR(T vs C) = . (16.73)
h(t, Xcontrol )

Hence, the effect of treatment and control over time is given by one real-valued
number. Here, it is important to emphasize that the ratio of the two hazards is given
by HR(T vs C) for any time point t, and thus it is not the integrated ratio over time.
Specifically, it gives the instantaneous relative risk, where the relative risk (RR)
quantifies the cumulative risk integrated over time.
In contrast with the comparison of survival curves for treatment and control —
for example, using a log-rank test, which gives us only a binary distinction — the
HR tells us something about the magnitude and the direction of this difference. Put
simply, the HR has the following interpretation:
• HR(T vs C) > 1: The treatment group experiences a higher hazard over the
control group ⇒ control group is favored.
• HR(T vs C) = 1: No difference between the treatment and the control group.
• HR(T vs C) < 1: The control group experiences a higher hazard over the
treatment group ⇒ treatment group is favored.
16.8 Cox Proportional Hazard Model 473

For instance, for HR(T vs C) = 1.3, the hazard of the treatment group is increased
by 30% compared to the control group, and for HR(T vs C) = 0.7, the hazard of the
treatment group is decreased by 30% compared to the control group [27].

16.8.3 Adjusted Survival Curves

A Cox proportional hazard model can be used to modify estimates for a survival
curve. Using Eq. 16.28, it follows that
 t  
S(t, X) = exp − h0 (τ )exp( βi Xi )dτ (16.74)
0 i
  t 
= exp − exp( βi Xi ) h0 (τ )dτ (16.75)
i 0

 t exp( i βi Xi )
= exp − h0 (τ )dτ (16.76)
0

= S0 (t)exp( i βi Xi )
. (16.77)

In general, one can show that

S(t, X) ≤ S(t) (16.78)

since the survival probability is always smaller than 1 and the exponent is always
positive.

16.8.4 Testing the Proportional Hazard Assumption

In the preceding discussion, we assumed that the proportional hazard (PH) assump-
tion holds. In the following, we discuss three ways (two graphical and one
analytical) that can be used to evaluate the PH assumption.

16.8.4.1 Graphical Evaluation

The two graphical methods to assess the PH assumption perform a comparison for
each variable/covariate one at a time [285]. The underlying idea of both methods is
as follows:
474 16 Survival Analysis

I. Comparison of estimated ln(− ln) survival curves


II. Comparison of observed and predicted survival curves

Graphical Method I To illustrate this method, we need to take the ln(− ln) of
Eq. 16.77. This leads to

  p  
ln − ln S(t, X) = βi Xi + ln − ln S0 (t) . (16.79)
i=1

Using the expression in Eq. 16.79 to evaluate two individuals characterized by the
following specific covariates,

X1 = (X11 , X12 , . . . , X1p ), (16.80)


X2 = (X21 , X22 , . . . , X2p ), (16.81)

gives

    p
 
ln − ln S(t, X1 ) − ln − ln S(t, X2 ) = βi X1i − X2i . (16.82)
i=1

From Eq. 16.82, we can see that the difference between ln(− ln) of survival curves
for two individuals having different covariate values is a constant given by the right-
hand side.
To assess the PH assumption, we perform such a comparison for each covariate
one at a time. In case of categorical covariates, all values will be assessed. For
continuous covariates, we categorize them first, and then perform the comparison.
The reason for using Eq. 16.82 for each covariate one at a time and not for all at
once is that performing such a comparison covariate by covariate is more stringent.
From Eq. 16.82, it follows that survival curves cannot cross each other if the
hazards are proportional. Observation of such crossings leads to a clear violation of
the PH assumption.

Graphical Method II The underlying idea of this approach, to compare the


observed and expected survival curves to assess the PH assumption, is the graphical
analog of the goodness-of-fit (GOF) testing.
Here, observed survival curves are obtained from stratified estimates of KM
curves. The strata are derived from the categories of the covariates, and the expected
survival curves are obtained from a CPHM with adjusted survival curves, as given
by Eq. 16.77.
The comparison is performed similarly for the ln(− ln) survival curves; that is,
for each covariate one at a time. Then, the observed and expected survival curves
for each strata are plotted on the same graph for assessment. If, for each category
of the covariates, the observed and expected survival curves are close to each other,
then the PH assumption holds.
16.8 Cox Proportional Hazard Model 475

Kleinbaum [285] suggested assuming that the PH assumption holds unless there
is very strong evidence against it, namely:
• Survival curves cross and don’t look parallel over time.
• Log cumulative hazard curves cross and don’t look parallel over time.
• Weighted Schoenfeld residuals clearly increase or decrease over time; see
Sec. 16.8.4.2 (tested by a significant regression slope).
If the PH assumption doesn’t hold for a particular covariate, then we are getting
an average HR (averaged over the event times). In many cases, this is not necessarily
a bad estimate.

16.8.4.2 Goodness-of-Fit Test

To test the validity of the PH assumption, several statistical tests have been
suggested. However, the most popular one, introduced in [222], is a variation of
a test originally proposed in [426] based on the so-called Schoenfeld residuals.
To carry out this test, the following steps are performed for each covariate one at
a time:
1. Estimate a CPHM and obtain Schoenfeld residuals for each predictor/covariate.
2. Set up a reference vector containing the ranks of events. Specifically, the subject
with the first (earliest) event receives a value of 1, the next subject receives a
value of 2, and so on.
3. Perform a correlation test between the variables obtained in the first and second
steps. The null hypothesis tested is that the correlation coefficient between the
Schoenfeld residuals and the ranked event times is zero.
The Schoenfeld residual [426] for subject i and covariate k, experiencing the
event at ti , is given by

rik = Xik − X̄k (β, ti ). (16.83)

Here, Xik is the individual value for subject i, and X̄k (β, ti ) is the weighted average
of the covariate values for the subjects at risk at time ti , denoted R(ti ), and it is given
by

X̄k (β, ti ) = Xj k wj (β, ti ). (16.84)
j ∈R(ti )

The weight function for all subjects at risk, given by R(ti ), is

exp(β T Xj )
Pr(subject j fails at ti ) = wj (β, ti ) = T
. (16.85)
l∈R(ti ) exp(β Xl )
476 16 Survival Analysis

The Schoenfeld residual in Eq. 16.83 is evaluated for the parameter vector β from a
fitted CPHM.
Overall, for each covariate k, this gives a vector

rk = (r1k , r2k , . . . , rnk ), (16.86)

which is compared with the vector of rank values through a correlation test.

16.8.5 Parameter Estimation of the CPHM via Maximum


Likelihood

So far, we have formulated the CPHM and utilized it in a number of different


settings. Now, we are dealing with estimating the regression coefficients β of the
model.
Conceptually, the values of the regression coefficients are obtained via maximum
likelihood (ML) estimates; that is, by finding the parameters β of our CPHM that
maximize L(β|data). Importantly, the CPHM does not specify the base hazard. This
implies that without explicitly specifying it, the full likelihood of the model cannot
be defined. For this reason, Cox proposed a partial likelihood.
The full likelihood for right-censored data, assuming no ties, would be composed
of two contributions: one for individuals observed to fail at time ti , contributing to
the density f (ti ), and the other for individuals censored at time ti , contributing to
the survival function S(ti ). The product of both defines the full likelihood, denoted
LF , given by
)
LF = f (ti )δi S(ti )1−δi . (16.87)
i

Here, δi indicates censoring. Using Eq. 16.23, we can rewrite LF using the hazard
function:
)
LF = h(ti )δi S(ti ). (16.88)
i

16.8.5.1 Case Without Ties

Assuming that there are no ties in the data, i.e., event times ti are unique, formally,
the Cox partial likelihood function [89, 90] is given by

) h0 (t)exp(β T Xi )
L(β) = T
, (16.89)
j ∈R(ti ) h0 (t)exp(β Xj )
ti uncensored
16.8 Cox Proportional Hazard Model 477

where R(ti ) is again the set containing the subjects at risk at ti . Here, again the
baseline hazard h0 (t) is not needed because it cancels out.
The solution of Eq. 16.89 is given by the coefficients β that maximize the
function L(β); that is,

βML = argmax L(β). (16.90)


β

To obtain the coefficients βk , we solve the following system of equations:

∂L
= 0, ∀ k. (16.91)
∂βk

Usually, this needs to be carried out numerically using computational optimization


methods. Practically, the log likelihood can be used to simplify the numerical
analysis because it converts the product term of the partial likelihood function into
a sum.

16.8.5.2 Case with Ties

As mentioned previously, the preceding given Cox partial likelihood is only valid
for data without ties. However, in practice, ties of events can occur. For this reason,
extensions are needed. Three of the most widely used extensions are exact methods
[89, 101, 275], the Breslow approximation [62], and the Efron approximation [127].
There are two types of exact methods. One assumes that the time is discrete,
while the other assumes continuous time. Due to the discrete nature of the time,
the former model is called exact discrete method [89]. This method assumes that
occurring ties are true ties and there exists no underlying ordering that would resolve
the ties. Formally, it has been shown that this can be described by a conditional logit
model that considers all possible combinations obtained for di tied subjects drawn
from all subjects at risk at ti . In contrast, Kalbfleisch and Prentice suggested an
exact method assuming continuous times. In this model, ties arise as a result of
imprecise measurement; for example, due to scheduled doctor visits. Hence, this
model assumes that there exists an underlying true ordering for all events, and the
partial likelihood needs to consider all possible orderings for resolving ties. This
involves considering all possible permutations (combinations) of tied events, leading
to an average likelihood [275, 466].
A major drawback of both exact methods is that they are very computationally
expensive due to the high number of combinations to be considered when there are
many ties. This means that the methods can even become computationally infeasi-
ble. For this reason, the following two methods, which provide approximations of
the exact partial likelihood and are computationally much faster, are preferred.
The first method is the Breslow approximation [62], given by
478 16 Survival Analysis

) exp(β T S i )
LB (β) = di . (16.92)
ti uncensored T
Xj )
j ∈R(ti ) exp(β

This approximation utilizes D(ti ), the set of all subjects experiencing their event at
the same time ti , where di is the number of subjects given by di = |D(ti )|, and

Si = Xj . (16.93)
j ∈D(ti )

That means the set D(ti ) provides information about the tied subjects at time ti . It
is interesting to note that using the following simple identification
)
exp(β T S i ) = exp(β T Xk ) (16.94)
k∈D(ti )

leads to an alternative formulation of the Breslow approximation:


)
exp(β T Xk )
) k∈D(ti )
LB (β) = di . (16.95)
ti uncensored T
Xj )
j ∈R(ti ) exp(β

Overall, the Breslow approximation looks similar to the Cox partial likelihood, with
minor adjustments. One issue with the Breslow method is that it considers each
of the events, at a given time, as distinct from each other, and it allows all failed
subjects to contribute the same weight to the risk set.
In contrast with the Breslow approximation, the Efron approximation allows each
of the members that fail at time ti to contribute partially (in a weighted way) to the
risk set. The Efron approximation [127] is given by
)
exp(β T Xk )
) k∈D(ti )
LE (β) = .
j −1
ti uncensored Πjd=1
i
k∈R(ti ) exp(β X k ) −
T
di
T
k∈D(ti ) exp(β X k )

Overall, when there are no ties in the data, all approximations give the same
results. Also, for a small number of ties, the differences are usually small. The
Breslow approximation works well when there are few ties but is problematic for a
large number. In general, the Efron approximation almost always works better; thus,
it is the preferred method. For this reason, it is the default in the function coxph()
available in R. Both the Breslow and Efron approximations give coefficients that are
biased toward zero.
16.9 Stratified Cox Model 479

16.9 Stratified Cox Model

In Sect. 16.8.4, we discussed approaches for testing the PH assumption. In this


section, we show that a stratification of the Cox model is a way to deal with
covariates for which the PH assumption does not hold.
Let’s assume that we have p covariates for which the PH assumption holds,
except for one covariate. Furthermore, we assume that the violating covariate
assumes values in S different categories. If this variable is continuous, we need
to define S discrete categories and discretize it.
For this, we can specify a hazard function for each strata s, given by
 
hs (t, X(s)) = h0,s (t)exp β T X(s) . (16.96)

Here, X(s) ∈ Rp are the covariates for which the PH assumption holds, and β ∈ Rp
and s ∈ {1, . . . , S} are the different strata. We wrote the covariate as a function of
strata s to indicate that only subjects having values within this strata are used. Put
simply, the categories s are used to stratify the subjects into S groups for which a
Cox model is fitted.
For each of these strata-specific hazard functions, one can define a partial
likelihood function, Ls (β), in the same way as for the ordinary CPHM. The overall
partial likelihood function for all strata is then given by the product of the individual
likelihoods, as follows:

)
S
L(β) = Ls (β). (16.97)
s=1

We want to emphasize that the parameters β are constant across the different
strata; that is, we are fitting S different models but the covariate-dependent part
is identical for all of these models, and only the time-dependent baseline hazard
function is different. This feature of the stratified Cox model is called the no-
interaction property. This implies that the hazard ratios are the same for each
stratum.

16.9.1 Testing No-Interaction Assumption

A question that arises is whether it is justified to assume a no-interaction model for


a given data set. This question can be answered with a likelihood ratio (LR) test. To
achieve this, we need to specify the interaction model given by
 
hs (t, X(s)) = h0,s (t)exp β Ts X(s) . (16.98)
480 16 Survival Analysis

Practically, this can be done by introducing dummy variables. For S = 2 strata, we


need one dummy variable Z ∗ ∈ {0, 1}, leading to the following interaction model:

hs (t, X(s)) = h0,s (t)exp β T X(s) + β11 (Z ∗ × X1 )

+ β21 (Z ∗ × X2 ) + · · · + βp1 (Z ∗ × Xp ) .

For Z ∗ = 0, we have

Coefficient for X1 : β1 , (16.99)


Coefficient for X2 : β2 , (16.100)
..
. (16.101)
Coefficient for Xp : βp , (16.102)

and for Z ∗ = 1, we have

Coefficient for X1 : β1 + β11 , (16.103)


Coefficient for X2 : β2 + β21 , (16.104)
..
. (16.105)
Coefficient for Xp : βp + βp1 . (16.106)

This shows that the coefficients differ for the two strata.
For S > 2 strata, we need to introduce S − 1 dummy variables Zj∗ , j ∈
{1, . . . , S − 1} with Zj∗ ∈ {1, . . . , S}. This gives

hs (t, X(s)) = h0,s (t)exp β T X(s) + β11 (Z1∗ × X1 ) + β21 (Z1∗ × X2 ) + . . . βp1 (Z1∗ × Xp )

+ β11 (Z2∗ × X1 ) + β21 (Z2∗ × X2 ) + · · · + βp1 (Z2∗ × Xp )


..
.
∗ ∗ ∗

+ β11 (ZS−1 × X1 ) + β21 (ZS−1 × X2 ) + · · · + βp1 (ZS−1 × Xp ) .

In this way, we obtained from the no-interaction model (NIM) and the interaction
model (IM) the likelihoods to be used for the test statistic LR = −2 log LNIM +
2 log LIM . The statistic, LR, follows a chi-square distribution with p(S − 1) degrees
of freedom.
16.10 Survival Analysis Using R 481

16.9.2 Case of Many Covariates Violating the PH Assumption

In the case where there is more than one covariate violating the PH assumption, there
is no elegant extension. Instead, the approach is usually situation-specific, requiring
the combination of all these covariates into a single covariate X∗ having S strata.
An additional problem is imposed by the presence of continuous covariates, which
requires discrete categorization. Both issues (large number of covariates violating
the PH assumption and continuous covariates) lead to a complicated situation,
making such an analysis very laborious. This is especially true for the testing of
the no-interaction assumption.

16.10 Survival Analysis Using R

In this section, we show how to perform a survival analysis using R. We provide


some scripts that enable one to obtain numerical results for different problems. To
demonstrate such an analysis, we use data from lung cancer patients provided in the
package survival [465].

16.10.1 Comparison of Survival Curves

In Listing 16.1, we show an example, based on the lung cancer data, that compares
the survival curves of female and male patients using the packages survival and
survminer [278, 465] available in R. The result of this analysis is shown in Fig. 16.4.
From the total number of available patients (228), we select 175 randomly. For the
selected patients, we estimate the Kaplan-Meier survival curves and compare them
using a log-rank test. The p-value from this comparison is p < 0.0001, which
means that, based on a significance level of α = 0.05, we need to reject the null
hypothesis “there is no difference in the survival curves for males and females.”
By setting options in the function ggsurvplot, we added to Fig. 16.4 information
about the number of subjects at risk in interval steps of 100 days (middle figure) and
the number of censoring events (bottom figure). This information is optional, but
one should always complement survival curves with this table because it provides
additional information about the data upon which the estimates are based.
Usually, it would be also informative to add confidence intervals to the survival
curves. This can be accomplished by setting the option conf.int to “TRUE” (not used
here to avoid an overloading of the presented information).
482 16 Survival Analysis
16.10 Survival Analysis Using R 483

Strata + Male + Female

1.00
+
++
Survival probability

++++++
0.75 ++++ +
+ +
++++ +
0.50 ++ +
+ ++
+++ +
+++
0.25 ++ +
p < 0.0001 +
+
+ +
0.00 +
0 100 200 300 400 500 600 700 800 900 1000
Time in days
Number at risk: n (%)

Male 105 (100) 86 (82) 55 (52) 34 (32) 19 (18) 11 (10) 6 (6) 3 (3) 2 (2) 1 (1) 1 (1)
Strata

Female 70 (100) 65 (93) 54 (77) 34 (49) 22 (31) 18 (26) 10 (14) 7 (10) 2 (3) 1 (1) 0 (0)

0 100 200 300 400 500 600 700 800 900 1000
Time in days

Number of censoring
2
n.censor

0
0 100 200 300 400 500 600 700 800 900 1000
Time in days

Fig. 16.4 The result of Listing 16.1. Top: The two survival curves for males (green) and females
(blue) are shown for a duration of 1000 days. Middle: The number of subjects at risk is shown in
interval steps of 100 days. Bottom: The number of censoring events is shown for the same interval
steps.

16.10.2 Analyzing a Cox Proportional Hazard Model

Here, we will illustrate how to perform the analysis of a CPHM. We will again
use the lung cancer data, and as covariate we will use the sex of the patients.
Listing 16.2 provides the steps of the analysis as well as the corresponding outputs.
In this model, the p-value of the regression coefficient is 0.000126, indicating a
statistical significance of this coefficient. Thus, the covariate sex makes a significant
contribution to the hazard.
The hazard ratio of female/male is 0.4788. That means the hazard for group
female is by a factor 0.4788 reduced compared to group male, or it is reduced by
52.12%.
Finally, the outputs of the statistical tests at the end of the listing provide
information about the overall significance of the model. The three tests assessed the
null hypothesis “all the regression coefficients are zero,” and they are asymptotically
equivalent. All three tests are significant, which indicates that the null hypothesis
needs to be rejected.
484 16 Survival Analysis

16.10.3 Testing the PH Assumption

Using the preceding fitted model, we can now test the PH assumption for
sex. Listing 16.3 provides the corresponding script and its outputs. Here, the
null hypothesis tested is “the correlation between the Schoenfeld residuals and
the ranked failure time is zero.” The test is not statistically significant for the
covariate sex. Thus, we can consider that the proportional hazard assumption
holds.

In Listing 16.4, we show how to obtain the Schoenfeld residuals, as discussed in


Sect. 16.8.4.2, and we also provide a visualization of the resulting residuals.
16.10 Survival Analysis Using R 485

Schoenfeld Individual Test p: 0.1253

10
Beta(t) for sex

−10

56 130 180 260 320 420 550 710


Time

Fig. 16.5 Visualization of the scaled Schoenfeld residuals of sex against the transformed time.

Figure 16.5 depicted the result of Listing 16.4. In this figure, the solid line is
a smoothing spline fit of the scaled Schoenfeld residuals against the transformed
time, and the dashed lines indicate ±2 standard errors. A systematic deviation from
a straight horizontal line would indicate a violation of the PH assumption, since for
a valid assumption the coefficient(s) do not vary over time. Overall, the solid line is
sufficiently straight to assume that the PH holds. Figure 16.5 shows also the p-value
of the result obtained using Listing 16.3.

16.10.4 Hazard Ratios

Finally, we present results for the full multivariate CPHM, with all seven available
covariates in the lung cancer data set as input for the model. Listing 16.5 gives the
corresponding code.
486 16 Survival Analysis

Hazard ratio

age (N=175) 1.00 0.856


(0.97 − 1.03)

sex (N=175) 0.36 <0.001 ***


(0.23 − 0.58)

ph.ecog (N=175) 2.10 0.006 **


(1.24 − 3.57)

ph.karno (N=175) 1.02 0.136


(0.99 − 1.05)

pat.karno (N=175) 0.99 0.093


(0.97 − 1.00)

meal.cal (N=175) 1.00 0.379


(1.00 − 1.00)

wt.loss (N=175) 0.98 0.042 *


(0.96 − 1.00)
# Events: 92; Global p−value (Log−Rank): 1.6674e−05
0.1 0.2 0.5 1 2
AIC: 704.57; Concordance Index: 0.68

Fig. 16.6 Forest plot of hazard ratios for a multivariate CPHM.

A convenient way to summarize the results is by using a forest plot, shown


in Fig. 16.6. This figure shows the hazard ratios for the seven covariates, where
the mean is represented by the square symbol and the confidence interval of the
estimates is represented by the horizontal line. The right-hand side shows the p-
values of the corresponding regression coefficients, which can also be obtained
using the function summary(res.cox). Overall, the covariate sex reduces the hazard,
whereas ph.ecog (measure of well-being according to ECOG performance score)
increases it. All other covariates are located around 1; that is, their effect is marginal.

16.11 Further Reading

As we have seen in this chapter, survival analysis is a very broad topic. To gain a
deeper understanding, we refer to [84, 133, 197, 285, 324].

16.12 Summary

Survival analysis, also known as event history analysis, is a very interesting topic
because the type of data that can be analyzed with such methods are different from
the other types of data discussed in this book. Although time-to-event data may
appear dull at first, the information that can be extracted from them is quite rich
and informative. The outcome variable of a survival analysis, namely, the “event,”
usually assumes a role of great interest; for example, death, bankruptcy, failure, or
crime. This enables a direct connection between a statistical analysis and events
16.13 Exercises 487

of apparent real-word interest. For this reason, it is not surprising that survival
analysis is used in diverse fields ranging from medicine and biology to economics
and marketing. Hence, every data scientist should be familiar with survival analysis.
Learning Outcome 16: Survival Analysis

Survival analysis is a collection of methods for analyzing time-to-event data


that can be flexibly applied to a wide range of problems. Such an analysis
provides distributional information about the time until an event occurs.

As we have seen in this chapter, for multivariate data a survival analysis can
quickly become complex and the interpretation of the results can be nontrivial; for
example, with respect to hazard ratios. Moreover, the testing of the proportional
hazard assumption or the analysis of the stratified Cox model can become complex
and labor intensive. The latter means that a diagnosis of the models, which demands
a thorough understanding by the analyst, is required. Overall, while the very basics
of survival analysis are rather intuitive, more advanced approaches are generally
quite involved.

16.13 Exercises

1. Discuss right censoring according to the information provided in Fig. 16.1.


2. Calculate the Kaplan-Meier estimator manually for the data provided in Fig. 16.2.
3. Load the pbc data from the library survival. Install and load the library survival
and then use the following command to load the data:
data(pbc, package=“survival”)
a. Compare the survival curves of female and male patients from the “pbc” data.
b. Perform an analysis of the CPHM using the covariate “sex” for the “pbc” data.
c. Test the PH assumption for the CPHM using the covariate “sex” for the “pbc”
data.
d. Visualize the results of the scaled Schoenfeld residuals of sex against the
transformed time.
e. Perform an analysis of the CPHM using all the covariates for the “pbc” data.
Chapter 17
Foundations of Learning from Data

17.1 Introduction

So far, we have pursued a practical approach to data science. This has included the
presentation of many examples using R, allowing us to reproduce the analysis of
essentially all topics discussed. Now, we are going a step backward to present a
theoretical perspective on learning from data.
In this chapter, we discuss two fundamental aspects of learning from data. The
first addresses the computational learning theory, and the second covers different
learning paradigms. In general, computational learning theory studies questions
of “learnability.” That means, given a task, natural questions include: How many
samples are needed to learn? Under what conditions is successful learning possible?
Is there a best learning algorithm under all conditions? Such questions can be
formalized with the probably approximately correct (PAC) learning framework,
which defines learnability formally. We will see that PAC can provide learnability
bounds for a finite hypothesis space, and by using the Vapnik-Chervonenkis (VC)
dimension, such results can be extended to an infinite hypothesis space.
These results will give us theoretical characterizations of the difficulty of
machine learning problems and the capabilities of certain models. The main results
of this part are summarized by the fundamental theorem of statistical learning,
which provides conditions for the PAC learnability of an infinite hypothesis space.
The second fundamental aspect addressed in this chapter is about the definition
of different learning paradigms in machine learning [9]. So far, we have focused on
supervised learning and unsupervised learning methods. Examples for supervised
learning are classification or regression methods (see Chaps. 9 and 11), whereas
clustering or dimension reduction methods are unsupervised (see Chaps. 7 and 8).
However, there are data sets (or combinations thereof) that have characteristics
that cannot be described by those two learning paradigms but instead require
advanced learning paradigms. In this chapter, we discuss seven modern learning
paradigms: semi-supervised learning, one-class classification, positive-unlabeled

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 489
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_17
490 17 Foundations of Learning from Data

learning, few/one-shot learning, transfer learning, multi-task learning, and multi-


label learning. We will see that these learning paradigms open a new universe of
novel methods and approaches.

17.2 Computational and Statistical Learning Theory

The field of computational learning theory (COLT) studies whether a given


problem is efficiently learnable or not. That means it is not only important
to know under what circumstances a problem is learnable but that there is an
efficient computational realization of a learning algorithm. Hence, computational
considerations are also of importance. Most important, computational learning
theory will provide us with a formal definition of learnability due to Valiant. While
there have been earlier attempts to define learning formally, such as [201], Valiant’s
seminal work in [479] is generally considered the starting point of computational
learning theory.
A field that is closely related to COLT is statistical learning theory (SLT).
Compared to COLT, SLT is less concerned with computational aspects of learning.
Furthermore, it allows infinite hypothesis spaces, which can be dealt with using
the VC dimension, thanks to Vapnik and Chervonenkis. Interestingly, compared
to statistics, SLT does not assume that there exists a correct model that should be
estimated. Instead, SLT presumes that the correct model is entirely unknown and the
goal is to find a model with similar prediction capabilities. Despite such differences,
in the literature, the terms “computational learning theory” and “statistical learning
theory” are often used synonymously.

17.2.1 Probabilistic Learnability

Before we formalize the problem, let’s consider two particular classification prob-
lems. These examples will teach us valuable lessons regarding the feasibility of
learning.
Classification of Tumor Tissues Let’s assume that we have a number of images
of tumor tissues, e.g., from lung cancer. The tumors are labeled “b” if the tumor
sample (TS) is benign or “m” if it is malignant. Let’s denote the data set that contains
all of these tumor tissues as D; that is, D = {(TS1 , f1 ), . . . , (TSn , fn )} with fi ∈
{‘b , ‘m }. Suppose that we are presented with an additional tumor tissue, TSx . What
is the label, fx , of this tumor?
Usually, such a question is answered by a pathologist. A pathologist is a clinician
with an additional education in histology, which enables them to distinguish tissues
based on their morphological differences. In this sense, a pathologist serves as a
human classifier for tumor tissue. Now, suppose we present the preceding outlined
17.2 Computational and Statistical Learning Theory 491

problem to one thousand pathologists all over the world asking them to classify the
unlabeled tumor tissue. Since becoming a pathologist requires years of training, this
problem is generally considered difficult. Empirical studies show that, in general, the
agreement among pathologists is usually not 100% but much less; see, for example,
[451]. How much less depends on the particular problem/disease.
Classifying Boolean Patterns The next example is of a mathematical nature.
Suppose that we have a training  data set D consisting of three Boolean patterns
of the form
5 x = (x , x ); y , as shown on the left-hand side in Fig. 17.1; that
6
1 2
   
is, D = (0, 0); 0 , (0, 1); 1 , (1, 0); 1 . That means for each of these three
Boolean patterns, we know its corresponding class label y.
Now, the question is what Boolean function f , with

f : X → Y; (17.1)
x = (x1 , x2 ) ∈ X, y ∈ Y; (17.2)
X = {0, 1}2 , Y = {0, 1}. (17.3)

provides the best classification of two-dimensional Boolean patterns?


Let’s further assume that we have a finite set of 16 functions, f1 , . . . , f16 ,
available as a potential solution to this problem. Each of these 16 functions is
fully defined by its truth table, as shown on the right-hand side in Fig. 17.1. Using
the training data D, we see that only the functions f7 and f15 allow an error-free
learning of the training data because both functions lead to the correct classification
of the three input data points.
It is interesting to note that, based on the preceding assumptions, we have no
information that would allow us to distinguish f7 from f15 . Hence, both functions
are reasonable solutions to our classification problem. Nevertheless, only one of
them will be correct, because the unknown input pattern x = (1, 1) will result
in either y = 0 or y = 1 (shown in yellow in Fig. 17.1), but not in both! As a

hypothesis space
x 2 x1 y f 1 f 2 f3 f 4 f 5 f 6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16
0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
observed

0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 1 ? 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
unobserved

Fig. 17.1 Classification of two-dimensional Boolean patterns.


492 17 Foundations of Learning from Data

consequence, we cannot guarantee that we can select the correct function from the
provided information.
Summary From the first (real-world) example, we learn that there are classification
problems humans cannot solve. From the second (mathematical) example, we
learn that there are classification problems mathematics cannot solve. The first
observation may not surprise you; however, the latter one does because it shows
that despite the simplicity of the problem (two-dimensional Boolean patterns, no
noise, deterministic mapping from the input pattern to the class label), there is no
approach that could guard us against selecting the wrong function.
The dilemma from the preceding consideration comes from our implicit assump-
tion that “learning a problem” means that for every data set D and hypothesis
space we can always find an error-free solution. This is not possible. However,
if we amend our expectations in a probabilistic manner, we can find quantitative
approaches. Hence, the nature of computational and statistical learning theory is
to provide probabilistic statements in the form of bounds about the learnability of
a problem. Next, we will quantify the term “learnability” using appropriate error
measures.

17.2.2 Probably Approximately Correct (PAC) Learning

In this section, we provide a specific quantitative definition of learnability, called


probably approximately correct (PAC) or PAC learnability, thanks to Valiant [479].
There are also other definitions of learnability, which will be briefly discussed in the
following section.
The PAC model is based on the following definitions [344]: Let X be the space
of all possible instances or data points that can be observed, also called the input
space. A function

c:X→Y (17.4)

is called a concept mapping from X to the output space Y. For simplicity, the output
space shall be binary labels; that is, Y = {0, 1} or Y = {−1, +1}. Put simply, a
concept function provides a classification for the data points x ∈ X. The (possibly
infinite) set C containing all allowed concept functions c ∈ C is called the concept
space.
We assume that all instances x ∈ X are independently and identically distributed
(i.i.d.) according to a fixed but unknown distribution P ; that is, x ∼ P . Furthermore,
a data point x and its class label are fully defined by a distribution P over X and a
target concept t ∈ C by (x, t (x)).
The learning task is defined as follows: For a set of i.i.d. samples given by S(n) =
{x1 , . . . , xn |xi ∼ P ∀i} and labels {t (x1 ), . . . , t (xn )} for a target concept t, what is
the concept h ∈ H that minimizes the generalization error? Here, H is the set of
17.2 Computational and Statistical Learning Theory 493

hypotheses, called hypothesis space, accessible to the learning algorithm. Possible


cases for a hypothesis space are H = C or C ⊂ H .
Definition 17.1 (Generalization Error) The generalization error or risk of a
concept h ∈ H for a target concept t ∈ C and distribution P is given by
  : ;
R(h) = errorP (h, t) = Prx∼P h(x) = t (x) = Ex∼P Ih(x)=t (x) . (17.5)

This error measure is the probability of misclassifications of concept h for all


possible data x sampled from P given the true classifier t.
Alternatively, one can formulate the generalization error for the zero-one loss

L(h(x), t (x)) = Ih(x)=t (x) (17.6)

by
: ;
R(h) = errorP (h, t) = Ex∼P L(h(x), t (x)) . (17.7)

Due to the fact that in reality we have only a finite data sample to assess the
quality of h, the generalization error needs to be approximated by the empirical
risk, also called empirical error, given by

1
n
< = error
R(h) = S(n) (h, t) = Ih(x)=t (x) . (17.8)
n
i=1

Here, S(n) = {x1 , . . . , xn |xi ∼ P ∀i} is a data set of sample size n.


Taking the expectation value of Eq. 17.8 with respect to S(n) ∼ P , i.e., for
all data sets S(n) of size n sampled from P , one can show that this results in the
generalization error [344],
: ;
= S(n) (h, t) .
errorP (h, t) = ES(n)∼P error (17.9)

Now, we can give the definition of learnability according to a PAC model [341,
344].
Definition 17.2 (Probably Approximately Correct (PAC) Learning) Let C be a
concept class. We call C probably approximately correct (PAC) learnable if there
exists a prediction model M that outputs hS given S(n), H , and such that for every
target concept t ∈ C, for every distribution P over X, and for all 0 < δ, ε < 1/2, M
can learn a concept hS ∈ H such that
 
PrS(n)∼P errorP (hS , t) ≤ ε ≥ 1 − δ (17.10)

for all sample sizes n with n ≥ n0 .


494 17 Foundations of Learning from Data

Here, n0 is a polynomial that grows in 1/ε, 1/δ.


In the preceding definition, the prediction model M is also called a learning
algorithm because, given a data set S(n) and hypothesis space H as input, it can
learn the concept hS ∈ H ; that is,

M(S(n), H ) = hS , (17.11)

such that Eq. 17.10 holds. By “learn,” we mean a procedure that selects hS from all
available concepts in H . It is important to note that because the result of M depends
on S(n), which is a random sample from P , h is a random variable.
We would also like to note that there is no assumption regarding the underlying
distribution P from which the data are sampled. For this reason, PAC learning is a
distribution-free model.
For clarity, we would like to highlight that PAC learning gets its name from the
two-step estimation of the error given by
 
PrS(n)∼P errorP (hS , t) ≤ ε ≥ 1 − δ. (17.12)
1 23 4
approximately correct
1 23 4
probably

If there are additional properties of the runtime of the prediction model M, the
preceding definition can be extended.
Definition 17.3 (Efficient PAC Learnability) Suppose that all the conditions in
Definition 17.2 hold. If, in addition, the time complexity of M is poly(1/δ, 1/ε) for
a fixed polynomial “poly,” we call C efficiently PAC learnable.
If the hypothesis space, H , is finite and a learning algorithm M can find a
= S(n) (h, t) = 0, then h is called a consistent hypothesis
hypothesis h with error
and M a consistent learner. Based on this, one can define the set of all consistent
hypotheses, called version space V SH,S(n) .
Definition 17.4 (Version Space) For a target concept t, a hypothesis space H , and
a sample S(n), the following set is called a version space:

V SH,S(n) = {h ∈ H |h(x) = t (x)∀x ∈ S(n)}. (17.13)

For this special case of a consistent learner M, one can derive the following bound:
Theorem 17.1 (Finite Hypothesis Space with h Consistent) Let H be a finite
hypothesis space and M a learning algorithm that finds, for any target concept
t ∈ H and S(n), a consistent hypothesis hS ∈ V SH,S(n) . Then, for any ε, δ > 0,
Eq. 17.10 holds if n ≥ n0 with

1 1
n0 = log|H | + log . (17.14)
ε δ
17.2 Computational and Statistical Learning Theory 495

Here, |H | is the cardinality of the hypothesis space. This characterizes the sample
complexity.
Proof Let’s derive a bound for the probability that the true error, given a consistent
hypothesis hS , is larger than ε.
 
= S(n) (hS , t) = 0 ∧ errorP (hS , t) > ε ≤
PrS(n)∼P error (17.15)
 
PrS(n)∼P ∃h ∈ H : error
= S(n) (h, t) = 0 ∧ errorP (h, t) > ε ≤ (17.16)
|H |
  
= S(n) (hi , t) = 0 ∧ errorP (hi , t) > ε ≤
PrS(n)∼P error (17.17)
i=1
|H |
    
= S(n) (hi , t) = 0 = |H |PrS(n)∼P error
PrS(n)∼P error = S(n) (h, t) = 0
i=1
(17.18)

Now, we derive a bound for the last probability.


   
= S(n) (h, t) = 0 = PrS(n)∼P h(xi ) = t (xi )|∀xi ∈ S(n) (17.19)
PrS(n)∼P error

)
n  
= PrS(n)∼P h(xi ) = t (xi ) (17.20)
i=1
 n
= PrS(n)∼P h(x) = t (x) (17.21)
   n
= 1 − PrS(n)∼P h(x) = t (x) (17.22)
 n
= 1 − errorP (h, t) (17.23)

≤ (1 − ε)n (17.24)
 
≤ exp − εn (17.25)

The second-to-last follows from

errorP (h, t) > ε ⇒ (17.26)


1 − errorP (h, t) ≤ 1 − ε (17.27)

In summary, this leads to the following bound, according to Haussler:


 
= S(n) (hS , t) = 0 ∧ errorP (hS , t) > ε
PrS(n)∼P error (17.28)
496 17 Foundations of Learning from Data

 
≤ |H |exp − εn (17.29)

$
#
From the preceding proof, one can see that the probability in Eq. 17.28 is for a
consistent hypothesis hS . By extending the definition of the version space, we can
obtain a more intuitive understanding of this probability.
Definition 17.5 (ε-Exhausted Version Space) The version space V SH,S(n) is
called ε-exhausted with respect to a target concept t and P if every hypothesis h
in V SH,S(n) has a generalization error less than ε; that is, ∀h ∈ V SH,S(n) we have

errorP (h, t) < ε. (17.30)

From this definition, one can see that the probability in Eq. 17.28 is the probability
that the version space is not ε-exhausted. This means that at least one consistent
hypothesis has an error larger than ε. Furthermore, one can see that by varying ε
and n one can control the size of the ε-exhausted version space. For instance, by
increasing n for a fixed ε, one can shrink this space by removing hypotheses that
violate the assumptions because they have a larger error than allowed.
Let’s study an example with a PAC-learnable concept class to see how to apply
the preceding definitions.

17.2.2.1 Example: Rectangle Learning

Suppose that the space X is two-dimensional (x, y), with x, y ∈ R+ . The concept
class C consists of all axis-aligned rectangles defined by

+1 ≡ • if x is inside the rectangle;
f (x) = (17.31)
−1 ≡ • if x is outside the rectangle.

for x = (x, y) with x, y ∈ R+ .


The learning algorithm L is defined as follows: For a sample of size n given by
S(n) = {(x1 , y1 ), . . . , (xn , yn )}, determine

xl = min{x1 , . . . , xn }, (17.32)
xu = max{x1 , . . . , xn }, (17.33)
yl = min{y1 , . . . , yn }, (17.34)
yu = max{y1 , . . . , yn }. (17.35)

These coordinates define the lower, (xl , yl ), and upper, (xu , yu ), boundaries of a
rectangle.
17.2 Computational and Statistical Learning Theory 497

Fig. 17.2 The target concept


h is shown as a green xu target concept h
rectangle, and the best
solution f obtained by the
learning algorithm L, given f P(R)
yu
the data sample, is shown as a P(R ) = 
4
blue rectangle. The orange
rectangle R is one of the four
rectangles needed to derive a yl
bound for errorP (f, h).
xl xu R

In Fig. 17.2 we show a visualization of this problem, where the target concept
h is shown in green, and a blue rectangle is shown that corresponds to the concept
f as a result of the algorithm L. Furthermore, the boundaries of the concept f are
extended to the surrounding target concept h, visualized by dashed lines. Due to the
problem setting, the rectangle of f is always contained in or equal to h ∀S(n). That
means, in this setting, there can be no false positives, only false negatives.
Suppose that we have data S(n) such that f is contained in h, as shown in
Fig. 17.2. Then, the only difference in the classification performance between f
and h involves the data that are observed in the green area, because f declares such
data points as red when they are in fact blue. In the following, we derive a bound for
this error, called errorP (f, h).
So far we know that errorP (f, h) is equivalent to the error probabilities of the
green areas, which could themselves be approximated by the error probabilities of
the four overlapping rectangles surrounding f . In Fig. 17.2, we show just one of
these surrounding rectangles, called R. Unfortunately, these error probabilities are
unknown. For this reason, we will derive a bound for these probabilities, which are
finally given in Eqs. 17.2.2.1.
We start this derivation by supposing that errorP (f, h) > ε for a fixed ε > 0 and
focus, in the following, on R, because the argument for the other three rectangles is
analogous. For this fixed ε, we can identify another rectangle R by adjusting the xu
coordinate to xu such that

ε
errorP (R (xu ), h) = . (17.36)
4

This error can be interpreted as the probability that one data point falls inside R and
not outside; that is,
  ε
PrS(1)|P one data point falls inside R’ = errorP (R (x2 ), h) = . (17.37)
4
498 17 Foundations of Learning from Data

Hence, the complementary event has the probability


  ε
PrS(1)|P one data point falls outside R’ = 1 − errorP (R (x2 ), h) = 1 − .
4
(17.38)

Finally, the probability that R does not include any data point from a sample of size
n is
 ε n
PrS(n)∼P (R’ does not include any data point from S(n)) = 1 − (17.39)
4
thanks to Eq. 17.38 and the fact that the n data points are i.i.d samples. This
probability is important, and it will be utilized later.
Now, we need to distinguish between two cases: (1) errorP (f, h) ≤ ε and (2)
errorP (f, h) > ε. If (1) holds, there is nothing to show, and we are done. So, let’s
assume that (2) holds. However, in this case at least one R must be included in the
corresponding R so that
 
PrS(1)|P one data point falls inside R >
  ε
PrS(1)|P one data point falls inside R’ = (17.40)
4
in order to enable errorP (f, h) > ε. From this follows the bound for the negative
events, given by

PrS(n)∼P (R does not include any data point from S(n)) ≤


PrS(n)∼P (R’ does not include any data point from S(n)).

Putting everything together for the four rectangles Ri , we obtain

PrS(n)∼P (errorP (f, h) > ε) ≤


PrS(n)∼P (f does not include ∪4i=1 {Ri }) =
PrS(n)∼P (∪4i=1 {Ri does not include any data point from S(n)}) ≤
PrS(n)∼P (∪4i=1 {Ri does not include any data point from S(n)}) ≤
Pr(Ri ∩Rj =∅)≥0


4
PrS(n)∼P (Ri does not include any data point from S(n)) =
i=1
 ε n
4 1− ≤
4
 nε 
4exp − ≤ δ
4
17.2 Computational and Statistical Learning Theory 499

The penultimate inequality is justified because 1 − x ≤ exp(−x) for all x ∈ R.


From the last inequality, it follows that

4 4
n≥ ln . (17.41)
ε δ
This allows us to state the final result as follows: If Eq. 17.41 holds, then

Pr(errorP (f, h) ≤ ε) ≥ 1 − δ. (17.42)

Overall, this proves that rectangle learning is an example of a PAC-learnable


problem.

17.2.2.2 General Bound for a Finite Hypothesis Space H : The


Inconsistent Case

Earlier, we gave a bound for a finite hypothesis space H and the consistent case.
= S(n) (h, t) = 0 and t ∈ H . Now, we generalize this
The consistent case implies error
for a finite hypothesis space H for the inconsistent case. This means that t ∈ H , and
so the learning algorithm M can only find a hypothesis h ∈ H that minimizes the
training error. In this case, we call M an agnostic learner.
Theorem 17.2 (Finite Hypothesis Space H with h Inconsistent) Let H be a finite
hypothesis space, that is, |H | < ∞, and M a model that returns hS for a given target
concept t ∈ H and S(n). Then for any δ > 0, and any P over X, the following
inequality holds — at least with probability 1 − δ:

errorP (hS , t) ≤ e
rrorS(n) (hS , t) + ε (17.43)

with

ln|H | + ln 2δ
ε= (17.44)
2n

Alternatively, Theorem 17.2 can be formulated as follows:


. . 
PrS(n)∼P .errorP (hS , t) − error
= S(n) (hS , t). ≤ ε ≥ 1 − δ. (17.45)

The sample complexity for the preceding bound is

1  2
n0 = 2
log|H | + log , (17.46)
2ε δ
which is now quadratic in ε, in contrast with Eq. 17.14 (consistent case).
500 17 Foundations of Learning from Data

Proof Since the hypothesis space H is finite, let h1 , . . . , h|H | denote all possible
hypotheses/concepts. Then, we can write
. . 
= S(n) (hS , t). > ε ≤
PrS(n)∼P .errorP (hS , t) − error
 . . 
PrS(n)∼P ∃h ∈ H with .errorP (h, t) − error
= S(n) (h, t). > ε =
. .
PrS(n)∼P .errorP (h1 , t) − error
= S(n) (h1 , t). > ε ∪ . . .
. . 
∪ .errorP (h|C| , t) − error
= S(n) (h|C| , t). > ε ≤
|H |
 . . 
PrS(n)∼P .errorP (hi , t) − error
= S(n) (hi , t). > ε ≤
i=1
 
2|H |exp − 2nε2 = δ.

$
#
For the last step, we used Hoeffding’s inequality, given by
. . 
PrS(n)∼P .errorP (hi , t) − error
= S(n) (hi , t). > ε ≤ 2exp(−2nε2 ). (17.47)

We would like to note that neither Theorem 17.2 nor its proof makes any
assumption about the empirical risk error = S(n) (h, t), thus allowing the values
= S(n) (h, t) ≥ 0. As mentioned earlier, the case error
error = S(n) (h, t) = 0 is called
= S(n) (h, t) > 0 is called inconsistent. Put simply, the
consistent, and the case error
inconsistent case means that the hypothesis space H does not include a concept
h = t that would lead to zero empirical risk. Nevertheless, as shown, Theorem 17.2
provides a bound for the inconsistent case.
Theorem 17.2 assumes, however, a finite hypothesis space H . This is indeed
a severe limitation because many classification algorithms are able to realize an
infinite number of concepts, and hence they have an infinite hypothesis space. The
general question is whether learning, in the sense of Definition 17.2, from a finite
data sample S(n) with n < ∞, is also possible for |H | = ∞. The answer is yes,
and the axis-aligned rectangle problem in Sect. 17.2.2.1 provides again an example
for this. The general answer to this question will be given in the next section.

17.2.3 Vapnik-Chervonenkis (VC) Theory

A problem with the bounds derived so far is that these depend on the size of
the hypothesis space |H |; see Eqs. 17.14 and 17.46. So, they cannot be used for
an infinite hypothesis space. For this reason, in this section we will extend the
17.2 Computational and Statistical Learning Theory 501

preceding results to the case of an infinite hypothesis space. We will do this by using
two purely combinatorial entities: the growth function and the VC dimension. This
will lead to generalization bounds based on the VC dimension instead of on |H |.
We would like to remark that, alternatively, this could also be achieved using the
Rademacher complexity, which quantifies the richness of a family of functions to fit
random data; however, this can be NP-hard.
Before we can provide a formal definition of the VC dimension, we need to
introduce some auxiliary definitions that will clarify the following discussion. First,
we will formalize the meaning of the effective number of hypotheses.
Definition 17.6 (Dichotomies Generated by H ) Let S(n) = {x1 , . . . , xn |xi ∼
P ∀i}. The dichotomies generated by a hypothesis space, H , on S(n) are defined by

/ 0
H (S(n)) = (h(x1 ), . . . , h(xn ))|h ∈ H . (17.48)

That means the set H (S(n)) contains all the different N-tuples that are realizable by
the hypothesis space H .
Definition 17.7 (Growth Function) Let H be a hypothesis space. The mapping
H (n) : N → N, defined for any n ∈ N, by
. .
. .
H (n) = max .H (S(n)). (17.49)
S(n)∼P

is called the growth function.


The growth function gives, for each n, the maximum number of dichotomies that
can be generated by H . Since for any hypothesis space H , H (S(n)) ⊆ {−1, +1}n ,
the total number of different N-tuples is given by

H (n) ≤ 2n . (17.50)

Definition 17.8 The data sample S(n) = {x1 , . . . , xn } is called shattered by H if


H (n) = 2n .
In this case, the hypothesis space, H , realizes all possible dichotomies of S(n) ∼ P .
Definition 17.9 (VC Dimension) The Vapnik-Chervonenkis (VC) dimension of
hypothesis space H is the size of the largest set that can be shattered by H ; that is,
5 6
VC-dimension(H ) = max n : H (n) = 2n . (17.51)
n∈N

In the following, we denote the VC dimension by dV C . If sets of arbitrary size can


be shattered, then dV C = ∞. It is important to remark that a VC dimension of dV C
502 17 Foundations of Learning from Data

does not mean that any set can be shattered by H , but rather that there exists at least
one set such that this is possible.

17.2.3.1 Example: One-dimensional Intervals

Let X = R be the one-dimensional real axis, and a concept h is a closed interval


[a, b] defined by

0 if x ∈ [a, b];
h(x) = (17.52)
1 if x ∈ [a, b].

This defines the hypothesis space H of the problem. The points for m = 2, shown
in Fig. 17.3, can be shattered by the shown concepts. For this reason, dV C ≥ 2.
However, there are no three data points with label order (0, 1, 0) that could be
shattered by h, because that would require two intervals. However, our hypothesis
space, as defined earlier, does not allow this. In Fig. 17.3, we show the general
problem for m = 3 and this label order. Note that the absolute position of the three
data points is not important for this argument. The only argument that matters is the
order of the labels. Hence, for this problem dV C = 2.

17.2.3.2 Example: Axis-Aligned Rectangles

The following two results are important because they connect the growth function
with the VC dimension.
Theorem 17.3 (Sauer’s Lemma) Let H be a hypothesis space with VC dimension
dV C . Then, for all n ∈ N, the growth function is bound by the following inequality:

d  
 n
H (n) ≤ . (17.53)
i
i=0

Fig. 17.3 Examples for m=2: class labels


m = 2 and m = 3 for
[ ] 0,0
X = R, with intervals as
concept functions h. For the
case m = 3, no two intervals
[ ] 0,1
capable of realizing the
labeling 0, 1, 0, exist.
[ ] 1,0

[ ] 1,1

m=3:
[ ] [ ] 0,1,0
17.3 Importance of Bias for Learning 503

Corollary 17.1 Let H be a hypothesis space with VC dimension dV C . Then, for all
n ≥ dV C , we have
 en dV C
H (n) ≤ . (17.54)
dV C

Here, e corresponds to Euler’s number.

17.3 Importance of Bias for Learning

Another important aspect we want to discuss concerns the need of bias for learning.
This may sound like a contradiction at first, because bias will restrict our hypothesis
space and thus remove some flexibility of our model or learning algorithm. To
understand this better, we discuss first the no free lunch theorem.
The no free lunch (NFL) theorem [432] by Wolpert states that with a lack
of prior knowledge (or inductive bias), any learning algorithm may fail on some
learnable task. This implies the impossibility of obtaining meaningful bounds on
the error of a learning algorithm without prior assumptions and modeling.
Theorem 17.4 (No Free Lunch) Let M be a learning algorithm for a binary
classification task, with respect to a zero-one loss over a domain X. Let the sample
size, n, be any number smaller than |X|/2, and let S(n) be the training data. Then,
there exists a distribution P such that
• there exists a concept h = t ∈ H with errorP (h, t) = 0; and
• with a probability of at least 1/7 over S(n), we have errorP (M(S(n), H ), t) ≥
1/8.
This theorem states that although the hypothesis space, H , contains the target
concept t, given a finite training sample S(n), the learning algorithm M cannot find
t for a particular distribution P .
Wolpert and others provided additional theorems proving, for example, the
following results [508]:
• For any equally weighted overall measure, each algorithm will perform equally
well.
• Averaged over all target concepts t, two algorithms will perform equally.
• Averaged over all distributions P , two algorithms will perform equally.
From all these results, it follows that there is no one learning algorithm better
than any other for minimizing the expected error averaged over all possible tasks.
Thus, no universally best learning algorithm exists. This provides one explanation
for the plurality of methods one can find in data science.
We would like to highlight that the proof of the NFL theorems works because we
are averaging over all possible tasks. That means a possible improvement may result
from excluding unfavorable distributions that prohibit PAC learning. However, this
504 17 Foundations of Learning from Data

would require us to introduce a bias in learning. One form of bias frequently used
is Occam’s razor, also called the bias of simplicity [507]. In general, this means
that when there are multiple possible explanations/models given the same data, we
should choose the simplest one.
Another formulation of this problem can be given using the generalization
capabilities of a model. For instance, Schaffer argued that generalization (to unseen
data) is only possible if we have additional information about the problem besides
the training data [423]. This information could be domain knowledge about the
problem or knowledge about characteristics of different learning algorithms. How-
ever, regardless of the nature of the bias, it is generally believed that generalization
without bias is not possible [340].
To introduce a bias (or prior knowledge) for learning, one has, in general, the
following options [488]:
• Change the hypothesis space H .
• Change the probability distribution P .
• Change the input space X.
• Change the loss function.
It is important to emphasize that none of these measures will be able to eliminate
the results of the NFL theorems, but by introducing a bias in learning, the average
practical generalization’s accuracy can be increased for a particular case, rather than
for all cases [507].
In general, one can formalize a machine learning problem as either a parameter
optimization problem or a hypothesis search problem. In 1995, Vapnik proposed
another minimization criterion: the risk minimization principle. Algorithms like
support vector machines, instead of minimizing a function of the error, minimize the
margin, which is defined as the distance between the nearest examples (the support
vectors) and the decision surface.

17.4 Learning as Optimization Problem

So far we have neglected the problem of how a learning algorithm M selects a


hypothesis from the hypothesis space H . In this section, we present empirical risk
minimization (ERM) and structural risk minimization (SRM) as induction principles
for inferring an optimal hypothesis. We will see that both induction principles
formulate learning as an optimization problem.

17.4.1 Empirical Risk Minimization

Earlier, we discussed how the generalization error (risk) is not accessible to the
learning algorithm. For this reason, it cannot be directly used. Instead, we use the
17.4 Learning as Optimization Problem 505

<
empirical risk R(h) <
as an approximation for the risk. Specifically, by using R(h),
we try to find the hypothesis, which minimizes the empirical risk; that is,

<
hS = argmin R(h). (17.55)
h∈H

This induction principle is called the empirical risk minimization (ERM).


When introducing the empirical risk, one can show that its expectation value cor-
responds to the generalization error; see Eq. 17.9. Furthermore, using Hoeffding’s
inequality (see Exercise 2), one can show that the empirical risk provides a good
approximation for the generalization error. Thus, the empirical risk minimization
corresponds to the minimization of the risk in the limit.

17.4.2 Structural Risk Minimization

In Chap. 13, we discussed regularization as a principle to penalize the optimization


functional of (regression) models in order to perform an implicit form of model
selection. It is possible to extend this to the ERM, which is intended to deal with a
large sample size, by introducing a regularized ERM. This is called structural risk
minimization (SRM).
SRM, introduced by Vapnik and Chervonenkis, is an inductive principle for
model selection used for learning from finite training data. It provides a trade-off
between the quality of fitting the training data (empirical risk) and the complexity
of the hypothesis space (model complexity) via the VC dimension.
Let’s consider a sequence of nested hypothesis spaces H = ∪m i=1 Hi ,

H1 ⊂ H2 ⊂ H3 · · · ⊂ Hm (17.56)

with a growing VC dimension

dV C (1) < dV C (2) < · · · < dV C (n) (17.57)

and a training data set S(n). For instance, H may be the class of all polynomial
classifiers, where each Hi corresponds to the class of polynomial classifiers of
degree i. This way, Hi+1 represents more complex models than Hi , and the nested
hypotheses allow the expression of prior knowledge by specifying preferences over
hypotheses within H.
In general, the structural risk minimizes the empirical risk for each hypothesis
space Hi using a regularization term J (h, n) that considers the complexity (VC
dimension) of Hi , as follows:
5 6
< + λJ (h, n) .
hS = argmin R(h) (17.58)
h∈H
506 17 Foundations of Learning from Data

This is called the structural risk minimization (SRM).


For binary classification, the SRM is given by

5 
< + min dV C (i)log(2n/dV C (i)) + log(4/δ) 6
hS = argmin R(h) . (17.59)
h∈H h∈Hi n

On a general note, we would like to remark that our discussion of ERM and SRM
showed that a learning problem can be converted into an optimization problem. The
objective (or cost) function of the optimization problem is the empirical risk (and the
regularization term), and the domain of the learning algorithm M is the hypothesis
space H .

17.5 Fundamental Theorem of Statistical Learning

Finally, we are in a position to state the central result of statistical learning theory.
The fundamental theorem of statistical learning shows that the VC dimension
completely characterizes the PAC learnability of hypothesis classes of binary
classifiers. The fundamental theorem states that a hypothesis class is PAC learnable
if and only if its VC dimension is finite. The theorem also shows that if a problem
is PAC learnable, then uniform convergence holds, and therefore the problem is
learnable using the ERM rule [48, 432].
Theorem 17.5 (Fundamental Theorem of Statistical Learning) Let H be the
hypothesis space of functions from X to {0, 1} and let the loss function be the zero-
one loss. Then there exist constants C1 and C2 such that the following statements
are equivalent:
• H is PAC learnable with sample complexity

dV C + log(1/δ) dV C log(1/ε) + log(1/δ)


C1 ≤ m H ≤ C2 (17.60)
ε2 ε2
• The VC dimension of H , denoted by dV C , is finite.
• ERM is a successful PAC learner for H .
• H has the uniform convergence property.

Lemma 17.1 (Sauer’s Lemma) If V C − d(H ) = d, then even though H might be


infinite, when restricting it to a finite set CX, its “effective” size is only O(|C|d ).
The VC dimension determines the general outline of the growth function, which
in turn determines whether a class satisfies uniform convergence. This is equivalent
to agnostic PAC learning, which implies PAC learning. This implies a finite VC
dimension thanks to the no free lunch theorems.
17.7 Modern Machine Learning Paradigms 507

17.6 Discussion

Computational learning theory addresses the problem of finding optimal gener-


alization bounds for supervised learning. We presented two formalisms of this
framework: probably approximately correct (PAC) learning and the VC theory. Both
approaches are nonparametric and distribution-free.
We have seen that, based on certain assumptions, one can derive error bounds
and sample complexities to characterize learning in various situations. This could
be done for a finite hypothesis space (PAC learning) and an infinite hypothesis
space (VC dimension). A somewhat sobering result has been contributed by the
NFL theorem because it shows that there is no universally best learning algorithm.
Furthermore, we have seen that the ERM and SRM allow one to convert the learning
problem into an optimization problem to find an optimal hypothesis. Overall, this
led to the formulation of the fundamental theorem of statistical learning, which
summarizes and interconnects all those results.
The insights of the NFL theorem are intimately related to using the bias of
learning to improve the generalization abilities of a learning algorithm for unseen
data. Put simply, bias can be seen as a specialization in a model to obtain a
generalization in learning. This motivates data-driven models to achieve a better
generalization.
On a practical note, we would like to mention that boosting is a PAC-inspired
method, and SVM is based on minimizing the VC dimension. Hence, the finger-
prints of both frameworks can be found in practical methods, although neither is
intended to provide practical tools for analyzing data; rather, both are meant to
provide formal verification approaches.

17.7 Modern Machine Learning Paradigms

The second fundamental aspect we will discuss in this chapter is learning paradigms.
So far, we have discussed in this book many methods and algorithms from data
science. Interestingly, from the perspective of the underlying learning paradigm, all
of them fall into the realm of supervised learning (including the first part of this
chapter) and unsupervised learning. However, modern data sets (or combinations
thereof) can have characteristics that cannot be described by those two learning
paradigms. For this reason, advanced machine learning paradigms have been
introduced. In the following, we discuss seven modern learning paradigms.
1. Semi-supervised learning
2. One-class classification
3. Positive-unlabeled learning
4. Few/one-shot learning
5. Transfer learning
6. Multi-task learning
7. Multi-label learning
508 17 Foundations of Learning from Data

For clarity reasons, we want to mention that, according to Kuhn [70, 298], a
paradigm is generally characterized as follows:
A scientific paradigm is a set of concepts, patterns, or assumptions to which those in a
particular professional community are committed and which forms the basis of further
research.

To further highlight the importance of having a paradigm in science, the term


worldview has been suggested as a synonym to describe “a way of thinking about
and making sense of the complexities of the real world” [281, 382]. Since machine
learning is a scientific field, the preceding definition can be applied directly to define
the machine learning paradigms (for short, called learning paradigms) as used in
this chapter.
We will see that each of the modern learning paradigms has different require-
ments for the underlying data. So, they do not merely provide alternative algorithmic
or computational approaches for existing data characteristics but instead establish
new conceptual approaches.

17.7.1 Semi-supervised Learning

The idea of semi-supervised learning is to use both labeled and unlabeled data when
performing a supervised learning task [74].
Definition 17.10 A domain D consists of a feature space X and a marginal
probability distribution P (X), where X = {X1 , . . . , Xn } ∈ X; that is, D =
{X, P (X)}.

Definition 17.11 A task T consists of a label space Y and a prediction function


f (X), with f : X → Y; that is, T = {Y, f (X)}.
The definition of a domain is similar to that of supervised learning; however, the
resulting data are different. Specifically, for semi-supervised learning, there are two
parts of the data: a labeled part, DL = {(xi , yi )}ni=1
L
, with xi ∈ X and yi ∈ Y, and an
nU
unlabeled part DU = {(xj )}j =1 . This means that the available data are of the form
D = DL ∪ DU (see Fig. 17.4).
Formally, semi-supervised learning can be defined as follows:
Definition 17.12 Given a domain, D, with task, T, labeled data DL = {(xi , yi )}ni=1
L
nU
with xi ∈ X and yi ∈ Y, and unlabeled data DU = {(xj )}j =1 , semi-supervised
learning is the process of improving the prediction function, f , by utilizing the
labeled and unlabeled data.
17.7 Modern Machine Learning Paradigms 509

data training

Semi-supervised learning
Use positive, negative and unlabeled
instances for learning.

One-class classification
Use either only positive instances or
positive and unlabeled instances for
learning.

Positive-unlabeled learning
Use positive and unlabeled instances
for learning.
positive class:

negative class:

unlabeled class:

Fig. 17.4 Characterization of semi-supervised learning, one-class classification, and positive-


unlabeled learning. The class labels of instances are distinguished by the color; for instance, a
positive class is blue, a negative class is red, and an unlabeled class is brown.

17.7.1.1 Methodological Approaches

For semi-supervised learning, a broad variety of methods have been proposed.


However, they can be distinguished based on the following two key concepts [532]:
1. Inductive methods
2. Transductive methods
Both concepts are fundamentally different from each other, and the training
and prediction parts of such methods are largely different [180]. Put simply, these
concepts can be formulated as follows:
Induction is reasoning from observed training cases to general rules, which
are then applied to the test cases.
In contrast, transduction has the following meaning:
Transduction is reasoning from observed, specific (training) cases to specific
(test) cases.
It is important to note that this implies that transductive learning does not
distinguish between a training and testing step of a model. Instead, it uses both
the training and testing data for training the model, in contrast with inductive
learning. Consequently, transductive learning does not build a predictive model. For
this reason, to test a new instance, the model needs to be trained again for all the
available data. This is not necessary for inductive learning, because it leads to a
predictive model that can be used for new instances without retraining the model.
510 17 Foundations of Learning from Data

It is interesting to note that transductive learning is either explicitly or implicitly


graph-based because information has to be propagated between different data points,
which can be seen as nodes in a graph [318, 482]. A recent comprehensive review
of semi-supervised learning, including details about algorithmic realizations, can be
found in [482].

17.7.2 One-Class Classification

The idea of one-class classification (OCC) is to distinguish instances from one


particular class from those outside this class [351, 406, 462]. This is quite different
from ordinary classification, and for this reason, OCC has also been called outlier
detection, novelty detection, anomaly detection, or concept learning [266, 413].
OCC focuses on one particular class only.

17.7.2.1 Methodological Approaches

According to [282], one-class learning approaches can be categorized with respect


to the way they are using the training data. This allows one to distinguish
approaches by utilizing only positive data from approaches that learn from positive
and unlabeled data. The latter has been of widespread interest, and it is called
positive-unlabeled learning. Due to the importance of such methods, we discuss
this subcategory of one-class learning in the next section.
From a methodological point of view, there are two key concepts for one-class
classification that use only positive-labeled data [29]:
1. Density estimation
2. Boundary estimation
Density estimation methods estimate the density of the data points with a positive
label. A new instance is classified according to a threshold [461]. Meanwhile,
boundary estimation methods focus on setting boundaries around a small set of
points, called target points. Some examples of methods from this category utilize
support vector machines or neural networks [326, 422].
It is interesting to note that one-class classification that uses only positive-labeled
data for density estimation is conceptually similar to statistical hypothesis testing
[145]. However, methodologically, these approaches are different because OCC
is not based on the concept of a sampling distribution, which specifies not only
the estimation precisely but also the statistical interpretations thereof. In contrast,
OCC approaches to density estimation are more broad, and for this reason they vary
considerably in their interpretation.
17.7 Modern Machine Learning Paradigms 511

17.7.3 Positive-Unlabeled Learning

For positive-unlabeled learning, we face a classification problem when only labeled


instances of one class are available. In addition, we have unlabeled data, which can
come from any class, but their labels are unknown. For this reason, we have labeled
data from one class (termed as “positive”) complemented by unlabeled data. The
goal is to utilize these data for a classification task.
To obtain the data, we assume that np positive samples are randomly drawn from
the marginal distribution P (x|Y = +1) and ni unlabeled samples are randomly
np
drawn from P (x) [371], resulting in the two data sets Dp = {(xi , yi )}i=1 with
nu
xi ∈ X, yi ∈ Y, and Du = {(xi )}i=1 with xi ∈ X. Hence, in total we have the data
set D = Dp ∪ Du with n = np + nu samples. Furthermore, we assume also that for
xi ∈ Du , labels in Y exist, but are not observed.
Due to the lack of observable instances for the entire label space Y, the problem
is limited to a binary label space (simplifying the complexity).
Definition 17.13 The task T of positive-unlabeled learning consists of a label space
Y and a prediction function f (X) with f : X → Y, that is, T = {Y, f (X)}, where
the label space Y is binary, that is, |Y| = 2.
Based on this definition and the previous assumptions, positive-unlabeled learn-
ing can be formally defined as follows:
Definition 17.14 Given D = Dp ∪ Du , positive-unlabeled learning is the process
of improving the prediction function f of the binary task T by utilizing Dp and Du .
Such approaches exploit inductive and transductive learning approaches, both of
which adopt an iterative procedure to obtain reliable negative training data from
the unlabeled data [380]. An example of such an inductive PU (positive-unlabeled)
learning algorithm using bagging (also known as Bootstrap aggregating) SVM to
infer a gene regulatory network (GRN) is presented in [347].

17.7.3.1 Methodological Approaches

The main methodological approaches for positive-unlabeled learning can be distin-


guished as follows:
1. Two-step methods
2. Weighting methods
The two-step methods use the unlabeled data in step one to identify negative
instances and then use a traditional classifier in step two. The weighting methods
estimate real valued weights for the unlabeled data and then learn a classifier based
on these weights. The weights represent the likelihood, or conditional probability,
that an unlabeled instance belongs to a certain class. Hence, the problem is converted
into a (constrained) regression problem.
512 17 Foundations of Learning from Data

Recently, a generative adversarial network (GAN) - which is an advanced deep


neural network model - was introduced for PU learning called GenPU [256].
GenPu consists of a number of generators and discriminators similar to a minimax
game. These components simultaneously generate positive and negative samples
with realistic properties, which can then be used with a standard classifier. For
comprehensive reviews of positive-unlabeled learning, the reader is referred to
[31, 267, 526].

17.7.4 Few/One-Shot Learning

The idea of few/one-shot learning is to utilize a (large) training set for learning a
similarity function, which is then used in combination with a very small data set
containing only one or a few instances of unknown classes to make predictions
about these unknown classes [163, 274, 496].
Thus, few/one-shot learning utilizes semantic information from the training data
to deal with few/one instances of new classes that are unknown from the training
data. In Fig. 17.5, we summarize the idea of few/one-shot learning.
Few/one-shot learning utilizes three key components: (1) a labeled data set D, (2)
a support set DSu , and (3) a query q representing a new instance for which a class
label should be predicted. The labeled data D is given by D = {(xi , yi )}ni=1 with
xi ∈ X and yi ∈ Y and i ∈ {1, . . . , n}, where n is the sample size, X is the feature
space, and Y is the label space. If the cardinality of the label space is larger than two,
that is, |Y| > 2, then we have a multi-class classification problem, otherwise it is a
binary classification. The data set D serves as the training data to learn a similarity
function g. This similarity function will then be used for evaluating the similarity of
a query q to instances given in the support set Dsu . The support set Dsu is defined
as follows:
5 6S
Definition 17.15 A support set Dsu is a labeled data set Dsu = {(xi , yi )}ni=1 s
s=1
providing information about labeled instances of S classes with yi ∈ Y . For n1 =
· · · = nS = 1, one obtains one-shot learning, and for ni > 1 for all i ∈ {1, . . . , S}
with |ni | being small, few-shot learning is obtained. For n1 = · · · = nS = n, this is
called n-shot, S-way learning.
It is important to note that the label space of the support set Dsu and the training
data D are different, that is, Y = Y . Thus, the semantic transfer from the training
data is accomplished via the similarity function, and the support set serves as a
dictionary to look up the similarity with the query q. In this way, it is possible to
make predictions about new classes that were not in the training data.
The task that is important for few/one-shot learning is to learn a prediction
function, fsu : X → Y , which maps into the classes given by Y , instead of Y.
17.7 Modern Machine Learning Paradigms 513

training

D = {(xi , yi )}ni=1 with xi ∈ X and yi ∈ Y

+
Learning similarity function g : (xi , xj ) →
with xi , xj ∈ X

Use g and Dsu to construct


 a prediction function fsu
fsu : argmin g(xsu,i , x) → Y 
j
 S
with Dsu = {(xsu,i , ysu,i )}ni=1
s
and xsu,i ∈ X , ysu,i ∈ Y 
s=1
Importantly: Y = Y 

testing

 
fsu : argmin g(xsu,j , q) → Y 
j
with q ∈ X new instance

Fig. 17.5 Overview of few/one-shot learning. There are three key components: (1) labeled data
set D, (2) support set DSu with Y = Y, and (3) query q representing a new instance for which a
class label should be predicted. For the testing, the prediction function fsu is used to evaluate the
similarity between q and the instances in the support set DSu .

Definition 17.16 The task Tsu for few/one-shot learning consists of outcome space
Y and a prediction function fsu (X) with fsu : X → Y ; that is, Tsu =
{Y , fsu (X)}.
The distinction between Y and Y may appear strange at first because it means
the classes of the training data and the testing data are different. So, how can one
learn from the instances provided by the training data for the testing data, when
the outcome spaces are entirely different? The trick of few/one-shot learning is to
assume that the similarity among instances in the training data and the testing data
are identical. Hence, learning such a similarity function in the form of the function
g allows one to learn from the training data for the testing data despite the fact that
Y = Y.
We would like to remark that the preceding assumption about the similarity
among instances in the training data and the testing data determines the quality
514 17 Foundations of Learning from Data

of the outcome. Specifically, for infinitely large training data, it should be possible
to learn the similarity function g with high accuracy. However, in the case where
the similarity in the testing data is not captured by g, the prediction function fsu
will not be able to provide meaningful results. Strictly, this is true irrespective of
the sample size of the training data and the number of instances in the support set.
So, if the similarity assumption is violated, no learning occurs, even in the limit of
infinitely large sample sizes.
Based on the preceding definitions, few/one-shot learning can now be defined as
follows.
Definition 17.17 Given a training data set D and a support set Dsu , few/one-shot
learning is the process of improving a prediction function, fsu : X → Y , for task
Tsu by utilizing D and Dsu .

17.7.4.1 Methodological Approaches

To establish a few/one-shot learning model, there are essentially two main concep-
tual approaches:
1. Semantic transfer via similarities
2. Semantic transfer via features
The semantic transfer via similarities means that knowledge extracted from the
training data is utilized for unknown classes by learning similarity concepts. An
example of this is the Siamese network used in [287]. There, the authors learn
an image verification task instead of predicting the classes of instances directly.
Conceptually, this means to learn the similarity (or lack thereof) between pairs
of instances. This network is trained for D and then utilized with Dsu , where an
instance from Dsu , such as xsu,i , is used together with a query x. If x is similar
to xsu,i , then the predicted class is ysu,i . The semantic transfer via features was
suggested in [28]. The authors showed that the similarity of novel features to
existing features learned from training data can help in feature adaptation.
Recently, deep learning approaches have been used, for instance, in [487], where
a neural architecture, called Matching Networks, utilizing an augmented memory
that included an attention kernel, was introduced. Another example is Relation
Network (RelNet), introduced in [456]. RelNet learns an embedding and a deep
nonlinear distance metric with a convolutional neural network for comparing query
and sample items.

17.7.5 Transfer Learning

The basic idea of transfer learning (TL) is to utilize information from one task to
improve the learning of a second one [8]. To distinguish the two tasks from each
17.7 Modern Machine Learning Paradigms 515

training testing

Transfer learning Task 1 Task 2 Task 2

Multi-task learning Task 1 Task m Task 1 Task m

Fig. 17.6 Visualization of training and testing for transfer learning (top) and multi-task learning
(bottom). For transfer learning, task 1 is usually called source task and task 2 target task. A crucial
difference between transfer learning and multi-task learning is that for the latter all tasks are equal,
whereas the former focuses only on task 2 (the target task). Furthermore, it is important to note
that for multi-task learning, all tasks are evaluated independently from each other.

other, the former is called source task and the latter target task [377, 503, 533]. For
each task, we distinguish the corresponding domain and data. In Fig. 17.6, we show
a visualization of the underlying idea of transfer learning.
Similar to supervised learning (discussed earlier), for transfer learning we also
need the definition of a domain, D, and a task, T.
Definition 17.18 A domain D consists of a feature space χ and a marginal
probability distribution P (X), where X = {X1 , . . . , Xn } ∈ X; that is, D =
{X, P (X)}.

Definition 17.19 A task T consists of a label space Y and a prediction function


f (X) with f : X → Y; that is, T = {Y, f (X)}.
The prediction function f (X) is learned from a data set D = {(xi , yi )}ni=1 with xi ∈
X and yi ∈ Y for i ∈ {1, . . . , n}, where n is the sample size. Some machine learning
methods provide explicitly probabilistic estimates of f in the form of conditional
probability distributions; that is, f (X) = P (Y |X). So, this is a generalized form
of a prediction function because in the deterministic case this reduces to a delta
distribution δx,y given by

if x = xi with (xi , yi )
δx,yi = 0 (17.61)
0 otherwise

For transfer learning, one needs to distinguish between two kinds of domains and
tasks, which are called source domain, DS , and source task, TS , as well as target
domain, DT , and target task, TT , with corresponding source data, DS , and target
data, DT . From these, one can now formally define transfer learning.
Definition 17.20 Given a source domain, DS , with source task, TS , and target
domain, DT , with target task, TT , transfer learning is the process of improving
the prediction function, fT , of the target task using DS and TS .
516 17 Foundations of Learning from Data

The preceding definition is quite general in the sense that it does not specify various
aspects. Hence, specifying these leads to different subtypes of transfer learning. In
the following, we distinguish various subtypes from each other.
• Case DS = DT and TS = TT : This corresponds to the traditional machine
learning setting when we learn fS from source data DS and continue the learning
process with target data DT , where the resulting prediction function is renamed
to fT . From this, it follows that transfer learning is obtained from DS = DT
or TS = TT . Here, it is important to emphasize that the “or” between the two
conditions yields three different cases.
• Case DS = DT : Given that DS = {XS , PS (X)} and DT = {XT , PT (X)}, this
can correspond to either XS = XT or PS (X) = PT (X).
◦ Homogeneous transfer learning: The case where the feature space of the
source domain and target domain are the same — that is, XS = XT — is
called homogeneous transfer learning.
◦ Heterogeneous transfer learning: The case where the feature space of the
source domain and target domain are different — that is, XS = XT — is
called heterogeneous transfer learning.
◦ PS (X) = PT (X).
• Case TS = TT : Given that TS = {YS , fS (X)} and TT = {YT , fT (X)}, this can
correspond to either YS = YT or fS (X) = fT (X).
◦ YS = YT : This case means that the label spaces of the source task and the
target task are different. For instance, this can be the result of there being a
different number of classes in the source task and target task.
◦ fS (X) = fT (X): Given that the prediction functions generalize to conditional
probability distributions, this means PS (Y |X) = PT (Y |X).

17.7.5.1 Methodological Approaches

For transfer learning, a variety of different perspectives have been suggested for the
categorization of this learning paradigm. For instance, one could assume a view with
respect to traditional paradigms, distinguishing between inductive, transductive, and
unsupervised transfer learning [377], or use a model-based view [533]. However, the
most common categorization is based on “what to transfer” [377]:
1. Feature-based TL
2. Parameter-based TL
3. Instance-based TL
4. Relational-based TL
1. For feature-based TL, good feature representations are learned from the source
task, and they are assumed to be useful for the target task as well. Hence, in
this case, the knowledge transfer between source task and target task is done via
17.7 Modern Machine Learning Paradigms 517

learning feature representations. 2. For parameter-based TL, some parameters or


prior distributions of hyperparameters are transferred from the source task to the
target task. This assumes a similarity of the source model and the target model.
Unlike multi-task learning, where both the source and target tasks are learned
simultaneously, for transfer learning, we may apply additional weightage to the
loss of the target domain to improve overall performance. 3. The idea of instance-
based TL is to reuse parts of the instances from the source task for the target task.
Usually, instances cannot be used directly; instead, this is accomplished via instance
weighting. 4. Relational-based TL assumes that instances are not independent and
identically distributed, but dependent. This implies that the underlying data form a
sort of network, such as a transcription regulatory network or a social network.

17.7.6 Multi-Task Learning

The idea of multi-task learning (MTL) compared to transfer learning is threefold.


First, instead of considering exactly two tasks, the source and target task, in multi-
task learning there can be m > 2 tasks. Second, these m tasks do not have one or
more dedicated targets, but all tasks are equally important. That means there are m
source tasks and m target tasks [72]. Third, MTL learns multiple related tasks jointly
by sharing useful information among related tasks.
Formally, multi-task learning can be described as follows:
Definition 17.21 Given m learning tasks, {Tk }m k=1 , where all tasks or a subset of
tasks are related, multi-task learning aims to improve each learning task Tk using
information from some or all the other models.
For clarity, we would like to emphasize that for each learning task Tk , there is
a corresponding domain Dk = {Xk , P (Xk )} and data set Dk given, from which
information can be utilized. In the following, we denote the data set of task k by
Dk = {(xki , yki) }ni=1k
with xki ∈ Xk and yki ∈ Yk with i ∈ {1, . . . , nk }, where nk is
the sample size.
For MTL there is an important special case one needs to distinguish from the
general setting.
The cases where xki = xli and nk = nl = n for all k, l ∈ {1, . . . , m} and
i ∈ {1, n}, is called multi-view learning. Therefore, in this case the x-values of the
data Dk for all tasks are identical but can have different labels; that is, Yk = Yl for
all k, l ∈ {1, . . . , m}.

17.7.6.1 Methodological Approaches

For multi-task learning, there are three key methodological approaches used to study
such problems [522].
518 17 Foundations of Learning from Data

1. Feature-based MTL
2. Parameter-based MTL
3. Instance-based MTL
Feature-based MTL models assume that different tasks share the same or at
least similar features. This includes also methods that perform feature selection or
transformation of the original features.
Parameter-based MTL models utilize parameters between different models to
relate the learning between different tasks. Examples of this include methods
based on regularization or priors on model parameters. In general, this conceptual
approach is very diverse, with many different realizations.
Instance-based MTL models estimate weights for the membership of instances
in tasks and then use all instances to learn all tasks in a weighted manner. For
comprehensive reviews of multi-task learning, we refer the reader to [412, 449, 522].

17.7.7 Multi-Label Learning

The idea of multi-label learning (MLL) is to generalize the class labels of a tradi-
tional classification with single-valued entities into variable set sizes [194, 472, 525].
Therefore, the number of labels, as the outcome of a prediction function, is variable.
To formally define multi-label learning, we need to modify the definition of the
data set D. Specifically, for multi-label learning, D is defined as D = {(xi , Yi )}ni=1
with xi ∈ X and Yi ⊆ Y where Y = {L1 , . . . , Lq }. Here, Yi can assume any subset
of Y, which makes the size of such a set variable.
The goal of multi-label learning is to find a prediction function, f , that maps the
elements of D correctly. Formally, the task is defined as follows:
Definition 17.22 For multi-label learning, a task T consists of a label space Y and
a prediction function f (X) with f : X → 2Y ; that is, T = {Y, f (X)}.
Here, 2Y corresponds to the power set of Y, which is the set of all subsets of Y.
From the preceding definition, one may wonder why 2Y is not mapped to a
multi-class problem. The reason for this can be visualized for Y = {y1 , . . . , y20 }.
In this case, the size of the power set is 1048576 (= 220 ). Hence, if we were
to map the multi-label problem to a multi-class classification, one would have
1, 048, 576 different classes. However, this results in severe learning problems for
such a classifier. For this reason, multi-label learning tries to be more resourceful.

17.7.7.1 Methodological Approaches

For multi-label learning, there are two key conceptual approaches, allowing a
categorization of available methods as follows:
1. Problem transformation
17.8 Summary 519

2. Algorithm adaptation
Approaches based on problem transformation can be further subdivided into
(1) transformation to binary classification, (2) transformation to label ranking, and
(3) transformation to multi-class classification [194]. Such approaches convert a
multi-label learning problem, by means of transformations, into well-established
problem settings. Examples of these approaches include classifier chains [403],
which transform a multi-label learning problem into a binary classification task;
calibrated label ranking [179], which maps MLL into the task of label ranking; and
random k-labelsets [474], which transforms multi-label learning into the task of
multi-class classification. If the mapping is performed, this is called label powerset
[473].
From the definition of multi-label learning and the description of transformation
methods, one may wonder why 2Y is not always directly mapped to a multi-class
classification problem, because, theoretically, such a mapping is always possible.
However, there is a practical problem with this for large |Y|, as discussed in the
previous section.
Methods based on algorithm adaptation modify existing learning methods to
adopt them to the multi-label case. In [525], four approaches are distinguished: (1)
lazy learning (e.g., ML-kNN [524]), (2) decision tree (e.g., ML-DT [81]), (3) kernel
learning (e.g., Rank-SVM [134]), and (4) information-theoretic methods (e.g., CML
[193]).

17.8 Summary

In this chapter, we discussed two fundamental aspects of learning from data: the
first concerned computational learning theory and the second different definitions
of learning paradigms. We have seen that computational learning theory provides
a quantification of “learnability” and thus allows one to derive bounds on the
capabilities of learning algorithms. In contrast, the learning paradigms provided new
frameworks for embedding particular data types with specific properties.
Learning Outcome 17: Machine Learning Paradigms

Machine learning is not a closed field where everything has been discovered.
Instead, this field is under rapid development not only with respect to novel
methods but even entirely new learning paradigms.

Both of these frameworks are useful for providing an overview of learning from
data beyond particular models. While computational learning theory is important
for obtaining definite answers about principle learning capabilities, such as error
bounds, the definition of learning paradigms provides clarity about tasks and the
usage of data beyond a supervised learning paradigm. It can be expected that the
520 17 Foundations of Learning from Data

latter, especially, will lead to the development of many new learning algorithms in
the years to come because such definitions can spur the creativity to design novel
learning algorithms. Overall, the concepts discussed in this chapter provide valuable
thought patterns for learning from data.

17.9 Exercises

1. Show that Eq. 17.9 holds. Hint: Utilize the linearity of the expectation value and
the fact that the samples are drawn i.i.d.
2. Let X1 , . . . , Xn be n independent and identically distributed (i.i.d) random
variables bounded by the intervals [ai , bi ] — that is, ai ≤ Xi ≤ bi for all i
— and X̄ = 1/n ni=1 Xi . Then the following inequality holds, according to
Hoeffding [247]
. : ;.   2n2 ε2 
P r .X̄ − E X̄ . ≥ ε ≤ 2exp − n . (17.62)
i=1 (bi− ai )2

Show that this corresponds to the inequality in Eq. 17.47 by identifying the terms.
3. Compare transfer learning with supervised learning and discuss the differences.
How can one convert a transfer learning problem to a supervised learning
problem?
4. Compare multi-task learning with transfer learning and discuss the differences.
How can one convert a multi-task learning problem to a transfer learning
problem?
Chapter 18
Generalization Error and Model
Assessment

18.1 Introduction

This chapter provides the conceptual roof for all the previous chapters. We started
this book with a discussion on general topics — for example, error measures and
resampling methods — that allow on the one hand an intuitive understanding of and
on the other hand a general application for a large variety of problems. Then, we
discussed important core methods that enable one to conduct different forms of data
analysis. Now, we return to a general topic; however, this time it is one of a more
abstract nature.
The central topic of this chapter is the generalization error. Put simply, the
generalization error provides a theoretical quantification of the performance of a
prediction model. Furthermore, it enables one to understand the bias-variance trade-
off, error-complexity curves, learning curves, and the connection of these concepts
with model selection and model assessment. Model selection and model assessment
were discussed in Chap. 12; however, in this chapter, we approach these problems
from a more fundamental perspective. In addition, other topics discussed in this
chapter, such as learning curves, are applicable to the methods discussed in Parts
II and III of this book. This underlines the fundamental character of the concepts
around the generalization error. As we will see, the concepts discussed in this
chapter can be used as guiding principles for general data science projects.
For the analysis of supervised learning models, such as regression or classifica-
tion methods [59, 82, 143, 221, 226, 421], where prediction errors can be estimated,
model selection and model assessment are key to finding the best model for a
given data set. Interestingly, regarding the definition of “best model,” there are
two complementary approaches with different underlying philosophies [112, 172].
One defines “best model” based on its predictiveness, and the other does so
based on its descriptiveness. The latter approach aims to identify the true model,
whose interpretation leads to a deeper understanding of the generated data and the
underlying processes that generated them.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 521
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8_18
522 18 Generalization Error and Model Assessment

18.2 Overall View of Model Diagnosis

Regardless of the statistical model under investigation, such as classification or


regression, there are two basic questions one needs to address: (1) How do you
choose between competing models? and (2) How do you evaluate them? Both
questions aim to diagnose the models.
The preceding informal questions are formalized by the following two statistical
concepts [226]:
Model selection: Estimate the performance of different models in order to choose the best
model.
Model assessment: For the best model, estimate its generalization error.

Briefly, model selection refers to the process of optimizing a model family or


a model candidate. This includes the selection of a model itself from a set of
potentially available models and the estimation of its parameters. The former relates
to deciding which regularization method (e.g., ridge regression, LASSO, or elastic
net) should be used, whereas the latter corresponds to estimating the parameters
of the selected model. Meanwhile, model assessment means the evaluation of the
generalization error (also called test error) of the finally selected model for an
independent data set. This task aims to estimate the “true prediction error” as could
be obtained from an infinitely large test data set. Both concepts are based on the
utilization of data to quantify numerically the properties of models.
For simplicity, let’s assume we are given a very large (or arbitrarily large) data
set D. The best approach for both problems would be to randomly divide the data
into three non-overlapping sets:
1. Training data set: Dtrain
2. Validation data set: Dval
3. Test data set: Dtest
By “very large data set,” we mean a situation where the sample sizes, that is, ntrain ,
nval , and ntest , for all three data sets are large without necessarily being infinite, such
that an increase in their sizes would not lead to changes in the model evaluation.
Formally, the relation between the three data sets can be written as follows:

D = Dtrain ∪ Dval ∪ Dtest (18.1)


∅ = Dtrain ∩ Dval (18.2)
∅ = Dtrain ∩ Dtest (18.3)
∅ = Dval ∩ Dtest (18.4)

Based on these data, the training set would be used to estimate or learn the
parameters of the models. This is called model fitting. The validation data would
be used to estimate a selection criterion for model selection, and the test data would
be used to estimate the generalization error of the final chosen model.
18.3 Expected Generalization Error 523

In practice, the situation is more complicated since D is typically not arbitrarily


large. In the following sections, we discuss model assessment and model selection
in detail. The order of our discussion is reversed to the order one would perform
a practical analysis. However, to facilitate the understanding of the concepts, this
order is beneficial.

18.3 Expected Generalization Error

Let’s assume that we have a general model of the form

y = f (x, β) + ε (18.5)

mapping the input x to an output y, as defined by the function f . The mapping varies
by a noise term ε ∼ N(0, σ 2 ) representing, for example, measurement errors. We
want to approximate the true (but unknown) mapping function f by a model g that
depends on parameters β; that is,

ŷ = g(x, β̂(D)) = ĝ(x, D). (18.6)

Here, the parameters β are estimated from a training data set D (strictly denoted by
Dtrain ); hence, the parameters β are functions of the training set, i.e., β̂(D). The
“hat” indicates that the parameters β are estimated using the data D. As a shortcut,
we write ĝ(x, D) instead of g(x, β̂(D)).
Based on these entities, we can define the following measures to evaluate models:


n
SST = TSS = (yi − ȳ)2 = Y − Ȳ 2 ; (18.7)
i=1


n
SSR = ESS = (yˆi − ȳ)2 = Ŷ − Ȳ 2 ; (18.8)
i=1


n 
n
SSE = RSS = (yˆi − yi )2 = ei2 = Ŷ − Y 2 . (18.9)
i=1 i=1

Here, ȳ = n1 ni=1 yi is the mean value of the predictor variable, and ei = yˆi − yi
are the residuals, whereas
• SST is the sum of squares total, also called total sum of squares (TSS).
• SSR is the sum of squares due to regression (variation explained by linear
model), also called the explained sum of squares (ESS).
524 18 Generalization Error and Model Assessment

• SSE is the sum of squares due to errors (unexplained variation), also called
residual sum of squares (RSS).
There is a remarkable property for the sum of squares, which is given by the
following:

SST
1234 = SSR
1234 + SSE
1234 (18.10)
total deviation deviation of regression from mean deviation of regression

This relationship is called partitioning of sum of squares [433].


Furthermore, to summarize the overall predictions of a model, the mean-squared
error (MSE), given by

SSE
MSE = , (18.11)
n
is a useful quantity.
The general problem when dealing with predictions is that we would like to know
about the generalization abilities of our model. Specifically, for a given training
data set Dtrain , we can estimate the parameters of our model β, leading to estimates
g(x, β̂(Dtrain )) = ĝ(x, Dtrain ). Ideally, we would like to have that y ≈ ĝ(x, Dtrain )
for every data point (x, y) drawn from the underlying population; that is, (x, y) ∼
P . To assess this quantitatively, a loss function, or simply a loss, is defined. Frequent
choices for a loss are as follows:
• The absolute error, defined by
. .
L(y, ĝ(x, Dtrain )) = .y − ĝ(x, Dtrain ). (18.12)

• The squared error, defined by


 2
L(y, ĝ(x, Dtrain )) = y − ĝ(x, Dtrain ) . (18.13)

If one were to use only data points from a training set, that is, (x, y) ∈ Dtrain ,
to assess the loss, such estimates would usually be overly optimistic and lead to
much smaller errors, as if data points were used from all possible values; that is,
(x, y) ∼ P , where P is the distribution of all possible values. Formally, we can
write this as an expectation value with respect to distribution P , as follows:
 
Etest (Dtrain , ntrain ) = EP L(y, ĝ(x, Dtrain )) . (18.14)

The expectation value in Eq. 18.14 is called the generalization error of the model.
This error is also called out-of-sample error or simply test error. The latter name
emphasizes the important fact that test data are used for the evaluation of the
18.4 Bias-Variance Trade-Off 525

prediction error (as represented by the distribution P ) of the model, but training
data are used to learn its parameters (as indicated by Dtrain ).
From Eq. 18.14, one can see that we have an unwanted dependency on the
training set Dtrain . To remove this, we need to assess the generalization error of
the model, given by β̂(Dtrain ), by expressing the expectation value with respect to
all training sets; that is,
 
Etest (ntrain ) = EDtrain EP L(y, ĝ(x, Dtrain )) . (18.15)

This is the expected generalization error of the model, also called expected out-
of-sample error [3], which is no longer dependent on any particular estimates of
β̂(Dtrain ) via Dtrain . Hence, this error provides the desired assessment of a model
for its generalization capability.
It is important to emphasize that the training sets, Dtrain , are not infinitely
large but rather all have the same finite sample size ntrain . Hence, the expected
generalization error in Eq. 18.15 is only independent of a particular training set but
still depends on the size of these sets. In Sect. 18.6, we will explore this dependency
when discussing learning curves.
From the preceding derivation, one can see that the expected generalization
error in Eq. 18.15 is a population estimate. That means its evaluation is based on
expectation values over populations; namely, the population of all data points, P ,
and the population of all training data, Dtrain , of size ntrain . So, this is a theoretical
entity. When working with data, one requires an approximation of the population
estimate via a sample estimate. For such an approximation, the resampling methods
discussed in Chap. 4 can be used. This implies also that for model assessment
one wants to estimate the expected generalization error. Theoretically, the expected
generalization error is the desired measure for model assessment.
Although the expected generalization error cannot be evaluated in general, it can
be used to derive theoretical insights. Specifically, in the next section we will use the
expected generalization error to derive a decomposition known as a bias-variance
trade-off [183, 192, 288, 502].

18.4 Bias-Variance Trade-Off

In this section, we show how the expected generalization error of the model in
Eq. 18.15 can be used to derive an error decomposed into different components. This
decomposition is known as a bias-variance trade-off [183, 192, 288, 502], and the
three components are denoted bias, variance, and noise. We will see that the result
provides valuable insights for understanding the influence of the model complexity
on the prediction error.
In the following, we denote the training set by D to simplify the notation.
Furthermore, we write the expectation value with respect to the distribution P as
526 18 Generalization Error and Model Assessment

Ex,y , and not as EP as in Eq. 18.15, as a short form for EP (x,y) . This allows one to
apply the probability rule of the form

P (x, y) = P (y|x)(x), (18.16)

making the derivation more explicit, as we will see in Eqs. 18.26 and 18.23.
We start the derivation from the expected generalization error with the squared
error as the loss; that is,
   2 
EDtrain EP L(y, ĝ(x, Dtrain )) = ED Ex,y y − ĝ(x, D) (18.17)
 2 
ED Ex,y y − ĝ(x, D)
 : ; : ; 2 
= ED Ex,y y−ED ĝ(x, D) + ED ĝ(x, D) − ĝ(x, D) (18.18)
1 23 4
independent
 : ;   : ; 2 
= Ex,y ED y − ED ĝ(x, D) )2 + Ex,y ED ED ĝ(x, D) − ĝ(x, D)
 : ; : ; 
+2Ex,y ED y − ED ĝ(x, D) ED ĝ(x, D) − ĝ(x, D) (18.19)
 : ;   : ; 2 
= Ex,y ED y − ED ĝ(x, D) )2 + Ex,y ED ED ĝ(x, D) − ĝ(x, D)

 : ;  : ; 
+2Ex,y y − ED ĝ(x, D) ED ED ĝ(x, D) − ĝ(x, D) (18.20)
1 23 4
independent of D
 : ;   : ; 2 
= Ex,y y − ED ĝ(x, D) )2 + Ex,y ED ED ĝ(x, D) − ĝ(x, D) (18.21)
   2 
= Ex,y y − ḡ(x))2 + Ex,y ED ḡ(x) − ĝ(x, D) (18.22)
   2 
= Ex,y y − ḡ(x))2 + ED Ex Ey|x ḡ(x) − ĝ(x, D) (18.23)
1 23 4
independent of y
   2 
= Ex,y y − ḡ(x))2 + ED Ex ḡ(x) − ĝ(x, D) (18.24)

= bias2 + variance

In Eq. 18.23, we used the independence of the sampling processes for D and
(x, y) to change the order of the expectation values. This allowed us to evaluate the
conditional expectation value Ey|x because the argument is independent of y.
In Eq. 18.22, we used the following short form:
: ;
ḡ(x) = ED ĝ(x, D) (18.25)
18.4 Bias-Variance Trade-Off 527

to write the expectation value of ĝ with respect to D giving a mean model ḡ over all
possible training sets D. Due to the fact that this expectation value integrates over
all possible values of D, the resulting ḡ(x) no longer depends on it.
By utilizing the conditional expectation value

Ex,y = Ex Ey|x , (18.26)

we can further analyze the first term of the preceding derivation (highlighted in
green) using the following relationship:

Ex,y y = Ex Ey|x y = Ex ȳ(x) = ȳ. (18.27)

Here, it is important to note that ȳ(x) is a function of x, whereas ȳ is not, because


the expectation value Ex integrates over all possible values of x. For clarity reasons,
we want to note that y actually means y(x), but to simplify the notation, we suppress
this argument so that the derivation is more readable.
Specifically, by utilizing this term, we obtain the following decomposition:
   : ; 
Ex,y y − ḡ(x))2 = Ex,y y − ED ĝ(x, D) )2 (18.28)
 : ; 
= Ex,y y−ȳ(x) + ȳ(x) − ED ĝ(x, D) )2 (18.29)
 2   : ; 
= Ex,y y − ȳ(x) + Ex,y ȳ(x) − ED ĝ(x, D) )2
  : ;
+2Ex,y y − ȳ(x) ȳ − ED ĝ(x, D) ) (18.30)
 2   : ; 
= Ex,y y − ȳ(x) + Ex,y ȳ(x) − ED ĝ(x, D) )2
1 23 4
independent of y
  : ;
+2 Ex Ey|x y − ȳ(x) ȳ(x) − ED ĝ(x, D) ) (18.31)
1 23 4
independent of y
 2   : ; 
= Ex,y y − ȳ(x) + Ex,y ȳ(x) − ED ĝ(x, D) )2 +
  : ;
+ 2 Ex ȳ(x) − ȳ(x) ȳ(x) − ED ĝ(x, D) ) (18.32)
 2    : ; 
= Ex,y y − ȳ(x) + Ex ȳ(x) − ED ĝ(x, D) )2 (18.33)

= Noise + Bias2
528 18 Generalization Error and Model Assessment

Taken together, we obtain the following combined result:


 2 
ED Ex,y y − ĝ(x, D)
 2   : ; 2 
= Ex,y y − ȳ(x) + Ex ED ED ĝ(x, D) − ĝ(x, D)
 : ; 
+Ex ȳ(x) − ED ĝ(x, D) )2 (18.34)
 2   2 
= Ex,y y − ȳ(x) + Ex ED ḡ(x) − ĝ(x, D)
 
+Ex ȳ(x) − ḡ(x))2 (18.35)

= Noise + Variance + Bias2

• Noise: This term measures the variability within the data without considering
any model. The noise cannot be reduced, because it does not depend on the
training data D or g or any other parameter under our control. Hence, it is
a characteristic of the distribution P from which the data are drawn. For this
reason, this component is also called an irreducible error.
• Variance: This term measures the model variability with respect to the changes
in the training sets. This variance can be reduced by using less complex models
g. However, this can increase the bias (underfitting).
• Bias: This term measures the inherent error that you obtain from your model
even with an infinitely large training data set. The bias can be reduced by using
more complex models, g. However, this can increase the variance (overfitting).
In Fig. 18.1, we show a visualization of the model assessment problem and its
interpretation based on the bias-variance trade-off. In Fig. 18.1a, the blue curve
corresponds to a model family — for example, a regression model with a fixed
number of covariates — and each point along this line corresponds to a particular
model obtained by estimating the parameters of the model from a data set. The dark
brown point corresponds to the true (but unknown) model and a data set generated
by this model. Specifically, this data set has been obtained in the error-free case;
that is, εi = 0 for all samples i. If another data set is generated from the true model,
this data set will vary to some extent because of the noise term εi , which is usually
not zero. This variation is indicated by the large (light) brown circle around the true
model.
In the case where the model family does not include the true model, there will
be a bias corresponding to the distance between the true model and the estimated
model indicated by the blue point along the curve of the model family. Specifically,
this bias is measured between the error-free data set generated by the true model
and the estimated model based on this data set. Also, the estimated model will have
some variability indicated by the (light) blue circle around the estimated model. This
corresponds to the variance of the estimated model.
18.4 Bias-Variance Trade-Off 529

A.
realization model family
y = f (x, β) + 
ŷ = g(x, β̂(D))
true model
(population model)

y = f (x, β)
bias variance

noise

B.
training data:
true model data estimate model
β̂train

test data: training data:


model assessment estimate training error
Etest Etrain

Fig. 18.1 Insights from the bias-variance trade-off for model assessment.

It is important to realize that there is no possibility of directly comparing the true


model with the estimated model, because the true model is usually unknown. The
only situation where the true model is known is for simulation studies where all the
entities involved are known and we have perfect control over them. We will make
use of this in the next section when we show some numerical simulation examples.
In reality, however, the comparison between the true model and the estimated
model is carried out indirectly, via data that have been generated by the true model.
Hence, these data are serving two purposes. First, they are used to estimate the
parameters of the model. For this estimation, the training data are used. If one uses
the same training data to evaluate the prediction error of this model, the prediction
error is called training error

Etrain = Etrain (Dtrain ). (18.36)

Etrain is also called in-sample error.


Second, the data are used to assess the estimated model by quantifying its prediction
error. For this estimation, the test data are used. In this case, the prediction error is
called test error or out-of-sample error

Etest = Etest (Dtest ). (18.37)


530 18 Generalization Error and Model Assessment

In Fig. 18.1b, we summarize the usage of the different data sets, i.e., training data
and testing data, for model assessment.
It is important to note that a prediction error is always evaluated with respect
to a given data set. For this reason, we emphasized this explicitly in Eqs. 18.36
and 18.37. Unfortunately, this information is frequently omitted in the literature,
which can lead to some confusion in the meaning of the prediction error.
We want to emphasize that the training error is only defined as a sample estimate
but not as a population estimate, because the training data set is always finite. That
means the sample training error in Eq. 18.36 is estimated by
n
train
1
Etrain (ntrain ) = L(yi , ĝ(xi , Dtrain )), (18.38)
ntrain
i=1

assuming that the sample size of the training data is ntrain with Dtrain =
{(xi , yi )}ni=1
train
.
In contrast, the test error in Eq. 18.37 (expected generalization error) corresponds
to the population estimate given in Eq. 18.15. In practice, this needs to be approxi-
mated by a sample estimate, similar to Eq. 18.38, of the form

1 
ntest
Etest (ntest , ntrain ) = L(yi , ĝ(xi , Dtrain )), (18.39)
ntest
i=1

for a test data set with ntest samples. For a finite size of the test data, the sample test
error depends not only on ntrain but also ntest .

18.5 Error-Complexity Curves

Before we apply the preceding results to an example, we want to go one step further
and discuss error-complexity curves. In general, error-complexity curves utilize
the preceding concepts to show the dependency of either the training error or the
test error on the complexity of a statistical model.
Definition 18.1 (Error-Complexity Curves for Training and Test Error) Error-
complexity curves show the training error and test error as functions of the model
complexity. The models underlying these curves are estimated from training data
with a fixed sample size.
Since any test error can be decomposed into a bias and a variance term, as we have
seen in Sect. 18.4, the preceding definition implies that error-complexity curves can
also be used to show the dependency of the bias-variance decomposition of the test
error on the model complexity.
18.5 Error-Complexity Curves 531

Definition 18.2 (Error-Complexity Curves for Bias-Variance Decomposition)


Error-complexity curves show the bias-variance decomposition of the test error on
the model complexity. The models underlying these curves are estimated from the
training data with a fixed sample size.
Overall, this means that error-complexity curves do not introduce a new concept
but rather apply the preceding results to different models with varying complexity.
In other words, by fixing the model complexity to a particular value, one obtains
results for the training and test error and the bias-variance decomposition.
Now, we can summarize all results obtained so far using an example.

18.5.1 Example: Linear Polynomial Regression Model

In this section, we show numerical examples for the preceding derivations. Because
the preceding entities are all population estimates, we may need to use simulations,
because they enable us to generate arbitrarily large data sets. In the following, we
study linear polynomial regression models.
In Fig. 18.2, we show an example where the true model, depicted in blue,
corresponds to

f (x, β) = 25 + 0.5x + 4x 2 + 3x 3 + x 4 , (18.40)

where β = (25, 0.5, 4, 3, 1)T (see Eq. 18.5). The true model is a mixture of
polynomials of different degrees, where the highest degree is 4, corresponding to
a linear polynomial regression model. From this model, we generate training data
with a sample size of n = 30 (shown by black points), which are used to fit different
regression models.
The general model family we use for the regression model is given by


d
g(x, β) = βi x i = β0 + β1 x + · · · + βd x d . (18.41)
i=0

That means we are fitting linear polynomial regression models with a maximal
degree of d. The highest degree corresponds to the model complexity of the
polynomial family. For our analysis, we are using polynomials with degree d from
1 to 10 and fit these to the training data. The results of these regression analyses are
shown as red curves in Fig. 18.2a-j.
In Fig. 18.2a-j, the blue curves show the true model, the red curves the fitted
models, and the black points correspond to the training data. The results shown
correspond to individual model fits; that is, no averaging has been performed.
Furthermore, for all results the sample size of the training data was kept fixed
(varying sample sizes are studied in Sect. 18.6). Since the degree of the polynomial
indicates the complexity of the fitted model, the shown models correspond to
532 18 Generalization Error and Model Assessment

A. B. C.
100 100 100
model degree: 1 model degree: 2 model degree: 3
l l l

75 75 75
l l l
l l l
l l l

l l l
50 50 50
y

ll ll ll

l l l
l l l
l l l
l l l l l l l l l
l l l ll l l l ll l l l ll
l l l l l l l l l
ll l l ll l l ll l l
25 ll
l
l 25 ll
l
l 25 ll
l
l
l l l

0 0 0

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x

D. E. F.
100 100 100
model degree: 4 model degree: 5 model degree: 6
l l l

75 75 75
l l l
l l l
l l l

l l l
50 50 50
y

ll ll ll

l l l
l l l
l l l
l l l l l l l l l
l l l ll l l l ll l l l ll
l l l l l l l l l
ll l l ll l l ll l l
25 ll
l
l 25 ll
l
l 25 ll
l
l
l l l

0 0 0

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x

G. H. I.
100 100 100
model degree: 7 model degree: 8 model degree: 9
l l l

75 75 75
l l l
l l l
l l l

l l l
50 50 50
y

ll ll ll

l l l
l l l
l l l
l l l l l l l l l
l l l ll l l l ll l l l ll
l l l l l l l l l
ll l l ll l l ll l l
25 ll
l
l 25 ll
l
l 25 ll
l
l
l l l

0 0 0

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x

J.
100
model degree: 10 Legend

true model f (x, β) = 25 + 0.5x + 4x2 + 3x3 + x4


l

75
l

fitted model
l
l

l
50
y

training data
ll

l
l
l
l l l
l l l ll
l l l
ll l l
25 ll
l
l
l

−2 −1 0 1 2
x

Fig. 18.2 Different examples for fitted linear polynomial regression models with varying degrees
d, ranging from 1 to 10. The model degree indicates the highest polynomial degree of the fitted
model. These models correspond to different model complexities, from low complexity (d = 1) to
high complexity (d = 10). The blue curves show the true model, the red curves the fitted models,
and the black points correspond to the training data. The results shown correspond to individual
fits; that is, no averaging has been performed. For all results, the sample size of the training data
was kept fixed.

different model complexities, from low-complexity (d = 1) to high-complexity


(d = 10) models.
One can see that for both low and high degrees of the polynomials, there are clear
differences between the true model and the fitted models. However, these differences
have a different origin. For low-degree models, the differences come from the low
18.5 Error-Complexity Curves 533

complexity of the models, which are not flexible enough to adapt to the variability
of the training data. Put simply, the model is too simple. This behavior corresponds
to an underfitting of the data (caused by high bias, as explained in detail later). In
contrast, for high degrees, the model is too flexible for the few available training
samples. In this case, the model is too complex for the training data. This behavior
corresponds to an overfitting of the data (caused by high variance, as explained in
detail later).

18.5.2 Example: Error-Complexity Curves

Based on the preceding model of linear polynomial regression, we can now show
error-complexity curves. Specifically, in Fig. 18.3, we show two different types of
results. The first type, shown in Fig. 18.3a, c, e, and f, corresponds to numerical
simulation results fitting a linear polynomial regression to training data, whereas
the second type, shown in Fig. 18.3b and d (highlighted using the dashed brown
rectangle), corresponds to idealized results that hold for general statistical models
beyond our studied examples. The numerical simulation results in Fig. 18.3a, c, e,
and f have been obtained from averaging over an ensemble of repeated model fits.
For all these fits, the sample size of the training data was kept fixed.
We start by discussing the results shown in Fig. 18.3a. These results provide the
error-complexity curves for the expected training and test errors for the different
polynomials. From Fig. 18.3a, one can see that the training error decreases with
increasing polynomial degree. In contrast, the test error is U-shaped. Intuitively,
it is clear that more complex models fit the training data better; however, there
should be an optimal model complexity, and going beyond should worsen the
prediction performance. The training error alone does not clearly reflect this, and
for this reason, estimates of the test error are needed. Figure 18.3b shows idealized
results for characteristic behavior of the training and test errors for general statistical
models.
In Fig. 18.3c, we show the decomposition of the test error into its noise, bias,
and variance components. The noise is constant for all polynomial degrees, whereas
the bias is monotonously decreasing and the variance is increasing. Note that this
behavior is generic beyond the shown examples. For this reason, we show, in
Fig. 18.3d, the idealized decomposition (neglecting the noise since it has a constant
contribution).
In Fig. 18.3e, we show the percentage breakdown of the noise, bias, and variance
for each polynomial degree. In this representation, the behavior of the noise is not
constant, since the decomposition is nonlinear for different complexity values of the
model. The numerical values of the percentage breakdown depend on the degree
of the polynomial and vary as shown. In Fig. 18.3f, we show the same results as in
Fig. 18.3e but without the noise part. From these representations, one can see that
simple models have a high bias and a low variance, whereas complex models have a
low bias and a high variance. This characterization is generic and not limited to the
particular model we studied.
534 18 Generalization Error and Model Assessment

A. B.
1.00
100 Legend underfitting overfitting
test
training
0.75
75
Error

Error
50 0.50

25 0.25
best model

0 0.00
2 4 6 8 10 1 2 3 4 5 6
Polynomial degree Model complexity

C. D.
Legend 1.00
100
noise
variance
bias2 0.75
75
Error

Error

50 0.50

bias2 + variance
25 0.25
high bias high variance

0 0.00
2 4 6 8 10 1 2 3 4 5 6
Polynomial degree Model complexity

E. F.
1.00 1.00
Legend
variance
Percentage breakdown

Percentage breakdown

0.75 0.75 bias2

0.50 0.50

0.25 0.25

0.00 0.00
2 4 6 8 10 2 4 6 8 10
Polynomial degree Polynomial degree

Fig. 18.3 Error-complexity curves showing the prediction error (training and test error) against
the model complexity. The panels (a, c, e, and f) show numerical simulation results for a linear
polynomial regression model. The model complexity is expressed by the degree of the highest
polynomial. For the shown analysis, the training data set was fixed. (b) Idealized error curves
for general statistical models. (c) Decomposition of the expected generalization error (test error)
into noise, bias, and variance. (d) Idealized decomposition into bias and variance. (e) Percentage
breakdown of the noise, bias, and variance, shown in (c), relative to the polynomial degrees. (f)
Percentage breakdown of the bias and variance.
18.5 Error-Complexity Curves 535

18.5.3 Interpretation of Error-Complexity Curves

From the idealized error-complexity curves in Fig. 18.3b, one can summarize and
clarify a couple of important terms.
We say that a model is overfitting if its test error is higher than the one of
a less complex model. That means, to decide whether a model is overfitting, it is
necessary to compare it with a simpler model. Hence, overfitting is detected from a
comparison, and it is not an absolute measure. In Fig. 18.3b, all models with a model
complexity larger than 3.5 are overfitting with respect to the best model’s having a
model complexity of copt = 3.5 and the lowest test error. One can formalize this by
defining an overfitting model as follows.
Definition 18.3 (Model Overfitting) A model with complexity c is said to be
Overfitting, if for its test error the following holds:

Etest (c) − Etest (copt ) > 0 ∀c > copt , (18.42)

with
5 6
copt = argmin Etest (c) , (18.43)
c
5 6
Etest (copt ) = min Etest (c) . (18.44)
c

From the bias-variance decomposition in Fig. 18.3d, one can see that an overfit-
ting model is characterized by

Overfitting : low bias and high variance (18.45)

Furthermore, from Fig. 18.3b, we can also see that for all these models, the
difference between the test error and the training error increases for increasing
complexity values; that is,
   
Etest (c) − Etrain (c) > Etest (c ) − Etrain (c ) , (18.46)

∀c > c and c, c > copt .

Similarly, we say a model is underfitting if its test error is higher than the one
of a more complex model. That means to decide whether a model is underfitting it is
necessary to compare it with a more complex model. In Fig. 18.3b, all models with
536 18 Generalization Error and Model Assessment

a model complexity smaller than 3.5 are underfitting with respect to the best model.
The formal definition of this can be given as follows:
Definition 18.4 (Model Underfitting) A model with complexity c is said to be
underfitting if for its test error the following holds:

Etest (c) − Etest (copt ) > 0, ∀c < copt . (18.47)

From the bias-variance decomposition in Fig. 18.3d, one can see that an under-
fitting model is characterized by

Underfitting : high bias and low variance (18.48)

Finally, the generalization capabilities of a model are assessed by its predictive


performance of the test error compared with the training error. If the distance
between the test error and the training error is small (small gap), i.e.,

Etest (c) − Etrain (c) ≈ 0, (18.49)

then the model has good generalization capabilities [3]. From Fig. 18.3b, one can see
that models with c > copt have bad generalization capabilities. In contrast, models
with c < copt have good generalization capabilities but not necessarily a small error.
This makes sense considering that the sample size is kept fixed.
In Definition 18.5, we formally summarize these characteristics.
Definition 18.5 (Generalization) If for a model with complexity c the following
holds

Etest (c) − Etrain (c) < δ with δ ∈ R+ , (18.50)

then we say that the model has good generalization capabilities.


In practice, one needs to decide what is a reasonable value of δ because usually
δ = 0 is too strict. This makes the definition of generalization problem specific. Put
simply, if one can conclude that they are similar from the training error to the test
error (because they are of similar value) then a model generalizes to new data.
Theoretically, as a result of increasing the same size of the training data, we
obtain

lim Etest (c) − Etrain (c) = 0, (18.51)


ntrain →∞

for all model complexities c since Eqs. 18.38 and 18.39 become identical for an
infinitely large test data set; that is, ntest → ∞.
From the idealized decomposition of the test error shown in Fig. 18.3d, one can
see that a simple model with low variance and high bias has, in general, good
18.6 Learning Curves 537

generalization capabilities. Whereas for a complex model, its variance is high and
its generalization capabilities are poor.

18.6 Learning Curves

The last concept we discuss in this chapter is called learning curves. A learning
curve shows the performance of a model for different sample sizes of the training
data [12, 13], where the performance of a model is measured by its prediction error.
To extract the most information, one needs to compare the learning curves of the
training error and the test error with each other. This leads to complementary infor-
mation to the error-complexity curves. Hence, learning curves play an important
role in model diagnosis but are not strictly considered part of model assessment
methods.
Definition 18.6 Learning curves show the training error and test error as functions
of the sample size of the training data. The models underlying these curves all have
the same complexity.
In the following, we first present numerical examples for learning curves for
linear polynomial regression models. Then, we discuss the behavior of idealized
learning curves that can correspond to any type of statistical model.

18.6.1 Example: Learning Curves for Linear Polynomial


Regression Models

In Fig. 18.4, we show results for the linear polynomial regression models discussed
earlier. It is important to emphasize that each figure shows results for a fixed model
complexity but varying sample size of the training data. This is in contrast with the
results shown earlier (see Fig. 18.3), which varied in the model complexity but kept
the sample size of the training data fixed. We show six examples for six different
model degrees. The horizontal dashed red line corresponds to the optimal error,
Etest (copt ), attainable by the model family. The first two examples (Fig. 18.4a and
b) are qualitatively different from all the others because neither the training nor the
test error converges to Etest (copt ), but they are much higher. This is due to the high
bias of the models, because these models are too simple for the data.
Figure 18.4e exhibits a different extreme behavior. Here, for sample sizes of the
training data smaller than ≈60, we obtain very high test errors and a large difference
with the training error. This is due to the high variance of the models because these
models are too complex for the data. In contrast, Fig. 18.4c shows results for copt =
4, which are the best results obtainable for this model family and the data.
In general, learning curves can be used to answer the following two questions:
538 18 Generalization Error and Model Assessment

A. B.
model degree: 1 training
50 model degree: 2 training
test test

40
100

30
Error

Error
50 20

10

0 0
50 100 150 200 250 50 100 150 200 250
sample size of training set sample size of training set

C. D.
20 model degree: 4 training
20 model degree: 6 training
test test

15 15
Error

Error

10 10

5 5

0 0
50 100 150 200 250 50 100 150 200 250
sample size of training set sample size of training set

E. F.
20 model degree: 9 training
50 model degree: 10 training
test test

40
15

30
Error

Error

10
20

5
10

0 0
50 100 150 200 250 50 100 150 200 250
sample size of training set sample size of training set

Fig. 18.4 Estimated learning curves for training and test errors for six linear polynomial regres-
sion models. The model degree indicates the highest polynomial degree of the fitted model, and
the horizontal dashed red line corresponds to the optimal error Etest (copt ) attainable by the model
family for the optimal model complexity copt = 4.

1. How much training data is needed?


2. How much bias and variance are present?
For (1): The learning curves can be used to predict the benefit one obtains by
increasing the number of samples in the training data.
• If the curve is slightly changing (increasing for training error and decreasing for
test error) → need larger sample size.
18.6 Learning Curves 539

• If the curve is completely flattened out → sample size is sufficient.


• If the curve is rapidly changing → need much larger sample size.
This assessment is based on evaluating the tangent of a learning curve toward the
highest available sample size.
For (2): To study this point, one needs to generate several learning curves for
models of different complexities. From this, one obtains information about the
smallest attainable test error. In the following, we call this the optimal attainable
error Etest (copt ).
For a specific model, one evaluates its learning curves as follows:
• A model has high bias if the training and test errors converge to a value much
larger than Etest . In this case, increasing the sample size of the training data
will not improve the results. This indicates an underfitting of the data because
the model is too simple. To improve the performance, one needs to increase the
complexity of the model.
• A model has high variance if the training and test errors are quite different from
each other; that is, there is a large gap between both. Here, the gap is defined
as Etest (n) − Etrain (n) for a sample size n of the training data. In this case, the
training data are fitted much better than the test data, indicating problems with the
generalization capabilities of the model. To improve the performance, the sample
size of the training data needs to be increased.
These assessments are based on evaluating the gap between the test error and the
training error for the highest available sample size of the training data.

18.6.2 Interpretation of Learning Curves

In Fig. 18.5, we show idealized learning curves for the four cases obtained from
combining high/low bias and high/low variance with each other. Specifically, the
first/second columns show low/high bias cases, and the first/second rows show
low/high variance cases. Figure 18.5a shows the ideal case where the model has
a low bias and a low variance. In this case, the training and test errors both converge
to the optimal attainable error Etest (copt ) that is shown as a dashed red line.
In Fig. 18.5b, a model with a high bias and a low variance is shown. In this case,
the training and test errors both converge to values that are distinct from the optimal
attainable error, and an increase in the sample size of the training data will not solve
this problem. The small gap between the training and test errors is indicative of a low
variance. A way to improve the performance is to increase the model complexity;
for instance, by allowing more free parameters or boosting approaches. This case is
the ideal example of an underfitting model.
In Fig. 18.5c, a model with a low bias and a high variance is shown. In this case,
the training and test errors both converge to the optimal attainable error. However,
the gap between the training and test errors is large, indicating a high variance.
540 18 Generalization Error and Model Assessment

A. bias: low B. bias: high


100 training
200 training
test test

75 150

variance: low
Error

Error
50 100

underfitting model
25 50

0 0
0 10 20 30 40 50 0 10 20 30 40 50
sample size of training set sample size of training set

C. D.
200 training
200 training
test test

150 150

variance: high
Error

Error

100 100

overfitting model
50 50

0 0
0 10 20 30 40 50 0 10 20 30 40 50
sample size of training set sample size of training set

Fig. 18.5 Idealized learning curves. The horizontal dashed red line corresponds to the optimal
error Etest (copt ) attainable by the model family. Shown are the following four cases: (a) Bias, low;
variance, low; (b) Bias, high; variance, low; (c) Bias, low; variance, high; (d) Bias, high; variance,
high.

To reduce this variance, the sample size of the training data needs to be increased
to possibly much larger values. The model complexity can also be reduced; for
example, by using regularization or bagging approaches. This case is the ideal
example for an overfitting model.
In Fig. 18.5d, a model with a high bias and a high variance is shown. This is
the worst-case example. To improve the performance, one needs to increase the
model complexity and possibly the sample size of the training data. This means that
improving such a model is the most demanding case.
The learning curves also allow an evaluation of the generalization capabilities of
a model. Only the low-variance cases have a small distance between the test error
and the training error, indicating the model has good generalization capabilities.
Hence, a model with a low variance has, in general, good generalization capabilities
irrespective of the bias. However, models with a high bias perform badly, and their
consideration needs to be discussed on a case-by-case basis.
18.7 Discussion 541

18.7 Discussion

In this chapter, we presented the expected generalization error, the bias-variance


decomposition, error complexity curves, and learning curves. We discussed these
concepts theoretically and practically for model assessment, model selection, and
model diagnosis [73, 215, 390].
It is important to emphasize that the expected generalization error is a theoretical
entity defined as a population estimate. That means the entire population from which
the data are drawn needs to be available. Similarly, not only one training data set
is needed, but infinitely many. So, practically, the expected generalization error
is not attainable. Since the bias-variance decomposition is based on the expected
generalization error, this is also defined as a population estimate. Nevertheless, both
concepts are useful to derive insights; for example, by utilizing simulation studies.
In this chapter, we used linear polynomial regression models for such simulations.
For practical applications, sample estimates of the preceding entities need to be
obtained, such as from resampling methods.
For model assessment, the expected generalization error is the ultimate measure
because it provides the most generic and complete summary information about a
statistical model with respect to its prediction capabilities. However, its theoretical
nature, as discussed, requires sample estimates from theoretical data, making the
concept cumbersome for the beginner because it is easy to get confused. For this
reason, we postponed its discussion until this chapter. Pedagogically, we think that
it is better to first gain some practical experience with prediction models and then
think about their theoretical foundations after a certain degree of understanding has
been achieved. As one can see, in this chapter the expected generalization error is not
the end point but the starting one, affecting many other concepts as exemplified by
the bias-variance decomposition, error complexity curves, and learning curves. This
also has direct connections to model assessment and model selection. Hence, the
concept of the expected generalization error triggers an avalanche of other topics,
each one nontrivial in its own right.
As mentioned, error-complexity curves and learning curves can be seen as appli-
cations of the expected generalization error and the bias-variance decomposition.
While error-complexity curves are based on the bias-variance decomposition so as
to provide a functional dependency on the model complexity, learning curves are
based on the expected generalization error so as to study the dependency on the
sample size of the training data. Hence, both concepts provide dynamic insights into
the estimation capabilities of statistical models.
Interestingly, error-complexity curves can be used to study model selection. In
practical terms, model selection is the task of selecting the best statistical model
from a family of models for a given data set. In Chap. 12 (see also Chap. 13), we saw
that possible model selection problems include but are not limited to the following:

• Select predictor variables for linear regression models.


542 18 Generalization Error and Model Assessment

• Select among different regularization models, such as ridge regression, LASSO,


or elastic net.
• Select the best classification method from a list of candidates, e.g., random forest,
logistic regression, support vector machine, or neural networks.
• Select the number of neurons and hidden layers in a neural network.
The general problems one tries to counteract with model selection are the
overfitting and underfitting of data.
• Underfitting model: Such a model is characterized by high bias, low variance,
and a poor test error. In general, such a model is too simple.
• Best model: For such a model, the bias and variance are balanced, and the test
error makes good predictions.
• Overfitting model: Such a model is characterized by low bias, high variance, and
a poor test error. In general, such a model is too complex.
It is important to realize that the preceding terms are defined for a given data set
with a certain sample size. Specifically, the error-complexity curves are estimated
from training data with fixed sample size, and, hence, these curves can change if the
sample size changes. In contrast, the learning curves investigate the dependency on
the sample size of the training data.
In Chap. 12, we discussed elegant methods for model selection, such as the AIC
or the BIC; however, the applicability of these methods depends on the availability
of analytical results for the models, usually based on their maximum likelihood.
Unfortunately, such results can often only be obtained for linear models, as seen
in Chap. 12, but may not be available for other types of models. So, for practical
applications, these methods are far less flexible compared to numerical resampling
methods.
The bias-variance trade-off, providing a frequentist viewpoint of model complex-
ity, is for practical problems where the true model is unknown or not accessible.
It offers a framework to think about a problem conceptually. Interestingly, the
balancing of bias and variance reflects the underlying philosophy of Occam’s razor
[202], which states that from two similar models the simpler one should be chosen.
Importantly, for simulations the true model is known, and the decomposition into
noise, bias, and variance is feasible.

18.8 Summary

In the last chapter of this book, we discussed the expected generalization error as
the conceptual roof for many topics discussed in the previous chapters. We have
seen that the expected generalization error has a direct effect on model assessment
and model selection. Since these topics are also used for a wider purpose, such as
for selecting the best classification or regression model by means of resampling
18.9 Outlook 543

methods, the expected generalization error penetrates essentially every aspect of


data science.
Learning Outcome 18: Expected Generalization Error

The expected generalization error is a theoretical entity defined as a popula-


tion estimate. That means the entire population from which the data are drawn
needs to be available. Similarly, not only one training data set is needed but
infinitely many. Hence, practically, the expected generalization error is not
attainable but needs to be estimated via sample estimates.

For practical applications, the most flexible approach that can be applied to any
type of statistical model is a resampling method (for instance, cross-validation).
Assuming that the computations can be completed within an acceptable time frame,
it is advised to base the decisions for model selection and model assessment on the
sample estimates of the error-complexity curves and the learning curves.

18.9 Outlook

Data science is a field that is under rapid development. For this reason, the goal
of this book was not to provide a comprehensive coverage of all topics but an
introduction to learn about data science step-by-step starting from the basics.
We want to finish this book by pointing out some additional topics to be learned
by the advanced data scientist. The following list provides advanced topics of great
interest to explore.
• Causal inference
• Deep reinforcement learning
• Digital twin
• Double decent
• Ensemble methods
– Bagging
– Boosting
• Generative adversarial networks
• Generative question answering
• Meta-analysis
• Network science
• Time series analysis
• Visual question answering
Most of these topics are fairly recent, for example, generative adversarial net-
works [92] or deep reinforcement learning [18], while others like causal inference
[235] have been around since decades — which does not mean that there are no
544 18 Generalization Error and Model Assessment

novel developments in causal inference. Despite considerable differences of all


those topics, what they have in common is that they are advanced, requiring a firm
understanding of all basics of machine learning, artificial intelligence, and statistics
as discussed in this book. Hence, jumping right into those topics by skipping the
basics will most likely lead to problems in mastering a deeper understanding and
appreciation of the underlying concepts.
With the introduction of chatGPT everyone is now aware that natural language
processing methods are of particular interest, and chatGPT shows certainly an
impressive performance of a generative question answering system [529]. Less
known may be a digital twin [10] which enables new learning paradigms that show
great promise across all sciences and engineering. Nevertheless, despite all these
new developments, the many methods presented in this book provide the foundation
for all trends that could emerge next.

18.10 Exercises

1. Discuss the definition of the expected generalization error.


a. What is the meaning of EP in this definition?
b. What is the meaning of EDtrain in this definition?
2. What is the difference between the in-sample error and the out-of-sample error?
What data are used as the test data for these two errors?
3. Derive the bias-variance decomposition of the expected generalization error.
4. What is an error-complexity curve?
5. Estimate error-complexity curves for a polynomial regression model.
a. Assume the true polynomial regression model has a degree of 2.
b. Assume the true polynomial regression model has a degree of 5.
c. Assume the true polynomial regression model has a degree of 7.
d. Discuss the differences of these models.
6. Discuss overfitting and underfitting based on an error-complexity curve.
7. What influence have bias and variance on the overfitting and underfitting of a
model?
8. What is a learning curve?
9. What data are needed for plotting learning curves?
References

1. O. Aalen, Nonparametric inference for a family of counting processes. Ann. Stat. 701–726
(1978).
2. M. Abadi, A. Agarwal, P. Barham, et al., Tensorflow: large-scale machine learning on
heterogeneous distributed systems (2016).
3. Y.S. Abu-Mostafa, M. Magdon-Ismail, H.-T. Lin, Learning from Data, vol. 4. (AMLBook,
New York, 2012).
4. K. Aho, D. Derryberry, T. Peterson, Model selection for ecologists: the worldviews of AIC
and BIC. Ecology 95(3), 631–636 (2014).
5. H. Akaike, A new look at the statistical model identification, in Selected Papers of Hirotugu
Akaike (Springer, Berlin, 1974), pp. 215–222.
6. B. Alipanahi, A. Delong, M.T. Weirauch, B.J. Frey, Predicting the sequence specificities of
DNA and RNA-binding proteins by deep learning. Nat. Biotechnol. 33(8), 831 (2015).
7. G. Altay, F. Emmert-Streib, Structural influence of gene networks on their inference: analysis
of C3NET. Biol. Direct 6, 31 (2011).
8. S. Bashath, N. Perera, S. Tripathi, K. Manjang, M. Dehmer, F.E. Streib, A data-centric review
of deep transfer learning with applications to text data. Inf. Sci. 585, 498–528 (Elsevier, 2022)
9. F. Emmert-Streib, M. Dehmer, Taxonomy of machine learning paradigms: A data-centric
perspective. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 12(5),
e1470 (Wiley Online Library, 2022)
10. F. Emmert-Streib, O. Yli-Harja, What is a digital twin? Experimental Design for a Data-
Centric Machine Learning Perspective in health. International journal of molecular sciences.
23(21), 13149 (MDPI, 2022)
11. M. Alvi, D. McArt, P. Kelly, et al., Comprehensive molecular pathology analysis of small
bowel adenocarcinoma reveals novel targets with potential clinical utility. Oncotarget 6(25),
20863–20874 (2015).
12. S.-I. Amari, A universal theorem on learning curves. Neural Netw. 6(2), 161–166 (1993).
13. S.-I. Amari, N. Fujita, S. Shinomoto, Four types of learning curves. Neural Comput. 4(4),
605–618 (1992).
14. V. Amrhein, S. Greenland, B. McShane, Scientists rise up against statistical significance.
Nature 567, 3055–3307 (2019).
15. S. Ancarani, C. Di Mauro, L. Fratocchi, et al., Prior to reshoring: a duration analysis of foreign
manufacturing ventures. Int. J. Prod. Eco. 169, 141–155 (2015).
16. S. Arlot, A. Celisse, et al., A survey of cross-validation procedures for model selection. Stat.
Surv. 4, 40–79 (2010).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 545
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8
546 References

17. A.B. Arrieta, N. Díaz-Rodríguez, J. Del Ser et al., Explainable artificial intelligence (XAI):
Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58,
82–115 (2020).
18. K. Arulkumaran, M.P. Deisenroth, M. Brundage, A.A. Bharath, Deep reinforcement learning:
a brief survey. IEEE Sig. Proces. Mag. 34(6), 26–38 (2017).
19. S.R. Austin, I. Dialsingh, N. Altman, Multiple hypothesis testing: a review. J. Indian Soc.
Agric. Stat. 68(2), 303–14 (2014).
20. J. Bacher, Clusteranalyse (Oldenbourg Verlag, Munich, 1996).
21. R. Baeza-Yates, B. Ribeiro-Neto (eds.), Modern Information Retrieval (Addison-Wesley,
Reading, 1999).
22. P. Bühlmann, S. Van De Geer, Statistics for High-Dimensional Data: Methods, Theory and
Applications (Springer Science & Business Media, Berlin, 2011).
23. P. Baldi, S. Brunak, Y. Chauvin, et al., Assessing the accuracy of prediction algorithms for
classification: an overview. Bioinformatics 16(5), 412–424 (2000).
24. H.U. Bao-Gang, W. Yong, Evaluation criteria based on mutual information for classifications
including rejected class. Acta Automat. Sin. 34(11), 1396–1403 (2008).
25. A.-L. Barabási, Network medicine—From obesity to the “Diseasome”. N. Engl. J. Med.
357(4), 404–407 (2007).
26. M. Baron, Probability and Statistics for Computer Scientists. (Chapman and Hall/CRC, Boca
Raton, 2013).
27. H. Barraclough, L. Simms, R. Govindan, Biostatistics primer: what a clinician ought to know:
hazard ratios. J. Thorac. Oncol. 6(6), 978–982 (2011).
28. E. Bart, S. Ullman, Cross-generalization: Learning novel classes from a single example by
feature replacement, in 2005 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’05), vol. 1 (IEEE, Piscataway, 2005), pp. 672–679.
29. A.M. Bartkowiak, Anomaly, novelty, one-class classification: a comprehensive introduction.
Int. J. Comput. Inf. Syst. Ind. Manag. Appl. 3(1), 61–71 (2011).
30. E.M.L. Beale, M.G. Kendall, D.W. Mann, The discarding of variables in multivariate analysis.
Biometrika 54(3–4), 357–366 (1967).
31. J. Bekker, J. Davis, Learning from positive and unlabeled data: a survey. Mach. Learn. 109(4),
719–760 (2020).
32. R. Bender, Introduction to the use of regression models in epidemiology, in Cancer
Epidemiology (Springer, Berlin, 2009), pp. 179–195.
33. Y. Bengio, et al., Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127
(2009).
34. D.J. Benjamin, J.O. Berger, Three recommendations for improving the use of p-values. Am.
Stat. 73(sup1), 186–191 (2019).
35. Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful
approach to multiple testing. J. R. Stat. Soc. B (Methodol.) 57, 125–133 (1995).
36. Y. Benjamini, Y. Hochberg, On the adaptive control of the false discovery rate in multiple
testing with independent statistics. J. Educat. Behav. Stat. 25(1), 60–83 (2000).
37. Y. Benjamini, D. Yekutieli, The control of the false discovery rate in multiple testing under
dependency. Ann. Stat. 29(4), 1165–1188 (2001).
38. Y. Benjamini, A.M. Krieger, D. Yekutieli, Adaptive linear step-up procedures that control the
false discovery rate. Biometrika 93(3), 491–507 (2006).
39. C.M. Bennett, G.L. Wolford, M.B. Miller, The principled control of false positives in
neuroimaging. Soc. Cogn. Affect. Neurosci. 4(4), 417–422 (2009).
40. C.M. Bennett, A.A. Baird, M.B. Miller, G.L. Wolford, Neural correlates of interspecies
perspective taking in the post-mortem atlantic salmon: an argument for proper multiple
comparisons correction. J. Serendipitous Unexpect. Results 1, 1–5 (2011).
41. C. Bergmeir, J.M. Benítez, Neural networks in R using the stuttgart neural network simulator:
RSNNS. J. Stat. Softw. 46(7), 1–26 (2012).
42. D.J. Biau, B.M. Jolles, R. Porcher, P value and the theory of hypothesis testing: an explanation
for new researchers. Clin. Orthop. Relat. Res. 468(3), 885–892 (2010).
References 547

43. P.J. Bickel, B. Li, Regularization in statistics. Test 15(2), 271–344 (2006).
44. O. Biran, C. Cotton, Explanation and justification in machine learning: a survey, in IJCAI-17
Workshop on Explainable AI (XAI), vol. 8 (2017), p. 1.
45. C.M. Bishop, Pattern Recognition and Machine Learning (Springer, Berlin, 2006).
46. G. Blanchard, É. Roquain, Adaptive false discovery rate control under independence and
dependence. J. Mach. Learn. Res. 10(Dec), 2837–2871 (2009).
47. G. Blanchard, T. Dickhaus, N. Hack, et al., μtoss-multiple hypothesis testing in an open
software system, in Proceedings of the First Workshop on Applications of Pattern Analysis
(2010), pp. 12–19.
48. A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth, Learnability and the vapnik-
chervonenkis dimension. J ACM 36(4), 929–965 (1989).
49. H.H. Bock, Automatische Klassifikation. Theoretische und praktische Methoden zur Grup-
pierung und Strukturierung von Daten. Studia Mathematica (Vandenhoeck & Ruprecht,
Göttingen, 1974).
50. D. Bonchev, Information Theoretic Indices for Characterization of Chemical Structures
(Research Studies Press, Chichester, 1983).
51. D. Bonchev, D.H. Rouvray, Chemical Graph Theory: Introduction and Fundamentals.
Mathematical Chemistry (Abacus Press, London, 1991).
52. Y. Bondarenko, Boltzman-machines (2017).
53. E. Bonferroni, Teoria statistica delle classi e calcolo delle probabilita, in Pubblicazioni del R
Istituto Superiore di Scienze Economiche e Commerciali di Firenze (1936), pp. 3–62.
54. B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers,
in Proceedings of the Fifth Annual Workshop on Computational Learning Theory (1992),
pp. 144–152.
55. L. Bottou, Large-scale machine learning with stochastic gradient descent, in Proceedings of
COMPSTAT’2010 (Springer, Berlin, 2010), pp. 177–186.
56. A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning
algorithms. Pattern Recogn. 30(7), 1145–1159 (1997).
57. L. Breiman, Statistics. With a view toward applications (Houghton Mifflin Co., Boston, 1973).
58. L. Breiman, Better subset regression using the nonnegative garrote. Technometrics 37(4),
373–384 (1995).
59. L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996).
60. L. Breiman, J.H. Friedman, R.A. Olshen, Ch.J. Stone, Classification and regression trees.
(Routledge, Milton Park, 1999).
61. L. Breiman et al., Statistical modeling: the two cultures. Stat. Sci. 16(3), 199–231 (2001).
62. N. Breslow, Covariance analysis of censored survival data. Biometrics, 89–99 (1974).
63. C. Brunsdon, M. Charlton, An assessment of the effectiveness of multiple hypothesis testing
for geographical anomaly detection. Environ. Plann. B. Plann. Des. 38(2), 216–230 (2011).
64. N. Buckley, P. Haddock, R. De Matos Simoes, et al., A BRCA1 deficient, NFκB driven
immune signal predicts good outcome in triple negative breast cancer. Oncotarget 7(15),
19884–19896 (2016).
65. K.P. Burnham, D.R. Anderson, Multimodel inference: understanding AIC and BIC in model
selection. Sociol. Methods Res. 33(2), 261–304 (2004).
66. C.L. Byrne, The EM algorithm theory: theory, applications and related methods (2017).
https://faculty.uml.edu/cbyrne/AnEMbook.pdf. Last accessed 28 July 2021.
67. A. Candel, V. Parmar, E. LeDell, A. Arora, Deep learning with H2O (2015).
68. E. Candes, T. Tao, The Dantzig selector: statistical estimation when p is much larger than n.
Ann. Stat. 35(6), 2313–2351 (2007).
69. C. Cao, F. Liu, H. Tan, et al., Deep learning and its applications in biomedicine. Genomics
Proteomics Bioinformatics 16(1), 17–32 (2018).
70. F. Capra, The web of life: a new scientific understanding of living systems (Anchor, South
Harpswell, 1996).
71. M.Á. Carreira-Perpiñán, G. Hinton, On contrastive divergence learning, in Proceedings of the
Tenth International Workshop on Artificial Intelligence and Statistics, PMLR (2005), pp. 33–
40.
548 References

72. R. Caruana, Multitask learning. Mach. Learn. 28(1), 41–75 (1997).


73. G.C. Cawley, N.L.C. Talbot, On over-fitting in model selection and subsequent selection bias
in performance evaluation. J. Mach. Learn. Res. 11(Jul), 2079–2107 (2010).
74. O. Chapelle, B. Schölkopf, A. Zien, Semi-supervised learning. Adaptive Computation and
Machine Learning (The MIT Press, Cambridge, 2006).
75. A.S. Charles, B.A. Olshausen, C.J. Rozell, Learning sparse codes for hyperspectral imagery.
IEEE J. Select. Topics Sig. Proces. 5(5), 963–978 (2011).
76. T. Chen, M. Li, Y. Li, et al., MXNet: a flexible and efficient machine learning library for
heterogeneous distributed systems (2015).
77. M.R. Chernick, R.A. LaBudde, An introduction to bootstrap methods with applications to R.
(John Wiley & Sons, Hoboken, 2014).
78. Chimera0. pydbm (2019).
79. K. Cho, B. Van Merriënboer, C. Gulcehre, et al., Learning phrase representations using RNN
encoder-decoder for statistical machine translation (2014). Preprint. arXiv:1406.1078.
80. F. Chollet, et al., Keras (2015). https://github.com/fchollet/keras.
81. A. Clare, R.D. King, Knowledge discovery in multi-label phenotype data, in European
conference on principles of data mining and knowledge discovery (Springer, Berlin, 2001),
pp. 42–53.
82. B. Clarke, E. Fokoue, H.H. Zhang, Principles and Theory for Data Mining and Machine
Learning (Springer, Dordrecht, 2009).
83. W.S. Cleveland, Data science: an action plan for expanding the technical areas of the field of
statistics. Int. Stat. Rev. 69(1), 21–26 (2001).
84. M. Cleves, W. Gould, W.W. Gould, et al., An introduction to survival analysis using stata
(Stata Press, College Station, 2008).
85. J. Cohen, A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46
(1960).
86. G. Cohen, S. Afshar, J. Tapson, A. van Schaik, EMNIST: an extension of MNIST to
handwritten letters (2017). Preprint. arXiv:1702.05373.
87. D. Cook, L.B. Holder, Mining graph data (Wiley-Interscience, Hoboken, 2007).
88. J.M. Cortina, W.P. Dunlap, On the logic and purpose of significance testing. Psychol. Methods
2(2), 161 (1997)
89. D.R. Cox, Regression models and life-tables. J. R. Stat. Soc. B. Methodol. 34(2), 187–202
(1972).
90. D.R. Cox, Partial likelihood. Biometrika 62(2), 269–276 (1975).
91. K. Cranmer, Statistical challenges for searches for new physics at the LHC, in Statistical
problems in particle physics, astrophysics and cosmology (World Scientific, Singapore, 2006),
pp. 112–123.
92. A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, A.A. Bharath, Generative
adversarial networks: An overview. IEEE Signal Process. Mag 35(1), 53–65 (2018, IEEE)
93. F. Crick, Central dogma of molecular biology. Nature 227, 561–563 (1970).
94. J. Dai, Y. Wang, X. Qiu, et al., BigDL: a distributed deep learning framework for big data
(2018).
95. A. Dasgupta, Y.V. Sun, I.R. König, et al., Brief review of regression-based and machine
learning methods in genetic epidemiology: the genetic analysis workshop 17 experience.
Genet. Epidemiol. 35(S1), S5–S11 (2011).
96. M. Dehmer, F. Emmert-Streib, Structural information content of networks: graph entropy
based on local vertex functionals. Comput. Biol. Chem. 32, 131–138 (2008).
97. M. Dehmer, F. Emmert-Streib, The structural information content of chemical networks. Z.
Naturforsch., A 63a, 155–158 (2008).
98. M. Dehmer, F. Emmert-Streib, Quantitative Graph Theory. Theory and Applications. (CRC
Press, Boca Raton, 2014).
99. M. Dehmer, F. Emmert-Streib, Frontiers in data science. Chapman & Hall/CRC, Big Data
Series. (Taylor & Francis Group, Milton Park, 2018).
100. M. Dehmer, A. Mowshowitz, A history of graph entropy measures. Inf. Sci. 1, 57–78 (2011).
References 549

101. D.M. DeLong, G.H. Guirguis, Y.C. So, Efficient computation of subset selection probabilities
with application to Cox regression. Biometrika 81(3), 607–611 (1994).
102. D. DeMers, G. Cottrell, Reducing the dimensionality of data with neural networks, in
Advances in neural information processing systems, vol. 5 (1993), pp. 580–587.
103. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the
EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–38 (1977).
104. J. Deng, Z. Zhang, E. Marchi, B. Schuller, Sparse autoencoder-based feature transfer learning
for speech emotion recognition, in 2013 Humaine Association Conference on Affective
Computing and Intelligent Interaction (IEEE, Piscataway, 2013), pp. 511–516.
105. S. Derksen, H.J. Keselman, Backward, forward and stepwise automated subset selection
algorithms: frequency of obtaining authentic and noise variables. Br. J. Math. Stat. Psychol.
45(2), 265–282 (1992).
106. G. Deuschl, C. Schade-Brittinger, P. Krack, et al., A randomized trial of deep-brain stimula-
tion for Parkinson’s disease. N. Engl. J. Med. 355(9), 896–908 (2006).
107. J. Devillers, A.T. Balaban, Topological indices and related descriptors in QSAR and QSPR
(Gordon and Breach Science Publishers, Amsterdam, 1999).
108. M.M. Deza, E. Deza, Encyclopedia of distances, 2nd ed. (Springer, Berlin, 2012).
109. R. de Matos Simoes, F. Emmert-Streib, Bagging statistical network inference from large-scale
gene expression data. PLoS ONE 7(3), e33624 (2012).
110. R. de Matos Simoes, M. Dehmer, F. Emmert-Streib, Interfacing cellular networks of S.
cerevisiae and E. coli: connecting dynamic and genetic information. BMC Genom. 14, 324
(2013).
111. S. Dieleman, J. Schlüter, C. Raffel, et al., Lasagne: first release (2015).
112. J. Ding, V. Tarokh, Y. Yang, Model selection techniques: an overview. IEEE Sig. Proces. Mag.
35(6), 16–34 (2018).
113. M.V. Diudea, I. Gutman, L. Jäntschi, Molecular topology (Nova Publishing, New York, 2001).
114. M. Dixon, D. Klabjan, L. Wei, OSTSC: over sampling for time series classification in R
(2017).
115. A.P. Diz, A. Carvajal-Rodríguez, D.O.F. Skibinski, Multiple hypothesis testing in proteomics:
a strategy for experimental work. Mol. Cell. Proteomics 10(3), M110.004374 (2011).
116. S. Döhler, Validation of credit default probabilities using multiple-testing procedures. J. Risk
Model Validat. 4(4), 59 (2010).
117. S. Döhler, G. Durand, E. Roquain, et al., New FDR bounds for discrete and heterogeneous
tests. Electron. J. Stat. 12(1), 1867–1900 (2018).
118. A. Dmitrienko, A.C. Tamhane, F. Bretz, Multiple testing problems in pharmaceutical
statistics. (CRC Press, Boca Raton, 2009).
119. J. Donahue, Y. Jia, O. Vinyals, et al., DeCAF: a deep convolutional activation feature for
generic visual recognition, in Proceedings of the 31st International Conference on Machine
Learning, PMLR (2014), pp. 647–655.
120. F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learning (2017)
Preprint. arXiv:1702.08608.
121. N.R. Draper, H. Smith, Applied regression analysis, vol. 326. (John Wiley & Sons, Hoboken,
2014).
122. R.O. Duda, P.E. Hart, et al., Pattern classification (John Wiley & Sons, Hoboken, 2000).
123. S. Dudoit, M.J. van Der Laan, Multiple testing procedures with applications to genomics
(Springer Science & Business Media, Berlin, 2007).
124. S. Dudoit, M.J. van der Laan, Multiple testing procedures with applications to genomics.
(Springer, New York, 2007).
125. S. Dudoit, J.P. Shaffer, J.C. Boldrick, Multiple hypothesis testing in microarray experiments.
Stat. Sci. 18(1), 71–103 (2003).
126. P.K. Dunn, G.K. Smyth, Generalized linear models with examples in R (Springer, Berlin,
2018).
127. B. Efron, The efficiency of Cox’s likelihood function for censored data. J. Am. Stat. Assoc.
72(359), 557–565 (1977).
550 References

128. B. Efron, Nonparametric estimates of standard error: the jackknife, the bootstrap and other
methods. Biometrika 68(3), 589–599 (1981).
129. B. Efron, The jackknife, the bootstrap, and other resampling plans, vol. 38. (SIAM,
Philadelphia, 1982).
130. B. Efron, T. Hastie, R. Tibshirani, Discussion: the Dantzig selector: statistical estimation when
p is much larger than n. Ann. Stat. 35(6), 2358–2364 (2007).
131. B. Efron, Large-scale inference: empirical Bayes methods for estimation, testing, and
prediction (Cambridge University Press, Cambridge, 2010).
132. B. Efron, R.J. Tibshirani, An introduction to the bootstrap (Chapman and Hall/CRC, New
York, 1994).
133. S.A. ElHafeez, C. Torino, G. D’Arrigo, et al., An overview on standard statistical methods
for assessing exposure-outcome link in survival analysis (part II): the Kaplan-Meier analysis
and the Cox regression method. Aging Clin. Exp. Res. 24(3), 203–206 (2012).
134. A. Elisseeff, J. Weston, A kernel method for multi-labelled classification. Adv. Neural Inform.
Proces. Syst. 14 (2001).
135. N. Elsayed, A.S. Maida, M. Bayoumi, Reduced-gate convolutional LSTM using predictive
coding for spatiotemporal prediction (2018). Preprint. arXiv:1810.07251.
136. F. Emmert-Streib, A heterosynaptic learning rule for neural networks. Int. J. Mod. Phys. C
17(10), 1501–1520 (2006).
137. F. Emmert-Streib, M. Dehmer, Global information processing in gene networks: fault
tolerance, in Proceedings of the Bio-Inspired Models of Network, Information, and Computing
Systems, Bionetics 2007 (2007).
138. F. Emmert-Streib, M. Dehmer, Information processing in the transcriptional regulatory
network of yeast: functional robustness. BMC Syst. Biol. 3, 35 (2009).
139. F. Emmert-Streib, M. Dehmer (eds.), Analysis of microarray data: a network-based approach.
(Wiley VCH Publishing, Hoboken, 2010).
140. F. Emmert-Streib, M. Dehmer (eds.), Medical biostatistics for complex diseases (Wiley-
Blackwell, Weinheim, 2010).
141. F. Emmert-Streib, M. Dehmer, Identifying critical financial networks of the DJIA: towards a
network-based index. Complexity 16(1), 24–33 (2010).
142. F. Emmert-Streib, M. Dehmer, A machine learning perspective on personalized medicine: an
automatized, comprehensive knowledge base with ontology for pattern recognition. Mach.
Learn. Knowl. Extract. 1(1), 149–156 (2018).
143. F. Emmert-Streib, M. Dehmer, High-dimensional lasso-based computational regression
models: regularization, shrinkage, and selection. Mach. Learn. Knowl. Extract. 1(1), 359–383
(2019).
144. F. Emmert-Streib, M. Dehmer, Evaluation of regression models: model assessment, model
selection and generalization error. Mach. Learn. Knowl. Extract. 1(1), 521–551 (2019).
145. F. Emmert-Streib, M. Dehmer, Understanding statistical hypothesis testing: the logic of
statistical inference. Mach. Learn. Knowl. Extract. 1(3), 945–961 (2019).
146. F. Emmert-Streib, M. Dehmer, Defining data science by a data-driven quantification of the
community. Mach. Learn. Knowl. Extract. 1(1), 235–251 (2019).
147. F. Emmert-Streib, M. Dehmer, Large-scale simultaneous inference with hypothesis testing:
multiple testing procedures in practice. Mach. Learn. Knowl. Extract. 1(2), 653–683 (2019).
148. F. Emmert-Streib, S. Tripathi, R. de Matos Simoes, et al., The human disease network:
opportunities for classification, diagnosis and prediction of disorders and disease genes. Syst.
Biomed. 1(1), 1–8 (2013).
149. F. Emmert-Streib, M. Dehmer, Y. Shi, Fifty years of graph matching, network alignment and
network comparison. Inf. Sci. 346–347, 180–197 (2016).
150. F. Emmert-Streib, S. Moutari, M. Dehmer, The process of analyzing data is the emergent
feature of data science. Front. Genet. 7, 12 (2016).
151. F. Emmert-Streib, S. Tripathi, O. Yli-Harja, M. Dehmer, Understanding the world economy in
terms of networks: a survey of data-based network science approaches on economic networks.
Front. Appl. Math. Stat. 4, 37 (2018).
References 551

152. F. Emmert-Streib, S. Tripathi, M. Dehmer, Constrained covariance matrices with a biolog-


ically realistic structure: comparison of methods for generating high-dimensional Gaussian
graphical models. Front. Appl. Math. Stat. 5, 17 (2019).
153. F. Emmert-Streib, S. Moutari, M. Dehmer, Mathematical foundations of data science using
R. (Walter de Gruyter GmbH & Co KG, Berlin, 2020).
154. F. Emmert-Streib, O. Yli-Harja, M. Dehmer, Explainable artificial intelligence and machine
learning: A reality rooted perspective. WIREs Data Min. Knowl. Discov. 10, e1368 (2020).
155. F. Emmert-Streib, Z. Yang, H. Feng, et al., An introductory review of deep learning for
prediction models with big data. Frontiers Artificial Intelligence Appl. 3, 4 (2020).
156. F. Emmert-Streib, K. Manjang, M. Dehmer, et al., Are there limits in explainability of
prognostic biomarkers? scrutinizing biological utility of established signatures. Cancers
13(20), 5087 (2021).
157. S. Enarvi, M. Kurimo, Theanolm—an extensible toolkit for neural network language model-
ing (2016). CoRR, abs/1605.00942.
158. B.S. Everitt, S. Landau, M. Leese, D. Stah, Cluster Analysis, 5th ed. (Wiley-VCH, Weinheim,
2011).
159. J. Fan, J. Lv, A selective overview of variable selection in high dimensional feature space.
Stat. Sinica 20(1), 101 (2010).
160. J. Fan, F. Han, H. Liu, Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2014).
161. A. Farcomeni, A review of modern multiple hypothesis testing, with particular attention to
the false discovery proportion. Stat. Methods Med. Res. 17(4), 347–88 (2008).
162. T. Fawcett, An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006).
163. L. Fei-Fei, R. Fergus, P. Perona, One-shot learning of object categories. IEEE Trans. Pattern
Anal. Mach. Intell. 28(4), 594–611 (2006).
164. L. Fein, The role of the university in computers, data processing, and related fields. Commun.
ACM 2(9), 7–14 (1959).
165. J.A. Ferreira, A.H. Zwinderman, et al., On the benjamini-hochberg method. Ann. Stat. 34(4),
1827–1849 (2006).
166. A. Fischer, C. Igel, An introduction to restricted Boltzmann machines, in Progress in pattern
recognition, image analysis, computer vision, and applications. CIARP, ed. by L. Alvarez, M.
Mejail, L. Gomez, J. Jacobo. Lecture Notes in Computer Science (Springer, Berlin, 2012).
167. R.A. Fisher, On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc.
A 222, 309–368 (1922).
168. R.A. Fisher, Statistical methods for research workers (Genesis Publishing Pvt. Ltd., Delhi,
1925).
169. R.A. Fisher, The statistical method in psychical research, in Proceedings of the Society for
Psychical Research, vol. 39 (1929), pp. 189–192.
170. R.A. Fisher, The arrangement of field experiments (1926), in Breakthroughs in Statistics
(Springer, Berlin, 1992), pp. 82–91.
171. P. Flach, Machine learning: the art of science and algorithms that make sense of data.
(Cambridge University Press, New York, 2012).
172. M.R. Forster, Key concepts in model selection: Performance and generalizability. J. Math.
Psychol. 44(1), 205–231 (2000).
173. M.R. Forster, Predictive accuracy as an achievable goal of science. Philos. Sci. 69(S3), S124–
S134 (2002).
174. A.V. Frane, Are per-family type I error rates relevant in social and behavioral science? J. Mod.
Appl. Stat. Methods 14(1), 5 (2015).
175. L.E. Frank, J.H. Friedman, A statistical view of some chemometrics regression tools.
Technometrics 35(2), 109–135 (1993).
176. B.R. Frieden, Science from Fisher information: a unification. (Cambridge University Press,
Cambridge, 2004).
177. J. Friedman, T. Hastie, R. Tibshirani, glmnet: Lasso and elastic-net regularized generalized
linear models. R Packag. Ver. 1(4) (2009).
552 References

178. J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via
coordinate descent. J. Stat. Softw. 33(1), 1 (2010).
179. J. Fürnkranz, E. Hüllermeier, E. Loza Mencía, K. Brinker, Multilabel classification via
calibrated label ranking. Mach. Learn. 73(2), 133–153 (2008).
180. A. Gammerman, V. Vovk, V. Vapnik, Learning by transduction, in UAI’98: Proceedings of
the Fourteenth Conference on Uncertainty in Artificial Intelligence (1998), pp. 148–155.
181. Y.C. Ge, S. Dudoit, T.P. Speed, Resampling-based multiple testing for microarray data
analysis. Test 12(1), 1–77 (2003).
182. S. Geisser, The predictive sample reuse method with applications. J. Am. Stat. Assoc. 70(350),
320–328 (1975).
183. S. Geman, E. Bienenstock, R. Doursat, Neural networks and the bias/variance dilemma.
Neural Comput. 4(1), 1–58 (1992).
184. C. Genovese, L. Wasserman, Operating characteristics and extensions of the false discovery
rate procedure. J. R. Stat. Soc. Series B Stat. Methodol. 64(3), 499–517 (2002).
185. C.R. Genovese, L. Wasserman, Exceedance control of the false discovery proportion. J. Am.
Stat. Assoc. 101(476), 1408–1417 (2006).
186. C.R. Genovese, K. Roeder, L. Wasserman, False discovery control with p-value weighting.
Biometrika 93(3), 509–524 (2006).
187. A. Genz, F. Bretz, Computation of multivariate normal and t probabilities. Lecture Notes in
Statistics (Springer, Heidelberg, 2009).
188. A. Genz, F. Bretz, T. Miwa, et al., mvtnorm: multivariate normal and t distributions (2019). R
package version 1.0-9.
189. F.A. Gers, J. Schmidhuber, Recurrent nets that time and count, in Proceedings of the
IEEE- INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural
Computing: New Challenges and Perspectives for the New Millennium, vol. 3 (IEEE,
Piscataway, 2000), pp. 189–194.
190. F.A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: continual prediction with LSTM
(1999).
191. F.A. Gers, N.N. Schraudolph, J. Schmidhuber, Learning precise timing with LSTM recurrent
networks. J. Mach. Learn. Res. 3(Aug), 115–143 (2002).
192. P. Geurts, Bias vs variance decomposition for regression and classification, in Data mining
and knowledge discovery handbook (Springer, Berlin, 2009), pp. 733–746.
193. N. Ghamrawi, A. McCallum, Collective multi-label classification, in Proceedings of the 14th
ACM International Conference on Information and Knowledge Management (2005), pp. 195–
200.
194. E. Gibaja, S. Ventura, Multi-label learning: a review of the state of the art and ongoing
research. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4(6), 411–444 (2014).
195. G. Gigerenzer, The superego, the ego, and the id in statistical reasoning, in A handbook for
data analysis in the behavioral sciences: methodological issues (1993), pp. 311–339.
196. S.G. Gilmour, The interpretation of Mallows’s c_p-statistic. Statistician, 49–56 (1996).
197. M.K. Goel, P. Khanna, J. Kishore, Understanding survival analysis: Kaplan-Meier estimate.
Int. J. Ayurveda Res. 1(4), 274 (2010).
198. J.J. Goeman, A. Solari, The sequential rejection principle of familywise error control. Ann.
Stat. 3782–3810 (2010).
199. J.J. Goeman, A. Solari, Multiple hypothesis testing in genomics. Stat. Med. 33(11), 1946–
1978 (2014).
200. K.-I. Goh, M.E. Cusick, D. Valle, et al., The human disease network. Proc. Natl. Acad. Sci.
104(21), 8685–8690 (2007).
201. E.M. Gold, Language identification in the limit. Inf. Contr. 10(5), 447–474 (1967).
202. I.J. Good, Explicativity: a mathematical theory of explanation with statistical applications.
Proc. R. Soc. Lond. A 354(1678), 303–330 (1977).
203. P.I. Good, Resampling Methods (Springer, Berlin, 2006).
204. I.J. Goodfellow, D. Warde-Farley, P. Lamblin, et al., Pylearn2: a machine learning research
library (2013).
References 553

205. I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., Generative adversarial nets, in Advances in


neural information processing systems (2014), pp. 2672–2680.
206. I. Goodfellow, Y. Bengio, A. Courville, Deep learning (The MIT Press, Cambridge, 2016).
207. S. Goodman, A dirty dozen: twelve p-value misconceptions, in Seminars in hematology,
vol. 45 (Elsevier, Amsterdam, 2008), pp. 135–140.
208. R.A. Gordon, Regression analysis for the social sciences (Routledge, Milton Park, 2015).
209. A. Gordon, G. Glazko, X. Qiu, et al., Control of the mean number of false discoveries,
Bonferroni and stability of multiple testing. Ann. Appl. Stat. 1(1), 179–190 (2007).
210. A. Graves, Generating sequences with recurrent neural networks (2013). Preprint.
arXiv:1308.0850.
211. A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and
other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005).
212. A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural net-
works, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (IEEE, Piscataway, 2013), pp. 6645–6649.
213. S. Greenland, S.J. Senn, K.J. Rothman, et al., Statistical tests, p values, confidence intervals,
and power: a guide to misinterpretations. Eur. J. Epidemiol. 31(4), 337–350 (2016).
214. S.R. Gross, B. O’Brien, C. Hu, E.H. Kennedy, Rate of false conviction of criminal defendants
who are sentenced to death. Proc. Natl. Acad. Sci. 111(20), 7230–7235 (2014).
215. I. Guyon, A. Saffari, G. Dror, G. Cawley, Model selection: beyond the bayesian/frequentist
divide. J. Mach. Learn. Res. 11(Jan), 61–87 (2010).
216. I. Hacking, Logic of statistical inference (Cambridge University Press, Cambridge, 2016).
217. M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques. J. Intel. Inf.
Syst. 17, 107–145 (2001).
218. J. Han, M. Kamber, Data mining: concepts and techniques (Morgan and Kaufmann Publish-
ers, Burlington, 2001).
219. F. Harary, Graph theory (Addison-Wesley Publishing Company, Reading, 1969).
220. J. Hardin, R. Hoerl, N.J. Horton, et al., Data science in statistics curricula: Preparing students
to ’think with data.’ Am. Stat. 69(4), 343–353 (2015).
221. F.E. Harrell, Regression modeling strategies (Springer, New York, 2001).
222. F.E. Harrell, K.L. Lee, Verifying assumptions of the Cox proportional hazards model,
in Proceedings of the Eleventh Annual SAS Users Group International Conference (SAS
Institute Inc., Cary, 1986), pp. 823–828.
223. C.R. Harvey, Y. Liu, Evaluating trading strategies. J. Portf. Manag. 40(5), 108–118 (2014).
224. T. Hastie, R. Tibshirani, J.H. Friedman, The elements of statistical learning. (Springer, Berlin,
2001).
225. T.J. Hastie, R.J. Tibshirani, J.H. Friedman, The elements of statistical learning: data mining,
inference, and prediction. Springer Series in Statistics (Springer, New York, 2009).
226. T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning: data mining,
inference and prediction (Springer, New York, 2009).
227. T. Hastie, R. Tibshirani, M. Wainwright, Statistical Learning with sparsity: the lasso and
generalizations (CRC Press, Boca Raton, 2015).
228. C. Hayashi, What is data science? Fundamental concepts and a heuristic example, in Data
science, classification, and related methods (Springer, Berlin, 1998), pp. 40–51.
229. H.O. Hayter, Probability and statistics for engineers and scientists, 4th ed. (Duxbury Press,
Belmont, 2012).
230. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.
231. D.O. Hebb, The organization of behavior (Wiley, New York, 1949).
232. D. Helbing, The automation of society is next: How to survive the digital revolution (2015).
Available at SSRN 2694312.
233. M. Henaff, J. Bruna, Y. LeCun, Deep convolutional networks on graph-structured data (2015).
Preprint. arXiv:1506.05163.
554 References

234. P. Henderson, R. Islam, P. Bachman, et al., Deep reinforcement learning that matters, in
Thirty-Second AAAI Conference on Artificial Intelligence (2018).
235. M.A. Hernan, J.M. Robins, Causal Inference. Chapman & Hall/CRC Monographs
on Statistics & Applied Probab. (CRC Press, 2023). https://books.google.fi/books?id=
_KnHIAAACAAJ
236. J. Hertz, A. Krogh, R.G. Palmer, Introduction to the theory of neural computation. (Addison-
Wesley, Boston, 1991).
237. G.E. Hinton, A practical guide to training restricted Boltzmann machines, in Neural networks:
tricks of the trade (Springer, Berlin, 2012), pp. 599–619.
238. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks.
Science 313, 504–507 (2006).
239. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks.
Science 313(5786), 504–507 (2006).
240. G.E. Hinton, T.J. Sejnowski, Optimal perceptual inference, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (Citeseer, 1983), pp. 448–453.
241. G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets. Neural
Comput. 18(7), 1527–1554 (2006).
242. D.C. Hoaglin, F. Mosteller, J.W. Tukey, Understanding robust and exploratory data analysis
(Wiley, New York, 1983).
243. Y. Hochberg, A sharper Bonferroni procedure for multiple tests of significance. Biometrika
75(4), 800–802 (1988).
244. J. Hochberg, A. Tamhane, Multiple comparison procedures (John Wiley & Sons, New York,
1987).
245. S. Hochreiter, The vanishing gradient problem during learning recurrent neural nets and
problem solutions. Int. J. Uncertainty Fuzziness Knowledge Based Syst. 6(02), 107–116
(1998).
246. S. Hochreiter, J. Schmidhuber, Long short-term memory.Neural Comput. 9(8), 1735–1780
(1997).
247. W. Hoeffding, Probability inequalities for sums of bounded random variables. J. Am. Stat.
Assoc. 58(301), 13–30 (1963).
248. A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems.
Technometrics 12(1), 55–67 (1970).
249. P. Hogeweg, B. Hesper, Interactive instruction on population interactions. Comput. Biol. Med.
8(4), 319–327 (1978).
250. S. Holm, A simple sequentially rejective multiple test procedure. Scandinavian J. Stat., 65–70
(1979).
251. A. Holzinger, C. Biemann, C.S. Pattichis, D.B. Kell, What do we need to build explainable ai
systems for the medical domain? (2017). Preprint. arXiv:1712.09923.
252. G. Hommel, A stagewise rejective multiple test procedure based on a modified Bonferroni
test. Biometrika 75(2), 383–386 (1988)
253. J.J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Natl. Acad. Sci. USA 79, 2554–2558 (1982).
254. K. Hornik, Approximation capabilities of multilayer feedforward networks. Neural Netw.
4(2), 251–257 (1991).
255. H. Hotelling, Analysis of a complex of statistical variables into principal components. J. Educ.
Psychol. 24, 417–441 (1933).
256. M. Hou, B. Chaib-Draa, C. Li, Q. Zhao, Generative adversarial positive-unlabelled learning
(2017). Preprint. arXiv:1711.08054.
257. J. Howard, et al., fastai (2018). https://github.com/fastai/fastai.
258. B.G. Hu, Y. Wang, Evaluation criteria based on mutual information for classifications
including rejected class. Acta Automat. Sin. 34(11), 1396–1403 (2008).
259. R. Hubbard, R.A. Parsa, M.R. Luthy, The spread of statistical significance testing in
psychology: the case of the journal of applied psychology, 1917–1994. Theory Psychol. 7(4),
545–554 (1997).
References 555

260. W. Huber, V. Carey, L. Long, S. Falcon, R. Gentleman, Graphs in molecular biology. BMC
Bioinf. 8(Suppl 6), S8 (2007).
261. J.D. Huling, P.Z.G. Qian, Fast penalized regression and cross validation for tall data with the
oem package. J. Stat. Softw. (2018).
262. K. Hwang, W. Sung, Single stream parallelization of generalized LSTM-like RNNs on a
GPU, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (IEEE, Piscataway, 2015), pp. 1047–1051.
263. C. Igel, M. Hüsken, Improving the RPROP learning algorithm, in Proceedings of the Second
International ICSC Symposium on Neural Computation (NC 2000) (2000), pp. 115–121.
264. J.P.A. Ioannidis, Retiring significance: a free pass to bias. Nature 567(7749), 461–461 (2019).
265. A.K. Jain, R.C. Dubes, Algorithms for clustering data (Prentice-Hall Inc., Upper Saddle
River, 1988).
266. N. Japkowicz, Concept-learning in the absence of counter-examples: an autoassociation-
based approach to classification. Ph.D. Thesis. State University of New Jersey (1999).
267. K. Jaskie, A. Spanias, Positive and unlabeled learning algorithms and applications: a survey,
in 2019 10th International Conference on Information, Intelligence, Systems and Applications
(IISA) (IEEE, Piscataway, 2019), pp. 1–8.
268. E.T. Jaynes, Probability theory: the logic of science (Cambridge University Press, Cambridge,
2003).
269. Y. Jia, E. Shelhamer, J. Donahue, et al., Caffe: convolutional architecture for fast feature
embedding, in Proceedings of the 22Nd ACM International Conference on Multimedia, MM
’14 (ACM, New York, 2014), pp. 675–678.
270. I.M. Johnstone, D.M. Titterington, Statistical challenges of high-dimensional data. Philos.
Transact. A Math. Phys. Eng. Sci. 367(1906), 4237 (2009)
271. I.T. Jolliffe, Principal component analysis (Springer Science & Business Media, Berlin,
2002).
272. M.I. Jordan, Learning in Graphical Models (MIT Press, Cambridge, 1998).
273. E.-Y. Jung, C. Baek, J.-D. Lee, Product survival analysis for the app store. Market. Lett. 23(4),
929–941 (2012).
274. S. Kadam, V. Vaidya, Review and analysis of zero, one and few shot learning approaches, in
International Conference on Intelligent Systems Design and Applications (Springer, Berlin,
2018), pp. 100–112.
275. J.D. Kalbfleisch, R.L. Prentice, The Statistical Analysis of Failure Time Data, vol. 360. (John
Wiley & Sons, Hoboken, 2011).
276. E.L. Kaplan, P. Meier, Nonparametric estimation from incomplete observations. J. Am. Stat.
Assoc. 53(282), 457–481 (1958).
277. R.E. Kass, A.E. Raftery, Bayes factors. J. Am. Stat. Assoc. 90(430), 773–795 (1995).
278. A. Kassambara, M. Kosinski, P. Biecek, et al., survminer: drawing survival curves using
’ggplot2’ (2017). R package version 0.3.
279. R.L. Kaufman, Heteroskedasticity in regression: detection and correction, vol. 172 (Sage
Publications, Thousand Oaks, 2013).
280. L. Kaufman, P.J. Rousseeuw, Clustering by means of medoids (North Holland/Elsevier,
Amsterdam, 1987), pp. 405–416.
281. V. Kaushik, C.A. Walsh, Pragmatism as a research paradigm and its implications for social
work research. Soc. Sci. 8(9), 255 (2019).
282. S.S. Khan, M.G. Madden, One-class classification: taxonomy of study and review of
techniques. Knowl. Eng. Rev. 29(3), 345–374 (2014).
283. J.-H. Kim, Estimating classification error rate: Repeated cross-validation, repeated hold-out
and bootstrap. Comput. Stat. Data Anal. 53(11), 3735–3745 (2009).
284. Y. Kim, Convolutional neural networks for sentence classification (2014). Preprint.
arXiv:1408.5882.
285. D.G. Kleinbaum, M. Klein, Survival analysis: a self-learning text. Statistics for Biology and
Health (Springer, New York, 2005).
556 References

286. D.G. Kleinbaum, L.L. Kupper, Applied regression analysis and other multivariable methods.
(Duxbury Press, London, 1978).
287. G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural networks for one-shot image recogni-
tion, in ICML deep learning workshop, Lille, vol. 2 (2015).
288. R. Kohavi, D.H. Wolpert, et al., Bias plus variance decomposition for zero-one loss functions,
in International Conference on Machine Learning, vol. 96 (1996), pp. 275–83.
289. R. Kohavi, et al., A study of cross-validation and bootstrap for accuracy estimation and
model selection, in International Joint Conference on Artificial Intelligence, Montreal, vol. 14
(1995). pp. 1137–1145.
290. D. Koller, N. Friedman, Probabilistic graphical models: principles and techniques (The MIT
Press, Cambridge, 2009).
291. I. Koo, S. Yao, X. Zhang, S. Kim, Comparative analysis of false discovery rate methods
in constructing metabolic association networks. J. Bioinform. Comput. Biol. 12, 1450018
(2014).
292. Q. Kou, Y. Sugomori, RcppDL (2014).
293. G. Kraemer, M. Reichstein, M.D. Mahecha, dimRed and coRanking–unifying dimensionality
reduction in R. R J. 10(1), 342–358 (2018). coRanking version 0.2.3.
294. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional
neural networks, in Advances in neural information processing systems (2012), pp. 1097–
1105.
295. D. Krstajic, L.J. Buturovic, D.E. Leahy, S. Thomas, Cross-validation pitfalls when selecting
and assessing regression and classification models. J. Cheminformat. 6(1), 10 (2014).
296. K.G. Kugler, L.A.J. Müller, A. Graber, M. Dehmer, Integrative network biology: Graph
prototyping for co-expression cancer networks. PLoS ONE 6, e22843 (2011).
297. J. Kuha, AIC and BIC: comparisons of assumptions and performance. Sociol. Methods Res.
33(2), 188–229 (2004).
298. T.S. Kuhn, The structure of scientific revolutions (University of Chicago Press, Chicago,
1970).
299. S. Lafon, A.B. Lee, Diffusion maps and coarse-graining: a unified framework for dimension-
ality reduction, graph partitioning, and data set parameterization. IEEE Trans. Pattern Anal.
Mach. Intell. 28(9), 1393–1403 (2006).
300. M. Lavine, M.J. Schervish, Bayes factors: what they are and what they are not. Am. Stat.
53(2), 119–122 (1999).
301. S. Lawrence, C. Giles, A. Tsoi, A. Back, Face recognition: a convolutional neural network
approach. IEEE Trans. Neural Netw. 8, 98–113 (1997).
302. Y. Lecun, Generalization and network design strategies, in Connectionism in perspective, ed.
by R. Pfeifer, Z. Schreter, F. Fogelman, L. Steels (Elsevier, Amsterdam, 1989).
303. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436 (2015).
304. D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization. Adv. Neural Inf.
Proces. Syst. 13, 556–562 (2001).
305. E.T. Lee, J. Wang, Statistical methods for survival data analysis, vol. 476 (John Wiley &
Sons, Hoboken, 2003).
306. H. Lee, P. Pham, Y. Largman, A.Y. Ng, Unsupervised feature learning for audio classification
using convolutional deep belief networks, in Advances in neural information processing
systems (2009), pp. 1096–1104.
307. E.L. Lehmann, The Fisher, Neyman-Pearson theories of testing hypotheses: one theory or
two? J. Am. Stat. Assoc. 88(424), 1242–1249 (1993).
308. M.D. Lesem, T.K. Tran-Johnson, R.A. Riesenberg, et al., Rapid acute treatment of agitation in
individuals with schizophrenia: multicentre, randomised, placebo-controlled study of inhaled
loxapine. Br. J. Psychiatry 198(1), 51–58 (2011).
309. K.-M. Leung, R.M. Elashoff, A.A. Afifi, Censoring issues in survival analysis. Ann. Rev. Pub.
Health 18(1), 83–104 (1997)
310. M.K.K. Leung, H.Y. Xiong, L.J. Lee, B.J. Frey, Deep learning of the tissue-regulated splicing
code. Bioinformatics 30(12), i121–i129 (2014).
References 557

311. D. Li, T.D. Dye, Power and stability properties of resampling-based multiple testing
procedures with applications to gene oncology studies. Comput. Math. Methods Med. (2013).
312. X. Li, T. Zhao, X. Yuan, H. Liu, The flare package for high dimensional linear regression and
precision matrix estimation in R. J. Mach. Learn. Res. 16(1), 553–557 (2015).
313. R. Li, S. Wang, F. Zhu, J. Huang, Adaptive graph convolutional neural networks, in Thirty-
Second AAAI Conference on Artificial Intelligence (2018).
314. K. Liang, D. Nettleton, Adaptive and dynamic adaptive procedures for false discovery rate
control and estimation. J. R. Stat. Soc. Series B Stat. Methodol. 74(1), 163–182 (2012).
315. C. Liedtke, C. Mazouni, K.R. Hess, et al., Response to neoadjuvant therapy and long-term
survival in patients with triple-negative breast cancer. J. Clin. Oncol. 26(8), 1275–1281
(2008).
316. M. Lin, Q. Chen, S. Yan, Network in network (2013). Preprint. arXiv:1312.4400.
317. Z.C. Lipton, J. Berkowitz, C. Elkan, A critical review of recurrent neural networks for
sequence learning (2015). Preprint. arXiv:1506.00019.
318. W. Liu, J. Wang, S.-F. Chang, Robust and scalable graph-based semisupervised learning. Proc.
IEEE 100(9), 2624–2638 (2012).
319. W.-Y. Loh, Fifty years of classification and regression trees. Int. Stat. Rev. 82(3), 329–348
(2014).
320. J.S. Long, The origins of sex differences in science. Soc. Forces 68, 1297–1315 (1990).
321. M. Loukides, What is data science? (O’Reilly Media, Sebastopol, 2011).
322. Z. Lu, H. Pu, F. Wang, Z. Hu, L. Wang, The expressive power of neural networks: a view from
the width, in Advances in Neural Information Processing Systems (2017), pp. 6231–6239.
323. S.M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in Proceed-
ings of the 31st International Conference on Neural Information Processing Systems (2017),
pp. 4768–4777.
324. J. Li, S. Ma, Survival analysis in medicine and genetics (Chapman and Hall/CRC, Boca
Raton, 2013).
325. J.B. MacQueen, Some methods for classification and analysis of multivariate observations,
in Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability
(University of California Press, Berkeley, 1967), pp. 281–297.
326. L.M. Manevitz, M. Yousef, Document classification on neural networks using only positive
examples, in Proceedings of the 23rd Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (2000), pp. 304–306.
327. K. Manjang, S. Tripathi, O. Yli-Harja, et al., Prognostic gene expression signatures of breast
cancer are lacking a sensible biological meaning. Sci. Rep. 11(1), 1–18 (2021).
328. R.N. Mantegna, Hierarchical structure in financial markets. Euro. Phys. J. B 11(1), 193–197
(1999).
329. N. Mantel, Evaluation of survival data and two new rank order statistics arising in its
consideration. Cancer Chemother. Rep. 50, 163–170 (1966).
330. A. Marshall, Principles of Economics (Macmillan, London, 1890).
331. B.W. Matthews, Comparison of the predicted and observed secondary structure of t4 phage
lysozyme. Biochim. Biophys. Acta Protein Struct. Mol. Enzymol. 405(2), 442–451 (1975).
332. A.G. McKendrick, Applications of mathematics to medical problems. Proc. Edinb. Math. Soc.
44, 98–130 (1926).
333. G.J. McLachlan, T. Krishnan, The EM algorithm and extensions, 2nd ed. (Wiley, New York,
2008).
334. R.J. Meijer, T.J.P. Krebs, J.J. Goeman, Hommel’s procedure in linear time. Biom. J. 61(1),
73–82 (2019).
335. N. Meinshausen, B. Yu, et al., Lasso-type recovery of sparse representations for high-
dimensional data. Ann. Stat. 37(1), 246–270 (2009).
336. N. Meinshausen, M.H. Maathuis, P. Bühlmann et al., Asymptotic optimality of the Westfall-
Young permutation procedure for multiple testing under dependence. Ann. Stat. 39(6), 3369–
3391 (2011).
558 References

337. T. Mikolov, I. Sutskever, K. Chen, et al., Distributed representations of words and phrases
and their compositionality, in Advances in neural information processing systems (2013),
pp. 3111– 3119.
338. C.J. Miller, C. Genovese, R.C. Nichol, et al., Controlling the false-discovery rate in
astrophysical data analysis. Astron. J. 122(6), 3492 (2001).
339. Y. Ming, Sh. Cao, R. Zhang, et al., Understanding hidden memories of recurrent neural
networks, in 2017 IEEE Conference on Visual Analytics Science and Technology (VAST)
(IEEE, Piscataway, 2017), pp. 13–24.
340. T.M. Mitchell, The need for biases in learning generalizations, in Readings in machine
learning ed. by J.W. Shavlik, T.G. Dietterich (Morgan Kaufman, Burlington, 1980), pp. 184–
191.
341. T. Mitchell, Machine learning (McGraw-Hill, New York, 1997).
342. V. Mnih, K. Kavukcuoglu, D. Silver, et al., Human-level control through deep reinforcement
learning. Nature 518(7540), 529 (2015).
343. A. Mohamed, G.E. Dahl, G. Hinton, Acoustic modeling using deep belief networks. IEEE
Trans. Audio Speech Lang. Proces. 20(1), 14–22 (2011).
344. M. Mohri, A. Rostamizadeh, A. Talwalkar, Foundations of machine learning. (MIT Press,
Cambridge, 2018).
345. I. Molina, J.G.I. Prat, F. Salvador, B. Treviño, E. Sulleiro, N. Serre, D. Pou, S. Roure, J.
Cabezos, L. Valerio, et al., Randomized trial of posaconazole and benznidazole for chronic
chagas’ disease. N. Engl. J. Med. 370(20), 1899–1908 (2014).
346. A.M. Molinaro, R. Simon, R.M. Pfeiffer, Prediction error estimation: a comparison of
resampling methods. Bioinformatics 21(15), 3301–3307 (2005).
347. F. Mordelet, J.-P. Vert, A bagging SVM to learn from positive and unlabeled examples. Pattern
Recogn. Lett. 37, 201–209 (2014).
348. R.D. Morey, J.-W. Romeijn, J.N. Rouder, The philosophy of Bayes factors and the quantifi-
cation of statistical evidence. J. Math. Psychol. 72, 6–18 (2016).
349. V. Moskvina, K.M. Schmidt, On multiple-testing correction in genome-wide association
studies. Genet. Epidemiol. Off. Publ. Int. Genet. Epidemiol. Soc. 32(6), 567–573 (2008).
350. A. Mowshowitz, Entropy and the complexity of the graphs I: an index of the relative
complexity of a graph. Bull. Math. Biophys. 30, 175–204 (1968).
351. M.M. Moya, D.R. Hush, Network constraints and multi-objective optimization for one-class
classification. Neural Netw. 9(3), 463–474 (1996).
352. L. Mueller, K. Kugler, A. Graber, et al., Structural measures for network biology using
QuACN. BMC Bioinf. 12(1), 492 (2011).
353. L.A.J. Müller, M. Schutte, K.G. Kugler, M. Dehmer, QuACN: Quantitative Analyze of
Complex Networks (2012). R Package Version 1.6.
354. L.A.J. Müller, M. Dehmer, F. Emmert-Streib, Network-based methods for computational
diagnostics by means of R, in Computational Medicine (Springer, Berlin, 2012), pp. 185–
197.
355. D.J. Murdoch, Y.-L. Tsai, J. Adcock, P-values are random variables. Am. Stat. 62(3), 242–245
(2008).
356. D.W. Murray, A.J. Carr, C. Bulstrode, Survival analysis of joint replacements. J. Bone Joint
Surg. Br. 75(5), 697–704 (1993).
357. V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines, in
Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010),
pp. 807–814.
358. P. Naur, Concise survey of computer methods (1974).
359. A.A. Neath, J.E. Cavanaugh, The bayesian information criterion: background, derivation, and
applications. Wiley Interdiscip. Rev. Comput. Stat. 4(2), 199–203 (2012).
360. W. Nelson, Theory and applications of hazard plotting for censored failure data. Technomet-
rics 14(4), 945–966 (1972).
361. S. Newcomb, A generalized theory of the combination of observations so as to obtain the best
result. Am. J. Math. 8, 343–366 (1886).
References 559

362. M.E.J. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci.
USA 103, 8577–8582 (2006).
363. J. Neyman, Sur un teorema concernente le cosidette statistiche sufficienti. Giorn. Ist. Ital. Att.
6, 320–334 (1935).
364. J. Neyman, E.S. Pearson, On the use and interpretation of certain test criteria for purposes of
statistical inference: part I. Biometrika, 175–240 (1928).
365. J. Neyman, E.S. Pearson, On the problem of the most efficient tests of statistical hypotheses.
Philos. Trans. R. Soc. Lond. A 231, 289–337 (1933).
366. A. Nichols, Causal inference with observational data. Stata J. 7(4), 507–541 (2007).
367. T. Nichols, S. Hayasaka, Controlling the familywise error rate in functional neuroimaging: a
comparative review. Stat. Methods Med. Res. 12(5), 419–446 (2003).
368. A.M. Nicholson, Generalization error estimates and training data valuation. Ph.D. Thesis,
California Institute of Technology (2002).
369. R.S. Nickerson, Null hypothesis significance testing: a review of an old and continuing
controversy. Psychol. Methods 5(2), 241 (2000).
370. M.A. Nielsen, Neural networks and deep learning (Determination Press, 2015).
371. G. Niu, M.C. du Plessis, T. Sakai, et al., Theoretical comparisons of positive-unlabeled
learning against positive-negative learning, in Advances in neural information processing
systems (2016), pp. 1199–1207.
372. T.W. Nix, J.J. Barnette, The data analysis dilemma: ban or abandon. A review of null
hypothesis significance testing. Res. Sch. 5(2), 3–14 (1998).
373. W.S. Noble, How does multiple testing correction work? Nat. Biotechnol. 27(12), 1135
(2009).
374. B.A. Olshausen, D.J. Field, Sparse coding with an overcomplete basis set: a strategy
employed by v1? Vis. Res. 37(23), 3311–3325 (1997).
375. Online Mendelian Inheritance in Man, OMIM (TM) (2007).
376. J. Oyelade, I. Isewon, F. Oladipupo, et al., Clustering algorithms: their application to gene
expression data. Bioinf. Biol. Insights 10, 237–253 (2016).
377. S.J. Pan, Q. Yang, A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10),
1345–1359 (2009).
378. O.A. Panagiotou, J.P.A. Ioannidis, Genome-Wide Significance Project. What should the
genome-wide significance threshold be? Empirical replication of borderline genetic associ-
ations. Int. J. Epidemiol. 41(1), 273–286 (2011).
379. A. Paszke, S. Gross, S. Chintala, et al., Automatic differentiation in pytorch (2017).
380. A.B. Patel, T. Nguyen, R.G. Baraniuk, A probabilistic framework for deep learning, in
NIPS’16: Proceedings of the 30th International Conference on Neural Information Process-
ing Systems (2016), pp. 2558–2566.
381. T.H.D.J. Patil, T.H. Davenport, Data scientist: the sexiest job of the 21st century. Harv. Bus.
Rev. (2012).
382. M.Q. Patton, Qualitative research & evaluation methods (SAGE Publications, Thousand
Oaks, 2002).
383. J. Pearl, M. Glymour, N.P. Jewell, Causal inference in statistics: A primer. (John Wiley &
Sons, Hoboken, 2016).
384. K. Pearson, Contributions to the mathematical theory of evolution, II: Skew variation
in homogeneous material. Trans. R. Philos. Soc. A 186, 343–414 (1895).
385. K. Pearson, On lines and planes of closest fit to systems of points in space. Philos. Mag. 2,
559–572 (1901).
386. F. Pedregosa, G. Varoquaux, A.G. Gramfort, et al., Scikit-learn: machine learning in Python.
J. Mach. Learn. Res. 12, 2825–2830 (2011).
387. H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-
dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.
27(8), 1226–1238 (2005).
388. J.D. Perezgonzalez, Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing.
Front. Psychol. 6, 223 (2015).
560 References

389. D. Phillips, D. Ghosh, et al., Testing the disjunction hypothesis using Voronoi diagrams with
applications to genetics. Ann. Appl. Stat. 8(2), 801–823 (2014).
390. J. Piironen, A. Vehtari, Comparison of bayesian predictive methods for model selection. Stat.
Comput. 27(3), 711–735 (2017).
391. N. Pike, Using false discovery rates for multiple comparisons in ecology and evolution.
Methods Ecol. Evol. 2(3), 278–282 (2011).
392. K.S. Pollard, S. Dudoit, M.J. van der Laan, Multiple testing procedures: R multtest package
and applications to genomics. UC Berkeley Division of Biostatistics working paper series
(2004). Technical report, Working Paper 164. http://www.bepress.com/ucbbiostat/paper164.
393. J.C. Principe, D.X. Xu, Q. Zhao, J.W. Fisher, Learning from examples with information-
theoretic criteria. Signal Proces. Syst. 26(1–2), 61–77 (2000).
394. F. Provost, T. Fawcett, Data science and its relationship to big data and data-driven decision
making. Big Data 1(1), 51–59 (2013).
395. Y. Pu, Z. Gan, R. Henao, et al., Variational autoencoder for deep learning of images, labels
and captions, in Advances in neural information processing systems (2016), pp. 2352–2360.
396. J. Quackenbush, The human genome: The book of essential knowledge. Curiosity Guides
(Imagine Publishing, New York, 2011).
397. B. Quast, RNN: a recurrent neural network in R. Working Papers (2016).
398. R Development Core Team, R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna (2008). ISBN 3-900051-07-0.
399. A.E. Raftery, Bayesian model selection in social research. Sociol. Methodol. 111–163 (1995).
400. Y. Rahmatallah, F. Emmert-Streib, G. Glazko, Gene Sets Net Correlations Analysis
(GSNCA): a multivariate differential coexpression test for gene sets. Bioinformatics 30(3),
360–368 (2014).
401. Y. Rahmatallah, B. Zybailov, F. Emmert-Streib, G. Glazko, GSAR: bioconductor package for
gene set analysis in R. BMC Bioinf. 18(1), 61 (2017).
402. W. Rawat, Z. Wang, Deep convolutional neural networks for image classification: a compre-
hensive review. Neural Comput. 29(9), 2352–2449 (2017).
403. J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi-label classification.
Mach. Learn. 85(3), 333–359 (2011).
404. G.A. Rempala, Y. Yang, On permutation procedures for strong control in multiple testing with
gene expression data. Stat. Interf. 6(1) (2013).
405. M. Riedmiller, H. Braun, A direct adaptive method for faster backpropagation learning: the
rprop algorithm, in IEEE Inernational Conference on Neural Networks (1993).
406. O.Y. Rodionova, P. Oliveri, A.L. Pomerantsev, Rigorous and compliant approaches to one-
class classification. Chemom. Intell. Lab. Syst. 159, 89–96 (2016).
407. J.P. Romano, M. Wolf, et al., Balanced control of generalized error rates. Ann. Stat. 38(1),
598–633 (2010).
408. X. Rong, Deep learning toolkit in R (2014).
409. F. Rosenblatt, The perceptron, a perceiving and recognizing automaton project para. (Cornell
Aeronautical Laboratory, Buffalo, 1957).
410. B. Rost, C. Sander, Prediction of protein secondary structure at better than 70% accuracy. J.
Mol. Biol. 232(2), 584–599 (1993).
411. P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis. Comput. Appl. Math. 20, 53–65 (1987).
412. S. Ruder, An overview of multi-task learning in deep neural networks (2017). Preprint.
arXiv:1706.05098.
413. L. Ruff, R. Vandermeulen, N. Goernitz, et al., Deep one-class classification, in International
Conference on Machine Learning (2018), pp. 4393–4402.
414. J.S. Saczynski, S.E. Andrade, L.R. Harrold, et al., A systematic review of validated methods
for identifying heart failure using administrative data. Pharmacoepidemiol. Drug Saf. 21(S1),
129–140 (2012).
415. R. Salakhutdinov, G. Hinton, Deep Boltzmann machines, in Proceedings of the Twelfth
International Conference on Artificial Intelligence and Statistics, PMLR (2009), pp. 448–455.
References 561

416. S. Santini, R. Jain, Similarity measures. IEEE Trans. Pattern Anal. Mach. Intell. 21(9), 871–
883 (1999).
417. F. Santosa, W.W. Symes, Linear inversion of band-limited reflection seismograms. SIAM J.
Sci. Stat. Comput. 7(4), 1307–1330 (1986).
418. R. Sarikaya, G. Hinton, A. Deoras, Application of deep belief networks for natural language
understanding. IEEE/ACM Trans. Audio Speech Lang. Proces. 22, 778–784 (2014).
419. S.K. Sarkar, On methods controlling the false discovery rate. Sankhyā Indian J. Stat. A, 135–
168 (2008).
420. A.G. Sawyer, J.P. Peter, The significance of statistical significance tests in marketing research.
J. Market. Res. 20(2), 122–133 (1983).
421. B. Schölkopf, A. Smola, Learning with kernels: support vector machines, regulariztion,
optimization and beyond. (The MIT Press, Massachussetts, 2002).
422. B. Schölkopf, R.C. Williamson, A.J. Smola, et al., Support vector method for novelty
detection, in Advances in neural information processing systems, vol. 12 (Citeseer, 1999),
pp. 582–588.
423. C. Schaffer, A conservation law for generalization performance, in Machine learning
proceedings 1994 (Elsevier, Amsterdam, 1994), pp. 259–265.
424. D. Scherer, A. Müller, S. Behnke, Evaluation of pooling operations in convolutional
architectures for object recognition, in Artificial neural networks—ICANN 2010, ed. by K.
Diamantaras, W. Duch, L.S. Iliadis. Lecture Notes in Computer Science (Springer, Berlin,
2010).
425. J. Schmidhuber, Deep learning in neural networks: an overview. Neural Netw. 61, 85–117
(2015).
426. D. Schoenfeld, Partial residuals for the proportional hazards regression model. Biometrika
69(1), 239–241 (1982).
427. M. Schumacher, N. Holländer, W. Sauerbrei, Resampling and cross-validation techniques: a
tool to reduce bias caused by model building? Stat. Med. 16(24), 2813–2827 (1997).
428. G. Schwarz, et al., Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978).
429. T. Schweder, E. Spjøtvoll, Plots of p-values to evaluate many tests simultaneously. Biometrika
69(3), 493–502 (1982).
430. C. Seidel, Introduction to dna microarrays, in Analysis of Microarray data: a network-based
approach, ed. by F. Emmert-Streib, M. Dehmer (Wiley-VCH, Weinheim, 2008), pp. 1–26.
431. J.P. Shaffer, Multiple hypothesis testing.Ann. Rev. Psychol. 46(1), 561–584 (1995).
432. S. Shalev-Shwartz, S. Ben-David, Understanding machine learning: from theory to algo-
rithms (Cambridge University Press, Cambridge, 2014).
433. S. Sheather, A modern approach to regression with R (Springer Science & Business Media,
Berlin, 2009).
434. D. Shen, G. Wu, H.-I. Suk, Deep learning in medical image analysis. Ann. Rev. Biomed. Eng.
19, 221–248 (2017).
435. D.J. Sheskin, Handbook of parametric and nonparametric statistical procedures, 3rd ed. (RC
Press, Boca Raton, 2004).
436. D.J. Sheskin, Handbook of parametric and nonparametric statistical procedures (CRC Press,
Boca Raton, 2020).
437. G. Shmueli, et al., To explain or to predict? Stat. Sci. 25(3), 289–310 (2010).
438. D.V. Shridhar, E.B. Bartlett, R.C. Seagrave, Information theoretic subset selection for neural
network models. Comput. Chem. Eng. 22(4–5), 613–626 (1998).
439. Z. Šidák, Rectangular confidence regions for the means of multivariate normal distributions.
J. Am. Stat. Assoc. 62(318), 626–633 (1967).
440. R.J. Simes, An improved Bonferroni procedure for multiple tests of significance. Biometrika
73(3), 751–754 (1986).
441. N. Simon, J. Friedman, T. Hastie, R. Tibshirani, A sparse-group lasso. J. Comput. Graph. Stat.
22(2), 231–245 (2013).
442. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
recognition, in International Conference on Learning Representations (2015).
562 References

443. D. Siroker, P. Koomen, A/B testing: the most powerful way to turn clicks into customers (John
Wiley & Sons, Hoboken, 2013).
444. J. Smolander, Deep learning classification methods for complex disorders. Master’s thesis,
The school of the thesis, Tampere University of Technology (2016). https://dspace.cc.tut.fi/
dpub/handle/123456789/23845.
445. J. Smolander, A. Stupnikov, G. Glazko, et al., Comparing biological information contained in
mRNA And non-coding RNAs for classification of lung cancer patients. BMC Cancer 19(1),
1176 (2019).
446. J. Smolander, M. Dehmer, F. Emmert-Streib, Comparing deep belief networks with support
vector machines for classifying gene expression data from complex disorders. FEBS Open
Bio. 9(7), 1232–1248 (2019).
447. Q. Song, An overview of reciprocal l 1-regularization for high dimensional regression data.
Wiley Interdiscip. Rev. Comput. Stat. 10(1), e1416 (2018).
448. T. Sørlie, C.M. Perou, R. Tibshirani, et al., Gene expression patterns of breast carcinomas
distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci. 98(19), 10869–
10874 (2001).
449. S. Sosnin, M. Vashurina, M. Withnall, et al., A survey of multi-task learning methods in
chemoinformatics. Mol. Inf. 38(4), 1800108 (2019).
450. P. Spirtes, Introduction to causal inference. J. Mach. Learn. Res. 11(5) (2010).
451. A. Stang, H. Pohlabeln, K.M. Müller, et al., Diagnostic agreement in the histopathological
evaluation of lung cancer tissue in a population-based case-control study. Lung Cancer 52(1),
29–36 (2006).
452. J.R. Stevens, A. Al Masud, A. Suyundikov, A comparison of multiple testing adjustment
methods with block-correlation positively-dependent tests. PLoS One 12(4), e0176124
(2017).
453. M. Stone, Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc.
Ser. B Methodol. 111–147 (1974).
454. A. Strehl, J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining
multiple partitions. J. Mach. Learn. Res. 3(Dec), 583–617 (2002).
455. A. Stupnikov, S. Tripathi, R. de Matos Simoes, et al., samExploreR: exploring reproducibility
and robustness of RNA-seq results based on SAM files. Bioinformatics, 475 (2016).
456. F. Sung, Y. Yang, L. Zhang, et al., Learning to compare: relation network for few-shot
learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2018), pp. 1199–1208.
457. R.S. Sutton, A.G. Barto, Reinforcement learning (MIT Press, Cambridge, 1998).
458. M.R.E. Symonds, A. Moussalli, A brief guide to model selection, multimodel inference and
model averaging in behavioural ecology using Akaike’s information criterion. Behav. Ecol.
Sociobiol. 65(1), 13–21 (2011).
459. C. Szegedy, et al., Going deeper with convolutions, in 2015 IEEE Conference on Computer
Vision and Pattern Recognition CVPR (2015), pp. 1–9.
460. D. Szucs, J. Ioannidis, When null hypothesis significance testing is unsuitable for research: a
reassessment. Front. Hum. Neurosci. 11, 390 (2017).
461. L. Tarassenko, P. Hayton, N. Cerneaz, M. Brady, Novelty detection for the identification of
masses in mammograms (1995).
462. D.M.J. Tax, One-class classification: concept learning in the absence of counter-examples.
Ph.D. Thesis. Technische Universiteit Delft (2001).
463. J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear
dimensionality reductions. Science 290(5500), 2319–2323 (2000).
464. Theano Development Team, Theano: a Python framework for fast computation of mathemat-
ical expressions (2016). arXiv e-prints, abs/1605.02688.
465. T.M. Therneau, A package for survival analysis in S (2015). version 2.38.
466. T.M. Therneau, P.M. Grambsch, Modeling survival data: extending the Cox model (Springer
Science & Business Media, Berlin, 2013).
References 563

467. R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58,
267–288 (1996).
468. A.N. Tikhonov, On the stability of inverse problems, in Doklady Akademii Nauk SSSR, vol. 39
(1943), pp. 195–198.
469. I. Tosic, P. Frossard, Dictionary learning. IEEE Sig. Proces. Mag. 28(2), 27–38 (2011).
470. N. Trinajstić, Chemical graph theory (CRC Press, Boca Raton, 1992).
471. S. Tripathi, F. Emmert-Streib, mvgraphnorm: multivariate Gaussian graphical models (2019).
R package version 1.0.0.
472. G. Tsoumakas, I. Katakis, Multi-label classification: an overview. Int. J. Data Warehouse.
Min. 3(3), 1–13 (2007).
473. G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in Data mining and
knowledge discovery handbook (Springer, Berlin, 2009), pp. 667–685.
474. G. Tsoumakas, I. Katakis, I. Vlahavas, Random k-labelsets for multilabel classification. IEEE
Trans. Knowl. Data Eng. 23(7), 1079–1089 (2010).
475. J.W. Tukey, Exploratory data analysis (Addison-Wesley, New York, 1977).
476. G. Tutz, J. Ulbricht, Penalized regression with correlation-based penalty. Stat. Comput. 19(3),
239–253 (2009).
477. U.N. Umesh, R.A. Peterson, M.H. Sauber, Interjudge agreement and the maximum value of
kappa. Educ. Psychol. Meas. 49, 835–850 (1989).
478. I. Unal, Defining an optimal cut-point value in roc analysis: an alternative approach. Comput.
Math. Methods Med. 2017.
479. L.G. Valiant, A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984).
480. M.J. Van De Vijver, Y.D. He, L.J. Van’t Veer, et al., A gene-expression signature as a predictor
of survival in breast cancer. N. Engl. J. Med. 347(25), 1999–2009 (2002).
481. S. van de Geer, L1-regularization in high-dimensional statistical models (World Scientific,
Singapore, 2011), pp. 2351–2369.
482. J.E. Van Engelen, H.H. Hoos, A survey on semi-supervised learning. Mach. Learn. 109(2),
373–440 (2020).
483. V.N. Vapnik, The nature of statistical learning theory (Springer, Berlin, 1995).
484. S. Venkataraman, Z. Yang, D. Liu, et al., SparkR: scaling R programs with spark, in
Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16
(ACM, New York, 2016), pp. 1099–1104.
485. P. Vincent, H. Larochelle, I. Lajoie, et al., Stacked denoising autoencoders: Learning useful
representations in a deep network with a local denoising criterion. J. Mach. Learn. Res.
11(Dec), 3371–3408 (2010).
486. O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: a neural image caption generator,
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015),
pp. 3156–3164.
487. O. Vinyals, C. Blundell, T. Lillicrap, et al., Matching networks for one shot learning (2016).
Preprint. arXiv:1606.04080.
488. U. Von Luxburg, B. Schölkopf, Statistical learning theory: models, concepts, and results, in
Handbook of the history of logic, vol. 10 (Elsevier, Amsterdam, 2011), pp. 651–706.
489. S.I. Vrieze, Model selection and psychological theory: a discussion of the differences between
the Akaike information criterion (AIC) and the bayesian information criterion (BIC). Psychol.
Methods 17(2), 228 (2012).
490. Q.H. Vuong, Likelihood ratio tests for model selection and non-nested hypotheses. Econo-
metrica, 307–333 (1989).
491. H. Wallach, Evaluation metrics for hard classifiers. Technical report. Cambridge University
(2006).
492. L. Wan, M. Zeiler, S. Zhang, et al., Regularization of neural networks using DropConnect,
in Proceedings of the 30th International Conference on Machine Learning, PMLR (2013),
pp. 1058–1066.
493. J. Wand, X. Shen, Estimation of generalization error: random and fixed inputs. Stat. Sin.
16(2), 569 (2006).
564 References

494. Z. Wang, M. Gerstein, M. Snyder, RNA-Seq: a revolutionary tool for transcriptomics. Nat.
Rev. Genet. 10, 57–63 (2009).
495. Y. Wang, M. Huang, L. Zhao, et al., Attention-based LSTM for aspect-level sentiment
classification, in Proceedings of the 2016 Conference on Empirical Methods in Natural
Language Processing (2016), pp. 606–615.
496. Y. Wang, Q. Yao, J.T. Kwok, L.M. Ni, Generalizing from a few examples: a survey on few-
shot learning. ACM Comput. Surv. 53(3), 1–34 (2020).
497. J.H. Ward, Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58,
236–244 (1963).
498. R.L. Wasserstein, N.A. Lazar, et al., The ASA’s statement on p-values: context, process, and
purpose. Am. Stat. 70(2), 129–133 (2016).
499. R.L. Wasserstein, A.L. Schirm, N.A. Lazar, Moving to a world beyond p < 0.05. Am. Stat.
73(sup1), 1–19 (2019).
500. A.R. Webb, K.D. Copsey, Statistical pattern recognition, 3rd ed. (Wiley, Hoboken, 2011).
501. R. Wehrens, H. Putter, L.M.C. Buydens, The bootstrap: a tutorial. Chemom. Intell. Lab. Syst.
54(1), 35–52 (2000).
502. K. Weinberger, Lecture notes in machine learning (CS4780/CS5780) (2017). http://www.cs.
cornell.edu/courses/cs4780/2017sp/lectures/lecturenote11.html
503. K. Weiss, T.M. Khoshgoftaar, D. Wang, A survey of transfer learning. J. Big Data 3(1), 9
(2016).
504. P.H. Westfall, On using the bootstrap for multiple comparisons. J. Biopharmaceut. Stat. 21(6),
1187–1205 (2011).
505. P.H. Westfall, J.F. Troendle, Multiple testing with minimal assumptions. Biomet. J. J. Math.
Methods Biosci. 50(5), 745–755 (2008).
506. P.H. Westfall, S.S. Young, et al., Resampling-based multiple testing: examples and methods
for p-value adjustment, vol. 279. (John Wiley & Sons, Hoboken, 1993).
507. D.R. Wilson, T.R. Martinez, Bias and the probability of generalization, in Proceedings
Intelligent Information Systems. IIS’97 (IEEE, Piscataway, 1997), pp. 108–114.
508. D.H. Wolpert, The supervised learning no-free-lunch theorems. Soft Comput. Ind., 25–42
(2002).
509. S. Wright, Correlation and causation. J. Agricult. Res. 20, 557–585 (1921).
510. Z. Wu, S. Pan, F. Chen, et al., A comprehensive survey on graph neural networks (2019).
Preprint. arXiv:1901.00596.
511. S. Xingjian, Z. Chen, H. Wang, et al., Convolutional lstm network: A machine learning
approach for precipitation nowcasting, in Advances in neural information processing systems
(2015), pp. 802–810.
512. S. Xiong, B. Dai, J. Huling, P.Z.G. Qian, Orthogonalizing EM: a design-based least squares
algorithm. Technometrics 58, 285–293 (2016).
513. Y. Yang, Can the strengths of AIC and BIC be shared? A conflict between model identification
and regression estimation. Biometrika 92(4), 937–950 (2005).
514. Y. Yang, H. Zou, gglasso: group lasso penalized learning using a unified BMD algorithm. R
package version, 1 (2013).
515. Z. Yang, M. Dehmer, O. Yli-Harja, F. Emmert-Streib, Combining deep learning with token
selection for patient phenotyping from electronic health records. Sci. Rep. (2020).
516. L. Yao, C. Mao, Y. Luo, Graph convolutional networks for text classification, in Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 33 (2019), pp. 7370–7377.
517. W.J. Youden, Index for rating diagnostic tests. Cancer 3(1), 32–35 (1950).
518. T. Young, D. Hazarika, S. Poria, E. Cambria, Recent trends in deep learning based natural
language processing. IEEE Comput. Intel. Mag. 13(3), 55–75 (2018).
519. D. Yu, J. Li, Recent progresses in deep learning based acoustic models. IEEE/CAA J.
Automat. Sin. 4(3), 396–409 (2017).
520. M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables. J. R.
Stat. Soc. Series B Stat. Methodol. 68(1), 49–67 (2006).
References 565

521. M. Yuan, Y. Lin, On the non-negative garrotte estimator. J. R. Stat. Soc. Ser. B Stat Methodol.
69(2), 143–161 (2007).
522. Y. Zhang, Q. Yang, An overview of multi-task learning. Natl. Sci. Rev. 5(1), 30–43 (2018).
523. Z. Zhang, H. Zha, Principal manifolds and nonlinear dimensionality reduction via local
tangent space alignment. SIAM J. Sci. Comput. 26(1), 313–338 (2004).
524. M.-L. Zhang, Z.-H. Zhou, ML-KNN: a lazy learning approach to multi-label learning. Pattern
Recogn. 40(7), 2038–2048 (2007).
525. M.-L. Zhang, Z.-H. Zhou, A review on multi-label learning algorithms. IEEE Trans. Knowl.
Data Eng. 26(8), 1819–1837 (2013).
526. B. Zhang, W. Zuo, Learning from positive and unlabeled examples: a survey, in 2008
International Symposiums on Information Processing (IEEE, Piscataway, 2008), pp. 650–
654.
527. W. Zhang, T. Ota, V. Shridhar, et al., Network-based survival analysis reveals subnetwork
signatures for predicting outcomes of ovarian cancer treatment. PloS Comput. Biol. 9(3),
e1002975 (2013).
528. S. Zhang, J. Zhou, H. Hu, et al., A deep learning framework for modeling structural features
of RNA-binding protein targets. Nucleic Acids Res. 44(4), e32–e32 (2015).
529. R. Zhang, J. Guo, L. Chen, Y. Fan, X. Cheng, A review on question generation from natural
language text. ACM Transactions on Information Systems (TOIS). 40(1), 1–43 (ACM New
York, NY, 2021)
530. Y. Zhou, Sentiment classification with deep neural networks. Master’s thesis (2019).
531. N. Zhou, J. Zhu, Group variable selection via a hierarchical lasso and its oracle property
(2010). Preprint. arXiv:1006.2871.
532. X. Zhu, A.B. Goldberg, Introduction to semi-supervised learning. Synth. Lect. Artif. Intel.
Mach. Learn. 3(1), 1–130 (2009).
533. F. Zhuang, Z. Qi, K. Duan, et al., A comprehensive survey on transfer learning. Proc. IEEE
109(1), 43–76 (2020).
534. H. Zou, The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429
(2006).
535. H. Zou, T. Hastie, Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser.
B Stat Methodol. 67(2), 301–320 (2005).
536. C. Zuccaro, Mallows’ cp statistic and model selection in multiple linear regression. Market
Res. Soc. J. 34(2), 1–10 (1992).
Index

A B
Accuracy, 34 Backpropagation algorithm, 366
Acquaintance network, 76 Backward stepwise selection, 317
Activation function, 360 Bagging, 78, 511, 540, 543
Activation map, 374 Bagging conservative causal core network
Adaptive Benjamini-Hochberg procedure, 446 (BC3Net), 78
Adaptive LASSO, 346 Balaban index, 154
ade4, 170 Bayes’ factor (BF), 122, 325
Adjacency matrix, 76 Bayesian credible intervals, 118
Adjusted coefficient of determination, 312 Bayesian inference, 110
Adjusted survival curves, 473 Bayesian information criterion, 314
Agglomerative algorithms, 149 Bayes’ theorem, 110
Akaike information criterion, 302 Benjamini-Hochberg procedure, 444
AlexNet, 380 Benjamini-Krieger-Yekutieli procedure, 448
Alternative hypothesis, 242 Benjamini-Yekutieli procedure, 447
amap, 170 Bernoulli distribution, 123
Analysis of Variance (ANOVA), 258 Best subset selection, 316
Approximately correct, 494 Beta distribution, 113
Arbuthnot, 239 Biased estimator, 106
Architecture, 83, 360–365, 372, 379, 380, Bias for learning, 503
385, 391, 407, 412, 416–418, Bias-variance trade-off, 525
514 Bidirectional LSTM, 404
Area under the receiver operator characteristic Big data, 417
curve, 41–44 Big questions, 4
Artificial intelligence, 1–4, 10–12, 17, 23, 91, Binary classification, 177
359 Biological cell, 73
Artificial neural network, 360 Biology, 239
Artificial neuron, 360 Bipartite network, 77
Astrophysics, 430 Blanchard-Roquain procedure, 449
Autoencoder, 365 Boltzmann distribution, 385
Automatic interaction detection (AID), 222 Boltzmann machine, 364
Average linkage, 151 Bonferroni correction, 434
Average silhouette coefficient, 160 Boolean function, 491
Axis-aligned rectangles, 502 Boolean patterns, 491

© The Editor(s) (if applicable) and The Author(s), under exclusive license to 567
Springer Nature Switzerland AG 2023
F. Emmert-Streib et al., Elements of Data Science, Machine Learning, and Artificial
Intelligence Using R, https://doi.org/10.1007/978-3-031-13339-8
568 Index

Boosting, 22, 507, 539 Convolutional layer, 376


Bootstrap, 53, 54, 57, 59–61, 70, 127–128, Convolutional neural network, 372
244, 255, 433 Correlation coefficient, 143
Bootstrap confidence interval, 127 Correlation distance, 141
Boxplot, 103 Correlation matrix, 166
Breast cancer, 455 Correlation test, 259
Breast cancer data, 371 Cosine similarity, 143
Breiman, L., 232 Covariance matrix, 166
Breslow approximation, 478 Covariate, 276
Business data, 9, 72, 85–86 Cox model, 455, 479–481, 487
Cox proportional hazard model, 455
C Credible intervals, 118
Canberra, 142 Cross-validation (CV), 9, 50, 53–59, 66, 69,
CART, 222 70, 231–233, 327–329, 340, 341, 372,
Categorical predictor, 294 375, 543
Categorical variables, 93, 191 Cyclomatic number, 154
Causal inference, 22, 543, 544
Causal model, 22–23, 27, 416
Censoring, 457 D
Central limit theorem, 246 Dantzig selector, 345
chatGPT, 544 Data, vii, 1–14, 18–19, 24–28, 60, 71–87,
Chemical graph, 76 92–104, 138–143, 215–221
Chemotherapy, 455 Data analysis process, 5
Chervonenkis, A., 216 Data cleaning, 93
Chi-squared distribution, 304 Data consolidation, 93
Class, 214 Data preprocessing, 93
Classification, 9, 20–22, 28, 30, 32–34, 36–38, Data reduction, 93
40–42, 46, 47, 49, 51, 54, 58, 61, Data science, vii, 2–7, 9–14, 17, 18, 20, 21, 27,
154–156, 174, 177, 191–237, 273, 306, 49, 54, 70–72, 91, 92, 135, 137, 138,
309, 359, 366, 368, 381, 382, 384, 404, 141, 143, 154, 161, 163, 177, 239, 271,
405, 411, 417, 489–492, 497, 500, 506, 416, 489, 503, 507, 521, 543
509–512, 518, 519, 521, 522, 542 Data transformation, 93
Climate science, 273 Data types, 1, 5, 6, 9, 18, 19, 71–87, 138, 139,
Cluster validation, 155 161, 192, 265, 266, 274, 417, 519
Clustering, 9, 22, 34, 74, 137–161, 489 Davies-Bouldin index, 159
Coefficient of determination (COD), 281 Decision boundaries, 205
Coefficients, 153, 277–281, 284–285, 291, Decision-making, 225
293, 300, 303, 333, 336–339, 345–347, Decision node, 224
349, 354, 357, 470, 476, 480, 483, 486 Decision surface, 504
Cohen’s kappa, 39 Decision tree, 222
Collinearity, 285 Decoding layer, 394
Complete linkage, 151 Deep belief network (DBN), 384
Complexity, 450 Deep feedforward neural networks (D-FFNN),
Computational learning theory (COLT), 490 360, 365–372, 384, 400
Computer science, 2 Deep learning, 416
Confidence interval, 124, 128 Deep neural networks, 20, 22, 333, 360, 361
Conjugate priors, 112 Deep reinforcement learning, 418
Constant error carousel, 401 Degree of freedom, 247
Contingency table, 28, 30–32, 37, 41, 45, Dendrogram, 149
48, 49, 94, 157, 158, 191, 195, 196, Denoising autoencoder, 392
261–264, 423 Descriptive statistics, 92
Continuous bag-of-words (CBOW), 83 Diagonal matrix, 166
Contrastive divergence, 387 Diameter of a cluster, 159
Controlling the FWER, 433 Dice’s coefficient, 143
Index 569

Diffusion maps, 164 Euclidian space, 220


Digital twin, 543, 544 Event, 83–85, 98, 104, 455–458, 460–463,
Dimension reduction, 163 469, 475, 476, 478, 486, 498
Directed cyclic graph, 363 Event history analysis, 455
Dirichlet, 297 Expectation-maximization (EM) algorithm, 9,
Diseasome, 78 21, 92, 129–134
Distance, 42, 48, 63, 137–142, 144–152, 154, Expected generalization error, 193
155, 159, 160, 200, 202, 212, 214, Expected out-of-sample error, 524
217–219, 356, 504, 514, 528, 536, 540 Experimental design, 417
Distance measure, 139 Explainable AI (XAI), 23, 416–417
Distance metric, 140 Explained sum of squares (ESS), 523
Divisive algorithms, 149 Explained variable, 276
DNA, 73 Explanatory variable, 276
DNA microarrays, 73 Exploratory data analysis (EDA), 7, 9, 74,
Document frequencies, 80 92–104, 134, 161
Domain, 193 Exponential model, 466
Double decent, 543 External criteria, 156
Dummy variable, 294
Dunn index, 159
Duration analysis, 455 F
Durbin-Watson test, 286 FactoMineR, 170
Factorization, 168
False-discovery rate (FDR), 36, 421, 423
E False-negative rate (FNR), 36
e1071, 221 False omission rate (FOR), 36
Early stopping, 234 False-positive rate (FPR), 36
Economics, 2, 273 Family-wise error (FWER), 421–424, 429,
Efficient PAC learnability, 494 433–445, 451, 453
Efron approximation, 478 Feature extraction, 164
Eigenvalue, 166 Features, 138
Eigenvector, 166 Feature selection, 184
Elastic net, 348 Feedforward neural network, 362
Embedded methods, 186 Few/one-shot learning, 507
Empirical cumulative distribution function Filter methods, 186
(ECDF), 102 Finance, 430
Empirical error, 493 Finite hypothesis space, 499
Empirical risk, 493 Finite impulse recurrent network, 363
Empirical risk minimization (ERM), 504 Fisher information, 125
Encoder block, 394 Fisher-Neyman factorization theorem, 108
Energy function, 385 Fisher, R.A., 239
Entropy, 186 Fisher Scoring, 302
epoch, 372 Floor function, 114
Error-complexity curves, 530 fMRI, 430
Error measures, 9, 24, 29–50, 194–196, 202, Forget gate, 401
205, 422, 492, 521 Forward stepwise selection, 317
Error model, 45 Fowlkes-Mallows (FM) Index, 157
Estimation, 9, 18, 49, 53–61, 69, 92, 98, Friendship network, 77
104–105, 111, 112, 116–118, 123–130, Frobenius norm, 179
134, 148, 206, 212, 213, 231, 244, 268, F-score, 157
276–280, 284, 299, 300, 302, 324, 336, Fully connected layer, 379
359, 386, 387, 418, 424, 476–478, 494, Fundamental errors, 29, 31, 33–38, 42, 45–50,
510, 522, 529, 541 194–196, 202
Euclidean norm, 334 Fundamental theorem of statistical learning,
Euclidian distance, 141 489
570 Index

G Hommel, G., 427


Gamma distribution, 110 Homogeneity, 143
Gaussian graphical model, 426 Homoscedasticity, 285
Gaussian kernel, 177 Hopfield network, 363
Gene expression data, 73 Hotelling’s t-squared test, 258
Generalization error, 193 Hyperbolic tangent, 361
Generalized linear models (GLMs), 207 Hypergeometric distribution, 264
Generative adversarial network, 512 Hypergeometric test, 261
Generative question answering, 543, 544 Hyperparameter, 372
Gene regulatory network (GRN), 77 Hyperplane, 216
Genes, 239 Hypothesis space, 493
Genome-wide association studies, 430 Hypothesis testing, 239
Genomic data, 9, 72–74
Genomics, 418
Geometric distribution, 110 I
Gini index, 227 Identity, 139
glmnet, 336 Impurity function, 227
Goodness of split, 228 Indicator function, 194
GoogLeNet, 380 Inferential model, 416
Gradient descent, 180 Input gate, 401
Gradient loss, 397 In-sample data, 62
Graph, 76 Interactions, 292
Graph CNN, 418 Internal criteria, 158
Graph entropy, 153 Interquartile range, 99
Graph kernel, 220 Interval data, 265
Greedy approximation, 187 Iris data, 151
Greedy optimization, 235 iRprop, 387
Group LASSO, 352 Isomap, 164
Growth function, 501

J
H Jaccard’s coefficient, 143
Hadamard product, 180
Hamiltonian theory, 27
Haussler, D., 495 K
Hazard function, 461 Kaplan-Meier estimator, 460
Hazard ratio, 472 Keep gate, 401
Heatmap, 74 Keras, 366
Heaviside function, 361 Kernel PCA, 175
Hebbian learning, 416 Kernel trick, 220
Hessian matrix, 302 K-fold CV, 53
Heterogeneity, 143 Kidney data, 103
Hidden layer, 362 K-means clustering, 145
Hierarchical clustering, 149 K-medoids clustering, 147
High-energy physics, 430 K-nearest neighbor classifier, 211
Hinge loss, 219 Kullback-Leibler divergence, 180
Histogram, 103 Kurtosis, 100
h2o, 170
Hochberg correction, 437
Hochreiter, S., 400 L
Hoeffding’s inequality, 500 Lagrange multipliers, 218
Holdout set (HOS), 53–55, 58, 69, 327 Lagrangian, 218
Holm correction, 436 Latent loss, 397
Hommel correction, 438 Latent space, 394
Index 571

Layer, 138, 359, 362, 366, 368, 371, 374, 377, Matthews correlation coefficient, 37–39
382, 389, 408 Maximum a posteriori (MAP), 111
Leaf node, 224 Maximum distance, 141
Learnability, 490 Maximum likelihood estimation (MLE), 9, 92,
Learning algorithm, 494 123–129, 134, 198, 299, 300, 302, 386,
Learning curves, 537 542
Least absolute shrinkage and selection operator Maximum norm, 334
(LASSO), 333 Maximum relevance, 187
Least squares error, 277 Max pooling, 379
Leave-one-out CV (LOO-CV), 55 McCullagh, 297
Level of measurement, 86 McCulloch-Pitts neuron, 363
Leverage point, 289 Meaning of life, 4
Likelihood, 110 Mean squared error, 389
Likelihood function, 123 Measure of location, 94
Likelihood ratio, 480 Measure of scale, 98
Likelihood ratio test, 323 Measures of shape, 93, 99–101
Linear classifier, 205 Medicine, 239
Linear discriminant analysis, 202 Mercer, 220
Linear kernel, 177 Meta-analysis, 543
Linearly separable data, 216 Minimum distance, 141
Linear regression, 274 Minimum redundancy and maximum relevance
Linkage function, 151 (MRMR), 188
Link function, 297–300, 302, 305, 306, 318 Minkowski distance, 141
lklaR, 198 Misclassification cost, 232
L0-norm, 335 mlbench, 170
L1-norm, 334 MNIST, 359
L2-norm, 334 Moby Dick, 79
Loadings, 169 Mode, 97
Loadings of the principal components, 168 Model assessment (MA), 310
Logarithmic likelihood function, 124 Model diagnosis, 522
Logistic function, 209 Model identification, 310
Logistic regression, 207 Model selection, 122, 309
Logit, 101 Monte Carlo, 118
Log-logistic model, 466 mtcars, 294
Log-normal model, 467 Multi-class error measure, 195
Log-rank test, 462 Multicollinearity, 291
Long short-term memory (LSTM), 400 Multi-label learning (MLL), 507
Loss function, 524 Multiple linear regression, 283
Multiple testing corrections (MTC), 421
Multiple testing procedures (MTP), 421
M Multi-task learning (MTL), 507
Machine learning (ML), vii, 1–4, 11, 12, 17, Multivariate analysis, 177
19, 23, 37, 40, 79, 91, 93, 135, 163, multtest, 426
186, 191, 418, 489, 504, 507–519 mutoss, 426
Machine learning paradigm, 507 mvgraphnorm, 426
Magnitude-based information index, 155 mvtnorm, 426
Mahalanobis kernel, 220
Mallow’s Cp, 314
Mallow’s Cp statistic, 313–314, 328 N
Manhattan distance, 141 Naive Bayes classifier, 197
Marginal likelihood, 122 Naive Bayesian classifier, 197–203, 237
Marketing, 239 Natural language processing (NLP), 83, 418
MASS, 204 Nearest neighbor classifier, 191, 212, 237
Mathematics, 2–4, 91, 492 Negative binomial distribution, 110
572 Index

Negative binomial regression, 318 Partition function, 385


Negative exponential distribution, 110 Partitioning around medoids (PAM), 148
Negative predictive value (NPV), 33 Part-of-speech (POS), 79
Nelder, 297 Pearson, E., 177
Nelson-Aalen estimator, 461 Peephole LSTM, 403
Network data, 9, 72, 74–79 Percentile, 97
Newton-Raphson method, 302 Perceptron, 361
Neyman, J., 239 Per comparison error rate, 423
No free lunch (NFL), 503, 504 Per family error rate, 423
Nominal data, 265 Performance, 37, 42, 46, 175, 177, 186, 202,
Non-convex optimization, 180 227, 280, 291, 310, 344, 380, 381, 392,
Non-hierarchical clustering, 145 411, 486, 497, 517, 522, 533, 537, 539
Nonlinear classifier, 177 Permutation test, 266
Nonlinearities, 46, 273, 292, 306 PimaIndiansDiabetes, 170
Nonlinearly separable data, 218 Point estimation, 104–105, 118
Nonlinear support vector machines, 219 Poisson distribution, 109
Non-negative garrote regression, 339 Poisson regression, 300
Non-negative matrix factorization (NNMF), Polynomial kernel, 177
179 Polynomial of degree q, 220
Normal distribution, 110 Polynomial regression, 531
Normalized mutual information, 40–41, 49, Pooling layer, 379
157 Population distribution, 245
Null deviance, 300 Population mean, 66, 253
Null hypothesis, 242 Positive predictive value (PPV), 33
Positive regression dependencies, 445
Positive-unlabeled learning, 507
O Positivity, 139
Occam’s razor, 504 Posterior distribution, 110
Odds, 122 Posterior predictive density, 121
One-class classification, 507 Power, 251
One-hot document (OHD), 79 Precision, 34
One-hot encoding (OHE), 79 Predictive model, 22–23, 27, 509
One-sample test, 256 Predictor, 276
One standard error rule, 232 Principal component analysis (PCA), 164
Online Mendelian Inheritance in Man Principal components, 164
(OMIM), 78 Principal component space, 169
Optimization problem, 218 Prior, 397
Oracle, 346 Prior distribution, 110
Ordinal data, 265 Prior predictive density, 121
Ordinary least squares, 273 Probabilistic classifier, 216
Orthogonal, 169 Probabilistic learnability, 490
Outcome space, 193 Probably approximately correct (PAC), 489
Outliers, 290 Programming, 3, 4, 9, 11, 12, 63, 269, 345,
Out-of-sample data, 62 422, 425
Out-of-sample error, 313, 524, 529 Property of data, 18
Overdispersion, 302 Property of optimization algorithm, 18
Overfitting, 177 Property of the model, 18
Proportion, 97
Proportional hazard model, 471–476
P Proportional hazard, 472
Parameter estimation, 21 Protein, 72
Parametric bootstrap, 127 Proteomics data, 74
Partial gradients, 180 Pruning, 226
Partial likelihood, 476 Psychology, 239
Index 573

p-value, 26, 241, 248–249, 251, 264, 268, 270, Right censoring, 457
279, 286, 288, 304, 305, 319, 320, 422, 424, Risk, 493
426, 430, 431, 434–442, 444–451, 453, 481, RNA-seq, 73
483 rpart, 223
Pythagoras’ theorem, 142

S
Q Salmon, 430
Q-Q plot, 287 Sample complexity, 495
QuACN, 155 Sample covariance, 167
Quadratic problem, 168 Sample mean, 66
Qualitative variable, 93 Sample median, 95
Quality of a fit, 285 Sample size, 92
Quantification, 29, 50, 118, 153, 240, 243, 273, Sample test error, 530
519, 521 Sample training error, 530
Quantile, 255 Sample variance, 98
Quantitative variable, 93 Sampling distribution, 244
Quartile, 96 Sampling from a distribution, 63
Quasi-Poisson regression, 321 Sampling layer, 394
Sauer’s lemma, 506
Schmidhuber, J., 398
R Schoenfeld residual, 475
Radial base function, 220 Schwarz criterion, 315
Randić index, 154 Scree plot, 170
Rand index, 34 Semi-supervised learning, 507
Random classifier, 44 Sensitivity, 33
Random variable, 25 Shapiro-Wilk test, 288
Range, 99 Shrinkage, 357
Ratio data, 265 Šidák correction, 434
Recommender systems, 418 Sigmoid function, 220
Rectangle learning, 496 Significance level, 247
Recurrent neural network (RNN), 363 Silhouette coefficient, 160
Recursive partitioning, 224 Similarity, 20, 74, 83, 137–141, 143, 161, 187,
Regressor, 276 212, 314, 417, 512, 513, 517
Regular exponential class (REC), 109 Similarity measure, 139, 140
Regularization, 291 Simple linear regression, 276
Reliability theory, 455 Single linkage, 151
ReLU, 361 Single-step, 430
Repeated holdout set, 58 Single-step maxT, 441
Repeated k-fold CV, 53, 58 Single-step minP, 441
Representation learning, 416 Singular value decomposition (SVD), 168
Resampling, 326 Skewness, 99
Resampling methods, 3, 9, 17, 25, 50, 53–70, Slack variable, 218
326, 327, 541 Social sciences, 239
Resampling without replacement, 57, 60 Softmax, 361
Resampling with replacement, 57, 60 Spearman’s rank-order correlation, 259
Residual, 277 Specificity, 33
Residual deviance, 300 Squared cosine, 170
Residual standard error (RSE), 281 Squared matrix, 76
Residual sum of squares (RSS), 278, 336, 524 Standard error, 9, 50, 54, 56–58, 66–68, 98,
ResNet, 381 233, 252, 254, 279–281, 291, 485
Restricted Boltzmann machine, 385 Statistical hypothesis testing, 239
Resubstitution error, 231 Statistical inference, 91–135, 279
Ridge regression, 336 Statistical learning, 490–503, 506, 507
574 Index

Statistical learning theory (SLT), 490 Time-to-event data, 9, 72, 83–85, 455, 457,
Statistical thinking, 23 486
Statistics, 2–4, 7, 9–12, 17, 18, 22, 23, 25, 59, Topological index, 154
63, 86, 91–104, 134, 135, 191, 239, Topological information, 154
240, 242, 246, 256, 258, 265, 273, 279, Topological information content, 154
379, 417, 433, 437, 441, 442, 445, 490 Total sum of squares (TSS), 523
Stats, 170 Training error, 529
Step-down maxT, 442 Transcriptomics data, 74
Step-down minP, 442 Transfer learning, 507
Step-down, 430 Tree cost-complexity, 231
Step-up, 430 Trimmed Sample Mean, 95
Stepwise selection, 316 Triple-negative breast cancer, 456
Stochastic gradient descent, 366 True model, 529
Stratified Cox model, 479 True-negative rate (TNR), 33
Stratified k-fold CV, 58 True-positive rate (TPR), 33
Stride, 377 t-score, 247
Strong control of FWER, 424 t-transformation, 247
Structural risk minimization (SRM), 504 Tumors, 138, 490
Student’s t-distribution, 257 Twitter, 78
Student’s t-test, 256 Two-sample test, 257
Subsampling, 61 Type 1 error, 247
Subset selection, 316, 339 Type 2 error, 250
Sufficiency, 107
Sum of squares due to errors (SSE), 524
Sum of squares due to regression (SSR), 523 U
Sum of squares total (SST), 523 Unbiased estimator, 105
Supervised learning, 18, 19, 29, 191–193, 235, Underfitting, 535
237, 273, 306, 389, 489, 507, 508, 515, Unitary, 169
519, 521 Universal approximation theorem, 365
Support vector machine (SVM), 216 Unsupervised learning, 9, 18, 19, 137, 161,
Support vectors, 218 164, 179, 385, 392, 416, 489, 507
Survival analysis, 455 Update rule, 181
Survival curves, 455, 461–462, 472–475, 481
Survival function, 459
Symmetry, 139 V
Vapnik-Chervonenkis (VC), 500
Vapnik, V., 216
T Variance inflation factor, 291
Target concept, 497 Variational autoencoder, 392
Task, 193 VC dimension, 501
Term frequency-inverse document frequency Version space, 496
(TF-IDF), 79 VGGNet, 380
Terminal node, 231 Visual question answering, 543
Test error, 529 Vuong test, 322
Test statistic, 242
Text data, 9, 72, 79–83, 87, 417
Text generation, 404 W
Text representation, 80 Ward method, 151
Theta automatic interaction detection Weak control of FWER, 424
(THAID), 222 Web science, 421
Time, 72, 84, 103, 131, 183, 377 Weibull model, 465
Time series analysis, 543 Westfall-Young Procedure, 441
Time series forecasting, 404 Wiener index, 154
Time-to-event, 455 Wilcoxon, 462
Index 575

Word embedding, 79 Z
word2vec, 83 Zagreb index, 154
Wrapper methods, 186 Zero-inflated Poisson model, 320
Zero-padding, 377
z-score, 247
Y z-transformation, 246
Youden index, 43

You might also like