100% found this document useful (5 votes)
77 views53 pages

Machine Learning: A Bayesian and Optimization Perspective 2nd Edition Sergios Theodoridis All Chapter Instant Download

ebook

Uploaded by

onwuhatanie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (5 votes)
77 views53 pages

Machine Learning: A Bayesian and Optimization Perspective 2nd Edition Sergios Theodoridis All Chapter Instant Download

ebook

Uploaded by

onwuhatanie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Full download test bank at ebook textbookfull.

com

Machine Learning: A Bayesian and


Optimization Perspective 2nd

CLICK LINK TO DOWLOAD

https://textbookfull.com/product/machine-
learning-a-bayesian-and-optimization-
perspective-2nd-edition-sergios-theodoridis/

textbookfull
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Academic Press Library in Signal Processing Contents


Vol 1 4 1st Edition Sergios Theodoridis

https://textbookfull.com/product/academic-press-library-in-
signal-processing-contents-vol-1-4-1st-edition-sergios-
theodoridis/

Academic Press Library in Signal Processing Volume 5


Image and Video Compression and Multimedia Sergios
Theodoridis

https://textbookfull.com/product/academic-press-library-in-
signal-processing-volume-5-image-and-video-compression-and-
multimedia-sergios-theodoridis/

Linear Algebra and Optimization for Machine Learning: A


Textbook Charu C. Aggarwal

https://textbookfull.com/product/linear-algebra-and-optimization-
for-machine-learning-a-textbook-charu-c-aggarwal/

Machine Learning and IoT: A Biological Perspective 1st


Edition Shampa Sen (Editor)

https://textbookfull.com/product/machine-learning-and-iot-a-
biological-perspective-1st-edition-shampa-sen-editor/
Hyperparameter Optimization in Machine Learning: Make
Your Machine Learning and Deep Learning Models More
Efficient 1st Edition Tanay Agrawal

https://textbookfull.com/product/hyperparameter-optimization-in-
machine-learning-make-your-machine-learning-and-deep-learning-
models-more-efficient-1st-edition-tanay-agrawal/

Hyperparameter Optimization in Machine Learning: Make


Your Machine Learning and Deep Learning Models More
Efficient 1st Edition Tanay Agrawal

https://textbookfull.com/product/hyperparameter-optimization-in-
machine-learning-make-your-machine-learning-and-deep-learning-
models-more-efficient-1st-edition-tanay-agrawal-2/

Algorithmic Trading Methods: Applications Using


Advanced Statistics, Optimization, and Machine Learning
Techniques 2nd Edition Robert Kissell

https://textbookfull.com/product/algorithmic-trading-methods-
applications-using-advanced-statistics-optimization-and-machine-
learning-techniques-2nd-edition-robert-kissell/

Machine Learning with TensorFlow 2nd Edition Chris A.


Mattmann

https://textbookfull.com/product/machine-learning-with-
tensorflow-2nd-edition-chris-a-mattmann/

Supervised machine learning: optimization framework and


applications with SAS and R First Edition Kolosova

https://textbookfull.com/product/supervised-machine-learning-
optimization-framework-and-applications-with-sas-and-r-first-
edition-kolosova/
Machine Learning
A Bayesian and Optimization
Perspective
Machine Learning
A Bayesian and Optimization
Perspective
2nd Edition

Sergios Theodoridis
Department of Informatics and Telecommunications
National and Kapodistrian University of Athens
Athens, Greece
Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong
Shenzhen, China
Academic Press is an imprint of Elsevier
125 London Wall, London EC2Y 5AS, United Kingdom
525 B Street, Suite 1650, San Diego, CA 92101, United States
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom
Copyright © 2020 Elsevier Ltd. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or any information storage and retrieval system, without permission in writing from the
publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our
arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found
at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may
be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any
information, methods, compounds, or experiments described herein. In using such information or methods they should be
mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any
injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or
operation of any methods, products, instructions, or ideas contained in the material herein.

Library of Congress Cataloging-in-Publication Data


A catalog record for this book is available from the Library of Congress

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library

ISBN: 978-0-12-818803-3

For information on all Academic Press publications


visit our website at https://www.elsevier.com/books-and-journals

Publisher: Mara Conner


Acquisitions Editor: Tim Pitts
Editorial Project Manager: Charlotte Rowley
Production Project Manager: Paul Prasad Chandramohan
Designer: Greg Harris
Typeset by VTeX
τ o σ π oιν ὰκι
For Everything
All These Years
Contents

About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi


Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvii
CHAPTER 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Historical Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Artificial Intelligence and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Algorithms Can Learn What Is Hidden in the Data . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Typical Applications of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Multimodal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Autonomous Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Challenges for the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Machine Learning: Major Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Unsupervised and Semisupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Structure and a Road Map of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
CHAPTER 2 Probability and Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Probability and Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.4 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.5 Transformation of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Examples of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.1 First- and Second-Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.2 Stationarity and Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.3 Power Spectral Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.4 Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
vii
viii Contents

2.5.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56


2.5.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.6 Stochastic Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Convergence Everywhere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Convergence Almost Everywhere . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Convergence in the Mean-Square Sense . . . . . . . . . . . . . . . . . . . . . . . 62
Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
CHAPTER 3 Learning in Parametric Modeling: Basic Concepts and Directions . . . . . . . . . 67
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Parameter Estimation: the Deterministic Point of View . . . . . . . . . . . . . . . . . . . 68
3.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Generative Versus Discriminative Learning . . . . . . . . . . . . . . . . . . . . 78
3.5 Biased Versus Unbiased Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.5.1 Biased or Unbiased Estimation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.6 The Cramér–Rao Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.7 Sufficient Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Inverse Problems: Ill-Conditioning and Overfitting . . . . . . . . . . . . . . . 91
3.9 The Bias–Variance Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.9.1 Mean-Square Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.9.2 Bias–Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.10 Maximum Likelihood Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.10.1 Linear Regression: the Nonwhite Gaussian Noise Case . . . . . . . . . . . . 101
3.11 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.11.1 The Maximum a Posteriori Probability Estimation Method . . . . . . . . . 107
3.12 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.13 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.14 Expected Loss and Empirical Risk Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.15 Nonparametric Modeling and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
CHAPTER 4 Mean-Square Error Linear Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.2 Mean-Square Error Linear Estimation: the Normal Equations . . . . . . . . . . . . . . 122
4.2.1 The Cost Function Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.3 A Geometric Viewpoint: Orthogonality Condition . . . . . . . . . . . . . . . . . . . . . . 124
Contents ix

4.4 Extension to Complex-Valued Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127


4.4.1 Widely Linear Complex-Valued Estimation . . . . . . . . . . . . . . . . . . . . 129
4.4.2 Optimizing With Respect to Complex-Valued Variables:
Wirtinger Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.5 Linear Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.6 MSE Linear Filtering: a Frequency Domain Point of View . . . . . . . . . . . . . . . . 136
Deconvolution: Image Deblurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.7 Some Typical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.7.1 Interference Cancelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.7.2 System Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.7.3 Deconvolution: Channel Equalization . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.8 Algorithmic Aspects: the Levinson and Lattice-Ladder Algorithms . . . . . . . . . 149
Forward and Backward MSE Optimal Predictors . . . . . . . . . . . . . . . . 151
4.8.1 The Lattice-Ladder Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.9 Mean-Square Error Estimation of Linear Models . . . . . . . . . . . . . . . . . . . . . . . 158
4.9.1 The Gauss–Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.9.2 Constrained Linear Estimation: the Beamforming Case . . . . . . . . . . . 162
4.10 Time-Varying Statistics: Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
CHAPTER 5 Online Learning: the Stochastic Gradient Descent Family of Algorithms . . . . . 179
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.2 The Steepest Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.3 Application to the Mean-Square Error Cost Function . . . . . . . . . . . . . . . . . . . . 184
Time-Varying Step Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.3.1 The Complex-Valued Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.4 Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Application to the MSE Linear Estimation . . . . . . . . . . . . . . . . . . . . . 196
5.5 The Least-Mean-Squares Adaptive Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 198
5.5.1 Convergence and Steady-State Performance of the LMS in Stationary
Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5.5.2 Cumulative Loss Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
5.6 The Affine Projection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Geometric Interpretation of APA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Orthogonal Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
5.6.1 The Normalized LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
5.7 The Complex-Valued Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
The Widely Linear LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
The Widely Linear APA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
5.8 Relatives of the LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
The Sign-Error LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
The Least-Mean-Fourth (LMF) Algorithm . . . . . . . . . . . . . . . . . . . . . 215
Transform-Domain LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
x Contents

5.9 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218


5.10 Adaptive Decision Feedback Equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
5.11 The Linearly Constrained LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
5.12 Tracking Performance of the LMS in Nonstationary Environments . . . . . . . . . . 225
5.13 Distributed Learning: the Distributed LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.13.1 Cooperation Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.13.2 The Diffusion LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.13.3 Convergence and Steady-State Performance: Some Highlights . . . . . . 237
5.13.4 Consensus-Based Distributed Schemes . . . . . . . . . . . . . . . . . . . . . . . . 240
5.14 A Case Study: Target Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
5.15 Some Concluding Remarks: Consensus Matrix . . . . . . . . . . . . . . . . . . . . . . . . 243
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
CHAPTER 6 The Least-Squares Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.2 Least-Squares Linear Regression: a Geometric Perspective . . . . . . . . . . . . . . . . 254
6.3 Statistical Properties of the LS Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
The LS Estimator Is Unbiased . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Covariance Matrix of the LS Estimator . . . . . . . . . . . . . . . . . . . . . . . . 257
The LS Estimator Is BLUE in the Presence of White Noise . . . . . . . . 258
The LS Estimator Achieves the Cramér–Rao Bound for White
Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Asymptotic Distribution of the LS Estimator . . . . . . . . . . . . . . . . . . . 260
6.4 Orthogonalizing the Column Space of the Input Matrix: the SVD Method . . . . 260
Pseudoinverse Matrix and SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
6.5 Ridge Regression: a Geometric Point of View . . . . . . . . . . . . . . . . . . . . . . . . . 265
Principal Components Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.6 The Recursive Least-Squares Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Time-Iterative Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Time Updating of the Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
6.7 Newton’s Iterative Minimization Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
6.7.1 RLS and Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.8 Steady-State Performance of the RLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.9 Complex-Valued Data: the Widely Linear RLS . . . . . . . . . . . . . . . . . . . . . . . . 277
6.10 Computational Aspects of the LS Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Fast RLS Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
6.11 The Coordinate and Cyclic Coordinate Descent Methods . . . . . . . . . . . . . . . . . 281
6.12 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
6.13 Total Least-Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
Geometric Interpretation of the Total Least-Squares Method . . . . . . . . 291
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Contents xi

MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296


References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
CHAPTER 7 Classification: a Tour of the Classics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
7.2 Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
The Bayesian Classifier Minimizes the Misclassification Error . . . . . . 303
7.2.1 Average Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
7.3 Decision (Hyper)Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
7.3.1 The Gaussian Distribution Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
7.4 The Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
7.5 The Nearest Neighbor Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
7.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
7.7 Fisher’s Linear Discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
7.7.1 Scatter Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
7.7.2 Fisher’s Discriminant: the Two-Class Case . . . . . . . . . . . . . . . . . . . . . 325
7.7.3 Fisher’s Discriminant: the Multiclass Case . . . . . . . . . . . . . . . . . . . . . 328
7.8 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
7.9 Combining Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
No Free Lunch Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Some Experimental Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Schemes for Combining Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 335
7.10 The Boosting Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
The AdaBoost Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
The Log-Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
7.11 Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
CHAPTER 8 Parameter Learning: a Convex Analytic Path . . . . . . . . . . . . . . . . . . . . . . . . 351
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
8.2 Convex Sets and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
8.2.1 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
8.2.2 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
8.3 Projections Onto Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
8.3.1 Properties of Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
8.4 Fundamental Theorem of Projections Onto Convex Sets . . . . . . . . . . . . . . . . . . 365
8.5 A Parallel Version of POCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
8.6 From Convex Sets to Parameter Estimation and Machine Learning . . . . . . . . . . 369
8.6.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
8.6.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
8.7 Infinitely Many Closed Convex Sets: the Online Learning Case . . . . . . . . . . . . 374
8.7.1 Convergence of APSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
8.8 Constrained Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
xii Contents

8.9 The Distributed APSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382


8.10 Optimizing Nonsmooth Convex Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . 384
8.10.1 Subgradients and Subdifferentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
8.10.2 Minimizing Nonsmooth Continuous Convex Loss Functions: the Batch
Learning Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
8.10.3 Online Learning for Convex Optimization . . . . . . . . . . . . . . . . . . . . . 393
8.11 Regret Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Regret Analysis of the Subgradient Algorithm . . . . . . . . . . . . . . . . . . 398
8.12 Online Learning and Big Data Applications: a Discussion . . . . . . . . . . . . . . . . 399
Approximation, Estimation, and Optimization Errors . . . . . . . . . . . . . 400
Batch Versus Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
8.13 Proximal Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
8.13.1 Properties of the Proximal Operator . . . . . . . . . . . . . . . . . . . . . . . . . . 407
8.13.2 Proximal Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
8.14 Proximal Splitting Methods for Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 412
The Proximal Forward-Backward Splitting Operator . . . . . . . . . . . . . 413
Alternating Direction Method of Multipliers (ADMM) . . . . . . . . . . . . 414
Mirror Descent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
8.15 Distributed Optimization: Some Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
CHAPTER 9 Sparsity-Aware Learning: Concepts and Theoretical Foundations . . . . . . . . . 427
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
9.2 Searching for a Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
9.3 The Least Absolute Shrinkage and Selection Operator (LASSO) . . . . . . . . . . . 431
9.4 Sparse Signal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
9.5 In Search of the Sparsest Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
The 2 Norm Minimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
The 0 Norm Minimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
The 1 Norm Minimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
Characterization of the 1 Norm Minimizer . . . . . . . . . . . . . . . . . . . . 443
Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
9.6 Uniqueness of the 0 Minimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
9.6.1 Mutual Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
9.7 Equivalence of 0 and 1 Minimizers: Sufficiency Conditions . . . . . . . . . . . . . . 451
9.7.1 Condition Implied by the Mutual Coherence Number . . . . . . . . . . . . . 451
9.7.2 The Restricted Isometry Property (RIP) . . . . . . . . . . . . . . . . . . . . . . . 452
9.8 Robust Sparse Signal Recovery From Noisy Measurements . . . . . . . . . . . . . . . 455
9.9 Compressed Sensing: the Glory of Randomness . . . . . . . . . . . . . . . . . . . . . . . . 456
Compressed Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
9.9.1 Dimensionality Reduction and Stable Embeddings . . . . . . . . . . . . . . . 458
9.9.2 Sub-Nyquist Sampling: Analog-to-Information Conversion . . . . . . . . 460
9.10 A Case Study: Image Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
Contents xiii

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
CHAPTER 10 Sparsity-Aware Learning: Algorithms and Applications . . . . . . . . . . . . . . . . . 473
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
10.2 Sparsity Promoting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
10.2.1 Greedy Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
10.2.2 Iterative Shrinkage/Thresholding (IST) Algorithms . . . . . . . . . . . . . . 480
10.2.3 Which Algorithm? Some Practical Hints . . . . . . . . . . . . . . . . . . . . . . 487
10.3 Variations on the Sparsity-Aware Theme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
10.4 Online Sparsity Promoting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
10.4.1 LASSO: Asymptotic Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
10.4.2 The Adaptive Norm-Weighted LASSO . . . . . . . . . . . . . . . . . . . . . . . . 502
10.4.3 Adaptive CoSaMP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
10.4.4 Sparse-Adaptive Projection Subgradient Method . . . . . . . . . . . . . . . . 505
10.5 Learning Sparse Analysis Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
10.5.1 Compressed Sensing for Sparse Signal Representation
in Coherent Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
10.5.2 Cosparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
10.6 A Case Study: Time-Frequency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
Gabor Transform and Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
Time-Frequency Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Gabor Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
Time-Frequency Analysis of Echolocation Signals Emitted by Bats . . 519
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
CHAPTER 11 Learning in Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . 531
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
11.2 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
11.3 Volterra, Wiener, and Hammerstein Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
11.4 Cover’s Theorem: Capacity of a Space in Linear Dichotomies . . . . . . . . . . . . . 536
11.5 Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
11.5.1 Some Properties and Theoretical Highlights . . . . . . . . . . . . . . . . . . . . 541
11.5.2 Examples of Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
11.6 Representer Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
11.6.1 Semiparametric Representer Theorem . . . . . . . . . . . . . . . . . . . . . . . . 550
11.6.2 Nonparametric Modeling: a Discussion . . . . . . . . . . . . . . . . . . . . . . . 551
11.7 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
11.8 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
11.8.1 The Linear -Insensitive Optimal Regression . . . . . . . . . . . . . . . . . . . 555
11.9 Kernel Ridge Regression Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
11.10 Optimal Margin Classification: Support Vector Machines . . . . . . . . . . . . . . . . . 562
xiv Contents

11.10.1 Linearly Separable Classes: Maximum Margin Classifiers . . . . . . . . . 564


11.10.2 Nonseparable Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
11.10.3 Performance of SVMs and Applications . . . . . . . . . . . . . . . . . . . . . . . 574
11.10.4 Choice of Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
11.10.5 Multiclass Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
11.11 Computational Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
11.12 Random Fourier Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
11.12.1 Online and Distributed Learning in RKHS . . . . . . . . . . . . . . . . . . . . . 579
11.13 Multiple Kernel Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
11.14 Nonparametric Sparsity-Aware Learning: Additive Models . . . . . . . . . . . . . . . 582
11.15 A Case Study: Authorship Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
CHAPTER 12 Bayesian Learning: Inference and the EM Algorithm . . . . . . . . . . . . . . . . . . . 595
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
12.2 Regression: a Bayesian Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
12.2.1 The Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 597
12.2.2 The MAP Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
12.2.3 The Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
12.3 The Evidence Function and Occam’s Razor Rule . . . . . . . . . . . . . . . . . . . . . . . 605
Laplacian Approximation and the Evidence Function . . . . . . . . . . . . . 607
12.4 Latent Variables and the EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
12.4.1 The Expectation-Maximization Algorithm . . . . . . . . . . . . . . . . . . . . . 611
12.5 Linear Regression and the EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
12.6 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
12.6.1 Gaussian Mixture Modeling and Clustering . . . . . . . . . . . . . . . . . . . . 620
12.7 The EM Algorithm: a Lower Bound Maximization View . . . . . . . . . . . . . . . . . 623
12.8 Exponential Family of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . 627
12.8.1 The Exponential Family and the Maximum Entropy Method . . . . . . . 633
12.9 Combining Learning Models: a Probabilistic Point of View . . . . . . . . . . . . . . . 634
12.9.1 Mixing Linear Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
12.9.2 Mixing Logistic Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
CHAPTER 13 Bayesian Learning: Approximate Inference and Nonparametric Models . . . . . 647
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
13.2 Variational Approximation in Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . 648
The Mean Field Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
13.2.1 The Case of the Exponential Family of Probability Distributions . . . . . 653
13.3 A Variational Bayesian Approach to Linear Regression . . . . . . . . . . . . . . . . . . 655
Computation of the Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660
Contents xv

13.4 A Variational Bayesian Approach to Gaussian Mixture Modeling . . . . . . . . . . . 661


13.5 When Bayesian Inference Meets Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
13.6 Sparse Bayesian Learning (SBL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
13.6.1 The Spike and Slab Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
13.7 The Relevance Vector Machine Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
13.7.1 Adopting the Logistic Regression Model for Classification . . . . . . . . . 672
13.8 Convex Duality and Variational Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
13.9 Sparsity-Aware Regression: a Variational Bound Bayesian Path . . . . . . . . . . . . 681
Sparsity-Aware Learning: Some Concluding Remarks . . . . . . . . . . . . 686
13.10 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686
Minimizing the KL Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
The Expectation Propagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 688
13.11 Nonparametric Bayesian Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
13.11.1 The Chinese Restaurant Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
13.11.2 Dirichlet Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692
13.11.3 The Stick Breaking Construction of a DP . . . . . . . . . . . . . . . . . . . . . . 697
13.11.4 Dirichlet Process Mixture Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 698
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699
13.11.5 The Indian Buffet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
13.12 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
13.12.1 Covariance Functions and Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
13.12.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
13.12.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716
13.13 A Case Study: Hyperspectral Image Unmixing . . . . . . . . . . . . . . . . . . . . . . . . . 717
13.13.1 Hierarchical Bayesian Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
13.13.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
CHAPTER 14 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 731
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 731
14.2 Monte Carlo Methods: the Main Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
14.2.1 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733
14.3 Random Sampling Based on Function Transformation . . . . . . . . . . . . . . . . . . . 735
14.4 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
14.5 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743
14.6 Monte Carlo Methods and the EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 745
14.7 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745
14.7.1 Ergodic Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
14.8 The Metropolis Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
14.8.1 Convergence Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756
14.9 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
14.10 In Search of More Efficient Methods: a Discussion . . . . . . . . . . . . . . . . . . . . . 760
Variational Inference or Monte Carlo Methods . . . . . . . . . . . . . . . . . . 762
xvi Contents

14.11 A Case Study: Change-Point Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762


Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
MATLAB® Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
CHAPTER 15 Probabilistic Graphical Models: Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771
15.2 The Need for Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
15.3 Bayesian Networks and the Markov Condition . . . . . . . . . . . . . . . . . . . . . . . . . 774
15.3.1 Graphs: Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775
15.3.2 Some Hints on Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779
15.3.3 d-Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
15.3.4 Sigmoidal Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
15.3.5 Linear Gaussian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786
15.3.6 Multiple-Cause Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786
15.3.7 I-Maps, Soundness, Faithfulness, and Completeness . . . . . . . . . . . . . . 787
15.4 Undirected Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
15.4.1 Independencies and I-Maps in Markov Random Fields . . . . . . . . . . . . 790
15.4.2 The Ising Model and Its Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791
15.4.3 Conditional Random Fields (CRFs) . . . . . . . . . . . . . . . . . . . . . . . . . . 794
15.5 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
15.5.1 Graphical Models for Error Correcting Codes . . . . . . . . . . . . . . . . . . . 797
15.6 Moralization of Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
15.7 Exact Inference Methods: Message Passing Algorithms . . . . . . . . . . . . . . . . . . 799
15.7.1 Exact Inference in Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799
15.7.2 Exact Inference in Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803
15.7.3 The Sum-Product Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
15.7.4 The Max-Product and Max-Sum Algorithms . . . . . . . . . . . . . . . . . . . 809
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
CHAPTER 16 Probabilistic Graphical Models: Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
16.2 Triangulated Graphs and Junction Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
16.2.1 Constructing a Join Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825
16.2.2 Message Passing in Junction Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 827
16.3 Approximate Inference Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 830
16.3.1 Variational Methods: Local Approximation . . . . . . . . . . . . . . . . . . . . 831
16.3.2 Block Methods for Variational Approximation . . . . . . . . . . . . . . . . . . 835
16.3.3 Loopy Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839
16.4 Dynamic Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 842
16.5 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844
16.5.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
16.5.2 Learning the Parameters in an HMM . . . . . . . . . . . . . . . . . . . . . . . . . 852
16.5.3 Discriminative Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855
Contents xvii

16.6 Beyond HMMs: a Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856


16.6.1 Factorial Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856
16.6.2 Time-Varying Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . . 859
16.7 Learning Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859
16.7.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 860
16.7.2 Learning the Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867
CHAPTER 17 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871
17.2 Sequential Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871
17.2.1 Importance Sampling Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872
17.2.2 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873
17.2.3 Sequential Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875
17.3 Kalman and Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
17.3.1 Kalman Filtering: a Bayesian Point of View . . . . . . . . . . . . . . . . . . . . 878
17.4 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881
17.4.1 Degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
17.4.2 Generic Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886
17.4.3 Auxiliary Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
CHAPTER 18 Neural Networks and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 901
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 902
18.2 The Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904
18.3 Feed-Forward Multilayer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 908
18.3.1 Fully Connected Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912
18.4 The Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 913
Nonconvexity of the Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 914
18.4.1 The Gradient Descent Backpropagation Scheme . . . . . . . . . . . . . . . . . 916
18.4.2 Variants of the Basic Gradient Descent Scheme . . . . . . . . . . . . . . . . . 924
18.4.3 Beyond the Gradient Descent Rationale . . . . . . . . . . . . . . . . . . . . . . . 934
18.5 Selecting a Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935
18.6 Vanishing and Exploding Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 938
18.6.1 The Rectified Linear Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939
18.7 Regularizing the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 940
Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943
18.8 Designing Deep Neural Networks: a Summary . . . . . . . . . . . . . . . . . . . . . . . . . 946
18.9 Universal Approximation Property of Feed-Forward Neural Networks . . . . . . . 947
18.10 Neural Networks: a Bayesian Flavor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
18.11 Shallow Versus Deep Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 950
18.11.1 The Power of Deep Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 951
xviii Contents

18.12 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956


18.12.1 The Need for Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956
18.12.2 Convolution Over Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
18.12.3 The Full CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 968
18.12.4 CNNs: the Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 971
18.13 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976
18.13.1 Backpropagation Through Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978
18.13.2 Attention and Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 982
18.14 Adversarial Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985
Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
18.15 Deep Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 988
18.15.1 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 988
18.15.2 Pretraining Deep Feed-Forward Networks . . . . . . . . . . . . . . . . . . . . . 991
18.15.3 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992
18.15.4 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994
18.15.5 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995
18.15.6 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004
18.16 Capsule Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1011
18.17 Deep Neural Networks: Some Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014
Geometric Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015
Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016
18.18 A Case Study: Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017
18.19 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023
Computer Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1029
CHAPTER 19 Dimensionality Reduction and Latent Variable Modeling . . . . . . . . . . . . . . . . 1039
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1040
19.2 Intrinsic Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041
19.3 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041
PCA, SVD, and Low Rank Matrix Factorization . . . . . . . . . . . . . . . . . 1043
Minimum Error Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045
PCA and Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045
Orthogonalizing Properties of PCA and Feature Generation . . . . . . . . 1046
Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047
19.4 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053
19.4.1 Relatives of CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056
19.5 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058
19.5.1 ICA and Gaussianity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058
19.5.2 ICA and Higher-Order Cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059
19.5.3 Non-Gaussianity and Independent Components . . . . . . . . . . . . . . . . . 1061
19.5.4 ICA Based on Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 1062
Contents xix

19.5.5 Alternative Paths to ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065


The Cocktail Party Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066
19.6 Dictionary Learning: the k-SVD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 1069
Why the Name k-SVD? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1072
Dictionary Learning and Dictionary Identifiability . . . . . . . . . . . . . . . 1072
19.7 Nonnegative Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074
19.8 Learning Low-Dimensional Models: a Probabilistic Perspective . . . . . . . . . . . . 1076
19.8.1 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077
19.8.2 Probabilistic PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1078
19.8.3 Mixture of Factors Analyzers: a Bayesian View to Compressed Sensing1082
19.9 Nonlinear Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085
19.9.1 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085
19.9.2 Graph-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087
19.10 Low Rank Matrix Factorization: a Sparse Modeling Path . . . . . . . . . . . . . . . . . 1096
19.10.1 Matrix Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096
19.10.2 Robust PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1100
19.10.3 Applications of Matrix Completion and ROBUST PCA . . . . . . . . . . . 1101
19.11 A Case Study: FMRI Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1103
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107
MATLAB® Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1108
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117
About the Author

Sergios Theodoridis is professor of machine learning and signal pro-


cessing with the National and Kapodistrian University of Athens, Athens,
Greece and with the Chinese University of Hong Kong, Shenzhen, China.
He has received a number of prestigious awards, including the 2014 IEEE
Signal Processing Magazine Best Paper Award, the 2009 IEEE Compu-
tational Intelligence Society Transactions on Neural Networks Outstand-
ing Paper Award, the 2017 European Association for Signal Processing
(EURASIP) Athanasios Papoulis Award, the 2014 IEEE Signal Processing
Society Education Award, and the 2014 EURASIP Meritorious Service
Award. He has served as president of EURASIP and vice president for the
IEEE Signal Processing Society. He is a Fellow of EURASIP and a Life Fellow of IEEE. He is the
coauthor of the book Pattern Recognition, 4th edition, Academic Press, 2009 and of the book Introduc-
tion to Pattern Recognition: A MATLAB Approach, Academic Press, 2010.

xxi
Preface

Machine learning is a name that is gaining popularity as an umbrella and evolution for methods that
have been studied and developed for many decades in different scientific communities and under differ-
ent names, such as statistical learning, statistical signal processing, pattern recognition, adaptive signal
processing, image processing and analysis, system identification and control, data mining and infor-
mation retrieval, computer vision, and computational learning. The name “machine learning” indicates
what all these disciplines have in common, that is, to learn from data, and then make predictions. What
one tries to learn from data is their underlying structure and regularities, via the development of a
model, which can then be used to provide predictions.
To this end, a number of diverse approaches have been developed, ranging from optimization of cost
functions, whose goal is to optimize the deviation between what one observes from data and what the
model predicts, to probabilistic models that attempt to model the statistical properties of the observed
data.
The goal of this book is to approach the machine learning discipline in a unifying context, by pre-
senting major paths and approaches that have been followed over the years, without giving preference
to a specific one. It is the author’s belief that all of them are valuable to the newcomer who wants to
learn the secrets of this topic, from the applications as well as from the pedagogic point of view. As the
title of the book indicates, the emphasis is on the processing and analysis front of machine learning and
not on topics concerning the theory of learning itself and related performance bounds. In other words,
the focus is on methods and algorithms closer to the application level.
The book is the outgrowth of more than three decades of the author’s experience in research and
teaching various related courses. The book is written in such a way that individual (or pairs of) chapters
are as self-contained as possible. So, one can select and combine chapters according to the focus he/she
wants to give to the course he/she teaches, or to the topics he/she wants to grasp in a first reading. Some
guidelines on how one can use the book for different courses are provided in the introductory chapter.
Each chapter grows by starting from the basics and evolving to embrace more recent advances.
Some of the topics had to be split into two chapters, such as sparsity-aware learning, Bayesian learning,
probabilistic graphical models, and Monte Carlo methods. The book addresses the needs of advanced
graduate, postgraduate, and research students as well as of practicing scientists and engineers whose
interests lie beyond black-box approaches. Also, the book can serve the needs of short courses on spe-
cific topics, e.g., sparse modeling, Bayesian learning, probabilistic graphical models, neural networks
and deep learning.

Second Edition
The first edition of the book, published in 2015, covered advances in the machine learning area up to
2013–2014. These years coincide with the start of a real booming in research activity in the field of deep
learning that really reshaped our related knowledge and revolutionized the field of machine learning.
The main emphasis of the current edition was to, basically, rewrite Chapter 18. The chapter now covers
a review of the field, starting from the early days of the perceptron and the perceptron rule, until
the most recent advances, including convolutional neural networks (CNNs), recurrent neural networks
(RNNs), adversarial examples, generative adversarial networks (GANs), and capsule networks.
xxiii
xxiv Preface

Also, the second edition covers in a more extended and detailed way nonparametric Bayesian meth-
ods, such as Chinese restaurant processes (CRPs) and Indian buffet processes (IBPs). It is the author’s
belief that Bayesian methods will gain in importance in the years to come. Of course, only time can
tell whether this will happen or not. However, the author’s feeling is that uncertainty is going to be
a major part of the future models and Bayesian techniques can be, at least in principle, a reasonable
start. Concerning the other chapters, besides the (omnipresent!) typos that have been corrected, changes
have been included here and there to make the text easier to read, thanks to suggestions by students,
colleagues, and reviewers; I am deeply indebted to all of them.
Most of the chapters include MATLAB® exercises, and the related code is freely available from
the book’s companion website. Furthermore, in the second edition, all the computer exercises are also
given in Python together with the corresponding code, which are also freely available via the website
of the book. Finally, some of the computer exercises in Chapter 18 that are related to deep learning,
and which are closer to practical applications, are given in Tensorflow.
The solutions manual as well lecture slides are available from the book’s website for instructors.
In the second edition, all appendices have been moved to the website associated with the book, and
they are freely downloadable. This was done in an effort to save space in a book that is already more
than 1100 pages. Also, some sections dedicated to methods that were present in various chapters in the
first edition, which I felt do not constitute basic knowledge and current mainstream research topics,
while they were new and “fashionable” in 2015, have been moved, and they can be downloaded from
the companion website of the book.

Instructor site URL:

http://textbooks.elsevier.com/web/Manuals.aspx?isbn=9780128188033

Companion Site URL:


https://www.elsevier.com/books-and-journals/book-companion/9780128188033
Acknowledgments

Writing a book is an effort on top of everything else that must keep running in parallel. Thus, writing
is basically an early morning, after five, and over the weekends and holidays activity. It is a big effort
that requires dedication and persistence. This would not be possible without the support of a number of
people—people who helped in the simulations, in the making of the figures, in reading chapters, and
in discussing various issues concerning all aspects, from proofs to the structure and the layout of the
book.
First, I would like to express my gratitude to my mentor, friend, and colleague Nicholas Kaloupt-
sidis, for this long-lasting and fruitful collaboration.
The cooperation with Kostas Slavakis over the more recent years has been a major source of inspi-
ration and learning and has played a decisive role for me in writing this book.
I am indebted to the members of my group, and in particular to Yannis Kopsinis, Pantelis Bouboulis,
Simos Chouvardas, Kostas Themelis, George Papageorgiou, Charis Georgiou, Christos Chatzichristos,
and Emanuel Morante. They were next to me the whole time, especially during the difficult final stages
of the completion of the manuscript. My colleagues Aggelos Pikrakis, Kostas Koutroumbas, Dimitris
Kosmopoulos, George Giannakopoulos, and Spyros Evaggelatos gave a lot of their time for discussions,
helping in the simulations and reading chapters.
Without my two sabbaticals during the spring semesters of 2011 and 2012, I doubt I would have
ever finished this book. Special thanks go to all my colleagues in the Department of Informatics and
Telecommunications of the National and Kapodistrian University of Athens.
During my sabbatical in 2011, I was honored to be a holder of an Excellence Chair in Carlos III
University of Madrid and spent the time with the group of Anibal Figuieras-Vidal. I am indebted to
Anibal for his invitation and all the fruitful discussions and the bottles of excellent red Spanish wine we
had together. Special thanks go to Jerónimo Arenas-García and Antonio Artés-Rodríguez, who have
also introduced me to aspects of traditional Spanish culture.
During my sabbatical in 2012, I was honored to be an Otto Mønsted Guest Professor at the Technical
University of Denmark with the group of Lars Kai Hansen. I am indebted to him for the invitation and
our enjoyable and insightful discussions, as well as his constructive comments on chapters of the book
and the visits to the Danish museums on weekends. Also, special thanks go to Morten Mørup and the
late Jan Larsen for the fruitful discussions.
The excellent research environment of the Shenzhen Research Institute of Big Data of the Chinese
University of Hong Kong ignited the spark and gave me the time to complete the second edition of the
book. I am deeply indebted to Tom Luo who offered me this opportunity and also introducing me to
the secrets of Chinese cooking.
A number of colleagues were kind enough to read and review chapters and parts of the book
and come back with valuable comments and criticism. My sincere thanks go to Tulay Adali, Kostas
Berberidis, Jim Bezdek, Soterios Chatzis, Gustavo Camps-Valls, Rama Chellappa, Taylan Cemgil
and his students, Petar Djuric, Paulo Diniz, Yannis Emiris, Mario Figuieredo, Georgios Giannakis,
Mark Girolami, Dimitris Gunopoulos, Alexandros Katsioris, Evaggelos Karkaletsis, Dimitris Katselis,
Athanasios Liavas, Eleftherios Kofidis, Elias Koutsoupias, Alexandros Makris, Dimitirs Manatakis,
xxv
xxvi Acknowledgments

Elias Manolakos, Petros Maragos, Francisco Palmieri, Jean-Christophe Pesquet, Bhaskar Rao, George
Retsinas, Ali Sayed, Nicolas Sidiropoulos, Paris Smaragdis, Isao Yamada, Feng Yin, and Zhilin Zhang.
Finally, I would like to thank Tim Pitts, the Editor at Academic Press, for all his help.
Notation

I have made an effort to keep a consistent mathematical notation throughout the book. Although every
symbol is defined in the text prior to its use, it may be convenient for the reader to have the list of major
symbols summarized together. The list is presented below:
• Vectors are denoted with boldface letters, such as x.
• Matrices are denoted with capital letters, such as A.
• The determinant of a matrix is denoted as det{A}, and sometimes as |A|.
• A diagonal matrix with elements a1 , a2 , . . . , al in its diagonal is denoted as A = diag{a1 , a2 , . . . , al }.
• The identity matrix is denoted as I .
• The trace of a matrix is denoted as trace{A}.
• Random variables are denoted with roman fonts, such as x, and their corresponding values with
mathmode letters, such as x.
• Similarly, random vectors are denoted with roman boldface, such as x, and the corresponding values
as x. The same is true for random matrices, denoted as X and their values as X.
• Probability values for discrete random variables are denoted by capital P , and probability density
functions (PDFs), for continuous random variables, are denoted by lower case p.
• The vectors are assumed to be column-vectors. In other words,
⎡ ⎤ ⎡ ⎤
x1 x(1)
⎢ x2 ⎥ ⎢ x(2) ⎥
⎢ ⎥ ⎢ ⎥
x = ⎢ . ⎥ , or x = ⎢ . ⎥ .
⎣ .. ⎦ ⎣ .. ⎦
xl x(l)

That is, the ith element of a vector can be represented either with a subscript, xi , or as x(i).
• Matrices are written as
⎡ ⎤ ⎡ ⎤
x11 x12 . . . x1l X(1, 1) X(1, 2) . . . X(1, l)
⎢ .. ⎥ ⎢ ⎥
X = ⎣ ... ..
.
..
. . ⎦ , or X = ⎣
..
.
..
.
..
.
..
. ⎦.
xl1 xl2 . . . xll X(l, 1) X(l, 2) . . . X(l, l)

• Transposition of a vector is denoted as x T and the Hermitian transposition


√ as x .
H

• Complex conjugation of a complex number is denoted as x and also −1 := j . The symbol “:=”
denotes definition.
• The sets of real, complex, integer, and natural numbers are denoted as R, C, Z, and N, respectively.
• Sequences of numbers (vectors) are denoted as xn (x n ) or x(n) (x(n)) depending on the context.
• Functions are denoted with lower case letters, e.g., f , or in terms of their arguments, e.g., f (x) or
sometimes as f (·), if no specific argument is used, to indicate a function of a single argument, or
f (·, ·) for a function of two arguments and so on.

xxvii
CHAPTER

INTRODUCTION
1
CONTENTS
1.1 The Historical Context................................................................................................ 1
1.2 Artificial Intelligence and Machine Learning..................................................................... 2
1.3 Algorithms Can Learn What Is Hidden in the Data............................................................... 4
1.4 Typical Applications of Machine Learning........................................................................ 6
Speech Recognition .......................................................................................... 6
Computer Vision............................................................................................... 6
Multimodal Data .............................................................................................. 6
Natural Language Processing ............................................................................... 7
Robotics ........................................................................................................ 7
Autonomous Cars ............................................................................................. 7
Challenges for the Future.................................................................................... 8
1.5 Machine Learning: Major Directions............................................................................... 8
1.5.1 Supervised Learning .................................................................................. 8
Classification .................................................................................................. 9
Regression..................................................................................................... 11
1.6 Unsupervised and Semisupervised Learning ..................................................................... 11
1.7 Structure and a Road Map of the Book ............................................................................ 12
References................................................................................................................... 16

1.1 THE HISTORICAL CONTEXT


During the period that covers, roughly, the last 250 years, humankind has lived and experienced three
transforming revolutions, which have been powered by technology and science. The first industrial
revolution was based on the use of water and steam and its origins are traced to the end of the 18th
century, when the first organized factories appeared in England. The second industrial revolution was
powered by the use of electricity and mass production, and its “birth” is traced back to around the turn
of the 20th century. The third industrial revolution was fueled by the use of electronics, information
technology, and the adoption of automation in production. Its origins coincide with the end of the
Second World War.
Although difficult for humans, including historians, to put a stamp on the age in which they them-
selves live, more and more people are claiming that the fourth industrial revolution has already started
and is fast transforming everything that we know and learned to live with so far. The fourth industrial
revolution builds upon the third one and is powered by the fusion of a number of technologies, e.g.,
Machine Learning. https://doi.org/10.1016/B978-0-12-818803-3.00010-6
Copyright © 2020 Elsevier Ltd. All rights reserved.
1
2 CHAPTER 1 INTRODUCTION

computers and communications (internet), and it is characterized by the convergence of the physical,
digital, and biological spheres.
The terms artificial intelligence (AI) and machine learning are used and spread more and more to
denote the type of automation technology that is used in the production (industry), in the distribution of
goods (commerce), in the service sector, and in our economic transactions (e.g., banking). Moreover,
these technologies affect and shape the way we socialize and interact as humans via social networks,
and the way we entertain ourselves, involving games and cultural products such as music and movies.
A distinct qualitative difference of the fourth, compared to the previous industrial revolutions, is
that, before, it was the manual skills of humans that were gradually replaced by “machines.” In the
one that we are currently experiencing, mental skills are also replaced by “machines.” We now have
automatic answering software that runs on computers, less people are serving us in banks, and many
jobs in the service sector have been taken over by computers and related software platforms. Soon, we
are going to have cars without drivers and drones for deliveries. At the same time, new jobs, needs,
and opportunities appear and are created. The labor market is fast changing and new competences and
skills are and will be required in the future (see, e.g., [22,23]).
At the center of this historical happening, as one of the key enabling technologies, lies a discipline
that deals with data and whose goal is to extract information and related knowledge that is hidden in
it, in order to make predictions and, subsequently, take decisions. That is, the goal of this discipline is
to learn from data. This is analogous to what humans do in order to reach decisions. Learning through
the senses, personal experience, and the knowledge that propagates from generation to generation is
at the heart of human intelligence. Also, at the center of any scientific field lies the development of
models (often called theories) in order to explain the available experimental evidence. In other words,
data comprise a major source of learning.

1.2 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING


The title of the book refers to machine learning, although the term artificial intelligence is used more
and more, especially in the media but also by some experts, to refer to any type of algorithms and
methods that perform tasks that traditionally required human intelligence. Being aware that definitions
of terms can never be exact and there is always some “vagueness” around their respective meanings,
I will still attempt to clarify what I mean by machine learning and in which aspects this term means
something different from AI. No doubt, there may be different views on this.
Although the term machine learning was popularized fairly recently, as a scientific field it is an
old one, whose roots go back to statistics, computer science, information theory, signal processing,
and automatic control. Examples of some related names from the past are statistical learning, pattern
recognition, adaptive signal processing, system identification, image analysis, and speech recognition.
What all these disciplines have in common is that they process data, develop models that are data-
adaptive, and subsequently make predictions that can lead to decisions. Most of the basic theories and
algorithmic tools that are used today had already been developed and known before the dawn of this
century. With a “small” yet important difference: the available data, as well as the computer power
prior to 2000, were not enough to use some of the more elaborate and complex models that had been
developed. The terrain started changing after 2000, in particular around 2010. Large data sets were
gradually created and the computer power became affordable to allow the use of more complex mod-
1.2 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING 3

els. In turn, more and more applications adopted such algorithmic techniques. “Learning from data”
became the new trend and the term machine learning prevailed as an umbrella for such techniques.
Moreover, the big difference was made with the use and “rediscovery” of what is today known
as deep neural networks. These models offered impressive predictive accuracies that had never been
achieved by previous models. In turn, these successes paved the way for the adoption of such models in
a wide range of applications and also ignited intense research, and new versions and models have been
proposed. These days, another term that is catching up is “data science,” indicating the emphasis on
how one can develop robust machine learning and computational techniques that deal efficiently with
large-scale data.
However, the main rationale, which runs the spine of all the methods that come under the machine
learning umbrella, remains the same and it has been around for many decades. The main concept is
to estimate a set of parameters that describe the model, using the available data and, in the sequel,
to make predictions based on low-level information and signals. One may easily argue that there is
not much intelligence built in such approaches. No doubt, deep neural networks involve much more
“intelligence” than their predecessors. They have the potential to optimize the representation of their
low-level input information to the computer.
The term “representation” refers to the way in which related information that is hidden in the input
data is quantified/coded so that it can be subsequently processed by a computer. In the more technical
jargon, each piece of such information is known as a feature (see also Section 1.5.1). As discussed in
detail in Chapter 18, where neural networks (NNs) are defined and presented in detail, what makes these
models distinctly different from other data learning methods is their multilayer structure. This allows
for the “building” up of a hierarchy of representations of the input information at various abstraction
levels. Every layer builds upon the previous one and the higher in hierarchy, the more abstract the
obtained representation is. This structure offers to neural networks a significant performance advantage
over alternative models, which restrict themselves to a single representation layer. Furthermore, this
single-level representation was rather hand-crafted and designed by the users, in contrast to the deep
networks that “learn” the representation layers from the input data via the use of optimality criteria.
Yet, in spite of the previously stated successes, I share the view that we are still very far from what
an intelligent machine should be. For example, once trained (estimating the parameters) on one data
set, which has been developed for a specific task, it is not easy for such models to generalize to other
tasks. Although, as we are going to see in Chapter 18, advances in this direction have been made, we are
still very far from what human intelligence can achieve. When a child sees one cat, readily recognizes
another one, even if this other cat has a different color or if it turns around. Current machine learning
systems need thousands of images with cats, in order to be trained to “recognize” one in an image. If a
human learns to ride a bike, it is very easy to transfer this knowledge and learn to ride a motorbike or
even to drive a car. Humans can easily transfer knowledge from one task to another, without forgetting
the previous one. In contrast, current machine learning systems lack such a generalization power and
tend to forget the previous task once they are trained to learn a new one. This is also an open field of
research, where advances have also been reported.
Furthermore, machine learning systems that employ deep networks can even achieve superhuman
prediction accuracies on data similar to those with which they have been trained. This is a significant
achievement, not to be underestimated, since such techniques can efficiently be used for dedicated
jobs; for example, to recognize faces, to recognize the presence of various objects in photographs, and
also to annotate images and produce text that is related to the content of the image. They can recognize
4 CHAPTER 1 INTRODUCTION

speech, translate text from one language to another, detect which music piece is currently playing in the
bar, and whether the piece belongs to the jazz or to the rock musical genre. At the same time, they can
be fooled by carefully constructed examples, known as adversarial examples, in a way that no human
would be fooled to produce a wrong prediction (see Chapter 18).
Concerning AI, the term “artificial intelligence” was first coined by John McCarthy in 1956 when
he organized the first dedicated conference (see, e.g., [20] for a short history). The concept at that time,
which still remains a goal, was whether one can build an intelligent machine, realized on software and
hardware, that can possess human-like intelligence. In contrast to the field of machine learning, the
concept for AI was not to focus on low-level information processing with emphasis on predictions, but
on the high-level cognitive capabilities of humans to reason and think. No doubt, we are still very far
from this original goal. Predictions are, indeed, part of intelligence. Yet, intelligence is much more than
that. Predictions are associated with what we call inductive reasoning. Yet what really differentiates
human from the animals intelligence is the power of the human mind to form concepts and create
conjectures for explaining data and more general the World in which we live. Explanations comprise
a high-level facet of our intelligence and constitute the basis for scientific theories and the creation of
our civilization. They are assertions concerning the “why” ’s and the “how” ’s related to a task, e.g.,
[5,6,11].
To talk about AI, at least as it was conceived by pioneers such as Alan Turing [16], systems should
have built-in capabilities for reasoning and giving meaning, e.g., in language processing, to be able
to infer causality, to model efficient representations of uncertainty, and, also, to pursue long-term
goals [8]. Possibly, towards achieving these challenging goals, we may have to understand and imple-
ment notions from the theory of mind, and also build machines that implement self-awareness. The
former psychological term refers to the understanding that others have their own beliefs and intentions
that justify their decisions. The latter refers to what we call consciousness. As a last point, recall that
human intelligence is closely related to feelings and emotions. As a matter of fact, the latter seem
to play an important part in the creative mental power of humans (e.g., [3,4,17]). Thus, in this more
theoretical perspective AI still remains a vision for the future.
The previous discussion should not be taken as an attempt to get involved with philosophical the-
ories concerning the nature of human intelligence and AI. These topics comprise a field in itself, for
more than 60 years, which is much beyond the scope of this book. My aim was to make the newcomer
in the field aware of some views and concerns that are currently being discussed.
In the more practical front, for the early years, the term AI was used to refer to techniques built
around knowledge-based systems that sought to hard-code knowledge in terms of formal languages,
e.g., [13]. Computer “reasoning” was implemented via a set of logical inference rules. In spite of the
early successes, such methods seem to have reached a limit, see, e.g., [7]. It was the alternative path of
machine learning, via learning from data, that gave a real push into the field. These days, the term AI
is used as an umbrella to cover all methods and algorithmic approaches that are related to the machine
intelligence discipline, with machine learning and knowledge-based techniques being parts of it.

1.3 ALGORITHMS CAN LEARN WHAT IS HIDDEN IN THE DATA


It has already been emphasized that data lie at the heart of machine learning systems. Data are the
beginning. It is the information hidden in the data, in the form of underlying regularities, correlations,
Another random document with
no related content on Scribd:
COCOA-NUT SOUP.

Pare the dark rind from a very fresh cocoa-nut, and grate it down
small on an exceedingly clean, bright grater; weigh it, and allow two
ounces for each quart of soup. Simmer it gently for one hour in the
stock, which should then be strained closely from it, and thickened
for table.
Veal stock, gravy-soup, or broth, 5 pints; grated cocoa-nut, 5 oz., 1
hour. Flour of rice, 5 oz.; mace, 1/2 teaspoonful; little cayenne and
salt; mixed with 1/4 pint of cream: 10 minutes.
Or: gravy-soup, or good beef broth, 5 pints: 1 hour. Rice flour, 5
oz.; soy and lemon-juice, each 1 tablespoonful; finely pounded
sugar, 1 oz.; cayenne, 1/4 teaspoonful; sherry, 2 glassesful.
Obs.—When either cream or wine is objected to for these soups, a
half-pint of the stock should be reserved to mix the thickening with.
CHESTNUT SOUP.

Strip the outer rind from some fine, sound Spanish chestnuts,
throw them into a large pan of warm water, and as soon as it
becomes too hot for the fingers to remain in it, take it from the fire, lift
out the chestnuts, peel them quickly, and throw them into cold water
as they are done; wipe, and weigh them; take three quarters of a
pound for each quart of soup, cover them with good stock, and stew
them gently for upwards of three quarters of an hour, or until they
break when touched with a fork; drain, and pound them smoothly, or
bruise them to a mash with a strong spoon, and rub them through a
fine sieve reversed; mix with them by slow degrees the proper
quantity of stock; add sufficient mace, cayenne, and salt to season
the soup, and stir it often until it boils. Three quarters of a pint of rich
cream, or even less, will greatly improve it. The stock in which the
chestnuts are boiled can be used for the soup when its sweetness is
not objected to; or it may in part be added to it.
Chestnuts, 1-1/2 lb.: stewed from 2/3 to 1 hour. Soup, 2 quarts;
seasoning of salt, mace, and cayenne: 1 to 3 minutes. Cream, 3/4
pint (when used).
JERUSALEM ARTICHOKE, OR PALESTINE SOUP.

Wash and pare quickly some freshly-dug artichokes, and to


preserve their colour, throw them into spring water as they are done,
but do not let them remain in it after all are ready. Boil three pounds
of them in water for ten minutes; lift them out, and slice them into
three pints of boiling stock; when they have stewed gently in this
from fifteen to twenty minutes, press them with the soup, through a
fine sieve, and put the whole into a clean saucepan with a pint and a
half more of stock; add sufficient salt and cayenne to season it, skim
it well, and after it has simmered for two or three minutes, stir it to a
pint of rich boiling cream. Serve it immediately.
Artichokes, 3 lbs., boiled in water: 10 minutes. Veal stock, 3 pints
15 to 20 minutes. Additional stock, 1-1/2 pint; little cayenne and salt
2 to 3 minutes. Boiling cream, 1 pint.
Obs.—The palest veal stock, as for white soup, should be used for
this; but for a family dinner, or where economy is a consideration
excellent mutton-broth, made the day before and perfectly cleared
from fat, will answer very well as a substitute; milk too may in part
take the place of cream when this last is scarce: the proportion of
artichokes should then be increased a little.
Vegetable-marrow, when young, makes a superior soup even to
this, which is an excellent one. It should be well pared, trimmed, and
sliced into a small quantity of boiling veal stock or broth, and when
perfectly tender, pressed through a fine sieve, and mixed with more
stock and some cream. In France the marrow is stewed, first in
butter, with a large mild onion or two also sliced; and afterwards in a
quart or more of water, which is poured gradually to it; it is next
passed through a tammy,[26] seasoned with pepper and salt, and
mixed with a pint or two of milk and a little cream.
26. Derived from the French tamis, which means a sieve or strainer.
COMMON CARROT SOUP.

The most easy method of making this favourite English soup is to


boil some highly coloured carrots quite tender in water slightly salted,
then to pound or mash them to a smooth paste, and to mix with them
boiling gravy soup or strong beef broth (see Bouillon) in the
proportion of two quarts to a pound and a half of the prepared
carrots; then to pass the whole through a strainer, to season it with
salt and cayenne, to heat it in a clean stewpan, and to serve it
immediately. If only the red outsides of the carrots be used, the
colour of the soup will be very bright; they should be weighed after
they are mashed. Turnip soup may be prepared in the same manner.
Obs.—An experienced and observant cook will know the
proportion of vegetables required to thicken this soup appropriately,
without having recourse to weights and measures; but the learner
had always better proceed by rule.
Soup, 2 quarts; pounded carrot, 1-1/2 lb.; salt, cayenne: 5
minutes.
A FINER CARROT SOUP.

Scrape very clean, and cut away all blemishes from some highly-
flavoured red carrots; wash, and wipe them dry, and cut them into
quarter-inch slices. Put into a large stewpan three ounces of the best
butter, and when it is melted, add two pounds of the sliced carrots,
and let them stew gently for an hour without browning; pour to them
then four pints and a half of brown gravy soup, and when they have
simmered from fifty minutes to an hour, they ought to be sufficiently
tender. Press them through a sieve or strainer with the soup; add
salt, and cayenne if required; boil the whole gently for five minutes,
take off all the scum, and serve the soup as hot as possible.
Butter, 3 oz.; carrots, 2 lbs.: 1 hour. Soup, 4-1/2 pints: 50 to 60
minutes. Salt, cayenne: 5 minutes.
COMMON TURNIP SOUP.

Wash and wipe the turnips, pare and weigh them; allow a pound
and a half for every quart of soup. Cut them in slices about a quarter
of an inch thick. Melt four ounces of butter in a clean stewpan, and
put in the turnips before it begins to boil; stew them gently for three
quarters of an hour, taking care that they shall not brown, then have
the proper quantity of soup ready boiling, pour it to them, and let
them simmer in it for three quarters of an hour. Pulp the whole
through a coarse sieve or soup strainer, put it again on the fire, keep
it stirred until it has boiled three minutes or four, take off the scum,
add salt and pepper if required, and serve it very hot. Turnips, 3 lbs.;
butter, 4 oz.: 3/4 hour. Soup, 2 quarts: 3/4 hour. Last time: three
minutes.
A QUICKLY MADE TURNIP SOUP.

Pare and slice into three pints of veal or mutton stock or of good
broth, three pounds of young mild turnips; stew them gently from
twenty-five to thirty minutes, or until they can be reduced quite to
pulp; rub the whole through a sieve, and add to it another quart of
stock, a seasoning of salt and white pepper, and one lump of sugar:
give it two or three minutes’ boil, skim and serve it. A large white
onion when the flavour is liked may be sliced and stewed with the
turnips. A little cream improves much the colour of this soup.
Turnips, 3 lbs.; soup, 5 pints: 25 to 30 minutes.
POTATO SOUP.

Mash to a smooth paste three pounds of good mealy potatoes,


which have been steamed, or boiled very dry; mix with them by
degrees, two quarts of boiling broth, pass the soup through a
strainer, set it again on the fire, add pepper and salt, and let it boil for
five minutes. Take off entirely the black scum that will rise upon it,
and serve it very hot with fried or toasted bread. Where the flavour is
approved, two ounces of onions minced and fried a light brown, may
be added to the soup, and stewed in it for ten minutes before it is
sent to table.
Potatoes, 3 lbs.; broth, 2 quarts: 5 minutes. (With onions, 2 oz.) 10
minutes.
APPLE SOUP.

(Soupe à la Bourguignon.)
Clear the fat from five pints of good mutton broth, bouillon, or shin
of beef stock, and strain it through a fine sieve; add to it when it
boils, a pound and a half of good cooking apples, and stew them
down in it very softly to a smooth pulp; press the whole through a
strainer, add a small teaspoonful of powdered ginger and plenty of
pepper, simmer the soup for a couple of minutes, skim, and serve it
very hot, accompanied by a dish of rice, boiled as for curries.
Broth, 5 pints; apples, 1-1/2 lb.: 25 to 40 minutes. Ginger, 1
teaspoonful; pepper, 1/2 teaspoonful: 2 minutes.
PARSNEP SOUP.

Dissolve, over a gentle fire, four ounces of good butter, in a wide


stewpan or saucepan, and slice in directly two pounds of sweet
tender parsneps; let them stew very gently until all are quite soft,
then pour in gradually sufficient veal stock or good broth to cover
them, and boil the whole slowly from twenty minutes to half an hour;
work it with a wooden spoon through a fine sieve, add as much stock
as will make two quarts in all, season the soup with salt and white
pepper or cayenne, give it one boil, skim, and serve it very hot. Send
pale fried sippets to table with it.
Butter, 4-1/2 oz.; parsneps, 2 lbs.: 3/4 hour, or more. Stock, 1
quart; 20 to 30 minutes; 1 full quart more of stock; pepper, salt: 1
minute.
Obs.—We can particularly recommend this soup to those who like
the peculiar flavour of the vegetable.
ANOTHER PARSNEP SOUP.

Slice into five pints of boiling veal stock or strong colourless broth,
a couple of pounds of parsneps, and stew them as gently as
possible from thirty minutes to an hour; when they are perfectly
tender, press them through a sieve, strain the soup to them, season,
boil, and serve it very hot. With the addition of cream, parsnep soup
made by this receipt resembles in appearance the Palestine soup.
Veal stock or broth, 5 pints; parsneps, 2 lbs.: 30 to 60 minutes.
Salt and cayenne: 2 minutes.
WESTERFIELD WHITE SOUP.

Break the bone of a knuckle of veal in one or two places, and put it
on to stew, with three quarts of cold water to the five pounds of meat;
when it has been quite cleared from scum, add to it an ounce and a
half of salt, and one mild onion, twenty corns of white pepper, and
two or three blades of mace, with a little cayenne pepper. When the
soup is reduced one-third by slow simmering strain it off, and set it
by till cold; then free it carefully from the fat and sediment, and heat it
again in a very clean stewpan. Mix with it when it boils, a pint of thick
cream smoothly blended with an ounce of good arrow-root, two
ounces of very fresh vermicelli previously boiled tender in water
slightly salted and well drained from it, and an ounce and a half of
almonds blanched and cut in strips: give it one minute’s simmer, and
serve it immediately, with a French roll in the tureen.
Veal, 5 lbs.; water, 3 quarts; salt, 1-1/2 oz.; 1 mild onion; 20 corns
white pepper; 2 large blades of mace: 5 hours or more. Cream, 1
pint; almonds, 1-1/2 oz.; vermicelli, 1 oz.: 1 minute. Little thickening if
needed.
Obs.—We have given this receipt without any variation from the
original, as the soup made by it—of which we have often partaken—
seemed always much approved by the guests of the hospitable
country gentleman from whose family it was derived, and at whose
well-arranged table it was very commonly served; but we would
suggest the suppression of the almond spikes, as they seem
unsuited to the preparation, and also to the taste of the present day.
A RICHER WHITE SOUP.

Pound very fine indeed six ounces of sweet almonds, then add to
them six ounces of the breasts of roasted chickens or partridges,
and three ounces of the whitest bread which has been soaked in a
little veal broth, and squeezed very dry in a cloth. Beat these
altogether to an extremely smooth paste; then pour to them boiling
and by degrees, two quarts of rich veal stock; strain the soup
through a fine hair sieve, set it again over the fire, add to it a pint of
thick cream, and serve it, as soon as it is at the point of boiling.
When cream is very scarce, or not easily to be procured, this soup
may be thickened sufficiently without it, by increasing the quantity of
almonds to eight or ten ounces, and pouring to them, after they have
been reduced to the finest paste, a pint of boiling stock, which must
be again wrung from them through a coarse cloth with very strong
pressure: the proportion of meat and bread also should then be
nearly doubled. The stock should be well seasoned with mace and
cayenne before it is added to the other ingredients.
Almonds, 6 oz.; breasts of chickens or partridges, 6 oz.; soaked
bread, 3 oz.; veal stock, 2 quarts; cream, 1 pint.
Obs. 1.—Some persons pound the yolks of four or five hard-boiled
eggs with the almonds, meat, and bread for this white soup; French
cooks beat smoothly with them an ounce or two of whole rice,
previously boiled from fifteen to twenty minutes.
Obs. 2.—A good plain white soup maybe made simply by adding
to a couple of quarts of pale veal stock or strong well-flavoured veal
broth, a thickening of arrow-root, and from half to three quarters of a
pint of cream. Four ounces of macaroni boiled tender and well-
drained may be dropped into it a minute or two before it is dished,
but the thickening may then be diminished a little.
MOCK TURTLE SOUP.

To make a single tureen of this favourite English soup in the most


economical manner when there is no stock at hand, stew gently
down in a gallon of water four pounds of the fleshy part of the shin of
beef, or of the neck, with two or three carrots, one onion, a small
head of celery, a bunch of savoury herbs, a blade of mace, a half-
teaspoonful of peppercorns, and an ounce of salt. When the meat is
quite in fragments, strain off the broth, and pour it when cold upon
three pounds of the knuckle or of the neck of veal; simmer this until
the flesh has quite fallen from the bones, but be careful to stew it as
softly as possible, or the quantity of stock will be so much reduced
as to be insufficient for the soup. Next, take the half of a fine calf’s
head with the skin on, remove the brains, and then bone it[27]
entirely, or let the butcher do this, and return the bones with it; these,
when there is time, may be stewed with the veal to enrich the stock,
or boiled afterwards with the head and tongue. Strain the soup
through a hair-sieve into a clean pan, and let it drain closely from the
meat. When it is nearly or quite cold, clear off all the fat from it; roll
the head lightly round, leaving the tongue inside, or taking it out, as
is most convenient, secure it with tape or twine, pour the soup over,
and bring it gently to boil upon a moderate fire; keep it well skimmed,
and simmer it from an hour to an hour and a quarter; then lift the
head into a deep pan or tureen, add the soup to it, and let it remain
in until nearly cold, as this will prevent the edges from becoming
dark. Cut into quarter-inch slices, and then divide into dice, from six
to eight ounces of the lean of an undressed ham, and if possible,
one of good flavour; free it perfectly from fat, rind, and the smoked
edges; peel and slice four moderate-sized eschalots, or if these
should not be at hand, one mild onion in lieu of them. Dissolve in a
well-tinned stewpan or thick iron saucepan which holds a gallon or
more, four ounces of butter; put in the ham and eschalots, or onion,
with half a dozen cloves, two middling-sized blades of mace, a half-
teaspoonful of peppercorns, three or four very small sprigs of thyme,
three teaspoonsful of minced parsley, one of lemon thyme and winter
savoury mixed, and when the flavour is thought appropriate, the very
thin rind of half a small fresh lemon. Stew these as softly as possible
for nearly or quite an hour, and keep the pan frequently shaken: then
put into a dredging box two ounces of fine dry flour, and sprinkle it to
them by degrees; mix the whole well together, and after a few
minutes more of gentle simmering, add very gradually five full pints
of the stock taken free of fat and sediment, and made boiling before
it is poured in; shake the pan strongly round as the first portions of it
are added, and continue to do so until it contains from two to three
pints, when the remainder may be poured in at once, and the pan
placed by the side of the fire that it may boil in the gentlest manner
for an hour. At the end of that time turn the whole into a hair-sieve
placed over a large pan, and if the liquid should not run through
freely, knock the sides of the sieve, but do not force it through with a
spoon, as that would spoil the appearance of the stock. The head in
the meanwhile should have been cut up, ready to add to it. For the
finest kind of mock turtle, only the skin, with the fat that adheres to it,
should be used; and this, with the tongue, should be cut down into
one inch squares, or if preferred into strips of an inch wide. For
ordinary occasions, the lean part of the flesh may be added also, but
as it is always sooner done than the skin, it is better to add it to the
soup a little later. When it is quite ready, put it with the strained stock
into a clean pan, and simmer it from three quarters of an hour to a
full hour: it should be perfectly tender, without being allowed to
break. Cayenne, if needed, should be thrown into the stock before it
is strained; salt should be used sparingly, on account of the ham,
until the whole of the other ingredients have been mixed together,
when a sufficient quantity must be stirred into the soup to season it
properly. A couple of glasses of good sherry or Madeira, with a
dessertspoonful of strained lemon-juice, are usually added two or
three minutes only before the soup is dished, that the spirit and
flavour of the wine may not have time to evaporate; but it is
sometimes preferred mellowed down by longer boiling. The
proportion of lemon-juice may be doubled at will, but much acid is
not generally liked. We can assure the reader of the excellence of
the soup made by this receipt; it is equally palatable and delicate,
and not heavy or cloying to the stomach, like many of the elaborate
compositions which bear its name. The fat, through the whole
process, should be carefully skimmed off. The ham gives far more
savour, when used as we have directed, than when, even in much
larger proportion, it is boiled down in the stock. Two dozens of
forcemeat-balls, prepared by the receipt No. 11, Chap. VIII., should
be dropped into the soup when it is ready for table. It is no longer
customary to serve egg-balls in it.
27. This is so simple and easy a process, that the cook may readily accomplish it
with very little attention. Let her only work the knife close to the bone always,
so as to take the flesh clean from it, instead of leaving large fragments on.
The jaw-bone may first be removed, and the flesh turned back from the edge
of the other.

First broth:—shin, or neck of beef, 4 lbs.; water, 4 quarts; carrots,


2 or 3; large mild onion, 1; celery, small head; bunch savoury herbs;
mace, 1 large blade; peppercorns, 1/2 teaspoonful; cloves, 6; salt, 1
oz.: 5 hours or more, very gently. For stock: the broth and 3 lbs. neck
or knuckle of veal (bones of head if ready): 4 to 5 hours. Boned half-
head with skin on and tongue, 1 to 1-1/4 hour. Lean of undressed
ham, 6 to 8 oz. (6 if very salt); shalots, 4, or onion, 1; fresh butter, 4
oz.; cloves, 6; middling-sized blades of mace, 2; peppercorns, 1/2
teaspoonful; small sprigs of thyme, 3 or 4; minced parsley, 3 large
teaspoonsful; minced savoury and lemon-thyme mixed, 1 small
teaspoonful (thin rind 1/2 small lemon, when liked): 1 hour. Flour, 2
oz.: 5 minutes. Stock, full five pints; flesh of head and tongue, 1-3/4
to 2 lbs.: 3/4 of an hour to 1 hour (salt, if needed, to be added in
interim). Good sherry or Madeira, 2 wineglassesful; lemon-juice, 1 to
2 dessertspoonsful; forcemeat-balls, 24.
Obs. 1.—The beef, veal, bones of the head, and vegetables may
be stewed down together when more convenient: it is only necessary
that a really good, well flavoured, and rather deeply-coloured stock
should be prepared. A calf’s foot is always an advantageous addition
to it, and the skin of another calf’s head[28] a better one still.
28. Country butchers, in preparing a calf’s head for sale in the ordinary way, take
off the skin (or scalp), considered so essential to the excellence of this soup,
and frequently throw it away; it may, therefore, often be procured from them
at very slight cost, and is the best possible addition to the mock turtle. It is
cleared from the head in detached portions with the hair on, but this may
easily be removed after a few minutes’ scalding as from the head itself, or
the feet, by the direction given in Chapter of Sweet Dishes. In London it is
sold entire, and very nicely prepared, and may be served in many forms,
besides being added to soup with great advantage.

Obs. 2.—A couple of dozens mushroom-buttons, cleaned with salt


and flannel, then wiped very dry, and sliced, and added to the ham
and herbs when they have been simmered together about half an
hour, will be found an improvement to the soup.
Claret is sometimes added instead of sherry or Madeira, but we do
not think it would in general suit English taste so well. From two to
three tablespoonsful of Harvey’s sauce can be stirred in with the
wine when it is liked, or when the colour requires deepening.
OLD-FASHIONED MOCK TURTLE.

After having taken out the brain and washed and soaked the head
well, pour to it nine quarts of cold water, bring it gently to boil, skim it
very clean, boil it if large an hour and a half, lift it out, and put into the
liquor eight pounds of neck of beef lightly browned in a little fresh
butter, with three or four thick slices of lean ham, four large onions
sliced, three heads of celery, three large carrots, a large bunch of
savoury herbs, the rind of a lemon pared very thin, a dessertspoonful
of peppercorns, two ounces of salt, and after the meat has been
taken from the head, all the bones and fragments. Stew these gently
from six to seven hours, then strain off the stock and set it into a very
cool place, that the fat may become firm enough on the top to be
cleared off easily. The skin and fat of the head should be taken off
together and divided into strips of two or three inches in length, and
one in width; the tongue may be carved in the same manner, or into
dice. Put the stock, of which there ought to be between four and five
quarts, into a large soup or stewpot; thicken it when it boils with four
ounces of fresh butter[29] mixed with an equal weight of fine dry
flour, a half-teaspoonful of pounded mace, and a third as much of
cayenne (it is better to use these sparingly at first, and to add more
should the soup require it, after it has boiled some little time); pour in
half a pint of sherry, stir the whole together until it has simmered for
a minute or two, then put in the head, and let it stew gently from an
hour and a quarter to an hour and a half: stir it often, and clear it
perfectly from scum. Put into it just before it is ready for table three
dozens of small forcemeat-balls; the brain cut into dice (after having
been well soaked, scalded,[30] and freed from the film), dipped into
beaten yolk of egg, then into the finest crumbs mixed with salt, white
pepper, a little grated nutmeg, fine lemon-rind, and chopped parsley
fried a fine brown, well drained and dried; and as many egg-balls,
the size of a small marble, as the yolks of four eggs will supply. (See
Chapter VIII). This quantity will be sufficient for two large tureens of
soup; when the whole is not wanted for table at the same time, it is
better to add wine only to so much as will be required for immediate
consumption, or if it cannot conveniently be divided, to heat the wine
in a small saucepan with a little of the soup, to turn it into the tureen,
and then to mix it with the remainder by stirring the whole gently after
the tureen is filled. Some persons simply put in the cold wine just
before the soup is dished, but this is not so well.
29. When the butter is considered objectionable, the flour, without it, may be
mixed to the smoothest batter possible, with a little cold stock or water, and
stirred briskly into the boiling soup: the spices should be blended with it.

30. The brain should be blanched, that is, thrown into boiling water with a little
salt in it, and boiled from five to eight minutes, then lifted out and laid into
cold water for a quarter of an hour: it must be wiped very dry before it is fried.

Whole calf’s head with skin on, boiled 1-1/2 hour. Stock: neck of
beef, browned in butter, 8 lbs.; lean of ham, 1/2 to 3/4 lb.; onions, 4;
large carrots, 3; heads of celery, 3; large bunch herbs; salt, 2 oz. (as
much more to be added when the soup is made as will season it
sufficiently); thin rind, 1 lemon; peppercorns, 1 dessertspoonful;
bones and trimmings of head: 8 hours. Soup: stock, 4 to 5 quarts;
flour and butter for thickening, of each 4 oz.; pounded mace, half-
teaspoonful; cayenne, third as much (more of each as needed);
sherry, half pint: 2 to 3 minutes. Flesh of head and tongue, nearly or
quite 2 lbs.: 1-1/4 to 1-1/2 hour. Forcemeat-balls, 36; the brain cut
and fried; egg-balls, 16 to 24.
Obs.—When the brain is not blanched it must be cut thinner in the
form of small cakes, or it will not be done through by the time it has
taken enough colour: it may be altogether omitted without much
detriment to the soup, and will make an excellent corner dish if
gently stewed in white gravy for half an hour, and served with it
thickened with cream and arrow-root to the consistency of good
white sauce, then rather highly seasoned, and mixed with plenty of
minced parsley, and some lemon-juice.
GOOD CALF’S HEAD SOUP.

(Not expensive.)

Stew down from six to seven pounds of the thick part of a shin of
beef with a little lean ham, or a slice of hung beef, or of Jewish beef,
trimmed free from the smoky edges, in five quarts of water until
reduced nearly half, with the addition, when it first begins to boil, of
an ounce of salt, a large bunch of savoury herbs, one large onion, a
head of celery, three carrots, two or three turnips, two small blades
of mace, eight or ten cloves, and a few white or black peppercorns.
Let it boil gently that it may not be too much reduced, for six or seven
hours, then strain it into a clean pan and set it by for use. Take out
the bone from half a calf’s head with the skin on (the butcher will do
this if desired), wash, roll, and bind it with a bit of tape or twine, and
lay it into a stewpan, with the bones and tongue; cover the whole
with the beef stock, and stew it for an hour and a half; then lift it into
a deep earthen pan and let it cool in the liquor, as this will prevent
the edges from becoming dry or discoloured. Take it out before it is
quite cold; strain, and skim all the fat carefully from the stock; and
heat five pints in a large clean saucepan, with the head cut into small
thick slices or into inch-squares. As quite the whole will not be
needed, leave a portion of the fat, but add every morsel of the skin to
the soup, and of the tongue also. Should the first of these not be
perfectly tender, it must be simmered gently till it is so; then stir into
the soup from six to eight ounces of fine rice-flour mixed with a
quarter-teaspoonful of cayenne, twice as much freshly pounded
mace, half a wineglassful of mushroom catsup,[31] and sufficient
cold broth or water to render it of the consistence of batter; boil the
whole from eight to ten minutes; take off the scum, and throw in two
glasses of sherry; dish the soup and put into the tureen some
delicately and well fried forcemeat-balls made by the receipt No. 1,
2, or 3, of Chapter VIII. A small quantity of lemon-juice or other acid
can be added at pleasure. The wine and forcemeat-balls may be
omitted, and the other seasonings of the soup a little heightened. As

You might also like