Introduction To Data Science
Introduction To Data Science
Introduction
to Data Science
A Python Approach to Concepts,
Techniques and Applications
Second Edition
Undergraduate Topics in Computer
Science
Series Editor
Ian Mackie, University of Sussex, Brighton, UK
Advisory Editors
Samson Abramsky , Department of Computer Science, University of Oxford,
Oxford, UK
Chris Hankin , Department of Computing, Imperial College London, London, UK
Mike Hinchey , Lero—The Irish Software Research Centre, University of Limerick,
Limerick, Ireland
Dexter C. Kozen, Department of Computer Science, Cornell University, Ithaca,
NY, USA
Hanne Riis Nielson , Department of Applied Mathematics and Computer Science,
Technical University of Denmark, Kongens Lyngby, Denmark
Steven S. Skiena, Department of Computer Science, Stony Brook University, Stony
Brook, NY, USA
Iain Stewart , Department of Computer Science, Durham University, Durham, UK
Joseph Migga Kizza, Engineering and Computer Science, University of Tennessee at
Chattanooga, Chattanooga, TN, USA
Roy Crole, School of Computing and Mathematics Sciences, University of Leicester,
Leicester, UK
Elizabeth Scott, Department of Computer Science, Royal Holloway University of
London, Egham, UK
‘Undergraduate Topics in Computer Science’ (UTiCS) delivers high-quality
instructional content for undergraduates studying in all areas of computing and
information science. From core foundational and theoretical material to final-year
topics and applications, UTiCS books take a fresh, concise, and modern approach
and are ideal for self-study or for a one- or two-semester course. The texts
are authored by established experts in their fields, reviewed by an international
advisory board, and contain numerous examples and problems, many of which
include fully worked solutions.
The UTiCS concept centers on high-quality, ideally and generally quite concise
books in softback format. For advanced undergraduate textbooks that are likely
to be longer and more expository, Springer continues to offer the highly regarded
Texts in Computer Science series, to which we refer potential authors.
Laura Igual · Santi Seguí
Introduction to Data
Science
A Python Approach to Concepts,
Techniques and Applications
Second Edition
Laura Igual Santi Seguí
Departament de Matemàtiques i Informàtica Departament de Matemàtiques i Informàtica
Universitat de Barcelona Universitat de Barcelona
Barcelona, Spain Barcelona, Spain
With Contribution by
Jordi Vitrià Eloi Puertas
Universitat de Barcelona Universitat de Barcelona
Barcelona, Spain Barcelona, Spain
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
In this era, where a huge amount of information from different fields is gath-
ered and stored, its analysis and the extraction of value have become one of the
most attractive tasks for companies and society in general. The design of solutions
for the new questions emerged from data has required multidisciplinary teams.
Computer scientists, statisticians, mathematicians, biologists, journalists, and soci-
ologists, as well as many others are now working together in order to provide
knowledge from data. This new interdisciplinary field is called data science. The
pipeline of any data science goes through asking the right questions, gathering
data, cleaning data, generating hypothesis, making inferences, visualizing data,
assessing solutions, etc.
vii
viii Preface
Target Audiences
Parts of the presented materials have been used in the postgraduate course of Data
Science and Big Data from Universitat de Barcelona. All contributing authors are
involved in this course.
This book can be used in any introductory data science course. The problem-based
approach adopted to introduce new concepts can be useful for the beginners. The
implemented code solutions for different problems are a good set of exercises for
the students. Moreover, these codes can serve as a baseline when students face
bigger projects.
Supplemental Resources
This book is accompanied by a set of IPython Notebooks containing all the codes
necessary to solve the practical cases of the book. The Notebooks can be found on
the following GitHub repository: https://github.com/DataScienceUB/introduction-
datascience-python-book.
Acknowledgments
ix
x Contents
4 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Statistical Inference: The Frequentist Approach . . . . . . . . . . . . . . . . 52
4.3 Measuring the Variability in Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 Testing Hypotheses Using Confidence Intervals . . . . . . . 60
4.4.2 Testing Hypotheses Using p-Values . . . . . . . . . . . . . . . . . . . 61
4.5 But, Is the Effect E Real? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 First Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 What Is Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Training, Validation and Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 Two Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.7.1 Generalities Concerning Learning Models . . . . . . . . . . . . . 87
5.7.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.7.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.8 Ending the Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.9 A Toy Business Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.1 Simple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . 100
6.2.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.3 Practical Case 1: Sea Ice Data and Climate Change . . . 103
6.2.4 Polynomial Regression Model . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.5 Regularization and Sparse Models . . . . . . . . . . . . . . . . . . . . 108
6.2.6 Practical Case 2: Boston Housing Data and Price
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3.1 Practical Case 3: Winning or Losing Football
Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Contents xi
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Authors and Contributors
Contributors
Dr. Jordi Vitrià is a Full Professor at the Department of Mathematics and Com-
puter Science at the Universitat de Barcelona. He received his Ph.D. degree from
the Universitat Autònoma de Barcelona in 1990. Dr. Jordi Vitrià has published
more than 100 papers in SCI-indexed journals and has more than 30 years of expe-
rience in working on Computer Vision, Machine Learning, Causal Inference, and
Artificial Intelligence and their applications to several fields. He is now the leader
of the “Data Science Group at Universitat de Barcelona”, a multidisciplinary tech-
nology transfer unit that conveys results from scientific and technological research
in the market and society in general.
Dr. Eloi Puertas is an Assistant Professor in the Department of Mathematics and
Computer Science at the Universitat de Barcelona. He has been a Computer Sci-
ence Engineer by the Universitat Autònoma de Barcelona (Spain) since 2002.
He received his Ph.D. degree from the Universitat de Barcelona (Spain) in 2014.
His areas of interest include artificial intelligence, software engineering, and data
science.
xiii
xiv Authors and Contributors
Dr. Petia Radeva is a Full Professor at the Universitat de Barcelona. She graduated
in Applied Mathematics and Computer Science in 1989 at the University of Sofia,
Bulgaria, and received her Ph.D. degree on Computer Vision for Medical Imag-
ing in 1998 from the Universitat Autònoma de Barcelona, Spain. She has been
an ICREA Academia Researcher since 2015, head of the Consolidated Research
Group “Artificial Intelligence and Biomedical Applications”. Her present research
interests are on the development of learning-based approaches for computer vision,
deep learning, data-centric data analysis, food data analysis, egocentric vision, and
data science.
Dr. Oriol Pujol is a Full Professor at the Department of Mathematics and Com-
puter Science at the Universitat de Barcelona. He received his Ph.D. degree from
the Universitat Autònoma de Barcelona (Spain) in 2004 for his work in machine
learning and computer vision. His areas of interest include machine learning,
computer vision, and data science.
Dr. Sergio Escalera is a Full Professor at the Department of Mathematics and
Computer Science at the Universitat de Barcelona. He has been a Computer Sci-
ence Engineer by the Universitat Autònoma de Barcelona (Spain) since 2003. He
received his Ph.D. degree from the Universitat Autònoma de Barcelona (Spain)
in 2008. His research interests include, among others, statistical pattern recogni-
tion and visual object recognition, with special interest in behavior analysis from
multi-modal data.
Francesc Dantí is an adjunct professor and system administrator from the Depart-
ment of Mathematics and Computer Science at the Universitat de Barcelona. He is
a computer science engineer by the Universitat Oberta de Catalunya (Spain). His
particular areas of interest are HPC and grid computing, parallel computing, and
cybersecurity. Francesc Dantí is coauthor of Chap. 2.
Introduction to Data Science
1
You have, no doubt, already experienced data science in several forms. When you are
looking for information on the web by using a search engine or asking your mobile
phone for directions, you are interacting with data science products. Data science
has been behind resolving some of our most common daily tasks for several years.
Data science involves the application of scientific methods, algorithms, and sys-
tems to extract insights and knowledge from large volumes of data. It encompasses
various disciplines such as mathematics, statistics, computer science, and domain
expertise to analyze, interpret, and make informed decisions based on data.
Most of the scientific methods that power data science are not new and they have
been out there, waiting for applications to be developed, for a long time. Statistics is
an old science that stands on the shoulders of eighteenth-century giants such as Pierre
Simon Laplace (1749–1827) and Thomas Bayes (1701–1761). Machine learning has
made significant progress and can be considered a well-established discipline. While
it is relatively younger compared to some other branches of science and technology,
machine learning has gained prominence and achieved remarkable advancements in
recent years. Computer science changed our lives several decades ago and continues
to do so; but it cannot be considered new.
While data science itself encompasses scientific knowledge and methodologies,
its novelty and impact on society are indeed rooted in a disruptive change brought
about by the evolution of technology, specifically the concept of datification. Dat-
ification is the process of rendering into data aspects of the world that have never
been quantified before. At the personal level, the list of datified concepts is very long
and still growing: business networks, the lists of books we are reading, the films we
enjoy, the food we eat, our physical activity, our purchases, our driving behavior, and
so on. Even our thoughts are datified when we publish them on our favorite social
network. At the business level, companies are datifying semi-structured data that
were previously discarded: web activity logs, computer network activity, machinery
© Springer Nature Switzerland AG 2024 1
L. Igual and S. Seguí, Introduction to Data Science,
Undergraduate Topics in Computer Science,
https://doi.org/10.1007/978-3-031-48956-3_1
2 1 Introduction to Data Science
signals, etc. Nonstructured data, such as written reports, e-mails or voice recordings,
are now being stored not only for archive purposes but also to be analyzed.
The rise of datification and the vast availability of data offer significant benefits
and opportunities. However, it is crucial to acknowledge and address the potential
dangers and challenges that accompany this phenomenon. Some of the concerns
include privacy and security risks, data bias and discrimination, lack of transparency
and accountability, and the perpetuation of social and economic disparities.
However, datification is not the only ingredient of the data science revolution.
The other ingredient is the democratization of data analysis. Large companies such
as Google, IBM, or SAS were the only players in this field when data science had
no name. At the beginning of the century, the huge computational resources of
those companies allowed them to take advantage of datification by using analytical
techniques to develop innovative products and even to take decisions about their own
business.
Currently, there is a decreasing analytical divide between companies utilizing
advanced data analytics and the rest of the world, including both companies and
individuals. This trend can be attributed to the widespread adoption of open-source
tools. However, it is important to note that generative AI, a field that involves creating
AI models capable of generating content such as text, images, or videos, raises
concerns about democratization. The access and use of generative AI technologies
may not be equally distributed, potentially leading to further disparities and unequal
opportunities in the application of this technology.
Data science is commonly defined as a methodology by which actionable insights
can be inferred from data. This is a subtle but important difference with respect to
previous approaches to data analysis, such as business intelligence or exploratory
statistics. Performing data science is a task with an ambitious objective: the produc-
tion of beliefs informed by data and to be used as the basis of decision-making. In
the absence of data, beliefs are uninformed and decisions, in the best of cases, are
based on best practices or intuition. The representation of complex environments by
rich data opens up the possibility of applying all the scientific knowledge we have
regarding how to infer knowledge from data.
In general, data science allows us to adopt four different strategies to explore the
world using data:
Data science is definitely a cool and trendy discipline that routinely appears in the
headlines of very important newspapers and on TV stations. Data scientists are
presented in those forums as a scarce and expensive resource. As a result of this
situation, data science can be perceived as a complex and scary discipline that is
only accessible to a reduced set of geniuses working for major companies. The main
purpose of this book is to demystify data science by describing a set of tools and
techniques that allows a person with basic skills in computer science, mathematics,
and statistics to perform the tasks commonly associated with data science.
To this end, this book has been written under the following assumptions:
• Data science is a complex, multifaceted field that can be approached from sev-
eral points of view: ethics, methodology, business models, how to deal with big
data, data engineering, data governance, etc. Each point of view deserves a long
and interesting discussion, but the approach adopted in this book focuses on an-
alytical techniques, because such techniques constitute the core toolbox of every
data scientist and because they are the key ingredient in predicting future events,
discovering useful patterns, and probing the world.
• You have some experience with Python programming. For this reason, we do not
offer an introduction to the language. But even if you are new to Python, this should
not be a problem. Before reading this book you should start with any online Python
4 1 Introduction to Data Science
course. Mastering Python is not easy, but acquiring the basics is a manageable task
for anyone in a short period of time.
• Data science is about evidence-based storytelling and this kind of process requires
appropriate tools. The Python data science toolbox is one, not the only, of the
most developed environments for doing data science. You can easily install all you
need by using Anaconda1 : a free product that includes a programming language
(Python), an interactive environment to develop and present data science projects
(Jupyter notebooks), and most of the toolboxes necessary to perform data analysis.
• Learning by doing is the best approach to learn data science. For this reason all the
code examples and data in this book are available to download at https://github.
com/DataScienceUB/introduction-datascience-python-book.
• Data science deals with solving real-world problems. So all the chapters in the
book include and discuss practical cases using real data.
This book includes three different kinds of chapters. The first kind is about Python
extensions. Python was originally designed to have a minimum number of data ob-
jects (int, float, string, etc.); but when dealing with data, it is necessary to extend the
native set to more complex objects such as (numpy) numerical arrays or (pandas)
data frames. The second kind of chapter includes techniques and modules to per-
form statistical analysis and machine learning. Finally, there are some chapters that
describe several applications of data science, such as building recommenders or sen-
timent analysis. The composition of these chapters was chosen to offer a panoramic
view of the data science field, but we encourage the reader to delve deeper into these
topics and to explore those topics that have not been covered: big data analytics and
more advanced mathematical and statistical methods (e.g., Bayesian statistics).
1 https://www.anaconda.com/.
Data Science Tools
2
2.1 Introduction
In this chapter, first we introduce some of the cornerstone tools that data scientists
use. The toolbox of any data scientist, as for any kind of programmer, is an essential
ingredient for success and enhanced performance. Choosing the right tools can save
a lot of time and thereby allow us to focus on data analysis.
The most basic tool to decide on is which programming language we will use.
Many people use only one programming language in their entire life: the first and
only one they learn. For many, learning a new language is an enormous task that, if
at all possible, should be undertaken only once. The problem is that some languages
are intended for developing high-performance or production code, such as C, C++,
or Java, while others are more focused on prototyping code, among these the best
known are the so-called scripting languages: Ruby, Perl, and Python. So, depending
on the first language you learned, certain tasks will, at the very least, be rather tedious.
The main problem of being stuck with a single language is that many basic tools
simply will not be available in it, and eventually you will have either to reimplement
them or to create a bridge to use some other language just for a specific task.
In conclusion, you either have to be ready to change to the best language for each
task and then glue the results together, or choose a very flexible language with a rich
ecosystem (e.g., third-party open-source libraries). In this book we have selected
Python as the programming language.
Python1 is a mature programming language but it also has excellent properties for
newbie programmers, making it ideal for people who have never programmed before.
Some of the most remarkable of those properties are easy to read code, suppression
of non-mandatory delimiters, dynamic typing, and dynamic memory usage. Python
is an interpreted language, so the code is executed immediately in the Python con-
sole without needing the compilation step to machine language. Besides the Python
console (which comes included with any Python installation) you can find other inter-
active consoles, such as IPython,2 which give you a richer environment in which to
execute your Python code.
Currently, Python is one of the most flexible programming languages. One of its
main characteristics that makes it so flexible is that it can be seen as a multiparadigm
language. This is especially useful for people who already know how to program with
other languages, as they can rapidly start programming with Python in the same way.
For example, Java programmers will feel comfortable using Python as it supports
the object-oriented paradigm, or C programmers could mix Python and C code using
cython. Furthermore, for anyone who is used to programming in functional languages
such as Haskell or Lisp, Python also has basic statements for functional programming
in its own core library.
In summary, in this book, we have decided to use Python language because it is a
mature language programming, easy for the newbies, and can be used as a specific
platform for data scientists, thanks to its large ecosystem of scientific libraries and
its high and vibrant community. Furthermore, a lot of documentation and courses
about python already exist. Besides, Python is one of the languages that modern AI
programming assisting tools, like copilot(3 ), best supports. Other popular alternatives
to Python for data scientists are R and MATLAB/Octave.
The Python community is one of the most active programming communities with a
huge number of developed toolboxes. The most popular basic Python toolboxes for
data scientists are NumPy, SciPy, Pandas, Matplotlib, and Scikit-Learn.
1 https://www.python.org/downloads/.
2 http://ipython.org/install.html.
3 https://github.com/features/copilot.
2.3 Fundamental Python Libraries for Data Scientists 7
NumPy4 is the cornerstone toolbox for scientific computing with Python. NumPy
provides, among other things, support for multidimensional arrays with basic oper-
ations on them and useful linear algebra functions. Many toolboxes use the NumPy
array representations as an efficient basic data structure. Meanwhile, SciPy provides
a collection of numerical algorithms and domain-specific toolboxes, including sig-
nal processing, optimization, statistics, and much more. Another core toolbox is the
plotting library Matplotlib. This toolbox has many tools for data visualization.
Scikit-learn5 is a machine learning library built from NumPy, SciPy, and Matplotlib.
Scikit-learn offers simple and efficient tools for common tasks in data analysis such
as classification, regression, clustering, dimensionality reduction, model selection,
and preprocessing. For deep learning tasks other specific libraries exists, such Ten-
sorFlow,6 one of the first open-source deep learning frameworks and used in many
applications, Keras,7 a high-level neural network application programming interface
or PyTorch,8 another deep learning framework more flexible than the others.
Pandas9 provides high-performance data structures and data analysis tools. The key
feature of Pandas is a fast and efficient DataFrame object for data manipulation with
integrated indexing. The DataFrame structure can be seen as a spreadsheet which
offers very flexible ways of working with it. You can easily transform any dataset
in the way you want, by reshaping it and adding or removing columns or rows.
It also provides high-performance functions for aggregating, merging, and joining
datasets. Pandas also has tools for importing and exporting data from different for-
mats: comma-separated value (CSV), text files, Microsoft Excel, SQL databases,
and the fast HDF5 format. In many situations, the data you have in such formats will
not be complete or totally structured. For such cases, Pandas offers handling of miss-
ing data and intelligent data alignment. Furthermore, Pandas provides a convenient
Matplotlib interface.
4 http://www.scipy.org/scipylib/download.html.
5 http://www.scipy.org/scipylib/download.html.
6 https://www.tensorflow.org/.
7 https://keras.io/.
8 https://pytorch.org/.
9 http://pandas.pydata.org/getpandas.html.
8 2 Data Science Tools
Before we can get started on solving our own data-oriented problems, we will need to
set up our programming environment. The first question we need to answer concerns
Python language itself. Python is evolving continuously and it is important to be
updated to the last version. Once we have installed the last Python version, the next
thing to decide is whether we want to install the data scientist Python ecosystem in
our local system, or to use it directly from the cloud. For newbies, the second option
is recommended.
However, if a standalone installation is chosen, the Anaconda Python distribu-
tion10 is then a good option. The Anaconda distribution provides integration of
all the Python toolboxes and applications needed for data scientists into a single
directory without mixing it with other Python toolboxes installed on the machine.
It contains, of course, the core toolboxes and applications such as NumPy, Pandas,
SciPy, Matplotlib, Scikit-learn, IPython, Spyder, etc., but also more specific tools
for other related tasks such as data visualization, code optimization, and big data
processing.
For any programmer, and by extension, for any data scientist, the integrated devel-
opment environment (IDE) is an essential tool. IDEs are designed to maximize
programmer productivity. Thus, over the years this software has evolved in order to
make the coding task less complicated. Choosing the right IDE for each person is
crucial and, unfortunately, there is no “one-size-fits-all” programming environment.
The best solution is to try the most popular IDEs among the community and keep
whichever fits better in each case.
In general, the basic pieces of any IDE are three: the editor, the compiler (or
interpreter), and the debugger. Some IDEs can be used in multiple programming
languages, provided by language-specific plugins, such as Netbeans,11 Eclipse12 or
Visual Studio Code.13 Others are only specific for one language or even a specific
programming task. In the case of Python, there are a large number of specific IDEs:
PyCharm,14 WingIDE,15 Spyder.16 , 17
10 http://continuum.io/downloads.
11 https://netbeans.org/downloads/.
12 https://eclipse.org/downloads/.
13 https://code.visualstudio.com/.
14 https://www.jetbrains.com/pycharm/.
15 https://wingware.com/.
16 https://github.com/spyder-ide/spyder.
17 Eric https://eric-ide.python-projects.org/.
2.6 Useful Resources for Data Scientists 9
With the advent of web applications, a new generation of IDEs for interactive lan-
guages such as Python has been developed. Starting in the academia and e-learning
communities, web-based IDEs were developed considering how not only your code
but also all your environment and executions can be stored in a server.
Nowadays, such sessions are called notebooks and they are not only used in class-
rooms but also used to show results in presentations or on business dashboards.
The recent spread of such notebooks is mainly due to IPython. Since December
2011, IPython has been issued as a browser version of its interactive console, called
IPython notebook, which shows the Python execution results very clearly and con-
cisely by means of cells. Cells can contain content other than code. For example,
markdown (a wiki text language) cells can be added to introduce algorithms and text.
It is also possible to insert Matplotlib graphics to illustrate examples or even web
pages. Recently, some scientific journals have started to accept notebooks in order
to show experimental results, complete with their code and data sources. In this way,
experiments can become completely and absolutely replicable.
Since the project has grown so much, IPython notebook has been separated from
IPython software and now it has become a part of a larger project: Jupyter18 Jupyter
(for Julia, Python and R) aims to reuse the same WIDE for all these interpreted
languages and not just Python. All old IPython notebooks are automatically imported
to the new version when they are opened with the Jupyter platform; but once they
are converted to the new version, they cannot be used again in old IPython notebook
versions.
Nowadays, Jupyter notebooks are supported by major cloud computing providers.
For example, Google released Colaboratory also known as Colab.19 With Colab you
can open any Jupyter notebook from your Google Drive and execute it in the remote
Google servers. It is possible to use it for free, even free GPU instances are available.
However, paid plans exist if more powerful machines are needed. In this book, all
the examples shown use Jupyter notebook style.
Besides libraries and frameworks, there are other resources that can be very useful
for data scientists in their everyday tasks:
. Data Sources: Access to high-quality data is crucial. Relevant data sets can be
found in research data repositories, open data databases like governmental data
portals or private companies API’s (Application Programming Interface) to access
18 http://jupyter.readthedocs.org/en/latest/install.html.
19 https://colab.research.google.com/.
10 2 Data Science Tools
to their data. Websites like Zenodo,20 Eurostat,21 and academic data repositories
such UCI22 data sets are excellent starting points.
. Version Control Systems: Tools like Git23 and platforms like Github24 are crucial
for source code version control and collaboration on data science projects. Not
only code version control is needed, also data scientist needs to version control
experiments, results, data and even models. For this purpose, data and experiment
version control tools exist such as DVC.25
. Online Courses and Tutorials: Websites like Coursera, edX, and Udacity offer
courses on data science topics, including machine learning, data analysis, and
more.
. GitHub Repositories: GitHub hosts numerous open-source data science projects
and repositories where you can find code, datasets, and useful resources.
. Data Science Competitions: Platforms like Kaggle host data science competitions
where you can practice your skills, learn from others, and even win prizes.
Throughout this book, we will come across many practical examples. In this chapter,
we will see a very basic example to help get started with a data science ecosystem
from scratch. To execute our examples, we will use Jupyter notebook, although any
other console or IDE can be used.
Once we have set up our programming environment, we can start by launching the
Jupyter notebook platform. We can start the Jupyter notebook platform by clicking
on the Jupyter Notebook icon installed by Anaconda in the start menu or on the
desktop.
The browser will immediately be launched displaying the Jupyter notebook home-
page, whose URL is https://localhost:8888/tree. Note that a special port is used;
by default it is 8888. As can be seen in Fig. 2.1, this initial page displays a tree view
of a directory. The root directory is the current user directory. If you prefer to use
some cloud computing provider like Google Colab, just log into the Jupyter notebook
service and start using it right away.
20 https://zenodo.org/.
21 https://ec.europa.eu/eurostat.
22 https://archive.ics.uci.edu/.
23 https://git-scm.com/.
24 https://github.com/.
25 https://dvc.org/.
2.7 Get Started with Python and Pandas 11
Fig. 2.1 IPython notebook home page, displaying a home tree directory
Now, to start a new notebook, we only need to press the New Notebooks
Python 3 button at the top on the right of the home page.
As can be seen in Fig. 2.2, a blank notebook is created called Untitled.
First of all, we are going to change the name of the notebook to something
more appropriate. To do this, just click on the notebook name and rename it:
DataScience-GetStartedExample.
12 2 Data Science Tools
Let us begin by importing those toolboxes that we will need for our program. In the
first cell we put the code to import the Pandas library as pd. This is for convenience;
every time we need to use some functionality from the Pandas library, we will write
pd instead of pandas. We will also import the two core libraries mentioned above:
the numPy library as np and the matplotlib library as plt.
In []:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
To execute just one cell, we press the . button or click on Cell Run or press the
keys Ctrl + Enter . While execution is underway, the header of the cell shows the *
mark:
In [*]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
While a cell is being executed, no other cell can be executed. If you try to execute
another cell, its execution will not start until the first cell has finished its execution.
Once the execution is finished, the header of the cell will be replaced by the next
number of execution. Since this will be the first cell executed, the number shown will
be 1. If the process of importing the libraries is correct, no output cell is produced.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
For simplicity, other chapters in this book will avoid writing these imports.
The key data structure in Pandas is the DataFrame object. A DataFrame is basically
a tabular data structure, with rows and columns. Rows have a specific index to access
them, which can be any name or value. In Pandas, the columns are called Series,
a special type of data, which in essence consists of a list of several values, where
each value has an index. Therefore, the DataFrame data structure can be seen as
a spreadsheet, but it is much more flexible. To understand how it works, let us
see how to create a DataFrame from a common Python dictionary of lists. First,
we will create a new cell by clicking Insert Insert Cell Below or pressing the keys
Ctrl + M + B . Then, we write in the following code:
2.7 Get Started with Python and Pandas 13
In [2]:
data = {’year’: [
2010, 2011, 2012,
2010, 2011, 2012,
2010, 2011, 2012
],
’team’: [
’FCBarcelona’, ’FCBarcelona’,
’FCBarcelona’, ’RMadrid’,
’RMadrid’, ’RMadrid’,
’ValenciaCF’, ’ValenciaCF’,
’ValenciaCF’
],
’wins’: [30, 28, 32, 29, 32, 26, 21, 17, 19],
’draws’: [6, 7, 4, 5, 4, 7, 8, 10, 8],
’losses’: [2, 3, 2, 4, 2, 5, 9, 11, 11]
}
pd.DataFrame(data, columns = [
’year’, ’team’, ’wins’, ’draws’, ’losses’
]
)
In this example, we use the pandas DataFrame object constructor with a dictionary
of lists as argument. The value of each entry in the dictionary is the name of the
column, and the lists are their values.
The DataFrame columns can be arranged at construction time by entering a key-
word columns with a list of the names of the columns ordered as we want. If the
column keyword is not present in the constructor, the columns will be arranged in
alphabetical order. Now, if we execute this cell, the result will be a table like this:
Out[2]: year team wins draws losses
0 2010 FCBarcelona 30 6 2
1 2011 FCBarcelona 28 7 3
2 2012 FCBarcelona 32 4 2
3 2010 RMadrid 29 5 4
4 2011 RMadrid 32 4 2
5 2012 RMadrid 26 7 5
6 2010 ValenciaCF 21 8 9
7 2011 ValenciaCF 17 10 11
8 2012 ValenciaCF 19 8 11
where each entry in the dictionary is a column. The index of each row is created
automatically taking the position of its elements inside the entry lists, starting from
0. Although it is very easy to create DataFrames from scratch, most of the time what
we will need to do is import chunks of data into a DataFrame structure, and we will
see how to do this in later examples.
Apart from DataFrame data structure creation, Panda offers a lot of functions
to manipulate them. Among other things, it offers us functions for aggregation,
manipulation, and transformation of the data. In the following sections, we will
introduce some of these functions.
14 2 Data Science Tools
To illustrate how we can use Pandas in a simple real problem, we will start doing
some basic analysis of government data. For the sake of transparency, data pro-
duced by government entities must be open, meaning that they can be freely used,
reused, and distributed by anyone. An example of this is the Eurostat, which is the
home of European Commission data. Eurostat’s main role is to process and publish
comparable statistical information at the European level. The data in Eurostat are
provided by each member state and it is free to reuse them, for both noncommercial
and commercial purposes (with some minor exceptions).
Since the amount of data in the Eurostat database is huge, in our first study we
are only going to focus on data relative to indicators of educational funding by the
member states. Thus, the first thing to do is to retrieve such data from Eurostat.
Since open data have to be delivered in a plain text format, CSV (or any other
delimiter-separated value) formats are commonly used to store tabular data. In a
delimiter-separated value file, each line is a data record and each record consists of
one or more fields, separated by the delimiter character (usually a comma). Therefore,
the data we will use can be found already processed at book’s Github repository as
educ_figdp_1_Data.csv file.
Reading
Let us start reading the data we downloaded. First of all, we have to create a new
notebook called Open Government Data Analysis and open it. Then, after
ensuring that the educ_figdp_1_Data.csv file is stored in the same directory
as our notebook directory, we will write the following code to read and show the
content:
In [1]:
edu = pd.read_csv(’files/ch02/educ_figdp_1_Data.csv’,
na_values = ’:’,
usecols = ["TIME","GEO","Value"])
edu
character that represents “non available data” in the file. Normally, CSV files have a
header with the names of the columns. If this is the case, we can use the usecols
parameter to select which columns in the file will be used.
In this case, the DataFrame resulting from reading our data is stored in edu. The
output of the execution shows that the edu DataFrame size is 384 rows .× 3 columns.
Since the DataFrame is too large to be fully displayed, three dots appear in the middle
of each column.
Beside this, Pandas also has functions for reading files with formats such as Excel,
HDF5, tabulated files, or even the content from the clipboard (read_excel(),
read_hdf(), read_table(), read_clipboard()). Whichever function
we use, the result of reading a file is stored as a DataFrame structure.
To see how the data looks, we can use the head() method, which shows just the
first five rows. If we use a number as an argument to this method, this will be the
number of rows that will be listed:
In [2]:
edu.head()
Similarly, the tail() method returns the last five rows by default.
In [3]:
edu.tail()
If we want to know the names of the columns or the names of the indexes, we
can use the DataFrame attributes columns and index respectively. The names of
the columns or indexes can be changed by assigning a new list of the same length to
these attributes. The values of any DataFrame can be retrieved as a Python array by
calling its values attribute.
If we just want quick statistical information on all the numeric columns in a
DataFrame, we can use the function describe(). The result shows the count, the
mean, the standard deviation, the minimum and maximum, and the percentiles, by
default, the 25th, 50th, and 75th, for all the values in each column or series.
In [4]:
edu.describe()
16 2 Data Science Tools
Selecting Data
Out[5]: 0 NaN
1 NaN
2 5.00
3 5.03
4 4.95
... ...
380 6.10
381 6.81
382 6.85
383 6.76
Name: Value, dtype: float64
If we want to select a subset of rows from a DataFrame, we can do so by indicating
a range of rows separated by a colon (:) inside the square brackets. This is commonly
known as a slice of rows:
In [6]:
edu[10:14]
This instruction returns the slice of rows from the 10th to the 13th position. Note
that the slice does not use the index labels as references, but the position. In this case,
the labels of the rows simply coincide with the position of the rows.
If we want to select a subset of columns and rows using the labels as our references
instead of the positions, we can use loc indexing:
In [7]:
edu.loc[90:94, [’TIME’,’GEO’]]
This returns all the rows between the indexes specified in the slice before the
comma, and the columns specified as a list after the comma. In this case, loc
references the index labels, which means that loc does not return the 90th to 94th
rows, but it returns all the rows between the row labeled 90 and the row labeled 94;
thus if the index 100 is placed between the rows labeled as 90 and 94, this row would
also be returned.
Filtering Data
Another way to select a subset of data is by applying Boolean indexing. This indexing
is commonly known as a filter. For instance, if we want to filter those values less
than or equal to 6.5, we can do it like this:
In [8]:
edu[edu[’Value’] > 6.5].tail()
Boolean indexing uses the result of a Boolean operation over the data, returning
a mask with True or False for each row. The rows marked True in the mask will
be selected. In the previous example, the Boolean operation edu[’Value’] .>
6.5 produces a Boolean mask. When an element in the “Value” column is greater
than 6.5, the corresponding value in the mask is set to True, otherwise it is set to
False. Then, when this mask is applied as an index in edu[edu[’Value’] .>
6.5], the result is a filtered DataFrame containing only rows with values higher
than 6.5. Of course, any of the usual Boolean operators can be used for filtering: .<
18 2 Data Science Tools
(less than), .<= (less than or equal to), .> (greater than), .>= (greater than or equal
to), .= (equal to), and .! =(not equal to).
Pandas uses the special value NaN (not a number) to represent missing values. In
Python, NaN is a special floating-point value returned by certain operations when
one of their results ends in an undefined value. A subtle feature of NaN values is that
two NaN are never equal. Because of this, the only safe way to tell whether a value is
missing in a DataFrame is by using the isnull() function. Indeed, this function
can be used to filter rows with missing values:
In [9]:
edu[edu["Value"].isnull()].head()
Manipulating Data
Once we know how to select the desired data, the next thing we need to know is how
to manipulate data. One of the most straightforward things we can do is to operate
with columns or rows using aggregation functions. Table 2.1 shows a list of the most
common aggregation functions. The result of all these functions applied to a row or
column is always a number. Meanwhile, if a function is applied to a DataFrame or a
selection of rows and columns, then you can specify if the function should be applied
to the rows for each column (setting the axis=0 keyword on the invocation of the
function), or it should be applied on the columns for each row (setting the axis=1
keyword on the invocation of the function).
In [10]:
edu.max(axis = 0)
Note that these are functions specific to Pandas, not the generic Python functions.
There are differences in their implementation. In Python, NaN values propagate
through all operations without raising an exception. In contrast, Pandas operations
2.7 Get Started with Python and Pandas 19
exclude NaN values representing missing data. For example, the pandas max function
excludes NaN values, thus they are interpreted as missing values, while the standard
Python max function will take the mathematical interpretation of NaN and return it
as the maximum:
In [11]:
print("Pandas max function:", edu[’Value’].max())
print("Python max function:", max(edu[’Value’]))
Out[12]: 0 NaN
1 NaN
2 0.0500
3 0.0503
4 0.0495
Name: Value, dtype: float64
However, we can apply any function to a DataFrame or Series just setting its name
as argument of the apply method. For example, in the following code, we apply
the sqrt function from the NumPy library to perform the square root of each value
in the Value column.
20 2 Data Science Tools
In [13]:
s = edu["Value"].apply(np.sqrt)
s.head()
Out[13]: 0 NaN
1 NaN
2 2.236068
3 2.242766
4 2.224860
Name: Value, dtype: float64
If we need to design a specific function to apply it, we can write an in-line function,
commonly known as a .λ-function. A .λ-function is a function without a name. It is
only necessary to specify the parameters it receives, between the lambda keyword
and the colon (:). In the next example, only one parameter is needed, which will be
the value of each element in the Value column. The value the function returns will
be the square of that value.
In [14]:
s = edu["Value"].apply(lambda d: d**2)
s.head()
Out[14]: 0 NaN
1 NaN
2 25.0000
3 25.3009
4 24.5025
Name: Value, dtype: float64
Another basic manipulation operation is to set new values in our DataFrame. This
can be done directly using the assign operator (=) over a DataFrame. For example, to
add a new column to a DataFrame, we can assign a Series to a selection of a column
that does not exist. This will produce a new column in the DataFrame after all the
others. You must be aware that if a column with the same name already exists, the
previous values will be overwritten. In the following example, we assign the Series
that results from dividing the Value column by the maximum value in the same
column to a new column named ValueNorm.
In [15]:
edu[’ValueNorm’] = edu[’Value’]/edu[’Value’].max()
edu.tail()
Now, if we want to remove this column from the DataFrame, we can use the drop
function; this removes the indicated rows if axis=0, or the indicated columns if
axis=1. In Pandas, all the functions that change the contents of a DataFrame, such
as the drop function, will normally return a copy of the modified data, instead of
overwriting the DataFrame. Therefore, the original DataFrame is kept. If you do not
want to keep the old values, you can set the keyword inplace to True. By default,
this keyword is set to False, meaning that a copy of the data is returned.
In [16]:
edu.drop(’ValueNorm’, axis = 1, inplace = True)
edu.head()
Instead, if what we want to do is to insert a new row at the bottom of the DataFrame,
we can use the Pandas concat function. This function receives as argument two
data frames, and returns a new data frame with the contents of both. To add the new
row at the bottom, we set the index to the maximum value of the index of the original
DataFrame plus one.
In [17]:
edu = pd.concat([edu, pd.DataFrame({’TIME’: 2000, ’Value’:
5.00, ’GEO’: ’a’},
index=[max(edu.index)+1])])
edu.tail()
Finally, if we want to remove this row, we need to use the drop function again.
Now we have to set the axis to 0, and specify the index of the row we want to remove.
Since we want to remove the last row, we can use the max function over the indexes
to determine which row is.
In [18]:
edu.drop(max(edu.index), axis = 0, inplace = True)
edu.tail()
22 2 Data Science Tools
To remove NaN values, instead of the generic drop function, we can use the specific
dropna() function. If we want to erase any row that contains an NaN value, we
have to set the how keyword to any. To restrict it to a subset of columns, we can
specify it using the subset keyword.
In [19]:
eduDrop = edu.dropna(how = ’any’, subset = ["Value"])
eduDrop.head()
If, instead of removing the rows containing NaN, we want to fill them with another
value, then we can use the fillna() method, specifying which value has to be
used. If we want to fill only some specific columns, we have to set as argument to
the fillna() function a dictionary with the name of the columns as the key and
which character to be used for filling as the value.
In [20]:
eduFilled = edu.fillna(value = {"Value": 0})
eduFilled.head()
Sorting
Another important functionality we will need when inspecting our data is to sort by
columns. We can sort a DataFrame using any column, using the sort function. If
we want to see the first five rows of data sorted in descending order (i.e., from the
largest to the smallest values) and using the Value column, then we just need to do
this:
2.7 Get Started with Python and Pandas 23
In [21]:
edu.sort_values(by = ’Value’, ascending = False,
inplace = True)
edu.head()
Note that the inplace keyword means that the DataFrame will be overwritten,
and hence no new DataFrame is returned. If instead of ascending = False we
use ascending = True, the values are sorted in ascending order (i.e., from the
smallest to the largest values).
If we want to return to the original order, we can sort by an index using the
sort_index function and specifying axis=0:
In [22]:
edu.sort_index(axis = 0, ascending = True, inplace = True)
edu.head()
Grouping Data
Another very useful way to inspect data is to group it according to some criteria. For
instance, in our example it would be nice to group all the data by country, regardless
of the year. Pandas has the groupby function that allows us to do exactly this. The
value returned by this function is a special grouped DataFrame. To have a proper
DataFrame as a result, it is necessary to apply an aggregation function. Thus, this
function will be applied to all the values in the same group.
For example, in our case, if we want a DataFrame showing the mean of the values
for each country over all the years, we can obtain it by grouping according to country
and using the mean function as the aggregation method for each group. The result
would be a DataFrame with countries as indexes and the mean values as the column:
In [23]:
group = edu[["GEO", "Value"]].groupby(’GEO’).mean()
group.head()
24 2 Data Science Tools
Out[23]: Value
GEO
Austria 5.618333
Belgium 6.189091
Bulgaria 4.093333
Cyprus 7.023333
Czech Republic 4.16833
Rearranging Data
Up until now, our indexes have been just a numeration of rows without much meaning.
We can transform the arrangement of our data, redistributing the indexes and columns
for better manipulation of our data, which normally leads to better performance. We
can rearrange our data using the pivot_table function. Here, we can specify
which columns will be the new indexes, the new values, and the new columns.
For example, imagine that we want to transform our DataFrame to a spreadsheet-
like structure with the country names as the index, while the columns will be the
years starting from 2006 and the values will be the previous Value column. To do
this, first we need to filter out the data and then pivot it in this way:
In [24]:
filtered_data = edu[edu["TIME"] > 2005]
pivedu = pd.pivot_table(filtered_data, values = ’Value’,
index = [’GEO’],
columns = [’TIME’])
pivedu.head()
Now we can use the new index to select specific rows by label, using the ix
operator:
In [25]:
pivedu.ix[[’Spain’,’Portugal’], [2006,2011]]
Ranking Data
Another useful visualization feature is to rank data. For example, we would like to
know how each country is ranked by year. To see this, we will use the pandas rank
function. But first, we need to clean up our previous pivoted table a bit so that it only
has real countries with real data. To do this, first we drop the Euro area entries and
shorten the Germany name entry, using the rename function and then we drop all
the rows containing any NaN, using the dropna function.
Now we can perform the ranking using the rank function. Note here that the
parameter ascending=False makes the ranking go from the highest values to
the lowest values. The Pandas rank function supports different tie-breaking methods,
specified with the method parameter. In our case, we use the first method, in
which ranks are assigned in the order they appear in the array, avoiding gaps between
ranking.
In [26]:
pivedu = pivedu.drop([
’Euro area (13 countries)’,
’Euro area (15 countries)’,
’Euro area (17 countries)’,
’Euro area (18 countries)’,
’European Union (25 countries)’,
’European Union (27 countries)’,
’European Union (28 countries)’
],
axis = 0)
pivedu = pivedu.rename(index = {’Germany (until 1990 former
territory of the FRG)’: ’Germany’})
pivedu = pivedu.dropna()
pivedu.rank(ascending = False, method = ’first’).head()
If we want to make a global ranking taking into account all the years, we can
sum up all the columns and rank the result. Then we can sort the resulting values to
retrieve the top five countries for the last 6 years, in this way:
26 2 Data Science Tools
In [27]:
totalSum = pivedu.sum(axis = 1)
totalSum.rank(ascending = False, method = ’dense’)
.sort_values().head()
Out[27]: GEO
Denmark 1
Cyprus 2
Finland 3
Malta 4
Belgium 5
dtype: float64
Notice that the method keyword argument in the in the rank function specifies
how items that compare equals receive ranking. In the case of dense, items that
compare equals receive the same ranking number, and the next not equal item receives
the immediately following ranking number.
Plotting
Pandas DataFrames and Series can be plotted using the plot function, which uses
the library for graphics Matplotlib. For example, if we want to plot the accumulated
values for each country over the last 6 years, we can take the Series obtained in the
previous example and plot it directly by calling the plot function as shown in the
next cell:
In [28]:
totalSum = pivedu.sum(axis = 1)
.sort_values(ascending = False)
totalSum.plot(kind = ’bar’, style = ’b’, alpha = 0.4,
title = "Total Values for Country")
Out[28]:
2.7 Get Started with Python and Pandas 27
Note that if we want the bars ordered from the highest to the lowest value, we
need to sort the values in the Series first. The parameter kind used in the plot
function defines which kind of graphic will be used. In our case, a bar graph. The
parameter style refers to the style properties of the graphic, in our case, the color
of bars is set to b (blue). The alpha channel can be modified adding a keyword
parameter alpha with a percentage, producing a more translucent plot. Finally,
using the title keyword the name of the graphic can be set.
It is also possible to plot a DataFrame directly. In this case, each column is treated
as a separated Series. For example, instead of printing the accumulated value over
the years, we can plot the value for each year.
In [29]:
my_colors = [’b’, ’r’, ’g’, ’y’, ’m’, ’c’]
ax = pivedu.plot(kind = ’barh’,
stacked = True,
color = my_colors)
ax.legend(loc = ’center left’, bbox_to_anchor = (1, .5))
Out[29]:
In this case, we have used a horizontal bar graph (kind=’barh’) stacking all the
years in the same country bar. This can be done by setting the parameter stacked
to True. The number of default colors in a plot is only 5, thus if you have more
than 5 Series to show, you need to specify more colors or otherwise the same set of
colors will be used again. We can set a new set of colors using the keyword color
with a list of colors. Basic colors have a single-character code assigned to each, for
example, “b” is for blue, “r” for red, “g” for green, “y” for yellow, “m” for magenta,
and “c” for cyan. When several Series are shown in a plot, a legend is created for
identifying each one. The name for each Series is the name of the column in the
DataFrame. By default, the legend goes inside the plot area. If we want to change
this, we can use the legend function of the axis object (this is the object returned
when the plot function is called). By using the loc keyword, we can set the relative
position of the legend with respect to the plot. It can be a combination of right or
left and upper, lower, or center. With bbox_to_anchor we can set an absolute
position with respect to the plot, allowing us to put the legend outside the graph.
28 2 Data Science Tools
2.8 Conclusions
This chapter has been a brief introduction to the most essential elements of a pro-
gramming environment for data scientists. The tutorial followed in this chapter is
just a simple introduction to Pandas library. For more advanced uses, refer to the
library documentation.26
Acknowledgements This chapter was writen by Eloi Puertas and Francesc Dantí.
26 https://pandas.pydata.org/docs/.
Descriptive Statistics
3
3.1 Introduction
Descriptive statistics applies the concepts, measures, and terms that are used to
describe the basic features of the samples in a study. These procedures are essential
to provide summaries about the samples as an approximation of the population.
Together with simple graphics, they form the basis of every quantitative analysis of
data. In order to describe the sample data and to be able to infer any conclusion, we
should go through several steps:
1. Data preparation: Given a specific example, we need to prepare the data for
generating statistically valid descriptions.
2. Descriptive statistics: This generates different statistics to describe and summa-
rize the data concisely and evaluate different ways to visualize them.
One of the first tasks when analyzing data is to collect and prepare the data in a format
appropriate for analysis of the samples. The most common steps for data preparation
involve the following operations.
1. Obtaining the data: Data can be read directly from a file or they might be obtained
by scraping the web.
2. Parsing the data: The right parsing procedure depends on what format the data
are in: plain text, fixed columns, CSV, XML, HTML, etc.
3. Cleaning the data: Survey responses and other data files are almost always in-
complete. Sometimes, there are multiple codes for things such as, not asked, did
not know, and declined to answer. And there are almost always errors. A simple
strategy is to remove or ignore incomplete records.
4. Building data structures: Once you read the data, it is necessary to store them in
a data structure that lends itself to the analysis we are interested in. If the data fit
into the memory, building a data structure is usually the way to go. If not, usually
a database is built, which is an out-of-memory data structure. Most databases
provide a mapping from keys to values, so they serve as dictionaries.
Let us consider a public database called the “Adult” dataset, hosted on the UCI’s
Machine Learning Repository.1 It contains approximately 32,000 observations con-
cerning different financial parameters related to the US population: age, sex, marital
(marital status of the individual), country, income (Boolean variable: whether the per-
son makes more than $50,000 per annum), education (the highest level of education
achieved by the individual), occupation, capital gain, etc.
We will show that we can explore the data by asking questions like: “Are men
more likely to become high-income professionals than women, i.e., to receive an
income of over $50,000 per annum?”
1 https://archive.ics.uci.edu/ml/datasets/Adult.
3.2 Data Preparation 31
data = []
for line in file:
data1 = line.split(’, ’)
if len(data1) == 15:
data.append([chr_int(data1[0]), data1[1],
chr_int(data1[2]), data1[3],
chr_int(data1[4]), data1[5],
data1[6], data1[7], data1[8],
data1[9], chr_int(data1[10]),
chr_int(data1[11]),
chr_int(data1[12]),
data1[13], data1[14]
])
df = pd.DataFrame(data)
df.columns = [
’age’, ’type_employer’, ’fnlwgt’,
’education’, ’education_num’, ’marital’,
’occupation’,’ relationship’, ’race’,
’sex’, ’capital_gain’, ’capital_loss’,
’hr_per_week’, ’country’, ’income’
]
The command shape gives exactly the number of data samples (in rows, in this
case) and features (in columns):
32 3 Descriptive Statistics
In [4]:
df.shape
Out[5]: country
? 583
Cambodia 19
Canada 121
China 75
Columbia 59
The first row shows the number of samples with unknown country, followed by
the number of samples corresponding to the firsts countries in the dataset.
Let us split people according to their gender into two groups: men and women.
In [6]:
ml = df[(df.sex == ’Male’)]
The data that come from performing a particular measurement on all the subjects
in a sample represent our observations for a single characteristic like country,
age, education, etc. These measurements and categories represent a sample
distribution of the variable, which in turn approximately represents the population
distribution of the variable. One of the main goals of exploratory data analysis is
to visualize and summarize the sample distribution, thereby allowing us to make
tentative assumptions about the population distribution.
3.3 Exploratory Data Analysis 33
The data in general can be categorical or quantitative. For categorical data, a simple
tabulation of the frequency of each category is the best non-graphical exploration
for data analysis. For example, we can ask ourselves what is the proportion of high-
income professionals in our database:
In [8]:
df1 = df[(df.income==’>50K\n’)]
print(’The rate of people with high income is: ’,
int(len(df1)/float(len(df))*100, ’%.’), ’
print(’The rate of men with high income is: ’,
int(len(ml1)/float(len(ml))*100, ’%.’), ’
print(’The rate of women with high income is: ’,
int(len(fm1)/float(len(fm))*100, ’%.’), ’
3.3.1.1 Mean
One of the first measurements we use to have a look at the data is to obtain sample
statistics from the data, such as the sample mean [1]. Given a sample of .n values,
.{x i }, , i = 1, . . . , n, the mean, .μ, is the sum of the values divided by the number of
2 We will use the following notation: . X is a random variable,.x is a column vector,.xT (the transpose
of .x) is a row vector, .X is a matrix, and .xi is the ith element of a dataset.
34 3 Descriptive Statistics
In [9]:
print(’The average age of men is: ’,
ml[’age’].mean())
print(’The average age of women is: ’,
fm[’age’].mean())
ml_median_age = ml1[’age’].median()
fm_median_age = fm1[’age’].median()
print("Median age per men and women with high-income: ",
ml_median_age, fm_median_age)
That value, .x p , is the . pth quantile, or the .100 × pth percentile. For example, a 5-
number summary is defined by the values .xmin , Q 1 , Q 2 , Q 3 , xmax , where . Q 1 is the
.25 × pth percentile, . Q 2 is the .50 × pth percentile and . Q 3 is the .75 × pth percentile.
Summarizing data by just looking at their mean, median, and variance can be danger-
ous: very different data can be described by the same statistics. The best thing to do
is to validate the data by inspecting them. We can have a look at the data distribution,
which describes how often each value appears (i.e., what is its frequency).
The most common representation of a distribution is a histogram, which is a graph
that shows the frequency of each value. Let us show the age of working men and
women separately.
In [12]:
ml_age = ml[’age’]
ml_age.hist(histtype = ’stepfilled’, bins = 20)
In [13]:
fm_age = fm[’age’]
fm_age.hist(histtype = ’stepfilled’, bins = 10)
The output can be seen in Fig. 3.1. If we want to compare the histograms, we can
plot them overlapping in the same graphic as follows:
Fig. 3.1 Histogram of the age of working men (left) and women (right)
3.3 Exploratory Data Analysis 37
Fig. 3.2 Histogram of the age of working men (in ochre) and women (in violet) (left). Histogram of
the age of working men (in ochre), women (in blue) and their intersection (in violet) after samples
normalization (right)
In [14]:
import seaborn as sns
fm_age.hist(histtype = ’stepfilled’,
alpha = .5, bins = 20)
ml_age.hist(histtype = ’stepfilled’,
alpha = .5,
color = sns.desaturate("indianred", .75),
bins = 10)
The output can be seen in Fig. 3.2 (left). Note that we are visualizing the absolute
values of the number of people in our dataset according to their age (the abscissa of
the histogram). As a side effect, we can see that there are many more men in these
conditions than women.
We can normalize the frequencies of the histogram by dividing/normalizing by
.n, the number of samples. The normalized histogram is called the Probability Mass
Function (PMF).
In [15]:
fm_age.hist(density = True, histtype = ’stepfilled’,
alpha = .5, bins = 20)
ml_age.hist(density = True, histtype = ’stepfilled’,
alpha = .5, bins = 10,
color = sns.desaturate("indianred", .75))
This outputs Fig. 3.2 (right), where we can observe a comparable range of indi-
viduals (men and women).
The Cumulative Distribution Function (CDF), or just distribution function, de-
scribes the probability that a real-valued random variable . X with a given probability
distribution will be found to have a value less than or equal to .x. Let us show the
CDF of age distribution for both men and women.
38 3 Descriptive Statistics
Fig. 3.3 The CDF of the age of working male (in blue) and female (in red) samples
In [16]:
ml_age.hist(normed = 1, histtype = ’step’,
cumulative = True, linewidth = 3.5,
bins = 20)
fm_age.hist(normed = 1, histtype=’step’,
cumulative = True, linewidth = 3.5,
bins = 20,
color = sns.desaturate("indianred", .75))
The output can be seen in Fig. 3.3, which illustrates the CDF of the age distributions
for both men and women.
As mentioned before, outliers are data samples with a value that is far from the central
tendency. Different rules can be defined to detect outliers, as follows:
For example, in our case, we are interested in the age statistics of men versus
women with high incomes and we can see that in our dataset, the minimum age is 17
years and the maximum is 90 years. We can consider that some of these samples are
due to errors or are not representable. Applying the domain knowledge, we focus on
the median age (37, in our case) up to 72 and down to 22 years old, and we consider
the rest as outliers.
3.3 Exploratory Data Analysis 39
In [17]:
df2 = df.drop(df.index[
(df.income == ’>50K\n’) &
(df[’age’] > df[’age’].median() + 35) &
(df[’age’] > df[’age’].median() -15)
])
ml1_age = ml1[’age’]
fm1_age = fm1[’age’]
ml2_age = ml1_age.drop(ml1_age.index[
(ml1_age > df[’age’].median() + 35) &
(ml1_age > df[’age’].median() - 15)
])
fm2_age = fm1_age.drop(fm1_age.index[
(fm1_age > df[’age’].median() + 35) &
(fm1_age > df[’age’].median() - 15)
])
We can check how the mean and the median changed once the data were cleaned:
In [18]:
mu2ml = ml2_age.mean()
std2ml = ml2_age.std()
md2ml = ml2_age.median()
mu2fm = fm2_age.mean()
std2fm = fm2_age.std()
md2fm = fm2_age.median()
print("Men statistics:")
print("Mean:", mu2ml, "Std:", std2ml)
print("Median:", md2ml)
print("Min:", ml2_age.min(), "Max:", ml2_age.max())
print("Women statistics:")
print("Mean:", mu2fm, "Std:", std2fm)
print("Median:", md2fm)
print("Min:", fm2_age.min(), "Max:", fm2_age.max())
Fig. 3.4 The red shows the cleaned data without the considered outliers (in blue)
Figure 3.4 shows the outliers in blue and the rest of the data in red. Visually, we
can confirm that we removed mainly outliers from the dataset.
Next we can see that by removing the outliers, the difference between the popula-
tions (men and women) actually decreased. In our case, there were more outliers in
men than women. If the difference in the mean values before removing the outliers
is 2.5, after removing them it slightly decreased to 2.44:
In [20]:
print(’The mean difference with outliers is: %4.2f.’
% (ml_age.mean() - fm_age.mean()))
print(’The mean difference without outliers is: %4.2f.’
% (ml2_age.mean() - fm2_age.mean()))
The results are shown in Fig. 3.5. One can see that the differences between male
and female values are slightly negative before age 42 and positive after it. Hence,
women tend to be promoted (receive more than 50 K) earlier than men.
3.3 Exploratory Data Analysis 41
Fig. 3.5 Differences in high-income earner men versus women as a function of age
For univariate data, the formula for skewness is a statistic that measures the asym-
metry of the set of .n data samples, .xi :
∑
1 i (xi − μ3 )
.g1 = , (3.3)
n σ3
where .μ is the mean, .σ is the standard deviation, and .n is the number of data points.
Negative deviation indicates that the distribution ”skews left” (it extends further
to the left than to the right). One can easily see that the skewness for a normal
distribution is zero, and any symmetric data must have a skewness of zero. Note
that skewness can be affected by outliers! A simpler alternative is to look at the
relationship between the mean .μ and the median .μ12 .
In [22]:
def skewness(x):
res = 0
m = x.mean()
s = x.std()
for i in x:
res += (i-m) * (i-m) * (i-m)
res /= (len(x) * s * s * s)
return res
There are many other definitions for skewness that will not be discussed here. In
our case, if we check the Pearson’s skewness coefficient for both men and women,
we can see that the difference between them actually increases:
In [23]:
def pearson(x):
return 3*(x.mean() - x.median())*x.std()
3.3.4.1 Discussions
After exploring the data, we obtained some apparent effects that seem to support our
initial assumptions. For example, the mean age for men in our dataset is 39.4 years;
while for women, it is 36.8 years. When analyzing the high-income salaries, the mean
age for men increased to 44.6 years; while for women, it increased to 42.1 years.
When the data were cleaned from outliers, we obtained mean age for high-income
men: 44.3, and for women: 41.8. Moreover, histograms and other statistics show the
skewness of the data and the fact that women used to be promoted a little bit earlier
than men, in general.
Fig. 3.6 Exponential CDF (left) and PDF (right) with .λ = 3.00
The normal CDF has no closed-form expression and its most common represen-
tation is the PDF: 2
1 − (x−μ)
.PDF(x) = √ e 2σ2 .
2πσ 2
The parameter .σ defines the shape of the distribution. An example of the PDF of
a normal distribution with .μ = 6 and .σ = 2 is given in Fig. 3.7.
Fig. 3.8 Summed kernel functions around a random set of points (left) and the kernel density
estimate with the optimal bandwidth (right) for our dataset. Random data shown in blue, kernel
shown in black and summed function shown in red
In [24]:
x1 = np.random.normal(-1, 0.5, 15)
x2 = np.random.normal(6, 1, 10)
y = np.r_[x1, x2] # r_ translates slice objects to
concatenation along the first axis.
x = np.linspace(min(y), max(y), 100)
Figure 3.8 (left) shows the result of the construction of the continuous function
from the kernel summarization.
In fact, the library SciPy3 implements a Gaussian kernel density estimation that
automatically chooses the appropriate bandwidth parameter for the kernel. Thus, the
final construction of the density estimate will be obtained by:
In [25]:
from scipy.stats import kde
density = kde.gaussian_kde(y)
xgrid = np.linspace(x.min(), x.max(), 200)
plt.hist(y, bins = 28, normed = True)
plt.plot(xgrid, density(xgrid), ’r-’)
Figure 3.8 (right) shows the result of the kernel density estimate for our example.
3 http://www.scipy.org.
46 3 Descriptive Statistics
3.4 Estimation
An important aspect when working with statistical data is being able to use estimates
to approximate the values of unknown parameters of the dataset. In this section, we
will review different kinds of estimators (estimated mean, variance, standard score,
etc.).
In continuation, we will deal with point estimators that are single numerical estimates
of parameters of a population.
3.4.1.1 Mean
Let us assume that we know that our data are coming from a normal distribution and
the random samples drawn are as follows:
{0.33, −1.76, 2.34, 0.56, 0.89}.
.
The question is can we guess the mean .μ of the distribution? One approximation is
given by the sample mean, .x̄. This process is called estimation and the statistic (e.g.,
the sample mean) is called an estimator. In our case, the sample mean is 0.472, and it
seems a logical choice to represent the mean of the distribution. It is not so evident if
we add a sample with a value of .−465. In this case, the sample mean will be .−77.11,
which does not look like the mean of the distribution. The reason is due to the fact
that the last value seems to be an outlier compared to the rest of the sample. In order
to avoid this effect, we can try first to remove outliers and then to estimate the mean;
or we can use the sample median as an estimator of the mean of the distribution.
If there are no outliers, the sample mean .x̄ minimizes the following mean squared
error:
1∑
.M S E = (x̄ − μ)2 ,
n
where .n is the number of times we estimate the mean.
Let us compute the MSE of a set of random data:
In [26]:
NTests = 200
mu = 0.0
var = 1.0
err = 0.0
NPs = 1000
for i in range(NTests):
x = np.random.normal(mu, var, NPs)
err += (x.mean()-mu)**2
print(’MSE: ’, err/NTests)
3.4.1.2 Variance
If we ask ourselves what is the variance, .σ 2 , of the distribution of . X , analogously we
can use the sample variance as an estimator. Let us denote by .σ̄ 2 the sample variance
estimator:
1∑
.σ̄ = (xi − x̄)2 .
2
n
For large samples, this estimator works well, but for a small number of samples
it is biased. In those cases, a better estimator is given by:
1 ∑
.σ̄ = (xi − x̄)2 .
2
n−1
Variables of data can express relations. For example, countries that tend to invest in
research also tend to invest more in education and health. This kind of relationship
is captured by the covariance.
3.4.2.1 Covariance
When two variables share the same tendency, we speak about covariance. Let us
consider two series, .{xi } and .{yi }. Let us center the data with respect to their mean:
.d x i = x i − μ X and .dyi = yi − μY . It is easy to show that when .{x i } and .{yi } vary
together, their deviations tend to have the same sign. The covariance is defined as
the mean of the following products:
1∑
n
.Cov(X, Y ) = d xi dyi ,
n
i=1
where .n is the length of both sets. Still, the covariance itself is hard to interpret.
48 3 Descriptive Statistics
Fig. 3.9 Positive correlation between economic growth and stock market returns worldwide (left).
Negative correlation between the world oil production and gasoline prices worldwide (right)
(see Fig. 3.9). This means that one variable can predict the other very well. However,
having .ρ = 0, does not necessarily mean that the variables are not correlated! Pear-
son’s correlation captures correlations of first order, but not nonlinear correlations.
Moreover, it does not work well in the presence of outliers.
compute the ranks, they are: [1.0, 2.0, 3.0, 4.0, 5.0] and [2.0, 1.0, 3.0, 5.0, 4.0].
3.4 Estimation 49
As value of the Pearson’s coefficient, we get 0.28, which does not show much cor-
relation between the sets. However, the Spearman’s rank coefficient, capturing the
correlation between the ranks, gives as a final value of 0.80, confirming the correlation
between the sets. As an exercise, you can compute the Pearson’s and the Spearman’s
rank correlations for the different Anscombe configurations given in Fig. 3.10. Ob-
serve if linear and nonlinear correlations can be captured by the Pearson’s and the
Spearman’s rank correlations.
50 3 Descriptive Statistics
3.5 Conclusions
In this chapter, we have familiarized ourselves with the basic concepts and procedures
of descriptive statistics to explore a dataset. They help us to understand the experiment
or a dataset in detail and allow us to put the data in perspective. We introduced the
central measures of tendency such as the sample mean and median; and measures of
variability such as the variance and standard deviation. We have also discussed how
these measures can be affected by outliers. In order to go deeper into visualizing the
dataset, we have introduced histograms, quantiles, and percentiles.
In many situations, when the values are continuous variables, it is convenient to use
continuous distributions. The most common continuous distributions are the normal
and the exponential distributions. The advantage of most of them is that we can have
an explicit expression for their PDF and CDF, as well as the mean and variance in
terms of a closed formula. Also, we learned how, by using the kernel density, we can
obtain a continuous representation of the sample distribution. Finally, we discussed
how to estimate the correlation and the covariance of datasets, where two of the most
popular measures are the Pearson’s and the Spearman’s rank correlations, which are
affected in different ways by the outliers of the dataset.
Acknowledgements This chapter was co-written by Petia Radeva and Laura Igual.
References
1. A.B. Downey, Probability and Statistics for Programmers (O’Reilly Media, 2011). ISBN-10:
1449307116
2. Probability Distributions: Discrete vs. Continuous. http://stattrek.com/probability-distributions/
discrete-continuous.aspx
Statistical Inference
4
4.1 Introduction
There is not only one way to address the problem of statistical inference. In fact,
there are two main approaches to statistical inference: the frequentist and Bayesian
approaches. Their differences are subtle but fundamental:
• In the case of the frequentist approach, the main assumption is that there is a
population, which can be represented by several parameters, from which we can
obtain numerous random samples.
Population parameters are fixed but they are not accessible to the observer. The
only way to derive information about these parameters is to take a sample of the
population, to compute the parameters of the sample, and to use statistical inference
techniques to make probable propositions regarding population parameters.
• The Bayesian approach is based on a consideration that data are fixed, not the result
of a repeatable sampling process, but parameters describing data can be described
probabilistically. To this end, Bayesian inference methods focus on producing
parameter distributions that represent all the knowledge we can extract from the
sample and from prior information about the problem.
In this chapter, to keep things simple, we will only deal with the first approach,
but we suggest the reader also explores the second approach as it is well worth it!
As we have said, the ultimate objective of statistical inference, if we adopt the fre-
quentist approach, is to produce probable propositions concerning population param-
eters from analysis of a sample. The most important classes of propositions are as
follows:
• Propositions about point estimates. A point estimate is a particular value that best
approximates some parameter of interest. For example, the mean or the variance
of the sample.
• Propositions about confidence intervals or set estimates. A confidence interval is
a range of values that best represents some parameter of interest.
• Propositions about the acceptance or rejection of a hypothesis.
Estimates produced by descriptive statistics are not equal to the truth but they are
better as more data become available. So, it makes sense to use them as central
elements of our propositions and to measure its variability with respect to the sample
size.
4.3 Measuring the Variability in Estimates 53
1 http://opendata.bcn.cat/.
2 Suppose that we draw all possible samples of a given size from a given population. Suppose further
that we compute the mean for each sample. The probability distribution of this statistic is called the
mean sampling distribution.
54 4 Statistical Inference
Fig. 4.1 Empirical distribution of the sample mean. In red, the mean value of this distribution
In Fig. 4.1, we can see the empirical sample distribution of the mean for.s = 10.000
samples with .n = 200 observations from our dataset. This empirical distribution has
been built in the following way:
In [2]:
# population
df = accidents.to_frame()
N_test = 10000
elements = 200
# mean array of samples
means = [0] * N_test
# sample generation
for i in range(N_test):
rows = np.random.choice(df.index.values, elements)
sampled_df = df.loc[rows]
means[i] = sampled_df.mean()
In general, given a point estimate from a sample of size .n, we define its sampling
distribution as the distribution of the point estimate based on samples of size .n
from its population. This definition is valid for point estimates of other population
parameters, such as the population median or population standard deviation, but we
will focus on the analysis of the sample mean.
The sampling distribution of an estimate plays an important role in understanding
the real meaning of propositions concerning point estimates. It is very useful to think
of a particular point estimate as being drawn from such a distribution.
4.3 Measuring the Variability in Estimates 55
Fig. 4.2 Mean sampling distribution by bootstrapping. In red, the mean value of this distribution
In [4]:
def meanBootstrap(X,numberb):
import numpy as np
x = [0]*numberb
for i in range(numberb):
sample = [X[_] for _ in np.random.randint(len(X),
size=len(X))]
x[i] = np.mean(sample)
return x
m = meanBootstrap(accidents, 10000)
print("Mean estimate:", np.mean(m))
A point estimate .Θ, such as the sample mean, provides a single plausible value for
a parameter. However, as we have seen, a point estimate is rarely perfect; usually
there is some error in the estimate. That is why we have suggested using the standard
error as a measure of its variability.
Instead of that, a next logical step would be to provide a plausible range of values
for the parameter. A plausible range of values for the sample parameter is called a
confidence interval.
We will base the definition of confidence interval on two ideas:
1. Our point estimate is the most plausible value of the parameter, so it makes sense
to build the confidence interval around the point estimate.
2. The plausibility of a range of values can be defined from the sampling distribution
of the estimate.
For the case of the mean, the Central Limit Theorem states that its sampling
distribution is normal:
Theorem 4.1 Given a population with a finite mean.μ and a finite non-zero variance
σ 2 , the sampling distribution of the mean approaches a normal distribution with a
.
mean of .μ and a variance of .σ 2 /n as .n, the sample size, increases.
In this case, and in order to define an interval, we can make use of a well-known
result from probability that applies to normal distributions: roughly 95% of the time
our estimate will be within 1.96 standard errors of the true mean of the distribution.
If the interval spreads out 1.96 standard errors from a normally distributed point
estimate, intuitively we can say that we are roughly 95% confident that we have
captured the true parameter.
C I = [Θ − 1.96 × S E, Θ + 1.96 × S E]
.
In [5]:
m = accidents.mean()
se = accidents.std()/math.sqrt(len(accidents))
ci = [m - se*1.96, m + se*1.96]
print("Confidence interval:", ci)
This is how we would compute a 95% confidence interval of the sample mean
using bootstrapping:
1. Repeat the following steps for a large number, .s, of times:
a. Draw .n observations with replacement from the original data to create a boot-
strap sample or resample.
b. Calculate the mean for the resample.
2. Calculate the mean of your .s values of the sample statistic. This process gives
you a “bootstrapped” estimate of the sample statistic.
3. Calculate the standard deviation of your .s values of the sample statistic. This
process gives you a “bootstrapped” estimate of the SE of the sample statistic.
4. Obtain the 2.5th and 97.5th centiles of your .s values of the sample statistic.
In [6]:
m = meanBootstrap(accidents, 10000)
sample_mean = np.mean(m)
sample_se = np.std(m)
ci = [np.percentile(m,2.5), np.percentile(m,97.5)]
print("Confidence interval:", ci)
Fig. 4.3 This graph shows 100 sample means (green points) and its corresponding confidence
intervals, computed from 100 different samples of 100 elements from our dataset. It can be observed
that a few of them (those in red) do not contain the mean of the population (black horizontal line)
In 95% of the cases, when I compute the 95% confidence interval from this sample, the true
mean of the population will fall within the interval defined by these bounds: .±1.96 × S E.
We cannot say either that our specific sample contains the true parameter or that
the interval has a 95% chance of containing the true parameter. That interpretation
would not be correct under the assumptions of traditional statistics.
• . H0 : The mean number of daily traffic accidents is the same in 2010 and 2013
(there is only one population, one true mean, and 2010 and 2013 are just different
samples from the same population).
• . H A : The mean number of daily traffic accidents in 2010 and 2013 is different
(2010 and 2013 are two samples from two different populations).
We call . H0 the null hypothesis and it represents a skeptical point of view: the
effect we have observed is due to chance (due to the specific sample bias). . H A is the
alternative hypothesis and it represents the other point of view: the effect is real.
The general rule of frequentist hypothesis testing is: we will not discard . H0 (and
hence we will not consider . H A ) unless the observed effect is implausible under . H0 .
We can use the concept represented by confidence intervals to measure the plausi-
bility of a hypothesis.
We can illustrate the evaluation of the hypothesis setup by comparing the mean
rate of traffic accidents in Barcelona during 2010 and 2013:
In [7]:
data = pd.read_csv("files/ch04/ACCIDENTS_GU_BCN_2010.csv",
encoding=’latin-1’)
#Create a new column which is the date
data[’Date’] = data[’Dia de mes’].apply(lambda x : str(x))
+ ’-’ + data[’Mes de any’].apply(lambda x :
str(x))
data2 = data[’Date’]
counts2010 =data[’Date’].value_counts()
print(’2010: Mean’, counts2010.mean())
data = pd.read_csv("files/ch04/ACCIDENTS_GU_BCN_2013.csv",
encoding=’latin-1’)
#Create a new column which is the date
data[’Date’] = data[’Dia de mes’].apply(lambda x : str(x))
+ ’-’ + data[’Mes de any’].apply(lambda x :
str(x))
data2 = data[’Date’]
counts2013 = data[’Date’].value_counts()
print(’2013: Mean’, counts2013.mean())
This estimate suggests that in 2013 the mean rate of traffic accidents in Barcelona
was higher than it was in 2010. But is this effect statistically significant?
Based on our sample, the 95% confidence interval for the mean rate of traffic
accidents in Barcelona during 2013 can be calculated as follows:
In [8]:
n = len(counts2013)
mean = counts2013.mean()
s = counts2013.std()
ci = [mean - s*1.96/np.sqrt(n), mean + s*1.96/np.sqrt(n)]
print(’2010 accident rate estimate:’, counts2010.mean())
print(’2013 accident rate estimate:’, counts2013.mean())
print(’CI for 2013:’,ci)
If we use a 95% confidence interval to test a problem where the null hypothesis is true, we
will make an error whenever the point estimate is at least 1.96 standard errors away from the
population parameter. This happens about 5% of the time (2.5% in each tail).
use a summary statistic of the data to help compute the . p-value and evaluate the
hypotheses.
Usually, if . P is less than 0.05 (the chance of a fluke is less than 5%) the result is
declared statistically significant.
It must be pointed out that this choice is rather arbitrary and should not be taken
as a scientific truth.
The goal of classical hypothesis testing is to answer the question, “Given a sample
and an apparent effect, what is the probability of seeing such an effect by chance?”
Here is how we answer that question:
• The first step is to quantify the size of the apparent effect by choosing a test statistic.
In our case, the apparent effect is a difference in accident rates, so a natural choice
for the test statistic is the difference in means between the two periods.
• The second step is to define a null hypothesis, which is a model of the system
based on the assumption that the apparent effect is not real. In our case, the null
hypothesis is that there is no difference between the two periods.
• The third step is to compute a . p-value, which is the probability of seeing the
apparent effect if the null hypothesis is true. In our case, we would compute the
difference in means, then compute the probability of seeing a difference as big, or
bigger, under the null hypothesis.
• The last step is to interpret the result. If the . p-value is low, the effect is said to
be statistically significant, which means that it is unlikely to have occurred by
chance. In this case we infer that the effect is more likely to appear in the larger
population.
1. Pool the distributions, generate samples with size n and compute the difference
in the mean.
2. Generate samples with size n and compute the difference in the mean.
3. Count how many differences are larger than the observed one.
4.4 Hypothesis Testing 63
In [10]:
# pooling distributions
x = counts2010
y = counts2013
pool = np.concatenate([x,y])
np.random.shuffle(pool)
#sample generation
import random
N = 10000 # number of samples
diff = np.arange(N)
for i in np.arange(N):
p1 = [random.choice(pool) for _ in np.arange(n)]
p2 = [random.choice(pool) for _ in np.arange(n)]
diff[i] = (np.mean(p1)-np.mean(p2))
In [11]:
# counting differences larger than the observed one
diff2 = np.array(diff)
w1 = np.where(diff2 > p)[0]
If the relevant question is whether accident rates are different, then it makes sense
to test the absolute difference in means. This kind of test is called two sided because
it counts both sides of the distribution of differences.
Direct Approach
The formula for the standard error of the absolute difference in two means is similar
to the formula for other standard errors. Recall that the standard error of a single
mean can be approximated by:
σ1
. S E x̄ 1 = √
n1
The standard error of the difference of two sample means can be constructed from
the standard errors of the separate sample means:
/
σ12 σ2
. S E x̄ 1 − x̄ 2 = + 2
n1 n2
This would allow us to define a direct test with the 95% confidence interval.
We do not yet have an answer for this question! We have defined a null hypothesis
H0 (the effect is not real) and we have computed the probability of the observed
.
effect under the null hypothesis, which is . P(E|H0 ), where . E is an effect as big as
or bigger than the apparent effect and a . p-value.
We have stated that from the frequentist point of view, we cannot consider . H A
unless . P(E|H0 ) is less than an arbitrary value. But the real answer to this question
must be based on comparing . P(H0 |E) to . P(H A |E), not on . P(E|H0 )! One possi-
ble solution to these problems is to use Bayesian reasoning; an alternative to the
frequentist approach.
No matter how many data you have, you will still depend on intuition to decide
how to interpret, explain, and use that data. Data cannot speak by themselves. Data
scientists are interpreters, offering one interpretation of what the useful narrative
story derived from the data is, if there is one at all.
4.6 Conclusions
In this chapter we have seen how we can approach the problem of making probable
propositions regarding population parameters.
We have learned that in some cases, there are theoretical results that allow us to
compute a measure of the variability of our estimates. We have called this approach
the “traditional approach”. Within this framework, we have seen that the sampling
References 65
distribution of our parameter of interest is the most important concept when under-
standing the real meaning of propositions concerning parameters.
We have also learned that the traditional approach is not the only alternative. The
“computationally intensive approach”, based on the bootstrap method, is a relatively
new approach that, based on intensive computer simulations, is capable of computing
a measure of the variability of our estimates by applying a resampling method to
our data sample. Bootstrapping can be used for computing variability of almost any
function of our data, with its only downside being the need for greater computational
resources.
We have seen that propositions about parameters can be classified into three
classes: propositions about point estimates, propositions about set estimates, and
propositions about the acceptance or the rejection of a hypothesis. All these classes
are related; but today, set estimates and hypothesis testing are the most preferred.
Finally, we have shown that the production of probable propositions is not error
free, even in the presence of big data. For these reason, data scientists cannot forget
that after any inference task, they must take decisions regarding the final interpretation
of the data.
Acknowledgements This chapter was co-writen by Jordi Vitrià and Sergio Escalera.
References
1. M.I. Jordan, Are you a Bayesian or a frequentist? [Video Lecture]. Published: Nov. 2, 2009,
Recorded: September 2009. https://videolectures.net/mlss09uk_jordan_bfway/
2. B. Efron, R.J. Tibshirani, An Introduction to the Bootstrap (CRC press, 1994)
Supervised Learning
5
5.1 Introduction
Machine learning involves coding programs that automatically adjust their perfor-
mance in accordance with their exposure to information in data [1]. This learning is
achieved via a parameterized model with tunable parameters that are automatically
adjusted according to different performance criteria. Machine learning can be con-
sidered a subfield of artificial intelligence (AI) and we can roughly divide the field
into the following three major classes.
1. Supervised learning: Algorithms which learn from a training set of labeled ex-
amples (exemplars) to generalize to the set of all possible inputs. Examples of
techniques in supervised learning: logistic regression, support vector machines,
decision trees, random forest, etc.
2. Unsupervised learning: Algorithms that learn from a training set of unlabeled
examples. Used to explore data according to some statistical, geometric or sim-
ilarity criterion. Examples of unsupervised learning include k-means clustering
and kernel density estimation. We will see more on this kind of techniques in
Chap. 7.
3. Reinforcement learning: Algorithms that learn via reinforcement from criticism
that provides information on the quality of a solution, but not on how to improve
it. Improved solutions are achieved by iteratively exploring the solution space.
– Given the results of a clinical test, e.g., does this patient suffer from diabetes?
– Given a magnetic resonance image, is it a tumor shown in the image?
– Given the past activity associated with a credit card, is the current operation
fraudulent?
– Given the description of an apartment, what is the expected market value of the
flat? What will the value be if the apartment has an elevator?
– Given the past records of user activity on Apps, how long will a certain client
be connected to our App?
– Given my skills and marks in computer science and maths, what mark will I
achieve in a data science course?
Observe that some problems can be solved using both regression and classification.
As we will see later, many classification algorithms are thresholded regressors. There
is a certain skill involved in designing the correct question and this dramatically
affects the solution we obtain.
In this chapter we use data from the Lending Club1 to develop our understanding
of machine learning concepts. The Lending Club is a peer-to-peer lending company.
It offers loans which are funded by other people. In this sense, the Lending Club
acts as a hub connecting borrowers with investors. The client applies for a loan of a
certain amount, and the company assesses the risk of the operation. If the application
is accepted, it may or may not be fully covered. We will focus on the prediction
of whether the loan will be fully funded, based on the scoring of and information
related to the application.
We will use the partial dataset of period 2007–2011. Framing the problem a little
bit more, based on the information supplied by the customer asking for a loan, we
want to predict whether it will be granted up to a certain threshold .thr . The attributes
we use in this problem are related to some of the details of the loan application, such
as amount of the loan applied for the borrower, monthly payment to be made by
the borrower if the loan is accepted, the borrower’s annual income, the number of
1 https://www.lendingclub.com/info/download-data.action.
5.3 First Steps 69
incidences of delinquency in the borrower’s credit file, and interest rate of the loan,
among others.
In this case we would like to predict unsuccessful accepted loans. A loan applica-
tion is unsuccessful if the funded amount (funded_amnt) or the amount funded
by investors (funded_amnt_inv) falls far short of the requested loan amount
(loan_amnt). That is,
loan − f unded
. ≥ 0.95.
loan
Note that in this problem we are predicting a binary value: either the loan is fully
funded or not. Classification is the natural choice of machine learning tools for
prediction with discrete known outcomes. According to the cardinality of the target
set, one usually distinguishes between binary classifiers when the target output only
takes two values, i.e., the classifier answers questions with a yes or a no; or multiclass
classifiers, for a larger number of classes. This issue is important in that not all
methods can naturally handle the multiclass setting.2
In a formal way, classification is regarded as the problem of finding a function
.h(x) : R → K that maps an input space in .R onto a discrete set of .k target outputs
d d
or classes .K = {1, . . . , k}. In this setting, the features are arranged as a vector .x of
.d real-valued numbers.
3
We can encode both target states in a numerical variable, e.g., a successful loan
target can take value .+1; and it is .−1, otherwise.
Let us check the dataset,
In [1]:
import pickle
ofname = open(’./files/ch05/dataset_small.pkl’,’rb’)
# x stores input data and y target values
(x,y) = pickle.load(ofname, encoding="latin1")
2 Several well-known techniques such as support vector machines or adaptive boosting (adaboost)
are originally defined in the binary case. Any binary classifier can be extended to the multiclass case
in two different ways. We may either change the formulation of the learning/optimization process.
This requires the derivation of a new learning algorithm capable of handling the new modeling.
Alternatively, we may adopt ensemble techniques. The idea behind this latter approach is that we
may divide the multiclass problem into several binary problems; solve them; and then aggregate the
results. If the reader is interested in these techniques, it is a good idea to look for: one-versus-all,
one-versus-one, or error correcting output codes methods.
3 Many problems are described using categorical data. In these cases, either we need classifiers
that are capable of coping with this kind of data or we need to change the representation of those
variables into numerical values.
70 5 Supervised Learning
• Input data is structured in Numpy arrays. The size of the array is expected to be
[n_samples, n_features]:
⎡ ⎤
x 11 x12 ··· x1d
⎢ x21 x22 ··· x2d ⎥
⎢ ⎥
⎢ x31 x32 ··· x3d ⎥
⎢ ⎥
feature matrix : X = ⎢
⎢ ..
. .. .. .. ⎥
. . ⎥
.
⎢ . ⎥
⎢ . .. .. . ⎥ .
⎣ .. . . . ⎦
x n1 xn2 · · · xnd
label vector :
. yT = [y1 , y2 , y3 , · · · yn ]
The number of features must be fixed in advance. However, it can be very great
(e.g., millions of features).
In [2]:
dims = x.shape[1]
N = x.shape[0]
print("dims: " + str(dims)+’, samples: ’+ str(N))
All objects in Scikit-learn share a uniform and limited API consisting of three
complementary interfaces:
Out[4]: 0.83164251207729467
It looks like a really good result. But how good is it? Let us first understand a little
bit more about the problem by checking the distribution of the labels.
Let us load the dataset and check the distribution of labels:
In [5]:
plt.pie(np.c_[np.sum(np.where(y == 1, 1, 0)),
np.sum(np.where(y == -1, 1, 0))][0],
labels = [’Not fully funded’,’Full amount’],
colors = [’r’, ’g’],shadow = False,
autopct = ’%.2f’ )
plt.gcf().set_size_inches((7, 7))
4 The term unbalanced describes the condition of data where the ratio between positives and nega-
tives is a small value. In these scenarios, always predicting the majority class usually yields accurate
performance, though it is not very informative. This kind of problems is very common when we
want to model unusual events such as rare diseases, the occurrence of a failure in machinery, fraud-
72 5 Supervised Learning
as we will see later on. In particular, a very simple rule such as always predict the
majority class, will give us good performance. In our problem, always predicting
that the loan will be fully funded correctly predicts .81.57% of the samples. Observe
that this value is very close to that obtained using the classifier.
Although accuracy is the most normal metric for evaluating classifiers, there are
cases when the business value of correctly predicting elements from one class is
different from the value for the prediction of elements of another class. In those
cases, accuracy is not a good performance metric and more detailed analysis is
needed. The confusion matrix enables us to define different metrics considering such
scenarios. The confusion matrix considers the concepts of the classifier outcome and
the actual ground truth or gold standard. In a binary problem, there are four possible
cases:
• True positives (TP): When the classifier predicts a sample as positive and it really
is positive.
• False positives (FP): When the classifier predicts a sample as positive but in fact
it is negative.
• True negatives (TN): When the classifier predicts a sample as negative and it really
is negative.
• False negatives (FN): When the classifier predicts a sample as negative but in fact
it is positive.
ulent credit card operations, etc. In these scenarios, gathering data from usual events is very easy
but collecting data from unusual events is difficult and results in a comparatively small dataset.
5.3 First Steps 73
Gold Standard
Positive Negative
Positive TP FP → Precision
Prediction Negative FN TN → Negative Predictive Value
↓ ↓
Sensitivity Specificity
(Recall)
• Accuracy:
TP + TN
accuracy =
.
TP + TN + FP + FN
• Column-wise we find these two partial performance metrics:
– Sensitivity or Recall:
TP TP
. sensitivity = =
Real Positives TP + FN
– Specificity:
TN TN
.specificity = =
Real Negatives TN + FP
In [6]:
yhat = knn.predict(x)
TP = np.sum(np.logical_and(yhat == -1, y == -1))
TN = np.sum(np.logical_and(yhat == 1, y == 1))
FP = np.sum(np.logical_and(yhat == -1, y == 1))
FN = np.sum(np.logical_and(yhat == 1, y == -1))
print("TP: "+ str(TP), ", FP: "+ str(FP))
print("FN: "+ str(FN), ", TN: "+ str(TN))
Let us check the following example. Let us select a nearest neighbor classifier
with the number of neighbors equal to one instead of eleven, as we did before, and
check the training error.
In [8]:
#Train a classifier using .fit()
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
knn.fit(x,y)
yhat=knn.predict(x)
In [9]:
# Simulate a real case: Randomize and split data into
# two subsets PRC*100\% for training and the rest
# (1-PRC)*100\% for testing
perm = np.random.permutation(y.size)
PRC = 0.7
split_point = int(np.ceil(y.shape[0]*PRC))
X_train = x[perm[:split_point].ravel(),:]
y_train = y[perm[:split_point].ravel()]
X_test = x[perm[split_point:].ravel(),:]
y_test = y[perm[split_point:].ravel()]
As expected from the former experiment, we achieve a perfect score. Now let us
see what happens in the simulation with previously unseen data.
76 5 Supervised Learning
In [11]:
#Check on the test set
yhat=knn.predict(X_test)
print("TESTING STATS:")
print("classification accuracy:", metrics.accuracy_score(
yhat, y_test))
print("confusion matrix: \n"+ str(metrics.confusion_matrix(
yhat,y_test)))
Observe that each time we run the process of randomly splitting the dataset and
train a classifier we obtain a different performance. A good simulation for approxi-
mating the test error is to run this process many times and average the performances.
Let us do this5 !
In [12]:
PRC = 0.3
acc = np.zeros((10,))
for i in range(10):
X_train, X_test, y_train, y_test = train_test_split(x,
y, test_size=PRC)
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
yhat = knn.predict(X_test)
acc[i] = metrics.accuracy_score(yhat, y_test)
acc.shape=(1,10)
print("Mean expected error: "+str(np.mean(acc[0])))
• In-sample error . E in : The in-sample error or training error is the error measured
over all the observed data samples in the training set, i.e.,
1 ∑
N
. E in = e(xi , yi )
N
i=1
5 sklearn allows us to easily automate the train/test splitting using the function
train_test_split(...).
5.3 First Steps 77
Note that the definition of the instantaneous error .e(xi , yi ) is still missing. For ex-
ample, in classification we could use the indicator function to account for a correctly
classified sample as follows:
1, if h(xi ) = yi
e(xi , yi ) = I [h(xi ) = yi ] =
.
0 otherwise.
Observe that:
. E out ≥ E in
Using the expected error on the test set, we can select the best classifier for
our application. This is called model selection. In this example we cover the most
simplistic setting. Suppose we have a set of different classifiers and want to select
the “best” one. We may use the one that yields the lowest error rate.
In [13]:
PRC = 0.1
acc_r=np.zeros((10,4))
for i in range(10):
X_train, X_test, y_train, y_test = train_test_split(x,
y, test_size=PRC)
nn1 = neighbors.KNeighborsClassifier(n_neighbors=1)
nn3 = neighbors.KNeighborsClassifier(n_neighbors=3)
svc = svm.SVC()
dt = tree.DecisionTreeClassifier()
nn1.fit(X_train,y_train)
nn3.fit(X_train,y_train)
svc.fit(X_train,y_train)
dt.fit(X_train,y_train)
yhat_nn1=nn1.predict(X_test)
yhat_nn3=nn3.predict(X_test)
yhat_svc=svc.predict(X_test)
yhat_dt=dt.predict(X_test)
This process is one particular form of a general model selection technique called
cross-validation. There are other kinds of cross-validation, such as leave-one-out or
K-fold cross-validation.
Let us recall the two basic values defined in the last section. We talk of training error
or in-sample error, . E in , which refers to the error measured over all the observed data
samples in the training set. We also talk of test error or generalization error, . E out ,
as the error expected on unseen data [2].
We can empirically estimate the generalization error by means of cross-validation
techniques and observe that:
. E out ≥ E in .
The goal of learning is to minimize the generalization error; but how can we
guarantee this minimization using only training data?
From the above inequality it is easy to derive a couple of very intuitive ideas.
• Additionally, we also want the training error behavior to track the generalization
error so that if one minimizes the in-sample error the out-of-sample error follows,
i.e.,
. E out ≈ E in .
Let us simulate the effect of the number of examples on the training and test errors
for a given complexity. This curve is called the learning curve. We will focus for a
moment in a more simple case. Consider the toy problem in Fig. 5.3.
Let us take a classifier and vary the number of examples we feed it for training
purposes, then check the behavior of the training and test accuracies as the number
of examples grows. In this particular case, we will be using a decision tree with fixed
maximum depth.
Observing the plot in Fig. 5.4, we can see that:
• As the number of training samples increases, both errors tend to the same value.
• When we have few training data, the training error is very small but the test error
is very large.
Now check the learning curve when the degree of complexity is greater in Fig. 5.5.
We simulate this effect by increasing the maximum depth of the tree.
And if we put both curves together, we have the results shown in Fig. 5.6.
Although both show similar behavior, we can note several differences:
6 The reader should note that there are several bounds in machine learning to characterize the
Fig. 5.4 Learning curves (training and test errors) for a model with a high degree of complexity
• With a low degree of complexity, the training and test errors converge to the bias
sooner/with fewer data.
• Moreover, with a low degree of complexity, the error of convergence is larger than
with increased complexity.
The value both errors converge towards is also called the bias; and the differ-
ence between this value and the test error is called the variance. The bias/variance
decomposition of the learning curve is an alternative approach to the training and
generalization view.
5.5 Learning Curves 81
Fig. 5.5 Learning curves (training and test errors) for a model with a low degree of complexity
Fig. 5.6 Learning curves (training and test errors) for models with a low and a high degree of
complexity
Let us now plot the learning behavior for a fixed number of examples with respect
to the complexity of the model. We may use the same data but now we will change
the maximum depth of the decision tree, which governs the complexity of the model.
Observe in Fig. 5.7 that as the complexity increases the training error is reduced;
but above a certain level of complexity, the test error also increases. This effect is
called overfitting. We may enact several cures for overfitting:
Fig. 5.7 Learning curves (training and test errors) for a fixed number of data samples, as the
complexity of the decision tree increases
These terms are added to the objective function. They trade off with the error
function in the objective and are governed by a hyperparameter. Thus, we still
have to select this parameter by means of model selection.
• We can use “ensemble techniques”. A third cure for overfitting is to use ensemble
techniques. The best known are bagging and boosting.
Going back to our problem, we have to select a model and control its complexity
according to the number of training data. In order to do this, we can start by using a
model selection technique [3]. We have seen model selection before when we wanted
to compare the performance of different classifiers. In that case, our best bet was to
select the classifier with the smallest . E out . Analogous to model selection, we may
5.6 Training, Validation and Test 83
think of selecting the best hyperparameters as choosing the classifier with parameters
that performs the best. Thus, we may select a set of hyperparameter values and use
cross-validation to select the best configuration.
The process of selecting the best hyperparameters is called validation. This intro-
duces a new set into our simulation scheme; we now need to divide the data we have
into three sets: training, validation, and test sets. As we have seen, the process of
assessing the performance of the classifier by estimating the generalization error is
called testing. And the process of selecting a model using the estimation of the gen-
eralization error is called validation. There is a subtle but critical difference between
the two and we have to be aware of it when dealing with our problem.
• Test data is used exclusively for assessing performance at the end of the process
and will never be used in the learning process.7
• Validation data is used explicitly to select the parameters/models with the best
performance according to an estimation of the generalization error. This is a form
of learning.
• Training data are used to learn the instance of the model from a model class.
In practice, we are just given training data, and in the most general case we
explicitly have to tune some hyperparameter. Thus, how do we select the different
splits?
How we do this will depend on the questions regarding the method that we want
to answer:
• Let us say that our customer asks us to deliver a classifier for a given problem. If
we just want to provide the best model, then we may use cross-validation on our
training dataset and select the model with the best performance. In this scenario,
when we return the trained classifier to our customer, we know that it is the one
that achieves the best performance. But if the customer asks about the expected
performance, we cannot say anything.
A practical issue: once we have selected the model, we use the complete training
set to train the final model.
• If we want to know about the performance of our model, we have to use unseen
data. Thus, we may proceed in the following way:
1. Split the original dataset into training and test data. For example, use .30% of
the original dataset for testing purposes. This data is held back and will only be
used to assess the performance of the method.
2. Use the remaining training data to select the hyperparameters by means of
cross-validation.
3. Train the model with the selected parameter and assess the performance using
the test dataset.
7 This set cannot be used to select a classifier, model or hyperparameter; nor can it be used in any
decision process.
84 5 Supervised Learning
A practical issue: Observe that by splitting the data into three sets, the classifier
is trained with a smaller fraction of the data.
• If we want to make a good comparison of classifiers but we do not care about
the best parameters, we may use nested cross-validation. Nested cross-validation
runs two cross-validation processes. An external cross-validation is used to assess
the performance of the classifier and in each loop of the external cross-validation
another cross-validation is run with the remaining training set to select the best
parameters.
If we want to select the best complexity of a decision tree, we can use tenfold cross-
validation checking for different complexity parameters. If we change the maximum
depth of the method, we obtain the results in Fig. 5.8.
In [14]:
# Create a 10-fold cross-validation set
kf = cross_validation.KFold(n = y.shape[0],
n_folds = 10,
shuffle = True,
random_state = 0)
Checking Fig. 5.8, we can see that the best average accuracy is obtained by the
fifth model, a maximum depth of .6. Although we can report that the best accuracy
is estimated to be found with a complexity value of .6, we cannot say anything about
the value it will achieve. In order to have an estimation of that value, we need to run
the model on a new set of data that are completely unseen, both in training and in
model selection (the model selection value is positively biased). Let us put everything
together. We will be considering a simple train_test split for testing purposes and
then run cross-validation for model selection.
5.6 Training, Validation and Test 85
Fig. 5.8 Box plot showing accuracy for different complexities of the decision tree
In [15]:
# Train_test split
X_train, X_test, y_train, y_test = cross_validation
.train_test_split(X, y, test_size = 0.20)
If we run the output of this code, we observe that the best accuracy is provided
by the fourth model. In this example it is a model with complexity .5.8 The selected
model achieves a success rate of .0.83423816 in validation. We then train the model
with the complete training set and verify its test accuracy.
In [16]:
# Train the model with the complete training set with the
selected complexity
dt = tree.DecisionTreeClassifier(
min_samples_leaf = 1,
max_depth = C[np.argmax(np.mean(acc, axis = 0))])
dt.fit(X_train,y_train)
As expected, the value is slightly reduced; it achieves .0.82608. Finally, the model
is trained with the complete dataset. This will be the model used in exploitation and
we expect to at least achieve an accuracy rate of .0.82608.
In [17]:
# Train the final model
dt = tree.DecisionTreeClassifier(min_samples_leaf = 1,
max_depth = C[np.argmax(np.mean(acc, axis = 0))])
dt.fit(X, y)
8 This reduction in the complexity of the best model should not surprise us. Remember that com-
plexity and the number of examples are intimately related for the learning to succeed. By using a
test set we perform model selection with a smaller dataset than in the former case.
5.7 Two Learning Models 87
Let us return to our problem and check the performance of different models. There
are many learning models in the machine learning literature. However, in this short
introduction we focus on two of the most important and pragmatically effective
approaches9 : support vector machines (SVM) and random forests (RF).
Before going into some of the details of the models selected, let us check the com-
ponents of any learning algorithm. In order to be able to learn, an algorithm has to
define at least three components:
• The model class/hypothesis space defines the family of mathematical models that
will be used. The target decision boundary will be approximated from one element
of this space. For example, we can consider the class of linear models. In this case
our decision boundary will be a line if the problem is defined in .R2 and the model
class is the space of all possible lines in .R2 .
Model classes define the geometric properties of the decision function. There are
different taxonomies but the best known are the families of linear and nonlinear
models. These families usually depend on some parameters; and the solution to a
learning problem is the selection of a particular set of parameters, i.e., the selection
of an instance of a model from the model class space. The model class space is
also called the hypothesis space.
The selection of the best model will depend on our problem and what we want
to obtain from the problem. The primary goal in learning is usually to achieve
the minimum error/maximum performance; but according to what else we want
from the algorithm, we can come up with different algorithms. Other common
desirable properties are interpretability, behavior when faced with missing data,
fast training, etc.
• The problem model formalizes and encodes the desired properties of the solution.
In many cases, this formalization takes the form of an optimization problem. In its
most basic instantiation, the problem model can be the minimization of an error
function. The error function measures the difference between our model and the
target. Informally speaking, in a classification problem it measures how “irritated”
we are when our model misses the right label for a training sample. For example,
in classification, the ideal error function is the 0–1 loss. This function takes value
.1 when we incorrectly classify a training sample and zero otherwise. In this case,
we can interpret it by saying that we are only irritated by “one unit of irritation”
when one sample is misclassified.
9 These techniques have been shown to be two of the most powerful families for classification [4].
88 5 Supervised Learning
The problem model can also be used to impose other constraints on our solution,10
such as finding a smooth approximation, a model with a low degree of small
complexity, a sparse solution, etc.
• The learning algorithm is an optimization/search method or algorithm that, given
a model class, fits it to the training data according to the error function. According
to the nature of our problem there are many different algorithms. In general, we
are talking about finding the minimum error approximation or maximum probable
model. In those cases, if the problem is convex/quasi-convex we will typically use
first- or second-order methods (i.e., gradient descent, coordinate descent, Newton’s
method, interior point methods, etc.). Other searching techniques such as genetic
algorithms or Monte Carlo techniques can be used if we do not have access to the
derivatives of the objective function.
SVM is a learning technique initially designed to fit a linear boundary between the
samples of a binary problem, ensuring the maximum robustness in terms of tolerance
to isotropic uncertainty. This effect is observed in Fig. 5.9. Note that the boundary
displayed has the largest distance to the closest point of both classes. Any other
separating boundary will have a point of a class closer to it than this one. The figure
also shows the closest points of the classes to the boundary. These points are called
support vectors. In fact, the boundary only depends on those points. If we remove
any other point from the dataset, the boundary remains intact. However, in general,
if any of these special points is removed the boundary will change.
Fig. 5.9 Support vector machine decision boundary and the support vectors
minimize 1
.
subject to yi (a T xi + b) ≥ 1, ∀xi ∈ D
11 Note the strict inequalities in the formulation. Informally, we can consider the smallest satisfied
constraint, and observe that the rest must be satisfied with a larger value. Thus, we can arbitrarily
set that value to 1 and rewrite the problem as
.a
T
si + b ≥ 1 and a T ri + b ≤ −1.
90 5 Supervised Learning
The solution of this problem is not unique. Selecting the maximum margin hy-
perplane requires us to add a new constraint to our problem. Remember from the
geometry of the hyperplane that the distance of any point to a hyperplane is given
T x+b
by: .d(x, π) = a||a|| 2
.
Recall also that we want positive data to be beyond value 1 and negative data
below .−1. Thus, what is the distance value we want to maximize?
The positive point closest to the boundary is at .1/||a||2 and the negative point
closest to the boundary data point is also at .1/||a||2 . Thus, data points from different
classes are at least .2/||a||2 apart.
Recall that our goal is to find the separating hyperplane with maximum margin,
i.e., with maximum distance between elements in the different classes. Thus, we can
complete the former formulation with our last requirement as follows:
minimize ||a||2 /2
.
subject to yi (a T xi + b) ≥ 1, ∀xi ∈ D
This formulation has a solution as long as the problem is linearly separable.
In order to deal with misclassifications, we are going to introduce a new set of
variables .ξi , that represents the amount of violation in the .i-th constraint. If the
constraint is already satisfied, then .ξi = 0; while .ξi > 0 otherwise. Because .ξi is
related to the errors, we would like to keep this amount as close to zero as possible.
This makes us introduce an element in the objective trade-off with the maximum
margin.
The new model becomes:
∑
N
minimize ||a||2 /2 + C ξi
.
i=1
subject to yi (a T xi + b) ≥ 1 − ξi , i = 1 . . . N
ξi ≥ 0
where .C is the trade-off parameter that roughly balances the rates of margin and
misclassification. This formulation is also called soft-margin SVM.
The larger the .C value is, the more importance one gives to the error, i.e., the
method will be more accurate according to the data at hand, at the cost of being more
sensitive to variations of the data.
The decision boundary of most problems cannot be well approximated by a linear
model. In SVM, the extension to the nonlinear case is handled by means of kernel
theory. In a pragmatic way, a kernel can be referred to as any function that captures
the similarity between any two samples in the training set. The kernel has to be a
positive semi-definite function as follows:
• Linear kernel:
k(xi , x j ) = xiT x j
.
• Polynomial kernel:
.k(xi , x j ) = (1 + xiT x j ) p
5.7 Two Learning Models 91
Note that selecting a polynomial or a Radial Basis Function kernel means that we
have to adjust a second parameter . p or .σ, respectively. As a practical summary, the
SVM method will depend on two parameters .(C, γ) that have to be chosen carefully
using cross-validation to obtain the best performance.
Random Forest (RF) is the other technique that is considered in this work. RF is
an ensemble technique. Ensemble techniques rely on combining different classifiers
using some aggregation technique, such as majority voting. As pointed out earlier,
ensemble techniques usually have good properties for combating overfitting. In this
case, the aggregation of classifiers using a voting technique reduces the variance of
the final classifier. This increases the robustness of the classifier and usually achieves
a very good classification performance. A critical issue in the ensemble of classifiers
is that for the combination to be successful, the errors made by the members of the
ensemble should be as uncorrelated as possible. This is sometimes referred to in the
literature as the diversity of the classifiers. As the name suggests, the base classifiers
in RF are decision trees.
Tackling the first question leads to different strategies for creating decision tree.
However, most techniques share the axis-orthogonal hyperplane partition policy,
i.e., a threshold in a single feature. For example, in our problem “Does the applicant
have a home mortgage?”. This is the key that allows the results of this method to be
interpreted. In decision trees, the second question is straightforward, each patch is
given the value of a label, e.g., the majority label, and all data falling in that part of
the space will be predicted as such.
The RF technique creates different trees over the same training dataset. The word
“random” in RF refers to the fact that only a subset of features is available to each
92 5 Supervised Learning
of the trees in its building process. The two most important parameters in RF are the
number of trees in the ensemble and the number of features each tree is allowed to
check.
With both techniques in mind, we are going to optimize and check the results using
nested cross-validation. Scikit-learn allows us to do this easily using several model
selection techniques. We will use a grid search, GridSearchCV (a cross-validation
using an exhaustive search over all combinations of parameters provided).
In [16]:
parameters = {’C’: [1e4, 1e5, 1e6],
’gamma’: [1e-5, 1e-4, 1e-3]}
N_folds = 5
kf=cross_validation.KFold(n = y.shape[0],
n_folds = N_folds,
shuffle = True,
random_state = 0)
acc = np.zeros((N_folds,))
i = 0
# We will build the predicted y from the partial
predictions on the test of each of the folds
yhat = y.copy()
for train_index, test_index in kf:
X_train, X_test = X[train_index,:], X[test_index,:]
y_train, y_test = y[train_index], y[test_index]
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
clf = svm.SVC(kernel = ’rbf’)
clf = grid_search.GridSearchCV(clf, parameters, cv = 3)
clf.fit(X_train, y_train.ravel())
X_test = scaler.transform(X_test)
yhat[test_index] = clf.predict(X_test)
print(metrics.accuracy_score(yhat, y))
print(metrics.confusion_matrix(yhat, y))
The result obtained has a large error in the non-fully funded class (negative). This
is because the default scoring for cross-validation grid-search is mean accuracy. De-
pending on our business, this large error in recall for this class may be unacceptable.
5.9 A Toy Business Case 93
There are different strategies for diminishing the impact of this effect. On the one
hand, we may change the default scoring and find the parameter setting that corre-
sponds to the maximum average recall. On the other hand, we could mitigate this
effect by imposing a different weight on an error on the critical class. For example,
we could look for the best parameterization such than one error on the critical class
is equivalent to one thousand errors on the noncritical class. This is important in
business scenarios where monetization of errors can be derived.
Consider that clients using our service yield a profit of 100 units per client (we will use
abstract units but keep in mind that this will usually be accounted in euros/dollars). We
design a campaign with the goal of attracting investors in order to cover all non-fully
funded loans. Let us assume that the cost of the campaign is .α units per client. With
this policy we expect to keep our customers satisfied and engaged with our service,
so they keep using it. Analyzing the confusion matrix we can give precise meaning
to different concepts in this campaign. The real positive set .(T P + F N ) consists of
the number of clients that are fully funded. According to our assumption, each of
these clients generates a profit of 100 units. The total profit is .100 · (T P + F N ). The
campaign to attract investors will be cast considering all the clients we predict are not
fully funded. These are those that the classifier predict as negative, i.e., .(F N + T N ).
However, the campaign will only have an effect on the investors/clients that are
actually not funded, i.e., .T N ; and we expect to attract a certain fraction .β of them.
After deploying our campaign, a simplified model of the expected profit is as follows:
.100 · (T P + F N ) − α(T N + F N ) + 100βT N
When optimizing the classifier for accuracy, we do not consider the business needs.
In this case, optimizing an SVM using cross-validation for different parameters of the
.C and .γ, we have an accuracy of .85.60% and a confusion matrix with the following
values: ( )
3371. 590.
.
6. 173.
If we check how the profit changes for different values of .α and .β, we obtain the plot
in Fig. 5.10. The figure shows two hyperplanes. The horizontal plane is the expected
profit if the campaign is not launched, i.e., .100 · (T P + F N ). The other hyperplane
represents the profit of the campaign for different values of .α and .β using a particular
classifier. Remember that the cost of the campaign is given by .α, and the success rate
of the campaign is represented by .β. For the campaign to be successful we would
like to select values for both parameters so that the profit of the campaign is larger
than the cost of launching it. Observe in the figure that certain costs and attraction
rates result in losses.
We may launch different classifiers with different configurations and toy with
different weights .(2, 4, 8, 16) for elements of different classes in order to bias the
94 5 Supervised Learning
Fig. 5.10 Surfaces for two different campaign and attraction factors. The horizontal plane corre-
sponds to the profit if no campaign is launched. The slanted plane is the profit for a certain confusion
matrix
classifier towards obtaining different values for the confusion matrix.12 The weights
define how much a misclassification in one class counts with respect to a misclassi-
fication in another. Figure 5.11 shows the different landscapes for different configu-
rations of the SVM classifier and RF.
In order to frame the problem, we consider a very successful campaign with a
.60% investor attraction rate. We can ask several questions in this scenario:
Checking the values in Fig. 5.11, we find the results collected in Table 5.1. Observe
that the most profitable campaign with .60% corresponds to a classifier that considers
the cost of mistaking a sample from the non-fully funded class eight times larger
12 It is worth mentioning that another useful tool for visualizing the trade-off between true positives
and false positives in order to choose the operating point of the classifier is the receiver-operating
characteristic (ROC) curve. This curve plots the true positive rate/sensitivity/recall (TP/(TP+FN))
with respect to the false positive rate (FP/(FP+TN)).
5.9 A Toy Business Case 95
Fig. 5.11 3D surfaces of the profit obtained for different classifiers and configurations of retention
campaign cost and retention rate. a RF, b SVM with the same cost per class, c SVM with double
cost for the target class, d SVM with a cost for the target class equal to 4, e SVM with a cost for
the target class equal to 8, f SVM with a cost for the target class equal to 16
96 5 Supervised Learning
Table 5.1 Different configurations of classifiers and their respective profit rates and accuracies
Max profit rate (%) Profit rate at .60% (%) Accuracy (%)
Random forest .4.41 .2.41 .87.87
than the one from the other class. Observe also that the accuracy in that case is much
worse than in other configurations.
The take-home idea of this section is that business needs are often not aligned with
the notion of accuracy. In such scenarios, the confusion matrix values have specific
meanings. This must be taken into account when tuning the classifier.
5.10 Conclusion
In this chapter we have seen the basics of machine learning and how to apply learning
theory in a practical case using Python. The example in this chapter is a basic one
in which we can safely assume the data are independent and identically distributed,
and that they can be readily represented in vector form. However, machine learning
may tackle many more different settings. For example, we may have different target
labels for a single example; this is called multilabel learning. Or, data can come
from streams or be time dependent; in these settings, sequential learning or sequence
learning can be the methods of choice. Moreover, each data example can be a non-
vector or have a variable size, such as a graph, a tree, or a string. In such scenarios
kernel learning or structural learning may be used. During these last years we are also
seeing the revival of neural networks under the name of deep learning and achieving
impressive results in different domains such as computer vision or natural language
processing. Nonetheless, all of these methods will behave as explained in this chapter
and most of the lessons learned here can be readily applied to these techniques.
Acknowledgements This chapter was co-written by Oriol Pujol and Petia Radeva.
References 97
References
1. K. Murphy, Machine Learning: A Probabilistic Perspective. MIT Press (2012)
2. C. Bishop, Pattern Recognition and Machine Learning, Springer (2006)
3. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Infer-
ence, and Prediction. Springer (2009)
4. M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we need hundreds of classifiers
to solve real world classification problems? J. Mach. Learn. Res. 15, 3133 (2014). http://jmlr.
org/papers/v15/delgado14a.html
Regression Analysis
6
6.1 Introduction
In this chapter, we introduce regression analysis and some of its applications in data
science. Regression is related to how to make predictions about real-world quantities
such as, for instance, the predictions alluded to in the following questions. How does
sales volume change with changes in price? How is sales volume affected by the
weather? How does the title of a book affect its sales? How does the amount of a
drug absorbed vary with the patient’s body weight; and does this relationship depend
on blood pressure? At what time should I go home to avoid traffic jams? What is the
chance of rain on the next two Mondays; and what is the expected temperature?
All these questions have a common structure: they ask for a response that can
be expressed as a combination of one or more (independent) variables (also called
covariates, predictors or regressors). The role of regression is to build a model to
predict the response from the covariates. This process involves the transition from
data to model.
More specifically, the model can be useful in different tasks, such as the following:
(1) understanding the behavior of data (the relation between the response and the
variables), (2) predicting data values (whether continuous or discrete), and (3) finding
important predictors.
In order to understand how a regression model can be suitable for tackling these
tasks, we will introduce three practical cases for which we use three real datasets and
solve different questions. These practical cases will motivate simple linear regression,
multiple linear regression, and logistic regression, as presented in the following
sections.
y = a0 + a1 x1 + · · · + ad xd .
. (6.1)
The variables .ai are termed the parameters or coefficients of the model and he
parameter .a0 is called the intercept or the constant term.
This equation can be rewritten in a more compact (matricial) form: .y = Xw,
where ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
y1 x11 . . . x1d a1
⎜ y2 ⎟ ⎜ x21 . . . x2d ⎟ ⎜ a2 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
.y = ⎜ . ⎟ , X = ⎜ .. ⎟ , w = ⎜ .. ⎟ .
⎝ .. ⎠ ⎝ . ⎠ ⎝ . ⎠
yn xn1 . . . xnd ad
In the matricial form we add a constant term by changing the matrix .X to .(1, X).
Linear regression is the technique for creating these linear models and Eq. (6.1) is
called multiple linear regression model since it has multiple (.d) variables, in contrast
to the simple linear regression model.
Fig. 6.1 Illustration of different simple linear regression models. Blue points correspond to a set
of random points sampled from a univariate normal (Gaussian) distribution. Red, green and yellow
lines are three different simple linear regression models
Ordinary least squares (OLS) is the simplest and most common estimator in which
the parameters (.a’s) are chosen to minimize the square of the distance between the
predicted values and the actual values with respect to .a0 , a1 :
∑
n
.||a0 + a1 x − y||22 = (a0 + a1 x1 j − y j )2 .
j=1
This error expression is often called the sum of squared errors of prediction (SSE).
The SSE function is quadratic in the parameters, .w, with positive-definite Hessian,
and therefore this function possesses a unique global minimum at .ŵ = (â0 , â1 ).
The resulting model is represented as follows: .ŷ = â0 + â1 x, where the hats on the
variables represent the fact that they are estimated from the data available.
Note that with SSE we are concerned with the y-axis distance, since it does not
consider the error in the variables (x-axis). See Fig. 6.2 for an illustration of the
errors. The blue line is the result of the fitting and the black pointed lines correspond
to the vertical errors which are minimized.
There are alternatives of errors-in-variables regression models as the total least
squares. We can also minimize other criteria, such as the sum of absolute differences
between the predicted values and the actual values (sum of absolute errors (SAE)):
∑
n
. |a0 + a1 x1 j − y j |,
j=1
with respect to .a0 , a1 . In this case, far values are penalized less, as it can be seen in
Fig. 6.2 (orange line).
OLS is a popular approach for several reasons. It makes it computationally cheap
to calculate the coefficients. It is also easier to interpret than other more sophisticated
models. In situations where the main goal is to understand a simple model in detail,
rather than to estimate the response well, it can provide insight into what the model
102 6 Regression Analysis
Fig. 6.2 Fitting of a linear model minimizing the SSE (blue) and SAE (orange). The errors mini-
mized with OLS are depicted with black pointed lines
captures. Finally, in situations where there is a lot of noise, as in many real scenarios,
it may be hard to find the true functional form, so a constrained model can perform
quite well compared to a complex model which can be more affected by noise
(overfitting).
On the one hand, as already explained in the previous chapter one of the key aspects
of supervised machine learning is model evaluation and validation. In the regression
problem there are two popular measures for evaluation of the model fitting: the
coefficient of determination (. R 2 ) of the model and the mean squared error (. M S E).
The coefficient . R 2 is defined as:
. R 2 = (1 − u/v).
The term .u is the residual sum of squares:
∑
.u = (y − ̂
y)2 ,
where .y is the observed response and . ̂
y is the predicted response.
And the term .v is the total sum of squares:
∑
.v = (y − ȳ)2 ,
where .ȳ is the mean of the observed data.
The best possible score for . R 2 is 1.0 (when .u = 0): lower values are worse; and
. R is 0.0 when .u = v. . R provides a measure of how well the observed data is
2 2
replicated by the model, based on the proportion of the total variance in the data that
is explained by the model.
The . M S E is defined as:
1∑
.M S E = (y − ̂
y)2 .
n
6.2 Linear Regression 103
On the other hand, as for any problem of supervised machine learning, after doing
the fitting of a linear regression model we can do predictions. When you evaluate
the predictive performance of your model, it is essential that the process is unbiased.
You can not evaluate the predictive performance of a model with the same data you
used for the fitting. You need to evaluate the model with fresh data that has not been
seen by the model before. This can be accomplish by splitting your dataset before
you use it into training-(validation)-test.
Finally, let’s try to answer the question: Could the coefficient . R 2 of a model
be negative in some situation? The name R-squared can lead to confusion when
answering this question. In fact, . R 2 can be negative if the model we fit is not correct,
for instance, if it does not use a constant/intercept. In this case, we are assuming that
the model passes through the origin (0, 0), therefore we are imposing a restriction on
the model that prevents to obtain a good fitting. Note that. R 2 cannot be negative when
a correct model (using constant/intercept) is being evaluated with the same data used
for the evaluation and for the adjustment. Since the OLS will find a solution better
than or equal to the mean of the observed data. However, . R 2 can be negative when
we evaluate the model with data other than those used for the adjustment (test set).
In this practical case, we pose the question: Is the climate really changing? More
concretely, we want to show the effect of the climate change by determining whether
the amount of sea ice (area or extent) has decreased over the years. Sea ice area
refers to the total area covered by ice, whereas sea ice extent is the area of ocean with
at least 15% sea ice. Reliable measurement of sea ice edges began with the satellite
era in the late 1970s.
We will use the sea ice data from the National Snow & Ice Data Center1 which
provides measurements of the area and extend of sea ice at the poles over the last
36 years. The center has given access to the archived monthly Sea Ice Index images
and data since 1979 [1]. Web-page instructions can be followed easily to access and
download the files2 . The ASCII data files tabulate sea ice extent and area (in millions
of square kilometers) by year for a given month.
In order to check whether there is an anomaly in the evolution of sea ice extent
over recent years, we want to build a simple linear regression model and analyze the
fitting; but before we need to perform several processing steps. First, we read the
data, previously downloaded, and create a DataFrame (Pandas) as follows:
In [1]:
ice = pd.read_csv(’files/ch06/SeaIce.txt’,
delim_whitespace=True)
print(’shape:’, ice.shape)
1 https://nsidc.org/home.
2 https://nsidc.org/data/seaice_index/data-and-image-archive.
104 6 Regression Analysis
Next, we visualize the data. The lmplot() function from the Seaborn toolbox
is intended for exploring linear relationships of different forms in multidimensional
datasets. For instance, we can illustrate the relationship between the month of the
year (variable) and the extent (response) as follows:
In [3]:
import Seaborn as sns
sns.lmplot(x = "mo", y = "extent", data = ice2,
aspect = 2);
This outputs Fig. 6.3. We can observe a monthly fluctuation of the sea ice extent,
as would be expected for the different seasons of the year. We should normalize the
data before performing the regression analysis to avoid this fluctuation and be able to
study the evolution of the extent over the years. To capture the variation for a given
interval of time (month), we can compute the mean for the ith interval of time (using
the period from 1979 through 2014 for the mean extent) .μi , and subtract it from the
set of extent values for that month .{eij }. This value can be converted to a relative
percentage difference by dividing it by the total average (1979–2014) .μ, and then
multiplying by 100:
eij − μi
.ẽ j = 100 ∗ , i = 1, ..., 12.
i
μ
We implement this normalization and plot the relationship again as follows:
In [4]:
grouped = ice2.groupby(’mo’)
month_means = grouped.extent.mean()
month_variances = grouped.extent.var()
for i in range(12):
ice2.extent.loc[ice2.mo == i+1] = 100*(ice2.extent.loc[
ice2.mo == i+1] - month_means[i+1])/month_means.
mean()
sns.lmplot(x = "mo", y = "extent", data = ice2, aspect =2);
The new output is in Fig. 6.4. We now observe a comparable range of values for
all months.
Next, the normalized values can be plotted for the entire time series to analyze the
tendency. We compute the trend as a simple linear regression. We use the lmplot()
function for visualizing linear relationships between the year (variable) and the extent
(response).
6.2 Linear Regression 105
In [5]:
sns.lmplot(x = "year", y = "extent", data = ice2,
aspect = 2);
This outputs Fig. 6.5 showing the regression model fitting the extent data. This
plot has two main components. The first is a scatter plot, showing the observed data
points. The second is a regression line, showing the estimated linear model relating
the two variables. Because the regression line is only an estimate, it is plotted with
a 95% confidence band to give an impression of the uncertainty in the model (using
Bootstraping).
In this figure, we can observe that the data show a long-term negative trend over
years. The negative trend can be attributed to global warming, although there is also
a considerable amount of variation from year to year.
106 6 Regression Analysis
Fig. 6.5 Regression model fitting sea ice extent data for all months by year using lmplot
Up until here, we have qualitatively shown the linear regression using a useful visu-
alization tool. We can also analyze the linear relationship in the data using the Scikit-
learn library, which allows a quantitative evaluation. As was explained in the previous
chapter, Scikit-learn provides an object-oriented interface centered around the con-
cept of an estimator. The sklearn.linear_model.LinearRegression es-
timator sets the state of the estimator based on the training data using the function
fit. Moreover, it allows the user to specify whether to fit an intercept term in the
object construction. This is done by setting the corresponding constructor arguments
of the estimator object as it is done in the next cell. During the fitting process,
the state of the estimator is stored in instance attributes that have a trailing under-
score (’_’). For example, the coefficients of a LinearRegression estimator are
stored in the attribute coef_. We fit a regression model using years as variables (.x)
and the extent values as the response (.y).
In [6]:
from sklearn.linear_model import LinearRegression
est = LinearRegression(fit_intercept = True)
x = ice2[[’year’]]
y = ice2[[’extent’]]
est.fit(x, y)
print("Coefficients:", est.coef_)
print("Intercept:", est.intercept_)
recent years? We can perform this analysis for a particular month or for all months
together, as done in the following lines:
In [7]:
from sklearn import metrics
y_hat = est.predict(x)
print("MSE:", metrics.mean_squared_error(y_hat, y))
print("R^2:", metrics.r2_score(y_hat, y))
print("var:", y.var())
Given the model, we can also predict the extent value for the coming years. For
instance, the predicted extent for January 2025 can be computed as follows:
In [8]:
X = np.array([2025]).reshape(-1,1)
y_hat = model.predict(X)
j = 1 # January
y_hat = (y_hat*month_means.mean()/100) + month_means[j]
print("Prediction of extent for January 2025 (in millions
of square km):", y_hat)
Out[8]: Prediction of extent for January 2025 (in millions of square km):
[[13.14449923]]
Multiple linear regression may seem a very simple model, but even when the response
depends on the variables in nonlinear ways, this model can still be used by considering
nonlinear transformations .φ(·) of the variables:
y = a1 φ(x1 ) + · · · + ad φ(xd )
.
In this practical case we want to solve the question: Can we predict the price of a new
housing market given any of its attributes? We will use the Boston housing dataset
from Scikit-learn, which provides recorded measurements of 13 attributes of housing
markets around Boston, as well as the median house price. Once we load the dataset
(506 instances), the description of the dataset can easily be shown by printing the
field DESCR. The data (.x), feature names, and target (.y) are stored in other fields of
the dataset.
Task 1: We first consider the task of predicting median house values in the Boston
area using as the variable one of the attributes, for instance, LSTAT, defined as the
“proportion of lower status of the population”.
We compute the histogram of the prices and use the lmplot() function from
Seaborn visualization to show the linear relationships easily:
6.2 Linear Regression 109
In [9]:
from sklearn import datasets
boston = datasets.load_boston()
X_boston, y_boston = boston.data, boston.target
features = boston.feature_names
print("Feature names: {}".format(boston.feature_names))
print("Shape of data: {} {}".format(X.shape, y.shape))
plt.hist(y_boston)
plt.xlabel(’price (\$1000s)’)
plt.ylabel(’count’)
df_boston = pd.DataFrame(boston.data,
columns = boston.feature_names)
df_boston[’PRICE’] = boston.target
sns.lmplot(x = "LSTAT", y = "PRICE", data = df_boston,
aspect = 2);
Out[9]: Feature names: [’CRIM’ ’ZN’ ’INDUS’ ’CHAS’ ’NOX’ ’RM’ ’AGE’
’DIS’ ’RAD’ ’TAX’ ’PTRATIO’ ’B’ ’LSTAT’]
Shape of data: (100, 1) (760,)
Figure 6.6 show the histogram of the prices. In Fig. 6.7, we can clearly see that
the relationship between PRICE and LSTAT is nonlinear, since the straight line is a
poor fit.
We can examine whether a better fit can be obtained by including higher-order
terms. For example, a quadratic model. The lmplot function allows to easily change
the order of the model as is done in the next code, which outputs Fig. 6.8, where we
observe a better fit.
110 6 Regression Analysis
Fig. 6.7 Scatter plot of Boston data (LSTAT versus price) and their linear relationship (using
lmplot)
Fig. 6.8 Scatter plot of Boston data (LSTAT versus price) and their polynomial relationship
(using lmplot with order 2)
In [10]:
sns.lmplot(x = "LSTAT", y = "PRICE", data = df_boston,
order = 2, aspect = 2);
town; INDUS, proportion of non-retail business acres per town; NOX, nitric oxide
concentrations (parts per 10 million); RM, average number of rooms per dwelling;
AGE, proportion of owner-occupied units built prior to 1940; and LSTAT.
In [11]:
features2 = [’CRIM’, ’INDUS’, ’NOX’, ’RM’, ’AGE’, ’LSTAT’,
’PRICE’]
df_boston2 = df_boston[features2]
corrmat = df_boston2.corr()
sns.heatmap(corrmat, vmax=.8, square=True, annot=True)
Figure 6.9 shows a heat map representing the correlation between pairs of vari-
ables; specifically, the six variables selected and the price of houses. The color bar
shows the range of values used in the matrix. This plot is a useful way of summa-
rizing the correlation of several variables. It can be seen that LSTAT and RM are
the variables that are most correlated (negatively and positively, respectively) with
price.
Another good way to explore multiple variables is the scatter plot from Pandas.
The scatter plot is a grid of plots of multiple variables one against the others, illus-
trating the relationship of each variable with the rest. For the sake of visualization,
we do not consider all the variables, but just three: RM, AGE, and LSTAT in the
following code:
In [12]:
features3 = [’RM’,’AGE’,’LSTAT’, ’PRICE’]
df_boston3 = df_boston[features3]
pd.plotting.scatter_matrix(df_boston3, figsize=(12.0,12.0))
112 6 Regression Analysis
This code outputs Fig. 6.10, where we obtain visual information concerning the
density function for every variable, in the diagonal, as well as the scatter plots of the
data points for pairs of variables. In the last column, we can appreciate the relation
between the three variables selected and house prices. It can be seen that RM follows
a linear relation with price; whereas AGE does not. LSTAT follows a higher-order
relation with price. This plot gives us an indication of how good or bad every
attribute would be as a variable in a linear model.
Task 2: We now consider predictions with all variables by fitting multiple linear
regression models. We compare them and we analyze the coefficient values and how
good the prediction results are.
For the evaluation of the prediction power of the model with new samples, we
split the data into a training set and a testing set, and compute the linear regression
score, which returns the coefficient of determination . R 2 of the prediction. We also
calculate the MSE.
6.2 Linear Regression 113
In [13]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
seed = 1 # Let’s make our predictions deterministic
X_boston,y_boston = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(
X_boston, y_boston, test_size=0.5, random_state=seed)
print("Train and test sizes of X: {} {}".format(
X_train.shape, X_test.shape))
print("Train and test sizes of y: {} {}".format(
y_train.shape, y_test.shape))
lr = LinearRegression()
lr.fit(X_train, y_train)
coefs_lr = pd.Series(np.abs(lr.coef_), features).
sort_values()
y_test_pred = lr.predict(X_test)
mse = mean_squared_error(y_test, y_test_pred)
r2score = lr.score(X_test, y_test)
print(’\nIntercept and coefs:\n{} {}’.format(lr.intercept_,
lr.coef_))
print(’\nMSE: {}’.format(mse))
print(’R^2 Score test: {}’.format(r2score))
In [16]:
from sklearn.preprocessing import StandardScaler
scalerX = StandardScaler().fit(X_train)
X_train_norm = scalerX.transform(X_train)
X_test_norm = scalerX.transform(X_test)
print("\nBefore transformation:")
print(’Train: Max {}, min {}, mean {}, and var {}’.format(
np.max(X_train), np.min(X_train), np.mean(X_train),
np.var(X_train)))
print(’Test: Max {}, min {}, mean {}, and var {}’.format(
np.max(X_test), np.min(X_test), np.mean(X_test),
np.var(X_test)))
print("\nAfter transformation:")
print(’Train: Max {}, min {}, mean {}, and var {}’.format(
np.max(X_train_norm), np.min(X_train_norm),
np.mean(X_train_norm), np.var(X_train_norm)))
print(’Test: Max {}, min {}, mean {}, and var {}’.format(
np.max(X_test_norm), np.min(X_test_norm),
np.mean(X_test_norm), np.var(X_test_norm)))
Fig. 6.11 Coefficients of the variables before (first row) and after (second row) the standardization
In [17]:
n_alphas = 100
alphas = np.logspace(-2, 2, n_alphas)
coefs_ridge = []
r2_ridge = []
for a in alphas:
regr_ridge = linear_model.Ridge(alpha=a)
regr_ridge.fit(X_train_norm, y_train)
coefs_ridge.append(regr_ridge.coef_)
r2_ridge.append(regr_ridge.score(X_test_norm,y_test))
coefs_lasso = []
r2_lasso = []
for a in alphas:
regr_lasso = linear_model.Lasso(alpha=a,tol =0.001)
regr_lasso.fit(X_train_norm, y_train)
coefs_lasso.append(regr_lasso.coef_)
r2_lasso.append(regr_lasso.score(X_test_norm,y_test))
This outputs Fig. 6.12 which shows the coefficient values (first row), as well as
the regression scores (second row), as a function of the regularization parameter (.α).
Feature selection strategies. There are also other strategies for feature selection.
For instance, we can select the k .= 5 best features, according to the k highest scores,
using the function SelectKBest from Scikit-learn:
6.2 Linear Regression 117
Fig. 6.12 Evolution of the coefficient values (first row) and the regression scores (second row) as
a function of the regularization parameter (.α)
In [18]:
import sklearn.feature_selection as fs
selector = fs.SelectKBest(score_func = fs.f_regression,
k=5)
X_new_train = selector.fit_transform(X_train,y_train)
X_new_test = selector.transform(X_test)
print(’Non important variables: {}’.format(
boston.feature_names[selector.get_support()==False]))
print(’Relevant variables: {}’.format(
boston.feature_names[selector.get_support()]))
Out[18]: Non important variables: [’CRIM’ ’ZN’ ’CHAS’ ’AGE’ ’DIS’ ’RAD’
’TAX’ ’B’]
Relevant variables: [’INDUS’ ’NOX’ ’RM’ ’PTRATIO’ ’LSTAT’]
118 6 Regression Analysis
Fig. 6.13 Relation between true (x-axis) and predicted (y-axis) prices
The set of selected features is now different, since the criterion has changed.
However, three of the most important features are among them: LSTAT, RM, and
PTRATIO.
Plots for the fitting evaluation. In order to evaluate the predictions we can analyze
two interesting plots:
In both cases we can add a line to show which would be the perfect result.
In [19]:
reg = LinearRegression()
reg.fit(boston.data, boston.target)
y_pred = reg.predict(boston.data)
plt.scatter(boston.target, y_pred, alpha = 0.3)
plt.plot([0, 50], [0, 50], ’--’)
plt.axis(’tight’)
plt.xlabel(’True price (\$1000s)’)
plt.ylabel(’Predicted price (\$1000s)’)
The output is in Fig. 6.13, where we can observe that the original prices are prop-
erly estimated by the predicted ones, except for the higher values, around $50.000
(points in the top right corner). Therefore, the model seem to be a good fitting ex-
cept for the more expensive houses, which is consistent with the histogram shown
in Fig. 6.6.
6.3 Logistic Regression 119
In [20]:
x_plot = plt.scatter(y_pred, (y-y_pred), alpha = 0.3)
plt.hlines(y=0, xmin= 0, xmax=50, linestyles=’dashed’)
plt.title(’Residual plot’)
plt.xlabel(’$\hat y$’)
plt.ylabel(’$y - \hat y$’)
Intuitively, if the model is good, then a plot of the residuals against the fitted values
(. ŷi ) should look like noise. However, this is not the case in Fig. 6.14 where there is
a visible pattern related with the expensive houses.
Statistical evaluation. Finally, it is worth noting that we can work with statistical
evaluation of a linear regression with the OLS toolbox of the Stats Model toolbox.3
This toolbox is useful to study several statistics concerning the regression model. To
know more about the toolbox, go to the Documentation related to Stats Models.
3 http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html.
120 6 Regression Analysis
Fig. 6.16 Linear regression (blue) versus logistic regression (red) for fitting a set of data (black
points) normally distributed across the 0 and 1 y-values
infinity, whereas the output is restricted to values between 0 and 1 and hence can be
interpreted as a probability.
The set of samples (X, y), illustrated as black points in Fig. 6.16, defines a fitting
problem suitable for a logistic regression. The blue and red lines show the fitting
result for linear and logistic models, respectively. In this case, a logistic model can
clearly explain the data; whereas a linear model cannot.
Now, we pose the question: What number of goals makes a football team the winner
or the loser? More concretely, we want to predict victory or defeat in a football
match when we are given the number of goals a team scores. To do this we consider
6.3 Logistic Regression 121
the set of results of the football matches from the Spanish league4 and we build a
classification model with it.
We first read the data file in a DataFrame and select the following columns
in a new DataFrame: HomeTeam,AwayTeam, FTHG (home team goals), FTAG
(away team goals), and FTR (H=home win, D=draw, A=away win). We then build
a .d-dimensional vector of variables with all the scores, .x, and a binary response
indicating victory or defeat, .y. For that, we create two extra columns containing W
the number of goals of the winning team and L the number of goals of the losing
team and we concatenate these data. Finally, we can compute and visualize a logistic
regression model to predict the discrete value (victory or defeat) using these data.
In [21]:
from sklearn.linear_model import LogisticRegression
data = pd.read_csv(’files/ch06/SP1.csv’)
s = data[[’HomeTeam’,’AwayTeam’, ’FTHG’, ’FTAG’, ’FTR’]]
def my_f1(row):
return max(row[’FTHG’], row[’FTAG’])
def my_f2(row):
return min(row[’FTHG’], row[’FTAG’])
s[’W’] = s.apply(my_f1, axis = 1)
s[’L’] = s.apply(my_f2, axis = 1)
x1 = s[’W’].values
y1 = np.ones(len(x1), dtype = np.int)
x2 = s[’L’].values
y2 = np.zeros(len(x2), dtype = np.int)
x = np.concatenate([x1, x2])
x = x[:, np.newaxis]
y = np.concatenate([y1, y2])
logreg = LogisticRegression()
logreg.fit(x, y)
X_test = np.linspace(-5, 10, 300)
def lr_model(x):
return 1 / (1+np.exp(-x))
loss = lr_model(X_test*logreg.coef_ + logreg.intercept_)
.ravel()
X_test2 = X_test[:,np.newaxis]
losspred = logreg.predict(X_test2)
plt.scatter(x.ravel(), y,
color = ’black’,
s = 100, zorder = 20,
alpha = 0.03)
plt.plot(X_test, loss, color = ’blue’, linewidth = 3)
plt.plot(X_test, losspred, color = ’red’, linewidth = 3)
Figure 6.17 shows a scatter plot with transparency so we can appreciate the over-
lapping in the discrete positions of the total numbers of victories and defeats. It
also shows the fitting of the logistic regression model, in blue, and prediction of the
4 http://www.football-data.co.uk/mmz4281/1213/SP1.csv.
122 6 Regression Analysis
Fig. 6.17 Fitting of the logistic regression model (blue) and prediction of the logistic regression
model (red) for the Spanish football league results
logistic regression model, in red, for the Spanish football league results. With this
information we can estimate that the cutoff value is 1. This means that a team, in
general, has to score more than one goal to win.
6.4 Conclusions
In this chapter, we have focused on regression analysis and the different Python tools
that are useful for performing it. We have shown how regression analysis allows us
to better understand data by means of building a model from it. We have formally
presented four different regression models: simple linear regression, multiple linear
regression, polynomial regression, and logistic regression. We have also emphasized
the properties of sparse models in the selection of variables.
The different models have been used in three real problems dealing with different
types of datasets. In these practical cases, we solve different questions regarding
the behavior of the data, the prediction of data values (continuous or discrete), and
the importance of variables for the model. In the first case, we showed that there
is a decreasing tendency in the sea ice extent over the years, and we also predicted
the amount of ice for the next 20 years. In the second case, we predicted the price
of a market given a set of attributes and distinguished which of the attributes were
more important in the prediction. Moreover, we presented a useful way to show
the correlation between pairs of variables, as well as a way to plot the relationship
between pairs of variables. In the third case, we faced the problem of predicting
victory or defeat in a football match given the score of a team. We posed this problem
as a classification problem and solved it using a logistic regression model; and we
estimated the minimum number of goals a team has to score to win.
References 123
Acknowledgements This chapter was co-written by Laura Igual and Jordi Vitriá.
References
1. J. Maslanik, J. Stroeve. Near-real-time dmsp ssmis daily polar gridded sea ice concentrations.
sea ice index data: Monthly sea ice extent and area data files. (1999, updated daily). https://doi.
org/10.5067/U8C09DWVX9LM
2. D.A. Freedman, Statistical Models Theory and Practice (2009)
Unsupervised Learning
7
7.1 Introduction
7.2 Clustering
• Examples within a cluster are similar (in this case, we speak of high intraclass
similarity).
• Examples in different clusters are different (in this case, we speak of low interclass
similarity).
When we denote data as similar and dissimilar, we should define a measure for this
similarity/dissimilarity. Note that grouping similar data together can help in discov-
ering new categories in an unsupervised manner, even when no sample category
labels are provided. Moreover, two kinds of inputs can be used for grouping:
(a) in similarity-based clustering, the input to the algorithm is an.n×n dissimilarity
matrix or distance matrix;
(b) in feature-based clustering, the input to the algorithm is an .n × D feature
matrix or design matrix, where .n is the number of examples in the dataset and . D the
dimensionality of each sample.
Similarity-based clustering allows easy inclusion of domain-specific similarity,
while feature-based clustering has the advantage that it is applicable to potentially
noisy data.
Therefore, several questions regarding the clustering process arise.
• What is a natural grouping among the objects? We need to define the “groupness”
and the “similarity/distance” between data.
• How can we group samples? What are the best procedures? Are they efficient?
Are they fast? Are they deterministic?
• How many clusters should we look for in the data? Shall we state this number
a priori? Should the process be completely data driven or can the user guide the
grouping process? How can we avoid “trivial” clusters? Should we allow final
clustering results to have very large or very small clusters? Which methods work
when the number of samples is large? Which methods work when the number of
classes is large?
• What constitutes a good grouping? What objective measures can be defined to
evaluate the quality of the clusters?
There is not always a single or optimal answer to these questions. It used to be said
that clustering is a “subjective” issue. Clustering will help us to describe, analyze,
and gain insight into the data, but the quality of the partition depends to a great extent
on the application and the analyst.
7.2 Clustering 127
To speak of similar and dissimilar data, we need to introduce a notion of the similarity
of data. There are several ways for modeling of similarity. A simple way to model
this is by means of a Gaussian kernel:
s(a, b) = e−γd(a,b)
.
where .d(a, b) is a metric function, and .γ is a constant that controls the decay of the
function. Observe that when .a = b, the similarity is maximum and equal to one. On
the contrary, when .a is very different to .b, the similarity tends to zero. The former
modeling of the similarity function suggests that we can use the notion of distance
as a surrogate. The most widespread distance metric is the Minkowski distance:
( d )1/ p
∑
.d(a, b) = |ai − bi | p
i=1
where .d(a, b) stands for the distance between two elements .a, b ∈ Rd , .d is the
dimensionality of the data, and . p is a parameter.
The best-known instantiations of this metric are as follows:
When performing clustering, the question normally arises: How do we measure the
quality of the clustering result? Note that in unsupervised clustering, we do not have
groundtruth labels that would allow us to compute the accuracy of the algorithm. Still,
there are several procedures for assessing quality. We find two families of techniques:
those that allow us to compare clustering techniques, and those that check an specific
properties of the clustering, for example “compactness”.
• .a is the number of pairs of elements in . S that are in the same subset in both . X and
.Y ;
• .b is the number of pairs of elements in . S that are in different subsets in both . X
and .Y ;
• .c is the number of pairs of elements in . S that are in the same subset in . X , but in
different subsets in .Y ; and
• .d is the number of pairs of elements in . S that are in different subsets in . X , but in
the same subset in .Y .
1 https://en.wikipedia.org/wiki/Rand_index.
7.2 Clustering 129
Out[1]: 0.000
The homogeneity is 0 since the samples in the predicted cluster 0 come from
original cluster 0 and cluster 1.
In [2]:
print(metrics.completeness_score([0, 0, 1, 1],
[1, 1, 0, 0]))
Out[2]: 1.0
The completeness is 1 since all the samples from the original cluster with label 0
go into the same predicted cluster with label 1, and all the samples from the original
cluster with label 1 go into the same predicted cluster with label 0.
However, how can we define a measure that takes into account the completeness
as well as the homogeneity? The V-measure is the harmonic mean between the
homogeneity and the completeness defined as follows:
v = 2 ∗ (homogeneity ∗ completeness)/(homogeneity + completeness).
.
Note that this metric is not dependent of the absolute values of the labels: a
permutation of the class or cluster label values will not change the score value in
any way. Moreover, the metric is symmetric with respect to switching between the
predicted and the original cluster label. This is very useful to measure the agreement
of two independent label assignment strategies applied to the same dataset even
when the real groundtruth is not known. If class members are completely split across
different clusters, the assignment is totally incomplete, hence the V-measure is null:
In [3]:
print("%.3f" % metrics.v_measure_score([0, 0, 0, 0],
[0, 1, 2, 3]))
Out[3]: 0.000
In contrast, clusters that include samples from different classes destroy the homo-
geneity of the labeling, hence:
In [4]:
print("%.3f" % metrics.v_measure_score([0, 0, 1, 1],
[0, 0, 0, 0]))
Out[4]: 0.000
In summary, we can say that the advantages of the V-measure include that it
has bounded scores: 0.0 means the clustering is extremely bad; 1.0 indicates a per-
fect clustering result. Moreover, it can be interpreted easily: when analyzing the
V-measure, low completeness or homogeneity explain in which direction the clus-
tering is not performing well. Furthermore, we do not assume anything about the
cluster structure. Therefore, it can be used to compare clustering algorithms such
as K-means, which assume isotropic blob shapes, with results of other clustering
algorithms such as spectral clustering (see Sect. 7.2.3.2), which can find clusters
130 7 Unsupervised Learning
with “folded” shapes. As a drawback, the previously introduced metrics are not
normalized with regard to random labeling. This means that depending on the num-
ber of samples, clusters and groundtruth classes, a completely random labeling will
not always yield the same values for homogeneity, completeness and hence, the V-
measure. In particular, random labeling will not yield a zero score, and they will tend
further from zero as the number of clusters increases. It can be shown that this prob-
lem can reliably be overcome when the number of samples is high, i.e., more than a
thousand, and the number of clusters is less than 10. These metrics require knowl-
edge of the groundtruth classes, while in practice this information is almost never
available or requires manual assignment by human annotators. Instead, as mentioned
before, these metrics can be used to compare the results of different clusterings.
and well separated; a logical feature when speaking about clusters. Furthermore, the
Silhouette coefficient is generally higher when clusters are compact.
Within different clustering algorithms, one can find soft partition algorithms, which
assign a probability of the data belonging to each cluster, and also hard partition
2 The intracluster distance of sample.i is obtained by the distance of the sample to the nearest sample
from the same class, and the nearest-cluster distance is given by the distance to the closest sample
from the cluster nearest to the cluster of sample .i.
7.2 Clustering 131
algorithms, where each data point is assigned precise membership of one cluster.
A typical example of a soft partition algorithm is the Mixture of Gaussians [1],
which can be viewed as a density estimator method that assigns a confidence or
probability to each point in the space. A Gaussian mixture model is a probabilistic
model that assumes all the data points are generated from a mixture of a finite
number of Gaussian distributions with unknown parameters. The universally used
generative unsupervised clustering using a Gaussian mixture model is also known
as EM Clustering. Each point in the dataset has a soft assignment to the . K clusters.
One can convert this soft probabilistic assignment into membership by picking out
the most likely clusters (those with the highest probability of assignment).
An alternative to soft algorithms are the hard partition algorithms, which assign a
unique cluster value to each element in the feature space. According to the grouping
process of the hard partition algorithm, there are two large families of clustering
techniques:
• Partitional algorithms: these start with a random partition and refine it iteratively.
That is why sometimes these algorithms are called “flat” clustering. In this chapter,
we will consider two partitional algorithms in detail: K-means and spectral clus-
tering.
• Hierarchical algorithms: these organize the data into hierarchical structures, where
data can be agglomerated in the bottom-up direction, or split in a top-down manner.
In this chapter, we will discuss and illustrate agglomerative clustering.
where .ci is the set of points that belong to cluster .i and .μi is the center of the class
.ci . K-means clustering objective function uses the square of the Euclidean distance
.d(x, μ j ) = ||x − μ j || , that is also referred to as the inertia or within-cluster sum-
2
of-squares. This problem is not trivial to solve (in fact, it is NP-hard problem), so
the algorithm only hopes to find the global minimum, but may become stuck at a
different solution.
132 7 Unsupervised Learning
In other words, we may wonder whether the centroids should belong to the original
set of points:
∑
n
.iner tia = min μ j ∈c (||xi − μ j ||2 )). (7.2)
i=0
The K-means algorithm, also known as Lloyd’s algorithm, is an iterative procedure
that searches for a solution of the . K -means clustering problem and works as follows.
First, we need to decide the number of clusters, .k. Then we apply the following
procedure:
Let us illustrate the algorithm in Python. First, we will create three sample distri-
butions:
In [5]:
import numpy as np
MAXN = 40
X = np.concatenate([
1.25*np.random.randn(MAXN, 2),
5 + 1.5*np.random.randn(MAXN, 2)])
X = np.concatenate([
X, [8, 3] + 1.2*np.random.randn(MAXN, 2)])
y = np.concatenate([np.ones((MAXN,1)),2*np.ones((MAXN,1))])
y = np.concatenate([y,3*np.ones((MAXN,1))])
The sample distributions generated are shown in Fig. 7.1 (left). However, the algo-
rithm is not aware of their distribution. Figure 7.1 (right) shows what the algorithm
sees. Let us assume that we expect to have three clusters (.k = 3) and apply the
K-means command from the Scikit-learn library:
In [6]:
from sklearn.cluster import KMeans
Fig. 7.1 Initial samples as generated (left), and samples seen by the algorithm (right)
• The inertia assumes that clusters are isotropic and convex, since the Euclidean
distance is applied, which is isotropic with regard to the different dimensions of
the data. However, we cannot expect that the data fulfill this assumption by default.
Hence, the K-means algorithm responds poorly to elongated clusters or manifolds
with irregular shapes.
• The algorithm may not ensure convergence to the global minimum. It can be
shown that K-means will always converge to a local minimum of the inertia
(Eq. (7.2)). It depends on the random initialization of the seeds, but some seeds
can result in a poor convergence rate, or convergence to suboptimal clustering.
To alleviate the problem of local minima, the K-means computation is often per-
formed several times, with different centroid initializations. One way to address
this issue is the k-means++ initialization scheme, which has been implemented
in Scikit-learn (use the init=’kmeans++’ parameter). This parameter
initializes the centroids to be (generally) far from each other, thereby probably
leading to better results than random initialization.
• This algorithm requires the number of clusters to be specified. Different heuristics
can be applied to predetermine the number of seeds of the algorithm.
• It scales well to a large number of samples and has been used across a large range
of application areas in many different fields.
In summary, we can conclude that K-means has the advantages of allowing the
easy use of heuristics to select good seeds; initialization of seeds by other methods;
multiple points to be tried. However, in contrast, it still cannot ensure that the local
minima problem is overcome; it is iterative and hence slow when there are a lot of
high-dimensional samples; and it tends to look for spherical clusters.
7.2 Clustering 135
where .α is a parameter. We expect two points from different clusters to be far away
from each other. However, if there is a sequence of points within the cluster that forms
a “path” between them, this also would lead to big distance among some of the points
from the same cluster. Hence, we define an affinity matrix . A based on the similarity
matrix . S, where . A contains positive values and is symmetric. This can be done, for
example, by applying a k-nearest neighbor that builds a graph connecting just the
.k closest data points. The symmetry comes from the fact that . Ai j and . A ji give the
distance between the same points. Considering the affinity matrix, the clustering can
be seen as a graph partition problem, where connected graph components correspond
to clusters. The graph obtained by spectral clustering will be partitioned so that graph
edges connecting different clusters have low weights, and vice versa. Furthermore,
we define a degree matrix. D, where each diagonal value is the degree of the respective
graph node and all other elements are 0. Finally, we can compute the unnormalized
graph Laplacian (.U = D − A) and/or a normalized version of the Laplacian .(L), as
follows:
the random walk corresponds to seldom transitions from one group to another.
1 1
– Normalized Laplacian: . L = D − 2 U D − 2 .
– Generalized Laplacian: . L = D −1 U .
If we assume that there are .k clusters, the next step is to find the .k small-
est eigenvectors, without considering the trivial constant eigenvector. Each row of
the matrix formed by the .k smallest eigenvectors of the Laplacian matrix defines
a transformation of the data .xi . Thus, in this transformed space, we can apply
K-means clustering in order to find the final clusters. If we do not know in advance
the number of clusters, .k, we can look for sudden changes in the sorted eigenvalues
of the matrix, .U , and keep the smallest ones.
136 7 Unsupervised Learning
When merging two clusters, a question naturally arises: How to measure the
similarity of two clusters? There are different ways to define this with different
results for the agglomerative clustering. The linkage criterion determines the metric
used for the cluster merging strategy:
• Ward linkage minimizes the sum of squared differences within all clusters. It is
thus a variance-minimizing approach and in this sense is similar to the K-means
objective function, but tackled with an agglomerative hierarchical approach.
Let us illustrate how the different linkages work with an example. Let us generate
three clusters as follows:
In [8]:
MAXN1 = 500
MAXN2 = 400
MAXN3 = 300
X1 = np.concatenate([
2.25 * np.random.randn(MAXN1, 2),
4 + 1.7*np.random.randn(MAXN2, 2)])
X1 = np.concatenate([
X1, [8, 3] + 1.9*np.random.randn(MAXN3, 2)])
y1 = np.concatenate([
np.ones((MAXN1, 1)),
2 * np.ones((MAXN2, 1))])
y1 = np.concatenate([
y1, 3 * np.ones((MAXN3, 1))]).ravel()
y1 = np.int_(y1)
labels_y1 = [’+’, ’*’, ’o’]
colors = [’r’, ’g’, ’b’]
plt.show()
138 7 Unsupervised Learning
Fig. 7.3 Illustration of agglomerative clustering using different linkages: Ward, complete and aver-
age. The symbol of each data point corresponds to the original class generated and the color
corresponds to the cluster obtained
The results of the agglomerative clustering using the different linkages: complete,
average, and Ward are given in Fig. 7.3. Note that agglomerative clustering exhibits
“rich get richer” behavior that can sometimes lead to uneven cluster sizes, with
average linkage being the worst strategy in this respect and Ward linkage giving the
most regular sizes. Ward linkage is an attempt to form clusters that are as compact
as possible, since it considers inter- and intra-distances of the clusters. Meanwhile,
for non-Euclidean metrics, average linkage is a good alternative. Average linkage
can produce very unbalanced clusters, it can even separate a single data point into a
separate cluster. This fact would be useful if we want to detect outliers, but it may
be undesirable when two clusters are very close to each other, since it would tend to
merge them.
Agglomerative clustering can scale to a large number of samples when it is used
jointly with a connectivity matrix, but it is computationally expensive when no con-
nectivity constraints are added between samples: it considers all the possible merges
at each step.
7.2 Clustering 139
Fig. 7.4 Illustration of agglomerative clustering without (top row) and with (bottom row) a con-
nectivity graph using the three linkages (from left to right): average, complete and Ward. The colors
correspond to the clusters obtained
Fig. 7.5 Comparison of the different clustering techniques (from left to right): K-means, spectral
clustering, and agglomerative clustering with average and Ward linkage on simple compact datasets.
In the first row, the expected number of clusters is .k = 2 and in the second row: .k = 4
connectivity graph and thus two of them avoid joining data points that belong to
different circles (except the Ward linkage that attempts to form compact and isotropic
clusters).
Fig. 7.6 Comparison of the different clustering techniques (from left to right): K-means, spectral
clustering, and agglomerative clustering with average and Ward linkage on uniformly distributed
data. In the first row, the number of clusters assumed is .k = 2 and in the second row: .k = 4
Fig. 7.7 Comparison of the different clustering techniques (from left to right): K-means, spec-
tral clustering, and agglomerative clustering with average and Ward linkage on non-flat geometry
datasets. In the first row, the expected number of clusters is .k = 2 and in the second row: .k = 4
points as much as possible following the “rich get richer” rule. This results in a
second cluster of a small set of data. This behavior is observed in both cases: .k = 2
and .k = 4.
Regarding datasets with more complex geometry, like in the moon dataset (see
Fig. 7.7), K-means and Ward linkage agglomerative clustering attempt to construct
compact clusters and thus cannot separate the moons. Due to the connectivity con-
straint, the spectral clustering and the average linkage agglomerative clustering sep-
arated both moons in case of .k = 2, while in case of .k = 4, the average linkage
agglomerative clustering clusters most of datasets correctly separating some of the
noisy data points as two separate single clusters. In the case of spectral clustering,
looking for four clusters, the method splits each of the two moon datasets in two
clusters.
142 7 Unsupervised Learning
In order to illustrate clustering with a real dataset, we will now analyze the indicators
of spending on education among the European Union member states, provided by
the Eurostat data bank.3 The data are organized by year (TIME) from 2002 until
2011 and country (GEO): [‘Albania’, ‘Austria’, ‘Belgium’, ‘Bulgaria’, etc.]. Twelve
indicators (INDIC_ED) of financing of education with their corresponding values
(Value) are given: (1) Expenditure on educational institutions from private sources
as % of gross domestic product (GDP), for all levels of education combined; (2)
Expenditure on educational institutions from public sources as % of GDP, for all
levels of government combined, (3) Expenditure on educational institutions from
public sources as % of total public expenditure, for all levels of education combined,
(4) Public subsidies to the private sector as % of GDP, for all levels of education
combined, (5) Public subsidies to the private sector as % of total public expenditure,
for all levels of education combined, etc. We can store the 12 indicators for a given
year (e.g., 2010) in a table. Figure 7.8 visualizes the first five countries in the table.
As we can observe, this is not a clean dataset, since there are values missing. Some
countries have very limited information and should be excluded. Other countries may
still not collect or have access to a few indicators. For these last cases, we can proceed
in two ways: a) fill in the gaps with some non-informative, non-biasing data; or b)
drop the features with missing values for the analysis. If we have many features and
only a few have missing values, then it is not very harmful to drop them. However, if
missing values are spread across most of the features, we eventually have to deal with
them. In our case, both options seem reasonable, as long as the number of missing
features for a country is not too large. We will proceed in both ways at the same time.
We apply both options: filling the gap with the mean value of the feature and
the dropping option, ignoring the indicators with missing values. Let us now apply
Fig. 7.8 Expenditure on different educational indicators for the first 5 countries in the Eurostat
dataset
3 http://ec.europa.eu/eurostat.
7.3 Case Study 143
Fig. 7.9 Clustering of the countries according to their educational expenditure using filled in (top
row) and dropped (bottom row) missing values
Fig. 7.10 Mean expenditure of the different clusters according to the 8 indicators of the indicators-
dropped dataset. Clustering with K-means where k .= 3
both clusterings, they may yield similar (up to label permutation) results, but they
will not necessarily always coincide. This is mainly due to two aspects: the random
initialization of the K-means clustering and the fact that each method works in a
different space (i.e., dropped data in 8D space versus filled-in data, working in 12D
space). Note that we should not consider the assigned absolute cluster value, since
it is irrelevant. The mean expenditure of the different clusters is shown by different
colors according to the 8 indicators of the indicators-dropped dataset (see Fig. 7.10).
So, without loss of generality, we continue analyzing the set obtained by dropping
missing values. Let us now check the clusters and check their profile by looking at
the centroids. Visualizing the eight values of the three clusters (see Fig. 7.10), we can
see that cluster 1 spends more on education for the 8 educational indicators, while
cluster 0 is the one with least resources invested in education.
Let us consider a specific country: e.g., Spain and its expenditure on education.
If we refine cluster 0 further and check how close members are from this cluster
to cluster 1, it may give us a hint as to a possible ordering. When visualizing the
distance to cluster 0 and 1, we can observe that Spain, while being from cluster 0, has
a smaller distance to cluster 1 (see Fig. 7.11). This should make us realize that using 3
clusters probably does not sufficiently represent the groups of countries. So we redo
the process, but applying .k = 4: we obtain 4 clusters. This time cluster 0 includes
the EU members with medium expenditure (Fig. 7.12). This reinforce the intuition
about Spain being a limit case in the former clustering. The clusters obtained are as
follows:
Fig. 7.11 Distance of countries in cluster 0 to centroids of cluster 0 (in red) and cluster 1 (in blue)
Fig. 7.12 K-means applied to the Eurostat dataset grouping the countries into 4 clusters
We can repeat the process using the alternative clustering techniques and compare
their results. Let us first apply spectral clustering. The corresponding code will be
as follows:
146 7 Unsupervised Learning
Fig. 7.13 Spectral clustering applied to the European countries according to their expenditure on
education
In [11]:
X = StandardScaler().fit_transform(edudrop.values)
distances = euclidean_distances(edudrop.values)
spectral = cluster.SpectralClustering(
n_clusters = 4, affinity = "nearest_neighbors")
spectral.fit(edudrop.values)
y_pred = spectral.labels_.astype(np.int)
The result of this spectral clustering is shown in Fig. 7.13. Note that in general,
the aim of spectral clustering is to obtain more balanced clusters. In this way, the
predicted cluster 1 merges clusters 2 and 3 of the K-means clustering, cluster 2
corresponds to cluster 1 of the K-means clustering, cluster 0 mainly shifts to cluster
2, and cluster 3 corresponds to cluster 0 of the K-means.
Applying agglomerative clustering, not only we do obtain different clusters, but
also we can see how different clusters are obtained. Thus, in some way it is giving
us information on which the most similar pairs of countries and clusters are. The
corresponding code that applies the agglomerative clustering will be as follows:
7.3 Case Study 147
In [12]:
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import pdist
X_train = edudrop.values
dist = pdist(X_train, ’euclidean’)
linkage_matrix = linkage(dist, method = ’complete’);
Figure 7.14 shows the construction of the clusters using complete linkage agglom-
erative clustering. Different cuts at different levels of the dendrogram allow us to
obtain different numbers of clusters.
To summarize, we can compare the results of the three clustering approaches. We
cannot expect the results to coincide, since the different approaches are based on
different criteria for constructing clusters. Nonetheless, we can still observe that in
this case, K-means and the agglomerative approaches gave the same results (up to a
permutation of the number of cluster, which is irrelevant); while spectral clustering
gave more evenly distributed clusters. This later approach fused clusters 0 and 2 of
the agglomerative clustering in cluster 1, and split cluster 3 of the agglomerative
clustering into its clusters 0 and 3. Note that these results could change when using
different distances among data.
148 7 Unsupervised Learning
Fig. 7.14 Agglomerative clustering applied to cluster European countries according to their expen-
diture on education
7.4 Conclusions
Acknowledgements This chapter was co-written by Petia Radeva and Oriol Pujol.
References
1. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Section 16.1. Gaussian mixture
models and k-means clustering, in Numerical Recipes: The Art of Scientific Computing, 3rd
edn. (Cambridge University Press, New York, 2007). ISBN 978-0-521-88068-8
2. M. Meilǎ, J. Shi, Learning segmentation by random walks, in Neural Information Processing
Systems 13 (NIPS 2000) (2001), pp. 873–879
3. G.J. Székely, M.L. Rizzo, Hierarchical clustering via joint between-within distances: extending
ward’s minimum variance method. J. Classif. 22, 151–183 (2005)
Network Analysis
8
8.1 Introduction
Network data are generated when we consider relationships between two or more
entities in the data, like the highways connecting cities, friendships between peo-
ple or their phone calls. In recent years, a huge number of network data are being
generated and analyzed in different fields. For instance, in sociology there is inter-
est in analyzing blog networks, which can be built based on their citations, to look
for divisions in their structures between political orientations. Another example is
infectious disease transmission networks, which are built in epidemiological studies
to find the best way to prevent infection of people in a territory, by isolating cer-
tain areas. Other examples, studied in the field of technology, include interconnected
computer networks or power grids, which are analyzed to optimize their functioning.
We also find examples in academia, where we can build co-authorship networks and
citation networks to analyze collaborations among Universities.
Structuring data as networks can facilitate the study of the data for different goals;
for example, to discover the weaknesses of a structure. That could be the objective
of a biologist studying a community of plants and trying to establish which of its
properties promote quick transmission of a disease. A contrasting objective would be
to find and exploit structures that work efficiently for the transmission of messages
across the network. This may be the goal of an advertising agent trying to find the
best strategy for spreading publicity.
How to analyze networks and extract the features we want to study are some of the
issues we consider in this chapter. In particular, we introduce some basic concepts
related with networks, such as degree distribution, centrality measures, connected
components, ego-networks, and PageRank. We present some useful Python tools for
the analysis of networks and discuss some of the visualization options. In order to
motivate and illustrate the concepts, we perform social network analysis using real
data. We present a practical case based on a public dataset which consists of a set
Graph is the mathematical term used to refer to a network. Thus, the field that
studies networks is called graph theory and it provides the tools necessary to analyze
networks. Leonhard Euler defined the first graph in 1735, as an abstraction of one of
the problems posed by mathematicians of the time regarding Konigsberg, a city with
two islands created by the River Pregel, which was crossed by seven bridges. The
problem was: is it possible to walk through the town of Konigsberg crossing each
bridge once and only once? Euler represented the land areas as nodes and the bridges
connecting them as edges of a graph and proved that the walk was not possible for
this particular graph.
A graph is defined as a set of nodes, which are an abstraction of any entities
(parts of a city, persons, etc.), and the connecting links between pairs of nodes called
edges or relationships. The edge between two nodes can be directed or undirected. A
directed edge means that the edge points from one node to the other and not the other
way round. An example of a directed relationship is “a person knows another person”.
An edge has a direction when person A knows person B, and not the reverse direction
8.2 Basic Definitions in Graphs 153
Fig. 8.1 Simple undirected labeled graph with 5 nodes and 5 edges
if B does not know A (which is usual for many fans and celebrities). An undirected
edge means that there is a symmetric relationship. An example is “a person shook
hands with another person”; in this case, the relationship, unavoidably, involves both
persons and there is no directionality. Depending on whether the edges of a graph are
directed or undirected, the graph is called a directed graph or an undirected graph,
respectively.
The degree of a node is the number of edges that connect to it. Figure 8.1 shows
an example of an undirected graph with 5 nodes and 5 edges. The degree of node C
is 1, while the degree of nodes A, D and E is 2 and for node B it is 3. If a network is
directed, then nodes have two different degrees, the in-degree, which is the number
of incoming edges, and the out-degree, which is the number of outgoing edges.
In some cases, there is information we would like to add to graphs to model
properties of the entities that the nodes represent or their relationships. We could add
strengths or weights to the links between the nodes, to represent some real-world
measure. For instance, the length of the highways connecting the cities in a network.
In this case, the graph is called a weighted graph.
Some other elementary concepts that are useful in graph analysis are those we
explain in what follows. We define a path in a network to be a sequence of nodes
connected by edges. Moreover, many applications of graphs require shortest paths
to be computed. The shortest path problem is the problem of finding a path between
two nodes in a graph such that the length of the path (the number of followed edges
or the sum of the weights of the edges in the path) is minimized. In the example in
Fig. 8.1, the paths (C, A, B, E) and (C, A, B, D, E) are those between nodes C and
E. This graph is unweighted, so the shortest path between C and E is the one that
follows the fewer edges: (C, A, B, E).
A graph is said to be connected if for every pair of nodes there is a path between
them. A graph is fully connected or complete if each pair of nodes is connected by
an edge. A connected component or simply a component of a graph is a subset of its
nodes such that every node in the subset has a path to every other one. In the example
154 8 Network Analysis
of Fig. 8.1, the graph has one connected component. A subgraph is a subset of the
nodes of a graph and all the edges linking those nodes. Any group of nodes can form
a subgraph.
Social network analysis processes social data structured in graphs. It involves the
extraction of several characteristics and graphics to describe the main properties of
the network. Some general properties of networks, such as the shape of the network
degree distribution (defined bellow) or the average path length, determine the type
of network, such as a small-world network or a scale-free network. A small-world
network is a type of graph in which most nodes are not neighbors of one another, but
most nodes can be reached from every other node in a small number of steps. This
is the so-called small-world phenomenon which can be interpreted by the fact that
strangers are linked by a short chain of acquaintances. In a small-world network,
people usually form communities or small groups where everyone knows every-
one else. Such communities can be seen as complete graphs. In addition, most the
community members have a few relationships with people outside that community.
However, some people are connected to a large number of communities. These may
be celebrities and such people are considered as the hubs that are responsible for
the small-world phenomenon. Many small-world networks are also scale-free net-
works. In a scale-free network the node degree distribution follows a power law (a
relationship function between two quantities .x and . y defined as . y = x n , where .n is
a constant). The name scale-free comes from the fact that power laws have the same
functional form at all scales, i.e., their shape does not change on multiplication by a
scale factor. Thus, by definition, a scale-free network has many nodes with a very few
connections and a small number of nodes with many connections. This structure is
typical of the World Wide Web and other social networks. In the following sections,
we illustrate this and other graph properties that are useful in social network analysis.
8.3.1 NetworkX
NetworkX 1 is a Python toolbox for the creation, manipulation and study of the
structure, dynamics and functions of complex networks. After importing the toolbox,
we can create an undirected graph with 5 nodes by adding the edges, as is done in
the following code. The output is the graph in Fig. 8.1.
1 https://networkit.iti.kit.edu.
8.3 Social Network Analysis 155
In [1]:
import networkx as nx
G = nx.Graph()
G.add_edge(’A’, ’B’);
G.add_edge(’A’, ’C’);
G.add_edge(’B’, ’D’);
G.add_edge(’B’, ’E’);
G.add_edge(’D’, ’E’);
nx.draw_networkx(G)
For our practical case we consider data from the Facebook social network. In par-
ticular, we use the data Social circles: Facebook 2 from the Stanford Large Network
Dataset3 (SNAP) collection. The SNAP collection has links to a great variety of
networks such as Facebook-style social networks, citation networks or open com-
munities like Live Journal. The Facebook dataset consists of a network representing
friendship between Facebook users. The Facebook data was anonymized by replac-
ing the internal Facebook identifiers for each user with a new value. The network
corresponds to an undirected and unweighted graph that contains users of Facebook
(nodes) and their friendship relations (edges). This dataset is stored in a plain text
file with an edge list with one edge per line. Let us load the Facebook network and
start extracting the basic information from the graph, including the numbers of nodes
and edges, and the average degree:
In [2]:
fb = nx.read_edgelist("files/ch08/facebook_combined.txt")
fb_n, fb_k = fb.order(), fb.size()
fb_avg_deg = fb_k / fb_n
print(’Nodes: ’, fb_n)
print(’Edges: ’, fb_k)
print(’Average degree: ’, fb_avg_deg)
2 https://snap.stanford.edu/data/egonets-Facebook.html.
3 http://snap.stanford.edu/data/.
156 8 Network Analysis
The output graph, shown in Fig. 8.2, is a power-law distribution. Thus, we can say
that the Facebook network is a scale-free network.
Next, let us find out if the Facebook dataset contains more than one connected
component:
In [4]:
print(’# connected components of Facebook network: ’,
nx.number_connected_components(fb))
by removing node ‘0’ (arbitrarily selected) and compute the number of connected
components of the pruned version of the graph:
In [5]:
fb_prun = nx.read_edgelist(
"files/ch08/facebook_combined.txt")
fb_prun.remove_node(’0’)
print(’Remaining nodes:’, fb_prun.number_of_nodes())
print(’New # connected components:’,
nx.number_connected_components(fb_prun))
8.4 Centrality
The centrality of a node measures its relative importance within the graph. In this
section we focus on undirected graphs. Centrality concepts were first developed in
social network analysis. The first studies indicated that central nodes are probably
more influential, have greater access to information, and can communicate their
opinions to others more efficiently [1]. Thus, the applications of centrality concepts
in a social network include identifying the most influential people, the most informed
people, or the most communicative people. In practice, what centrality means will
depend on the application and the meaning of the entities represented as nodes in the
data and the connections between those nodes. Various measures of the centrality
of a node have been proposed. We present four of the best-known measures: degree
centrality, betweenness centrality, closeness centrality, and eigenvector centrality.
158 8 Network Analysis
Degree centrality is defined as the number of edges of the node. So the more ties a
node has, the more central the node is. To achieve a normalized degree centrality of a
node, the measure is divided by the total number of graph nodes (.n) without counting
this particular one (.n − 1). The normalized measure provides proportions and allows
us to compare it among graphs. Degree centrality is related to the capacity of a node
to capture any information that is floating through the network. In social networks,
connections are associated with positive aspects such as knowledge or friendship.
Betweenness centrality quantifies the number of times a node is crossed along the
shortest path/s between any other pair of nodes in the graph (except pairs with the
node in question itself). For the normalized measure this number is divided by the
total number of shortest paths for every pair of nodes. Intuitively, if we think of a
public bus transportation network, the bus stop (node) with the highest betweenness
has the most traffic. In social networks, a person with high betweenness has more
power in the sense that more people depend on him/her to make connections with
other people or to access information from other people. Comparing this measure
with degree centrality, we can say that degree centrality depends only on the node’s
neighbors; thus it is more local than the betweenness centrality, which depends on
the connection properties of every pair of nodes. The equivalent measure exists for
edges. The betweenness centrality of an edge is the proportion of the shortest paths
between all node pairs which pass through it.
Closeness centrality tries to quantify the position a node occupies in the network
based on a distance calculation. The distance metric used between a pair of nodes
is defined by the length of its shortest path. The closeness of a node is inversely
proportional to the length of the average shortest path between that node and all the
other nodes in the graph. In this case, we interpret a central node as being close to,
and able to communicate quickly with, the other nodes in a social network.
Eigenvector centrality defines a relative score for a node based on its connections
and considering that connections from high centrality nodes contribute more to the
score of the node than connections from low centrality nodes. It is a measure of the
influence of a node in a network, in the following sense: it measures the extent to
which a node is connected to influential nodes. Accordingly, an important node is
connected to important neighbors.
Let us illustrate the centrality measures with an example. In Fig. 8.3, we show
an undirected star graph with .n = 8 nodes. Node C is obviously important, since
it can exchange information with more nodes than the others. The degree centrality
measures this idea. In this star network, node C has a degree centrality of 7, or 1 if
we consider the normalized measure, whereas all other nodes have a degree of 1,
or 1/7 if we consider the normalized measure. Another reason why node C is more
important than the others in this star network is that it lies between each of the other
pairs of nodes, and no other node lies between C and any other node. If node C
wants to contact F, C can do it directly; whereas if node F wants to contact B, it
must go through C. This gives node C the capacity to broke/prevent contact among
other nodes and to isolate nodes from information. The betweenness centrality is
underneath this idea. In this example, the betweenness centrality of the node C is 28,
computed as .(n − 1)(n − 2)/2, while the rest of nodes have a betweenness of 0. The
8.4 Centrality 159
final reason why we can say node C is superior in the star network is because C is
closer to more nodes than any other node is. In the example, node C is at a distance
of 1 from all other 7 nodes and each other node is at a distance 2 from all other nodes,
except C. So, node C has closeness centrality of 1/7, while the rest of nodes have a
closeness of 1/13. The normalized measures, computed by dividing by .n − 1, are 1
for C and 7/13 for the other nodes.
An important concept in social network analysis is that of a hub node, which is
defined as a node with high degree centrality and betweenness centrality. When a
hub governs a very centralized network, the network can be easily fragmented by
removing that hub.
Coming back to the Facebook example, let us compute the degree centrality of
Facebook graph nodes. In the code below we show the user identifier of the 10 most
central nodes together with their normalized degree centrality measure. We also
show the degree histogram to extract some more information from the shape of the
distribution. It might be useful to represent distributions using logarithmic scale. We
do that with the matplotlib.loglog() function. Figure 8.4 shows the degree
centrality histogram in linear and logarithmic scales as computed in the box bellow.
In [7]:
degree_cent_fb = nx.degree_centrality(fb)
print(’Facebook degree centrality: ’,
sorted(degree_cent_fb.items(),
key = lambda x: x[1],
reverse = True)[:10])
degree_hist = plt.hist(list(degree_cent_fb.values()), 100)
plt.loglog(degree_hist[1][1:],
degree_hist[0], ’b’, marker = ’o’)
160 8 Network Analysis
Fig. 8.4 Degree centrality histogram shown using a linear scale (left) and a log scale for both the
x- and y-axis (right)
As can be seen in the previous results, each measure gives a different ordering of
the nodes. The node ‘107’ is the most central node for degree (see box Out [7]),
betweenness, and closeness centrality, while it is not among the 10 most central nodes
for eigenvector centrality. The second most central node is different for closeness
and eigenvector centralities; while the third most central node is different for all four
centrality measures.
Another interesting measure is the current flow betweenness centrality, also called
random walk betweenness centrality, of a node. It can be defined as the probability
of passing through the node in question on a random walk starting and ending at
some node. In this way, the betweenness is not computed as a function of shortest
paths, but of any paths. This makes sense for some social networks where messages
may get to their final destination not by the shortest path, but by a random path, as
in the case of gossip floating through a social network for example.
Computing the current flow betweenness centrality can take a while, so we will
work with a trimmed Facebook network instead of the original one. In fact, we can
pose the question: What happen if we only consider the graph nodes with more than
the average degree of the network (21)? We can trim the graph using degree values.
To do this, in the next code, we define a function to trim the graph based on the
degree of the graph nodes. We set the threshold to 21 connections:
162 8 Network Analysis
In [9]:
def trim_degree(graph, degree = 1):
g = graph.copy()
d = dict(nx.degree(g))
for key, node in list(d.items()):
if node <= degree:
g.remove_node(key)
return g
thr = 21.0
print(’Degree threshold:’, thr)
The new graph is much smaller; we have removed almost half of the nodes (we
have moved from 4,039 to 2,226 nodes).
We compute the current flow betweenness centrality measure of the trimmed
Facebook network:
In [10]:
betweenness = nx.betweenness_centrality(fb_trimed)
print(’Trimmed Facebook betweenness centrality: ’, sorted(
betweenness.items(), key = lambda x: x[1], reverse =
True)[:10])
In this section we focus on graph visualization, which can help in the network data
understanding and usability. The visualization of a network with a large amount
of nodes is a complex task. Different layouts can be used to try to build a proper
visualization. For instance, we can draw the Facebook graph using the random layout
(nx.random_layout), but this is a bad option, as can be seen in Fig. 8.5. Other
alternatives can be more useful. In the box below, we use the Spring layout, as it
8.4 Centrality 163
is used in the default function (nx.draw), but with more iterations. The function
nx.spring_layout returns the position of the nodes using the Fruchterman-
Reingold force-directed algorithm. This algorithm distributes the graph nodes in
such a way that all the edges are more or less equally long and they cross themselves
as few times as possible. In the next code, we also build the graph changing the size
of the nodes to that defined by their degree centrality. As can be seen in the code,
the degree centrality is normalized to values between 0 and 1, and multiplied by a
constant to make the sizes appropriate for the format of the figure:
In [11]:
pos_fb = nx.spring_layout(fb,iterations = 1000)
nsize = np.array([v for v in degree_cent_fb.values()])
cte = 2000
nsize = cte*(nsize - min(nsize))/(max(nsize)-min(nsize))
nodes = nx.draw_networkx_nodes(fb, pos = pos_fb,
node_size = nsize,
node_color=’red’,
edgecolors=’black’)
edges = nx.draw_networkx_edges(fb, pos = pos_fb,
alpha = .1)
The resulting graph visualization is shown in Fig. 8.6. This illustration allows us
to understand the network better. Now we can distinguish several groups of nodes or
“communities” clearly in the graph. Moreover, the larger nodes are the more central
nodes, which are highly connected in the Facebook graph.
We can also use the betweenness centrality to define the size of the nodes. In this
way, we obtain a new illustration stressing the nodes with higher betweenness, which
164 8 Network Analysis
are those with a large influence on the transfer of information through the network.
The new graph is shown in Fig. 8.7. As expected, the central nodes are now those
connecting the different communities.
Generally different centrality metrics will be positively correlated, but when they
are not, there is probably something interesting about the network nodes. For instance,
if you can spot nodes with high betweenness but relatively low degree, these are the
nodes with few links but which are crucial for network flow. We can also look for
8.4 Centrality 165
the opposite effect: nodes with high degree but relatively low betweenness. These
nodes are those with redundant communication.
Changing the centrality measure to closeness, we obtain the graph in Fig. 8.8.
As can be seen, the central nodes are also different for this measure. With this or
other visualizations you will be able to discern different types of nodes. You can
probably see nodes with high closeness centrality but low degree; these are essential
nodes linked to a few important or active nodes. If the opposite occurs, if there are
nodes with high degree centrality but low closeness, these can be interpreted as nodes
embedded in a community that is far removed from the rest of the network. In other
examples of social networks, you could find nodes with high closeness centrality but
low betweenness; these are nodes near many people, but since there may be multiple
paths in the network, they are not the only ones to be near many people. Finally, it
is usually difficult to find nodes with high betweenness but low closeness, since this
would mean that the node in question monopolized the links from a small number
of people to many others.
8.4.2 PageRank
directed graph formed by nodes corresponding to the webpages and edges corre-
sponding to the hyperlinks. Intuitively, a hyperlink to a page counts as a vote of
support and a page has a high rank if the sum of the ranks of its incoming edges is
high. This considers both cases when a page has many incoming links and when a
page has a few highly ranked incoming links. Nowadays, a variant of the algorithm
is used by Google. It does not only use information on the number of edges pointing
into and out of a website, but uses many more variables.
We can describe the PageRank algorithm from a probabilistic point of view. The
rank of page . Pi is the probability that a surfer on the Internet who starts visiting a
random page and follows links, visits the page . Pi . With more details, we consider
that the weights assigned to the edges of a network by its transition matrix, M, are the
probabilities that the surfer goes from one webpage to another. We can understand the
rank computation as a random walk through the network. We start with an initial equal
probability for each page: .v0 = ( n1 , ..., n1 ), where .n is the number of nodes. Then
we can compute the probability that each page is visited after one step by applying
the transition matrix: .v1 = Mv. The probability that each page will be visited after
k steps is given by .vk = M k a. After several steps, the sequence converges to a
unique probabilistic vector .a ∗ which is the PageRank vector. The .ith element of this
vector is the probability that at each moment the surfer visits page . Pi . We need a
nonambiguous definition of the rank of a page for any directed web graph. However,
in the Internet, we can expect to find pages that do not contain outgoing links and
this configuration can lead to certain problems to the explained procedure. In order
to overcome this problem, the algorithm fixes a positive constant . p between 0 and
1 (a typical value for . p is 0.85) and redefines the transition matrix of the graph as
. R = (1 − p) M + p B, where . B =
1
n I , andp I is the identity matrix. Therefore, a
node with no outgoing edges has probability . n of moving to any other node.
Let us compute the PageRank vector of the Facebook network and use it to define
the size of the nodes, as was done in box In [11].
In [12]:
pr = nx.pagerank(fb, alpha = 0.85)
nsize = np.array([v for v in pr.values()])
nsize = cte*(nsize - min(nsize))/(max(nsize) - min(nsize))
nodes = nx.draw_networkx_nodes(fb,
pos = pos_fb,
node_size = nsize,
node_color=’red’,
edgecolors=’black’)
edges = nx.draw_networkx_edges(fb,
pos = pos_fb,
alpha = .1)
The code above outputs the graph in Fig. 8.9, that emphasizes some of the nodes
with high PageRank. Looking carefully one can realize that there is one large node
per community.
8.5 Ego-Networks 167
8.5 Ego-Networks
In [13]:
ego_107 = nx.ego_graph(fb, ’107’)
print(’# nodes of ego graph 107:’, len(ego_107))
print(’# nodes of ego graph 107 with radius up to 2:’,
len(nx.ego_graph(fb, ’107’, radius = 2)))
As can be seen, the size of the previously defined ego-network of node ‘107’ is
slightly different from the ego-network automatically computed using NetworkX.
This is due to the fact that the manual labeling is not necessarily referred to the
subgraph of neighbors at a distance of 1.
We can now answer some other questions about the organization of the Facebook
network and compare the 10 different ego-networks among them. First, we can
compute which the most densely connected ego-network is from the total of 10. To
do that, in the code below, we compute the number of edges in every ego-network
and select the network with the maximum number:
In [15]:
ego_ids = (0, 107, 348, 414, 686, 698,
1684, 1912, 3437, 3980)
ego_sizes = zeros((10, 1))
i = 0
# Fill the ’ego_sizes’ vector with the size (# edges) of
the 10 ego-networks in egoids
for id in ego_ids :
G = nx.read_edgelist(
os.path.join(’files/ch08/facebook’,
’{0}.edges’.format(id)),
nodetype = int)
ego_sizes[i]= G.size()
i = i + 1
[i_max,j] = (ego_sizes == ego_sizes.max()).nonzero()
ego_max = ego_ids[i_max[0]]
print(’The most densely connected ego-network is \
that of node:’, ego_max)
G = nx.read_edgelist(
os.path.join(’files/ch08/facebook’,
’{0}.edges’.format(ego_max)),
nodetype = int)
print(’Nodes: ’, G.order())
print(’Edges: ’, G.size())
print(’Average degree: ’, G_k / G_n)
The most densely connected ego-network is that of node ‘1912’, which has an
average degree of 40. We can also compute which is the largest (in number of nodes)
ego-network, changing the measure of sizes from G.size() by G.order(). In
this case, we obtain that the largest ego-network is that of node ‘107’, which has
1,034 nodes and an average degree of 25.
Next let us work out how much intersection exists between the ego-networks in
the Facebook network. To do this, in the code below, we add a field ‘egonet’ for every
170 8 Network Analysis
node and store an array with the ego-networks the node belongs to. Then, having the
length of these arrays, we compute the number of nodes that belong to 1, 2, 3, 4 and
more than 4 ego-networks:
In [16]:
# Add a field ’egonet’ to the nodes of the whole facebook
network.
# Default value egonet = [], meaning that this node does
not belong to any ego-netowrk
for i in fb.nodes() :
fb.nodes[str(i)][’egonet’] = []
In [17]:
# Add a field ’egocolor’ to the nodes of the whole facebook
network.
# Default value egocolo r =0, meaning that this node does
not belong to any ego-netowrk
for i in fb.nodes() :
fb.nodes[str(i)][’egocolor’]=0
However, the graph in Fig. 8.11 does not illustrate how much overlap is there
between the ego-networks. To do that, we can visualize the intersection between
ego-networks using a Venn or an Euler diagram. Both diagrams are useful in order to
see how networks are related. Figure 8.12 shows the Venn diagram of the Facebook
network. This powerful and complex graph cannot be easily built in Python tool-
boxes like NetworkX or Matplotlib. In order to create it, we have used a JavaScript
visualization library called D3.JS.4
4 https://d3js.org.
172 8 Network Analysis
A community in a network can be seen as a set of nodes of the network that is densely
connected internally. The detection of communities in a network is a difficult task
since the number and sizes of communities are usually unknown [3].
Several methods for community detection have been developed. Here, we apply
one of the methods to automatically extract communities from the Facebook network.
8.7 Conclusions 173
We import the Community toolbox5 which implements the Louvain method for
community detection. In the code below, we compute the best partition and plot the
resulting communities in the whole Facebook network with different colors, as we
did in box In [17]. The resulting graph is shown in Fig. 8.13.
In [18]:
import community
partition = community.best_partition(fb)
print(’# communities found:’, max(partition.values()))
colors2 = [partition.get(node) for node in fb.nodes()]
nsize = np.array([v for v in degree_cent_fb.values()])
nsize = cte*(nsize - min(nsize))/(max(nsize)-min(nsize))
nodes = nx.draw_networkx_nodes(
fb, pos = pos_fb,
cmap = plt.get_cmap(’Paired’),
node_color = colors2,
node_size = nsize,
edgecolors=’black’)
edges = nx.draw_networkx_edges(fb, pos = pos_fb, alpha =
.1)
8.7 Conclusions
In this chapter, we have introduced network analysis and a Python toolbox (Net-
workX) that is useful for this analysis. We have shown how network analysis allows
us to extract properties from the data that would be hard to discover by other means.
Some of these properties are basic concepts in social network analysis, such as
centrality measures which return the importance of the nodes in the network or ego-
networks which allows us to study the reach of the information a node can transmit or
have access to. The different concepts have been practically illustrated by a practical
case dealing with a Facebook network. In this practical case, we have resolved several
issues, such as, finding the most representative members of the network in terms of
the most “connected”, the most “circulated”, the “closest” or the most “accessible”
nodes to the others. We have presented useful ways of extracting basic properties of
the Facebook network, and studying its ego-networks and communities, as well as
comparing them quantitatively and qualitatively. We have also proposed several visu-
5 http://perso.crans.org/aynaud/communities/.
174 8 Network Analysis
alizations of the graph to represent several measures and to emphasize the important
nodes with different meanings.
Acknowledgements This chapter was co-written by Laura Igual and Santi Seguí.
References
1. N. Friedkin, Structural bases of interpersonal influence in groups: a longitudinal case study. Am.
Sociol. Rev. 58(6), 861 (1993)
2. L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: bringing order to
the web (1999)
3. V.D. Blondel, J.-L. Guillaume, R. Lambiotte, R. Lefebvre, Fast unfolding of communities in
large networks. J. Stat. Mech.: Theory Exper. 2008(10) (2008)
Recommender Systems
9
9.1 Introduction
In this chapter, we will see what are recommender systems, how they work, and how
they can be implemented. We will also see the different paradigms of recommender
systems based on the information they use, as well as the output they produce. We
will consider typical questions that companies like Netflix or Amazon include in
their products: Which movie should I rent? Which TV should I buy? and we will
give some insights in order to deal with more complex questions: Which is the best
place for me and my family to travel to?
So, the first question we should answer: What is a recommender system? It can
be defined as a tool designed to interact with large and complex information spaces,
and to provide information or items that are likely to be of interest to the user, in an
automated fashion. We refer to complex information space to the set of items, and
its characteristics, which the system recommends to the user, i.e., books, movies, or
city trips.
Nowadays, recommender systems are extremely common, and are applied in a
large variety of applications [1]. Perhaps one of the most popular types are the movie
recommender systems in applications used by companies such as Netflix, and the
music recommenders in Pandora or Spotify, as well as any kind of product recom-
mendation from Amazon.com. However, the truth is that recommender systems are
present in a huge variety of applications, such as movies, music, news, books, re-
search papers, search queries, social tags, and products in general, but they are also
present in more sophisticated products where personalization is critical, like recom-
mender systems for restaurants, financial services, life assurance, online dating, and
Twitter followers.
In this new era, where the quantity of information is huge, recommender systems
are extremely useful in several domains. People are not able to be experts in all
these domains in which they are users, and they do not have enough time to spend
looking for the perfect TV or book to buy. Particularly, recommender systems are
really interesting when dealing with the following issues:
There are several different ways to build a recommender system. However, most of
them take one of two basic approaches: content-based filtering (CBF) or collabora-
tive filtering (CF).
CBF methods are constructed behind the following paradigm: “Show me more of
the same what I’ve liked”. So, this approach will recommend items which are similar
to those the user liked before and the recommendations are based on descriptions
of items and a profile of the user’s preferences. The computation of the similarity
between items is the most important part of these methods and it is based on the
content of the items themselves. As the content of the item can be very diverse, and it
usually depends on the kind of items the system recommends, a range of sophisticated
algorithms are usually used to abstract features from items. When dealing with textual
information such as books or news, a widely used algorithm is tf–idf representation.
The term tf-idf refers to frequency–inverse document frequency, it is a numerical
statistic that measures how important a word is to a document in a collection or
corpus.
An interesting content-based filtering system is Pandora.1 This music recom-
mender system uses up to 400 songs and artist properties in order to find similar
songs to recommend to the original seed. These properties are a subset of the fea-
tures studied by musicologists in The Music Genome Project who describe a song
in terms of its melody, harmony, rhythm, and instrumentation as well as its form and
the vocal performance.
1 http://www.pandora.com/.
9.2 How Do Recommender Systems Work? 177
CF methods are constructed behind the following paradigm: “Tell me what’s popular
among my like-minded users”. This is really intuitive paradigm since it is really
similar of what people use to do: ask or look at the preferences of the people they
trust. An important working hypothesis behind these kind of recommenders is that
similar users tend to like similar items. In order to do so, these approaches are based
on collecting and analyzing a large number of data related to the behavior, activities,
or tastes of users, and predicting what users will like based on their similarity to other
users. One of the main advantages of this type of system is that it does not need to
“understand” what the item it is recommending.
Nowadays, these methods are extremely popular because of the simplicity and
the large amount of data available from users. The main drawbacks of this kind of
method is the need for a user community, as well as the cold-start effect for new
users in the community. The cold-start problem appears when the system cannot
draw any, or an optimal, inference or recommendation for the users (or items) since
it has not yet obtained the sufficient information of them.
CF can be of two types: user-based or item-based.
• User-based CF works like this: Find similar users to me and recommend what they
liked. In this method, given a user, .U , we first find a set of other users, . D, whose
ratings are similar to the ratings of .U and then we calculate a prediction for .U .
• Item-based CF works like this: Find similar items to those that I previously liked.
In item-based CF, we first build an item–item matrix that determines relationships
between pairs of items; then using this matrix and data on the current user .U ,
we infer the user’s taste. Typically, this approach is used in the domain: people
who buy .x also buy . y. This is a really popular approach used by companies like
Amazon. Moreover, one of the advantages of this approach is that items usually
do not change much, so its similarities can be computed offline.
Both, CBF and CF recommender systems, require to understand the user preferences.
Understanding how to model the user preference is a critical step due to the variety
of sources. It is not the same when we deal with applications like the movie recom-
mender from Netflix, where the users rank the movies with 1–5 starts; or as dealing
with any product recommender system from Amazon, where usually the tracking
information of the purchases is used. In this case, three values can be used: 0—not
bought; 1—viewed; 2—bought.
The most common types of labels used to estimate the user preferences are:
In the following sections of this chapter, we only consider the numerical expression
described as stars on the scale of 1 to 5.
In this section, we will play with a real dataset to implement a movie recommender
system. We will work with a user-based collaborative system with the MovieLens
dataset.
2 http://grouplens.org/datasets/movielens/.
180 9 Recommender Systems
Once you have downloaded and unzipped the file into a directory, you can create
a DataFrame with the following code:
9.5 Practical Case 181
In [1]:
# Load user data
u_cols = [
’user_id’, ’age’, ’sex’,
’occupation’, ’zip_code’
]
users = pd.read_csv(’files/ch09/ml-100k/u.user’,
sep=’|’,
names=u_cols)
If you explore the dataset in detail, you will see that it consists of:
• 100,000 ratings from 943 users of 1682 movies. Ratings are from 1 to 5.
• Each user has rated at least 20 movies.
• Simple demographic info for the users (age, gender, occupation, zip).
First of all, let’s build our naïve recommender: A random recommender system.
We can define our recommender system with a Python class. This class consists
of a constructor and three methods: fit, predict_score and predict_top.
In [2]:
class RandomRecommender():
return np.random.uniform(1,5)
return recommended_items
For the evaluation of the system, we define a function called evaluate. This
function estimates the score for all items in the test set (X_test) and compares
them with the real values using the RMSE, precision, recall and the Mean Average
Precision (MAP).
9.5 Practical Case 183
In [3]:
def evaluate(rec_object, train, test,
at = 5, thr_relevant= 4):
""" Performance evaluation """
# RMSE evaluation
ids_to_estimate = zip(test.user_id, test.movie_id)
y_estimated = np.array([rec_object.predict_score(u,i)
if u in train.user_id else 3
for (u,i) in ids_to_estimate ])
y_real = test.rating.values
rmse = compute_rmse(y_estimated, y_real)
print("RMSE: {:.4f}".format(rmse))
cumulative_precision = 0.0
cumulative_recall = 0.0
cumulative_AP = 0.0
num_eval = 0
if len(relevant_items)>0:
recommended_items = rec_object.predict_top(
user_id, at=at)
num_eval+=1
cumulative_precision += precision(
recommended_items, relevant_items)
cumulative_recall += recall(recommended_items,
relevant_items)
cumulative_AP += AP(recommended_items,
relevant_items)
cumulative_precision /= num_eval
cumulative_recall /= num_eval
MAP = cumulative_AP / num_eval
We are ready to create the object and train it using training data. We can do it with
the following code:
In [4]:
random_recsys = RandomRecommender()
random_recsys.fit(X_train)
The random model has achieved an RMSE of 1.6241 and Precision, Recall and
MAP of 0.0%. These are the numbers we need to beat.
Prediction Function
The prediction function behind the user-based CF will be based on the movie ratings
from similar users. So, in order to recommend a movie, . p, from a set of movies, . P,
to a given user, .a, we first need to see the set of users, . B, who have already seen . p.
Then, we need to see the taste similarity between these users in . B and user .a. The
most simple prediction function for a user .a and movie . p can be defined as follows:
∑
b∈B sim(a, b)(rb, p )
. pr ed(a, p) = ∑ (9.1)
b∈B sim(a, b)
where .sim(a, b) is the similarity between user .a and user .b, . B is the set of users in
the dataset that have already seen . p and .rb, p is the rating of . p by .b.
Let us give an example (see Table 9.1). Imagine the system can only recommend
one movie, since the rest have already been seen by the user. So, we only want to
estimate the score corresponding to that movie. The movie has been seen by Paul,
Alice, Marc, and Anne and scored 3, 3, 4, and 3, respectively. Similarity between user
.a and Paul, Alice, Marc, and Anne has been computed “somehow” (we will see later
how we can compute it) and the values are 0.99, 0.38, 0.89 and 0.92 respectively. If
we follow the previous equation, the estimated score is .3.41, as seen in Table 9.1.
9.5 Practical Case 185
The computation of the similarity between users is one of the most critical steps in
the CF algorithms. The basic idea behind the similarity computation between two
users .a and .b is that we can first isolate the set . P of items rated by both users, and
then apply a similarity computation technique to determine the similarity.
The set of common_movies can be obtained with the following code:
In [6]:
# dataframe with the data from user 1
data_user_1 = data[data.user_id==1]
mask = (data_user_2.movie_id.isin(common_movies))
data_user_2 = data_user_2[mask]
print(data_user_2[[’title’,’rating’]].head())
186 9 Recommender Systems
Once the set of ratings for all movies common to the two users has been obtained,
we can compute the user similarity. Some of the most common similarity functions
used in CF methods are as follows:
Euclidean distance:
1
.sim(a, b) = √ ∑ (9.2)
1+ p∈P (ra, p − rb, p )
2
Pearson correlation:
∑
p∈P (ra, p − r¯a )(rb, p − r¯b )
sim(a, b) = √ ∑
. √ ∑ (9.3)
p∈P (ra, p − r¯a ) p∈P (rb, p − r¯b )
where .r¯a and .r¯b are the mean ratings of users .a and .b.
Cosine distance:
a·b
.sim(a, b) = (9.4)
|a| · |b|
Now, the question: Which function should we use? The answer is that there is no
fixed recipe; but there are some issues we can take into account when choosing the
proper similarity function. On the one hand, Pearson correlation usually works better
than Euclidean distance since it is based more on the ranking than on the values. So,
two users who usually like more the same set of items, although their rating is on
different scales, will come out as similar users with Pearson correlation but not with
Euclidean distance. On the other hand, when dealing with binary/unary data, i.e.,
like versus not like or buy versus not buy, instead of scalar or real data like ratings,
cosine distance is usually used.
Let us define a function to compute the user’s similarity:
9.5 Practical Case 187
In [7]:
from scipy.stats import pearsonr
from scipy.spatial.distance import euclidean
if method ==’pearson’:
res=pearsonr(df_shared[’rating_x’],df_shared[’
rating_y’])[0]
if(np.isnan(res)):
return 0
return res
Figure 9.2 shows the correlation plots for user 1 versus user 8 and user 1 versus
user 31. Each point in the plots corresponds to a different set of ratings from the two
users of the same movies. The bigger the dot, the larger the set of movies rated with
the corresponding values. We can observe in these plots that ratings from user 1 are
more correlated with ratings from user 8 than from the user 31. However, as we can
observe in the following outputs, the Euclidean similarity between user 1 and user
31 is closer than between user 1 and user 8.
188 9 Recommender Systems
In [8]:
print("Euclidean Similarity",
user_sim(data,1,8, method = ’euclidean’))
print("Pearson Similarity",user_sim(data,1,8))
print("Euclidean Similarity",
user_sim(data,1,31,method = ’euclidean’))
print("Pearson Similarity",user_sim(data,1,31))
Evaluation
In order to validate the system, we will divide the dataset into two different sets:
one called X_train containing .80% of the data from each user; and another called
X_test, with the remaining .20% of the data from each user. In the following code
we create a function assign_to_set that creates a new column in the DataFrame
indicating which sample it belongs to.
9.5 Practical Case 189
In [9]:
def assign_to_set(df):
sampled_ids = np.random.choice(
df.index,
size=np.int64(np.ceil(df.index.size * 0.2)),
replace=False)
df.loc[sampled_ids, ’for_testing’] = True
return df
The resulting X_train and X_test sets have 79619 and 20381 ratings, respec-
tively.
Once the data is divided in these sets, we can build a model with the training set
and evaluate its performance using the test set. To validate or methods we can use
the standard Root Mean Square Error (RMSE):
√( ∑ )
( ŷ − y)2
.R M S E = (9.5)
n
where . y is the real rating and . ŷ is the predicted rating.
In [10]:
def compute_rmse(y_pred, y_true):
""" Compute Root Mean Squared Error. """
return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))
Or instead, we can use ranking measures such as the Precision, Recall or Mean
Average Precision (AP) given a set of recommended items:
190 9 Recommender Systems
In [11]:
def precision(recommended_items, relevant_items):
is_relevant = np.in1d(recommended_items,
relevant_items,
assume_unique=True)
return precision
return recall
return ap_score
Similarly, we can build our collaborative recommender system with a Python class.
This class consists of a constructor and three methods: fit, predict_score,
predict_top. In this case, the fit will be in charge to compute and store the
users similarities into a Python dictionary. This is a really simple method but quite
expensive in terms of computation when dealing with a large dataset. We decided
to show one of the most basic schemes in order to implement it. More complex
algorithms can be used in order to improve the computations cost. Moreover, online
strategies can be used when dealing with a really dynamic problems. In the predict
the score for a movie and a user is estimated.
9.5 Practical Case 191
In [12]:
class CollaborativeFiltering:
’’’Collaborative filtering model’’’
return top_items
9.5 Practical Case 193
Now, the collaborative model can be trained and executed with the following lines:
In [13]:
cf_model = CollaborativeFiltering()
cf_model.fit(X_train)
evaluate(cf_model,X_train,X_test, at = 50)
9.6 Conclusions
In this chapter, we have introduced what are recommender systems, how they work,
and how they can be implemented in Python. We have seen that there are different
types of recommender systems based on the information they use, as well as the
output they produce. We have introduced content-based recommender systems and
collaborative recommender systems; and we have seen the importance of defining
the similarity function between items and users.
We have learned how recommender system can be implemented in Python in order
to answer questions such as which movie should I rent? We have also discussed how
recommender system should be evaluated, and several online and offline metrics.
Finally, we have worked with a publicly available dataset from GroupLens in
order to implement and evaluate a collaborative recommendation system for movie
recommendations.
Acknowledgements This chapter was co-written by Santi Seguí and Eloi Puertas.
References
1. F. Ricci, L. Rokach, B. Schapira, in Recommender Systems Handbook (Springer, 2015)
2. G. Shani, A. Gunawardana, A survey of accuracy evaluation metrics of recommendation tasks.
J. Mach. Learn. Res. 10, 2935–2962 (2009)
Basics of Natural Language Processing
10
10.1 Introduction
In this chapter, we will go into the basics of Natural Language Processing (NLP) to
process text data with examples of application related to sentiment analysis. The term
sentiment analysis (or opinion mining) refers to the analysis from data of the attitude
of the subject with respect to a particular topic. This attitude can be a judgment
(appraisal theory), an affective state, or the intended emotional communication.
Generally, sentiment analysis is performed based on the processing of natural
language, the analysis of text and computational linguistics. Although data can come
from different data sources, in this chapter we will analyze sentiment in text data,
using two particular text data examples: one from film critics, where the text is highly
structured and maintains text semantics; and another example coming from social
networks (tweets in this case), where the text can show a lack of structure and users
may use (and abuse!) text abbreviations.
In the following sections, we will review some basic procedures of NLP that
are common to perform sentiment analysis. In particular, we will analyze the steps
required for data cleaning (that is, removing irrelevant text items not associated
with sentiment information), producing a general representation of the text, and
performing some statistical inference on the text represented to determine positive
and negative sentiments.
Although the scope of sentiment analysis may introduce many aspects to be an-
alyzed, in this chapter and for simplicity, we will analyze binary sentiment analysis
categorization problems. We will thus basically learn to classify positive against
negative opinions from text data. The scope of sentiment analysis is broader, and it
includes many aspects that make analysis of sentiments a challenging task. Some
interesting open issues in this topic are as follows:
• Identification of sarcasm: sometimes without knowing the personality of the per-
son, you do not know whether ‘bad’ means bad or good.
• Lack of text structure: in the case of Twitter, for example, it may contain abbre-
viations, and there may be a lack of capitals, poor spelling, poor punctuation, and
poor grammar, all of which make it difficult to analyze the text.
• Many possible sentiment categories and degrees: positive and negative is a simple
analysis, one would like to identify the amount of hate there is inside the opinion,
how much happiness, how much sadness, etc.
• Identification of the object of analysis: many concepts can appear in text, and how
to detect the object that the opinion is positive for and the object that the opinion is
negative for is an open issue. For example, if you say “She won him!”, this means
a positive sentiment for her and a negative sentiment for him, at the same time.
• Subjective text: another open challenge is how to analyze very subjective sentences
or paragraphs. Sometimes, even for humans it is very hard to agree on the sentiment
of these highly subjective texts.
In order to perform sentiment analysis, first we need to deal with some processing
steps on the data. Next, we will apply the different steps on simple ‘toy’ sentences
to understand better each one. Later, we will perform the whole process on larger
datasets.
Given the input text data in cell [1], the main task of data cleaning is to remove
those characters considered as noise in the data mining process. For instance, comma
or colon characters. Of course, in each particular data mining problem different char-
acters can be considered as noise, depending on the final objective of the analysis. In
our case, we are going to consider that all punctuation characters should be removed,
including other non-conventional symbols. In order to perform the data cleaning pro-
cess and posterior text representation and analysis we will use the Natural Language
Toolkit (NLTK) library for the examples in this chapter.
In [1]:
raw_docs = ["Here are some very simple basic sentences.",
"They won’t be very interesting, I’m afraid.",
"The point of these examples is to _learn how basic text \
cleaning works_ on *very simple* data."]
The first step consists of defining a list with all word-vectors in the text. NLTK
makes it easy to convert documents-as-strings into word-vectors, a process called
tokenizing. See the example below.
In [2]:
from nltk.tokenize import word_tokenize
tokenized_docs = [word_tokenize(doc) for doc in raw_docs]
print(tokenized_docs)
10.2 Data Cleaning 197
Thus, for each line of text in raw_docs, word_tokenize function will set
the list of word-vectors. Now we can search the list for punctuation symbols, for
instance, and remove them. There are many ways to perform this step. Let us see
one possible solution using the String library.
In [3]:
import string
string.punctuation
Out[3]: ’!"#\$\%&\’()*+,-./:;<=>?@[\\]∧_‘{|}∼’
One can see that punctuation symbols are removed, and those words containing
a punctuation symbol are kept and marked with an initial u. If the reader wants
more details, we recommend to read information about the RE package1 for treating
expressions.
Another important step in many data mining systems for text analysis consists of
stemming and lemmatizing. Morphology is the notion that words have a root form.
If you want to get to the basic term meaning of the word, you can try applying
a stemmer or lemmatizer. This step is useful to reduce the dictionary size and the
posterior high-dimensional and sparse feature spaces. NLTK provides different ways
of performing this procedure. In the case of running the porter.stem(word)
approach, the output is shown next.
In [5]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
porter = PorterStemmer()
#snowball = SnowballStemmer(’english’)
#wordnet = WordNetLemmatizer()
porter.stem(word)
#snowball.stem(word)
#wordnet.lemmatize(word)
These kind of approaches are very useful in order to reduce the exponential number
of combinations of words with the same meaning and match similar texts. Words
such as “interest” and “interesting” will be converted into the same word “interest”
making the comparison of texts easier, as we will see later.
1 https://docs.python.org/2/library/re.html.
10.3 Text Representation 199
Another very useful data cleaning procedure consists of removing HTML entities
and tags. Those may contain words and other symbols that were not removed by
applying the previous procedures, but that do not provide useful meaning for text
analysis and will introduce noise in our posterior text representation procedure. There
are many possibilities for removing these tags. Here we show another example using
the same NLTK package.
In [6]:
import nltk
test_string ="<p>While many of the stories tugged at the
heartstrings, I never felt manipulated by the authors.
(Note: Part of the reason why I don’t like the ’Chicken
Soup for the Soul’ series is that I feel that the
authors are just dying to make the reader clutch for
the box of tissues.)</a>"
print(’Original text:’)
print(test_string)
print(’Cleaned text:’ nltk.clean_html(test_string.decode())
)
Cleaned text:
u"While many of the stories tugged at the heartstrings, I never
felt manipulated by the authors. (Note: Part of the reason why I
don’t like the "Chicken Soup for the Soul" series is that I feel
that the authors are just dying to make the reader clutch for the
box of tissues.)"
You can see that tags such as “.<p.>” and “.</a.>” have been removed. The reader
is referred to the RE package documentation to learn more about how to use it for
data cleaning and HTLM parsing to remove tags.
In the previous section we have analyzed different techniques for data cleaning, stem-
ming, and lemmatizing, and filtering the text to remove other unnecessary tags for
posterior text analysis. In order to analyze sentiment from text, the next step consists
of having a representation of the text that has been cleaned. Although different rep-
resentations of text exist, the most common ones are variants of Bag of Words (BoW)
models [1]. The basic idea is to think about word frequencies. If we can define a
dictionary of possible different words, the number of different existing words will
define the length of a feature space to represent each text. See the toy example in
200 10 Basics of Natural Language Processing
Fig. 10.1. Two different texts represent all the available texts we have in this case.
The total number of different words in this dictionary is seven, which will represent
the length of the feature vector. Then we can represent each of the two available texts
in the form of this feature vector by indicating the number of word frequencies, as
shown in the bottom of the figure. The last two rows will represent the feature vector
codifying each text in our dictionary.
Next, we will see a particular case of bag of words, the Vector Space Model of
text: TF-IDF (term frequency–inverse distance frequency). First, we need to count
the terms per document, which is the term frequency vector. See a code example
below.
In [7]:
mydoclist = [’Mireia loves me more than Hector loves me’,
’Sergio likes me more than Mireia loves me’,
’He likes basketball more than football’]
from collections import Counter
for doc in mydoclist:
tf = Counter()
for word in doc.split():
tf[word] += 1
print(tf.items())
In [8]:
c = Counter() # a new, empty counter
c = Counter(’gallahad’) # a new counter from an iterable
Counter objects have a dictionary interface except that they return a zero count
for missing items instead of raising a KeyError.
In [9]:
c = Counter([’eggs’, ’ham’])
c[’bacon’]
Out[9]: 0
Let us call this a first stab at representing documents quantitatively, just by their
word counts (also thinking that we may have previously filtered and cleaned the text
using previous approaches). Here we show an example for computing the feature
vector based on word frequencies.
In [10]:
def build_lexicon(corpus):
# define a set with all possible words included in all the
sentences or "corpus"
lexicon = set()
for doc in corpus:
lexicon.update([word for word in doc.split()])
return lexicon
def tf(term, document):
return freq(term, document)
def freq(term, document):
return document.split().count(term)
vocabulary = build_lexicon(mydoclist)
doc_term_matrix = []
print(’Our vocabulary vector is [’ +
’, ’.join(list(vocabulary)) + ’]’)
for doc in mydoclist:
print(’The doc is "’ + doc + ’"’)
tf_vector = [tf(word, doc) for word in vocabulary]
tf_vector_string = ’, ’.join(format(freq, ’d’)
for freq
in tf_vector)
print(’The tf vector for Document \%d is [\%s]’ \% ((
mydoclist.index(doc)+1), tf_vector_string))
doc_term_matrix.append(tf_vector)
print(’All combined, here is our master document term
matrix: ’)
print(doc_term_matrix)
202 10 Basics of Natural Language Processing
def l2_normalizer(vec):
denom = np.sum([el**2 for el in vec])
return [(el / math.sqrt(denom)) for el in vec]
doc_term_matrix_l2 = []
for vec in doc_term_matrix:
doc_term_matrix_l2.append(l2_normalizer(vec))
print(’A regular old document term matrix: ’)
print(np.matrix(doc_term_matrix))
print(’\nA document term matrix with row-wise L2 norm:’)
print(np.matrix(doc_term_matrix_l2))
You can see that we have scaled down the vectors so that each element is between
[0, 1]. This will avoid getting a diminishing return on the informative value of a word
massively used in a particular document. For that, we need to scale down words that
appear too frequently in a document.
Finally, we have a final task to perform. Just as not all words are equally valuable
within a document, not all words are valuable across all documents. We can try
reweighting every word by its inverse document frequency.
In [12]:
def numDocsContaining(word, doclist):
doccount = 0
for doc in doclist:
if freq(word, doc) > 0:
doccount += 1
return doccount
def idf(word, doclist):
n_samples = len(doclist)
df = numDocsContaining(word, doclist)
return np.log(n_samples / (float(df)) )
my_idf_vector = [idf(word, mydoclist) for word in
vocabulary]
print(’Our vocabulary vector is [’ + ’, ’.join(list(
vocabulary)) + ’]’)
print(’The inverse document frequency vector is [’ + ’, ’.
join(format(freq, ’f’) for freq in my_idf_vector) + ’]’
)
Out[13]: [[ 0.40546511 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 1.09861229 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0.40546511 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 1.09861229 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.40546511 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0.40546511 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 1.09861229 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 1.09861229 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 1.09861229 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]
That means we can now multiply every term frequency vector by the inverse
document frequency matrix. Then, to make sure we are also accounting for words
that appear too frequently within documents, we will normalize each document using
the L2 norm.
In [14]:
doc_term_matrix_tfidf = []
#performing tf-idf matrix multiplication
for tf_vector in doc_term_matrix:
doc_term_matrix_tfidf.append(np.dot(tf_vector,
my_idf_matrix))
#normalizing
doc_term_matrix_tfidf_l2 = []
for tf_vector in doc_term_matrix_tfidf:
doc_term_matrix_tfidf_l2.
append(l2_normalizer(tf_vector))
print(vocabulary)
# np.matrix() just to make it easier to look at
print(np.matrix(doc_term_matrix_tfidf_l2))
It is sometimes useful to take significant bi-grams into the model based on the BoW.
Note that this example can be extended to .n-grams. In the fields of computational
linguistics and probability, an .n-gram is a contiguous sequence of .n items from
10.4 Practical Cases 205
a given sequence of text or speech. The items can be phonemes, syllables, letters,
words, or base pairs according to the application. The .n-grams are typically collected
from a text or speech corpus.
A .n-gram of size 1 is referred to as a “uni-gram”; size 2 is a “bi-gram” (or, less
commonly, a “digram”); size 3 is a “tri-gram”. Larger sizes are sometimes referred
to by the value of .n, e.g., “four-gram”, “five-gram”, and so on. These .n-grams can
be introduced within the BoW model just by considering each different .n-gram as a
new position within the feature vector representation.
Python packages provide useful tools for analyzing text. The reader is referred to
the NLTK and Textblob package2 documentation for further details. Here, we will
perform all the previously presented procedures for data cleaning, stemming, and
representation and introduce some binary learning schemes to learn the text repre-
sentations in the feature space. The binary learning schemes will receive examples
for training positive and negative sentiment texts and we will apply them later to
unseen examples from a test set.
We will apply the whole sentiment analysis process in two examples. The first
corresponds to the Large Movie reviews dataset [2]. This is one of the largest public
available datasets for sentiment analysis, which includes more than 50,000 texts from
movie reviews including the ground truth annotation related to positive and negative
movie reviews. As a proof on concept, for this example we use a subset of the dataset
consisting of about 30% of the data.
The code re-uses part of the previous examples for data cleaning, reads training
and testing data from the folders as provided by the authors of the dataset. Then,
TF-IDF is computed, which performs all steps mentioned previously for computing
feature space, normalization, and feature weights. Note that at the end of the script we
perform training and testing based on two different state-of-the-art machine learning
approaches: Naive Bayes and Support Vector Machines. It is beyond the scope of
this chapter to give details of the methods and parameters. The important point here
is that the documents are represented in feature spaces that can be used by different
data mining tools.
2 https://textblob.readthedocs.io/en/dev/.
206 10 Basics of Natural Language Processing
In [15]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.classify import NaiveBayesClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from unidecode import unidecode
def BoW(text):
# Tokenizing text
text_tokenized = [word_tokenize(doc) for doc in text]
# Removing punctuation
regex = re.compile(’[%s]’ % re.escape(string.
punctuation))
tokenized_docs_no_punctuation = []
for review in text_tokenized:
new_review = []
for token in review:
new_token = regex.sub(u’’, token)
if not new_token == u’’:
new_review.append(new_token)
tokenized_docs_no_punctuation.append(new_review)
# Stemming and Lemmatizing
porter = PorterStemmer()
preprocessed_docs = []
for doc in tokenized_docs_no_punctuation:
final_doc = ’’
for word in doc:
final_doc = final_doc + ’ ’ + porter.stem(word)
preprocessed_docs.append(final_doc)
return preprocessed_docs
In [16]:
print(’Training and testing on training Naive Bayes’)
gnb = GaussianNB()
testData.todense()
y_pred = gnb.fit(trainData.todense(), targetTrain)
.predict(trainData.todense())
print("Number of mislabeled training points out of a total \%d
points : \%d" \% (trainData.shape[0],(targetTrain != y_pred).
sum()))
We can see that the training error of Naive Bayes on the selected data is 129/4313
while in testing it is 2087/6292. Interestingly, the training error using SVM is higher
(1288/4313), but it provides a better generalization of the test set than Naive Bayes
(1680/6292). Thus it seems that Naive Bayes produces more overfitting of the data
(selecting particular features for better learning the training data but producing such
high modifications of the feature space for testing that cannot be recovered, just
reducing the generalization capability of the technique). However, note that this is a
simple execution with standard methods on a subset of the dataset provided. More
data, as well as many other aspects, will influence the performance. For instance,
we could enrich our dictionary by introducing a list of already studied positive and
208 10 Basics of Natural Language Processing
negative words.3 For further details of the analysis of this dataset, the reader is
referred to [2].
Finally, let us see another example of sentiment analysis based on tweets. Although
there is some work using more tweet data4 here we present a reduced set of tweets
which are analyzed as in the previous example of movie reviews. The main code
remains the same except for the definition of the initial data.
In [17]:
textTrain = [’I love this sandwich.’, ’This is an amazing place!’,
’I feel very good about these beers.’, ’This is my best work.’,
’What an awesome view’, ’I do not like this restaurant’, ’I am
tired of this stuff.’, ’I can not deal with this’, ’He is my
sworn enemy!’, ’My boss is horrible.’]
targetTrain = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
preprocessed_docs=BoW(textTrain)
tfidf_vectorizer = TfidfVectorizer(min_df = 1)
trainData = tfidf_vectorizer.fit_transform(preprocessed_docs)
In this scenario both learning strategies achieve the same recognition rates in both
training and test sets. Note that similar words are shared between tweets. In practice,
with real examples, tweets will include unstructured sentences and abbreviations,
making recognition harder.
10.5 Conclusions
In this chapter, we reviewed the basics of NLP applied to the problem of binary
sentiment analysis from text data: data cleaning to remove irrelevant symbols, punc-
tuation and tags; stemming in order to define the same root for different works with
the same meaning in terms of sentiment; defining a dictionary of words (including
.n-grams); and representing text in terms of a feature space with the length of the
dictionary. We have also seen codification in this feature space, based on normalized
and weighted term frequencies. We have defined feature vectors that can be used
by any machine learning technique in order to perform sentiment analysis (binary
classification in the examples shown), and reviewed some useful Python packages,
such as NLTK and Textblob, for sentiment analysis.
As discussed in the introduction of this chapter, we have only reviewed the senti-
ment analysis problem and described common procedures for performing the analysis
resulting from a binary classification problem. Several open issues can be addressed
in further research, such as the identification of sarcasm, a lack of text structure (as
in tweets), many possible sentiment categories and degrees (not only binary but also
multiclass, regression, and multilabel problems, among others), identification of the
object of analysis, or subjective text, to name a few.
The NLP tools described in this chapter can define a basis for dealing with those
more challenging problems. One recent example of current state-of-the-art research
is the work of [3], where deep learning architectures are used for sentiment anal-
ysis. Deep learning strategies are currently a powerful tool in the fields of pattern
recognition, machine learning, and computer vision, among others; the main deep
learning strategies are based on neural network architectures. In the work of [3],
a deep learning model builds up a representation of whole sentences based on the
sentence structure, and it computes the sentiment based on how words form the mean-
ing of longer phrases. In the methods explained in this chapter, .n-grams are the only
210 10 Basics of Natural Language Processing
features that capture those semantics. Although in the text that lack structures such
as tweets, this technology would not benefit from semantics, it will in the case of
movie reviews as well as most existing text mining problems. For further discussion
in this field, the reader is referred to [4, 5].
Acknowledgements This chapter was co-written by Sergio Escalera and Santi Seguí.
References
1. Z. Ren, J. Yuan, J. Meng, Z. Zhang, IEEE Trans. Multimedia 15(5), 1110 (2013)
2. A.L. Maas, R.E. Daly, P.T. Pham, D. Huang, A.Y. Ng, C. Potts, in Proceedings of the 49th An-
nual Meeting of the Association for Computational Linguistics: Human Language Technologies
(Association for Computational Linguistics, Portland, Oregon, USA, 2011), pp. 142–150. http://
www.aclweb.org/anthology/P11-1015
3. R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, C. Potts, Conference on Empirical
Methods in Natural Language Processing (2013)
4. E. Cambria, B. Schuller, Y. Xia, C. Havasi, IEEE Intell. Syst. 28(2), 15 (2013)
5. B. Pang, L. Lee, Found. Trends Inf. Retr. 2(1–2), 1 (2008)
Deep Learning
11
11.1 Introduction
11.2 Perceptron
The Perceptron can be considered the simplest neural network model, and usually
referred to as the building block of artificial neural networks. Given an input vector
.x of .n values
.(x 1 , x 2 , ..., x n )
often called input features, this model outputs either 1 (yes) or 0 (no). Mathematically,
the model is defined as:
( )
.y = σ w · x + b
where . y is the output of the model, .w represents the weights associated with the
input features, .b is the bias term and .σ is known as the activation function, which in
the case of the original perceptron is the linear threshold unit defined as:
{
1 if w · x + b > 0,
. f (x) =
0 otherwise
As you may observe, the Perceptron bears a strong resemblance to Logistic Regres-
sion. However, while Logistic Regression offers a probabilistic output, the Perceptron
produces an output determined by a strict threshold. This is one of the reasons why
Logistic Regression is typically favored over the Perceptron for linear classification
tasks.
Basic neurons, such as Perceptrons, can be assembled into more complex structures
by employing various sets of weights on the same data vector. This arrangement
creates what is known as a layer, and these layers can be stacked one on top of the
other to build a multilayer network.
It is important to notice that a multilayer neural network can be seen as a com-
position of matrix products (matrices represent weights) and non-linear activation
functions. For the case of a 2-layer network, the outcome is:
( ( ) )
.y = σ W σ W x + b + b1
1 0 0
where .σ represents a vectorial version of the activation function and .W i are the
weights of each layer in matrix form.
Non-linear activation functions play a pivotal role in neural networks as they
inject non-linearity into the model, empowering neural networks to discern and
adapt to intricate non-linear patterns within data. The choice of activation function
depends on the specific problem, the architecture of the neural network, and empirical
testing. Contemporary neural networks predominantly rely on the ReLU function,
. ReLU (x) = max(0, x), rather than sigmoid for their activation needs. ReLU and its
11.3 Multilayer Perceptron 213
variants (Leaky ReLU, PReLU, etc.) are often preferred in many scenarios due to
their computational efficiency and success in deep learning applications. However,
the choice of activation function is a hyperparameter that can be tuned to optimize a
neural network for a particular task.
What is interesting about multilayer perceptrons (MLPs) is that it has been shown
that even a 2-layer neural network, with a single hidden layer,1 containing a finite
number of neurons can approximate any continuous function of .Rn . This fact makes
neural networks a sound candidate to implement learning from data methods. The
question is then: how to find the optimal parameters, .w = (W i , b), to approximate
a function that is implicitly defined by a set of samples .{(x1 , y1 ), . . . , (xn , yn )}?
1. Start by initializing the weights and biases of the neural network randomly.
2. Perform a forward pass through the neural network with a minibatch2 of training
data. A predicted output is obtained for each input sample.
3. Calculate the loss (or error) between the predicted output and the actual target
values. Common loss functions include mean squared error (MSE) for regression
tasks and categorical cross-entropy for classification tasks.
4. Backpropagation: This is the key step in gradient descent. In backpropagation, the
gradient (partial derivatives) of the loss function with respect to each parameter
(weights and biases) in the network is computed. This is done by applying the
chain rule of calculus in reverse order, starting from the output layer and moving
backward through the network.
5. Adjust the parameters (weights and biases) in the direction that reduces the loss.
This is done by subtracting the gradient of the loss with respect to each parameter
from the current parameter value.
1 In the context of neural networks, a hidden layer is a layer of artificial neurons or nodes that sits
between the input layer and the output layer. It is called “hidden” because it does not directly interact
with the external data or the final output of the network. Instead, the hidden layer(s) are responsible
for processing and transforming the input data before passing it to the output layer.
2 A minibatch is a subset of the training dataset that is used during the training of machine learning
models, including neural networks. Instead of using the entire training dataset in a single step (which
is known as batch or full-batch training), training is typically performed in smaller, manageable
batches.
214 11 Deep Learning
Steps 2 to 5 are repeated for multiple epochs3 or until the loss converges to a
satisfactory level.
Typically, training neural networks involves running multiple epochs, often for
tens or hundreds of epochs, depending on the complexity of the problem and the
convergence behavior of the model. The number of epochs is a hyperparameter that
needs to be tuned during the training process. It is essential to monitor the model’s
performance (e.g., training loss and validation accuracy) as the epochs progress to
determine when to stop training or if additional epochs are needed to improve the
model’s performance.
Before proceeding further, we highly encourage the reader to visit this interactive
site: http://playground.tensorflow.org. Take the opportunity to experiment with cre-
ating neural network architectures to tackle three distinct challenges: handling two
Gaussian data sets, concentric data, and X-or data (illustrated in Fig. 11.1). It is an
excellent way to gain hands-on experience and insights into neural network behavior
across various scenarios.
Upon closer examination of both non-linear scenarios–the concentric and X-or
problems–it becomes evident that, despite employing multiple layers with numerous
neurons in each layer, the utilization of a linear activation function proves insufficient
for resolving the issue. Conversely, by integrating a non-linear activation function
like sigmoid or ReLU, we can achieve a solution with just a single hidden layer.
3 In deep learning, an epoch is one complete pass through the entire training dataset during the
training phase of a neural network. During an epoch, the algorithm processes the entire dataset,
computes the loss, and updates the model’s parameters (weights and biases) once.
11.5 Deep Learning with Keras 215
Let us see how we can implement our first neural network model. There are several
libraries we can use with Python, such as Tensorflow,4 PyTorch,5 and Keras,6 to
name a few. These libraries provide powerful tools for developing and training neural
networks, making them essential for machine learning and deep learning tasks.
In this chapter, we have chosen to work with Keras due to its simplicity and
versatility. Keras is a high-level neural networks library, written in Python and capable
of running on top of TensorFlow or PyTorch. It offers a user-friendly and intuitive
interface for building and training neural networks, making it an excellent choice
for developers who want to quickly prototype and experiment with deep learning
models.
In order to run the notebooks, we can install Tensoflow library or use Google Colab,7
Google’s free cloud service for AI developers. With Colab, we can develop deep
learning applications on the GPU for free. We can use GPU as a backend for free for
12 h at a time!
To get started with Google Colab:
• You need to signup and apply for access before you start using google colab:
https://colab.research.google.com/.
• Once you get access, you can upload notebooks using File.− >Upload Notebook.
• To enable GPU backend for your notebook use Runtime.− >Change runtime
type.− >Hardware Accelerator.− >GPU.
We will start with a simple dataset: the concentric circles. We can create the data as
follows (Fig. 11.2):
4 https://www.tensorflow.org/.
5 https://keras.io/.
6 https://pytorch.org/.
7 https://colab.research.google.com/.
216 11 Deep Learning
In [1]:
i m p o r t m a t p l o t l i b . p y p l o t as plt
from s k l e a r n . d a t a s e t s i m p o r t m a k e _ c i r c l e s
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
X , y = m a k e _ c i r c l e s ( n _ s a m p l e s =1000 ,
f a c t o r =.4 ,
noise =0.05)
# Let us v i s u a l i z e the d a t a s e t
reds = y == 0
blues = y == 1
plt . f i g u r e ()
plt . s c a t t e r ( X [ reds , 0] , X [ reds , 1] ,
c = " red " , s =20 , e d g e c o l o r = ’k ’)
plt . s c a t t e r ( X [ blues , 0] , X [ blues , 1] ,
c = " blue " , s =20 , e d g e c o l o r = ’k ’)
plt . show ()
The fundamental data structure in Keras is the model, which serves as a means to
organize layers. The primary model type is the Sequential model, representing
a linear stack of layers.
Creating a MLP with two hidden layers is as straightforward as follows:
In [3]:
from k e r a s . m o d e l s i m p o r t S e q u e n t i a l
from k e r a s . l a y e r s i m p o r t Dense
model = S e q u e n t i a l ()
model . add ( Dense ( units =64 , i n p u t _ d i m =2 , a c t i v a t i o n = " relu " ) )
model . add ( Dense ( units =64 , a c t i v a t i o n = " relu " ) )
model . add ( Dense ( units =1 , a c t i v a t i o n = " s i g m o i d " ) )
11.5 Deep Learning with Keras 217
As we can see in Fig. 11.3, the first hidden layer consists of 192 parameters, while
the second layer contains 4,160 parameters. These parameter counts are calculated
as follows: for the first hidden layer, there are 2 .× 64 connection weights plus an
additional 64 biases, and for the second hidden layer, there are 64 .× 64 connections
plus an additional 64 biases
Once the model looks good for you, configure its learning process .compile()
method as follows:
In [5]:
model . c o m p i l e ( loss = ’ b i n a r y _ c r o s s e n t r o p y ’ ,
o p t i m i z e r = ’ adam ’ , m e t r i c s =[ ’ a c c u r a c y ’ ])
If you need to, you can further configure your optimizer as follows:
In [6]:
from t e n s o r f l o w . keras . o p t i m i z e r s i m p o r t Adam
model . c o m p i l e ( loss = ’ b i n a r y _ c r o s s e n t r o p y ’ ,
o p t i m i z e r = Adam ( l e a r n i n g _ r a t e = 0 . 0 1 ) ,
m e t r i c s =[ ’ a c c u r a c y ’ ])
With your model compiled, we can now proceed to train it on your data in batches
as follows:
In [7]:
model . fit ( X_train , y_train ,
b a t c h _ s i z e =32 ,
e p o c h s =5 ,
v e r b o s e =1)
We can evaluate the model’s performance on the test set using the following code:
In [8]:
l o s s _ a n d _ m e t r i c s = model . e v a l u a t e ( X_test , y _ t e s t )
In this practical case we will work with a synthetic dataset called Medmnist, which
can be accessed at https://medmnist.com. This dataset is a comprehensive collection
of standardized biomedical images, resembling the original MNIST dataset, and
comprises 12 datasets of 2D images and 6 datasets of 3D images. All the images
in this dataset have been pre-processed to dimensions 28 .× 28 (for 2D images) or
28 .× 28 .× 28 (for 3D images) and come with corresponding classification labels.
This design ensures that users do not require any prior background knowledge.
Specifically, we will focus on the ‘BloodMNIST’ dataset.
To install this dataset, we can easily execute the following command:
In [10]:
pip i n s t a l l m e d m n i s t
We can print some information from the dataset such as the description or the
class names as follows:
In [12]:
d a t a s e t . info
We load the dataset and the corresponding training, validation and test set as
follows:
220 11 Deep Learning
In [13]:
i m p o r t os
i m p o r t numpy as np
dataset_name = ’ bloodmnist ’
data = np . load ( os . path . join ( d a t a s e t . root , d a t a s e t _ n a m e + ’.
npz ’) )
After loading the data, we can check some information about the dataset, such as
the number of training examples and the dimensions of each sample as follows:
In [14]:
num_train , height , width , c h a n n e l s = X _ t r a i n . shape
print ( " There are " , num_train , " t r a i n i n g e x a m p l e s . " )
print ( " E a c h s a m p l e has a size of " , height , " by " , width , "
pixels .")
Following this, we can proceed to determine the number of unique classes in the
dataset as follows:
In [15]:
n u m _ c l a s s e s = np . u n i q u e ( y _ t r a i n ) . shape [0]
With these steps, the data is correctly loaded, prepared, and ready for further
processing. To visualize some samples, we can use the following code (Fig. 11.4):
In [18]:
plt . f i g u r e ( f i g s i z e =(16 ,12) )
for i in range (24) :
plt . s u b p l o t (4 ,6 , i +1)
plt . x t i c k s ( [ ] )
plt . y t i c k s ( [ ] )
plt . grid ( False )
plt . i m s h o w ( X _ t r a i n [ i ]. r e s h a p e (( height , width , c h a n n e l s ) ) ,
c m a p = plt . cm . b i n a r y )
plt . x l a b e l ( d a t a s e t . info [ ’ label ’ ][ str ( y _ t r a i n [ i ] [ 0 ] )
][:20])
plt . show ()
11.6 Practical Case: Building an Image Classifier 221
We are now ready to create our first model. Let us create a simple MLP model.
In the following example, we create a MLP model with two hidden layers with 256
neurons each and one output layer with as many neurons as the number of different
categories in the dataset.
In [19]:
from k e r a s . l a y e r s i m p o r t F l a t t e n
## DEFINE PARAMETERS
b a t c h _ s i z e = 128
e p o c h s = 20
# # C R E A T E THE M O D E L
model = S e q u e n t i a l ()
m o d e l . add ( F l a t t e n ( i n p u t _ s h a p e =[ height , width , c h a n n e l s ]) )
model . add ( Dense (32 , a c t i v a t i o n = ’ relu ’ , i n p u t _ s h a p e =( h e i g h t *
width * channels ,) ) )
model . add ( Dense (16 , a c t i v a t i o n = ’ relu ’) )
m o d e l . add ( D e n s e ( n u m _ c l a s s e s , a c t i v a t i o n = ’ s o f t m a x ’) )
We can check the model summary to see its architecture and the number of pa-
rameters as follows:
In [20]:
m o d e l . s u m m a r y ()
222 11 Deep Learning
As we can observe, the created models have 75,960 parameters that require op-
timization. This is a substantial number. To optimize all these parameters, we must
define both the loss function and the optimization method as follows:
In [21]:
# # C O M P I L E M O D E L -> loss + o p t i m i z e r
model . c o m p i l e ( loss = ’ s p a r s e _ c a t e g o r i c a l _ c r o s s e n t r o p y ’ ,
o p t i m i z e r = ’ adam ’ ,
m e t r i c s =[ ’ a c c u r a c y ’ ])
Out[22]: Epoch 2/20 [==========] - loss: 1.5613 - accuracy: 0.5067 - val_loss: 1.2363
- val_accuracy: 0.5929
Epoch 4/20 [==========] - loss: 1.0670 - accuracy: 0.6476 - val_loss: 0.9538
- val_accuracy: 0.6688
Epoch 6/20 [==========] - loss: 0.9098 - accuracy: 0.6812 - val_loss: 0.8124
- val_accuracy: 0.7109
Epoch 8/20 [==========] - loss: 0.8120 - accuracy: 0.7140 - val_loss: 0.7944
- val_accuracy: 0.7015
Epoch 10/20 [==========] - loss: 0.7688 - accuracy: 0.7351 - val_loss:
0.7434 - val_accuracy: 0.7225
Epoch 12/20 [==========] - loss: 0.7527 - accuracy: 0.7337 - val_loss:
0.6861 - val_accuracy: 0.7547
Epoch 14/20 [==========] - loss: 0.7122 - accuracy: 0.7468 - val_loss:
0.6682 - val_accuracy: 0.7629
Epoch 16/20 [==========] - loss: 0.7241 - accuracy: 0.7476 - val_loss:
0.6360 - val_accuracy: 0.7710
Epoch 18/20 [==========] - loss: 0.6961 - accuracy: 0.7535 - val_loss:
0.6203 - val_accuracy: 0.7681
Epoch 20/20 [==========] - loss: 0.6508 - accuracy: 0.7717 - val_loss:
0.6033 - val_accuracy: 0.7745
11.6 Practical Case: Building an Image Classifier 223
When plotting the learning curves as it is done in the next cell, we can see
(Fig. 11.5) that after 20 epochs training and validation are still improving without
any overfitting signal.
In [23]:
def p l o t _ h i s t o r y ( h i s t o r i e s ) :
fig , ax = plt . s u b p l o t s (1 , 2 , f i g s i z e =(15 , 7) )
for ( history , name ) in h i s t o r i e s :
ax [0]. plot ( h i s t o r y . h i s t o r y [ ’ a c c u r a c y ’ ] , ’o -. ’ , label
= name + ’ train ’)
ax [0]. plot ( h i s t o r y . h i s t o r y [ ’ v a l _ a c c u r a c y ’] , ’x - ’ ,
label = name + ’ test ’ )
ax [0]. s e t _ t i t l e ( ’ model a c c u r a c y ’ )
ax [0]. s e t _ y l a b e l ( ’ a c c u r a c y ’ )
ax [0]. s e t _ x l a b e l ( ’ epoch ’)
ax [ 0 ] . l e g e n d ( loc = ’ upper left ’ )
p l o t _ h i s t o r y ([( h i s t o r y _ m l p , ’ MLP ’) ])
224 11 Deep Learning
Finally, we evaluate the model on the test set in the next cell, and as we can see
the obtained results are similar to the validation set.
In [24]:
score = model . e v a l u a t e ( X_test , y_test , v e r b o s e =0)
print ( ’ Test loss : ’ , score [0])
print ( ’ Test a c c u r a c y : ’ , score [1])
While MLPs are powerful neural network models, they have limitations when it
comes to processing raw image data. MLPs are structured in layers, with each neuron
in a layer receiving input from all neurons in the previous layer. This design is suitable
for certain problems, particularly those with a fixed number of relatively unstructured
parameters.
However, let’s consider what happens when we feed raw image data, such as a
200 .× 200 pixel image, into an MLP with 1024 neurons: the first hidden layer alone
would require more than 40 million parameters.
The situation quickly becomes unmanageable as image sizes grow, way before
reaching the sizes typically encountered in practical real-world applications that
people usually deal with.
A common solution is to downsample the images to a size that can be effectively
processed using MLPs. However, if we directly downsample the image, we poten-
tially lose a wealth of information; it would be great if we would somehow be able
to still do some useful (without causing an explosion in parameter count) processing
of the image, prior to performing the downsampling.
It turns out that there is a very efficient way of pulling this off, and it makes
advantage of the structure of the information encoded within an image—it is assumed
that pixels that are spatially closer together will “cooperate” on forming a particular
feature of interest much more than ones on opposite corners of the image. Also, if
a particular (smaller) feature is found to be of great importance when defining an
image’s label, it will be equally important if this feature was found anywhere within
the image, regardless of location.
Given a two-dimensional image, . I , and a small matrix (known as a convolution
kernel), . K of size .h ×w, which we assume encodes a way of extracting an interesting
image feature, we compute the convolved image, . I ⊗ K , by overlaying the kernel
on top of the image in all possible positions, and recording the sum of element-wise
11.7 Convolutional Neural Networks 225
products between the image and the kernel, as defined in the next equation:
∑ N∑
M−1 −1
out put (x, y) = (I ⊗ K )(x, y) =
. K (m, n)I (x − n, y − m)
m=0 n=1
The convolution operator forms the fundamental basis of the convolutional layer
of a CNN. The layer is completely specified by a certain number of kernels, . K , and
it operates by computing the convolution of the output images of a previous layer
with each of those kernels, afterwards adding the biases (one per each output image).
Finally, an activation function, .σ , may be applied to all of the pixels of the output
images.
Typically, the input to a convolutional layer will have .d channels (e.g., red/-
green/blue in the input layer), in which case the kernels are extended to have this
number of channels as well.
Note that, since all we are doing here is addition and scaling of the input pixels,
the kernels may be learned from a given training dataset via gradient descent, ex-
actly as the weights of a MLP. In fact, a MLP is perfectly capable of replicating a
convolutional layer, but it would require a lot more training time (and data) to learn
to approximate that mode of operation.
• Input Layer: Although CNNs are versatile and not limited to images, they are
commonly used for image data, which is typically represented as .h × w × d (.h =
Height, .w = Width, .d = Channels). Input layer is the first layer that receives the
input data.
• Convolutional Layer: This is the heart of a CNN and often represents multiple
layers in the network. These are the layers that will perform most of the architec-
ture’s computational work.
• Sub-sampling or Pooling Layer: This layer serves a dual purpose. First, it creates
a smoother representation, reducing sensitivity to noise and variations in the data.
Second, it reduces the size of the representation, which lowers the number of
parameters and system complexity, helping to prevent overfitting. Sub-sampling
typically involves resizing the input to a smaller output size, controlled by the
stride parameter. A common approach is to use max-pooling with a filter size of
.2 × 2 (or .4 × 4) and a stride of two pixels.
226 11 Deep Learning
Now, let us take a look at a simple CNN to tackle our problem as follows:
In [26]:
model = S e q u e n t i a l ()
m o d e l . add ( C o n v 2 D ( f i l t e r s = 32 , k e r n e l _ s i z e = (3 ,3) ,
a c t i v a t i o n = ’ relu ’ ,
i n p u t _ s h a p e = ( height , width , c h a n n e l s ) ) )
• First Convolutional Layer: This initial layer consists of 32 filters, each with
dimensions of .3 × 3, and employs the ReLU activation function. Since it serves as
the starting point in our network, we specify the dimensions of our input images.
Consequently, the output shape of this layer consists of 32 new images, each with
dimensions of .26 × 26. This output size is derived from our input images, which
are of size .28 × 28 × 3, accounting for the three color channels .(r, g, b). Notably,
the filters in this layer are sized at .3×3×3 to match the input data’s three channels
.(r, g, b).
• Pooling Layer: Following the convolutional layer, there is a pooling layer with
a pool size of 2, resulting in a halving of the input’s dimensions. Therefore, the
output shape becomes .13 × 13 × 32.
• Second Convolutional Layer: This layer features 16 filters, each with dimensions
of .3 × 3 × 32. Consequently, the output shape is .11 × 11 × 16.
• Output Layer: Finally, we have the output layer, which is a dense layer compris-
ing 8 neurons, matching the number of categories in our problem. Note that we
previously flattened the input data since the dense layer requires a 1D array.
As we can see with the model.summary() the number of parameters has been
substantially reduced with respect to MLP.
In [27]:
m o d e l . s u m m a r y ()
11.8 Practical Case: Improving Our Image Classifier 227
We need to reshape the training and test data into an image shape as follows:
In [29]:
X _ t r a i n = X _ t r a i n . r e s h a p e ( num_train , height , width , c h a n n e l s )
X _ t e s t = X _ t e s t . r e s h a p e ( num_test , height , width , c h a n n e l s )
Now, once the data is prepared, we can fit the model as follows:
In [30]:
h i s t o r y _ c n n = model . fit ( X_train , y_train ,
b a t c h _ s i z e = batch_size ,
e p o c h s = epochs ,
v e r b o s e =0 ,
v a l i d a t i o n _ d a t a =( X_test , y _ t e s t ) )
In the next cell we plot the learning curves to compare the performance of MLP
versus CNN. In Fig. 11.6 we can observe that CNN clearly outperforms MLP.
In [31]:
p l o t _ h i s t o r y ([( h i s t o r y _ m l p , ’ MLP ’) ,
( h i s t o r y _ c n n , ’ CNN ’) ])
As we can see, the CNN model has substantially outperformed the MLP, achieving
an accuracy of 85% compared to the MLP’s 75%.
228 11 Deep Learning
Fig. 11.6 Learning curves for the MLP and CNN model
At this point, you might wonder if we can further improve the accuracy of our
CNN model. The answer is clear: yes, we can create a more complex model, either
deeper or with more neurons per layer, to enhance the performance. So, let’s try a
deeper model created as follows:
In [33]:
model = S e q u e n t i a l ()
m o d e l . add ( C o n v 2 D (64 , (3 ,3) , a c t i v a t i o n = ’ relu ’ ,
i n p u t _ s h a p e = ( height , width , c h a n n e l s ) ) )
m o d e l . add ( C o n v 2 D (64 , (3 ,3) , a c t i v a t i o n = ’ relu ’ ) )
model . add ( M a x P o o l i n g 2 D ( p o o l _ s i z e =(2 , 2) ) )
m o d e l . add ( C o n v 2 D (64 , (3 ,3) , a c t i v a t i o n = ’ relu ’ ) )
m o d e l . add ( C o n v 2 D (64 , (3 ,3) , a c t i v a t i o n = ’ relu ’ ) )
model . add ( M a x P o o l i n g 2 D ( p o o l _ s i z e =(2 , 2) ) )
m o d e l . add ( C o n v 2 D (64 , (3 ,3) , a c t i v a t i o n = ’ relu ’ ) )
m o d e l . add ( F l a t t e n () )
model . add ( Dense (64 , a c t i v a t i o n = ’ relu ’ ) )
model . add ( Dense (8 , a c t i v a t i o n = ’ s o f t m a x ’) )
With millions of parameters, neural networks have the capability to fit almost any
dataset perfectly. Nevertheless, this adaptability carries the potential for overfitting,
a situation in which the model memorizes the training data rather than generalizing
from it. To mitigate overfitting, several widely adopted techniques can be employed:
regularization, dropout and data augmentation.
Although regularizers can be applied in any layers, they are not usually used in
convolution layers because convolutional layers inherently possess a built-in form
of regularization through their architecture. Convolutional layers apply a local re-
ceptive field over the input data, which forces weight sharing among neurons. This
weight sharing reduces the risk of overfitting by limiting the number of independent
parameters in the network and thus making explicit regularization techniques less
necessary.
230 11 Deep Learning
11.8.1.2 Dropout
Dropout is another effective technique for regularizing neural networks. During train-
ing, some neurons in a layer are randomly deactivated or “dropped out” at each it-
eration. This introduces an element of randomness and prevents any single neuron
from becoming overly reliant on specific neurons in the preceding layer. Dropout
promotes a more robust and generalizable model.
Below we can see how dropout can be used in keras:
In [38]:
from k e r a s . l a y e r s i m p o r t D r o p o u t
model = S e q u e n t i a l ()
m o d e l . add ( C o n v 2 D (256 , (3 ,3) , a c t i v a t i o n = ’ relu ’ ,
i n p u t _ s h a p e = ( height , width , c h a n n e l s ) ) )
m o d e l . add ( D r o p o u t ( 0 . 2 5 ) )
m o d e l . add ( F l a t t e n () )
model . add ( Dense (256 , a c t i v a t i o n = ’ relu ’) )
m o d e l . add ( D r o p o u t ( 0 . 2 5 ) )
m o d e l . add ( D e n s e ( n u m _ c l a s s e s , a c t i v a t i o n = ’ s o f t m a x ’) )
• Rotation: Rotating the image by a certain angle, which can help the model become
more invariant to the orientation of objects.
• Width and height shifts: Shifting the image horizontally or vertically, simulating
variations in object positioning within the frame.
• Shear: Distorting the image by shifting one part of the image relative to the other
along a given axis. This can simulate the effect of tilting or slanting an object.
• Zoom: Zooming in or out of the image, changing the scale and perspective.
• Horizontal flip: Flipping the image horizontally, which is useful for tasks where
left-to-right orientation does not matter.
In [39]:
from t e n s o r f l o w . keras . p r e p r o c e s s i n g . image i m p o r t
ImageDataGenerator
model = S e q u e n t i a l ()
m o d e l . add ( C o n v 2 D (256 , (3 ,3) , a c t i v a t i o n = ’ relu ’ ,
i n p u t _ s h a p e = ( height , width , c h a n n e l s ) ) )
m o d e l . add ( D r o p o u t ( 0 . 2 5 ) )
m o d e l . add ( F l a t t e n () )
model . add ( Dense (256 , a c t i v a t i o n = ’ relu ’) )
m o d e l . add ( D r o p o u t ( 0 . 2 5 ) )
m o d e l . add ( D e n s e ( n u m _ c l a s s e s , a c t i v a t i o n = ’ s o f t m a x ’) )
# C o m p i l e the m o d e l
model . c o m p i l e ( loss = ’ s p a r s e _ c a t e g o r i c a l _ c r o s s e n t r o p y ’ ,
o p t i m i z e r = ’ adam ’ ,
m e t r i c s =[ ’ a c c u r a c y ’ ])
11.9 Conclusions
In this chapter we have introduced the main concepts of neural networks and deep
learning. We have implemented the classical Multilayer Perceptron model and the
Convolution Neural Networks model with Keras on the top of TensorFlow. We have
shown how to construct an image classification model, bridging theory with hands-on
experience. Furthermore, to ensure effective model training and prevent overfitting,
we have introduced various techniques, such as regularization, dropout and data
augmentation.
Acknowledgements This chapter was co-written by Jordi Vitrià and Santi Seguí.
232 11 Deep Learning
References
1. D.O. Hebb, The Organization of Behavior: A Neuropsychological Theory (Wiley, New York,
1949). ISBN 9780471367277
2. F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization
in the brain. Psychol. Rev. 65(6), 386–408 (1958). https://doi.org/10.1037/h0042519
3. D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by back-propagating
errors. Nature 323(6088), 533–536 (1986)
4. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural
networks, in Advances in Neural Information Processing Systems 25 (2012)
Responsible Data Science
12
The field of data science carries significant social responsibility due to its growing
impact on individuals, communities, and society as a whole [1]. As data scientists,
we must recognize and embrace the significant social responsibility that comes with
our work. This responsibility extends beyond technical expertise and requires us to
consider the ethical, moral, and societal implications of our actions. By understanding
and addressing these responsibilities, we can ensure that data science is used in a
way that benefits society and respects individual rights.
Data science has the potential to profoundly impact individuals by influencing
their privacy, autonomy, and well-being. As data scientists, we must prioritize the
protection of personal data and respect individual privacy rights. We should handle
data ethically, ensuring that informed consent is obtained, and implementing robust
security measures to prevent unauthorized access or misuse of personal information.
By upholding these principles, we can help build trust between individuals and the
data-driven systems that affect their lives.
Data science has also the power to shape communities and their interactions. It
is crucial to acknowledge the potential for bias and discrimination in data analysis
and decision-making processes. We must actively work to identify and mitigate bias
in data sources, algorithms, and models, ensuring fairness and equity. By embracing
transparency and explaining the rationale behind our data-driven systems, we can
foster understanding, inclusion, and trust within communities.
Last but not least, data science has far-reaching implications for society as a
whole. It can influence critical areas such as healthcare, education, criminal justice,
and public policy. As data scientists, we must consider the broader social impact of
our work. Conducting social impact assessments helps us understand the potential
benefits and harms our data-driven systems may have on different segments of society.
By actively engaging with experts from diverse backgrounds and collaborating across
disciplines, we can address societal concerns, mitigate negative consequences, and
amplify the positive impact of data science on society.
All these concerns can be grouped in a novel area that is called “data ethics” [2].
“Data ethics” refers to the ethical considerations and principles that guide the respon-
sible and ethical use of data. It encompasses the moral and societal implications of
collecting, storing, analyzing, and sharing data, as well as the impact of data-driven
decisions and technologies on individuals, communities, and society as a whole.
The convergence of data analytics, a data-saturated environment, and inadequate
regulations has given rise to a complex ethical landscape in data practice. It has
become clear that ethical insights and guidelines need to be adaptable and contextual
[3]. Different sectors, such as industry, academia, and government, operate under
distinct circumstances and face unique ethical challenges. Ethics in data practice
should consider the specific needs, values, and legal requirements of each context.
In a commercial setting, data ethics must address concerns such as consumer
privacy, consent, and responsible data handling practices. Ethical guidelines should
encourage transparency, accountability, and fair treatment of customers and employ-
ees. Striking a balance between profit-driven objectives and ethical considerations
is crucial.
Ethical challenges in academic data practice encompass issues like research in-
tegrity, informed consent, and data sharing. Ethical guidelines should emphasize the
responsible use of data, protection of research participants’ rights, and rigorous data
management practices. Promoting open science and fostering collaboration while
respecting privacy and confidentiality are essential.
Ethical challenges in government data practice encompass issues of data security,
surveillance, and transparency. Ethical guidelines should focus on protecting citi-
zens’ rights, ensuring data privacy and security, and promoting transparency in data
collection and use. Responsible governance, fairness, and accountability are critical
in this context.
While ongoing debates and evolving perspectives exist regarding the constituents of
a comprehensive responsible data science standard [4], the following can be widely
acknowledged as crucial elements:
12.3 Responsible Data Science 235
1. Principles: A set of core principles (lines you will never cross) that data-based
systems and products should follow.
2. Governance: Establishing a governance structure to oversee the development,
deployment, and use of data-based systems and products.
3. Transparency: Ensuring that the data-based system’s decisions and processes
are explainable and understandable.
4. Fairness: Ensuring that data-based systems and products are fair and unbiased
in their decision-making.
5. Privacy: Ensuring that data-based systems and products are compliant with pri-
vacy regulations and respects user data privacy.
6. Security: Ensuring that the data-based systems and products are secure and pro-
tected from malicious actors.
7. Robustness and reliability: Confirming that data-based systems have the ability
to learn from humans and other systems and produce consistent and reliable
outputs.
8. Lawfulness, accountability and auditability: Putting an organizational structure
and policies in place that can help clearly determine who is responsible for the
output of data-based system decisions. Documenting decisions made by these
systems and allowing for an audit trail.
The responsibilities associated with responsible data science include various as-
pects, some of which require management involvement, infrastructure design, and
compliance considerations. However, there are specific responsibilities that lie within
the realm of data science practitioners. In the following sections, we will review these
technically-oriented responsibilities and their implications.
12.3.1 Transparency
Transparency refers to the degree to which the inner workings, decision-making pro-
cesses, and underlying algorithms of a data science system are open, understandable,
and explainable to users, developers, and other affected stakeholders. Here are some
transparency questions that data scientists must address: What are the factors and
values that influence the decisions made by algorithms? Can we understand the rea-
soning behind each decision? Can we assure that all relevant knowledge (and nothing
else) is reflected in the model?
Transparency is crucial for building trust, ensuring accountability, and facilitating
ethical and responsible use of these technologies [5].
Transparency can manifest in various ways:
12.3.1.1 Explainability
Explainability is the ability to express why a data-based system has arrived at a
particular decision, recommendation or prediction. To develop this capability it is
necessary to understand how the model works and the types of data used to train it.
We can get “explanations” from data-based systems by following different strategies:
The “explanation” concept is somewhat ambiguous, and can mean different things
to different people in different contexts [13].
First, explainability can be provided through local and global explanations. Local
explanations focus on explaining the decision or output for a specific instance or
input. This can involve highlighting the relevant features and explaining how they
influenced the decision. Global explanations provide a more general overview of
how the data-based system behaves and the patterns it has learned across different
instances.
Second, different stakeholders require different kinds of explanations: For a de-
veloper, to understand how their system is working, aiming to debug or improve it:
to see what is working well or badly, and get a sense for why; for a customer/user, to
provide a sense for what the system is doing and why, to enable prediction of what
it might do in unforeseen circumstances and build a sense of trust in the technology;
for society broadly to understand and become comfortable with the strengths and
limitations of the system, overcoming a reasonable fear of the unknown; for a user to
understand why one particular prediction or decision was reached, to allow a check
that the system worked appropriately and to enable meaningful challenge (e.g. credit
approval or criminal sentencing).
Getting algorithms to provide us with explanations about how a particular deci-
sion was made allows us to keep ‘meaningful human control’ over the decision. A
principle of explicability, then, is an ethical principle that should help bring us closer
to acceptable uses of algorithms.
biases), model performance (metrics and evaluation results indicating how well the
model performs on various tasks or benchmarks), intended use and limitations (an
explanation of the intended use cases for the model, as well as any known limitations,
constraints, or edge cases where the model might not perform well), ethical consid-
erations (a discussion of potential ethical implications associated with the model,
including considerations related to fairness, bias, and potential societal impact), and
documentation updates (information on when the model documentation was last
updated, indicating the currency of the information provided).
12.3.2 Fairness
Fairness in data science refers to the goal of ensuring that data-based systems and
algorithms treat all individuals fairly and without bias or discrimination [15]. Ad-
dressing fairness is crucial to building trust in data-based systems, promoting social
justice, and avoiding reinforcing existing biases or discrimination. It requires a mul-
tidisciplinary approach, involving ethics, law, computer science, and social sciences
to develop robust techniques and guidelines for creating fair and unbiased data-based
systems.
Algorithm fairness is a domain of research focused on comprehending and recti-
fying undesired biases in data and algorithms. Its objectives encompass three areas:
Investigating the origins of bias in data and algorithms, establishing and employ-
ing metrics to assess fairness, and advancing methodologies for data collection and
modeling that aim to mitigate unfairness.
There are several causes of bias in data and algorithms that can contribute to unfair
or discriminatory outcomes. These are called bad biases, biases that are problem-
atic from an ethical point of view because they configure the distribution of goods,
services, risks, and opportunities, or even access to information in ways that are
problematic. Other kinds of biases can be good or necessary to learn.
Bad bias causes include:
1. Structural bias [16]: Structural bias refers to biases that are ingrained within
the underlying systems, processes, or institutions, leading to unequal treatment
or outcomes for certain groups of people. It is often rooted in societal, economic,
or historical factors and can manifest in various domains, including education,
employment, criminal justice, healthcare, and more. Structural bias produces im-
plicitly biased datasets.
2. Biased Data Collection: Bias can be introduced during the data collection pro-
cess. If the data collection methods are flawed or biased, the collected data may not
be representative of the entire population, leading to skewed or incomplete infor-
mation. More specifically, if the dataset lacks diversity and does not adequately
represent various demographic groups, it can lead to biased outcomes. Under
representation of certain groups can result in inaccurate or unfair predictions or
decisions for those groups. Human biases can inadvertently influence the data
12.3 Responsible Data Science 239
1. Base rates approaches [17]: that use only the output .Ŷ of the algorithm. For
example, when the probabilities of a favorable outcome are equal for the two
groups, we have a special type of fairness termed demographic, or statistical
parity: . P(Ŷ = 1|X ∈ G + ) ∼ P(Ŷ = 1|X ∈ G − ). In a more general setting we
can define demographic parity in terms of statistical independence: the protected
attribute must be statistically independent of the outcome.
2. Accuracy-based approaches [15]: that use both the output .Ŷ of the algorithm
and the ground truth .Y . Accuracy-based fairness warrants that various types of
classification errors (e.g., true positives, false positives) are equal across groups.
The case in which we ask that . P(Ŷ = 1|Y = 1, X ∈ G + ) = P(Ŷ = 1|Y =
1, X ∈ G − ), (same True Positive Rate) is called equal opportunity. Depending
240 12 Responsible Data Science
on the type of classification errors considered, the achieved type of fairness takes
different names. In the general case, this method can be called separation: .Y must
be independent of the protected characteristic, conditional on .Ŷ .
3. Calibration approaches [18]: that use the predicted probability . S and the ground
truth .Y . In general, a classification algorithm is considered to be well-calibrated
if: when the algorithm predicts a set of individuals as having probability . p of
belonging to the positive class, then approximately a . p fraction of this set is
actual members of the positive class. In terms of fairness, intuitively, we would
like the classifier to be equally well calibrated for both groups.
Causal approaches to fairness [19] refer to the use of causal reasoning and anal-
ysis in understanding and addressing fairness issues in AI and machine learning
systems. These approaches aim to identify and mitigate unfairness by examining the
causal relationships between variables and outcomes, rather than relying solely on
statistical associations. There are several key concepts and methods within causal
approaches to fairness, being the most important Counterfactual Fairness. This ap-
proach focuses on counterfactual analysis, which involves assessing the fairness of
an algorithm by comparing the outcomes under different hypothetical scenarios. It
considers what would have happened to individuals or groups if certain attributes
(e.g., gender, race) were different while keeping other relevant factors constant.
Lastly, if any of these metrics indicate the presence of discrimination, we can
apply several bias mitigation strategies [20]. These strategies can be categorized
into three types: (i) pre-processing methods manipulate the data to eliminate bias
before a machine learning (ML) model is able to incorporate these biases based on
the data, (ii) in-processing bias mitigation strategies manipulate the model to mitigate
bias that appears during the training process, and (iii) post-processing methods alter
the outcomes of a model, preying on bias present in the output.
Robustness and reliability are critical factors in ensuring the effectiveness and trust-
worthiness of data-based systems. Robustness refers to the ability of a data-based
model to perform consistently and accurately in the face of perturbations from the
expected conditions. It involves designing models that are resilient to different types
of inputs, noise, or adversarial attacks. Robust models are less likely to produce
erroneous or misleading results in different scenarios:
3. Noisy or Incomplete Inputs: Robust models can handle noise, errors, or missing
data without significant degradation in performance.
1. They should accurately report uncertainty about their predictions (“know what
they don’t know”). Uncertainty arises from incomplete or unknown information
[23], which poses challenges for accurate predictions by models. Predictive un-
certainty quantification enables models to calculate optimal decisions and assists
practitioners in determining when to rely on the model’s predictions. This capa-
bility allows for graceful handling of situations where the model is likely to make
incorrect predictions.
2. They should generalize robustly to new scenarios (distribution shift) [24]. Distri-
bution shifts can occur due to various factors, such as changes in the input data
characteristics, unseen scenarios, or shifts in the environment where the model
operates.
12.4 Conclusions
This chapter has focused on the intersection of data science and ethics. This area
is of utmost importance in today’s data-driven world. Ethical considerations play
a crucial role in guiding responsible data science practices and ensuring that data-
driven technologies are developed and used in a manner that respects individuals’
rights, promotes fairness, and benefits society as a whole.
Data science professionals have a responsibility to incorporate ethical principles
into their work. This includes handling data with integrity, protecting privacy, ad-
dressing biases, and promoting transparency and explainability. They should actively
engage with stakeholders, collaborate with experts from diverse fields, and stay in-
formed about emerging ethical issues in data science.
Moreover, it is essential for organizations, policymakers, and society as a whole to
recognize the social responsibility of data science. Establishing robust frameworks,
guidelines, and regulations that govern data collection, usage, and privacy can help
create an environment where ethical data practices are encouraged and enforced.
Ultimately, by integrating ethics into data science, we can strive for a future where
data-driven technologies contribute positively to societal progress, uphold individual
rights, and enhance human well-being.
References
1. A. Spector, P. Norvig, C. Wiggins, J. Wing, Data Science in Context: Foundations, Chal-
lenges, Opportunities (Cambridge University Press, Cambridge, 2022). https://doi.org/10.
1017/9781009272230
2. L. Floridi, M. Taddeo, What is data ethics? Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci.
374(2083), 20160360 (2016)
3. A. Jobin, M. Ienca, E. Vayena, The global landscape of AI ethics guidelines. Nat. Mach. Intell.
1(9), 389–399 (2019)
4. L. Taylor, N. Purtova, What is responsible and sustainable data science? Big Data & Soc. 6(2),
2053951719858114 (2019)
5. W.J. Von Eschenbach, Transparency and the black box problem: why we do not trust AI. Philos.
& Technol. 34(4), 1607–1622 (2021)
6. N. Burkart, M.F. Huber, A survey on the explainability of supervised machine learning. J. Artif.
Intell. Res. 70, 245–317 (2021)
7. S. Garfinkel, J. Matthews, S.S. Shapiro, J.M. Smith, Toward algorithmic transparency and
accountability. Commun. ACM 60(9), 5–5 (2017)
8. M. Pushkarna, A. Zaldivar, O. Kjartansson, Data cards: purposeful and transparent dataset
documentation for responsible AI, in Proceedings of the 2022 ACM Conference on Fairness,
Accountability, and Transparency (2022), pp. 1776–1826
9. C. Rudin, Stop explaining black box machine learning models for high stakes decisions and
use interpretable models instead. Nature Mach. Intell. 1(5), 206–215 (2019)
10. C. Rudin, C. Chen, Z. Chen, H. Huang, L. Semenova, C. Zhong, Interpretable machine learning:
fundamental principles and 10 grand challenges. Stat. Surv. 16, 1–85 (2022)
11. S.M. Lundberg, S.I. Lee, A unified approach to interpreting model predictions, in Advances in
Neural Information Processing Systems 30 (2017)
12. M.T. Ribeiro, S. Singh, C. Guestrin, "Why should i trust you?" Explaining the predictions of any
classifier, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (2016), pp. 1135–1144
13. T. Miller, Explanation in artificial intelligence: insights from the social sciences. Artif. intell.
267, 1–38 (2019)
14. M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, ... , T. Gebru, Model
cards for model reporting, in Proceedings of the Conference on Fairness, Accountability, and
Transparency (2019), pp. 220–229
15. S. Mitchell, E. Potash, S. Barocas, A. D’Amour, K. Lum, Algorithmic fairness: choices, as-
sumptions, and definitions. Ann. Rev. Stat. Appl. 8, 141–163 (2021)
16. S.A. Friedler, C. Scheidegger, S. Venkatasubramanian, The (im) possibility of fairness: different
value systems require different mechanisms for fair decision making. Commun. ACM 64(4),
136–143 (2021)
17. Zemel, R., Wu, Y., Swersky, K., Pitassi, T., & Dwork, C. (2013, May). Learning fair represen-
tations. In International conference on machine learning (pp. 325-333). PMLR
18. G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, K.Q. Weinberger, On fairness and calibration, in
Advances in Neural Information Processing Systems 30 (2017)
19. A.N. Carey, X. Wu, The causal fairness field guide: perspectives from social and formal sci-
ences. Front. Big Data 5, 892837 (2022)
20. S. Barocas, M. Hardt, A. Narayanan, Fairness and machine learning: limitations and opportu-
nities (2019). http://www.fairmlbook.org
21. P. Laskov, R. Lippmann, Machine learning in adversarial environments. Mach. Learn. 81,
115–119 (2010)
22. S. Fort, J. Ren, B. Lakshminarayanan, Exploring the limits of out-of-distribution detection,
Advances in Neural Information Processing Systems 34 (2021), pp. 7068–7081
References 243
23. J. Mena, O. Pujol, J. Vitria, A survey on uncertainty estimation in deep learning classification
systems from a Bayesian perspective. ACM Comput. Surv. (CSUR) 54(9), 1–35 (2021)
24. A. Subbaswamy, B. Chen, S. Saria, A unifying causal framework for analyzing dataset shift-
stable learning algorithms. J. Causal Inf. 10(1), 64–89 (2022)
Index
B F
Backpropagation, 211, 213 Fairness, 233, 234, 238–241
Bag of words, 199, 200 Frequentist approach, 51, 52, 64
Bootstrapping, 52, 55, 56, 58, 65
H
C
Hierarchical clustering, 136, 148
Centrality measures, 151, 158, 159, 161, 162,
Histogram, 33, 36, 37, 42, 50, 108, 109, 118,
165, 167, 173
156, 159, 160
Classification, 7, 67–69, 71, 74, 77, 87, 88, 91,
96, 119, 121, 122, 212, 213, 217, 219,
231, 239, 240
K
Clustering, 7, 67, 125–131, 133–149, 202
K-means, 67, 129, 131, 132, 134, 135, 137,
Collaborative filtering, 176–178, 184, 186
140, 141, 143–149
Community detection, 172, 173
Connected components, 151, 153, 156, 157
Content based recommender systems, 194 L
Convolutional Neural Networks, 211, 224–228
Lemmatizing, 198, 199
Correlation, 2, 47–50, 110, 111, 122, 187
Linear and polynomial regression, 107
Logistic regression, 67, 99, 119–122, 212
D
Data augmentation, 229–231
Data distribution, 36 M
Data ethics, 234 Machine learning, 1, 2, 4, 7, 10, 67–69, 79, 87,
Data science, 1–4, 10, 68, 99, 233–235, 237, 91, 96, 102, 103, 125, 202, 205, 207, 209,
238, 241 213, 215, 240, 241
Deep learning, 7, 96, 209, 211, 213–215, 231 Mean, 15, 23, 33–36, 39–43, 46–48, 50, 52–62,
76, 92, 102–104, 114, 182, 189, 193, 237
E Multilayer Perceptron, 212, 213, 216, 221,
Ego-networks, 151, 167–173 224–227, 231
O S
Overfitting, 81, 82, 88, 91, 102, 107, 207, 223, Sentiment analysis, 4, 195, 196, 205, 209
225, 229–231 Sparse model, 108, 113, 114, 122
Spectral clustering, 129, 131, 135, 140, 141,
P 145–149
Programming, 3–6, 8–10, 28, 211 Statistical inference, 51, 52, 55, 195
P-value, 63 Supervised learning, 67, 125
Python, 3–9, 12, 15, 18, 122, 132, 151, 154,
171, 173, 182, 190, 194, 200, 205, 209, T
215 Toolbox, 2–8, 104, 119, 154, 171, 173
Transparency, 2, 14, 121, 211, 233–236, 241
R
Recommender systems, 175, 176, 178, 179, V
182, 184, 185, 190, 193, 194 Variance, 34–36, 43, 46, 47, 50, 52, 56, 57, 80,
Regression analysis, 99, 104, 122 91, 102, 107, 114, 137