Zero To Deep Learning
Zero To Deep Learning
[email protected]
2
Zero to Deep Learning
Francesco Mosconi
September 7, 2018
Copyright © 2018 Francesco Mosconi. All rights reserved. Printed in the United States of America
Published by Fullstack.io
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and Catalit LLC was aware of a trademark claim,
the designations have been printed.
While every precaution has been taken in the preparation of this book, the publisher and author assume no
responsibility for errors or omissions, or for damages resulting from the use of the information contained
herein.
iii
0 Introduction 1
0.1 Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Getting Started 19
v
vi TABLE OF CONTENTS
1.2.3 Scikit-Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.2.4 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.3.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.3.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.3.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.3.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2 Data Manipulation 53
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.6.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.6.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.6.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
TABLE OF CONTENTS vii
2.6.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.6.5 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3 Machine Learning 87
14 Appendix 575
Welcome
Welcome to this practical introduction to Deep Learning. Whether starting our journey with Machine
Learning or as a seasoned practitioner looking to add Deep Learning to our set of tools, this course will
help! We will gradually start introducing data and Machine Learning problems and then dive deeper into
how Neural Networks are built, trained and used. We will deal with tabular data, images, sound, text data
and video games and for each of these we will build and train the appropriate Neural Network.
By the end of the course, we will be able to recognize problems that can be solved using Deep Learning,
collect and organize data so that it can be used by Neural Networks, build and train models to solve a variety
of tasks and finally take advantage of the cloud to train even more powerful models.
Acknowledgements
This book would not exist without the help of many friends and colleagues who contributed in several ways.
Special thanks go to Ari and Nate from Fullstack for the continuous support and useful suggestions
throughout the project. Thanks to Nicolò for reading throughout the early version of the book and
contributing many corrections to make it more accessible to beginners. Thanks to Carlo for helping me
transform this book into a succesful bootcamp. A huge thank you to François Chollet for inventing Keras
and making Deep Learning accessible to the world and to the Tensorflow developers for the amazing work
they are doing. Finally to Chiara, to my friends and to my family for all the emotional support through this
journey: thank you, I would not have finished this without all of you!
1
2 CHAPTER 0. INTRODUCTION
Francesco Mosconi
With 15 years experience working with data, Francesco has been an instructor at General Assembly, The
Data Incubator, Udemy and many conferences including ODSC, TDWI, PyBay and AINext.
Formerly he was co-founder and Chief Data Officer at Spire, a YC-backed company that invented the first
consumer wearable device capable of continuously tracking respiration and physical activity.
Francesco started his career with a PhD in Biophysics, publishing a paper on DNA mechanics that currently
has over 100 citations and then turned his data skills to business helping small and large companies grow
their market through data analytics solutions.
He also started a series of workshops on Machine Learning and Deep Learning called Dataweekends,
training hundred of students on these topics.
This book extends and improves the training program of Dataweekends and it provides a practical
foundation in Machine Learning and Deep Learning.
The first chapters review core concepts in Machine Learning and explain how Deep Learning expands the
field. We will build an intuition to recognize the kind of problems where Deep Learning really shines and
those where other techniques could provide better results.
Chapters 4-8 present the foundation of Deep Learning. By the end of Chapter 8, we’ll be able to use Fully
Connected, Convolutional, and Recurrent Neural Networks on your laptop and deal with a variety of input
sources including: tabular data, images, time series and text.
Chapters 9-12 build on core chapters to extend the reach of our skills both in depth and in width. We’ll learn
to improve the performance of our models as well as to use pre-trained models, piggybacking on the
shoulders of giants. We will also talk about how to use GPUs to speed up training as well as how to serve
predictions from your model.
This is a practical book. Everything we introduce is accompanied by code to experiment with and explore.
The code and notebooks accompanying the book are available through your purchase on your Gumroad
library. In order to follow along with the book, pleasure be sure to download and unarchive the code on
your local computer.
Exercises
Before we go on, let’s spend a couple of words on exercises. They are a key part of this book and we suggest
working as follows:
1. Execute the code provided with the chapter, to get a sense of what it is doing.
2. Once we have run the provided code, we suggest to start working through the exercises. We begin
with easy exercises and build gradually towards more difficult ones.
If you find yourself stuck, here are some resources where you can look for help:
• look at the error message: understand which parts of it are important. The most important line in the
Python error message is the last one. It tells us the error type. The other lines give us information
about what caused the error to happen (backtrace), so that we can go ahead and fix it.
• the internet: try pasting part of the error message in a search engine and see what you find. It is very
likely that someone has already encountered the same problem and a solution is available.
• Stack Overflow: this is a great knowledge base where people answer code-related questions. Very
often you can just search for the specific error message you got and find the answer here.
Notation
You will find different fonts for different parts of the text:
to indicate practical suggestions and concepts that add to the material but are not strictly core.
x0 + x1 + x2 (1)
In this book we’ll try to find a balance between the two. It is an application focused book, with working
examples, coding exercises, solutions and real-world datasets. At the same time we won’t shy away from the
math when necessary. We’ll try to keep equations to a minimum, so here are a few symbols you may
encounter in the course of the book.
Sum
Sometimes we will need to indicate the sum of many quantities at once. Instead of writing the sum explicitly
like in equation 2:
x0 + x1 + x2 + x3 + x4 + x5 + ... (2)
∑ xi (3)
i
0.5. PREREQUISITES - IS THIS BOOK RIGHT FOR ME? 5
Partial derivatives
Sometimes we will need to indicate the speed of change in a function with respect to one of its arguments.
This is obtained through an operation called partial derivative and it’s indicated with the symbol ∂ like this:
∂ f (x1 , x2 , ...)
(4)
∂x1
It means we are looking at how much f is changing for a unit of change in x1 when all the other variables are
kept fixed (find more about on Wikipedia).
Dot product
A lot of Deep Learning relies on a few common linear algebra concepts like vectors and matrices. In
particular, an operation we will frequently use is the dot product: A.B. If you’ve never seen this before it may
be a great time to look up on Wikipedia how it works.
In any case do not worry too much about the math. We will introduce it gradually and only when needed,
so you’ll have time to get used to it.
Python
This is not a book about Python. It’s a book about Deep Learning and how to build Deep Learning models
using Python. Therefore, some familiarity with Python is required in order to follow along.
One of the questions we often receive is “I don’t have any experience with Python, would I able to follow the
course if I take a basic Python course before?”. The answer is YES, but let us help you with that.
This book focuses on data techniques and it assumes some familiarity with Python and programming
languages. It is designed to speed up your learning curve in deep learning giving you enough knowledge for
you to then be able to continue learning on your own.
Here are 2 resources to get you started: - Learn Python the Hard Way: a great way to start learning without
putting too much effort into it. - Hacker Rank 30 days of code: more problem-solving oriented resource.
In order to follow this course easily, you should be familiar with the following Python constructs:
Once you are comfortable with these you’ll be ready to take this course.
We will need to install Python and then we will also install a few libraries that allow us to perform Machine
Learning and Deep Learning experiments.
Miniconda Python
Anaconda Python is a great open source distribution of Python packages geared to data science. It is
prepackaged with a lot of useful tools and we encourage you to have a look at it. For this book we will not
need the full Anaconda distribution, but we will just install the required packages, so that we can keep space
requirements to a minimum.
TIP: if you already have Anaconda Python installed, just make sure conda is up to date by
running conda update conda in a terminal window.
We can do this by installing Miniconda, which includes Python and the Anaconda package installer conda.
Here are the steps to complete in order to install it:
If you’ve completed these steps successfully you can open a command prompt (how to do this will differ
depending on which OS you’re using) and type python. This will launch the Python interpreter and should
display something like the following:
Conda Environment
Environment creation
Now that we have installed Python, we need to download and install the packages required to run the code
of this book. We will do this in a few simple steps.
• Change directory to the folder you have downloaded from our book repository using the command
cd.
• Run the command conda env create to create the environment with all the required packages.
TIP: when you run the create command you will get the error:
If the environment already exists. In that case run the command conda env update instead.
You will see a lot of lines on your terminal. What is going on? The conda package installer is creating an
environment, i.e. a special folder that contains the specific versions of each required package for the book.
The environment is specified in the file environment.yml, that looks like this:
name: ztdlbook
channels:
- defaults
dependencies:
- python=3.6
- bz2file==0.98
- cython==0.28.*
- flask==1.0.*
- gensim==3.4.*
- h5py==2.8.*
- jupyter==1.0.*
- matplotlib==2.2.*
- numpy==1.14.*
- pandas==0.23.*
- pillow==5.2.*
- pip==10.0.*
8 CHAPTER 0. INTRODUCTION
- pytest==3.7.*
- scikit-learn==0.19.*
- scipy==1.1.*
- seaborn==0.9.*
- twisted==18.7.*
- pip:
- PyHamcrest==1.9.*
- tensorflow==1.10.1
- keras==2.2.2
- jupyter_contrib_nbextensions==0.5.0
- tensorflow-serving-api==1.10.1
The package installer reads the environment file and downloads the correct versions of each package
including its dependencies. This is why you see a lot more packages being downloaded.
Once the environment is created you should see a message like the following (Mac/Linux):
Environment activation
(or the Windows equivalent). If you do that, you’ll notice that your command prompt changes and now
displays the environment name at the beginning, within parentheses, like this:
If your machine has an NVIDIA GPU you’ll want to install the GPU-enabled version of Tensorflow. In
Chapter 9 we cover cloud GPUs in detail and explain how to install all the required software. Assuming you
have already installed the NVIDIA drivers, CUDA and CUdnn, you can create the environment for this
book in a few simple steps:
1. run: conda env create -f environment-gpu.yml This is the exact same environment as the standard
one, minus the tensorflow-serving-api package. The reason for this is that this package has
standard Tensorflow as dependency and if we install it programmatically it will clutter our
tensorflow-gpu installation. So go ahead and create the environment using the above config file.
2. Once the environment is created, activate it: conda activate ztdlbook
3. Finally install the tensorflow-serving-api package without dependencies: pip install
tensorflow-serving-api==1.10.1 –no-deps
Jupyter notebook
Now that we have installed Python and the required packages, let’s explore the code for this book. Code is
provided as notebooks. Jupyter Notebooks are documents that can contain live code, equations,
visualizations and explanatory text. They are very common in the data science community because they
allow for easy prototyping of ideas and fast iteration. Notebook files are opened and edited through the
Jupyter Notebook web application. Let’s launch it!
In the terminal, change directory to the course folder (if you haven’t already) and then type:
jupyter notebook
This will start the notebook server and open a window in your default browser, and you should reach a
window like the one shown in here:
This is called the notebook dashboard and serves as a home page for the notebook. There are three tabs in
the dashboard:
10 CHAPTER 0. INTRODUCTION
Jupyter Notebook
• Files : this tab displays the notebooks and files in the current directory. By clicking on the
breadcrumbs or on sub-directories at the top of notebook list, you can navigate your file system.
Additional files may be created either uploaded. To create a new notebook, click on the “New” button
at the top of the list and select a kernel from the dropdown. To upload a new file, click on the
“Upload” botton and browse the file from your computer. By selecting a notebook file, you can
perform several tasks, such as “Duplicate”, “Shutdown”, “View”, “Edit” or “Delete” it.
• Running : this tab displays the currently running Jupyter processes, either a Terminals or Notebooks.
This tab is important to shutdown running notebook: in fact Notebooks remain running until you
explicitly shut them down, and closing the notebook’s page is not sufficient.
• Cluster : this tab displays parallel process, provided by IPython parallel and it requires further
activation, not necessary for the scope of this book.
If you are new to Jupyter Notebook, it may feel a little disorienting at first, especially if you are used to
working in an IDE. However, you’ll see that it’s actually quite easy to navigate your way around it. Let’s start
from how you open the first notebook.
and the notebooks forming the course will appear. Go ahead and click on the 00_Introduction.ipynb
notebook:
This will open a new tab where you should see the content of this chapter in the notebook. Now scroll down
to this point and feel free to continue reading from the screen if you prefer.
0.6. OUR DEVELOPMENT ENVIRONMENT 11
Course notebook
12 CHAPTER 0. INTRODUCTION
Let us summarize here a few very useful commands to get you started with Jupyter Notebook.
TIP: For a complete introduction to the Jupyter Notebook we encourage you to have a look
at the official documentation.
• Ctrl-ENTER executes the currently active cell and keeps the cursor on the same cell
• Shift-ENTER executes the currently active cell and moves the cursor on the same cell
• ESC enables the Command Mode. Try it. You’ll see the border of the notebook change to Blue. In
Command Mode you can press a single key and access many commands. For example, use:
– A to insert cell above the cursor
– B to insert cell below the cursor
– DD to delete the current cell
– F to open the find/replace dialogue
– Z to undo the last command
Finally you can use H to access the help dialog with all the keyboard shortcuts for both command and edit
mode:
Environment check
If you have followed the instructions this far you should be running the first notebook.
The next command cell makes sure that you are using the Python executable from within the course
environment and should evaluate without an error.
TIP: If you get an error, try the following: 1. Close this notebook. 2. Go to the terminal and
stop Jupyter Notebook using: CTRL+C 3. Make sure that you have activated the
environment, you should see a prompt like: (ztdlbook) $ 4. (Optional) if you don’t see that
prompt activate the environment: - mac/linux: conda activate ztdlbook - Windows:
activate ztdlbook 5. Restart Jupyter Notebook. 6. Re-open the first notebook in the course
folder 7. Re-run the next cell.
In [1]: import os
import sys
0.6. OUR DEVELOPMENT ENVIRONMENT 13
Jupyter shortcuts
14 CHAPTER 0. INTRODUCTION
env_name = 'ztdlbook'
p = sys.executable
try:
assert(p.find(env_name) != -1)
print("Congrats! Your environment is correct!")
except Exception as ex:
print("It seems your environment is not correct.\n",
"Currently running Python from this path:\n",
p,
"\n",
"Please follow the instructions and retry.")
raise ex
Python 3.6
The next line checks that you’re using Python 3.6.x from Anaconda and it should execute without any error.
If you get an error, go back to the previous step and make sure you created and activated the environment
correctly.
v = sys.version
try:
assert(v.find(python_version) != -1)
assert(v.find(distribution) != -1)
print("Congrats! Your Python is correct!")
except Exception as ex:
print("It seems your Python is not correct.\n",
"Currently running Python from this path:\n",
v,
"\n",
"Please follow the instructions above\n",
"and make sure activated the environment.")
raise ex
Jupyter
try:
assert(j.find('jupyter') != -1)
assert(j.find(env_name) != -1)
print("Congrats! You are using Jupyter from\n",
"within the environment.")
except Exception as ex:
print("It seems you are not using the correct\n",
"version of Jupyter.\n",
"Currently running Python from this path:\n",
j,
"\n",
"Please follow the instructions above\n",
"and make sure activated the environment.")
raise ex
Other packages
Here we will check that all the packages are installed and have the correct versions. If everything is ok you
should see:
If there’s any issue here please make sure you have checked the previous steps.
import jupyter
import matplotlib
import numpy
import pandas
import PIL
import pytest
import sklearn
import scipy
import seaborn
import twisted
import hamcrest
import tensorflow
import keras
import tensorflow_serving
check_version(cython, '0.28')
check_version(flask, '1.0')
check_version(gensim, '3.4')
check_version(h5py, '2.8')
check_version(matplotlib, '2.2')
check_version(numpy, '1.14')
check_version(pandas, '0.23')
check_version(PIL, '5.2')
check_version(pip, '10.0')
check_version(pytest, '3.7')
check_version(sklearn, '0.19')
check_version(scipy, '1.1')
check_version(seaborn, '0.9')
check_version(twisted, '18.7')
check_version(hamcrest, '1.9')
0.6. OUR DEVELOPMENT ENVIRONMENT 17
check_version(tensorflow, '1.10')
check_version(keras, '2.2')
Congratulations! You have just verified that you have correctly set up your computer to run the code in this
book.
Troubleshooting installation
If for some reason you encounter errors while running the first notebook, the simplest solution is to delete
the environment and start from scratch again.
deactivate ztdlbook
• restart from environment creation and make sure that each steps completes till the end.
18 CHAPTER 0. INTRODUCTION
Updating Conda
One thing you can also try is to update your conda executable. This may help if you already had Anaconda
installed on your system.
Image recognition
This is a very common application, consisting in determine whether or not an image contains some specific
objects, features, or activities. For example, the following image shows an object detection algorithm taken
from the Google Blog.
The trained model is able to identify the objects in the image. Similar algorithms can be applied to identify
faces, or determine diseases from a radiography, or in self driving cars, just to name a few examples.
Predictive modeling
Deep Learning may be applied for predictive purposes. Algorithms can be applied to times series of a
certain value over time, in order to forecast a future trend. For example the energy consumption of a region,
the temperature over an area, the price of a stock, and so on. Again, Neural Networks can be used for
predicting demographics and election results or even earthquakes.
19
20 CHAPTER 1. GETTING STARTED
Object Detection
Language translation
Deep Learning can be applied for language translation. This approach uses a large artificial Neural Network
to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated
model. The following image represents an example of an instant visual translation, taken from the Google
Blog. This algorithm combines image recognition tasks with language translation ones.
Machine Translation
Recommender system
Recommender systems help the user finding the correct choice among the available possibilities. They are
everywhere and we use them every day: when we buy a book that Amazon recommends us based on our
1.1. DEEP LEARNING IN THE REAL WORLD 21
previous history, or when we listen to that song tailored to our taste in Spotify, or when we watch with the
family that movie recommended in Netflix, just to name some examples.
Automatic image captioning is the task where, given an image, the system can generate a caption that
describes the contents of the image. Once you can detect objects in photographs and generate labels for
those objects, you can turn those labels into a coherent sentence description. This is a sample of automatic
image caption generation taken from Andrej Karpathy and Li Fei-Fei at Stanford University.
Anomaly detection
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected
behavior. It has many applications in business, from intrusion detection (identifying strange patterns in
network traffic that could signal a hack) to system health monitoring (spotting a malignant tumor in an MRI
scan), and from fraud detection in credit card transactions to fault detection in operating environments.
22 CHAPTER 1. GETTING STARTED
This is a task where a Deep Learning model learns how to play a computer game, training to maximize some
goals. One glorious example is the DeepMind’s AlphaGo algorithm developed by Google, that beat the
world master at the game Go.
First we are going to create these two sets of points, and then we will build a model that is able to separate
them. Although it’s a toy example, this is representative of many industry relevant problems where a binary
prediction is requested from the model.
Question: Can you think of any industry examples where we may want to predict a binary outcome?
Answer: detecting if an email is spam or not, detecting if a credit card transaction is legitimate or not,
predicting if a user is going to buy an item or not. . .
The primary goal of this exercise is to see that, with a few lines of code we can sufficiently define and train a
Deep Learning model. Do not worry if some of it is beyond your understanding yet, we’ll walk through it
and see code similar to it in the rest of the book in-depth.
In the next chapters, we will be building more complex models and we will work with more interesting
datasets.
Numpy
At their core, Neural Networks are mathematical functions. The workhorse library used industry-wide is
numpy. numpy is a Python library that contains many mathematical functions, particularly around working
with arrays of numbers.
For instance, numpy contains functions for: * vector math * matrix math * operations optimized for number
arrays
1.2. FIRST DEEP LEARNING MODEL 23
While we’ll use higher-level libraries such as Keras a lot in this book, being familiar with and proficient in
using at numpy is a core skill we’ll need to build (and evaluate) our networks. Also, while numpy is a
comprehensive library, there are only a few key functions that we will use over and over again. We’ll cover
each new function as it comes up, so let’s dive in and try out a few basic operations.
Basic Operations
The first thing we need to do to use numpy is import it into our workspace:
TIP: If you get an error message similar to the following one, don’t worry.
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-4ee716103900> in <module>()
----> 1 import numpy as np
Python error messages may not seem very easy to navigate, but it is actually quite simple.
To understand the error, we suggest reading from the bottom up. Usually the last line is the
most informative. Here it says: ImportError, so it looks like it didn’t find the module
called numpy. This could be due to many reasons, but it probably indicates that either you
didn’t install numpy or you didn’t activate the conda environment. Please refer back to the
installation section and make sure you have activated the environment before starting
jupyter notebook.
Here we’ve created a 1-dimensional array containing four numbers. We can evaluate a to see the current
values:
In [3]: a
Note that the type of a is numpy.ndarray. The documentation for this type is available here.
In [4]: type(a)
Out[4]: numpy.ndarray
TIP: Jupyter Notebook is a great interactive environment and it offers a lot of help when we
want to quickly check the documentation for an object. This can be accessed by appending
a question mark to any variable in our notebook.
For example, in the next cell, try typing:
a?
As you have noticed, this opens a pane in the bottom with the documentation for the
object a. This trick can be used with any object in the notebook. Pretty awesome! Press
escape to dismiss the panel at the bottom.
c = np.array([[[1, 2, 3],
[4, 3, 6]],
[[8, 5, 1],
[5, 2, 7]],
[[0, 4, 5],
[8, 9, 1]],
[[1, 2, 6],
[3, 7, 4]]])
Again we can evaluate them to check that they are indeed what we expect them to be:
1.2. FIRST DEEP LEARNING MODEL 25
In [6]: b
In [7]: c
[[8, 5, 1],
[5, 2, 7]],
[[0, 4, 5],
[8, 9, 1]],
[[1, 2, 6],
[3, 7, 4]]])
In mathematical terms, we can think of the 1-D array as a vector, the 2-D array as a matrix and the 3-D array
as a tensor of order 3.
Numpy arrays are objects, which means they have attributes and methods. A useful attribute is shape,
which tells us the number of elements in each dimension of the array:
In [8]: a.shape
Out[8]: (4,)
Python here tells us the object has 4 elements along the first axis. The trailing comma is needed in Python to
indicate that the object is a tuple with only one element.
26 CHAPTER 1. GETTING STARTED
Tab tricks
Trick 1: Tab completion In Jupyter Notebook we can type faster by using the tab key to complete any
variable name that has been created previously. So, for example, if somewhere along our code we have
created a variable called verylongclumsynamevariable and we need to use it again, we can simply start
typing ver and hit tab to see the possible completions, including our long and clumsy variable.
TIP: try to use meaningful and short variable names, to make your code more readable.
Trick 2: Methods & Attributes In Jupyter Notebook, we can hit tab after the dot . to know which
methods and attributes are accessible for a certain object. For example try typing
a.
and then hit tab. You will notice that a small window pops up with all the methods and attributes available
for the object a. This looks like:
This is very handy if we are looking for inspiration about what a can do.
Trick 3: Documentation pop-up Let’s go ahead and select a.argmax from the tab menu by hitting
enter. This is a method, and we can quickly find how it works. Let’s open a round parenthesis a.argmax(
and let’s hit SHIFT+TAB+TAB (that is TAB two times in a row while holding down SHIFT). This will open a
pop-up window with the documentation of the argmax function.
Here you can read how the function works, which inputs it requires and what outputs are returned. Pretty
nice!
In [9]: b.shape
Out[9]: (3, 4)
Since b is a 2-dimensional array, the attribute .shape has two elements one for each of the 2 axes of the
matrix. In this particular case we have a matrix with 3 rows and 4 columns or a 2x4 matrix.
In [10]: c.shape
Out[10]: (4, 2, 3)
c has 3 dimensions. Notice how the last element indicates the length of the innermost axis, in fact shape is a
tuple, whose elements are ordered from the outer list to the inner list.
TIP: Knowing how to navigate the shape of an ndarray is important. Most of the work we
will encounter later, from being able to perform a dot product between weights and inputs
in a model to correctly reshaping images when feeding them to a convolutional Neural
Network.
Selection
Now that we know how to create arrays, we also need to know how to extract data out of them. You can
access elements of an array using the square brackets. For example, we can select the first element in a by
doing:
In [11]: a[0]
Out[11]: 1
Remember that numpy indicies start from 0 and the element at any particular index can be found by n-1.
For instance, the first element can be extracted by referencing the cell at a[0] and the second element at
a[1].
Uncomment the next line and and select the second element of b.
1.2. FIRST DEEP LEARNING MODEL 29
Unlike accessing arrays in, say, JavaScript, numpy arrays have a powerful selection notation that you can use
to read data in a variety of ways.
For instance, we can use commas to select along multiple axes. For example, here’s how we can get the first
sub-element of the third element in c:
In [13]: c[2, 0]
What about selecting all the first elements along the second axis in b? That’s achieved with the : operator:
In [14]: b[:, 0]
: is the delimiter of the slice syntax to select a sub-part of a sequence, like: [begin:end].
In [15]: a[0:2]
Out[16]: array([[8]])
Select all the elements from the beginning excluding the last one:
Stride
We can also select regularly spaced elements by specifying a step size after a second :. For example, to select
the first and third element in a we can type:
In [20]: a[0:-1:2]
or, simply:
In [21]: a[::2]
where it is implicit that we want start and end to be the first and last element in the array.
Math
We’ll try to keep the math at a minimum here, but we do need to understand how the various operations
work in an array.
Math operators work element-wise, meaning that the mathematical operation is performed on all of the
elements and their corresponding element locations.
Addition works here by adding one[0] and two[0] together, then adding one[1] and two[1] together:
In [22]: 3 * a
In [23]: a + a
In [24]: a * a
In [25]: a / a
In [26]: a - a
In [27]: a + b
In [28]: a * b
32 CHAPTER 1. GETTING STARTED
Go ahead and play a little to make sure we understand how these work.
TIP: If you’re not familiar with the difference between element-wise multiplication and dot
product, checkout these two links: - Hadamard product - Matrix multiplication
As mentioned in the beginning, numpy is a very mature library that allows us to perform many operations
on arrays including:
We will introduce these different operations as needed. The curious reader is referred to this documentation
for more information.
Matplotlib
Another library we will use extensively is Matplotlib. The Matplotlib library is used to plot graphs so that we
can visualize our data. Visualization is an important step in Machine Learning. Throughout this book we
will use different kinds of plots in many situations, including
Let’s have a look at how to generate the most common plots available in matplotlib.
Above we’ve imported matplotlib.pyplot, which gives us access to the plotting functions.
Let’s set a few common parameters that define how plots will look like. Their names are self-explanatory. In
future chapters we’ll bundle these into a configuration file.
rcParams['font.size'] = 14
rcParams['lines.linewidth'] = 2
rcParams['figure.figsize'] = (9, 6)
rcParams['axes.titlepad'] = 14
rcParams['savefig.pad_inches'] = 0.2
Plot
To plot some data with a line plot, we can just call the plot function on that data. We can try plotting our a
vector from above like so:
In [31]: plt.plot(a);
We can also render a scatter plot, by specifying a symbol to use to plot each point.
34 CHAPTER 1. GETTING STARTED
In the plot below, we show how 2-D Arrays are interpreted as tabular data, i.e. data arranged in a table with
rows and columns. Each row is interpreted as 1 data point and each column as a coordinate for that data
point.
If we plot b we will obtain 4 curves, one for each coordinate, with 3 points each.
Notice that the 4 lines in the graph are plotted in the two dimensional graph where the first line has a point
at (0, 8), one at (1, 4), and another at (2, 1), which maps to the columns of b.
In [34]: b
b has the shape (3, 4). If we want to plot 3 lines with 4 points each we need to swap the rows with the
columns.
In [35]: b.transpose()
[6, 0, 2],
[1, 7, 9]])
Notice how we used a special marker and also added the line between points. matplotlib contains a
variety of functions that let us create detailed, sophisticated plots.
We don’t need to understand all of the operations up-front, but for an example of the power Matplotlib
provides, let’s look at a more complex plot example (don’t worry about understanding every one of these
functions, we’ll cover the ones we need later on):
TIP This gallery gives more examples of what’s possible with Matplotlib.
38 CHAPTER 1. GETTING STARTED
Scikit-Learn
Scikit-learn is a wonderful library for many Machine Learning algorithms in Python. We will use it here to
generate some data
X, y = make_circles(n_samples=1000,
noise=0.1,
factor=0.2,
random_state=0)
The make_circle function will generate two “rings” of data points, each with 2 coordinates. It will also
generate an array of labels, either 0 or 1.
TIP: label is an important term in Machine Learning, that we’ll be explained better in
classification problems. Basically, it is a number indicating the class the data belongs to.
We assigned these to the variables X and y. This is a very common notation in Machine Learning. - X
indicates the input variable, and it is usually an array of dimension >= 2 with the outer index running over
the various data points in the set. - y indicates the output variable, and it can be an array of dimension >= 1.
In this case our data will belong to either one circle or to the other and therefore our output variable will be
binary: either 0 or 1. In particular, the data points belonging to the inner circle will have a label of 1.
In [39]: X
In [40]: y[:10]
1.2. FIRST DEEP LEARNING MODEL 39
In [41]: X.shape
Out[41]: (1000, 2)
In [42]: y.shape
Out[42]: (1000,)
While we are able to investigate the individual points, it becomes a lot clearer if we plot the points visually
using matplotlib.
TIP: what does the X[y==0, 0] syntax do? Let’s break it down:
• X[ , ] is the multiple axis selection operator, so we will be selecting along rows and
columns in the 2D X array.
• X[:, 0] would select all the elements in the first column of X. If we interpret the 2
columns of X as being the coordinates along the 2 axes of the plot, we are selecting the
coordinates along the horizontal axis.
• y==0 returns a boolean array of the same length as y, with True at the locations
where y is equal to 0 and False in the remaining locations. By passing this boolean
1.2. FIRST DEEP LEARNING MODEL 41
array in the row selector, numpy will smartly choose only those rows in X for which
the boolean condition is True.
Thus X[y==0, 0] means: select all the data points corresponding to the label 0 and for
each of these select the first coordinate, then return all these in an array.
Notice also how we are using the keywords color and marker to modify the aspect of our plot.
When we look at this plot we can see that points are spread on the plane in two concentric circles, the blue
dots forming a larger circle on the outside and the red crosses a smaller circle on the inside. Although this
plot is artificially created, it’s representative of any situations where we want to separate two classes that are
not separable with a straight line.
For example in the next chapters we will try to distinguish between fake and true banknotes or between
different classes of wine, and in all these cases the boundary between a class and the other will not be a
straight line.
In this toy example, we want to train a simple Neural Network model to learn to separate the blue circles
from the red crosses.
Keras
Keras is the Deep Learning library we will use throughout the book. It’s modular, well designed, and it has
been integrated by both Google and Microsoft to serve as the high level API for their Deep Learning
libraries (if your are not familiar with APIs, you may have a look on Wikipedia).
TIP: Do not worry about understanding every line of code of what follows. The rest of the
book is dedicated to walking through how to use Keras and Tensorflow (a very powerful
open-source ML library developed by Google), and so we’re not going to explain every
detail here. Here we’re going to demonstrate an overview of how to use Keras and we’ll
describe more details as the book progresses.
To train a model to tell the difference between red crosses and blue dots above, we have to perform the
following steps:
TIP: If this is the first time you train a Machine Learning model, do not worry, we will
repeat these steps many time throughout the book and we’ll have plenty of opportunities to
familiarize ourselves with them.
Let’s start with step one: defining a Neural Network model. The next 4 lines are all that’s necessary to do just
that.
Keras will interpret those lines and behind the scenes create a model in Tensorflow. In fact, we may have
noticed that the above cells informed us that Keras is “Using Tensorflow backend”. In fact Keras is just a
high level API specification that can work with several back-ends. For this course, we will use it with the
Tensorflow library as back-end.
The Neural Network below will take 2 inputs (the horizontal and vertical position of a data point in the plot
above) and return a single value: the probability that the point belongs to the the “Red Crosses” in the inner
circle.
We start by creating an empty shell for our model. We do this using the Sequential class. This tells keras
that we are planning to build our model sequentially, adding one component at a time. So we will start by
declaring the model to be a sequential model and then we will proceed adding elements to the model.
TIP: Keras also offers a functional API to build models. This is a bit more complex and we
will introduce it later in the book. Most of the models in this book will be built using the
Sequential API.
The next step is to add components to our model. We won’t explain the meaning of each of these lines now,
except pointing your attention to 2 facts:
1.2. FIRST DEEP LEARNING MODEL 43
1. We are specifying the input shape of our model input_shape=(2,) in the first line below, so that
our model will expect 2 input values for each data point.
2. We have one output value only which will give us the predicted probability for a point to be a blue dot
or a red cross. This is specified by the number 1 in the second line below.
Finally we need to compile the model, which will communicate to our backend (Tensorflow) the model
structure and how it will learn from examples. Again, don’t worry about knowing what optimizer and loss
function mean, we’ll have plenty of time to understand those.
In [47]: model.compile(optimizer=SGD(lr=0.5),
loss='binary_crossentropy',
metrics=['accuracy'])
Defining the model is like creating an empty box where there are no meaningful data points defined in the
model. We can think of it like wiring up a circuit. To get any meaningful data points in our model, we’ll
need to feed some example data to the model, so that it can learn general rules to separate the red crosses
from the blue dots.
This is done using the fit method. We’ll discuss this in great detail in the chapter on Machine Learning.
Epoch 1/20
1000/1000 [==============================] - 1s 1ms/step - loss: 0.6749 - acc:
0.6580
Epoch 2/20
1000/1000 [==============================] - 0s 59us/step - loss: 0.5820 - acc:
0.8280
Epoch 3/20
1000/1000 [==============================] - 0s 58us/step - loss: 0.4815 - acc:
0.8490
Epoch 4/20
1000/1000 [==============================] - 0s 59us/step - loss: 0.4157 - acc:
0.8640
Epoch 5/20
1000/1000 [==============================] - 0s 59us/step - loss: 0.3728 - acc:
0.8700
Epoch 6/20
1000/1000 [==============================] - 0s 60us/step - loss: 0.3407 - acc:
0.8700
Epoch 7/20
1000/1000 [==============================] - 0s 59us/step - loss: 0.2860 - acc:
0.8940
Epoch 8/20
44 CHAPTER 1. GETTING STARTED
The fit function just ran 20 rounds or passes over our data. Each round is called an epoch. At each epoch we
pass our data through the Neural Network and compare the known labels with the predictions from the
network and measure how accurate our net was.
After 20 iterations the accuracy of our model is 1 or close to 1, meaning 100 (or close to 100) of the test
cases were predicted correctly. This means our prediction is spot-on.
1.2. FIRST DEEP LEARNING MODEL 45
Decision Boundary
Now that our model is trained, we can feed it with any pair of numbers and it will generate a prediction for
the probability that a point situated on the 2D plane at those coordinates belongs to the group of red crosses.
In other words, now that we have a trained model, we can ask for the probability to be in the group of “red
crosses” for any point in the 2D plane. This is great because we can see if it has correctly learned to draw a
boundary between red crosses and blue dots. One way to calculate this is to draw a grid on the 2D plane and
calculate the probability predicted by the model for any point on this grid. Let’s do it!
TIP: Don’t worry if you don’t yet understand everything in the following code. It is
important that you get the general idea.
Our data varies roughly between -1.5 and 1.5 along both axes, so let’s build a grid of equally spaced
horizontal lines and vertical lines between these 2 extremes.
We will start by building 2 arrays of equally spaced points between the -1.5 and 1.5. The np.linspace
function does just that.
In [50]: hticks[:10]
Out[50]: array([-1.5 , -1.47, -1.44, -1.41, -1.38, -1.35, -1.32, -1.29, -1.26,
-1.23])
Now let’s build a grid with all the possible pairs of points from hticks and vticks. The function
np.meshgrid does that.
In [52]: aa.shape
In [53]: aa
46 CHAPTER 1. GETTING STARTED
In [54]: bb
In [55]: plt.figure(figsize=(6,6))
plt.scatter(aa, bb, s=0.3, color='blue')
# highlight one horizontal series of grid points
plt.scatter(aa[50], bb[50], s=5, color='green')
# highlight one vertical series of grid points
plt.scatter(aa[:, 50], bb[:, 50], s=5, color='red');
1.2. FIRST DEEP LEARNING MODEL 47
The model expects a pair of values for each data point, so we have to re-arrange aa and bb into a single array
with 2 columns.
The ravel function flattens an N-dimensional array to a 1D array and the np.c_ class will help us combine
aa and bb into a single 2D array.
In [57]: ab.shape
Out[57]: (10201, 2)
48 CHAPTER 1. GETTING STARTED
We have created an array with 10201 rows and 2 columns, these are all the points on the grid we drew
above. Now we can pass it to the model and obtain a probability prediction for each point in the grid.
In [58]: c = model.predict(ab)
In [59]: c
Out[59]: array([[1.5898737e-04],
[1.2135810e-04],
[9.3694063e-05],
...,
[3.0899048e-02],
[3.0890310e-02],
[3.0880447e-02]], dtype=float32)
Great! We have predictions from our model for all points on the grid, and they are all values between 0 and
1.
Let’s check to make sure that they are, in fact between 0 and 1 by checking the minimum and maximum
values:
In [60]: c.min()
Out[60]: 9.601998e-06
In [61]: c.max()
Out[61]: 0.9903734
Let’s reshape c so that it has the same shape as aa and bb. We need to do this so that we will be able to use it
to control the size of each dot in the next plot
In [62]: c.shape
Out[62]: (10201, 1)
In [63]: cc = c.reshape(aa.shape)
cc.shape
1.2. FIRST DEEP LEARNING MODEL 49
Let’s see what they look like! We will redraw the grid, making the size of each dot proportional to the
probability predicted by the model that that point belongs to the group of red crosses
Nice! We see that a dense cloud of points with high probability is found in the central region of the plot,
exactly where our red crosses are. We can draw the same data in a more appealing way using the
plt.contourf function with appropriate colors and transparency:
The last plot clearly shows the decision boundary of our model, i.e. the curve that delimits the area
predicted to be red crosses VS the area predicted to be blue dots.
Our model learned to distinguish the two classes perfectly! This is really promising, although the current
example was very simple.
Below are some exercises for you to practice with the commands and concepts we just introduced.
1.3. EXERCISES 51
Exercises
Exercise 1
Exercise 2
• use plt.imshow() to display the array a as an image, does it look like a checkerboard?
• display c, d and e using the same function, change the colormap to grayscale
• plot e using a line plot, assigning each row to a different data series. This should produce a plot with
noisy horizontal lines. You will need to transpose the array to obtain this.
• add title, axes labels, legend and a couple of annotations
Exercise 3
• encapsulate the code that calculates the decision boundary in a nice function called
plot_decision_boundary with the signature:
Exercise 4
• use the functions make_blobs and make_moons from scikit learn to generate new datasets with 2
classes
• plot the data to make sure you understand what has been generated
• re-train your model on each of these datasets
• display the decision boundary for each of these models
Data Manipulation
2
This chapter is about data.
In order to do deep-learning effectively, we’ll need to be able to work with data of all shapes and sizes. At the
end of this section we will be able to explore data visually and do simple descriptive statistics using Python
and Pandas.
On the other hand, if we are developing a method to detect cancer from brain scans, we will deal with
images and video data, very often these files will be large in size (or number) and possibly in complicated
formats.
If we are trying to detect a signal for trading stocks based on information in news articles, our data will
often be millions of text documents.
If we are translating spoken language to text, our input data will be sound files, etc.
Traditionally Machine Learning has been fairly good at dealing with “tabular” data, while “unstructured”
data such as text, sound and images, were each addressed with very complex, domain-specific techniques.
Deep Learning is particularly good at efficiently learning ways to represent such “unstructured” data,
and this is one of the reasons for its enormous success. Neural net models can be used to solve a translation
53
54 CHAPTER 2. DATA MANIPULATION
problem or an image classification problem, without worrying too much about the type of underlying data.
This is the first reason why Deep Learning is so popular: it can deal with many different types of data.
But before we can train models on our data, we need to gather the data and provide it to our networks in a
consistant format. Let’s take a look at a few different types of data and explore the tools we’ll be using to
process (and explore) them.
Tabular Data
The simplest data to feed to a Machine Learning model is so-called tabular data. It’s called mytabular
because it can be represented in a table with rows and columns, very much like a spreadsheet. Let’s use an
example to define some common vocabulary that will be used throughout the book.
A row in a table corresponds to a datapoint, and it’s often referred to as a record. A record is a list of
attributes (extracted from a data point), which are often numbers, categories, or free-form text. These
attributes go by the name of features.
Features can be directly measurable or they can be inferred from other features. Think, for example, of the
number of times a user visited a website or the browser they used - both of these features can be directly
counted. We could also create a new feature from existing data such as the average time between two user
visits. The process of calculating new features is called feature engineering.
That said, not all the features can be as informative. Some may be completely irrelevant for what we are
2.2. DATA EXPLORATION WITH PANDAS 55
trying to do. For example, if we are trying to predict how likely a user is to buy our product, chances are that
his/her first name will have no predictive power. On the other hand, previous purchases may carry a lot
of information in terms of propensity to buy.
While traditionally a lot of emphasis has been placed on feature engineering (extracting or inventing “good”
features) and feature selection (keeping only the “good” features), Deep Learning solves this problem by
automatically figuring out the important features and building higher order combinations of simple features
deeper in the network.
This is another reason why Deep Learning is so popular: it automates the complicated process of feature
engineering.
1: Bishop, Christopher (2006). Pattern recognition and Machine Learning. Berlin: Springer. ISBN
0-387-31073-8.
We want to ask these questions early in order to decide how to proceed further without wasting time. For
example, if we have too few data points we may not have enought examples to train a Machine Learning
model. Our first step in that case will be to go out and gather more data. If we have missing data we need to
decide what to do about it. Do we delete the records missing data or do we impute (create) the missing data?
And if we impute the data, how do we decide how to impute it? If we have many features but only few of
them are not constant, we’d better eliminate the constant features first, because they will clearly have no
predictive power, and so on. . .
Python comes with a library that allows to address all these questions very easily, it’s called Pandas.
Pandas is an open source library that provides high-performance, easy-to-use data structures and data
analysis tools. It can load data from a multitude of sources including CSV, JSON, Excel, HTML, PDF and
many others (here you may find all the types of file that can be loaded, together with a short description).
Let’s start by loading a csv file.
56 CHAPTER 2. DATA MANIPULATION
TIP: A comma-separated values file (CSV) stores tabular data (numbers and text) in plain
text. Each line of the file is a data record, and each record consists of one or more fields,
separated by commas.
Before we do anything else, let’s also set a couple of common options that will help us with contain the size
of the tables displayed. We configure pandas to show at most 13 rows of data in a dataframe and at most 11
columns. Bigger dataframes will be trucated with ellipses.
Notice here that the display.latex.repr is only to True set for the PDF version of the book, while it’s set
to False for the other versions. Starting from the next chapter we’ll group all the configurations in a single
script. Let’s now load the data from the titanic-train.csv file:
In [3]: df = pd.read_csv('../data/titanic-train.csv')
This is a famous dataset containing information about passengers of the Titanic, such as their name, age,
and if they survived.
pd.read_csv will read the CSV file and create a Pandas DataFrame object from it. A DataFrame is a
labeled, 2D data-structure, much like a spreadsheet.
Now that we have imported the Titanic data into a Pandas DataFrame object, we can inspect it. Let’s start by
peeking into the first few records to get a feel for how DataFrames work.
df.head() displays the first 5 lines of the DataFrame. We can see it as a table, with column names inferred
from the CSV file and an index, indicating the row it came from:
In [4]: df.head()
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
2.2. DATA EXPLORATION WITH PANDAS 57
df.info() summarizes the content of the DataFrame, letting us know the index range, the number and
names of columns with their data type.
We also learn about missing entries. For example, notice that the Age column has a few null entries.
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
df.describe() summarizes the numerical columns with some basic stats: count, min, max, mean,
standard deviation etc.
In [6]: df.describe()
Out[6]:
This is very useful to compare the scale of different features and decide if we need to rescale some of them.
Indexing
We can get the fourth row of the DataFrame (numerical index 3) using df.iloc[3]
In [7]: df.iloc[3]
Out[7]:
3
PassengerId 4
Survived 1
Pclass 1
Name Futrelle, Mrs. Jacques Heath (Lily May Peel)
Sex female
Age 35
SibSp 1
Parch 0
Ticket 113803
Fare 53.1
Cabin C123
Embarked S
In [8]: df.loc[0:4,'Ticket']
Out[8]:
Ticket
0 A/5 21171
1 PC 17599
2 STON/O2. 3101282
3 113803
4 373450
We can obtain the same result by selecting the first 5 elements of the column ‘Ticket’, with .head()
command:
In [9]: df['Ticket'].head()
Out[9]:
2.2. DATA EXPLORATION WITH PANDAS 59
Ticket
0 A/5 21171
1 PC 17599
2 STON/O2. 3101282
3 113803
4 373450
Out[10]:
Embarked Ticket
0 S A/5 21171
1 C PC 17599
2 S STON/O2. 3101282
3 S 113803
4 S 373450
Selections
Pandas is smart about indices and allows us to write expressions. For example, we can get the list of
passengers with Age over 70:
Out[11]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
630 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
To understand what this does, let’s break it down. df['Age'] > 70 returns a boolean Series of values that
are True when the Age is greater than 70 (and False otherwise). The lenght of this series is the same as that
of the whole DataFrame, as you can check by running:
Out[12]: 891
60 CHAPTER 2. DATA MANIPULATION
Passing this series to the [] operator, selects only the rows for which the boolean series is True. In other
words, Pandas matches the index of the DataFrame with the index of the Series and selects only the rows for
which the condition is True.
Out[13]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
630 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
We can use the & and | Python operators (which normally do bitwise and bitwise or, respectively) to
combine conditions. For example, the next statement returns the records of passengers 11 years old and with
5 siblings/spouses.
Out[14]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
59 60 0 3 Goodwin, Master. William Frederick male 11.0 5 2 CA 2144 46.9 NaN S
If we use an or operator, we’ll have passengers that are 11 years old or passengers with 5 siblings/spouses.
Out[15]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
59 60 0 3 Goodwin, Master. William Frederick male 11.0 5 2 CA 2144 46.9000 NaN S
71 72 0 3 Goodwin, Miss. Lillian Amy female 16.0 5 2 CA 2144 46.9000 NaN S
386 387 0 3 Goodwin, Master. Sidney Leonard male 1.0 5 2 CA 2144 46.9000 NaN S
480 481 0 3 Goodwin, Master. Harold Victor male 9.0 5 2 CA 2144 46.9000 NaN S
542 543 0 3 Andersson, Miss. Sigrid Elisabeth female 11.0 4 2 347082 31.2750 NaN S
683 684 0 3 Goodwin, Mr. Charles Edward male 14.0 5 2 CA 2144 46.9000 NaN S
731 732 0 3 Hassan, Mr. Houssein G N male 11.0 0 0 2699 18.7875 NaN C
802 803 1 1 Carter, Master. William Thornton II male 11.0 1 2 113760 120.0000 B96 B98 S
Again, we can use the query method to achieve the same result.
Out[16]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
59 60 0 3 Goodwin, Master. William Frederick male 11.0 5 2 CA 2144 46.9000 NaN S
71 72 0 3 Goodwin, Miss. Lillian Amy female 16.0 5 2 CA 2144 46.9000 NaN S
386 387 0 3 Goodwin, Master. Sidney Leonard male 1.0 5 2 CA 2144 46.9000 NaN S
480 481 0 3 Goodwin, Master. Harold Victor male 9.0 5 2 CA 2144 46.9000 NaN S
542 543 0 3 Andersson, Miss. Sigrid Elisabeth female 11.0 4 2 347082 31.2750 NaN S
683 684 0 3 Goodwin, Mr. Charles Edward male 14.0 5 2 CA 2144 46.9000 NaN S
731 732 0 3 Hassan, Mr. Houssein G N male 11.0 0 0 2699 18.7875 NaN C
802 803 1 1 Carter, Master. William Thornton II male 11.0 1 2 113760 120.0000 B96 B98 S
Unique Values
The unique method returns the unique entries. For example, we can use it to know the possible ports of
embarkment and only select the unique values.
In [17]: df['Embarked'].unique()
Sorting
We can sort a DataFrame by any group of columns. For example, let’s sort people by Age, starting from the
oldest using the ascending flag. By default, ascending is set to True, which sorts by the youngest first. To
reverse the sort order, we set this value to False.
Out[18]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
630 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
Aggregations
Pandas also allows to perform aggregations and group-by operations like we can do in SQL and can
reshuffle data into pivot-tables like a spreadsheet application. This makes it very powerful for data
exploration and we strongly recommend a thorough look at its documentation if you are new to Pandas.
Here we will review only a few useful commands.
value_counts() counts how many instances of each value are there in a series, sorting them in descending
order. We can use it to know how many people survived and how many died.
62 CHAPTER 2. DATA MANIPULATION
In [19]: df['Survived'].value_counts()
Out[19]:
Survived
0 549
1 342
In [20]: df['Pclass'].value_counts()
Out[20]:
Pclass
3 491
1 216
2 184
Like in a database, we can group data by column name and then aggregate them with some function. For
example, let’s count dead and alive passengers by class:
In [21]: df.groupby(['Pclass','Survived'])['PassengerId'].count()
Out[21]:
PassengerId
Pclass Survived
1 0 80
1 136
2 0 97
1 87
3 0 372
1 119
This is a very powerful tool, we can immediatly see that almost 2/3 of passengers in first class survived,
compared to only about 1/3 of passengers in 3rd class!
We can look at individual columns min, max, mean and median, in order to get some more information
about our numerical features. For example, the next line shows that the youngest passenger was less than
six-months old:
2.2. DATA EXPLORATION WITH PANDAS 63
In [22]: df['Age'].min()
Out[22]: 0.42
In [23]: df['Age'].max()
Out[23]: 80.0
In [24]: df['Age'].mean()
Out[24]: 29.69911764705882
In [25]: df['Age'].median()
Out[25]: 28.0
We can see if the mean age of survivors was different from the mean age of victims.
Out[26]:
Age
Survived
0 30.626179
1 28.343690
Although the mean age of survivors seems a bit lower, the difference between the 2 classes is not statistically
significant as we can see by looking at the standard deviation.
64 CHAPTER 2. DATA MANIPULATION
Out[27]:
Age
Survived
0 14.172110
1 14.950952
Merge
Pandas can perform join operations like we can do in SQL. This operation is called merge. For example, let’s
combine the 2 previous tables:
Out[28]:
Survived Age
0 0 31.0
1 1 28.0
Out[29]:
Survived Age
0 0 14.0
1 1 15.0
Out[30]:
Out[31]:
merge is incredibly powerful. We recommend reading more into its functionality in Pandas documentation
Pivot Tables
Pandas has the ability to aggregate data into a pivot table, just like Microsoft Excel.
TIP: A pivot table is a table that summarizes data in another table, and is made by applying
an operation such as sorting, averaging, or summing to data in the first table. A trivial
example is a column of numbers as the first table, and the column average as a pivot table
with only one row and column.
For example, we can create a table which holds the count of the number of people who survived (or not) per
class:
In [32]: df.pivot_table(index='Pclass',
columns='Survived',
values='PassengerId',
aggfunc='count')
Out[32]:
Survived 0 1
Pclass
1 80 136
2 97 87
3 372 119
66 CHAPTER 2. DATA MANIPULATION
Correlations
Finally, Pandas can also calculate correlations between features, making it easier to spot redundant
information or uninformative columns.
For example, let’s check the correlation of a few columns with a True value of Survived. If it’s true that
women and children are saved first, we expect to see some correlation with Age and Sex, while we expect no
correlation with PassengerId.
Since the Sex column is a string, we first need to create an auxiliary (extra) IsFemale boolean column that
is set to True if the Sex is set to the string ‘female’.
Out[34]:
Survived
Pclass -0.338481
Age -0.077221
SibSp -0.035322
PassengerId -0.005007
Parch 0.081629
Fare 0.257307
IsFemale 0.543351
Survived 1.000000
Before looking at what these values mean, let’s peek ahead a little and look into Pandas plotting functionality.
We can use Pandas plotting functionality to display the last result visually. Let’s import matplolib:
rcParams['font.size'] = 14
rcParams['lines.linewidth'] = 2
rcParams['figure.figsize'] = (9, 6)
rcParams['axes.titlepad'] = 14
rcParams['savefig.pad_inches'] = 0.2
2.2. DATA EXPLORATION WITH PANDAS 67
Now let’s use pandas to plot the corr_w_surv dataframe. Notice that we will exclude the last row, which is
Survived itself:
Let’s interpret the graph above. The largest correlation with survival is being a woman. We also see that
people who paid a higher fare (probably corresponding to a higher class) had a higher chance of surviving.
The attribute Pclass is negatively correlated, meaning the higher the class number the lower the chance of
survival, which makes sense (first class passenger more likely to survive than third class).
Age is also negatively correlated, though mildly, meaning the younger you are the more likely you are to
survive. Finally, as expected PassengerId has no correlation with survival.
We’ve barely scratched the surface of what Pandas can do in terms of data manipulation and data
exploration. Do refer to the mentioned documentation for a better understanding of its capabilities.
68 CHAPTER 2. DATA MANIPULATION
We can represent data visually in several ways, depending on the type of data and on what we are interested
in seeing.
We will create 3 data series: - A stationary noisy sequence, centered around zero (data1). - A sequence with
larger noise, following a linearly increasing trend (data2). - A sequence with where noise increases over
time (data 3). - A sequence with somewhat intermediate noise, following a sinusoidal oscillatory pattern
(data 4).
In [39]: N = 1000
data1 = np.random.normal(0, 0.1, N)
data2 = (np.random.normal(1, 0.4, N) +
np.linspace(0, 1, N))
data3 = 2 + (np.random.random(N) *
np.linspace(1, 5, N))
data4 = (np.random.normal(3, 0.2, N) +
0.3 * np.sin(np.linspace(0, 20, N)))
Now, let’s create a DataFrame object composing all of our newly created data sequences. First we aggregate
the data using np.vstack and we transpose it:
df = pd.DataFrame(data, columns=cols)
df.head()
Out[41]:
2.3. VISUAL DATA EXPLORATION 69
Even when we’ve been given a description of these four data sets, it’s really hard to understand what’s going
on by simply looking at the table of numbers. Instead, let’s look at this data visually.
Line Plot
Pandas plot function defaults to a line plot. This is a good choice if our data comes from an ordered series
of consecutive events (for example, the outside temperature in a city over the course of a year).
A line plot represents the values of data in sequential order, and makes it easy to spot trends like growth
over time or seasonal patterns.
Above, we’re using the plot method on the DataFrame. The same plot can be obtained by using
matplotlib.pyplot (and passing in the DataFrame df as an argument) like this:
In [43]: plt.plot(df)
plt.title('Line plot')
plt.legend(['data1', 'data2', 'data3', 'data4']);
Scatter plot
If data is not ordered, and we are looking for correlations between variables, a scatter plot is a better
choice. We can simply change the style of the line plot if we want to plot data in order:
or we can use the scatter plot kind, if we want to visualize one column against another:
In the above plot, we see that there is no correlation between data1 and data2 (which may be obvious
because data1 is a flat random noise).
Histograms
Sometimes we are interested in knowing the frequency of occurence of data, and not their order. In this
case we divide the range of data into buckets and ask how many points fall into each bucket. This is called a
histogram, and it represents the statistical distribution of our data.
This could look like a bell curve, or an exponential decay, or have a weird shape. By plotting the histogram
of a feature we might spot the presence of distinct sub-populations in our data and decide to deal with each
one separately.
In [46]: df.plot(kind='hist',
bins=50,
title='Histogram',
alpha=0.6);
2.3. VISUAL DATA EXPLORATION 73
Note that we lost all the temporal information contained in our data, for example the oscillations in data4
are not visible any longer, all we see is a quite large bell-like distribution, where the sinusoidal oscillations
have been summed up in the histogram.
Cumulative Distribution
A close relative of a histogram is the cumulative distribution. This is useful to answer questions like: what
fraction of our sample falls below a certain value?
In [47]: df.plot(kind='hist',
bins=100,
title='Cumulative distributions',
density=True,
cumulative=True,
alpha=0.4);
74 CHAPTER 2. DATA MANIPULATION
Answers: 1. 100. If you draw a vertical line that passes through 2 you will see that it
crosses the cumulative distribution for data1 at the high value of 1, which corresponds to
100. 2. approximately 50. This can be seen by tracing a vertical line at 1.5 and checking
at what height it crosses the data2 distribution.
Box plot
A box plot is a useful tool to compare several distributions, it is often used in biology and medicine to
compare the results of an experiment with a control group. For example, in the simplest form of a clinical
trial for a new drug, there will be 2 boxes, one for the population that took the drug and the other for the
population that took the placebo.
2.3. VISUAL DATA EXPLORATION 75
In [48]: df.plot(kind='box',
title='Boxplot');
What does this plot mean? In loose terms, it’s as if we were looking at the histogram plot from above. Each
box represents the key facts about the distribution of that particular data series. Let’s first get an intuition
about the information it shows. Later we will give a more formal definition.
Let’s start with the green horizontal line that cuts each box. It represents the position of the peak of the
histogram. We can check the peak for data1 that the line is at 0, exactly like the very sharp peak of data1
in the histogram figure, and for data4 the green line is roughly at 3, exactly like the peak of the red
histogram in the previous picture.
The box represents the bulk of the data, i.e. it gives us an idea of how fat and centered our distribution is
around the peak. We can see that the box in data3 is not centered around the green line, reflecting the fact
that the histogram in green is skewed. The whiskers give us an idea of the extension of the tails of the
distribution. Again, notice how the upper whisker of data3 extends to high values.
TIP: For the more statistically inclined readers, here are the formal definitions of the above
concepts: - The green line is the median of our data, i.e. the value lying at the midpoint of
76 CHAPTER 2. DATA MANIPULATION
the distribution. - The box around it denotes the confidence interval (calculated using a
gaussian approximation). Notice how these reproduce more closely the actual size of the
noise fluctuations for data2 and data4. - The whiskers above and below denote the range
of data not considered outliers. By default they are set to be at [Q1 - 1.5*IQR, Q3 +
1.5*IQR], where Q1 is the first quartile, Q3 the third quartile and IQR the interquartile
range. Notice that these give us a clear indication that data3 is not symmetric around its
median. - The dots represent data that are considered outliers.
TIP: In the previous TIP, we just introduced the concept of outliers. Outliers are data that
are distant from other observations. Outliers may be due for example to variability in the
measurement or they may indicate experimental errors. This is a fundamental concept in
Machine Learning, and we’ll have the chance to discuss it later.
Subplots
We can also combine these plots in a single figure using the subplots command:
df.plot(ax=ax[0][0],
title='Line plot')
df.plot(ax=ax[0][1],
style='o',
title='Scatter plot')
df.plot(ax=ax[1][0],
kind='hist',
bins=50,
title='Histogram')
df.plot(ax=ax[1][1],
kind='box',
title='Boxplot');
2.3. VISUAL DATA EXPLORATION 77
Pie charts
Pie charts are useful to visualize fractions of a total, for example we could ask how much of data1 is greater
than 0.1:
Out[50]:
data1
False 831
True 169
In [51]: piecounts.plot(kind='pie',
figsize=(7, 7),
78 CHAPTER 2. DATA MANIPULATION
explode=[0, 0.15],
labels=['<= 0.1', '> 0.1'],
autopct='%1.1f%%',
shadow=True,
startangle=90,
fontsize=16);
Hexbin plot
Hexbin plots are useful to look at 2-D distributions. Let’s generate some new data for this plot.
In [53]: df.head()
Out[53]:
x y
0 1.795765 3.212868
1 -0.380623 -0.613050
2 3.183915 0.179148
3 3.315577 -0.039096
4 -2.252500 -0.930392
In [54]: df.plot();
This new data is a stack of two 2-D random sequences, the first one centered in (0, 0) and the second one
centered in (9, 9). Let’s see how the hexbin plot visualizes them.
80 CHAPTER 2. DATA MANIPULATION
The Hexbin plot is the 2-D extension of a histogram. It is created by creating a set of tiles that cover the 2-D
plane, and then counting how many points end up in each tile. The color is proportional to the count. Since
we created this dataset with points sampled from 2 gaussian distributions, we expect to see tiles containing
more points near the centers of these two gaussians, which is what we observe above.
We encourage you to have a look at the this gallery to get some inspiration on visualizing your data.
Remember that the choice of visualization is strongly tied to the kind of data and the kind of question we
are asking.
Unstructured data
Most often than not, data doesn’t come as a nice, well-formatted table. As we mentioned earlier, we could be
dealing with images, sound, text, movies, protein molecular structures, video games and many other types
of data.
The beauty of Deep Learning is that it can handle most of this data and learn optimal ways to represent it for
the task at hand.
2.4. UNSTRUCTURED DATA 81
Images
Let’s take images for example. We’ll use the PIL imaging library (which is referred to as Pillow for newer
versions).
Out[57]:
We can convert the image to a 3-D array using numpy. After all, an image can be seen as a table of pixels. For
each pixel, the values of red, green and blue are specified. So, our image is really a 3 dimensional table,
where rows and columns correspond to the pixel index and the depth correspond to the color channel.
In [59]: imgarray.shape
82 CHAPTER 2. DATA MANIPULATION
The shape of the above array indicating (width, height, channels). While it’s quite easy to think of features
when dealing with tabular data, it’s trickier when we deal with images. We could imagine unrolling this
image onto a long list of numbers, walking along each of the 3 dimensions, and we did so, our dataset of
images would again be a tabular dataset, with each row corresponding to a particular image and each
column corresponding to a specific pixel and color channel.
In [60]: imgarray.ravel().shape
Out[60]: (835200,)
However, not only this procedure created 835200 features for our image, but also by doing so we lost most of
the useful information in the image. In other words, a single pixel in an image carries very little
information, while most of the information is contained in changes and correlations between nearby
pixels. Neural Networks can learn features from that through a technique called convolution, which we will
learn about later in this course.
Sound
Now take sound. Digitally recorded sound is a long series of ordered numbers representing the sound wave.
Let’s load an example file.
This file is sampled at 44.1 kHz, which means 44100 times per second. So, our 3 second file contains over
100k samples:
In [65]: len(snd)
2.4. UNSTRUCTURED DATA 83
Out[65]: 110250
In [66]: snd
In [67]: plt.plot(snd)
plt.title('sms.wav as a Line Plot');
If each point in our dataset is a recorded sound, it is likely that each will have a different length. We could
still represent our data in tabular form by taking each consecutive sample as a feature and padding with
zeros the records that are shorter, but these extra zeros would carry no information (unless we had taken
great care to synchronize each file so that the sound started at the same sample number).
Besides, sound information is carried in modulations of frequency, suggesting that the raw form may not be
the best to use. As we shall see, there are better ways to represent sound and to feed it to a Neural Network
for task like music recognition or speech-to-text.
84 CHAPTER 2. DATA MANIPULATION
Text data
Text documents pose similar challenges. If each datapoint is a document, we need to find a good
representation for it if we want to build a model that identifies it. We could use a dictionary of words and
count the relative frequencies of words, but with Neural Networks we can do better than this.
In general this is called the problem of representation, and Deep Learning is a great technique to tackle it!
Feature Engineering
As we have seen, unstructured data does not look like tabular data. The traditional solution to connect the
two is feature engineering.
In feature engineering, an expert uses her domain knowledge to create features that correctly encapsulate
the relevant information from the unstructured data. Feature engineering is fundamental to the application
of Machine Learning, and it is both difficult and expensive.
2.6. EXERCISES 85
For example, if we are training a Machine Learning model on a face recognition task from images, we could
use well tested existing methods to detect a face and measure the distance between key points like eyes,
mouth and nose. These distances would be the engineered features we would pass to the model being
trained.
Similarly, in the domain of speech recognition, features based on wavelets and Short Time Fourier
Transforms were the standard until not long ago.
Deep Learning disrupts feature engineering by learning the best features directly from the raw
unstructured data. This approach is not only very powerful but also much much faster. This is a paradigm
shift: more versatile technique taking the role of the domain expert.
Exercises
Now it’s time to test what you’ve learned with a few exercises.
Exercise 1
Exercise 2
Exercise 3
• plot the histogram of the heights for males and for females on the same plot
• use alpha to control transparency in the plot comand
• plot a vertical line at the mean of each population using plt.axvline()
• bonus: plot the cumulative distributions
Exercise 4
• plot the weights of the males and females using a box plot
86 CHAPTER 2. DATA MANIPULATION
Exercise 5
Since for the rest of the book we will using terms like train_test_split or cross_validation it makes
sense to introduce these first and then move on to explain Deep Learning.
This revolution has largely been possible thanks to the combination of 3 factors:
These same 3 factors are enabling the current Deep Learning and AI revolution. Deep Neural Networks
87
88 CHAPTER 3. MACHINE LEARNING
have been around for quite a while, but it wasn’t until relatively recently that we’ve powerful enough
computers (and large enough datasets) to make good use of them. This has changed in the last few years,
and a lot of companies that used other Machine Learning techniques are now switching to Deep Learning.
Before we start studying Neural Networks, we need to make sure to have a shared understanding of
Machine Learning, so this chapter is a quick summary of its main concepts.
If you are already familiar with terms like Regression, Classification, Cross-Validation and Confusion
matrix, you may want to skim through this section quickly. However, make sure you understand cost
functions and parameter optimization as they are fundamental for everything that will follow!
• Supervised Learning
• Unsupervised Learning
• reinforcement learning
While this course will primarily focus on supervised learning, it is important to understand the difference
in each of the types.
In Supervised Learning an algorithm learns from labeled data. For example, let’s say we are training an
image recognition algorithm to distinguish cats from dogs: each training datapoint will be the pair of an
image (training data) and a label, which specifies if the image is a cat or a dog. Similarly, if we are training a
translation engine, we will provide both input and output sentences, asking the algorithm to learn the
function that connects them.
Conversely, in Unsupervised Learning, data comes without labels, and the task is to find similar
datapoints in the dataset, in order to identify any underlying higher order structure. For example, in a
dataset containing the purchase preferences of ecommerce users, these users will likely form clusters with
similar purchase behavior in terms of amount spent, objects bought etc. We can think of these as different
3.3. SUPERVISED LEARNING 91
“tribes” with different preferences. Once these tribes are identified, we can describe each data point (that is,
each user) in terms of the “tribe” it belongs to, gaining a deeper understanding of the data.
Finally, reinforcement learning is similar to Supervised Learning, but in this case the algorithm is training
an agent to act in an environment. The actions of the agent lead to outcomes that are attached to a score
and the algorithm tries to maximize such score. Typical examples here are algorithms that learn to play
games, like Chess or Go. The main difference with Supervised Learning is that the score is that the
algorithm does not receive a label (score) for each action it takes. Instead, it needs to perform a sequence of
actions before it knows if that lead to a higher score.
In 2016 a software trained with reinforcement learning beat the world Go champion, marking a new
milestone in the race towards artificial intelligence.
Supervised Learning
Let’s dive into Supervised Learning by first reviewing some of its successful applications. Have you ever
noticed that email spam is practically non-existent any longer? This thought is thanks to Supervised
Learning.
In the early 2000s, mailboxes were plagued by tons of emails advertising pills, money making schemes and
other crappy information. The first step to get rid of these was to allow users to move spam emails into a
spam folder. This provided the training labels. With million of users manually cataloguing spam, large email
providers like Google and Yahoo could quickly gather enough examples of what a spam mail looked like to
train a model that would predict the probability for a message to be spam.
This technique is called a binary classifier, and it is a Machine Learning algorithm that learns to
distinguish between 2 classes, like true or false, spam or not spam, positive or negative sentiment, dead or
alive.
Binary classifiers trained with Supervised Learning are ubiquitous. Telecom companies use them to predict
if a user is about to churn and go to a competitor, so they know when and to whom to make an offer in
order to retain them.
Social media analytics companies use binary classifiers to judge the prevalent sentiment on their clients’
pages. If you are a celebrity, you receive millions of comments each time you post something on Facebook
or Twitter. How can you know if your followers were prevalently happy or angry at what you tweeted? A
sentiment analysis classifier can distinguish that for each single comment, and therefore give us the overall
reaction by aggregating over all comments.
Supervised learning is also used to predict continuous quantities, for example to forecast retail sales of
next month or to predict how many cars there will be at a certain intersection in order to offer better route
for car navigation. In this case the labels are not discrete like “true/false” “black/blue/green” but they have a
continuous values, like 68, 73, 71 if we’re trying to predict temperature.
Configuration File
As promised in earlier chapter, from this chapter onwards we’ll bundle common packages and
configurations in a single config file that we load at the beginning of the chapter. Let’s go ahead and load it:
In [2]: print(msg)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option("display.max_rows", 13)
pd.set_option('display.max_columns', 11)
pd.set_option("display.latex.repr", True)
3.5. LINEAR REGRESSION 93
In [3]: exec(msg)
We have now loaded pyplot, pandas and numpy and we have set a few configuration parameters. Let’s also
load the configuration for matploltib. For some reason this must be executed in a separate cell or it won’t
work:
Linear Regression
Let’s take a second look at the plot we drew in Exercise 2 of Section 2. As we know by now, it represents a
population of individuals. Each dot is a person, and the position of the dot on the chart is defined by two
coordinates: weight and height. Let’s plot it again:
In [5]: df = pd.read_csv('../data/weight-height.csv')
In [6]: df.head()
Out[6]:
plot_humans()
94 CHAPTER 3. MACHINE LEARNING
Can we tell if there is a pattern in how the dots are laid out on the graph or do they seem completely
randomly spread? Our visual cortex, a great pattern recognizer, tells us that there is a pattern: dots are
roughly spread around a straight diagonal line. This line seems to indicate the obvious: taller people are also
heavier on average.
Let’s sketch a line to represent this relation. We can plot this line “by hand”, without any learning, by
choosing the values of the two extremities. For example, let’s draw a segment that starts at the point [55, 78]
and ends at the point [75, 250].
In [8]: plot_humans()
# Here we plot the red line 'by hand' with fixed values
# We'll try to learn this line with an algorithm below
plt.plot([55, 78], [75, 250], color='red', linewidth=3);
3.5. LINEAR REGRESSION 95
Can we specify this relationship more precisely? Instead of guessing the position of the line, can we ask an
algorithm to find the best possible line to describe our data? The answer is yes! Let’s see how.
We are saying that weight (our target or label) is a linear function of height (our only feature).
Let’s assign variable names to our quantities. As we saw in the introduction, it is common to assign the letter
y to the labels (people’s weight in this case) and the letter X to the input features (only height in this case).
You may remember from high school math that a line in a 2D-space can be described by an equation
between X and y that involves only two parameters. One parameter controls the point where the line crosses
the vertical axis, the other controls the slope of the line. We can write the equation of a line in a 2D plane as:
ŷ = b + Xw (3.1)
where ŷ is pronounced y-hat. Let’s first convince ourselves that this indicates any possible line in the 2D
plane (with the exception of a perfectly vertical line).
If we choose b = 0 and w = 0, we obtain the equation ŷ = 0 for any value of X. This is the set of points that
form the horizontal line passing through zero.
If we start changing b, we will obtain ŷ = b, which is still a horizontal line, passing through the constant
96 CHAPTER 3. MACHINE LEARNING
point b. Finally, if we also change w the line will start to be inclined in some way.
So yes, any line in the 2D-plane, except a vertical line, will have its unique values for w and b.
To find a linear relation between X and y means to describe our labels y as a linear function of X plus some
small correction є:
y = b + Xw + є = ŷ + є (3.2)
It’s good to get used to distinguish between the values of the output (y, our labels) and the values of the
predictions ( ŷ).
In this chapter we are going to explain how an algorithm can find the perfect line to fit a dataset. Before
writing an algorithm it’s helpful to understand the dynamics of this line formula. So what we’re going to do
is draw a few plots where we change the values of b and w and see how they affect the position of the line in
the 2D plane. This will give us better insight when we try to automate this process.
Then let’s create an array of equally spaced x values between 55 and 80 (these are going to be the values of
height):
And let’s pass these values to the line function and calculate ŷ. Since both w and b are zero, we expect ŷ to
also be zero:
Out[11]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
In [12]: plot_humans()
So we’ve drawn a horizontal line as our model. This is not really a good model for our data! It would be a
good model if everyone in our population was floating in space and therefore measured 0 weight regardless
of their height. Fun, but not accurate for our chart! See how far the line is from our data.
If we let b vary, the horizontal line starts to move up or down, indicating constant weight b, regardless of the
value of x (the height).
In [13]: plot_humans()
This would be a good model only if we had a broken scale that always returned a fixed value, regardless of
who steps on it. Also not accurate.
Finally, if we vary w, the line starts to tilt, with w indicating the increment in weight corresponding to the
increment in 1 unit of height. For example, if w=1, that would imply that 1 pound is gained for each inch of
height.
In [14]: plot_humans()
So, to recap, we started from the intuitive observation that taller people are heavier and we decided to look
for a line function to predict the weight of a person as a function of the height.
Then we observed that any line in the 2D plane can be drawn by defining just two parameters, b and w, we
plotted a few such lines and compared them with our data. Now we need to find the values of such
parameters that correspond to the best line for our data.
Cost Function
In order to find the best possible linear model to describe our data, we need to define a criterion to evaluate
the “goodness” of a particular model.
In Supervised Learning we know the values of the labels. So we can compare the value predictied by the
hypothesis with the actual value of the label and calculate the error for each datapoint:
є i = y i − ŷ i (3.3)
Remember that y i is the actual value of the output while ŷ i is our prediction. Also notice that we used a
subscript index i to indicate the i-th datapoint. Each datapoint difference is a residual and the group of
them together are the residuals.
3.5. LINEAR REGRESSION 101
Note that in this definition, a residual carries a sign, it will be positive if our hypothesis underestimates the
true weight and negative if it overestimates it.
Residuals
Since we don’t really care about the direction in which our hypothesis is wrong (we only care about the total
amount of being wrong), we can define the total error as the sum of the absolute values of the residuals:
The total error is one possible example of what’s called a cost function. We have associated a well definied
cost that can be calculated from our features and labels through the use of our hypothesis ŷ = h(x).
For reasons that will be apparent later, it is often preferable to use another cost function called Mean
Squared Error. This is defined as:
1 2
MSE = ∑(y i − ŷ i ) (3.5)
N i
Notice that since the square is a positive function, this will be big when the total error is big and small when
the total error is small. In that sense they are equivalent. The Mean Squared Error (or ‘mse’, for short) is
preferred because it’s smooth and guaranteed to have a global minimum, which is exactly what we are going
to look for.
102 CHAPTER 3. MACHINE LEARNING
Now that we have both a hypothesis (linear model) and a cost function (mean squared error), we need to
find the combination of parameters b and w that minimizes such cost.
Remember, that cost is another way to say the ‘error amount’ of our prediction - we’re assigning a number to
how wrong our prediction is. We want to minimize this error (cost) because if our error was zero, that
means we predicted perfectly.
Let’s first define a helper function to calculate the MSE and then evaluate the cost for a few lines:
Let’s also define inputs and outputs for our data. Our input is the height column. We will assign it to the
variable X:
In [16]: X = df[['Height']].values
X
Out[16]: array([[73.84701702],
[68.78190405],
[74.11010539],
...,
[63.86799221],
[69.03424313],
[61.94424588]])
In [17]: X.shape
Out[17]: (10000, 1)
This format will allow us to extend the linear regression to cases where we want to use more than one
column as input.
The outputs are a single array of values. What is the cost going to be for the horizontal line passing through
zero? We can calculate it as follows.
Out[19]: array([[0.],
[0.],
[0.],
...,
[0.],
[0.],
[0.]])
And then we calculate the cost, i.e. the mean squared error between these predictions and our true values:
Out[20]: 27093.83757456157
Notice that we flattened out the predictions so that it has the same shape as the output vector.
The cost is above 27,000. What does it mean? Is it bad? Is it good? It’s really hard to say because we don’t
have anything to compare it to. Different datasets will have very different numbers here depending on the
units of measure of the quantity we are predicting. So the value of the cost has very little meaning by itself.
What we need to do is compare this cost with that of another choice of b and w. Let’s increase w a little bit:
Out[21]: 1457.1224504786412
The total mse decreased from over 27000 to below 2000. This is good! It means our new hypothesis with
w=2 is less-wrong than using w=0.
Out[22]: 708.9129575511095
Even better! As you can see we can keep changing b and w by small amounts and the value of the cost will
keep changing.
Of course, it’s going to take forever for us to find the best combination if we sit here and tweak numbers
until we find the best ones. A better way would be if we could write a program that would test all possible
values for us and then simply report to us the result.
Before we do that, let’s check a couple of other combinations of w and b. Let’s try to keep w fixed and vary
only b.
When w = 2, the cost as a function of b has a minimum value somewhere near 50.
The same would be true if we let w vary, there will be a value of w for which the cost is minimum. Since we
choose a cost function that is quadratic in b and w, there is a global minimum, corresponding to the
combination of parameters b and w that minimize the mean squared error cost.
TIP: A quadratic function is a polynomial function in one or more variables in which the
highest-degree term is of the second degree. This is a very nice feature, that guarantees us
that there is only one minimum, and therefore it is the global one.
Once our parameters w and b are set to the combination that minimizes the cost, we can say that the
model is trained over the training set.
Notice what just happened: - We started with a hypothesis: height and weight are connected by a linear
model that depends on parameters. - We defined a cost function: the mean squared error is calculated
for a each combination of b and w using the training set features and labels. - Finally, we minimized the
cost: the model is trained when we have found the values of b and w that minimize the cost over the
training set.
Another way to say this is that we have turned the problem of training a Machine Learning model into a
minimization problem, where our cost defines a “landscape” made of valleys and peaks, and we are
looking for the global minimum.
This is great news, because there are plenty of techniques to look for the minimum value of a function.
TIP: We solved a Linear Regression problem using Gradient Descent. This was not really
106 CHAPTER 3. MACHINE LEARNING
necessary, since Linear Regression has an exact solution. We used this simple case to
introduce the Gradient Descent technique that we will use throughout the book to train
our Neural Networks.
Let’s see if we can use Keras to perform linear regression. We will start by importing a few elements to build
a model, as we did in chapter 1.
The model we need to build is super simple: it has one input and one output, connected by one parameter w
and one parameter b. Let’s do that! First, we initialize the model as a Sequential model. This is the
simplest way to define models in Keras, because we add layers one by one, starting from the input and
working our way towards the output.
Then we add a single element to our model: a linear operation with 1 input and 1 output, connected by the
two parameters w and b. In keras this is done with the Dense class. In fact, from the documentation of
Dense we read:
output -> y
activation -> None
input -> X
kernel -> w
bias -> b
3.5. LINEAR REGRESSION 107
and noticing that the dot product with a single input is just the multiplication. So Dense(1,
input_shape=(1,)) implements a linear function with 1 input and 1 output. Let’s add it to the model:
The .summary() method will tell us the number of parameters in our model:
In [27]: model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 1) 2
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________
As expected our model has 2 parameters: the bias or b and the kernel or weight or w. Let’s go ahead and
compile the model.
Compilation tells Keras what the cost function is (Mean Squared Error in this case) and what method we
choose to find the minimum value (the Adam method in this case).
TIP If you have never seen an optimizer don’t worry about it, we will explain it in detail
later in the book.
Now that we have compiled the model, let’s go ahead and train it on our data. To train a model we use the
method model.fit(X, y). This method requires the input data and the input labels and it trains a model
by minimizing the value of the cost function over the training data. As an additional parameter to the fit
model, we will pass the number of epochs. This is the number of time we want the training to loop over the
whole training dataset.
TIP: an Epoch in Deep Learning indicates one cycle over the training dataset. At the end of
an epoch the model has seen each pair of (input, output) once.
108 CHAPTER 3. MACHINE LEARNING
In this example we train the model for 40 epochs, which means we will cycle through the whole (X,
y_true) dataset 40 times.
Let’s see how well the model fits our data. We will store our predictions in a variable called y_pred and plot
them over the data:
In [31]: df.plot(kind='scatter',
x='Height',
y='Weight',
title='Weight and Height in adults')
plt.plot(X, y_pred, color='red');
3.5. LINEAR REGRESSION 109
The line is not perfectly where we would have liked it to be, but it seems to have captured the relationship
between Height and Weight quite well. We can inspect the parameters of the model to see what values our
training decided were optimal for b and w.
In [32]: W, B = model.get_weights()
In [33]: W
In [34]: B
Notice here that W is returned as a matrix, because in the general case of a Neural Network we could have
many more parameters. In this simple case of a linear regression our matrix has 1 row and 1 column and a
single number for the slope of our line, so let’s extract it.
In [35]: w = W[0, 0]
B is also a vector with just one entry, so we can extract that too:
In [36]: b = B[0]
The slope parameter w has a value near 7.7. This means that, for 1 inch increase, people are on average 7.7
pounds heavier. The b parameter is roughly -350. This is called offset or bias and corresponds to the weight
of an adult of zero height.
Since negative weight doesn’t actually make sense, we have to be careful about how we interpret this value.
Let’s see if we can see what’s the minimum height that makes sense in this model. This will be the height that
produces a weight of zero, since negative weights are nonsense. Zero weight means y = 0, so now that we
have a model we can look for the value of X that corresponds to y = 0.
0 = Xw + b (3.6)
−b
X= (3.7)
w
In [37]: -b/w
Out[37]: 46.033733
So this model only makes sense for people who are at least about 45 inches tall. If you are shorter then 46
inches, this model predicts you’d have a negative weight, which is obviously wrong.
Great! We have trained our first Supervised Learning model and have found the best combination of
parameters b and w. But is this really a good model? Can we trust it to predict the height of new people that
were not part of the training data? In other words, will our model “generalize well” when offered new,
unknown data? Let’s see how we can answer that question.
R 2 coefficient of determination
First of all we need to define a sort of standard score, a number that will allow us to compare the goodness of
a model regardless of how many data points we used. We could compare losses, but the value of the loss is
ultimately arbitrary and dependent on the scale of the features, so we don’t want to use that. Instead, let’s use
the coefficient of determination R 2 .
This coefficient can be defined for any model predicting continuous values (like regression) and it will give
some information about the goodness of fit. In the case of regression, the R2 coefficient is a measure of how
well the regression model approximates the real data points. An R 2 of 1 indicates a regression line that
perfectly fits the data. If the line does not perfectly fit the data, the value of R2 , will decrease. A value of 0 or
lower indicates that the model is not a good one.
We recall here Scikit-Learn, a Python package introduced in chapter 1 that contains many Machine Learning
algorithms and supporting functions, including the R 2 score. Let’s calculate it for the current model.
TIP: In the last command we introduced a way to define Python format, to make numbers
more readable. In particular, we specified the format {:0.3f} The brackets and characters
within them (called format fields) are replaced with the objects passed into the
str.format() method. The integer after the : will cause that field to be a minimum
number of characters wide, 0 in this case. 3 indicates the number of decimal digits and f
stands for a floating point decimal format.
It’s not too far from 1, which means our regression is not too bad. It doesn’t answer the question about
generalization though, how can we know if our model is going to generalize well?
Let’s go back to our dataset. What if, instead of using all of it to train our model, we held out a small fraction
of it, say 20 randomly sampled points. We could train the model on the remaining 80 and use the 20 to
test how good the model actually is. This would be a good way to test if our model is overfitting.
Overfitting means that our model is just memorizing the answers instead of learning general rules about the
training examples. By withholding a test set, we can test our model on data never seen before. If it performs
just as well we can assume it will perform well on new data when deployed.
On the other hand, if our model has a good score on the training set but has a bad score on the test set, this
would mean it is not able to generalize to unseen examples, and therefore it’s not ready for deployment.
This is called a train/test split, it’s standard practice for Supervised Learning and there’s a convenient
Scikit-Learn function for it.
In [42]: len(X_train)
Out[42]: 8000
112 CHAPTER 3. MACHINE LEARNING
In [43]: len(X_test)
Out[43]: 2000
Using train_test_split we split the data into two sets, the training set and the test set. Now we
can use each according to its name: we let the parameters of our models vary to minimize the cost over the
training set and then check the cost and the R 2 score over the test set. If things went well, these two should
be comparable, i.e. the model should perform well on new data. Let’s do that!
First, let’s train our model on the training data (notice the test data is not involved here):
Then let’s calculate predictions for both the train and test sets.
TIP: Note that unlike training, making predictions is a “read-only” operation and does not
change our model. We’re just making predictions.
Let’s calculate the mean squared error and the R2 score for both. We will also import the
mean_squared_error function from Scikit-Learn, which does the same calculation as the function we
defined above, but it’s probably better defined.
r2 = r2_score(y_test, y_test_pred)
print("R2 score (Test set):\t{:0.3f}".format(r2))
It appears that both the loss and the R 2 score are comparable for the Train and Test set, which is great! If we
had obtained values that were significantly different, we would have had a problem. Generally speaking our
test set could perform a little worse because the test data hasn’t been seen before. If the performance on the
test set is significantly lower than on the training set, we are overfitting.
TIP: The test fraction does not need to be 20. We could use 5, 10, 30, 50 or
anything we like. Keep in mind that if we do not use enough data for testing, we may not
have a credible test of how well the model generalizes, while if we use too much testing
data, we make it harder for the model to learn because it is only exposed to few examples.
Note that this is another reason to prefer an average cost (i.e divided by the total number of sample points)
rather than a total cost. In this way, the cost will not depend on the size of the set used to calculate it and we
will be therefore able to compare costs obtained over sets of different sizes.
Congratulations! We have just encountered the three basic ingredients of a Neural Network: hypothesis
with parameters, cost function and optimization.
Classification
So far we have just learned about linear regression and how we can use it to predict a continuous target
variable. We have learned about formulating a hypothesis that depends on parameters and about optimizing
a cost to find the optimal values for such parameters.
We can apply the same framework to cases where the target variable is discrete and not continuous. All we
need to do is to adapt the hypothesis and the cost function.
Let’s see how that is done. Let’s imagine we are predicting whether a visitor on our website is going to buy a
product, based on how many seconds he/she spent on the product page. In this case, the outcome variable is
114 CHAPTER 3. MACHINE LEARNING
binary: the user either buys or doesn’t buy the product. How can we build a model with a binary outcome?
Let’s load some data and find out:
In [49]: df = pd.read_csv('../data/user_visit_duration.csv')
In [50]: df.head()
Out[50]:
Since the outcome variable can only assume a finite set of distinct values (only 0 and 1 in this case), this is a
classification problem, i.e. we are looking for a model that is capable of predicting to which class a data point
belongs.
TIP: There are many algorithms to solve a classification problem, including K-nearest
neighbors, decision trees, support vector machines and Naive Bayes classifiers. The
interested reader is referred to our other book on Machine Learning for an in-depth
explanation of each of them.
What happens if we use the same model we have just used to fit this data? Will the model refuse to work?
Will it converge? Will it give helpful predictions?
Let’s try it and see what happens. First we need to define our features and target variables.
Then we can use the exact same model we used before. We will simple re-initialize it by resetting the
parameter w to 1 and b to 0:
Then we fit the model on X and y for 200 epochs, suppressing the output with verbose=0:
As you can see the linear regression it doesn’t make much sense to use a straight line to predict an outcome
that can only either 0 or 1. That said, the modification we need to apply to our model in order to make it
work is actually quite simple.
3.6. CLASSIFICATION 117
Logistic Regression
We will approach this problem with a method called Logistic Regression. Despite the name being
“regression”, this technique is actually useful to solve classification problems, i.e. problems where the
outcome is discrete.
The linear regression technique we have just learned predicts values in the real axis for each input data
point. Can we modify the form of the hypothesis so that we can predict the probability of an outcome? If
we can do that, for each value in input, our model would give us a value between 0 and 1. At that point we
could use p = 0.5 as our dividing criterion and assign every point predicted with probability less than 0.5 to
class 0, and every point predicted with probability more than 0.5 to class 1.
In other words, if we modify the regression hypotesis to allow for a nonlinear function between the domain
of our data and the interval [0, 1], we can use the same machinery to solve a classification problem.
There’s actually one additional point we will need to address, which is how to adapt the cost function. In
fact, since our labels are only the values 0 and 1 the Mean Squared Error is not the correct cost function to
use. We will see below how to define a cost that works in this case.
Let us first start by defining a nonlinear hypothesis. We need a nonlinear function that will map all of the
real axis into the interval [0, 1]. There are many such functions and we will see a few in the next chapters. A
simple, smooth and well-behaved function is the Sigmoid function:
1
σ(z) = (3.8)
1 + e −z
The Sigmoid starts at values really close to 0 for negative values of x. Then it gradually increases and near
x = 0 it smoothly transitions to values close to 1. Mathematically speaking, the sigmoid function is like a
smooth step function.
Hypothesis
Using the sigmoid we can formulate the hypothesis for our classification problem as:
1
Buy = (3.9)
1 + e −(Time w+b)
or
ŷ = σ(Xw + b) (3.10)
We will encounter this function many times in this book. In fact it has been a very important function in
the early days of Neural Networks and it is still very important.
Notice that we have introduced two parameters, w and b, in our definition. One of them controls the speed
of the transition between 0 and 1, while the other controls the position of the transition. Let’s plot a few
examples:
3.6. CLASSIFICATION 119
plt.figure(figsize=(15, 5))
plt.subplot(121)
ws = [0.1, 0.3, 1, 3]
for w in ws:
plt.plot(x, sigmoid(line(x, w=w)))
plt.legend(ws)
plt.title('Changing w')
plt.subplot(122)
bs = [-5, 0, 5]
for b in bs:
plt.plot(x, sigmoid(line(x, w=1, b=b)))
plt.legend(bs)
plt.title('Changing b');
Cost function
Now that we have defined the hypothesis, we need to adapt the definition of the cost function so that it
makes sense for a binary classification problem. There are various options for this, similarly to the
regression case, including square loss, hinge loss and logistic loss.
As we shall see in chapter 5, Deep Learning models are trained by performing gradient descent
120 CHAPTER 3. MACHINE LEARNING
minimization of the cost function, which requires the cost function to be “minimizable” in the first place. In
mathematics we say that the cost function needs to be convex and differentiable.
One of the most commonly used cost function in Deep Learning is the cross-entropy loss.
Let’s explore how it is calculated. We can define the cost for a single point as:
Notice that due to the binary nature of the outcome variable y, only one of the two terms is present at each
time. If the label y i is 0, then c i = − log (1 − ŷ i ), if the label y i is 1, then c i = − log ( ŷ i ).
⎧
⎪− log ( ŷ i )
⎪ for y i = 1
ci = ⎨ (3.12)
⎩− log (1 − ŷ i ) for y i = 0
⎪
⎪
Let’s look at the first term first, which only contributes to the cost when y i = 1. Remember that ŷ contains
the sigmoid function, so its negative logarithm is:
What this means is if z is really big, this quantity goes to zero, if z is negative, this quantity goes to infinity:
In other words, when the label is 1 (y = 1), our predictions should also approach 1. Since our predictions are
obtained with the sigmoid, we want ŷ = σ(z) to approach 1 as well. This happens for very large values of z.
Therefore, the cost should be very small when z is large. On the other hand, if z is small, the sigmoid goes to
zero and our prediction is wrong. That’s why the cost becomes increasingly large for negative values of z.
The same logic applies to the second term for when y = 0: it should push z to have negative values so that
the sigmoid goes to zero and our prediction is correct in this case.
Now that we have defined the cost for a single point, we can define the average cost as:
1
c= ∑ ci (3.14)
N i
Now that we have defined hypothesis and cost for the logistic regression case, we can go ahead and look for
the best parameters that minimize the cost, very much in the same way as we did for the linear regression
case.
First let’s define a model in Keras. As we have seen above, Dense(1, input_shape=(1,)) implements a
linear function with one input and one output. The only change we need to perform is to add a sigmoid
function that takes the linear variable and maps it to the interval [0, 1]. In a way, it’s as if we were “wrapping”
the Dense layer with the sigmoid function.
Let’s first create a model like we did for the linear regression:
model.add(Dense(1, input_dim=1))
In [62]: model.add(Activation('sigmoid'))
In [63]: model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 1) 2
_________________________________________________________________
activation_1 (Activation) (None, 1) 0
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________
As you can see the model has two parameters, a weight and a bias, and it has a sigmoid activation function
as a second layer. We can convince ourselves that it’s a sigmoid by using the model to predict values for a
few z values:
Also notice that the weights in the model are initialized randomly, so your sigmoid may look different from
the one in the figure above.
TIP: Keras allows a more compact model specification by including the activation function
in the Dense layer definition. We can define the same model above by:
The next step is to compile the model like we did before to specify the cost function and the optimizer.
Keras offers several cost functions for classification. The cross-entropy for the binary classification case is
called binary_crossentropy so we will use this one now:
In [65]: model.compile(optimizer=SGD(lr=0.5),
loss='binary_crossentropy',
metrics=['accuracy'])
3.6. CLASSIFICATION 125
Accuracy
Notice that this time we also included an additional metric at compile time: accuracy. Accuracy is one of the
possible scores we can use to judge the quality of a classification model. It tells us what fraction of samples
are predicted in the correct class, so for example an accuracy of 80% or 0.8 means that 80 samples out of
100 are predicted correctly.
Epoch 1/25
100/100 [==============================] - 0s 1ms/step - loss: 0.6364 - acc:
0.5600
Epoch 2/25
100/100 [==============================] - 0s 75us/step - loss: 0.5843 - acc:
0.6400
Epoch 3/25
100/100 [==============================] - 0s 77us/step - loss: 0.5547 - acc:
0.7000
Epoch 4/25
100/100 [==============================] - 0s 76us/step - loss: 0.5450 - acc:
0.6900
Epoch 5/25
100/100 [==============================] - 0s 76us/step - loss: 0.5059 - acc:
0.8000
Epoch 6/25
100/100 [==============================] - 0s 74us/step - loss: 0.4946 - acc:
0.8100
Epoch 7/25
100/100 [==============================] - 0s 79us/step - loss: 0.4847 - acc:
0.7900
Epoch 8/25
100/100 [==============================] - 0s 74us/step - loss: 0.4809 - acc:
0.8100
Epoch 9/25
100/100 [==============================] - 0s 79us/step - loss: 0.4655 - acc:
0.8200
Epoch 10/25
100/100 [==============================] - 0s 77us/step - loss: 0.4522 - acc:
0.8300
Epoch 11/25
100/100 [==============================] - 0s 75us/step - loss: 0.4454 - acc:
0.8200
Epoch 12/25
100/100 [==============================] - 0s 73us/step - loss: 0.4475 - acc:
0.8000
Epoch 13/25
100/100 [==============================] - 0s 75us/step - loss: 0.4290 - acc:
0.8300
Epoch 14/25
100/100 [==============================] - 0s 75us/step - loss: 0.4240 - acc:
0.8300
Epoch 15/25
126 CHAPTER 3. MACHINE LEARNING
The model seems to have converged because the loss does not seem to improve in the last epochs. Let’s see
what the predictions look like:
temp = np.linspace(0, 4)
ax.plot(temp, model.predict(temp), color='orange')
plt.legend(['model', 'data']);
3.6. CLASSIFICATION 127
Great! The two parameters in our logistic regression have been tuned to best reproduce our data.
Notice that the logistic regression model predicts a probability. If we want to convert this to a binary
prediction we need to set a threshold. For example we could say that all points predicted to be 1 with p > 0.5
are set to 1 and the others are set to 0.
With this definition we can calculate the accuracy of our model as the number of correct predictions over
the total number of points. Scikit-learn offers a ready to use function for this behavior called
accuracy_score:
Train/Test split
We can repeat the above steps using train/test split. Remember, we’re aiming for similar accuracies in the
train and test sets:
We need to reset the model, or it will retain the previous training. How do we do that? Our model only has 2
parameters, w and b, so we can just reset these two parameters to zero.
In [74]: params
In [76]: params
In [77]: model.set_weights(params)
And let’s check the accuracy score on training and test sets:
So, in this case the model is performing as well on the test set as on the training set. Good!
130 CHAPTER 3. MACHINE LEARNING
Overfitting
We are advancing quickly! This table recaps what we have learned so far:
Notice we have extended the models to datasets with multiple features using the vector notation:
In this case w is a weight vector of size M, where M is the number of features, while X is a matrix of size
NxM, where N is the number of records in our dataset.
We have also learned to split our data in two parts: a training set and a test set.
Now let’s talk about one thing to watch out for: overfitting!
Overfitting happens when our model learns the probability distribution of the training set too well and is
not able to generalize to the test set with the same performance. Think of this as learning things by heart
without really understanding them, in a new situation you will be lost and probably under-perform.
A very simple way to check for overfitting is to compare the cost and the performance metrics of the
training and test set. For example, let’s say we are performing a classification and we measure the number of
correct prediction, aka the accuracy, to be 99 for the training set and only 85 for the test set. This means
our model is not performing as well on the test set and we are therefore overfitting.
It is going to be very hard to overfit with a simple model with only one parameter, but as the number of
parameters increases, the likelyhood of overfitting increases as well. We’ll need to watch out for our model
overfitting the dataset.
There are several actions we can take to minimize the risk of overfitting.
The first simple check is to make sure that our train/test split is performed correctly and both the train and
test sets are representative of the whole population of features and labels. Common errors include:
If the train/test split seems correct, it could be the case that our model has too much “freedom” and
therefore learns by heart the training set. This is usually the case if the number of parameters in the model is
comparable or greater than the number of data points in the training set. In order to mitigate this we can
either reduce the complexity of the model, or use regularization, as we shall see later on in the book.
Cross Validation
Is train/test split the most efficient way to use our dataset? Even if we took great care in randomly splitting
our data, that’s only one of many possible ways to perform a split. What if we performed several different
train/test splits, checked the test score in each of them and finally averaged the scores? Not only we would
have a more precise estimation of the true accuracy, but also we could calculate the standard deviation of
the scores and therefore know the error on the accuracy itself.
This procedure is called cross-validation. There are many ways to perform cross-validation. The most
common is called K-fold cross-validation.
Cross Validation
In K-fold cross validation the whole dataset is split into K equally sized random subsets. Then, each of the K
subsets gets to play the role of test set, while the others are aggregated back to form a training set. In this
way, we obtain K estimations of the model score, each calculated from a test set that does not overlap with
any of the other test sets.
Not only do we get a better estimate of the validation score, including its standard deviation, but we also
used each datapoint more efficiently, since each data point gets to play the role of both train and test.
These advantages do not come for free, in fact we had to train the model K times, which takes longer and
consumes more resources than training it just one time. On the other hand, we can parallelize the training
over each fold, either by distributing them across processes or across different machines.
Scikit-Learn offers cross-validation out of the box, but we’ll have to wrap our model in a way that can be
understood by Scikit-Learn. This is easy to do using a wrapper class called KerasClassifier.
3.8. CROSS VALIDATION 133
model.compile(optimizer=SGD(lr=0.5),
loss='binary_crossentropy',
metrics=['accuracy'])
return model
We’ve just redefined the same model, but in a format that is compatible with Scikit-Learn. Let’s calculate the
cross validation score on a 3-fold (it means K = 3) cross validation:
In [87]: scores
The cross validation produced 3 scores, 1 for each fold. We can average them and take their standard
deviation as a better estimation of our accuracy:
In [88]: m = scores.mean()
s = scores.std()
print("Cross Validation accuracy:",
"{:0.4f} ± {:0.4f}".format(m, s))
There are also other ways to perform a cross validation. Here we mention a few.
Stratified K-fold is similar to K-fold but it makes sure that the proportions of labels is preserved in the
folds. For example, if we are performing a binary classification and 40 of the data is labeled True and 60
is labeled False, each of the folds will also contain 40 True labels and 60 False labels.
We can also perform cross validation by randomly selecting a test set of fixed size multiple times. In this
case, we do not need make sure that the test sets are disjoint and they will overlap in some points.
Finally it is worth mentioning Leave-One-Label-Out cross validation, or LOLO. LOLO is useful when our
data is organized in subgroups. For example, imagine we are building a model to recognize gestures from
phone accelerometer data. Our training dataset probably contains multiple recordings of different gestures
from different users. The labels we are trying to predict are the gestures.
By performing a simple cross validation both our training and test sets would leave us with sets which
contain recordings from all users. If we train the model in this way, we could very well end up with a good
test score, but we would have no idea about how the model would perform if a new user performed the
same gestures. In other words, the model could be overfitting over each user, and we would have no way of
knowing it.
In this case, it is better to split the data on the users, using the data from some of them as training, while
testing on the data from some other users. If the test score is good in this case, we can be fairly sure that the
model will perform well with a new user.
Confusion Matrix
Is accuracy the best way to check the performance of our model? It surely tells us how well we are doing
overall, but it doesn’t give us any insight on the kind of errors the model is doing. Let’s see how we can do
better.
In the problem we just introduced, we are estimating the purchase probability from the time spent on a
page. This is a binary classification and we can be either right or wrong in the four ways represented here:
This table is called confusion matrix and it gives a better view of what’s being predicted correctly and what’s
not.
Let’s look at the four cases one at a time. We could be right in predicting the purchase or right in predicting
the absence of a purchase. These are the True Positives and True Negatives. Summed together they amount
to the number of correct predictions we formulated. If we divide this number by the total number of data
points, we obtain the Accuracy of the model. In other words, the accuracy is the overall ratio of correct
predictions:
(TP + TN)
Acc = (3.16)
All
Confusion Matrix
1. It could predict buy, when the person is actually not buying: this is a False Positive.
2. It could predict not buy when the person is actually buying: this is a False Negative.
We define a short helper function to add column and row labels for nice display:
df = pd.DataFrame(cm,
index=labels,
columns=pred_labels)
return df
Out[91]:
Let’s stop here for a second. Let’s say that, if the model was predicting True, the user is offered to buy an
additional product at a discount. On which side would you rather the model be wrong? Would you like the
model to offer a discount to users with no intention of buying (False Positive) or would you rather it not
offer a discounted additional item to users who intend to buy (False Negative)?
What if, instead of predicting the purchase behavior from time spent on a page we were predicting the
likelihood to have cancer based on the value of a blood screening exam? Would you rather have a False
Positive or a False Negative in that case?
Most people would prefer a False Positive, and do an additional screening to make sure of the result, rather
than go home feeling safe and healthy while they are actually not. Would that be your choice too?
What if you were an (evil) health insurance company instead? Would you still choose to optimize the model
in the same way? A False Positive would be an additional cost to you, because the patient would go on to see
a specialist. Would you rather minimize False Positives in this case?
As you can see, there is no one correct answer. Different stakeholders will make opposite choices. This is to
say that the data scientist is not a neutral observer of a Machine Learning process. The choices he/she
makes, fundamentally determine the outcome of the training!
False Positives and False Negatives are usually expressed in terms of two sister quantities: Precision and
Recall. Here they are:
Precision
We define precision as the ratio of True Positives to the total number of positive tests:
(TP)
Precision = (3.17)
TP + FP
Precision P will tend towards 1 when the number of False Positives goes to zero, i.e. when we do not create
any false alerts and are thus, “precise”. Here on every positive case we are correct.
Recall
On the other hand, recall is defined as the ratio of True Positives to the total number of actually positive
cases:
(TP)
Recall = (3.18)
TP + FN
Recall R will tend towards 1 when the number of False Negatives goes to zero, i.e. when we do not miss
many of the positive cases or we “recall” all of them.
3.9. CONFUSION MATRIX 137
F1 Score
PR
F1 = 2 (3.19)
P+R
F1 will be close to 1 if both precision and recall are close to 1, while it will go to zero if either of them is low.
In this sense the F1 score is a good way to make sure that both precision and recall are high.
The F1 score is a harmonic mean of precision and recall. The harmonic mean is an average for ratios. There
are also other F-scores that weigh one of precision or recall more or less heavily, called F-beta scores. You
can read about them on Wikipedia and on Scikit-Learn doc.
f1 = f1_score(y, y_class_pred)
print("F1 Score:\t{:0.3f}".format(f1))
Precision: 0.811
Recall: 0.860
F1 Score: 0.835
support here means how many point were present in each class.
While these definitions hold true only for the binary classification case, we can still extend the confusion
matrix to the case where there are more than 2 classes.
In this case the element i, j of the matrix will tell us how many datapoints in class i have been predicted to be
in class j. This is very powerful to see if any of the classes are being confused. If so we can isolate the data
being misclassified and try to understand why.
Feature Preprocessing
Categorical Features
Sometimes input data will be categorical, i.e. the feature values will be discrete classes instead of continuous
numbers. For example, in the weight/height dataset above, there’s a 3rd column called Gender which can
either be Male or Female. How can we convert this categorical data to numbers that can be consumed by
our model?
There are several ways to do it, the most common being One-Hot or Dummy encoding. In Dummy
encoding, we substitute the categorical column with a set of boolean columns, one for each category present
in the column. In the Male/Female example above, we would replace the Gender column with 2 columns
called Gender_Male and Gender_Female that would have binary values. Pandas offers a quick way to do
that:
In [96]: df = pd.read_csv('../data/weight-height.csv')
df.head()
3.10. FEATURE PREPROCESSING 139
Out[96]:
Out[97]:
Gender_Female Gender_Male
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
In this particular case, we only need one of the two columns, since we only have 2 classes, but if we had 3 or
more categories, then we would need to pass all the dummy columns to our model.
There are other ways to encode categorical information, including index encoding, hashing trick and
embeddings. We will learn more of these later in the book.
Feature Transformations
As we will see in the exercises, Neural Network models are quite sensitive to the absolute size of the input
features. This means that passing in features with very large or very small values will not help our model
converge to a solution. An easy way to overcome this problem is to normalize the features to a number near
1.
We could change the unit of measurement. For example, in the Humans example we could rescale the
height by 12 (go from inches to feet) and the weight by 100 (go from pounds to 100 pounds):
In [98]: df.head()
140 CHAPTER 3. MACHINE LEARNING
Out[98]:
In [100]: df.describe().round(2)
Out[100]:
As you can see our new features have values that are close to 1 in order of magnitude, which is good enough.
2) MinMax normalization
A second way to normalize features is to take the minimum value and the maximum value and rescale all
values to the interval (0,1). This can be done using the MinMaxScaler provided by sklearn like so:
mms = MinMaxScaler()
df['Weight_mms'] = mms.fit_transform(df[['Weight']])
df['Height_mms'] = mms.fit_transform(df[['Height']])
df.describe().round(2)
Out[101]:
3.10. FEATURE PREPROCESSING 141
Our new features have a maximum value of 1 and a minimum value of 0, exactly as we wanted them.
3) Standard normalization
A third way to normalize large or small features is to subtract the mean and divide by the standard deviation.
ss = StandardScaler()
df['Weight_ss'] = ss.fit_transform(df[['Weight']])
df['Height_ss'] = ss.fit_transform(df[['Height']])
df.describe().round(2)
Out[102]:
Height Weight Height (feet) Weight (100 lbs) Weight_mms Height_mms Weight_ss Height_ss
count 10000.00 10000.00 10000.00 10000.00 10000.00 10000.00 10000.00 10000.00
mean 66.37 161.44 5.53 1.61 0.47 0.49 0.00 0.00
std 3.85 32.11 0.32 0.32 0.16 0.16 1.00 1.00
min 54.26 64.70 4.52 0.65 0.00 0.00 -3.01 -3.15
25 63.51 135.82 5.29 1.36 0.35 0.37 -0.80 -0.74
50 66.32 161.21 5.53 1.61 0.47 0.49 -0.01 -0.01
75 69.17 187.17 5.76 1.87 0.60 0.60 0.80 0.73
max 79.00 269.99 6.58 2.70 1.00 1.00 3.38 3.28
After standard normalization, our new features have approximately zero mean and standard deviation of 1.
This is good in a linear model, because each feature is multiplied by a weight that the model has to find.
Since the weights are initialized to have values near 1, if the feature had a very large or very small scale, the
model could have to adjust the value of the weight enormously, just to account for the different scale. It is
therefore good practice to normalize our features before giving them to a Neural Network.
Note that we have just rescaled the units of our features, but their distribution is the same:
plt.tight_layout();
Alright! Time has come to apply what you’ve learned with some exercises.
Exercises
Exercise 1
You’ve just been hired at a real estate investment firm and they would like you to build a model for pricing
houses. You are given a dataset that contains data for house prices and a few features like number of
bedrooms, size in square feet and age of the house. Let’s see if you can build a model that is able to predict
the price. In this exercise we extend what we have learned about linear regression to a dataset with more
than one feature. Here are the steps to complete it:
– normalize the input features with one of the rescaling techniques mentioned above
– use a different value for the learning rate of your model
– use a different optimizer
• once you’re satisfied with training, check the R 2 on the test set
Exercise 2
Your boss was extremely happy with your work on the housing price prediction model and decided to
entrust you with a more challenging task. They’ve seen a lot of people leave the company recently and they
would like to understand why that’s happening. They have collected historical data on employees and they
would like you to build a model that is able to predict which employee will leave next. They would like a
model that is better than random guessing. They also prefer false negatives than false positives, in this first
phase. Fields in the dataset include:
Your goal is to predict the binary outcome variable left using the rest of the data. Since the outcome is
binary, this is a classification problem. Here are some things you may want to try out:
1. load the dataset at ../data/HR_comma_sep.csv, inspect it with .head(), .info() and .describe().
• Establish a benchmark: what would be your accuracy score if you predicted everyone stay?
• Check if any feature needs rescaling. You may plot a histogram of the feature to decide which
rescaling method is more appropriate.
• convert the categorical features into binary dummy columns. You will then have to combine them
with the numerical features using pd.concat.
• do the usual train/test split with a 20 test size
• play around with learning rate and optimizer
• check the confusion matrix, precision and recall
• check if you still get the same results if you use a 5-Fold cross validation on all the data
• Is the model good enough for your boss?
As you will see in this exercise, this logistic regression model is not good enough to help your boss. In the
next chapter we will learn how to go beyond linear models.
144 CHAPTER 3. MACHINE LEARNING
While these techniques are very powerful, they also have some limitations.
For example, linear regression doesn’t work well when the relationship between features and output is not
linear, i.e. when it is not described by a straight line or a flat plane. For example, think of the number of
active users of a web product or a social media. If the product is successful, the number of new users added
each month will grow, resulting in a nonlinear relationship between the number of users and time.
Similarly, logistic regression is incapable of separating classes that cannot be pulled apart by a flat boundary
(a line in 2D, a plane in 3D, a hyperplane if we have more than 3 features). This happens all the time, and you
may hear the term “not linearly separable” to describe two classes that cannot be separated by a flat
boundary. We saw an example of this in the first chapter when we tried to separate the blue dots from the
red crosses.
In general, the boundary between two classes is rarely linear, especially when dealing with interesting
classification problems with thousands of features. In order to extend regression and classification beyond
145
146 CHAPTER 4. DEEP LEARNING
the linear cases we need to use more complex models. Historically, computer scientists have invented many
techniques to extend beyond linear models including models such as Decision Trees, Support Vector
Machines, and Naive Bayes.
Deep Neural Networks bring together a unified framework to tackle all these cases: we can do linear and
nonlinear regression, classification, use them to generate new data, and much more!
In this chapter, we will introduce a notation for discussing Neural Networks and rewrite linear and logistic
regression using this notation. Finally, we work through stacking multiple nodes and create a deep network.
For the visual learners out there, this section is really helpful to “chalk-board” the algorithms we’re building.
Linear regression
Let’s look at linear regression. We have introduced linear regression in Chapter 3. As you may remember, it
refers to problems where we try to predict a number from a set of input features. Examples are: predicting
the price of a house, predicting the number of clicks a page will get or predicting the revenue a business will
generate in the future.
As usual, we will refer to the inputs in the problem using the variable x and to the outputs using the variable
y. So, for example, if we are trying to predict the price of a house from its size, x will be the size of the house
and y will be the price. The equation of linear regression is:
y = x.w + b (4.1)
and we can represent its operation as an artificial Neural Network like this:
This network has only 1 node, the output node, represented by the circle in the diagram. This node is
148 CHAPTER 4. DEEP LEARNING
connected to the input feature x by a weight w. A second edge enters the node carrying the value of the
parameter b, which we will call bias.
Fantastic! We have a simple way to represent linear operations in a graph. Let’s extend the graph to multiple
input features. We encountered an example of multivariate regression problem in Exercise 1 of Chapter 3,
where we built a model to predict the price of a house as a function of 3 inputs: the size in square feet (x1 ),
the number of bedrooms (x2 ) and the age (x3 ) of the house. In that case we had 3 input features and the
model had 3 weights (w1 , w2 and w3 ) and 1 bias (b). We can extend our graph notation very simply to
accommodate for this case:
The output node here is connected to the N inputs through N weights and it is also connected to a bias
parameter. The equation is the same as before:
y = X.w + b (4.2)
but now X and w are arrays that contain more than one entry, multiplied using a dot product. So, what the
above equation really means is:
This is great! We can now visually represent linear regression with as many inputs as we like.
Logistic regression
Linear regression gives us a linear relationship between the inputs and outputs, but what if we want a
non-binary answer instead of a linear one. For instance, what if we want a binary answer, yes/no answer? For
example, given a list of passengers on the titanic, can we predict if a specific person would survive or not?
4.2. NEURAL NETWORK DIAGRAMS 149
Can you think of a way to change our equation so that we can allow for binary output?
This is where we use logistic regression. Just before we output the value, we’ll use the sigmoid function to
output a binary value instead of sliding one. As you may remember from Chapter 3 the sigmoid function
maps the all real values to the interval [0, 1]. We can use the sigmoid to map the output of the node (so far
linear) into the interval [0, 1]. We will interpret the result as the probability of a binary outcome.
TIP: if you need a refresher about the sigmoid you can check Chapter 3 as well as this nice
article on Wikipedia.
Perceptron
Adding a sigmoid function is just a special case of what is called an activation function. Activation
Function is just a fancy name we give to the function that sits at the output of a node in a Neural Network.
There are many different types of activation functions, and we will encounter them later in this chapter. For
now, just know that they are important. For example, the first Neural Network invented, had can be
described by a diagram similar to that of the Logistic Regression with just a different activation function.
This network is called Perceptron.
The Perceptron is also a binary classifier, but instead of using a smooth sigmoid activation function, it uses
the step function:
⎧
⎪1
⎪ if w.x + b > 0
y=⎨ (4.4)
⎩0 otherwise
⎪
⎪
We could even simplify our diagram notation without losing information by including the bias and the
activation symbols in the node itself, like this:
150 CHAPTER 4. DEEP LEARNING
Perceptron
Before we move on, let’s review each element in the diagram with an example. Let’s say our goal is to build a
model that predicts if a banknote is fake or real based on some of its properties (we’ll actually do this later in
the book).
Our inputs are the properties of the banknote we plan to use. These could be for example: length, height,
thickness, transparency, and so on. These input properties of the banknotes are also called features.
Our output is the prediction value, True or False, one or zero, that we hope our model to give us to tell us if
the note is real or not.
The architecture of our network is represented by the graph connecting input to output. In the simple graph
above our network consists of a single node performing a weighted sum of the input features.
Weights and biases are the parameters of the model. These parameters are the things we have control over
(in the beginning). These are what the machine learns in our Machine Learning algorithm. They are the
knobs that can be turned to change the model predictions.
During training, the network will attempt to find the best values for weights and biases, but the inputs
x1 , ..., x n , the outputs and the network architecture, are given and cannot be changed by the model (or us, for
that matter).
Now that we have established a symbolic notation that allows us to describe both linear regression and
logistic regression in a very compact and visual way, let’s see how we can expand the networks.
Deeper Networks
The above simple networks take multiple inputs and calculate each of their outputs as a weighted sum of the
inputs plus a few other things to define a classification model (to make sure numbers make sense – yes, we
can do that). The other things we add to each of the inputs of our model is a fixed bias (usually just some
small number that makes snese the input isn’t zero) and an optional nonlinear activation function for the
classification models.
The weighted sum of the input plus the bias is sometimes also called a linear combination of the input
features because it only involves sums and multiplications by parameters (no strange functions like
exponentials, cosines etc.).
Let’s see what happens when several Perceptrons are used in the same model.
We start by taking many Perceptrons, each connected by different weights to the same input nodes. Let’s
then calculate the output for each of the nodes, obtaining a bunch of different predictions, one for each of
the Perceptrons. This is called a fully connected layer, sometimes called a dense layer.
We can think of a dense layer as is nothing more than many identical nodes connected to the same inputs
through independent weights, all operating in parallel.
152 CHAPTER 4. DEEP LEARNING
Nothing prevents us from using the output values of the dense layer as features (or inputs) for even more
Perceptrons. In other words, we can create a deeper fully connected Neural Network by stacking fully
connected layers on top of each other.
These fully connected layers are the root of Deep Learning and are used all the time.
Perceptrons with the same inputs are organized in layers, where a layer is just a group of Perceptrons that
receive the same inputs. As we will see later, creating a fully connected network in keras is very easy, it’s
just a matter of adding more layers.
We can think of a Neural Network as a function (F), that takes an input value from the feature space and
outputs a value in the target space. This calculation, called Forward Pass is a composition of linear and
nonlinear steps.
For the math inclined reader, let’s look at how we can write the operations performed by a node in the first
layer. Each node in the first layer performs a linear transformation of the input features. Mathematically
speaking, it performs a weighted average of the features and then add a bias.
If we use the index k to enumerate the nodes in the first layer, we can write the weighted sum z (1) calculated
by that node as:
4.2. NEURAL NETWORK DIAGRAMS 153
where we have used the superscript (1) to indicate that the weights belong to the first layer, and the
subscript jk to indicate the weight multiplying the input feature at position j for the node at position k.
In the previous example of the price prediction of a house, the index j runs over the features, i.e. so for
example j = 1 locates the first feature, i.e. the size of the house in square feet (x1 ).
If we consider all the input features as a vector X = [x1 , x2 , x3 , ....] and all the output sums of the first layer as
(1) (1) (1)
a vector Z (1) = [z1 , z2 , z3 , ...], the above weighted sum can be written as a matrix multiplication of the
weight matrix W (1) with the input features:
TIP: if you are not familiar with vectors, matrices, and linear algebra you can keep going
and ignore this mathematical part. There is a more in-depth discussion of these concepts in
the next chapter. That said, linear algebra is a fundamental component of how Machine
Learning and Deep Learning work. So if you are completely foreign to these notions, you
may find it valuable to take a class or two on Youtube about vectors, matrices and their
operations.
(1) (1)
Z (1) = X.W (1) + B(1) = ∑ x j w jk + b k (4.6)
j
where the weights are arranged in a matrix W (1) whose rows run along the input features and whose
columns run along the nodes in the layer.
The nonlinear activation function will be applied to the weighted sum to yield the activation at the output.
For example, in the case of the Perceptron, we will apply the step function like this:
The activation vector A(1) , is a vector of length k, and it becomes the input vector to the second layer in the
network. The second layer will take the output of the first and perform the exact same calculation:
yielding a new activation vector A(2) with as many elements as the number of nodes in the second layer.
This is true for any of the layers: a layer takes the output of the previous layer and performs a linear
154 CHAPTER 4. DEEP LEARNING
combination, followed by a nonlinear function. The nonlinear activation function is the most important
part of the transformation. If that were not present, a deep network would produce the same result as a
shallow network, and it wouldn’t be powerful at all.
Activation functions
We’ve looked at two nonlinear activation functions already:
These functions are applied to the output weighted sum calculated by a layer before we pass the values onto
the next layer or to output. They are the key element of Neural Networks. Activation functions are what
make Neural Networks so versatile and powerful! Besides sigmoid and step functions there are other
powerful options. Let’s look at a few more. First let’s load our common files:
Sigmoid and Step functions are easy to define using numpy (using their mathematical formulas):
def step(x):
return x > 0
They both map the real axis onto the interval between 0 and 1 ([0, 1]), i.e. they are bounded:
They are designed to squeeze a large output sum to 1 while taking a really negative output that sums to 0.
It’s as if each node was performing an independent classification of the input features and feeding the output
binary outcome onto the next layer.
Besides the sigmoid and step, other nonlinear activation functions are possible and will be used in this
book. Let’s look at a few of them:
Tanh
The hyperbolic tangent has a very similar shape to the sigmoid, but it is bounded and smoothly varying
between [−1, +1] instead of [0, 1], and is defined as:
e x − e −x
y = tanh(x) = (4.9)
e x + e −x
The advantage of this is that negative values of the weighted sum are not forgotten by setting them to zero,
but are given a negative weight. In practice tanh makes the network learn much faster than sigmoid or
step.
We can write the tanh function simply in Python as well, but we don’t have to. An efficient version of the
tanh function is available through numpy:
156 CHAPTER 4. DEEP LEARNING
ReLU
⎧
⎪x
⎪ if x > 0
y=⎨ (4.10)
⎩0 otherwise
⎪
⎪
or simply:
y = max(0, x) (4.11)
Originally motivated from biology, it has been shown to be very effective and it is probably the most
4.3. ACTIVATION FUNCTIONS 157
popular activation function for deep Neural Networks. It offers two advantages.
1. If it’s implemented as an if statement (the former of the two formulations above), it’s calculation is
very fast, much faster than smooth functions like sigmoid and tanh.
2. Not being bounded on the positive axis, it can distinguish between two large values of input sum,
which helps back-propagation converge faster.
Softplus
y = log(1 + e x ) (4.12)
SeLU
Finally, the SeLU activation function is a very recent development (see paper published in June 2017). The
name stands for scaled exponential linear unit and it’s implemented as:
4.3. ACTIVATION FUNCTIONS 159
⎧
⎪x
⎪ if x > 0
y = λ⎨ x
(4.13)
⎩α(e − 1) otherwise
⎪
⎪
On the positive axis it behaves like the rectified linear unit (ReLU), scaled by a factor λ. On the negative axis
it smoothly goes down to a negative value. This activation function, combined with a new regularization
technique called Alpha Dropout, offers better convergence properties than ReLU!
When creating a deep network, we will use one of these activation functions between one layer and the next,
in order to make the Neural Network nonlinear. These functions are the secret power of Neural Networks:
with nonlinearities at each layer they are able to approximate very complex functions.
Binary classification
Let’s work through classifying a binary dataset using a Neural Network. We’ll need a dataset to work with to
train our Neural Network. Let’s create an example dataset with two classes that are not separable with a
straight boundary, and let’s separate them with a fully connected Neural Network. First we import the
make_moons function from Scikit Learn:
And then we use it to generate a synthetic dataset with 1000 points and 2 classes:
In [13]: X, y = make_moons(n_samples=1000,
noise=0.1,
random_state=0)
In [15]: X.shape
Out[15]: (1000, 2)
To build our Neural Network, let’s import a few libraries from the Keras package:
Logistic Regression
Let’s first verify that a shallow model cannot separate the two classes. This is more for educational purposes
than anything else. We are going to build a model that we know is wrong, since it can only draw straight
boundaries. This model will not be able to separate our data correctly but we will then be able to extend it
and see the power of Neural Networks.
Let’s start by building a Logistic Regression model like we did in the previous chapter. We will create it using
the Sequential API, which is the simpler way to build models in Keras. We add a single Dense layer with 2
inputs, a single node and a sigmoid activation function:
Now we’ll add a single Dense layer with 2 inputs and we’ll use the sigmoid activation function here.
The arguments of the Dense layer definition map really well to our graph notation above
Then we compile the model assigning the optimizer, the loss and any additional metric we would like to
include (like the accuracy in this case):
In [21]: model.compile(Adam(lr=0.05),
'binary_crossentropy',
metrics=['accuracy'])
• Adam(lr=0.05) is the optimizer, this is the algorithm that performs the actual learning. There are
many different optimizers, and we will explore them in detail in the next chapter. For now know that
Adam is a very good one.
• binary_crossentropy is the loss or cost function. We have described it in detail in Chapter 3. For
binary classification problems where we have a single output with a sigmoid activation we need to
use binary_crossentropy function. For Multiclass classifications where we have multiple classes
with a softmax activation we need to use categorical_crossentropy, as we’ll see below.
• metrics is just a list of additional metrics we’d like to calculate, in this case we add the accuracy of
our classification, i.e. the fraction of correct predictions as seen in Chapter 3.
As we have seen in the previous chapter, we can now train the compiled model using our training data. The
model.fit(X, y) method does just that: it uses the training inputs X_train to generate predictions. It
then compares the predictions with the actual labels y_train through the use of the cost function and it
finally adapts the parameters to minimize such cost.
We will train our model for 200 epochs, which means our model will get to see our training data completely
for 200 times. We also set verbose=0 to suppress printing during the training. Feel free to change it to
verbose=1 or verbose=2 if you want to monitor training as it progresses.
Now that we have trained our model, we can evaluate its performance on the test data using the function
.evaluate. This takes the input features of the test data X_test and the input labels of the test data y_test
and calculates the average loss and any other metric added during model.compile. In the present case
.evaluate will return 2 numbers, the loss (cost) and the accuracy:
We can print out the accuracy by retrieving the second element in the results tuple:
The accuracy is better than random guessing, but it’s not 100. Let’s see the boundary identified by the
logistic regression by plotting the boundary as a line:
164 CHAPTER 4. DEEP LEARNING
c = model.predict(ab)
cc = c.reshape(aa.shape)
plt.figure(figsize=(12, 8))
plt.contourf(aa, bb, cc, cmap='bwr', alpha=0.2)
plt.plot(X[y==0, 0], X[y==0, 1], 'ob', alpha=0.5)
plt.plot(X[y==1, 0], X[y==1, 1], 'xr', alpha=0.5)
plt.legend(['0', '1'])
plot_decision_boundary(model, X, y)
As you can see in the figure, since a shallow model like logistic regression is not able to draw curved
4.4. BINARY CLASSIFICATION 165
boundaries, the best it can do is align the boundary so that most of the blue dots fall in the blue region and
most of the red crosses fall in the red region.
Deep model
The word deep in Deep Learning has changed meaning over time. Initially it was used to refer to networks
that had more than a single layer. As the field progressed and more and more complex models were
invented, the word shifted to meaning networks with hundreds of layers and billions of parameters. In this
book we will use the original meaning and call “deep” any model with more than one layer, so let’s add a few
layers and create our first “deep” model.
This model has 3 layers. The first layer has 4 nodes, with 2 inputs and a relu activation function. The 4
values at the output of the first layer will be fed into a second layer with 2 nodes and a relu activation
function and finally the 2 outputs of this layer will be fed into the third layer, which is also our output layer.
This only has 1 node and a sigmoid activation function, so that the output values are constrained between 0
and 1.
We can build this network in keras very easily. All we have to do is add more layers to the Sequential
model, specifying the number of nodes and the activation for each of them using the .add() function. Let’s
start with the first layer:
This is very similar to what we did above, except that now this Dense layer has 4 nodes instead of 1. How
many parameters are there in this layer? There are 12 parameters, 2 weights for each of the nodes (2*4) plus 1
bias for each of the nodes (4).
Let’s now add a second layer after the first one, with 2 nodes:
Notice that we didn’t have to specify the input_dim parameter, because keras is smart and automatically
matches it with the output size of the previous layer.
In [29]: model.compile(Adam(lr=0.05),
'binary_crossentropy',
metrics=['accuracy'])
The input_dim parameter is the number of dimensions in our input data points. In this case, each point is
described by two numbers, so the input dimension is equal to 2 (for the first Dense() layer). Dense(1) is
the output layer. Here we are classifying 2 classes, blue dots and red crosses, and therefore it’s a binary
classification and we are predicting a single number: the probability of being in the class of the red crosses.
Let’s train it and see how it performs, using the .fit() method again:
We’ll use a couple handy functions from the sklearn.metrics package, the accuracy_score() and
confusion_matrix() functions. First of all let’s see what classes our model predicts using the
.predict_classes() method:
This is different from the .predict() method because it returns the actual predicted class instead of the
predicted probability of each class.
4.4. BINARY CLASSIFICATION 167
In [33]: y_train_pred[:3]
Out[33]: array([[1],
[1],
[0]], dtype=int32)
In [34]: y_train_prob[:3]
Out[34]: array([[0.9999733 ],
[0.999522 ],
[0.00104994]], dtype=float32)
Let’s compare the predicted classes with the actual classes on both the training and the test set. First, let’s
import the accuracy_score and the confusion_matrix methods from sklearn:
Let’s check out the score accuracy here for both the training set and the test set:
In [37]: plot_decision_boundary(model, X, y)
plt.title("Decision Boundary for Fully Connected");
168 CHAPTER 4. DEEP LEARNING
As you can see, our network learned to separate the two classes with a zig-zag boundary, which is typical of
the ReLU activation.
TIP: if your the model has not learned to separate the data well, just re-initialize the model
and re-train it. As you’ll see later in this book, the model is initialized randomly and this
may have a huge effect on it’s ability to effectively learn.
Let’s try building our model again, but use a different activation this time. If we used the tanh function
instead, we’d have obtained a smoother boundary:
plot_decision_boundary(model, X, y)
Adding depth to our model allows us to separate two classes with a boundary of arbitrary shape. The
complexity of the boundary profile is given by the number of nodes and layers we add to the network. The
more we add, the more parameters our network will learn. This is really powerful because we can always
add more layers if we want to be able to capture more complex boundaries.
170 CHAPTER 4. DEEP LEARNING
Deep Learning models can have as little as few hundred parameters to as much as a few billions. As models
get more parameters, they also need more data, so to train a model with millions of parameters we will
likely need tens of millions of data points. This will also imply much computational resources as we shall see
later on.
Multiclass classification
Neural Networks can be easily extended to cases where the output is not a single value.
In the case of regression, this means that the output is a vector, while in the case of classification, it means
we have more than one class we’d like to separate.
For example, if we are doing image recognition, we may have several classes for all the objects we’d like to
distinguish (e.g. cat, dog, mouse, bird, etc. ). Instead of having a single output Yes/No, we allow the network
to predict multiple values.
Similarly, for a self driving car, we may want our network to predict the direction of the trajectory the car
should take, which means both the speed and the steering angle. This would be a regression with multiple
outputs at the same time. The extension is trivial in the case of regression: we add as many output nodes as
needed and minimize the mean squared error on the whole vector output.
The case of classification requires a little more discussion, because we need to carefully choose the activation
function. In fact, when we are predicting discrete output we could be in one of two cases:
Let’s consider the example of email classification. We would like to use our Machine Learning model to
organize a large pool of emails sitting in our inbox. We could choose two way to organize them.
Tags
One way to arrange our emails would be to add tags to each email to specify the content. We could have a
tag for Work, a tag for Personal, but also a tag for Has_Picture or Has_Attachment. These tags are not
mutually exclusive. Each one is independent from the others and a single email could carry multiple tags.
The extension of the Neural Network to this case is also pretty straightforward, because we will perform an
independent logistic regression on each tag. Just like in the case of the regression, all we have to do is add
multiple sigmoid output nodes and we are done.
A different case is if we decided to arrange our emails in folders, for example: Work, Personal, Spam etc.,
and move each email to the corresponding folder. In this case, each email can only be in one folder. If it’s in
folder Work, it is automatically not in folder Personal. In this case, we cannot use independent sigmoids,
we need to use an activation function that will normalize the output so that if a node predicts a high
4.5. MULTICLASS CLASSIFICATION 171
probability, all the others will predict a low probability and the sum of all the probabilities will add up to one.
Mathematically, the softmax function is a generalization of the logistic function that does just that:
ez j
σ(z) j = for j = 1, . . . , K. (4.14)
∑Kk=1 e z k
When we deal with mutually exclusive classes, we always have to apply the softmax function to the last
layer.
The Iris dataset is a classic dataset used in Machine Learning. It describes 3 species of flowers, with 4
features each, so it’s a great example for a Multiclass classification. Let’s see how Multiclass classification’s
done using keras and the Iris dataset. First of all let’s load the data.
In [41]: df = pd.read_csv('../data/iris.csv')
In [42]: df.head()
Out[42]:
We need to do a bit of massaging of the data, separate the input features from the target column containing
the species.
First of all let’s create a feature matrix X where we store the first 4 columns:
Out[43]:
172 CHAPTER 4. DEEP LEARNING
Let’s also create a target column, where we encode the labels in alphabetical order. We need to do this
because Machine Learning models do not understand string values like setosa or versicolor. We will
first look at the unique values contained in the species column:
And then build a dictionary where we assign an index to each target name in alphabetical order:
Now we can use the .map method to create a new Series from the species column, where each of the
entries is replaced using targed_dict:
In [46]: y= df['species'].map(target_dict)
y.head()
Out[46]:
species
0 0
1 0
2 0
3 0
4 0
Now y is a number indicating the class (0, 1, 2). In order to use this with Neural Networks, we need to
perform one last step: we will expand it it to 3 binary dummy columns. We could use the
pandas.get_dummies function to do this, but Keras also offers an equivalent function, so let’s use that
instead:
4.5. MULTICLASS CLASSIFICATION 173
Let’s check out what the data looks like by looking at the first 5 values:
In [49]: y_cat[:5]
Now we create a train and test split, with 20 test size. We’ll pass the values of the X dataframe because
keras doesn’t like pandas dataframes. Also notice that we introduce 2 more parameters:
• stratify = True to make sure that we preserve the ratio of labels in each set, i.e. we want each set
to be composed of one third of each flower type.
• random_state = 0 sets the seed of the random number generator in a way that we all get the same
results.
This is a shallow model, equivalent of a Logistic Regression with 3 classes instead of two.
The output of the model is a matrix with 3 columns, corresponding to the predicted probabilities for each
class where each of the 3 output predictions are listed in the columns, ordered by their order in the y_train
array:
Which class does our network think each flower is? We can obtain the predicted class with the np.argmax,
which finds the index of the maximum value in an array:
4.5. MULTICLASS CLASSIFICATION 175
Let’s check the classification report and confusion matrix that we have described in Chapter 3
To create a classification report, we’ll run the classification_report() method, passing it the test class
(the list that we created before of the correct labels for each dataum) and the y_pred_class (the list we just
obtained of the predicted classes).
We get the confusion matrix by running the confusion_matrix() method passing it the same arguments
as the classification report:
Out[57]:
Recall that the confusion matrix tells us how many examples from one class are predicted in each class. It’s
almost perfect, with the exception of one point in class virginica which gets predicted in class
versicolor. Let’s inspect the data visually to check why. Our data has 4 features, so we need to decide how
to plot it. We could choose 2 features and plot just those:
In [59]: plt.scatter(X.loc[y==0,'sepal_length'],
X.loc[y==0,'petal_length'])
plt.scatter(X.loc[y==1,'sepal_length'],
X.loc[y==1,'petal_length'])
plt.scatter(X.loc[y==2,'sepal_length'],
X.loc[y==2,'petal_length'])
plt.xlabel('sepal_length')
plt.ylabel('petal_length')
plt.legend(targets)
plt.title("The Iris Dataset");
4.5. MULTICLASS CLASSIFICATION 177
Classes virginica and versicolor are slightly overlapping, which could explain why our model couldn’t
separate them too well. Is it true for every feature? We’ll check that with a very cool visualization library
called Seaborn. Seaborn improves Matplotlib with additional plots, for example the pairplot, which plots
all possible pairs of features in a scatter plot:
As you can see virginica and versicolor overlap in all the features, which can explain why our model
confuses them. Keep in mind that we used a shallow model to separate them instead of a deeper one.
Conclusion
In this chapter we have introduced fully connected deep Neural Networks and seen how they can be used to
solve linear and nonlinear regression and classification problems. In the exercises we will apply them to
predict the onset of diabetes in a population.
Exercises
Exercise 1
The Pima Indians dataset is a very famous dataset distributed by UCI and originally collected from the
National Institute of Diabetes and Digestive and Kidney Diseases. It contains data from clinical exams for
women age 21 and above of Pima indian origins. The objective is to predict, based on diagnostic
measurements, whether a patient has diabetes.
4.7. EXERCISES 179
1. Load the ..data/diabetes.csv dataset, use pandas to explore the range of each feature
• For each feature draw a histogram. Bonus points if you draw all the histograms in the same figure.
• Explore correlations of features with the outcome column. You can do this in several ways, for
example using the sns.pairplot we used above or drawing a heatmap of the correlations.
• Do features need standardization? If so what standardization technique will you use? MinMax?
Standard?
• Prepare your final X and y variables to be used by a ML model. Make sure you define your target
variable well. Will you need dummy columns?
Exercise 2
Build a fully connected NN model that predicts diabetes. Follow these steps:
1. split your data in a train/test with a test size of 20 and a random_state = 22
• define a sequential model with at least one inner layer. You will have to make choices for the following
things:
– what is the size of the input?
– how many nodes will you use in each layer?
– what is the size of the output?
– what activation functions will you use in the inner layers?
– what activation function will you use at output?
– what loss function will you use?
– what optimizer will you use?
• fit your model on the training set, using a validation_split of 0.1
• test your trained model on the test data from the train/test split
• check the accuracy score, the confusion matrix and the classification report
180 CHAPTER 4. DEEP LEARNING
Exercise 3
Compare your work with the results presented in this notebook. Are your Neural Network results better or
worse than the results obtained by traditional Machine Learning techniques?
• Try training a Support Vector Machine or a Random Forest model on the exact same train/test split.
Is the performance better or worse?
• Try restricting your features to only 4 features like in the suggested notebook. How does model
performance change?
Exercise 4
Tensorflow playground is a web based Neural Network demo. It is really useful to develop an intuition
about what happens when you change architecture, activation function or other parameters. Try playing
with it for a few minutes. You don’t need to understand the meaning of every knob and button in the page,
just get a sense for what happens if you change something. In the next chapter we’ll explore these things in
more detail.
Deep Learning Internals
5
This is a special chapter
In the last chapter we introduced the Perceptron with weights, biases and activation functions and fully
connected Neural Networks. This chapter is a bit different from all the other chapters and it is meant for the
reader who is interested in understanding the inner workings of a Neural Network.
In this chapter we learn about gradient descent and backpropagation. This sure much more technical and
abstract than the rest of the book. There are mathematical formulas, weird symbols, derivatives, gradients
and much more. We will try to make these concepts as intuitive and simple as possible, but these are
complex topics and it is not possible to introduce them fully without going into some level of detail.
Let us first tell you: you don’t NEED to read this chapter. This book is meant for the developer and
practitioner that is interested in applying Neural Networks to solve great problems. As such, all the previous
and following chapters are focused on the implementation of Neural Networks and their practical
application to several problems. This chapter is different from all the others, you will not learn new
applications here, you will not learn new commands or tricks nor we will introduce any new Neural
Network architecture.
All this chapter does, is explain what happens when you run the function model.fit, i.e. break down how a
Neural Network is trained. As we have already seen in chapters 3 and 4 after we define the model
architecture we usually do 2 more steps:
1. we .compile the model specifying the optimizer and the cost function
• we .fit the model for a certain number of epochs using the training data
181
182 CHAPTER 5. DEEP LEARNING INTERNALS
These two operations are executed by Keras for us and we don’t have to worry about them too much.
However, I’m sure you’ve been wondering why we choose a particular optimizer at compilation or what is
actually happening during training.
In our opinion it is important to learn this for a few of reasons. First of all, understanding these concepts
allow us to demystify what’s actually happening under the hood with our network. Neural Networks are not
magic, and knowing these concepts can give us a better ability to judge where we can use them to solve
problems and where we cannot. Secondly, knowing the internal mechanisms increases our abilities to
understand which paraemters can be tweaked and which optimization algorithms to choose.
So, let us re-iterate this once again: feel free to skip this chapter if your main goal is to learn how to use
Keras and to apply Neural Networks. You won’t find new code here, mostly a lot of maths and formulas.
On the other hand, if your goal is to understand how things actually work, then go ahead and read it.
Chances are you will find the answers to some of your questions in this chapter.
Finally, if you are already familiar with derivatives and college calculus, you can probably skim through this
large portions of this chapter quite quickly.
All that said, let’s start by introducing derivatives and gradients. First let’s import our usual libraries. By now
you should be very familiar with all of them, but if in doubt on what they do, check back Chapter 2 where
we introduced them:
Derivatives
As the name suggests a derivative is a function that derives from another function.
Let’s start with an example. Imagine you are driving on the highway. As time goes by you mark your
position along the highway, filling a table of values as a function of time. If your speed is 60 miles an hour,
every minute your position will be increased by 1 mile.
Let’s indicate your position as a function of time with the variable x(t). Let’s create an array of 10 minutes,
called t and an array of your positions called x:
In [3]: t = np.arange(10)
x = np.arange(10)
5.2. DERIVATIVES 183
Now, let’s make a plot to see the distance over time with respect to the distance traveled.
The derivative x ′ (t) of this function is the rate of change in position with respect to time. In this example it
is the speed of your car indicated by the odometer. In the example just mentioned, the derivative is a
constant value of 60 miles per hour, or 1 mile per minute. Let’s create an array containing the speed at each
moment in time:
In general, the derivative x ′ (t) is itself a function of t that tells us the rate of change of the original function
x(t) at each point in time. This is why it is called a derivative. It can also be written explicitly as:
dx
x ′ (t) ∶= (t) (5.1)
dt
Where the fraction dxdt indicates the ratio between a small change in x due to a small change in t. Let’s look
at a case where the derivative is not constant. Consider an arbitrary curve f (t). Let’s first create a slightly
bigger time array:
Then let’s take an arbitrary function and let’s apply it to the array t. We will use the sine function, but that’s
just an example, any function would do:
5.2. DERIVATIVES 185
In [8]: f = np.sin(t)
plt.plot(t, f)
plt.title("Sine Function");
At each point along the curve f (t), the derivative f ′ (t) is equal to the rate of change in the function.
Finite differences
How do we calculate the value of the derivative at a particular point in t? We can calculate its approximate
value with the method of finite differences:
df ∆f f (t i ) − f (t i−1 )
(t i ) ≈ (t i ) = (5.2)
dt ∆t t i − t i−1
We can calculate the value of the approximate derivative of the above function by using the function
np.diff that calculates the difference between consecutive elements in an array:
In [10]: plt.plot(t, f)
plt.plot(t[1:], dfdt)
plt.legend(['f', 'dfdt'])
plt.axhline(0, color='black')
plt.title("Sine Function and it's first derivative");
If we read the figure from left to right, we notice that the value of the derivative is negative when the original
curve is going downhill and it is positive when the original curve is going uphill. Finally, if we’re at a
minimum or at a maximum the derivative is 0 because the original curve is flat.
Let’s define a simple helper function to plot the tangent line to our curve, i.e. the line that “just touches” the
curve at that point:
plt.plot(t, f)
plt.plot(t[:-1], dfdt)
plt.legend(['f', '$\\frac{df}{dt}$'])
plt.axhline(0)
5.2. DERIVATIVES 187
ti = t[i]
fi = f[i]
dfdti = dfdt[i]
We can use this helper function to display the relationship between the inclination (slope) of our tangent
line and the value of the derivative function. As you can see, a positive derivative corresponds to an uphill
tangent while negative derivative corresponds to a downhill tangent line.
In [12]: plt.figure(figsize=(14,5))
plt.subplot(131)
plot_tangent(15)
plt.title("Positive Derivative / Upwards")
plt.subplot(132)
plot_tangent(89)
plt.title("Zero Derivative / Flat")
plt.subplot(133)
plot_tangent(175)
plt.title("Negative Derivative / Downwards")
plt.tight_layout();
188 CHAPTER 5. DEEP LEARNING INTERNALS
Although the finite differences method is useful to calculate the numerical value of a derivative, we don’t
need to use it. In fact, derivatives of simple functions are well known and we don’t need to calculate them.
Calculus is the branch of mathematics that deals with all this. For our purposes we will simply summarize
here a few common functions and their derivatives:
When our function has more than one input variable, we need to specify which variable we are using for
derivation. For example, let’s say we are measuring our elevation on a mountain as a function of our
position. Our GPS position is defined by two variables: longitude and latitude, and therefore the elevation
depends on two variables: y = f (x1 , x2 ).
We can calculate the rate of change in elevation with respect to x1 , and the rate of change with respect x2
independently. These are called partial derivatives, because we only consider the change with respect to
one variable. We will indicate them with a “curly d” symbol:
∂f ∂f
,
∂x1 ∂x2
If we are on top of a hill the fastest route downhill will not necessarily be along any of the north-south or
5.3. BACKPROPAGATION INTUITION 189
east-west directions, it will be in whatever direction the hill is more steeply descending down.
In the two dimensional plane of x1 and x2 , the direction of the most abrupt change will be a 2-dimensional
vector whose components are the partial derivatives with respect to each variable. We call this vector the
Gradient, and we indicate it with an inverted triangle called del or nabla: ∇.
The gradient is an operation that takes a function of multiple variables and returns a vector. The
components of this vector are all the partial derivatives of the function. Since the partial derivatives are
functions of all variables, the gradient too is a function of all variables. To be precise, it is a vector function.
For each point (x1 , x2 ), the gradient returns the vector in the direction of maximum steepness in the
graph of the original function. If we want to go downhill, all we have to do is walk in the direction opposite
to the gradient. This will be our strategy for minimizing cost functions.
So we have an operation, the gradient, which takes a function of multiple variables and returns a vector in
the direction of maximum steepness. Pretty cool!
Why is this neat? Why is it important? Well, it turns out that we can use this idea to train our networks.
Backpropagation intuition
Now that we have defined the gradient, let’s talk about backpropagation.
Backpropagation is a core concept in Machine Learning. The next several sections are dedicated to working
through the math of backpropagation. As said at the beginning of this chapter, it is not necessary to
understand the maths in order to be able to build and apply a Deep Learning model. However, the math is
not very hard and with a little bit of exercise you’ll be able to see that there is no mistery behind how Neural
Networks function.
At a high-level, the backpropagation algorithm is a Supervised Learning method for training our networks.
190 CHAPTER 5. DEEP LEARNING INTERNALS
It uses the error between the model prediction and the true labels to modify the model weights in order to
reduce the error in the next iteration.
The starting point for backpropagation is the Cost Function we have introduced in Chapter 3.
Let’s consider a generic cost function of a network with just one weight, let’s call this function J(w). For
every value of the weight w, the function calculates a value of the cost J(w).
The figure shows this situation for the case of a Linear Regression. As seen in Chapter 3 different lines
correspond to different values of w. In the figure we represented them with different colors. Each line
produces a different cost, here represented with a dot of different color and our goal is to find the value of w
that corresponds to the minimum of J(w).
The case of linear regression is easily solved with a bit of algebra, but how do we deal with the general case of
a network with millions of weights and biases? The cost function J now depends on millions of parameters
and it is not obvious how to search for a minimum value.
What is clear is that the shape of such a cost function is not going to be a smooth parabola like the one in the
figure, and we will need a way to navigate a very complex landscape in search for a minimum value.
Let’s say we are sitting at a particular point w0 with corresponding cost J(w0 ), how do we move towards
lower costs?
We would like to move in the direction of decreasing J(w) until we reach the minimum, but we can only
use local information. How do we decide where to go?
As we have seen when we talked about descending from a hill, the derivative indicates its slope at each
point. So, in order to move towards lower values, we need to calculate the derivative at w0 and then change
our position by subtracting the value of the derivative from the value of our starting position w0 .
Programmatically speaking, we can take one step in the direction where the descent algorithm is the lowest,
5.4. LEARNING RATE 191
Weight update
dJ
w0 − > w0 − (w0 ) (5.3)
dw
Let’s check that this does move us towards lower values on the vertical axis.
dJ
If we are at w0 like in the figure, the slope of the curve is negative and thus the quantity − dw (w0 ) is positive.
So, the value of w0 will increase, moving us towards the right on the horizontal axis.
The corresponding value on the vertical axis will decrease and we successfully moved towards a lower value
of the function f (w).
Vice versa, if we were to start at a point w0 where the value of the slope is positive, we would subtract a
dJ
positive quantity dw (w0 ) that is now negative. This would move w0 to the left, and the corresponding values
on the vertical axis would still decrease.
Learning Rate
The update rule we have just introduced one more modification. As it is, it suffers from two problems. If the
cost function is very flat, the derivative will be very very small and with the current update rule, we will
move very very slowly towards the minimum. Viceversa, if the cost function is very steep, the derivative will
be very large and we might end up jumping beyond the minimum.
A simple solution to both problems is to introduce a tunable knob that allows us to decide how big should
be the step to take in the direction of the gradient. This is called learning rate, and we will indicate it with
the Greek letter η:
192 CHAPTER 5. DEEP LEARNING INTERNALS
dJ
w0 − > w0 − η (w0 )
dw
If we choose a small learning rate, we will move by tiny steps, if we choose a large learning rate, we will
move by large steps.
However, we must be careful. If the learning rate is too large, we will actually run away from the solution. At
each new step we move towards the direction of the minimum, but since the step is too large, we overshoot
and go beyond the minimum, at which point we reverse course and repeat, going further and further away.
Gradient descent
This way of looking for the minimum of a function is called Gradient Descent and it is the idea behind
backpropagation. Given a function, we can move towards its minimum by following the path indicated by
its derivative, or in the case of multiple variables, indicated by the gradient.
For a Neural Network, we define a cost function that depends on the values of the parameters, and we find
the values of the parameters by minimizing such cost through gradient descent.
The cost function is the method for how we can optimize our networks. In fact, it’s the backbone for a lot of
different Machine Learning and Deep Learning techniques.
All we are really doing is taking the taking the cost function, calculating its partial derivatives with respect
to each parameter, and then using update rule to decrease the cost. We do this by subtracting the value of
the negative gradient from the parameters themselves. This is what’s called a parameter update.
We know that the gradient is a function that indicates the direction of maximum steepness. We also know
that we can move towards the minimum of a function by taking consecutive steps in the direction of the
gradient at each point we visit.
Let’s see this with a programming example. We’ll use an invented cost function. Let’s start by defining an
array x with 100 points in the interval [-4, 4]:
Then let’s define an invented cost function J(w) that depends on w in some weird way.
Using the table of derivatives presented earlier we can also quickly calculate its derivative.
dJ
(w) = −30.0w + 1.5w 2 + 4w 3 (5.5)
dw
In [16]: plt.subplot(211)
plt.plot(x, J(x))
plt.title("J(w)")
plt.subplot(212)
plt.plot(x, dJdw(x))
plt.axhline(0, color='black')
plt.title("dJdw(w)")
plt.xlabel("w")
plt.tight_layout();
194 CHAPTER 5. DEEP LEARNING INTERNALS
Now let’s find the minimum value of J(w) by gradient descent. The function we have chosen has two
minima, one is a local minimum, the other is the global minimum. If we apply plain gradient descent we
will stop at the minimum that is nearest to where we started. Let’s keep this in mind for later.
In [17]: w0 = -4
dJ
w0 − > w0 − η (w0 ) (5.6)
dw
In [18]: lr = 0.001
Out[19]: -0.112
In [20]: w0 - step
Out[20]: -3.888
In [21]: iterations = 30
w = w0
5.6. GRADIENT CALCULATION IN NEURAL NETWORKS 195
ws = [w]
for i in range(iterations):
step = lr * dJdw(w)
w -= step
ws.append(w)
ws = np.array(ws)
Let’s visualize our descent, zooming in the interesting region of the curve:
As you can see, we proceed with small steps towards the minimum, and there we stop. Try to modify the
starting point and re-run the code above to fully understand how this works.
Remember that a Neural Network is just a function that connects our inputs X to our outputs y. We’ll refer
to this function as ŷ = f (X). This function depends on a set of weights w that modulate the output of a layer
when transferring it to the next layer, and on a set of biases b.
Also, remember that we defined a cost J( ŷ, y) = J( f (X, w, b), y) that is calculated over the training set. So,
for fixed training data, the cost J is a function of the parameters w and b.
The best model is the one that minimizes the cost. We can therefore use gradient descent on the cost
function to update the values of the parameters w and b. The gradient will tell us in which direction to
update our parameters, and it is crucial to learning the optimal values of our network parameters.
∂J
First we calculate the gradient with respect to each weight (and bias): ∂w and then we update each weight
∂J
using the learning rate we have just introduced: w0 − > w0 − η ∂w .
∂J
All we need to do at this point is to learn how to calculate the calculate the gradient ∂w .
In this section we will work through the calculation of the gradient for a very simple Neural Network. We
are going to use equations and maths. As said previously, feel free to skim through this part if you’re focused
on applications, you can always come back later to go deeper in the subject. We will start with a network
with only one input, one inner node and one output. This will make our calculations easier to follow.
In order to make the math easier to follow we will break down this graph and highlight the operations
involved:
Starting from the left, the input is multiplied with the first weight w (1) , then the bias b(1) is added and the
sigmoid activation function is applied. This completes the first layer. Then we multiply the output of the first
layer by the second weight w (2) , we add the second bias b(2) and we apply another sigmoid activation
function. This gives us the output ŷ. Finally we use the output ŷ and the labels y to calculate the cost J.
5.7. THE MATH OF BACKPROPAGATION 197
Forward Pass
Let’s formalize the operations described above with math. The forward pass equations are written as follows:
The input-sum z (1) is obtained through a linear transformation of the input x with weight w (1) and bias b(1) .
In this case we only have one input, so there really is no weighted “sum”, but we still call it input-sum to
remind ourselves of the general case where multiple inputs and multiple weights are present.
The activation a(1) is obtained by applying the sigmoid function to the input-sum z (1) . This is indicated by
that letter σ (pronounced sigma). A similar set of equations holds for the second layer with input-sum z (2)
and activation a (2) , which is equivalent to our predicted output in this case.
198 CHAPTER 5. DEEP LEARNING INTERNALS
The cost function J is a function of the true labels y and the predicted values ŷ, which contain all the
parameters of the network.
The equations described above allow us to calculate the prediction of the network for a given input and the
cost associated with such prediction. Now we want to calculate the gradients in order to update the weights
and biases and reduce the cost.
Weight updates
Our goal is to calculate the derivative of the cost function with respect to the parameters of the model,
i.e. weights and biases. Let’s start by calculating the derivative of the cost function with respect to w (2) , the
last weight used by the network.
∂J
(5.13)
∂w (2)
w (2) appears inside z (2) , which is itself inside the sigmoid function, so we need a way to calculate the
derivative of a nested function.
The technique is actually pretty easy and it’s called chain rule. If you need a refresher of how it works, we
have an example of this in the Appendix.
We can look at the graph above to determine which terms will appear in the chain rule and see that J
depends on w (2) through ŷ and z (2) .
If we apply the chain rule we see that this derivative is the product of three terms.
∂J ∂J ∂ ŷ ∂z (2)
= ⋅ ⋅ (5.14)
∂w (2) ∂ ŷ ∂z (2) ∂w (2)
Wow! This may look to you pretty complicated! Wouldn’t it be nice to have a simplified notation for all this?
It turn out we can introduce this simpler notation, following the course by Roger Grosse at University of
Toronto.
In particular we will use a long line over a variable to indicate the derivative of the cost function with respect
to that variable. E.g.:
∂J
w (2) ∶= (5.15)
∂w (2)
Besides being easier to read, this notation emphasizes the fact that those derivatives are evaluated at a
certain point, i.e. they are numbers, not functions.
∂z (2) ∂ ŷ ∂z (2)
w (2) = z (2) ⋅ = ŷ ⋅ ⋅ (5.16)
∂w (2) ∂z (2) ∂w (2)
And we can start to see why it is called backpropagation: in order to get to calculate w (2) we will need to
first calculate the derivatives of the terms that follow w (2) in the graph, and then propagate their
contributions back to calculate w (2) .
∂J
Step 1: ŷ = ∂ ŷ
The first term is just the derivative of the cost function with respect to ŷ. This term will depend on the exact
form of the cost function, but it is well defined, and it can be calculated for a given training set. For example,
in the case of the Mean Squared Error 21 ( ŷ − y)2 this term is simply: ( ŷ − y).
Looking at the graph above, we can highlight in red the terms involved in the calculation of ŷ which is only
the labels and the predictions:
∂J
Step 2: z (2) = ∂z (2)
As noted before, the chain rule tells us that z (2) is the product of the derivative of the sigmoid with the term
we just calculated ŷ:
∂ ŷ
z (2) = ŷ = ŷ σ ′ (z (2) ) (5.17)
∂z (2)
Since we have already calculated ŷ we don’t need to calculate it again, the only term we need is the derivative
200 CHAPTER 5. DEEP LEARNING INTERNALS
of the sigmoid. This is easy to calculate and we’ll just indicate it with σ ′ .
∂J
Step 3: w (2) = ∂w (2)
∂z (2)
w (2) = z (2) (5.18)
∂w (2)
∂z (2)
Since we have already calculated z (2) we only need to calculate ∂w (2)
, which is equal to z (2) a(1)
So we have:
This last formula is really interesting because it tells us that the update to the weights w (2) is proportional to
the input a (1) received by those weights.
where δ (2) is calculated using parts of the network that are downstream with respect to w (2) and it
corresponds to the derivative of the cost with respect to the input sum z (2) .
5.7. THE MATH OF BACKPROPAGATION 201
The important aspect here is that δ (2) , i.e. z (2) , is a constant, representing the downstream contribution of
the network to the error.
Using the same procedure we can calculate the corrections to the bias b(2 as well:
∂J
Step 4: b(2) = ∂b(2)
∂z (2)
b(2) = z (2) = z (2) (5.20)
∂b(2)
∂z (2)
Since the ∂b(2)
=1
Following a similar procedure we can keep propagating the error back and calculate the corrections to w (1)
and b(1) . Proceeding backwards, the next term we need to calculate is a (1) .
202 CHAPTER 5. DEEP LEARNING INTERNALS
∂J
Step 5: a(1) = ∂a (1)
Looking at the formulas for the forward pass we notice that a (1) appears inside z (2) , so we apply the chain
rule and obtain:
∂z (2)
a (1) = z (2) = z (2) w (2) (5.21)
∂a (1)
At this point the calculation of the other terms is mechanical, and we will just summarize them all here:
∂J
ŷ = (5.22)
∂ ŷ
z (2) = ŷ σ ′ (z (2) ) (5.23)
b(2) = z (2) (5.24)
w (2) = z (2) a (1) (5.25)
a (1) = z (2) w (2) (5.26)
z (1) = a (1) σ ′ (z (1) ) (5.27)
b (1) = z (1) (5.28)
w (1) = z (1) x (5.29)
(5.30)
As you can see each term relies on previously calculated terms, which means we don’t have to calculate them
twice. This is why it’s called backpropagation: because the error terms are propagated back starting from
the cost function and walking along the network graph in reverse order.
Wow! What a journey! Congratulations! you have completed the hardest part. We hope this was insightful
and useful. In the next section we will extend these calculations to fully connected networks where there are
5.8. FULLY CONNECTED BACKPROPAGATION 203
many nodes for each layer. As you will see, it’s basically the same thing, only we will deal with matrices
instead of just numbers.
In a fully connected network, each layer contains several nodes, and each node is connected to all of the
nodes in the previous and in the next layers. The weights in layer l can be organized in a matrix W (l) whose
elements are identified by two indices j and k. The index k indicates the receiving node and the index j
indicates the emitting node. So, for example, the weight connecting node 5 in layer 2 to node 4 in layer 3 is
(3)
going to be indicated as w54 etc.
(1)
The input sum at layer l and node k, z k is the weighted sum of the activations of layer l − 1 plus the bias
term of layer l:
Forward Pass
... (5.31)
(l) (l−1) (l) (l)
zk = ∑ a j w jk + b k (5.32)
j
(l) (l)
a k = σ(z k ) (5.33)
... (5.34)
(5.35)
(l) (l)
The activations a k are obtained by applying the sigmoid function to the input-sums z k coming out from
node k at layer l.
Let’s indicate the last layer with the capital letter L. The equations for the output are:
The cost function J is a function of the true labels y and the predicted values ŷ, which contain all the
parameters of the network. We indicated it with a sum to include the case where more than one output node
is present.
If the above formulas are hard to read in maths, here’s a code version of them. We allocate an array W with
(l)
random values for the weights w jk . In this particular example, imagine a set of weights connecting a layer
with 4 units to a layer with 2 units:
We also need an array for the biases, with as many elements as there are units in the receiving layer, i.e. 2:
The output of the layer with 4 elements is represented by the array a, whose elements are a j
(l−1)
:
5.8. FULLY CONNECTED BACKPROPAGATION 205
In [26]: z = np.dot(a, W) + b
In [27]: z
z is indexed by the letter k. There are 2 entries, one for each of the units in the receiving layer. Similarly you
can write code examples for the other equations.
Backpropagation
Although they may seem a bit more complicated, the only thing that changed is that now each node takes
multiple inputs, each with its own weight and so the input sums z are actually summing up the
contributions of the nodes in the previous layer.
The backpropagation formulas are calculated as before. Here is a summary of all of the terms:
∂J
ŷs = (5.40)
∂ ŷs
(L) (L)
zs = ŷs σ ′ (zs ) (5.41)
(L) (L)
bs = zs (5.42)
(L) (L) (L−1)
wrs = zs ar (5.43)
... (5.44)
... (5.45)
(l) (l+1) (l+1)
a k = ∑ w km zm (5.46)
m
(l) (l) (l)
z k = a k σ ′ (z k ) (5.47)
(l) (l)
bk = zk (5.48)
(l) (l) (l−1)
w jk = z k a j (5.49)
(5.50)
206 CHAPTER 5. DEEP LEARNING INTERNALS
These equations are equivalent to the ones for the unidimensional case, with only one major difference.
(l)
The term a k , indicating the change in cost due to the activation at node k in layer l needs to take into
(l)
account all the errors in the nodes downstream at layer l + 1. Since the activation a k is part of the input of
each node in the next layer l + 1, we have to apply the chain rule to each of them and sum all their
contributions together.
Everything else is pretty much the same as the unidimensional case, with just a bunch of indices to keep
track of.
Matrix Notation
We can simplify the above notation a bit by using vectors and matrices to indicate all the ingredients in the
network.
Forward Pass
... (5.51)
(l) (l−1) (l) (l)
z =a W +b (5.52)
(l) (l)
a = σ(z ) (5.53)
... (5.54)
(5.55)
Backpropagation
... (5.56)
a(l) = W(l+1) T z(l+1) (5.57)
′ (l)
z(l) = a(l) ⊙ σ (z ) (5.58)
b(l) = z(l) (5.59)
W(l) = a(l−1) z(l) T (5.60)
... (5.61)
(5.62)
Circle dot indicates the element-wise product and it is also called Hadamard product, whereas when we
5.10. GRADIENT DESCENT 207
have two matrices next to each other, we indicate the matrix multiplication is taking place.
1. Forward pass: we calculate the input-sum and activation of each neuron proceeding from input to
output.
2. We calculate the error signal of the final layer, by obtaining the gradient of the cost function with
respect to the outputs of the network. This expression will depend on the training data and training
labels, as well as the chosen cost function, but it is well defined for given training data and cost.
3. We propagate the error backwards at each operation by taking into account the error signals at the
outputs affected by that operation as well as the kind of operation performed by that specific node.
4. We proceed back till we get to the weights multiplying the input, at which point we are done.
A couple of observations: - The gradient of the cost function with respect to the weights is a matrix with the
same shape as the weight matrix. - The gradient of the cost function with respect to the biases is a vector
with the same shape as the biases.
Congratulations! You’ve now gone through the back propagation algorithm and hopefully see that it’s just
many matrix multiplications. The bigger the network, the bigger your matrices will be and so the larger the
matrix multiplication products. We will go back to this in a few sections. For now, give yourself a pat on the
back: Neural Networks have no more mysteries for you!
Gradient descent
How do backpropagation and gradient descent work in practice in Deep Learning? Let’s use a real world
dataset to explore how this is done in detail.
Let’s say you’ve just been hired by the government for a very important task. A group of counterfeiters is
using fake banknotes and this is creating all sorts of problems. Luckily your colleague Agent Jones managed
to get hold of a stack of fake banknotes and bring them to the lab for inspection. You’ve scanned true and
fake notes and extracted four spectral features. Let’s build a classifier that can distinguish them.
208 CHAPTER 5. DEEP LEARNING INTERNALS
Banknotes
5.10. GRADIENT DESCENT 209
In [28]: df = pd.read_csv('../data/banknotes.csv')
df.head()
Out[28]:
The four features come from the images (see UCI database for details) and they are like a fingerprint of each
image. Another way to look at it is to say that feature engineering has already been done and we have now 4
numbers representing the relevant properties of each image. The class column indicates if a banknote is
true or fake, with 0 indicating true and 1 indicating fake.
In [29]: df['class'].value_counts()
Out[29]:
class
0 762
1 610
We can also calculate the fraction of the larger class by dividing the first row by the total number of rows:
In [30]: df['class'].value_counts()[0]/len(df)
Out[30]: 0.5553935860058309
The larger class amounts to 55 of the total, so we if we build a model it needs to have an accuracy superior
to 55 in order to be useful.
Let’s use seaborn.pairplot for a quick visual inspection of the data. First we load the library:
Then we plot the whole dataset using a pairplot like we did for the Iris flower dataset in the previous
chapters. A pairplot allows us to look at how pairs of features are correlated, as well as how each feature is
correlated with the labels. Also, a pairplot displays the histogram of each feature along the diagonal and we
can use the hue parameter to color the data using the labels. Pretty nice!
/home/ubuntu/miniconda3/envs/ztdlbook/lib/python3.6/site-
packages/statsmodels/nonparametric/kde.py:488: RuntimeWarning: invalid value
encountered in true_divide
binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
/home/ubuntu/miniconda3/envs/ztdlbook/lib/python3.6/site-
packages/statsmodels/nonparametric/kdetools.py:34: RuntimeWarning: invalid value
encountered in double_scalars
FAC1 = 2*(np.pi*bw/RANGE)**2
We can see from the plot that the two sets of banknotes seem quite well separable. In other words the orange
and the blue scatters are not completely overlapped. This induces us to think that we will manage to build a
good classifier and bust the counterfeiters.
Let’s start by building a reference model using Scikit-Learn. As we have seen in Chapter 3,
Scikit-Learn is a great Machine Learning library for Python. It implements many classical algorithms like
Decision Trees, Support Vector Machines, Random Forest and more. It also has many preprocessing and
model evaluation routines, so we strongly encourage you to learn to use it well.
For the purpose of this Chapter, we would like a model that trains fast, that does not require too much
pre-processing and feature engineering and that is known to give good results.
Luckily for us such model exists and it’s called Random Forest.
212 CHAPTER 5. DEEP LEARNING INTERNALS
Random Forest
Random Forest is an ensemble learning method for classification, regression and other tasks, that operates
by constructing a multitude of decision trees at training time and outputting the class that is the mode of the
classes (classification) or mean prediction (regression) of the individual trees. You can think of it as a
Decision Tree on steroids!
For the purpose of this Chapter it is not fundamental that you understand the internals of how the Random
Forest classifier works. The important point here is that it’s a model that works quite well and so we will use
it for comparison.
and we are ready to train the model. In order to be quick and effective in judging the performance of our
model we will use a 3-fold cross validation as done many times in Chapter 3. First we load the
cross_val_score function:
And then we run it with the model, features and labels as arguments. This function will return 3 values for
the test accuracy, one for each of the 3 folds.
In [37]: cross_val_score(model, X, y)
The Random Forest model seems to work really well on this dataset. We obtain an accuracy score higher
than 99 with a 3-fold cross-validation. This is really good and it also shows us how in some cases
traditional ML methods are very fast and effective solutions.
We can also get the score on a train/test split fixed set in order to compare it later with a Neural Network
based model.
Let’s train our model and check the accuracy score now:
Out[40]: 0.9927184466019418
Let’s build a Logistic Regression model in Keras and train it. As we have seen, the parameters of the model
are updated using the gradient calculated from the cost function evaluated on the training data.
In principle, we could feed the training data one point at a time. For each pair of features and label, calculate
the cost and the gradient and update the weights accordingly. This is called Stochastic Gradient Descent
(also SGD). Once our model has seen each training data once, we say that an Epoch has completed, and we
start again from the first training pair with the following epoch. Let’s manually run one epoch on this simple
model.
Notice that since this is a Logistic Regression we will only have one Dense layer, with an output of 1 and a
sigmoid activation function. By now you should be very familiar with all this, but in case you have doubts
you may go back to Chapter 4 where we explained Dense layers in more detail.
And then let’s define the model. We will initialize the weights to one for this time, using the
kernel_initializer parameter. It’s not a good initialization, but it will guarantee that we all get the same
results without any artifacts due to random initialization:
Then we compile the model as usual. Notice that, since we only have 1 output node with a sigmoid
activation, we will have to use the binary_crossentropy loss, also introduced in Chapter 4.
We compile the model using the sgd optimizer, which stands for Stochastic Gradient Descent. We will
discuss this optimizer along with other more powerful ones later in this chapter, so stay tuned.
Finally, we will compile the model requesting that the accuracy metric is also calculated at each iteration.
5.10. GRADIENT DESCENT 215
In [43]: model.compile(optimizer='sgd',
loss='binary_crossentropy',
metrics=['accuracy'])
Finally we save the random weights so that we can always reset the model to this starting point.
The method .train_on_batch performs a single gradient update over one batch of samples, so we can use
it to train the model on a single data point at a time and then visualize how the loss changes at each point.
Normally we train models one batch at a time, passing several points at once and calculating the average
gradient correction. The next plot will make it very clear why.
Let’s train the model one point at a time first, for one epoch, i.e. passing all of the training data once:
In [45]: losses = []
idx = range(len(X_train) - 1)
for i in idx:
loss, _ = model.train_on_batch(X_train[i:i+1],
y_train[i:i+1])
losses.append(loss)
Let’s plot the losses we have just calculated. As you will see the value of the loss changes greatly from one
update to the next:
In [46]: plt.plot(losses)
plt.title('Binary Crossentropy Loss, One Epoch')
plt.xlabel('Data point index')
plt.ylabel('Loss');
216 CHAPTER 5. DEEP LEARNING INTERNALS
As you can see in the plot, passing one data point at a time results in a very noisy estimation of the gradient.
We can improve the estimation of the gradient by averaging the gradients over a few points contained in a
mini-batch.
Common choices for the mini-batch size are 16, 32, 64, 128, 256, generally powers of 2. With mini-batch
gradient descent, we do N/B weight updates per epoch, with N equals to the number of points in the
training set and B equals to the number of points in a mini-batch.
In [47]: model.set_weights(weights)
In [48]: B = 16
y_train[i:i+B])
batch_losses.append(loss)
Now let’s plot the losses calculated with mini-batch gradient descent over the losses calculated at each point.
As you will see, the loss decreases in a much smoother fashion:
The min-batch method is what keras automatically does for us when we invoice the .fit method. When
we run model.fit we can specify the number of epochs and the batch_size, like we have been doing
many times:
In [50]: model.set_weights(weights)
218 CHAPTER 5. DEEP LEARNING INTERNALS
Now that we’ve trained the model, we can evaluate its performance on the test set using the
model.evaluate method. This is somewhat equivalent to the model.score method in Scikit-Learn. It
returns a dictionary with the loss, and all the other metrics we passed when we executed model.compile.
With 20 epochs of training the logistic regression model does not perform as well as the Random Forest
model yet. Let’s see how we can improve it. One direction that we can explore to improve a model is to tune
the hyperparameters. We will start from the most obvious one, which is the Learning Rate.
Learning Rates
Let’s explore what happens to the performance of our model if we change the learning rate. We can do this
with a simple loop where we perform the following steps:
In [53]: dflist = []
for lr in learning_rates:
model.compile(loss='binary_crossentropy',
optimizer=SGD(lr=lr),
metrics=['accuracy'])
model.set_weights(weights)
5.10. GRADIENT DESCENT 219
dflist.append(pd.DataFrame(h.history,
index=h.epoch))
print("Done: {}".format(lr))
Done: 0.01
Done: 0.05
Done: 0.1
Done: 0.5
We can concatenate all our results in a single file for easy visualization using the pd.concat function along
the columns axis.
In [55]: historydf
Out[55]:
And we can add information about the learning rate in a secondary column index using the
pd.MultiIndex class.
historydf.columns = idx
220 CHAPTER 5. DEEP LEARNING INTERNALS
In [57]: historydf
Out[57]:
Now we can display the behavior of loss and accuracy as a function of the learning rate.
In [58]: ax = plt.subplot(211)
hxs = historydf.xs('loss', axis=1, level='metric')
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Loss")
ax = plt.subplot(212)
hxs = historydf.xs('acc', axis=1, level='metric')
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Accuracy")
plt.xlabel("Epochs")
plt.tight_layout();
5.10. GRADIENT DESCENT 221
As expected a small learning rate gives a much slower decrease in the loss. Another hyperparameter we can
try to tune is the Batch Size. Let’s see how changing batch size affects the convergence of the model.
Batch Sizes
Let’s loop over increasing batch sizes from a single point up to 128.
In [59]: dflist = []
model.compile(loss='binary_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
h = model.fit(X_train, y_train,
batch_size=batch_size,
verbose=0, epochs=20)
dflist.append(pd.DataFrame(h.history,
222 CHAPTER 5. DEEP LEARNING INTERNALS
index=h.epoch))
print("Done: {}".format(batch_size))
Done: 1
Done: 4
Done: 16
Done: 32
Done: 64
Done: 128
Like we did above we can arrange the results in a Pandas Dataframe for easy display. Notice how we are
using the pd.MultiIndex.from_product function to create a multi-index for the columns so that the
data is organized by batch size and by metric.
In [61]: historydf
Out[61]:
batch_size 1 4 16 32 64 128
metric loss acc loss acc loss acc loss acc loss acc loss acc
0 0.409446 0.869792 1.121512 0.623958 2.652906 0.250000 3.320561 0.186458 3.818151 0.127083 4.147772 0.105208
1 0.091748 0.972917 0.209368 0.929167 0.970648 0.543750 1.955286 0.330208 2.814991 0.238542 3.452607 0.160417
2 0.069933 0.980208 0.142254 0.959375 0.477751 0.797917 1.192059 0.463542 2.192038 0.307292 2.940897 0.209375
3 0.058504 0.986458 0.116638 0.972917 0.326435 0.881250 0.756515 0.647917 1.718532 0.353125 2.557043 0.273958
4 0.052077 0.984375 0.101401 0.971875 0.256029 0.912500 0.538703 0.767708 1.338816 0.423958 2.244987 0.298958
5 0.047357 0.985417 0.092057 0.976042 0.215487 0.927083 0.421322 0.827083 1.045915 0.516667 1.977632 0.323958
6 0.044281 0.986458 0.084392 0.977083 0.189522 0.939583 0.350943 0.872917 0.832616 0.604167 1.738991 0.354167
7 0.041591 0.988542 0.078510 0.977083 0.170528 0.944792 0.303833 0.896875 0.683029 0.685417 1.523876 0.387500
8 0.039047 0.986458 0.073547 0.978125 0.156586 0.952083 0.269729 0.906250 0.577486 0.742708 1.331938 0.415625
9 0.038182 0.988542 0.069614 0.982292 0.145689 0.955208 0.244082 0.917708 0.500669 0.791667 1.166540 0.472917
10 0.036937 0.988542 0.066473 0.983333 0.136822 0.965625 0.224280 0.927083 0.443691 0.819792 1.022474 0.519792
11 0.035518 0.988542 0.063356 0.982292 0.129631 0.967708 0.208520 0.931250 0.399797 0.840625 0.902191 0.562500
12 0.034495 0.987500 0.060760 0.984375 0.123468 0.969792 0.195154 0.935417 0.364787 0.864583 0.804595 0.616667
13 0.033401 0.988542 0.058514 0.986458 0.118133 0.972917 0.184068 0.939583 0.336677 0.884375 0.724410 0.654167
14 0.032379 0.988542 0.056788 0.983333 0.113554 0.971875 0.174797 0.942708 0.313450 0.890625 0.655693 0.702083
15 0.032016 0.988542 0.054849 0.985417 0.109539 0.975000 0.166773 0.945833 0.293672 0.897917 0.598723 0.730208
16 0.031432 0.987500 0.053285 0.985417 0.105950 0.978125 0.159736 0.950000 0.276935 0.903125 0.551560 0.757292
17 0.031232 0.988542 0.051734 0.986458 0.102629 0.979167 0.153550 0.951042 0.262416 0.907292 0.511463 0.784375
18 0.030454 0.989583 0.050314 0.986458 0.099699 0.977083 0.148115 0.955208 0.249905 0.914583 0.477414 0.803125
19 0.029632 0.987500 0.049410 0.986458 0.096997 0.980208 0.143223 0.957292 0.238602 0.921875 0.448099 0.818750
In [62]: ax = plt.subplot(211)
hxs = historydf.xs('loss', axis=1, level='metric')
5.11. OPTIMIZERS 223
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Loss")
ax = plt.subplot(212)
hxs = historydf.xs('acc', axis=1, level='metric')
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Accuracy")
plt.xlabel("Epochs")
plt.tight_layout();
Smaller batches allow for more updates in a single epoch, on the other hand, they take much longer to run a
single epoch, so there’s a trade-off between speed of training (measured as number of gradients updates)
and speed of convergence (measured as number of epochs). In practice a batch size of 16 or 32 data points is
often used.
A recent research article suggests to start with a small batch size and the increase it gradually. We encourage
you to try and experiment with that strategy as well.
Optimizers
The optimizer is the algorithm used internally by Keras to update the weights and move the model towards
lower values of the cost function. Keras implements several optimizers that go by fancy names like: SGD,
224 CHAPTER 5. DEEP LEARNING INTERNALS
Adam, RMSProp and many more. Despite these smart sounding names, the optimizers are all variations of
the same concept, which is the Stochastic Gradient Descent or SGD.
SGD is so fundamental that we have invented an acronym to help you remember it. If you find it hard to
remember Stochastic Gradient Descent, just think Simply Go Down, which is exactly what SGD does!
TIP: In the next pages you will find some mathematical symbols when the algorithms are
explained. We highlighted the algorithms pseudo-code parts with a blue box like this:
Here’s the algorithm
Feel free to skim through these if maths is not your favourite thing, you’ll find a practical
comparison of optimizers just after this section.
Let’s begin our discovery of optimizers with a review of the SGD algorithm. SGD only needs one
hyper-parameter: the learning rate. Once we know the learning rate, we proceed in a loop by:
SGD
– Extract a random batch from the training set, with corresponding training labels
– Evaluate the average cost function J(y, ŷ) using the points in the batch
– Evaluate the gradient g = ∇w J(w) using the points in the batch and the current value of the
parameters w
– Apply the update rule: w− > w − ηg
The stop rule could be a fixed number of updates or epochs as well a condition on the amount of change in
the cost function. For example, we could decide to stop the training loop if the value of the cost is not
changing too much.
5.11. OPTIMIZERS 225
Momentum
In recent years, several improvements have been proposed to this formula. In particular, we would like to
have a
A first improvement of the SGD is to add momentum. Momentum means that we accumulate the gradient
corrections in a variable v called velocity, that basically serves as a smoothed version of the gradient.
• Like SGD, choose an initial vector of parameters w, a learning rate η and a momentum parameter µ
• Repeat until stop rule:
Applying momentum is like saying: if you are going down in a direction, then you should keep going more
or less in that direction minus a small correction given by the new gradients. It’s as if instead of walking
downhill, we would roll down like a ball. The name comes from physics, in case you’re curious.
AdaGrad
SGD and SGD + momentum keep the learning rate constant for each parameter. This can be problematic if
the parameters are sparse (i.e. most of them are zero except a few ones).
Adaptive algorithm, like AdaGrad overcome this problem by accumulating the square of the gradient into a
normalization variable for each of the parameter. The result of this is that each parameter will have a
personalized learning rate. Parameters whose gradient is large, will have a learning rate that decreases fast,
while parameters that have small gradients will have a large learning rate.
• Like SGD, choose an initial vector of parameters w, a learning rate η, a small constant δ = 10−7 to
avoid division by 0
• Repeat until stop rule:
Let’s break down the above equation for the update so that we understand it fully. Both the accumulation
step and the update step are computed element by element, so we can focus on a single parameter.
226 CHAPTER 5. DEEP LEARNING INTERNALS
• For a single parameter w i , g ⊙ g is equivalent to g i2 , so we are accumulating the square of the gradient
in a variable r i for each parameter.
η
• √
δ+ r
⊙ g may look a bit daunting at first, so let’s break it down. η is the learning rate, no surprises
here. For a single parameter w i we are dividing the value of the gradient g i by the square root of the
accumulated square gradients r i . If the gradients are large we will be dividing by a large quantity. On
the other hand, if the gradients are small, we will be dividing by a small quantity. This yields a
practically constant update step size, multiplied by the learning rate. The δ in the denominator is a
numerical regularization constant, so that we do not risk dividing by zero if r becomes too small.
RMSProp is also adaptive, but it allows to choose the fraction of squared gradients to accumulate, using an
Exponentially Weighted Moving Average (or EWMA) decay in the accumulation formula. If you’re not
familiar with how EWMA works, we strongly encourage you to review the Appendix. EWMA is the most
important algorithm of your life!
• Like SGD, choose an initial vector of parameters w, a learning rate η, a small constant δ = 10−7 to
avoid division by zero and an EWMA mixing factor ρ between 0 and 1, this is also called decay rate
• Repeat until stop rule:
– Same 3 steps as SGD (get batch, evaluate cost, evaluate gradient)
– Accumulate EWMA of the square of the gradient: r− > ρr + (1 − ρ)g ⊙ g
– Same update rules as Adagrad
Finally, let’s introduce Adam. This algorithm improves upon RMSProp by applying EWMA to the gradient
update as well as the square of the gradient.
• Like SGD, choose an initial vector of parameters w, a learning rate η, a small constant δ = 10−7 to
avoid division by zero and an EWMA mixing factors ρ1 and ρ2 between 0 and 1 (usually chosen as 0.9
and 0.999 respectively)
• Repeat until stop rule:
– Same 3 steps as SGD (get batch, evaluate cost, evaluate gradient)
– Accumulate EWMA of the gradient: v− > ρ1 v + (1 − ρ2 )g
– Accumulate EWMA the square of the gradient: r− > ρ2 r + (1 − ρ2 )g ⊙ g
v
– Correct bias 1: v̂ = 1−ρ t
1
r
– Correct bias 2: r̂ = 1−ρ t
2
– Compute update: ∆w = η 1√ ⊙ v̂
δ+ r̂
– Apply the update rule w− > w − ∆w
This formula may also appear to be a bit complicated, so let’s walk through it step by step.
5.11. OPTIMIZERS 227
• EWMA is applied to both the gradient and it square. We are taking inspiration from both the
momentum and the RMSProp formulas.
• The only other novelty is the bias correction. We take the current value of the accumulated quantity
and divide it by (1 − ρ t ). Since both decay rates are almost 1, the normalization is very small initially,
and it increases as time goes by. This seems to work in practice really well.
In summary, we have seen few of the most popular optimization algorithms. You are probably wondering
how to choose the best one. Unfortunately there is no best one, and each of them performs better in some
conditions. What is true though, is that a good choice of the hyper parameters is key for an algorithm to
perform well, and we encourage you to familiarize yourself with one algorithm and understand the effects of
changing hyper parameter.
Let’s compare the performance of few optimizers in keras. Optimizers are available in the
keras.optimizer module, so let’s start by importing them:
We then set the learning rate to be the same for each of them and run the training for 5 epochs each:
In [64]: dflist = []
opts = ['SGD(lr=0.01)',
'SGD(lr=0.01, momentum=0.3)',
'SGD(lr=0.01, momentum=0.3, nesterov=True)',
'Adam(lr=0.01)',
'Adagrad(lr=0.01)',
'RMSprop(lr=0.01)']
model.set_weights(weights)
dflist.append(pd.DataFrame(h.history,
index=h.epoch))
print("Done: ", opt_name)
Done: SGD(lr=0.01)
Done: SGD(lr=0.01, momentum=0.3)
228 CHAPTER 5. DEEP LEARNING INTERNALS
ax = plt.subplot(121)
hxs = historydf.xs('loss', axis=1, level='metric')
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Loss")
ax = plt.subplot(122)
hxs = historydf.xs('acc', axis=1, level='metric')
hxs.plot(ylim=(0,1), ax=ax)
plt.title("Accuracy")
plt.xlabel("Epochs")
plt.tight_layout();
5.12. INITIALIZATION 229
As you can see, in this particular case, some optimizers converge a lot faster than others. This could be due
to the particular combination of hyper-parameters chosen as well as to the their better performance on this
particular problem. We encourage you to try out different optimizers on your problems, as well as trying
different hyper-parameter combinations.
Initialization
So far we have explored the effect of learning rate, batch size and optimizers on the speed of convergence of
a model. We have compared their effect starting from the same set of randomly initialized weights. What if
we initialized weights in a different weight and kept everything else fixed? This may seem unimportant but it
turns out that the initialization is actually critical. A model could not converge at all for some initialization
and converge really quickly for some other initialization. While we don’t really understand this fully, we
have a few heuristic strategies available, that we can test, looking for the best one for our specific problem.
keras offers the possibility to initialize the weights in several ways including:
• Zeros, ones, constant: all weights are initialized to zero, to one or to a constant value. Generally these
are not good choices, because they leave the model uncertain on which parameters to optimize first.
Initialization strategies try to “break the symmetry” by assigning random values to the parameters. The
range and type of the random distribution can vary and several initialization schemes have been proposed:
• Random uniform: each weight receives a random value between 0 and 1, chosen with uniform
probability.
• Lecun_uniform:
√
like the above, but the values are drawn in the interval [-limit, limit] where limit is
3
# inputs . Where # inputs indicates the number of inputs in the weight tensor for a specific layer.
• Normal: each weight receives a random value drawn from a normal distribution with mean 0 and
standard deviation of 1. √
• He_normal: like the previous one, but with standard deviation σ = #2in .
√
2
• Glorot_normal: like the previous one, but with standard deviation σ = # in+# out .
You can read more about them here. In order to see the effect of initialization we’ll use a deeper network
with more than just 5 weights.
In [68]: dflist = []
K.clear_session()
model = Sequential()
model.add(Dense(10, input_shape=(4,),
kernel_initializer=init,
activation='tanh'))
model.add(Dense(10, kernel_initializer=init,
activation='tanh'))
model.add(Dense(10, kernel_initializer=init,
activation='tanh'))
model.add(Dense(1, kernel_initializer=init,
activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
dflist.append(pd.DataFrame(h.history,
index=h.epoch))
print("Done: ", init)
Done: zeros
Done: ones
Done: uniform
Done: lecun_uniform
Done: normal
Done: he_normal
Done: glorot_normal
historydf.columns = idx
plt.figure(figsize=(15, 5))
ax = plt.subplot(121)
xs = historydf.xs('loss', axis=1, level='metric')
xs.plot(ylim=(0,1), ax=ax, style=styles)
plt.title("Loss")
ax = plt.subplot(122)
xs = historydf.xs('acc', axis=1, level='metric')
xs.plot(ylim=(0,1), ax=ax, style=styles)
plt.title("Accuracy")
plt.xlabel("Epochs")
plt.tight_layout();
As you can see some initializations don’t even converge, while some do converge rather quickly.
Initialization of the weights plays a very important role in large models, so it is important to try a couple of
different initialization schemes in order to get the best results.
In [71]: K.clear_session()
model.compile(loss='binary_crossentropy',
optimizer=RMSprop(lr=0.01),
metrics=['accuracy'])
We then set the odel weights to some random values. In order to get reproducible results, the random values
are given for this particular run:
model.set_weights(weights)
In [75]: model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 2) 10
_________________________________________________________________
dense_2 (Dense) (None, 1) 3
=================================================================
Total params: 13
Trainable params: 13
Non-trainable params: 0
_________________________________________________________________
5.13. INNER LAYER REPRESENTATION 233
The model only has 2 dense layers. One connecting the input to the 2 inner nodes and one connecting these
2 inner nodes to the output. The list of layers is accessible as an attribute of the model:
In [76]: model.layers
and the inputs and outputs of each layer are also accessible as attributes. Let’s take the input of the first layer
and the output of the first layer, the one with 2 nodes:
These variables refer to objects from the keras kernel. This is Tensorflow by default but it can be switched to
other kernels if needed.
In [78]: inp
In [79]: out
Both the input and the output are Tensorflow tensors. In the next chapter we will learn more about Tensors,
so don’t worry about them for now.
Notice that features_function is a function itself, so K.function is a function that returns a function.
In [81]: features_function
We can apply this function to the test data. Notice that the function expects a list of inputs and returns a list
of outputs. Since our inputs list only has one element, so will the output list and we can extract the outputs
by taking the first element:
The output tensor contains as many points as X_test each represented by 2 numbers, the output values of
the 2 nodes in the first layer:
In [83]: features.shape
Out[83]: (412, 2)
We can plot the data as a scatter plot, and we can see how the network has learned to represent the data in 2
dimensions in such a way that the next layer can separate the 2 classes more easily:
Let’s plot the output of the second-to-last layer at each epoch in a training loop. First we re-initialize the
model:
5.13. INNER LAYER REPRESENTATION 235
In [85]: model.set_weights(weights)
Then we create a K.function between the input and the output of layer 0:
Then we train the model one epoch at a time, plotting the 2D representation of the data as it comes out from
layer 0:
In [87]: plt.figure(figsize=(15,10))
plt.tight_layout();
236 CHAPTER 5. DEEP LEARNING INTERNALS
As you can see, at the beginning the network has no notion of the difference between the two classes. As the
training progresses, the network learns to represent the data in a 2 dimensional space where the 2 classes are
linearly separable, so that the final layer (which is basically a logistic regression) can easily separate them
with a straight line.
This chapter was surely more intense and theoretical than the previous ones, but we hope it gave you a
thorough understanding of the inner workings of how a Neural Network works and what you ca do to
improve its performance.
Exercises
Exercise 1
You’ve just been hired at a wine company and they would like you to help them build a model that predicts
the quality of their wine based on several measurements. They give you a dataset with wine:
• choose the cost function, what will you use? Mean Squared Error? Binary Cross-Entropy? Categorical
Cross-Entropy?
• choose an optimizer
• choose a value for the learning rate, you may want to try with several values
• choose a batch size
• train your model on all the data using a validation_split=0.2. Can you converge to 100
validation accuracy?
• what’s the minimum number of epochs to converge?
• repeat the training several times to verify how stable your results are
Exercise 2
Since this dataset has 13 features we can only visualize pairs of features like we did in the Paired plot. We
could however exploit the fact that a Neural Network is a function to extract 2 high level features to
represent our data.
Exercise 3
Keras functional API. So far we’ve always used the Sequential model API in Keras. However, Keras also
offers a Functional API, which is much more powerful. You can find its documentation here. Let’s see how
we can leverage it.
Exercise 4
Keras offers the possibility to call a function at each epoch. These are Callbacks, and their documentation is
here. Callbacks allow us to add some neat functionality. In this exercise we’ll explore a few of them.
• Split the data into train and test sets with a test_size = 0.3 and random_state=42
• Reset and recompile your model
• train the model on the train data using validation_data=(X_test, y_test)
• Use the EarlyStopping callback to stop your training if the val_loss doesn’t improve
• Use the ModelCheckpoint callback to save the trained model to disk once training is finished
• Use the TensorBoard callback to output your training information to a /tmp/ subdirectory
Convolutional Neural Networks
6
Intro
In the previous chapter we dove into Deep Learning, we built our first real model, and hopefully demystified
a lot of the complicated stuff. Now it’s time to start applying Deep Learning to a kind of data where it really
shines: images!
At the root of it, what is an image anyway? The information in an image is encoded by the relations between
nearby pixels. A slightly darker or lighter image still contains the same information. Similarly, it doesn’t
matter where an object is positioned exactly, in order to recognize it. Convolutional Neural Networks
(CNN), as we will discover in this chapter, are able to encode a lot of information about relations between
nearby pixels. This makes them great tools to work with images, as well as with sequences like sounds and
movies.
In this section we will learn what convolutions are, how they can be used to filter images, and how
Convolutional Neural Nets work. By the end of this section, we will train our first CNN to recognize
handwritten digits. We will also introduce the core concept of Tensor. Are you ready? Let’s go!
humans quickly recognize the a cat, whereas the computer just sees a bunch of pixels and has no prior
notion of what a cat is, nor that a cat is represented in this image. It may seem magic that Neural Networks
239
240 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS
Picture of a cat
are able to to solve the image classification problem well, but we we hope that, by the end of this chapter,
how they do it will be quite clear!
In order to understand why it is so difficult for a computer to classify objects in images let’s start from how
images are represented, and in particular let’s start with a black and white image.
A black and white image can be represented as a grid of points, each point with a binary value. These points
on the grid are called pixels, and in a black and white image they only carry two possible values: 0 and 1.
Let’s create a random Black and White image with Python. As always, we start by importing the usual
libraries. By now they should be familiar, but if you need a reminder have a look at Chapter 1:
Let’s use the np.random.binomial function to generate a 10x10 square matrix of random zeros and ones.
Using the np.random.binomial() method will give us an approximately equal amount of zeros and ones.
We will use the argument size=(10, 10) to specify that we want an array with 2 axes, each with 10
positions:
In [4]: bw
As promised, it’s a random set of zeros and ones. We can also use the function
matplotlib.pyplot.imshow to visualize it as an image. Let’s do it:
Awesome! We have just learned how to create a Black and White image with Python. Let’s now generate a
grayscale image.
To generate a grayscale image we simply allow the pixels to carry values that are intermediate between 0 and
1. Actually, since we do not really care about infinite possible shades of gray, we normally use unsigned
integers with 8 bits, i.e. the numbers from 0 to 255.
A 10x10 grayscale image with 8-bit resolution is a grid of numbers, each-of-which is an integer between 0
and 255.
Let’s draw one such image. In this case we will use the np.random.randint function, which generates
random integers uniformly distributed between a low and a high extremes. Here’s a snippet from the
documentation:
In [7]: gs
Out[7]: array([[168, 92, 106, 133, 44, 167, 228, 37, 157, 63],
[ 78, 100, 28, 35, 200, 165, 30, 101, 151, 113],
[193, 121, 88, 233, 163, 179, 48, 35, 78, 235],
[108, 243, 46, 13, 58, 230, 156, 67, 169, 89],
[255, 19, 211, 74, 52, 85, 185, 55, 114, 102],
[109, 37, 92, 235, 16, 141, 61, 144, 95, 72],
[ 62, 96, 171, 39, 104, 221, 144, 116, 95, 136],
[216, 116, 31, 228, 94, 233, 6, 121, 206, 23],
[ 62, 31, 73, 238, 225, 21, 193, 53, 227, 64],
[188, 75, 38, 213, 188, 213, 92, 73, 80, 230]])
As expected it’s a 10x10 grid of random integers between 0 and 255. Let’s visualize it as an image:
Wonderful! In image classification problems we have to think of images as the input into the algorithm,
therefore, this 2D array with 100 numbers, corresponds to one data point in a classification task. How could
we train a Machine Learning algorithm on such data? Let’s say we have many such gray-scale images
representing handwritten digits. How do we feed them to a Machine Learning model?
MNIST
The MNIST database is a very famous dataset of handwritten digits and it has become a benchmark for
image recognition algorithms. It consists of 70000 images of 28 pixel by 28 pixels, each representing a
handwritten digit.
TIP: Think of how many real world applications involve recognition of handwritten digits:
- zipcodes - tax declarations - student tests - . . .
6.2. MACHINE LEARNING ON IMAGES WITH PIXELS 245
Keras has it’s built-in dataset for MNIST, so we will load it from there using the load_data function
Let’s check the shape of the arrays of the data we received for the training and test sets:
In [11]: X_train.shape
In [12]: X_test.shape
The loaded data is a numpy array of order 3. It’s like a 3-dimensional matrix, whose elements are identified
by 3 indices. We’ll discuss these more in detail later in this chapter.
For now, it is sufficient to know that the first index (running from 0 to 59999 for X_train) locates a specific
image in the dataset, while the other two indices locate a certain pixel in the image, i.e. they run from 0 to
the height and width of the image.
For instance, we can select the first image in the training set and take a look at its shape by using the first
index:
Notice that with the gray colormap, zeros are displayed as black pixels while higher numbers are displayed
6.2. MACHINE LEARNING ON IMAGES WITH PIXELS 247
as lighter pixels.
Pixels as features
So far our input datasets have always been 2D tabular sets, where table columns refer to different features
and each data point occupies a row. In this case, each data point is itself a 2D table (an image) and so we
need to decide how to map it to features.
The simplest way to feed images to a Machine Learning algorithm is to use each pixel in the image as an
individual feature. If we do this, we will have 28 × 28 = 784 independent features, each one being an integer
between 0 and 255, and our dataset will become tabular once again. Each row in the tabular dataset will
represent a different image, and each of the 784 columns will represent a specific pixel.
The reshape method of a numpy array allows us to reshape any array to a new shape. For example, let’s
reshape the training dataset to be a tabular dataset with 60000 rows and 784 columns:
We can check that the operation worked by printing the shape of X_train_flat:
In [16]: X_train_flat.shape
Wonderful! Another valid syntax for reshape is to just specify the size of the dimensions we care about and
let the method figure out the other dimension, like this:
In [18]: X_test_flat.shape
Great! Now we have 2 tabular datasets like the ones we are familiar with. The features contain values
between 0 and 255:
248 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS
In [19]: X_train_flat.min()
Out[19]: 0
In [20]: X_train_flat.max()
Out[20]: 255
As already seen in Chapter 3, Neural Network models are quite sensitive to the absolute size of the input
features, and hence they like features that are normalized to be somewhat near 1.
We should rescale the values of our features to be between 0 and 1. Lets do it by dividing them by 255 so they
will have values between 0 and 1. Notice that we need to convert the the data type to float32 because
under the hood numpy arrays are implemented in C and therefore are strongly typed.
Great! We now have 2D data that we can use to train a fully connected Neural Network!
Multiclass output
Since our goal is to recognize a digit contained in an image, our final output is a class label between 0 and 9.
Let’s inspect y_train to look at the target values we want to train our network to learn:
In [22]: y_train
We can use the np.unique method to check what are the unique values for the labels, these should be the
digits from 0 to 9:
In [23]: np.unique(y_train)
Since there are 10 possible output classes, this is a Multiclass classification problem where the outputs are
mutually exclusive. As we have learned in the Chapter 4, we need to convert the labels to a matrix of binary
columns. In doing so, we communicate to the network that the labels are distinct and it should learn to
predict the probability of an image to correspond to a specific label.
In other words, our goal is to build a network with 784 inputs and 10 output, like the one represented in this
figure:
so that for a given input image the network learns to indicate to which label it corresponds. Therefore we
need to make sure that the shape of the label array matches the output of our network.
We can convert our labels to binary arrays using the to_categorical utility function from keras. Let’s
import it
Let’s double check what’s going on. As we have seen before, the first element of X_train is a handwritten
number 5. So the corresponding label should be a 5.
In [26]: y_train[0]
Out[26]: 5
In [27]: y_train_cat[0]
Out[27]: array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], dtype=float32)
As you can see, this is an array of 10 numbers, zero everywhere except at position 5 (remember we start
counting from 0) indicating which of the 10 classes our image should be classified as.
As seen in Chapter 3 for features and in Chapter 4 for labels, this type of encoding is called one-hot
encoding, meaning we encode classes as an array with as many elements as the number of distinct classes,
zero everywhere except for a 1 at the corresponding class.
Great! Finally let’s check the shape of y_train_cat. This should have as many rows as we have training
examples and 10 columns for the 10 binary outputs:
In [28]: y_train_cat.shape
In [29]: y_test_cat.shape
Fantastic! We can now train a fully connected Neural Network using all what we’ve learned in the previous
chapters.
To build our network, let’s import the usual Keras classes as seen in Chapter 1. Once again we build a
Sequential model, i.e. we add the layers one by one, using fully connected layers, i.e. Dense:
Now let’s build the model. As we have done in Chapter 4, we will build this network layer by layer, making
sure that the sizes of the input/outputs.
1. We specify the size of the input in the definition of the first layer through the parameter
input_dim=784.
• The choice of the number of layers and the number of nodes per layer is arbitrary. Feel free to
experiment with different architectures and observe:
• The last layer added to the stack is also the output layer. This may be sometimes confusing, so make
sure that the number of nodes in the last layer in the stack corresponds to the number of categories in
your dataset
• The last layer outputs has a Softmax activation function. As seen in Chapter 4 this is needed when
the classes are mutually exclusive. In this case, an image of a digit cannot be of 2 different digits at the
same time, and we need to let the model know about it.
• Finally, the model is compiled using the categorical_crossentropy loss, which is the correct one
for classifications with many mutually exclusive classes.
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
In [32]: model.summary()
252 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 512) 401920
_________________________________________________________________
dense_2 (Dense) (None, 256) 131328
_________________________________________________________________
dense_3 (Dense) (None, 128) 32896
_________________________________________________________________
dense_4 (Dense) (None, 32) 4128
_________________________________________________________________
dense_5 (Dense) (None, 10) 330
=================================================================
Total params: 570,602
Trainable params: 570,602
Non-trainable params: 0
_________________________________________________________________
As you can see, the model has about half a million parameters, namely 570,602.
Let’s train it on our data for 10 epochs with 128 images per batch. We will need to pass the scaled and
reshaped inputs and outputs.
Also, let’s use a validation_split of 10, meaning we will train the model on 90 of the training data,
and evaluate its performance on the remaining 10. This is is like an internal train/test split done on the
training data. It’s useful when we plan to change the network and tune its architecture to maximize its ability
to generalize. We will keep the actual test set for a final check once we have committed to the best
architecture.
The model seems to be doing very well on the training data (as we can see by the acc output).
Let’s check if it is overfitting, i.e. if it is just memorizing the answers instead of learning general rules about
the training examples
TIP: if you need to refresh your knowledge of overfitting have a look at Chapter 3 as well as
this Wikipedia article.
Let’s plot the history of the accuracy and compare the training accuracy with the validation accuracy.
In [34]: plt.plot(h.history['acc'])
plt.plot(h.history['val_acc'])
plt.legend(['Training', 'Validation'])
plt.title('Accuracy')
plt.xlabel('Epochs');
254 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS
We already notice that while the training accuracy increases, the validation accuracy does not seem to
increase as well. Let’s check the performance on the test set:
Out[35]: 0.9813
Out[36]: 0.9954666666666667
The performance on the test set is lower than the performance on the training set.
TIP: one question you may have is “When is a difference between the test and train scores
significant”. We can answer this question by running a cross-validation to see what the
standard deviation of each score is. Then we can compare their difference between the two
scores with the standard deviation and see if their difference is much greater than the
statistical fluctuations of each score.
This difference between the train and test scores may indicate we are overfitting.
This makes sense, because the model is trained using the individual pixels as features. This implies that two
images which are similar but slightly rotated or shifted have completely different features.
In order to go beyond “pixels as features” we need to extract better features from the images.
It is legitimate to wonder if there is a better way to extract information from images, and there is.
The process of going from an image to a vector of pixels is just the simplest case of feature extraction from
an image. There are many other methods to extract features from images, including Fourier transforms,
Wavelet transforms, Histograms of oriented gradients (HOG) and many others. These are all methods
that take an image in input and return a vector of numbers that can be used as features.
The banknotes dataset we used in the previous chapter is an example of features extracted from images with
these methods.
Although really powerful, these methods require very deep domain knowledge and each was developed
over time to solve a specific problem in image recognition. It would be great if we could avoid using these
special methods and just learn the best features from the image problem itself.
This case is a general issue with feature engineering: identifying features that correctly represent the type of
256 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS
Feature extraction
information we are trying to capture from a rich data point (like an image), is a time consuming and
complex effort, often involving several Ph.D. students doing their thesis on it.
Let’s consider an image more in detail. What makes an image different from a vector of numbers is that the
values of pixels are correlated both horizontally and vertically. It’s the 2D pattern that carries the
information about what’s represented in the image and these 2D patterns, like for example horizontal and
vertical contrast lines, are specific to an image or to a set of images. It would be great to have a technique
that is able to capture them automatically.
Additionally, if all we care about is recognizing an object, we should strive to be insensitive to the position of
the object in the image, and our features should rely more on local patterns of pixels arranged in the form of
the object, than on the position of such pixels on the grid.
The mathematical operation that allows us to look for local patterns is called convolution. But before we
learn about-it, we have to take a moment and learn about Tensors.
6.3. BEYOND PIXELS AS FEATURES 257
Images as tensors
In this section we talk about tensors. Tensors were originally introduced in Physics and they are a very
powerful mathematical tool, they are so important that Einstein used them to describe general relativity.
Yeah! That’s right, space-time curvature and gravity are described by tensors!
Despite this, the tensors used in Machine Learning are not the same as the ones used in physics. Tensors in
Machine Learning really is just a synonym of Multi-dimensional array. This is somewhat misleading and
has generated a bit of a debate (see here), but we will proceed as the mainstream convention and use the
word tensors to refer to multi-dimensional arrays.
In this sense the order or rank of a tensor refers to the number of axes in the array.
TIP: people tend to use the word dimension to indicate the rank of a tensor (number of
axes) as well as the length of a specific axis. We will call the former rank or order, saving
the word dimension for the latter. More on this later, however.
You may wonder why you should learn about tensors. The answer is, they allow you to apply Machine
Learning to multi-dimensional data like images, movies, text and so on. Tensors are a great way to extend
our skills beyond the tabular datasets we’ve used so far!
Let’s start with scalars. Scalars are just numbers, everyday numbers we are used to. They have no dimension.
In [37]: 5
Out[37]: 5
Vectors can be thought of as lists of numbers. The number of elements in a vector is also called vector
length and sometimes number of dimensions. As already seen many times, in python we can create
vectors using the np.array constructor:
Out[38]: (4,)
TIP: In our terminology this is a vector of dimension 4, which is still a tensor of order 1,
since it only has 1 axis.
The numbers in the list are the coordinates of a point in a space with the same number of dimensions as the
number of entries in the list.
Going up one level, we encounter tensors of order 2, which are called matrices. Matrices are tables of
numbers with rows and columns, i.e. they have 2 axes.
Out[39]: (2, 4)
The first axis of M has length 2, which is the number of rows in the matrix, the second axis has length 4 and it
corresponds to the columns in the matrix.
A grayscale image, as we saw, is a 2D matrix where each entry in the matrix corresponds to a pixel.
TIP: notice that plt.imshow takes care of normalizing the values of the matrix so that they
can be displayed in gray-scale.
Notice also that a matrix can be thought of as a list of vectors of the same length, each representing a row of
pixels. In the same way, as a vector can be seen as a list of scalars. So if we extract the first element of the
matrix, this is a vector:
In [41]: M[0].shape
Out[41]: (4,)
This recursive construction allows us to organize them in the larger family of tensors. Tensors can be
understood as nested lists of objects of the previous order, all with the same shape.
So for example, a tensor of order 3 can be thought-of as an array of matrices, which are tensors of order two.
Since all of these matrices have the exact same number of rows and columns, the tensor is actually like a
cuboid of numbers.
Each number is located by the row, the column and the depth where it’s stored.
The shape of a tensor tells us how many objects there are when counting along a particular axis. So for
example, a vector has only one axis and a matrix has two axes, indicating the number of rows and the
number of columns. Since most of the data (images, sounds, texts, etc.) we will use are stored as tensors, it is
very important to know the dimensions of these objects for a proper use.
Colored images
A colored image is actually a set of gray-scale images each corresponding to a primary color channel. So, in
the case of RGB, we have three channels (Red, Blue and Green), each containing the pixels of the image in
that particular channel.
260 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS
Tensors
Multi-dimensional array
6.3. BEYOND PIXELS AS FEATURES 261
This image is an order three tensor and there are two major ordering conventions. If we think of the image
as a list of three single color images, then the axis order will be channel first, then height, and then width.
On the other hand, we can also think of the tensor as an order two list of vector pixels, where each pixels
contains three numbers, one for each of the colors. This is called “channel last” and it’s the convention used
in the rest of the book.
Let’s create and display a random color image by creating a list of random pixels between 0 and 255:
[[170, 3, 53],
[146, 251, 17],
[180, 211, 82],
[ 60, 177, 27]]], dtype=uint8)
Now let’s display it as a figure, showing each of the dominant pixels in each list.
plt.subplot(222)
plt.imshow(img[:, : , 0], cmap='Reds')
plt.title("Red channel")
plt.subplot(223)
plt.imshow(img[:, : , 1], cmap='Greens')
plt.title("Green channel")
plt.subplot(224)
plt.imshow(img[:, : , 2], cmap='Blues')
plt.title("Blue channel")
plt.tight_layout()
6.4. CONVOLUTIONAL NEURAL NETWORKS 263
Pause here for a second and observe how the colors of the pixels in the colored image reflect the
combination of the colors in the three channels.
Now that we know how to represent images using tensors, we are ready to introduce convolutional Neural
Networks.
TIP: If you’d like to know a bit more about Tensors and how they work, we displayed a few
operations in the Appendix.
TIP: if you need a refresher about convolutions and how they work, have a look at the
Appendix.
For the purpose of this chapter, all we need to know is that an image can be convolved with a filter or
kernel, which is basically a smaller image. The convolution of an image with a kernel generates a new
images, also called a feature map, whose pixels represent the “degree of matching” of the corresponding
receptive field with the kernel.
So, if we take many filters and arrange them in a convolutional layer the output of the convolution of an
image will be as many feature maps (convolved images) as there are filters. Since all of these images have the
same size, we can arrange them in a tensor, where the number of channels corresponds to the number of
filters used. In fact, let’s use tensors to describe everything: inputs, layers and outputs.
We can arrange the input data is a tensor of order four. A single image is an order-3 tensor as we know, but
since we have many input images in a batch, and they all have the same size, we might as well stack them in
an order-4 tensor where the first axis indicates the number of samples.
So the 4 axis are respectively: number of images in the batch, height of the image, width of the image and
number of color channels in the image.
6.4. CONVOLUTIONAL NEURAL NETWORKS 265
For example, in the MNIST training dataset we have 60000 images, each with 28x28 pixel and only one color
channel, because they are grayscale. This gives us an order-4 tensor with the shape (60000, 28, 28, 1).
Similarly, we can stack the filters in the convolutional layer as an order-4 tensor. We will use the first two
axes for the height and the weight of the filter. The third axis will correspond to the number of color
channels in the input, while the last axis is for the number nodes in the layer, i.e. the number of different
filters we are going to learn. This is also called number of output channels sometimes, you’ll soon see why.
Let’s do an example where we build a convolutional layer with four 3x3 filters. The order-4 tensor has a
shape of (3, 3, 1, 4), i.e. four filters of 3x3 pixels each with a single input color channel each.
When we convolve each input image with the convolutional layer, we still obtain an order-4 tensor.
The first axis is still the number of images in the batch or in the dataset. The other three axes are for the
image height, width and number of color channels in the output. Notice that this is also the number of
filters in the layer, four in the case of this example.
Notice that since the output is an order-4 tensor, we could feed it to a new convolutional layer, provided we
make sure to match the number of channels correctly.
Convolutional Layers
Convolutional layers are available in keras.layers.Conv2D. Let’s apply a convolutional layer to an image
and see what happens.
In [48]: img.shape
A convolutional layer wants an order-4 tensor as input, so first of all we need to reshape our image so that it
has 4 axes and not 2.
We can just add 1 axis of length 1 for the color channel (which is a grayscale pixel value between 0 and 255)
and 1 axis of length 1 for the dataset index.
Let’s start by applying a large flat filter of size 11x11 pixels, this should result in a blurring of the image
because the pixels are averaged.
so we will specify 1 for the filter and (11, 11) for the kernel_size. We will also initialize all the weights to
one by using kernel_initializer='ones'. Finally we will need to pass the input shape, since this is the
first layer in the network. This is the shape of a single image, which in this case is (512, 512, 1).
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 502, 502, 1) 122
=================================================================
Total params: 122
Trainable params: 122
Non-trainable params: 0
_________________________________________________________________
We have a model with one convolutional layer, so the number of parameters is equal to 11 x 11 + 1 where
the +1 comes from the bias term. We can apply the convolution to the image by running a forward pass:
TIP: try to change the initialization of the convolutional layer to something else. Then
re-run the convolution and notice how the output image changes.
Great! We have just demonstrated that the convolution with a kernel will produce a new image, whose pixels
will be a combination of the original pixels in a receptive field and the values of the weights in the kernel.
270 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS
These weights are not decided by the user, they are learned by the network through backpropagation! This
allows a Neural Network to adapt and learn any pattern that is relevant to solving the task.
There are two additional arguments to consider when building a convolutional layer with keras: padding
and stride.
Padding
If you’ve been paying attention, you may have noticed that the convolved image is slightly smaller than the
original image:
In [54]: img_pred_tensor.shape
This is due to the default setting of padding='valid' in the Conv2D layer and it has to do with how we
treat the data at the boundaries. Each pixel in the convolved image is the result of the contraction of the
receptive field with the kernel. Since in this case the kernel has a size of 11x11, if we start at the top left corner
and slide to the right there are only 502 possible positions for the receptive field. In other words we lose 5
pixels on the right and 5 pixels on the left.
If we would like to preserve the image size, we need to offset the first receptive field so that its center falls on
the top left corner of the input image. We can fill the empty parts with zeros. This is called padding.
In keras we have two padding modes: - valid which means no padding - same which means pad to keep
the same image size.
model.predict(img_tensor).shape
Awesome! We know how padding works. Why use padding? We can use padding if we think that the pixels
at the border contain useful information to solve the classification task.
6.4. CONVOLUTIONAL NEURAL NETWORKS 271
Padding
272 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS
Stride
The stride is the number of pixels that we use that separate one receptive field from the next. It’s like the step
size in the convolution. A stride of (1, 1) means we slide the receptive field by one pixel horizontally and
one vertically. Looking at the figure:
Stride
The input image has size 6x6, the filter (not shown) is 3x3 and so is the receptive field. If we perform a
convolution with no padding and stride of 1, the output image will lose one pixel on each side, resulting in a
4x4 image. Increasing the stride means skipping a few pixels between one receptive field and the next, so for
example a stride of (3, 3), will produce an output image of 2x2.
We can also stride of different length in the two directions, which will produce a rectangular convolved
output image.
Finally, if we don’t want to lose the borders during the convolution we can pad the image with zeros and
obtain an image with the same size as the input.
The default value for the stride is 1 to the right and 1 down, but we can jump by larger amounts, for example
if the image resolution is too high.
This will produce output images that are smaller. For example, let’s jump by 5 pixels in both directions in our
example:
small_img_tensor = model.predict(img_tensor)
small_img_tensor.shape
The image is still present, but its resolution is now much lower. We can also choose asymmetric strides, if we
believe the image has more resolution in one direction than another:
asym_img_tensor = model.predict(img_tensor)
asym_img_tensor.shape
Pooling layers
Pooling reduces the size of the image by discarding some information. For example, max-pooling only
preserves the maximum value in a patch and stores it in the new image, while discarding the values in the
other pixels.
Also, pooling patches usually do not overlap, so that the size of the image is actually reduced.
If we apply pooling to the feature maps, we end up with smaller feature maps, that still retain the highest
matches of our convolutional filters with the input.
Let’s add a MaxPooling2D layer in a simple network (containing this single layer):
Max-pooling layers are useful in tasks of object recognition, since pixels in feature maps represent the
“degree of matching” of a filter with a receptive field, keeping the max keeps the highest matching feature.
On the other hand, if we are also interested in the location of a particular match, then we shouldn’t be using
max-pooling, because location information will be lost in the pooling operation.
Thus, for example if we are using a convolutional Neural Network to read the state of a video game from a
frame we need to know the exact positions of players and thus using max-pooling is not recommended.
Finally GlobalMaxPooling2D calculates the global max in the image, so it returns a single value for the
image:
Out[64]: (1, 1)
Final architecture
Convolutional, pooling and activation layers can be stacked together. The output of one layer can be fed into
the next resulting in an feature extraction pipeline that will gradually transform an image into a tensor with
more channels and less pixels:
Convolutional stack
The value of each “pixel” in the last feature map is influenced by a large regions of the original image and it
will have learned to recognize complex patterns.
In fact, that’s the beauty of stacking convolutional layers. The first layers will learn patterns of pixels in the
original image, while deeper layers will learn more complex patterns that are combinations of the simpler
patterns.
In practice, early layers will specialize to recognize contrast lines in different orientations, while deeper
layers will combine those contrast lines to recognize parts of objects. The typical example of this is the face
recognition task where middle layers recognize facial features like eyes, noses and mouths while deeper
nodes specialize on individual faces.
The convolutional stack behaves like an optimized feature extraction pipeline that is trained to optimally
solve the task at hand.
In order to complete the pipeline and solve the classification task we can pipe the output of the feature
extraction pipeline into a fully connected final stack of layers.
We will need to unroll the output tensor into a long vector like we did initially for the MNIST data, and
connect this vector to the labels using a fully connected network.
We can also stack multiple fully connected layers if we want. Our final network is like a pancake of many
layers, the convolutional part dealing with feature extraction and the fully connected part handling the
classification.
6.4. CONVOLUTIONAL NEURAL NETWORKS 277
Flatten layer
The deeper we go in the network the richer and more unique are the patterns matched and so more robust
the classification will be.
Let’s build our first convolutional Neural Network to classify the MNIST data. First of all we need to reshape
the data as order-4 tensors. We will store the reshaped data into new variables called X_train_t and
X_test_t.
In [66]: X_train_t.shape
Notice that between the convolutional layers and the fully connected layers we will need Flatten to
reshape the feature maps into feature vectors.
In order to speed up the convergence we initialize the convolutional weights drawing from a random
normal distribution. Later in the book we will discuss intializations more in detail.
Also notice that we need to pass input_shape=(28, 28, 1) to let the model know our input images are
grayscale 28x28 images:
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_5 (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 13, 13, 32) 0
_________________________________________________________________
activation_1 (Activation) (None, 13, 13, 32) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 5408) 0
_________________________________________________________________
dense_6 (Dense) (None, 64) 346176
_________________________________________________________________
dense_7 (Dense) (None, 10) 650
=================================================================
Total params: 347,146
Trainable params: 347,146
Non-trainable params: 0
_________________________________________________________________
6.4. CONVOLUTIONAL NEURAL NETWORKS 279
This model has 300k parameters, that’s almost half of the the fully connected model we designed at the
beginning of this chapter. Let’s train it for 5 epochs. Notice that we pass the tensor data we created above:
In [70]: plt.plot(h.history['acc'])
plt.plot(h.history['val_acc'])
plt.legend(['Training', 'Validation'])
plt.title('Accuracy')
plt.xlabel('Epochs');
280 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS
The convolutional model achieved a better performance on the MNIST data in less epochs. Overfitting is
also reduced, because the model is learning to combine spatial patterns instead of learning the exact values
of the pixels.
6.5. BEYOND IMAGES 281
Beyond images
Convolutional networks are great on all data types where the order matters. For example, they can be used
on sound files using spectrograms. Spectrograms represent sound as an image where the vertical axis
corresponds to the frequency bands, while the horizontal axis indicates the time. We can feed spectrograms
to a convolutional layer and treat it like an image. Some of the most famous speech recognition engines use
this technique.
Similarly, we can map a sentence of text onto an image where the vertical axis indicates the word index in a
vocabulary, and the horizontal axis is for the position in the sentence.
Although they are very powerful, CNNs are not useful at all in some case. Since they are good at capturing
spatial patterns, they are of no use when such local patterns do not exist. This is the case when data is a 2D
table coming from a database collecting user data. Each row corresponds to a user and each column to a
feature, but there is no special order in either columns or rows.
In other words we can swap the order of the rows or the columns without altering the information
contained in the table. In a case like this, a CNN is completely useless and it should not be used.
Conclusion
In this chapter we’ve finally introduced convolutional Neural Networks as a tool to efficiently extract
features from images and more generally from spatially correlated data.
Convolutional networks are ubiquitous in object recognition tasks, widely used in robotics, self-driving
cars, advertising, and many more fields.
Exercise
Exercise 1
You’ve been hired by a shipping company to overhaul the way they route mail, parcels and packages. They
want to build an image recognition system capable of recognizing the digits in the zipcode on a package, so
that it can be automatically routed to the correct location. You are tasked to build the digit recognition
system. Luckily, you can rely on the MNIST dataset for the initial training of your model!
Build a deep convolutional Neural Network with at least two convolutional and two pooling layers before
the fully connected layer:
Exercise 2
Pleased with your performance with the digits recognition task, your boss decides to challenge you with a
harder task. Their online branch allows people to upload images to a website that generates and prints a
postcard that is shipped to destination. Your boss would like to know what images people are loading on the
site in order to provide targeted advertising on the same page, so he asks you to build an image recognition
system capable of recognizing a few objects. Luckily for you, there’s a dataset ready made with a collection of
labeled images. This is the Cifar 10 Dataset, a very famous dataset that contains images for 10 different
categories:
• airplane
• automobile
• bird
• cat
• deer
• dog
• frog
• horse
• ship
• truck
In this exercise we will reach the limit of what you can achieve on your laptop. In later chapters we will learn
how to leverage GPUs to speed up training.
Here’s what you have to do: - load the cifar10 dataset using keras.datasets.cifar10.load_data() -
display a few images, see how hard/easy it is for you to recognize an object with such low resolution - check
the shape of X_train, does it need reshape? - check the scale of X_train, does it need rescaling? - check
the shape of y_train, does it need reshape? - build a model with the following architecture, and choose the
parameters and activation functions for each of the layers: - conv2d - conv2d - maxpool - conv2d - conv2d -
maxpool - flatten - dense - output - compile the model and check the number of parameters - attempt to
train the model with the optimizer of your choice. How fast does training proceed? - If training is too slow,
feel free to stop it and read ahead. In the next chapters you’ll learn how to use GPUs to
Machine Learning on Time Series requires a bit more caution than usual, since we need to avoid leaking
future information into the training. We will therefore start this chapter talking about Time Series and
Sequence problems in general. Then we will introduce RNNs and in particular two famous architectures:
LSTMs and GRUs (for the latter, have a look at the Exercise 2).
This chapter contains both practical and theoretical parts with some math. Like we did in chapter 5, let us
first tell you: you don’t NEED to read the math in this chapter. This book is meant for the developer and
practitioner that is interested in applying Neural Networks to solve great problems. We provide the math for
the curious and we will make sure to highlight which sections can be skipped at a first read.
Time Series
Time series are everywhere. Examples of time series are the values of a stock, music, text, events on your
app, video games, which are sequences of actions, and in general any quantity monitored over time that
generates a sequence of values.
A time series is an ordered sequence of data points, and it can be univariate or multivariate.
A univariate time series is nothing but a sequence of scalars. Example of this are temperature values
through the day or the number of times per minute your app was downloaded.
283
284 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
A time Series
A time series could also take values in a vector space, in which case it is a multivariate time series.
Examples of vector time series are the speed of a car as a function of time or an audio file recorded in stereo,
which has two channels.
Machine Learning can be applied to time series to solve several problems including forecasting, anomaly
detection and pattern recognition.
Forecasting refers to predicting future samples in a sequence. In a way, this problem is a regression problem
because we are predicting a continuous quantity using features derived from the time series and most likely
it is a nonlinear regression.
7.1. TIME SERIES 285
Anomaly detection refers to identifying deviations from a regular pattern. This problem can be approached
in two ways: if we know the anomalies we are looking for, we can approach it as a classification problem. If
we do not know the anomalies we would just train a model to forecast future values (regression) and then
compare the predicted value and the actual signal. In this case, anomalies are located where the model
prediction is very different from the actual signal.
In all these cases we must use particular care because the data is ordered in time and we need to avoid
leaking future information in the features used by the model. This is particularly true for model validation.
If we split the time series into training and test sets we cannot just pick a random split from the time series.
We need to split the data in time: all the test data should come after the training data.
This is particularly true with any data related to human activity, where daily, weekly, monthly and yearly
periodicities are found.
Think for example of retail sales. A dataset with hourly sales from a shop, will have regular patterns during
the day: with period of higher customer flow and period of lower customer flow, as well as during the week.
286 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
Depending on the type of goods we may find higher or lower sales during the weekend. Special dates, like
black Friday or sales days, will appear as anomalies in these regular patterns and should be easy to catch. In
these cases, it is a good idea to either remove these periodicities beforehand or to add the relevant time
interval as an input feature.
The file sequence_classification.csv.bz2 contains a set of 4000 curves. Let’s load it and look at a few
rows and columns:
Out[3]:
TIP: this is the first time that we load a zipped file, i.e. a compressed file convenient to save
storage space. Pandas allows to load directly zipped file saved in several formats, for
example in this case a bz2 file. Have a look at the documentation for further details and
discover all the formats supported.
In [4]: df.info()
7.2. TIME SERIES CLASSIFICATION 287
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Columns: 201 entries, anomaly to t_199
dtypes: bool(1), float64(200)
memory usage: 6.1 MB
Each row in the dataset is a curve, the labels for anomalies are given in the first column (in this case, we just
have two, True and False).
In [5]: df['anomaly'].value_counts()
Out[5]:
anomaly
True 2000
False 2000
As we can see, 2000 curves present anomalies, while the other 2000 do not. Let’s create the X and y arrays
and plot the first 4 curves.
In [7]: plt.plot(X[:4].transpose())
plt.legend(y[:4])
plt.title("Curves with Anomalies");
288 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
1. We could use the values of the the curves as features (that is 200 points) and feed them to a fully
connected network.
2. We could engineer features from the curves, like statistical quantities, differences and Fourier
coefficients and feed those to a Neural Network.
3. We could use a 1D convolutional network to automatically extract patterns from the curves.
TIP: if you had to guess, which of the three approaches seems more promising?
First of all, we will perform a train/test split. In this case we do not need to worry about the order in time
because the sequences are given to us without any information about their absolute time. For all we know
they could be independent measurements of the same phenomenon.
7.2. TIME SERIES CLASSIFICATION 289
Now let’s split the data into the training and test sets:
Let’s load our layers from keras so we can build our fully connected network:
Let’s also clear the backend of any data it’s holding on to with clear_session() on the backend:
In [11]: K.clear_session()
Finally, let’s build our model with our Keras layers, using the 200 points as features. This process should be
pretty familiar at this point:
Next, let’s train the model to fit against our training set.
290 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
In [14]: plt.plot(h.history['acc'])
plt.plot(h.history['val_acc'])
plt.legend(['Training', 'Validation'])
plt.title('Accuracy')
plt.xlabel('Epochs');
This model does not seem to perform really well at all (operating around 50 accuracy).
This is easy to understand. In fact, the anomaly can be located anywhere along the curve, making it difficult
for a Neural Network to learn about its presence from amplitude values.
Let’s try to extract some features from the curves. We will limit ourselves to:
TIP: feel free to add more features like higher order statistical moments or Fourier
coefficients.
First, let’s build the new DataFrame eng_f containing in the two columns the feature std and std_diff.
eng_f.head()
Out[16]:
std std_diff
0 0.260902 0.023511
1 0.249588 0.030286
2 0.304086 0.023464
3 0.302908 0.030531
4 0.286405 0.066638
Let’s clear out the backend for any memory we’ve already used:
292 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
In [18]: K.clear_session()
Next, let’s train a fully connected model: as already seen many times, the first layer depends on the number
of input features, 2 in this case, and the last layer depends on the output, a binary classification 0/1 in this
contest (notice that the last layer is the same of the previous model, since we didn’t change our output). The
inner layers, only one in this model, depend on the researcher preferences.
In [21]: plt.plot(h.history['acc'])
plt.plot(h.history['val_acc'])
plt.legend(['Training', 'Validation'])
plt.title('Accuracy')
plt.xlabel('Epochs');
7.2. TIME SERIES CLASSIFICATION 293
This model is already much better than the previous one, but can we do better? Let’s try with the third
approach, i.e. 1D convolutional network to automatically extract patterns from the curves.
As we know by now, convolutional layers are good for recognizing spatial patterns. In this case we know the
anomaly spans across a dozen points along the curve, so we should be able to capture it if we cascade a few
Conv1D layers with filter size of 3.
TIP: the filter size, 3 in this case, is an arbitrary choice. In the Appendix we explain how a
294 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
convolution with a filter size equal to 3 helps identify patterns in the 1D sequence.
Cascading multiple layers with small filters allows us to learn longer patterns.
Furthermore, since the anomaly can appear anywhere along the curve, MaxPooling1D introduced in
Chapter 6 may help to reduce the sensitivity to the exact location.
Finally we will need to include a few nonlinear activations, a Flatten layer (seen in Chapter 6), and one or
more fully connected layers. Let’s do it!
In [24]: K.clear_session()
Next, let’s build the model with our layers, considering again the 200 points as input:
model.add(Conv1D(16, 3))
model.add(MaxPool1D())
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
Conv1D requires the input data to have shape (N_samples, N_timesteps, N_features) so we need to add
a dummy dimension to our data.
7.2. TIME SERIES CLASSIFICATION 295
Now let’s plot the accuracy of our model using our 1D convolutional Neural Network:
In [28]: plt.plot(h.history['acc'])
plt.plot(h.history['val_acc'])
plt.legend(['Training', 'Validation'])
plt.title('Accuracy')
plt.xlabel('Epochs');
296 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
This model is the best so far, and it required no feature engineering. We just reasoned about the type of
patterns we were looking for and chose the most appropriate Neural Network architecture to detect them.
This is very powerful!
Sequence Problems
Time series problems can be extended to consider general problems involving sequences. In other words,
we can consider a time series as a particular type of sequence, where every element of the sequence is
associated with a time. But, in general, we may have sequences of elements not associated with a specific
time: for example, a word can be thought as a sequence of characters, or a sentence as a sequence of words.
Similarly, the interactions of a user in an app form a sequence of events, and it is a very common use case to
try to classify such a sequence or to predict what the next event is going to be.
More generally, we are going to introduce here a few general scenarios involving sequences, that will stretch
our application of Machine Learning to new problems.
1-to-1
The simplest Machine Learning problem involving a sequence is the 1-to-1 problem. All the Machine
Learning problems we have encountered so far are of this type: linear regression, classification, and
convolutional Neural Networks for image or sequence classification. For each input we have one label, for
each image in MNIST we have a digit label, for each user we have a purchase, for each one banknote we have
a label of real or fake.
In all of these cases the Machine Learning model learns a stateless function to connect a given input to a
given output.
In the case of sequences, we can expand our framework to allow for the model to make use of past values of
the input and of the output. Let’s see how.
1-to-many
The 1-to-many problem starts like the 1-to-1 problem. We have an input and the model generates an output.
After the first output is generated, it is fed back to the network as a new input, and the network generates a
new output. We can continue like this indefinitely and therefore generate an arbitrary sequence of outputs.
A typical example of this situation is image captioning: a single image in input generates as output a text
description of arbitrary length.
7.3. SEQUENCE PROBLEMS 297
TIP: a text description can be thought as a sequence either of words or characters. Every
single words or characters is indeed an element of the sequence.
many-to-1
The many-to-1 problem reverses the situation. We feed multiple inputs to the network and at each step we
also feed the network output back into the network, until we reach the end of the input sequence. At this
point we look at the network output.
Text sentiment analysis falls in this category. We associate a single output sentiment label (positive or
negative) to a string of text of arbitrary length in input.
asynchronous many-to-many
In the asynchronous many-to-many case, we have a sequence in input and a sequence in output. The
model first learns to encode an input sequence of arbitrary length into the internal state. Then, when the
sequence ends, the model starts to generate a new sequence.
The typical application for this setup is language translation, where an input sentence in a language, for
example english, is translated to an output sentence in a different language, for example italian. In order to
complete the task correctly the model has to “listen” to the whole input sentence first. Once the sentence is
finished, the model goes ahead and translates that into the new sentence.
synchronous many-to-many
Finally, there’s the synchronous many-to-many case, where the network outputs a value at each input,
considering both the input and its previous state. Video frame classification falls in this category because
for each frame we produce a label using the information from the frame but also the information from the
the state of the network.
Recurrent Neural Networks can deal with all these sequence problems because their connections form a
directed cycle. In other words they are able to retain state from one iteration to the next by using their own
output as input for the next step. This is similar to infinite response filters in signal processing.
In programming terms, this is like running a fixed program with certain inputs and some internal variables.
Viewed this way, RNNs can be thought as networks that learn generic programs.
In fact, RNNs are Turing-Complete, which means they can simulate arbitrary programs! We can think of
feed-forward Neural Networks as approximating arbitrary functions and recurrent Neural Networks as
approximating arbitrary programs. This makes them really really powerful.
7.4. TIME SERIES FORECASTING 299
The previous dataset however was quite special for a number of reasons. First of all, each sample sequence
in the dataset had exactly the same duration, each curve included exactly 200 time steps. Secondly, we had
no information about the order of the samples and so we considered them as independent measurements
and performed train/test split in the usual way.
Both these conditions are not generally present when dealing with forecasting problems on time series or
text data. In fact, a time series can have arbitrary length and it usually comes with a timestamp, indicating
the absolute time of each sample.
Let’s load a new dataset and let’s see how recurrent networks can help in this case.
In [31]: df.head(3)
Out[31]:
Date Hour Total Ontario Northwest Northeast Ottawa East Toronto Essa Bruce Southwest Niagara West Tot Zones diff
0 01-May-03 1 13702 809 1284 965 765 4422 622 41 2729 617 1611 13865 163
1 01-May-03 2 13578 825 1283 923 752 4340 602 43 2731 615 1564 13678 100
2 01-May-03 3 13411 834 1277 910 751 4281 591 45 2696 596 1553 13534 123
In [32]: df.tail(3)
Out[32]:
Date Hour Total Ontario Northwest Northeast Ottawa East Toronto Essa Bruce Southwest Niagara West Tot Zones diff
119853 2016/12/31 22 15195 495 1476 1051 1203 5665 1045 72 2986 465 1334 15790 595
119854 2016/12/31 23 14758 495 1476 1051 1203 5665 1045 72 2986 465 1334 15790 1,032
119855 2016/12/31 24 14153 495 1476 1051 1203 5665 1045 72 2986 465 1334 15790 1,637
The dataset contains hourly electricity demands for different parts of Canada and it runs from May 2003 to
December 2016. Let’s create a pd.DatetimeIndex using the Date and Hour columns.
Let’s run this function over our data to generate the DatetimeIndex for each column
In [35]: idx.head()
Out[35]:
0
0 2003-05-01 01:00:00
1 2003-05-01 02:00:00
2 2003-05-01 03:00:00
3 2003-05-01 04:00:00
4 2003-05-01 05:00:00
In [36]: df = df.set_index(idx)
TIP: the function set_index() returns a new DataFrame whose index (row labels) has
been set to the the values of one or more existing column. Unless you use the
inplace=True argument this does not alter the DataFrame, it simply returns a different
version. That’s why we overwrite the original df variable.
Now that we have set the index, let’s select and plot the Total Ontario column:
Great! The time series seems quite regular! This looks promising for forecasting. Let’s split the data in time
on January 1st, 2014. We will use data before that date as training data and data after that date as test.
Now we copy the data to a pair of new Pandas data frames that only contain the Total Ontario data up to
the split date (train) and after the split date (test).
TIP: We use the .copy() command here because the .loc indexing command may return
a view on the data instead of a copy. This could be a problem later on when we do other
selections or manipulations of the data.
Let’s plot the data. We will use the matplotlib plotting function that is automatically aware of index with
dates and times and assign a label to each plot so that we can display them with a legend:
302 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
In [40]: plt.figure(figsize=(15,5))
plt.plot(train, label='y_test')
plt.plot(test, label='y_pred')
plt.legend()
plt.title("Energy Consumption in Ontario 2003 - 2017");
We’ve already seen in Chapter 3 that Neural Network models are quite sensitive to the absolute size of the
input features. This means that passing in features with very large or very small values will not help our
model converge to a solution. Hence, we should rescale the data before anything else.
Notice that there’s a huge drop somewhere in 2003. We shouldn’t use that as the minimum for our analysis,
since it is clearly an outlier.
We will rescale the data in such a way that most of it is close to 1. We can achieve this by subtracting 10000,
which shifts everything down and then dividing by 5000.
TIP: feel free to adjust these values as you prefer, or to try out other scaling methods like
the MinMaxScaler or the StandardScaler. The important thing is to get our data close
to 1 in size, not exactly between 0 and 1.
Let’s look at the first four dates and demand just to make sure our data is in the expected region of where we
think it should be.
7.4. TIME SERIES FORECASTING 303
In [42]: train_sc[:4]
Out[42]:
Total Ontario
2003-05-01 01:00:00 0.7404
2003-05-01 02:00:00 0.7156
2003-05-01 03:00:00 0.6822
2003-05-01 04:00:00 0.7002
In [43]: plt.figure(figsize=(15,5))
plt.plot(train_sc, label='y_test')
plt.plot(test_sc, label='y_pred')
plt.legend()
plt.title("Energy Consumption Scaled Data");
We are finally ready to build a predictive model. Our target is going to be the value of the demand on a
certain time, and to start we will use the demand on the previous time as the only feature.
X_test = test_sc[:-1].values
y_test = test_sc[1:].values
Now we have our training data as well as testing data mapped out. Let’s move on to model building.
304 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
Let’s train a fully connected network to predict and see that it is not able to predict the next value from the
previous one.
The network will have single input (the previous hour value) and a single output.
We can see this as a simple regression problem, since we want to establish a connection between two
continuous variables.
TIP: if you need a refresher on what a regression is and why it makes sense to use it here,
have a look at Chapter 3 where we used a Linear regression to predict the weight of
individuals given their height.
Since we want to predict a continuous variable, the output of the network does not need an activation
function and we will use the mean_squared_error as loss function, which is a standard error metric in
regression models.
Let’s clear the backend of any held memory first, as we have done many times when building a new model:
In [45]: K.clear_session()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 24) 48
_________________________________________________________________
dense_2 (Dense) (None, 12) 300
_________________________________________________________________
dense_3 (Dense) (None, 6) 78
7.4. TIME SERIES FORECASTING 305
_________________________________________________________________
dense_4 (Dense) (None, 1) 7
=================================================================
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
In this case, before fitting the built Neural Networks, we load the EarlyStopping callback, to halt the
training if it is not improving.
TIP: a callback is a set of functions to be applied at each epoch during the training. We have
already encountered them in Exercise 4 of Chapter 5. You can pass a list of callbacks to the
.fit() method, and in this specific case we use the EarlyStopping callback to stop the
training if no progress is observed. According to the documentation , monitor defines the
quantity to be monitored (the mean_squared_error in this case) and patience defines
the number of epochs with no improvement after which the training will be stopped.
In particular we will set the EarlyStopping callback to monitor the value of the loss and stop the training
loop with a patience=1 if that does not improve. Without this callback, the training will be be stuck on a
fixed loss without improving, and the training will not stop by itself (go ahead and try to confirm that!).
Now we can launch the training, using this callback to monitor the progress of the data.
Our dataset has over 100k points so we can choose large batches.
The model stopped improving quite quickly. Feel free to experiment with other architectures and other
activation functions. Let’s see how our model is doing. We can generate the predictions on the test set by
running model.predict.
In [51]: plt.figure(figsize=(15,5))
plt.plot(y_test, label='y_test')
plt.plot(y_pred, label='y_pred')
plt.legend()
plt.title("True VS Pred Test set, Fully Connected");
They seem to overlap pretty well. Is it so? Let’s zoom in and watch more closely. We will do this by using the
plt.xlim function that sets the boundaries of the horizontal axis in a plot. Feel free to choose other values
in order to inspect other regions of the plot. Also notice that we have lost the date labels when we created
the data, but this is not a problem: we can always bring them back from the original series if we need them.
In [52]: plt.figure(figsize=(15,5))
plt.plot(y_test, label='y_test')
plt.plot(y_pred, label='y_pred')
plt.legend()
plt.xlim(1200,1300)
plt.title("Zoom True VS Pred Test set, Fully Connected");
7.4. TIME SERIES FORECASTING 307
Let’s measure the total mean squared error (a.k.a. our total loss) and the R 2 score on the test set. As seen in
Chapter 3 here), if the R 2 score is far from 1.0, that is a sign of a bad regression.
TIP: If you need a refresher about Mean Squared Error and R 2 score, how they are defined
and used, take a look at Chapter 3 here) and Chapter 3 here
print("MSE: {:0.3}".format(mse))
print("R2: {:0.3}".format(r2))
MSE: 0.0149
R2: 0.933
In this case however the R 2 score is quite high, which would lead us to think the model is quite good.
308 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
If you inspect the graph closely, you will realize that the network has just learned to repeat the same value it
receives in input!
This is not forecasting at all, in other words the model has no real predictive power. It behaves like a parrot
that repeats yesterday’s value for today. In this particular case, since the curve is varying smoothly, the
differences between one day and the next are small and the R 2 score is still pretty close to 1. However, the
model is not anticipating any future value and so it would be quite useless for forecasting.
This behavior is not surprising. After all, the only input feature our model knew was the value of the time
series in the previous period, so it makes sense that the best it could do was to learn to repeat such value as
prediction for what would come next.
Vanilla RNN
As we introduced, Recurrent Neural Networks are able to maintain an internal state using feedback loops.
Let’s see how we could build a simple RNN.
The Vanilla Recurrent Neural Network can be built as a fully connected Neural Network if we unroll the
time axis.
Ignoring the output of the network for the time being, let’s focus on the recurrent aspect. The network is
7.4. TIME SERIES FORECASTING 309
recurrent because it’s internal state h at time t is obtained by mixing current input x t with the previous value
of he internal state h t−1 :
At each instant of time, the simple RNN is behaving as a fully connected network with two inputs: the
current input x t and the previous output h t−1 .
TIP: Notice that for now we are using a network with a single input and a single output, so
both x and h are numbers. Later we will extend the notation to networks with multiple
input and multiple recurrent units in a layer. As you will see the extension is quite simple.
Notice only two weights are involved: the weight multiplying the previous value of the output w and the
weight multiplying the current input u. By the way doesn’t this formula remind you of the
Exponentially Weighted Moving Average (or EWMA)?
TIP: we have already mentioned EWMA in Chapter 5 and it is explained in the appendix.
Just as a reminder, it’s a simple smoothing algorithm that follows the formula:
y t = (1 − α) y t−1 + α x t (7.2)
It is not exactly the same, because there is a tanh and the two weights are independent but it does look
similar: it’s a linear mixing of the past output with the present input, followed by a nonlinear activation
function.
Also notice that the weights do not depend on time. The network is learning the best values of its two
weights which are fixed in time.
We can build deep recurrent Neural Networks by stacking recurrent layers onto one another. We feed the
input to a first layer and then feed the output of that layer into a second layer and so on. Also we can add
multiple recurrent units in each layer. Each unit is receiving inputs from all the units in the previous layer
(or the input) as well as all the units in the same layer at the previous time:
If we have multiple layers we will need to make sure that an earlier layer returns the whole sequence of
310 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
outputs to the next layer. This is achieved in Keras using the return_sequences=True optional argument
when defining a layer. We will see an example of this in Exercise 1.
Keras implements Vanilla Recurrent Layers with the layers.SimpleRNN class. Let’s try it out on our
forecasting problem. First of all we import it.
Input shape
but so far we have used only a tensor of order two for our data. Let’s stop for a second and think about how
to reshape our data, because there’s more than one way. Our input data right now has a shape of:
In [56]: X_train.shape
Out[56]: (93551, 1)
7.4. TIME SERIES FORECASTING 311
So it’s like a matrix with a single column. We want to add an additional dimension to the tensor so that the
data is a tensor of order three. There are many ways of doing this, but a very simple one is to simply add a
None axis like this:
y_train_t = y_train[:]
y_test_t = y_test[:]
In [58]: X_train_t.shape
Out[58]: (93551, 1, 1)
Good! We reshaped the data to have one additional axis as requested. Now let’s think about the batches. If
we randomly sample this data and give the model batches of 1 point in input with the corresponding label in
output the model will not leverage the fact that the data is part of a sequence and therefore produce results
that are very similar to the ones of the Fully Connected network.
Instead, we’d like to leverage the fact that all the data comes in a sequence. This can be done by feeding the
data sequentially to the network. In other words we don’t want to randomly sample batches from the
sequence, we want to feed the data one by one sequentially, while maintaining the state of the network
between one point and the next. This can be achieved by setting the stateful=True argument in the layer,
but it requires that the size of our data is exactly a multiple of the batch size.
Since we want to feed the points one by one we will choose a batch_size=1. Let’s do it!
Now let’s create a SimpleRNN with one layer with 6 nodes. This means that there are 6 recurrent units in our
layer. The principle is the same as above, only each of these units will receive a 6 dimensional vector as
recurrent input from the past, together with the single value of the actual input.
TIP: The number of nodes here is arbitrary. We could choose to put many more nodes, but
that would result in a bigger model which is slower to train. We have noticed that with 6
nodes results are already acceptable and hence we choose that value.
Notice that since we are using the stateful=True flag we will need to pass the batch_input_shape to
the first layer.
312 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
We will use the Adam optimizer (which is one of the most efficient and robust optimizer, as seen in Chapter
5, adopting a small value for the learning rate, since the SimpleRNN can sometimes be unstable.
In [60]: K.clear_session()
Now let’s build the model. We will pass a batch_input_shape=(1, 1, 1) because we read the data one
point at a time. Also we will set the input weights to one. We do this in order to reduce the variability in the
results obtained, as you’ll see, this model is quite unstable.
TIP: if you get a result that is very different from the one of the book, go ahead and
re-initialize the model. It may be just a case of bad luck with the starting point in the
minimization.
model.compile(optimizer=Adam(lr=0.0005),
loss='mean_squared_error')
Now we can fit the data. Since we are maintaining states between point, we shall pass the data in order using
the shuffle=False flag and batch_size=1. Also, we run the training for a single epoch. In our
experiments this should be sufficient to get decent results:
Epoch 1/1
93551/93551 [==============================] - 243s 3ms/step - loss: 0.2061
Let’s plot a small part of our predictive model to compare train and test.
Notice that despite our initialization, the model converges to different solutions at each training run. -
Sometimes you will get a graph that looks very similar to the Fully Connected result, with no predictivity at
all:
- Sometimes you will get a graph that looks noisy while being closer to the actual data in the sharp decays,
meaning some forecasting power is actually achieved:
314 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
Sometimes the network will get stuck and give nonsense results like this one:
Feel free to change the number of layers, nodes, optimizer and learning rate to see if you can get better
results. You will notice that this model is very prone to diverging away from a small value of the loss, which
is not ideal at all.
print("MSE: {:0.3}".format(mse))
print("R2: {:0.3}".format(r2))
MSE: 0.252
R2: -0.13
All in all this model does not seem to be much better than the Fully Connected one and it is also quite
unstable. The problem lies with the fact that the SimpleRNN actually has a short memory and cannot learn
really long-term patterns.
Let’s see why this happens and how we can fix it.
7.4. TIME SERIES FORECASTING 315
In order to fully understand how recurrent networks work and why our simple implementation fails we will
need a little bit of maths. Like we suggested in Chapter 5 you can feel free to skip this section entirely if you
just want to get to the working model. You can always come back to it later on, if you are curious about how
a recurrent network works
Vanishing Gradients
Let’s start from the equation of backpropagation through time, and let’s ignore the output of the network for
now and let’s focus on the recurrent part. This is also called an encoder network, since it the output is
discarded.
Encoder Network
This network is encountered in many cases, for example when solving many-to-1 problems like sentiment
analysis or asynchronous many-to-many problems like machine translation.
316 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
z t = w h t−1 + u x t (7.3)
h t = ϕ(z t ) (7.4)
(7.5)
We can now use the overline notation introduced in Chapter 5 to study the backpropagation through time.
If we assume to have already backpropagated from the output all the way back to the error signal h T , we can
write the backpropagation relations as:
h t = z t+1 w (7.6)
′
z t = h t ϕ (z t ) (7.7)
(7.8)
Let’s focus our attention on h t , and let’s propagate back all the way to h0 :
h0 = wz1 (7.9)
= wh1 ϕ′ (z1 ) (7.10)
2 ′ ′
= w h2 ϕ (z1 )ϕ (z2 ) (7.11)
T ′ ′ ′
... = w h T ϕ (z1 )ϕ (z2 )...ϕ (z T ) (7.12)
(7.13)
∂J
Now remembering the definition of h = ∂h we can write:
∂h T
h0 = h T (7.14)
∂h0
which implies:
∂h T
= w T ϕ′ (z1 )ϕ′ (z2 )...ϕ′ (z T ) (7.15)
∂h0
7.4. TIME SERIES FORECASTING 317
Now let’s stop for a second and focus on ϕ′ (z). For most activation functions (sigmoid, tanh, relu) this
quantity is bounded. This is easily seen looking at the graph of the derivative of these functions:
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
def relu(x):
cond = x > 0
return cond * x
Let’s plots for the sigmoid, the Tanh, and the relu activation functions along with their derivatives.
In [66]: plt.figure(figsize=(12,8))
plt.subplot(321)
plt.plot(x, sigmoid(x))
plt.title('Sigmoid')
plt.subplot(322)
plt.plot(x[1:], np.diff(sigmoid(x))/np.diff(x))
plt.title('Derivative of Sigmoid')
plt.subplot(323)
plt.plot(x, np.tanh(x))
plt.title('Tanh')
plt.subplot(324)
plt.plot(x[1:], np.diff(np.tanh(x))/np.diff(x))
plt.title('Derivative of Tanh')
plt.subplot(325)
plt.plot(x, relu(x))
plt.title('Relu')
plt.subplot(326)
plt.plot(x[1:], np.diff(relu(x))/np.diff(x))
plt.title('Derivative of Relu')
plt.tight_layout()
318 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
All the derivatives take values between 0 and 1, i.e. they are bounded. We can use this fact to rewrite the last
equation as:
∂h T
= w T ϕ′ (z1 )ϕ′ (z2 )...ϕ′ (z T ) ≤ w T (7.16)
∂h0
Which means that derivative of the last output with respect to the first output is less than or equal to w T .
At this point the vanishing gradient problem should be evident. If w < 1 the propagation through time is
suppressed at each additional time step. This means that means the influence of an input point that is 3 steps
back in time, will contribute to the gradient with a term smaller than w 3 . If, for example, w = 0.1, the
previous point will contribute with less than 10, the one before with less than 1 and the one before with
0.1 and so on. You can see that their contributions quickly disappear.
Let’s take a peek at what this looks like visually. First, let’s create a decay function that we’ll use to create our
plots.
plt.title("$w^T$")
plt.xlabel("Time steps T")
plt.legend(['w = {}'.format(w) for w in ws]);
The error signal quickly goes to zero if the recurrent weight is smaller than 1. This suggests that the recurrent
model is only able to capture short time dependencies, but longer dependencies are rendered useless.
Similarly, it can be shown that when w is greater than a certain threshold, the gradient will exponentially
explode over time (notice that in the gradient we have w T ), rendering the backpropagation unstable.
It would appear as if we are stuck with a model that either does not converge at all or it quickly forgets about
the past. How can we solve this?
320 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
LSTMs were designed to overcome the problems of simple Recurrent networks by allowing the network to
store data in a sort of memory that can be accessed at later times. LSTM units are a bit more complicated
than the nodes we have seen so far, so let’s take our time to understand what this means and how they work.
Again, feel free to skip this section at first and come back to it later on.
We will start from an intuitive description of how LSTM works and we will gradually approach the
mathematical formulas. We will do this because the formulas for the LSTM can be daunting at first, so it is
good to break them down and learn them gradually.
At the core of the LSTM is the internal state ct . This can be thought of as an internal conveyor belt that
carries information from one time step to the next. In the general case, this is a vector, with as many entries
as the number of units in the LSTM layer. The LSTM unit will store information in this vector and this
information will be available for retrieval later on.
These two inputs are concatenated, in order to create a unique set of input feature.
For example, if the input vector has length 3 (i.e. there are 3 input features) and the output vector has length
2 (i.e. there are 2 output features), the concatenated vector has now 5 features, 3 coming from the input
vector and 2 coming from the output vector.
7.4. TIME SERIES FORECASTING 321
The next step is to apply 4 different simple Neural Network layers to these concatenated features along 4
parallel branches. Each branch takes a copy of the features and multiplies them by an independent set of
weights and a different activation function.
TIP: You may be wondering why 4 and not 3 or 5. The reason is simple: one branch is the
one that will process the data similarly to the Vanilla RNN, i.e. it will take the past and the
present, weight them and send them through a tanh activation function. The other three
branches will control operations that we call gates. As you will see, these gates control how
past and present information are recorded in the internal state. Other kinds of recurrent
units, like GRU, use a different number of gates, so 4 is specific to the LSTM architecture.
Notice that the weights here are actually weight matrices. The number of rows in the weight matrix
corresponds to the number of features, while the number of columns corresponds to the number of output
features, i.e. the number of nodes in the LSTM layer.
After the matrix multiplication with the weights, the results are passed through 4 independent nonlinear
activation functions.
Three of these are sigmoids, yielding the output vectors with values between 0 and 1. These 3 outputs take
the name of gates, because they control the flow of information. The last one is not a sigmoid, it is a tanh.
Let’s now look at the role of each of these nonlinear outputs. We start from the bottom one. This is called the
forget gate.
322 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
7.4. TIME SERIES FORECASTING 323
The role of the forget gate is to mediate how much of the internal state vector will be kept and passed
through to the future times. Since the value of this gate comes from a dense layer followed by a sigmoid, the
LSTM node is learning which fraction of the past data to retain and which fraction to forget.
Notice that the ⊙ operator implies we are multiplying ft and ct elementwise. This fact also means they are
vectors of the same length.
Let’s look at the gate mediated by wi . This gate is the input gate and it mediates how much of the input to
keep. However, it’s not the plain input concatenated vector, it’s a vector that went through the tanh layer. We
call it gt .
324 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
The resulting vector it ⊙ gt is added to the fraction of internal state that had been retained through the
forget gate.
The new internal state ct is the result of these two operations: forgetting a bit of the past state and adding
some new elements coming from the input and the past output.
Now that we have the update rules for the internal state, let’s see how the output state is calculated from the
internal state. One last tanh operation
Now let’s look at the output gate ot . This gate mediates the output of the tanh only allowing part of it to
7.4. TIME SERIES FORECASTING 325
There we go! It looks complex, but that’s because it’s one of the most complicated units in Neural Networks.
We’ve just dissected the LSTM block that has revolutionised our ability to tackle problems with long term
dependencies. For example, LSTM blocks have been successfully used to learn the structure of language, to
produce code from text descriptions, to translate between language pairs and so on.
For the sake of completeness we will write here the equations of the LSTM, though it’s not so important that
you learn them: Keras has them implemented in a conveniently available LSTM layer!
LSTM forecasting
Enough with math and theory! Let’s try to use an LSTM and see if we get a better result on our forecasting
problem.
326 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
In [70]: K.clear_session()
Now let’s build our model using the LSTM layer now.
TIP: according to the documentation, the LSTM layer may have many arguments. In this
case we create a layer with 6 recurrent nodes, like we had 6 units in our fully connected
layer and we will set batch_input_shape=(1, 1, 1) and stateful=True, i.e. the last
state for each data point will be used as initial state next data point.
Like we did above, we will use the Adam optimizer with a small learning rate and initialize the input weights
to one:
stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error',
optimizer=Adam(lr=0.0005) )
Now let’s train our model. In doing so, we will use the X_train_t and y_train_t set for the training, the
already specified batch_size=1, because we feed the data one point at a time, and shuffle=False to pass
the data in order. We train the model for 2 epochs.
Epoch 1/2
93551/93551 [==============================] - 349s 4ms/step - loss: 0.0407
Epoch 2/2
74185/93551 [======================>...] - ETA: 1:12 - loss: 0.0156
Notice that the LSTM takes much longer to train than SimpleRNN. This is because it has many more weights
to adjust. In a future chapter we will learn how to use GPUs in order to speed up the training.
To examine the effectiveness of our model, and like we did before, we can plot a small part of the time series
and compare our predictions with the true values:
This should look better than what we have obtained previously, but even in this case we see that the abiility
of the network to forecast is limited. As done for the other models, let’s also check the Mean Squared Error
and the R 2 , for an objective evaluation of the error:
print("MSE: {:0.3}".format(mse))
print("R2: {:0.3}".format(r2))
MSE: 0.013
R2: 0.942
Improving forecasting
In all the models used so far we fed the data sequentially to our recurrent unit, one point at a time. This is
not the only way in which we can train a recurrent layer. We can also use the rolling Windows approach.
Instead of taking a single previous point as input, we can use a set of points, going back in time for a
window. This will allow us to feed data to the network in larger batches, speeding up training and hopefully
improving convergence.
We will reformat our input tensor X to have the following shape: (N_windows, window_len, 1). By
doing this we treat the time series as if it was composed by many independent Windows of fixed length and
we can treat each window as an individual data point. This has the advantage of allowing to randomize the
Windows in our train and test data.
Let’s start by definining the window size. We’ll take a window of 24 periods, i.e. the data from the previous
day. You can always adjust this later on if you wish:
7.5. IMPROVING FORECASTING 329
Rolling windows
In [75]: window_len = 24
Next we’ll use the .shift method of a pandas DataFrame to create lagged copies of our original time series.
Note that we will start from the train_sc and test_sc vectors we’ve defined earlier. Let’s double check
that they still contain what we need:
In [76]: train_sc.head()
Out[76]:
Total Ontario
2003-05-01 01:00:00 0.7404
2003-05-01 02:00:00 0.7156
2003-05-01 03:00:00 0.6822
2003-05-01 04:00:00 0.7002
2003-05-01 05:00:00 0.8020
To create the lagged data we define a helper function create_lagged_Xy_win that creates an input matrix
X with lags going from start_lag to start_lag + window_len and an output vector y with the
unaltered values.
Let’s do it:
X = data.shift(start_lag).copy()
X.columns = ['T_{}'.format(start_lag)]
if window_len > 1:
for s in range(1, window_len):
col_ = 'T_{}'.format(start_lag + s)
X[col_] = data.shift(start_lag + s)
X = X.dropna()
idx = X.index
y = data.loc[idx]
return X, y
Now we use the function on the train and test data. We will use start_lag=1 so that we can compare the
results with our previous results:
In [78]: start_lag=1
window_len=24
In [79]: X_train.head()
Out[79]:
T_1 T_2 T_3 T_4 T_5 T_6 T_7 T_8 T_9 T_10 T_11 T_12 T_13 T_14 T_15 T_16 T_17 T_18 T_19 T_20 T_21 T_22 T_23 T_24
2003-05-02 01:00:00 0.8694 1.0414 1.3096 1.5408 1.6008 1.5228 1.5486 1.6096 1.6290 1.6036 1.5976 1.6236 1.6242 1.6342 1.6382 1.6074 1.5536 1.3524 1.0226 0.8020 0.7002 0.6822 0.7156 0.7404
2003-05-02 02:00:00 0.7742 0.8694 1.0414 1.3096 1.5408 1.6008 1.5228 1.5486 1.6096 1.6290 1.6036 1.5976 1.6236 1.6242 1.6342 1.6382 1.6074 1.5536 1.3524 1.0226 0.8020 0.7002 0.6822 0.7156
2003-05-02 03:00:00 0.7218 0.7742 0.8694 1.0414 1.3096 1.5408 1.6008 1.5228 1.5486 1.6096 1.6290 1.6036 1.5976 1.6236 1.6242 1.6342 1.6382 1.6074 1.5536 1.3524 1.0226 0.8020 0.7002 0.6822
2003-05-02 04:00:00 0.6914 0.7218 0.7742 0.8694 1.0414 1.3096 1.5408 1.6008 1.5228 1.5486 1.6096 1.6290 1.6036 1.5976 1.6236 1.6242 1.6342 1.6382 1.6074 1.5536 1.3524 1.0226 0.8020 0.7002
2003-05-02 05:00:00 0.7018 0.6914 0.7218 0.7742 0.8694 1.0414 1.3096 1.5408 1.6008 1.5228 1.5486 1.6096 1.6290 1.6036 1.5976 1.6236 1.6242 1.6342 1.6382 1.6074 1.5536 1.3524 1.0226 0.8020
In [80]: y_train.head()
Out[80]:
7.5. IMPROVING FORECASTING 331
Total Ontario
2003-05-02 01:00:00 0.7742
2003-05-02 02:00:00 0.7218
2003-05-02 03:00:00 0.6914
2003-05-02 04:00:00 0.7018
2003-05-02 05:00:00 0.7904
As you can see, to predict the value 0.7806 that appears in y at 2003-05-08 05:00:00, in X we have the
previous values, going back in time from 0.6950 (previous hour) to 0.6734 (two hours before) and so on.
In order to feed this data to a recurrent model we need to reshape as a tensor of order with the shape
(batch_size, timesteps, input_dim). We are still dealing with a univariate time series, so
input_dim=1, while timesteps is going to be 168, the number of timesteps in the window. Easy to do
using the .reshape method from numpy.
We will get numpy arrays using the .values attribute. We have already checked that the data is shifted
correctly, so it’s not a problem to throw away the index and the column names:
y_train_t = y_train.values
y_test_t = y_test.values
In [82]: X_train_t.shape
Yes! We have correctly reshaped the tensor. Note here that if we had multiple time series, we could have
bundled them together in an input vector along the last axis.
Let’s build a new recurrent model. This time we will not need to use the stateful=True directive because
some history is already included in the input data. For the same reason we will use input_shape instead of
batch_input_shape.
Also, since we will use batches of more than one point, and each point contains a lot of history, the model
convergence will be a lot more stable. Therefore we can increase the learning rate a lot without risking that
the model becomes unstable.
In [83]: K.clear_session()
model = Sequential()
332 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
model.compile(loss='mean_squared_error',
optimizer=Adam(lr=0.05) )
Let’s go ahead and train our model using a batch of size 256 for 5 epochs. This may take some time. Later in
the book we will learn how to speed it up using GPUs. For now take advantage of this time with a little
break. You deserve it!
Epoch 1/5
93528/93528 [==============================] - 13s 134us/step - loss: 0.0709
Epoch 2/5
93528/93528 [==============================] - 12s 128us/step - loss: 0.0595
Epoch 3/5
93528/93528 [==============================] - 12s 128us/step - loss: 0.0594
Epoch 4/5
93528/93528 [==============================] - 12s 129us/step - loss: 0.0593
Epoch 5/5
93528/93528 [==============================] - 12s 129us/step - loss: 0.0588
Let’s generate the predictions and compare them with the actual values:
This model trained considerably faster than the previous ones and its predictions should look much better
than the previous models. First of all the model seems to have learned the temporal patter much better than
the other models: it’s not simply repeating the input like a parrot, it’s genuinely trying to predict the future.
Also, the curves look quite close to one another, which is a great sign!
TIP: Try to re-initialize and re-train the model if the loss of your model does not reach 0.05
and the above figure does not look like this:
One problem with recurrent models is that they tend to get stuck in local minima and be
sensitive to initialization. Also, keep in mind that we chose only 6 units in this network,
which is probably small for this problem.
Conclusion
Well done! You have completed the chapter on Time Series and Recurrent Neural Networks. Let’s recap
what we have learned.
1. We learned how to classify time series of a fixed length using both fully connected and convolutional
334 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
Neural Networks
• We learned about recurrent Neural Networks and about how they allow us to approach new problems
with sequences, including generating a sequence of arbitrary length and learning from sequences of
arbitrary length
• We trained a fully connected network to forecast future values in a sequence
• We performed a deep dive in recurrent Neural Networks, in particular in the Long Short-Term
Memory network to see what advantage they bring
• Finally we trained an LSTM model to forecast values using both a single point as well as a window of
past data
In the exercises we will explore a couple of extensions of what we have done and we will try to predict the
price of Bitcoin from its historical value!
Exercises
Exercise 1
Your manager at the power company is quite satisfied with the work you’ve done predicting the electric load
of the next hour and would like to push it further. He is curious to know if your model can predict the load
on the next day or even on the next week instead of the next hour.
• Go ahead and use the helper function create_lagged_Xy_win we created above to generate new X
and y pairs where the start_lag is 36 hours or even further. You may want to extend the window
size to a little longer than a day.
• Train your best model on this data. You may have to use more than one layer. In which case,
remember to use the return_sequences=True argument in all layers except for the last one so that
they pass sequences to one another.
• Check the goodness of your model by comparing it with test data as well as looking at the R 2 score.
Exercise 2
Gate Recurrent Unit (GRU) are more modern and simpler implementation of a cell that retains longer term
memory.
Keras makes them available in keras.layers.GRU. Try swapping the LSTM layer with a GRU layer and
re-train the model. Does its performance improve on the 36 hours lag task?
7.6. EXERCISES 335
Exercise 3
Does a fully connected model work well using Windows? Let’s find out! Try to train a fully connected model
on the lagged data with Windows, which will probably train much faster:
• reshape the input data back to an Order-2 tensor, i.e. eliminate the 3rd axis
• build a fully connected model with one or more layers
• train the fully connected model on the windowed data. Does it work well? Is it faster to train?
Exercise 4
You have heard a lot of talk about Bitcoin and how it is growing that you decide to put your newly
acquired Deep Learning skills to test in trying to beat the market. The idea is simple: if we could predict
what Bitcoin is going to do in the future, we can trade and profit using that knowledge.
The simplest formulation of this forecasting problem is to try to predict if the price of Bitcoin is going to
go up or down in the future, i.e. we can frame the problem as a binary classification that answers the
question: is Bitcoin going up.
336 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
• Check out the data using df.head(). Notice that the dataset contains the close, high, low, open
for 30 minutes intervals, which means: the first, highest, lowest and last amounts of US Dollars
people were willing to exchange Bitcoin for during those 30 minutes. The dataset also contains
Volume values, that we shall ignore, and a weighted average value, which is what we will use to
build the labels.
• Convert the date column to a datetime object using pd.to_datetime and set it as index of the
DataFrame.
• Plot the value of df[‘close’] to inspect the data. You will notice that it’s not periodic at all and it has
an overall enormous upward trend, so we will need to transform the data into a more stationary
timeseries. We will use percentage changes, i.e. we will look at relative movements in the price
instead of absolute values.
• Create a new dataset df_percent with percent changes using the formula:
x t − x t−1
v t = 100 × (7.25)
x t−1
this is what we will use next.
• Inspect df_percent and notice that it contains both infinity and nan values. Drop the null
values and replace the infinity values with zero.
• Split the data at January 1st 2017, using the data before then as training and the data after that as
test.
• Use the window method to create an input training tensor X_train_t with the shape
(n_windows, window_len, n_features). This is the main part of the exercise, since you’ll have to
make a few choices and be careful not to leak information from the future. In particular you will
have to:
– decide the window_len you want to use
– decide which features you’d like to use as input (don’t use weightedAverage, since we’ll
need it for the output.
– decide what lag you want to introduce between the last timestep in your input window and
the timestep of the output.
– You can start from the create_lagged_Xy_win function we defined in Chapter 7, but you
will have to modify it to work with numpy arrays because Pandas DataFrames are only good
with 1 feature.
• Create a binary outcome variable that is 1 when train[weightedAverage] >= 0 and 0
otherwise. This is going to be our label.
• Repeat the same operations on the test data
• Create a model to work with this data. Make sure the input layer has the right input_shape and
the output layer has 1 node with a Sigmoid activation function. Also make sure to use the
binary_crossentropy loss and to track the accuracy of the model.
• Train the model on the training data
7.6. EXERCISES 337
• Test the model on the test data. Is the accuracy better than a baseline guess? Are you going to be
rich?
Again disclaimer: past performance is no guarantee of future results. This is not investment
advice.
338 CHAPTER 7. TIME SERIES AND RECURRENT NEURAL NETWORKS
Natural Language Processing and Text Data
8
In this chapter we will learn a few techniques to approach problems involving text. This a very important
topic since text data is very common.
We will start by introducing text data and some use cases of Machine Learning and Deep Learning applied
to text prediction. Then we will explore the traditional approach to text problems: the Bag of Words (BOW)
approach.
This topic will take us to explore how to extract features from text. We will introduce new techniques to do
this as well as a couple of new Python packages specifically designed to deal with text data.
We will explore the limitations of the BOW approach and see how Neural Networks can help to overcome
them. In particular we will look at embeddings to encode text and at how they can be used in Keras. Let’s
get started!
Use cases
As noted in the introduction text data is encountered in many applications. Let’s take a look at a few of
them. Spam Detection is probably the one we are all familiar with. It is a text classification problem where
we try to distinguish legitimate documents from extraneous documents.
Spam detection can be applied to email spam, sms spam, im spam and in general any corpus of messages.
The problem is usually presented as a binary classification where one has two sets of documents: the spam
messages and the “ham” messages, i.e. the legitimate messages that we would like to keep.
339
340 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
Imagine you are a rock star tweeting about your latest album. Millions of people will reply to your tweet and
it will be impossible for you to read all of the messages from your fans. You would like to capture the overall
sentiment of your fan base and see if they are happy about what you tweeted.
Sentiment analysis does that by classifying a piece of text as positive or negative in regard to the overall
sentiment. If you know the sentiment for each tweet it’s easy to draw conclusions like: 74 of your fans
responded positively to your tweet“.
Sentiment Analysis is widely applied to many fields including stock trading, e-commerce reviews, customer
service and in general any website or application where users are allowed to submit free-form text
comments.
Text problems
Extending beyond classification problems, we can consider regression problems involving text, for example
extracting a score, a price, or any other metric starting from a text document. An example of this would be
estimating the number of followers your tweet will generate based on its text content or predicting the
number of downloads your application will do based on the content of a blog article.
All the above problems are traditional Machine Learning problems where text is the input to the problem.
Text can also be the output of a Machine Learning problem. For example, Machine Translation involves
converting text from a language to another. It is a supervised, many-to-many, sequence learning problem,
where pairs of sentences in two languages are fed to a model that learns to generate the output sequence (for
example a sentence in English), given a certain input sequence (the corresponding sentence in Italian).
Machine translation is an example of a whole category of Machine Learning problems involving text:
problems involving automatic text generation. Another famous example in this category is that of
8.2. TEXT DATA 341
Language Modeling.
In Language Modeling a corpus of documents (see next section for a proper definition) is fed sequentially to
a model. The model will learn the probability distribution of a certain word to appear after a sentence. The
model is then sampled randomly and is capable of producing sentences that resemble the properties of the
corpus. Using this approach people had models produce new sonnets from Shakespeare, new chapters of
Harry Potter, new episodes of popular novels and so on.
Since Language Modeling works on sequences, we can also build character level models that learn the
syntax or our input corpus. In this way we can produce syntactically accurate markup languages like HTML,
Wiki, Latex and even C! See the wonderful article by Andrej Karpathy for a few examples of this.
It is clear that text is involved in many useful application. So let’s see how to prepare text documents for
Machine Learning.
Text Data
Text data is usually a collection of articles or documents. Linguists call this collection a corpus to indicate
that it’s coherent and organized. For example we could be dealing with the corpus of patents from our
company or with a corpus of articles from a news platform.
The first thing we are going to learn is how to load text data using Scikit-Learn. We will build a simple Spam
detector to separate sms containing spam from legitimate sms messages. The data comes from the UCI SMS
Spam collection, but it has been re-organized and re-compressed.
sms
|-- ham
| |-- msg_000.txt
| |-- msg_001.txt
| |-- msg_003.txt
| +-- ...
|
+-- spam
|-- msg_002.txt
|-- msg_005.txt
|-- msg_008.txt
+-- ...
First, let’s import the zipfile package from Python so that we can extract the data into folders:
342 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
zipfile allows to operate directly zipped folder into our workspace. Have a look at the documentation for
further details. Here we use it to extract the data for later loading it:
This last operation created a folder called sms inside the data folder. Let’s look at its content. The os
module contains many functions to interact with the host system. Let’s import it:
In [3]: import os
And let’s use the command os.listdir to look at the content of the folder:
In [4]: os.listdir('../data/sms')
As expected there are two subfolders: ham and spam. We can count how many files they contain with the
help of the following little function that lists the content of path and uses a filter to only count files.
Let’s use this function to count the number of files in the folders:
Out[7]: 4825
Out[8]: 747
We have 4825 ham files and 747 spam files. We can use these numbers to establish a baseline for our
classification efforts:
If we always predicted the large class, i.e. we never predicted spam, we would be correct 86.6 of the time.
Our model needs to score higher than that to be of any help.
Let’s also look at a couple of examples of our messages for each class:
In [11]: read_file('../data/sms/ham/msg_000.txt')
Out[11]: 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Ci
In [12]: read_file('../data/sms/ham/msg_001.txt')
In [13]: read_file('../data/sms/spam/msg_002.txt')
Out[13]: "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 t
In [14]: read_file('../data/sms/spam/msg_005.txt')
Out[14]: "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun y
344 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
As expected, spam messages look quite different from ham messages. In order to start building a spam
detection model, let’s first load all the data into a dataset.
Scikit Learn offers a function to load text data from folders for classification purposes. Let’s use the
load_files function from sklearn.datasets package:
In [17]: type(data)
Out[17]: sklearn.utils.Bunch
In [18]: data.keys()
Following the documentation, let’s assign the data.data and data.target to two variables docs and y.
In [20]: docs[:5]
8.2. TEXT DATA 345
Out[20]: ['Hi Princess! Thank you for the pics. You are very pretty. How are you?',
"Hello my little party animal! I just thought I'd buzz you as you were with your frien
'And miss vday the parachute and double coins??? U must not know me very well...',
'Maybe you should find something else to do instead???',
'What year. And how many miles.']
In [21]: y = data.target
In [22]: y[:5]
Before we do anything else, let’s save the data we have loaded as a DataFrame, just in case we need to reload
it later. As usual we import our common files:
Out[25]:
message spam
0 Hi Princess! Thank you for the pics. You are v... 0
1 Hello my little party animal! I just thought I... 0
2 And miss vday the parachute and double coins??... 0
3 Maybe you should find something else to do ins... 0
4 What year. And how many miles. 0
Pandas allows to save a dataframe to a variety of different formats, including Excel, CSV and SAS. We will
export to CSV:
346 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
In [26]: df.to_csv('../data/sms_spam.csv',
index=False,
encoding='utf8')
A Machine Learning algorithm is not able to deal with text as it is. Instead, we need to extract features from
the text!
Let’s begin with a naive solution and gradually build up to a more complex one. The simplest way to to build
features from text is to use the counts of certain words that we assume to carry information about the
problem.
For example, spam messages often offer something for free or give a link to some service. Since these are
sms message, this link will likely be a number. With these two ideas in mind, let’s build a very simple
classifier that uses only two features:
Notice that our text contains uppercase and lowercase words, so as a preprocessing step let’s convert
everything to lowercase so we don’t include meaningless features.
In [28]: docs_lower[:5]
Out[28]: ['hi princess! thank you for the pics. you are very pretty. how are you?',
"hello my little party animal! i just thought i'd buzz you as you were with your frien
8.2. TEXT DATA 347
'and miss vday the parachute and double coins??? u must not know me very well...',
'maybe you should find something else to do instead???',
'what year. and how many miles.']
We can define a simple helper function that counts the occurrences of a particular word in a sentence:
In [31]: df.head()
Out[31]:
free
0 0
1 0
2 0
3 0
4 0
Similarly let’s build a helper function that counts the numerical character in a sentence using the re package:
In [32]: import re
In [35]: df.head()
Out[35]:
348 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
free num_char
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
Spam classification
Notice that most messages don’t contain our special features, so we don’t expect any model to work super
well in this case, but let’s try to build one anyways. First, let’s import the train_test_split function from
sklearn as well as the the usual Sequential model and the Dense layer:
Now let’s define the helper function that follows the usual process that we repeated several times in the
previous chapters:
• Train/test split
• Model definition
• Model training
• Model evaluation on test set
We will use a simple Logistic Regression model to start, to make things simple and quick:
if not model:
model = Sequential()
model.add(Dense(1, input_dim=X.shape[1],
activation='sigmoid'))
8.2. TEXT DATA 349
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
h = model.fit(X_train, y_train,
epochs=epochs,
verbose=0)
Executing the function with our values, we’ll capture the result in a variable we’ll call res:
Despite our initial skepticism, this dataset is easy to separate! In fact, it is so easy that two very simple
features (the counts of the word free and the count of numerical characters) already achieve a much better
accuracy score than the baseline, that was 0.866.
We can extend the simple approach of the previous model in a few ways:
• We could build a vocabulary with more than just one word, and build a feature for each of them
which counts how many times that word appears.
• We could filter out common English words.
Scikit Learn has a transformer that allows to do exactly these two tasks, it’s called CountVectorizer. Let’s
import it from sklearn.feature_extraction.text:
350 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
Let’s plan on using the top 3000 most common words in the corpus, this is going to be our vocabulary size:
Then we can initialize the vectorizer. Here we have to use the additional argument
stop_words='english' that tells the vectorizer to ignore common English stop words. We do this
because we are ranking features starting from the most common word. If we didn’t ignore common words,
we would end up with word features like “if ”, “and”, “of ” etc. at the top of our list, since these words are just
very common in the English language. However, these words do not carry much meaning about spam and
by ignoring them we get word features that are more specific to our corpus.
We are also going to ignore decoding errors using the decode_error='ignore' argument:
Notice that it also allows for automatic lowercase conversion. You can check what are the stop words using
the .get_stop_words() method. Let’s look at a few of them:
In [44]: stop_words[:10]
Out[44]: ['ourselves',
'seemed',
'cannot',
'across',
'those',
'not',
'twenty',
'whole',
'were',
'with']
Now that we have created the vectorizer, let’s apply it to our corpus:
8.2. TEXT DATA 351
In [45]: X = vect.fit_transform(docs)
X
X is a sparse matrix i.e. a matrix in which most of the elements are 0. This makes sense since most messages
are short and they will only contain a few of the 3000 words in our feature list. The X matrix has 5572 rows
(i.e. the total number of sms) and 3000 columns (i.e. the total number of selected words) but only 37142
non-zero entries (less then 1).
In order to use it for Machine Learning we will convert it to a dense matrix, which we can do by calling
todense() on the object:
In [46]: Xd = X.todense()
TIP: be careful with converting sparse matrices to dense. If you are dealing with large
datasets you will quickly run out of memory with all those zeros. In those cases we do
on-the-fly conversion to dense of each batch during Stochastic Gradient Descent.
In [48]: vocab[:10]
Out[48]: ['00',
'000',
'02',
'0207',
'02073162414',
'03',
'04',
'05',
'06',
'07123456789']
352 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
In [49]: vocab[-10:]
Out[49]: ['yogasana', 'yor', 'yr', 'yrs', 'yummy', 'yun', 'yunny', 'yuo', 'yup', 'zed']
Let’s use the helper function we’ve defined above to train a model on the new features:
The accuracy on the test set is not much higher than our simple model, however, we can use this model to
look for features importances, i.e. to identify words whose weight is high when predicting spam or not. Let’s
recover the trained model from the res object returned by our custom function:
Then let’s put the weights in a Pandas Series, indexed by the vocabulary:
In [53]: w_ = model.get_weights()[0].ravel()
vocab_weights = pd.Series(w_, index=vocab)
In [54]: vocab_weights.sort_values(ascending=False).head(20)
Out[54]:
8.2. TEXT DATA 353
0
txt 0.567846
claim 0.529718
free 0.529599
18 0.520037
uk 0.514894
mobile 0.508967
reply 0.499052
www 0.497820
150p 0.491184
service 0.486691
1000 0.440858
prize 0.437020
stop 0.414112
com 0.406652
50 0.405576
urgent 0.393387
rate 0.387784
16 0.386934
ringtone 0.380585
text 0.369661
Not surprisingly we find here words like www, claim, prize, cash etc. Similarly we can look at the bottom
20 words:
In [55]: vocab_weights.sort_values(ascending=False).tail(20)
Out[55]:
354 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
0
lt -0.448238
lol -0.451866
tell -0.455773
doing -0.459791
lor -0.470848
need -0.476604
home -0.487474
later -0.488070
yeah -0.490425
think -0.491266
like -0.503484
oh -0.504350
good -0.512226
sorry -0.516341
got -0.517523
going -0.522831
come -0.550252
da -0.552738
ll -0.581713
ok -0.634840
and see they are pretty common legitimate words like sorry, ok, lol, etc. . . If we were spammer we could
take advantage of this information and craft messages that attempt to fool these simple features by using a
lot of words like sorryor ok. This is a typical Adversarial Machine Learning scenario, where the target is
constantly trying to beat the model.
In any case, it’s pretty clear that this dataset is an easy one. So let’s load a new dataset and learn a few more
tricks!
Word frequencies
In the previous spam classification problem we used a CountVectorizer transformer from Scikit Learn to
produce a sparse matrix with term counts of the top 3000 words. Using the absolute counts was ok because
the corpus was formed by sms messages, whose length is capped at 160 characters. In the general case, using
absolute counts may be a problem if we deal with documents of uneven length. Think of a brief email versus
a long article, both about the topic of AI. The word AI will appear in both documents, but it will likely be
repeated more times in the long article. Using the counts would lead us to think that the article is more
about AI than the short email, while it’s simply a longer text. We can account for that using the term
frequency instead of the count, i.e. by dividing the counts by the length of the document. Using term
frequencies is already an improvement, but we can do even better.
In fact, there will be some words that are common in every document. These could be common english stop
words (like: a, and, if, on, etc.), but they could also be words that are common across the specific corpus. For
example, if we are trying to sort a corpus of patents by topics, it is clear that words like: patent, application,
8.2. TEXT DATA 355
grant and similar legal terms will be common across the whole corpus and not really indicative of the
particular topic of each of the documents in the corpus.
We want to normalize our term frequencies with a term inversely proportional to the fraction of documents
containing that term, i.e. we want to use an inverse document frequency.
These features go by the name of TF-IDF i.e. term frequency–inverse document frequency, which is also
available as a vectorizer in Scikit Learn.
or using maths:
tf(w, d)
tf-idf(w, d) = (8.1)
df(w, d)
where w is a word, d is a document, tf stands for “term frequency” and df for “document frequency”.
As you can read in the Wikipedia article, there are several ways to improve the above of the TF-IDF formula,
using different regularization schemes. Scikit Learn implements it as follows:
1 + nd
tf-idf(w, d) = tf(w, d) × log ( )+1 (8.2)
1 + df(w, d)
where nd is the total number of documents and the regularized logarithm takes care of words that are
extremely rare or extremely common.
Sentiment classification
Let’s load a new dataset, containing reviews from the popular website Rotten Tomatoes:
In [56]: df = pd.read_csv('../data/movie_reviews.csv')
df.head()
Out[56]:
356 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
Let’s take a peek into the data we loaded, see how many reviews we have and a few other pieces of
information.
In [57]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14072 entries, 0 to 14071
Data columns (total 3 columns):
title 14072 non-null object
review 14072 non-null object
vote 14072 non-null object
dtypes: object(3)
memory usage: 329.9+ KB
Let’s look at the division between the fresh votes, the rotten, and the none votes.
Out[58]:
vote
fresh 0.612067
rotten 0.386299
none 0.001634
As you can see, the dataset contains reviews about famous movies and a judgment of rotten VS fresh,
which is the class we will try to predict.
First of all, we notice that a small number of reviews do not have a class, so let’s eliminate those few rows
from the dataset. We’ll do this by selecting all the votes that are not none:
Out[60]:
vote
fresh 0.613069
rotten 0.386931
Label encoding
Notice that the labels are strings, and we need to convert them to 0 and 1 in order to use them for
classification. We could do this in many ways, one way is to use the LabelEncoder from Scikit Learn. It is a
transformer that will look at the unique values present in our label column and encode them to numbers
from 0 to N − 1, where N is the number of classes, in our case 2.
In [62]: le = LabelEncoder()
In [63]: y = le.fit_transform(df['vote'])
In [64]: y[:10]
Let’s initialize it to look at the top 10000 words in the corpus, excluding English stop words:
vect = TfidfVectorizer(decode_error='ignore',
stop_words='english',
max_features=vocab_size)
In [67]: X = vect.fit_transform(df['review'])
In [68]: X
This generates a sparse matrix with 14049 rows and 10000 columns. This is still small enough to be
converted to dense and passed to our model evaluation function. Let’s call todense() on the object to
convert it to a dense matrix:
In [69]: Xd = X.todense()
Let’s train our model. We will use a higher number of epochs in this case, to ensure convergence with the
larger dataset:
The accuracy on the test set is much lower than the last value of accuracy obtained on the training set (last
line printed during training), therefore the model is overfitting. This is not unexpected given the large
number of features. Despite the overfitting, the test score is still higher than the 61.3 accuracy obtained by
always predicting the larger class.
Text as a sequence
The bag of words approach is very crude. It does not take into account context, i.e. each word is treated as
independent feature, regardless of its position in the sentence. This is particularly bad for tasks like
sentiment analysis where negations could be present (“This movie was not good”) and the overall sentiment
could not be carried by any particular word.
In order to go beyond the bag of words approach we need to treat text as a sequence instead of just looking
at frequencies. In order to do this, we will proceed to:
1. create a vocabulary, indexed starting from the most frequent word and then continuing in decreasing
order.
2. convert the sentences to sequences of integer indices using the vocabulary
3. feed the sequences to a Neural Network in order to perform the sentiment classification
Keras has a preprocessing Tokenizer that allows us to create a vocabulary and convert the sentences using
it. Let’s load it:
Let’s initialize the Tokenizer. We will use the same vocabulary size of 10000 used in the previous task:
In [73]: vocab_size
Out[73]: 10000
We can fit the tokenizer on our reviews using the function .fit_on_texts. We will pass the column of the
dataframe df that contains the reviews:
In [75]: tokenizer.fit_on_texts(df['review'])
Great! The tokenizer has finished its job, so let’s give a look at some of its attributes.
The .document_count gives us the number of documents used to build the vocabulary:
360 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
In [76]: tokenizer.document_count
Out[76]: 14049
These are the 14049 reviews left in the dataset after we removed the ones without a vote. The .num_words
attribute gives us the number of features in the vocabulary. These should be 10000:
In [77]: tokenizer.num_words
Out[77]: 10000
Finally, we can retrieve the word index by calling .word_index, which returns an vocabulary. Let’s look at
the first 10 items in it:
In [78]: list(tokenizer.word_index)[:10]
Out[78]: ['the', 'a', 'and', 'of', 'to', 'is', 'in', 'it', 'that', 'as']
As you can see this is not sorted alphabetically, but in decreasing order of frequency starting from the most
common word. Let’s use the tokenizer to convert our reviews to sequences. For instance, “The movie is
great” translates to the sequence 1539:
sequences is a list of lists. Each of the inner lists is one of the reviews:
In [80]: sequences[:3]
8.2. TEXT DATA 361
Out[80]: [[36,
1764,
7,
1058,
800,
3,
1765,
9,
27,
151,
268,
8,
21,
2,
9088,
3879,
5881,
115,
3,
101,
20,
22,
17,
360],
[1, 610, 38, 801, 49],
[2, 1012, 347, 225, 9, 24, 107, 14, 564, 21, 1, 354, 7122]]
Let’s just double check that the conversion is correct by converting the first list back to text. We will need to
use the reverse index -> word map:
Out[82]: 'So ingenious in concept, design and execution that you could watch it on a postage sta
Out[83]: 'so ingenious in concept design and execution that you could watch it on a postage stam
The two sentences are almost identical, however notice a couple of things:
Now that we have sequences of numbers, we can organize them into a matrix with one review per row and
one word per column. Since not all sequences have the same length, we will need to pad the short ones with
zeros.
Out[84]: 49
The longest review contains 49 words. Let’s pad every other review to 49 using the pad_sequences
function from Keras:
pad_sequences operates on the sequences by padding and truncating them. Let’s set the maxlen
parameter to the value we already found:
In [87]: X.shape
X has 14049 rows (i.e. the number of samples, review in this case) and 49 columns (i.e. the words of the
longest review). Let’s print out the first few reviews:
In [88]: X[:4]
Out[88]: array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 36, 1764, 7, 1058, 800, 3, 1765, 9,
27, 151, 268, 8, 21, 2, 9088, 3879, 5881, 115, 3,
101, 20, 22, 17, 360],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 610, 38, 801, 49],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 2, 1012, 347, 225, 9, 24, 107, 14,
564, 21, 1, 354, 7122],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
16, 1190, 2, 822, 3, 485, 42, 121, 144, 285, 1,
1678, 4, 13, 742, 724]], dtype=int32)
Can we feed this matrix to a Machine Learning model? Let’s think about it for a second. If we treat this
matrix as tabular data, it would mean each column represents a feature. But what feature? Columns in the
matrix correspond to the position of the word in a sentence and so there’s absolutely no reason why two
words appearing at the same position would carry coherent information about sentiment.
Also, the numbers here are the indices of our words in a vocabulary, so their actual value is not a quantity,
it’s their rank in order of frequency in the vocabulary. In other words, word number 347 is not 347 times as
large as the word at index 1, it’s just the word that appears at index 347 in the vocabulary.
In fact, the correct way to think of this data is to recognize that each number in X really represents an index
in a vector with length vocab_size, i.e. a vector with 10000 entries. These are the actual features, i.e. all the
words in our vocabulary.
So, this matrix is really a shorthand for an order-3 sparse tensor whose three axes are (sentence, position
along sentence, word feature index). The first axis would locate the sentence in the dataset and it
364 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
corresponds to the row axis of our X matrix. The second axis would locate the word position in the sentence
and it corresponds to the column axis of our X matrix. The third axis would locate the word in the
vocabulary and it corresponds to actual value in the entry of the matrix.
It looks like one way to feed this data to a Neural Network would be to expand the X matrix to a 1-hot
encoded order-3 tensor with 0s and 1s, and then feed this tensor to our network, for example to a Recurrent
layer with input_shape=(49, 10000). This would be the equivalent of feeding a dataset of 10000 time
series, whose elements are all mostly zeros, except for 1 at each time, which is not zero when that word
occurs in that particular sentence.
While this encoding works, it is not really memory efficient. Besides that, representing each word along a
different orthogonal axis in a 10000-dimensional vector space doesn’t capture any information on how that
word is used in its context. How can we improve this situation?
One idea would be to insert a fully connected layer to compress our input space from the very large sparse
space of 10000 words in our vocabulary to a much smaller dense space, for example with just 32 axes. In this
new space each word is represented by a dense vector, whose entries are floating point numbers instead of
all 0s and a single 1.
This is really cool, because now we can feed our sequences of much smaller dense vectors to a recurrent
network in order to complete the sentiment classification task, i.e. we are treating the sentiment
classification problem as a Sequence Classification problem like the ones encountered in Chapter 7.
Furthermore, since the dense vector is obtained through a fully connected layer, we can jointly train the
fully connected layer and the recurrent layer allowing the fully connected layer to find the best
representation for the words in order to help the recurrent layer achieve its task.
8.2. TEXT DATA 365
Embeddings
In practice we never actually go through the burden of converting the word indices to 1-hot vectors and
then back to dense vectors. We use an Embedding layer. In this layer we specify the output dimension,
i.e. the length of the dense vector and it has an independent set of as many weights for each of the words in
the vocabulary. So, for example, if the vocabulary is 10000 words and we specify an output dim of 100, the
embedding layer will carry 1000000 weights, 100 for each of the words in the vocabulary.
The numbers in input will be understood as the index that selects the set of 100 weights, i.e. they will be
interpreted as indices in a phantom sparse space, saving us from converting the data to 1-hot and then
converting it back to dense.
Let’s see how to include this in our own network. Let’s load the Embedding layers from Keras:
Let’s see how it works by creating a network with a single such layer that maps a feature space of 100 words
to an output dense space of only 2 dimensions:
The network above assumes the input will be made of numbers between 0 and 99. These are interpreted as
the indices of the single non-zero entry in a 100-dimensional 1-hot vector. Sequences of such indices will be
366 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
Embedding vectors
interpreted as sequences of such vectors and will be transformed to sequences 2-dimensional dense vectors,
since 2 is the dimension of the output space.
Let’s feed a single sequence of a few indices and perform a forward pass:
The embedding layer turned the sequence of five numbers into a sequence of five 2-dimensional vectors.
Since we have not trained our Embedding Layer yet, these are just the weight vectors corresponding to each
word, so for example, words 0 corresponds to the weights [-0.03977597, -0.01466479]. Notice how
these appear both on the first row and on the fourth row, exactly as one would expect since the words 0
appears at the first and fourth positions in our five words sentence.
Similarly if we feed a batch of few sequences of indices, we will obtain a batch of few sequences of vectors
i.e. a tensor of order three, with axes (sentence, position in sentence, embedding):
[[-0.02983077, -0.01654588],
[-0.02862899, -0.04024762],
[ 0.03696802, 0.01382831],
[ 0.04779704, -0.03472906],
[ 0.00352927, -0.00599594]],
[[-0.03610412, 0.0394091 ],
[ 0.03755711, -0.01945177],
[-0.01093823, -0.01441556],
[-0.04975789, -0.00622205],
[-0.0464707 , 0.03933306]]], dtype=float32)
Let’s start from the train/test split. As done several times in the book we set random_state=0 so that we all
get the same train/test split.
TIP: Setting the random state is useful when you want to have repeatable random splits.
In [94]: X.shape
Recurrent model
Let’s build our model as we did in the previous chapter. First, let’s import the LSTM layer from Keras:
Next, let’s build up our model. We’ll create our Embedding layer followed by our LSTM layer and the regular
Dense and Activation layers after that:
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
Let’s train our model using the fit() function. We will train the model on batches of 128 reviews for 8
epochs with a 20 validation split:
The model seems to be doing much better on the training set than any of the previous models based on Bag
of Words, since it achieves an accuracy greater than 95 in only 10 epochs. On the other hand, the
validation accuracy sees to be consistently lower, which indicates probable overfitting. Let’s evaluate the
model on the test set in order to verify the ability of our model to generalize:
Out[98]: 0.7386848847648204
Ouch! The test score is not much better than the score obtained by our BOW model. This means the model
is overfitting. Let’s plot the training history:
As you can see, after a few of epochs the validation accuracy stops improving while the training accuracy
keeps improving.
We can also look at the loss and notice that the validation loss does not decrease after a certain point, while
the training loss does.
In [101]: model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 49, 16) 160000
_________________________________________________________________
lstm_1 (LSTM) (None, 32) 6272
_________________________________________________________________
dense_4 (Dense) (None, 1) 33
=================================================================
Total params: 166,305
Trainable params: 166,305
Non-trainable params: 0
_________________________________________________________________
The model is quite big compared to the size of the dataset. We have over 160 thousand parameters to classify
less than 15 thousand short reviews. This is not a good situation and overfitting is expected. In the exercises
we will repeat the sentiment prediction on a larger corpus of reviews and see if we can get better results.
We will also learn another way to reduce overfitting later in the book, when we discuss about pre-trained
models.
372 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
The basic idea is to apply to text the same approach we used to improve forecasting in the time series
prediction of the last chapter.
We will start from a corpus of text, split it into short, fixed-size Windows, i.e. sub-sentences with a few
characters, and then train a model to predict the next character after the sequence.
Windows of text
Let’s give an example by designing an RNN to generate names of babies. We will use this corpus as training
data, which contains thousands of names.
We start by loading all the names from ../data/names.txt. We also add a \n character to allow the
model to learn to predict the end of a name and convert the names to lowercase.
names = f.readlines()
names = [n.lower().strip() + '\n' for n in names]
In [103]: names[:3]
We need to count all of the characters in our “vocabulary” and build a vocabulary that translates between
the character and its assigned index (and vice versa). We could do this using the Tokenizer from Keras,
but it is so simple that we can do it by hand using a Python set:
vocab_size = len(chars)
In [105]: vocab_size
Out[105]: 28
In [106]: chars
Out[106]: {'\n',
'-',
'a',
'b',
'c',
'd',
374 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
'e',
'f',
'g',
'h',
'i',
'j',
'k',
'l',
'm',
'n',
'o',
'p',
'q',
'r',
's',
't',
'u',
'v',
'w',
'x',
'y',
'z'}
Now let’s create two dictionaries, one to go fro characters to indices and the other to go back from indices to
characters. We’ll use these two dictionaries a bit later.
Character sequences
We can use the vocabulary created above to translate each name in names to its number format in
int_names. We will achieve this using a nested list comprehension where we iterate on names and for each
name we iterate on characters:
Now each name has been converted to a sequence of integers, for example, the first name:
In [109]: names[0]
Out[109]: 'aamir\n'
8.3. SEQUENCE GENERATION AND LANGUAGE MODELING 375
In [110]: int_names[0]
Great! Now we want to create short sequences of few characters and try to predict the next. We will do this
by cutting up names into input sequence of length maxlen and using the following character as training
labels. Let’s start with maxlen = 3:
In [111]: maxlen = 3
name_parts = []
next_chars = []
name_parts is a list with short fractions of names (three characters). Let’s take a look at the first elements:
In [112]: name_parts[:4]
Out[112]: [[2, 2, 19], [2, 19, 0], [19, 0, 27], [2, 2, 27]]
next_chars is a list with single entries, each representing the next character:
In [113]: next_chars[:4]
As a last step we convert the nested list of name_parts to an array. We can do this, using the same
pad_sequences function used earlier in this chapter. This takes the nested list and converts it to an array
trimming the longer sequences and padding the shorter sequences:
In [115]: X.shape
Out[115]: (32016, 3)
Now let’s deal with the labels. We can use the to_categorical function to 1-hot encode the targets. Let’s
import it from keras.utils:
Now let’s create our categories from the next_chars using this function. Notice that we let Keras know
how many characters are in the vocabulary by setting num_classes=vocab_size in the second argument
of the function:
In [118]: y.shape
i.e. we have 32016 characters, each represented by a 1-hot encoded vector of vocab_size length.
Recurrent Model
We will need to set up an embedding layer for the input, one or more recurrent layers and a final dense layer
with softmax activation to predict the next character. We can design the model using the Sequential API as
usual or we can start to practice with the Functional API, which we will use more often later on. This API is
much more powerful than the Sequential API we used so far, because it allows us to build models that can
have more than one processing branch. It is good to start approaching it on a simple case, so that we will be
more familiar with it when we use it on larger and more complex models.
Let’s import the Model class from keras and the Input layer:
8.3. SEQUENCE GENERATION AND LANGUAGE MODELING 377
Using the Functional API, each layer is treated as a function, which receives the output of the previous layer
and it returns an output to the next. When we specify a model in this way, we need to start from an Input
layer with the correct shape.
Since we have padded our name subsequences to a length of 3, we’ll create an Input layer with shape (3,):
TIP: remember that the trailing comma is needed in Python to distinguish a tuple with one
element from a simple number within parentheses.
In [121]: inputs
It’s a Tensorflow tensor with shape=(?, 3), i.e. it will accept batches of data with 3 features, exactly as we
want. Next we create the Embedding layer, with input dimension equals to the vocabulary size (i.e. 28) and
output dimension equal to 5.
Next we will use this layer as a function, i.e. we well pass the inputs tensor to it and save the output tensor
to a temporary variable called h (for hidden).
In [123]: h = emb(inputs)
Note that we could have achieve the previous two operations in a single line by writing:
h = Embedding(input_dim=vocab_size, output_dim=5)(inputs)
378 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
Following this style we define the net layer to be an LSTM layer with 8 units and we reuse the h variable for
its output:
In [124]: h = LSTM(8)(h)
Finally we create the output layer, a Dense layer with as many nodes as vocab_size and with a Softmax
activation function:
Now that we have created all the layers we need and connected their inputs and outputs, let’s create a model.
This is done using the Model class that needs to know what the inputs and outputs of the model are:
From here onwards we proceed in an identical way to what we’ve been doing with the Sequential API. We
compile the model for a classification problem:
In [127]: model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
and now we are ready to train it. We will let the training run for at least 10 epochs. While the model trains,
let us reflect on a couple of questions:
Let’s train our model by using the fit() function. We will run the training for 20 epochs:
Epoch 1/20
32016/32016 [==============================] - 8s 265us/step - loss: 2.6411 -
acc: 0.2466
Epoch 2/20
32016/32016 [==============================] - 8s 240us/step - loss: 2.4984 -
8.3. SEQUENCE GENERATION AND LANGUAGE MODELING 379
acc: 0.2535
Epoch 3/20
32016/32016 [==============================] - 8s 240us/step - loss: 2.3988 -
acc: 0.2724
Epoch 4/20
32016/32016 [==============================] - 8s 241us/step - loss: 2.3453 -
acc: 0.2787
Epoch 5/20
32016/32016 [==============================] - 8s 242us/step - loss: 2.3163 -
acc: 0.2814
Epoch 6/20
32016/32016 [==============================] - 8s 242us/step - loss: 2.2936 -
acc: 0.2885
Epoch 7/20
32016/32016 [==============================] - 8s 243us/step - loss: 2.2736 -
acc: 0.2958
Epoch 8/20
32016/32016 [==============================] - 8s 242us/step - loss: 2.2565 -
acc: 0.2992
Epoch 9/20
32016/32016 [==============================] - 8s 243us/step - loss: 2.2419 -
acc: 0.3010
Epoch 10/20
32016/32016 [==============================] - 8s 242us/step - loss: 2.2291 -
acc: 0.3030
Epoch 11/20
32016/32016 [==============================] - 8s 243us/step - loss: 2.2172 -
acc: 0.3046
Epoch 12/20
32016/32016 [==============================] - 8s 240us/step - loss: 2.2056 -
acc: 0.3081
Epoch 13/20
32016/32016 [==============================] - 8s 241us/step - loss: 2.1948 -
acc: 0.3106
Epoch 14/20
32016/32016 [==============================] - 8s 241us/step - loss: 2.1850 -
acc: 0.3158
Epoch 15/20
32016/32016 [==============================] - 8s 240us/step - loss: 2.1756 -
acc: 0.3189
Epoch 16/20
32016/32016 [==============================] - 8s 239us/step - loss: 2.1676 -
acc: 0.3223
Epoch 17/20
32016/32016 [==============================] - 8s 240us/step - loss: 2.1603 -
acc: 0.3254
Epoch 18/20
32016/32016 [==============================] - 8s 240us/step - loss: 2.1537 -
acc: 0.3295
Epoch 19/20
32016/32016 [==============================] - 8s 241us/step - loss: 2.1478 -
acc: 0.3320
Epoch 20/20
32016/32016 [==============================] - 8s 240us/step - loss: 2.1420 -
acc: 0.3348
Great! The model has finished training. Getting above 30 accuracy is a good result in this case. The reason
is, we are trying to predict the next character after a sequence of three characters, but there is no unique
solution to this prediction problem.
Think for example of the 3 characters and. How many names are there in the dataset that start with and?
From this example we see that while r is the most frequent answer, it’s not the only one. Other letters could
come after the letters and in our training set.
By training the model on the truncated sequences we are effectively teaching our model a probability
distribution over our vocabulary. Using the example above, given the sequence of characters ['a', 'n',
'd'] the model is learning that the character r appears 11/15 times, i.e. it has a probability of 0.733, while the
characters e, i, o, y each appear 1/15 times, i.e. each has a probability of 0.066.
TIP: For the math inclined reader, the model is learning to predict the probability
p(c t ∣c t−3 c t−2 c t−1 ) where the index t indicates the position of a character in the name.
p(A∣B) is the conditional probability of A given B. This is the probability that A will
happen when B has already happened.
Since the vocabulary size is 28, if the next character would have been predicted using a random uniform
distribution over the vocabulary, on average we would predict correctly only 1 time every 28 trials, which
would give an accuracy of about 3.6. We get to an accuracy of about 30, which is 10x higher than random.
Now that the model is trained, we can use it to produce new names, that should at least sound like English
names. We can sample the model by feeding in a few letters and using the model’s prediction for the next
8.3. SEQUENCE GENERATION AND LANGUAGE MODELING 381
letter. Then we feed the model’s prediction back in to get the next letter, etc.
First of all, let’s define a helper function called sample. This function has to take an array of probabilities
p = [p i ]i∈vocab for the characters in the vocabulary and return the index of a character, with probabilities
according to p. This means that if a character has a high probability, its index will be returned more often
than a character with a low probability.
The multinomial distribution is a generalization of the binomial distribution that can help us in this case.
It is implemented in Numpy and its documentation reads:
This description says that if our experiment has three possible outcomes with probabilities [0.25, 0.7,
0.05], a single multinomial experiment will return an array of length three, where all the entries will be
zero except one, that will be a 1, corresponding to the randomly chosen outcome for that experiment. If we
were to repeat the experiments multiple times, the frequencies of each outcome would tend towards the
assigned probabilities.
In fact, we are going to generalize this a bit more, introducing a parameter called diversity that rescales the
probabilities. For high values of the diversity, the probability vectors will be squished to zero and we will be
approaching the random uniform distribution. When the diversity is low, the most likely characters will be
selected even more often, approaching a deterministic character generator.
Let’s create the sample function that accepts an input list with a diversity argument that allows us to
rescale the propabilities as an argument:
Let’s make sure we understand how this function works with an example. Let’s define the probabilities of 3
outcomes (you may think of these as win-loose-draw) where the first one is happens 1/4th of the time, the
second one 65 of the time and the last one only 10 of the time.
Drawing samples from this probability distribution we would expect to pull out 1 about 55 of the time and
so on. Let’s sample 100 times:
In [133]: Counter(draws)
As you can see our results reflect the actual probabilities, with some statistical fluctuations.
Great! Now that we can sample from the vocabulary, let’s generate a few names. We will start from an input
seed of three letters and then iterate in a loop the following steps:
• Use the seed to predict the probability distribution for next characters.
• Sample the distribution using the sample function.
• Append the next character to the seed.
• Shift the input window by one to include the last character appended.
• Repeat.
The loop ends either when a termination character is reached or when a pre-defined length is reached.
Let’s go ahead and build this function step by step. Let’s set up the seed of our name to be something like
ali.
In order to build the name, let’s create an output list (we’ll call x) to store our output, setting the length to
that of the maximum length of the name we want to generate:
Let’s use a variable we’ll call stop to stop the loop if our network predicts the '\n' character as the next
character and set it to False:
c = inds_to_char[sample(preds)]
out += c
if c == '\n':
stop = True
out
Out[137]: 'alie\n'
The network produced a few characters and then stopped. Now let’s wrap these steps in a function that
encapsulates this entire process in a single method. Let’s call our function complete_name. This function
will take an input seed of three letters and run through the previous steps to predict the next character.
Parameters
----------
seed : string
The start of the name to sample
384 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
Returns
-------
out : string
'''
out = seed
stop = False
c = inds_to_char[sample(preds, diversity)]
out += c
if c == '\n':
stop = True
else:
if max_name_len is not None:
if len(out) > max_name_len - 1:
stop = True
return out
Nice! Now that we have a function to complete names, let’s predict a few names that start as jen:
jenaann
jene
8.3. SEQUENCE GENERATION AND LANGUAGE MODELING 385
jenie
jenne
jenn
jen
jenta
jenanl
jena
jenen
Not bad! Let’s play with the diversity parameter to understand what it does. If we set the diversity to be high,
we get random sequences of characters:
jenzycmz-oapqwoeqliw
jenclwypkloqvtkpbsgi
jentspmwikqfxojlopdf
jenjwdi
jenwhyrq
jenjooogynah-yzk-qjl
jenkkdpakbi-
jenemsfy-lwkcrutvzsy
jen
jensyletawbk
TIP: since the sample function involves logarithms and exponential, it accumulates
numerical errors very quickly. It would be better to build a model that predicts logits
instead of probabilities, but keras does not allow to do that.
386 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
jen
jen
jen
jen
jen
jen
jen
jen
jen
jen
Awesome! We now know how to build a language model! Go ahead and unleash your powers on your
author of choice and start producing new poems or stories. The model we built has a memory of 3
characters, so it won’t exactly be “Shakespeare” when it tries to produce sentences. In order to have a model
producing correct sentences in English we would need to train it with a much larger corpus and with longer
Windows of text. For example, a memory of 20-25 characters is long enough to generate English-looking
text.
In the next section we will extend our skills to build a language translation model.
Sequence-to-sequence (Seq2Seq) models take a sentence in input and return a new sequence in output.
They are very common in language translation, where the input sequence is a sentence in the first language
and the output sequence is the translation in the second language.
There is a great article by Francois Chollet on the Keras Blog on how to build them in keras. We strongly
encourage you to read it!
In this chapter we have approached text problems from a variety of angles and hopefully inspired you to dig
deeper into this domain.
8.4. EXERCISES 387
Exercises
Exercise 1
For our Spam detection model we used a CountVectorizer with a vocabulary size of 3000. Was this the
best size? Let’s find out:
Exercise 2
Keras provides a large dataset of movie reviews extracted from the Internet Movie Database for sentiment
analysis purposes. This dataset is much larger than the one we have used, and its already encoded as
sequences of integers. Let’s put what we have learned to good use and build a sentiment classifier for movie
reviews:
• decide what size of vocabulary you are going to use and set the vocab_size variable
• import the imdb module from keras.datasets
388 CHAPTER 8. NATURAL LANGUAGE PROCESSING AND TEXT DATA
Bonus points: can you convert back the sentences to their original text form? You should look at
imdb.get_word_index() to download the word index:
Training with GPUs
9
In this chapter, we will learn a how to leverage Graphical Processing Units (GPUs) to speed up training of
our models. If a model trains faster, we can do more experiments and therefore arrive to good solutions
more quickly. Also, leveraging cloud GPUs has become so easy by now that it would be a pity not to take
advantage of this opportunity. Only a few years ago, training a deep Neural Network using a GPU was a skill
that demanded very sophisticated knowledge and lot of money. Nowadays, we train a model on many GPUs
at a relatively affordable cost.
We will start this chapter by introducing what a GPU is, where it can be found, what kinds of GPUs are
available and why they are so useful to do Deep Learning. Then we will review several cloud providers of
GPUs and guide you through how to use them. Once we have a working cloud instance with one or more
GPUs we will compare training a model with and without a GPU, to appreciate the speedup especially with
Convolutional Neural Networks. We will then extend training to multiple GPUs and introduce a few ways
to use multiple GPUs in Keras.
This chapter is a bit different from the other chapters as there will be less Python code and more links to
external documentation and services. Also, while we will do our best to have the most up to date guide to
currently existing providers, it is important that you understand how fast the landscape is evolving. During
the course of the past 6 months each of the providers presented introduced newer and easier ways to access
cloud GPUs, making the previous documentation obsolete. Thus, it is important that you understand the
principles of why accelerated hardware helps and when. If you do this, it will be easy to adapt to new ways of
doing things when they come out. All that said, let’s get started!
389
390 CHAPTER 9. TRAINING WITH GPUS
today widely used for other purposes like Machine Learning acceleration.
The term GPU became popular in 1999, when Nvidia - still a major player in the field today - marketed the
GeForce 256 as “the world’s first GPU”. In 2002, ATI Technologies, a competitor of Nvidia, coined the term
“visual processing unit” or VPU with the release of the Radeon 9700. The following picture shows the
original GeForce 256 (left side) and the GeForce GTX 1080 (right), one of the latest released and most
powerful graphic cards in the market.
In 2006, Nvidia came out with a high level language called CUDA (Compute Unified Device Architecture),
that helps software developers and engineers to write programs from graphic processors in a high level
language – an approach termed GPGPU (General-Purpose computing on Graphics Processing Units).
CUDA is a language that gives direct access to the GPU’s virtual instruction set and parallel computational
elements, for the execution of compute kernels. This was probably one of the most significant changes in
they way researchers and developers interacted with GPUs.
But why are GPUs, originally developed for video games graphics, so useful for Deep Learning?
As you already know, training a Neural Network requires several operations, many of which involve large
matrix multiplications. They perform matrix multiplications in the forward pass, when inputs (or
activations) and weights are multiplied (see Chapter 5 if you need a refresher on the math). Matrix
multiplications are also performed during back-propagation, when the error is propagated back through the
network to adjust the values of the weights. In practice, training a Neural Network mostly consists of matrix
multiplications. Consider for example VGG16 (a frequently used convolutional Neural Network for image
classification. Proposed by K. Simonyan and A. Zisserman ), it has approximately 140 million parameters.
Using a CPU, it would take weeks to train this model and perform all the matrix multiplications.
GPUs allow to dramatically decrease the time needed for matrix multiplication, offering 10 to 100 times
more computational power than traditional CPUs. There are several reasons why they make this
computational speed-up possible, well discussed in this article.
Summarizing the article, GPUs, comprised of thousands of cores unlike CPUs, not only allow for parallel
operations, but they are ideal when it comes to fetching very large amounts of memory. The best GPUs can
fetch up to 750GB/s, which is huge if compared with the best CPU which can handle only up to 50GB/s
9.2. CLOUD GPU PROVIDERS 391
memory bandwidth. Of course, dedicated GPUs, specifically designed for High Performance Computing
and Deep Learning, are more performant (and expensive) than gaming GPUs, but the latter, usually
available in everyday laptops, are still a good starting option!
The following picture shows a comparison between CPU and GPU performance (source: Nvidia). The left
image shows that, using the same Neural Network for image detection, the Fermi GPU is able to process
more then 10 times the number of images processed (per second) by an Intel 4 core CPU. The right image
shows that a 16 GPU Accelerated Servers is able to handle a more than 6 times bigger Neural Network, if
compared with a 1000 CPU Servers.
Other companies offering cloud GPUs are Microsoft Azure Cloudand IBM. Also, a few startups have started
to offer Deep Learning optimized cloud instances, that are often cheaper and easier to access. In this chapter
we will review Floydhub, Pipeline.ai and Paperspace.
Regardless of the cloud provider, if you have a Linux box with an NVIDIA GPU it is not hard to equip it to
run tensorflow-gpu and a Jupyter Notebook.
Google Colab
The easiest way to give GPU acceleration a try is to use Google Colab, also known as Colaboratory. Besides
being so easy, Colab is also free to use (you only need a Google account), which makes it perfect to try out
GPU acceleration.
Colaboratory is a research tool for Machine Learning education and research. It’s a Jupyter Notebook
environment that requires no setup to use: you can create and share Jupyter notebooks with others without
having to download, install, or run anything on your own computer other than a browser. It works with
392 CHAPTER 9. TRAINING WITH GPUS
most major browsers, and is most thoroughly tested with desktop versions of Chrome and Firefox.
This welcome notebook provides you with the fundamental information to start working with Colab. In
addition to all the classic operations in Jupyter you can change the notebook settings to enable GPU support:
Once you’ve done that, you can run this code to verify that GPU is available
import tensorflow as tf
tf.test.gpu_device_name()
Pipeline AI
The next best option to try a GPU for free in the cloud is the service offered by PipelineAI. PipelineAI
service enables data scientists to rapidly train, test, optimize, deploy, and scale models in production directly
from a Jupyter Notebook or command-line interface. It provides you a platform that simplifies the workflow
and let the user to focus only on the essential Machine Learning aspects.
The login process to use PipelineAI is quite simple and straightforward: 1. Sign up at PipelineAI. 2. Once
your are successfully logged, you should see the following dashboard. You can either launch a new notebook
or directly type commands in a terminal.
9.2. CLOUD GPU PROVIDERS 393
3. Alternatively, you can use some of the already available resources, accessible from the left menu. For
example, you can have a look at the 01a_Explore_GPU.ipynb notebook, under notebooks >
00_GPU_Workshop
PipelineAI is not only a platform providing GPU-powered Jupyter Notebooks, it allows you to do much
more, such as monitoring the training of the algorithms, evaluating the results of your model, comparing
the performances of different models, browsing among stored models, etc.. The following picture shows
some of the available tools, but have a look of all the options available in the community edition.
394 CHAPTER 9. TRAINING WITH GPUS
To better understand the potential of PipelineAI, we encourage you to take this tour. Pipeline is under
active development. You can follow its Github repository
Floydhub
Floydhub is an other easy and cheap option to access GPU in the cloud. Floydhub is a platform for training
and deploying Deep Learning and AI applications. FloydHub comes with fully configured CPU and GPU
environments ready to use for Deep Learning. It includes CUDA, cuDNN and popular frameworks like
Tensorflow, PyTorch, and Keras. Take a look at the documentation for a more extended explanation of its
features.
This tutorial explains how to start a Jupyter Notebook on Floydhub: 1. Create an account on Floydhub.
2. Install floyd-cli on your computer.
4. From your terminal, use floyd-cli to initialize the project (be sure to use the name you gave the
project in step 3).
TIP: if this is the first time you run floyd it will ask you to login. Just type floyd login
and follow the instructions provided.
and open your FloydHub web page. Here you’ll see a View button that will direct you to a Jupyter Notebook.
The notebook is running on FloyHub’s GPU servers.
Once you’re finished with your work you can stop the Jupyter Notebook with the cancel button. Make sure
to save your results by downloading the notebook before you terminate it:
396 CHAPTER 9. TRAINING WITH GPUS
9.2. CLOUD GPU PROVIDERS 397
Paperspace
Paperspace is a platform to access a virtual desktop in the cloud. In particular, the Gradient service allows to
explore, collaborate, share code and data using Jupyter Notebooks, and submit tasks to the Paperspace GPU
cloud.
It is a suite of tools specifically designed to accelerate cloud AI and Machine Learning. Gradient also
includes a powerful job runner (that can even run on the new Google TPUs!), first-class support for
containers and Jupyter notebooks, and a new set of language integrations. Gradient has also a job runner,
that allows you to work on your local machine and submit “jobs” to the cloud to be processed. Discover
more about this service reading this blog post.
The procedure to run a Jupyter Notebooks within Paperspace is similar to what we have seen so far for
different GPU services:
3. Create a Jupyter Notebook to create your models. (Credit card information on the billing page are
required to enable all functionality.).
Paperspace is much more general than simply a hosted Jupyter Notebook service with GPU enabled. Since
Paperspace gives you a full virtual desktop (both Linux and Windows, as shown in the following picture),
you can install any other applications you need, from 3D rendering software to video editing and more.
398 CHAPTER 9. TRAINING WITH GPUS
AWS provides a Deep Learning AMI ready to use with all the NVIDIA drivers pre-installed as well as most
Deep Learning frameworks and Python packages. It’s not free but it’s sufficiently simple and versatile to use.
We can quickly launch Amazon EC2 instances pre-installed with popular Deep Learning frameworks such
as Apache MXNet and Gluon, TensorFlow, Microsoft Cognitive Toolkit, Caffe, Caffe2, Theano, Torch,
PyTorch, Chainer, and Keras to train sophisticated, custom AI models, experiment with new algorithms, or
to learn new skills and techniques.
In order to use any AWS service, we need to open an account. Several resources are available for a free trial
period, as descibed in the official web page. After finishing the trail period, keep in mind that the service
will charge you. Also, keep in mind that GPU instances are not included in the free tier so you will incur in
charges if you complete the next steps.
Follow this procedure to spin up a GPU enabled machine on AWS with the Deep Learning AMI:
1. Access the AWS console and select EC2 from the Compute menu.
3. Scroll the page and select an Amazon Machine Image (AMI). The Deep Learning AMI is a good
option to start. It comes in 2 flavors: Ubuntu and Amazon Linux. Both are good and we recommend
you use the flavor you are more comfortable with. Also note that there are both a Deep Learning AMI
and a Deep Learning AMI Basic. The Basic AMI has only GPU drivers installed but no Deep Learning
software. The full AMI comes pre-packaged with a ton of useful packages including Tensorflow,
Keras, Pytorch, MXNet, CNTK and more. We recommend you use this one to start.
4. Chose an istance type from the menu. Roughly speaking, instance types are ordered in ascending
order considering the computational power and the storage space.
400 CHAPTER 9. TRAINING WITH GPUS
Here’s a summary table of AWS GPU instances.Read the documentation for a detailed description of
every instance type.
Once you have chosen the instance go through the other steps:
and finally launch your instance with a key pair you own. Let’s assume it’s called your-key.pem.
You should now be able to see the newly created instance in the dashboard, and you are now ready to
connect with it.
9.2. CLOUD GPU PROVIDERS 401
Finally, take a look at the Tutorials and Examples section to better understand how to use Deep Learning
AMI service offered by AWS.
Once your Instance state is running you are ready to connect to it. We are going to do that from a terminal.
We will use the ssh key we have generated and we will also route remote port 8888 to the local port 8888 so
that we get to access Jupyter Notebook. Go ahead and type:
TIP: if you get a message that says your key is not protected, you need to change the
permissions of your key to read-only. You can do that by executing the command: chmod
600 your-key.pem.
Once you’re connected you should see a screen like the following, where all the environments are listed:
402 CHAPTER 9. TRAINING WITH GPUS
We will go ahead and activate the tensorflow_py36 environment with the command:
This command launches Jupyter in a way that will not stop if you disconnect from the instance. The
final step is to retrieve the Jupyter address: http://localhost:8888/?token=<your-token>. You
will find it in the nohup.out file:
tail nohup.out
Copy it and and paste it into your browser. If you’ve done everything correctly you should see a screen
like this one:
9.2. CLOUD GPU PROVIDERS 403
To connect with the AWS EC2 Deep Learning AMI from Windows similar steps must be followed, but in this
case it is convenient to use PuTTY, an SSH client specifically developed for the Windows platform. After
the installation of Putty in your machine, the procedure to connect with the cloud instance is as follows:
Once you are connected follow the same steps as for the Linux case.
It is really important that once you are done with your experiments you turn off the instance in order to
avoid useless costs. Just go to your AWS console and either Stop or Terminate the instance, by choosing an
action from the Actions menu:
AWS also supports a command line interface AWS CLI that allows to perform the same operations from the
terminal. If you’d like to try it you can install it using the command:
from your terminal. Once you have installed it, you need to add configuration credentials. First you’ll have
to setup an IAM user in the EC2 dashboard, then run the following configuration command:
aws configure
Make sure to choose a region that provides a copy of the Deep Learning AMI.
As explained in the AWS CLI guide, the output format can be json:
or text:
406 CHAPTER 9. TRAINING WITH GPUS
Once configured you can start your Deep Learning instance with the following command:
• <DL-AMI-ID-for-your-region>: the AMI ID for the Deep Learning AMI in the AWS region
you’ve chosen
• <instance-type>: the type of instance, like g2.2xlarge, p3.16xlarge etc.
• <your-ssh-key-name>: then name of your ssh key. You must have this on your disk.
• <subnet-id>: the subnet id, you can find this when you launch an instance from the web interface.
• <security-group-id>: the security group id, you can find this when you launch an instance from
the web interface as well.
• <a-name-tag>: a name for your instance, so that you can easily retrieve it by name
You can query the status of your launch with the command:
and remember to stop or terminate the instance when you are done, for example using this command:
AWS Sagemaker
AWS Sagemaker is an AWS managed solution that allows to perform all the steps involved in a Deep
Learning pipeline. In fact, on Sagemaker you can define, train and deploy a Machine Learning model in just
a few steps.
408 CHAPTER 9. TRAINING WITH GPUS
Sagemaker provides an integrated Jupyter Notebook instance that can be used to access data stored in other
AWS services, explore it, clean it and analyze it as well as to define a Machine Learning model. It also
provides common Machine Learning algorithms that are optimized to run efficiently against extremely large
data in a distributed environment.
Detailed information about this service can be found in the official documentation.
4. Assign a Notebook instance name, for example “my_first_notebook” and click the button Create
notebook instance. Sagemaker offers several types of instances, including a cheap option for
development of your notebook, a CPU-heavy instance if your code requires a lot of CPUs and a
GPU-enabled instance if you need it. Notice that the instance types available for the notebook
instance are different from the ones available for model training and deployment.
9.2. CLOUD GPU PROVIDERS 409
Once you’re done developing your model, Sagemaker allows to export, train and deploy the model with very
easy steps. Please refer to the User guide for more information on these steps.
Although we reviewed in detail the solutions offered by Amazon AWS, both Google Cloud and Microsoft
Azure offer similarly priced GPU-enabled cloud instances. We invite you check their offering here: - Google
Cloud - Microsoft Azure
If you’d like start from scratch on a barebone Linux machine with a GPU, here are the steps you will need to
follow:
1. Install NVIDIA Cuda Drivers. CUDA is a language that gives direct access to the GPU’s virtual
instruction set and parallel computational elements, for the execution of compute kernels.
• Download and install CUDNN. CUDNN is an NVIDIA library built on CUDA that implements a lot
of common Neural Network algorithms.
410 CHAPTER 9. TRAINING WITH GPUS
• Install Miniconda. Miniconda is a minimal installation of Python and the conda package manager
that we will use to install other packages.
• Install common packages conda install pip numpy pandas scikit-learn scipy
matplotlib seaborn h5py. This command will install the packages in the base environment.
• Install Tensorflow compiled with GPU support: pip install tensorflow-gpu.
• (Optional) Install Keras: pip install keras.
TIP: the code that follows relies on functionality that is specific to Tensorflow. This implies
that we cannot change backend.
Let’s start by comparing training speed on a CPU vs a GPU for a convolutional Neural Network. We will
train this on the CIFAR10 data that we have encountered also in Chapter 6. Let’s load the usual packages of
Numpy, Pandas and Matplotlib:
First we load the data using a helper function that also rescales it and expands the labels to binary
categories. If you’re unfamiliar with these steps we recommend you review Chapter 3, Chapter 4 and
Chapter 6 where they are repeated multiple times and explained in detail.
Next we define a function that creates the convolutional model. By now you should be familiar with every
line of code that follows, but just as a reminder, we create a Sequential model adding layers in sequence,
like pancakes in a stack. The layers in this network are:
• 2D Convolutional layer with 32 filters, each of size 3x3 and ReLU activation. Notice that in the first
layer we also specify the input shape of (32, 32, 3) which means our images are 32x32 pixels with 3
colors: RGB.
• 2D Convolutional layer with 32 filters, each of size 3x3 and ReLU activation. We add a second
convolutional layer immediately after the first to effectively convolve over larger regions in the input
image.
• Max Pooling layer 2 D with a pool size of 2x2. This will cut in half the height and the width of our
feature maps, effectively making the calculations 4 times faster.
• Flatten layer to go from the order 4 tensors used by convolutional layers to an order 2 tensor suitable
for fully connected networks.
• Fully connected layer with 512 nodes and a ReLU activation
• Output layer with 10 nodes and a Softmax activation
If you need to review these concepts make sure to check out Chapter 6 for more details.
We also compile the model for a classification problem using the Categorical Crossentropy loss function
and the RMSProp optimizer. These are explained in detail in Chapter 5.
Notice also that we import the time module to track the performance of our model:
t0 = time()
model = Sequential()
model.add(Conv2D(32, (3, 3),
padding='same',
input_shape=(32, 32, 3),
kernel_initializer='normal',
activation='relu'))
model.add(Conv2D(32, (3, 3), activation='relu',
kernel_initializer='normal'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(10, activation='softmax'))
Now we are ready to do a comparison between the CPU training time and the GPU training time. We can
force tensorflow to create the model on the the CPU with the context setter with tf.device('cpu:0').
First let’s import Tensorflow:
Now let’s compare the model with a model living on the GPU. We use a similar context setter: with
tf.device('gpu:0'):
acc: 0.3874
9.329 seconds.
As you can see training on the GPU is much faster than on the CPU. Also notice that the second epoch runs
much faster than the first one. The first epoch also includes the time to transfer the model to the GPU, while
for the following ones the model has already been transferred to the GPU. Pretty cool!
NVIDIA-SMI
We can check that the GPU is actually being utilized using nvidia-smi. The NVIDIA System Management
Interface is a tool that allows us to check the operation of our GPUs. To better understand how it works,
have a look at the documentation.
In order to use the NVIDIA System Management Interface: 1. Open a new terminal from the Jupyter
interface
Multiple GPUs
If your machine has more than one GPU you can use multiple gpus to speed up your training even more.
This can be done in 2 ways: - distributing different batches to different GPUs, also called data
parallelization. - distributing different parts of the model to different GPUs, also called model
parallelization. Let’s have a look at them in detail.
Data Parallelization
Keras makes it really easy to parallelize training by distributing data across multiple GPUs through the
recently introduced multi_gpu_model command. Let’s import it from keras.utils:
9.4. MULTIPLE GPUS 415
TIP: if you’re on floydhub the keras version is probably earlier than the one we are using in
the book. If you don’t find keras.utils.multi_gpu_model try with
and let’s distribute it over 2 GPUs (this will only work if you have at least 2 GPUs on your machine):
In [16]: model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
Finally we can train the model in the exact same way as we did before. Notice that the multi_gpu_model
documentation explains how a batch is divided to the GPUs:
416 CHAPTER 9. TRAINING WITH GPUS
This also means that if we want to maximize GPU utilization we want to increase the batch size by a factor
equal to the number of GPUs, so we will use batch_size=1024*NGPU.
Since with 2 GPUs each epoch takes only a few seconds, let’s run the training for a few more epochs:
and let’s plot the history like we’ve done many times in this book:
In [19]: pd.DataFrame(h.history).plot()
plt.ylim(0, 1.1)
plt.axhline(1, color='black');
9.5. CONCLUSION 417
As you can see, with 30 epochs the model seems to be still improving. Having multiple GPUs allowed us to
iterate fast and explore the performance of a powerful convolutional model in a very short time. Cool!
Conclusion
In this chapter we have seen how GPUs can easily be used to train faster on larger data. Before you move on
to the next chapter make sure to terminate all instances or you’ll incur in charges!
Exercises
Exercise 1
In Exercise 2 of Chapter 8 we introduced a model for sentiment analysis of the IMDB dataset provided in
Keras.
Exercise 2
Model parallelism is a technique that is used for very large models that cannot fit in the memory of a single
GPU. While this is is not the case for the model we developed in Exercise 1, it is still possible to distribute
the model across multiple GPUs using the with context setter. Define a new model with the following
architecture:
1. Embedding
• LSTM
• LSTM
• LSTM
• Dense
Place layers 1 and 2 on the first GPU, layers 3 and 4 on the second GPU and the final Dense layer on the CPU.
Congratulations! We’ve traveled very far along this Deep Learning journey together! We have learned about
fully connected, convolutional and recurrent architectures and we applied them to a variety of problems,
from image recognition to sentiment analysis.
One question we haven’t really answered yet is what to do when a model is not performing well. This is very
common for Deep Learning models. We train a model and the performance on the test set is disappointing.
This chapter is about a few techniques to do that. We will start by introducing Learning Curves, a tool that
is useful to decide if more data is needed. Then we will introduce several regularization techniques, that
may be useful to fight Overfitting. Some of these techniques have been invented very recently.
Finally, we will discuss data augmentation, which is useful in some cases, e.g. when the input data is made
of images. We will conclude the chapter with a brief part on hyperparameter optimization. This is a vast
topic, that can be approached in several ways which we’ll look into.
419
420 CHAPTER 10. PERFORMANCE IMPROVEMENT
Learning curves
The first tool we present is the Learning Curve. A learning curve plots the behavior of the training and
validation scores as a function of how much training data we fed to the model.
Let’s load a simple dataset and explore how to build a learning curve. We will use the digits dataset from
Scikit Learn, which is quite small. First of all we import the load_digits function and use it:
Now let’s create a variable called digits we’ll fill as the result of calling load_digits():
In [6]: X.shape
X is an array of 1797 images that have been unrolled as feature vectors of length 64.
In [7]: y.shape
Out[7]: (1797,)
In order to see the images we can always reshape them to the original 8x8 format. Let’s plot a few digits:
10.1. LEARNING CURVES 421
TIP: the function tight_layout automatically adjusts subplot params so that the
subplot(s) fits in to the figure area. See the Documentation for further details.
Since digits is a Scikit Learn Bunch object, it has a property with the description of the data (in the DESCR
key). Let’s print it out:
In [9]: print(digits.DESCR)
422 CHAPTER 10. PERFORMANCE IMPROVEMENT
Notes
-----
Data Set Characteristics:
:Number of Instances: 5620
:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998
This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
References
----------
- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
Graduate Studies in Science and Engineering, Bogazici University.
- E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
Linear dimensionalityreduction using relevance weighted LDA. School of
Electrical and Electronic Engineering Nanyang Technological University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin Classification
Algorithm. NIPS. 2000.
From digits.DESCR we find that the input is made of integers in the range (0,16). Let’s check that it’s true
by calculating the minimum and maximum values of X:
In [10]: X.min()
Out[10]: 0.0
10.1. LEARNING CURVES 423
In [11]: X.max()
Out[11]: 16.0
In [12]: X.dtype
Out[12]: dtype('float64')
As previously seen in Chapter 3, it’s a good practice to rescale the input so that it’s close to 1. Let’s do this by
dividing by the maximum possible value (16.0):
In [14]: y[:20]
Although it could appear that the digits are sorted, actually they are not:
As seen in Chapter 3, let’s convert them to 1-hot encoding, to substitute the categorical column with a set of
boolean columns, one for each category. First, let’s import the to_categorical method from
keras.utils:
Now we can split the data into a training and a test set. Let’s import the train_test_split function and
call it against our data and the target categories:
We will split the data with a 70/30 ratio and we will use a random_state here, so that we all get the exact
same train/test split. We will also use the option stratify, to require the ratio of classes be balanced,
i.e. about 10 for each class (we already introduced this concept in Chapter 3 for the stratified K-fold cross
validation).
Let’s double check that we have balanced the classes correctly. Since y_test is now a 1-hot encoded vector,
we need first to recover the corresponding digits. We can do this using the function argmax:
In [21]: y_test_classes
Out[21]: array([1, 4, 5, 6, 9, 1, 2, 2, 2, 0, 7, 5, 4, 8, 6, 6, 8, 2, 0, 9, 7, 3,
9, 1, 3, 5, 2, 2, 9, 9, 8, 9, 7, 6, 1, 3, 1, 4, 7, 6, 7, 3, 5, 0,
1, 1, 7, 5, 4, 6, 0, 5, 8, 9, 0, 5, 4, 5, 3, 5, 5, 6, 5, 4, 9, 6,
5, 9, 6, 5, 7, 6, 6, 3, 0, 8, 4, 4, 3, 2, 9, 7, 2, 7, 9, 8, 8, 0,
1, 7, 2, 3, 3, 5, 5, 6, 0, 4, 3, 7, 1, 4, 1, 9, 0, 5, 3, 8, 9, 6,
4, 9, 2, 9, 2, 0, 6, 7, 8, 1, 9, 2, 8, 6, 3, 6, 5, 1, 3, 6, 2, 3,
0, 6, 5, 5, 9, 2, 8, 1, 0, 1, 4, 5, 1, 0, 3, 0, 0, 9, 8, 9, 2, 2,
5, 8, 1, 9, 3, 7, 6, 8, 7, 3, 1, 2, 5, 1, 1, 6, 3, 9, 6, 9, 8, 9,
9, 8, 9, 9, 8, 8, 4, 7, 6, 2, 6, 4, 3, 4, 4, 3, 8, 5, 4, 8, 3, 1,
3, 4, 1, 0, 7, 8, 7, 5, 0, 6, 0, 1, 8, 7, 0, 0, 3, 4, 8, 9, 4, 4,
1, 1, 2, 1, 9, 2, 7, 7, 6, 9, 2, 9, 6, 0, 5, 2, 4, 4, 4, 6, 4, 0,
1, 8, 3, 4, 0, 5, 9, 0, 2, 0, 0, 1, 3, 2, 8, 1, 6, 1, 1, 9, 2, 7,
8, 3, 8, 2, 1, 3, 3, 0, 7, 8, 6, 7, 1, 4, 8, 2, 1, 4, 2, 6, 0, 6,
0, 1, 0, 8, 0, 6, 5, 1, 6, 6, 9, 2, 9, 2, 8, 5, 9, 4, 3, 9, 2, 9,
7, 9, 1, 3, 0, 3, 9, 2, 6, 1, 0, 0, 6, 3, 5, 0, 0, 3, 8, 0, 3, 0,
7, 7, 6, 1, 8, 8, 7, 2, 7, 5, 8, 5, 3, 7, 8, 2, 5, 4, 5, 1, 5, 7,
5, 6, 4, 0, 6, 7, 1, 1, 6, 4, 0, 4, 0, 1, 3, 4, 4, 4, 5, 4, 5, 5,
4, 3, 7, 9, 1, 1, 4, 7, 2, 0, 2, 9, 7, 8, 4, 8, 2, 4, 8, 7, 9, 4,
8, 0, 7, 0, 6, 5, 4, 2, 3, 5, 3, 5, 7, 7, 4, 1, 3, 0, 1, 1, 8, 6,
5, 1, 8, 0, 0, 3, 7, 7, 4, 9, 0, 4, 6, 9, 0, 7, 9, 2, 9, 2, 9, 6,
6, 5, 4, 5, 7, 3, 7, 7, 5, 2, 2, 7, 8, 9, 3, 3, 2, 6, 3, 6, 2, 1,
7, 4, 8, 0, 8, 2, 4, 3, 7, 6, 3, 5, 7, 9, 3, 7, 9, 5, 3, 7, 7, 6,
4, 8, 0, 8, 4, 6, 8, 4, 1, 7, 6, 5, 9, 3, 4, 5, 9, 8, 2, 3, 2, 5,
6, 4, 9, 1, 5, 9, 8, 2, 6, 1, 3, 1, 0, 7, 5, 2, 8, 1, 5, 2, 2, 3,
0, 0, 7, 8, 5, 2, 3, 5, 2, 6, 1, 3])
426 CHAPTER 10. PERFORMANCE IMPROVEMENT
There are many ways to count the number of each digit, the simplest is to temporarily wrap the array in a
Pandas Series and use the .value_counts() method:
In [22]: pd.Series(y_test_classes).value_counts()
Out[22]:
0
5 55
3 55
1 55
9 54
7 54
6 54
4 54
0 54
2 53
8 52
Great! Our classes are balanced, with around 54 samples per class. Let’s quickly train a model to classify
these digits. First we load the necessary libraries:
We create a small, fully connected network with 64 inputs, a single inner layer with 16 nodes and 10 outputs
with a Softmax activation function:
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
Let’s also save the initial weights so that we can always re-start from the same initial configuration:
Now we fit the model on the training data for 100 epochs:
10.1. LEARNING CURVES 427
The model converged and we can evaluate the final training performance and test accuracies:
TIP: The carachter _ means that part of the function result can be deliberately ignored, and
that variable can be throw away.
The performance on the test set is lower than the performance on the training set, which indicates the
model is overfitting.
TIP: Overfitting is a fundamental concept in Machine Learning and Deep Learning. If you
are not familiar with it, have a look at Chapter 3.
Before we start playing with different techniques to reduce overfitting, it is legitimate to ask if we simply
don’t have enough data to solve the problem.
This is a very common situation: you collect data with labels, you train a model, and the model does not
perform as well as you hoped.
428 CHAPTER 10. PERFORMANCE IMPROVEMENT
What should you do at that point? Should you collect more data? Or should you invest time in searching for
better features or a different model?
With the little information we have, it is hard to know which of these alternatives is more likely to help.
What is sure, on the other hand, is that all these alternatives carry a cost. For example, let’s say you think
that more data is what you need.
Collecting more labeled data could be as cheap and simple as downloading a new dataset from your source,
or it could be as involved and complex as coordinating with the data collection team at your company,
hiring contractors to label the new data, and so on. In other words, the time and cost associated with new
data collection strongly vary and need to be assessed case by case.
If, on the other hand, you decided to experiment with new features and model architectures, this could be as
simple as adding a few layers and nodes to your model, or as complex as an R&D team dedicating several
months to discovering new features for your particular dataset. Again, the actual cost of this option strongly
depends on your particular use case.
The learning curve is a tool we can use to answer that question. Here is how we build it.
First, we set the X_test aside, then, we take increasingly large fraction of X_train and use them to train
the model. For each of these fractions, we fit the model, then we evaluate the model on this fraction and on
the test set. Since the training data is small, we expect the model to overfit the training data and perform
quite poorly on the test set.
As we gradually take more training data, the model should improve and learn to generalize better, i.e. the
test score should increase. We proceed like this until we have used all our training data.
At this point two cases are possible. If it looks like the test performance stopped increasing with the size of
10.1. LEARNING CURVES 429
the training set, we probably reached the maximum performance of our model. In this case we should invest
time in looking for a better model to improve the performance.
In the second case, it could seem that the test error would continue to decrease if only we had access to more
training data. If that’s the case, we should probably go out looking for more labeled data first and then worry
about changing model.
So, now you know how to answer the big question of more data or better model: use a learning curve.
Let’s draw one together. First we take increasing fractions of the training data using the function
np.linspace.
TIP: np.linspace returns evenly spaced numbers over a specified interval. In this case we
are creating 4 fractions, from 10 to 90 of the data.
Then we loop over the train sizes, and for each train_size we do the following:
Handling this in the first case (i.e. the first train_size in our train_sizes array), we’ll use our work to
then iterate over a longer list of all the train_sizes.
In [31]: train_scores = []
test_scores = []
Now let’s break up the test data using the train_test_split function as we usually would:
In [33]: model.set_weights(initial_weights)
Now we can train our model using the fit function, as normal:
With our model trained, let’s evaluate it over our training set and save it into the train_scores variable
from above:
It’s kind of silly to do this manually for every train_size entry. Instead, let’s iterate over them and build up
our train_scores and test_scores variables:
10.1. LEARNING CURVES 431
In [37]: train_scores = []
test_scores = []
model.set_weights(initial_weights)
h = model.fit(X_train_frac, y_train_frac,
verbose=0,
epochs=100)
r = model.evaluate(X_train_frac, y_train_frac,
verbose=0)
train_scores.append(r[-1])
Let’s plot the training score and the test score as a function of increasing training size:
Judging from the curve, it appears the test score would keep improving if we added more data. This is the
indication we were looking for. If on the other hand the test score was not improving, it would have been
more promising to improve the model first and only then go look for more data if needed.
Reducing Overfitting
Sometimes it’s not easy to go out and look for more data. It could be time consuming and expensive. There
are a few ways to improve a model and reduce its propensity to overfit without requiring additional data.
These fall into the big family of Regularization techniques.
The general idea here is the following. By now you should be familiar with the idea that the complexity of a
model is somewhat represented by the number of parameters the model has. In simple terms, a model with
many layers and many nodes is more complex than a model with a single layer and few nodes. More
complexity gives the model, more freedom to learn nuances in our training data. This is what makes Neural
Networks so powerful.
On the other hand, the more freedom a model has, the more likely it will be to overfit on the training data,
loosing the ability to generalize. We could try to reduce the model freedom by reducing the model
complexity, but this would not always be a great idea as it would make the model less able to pick up subtle
patterns in our data.
A different approach would be to keep the model very complex, but change something else in the model in
order to push it towards less complex solutions. In other words, instead of removing the complexity
completely, we allow the model to choose complex solutions, but we push the model towards simpler, more
10.2. REDUCING OVERFITTING 433
regular, solutions. This is what regularization is about. Regularization refers to techniques to keep the
complexity of a model from spinning out of control.
Let’s review a few ways to regularize a model, and to ease our comparison we will define a few helper
functions.
First, let’s define a helper function to repeat the training several times. This helper function will be useful to
average out any statistical fluctuations in the model behavior due to the random initialization of the weights.
We will reset the backend at each iteration in order to save memory and erase any previous training.
And then let’s define the repeat_train helper function. This function expects an already created
model_fn as input, i.e. a function that returns a model, and it repeats the following process a number of
times specified by the input repeats:
K.clear_session()
model = model_fn()
h = model.fit(X_train, y_train,
validation_data=(X_test, y_test),
verbose=verbose,
batch_size=batch_size,
epochs=epochs)
• retrieve the accuracy of the model on training data (acc) and test data (val_acc) and append the
results to the histories array
histories.append([h.history['acc'], h.history['val_acc']])
Finally, the repeat_train function calculates the average history along with its standard deviation and
returns them.
"""
Repeatedly train a model on (X_train, y_train),
averaging the histories.
Parameters
----------
model_fn : a function with no parameters
Function that returns a Keras model
Returns
-------
mean, std : np.array, shape: (epochs, 2)
mean : array contains the accuracy
and validation accuracy history averaged
over the different training runs
std : array contains the standard deviation
over the different training runs of
accuracy and validation accuracy history
"""
histories = []
histories = np.array(histories)
print()
The repeat_train function expects an already created model_fn as input. Hence, let’s define a new
function that will create a fully connected Neural Network with 3 inner layers. We’ll call this function
base_model, since we will use this basic model for further comparison:
TIP: Notice that this model is quite big for the problem we are trying to solve. We
purposefully make the model big, so that there are lots of parameters and it can overfit
easily.
Now we repeat 5 times the training of the base (non-regularized) model using the repeat_train helper
function:
0 1 2 3 4
We can plot the histories for training and test. First, let’s define an additional helper function
plot_mean_std(), which plots the average history as a line and add a colored area around it
corresponding to +/- 1 standard deviation:
Then, let’s plot the results obtained training 5 times the base model:
Overfitting in this case is evident, with the test score saturating at a lower value than the training score.
Model Regularization
Remember the Cost Function we have introduced in Chapter 3? The main goal of the cost function is to
make sure that the predictions of the model are close to the correct labels.
Regularization works by modifying the original cost function C with an additional term λCr , that somehow
penalizes the complexity of the model:
C ′ = C + λCr (10.1)
The original cost function C would decrease as the model predictions got closer and closer to the actual
labels. In other words, the gradient descent algorithm for the original cost would push the parameters to the
region of parameter space that would give the best predictions on the training data. In complex models with
many parameters, this could result in overfitting because of all the freedom the model had.
438 CHAPTER 10. PERFORMANCE IMPROVEMENT
The new penalty Cr pushes the model to be “simple”, in other words it grows with the parameters of the
model, but it is completely unrelated to the goodness of the prediction.
Weight Regularization
The total cost C ′ is a combination of the two terms and therefore the model will have to try to generate the
best predictions possible, while retaining simplicity. In other words, the gradient descent algorithm is now
solving a constrained minimization problem, where some regions of the parameter space are too expensive
to be used for a solution.
The hyper-parameter λ controls the relative strength of the regularization and we can control it.
But how do we implement Cr in practice? There are several ways to do it. Weight Regularization assigns a
penalty proportional to the size of the weights, for example:
Cw = ∑ ∣w∣ (10.2)
w
or:
10.2. REDUCING OVERFITTING 439
Cw = ∑ w 2 (10.3)
w
The first one is called l1-regularization and it is the sum of the absolute values of each weight. The second
one is called l2-regularization and it is the sum of the square values of each weight. While they both
suppress complexity, their effect is different.
l1-regularization pushes most weights to be exactly zero, with the exception of a few that will be non-zero.
In other words, the net effect of l1-regularization is to make the weight matrix sparse.
l2-regularization, on the other hand, suppresses weights quadratically. Suppressing weights quadratically
means that any weight larger than the rest will have a much greater contribution to Cr and therefore to the
overall cost. The net effect of this is to make all weights equally small.
Similarly to weight regularization, Bias Regularization and Activity Regularization penalize the cost
function with a term proportional to the size of the biases and to the activations respectively.
Let’s compare the behavior of our base model with a model with exact the same architecture but endowed
with the l2 weight regularization.
We start by defining a helper function that creates a model with weight regularization: we start from the
function base_model, and we create the function regularized_model, adding the
kernel_regularizer option to each layer. First of all let’s import keras’s l2 regularizer function:
model = Sequential()
model.add(Dense(1024,
input_shape=(64,),
activation='relu',
kernel_regularizer=reg))
model.add(Dense(1024,
activation='relu',
kernel_regularizer=reg))
model.add(Dense(1024,
activation='relu',
kernel_regularizer=reg))
440 CHAPTER 10. PERFORMANCE IMPROVEMENT
model.add(Dense(10, activation='softmax'))
model.compile('adam', 'categorical_crossentropy',
metrics=['accuracy'])
return model
Now we compare the results of no regularization and l2-regularization. Let’s repeat the training 3 times.
0 1 2
TIP: Notice that, since we didn’t specified the number of time to train the model, it will
repeat the training according to the default parameter, i.e. 3 times.
Let’s now compare the performance of the weight regularized model with our base model. We will also plot
a dashed line at the maximum test accuracy obtained by the base model:
plot_mean_std(m_train_reg, s_train_reg)
plot_mean_std(m_test_reg, s_test_reg)
plt.axhline(m_test_base.max(),
linestyle='dashed',
color='black')
With this particular dataset, weight regularization does not seem to improve the model performance.
This is visually true at least within the small number of epochs we are running. It may be the case that if we
let the training run much longer regularization would help, but we don’t know for sure and that can cost a
lot of time and expense on our compute/memory.
It’s however good to know that this technique exists and keep it in mind as one of the options to try. In
practice, weight regularization has been superseded by more modern regularization techniques such as
Dropout and Batch Normalization.
Dropout
In other words during the training phase each unit has a non-zero probability not to emit its output to the
next layer. This prevents units from co-adapting too much.
Let’s reflect on this for a second. Apparently we are damaging the network by dropping a fraction of the
units with non zero probability during training time. We are crippling the network and making it a lot
harder for it to learn. This is clearly counter-intuitive! Why are we weakening our network?
442 CHAPTER 10. PERFORMANCE IMPROVEMENT
Dropout
It turns out the underlying principle is actually quite common in Machine Learning: we make the network
less stable so that the solution found during training is more general, more robust, and more resilient to
failure. Another way to look at this is to say that we are adding noise at training time, so that the network
will need to learn more general patterns that are resistant to noise.
The technique has similarities with ensemble techniques, because it’s as if, during training the network
sampled from an many different “thinned” networks, where a fraction of the nodes are not working. At test
time, dropout is turned off and a single network with smaller weights is used. This technique has been
shown to improve the performance of Neural Networks on Supervised Learning tasks in vision, speech
recognition, document classification, and many others.
We strongly encourage you to read the paper if you want to fully understand how dropout is implemented.
On the other hand, if you are eager to apply it, you’ll be happy to hear that Dropout is implemented in Keras
as a layer, so all we need to do is to add it between the layers. We’ll import it first:
And then we define a dropout_model, again starting from the base_model and adding the dropout layers.
We’ve tested several configurations and we’ve found that with this dataset good results can be obtained with
a dropout rate of 10 at the input and 50 in the inner layers. Feel free to experiment with different
numbers and see what results you get.
TIP: according to the Documentation, in the Dropout layer the argument rate is a float
between 0 and 1, that gives the fraction of the input units to drop.
10.2. REDUCING OVERFITTING 443
model = Sequential()
model.add(Dropout(input_rate, input_shape=(64,)))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(rate))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(rate))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(rate))
model.add(Dense(10, activation='softmax'))
model.compile('adam', 'categorical_crossentropy',
metrics=['accuracy'])
return model
0 1 2
Next, let’s plot the accuracy of the dropout model against the base model:
plot_mean_std(m_train_dro, s_train_dro)
plot_mean_std(m_test_dro, s_test_dro)
plt.axhline(m_test_base.max(),
linestyle='dashed',
color='black')
Nice! Adding Dropout to our model pushed our test score above the base model for the first time (although
not by much)! This is great because we didn’t have to add more data. Also, notice how the training score is
lower than the test score, which indicates the model is not overfitting and also there seem to be even more
room for improvement if we run the training for more epochs!
The Dropout paper also mentions the use of a global constraint to further improve the behavior of a
Dropout network. Constraints can be added in Keras through the kernel_constraint parameter
available in the definition of a layer. Following the paper, let’s see what happens if we impose a max_norm
constraint to the weights of the model. According to the Documentation, this is equivalent to say that the
sum of the square of the weights cannot be higher than a certain constant, which can be specified by the user
with the argument c.
Let’s define a new model function dropout_max_norm, that has both dropout and the max_norm
constraint:
model = Sequential()
model.add(Dropout(input_rate, input_shape=(64,)))
model.add(Dense(1024, activation='relu',
kernel_constraint=max_norm(c)))
model.add(Dropout(rate))
model.add(Dense(1024, activation='relu',
kernel_constraint=max_norm(c)))
model.add(Dropout(rate))
model.add(Dense(1024, activation='relu',
kernel_constraint=max_norm(c)))
model.add(Dropout(rate))
model.add(Dense(10, activation='softmax'))
model.compile('adam', 'categorical_crossentropy',
metrics=['accuracy'])
return model
0 1 2
plot_mean_std(m_train_dmn, s_train_dmn)
plot_mean_std(m_test_dmn, s_test_dmn)
446 CHAPTER 10. PERFORMANCE IMPROVEMENT
plt.axhline(m_test_base.max(),
linestyle='dashed',
color='black')
In this particular case the Max Norm constraint does not seem to produce results that are qualitatively
different from the simple Dropout, but there may be datasets where this constraint helps make the network
converge to a better result.
Batch Normalization
Batch Normalization was introduced in 2015 as an even better regularization technique, as described in this
paper. The authors of the paper started from the observation that training of deep Neural Networks is slow
because the distribution of the inputs to a layer changes during training, as the parameters of the previous
10.2. REDUCING OVERFITTING 447
layers change. Since the inputs to a layer are the outputs of the previous layer, and these are determined by
the parameters of the previous layer, as training proceeds the distribution of the output may drift, making it
harder for the next layer to adapt.
The authors’ solution to this problem is to introduce a normalization step between layers, that will take the
output values for the current batch and normalize them by removing the mean and dividing by the standard
deviation. They observe that their technique allows to use much higher learning rates and be less careful
about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout.
Let’s walk through the batch algorithm with a small code example. First we calculate the mean and standard
deviation of the batch:
mu_B = X_batch.mean()
std_B = X_batch.std()
Finally we rescale the batch with 2 parameters γ and β that are learned during training:
TIP: Using math notation, the complete algorithm for Batch normalization is the following.
Given a mini-batch B = {x1...m }
1 m
µB = ∑ xi (10.4)
m i=1
1 m 2
σB = ∑(x i − µ B ) (10.5)
m i=1
x i − µB
xˆi = √ 2 (10.6)
σB + є
y i = γ xˆi + β (10.7)
(10.8)
Batch Normalization is very powerful, and Keras makes it available as a layer too, as described in the
Documentation. One important thing to note is that BN needs to be applied before the nonlinear activation
function. Let’s see how it’s done. First we load the BatchNormalization and Activation layers:
448 CHAPTER 10. PERFORMANCE IMPROVEMENT
Then we define again a new model function batch_norm_model that adds Batch Normalization to our fully
connected network defined in the base_model:
Returns
-------
model : a compiled keras model
"""
model = Sequential()
model.add(Dense(1024, input_shape=(64,)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(1024))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(1024))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(10))
model.add(BatchNormalization())
model.add(Activation('softmax'))
model.compile('adam', 'categorical_crossentropy',
metrics=['accuracy'])
return model
Batch Normalization seems to work better with smaller batches, so we will run the repeat_train function
with a smaller batch_size.
Since smaller batches mean more weight updates at each epoch we will also run the training for less epochs.
We have 1257 points in the training set. Previously, we used batches of 256 points, which gives 5 weight
updates per epoch, and a total of 200 updates in 40 epochs. If we reduce the batch size to 32, we will have 40
updates at each epoch, so we should run the training for only 5 epochs.
10.2. REDUCING OVERFITTING 449
We will actually run it a bit longer in order to see the effectiveness of Batch Normalization. 10-15 epochs will
suffice to bring the model accuracy to a much higher value on the test set.
0 1 2
Let’s plot the results and compare with the base model:
plot_mean_std(m_train_bn, s_train_bn)
plot_mean_std(m_test_bn, s_test_bn)
plt.axhline(m_test_base.max(),
linestyle='dashed',
color='black')
Awesome! With the addition of Batch Normalization, the model converged to a solution that is better able
to generalized on the Test set, i.e. it is overfitting a lot less than the base solution.
Data augmentation
Another very powerful technique to improve the performance of a model without requiring the collection
of new data is Data Augmentation. Let’s consider the problem of image recognition, and to make things
practical, let’s consider this nice picture of a squirrel:
If your goal was to recognize the animal in this picture, you would still be able to solve the task effectively
even if we distorted the image or rotated it. In fact there’s a variety of transformations that we could apply to
the image, without altering its information content, including:
• rotation
• shift (up, down, left, right)
• shear
• zoom
• flip (vertical, horizontal)
• rescale
• color correction and changes
• partial occlusion
10.3. DATA AUGMENTATION 451
Image of a squirrel
All these transformations would not destroy the information contained in the image. They would just
change the absolute values of the pixels. A human would still be able to recognize a rotated squirrel or a
shifted panda, very much like you can still recognize your friends after all the filters they apply to their
selfies. This property means a good image recognition algorithm should also be resilient to this kind of
transformations.
If we apply these transformation to an image in our training dataset, we can generate an infinite number of
variations of such image, giving us access to a much much larger synthetic training dataset. This process is
what data augmentation is about: generating new labeled data points starting from existing data through the
use of valid transformations.
Although the example we provided is in the domain of image recognition, the same process can be applied
to augment other kinds of data, for example speech samples for a speech recognition task. Given a certain
sound file we can change its speed and pitch, add background noise, add silences to generate variations of
the speech snippet that would still be perfectly understood by a human.
Let’s see how Keras allows us to do it easily for images. We need to load the ImageDataGenerator object:
This class creates a generator that can apply all sorts of variations to an input image. Let’s initialize it with a
few parameters: - We’ll set the rescale factor to 1/255 to normalize pixel values to the interval [0-1] -
We’ll set the width_shift_range and height_shift_range to ±10 of the total range - We’ll set the
rotation_range to ±20 degrees - We’ll set the shear_range to ±0.3 degrees - We’ll set the zoom_range
452 CHAPTER 10. PERFORMANCE IMPROVEMENT
See the Documentation for a complete overview of all the available arguments.
The next step is to create an iterator that will generate images with the image data generator. We basically
need to tell where our training data are. Here we use the method flow_from_directory, which is useful
when we have images stored in a directory, and we tell it to produce target images of size 128x128. The input
folder structure need to be:
top/
class_0/
class_1/
...
Where top is the folder we will flow from, and the images are organized into one subfolder for each class.
for i in range(16):
img, label = train_gen.next()
plt.subplot(4, 4, i+1)
plt.imshow(img[0])
10.3. DATA AUGMENTATION 453
Great! In all of the images the squirrel is still visible and from a single image we have generated 16 different
images that we can use for training!
Let’s apply this technique to our digits and see if we can improve the score on the test set. We will use
slightly less dramatic transformations and also fill the empty space with zeros along the border.
We will need to reshape our data into tensors with 4 axes, in order to use it with the ImageDataGenerator,
454 CHAPTER 10. PERFORMANCE IMPROVEMENT
so let’s do it:
We can use the method .flow to flow directly from a dataset. We will need to provide the labels as well.
Notice that by default the .flow method generates a batch of 32 images with corresponding labels:
In [69]: imgs.shape
Out[69]: (32, 8, 8, 1)
As you can see the digits are deformed, due to the very low resolution of the images. Will this help our
network or confuse it? Let’s find out!
We will need a model that is able to deal with a tensor input, since the images are now tensors of order 4.
Luckily, it’s very simple to adapt our base model to have a Flatten layer as input:
model.add(Dense(1024, activation='relu'))
model.add(Dense(1024, activation='relu'))
model.add(Dense(1024, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile('adam', 'categorical_crossentropy',
metrics=['accuracy'])
return model
We also need to define a new repeat_train_generator function that allows to train a model from a
generator. We can take the original repeat_train function and modify it. We will follow the same
procedure used in with 2 difference:
1. We’ll define a generator that yields batches from X_train_t using the image data generator
h = model.fit(X_train, y_train,
validation_data=(X_test, y_test),
verbose=verbose,
batch_size=batch_size,
epochs=epochs)
h = model.fit_generator(train_gen,
steps_per_epoch=steps_per_epoch,
epochs=epochs,
validation_data=(X_test_t, y_test),
verbose=verbose)
Notice that, since we are now feeding variations of the data in the training set the concept of an epoch
becomes blurry. When does an epoch terminate if we flow random variations of the training data? The
model.fit_generator function allows us to define how many steps_per_epoch we want. We will use
the value of 5, with a batch_size of 256 like in most of the examples above.
Parameters
----------
10.3. DATA AUGMENTATION 457
Returns
-------
mean, std : np.array, shape: (epochs, 2)
mean : array contains the accuracy
and validation accuracy history averaged
over the different training runs
std : array contains the standard deviation
over the different training runs of
accuracy and validation accuracy history
"""
# generator that flows batches from X_train_t
train_gen = digit_idg.flow(X_train_t, y=y_train,
batch_size=batch_size)
histories = []
h.history['val_acc']])
print(repeat, end=" ")
histories = np.array(histories)
print()
0 1 2
plot_mean_std(m_train_gen, s_train_gen)
plot_mean_std(m_test_gen, s_test_gen)
plt.axhline(m_test_base.max(),
linestyle='dashed',
color='black')
As you can see, the Data Augmentation process improved the performance of the model on our test set.
This makes a lot of sense, because by feeding variations of the input data as training we have made the
model more resilient to changes in the input features.
Hyperparameter optimization
One final note on hyper-parameter optimization. Neural Network models have a lot of hyper-parameters.
These are things like: - model architecture - number of layers - type of layers - number of nodes - activation
functions - . . . - optimizer parameters - optimizer type - learning rate - momentum - . . . - training
parameters - batch size - learning rate scheduling - number of epochs - . . . These parameters are called
Hyper-parameters because they define the training experiment and the model is not allowed to change
them while training. That said, they turn out to be really important in determining the success of a model in
solving a particular problem.
The topic of hyper-parameter tuning is really vast and we don’t have space to cover it but we want to
mention here a few tools that can help you achieve optimal hyper-parameter tuning.
Hyperopt is a Python library that can perform generalized hyper-parameter tuning using a technique called
Bayesian Optimization.
Hyperas is a library that connects Hyperopt and Keras, making it easy to run parallel trainings of a keras
460 CHAPTER 10. PERFORMANCE IMPROVEMENT
AWS SageMaker and Google Cloud ML offer options for spawning parallel training experiments with
different hyper-parameter combinations.
Determined.ai and Pipeline.ai also offer this feature as part of their cloud training platform.
Exercises
Exercise 1
This is a long and complex exercise, that should give you an idea of a real world scenario. Feel free to look at
the solution if you feel lost. Also, feel free to run this on a GPU.
First of all download and unpack the male/female pictures from here into a subfolder of the ../data folder.
These images and labels were obtained from Crowdflower.
Your goal is to build an image classifier that will recognize the gender of a person from pictures.
We started our journey from the basics of Data Manipulation, and Machine Learning, and then we
introduced Deep Learning and Neural Networks. We learned about Deep Learning_Internals and the math
that makes Neural Networks function. Then we explored more complex architectures like Convolutional
Neural Networks for Images and Recurrent Neural Networks for Time Series and for Text Data. Finally, we
learned how to train our models GPUs to speed up training and how to improve a model if it’s overfitting.
With Chapter 10 we conclude the part of the book that deals with understanding how Neural Networks
work and how they can be trained and we shift gears to more recent applications.
Many of the techniques we learned are only a few years old, yet the field of Deep Learning is evolving really
fast and in the last years many new techniques have been invented and discovered.
This chapter walks through how to piggyback on the shoulders of giants. In fact, will learn how to use
pre-trained networks, i.e. networks that have already been trained on a similar task, and adapt them to the
task we would like to perform.
These are often very large networks, with tens of millions of parameters, that have been trained on very large
datasets. It would cost us a lot of computing power to re-train them from scratch. Luckily, we don’t need to
do that. Let’s see how.
461
462 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
In [3]: df = pd.read_csv('../data/sports.csv')
In [4]: df.head()
Out[4]:
As you can see the daset contains 3 columns: - the image url - the class - the label confidence
Let’s first have a look at how many classes there are using the .value_counts() method on the
df['class'] column. This method group the entries in the column by equal type and counts how many
occurences there are in each group:
In [5]: df['class'].value_counts()
Out[5]:
class
Cross-country skiing 1003
Beach volleyball 1002
Formula racing 1001
11.1. RECOGNIZING SPORTS FROM IMAGES 463
There are 3 classes, with approximately a thousand images each. This is not enough examples to train a
convolutional network from scratch. We’ll need to use transfer learning to solve the problem.
Before we dive into it, let’s prepare train and test datasets and let’s download all the images to disk. We first
import the train_test_split function from Scikit-Learn:
Then we split the dataframe df into 70 train and 30 test. Notice that we stratify the split according to the
class distribution, i.e. we make sure that the train set (and the test set) is composed by 1/3 skiing, 1/3 volley
and 1/3 formula racing.
Now that we have set up our datasets, we need to download the images. Let’s define a helper function that
checks if an image has alredy been downloaded and it downloads it if it doesn’t exist. We import the os
module to be able to create folders and files:
In [8]: import os
We also load the urlretrieve (from the urllib.request library) function that allows us to download an
image from a url:
Args:
save_dir: Path where images are saved
464 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
Let’s test our function on the first item in the dataframe. Let’s retrieve the first url:
Out[11]: 'https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/7ad/a7b/7ada7b21d671
Out[13]: 0
11.1. RECOGNIZING SPORTS FROM IMAGES 465
The function returns 1, since the image was not there. Notice that if we run it again, this time the function
will return 0:
Out[14]: 0
Now we need to download all the images in the dataframe. We could simply loop over the rows, but this can
be tediously slow. Instead we’ll resort to asynchronous downloading and we’ll start many threads to
download the images concurrenty. To do this we need to import the ThreadPoolExecutor from the
concurrent.futures library, as well as the as_completed function:
The as_completed function is an iterator over the given futures that yields each as it completes. With these
two components, let’s build another helper function that distributes all the urls to a ThreadPoolExecutor
and runs them in parallel.
Args:
save_dir: Path where images are saved
image_urls: A list of image urls
image_labels: A list of image labels
max_workers: Concurrent threads (default=20)
"""
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# we build a dictionary with executors
# as keys and the urls as values
future_to_url = {}
for url, label in zip(image_urls, image_labels):
k = executor.submit(maybe_download_image,
save_dir, url, label)
future_to_url[k] = url
url = future_to_url[future]
try:
result = future.result()
print(result, end='')
except Exception as ex:
print('%r exception: %s' % (url, ex))
Let’s run the get_images function on the train dataset. We’ll save them in a sports/train folder inside
../data:
In [18]: get_images(train_path,
train_df['image_url'].values,
train_df['class'].values)
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000
Similarly we’ll save the test images in a sports/test folder inside ../data:
In [20]: get_images(test_path,
test_df['image_url'].values,
test_df['class'].values)
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000
Now that we have downloaded the data, we are ready to tackle transfer learning
Keras applications
Keras offers many pre-trained models in the keras.applications module. All of them are models
trained for image classification on the Imagenet dataset and they have different architectures. Here we
summarize their main properties:
As you can see, some of them have a large memory footprint (up to over 500Mb), while some others trade a
bit of accuracy for a smaller footprint that makes them perfect to run on a mobile phone.
• we can partially retrain them and adapt them to classify new objects, using only a few input images
and a laptop (no need for GPU)!
468 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
Let’s go ahead and explore both these applications. First of all we’re going to load the image module from
keras.preprocessing, which will allow us to load images from disk:
Now let’s load an image from the ones we have downloaded previously. Let’s define the input_path:
Jupyter notebook can display images inline, so let’s have a look at it:
In [24]: img
Out[24]:
11.2. KERAS APPLICATIONS 469
What type of Python object is img? We can check it with the type function and see that it’s a Python Image
data type.
In [25]: type(img)
Out[25]: PIL.Image.Image
Now let’s convert it to a numpy array so that we can feed it to the model. We will use the img_to_array
function from the image module we’ve just loaded:
Now the image is an order-3 tensor with 229 pixels in Height and Width and 3 color channels for RGB:
In [27]: img_array.shape
Keras convolutional models require an input with 4 axes, i.e. an order-4 tensor, where the first axis locates
the image in the datset (in this case we only have one image, but we can still think of it as the first element in
an order-4 array. We can add this “dummy” dimension with the np.expand_dims function:
Let’s double check that the shape of this new tensor is the one we want:
In [29]: img_tensor.shape
We can create a pre-trained model simply by creating an instance of Xception with the
weights='imagenet' parameter. This command will download the pre-trained weights and create a
model with the Xception architecture.
TIP: note that it could take a few minutes to download the weights.
In [32]: model.summary()
________________________________________________________________________________
__________________
Layer (type) Output Shape Param # Connected to
11.3. PREDICT CLASS WITH PRE-TRAINED XCEPTION 471
================================================================================
==================
input_1 (InputLayer) (None, None, None, 3 0
________________________________________________________________________________
__________________
block1_conv1 (Conv2D) (None, None, None, 3 864 input_1[0][0]
________________________________________________________________________________
__________________
block1_conv1_bn (BatchNormaliza (None, None, None, 3 128
block1_conv1[0][0]
________________________________________________________________________________
__________________
block1_conv1_act (Activation) (None, None, None, 3 0
block1_conv1_bn[0][0]
________________________________________________________________________________
__________________
block1_conv2 (Conv2D) (None, None, None, 6 18432
block1_conv1_act[0][0]
________________________________________________________________________________
__________________
block1_conv2_bn (BatchNormaliza (None, None, None, 6 256
block1_conv2[0][0]
________________________________________________________________________________
__________________
block1_conv2_act (Activation) (None, None, None, 6 0
block1_conv2_bn[0][0]
________________________________________________________________________________
__________________
block2_sepconv1 (SeparableConv2 (None, None, None, 1 8768
block1_conv2_act[0][0]
________________________________________________________________________________
__________________
block2_sepconv1_bn (BatchNormal (None, None, None, 1 512
block2_sepconv1[0][0]
________________________________________________________________________________
__________________
block2_sepconv2_act (Activation (None, None, None, 1 0
block2_sepconv1_bn[0][0]
________________________________________________________________________________
__________________
block2_sepconv2 (SeparableConv2 (None, None, None, 1 17536
block2_sepconv2_act[0][0]
________________________________________________________________________________
__________________
block2_sepconv2_bn (BatchNormal (None, None, None, 1 512
block2_sepconv2[0][0]
________________________________________________________________________________
__________________
conv2d_1 (Conv2D) (None, None, None, 1 8192
block1_conv2_act[0][0]
________________________________________________________________________________
__________________
block2_pool (MaxPooling2D) (None, None, None, 1 0
block2_sepconv2_bn[0][0]
________________________________________________________________________________
__________________
batch_normalization_1 (BatchNor (None, None, None, 1 512 conv2d_1[0][0]
________________________________________________________________________________
__________________
472 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
block4_sepconv2_act[0][0]
________________________________________________________________________________
__________________
block4_sepconv2_bn (BatchNormal (None, None, None, 7 2912
block4_sepconv2[0][0]
________________________________________________________________________________
__________________
conv2d_3 (Conv2D) (None, None, None, 7 186368 add_2[0][0]
________________________________________________________________________________
__________________
block4_pool (MaxPooling2D) (None, None, None, 7 0
block4_sepconv2_bn[0][0]
________________________________________________________________________________
__________________
batch_normalization_3 (BatchNor (None, None, None, 7 2912 conv2d_3[0][0]
________________________________________________________________________________
__________________
add_3 (Add) (None, None, None, 7 0
block4_pool[0][0]
batch_normalization_3[0][0]
________________________________________________________________________________
__________________
block5_sepconv1_act (Activation (None, None, None, 7 0 add_3[0][0]
________________________________________________________________________________
__________________
block5_sepconv1 (SeparableConv2 (None, None, None, 7 536536
block5_sepconv1_act[0][0]
________________________________________________________________________________
__________________
block5_sepconv1_bn (BatchNormal (None, None, None, 7 2912
block5_sepconv1[0][0]
________________________________________________________________________________
__________________
block5_sepconv2_act (Activation (None, None, None, 7 0
block5_sepconv1_bn[0][0]
________________________________________________________________________________
__________________
block5_sepconv2 (SeparableConv2 (None, None, None, 7 536536
block5_sepconv2_act[0][0]
________________________________________________________________________________
__________________
block5_sepconv2_bn (BatchNormal (None, None, None, 7 2912
block5_sepconv2[0][0]
________________________________________________________________________________
__________________
block5_sepconv3_act (Activation (None, None, None, 7 0
block5_sepconv2_bn[0][0]
________________________________________________________________________________
__________________
block5_sepconv3 (SeparableConv2 (None, None, None, 7 536536
block5_sepconv3_act[0][0]
________________________________________________________________________________
__________________
block5_sepconv3_bn (BatchNormal (None, None, None, 7 2912
block5_sepconv3[0][0]
________________________________________________________________________________
__________________
add_4 (Add) (None, None, None, 7 0
block5_sepconv3_bn[0][0]
474 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
add_3[0][0]
________________________________________________________________________________
__________________
block6_sepconv1_act (Activation (None, None, None, 7 0 add_4[0][0]
________________________________________________________________________________
__________________
block6_sepconv1 (SeparableConv2 (None, None, None, 7 536536
block6_sepconv1_act[0][0]
________________________________________________________________________________
__________________
block6_sepconv1_bn (BatchNormal (None, None, None, 7 2912
block6_sepconv1[0][0]
________________________________________________________________________________
__________________
block6_sepconv2_act (Activation (None, None, None, 7 0
block6_sepconv1_bn[0][0]
________________________________________________________________________________
__________________
block6_sepconv2 (SeparableConv2 (None, None, None, 7 536536
block6_sepconv2_act[0][0]
________________________________________________________________________________
__________________
block6_sepconv2_bn (BatchNormal (None, None, None, 7 2912
block6_sepconv2[0][0]
________________________________________________________________________________
__________________
block6_sepconv3_act (Activation (None, None, None, 7 0
block6_sepconv2_bn[0][0]
________________________________________________________________________________
__________________
block6_sepconv3 (SeparableConv2 (None, None, None, 7 536536
block6_sepconv3_act[0][0]
________________________________________________________________________________
__________________
block6_sepconv3_bn (BatchNormal (None, None, None, 7 2912
block6_sepconv3[0][0]
________________________________________________________________________________
__________________
add_5 (Add) (None, None, None, 7 0
block6_sepconv3_bn[0][0]
add_4[0][0]
________________________________________________________________________________
__________________
block7_sepconv1_act (Activation (None, None, None, 7 0 add_5[0][0]
________________________________________________________________________________
__________________
block7_sepconv1 (SeparableConv2 (None, None, None, 7 536536
block7_sepconv1_act[0][0]
________________________________________________________________________________
__________________
block7_sepconv1_bn (BatchNormal (None, None, None, 7 2912
block7_sepconv1[0][0]
________________________________________________________________________________
__________________
block7_sepconv2_act (Activation (None, None, None, 7 0
block7_sepconv1_bn[0][0]
________________________________________________________________________________
__________________
block7_sepconv2 (SeparableConv2 (None, None, None, 7 536536
11.3. PREDICT CLASS WITH PRE-TRAINED XCEPTION 475
block7_sepconv2_act[0][0]
________________________________________________________________________________
__________________
block7_sepconv2_bn (BatchNormal (None, None, None, 7 2912
block7_sepconv2[0][0]
________________________________________________________________________________
__________________
block7_sepconv3_act (Activation (None, None, None, 7 0
block7_sepconv2_bn[0][0]
________________________________________________________________________________
__________________
block7_sepconv3 (SeparableConv2 (None, None, None, 7 536536
block7_sepconv3_act[0][0]
________________________________________________________________________________
__________________
block7_sepconv3_bn (BatchNormal (None, None, None, 7 2912
block7_sepconv3[0][0]
________________________________________________________________________________
__________________
add_6 (Add) (None, None, None, 7 0
block7_sepconv3_bn[0][0]
add_5[0][0]
________________________________________________________________________________
__________________
block8_sepconv1_act (Activation (None, None, None, 7 0 add_6[0][0]
________________________________________________________________________________
__________________
block8_sepconv1 (SeparableConv2 (None, None, None, 7 536536
block8_sepconv1_act[0][0]
________________________________________________________________________________
__________________
block8_sepconv1_bn (BatchNormal (None, None, None, 7 2912
block8_sepconv1[0][0]
________________________________________________________________________________
__________________
block8_sepconv2_act (Activation (None, None, None, 7 0
block8_sepconv1_bn[0][0]
________________________________________________________________________________
__________________
block8_sepconv2 (SeparableConv2 (None, None, None, 7 536536
block8_sepconv2_act[0][0]
________________________________________________________________________________
__________________
block8_sepconv2_bn (BatchNormal (None, None, None, 7 2912
block8_sepconv2[0][0]
________________________________________________________________________________
__________________
block8_sepconv3_act (Activation (None, None, None, 7 0
block8_sepconv2_bn[0][0]
________________________________________________________________________________
__________________
block8_sepconv3 (SeparableConv2 (None, None, None, 7 536536
block8_sepconv3_act[0][0]
________________________________________________________________________________
__________________
block8_sepconv3_bn (BatchNormal (None, None, None, 7 2912
block8_sepconv3[0][0]
________________________________________________________________________________
__________________
476 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
__________________
block10_sepconv2 (SeparableConv (None, None, None, 7 536536
block10_sepconv2_act[0][0]
________________________________________________________________________________
__________________
block10_sepconv2_bn (BatchNorma (None, None, None, 7 2912
block10_sepconv2[0][0]
________________________________________________________________________________
__________________
block10_sepconv3_act (Activatio (None, None, None, 7 0
block10_sepconv2_bn[0][0]
________________________________________________________________________________
__________________
block10_sepconv3 (SeparableConv (None, None, None, 7 536536
block10_sepconv3_act[0][0]
________________________________________________________________________________
__________________
block10_sepconv3_bn (BatchNorma (None, None, None, 7 2912
block10_sepconv3[0][0]
________________________________________________________________________________
__________________
add_9 (Add) (None, None, None, 7 0
block10_sepconv3_bn[0][0]
add_8[0][0]
________________________________________________________________________________
__________________
block11_sepconv1_act (Activatio (None, None, None, 7 0 add_9[0][0]
________________________________________________________________________________
__________________
block11_sepconv1 (SeparableConv (None, None, None, 7 536536
block11_sepconv1_act[0][0]
________________________________________________________________________________
__________________
block11_sepconv1_bn (BatchNorma (None, None, None, 7 2912
block11_sepconv1[0][0]
________________________________________________________________________________
__________________
block11_sepconv2_act (Activatio (None, None, None, 7 0
block11_sepconv1_bn[0][0]
________________________________________________________________________________
__________________
block11_sepconv2 (SeparableConv (None, None, None, 7 536536
block11_sepconv2_act[0][0]
________________________________________________________________________________
__________________
block11_sepconv2_bn (BatchNorma (None, None, None, 7 2912
block11_sepconv2[0][0]
________________________________________________________________________________
__________________
block11_sepconv3_act (Activatio (None, None, None, 7 0
block11_sepconv2_bn[0][0]
________________________________________________________________________________
__________________
block11_sepconv3 (SeparableConv (None, None, None, 7 536536
block11_sepconv3_act[0][0]
________________________________________________________________________________
__________________
block11_sepconv3_bn (BatchNorma (None, None, None, 7 2912
block11_sepconv3[0][0]
478 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
________________________________________________________________________________
__________________
add_10 (Add) (None, None, None, 7 0
block11_sepconv3_bn[0][0]
add_9[0][0]
________________________________________________________________________________
__________________
block12_sepconv1_act (Activatio (None, None, None, 7 0 add_10[0][0]
________________________________________________________________________________
__________________
block12_sepconv1 (SeparableConv (None, None, None, 7 536536
block12_sepconv1_act[0][0]
________________________________________________________________________________
__________________
block12_sepconv1_bn (BatchNorma (None, None, None, 7 2912
block12_sepconv1[0][0]
________________________________________________________________________________
__________________
block12_sepconv2_act (Activatio (None, None, None, 7 0
block12_sepconv1_bn[0][0]
________________________________________________________________________________
__________________
block12_sepconv2 (SeparableConv (None, None, None, 7 536536
block12_sepconv2_act[0][0]
________________________________________________________________________________
__________________
block12_sepconv2_bn (BatchNorma (None, None, None, 7 2912
block12_sepconv2[0][0]
________________________________________________________________________________
__________________
block12_sepconv3_act (Activatio (None, None, None, 7 0
block12_sepconv2_bn[0][0]
________________________________________________________________________________
__________________
block12_sepconv3 (SeparableConv (None, None, None, 7 536536
block12_sepconv3_act[0][0]
________________________________________________________________________________
__________________
block12_sepconv3_bn (BatchNorma (None, None, None, 7 2912
block12_sepconv3[0][0]
________________________________________________________________________________
__________________
add_11 (Add) (None, None, None, 7 0
block12_sepconv3_bn[0][0]
add_10[0][0]
________________________________________________________________________________
__________________
block13_sepconv1_act (Activatio (None, None, None, 7 0 add_11[0][0]
________________________________________________________________________________
__________________
block13_sepconv1 (SeparableConv (None, None, None, 7 536536
block13_sepconv1_act[0][0]
________________________________________________________________________________
__________________
block13_sepconv1_bn (BatchNorma (None, None, None, 7 2912
block13_sepconv1[0][0]
________________________________________________________________________________
__________________
block13_sepconv2_act (Activatio (None, None, None, 7 0
11.3. PREDICT CLASS WITH PRE-TRAINED XCEPTION 479
block13_sepconv1_bn[0][0]
________________________________________________________________________________
__________________
block13_sepconv2 (SeparableConv (None, None, None, 1 752024
block13_sepconv2_act[0][0]
________________________________________________________________________________
__________________
block13_sepconv2_bn (BatchNorma (None, None, None, 1 4096
block13_sepconv2[0][0]
________________________________________________________________________________
__________________
conv2d_4 (Conv2D) (None, None, None, 1 745472 add_11[0][0]
________________________________________________________________________________
__________________
block13_pool (MaxPooling2D) (None, None, None, 1 0
block13_sepconv2_bn[0][0]
________________________________________________________________________________
__________________
batch_normalization_4 (BatchNor (None, None, None, 1 4096 conv2d_4[0][0]
________________________________________________________________________________
__________________
add_12 (Add) (None, None, None, 1 0
block13_pool[0][0]
batch_normalization_4[0][0]
________________________________________________________________________________
__________________
block14_sepconv1 (SeparableConv (None, None, None, 1 1582080 add_12[0][0]
________________________________________________________________________________
__________________
block14_sepconv1_bn (BatchNorma (None, None, None, 1 6144
block14_sepconv1[0][0]
________________________________________________________________________________
__________________
block14_sepconv1_act (Activatio (None, None, None, 1 0
block14_sepconv1_bn[0][0]
________________________________________________________________________________
__________________
block14_sepconv2 (SeparableConv (None, None, None, 2 3159552
block14_sepconv1_act[0][0]
________________________________________________________________________________
__________________
block14_sepconv2_bn (BatchNorma (None, None, None, 2 8192
block14_sepconv2[0][0]
________________________________________________________________________________
__________________
block14_sepconv2_act (Activatio (None, None, None, 2 0
block14_sepconv2_bn[0][0]
________________________________________________________________________________
__________________
avg_pool (GlobalAveragePooling2 (None, 2048) 0
block14_sepconv2_act[0][0]
________________________________________________________________________________
__________________
predictions (Dense) (None, 1000) 2049000 avg_pool[0][0]
================================================================================
==================
Total params: 22,910,480
Trainable params: 22,855,952
Non-trainable params: 54,528
480 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
________________________________________________________________________________
__________________
Wow! What a huge model! Notice that it has almost 23 Million parameters and many convolutional layers
stacked on top of one another. Let’s test the pre-trained model on the task of recognizing an image without
training.
We will need to pre-process the image so that it has the correct format for the network. Luckily for us, the
keras.applications.xception module also contains a preprocess_intput function that is exactly
what we need. Let’s load it:
TIP: we apply it to a copy because the function alters the argument itself and we don’t want
to alter the original version of the image.
img_scaled is scaled such that the minimum value is -1 and the maximum value is 1. We can pass it to the
model and generate a prediction:
What do our predictions look like? What is the output of the model? Let’s first look at the shape of the
preds object:
In [36]: preds.shape
preds is a vector with 1000 entries. This makes sense: they are the probabilities associated with each of the
1000 classes of objects in the Imagenet dataset.
To recap, our pre-trained Xception model takes an image in input and returns a softmax classification
output with 1000 classes. To interpret the prediction we will load the decode_predictions function:
11.4. TRANSFER LEARNING 481
and apply it to the preds vector to get the top three most likely labels for the photo.
Not bad! Our model this it’s very likely that the photo is about volleyball, which is not far from beach volley
at all! How awesome is that?! Now let’s do even better, let’s repurpose the model so that it will work with
exactly the 3 categories of pictures we have. This is called Transfer Learning.
Transfer Learning
Transfer Learning consists in leveraging a pre-trained model to solve a similar task. In this case, we’re going
to use a network that was trained on imagenet and re-purpose it to solve the sport image classification task.
By using a pre-trained network we don’t need to train it completely from scratch, a great advantage both in
terms of computing power required and in terms of amount of data needed.
We will be able to adapt a very large network like Xception, that has more than 20 Million parameters, using
a laptop and a few thousand images. This is an incredibly powerful function! Let’s see how it’s done.
First of all we’re going to set a value for the img_size = 299. This is the correct input size for Xception and
it corresponds to the size of images it was originally trained on.
We reload the Xception model, but this time we include a couple more arguments besides the weights.
First of all we specify include_top=False. This option says we don’t want the full model, but only the
convolutional part. If you remember what we’ve learned in Chapter 6 on CNNs, convolutional models are
composed by a cascade of convolutional layers that yield more and more specialized feature maps.
At some point the feature maps are flattened to an array which is fed to a Dense layer (or a series of Dense
layers) and finally to the output of the classification. Here we want to load all the layers of Xception up to the
layer before the last fully connected (the top layer). The reason we want to do this is simple to explain. We
want to use the pre-trained model as a giant pre-processing layer that takes an image in input and returns a
few thousand high-level features (we well see these are called bottleneck features). We will then use these
high level features to perform a standard classification with only the classes of images present in our dataset,
i.e. the 3 sports.
482 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
When loading the Xception model, we also specify the input_shape and an additional parameter
pooling='avg'. This last one specifies how we’d like to go from the order-4 tensor of the feature maps to
the order-2 tensor that goes in the fully connected top. pooling='avg' means we’re going to apply a Global
Average Pooling layer at the end.
Now that we’ve loaded the base model, we’re going to complete the model with a couple of dense layers.
First we load the Sequential model and the Dense and Dropout layers:
First we pass the whole base_model as the first layer, This will take an image in input, process it with the
pre-trained Xception weights, and pass an array of numbers to the next layer. Then we’ll load a fully
connected layer with 256 nodes and a ReLU activation, then Dropout and finally the output layer with 3
nodes and a Softmax. Remember that we have only 3 classes:
• Beach volleyball
11.4. TRANSFER LEARNING 483
• Cross-country skiing
• Formula racing
that are mutually exclusive, i.e. a picture is only about one of the 3 sports, so our output needs to have 3
nodes with a Softmax:
In [43]: model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
xception (Model) (None, 2048) 20861480
_________________________________________________________________
dense_1 (Dense) (None, 256) 524544
_________________________________________________________________
dropout_1 (Dropout) (None, 256) 0
_________________________________________________________________
dense_2 (Dense) (None, 3) 771
=================================================================
Total params: 21,386,795
Trainable params: 21,332,267
Non-trainable params: 54,528
_________________________________________________________________
Wow! This model still has 20+ millions of parameters. Now here’s the trick: we’re going to set most of them
to be frozen, i.e backpropagation will not touch them at all! This is obtained by setting the .trainable
attribute of a layer to False. Since we’ve added the base_model as the first layer, we’ll only need to set that
flag:
In [45]: model.summary()
484 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
xception (Model) (None, 2048) 20861480
_________________________________________________________________
dense_1 (Dense) (None, 256) 524544
_________________________________________________________________
dropout_1 (Dropout) (None, 256) 0
_________________________________________________________________
dense_2 (Dense) (None, 3) 771
=================================================================
Total params: 21,386,795
Trainable params: 525,315
Non-trainable params: 20,861,480
_________________________________________________________________
Of the total number of parameters only a half a million are now trainable, the ones that belong to the 2 dense
layers we’ve added after the base_model. This seems a much more tractable model than the original one!
In [46]: model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
Data augmentation
Since we don’t have too much data available, we’ll use the trick of data augmentation learned in Chapter 10.
This consists in generating variations of an image with transformations such as zoom, rotate and shear. Let’s
load the ImageDataGenerator:
Let’s set a batch_size = 32. The choice of this number is somewhat arbitrary, but since we have 3 classes
only, a batch of 32 images will contain on average about 10 images for each class. This seems a good number
of examples to learn something from:
In [48]: batch_size = 32
Now let’s create an instance of ImageDataGenerator that applies transformations to the training set. It will
do the following operations:
11.5. DATA AUGMENTATION 485
The train_datagen object contains the instructions for the transformation we want to apply, now we need
to tell it where to source the images from. We’ll use the very convenient .flow_from_directory method,
specifying the path of the training images, the target size and the batch size:
As you can see, it found 2100 images with 3 classes. This is because the images are organized in 3 subfolders
of the train path:
train_path
|- Beach volleyball
|- img1
|- img2
|- ...
|- Cross-country skiing
|- img1
|- img2
|- ...
|- Formula racing
486 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
|- img1
|- img2
|- ...
Now let’s also create a generator for the test images. Note that the test_path must have the same
subfolders in order to be compatible with the training set. We will not apply any transformation to test
images and we will flow them as they are. The reason for this is to have reproducible test results:
We are now ready to train the model with our generator. Since we are generating images with a generator,
the concept of Epoch is no longer well defined. For this reason we will specify how many steps of update an
epoch includes.
Let’s see: we have 2100 images in the training folder and we feed batches of 32 images. This means that with
approximately 65 steps we have sent roughly as many images as there are in the training set. Let’s do this: we
train the model for 1 epoch with 65 update steps:
In [53]: model.fit_generator(
train_generator,
steps_per_epoch=65,
epochs=1)
Epoch 1/1
65/65 [==============================] - 48s 733ms/step - loss: 0.5263 - acc:
0.7945
It took a little bit of time but it looks really good. In a single epoch we have re-purposed a huge
convolutional Neural Network that can now perform image recognition on new classes of images, never
before encountered. Cool!
Let’s assess the accuracy or our model with the .evaluate_generator method on the test set:
In [54]: model.evaluate_generator(test_generator)
Not bad at all, considering that it only trained for 1 epoch. Also, let’s check the prediction on our original
image of the volleyball player:
In [55]: model.predict_classes(img_tensor)
Out[55]: array([2])
It’s predicted to be in class 0, which is the correct class. We can check that by looking at the class_indices
defined in the test_generator:
In [56]: train_generator.class_indices
Awesome! We’ve performed transfer learning for the first time and we’ve reused a giant pre-trained model
for our goal. This was really good, but it did take a bit of a long time to train. Can we speed that up? The
answer is yes! Let’s introduce bottleneck features.
Bottleneck features
Let’s stop for a second and think back about what we’ve just done. First we’ve loaded a large convolutional
network, whose weights have been pre-trained on the Imagenet problem.
Then we’ve used the convolutional part of this network as the first layer of a new network, followed by fully
connected layers. Since its weights are frozen, the convolutional part of the network is basically acting as a
feature extractor, that extracts a vector of features from the given input image.
These features are then fed to the fully connected layers who perform the classification. The training process
is still slow because all of the convolutions are applied to the image at each training step.
488 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
Bottleneck features
On the other hand, since the weights are frozen, we could take a different approach. We could pre-process
all the images once: send them through the convolutional part of the network and extract a feature vector
for each of them. We could use this dataset of feature vectors to train a fully connected network for the
classification.
Another way to look at this process is to say that we are using the pre-trained network as a feature extraction
pipeline, not dissimilar from traditional pipelines involving Wavelets, Histograms and so on. The difference
here is that the bottleneck features are obtained through a network that’s been trained on image
classification on millions of images and are therefore optimized for that task.
Let’s start by wrapping the ImageDataGenerator and the .flow_from_directory method in a single
function. This function takes the input_path of the images and a couple of other parameters and returns a
generator ready to receive all the images in the input_path. We’ll feed this generator to the
base_model.predict_generator function, which will return the values of the last layer before the output
layer of the full model (remember we loaded the model with the parameter include_top=False. Also
notice that we will set shuffle=False so that the image order is the same as that contained in
generator.classes and we can later use it.
preprocessing_function=preprocess_input)
return generator
Let’s use this function to generate the bottlenecks for the training images:
Let’s also recover the training labels from the same generator.
Depending on your system, generating bottlenecks may take a long time and having one or more GPU
available will surely speed up the process. Since they are quite small, the code repository already contains a
saved version of these bottlenecks.
Now that we have created the bottlenecks let’s check what they look like. Let’s start from the shape of the
tensors:
In [60]: bottlenecks_train.shape
Train bottlenecks are a matrix with as many rows as there are images in the training set and with a number
of columns equal to the number of outputs of the base model’s last layer, i.e. the GlobalAveragePooling
layer from Xception, i.e. 2048 features. Let’s plot a few of them to see what they look like. Let’s get a bunch
of images and labels from the train generator:
And let’s create a list with the label names for each of the images in the batch. We will do this in two steps.
First let’s create a label map with the label names corresponding to the class indices:
Then let’s use the label_map to convert the labels into the corresponding label names:
label_names[:10]
Great. Now let’s generate bottleneck features for the images in the batch:
Will bottlenecks of images with the same label be similar? Let’s take a look at them on a plot and see if
there’s any pattern we can recognize. We will create a figure with three plots, one for each of the three
classes, and plot the values of the bottlenecks:
plt.xlim(0, 2050)
plt.tight_layout()
Hmm, although the 3 plots look somewhat different, it’s hard to tell if there’s anything interesting. Let’s
zoom in to the range 980-1010:
plt.xlim(980, 1010)
plt.tight_layout()
492 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
Here it’s clearer that several of the Beach Volleyball bottlenecks have a high spike at features 984, 990 and
1001, while the other two sports do not have those peaks. Bottlenecks are like fingerprints of an image:
features extracted through convolutions that encode the content of the image.
Now that we understand a little more what bottleneck features are, we can save them to disk once and for all.
We can now experiment with fully connected architectures that classify the bottlenecks as input data. Let’s
save the bottlenecks and the labels as numpy arrays. We will use the gzip library for efficiency.
fname_ = '../data/sports/bottlenecks_test.npy.gz'
X_test = np.load(gzip.open(fname_, 'rb'))
We can check the shape of the train data, and verify that it’s a matrix:
In [75]: X_train.shape
fname_ = '../data/sports/labels_test.npy'
y_test = np.load(open(fname_, 'rb'))
In [77]: y_train.shape
494 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
Out[77]: (2100,)
It looks like we have to one-hot encode the labels, so let’s do that using the
keras.utils.to_categorical function that we have used many times in the book:
Now we are finally ready to train a fully connected network on the bottleneck features. Let’s import the
Sequential model and Dense and Dropout layers.
We’ll define the model using the sequential API. We will build a very simple model with just 2 layers: a
Dropout layer as input, that will receive the bottleneck features and a Dense output layer for the output. The
output layer must have 3 nodes because there are 3 classes and it must have a softmax activation function
because the classes are mutually exclusive. Feel free to change the model definition to something else if you’d
like, keeping in mind that we only have a few thousand training data points so giving the model too much
freedom may lead to overfitting.
Notice that instead of adding the layers like we did in other parts of the book we can pass a list of layers to
the model constructor:
Let’s compile the model with our preferred optimizer using the categorical_crossentropy loss:
In [82]: fc_model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
In [83]: fc_model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dropout_2 (Dropout) (None, 2048) 0
_________________________________________________________________
dense_3 (Dense) (None, 16) 32784
_________________________________________________________________
dropout_3 (Dropout) (None, 16) 0
_________________________________________________________________
dense_4 (Dense) (None, 3) 51
=================================================================
Total params: 32,835
Trainable params: 32,835
Non-trainable params: 0
_________________________________________________________________
This simple model has a little over 6000 parameters, so it will be very fast to train. Let’s train it for a few
epochs:
In [85]: plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Accuracy')
plt.legend(['train', 'test'])
plt.xlabel('Epochs');
496 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
This model trained really fast and the accuracy on the test set is higher than the accuracy on the training set,
which is a really good sign. This is the value of bottleneck features, we can use them as proxies for our
original images and train a simple model using them using a laptop.
Image search
Imagine the following situation: we have a dataset of images and we would like to find the most similar
images to a specific one, for example we have our collection of pictures on our laptop and we’d like to find all
the pictures of a friend.
Solving this problem requires the definition of a distance measure between images, so that, given an image,
we can look for images that are close to it. This is hard to do using the raw pixels as features because, as we
have seen many times, images with similar content may look completely different on every single pixel.
On the other hand, since bottleneck features capture high level features from the images, we exploit them to
locate similar images. We will do this using the DistanceMetric class from Scikit Learn. Let’s start by
importing it:
and let’s load an instance of the Euclidean metric, which is the usual vector difference distance:
We have used to define the Mean Squared Error in Chapter 3 and it is obtained through the sum of the
squares of the differences along each coordinate:
√ √
d(x’, x) = (x’ − x)2 = ∑(x i′ − x i )2 (11.1)
i
it is calculated as:
Out[90]: 3.1622776601683795
Now that we have defined the euclidean distance metric we can calculate the pairwise distances between all
the bottlenecks and then use that for our image search engine.
Let’s take a few images as example. Let’s get the training images from the training set using the bottleneck
data generator:
And let’s display a few of them. Note that since our images have been normalized during pre-processing,
their pixels are now values between -1 and 1. plt.imshow requires floating point images to have values
between 0 and 1 so we will need to add 1 and divide by 2 each pre-processed image in order to display it
correctly. Let’s define a helper function that does that.
498 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
In [93]: plt.subplot(1, 3, 1)
imshow_scaled(images[0])
plt.subplot(1, 3, 2)
imshow_scaled(images[1])
plt.subplot(1, 3, 3)
imshow_scaled(images[900])
The first two are images of beach volleyball while the third one is of skiing.
The distance between the first and the second image is:
Out[94]: 6.608993
while the distance between the first and the third is:
Out[95]: 10.53551
As you can see, the first and the third image are further apart, which makes sense since the last one is very
different from the previous two. We will proceed now to calculate all the distances between all of the images
in the training set using their bottleneck features.
11.7. TRAIN A FULLY CONNECTED ON BOTTLENECKS 499
We will do this calculation using the .pairwise method of the euclidean distance object we have created
previously. We could also do a double for loop over bottlenecks and calculate the distances manually,
however, this is more efficient:
In [97]: bn_dist.shape
Since we have 2100 images in the training set, the pairwise matrix is a square simmetric matrix that contains
all pairwise distances. Let’s visualize it to understand it a little bit better:
Notice a couple of things about this matrix: 1. The darker a pixel the closer two corresponding images are. -
The matrix is simmetric with respect to the diagonal, which makes sense since the distance between image 1
and image 2 is the same as the distance between image 2 and image 1 - The diagonal is the darkest of all,
which also makes sense, since an image will identical to itself and therefore have a distance of zero from
itself - Three blocks are clearly distinguishable along the diagonal, although a little fuzzy. This makes a lot of
sense, because images are sorted by class and generally speaking all the images in a class are expected to be
more similar to one another than to images in other classes
Notice that we have obtained these distances using the bottlenecks from the pre-trained model, no
additional training needed. Awesome!
Let’s put this to use in our search engine! Let’s take an image:
In [99]: imshow_scaled(images[0])
11.7. TRAIN A FULLY CONNECTED ON BOTTLENECKS 501
Let’s look for the top 9 closest images. All we have to do is select the row in the bn_dist matrix
corresponding to the index of the image selected, which is zero in this case. We will wrap this with a Pandas
Series so that we can use the indices later:
In [101]: dist_from_sel.sort_values().head(9)
Out[101]:
502 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
0
0 0.000000
431 5.348794
15 5.404700
32 5.787818
636 5.809691
648 5.847935
429 5.848355
48 5.952395
656 5.967599
Let’s display these. We will display 9 images, in a grid of 3x3. Let’s make this configurable by defining a few
parameters:
In [102]: n_rows = 3
n_cols = 3
n_images = n_rows * n_cols
Now let’s take the top 9 images with the shortest distance from our orignal image. This can be done by
sorting the values in dist_from_sel and then using the .head command that retrieves the first elements
in the series:
Now let’s loop over the index of the retrieved images and plot the images:
Nice! The first image displayed is the one we had selected, and as you can see the other ones are all very
similar! Let’s try again with another image. We will define a function to make things easy:
plt.figure(figsize=(10, 10))
i = 1
for idx in retrieved.index:
plt.subplot(n_rows, n_cols, i)
imshow_scaled(images[idx])
if i == 1:
plt.title('Selected image')
else:
plt.title("Dist: {:0.4f}".format(retrieved[idx]))
i += 1
plt.tight_layout()
In [106]: image_search(900)
11.7. TRAIN A FULLY CONNECTED ON BOTTLENECKS 505
In [107]: image_search(1600)
In [108]: image_search(100)
506 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
Notice that we can also sort the distances in reverse order and find the images wich are the furthest away
from a selected image. E.g., for this image:
In [109]: imshow_scaled(images[0])
11.7. TRAIN A FULLY CONNECTED ON BOTTLENECKS 507
Which clearly have very little in common with the above image!
In conclusion, Keras offers several pre-trained models for images, that can be used for a variety of tasks
including image recognition, transfer learning and image similarity search.
Exercises
Exercise 1
Exercise 2
Choose another pre-trained model from the ones provided at https://keras.io/applications/ and use it to to
predict the same image. Do the predictions match?
Exercise 3
The Keras documentation shows how to fine-tune the Inception V3 model by unfreezing some of the
convolutional layers. Try reproducing the results of the documentation on our dataset using the Xception
model and unfreezing some of the top convolutional layers.
510 CHAPTER 11. PRETRAINED MODELS FOR IMAGES
Pretrained Embeddings for Text
12
In the last part of Chapter 8 we introduced the concept of Embeddings. These are dense vectors that
represent words and are often used as a starting point when approaching NLP problems like language
translation or sentiment analysis. The word vectors we introduced in Chapter 8 were trained together with
the rest of the model and therefore were specific to the particular problem we were trying to solve.
For example, when we trained our Recurrent Model for the movie review sentiment analysis task, the
embedding layer was the first layer in a Sequential model, followed by a recurrent layer and a classification
head for the sentiment prediction. The weights of the word vectors in the embedding layer were learned
together with the weights of the recurrent layer and the classification layer. The single task of predicting the
sentiment of a movie review would provide a value for the loss which would then propagate back through
the network to adjust both the recurrent and the embedding weights.
1) since the embedding layer has many of weights (e.g., for a vocabulary of 10k words, each embedded
with 100 numbers, we have 1M weights), we need a lot of data for this model to generalize well and
avoid overfitting
2) since the embeddings are trained on the sentiment analysis task, they will work well on that task but
will not necessarily learn general properties of the semantic space. In other words, we will not be able
to use those same embeddings for a completely different NLP task, like translation.
To overcome these two limitations, researchers have proposed a different approach to building more general
embeddings. These approaches try to capture the meaning of a word in a language and build generic
embeddings where words with similar meanings are represented by similar vectors. Although this may
seem crazy at first, it works well in practice.
511
512 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT
In this chapter, we will see a couple of different famous embeddings, and we will use them to do fun
operations with text.
“Unsupervised”-“supervised learning”
How can we train a generic embedding layer that encapsulates the meaning of words? We’ll have to resort to
a trick that is often used when training very large Neural Networks. This is often referred to as
“Unsupervised Learning” although, as we shall see, it is actually a special case of Supervised Learning where
the labels are not generated by humans.
Let’s think back of the sequence generation example we introduced in Chapter 8. There we built a network
that learned to predict the most likely letter after a sequence of 3 letters, using a corpus of English baby
names. The same approach can be used to build a model that is trained to predict the most likely word after
a sequence of words. For example, if trained using a corpus of songs from John Lennon, the model should
be able to learn that “heaven” should be the most likely word after the words “imagine there’s no”. This task
is called language modeling because what the model learn is the structure of a language.
The output of this model is a Softmax over the vocabulary of the language, while the input is a sequence of
words that will be encoded as vectors by the input embedding layer of the language model. Since the model
is essentially solving a forecasting task (predict the most likely words after a sequence of words), we are still
in the domain of Supervised Learning. The labels, however, are contained in the corpus of text itself.
Consider for example this excerpt from the song Imagine by John Lennon:
...
"""
12.2. GLOVE EMBEDDINGS 513
From this text we can build the following pairs of inputs and labels:
Sequence Label
imagine there’s no heaven
there’s no heaven it’s
no heaven it’s easy
heaven it’s easy if
it’s easy if you
easy if you try
... ...
Both the inputs and the labels are obtained from the same corpus by simply sliding a window of fixed length
and asking the model to predict the word coming immediately after the window.
This generic forecasting approach is amazingly powerful! We are no longer limited by our ability to label
data. We can use any text, literally we could use the whole of Wikipedia, and train a very generic language
model that attempts to predict the next word in a sequence. The embeddings of this model must be more
generic than the ones trained on the sentiment problem!
Starting with this intuition, that you can obtain labels from the text itself, researchers have invented several
approaches to train generic embeddings. We will mention here a few of the most famous and show you
where to find them and how to use them.
Let’s start with a very common set of embeddings called GloVe, which stands for Global Vectors for Word
Representation.
GloVe embeddings
In [2]: with open('common.py') as fin:
exec(fin.read())
In the data/embeddings folder we provide a download script that downloads and extracts GloVe
embeddings from: https://nlp.stanford.edu/projects/glove/. Here is its content:
Go ahead and run the script to retrieve the glove.6B embeddings. Let’s take a look at them. First let’s
define a path variable:
In [7]: line
Out[7]: 'the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.65
As you can see the line contains the word the as first element, followed by 50 space-separated floating point
numbers, which form the word vector. Let’s define a parse function that parses the line and returns the word
and the vector as a numpy array.
We should take care of removing the trailing \n character at the end, then split the line at spaces, which will
return a list. Finally we’ll take the first element in the list as word and the remaining values as the vector.
Here’s the parse function:
Now that we have defined a parse function, let’s use it to load the word embeddings. We will use a Python
dictionary for the embeddings and one for the word index. Let’s create two empty dictionaries:
12.2. GLOVE EMBEDDINGS 515
In [9]: embeddings = {}
word_index = {}
Let’s also create an empty list for the inverted index that will map numbers to words:
In [10]: word_inverted_index = []
Now we can loop over the lines in the file, parse each line and store it in the embeddings and word index
dictionary. We will enumerate the lines as we proceed with the loop so that we can also retrieve their
numeric index.
Let’s do it:
Let’s check a few entries in the indexes we built. For example, using word_index, we can retrieve the line
number at which the word good appears:
In [12]: word_index['good']
Out[12]: 219
Using the word_inverted_index we can do the reverse i.e. given a line number, find the corresponding
word:
In [13]: word_inverted_index[219]
Out[13]: 'good'
The embeddings dictionary contains the actual word vectors, so for example, the word vector
corresponding to the word good is the following:
In [14]: embeddings['good']
516 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT
How many components does this vector have? Let’s check its length:
Out[15]: 50
In [16]: plt.plot(embeddings['good']);
12.2. GLOVE EMBEDDINGS 517
It doesn’t tell us much, but for example we can compare the word vectors of a few words and see how they
look like. Let’s plot a few numbers, like two, three and four and a few animals like cat, dog and rabbit. As you
will see numbers will look very similar to one another, and animals will be distinctly different from the
numbers:
In [17]: plt.subplot(211)
plt.plot(embeddings['two'])
plt.plot(embeddings['three'])
plt.plot(embeddings['four'])
plt.title("A few numbers")
plt.ylim(-2, 5)
plt.subplot(212)
plt.plot(embeddings['cat'])
plt.plot(embeddings['dog'])
plt.plot(embeddings['rabbit'])
plt.title("A few animals")
plt.ylim(-2, 5)
plt.tight_layout()
518 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT
This is reminiscent of Bottleneck features we’ve encountered in Chapter 11. Each word has been encoded
with a vector with 50 numbers and words that carry similar semantic value will be encoded with similar
vectors. As we shall see, GloVe vectors are built by looking at co-occurrence of words, so the above plots tell
us that numbers like two, three and four are often found nearby similar words, which makes sense. I can
say: “I ran two miles” or “I ran three miles” and both sentences make sense, while I cannot say “I ran cat
miles”. Two and three can be found in similar contexts and therefore are encoded as similar vectors.
Let’s see how many words are contained in our GloVe embeddings by checking the len of the embeddings
variable:
Out[18]: 400000
There are 400000 words in the embeddings, which will cover most of our needs.
TIP: If you are curious to know more about how GloVe embeddings are build, we
encourage you to read the original paper or take a look at the original source code.
Next we are going to arrange our pre-trained embeddings as a giant matrix of shape (vocabulary_size,
embedding_size). We will do this in 2 steps. First let’s create a zero matrix with the correct shape:
Then let’s iterate over the items in our word_index dictionary and let’s assign each vector in the
embeddings dictionary to a line in the matrix. For example, we know from above that the word good has
index 219. This means we will assign its vector to the row in the matrix corresponding to index 219 (i.e. the
220-th row). Let’s do it:
Now that we have our pre-trained weights arranged in a matrix, we can create an Embedding layer in Keras,
passing the weights as initialization. We will specify the input_dim to be equal to the vocabulary_size
and the output_dim to be equal to the embedding size. Then we’ll use the parameter weights to pass a
list of weights, in this case just the embedding weights. Finally we’ll set both the mask_zero and the
trainable flags to False.
Let’s stop for a second and make sure we understand the last 2 flags. Here’s the documentation of
mask_zero:
In our case, we have used the index 0 in the vocabulary for the word the, as we can check in the
word_inverted_index:
In [23]: word_inverted_index[0]
Out[23]: 'the'
520 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT
so we have to set mask_zero to False. Had we started enumerating the word vectors from 1, we could have
reserved the 0 value for padding, which, as the doc says, is useful when using recurrent layers.
The trainable=False flag simply tells Keras that this layer is not trainable, i.e. its weights cannot be
changed during training. We used this earlier when using pre-trained models for images in Chapter11.
Notice that simply passing the matrix to the Embedding constructor is not enough. We need to put this
layer in a model in order for Keras to actually create a Tensorflow graph with it. Let’s do it:
Now that we have created a model we can check that the embedding layer does use the pre-trained weights.
Let’s check the embeddings for the word cat. Here are the original values we loaded from the file:
In [25]: embeddings['cat']
Now let’s retieve the index of the word cat and let’s pass it to the model.predict method. First we retrieve
the index of the word cat using the word_index:
In [27]: cat_index
Out[27]: 5450
Now that we have the index, we run model.predict on a double nested list containing the single index of
the word cat:
12.4. GENSIM 521
TIP: we need to use a list here because the predict method expects as input an integer
matrix of size (batch, input_length), as explained in the documentation and here we have a
batch of 1 point with a sequence of 1 word.
In [28]: model.predict([[cat_index]])
As you can see the method returns exactly the same values, so we have successfully initialized a Keras model
with a pre-trained embedding.
Gensim
Gensim is a topic modelling library in Python that contains a lot of functions related to extracting meaning
and manipulating text. Let’s import it and have some fun with word embeddings:
In order to load Glove embeddings using Gensim we need to convert them into the appropriate format.
Luckily for us Gensim has a function for that. We just need to import the glove2word2vec script:
and then run it. We first set input and output paths:
Now that we have loaded the vectors into a Gensim model, we have access to a lot of functionality. For
example, we can quickly find what are the most similar words to a given word.
Here’s how we look for the 5 closest words to the word good:
The .most_similar method allows for both a list of positive and negative words. Feel free to play with
the list of words to get a feel for how they affect the output. The closest words to good are words that can
appear in the same context as good, so it’s quite obvious that we should get similar adjectives like better or
adverbs like really and always.
Word Analogies
Since word vectors are vectors, we can do any vector operation with them, including addition, subtraction
and dot products. For example we can perform operations between words like:
where the vector result is a perfectly valid vector in the embedding space. Using the .most_similar
method, we can look for the 3 vectors closest to result. Can you guess which vector will be the closest?
If you guessed queen, which is the feminine counterpart of king, you guessed right. Let’s see it in action:
Another way to look at this is to say that the vector queen - king is similar to the vector woman - man.
This is often represented with this famous picture:
In this figure we imagine an embedding space that has only 2 axes (instead of 50 or 300) and represent the
words as points in the embedded space. The arrows represent the vector distances between the words.
Since this chart can be useful to understand how the model is representing the semantic space, it’s legitimate
to ask if we can visualize all of glove words in a similar chart using a dimensionality reduction technique.
The answer to this question is yes, and we can actually leverage tensorboard for this. We’ll see how in the
next Section.
Visualization
Tensorboard also contains a projector that allows us to explore word embeddings visually. Let’s save our
word embeddings in a tensorflow model and let’s visualize them in Tensorboard. First we need to create an
output folder. We’ll use the /tmp/ztdl_models/embeddings/ folder for output. Let’s first create it. We
will need the os module:
In [38]: import os
524 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT
Then let’s define a path variable that we’ll use later too:
Let’s also load the rmtree function from shutil so that we can delete the directory if it already exists:
In [42]: os.makedirs(model_dir)
For the purposes of this visualization we will limit our Embedding layer to the top most frequent 4000
words in the glove set. Let’s set a variable called n_viz to 4000:
Now we create a new embedding layer, with only 4000 x 50 weights. Notice that we still pass the
mask_zero=False parameter since our first vector, corresponding to the index 0, is the word the:
Let’s stick this layer into a Sequential model so that the Tensorflow graph gets populated:
In [47]: word_embeddings
As you can see it is a tf.Variable. As explained in the documentation, maintains state in the graph across
calls to run(). Variables are used in Tensorflow precisely to contain the values of the weights of a network.
We now would like to use tensorboard to visualize this data. The most recent documentation on how to do
this is a bit cryptic, so we refer to an earlier version of the documentation which is more helpful.
We need to accomplish three things, which are independent: - save the embedding tensor - save a file with
the words (metadata) - save a configuration file that binds the metadata to the embedding tensor
Let’s start by saving the model. The code that follows is specific to Tensorflow. First we load the Keras
backend and tensorflow itself:
Out[51]: '/tmp/ztdl_models/embeddings/model.ckpt-1'
This operation creates a few files in the model_dir folder, as you can see with the os.listdir command:
In [52]: os.listdir(model_dir)
Out[52]: ['checkpoint',
'model.ckpt-1.data-00000-of-00001',
'model.ckpt-1.index',
'model.ckpt-1.meta']
These files contain the weights, but have no information about which word corresponds to each vector.
What we need is a metadata.tsv file with the list of words. We can easily create it by looping over the
indexes from 0 to n_viz adding one word per line to the file:
You can check the content of this file and see that it contains one word per line like:
the
,
.
of
to
and
...
12.5. VISUALIZATION 527
In [55]: print(config)
embeddings {
tensor_name: "embedding_2/embeddings:0"
metadata_path: "metadata.tsv"
}
TIP: Note that you could also use the following code to generate and save the same config
file in a programatic way. Since the config file is so simple, we opted for the explicit
solution above.
config = projector.ProjectorConfig()
emb_config = config.embeddings.add()
emb_config.tensor_name = word_embeddings.name
emb_config.metadata_path = 'metadata.tsv'
tensorboard --logdir=/tmp/ztdl_models/embeddings/
http://localhost:6006/#projector
We should see the word embedding projector spinning. Using the Search tab on the right, let’s look for a
specific word, for example the word network and see what are the closest words. You should see something
like this:
Tensorflow projector
Word2Vec
Word2Vec is a set of word vectors introduced by Google in 2013. These vectors also try to encapsulate the
meaning of a word by looking at its context, i.e. the words that precede it and follow it.
You can find a detailed tutorial on how Word2Vec vectors are built in the Tensorflow Tutorials page. The
two main useful ideas are the Skip-gram Model and the noise-contrastive estimation (NCE) loss. Let’s take a
look at these in a bit more detail.
12.6. OTHER PRE-TRAINED EMBEDDINGS 529
Skip-grams
Skip-grams are simply pairs of words that appear in context. Let’s consider the first 2 lines of the song
imagine:
and let’s focus on the word heaven. If we choose a context of -2, +2 words, we see that the following words
appear in the context of heaven: - there’s - no - it’s - easy
We could therefore try to build a model that takes a word in input and tries to predict the probability that
another word in the dictionary appears in it’s context by using input/output pairs like:
INPUT OUTPUT
heaven there’s
heaven no
heaven it’s
heaven easy
We could use a Softmax over the whole dictionary and eventually learn these probabilities. However this
would require a huge amount of data, since the dictionary size is huge.
Imagine solving a slightly different problem where instead of having a word as input and a word as output,
we have a pair as input and a binary label as output. We can use the pairs above as positive examples, since
they are actual pairs found in the training text, and we can build fake pairs that have the same first word and
a random second word. Our data will look like this:
INPUT OUTPUT
(heaven, there’s) 1
(heaven, no) 1
(heaven, it’s) 1
(heaven, easy) 1
(heaven, cat) 0
(heaven, brain) 0
(heaven, swimming) 0
(heaven, chair) 0
... ...
530 CHAPTER 12. PRETRAINED EMBEDDINGS FOR TEXT
We have constructed negative pairs by randomly choosing words from the dictionary. This model learns to
predict the probability that given a first word, the second word is in it’s context or not, which is the same
problem as before. However, this model is much faster to train, because we are only choosing a small set of
negative examples at each batch instead of having the whole dictionary.
These two tricks allow to train Word2Vec quite easily. In our case we will not even train it but rely on a
pre-trained version of these embeddings that’s been trained using data from Google News.
TIP: if you want to learn more about Word2Vec we encourage you to read the Wikipedia
page and to go through the tutorial mentioned earlier.
FastText
FastText is a library for efficient learning of word representations and sentence classification developed by
Facebook Research. It is open-source, free and lightweight and it allows users to learn text representations
and text classifiers. Here is the Github repository and here you can read the blog post announcing its
publication.
1) fastText word vectors are built from vectors of substrings of characters contained in it. This allows to
build vectors even for misspelled words or concatenation of words.
2) fastText has been designed to work on a variety of languages by taking advantage of the languages
morphological structure. This means pre-trained vectors are available for many other languages
besides english.
You can download the pre-trained English vectors here. You can download the pre-trained vectors for many
other languages here
In the exercises we will use these vectors and compare their results with GloVe and Word2Vec.
Exercises
Exercise 1
Compare the representations of Word2Vec, Glove and FastText. In the data/embeddings folder we
provided you with two additional scripts to download FastText and Word2Vec. Go ahead and download
each of them into the data/embeddings. Then load each of the 3 embeddings in a separate Gensim model
and complete the following steps:
1. define a list of words containing the following words: ‘good’, ‘bad’, ‘fast’, ‘tensor’, ‘teacher’, ‘student’.
12.7. EXERCISES 531
• create a function called get_top_5(words, model) that retrieves the top 5 most similar words to
the list of words and compare what the 3 different embeddings give you
• apply the same function to each word in the list separately and compare the lists of the 3 embeddings.
Note that loading the vector may take several minutes depending on your computer.
Exercise 2
The Reuters Newswire topic classification dataset is a dataset of 11,228 newswires from Reuters, labeled over
46 topics. This dataset is provided in the keras.datasets module and it’s easy to use.
Let’s compare the performance of a model using pre-trained embeddings with a model using random
embeddings on the topic classification task.
In this chapter we want to achieve a few goals. We want to explain at a high level how to think about the
deployment process, highlighting the issues involved, and outlining the possible choices. This chapter will
help us understand the decision you’ll need to make when deploying a model as well as equip you with a list
of resources that you can tap into, according to your needs.
Then we will show you two ways of deploying a model: a simple Flask application using a Python server and
a more general deployment using Tensorflow Serving. These are not the only two ways to deploy a model
and we’ll make sure to point you to additional resources, companies and products that simplify the
management of the model deployment cycle.
1. Data Collection
2. Data Processing
533
534 CHAPTER 13. SERVING DEEP LEARNING MODELS
3. Model Development
4. Model Evaluation
5. Model Exporting
6. Model Deployment
7. Model Monitoring
These steps are part of a continuous deployment cycle: we never finish to improve our models and learning
from new data. After deploying our first model, we will deploy a second version, a third, and so on. The
performance of each new version will be compared to the performance of the previous one. Traffic will
gradually be shifted towards the new model, like it happens with any other release in a continuous
integration setup.
The 7 steps are not mutually exclusive, they happen in parallel. In other words, while you are working on
developing (3) and evaluating (4) the current version of a model, the previous version of the model is
already being monitored (7) and additional data and labels are being collected (1) and processed (2).
Data Collection
Throughout this book we used datasets based on files. These were either tabular files (Csv, Excel) or folders
containing images or documents. In the real world there is usually a process involving data collection,
where data is stored in a database or in a distributed file system for later use.
Depending on the type of data, on its frequency and on its size, you will design different collection and
storage systems.
As a first case let’s consider a bank that would like to train a model to decide which people can receive a
loan. The information used as input for the model are things like:
• user information
• account activity
• past loans
• history of credit etc.
This information will typically be stored in several tables in a database. We will be able to create a dataset to
train our model by simply joining data from a few tables of the database. Furthermore, we can probably
work with a sample of the whole data as a starting point, especially if our first model is not going to be
personalized to each user. This “snapshot” of the world, a dataset we extracted at some point in time, is
going to be valid for at least a few days, if not for a few weeks or even months. That is to say, the general
lending behavior of a population will surely evolve over time, but it will do so quite slowly, not from one day
to the next. Another way to say this is to say that the statistical distribution of our users is stationary, or
quasi-stationary, i.e. independent from time.
These facts allow us to train a model on a file, maybe a large file, but a fixed snapshot, like we did throughout
this book and use that model in production. Once we have trained and evaluated our model, we will deploy
it to our branches and the managers will have an “AI helper” to decide when to issue a loan or not.
Let’s consider a very different case now. Let’s say we want to build a system that based on the actions of a
user in our web application will decide which advertisement to show. In this case the input will be a series of
events in the app. Both the app and the ads inventory will change much more frequently in time, due to new
feature releases, new clients etc. In this case we will re-train our model much more frequently, possibly
every day, using the most recent data.
• the kind of data, is it: files, documents, images, text, numbers in a table?
• the amount of data, how many new data points do we collect per day? 100? 1000? 1 million? 1 billion?
The data collection and storage process will change dramatically based on that.
Modern Machine Learning products usually work as a continuous pipeline, where a model is continuously
learning from new data and every now and then a snapshot of the most current model is saved and used in
production for inference.
536 CHAPTER 13. SERVING DEEP LEARNING MODELS
Labels
As we know very well by now, in order to train a model with supervision, we need labels. Here too, there can
be many different scenarios.
In some cases, we may not have those labels at all. For example, let’s say we are training an algorithm to
recognize offensive pictures in our user-generated content. We will need to collect a sample of pictures and
have human supervisors manually label the offensive pictures with labels such as “violence”, “nudity”, etc.
This labeling process will be slow and costly, but it will be necessary before we can proceed with any training.
Additionally, if we randomly sample our images, there will likely be very few offensive images, which would
make labeling very slow because our human supervisors would receive mostly normal pictures. This is why
most websites implement a button for users to report offensive content. This will effectively triage the
pictures bringing the offensive ones to surface so that human supervisors can review them and generate
labels accordingly.
On the opposite end, if we are training a model to advertise products, i.e. to predict the likelihood that a
user will click on a certain product, the labels, i.e. the past clicks, are automatically recorded by our system.
In general, the win-win strategy for label generation is when the product manager can design a product in
such a way that labels are automatically generated by the users or the process. Examples of this win-win
approach are:
• tagging your friends on Facebook => labels for face recognition algorithm
• recording purchase actions on Amazon => labels for recommendation of other products
• “flag this post” button on Craigslist => labels for fraud/spam detection algorithm
• captchas that ask you to recognize street signs => labels for image recognition algorithm
Data Processing
This process is usually referred to as ETL in enterprise setting. It involves going from the raw data in your
data store to features ready for consumption by the Machine Learning model.
At this stage you will focus on operations like data cleaning, data imputation, feature extraction and feature
engineering. Once again, this will depend on your specific situation but it is important to keep in mind that
when NULL values are present, i.e. when some data is missing, we need to stop and ask ourselves why such
data is missing. Discussion of the different cases of missing data is beyond the scope of this book, but we
invite you to read this Wikipedia article on missing data to be cognizant of the issues involved when dealing
with missing data.
Other data processing steps may involve generating features, augmenting the data, one-hot encoding, using
pre-trained models for feature extraction etc.
13.1. THE MODEL DEVELOPMENT CYCLE 537
Model Development
The model development is deeply interconnected with the data processing step. Here is where we focus the
attention on deciding which model we will apply.
Again, the choices will depend on the particular situation you are in. It is important that you keep in mind a
general approach to this phase: keep your feedback loop as rapid as possible. It’s not a mystery that a rapid
feedback loop is a great strategy in software development (e.g. Agile development). In Machine Learning
this is just as true, so if you are considering two options to improve your model and one takes 1 hour to test
and the other 1 week to test, you should absolutely choose the former over the latter.
Let’s look at one example. Let’s say we have some indication that our model will improve if we had more
data but we are also not sure that we have chosen the right architecture for the model. In some cases, getting
more data could be as simple as running a new SQL query to extract a few more million data points from
our database, in some other cases it could be much more complicated, involving manual label generation
with a team of supervisors, which would likely require days if not weeks of delay.
On the other hand, if re-training the model takes minutes or even just a few hours we could spin up a new
copy of the model with a different architecture and train it quickly. If training the model takes 1 week that’s
clearly not an easy option.
We will have to take all these factors into account when developing the model and choosing where to start
first.
Model Evaluation
Once we have decided what model architecture we are going to use, we need to train the model on the data.
The majority of this book has been dedicated to this process, so you should be pretty familiar with terms like
train/test splitting, cross-validation, and hyper-parameter tuning. It is important that at this stage you know
what baseline you are measuring against and what metric your are going to use. If this is your first model
attempt, you are probably comparing the performance with a dummy model (i.e. one that always predicts
the average label or the majority class). On the other hand, if you have previously deployed other models,
you will compare the model performance with that of the previous model.
You will have to consider what overall goal you are trying to achieve. In the case of a binary classification
problem, you will consider metrics like precision and recall to evaluate if your model has lots of false
positives or false negatives. The choice you make will depend on your business goal and your data. For
example, if you are deploying a model that predicts patient sickness you will try to avoid false negatives,
because you wouldn’t want to leave any screened patient with a false impression that they are healthy when
538 CHAPTER 13. SERVING DEEP LEARNING MODELS
they are are not. On the other hand, if you are developing a system for flagging spam, you will focus more
on avoiding false positives, which would route legitimate emails to the spam folder, creating a bad user
experience.
If you plan to do hyper-parameter training, it is also very important that you split your total dataset into
three parts:
• training
• validation
• test
The training data will be used to train the model. The validation data will serve as “test” data for
hyper-parameter tuning. I.e. for each new combination of hyper-parameters you will train the model on the
training portion and validate the model on the validation portion. This split is like having two nested
training loops. The inner training loop will choose the weights and biases of your network, the outer
training loop will choose the hyper-parameters like learning rate, batch size, number of layers etc.
The test set will never be seen by any of these models until the end. Once you have chosen the best
hyper-parameters and you have trained the best model on that data, only then, you will test your trained
model on the test set to get a sense of how well your model is going to perform with out-of-sample data,
i.e. an indication of how well your model is going to do when deployed.
Model Exporting
Once the model has been trained and evaluated, it is time to get it ready for serving. What happens at this
stage will depend on many requirements including desired latency, footprint, the device you are planning to
use for serving and many more.
At one end of the spectrum, this steps is as simple as saving the trained model to disk as is. The trained
model is composed of 2 parts, the model architecture and the trained weights. This method is perfect when
we use Keras to build our model and plan to serve it as part of a Python/Flask application. It is not
optimized at all, but if all we care is to build a proof of concept and if we don’t need to support high traffic
then this can be fast to execute. This is the first method we will explore in this chapter.
On the other end of the spectrum are large scale deployments. If we are planning to use our model in a high
availability production environment, we need to make sure it is optimized for serving predictions within the
constrains required by our application.
13.1. THE MODEL DEVELOPMENT CYCLE 539
For example, if we plan to use a model to make a real-time decision on which ad to serve or which product
to recommend to a user visiting our website, we will have very stringent latency requirements, usually few
tens of milliseconds at most and this will influence the choices we make when designing the model as well
as when saving it.
The topic of model optimization is vast and it requires tools that go beyond the scope of this book. We will
therefore limit ourselves to pointing out what kind of optimizations are possible and where to look for
information about them.
• stripping away all operations that are not needed for inference. The Tensorflow graph underlying our
Keras model contains all the operations required for training as well, including the gradients
calculations and the optimizer. None of these is relevant at inference time and we should strip them
away from the graph. Tensorflow has a Graph Transform Tool that includes a lot of options to check
out. Common cases covered by the tool are:
• low level compilation of tensorflow operation using the Accelerated Linear Algebra compiler (XLA).
This compiler optimizes the operation for the specific platform that we are going to use to deploy and
it can help in the following areas:
In addition to model optimization, inference performance can be improved by choosing the hardware
platform that is most adapt to our model. Currently Tensorflow supports CPU, GPU and TPU training and
inference. In the coming years we’ll see a flourishing of hardware platforms dedicated to Deep Learning
model training and serving, which will bring additional options to the table.
In this chapter we’ll see how to save a Keras model in Tensorflow format, so that all the above tools can be
applied.
Model Deployment
Model deployment refers to how we are going to make our model available to the rest of the world. In
several cases, this is a Python/Flask application that simply loads the model to memory and then runs
model.predict when requested. This is the first method we will explore in this chapter. It’s is a great way
to deploy a proof of concept in situations where we do not need a high throughput.
540 CHAPTER 13. SERVING DEEP LEARNING MODELS
The natural extension of this method is to containerize the Flask app with Docker so that we can replicate
the model multiple times and adjust our model to the load requested by our application. While this works, it
is not the recommended solution when scaling out operations. Tensorflow offers its own server which is the
preferred way to deploy models at scale.
That’s why in this chapter we will also go through a minimal deployment with Tensorflow Serving.
Tensorflow Serving is a powerful package developed with large deployments in mind. We will introduce it
and guide you to more resources if you you need to scale out operations with your models.
More generally, a new model will be deployed in parallel to an existing model and its performance will be
validated with live traffic in a classic A/B test scenario where traffic is only partially routed to the new model
and its performance is monitored against for some time before completely adopting it and phasing out the
old one.
This strategy is why deployment must also include monitoring of the model performance.
Model Monitoring
Last but not least, when we deploy a model we want to monitor its performance. It is important to sample
the predictions of the model and send a few of them to human supervision in order to verify their quality. In
other words, label collection never ends. We need to keep measuring the performance of our model against
a known set of labels.
In some cases this process is automated, for example the case where our model predicts future values of a
time series e.g. a the price of a stock for trading purposes. In this case, as soon as we get the next value of the
time series we can immediately compare it with the prediction from our model and monitor its quality in
real-time.
In other cases, where labels are generated by human supervisors, we need to keep sending data to a QA team
that will generate new labels for them. We can then compare the predictions with the labels and decide how
to improve the model on the cases where it failed.
This process never ends, we can always come up with better models. However, we should not be
discouraged by this. As British mathematician George E. P. Box said: “All models are wrong; some models
are useful”. You can reap enormous benefits from a model that is not perfect.
Let’s deploy our first model. We will build an API that can predict the location of a user based on the
strength of WiFi signals detected.
Data exploration
In [3]: df = pd.read_csv('../data/wifi_location.csv')
Let’s quickly inspect the data to get a sense for what we have:
In [4]: df.head()
Out[4]:
It looks like we have 7 features, presumably the strengths of the wifi signals coming from 7 different access
points. There’s also a column called location that will be our label. Let’s see how many locations there are
in the dataset:
In [5]: df['location'].value_counts()
Out[5]:
location
3 500
2 500
1 500
0 500
542 CHAPTER 13. SERVING DEEP LEARNING MODELS
Great! The dataset is balanced and it has 500 examples of each location! Since we have only 2000 points
total, we can plot the features and take a look at them:
From the plot we can clearly see that the wifi signal strengths are different in the 4 locations and therefore
we can hope to be able to be able to predict the location of a person based on these features. To further
remark this point, let’s do a pairplot using Seaborn and let’s color the data by location:
/home/ubuntu/miniconda3/envs/ztdlbook/lib/python3.6/site-
packages/statsmodels/nonparametric/kde.py:488: RuntimeWarning: invalid value
13.2. DEPLOY A MODEL TO PREDICT INDOOR LOCATION 543
encountered in true_divide
binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
/home/ubuntu/miniconda3/envs/ztdlbook/lib/python3.6/site-
packages/statsmodels/nonparametric/kdetools.py:34: RuntimeWarning: invalid value
encountered in double_scalars
FAC1 = 2*(np.pi*bw/RANGE)**2
It is very clear that the 4 locations are quite well defined and therefore we can hope to train a good model.
Let’s do that! First let’s define our usual X and y arrays of features and labels:
Then we’ll split our data into training and test, using a 25 test split.
Now let’s build a fully connected model using the Functional API in Keras. Let’s import the Model class as
well as a few layers:
In particular notice that we’ll use the BatchNormalization layer right after the input, since our features
take negative large numbers, which may slow the convergence of our model. Let’s build a fully connected
model with the following architecture:
• Input
• Batch Normalization
• Fully connected inner layer with 50 nodes and a ReLU activation
• Fully connected inner layer with 30 nodes and a ReLU activation
• Fully connected inner layer with 10 nodes and a ReLU activation
• Output layer with 4 nodes and a Softmax activation
Let’s display a model summary and make sure that we have built exactly what we wanted:
13.2. DEPLOY A MODEL TO PREDICT INDOOR LOCATION 545
In [14]: model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 7) 0
_________________________________________________________________
batch_normalization_1 (Batch (None, 7) 28
_________________________________________________________________
dense_1 (Dense) (None, 50) 400
_________________________________________________________________
dense_2 (Dense) (None, 30) 1530
_________________________________________________________________
dense_3 (Dense) (None, 10) 310
_________________________________________________________________
dense_4 (Dense) (None, 4) 44
=================================================================
Total params: 2,312
Trainable params: 2,298
Non-trainable params: 14
_________________________________________________________________
Great! Now we can compile the model. Notice that since our labels are not one-hot encoded we should use
the sparse_categorical_crossentropy instead of the usual categorical_crossentropy:
In [15]: model.compile('adam',
'sparse_categorical_crossentropy',
metrics=['accuracy'])
We are now ready to train the model. Let’s train it for 40 epochs, using the test data to validate the
performance:
As we have done several times in the book, we can display the history of training leveraging Pandas plotting
capabilities:
In [17]: pd.DataFrame(h.history).plot()
plt.ylim(0, 1);
546 CHAPTER 13. SERVING DEEP LEARNING MODELS
The training graph looks very good. The model has converged to almost perfect accuracy and there is no
sign of overfitting. This is great! We are ready to export the model for deployment.
Keras offers several ways to export a model. The simplest way is to save the model architecture and the
weights as separate compressed file. You can read more about the various ways of saving a model here. Let’s
start by importing the os, json and shutil packages:
Next we define the output path to save our model. This path will be composed of three parts:
sub_path = 'flask'
version = 1
Out[21]: '/tmp/ztdl_models/wifi/flask/1'
Next we create the export path. We delete it first and then re-create it as an empty path:
Now we are ready to save the model. Let’s have a look at the json description of the model:
In [23]: json.loads(model.to_json())
Nice! The whole model is specified in a few lines! To save it we’ll open a model.json file and then write to it
the json version of the file:
Next we save the weights. We do this with the .save_weights method of the model:
Let’s check the content of the export_path using the the os.listdir command:
In [26]: os.listdir(export_path, )
As you can see there are 2 files, the json description of the model and the weights. Great! Let’s see how one
would re-load these into a new model. First we need to import the model_from_json function:
The loaded model has random weights, as we can verify by generating predictions on the test set and then
comparing them with the labels. Notice that since the model was defined using the functional API, there is
no .predict_classes method. Let’s use the .predict method to obtain the probabilities for each class:
To retrieve the predicted classes we need to use the argmax function from Numpy:
13.2. DEPLOY A MODEL TO PREDICT INDOOR LOCATION 551
Out[30]: array([2, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 3,
2, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 2, 3, 3, 3,
3, 2, 2, 3, 3, 2, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3,
3, 3, 2, 2, 3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 3, 2,
3, 3, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 3, 2, 3, 3, 2, 2, 2, 3, 3, 2,
3, 3, 2, 3, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 2, 3, 2, 3, 2, 2,
3, 2, 3, 2, 2, 3, 2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 3, 3, 3, 2, 2, 3,
3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 2, 3, 2, 2, 3, 3, 3, 2, 3, 2, 2, 2, 3, 3, 2, 2, 2, 2, 3, 3, 2,
3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3,
2, 2, 2, 2, 2, 3, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 3, 2, 3, 2, 2,
3, 2, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3,
2, 3, 3, 3, 2, 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3,
3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 2,
3, 3, 3, 3, 2, 3, 2, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3,
3, 3, 2, 3, 3, 3, 2, 2, 3, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 2,
3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 2,
2, 2, 3, 3, 2, 2, 3, 3, 2, 2, 2, 2, 3, 3, 2, 3, 2, 3, 2, 3, 3, 3,
3, 2, 3, 3, 3, 3, 3, 2, 2, 2, 3, 2, 3, 2, 3, 3, 3, 2, 2, 2, 2, 2,
3, 3, 2, 2, 2, 3, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 3, 3, 3, 2, 2, 3,
3, 3, 2, 2, 3, 3, 3, 2, 3, 2, 2, 2, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3,
2, 2, 2, 3, 3, 3, 2, 3, 2, 2, 3, 2, 3, 3, 2, 3])
Finally we can check the accuracy of these prediction using the accuracy_score from Scikit-Learn:
Out[32]: 0.274
As expected, this model is not trained. Let’s load the weights now:
Out[34]: 0.978
Great! The model is now using the trained weights, so we can use it for inference in deployment.
Notice that this model is not trainable. If you tried to run the command:
loaded_model.fit(X_train, y_train)
As the message explains, in order to train the model we need to compile it first, i.e. add to the graph all the
operations concerning gradient calculation, loss calculation and optimizer. We don’t need any of this for
deployment, so let’s not compile the model.
WARNING: The simple script we run here is not meant for production use. Please make
sure to read how to deploy a Flask app to production in the Flask documentation.
As the documentation says, Flask is a microframework for Python based on Werkzeug, Jinja 2, and
good intentions. And before you ask: It’s BSD licensed!
It’s a very common choice for simple websites, APIs, and in general web development. We’ll use it here
to load our model in a very simple application that will be launched from a script.
We will here go through the commands that compose the script and tell you how to run it from the
shell. Let’s get started.
TIP: if you have installed the most recent version of our ztdlbook environment file, you
should already have flask installed. Otherwise go back to Chapter 1 and check the
instructions on how to create or update the environment.
13.3. A SIMPLE DEPLOYMENT WITH FLASK 553
The script will first import the Flask and request classes. Flask is the main app, while request will be
used to collect the data received by the app.
We also import tensorflow which we’ll need when loading the model:
Then we define a few global variables, like the path of the model, the model and the tensorflow graph. These
last two are initialized as None, they will be assigned later.
The next step is to define a load_model function that loads the model from the export_path like we did
before:
# global variables
global loaded_model
global graph
# load weights
loaded_model.load_weights(join(export_path, 'weights.h5'))
554 CHAPTER 13. SERVING DEEP LEARNING MODELS
The second function we define is a preprocess function that can be used to perform any normalization,
feature engineering or other preprocessing. In the current scenario we use this function to convert the data
from json to a Numpy array.
Next we define a function called predict, which performs the following operations:
@app.route('/', methods=["POST"])
This method tells Flask that this function should be called when a POST request is received at the / route.
For more information on how this is done in flask please make sure to check the extensive documentation.
data = request.data
# print in backend
print("Received data:", data)
print("Predicted labels:", preds)
return jsonify(preds.tolist())
Finally we complete the script with an if statement that runs the app in debug mode:
if __name__ == "__main__":
print("* Loading model and starting Flask server...")
load_model()
app.run(host='0.0.0.0', debug=True)
Please note that this is not the preferred mode to run a flask app. Please refer to the documentation for more
information.
Full script
Let’s take a look at the whole script using the cat shell command.
TIP: if this doesn’t work on your system, simply open the script in your favorite text editor:
import os
import json
import numpy as np
556 CHAPTER 13. SERVING DEEP LEARNING MODELS
loaded_model = None
graph = None
app = Flask(__name__)
def load_model(export_path):
"""
Load model and tensorflow graph
into global variables.
"""
# global variables
global loaded_model
global graph
# load weights
loaded_model.load_weights(os.path.join(export_path, 'weights.h5'))
def preprocess(data):
"""
Generic function for normalization
and feature engineering.
Convert data from json to numpy array.
"""
res = json.loads(data)
return np.array(res['data'])
@app.route('/', methods=["POST"])
def predict():
"""
Generate predictions with the model
when receiving data as a POST request
"""
if request.method == "POST":
# get data from the request
data = request.data
probas = loaded_model.predict(processed)
# print in backend
print("Received data:", data)
print("Predicted labels:", preds)
return jsonify(preds.tolist())
if __name__ == "__main__":
from sys import argv
print("* Loading model and starting Flask server...")
if len(argv) > 1:
export_path = argv[1]
else:
export_path = '/tmp/ztdl_models/wifi/flask/1/'
load_model(export_path)
app.run(host='0.0.0.0', debug=True)
python 13_flask_serve_model.py
Make sure to check Flask Documentation if you encounter any issues with the above steps.
Now that the server is running, let’s send some data to it and get predictions. We can test the application
with a simple CURL request like:
[
0,
2,
2
]
What did we just do? We have sent the wifi signal detected by 3 mobile phones and obtained their location.
The first one is in zone 0 and the other two are in zone 2. Great!
We can also ping our API using Python from the notebook by importing the requests module:
In [46]: data
Finally send a post request to the api_url with a our data in json format. We collect the request response
into a response variable:
In [49]: response
If you see: <Response [200]> it means the request worked. Let’s check the response we obtained:
In [50]: response.json()
Out[50]: [0, 2, 2, 1, 3]
In [51]: y_test[:5]
The deployed model is working pretty well! Very nice! There are many options to host your deployed model,
including: - hosting the Flask app on AWS, GCloud, Azure - deploying it on Heroku - deploying it on
Floydhub
This chapter continues introducing a different way to export and deploy a model, which leverages
Tensorflow Serving. This is the preferred way for larger production deployments.
560 CHAPTER 13. SERVING DEEP LEARNING MODELS
Tensorflow Serving can accommodate both small and large deployments, and it is built for production. It is
not as simple as Flask, and here we will barely scratch the surface of what it’s possible with it. If you are
serious about using it, we strongly recommend you take a look at the Architecture overview where many
concepts like Servables, Managers and Sources are explained.
In this part of the book, we will just show you how to export a model for serving and how to ping a
Tensorflow serving server. We will leave the full installation of Tensorflow serving for the end of the chapter.
Installation is strongly dependent on the system you are using and is well documented.
Let’s get started by exporting the model for Tensorflow Serving. Let’s start by defining an export path:
Notice that we can bump up the version number if we save a new model later on. Like before, we can
combine these:
Out[53]: '/tmp/ztdl_models/wifi/tfserving/1'
The SaveModelBuilder class provides functionality to build a SavedModel instance protocol buffer.
Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing
structured data – think XML, but smaller, faster, and simpler.
We’ll use this instance to save our model. Before we get that far, let’s create the instance:
The builder allows multiple meta graphs (i.e. TF models) to be saved as part of a single language-neutral
SavedModel, while sharing variables and assets. To build a SavedModel, the first meta graph must be saved
with variables. Subsequent meta graphs will simply be saved with their graph definitions.
Each meta graph added to the SavedModel must be annotated with tags. The tags provide a means to
identify the specific meta graph to load and restore, along with the shared set of variables and assets.
Together with tags we will need to pass the session and a signature map. This is a dictionary that contains
the methods that can be called when we call the model from the front-end. We’ll need to create at least one
signature to be able to use the model from within Tensorflow serving.
Now let’s define a signature that references the model.input tensors as input and the model.output as
output. Buy doing this we are tying the graph of our Keras model to the tensorflow serving builder.
Next let’s retrieve the Tensorflow Session, which is the class that represents the connection between our
Python client program and the C++ runtime. Keras makes the session available through the backend
module. Let’s import it:
And then let’s retrieve the session using the .get_session method:
We are now ready to use the builder to save our model. Let’s do that:
In [61]: builder.add_meta_graph_and_variables(
sess=sess,
tags=[tf.saved_model.tag_constants.SERVING],
signature_def_map={'predict': signature})
As we stated earlier, the first meta graph must be tagged, so we tagged our graph with the SERVING tag.
Finally let’s also write the SavedModel protocol buffer to disk using the builder.save() method:
In [62]: builder.save()
Out[62]: b'/tmp/ztdl_models/wifi/tfserving/1/saved_model.pb'
In [63]: os.listdir(export_path)
As you can see there’s a protocol buffer file saved_model.pb which contains the serialized model and a
folder called variables. This folder contains the weights of the trained model that we have saved with the
builder:
Inference with Tensorflow Serving using Docker and the Rest API
By far the easiest way to get tensorflow serving up and running is to use the pre-built Docker image as
explained in the documentation.
TIP: If you are new to Docker this may be a bit unfamiliar and complicate. Feel free to
either skip this section or read more about Docker and how it works in the wonderful
documentation.
Assuming you have docker installed and running on your machine, let’s pull the tensorflow/serving
docker container:
Next let’s run the docker container with the following command:
docker run \
-v /tmp/ztdl_models/wifi/tfserving/:/models/wifi \
-e MODEL_NAME=wifi \
-e MODEL_PATH=/models/wifi \
-p 8500:8500 \
-p 8501:8501 \
-t tensorflow/serving
• -v: This bind mounts a volume, basically it tells docker to map the internal path /models/wifi to
the /tmp/ztdl_models/wifi/tfserving/ in our host computer.
• -e: Sets environment variables, in this case we set the MODEL_NAME and MODEL_PATH variables
• -p: Publishes a container’s port to the host. In this case we are publishing port 8500 (default gRPC)
and 8501 (default REST).
• -t: Allocate a pseudo-TTY
• tensorflow/serving is the name of the container we are running.
Since Tensorflow 1.8, Tensorflow serving comes with both a gRPC and REST endpoints by default, so we can
test our running server by simply using curl. The correct command for this is:
Go ahead and run that in a shell, you should receive an output that looks similar to the following:
{
"predictions": [[0.997524, 1.19462e-05, 0.00171472, 0.000749083],
[3.40611e-06, 0.00262853, 0.997005, 0.000363284],
[2.52653e-05, 0.00507444, 0.993813, 0.00108718]
]
}
Wonderful! You have just ran your first model using Tensorflow Serving and Docker.
To stop the server, first press CTRL+C to exit from the tty session. Then run:
docker container ls
to list all the containers currently running. The output should look similar to this:
To stop it from running. You can always restart it later if you need it.
Tensorflow serving can also receive data serialized as protocol buffers, so we will need to do a little bit more
work in order to use our server for predictions.
First of all let’s create a prediction service. We’ll need to import the insecure_channel from grpc:
Next let’s create an insecure channel to localhost (or to your server) on port 8500, which is the port we chose
for tensorflow serving:
In [67]: channel
Through this channel we’ll be able to perform RPCs. Next we are going to create an instance of
PredictionServiceStub from tensorflow_serving.apis.prediction_service_pb2_grpc. Notice
that most of the documentation you can find online is outdated and uses the legacy beta API. We are using
the most recent version of the gRPC API:
We are almost ready to send data to our server. The last thing we need to do is convert the data to protocol
buffers. Let’s use the same data we have used for the Flask example:
In [70]: data
Let’s convert this data to Protocol Buffers. First we import the make_tensor_proto function from
Tensorflow:
Then we use it to serialize our data. Notice that we will need wrap our data in a numpy array since it was
passed as a list to the Flask application:
In [74]: data_pb
As you can see it’s a binary file, with a text header. In the header we can read the data type and the tensor
shape, while the values have been converted to binary values. Now that we have prepared the data, we are
ready to create an instance of PredictRequest, which is a class in
tensorflow_serving.apis.predict_pb2:
When we started our tensorflow serving server, we specified wifi as the model name, so let’s use wifi as
the model name for the request:
In [79]: request.inputs['inputs'].CopyFrom(data_pb)
In [80]: request
Out[80]: model_spec {
name: "wifi"
signature_name: "predict"
}
inputs {
key: "inputs"
value {
dtype: DT_FLOAT
tensor_shape {
dim {
size: 5
}
dim {
size: 7
}
}
tensor_content: "\000\000x\302\000\000h\302\000\000l\302\000\000l\302\000\000\206\3
}
}
Great! Now let’s pass the request to the Stub.future method which will invoke the underlying RPC
asynchronously. This method returns an object that is both a Call for the RPC and a Future. In the event of
RPC completion, the return Call-Future’s result value will be the response message of the RPC. Should the
event terminate with non-OK status, the returned Call-Future’s exception value will be an RpcError.
Out[82]: outputs {
key: "outputs"
value {
dtype: DT_FLOAT
tensor_shape {
dim {
size: 5
}
dim {
size: 4
}
}
float_val: 0.9939389228820801
float_val: 0.00017752652638591826
float_val: 0.000844559574034065
float_val: 0.005038977600634098
float_val: 2.9748796805506572e-05
float_val: 0.006226719357073307
float_val: 0.9936287999153137
float_val: 0.00011463184637250379
float_val: 0.0005520731210708618
float_val: 0.00699754199013114
float_val: 0.9918016791343689
float_val: 0.0006487205973826349
float_val: 2.8535052933875704e-06
float_val: 0.9599703550338745
float_val: 0.03942705690860748
float_val: 0.0005996639374643564
float_val: 3.383963189662609e-07
float_val: 1.808920160328853e-06
float_val: 1.0559380037022947e-08
float_val: 0.9999978542327881
}
}
model_spec {
name: "wifi"
version {
value: 1
}
signature_name: "predict"
}
Wonderful! Our Tensorflow server returned the predicted probabilities. We can convert them back to a
familiar numpy array using the make_ndarray from tensorflow.contrib.util:
In [85]: scores
and we can compare this with the local model we still have in memory:
In [87]: model.predict(np.array(data)).argmax(axis=1)
Wonderful! We have successfully retrieved predictions from a Tensorflow serving server. This barely
scratches the surface of what’s possible with Tensorflow Serving. If you are serious about bringing your
models to production we strongly encourage you to read the Documentation as well as to complete the
Basic Tutorial and the Advanced Tutorial.
The installation of Tensorflow Serving depends on the system you are using. The installation guide. There
are 2 methods, one involves installing Bazel and compiling Tensorflow Serving, the other leverages the
570 CHAPTER 13. SERVING DEEP LEARNING MODELS
Python packages. We will choose this method, which is simpler, and show you how to complete it on
Ubuntu Linux 16.04 with system Python.
Next we install the Model Server. This requires adding the TensorFlow Serving distribution URI as a
package source (one time setup):
curl https://storage.googleapis.com/tensorflow-serving-apt/\
tensorflow-serving.release.pub.gpg | sudo apt-key add -
Now you are ready to serve your models, simply start the Model Server by running:
tensorflow_model_server --port=8500 \
--model_name=wifi \
--model_base_path=/tmp/ztdl_models/wifi/tfserving/
13.6. EXERCISES 571
This command specifies the port to use for serving, the model_name we are giving to this model and the
base path where the version folders are saved. If this command runs correctly you should see log messages
from Tensorflow serving indicating that it has found the model:
If the machine where you installed the model server is not your local machine, you’ll need to pack your
local model and upload it to your remote machine, for example using tar & scp:
cd /tmp/ztdl_models/wifi
tar -czvf tfserving.tgz tfserving
scp tfserving.tgz <your-ubuntu-user>@<your-remote-ip>:~/
mkdir -p /tmp/ztdl_models/wifi/
tar -xvzf tfserving.tgz -C /tmp/ztdl_models/wifi/
Finally start the server on the remote machine and make sure that port 8500 is accessible (either by setting
firewall rules or by tunnelling through SSH).
If you want to run the model server on your local machine and/or optimize the performance of the model
server or you’ll need to install Bazel and compile tensorflow serving. We encourage you to take a look at the
Documentation for Bazel and for Tensorflow Serving Setup.
Assuming you have completed the installation of Tensorflow Serving on your system (see the end of this
Chapter for information), you just need to run the command:
Exercises
Exercise 1
Let’s deploy an image recognition API using Tensorflow Serving. The main difference from the API we have
deployed in this chapter is that we will have to deal with how to pass an image to the model through
572 CHAPTER 13. SERVING DEEP LEARNING MODELS
tensorflow serving. Since this chapter focuses on deployment, we will take a shortcut and deploy a
pre-trained model that uses Imagenet. In particular we will deploy the Xception model. If you are unsure
about how to use pre-trained model, please go back to Chapter 11 for a refresher.
Exercise 2
The above method of serving a pre-trained model has an issue: we are doing pre-processing and prediction
decoding on the client side. This is actually not a best practice, because it requires the client to be aware of
what kind of pre-processing and decoding functions the model needs.
We would like a server that takes the image as it is and returns a string with the name of the object in the
image.
The easy way to do this is to use the Flask app implementation we have shown in this chapter and move
pre-processing and decoding on the server side.
Go ahead and build a Flask version of the API that takes an image url as a json string, applies
pre-processing, runs and decodes the prediction and returns a string with the response.
curl -d 'http://bit.ly/2wb7uqN' \
-H "Content-Type: application/json" \
-X POST http://localhost:5000
13.6. EXERCISES 573
"king_penguin"
Disclaimer: this script is not meant for production purposes. Retrieving a file from a URL is not secure
and you should avoid building an API that retrieves a file from a URL provided from the client. Here
we used the url retrieval trick in order to make the curl command shorter.
574 CHAPTER 13. SERVING DEEP LEARNING MODELS
14 Appendix
Throughout the book we use several mathematical concepts drawn from linear algebra and calculus. In this
appendix we review them in little more detail. This is meant to be for the curious reader and it’s not
necessary in order to complete the book.
Matrix multiplication
We have introduced matrix in chapter 1. As you know an N × M matrix is an array of numbers organized in
N rows and M columns.
Matrices are multiplied with the same rule of the dot product. Two matrices A and B can be multiplied if the
number of columns of the first is equal to the number of rows of the second. If A is 2x3 and B is 3x2, they
can be multiplied and the resulting matrix will have shape 2x2 if we do A.B and 3x3 if we do B.A.
The figure below shows how the elements of this matrix are calculated:
Let’s create 2 matrices in numpy using 2D-array method and check this formula:
575
576 CHAPTER 14. APPENDIX
B = np.array([[0, 1],
[2, 3],
[4, 5]])
C = np.array([[0, 1],
[2, 3],
[4, 5],
[0, 1],
[2, 3],
[4, 5]])
print("A is a {} matrix".format(A.shape))
print("B is a {} matrix".format(B.shape))
print("C is a {} matrix".format(C.shape))
A is a (2, 3) matrix
B is a (3, 2) matrix
C is a (6, 2) matrix
14.1. MATRIX MULTIPLICATION 577
The matrix product in Numpy is a function called dot. We can access it as a method of an array:
In [4]: A.dot(B)
or as a function in Numpy:
In [5]: np.dot(A, B)
In [6]: B.dot(A)
Or, using the np.dot() version, we get the same as these two methods are functionally equivalent:
In [7]: np.dot(B, A)
We can also perform the matrix multiplication C.dot(A), however, matrix multiplications are only possible
along axes with the same length. So, for example, we cannot perform the multiplication A.dot(C).
In [8]: C.dot(A)
578 CHAPTER 14. APPENDIX
For example, uncomment the next line to get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-9c5b5a616184> in <module>()
1 # uncomment the next line to get an error
----> 2 A.dot(C)
In [9]: # A.dot(C)
TIP: remember this ValueError for mismatching shapes. It’s very common when building
Neural Networks.
Chain rule
Univariate functions
How do we calculate the derivative of this function with respect to x? This function is a composition of the
two functions
We can calculate the derivative of h with respect to x by applying the chain rule:
dh(x) d d f dg
= f (g(x)) = ⋅ (14.4)
dx dx d g dx
d d 1
(2 + cos(x)) = − sin(x) log(g) = (14.5)
dx dg g
So finally, the derivative of our nested function h of x is the product of the two derivatives:
d − sin(x)
log(2 + cos(x)) = (14.6)
dx 2 + cos(x)
Code example
Let’s define all the above functions and verify that the derivative of h calculated with the chain rule is
equivalent to the derivative calculated with the np.diff function.
def g(x):
return 2 + np.cos(x)
def h(x):
return f(g(x))
def df(x):
return 1/x
580 CHAPTER 14. APPENDIX
def dg(x):
return -np.sin(x)
def dh(x):
return df(g(x)) * dg(x)
plt.subplot(211)
plt.plot(x, dh(x))
plt.legend(['using the chain rule'])
plt.subplot(212)
plt.plot(x[:-1], np.diff(h(x))/np.diff(x))
plt.legend(['using np.diff'])
The two are results are the same (minus numerical errors), as expected!
14.2. CHAIN RULE 581
Multivariate functions
The chain rule can be easily extended to the case where f has multiple functions as arguments:
We simply distribute the chain rule to all the arguments that depend on x:
dh(x) ∂ f d g ∂ f dk
= ⋅ + ⋅ (14.8)
dx ∂g dx ∂k dx
Notice that here we are using the partial derivative symbol ∂ which simply means we are taking the
derivative with respect to one of the variables while keeping all the others fixed.
The EWMA is the most important algorithm of your life. We often use this joke in classes to get the
attention of our students. Although this may or may not true in your particular case, it is true that this
algorithm crops up everywhere, from financial time series to signal processing to Neural Networks.
Different domains name it in different ways but it’s actually always the same thing and it’s worth knowing
how it works in detail.
Let us have a look at how it works. Let’s say we have a sequence of ordered datapoints. These could be the
values of a stock, temperature measurements, anything that is measured in a sequence.
If this data is noisy, we may want to reduce the noise in order to obtain a more accurate estimation of the
underlying actual values. One easy way to remove noise from a time series is to perform a rolling average
or moving average: you wait to accumulate a certain number of observations and and use their average as
the estimation of the current value. This method works, but it requires to hold the past values in a memory
buffer and constantly update such buffer when a new data point of the sequence arrives. So if we want to
average over a long window, we have to keep the whole window in memory, and also we cannot calculate
the first average until we have observed at least as many points as the window contains (unless we pad with
zeros).
Rolling averages are available in Pandas through the .rolling() method. Let’s plot a few examples:
EWMA differs from the moving average because it only requires knowledge of the previous value of the
data and of the current value of the EWMA itself.
Let’s indicate the values of our sequence as x0 , x1 , x2 , . . . , x n . We can calculate the value of the corresponding
EWMA recursively as:
y0 = x0 (14.9)
y n = (1 − α) y n−1 + α x n (14.10)
The two extreme cases of this formula are α = 0, in which case the value of y n will remain fixed to x0 forever
and α = 1, in which case y n will be exactly tracking the value of x n .
If α is between 0 and 1, the EWMA will smooth the signal reducing its fluctuations. Let’s walk through an
example with α = 0.9 to clarify how it works.
When the first point x0 comes in, the EWMA is set to be equal to the raw data, so y0 = x0 .
Then, the second raw value x1 comes in, we take 90 of it and add it to 10 of the previous value of the
moving average y0 :
584 CHAPTER 14. APPENDIX
So, the value of the EWMA will be almost equal to the initial value, with 90 contribution from the new
value x1 .
Then, the third point x2 comes in. Again, we take 90 of its value and add it to 10 of the current EWMA
value y1 .
This third point will still be mostly influenced by the initial point, but it will also contain contributions from
the most recent two points.
and here’s y4 :
As you can see the value of y4 is influenced by all the previous values of x in an exponentially decreasing
fashion.
14.2. CHAIN RULE 585
We can continue playing this game at each new point, and all we need to keep in memory is the previous
value of the EWMA y n−1 until we have mixed it with the current raw value of the signal x n .
1. We only keep the last values of the EWMA in memory, no need for a buffer.
2. We can calculate it from the beginning of the sequence instead of waiting to accumulate some values.
This formula is very popular and goes under different names in different domains. Statisticians would call it
an autoregressive integrated moving average model with no constant term or (ARIMA) (0,1,1). Signal
processing people would call it a first order Infinite Impulse Response (IIR) filter, but it’s the same thing.
The idea is simple. Each new value of the smoothed sequence is the sum of two terms: its own previous value
and the current new value of the sequence. The ratio of the mixing is controlled by the parameter α: very
large values will skew the mix towards the raw data, with very little smoothing, very small α (pronounced
alpha) will skew the mix towards the previous smoothed value, therefore with very strong smoothing.
You can notice a couple of things when comparing this plot with the previous one:
1. the smoothed curves start immediately, we don’t have to wait in order to calculate the EWMA
This algorithm is simple and beautiful, and you will encounter it in many places, beyond optimizers for
neural nets.
Tensors
Let’s create a couple of test tensors. We will create a tensor A of order 4 and a tensor B of order 2:
In [16]: A
14.3. TENSORS 587
[[2, 7, 3, 8, 2],
[7, 4, 7, 7, 2],
[0, 4, 1, 5, 6],
[3, 3, 7, 9, 4]],
[[0, 0, 8, 7, 7],
[7, 2, 2, 5, 6],
[1, 1, 6, 5, 5],
[3, 2, 2, 3, 9]]],
[[[1, 1, 4, 4, 2],
[6, 6, 1, 0, 0],
[2, 0, 1, 8, 7],
[4, 1, 7, 2, 5]],
[[6, 4, 7, 6, 9],
[2, 4, 2, 7, 2],
[5, 3, 2, 6, 4],
[5, 2, 8, 3, 9]],
[[3, 6, 9, 5, 0],
[4, 8, 7, 8, 4],
[9, 7, 9, 7, 6],
[2, 9, 1, 4, 9]]]])
In [17]: B
In [18]: A[0, 1, 0, 3]
Out[18]: 8
Tensors can be multiplied by a scalar, and their shape remains the same:
588 CHAPTER 14. APPENDIX
In [19]: A2 = 2 * A
A2
[[[ 2, 2, 8, 8, 4],
[12, 12, 2, 0, 0],
[ 4, 0, 2, 16, 14],
[ 8, 2, 14, 4, 10]],
Out[20]: True
We can also add tensors of the same shape element by element to obtain a third tensor with the same shape:
In [21]: A + A2
One of the most important operation between tensors is the product. If we think about the product between
two scalars, we have no doubts how to perform it. If we think about two vectors a = {a i } and b = {b i }, we
can perform different types of product (for example the dot product or the cross product).
Here we focus on the so called Dot Product. The dot product p between a and b is given by:
p = ∑ ai bi
i
The operation consists in summing up the product between the components of the two vectors. As you may
observe, the results of the dot product between two vectors is a scalar, which is an entity with a lower order
if compared with the two factors. For this reason, this operation is also called contraction.
A similar operation can be performed also between two tensors of higher order, if the two tensors have an
axis with the same length. In this case we can perform a dot product (or a contractio) along that axis. The
590 CHAPTER 14. APPENDIX
shape of the resulting tensor depends on the shapes of the original two tensors that got contracted.
Let’s see a couple of examples. Here are the shapes of A and B. A has order 4, B has order 2:
In [22]: A.shape
Out[22]: (2, 3, 4, 5)
In [23]: B.shape
Out[23]: (2, 3)
Since both A and B have a first axis of length 2, we can perform a tensor dot product along the first axis using
the tensordot function from numpy. In order to perform this product we have to specify not only the two
arguments A and B, but also that we want to perform the operation along the first axis in each of the 2
tensors. This can be done through the argument axes=([0], [0]), as explained in the np.tensordot
documentation.
In [25]: T.shape
Out[25]: (3, 4, 5, 3)
Interesting! Can you see what happened? T has four axes, i.e. it has order 4, T = {t jkln }. We can calculate
that by thinking how many free indices remained in A and B after the contraction on axis 0. The elements of
A are indicated by four indices A = {a i jkl }, the elements of B are indicated by two indices B = {b mn }.
Mathematically, the tensor product performs the operation:
T = {t jkln } = ∑ a i jkl b in
i
so, the elements of the resulting tensor T are located by 4 indices: 3 coming from the tensor A and 1 coming
from the tensor B.
Let’s do another example. What will be the shape of the tensor product of A and B if we contract along the
first 2 axes? First of all we have to check that the first two axes have the same length. Then we have to change
the argument into axes=([0, 1], [0, 1]).
14.4. CONVOLUTIONS 591
In [27]: T.shape
Out[27]: (4, 5)
Since both axis 0 and axis 1 have been contracted, the only remaining 2 indices come from axis 2 and 3 of
tensor A. This yields a tensor of order 2.
Wonderful! We have learned to perform a few operations using tensors! While this may seem really abstract
and removed from the practical applications of Deep Learning, actually it is not. We need to understand
how to arrange our data using tensors if we want to leverage Neural Networks with their full potential.
Now that we know how to operate with tensors, it is time to dive into convolutions!
We will start from 1D convolutions and then extend the definition to 2D arrays.
Convolutions
In chapter 6 we introduced Convolutional Neural Networks and we went a bit fast when talking about
convolutions. Let’s introduce convolutions here in a bit more detail.
In [28]: a_ = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
a = np.array(a_, dtype='float32')
a
Out[28]: array([0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
dtype=float32)
v is a short array of only two elements, while a is a longer array of several elements. The general question we
are looking to answer is how similar the two arrays are. Since they are not the same length we cannot
perform a dot product between the two. We can, however, define two operations involving a and v: the
correlation and the convolution. These operations try to gauge the similarity between the two arrays,
acknowledging the fact that they don’t have the same length and performing a sort of “rolling dot product”.
In both cases we start from the left-side of a and we take a short sub-array with the same length as v, in this
case 2 numbers. In Machine Learning this sub-array is called receptive field.
Then we perform a tensor dot product of v with the receptive field a, i.e. we multiply the elements of v by
the elements of the sub-array and we sum the products. We then store the result as the first element of our
result array c.
Then we shift the window in a by one number and again perform a product between the new sub-array and
v, also summing at the end. This second value gets stored in the result array as well.
We can continue shifting the window and performing dot products until we reach the end of the array a and
no more shifting is possible.
The difference between convolution and correlation is that the array v is flipped before the multiplication.
14.4. CONVOLUTIONS 593
For a more precise mathematical definition of the correlation and convolution of a with v we invite the user
to consult the many detailed resources that can be found online.
Now, let’s see how we can perform these operations with Numpy. The functions np.correlate and
np.convolve offer these two functions to perform correlation and convolution of 1D arrays:
Out[30]: array([ 0., 0., 0., 0., 1., 0., 0., 0., 0., -1., 0., 0., 0.,
0.], dtype=float32)
Out[31]: array([ 0., 0., 0., 0., -1., 0., 0., 0., 0., 1., 0., 0., 0.,
0.], dtype=float32)
Looking at the plot we notice that both convolution and correlation have spikes when there’s a jump in the
array a. Our short filter v looks for jumps in a and the resulting convolution array represents how similar
each window in a is with v.
TIP: Since in a convolutional Neural Network the filter v is learned during the training
process, it makes no difference whether we flip the filter or not. In fact, if we perform the
flip, the network will simply learn flipped weights. For this reason, convolutional layers in a
Neural Network are actually calculating correlations. We still call them convolutional
layers, but the operation performed is actually a correlation. In what follows we will keep
talking about convolutions, but we’ll keep in mind that flipping the array is not actually
necessary in practice.
2D Convolution
We can easily extend the 1D convolution to 2D convolutions using 2D arrays instead of 1D.
A contains a pattern in the shape of an “X”. For the sake of simplicity, we will also rescale the values of the
array so that the minimum value is -1 and the maximum is +1, but same concepts apply for different range of
14.4. CONVOLUTIONS 595
values.
Out[33]: array([[-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.],
[-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.],
[-1., -1., 1., -1., -1., -1., -1., -1., 1., -1., -1.],
[-1., -1., -1., 1., -1., -1., -1., 1., -1., -1., -1.],
[-1., -1., -1., -1., 1., -1., 1., -1., -1., -1., -1.],
[-1., -1., -1., -1., -1., 1., -1., -1., -1., -1., -1.],
[-1., -1., -1., -1., 1., -1., 1., -1., -1., -1., -1.],
[-1., -1., -1., 1., -1., -1., -1., 1., -1., -1., -1.],
[-1., -1., 1., -1., -1., -1., -1., -1., 1., -1., -1.],
[-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.],
[-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.]])
Let’s display A with a gray colormap (because we’re working with numbers between -1 and 1):
Out[34]: Text(0.5,1,'A')
596 CHAPTER 14. APPENDIX
Out[36]: Text(0.5,1,'V')
The 2D convolution can be calculated with the scipy.convolve2d function as from scipy. Let’s import
the function from scipy first:
Out[38]: array([[ 5., 1., 1., 3., 3., 3., 5., 1., 1.],
[ 1., 7., -1., 1., 3., 5., -1., 3., 1.],
[ 1., -1., 9., -1., 3., -1., 1., -1., 5.],
[ 3., 1., -1., 9., -3., 1., -1., 5., 3.],
[ 3., 3., 3., -3., 5., -3., 3., 3., 3.],
[ 3., 5., -1., 1., -3., 9., -1., 1., 3.],
[ 5., -1., 1., -1., 3., -1., 9., -1., 1.],
[ 1., 3., -1., 5., 3., 1., -1., 7., 1.],
[ 1., 1., 5., 3., 3., 3., 1., 1., 5.]])
The convolved array, is obtained by taking the filter V, flipping it on both axis and then multiplying it with a
3x3 patch in the image. Then we shift the patch to the right and repeat. We start at the first patch on the top
left of the image A[0:3, 0:3], multiply this patch with V_rev element by element, then sum all the values.
We are effectively contracting the 2D tensor V with the patch over both axis:
Out[40]: array(5.)
This produces the first pixel in the output convolution. We then shift to the right by one pixel in A and repeat
the contraction operation:
Out[41]: array(1.)
We can continue doing this and accumulate the result in a new 2D array.
Functionally, we can do the same exact thing that scipy.convolve does manually, although in practice we
never need to do this thanks to scipy:
for i in range(out_h):
for j in range(out_w):
patch_ij = A[i:i+win_h, j:j+win_w]
try:
res[i, j] = np.tensordot(patch_ij, V_rev)
except Exception as ex:
print(i, j)
print(patch_ij)
print(V)
raise ex
np.allclose(res, C)
Out[42]: True
TIP: the function np.allclose returns True if two arrays are element-wise equal within a
tolerance. See the documentation for details.
14.4. CONVOLUTIONS 599
Notice that we can rescale the product by the number of elements in the filter V, which is 9, to obtain:
In [43]: C_resc = C / 9
C_resc.round(2)
Out[43]: array([[ 0.56, 0.11, 0.11, 0.33, 0.33, 0.33, 0.56, 0.11, 0.11],
[ 0.11, 0.78, -0.11, 0.11, 0.33, 0.56, -0.11, 0.33, 0.11],
[ 0.11, -0.11, 1. , -0.11, 0.33, -0.11, 0.11, -0.11, 0.56],
[ 0.33, 0.11, -0.11, 1. , -0.33, 0.11, -0.11, 0.56, 0.33],
[ 0.33, 0.33, 0.33, -0.33, 0.56, -0.33, 0.33, 0.33, 0.33],
[ 0.33, 0.56, -0.11, 0.11, -0.33, 1. , -0.11, 0.11, 0.33],
[ 0.56, -0.11, 0.11, -0.11, 0.33, -0.11, 1. , -0.11, 0.11],
[ 0.11, 0.33, -0.11, 0.56, 0.33, 0.11, -0.11, 0.78, 0.11],
[ 0.11, 0.11, 0.56, 0.33, 0.33, 0.33, 0.11, 0.11, 0.56]])
Four pixels in the resulting convolution are exactly equal to 1, corresponding to a perfect match of the filter
with the image at those locations. The other pixels have smaller values with varying degrees, indicating
partial match only.
Convolutions can be used to perform filters on images, for example to blur it or detect the edges. Let’s have a
look at one example. We load an example image from keras.datasets.mnist:
In [48]: img.shape
Let’s filter this image with 3x3 kernels that recognize lines:
f2 = np.array([[ 0, 0, 1],
[ 0, 1, 0],
[ 1, 0, 0]])
f4 = np.array([[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]])
602 CHAPTER 14. APPENDIX
Let’s see what these kernels look like visually with the method imshow():
plt.subplot(222)
plt.imshow(f2, cmap='gray')
plt.title('filter f2')
plt.subplot(223)
plt.imshow(f3, cmap='gray')
plt.title('filter f3')
plt.subplot(224)
plt.imshow(f4, cmap='gray')
plt.title('filter f4')
plt.tight_layout()
plt.show()
14.4. CONVOLUTIONS 603
Now let’s run the 2D convolution on the image and see what these convolutions produce:
plt.subplot(221)
res = convolve2d(img, f1, mode='valid')
plt.imshow(res, cmap='gray')
plt.title('filtered with f1')
plt.subplot(222)
res = convolve2d(img, f2, mode='valid')
plt.imshow(res, cmap='gray')
plt.title('filtered with f2')
604 CHAPTER 14. APPENDIX
plt.subplot(223)
res = convolve2d(img, f3, mode='valid')
plt.imshow(res, cmap='gray')
plt.title('filtered with f3')
plt.subplot(224)
res = convolve2d(img, f4, mode='valid')
plt.imshow(res, cmap='gray')
plt.title('filtered with f4')
plt.tight_layout()
plt.show()
14.5. BACKPROPAGATION FOR RECURRENT NETWORKS 605
Great! We have seen how convolutions can be used to filter images. Each pixel in the filtered image is the
result of a tensor contraction of the filter with a patch in the original image. In this respect, the convolution
is the operation that allows us to leverage the fact that information is related to spatial patterns of nearby
pixels.
TIP: If you’ve ever used an image program like Adobe Photoshop, these convolutions are
how the image filters are created for images.
z t = w h t−1 + u x t (14.24)
h t = ϕ(z t ) (14.25)
rt = v ht (14.26)
ŷ t = ϕ(r t ) (14.27)
(14.28)
where we substituted the tanh activation function to a generic activation ϕ and allowed for different weights
on the recurrent relation and the output relation.
recurrent_2.png
∂J
ŷ = (14.29)
∂ ŷ
r t = ŷ ϕ′ (r t ) (14.30)
h t = r t v + z t+1 w (14.31)
′
z t = h t ϕ (z t ) (14.32)
T
u = ∑ zt xt (14.33)
t=0
T
v = ∑ rt ht (14.34)
t=0
T
w = ∑ z t+1 h t (14.35)
t=0
(14.36)
As you can see, these relations are very similar to the fully connected backpropagation relations we saw in
Chapter 5, with a big difference: the updates to the weights require a summation on the contributions from
all time.
Getting Started Exercises Solutions
15
In [1]: with open('../course/common.py') as fin:
exec(fin.read())
Exercise 1
Let’s practice a little bit with numpy.
607
608 CHAPTER 15. GETTING STARTED EXERCISES SOLUTIONS
In [5]: a
Out[5]: array([[1., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1.]])
In [7]: b
In [9]: c
Out[9]: array([[ 5., 0., 5., 0., 5., 0., 5., 0., 5., 0.],
[ 0., 6., 0., 6., 0., 6., 0., 6., 0., 6.],
[ 7., 0., 7., 0., 7., 0., 7., 0., 7., 0.],
[ 0., 8., 0., 8., 0., 8., 0., 8., 0., 8.],
[ 9., 0., 9., 0., 9., 0., 9., 0., 9., 0.],
[ 0., 10., 0., 10., 0., 10., 0., 10., 0., 10.],
[11., 0., 11., 0., 11., 0., 11., 0., 11., 0.],
[ 0., 12., 0., 12., 0., 12., 0., 12., 0., 12.],
[13., 0., 13., 0., 13., 0., 13., 0., 13., 0.],
[ 0., 14., 0., 14., 0., 14., 0., 14., 0., 14.]])
In [10]: c.mean(axis=0)
In [11]: c.mean(axis=1)
In [12]: c.std(axis=0)
In [13]: c.std(axis=1)
In [14]: d = c[c>0].reshape(10, 5)
In [15]: d
In [17]: e = d + noise
Exercise 2
Practice plotting with matplotlib
• use plt.imshow() to display the array a as an image, does it look like a checkerboard?
• display c, d and e using the same function, change the colormap to grayscale
610 CHAPTER 15. GETTING STARTED EXERCISES SOLUTIONS
• plot e using a line plot, assigning each row to a different data series. This should produce a plot with
noisy horizontal lines. You will need to transpose the array to obtain this.
• add title, axes labels, legend and a couple of annotations
In [18]: plt.imshow(a)
In [22]: plt.plot(e.transpose())
plt.title("Noisy lines")
plt.xlabel("the x axis")
plt.xlabel("the y axis")
plt.annotate(xy=(1, 14), xytext=(0, 12.3),
s="The light blue line",
arrowprops={"arrowstyle": '-|>'},
fontsize=12);
614 CHAPTER 15. GETTING STARTED EXERCISES SOLUTIONS
Exercise 3
Reuse your code
Encapsulate the code that calculates the decision boundary in a nice function called
plot_decision_boundary with the signature:
Exercise 4
Practice retraining the model on different data.
• Use the functions make_blobs and make_moons from scikit learn to generate new datasets with 2
classes
• plot the data to make sure you understand what has been generated
• Re-train your model on each of these datasets
• Display the decision boundary for each of these models
X, y = make_circles(n_samples=1000,
noise=0.1,
factor=0.2,
random_state=0)
In [27]: plot_decision_boundary(model, X, y)
616 CHAPTER 15. GETTING STARTED EXERCISES SOLUTIONS
X, y = make_blobs(n_samples=1000,
centers=2,
random_state=0)
metrics=['accuracy'])
model.fit(X, y, epochs=30, verbose=0);
In [30]: plot_decision_boundary(model, X, y)
X, y = make_moons(n_samples=1000,
noise=0.1,
random_state=0)
618 CHAPTER 15. GETTING STARTED EXERCISES SOLUTIONS
In [33]: plot_decision_boundary(model, X, y)
Data Manipulation Exercises Solutions
16
In [1]: with open('../course/common.py') as fin:
exec(fin.read())
Exercise 1
• load the dataset: ../data/international-airline-passengers.csv
• inspect it using the .info() and .head() commands
• use the function pd.to_datetime() to change the column type of ‘Month’ to a datatime type (you
can find the doc here:
http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_datetime.html)
• set the index of df to be a datetime index using the column ‘Month’ and the df.set_index() method
• choose the appropriate plot and display the data
• choose appropriate scale
• label the axes
619
620 CHAPTER 16. DATA MANIPULATION EXERCISES SOLUTIONS
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 2 columns):
Month 144 non-null object
Thousand Passengers 144 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.3+ KB
In [5]: df.head()
Out[5]:
df['Month'] = pd.to_datetime(df['Month'])
df = df.set_index('Month')
In [7]: df.head()
Out[7]:
Thousand Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
df.plot();
Exercise 2
• load the dataset: ../data/weight-height.csv
• inspect it
• plot it using a scatter plot with Weight as a function of Height
• plot the male and female populations with 2 different colors on a new scatter plot
• remember to label the axes
Out[9]:
622 CHAPTER 16. DATA MANIPULATION EXERCISES SOLUTIONS
In [10]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
Gender 10000 non-null object
Height 10000 non-null float64
Weight 10000 non-null float64
dtypes: float64(2), object(1)
memory usage: 234.5+ KB
In [11]: df.describe()
Out[11]:
Height Weight
count 10000.000000 10000.000000
mean 66.367560 161.440357
std 3.847528 32.108439
min 54.263133 64.700127
25 63.505620 135.818051
50 66.318070 161.212928
75 69.174262 187.169525
max 78.998742 269.989699
In [12]: df['Gender'].value_counts()
Out[12]:
Gender
Male 5000
Female 5000
In [15]: # method 2
mfmap = {'Male': 'blue', 'Female': 'red'}
df['Gendercolor'] = df['Gender'].map(mfmap)
df.head()
Out[15]:
In [16]: df.plot(kind='scatter',
x='Height',
y='Weight',
c=df['Gendercolor'],
16.2. EXERCISE 2 625
alpha=0.3,
title='Male & Female Populations');
In [17]: # method 3
fig, ax = plt.subplots()
ax.plot(males['Height'], males['Weight'], 'ob',
females['Height'], females['Weight'], 'or',
alpha=0.3)
plt.xlabel('Height')
plt.ylabel('Weight')
plt.title('Male & Female Populations');
626 CHAPTER 16. DATA MANIPULATION EXERCISES SOLUTIONS
Exercise 3
• plot the histogram of the heights for males and for females on the same plot
• use alpha to control transparency in the plot comand
• plot a vertical line at the mean of each population using plt.axvline()
• bonus: plot the cumulative distributions
In [18]: males['Height'].plot(kind='hist',
bins=50,
range=(50, 80),
alpha=0.3,
color='blue')
females['Height'].plot(kind='hist',
bins=50,
range=(50, 80),
alpha=0.3,
color='red')
plt.title('Height distribution')
16.3. EXERCISE 3 627
plt.legend(["Males", "Females"])
plt.xlabel("Heigth (in)")
plt.axvline(males['Height'].mean(),
color='blue', linewidth=2)
plt.axvline(females['Height'].mean(),
color='red', linewidth=2);
In [19]: males['Height'].plot(kind='hist',
bins=200,
range=(50, 80),
alpha=0.3,
color='blue',
cumulative=True,
normed=True)
females['Height'].plot(kind='hist',
bins=200,
628 CHAPTER 16. DATA MANIPULATION EXERCISES SOLUTIONS
range=(50, 80),
alpha=0.3,
color='red',
cumulative=True,
normed=True)
plt.title('Height distribution')
plt.legend(["Males", "Females"])
plt.xlabel("Heigth (in)")
plt.axhline(0.8)
plt.axhline(0.5)
plt.axhline(0.2);
/home/ubuntu/miniconda3/envs/ztdlbook/lib/python3.6/site-
packages/matplotlib/axes/_axes.py:6571: UserWarning: The 'normed' kwarg is
deprecated, and has been replaced by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
16.4. EXERCISE 4 629
Exercise 4
• plot the weights of the males and females using a box plot
• which one is easier to read?
• (remember to put in titles, axes and legends)
In [21]: dfpvt.head()
Out[21]:
In [22]: dfpvt.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 2 columns):
Female 5000 non-null float64
Male 5000 non-null float64
dtypes: float64(2)
memory usage: 234.4 KB
In [23]: dfpvt.plot(kind='box')
plt.title('Weight Box Plot')
plt.ylabel("Weight (lbs)");
630 CHAPTER 16. DATA MANIPULATION EXERCISES SOLUTIONS
Exercise 5
• load the dataset: ../data/titanic-train.csv
• learn about scattermatrix here: http://pandas.pydata.org/pandas-docs/stable/visualization.html
• display the data using a scattermatrix
In [24]: df = pd.read_csv('../data/titanic-train.csv')
df.head()
Out[24]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
Exercise 1
You’ve just been hired at a real estate investment firm and they would like you to build a model for pricing
houses. You are given a dataset that contains data for house prices and a few features like number of
bedrooms, size in square feet and age of the house. Let’s see if you can build a model that is able to predict
the price. In this exercise we extend what we have learned about linear regression to a dataset with more
than one feature. Here are the steps to complete it:
633
634 CHAPTER 17. MACHINE LEARNING EXERCISES SOLUTIONS
– normalize the input features with one of the rescaling techniques mentioned above
– use a different value for the learning rate of your model
– use a different optimizer
• once you’re satisfied with training, check the R 2 on the test set
Out[3]:
plt.tight_layout()
17.1. EXERCISE 1 635
In [9]: # split the data into train and test with a 20% test
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.2)
In [10]: # train the model on the training set and check its
# accuracy on training and test set
# how's your model doing? Is the loss growing smaller?
model.fit(X_train, y_train, epochs=20, verbose=0);
In [11]: df.describe()
Out[11]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
r_ = r2_score(y_train, y_train_pred)
print("R2 score on Train set is:\t{:0.3f}".format(r_))
r_ = r2_score(y_test, y_test_pred)
print("R2 score on Test set is:\t{:0.3f}".format(r_))
Exercise 2
Your boss was extremely happy with your work on the housing price prediction model and decided to
entrust you with a more challenging task. They’ve seen a lot of people leave the company recently and they
17.2. EXERCISE 2 637
would like to understand why that’s happening. They have collected historical data on employees and they
would like you to build a model that is able to predict which employee will leave next. They would like a
model that is better than random guessing. They also prefer false negatives than false positives, in this first
phase. Fields in the dataset include:
Your goal is to predict the binary outcome variable left using the rest of the data. Since the outcome is
binary, this is a classification problem. Here are some things you may want to try out:
1. load the dataset at ../data/HR_comma_sep.csv, inspect it with .head(), .info() and .describe().
• Establish a benchmark: what would be your accuracy score if you predicted everyone stay?
• Check if any feature needs rescaling. You may plot a histogram of the feature to decide which
rescaling method is more appropriate.
• convert the categorical features into binary dummy columns. You will then have to combine them
with the numerical features using pd.concat.
• do the usual train/test split with a 20 test size
• play around with learning rate and optimizer
• check the confusion matrix, precision and recall
• check if you still get the same results if you use a 5-Fold cross validation on all the data
• Is the model good enough for your boss?
As you will see in this exercise, this logistic regression model is not good enough to help your boss. In the
next chapter we will learn how to go beyond linear models.
df = pd.read_csv('../data/HR_comma_sep.csv')
638 CHAPTER 17. MACHINE LEARNING EXERCISES SOLUTIONS
In [19]: df.head()
Out[19]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
In [20]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level 14999 non-null float64
last_evaluation 14999 non-null float64
number_project 14999 non-null int64
average_montly_hours 14999 non-null int64
time_spend_company 14999 non-null int64
Work_accident 14999 non-null int64
left 14999 non-null int64
promotion_last_5years 14999 non-null int64
sales 14999 non-null object
salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
In [21]: df.describe()
Out[21]:
df.left.value_counts() / len(df)
Out[22]:
17.2. EXERCISE 2 639
left
0 0.761917
1 0.238083
In [24]: df['average_montly_hours_100'] = \
df['average_montly_hours']/100.0
In [25]: df['time_spend_company'].plot(kind='hist');
640 CHAPTER 17. MACHINE LEARNING EXERCISES SOLUTIONS
In [27]: df.columns
In [28]: X = pd.concat([df[['satisfaction_level',
'last_evaluation',
'number_project',
'time_spend_company',
'Work_accident',
'promotion_last_5years',
'average_montly_hours_100']],
df_dummies], axis=1).values
y = df['left'].values
17.2. EXERCISE 2 641
In [29]: X.shape
model = Sequential()
model.add(Dense(1, input_dim=20, activation='sigmoid'))
model.compile(Adam(lr=0.5),
'binary_crossentropy',
metrics=['accuracy'])
Epoch 1/1
11999/11999 [==============================] - 1s 78us/step - loss: 0.5358 -
acc: 0.7637
pretty_confusion_matrix(y_test, y_test_pred,
labels=['Stay', 'Leave'])
642 CHAPTER 17. MACHINE LEARNING EXERCISES SOLUTIONS
Out[36]:
In [39]: # check if you still get the same results if you use a 5-Fold cross validation on all t
def build_logistic_regr():
model = Sequential()
model.add(Dense(1, input_dim=20, activation='sigmoid'))
model.compile(Adam(lr=0.5),
'binary_crossentropy',
metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=build_logistic_regr,
epochs=10, verbose=0)
In [42]: scores
17.2. EXERCISE 2 643
No, the model is not good enough for my boss, since it performs no better than the benchmark.
644 CHAPTER 17. MACHINE LEARNING EXERCISES SOLUTIONS
Deep Learning Exercises Solutions
18
In [1]: with open('../course/common.py') as fin:
exec(fin.read())
Exercise 1
The Pima Indians dataset is a very famous dataset distributed by UCI and originally collected from the
National Institute of Diabetes and Digestive and Kidney Diseases. It contains data from clinical exams for
women age 21 and above of Pima indian origins. The objective is to predict based on diagnostic
measurements whether a patient has diabetes.
645
646 CHAPTER 18. DEEP LEARNING EXERCISES SOLUTIONS
1. Load the ..data/diabetes.csv dataset, use pandas to explore the range of each feature
• For each feature draw a histogram. Bonus points if you draw all the histograms in the same figure.
• Explore correlations of features with the outcome column. You can do this in several ways, for
example using the sns.pairplot we used above or drawing a heatmap of the correlations.
• Do features need standardization? If so what stardardization technique will you use? MinMax?
Standard?
• Prepare your final X and y variables to be used by a ML model. Make sure you define your target
variable well. Will you need dummy columns?
In [4]: df = pd.read_csv('../data/diabetes.csv')
df.head()
Out[4]:
/home/ubuntu/miniconda3/envs/ztdlbook/lib/python3.6/site-
packages/statsmodels/nonparametric/kde.py:488: RuntimeWarning: invalid value
encountered in true_divide
binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
/home/ubuntu/miniconda3/envs/ztdlbook/lib/python3.6/site-
packages/statsmodels/nonparametric/kdetools.py:34: RuntimeWarning: invalid value
encountered in double_scalars
FAC1 = 2*(np.pi*bw/RANGE)**2
648 CHAPTER 18. DEEP LEARNING EXERCISES SOLUTIONS
In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies 768 non-null int64
Glucose 768 non-null int64
BloodPressure 768 non-null int64
SkinThickness 768 non-null int64
Insulin 768 non-null int64
BMI 768 non-null float64
DiabetesPedigreeFunction 768 non-null float64
Age 768 non-null int64
Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
In [9]: df.describe()
650 CHAPTER 18. DEEP LEARNING EXERCISES SOLUTIONS
Out[9]:
In [12]: sc = StandardScaler()
X = sc.fit_transform(df.drop('Outcome', axis=1))
y = df['Outcome'].values
y_cat = to_categorical(y)
Exercise 2
Build a fully connected NN model that predicts diabetes. Follow these steps:
1. Split your data in a train/test with a test size of 20 and a random_state = 22
• define a sequential model with at least one inner layer. You will have to make choices for the following
things:
In [14]: X.shape
Out[14]: (768, 8)
Out[21]: 0.7272727272727273
Exercise 3
Compare your work with the results presented in this notebook. Are your Neural Network results better or
worse than the results obtained by traditional Machine Learning techniques?
• Try training a Support Vector Machine or a Random Forest model on the exact same train/test split.
Is the performance better or worse?
• Try restricting your features to only 4 features like in the suggested notebook. How does model
performance change?
================================================================================
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
18.3. EXERCISE 3 653
================================================================================
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
--------------------------------------------------------------------------------
Accuracy score: 0.721
Confusion Matrix:
[[89 11]
[32 22]]
================================================================================
GaussianNB(priors=None)
--------------------------------------------------------------------------------
Accuracy score: 0.708
Confusion Matrix:
[[87 13]
[32 22]]
654 CHAPTER 18. DEEP LEARNING EXERCISES SOLUTIONS
Deep Learning Internals Exercises Solutions
19
In [1]: with open('../course/common.py') as fin:
exec(fin.read())
Exercise 1
You’ve just been hired at a wine company and they would like you to help them build a model that predicts
the quality of their wine based on several measurements. They give you a dataset with wine
655
656 CHAPTER 19. DEEP LEARNING INTERNALS EXERCISES SOLUTIONS
In [3]: df = pd.read_csv('../data/wines.csv')
In [4]: df.head()
Out[4]:
Class Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanins Color_intensity Hue OD280-OD315_of_diluted_wines Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
In [5]: y = df['Class']
In [6]: y.value_counts()
Out[6]:
Class
2 71
1 59
3 48
In [8]: y_cat.head()
Out[8]:
1 2 3
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
In [10]: X.shape
/home/ubuntu/miniconda3/envs/ztdlbook/lib/python3.6/site-
packages/statsmodels/nonparametric/kde.py:488: RuntimeWarning: invalid value
encountered in true_divide
binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
/home/ubuntu/miniconda3/envs/ztdlbook/lib/python3.6/site-
packages/statsmodels/nonparametric/kdetools.py:34: RuntimeWarning: invalid value
encountered in double_scalars
FAC1 = 2*(np.pi*bw/RANGE)**2
In [14]: sc = StandardScaler()
In [17]: K.clear_session()
model = Sequential()
model.add(Dense(5, input_shape=(13,),
kernel_initializer='he_normal',
activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(RMSprop(lr=0.1),
'categorical_crossentropy',
metrics=['accuracy'])
model.fit(Xsc, y_cat.values,
batch_size=8,
epochs=10,
verbose=0,
validation_split=0.2);
Exercise 2
Since this dataset has 13 features we can only visualize pairs of features like we did in the Paired plot. We
could however exploit the fact that a Neural Network is a function to extract 2 high level features to
represent our data.
In [18]: K.clear_session()
model = Sequential()
model.add(Dense(8, input_shape=(13,),
kernel_initializer='he_normal',
660 CHAPTER 19. DEEP LEARNING INTERNALS EXERCISES SOLUTIONS
activation='tanh'))
model.add(Dense(5, kernel_initializer='he_normal',
activation='tanh'))
model.add(Dense(2, kernel_initializer='he_normal',
activation='tanh'))
model.add(Dense(3, activation='softmax'))
model.compile(RMSprop(lr=0.05),
'categorical_crossentropy',
metrics=['accuracy'])
model.fit(Xsc, y_cat.values,
batch_size=16,
epochs=20,
verbose=0);
In [19]: model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 8) 112
_________________________________________________________________
dense_2 (Dense) (None, 5) 45
_________________________________________________________________
dense_3 (Dense) (None, 2) 12
_________________________________________________________________
dense_4 (Dense) (None, 3) 9
=================================================================
Total params: 178
Trainable params: 178
Non-trainable params: 0
_________________________________________________________________
In [23]: features.shape
Out[23]: (178, 2)
19.3. EXERCISE 3 661
Exercise 3
Keras functional API. So far we’ve always used the Sequential model API in Keras. However, Keras also
offers a Functional API, which is much more powerful. You can find its documentation here. Let’s see how
we can leverage it.
In [26]: K.clear_session()
662 CHAPTER 19. DEEP LEARNING INTERNALS EXERCISES SOLUTIONS
inputs = Input(shape=(13,))
x = Dense(8, kernel_initializer='he_normal',
activation='tanh')(inputs)
x = Dense(5, kernel_initializer='he_normal',
activation='tanh')(x)
second_to_last = Dense(2, kernel_initializer='he_normal',
activation='tanh')(x)
outputs = Dense(3, activation='softmax')(second_to_last)
model.compile(RMSprop(lr=0.05),
'categorical_crossentropy',
metrics=['accuracy'])
Exercise 4
Keras offers the possibility to call a function at each epoch. These are Callbacks, and their documentation is
here. Callbacks allow us to add some neat functionality. In this exercise we’ll explore a few of them.
• Split the data into train and test sets with a test_size = 0.3 and random_state=42
• Reset and recompile your model
• train the model on the train data using validation_data=(X_test, y_test)
• Use the EarlyStopping callback to stop your training if the val_loss doesn’t improve
• Use the ModelCheckpoint callback to save the trained model to disk once training is finished
• Use the TensorBoard callback to output your training information to a /tmp/ subdirectory
• Watch the next video for an overview of tensorboard
In [36]: K.clear_session()
inputs = Input(shape=(13,))
x = Dense(8, kernel_initializer='he_normal',
664 CHAPTER 19. DEEP LEARNING INTERNALS EXERCISES SOLUTIONS
activation='tanh')(inputs)
x = Dense(5, kernel_initializer='he_normal',
activation='tanh')(x)
model.compile(RMSprop(lr=0.05),
'categorical_crossentropy',
metrics=['accuracy'])
Exercise 1
You’ve been hired by a shipping company to overhaul the way they route mail, parcels and packages. They
want to build an image recognition system capable of recognizing the digits in the zipcode on a package, so
that it can be automatically routed to the correct location. You are tasked to build the digit recognition
system. Luckily, you can rely on the MNIST dataset for the intial training of your model!
Build a deep convolutional Neural Network with at least two convolutional and two pooling layers before
the fully connected layer.
665
666 CHAPTER 20. CONVOLUTIONAL NEURAL NETWORKS EXERCISES SOLUTIONS
In [7]: X_train.shape
model.add(Flatten())
model.add(Dense(64, activation='relu'))
20.1. EXERCISE 1 667
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32) 0
_________________________________________________________________
activation_1 (Activation) (None, 13, 13, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 11, 11, 32) 9248
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 32) 0
_________________________________________________________________
activation_2 (Activation) (None, 5, 5, 32) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 800) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 51264
_________________________________________________________________
dense_2 (Dense) (None, 10) 650
=================================================================
Total params: 61,482
Trainable params: 61,482
Non-trainable params: 0
_________________________________________________________________
Exercise 2
Pleased with your performance with the digits recognition task, your boss decides to challenge you with a
harder task. Their online branch allows people to upload images to a website that generates and prints a
postcard that is shipped to destination. Your boss would like to know what images people are loading on the
site in order to provide targeted advertising on the same page, so he asks you to build an image recognition
system capable of recognizing a few objects. Luckily for you, there’s a dataset ready made with a collection of
labeled images. This is the Cifar 10 Dataset, a very famous dataset that contains images for 10 different
categories:
• airplane
• automobile
• bird
• cat
• deer
• dog
• frog
• horse
• ship
• truck
In this exercise we will reach the limit of what you can achieve on your laptop. In later chapters we will learn
how to leverage GPUs to speed up training.
Here’s what you have to do: - load the cifar10 dataset using keras.datasets.cifar10.load_data() -
display a few images, see how hard/easy it is for you to recognize an object with such low resolution - check
the shape of X_train, does it need reshape? - check the scale of X_train, does it need rescaling? - check
the shape of y_train, does it need reshape? - build a model with the following architecture, and choose the
parameters and activation functions for each of the layers: - conv2d - conv2d - maxpool - conv2d - conv2d -
maxpool - flatten - dense - output - compile the model and check the number of parameters - attempt to
train the model with the optimizer of your choice. How fast does training proceed? - If training is too slow,
feel free to stop it and read ahead. In the next chapters you’ll learn how to use GPUs to
20.2. EXERCISE 2 669
In [14]: X_train.shape
In [15]: plt.imshow(X_train[1]);
In [17]: y_train.shape
Out[17]: (50000, 1)
In [19]: y_train_cat.shape
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(10, activation='softmax'))
In [21]: model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
In [22]: model.summary()
20.2. EXERCISE 2 671
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_3 (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
conv2d_4 (Conv2D) (None, 30, 30, 32) 9248
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 15, 15, 32) 0
_________________________________________________________________
conv2d_5 (Conv2D) (None, 15, 15, 64) 18496
_________________________________________________________________
conv2d_6 (Conv2D) (None, 13, 13, 64) 36928
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 6, 6, 64) 0
_________________________________________________________________
flatten_2 (Flatten) (None, 2304) 0
_________________________________________________________________
dense_3 (Dense) (None, 512) 1180160
_________________________________________________________________
dense_4 (Dense) (None, 10) 5130
=================================================================
Total params: 1,250,858
Trainable params: 1,250,858
Non-trainable params: 0
_________________________________________________________________
Exercise 1
Your manager at the power company is quite satisfied with the work you’ve done predicting the electric load
of the next hour and would like to push it further. He is curious to know if your model can predict the load
on the next day or even on the next week instead of the next hour.
• Go ahead and use the helper function create_lagged_Xy_win we created above to generate new X
and y pairs where the start_lag is 36 hours or even further. You may want to extend the window
size to a little longer than a day.
• Train your best model on this data. You may have to use more than one layer. In which case,
remember to use the return_sequences=True argument in all layers except for the last one so that
they pass sequences to one another.
• Check the goodness of your model by comparing it with test data as well as looking at the R 2 score.
In [3]: df = pd.read_csv('../data/ZonalDemands_2003-2016.csv.bz2',
compression='bz2',
engine='python')
673
674 CHAPTER 21. TIME SERIES AND RECURRENT NEURAL NETWORKS EXERCISES SOLUTIONS
if window_len > 1:
for s in range(1, window_len):
col_ = 'T_{}'.format(start_lag + s)
X[col_] = data.shift(start_lag + s)
X = X.dropna()
idx = X.index
y = data.loc[idx]
return X, y
In [8]: start_lag=36
window_len=72
y_train_t = y_train.values
y_test_t = y_test.values
In [11]: K.clear_session()
model = Sequential()
model.add(LSTM(12, input_shape=(window_len, 1),
kernel_initializer='normal',
return_sequences=True))
model.add(LSTM(6, kernel_initializer='normal'))
model.add(Dense(1))
model.compile(optimizer=Adam(lr=0.05),
loss='mean_squared_error')
Epoch 1/5
93445/93445 [==============================] - 73s 782us/step - loss: 0.2414
Epoch 2/5
93445/93445 [==============================] - 71s 755us/step - loss: 0.1333
Epoch 3/5
93445/93445 [==============================] - 71s 765us/step - loss: 0.1157
Epoch 4/5
93445/93445 [==============================] - 71s 764us/step - loss: 0.1110
Epoch 5/5
93445/93445 [==============================] - 72s 768us/step - loss: 0.0986
Let’s compare the predictions on the test set. We will a few days of data and put vertical bars to mark an
interval of 36 hours:
plt.plot(y_test_t, label='y_test')
plt.plot(y_pred, label='y_pred')
plt.legend()
plt.xlim(1100,1500)
plt.axvline(1300)
plt.axvline(1336);
Exercise 2
Try swapping the LSTM layer with a GRU layer and re-train the model. Does its performance improve on
the 36 hours lag task?
In [15]: K.clear_session()
model = Sequential()
model.add(GRU(12, input_shape=(window_len, 1),
kernel_initializer='normal',
return_sequences=True))
model.add(GRU(6, kernel_initializer='normal'))
model.add(Dense(1))
model.compile(optimizer=Adam(lr=0.05),
loss='mean_squared_error')
Epoch 1/5
93445/93445 [==============================] - 60s 640us/step - loss: 0.1378
Epoch 2/5
93445/93445 [==============================] - 58s 624us/step - loss: 0.0783
Epoch 3/5
93445/93445 [==============================] - 58s 624us/step - loss: 0.0718
Epoch 4/5
93445/93445 [==============================] - 58s 618us/step - loss: 0.0728
Epoch 5/5
93445/93445 [==============================] - 58s 621us/step - loss: 0.0676
GRU not only trains faster, but also seems to reach a better performance than LSTM on this task.
Exercise 3
Does a fully connected model work well using Windows? Let’s find out! Try to train a fully connected model
on the lagged data with Windows, which will probably train much faster:
• reshape the input data back to an Order-2 tensor, i.e. eliminate the 3rd axis
• build a fully connected model with one or more layers
• train the fully connected model on the windowed data. Does it work well? Is it faster to train?
model.compile(optimizer='adam', loss='mean_squared_error')
Exercise 4
Predicting the price of Bitcoin from historical data.
You have heard a lot of talk about Bitcoin and how it is growing that you decide to put your newly
acquired Deep Learning skills to test in trying to beat the market. The idea is simple: if we could predict
what Bitcoin is going to do in the future, we can trade and profit using that knowledge.
The simplest formulation of this forecasting problem is to try to predict if the price of Bitcoin is going to
go up or down in the future, i.e. we can frame the problem as a binary classification that answers the
question: is Bitcoin going up.
• Check out the data using df.head(). Notice that the dataset contains the close, high, low, open
for 30 minutes intervals, which means: the first, highest, lowest and last amounts of US Dollars
people were willing to exchange Bitcoin for during those 30 minutes. The dataset also contains
Volume values, that we shall ignore, and a weighted average value, which is what we will use to
build the labels.
• Convert the date column to a datetime object using pd.to_datetime and set it as index of the
DataFrame.
• Plot the value of df[‘close’] to inspect the data. You will notice that it’s not periodic at all and it has
an overall enormous upward trend, so we will need to transform the data into a more stationary
timeseries. We will use percentage changes, i.e. we will look at relative movements in the price
instead of absolute values.
• Create a new dataset df_percent with percent changes using the formula:
x t − x t−1
v t = 100 ×
x t−1
this is what we will use next.
• Inspect df_percent and notice that it contains both infinity and nan values. Drop the null
values and replace the infinity values with zero.
• Split the data at January 1st 2017, using the data before then as training and the data after that as
test.
• Use the window method to create an input training tensor X_train_t with the shape
(n_windows, window_len, n_features). This is the main part of the exercise, since you’ll have to
make a few choices and be careful not to leak information from the future. In particular you will
have to:
– decide the window_len you want to use
– decide which features you’d like to use as input (don’t use weightedAverage, since we’ll
need it for the output.
– decide what lag you want to introduce between the last timestep in your input window and
the timestep of the output.
– You can start from the create_lagged_Xy_win function we defined in Chapter 7, but you
will have to modify it to work with numpy arrays because Pandas DataFrames are only good
with 1 feature.
680 CHAPTER 21. TIME SERIES AND RECURRENT NEURAL NETWORKS EXERCISES SOLUTIONS
Again disclaimer: past performance is no guarantee of future results. This is not investment
advice.
In [22]: df = pd.read_json('../data/poloniex_usdt_btc.json.gz',
compression='gzip')
In [24]: df['close'].plot();
21.4. EXERCISE 4 681
In [26]: df_percent.head()
Out[26]:
train = df_percent.loc[:split_date].copy()
test = df_percent.loc[split_date:].copy()
In [30]: start_lag = 1
window_len = 36
In [32]: X_train_t.shape
In [33]: y_train_t.shape
Out[33]: (23956,)
In [34]: K.clear_session()
model = Sequential()
model.add(GRU(24, input_shape=(window_len, 4),
kernel_initializer='normal',
21.4. EXERCISE 4 683
return_sequences=True))
model.add(GRU(18, kernel_initializer='normal',
return_sequences=True))
model.add(GRU(12, kernel_initializer='normal'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer=Adam(lr=0.05),
loss='binary_crossentropy',
metrics=['accuracy'])
In [35]: model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru_1 (GRU) (None, 36, 24) 2088
_________________________________________________________________
gru_2 (GRU) (None, 36, 18) 2322
_________________________________________________________________
gru_3 (GRU) (None, 12) 1116
_________________________________________________________________
dense_1 (Dense) (None, 1) 13
=================================================================
Total params: 5,539
Trainable params: 5,539
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
23956/23956 [==============================] - 7s 291us/step - loss: 0.6866 -
acc: 0.5474
Epoch 2/20
23956/23956 [==============================] - 6s 231us/step - loss: 0.6740 -
acc: 0.5669
Epoch 3/20
23956/23956 [==============================] - 6s 230us/step - loss: 0.6578 -
acc: 0.6137
Epoch 4/20
23956/23956 [==============================] - 5s 229us/step - loss: 0.6543 -
acc: 0.6145
Epoch 5/20
23956/23956 [==============================] - 6s 232us/step - loss: 0.6515 -
acc: 0.6191
Epoch 6/20
23956/23956 [==============================] - 6s 231us/step - loss: 0.6611 -
684 CHAPTER 21. TIME SERIES AND RECURRENT NEURAL NETWORKS EXERCISES SOLUTIONS
acc: 0.6088
Epoch 7/20
23956/23956 [==============================] - 5s 228us/step - loss: 0.6813 -
acc: 0.5672
Epoch 8/20
23956/23956 [==============================] - 6s 232us/step - loss: 0.6870 -
acc: 0.5502
Epoch 9/20
23956/23956 [==============================] - 5s 227us/step - loss: 0.6800 -
acc: 0.5711
Epoch 10/20
23956/23956 [==============================] - 5s 229us/step - loss: 0.6749 -
acc: 0.5859
Epoch 11/20
23956/23956 [==============================] - 5s 228us/step - loss: 0.6758 -
acc: 0.5904
Epoch 12/20
23956/23956 [==============================] - 6s 230us/step - loss: 0.6685 -
acc: 0.6049
Epoch 13/20
23956/23956 [==============================] - 5s 228us/step - loss: 0.6732 -
acc: 0.5915
Epoch 14/20
23956/23956 [==============================] - 5s 228us/step - loss: 0.6796 -
acc: 0.5754
Epoch 15/20
23956/23956 [==============================] - 5s 228us/step - loss: 0.6783 -
acc: 0.5792
Epoch 16/20
23956/23956 [==============================] - 5s 227us/step - loss: 0.6699 -
acc: 0.5933
Epoch 17/20
23956/23956 [==============================] - 5s 228us/step - loss: 0.6674 -
acc: 0.5971
Epoch 18/20
23956/23956 [==============================] - 5s 227us/step - loss: 0.6675 -
acc: 0.6008
Epoch 19/20
23956/23956 [==============================] - 5s 229us/step - loss: 0.6651 -
acc: 0.6007
Epoch 20/20
23956/23956 [==============================] - 5s 226us/step - loss: 0.6639 -
acc: 0.6036
Out[38]:
0
True 0.527326
False 0.472674
686 CHAPTER 21. TIME SERIES AND RECURRENT NEURAL NETWORKS EXERCISES SOLUTIONS
Natural Language Processing and Text Data Exercises
22
Solutions
Exercise 1
For our Spam detection model we used a CountVectorizer with a vocabulary size of 3000. Was this the
best size? Let’s find out.
687
688 CHAPTER 22. NATURAL LANGUAGE PROCESSING AND TEXT DATA EXERCISES SOLUTIONS
In [3]: df = pd.read_csv('../data/sms_spam.csv')
df.head()
Out[3]:
message spam
0 Hi Princess! Thank you for the pics. You are v... 0
1 Hello my little party animal! I just thought I... 0
2 And miss vday the parachute and double coins??... 0
3 Maybe you should find something else to do ins... 0
4 What year. And how many miles. 0
Train/Test split on the messages, notice that we use Numpy Arrays, not Pandas Dataframes:
vect.fit(docs_train)
X_train_sparse = vect.transform(docs_train)
X_train = X_train_sparse.todense()
22.1. EXERCISE 1 689
X_test_sparse = vect.transform(docs_test)
X_test = X_test_sparse.todense()
input_dim = X_train.shape[1]
model = Sequential()
model.add(Dense(1, input_dim=input_dim, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
for v in sizes:
i, tra, tea = train_for_vocab_size(v)
idx.append(i)
train_accs.append(tra)
test_accs.append(tea)
resdf['Test'] = test_accs
and let’s plot the results using the logarithmic scale for the x axis. Remember that our benchmark accuracy
is 86.6, so we will add a baseline at that level:
Exercise 2
Keras provides a large dataset of movie reviews extracted from the Internet Movie Database for sentiment
analysis purposes. This dataset is much larger than the one we have used, and its already encoded as
sequences of integers. Let’s put what we have learned to good use and build a sentiment classifier for movie
reviews:
• decide what size of vocabulary you are going to use and set the vocab_size variable
• import the imdb module from keras.datasets
• load the train and test sets using num_words=vocab_size
• check the data you have just loaded, they should be sequences of integers
• pad the sequences to a fix length of your choice. You will need to:
Bonus points: can you convert back the sentences to their original text form. You should look at
imdb.get_word_index() to download the word index:
In [11]: vocab_size=20000
In [14]: X_train.shape
Out[14]: (25000,)
In [15]: X_train[0][:10]
Out[15]: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]
692 CHAPTER 22. NATURAL LANGUAGE PROCESSING AND TEXT DATA EXERCISES SOLUTIONS
Let’s use a maximum review length of 80 words. This seems long enough to express an opinion about the
movie:
In [16]: maxlen = 80
We will pad sequences using the default padding='pre' and truncating='pre' parameters.
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
TIP: in the above model we have used dropout, which has not yet been formally
introduced. For now just know that it’s a technique aimed at reducing overfitting.
In [22]: model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 100) 2000000
22.2. EXERCISE 2 693
_________________________________________________________________
lstm_1 (LSTM) (None, 64) 42240
_________________________________________________________________
dense_14 (Dense) (None, 1) 65
=================================================================
Total params: 2,042,305
Trainable params: 2,042,305
Non-trainable params: 0
_________________________________________________________________
Let’s train the model for a couple of epochs. If you run this model on your laptop it may take a few minutes
for each epoch:
Not bad! We have a sentiment analysis model that we can unleash on the social media of our choice. Time
to go to an investor and raise money! Not quite, but it’s nice to see how easy it has become to build a model
that would have been unthinkable just a few years ago.
Finally, for the bonus question. Let’s get the word index:
and let’s create the reverse index. Notice that the documentation of imdb.load_data reads:
"""
Signature: imdb.load_data(path='imdb.npz', num_words=None, skip_top=0, maxlen=None, seed=113, st
Docstring:
Loads the IMDB dataset.
Also, following the documentation let’s add the start character and the out-of-vocabulary character:
22.2. EXERCISE 2 695
We can then apply the reverse index to recover the text of a review:
Out[29]: "start_char this film was just brilliant casting location scenery story direction every
Exercise 1
In Exercise 2 of Chapter 8 we introduced a model for sentiment analysis of the IMDB dataset provided in
Keras.
697
698 CHAPTER 23. TRAINING WITH GPUS EXERCISES SOLUTIONS
In [11]: NGPU = 2
700 CHAPTER 23. TRAINING WITH GPUS EXERCISES SOLUTIONS
In [13]: model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
Exercise 2
Model parallelism is a technique that is used for very large models that cannot fit in the memory of a single
GPU. While this is is not the case for the model we developed in Exercise 1, it is still possible to distribute
the model across multiple GPUs using the with context setter. Define a new model with the following
architecture:
1. Embedding
• LSTM
• LSTM
23.2. EXERCISE 2 701
• LSTM
• Dense
Place layers 1 and 2 on the first GPU, layers 3 and 4 on the second GPU and the final Dense layer on the CPU.
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
10.114 seconds.
Compiling the model...
702 CHAPTER 23. TRAINING WITH GPUS EXERCISES SOLUTIONS
0.041 seconds.
Exercise 1
This is a long and complex exercise, that should give you an idea of a real world scenario. Feel free to look at
the solution if you feel lost. Also, feel free to run this on a GPU.
First of all download and unpack the male/female pictures from here into a subfolder of the ../data folder.
These images and labels were obtained from Crowdflower.
Your goal is to build an image classifier that will recognize the gender of a person from pictures.
703
704 CHAPTER 24. PERFORMANCE IMPROVEMENT EXERCISES SOLUTIONS
• Define also a test generator, whose only purpose is to rescale the pixels by 1./255
• use the function flow_from_directory to generate batches from the train and test folders. Make
sure you set the target_size to 64x64.
• Use the model.fit_generator function to fit the model on the batches generated from the
ImageDataGenerator. Since you are streaming and augmenting the data in real time you will have to
decide how many batches make an epoch and how many epochs you want to run
• Train your model (you should get to at least 85 accuracy)
• Once you are satisfied with your training, check a few of the misclassified pictures.
• Read about human bias in Machine Learning datasets
In [3]: %%bash
if [ ! -d ../data/male_female ]; then
A=https://www.zerotodeeplearning.com/
B=media/z2dl/45bzty/
C=male_female.tgz
wget $A$B$C -O male_female.tgz
tar -xzvf male_female.tgz --directory ../data/
rm male_female.tgz
fi
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
return model
Out[7]: ['/job:localhost/replica:0/task:0/device:GPU:0',
'/job:localhost/replica:0/task:0/device:GPU:1']
In [10]: model.summary()
________________________________________________________________________________
__________________
Layer (type) Output Shape Param # Connected to
================================================================================
==================
conv2d_1_input (InputLayer) (None, 64, 64, 3) 0
________________________________________________________________________________
__________________
lambda_1 (Lambda) (None, 64, 64, 3) 0
conv2d_1_input[0][0]
706 CHAPTER 24. PERFORMANCE IMPROVEMENT EXERCISES SOLUTIONS
________________________________________________________________________________
__________________
lambda_2 (Lambda) (None, 64, 64, 3) 0
conv2d_1_input[0][0]
________________________________________________________________________________
__________________
sequential_1 (Sequential) (None, 1) 352129 lambda_1[0][0]
lambda_2[0][0]
________________________________________________________________________________
__________________
dense_2 (Concatenate) (None, 1) 0
sequential_1[1][0]
sequential_1[2][0]
================================================================================
==================
Total params: 352,129
Trainable params: 351,809
Non-trainable params: 320
________________________________________________________________________________
__________________
In [11]: model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
In [12]: batch_size = 16
test_gen = ImageDataGenerator(rescale=1./255)
test = test_gen.flow_from_directory(
data_path + '/test', target_size=(64, 64),
batch_size=batch_size * ncopies,
class_mode='binary')
24.1. EXERCISE 1 707
In [15]: test.class_indices
In [17]: model.fit_generator(train,
steps_per_epoch=600,
epochs=3);
Epoch 1/3
600/600 [==============================] - 59s 99ms/step - loss: 0.6250 - acc:
0.6959
Epoch 2/3
600/600 [==============================] - 58s 97ms/step - loss: 0.4662 - acc:
0.7706
Epoch 3/3
600/600 [==============================] - 58s 96ms/step - loss: 0.4241 - acc:
0.7938
In [18]: model.evaluate_generator(test)
In [19]: X_test = []
y_test = []
for ts in islice(test, 50):
X_test.append(ts[0])
y_test.append(ts[1])
X_test = np.concatenate(X_test)
y_test = np.concatenate(y_test)
In [20]: y_test
Out[22]: array([ 3, 7, 12, 24, 25, 27, 28, 29, 43, 45, 49,
52, 54, 57, 76, 80, 89, 90, 99, 101, 102, 109,
115, 116, 117, 120, 131, 139, 153, 155, 157, 162, 170,
172, 177, 182, 193, 195, 199, 202, 204, 205, 210, 212,
213, 216, 225, 226, 228, 236, 245, 246, 255, 260, 267,
273, 287, 289, 292, 293, 301, 303, 310, 311, 314, 336,
347, 351, 354, 359, 368, 375, 380, 381, 384, 387, 390,
402, 407, 408, 409, 411, 431, 442, 446, 449, 462, 464,
467, 468, 481, 484, 486, 488, 491, 492, 493, 496, 497,
499, 503, 507, 509, 512, 531, 532, 533, 546, 550, 555,
564, 566, 570, 571, 580, 584, 604, 606, 607, 608, 616,
618, 620, 624, 631, 636, 640, 641, 645, 652, 660, 668,
669, 676, 680, 681, 684, 685, 692, 693, 695, 710, 724,
726, 730, 732, 740, 743, 747, 789, 797, 803, 806, 811,
814, 817, 818, 819, 824, 840, 841, 842, 847, 858, 867,
870, 877, 879, 886, 887, 895, 897, 900, 908, 913, 914,
919, 921, 935, 942, 943, 945, 949, 956, 959, 961, 965,
968, 978, 984, 986, 991, 996, 1018, 1020, 1037, 1042, 1085,
1089, 1093, 1095, 1104, 1105, 1106, 1108, 1111, 1112, 1118, 1120,
1122, 1139, 1142, 1143, 1150, 1159, 1166, 1167, 1173, 1186, 1191,
1196, 1197, 1206, 1208, 1210, 1221, 1222, 1223, 1239, 1243, 1244,
1251, 1254, 1256, 1257, 1258, 1273, 1275, 1277, 1287, 1289, 1305,
1307, 1313, 1316, 1326, 1333, 1335, 1337, 1340, 1341, 1345, 1348,
1365, 1372, 1373, 1374, 1377, 1398, 1399, 1407, 1410, 1411, 1417,
1418, 1429, 1430, 1435, 1436, 1444, 1448, 1450, 1452, 1454, 1455,
1474, 1480, 1485, 1490, 1497, 1500, 1503, 1525, 1530, 1531, 1533,
1536, 1540, 1546, 1551, 1566, 1567, 1571, 1580, 1581, 1583, 1585,
1592])
i = 1
pred = label_to_class[int(y_pred[idx])]
plt.title("Label: {} Pred: {}".format(label, pred))
i += 1
plt.tight_layout()
Exercise 1
Use a pre-trained model on a different image.
711
712 CHAPTER 25. PRETRAINED MODELS FOR IMAGES EXERCISES SOLUTIONS
img_tensor = np.expand_dims(
image.img_to_array(img), axis=0)
In [8]: img
Out[8]:
25.2. EXERCISE 2 713
Exercise 2
Choose another pre-trained model from the ones provided at https://keras.io/applications/ and use it to to
predict the same image. Do the predictions match?
Exercise 3
The Keras documentation shows how to fine-tune the Inception V3 model by unfreezing some of the
convolutional layers. Try reproducing the results of the documentation on our dataset using the Xception
model and unfreezing some of the top convolutional layers.
In [21]: x = base_model.output
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(3, activation='softmax')(x)
In [24]: model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
In [26]: batch_size = 32
In [29]: model.fit_generator(
train_generator,
steps_per_epoch=65,
epochs=1)
Epoch 1/1
65/65 [==============================] - 47s 730ms/step - loss: 0.5013 - acc:
0.8046
0 input_3
1 block1_conv1
2 block1_conv1_bn
3 block1_conv1_act
4 block1_conv2
5 block1_conv2_bn
6 block1_conv2_act
7 block2_sepconv1
8 block2_sepconv1_bn
9 block2_sepconv2_act
10 block2_sepconv2
11 block2_sepconv2_bn
12 conv2d_5
13 block2_pool
14 batch_normalization_5
15 add_13
16 block3_sepconv1_act
17 block3_sepconv1
18 block3_sepconv1_bn
19 block3_sepconv2_act
20 block3_sepconv2
21 block3_sepconv2_bn
22 conv2d_6
23 block3_pool
24 batch_normalization_6
25 add_14
26 block4_sepconv1_act
27 block4_sepconv1
28 block4_sepconv1_bn
29 block4_sepconv2_act
30 block4_sepconv2
31 block4_sepconv2_bn
32 conv2d_7
33 block4_pool
34 batch_normalization_7
35 add_15
36 block5_sepconv1_act
37 block5_sepconv1
38 block5_sepconv1_bn
39 block5_sepconv2_act
40 block5_sepconv2
41 block5_sepconv2_bn
42 block5_sepconv3_act
43 block5_sepconv3
44 block5_sepconv3_bn
45 add_16
46 block6_sepconv1_act
47 block6_sepconv1
48 block6_sepconv1_bn
49 block6_sepconv2_act
50 block6_sepconv2
51 block6_sepconv2_bn
52 block6_sepconv3_act
53 block6_sepconv3
54 block6_sepconv3_bn
25.3. EXERCISE 3 717
55 add_17
56 block7_sepconv1_act
57 block7_sepconv1
58 block7_sepconv1_bn
59 block7_sepconv2_act
60 block7_sepconv2
61 block7_sepconv2_bn
62 block7_sepconv3_act
63 block7_sepconv3
64 block7_sepconv3_bn
65 add_18
66 block8_sepconv1_act
67 block8_sepconv1
68 block8_sepconv1_bn
69 block8_sepconv2_act
70 block8_sepconv2
71 block8_sepconv2_bn
72 block8_sepconv3_act
73 block8_sepconv3
74 block8_sepconv3_bn
75 add_19
76 block9_sepconv1_act
77 block9_sepconv1
78 block9_sepconv1_bn
79 block9_sepconv2_act
80 block9_sepconv2
81 block9_sepconv2_bn
82 block9_sepconv3_act
83 block9_sepconv3
84 block9_sepconv3_bn
85 add_20
86 block10_sepconv1_act
87 block10_sepconv1
88 block10_sepconv1_bn
89 block10_sepconv2_act
90 block10_sepconv2
91 block10_sepconv2_bn
92 block10_sepconv3_act
93 block10_sepconv3
94 block10_sepconv3_bn
95 add_21
96 block11_sepconv1_act
97 block11_sepconv1
98 block11_sepconv1_bn
99 block11_sepconv2_act
100 block11_sepconv2
101 block11_sepconv2_bn
102 block11_sepconv3_act
103 block11_sepconv3
104 block11_sepconv3_bn
105 add_22
106 block12_sepconv1_act
107 block12_sepconv1
108 block12_sepconv1_bn
109 block12_sepconv2_act
110 block12_sepconv2
111 block12_sepconv2_bn
112 block12_sepconv3_act
113 block12_sepconv3
718 CHAPTER 25. PRETRAINED MODELS FOR IMAGES EXERCISES SOLUTIONS
114 block12_sepconv3_bn
115 add_23
116 block13_sepconv1_act
117 block13_sepconv1
118 block13_sepconv1_bn
119 block13_sepconv2_act
120 block13_sepconv2
121 block13_sepconv2_bn
122 conv2d_8
123 block13_pool
124 batch_normalization_8
125 add_24
126 block14_sepconv1
127 block14_sepconv1_bn
128 block14_sepconv1_act
129 block14_sepconv2
130 block14_sepconv2_bn
131 block14_sepconv2_act
132 global_average_pooling2d_1
split_layer = 126
In [33]: model.fit_generator(
train_generator,
steps_per_epoch=65,
epochs=1);
Epoch 1/1
65/65 [==============================] - 47s 730ms/step - loss: 0.2432 - acc:
0.9110
Pretrained Embeddings for Text Exercises Solutions
26
In [1]: with open('../course/common.py') as fin:
exec(fin.read())
Exercise 1
Compare the representations of Word2Vec, Glove and FastText. In the data/embeddings folder we
provided you with two additional scripts to download FastText and Word2Vec. Go ahead and download
each of them into the data/embeddings. Then load each of the 3 embeddings in a separate Gensim model
and complete the following steps:
1. define a list of words containing the following words: ‘good’, ‘bad’, ‘fast’, ‘tensor’, ‘teacher’, ‘student’.
• create a function called get_top_5(words, model) that retrieves the top 5 most similar words to
the list of words and compare what the 3 different embeddings give you
• apply the same function to each word in the list separately and compare the lists of the 3 embeddings.
719
720 CHAPTER 26. PRETRAINED EMBEDDINGS FOR TEXT EXERCISES SOLUTIONS
Note that loading the vector may take several minutes depending on your computer.
good
W2V : ['great', 'bad', 'terrific', 'decent', 'nice']
Glove : ['better', 'really', 'always', 'sure', 'something']
FastText: ['bad', 'excellent', 'decent', 'nice', 'great']
bad
W2V : ['good', 'terrible', 'horrible', 'Bad', 'lousy']
26.1. EXERCISE 1 721
fast
W2V : ['quick', 'rapidly', 'Fast', 'quickly', 'slow']
Glove : ['slow', 'faster', 'pace', 'turning', 'better']
FastText: ['slow', 'rapid', 'quick', 'Fast', 'faster']
tensor
W2V : ['uniaxial', 'τ ', 'θ ', 'φ', 'wavefunction']
Glove : ['scalar', 'tensors', 'coefficients', 'coefficient', 'formula_12']
FastText: ['tensors', 'Tensor', 'stress-energy', 'pseudotensor', 'tensorial']
teacher
W2V : ['teachers', 'Teacher', 'guidance_counselor', 'elementary',
'PE_teacher']
Glove : ['student', 'graduate', 'teaching', 'taught', 'teaches']
FastText: ['teachers', 'educator', 'Teacher', 'student', 'pupil']
student
W2V : ['students', 'Student', 'teacher', 'stu_dent', 'faculty']
Glove : ['teacher', 'students', 'teachers', 'graduate', 'school']
FastText: ['students', 'teacher', 'Student', 'university', 'graduate']
print(analogy)
print("W2V : ", word_analogy(
w2v_model, thing, is_to, like))
print("Glove : ", word_analogy(
glove_model, thing, is_to, like))
722 CHAPTER 26. PRETRAINED EMBEDDINGS FOR TEXT EXERCISES SOLUTIONS
man:king=woman:queen
W2V : ['queen', 'monarch', 'princess']
Glove : ['queen', 'throne', 'prince']
FastText: ['queen', 'monarch', 'princess']
france:paris=germany:berlin
W2V : ['berlin', 'german', 'lindsay_lohan']
Glove : ['berlin', 'frankfurt', 'vienna']
FastText: ['berlin', 'munich', 'dresden']
teacher:teach=student:learn
W2V : ['educate', 'learn', 'teaches']
Glove : ['students', 'teachers', 'teaching']
FastText: ['learn', 'educate', 'attend']
cat:kitten=dog:?
W2V : ['puppy', 'pup', 'pit_bull']
Glove : ['puppy', 'rottweiler', 'retriever']
FastText: ['puppy', 'puppies', 'pup']
english:friday=italiano:?
W2V : ['noche', 'fatto', 'la_versione']
Glove : ['exxonmobil', 'eni', 'newmont']
FastText: ['dopo', 'meglio', 'lavoro']
Exercise 2
The Reuters Newswire topic classification dataset is a dataset of 11,228 newswires from Reuters, labeled over
46 topics. This dataset is provided in the keras.datasets module and it’s easy to use.
Let’s compare the performance of a model using pre-trained embeddings with a model using random
embeddings on the topic classification task.
In [14]: vocab_size=20000
Out[20]: 'start_char oov_char oov_char said as a result of its december acquisition of space co
Out[21]: 2376
In [22]: maxlen=100
fixed_emb_layer = Embedding(vocab_size,
embedding_size,
weights=[reuters_emb_weights],
mask_zero=True,
trainable=False,
input_length=maxlen)
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
h = model.fit(X_train_pad, y_train,
batch_size=32,
epochs=5,
validation_split=0.1)
return h, model
In [31]: pd.DataFrame(history.history).plot();
726 CHAPTER 26. PRETRAINED EMBEDDINGS FOR TEXT EXERCISES SOLUTIONS
In [33]: pd.DataFrame(history.history).plot();
26.2. EXERCISE 2 727
728 CHAPTER 26. PRETRAINED EMBEDDINGS FOR TEXT EXERCISES SOLUTIONS
Serving Deep Learning Models Exercises Solutions
27
In [1]: with open('../course/common.py') as fin:
exec(fin.read())
Exercise 1
Let’s deploy an image recognition API using Tensorflow Serving. The main difference from the API we have
deployed in this chapter is that we will have to deal with how to pass an image to the model through
tensorflow serving. Since this chapter focuses on deployment, we will take a shortcut and deploy a
pre-trained model that uses Imagenet. In particular we will deploy the Xception model. If you are unsure
about how to use pre-trained model, please go back to Chapter 11 for a refresher.
729
730 CHAPTER 27. SERVING DEEP LEARNING MODELS EXERCISES SOLUTIONS
In [3]: import os
from os.path import join
import shutil
import tensorflow as tf
import keras.backend as K
from tensorflow.python.saved_model.builder \
import SavedModelBuilder
from tensorflow.python.saved_model.signature_def_utils \
import predict_signature_def
from tensorflow_serving.apis.prediction_service_pb2_grpc \
import PredictionServiceStub
from tensorflow_serving.apis.predict_pb2 \
import PredictRequest
from tensorflow.contrib.util import make_tensor_proto
from tensorflow.contrib.util import make_ndarray
In [5]: K.set_learning_phase(0)
27.1. EXERCISE 1 731
In [11]: builder.add_meta_graph_and_variables(
sess=sess,
tags=[tf.saved_model.tag_constants.SERVING],
signature_def_map={'predict': signature})
In [12]: builder.save()
Out[12]: b'/tmp/ztdl_models/xception/tfserving/1/saved_model.pb'
Start Server
docker run \
-v /tmp/ztdl_models/xception/tfserving/:/models/xception \
-e MODEL_NAME=xception \
-e MODEL_PATH=/models/xception \
-p 8502:8500 \
-p 8503:8501 \
-t tensorflow/serving
732 CHAPTER 27. SERVING DEEP LEARNING MODELS EXERCISES SOLUTIONS
In [22]: request.inputs['inputs'].CopyFrom(data_pb)
Decode predictions
In [27]: preds
Out[27]: 'king_penguin'
Exercise 2
The above method of serving a pre-trained model has an issue: we are doing pre-processing and prediction
decoding on the client side. This is actually not a best practice, because it requires the client to be aware of
what kind of pre-processing and decoding functions the model needs.
We would like a server that takes the image as it is and returns a string with the name of the object in the
image.
The easy way to do this is to use the Flask app implementation we have shown in this chapter and move
pre-processing and decoding on the server side.
Go ahead and build a Flask version of the API that takes an image url as a json string, applies
pre-processing, runs and decodes the prediction and returns a string with the response.
python 13_flask_serve_xception.py
curl -d 'http://bit.ly/2wb7uqN' \
-H "Content-Type: application/json" \
-X POST http://localhost:5000
"king_penguin"
Disclaimer: this script is not meant for production purposes. Retrieving a file from a URL is not secure
and you should avoid building an API that retrieves a file from a URL provided from the client. Here
we used the url retrieval trick in order to make the curl command shorter.
import os
import json
import numpy as np
import tensorflow as tf
from urllib.request import urlretrieve
from keras.preprocessing import image
from keras.applications.xception import Xception
from keras.applications.xception import preprocess_input
from keras.applications.xception import decode_predictions
import keras.backend as K
loaded_model = None
graph = None
app = Flask(__name__)
def load_model():
"""
Load model and tensorflow graph
into global variables.
"""
# global variables
global loaded_model
global graph
loaded_model = Xception(weights='imagenet')
def preprocess(data):
url = data.decode('utf-8')
img, img_tensor = load_image_from_url(url)
img_scaled = preprocess_input(img_tensor)
return img_scaled
@app.route('/', methods=["POST"])
def predict():
"""
Generate predictions with the model
when receiving data as a POST request
"""
if request.method == "POST":
# get url from the request
data = request.data
27.2. EXERCISE 2 735
# print in backend
print("Received data:", data)
print("Predicted labels:", result)
return jsonify(result)
if __name__ == "__main__":
print("* Loading model and starting Flask server...")
load_model()
app.run(host='0.0.0.0', debug=True)