02-DL-Deep Learning For Image Data (Convnets) 02
02-DL-Deep Learning For Image Data (Convnets) 02
Mounted at /content/gdrive
Out[3]: '2.4.3'
Having "few" samples can mean anywhere from a few hundreds to a few tens of thousands of images. As a practical example, we will focus on
classifying images as "dogs" or "cats", in a dataset containing 4000 pictures of cats and dogs (2000 cats, 2000 dogs). We will use 2000 pictures for
training, 1000 for validation, and finally 1000 for testing.
In this section, we will review one basic strategy to tackle this problem: training a new model from scratch on what little data we have. We will start by
naively training a small convnet on our 2000 training samples, without any regularization, to set a baseline for what can be achieved. This will get us to
a classification accuracy of 71%. At that point, our main issue will be overfitting. Then we will introduce data augmentation, a powerful technique for
mitigating overfitting in computer vision. By leveraging data augmentation, we will improve our network to reach an accuracy of 82%.
In the next section, we will review two more essential techniques for applying deep learning to small datasets: doing feature extraction with a pre-trained
network (this will get us to an accuracy of 90% to 93%), and fine-tuning a pre-trained network (this will get us to our final accuracy of 95%). Together,
these three strategies -- training a small model from scratch, doing feature extracting using a pre-trained model, and fine-tuning a pre-trained model --
will constitute your future toolbox for tackling the problem of doing computer vision with small datasets.
However, what constitutes "lots" of samples is relative -- relative to the size and depth of the network you are trying to train, for starters. It isn't possible
to train a convnet to solve a complex problem with just a few tens of samples, but a few hundreds can potentially suffice if the model is small and well-
regularized and if the task is simple. Because convnets learn local, translation-invariant features, they are very data-efficient on perceptual problems.
Training a convnet from scratch on a very small image dataset will still yield reasonable results despite a relative lack of data, without the need for any
custom feature engineering. You will see this in action in this section.
But what's more, deep learning models are by nature highly repurposable: you can take, say, an image classification or speech-to-text model trained on
a large-scale dataset then reuse it on a significantly different problem with only minor changes. Specifically, in the case of computer vision, many pre-
trained models (usually trained on the ImageNet dataset) are now publicly available for download and can be used to bootstrap powerful vision models
out of very little data. That's what we will do in the next section.
For now, let's get started by getting our hands on the data.
The pictures are medium-resolution color JPEGs. They look like this:
Unsurprisingly, the cats vs. dogs Kaggle competition in 2013 was won by entrants who used convnets. The best entries could achieve up to 95%
accuracy. In our own example, we will get fairly close to this accuracy (in the next section), even though we will be training our models on less than 10%
of the data that was available to the competitors. This original dataset contains 25,000 images of dogs and cats (12,500 from each class) and is 543MB
large (compressed). After downloading and uncompressing it, we will create a new dataset containing three subsets: a training set with 1000 samples of
each class, a validation set with 500 samples of each class, and finally a test set with 500 samples of each class.
Out[5]: True
As a sanity check, let's count how many pictures we have in each training split (train/validation/test):
img = mpimg.imread(img_path)
plt.imshow(img)
So we have indeed 2000 training images, and then 1000 validation images and 1000 test images. In each split, there is the same number of samples
from each class: this is a balanced binary classification problem, which means that classification accuracy will be an appropriate measure of success.
However, since we are dealing with bigger images and a more complex problem, we will make our network accordingly larger: it will have one more
Conv2D + MaxPooling2D stage. This serves both to augment the capacity of the network, and to further reduce the size of the feature maps, so that
they aren't overly large when we reach the Flatten layer. Here, since we start from inputs of size 150x150 (a somewhat arbitrary choice), we end up
with feature maps of size 7x7 right before the Flatten layer.
Note that the depth of the feature maps is progressively increasing in the network (from 32 to 128), while the size of the feature maps is decreasing
(from 148x148 to 7x7). This is a pattern that you will see in almost all convnets.
Since we are attacking a binary classification problem, we are ending the network with a single unit (a Dense layer of size 1) and a sigmoid
activation. This unit will encode the probability that the network is looking at one class or the other.
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu',
input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
Let's take a look at how the dimensions of the feature maps change with every successive layer:
In [10]: model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 148, 148, 32) 896
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 74, 74, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 72, 72, 64) 18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 36, 36, 64) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 34, 34, 128) 73856
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 17, 17, 128) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 15, 15, 128) 147584
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 7, 7, 128) 0
_________________________________________________________________
flatten (Flatten) (None, 6272) 0
_________________________________________________________________
dense (Dense) (None, 512) 3211776
_________________________________________________________________
dense_1 (Dense) (None, 1) 513
=================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0
_________________________________________________________________
For our compilation step, we'll go with the RMSprop optimizer as usual. Since we ended our network with a single sigmoid unit, we will use binary
crossentropy as our loss (as a reminder, check out the table in Chapter 4, section 5 for a cheatsheet on what loss function to use in various situations).
model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(lr=1e-4),
metrics=['acc'])
Data preprocessing
As you already know by now, data should be formatted into appropriately pre-processed floating point tensors before being fed into our network.
Currently, our data sits on a drive as JPEG files, so the steps for getting it into our network are roughly:
It may seem a bit daunting, but thankfully Keras has utilities to take care of these steps automatically. Keras has a module with image processing helper
tools, located at keras.preprocessing.image . In particular, it contains the class ImageDataGenerator which allows to quickly set up Python
generators that can automatically turn image files on disk into batches of pre-processed tensors. This is what we will use here.
train_generator = train_datagen.flow_from_directory(
# This is the target directory
train_dir,
# All images will be resized to 150x150
target_size=(150, 150),
batch_size=20,
# Since we use binary_crossentropy loss, we need binary labels
class_mode='binary')
validation_generator = test_datagen.flow_from_directory(
validation_dir,
target_size=(150, 150),
batch_size=20,
class_mode='binary')
Let's take a look at the output of one of these generators: it yields batches of 150x150 RGB images (shape (20, 150, 150, 3) ) and binary labels
(shape (20,) ). 20 is the number of samples in each batch (the batch size). Note that the generator yields these batches indefinitely: it just loops
endlessly over the images present in the target folder. For this reason, we need to break the iteration loop at some point.
Let's fit our model to the data using the generator. We do it using the fit_generator method, the equivalent of fit for data generators like ours. It
expects as first argument a Python generator that will yield batches of inputs and targets indefinitely, like ours does. Because the data is being
generated endlessly, the generator needs to know example how many samples to draw from the generator before declaring an epoch over. This is the
role of the steps_per_epoch argument: after having drawn steps_per_epoch batches from the generator, i.e. after having run for
steps_per_epoch gradient descent steps, the fitting process will go to the next epoch. In our case, batches are 20-sample large, so it will take 100
batches until we see our target of 2000 samples.
When using fit_generator , one may pass a validation_data argument, much like with the fit method. Importantly, this argument is allowed
to be a data generator itself, but it could be a tuple of Numpy arrays as well. If you pass a generator as validation_data , then this generator is
expected to yield batches of validation data endlessly, and thus you should also specify the validation_steps argument, which tells the process
how many batches to draw from the validation generator for evaluation.
Epoch 1/5
100/100 [==============================] - 95s 947ms/step - loss: 0.6519 - acc: 0.6185 - val_loss: 0.6311 - v
al_acc: 0.6330
Epoch 2/5
100/100 [==============================] - 94s 945ms/step - loss: 0.6075 - acc: 0.6550 - val_loss: 0.5970 - v
al_acc: 0.6910
Epoch 3/5
100/100 [==============================] - 94s 943ms/step - loss: 0.5664 - acc: 0.7090 - val_loss: 0.5729 - v
al_acc: 0.7010
Epoch 4/5
100/100 [==============================] - 94s 945ms/step - loss: 0.5436 - acc: 0.7170 - val_loss: 0.5963 - v
al_acc: 0.6650
Epoch 5/5
100/100 [==============================] - 94s 945ms/step - loss: 0.5141 - acc: 0.7425 - val_loss: 0.5647 - v
al_acc: 0.7140
Let's plot the loss and accuracy of the model over the training and validation data during training:
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.figure()
plt.show()
In [21]: TEST_SIZE = 5
probabilities = model_test.predict(validation_generator, TEST_SIZE)
for index, probability in enumerate(probabilities[:TEST_SIZE]):
print(index, probability)
image_path = validation_dir + "/" +validation_generator.filenames[index]
img = mpimg.imread(image_path)
plt.imshow(img)
if probability > 0.5:
plt.title("%.2f" % (probability[0]*100) + "% dog")
else:
plt.title("%.2f" % ((1-probability[0])*100) + "% cat")
plt.show()
0 [0.669663]
1 [0.4211577]
2 [0.7281866]
3 [0.2058427]
4 [0.80925035]
These plots are characteristic of overfitting. Our training accuracy increases linearly over time, until it reaches nearly 100%, while our validation
accuracy stalls at 70-72%. Our validation loss reaches its minimum after only five epochs then stalls, while the training loss keeps decreasing linearly
until it reaches nearly 0.
Because we only have relatively few training samples (2000), overfitting is going to be our number one concern. You already know about a number of
techniques that can help mitigate overfitting, such as dropout and weight decay (L2 regularization). We are now going to introduce a new one, specific
to computer vision, and used almost universally when processing images with deep learning models: data augmentation.