Lecture06 - Copie
Lecture06 - Copie
1 / 51
Agenda
Computer vision with deep learning:
1. Classi cation
2. Image augmentation
3. Transfer learning / ne tuning
4. Object detection
5. Semantic segmentation
2 / 51
Some of the main computer vision tasks.
Each of them requires a different neural network architecture.
3 / 51
Classi cation
Convolutional neural networks
4 / 51
Image augmentation
The lack of data is the biggest limit for the performance of deep learning models
―――
Credits: DeepAugment, 2020. 5 / 51
Image augmentation
―――
Credits: DeepAugment, 2020. 6 / 51
Pre-trained models
Training a model on natural images, from scratch, takes days or weeks.
Many models trained on ImageNet are publicly available for download. These
models can be used as feature extractors or for smart initialization.
The models themselves should be considered as generic and re-usable assets.
7 / 51
Transfer learning
Take a pre-trained network, remove the last layer(s) and then treat the rest of
the network as a xed feature extractor.
Train a model from these features on a new task.
Often better than handcrafted feature extraction for natural images, or better
than training from data of the new task only.
―――
Credits: Mormont et al, Comparison of deep transfer learning strategies for digital pathology, 2018. 8 / 51
Fine-tuning
Same as for transfer learning, but also ne-tune the weights of the pre-trained
network by continuing backpropagation.
All or only some of the layers can be tuned.
―――
Credits: Dive Into Deep Learning, 2020. 9 / 51
In the case of models pre-trained on ImageNet, transferred/ ne-tuned networks
usually work even when the input images for the new task are not photographs of
objects or animals, such as biomedical images, satellite images or paintings.
―――
Credits: Matthia Sabatelli et al, Deep Transfer Learning for Art Classi cation Problems, 2018. 10 / 51
Object detection
11 / 51
The simplest strategy to move from image classi cation to object detection is to
classify local regions, at multiple scales and locations.
―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 12 / 51
Intersection over Union (IoU)
A standard performance indicator for object detection is to evaluate the
intersection over union (IoU) between a predicted bounding box B ^ and an
annotated bounding box B ,
^)
area(B ∩ B
^
IoU(B, B ) = .
^)
area(B ∪ B
―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 13 / 51
Mean Average Precision (mAP)
^ ) is larger than a xed threshold (usually 1 ), then the predicted
If IoU(B, B 2
bounding-box is valid (true positive) and wrong otherwise (false positive).
TP and FP values are accumulated for all thresholds on the predicted con dence.
The area under the resulting precision-recall curve is the average precision for the
considered class.
Recall that Precision = TP / all detections and that Recall = TP / all ground truths
14 / 51
The sliding window approach evaluates a classi er at large number of locations and
scales.
15 / 51
OverFeat
The complexity of the sliding window approach
was mitigated in the pioneer OverFeat network
(Sermanet et al, 2013) by adding a regression
head to predict the object bounding box
(x, y, w, h).
For training, the convolutional layers are xed
and the regression network is trained using an ℓ2
loss between the predicted and the true
bounding box for each example.
―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 16 / 51
The classi er head outputs a class and a con dence for each location and scale pre-
de ned from a coarse grid. Each window is resized to t with the input dimensions
of the classi er.
―――
Credits: Sermanet et al, 2013. 17 / 51
The regression head then predicts the location of the object with respect to each
window.
―――
Credits: Sermanet et al, 2013. 18 / 51
These bounding boxes are nally merged with an ad-hoc greedy procedure to
produce the nal predictions over a small number of objects.
―――
Credits: Sermanet et al, 2013. 19 / 51
The OverFeat architecture can be adapted to object detection by adding a
"background" class to the object classes.
Negative samples are taken in each scene either at random or by selecting the ones
with the worst miss-classi cation.
―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 20 / 51
Although OverFeat is one of the earliest successful networks for object detection,
its architecture comes with several drawbacks:
21 / 51
YOLO
YOLO (You Only Look Once; Redmon et al, 2015) models detection as a regression
problem.
It divides the image into an S × S grid and for each grid cell predicts B bounding
boxes, con dence for those boxes, and C class probabilities. These predictions are
encoded as an S × S × (5B + C) tensor.
―――
Credits: Redmon et al, 2015. 22 / 51
For S = 7, B = 2, C = 20, the network predicts a vector of size 30 for each cell.
―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 23 / 51
The network predicts class scores and bounding-box regressions, and although the
output comes from fully connected layers, it has a 2D structure.
―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 24 / 51
During training, YOLO makes the assumptions that any of the S × S cells contains
at most (the center of) a single object. We de ne for every image, cell index
i = 1, ..., S × S , predicted box j = 1, ..., B and class index c = 1, ..., C ,
1obj
i is 1 if there is an object in cell i, and 0 otherwise;
1obj
i,j is 1 if there is an object in cell i and predicted box j is the most tting one,
and 0 otherwise;
pi,c is 1 if there is an object of class c in cell i, and 0 and otherwise;
xi , yi , wi , hi the annoted bouding box (de ned only if 1obj
i = 1, and relative
in location and scale to the cell);
ci,j is the IoU between the predicted box and the ground truth target.
―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 25 / 51
obj
The training procedure rst computes on each image the value of the 1i,j 's and ci,j ,
and then does one step to minimize the multi-part loss function
S×S B
λcoord ∑ ∑ 1i,j ((xi − x ^ i,j )2 )
obj
^i,j )2 + (yi − y^i,j )2 + ( wi − ^ i,j )2 + ( hi −
w h
i=1 j=1
S×S B S×S B
+ λobj ∑ ∑ 1obji,j (ci,j − c^i,j ) + λnoobj ∑ ∑(1 − 1obj
2
^2i,j
i,j ) c
i=1 j=1 i=1 j=1
S×S C
+ λclasses ∑ 1i ∑(pi,c
obj
− p^i,c )2
i=1 c=1
where p^i,c , x
^i,j , y^i,j , w ^ i,j and c^i,j are the network outputs.
^ i,j , h
―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 26 / 51
Training YOLO relies on many engineering choices that illustrate well how involved
is deep learning in practice:
use a quadratic loss not only for the bounding box coordinates, but also for the
con dence and the class scores;
reduce weight of large bounding boxes by using the square roots of the size in
the loss;
reduce the importance of empty cells by weighting less the con dence-related
loss on them;
data augmentation with scaling, translation and HSV transformation.
―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 27 / 51
YOLO in New York
Redmon, 2017.
28 / 51
SSD
The Single Short Multi-box Detector (SSD; Liu et al, 2015) improves upon YOLO by
using a fully-convolutional architecture and multi-scale maps.
―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 29 / 51
Region-based CNNs
An alternative strategy to having a huge prede ned set of box proposals, as in
OverFeat or YOLO, is to rely on region proposals rst extracted from the image.
30 / 51
R-CNN
This architecture is made of four parts:
―――
Credits: Dive Into Deep Learning, 2020. 31 / 51
Selective search (Uijlings et al, 2013) looks at the image through windows of
different sizes, and for each size tries to group together adjacent pixels that are
similar by texture, color or intensity.
32 / 51
Fast R-CNN
The main performance bottleneck of an R-
CNN model is the need to independently
extract features for each proposed region.
Fast R-CNN uses the entire image as input
to the CNN for feature extraction, rather
than each proposed region.
Fast R-CNN introduces RoI pooling for
producing feature vectors of xed size
from region proposals of different sizes.
―――
Credits: Dive Into Deep Learning, 2020. 33 / 51
Faster R-CNN
The performance of both R-CNN and Fast R-CNN is tied to the quality of the
region proposals from selective search.
Faster R-CNN replaces selective search with a region proposal network.
This network reduces the number of proposed regions generated, while
ensuring precise object detection.
―――
Credits: Dive Into Deep Learning, 2020. 34 / 51
YoloV2, Yolo 9000, SSD Mobilenet, Faster RCNN NasNet compar…
compar…
35 / 51
Take-home messages
One-stage detectors (YOLO, SSD, RetinaNet, etc) are fast for inference but are
usually not the most accurate object detectors.
Two-stage detectors (Fast R-CNN, Faster R-CNN, R-FCN, Light head R-CNN,
etc) are usually slower but are often more accurate.
All networks depend on lots of engineering decisions.
36 / 51
Segmentation
37 / 51
Semantic segmentation is the task of partitioning an image into regions of different
semantic categories.
These semantic regions label and predict objects at the pixel level.
―――
Credits: Dive Into Deep Learning, 2020. 38 / 51
Fully convolutional networks
The historical approach to image segmentation was to de ne a measure of
similarity between pixels, and to cluster groups of similar pixels. Such approaches
account poorly for semantic content.
―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 39 / 51
―――
Credits: CS231n, Lecture 11, 2018. 40 / 51
―――
Credits: CS231n, Lecture 11, 2018. 41 / 51
Transposed convolution
The convolution and pooling layers introduced so far often reduce the input width
and height, or keep them unchanged.
Semantic segmentation requires to predict values for each pixel, and therefore
needs to increase input width and height.
Fully connected layers could be used for that purpose but would face the same
limitations as before (spatial specialization, too many parameters).
Ideally, we would like layers that implement the inverse of convolutional and
pooling layers.
42 / 51
Transposed convolution
A transposed convolution is a convolution where the implementation of the
forward and backward passes are swapped.
T
U
43 / 51
Transposed convolution
UT v(x) = v(h)
⎡1 0 0 0⎤ ⎡2⎤
⎢4 1 0 0⎥ ⎢9⎥
⎢1 0⎥ ⎢6⎥
⎢ 4 0 ⎥ ⎢ ⎥
⎢0 ⎥ ⎢1⎥
⎢ 1 0 0⎥ ⎢ ⎥
⎢1 ⎥ ⎢6⎥
⎢ 0 1 0⎥ ⎢ ⎥
⎢4 ⎥ ⎢29⎥
⎢ 1 4 1⎥ ⎢ ⎥
⎢3 4⎥ ⎢ ⎥
⎢ 4 1 ⎥ ⎡ ⎤ ⎢30⎥
2
⎢0 1⎥ ⎢ ⎥
⎢ 3 0 ⎥ ⎢1 ⎥ = ⎢ 7 ⎥
⎢3 0⎥ ⎢ ⎥ ⎢10⎥
⎢ 0 1 ⎥ 4 ⎢ ⎥
⎢3 1⎥ ⎣4⎦ ⎢29⎥
⎢ 3 4 ⎥ ⎢ ⎥
⎢1 4⎥ ⎢33⎥
⎢ 3 3 ⎥ ⎢ ⎥
⎢0 3⎥ ⎢13⎥
⎢ 1 0 ⎥ ⎢ ⎥
⎢0 0⎥ ⎢12⎥
⎢ 0 3 ⎥ ⎢ ⎥
⎢0 ⎥ ⎢24⎥
⎢ 0 3 3⎥ ⎢ ⎥
⎢0 0 1 3 ⎥ ⎢16⎥
⎣0 0 0 1⎦ ⎣4⎦
44 / 51
FCNs for segmentation
The simplest design of a fully convolutional network for
semantic segmentation consists in:
45 / 51
―――
Credits: Noh et al, 2015. 46 / 51
UNet
The UNet architecture builds upon the previous FCN architecture.
―――
Credits: Ronneberger et al, 2015. 47 / 51
Mask R-CNN
Mask R-CNN extends the Faster R-CNN model for semantic segmentation.
48 / 51
―――
Credits: He et al, 2017. 49 / 51
Mask RCNN - COCO - instance segmentation
50 / 51
Some nal comments
For detection and semantic segmentation, there is a heavy use of transfer-
learning and ne tuning: re-use of large networks trained on classi cation
problems
Take-home message
The models themselves, as much as the source code of the algorithm that
produced them, or the training data, are generic and re-usable assets
51 / 51
Thank you !
51 / 51