0% found this document useful (0 votes)
23 views

Lecture06 - Copie

This document discusses various computer vision tasks that can be performed with deep learning models, including classification, image augmentation, transfer learning, object detection, and semantic segmentation. It provides details on convolutional neural networks for classification and how they can be adapted for object detection tasks through methods like OverFeat and YOLO that predict bounding boxes to locate objects in images.

Uploaded by

Charef Wided
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Lecture06 - Copie

This document discusses various computer vision tasks that can be performed with deep learning models, including classification, image augmentation, transfer learning, object detection, and semantic segmentation. It provides details on convolutional neural networks for classification and how they can be adapted for object detection tasks through methods like OverFeat and YOLO that predict bounding boxes to locate objects in images.

Uploaded by

Charef Wided
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Deep Learning

Lecture 6: Computer vision

Prof. Stéphane Gaïffas


https://stephanegaiffas.github.io

1 / 51
Agenda
Computer vision with deep learning:

1. Classi cation
2. Image augmentation
3. Transfer learning / ne tuning
4. Object detection
5. Semantic segmentation

2 / 51
Some of the main computer vision tasks.
Each of them requires a different neural network architecture.

3 / 51
Classi cation
Convolutional neural networks

Convolutional neural networks combine convolution, pooling and fully


connected layers.
They achieve state-of-the-art results for spatially structured data, such as
images, sound or text.

For classi cation

the activation of the output layer is a softmax activation producing a vector p


in the simplex of probability estimates P[Y = c∣x] for c = 1, … , C , where
C is the number of classes and x is the input image

the loss function is the cross-entropy loss

4 / 51
Image augmentation
The lack of data is the biggest limit for the performance of deep learning models

Image augmentation is a form of data augmentation for images


Collecting more data is usually expensive and laborious.
Synthesizing data is complicated and may not represent the true distribution.
Augmenting the data with base transformations is simple and ef cient (e.g., as
demonstrated with AlexNet).

―――
Credits: DeepAugment, 2020. 5 / 51
Image augmentation

―――
Credits: DeepAugment, 2020. 6 / 51
Pre-trained models
Training a model on natural images, from scratch, takes days or weeks.
Many models trained on ImageNet are publicly available for download. These
models can be used as feature extractors or for smart initialization.
The models themselves should be considered as generic and re-usable assets.

7 / 51
Transfer learning
Take a pre-trained network, remove the last layer(s) and then treat the rest of
the network as a xed feature extractor.
Train a model from these features on a new task.
Often better than handcrafted feature extraction for natural images, or better
than training from data of the new task only.

―――
Credits: Mormont et al, Comparison of deep transfer learning strategies for digital pathology, 2018. 8 / 51
Fine-tuning

Same as for transfer learning, but also ne-tune the weights of the pre-trained
network by continuing backpropagation.
All or only some of the layers can be tuned.

―――
Credits: Dive Into Deep Learning, 2020. 9 / 51
In the case of models pre-trained on ImageNet, transferred/ ne-tuned networks
usually work even when the input images for the new task are not photographs of
objects or animals, such as biomedical images, satellite images or paintings.

―――
Credits: Matthia Sabatelli et al, Deep Transfer Learning for Art Classi cation Problems, 2018. 10 / 51
Object detection

11 / 51
The simplest strategy to move from image classi cation to object detection is to
classify local regions, at multiple scales and locations.

―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 12 / 51
Intersection over Union (IoU)
A standard performance indicator for object detection is to evaluate the
intersection over union (IoU) between a predicted bounding box B ^ and an
annotated bounding box B ,

^)
area(B ∩ B
^
IoU(B, B ) = .
^)
area(B ∪ B

―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 13 / 51
Mean Average Precision (mAP)
^ ) is larger than a xed threshold (usually 1 ), then the predicted
If IoU(B, B 2
bounding-box is valid (true positive) and wrong otherwise (false positive).

TP and FP values are accumulated for all thresholds on the predicted con dence.
The area under the resulting precision-recall curve is the average precision for the
considered class.

The mean over the classes is the mean average precision.

Recall that Precision = TP / all detections and that Recall = TP / all ground truths
14 / 51
The sliding window approach evaluates a classi er at large number of locations and
scales.

This approach is usually very computationally expensive as performance directly


depends on the resolution and number of the windows fed to the classi er (the
more the better, but also the more costly).

15 / 51
OverFeat
The complexity of the sliding window approach
was mitigated in the pioneer OverFeat network
(Sermanet et al, 2013) by adding a regression
head to predict the object bounding box
(x, y, w, h).
For training, the convolutional layers are xed
and the regression network is trained using an ℓ2
loss between the predicted and the true
bounding box for each example.

―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 16 / 51
The classi er head outputs a class and a con dence for each location and scale pre-
de ned from a coarse grid. Each window is resized to t with the input dimensions
of the classi er.

―――
Credits: Sermanet et al, 2013. 17 / 51
The regression head then predicts the location of the object with respect to each
window.

―――
Credits: Sermanet et al, 2013. 18 / 51
These bounding boxes are nally merged with an ad-hoc greedy procedure to
produce the nal predictions over a small number of objects.

―――
Credits: Sermanet et al, 2013. 19 / 51
The OverFeat architecture can be adapted to object detection by adding a
"background" class to the object classes.

Negative samples are taken in each scene either at random or by selecting the ones
with the worst miss-classi cation.

―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 20 / 51
Although OverFeat is one of the earliest successful networks for object detection,
its architecture comes with several drawbacks:

it is a disjoint system (2 disjoint heads with their respective losses, ad-hoc


merging procedure);
it optimizes for localization rather than detection;
it cannot reason about global context and thus requires signi cant post-
processing to produce coherent detections.

21 / 51
YOLO

YOLO (You Only Look Once; Redmon et al, 2015) models detection as a regression
problem.

It divides the image into an S × S grid and for each grid cell predicts B bounding
boxes, con dence for those boxes, and C class probabilities. These predictions are
encoded as an S × S × (5B + C) tensor.

―――
Credits: Redmon et al, 2015. 22 / 51
For S = 7, B = 2, C = 20, the network predicts a vector of size 30 for each cell.

―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 23 / 51
The network predicts class scores and bounding-box regressions, and although the
output comes from fully connected layers, it has a 2D structure.

Unlike sliding window techniques, YOLO is therefore capable of reasoning


globally about the image when making predictions.
It sees the entire image during training and test time, so it implicitly encodes
contextual information about classes as well as their appearance.

―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 24 / 51
During training, YOLO makes the assumptions that any of the S × S cells contains
at most (the center of) a single object. We de ne for every image, cell index
i = 1, ..., S × S , predicted box j = 1, ..., B and class index c = 1, ..., C ,

1obj
i is 1 if there is an object in cell i, and 0 otherwise;

1obj
i,j is 1 if there is an object in cell i and predicted box j is the most tting one,
and 0 otherwise;
pi,c is 1 if there is an object of class c in cell i, and 0 and otherwise;
xi , yi , wi , hi the annoted bouding box (de ned only if 1obj
i = 1, and relative
in location and scale to the cell);
ci,j is the IoU between the predicted box and the ground truth target.

―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 25 / 51
obj
The training procedure rst computes on each image the value of the 1i,j 's and ci,j ,
and then does one step to minimize the multi-part loss function
S×S B
λcoord ∑ ∑ 1i,j ((xi − x ^ i,j )2 )
obj
^i,j )2 + (yi − y^i,j )2 + ( wi − ^ i,j )2 + ( hi −
w h
i=1 j=1
S×S B S×S B
+ λobj ∑ ∑ 1obji,j (ci,j − c^i,j ) + λnoobj ∑ ∑(1 − 1obj
2
^2i,j
i,j ) c
i=1 j=1 i=1 j=1
S×S C
+ λclasses ∑ 1i ∑(pi,c
obj
− p^i,c )2
i=1 c=1

where p^i,c , x
^i,j , y^i,j , w ^ i,j and c^i,j are the network outputs.
^ i,j , h

―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 26 / 51
Training YOLO relies on many engineering choices that illustrate well how involved
is deep learning in practice:

pre-train the 20 rst convolutional layers on ImageNet classi cation;


use 448 × 448 input for detection, instead of 224 × 224;

use Leaky ReLUs for all layers;


dropout after the rst convolutional layer;
normalize bounding boxes parameters in [0, 1];

use a quadratic loss not only for the bounding box coordinates, but also for the
con dence and the class scores;
reduce weight of large bounding boxes by using the square roots of the size in
the loss;
reduce the importance of empty cells by weighting less the con dence-related
loss on them;
data augmentation with scaling, translation and HSV transformation.

―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 27 / 51
YOLO in New York

Redmon, 2017.

28 / 51
SSD
The Single Short Multi-box Detector (SSD; Liu et al, 2015) improves upon YOLO by
using a fully-convolutional architecture and multi-scale maps.

―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 29 / 51
Region-based CNNs
An alternative strategy to having a huge prede ned set of box proposals, as in
OverFeat or YOLO, is to rely on region proposals rst extracted from the image.

The main family of architectures following this principle are region-based


convolutional neural networks:

(Slow) R-CNN (Girshick et al, 2014)


Fast R-CNN (Girshick et al, 2015)
Faster R-CNN (Ren et al, 2015)
Mask R-CNN (He et al, 2017)

30 / 51
R-CNN
This architecture is made of four parts:

1. Selective search is performed on the input image to select multiple high-quality


region proposals.
2. A pre-trained CNN is selected and put before the output layer. It resizes each
proposed region into the input dimensions required by the network and uses a
forward pass to output features for the proposals.
3. The features are fed to an SVM for predicting the class.
4. The features are fed to a linear regression model for predicting the bounding-
box.

―――
Credits: Dive Into Deep Learning, 2020. 31 / 51
Selective search (Uijlings et al, 2013) looks at the image through windows of
different sizes, and for each size tries to group together adjacent pixels that are
similar by texture, color or intensity.

32 / 51
Fast R-CNN
The main performance bottleneck of an R-
CNN model is the need to independently
extract features for each proposed region.
Fast R-CNN uses the entire image as input
to the CNN for feature extraction, rather
than each proposed region.
Fast R-CNN introduces RoI pooling for
producing feature vectors of xed size
from region proposals of different sizes.

―――
Credits: Dive Into Deep Learning, 2020. 33 / 51
Faster R-CNN

The performance of both R-CNN and Fast R-CNN is tied to the quality of the
region proposals from selective search.
Faster R-CNN replaces selective search with a region proposal network.
This network reduces the number of proposed regions generated, while
ensuring precise object detection.
―――
Credits: Dive Into Deep Learning, 2020. 34 / 51
YoloV2, Yolo 9000, SSD Mobilenet, Faster RCNN NasNet compar…
compar…

YOLO (v2) vs YOLO 9000 vs SSD vs Faster RCNN

35 / 51
Take-home messages
One-stage detectors (YOLO, SSD, RetinaNet, etc) are fast for inference but are
usually not the most accurate object detectors.
Two-stage detectors (Fast R-CNN, Faster R-CNN, R-FCN, Light head R-CNN,
etc) are usually slower but are often more accurate.
All networks depend on lots of engineering decisions.

36 / 51
Segmentation

37 / 51
Semantic segmentation is the task of partitioning an image into regions of different
semantic categories.

These semantic regions label and predict objects at the pixel level.

―――
Credits: Dive Into Deep Learning, 2020. 38 / 51
Fully convolutional networks
The historical approach to image segmentation was to de ne a measure of
similarity between pixels, and to cluster groups of similar pixels. Such approaches
account poorly for semantic content.

The deep-learning approach re-casts semantic segmentation as pixel classi cation,


and re-uses networks trained for image classi cation by making them fully
convolutional (FCNs).

―――
Credits: Francois Fleuret, EE559 Deep Learning, EPFL. 39 / 51
―――
Credits: CS231n, Lecture 11, 2018. 40 / 51
―――
Credits: CS231n, Lecture 11, 2018. 41 / 51
Transposed convolution
The convolution and pooling layers introduced so far often reduce the input width
and height, or keep them unchanged.

Semantic segmentation requires to predict values for each pixel, and therefore
needs to increase input width and height.
Fully connected layers could be used for that purpose but would face the same
limitations as before (spatial specialization, too many parameters).
Ideally, we would like layers that implement the inverse of convolutional and
pooling layers.

42 / 51
Transposed convolution
A transposed convolution is a convolution where the implementation of the
forward and backward passes are swapped.

Given a convolutional kernel u,

the forward pass is implemented as v(h) = UT v(x) with appropriate


reshaping, thereby effectively up-sampling an input v(x) into a larger one;

the backward pass is computed by multiplying the loss by U instead of UT .

Transposed convolutions are also referred to as deconvolutions (but this is


misleading...).

T
U

x flatten matmul reshape


h

43 / 51
Transposed convolution
UT v(x) = v(h)
⎡1 0 0 0⎤ ⎡2⎤
⎢4 1 0 0⎥ ⎢9⎥
⎢1 0⎥ ⎢6⎥
⎢ 4 0 ⎥ ⎢ ⎥
⎢0 ⎥ ⎢1⎥
⎢ 1 0 0⎥ ⎢ ⎥
⎢1 ⎥ ⎢6⎥
⎢ 0 1 0⎥ ⎢ ⎥
⎢4 ⎥ ⎢29⎥
⎢ 1 4 1⎥ ⎢ ⎥
⎢3 4⎥ ⎢ ⎥
⎢ 4 1 ⎥ ⎡ ⎤ ⎢30⎥
2
⎢0 1⎥ ⎢ ⎥
⎢ 3 0 ⎥ ⎢1 ⎥ = ⎢ 7 ⎥
⎢3 0⎥ ⎢ ⎥ ⎢10⎥
⎢ 0 1 ⎥ 4 ⎢ ⎥
⎢3 1⎥ ⎣4⎦ ⎢29⎥
⎢ 3 4 ⎥ ⎢ ⎥
⎢1 4⎥ ⎢33⎥
⎢ 3 3 ⎥ ⎢ ⎥
⎢0 3⎥ ⎢13⎥
⎢ 1 0 ⎥ ⎢ ⎥
⎢0 0⎥ ⎢12⎥
⎢ 0 3 ⎥ ⎢ ⎥
⎢0 ⎥ ⎢24⎥
⎢ 0 3 3⎥ ⎢ ⎥
⎢0 0 1 3 ⎥ ⎢16⎥
⎣0 0 0 1⎦ ⎣4⎦
44 / 51
FCNs for segmentation
The simplest design of a fully convolutional network for
semantic segmentation consists in:

using a (pre-trained) convolutional network for


downsampling and extracting image features;
replacing the dense layers with a 1 × 1 convolution
layer to transform the number of channels into the
number of categories;
upsampling the feature map to the size of the input
image by using one (or several) transposed convolution
layer(s).

Contrary to fully connected networks, the dimensions of the


output of a fully convolutional network is not xed. It
directly depends on the dimensions of the input, which can
be images of arbitrary sizes.

45 / 51
―――
Credits: Noh et al, 2015. 46 / 51
UNet
The UNet architecture builds upon the previous FCN architecture.

It consists in symmetric contraction and expansion paths, along with a


concatenation of high resolution features from the contracting path to the
unsampled features from the expanding path. These connections allow for
localization.

―――
Credits: Ronneberger et al, 2015. 47 / 51
Mask R-CNN

Mask R-CNN extends the Faster R-CNN model for semantic segmentation.

The RoI pooling layer is replaced with an RoI alignment layer.


It branches off to an FCN for predicting a segmentation mask.
Object detection combined with mask prediction enables instance
segmentation.

48 / 51
―――
Credits: He et al, 2017. 49 / 51
Mask RCNN - COCO - instance segmentation

50 / 51
Some nal comments
For detection and semantic segmentation, there is a heavy use of transfer-
learning and ne tuning: re-use of large networks trained on classi cation
problems

Tons of engineering, many crucial details

Take-home message

The models themselves, as much as the source code of the algorithm that
produced them, or the training data, are generic and re-usable assets

Transfer-learning is crucial, but somewhat under-studied

There is no such successful transfer learning outside of deep learning

51 / 51
Thank you !

51 / 51

You might also like