Military AI-Week 05-AI in Computer Vision
Military AI-Week 05-AI in Computer Vision
AI in Computer Vision
Presentation Outline
❖ The basic understanding of Convolutional Neural Network (CNN)
❖ The usage of CNN in CV
➢ Image Classification
➢ Semantic Segmentation
➢ Object Detection
vs
1 0 0 1 2
0 0 0 3 0 0 0 1
0 1 2 1 1 0 2 0
1 1 3 0 0 1 1 0
3 0 0 0 1
6
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0
0 0
0 1
1 2
0 0
0 2
0 0
3 0 1 4 11
0 1
1 1
2 0
1 1 4 11 5
1 1 3 0 0 7 7 7
3 0 0 0 1
7
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0
0 0
0 1
1 2
0 0
0 2
0 0
3 0 1 4 11
0 1
1 1
2 0
1 1 4 11 5
1 1 3 0 0 7 7 7
3 0 0 0 1
8
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0
0 0
1 1
2
0 0 0
0 2
3 0
0 1 4 11
0 1 1
2 1
1 0
1 4 11 5
1 1 3 0 0 7 7 7
3 0 0 0 1
9
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 0
1 0
2 1
0 0 0 0
3 2
0 0
1 4 11
0 1 2 1
1 1
1 0
4 11 5
1 1 3 0 0 7 7 7
3 0 0 0 1
10
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2
0 0
0 0
0 1
3 0 1 4 11
0 0
1 2
2 0
1 1 4 11 5
1 1
1 1
3 0
0 0 7 7 7
3 0 0 0 1
11
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2
0 0 0
0 0
3 1
0 1 4 11
0 1 0
2 2
1 0
1 4 11 5
1 1 1
3 1
0 0
0 7 7 7
3 0 0 0 1
12
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2
0 0 0 0
3 0
0 1
1 4 11
0 1 2 0
1 2
1 0
4 11 5
1 1 3 1
0 1
0 0
7 7 7
3 0 0 0 1
13
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2
0 0 0 3 0 1 4 11
0 0
1 0
2 1
1 1 4 11 5
1 0
1 2
3 0
0 0 7 7 7
3 1
0 1
0 0
0 1
14
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2
0 0 0 3 0 1 4 11
0 1 0
2 0
1 1
1 4 11 5
1 1 0
3 2
0 0
0 7 7 7
3 0 1
0 1
0 0
1
15
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2
0 0 0 3 0 1 4 11
0 1 2 0
1 0
1 1
4 11 5
1 1 3 0
0 2
0 0
7 7 7
3 0 0 1
0 1
1 0
16
Discrete cross-correlation: 2-D example
0 0 0 3 0 0 0 1 1 4 11
0 1 2 1 1 0 2 0 4 11 5
1 1 3 0 0 1 1 0 7 7 7
3 0 0 0 1
17
Problem Definition : ANN vs CNN
(Convolutional Neural Network)
ANN (MLP) versus CNN Image Analysis
ANN Architecture Simple CNN Architecture
Hidden Layer
Classification /
Output Layer
MLP Problem in Image Analysis
https://towardsdatascience.com/simple-introduction-to-convolutional-neural-networks-cdf8d3077bac
CNN in Image Analysis
CNN do
• analyze the influence of nearby pixels using
Kernel • Pixel position and neighborhood have semantic
• Extract feature of image > called feature map meanings
• Elements of interest can appear anywhere in the image
https://towardsdatascience.com/simple-introduction-to-convolutional-neural-networks-cdf8d3077bac
Visualizing convolution activations
feature map
22
Visualizing convolution activations
Lower resolution after
pooling
23
Visualizing convolution activations
Not as easy to
interpret…
24
Visualizing learned convolutions
Different orientations
Learned convolutions from first layer of AlexNet:
Krizhevsky et al. Imagenet classification with deep convolutional neural networks. NIPS 2012.
25
Visualizing learned convolutions
Different spatial frequencies
Learned convolutions from first layer of AlexNet:
Krizhevsky et al. Imagenet classification with deep convolutional neural networks. NIPS 2012.
26
Visualizing learned convolutions
Different colour contrasts
Learned convolutions from first layer of AlexNet:
Krizhevsky et al. Imagenet classification with deep convolutional neural networks. NIPS 2012.
27
Biological connections
Several principles of ConvNets were inspired by Image credit: Wendell. Foundations of Vision: https://foundationsofvision.stanford.edu/
elements of neuroscience models of the primary DTS AI 2019
Different
Different scales Different orientations colour contrasts
28
Notable difference ANN vs CNN
• Field of pattern recognition within images > encode image-specific features
into the architecture.
• Complexity required to compute image data.
• Ex. MNIST 28*28 in ANN first layer 28 x 28 x 1 = 784 weight.
• 64 * 64 data in ANN first layer 64 x 64 x 1 = 12288 weight.
Just increase the number of hidden layers ? NO, Huge ANN cause
• unlimited computational power and time to train
• likely the network will overfit (either too general or too over-engineered)
Sample Architecture
CNN Visualization
https://www.youtube.com/watch?v=f0t-OCG79-U
Usage of CNN in Image
Processing
CNNs for Image Classification
34
Image Classification
(5x5) (5x5)
Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep
convolutional neural networks." Advances in neural information processing systems. 2012.
37
GoogLeNet
• Key points in GoogleNet:
• Bigger and deeper than the previous CNN architecture.
• Using the inception module so GoogLeNet can use
parameters more effectively:
• Better performance despite fewer parameters (6 million)
compared to AlexNet (60 million).
• The Inception module proposed in the GoogLeNet
architecture performs feature extraction using a
combination of 1x1, 3x3, and 5x5 filters to cover
various patterns in the image.
Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2015.
38
GoogLeNet’s Inception Module
Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2015.
39
VGGNet
• Deeper and larger (has more parameters).
• VGGNet was the runner-up of the 2014
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC). Meanwhile, the winner was
the GoogLeNet.
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." arXiv preprint arXiv:1409.1556 (2014).
40
Residual Network
• Note: At the beginning of training, RU is an identity function (or almost the same as an identity function)
because during training, almost all weights in the convolutional layer are close to 0 (zero). This makes RU
easy and fast to train.
Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019. computer vision and pattern recognition. 2016.
41
Extreme Inception (Xception)
FM3
Convolutional layer
• Kernels with size > 1, perform computations at
the spatial and cross-channel levels. Depth-wise separable convolutional layer
• A kernel with a size of 1x1 only performs cross- • Perform spatial computation for each input
channel computing. channel/feature map separately.
Chollet, François. "Xception: Deep learning with depthwise separable convolutions." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2017.
42
Xception: Extreme Inception Module
3x3
Depth-wise 1x1
convolutional Convolutional layer
layer (cross-channel only)
Chollet, François. "Xception: Deep learning with depthwise separable convolutions." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2017.
43
Squeeze-and-Excitation Network (SENet)
• The Squeeze-and-Excitation Network
(SENet) is the winner of the ILSVRC 2017
challenge proposing a squeeze-and-
excitation (SE) block.
• The SE block calibrates the feature maps
issued by the previous layer by giving
additional scaling weights.
• Scaling values used in SENet are trained on
the SE module in the training process and
issued during testing. Géron, Aurélien. Hands-on machine learning with
Scikit-Learn, Keras, and TensorFlow: Concepts,
• SE block does not pay attention to spatial tools, and techniques to build intelligent systems.
O'Reilly Media, 2019.
patterns and focuses on feature maps which
are very closely related during training
(strongly correlated feature maps).
Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference
on computer vision and pattern recognition. 2018.
44
SENet: Squeeze-and-Excitation Block
Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
CNNs for Semantic Segmentation
46
(Semantic) Segmentation
Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
47
Fully Convolutional Networks (FCN)
• In the CNN architecture for classification
tasks, the last layers are usually
fully-connected layers.
• In Fully Convolutional Networks (FCN),
all layers that have weights to train
(trainable layers) are in the form of
convolutional layers.
• A strategy is needed to restore
the spatial resolution that has been reduced due to the convolutional layer.
• We can restore the image dimensions in the last layers by using up-sampling or
transpose convolutional layers.
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic
segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
48
Recovering Spatial Resolution
• Up-sampling can be done by replicating the values in feature maps or it can be done
by interpolating (e.g., bilinear).
• It can also be done by performing a transpose convolutional layer operation, where
feature maps are expanded by adding 0 (zero) and then performing a convolution
operation as usual. For more precise results, we can add the previous skip connection.
Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
49
Pixel-to-Pixel Segmentation
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical
image segmentation." International Conference on Medical image computing and computer-assisted
intervention. Springer, Cham, 2015.
CNNs for Object Detection
52
Object Detection
• In object detection, we also want information about the
location / position of the desired object. This is commonly
called localization.
• In its implementation, localization is done using bounding
boxes, where each object has a label in a tuple (images_id,
(class_label, bounding_boxes)).
• The bounding box itself has 4 values, namely information on
the coordinates of the center of the object (x,y) and the size of
the bounding box (height, width).
• Convolutional networks (or other machine learning methods) can be optimized for object detection by
regressing the values (x, y, height, width) using the mean square error (MSE) loss function.
• Meanwhile, object detection performance can be evaluated using Intersection over Union (IoU).
Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
53
Object Detection’s Sliding Window
• Object detection is usually done using a sliding window that
will cover the entire image.
• Divide the image/image into several parts (example: 6x8 as
shown on the side).
• Run CNN using a sliding window (example: large black
rectangle 3x3) to all parts of the image.
• Draw and save 1 bounding box for each object detected by
CNN (example: red square).
• Run the non-maximum suppression method to get 1 bounding
box for each detected object.
• Objects in the image/image can have various sizes (small,
large). Therefore, sliding windows can be done many times
using different sizes (e.g.,: 2x2 🡪 3x3 🡪 4x4).
Non-maximum suppression:
https://towardsdatascience.com/non-maximum- Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
suppression-nms-93ce178e177c
54
CNNs for Object Detection: R-CNN Model Family
• R-CNN (2014): Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic
segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
• Use 3 modules: (1) Region Proposal: selective search (i.e., non-CNN method for proposing
bounding box), (2) Feature Extraction: AlexNet (CNN), dan (3) Classifier: SVM (non-CNN method
for object classification).
• Fast R-CNN (2015): Girshick, Ross. "Fast R-CNN." Proceedings of the IEEE international conference on computer vision.
• Use 2 modules: (1) Region Proposal Netwok (CNN for deciding bounding boxes which will be
classified) dan (2) Fast R-CNN (CNN for object classification and bounding box regression update).
55
CNN for Object Detection: YOLO & SSD
• You Only Look Once (YOLO) (2016): Redmon, Joseph, et al. "You only look once: Unified,
real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition.
• Only has one CNN for localization (bounding box) and object classification.
• Single Shot Multi-box Detector (SSD) (2016): Liu, Wei, et al. “SSD: Single shot multibox
detector." European conference on computer vision. Springer, Cham.
• Same with YOLO, SSD only has 1 CNN for object detection.
• Difference between SSD and YOLO: SSD can detect smaller objects and compute
faster than YOLO.
• Due to faster computing, SSDs can be used in real-time object detection systems.
56
Yolo vs. SSD
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016.
Liu, Wei, et al. “SSD: Single shot multibox detector." European conference on
computer vision. Springer, Cham. 2016.
57
Other Architectures for Object Detection
• SSD and YOLO has various variants of models that are faster and better for real-time system.
For example: SSD300, SSD500, YOLOv2, YOLOv3, YOLO9000.
• Mask R-CNN (2017): Object detection and semantic segmentation at the same time.
He, Kaiming, et al. "Mask R-CNN." Proceedings of the IEEE international conference on computer vision. 2017.
Practical use of CNN in CV
Computer Vision in Construction
https://www.youtube.com/watch?v=l1LYg9NCKWY
Computer Vision Aerial Surveillance
https://www.youtube.com/watch?v=9tLCFbupeOI
Computer Vision Autonomous Car
https://www.youtube.com/watch?v=HS1wV9NMLr8
Computer Vision in Security
https://www.youtube.com/watch?v=HHXRqCGCRCs
Computer Vision in Military
https://www.youtube.com/watch?v=g0zxmO6qlD8
64
Summary
• We have studied various forms of CNN architectural models for 3 types of tasks
in computer vision, namely image classification, semantic segmentation, and
object detection.
• We can see how the convolutional layer is very easy to implement for a wide
variety of architectures and tasks.
• We can also see how convolution can be modified to meet the needs of a specific
task (e.g., inception module, residual unit, squeeze-and-excitation block).
• Understanding the data and the problems and assumptions associated with the
task will greatly facilitate the development of new CNN models.
• A lot of working codes and examples are available on the internet.
• https://keras.io/examples/vision/
Thank You