0% found this document useful (0 votes)
14 views65 pages

Military AI-Week 05-AI in Computer Vision

Uploaded by

Adhi Kusumadjati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views65 pages

Military AI-Week 05-AI in Computer Vision

Uploaded by

Adhi Kusumadjati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

AI in Computer Vision

Military Artificial Intelligence


Prof. Dr. Eng. Wisnu Jatmiko, S.T., M.Kom.
Dr. Ario Yudo Husodo, S.T., M.T.
Grafika Jati, S.Kom., M.Kom.
© Fasilkom UI - 2023
Discussion Topic

AI in Computer Vision
Presentation Outline
❖ The basic understanding of Convolutional Neural Network (CNN)
❖ The usage of CNN in CV
➢ Image Classification
➢ Semantic Segmentation
➢ Object Detection

❖ Practical use of CNN in CV


Basic Knowledge : Convolutional
Operation
The Convolution Operation

vs

Image by Cmglee - Own work, CC BY-SA 3.0,


https://commons.wikimedia.org/w/index.php?curid=20206883
Discrete cross-correlation: 2-D example

1 0 0 1 2

0 0 0 3 0 0 0 1

0 1 2 1 1 0 2 0

1 1 3 0 0 1 1 0

3 0 0 0 1

6
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0
0 0
0 1
1 2

0 0
0 2
0 0
3 0 1 4 11

0 1
1 1
2 0
1 1 4 11 5

1 1 3 0 0 7 7 7

3 0 0 0 1

7
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0
0 0
0 1
1 2

0 0
0 2
0 0
3 0 1 4 11

0 1
1 1
2 0
1 1 4 11 5

1 1 3 0 0 7 7 7

3 0 0 0 1

8
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0
0 0
1 1
2

0 0 0
0 2
3 0
0 1 4 11

0 1 1
2 1
1 0
1 4 11 5

1 1 3 0 0 7 7 7

3 0 0 0 1

9
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 0
1 0
2 1

0 0 0 0
3 2
0 0
1 4 11

0 1 2 1
1 1
1 0
4 11 5

1 1 3 0 0 7 7 7

3 0 0 0 1

10
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2

0 0
0 0
0 1
3 0 1 4 11

0 0
1 2
2 0
1 1 4 11 5

1 1
1 1
3 0
0 0 7 7 7

3 0 0 0 1

11
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2

0 0 0
0 0
3 1
0 1 4 11

0 1 0
2 2
1 0
1 4 11 5

1 1 1
3 1
0 0
0 7 7 7

3 0 0 0 1

12
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2

0 0 0 0
3 0
0 1
1 4 11

0 1 2 0
1 2
1 0
4 11 5

1 1 3 1
0 1
0 0
7 7 7

3 0 0 0 1

13
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2

0 0 0 3 0 1 4 11

0 0
1 0
2 1
1 1 4 11 5

1 0
1 2
3 0
0 0 7 7 7

3 1
0 1
0 0
0 1

14
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2

0 0 0 3 0 1 4 11

0 1 0
2 0
1 1
1 4 11 5

1 1 0
3 2
0 0
0 7 7 7

3 0 1
0 1
0 0
1

15
Discrete cross-correlation: 2-D example
Can be viewed as a “sliding window” operation:
1 0 0 1 2

0 0 0 3 0 1 4 11

0 1 2 0
1 0
1 1
4 11 5

1 1 3 0
0 2
0 0
7 7 7

3 0 0 1
0 1
1 0

16
Discrete cross-correlation: 2-D example

Often called the


feature map
1 0 0 1 2

0 0 0 3 0 0 0 1 1 4 11

0 1 2 1 1 0 2 0 4 11 5

1 1 3 0 0 1 1 0 7 7 7

3 0 0 0 1

17
Problem Definition : ANN vs CNN
(Convolutional Neural Network)
ANN (MLP) versus CNN Image Analysis
ANN Architecture Simple CNN Architecture

Hidden Layer
Classification /
Output Layer
MLP Problem in Image Analysis

• The amount of weights


rapidly becomes
unmanageable for large
images. 224 x 224 pixel
image with 3 color =
150.000 weight
• React differently to an input • Do not scale well for images
(images) and its shifted • Ignore the information brought by pixel position
version. and correlation with neighbors
• Spatial information is lost • Cannot handle translations

https://towardsdatascience.com/simple-introduction-to-convolutional-neural-networks-cdf8d3077bac
CNN in Image Analysis

CNN do
• analyze the influence of nearby pixels using
Kernel • Pixel position and neighborhood have semantic
• Extract feature of image > called feature map meanings
• Elements of interest can appear anywhere in the image

https://towardsdatascience.com/simple-introduction-to-convolutional-neural-networks-cdf8d3077bac
Visualizing convolution activations

feature map

22
Visualizing convolution activations
Lower resolution after
pooling

Image credit: http://cs231n.github.io/convolutional-networks/

23
Visualizing convolution activations

Not as easy to
interpret…

Image credit: http://cs231n.github.io/convolutional-networks/

24
Visualizing learned convolutions
Different orientations
Learned convolutions from first layer of AlexNet:

Krizhevsky et al. Imagenet classification with deep convolutional neural networks. NIPS 2012.

25
Visualizing learned convolutions
Different spatial frequencies
Learned convolutions from first layer of AlexNet:

Krizhevsky et al. Imagenet classification with deep convolutional neural networks. NIPS 2012.

26
Visualizing learned convolutions
Different colour contrasts
Learned convolutions from first layer of AlexNet:

Krizhevsky et al. Imagenet classification with deep convolutional neural networks. NIPS 2012.

27
Biological connections
Several principles of ConvNets were inspired by Image credit: Wendell. Foundations of Vision: https://foundationsofvision.stanford.edu/
elements of neuroscience models of the primary DTS AI 2019

and secondary visual cortex, including:


• Limited receptive field (at least at early stages)
• Filters tuned to different scales, orientations, Different spatial frequencies
spatial frequencies and colour contrasts

Different
Different scales Different orientations colour contrasts

28
Notable difference ANN vs CNN
• Field of pattern recognition within images > encode image-specific features
into the architecture.
• Complexity required to compute image data.
• Ex. MNIST 28*28 in ANN first layer 28 x 28 x 1 = 784 weight.
• 64 * 64 data in ANN first layer 64 x 64 x 1 = 12288 weight.

If we use 3 channel in RGB case, the amount of weights rapidly becomes


unmanageable for large images

Just increase the number of hidden layers ? NO, Huge ANN cause
• unlimited computational power and time to train
• likely the network will overfit (either too general or too over-engineered)
Sample Architecture
CNN Visualization

https://www.youtube.com/watch?v=f0t-OCG79-U
Usage of CNN in Image
Processing
CNNs for Image Classification
34
Image Classification

• When the system is given an image of a


handwritten number (from 0 to 9), can the
system classify the image of that number?
• A few things to keep in mind:
• The location/position of numbers (localization) is
not important in image classification.
• Generally, classes that are not included in the
training set are not taken into account in the
performance measure.
• Image classification performance can be
evaluated using confusion matrix
measurements (e.g., accuracy, precision,
recall, etc.), F1-score, ROC curves, etc.
35
LeNet-5

MNIST images are 28x28, • Using the tanh activation function on


but zero-padded into 32x32 every layer (except the classification
layer).
• Demo: http://yann.lecun.com/exdb/lenet/

(5x5) (5x5)

Radial Basis Function (RBF) for the non-linear function.


Average Pooling

LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of


the IEEE 86.11 (1998): 2278-2324.
36
AlexNet

• Very similar to LeNet-5. Difference:


• Bigger and deep (deeper).
• Stack multiple convolutional layers directly
without layer pooling.
• Using Rectified Linear Unit (ReLU) as a non-linear
activation function.
• Using several regularization techniques:
• Dropout layer with 50% dropout rate
• Data augmentation (i.e., random shifting, horizontal
flipping, and changing lightning conditions)

Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep
convolutional neural networks." Advances in neural information processing systems. 2012.
37
GoogLeNet
• Key points in GoogleNet:
• Bigger and deeper than the previous CNN architecture.
• Using the inception module so GoogLeNet can use
parameters more effectively:
• Better performance despite fewer parameters (6 million)
compared to AlexNet (60 million).
• The Inception module proposed in the GoogLeNet
architecture performs feature extraction using a
combination of 1x1, 3x3, and 5x5 filters to cover
various patterns in the image.

Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2015.
38
GoogLeNet’s Inception Module

• In the inception module owned by GoogLeNet, there is a


1x1 kernel for dimension reduction.
• Why is a 1x1 kernel important?
• Remember that convolution is cross-channel, so it also computes
all input channels (RGB in the image) or feature maps (input from
the hidden layer).
• If the number of kernels used is less than the number of feature
maps owned by the input, then the 1x1 kernel is used as a
bottleneck layer that functions for dimension reduction.
• The kernel combination [1x1,3x3] and [1x1,5x5] function as cross-
channel convolution (for 1x1) and cross-channel + spatial
convolution (for 3x3,5x5).

Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2015.
39
VGGNet
• Deeper and larger (has more parameters).
• VGGNet was the runner-up of the 2014
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC). Meanwhile, the winner was
the GoogLeNet.

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." arXiv preprint arXiv:1409.1556 (2014).
40
Residual Network

• Residual Network (ResNet) proposes a


residual unit (RU) that allows us to use
very deep CNNs (deepest version: 152
trained layers).
• Within each RU, there is a skip
connection (or shortcut connection)
which makes the signal flow (forward
and backward passes) easier.
• Google's proposed Inception-v4
architecture combines the ideas of
GoogLeNet and ResNet.

• Note: At the beginning of training, RU is an identity function (or almost the same as an identity function)
because during training, almost all weights in the convolutional layer are close to 0 (zero). This makes RU
easy and fast to train.
Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019. computer vision and pattern recognition. 2016.
41
Extreme Inception (Xception)

• Extreme Inception (Xception) from Google proposes a


depth-wise separable convolutional layer to strengthen
the inception module.

FM3

Convolutional layer
• Kernels with size > 1, perform computations at
the spatial and cross-channel levels. Depth-wise separable convolutional layer
• A kernel with a size of 1x1 only performs cross- • Perform spatial computation for each input
channel computing. channel/feature map separately.

Chollet, François. "Xception: Deep learning with depthwise separable convolutions." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2017.
42
Xception: Extreme Inception Module

• In general, the depth-wise convolutional layer uses


less parameters, memory, and computation.
• Recommended for use on deeper layers (deeper
layers, or after multiple convolutional layers).

3x3
Depth-wise 1x1
convolutional Convolutional layer
layer (cross-channel only)

Chollet, François. "Xception: Deep learning with depthwise separable convolutions." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2017.
43
Squeeze-and-Excitation Network (SENet)
• The Squeeze-and-Excitation Network
(SENet) is the winner of the ILSVRC 2017
challenge proposing a squeeze-and-
excitation (SE) block.
• The SE block calibrates the feature maps
issued by the previous layer by giving
additional scaling weights.
• Scaling values used in SENet are trained on
the SE module in the training process and
issued during testing. Géron, Aurélien. Hands-on machine learning with
Scikit-Learn, Keras, and TensorFlow: Concepts,
• SE block does not pay attention to spatial tools, and techniques to build intelligent systems.
O'Reilly Media, 2019.
patterns and focuses on feature maps which
are very closely related during training
(strongly correlated feature maps).

Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference
on computer vision and pattern recognition. 2018.
44
SENet: Squeeze-and-Excitation Block

• E.g., the input SE block is 256 feature maps.


• Squeezes:
• Global average pooling: The output is 256 feature maps of
very smaller dimensions (usually 1 neuron for 1 feature
map).
• Dense connection: Encoding/embedding data with a fully
connected layer (e.g., 16 neurons from 256 neurons initially).
• Excitation: Decodes the embedded data to the
initial number of feature maps (from 16 to 256
neurons).

Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
CNNs for Semantic Segmentation
46
(Semantic) Segmentation

• Image segmentation performance can be evaluated using precision, recall, F1-score


(Dice similarity coefficient), ROC curve, etc.

Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
47
Fully Convolutional Networks (FCN)
• In the CNN architecture for classification
tasks, the last layers are usually
fully-connected layers.
• In Fully Convolutional Networks (FCN),
all layers that have weights to train
(trainable layers) are in the form of
convolutional layers.
• A strategy is needed to restore
the spatial resolution that has been reduced due to the convolutional layer.
• We can restore the image dimensions in the last layers by using up-sampling or
transpose convolutional layers.

Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic
segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
48
Recovering Spatial Resolution
• Up-sampling can be done by replicating the values in feature maps or it can be done
by interpolating (e.g., bilinear).
• It can also be done by performing a transpose convolutional layer operation, where
feature maps are expanded by adding 0 (zero) and then performing a convolution
operation as usual. For more precise results, we can add the previous skip connection.

Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
49
Pixel-to-Pixel Segmentation

Long, Jonathan, Evan Shelhamer, and Trevor


Darrell. "Fully convolutional networks for semantic
segmentation." Proceedings of the IEEE conference
on computer vision and pattern recognition. 2015.

• The results of (semantic) segmentation from low-level resolution feature


maps are very imprecise because there is very little spatial information.
• Therefore, we can combine (spatial) information from shallower feature
maps that have more spatial information.
• This merger, usually done using a skip connection, is also very useful to
speed up the training process.
50
U-Net

Encoder Decoder • U-Net expands by using encoder, decoder,


bottleneck, and skip connection strategies
Skip
simultaneously.
connections • It can be said that U-Net performs domain mapping
from the origin domain (image) to the target domain
(semantic segmentation).
• The bottleneck can be considered as a latent
representation that contains the most important
Bottleneck features needed to be able to perform semantic
segmentation of the original image.
• U-Net is very popular in medical image analysis.

Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical
image segmentation." International Conference on Medical image computing and computer-assisted
intervention. Springer, Cham, 2015.
CNNs for Object Detection
52
Object Detection
• In object detection, we also want information about the
location / position of the desired object. This is commonly
called localization.
• In its implementation, localization is done using bounding
boxes, where each object has a label in a tuple (images_id,
(class_label, bounding_boxes)).
• The bounding box itself has 4 values, namely information on
the coordinates of the center of the object (x,y) and the size of
the bounding box (height, width).
• Convolutional networks (or other machine learning methods) can be optimized for object detection by
regressing the values (x, y, height, width) using the mean square error (MSE) loss function.
• Meanwhile, object detection performance can be evaluated using Intersection over Union (IoU).

Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
53
Object Detection’s Sliding Window
• Object detection is usually done using a sliding window that
will cover the entire image.
• Divide the image/image into several parts (example: 6x8 as
shown on the side).
• Run CNN using a sliding window (example: large black
rectangle 3x3) to all parts of the image.
• Draw and save 1 bounding box for each object detected by
CNN (example: red square).
• Run the non-maximum suppression method to get 1 bounding
box for each detected object.
• Objects in the image/image can have various sizes (small,
large). Therefore, sliding windows can be done many times
using different sizes (e.g.,: 2x2 🡪 3x3 🡪 4x4).

Non-maximum suppression:
https://towardsdatascience.com/non-maximum- Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.
suppression-nms-93ce178e177c
54
CNNs for Object Detection: R-CNN Model Family
• R-CNN (2014): Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic
segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
• Use 3 modules: (1) Region Proposal: selective search (i.e., non-CNN method for proposing
bounding box), (2) Feature Extraction: AlexNet (CNN), dan (3) Classifier: SVM (non-CNN method
for object classification).
• Fast R-CNN (2015): Girshick, Ross. "Fast R-CNN." Proceedings of the IEEE international conference on computer vision.
• Use 2 modules: (1) Region Proposal Netwok (CNN for deciding bounding boxes which will be
classified) dan (2) Fast R-CNN (CNN for object classification and bounding box regression update).
55
CNN for Object Detection: YOLO & SSD
• You Only Look Once (YOLO) (2016): Redmon, Joseph, et al. "You only look once: Unified,
real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition.
• Only has one CNN for localization (bounding box) and object classification.
• Single Shot Multi-box Detector (SSD) (2016): Liu, Wei, et al. “SSD: Single shot multibox
detector." European conference on computer vision. Springer, Cham.
• Same with YOLO, SSD only has 1 CNN for object detection.
• Difference between SSD and YOLO: SSD can detect smaller objects and compute
faster than YOLO.
• Due to faster computing, SSDs can be used in real-time object detection systems.
56
Yolo vs. SSD

Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016.
Liu, Wei, et al. “SSD: Single shot multibox detector." European conference on
computer vision. Springer, Cham. 2016.
57
Other Architectures for Object Detection
• SSD and YOLO has various variants of models that are faster and better for real-time system.
For example: SSD300, SSD500, YOLOv2, YOLOv3, YOLO9000.
• Mask R-CNN (2017): Object detection and semantic segmentation at the same time.
He, Kaiming, et al. "Mask R-CNN." Proceedings of the IEEE international conference on computer vision. 2017.
Practical use of CNN in CV
Computer Vision in Construction

https://www.youtube.com/watch?v=l1LYg9NCKWY
Computer Vision Aerial Surveillance

https://www.youtube.com/watch?v=9tLCFbupeOI
Computer Vision Autonomous Car

https://www.youtube.com/watch?v=HS1wV9NMLr8
Computer Vision in Security

https://www.youtube.com/watch?v=HHXRqCGCRCs
Computer Vision in Military

https://www.youtube.com/watch?v=g0zxmO6qlD8
64
Summary
• We have studied various forms of CNN architectural models for 3 types of tasks
in computer vision, namely image classification, semantic segmentation, and
object detection.
• We can see how the convolutional layer is very easy to implement for a wide
variety of architectures and tasks.
• We can also see how convolution can be modified to meet the needs of a specific
task (e.g., inception module, residual unit, squeeze-and-excitation block).
• Understanding the data and the problems and assumptions associated with the
task will greatly facilitate the development of new CNN models.
• A lot of working codes and examples are available on the internet.
• https://keras.io/examples/vision/
Thank You

You might also like