0% found this document useful (0 votes)
9 views

Tutorial4 - Image Classification A

Uploaded by

Aryaman Mani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Tutorial4 - Image Classification A

Uploaded by

Aryaman Mani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Tutorial 4: Dataset preparation for image classification tasks

&
Loading and evaluation of pre-trained image classification models

CHEN JIELIN
Department of Architecture, National University of Singapore
Image Classification Task

Image Classification is a fundamental task in vision recognition that aims to understand and categorize an image as a
whole under a specific label, and it typically pertains to single-object images.

https://paperswithcode.com/task/image-classification
https://vitalflux.com/difference-binary-multi-class-multi-label-classification/#:~:text=Multiclass%20Classification%20is%20where%20each,labels%20to%20each%20data%20sample.
Image Classification Task: Binary vs Multi-Class vs Multi-Label

https://medium.com/@saugata.paul1010/a-detailed-case-study-on-multi-label-classification-with-machine-learning-algorithms-and-72031742c9aa
Image Classification Task: Binary vs Multi-Class vs Multi-Label

To ensure only one class is


selected each time, we apply the In the case where multi-label
Softmax Activation Function at classification is needed, we use
the last layer: the sum of the multiple sigmoids on the last
probabilities of each class is 1 layer and thus learn a separate
distribution for each class: the
probability of each class is
independent from each other

https://www.analyticsvidhya.com/blog/2021/07/demystifying-the-difference-between-multi-class-and-multi-label-classification-problem-statements-in-deep-learning/
Existing open-sourced large-scale datasets for image classification task
Some example datasets for image classification

ImageNet: contains 14,197,122 annotated


images. The average image resolution on ImageNet is
469x387 pixels. The publicly released dataset contains
a set of manually annotated training images. A set of
test images is also released, with the manual
annotations withheld.

https://paperswithcode.com/datasets?task=image-classification
Existing open-sourced large-scale datasets for image classification task
Some example datasets for image classification

CIFAR-10: 60000 32x32 color images. The images are


labelled with one of 10 mutually exclusive classes: airplane,
automobile, bird, cat, deer, dog, frog, horse, ship, and truck. There
are 6000 images per class with 5000 training and 1000 testing
images per class.

https://paperswithcode.com/datasets?task=image-classification
Existing open-sourced large-scale datasets for image classification task
Some example datasets for image classification

MNIST: a large collection of handwritten digits. It


has a training set of 60,000 examples, and a test set
of 10,000 examples. Each image is a crude 28 x 28
(784 pixels) handwritten digit from "0" to "9." Each pixel
value is a grayscale integer between 0 and 255.

https://paperswithcode.com/datasets?task=image-classification
Existing open-sourced large-scale datasets for image classification task
Some example datasets for image classification

Fashion-MNIST: a dataset comprising of 28×28


grayscale images of 70,000 fashion products from 10
categories, with 7,000 images per category. The training
set has 60,000 images and the test set has 10,000 images.
Fashion-MNIST shares the same image size, data format
and the structure of training and testing splits with the
original MNIST.
https://paperswithcode.com/datasets?task=image-classification
For classification tasks, we need to split the original dataset into
subsets for training and testing
(This also applies to all discriminative tasks/supervised learning)

There are different practical


methods for splitting the dataset,
the most common ones are 80:20,
70:30, or 90:10, depending on the
size of your original dataset

Adopted from “https://labelyourdata.com/articles/machine-learning-and-training-data”


Feature Engineering: One-Hot Encoding for categorical or string-based
labels

Since every machine learning or deep learning model


requires exact mathematical and statistical
computation, which can only be achieved with
numerical data type, thus while building a machine
learning or deep learning models, we need to convert
our data into int/float based values.

One-Hot Encoding is frequently opted to achieve this 0


purpose. It is a process by which categorical or
string-based labels are converted into 1s and 0s. New 1
columns will be introduced in this process; the number
of columns depends upon the categorical values in the 2
original column.
3
There is no correlation between the values of any pair
of data points in any newly generated columns, which 4
is desirable as it is normally assumed that all data
points are mutually independent from each other.
Annotated Image Database of Architecture (AIDA)
● A repository of building imagery with high-diversity and high-coverage
retrieved from the professional architectural website Archdaily®

● Each image is annotated with ground truth architectural category labels and
scene labels

● 14,659 images, 25 architecture categories, The number of images in each


architectural category of each scene class varies from 20 to 1,400.

● 11,730 images from the dataset are randomly selected for training and 2,929
for testing.

Chen, J., Stouffs, R., & Biljecki, F. (2021). Hierarchical (multi-label) architectural image recognition and classification. In PROJECTIONS, Proceedings of the 26th International Conference of the Association for Computer-Aided Architectural Design Research in Asia
(CAADRIA) 2021 (pp. 161-170).
Hands-on Exercise of Image Dataset Preparation
Architectural Image Classification
Models using AIDA

input
image

outdoor
scene category indoor
street-level

architectural
category houses school ... cinema

Chen, J., Stouffs, R., & Biljecki, F. (2021). Hierarchical (multi-label) architectural image recognition and classification. In PROJECTIONS, Proceedings of the 26th International Conference of the Association for Computer-Aided Architectural Design Research in Asia
(CAADRIA) 2021 (pp. 161-170).
Architectural Image Classification Models using AIDA

(Convolutional Neural Network)

Chen, J., Stouffs, R., & Biljecki, F. (2021). Hierarchical (multi-label) architectural image recognition and classification. In PROJECTIONS, Proceedings of the 26th International Conference of the Association for Computer-Aided Architectural Design Research in Asia
(CAADRIA) 2021 (pp. 161-170).
Convolutional Neural Network (CNN)

Pooling Pooling

In simple word what CNN does is, it extract the feature of image and convert it into lower
dimension without losing its characteristics. In CNN, the hidden layers include one or more layers
that perform convolutions for learning feature engineering by the model itself with convolution
kernels. As the convolution kernel slides along the input matrix for the layer, the convolution
operation generates a feature map, which in turn contributes to the input of the next layer. This is
followed by other layers such as pooling layers or fully connected layers.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551.
https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d
https://www.superannotate.com/blog/guide-to-convolutional-neural-networks
Convolutional Neural Network (CNN)

Pooling Pooling

In simple word what CNN does is, it extract the feature of image and convert it into lower
dimension without losing its characteristics. In CNN, the hidden layers include one or more layers
that perform convolutions for learning feature engineering by the model itself with convolution
kernels. As the convolution kernel slides along the input matrix for the layer, the convolution
operation generates a feature map, which in turn contributes to the input of the next layer. This is
followed by other layers such as pooling layers or fully connected layers.

Pooling layers are used to reduce the spatial size of the feature maps while preserving important
information. This reduces the computational cost of the network and helps to prevent overfitting.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551.
https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d
https://www.superannotate.com/blog/guide-to-convolutional-neural-networks
Convolutional Neural Network (CNN)

Pooling Pooling

In simple word what CNN does is, it extract the feature of image and convert it into lower dimension
without losing its characteristics. In CNN, the hidden layers include one or more layers that perform
convolutions for learning feature engineering by the model itself with convolution kernels. As the
convolution kernel slides along the input matrix for the layer, the convolution operation generates
a feature map, which in turn contributes to the input of the next layer. This is followed by other
layers such as pooling layers or fully connected layers.

Pooling layers are used to reduce the spatial size of the feature maps while preserving important
information. This reduces the computational cost of the network and helps to prevent overfitting.

Fully connected layer is applied on the feature map at the end to map learned features into a
chosen number of classes.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551.
https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d
https://www.superannotate.com/blog/guide-to-convolutional-neural-networks
Convolutional Neural Network (CNN)

Pooling Pooling
Image Kernels explained visually:
https://setosa.io/ev/image-kernels/

A convolution kernel is a small matrix used to apply effects like the ones you might find in
Photoshop, like blurring, sharpening, or embossing. They're used in CNNs for 'feature extraction'.
In this context the process is referred to as "convolution". The values of the kernels are iteratively
updated during training of CNN models.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551.
https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d
https://www.superannotate.com/blog/guide-to-convolutional-neural-networks
Convolutional Neural Network (CNN)

Stride denotes how many steps we are moving in


each steps in convolution; this animation shows a
stride size of one.

Pooling Pooling

To maintain the dimension of output as in input , we use


A convolution kernel is a small matrix used to apply effects like the ones you might find in padding. Padding is a process of adding zeros around the
input matrix symmetrically. After applying padding we will
Photoshop, like blurring, sharpening, or embossing. They're used in CNNs for 'feature extraction'. get the same dimension as the original input
In this context the process is referred to as "convolution". The values of the kernels are iteratively
updated during training of CNN models.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551.
https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d
https://www.superannotate.com/blog/guide-to-convolutional-neural-networks
ResNets

Classical CNNs are not able to scale to a large number of layers, as they face the “vanishing
gradient” problem (with too many layers, repeated multiplications will eventually reduce the
gradient until it “disappears”). ResNets provides a solution to the vanishing gradient problem
by adding “skip connections” between every two or three layers

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
ResNeXt

ResNeXt is an alternative model based on


the ResNet design, which adds another
dimension of cardinality in the form of the
independent path number, and the Increased
cardinality has been found to help the
network go wider or deeper

A block of ResNet A block of ResNeXt with cardinality = 32

Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492-1500).
DenseNet

A ResNet variation, which attempts to


resolve the issue of vanishing gradients
by creating more connections. The
authors of DenseNet ensured the
maximum flow of information between the
network layers by connecting each layer
directly to all the others, and by allowing
every layer to obtain additional inputs
from its preceding layers and passing on
the feature map to subsequent layers

Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700-4708).
Interpreting performance of image classification models:
Gradient-weighted Class Activation Mapping (Grad-CAM)

Visual explanations: making Convolutional Neural Network (CNN)-based models more transparent by
visualizing the regions of input that are “important” for predictions from these models.

https://medium.com/@mohamedchetoui/grad-cam-gradient-weighted-class-activation-mapping-ffd72742243a
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618-626).
Use Grad-CAM to interpreting performance of architectural image
classification model

Chen, J., Stouffs, R., & Biljecki, F. (2021). Hierarchical (multi-label) architectural image recognition and classification. In PROJECTIONS, Proceedings of the 26th International Conference of the Association for Computer-Aided Architectural Design Research in Asia (CAADRIA) 2021 (pp. 161-170).
Hands-on Exercise of Image Classification Model
Loading and Evaluation
Assignment 3: Individual work (10% of final grade)

For this assignment, you can choose one of the two following tasks:

1. Choose a target design website suitable for collecting image datasets for training an image
classification model related to your field of design practice (architecture, landscape architecture,
industrial design, etc.), and construct an image dataset using one of the introduced data crawling
and pre-processing approaches from tutorials 3&4. Write a report with at least 500 words (one or
two paragraphs), with screenshots or illustrations of your image dataset construction process.

2. Use the two pre-trained architectural image classification models (densenet161 or resnext101) to
classify 15 randomly selected architectural images from the test set of AIDA respectively. Analyse
and compare the classification results of two models in terms of accuracy, and use the Grad-CAM
visualization tool to interpret the model performance. Write a report based on your analysis. The
report should contain at least 500 words (one or two paragraphs), with comparison charts and
corresponding Grad-CAM visualization results.

Please note that the next assignment will allow you to train your own image classifier model, either
based on the AIDA dataset or the image dataset from task 1 above.
Assignment assessment criteria (10% of final grade)

Completeness: Make sure your report is complete with respect to the


assignment requirements.

Critical thinking is expected: If you choose the second task, we will look into
the width and depth of your thinking concerning the performance of the
classification models in terms of design practice.

You might also like