3 - Feature Extraction and Images Classification - Part 3
3 - Feature Extraction and Images Classification - Part 3
Faculty of Technology
Computer Science Department – Master IATI
Dropout: is a regularization technique used in artificial neural networks during training. It involves randomly "dropping out"
(setting to zero) a proportion of neurons in the network during each training iteration.
The dropout rate, like 0.5, in a neural network layer randomly deactivates 50% of its neurons during training.
A dropout rate greater than 0.5 deactivates more than half of the neurons.
A dropout rate less than 0.5 deactivates fewer neurons.
Early stopping: monitoring the performance of a model on a separate validation dataset during training.
When the validation loss starts to increase consistently or stagnates while the training loss continues to
decrease, it indicates that the model is overfitting and further training may not lead to better generalization
performance (early stopping is needed).
It is a technique that uses these feature extraction methods, like SIFT, SURF, and ORB, to extract features from the image
and cluster them into visual words.
BoF is a powerful technique for extracting and representing features in a compact and meaningful way. But it is
computationally expensive and requires large amounts of training data.
HOG
First used for application of person detection [Dalal and Triggs, CVPR 2005]
Calculates the distribution of gradients (intensity gradients or edge directions) within localized regions of an image
Machine learning works only with sets of structured and semi-structured data, while deep learning works with both
structured and unstructured data
Deep learning algorithms can perform complex operations efficiently, while machine learning algorithms cannot
Machine learning algorithms use labeled sample data to extract patterns, while deep learning accepts large volumes
of data as input and analyzes the input data to extract features out of an object
The performance of machine learning algorithms decreases as the number of data increases; so to maintain the
performance of the model, we need a deep learning.
Neural networks
Neural network is a system modeled on the human brain, consisting of an input layer,
multiple hidden layers, and an output layer. Data is fed as input to the neurons. The
information is transferred to the next layer using appropriate weights and biases. The output
is the final value predicted by the artificial neuron.
Each neuron in a neural network performs the following operations:
Cost function
The cost function is one of the significant components of a neural network. The cost value is the difference between the
neural nets predicted output and the actual output from a set of labeled training data. The least-cost value is obtained by
making adjustments to the weights and biases iteratively throughout the training process.
Classification Model
Invariance and equivariance:
Invariance: refers to the property of a machine learning model
where the output remains unchanged despite certain
transformations applied to the input data. Identity
Classification Model
Invariance and equivariance:
Accurate classification:
• The model should accurately classify images regardless of their position or
Identity
scale within the image.
• Achieved by training with diverse datasets and using data augmentation.
• Techniques like image translation and scaling help the model generalize to
Segmentation Model
various object positions and sizes.
Accurate segmentation:
• The model can identify objects in an image regardless of their position, size
or scale.
Pixel label
• This requires training with diverse datasets and using data augmentation
techniques.
• Techniques like image translation and scaling can be applied en ensure an
accurate segmentation.
Figure 1.
Course: Information Analysis and Processing Year: 2023-2024
Image classification using deep learning
14
1D convolution operation:
Convolutional layers are network layers based on the convolution operation. In 1D, a convolution transforms an input
vector x into an output vector z so that each output 𝑧𝑖 is a weighted sum of nearby inputs.
The same weights are used at every position and are collectively called the convolution kernel or filter. The size of the
region over which inputs are combined is termed the kernel size. For a kernel size of three, we have:
1D convolution operation:
Stride, kernel size, and dilation.
a) Stride: With a stride of two, we evaluate the
kernel at every other position, so the first output z1 is
computed from a weighted sum centered at x1,
1D convolution operation:
Dilation:
d) Skipping input values within convolutional layers, effectively increasing the receptive
field of neurons. It helps capture broader contextual information while reducing
computational cost compared to traditional convolutions.
Effective kernel size: considers the impact of dilation on the receptive field (size is 5)
Figure 3
1D convolution operation:
Padding:
In Convolution operation, each output is computed by taking a weighted sum of the previous, current, and subsequent positions
in the input.
Question: how to deal with the first output (where there is no previous input)? and the final output (where there is no subsequent
input)?
Solution 1: to pad the edges of the inputs with new values (adding zeros around the edges).
Zero padding assumes the input is zero outside its valid range (figure 4).
Advantage: preserve spatial dimensions during convolutional operations, enabling better
extraction of features from the edges and corners of the input images.
Solution 2: to discard the output positions where the kernel exceeds the range of input
positions. These valid convolutions have the advantage of introducing no extra information
at the edges of the input.
Disadvantage: the representation decreases in size.
Figure 4
Course: Information Analysis and Processing Year: 2023-2024
Image classification using deep learning
18
1D convolution operation:
𝑘𝑒𝑟𝑛𝑒𝑙 𝑠𝑖𝑧𝑒 − 1
𝑃𝑎𝑑𝑑𝑖𝑛𝑔 𝑠𝑖𝑧𝑒 =
2
Where:
Convolution layers:
A convolutional layer computes its output by convolving the input, adding a bias β, and passing each
result through an activation function a[]. With kernel size three, stride one, and dilation rate one, the
𝑖 𝑡ℎ hidden unit ℎ𝑖 would be computed as:
Relu : 𝒇 𝒙 = 𝒎𝒂𝒙(𝟎, 𝟏)
Channels:
When applying a single convolution, information will inevitably be lost due to averaging nearby inputs and ReLU activation
clipping.
It is usual to compute several convolutions in parallel. Each convolution produces a new set of hidden variables, termed a
feature map or channel.
2D convolution operation
Convolutional networks are more usually applied to 2D image data. The convolutional kernel is now a 2D object.
A 3×3 kernel Ω ∈ R3×3 applied to a 2D input comprising of elements xij
computes a single layer of hidden units ℎ𝑖𝑗 as:
2D convolution operation
Receptive field:
The receptive field of a hidden unit in the network is the region of the original input that feeds into it.
Consider a convolutional network where each convolutional layer has kernel size three. The hidden units in the
first layer take a weighted sum of the three closest inputs, so have receptive fields of size three. The units in the
second layer take a weighted sum of the three closest positions in the first layer, which are themselves weighted
sums of three inputs.
The receptive field in a convolutional neural network with two 3x3 convolutional (conv) layers. In the 2nd conv layer, every
pixel has a 5x5 field of view.
With stride 1:
𝑅𝑒𝑐𝑒𝑝𝑡𝑖𝑣𝑒 𝑓𝑖𝑒𝑙𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑛𝑒𝑥𝑡 𝑙𝑎𝑦𝑒𝑟 = 𝑘𝑒𝑟𝑛𝑒𝑙 𝑠𝑖𝑧𝑒 + current receptive field − 1
With stride 2:
𝑅𝑒𝑐𝑒𝑝𝑡𝑖𝑣𝑒 𝑓𝑖𝑒𝑙𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑛𝑒𝑥𝑡 𝑙𝑎𝑦𝑒𝑟 = 𝑘𝑒𝑟𝑛𝑒𝑙 𝑠𝑖𝑧𝑒 + (current receptive field − 1) × 2
Number of Layers 1 2 3
Receptive field 3 5 7
Assuming that the input size is 224 × 244 with kernel size 3:
Overview: