0% found this document useful (0 votes)
1 views

Understanding_house_numbers_for_delivery_robots-2024

The document discusses a project aimed at enhancing delivery robots' ability to detect house numbers using the SVHN dataset, focusing on developing a robust AI model with YOLO for real-time object detection. It details the methodology, including data pre-processing, model training, and hyper-parameter tuning, while comparing various AI models to determine the most effective approach. The findings indicate that YOLOv8n is the optimal choice due to its balance of performance and computational efficiency.

Uploaded by

Philipe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Understanding_house_numbers_for_delivery_robots-2024

The document discusses a project aimed at enhancing delivery robots' ability to detect house numbers using the SVHN dataset, focusing on developing a robust AI model with YOLO for real-time object detection. It details the methodology, including data pre-processing, model training, and hyper-parameter tuning, while comparing various AI models to determine the most effective approach. The findings indicate that YOLOv8n is the optimal choice due to its balance of performance and computational efficiency.

Uploaded by

Philipe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2024 IEEE International Conference on Industrial Technology (ICIT), 25-27 March 2024, Bristol, UK

DOI: 10.1109/ICIT58233.2024.10540817

Reading and understanding house numbers for delivery robots using the ”SVHN
Dataset”

1st Omkar Pradhan 2nd Dr. Gilbert Tang 3rd Christos Makris 4th Radhika Gudipati
SATM SATM Ocado Technology Ocado Technology
Cranfield University Cranfield University Hatfield, UK Hatfield, UK
Cranfield, UK Cranfield, UK [email protected] [email protected]
[email protected] [email protected]

Abstract— Detecting street house numbers in complex environ- 2) Pre-processing: The image data is being prepared
ments is a challenging robotics and computer vision task that before any machine learning processing.
could be valuable in enhancing the accuracy of delivery robots’ 3) Deciding the optimum AI model to train on for the
localisation. The development of this technology also has posi- best results: this part will compare the results for
tive implications for address parsing and postal services. This multiple possible AI models based on the necessary
project focuses on building a robust and efficient system that factors as follows.
deals with the complexities associated with detecting house
a) Considering the inference time
numbers in street scenes. The models in this system are trained
b) Increasing the efficiency by tuning the
on Stanford University’s SVHN (Street View House Numbers)
hyper-parameters
dataset. By fine-tuning the YOLO’s (You Only Look Once)
c) Real-time object detection analysis
nano model results with an effective detection range from 1.02
meters to 4.5. The optimum allowance for angle of tilt was 4) Hardware implementation: in this part, the model
±15o . The inference resolution was obtained to be 2160 ∗ 1620 needs to be implemented by using a camera and
with inference delay of 35 milliseconds thus doing the inferencing of live camera feed.
Index Terms—Artificial Intelligence, Character Recognition,
Computer Vision, Object Detection, YOLO, SVHN. 1.1. Related works

1. Introduction The state-of-the-art computational techniques can match


human-level accuracy in pattern recognition and object de-
In the dynamic field of artificial intelligence (AI) devel- tection when tested in a controlled environment, but this gap
opment, rapid advancements in computational capabilities widens as we go towards complex scenarios. To deal with
and sophisticated algorithms propel the creation of AI and number identification in the real-world environment, we
machine learning (ML) models, meticulously designed for need to have robust systems. [1] Used Convolution networks
precise object categorization to emulate human cognitive architecture to deal with the issue. Instead of implementing
abilities. Across industries, the widespread adoption of AI is max pooling, Lp pooling was implemented, and multi-stage
driven by its unparalleled accuracy and minimal downtime, features were used along with the training using stochastic
contributing transformative benefits in fields such as dis- gradient descent (SGD). With the implementation of this
ease diagnosis, fraud detection, and autonomous vehicles. architecture, and accuracy of pattern recognition improved
This project explores multiple facets of machine learning to 94.97%.
and machine vision, with applications extending to various To determine the most suitable machine learning model
domains, notably in the optimization of delivery robots. The for a specific application, researchers referred to various
fusion of AI and computer vision equips these robots with papers that summarized findings. In a study by [2], five ma-
the capability to navigate complex environments, efficiently chine learning models (Neural Network, K-Nearest Neigh-
recognize and handle objects, and seamlessly interact with bour, Random-forest, Decision tree, and bagging with gra-
their surroundings. This integration enhances the precision dient boost) were compared using the MNIST dataset. They
of object and number detection, fostering the evolution of applied multiple pre-processing techniques and found that
smart, efficient, and adaptive delivery systems. The project Neural Networks achieved the highest accuracy at 95.73%,
objectives are as follows: but struggled with poorly written digits. [3] also used
1) Literature review: This part develops the baseline the MNIST dataset, comparing Linear SVM, Multilayered
of the development by reviewing existing work on Perception, and Convolutional Neural Networks (CNN).
this research topic. They considered execution time, complexity, accuracy rate,

© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
epochs, and hidden layers. SVM provided the fastest re- 2. Implementation
sults for simple data, while CNN excelled in accuracy and
efficiency for more complex data. [4] and [5] also trained
on MNIST, using Ensemble systems, TFE-SVM, C-NN, and To successfully implement the training, the data was
Large C-NN+. Their results reinforced that Neural networks, prepared in the format required for the training of the
particularly CNN, consistently outperformed other models specific model (also known as pre-processing), as required
in terms of efficiency and accuracy. for using YOLOv8. The process of data pre-processing,
[6] conducted a classification analysis using 18 augmenting and tuning parameters are defined in the further
Deep Neural Network (D-NN) models, including ResNets, subsections 2.1, 2.2, 2.3, 2.4 and 2.5
DenseNets, MobileNets, Nasnets, VGG Nets, and Alex
Nets. They trained these models on the ImageNet Dataset,
generating adversarial images from a random sample of
1000 images. Evaluation criteria included attack success 2.1. Bounding Box Labels
rate, distortion using l2 and l∞ norms, CLEVER scores,
and transferability, providing a comprehensive understand-
ing of model performance. [7], in a separate study, compared As stated in section 1.1, the study converged towards
VGG16, VGG19, and ResNet50 models trained on a custom the solution that YOLO is the optimum model to begin
dataset of 6000 images with five classes. The results showed with. According to Ultralytics, a YOLO model is to be
accuracies of 0.9667, 0.9707, and 0.9733. trained with a training and validation dataset. The bifurcated
Object detection combines localization and classification dataset’s image needs to be accompanied by the text file
and is often implemented using CNN architecture. In [8], with the information about the classes and the bounding
Faster-RCNN, YOLOv5 , and SSD were compared using box information defined in it. Ex. (if an image is having
an automobile training dataset. Faster-RCNN demonstrated the name ‘xyz.jpg’, the text file should be ‘xyz.txt’). The
better accuracy but was unsuitable for real-time applications information in the text file is to be exactly in the form which
due to its 2-stage nature, making YOLO the top performer. is required to train the YOLO model. The information is
In another comparison by [9] in 2021, YOLOv6 was pitted shown in table1.
against SSD, with SSD outperforming YOLO in terms of
Frames Per Second (FPS) and achieving a higher mean
TABLE 1. T EXT F ILE F ORMAT
Average Precision (mAP) score.
[10] delved into the SVHN (Street View House Num- C1 CBB1 X norm CBB1 Y norm W1 norm H1 norm
bers) dataset from Stanford, consisting of 73,257 images in C2 CBB2 X norm CBB2 Y norm W2 norm H2 norm
. . . . .
the training set and 26,032 in the test set. An additional
SVHN extra dataset, with 531,131 less complex images, . . . . .
introduces potential model bias. Feature learning, utilizing
methods such as Histogram of Oriented Gradients (HOG) . . . . .
Cn CBBn X norm CBBn Y norm Wn norm Hn norm
and Sauvola binarization, is employed to detect features in
these complex images. Post-processing involves comparing
results from algorithms like HOG binary features, K-means,
and Stacked Sparse Auto-Encoders. The findings favor the Where, Cn =nth Class number CBBn X norm = Nor-
K-means-based system, although a notable challenge is the malised nth Centre co-ordinate of Bounding Box in X-
continuous failure of the binarization algorithm to separate axis CBBn Y norm = Normalised nth Centre co-ordinate
characters from their surrounding backgrounds. of Bounding Box in Y-axis Wn norm = Normalised nth
Bounding Box Width Hn norm = Normalised nth Bounding
Box Height
1.2. Research Methodology Stanford University’s SVHN dataset comes in two for-
mats: one with all images formatted to 32x32 pixels, heavily
The study followed a general methodology that involved cropped to show only a single number, and another with
collecting relevant papers in the AI field, laying the foun- images in a general form including parts of the environment.
dation for the research direction. Once the basics were The ’digitstruct.mat’ file provides data on bounding boxes
established, specific models were chosen for the project. and classes. To create individual text files for each image,
Data analysis and preparation were conducted to facilitate a MATLAB code was developed. This code reads image
model training. During the training and testing phases, the dimensions, extracts information from the mat file, and
selected models were individually trained and fine-tuned normalizes and formats the data according to the required
for improved performance. Challenges encountered at each specifications.
stage were addressed through the study of GitHub libraries To get to know about the number of instances for which
and literature. This iterative process continued until the the classes appear in the training and validation set in total,
desired results were achieved. the graph is plotted.
2.4. Hyper-parameters

During the training of a model, multiple hyper-


parameters can be changed to check the impact of the
variance of those parameters on the inference. In this study,
the hyper-parameters that are studied and tuned are shown
in table 2. The models were tuned and analysed using the
different combinations.

Figure 1. Total number of instances of classes in training and validation


TABLE 2. T UNED H YPER - PARAMETERS
The graph, when plotted, shows that the highest number [HTML]D5DCE4Hyper-parameter [HTML]D5DCE4Values
of instances that appear are for number ‘1’, and it gradually Optimiser Auto, SGD, RAdams, Adamax
goes on decreasing till number ‘9’ and is equal to the Pretrained True / False
Degrees ±5, ±15, ±45
number ‘0’. Imgsz 240, 320, 640, 720, 1080
Model YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8x

2.2. YAML file


The YAML file contains crucial details about the dataset,
specifying the locations of the training and testing datasets 3. Analysis and Discussion
and defining classes. Path indicates the absolute location
of the dataset folder. Train and Val denote the relative
locations of the training and validation datasets. Names
Before hyper-parameter tuning, the initial step involves
assign encoding to each class (e.g., 7 represents the encoded
identifying the best-performing models for the SVHN
value for ’H’). These encodings are reflected in the Cn
dataset using their default configurations. This preliminary
section of the Bounding Box Labels file, detailed in Section
performance analysis aims to streamline the process by
2.1.
excluding slower or less efficient models. The training and
testing procedures are carried out on the hardware specifi-
2.3. structure of the Files cations detailed in Table 3.
To train the model, the dataset with its labels and images
needs to be organised in a specific order for the training. TABLE 3. H ARDWARE S PECIFICATIONS
The order is shown in Figure 2 The absolute path of the
[HTML]D5DCE4Hardware Specification
CPU intel i5-11400H 6 cores 12 threads
GPU Nvidia RTX3050 Laptop GPU 4 GB
RAM 32 GB DDR4 3200 MHz
Max Power TDP 180W
[HTML]D5DCE4Software Specification
OS Ubuntu 22.04.2 Jammy Jellybean
IDE VS Code
Python version 3.10
Cuda version 11.8.0
Cudnn version 8.6.0.163
Tensorflow version 2.12.0

Figure 2. Structure of the files

file director is to be given in the YAML file, as explained 3.1. Default Model Performance Analysis
in section 2.2. The names of the files matter the most while
training the YOLO models using the Ultralytics library.
There is a compulsion to have folders named ‘Images’
and ‘Labels’ in both training and validation because the To decide which model works the best, multiple aspects
Utralytics library searches for the folders with exact names. need to be considered. To get the overall working of the
The names of the train and validation folder can be user- different versions of the YOLOv8 models, parameters like
defined, considering you define the exact name in the YAML training time, precision vs epochs, mAP vs epochs, and
file with the specified location. Recall vs epochs were compared.
Comparing the models, YOLOv8x emerges as the top
performer, but for tuning purposes, YOLOv8n is selected as
the best model. This decision is based on YOLOv8n having
precision levels almost equivalent to YOLOv8x while being
significantly lighter in size. This choice results in reduced
computational expense and enables implementation on mo-
bile systems.

3.2. Hyper-Parameters Tuning Analysis

Figure 3. Precision vs Epochs YOLOv8n was trained with a fixed Epoch of 20 and a
batch size of -1, optimizing CUDA memory usage. A fixed
positive batch size can either slow down training or exceed
P recsion = T P/(T P + F P ) (1) memory capacity. By setting it to -1, the system dynami-
The top three performers, YOLOv8x, YOLOv8n, and cally determines simultaneous training images, optimizing
YOLOv8m, were further examined. The study delved deeper memory usage without trial and error.
into these models, analyzing Precision-Confidence and
Recall-Confidence curves to gain a comprehensive under-
standing of their overall performance.

Figure 6. Precision-recall-mAP comparison of Nano models for different


Optimiser
Figure 4. Precision-Confidence Curve of YOLOv8x, YOLOv8n, YOLOv8m

Precision confidence curves for the three models re- When plotting Precision, Recall, and mAP, the graph
vealed a consistent lag in precision for the digit ’1,’ with the with the largest area indicates the optimum and best results.
widest gap in YOLOv8x, narrowing in YOLOv8n, and fur- In this scenario, the Default configuration consistently de-
ther reducing in YOLOv8m. As confidence levels increased, livers the best results. The Optimiser with high Precision,
the gap diminished, and all models peaked at approximately Recall, and mAP collectively outperforms the alternative,
82% confidence. Despite ’1’ having the highest training resulting in fewer overall losses. Therefore, the Default
instances, the lower precision is puzzling. To address this, (Auto) configuration is the most suitable choice in this case.
a detailed analysis of the confusion matrix is recommended Enabling tilt for the models revealed that as the tilt angles
for insights into class predictions, including true and false increased, losses also increased. The model with no rotation
predictions across all classes. had the lowest loss, while the model with a 45° tilt had the
highest, with the 15° tilt falling in between.

Figure 5. Confusion Matrix for YOLOv8x

In the case of YOLOv8m, it was observed that 20% of


the time, it identified the number ’1’ in the background.
This indicates that the problem of misclassification is more
related to errors in the dataset rather than how the models Figure 7. Precision-recall-mAP comparison of Nano models for different
are trained. tilt angles
Training models with various training resolutions, rang- meters. Optimal results were obtained with a tilt angle of
ing from 240 pixels width to 1080 pixels, led to an unex- 15°, aligning with previous comparisons, offering flexibility
pected outcome: the model trained with a 240-pixel resolu- within ±15° and an inference distance spanning 1.11 to
tion performed the best. This result contradicts the assump- 3.171 meters. Similarly, the 240-pixel training resolution
tion that higher training resolution leads to better results, echoed earlier findings, providing the best performance with
which typically holds true when the actual image resolution a distance range of 1.1 to 4.9 meters.
is sufficiently high. It’s crucial for the training resolution to
closely match the actual image resolution.

Figure 8. Precision-recall-mAP comparison of Nano models for different Figure 9. Inference distance for YOLOv8n with RAdam Optimiser
training resolutions

To address this, a code was developed to identify the


highest and lowest image resolutions in the training dataset, 3.4. Compound Hyper-parameters
as well as the average resolution across all images.
The analysis revealed that the average image resolution
is 128 pixels in width and 50 pixels in height. This indicates
that when the model is trained with a resolution higher than The analysis in Section 3.3 indicates that a training
the actual image resolution, it introduces significant noise, resolution of 240 pixels width and the RAdam optimizer
leading to increased losses and reduced accuracy. Therefore, produce optimal results within their respective segments. A
the optimal model, taking resolution into account, is one tilt angle of 15° is chosen for a balance between robustness
trained with a 240-pixel width resolution. and a reduction in inference distance. Subsequently, a model
is trained with these optimal hyperparameter settings, com-
3.3. Inference Timings and Min/Max Distance prising a training resolution of 240 pixels width, the RAdam
optimizer, and a ±15° tilt angle.
To evaluate the real-time performance of the models, all
trained models with various hyper-parameters, as discussed
in section 3.2, were configured for inference. The recorded
metrics include inference time and the closest and farthest
distances at which classes are successfully classified. These
measurements were collected while altering the inference
resolution, spanning 240, 320, 640, 720, 1080, 2160, and
3840-pixel widths.
Various models were assessed for effective classification
and localization distances through experiments in a lab. A Figure 10. Confusion matrix for YOLOv8 with compound tuning
complex-font image with numbers 0 to 9 was affixed to a
movable structure. A webcam, paired with a laser-guided
distance meter, ensured accurate measurements. Multiple
experiments, with varying hyper-parameters and inference
resolutions, were conducted. Results were tabulated, and
final outcomes were derived by averaging data from three
experiment repetitions.
RAdam demonstrated standout performance in optimizer
evaluations, achieving an inference distance of 7.389 meters
and a lower bound of 1.396 meters at a training resolution
of 3840 pixels, albeit with a noticeable 100-millisecond
inference lag. Subsequent reevaluation at an inference res- Figure 11. Precision-Confidence curve and Recall-Confidence curve for
olution of 2160 pixels yielded a range from 0.967 to 5.483 YOLOv8 with compound tuning
4.2.2. Classification of Number ‘1’. In Section 3.2, an
average image resolution of 128 x 50 pixels was determined.
Training involves convolution and pooling layers to preserve
features while reducing dimensions. The unclassified image
section, representing the background class, may transform
the number ’1’ into a background-like feature due to lower
resolution, causing misclassification and reduced precision.
This requires later detection or closer proximity for ’1’ than
other classes, impacting overall performance. A proposed
solution is outlined in Section 6.
Figure 12. Precision recall mAP comparison of Compound tuned model
with individual tuned models 5. Conclusion
The successful implementation of house number recog-
nition in a complex environment was achieved using the
SVHN (Street View House Numbers) dataset. A robust
system was developed utilizing a webcam to detect and
localize numbers with a remarkable 90% accuracy, covering
an inference distance of up to 4.5 meters.

6. Future work
6.0.1. Recognition. To address the inference issue related
to number ’1’, a potential solution is to create a custom
image dataset by gathering images of number ’1’ from
Figure 13. Inference distance for YOLOv8n with Compound Model Tuning various open sources and incorporating them into the train-
ing dataset, along with their corresponding bounding box
4. Results and challenges information. This approach can be used to assess whether
adding more number ’1’ images to the dataset resolves the
issue and whether it has any unintended consequences or
4.1. Results alterations on the results for the other classifications.
After the successful completion of the tuning, YOLOv8n
6.0.2. Model Fine Tuning. Much finer tuning of the models
was selected with hyperparameters as defined.
can be done by changing the parameters more gradually,
TABLE 4: Finalised Hyper-parameters thus resulting in a more extensive analysis.

Hyperparameter Key Value 6.0.3. Delivery Robots. The trained algorithm after further-
Imgsz (Training) 240 more fine-tuning, can be implemented on the actual robots
Optimiser RAdam such as delivery robots to check practical implementation
Degrees 15 of the system. The fusion of GPS and recognition can
be implemented and analysed for the enhancement in the
accuracy of the current systems in use.
This model gave the optimum results during the analysis
with the effective inference distance of 1.02 meters to 4.5
meters with an average latency of 32 milliseconds.
References
[1] P. Sermanet, S. Chintala, and Y. LeCun, “Convolutional neural
4.2. Challenges networks applied to house numbers digit classification,” Proceed-
ings of the 21st International Conference on Pattern Recognition
4.2.1. Configuration errors. Setting up the environment (ICPR2012), pp. 3288–3291, 2012.
demands careful configuration. Neglecting to ensure com- [2] S. Chen, R. Almamlook, Y. Gu, and L. wells, “Offline handwritten
patibility among versions of PyTorch, cuDNN, CUDA, and digits recognition using machine learning,” European Conference on
Computer Vision (ECCV), 2018.
TensorFlow can lead to a chain of failures and potentially a
malfunctioning setup. It is not recommended to install the [3] R. Dixit, R. Kushwah, and S. Pashine, “Handwritten digit recognition
using machine and deep learning algorithms,” International Journal
latest version using the pip command, as it can result in com- of Computer Applications, vol. 176, pp. 27–33, 07 2020.
patibility issues or failure to recognize specific packages.
[4] A. Shrivastava, I. Jaggi, S. Gupta, and D. Gupta, “Handwritten digit
To address this, it’s advisable to follow the comprehensive recognition using machine learning: A review,” 2019 2nd Interna-
list provided by TensorFlow to verify the compatibility of tional Conference on Power Energy, Environment and Intelligent
versions. Control (PEEIC), pp. 322–326, 2019.
[5] R. KARAKAYA and S. Çakar, “Handwritten digit recognition using
machine learning,” Sakarya University Journal of Science, vol. 25,
10 2020.
[6] D. Su, H. Zhang, H. Chen, J. Yi, P.-Y. Chen, and Y. Gao, Is Robustness
the Cost of Accuracy? – A Comprehensive Study on the Robustness
of 18 Deep Image Classification Models: 15th European Conference,
Munich, Germany, September 8–14, 2018, Proceedings, Part XII,
pp. 644–661. 09 2018.
[7] S. Mascarenhas and M. Agarwal, “A comparison between vgg16,
vgg19 and resnet50 architecture frameworks for image classification,”
in 2021 International Conference on Disruptive Technologies for
Multi-Disciplinary Research and Applications (CENTCON), vol. 1,
pp. 96–99, Nov 2021.
[8] J. ah Kim, J.-Y. Sung, and S.-H. Park, “Comparison of faster-rcnn,
yolo, and ssd for real-time vehicle type recognition,” 2020 IEEE
International Conference on Consumer Electronics - Asia (ICCE-
Asia), pp. 1–4, 2020.
[9] M. Shetty, “A review on deep learning object detection: Yolo vs ssd,”
International Journal of Advanced Research in Science, Communica-
tion and Technology (IJARSCT), vol. 5, 2021.
[10] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,
“Reading digits in natural images with unsupervised feature learning,”
in NIPS Workshop on Deep Learning and Unsupervised Feature
Learning 2011, 2011.
Cranfield University
CERES Research Repository https://dspace.lib.cranfield.ac.uk/
School of Aerospace, Transport and Manufacturing (SATM) Staff publications (SATM)

Reading and understanding house


numbers for delivery robots using the
”SVHN Dataset”

Pradhan, Omkar N.
2024-06-05
Attribution-NonCommercial 4.0 International

Pradhan O, Tang G, Makris C, Gudipati R. (2024) Reading and understanding house numbers
for delivery robots using the “SVHN Dataset”. In: 2024 IEEE International Conference on
Industrial Technology (ICIT), 25-27 March 2024, Bristol, UK
https://doi.org/10.1109/ICIT58233.2024.10540817
Downloaded from CERES Research Repository, Cranfield University

You might also like