Understanding_house_numbers_for_delivery_robots-2024
Understanding_house_numbers_for_delivery_robots-2024
DOI: 10.1109/ICIT58233.2024.10540817
Reading and understanding house numbers for delivery robots using the ”SVHN
Dataset”
1st Omkar Pradhan 2nd Dr. Gilbert Tang 3rd Christos Makris 4th Radhika Gudipati
SATM SATM Ocado Technology Ocado Technology
Cranfield University Cranfield University Hatfield, UK Hatfield, UK
Cranfield, UK Cranfield, UK [email protected] [email protected]
[email protected] [email protected]
Abstract— Detecting street house numbers in complex environ- 2) Pre-processing: The image data is being prepared
ments is a challenging robotics and computer vision task that before any machine learning processing.
could be valuable in enhancing the accuracy of delivery robots’ 3) Deciding the optimum AI model to train on for the
localisation. The development of this technology also has posi- best results: this part will compare the results for
tive implications for address parsing and postal services. This multiple possible AI models based on the necessary
project focuses on building a robust and efficient system that factors as follows.
deals with the complexities associated with detecting house
a) Considering the inference time
numbers in street scenes. The models in this system are trained
b) Increasing the efficiency by tuning the
on Stanford University’s SVHN (Street View House Numbers)
hyper-parameters
dataset. By fine-tuning the YOLO’s (You Only Look Once)
c) Real-time object detection analysis
nano model results with an effective detection range from 1.02
meters to 4.5. The optimum allowance for angle of tilt was 4) Hardware implementation: in this part, the model
±15o . The inference resolution was obtained to be 2160 ∗ 1620 needs to be implemented by using a camera and
with inference delay of 35 milliseconds thus doing the inferencing of live camera feed.
Index Terms—Artificial Intelligence, Character Recognition,
Computer Vision, Object Detection, YOLO, SVHN. 1.1. Related works
© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
epochs, and hidden layers. SVM provided the fastest re- 2. Implementation
sults for simple data, while CNN excelled in accuracy and
efficiency for more complex data. [4] and [5] also trained
on MNIST, using Ensemble systems, TFE-SVM, C-NN, and To successfully implement the training, the data was
Large C-NN+. Their results reinforced that Neural networks, prepared in the format required for the training of the
particularly CNN, consistently outperformed other models specific model (also known as pre-processing), as required
in terms of efficiency and accuracy. for using YOLOv8. The process of data pre-processing,
[6] conducted a classification analysis using 18 augmenting and tuning parameters are defined in the further
Deep Neural Network (D-NN) models, including ResNets, subsections 2.1, 2.2, 2.3, 2.4 and 2.5
DenseNets, MobileNets, Nasnets, VGG Nets, and Alex
Nets. They trained these models on the ImageNet Dataset,
generating adversarial images from a random sample of
1000 images. Evaluation criteria included attack success 2.1. Bounding Box Labels
rate, distortion using l2 and l∞ norms, CLEVER scores,
and transferability, providing a comprehensive understand-
ing of model performance. [7], in a separate study, compared As stated in section 1.1, the study converged towards
VGG16, VGG19, and ResNet50 models trained on a custom the solution that YOLO is the optimum model to begin
dataset of 6000 images with five classes. The results showed with. According to Ultralytics, a YOLO model is to be
accuracies of 0.9667, 0.9707, and 0.9733. trained with a training and validation dataset. The bifurcated
Object detection combines localization and classification dataset’s image needs to be accompanied by the text file
and is often implemented using CNN architecture. In [8], with the information about the classes and the bounding
Faster-RCNN, YOLOv5 , and SSD were compared using box information defined in it. Ex. (if an image is having
an automobile training dataset. Faster-RCNN demonstrated the name ‘xyz.jpg’, the text file should be ‘xyz.txt’). The
better accuracy but was unsuitable for real-time applications information in the text file is to be exactly in the form which
due to its 2-stage nature, making YOLO the top performer. is required to train the YOLO model. The information is
In another comparison by [9] in 2021, YOLOv6 was pitted shown in table1.
against SSD, with SSD outperforming YOLO in terms of
Frames Per Second (FPS) and achieving a higher mean
TABLE 1. T EXT F ILE F ORMAT
Average Precision (mAP) score.
[10] delved into the SVHN (Street View House Num- C1 CBB1 X norm CBB1 Y norm W1 norm H1 norm
bers) dataset from Stanford, consisting of 73,257 images in C2 CBB2 X norm CBB2 Y norm W2 norm H2 norm
. . . . .
the training set and 26,032 in the test set. An additional
SVHN extra dataset, with 531,131 less complex images, . . . . .
introduces potential model bias. Feature learning, utilizing
methods such as Histogram of Oriented Gradients (HOG) . . . . .
Cn CBBn X norm CBBn Y norm Wn norm Hn norm
and Sauvola binarization, is employed to detect features in
these complex images. Post-processing involves comparing
results from algorithms like HOG binary features, K-means,
and Stacked Sparse Auto-Encoders. The findings favor the Where, Cn =nth Class number CBBn X norm = Nor-
K-means-based system, although a notable challenge is the malised nth Centre co-ordinate of Bounding Box in X-
continuous failure of the binarization algorithm to separate axis CBBn Y norm = Normalised nth Centre co-ordinate
characters from their surrounding backgrounds. of Bounding Box in Y-axis Wn norm = Normalised nth
Bounding Box Width Hn norm = Normalised nth Bounding
Box Height
1.2. Research Methodology Stanford University’s SVHN dataset comes in two for-
mats: one with all images formatted to 32x32 pixels, heavily
The study followed a general methodology that involved cropped to show only a single number, and another with
collecting relevant papers in the AI field, laying the foun- images in a general form including parts of the environment.
dation for the research direction. Once the basics were The ’digitstruct.mat’ file provides data on bounding boxes
established, specific models were chosen for the project. and classes. To create individual text files for each image,
Data analysis and preparation were conducted to facilitate a MATLAB code was developed. This code reads image
model training. During the training and testing phases, the dimensions, extracts information from the mat file, and
selected models were individually trained and fine-tuned normalizes and formats the data according to the required
for improved performance. Challenges encountered at each specifications.
stage were addressed through the study of GitHub libraries To get to know about the number of instances for which
and literature. This iterative process continued until the the classes appear in the training and validation set in total,
desired results were achieved. the graph is plotted.
2.4. Hyper-parameters
file director is to be given in the YAML file, as explained 3.1. Default Model Performance Analysis
in section 2.2. The names of the files matter the most while
training the YOLO models using the Ultralytics library.
There is a compulsion to have folders named ‘Images’
and ‘Labels’ in both training and validation because the To decide which model works the best, multiple aspects
Utralytics library searches for the folders with exact names. need to be considered. To get the overall working of the
The names of the train and validation folder can be user- different versions of the YOLOv8 models, parameters like
defined, considering you define the exact name in the YAML training time, precision vs epochs, mAP vs epochs, and
file with the specified location. Recall vs epochs were compared.
Comparing the models, YOLOv8x emerges as the top
performer, but for tuning purposes, YOLOv8n is selected as
the best model. This decision is based on YOLOv8n having
precision levels almost equivalent to YOLOv8x while being
significantly lighter in size. This choice results in reduced
computational expense and enables implementation on mo-
bile systems.
Figure 3. Precision vs Epochs YOLOv8n was trained with a fixed Epoch of 20 and a
batch size of -1, optimizing CUDA memory usage. A fixed
positive batch size can either slow down training or exceed
P recsion = T P/(T P + F P ) (1) memory capacity. By setting it to -1, the system dynami-
The top three performers, YOLOv8x, YOLOv8n, and cally determines simultaneous training images, optimizing
YOLOv8m, were further examined. The study delved deeper memory usage without trial and error.
into these models, analyzing Precision-Confidence and
Recall-Confidence curves to gain a comprehensive under-
standing of their overall performance.
Precision confidence curves for the three models re- When plotting Precision, Recall, and mAP, the graph
vealed a consistent lag in precision for the digit ’1,’ with the with the largest area indicates the optimum and best results.
widest gap in YOLOv8x, narrowing in YOLOv8n, and fur- In this scenario, the Default configuration consistently de-
ther reducing in YOLOv8m. As confidence levels increased, livers the best results. The Optimiser with high Precision,
the gap diminished, and all models peaked at approximately Recall, and mAP collectively outperforms the alternative,
82% confidence. Despite ’1’ having the highest training resulting in fewer overall losses. Therefore, the Default
instances, the lower precision is puzzling. To address this, (Auto) configuration is the most suitable choice in this case.
a detailed analysis of the confusion matrix is recommended Enabling tilt for the models revealed that as the tilt angles
for insights into class predictions, including true and false increased, losses also increased. The model with no rotation
predictions across all classes. had the lowest loss, while the model with a 45° tilt had the
highest, with the 15° tilt falling in between.
Figure 8. Precision-recall-mAP comparison of Nano models for different Figure 9. Inference distance for YOLOv8n with RAdam Optimiser
training resolutions
6. Future work
6.0.1. Recognition. To address the inference issue related
to number ’1’, a potential solution is to create a custom
image dataset by gathering images of number ’1’ from
Figure 13. Inference distance for YOLOv8n with Compound Model Tuning various open sources and incorporating them into the train-
ing dataset, along with their corresponding bounding box
4. Results and challenges information. This approach can be used to assess whether
adding more number ’1’ images to the dataset resolves the
issue and whether it has any unintended consequences or
4.1. Results alterations on the results for the other classifications.
After the successful completion of the tuning, YOLOv8n
6.0.2. Model Fine Tuning. Much finer tuning of the models
was selected with hyperparameters as defined.
can be done by changing the parameters more gradually,
TABLE 4: Finalised Hyper-parameters thus resulting in a more extensive analysis.
Hyperparameter Key Value 6.0.3. Delivery Robots. The trained algorithm after further-
Imgsz (Training) 240 more fine-tuning, can be implemented on the actual robots
Optimiser RAdam such as delivery robots to check practical implementation
Degrees 15 of the system. The fusion of GPS and recognition can
be implemented and analysed for the enhancement in the
accuracy of the current systems in use.
This model gave the optimum results during the analysis
with the effective inference distance of 1.02 meters to 4.5
meters with an average latency of 32 milliseconds.
References
[1] P. Sermanet, S. Chintala, and Y. LeCun, “Convolutional neural
4.2. Challenges networks applied to house numbers digit classification,” Proceed-
ings of the 21st International Conference on Pattern Recognition
4.2.1. Configuration errors. Setting up the environment (ICPR2012), pp. 3288–3291, 2012.
demands careful configuration. Neglecting to ensure com- [2] S. Chen, R. Almamlook, Y. Gu, and L. wells, “Offline handwritten
patibility among versions of PyTorch, cuDNN, CUDA, and digits recognition using machine learning,” European Conference on
Computer Vision (ECCV), 2018.
TensorFlow can lead to a chain of failures and potentially a
malfunctioning setup. It is not recommended to install the [3] R. Dixit, R. Kushwah, and S. Pashine, “Handwritten digit recognition
using machine and deep learning algorithms,” International Journal
latest version using the pip command, as it can result in com- of Computer Applications, vol. 176, pp. 27–33, 07 2020.
patibility issues or failure to recognize specific packages.
[4] A. Shrivastava, I. Jaggi, S. Gupta, and D. Gupta, “Handwritten digit
To address this, it’s advisable to follow the comprehensive recognition using machine learning: A review,” 2019 2nd Interna-
list provided by TensorFlow to verify the compatibility of tional Conference on Power Energy, Environment and Intelligent
versions. Control (PEEIC), pp. 322–326, 2019.
[5] R. KARAKAYA and S. Çakar, “Handwritten digit recognition using
machine learning,” Sakarya University Journal of Science, vol. 25,
10 2020.
[6] D. Su, H. Zhang, H. Chen, J. Yi, P.-Y. Chen, and Y. Gao, Is Robustness
the Cost of Accuracy? – A Comprehensive Study on the Robustness
of 18 Deep Image Classification Models: 15th European Conference,
Munich, Germany, September 8–14, 2018, Proceedings, Part XII,
pp. 644–661. 09 2018.
[7] S. Mascarenhas and M. Agarwal, “A comparison between vgg16,
vgg19 and resnet50 architecture frameworks for image classification,”
in 2021 International Conference on Disruptive Technologies for
Multi-Disciplinary Research and Applications (CENTCON), vol. 1,
pp. 96–99, Nov 2021.
[8] J. ah Kim, J.-Y. Sung, and S.-H. Park, “Comparison of faster-rcnn,
yolo, and ssd for real-time vehicle type recognition,” 2020 IEEE
International Conference on Consumer Electronics - Asia (ICCE-
Asia), pp. 1–4, 2020.
[9] M. Shetty, “A review on deep learning object detection: Yolo vs ssd,”
International Journal of Advanced Research in Science, Communica-
tion and Technology (IJARSCT), vol. 5, 2021.
[10] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,
“Reading digits in natural images with unsupervised feature learning,”
in NIPS Workshop on Deep Learning and Unsupervised Feature
Learning 2011, 2011.
Cranfield University
CERES Research Repository https://dspace.lib.cranfield.ac.uk/
School of Aerospace, Transport and Manufacturing (SATM) Staff publications (SATM)
Pradhan, Omkar N.
2024-06-05
Attribution-NonCommercial 4.0 International
Pradhan O, Tang G, Makris C, Gudipati R. (2024) Reading and understanding house numbers
for delivery robots using the “SVHN Dataset”. In: 2024 IEEE International Conference on
Industrial Technology (ICIT), 25-27 March 2024, Bristol, UK
https://doi.org/10.1109/ICIT58233.2024.10540817
Downloaded from CERES Research Repository, Cranfield University