0% found this document useful (0 votes)
50 views13 pages

Attention-Based Automated Pallet Racking Damage Detection

Pallet racking systems are shelves that are specifically intended to hold palletised items, and they are essential for the safe and effective handling of products in warehouses. These shelves are susceptible to damage from a variety of sources, including as wear and tear and collisions, which might jeopardise their structural integrity and put workers and stored items at risk.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views13 pages

Attention-Based Automated Pallet Racking Damage Detection

Pallet racking systems are shelves that are specifically intended to hold palletised items, and they are essential for the safe and effective handling of products in warehouses. These shelves are susceptible to damage from a variety of sources, including as wear and tear and collisions, which might jeopardise their structural integrity and put workers and stored items at risk.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Attention-Based Automated Pallet


Racking Damage Detection
Mujadded Al Rabbani Alif
Department of Computer Science
Huddersfield University
Huddersfield, HD1 3DH, United Kingdom

Abstract:- Pallet racking systems are shelves that are inventory control, easy product identification, and streamlined
specifically intended to hold palletised items, and they are picking operations, reducing time and effort. However, pallet
essential for the safe and effective handling of products in racking systems are susceptible to damage over time due to
warehouses. These shelves are susceptible to damage from various factors, including collisions, overloading, improper
a variety of sources, including as wear and tear and handling, wear and tear, incorrect installation or maintenance,
collisions, which might jeopardise their structural and external forces. Accidental collisions with forklifts or other
integrity and put workers and stored items at risk. It's equipment can result in bending, distortion, or misaligning of
critical to identify faulty pallet racking quickly to avoid structural components, such as upright frames and horizontal
mishaps, product loss, and interruptions to business beams. Exceeding the weight capacity of the racks can lead to
operations. Pallet racking system upkeep and routine structural strain, compromising stability and potentially
inspections, however, can be expensive and prone to causing collapse. Inadequate handling practices and improper
human mistakes. This research study suggests Pallet-Net, placement or removal of pallets can exert excessive force on
a unique deep learning technique that employs an the racking system, resulting in impact damage. Wear and tear
attention-based convolutional neural network (CNN) to from continuous loading and unloading, environmental
automatically detect faulty pallet racking, as a solution to conditions, and friction can gradually weaken the rack’s
this problem. The suggested technique uses attention structural integrity, leading to rust, corrosion, or deterioration.
processes to concentrate on the pallet racking image's Additionally, external forces like earthquakes, extreme weather
damaged areas, making it easier to locate and identify conditions, or impacts from heavy objects can threaten the
damage. Pallet-Net precisely categorises the racking as integrity of pallet racking systems [1], compromising their
either damaged or undamaged by learning the structural integrity and posing significant risks to personnel and
discriminative properties of these zones. The suggested stored goods. Detecting and addressing damaged pallet racking
approach, when compared to previous studies, provides in a timely manner is essential to prevent accidents, minimise
great robustness and accuracy in locating and recognising product loss, and ensure the smooth operation of warehouse
damaged areas in pallet racking photos. Moreover, the logistics.
proposed method obtains a 97.64% total accuracy rate,
with 98% precision, 98% recall, and 98% F1 score. Recent The conventional approach to identifying damaged pallet
deep learning models like Vision Transformer (ViT) and racking heavily relies on manual inspections carried out by
Compact Convolutional Transformer (CCT) are also trained personnel. During these inspections, the racking system
analysed and compared to the suggested architecture. is visually examined for indications of damage, such as bent
components, cracks, or misalignments. While this technique
Keywords:- Pallet Racking Systems; Logistics; Material serves as a starting point for detection, it has several
Handling; Structural Integrity; Deep Learning; Attention shortcomings. For one, manual inspections are time-consuming
Mechanisms; Convolutional Neural Networks; Image and demanding, particularly in large-scale warehouses or
Classification; Spatial Transformer Network; Vision facilities with numerous racks, causing delays and disruptions
Transformer; Compact Convolutional Transformer. to daily operations. Secondly, the subjectivity of visual
assessments introduces the possibility of human error, leading
I. INTRODUCTION to missed or misidentified damages. The interpretation of
damage severity may also vary among different individuals,
Pallet racking refers to a system of storage racks further affecting the consistency of detection results.
specifically designed to organise and efficiently store goods in Furthermore, manual inspections may not effectively detect
warehouses and storage facilities. It consists of vertical frames, subtle signs of damage or potential structural weaknesses that
horizontal beams, and various supporting components to create could result in accidents or failures in the future. Additionally,
multiple levels of storage space. It is pivotal in efficiently these inspections offer limited quantitative data for analysis and
storing and organising goods in warehouses and storage tracking of the overall health and condition of the racking
facilities. These systems provide vertical storage solutions that system. In summary, the manual inspection approach for
maximise space utilisation and enable easy access to stored detecting damaged pallet racking requires greater efficiency,
items without requiring additional floor space. By providing an consistency, and the ability to provide comprehensive insights
organised storage solution, pallet racking allows efficient for effective maintenance and risk management.

IJISRT24JAN241 www.ijisrt.com 728


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Recent advances in computer vision and machine II. RELATED WORK
perception using deep learning promise automated solutions in
diverse areas, including healthcare [2,3], renewable energy [4], A. Pallet Racking Inspection Methods
and industrial quality inspection [5]. In this research, we Numerous studies have been conducted on detecting and
leverage such techniques to develop automated damage evaluating damaged pallet racking systems. The conventional
detection for warehouse pallet racking systems, critical but approach entails manual inspections by trained personnel who
susceptible components of inventory storage. The field has visually examine the racking systems for indications of
witnessed dramatic progress through sophisticated deep neural damage, such as bent or distorted components, cracks, or
network architectures such as convolutional neural networks misalignments. However, these inspections could be more
(CNNs) [6] and recurrent neural networks (RNNs) [7], laborious, time-consuming, and susceptible to human error. To
enabling unprecedented performance in computer vision, overcome these limitations, various automated inspection
language processing, and speech recognition. Landmark CNN techniques have been researched by experts. In a recent study,
models, including VGGNet [8], ResNet [9], Inception [10], Hong-Hu Zhu et al. [18] delved into the increasing usage of
RCNN [11], and Fast RCNN [12], aided by expanding datasets innovative sensing technologies in civil infrastructure and their
and GPU computing, have achieved remarkable accuracy in advantages in construction, operation, maintenance, and
classification, detection, and generative modelling of images. upgrading. He highlighted various facets of innovative sensing
Novel deep approaches like YOLOv7 [13], GANs [14] and technologies and their utilisation in civil infrastructures, such
vision transformers [17] further extend these abilities. Building as innovative mechanisms and devices, on-site
upon such advances, we propose a tailored CNN methodology implementation, supporting technologies and methodologies,
employing visual attention to focus selectively on racking and real-life examples. In another recent research paper,
damage cues, learning highly discriminative representations, Hussain et al. [19] presented a self-governing system for
and enabling precise automated identification. By pursuing inspecting storage racks using the MobileNetV2-SSD
robust, accurate, and computationally efficient perception, this architecture. The proposed system is claimed to be utilised in
research aims to promote safety and efficiency in the automated distribution centres, warehouses, and retail store facilities, as it
monitoring of warehouse storage environments. has a mean average precision of 92.7% and can extend its
coverage to higher-level racking with the help of a forklift cage.
This paper puts forward Pallet-Net, a novel computational The authors compiled the first racking dataset for this study
method leveraging attention-driven convolutional neural based on actual pallet racking images from various operational
networks (CNNs) to automatically classify warehouse pallet warehouses.
racking systems as damaged or undamaged. Explicitly focusing
visual attention on areas indicative of damage facilitates precise Furthermore, they plan to improve the solution by
localisation and identification of distorted, cracked, or including several damage detection classes and collaborating
misaligned rack structures from images. Our tailored CNN with SEMA to develop a defect detection architecture.
architecture, trained on pallet rack datasets, learns to extract Moreover, Chuan-Zi Dong et al. [20] in their research provided
highly discriminative damage characteristics, enabling reliable an overview of computer vision–structural health monitoring
automated decisions on rack integrity. We extensively evaluate (CV-SHM) at local and global levels for element, crack,
Pallet-Net against recent methods, including vision delamination, displacement, vibration, modal identification,
transformers and compact convolutional transformers. load factor estimation, and structural identification. The author
Experiments demonstrate state-of-the-art classification described CV-SHM as an excellent complement to
accuracy, precision and recall exceeding 97% on held-out test conventional SHM due to its advantages, such as non-contact
data, with computational efficiency amenable to real-time measurements, long-distance data collection, low cost, and
monitoring. By reliably automating visual assessments reduced labour with minimum interference or intrusion to the
currently requiring laborious manual inspection, this research daily operation of structures. Hussain et al. [21] introduced a
promises significantly enhanced safety and reduced downtimes framework centred on the YOLOv7 architecture in a different
and risks in modern warehouse environments. Results further study. The framework includes a domain variance modelling
inform the future incorporation of attention-based deep mechanism to address data scarcity, resulting in a mean average
learning in related structural health monitoring applications. precision of 91.1%. This solution offers a non-invasive
approach to defect detection that differs from conventional
The remainder of this paper is organised as follows. sensor-oriented methods and can potentially reduce client
Section 2 discusses related work on object detection and costs.
classification using deep learning. Section 3 presents the
methodology, including dataset collection, network B. Object Detection and Classification with DL
architectures, training process, and ensemble learning Deep learning (DL) has shown remarkable success in
techniques. Section 4 describes the experimental setup, various computer vision tasks, including object detection and
evaluation metrics, and results. Section 5 compares our classification. Numerous studies have explored the application
proposed solution with other existing solutions, and Section 6 of CNNs for accurate and efficient detection and classification
concludes the paper. of objects in images.

IJISRT24JAN241 www.ijisrt.com 729


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Techniques such as Faster R-CNN, YOLO (You Only C. Attention Mechanisms in CNNs
Look Once), and SSD (Single Shot MultiBox Detector) have Attention mechanisms allow neural networks to emulate
been widely adopted for object detection. Zaidi et al. [22] biological perception and cognition by selectively prioritising
published a study on object detection methods in the modern the most task-relevant visual information. This targeted focus
world. The study also covers contemporary lightweight on salient environmental cues drives enhanced efficiency and
classification models used on edge devices. The need for accuracy even where critical visual signatures occupy just a
lightweight models that can be deployed on mobile and fraction of the full sensory space. Tang et al.’s [28]
embedded systems is increasing, and the study shows how manufacturing damage classification framework first
various object detectors have developed. Similarly, in their combined spatial attention with convolutional neural networks
paper, Liu et al. [23] review deep learning methods for (CNNs) to achieve 93.3% accuracy, significantly improving on
detecting small objects in images, including challenges and previous approaches lacking such selective computational
solutions, practical techniques, and related research areas. The focus. Follow-up research by Su et al. [29] validated
paper compares the performances of leading deep learning complementary attention mechanisms for suppressed noise and
methods, including YOLOv3, Faster R-CNN, and SSD, based improved solar cell defect identification. In agricultural
on three large benchmark datasets of small objects. The applications, Shahi et al. [30] integrated CNN features with
experimental results show that while the detection accuracy on attention-based modules to enable automated fruit
small things by these deep learning methods was low, Faster R- classification as the first stage of precision harvesting.
CNN performed the best, while YOLOv3 was a close second. Collectively, these works presage automation across tedious,
Finally, in their research, Yang et al. [24] propose a real-time inconsistent manual structural monitoring tasks spanning
tiny-part defect detection method for manufacturing using deep warehouse, manufacturing, solar, and agricultural sectors.
learning algorithms. The authors establish a correlation model Building upon these latest developments at the intersection of
between the part system’s detection capability coefficient and computational perception and selective focus, we propose an
the conveyor’s moving speed and propose a defect detection attention-driven CNN methodology to reliably detect
algorithm based on a single short detector network (SSD) and hazardous pallet racking distortions in inventory storage
deep learning. The paper also addresses the problem of missed environments. Our approach learns highly discriminative
detection using an industrial real-time detection platform and a damage characteristics to match or enhance human visual
missed detection algorithm based on intermediate variables. assessments while integrated attention filters out task-irrelevant
These methods leverage CNNs to extract image features and cues. More broadly, research into such biomimetic selectivity
employ region proposal mechanisms or anchor-based and efficiency gains continues to advance a new generation of
approaches to identify object-bound boxes. intelligent systems endowed with heightened situational
awareness for reliable autonomous decision support. As these
In the context of object classification, CNN architectures technologies fundamentally disrupt sectors centred upon
like AlexNet, ResNet, and EfficientNet have been widely used. human evaluation, policy and regulation must proactively
These models leverage deep convolutional layers to capture address emerging societal impacts.
hierarchical features and achieve high classification accuracy.
In their paper, Akinosho et al. [25] compare the performance of Although several image classification techniques have
edge detection algorithms and deep convolutional neural been utilised to classify damaged pallet racking, deep learning
networks (DCNN). The authors analyse a dataset of 19 concrete methods have recently gained significant attention. Attention
images and compare the relative performance of six typical mechanisms can be a valuable tool to improve the accuracy and
edge detection schemes and the AlexNet DCNN architecture in robustness of the models. Recent deep learning models such as
different modes. The edge detection methods accurately Vision Transformer and Compact Convolutional Transformer
detected 53-79% of cracked pixels. Still, they produced have shown potential in improving the speed and accuracy of
residual noise in the final binary images, whereas DCNNs image classification.
accurately labelled pictures with 99% accuracy and detected
much finer cracks than edge detection methods. III. METHODOLOGY

Similarly, Weimer et al.[26] explore the use of DCNN for A. Dataset


defect detection in industrial inspection instead of manually Effective machine learning relies on comprehensive and
engineering features. According to the author, DCNN representative datasets encompassing real-world complexity;
automatically generates powerful features through hierarchical for our pallet-racking damage detection system, labelled
learning strategies from massive training data with minimal images depicting various distortion types are needed to train
human interaction. The proposed approach is tested on a dataset computational models and quantify evaluation generalizability.
with 12 different classification categories of visual defects We gathered on-site photos of bent beams, cracked uprights,
occurring on heavily textured backgrounds, with excellent misalignments and other visible defects from Tile Easy and
results and low false alarm rates. Liu et al. [27] explore the Lamteks warehouses. As public pallet racking datasets remain
application of robots in intelligent supply chains and digital unavailable, this collection provides an essential bootstrap
logistics to perform efficient operations, energy conservation, capturing noise, occlusion and variability challenging unaided
and emission reduction in warehousing and sorting. The human assessments. While expanding sample diversity and
researchers established an image recognition model using a quantity would further enhance robustness, these initial images
convolution neural network (CNN) to identify and classify enable the development of a rigorous methodology that
goods by simulating a human hand-grasping object.

IJISRT24JAN241 www.ijisrt.com 730


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
assesses authentic damage manifestations rather than simulated infrastructure constitutes relatively simple classification tasks
data. targeting racking components against static backgrounds (Fig.
1A). However, distortion severity varies extensively among
damaged samples (Fig. 1B), challenging human evaluation
consistency, especially for subtle cases. The centre image
depicts a rack leg crack that could easily elude unaided visual
assessment compared to the obvious right deformation. All
samples embed environmental context, including occlusion,
variable lighting and noise. Augmentation must, therefore,
balance class distinction and resolution preservation with
realistic domain complexity to enable effective model
generalisation. Overall, these images capture the multi-scale
damage phenomena, ambiguity and scene diversity demanding
selective, context-aware computational focus - an ideal testbed
to advance attention-based automated inspection.

By collecting and curating this initial dataset, we aim to


provide a foundation for training and evaluating the proposed
autonomous racking inspection mechanism using CNN models
with attention. The dataset offers a diverse range of normal and
damaged racking images, enabling the model to learn and
Fig 1 Pallet Racking dataset Samples (A) generalise patterns associated with different racking
Normal and (B) Damaged conditions.

This preliminary dataset establishes an essential


benchmark for developing and evaluating automated pallet-
racking assessment systems using attention-focused
computational perception. Despite sample size constraints, the
images capture real-world diversity across damage modes and
environmental variability. More broadly, benchmarking on
authentic anomalies rather than simulated data should enhance
model generalisation to the complexities of deployable
structural monitoring.

B. Data Augmentation
Data augmentation enables the artificial expansion of
limited training sets to enhance model generalisation -
mimicking the diversity of real-world phenomena from limited
samples. Popular techniques add noise or apply
transformations like rotation while retaining core semantics.
Such expanded sets curb overfitting, improve resilience to
previously unseen inputs, and strengthen the mapping from
images to damaged phenotypes learned during training. We
Fig 2 The Effects of data Augmentation on the leverage Keras’ [31] flexible ImageDataGenerator toolkit,
Training and test Images. which has become a vital utility across deep learning
applications owing to its simplicity and built-in transforms.
Data collection leveraged an iPhone 8 12MP camera Although constrained generalisation demands eventually
selected for sensor fidelity matching our targeted Raspberry Pi surpass synthetic expansion alone, augmentation grants
edge deployment. Images simulate views from a forklift- valuable bootstrapping for developing rigorous defect
mounted rack inspection system, withstanding volatility from detection from scarce racks lacking comprehensive historical
motion, occlusion and variable lighting. A human operator assessments.
proxy holding the smartphone-based camera towards storage
racking emulates automated on-vehicle assessments' precise Effective automation requires resilience across damage
positional dynamics and visual perspective. This contextual modes, environments and operating conditions. We augment
data gathering aims to furnish models with representation pallet racking data with techniques including brightness
crucial for a smooth transition from laboratory to materials adjustment, rotation, zooming and shear transformations (Fig.
handling environments. Additionally, mobile visual data 2). This expanded, distorted sample diversity compels models
promises scalability via crowdsourcing to rapidly expand to generalise rather than memorise, improving deployable
sample diversity in future work. decision-making amid complex warehouses far beyond
constrained training distributions.
Figure 1 exemplifies dataset diversity across undamaged
and damaged pallet racking images. Healthy warehouse storage

IJISRT24JAN241 www.ijisrt.com 731


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Feature-Wise Normalization:  Zoom:
Input consistency is crucial for model convergence. This augmentation entails applying a certain zoom factor
Feature-wise normalisation (Equation 1) rescales inputs to to the supplied image. By using this method, the model may be
constrain variability - transforming dimensions to standard trained to recognise and categorise broken pallet racking at
normal distributions with zero mean and unit variance. This various sizes and scales, mimicking the variances in object
harmonic representation attenuates the influence of noise and sizes and distances found in the real world.
distortions, so computational focus targets the underlying
damage morphology rather than incidental data properties.  Rotation:
Ultimately, learning intrinsically invariant causal markers The supplied picture is transformed via rotation in this
promises improved generalisation. augmentation. This method improves the model's capacity to
handle multiple viewing angles by helping it identify faulty
𝑥−μ
𝑥𝑛𝑜𝑟𝑚 = σ
(1) pallet racking from a variety of viewpoints or orientations.
Equation (4) was utilised to compute the rotation in order to
Where σ is the standard deviation, μ is the mean value of facilitate this augmentation.
the feature throughout the dataset, and x is the input feature.
Every feature dimension is subjected to a separate feature-wise 𝑐𝑜𝑠𝜃 −𝑠𝑖𝑛𝜃
𝐻 =[ ] (4)
normalisation method, guaranteeing that every feature has a 𝑠𝑖𝑛𝜃 𝑐𝑜𝑠𝜃
mean of zero and a standard deviation of one. The model's
 Brightness:
capacity to tolerate differences in the distribution of input
features is improved when feature-wise normalisation is The input image's brightness level is adjusted by this
augmentation. Brightness changes strengthen the model's
applied during data augmentation. This method successfully
resistance to various lighting scenarios and guarantee correct
reduces the effect of variations in brightness, contrast, or
classification even in the presence of fluctuating illumination.
intensity between different photographs.

 Feature-Wise Centre  Width And Height Shift:


Another efficient technique for augmenting data in deep This augmentation entails a horizontal or vertical picture
learning to enhance model performance and generalisation is shift. With the use of this method, the model may be trained to
feature-wise centring. Using this method, the appropriate input identify broken pallet racking even in situations when it is only
feature is subtracted from the mean value of each feature partially visible or positioned differently in the picture.
dimension. The feature-wise centring equation is defined by
equation (2).  Fill Mode Reflects:
This augmentation takes care of any voids or regions left
𝑥𝑐𝑒𝑛𝑡𝑒𝑟𝑒𝑑 = 𝑥 − 𝜇 (2) by previous augmentations, such as rotation or shifting. By
reflecting the adjacent pixels, it fills in the empty pixels and
Where μ is the feature's average value over the dataset, keeps the image whole.
and x is the input feature. During data augmentation, the
model's sensitivity to the mean value of features can be By combining these methods of data augmentation, we
decreased by using feature-wise centring. By using this are able to provide an enhanced dataset that depicts changes in
method, the impact of differences in brightness or intensity pallet racking that have been damaged. Furthermore, the
levels across samples is lessened [33]. The model can more diversity and number of training samples are greatly increased
effectively identify the relative differences and patterns linked by the enhanced dataset, which aids in the model's improved
to broken pallet racking by centring the features, which generalisation and classification performance [34].
prevents the model from being impacted by overall changes in
the input data. Colourisation imparts limited semantic insight for
structural damage classification, instead obstructing the
 Shearing perception of subtle depth or texture distortions with incidental
Shearing is a widely used deep learning data hue variations. We deploy grayscale transformation, a
augmentation technique that modifies input data geometrically. technique shown by Li et al. [35], to improve dermatological
It involves skewing or tilting pictures along a certain axis to anomaly detection models to similarly enhance rack damage
distort them. Equation (3) is an expression for the shearing cognition. Eliminating RGB colour space dimensionality
transformation. focuses computations exclusively on luminance-linked cues
while enabling simplified model architectures.
𝑥 1 𝑠ℎ𝑒𝑎𝑟𝑓𝑎𝑐𝑡𝑜𝑟 𝑥
[𝑦𝑠ℎ𝑒𝑎𝑟𝑒𝑑 ] = [ ] [𝑦 ] (3)
𝑠ℎ𝑒𝑎𝑟𝑒𝑑 0 1

 Others
We have included several widely used data augmentation
strategies in addition to the ones that were previously
described.

IJISRT24JAN241 www.ijisrt.com 732


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
augmentation synergistically filter pallet-racking data
complexity down to the core factors explicating damage
morphology. By eliminating incidental colour variation while
exposing models to an expanded, distorted sample distribution,
we steered computation towards intrinsic intensity patterns
predictive of actionable rack defects. Meanwhile, consistent
image resizing removes confusing variability that might inhibit
convolutional filter convergence. Together, these techniques
significantly enhanced classification accuracy by reducing the
burden of memorisation and overfitting intrinsic to limited
data. More broadly, such complexity reduction through domain
knowledge infusion epitomises efficient biological perception
- discarding sensory noise to amplify causal signatures. Our
methodology thus demonstrates how even modest datasets can
Fig 3 Pallet-Net architecture. fuel deployable decision automation so long as data curation
targets explanatory factors using time-tested bio-inspiration.
Grayscale's single channel mitigates inter-channel
correlation, imposing ineffective representational constraints C. Detailed Description of the CNN Architectures
for convolutional Filter learning. Attenuating colour As a consequence of our study, a unique CNN
information steers models towards crucial shape and
architecture called Pallet-Net—an integrated attention
morphology factors rather than superficial chromatic
mechanism—was created with racking inspection in mind.
tendencies counterproductive to generalisable decisions.
This design efficiently separates pallet racking that is damaged
Overall, this restrictive representation learning approach filters
from that that is not. We have also experimented with
out rack imaging noise to improve accuracy - exploiting
additional state-of-the-art deep learning architectures, such as
intrinsic intensity patterns correlated with damage while
Custom Compact Convolutional Transformer (CCT) and
discarding nuisance colour variation. More broadly, task- Custom Vision Transformer (VIT), based on current findings.
specific dimensionality reduction that isolates primary
explanatory factors epitomises efficient biological perception  Pallet-Net
for accelerated anomaly cognition. This bio-inspired sparsity The Pallet-Net's architecture consists of three CNN
simultaneously enhances model performance and connections operating in parallel, each with a distinct 3x3, 5x5,
computational efficiency, which is crucial for embedded
and 7x7 kernel size. In Figure 3, the architecture is displayed.
structural monitoring.
The input goes through batch normalisation, max-pooling, and
convolutional operations in each connection. The final feature
The damaged pallet racking categorisation work has been maps obtained from the three connections are concatenated,
conducted consistently using an image size of 112x112 pixels
and then they are run through a dense layer using a SoftMax
throughout our investigation. This calculated choice was made
activation function and one unit.
with a number of factors in mind in an effort to increase our
model's precision and effectiveness. Initially, maintaining a
constant picture size guaranteed consistency in the input data
fed into the model for both training and inference. Our model
was able to acquire and derive significant characteristics from
photographs of damaged pallet racking, regardless of the
images' initial size, because of this constancy. It made
comparing and analysing the various photographs in the dataset
easier as well.

Additionally, the 112x112 pixel standard picture size


contributed to a decrease in memory use and computational
complexity [36]. Furthermore, the key characteristics and
details of the damaged pallet racking were preserved because
of the 112x112 picture size. It struck a compromise between
lessening the computing load and collecting enough geographic
information. Last but not least, this scale made sure the model
could pick up on essential patterns, textures, and structural Fig 4 Image Flattening in to 1-Dimension in a Layer
details connected to broken pallet racking without adding
superfluous detail or excessive noise that might compromise Pallet-Net's concatenation technique served as the
categorisation accuracy. foundation for the attention mechanism, which generated
attention weights for each feature map that the parallel
Robust automation centres on harmonising model connections created. As seen in Figure 4, the attention output
simplicity, efficiency and real-world performance through that results from multiplying the attention weights by the
representation learning heuristics tailored to the application. concatenated feature maps is shaped into a 1-dimensional array
Our image standardisation, grayscale conversion and data via a flattened layer.

IJISRT24JAN241 www.ijisrt.com 733


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Two fully linked layers with 512 and 256 neurons each  Concatenation:
make up Pallet-Net. Each layer makes use of batch This node concatenates the outputs from the three parallel
normalisation and the ReLU activation function, which was connections.
chosen for its straightforward mathematical formulation and
given in Equation (5).  Attention Mechanism:
Using a dense layer of 1 unit and softmax activation, we
𝑓(𝑥) = max(0, 𝑥) (5) apply an attention mechanism to the concatenated output to
derive attention weights. The element-wise product of the
For the purpose of automatically identifying and concatenated output and the attention weights yields the final
categorising broken pallet racking, the Pallet-Net architecture attention output. Equations (13 and 14) represent the attention
with an integrated attention mechanism is a useful method. The mechanism equation.
model performs better because the attention mechanism creates
attention weights for each feature map that the parallel 𝑄𝑊𝑞 .𝐾𝑊𝐾𝑇
connections produce. The accuracy and resilience of the model 𝐸 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 ( ) (13)
√𝑑𝑘
are greatly enhanced by the simultaneous convolutional
connections and the attention method. Here is an equation that 𝐶 = 𝐸(𝑉𝑊𝑣 ) (14)
may be used to represent the suggested model: 6,7,8,9,10,11,
and 6. Here, the network or decoder's current state is represented
by the query vector Q; the input or encoder states are
𝑓(𝑥) = max(𝐶 = 𝐶𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒(𝑃1 , 𝑃2 , 𝑃3 )) (6) represented by the set of key vectors K; the1 input or encoder
states are represented by the set of value vectors V, the
𝑊 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝐷𝑒𝑛𝑠𝑒(𝐶)) (7) importance weights assigned to the input states are represented
by the attention matrix E, and the context vector C is calculated
𝑂 = 𝐶 ⊙𝑊 (8) as the weighted sum of the value vectors. The scaled dot
product between the query and key vectors is multiplied by the
𝐹 = 𝐹𝑙𝑎𝑡𝑡𝑒𝑛(𝑂) (9) Softmax function to obtain the attention matrix E in this
formula. Each input state's weight or relevance is represented
𝐻1 = 𝐷𝑒𝑛𝑠𝑒(512, 𝑅𝑒𝐿𝑈)(𝐹) (10) in the resultant attention matrix. Next, the attention matrix is
multiplied by the value to get the context vector C.
𝐻2 = 𝐷𝑒𝑛𝑠𝑒(256, 𝑅𝑒𝐿𝑈)(𝐻1 ) (11)
 Flatten Layer:
𝑌 = 𝐷𝑒𝑛𝑠𝑒(𝑛𝑐𝑙𝑎𝑠𝑠𝑒𝑠 , 𝑆𝑜𝑓𝑡𝑚𝑎𝑥)(𝐻2) (12) The attention output is flattened as input to thick layers in
order to conform to the processing step.
In this case, C represents the concatenated output of the
parallel connections, W represents the attention weights  Fully Connected Layers:
calculated using a dense layer with softmax activation, and X Pallet-Net has two completely linked dense layers that
represents the input layer of shape (image_size, image_size, come after the convolutional layers and the attention
P1, P2, P3). H1 and H2 stand for the first and second component. Relu activation is included in the first layer's 512
completely connected layers, O for the attention output derived units and the second layer's 256 units. For performance
by element-wise multiplying C and W, F for the flattened regularisation, a batch normalisation layer is also included in
output of O, and Y for the output layer with a class number of each layer.
neurons. This formula provides a succinct mathematical
depiction of the customised attention-based CNN model by  Output Layer:
symbolically representing the model's layers and processes. the In order to generate the output, the final layer is dense
following is a detailed description of pallet-net's architecture: with three units and SoftMax activation.

 Input Layer:  Custom Vision Transformer (ViT)


Given that the input layer's shape is (112,112,1), the For benchmarking, a specially designed Vision
model is likely to accept grayscale photos of 112 x 112 size. Transformer (ViT) built on transformer architecture was also
trained. Transformers have demonstrated remarkable
 Parallel Convolutional Layers: performance in a range of computer vision applications, such
The network has three convolutional layers arranged in as picture categorisation. The ViT model uses a vision
parallel. There are 32 filters per layer, with three, five, and transformer architecture that captures both local and global
seven-by-seven-inch filters in each size. All layers get the dependencies in the picture while processing image patches
application of the ReLU activation function. Every parallel effectively using self-attention mechanisms. The new design of
connection has a 2x2 max pooling and batch normalisation the model is based on the transformer architecture described in
layers. research paper 13. Figure 5 illustrates the leading architecture
of this concept, which is as follows:

IJISRT24JAN241 www.ijisrt.com 734


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒𝑥0 = 𝑀𝑎𝑥𝑃𝑜𝑜𝑙 (𝑅𝑒𝐿𝑈(𝐶𝑜𝑛𝑣2𝑑(𝑥))) (15)

Given a feature map or picture where x ∈RH×W×C,


where c is the number of channels, w is the weight, and H is the
height.

 Transformer with Sequence Pooling:


The Transformer Encoder, the initial component of this
layer, attempts to comprehend the connections between various
patches that the Convolutional Tokenizer block extracts. It
contains 64 projection dimensions, eight transformer layers,
and four attention heads. A Multi-Head Attention method is
used in this section to assist the model in focusing on particular
regions of the picture while taking the entire image into
consideration. The output is then sent to Sequence Pooling,
which pools across the token sequence using an attention-based
Fig 5 Custom vision transformer architecure. methodology. This change results in a minor reduction in
computation since fewer tokens are being transmitted.
 Patch Embedding Layer:
The image is initially processed with a layer of patch A feed-forward network (MLP) is also used by the
edging. Getting 16x16 patches from the input picture and Transformer Encoder block to analyse the characteristics that
applying a dense projection to each patch are the main goals at the Multi-Head Attention mechanism has retrieved. The
this point. Each patch also includes a class token and positional transformer encoder's transformer units are configured to 128.
embedding, which help the model learn about the spatial and Finally, many FC layers are applied to the Transformer
semantic elements of the picture. Encoder block's output in order to get the final categorisation.
A dense layer with 512 units, another dense layer with 256
units, and a third dense layer with three units make up the FC
 Transformer Encoder Layer:
layers. A SoftMax activation function is used in the last layer
After that, the patch edging layer's output is routed
to output the probability for each class.
through many transformer layers. The model consists of eight
Transformer Layer layers, each with a feed-forward neural
IV. EXPERIMENTAL RESULTS
network and a multi-head self-attention mechanism. The feed-
forward neural network assists in obtaining higher-level
A. Experimental Setup
information from the many picture patches that the model has
A laptop with an AMD Ryzen 9 5900HX CPU, 16 GB
assigned weights to, thanks to the multi-head attention
mechanism. Several dense layers with dropout rates of 0.3 and DDR4 RAM, and an NVIDIA GeForce GTX 3070 with 8GB
GDDR6 GPU was used for the research described in this
0.2 are included after the Transformer encoder layers. In the
paper. Using Mathplotlib [39], Pandas [40], and Keras [38]
end, this yields a softmax classifier.
from the DL libraries, the Python programmes were created.
 Compact Convolutional Transformer (CCT)
B. Data Partition
A third model, a Custom Compact Convolutional
It was essential to separate our dataset into three subsets
Transformer (CCT), was developed for benchmarking. By
for testing, validation, and training in order to train our model
using both local and global information, this model enhances
efficiently. At first, we designated 80% of our photos as part
feature extraction through the use of compact convolutions and
of the training set, while the remaining 20% were kept just for
the transformer architecture. Unlike the proprietary ViT model,
testing. However, in order to guarantee the best accuracy and
its main goal is picture classification by feature extraction and
lower the chance of overfitting, we further divided our training
transformer-based architecture processing. Based on a study
set into two subgroups. Eighty percent of our training photos
report [37], the Custom Compact Convolutional Transformer
were used for the actual model training, while twenty percent
model was created. As seen in Figure 6, there are two main
went into the validation set. With this method, we were able to
construction processes in the CCT model: Transformer and
keep a close eye on our model's performance and modify our
Convolutional Tokenization with Sequence Pooling.
training regimen as needed. Following the process, Table 1
displays the number of photos for each region.
 Convolutional Tokenizer:
This section's job is to take features out of the supplied
Table 1 Different Subsets of Dataset
image. A series of convolutional layers with a kernel size of 3,
Data Samples
a stride of 1, and a padding of 1 are used to accomplish this. A
pooling process is then carried out. A collection of patches, Training 836
each of which represents a distinct area of the image, is the Validation 238
result of this method. After that, these patches are moved to the Testing 127
Transformer Encoder block to undergo further processing. The
function may be expressed using equation (15) [37].

IJISRT24JAN241 www.ijisrt.com 735


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
C. Model Hyperparameters and Training Setup  This Measure was Computed using Equation (16).
Table 2 displays the hyperparameters used to train each
model. Throughout the training phase, 150 epochs of training 𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑐𝑎𝑦 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
(16)
were permitted for the models. An Early Stopping method was
included to avoid overfitting. This function keeps an eye on
 F1-score
the validation accuracy of the model and halts training when
The model's equilibrium between recall and accuracy is
the accuracy ceases, increasing by at least 1e-4 for a
gauged by the F1-score. Equation (17) is used to assess the
continuous period of 15 epochs. Additionally, while training,
model's accuracy in classifying both positive and negative
the best weights are recovered. This guarantees that the
events.
weights of the model from the epoch that performed the best
on the validation set will be used for the final assessment. 2𝑇𝑃
𝐹1 = (17)
2𝑇𝑃+𝐹𝑃+𝐹𝑁
Table 2 Standard Hyperparameters across all Models.
Hyperparameter Name Hyperparameter Value  Sensitivity (True Positive Rate)
Batch Size 32 The percentage of accurate positive predictions among all
Learning Rate 0.001 positive occurrences is known as sensitivity. Equation (18) is
Weight Decay 0.001 utilised to calculate the model's accuracy in identifying positive
Optimizer Adam cases.
2𝑇𝑃
To enhance optimisation and prevent overfitting, the 𝐹1 = 2𝑇𝑃+𝐹𝑃+𝐹𝑁
(18)
TensorFlow library's CosineDecay function is used as the
learning rate scheduler during training [41]. To find the ideal  False Positive Rate (FPR)
learning rate value, the learning rate is first set at 0.001 and then Out of all negative cases, the fraction of wrongly
progressively reduced over little stages. A sharp overshooting anticipated positive instances is quantified by the FPR. It
and a severe drop in accuracy were noted if the initial learning gauges the model's propensity to mistakenly identify negative
rate was greater than this number. This method makes sure the situations as positive. The FPR formula is represented by
model starts out with a high learning rate, which allows it to equation (19).
converge fast, and then progressively lowers the learning rate
over time to fine-tune the model. During training, the 𝐹𝑃
𝐹𝑃𝑅 = 𝐹𝑃+𝑇𝑁 (19)
optimiser's learning rate is updated via the
LearningRateScheduler callback.
We employ community-standard performance criteria for
D. Evaluation Equations (16), (17), (18), and (19) [41] to assess the precision
We used the unweighted mean to average the class-wise of our models in identifying cases of defective pallet racking.
scores that we calculated in our experimental setup in order to False positive (FP), false negative (FN), true positive (TP), and
assess the models' performance. We used a number of true negative (TN) are examples of this. By using these
performance criteria that are widely accepted in the community indicators, we are able to compare our results with previous
to evaluate the efficacy of our strategy. The metrics listed methods and gain a thorough understanding of the performance
below were used. of our model. With this method, we can evaluate our model's
performance in a consistent and industry-accepted way.
 Accuracy
This indicator, which shows the percentage of properly
identified cases, assesses how accurate the model's predictions
are overall.

Fig 8 Pallet-Net, vit and CCT Epochs vs training Loss


Fig 7 Pallet-Net, vit and CCT Epoch vs training Accuracy

IJISRT24JAN241 www.ijisrt.com 736


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
E. Results
We ran a number of tests to evaluate the effects of the
suggested pallet net. Table 3 shows the training duration,
parameters, recall, f1-score, precision, and accuracy of our
suggested design in comparison to well-known modern
architectures such as the Compact Convolutional Transformer
(CCT) and Vision Transformer (VIT). Additionally, the
models' training accuracy outcomes and training losses are
shown in Figures 7 and 8.

Automated pallet racking classification proves non-


trivial, with the Vision Transformer (ViT) architecture
demonstrating limited accuracies of around 34% F1 despite
low parametrisation and training costs. ViT’s inability to aptly
capture subtle damage morphology cues limits precision and

Fig 9 Confusion Matrix of Proposed Model

Table 3 Quantitative Examination and Comparative analysis of Model Performance on test Datase
Model Training Time Total Params F1 Score Recall Precision Accuracy
ViT 03m55s 296066 34% 49% 26% 52%
CCT 05m29s 240451 87% 87% 87% 87%
AttentionCNN 06m24s 154279331 98% 98% 98% 98%

Recall alike. In contrast, the Compact Convolutional


Transformer (CCT) better balances representational
complexity and training efficiency, achieving improved 87%
F1 classification performance at marginally higher resource
overheads. CCT’s embedded convolutional feature extraction
likely accounts for enhanced localisation of damage signatures
within broader rack imagery context to enhance positive and
negative instance prediction consistency. Ultimately, our
proposed attention-augmented Convolutional Neural Network
(CNN) significantly outperforms both baseline approaches,
reaching 98% F1-scores, by dedicating a majority of
representational capacity towards hierarchical damage
characteristics cognition. The additional parameters enable
discerning highly complex and variable pallet-racking
distortion topologies amid clutter. Our evaluations reaffirm
target-specific selectivity as the cornerstone of efficient
biological perception and intelligence, which is now gaining
traction in biomimetic automated monitoring. Deliberate
representation skewing towards explanatory factors, rather than
blanket resource scaling, continues to drive innovation.

In order to evaluate how well the suggested Pallet-Net Fig 10 Correctly Classified Cases and their Attention
model performed in comparison to the real damage categories, Heatmap via Grad Cam
we also created the confusion matrix shown in Figure 9. The
classification findings' real positives and negatives, as well as In Pallet-Net, we have used Gradient-weighted Class
false positives and negatives, are shown in a 2x2 table called Activation Mapping (Grad-CAM) visualisations to identify the
the matrix. Correctly categorised data is represented by the important areas of the input photos in order to assess how well
diagonal of the confusion matrix, and incorrectly classified data the feature extraction method worked. A Grad-CAM depiction
is represented by the off-diagonal components. Pallet-Net of our architecture is shown in Figure 10. Our investigation
properly recognised 66 out of 67 actual damaged racking shows that although Pallet-Net's attention mechanism
photos as damaged, according to the confusion matrix. successfully distinguishes between damaged and undamaged
Comparable to the 60 real normal racking photos, just one was racking by identifying the critical structures of the racking, it
incorrectly identified as normal at the same moment. Allet-Net occasionally focuses on the pallet region and other non-salient
identified two racking photos as damaged but properly image regions, which could lead to incorrect classification. As
identified 58 as normal. With an overall accuracy rating of a potential remedy, we advise applying preprocessing methods
97.64%, the suggested model has a high level of accuracy to improve Pallet-Net's accuracy, such as filtering out
overall. unnecessary or irrelevant areas.

IJISRT24JAN241 www.ijisrt.com 737


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
It is clear from the technical evaluation that the pallet net competitive, indicating scenarios where detail surmounts
performs better in terms of classification accuracy than the ViT efficiency.
and CCT models. Greater performance is achieved by the
models with more parameters and longer training sessions Our attention-based classifier achieves an accuracy of
because they show a deeper comprehension of the intricate 97.63% on over a thousand pallet rack images lacking
linkages present in the damaged pallet racking photos. supplementary bounding boxes, setting new state-of-the-art
performance. Compared to prevailing techniques, our
V. SOLUTION COMPARISION methodology promises a pragmatic balance of damage
cognisance, computational frugality and real-world validity for
Automated analytical workflows aim to balance accuracy, scalable rack monitoring autonomy. More broadly, the
efficiency and accessibility for real-world damage monitoring comparative analysis spotlights representational selectivity as
integration. Farahnakian et al.’s [42] segmentation the lingering bottleneck for pervasive intelligence. While sheer
methodology demonstrates leading 93.45% precision but on analytical muscle continues steadily improving, deliberate
highly constrained datasets given intensive resource demands. dimensionality pruning to amplify explanatory factors over
Conversely, Hussain et al.’s [21] YOLOv7 detector attains superfluous imagery traits remains crucial but underexplored.
91.1% accuracy on thousands of samples, although at the cost As domains such as biomarker discovery already underscore,
of additional bounding box annotations and computations. sparsity frequently surpasses scale for unravelling complex
Recent focus has shifted to streamlined classification via phenomena. Our evaluations reaffirm this motif - superior
lightweight architectures [43], reaching 96% accuracy under cognition arises from compact, causal models rather than
reasonable resource profiles. However, despite higher indiscriminate resource intensification.
expenses, MobileNet-powered detection [19] remains

Table 4 Systematic Evaluation and Comparative Analysis with Prior Research in the field
Research Domain Dataset Size Detector Accuracy
[43] Image Classification 1723 Custom CNN 96%
[42] Segmentation 75 Mask RCNN 93.45%
[19] Object Detection 19717 Mobile Net 92.7%
[21] Object Detection 2094 YOLOv7 91.1%
Proposed Image Classification 1201 Attention CNN 97.63%

In summary, compared to other studies on automated worldwide, scalable intelligence will arise from carefully
racking inspection, Pallet-Net, the suggested attention-based tuned filters revealing what truly matters.
CNN architecture, offers better accuracy and a simpler
processing pipeline. It provides a more dependable and This research pioneers automated pallet racking
effective way to identify and categorise damage to pallet assessment via selective deep learning, surpassing constrained
racking, allowing the warehouse sector to operate with more human visual scrutiny. Pallet-Net exemplifies augmented
efficiency, lower costs, and higher safety. cognition - not brute analytical force alone - achieving
previously unattained warehouse visibility. Our framework
VI. CONCLUSION promises enhanced safety, efficiency and risk attenuation
beyond current practice. We acknowledge sample size
This research pioneers Pallet-Net - an attention-focused limitations among other constrained resources typical of initial
convolutional neural network (CNN) architecture achieving investigations now outpacing isolated human perspective.
automated state-of-the-art pallet racking damage detection at Ongoing efforts will enrich representations and explore
97.64% accuracy. We systematically enhance representation modern architectures. Ultimately, damage detection applies
learning using grayscale conversion, image resizing and data the selectivity gaining prominence from healthcare to
augmentation that exposes models to real-world renewables. Embedded intelligence that amplifies the most
environmental complexity while steeping them specifically in explanatory cues in environments otherwise overwhelming
damage morphology. Our tailored CNN then develops human operators must emerge. As automation broadly
hierarchical damage characterisations amplified by integrated displaces specialised operators and sensors, next-generation
attention mechanisms highlighting spatial irregularities. methods embedding extracted wisdom into key processes
Comprehensive evaluations versus contemporary Vision promise democratised situation awareness, benefiting society
Transformer and Compact Convolutional Transformer widely. Scalable and reliable intelligence resides in deliberate
architectures reaffirm attention’s efficacy for potent yet representations - the essence revealed matters more than the
selective rack cognition. Pallet-Net promises efficient resources invested. Our research manifests this new paradigm
automation unattained by blanket computational scaling or centred on awareness rather than just analysis.
human visual assessment alone. More broadly, it epitomizes
an awareness amplification motif gaining traction across
biomedicine, manufacturing, and more - seemingly boundless
societal challenges are increasingly yielding not to brute
analytical force but deliberate, causal representations distilling
phenomena down to their essence. As datasets now expand

IJISRT24JAN241 www.ijisrt.com 738


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
REFERENCES [15]. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H.
Xiong, Q. He, A comprehensive survey on transfer
[1]. Bernuzzi, M. Simoncelli, An advanced design learning, Proceedings of the IEEE 109 (1) (2020) 43–
procedure for the safe use of steel storage pallet racks 76.
in seismic zones, Thin-Walled Structures 109 (2016) [16]. X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J.
73–87. Qiu, Y. Yao, A. Zhang, L. Zhang, Pre-trained models:
[2]. M.Hussain, H. Al-Aqrabi, M. Munawar, R. Hill, S. Past, present and future, AI Open 2 (2021) 225–250.
Parkinson, Exudate regeneration for automated [17]. Dosovitskiy, L. Beyer, A. Kolesnikov, D.
exudate detection in retinal fundus images, IEEE Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
access(2022)1doi:https://doi.org/10.1109/access.2022 M. Minderer, G. Heigold, S. Gelly, An image is worth
.3205738. 16x16 words: Transformers for image recognition at
[3]. B. A. Aydin, M. Hussain, R. Hill, H. Al-Aqrabi, scale, arXiv preprint arXiv:2010.11929 (2020).
Domain modelling for a lightweight convolutional [18]. H.-H. Zhu, F. Dai, Z. Zhu, T. Guo, X.-W. Ye, Smart
network focused on automated exudate detection in sensing technologies and their applications in civil
retinal fundus images, in: 2023 9th International infrastructures 2016 (2016).
Conference on Information Technology Trends (ITT), [19]. M. Hussain, T. Chen, R. Hill, Moving toward smart
IEEE, 2023, pp. 145–150. manufacturing with an autonomous pallet racking
[4]. Zahid, M. Hussain, R. Hill, H. Al-Aqrabi, Lightweight inspection system based on mobilenetv2, Journal of
convolutional network for automated photovoltaic Manufacturing and Materials Processing 6 (4) (2022)
defect detection, in: 2023 9th International Conference 75.
on Information Technology Trends (ITT), IEEE, [20]. C.-Z. Dong, F. N. Catbas, A review of computer
2023, pp. 133–138. vision–based structural health monitoring at local and
[5]. M. Hussain, H. Al-Aqrabi, M. Munawar, R. Hill, global levels, Structural Health Monitoring 20 (2)
Feature mapping for rice leaf defect detection based (2021) 692–743.
on a custom convolutional architecture, Foods 11 (23) [21]. M. Hussain, H. Al-Aqrabi, M. Munawar, R. Hill, T.
(2022) 3914. doi:10.3390/foods11233914. Alsboui, Domain feature mapping with yolov7 for
[6]. K. O’Shea, R. Nash, An introduction to convolutional automated edge-based pallet racking inspections,
neural networks, arXiv preprint arXiv:1511.08458 Sensors 22 (18) (2022) 6927.
(2015). [22]. S. S. A. Zaidi, M. S. Ansari, A. Aslam, N. Kanwal, M.
[7]. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning Asghar, B. Lee, A survey of modern deep learning
internal representations by error propagation, Report, based object detection models, Digital Signal
California Univ San Diego La Jolla Inst for Cognitive Processing (2022) 103514.
Science (1985). [23]. Y. Liu, P. Sun, N. Wergeles, Y. Shang, A survey and
[8]. K. Simonyan, A. Zisserman, Very deep convolutional performance evaluation of deep learning methods for
networks for largescale image recognition, arXiv small object detection, Expert Systems with
preprint arXiv:1409.1556 (2014). Applications 172 (2021) 114602.
[9]. K. He, X. Zhang, S. Ren, J. Sun, Deep residual [24]. J. Yang, S. Li, Z. Wang, G. Yang, Real-time tiny part
learning for image recognition, in: Proceedings of the defect detection system in manufacturing using deep
IEEE conference on computer vision and pattern learning, IEEE Access 7 (2019) 89278–89291.
recognition, 2016, pp. 770–778. [25]. T. D. Akinosho, L. O. Oyedele, M. Bilal, A. O. Ajayi,
[10]. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. M. D. Delgado, O. O. Akinade, A. A. Ahmed, Deep
Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, learning in the construction industry: A review of
Going deeper with convolutions, in: Proceedings of present status and future innovations, Journal of
the IEEE conference on computer vision and pattern Building Engineering 32 (2020) 101827.
recognition, 2014, pp. 1–9. [26]. D. Weimer, B. Scholz-Reiter, M. Shpitalni, Design of
[11]. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich deep convolutional neural network architectures for
feature hierarchies for accurate object detection and automated feature extraction in industrial inspection,
semantic segmentation, in: Proceedings of the IEEE CIRP annals 65 (1) (2016) 417–420.
conference on computer vision and pattern [27]. H. Liu, L. Zhou, J. Zhao, F. Wang, J. Yang, K. Liang,
recognition, 2013, pp. 580–587. Z. Li, Deeplearning-based accurate identification of
[12]. R. Girshick, Fast r-cnn, in: Proceedings of the IEEE warehouse goods for robot picking operations,
international conference on computer vision, 2015, Sustainability 14 (13) (2022) 7781.
pp. 1440–1448. [28]. Z. Tang, E. Tian, Y. Wang, L. Wang, T. Yang,
[13]. C.-Y. Wang, A. Bochkovskiy, H.-Y. M. Liao, Yolov7: Nondestructive defect detection in castings by using
Trainable bagof-freebies sets new state-of-the-art for spatial attention bilinear convolutional neural
real-time object detectors, arXiv preprint network, IEEE Transactions on Industrial Informatics
arXiv:2207.02696 (2022). 17 (1) (2020) 82–89.
[14]. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B.
Sengupta, A. A. Bharath, Generative adversarial
networks: An overview, IEEE signal processing
magazine 35 (1) (2018) 53–65.

IJISRT24JAN241 www.ijisrt.com 739


Volume 9, Issue 1, January – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
[29]. B. Su, H. Chen, P. Chen, G. Bian, K. Liu, W. Liu,
Deep learning-based solar-cell manufacturing defect
detection with complementary attention network,
IEEE Transactions on Industrial Informatics 17 (6)
(2020) 4084–4095.
[30]. T. B. Shahi, C. Sitaula, A. Neupane, W. Guo, Fruit
classification using attention-based mobilenetv2 for
industrial applications, Plos one 17 (2) (2022)
e0264586.
[31]. Chollet, Building powerful image classification
models using very little data, Keras Blog 5 (2016) 90–
95.
[32]. D. Singh, B. Singh, Feature wise normalization: An
effective way of normalizing data, Pattern
Recognition 122 (2022) 108307.
[33]. Al-Sadi, A.-A. M. Hana’Al-Theiabat, M. Al-Ayyoub,
The inception team at vqa-med 2020: Pretrained vgg
with data augmentation for medical vqa and vqg, in:
CLEF (Working Notes), 2020.
[34]. Z. Hussain, F. Gimenez, D. Yi, D. Rubin, Differential
data augmentation techniques for medical imaging
classification tasks, in: AMIA annual symposium
proceedings, Vol. 2017, American Medical
Informatics Association, 2017, p. 979.
[35]. L.-F. Li, X. Wang, W.-J. Hu, N. N. Xiong, Y.-X. Du,
B.-S. Li, Deep learning in skin disease image
recognition: A review, IEEE Access 8 (2020) 208264–
208280.
[36]. M. A. R. Alif, S. Ahmed, M. A. Hasan, Isolated bangla
handwritten character recognition with convolutional
neural network, in: 2017 20th International conference
of computer and information technology (ICCIT),
IEEE, 2017, pp. 1–6.
[37]. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, H.
Shi, Escaping the big data paradigm with compact
transformers, arXiv preprint arXiv:2104.05704
(2021).
[38]. N. Ketkar, N. Ketkar, Introduction to keras, Deep
learning with python: a hands-on introduction (2017)
97–111.
[39]. Bisong, E. Bisong, Matplotlib and seaborn, Building
Machine Learning and Deep Learning Models on
Google Cloud Platform: A Comprehensive Guide for
Beginners (2019) 151–165.
[40]. W. McKinney, pandas: a foundational python library
for data analysis and statistics, Python for high
performance and scientific computing 14 (9) (2011)
1–9.
[41]. D. So, Q. Le, C. Liang, The evolved transformer, in:
International conference on machine learning, PMLR,
2019, pp. 5877–5886.
[42]. Farahnakian, L. Koivunen, T. Makil¨ a,¨ J. Heikkonen,
Towards autonomous industrial warehouse
inspection, in: 2021 26th International Conference on
Automation and Computing (ICAC), IEEE, 2021, pp.
1–6.
[43]. M. Hussain, R. Hill, Custom lightweight
convolutional neural network architecture for
automated detection of damaged pallet racking in
warehousing & distribution centers, IEEE Access
(2023).

IJISRT24JAN241 www.ijisrt.com 740

You might also like