0% found this document useful (0 votes)
15 views11 pages

MixSegNet A Novel Crack Segmentation Network Combining CNN and Transformer

The document presents MixSegNet, a novel deep learning model for crack segmentation that integrates convolutional neural networks (CNNs) and transformer architectures to enhance detection accuracy. MixSegNet addresses challenges in traditional image processing methods by effectively extracting and fusing multi-scale features, achieving a precision of 95.2% and an F1 score of 91.5% in comparative evaluations against existing models. The research highlights the importance of accurate crack identification for structural safety and proposes MixSegNet as a solution to improve segmentation performance in complex environments.

Uploaded by

Bhoomika A S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views11 pages

MixSegNet A Novel Crack Segmentation Network Combining CNN and Transformer

The document presents MixSegNet, a novel deep learning model for crack segmentation that integrates convolutional neural networks (CNNs) and transformer architectures to enhance detection accuracy. MixSegNet addresses challenges in traditional image processing methods by effectively extracting and fusing multi-scale features, achieving a precision of 95.2% and an F1 score of 91.5% in comparative evaluations against existing models. The research highlights the importance of accurate crack identification for structural safety and proposes MixSegNet as a solution to improve segmentation performance in complex environments.

Uploaded by

Bhoomika A S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received 9 July 2024, accepted 29 July 2024, date of publication 5 August 2024, date of current version 20 August 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3438112

MixSegNet: A Novel Crack Segmentation Network


Combining CNN and Transformer
YANG ZHOU 1 , RAZA ALI 2, (Senior Member, IEEE), NORRIMA MOKHTAR 1 ,
SULAIMAN WADI HARUN 1, AND MASAHIRO IWAHASHI 3 , (Senior Member, IEEE)
1 Department of Electrical Engineering, Faculty of Engineering, Universiti Malaya, Kuala Lumpur 50603, Malaysia
2 Department of Electrical Engineering, Faculty of Information and Communication Technology (FICT), Balochistan University of Information Technology,
Engineering and Management Sciences (BUITEMS), Quetta 87300, Pakistan
3 Department of Electrical, Electronics and Information Engineering, Nagaoka University of Technology, Nagaoka 940-2188, Japan

Corresponding author: Norrima Mokhtar ([email protected])


This work was supported in part by JSPS KAKENHI under Grant 24K02975.

ABSTRACT In the domain of road inspection and structural health monitoring, precise crack identification
and segmentation are essential for structural safety and disaster prediction. Traditional image processing
technologies encounter difficulties in detecting cracks due to their morphological diversity and complex
background noise. This results in low detection accuracy and poor generalization. To overcome these
challenges, this paper introduces MixSegNet, a novel deep learning model that enhances crack recognition
and segmentation by integrating multi-scale features and deep feature learning. MixSegNet integrates
convolutional neural networks (CNNs) and transformer architectures to enhance the detection of small cracks
through the extraction and fusion of fine-grained features. Comparative evaluations against mainstream
models, including LRASPP, U-Net, Deeplabv3, Swin-UNet, AttuNet, and FCN, demonstrate that MixSegNet
achieves superior performance on open-source datasets. Specifically, the model achieved a precision of
95.2%, a recall of 88.2%, an F1 score of 91.5%, and a mean intersection over union (mIoU) of 84.8%,
thereby demonstrating its effectiveness and reliability for crack segmentation tasks.

INDEX TERMS Crack segmentation network, crack images, convolutional neural network, transformer
model, image processing, deep learning, self-attention mechanism.

I. INTRODUCTION cracks are not discovered and treated in time, they may
Crack identification occupies a vital position in the field lead to a decrease in structural performance and even
of structural health monitoring because it is directly related threaten personnel safety. Therefore, developing effective
to the safety and reliability of building structures. With the crack detection and identification systems is crucial to ensure
development of technology, crack detection methods have structural safety.
gradually transformed from traditional manual inspection Traditional crack detection methods, such as threshold
to automatic identification using modern technologies such techniques [2], demonstrate limited adaptability. To address
as advanced image processing, artificial intelligence, and this, Yang et al. [3] introduced a novel approach utilizing a
machine learning [1]. These methods not only improve fully convolutional network (FCN), enhancing the detection
identification accuracy and efficiency, but also enable poten- process. This technique employs single-pixel width skeletons
tial structural problems to be discovered at an early stage, for crack segmentations, allowing for the detailed analysis
enabling preventive maintenance and extending the life of of crack features—like topology, length, and widths—
the building. The existence of cracks may be caused by a thus offering critical indicators for practical assessments.
variety of reasons, including structural aging, environmental However, the scarcity of training data for crack segmen-
erosion, excessive loads, and natural disasters. If these tation presents a challenge. In response, König et al. [4]
developed a method to streamline the annotation process
The associate editor coordinating the review of this manuscript and for semantic segmentation of surface cracks. They uti-
approving it for publication was Wei Wei . lized a U-Net architecture based on a fully convolutional
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 111535
Y. Zhou et al.: MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer

network, optimized for small datasets through patch-based we propose the MixSegNet model as a means of enhancing
training, leading to unprecedented results on various crack the accuracy of crack segmentation.
datasets. Ren et al. [5] explored the application of deep
fully convolutional networks for concrete crack detection in II. RELATED WORK
tunnel images, proposing CrackSegNet, an advanced network A. SEMANTIC SEGMENTATION
for comprehensive crack segmentation. This innovation Semantic segmentation is derived from the further refinement
improves feature extraction, aggregation, and resolution of classification problems. It requires pixel-level classi-
reconstruction, significantly boosting segmentation perfor- fication tasks and puts forward higher requirements for
mance. Kang et al. [6] introduced an automated method architecture and algorithms. At present, semantic segmen-
combining Faster R-CNN and a modified TuFF algorithm tation technology has been widely used in different fields
for precise crack detection, localization, and quantification, of computer vision. Among them, semantic segmentation is
overcoming the limitations posed by varying environmental applied in various fields, including satellite imagery [20],
conditions. Similarly, Lau et al. [7] applied convolutional medicine [21], material science [22], and meteorology [23].
neural networks for segmenting pavement crack images, It can be seen that semantic segmentation technology is
marking a significant advancement in the field. Liu et al. [8] crucial. FCN (Fully Convolutional Networks) [24], which
proposed a two-step convolutional neural network method was first proposed by Jonathan Long, Evan Shelhamer, and
for enhanced crack detection and segmentation. Following Trevor Darrell in 2015, aims to classify each pixel in the
this, Guan et al. [9] aimed to refine the accuracy and speed image into the corresponding category. The core idea of FCN
of 3D crack segmentation models, pushing the boundaries is to use a fully convolutional layer to replace the fully con-
of current methodologies. Ali et al. [10] proposed an addi- nected layer in the traditional convolutional neural network,
tive attention gate-based network architecture called Crack so that the network can accept input images of any size
Segmentation Network-II (CSN-II). and output a spatial map of corresponding size. The spatial
map can be directly applied to pixel-level prediction tasks.
U-Net [25], a deep learning model specifically designed
A. RESEARCH GAP for medical image segmentation, was initially introduced
Recent research has led to further improvements in various by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in
aspects of crack segmentation. Wang et al. [11] intro- 2015. The architectural design of U-Net is particularly well-
duced a lightweight crack segmentation network based suited for tasks that necessitate high-precision localization,
on knowledge distillation. Liu et al. [12] presented an such as the segmentation of organs and tissues within
upgraded CrackFormer network for pavement crack seg- medical imagery. This model is named for its unique
mentation. This network achieved higher accuracy with ‘‘U’’-shaped structure, which effectively combines shallow
fewer floating-point operations (FLOPs) and parameters (high-resolution) features and deep (high-level semantics)
compared to previous methods. Wu et al. [13] developed a features to improve segmentation accuracy. LRASPP (Lite
lightweight MobileNetV2-DeepLabV3 network for enhanced R-ASPP, or Lightweight Residual Atrous Space Pyramid
precision in dam crack width measurement. Yao et al. [14] Pooling) [26] is a deep learning architecture optimized for
developed a CrackResU-Net model with a pyramid region mobile devices and edge computing, especially for semantic
attention module for pixel-level pavement crack recognition. segmentation tasks. It is improved and simplified based on
Lin et al. [15] proposed DeepCrackAT, a framework for crack the original ASPP (Atrous Spatial Pyramid Pooling) and
segmentation based on learning multi-scale crack features. DeepLab series models. ASPP captures multi-scale informa-
Tang et al. [16] introduced a novel lightweight concrete crack tion by using different dilation rates in parallel convolutional
segmentation method based on DeeplabV3+. This method layers, thereby improving the model’s ability to understand
reduces the number of model parameters and enhances different areas of the image. LRASPP aims to reduce the
segmentation accuracy. Chen et al. [17] introduced a dynamic computational complexity and number of parameters to
semantic segmentation algorithm with an encoder-crossor- adapt to environments with limited computing resources.
decoder structure for pixel-level building crack segmentation. DeepLabv3 [27] is an advanced deep learning architecture
Li et al. [18] concentrated on crack segmentation in asphalt designed specifically for image semantic segmentation tasks.
pavement using an enhanced YOLOv5s model. Moreover, It is the third version of the DeepLab series model, developed
Sohaib et al. [19] proposed an ensemble approach for robust by Liang-Chieh Chen and others, aiming to further improve
automated crack detection and segmentation in concrete the segmentation accuracy of the model in complex image
structures, achieving high precision and an intersection over scenes. The core contributions of DeepLabv3 include the
union score. Collectively, these studies contribute to the improved atrous spatial pyramid pooling (ASPP) module
advancement of crack segmentation algorithms, addressing and the systematic application of atrous convolution. These
various challenges and improving the accuracy and efficiency features enable the model to effectively capture multi-scale
of crack detection and segmentation processes. information and handle different object sizes in images.
However, the aforementioned models fail to fully leverage AttuNet [28] is a recently proposed semantic segmentation
the respective strengths of CNN and Transformer. Therefore, architecture. It is an improved version of U-Net. It better

111536 VOLUME 12, 2024


Y. Zhou et al.: MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer

integrates shallow and deep semantic information through a segmentation. It was proposed by Hao Chen et al. and
special attention module. aims to utilize the hierarchical and self-attention mechanism
of Swin Transformer to capture the details and contextual
B. ATTENTION MECHANISM
information of the image, while achieving high-precision
pixel-level segmentation through the encoder-decoder struc-
As Transformer [29] based on self-attention mechanisms
ture of U-Net. Through this combination, Swin-UNet aims
have gained advantages in NLP in recent years, more and
to improve the model’s ability to understand details and
more people are trying to use Transformers in visual models.
structures in complex images such as medical images, thereby
The ViT (Vision Transformer) [30] model is a deep learning
improving the accuracy and efficiency of segmentation.
model based on the Transformer architecture, specially
The models described above are either based on convolu-
designed to handle image recognition tasks. It was originally
tional neural networks (CNNs) or transformers, or combine
proposed in 2020 by Alexey Dosovitskiy and others at Google
the advantages of both. However, in the context of crack
Research. By partitioning images directly into serially
segmentation, where the need for fine segmentation of the
arranged patches and subsequently processing these serial-
scene and the ability to deal with a variety of background
ized image patches using Transformers, Vision Transformer
noise, environmental impact, and other factors is paramount,
(ViT) has demonstrated performance that matches or even
the aforementioned models are not optimal. Consequently,
surpasses the state-of-the-art on multiple image recognition
this research paper proposes the MixSegNet model as a
tasks, providing results comparable to those of Convolutional
solution to address the shortcomings of existing models in
Neural Networks (CNN) models. However, ViT cannot be
the context of more complex segmented images. In summary,
directly used for semantic segmentation. The original ViT
the main contributions of this article include: (1) We adopt
can only be used for classification tasks. Subsequently, many
a similar structure to U-Net, while using the innovative UC
people have proposed variants based on the ViT model for
Block module to obtain more details while increasing the
semantic segmentation.
receptive field, and through the proposed multi-scale fusion
module for crack segmentation (Fuse Block) to enhance it.
C. SEGMENTATION BASED ON ATTENTION MECHANISM Ablation experiments show that all the proposed modules,
The SETR (Semantic Segmentation Transformer) [31] model including parallel CNN and Transformer architecture, help
is a deep learning model specially designed for semantic the model combine multi-scale features more effectively
segmentation tasks. It combines the powerful capabilities and generate more accurate crack segmentation masks. (2)
of Transformer with the advantages of traditional semantic The developed model demonstrates satisfactory segmentation
segmentation methods. SETR was originally proposed by accuracy on a benchmark datasets (cracks-APCGAN [28]).
Zheng et al. in 2020. Its core idea is to apply Transformer
to the global feature extraction of images to directly classify III. METHODOLOGY
at the pixel level. This is an idea borrowed from the In our research, the MixSegNet crack segmentation model
NLP field, and it is the first time that it has been widely uses two major deep learning technologies, Convolutional
used. Scale applied to semantic segmentation tasks in Neural Network (CNN) and Transformer, to take full advan-
computer vision. Swin Transformer [32] is a deep learning tage of their respective strengths to achieve highly accurate
model designed based on the Transformer architecture and crack image segmentation. CNN is a powerful deep learning
optimized for processing computer vision tasks. It was tool specifically designed to process data with a grid structure
proposed by Ze Liu et al. of Microsoft Research in 2021. (such as images). In MixSegNet, we use CNN to extract
The core innovation of Swin Transformer is the introduc- local and low-level features from images, taking advantage
tion of a Transformer structure called ‘‘hierarchy’’, which of its excellent spatial feature extraction capabilities. The
effectively manages global dependencies and computational advantage of CNN is that it can automatically learn basic
complexity in images by using variable window sizes, features such as edges and textures of images through
allowing the model to be more efficient Process large-scale convolutional layers, and capture more complex image
image data. SegFormer [33] is an advanced deep learning features through deep network structures, providing a solid
model designed for semantic segmentation tasks, which foundation for accurate crack segmentation. Transformer
combines the power of Transformer with the efficiency technology is based on the self-attention mechanism and
of Convolutional Neural Networks (CNN). Introduced by can process sequence data and is particularly good at
Xie et al. in 2021, SegFormer achieves precise segmentation capturing long-range dependencies. In the MixSegNet model,
of objects of varying sizes within images by incorporating a we introduce Transformer to complement the limitations
lightweight Transformer encoder and an efficient multi-scale of CNN, especially in understanding the global context
feature fusion strategy, all while preserving the model’s of images and capturing long-range dependencies. The
efficiency and adaptability. The Swin-UNet [34] model is advantage of Transformer is that it can dynamically weigh
a deep learning model that combines the characteristics of the importance of each part in the image through the
Swin Transformer and U-Net architecture. It is specially self-attention mechanism, thereby better understanding the
designed for fine segmentation tasks such as medical image global structure of the image. This is particularly important

VOLUME 12, 2024 111537


Y. Zhou et al.: MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer

FIGURE 1. The proposed MixSegNet framework.

FIGURE 2. UC block.

for the crack segmentation task, where crack identification CNN and Transformer, MixSegNet is able to simultaneously
requires not only the accurate extraction of local features, leverage the advantages of CNN in extracting powerful local
but also a comprehensive understanding of their location and features and Transformer in understanding the global context.
morphology in the entire image. By combining the benefits of This combination not only improves the accuracy of fracture

111538 VOLUME 12, 2024


Y. Zhou et al.: MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer

FIGURE 3. Cat Block. Fuse Block. Start Block. End Block.

segmentation, but also enhances the adaptability and gener- Block module fusion. The most important use of Fuse Block
alisation of the model to different fracture types and complex is in fusing the features extracted from Swin Transformer
backgrounds. Our method enables an efficient exchange using transformer and the features extracted from CNN
of information between CNN and Transformer through a network. With the feature fusion technique, the global and
carefully designed network structure, ensuring the model’s local features can be captured better, and the segmentation
excellent performance in the fracture segmentation task. accuracy can be improved. Finally, these layered features are
passed into the continuous Decoder part (as shown in the
A. MIXSEGNET continuous purple part of the figure, which we call Cat Block,
As seen in Figure 1, the overall architecture is divided into and will be explained in detail later), which in turn recovers
two parts, one is Encoder and the other is Decoder, where the mask information of the image step by step.
the Encoder part is divided into three parts, the top part is
the tandem CNN architecture, the middle part is the Fuse 1) UC BLOCK
Block part, and the bottom part is the architecture that uses As shown in the Figure 2, this is our proposed UC
part of the Swin Transformer [32] model(If you want more Block module, which differs from the previous approach of
details, refer to the original paper). The Decoder part is the changing the image width and height to obtain features at
tandem CNN module that accepts the output of the Fuse different levels through Maxpool in that we cleverly first
Block module, which in turn recovers the feature map step enlarge the input image to more than twice its size through
by step and extracts the segmentation part we need from it. the Upsample operation, and then obtain more sensory fields
When given an image 3 × 256 × 256, we first pass through through oversized convolutions (in this research, the size
the upper part of the Encoder (such as the consecutive blue of convolution is 1.5 times the size of the image in the
part of the figure, which we call UC Block, and we will input UC Block, and the dilation rate is 6) to obtain more
explain it in detail later) and the lower part at the same time sensory fields, although this causes a computational burden,
by means of the concatenated architecture, so that we can we reduce the computational burden by using a null depth-
get the hierarchical feature maps. Then the layered features separable convolution. At the same time, we combine the
from CNN and Transformer are fused by the proposed Fuse previous Maxpool method to acquire detailed features. With

VOLUME 12, 2024 111539


Y. Zhou et al.: MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer

UC Block, we acquire feature maps with larger receptive B. MODEL TRAINING DETAILS
fields and detailed information. 1) DATASETS
The formula for upsampling using nearest neighbor In this study, we have opted to utilize the secondary
interpolation is as follows Equation 1. open-source dataset, cracks-APCGAN [28], which has
j x k j y k recently been supplemented with additional data from the
U (x, y) = F , (1) DeepCrack [35] dataset. This choice was made in light of
s s
the open-source nature of the DeepCrack dataset, which
U (x, y) represents the value at coordinates (x, y) in the we have found to be a valuable resource in our research.
upsampled feature map. F is the original feature map. s is the cracks-APCGAN was developed via the APCGAN enhanced
scaling factor for upsampling. ⌊.⌋ denotes the floor function. dataset, the principle is to generate more similar images by
This formula indicates that for each point (x, y) in the output GAN based on the training set, and then further enhance the
feature map, we find the value in the original feature map F training dataset by manually annotating the GAN-generated
at the point closest to xs , ys , and use it as the new pixel value. images, the enhanced training dataset in the original paper
Dilated Depthwise Separable Convolution consists of two has a great enhancement for the training process, so we chose
main steps, Dilated Convolution, calculated in Equation 2. cracks-APCGAN as our benchmark dataset.
X
G[i, j] = F[i + r · k, j + r · l] · K [k, l] (2) 2) LOSS FUNCTION
k,l The loss function plays a crucial role in deep learning as a
G[i, j] is the output feature map, F is the input feature map, K measure of the difference between the predicted and actual
is the convolutional kernel, r is the dilation rate(In this paper, values of the model. During training, the main purpose of
we take the value 6), and i, j denote positions in the output the loss function is to guide the model learning and adjust
feature map, while k, l denote positions in the convolutional the model parameters by minimising the loss function values
kernel. to make the model predictions more accurate. The loss
Depthwise Separable Convolution, calculated in Equation 3 function not only affects the efficiency and effectiveness of
and Equation 4. model training, but also relates to whether the model can
effectively learn the complex patterns and structures in the
X data. Therefore, choosing an appropriate loss function is
H [i, j, m] = G[i + k, j + l] · Dm [k, l] (3)
crucial for the performance optimisation of deep learning
k,l
X models. Three common loss functions are listed below and
O[i, j, n] = H [i, j, m] · Pmn (4) analysed one by one.
m N
1 X 
H is the feature map after Depthwise convolution, Dm is the BCEloss = − yi log(pi ) + (1 − yi ) log(1 − pi ) (5)
N
m-th Depthwise convolutional kernel, O is the final output i=1
feature map, and P is the pointWise convolutional kernel. The Binary Cross-Entropy (BCE) loss function defined as
Equation 5. N is the total number of samples, representing
2) CAT BLOCK all the samples considered when computing the loss. yi is
The Cat Block is shown in the Figure 3. This module the actual label of the i-th sample, which can be 0 or 1,
accepts inputs from two sources, on the one hand the input representing the two categories in binary classification. pi is
obtained by fusing the feature maps through Fuse Block the model’s predicted probability that the i-th sample belongs
and on the other hand the input processed through UC to class 1, with a value between 0 and 1. log denotes the
Block. Through this fusion module, we first concatenate natural logarithm, a function to measure the discrepancy
the two, and then further fuse the feature maps by CNN, between the predicted probabilities and the actual labels. This
so that we can get the feature maps with the advantages formula averages the discrepancies between the predicted
of CNN and the feature maps from the Transformer, which probabilities and the actual labels across all samples to obtain
is able to directly calculate the dependency between any the overall loss value.
two positions in the sequence through the mechanism of N
self-attention, which makes the model more efficient in 1 X
WBCEloss = − [wpos · yi log(pi )
dealing with the long distance dependency information, N
i=1
and thus able to capture the long distance dependency + wneg · (1 − yi ) log(1 − pi )] (6)
information. The integration of this mechanism enhances
the model’s effectiveness in handling long-range dependency The Weighted Binary Cross-Entropy (WBCE) loss func-
information, thereby enabling it to capture more complex data tion [25] defined as Equation 6. N is the total number
patterns. Since crack segmentation is an intensive task, the of samples, indicating all samples are considered when
Fuse Block module allows us to obtain feature maps with computing the loss. yi is the actual label of the i-th sample,
more detail while maintaining accuracy at large scales. which can be 0 or 1, representing the two categories in binary

111540 VOLUME 12, 2024


Y. Zhou et al.: MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer

classification. pi is the model’s predicted probability that Therefore, in this paper we have chosen the focal loss as the
the i-th sample belongs to class 1, with a value between loss function.
0 and 1. wpos and wneg are the weights for positive and
negative classes, respectively. These weights are used to 3) OPTIMIZER
handle class imbalance by adjusting the loss contribution of In our crack segmentation task, we selected AdamW [37]
each class. log denotes the natural logarithm, used to measure as the optimisation algorithm to better adjust model weights
the discrepancy between the predicted probabilities and the and mitigate overfitting. The AdamW optimiser is a variant
actual labels. This formula computes the weighted average of the Adam optimiser that primarily improves model
of the discrepancies between the predicted probabilities and generalisation by modifying the weight decay strategy.
the actual labels across all samples, thus obtaining the overall Traditional L2 regularisation methods may not be effective
loss value. in adaptive learning rate optimisation algorithms, as such
N
algorithms automatically adjust the update step for each
1 X parameter, potentially conflicting with the goals of L2
Focalloss =− [αyi (1 − pi )γ log(pi )
N regularisation. In contrast, AdamW decouples weight decay
i=1
γ from the optimiser’s adaptive learning rate adjustments,
+ (1 − α)(1 − yi )pi log(1 − pi )] (7)
allowing weight decay to operate independently of the
adaptive learning rate mechanism, thereby implementing
The Focal loss function [36] defined as Equation 7. N is regularisation more effectively.
the total number of samples, indicating that all samples are In this study, we used an initial learning rate of 6 × 10−5
considered when computing the loss. yi is the actual label chosen on the basis of experience and the results of several
of the i-th sample, which can be 0 or 1, representing the experiments. This learning rate is intended to strike a balance
two categories in binary classification. pi is the model’s between convergence speed and stability during the training
predicted probability that the i-th sample belongs to class 1, process, avoiding excessively large update steps early in
with a value between 0 and 1. α is a weighting factor for the training that could cause the model to fail to stabilise on an
positive class, used to address class imbalance by weighting optimal solution. The AdamW optimiser allows us to finely
the importance of positive/negative examples differently. The control the learning rate for each parameter in an adaptive
exact calculation can be obtained by counting the proportion manner, while using weight decay to suppress overfitting,
of positive and negative classes in the datasets. γ is the providing strong support for deep learning models in crack
focusing parameter, a hyperparameter that adjusts the rate segmentation tasks.
at which easy examples are down-weighted, thus allowing Furthermore, by using AdamW’s weight decay mecha-
the model to focus more on hard, misclassified examples. nism, we can more effectively manage model complexity and
γ
(1−pi )γ and pi are factors that adjust the contribution of each prevent the occurrence of overfitting, which is particularly
sample to the loss based on the prediction confidence. These important for tasks such as crack segmentation that require
factors reduce the loss for well-classified examples, thereby a high level of detail and precision. By considering both
focusing the model’s learning on hard examples. log denotes training efficiency and model generalisation capabilities,
the natural logarithm, used to measure the discrepancy we are confident that the choice of the AdamW optimiser and
between the predicted probabilities and the actual labels. its configuration will provide optimal training results for our
This formula aims to reduce the loss contribution from crack segmentation model.
easy examples and increase the influence of hard examples,
improving model performance on difficult classification tasks
IV. RESULTS AND DISCUSSION
by modulating the effect of the sample based on its prediction
A. EVALUATION METRICS
confidence and actual class.
In the table 1, we first define the variables commonly used
In the case of crack segmentation, the default and most
to evaluate the metrics. In crack segmentation, the commonly
commonly used choice is the cross-entropy loss, which is
used evaluation indicators are as follows:
applied pixel by pixel. The loss function evaluates the class
prediction for each pixel independently and averages over all • Precision in crack segmentation assesses the ratio of

pixels. However, it can be biased by an unbalanced dataset, correctly predicted positive areas to all areas predicted as
causing most classes to dominate. To overcome this problem positive, highlighting the accuracy of positive class pre-
when the dataset is unbalanced, they introduced weighted dictions. The Precision can be calculated in Equation 8.
cross-entropy loss. As a further improvement to cross-entropy P
loss, the focal loss technique was introduced. This is achieved p
Precision = P P (8)
by changing the structure of the cross-entropy loss. When the p+ p
focal loss is applied to samples with accurate classification,
the scaling factor values are weighted down. This ensures • Recall in crack segmentation quantifies the fraction of
that more difficult samples are emphasised and therefore true positive areas correctly identified, reflecting the
advanced imbalances do not bias the overall computation. model’s sensitivity to actual positives. The Recall can

VOLUME 12, 2024 111541


Y. Zhou et al.: MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer

be calculated in Equation 9. TABLE 2. Comparison of different models on the test data.


P
p
Recall = P P (9)
p+ g
• F1 score in crack segmentation is the harmonic mean of
precision and recall, balancing both metrics. It evaluates
the model’s accuracy and sensitivity, with higher values
indicating better overall performance. The F1 score can
be calculated in Equation 10.
2 ∗ Precision ∗ Recall
F1 = (10)
Precision + Recall
task. These performance improvements of MixSegNet are
• mIoU (mean Intersection over Union) in semantic
due to its unique model design. The model combines the
segmentation calculates the average IoU, which is
use of a CNN-based UC block module and an attention
the overlap versus total area of predicted and true
mechanism that improves segmentation accuracy by focusing
regions, across all classes. It assesses the model’s
on key information in the image. The introduction of the
segmentation accuracy, with higher mIoU indicating
UC block module improves the capture of crack features,
better performance. The mIoU can be calculated in
while the attention mechanism ensures that the model is
Equation 11.
able to distinguish between foreground cracks and complex
C P backgrounds. In addition, the parallel and serial structure
1 X p
mIoU = P P P (11) we adopt allows the model to respond effectively to crack
C p+ p+ g
c=1 variations in different scenes, improving the model’s ability
to adapt to crack morphology. We also employ a focal loss
TABLE 1. Definitions in crack segmentation.
function and an adaptive learning strategy to address the
problem of category imbalance in the dataset. The focal
loss function can reduce the weight of easy to classify
samples, allowing the model to focus more on crack regions
that are difficult to segment. The adaptive learning strategy
B. RESULTS further optimises the training process and ensures the model’s
In this paper, we describe a novel deep learning model for performance in a variety of complex scenarios and conditions.
crack segmentation, MixSegNet, and provide an in-depth As illustrated in Figure 4, this study randomly selected
analysis of its performance. As shown in Table 2, MixSeg- seven images from diverse scenes and employed distinct
Net outperforms current mainstream segmentation models, segmentation models to illustrate the outcomes. The segmen-
including LRASPP, FCN, DeepLabV3, U-Net, AttuNet, tation outcomes depicted in the figure demonstrate that the
and Swin-UNet, in a number of key metrics. Specifically, results produced by the MixSegNet model align with those
MixSegNet attains a precision of 0.952, a recall of 0.882, presented in Table 2. Additionally, the MixSegNet model
an F1 score of 0.915, and a mean intersection over union exhibits consistent and continuous segmentation outcomes
(mIoU) scores of 0.848. These results show that MixSeg- when compared to other models. This consistency can be
Net has a significant leading performance on the crack attributed to the fact that MixSegNet integrates the strengths
segmentation task. When analysing these results in more of CNN and Transformer. A comparison of the details of
detail, we can see that MixSegNet is only slightly higher in the various models reveals that the MixSegNet model also
precision by 0.001 compared to U-Net, but the improvement maintains a leading level of detail processing, which is crucial
in recall is even more significant, being 0.040 higher than for the refined crack segmentation scene.
that of U-Net. This suggests that MixSegNet is able to In summary, MixSegNet’s innovative design and strategy
maintain a high level of detection accuracy while avoiding set a new performance benchmark for the crack segmentation
missing actual cracks. In addition, the F1 score, which is task. Its outstanding performance bodes well for the model’s
a reconciled average of precision and recall, also shows wide application and far-reaching impact on future crack
that MixSegNet outperforms all compared models on this segmentation. We are excited about MixSegNet’s ability to
metric, further demonstrating its superiority in correctly handle complex problems and look forward to seeing its
identifying and segmenting cracks. MixSegNet also performs performance in real-world applications.
well on the mIoU metric, outperforming the second highest
model, AttuNet, by 0.012, demonstrating better consistency C. DISCUSSION
and overall performance in the crack segmentation task. Table 3 shows the results of the ablation experiments
mIoU is an important metric for assessing the quality of performed on the MixSegNet model, where the contribution
a model’s segmentation, and its high value underscores of each part to the overall model performance is verified by
MixSegNet’s reliability and robustness in the crack detection incrementally adding UC Block and Transformer modules.

111542 VOLUME 12, 2024


Y. Zhou et al.: MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer

FIGURE 4. A comparison of the results obtained using different segmentation models in different scenarios. (a) Original Image (b) Ground
Truth (c) MixSegNet (d) AttuNet (e) Swin-UNet (f) U-Net (g) Deeplabv3 (h) FCN (i) LRASPP.

Ablation experiments are a method of assessing the impor- may be due to the fact that the Transformer module’s
tance of model components by removing or adding specific emphasis on global features may lead to some local noise
sections and observing changes in model performance. The being misidentified. Nevertheless, the slight improvement
results of these experiments are analysed in detail below. The in F1 and mIoU confirms the positive contribution of the
base model serves as a frame of reference with Precision, Transformer module in the model. By combining the UC
Recall, F1 Score and mean Intersection over Union (mIoU) Block and Transformer modules into MixSegNet, the model
of 0.931, 0.836, 0.881 and 0.813 respectively. This model was significantly improved in all metrics. Precision reached
already has good performance on its own, providing a solid a maximum of 0.952, recall also reached a maximum
foundation for adding new modules. Adding the UC Block of 0.882, and F1 score and mIoU reached 0.915 and
module to the base model improves all performance metrics, 0.848 respectively. This all-round performance improvement
including precision to 0.935, recall to 0.858, F1 score to fully demonstrates the positive impact of combining the UC
0.895 and mIoU to 0.827. This shows that the UC Block Block and Transformer modules on the model’s performance,
module plays a key role in improving the performance of especially on recall and mIoU, which shows the model’s
crack segmentation, especially in the significant increase efficiency in crack detection and accuracy in segmenting the
in recall, suggesting that the addition of UC Block helps crack region. The contribution of each component to the
the model to reduce the cases of missed crack detection. MixSegNet model was experimentally demonstrated: the UC
When the Transformer module is added to the base model, Block module significantly improved the recall rate, showing
the recall rate improves from 0.836 to 0.863, showing the that it is effective in avoiding missed crack detection. While
effectiveness of the Transformer module in capturing the the Transformer module enhances the global understanding
global information of the crack image and understanding of the model and improves recall. When these two modules
the relationship between the crack and the background. are combined, they work in synergy to significantly improve
However, the accuracy decreased slightly to 0.928, which the overall performance of the model, especially in terms

VOLUME 12, 2024 111543


Y. Zhou et al.: MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer

TABLE 3. Ablation experiment. we expect to open new avenues for research and practical
applications of crack segmentation.
REFERENCES
[1] S. Zhou, C. Canchila, and W. Song, ‘‘Deep learning-based crack segmen-
tation for civil infrastructure: Data types, architectures, and benchmarked
performance,’’ Autom. Construct., vol. 146, Feb. 2023, Art. no. 104678.
[2] A. Akagic, E. Buza, S. Omanovic, and A. Karabegovic, ‘‘Pavement crack
detection using OTSU thresholding for image segmentation,’’ in Proc.
41st Int. Conv. Inf. Commun. Technol., Electron. Microelectron. (MIPRO),
of precision and recall, enabling MixSegNet to excel in Aug. 2018, pp. 1092–1097.
the field of crack segmentation. These results validate the [3] X. Yang, H. Li, Y. Yu, X. Luo, T. Huang, and X. Yang, ‘‘Automatic pixel-
effectiveness of our proposed model design and provide a new level crack detection and measurement using fully convolutional network,’’
Comput.-Aided Civil Infrastruct. Eng., vol. 33, no. 12, pp. 1090–1109,
powerful tool for crack segmentation tasks. Dec. 2018.
Although the results indicate that MixSegNet has achieved [4] J. König, M. David Jenkins, P. Barrie, M. Mannion, and G. Morison, ‘‘A
a leading level of performance, it is important to note that convolutional neural network for pavement surface crack segmentation
using residual connections and attention gating,’’ in Proc. IEEE Int. Conf.
the model combines the CNN and Transformer architectures, Image Process. (ICIP), Sep. 2019, pp. 1460–1464.
which inevitably increases the computational complexity. [5] Y. Ren, J. Huang, Z. Hong, W. Lu, J. Yin, L. Zou, and X. Shen,
However, the focus of this study is on high-precision seg- ‘‘Image-based concrete crack detection in tunnels using deep fully
convolutional networks,’’ Construct. Building Mater., vol. 234, Feb. 2020,
mentation, and computational complexity is not the primary Art. no. 117367.
consideration. In the next research, we will optimize the [6] D. Kang, S. S. Benipal, D. L. Gopal, and Y.-J. Cha, ‘‘Hybrid pixel-
computational efficiency of the model and improve its real- level concrete crack segmentation and quantification across complex
backgrounds using deep learning,’’ Autom. Construct., vol. 118, Oct. 2020,
time performance. The current work focuses on improving Art. no. 103291.
the accuracy of crack segmentation, and subsequent work will [7] S. L. H. Lau, E. K. P. Chong, X. Yang, and X. Wang, ‘‘Automated pavement
crack segmentation using U-Net-based convolutional neural network,’’
be used in real drone scenarios to realize real-time crack seg- IEEE Access, vol. 8, pp. 114892–114899, 2020.
mentation warning using a drone and an onboard computer. [8] J. Liu, X. Yang, S. Lau, X. Wang, S. Luo, V. C. Lee, and L. Ding,
‘‘Automated pavement crack detection and segmentation based on two-
step convolutional neural network,’’ Comput.-Aided Civil Infrastruct. Eng.,
V. CONCLUSION vol. 35, no. 11, pp. 1291–1305, Nov. 2020.
This research proposes an innovative crack segmentation [9] J. Guan, X. Yang, L. Ding, X. Cheng, V. C. S. Lee, and C. Jin, ‘‘Automated
pixel-level pavement distress detection based on stereo vision and deep
model, MixSegNet, which represents a major breakthrough
learning,’’ Autom. Construct., vol. 129, Sep. 2021, Art. no. 103788.
in the field of crack segmentation. MixSegNet not only [10] R. Ali, J. H. Chuah, M. S. A. Talip, N. Mokhtar, and M. A. Shoaib, ‘‘Crack
improves the perceptual capability of the model, but also segmentation network using additive attention gate—CSN-II,’’ Eng. Appl.
Artif. Intell., vol. 114, Sep. 2022, Art. no. 105130.
strengthens the capture of details and the maintenance [11] W. Wang, C. Su, G. Han, and H. Zhang, ‘‘A lightweight crack segmentation
of long-range dependencies by combining an innovative network based on knowledge distillation,’’ J. Building Eng., vol. 76,
UC block and a parallel CNN and Transformer design. Oct. 2023, Art. no. 107200.
[12] H. Liu, J. Yang, X. Miao, C. Mertz, and H. Kong, ‘‘CrackFormer network
This unique two-pronged approach effectively overcomes for pavement crack segmentation,’’ IEEE Trans. Intell. Transp. Syst., vol. 1,
the limitations of previous single-architecture designs and no. 1, pp. 1–13, Aug. 2023.
achieves significant improvements in key performance met- [13] Z. Wu, Y. Tang, B. Hong, B. Liang, and Y. Liu, ‘‘Enhanced precision in
dam crack width measurement: Leveraging advanced lightweight network
rics such as 95.2% precision, 88.2% recall, 91.5% F1 identification for pixel-level accuracy,’’ Int. J. Intell. Syst., vol. 2023,
score and 84.8% mIoU. The performance advantages of pp. 1–16, Sep. 2023.
MixSegNet are fully demonstrated by comparing it to existing [14] H. Yao, Y. Liu, H. Lv, J. Huyan, Z. You, and Y. Hou, ‘‘Encoder–decoder
with pyramid region attention for pixel-level pavement crack recognition,’’
state-of-the-art models LRASPP, FCN, DeepLabV3, U-Net, Comput.-Aided Civil Infrastruct. Eng., vol. 39, no. 10, pp. 1490–1506,
AttuNet and Swin-UNet. The model not only improves in May 2024.
all indices, but also shows better generalisation ability in the [15] Q. Lin, W. Li, X. Zheng, H. Fan, and Z. Li, ‘‘DeepCrackAT: An effective
crack segmentation framework based on learning multi-scale crack
experiments, predicting its wide applicability and potential features,’’ Eng. Appl. Artif. Intell., vol. 126, Nov. 2023, Art. no. 106876.
value in practical applications. In future research, we plan to [16] C. Tang, S. Jiang, H. Li, D. Huang, X. Huang, and Y. Xiong, ‘‘Lightweight
extend the scope of application of MixSegNet, improve its concrete crack segmentation method based on Deeplabv3+,’’ in Proc. 3rd
Int. Conf. Comput. Vis. Pattern Anal., Aug. 2023, pp. 399–404.
generalisation ability, and verify its robustness by testing it on [17] Y. Chen, S. Dong, B. Hu, Q. Liu, and Y. Qu, ‘‘A dynamic semantic
more diverse and complex datasets. At the same time, we will segmentation algorithm with encoder-crossor-decoder structure for pixel-
work on optimising the computational efficiency of the model level building cracks,’’ Meas. Sci. Technol., vol. 35, no. 2, Feb. 2024,
Art. no. 025139.
to meet real-time processing requirements and applications in [18] Z. Li, C. Yin, and X. Zhang, ‘‘Crack segmentation extraction and parameter
real industrial scenarios. We will also explore the potential calculation of asphalt pavement based on image processing,’’ Sensors,
of MixSegNet in cross-domain image segmentation tasks vol. 23, no. 22, p. 9161, Nov. 2023.
[19] M. Sohaib, S. Jamil, and J.-M. Kim, ‘‘An ensemble approach for robust
such as medical image analysis and remote sensing image automated crack detection and segmentation in concrete structures,’’
processing. Improving the interpretability of the model Sensors, vol. 24, no. 1, p. 257, Jan. 2024.
and adapting it to small-sample learning environments to [20] Y. A. Lumban-Gaol, A. Rizaldy, and A. Murtiyoso, ‘‘Comparison of
deep learning architectures for the semantic segmentation of slum areas
maintain excellent performance in data-constrained situations from satellite images,’’ in Proc. Int. Archives Photogramm., Remote Sens.
will also be the focus of our future work. With these efforts, Spatial Inf. Sci., 2023, pp. 1439–1444.

111544 VOLUME 12, 2024


Y. Zhou et al.: MixSegNet: A Novel Crack Segmentation Network Combining CNN and Transformer

[21] E. Kot, Z. Krawczyk, K. Siwek, L. Królicki, and P. Czwarnowski, RAZA ALI (Senior Member, IEEE) received the
‘‘Deep learning-based framework for tumour detection and semantic B.S. degree in telecommunication engineering
segmentation,’’ Bull. Polish Acad. Sci. Tech. Sci., vol. 69, Mar. 2021, from Balochistan University of Information Tech-
Art. no. 136750. nology, Engineering and Management Sciences
[22] S. Agarwal, A. Sawant, M. Faisal, S. E. Copp, J. Reyes-Zacarias, Y.-R. Lin, (BUITEMS), Quetta, Pakistan, the M.S. degree
and S. J. Zinkle, ‘‘Application of a deep learning semantic segmentation in electrical engineering (communication) from
model to helium bubbles and voids in nuclear materials,’’ Eng. Appl. Artif.
UET, Lahore, and the Ph.D. degree from the
Intell., vol. 126, Nov. 2023, Art. no. 106747.
University of Malaya, Malaysia, in 2022. During
[23] L. Fan and C. Zhou, ‘‘Cloud-to-Ground and intra-cloud nowcasting
lightning using a semantic segmentation deep learning network,’’ Remote the Ph.D. studies, he was associated with the VIP
Sens., vol. 15, no. 20, p. 4981, Oct. 2023. Laboratory, University of Malaya. He is currently
[24] J. Long, E. Shelhamer, and T. Darrell, ‘‘Fully convolutional networks an Assistant Professor with the Faculty of Information and Communication
for semantic segmentation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Technology, BUITEMS. His research interests include signal processing,
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. computer vision, machine learning, and deep learning.
[25] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networks
for biomedical image segmentation,’’ in Proc. 18th Int. Conf., 2015, NORRIMA MOKHTAR received the B.Eng.
pp. 234–241. degree in electrical engineering from the Univer-
[26] A. Howard, M. Sandler, B. Chen, W. Wang, L.-C. Chen, M. Tan, G. Chu,
sity of Malaya, in 2000, and the M.Eng. degree
V. Vasudevan, Y. Zhu, R. Pang, H. Adam, and Q. Le, ‘‘Searching for
from Oita, Japan, in 2006. After working two years
MobileNetV3,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
Oct. 2019, pp. 1314–1324. with the International Telecommunication Indus-
[27] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, ‘‘Rethinking atrous try with attachment at Echo Broadband GmbH, she
convolution for semantic image segmentation,’’ 2017, arXiv:1706.05587. managed to secure a Panasonic Scholarship which
[28] T. Zhang, D. Wang, A. Mullins, and Y. Lu, ‘‘Integrated APC-GAN required intensive screening at the national level,
and AttuNet framework for automated pavement crack pixel-level in 2002. To date, she has successfully supervised
segmentation: A new solution to small training datasets,’’ IEEE Trans. seven Ph.D. and four M.Eng.Sc. students (by
Intell. Transp. Syst., vol. 24, no. 4, pp. 4474–4481, Apr. 2023. research). She is the author and co-author of more than 50 publications
[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, in international journals and proceedings in Sensors, Automation, IEEE
L. u. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. TRANSACTIONS ON IMAGE PROCESSING, Human-Computer Interface, Brain-
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–20. Computer Interface, UAV, and Robotics. She received financial support from
[30] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, a Panasonic Scholarship for her M.Eng. degree. She is active as a reviewer
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, for many reputable journals and several international conferences.
J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16×16 words:
Transformers for image recognition at scale,’’ 2021, arXiv:2010.11929.
[31] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, SULAIMAN WADI HARUN received the B.E.
T. Xiang, P. H. S. Torr, and L. Zhang, ‘‘Rethinking semantic segmentation degree in electrical and electronics system engi-
from a sequence-to-sequence perspective with transformers,’’ in Proc. neering from Nagaoka University of Technology,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, Japan, in 1996, and the M.Sc. and Ph.D. degrees
pp. 6877–6886. in photonics technology from the University
[32] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and of Malaya, in 2001 and 2004, respectively.
B. Guo, ‘‘Swin transformer: Hierarchical vision transformer using shifted He was an Adjunct Professor at Airlangga Uni-
windows,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, versity, Indonesia, and Ton Duc Thang University,
pp. 9992–10002. Vietnam. He has nearly 20 years of research
[33] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo,
experience in the development of optical fiber
‘‘SegFormer: Simple and efficient design for semantic segmentation with
devices, including fiber amplifiers, fiber lasers, and fiber optic sensors.
transformers,’’ in Proc. Adv. Neural Inf. Process. Sys. (NIPS), vol. 34,
Dec. 2021, pp. 12077–12090.
He was also involved in exploiting new nanomaterials, such as graphene,
[34] H. Cao and Y. Wang, ‘‘Swin-Unet: Unet-like pure transformer for medical carbon nanotubes, black phosphorous, topological insulators for various
image segmentation,’’ in Computer Vision–ECCV. Cham, Switzerland: fiber lasers, and sensor applications. He has received about ten research
Springer, 1007, pp. 205–218. grants of value over RM4M from the Ministry of Education and the
[35] Y. Liu, J. Yao, X. Lu, R. Xie, and L. Li, ‘‘DeepCrack: A deep hierarchical Ministry of Science, Technology, and Innovation. He has published more
feature learning architecture for crack segmentation,’’ Neurocomputing, than 700 articles in ISI journals and his papers have been cited more than
vol. 338, pp. 139–153, Apr. 2019. 7000 times with an H-index of 37, showing the impact on the community.
[36] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, ‘‘Focal loss for dense He is a fellow of Malaysian Academic of Science and the Founder and
object detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, Honorary Advisor for the Optical Society of Malaysia. He received the
pp. 2999–3007. prestigious award of Malaysian Rising Star from the Ministry of Higher
[37] I. Loshchilov and F. Hutter, ‘‘Decoupled weight decay regularization,’’ Education, in 2016, for his contribution to international collaboration.
2019, arXiv:1711.05101.

MASAHIRO IWAHASHI (Senior Member, IEEE)


YANG ZHOU received the B.Eng. degree in received the B.Eng., M.Eng., and D.Eng. degrees
automation from Hubei Engineering Univer- in electrical engineering from Tokyo Metropolitan
sity. He is currently pursuing the M.Eng.Sc. University, Tokyo, Japan, in 1988, 1990, and
degree with the Faculty of Engineering, Universiti 1996, respectively. In 1990, he joined Nippon
Malaya, under the supervision of Dr. Mokhtar Steel Company Ltd. Since 1993, he has been
from the Department of Electrical Engineering. with Nagaoka University of Technology, Nagaoka,
During his previous projects involving machine Japan, where he is currently a Professor with
learning applications, he gained expertise in the Department of Electrical, Electronics, and
various computer vision and natural language Information Engineering. His research interests
processing models and techniques, such as con- include digital signal processing, multirate systems, and image compression.
volutional neural networks (CNNs), transformers, model fine-tuning, and He is a Senior Member of IEICE and a member of the Asia–Pacific
data augmentation. His research interests include computer vision, deep Signal and Information Processing Association and the Institute of Image
learning, and image segmentation, with a focus on crack segmentation using Information and Television Engineers.
imbalanced data.

VOLUME 12, 2024 111545

You might also like