1. Introduction
Wood-based panels are among the most commonly used materials in the manufacturing industry due to their cost-effectiveness, stability, and ease of processing, leading to widespread applications in construction, furniture manufacturing, and decoration. However, defects inevitably arise during industrial production, and product quality control necessitates the accurate detection and localization of surface defects on wood-based panels to minimize raw material waste and reduce production costs. Furthermore, product quality issues directly impact the reputation of factories and market share. Therefore, rapidly and accurately detecting surface defects has become a key research priority in the industrial sector [
1].
The collection and transportation of high-quality data, along with its encoding, are crucial for defect detection tasks [
2,
3,
4]. However, due to the susceptibility of sensors to environmental factors such as dust, water mist, and lighting, images of wood-based panel surfaces often fail to accurately reflect the true condition of the panels. Especially under significant noise interference, image quality may be severely compromised, affecting defect detection and recognition. Specifically, detecting surface defects on wood-based panels faces several challenges. First, there is blur between the defects and the background. In complex industrial production environments, the grayscale values of surface defects on wood-based panels are often very similar to those of the surrounding background areas. During the detection process, it is challenging for detectors to effectively separate the defects from the background, leading to ambiguity between the defects and the background. Second, there is variability in defect size. The size of various defects can differ significantly. Detectors often struggle to balance between large and small defects, resulting in feature confusion among multi-scale objects and impacting overall detection performance. Third, there is significant inter-class variance. Defects within the same category may exhibit vastly different appearances. This high inter-class variance presents challenges for detectors in extracting and distinguishing defect features, reducing detection accuracy.
Initially, surface defect detection was primarily performed manually. However, manual inspection is limited by human capacity, resulting in low accuracy and lengthy processing times, which significantly impacts production efficiency [
5]. With the rapid advancement of machine vision technology, various algorithms have been extensively applied in defect detection, broadly categorized into traditional methods and deep-learning-based approaches [
6]. Traditional methods generally rely on handcrafted features for defect detection [
7,
8]. These methods encounter two major issues, as follows: First, handcrafted features often lack robustness, particularly under poor lighting conditions or high noise interference, making it challenging to achieve satisfactory detection results. Second, the reliance on prior knowledge for designing handcrafted features constrains the potential for enhancing detection performance, leading to significant limitations in traditional methods.
In recent years, with the rapid development of machine learning and deep learning technologies, models based on these techniques have been widely applied across various fields [
9,
10,
11,
12,
13,
14]. Convolutional Neural Networks (CNNs), with their powerful feature extraction capabilities, can adaptively identify defect features and achieve high-precision detection, even in complex environments, demonstrating strong robustness [
15,
16,
17]. Dlamini et al. [
18] proposed an automatic surface defect detection system based on MobileNetV2 and Feature Pyramid Networks (FPN), which performed exceptionally well on printed circuit boards. Zhang et al. [
19] designed a lightweight defect detection network that uses efficient downsampling methods to extract richer defect features and employs multi-scale aggregation networks and efficient attention modules to mitigate interference from complex backgrounds, achieving commendable performance. Jiang et al. [
20] introduced a multi-scale attention module that efficiently integrates high-level semantic information with low-level feature information, generating complete feature maps and improving fault recognition accuracy. Song et al. [
21] proposed an encoder–decoder network for steel surface defect detection, utilizing a powerful attention mechanism in the encoder to extract rich multi-dimensional features and a channel weighting module in the decoder to integrate feature maps. Su et al. [
22] designed a dynamic transformer model based on semantic alignment for steel defect detection. This network introduces a local attention mechanism to eliminate noise blur between defects and backgrounds while dynamically adjusting encoding blocks, shortening inference time while maintaining high accuracy. Dong et al. [
23] combined spatial attention with channel attention, achieving good results in pavement defect detection. Su et al. [
24] constructed a novel complementary attention network that utilizes spatial and channel features to dynamically suppress background noise, achieving high precision in defect detection for electroluminescence images of solar cells. Cao et al. [
25] proposed a deep feature fusion pixel-level segmentation network for surface defect detection that aggregates multi-scale feature maps through multi-level feature fusion and uses a branch decoder to recover defect details, improving defect segmentation accuracy. However, in wood-based panel surface images, these deep learning methods still face challenges due to the ambiguity between defects and backgrounds as well as significant variations in defect sizes. They often encounter misclassification issues when handling blurry and small defects, and achieving an optimal balance between efficiency and accuracy remains difficult.
We re-examined the issue of low separability between low-contrast defects and backgrounds, identifying that the root cause lies in the fact that, when the grayscale values of defects and backgrounds are very close, deep learning methods based on spatial domain processing often struggle to effectively separate the foreground and the background in feature extraction modules. This leads to blurred boundaries of target features, particularly under complex noise interference, making it difficult for the network to distinguish between the target and the background. As a result, the detector fails to accurately extract true defect features, leading to false positives and missed detections. However, when processing images in the frequency domain, the sensitivity to contrast changes is lower, and the frequency domain offers significant advantages over spatial domain processing in terms of noise suppression and global feature extraction. Based on this observation, we propose a feature decoupling and denoising method using multi-axis frequency domain weighting. We convert low-contrast defects and backgrounds to the frequency domain using 2D discrete Fourier transform (2D DFT) and examine their signal intensity differences. As shown in
Figure 1a,b, there is a significant difference in signal intensity between the target and the background. This indicates that low-contrast defects, which are difficult to distinguish in the spatial domain, exhibit higher separability in the frequency domain. Additionally, the frequency domain is more effective at removing image noise, enhancing the features of interest, and suppressing background information (as shown in
Figure 1c). Leveraging this characteristic, we use frequency domain signal transformation, renowned for its global feature extraction and noise suppression capabilities, to separate the target from the background while suppressing background noise. Convolutional Neural Networks (CNNs), known for their local perception and texture feature extraction, are then employed to capture detailed feature information. By combining spatial and frequency domain transformations, we enhance the feature extraction and spatial discriminative capabilities of the backbone network, enabling the model to effectively differentiate subtle differences between the target and the background.
To address the challenge of defect detection in wood-based panels under complex interference conditions, we propose a method called FDADNet, which is based on frequency domain transformation and adaptive dynamic downsampling. We have designed a Multi-axis Frequency Domain Weighted Information Representation Module (MFDW) for feature separation and developed an Adaptive Dynamic Convolution (ADConv) module for downsampling, aimed at enhancing the feature representation of small and weak defects. By serially integrating these two modules, we have created a powerful feature extraction framework. We believe that these modules achieve information complementarity and functional synergy in several ways. First, convolutional transformations in the spatial domain are effective at handling local spatial features, while frequency domain signal transformations excel at capturing global frequency characteristics. The combination of these approaches enhances the model’s ability to express features across different dimensions. Second, Adaptive Dynamic Convolution enables the model to flexibly adapt to varying inputs, while the Multi-axis Frequency Domain Weighted Information Representation Module strengthens the expression of specific features. This integration improves the model’s detection accuracy and robustness in complex scenarios. Finally, although Adaptive Dynamic Convolution introduces additional parameters, the Multi-axis Frequency Domain Weighted Information Representation Module (MFDW) divides feature maps into four regions and performs feature transformations across the different domains and axes, thus maintaining a low parameter count and computational demand. This design achieves a good balance between efficiency and accuracy. Additionally, due to the current lack of open-source data, we have constructed a dataset of surface defects on wood-based panels, named WBP-DET. Our main contributions include the following:
- 1.
Due to the limited availability of datasets for wood-based panel defect detection, we established a wood-based panel defect detection dataset (WBP-DET), which includes surface defects such as Glue Spot, Oil Stains, Chalk, and Scratch.
- 2.
We introduced frequency domain signal transformation into the defect detection task and have proposed a method for surface defect detection in wood-based panels based on frequency domain transformation and adaptive dynamic downsampling (FDADNet). This approach utilizes frequency domain signal transformation in the feature extraction phase to handle the target and the background, providing strong noise resistance and discrimination capabilities.
- 3.
We developed a Multi-axis Frequency Domain Weighted Information Representation Module (MFDW) for feature extraction. The MFDW focuses on global frequency domain features, applying weighted transformations to different frequency domain signal characteristics to enhance the separability between the target and the background. Gaussian filtering is used to suppress background noise, reduce noise accumulation, and enhance the feature representation of objects.
- 4.
We designed Adaptive Dynamic Convolution (ADConv) for downsampling feature maps. ADConv can flexibly compress and enhance the features of different categories, increasing the semantic gap between the features and thereby reducing feature confusion among multi-scale objects.
3. Methods
In this section, we will first introduce the overall architecture and workflow of the network. Then, in
Section 3.1 and
Section 3.2, we will discuss the operation mechanisms and functions of the three key modules. Finally, in
Section 3.3, we will introduce the loss function that we used.
Based on the integration of frequency domain and spatial domain transformations, we designed the FDADNet. As illustrated in
Figure 2, the network is composed of the following three main components: the backbone, the neck, and the heads. In the backbone, we introduce the Multi-axis Frequency Domain Weighted Module (MFDW) as an efficient feature aggregation unit to extract and refine multi-scale image features, facilitating both frequency domain and spatial domain information extraction. Subsequently, these features (P3, P4, P5) are passed to the neck for multi-scale feature fusion and information response. The neck enhances global semantic information through a top-down feature pyramid while integrating local texture information via a bottom-up path. Finally, three detection heads perform object classification and bounding box regression on the fused features. This multi-level detection approach enhances the model’s ability to discriminate between the targets of different scales. In the backbone, each stage consists of a combination of one ADConv and one MFDW for progressive feature extraction. After four stages, the spatial resolution of the features is gradually reduced to 1/4, 1/8, 1/16, and 1/32 of the original image, while the number of channels increases progressively to 2C
1, 4C
1, 8C
1 and 16C
1. During the feature fusion stage, high-level feature maps are aligned with low-level feature maps via 2× upsampling, while low-level feature maps are fused with high-level feature maps via 2× downsampling. This design helps to fully leverage feature information at multiple scales, thereby enhancing detection accuracy and robustness.
3.1. Multi-Axis Frequency Domain Weighted Information Representation Module
In the field of wood-based panel defect detection, recent methods often emphasize extracting spatial domain information, frequently overlooking the significance of frequency domain information. In the spatial domain, the edges between the objects and the backgrounds are often blurred, and backgrounds frequently contain substantial noise, making it challenging to extract clean and complete object features. In contrast, in the frequency domain, different objects exhibit distinct frequency signals, and the feature spaces of the objects and backgrounds are more dispersed. By extracting both spatial and frequency domain information simultaneously, the model can achieve enhanced perceptual capabilities. However, despite the efforts in previous studies [
39,
40,
41] to utilize frequency domain information in deep learning, if effective denoising is not performed in the frequency domain, significant noise will persist in the subsequent spatial domain feature extraction process, leading to noise accumulation and adversely affecting detection performance. To address this issue, we propose the Multi-axis Frequency Domain Weighted Information Representation Module (MFDW). This module not only extracts multi-axis frequency domain information, but also effectively suppresses noise. By performing denoising in the frequency domain, the backbone can extract more complete object features, thereby reducing misdetections and missed detections caused by blurred boundaries and noise interference.
The structure of the efficient feature aggregation module MFDW and the ELAN [
48] module is illustrated in
Figure 3. The MFDW module consists of convolution, split, a Signal Extraction Module (SEM), and concatenation operations. For a given input feature map X ∈ R
Cin×H×W, where C
in, H and W denote the number of input channels, height, and width, respectively, the input feature map X is first processed by a convolution operation to adjust the number of channels to C
out. The feature map is then split along the channel dimension into two parts. One of these parts is processed by the SEM to extract the frequency domain signal features. After processing, the two split parts of the feature map, along with the output from SEM, are concatenated, and the concatenated feature map is then processed by another convolution to adjust the number of channels. This design ensures that important spatial domain information is preserved during the transformation process, maximizing the retention and utilization of all useful information from the input image.
Assuming that the feature map input to SEM is
, to obtain the multi-axis frequency domain information, we split
X′ along the channel dimension into four equal parts and input each part into four different branches. The specific process is described in Equation (1), as follows:
where
Split(·) denotes the operation of splitting along the channel dimension; and
x1,
x2, and
x3 correspond to the first three branches, each of which is passed along the H–W axis, C–W axis, and C–H axis, respectively, into the Adaptive Frequency Filters (AFF) for feature weighting and noise suppression. Multi-branch processing calculates the frequency domain features of each branch in parallel, enhancing the signal perception capability of the module and significantly improving computational efficiency. Next, we convert the feature map to the frequency domain using 2D DFT, transforming the spatial domain pixel features into frequency domain signal features. In the frequency domain, we use learnable multi-axis weights to perform weighted transformations on the signal information. Through weighted processing, we enhance the differences between the target and background signals in the frequency domain, achieving feature optimization and enhancement. Subsequently, we apply Gaussian filtering in the frequency domain to suppress background noise, improving the signal-to-noise ratio and making the target features clearer and more realistic. Finally, we convert the frequency domain information back to the spatial domain using 2D inverse discrete Fourier transform (2D IDFT) for subsequent processing. The AFF process is represented in Equations (2) and (3), as follows:
where
and
represent the learnable weights and the 2D DFT corresponding to the respective axes;
denotes the element-wise product; and
signifies Gaussian filtering for denoising. For the fourth branch, we utilize depthwise separable convolution (DWConv) to capture the local features. Finally, the feature maps of the four branches are concatenated along the channel dimension, restored to the same size as the input feature map, and the final output is generated through residual connection with the input feature map, as shown in Equations (4) and (5).
where
represents DWConv and
denotes concatenation of the feature maps along the channel dimension. MFDW efficiently extracts and enhances the multi-axis frequency domain signals of the target while effectively suppressing background noise. Compared to traditional convolution operations, it requires fewer parameters and computational resources.
3.2. Adaptive Dynamic Convolution
Typically, an image containing defects may simultaneously present both large and small defects, and detectors often face the challenge of balancing performance when handling targets of different scales. This challenge arises mainly because, as the image undergoes successive layers of feature extraction and spatial transformation, the spatial resolution of the feature map gradually decreases, causing the features of small defects to be more easily lost. Meanwhile, large defects generally exhibit stronger responses in the image, and these strong response areas can overshadow or merge with the subtle semantic information of small defects, thereby reducing the discernibility of the small defects. To address this issue, we propose the Adaptive Dynamic Convolution Module (ADConv) as the downsampling module of the network. ADConv adaptively enhances the feature representation of small target defects by enlarging the feature distance between different scales in the transformed space, thereby improving the distinction between the features of objects at different scales and alleviating feature confusion among multi-scale targets.
As shown in
Figure 4, for the input
and the weight tensor
, the traditional convolution operation is expressed in Equation (6) as follows:
where
is the output and ∗ represents the convolution operation. For simplicity, we omit the bias operation. However, traditional convolution operations use fixed kernels, which limits their adaptability to input features. To allow the kernel weights to dynamically adjust according to specific input features, we create a map to modify the kernel weights, as shown in Equations (7)–(9), as follows:
where
represents the dynamic coefficient and
stands for the fully connected layer (FCL). We aim to enhance the representation capability of small defects using ADConv, thereby increasing the semantic gap between the features of different scales and achieving balanced attention to multi-scale targets. To this end, we associate the dynamic coefficient with the input features. Specifically, we first perform global average pooling on the input feature map
along the channel dimension, aggregating the global information into a vector. Then, we use FCL and an activation function to generate the dynamic coefficient. This dynamic coefficient is element-wise multiplied with the convolution kernel weights, and the weighted convolution kernel is then used to convolve the input feature map. ADConv effectively improves the downsampling module’s ability to enhance and retain small defect features while preventing the dominance of large defect features. This method introduces more learnable parameters while maintaining low FLOPs, preserving the integrity of large defect information and significantly improving the feature representation of small defects, thus enhancing overall detection performance.
3.3. Loss Function
The loss function of the network consists of two main components [
28]: the object category loss and the bounding box loss.
where
and
are coefficients used to adjust the weights of these two loss components. The object classification loss is calculated using binary cross-entropy loss (BCE), while the bounding box loss combines CIOU loss with Distribution Focal Loss (DFLs).
4. Experiments
In this section, we will introduce the dataset, experimental details, and the results and conclusions of the experiments. Specifically,
Section 4.1 will present the WBP-DET dataset,
Section 4.2 will detail the experimental parameters and evaluation metrics for our method,
Section 4.3 will discuss the hyperparameters of the comparison methods,
Section 4.4 will showcase the results of the comparative experiments,
Section 4.5 will analyze the generalization of our network,
Section 4.6 will display the results of the ablation experiments and discuss the roles of the modules, and
Section 4.7 will provide the experimental conclusions.
4.1. WBP-DET Dataset
Currently, there are few publicly available benchmark datasets for defect detection in wood-based panels. To facilitate future research, we introduce a new dataset for surface defect detection in wood-based panels called the WBP-DET dataset. The dataset was collected in 2024 by Sun Qin in Wuhan, featuring infrared thermal imaging with a resolution of 2048 × 4300. The WBP-DET dataset includes the following five common types of surface defect in wood-based panels: Glue Spot (GS), Oil Stains (OS), Chalk (Ch), Scratch (Sc), and Other Defects (OD). Since the original images have a high resolution, they have been cropped to 512 × 512 for research purposes. Additionally,
Figure 5 illustrates the distribution of ground truth bounding boxes, showing a variety of defect shapes. The defects in the WBP-DET dataset are randomly distributed, reflecting the real-world conditions of surface defect detection in wood-based panels.
After cropping, we obtained the WBP-DET dataset, which contains 1793 images for wood-based panel surface defect detection.
Figure 6 shows some examples of the defects on wood-based panel surfaces. The Glue Spot defects are typically small, black, circular spots with a gray-level similar to that of the background; Oil Stains are usually irregular black shapes with varying scales and uneven feature distribution; Chalk defects are more prominent, generally appearing as white arcs; Scratch defects are black diagonal lines with significant differences in length and width and have lower contrast; and Other Defects have no distinct characteristics. These images are divided into training, validation, and test sets in an 8:1:1 ratio.
Table 1 provides detailed information about the dataset. The WBP-DET dataset will be made publicly available for researchers at the following address:
https://github.com/LazyShark2001/FDADNet (accessed on 3 August 2024).
4.2. Implementation Details
Considering that wood-based panel images often contain numerous small defects, we normalized the image size to 640 × 640 to balance the model detection accuracy and computational complexity. This normalization facilitates the easier deployment of the model on edge devices. All of the experiments were conducted on a 16 G Nvidia RTX 4060 Ti GPU (Maxsun, Wuhan, China) with PyTorch 2.0.1. For each dataset, we set the training, validation, and test set ratios to 8:1:1. To ensure fairness in model comparisons, all ablation and comparative experiments were conducted without using pre-trained weights, unless otherwise specified. The other training parameters are detailed in
Table 2.
In this study, we used the most common metrics for defect detection to evaluate the methods, including precision (P), Recall (R), and mean average precision (mAP) [
49]. Additionally, to assess the model’s complexity and size, we considered the number of floating-point operations (FLOPs) and the number of parameters (Params) to evaluate the model’s computational efficiency and complexity, which are crucial for deploying the network on edge devices. The formulas for the relevant evaluation metrics are as follows:
where True Positives (TP) and True Negatives (TN) represent the correct predictions, False Positives (FP) and False Negatives (FN) represent the incorrect predictions, P stands for precision, R stands for Recall, and
c denotes the number of classes.
4.3. Comparison Methodology
The experimental details and hyperparameter settings for the comparison methods are provided in
Table 3. It is important to note that, except for Faster R-CNN, which scales both the width and the height of the images to within the range of [800, 1333], the input image size for all other networks is resized to (640, 640). Due to the large number of parameters in the Faster R-CNN and RTDETR models, and the relatively small dataset size, training with the initial weights resulted in slower convergence. Therefore, we loaded the official pretrained weights for these two models during training, while the other models were trained using the initialized weights. We ensured that the models did not overfit by monitoring the accuracy curves on the validation set. Through multiple experiments, we determined the relatively appropriate number of epochs for each network. Additionally, for each model, we saved the training weights from the final epoch and used them to test on the test set. The results of the test set were then compared across models. Apart from our method, the reproduction of the other methods utilized the mm-detection and ultralytics frameworks. When training the other comparative methods, we modified only the epochs and batch size parameters, while all other hyperparameters were retained as the default settings of their respective network frameworks.
4.4. Comparison with State-of-the-Art Models
To evaluate the proposed FDADNet comprehensively, we conducted both quantitative and visual analyses using eight state-of-the-art object detection methods. These models are categorized into CNN-based and Transformer-based models. The CNN-based models include Faster R-CNN [
31], YoloV5 [
50], YoloX [
51], YoloV7 [
27], YoloV8 [
28], YoloV10 [
30], and RTMDet [
26], while RTDETR [
37] falls under the Transformer-based models.
Visual Analysis: We visualize the detection results of the different models in
Figure 7. Specifically, we selected representative complex scenes, including low-contrast targets, complex background targets, and small defect targets. The visual results indicate that the CNN-based methods, such as Faster R-CNN, RTMDet, YoloV8, and YoloV10, perform exceptionally well in local perception, effectively detecting targets with clear edges, such as the Chalk defect shown in
Figure 7c. However, these methods struggle to accurately distinguish between targets and backgrounds, especially for weak-featured, low-contrast targets in noisy environments (e.g.,
Figure 7a,b). For small-scale defect targets (e.g.,
Figure 7d,f), the Transformer-based RTDETR may overlook the edge details around small defects while capturing global information. Although the self-attention mechanism of Transformer-based models effectively captures global features, it may lack sensitivity to the subtle features of small defects, leading to false positives and missed detections. Additionally, for defects with significant inter-class shape and scale differences and uneven feature distribution (e.g.,
Figure 7e), Faster R-CNN with anchor-based detection and YoloV10 with dual detection heads may encounter false positives when dealing with dense and uneven defects. In contrast, FDADNet generates more accurate detection boxes, effectively detecting all defects, even in noisy backgrounds, and distinguishing between the foreground and the background with high precision, demonstrating an outstanding performance in detecting small defect targets.
Quantitative Analysis: Table 4 presents the quantitative comparison results. Our FDADNet not only achieves the best detection performance, with 79.6% improvement in mAP
50, but also demonstrates an advantage in model parameters and FLOPs (4.5 M and 6.2 G) compared to other SOTA methods. Due to the presence of numerous low-contrast defects in the images (such as Glue Spot and Scratch), most spatial-convolution-based methods perform poorly in detecting these defects. Although RTMDet improves the issue of large morphological differences in Scratch defects through dynamic label assignment and kernel depth convolution, its ability to perceive details for low-contrast small defects (such as Glue Spot) remains insufficient. On the other hand, RTDETR, based on spatial transformers, performs well with large-scale targets (such as Oil Stains, Scratch, and Other Defects) but often overlooks local feature information when detecting small defects (such as Glue Spot), leading to missed detections. Transformer models rely on self-attention mechanisms to capture global information, which provides an advantage in identifying large defects. However, for small defects, the self-attention mechanism may fails to adequately focus on minute details, and the position encoding performs poorly on fine-grained spatial information, resulting in reduced detection accuracy for small defects. This neglect of local features and positional information discrepancies are likely the main reasons for RTDETR’s poor performance in small defect detection. In contrast, FDADNet excels not only in detecting low-contrast targets, but also effectively balances the detection performance for both large and small objects. As shown in
Figure 8, we plotted the confusion matrix for the results of each network. In the figure, our diagonal values are relatively concentrated, with fewer false detections. This indicates that our network maintains a high level of accuracy for each target.
4.5. Generalization Analysis
To further validate the generalization ability of our model, we conducted experiments on several defect datasets, including GC10-DET [
52], APDDD [
53], and NEU-DET [
54]. These datasets feature surface defects in steel and aluminum from industrial scenarios, commonly used to assess the performance of defect detection models.
Table 5 provides detailed information about these datasets. We compared our model with the current mainstream single-stage lightweight detection models, as shown in
Table 6. The results indicate that our model achieved the best detection performance on these three datasets, with mAP
50 reaching 56.8%, 66.3%, and 76%, respectively. Additionally, in terms of parameters and FLOPs, our model ranks high among the mainstream single-stage lightweight detectors. Our approach demonstrates strong generalization capability across various industrial defect datasets and effectively addresses complex defect detection tasks.
4.6. Ablation Study
To validate the effectiveness of each module, we designed a series of ablation experiments. In these experiments, we replaced ADConv with conventional convolution and ELAN with MFDW as the baseline.
Frequency Domain Module Analysis: As shown in
Table 7 (comparing the first and second rows of results), the baseline model performs well in detecting high-contrast defects with clear contours (such as Chalk and Other Defects). However, it performs poorly on low-contrast and small defects (like Glue Spot, Oil Stains, and Scratch). Adding the MFDW module significantly improves the detection performance of Glue Spot, Oil Stains, and Scratch. This enhancement is due to the MFDW module’s capability to process image signal information in the frequency domain, allowing for a more precise detection of defects that are close in grayscale to the background, thus reducing missed detections of low-contrast defects. Additionally, this approach significantly reduces the model’s parameter count and computational load (parameters from 3.01 M to 2.78 M and FLOPs from 8.1 G to 7.4 G), lowering deployment and inference costs and making the model more competitive in practical applications.
ADConv Module Analysis: As shown in
Table 7 (first and third row results), replacing the downsampling module in the baseline network with ADConv significantly improves the performance in detecting small defects (e.g., Glue Spot), with mAP
50 increasing from 61.2% to 71%. This improvement is mainly because Glue Spot, as a small defect, has relatively weak features that can be overshadowed by larger defect features. ADConv, through dynamic convolutional downsampling, provides more refined processing of features with different attributes. This dynamic convolutional downsampling mechanism not only effectively preserves the edge texture details of small defects, but also adapts to the feature attributes, enhancing the expression capability of multi-scale defect information. Although ADConv introduces additional parameters (increasing from 3.01 M to 4.47 M), the overall computation (FLOPs) actually decreases (from 8.1 G to 6.9 G) due to the omission of subsequent computation steps such as BatchNorm. This design optimizes computational complexity, improves inference efficiency, and is advantageous for deploying and running the network on edge devices.
The compatibility between MFDW and ADConv is well demonstrated: As shown in
Table 7 (first and fourth row results), when both MFDW and ADConv are used together, there is a significant performance improvement, especially in detecting low-contrast defects (such as Glue Spot, Oil Stains, and Scratch), compared to the baseline model. The detection capabilities for Glue Spot and Oil Stains are significantly enhanced. This indicates that MFDW and ADConv work synergistically, with good compatibility. ADConv and MFDW process features in spatial and frequency domains, respectively, complementing and enhancing the information across different dimensions. This combination provides the model with strong feature representation capabilities across various dimensions, thereby improving defect detection.
The necessity of Gaussian filtering: As shown in
Table 7 (results from the fourth and sixth rows), after incorporating Gaussian filtering for denoising, the detection performance for Glue Spot, Oil Stains, Scratch, and Chalk defects has significantly improved, achieving leading levels. This enhancement is primarily due to the complementary advantages and synergistic effects of the MFDW and ADConv methods in feature processing. Gaussian filtering removes the noise around the target before converting the features back to the spatial domain, significantly reducing background noise in the feature maps. This allows ADConv to focus more effectively on extracting the local features of the defects without interference from background noise, thereby improving the accuracy of bounding box predictions. In practice, when applying Gaussian filtering to feature maps, the goal is to preserve as much of the original image information as possible while suppressing background noise. As shown in
Figure 9a, for the WBP-DET dataset, with Gaussian filter standard deviations of 20 and 40, the noise removal effect is significant, but the texture information of the image is less preserved. Conversely, with standard deviations of 80 and 100, while more texture information is retained, the denoising effect is insufficient, and noise interference remains. A standard deviation of 60 strikes a balance by preserving rich image texture details while effectively suppressing noise. As illustrated in
Figure 9b and
Table 8, with a Gaussian filter standard deviation of 60, the resulting images retain rich edge information and effectively remove noise, enabling the subsequent spatial domain convolution operations to focus more precisely on the target features, achieving optimal detection performance.
Why MFDW is Used Only in the Feature Extraction Stage: As shown in
Table 7 (fifth and sixth rows), extending the use of MFDW from the backbone to both the backbone and neck stages results in decreased detection performance for low-contrast defects (such as Glue Spot, Oil Stains, and Scratch). Although Gaussian filtering effectively removes noise, it also slightly loses image details. MFDW in the backbone extracts relatively complete feature information and significantly reduces background noise interference. However, the neck stage processes the output feature maps from the backbone, and using MFDW in the neck stage not only makes it difficult to further denoise, but also increases the risk of losing substantial semantic information due to detail loss, particularly when dealing with high-level feature maps. Therefore, applying MFDW only in the backbone network maximizes noise suppression while preserving object information integrity, thus optimizing detection performance.
4.7. Experimental Conclusions
The results of the comparative experiments and generalization experiments indicate that our method can effectively detect small target defects and low-contrast target defects while still performing excellently when generalized to other tasks. This demonstrates that combining spatial and frequency domains for feature extraction enhances feature encoding and adaptability, as well as providing superior stability. Additionally, the results of the ablation experiments show that the frequency domain module has strong feature extraction capabilities for low-contrast targets, and that Gaussian filtering, when used to eliminate background noise, enables the detector to better detect small defects and blurred boundary targets. Furthermore, the ADConv and MFDW modules complement and enhance information from different dimensions, exhibiting strong feature representation capabilities.
5. Conclusions
Addressing the ambiguity between defects and backgrounds, as well as the variability in defect sizes, is crucial for detecting surface defects in wood-based panels. This paper proposes a defect detection method, FDADNet, based on frequency domain transformation and adaptive dynamic downsampling. Specifically, we designed the MFDW module to tackle noise accumulation and boundary blurring issues in feature extraction. MFDW enhances the separability of signals between targets and backgrounds in the frequency domain. Additionally, Gaussian filtering is used to suppress noise, reducing its impact on feature representation and further enhancing the expression of target features. Furthermore, we introduced the ADConv module for image downsampling to address the variability in defect sizes. ADConv adaptively compresses and reinforces feature maps, enabling the flexible enhancement of targets of different scales. This mechanism allows both large and small defect features to be adaptively enhanced in the transformed space, reducing feature confusion between multi-scale objects and improving the discriminability of small defects, thus achieving a more balanced detection performance. Moreover, we established a new dataset for defect detection, WBP-DET. Compared to the current mainstream object detection methods, our model achieves the highest detection accuracy on the WBP-DET dataset, particularly excelling in detecting small and low-contrast defects. Additionally, our method demonstrates significant advantages in terms of parameter count and computational complexity. Our approach also performs excellently across three other mainstream industrial material defect detection datasets. The outstanding performance of FDADNet makes it highly suitable for practical applications in complex industrial scenarios. In future work, we will further explore the potential of frequency domain techniques and expand their application to other industrial inspection tasks.