3.1. Overview
In the setting of ADA for semantic segmentation, we have a set of labeled source domain data, , where is corresponding pixel-wise annotation for image and means the number of images in the source domain of , and the target-domain dataset , where is the target active label that is initialized as and means the number of images in the target domain of . We aim to learn a segmentation network parameterized by that performs well on the target domain using minimal annotations. Typically, the segmentation model, G, consists of a feature extractor, E, and a classifier, F, with the relationship . Models trained exclusively with labeled data from the source domain often produce suboptimal results when applied to the target domain. For effective knowledge transfer, the latest self-training paradigm creates pseudo-labels, , for target input, , subsequently optimizing the cross-entropy loss. However, the performance is still limited to be competitive with the model under supervised learning. This arises because pseudo-labels tend to be imprecise, and only those pixels exceeding a specific confidence level are selected for further training. Moreover, the existence of stubborn inter-domain differences causes the reliability of the pixels, convinced by the model itself, to remain doubtful. To tackle this issue, we introduce a straightforward but effective AL method to assist domain adaptation via identifying image regions with sharp inter-domain differences. Furthermore, an enhanced self-training strategy is proposed for filtering unfavorable samples from the source domain.
The comprehensive structure of ABSNet comprises three stages: (1) Initially, train the model using the UDA technique as a preliminary step. (2) Execute multi-prototype active region selection (MARS) to determine the representative region of each image based on the inter-domain similarity. (3) Re-train the ADA segmentation model through source-weighted class-balanced self-training (SCBS) with the target-domain specific knowledge. The overview is shown in
Figure 3, in which a ResNet50 is used as the feature extractor of the segmentation model. The MARS module is a forward inference procedure where the weight parameters of the segmentation network are frozen. Each image in the source-domain data, Ds, is fed into the model to obtain the source features, and those pixel-level feature vectors are clustered to obtain the source prototype. The MARS module then involves estimating the source-domain data distribution and picking a few active samples with the labeling budge, B, for the target domain guided by prototypes from the source domain. Conversely, the SCBS module is a backward training procedure with the back propagation algorithm. The source and target-domain images are simultaneously fed into the segmentation network, where the source and target-domain features are derived from the output of the feature extractor, and the source and target-domain predictions are derived from the decoder of the segmentation network (e.g., the Atrous Spatial Pyramid Pooling decoder [
39]). We retrain the segmentation model by optimizing two segmentation losses (e.g., cross-entropy loss). The SCBS module screens out hard instances or sensitive samples in the source domain by considering the relationship between the source domain and target-domain centroids.
3.2. Multi-Prototype Active Region Selection
Region Generation. To optimize labeling efforts for unlabeled target images, our initial step involves decomposing each image into superpixels. Superpixels serve as image primitives, automatically extracting object boundaries to some extent by grouping similar pixels in the image. Subsequently, we employ an AL strategy to select informative superpixel regions. Previous research [
40] has shown that, unlike traditional polygon-based labeling, superpixel annotation offers a more advantageous labeling scheme. In line with this, each superpixel region in our experiments is assigned only one class label. To simulate annotation by an oracle, our work utilizes the ground truth of target-domain dataset.
We adopt the off-the-shelf SEEDS algorithm [
41], which is a clustering-based algorithm used for superpixel generation. For the training image,
, the set of obtained superpixels is denoted as:
where an image is divided into k superpixels. For
images in the target domain, the set of obtained superpixels is denoted as:
Therefore, the total achievable superpixels amount to
. A labeling budget,
B, is established, which is much less than the potential labeling volume within the target-domain data, which is quantifiable as follows:
where
denotes the proportion of active labels.
Multi-prototype Domain Density Estimation. Previous ADA techniques typically rely on labeling with predictive uncertainty or measuring the similarity between their data pairs based on Euclidean distances. However, they may be sub-optimal for RSIs that have strong intra-class variation. For effective labeling of the most informative target-domain data, we employ a soft clustering approach to estimate the density of the source-domain features, i.e., a Gaussian mixture model (GMM) [
30], and evaluate the likelihood of target-domain samples being associated with a specific class. Leveraging the GMMs, we enhance the measurement of the domain gap and augment specific knowledge of the target domain to tailor the model more effectively to target-domain data, focusing on selecting samples that exhibit the largest domain gap.
We utilize a GMM to match source feature distributions due to its capability to predict complex distributions and estimate probability densities. GMMs measure the probability of a sample belonging to a cluster using probability densities. The use of a multiple-weighted Gaussian distribution enables generalization to non-Gaussian cases of training data. To acquire multiple prototypes of each class and the density distribution of the source domain, we utilize the feature extractor,
E, to extract feature map
from each source image,
, where
N is the channel number of the feature map and there are
pixel-level feature vectors in each feature map. We consider the pixel features that are correctly classified:
where
represents the prediction made by the initial warm-up model and
is the corresponding ground truth of
. To align with the feature map’s dimensions,
is the intermediate output with Softmax function before the decoder of the model performing up-sampling, sized
, whereas
is derived by down-sampling the ground truth to
.) Thus,
denotes the pixel feature with index
j, and
denotes the feature set of class
c in the source domain.
We employ GMMs to characterize class-specific data distributions for each feature set within the
. Formally, the GMMs for class
c, represented as the weighted sum of
K Gaussian distributions, can be expressed as follows:
where
is the feature vector of class
c in the source domain,
and
represent the mean vector and covariance matrix of the Gaussian distribution, respectively, and
denotes the mixture weight.
d is the dimension of feature vector
. In the GMM expression,
K prototypes are obtained, where each prototype is characterized by the mean vector that describes the center of its distribution in the feature space. In addition, the covariance matrix provides information about whether the distribution expands or compresses in different directions, thus revealing the direction and strength of the shape of the distribution. We employ the Expectation–Maximization algorithm [
30] to solve the GMMs equations iteratively, and random sampling is applied to restrict the number of features to no more than 300,000 per class to avoid excessive memory usage.
Active Target Region Selection Against Source Density. In the context of cross-domain active learning, we suggest that the segmentation network benefits more from target samples that exhibit greater dissimilarity to the source domain. To quantify dissimilarity, we calculate the probability value of target-domain samples under the source-domain density distribution. This computation serves as a measure of the significance of unlabeled target-domain samples for the cross-domain adaptive model. The complete process of our MARS is depicted in
Figure 4.
Given a target image,
, we extract the feature map
using the model’s encoder. Then, for a target pixel feature with index
j, the maximum probability value under the GMMs of the source domain with
C categories is defined as the inter-domain similarity from the target pixel feature to the source domain:
Next, since we consider a superpixel as the smallest labeling unit of the active sample, we obtain the inter-domain similarity of a superpixel region by computing the mean of
within that superpixel region. For the target image,
, its superpixel region is represented as in Equation (
1), and the inter-domain similarity of the
k superpixels of this image is expressed as:
Eventually, for all
target-domain images, there are a total of
superpixels undergoing active region selection based on their inter-domain similarities, and the formula is as follows:
With the labeling budget, B, we rank the similarity of these regions by , and we select the top regions with the lowest similarity as active samples.
Intuitively, the similarity definition in Equation (
7) assigns target-domain samples (i.e., superpixel regions) to the closest category in the source domain. Utilizing inter-domain similarity enables us to identify target-domain samples that significantly differ from the entire source domain. By labeling these samples with their true categories, we acquire target-domain-specific information that is challenging to learn solely from pseudo-labels. Note that, for these active samples, we uniformly assign a class label to image pixels within the same superpixel region, disregarding the outcome that superpixels may not be strict in segmenting edges. This approach significantly reduces the labeling cost of active samples.
3.3. Source-Weighted Class-Balanced Self-Training
In our study, the segmentation model, guided by the proposed AL strategy, is equipped to grasp specific knowledge from the target domain. However, unlike the conventional self-training learning in UDA, which learns consistent category features across both domains, our model may encounter disruption if the source data is trained without any filtering. This interference may arise from source-domain samples that significantly deviate from the target-domain characteristics. Hence, we present a method to degrade source-domain samples by utilizing the clustering centers of the target domain. This approach mitigates the impact on domain adaptation by adjusting the adaptive weighting of the source-domain training loss. The detailed procedure is illustrated in
Figure 5, where source- and target-domain images are trained simultaneously by optimizing two cross-entropy (CE) losses. We incorporate the distance measurement from the source-domain pixels to the target domain and the class-wise average entropy to obtain the weighting map, which is used in the optimization process of the source-domain CE loss.
Target Clustering Center Generation. To maximize the utilization of a priori knowledge in the target-domain data, we employ the fine-tuned model guided by active labels to compute the pixel features of unlabeled target-domain samples by
, and we obtain the set of target-domain features
. As the target-domain data is unlabeled, we perform unsupervised clustering algorithm K-means on this feature set to derive center anchor features that effectively represent the distribution of the target data. Specifically, we organize them into
V clusters, aiming to minimize the error:
where
represents the squared Euclidean distance and
is the centroid of cluster
:
where
indicates the quantity of features assigned to cluster
. The centroids
are then employed to assess the domain adaptability of the source samples during the self-training process. Notably, due to the potential bias of the initial fine-tuned model’s clustering centers towards the actual target distribution, we dynamically update the centroids during self-training. After a specific number of iterations, we conduct inference on the target-domain training images, and new pixel features are used to update the centers
.
Source Weighting Class-Balanced Factor. To filter out challenging source samples or those with insufficient contribution to domain adaptation, we assess each sample’s level of contribution by measuring the feature distance to the target centroids. The initial step involves computing the distance of the source feature to the target domain as follows:
where
is the feature map of source image,
, extracted by the encoder,
E,
j is the index on feature map,
, and
represents the nearest distance of the source feature from the target centers.
Based on the calculation of this distance, we define the source weighting factor of the source image,
, as:
where
denotes the average distance of
over all source images and is used for normalization in the equation. During training,
is computed via the exponential moving average as follows:
where
denotes the smoothing parameter to avoid large fluctuations.
Additionally, as illustrated in
Figure 1, there is a noticeable difference in category distribution, impacting the model’s training results on the target domain, especially on categories that are dominated by the source domain and have fewer samples in the target domain. To address this issue, we utilize the prediction entropy of the segmentation model to measure the training difficulty of the target-domain data on each category, which is used to avoid repeated training on samples with a larger proportion of the source domain, while paying more attention to hard-to-learn classes in the target domain.
Hence, we employ class-wise average entropy,
, to achieve class-balanced self-training. As depicted in
Figure 5, we acquire the Softmax output of the fine-tuned model on the target image,
, to compute the entropy for each image. This entropy is then combined with the pseudo-label to calculate the average entropy under each class. Specifically, We use the pretrained segmentation model to output per-pixel classification predictions for the target-domain training images. Let the classification probability prediction result for a certain pixel be
, where
C is the number of categories and
represents the probability that the pixel belongs to category
i, and then use the entropy calculation formula
to obtain the prediction entropy for each pixel. Finally, we calculate the average entropy,
, for each category based on the classification categories of the pseudo-labels using the pixel-level prediction entropy.
Finally, we adjust the source weighting factor by multiplying the inter-domain nearest distance and the class-wise average entropy,
, as follows:
where
is the normalized value of class-wise average entropy corresponding to the category of
. In this way, the class-wise average entropy limits the degree of degradation of the source-domain samples on that category, thus mitigating the effects of inter-domain category imbalance. Then, we adjust the loss function for source data by applying a class-balanced weighting factor, and the cross-entropy loss for the source image
is expressed as:
where
,
, and
denote the source weighting factor, the ground truth, and the probability predicted on pixel
ith, respectively.