0% found this document useful (0 votes)
9 views

ADeepReinforcement Active Learning Method for Multi-Label Image Classification

This document presents a novel deep reinforcement active learning method for multi-label image classification, which integrates reinforcement learning with active learning to enhance model performance. The proposed framework dynamically selects informative samples based on label correlations, outperforming existing methods in various classification tasks. The study includes experimental validation on multiple datasets, demonstrating the effectiveness of the approach in reducing labeling costs and improving classification accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

ADeepReinforcement Active Learning Method for Multi-Label Image Classification

This document presents a novel deep reinforcement active learning method for multi-label image classification, which integrates reinforcement learning with active learning to enhance model performance. The proposed framework dynamically selects informative samples based on label correlations, outperforming existing methods in various classification tasks. The study includes experimental validation on multiple datasets, demonstrating the effectiveness of the approach in reducing labeling costs and improving classification accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Journal Pre-proof

A deep reinforcement active learning method for multi-label image


classification

Qing Cai, Ran Tao, Xiufen Fang, Xiurui Xie, Guisong Liu

PII: S1077-3142(25)00074-8
DOI: https://doi.org/10.1016/j.cviu.2025.104351
Reference: YCVIU 104351
To appear in: Computer Vision and Image Understanding
Received date : 2 August 2024
Revised date : 27 March 2025
Accepted date : 27 March 2025

Please cite this article as: Q. Cai, R. Tao, X. Fang et al., A deep reinforcement active learning
method for multi-label image classification. Computer Vision and Image Understanding (2025),
doi: https://doi.org/10.1016/j.cviu.2025.104351.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the
addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive
version of record. This version will undergo additional copyediting, typesetting and review before it
is published in its final form, but we are providing this version to give early visibility of the article.
Please note that, during the production process, errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.

© 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training,
and similar technologies.
Revised manuscript (clean version) Journal Pre-proof

Highlights
A Deep Reinforcement Active Learning Method for Multi-Label Image Classification
Qing Cai, Ran Tao, Xiufen Fang, Xiurui Xie, Guisong Liu

• Propose a multi-label reinforcement active learning module, which combines reinforcement learning with active

of
learning to reduce the expected error in multi-label image classification.
• Consider the correlation between labels, and dynamically select samples that are most informative to the current model
during the training process.
• Comparing the performance of the state-of-the-art models with the proposed, our model outperforms existing

pro
approaches on various tasks.

re-
lP
rna
Jou
Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image


Classification⋆
Qing Caia,b , Ran Taoa,b , Xiufen Fangc,∗ , Xiurui Xiec and Guisong Liua,b
a School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, 611130, Sichuan, China

of
b Complex Laboratory of New Finance and Economics, Southwestern University of Finance and Economics, Chengdu, 611130, Sichuan, China
c University of Electronic Science and Technology of China, Chengdu, 611731, Sichuan, China

ARTICLE INFO ABSTRACT


Active learning is a widely used method for addressing the high cost of sample labeling in deep

pro
Keywords:
Active Learning learning models and has achieved significant success in recent years. However, most existing active
Multi-label Image Classification learning methods only focus on single-label image classification and have limited application in
Deep Learning the context of multi-label images. To address this issue, we propose a novel, multi-label active
Reinforcement Learning learning approach based on a reinforcement learning strategy. The proposed approach introduces a
reinforcement active learning framework that accounts for the expected error reduction in multi-label
images, making it adaptable to multi-label classification models. Additionally, we develop a multi-
label reinforcement active learning module (MLRAL), which employs an actor-critic strategy and
proximal policy optimization algorithm (PPO). Our state and reward functions consider multi-label
correlations to accurately evaluate the potential impact of unlabeled samples on the current model
state. We conduct experiments on various multi-label image classification tasks, including the VOC
2007, MS-COCO, NUS-WIDE and ODIR. We also compare our method with multiple classification

1. Introduction
re-
models, and experimental results show that our method outperforms existing approaches on various
tasks, demonstrating the superiority and effectiveness of the proposed method.

Obtaining large amounts of unlabeled data is relatively


easy Wu, Sheng, Zhang, Li, Dadakova, Swisher, Cui and
lP
Zhao (2020). However, training a deep learning model re-
quires a significant number of labeled samples, which is
Train the modle
Active Strategy
expensive due to the need for human experts to perform
manual labeling. The annotation of multi-label data is much
more laborious than that of single-label samples, since it
needs to consider both presence and absence of each possible Labeled Set Unlabeled set
label. Different from single-label classification, multi-label
rna

classification requires specialized metrics like F1 score, Jac-


card index, or Hamming loss to measure performance Zhang Annotate Select samples
and Zhou (2013). Besides, the single-label models unable
to capture label dependencies and handle multi-output pre- Experts

dictions, which makes them fail in multi-label contexts Liu, Figure 1: Through an active learning strategy, samples are
Wang, Shen and Tsang (2021b). Active learning methods is selected from the unlabeled set and added to the labeled set
proposed to address this issue. It trains a high-performing after being labeled by experts. Then, the model is trained using
model by only selecting a small number of useful samples the labeled sample set
for labeling Wang, Yan, Zhang, Cao, Yang and Ng (2020b).
Jou

Active learning has been proposed as a solution to address


the issue of high sample labeling costs Ren, Xiao, Chang, classification model. In recent years, active learning has
Huang, Li, Gupta, Chen and Wang (2022). The process demonstrated considerable success in various applications
of active learning is illustrated in Fig.1, wherein unlabeled Yuan, Chang, Liu, Yang, Wang, Shu, He and Shi (2023);
samples are selected from a sample pool and added to the Yoo and Kweon (2019); Zhao, Qin, Feng, Zhu, Sun, Li
labeled data set after annotation to train a more effective and Jia (2023); Sener and Savarese (2017); Gal, Islam and
⋆ Ghahramani (2017). However, most of these works focus on
∗ Corresponding author single-label image classification tasks, and only a limited
[email protected] (Q. Cai); [email protected] (R. Tao); number of studies have applied active learning to more
[email protected] (X. Fang); [email protected] (X. Xie); complex and expensive multi-label image tasks.
[email protected] (G. Liu)
Multiple labels are commonly present in real-life images
(Q. Cai); (R. Tao); (X. Fang); (X. Xie); (G. Liu)
ORCID (s): 0009-0008-4495-2958(Q. Cai)
across various applications, such as medical image recogni-
1 tion Chen, Li, Lu and Zhang (2020), scene understanding

Qing Cai: Preprint submitted to Elsevier Page 1 of 12


Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image Classification

Zhu, Liu, Liu, Ge, Liu and Cao (2023), and human at- based on DQN Volodymyr, Mnih, Koray, Kavukcuoglu,
tribute recognition Li, Chen, Chen and Tang (2016), among David, Silver, Andrei, A, Rusu and Joel (2015) and actor-
other realistic scenarios. With the rapid development of critic have been derived, leading to some notable achieve-
deep learning, recent studies Liu, Zhang, Yang, Su and Zhu ments. Reinforcement learning has also been explored in
(2021a); Chen, Wei, Wang and Guo (2019) have introduced multi-label classification tasks, where it is employed to
methods such as Graph Convolutional Network (GCN) and predict the labels of multi-label images Chen, Wang, Li
Transformer Zhou, Dou, Su, Hu and Zheng (2023); Carion, and Lin (2018). However, there is no research currently in
Massa, Synnaeve, Usunier, Kirillov and Zagoruyko (2020), reinforcement learning to choose valuable samples in multi-

of
which have demonstrated superior performance on large- label classification tasks. Furthermore, recent studies have
scale multi-label image datasets Lin, Maire, Belongie, Hays, utilized deep reinforcement learning in the development of
Perona, Ramanan, Dollár and Zitnick (2014); Chua, Tang, active learning strategies for single-label image classifica-
Hong, Li, Luo and Zheng (2009). However, compared to tion tasks Wang et al. (2020a).
traditional machine learning methods, deep multi-label clas- In this study, we propose a novel reinforcement active

pro
sification models require a larger number of labeled data learning framework and algorithm for multi-label image
samples for training. Currently, the labeling of multi-label classification tasks. In contrast to traditional multi-label ac-
data must be performed by human experts, which is an tive learning strategies that rely on hand-designed heuristics
expensive process. Moreover, the annotation of complex Wang et al. (2020a), we model the active learning process as
images, such as medical images, can be extremely time- a Markov decision process and leverage deep reinforcement
consuming Wang, Yan, Zhang, Cao, Yang and Ng (2020a). learning to dynamically evaluate the effectiveness of unla-
Existing multi-label active learning research can be clas- beled samples for the current model, based on the classifier’s
sified based on the method used for querying unlabeled state. Moreover, we incorporate the correlation between
sample labels. The majority of multi-label active learn-
ing strategies are designed to query all labels of selected
unclassified samples Wang, Zhang and Guo (2012); Gui,
Lu and Yu (2021); Li and Guo (2013). Some multi-label
re- labels into our strategy to learn the latest sampling policy
and select unlabeled samples that are most informative for
the current classification model. Specifically, we adopt the
actor-critic method to generate and evaluate data selec-
active learning strategies have been developed to query the tion decisions and employ the proximal policy optimization
correlation between sample-label pairs Vasisht, Damianou, (PPO) algorithm Schulman, Wolski, Dhariwal, Radford and
Varma and Kapoor (2014); Huang, Chen and Zhou (2015); Klimov (2017) to train our model.
Chen, Wang, Lu and Wang (2022), i.e., to analyze whether a
lP
The main contributions of this work are summarized as
particular label is related to the selected example. However, follows:
the strategy of querying all labels of an instance may result 1. A reinforcement active learning framework is pro-
in redundant information and require more annotations in posed, which combines reinforcement learning with active
real-world problems with a large number of labels Reyes, learning to reduce the expected error in multi-label image
Morell and Ventura (2018). Furthermore, these active learn- classification. This framework can be used to dynamically
ing strategies are often based on specific hand-designed evaluate the usefulness of samples for improving model
rna

strategies and it is a limitation, because it relies on manual performance and address the challenge of training classifiers
methods to design sample selection strategies, which lacks in a changing environment.
generalization ability and the capability Hu, Guo, Cordy, 2. A reinforcement active learning module is proposed
Xie, Ma, Papadakis and Le Traon (2021). which combines the use of an actor-critic strategy with
To address the issue, researchers have introduced the proximal policy optimization (PPO). This algorithm not
concept of reinforcement learning into active learning Epshteyn, only takes into account the correlation between labels, but
Vogel and DeJong (2008); Wang et al. (2020a); Ménard, also dynamically selects samples that are most useful and
Domingues, Jonsson, Kaufmann, Leurent and Valko (2021). informative to the current model during the training process.
Reinforcement learning is a classic artificial intelligence The remainder of this paper is organized as follows.
Jou

method that has been extensively used in game AI and Section II provides an overview of related works. Section
robotics. It constructs intelligent agent to learn the optimal III presents our active learning method, multi-label rein-
strategy by interacting with the environment and receiving forcement active learning module (MLRAL). Section IV
reward signals. The agent adjusts its behavior through trial introduces the datasets, relevant settings, and presents the
and error to maximize the cumulative long-term reward analysis of experimental results. Finally, we summarize the
Milani, Topin, Veloso and Fang (2024). During active learn- main research contents and experimental findings of this
ing, the external environment of the entire system changes work.
each time the labeled sample is updated. The reinforcement
learning algorithm can dynamically perceive changes in the 2. Related Works
external environment and adjust the action, so reinforcement
learning (RL) is needed in active learning for multi-label 2.1. Multi-label active learning
classification. With the development of deep learning in Active learning employs a sampling strategy to select
recent years, several deep reinforcement learning algorithms informative samples from unlabeled data for labeling, result-
ing in improved model performance under the same labeling

Qing Cai: Preprint submitted to Elsevier Page 2 of 12


Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image Classification

Actor 2.2. Reinforcement active learning


In recent years, reinforcement learning has achieved sig-
nificant success Wang, Wang, Liang, Zhao, Huang, Xu, Dai
and Miao (2022). With the advancement of deep learning,
deep reinforcement learning has also shown strong capa-
Policy bilities in decision-making problems. Deep Q-learning em-
ploys deep neural networks to extend traditional Q-learning
to learn Q-value functions Volodymyr et al. (2015). The

of
Deep Deterministic Policy Gradient (DDPG) algorithm is
proposed to adapt deep Q-learning to process continuous
action spaces Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa,
TD error Silver and Wierstra (2015). The DDPG algorithm is based
on the actor-critic framework, as illustrated in Fig. 2.

pro
The Twin Delayed Deep Deterministic Strategy Gradient
(TD3) algorithm expands upon the DDPG approach by
introducing a pair of critics for a single actor, resulting in
State Action
enhanced performance and efficiency Fujimoto, Hoof and
Meger (2018). The Proximal Policy Optimization (PPO)
algorithm diverges from the conventional policy gradient
technique, as it proposes a novel objective function that
Value Function
facilitates small batch updates over multiple iterations. This
Critic

Reward
re-
not only simplifies algorithm training, but also improves
sample complexity Schulman et al. (2017). Active learning
researchers have also explored the use of actor-critic meth-
ods for dynamic sampling strategies in single-label medical
image classification. J. Wang et al. Wang et al. (2020a)
applied the DDPG algorithm to train a model that achieved
a superior sampling strategy and improved the effectiveness
lP
of model training.
Environment
This paper introduces a novel reinforcement active learn-
ing framework and algorithm for multi-label image classi-
fication tasks. Our proposed approach differs from existing
Figure 2: Process of the actor-critic algorithm.
multi-label active learning strategies and conventional re-
inforcement active learning methods. The main differences
budget. In the field of multi-label classification, several and advantages of our work are reflected in the following
studies have been conducted on active learning. For instance,
rna

two aspects: Firstly, most of the multi-label active learning


Reyes et al. Reyes et al. (2018) propose CVIRS, which methods are based on deep Bayesian networks and graph
is used to solve rank aggregation problems and category neural networks, which is difficult to interact and evolve
vector inconsistency. In CVIRS, it selects one sample for with the environment automatically. Our work proposed a
labeling at a time. After multiple iterations, a batch of un- reinforcement active learning framework that can update the
labeled samples can be chosen for labeling. However, when selection strategy dynamically based on the environment.
applied to datasets like MS-COCO Lin et al. (2014), the Secondly, different from conventional reinforcement active
time complexity of CVIRS can be extremely high, and the learning methods that are only applied to single-label classi-
selected samples may be redundant, leading to information fication, our method proposes the first reinforcement active
Jou

redundancy. learning method for multi-label classification. It takes into


Another active learning strategy is the batch mode ap- account the correlation between labels to dynamically se-
proach, in which a batch of unlabeled examples is selected lects useful samples, and clips the gradient to prevent invalid
by considering the diversity of selected examples in each policy update deviation. Specifically, we incorporate deep
iteration. To date, few works have focused on batch mode reinforcement learning to dynamically assess the usefulness
multi-label active learning Wu et al. (2020). The existing of a sample for the current model based on the classifier’s
batch mode methods have high computational complexity state. Furthermore, we consider the inter-label relevance to
and are challenging to apply to large-scale datasets. Con- develop a sampling strategy, enabling us to select unlabeled
sequently, existing multi-label active learning strategies are samples that are most effective for the current classification
designed based on traditional classification models such model.
as BR-SVM Li, Wang and Sung (2004), rendering them
unsuitable for updated deep multi-label classification models
like Query2Label Liu et al. (2021a).

Qing Cai: Preprint submitted to Elsevier Page 3 of 12


Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image Classification

of
pro
Figure 3: Illustration of our proposed MLRAL method. The left part of this figure represents the use of the PPO algorithm to

process of the MLRAL. re-


update the actor network and the critic network through the replay buffer, and the right part of this figure reflects the operation

et al. (2021a), and other models that have demonstrated


3. Our Method
strong performance in multi-label image classification tasks.
In Section 3.1, we provide an overview of the proposed
As with these conventional approaches, we define the clas-
reinforcement active learning framework. Then we introduce
sification loss as follows:
lP
the multi-label reinforcement active learning (MLRAL) al-
gorithm and how it is trained in Section 3.3. 𝐶
1 ∑ 𝑐
𝐿= 𝑦 𝑙𝑜𝑔(𝜎(𝑦̂𝑐 )) + (1 − 𝑦𝑐 )𝑙𝑜𝑔(1 − 𝜎(𝑦̂𝑐 )), (1)
𝐶 𝑐=1
3.1. Framework
In this section, we introduce the proposed reinforce-
where 𝑐 is the class index, 𝐶 is the number of classes, 𝑦𝑐 and
ment active learning framework. As shown in Fig. 3, this
𝑦̂𝑐 are the actual and predicted labels of class 𝑐 respectively,
framework includes two parts: the basic classifier and the
and 𝜎(⋅) is the 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 function.
rna

reinforcement learning active learning model.


To train a multi-label image classifier, we utilize both 3.3. Reinforcement active learning module
labeled samples (i.e., (𝑋𝑙 , 𝑌𝑙 )) and an unlabeled sample pool To guide the actor model in selecting samples, we pro-
(i.e., 𝑋𝑢 ) that will be selected and annotated during the active pose the multi-label reinforcement active learning algorithm
learning procedure. During the learning process, an actor (MLRAL). Specifically, we define key reinforcement learn-
network is trained to select the most informative samples ing terms, such as action, state, and reward, and demonstrate
from the unlabeled pool based on the current state and a how to train the reinforcement learning network model. To
learned policy. Next, an annotator is trained to label the address the issue that the optimization of the policy gradient
selected samples, thereby increasing the amount of labeled
Jou

update is prone to errors, our work adopts a policy pruning


training data for gradually updating the classifier. Finally, a strategy to prevent policy update errors. In summary, the
critic network is trained to evaluate the effectiveness of the main novelties of our work are:
actor network’s sample selection in improving the classifier’s (1) Different from conventional reinforcement active
performance. By utilizing a deep reinforcement learning learning methods that are only applied to single-label classi-
approach to train both the actor and critic networks, our fication, our method proposes the first reinforcement active
proposed method selects and annotates the most informative learning framework for multi-label classification, addressing
samples, leading to the training of a more effective classifier the issue of aligning the multi-label spaces with the action
and improved classification performance. state spaces in reinforcement learning.
(2) A reinforcement active learning module is proposed
3.2. Classification module
combining actor-critic strategies with multi-label classifica-
Our reinforcement active learning framework is com-
tion. It takes into account the correlation between labels to
patible with various conventional classification networks,
dynamically selects useful samples, and clips the gradient to
including ML-GCN Chen et al. (2019), Query2Label Liu
prevent invalid policy update deviation.

Qing Cai: Preprint submitted to Elsevier Page 4 of 12


Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image Classification

3.3.1. State reward function for sample 𝑥𝑠𝑖 is:


To select more valuable samples for improving the per-
formance of the classification model, we use the prediction 𝐾 𝐶
1 ∑∑
of the current classifier to design the state matrix 𝑆 ∈ 𝑟(𝑆, 𝑎) = (𝜎(𝑦𝑠𝑖𝑗 = 𝑘̂ 𝑖𝑗 |𝑥𝑠𝑖 ; 𝜃𝑐 )
(0, 1] × , where  is the number of all unlabeled train-
𝐾 𝑖=1 𝑗=1 (3)
ing samples, and  is the total number of categories. The −𝜎(𝑦𝑠𝑖𝑗 = 𝑘𝑖𝑗 |𝑥𝑠𝑖 ; 𝜃𝑐 )).
mathematical definition of the (𝑖, 𝑗)-th element of 𝑆 is de-
fined as: The actor network in PPO achieves sample selection

of
by prioritizing samples that are most likely to improve
𝑆𝑖,𝑗 = 𝜎(𝑦𝑢𝑖 = 𝑗|𝑥𝑢𝑖 ; 𝜃𝑐 ), (2) model performance. It calculates the difference between the
ground-truth label and the predicted probability value for
where 𝑥𝑢𝑖 is an unlabeled training sample, 𝑦𝑢𝑖 is the cor- each category, and then sum these differences across all
responding unknown label for sample 𝑖, and 𝜎(⋅) is the categories. If the difference is small, it indicates that the

pro
sigmoid function, and 𝜃𝑐 represents the classifier used in the predicted probability values are close to the true labels. As
framework. in Eq. (3), a higher reward 𝑟 corresponds to a larger gap
between the predicted values and the ground-truth labels,
3.3.2. Action which suggests that these samples are more likely to improve
The parameters of the actor network are defined as 𝜃𝑎 . the model performance. Consequently, the actor network is
The actor network takes the state 𝑆 as input, and the output incentivized to prioritize the selection of these misclassified
action selects samples from the unlabeled training dataset samples during the learning process.
for labeling. We define action as a vector 𝑎 ∈ (0, 1) , in
which each element corresponds to an unlabeled sample. We
adopt the sigmoid function as the activation function for each
element, and the learning policy 𝜋(𝑆; 𝜃𝑎 ) generates an action
𝑎 from state 𝑆. After computing the action vector, we sort
all candidate samples except the samples already selected
re-
in descending order, and select the top 𝐾 samples with
the highest values for annotation. The unlabeled samples
chosen by the current action and the labels provided by the
lP
annotators are denoted as (𝑋𝑈𝐾 , 𝑌𝑈𝐾 ). The updated training
dataset is denoted as 𝑖 = (𝑋𝐿 , 𝑌𝐿 ) {(𝑋𝑈𝐾 , 𝑌𝑈𝐾 )}𝑖 , where

𝑖 represents the current number of training rounds.

3.3.3. State Transition


The samples selected by the action will be annotated and
merged into the labeled data. Then we minimize the loss
rna

function Eq. (1) by the latest dataset (𝑋𝐿 , 𝑌𝐿 ) to update the


parameters of the classifier 𝜃𝑐 . Finally, the updated classifier
is used to obtain a new state matrix 𝑆 ′ according to the
equation (2).

3.3.4. Reward Figure 4: The network structure of the PPO, which consists of
To enhance the classifier’s performance, MLRAL in- an actor network and a critic network. The input of the actor
centivizes the actor network to prioritize samples that are network is the state 𝑠, and the output is the probability 𝑃𝑖 of
most likely to be misclassified. To achieve this, we propose action. The input of the critic network is the state, and the
Jou

output is the value 𝑉 (𝑠) of the state.


a novel reward function that takes into account the predicted
values obtained from the classifier and the ground-truth
labels provided by the annotators. Our algorithm performs
Gaussian normalization on the reward results to make the 3.3.5. Reinforcement training
features of different dimensions numerically comparable. In the reinforcement training process, we take the Prox-
The normalized rewards can reflect the contribution of dif- imal Policy Optimization (PPO) Schulman et al. (2017)
ferent features to model performance in the same scale. algorithm to optimize the reward function which is shown
Specifically, for the selected sample 𝑥𝑠𝑖 , we define 𝑘𝑖𝑗 as the in Fig. 4. Since the action space of our task is discrete,
true label for the category 𝑗 given by the labeling expert, the input of the actor network is state 𝑆, and the output
and 𝑘̂ 𝑖𝑗 as the predicted label of the sample for the category is the probability 𝑃𝑖 of the action. The critic network also
𝑗 obtained by the classifier. Therefore, the definition of the takes the state 𝑆 as input, and the output is the value of
the state 𝑉 (𝑠). PPO is improved on the basis of the policy
gradient algorithm whose core idea is to optimize the policy

Qing Cai: Preprint submitted to Elsevier Page 5 of 12


Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image Classification

𝜋(𝑆; 𝜃𝑎 ) to improve the reward. The loss function of the Algorithm 1 Multi-label reinforcement active learning
policy gradient is: Input: unlabeled training data  .
Initialize: Randomly select samples from  and con-
struct the initial labeled training data 0
𝑇

𝐿(𝜃𝑎 ) = 𝑙𝑜𝑔(𝜋(𝑆𝑡 ; 𝜃𝑎 ))𝑓 (𝑆𝑡 , 𝑎𝑡 ), (4)
𝑡=1 1: for epoch i do
2: Compute the state S according to Eq. (2).
where 𝑓 (𝑆𝑡 , 𝑎𝑡 ) is the evaluation of the action 𝑎𝑡 in the 3: Select 𝐾 unlabeled training sample based on the actor
state 𝑆𝑡 . There are many ways to calculate 𝑓 (𝑆𝑡 , 𝑎𝑡 ), such as

of
a = 𝜋(S; 𝜃𝑎 ).
the advantage function, which is defined as: 4: Annotate these 𝐾 samples 𝑋𝑈𝐾 to get{(𝑋𝑈𝐾 , 𝑌𝑈𝐾 )}𝑖 .
5: Update the labeled
⋃ training data
(5) 𝑖 = (𝑋𝐿 , 𝑌𝐿 ) {(𝑋𝑈𝐾 , 𝑌𝑈𝐾 )}𝑖 .
Train the classifier Θ𝑖𝑐 based on 𝑖 .
𝐴𝜋 (𝑆𝑡 , 𝑎𝑡 ) = 𝑄𝜋 (𝑆𝑡 , 𝑎𝑡 ) − 𝑉 𝜋 (𝑆𝑡 ),
6:

pro
where 𝜋 is the policy, 𝑄 is the Q-value function, and 7: Calculate the state S’ based on Eq. (2), and the reward
the 𝑉 is the value function. The advantage function is an r based on Eq. (3).
estimate of the dominance of action 𝑎 in state 𝑆 compared 8: Save the sample (S, a, S’, r) into the replay buffer.
to the average value. 9: for each training sample for actor and critic do
The estimation of the advantage function may not be 10: Update the critic by Eq. (7).
completely accurate, and the data can be biased in the policy 11: Update the actor by Eq. (6).
gradient method, which may result in policy update devia- 12: end for
tion. To address this issue, the PPO algorithm incorporates 13: end for
the concept from Schulman, Levine, Abbeel, Jordan and
Moritz (2015) and simplifies it. The objective function of
the proxy objective method is defined as:
[ ( ( ) )]
𝐿𝐶𝐿𝐼𝑃 (𝜃) = 𝔼̂ 𝑡 𝑚𝑖𝑛 𝑟𝑡 (𝜃)𝐴̂ 𝑡 , 𝑐𝑙𝑖𝑝 𝑟𝑡 (𝜃), 1 − 𝜖, 1 + 𝜖 , 𝐴̂ 𝑡 , (6)
re-
4. Experiments
In this section, we evaluate the effectiveness of the
where 𝐴̂ 𝑡 is an estimator of the advantage function at proposed MLRAL method on various multi-label datasets.
timestep 𝑡, and 𝜖 is a hyperparameter, clip represents clipping We first introduce the datasets and implementation details of
the update range of the current policy 𝜋(𝑆; 𝜃𝑎 ), PPO limits the experiments, including the network structure and hyper-
the update range of 𝜋(𝑆; 𝜃𝑎 ) within 1 − 𝜖 and 1 + 𝜖. In parameter settings, and then conduct a series of comparison
lP
MLRAL, the actor network is optimized according to Eq.(6). experiments on different datasets and analyze the results to
When the deviation between the current strategy and the old demonstrate the superiority of our approach.
strategy exceeds 𝜖, the PPO algorithm clips the gradient to
prevent policy update deviation. After clipping, the gradient 4.1. Datasets
becomes 0, which prevents further parameter updates in the VOC 2007: The PASCAL Visual Object Classes Chal-
neural network. In addition, the critic network is responsible lenge (VOC 2007) is a classic multi-label image classifica-
rna

for estimating the value of the state 𝑆, and its loss 𝜃𝑐𝑟 is tion dataset, which includes 9963 images and 20 categories
defined as follows: Everingham, Van Gool, Williams, Winn and Zisserman
(2010). There are 5011 images in the training set and 4952
𝐿(𝜃𝑐𝑟 ) = (𝑉𝜃𝑐𝑟 (𝑠𝑡 ) − 𝑉𝑡𝑡𝑎𝑟𝑔 )2 , (7) images in the testing set. The training set is used to train the
model and verify the effect of the model on the testing set.
where 𝑉𝜃𝑐𝑟 is the output of the critic network, which is an MS-COCO: The MS-COCO is a multi-label image
estimate of the state value based on the policy; 𝑉𝑡𝑡𝑎𝑟𝑔 is the recognition dataset constructed and maintained by Mi-
trajectory-based estimate of the action value, representing crosoft, which contains 82,081 training sets and 40,504
the sum of discounts for future rewards. testing sets Lin et al. (2014). There are a total of 80
Jou

Algorithm 1 outlines the proposed MLRAL algorithm. categories, and each image has approximately 2.9 labels.
Here, the classification network is denoted as Θ𝑐 and all un- The number of images in this dataset is indeed large, and the
labeled samples are denoted as  , where  represents the number of labels for each image varies greatly. This makes it
number of unlabeled samples. To start the training process, challenging to accurately classify the images in this dataset.
𝐾0 samples are randomly selected for labeling, creating the NUS-WIDE: The NUS-WIDE is a multi-label dataset
initial labeled training set 0 . The classification model 𝜃𝑐 constructed by the National University of Singapore, which
is trained using 0 . Subsequently, an active learning model contains 26,9648 samples and a total number of 5018 unique
such as the actor network in MLRAL selects samples to labels Chua et al. (2009). Six types of low-level features ex-
obtain a new labeled sample set 1 , completing an iteration. tracted from these images, including 64-D color histogram, a
The feedback from the classification network is used during 144-D color correlogram, a 73-D edge direction histogram,
training to filter out samples that are more suitable for the a 128-D wavelet texture, 225-D blockwise color moments,
current model, making the active learning model like the 500-D bag of words based on SIFT descriptions, and ground-
actor network in MLRAL suitable for different classification truths for 81 categories that can be used for evaluation.
models.

Qing Cai: Preprint submitted to Elsevier Page 6 of 12


Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image Classification

Ocular Disease Intelligent Recognition (ODIR): Oc- Table 1


ular Disease Intelligent Recognition (ODIR) is a structured Choose the value of exploration probability .
ophthalmic database of 5,000 patients with age, color fun- 𝜆
dus photographs from left and right eyes and doctors’ di- dataset
𝜆 = 0.01 𝜆 = 0.05 𝜆 = 0.1
agnostic keywords from doctors Dugas, Jared, Jorge and MS-COCO 76.9±0.98 79±1.34 77.8±1.66
Cukierski (2015). Trained human readers labeled annota- VOC 2007 89.7±0.58 91±0.66 90.6±1.54
tions with quality control management. They classify pa-
tients into eight labels including: Normal (N), Diabetes (D),

of
Glaucoma (G), Cataract (C), Age related Macular Degener- 4.3. Comparison on different Datasets
ation (A), Hypertension (H), Pathological Myopia (M) and In this section, we evaluate the effectiveness of our
Other diseases (O). proposed method by comparing it with the random sampling
method on three benchmark datasets, namely, VOC 2007,
MS-COCO, and NUS-WIDE. We begin with the VOC 2007

pro
dataset, where we initially select 600 images as the labeled
dataset. In each sampling cycle, our active learning model
selects 200 samples from the remaining unlabeled pool. For
(a) VOC 2007 the MS-COCO dataset, we set the initial labeled dataset
size to 2,000 images, and select additional 1,000 images
via active learning in each subsequent cycle. In the case of
the NUS-WIDE dataset, we use 50,000 images as the initial
labeled dataset, and sample an additional 5,000 images for
(b) MS-COCO each iteration. In each iteration, the target module is trained
re-
150 epochs. In the case of the ODIR dataset, we use 3,000
images as the initial labeled dataset, and sample an addi-
tional 500 images for each iteration. In each iteration, the
target module is trained 120 epochs. We then compare the
(c) NUS-WIDE
performance of our model trained on the actively selected
samples with that of the model trained on randomly selected
samples, using the same volume of data.
lP
The results of the mean average precision (mAP) com-
parison on the four datasets are displayed in Fig 6. Our
method achieves the greatest improvement in the multi-label
classification model on the VOC 2007 dataset when the num-
(d) ODIR ber of labeled samples reaches 2,000. Specifically, the mean
average precision (mAP) increases from 55.10 % to 78 %
rna

Figure 5: Example images of three datasets.


compared with random sampling when using our model. The
results on the MS-COCO dataset are shown in Fig. 6b. When
the labeled sample size reaches 10,000, our method achieves
4.2. Experimental Settings an accuracy improvement of approximately 3 % compared
In our experiments, we utilize high-accuracy multi-label to random sampling. According to the results in Fig. 6c, our
classification models, including ML-GCN Chen et al. (2019) method shows an improvement of approximately 2.3 % in
and Query2Label Liu et al. (2021a). The actor network accuracy compared to random sampling, when the labeled
and the critic network in deep reinforcement learning are sample size reaches 100,000 for the NUS-WIDE dataset. The
composed of three fully connected layers with the same error bars in Fig. 6 represent the variance of our method
Jou

structure. The Adam optimizer is used for training with and the variance of random sampling. The results on the
a learning rate of 0.001. To set the delay factor, we use ODIR dataset are shown in Table. 2. Our method achieves an
𝛾 = 0.99, and the exploration probability is set 𝜆 = 0.05. accuracy improvement of approximately 4.4 % compared to
We use a batch size of 16. In addition, we implement a random sampling. Our method has lower variance compared
noise exploration mechanism in the PPO algorithm, and add to random sampling, indicating that MLRAL is more stable
a set of experiments to obtain the optimal parameters for in multiple training sessions.
obtaining the best model accuracy and confidence in 1. We
have tested the training time of our method on MS-COCO 4.4. Comparison on different classification models
with 12,000 samples, on two Nvidia 3090 GPUs. The results To demonstrate the versatility of our method, we eval-
show that our method requires 105 epochs and takes 7.5 uated its performance on three state-of-the-art multi-label
hours in the training stage, and takes about 0.02 seconds per classification models: the ASL Ben-Baruch, Ridnik, Zamir,
image in the inference applications. Noy, Friedman, Protter and Zelnik-Manor (2020), Query2Label
Liu et al. (2021a), and ML-GCN Chen et al. (2019).

Qing Cai: Preprint submitted to Elsevier Page 7 of 12


Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image Classification

100 80 60
random_select random_select
87.19 our method 55 our method
83.38 70 70.88
80 78.64
83.25 66.93 68.6 70.03 50 52.78 53.5 53.82
53.66
67.6 50.66 51.25 51.5 51.9
74.0 6062.16 63.57 65.5
45 48.24 49.1
60 60.45

mAP(%)
47.65

mAP(%)
44.1 45.32 45.24
mAP(%)

55.1 5056.37 40 43.78


40.7 42.75
4036.48 3538.6 39.6
40 37.9
30
2025.4 30

of
random_select 25
our method
0 20 20
0 0 0 0 0 0 0 0 0 0
1000 1500 2000 2500 3000 3500 4000 8000 9000 10000 11000 12000 13000 14000 15000 16000 5000 6000 7000 8000 9000 1000011000120001300014000
the number of labeled data in VOC2007 the number of labeled data in COCO2014 the number of labeled data in NUS-WIDE

(a) Accuracy of VOC 2007 (b) Accuracy of MS-COCO (c) Accuracy of NUS-WIDE

pro
Figure 6: Comparison results on different datasets.

Table 2
Comparison results on different datasets.
Datasets MS-COCO VOC 2007 NUS-WIDE ODIR
Methods 8000-samples 10000-samples 2000-samples 3000-samples 90000-samples 100000-samples 3000-samples 4500-samples
ML-GCN-Random 56.4±3.41 60.5±3.08 55.1±2.61 74.0±2.40 45.2±2.07 47.6±1.86 55.4±1.73 59.3±1.63
ML-GCN-Active 62.2±2.48 63.6±1.44 78.6±1.44 83.4±1.08 48.2±1.52 50.7±1.33 58.6±1.44 63.8±1.21
Q2L-Random
Q2L-Active
ASL-Random
ASL-Active
77.2±1.60
79.8±1.31
29.7±2.79
31.9±2.37
79.1±1.45
81.2±1.14
31.6±1.95
33.9±1.81
86.4±1.12
87.3±1.01
84.3±1.35
86.4±1.08
90.6±0.91
91.0±0.66
89.2±0.86
89.7±0.75
47.5±2.18
50.8±1.43
46.8±1.88
49.0±1.76
50.4±1.71
53.2±1.35
49.6±1.64
52.3±1.52
re-
61.7±2.04
64.9±1.24
57.5±1.79
61.2±1.77
65.4±1.59
69.2±1.18
61.7±1.68
66.2±1.49
lP
Table 3 4.5. Comparison on different reinforcement
MLRAL on DDPG and PPO on three datasets. learning algorithms
Datasets In order to provide further evidence of the effectiveness
of different reinforcement learning algorithms, we test and
VOC 2007 MS-COCO NUS-WIDE
Methods
compare our framework on the PPO and DDPG Lillicrap
DDPG-Active 76±1.28 62±2.11 48±1.43
PPO-Active 79±1.08 64±1.44 51±1.33
et al. (2015). To ensure a fair comparison, we use the
same network structure and experimental settings as in the
rna

We conduct experiments using the same settings as in the previous experiments, with the only difference being the use
previous experiment, testing our proposed active learning of different reinforcement learning algorithms.
method on the MS-COCO, VOC 2007, and NUS-WIDE The results in Table 3 demonstrate that the DDPG algo-
datasets. The results of the comparison are presented in rithm requires setting many hyperparameters for exploring
Table 2. It can be observed that our method outperforms the environment and the training process is often slow and
conventional models on all three datasets when tested with unstable Wang, Li, Lei, Yang and Pei (2024). The PPO is
the ASL Ben-Baruch et al. (2020), Query2Label Liu et al. improved from TRPO Schulman et al. (2015). It simplifies
(2021a), and ML-GCN Chen et al. (2019) models. the calculation process of the trust region of the TRPO,
On the VOC 2007 dataset, our active learning method resulting in stable training and easier parameter adjustment
Jou

shows a more significant performance improvement on the compared to DDPG. Therefore, the PPO algorithm is more
ML-GCN model compared to the Q2L and ASL models. stable than DDPG. We observe a decrease in accuracy of
This can be attributed to the fact that the Q2L and ASL 3 % for VOC 2007 when using the DDPG approach to train
models are well-pre-trained on the ImageNet dataset, while the actor and critic networks. Furthermore, PPO outperforms
the ML-GCN model only uses the parameters of certain DDPG by 2 % and 3 % respectively for MS-COCO and
layers in ResNet pre-trained on ImageNet. This suggests NUS-WIDE datasets, demonstrating its effectiveness in the
that active learning is particularly effective when the model active learning framework. Compared with another multi-
has a shortage of training samples. Additionally, our results label active learning HIN Xie, Tian, Luo, Liu, Wu and Qin
demonstrate that active learning does not have a negative (2023), our method got better performance on VOC 2007
impact even if the model has a well-trained initial state. MS-COCO and NUS-WIDE datasets. The results are shown
in Table4.

Qing Cai: Preprint submitted to Elsevier Page 8 of 12


Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image Classification

of
pro
Figure 9: Feature visualization for the MLRAL on various
(a) Results for MS-COCO
layers.

re-
(b) Results for NUS-WIDE
lP
Figure 7: Comparison of the training rounds on MS-COCO and
NUS-WIDE

Figure 10: Correlation of a batch of samples using VOC 2007.


The amount of samples selected
rna

3000

Table 4
2500

2000

1500
Compare with other multi-label active learning methods.
Datasets
VOC 2007 MS-COCO NUS-WIDE
1000

500 Methods
0
HIN Xie et al. (2023) 86.1±1.08 70.3±1.31 53.6±1.20
MLRAL 91.0±0.66 79±1.34 53.8±1.33
before training after training

(a) The number of samples on different categories 4.6. Comparison of the convergence speed
To further demonstrate the effectiveness of our method,
Jou

we conduct experiments to evaluate the convergence speed


The accuracy on different categories
85

on different datasets. The comparison results are presented


80

in Fig. 7, which clearly show that our active learning method,


75

MLRAL, leads to faster convergence during model training.


70
mAP(%)

As shown in figure 8a, when the size of the training set


65

60

55
is 10000, the samples we actively select cause the model
to converge at 80 epochs, while random sampling requires
50

before active-lerning after active-learning 130 epochs or more. For NUS-WIDE, we set the size of the
training set to 100,000, and the samples we actively selected
(b) The accuracy on different categories
cause the model to converge at approximately 100 epochs,
Figure 8: The training on the imbalance dataset - VOC 2007 while random sampling requires at least 140 epochs to
converge. As the MS-COCO and NUS-WIDE datasets have
many categories and large data volumes, the improvement

Qing Cai: Preprint submitted to Elsevier Page 9 of 12


Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image Classification

Table 5
Algorithm complexity analysis.
FLOPs
FLOPs in training stage FLOPs in inference stage mPA (%)
Method
Random-based method 7.96G 7.96G 60.2
Our method 8.16G 7.96G 64.1

previous experiments, we compare our method with random-

of
based active learning methods, and we also compare it with
the conventional ResNet101 backbone.
As the comparison results shown in Fig. 12, we see that
(a) t-SNE at the beginning of training the accuracy of our active learning method outperforms that
of the random-based one in the transformer-based model,

pro
with the mAP increasing from 72.92 % to 74.21%. This
further proves the effectiveness of the proposed method. In
addition, the mAP in the transformer-based method is higher
than that of the ResNet-based one, which also demonstrates
the effectiveness of the transformer structure especially in
relatively larger training datasets.

4.8. Further Analysis


Training on Imbalance dataset At the beginning of this
(b) t-SNE after training with our method
Figure 11: Visualization of the learned interdependent classi-
fiers with our MLRAL method using VOC 2007
re-
experiment, we chose 3000 examples at VOC 2007, and the
dataset is imbalanced (including 2563 persons, 112 sheep,
342 sofas, and so on). After 120 epochs, the amount of
samples selected in the person category increases by 11.3
% (from 2462 to 2741), while that of the sheep category
increases by 41.3 % (from 116 to 164). The result shows that
of the model accuracy decreases as the number of labeled our model tends to select samples with a lower proportion in
lP
samples increases. In general, the selected samples have the imbalanced dataset in Fig. 8(a). The experimental results
obvious advantages in shortening the training time. in Fig. 8(b) show that our method achieves a 4.9 % accuracy
improvement for the person class with relatively large sam-
ple size, and 11.6 % improvement for the sheep class with
very small sample size. It shows that our algorithm has a
greater improvement in the accuracy of category recognition
with a small proportion of data.
rna

Feature visualization To further analyze the learned


features, we plot the feature maps shown in Fig. 9, which
show the feature representations at different layers of the
MLRAL. The first subfigure represents the input features,
and the remaining figures visualize the intermediate layers of
the classifier. As with conventional models, we can observe
that the features become increasingly abstract as the network
layers deepen.
Heatmap To illustrate the correlation of labels between
Jou

samples, we randomly select a batch of data from VOC


2007 and generate a corresponding heatmap as shown in Fig.
10. The heatmap shows the correlation between the cate-
Figure 12: Comparison results on transformer-base model.
gories, with larger values indicating greater correlation. This
highlights the importance of considering label correlation in
the algorithm and also reflects the challenge of multi-label
4.7. Comparison with Transformer-based method classification.
To further demonstrate the effectiveness of the proposed Algorithm complexity analysis We test the model com-
active learning strategy in the Transformer-based method, putational Flops on the MS-COCO with 12000 image sam-
we apply one of the latest transformer-based backbones ples, and the input image size is 224*224. The computa-
Lin, Cheng, Wu and Shen (2022) in the comparison. As in tional Flops in both training and inference stages are shown
in Table 5, which shows that our method requires more
computational Flops than the random-based method in the

Qing Cai: Preprint submitted to Elsevier Page 10 of 12


Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image Classification

training stage because of using a reinforcement active learn- Acknowledgement


ing module. The two methods have the same Flops in the
inference because the enhanced active strategy is not used This work was supported by the National Natural Sci-
during the inference stage. Besides, our method achieves ence Foundation of China, NO. 62376228.
better performance than the random-based method.
Limitations From the following Fig. 8, we see that the References
class “bus” still has relatively lower accuracy even after Ben-Baruch, E., Ridnik, T., Zamir, N., Noy, A., Friedman, I., Protter, M.,
active learning. That is mainly because it has similar features

of
Zelnik-Manor, L., 2020. Asymmetric loss for multi-label classification.
with other classes such as “car” and has a lower amount of arXiv:2009.14119.
data at the same time. Thus, how to improve the accuracy Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko,
S., 2020. End-to-end object detection with transformers, in: European
of the confusing classes with small data volume is still a
conference on computer vision, Springer. pp. 213–229.
limitation of our work, which will be further explored in Chen, B., Li, J., Lu, G., Zhang, D., 2020. Lesion location attention guided
future works.

pro
network for multi-label thoracic disease classification in chest x-rays.
t-SNE visualization We utilize t-SNE to visualize the IEEE Journal of Biomedical and Health Informatics 24, 2016–2027.
performance of the classifier learned by our proposed ML- Chen, S., Wang, R., Lu, J., Wang, X., 2022. Stable matching-based
two-way selection in multi-label active learning with imbalanced data.
RAL method. The validation is conducted on VOC 2007,
Information Sciences 610, 281–299.
and the results are presented in Fig. 11. Initially, the labels Chen, T., Wang, Z., Li, G., Lin, L., 2018. Recurrent attentional reinforce-
are shown as independent, after the training, they are clus- ment learning for multi-label image recognition, in: Proceedings of the
tered based on the correlation between them. This visualiza- AAAI conference on artificial intelligence.
tion further emphasizes the importance of considering label Chen, Z.M., Wei, X.S., Wang, P., Guo, Y., 2019. Multi-label image
recognition with graph convolutional networks, in: Proceedings of the
associations in the multi-label classification task.
IEEE/CVF Conference on Computer Vision and Pattern Recognition,

5. Conclusions
In this paper, we introduce a novel multi-label reinforce-
ment active learning method, MLRAL. We utilize deep rein-
re-pp. 5177–5186.
Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y., 2009. Nus-wide: a
real-world web image database from national university of singapore, in:
Proceedings of the ACM international conference on image and video
retrieval, pp. 1–9.
forcement learning to learn a policy that selects samples for Dugas, E., Jared, Jorge, Cukierski, W., 2015. Diabetic retinopathy detection.
https://kaggle.com/competitions/diabetic-retinopathy-detection.
annotation in a dynamic manner. Then, we train the model Kaggle.
using a proximal policy optimization algorithm under the
lP
Epshteyn, A., Vogel, A., DeJong, G., 2008. Active reinforcement learning,
actor-critic paradigm. This approach allows the algorithm in: Machine Learning, Proceedings of the Twenty-Fifth International
to dynamically select valuable samples for the current state Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008.
during training, making it applicable to a wide range of clas- Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.,
2010. The pascal visual object classes (voc) challenge. International
sification models. Our extensive experiments on different journal of computer vision 88, 303–338.
datasets, models, and algorithms demonstrate the superiority Fujimoto, S., Hoof, H.V., Meger, D., 2018. Addressing function approxi-
of our method. In future work, we plan to explore the ap- mation error in actor-critic methods .
plication of reinforcement active learning in more complex
rna

Gal, Y., Islam, R., Ghahramani, Z., 2017. Deep bayesian active learning
scenarios and to further improve the performance of active with image data, in: International conference on machine learning,
PMLR. pp. 1183–1192.
learning models in the field of multi-label image classifica- Gui, X., Lu, X., Yu, G., 2021. Cost-effective batch-mode multi-label active
tion. In future works, we will further explore active learning learning. Neurocomputing 463, 355–367.
methods on more complex models, such as the Transformer- Hu, Q., Guo, Y., Cordy, M., Xie, X., Ma, W., Papadakis, M., Le Traon, Y.,
based multi-label classification method Zhou et al. (2023) 2021. Towards exploring the limitations of active learning: An empirical
Zhao, Zhu, He, Cao and Dai (2024). Besides, better sample study, in: 2021 36th IEEE/ACM International Conference on Automated
Software Engineering (ASE), IEEE. pp. 917–929.
selection and computational acceleration strategies, such as Huang, S.J., Chen, S., Zhou, Z.H., 2015. Multi-label active learning:
self-supervised learning, contrastive active learning, hard- Query type matters, in: Twenty-Fourth International Joint Conference
Jou

ware acceleration (distributed training, TPU/GPU clusters), on Artificial Intelligence.


and software and hardware co-optimization (edge computing Li, X., Guo, Y., 2013. Active learning with multi-label svm classification,
and model compression) will also be explored in future in: Twenty-Third International Joint Conference on Artificial Intelli-
gence.
works. Li, X., Wang, L., Sung, E., 2004. Multilabel svm active learning for image
classification, in: 2004 International Conference on Image Processing,
2004. ICIP’04., IEEE. pp. 2207–2210.
Declaration of competing interest Li, Y., Chen, H., Chen, C.L., Tang, X., 2016. Human attribute recognition
The authors declare that they have no known competing by deep hierarchical contexts, in: European Conference on Computer
financial interests or personal relationships that could have Vision.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver,
appeared to influence the work reported in this paper.
D., Wierstra, D., 2015. Continuous control with deep reinforcement
learning. Computer ence .
Lin, H., Cheng, X., Wu, X., Shen, D., 2022. Cat: Cross attention in vision
transformer, in: 2022 IEEE international conference on multimedia and
expo (ICME), IEEE. pp. 1–6.

Qing Cai: Preprint submitted to Elsevier Page 11 of 12


Journal Pre-proof

A Deep Reinforcement Active Learning Method for Multi-Label Image Classification

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, Zhao, C., Qin, B., Feng, S., Zhu, W., Sun, W., Li, W., Jia, X., 2023.
P., Zitnick, C.L., 2014. Microsoft coco: Common objects in context, in: Hyperspectral image classification with multi-attention transformer and
European conference on computer vision, Springer. pp. 740–755. adaptive superpixel segmentation-based active learning. IEEE Transac-
Liu, S., Zhang, L., Yang, X., Su, H., Zhu, J., 2021a. Query2label: A simple tions on Image Processing 32, 3606–3621.
transformer way to multi-label classification. arXiv:2107.10834. Zhao, J., Zhu, J., He, J., Cao, G., Dai, C., 2024. Multi-label classification
Liu, W., Wang, H., Shen, X., Tsang, I.W., 2021b. The emerging trends of of retinal diseases based on fundus images using resnet and transformer.
multi-label learning. IEEE transactions on pattern analysis and machine Medical & Biological Engineering & Computing 62, 3459–3469.
intelligence 44, 7955–7974. Zhou, W., Dou, P., Su, T., Hu, H., Zheng, Z., 2023. Feature learning
Ménard, P., Domingues, O.D., Jonsson, A., Kaufmann, E., Leurent, E., network with transformer for multi-label image classification. Pattern

of
Valko, M., 2021. Fast active learning for pure exploration in reinforce- Recognition 136, 109203.
ment learning, in: Proceedings of the 38th International Conference on Zhu, X., Liu, J., Liu, W., Ge, J., Liu, B., Cao, J., 2023. Scene-aware label
Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, PMLR. graph learning for multi-label image classification, in: Proceedings of
Milani, S., Topin, N., Veloso, M., Fang, F., 2024. Explainable reinforcement the IEEE/CVF International Conference on Computer Vision, pp. 1473–
learning: A survey and comparative review. ACM Computing Surveys 1482.
56, 1–36.

pro
Ren, P., Xiao, Y., Chang, X., Huang, P., Li, Z., Gupta, B.B., Chen, X., Wang,
X., 2022. A survey of deep active learning. ACM Comput. Surv. .
Reyes, O., Morell, C., Ventura, S., 2018. Effective active learning strategy
for multi-label learning. Neurocomputing 273, 494–508.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P., 2015. Trust
region policy optimization, in: International conference on machine
learning, PMLR. pp. 1889–1897.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017.
Proximal policy optimization algorithms .
Sener, O., Savarese, S., 2017. Active learning for convolutional neural
networks: A core-set approach. arXiv preprint arXiv:1708.00489 .
Vasisht, D., Damianou, A., Varma, M., Kapoor, A., 2014. Active learning
for sparse bayesian multilabel classification, in: Proceedings of the 20th
ACM SIGKDD international conference on Knowledge discovery and
data mining, pp. 472–481.
re-
Volodymyr, Mnih, Koray, Kavukcuoglu, David, Silver, Andrei, A, Rusu,
Joel, 2015. Human-level control through deep reinforcement learning.
Nature .
Wang, J., Li, W., Lei, C., Yang, M., Pei, Y., 2024. An efficient and robust
lP
gradient reinforcement learning: Deep comparative policy. Journal of
Intelligent & Fuzzy Systems , 1–16.
Wang, J., Yan, Y., Zhang, Y., Cao, G., Yang, M., Ng, K., 2020a. Deep
reinforcement active learning for medical image classification, in: Med-
ical Image Computing and Computer Assisted Intervention–MICCAI
2020: 23rd International Conference, Lima, Peru, October 4–8, 2020,
Proceedings, Part I 23, Springer. pp. 33–42.
Wang, J., Yan, Y., Zhang, Y., Cao, G., Yang, M., Ng, M.K., 2020b.
rna

Deep reinforcement active learning for medical image classification,


in: Medical Image Computing and Computer Assisted Intervention–
MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8,
2020, Proceedings, Part I 23, Springer. pp. 33–42.
Wang, P., Zhang, P., Guo, L., 2012. Mining multi-label data streams
using ensemble-based active learning, in: Proceedings of the 2012 SIAM
international conference on data mining, SIAM. pp. 1131–1140.
Wang, X., Wang, S., Liang, X., Zhao, D., Huang, J., Xu, X., Dai, B., Miao,
Q., 2022. Deep reinforcement learning: A survey. IEEE Transactions
on Neural Networks and Learning Systems 35, 5064–5078.
Jou

Wu, J., Sheng, V.S., Zhang, J., Li, H., Dadakova, T., Swisher, C.L., Cui,
Z., Zhao, P., 2020. Multi-label active learning algorithms for image
classification: Overview and future promise. ACM Computing Surveys
(CSUR) 53, 1–35.
Xie, X., Tian, M., Luo, G., Liu, G., Wu, Y., Qin, K., 2023. Active learning
in multi-label image classification with graph convolutional network
embedding. Future Generation Computer Systems 148, 56–65.
Yoo, D., Kweon, I.S., 2019. Learning loss for active learning, in: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 93–102.
Yuan, D., Chang, X., Liu, Q., Yang, Y., Wang, D., Shu, M., He, Z., Shi, G.,
2023. Active learning for deep visual tracking. IEEE Transactions on
Neural Networks and Learning Systems .
Zhang, M.L., Zhou, Z.H., 2013. A review on multi-label learning algo-
rithms. IEEE transactions on knowledge and data engineering 26, 1819–
1837.

Qing Cai: Preprint submitted to Elsevier Page 12 of 12


Journal Pre-proof

Declaration of interests

☒ The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.

of
☐ The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:

pro
re-
lP
rna
Jou

You might also like