4445-Article Text-7484-1-10-20190706
4445-Article Text-7484-1-10-20190706
4445-Article Text-7484-1-10-20190706
Self-Paced Active Learning: Query the Right Thing at the Right Time
Ying-Peng Tang, Sheng-Jun Huang∗
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics
Collaborative Innovation Center of Novel Software Technology and Industrialization
Nanjing 211106, China
{tangyp, huangsj}@nuaa.edu.cn
5117
method is proposed next, followed by the experimental study. of predicted labels, which could be unstable when the model
And at last we conclude this work. is trained with very limited labeled data.
5118
One challenge here is the ground-truth labels of the se- Next we discuss how to implement the self-paced regular-
lected data are unknown before querying. Inspired by (Wang izer g(·), whose role is to control the optimization of weight
and Ye 2013), we consider the upper bound of the risk by vector v in order to ensure the easy instance can receive a
taking yˆj = −sign(f (xj )) as the pseudo label of xj ∈ U . large vj . Here we simply employ the strategy used in (Jiang
Then we have the following expected loss with self-paced et al. 2014) as:
weights: nu
1 X
nl g(v) = ||v||22 − vj . (6)
2
X
`(f, w, v) = (yi − f (xi ))2 j=1
i=1
nu (3) Note that λ in Eq. 1 is the pace parameter. When λ is small
at early stage, only a small subset of easy examples with
X
2
+ vj wj (f (xj ) + 2|f (xj )| + 1).
j=1
small losses will be utilized. With more instances queried,
the model becomes stronger, then harder examples can be
Obviously, by optimizing the above formulation, the infor- involved as λ iteratively increases during the learning process.
mative instance with a small |f (xj )| will receive a large wj . It can be shown in the next subsection that by minimizing the
In other words, the uncertain instances will be preferred in above formulation, easy example will receive a large value
the active selection. of vj .
Then, with the most representative instances selected, the Lastly, with a commonly used `2 norm for controlling the
distributions of labeled and unlabeled data should be close model complexity, i.e., Ω(f ) = ||f ||2 , we can rewrite the
after the query. The model trained on the queried instances objective function in Eq. 1 as follows.
is expected to generalize well on the unseen data coming nl nu
from the same distribution. We implement h(·) based on X X
min (yi − f (xi ))2 + [vj · wj (yˆj − f (xj ))2
Maximum Mean Discrepancy (MMD) (Borgwardt et al. 2006; f,w,v
i=1 j=1
Gretton et al. 2006), which is a commonly used method for
estimating the difference between two distributions. 1
+ λ( vj2 − vj )] + µ(wT K1 w + kw) + γ||f ||2
Formally, we have: 2
s.t. wj ∈ [0, 1], vj ∈ [0, 1] ∀j = 1 · · · nu .
h(L ∪ Q, U \Q) = M M Dφ2 (L ∪ Q, U \ Q) (7)
1 X X As a result, we formulate the active selection procedure
=
φ(xi ) + wj φ(xj )
nl + b as a concise optimization problem, which incorporates the
xi ∈L xj ∈U (4)
2 easiness, informativeness and representativeness into an uni-
1 X
fied framework for self-paced active learning. Next, we will
− (1 − wj )φ(xj )
, discuss the optimizing strategy of our method.
nu − b
xj ∈U H
Optimization
where φ : X → H is a mapping from the feature space to the
We use alternative optimization strategy (Bezdek and Hath-
Reproducing Kernel Hilbert Space (RKHS). wj is served as
away 2003) to optimize the objective function in Eq. 7.
the indicator variable here which is relaxed to a continuous
value [0, 1]. Optimize f with the fixed v and w Firstly, we intro-
According to the discussion in previous work, duce the method to optimize f with fixed v and w. For
M M Dφ2 (p, q) will vanish if p = q. Therefor, our pur- simplicity,
P f is implemented with the kernel form f (xi ) =
pose is to ensure the selected instances can lead to a small xk ∈L θk k(xk , xi ), where k(·) is the kernel function. Then
value for the above formulation. the task is to learn θ, which leads to the following optimiza-
With the similar derivations in (Chattopadhyay et al. 2012), tion problem:
we can have a more simple formulation for the above prob- nl
lem: X X
min (yi − θk k(xk , xi ))2 +
θ
min h(L ∪ Q, U \Q) = min wT K1 w + kw , (5) i=1 xk ∈L
w w "
nu
X X 2
where vj · w j θk k(xk , xj ) + (8)
1 j=1 xk ∈L
K1 = KU U , #
2 X
nu − b nl + b 2vj · wj θk k(xk , xj ) + γθ T KLL θ .
k= 1nl KLU − 1 nu K U U ,
xk ∈L
n n
K is the kernel matrix and KAB denotes the sub-matrix of The alternating direction method of multipliers
K between set A and B. 1nl and 1nu are vectors with all (ADMM) (Boyd et al. 2011) is employed to solve
elements being 1. By minimizing the above formulation, the this problem. There are mainly three key steps when
representative instance will receive a large wj . performing ADMM to solve Eq. 8. Firstly, we construct
5119
auxiliary variable z. Then, the augmented Lagrangian for Algorithm 1 The SPAL Algorithm
the original function is constructed. Finally, we optimize 1: Input:
the original variable θ, auxiliary variable z, and the dual 2: Training set L and U ;
variable δ in augmented Lagrangian alternately. Following 3: Initializing:
we discuss the three steps in detail. 4: Initialize v = 1nu , w = 1nu ;
PFor the auxiliary variable, we let zj = 5: Repeat until convergence:
xk ∈L θ k k(x k , x j ) for each x j ∈ U . Note that we 6: Update f by solving Eq. 8 through ADMM;
filter out some less important samples whose weight 7: Update w by solving Eq. 13;
wj · vj is less than a specified small threshold for efficiency 8: Update v by solving Eq. 14;
in optimizing θ. Then the optimization problem can be 9: Q ← top b instances of U with largest vj · wj values;
rewritten as: 10: U = U \ Q; L = L ∪ Q;
nl
X X 11: Train the model based on L.
min (yi − θk k(xk , xi ))2 + γθ T KLL θ
θ
i=1 xk ∈L
nu
By denoting c = µk + aT , where aj = vj (f (xj )2 +
X
vj · wj (zj )2 + 2vj · wj |zj | (9)
+
j=1
2|f (xj )|), the above function can be further rewritten as:
X
s.t. zj − θk k(xk , xj ) = 0 ∀j = 1 · · · nu . min wT (µK1 )w + cw . (13)
xk ∈L wT v=b,w∈[0,1]nu
The augmented Lagrangian is: This is a quadratic programming problem, and can be effi-
nl nu h
X X X ciently solved with existing toolbox.
(yi − θk k(xk , xi ))2 + vj · wj (zj )2
i=1 xk ∈L j=1 Optimize v with the fixed f and w Finally, when optimiz-
ing v with fixed f and w, we have the following problem:
X
+2vj · wj |zj | + δj (zj − θk k(xk , xj )) (10)
xk ∈L nu
ρ i X
˜ 1 2
min vj `j + λ( vj − vj ) ,
X
+ (zj − θk k(xk , xj ))2 + γθ T KLL θ , (14)
2 v∈[0,1]nu
j=1
2
xk ∈L
where ρ is a parameter in ADMM. where
Finally, by denoting ◦ as the element-wise product of vec-
tors, (·)+ as setting the negative entries of the argument vec- `˜j = wj (yˆj − f (xj ))2 ∀j = 1 · · · nu .
tor to 0, ylp= [y1 , · · · , ynl ]T , η = v◦w, = [1 , · · · , nu ]T ,
and j = ηj + ρ2 . Then we can get the following updating With linear soft weighting regularizer g(v), this problem
rules: has the closed form solution for vj :
θ k+1 = A−1 r T , ( ˜
`
z k+1 = diag()−1 ζ, (11) ∗ − λj + 1 `˜j < λ
vj = (15)
k+1 k k+1 T 0 `˜j ≥ λ .
δ = δ + ρ(z − KLU (θ k+1 )) ,
where It can be observed that, The weight v is updated based on
T ρ T
A =KLL KLL + KLU KLU + γKLL , the current losses of instances. By adopting the self-paced
2 regularizer g(v), the solution of vj is inversely proportional
1 T T ρ T T
r =ylT KLL
T
+ δ k KLU + z k KLU , to its weighted loss `˜j . Thus the easily learned samples with
2 2 smaller losses can receive higher value of vj . The pace param-
nu
1 X eter λ can be taken as the threshold to filter out over-complex
ζ = arg min ||ζ − o||22 + ξj |ζj | instances. Note that when the pace parameter λ = ∞, all
2 j=1
entries of v will be 1; at this point, our method will degener-
=sign(o) ◦ (|o| − ξ)+ , ate to the active learning approach that does not consider the
1 easiness.
o = diag()−1 (ρ · KLU T
θ k+1 − δ k ), We summarize the framework of SPAL in Algorithm 1. At
2
each iteration, f , w and v will be optimized alternately until
ξ =diag()−1 η .
converge. Instance with high potential value can be identified
Optimize w with the fixed f and v To optimize w for the by the optimized wj , while easy instance for the current
fixed f and v, Eq. 7 becomes: model will receive a large vj . We thus select the instances
nu
X with the largest vj · wj to ensure they not only have high
min [vj · wj (yˆj − f (xj ))2 ] potential value for improving the model, but also can be fully
wT v=b,w∈[0,1]nu (12)
j=1 utilized by the current model. After updating the model with
+ µ(wT K1 w + kw) . L ∪ Q, we evaluate the performance on the test set.
5120
Figure 1: Performance comparison.
5121
Table 1: Datasets used in the experiments. Table 2: Win/Tie/Loss counts of SPAL versus the other meth-
ods with 20%, 40%, 60%, 80%, 100% of the preset number
Dataset thyroid antivirus clean1 of queries based on paired t-tests at 95 percent significance
# Instances 215 373 476 level.
Dataset tictactoe image krvskp
# Instances 958 2086 3196 SPAL versus
Dataset In All
Random BMDR ASPL
Dataset phoneme gisette phishing thyroid 4/1/0 5/0/0 2/3/0 11/4/0
# Instances 5404 7000 11055 antivirus 4/1/0 5/0/0 0/5/0 9/6/0
clean1 4/1/0 1/4/0 3/2/0 8/7/0
tictactoe 4/1/0 4/1/0 1/4/0 9/6/0
sample 40% instances as the test set, and the rest 60% in- image 4/1/0 4/1/0 4/1/0 12/3/0
stances for the training. Further, 5% of the training set is used krvskp 5/0/0 5/0/0 2/3/0 12/3/0
as the initially labeled data, while the rest instances consist phoneme 5/0/0 4/1/0 0/5/0 9/6/0
of the unlabeled pool for active selection. The data partition gisette 5/0/0 3/2/0 5/0/0 13/2/0
is repeated randomly for 10 times. We fix batch size b = 5 phishing 5/0/0 5/0/0 1/4/0 11/4/0
for all methods. In All 40/5/0 36/9/0 18/27/0 94/41/0
Note that the ASPL will add two batches of instances into
L in each iteration, with half from querying and half from
prediction. This causes that the end point of ASPL is earlier
than others. Thus we also stop other methods to ensure the It can be observed from the figure that the proposed SPAL
numbers of queried instances are the same. For the relatively approach outperforms the other methods in most cases. When
large datasets, we report the performances of early stage to comparing with BMDR, our method is always superior. It
demonstrate that at a specific training stage, over-complex implies that considering the easiness of the instances can save
examples may be less useful than easy ones for improving the labeling cost by filtering out instances that are over com-
the model. It is thus important to query the right thing at the plex for the classification model. ASPL works well on some
right time. datasets but fails on the others. Note that in addition to the
The parameters of BMDR are set to the recommended queried instances, ASPL also adds a batch of instances with
values in their paper. Specifically, the regularized weight predicted labels. That is why its performance is less stable.
γ = 0.1 and the trade-off parameter µ = 1000. For ASPL, Because the predicted labels could be unreliable when the
it targets for specific application and can not be applied to model is not well trained. As expected, the random strategy
binary classification problem directly, so we simplify it to is usually the worst one.
select two batches of samples with the same batch size, one Table 2 shows that our method can outperform the base-
is the most uncertain instances for querying and the other is line methods significantly in most cases. Note that, although
the most confident instances for assigning predicted labels. ASPL achieves better performance than random and BMDR
For the proposed method SPAL, we fix µ = 0.1, and γ = 0.1. by using extra self-annotated instances in model training, it
For the SPL parameter λ, we initialize it with a certain value still has the risk that the labeled instances may not be fully
which is selected from {0.1, 0.01}, and follow the method utilized by the model. We believe this is the reason why our
used in (Lin et al. 2018) to update it linearly with a small method can outperform the others.
fixed value. In our experiments, we fix λpace = 0.01 for all
datasets. Specifically, we have the following updating rule Study on different initially labeled ratios
for λ at tth iteration: In this subsection, we further perform the experiments with
different ratios of initially labeled data to examine the perfor-
λt = λinitial + (t − 1) ∗ λpace .
mances of compared approaches. Specifically, we compare
CVX (Grant and Boyd 2014) and MOSEK 1 are used to the methods when the 1%, 5%, 10% and 20% of the training
solve the QP problem. We follow (Wang and Ye 2013) to set is initially labeled while other settings remain unchanged.
employ a regularized linear model to implement the classifi- Because of the space limitation, we report the average value
cation model for all methods. of the accuracy curve instead of plotting the whole curve. For
each case, the best result and its comparable performances
Performance comparison are highlighted in boldface based on paired t-tests at 95 per-
cent significance level. The mean and standard deviation of
We plot the average accuracy curves of the proposed SPAL
accuracies are presented in Table 3.
and compared methods with queried instances increasing in
Figure 1. To further validate the significance of our method, We can observe that SPAL achieves the best performance
we also conduct paired t-tests at 95 percent significance level for most cases, and for the few cases that our method is not
when 20%, 40%, 60%, 80%, 100% of the preset number the best, it is comparable to the best performance. These
of queries is reached. We present the win/tie/loss counts of results imply that our method is rather stable and can out-
SPAL versus the other methods in Table 2. perform the others with different ratios of initially labeled
data. Table 3 shows that ASPL method prefers larger initially
1
http://www.mosek.com/ labeled ratio. Note that ASPL uses more training data than
5122
Table 3: Influence of different initially labeled ratios (mean ± std). The best performance and its comparable performances based
on paired t-tests at 95 percent significance level are highlighted in boldface.
Different Methods
Dataset
labeled ratios SPAL Random BMDR ASPL
1% 0.879 ± 0.019 0.854 ± 0.020 0.837 ± 0.016 0.783 ± 0.137
5% 0.929 ± 0.016 0.899 ± 0.035 0.889 ± 0.032 0.920 ± 0.015
thyroid
10% 0.933 ± 0.014 0.913 ± 0.030 0.916 ± 0.023 0.931 ± 0.018
20% 0.935 ± 0.013 0.929 ± 0.022 0.928 ± 0.015 0.938 ± 0.013
1% 0.973 ± 0.009 0.956 ± 0.013 0.921 ± 0.025 0.964 ± 0.009
5% 0.982 ± 0.007 0.970 ± 0.011 0.954 ± 0.017 0.978 ± 0.008
antivirus
10% 0.985 ± 0.007 0.976 ± 0.012 0.963 ± 0.018 0.984 ± 0.009
20% 0.987 ± 0.008 0.980 ± 0.011 0.976 ± 0.013 0.986 ± 0.008
1% 0.708 ± 0.030 0.687 ± 0.031 0.681 ± 0.025 0.654 ± 0.026
5% 0.749 ± 0.034 0.719 ± 0.032 0.725 ± 0.022 0.727 ± 0.038
clean1
10% 0.768 ± 0.036 0.742 ± 0.029 0.739 ± 0.028 0.760 ± 0.026
20% 0.788 ± 0.024 0.775 ± 0.025 0.776 ± 0.020 0.780 ± 0.028
1% 0.763 ± 0.013 0.727 ± 0.019 0.722 ± 0.024 0.756 ± 0.027
5% 0.786 ± 0.017 0.748 ± 0.021 0.761 ± 0.022 0.775 ± 0.020
tictactoe
10% 0.810 ± 0.015 0.766 ± 0.024 0.749 ± 0.027 0.803 ± 0.011
20% 0.839 ± 0.010 0.794 ± 0.027 0.769 ± 0.028 0.838 ± 0.013
1% 0.933 ± 0.007 0.896 ± 0.011 0.916 ± 0.007 0.909 ± 0.013
5% 0.944 ± 0.008 0.916 ± 0.009 0.929 ± 0.006 0.933 ± 0.010
image
10% 0.953 ± 0.007 0.928 ± 0.009 0.938 ± 0.005 0.949 ± 0.007
20% 0.961 ± 0.004 0.941 ± 0.007 0.947 ± 0.004 0.960 ± 0.005
1% 0.942 ± 0.004 0.899 ± 0.012 0.910 ± 0.005 0.936 ± 0.003
5% 0.964 ± 0.004 0.928 ± 0.010 0.937 ± 0.006 0.962 ± 0.004
krvskp
10% 0.974 ± 0.004 0.943 ± 0.009 0.949 ± 0.008 0.972 ± 0.004
20% 0.980 ± 0.003 0.956 ± 0.006 0.958 ± 0.006 0.979 ± 0.004
1% 0.829 ± 0.007 0.808 ± 0.008 0.812 ± 0.009 0.823 ± 0.012
5% 0.844 ± 0.008 0.826 ± 0.008 0.829 ± 0.006 0.841 ± 0.007
phoneme
10% 0.851 ± 0.004 0.837 ± 0.007 0.838 ± 0.008 0.850 ± 0.005
20% 0.861 ± 0.006 0.845 ± 0.007 0.845 ± 0.007 0.859 ± 0.005
1% 0.945 ± 0.003 0.930 ± 0.005 0.931 ± 0.005 0.927 ± 0.003
5% 0.947 ± 0.003 0.942 ± 0.004 0.943 ± 0.005 0.933 ± 0.005
gisette
10% 0.951 ± 0.004 0.946 ± 0.004 0.947 ± 0.003 0.935 ± 0.003
20% 0.952 ± 0.003 0.950 ± 0.003 0.950 ± 0.004 0.938 ± 0.003
1% 0.934 ± 0.003 0.919 ± 0.007 0.918 ± 0.006 0.931 ± 0.002
5% 0.937 ± 0.003 0.924 ± 0.009 0.930 ± 0.004 0.935 ± 0.003
phishing
10% 0.936 ± 0.003 0.925 ± 0.008 0.926 ± 0.008 0.934 ± 0.003
20% 0.935 ± 0.003 0.927 ± 0.006 0.927 ± 0.007 0.932 ± 0.004
the other approaches, because it adds two batches with one data is limited.
from querying and one from prediction. When there is more
labeled data, the model prediction is more reliable, and thus Conclusion
ASPL can benefit more from the extra pseudo labels. For
In this paper, we propose a novel batch mode active learning
BMDR, its performance is still worse than ours even in dif-
approach SPAL to query the right thing at the right time. On
ferent initial ratios of labeled data which implies that it is
one hand, informativeness and representativeness are con-
important to further consider the easiness of instances even
sidered such that the selected instances have high potential
they have high potential value.
value for improving the model; on the other hand, easiness
In addition, we also observe some trends in the table that is exploited to make sure the potential value can be fully uti-
the proposed method SPAL favors the case with less labeled lized by the model. These two aspects are incorporated into
data. One possible reason is that many examples are over- an unified framework of self-paced active learning. Experi-
difficult for a simple model, and thus we need a self-paced ments show that, our method is superior to the state-of-the-art
strategy to select the easy ones at such an early learning stage batch mode active learning methods. In the future, we plan to
to get cost-effective queries. We believe this is an advantage further examine the effectiveness of the proposed framework
because active learning is especially important when labeled when the easiness of instances are known.
5123
References Lee, Y. J., and Grauman, K. 2011. Learning the easy things
Basu, S., and Christensen, J. 2013. Teaching classification first: Self-paced visual category discovery. In The 24th IEEE
boundaries to humans. In Proceedings of the 27th AAAI Conference on Computer Vision and Pattern Recognition,
Conference on Artificial Intelligence. 1721–1728.
Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Lin, L.; Wang, K.; Meng, D.; Zuo, W.; and Zhang, L. 2018.
Curriculum learning. In Proceedings of the 26th Annual Active self-paced learning for cost-effective and progressive
International Conference on Machine Learning, 41–48. face identification. IEEE Transactions on Pattern Analysis
Bezdek, J. C., and Hathaway, R. J. 2003. Convergence and Machine Intelligence 40(1):7–19.
of alternating optimization. Neural, Parallel, and Scientific Ma, F.; Meng, D.; Xie, Q.; Li, Z.; and Dong, X. 2017. Self-
Computations 11(4):351–368. paced co-training. In Proceedings of the 34th International
Borgwardt, K. M.; Gretton, A.; Rasch, M. J.; Kriegel, H.; Conference on Machine Learning, 2275–2284.
Schölkopf, B.; and Smola, A. J. 2006. Integrating structured Settles, B. 2009. Active learning literature survey. Technical
biological data by kernel maximum mean discrepancy. In report, University of Wisconsin-Madison.
Proceedings 14th International Conference on Intelligent Seung, H. S.; Opper, M.; and Sompolinsky, H. 1992. Query
Systems for Molecular Biology, 49–57. by committee. In Proceedings of the 5th Annual Workshop
Boyd, S. P.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, on Computational Learning Theory, 287–294.
J. 2011. Distributed optimization and statistical learning via Supancic, J. S., and Ramanan, D. 2013. Self-paced learn-
the alternating direction method of multipliers. Foundations ing for long-term tracking. In 2013 IEEE Conference on
and Trends in Machine Learning 3(1):1–122. Computer Vision and Pattern Recognition, 2379–2386.
Chattopadhyay, R.; Wang, Z.; Fan, W.; Davidson, I.; Pan- Tang, J.; Zha, Z.; Tao, D.; and Chua, T. 2012. Semantic-
chanathan, S.; and Ye, J. 2012. Batch mode active sampling gap-oriented active learning for multilabel image annotation.
based on marginal probability distribution matching. In The IEEE Transactions on Image Processing 21(4):2354–2360.
18th ACM SIGKDD International Conference on Knowledge
Tang, Y.; Yang, Y.; and Gao, Y. 2012. Self-paced dictionary
Discovery and Data Mining, 741–749.
learning for image classification. In Proceedings of the 20th
Dasgupta, S., and Hsu, D. J. 2008. Hierarchical sampling ACM Multimedia Conference, 833–836.
for active learning. In Proceedings of the 15th International
Wang, Z., and Ye, J. 2013. Querying discriminative and
Conference on Machine Learning, 208–215.
representative samples for batch mode active learning. In The
Grant, M., and Boyd, S. 2014. CVX: Matlab software for 19th ACM SIGKDD International Conference on Knowledge
disciplined convex programming, version 2.1. http://cvxr. Discovery and Data Mining, 158–166.
com/cvx.
Wang, K.; Wang, Y.; Zhao, Q.; Meng, D.; and Xu, Z. 2017.
Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Schölkopf, B.; SPLBoost: An improved robust boosting algorithm based on
and Smola, A. J. 2006. A kernel method for the two-sample- self-paced learning. arXiv preprint arXiv:1706.06341.
problem. In Advances in Neural Information Processing
Systems, 513–520. Xu, C.; Tao, D.; and Xu, C. 2015. Multi-view self-paced
learning for clustering. In Proceedings of the 24th Interna-
Hoi, S. C. H.; Jin, R.; Zhu, J.; and Lyu, M. R. 2008. Semi- tional Joint Conference on Artificial Intelligence, 3974–3980.
supervised SVM batch mode active learning for image re-
trieval. In IEEE Conference on Computer Vision and Pattern Yan, Y., and Huang, S. 2018. Cost-effective active learning
Recognition. for hierarchical multi-label classification. In Proceedings of
the27th International Joint Conference on Artificial Intelli-
Huang, S., and Zhou, Z. 2013. Active query driven by gence, 2962–2968.
uncertainty and diversity for incremental multi-label learn-
ing. In IEEE 13th International Conference on Data Mining, Zhang, D.; Meng, D.; Zhao, L.; and Han, J. 2016. Bridg-
1079–1084. ing saliency detection to weakly supervised object detection
based on self-paced curriculum learning. In Proceedings of
Huang, S.; Jin, R.; and Zhou, Z. 2014. Active learning by
the 25th International Joint Conference on Artificial Intelli-
querying informative and representative examples. IEEE
gence, 3538–3544.
Transactions on Pattern Analysis and Machine Intelligence
36(10):1936–1949. Zhang, D.; Meng, D.; and Han, J. 2017. Co-saliency detection
via a self-paced multiple-instance learning framework. IEEE
Jiang, L.; Meng, D.; Mitamura, T.; and Hauptmann, A. G.
Transactions on Pattern Analysis and Machine Intelligence
2014. Easy samples first: Self-paced reranking for zero-
39(5):865–878.
example multimedia search. In Proceedings of the ACM
International Conference on Multimedia, 547–556. Zhu, J.; Wang, H.; Tsou, B. K.; and Ma, M. Y. 2010. Ac-
Khan, F.; Mutlu, B.; and Zhu, X. 2011. How do humans teach: tive learning with sampling by uncertainty and density for
On curriculum learning and teaching dimension. In Advances data annotations. IEEE Transactions on Audio, Speech &
in Neural Information Processing Systems, 1449–1457. Language Processing 18(6):1323–1331.
Kumar, M. P.; Packer, B.; and Koller, D. 2010. Self-paced
learning for latent variable models. In Advances in Neural
Information Processing Systems, 1189–1197.
5124