Citation: Tieliang Gong, Qian Zhao, Deyu Meng, Zongben Xu. Why Curriculum Learning & Self-paced Learning Work in Big/Noisy Data: A Theoretical Perspective[J]. Big Data and Information Analytics, 2016, 1(1): 111-127. doi: 10.3934/bdia.2016.1.111
[1] | M Supriya, AJ Deepa . Machine learning approach on healthcare big data: a review. Big Data and Information Analytics, 2020, 5(1): 58-75. doi: 10.3934/bdia.2020005 |
[2] | Xiangmin Zhang . User perceived learning from interactive searching on big medical literature data. Big Data and Information Analytics, 2017, 2(3): 239-254. doi: 10.3934/bdia.2017019 |
[3] | Minlong Lin, Ke Tang . Selective further learning of hybrid ensemble for class imbalanced increment learning. Big Data and Information Analytics, 2017, 2(1): 1-21. doi: 10.3934/bdia.2017005 |
[4] | Nickson Golooba, Woldegebriel Assefa Woldegerima, Huaiping Zhu . Deep neural networks with application in predicting the spread of avian influenza through disease-informed neural networks. Big Data and Information Analytics, 2025, 9(0): 1-28. doi: 10.3934/bdia.2025001 |
[5] | Yiwen Tao, Zhenqiang Zhang, Bengbeng Wang, Jingli Ren . Motality prediction of ICU rheumatic heart disease with imbalanced data based on machine learning. Big Data and Information Analytics, 2024, 8(0): 43-64. doi: 10.3934/bdia.2024003 |
[6] | Sunmoo Yoona, Da Kuang, Peter Broadwell, Haeyoung Lee, Michelle Odlum . What can we learn about the Middle East Respiratory Syndrome (MERS) outbreak from tweets?. Big Data and Information Analytics, 2017, 2(3): 203-207. doi: 10.3934/bdia.2017013 |
[7] | Jason Adams, Yumou Qiu, Luis Posadas, Kent Eskridge, George Graef . Phenotypic trait extraction of soybean plants using deep convolutional neural networks with transfer learning. Big Data and Information Analytics, 2021, 6(0): 26-40. doi: 10.3934/bdia.2021003 |
[8] | Cai-Tong Yue, Jing Liang, Bo-Fei Lang, Bo-Yang Qu . Two-hidden-layer extreme learning machine based wrist vein recognition system. Big Data and Information Analytics, 2017, 2(1): 59-68. doi: 10.3934/bdia.2017008 |
[9] | Jian-Bing Zhang, Yi-Xin Sun, De-Chuan Zhan . Multiple-instance learning for text categorization based on semantic representation. Big Data and Information Analytics, 2017, 2(1): 69-75. doi: 10.3934/bdia.2017009 |
[10] | Jiaqi Ma, Hui Chang, Xiaoqing Zhong, Yueli Chen . Risk stratification of sepsis death based on machine learning algorithm. Big Data and Information Analytics, 2024, 8(0): 26-42. doi: 10.3934/bdia.2024002 |
Recently, curriculum learning (CL) [2] and self-paced learning (SPL) [12] have been attracting increasing attention in machine learning and computer vision. Both learning paradigms are inspired by the learning principle underlying the cognitive process of humans/animals, which generally starts with learning easier aspects of an learning task, and then gradually takes more complex examples into consideration.
Since being raised, multiple variations of this CL/SPL learning regime, like self-paced reranking [8], self-paced learning with diversity [9], and self-paced curriculum learning [10], have been proposed to further ameliorate its capability. Its effectiveness has also been extensively validated in various machine learning and computer vision tasks, including object detector adaptation [20], dictionary learning [19], long-term tracking [18] and matrix factorization [23]. Especially, this paradigm has been integrated into the system developed by CMU Informedia team, and achieved the leading performance in challenging semantic query (SQ)/000Ex tasks of the TRECVID MED/MER competition organized by NIST in 2014 [22]. Just as indicated by the initial work [2] along this line, two advantages of the CL/SPL learning have been empirically substantiated, especially under big data/noisy scenarios [12,8,9,10,1,11]: generalization improving and convergence speedup.
Albeit with superior performance in applications, the reasonability of the CL/SPL regime is only intuitively explained by its cognitive understanding, while short of a sound theory to reveal the insightful mechanism leading to its effectiveness. Specifically, current CL/SPL learning methods need to iteratively solve varying optimization problems under gradually increasing pace parameters [12,8,9,10], while there is still not a theoretical argument presented to clarify where these methods converge to and which objective is these methods intrinsically solve.
To the above issue, this work initializes the learning theory for CL/SPL and provides an insightful explanation for the effectiveness mechanism under this line of learning schemes. Specifically, the main contribution of this paper can be summarized as the following aspects.
Different from the traditional learning theory assuming the similar training and test distribution, a new theory is formalized to understand the learning problem under the assumption that there exists deviation between training and test/target distributions. This actually is the case often encountered in this era of big data. Nowadays, in various learning tasks like object recognition, event detection and user behavior analysis, learners always need to achieve massive data source for training. In general these massive data are collected and annotated from company users (e.g., the Netflix database1), the web (e.g., the LFW database2) or by making use of crowdsourcing involvement (e.g., the ImageNet database3). The subjective understanding of any annotator is inevitably more-or-less deviated from the objective oracle knowledge underlying data. This naturally conducts the deviation from the training distribution (accumulated from knowledge of all involved annotators) and the true target one, especially in those ambiguous annotated regions. This inspires us to formulate this learning problem and investigate its learning theory.
3http://vis-www.cs.umass.edu/lfw/
Under the premise of the proposed learning theory, the insight of CL/SPL can be rationally explained. Especially, the theory clarifies that the CL/SPL regime actually attempts to minimize an upper bound of the expected risk under target distribution, purely from the data generated from the deviated training distribution. In specific, easy samples in CL/SPL correspond to those in high-confidence annotated area of training distribution, which is also consistent with the high-confidence region of the target distribution (where annotators can easily confirm and agree). Complex ones, however, are more likely to be located in the ambiguous annotated regions, corresponding to the more deviated area between training and target distributions (where users are easily get uncertain or even wrongly cognized). Thus to start training from easy samples by CL/SPL actually simulates learning from the high-confidence target region, while to gradually incrementing complex ones means that the samples residing on ambiguous training regions then come to be involved. Through this process, the faithful information delivered by those high-confidence/easy samples incline to soundly guide the learning towards the expected target, while being less hampered by those low-confidence/complex samples relatively more deviated from the target. This naturally conducts the advantages of SPL, i.e., better generalization to target and faster convergence in a sound manner, as compared to the traditional learning mode, which considers or even emphasizes unreliable low-confidence samples throughout the learning process.
Besides, based on the proposed theory, we can construct a new CL/SPL learning scheme based on random sampling. This new scheme better complies with the deduced upper bound of the expected risk on the target distribution, and thus can be more faithfully explained by our theory. We also substantiate the effectiveness of the proposed learning scheme by experiments on synthetic dan real data.
The rest of this paper is organized as follows. Section 2 briefly reviews the related work on CL/SPL. Section 3 introduces the new learning problem and our motivations. Section 4 establishes the main learning theory for this learning problem, and clarifies its intrinsic relationship to CL/SPL. The SPL learning algorithm by random sampling is constructed in Section 5, and evaluated by experiments in Section 6. The paper is then concluded with a future research.
Inspired by the learning principle of humans/animals, [2] formulated the curriculum learning paradigm. Its core idea is to iteratively involve samples into learning in sequence, where easy samples are learned first and more complex ones are gradually included when the learner is ready for them. These gradually included sample sequences from easy to complex are called curriculums learned in different grown-up stages of training. In specific, [2] formalized the CL problem as follows. Let
Qλ(z)∝Wλ(z)Ptrain(z), | (1) |
such that
To make the CL idea more implementable in applications, [12] first formulated the key principle of CL as a concise optimization model named SPL. The SPL model includes a weighted loss term on all samples and a general SPL regularizer imposed on sample weights. By sequentially optimizing the model with gradually increasing pace parameter on the SPL regularizer, more samples can be automatically included into training from easy to complex in a pure self-paced way. [8] and [23] further built a guideline to construct a rational SPL regularizer, and formalized the SPL model as the following optimization problem:
minw,v∈[0,1]nn∑i=1viL(yi,f(xi,w))+r(v;λ), | (2) |
where
In this paper, we attempt to explore the insightful reason behind these successful applications of CL/SPL. To the best of our knowledge, this is the first theoretical explanation work for this newly emerging methodology.
The current learning tasks always need to collect a massive data set for training. Such a large magnitude makes it only possible to achieve the expected data from crowdsourcing, especially for supervised learning tasks. This often conducts large amount of ambiguous (or complex in CL/SPL) samples for general users in the obtained data, as illustrated in Figure 1, showing typical "hard" samples from the SIN4 and Pascal VOC5 data sets, and returned by Google image search engine6. The reason is that any participant has his/her own specific viewpoint on a problem as compared to most others, and there is thus inevitably a deviation from each collector/annotator's subjective understanding to the objective oracle knowledge of the problem. This naturally leads to the problem that the training distribution,
4http://www.ee.columbia.edu/ln/dvmm/a-TRECVID/
5http://host.robots.ox.ac.uk/pascal/VOC/
Albeit deviated, useful information under
In small/clean sample cases, such a low-confidence region is always with few generated samples due to its small density and small base number of samples. Thus it tends to be configured as a blank "margin" area. Through finding a classification surface to maximize this margin, the decision boundary can always be effectively located [21]. In the premise of practical big/noisy data, however, such margin tends to be very hard to enanchor. Both relatively high density of marginal samples (caused by noise/outliers) and large data cardinality (caused by big data) tend to fill the margin, and the heavy noises/outliers even seriously mislead the margin location. This might explain the fail cases of traditional margin-emphasizing algorithms like SVM [21], Adaboost [7], and etc., in some real data applications [8,9].
It is thus rational to more emphasize the high-confidence (i.e., easy) samples rather than low-confidence (i.e., complex) ones in certain real data cases, instead of treating the former as non-support-vectors and ignoring their role in learning. This constitutes the basic methodology under CL/SPL, which more complies with the human learning process. Such high-confidence-sample-emphasizing idea has also been employed to build never-ending machine learning systems that acquire the ability to extract structured information from unstructured data [4,15] by persistently picking up high-confidence samples in iteration.
In sum, our argument is that in real big/noisy data scenarios, both learning theories and implementation methods need to be handled in new viewpoints. In theory, instead of similar [5,6], the target distribution is often deviated from the training, especially in those low-confidence regions; and in implementation, high-confidence samples, i.e., the traditional non-support-vectors, might be put more emphasis in learning, as the CL/SPL methodology suggests.
In the following, we will provide some preliminary theoretical results on this new setting of learning problem, and deliver a rational theoretical explanation for the working mechanism under CL/SPL methodology.
In this work we mainly investigate the binary classification problem. Following the classic setting of learning theory, our aimed learning problem is: Let
R(f):=∫ZLf(z)Ptarget(x|y)Ptarget(y)dz, |
where
Remp(f)=1nn∑i=1Lf(zi). | (3) |
We assume
We first formulate
Ptarget(x)=1α∗Wλ∗(x)Ptrain(x), | (4) |
where
7We thus have
Eq. (4) can be equivalently reformulated as
Ptrain(x)=α∗Ptarget(x)+(1−α∗)E(x) | (5) |
where
E(x)=11−α∗(1−Wλ∗(x))Ptrain(x). |
Here it is easy to see
We can then construct the following curriculum sequence for our theoretical evaluation:
Qλ(x)=αλPtarget(x)+(1−αλ)E(x), | (6) |
where
Qλ(x)∝Wλ(x)Ptrain(x), |
where
Wλ(x)∝αλPtarget(x)+(1−αλ)E(x)α∗Ptarget(x)+(1−α∗)E(x) |
with
Note that the initial stage of this CL process sets
By taking (6) as the pace distribution, we attempt to present some theoretical results on CL/SPL strategy. These results will help us get some useful insights under this interesting learning scheme.
First we need some preliminary definitions.
Definition 4.1. Let
ˆRm(G)=Eσ[supg∈G1mm∑i=1σig(zi)], | (7) |
where
Rm(G)=ES∼Pm|ˆRS(G)|. | (8) |
Definition 4.2. The Kullback-Leibler divergence
DKL(p‖q)=∫Ωp(x)logp(x)q(x)dx. | (9) |
Based on the above definitions, we can estimate the generalization error bound for CL/SPL learning under the curriculum
Lemma 4.3. (Bretagnolle-Huber inequality) Let
∫|p(x)−q(x)|dx≤2√1−exp{−DKL(p∥q)}. | (10) |
Lemma 4.4. [16] Let
R(f)≤Remp(f)+Rm(H)+√ln(1/δ)2m. | (11) |
In addition, we have
R(f)≤Remp(f)+ˆRm(H)+3√ln(2/δ)2m. | (12) |
Lemma 4.5. Suppose
ˆRm(H)≤BR√m. | (13) |
Proof.
ˆRm(H)=1mEσ[sup‖w‖≤Bm∑i=1σisgn(wixi)]≤1mEσ[sup‖w‖≤Bm∑i=1σi|sgn(wixi)|]≤1mEσ[sup‖w‖≤Bm∑i=1σi|wixi|]≤BmEσ[‖m∑i=1σixi‖]≤BmEσ[[‖m∑i=1σixi‖2]]12=BmEσ[[‖m∑i,j=1σiσj(xixj)‖2]]12 ≤Bm[Eσ[‖m∑i=1xi‖2]]12 =BR√m. |
Then we give the main results of this work.
Theorem 4.6. Suppose
R(f)≤12R+emp(f)+12R−emp(f)+12Rm+(H)+12Rm−(H)+√ln(1/δ)m∗+(1−αλ)√1−exp{−DKL(P+target∥E+)}+(1−αλ)√1−exp{−DKL(P−target∥E−)}, | (14) |
and
R(f)≤12R+emp(f)+12R−emp(f)+12ˆRm+(H)+12ˆRm−(H)+3√ln(2/δ)m∗++(1−αλ)√1−exp{−DKL(P+target∥E+)}+(1−αλ)√1−exp{−DKL(P−target∥E−)}, | (15) |
where
Proof. We first rewrite the expected risk as
R(f)=∫ZLf(z)Ptarget(x|y)Ptarget(y)dz=12∫X+Lf(x,y)Ptarget(x|y=1)dx+12∫X−Lf(x,y)Ptarget(x|y=−1)dx:=12(R+(f)+R−(f)). |
The empirical risk tends not to approximate the expected risk due to the inconsistence of
12(R+(f)+R−(f))−12(R+emp(f)+R−emp(f))=12[R+(f)−EQ+λ(f)+EQ+λ(f)−R+emp(f)]+12[R−(f)−EQ−λ(f)+EQ−λ(f)−R−emp(f)]:=S1+S2. | (16) |
Let
We first focus on the estimation of
A1≤12∫X+(P+target(x)−Q+λ(x))dx=12∫X+(P+target(x)−αλP+target(x)−(1−αλ)E+(x))dx=12(1−αλ)∫X+(P+target(x)−E+(x))dx≤(1−αλ)√1−exp{−DKL(P+target∥E+)} | (17) |
The last inequality is obtained by Lemma 4.3. For the estimation of
A2≤12Rm+(H)+12√ln(1/δ)2m+. | (18) |
In the similar way, we can bound
B1≤(1−αλ)√1−exp{−DKL(P−target∥E−)}, | (19) |
and
B2≤12Rm−(H)+12√ln(1/δ)2m−. | (20) |
By taking
Rm(H)≤ˆRm(H)+√ln(2/δ)2m. | (21) |
By replacing
The proof is then completed.
Note that the above established error bounds upon 0-1 loss are hard to optimize. We thus further deduce another bound under the commonly utilized hinge loss.
Corollary 1. Suppose
R(sgn(g))≤12m+m+∑i=1ϕ(yig(xi))+12m−m−∑i=1ϕ(yig(xi))+RB√m∗+3√ln(1/δ)m∗+(1−αλ)√1−exp{−DKL(P+target∥E+)}+(1−αλ)√1−exp{−DKL(P−target∥E−)}. | (22) |
Proof. Based on Lemma 4.5 to Eq. (15), and the fact that the hinge loss is the upper bound of
Note that there are three components in the upper bound of the expected risk under
This theory reveals the following insights underlying this CL/SPL process. The "easy-to-complex" property of the curriculum
When we only have samples
First, let's approximate
Instead of minimizing the empirical risk
minwEˆQλ(1nn∑i=1L(yi,f(xi,w)))=EˆQλL(y,f(x,w))⇔minw∑ivi(λ)L(yi,f(xi,w)), | (23) |
where the first expectation is taken with respect to
An useful knowledge to judge whether the label confidence of a sample is high or low is through its learning error. That is, the high-confidence sample tends to be located inside the region of its category, thus always leading to its small training error, and vice versa. From this understanding, Eq. (23) exactly corresponds to current SPL learning models [8,23,10], which fit these weight values to accord with the similar requirements through supplementing a self-paced regularizer on
In this sense, we might explain the effectiveness of the previous SPL models by the following insight. Based on our theoretical results, this learning scheme tends to learn from the deviated training information to discover ground truth knowledge of the target distribution, through learning in a sound manner from high-confidence/easy/small-loss samples to low-confidence/complex/large-loss ones. Throughout this learning process, it intrinsically tries to minimize an upper bound of the expected risk on the target distribution, through being terminated at a proper compromised pace. This fully complies with the experience of its real implementations in multiple applications [8,9,23].
Note that current SPL models are all deterministic, while the empirical risk in the upper bound (22) is calculated on randomly generated samples. We thus want to build a new SPL algorithm by using random sampling mechanism. The core idea is to approximate the pace distribution
The implementation details are as follows. At each iteration, we first compute the losses of all training samples based on the current model. Then we solve the following optimization problem to form weights on all samples:
minvn∑i=1viL(yi,f(xi,w))+r(v,λ), | (24) |
where
Algorithm 1 Self-Pace Learning with Random Sampling (RS-SPL) |
Input: training data |
Output: model parameter |
1:Train a model on entire training set to obtain loss |
2:repeat |
3:Solve (24) to obtain |
4: |
5:Draw |
6:Train a new model on |
7:If |
8:until stopping criteria satisfied |
There are many choices for
r(v,λ)=−γn∑i=1log(vi+1λγ), | (25) |
where
vi(λ)={1logγlog(L(yi,f(xi,w))+γ)L(yi,f(xi,w))<λ0L(yi,f(xi,w))≥λ. |
In this section, we implemented experiments on synthetic and real classification datasets. The linear SVM, implemented by LibSVM [3], is utilized as the comparison method.
We first give a synthetic example to illustrate behavior of the proposed RS-SPL algorithm. The data were generated as follows: Two 2-D Gaussian distributions, each associated with a class, were specified as the target distribution. The training distribution is further mixed with another two 2-D Gaussian distributions, each centered at the low density area of the target distribution of corresponding class to enforce deviation. We generated
In order to understand the behavior of RS-SPL, we implemented Algorithm 1 to this synthetic data and plot in Figure 4 the selected samples and the learned separating hyperplane during the SPL process. It can be observed that, samples from the high density region of the training distribution are selected first. As the SPL iteration continues, more and more samples with comparatively high confidence are included for training the classifier, and the separating hyperplane tends to be learned more accurately. However, when "hard" samples, i.e., those deviated samples, are included at the latter stages of SPL, the learned hyperplane tends to be disordered. Such behavior can also be substantiated by the accuracy tendency on the test data as shown in Figure 5. These results coincide with the SPL learning theory developed in Section 4, which asserts that the optimal expected risk tends to be achieved as a tradeoff between the better approximation capability of increasingly more samples and the worse generalization derived by the divergence from the pace distribution to the target.
We also implemented the proposed method to 5 real-world classification datasets, including magic8, image, waveform, ringnorm and twonorm9. The numbers of instances and features of each dataset are summarized in Table 1.
Dataset | # Instances | # Features |
magic | |
|
waveform | |
|
image | |
|
ringnorm | |
|
twonorm | |
|
8http://archive.ics.uci.edu/ml/datasets.html
9http://www.raetschlab.org/Members/raetsch/benchmark
We randomly split each dataset into two subsets with equal sizes for training and testing, respectively. Then we applied the proposed RS-SPL algorithm to training a SVM classifier on the training set, and evaluated its performance in terms of classification accuracy on the test set. The parameters for SVM and RS-SPL were selected via hold-out validation on training set. We averaged the performance for each dataset over 50 runs as summarized in Table 2. As a comparison, we also include the results of the batch-trained SVM. We can see that the proposed SP-SPL algorithm can improve the classification accuracy over batch training. Its effectiveness can thus be validated.
Dataset | # Batch Train | # SPL Train |
magic | |
|
waveform | |
|
image | |
|
ringnorm | |
|
twonorm | |
|
We have presented a theoretical explanation for the working insight underlying the CL/SPL paradigm. Specifically, we clarify that the insight of the CL/SPL strategy is to learn knowledge of the target information from the given samples generated from the training distribution, which is deviated from the target. We have also argued that such a learning problem tends to happen in real big data scenarios due to the bias between subjective understanding of data collectors/annotators and objective oracle knowledge underlying data. Besides, our theory suggests the importance of high-confidence/easy samples in learning, which are generally taken as non-support-vectors in traditional learning methods and whose role is more or less underestimated. We further designed a new SPL algorithm with random sampling, which better complies our theory, and verified its effectiveness by experiments on synthetic and real data.
Our future research includes designing feasible termination condition for CL/SPL iteration based on our theory, deriving theory under unequal probabilities between
[1] | [ S. Basu and J. Christensen, Teaching Classification Boundaries to Humans, Proceddings of the 27th AAAI Conference on Artificial Intelligence, 2013. |
[2] | [ Y. Bengio, J. Louradour, R. Collobert and J. Westone, Curriculum Learning, Proceedings of the 26th International Conference on Machine Learning, (2009), 41-48. |
[3] | [ C.-C. Chang and C.-J. Lin, LIBSVM:A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, 2(2011), 1-27. Software available from:http://www.csie.ntu.edu.tw/~cjlin/libsvm. |
[4] | [ X. Chen, A. Shrivastava and A. Gupta, NEIL:Extracting visual knowledge from web data, Proceedings of the IEEE International Conference on Computer Vision, (2013), 1409-1416. |
[5] | [ F. Cucker and S. Smale, On the mathematical foundations of learning, Bull. Amer. Math. Soc., 39(2002), 1-49. |
[6] | [ F. Cucker and D. X. Zhou, Learning Theory:An Approximation Theory Viewpoint, Cambridge University Press, New York, NY, USA, 2007. |
[7] | [ Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, Proceedings of the 13th International Conference on Machine Learning, 1996. |
[8] | [ L. Jiang, D. Y. Meng, T. Mitamura and A. Hauptman, Easy samples first:Self-paced reranking for multimedia search, Proceddings of the ACM International Conference on Multimedia, (2014), 547-556. |
[9] | [ L. Jiang, D. Y. Meng, S. Yu, Z. Z. Lan, S. G. Shan and A. Hauptma, Self-paced Learning with Diversity, Advances in Nerual Information Processing Systems 27, 2014. |
[10] | [ L. Jiang and D. Y. Meng, Q. Zhao, S. G. Shan and A. Hauptman, Self-paced Curriculum Learning, Proceddings of the 29th AAAI Conference on Artificial Intelligence, 2015. |
[11] | [ F. Khan, X. Zhu and B. Mutlu, How do Humans Teach:On Curriculum Learning and Teaching Dimension, Advances in Nerual Information Processing Systems 24, 2011. |
[12] | [ M. Kumar, B. Packer and D. Koller, Self-paced Learning for Latent Variable Models, Advances in Nerual Information Processing Systems 23, 2010. |
[13] | [ M. Kumar, H. Turki, D. Preston and D. Koller, Learning specfic-class segmentation from diverse data, Proceedings of the IEEE International Conference on Computer Vision, 2011. |
[14] | [ Y. Lee and K. Grauman, Learning the easy things first:Self-paced visual category discovery, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2011), 1721-1728. |
[15] | [ T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves and J. Welling, Never-Ending Learning, Proceddings of the 29th AAAI Conference on Artificial Intelligence, 2015. |
[16] | [ M. Mohri, A. Rostamizadeh and A. Talwalkar, Foundations of Machine Learning, The MIT Press, Cambridge, Massachusetts, London, England, 2012. |
[17] | [ E. Ni and C Ling, Supervised learning with minimal effort, Advances in Knowledge Discovery and Data Mining, 6119(2010), 476-487. |
[18] | [ J. Supanvcivc and D. Ramana, Self-paced learning for long-term tracking, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013. |
[19] | [ Y. Tang, Y. B. Yang and Y. Gao, Self-paced Dictionary Learning for Image Classification, Proceddings of the ACM International Conference on Multimedia, (2012), 833-836. |
[20] | [ K. Tang, V. Ramanathan, F. Li and D. Koller, Shifting weights:Adapting object detectors from image to video, Advances in Nerual Information Processing Systems 25, 2012. |
[21] | [ V. Vapnik, Statistical Learning Theory, Wiley-Interscience, New York, 1998. |
[22] | [ S. Yu, L. Jiang, Z. Mao, X. J. Chang, X. Z. Du, C. Gan, Z. Z. Lan, Z. W. Xu, X. C. Li, Y. Cai, A. Kumar, Y. Miao, L. Martin, N. Wolfe, S. C. Xu, H. Li, M. Lin, Z. G. Ma, Y. Yang, D. Y. Meng, S. G. Shan, P. D. Sahin, S. Burger, F. Metze, R. Singh, B. Raj, T. Mitamura, R. Stern and A. Hauptmann, CMU-Informedia@TRECVID 2014 Multimedia Event Detection (MED), TRECVID Video Retrieval Evaluation Workshop, 2014. |
[23] | [ Q. Zhao, D. Y. Meng, L. Jiang, Q. Xie, Z. B. Xu and A. Hauptman, Self-paced Matrix Factorization, Proceddings of the 29th AAAI Conference on Artificial Intelligence, 2015. |
1. | Manuel Garcia-Piqueras, José Hernández-Orallo, 2021, Chapter 43, 978-3-030-86485-9, 705, 10.1007/978-3-030-86486-6_43 | |
2. | Xin Wang, Yudong Chen, Wenwu Zhu, A Survey on Curriculum Learning, 2021, 0162-8828, 1, 10.1109/TPAMI.2021.3069908 | |
3. | Mobarakol Islam, Lalithkumar Seenivasan, S. P. Sharan, V. K. Viekash, Bhavesh Gupta, Ben Glocker, Hongliang Ren, Paced-curriculum distillation with prediction and label uncertainty for image segmentation, 2023, 1861-6429, 10.1007/s11548-023-02847-9 | |
4. | Melike Nur Yeğin, Ömer Kurttekin, Serkan Kaan Bahşi, Mehmet Fatih Amasyali, Training with growing sets: A comparative study, 2022, 39, 0266-4720, 10.1111/exsy.12961 | |
5. | Yuwei Zhou, Hong Chen, Zirui Pan, Chuanhao Yan, Fanqi Lin, Xin Wang, Wenwu Zhu, 2022, CurML: A Curriculum Machine Learning Library, 9781450392037, 7359, 10.1145/3503161.3548549 | |
6. | Chenkang Zhang, Wanli Shi, Lei Luo, Bin Gu, 2023, Doubly Robust AUC Optimization against Noisy and Adversarial Samples, 9798400701030, 3195, 10.1145/3580305.3599316 | |
7. | Zean Liu, Yuanzhi Cheng, Shinichi Tamura, Multi-Label Local to Global Learning: A Novel Learning Paradigm for Chest X-Ray Abnormality Classification, 2023, 27, 2168-2194, 4409, 10.1109/JBHI.2023.3281466 | |
8. | Kaiyue Liu, Yun Zhou, Hongbin Huang, Bayesian network structure learning with a new ensemble weights and edge constraints setting mechanism, 2024, 2199-4536, 10.1007/s40747-024-01485-1 | |
9. | Peng Zheng, Yong Dou, Yeqing Yan, Sensing the diversity of rumors: Rumor detection with hierarchical prototype contrastive learning, 2024, 61, 03064573, 103832, 10.1016/j.ipm.2024.103832 | |
10. | Zihao Suo, Shanliang Pan, 2025, Chapter 11, 978-981-96-2070-8, 141, 10.1007/978-981-96-2071-5_11 |
Algorithm 1 Self-Pace Learning with Random Sampling (RS-SPL) |
Input: training data |
Output: model parameter |
1:Train a model on entire training set to obtain loss |
2:repeat |
3:Solve (24) to obtain |
4: |
5:Draw |
6:Train a new model on |
7:If |
8:until stopping criteria satisfied |
Dataset | # Instances | # Features |
magic | |
|
waveform | |
|
image | |
|
ringnorm | |
|
twonorm | |
|
Dataset | # Batch Train | # SPL Train |
magic | |
|
waveform | |
|
image | |
|
ringnorm | |
|
twonorm | |
|
Algorithm 1 Self-Pace Learning with Random Sampling (RS-SPL) |
Input: training data |
Output: model parameter |
1:Train a model on entire training set to obtain loss |
2:repeat |
3:Solve (24) to obtain |
4: |
5:Draw |
6:Train a new model on |
7:If |
8:until stopping criteria satisfied |
Dataset | # Instances | # Features |
magic | |
|
waveform | |
|
image | |
|
ringnorm | |
|
twonorm | |
|
Dataset | # Batch Train | # SPL Train |
magic | |
|
waveform | |
|
image | |
|
ringnorm | |
|
twonorm | |
|