1.
Introduction
Along with the rapid development of Artificial Intelligence (AI) during the past few years, Semi-Supervised Learning (SSL) has attracted widespread attentions in machine learning field [1,2,3,4,5,6]. Various SSL methods are proposed to improve promising performance by exploiting both labeled and unlabeled samples. Different from Supervised Learning (SL), SSL tries to employ different assumptions or strategies to explore the unlabeled samples. The following assumptions are often used: (1) smoothness; (2) cluster; (3) manifold; (4)disagreement. SSL has shown the superiority to SL in some practical applications, such as object detection [7,8], image classification [9,10,11], speech recognition [12,13], etc. Although the SSL methods can achieve promising performance, they failed to consider the harmful effect of the unlabeled samples on the classification performance. Some previous studies [3,14,15,16,17,18] have verified that the risky unlabeled samples can result in a bad consequence through theoretic and empirical analysis. The unlabeled samples may be risky which means that SSL is inferior to SL. If the risky unlabeled samples can not be safely explored, it will limit the application scope of SSL to some extent. Therefore, it is worthy to design Safe SSL (S3L) which never performs worse than SL.
Up to now, some researchers have proposed a few S3L methods which consider the risk of the unlabeled samples. On the whole, different approaches are proposed to conservatively explore the risky unlabeled samples. In our opinion, the proposed approaches can be summarized as follows: (1) Risky unlabeled samples are firstly identified through SSL and SL and then conservatively explored by SSL. (2) The risk degrees are firstly computed and then embedded into some SSL methods. (3) Multiple SSL classifiers are simultaneously learnt to decrease the degeneration probability.
In the first strategy, some methods tried to identify the risky unlabeled samples through SSL and SL. If an unlabeled sample was risky, it should be classified by SL. Otherwise, it would be classified by SSL. Thus the final classifier was trained by the labeled samples and unlabeled ones with the pseudo labels predicted by SL or SSL. In particular, Li and Zhou [19] presented an S3VM_us method where a hierarchical clustering method was used to select the risky unlabeled samples. The selected samples were then predicted by SVM and the remaining were predicted by transductive SVM (TSVM) [20]. Hence, S3VM_us should have a smaller degeneration probability than TSVM. Li et al. [21] built a large margin approach named LargE margin grAph quality juDgement (LEAD). LEAD firstly constructed multiple graphs and then performed Graph-based SSL (GSSL) to yield the predictions of the labeled and unlabeled samples. The risky unlabeled samples were identified by SVM which considered the predictions of multiple GSSL as the input features. Finally, their pseudo labels in LEAD were predicted by SVM and the rest were labeled according to the average predictions of multiple GSSL.
In the second strategy, some regularization approaches were used to exploit the risky unlabeled samples. Wang and Chen [22] introduced Safety-Aware SSCCM (SA-SSCCM) which was extended from semi-supervised classification method based on class membership (SSCCM). SA-SSCCM utilized a ℓ2-norm based loss function to restrict the predictions of the unlabeled samples to be those of SL (i.e., Least-Square SVM (LS-SVM)). The performance of SA-SSCCM was never significantly inferior to that of LS-SVM and seldom significantly inferior to that of SSCCM. Gan et al. [15] proposed Risk-based Safe Laplacian Regularized Least Squares (RsLapRLS) which tried to assign risk degrees to different unlabeled samples and a risk-based regularization term was embedded into LapRLS to reduce the risk. Wang et al. [23] proposed safe LS_S3VM based on Adjusted Cluster Assumption (ACA-S3VM) which discussed the negative effect of an inappropriate model assumption (e.g., cluster assumption). In this method, the unlabeled samples lied in the cluster boundary were found by unsupervised clustering and cautiously explored.
Different from the aforementioned two strategies which learned one single classifier, the third strategy tried to simultaneously learn multiple semi-supervised classifiers or regressors. Li and Zhou[24] developed safe semi-supervised SVMs (S4VMs). S4VMs constructed multiple low-density separators simultaneously to reduce the risk of identifying a poor separator with unlabeled samples. The performance of S4VMs was never significantly inferior to that of SVM. Furthermore, in order to extend S4VMs for dealing with multi-class problems, Thiago et al. [25] proposed a hierarchical bottom-up S4VMs tree scheme to take advantage of S4VMs. Similar to S4VMs, Gan et al. [26] presented a Safety-aware GSSL (SaGSSL) method which tried to learn a safe classifier by constructing multiple graphs. The experimental results demonstrated that SaGSSL could adaptively select good graphs from the candidates and reduce the risk of inappropriate graphs for GSSL. Except the classification problem, Li et al. [27] proposed SAFE semi-supervised Regression (SAFER) which was designed to deal with the regression problem. SAFER aimed to learn a safe regressor from multiple semi-supervised ones and obtained the desired performance.
Although many different S3L methods have been proposed to safely explore the unlabeled samples and obtained better performance compared to SL and SSL, they have the following shortcomings: (1) The risk degrees of the unlabeled samples are given by an pre-defined approach; (2) The hurt of the labeled samples (e.g., mislabeled ones) on learning performance is not considered.
To overcome the shortcomings, we present ℓ1-norm based S3L which can simultaneously reach the safe exploitation of the labeled and unlabeled samples. In the proposed algorithm, we utilize a ℓ1-norm based loss function to generate the objective function of S3L and use an effective iterative optimization technique to obtain an optimal solution. In each iteration, the weights of the labeled and unlabeled samples are adaptively estimated. The risky samples will have small weights and a small impact on training the final classifier. Hence, it is expected to alleviate the negative influence of both labeled and unlabeled samples. The main contributions of the proposed algorithm include: (1) The risk of both labeled and unlabeled samples can be reduced by adaptively the weights through ℓ1 norm; (2) An effective optimization algorithm is introduced to solve the proposed problem. This work is an extended version of our work in [28]. In comparison to the work in [28], we have given more details of the existing related methods and proposed algorithm. Additionally, extensive experiments are performed on more datasets to show the effectiveness of the proposed algorithm, including result discussion and parameter analysis.
The remaining part of the paper is organized as follows: In Section 2, we firstly review background knowledge (i.e., LapRLS and RsLapRLS). In Section 3, we will present the motivation and give the details of proposed algorithm. Section 4 will report the results to verify the effectiveness of the proposed algorithm by conducting experiments on several UCI and benchmark datasets. Finally, we will present the conclusions and some future work in Section 5.
2.
Background knowledge
2.1. Laplacian regularized least squares
Suppose a dataset X={(x1,y1),⋯,(xl,yl),xl+1,⋯,xn} with l labeled samples and u=n−l unlabeled ones, xi∈RD and yi∈{−1,1}. For the purposes of exploration of unlabeled samples, LapRLS [29] tried to construct a local graph W. The graph was used to approximate to geometric structure of the whole samples. The graph W was constructed as:
here Np(xi) denotes the data sets of p nearest neighbors of xi.
After obtaining the graph, LapRLS built a graph-based regularization term and embedded it into RLS. The term was written as:
here f=[f(x1),⋯,f(xn)]T, and L is the graph Laplacian defined as L=D−W with Dii=∑jWij.
The objective function of LapRLS was then written as:
where ‖⋅‖K is the norm defined in HK which is a Reproducing Kernel Hilbert Space (RKHS) associated with a Mercer kernel K:X×X→R. γA and γI are regularization parameters. When γI is equal to 0, LapRLS will boil down to RLS.
According to the Representer Theorem [29], the decision function was represented as
where α∗ denoted the optimal value of α and k(⋅,⋅) was the Mercer kernel.
The equation Eq (2.3) was rewritten as:
where Y=[y1,⋯,yl]T was the given labels. The Gram matrix K was calculated using a Mercer kernel whose entry Kij=k(xi,xj). Kl was the first l rows and ku was the last u rows in K.
By setting the derivative of Eq (2.5) to zero, we can obtain the following solution:
2.2. Risk-based safe laplacian regularized least squares
In order to reach the safe exploitation of the unlabeled samples, RsLapRLS utilized a ℓ2 norm to define the loss function. The loss function constrained the outputs of the unlabeled samples to be those of RLS. Hereafter, we denote the supervised classifier obtained by RLS as g(x). The objective function of RsLapRLS was given as:
where λ was a regularization parameter. Specifically, RsLapRLS became RLS when γI and λ were both 0 and became LapRLS when λ=0. sj was the risk degree of the unlabeled sample xj.
From the above equation, the output of RsLapRLS was a balance between that of RLS and LapRLS.
A risk degree estimation equation was pre-defined for the unlabeled sample xj.
here, the consistency (cs(xj)) and confidence (cf(xj)) were calculated as:
where ˆyj and ¯yj were respectively the outputs of LapRLS and RLS.
cs(xj) denoted the prediction consistency of xj between LapRLS and RLS. If the predictions were consistent, cs(xj) was equal to 1, otherwise -1. cf(xj) represented the prediction confidence or difference of xj between LapRLS and RLS.
As shown in Eqs (2.8) and (2.9), the risk degree sj was small when cs(xj)=1 and cf(xj)>0 or cs(xj)=1 and cf(xj)=0. In this case, the unlabeled sample xj should be exploited through the semi-supervised approach. Otherwise, the risk degree sj was large and the output of xj should approach to that of RLS which was defined by the last term in Eq (2.7).
According to Eq (2.4), the objective function Eq (2.7) was then rewritten as:
where ¯Y=[g(xl+1),⋯,g(xn)]T was the outputs of RLS and S was a diagonal matrix with Sjj=sj+l.
The derivative of Eq (2.10) with respect to α could be otained and set to zero. The solution of RsLapRLS was then given as:
3.
Proposed algorithm
3.1. Motivation
As shown in Section 1, the existing S3L methods mainly consider the hurt of the unlabeled samples. Firstly, the risk degrees of the unlabeled samples are in advance defined in S3L. If inappropriate risk degrees are assigned to the unlabeled samples, the unlabeled samples will not be safely exploited. Therefore, it is important to investigate an adaptive weight computing method. Secondly, the risk of the labeled samples should be considered. In some cases, the samples may be mislabeled by the experts. In the next subsection, we will present how to solve the above problems using ℓ1 norm.
3.2. Formulation and solution
For convenience, the objective function of RsLapRLS is written as:
where fl and fu respectively denote the outputs of the labeled and unlabeled samples. s=[sl+1,⋯,sn] denotes the risk degrees of the unlabeled samples. A∘B represents the Hadamard product between the vectors A and B.
In this paper, a ℓ1 norm is used to substitute for the ℓ2 norm. Hence, the objective function of the proposed algorithm is written as:
Equation (3.2) can be written as:
where ˜yi=yi and ri=1 if i∈{1,⋯,l} and ˜yi=g(xi) and ri=λ if i∈{l+1,⋯,n}.
The derivative of Eq (3.3) with respect to f(xi) is obtained as:
where μi=12|f(xi)−˜yi|+ε with a small value ε.
If μi is fixed, Eq (3.4) can be considered as the derivative with respect to f(xi) of the following function:
As we know, a closed-form solution of Eq (3.5) can be obtained. Since μi is related to f(xi), an alternating optimization approach is utilized to solve the optimization problem (3.3). In each iteration, f and μi are calculated, respectively. Because the computation formula of μi has been given, we next discuss how to calculate f when μi is known.
Based on Eq (2.4), Eq (3.5) can be denoted as:
where ˜Y=[˜y1,⋯,˜yn]T and R is a diagonal matrix with Rii=μiri.
The derivative of Eq (3.6) with respect to α is obtained and set to zero, the following equation is achieved:
It is worthy to point out that the weight μi can reflect the risk degree of the labeled and unlabeled samples. If the difference between the prediction and given label for a labeled sample is large, the labeled sample may be mislabeled and risky. In this case, the corresponding μi should be small. Meanwhile, the unlabeled samples may be safe if the corresponding output approaches to that of LS and the weight μi is large. Otherwise, the unlabeled sample should be risky and the weight μi will be small. Hence the risky labeled and unlabeled samples will have a small impact on training the final classifier. Overall, it is expect to yield a safe exploitation through the weight.
The iterative framework of the proposed algorithm is shown in Algorithm 1.
3.3. The convergence analysis
Next, we will give a convergence analysis of Algorithm 1. Eq (3.5) can be written as
here Ki is the i-th row of the matrix K.
First, the optimal solution α of the optimization problem Eqs (3.8) and (3.6) can be yield through Eq (3.7). Hence, in the t-th iteration, α(t) is the optimal solution of Eq (3.8). We have
Since μi=12|f(xi)−˜yi|, the above inequality can be written as
Because
and ri is a nonnegative constant, we have
Through Eqs (3.10) and (3.12), the following inequality is achieved.
From Eqs (3.2), (3.3) and (3.13), one can see that J(f) will be monotonically decreased and nonnegative. Therefore, Algorithm 1 will be converged.
4.
Experimental analysis
Next, we perform some experiments to show the behavior of the proposed algorithm. The used datasets are selected from UCI * and benchmark ones [1]. Firstly, two subsets are obtained by randomly dividing the whole dataset. The two subsets include 10 or 100 labeled samples and the remaining unlabeled ones. This division process is repeated 30 times for the UCI datasets and 12 times for the benchmark datasets. The details of each dataset are described in Table 1. In the experiments, the following methods are used:
*http://archive.ics.uci.edu/ml/
● SL: RLS, SVM, LS-SVM.
● SSL: LapRLS, TSVM, SSCCM.
● S3L: S3VM_us, S4VMs, SA-SSCCM, RsLapRLS.
For the UCI datasets, the parameters C1 and C2 in S4VMs are respectively fixed to 1 and 0.1. λ1, λ2 and η in SA-SSCCM are set to 100, 1 and 1, respectively. For benchmark datasets, C1 and C2 in S4VMs are respectively set to 100 and 0.1. λ1, λ2 and η in SA-SSCCM are set to 100, 0.1 and 1, respectively. ϵ in S3VM_us is set to 0.1. γl in RLS is fixed to 0.05, p in LapRLS, RsLapRLS and proposed algorithm is set to 5. γA, γI and λ in LapRLS, RsLapRLS and proposed algorithm are obtained by leave-one-out cross validation in the case of 10 labeled samples and by 5-fold cross validation in the case of 100 labeled samples. γA and γI are selected in the set {10−6,10−4,10−2,1,10,100} and λ is selected from {10−6,10−4,10−2,0.1,0.5,0.9,1,5,9}. The linear and Gaussian kernels are used to calculate the Gram matrix K. In the Gaussian kernel, the width parameter δ is set to the average distance between the samples in the case of 10 labeled samples, and obtained by 5-fold cross validation among {0.25δ,0.5δ,δ,2δ,4δ} in the case of 100 labeled samples.
Tables 2 and 3 report the classification results of different methods. The third-to-last row reports the average accuracies of different methods. The second-to-last row reports the win/tie/loss (W/T/L) counts where SSL performs better/comparable/worse than the corresponding SL. The last row reports the W/T/L counts where S3L performs better/comparable/worse than the corresponding SSL.
(a) According to the average accuracy, the proposed algorithm can perform better than the other SL and SSL methods. On the one hand, it demonstrates that the unlabeled samples can generally boost the learning performance. On the other hand, the proposed algorithm can be used to learn a semi-supervised classifier.
(b) In terms of the W/T/L counts, different SSL methods perform worse than the corresponding SL methods in some cases. textcolor[rgb]1, 0, 0 Specifically, LapRLS performs worse than RLS on 18 out of 32 cases. It shows the negative influence of the unlabeled samples on the learning performance.
(c) In terms of the W/T/L counts, the S3L methods are never significantly inferior to the corresponding SL methods. In particular, RsLapRLS and the proposed algorithm can perform better or comparable to RLS on all cases. It is indicated that the proposed algorithm can reduce the risk of the unlabeled samples.
(d) Finally, the proposed algorithm is superior and more robust than RsLapRLS on all cases in term of the average accuracy and standard deviation. We can conclude that the proposed ℓ1-norm strategy is effective to deal with the hurt of the labeled samples. Since the weights of both labeled and unlabeled samples are adaptively estimated, it will further alleviate the negative impact which is caused by artificial risk degree assignment in RsLapRLS.
Additionally, we carry out some experiments to analyze the convergence of the proposed algorithm. Figures 1 and 2 give the plots of convergence and the corresponding testing accuracy at each iteration on G241n and USPS datasets. From the two figures, one can see that the porposed algorithm usually converges after less than 15 iterations. It shows that the proposed algorithm is efficient and converges fast.
Futhermore, we give a parameter analysis which tries to analyze the impact of λ on the performance of the proposed algorithm. The parameter λ can reflect the importance of the regularizer constraining the outputs of the risky unlabeled samples. The value is selected among {10−6,10−4,10−2,0.1,0.5,0.9,1,5,9}. Figures 3–6 report the accuracies of the proposed algorithm as the parameter λ changes. From these figures, one can see that proposed algorithm can obtain the best results in a wide range. It will extend the practicability of proposed algorithm to some extent.
5.
Conclusions
On the whole, we have investigated a ℓ1-norm based S3L algorithm. This algorithm has ultilized ℓ1 norm instead of ℓ2 norm to define the objective function. It can effectively reduce the risk of both labeled and unlabeled samples. In comparison to RsLapRLS which uses ℓ2 norm, the performance of the proposed algorithm is better and more robust. However, the proposed algorithm is designed to deal with two-class problems and multi-class problems often occur in some applications(i.e., face recognition). Additionally, compared to label information, pairwise constraints are more easily collected. Hence, how to design pairwise-constraint S3L methods and solve the multi-class problems will be the future work.
Acknowledgments
This work is supported by the High-level Talents Fund of Hubei University of Technology under grant No. GCRC2020016, Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province under grant No. 2020E10010-02, Open Research Fund Program of State Key Laboratory of Biocatalysis and Enzyme Engineering No. SKLBEE2020020, National Natural Science Foundation of China under grant No. 61601162, Shenzhen Institute of Artificial Intelligence and Robotics for Society.