
In this study, we introduced an innovative and robust semi-supervised learning strategy tailored for high-dimensional data categorization. This strategy encompasses several pivotal symmetry elements. To begin, we implemented a risk regularization factor to gauge the uncertainty and possible hazards linked to unlabeled samples within semi-supervised learning. Additionally, we defined a unique non-second-order statistical indicator, termed Cp-Loss, within the kernel domain. This Cp-Loss feature is characterized by symmetry and bounded non-negativity, efficiently minimizing the influence of noise points and anomalies on the model's efficacy. Furthermore, we developed a robust safe semi-supervised extreme learning machine (RS3ELM), grounded on this educational framework. We derived the generalization boundary of RS3ELM utilizing Rademacher complexity. The optimization of the output weight matrix in RS3ELM is executed via a fixed point iteration technique, with our theoretical exposition encompassing RS3ELM's convergence and computational complexity. Through empirical analysis on various benchmark datasets, we demonstrated RS3ELM's proficiency and compared it against multiple leading-edge semi-supervised learning models.
Citation: Jun Ma, Xiaolong Zhu. Robust safe semi-supervised learning framework for high-dimensional data classification[J]. AIMS Mathematics, 2024, 9(9): 25705-25731. doi: 10.3934/math.20241256
[1] | Jun Ma, Junjie Li, Jiachen Sun . A novel adaptive safe semi-supervised learning framework for pattern extraction and classification. AIMS Mathematics, 2024, 9(11): 31444-31469. doi: 10.3934/math.20241514 |
[2] | Peng Lai, Wenxin Tian, Yanqiu Zhou . Semi-supervised estimation for the varying coefficient regression model. AIMS Mathematics, 2024, 9(1): 55-72. doi: 10.3934/math.2024004 |
[3] | Mohammed Abdul Kader, Muhammad Ahsan Ullah, Md Saiful Islam, Fermín Ferriol Sánchez, Md Abdus Samad, Imran Ashraf . A real-time air-writing model to recognize Bengali characters. AIMS Mathematics, 2024, 9(3): 6668-6698. doi: 10.3934/math.2024325 |
[4] | Yuelin Gao, Huirong Li, Yani Zhou, Yijun Chen . Semi-supervised graph regularized concept factorization with the class-driven constraint for image representation. AIMS Mathematics, 2023, 8(12): 28690-28709. doi: 10.3934/math.20231468 |
[5] | Rajesh Kumar, Sameh Shenawy, Lalnunenga Colney, Nasser Bin Turki . Certain results on tangent bundle endowed with generalized Tanaka Webster connection (GTWC) on Kenmotsu manifolds. AIMS Mathematics, 2024, 9(11): 30364-30383. doi: 10.3934/math.20241465 |
[6] | Changfu Yang, Wenxin Zhou, Wenjun Xiong, Junjian Zhang, Juan Ding . Single-index logistic model for high-dimensional group testing data. AIMS Mathematics, 2025, 10(2): 3523-3560. doi: 10.3934/math.2025163 |
[7] | Raweerote Suparatulatorn, Wongthawat Liawrungrueang, Thanasak Mouktonglang, Watcharaporn Cholamjiak . An algorithm for variational inclusion problems including quasi-nonexpansive mappings with applications in osteoporosis prediction. AIMS Mathematics, 2025, 10(2): 2541-2561. doi: 10.3934/math.2025118 |
[8] | Yusuf Dogru . η-Ricci-Bourguignon solitons with a semi-symmetric metric and semi-symmetric non-metric connection. AIMS Mathematics, 2023, 8(5): 11943-11952. doi: 10.3934/math.2023603 |
[9] | Kun Liu, Chunming Tang . Privacy-preserving Naive Bayes classification based on secure two-party computation. AIMS Mathematics, 2023, 8(12): 28517-28539. doi: 10.3934/math.20231459 |
[10] | Fatemah Mofarreh, S. K. Srivastava, Anuj Kumar, Akram Ali . Geometric inequalities of PR-warped product submanifold in para-Kenmotsu manifold. AIMS Mathematics, 2022, 7(10): 19481-19509. doi: 10.3934/math.20221069 |
In this study, we introduced an innovative and robust semi-supervised learning strategy tailored for high-dimensional data categorization. This strategy encompasses several pivotal symmetry elements. To begin, we implemented a risk regularization factor to gauge the uncertainty and possible hazards linked to unlabeled samples within semi-supervised learning. Additionally, we defined a unique non-second-order statistical indicator, termed Cp-Loss, within the kernel domain. This Cp-Loss feature is characterized by symmetry and bounded non-negativity, efficiently minimizing the influence of noise points and anomalies on the model's efficacy. Furthermore, we developed a robust safe semi-supervised extreme learning machine (RS3ELM), grounded on this educational framework. We derived the generalization boundary of RS3ELM utilizing Rademacher complexity. The optimization of the output weight matrix in RS3ELM is executed via a fixed point iteration technique, with our theoretical exposition encompassing RS3ELM's convergence and computational complexity. Through empirical analysis on various benchmark datasets, we demonstrated RS3ELM's proficiency and compared it against multiple leading-edge semi-supervised learning models.
With the exponential growth of computing technology and increased access to diverse data sources, the availability of information has skyrocketed. To effectively extract the relevant data from this vast pool, machine learning has emerged as a crucial tool. It offers a solution to the challenging task of extracting the desired information amidst an abundance of data. In supervised learning, a multitude of data is utilized to train a model, which is then employed to make predictions on unlabeled data. However, if the model fails to generalize adequately and there is an excess of labeled data, it may lead to overfitting. In numerous practical scenarios, there exists a considerable amount of unlabeled data alongside a subset of labeled data. However, labeling data can be a laborious and costly process. Utilizing the limited labeled data to enhance performance through learning from the vast pool of unlabeled data poses a significant challenge. Over the past decade, semi-supervised learning (SSL) has emerged as a highly effective learning framework, showcasing remarkable success in both theory and practical applications [1,2]. Obtaining labeled samples is often a challenging and expensive task, while acquiring unlabeled samples tends to be easier and more cost-effective in many real-world problems. Semi-supervised learning (SSL) tackles this issue by leveraging different assumptions to establish connections between labeled and unlabeled samples. Among these assumptions, the manifold assumption [1] has gained significant traction and is widely adopted in practice. This assumption assumes that the data resides on a low-dimensional manifold, facilitating the exploration and utilization of unlabeled data for improved learning outcomes. For instance, Belkin et al. [1] introduced two algorithms, namely Laplacian regularized least squares (Lap-RLS) and support vector machines (Lap-SVM), which utilize manifold regularization to effectively harness information from unlabeled samples. These algorithms have demonstrated promising results. However, recent studies have highlighted a potential concern related to the reliance on unlabeled samples, suggesting that they could be compromised and potentially diminish the performance of SSL [3,4,5]. This raises limitations on the practical applicability of SSL to some extent [6,7]. Consequently, there is a need to develop a safe semi-supervised learning (SaSSL) method that guarantees never performing worse than its supervised learning (SL) counterpart utilizing only labeled samples [8,9,10,11,12]. In recent years, numerous outstanding safe semi-supervised learning methods have been proposed [13,14,15,16].
Extreme learning machines (ELM) [17,18,19] have been extensively researched as single hidden layer feedforward networks (SLFNs). ELM has gained traction in the field of machine learning due to its simple architecture, low computational requirements, and wide applicability [20,21,22]. Additionally, ELM addresses drawbacks associated with conventional neural networks, including issues like local minima, imprecise learning, and slow convergence. Moreover, ELM offers a unified learning framework that caters to various applications such as regression, binary, and multiclass classification [23]. Recently, a semi-supervised ELM algorithm based on manifold regularization has been proposed to effectively leverage both labeled and unlabeled samples [24]. Although the results of ELM have been promising, it is worth mentioning that existing algorithms typically employ the mean square error (MSE) criterion as the cost function, while the impact of sample noise and outliers remain understudied. In the MSE-based criteria, equal penalty is assigned to all samples. However, samples with outliers or non-Gaussian noise tend to exhibit larger errors, leading to higher penalties for these samples. Consequently, the generalization performance suffers in the presence of non-Gaussian noise or outliers. This issue is particularly critical in semi-supervised learning, where misclassifications between labeled data easily affect nearby unlabeled data. Recently, the correlation coefficient has emerged as a novel criterion to address non-Gaussian noise and outliers [25,26,27,28]. Correntropy, a local similarity measure in kernel space, has been proposed to complement this approach. It offers an effective mechanism to mitigate the impact of noise and outliers by enhancing robustness through assigning lower weights to data outside the local neighborhood. In the areas of resilient learning and signal processing, such as adaptive filtering [26], state estimation [27], and principal component analysis [28], the correntropy-based criterion has shown promising results [29,30,31,32]. The correlation-based criterion has been used in a number of extreme learning machine (ELM) algorithms, including [33,34,35,36]. In order to obtain robust learning performance when addressing outliers, Chen et al. [29] introduced the kernel mean p-power error (KMPE), a non-second-order statistical metric in kernel space. The regularized correntropy criterion (RCC) had been developed by Xing and Wang in order to help the ELM handle noise or outliers in the training data [30]. In parallel, Yang et al. [37] proposed a brand-new semi-supervised ELM technique based on the maximum correntropy criterion (MCC), known as RCC-based SSELM (RC-SSELM). According to experimental findings, RC-SSELM performs impressively in semi-supervised learning scenarios with non-Gaussian noise and outliers. Chen et al. established the Maximum Mixed Correlation Criterion (MMCC) for ELM to achieve robust generalization performance in [32] and presented the mixed correlation approach, which employs a combination of two or more kernel functions, to enhance the performance of ELM. A new semi-supervised ELM algorithm based on the MMCC optimization strategy was proposed by Yang et al. [35], which increases the algorithm's effectiveness and adaptability for managing huge and complicated outliers. For the purpose of demonstrating the efficacy of MC-SSELM, experimental results from several benchmark datasets were collected. The algorithm's superiority in terms of performance and adaptability was also shown through comparisons with a number of cutting-edge semi-supervised learning methods.
It can be observed that none of the existing semi-supervised ELM learning algorithms based on correntropy has focused on the uncertainty and potential risks of the unlabeled samples in the semi-supervised learning process. Inspired by the above research, in this paper, first of all, a robust and safe semi-supervised learning framework is proposed. Then, on the basis of this learning framework, a robust safe semi-supervised extreme machine learning (RS3ELM) framework is proposed. In more specific terms, the main contributions of this paper are as follows:
(1) Targeting uncertainty and potential risk of unlabeled samples in semi-supervised learning, a risk regularization term is constructed, which can better handle the uncertainty and potential risks, thereby improving the security and reliability of the learning algorithm.
(2) Based on the theory of correntropy, a new robust loss function is defined in the kernel space, which is called the p-power C-loss. The influence of noise points and outliers on the performance of the model can be effectively suppressed by this loss function.
(3) A robust and safe semi-supervised learning framework is established based on the risk regularization term and the p-power C-loss.
(4) A robust safe semi-supervised extreme learning machine (RS3ELM) for pattern classification is proposed based on this learning framework.
(5) The generalizing bound of RS3ELM is given by using Rademacher complexity. The fixed-point iteration method is used to solve the RS3ELM, and the computational complexity and the convergence of the algorithm are analyzed from a theoretical point of view.
(6) Experimental results show that RS3ELM performs significantly better than existing semi-supervised ELM learning algorithms on multiple benchmark datasets.
The remaining work is organized as follows. In Section 2, we review related work, including supervised learning, extreme learning machines (ELM), semi-supervised learning, and semi-supervised extreme learning (SS-ELM). In Section 3, we detail our algorithm. Experimental results and analysis are presented in Section 4. Finally, the main conclusions of this work are summarized in Section 5, and we also discuss future developments.
Let Tl={xi,yi}li=1 be the training set, where l is the number of training samples, xi∈Rn, yi∈{−1,+1}(i=1,…,l). Suppose that f(⋅) represents the decision function. Let HK denote the reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel K. Suppose that the general decision function f, the basic supervised learning framework, can be expressed as
minf∈HK1l∑i∈Iρloss(xi,yi,f(xi))+γA‖f‖2H, | (2.1) |
where ρloss(⋅) is any loss function, such as the hinge loss function max{0.1−yif(xi)}; ‖f‖2H is the RKHS norm penalty and represents the complexity of functions in RKHS HK.
Under the assumption that L is the number of neurons in the hidden layer, the output function of ELM [17,18] is
Y=Hβ, | (2.2) |
where β=[β1,β2,…,βL]T is the vector of output weights between the hidden layer of L nodes and the output node, Y=[y1,y2,…,yl]T, and H is the output matrix of the hidden layer defined as
H=[h1(x1)⋯hL(x1)⋮⋮⋮h1(xl)⋯hL(xl)], |
where hi(x)=G(ai,bi,x)=ai⋅x+bi,i=1,…,L (a and b can be randomly generated according to a continuous probability distribution). The primal regularization ELM framework can be expressed as
minβΨ(β)=C‖Hβ−Y‖2+‖β‖2, | (2.3) |
where C is a penalty coefficient for the errors in the training process. Then, the output weight vector β can obtained by
β∗={(HTH+ILC)−1HTY,l≥L,HT(HHT+IlC)−1Y,l≤L, | (2.4) |
where IL is an identity matrix of dimension L and Il is an identity matrix of dimension l.
Let T=Tl∪Tu={xi,yi}li=1∪{xi}l+ui=l+1 be a semi-supervised learning training dataset, where xi∈Rn, yi∈{−1,+1}, Tl denotes labeled samples set with l, Tu denotes unlabeled samples set with u; n=l+u. Suppose that the general decision function f, the general semi-supervised learning framework can be expressed as the following optimization problem:
minf∈HK1l∑i∈Iρloss(xi,yi,f(xi))+γA‖f‖2H+γI‖f‖2I, | (2.5) |
where ρloss(⋅) is some loss function, ‖f‖2H is the RKHS norm penalty and represents the complexity of functions in RKHS HK, γA and γI are the nonnegative regularization parameters, and ‖f‖2I is the manifold regularizer (MR) [1] whose empirical form takes
‖f‖2I=1(l+u)2l+u∑i,j=1Wij(f(xi)−f(xj))2=1(l+u)2(fTLf), | (2.6) |
where L=D−W is the graph Laplacian; D is the diagonal degree matrix of W given by Dii=∑l+uj=1Wij, Dij=0 for i≠j; and the normalizing coefficient 1(l+u)2 is the natural scale factor for the empirical estimate of the Laplace operator. The weight matrix W may be defined by k nearest neighbors or graph kernels as follows
Wij={exp(−||xi−xj||222σ2),if xi∈Nk(xj) or xj∈Nk(xi),0,otherwise, | (2.7) |
where Nk(xj) denotes the data sets of k nearest neighbors of xi.
Consequently, the primal problem of the semi-supervised extreme learning machine (SS-ELM) by introducing a manifold regularization term into (2.3), is
minβΨ(β)=C‖Hlβ−Y‖2+‖β‖2+λTr(βTHTnLHnβ), | (2.8) |
where C and λ are regularization parameters, Tr(⋅) denotes the trace of a matrix. Thus, we have
β∗={(Il+HTlCHl+λHTnLHn)−1HTlCY,n≥L,HTl(In+CHlHTl+λLHnHTn)−1CY,n≤L, | (2.9) |
where Il is an identity matrix of dimension l; and C is a diagonal matrix whose entries are Cjj=Cltj, where ltj is the number of training samples belonging to class j, j=1,2…,l; In is an identity matrix of dimension n.
In this section, the definition of correntropy is derived [25]. Some of its properties are briefly introduced.
Definition 1. [25] Given two random variables X and Y, the correntropy is defined as:
V(X,Y)=E[⟨Φ(X),Φ(Y)⟩H]=∫⟨Φ(x),Φ(y)⟩HdFXY(x,y)=E[κ(X,Y)], | (2.10) |
where E[⋅] is the expectation operator, FXY(x,y) is the joint distribution function, and Φ(x)=κ(x,⋅) is a nonlinear mapping induced by a Mercer kernel κ(⋅), which transforms x from the original space to a functional Hilbert space (or kernel space) H equipped with an inner product ⟨⋅,⋅⟩H which satisfies ⟨Φ(x),Φ(y)⟩H=κ(x,y). It is obvious that we have V(X,Y)=E[κ(X,Y)]. Throughout this article, unless otherwise noted, the kernel function is a Gaussian kernel, given by
κ(x,y)=κσ(x−y)=exp(−(x−y)22σ2) | (2.11) |
with σ being the kernel bandwidth.
In practice, if the given sample is {(x1,y1),(x2,y2),...,(xm,ym)}, but the joint probability density function between the data is unknown, the correlation coefficient can be approximated as
VM,σ(X,Y)=1mm∑i=1κσ(xi−yi). | (2.12) |
Property 1. [25] Correntropy is symmetric, expressed as Vσ(X,Y)=Vσ(Y,X).
Property 2. [25] The value of correntropy is positive and bounded, and it is proven that the upper bound is reached if and only if X=Y.
Property 3. [25] Assume that the probability density function of the sample {(xi,yi),i=1,...,m} is fX,Y(x,y). E=Y−X is defined as the error random variable. fE,σ(e) is the Parzen estimate of the error probability density function from the data sample {(ei=xi−yi),i=1,...,N}. Afterwards Vm,σ(X,Y) is equal to the value of fE,σ(e) evaluated at point e=0.
Vm,σ(X,Y)=fE,σ(0). | (2.13) |
Definition 2. [33,34] Given two random variables X and Y, the C-Loss C(X,Y) is defined by
C(X,Y)=12E[‖Φ(X)−Φ(Y)‖2H]=12E[2κσ(0)−2κσ(X−Y)]=E[1−κσ(X−Y)]. | (2.14) |
It holds that C(X,Y)=1−V(X,Y), so that the minimization of C-Loss is equivalent to the maximization of correlation. The MCC has recently attracted increasing attention due to its robustness to large outliers [33,34,35,36].
In this section, we systematically describe our approach. Typically, semi-supervised learning leverages a large volume of unlabeled data alongside a sparse amount of labeled data. However, research indicates that the inclusion of unlabeled samples can sometimes degrade the performance of semi-supervised classifiers due to the inherent uncertainties and risks associated with unlabeled data. Moreover, in manifold regularization-based semi-supervised learning frameworks, often only the local manifold geometry of the samples is considered, while the global information is overlooked. Our core methodology involves constructing a risk degree regularization term to assess the uncertainty and potential risk of unlabeled data during the semi-supervised learning process. Intuitively, the risk associated with unlabeled samples is minimal when they contribute positively to semi-supervised learning; conversely, the risk is high, warranting the use of a supervised classifier to predict their labels. Additionally, we introduce a novel non-second-order statistical indicator, referred to as Cp-Loss, within the kernel domain. The Cp-Loss metric is symmetrical and bounded by non-negativity, which effectively reduces the impact of outliers and noise on the model's performance. Furthermore, we propose a robust, safe, semi-supervised extreme learning machine (RS3ELM) based on this framework. We derive the generalization boundary of RS3ELM using Rademacher complexity, and the optimization of the output weight matrix in RS3ELM is performed through a analysis also covers the convergence and computational complexity of RS3ELM.
In this section, we will provide a more detailed explanation of our algorithm. Our algorithm is based on two main methods: the SL method and the collaborative representation-based classification (CRC) method. Initially, we utilize the SL method and CRC method to reconstruct the unlabeled samples. This involves using the labeled samples to train a model that can reconstruct the features of the unlabeled samples. By doing this, we aim to capture the underlying structure and patterns in the unlabeled data. Furthermore, to assess the risk of the unlabeled samples, we compare the original and reconstructed versions of the unlabeled samples. By examining how well the reconstructed samples match the original ones, we obtain a measure of the risk associated with the unlabeled data. To incorporate these risk measures into our learning framework, we define risk-based regularization terms. These regularization terms serve as constraints that guide the learning process. By embedding these terms into the semi-supervised learning framework, we ensure that the resulting model strikes a balance between supervised and semi-supervised learning approaches. Therefore, the outputs of our learning framework represent a compromise between the information obtained from the available labeled samples and the reconstruction-based risk assessment of the unlabeled samples. This approach allows us to leverage the benefits of both supervised and semi-supervised learning, resulting in a more robust and effective learning algorithm.
To begin with, we start by training an ELM classifier using the labeled samples Xl. Using this trained classifier, we can then make predictions yu for the unlabeled samples xu. These predictions serve as an estimate of the risk associated with each unlabeled sample. Next, we employ the collaborative representation-based classification (CRC) method to reconstruct the unlabeled samples. This involves utilizing the labeled samples Xyul from the same class yu to reconstruct the unlabeled samples. The objective function of CRC can be defined as follows:
minδj(1−exp−‖xj−Xyjlδj‖22σ2)+λ‖δj‖2, | (3.1) |
where λ is a regularization parameter.
Let L1(δu)=‖xu−Xyulδu‖22σ2+λ‖δu‖2 and L2(δj)=‖xu−Xyjlδj‖22σ2−1+exp−‖xj−Xyjlδj‖22σ2. Thus, the optimization problem (3.1) can be rewritten as
minδjL1(δj)−L2(δj). | (3.2) |
Clearly, the functions L1(δj) and L2(δj) are both convex. This means that the optimization problem given by Eq (3.2) is a DC (difference of convex) programming problem, where the differentiable convex component is L2(δj). By utilizing DC programming techniques, we can find the optimal value of δj denoted as δ∗j. With this optimal value, we can reconstruct the unlabeled samples using the following expression:
ˉxj=(Xyjl)Tδ∗j. | (3.3) |
Moreover, we apply the ELM classifier to classify the reconstructed samples ˉxj and obtain the predicted label ˉyj.
Definition 3. Denote yj and ˉyj as the predictions of xj and ˉxj, respectively. The degree of risk (RD) of the unlabeled samples can be defined as the following:
rj={exp{−‖xj−ˉxj‖2σ},if yj=ˉyj,exp{‖xj−ˉxj‖2σ},otherwise. | (3.4) |
Based on Eq (3.4), when the predictions are identical (ˆy=ˉy), it implies that the unlabeled samples might be considered safe. In this case, as the error increases, rj should be reduced to minimize any potential risks. On the other hand, if the predictions are not equal (ˆy≠ˉy), it indicates that the unlabeled samples could be potentially risky. Therefore, in such instances, rj should be elevated with increasing error to effectively address these risks.
Definition 4. Let g(x) and f(x) be the outputs of the supervised classifier and the semi-supervised classifier, respectively. The decision function ΞR(f) aims to find a trade-off between g(x) and f(x) by incorporating a risk-based term. It can be expressed as follows:
ΞR(f)=n∑j=l+1rj‖f(x)−g(x)‖2. | (3.5) |
Here, rj represents the safety degree of the unlabeled samples xj.
Based on the above analysis, it is evident that ΞR(f) plays a crucial role in determining the degree of disparity between supervised learning and semi-supervised learning.
Definition 5. The Cp-Loss between two random variables X and Y can be defined as follows:
Cp(X,Y)=2−pE[‖Φ(X)−Φ(Y)‖2pH]=2−pE[(‖Φ(X)−Φ(Y)‖2H)p]=2−pE[(2κσ(0)−2κσ(X−Y))p]=E[(1−κσ(X−Y))p]. | (3.6) |
Here, p>0 is the power parameter.
Remark 1. If p=1, then the Cp-Loss will degenerate into the C-Loss. In other words, C-Loss is a special case of Cp-Loss.
Remark 2. Given N samples {xi,yi}, the empirical Cp-Loss can be easily obtained as
ˆCp(X,Y)=1NN∑i=1(1−κσ(xi−yi))p. | (3.7) |
The curves of our proposed loss function under different parameters are shown in Figure 1.
Our Cp(X,Y) loss has the following interesting properties:
Property 4. Symmetry: Cp(X,Y)=Cp(Y,X).
Property 5. Non-negative boundedness: 0≤Cp(X,Y)<1. The equal sign is true if and only if X=Y.
Property 6. When σ→∞, we have
Cp(X,Y)≈(2σ2)−pE[‖X−Y‖2p]. | (3.8) |
Property 7. When p→0, we have
Cp(X,Y)≈1+pE[log(1−κσ(X−Y))]. | (3.9) |
We can define the robust safe semi-supervised learning framework as follows:
minf∈Hk{ˆCp(xi,yi,f(xi))+γA‖f‖2H+γI‖f‖2I+γRΞR(f)}. | (3.10) |
In Eq (3.10), the regularization parameters γA>0, γI>0, and γR>0 are used. The first three terms in Eq (3.10) are responsible for finding the semi-supervised classifier. The last term controls the trade-off between supervised and semi-supervised learning. The objective function in Eq (3.10) possesses the following characteristics:
(1) The empirical risk is defined in the first term of Eq (3.10), which evaluates how well the model fits the trained samples.
(2) The second term of Eq (3.10) represents the structural risk, which ensures the generalization capability of the model and prevents it from overfitting.
(3) The third term of Eq (3.10) introduces a joint regularization term. This term not only utilizes discriminative information more effectively but also explores the local geometric structure of new samples to enhance the classification performance. For this joint regularization, samples that are similar on the manifold should have the same class label if they belong to the same class, or different class labels if they do not.
(4) The fourth term of Eq (3.10) incorporates risk-based regularization. This term controls the trade-off between supervised and semi-supervised learning. The choice of risk degrees determines how the unlabeled samples are utilized.
Building upon our robust safe semi-supervised learning framework, we introduce a novel approach named robust safe semi-supervised extreme learning machine (RS3ELM), which is described as follows:
minˆCp(Yl,Hlβ)+λ‖β‖2F+γTr(βTHTnLHnβ)+αl+u∑j=l+1rj‖f(xj)−g(xj)‖2. | (3.11) |
In this framework, we incorporate three regularization parameters: λ, γ, and α. These parameters contribute to the definition of semi-supervised classifiers, while the last term is responsible for ensuring a suitable balance between ELM and SS-ELM.
minβΓ(β)=ˆCp(Hlβ,T)+λ‖β‖2F+γTr(βTHTnLHnβ)+αl+u∑j=l+1rj‖f(xj)−g(xj)‖2=1ll∑i=1(1−κσ(ei))p+λ‖β‖2F+γTr(βTHTnLHnβ)+α(Huβ−HuβELM)TR(Huβ−HuβELM)=1ll∑i=1(1−exp(−(e2i)2σ2))p+λ‖β‖2F+γTr(βTHTnLHnβ)+α(Huβ−HuβELM)TR(Huβ−HuβELM). | (3.12) |
In this scenario, we denote βELM as the optimal solution obtained from ELM, Hu as the hidden layer output matrix pertaining to unlabeled samples, and R as a diagonal matrix with the entry Rjj=rj+l. The derivative of Eq (3.11) with respect to β is
∂Γ∂β=0⇒1ll∑i=1[−pσ2(1−κσ(ei))p−1κσ(ei)eihTi]+2λβ+2γ(HTnLHnβ)+2αHTuR(Huβ−HuβELM)=0⇒l∑i=1[−(1−κσ(ei))p−1κσ(ei)eihTi]+2σ2lλpβ+2σ2lγp(HTnLHnβ)+2σ2lαpHTuR(Huβ−HuβELM)=0⇒l∑i=1(φ(ei)hTihiβ−φ(ei)tihTi)+ˆλβ+ˆγ(HTnLHnβ)+ˆαHTuR(Huβ−HuβELM)=0⇒l∑i=1(φ(ei)hTihiβ)+ˆλβ+ˆγ(HTnLHnβ)+ˆαHTuR(Huβ)=l∑i=1(φ(ei)tihTi)+ˆαHTuRHuβELM. | (3.13) |
Thus, we can get
β=[HTAH+ˆλI+ˆαHTuRHu]−1(HTAT+ˆαHTuRHuβELM), | (3.14) |
where ˆλ=2σ2lλp, ˆγ=2σ2lγp, ˆα=2σ2lαp, hi is the i-th row of H, and A is a diagonal matrix with diagonal elements Aii=φ(ei)=(1−κσ(ei))p−1κσ(ei).
The obtained optimal solution β can be expressed as β=[HTAH+ˆλI+ˆγHTnLHn+ˆαHTuRHu]−1(HTAT+ˆαHTuRHuβELM). However, this equation does not have a closed-form solution as the right-hand side matrix depends on the weight vector β through the error term ei=ti−hiβ. Therefore, it is essentially a fixed-point equation, and the true optimal solution can be obtained using a fixed-point iterative algorithm.
Given a test set Xnew, we can calculate its corresponding hidden layer output matrix Hnew, and the prediction result can be obtained as follows:
Y=Hnewβ∗. | (3.15) |
Based on the above discussion, our algorithm will be presented in Algorithm 1.
Algorithm 1 Our algorithm |
Input: l labeled examples {(xi,yi)}li=1; u unlabeled examples {(xi)}ui=1; regularization parameters γA, γD, and γR.
Output: The decision function f∗(x)=∑l+ui=1β∗iK(xi,x) of RS3ELM. 1: Learn the ELM and predict the output yu of the unlabeled instance xu using ELM; 2: calculate ˉxu through (3.3); 3: compute risk degree of unlabeled samples using Eq (3.4); 4: construct data adjacency graph G with (l+u) nodes using k nearest neighbors, and construct weight matrix Wij; 5: construct L using W; 6: initiate an ELM network of L hidden neurons with random input weights and biases, and calculate the output matrices of the hidden neurons Hl, Hu, and Hn; 7: compute the output weights β∗ using Eq (3.13). |
Theorem 1. [38] Based on the robust estimation given by Huber, if the loss function Φ(Z) is called a robust loss function, then it needs to satisfy the following:
● Φ(Z)≥0 and Φ(Z)=0;
● ∀Z∈R, Φ(Z)=Φ(−Z);
● ∀Z≥0, Φ′(Z)≥0;
● Φ(Z) is second-order differentiable in R+, and Φ′′(0+)>0;
● Φ(√Z) is concave in R+;
then there exists a convex function Ψ(q) such that
Φ(Z)=infq>0{12qZ2+Ψ(q)}, ∀Z. | (3.16) |
When Z is fixed, a minimum solution q∗ exists in the right-hand side of (3.16):
infq>0{12qZ2+Ψ(q)}=12q∗Z2+Ψ(q∗), | (3.17) |
where q∗={Φ′(Z)ZZ>0,Φ′′(0+)Z=0,Φ′(−Z)−ZZ<0.
Proposition 1. For the function G(z)=exp(−‖z‖22σ2), it is possible to find a convex function ψ:R→R such that the following equation holds:
G(z)=supϑ∈R−(ϑ‖z‖22σ2−ψ(ϑ)). | (3.18) |
In the above equation, R− represents the set of negative real numbers, and sup(⋅) denotes the operation of finding the supremum. Additionally, for a fixed z, the supremum is attained at ϑ=−G(z).
To simplify the expression, let us define the objective function of the proposed method as follows:
Γ(β)=minβ1ll∑i=1(1−exp(−(e2i)2σ2))p+λ‖β‖2F+γTr(βTHTnLHnβ)+α(Huβ−HuβELM)TR(Huβ−HuβELM). | (3.19) |
According to Proposition 1, we can rewrite (3.19) as
ˆΓ(β,ϑ)=minβ,ϑ1ll∑i=1(1−ϑi‖tTi−hiβ‖22σ2+ψ(ϑi))p+λ‖β‖2F+γTr(βTHTnLHnβ)+α(Huβ−HuβELM)TR(Huβ−HuβELM), | (3.20) |
where ϑi=(ϑ1,ϑ2,⋯,ϑl).
Thus, given a fixed β, the local optimal solution of (3.20) is
ϑi=(1−G(ei))p−1G(ei), | (3.21) |
where ei=tTi−hiβ, i=1,2,⋯,l.
Then, we have Γ(β)=minϑΓ(β,ϑ) and β can be derived as
β=argminβ1ll∑i=1(1−ϑi‖tTi−hiβ‖22σ2)p+λ‖β‖2F+γTr(βTHTnLHnβ)+α(Huβ−HuβELM)TR(Huβ−HuβELM)−Const. | (3.22) |
Const is a constant real number greater than zero, ∑li=1(φ(ei)hTihiβ)+ˆλβ+ˆγ(HTnLHnβ)+ˆαHTuR(Huβ)=∑li=1(φ(ei)tihTi)+ˆαHTuRHuβELM, where \varphi(e_{i}) = (1-\kappa_{\sigma}(e_{i}))^{p-1}\kappa_{\sigma}(e_{i}) . Therefore, the output weight vector \beta can be calculated by the following formula
\mathit{\boldsymbol{\beta}} = [\mathit{\boldsymbol{H}}^{T}\mathit{\boldsymbol{AH}}+\hat{\lambda}\mathit{\boldsymbol{I}}+\hat{\gamma}\mathit{\boldsymbol{H}}_{n}^{T}\mathit{\boldsymbol{L}}\mathit{\boldsymbol{H}}_{n}+\hat{\alpha}\mathit{\boldsymbol{H}}_{u}^{T}\mathit{\boldsymbol{R}}\mathit{\boldsymbol{H}}_{u}]^{-1} (\mathit{\boldsymbol{H}}^{T}\mathit{\boldsymbol{AT}}+\hat{\alpha}\mathit{\boldsymbol{H}}_{u}^{T}\mathit{\boldsymbol{R}}\mathit{\boldsymbol{H}}_{u}\mathit{\boldsymbol{\boldsymbol{\beta}}}_{ELM}), |
where \mathit{\boldsymbol{A}} is a diagonal matrix and diagonal elements for A_{ii} = \varphi(e_{i}) = (1-\kappa_{\sigma}(e_{i}))^{p-1}\kappa_{\sigma}(e_{i}) .
Assume that at iteration k , there is
\begin{equation} \vartheta_{i}^{k} = (1-G(t_{i}^{T}-h_{i}\mathit{\boldsymbol{\beta}}^{k-1}))^{p-1}G(t_{i}^{T}-h_{i}\mathit{\boldsymbol{\beta}}^{k-1}), i = 1, 2, \cdots, l . \end{equation} | (3.23) |
\begin{equation} \beta^{k} = \arg\min\limits_{\mathit{\boldsymbol{\beta}}}\hat{\Gamma}(\mathit{\boldsymbol{\beta}}, \vartheta^{k}). \end{equation} | (3.24) |
Therefore, we can obtain
\begin{equation} \Gamma(\beta^{k}) = \min\limits_{\vartheta}\hat{\Gamma}(\mathit{\boldsymbol{\beta}}^{k}, \vartheta) = \min\limits_{\vartheta}\hat{\Gamma}(\mathit{\boldsymbol{\beta}}^{k}, \vartheta^{k+1}) . \end{equation} | (3.25) |
For a fixed \mathit{\boldsymbol{\beta}}^{k} , \hat{\Gamma}(\beta^{k}, \vartheta^{k+1}) is the minimum value with respect to \vartheta . Therefore,
\begin{eqnarray} \hat{\Gamma}(\mathit{\boldsymbol{\beta}}^{k}, \vartheta^{k+1}) &\leq& \hat{\Gamma}(\mathit{\boldsymbol{\beta}}^{k}, \vartheta^{k}), \end{eqnarray} | (3.26) |
\begin{eqnarray} \hat{\Gamma}(\mathit{\boldsymbol{\beta}}^{k}, \vartheta^{k+1}) &\leq&\hat{\Gamma}(\mathit{\boldsymbol{\beta}}^{k-1}, \vartheta^{k}), \end{eqnarray} | (3.27) |
\begin{eqnarray} \Gamma(\mathit{\boldsymbol{\beta}}^{k-1}) & = &\hat{\Gamma}(\mathit{\boldsymbol{\beta}}^{k-1}, \vartheta^{k}). \end{eqnarray} | (3.28) |
Therefore, we have
\begin{equation} \Gamma(\mathit{\boldsymbol{\beta}}^{k})\leq \Gamma(\mathit{\boldsymbol{\beta}}^{k-1}). \end{equation} | (3.29) |
In the RS3ELM algorithm, the computational complexity primarily resides in the matrix update ( \mathit{\boldsymbol{R}} ), weight vector update ( \beta ), graph Laplacian construction, and the computation of the nearest neighbors ( k ). The main computational cost of each iteration is determined by the \mathcal{O}(L^{3}) computation required to update \beta .
Thus, the approximate computational complexity of the RS3ELM algorithm is given by \mathcal{O}(T\cdot(L^{3}+(l + u)^{2}\log(l + u)+2(l+u)^{2}))) . Based on our experimental results, a satisfactory choice for the iteration count T is 10 .
In this section, we provide theoretical bounds for the generalization error of the proposed method based on the Rademacher complexity. Let us define the empirical Rademacher complexity of function class \mathcal{F} , denoted by \hat{E}_n(\mathcal{F}) , for a sample \{x_1, \ldots, x_n\} generated by a distribution D , where \mathcal{F} is a real-valued function class with domain \mathcal{X} .
\begin{equation} \hat{E}_{n}(\mathcal{F}) = \mathbb{E}_{\mathit{\boldsymbol{\sigma}}}\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{2}{n}\sum\limits_{i = 1}^{n}\sigma_{i}f(x_{i})\right|\right]. \end{equation} | (3.30) |
The expectation is taken over \mathit{\boldsymbol{\sigma}} = (\sigma_{1}, \sigma_{2}, \ldots, \sigma_{n})^{T} , where \sigma_i \in \{-1, +1\} are independent uniform random variables. The Rademacher random variables satisfy \mathit{\boldsymbol{P}}\{\sigma_i = -1\} = \mathit{\boldsymbol{P}}\{\sigma_i = +1\} = \frac{1}{2} .
Based on the above, we can express the Rademacher complexity of \mathcal{F} , denoted by E_n(\mathcal{F}) , as follows:
\begin{equation} E_{n}(\mathcal{F}) = \mathbb{E}_{\mathit{\boldsymbol{x}}}[\hat{E}_{n}(\mathcal{F})] = \mathbb{E}_{\mathit{\boldsymbol{x\sigma}}}\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{2}{n}\sum\limits_{i = 1}^{n}\sigma_{i}f(x_{i})\right|\right]. \end{equation} | (3.31) |
Here, the expectation is taken over the joint distribution of \mathit{\boldsymbol{x}} and \mathit{\boldsymbol{\sigma}} .
Theorem 2. [37] Let \mathcal{F}:\mathcal{Z} = \mathcal{X}\times \mathcal{Y}\mapsto[-1, +1] be a class of functions. Let n samples be drawn independently from a distribution D . Then, with a probability of at least 1-\theta , for every f\in\mathcal{F} , we have
\begin{equation} \mid Err(f)-\hat{Err}(f)\mid\leq\hat{E}_{n}(\mathcal{F})+3\sqrt{\frac{\ln(\frac{2}{\theta})}{2n}}, \end{equation} | (3.32) |
where Err(f) and \hat{Err}(f) denote the expected error and the empirical error of f , respectively.
If we can compute the empirical Rademacher complexity \hat{E}_{n}(\mathcal{F}) for the function class \mathcal{F} , then this bound is applicable to a wide range of learning algorithms. Additionally, for kernelized algorithms, it is relatively simple to bound the empirical Rademacher complexity using the trace of the kernel matrix.
Theorem 3. [37] For a kernel \mathit{\boldsymbol{K}}:\mathcal{X}\times \mathcal{X}\mapsto\mathbb{R} and a sample \{x_{1}, \ldots, x_{n}\} from \mathcal{X} , the empirical Rademacher complexity of the class \mathcal{F}(B) satisfies the following condition: If the norm \|f\|_{\mathcal{H}}\leq B , then
\begin{equation} \hat{E}_{n}(\mathcal{F}(B))\leq\frac{2B}{n}\sqrt{\sum\limits_{i = 1}^{n}\mathit{\boldsymbol{K}}(x_{i}, x_{i})}. \end{equation} | (3.33) |
If for all x\in\mathcal{X} , we have \mathit{\boldsymbol{K}}(x, x)\leq T^{2} and \mathit{\boldsymbol{K}} is a standard kernel, the inequality above can be rewritten as
\begin{equation} \hat{E}_{n}(\mathcal{F}(B))\leq\frac{2B}{n}\sqrt{\sum\limits_{i = 1}^{n}\mathit{\boldsymbol{K}}(x_{i}, x_{i})}\leq2B\sqrt{\frac{T^{2}}{n}}. \end{equation} | (3.34) |
Theorem 4. (Generalization bound of RS3ELM) Suppose Err(f) and \hat{Err}(f) are the expected error and the empirical error of RS3ELM, l and u are the sets of labeled and unlabeled examples, respectively, and n = l+u . If the unlabeled sample is risk-free and \mathit{\boldsymbol{K}}(x, x)\leq T^2 , then for every \theta\in(0, 1) , with probability at least 1-\theta , the generalization error of RS3ELM is given by
\begin{eqnarray} |Err(f)-\hat{Err}(f)| &\leq& 2T\sqrt{\frac{n}{\lambda n^{2}+\gamma\pi_{1}}\left[1-\exp(-\frac{1}{2\sigma^{2}})\right]^{p}} + 3\sqrt{\frac{\ln(\frac{2}{\theta})}{2n}}. \end{eqnarray} | (3.35) |
Proof. To utilize Theorem 3, we need to determine the value of B in the inequality (3.33), which serves as an upper bound for \|f\|^{2}_{\mathcal{H}} . Assuming the unlabeled samples are safe, the objective function of RS3ELM in the regenerative Hilbert space can be reformulated as
\begin{equation} \Phi(f) = \sum\limits_{i = 1}^{l}Loss(y_{i}, f(x_{i}))+\frac{\lambda}{2}\|f\|^{2}_{\mathcal{H}}+\frac{\gamma}{2n^{2}}\mathit{\boldsymbol{f}}^{T}\mathit{\boldsymbol{Lf}} . \end{equation} | (3.36) |
Assuming that f^{\ast} = \arg\min_{f\in\mathcal{H}}\Phi(f) is the solution to (3.36), we have \Phi(f^{\ast})\leq\Phi(\mathit{\boldsymbol{0}}) . As a result, we can derive additional information
\begin{equation} \frac{\lambda}{2}\|f^{\ast}\|^{2}_{\mathcal{H}}+\frac{\gamma}{2n^{2}}\mathit{\boldsymbol{f}}^{T}\mathit{\boldsymbol{Lf}}\leq\Phi(\mathit{\boldsymbol{0}}) . \end{equation} | (3.37) |
Let \pi_1 < \pi_2 < \cdots < \pi_r be the non-zero eigenvalues of the Laplacian matrix \mathit{\boldsymbol{L}} , where r is the rank of \mathit{\boldsymbol{L}} . In terms of the eigenvalues, we have the inequality \pi_1 < \pi_2 < \cdots < \pi_r . This inequality represents the relationship between the minimum eigenvalue \pi_1 and the maximum eigenvalue \pi_r of the Laplacian matrix.
\begin{equation} \pi_{1}\|f^{\ast}\|^{2}_{\mathcal{H}}\leq\mathit{\boldsymbol{f}}^{T}\mathit{\boldsymbol{Lf}}\leq\pi_{r}\|f^{\ast}\|^{2}_{\mathcal{H}}, \end{equation} | (3.38) |
and
\begin{eqnarray} \left(\frac{\lambda}{2}+\frac{\gamma\pi_{1}}{2n^{2}}\right)\|f^{\ast}\|^{2}_{\mathcal{H}}&\leq&\frac{\lambda}{2}\|f^{\ast}\|^{2}_{\mathcal{H}}+\frac{\gamma\pi_{1}}{2n^{2}}\mathit{\boldsymbol{f}}^{T}\mathit{\boldsymbol{Lf}}\\ &\leq&\left(\frac{\lambda}{2}+\frac{\gamma\pi_{r}}{2n^{2}}\right)\|f^{\ast}\|^{2}_{\mathcal{H}} . \end{eqnarray} | (3.39) |
By combining the inequalities (3.37) and (3.39), we can derive the following resulting inequalities
\begin{equation} \left(\frac{\lambda}{2}+\frac{\gamma\pi_{1}}{2n^{2}}\right)\|f^{\ast}\|^{2}_{\mathcal{H}}\leq\Phi(\mathit{\boldsymbol{0}}) . \end{equation} | (3.40) |
Hence, to restrict the search range of f^{\ast} , we can confine it within a radius R = \sqrt{\frac{\Phi(\mathit{\boldsymbol{0}})}{\left(\frac{\lambda}{2}+\frac{\gamma\pi_{1}}{2n^{2}}\right)}} inside the ball. Let \mathcal{H}_R = \{f\in\mathcal{H}: \|f\|_{\mathcal{H}}\leq R\} denote the sphere with radius R in the regenerative Hilbert space \mathcal{H} , and the generalized version of RS3ELM can be expressed as:
\begin{eqnarray} |Err(f)-\hat{Err}(f)| &\leq&\hat{E}_{n}(\mathcal{H}_{R}) + 3\sqrt{\frac{\ln(\frac{2}{\theta})}{2n}}, \end{eqnarray} | (3.41) |
where
\begin{equation} \hat{E}_{n}(\mathcal{H}_{R})\leq 2R\sqrt{\frac{T^{2}}{n}}. \end{equation} | (3.42) |
Suppose the output weight vector \mathit{\boldsymbol{\beta}} = (0, 0, \ldots, 0)^{T} is chosen such that the last two terms in Eq (3.36) become zero. Then
\begin{equation} \Phi(\mathit{\boldsymbol{0}}) = \sum\limits_{i = 1}^{l}Loss(y_{i}, f(\mathit{\boldsymbol{0}})) = \left[1-\exp(-\frac{1}{2\sigma^{2}})\right]^{p} . \end{equation} | (3.43) |
By substituting (3.43) into (3.40), we can get
\begin{equation} \left(\frac{\lambda}{2}+\frac{\gamma\pi_{1}}{2n^{2}}\right)\|f^{\ast}\|^{2}_{\mathcal{H}}\leq\left[1-\exp(-\frac{1}{2\sigma^{2}})\right]^{p}. \end{equation} | (3.44) |
The inequality (3.44) can be equivalently written as
\begin{equation} \|f^{\ast}\|^{2}_{\mathcal{H}}\leq\frac{2n^{2}}{\lambda n^{2}+\gamma\pi_{1}}\left[1-\exp(-\frac{1}{2\sigma^{2}})\right]^{p}. \end{equation} | (3.45) |
Obviously, the radius R is
\begin{equation} R = \sqrt{\frac{2n^{2}}{\lambda n^{2}+\gamma\pi_{1}}\left[1-\exp(-\frac{1}{2\sigma^{2}})\right]^{p}}. \end{equation} | (3.46) |
By substituting Eq (3.46) into Eq (3.42), we can derive the following inequality.
\begin{equation} \hat{E}_{n}(\mathcal{H}_{R})\leq2T\sqrt{\frac{n}{\lambda n^{2}+\gamma\pi_{1}}\left[1-\exp(-\frac{1}{2\sigma^{2}})\right]^{p}}. \end{equation} | (3.47) |
By substituting Eq (3.47) into Eq (3.41), we can easily obtain an approximation for the generalization error bound of RS3ELM:
\begin{eqnarray} |Err(f)-\hat{Err}(f)| &\leq&2T\sqrt{\frac{n}{\lambda n^{2}+\gamma\pi_{1}}\left[1-\exp(-\frac{1}{2\sigma^{2}})\right]^{p}} + 3\sqrt{\frac{\ln(\frac{2}{\theta})}{2n}}. \end{eqnarray} | (3.48) |
In this section, we conducted experiments on nine benchmark datasets to validate the effectiveness of the RS3ELM algorithm. We compared it with the following algorithms: SS-ELM [24], Lap-SVM [1], Lap-RLS [2], RCSSELM [36], and MC-SSELM [35]. For the experiments, we used the following parameters:
● SS-ELM: The number of hidden nodes L was chosen from the set \{100,500, 1000, 2000\} . The regularization parameters C and \lambda were chosen from the set \{10^{-5}, 10^{-3}, 10^{-1}, 10^{1}, 10^{3}, 10^{5}\} .
● RC-SSELM: The regularization parameters C , \lambda , and the number of hidden nodes L were chosen within the same range as SS-ELM. The Gaussian kernel parameter \sigma was chosen from the set \{1, 3, 5, 7, 9\} .
● Lap-SVM and Lap-RLS: The Gaussian kernel parameter \sigma was chosen from the set \{2^{-10}, 2^{-6}, 2^{-2}, 2^{2}, 2^{6}\} . The regularization parameters \gamma_{I} and \gamma_{A} were chosen from the set \{10^{-6}, 10^{-4}, 10^{-2}, 10^{-1}, 10^{0}, 10^{1}, 10^{2}\} .
● MC-SSELM: The regularization parameters C , \lambda , and the number of hidden nodes L were chosen within the same range as SS-ELM. The Gaussian kernel parameters \sigma_{1} and \sigma_{2} were chosen from the set \{2^{-5}, 2^{-3}, 2^{-1}, 2^{1}, 2^{3}, 2^{5}\} . The maximum number of iterations K was set to 50, and the tolerable error \varepsilon was set to 10^{-3} . The variable center values were \{-3, -1, 1, 3\} , and the weight parameter \alpha was chosen from the set \{0.1, 0.2, 0.3, 0.4, 0.5\} .
● RS3ELM: The maximum number of iterations K was set to 50, and the tolerable error \varepsilon was set to 10^{-3} . The Gaussian kernel parameter \sigma was chosen from the set \{2^{-10}, 2^{-6}, 2^{-2}, 2^{2}, 2^{6}\} . The regularization parameters \gamma , \lambda , and \alpha were chosen from the set \{10^{-6}, 10^{-4}, 10^{-2}, 10^{-1}, 10^{0}, 10^{1}, 10^{2}\} .
To ensure a fair comparison, we performed a grid search to find the optimal parameters for all algorithms, and the algorithm with the best performance was selected.
To assess the classification performance of all the algorithms, we used the traditional accuracy index (ACC). The ACC is defined as follows:
\begin{equation} \begin{aligned} ACC = \frac{TP + TN}{TP + FN + TN + FP} \end{aligned}. \end{equation} | (4.1) |
In this equation, TP represents the number of true positives, TN represents the number of true negatives, FN represents the number of false negatives, and FP represents the number of false positives. A higher ACC value indicates a better model performance. Additionally, to compare the computational time of the algorithms, we recorded their running time, including both training and testing, on all the datasets used.
In the experiment, the main criterion for evaluating the performance of all the algorithms was the average classification accuracy, obtained through Monte Carlo cross-validation (MCCV) [39] and grid search. All the datasets used were approximately balanced, and the following steps were taken:
● Random division: The datasets were randomly divided into training sets and test sets, with a ratio of approximately 7:3.
● Repetition: This process of random division was repeated 10 times.
● Averaging: The results from the 10 repetitions were averaged to obtain a more reliable measure of performance.
● Normalization: To ensure fair comparison, all the datasets were normalized to the interval [0, 1] . Details of the datasets are presented in Table 1.
ID | Datasets | Samples | Features | Class |
1 | Breast Cancer | 569 | 30 | 2 |
2 | wine | 178 | 13 | 2 |
3 | COIL20 | 1440 | 1024 | 20 |
4 | Diabetic | 1151 | 19 | 2 |
6 | Carcinomc | 174 | 9182 | 11 |
7 | Heart diseasec | 270 | 13 | 2 |
8 | Lungc | 203 | 3312 | 5 |
9 | Proteinc | 1483 | 56 | 10 |
● Recording: The average classification accuracy, denoted by ACC, was obtained as the average of the 10 test results.
By following these steps, we aimed to obtain objective experimental results that accurately assessed the performance of the algorithms. We implemented the algorithms used in the experiments using MATLAB scripts. To ensure a fair comparison, we employed the MATLAB toolbox for quadratic programming (QP) to solve all the quadratic programming problems embedded in the algorithms. The implementation was carried out on a PC with the following specifications: Intel(R) Core(TM) i7-8700 processor (3.20 GHz) and 16 GB of RAM. The operating system used was Windows 10, and MATLAB version 2014a was employed for the implementation purpose.
In this section, we aim to demonstrate the robustness of our method by conducting experiments on a two-dimensional "XOR" dataset. The dataset is generated by perturbing points from two intersecting lines, \mbox{Class 1:}\ y_{i} = 0.7x_{i}+\eta and \mbox{Class 2:}\ y_{i} = -0.3x_{i}+\eta . Each instance is randomly assigned Gaussian noise, following a normal distribution with mean 0 and standard deviation 0.2. Both classes consist of 1000 sample points. We evaluate our method on both the original "XOR" dataset and a contaminated version, which includes a few outliers. The experimental results are presented in Table 2.
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±std(%) | ACC±std(%) | ACC±std(%) | ACC±std(%) | ACC±std(%) | ACC±std(%) |
Time (s) | Time (s) | Time (s) | Time (s) | Time (s) | Time (s) | |
Without outliers | 81.78±1.35 | 80.12±1.75 | 80.28±1.47 | 81.29±1.69 | 81.47±1.56 | 81.55±1.34 |
0.891 | 0.836 | 0.465 | 0.557 | 0.847 | 1.062 | |
With outliers | 77.35±1.86 | 78.86±1.26 | 79.56±1.57 | 80.13±1.43 | 80.68±1.54 | 80.72±1.78 |
0.891 | 0.836 | 0.465 | 0.557 | 0.847 | 1.062 |
The experimental results indicate that all methods perform exceptionally well in classifying the synthetic dataset without outliers. Among the algorithms, Lap-SVM exhibits the highest classification performance. However, the scenario changes when the dataset is contaminated with outliers. In this case, we observe that Lap-SVM, Lap-RSL, and SS-ELM demonstrate significantly poorer performance compared to the other algorithms. The classification accuracies of Lap-SVM, Lap-RSL, SS-ELM, RC-SSELM, MC-SSELM, and RS3ELM is presented in Table 2. It is evident from the experiments that our method, RS3ELM, outperforms the other five classifiers in terms of accuracy when classifying a synthetic dataset with outliers. However, in terms of training time, RS3ELM does not exhibit a notable advantage over the other classifiers. These results effectively highlight the robustness of RS3ELM. RS3ELM are more time-consuming: Despite having similar accuracy, RS3ELM requires more computational time compared to MC-SSELM. This could mean that the internal algorithms or processes involved in RS3ELM are more complex or require more iterations/operations for convergence.
In this section, we evaluate the performance of our proposed method in comparison with other algorithms in a noise-free environment. Table 3 presents the experimental results for all algorithms using the optimal parameters. For Lap-SVM, Lap-RSL, SS-ELM, RC-SSELM, and MC-SSELM algorithms, we refer to the literature [35] for their experimental settings and results. Please consult the literature [35] for more detailed information. From Table 3, it is evident that our proposed method achieves comparable performance to the other five algorithms in the noise-free experiment. In general, the ELM-based methods demonstrate better performance than the SVM-based methods in most cases.
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 90.09±1.83 | 91.08±1.62 | 91.15±1.19 | 91.97±1.22 | 90.41±0.43 | 90.53±0.67 |
2 | 94.87±3.39 | 95.73±1.96 | 98.72±0.00 | 98.72±0.00 | 100±0.00 | 97.87±1.33 |
3 | 96.15±0.98 | 94.43±0.20 | 96.17±0.56 | 96.31±0.34 | 96.62±0.40 | 97.05±0.94 |
4 | 70.72±1.61 | 70.53±1.49 | 72.35±1.02 | 72.58±0.60 | 73.45±0.54 | 73.21±1.12 |
5 | 96.97±0.52 | 95.15±1.05 | 96.91±0.50 | 96.73±1.52 | 97.64±0.73 | 97.89±0.84 |
6 | 68.60±9.65 | 65.22± 2.51 | 75.94 ±4.30 | 78.26 ± 1.02 | 78.33±2.34 | 79.17±1.04 |
7 | 81.18±3.28 | 82.55±1.22 | 82.47 ± 0.26 | 83.65 ±0.49 | 83.78 ±0.43 | 83.81±0.54 |
8 | 90.53±5.70 | 93.00±1.43 | 91.36±1.51 | 91.85±1.41 | 92.17±1.29 | 92.22 ±1.26 |
9 | 89.95 ±1.20 | 89.20±2.02 | 89.59 ±1.51 | 89.89 ± 0.70 | 90.23 ± 0.46 | 90.38± 0.53 |
To assess the efficacy of our proposed approach, we utilized nine benchmark datasets sourced from the UCI machine learning repository: http://archive.ics.uci.edu/ml/datasets.html. In order to ensure consistency, we standardized these datasets, thereby constraining the feature values to the range of [0, 1] . Subsequently, we conducted two types of experiments on these standardized datasets.
In this section, we initially conducted experiments on datasets containing 10% and 30% outliers to assess the robustness of the proposed method. These outliers were generated by randomly selecting 10% and 30% of the labeled training samples and applying the label inversion technique. (In robustness testing, label inversion is used to test if a model can withstand label noise. For example, flipping a certain percentage of labels randomly in the training data helps to determine how well the model can generalize despite noisy conditions.) The experimental settings and results of Lap-SVM, Lap-RSL, SS-ELM, RC-SSELM, and MC-SSELM algorithms were adopted from the literature [35]. The optimal parameters were used in these experiments, and the results are presented in Tables 4 and 5. Observing the tables, it is evident that the performance of all the algorithms declined as the label noise increased. Moreover, the model utilizing the entropy loss function outperformed other methods, indicating its effectiveness in handling outliers.
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 88.48±0.98 | 87.86±0.57 | 89.59±0.53 | 89.89±1.47 | 90.41±0.36 | 90.26±0.43 |
2 | 95.30±3.23 | 93.16±4.12 | 97.44±1.28 | 97.69±1.07 | 99.74±0.51 | 96.88±1.49 |
3 | 93.45±0.20 | 91.72±1.08 | 92.07±0.40 | 93.72±0.26 | 95.21±0.33 | 95.56±0.41 |
4 | 67.25±1.61 | 69.76±0.67 | 69.04±0.43 | 69.04±0.56 | 72.00±0.68 | 72.47±0.64 |
5 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
6 | 67.15 ± 4.66 | 70.53 ± 3.65 | 75.94± 2.43 | 77.97 ± 4.51 | 78.65 ± 2.23 | 78.79 ± 2.28 |
5 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
6 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
7 | 80.78±2.78 | 81.96±0.90 | 79.88± 0.77 | 83.53 ±0.72 | 83.62 ±0.77 | 83.78 ±0.52 |
8 | 91.77±1.43 | 91.36±2.47 | 89.14±2.95 | 90.12±2.62 | 90.19±2.55 | 90.23±2.47 |
9 | 86.14±0.65 | 82.40±2.83 | 86.70±0.30 | 86.85±0.48 | 87.21±0.32 | 87.52±0.28 |
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 85.25±3.90 | 86.74±3.72 | 89.37±0.86 | 89.52±1.27 | 89.59±1.21 | 89.61±1.17 |
2 | 88.89±3.92 | 90.60±1.48 | 82.05±4.53 | 93.33±0.57 | 95.64±0.63 | 94.72±0.69 |
3 | 85.06±2.65 | 85.06±2.49 | 86.28±0.50 | 86.41±0.61 | 87.86±0.30 | 89.73±1.09 |
4 | 60.39±1.17 | 62.22±3.83 | 64.41±0.63 | 65.28±0.80 | 66.96±0.76 | 66.79±0.52 |
5 | 93.94±3.19 | 94.55±0.91 | 95.27±1.49 | 96.36±0.91 | 97.27±0.57 | 95.06±1.17 |
5 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
6 | 54.11±1.67 | 64.25±0.84 | 68.99±4.18 | 71.01±5.42 | 71.54±3.45 | 71.89±3.32 |
7 | 78.43 ±3.78 | 79.61±1.89 | 75.53± 0.67 | 80.00± 0.59 | 80.41± 1.03 | 80.47± 0.74 |
8 | 88.07± 2.85 | 86.42±1.23 | 87.41±5.19 | 88.15±2.07 | 88.33±2.11 | 88.52±2.15 |
9 | 78.03± 3.19 | 81.46 ±1.60 | 81.46±0.35 | 83.07±0.71 | 83.17±0.85 | 83.03±0.67 |
In order to further validate the robustness of the proposed method, this study also conducted experiments in a characteristic noise environment. The characteristic noise dataset was generated by randomly selecting 30% of the characteristic values of each sample and assigning them a value of 0, thus creating a new experimental dataset. The experimental settings and results of Lap-SVM, Lap-RSL, SS-ELM, RC-SSELM, and MC-SSELM algorithms were obtained from the literature [35]. The experimental settings and parameter choices were consistent with the previous experiments. The results of this experiment are presented in Table 6 using the optimal parameters. From Table 6, it can be observed that the proposed method performs comparably to other related algorithms. To summarize, the proposed method in this paper demonstrates effectiveness in terms of robustness and significantly enhances the performance of the model.
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 85.75±6.11 | 87.86±0.86 | 87.66±2.30 | 88.40±1.45 | 89.52±1.04 | 90.11±1.24 |
2 | 81.37±2.23 | 82.35±2.56 | 80.35±1.22 | 82.59±1.42 | 81.76±0.37 | 82.23±1.19 |
3 | 97.01±0.74 | 94.44±4.12 | 97.18±1.07 | 97.95±1.15 | 98.21±0.63 | 98.33±1.14 |
4 | 59.71±1.53 | 59.23±0.89 | 59.36±0.38 | 61.62±0.76 | 62.26±0.22 | 62.05±1.06 |
5 | 95.45±1.82 | 94.55±1.57 | 96.00±1.04 | 96.00±0.50 | 96.36±1.29 | 96.55±1.26 |
6 | 30.92 ±7.15 | 34.30 ± 10.88 | 34.49 ± 5.27 | 40.58±3.40 | 39.94±3.76 | 40.71±2.66 |
7 | 81.37±2.23 | 82.35±2.56 | 80.35±1.22 | 82.59±1.42 | 82.74±1.28 | 82.79±1.24 |
8 | 79.01±0.00 | 78.60 ±0.71 | 79.01 ±0.00 | 79.01±0.00 | 79.01±0.00 | 79.01±0.00 |
9 | 79.21±1.17 | 80.71±0.19 | 80.26 ±1.35 | 81.27±0.88 | 81.41±0.27 | 81.45±0.31 |
To evaluate the convergence of the proposed algorithm, experiments are conducted on multiple datasets including wine, breast cancer, diabetic, and G50C datasets. The optimal parameters are selected, and the experimental procedure is consistent with the noise experiment. The learning outcomes are presented in Figure 2. As observed from Figure 2, it is evident that the objective function value of RS3ELM progressively decreases with each iteration, converging to a stable value in fewer than 10 iterations. Hence, the proposed algorithm RS3ELM demonstrates convergence.
The RS3ELM algorithm incorporates several parameters, namely p and \sigma in the C _{p} -Loss function, as well as the tradeoff coefficients for three terms in (3.11), which are \lambda , \gamma , and \alpha . For simplicity, we assume that \gamma , \lambda , and \alpha are equal in value. To assist in parameter selection, we examine the impact of p , \sigma , \lambda , \gamma , and \alpha on the performance of RS3ELM.
In our experimental analysis, we investigated the impact of p and \alpha on the performance of RS3ELM, while keeping \sigma = 2^{-2} , \lambda = 10^{-2} , and \gamma = 10^{-1} . We conducted the experiment by varying \alpha and p within the sets \{10^{-4}, 10^{-2}, 10^{-1}, 10^{0}, 10^{1}, 10^{2}\} and \{0.5, 1.0, 1.5, 2.0, 2.5, 3.0\} , respectively. The accuracy results in relation to the parameters p and \alpha are presented in Figure 3. Although the classification performance of the wine, Breast Cancer, Diabetic, and g50c datasets were minimally influenced by changes in p and \alpha , we can still make certain conclusions about the optimal values of these parameters based on Figure 3. Generally, outliers tend to affect the determination of hyperplanes. However, RS3ELM can effectively mitigate this influence by adjusting the parameter p , indicating that by controlling p , RS3ELM becomes less sensitive to outliers.
In this research paper, we propose a robust safe framework for semi-supervised learning that ensures the reliable utilization of unlabeled samples and achieves resistance to noise and outliers. To implement this framework, we introduce a robust safe semi-supervised extreme learning machine (RS3ELM), which is solved using a fixed-point iterative algorithm. We theoretically analyze the computational complexity and convergence of the RS3ELM and provide a generalizing error bound based on the Rademacher complexity. Our experimental results on multiple datasets validate the robustness and classification accuracy of RS3ELM when compared to similar methods. Consequently, our proposed approach can be effectively applied to robust classification problems. It is important to note that our method focuses solely on binary classification tasks. Investigating the extension of this method to other learning tasks without compromising the model's classification accuracy and robustness is a valuable direction for future research.
Jun Ma and Xiaolong Zhu: Algorithm development, software creation, numerical example preparation, original draft writing, review and editing of the manuscript. All authors have read and approved the final version of the manuscript for publication.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
This work was supported in part by the Fundamental Research Funds for the Central Universities of North Minzu University (No.2021JCYJ07 and No.2023ZRLG01), in part by the National Natural Science Foundation of China (No.62366001 and No.12361062), in part by the Key Research and Development Program of Ningxia (Introduction of Talents Project) (No. 2022BSB03046), and in part by the Natural Science Foundation of Ningxia Provincial (No. 2023AAC02053).
The authors declare that they have no conflict of interest and no relevant financial or non-financial interests to disclose.
[1] |
M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res., 7 (2006), 2399–2434. http://dx.doi.org/10.5555/1248547.124863 doi: 10.5555/1248547.124863
![]() |
[2] | O. Chapelle, B. Scholkopf, A. Zien, Semi-supervised learning, IEEE T. Neural Networ., https://doi.org/10.1109/TNN.2009.2015974 |
[3] |
T. Yang, C. E. Priebe, The effect of model misspecification on semi-supervised classification, IEEE T. Pattern Anal., 33 (2011), 2093–2103. http://dx.doi.org/10.1109/TPAMI.2011.45 doi: 10.1109/TPAMI.2011.45
![]() |
[4] |
Y. F. Li, Z. H. Zhou, Towards making unlabeled data never hurt, IEEE T. Pattern Anal., 37 (2015), 175–188. https://doi.org/10.1109/TPAMI.2014.2299812 doi: 10.1109/TPAMI.2014.2299812
![]() |
[5] | Y. T. Li, J. T. Kwok, Z. H. Zhou, Towards safe semi-supervised learning for multivariate performance measures, In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 30 (2016), 1816–1822. https://doi.org/10.1609/aaai.v30i1.10282 |
[6] |
Y. Wang, S. Chen, Z. H. Zhou, New semi-supervised classification method based on modified cluster assumption, IEEE T. Neural Networ., 23 (2011), 689–702. https://doi.org/10.1609/aaai.v25i1.7920 doi: 10.1609/aaai.v25i1.7920
![]() |
[7] |
Y. Wang, S. Chen, Safety-aware semi-supervised classification, IEEE T. Neural Networ., 24 (2013), 1763–1772. https://doi.org/10.1109/TNNLS.2013.2263512 doi: 10.1109/TNNLS.2013.2263512
![]() |
[8] |
M. Kawakita, J. Takeuchi, Safe semi-supervised learning based on weighted likelihood, Neural Networks, 53 (2014), 146–164. https://doi.org/10.1016/j.neunet.2014.01.016 doi: 10.1016/j.neunet.2014.01.016
![]() |
[9] |
H. Gan, Z. Luo, M. Meng, Y. Ma, Q. She, A risk degree-based safe semi-supervised learning algorithm, Int. J. Mach. Learn. Cyb., 7 (2015), 85–94. https://doi.org/10.1007/s13042-015-0416-8 doi: 10.1007/s13042-015-0416-8
![]() |
[10] |
H. Gan, Z. Luo, Y. Sun, X. Xi, N. Sang, R. Huang, Towards designing risk-based safe Laplacian regularized least squares, Expert Syst. Appl., 45 (2016), 1–7. https://doi.org/10.1016/j.eswa.2015.09.017 doi: 10.1016/j.eswa.2015.09.017
![]() |
[11] |
H. Gan, Z. Li, Y. Fan, Z. Luo, Dual learning-based safe semi-supervised learning, IEEE Access, 6 (2017), 2615–2621. https://doi.org/10.1109/access.2017.2784406 doi: 10.1109/access.2017.2784406
![]() |
[12] |
H. Gan, Z. Li, W. Wu, Z. Luo, R. Huang, Safety-aware graph-based semi-supervised learning, Expert Syst. Appl., 107 (2018), 243–254. https://doi.org/10.1016/j.eswa.2018.04.031 doi: 10.1016/j.eswa.2018.04.031
![]() |
[13] |
N. Sang, H. Gan, Y. Fan, W. Wu, Z. Yang, Adaptive safety degree-based safe semi-supervised learning, Int. J. Mach. Learn. Cyb., 10 (2018), 1101–1108. https://doi.org/10.1007/s13042-018-0788-7 doi: 10.1007/s13042-018-0788-7
![]() |
[14] |
Y. Y. Wang, Y. Meng, Z. Fu, H. Xue, Towards safe semi-supervised classification: Adjusted cluster assumption via clustering, Neural Process. Lett., 46 (2017), 1031–1042. https://doi.org/10.1007/s11063-017-9607-5 doi: 10.1007/s11063-017-9607-5
![]() |
[15] |
H. Gan, G. Li, S. Xia, T. Wang, A hybrid safe semi-supervised learning method, Expert Syst. Appl., 149 (2020), 1–9. https://doi.org/10.1016/j.eswa.2020.113295 doi: 10.1016/j.eswa.2020.113295
![]() |
[16] | Y. T. Li, J. T. Kwok, Z. H. Zhou, Towards safe semi-supervised learning for multivariate performance measures, In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 30 (2016), 1816–1822. https://doi.org/10.1609/aaai.v30i1.10282 |
[17] |
G. B. Huang, Q. Y. Zhu, C. K. Siew, Extreme learning machine: Theory and applications, Neurocomputing, 70 (2006), 489–501. https://doi.org/10.1016/j.neucom.2005.12.126 doi: 10.1016/j.neucom.2005.12.126
![]() |
[18] |
Y. Cheng, D. Zhao, Y. Wang, G. Pei, Multi-label learning with kernel extreme learning machine autoencoder, Knowl.-Based Syst., 178 (2019), 1–10. https://doi.org/10.1016/j.knosys.2019.04.002 doi: 10.1016/j.knosys.2019.04.002
![]() |
[19] |
X. Huang, Q. Lei, T. Xie, Y. Zhang, Z. Hu, Q. Zhou, Deep transfer convolutional neural network and extreme learning machine for lung nodule diagnosis on CT images, Knowl.-Based Syst., 204 (2020), 106230. https://doi.org/10.1016/j.knosys.2020.106230 doi: 10.1016/j.knosys.2020.106230
![]() |
[20] |
J. Ma, L. Yang, Y. Wen, Q. Sun, Twin minimax probability extreme learning machine for pattern recognition, Knowl.-Based Syst., 187 (2020), 104806. https://doi.org/10.1016/j.knosys.2019.06.014 doi: 10.1016/j.knosys.2019.06.014
![]() |
[21] |
C. Yuan, L. Yang, Robust twin extreme learning machines with correntropy-based metric, Knowl.-Based Syst., 214 (2021), 106707. https://doi.org/10.1016/j.knosys.2020.106707 doi: 10.1016/j.knosys.2020.106707
![]() |
[22] |
Y. Li, Y. Wang, Z. Chen, R. Zou, Bayesian robust multi-extreme learning machine, Knowl.-Based Syst., 210 (2020), 106468. https://doi.org/10.1016/j.knosys.2020.106468 doi: 10.1016/j.knosys.2020.106468
![]() |
[23] |
H. Pei, K. Wang, Q. Lin, P. Zhong, Robust semi-supervised extreme learning machine, Knowl.-Based Syst., 159 (2018), 203–220. https://doi.org/10.1016/j.knosys.2018.06.029 doi: 10.1016/j.knosys.2018.06.029
![]() |
[24] |
G. Huang, S. Song, J. N. D. Gupta, C. Wu, Semi-supervised and unsupervised extreme learning machines, IEEE T. Cybernetics, 44 (2014), 2405. https://doi.org/10.1109/tcyb.2014.2307349 doi: 10.1109/tcyb.2014.2307349
![]() |
[25] |
W. Liu, P. P. Pokharel, J. C. Principe, Correntropy: Properties and applications in non-Gaussian signal processing, IEEE T. Signal Proces., 55 (2007), 5286–5298. https://doi.org/10.1109/tsp.2007.896065 doi: 10.1109/tsp.2007.896065
![]() |
[26] |
N. Masuyama, C. K. Loo, F. Dawood, Kernel Bayesian ART and ARTMAP, Neural Networks, 98 (2018), 76–86. https://doi.org/10.1016/j.neunet.2017.11.003 doi: 10.1016/j.neunet.2017.11.003
![]() |
[27] |
X. Liu, B. Chen, H. Zhao, J. Qin, J. Cao, Maximum correntropy Kalman filter with state constraints, IEEE Access, 5 (2017), 25846–25853. https://doi.org/10.1109/access.2017.2769965 doi: 10.1109/access.2017.2769965
![]() |
[28] |
B. Chen, X. Liu, H. Zhao, J. C. Principe, Maximum correntropy Kalman filter, Automatica, 76 (2017), 70–77. https://doi.org/10.1016/j.automatica.2016.10.004 doi: 10.1016/j.automatica.2016.10.004
![]() |
[29] |
B. Chen, X. Lei, W. Xin, Q. Jing, N. Zheng, Robust learning with kernel mean p-power error loss, IEEE T. Cybernetics, 48 (2018), 2101–2113. https://doi.org/10.1109/tcyb.2017.2727278 doi: 10.1109/tcyb.2017.2727278
![]() |
[30] |
H. Xing, X. Wang, Training extreme learning machine via regularized correntropy criterion, Neural Comput. Appl., 23 (2013), 1977–1986. https://doi.org/10.1007/s00521-012-1184-y doi: 10.1007/s00521-012-1184-y
![]() |
[31] |
Z. Yuan, X. Wang, J. Cao, H. Zhao, B. Chen, Robust matching pursuit extreme learning machines, Sci. Programming, 1 (2018), 1–10. https://doi.org/10.1155/2018/4563040 doi: 10.1155/2018/4563040
![]() |
[32] |
B. Chen, X. Wang, N. Lu, S. Wang, J. Cao, J. Qin, Mixture correntropy for robust learning, Pattern Recogn., 79 (2018), 318–327. https://doi.org/10.1016/j.patcog.2018.02.010 doi: 10.1016/j.patcog.2018.02.010
![]() |
[33] |
G. Xu, B. G. Hu, J. C. Principe, Robust C-loss kernel classifiers, IEEE T. Neur. Net. Lear., 29 (2018), 510–522. https://doi.org/10.1109/tnnls.2016.2637351 doi: 10.1109/tnnls.2016.2637351
![]() |
[34] |
A. Singh, R. Pokharel, J. Principe, The C-loss function for pattern classification, Pattern Recogn., 47 (2014), 441–453. https://doi.org/10.1016/j.patcog.2013.07.017 doi: 10.1016/j.patcog.2013.07.017
![]() |
[35] |
J. Yang, J. Cao, A. Xue, Robust maximum mixture correntropy criterion-based semi-supervised ELM with variable center, IEEE T. Circuits-II, 67 (2020), 3572–3576. https://doi.org/10.1109/tcsii.2020.2995419 doi: 10.1109/tcsii.2020.2995419
![]() |
[36] |
J. Yang, J. Cao, T. Wang, A. Xue, B. Chen, Regularized correntropy criterion based semi-supervised ELM, Neural Networks, 122 (2020), 117–129. https://doi.org/10.1016/j.neunet.2019.09.030 doi: 10.1016/j.neunet.2019.09.030
![]() |
[37] | P. L. Bartlett, S. Mendelson, Rademacher and Gaussian complexities: Risk bounds and structural results, In: Conference on Computational Learning Theory & European Conference on Computational Learning Theory, Berlin/Heidelberg: Springer, 2001,224–240. https://doi.org/10.1007/3-540-44581-1-15 |
[38] |
P. J. Huber, Robust estimation of a location parameter, Ann. Math. Stat., 35 (1964), 73–101. https://doi.org/10.1214/aoms/1177703732 doi: 10.1214/aoms/1177703732
![]() |
[39] |
Q. S. Xu, Y. Z. Liang, Monte Carlo cross validation, Chemometr. Intell. Lab., 56 (2001), 1–11. https://doi.org/10.1016/s0169-7439(00)00122-2 doi: 10.1016/s0169-7439(00)00122-2
![]() |
ID | Datasets | Samples | Features | Class |
1 | Breast Cancer | 569 | 30 | 2 |
2 | wine | 178 | 13 | 2 |
3 | COIL20 | 1440 | 1024 | 20 |
4 | Diabetic | 1151 | 19 | 2 |
6 | Carcinomc | 174 | 9182 | 11 |
7 | Heart diseasec | 270 | 13 | 2 |
8 | Lungc | 203 | 3312 | 5 |
9 | Proteinc | 1483 | 56 | 10 |
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±std(%) | ACC±std(%) | ACC±std(%) | ACC±std(%) | ACC±std(%) | ACC±std(%) |
Time (s) | Time (s) | Time (s) | Time (s) | Time (s) | Time (s) | |
Without outliers | 81.78±1.35 | 80.12±1.75 | 80.28±1.47 | 81.29±1.69 | 81.47±1.56 | 81.55±1.34 |
0.891 | 0.836 | 0.465 | 0.557 | 0.847 | 1.062 | |
With outliers | 77.35±1.86 | 78.86±1.26 | 79.56±1.57 | 80.13±1.43 | 80.68±1.54 | 80.72±1.78 |
0.891 | 0.836 | 0.465 | 0.557 | 0.847 | 1.062 |
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 90.09±1.83 | 91.08±1.62 | 91.15±1.19 | 91.97±1.22 | 90.41±0.43 | 90.53±0.67 |
2 | 94.87±3.39 | 95.73±1.96 | 98.72±0.00 | 98.72±0.00 | 100±0.00 | 97.87±1.33 |
3 | 96.15±0.98 | 94.43±0.20 | 96.17±0.56 | 96.31±0.34 | 96.62±0.40 | 97.05±0.94 |
4 | 70.72±1.61 | 70.53±1.49 | 72.35±1.02 | 72.58±0.60 | 73.45±0.54 | 73.21±1.12 |
5 | 96.97±0.52 | 95.15±1.05 | 96.91±0.50 | 96.73±1.52 | 97.64±0.73 | 97.89±0.84 |
6 | 68.60±9.65 | 65.22± 2.51 | 75.94 ±4.30 | 78.26 ± 1.02 | 78.33±2.34 | 79.17±1.04 |
7 | 81.18±3.28 | 82.55±1.22 | 82.47 ± 0.26 | 83.65 ±0.49 | 83.78 ±0.43 | 83.81±0.54 |
8 | 90.53±5.70 | 93.00±1.43 | 91.36±1.51 | 91.85±1.41 | 92.17±1.29 | 92.22 ±1.26 |
9 | 89.95 ±1.20 | 89.20±2.02 | 89.59 ±1.51 | 89.89 ± 0.70 | 90.23 ± 0.46 | 90.38± 0.53 |
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 88.48±0.98 | 87.86±0.57 | 89.59±0.53 | 89.89±1.47 | 90.41±0.36 | 90.26±0.43 |
2 | 95.30±3.23 | 93.16±4.12 | 97.44±1.28 | 97.69±1.07 | 99.74±0.51 | 96.88±1.49 |
3 | 93.45±0.20 | 91.72±1.08 | 92.07±0.40 | 93.72±0.26 | 95.21±0.33 | 95.56±0.41 |
4 | 67.25±1.61 | 69.76±0.67 | 69.04±0.43 | 69.04±0.56 | 72.00±0.68 | 72.47±0.64 |
5 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
6 | 67.15 ± 4.66 | 70.53 ± 3.65 | 75.94± 2.43 | 77.97 ± 4.51 | 78.65 ± 2.23 | 78.79 ± 2.28 |
5 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
6 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
7 | 80.78±2.78 | 81.96±0.90 | 79.88± 0.77 | 83.53 ±0.72 | 83.62 ±0.77 | 83.78 ±0.52 |
8 | 91.77±1.43 | 91.36±2.47 | 89.14±2.95 | 90.12±2.62 | 90.19±2.55 | 90.23±2.47 |
9 | 86.14±0.65 | 82.40±2.83 | 86.70±0.30 | 86.85±0.48 | 87.21±0.32 | 87.52±0.28 |
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 85.25±3.90 | 86.74±3.72 | 89.37±0.86 | 89.52±1.27 | 89.59±1.21 | 89.61±1.17 |
2 | 88.89±3.92 | 90.60±1.48 | 82.05±4.53 | 93.33±0.57 | 95.64±0.63 | 94.72±0.69 |
3 | 85.06±2.65 | 85.06±2.49 | 86.28±0.50 | 86.41±0.61 | 87.86±0.30 | 89.73±1.09 |
4 | 60.39±1.17 | 62.22±3.83 | 64.41±0.63 | 65.28±0.80 | 66.96±0.76 | 66.79±0.52 |
5 | 93.94±3.19 | 94.55±0.91 | 95.27±1.49 | 96.36±0.91 | 97.27±0.57 | 95.06±1.17 |
5 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
6 | 54.11±1.67 | 64.25±0.84 | 68.99±4.18 | 71.01±5.42 | 71.54±3.45 | 71.89±3.32 |
7 | 78.43 ±3.78 | 79.61±1.89 | 75.53± 0.67 | 80.00± 0.59 | 80.41± 1.03 | 80.47± 0.74 |
8 | 88.07± 2.85 | 86.42±1.23 | 87.41±5.19 | 88.15±2.07 | 88.33±2.11 | 88.52±2.15 |
9 | 78.03± 3.19 | 81.46 ±1.60 | 81.46±0.35 | 83.07±0.71 | 83.17±0.85 | 83.03±0.67 |
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 85.75±6.11 | 87.86±0.86 | 87.66±2.30 | 88.40±1.45 | 89.52±1.04 | 90.11±1.24 |
2 | 81.37±2.23 | 82.35±2.56 | 80.35±1.22 | 82.59±1.42 | 81.76±0.37 | 82.23±1.19 |
3 | 97.01±0.74 | 94.44±4.12 | 97.18±1.07 | 97.95±1.15 | 98.21±0.63 | 98.33±1.14 |
4 | 59.71±1.53 | 59.23±0.89 | 59.36±0.38 | 61.62±0.76 | 62.26±0.22 | 62.05±1.06 |
5 | 95.45±1.82 | 94.55±1.57 | 96.00±1.04 | 96.00±0.50 | 96.36±1.29 | 96.55±1.26 |
6 | 30.92 ±7.15 | 34.30 ± 10.88 | 34.49 ± 5.27 | 40.58±3.40 | 39.94±3.76 | 40.71±2.66 |
7 | 81.37±2.23 | 82.35±2.56 | 80.35±1.22 | 82.59±1.42 | 82.74±1.28 | 82.79±1.24 |
8 | 79.01±0.00 | 78.60 ±0.71 | 79.01 ±0.00 | 79.01±0.00 | 79.01±0.00 | 79.01±0.00 |
9 | 79.21±1.17 | 80.71±0.19 | 80.26 ±1.35 | 81.27±0.88 | 81.41±0.27 | 81.45±0.31 |
ID | Datasets | Samples | Features | Class |
1 | Breast Cancer | 569 | 30 | 2 |
2 | wine | 178 | 13 | 2 |
3 | COIL20 | 1440 | 1024 | 20 |
4 | Diabetic | 1151 | 19 | 2 |
6 | Carcinomc | 174 | 9182 | 11 |
7 | Heart diseasec | 270 | 13 | 2 |
8 | Lungc | 203 | 3312 | 5 |
9 | Proteinc | 1483 | 56 | 10 |
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±std(%) | ACC±std(%) | ACC±std(%) | ACC±std(%) | ACC±std(%) | ACC±std(%) |
Time (s) | Time (s) | Time (s) | Time (s) | Time (s) | Time (s) | |
Without outliers | 81.78±1.35 | 80.12±1.75 | 80.28±1.47 | 81.29±1.69 | 81.47±1.56 | 81.55±1.34 |
0.891 | 0.836 | 0.465 | 0.557 | 0.847 | 1.062 | |
With outliers | 77.35±1.86 | 78.86±1.26 | 79.56±1.57 | 80.13±1.43 | 80.68±1.54 | 80.72±1.78 |
0.891 | 0.836 | 0.465 | 0.557 | 0.847 | 1.062 |
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 90.09±1.83 | 91.08±1.62 | 91.15±1.19 | 91.97±1.22 | 90.41±0.43 | 90.53±0.67 |
2 | 94.87±3.39 | 95.73±1.96 | 98.72±0.00 | 98.72±0.00 | 100±0.00 | 97.87±1.33 |
3 | 96.15±0.98 | 94.43±0.20 | 96.17±0.56 | 96.31±0.34 | 96.62±0.40 | 97.05±0.94 |
4 | 70.72±1.61 | 70.53±1.49 | 72.35±1.02 | 72.58±0.60 | 73.45±0.54 | 73.21±1.12 |
5 | 96.97±0.52 | 95.15±1.05 | 96.91±0.50 | 96.73±1.52 | 97.64±0.73 | 97.89±0.84 |
6 | 68.60±9.65 | 65.22± 2.51 | 75.94 ±4.30 | 78.26 ± 1.02 | 78.33±2.34 | 79.17±1.04 |
7 | 81.18±3.28 | 82.55±1.22 | 82.47 ± 0.26 | 83.65 ±0.49 | 83.78 ±0.43 | 83.81±0.54 |
8 | 90.53±5.70 | 93.00±1.43 | 91.36±1.51 | 91.85±1.41 | 92.17±1.29 | 92.22 ±1.26 |
9 | 89.95 ±1.20 | 89.20±2.02 | 89.59 ±1.51 | 89.89 ± 0.70 | 90.23 ± 0.46 | 90.38± 0.53 |
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 88.48±0.98 | 87.86±0.57 | 89.59±0.53 | 89.89±1.47 | 90.41±0.36 | 90.26±0.43 |
2 | 95.30±3.23 | 93.16±4.12 | 97.44±1.28 | 97.69±1.07 | 99.74±0.51 | 96.88±1.49 |
3 | 93.45±0.20 | 91.72±1.08 | 92.07±0.40 | 93.72±0.26 | 95.21±0.33 | 95.56±0.41 |
4 | 67.25±1.61 | 69.76±0.67 | 69.04±0.43 | 69.04±0.56 | 72.00±0.68 | 72.47±0.64 |
5 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
6 | 67.15 ± 4.66 | 70.53 ± 3.65 | 75.94± 2.43 | 77.97 ± 4.51 | 78.65 ± 2.23 | 78.79 ± 2.28 |
5 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
6 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
7 | 80.78±2.78 | 81.96±0.90 | 79.88± 0.77 | 83.53 ±0.72 | 83.62 ±0.77 | 83.78 ±0.52 |
8 | 91.77±1.43 | 91.36±2.47 | 89.14±2.95 | 90.12±2.62 | 90.19±2.55 | 90.23±2.47 |
9 | 86.14±0.65 | 82.40±2.83 | 86.70±0.30 | 86.85±0.48 | 87.21±0.32 | 87.52±0.28 |
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 85.25±3.90 | 86.74±3.72 | 89.37±0.86 | 89.52±1.27 | 89.59±1.21 | 89.61±1.17 |
2 | 88.89±3.92 | 90.60±1.48 | 82.05±4.53 | 93.33±0.57 | 95.64±0.63 | 94.72±0.69 |
3 | 85.06±2.65 | 85.06±2.49 | 86.28±0.50 | 86.41±0.61 | 87.86±0.30 | 89.73±1.09 |
4 | 60.39±1.17 | 62.22±3.83 | 64.41±0.63 | 65.28±0.80 | 66.96±0.76 | 66.79±0.52 |
5 | 93.94±3.19 | 94.55±0.91 | 95.27±1.49 | 96.36±0.91 | 97.27±0.57 | 95.06±1.17 |
5 | 96.67±0.52 | 95.76±0.52 | 96.36±1.57 | 97.09±0.41 | 96.36±0.57 | 96.08±1.12 |
6 | 54.11±1.67 | 64.25±0.84 | 68.99±4.18 | 71.01±5.42 | 71.54±3.45 | 71.89±3.32 |
7 | 78.43 ±3.78 | 79.61±1.89 | 75.53± 0.67 | 80.00± 0.59 | 80.41± 1.03 | 80.47± 0.74 |
8 | 88.07± 2.85 | 86.42±1.23 | 87.41±5.19 | 88.15±2.07 | 88.33±2.11 | 88.52±2.15 |
9 | 78.03± 3.19 | 81.46 ±1.60 | 81.46±0.35 | 83.07±0.71 | 83.17±0.85 | 83.03±0.67 |
Lap-SVM | Lap-RSL | SS-ELM | RC-SSELM | MC-SSELM | RS3ELM | |
Datasets | ACC±S(%) | ACC±S(%) | ACC±S(%) | ACC± S(%) | ACC± S(%) | ACC± S(%) |
1 | 85.75±6.11 | 87.86±0.86 | 87.66±2.30 | 88.40±1.45 | 89.52±1.04 | 90.11±1.24 |
2 | 81.37±2.23 | 82.35±2.56 | 80.35±1.22 | 82.59±1.42 | 81.76±0.37 | 82.23±1.19 |
3 | 97.01±0.74 | 94.44±4.12 | 97.18±1.07 | 97.95±1.15 | 98.21±0.63 | 98.33±1.14 |
4 | 59.71±1.53 | 59.23±0.89 | 59.36±0.38 | 61.62±0.76 | 62.26±0.22 | 62.05±1.06 |
5 | 95.45±1.82 | 94.55±1.57 | 96.00±1.04 | 96.00±0.50 | 96.36±1.29 | 96.55±1.26 |
6 | 30.92 ±7.15 | 34.30 ± 10.88 | 34.49 ± 5.27 | 40.58±3.40 | 39.94±3.76 | 40.71±2.66 |
7 | 81.37±2.23 | 82.35±2.56 | 80.35±1.22 | 82.59±1.42 | 82.74±1.28 | 82.79±1.24 |
8 | 79.01±0.00 | 78.60 ±0.71 | 79.01 ±0.00 | 79.01±0.00 | 79.01±0.00 | 79.01±0.00 |
9 | 79.21±1.17 | 80.71±0.19 | 80.26 ±1.35 | 81.27±0.88 | 81.41±0.27 | 81.45±0.31 |