1.
Introduction
Data classification is an important data mining technique with a wide variety of applications to classify the different kinds of data that exist practically in all aspects of our life. It has been recognized as a critical topic in machine learning and data mining.
We begin by reviewing the history of various mathematical models and related techniques used for this purpose. Convex bilevel optimization problem plays an important role in real-world applications. It can be applied to data classification, see for example [1,2,3,4]. The convex bilevel optimization problem consists of the constrained minimization problem known as the outer level,
where H is a real Hilbert space, ϕ:H→R is a strongly convex differentiable function, and Λ is a nonempty set of minimizers of the inner level given by
where φ:Rm→R is convex differentiable function such that ∇φ is Lφ-Lipschitzian and ψ∈Γ0(H), the set of proper lower semicontinuous convex functions from H to R. Problems (1.1) and (1.2) are labeled as a bilevel optimization problem.
Furthermore, the solution of (1.2) can be restated as the problem of finding ˆu∈Λ such that
Parikh and Boyd [5] introduced the proximal gradient technique for solving (1.3), that is, ˆu is a solution of (1.3) if and only if ˆu∈F(T) where T is the prox-grad mapping defined by
for t>0, and F(T) is the set of fixed points of T. It is well-known that if t∈(0,2Lφ), then T is nonexpansive and F(T)=argminu∈Rm{φ(u)+ψ(u)}. We also note that the set of all common fixed points of Tn=proxcnψ(I−s∇φ) is the set of minimizers of the inner level problem (1.2).
Furthermore, u∈F(T) is a solution for problem (1.1) if u satisfies the condition
Hereafter, we would like to give some background on iteration methods for finding a fixed point of the nonexpansive mapping T, that is, finding a point u∗∈C such that Tu∗=u∗.
Let H be a real Hilbert space with norm ‖⋅‖ and inner product ⟨⋅,⋅⟩, and C be a nonempty closed convex subset of H. One of the most popular iterative methods for finding a fixed point of a nonexpansive mapping is the Mann iteration, which was first introduced by Mann [6]. Later, Reich [7] modified it to the general version
where u1∈H and {λn} is a real sequence in [0,1]. He proved the weak convergence of (1.4) under the condition ∑∞n=1λn(1−λn)=∞.
Later, Halpern [8] introduced an iterative method known as the Halpern iteration for finding a fixed point of nonexpansive mappings in real Hilbert spaces. His algorithm was given in the following form:
where u0,u1∈C and {λn}⊂[0,1]. Under some condition on {λn}, he established a strong convergence theorem of (1.5) when u0=0. Later, Reich [9] extended the Halpern iteration (1.5) to uniformly smooth Banach spaces.
In 1974, by modifying the Mann iteration, Ishikawa [10] introduced the Ishikawa iteration process as follows:
where u1∈H and {λn},{δn}⊂[0,1].
Moudafi [11] demonstrated a viscosity approximation method for a nonexpansive mapping in 2000, which was defined as
where u1∈H,{λn}⊂[0,1], and f is a contraction mapping. He proved that, under certain conditions, {un} generated by (1.7) converges strongly to x∈F(T).
By modification of the Ishikawa iteration, Agarwal et al. [12] presented the S-iteration process as follows:
where {λn},{δn}⊂[0,1] and x1 is arbitrarily chosen. Furthemore, they demonstrated that the convergence behavior of the S-iteration is better than the iterations of Mann and Ishikawa.
Now, we would like to give some background on iteration methods to find a common fixed point of a countable family of a nonexpansive mapping {Tn}.
Aoyama et al. [13] demonstrated a Halpern type iteration
where {λn}⊂[0,1] and u1 and u∈C are arbitrarily chosen. Further, they showed that, under some condition on {λn},xn→x∈∞⋂n=1F(Tn).
Thereafter, Takahashi [14] demonstrated the iteration process
where {λn}⊂[0,1], and established a strong convergence theorem of (1.10) under some constraint on {λn}.
In 2010, Klin-eam and Suantai [15] introduced the following algorithm:
where {λn}⊂[0,1] and u1∈C, and showed that, under certain conditions, {un} generated by (1.11) converges strongly to a common fixed point of Tn.
Polyak [16] developed an inertial methodology for improving the convergence behavior of the method. From that time on, the inertial methodology was frequently employed to accelerate the convergence behavior of methods, such as the fast iterative shrinkage-thresholding algorithm (FISTA) defined as follows:
where u1=v0∈Rm,t1=1, and T=proxλg(I−λ∇f) for λ>0. FISTA was introduced by Beck and Teboulle [17], and they applied it to solve some image restoration problems where it was shown that the performance of FISTA was better than the existing methods in the literature.
A new accelerated viscosity algorithm (NAVA) was proposed by Puangpee and Suantai [18] for finding a common fixed point of {Tn}. It was defined as follows:
where u0,u1∈H, and {σn},{λn},{δn}, and {γn}⊂(0,1). Moreover, they obtained a strong convergence theorem of (1.13) under certain control conditions.
Polyak [19] also highlighted how multi-step inertial methods can accelerate optimization approaches, despite the fact that neither the convergence nor the rate outcome of such multi-step inertial methods are proven in [19].
After that, Q. L. Dong et al. [20] presented the general inertial Mann algorithm as follows:
for each n≥1, where {θn}⊂[0,θ],{ζn}⊂[0,ζ] with θ1=ζ1=0, and θ,ζ∈[0,1).
From here on, we would like to give a some direct methods to solve problem (1.1), namely, the Bilevel Gradient Sequential Averaging Method (BiG-SAM) and the inertial Bilevel Gradient Sequential Averaging Method (iBiG-SAM).
In 2017, Sabach and Shtern [21] presented the BiG-SAM process (Algorithm 1) as follows:
They showed that un→u where u is a solution of (1.1) and (1.2).
Later, Shehu et al. [22] introduced iBiG-SAM (Algorithm 2) by utilizing an inertial technique with BiG-SAM as follows:
They proved a strong convergence theorem of Algorithm 2 under Assumption 1.1 as follows:
Assumption 1.1. Suppose {λn}∞n=1⊂(0,1) and {ϵn}∞n=1 are positive sequences that satisfy the following conditions:
(1) limn→∞λn=0 and ∑∞n=1λn=∞.
(2) ϵn=o(λn), i.e., limn→∞(ϵn/λn)=0.
Motivated by ongoing research in this area, we are interested in introducing a new accelerated algorithm for solving convex bilevel optimization problems and applying it to solve data classification problems.
The following describes the way this paper is organized: Section 2 contains some fundamental definitions and helpful lemmas. The main results of this work are presented in Section 3. In this part, we provide a new accelerated algorithm for solving convex bilevel optimization problems and prove its strong convergence theorem. In Section 4, we also use our main finding to solve data classification problems. Finally, Section 5 contains a conclusion of our work.
2.
Materials and methods
Throughout this paper, let C be a nonempty closed convex subset of real Hilbert space H, and let T:C→C be a mapping. Let the strong and weak convergence of {un} to u∈H be denoted by un→u and un⇀u, respectively. A point u∈C is said to be a fixed point of T if Tu=u, and the set of all fixed points of T is denoted by F(T).
A set C is said to be convex if αu+(1−α)v∈C for all u,v∈C and α∈[0,1].
Definition 2.1. Let f:H→ˉR. Then, the function f is convex on C if
Definition 2.2. A function f:H→R is strongly convex with constant σ>0 if for any u,v∈H and λ∈[0,1],
Definition 2.3. For a scalar-valued function f:Rm→R, the derivative of f at ˉu is denoted by ∇f(ˉu)∈Rm and is defined as
A function f is differentiable if it is differentiable at every u∈Rm.
Definition 2.4. Let f:Rm→R be convex differentiable. The gradient of f at u denoted by ∇f(u), is defined by
Hereafter, we will recall some important definitions, lemmas, and propositions that will be used to prove our main results.
Definition 2.5. If there exists τ≥0 such that
T:C→C is said to be Lipschitzian.
In the above inequality, if 0≤τ<1, T is called a contraction, and if τ=1, T is called nonexpansive. It is known that F(T) is closed and convex if T is nonexpansive.
Definition 2.6. Let u∈H. An element u∗∈C is said to be a metric projection of u on C if
and u∗ is denoted by PCu.
The function PC is called the metric projection of H onto C and it is well-known that PC is nonexpansive. Moreover,
holds for all u∈H and v∈C. More information and properties of PC can be found in [23].
For finding a common fixed point of a family of nonexpansive {Tn}, we need some important conditions, one of which is the NST- condition introduced by Nakajo et al. [24].
Let {Tn} and T be two families of nonexpansive mappings of H into itself with ∅≠F(T)⊂⋂∞n=1F(Tn) where F(T) is the set of all common fixed points of each T∈T. We say that {Tn} satisfies NST- condition (Ⅰ) with T if for each bounded sequence {un}
In particular, if T={T}, then {Tn} is said to satisfy NST- condition (Ⅰ) with T.
Definition 2.7. Let ψ∈Γ0(H) and t>0. The proximitor of tψ at v∈H, denoted by proxtψ(v), is defined as
The forward-backward operator T of φ and ψ with respect to t is denoted by T:=proxtψ(I−t∇φ). Futhermore, if t∈(0,2/Lφ), where Lφ is the Lipschitz gradient of φ, it is generally known that T is nonexpansive.
The following lemma is required to prove our main results.
Lemma 2.8. [25,27] The following holds with u,w∈H and any arbitrary real number λ∈[0,1]:
(1) ‖λu+(1−λ)w‖2=λ‖u‖2+(1−λ)‖w‖2−λ(1−λ)‖u−w‖2;
(2) ‖u±w‖2=‖u‖2±2⟨u,w⟩+‖w‖2;
(3) ‖u+w‖2≤‖u‖2+2⟨w,u+w⟩.
The following equality holds for all u,v,w∈H by utilizing Lemma 2.8 (1):
where α,β,γ∈[0,1] with α+β+γ=1.
Lemma 2.9. [26] Let ψ∈Γ0(H), and φ:H→R be convex differentiable such that ∇φ is Lφ-Lipschitzian with Lφ>0. Let {cn}⊂(0,2/Lφ) and c∈(0,2/Lφ) such that cn→c. Define Tn:=proxcnψ(I−cn∇φ), then {Tn} satisfies NST-condition (I) with T, where T:=proxcψ(I−c∇φ).
Lemma 2.10. [18] Let T be a nonexpansive mapping, and {Tn} be a family of nonexpansive mappings such that ∅≠F(T)⊂⋂∞n=1F(Tn). For any subsequences {k} of {n}, if {Tn} satisfies NST-condition (I) with T, then {Tk} also satisfies NST-condition (I) with T.
Proposition 2.11. [21] Let ϕ be a strongly convex differentiable function from Rm into R with parameter σ>0 such that ∇ϕ is Lϕ-Lipschitzian. Define Ts:=I−s∇ϕ, where I is the identity mapping. Then, Ts is a contraction for all s≤2Lϕ+σ, that is
Lemma 2.12. [28] Let T:H→H be a nonexpansive mapping with F(T)≠∅. Then, I−T is demiclosed at zero, that is
for any sequences {un}∈H such that un⇀u∈H.
Lemma 2.13. [29,30] Let {pn},{ξn} be sequences of nonnegative real numbers, {αn} a sequence in [0,1], and {qn} a sequence of real numbers such that
for all n∈N. If the following conditions hold,
(1) ∑∞n=1αn=∞;
(2) ∑∞n=1rn<∞;
(3) lim supn→∞qn≤0;
then limn→∞pn=0.
Lemma 2.14. [31] Let {ϑn} be a real sequence of numbers that does not decrease at infinity in such a way that there is a subsequence {ϑni} such that ϑnk<ϑnk+1 for all k∈N. Define the sequence {π(n)}n≥n0 by
where n0∈N such that {j≤n0:ϑj<ϑj+1}≠∅. Then, the following hold:
(1) π(n0)≤π(n0+1)≤… and π(n)→∞;
(2) ϑπ(n)≤ϑπ(n)+1 and ϑn≤ϑπ(n)+1 for all n≥n0.
3.
Results
In this part, we propose a new accelerated algorithm for finding a common fixed point of a family of nonexpansive mappings in H by using the two-step inertial methodology with the viscosity approximation method. Second, we establish a strong convergence theorem under relevant conditions.
To do this, we start by introducing a new two-step inertial algorithm for estimating a solution for a common fixed point problem (Algorithm 3).
Throughout this section, let {Tn} be a family of nonexpansive mappings on H into itself. Let f be a τ-contraction mapping on H with τ∈(0,1), {ηn}⊂(0,∞), and {λn},{δn},{ιn}⊂(0,1).
Next, we prove a strong convergence theorem of Algorithm 3.
Theorem 3.1. Let T:H→H be a nonexpansive mapping with F(T)≠∅. Assume that ∅≠F(T)⊂⋂∞n=1F(Tn) such that {Tn} satisfies NST-condition (I) with T. Let {un} be a sequence generated by Algorithm 3 such that the following additional conditions hold:
(1) limn→∞ηn=0,
(2) limn→∞ιn=0 and ∑∞n=1ιn=∞,
(3) 0<a<λn for some a∈R,
(4) 0<b<δn<λn+δn<c<1 for some b,c∈R,
then the sequence {un}→u∈F(T) such that u=PF(T)f(u).
Proof. Let u∈F(T) such that u=PF(T)f(u). First, we show that {un} is bounded. According to the definitions of vn and wn, we obtain
and
We also know from (3.1) and (3.2) that
In accordance with Assumption (1) and the definition of θn and ζn, we have
Then, positive constants M1,M2 exist such that
From (3.3), we have
where ξ=sup{1−(1−τ)λnιnιn}. As a result, {un} is bounded. Moreover, {vn}, {wn}, {f(un)}, and {Tnvn} are all bounded.
Using Lemma 2.8 (2), we also have
Using Lemma 2.8 (3) and (3.4), we have
Since
and
as n→∞, there exist positive constants M3,M4 such that
It follows from (3.5) that
where M5=max{supn‖un−u‖,M3,M4}. From (3.6), we set
and
Hence, we obtain
After that, we examine the following two cases:
Case 1. Assume there is an n0∈N such that the sequence {‖un−u‖}n≥n0 is nonincreasing. As a result, {‖un−u‖} converges since it has boundaries from below by 0. We infer that ∑∞n=1αn=∞, by using Assumptions (2) and (3). Then, using Lemma 2.13, we assert that
Indeed, by (3.2), we have
By Lemma 2.8 (1), (3.4), and (3.8), we have
This implies that
Assumptions (2) and (4), as well as the convergence of the sequences {‖un−u‖} and the fact that θn‖un−un−1‖→0 and |ζn|⋅‖un−1−un−2‖→0, imply that
Since {Tn} satisfies NST-condition (Ⅰ) with T, we obtain
As a result of the definition of vn and wn, we have
We can conclude from (3.11) and Assumption (2) that
By definition of un+1, we have
which implies
We can also conclude the following fact from the definition of vn:
Hence,
Set
So, there is a subsequence {wnk} of {wn} such that
Because {wnk} is bounded, there must be a subsequence {wn′k} of {wnk} that satisfies wn′k⇀w∈H. We can assume wnk⇀w and (3.20) hold without losing generality.
We may conclude from (3.14) that vnk⇀w, and we obtain that w∈F(T) by using that fact and Lemma 2.12. Furthermore, we obtain the following fact by using u=PF(T)f(u) and (2.1):
Hence,
which implies lim supn→∞qn≤0 by using θn‖un−un−1‖→0 and |ζn|⋅‖un−1−un−2‖→0.
Using Lemma 2.13, we can conclude that un→u.
Case 2. Assume that for any n0, the sequence {‖un−u‖}n≥n0 is not monotonically nonincreasing. We define
So, there is a subsequence {ϑnk} of {ϑn} such that ϑnk<ϑnk+1 for all k∈N. We define π:{n:n≥n0}→N, by
For any n≥n0, we have ϑπ(n)≤ϑπ(n)+1 by Lemma 2.14, that is
As in Case 1, by applying (3.23) we obtain δπ(n)(1−λπ(n)−δπ(n))‖vπ(n)−Tπ(n)vπ(n)‖2
which implies
Similar to the proof in Case 1, we get
and
as n→∞, and so
As in Case 1, we then demonstrate that lim supn→∞⟨f(u)−u,wπ(n)−u⟩≤0. Set
There exists a subsequence {wπ(t)} of {wπ(n)} such that wπ(t)⇀w∈H and
By Lemma 2.10, {Tπ(t)} satisfies NST-condition (Ⅰ) with T. Due to inequality (3.24), ‖vπ(t)−Tπ(t)vπ(t)‖→0, and we obtain
As in Case 1, we can conclude from (3.25) that vπ(t)⇀w, and w∈F(T). Using u=PF(T)f(u) and (2.1), we obtain
Then,
Since ϑπ(n)≤ϑπ(n)+1, and from (3.6) along with (1−τ)λπ(n)ιπ(n)>0, we obtain
From θπ(n)λπ(n)‖uπ(n)−uπ(n)−1‖→0,|ζπ(n)|λπ(n)‖uπ(n)−1−uπ(n)−2‖→0, and (3.34), we obtain
and so ‖uπ(n)−u‖→0 as n→∞.
This implies by (3.29) that ‖uπ(n)+1−u‖→0 as n→∞. From Lemma 2.14 (2), we get ϑn≤ϑπ(n)+1, that is,
Therefore, un→u. □
For solving the problem (1.1), we assume the following assumptions:
Assumption 3.2. Let Φ be the set of all solutions of problem (1.1) where
(1) ϕ:Rm→R is strongly convex with parameter σϕ>0,
(2) ϕ is a continuously differentiable function such that ∇ϕ is Lipschitz continuous with constant Lϕ.
For solving the problem (1.2), we assume:
Assumption 3.3. Let Λ be a nonempty set of minimizer of problem (1.2).
(1) φ:Rm→R is convex and continuously differentiable, and ∇φ is Lipschitz continuous with constant Lφ,
(2) ψ∈Γ0(Rm).
Next, we will present an algorithm (Algorithm 4) for solving problem (1.1).
Theorem 3.4. Let ϕ be a function satisfying Assumption 3.2, and φ and ψ be functions satisfying Assumption 3.3. Let {cn}⊂(0,2Lφ) and c∈(0,2Lφ) such that cn→c as n→∞. Let {un} be a sequence generated by Algorithm 4 with the same conditions as in Theorem 3.1. Then, un→u∈Φ.
Proof. Set Tn=proxcnψ(I−cn∇φ),T=proxcψ(I−c∇φ) and f=I−s∇ϕ. In addition, we know that Tn and T are nonexpansive mappings. We also know from Lemma 2.9 that Tn satisfie NST-condition (Ⅰ) with T. According to Proposition 2.11, f is contraction with constants τ=√1−2sσLϕσ+Lϕ and s≤2Lϕ+σ. Theorem 3.1 clearly demonstrates that un→u∈F(T), where u=PF(T)f(u). We next claim that u∈Φ. By using (2.1), we have for any v∈F(T)
Therefore, u is a solution of problem (1.1). □
4.
Application in data classifications
In this section, we utilize our algorithm as a machine learning algorithm for data classification of Parkinson's disease and diabetes, and compare its effectiveness with BiG-SAM and iBiG-SAM.
Let {(xk,tk)∈Rn×Rm:k=1,2,…,s} be a training set with s samples, with xk representing an input and tk representing a target. The mathematical model of single-layer feedforward neuron networks (SLFNs) is given by
where ok is an output of ELM for SLFNs, h is the number of hidden nodes, g is an activation function, bj is the bias, and αj and ωj are the weight vectors connecting the j-th hidden node with the output and input node, respectively.
The hidden-layer output matrix denoted by H, is given by
The target of standard SLFNs is to approximate these s sample with zero means, that is, s∑k=1|ok−tk|=0. Then, there exists αj,ωj, and bj such that
We could derive the following simple equation from the s equations above:
where u=[αT1,⋯,αTh]T,T=[tT1,⋯,tTs]T.
For solving ELM, it is necessary to calculate only the u that satisfies (4.1) with random ωj and bj. If there is a pseudo-inverse H+ of H, u=H+T is the solution of (4.1). If H+ does not exists, we can obtain a solution in terms of the least squares problem, that is,
In machine learning, model fitness plays an essential role for training set accuracy. We cannot employ an overfitting model to predict unknown data; instead, we utilize the most common technique known as the least absolute shrinkage and selection operator (LASSO). It is formulated as
where ‖⋅‖1 is the l1-norm defined by ‖(x1,…,xn)‖1=n∑i=1|xi|, and λ>0 is a regularization parameter. We may simplify problem (4.3) to problem (1.2) by setting φ(u):=‖Hu−T‖22 and ψ(u):=λ‖u‖1. For problem (1.1), we set ϕ:=12‖u‖22 with Lϕ=1 and σϕ=1.
In this experiment, we aim to classify the datasets of the Parkinson's disease and diabetes from UCI and Kaggle, respectively.
Parkinson's disease dataset. [33] There are 195 examples in this dataset, all of which have 22 features. We classified two types of data in this dataset.
Diabetes dataset. [34] There are 768 examples in this set, all of which have 8 features. We classified two types of data in this dataset.
In this experiment, we establish the default settings by selecting the most advantageous choice for any parameter of each algorithm in order to reach the best level of performance, as follows:
(1) For inner level: ∇φ(u)=2HT(Hu−T) and Lφ=λmax(H∗H), the maximum eigenvalue of H∗H.
(2) For Algorithm 1 (BiG-SAM) and Algorithm 2 (iBiG-SAM):
(3) For Algorithm 4 (our algorithm):
(4) For all algorithms:
● Regularization parameter: λ=10−5.
● Hidden nodes: h=30.
● n=500,α=3, and s=0.01.
● 10-fold cross-validation.
The following experiment uses the Parkinson's disease and diabetes disease datasets. We compare the effectiveness of Algorithms 1, 2, and 4 at the 500th iteration, as shown in Tables 1 and 2.
The results of Tables 1 and 2 reveal that Algorithm 4 provides a better accuracy for data classification than the others.
5.
Conclusions
We provide a new two-step inertial accelerated algorithm in this paper. First, we analyze the convergence behavior of this algorithm and establish the strong convergence theorem under relevant conditions. Next, we utilize our algorithm as a machine learning algorithm to solve data classification problems of some noncommunicable diseases and compare its efficacy with BiG-SAM and iBiG-SAM. We find that our algorithm outperforms BiG-SAM and iBig-SAM in terms of accuracy. In our future work, we would like to employ our proposed algorithm as a machine learning algorithm for prediction and classification of some noncommunicable diseases collected from the Sriphat Medical Center, Faculty of Medicine, Chiang Mai University, Chiang Mai, Thailand, and we also aim to build new innovations in the form of web applications/mobile applications/computer systems for data prediction and classification of noncommunicable diseases. These applications will have benefits for hospitals, communities, and citizens in terms of screening and preventing noncommunicable diseases.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
This work was partially supported by Chiang Mai University and Fundamental Fund 2024 (FF030/2567), Chiang Mai University.
The first author would like to thank CMU Presidential Scholarship for the financial support.
Conflict of interest
All authors declare no conflicts of interest in this paper.