1.
Introduction
Nonnegative matrix factorization (NMF) [13,25,26,31,32] is a representative method of linear dimensionality reduction for nonnegative data, and has been widely used as a significant tool. Generally speaking, the basic NMF issue can be expressed as: given an m×n dimensional nonnegative matrix V=(Vij) with Vij≥0, and a pre-assigned positive integer r<min(m,n), NMF wants to find two nonnegative matrices W∈Rm×r+ and H∈Rr×n+ such that
A common way to solve NMF (1.1) is
where ‖⋅‖F is the Frobenius norm.
The projected Barzilai-Borwein gradient (PBB) method is a prevailing and forceful means for solving (1.2). This method was proposed by Barzilai and Borwein [3]. Recently, the literature [8,33,34,42,43] has shown that the PBB method is very effective in solving the optimization issue [17,18,19,20,21,22]. Because of its simplicity and numercal efficiency, the PBB approach has attracted much attention in different fields. To date, the PBB approach has been triumphantly applied to NMF (see [16,23,24,27,28]). On account of W and H being completely symmetric, this thesis primarily ponders updating matrix W using the PBB method. Let Hk denote the approximate value of H after their kth update, and let
At each step for solving (1.3), there are three different updates:
where LkW>0, ∇f(W)=∇Wf(W,Hk).
The original cost function (1.4a) is the most frequently used form in the PBB method for NMF and has been widely and deeply researched [4,16,27,28,46]. But the major disadvantage of (1.4a) is that it is not strongly convex, and we can merely hope that the algorithms seek out a stationary point instead of the global or even local minimizer. To overcome this drawback, a proximal modification of the cost function (1.4a) is presented in [23,24], namely, the proximal cost function (1.4b).
At present, the proximal cost function (1.4b) has been used with the PBB method for NMF in [23,24]. As the cost function (1.4b) is a strongly convex quadratic programming with a lower bound of zero, the subproblem (1.4b) has a unique minimizer. In [23], the authors proposed a quadratic regularization nonmonotone PBB method to solve (1.4b), and established a global convergence result under mild conditions. Recently, this method was revisited in [24], indicating that the monotone PBB method converges globally to the stationary point of (1.3), and the numerical experiments indicated that the monotone PBB method was better than the nonmonotone one under some circumstances. Nevertheless, most of the existing PBB gradient methods for (1.4a) and (1.4b) tend to converge slowly on account of the nonnegative constraints. This prompted us to exploit new and faster NMF algorithms.
In this article, first, we enhance the active set recognition technique in [6] so that it can reasonably identify the active constraints for NMF. Then, we present a modified nonmonotonic line search technique for the sake of enhancing the efficiency of the nonmonotonic line search. By using the active set identification technology and the improved nonmonotonic line search, a globally convergent gradient-based method to solve (1.4c) on the basis of the alternating nonnegative least squares framework is proposed. To accelerate the algorithm, we use the Barzilai-Borwein step size and the greater step-size tactics. Ultimately, numerical experiments are carried out on synthetic and image datasets to verify the effectiveness of the proposed method.
This article is organized as shown below. In Section 2, we put forward an effective NMF algorithm and present its global convergence. The experimental results are shown in Section 4. Ultimately, Section 5 summarizes the work.
2.
An active set nonmonotone PBB algorithm
2.1. Main algorithm
In this section, we put forward an efficient algorithm for solving the NMF (1.3) and establish the global convergence of our suggested algorithm. To set up the primary consequence of this section, let us first present some known properties about the objective function f(W,Hk).
Lemma 1. [15] The two statements given below are effective.
(i) The objective function f(W,Hk) of (1.3) is convex.
(ii) The gradient
is Lipschitz continuous with the constant LW=‖Hk(Hk)T‖2.
For the sake of argument, we will be centered on (1.4c) and rewrite it as
where the fixed matrix U≥0.
Distinctly, it can be seen from (ii) of Lemma 1 that φ(U,W) is strictly convex in W for each fixed U. In each iteration, first, we will solve the following strongly convex quadratic minimization problem to compute point Zt:
Since the objective function of problem (2.2) has a strong convex property, this issue has a unique closed-form solution:
where the operator P[X] projects all of the negative entries of X to zero.
Let Wt+1=Zt+Dt, and here Dt stands for the direction. We discover that the convergence of {Wt+1} can not be ensured. Therefore, researchers have come up with a tactic for globalization based on the modified Armiji line search [45], namely, we will seek out a step size λt that makes
where M>0. Because of the maximum function, in any iteration, it is possible to discard the generated good function values, meanwhile, numerical performances depend heavily on the choice of M under some circumstances (see [7]).
To overcome these shortcomings in each procedure, we present a modified nonmonotonic line search rule. Our line search is represented like this: for a given iterative point Zt and a search direction Dt at Zt, we select ηt∈[ηmin,ηmax], here 0<ηmin<ηmax<1, γt∈[γmin,γmax(1−ηmax)], where 0<γmax<1, and 0<γmin<(1−ηmax)γmax, to get a λt satisfying the inequality shown below:
where St is defined as
As similar with M in (2.4), the selection of ηt in (2.6) plays a key role in controlling the degree of nonmonotonicity (see [14]). So, for the sake of enhancing the efficiency of the nonmonotonic line search, Ahookhosh et al. [1] selected a varying value for the parameter ηt by using a simple formula. Later, Nosratipour et al. [30] thought that ηt should be related to an appropriate criterion for measuring the distance to the optimal solution. Hence, they defined ηt by
However, we found that if the sequence of iteration {Zt} is trapped in a confined crooked valley, then that can lead to ∇f(Zt)=0, from which we can get ηt=0, so the nonmonotonic line search is decreased to the normative Armijo line search, which is inefficient by producing very short or tortuous steps. For the sake of overcoming this shortcoming, we suggest the following ηt:
It is obvious that |f(Zt)−f(Zt−1)| is large when the function value decreases rapidly, and then ηt will also be large, hence the nonmonotonic tactics will be stronger. However, as f(Zt) is close to the optimal solution, we can obtain |f(Zt)−f(Zt−1)| tending to zero, and after that ηt also tends to zero, hence, the nonmonotonic rule will weaken and tend to the monotonic rule.
Finally, let
where λt is the step size you get by employing nonmonotonic line search (2.5).
As everyone knows from [12] that the larger step size technology can significantly accelerate the convergence rate of the algorithm, by adding a relaxation factor s to the renewal rule of Wt+1 (2.9), we modify the update rule (2.9) as
for relaxation factor s>1. We show that the optimal parameter s in (2.10) is s=1.7 by number experiments in Section 4.4.
As was observed in [6,43], the active set method can enhance the efficiency of the local convergence algorithm and reduce the computing cost. Therefore, we will recommend an active set recognition technology to approximate the right sustain of the solution points. In our context, the active set is considered as the subset of zero components of Z∗. Now, similar to the idea proposed in [6], we define the active set ˉA as the index set corresponding to the zero component, meanwhile, the inactive set ˉF will be the support of Z∗.
Definition 1. Let Ω={ij:Zij≥0 and Z∈Rm×r} and Z∗ be a stationary point of (1.3). We define the active set as follows:
We further define an inactive set ˉF which is a complementary set of ˉA,
where I={11,12,⋯,1r,21,22,⋯,2r,⋯,m1,m2,⋯,mr}.
Then, for any Z∈Ω, we give approximations as shown below for A(Zt) and F(Zt) as ˉA and ˉF, respectively,
where αt is the BB step size. Similar to Proposition 3.1 in [6], we have that if the strict complementarity is satisfied at Zt, then A(Zt) overlaps with the active set if Zt is close enough to Z∗, namely A(Zt)=ˉA,F(Zt)=ˉA.
In order to get a good estimate of the active set, the active set is further subdivided into two sets
and
where c>0 is a very small constant. It is clear that A2(Zt) is a variable index set that approximately meets the first-order necessary conditions. Thus, it is reasonable for us to use the scaled projected gradient as the descent direction in the corresponding subspace. Furthermore, realizing that A1(Zt) is a variable index set that goes against the first-order necessary conditions, to define a reasonable search direction, we further subdivided A1(Zt) into two subsets
and
We consider the direction of the form 0 for variables with indexes in ˉA1(Zt) and −Zt for variables with indexes ˜A1(Zt) to increase the corresponding components. Thus, we define the previously stated direction as a compact form as shown below:
where αt is the BB step size.
Based on the above discussion, we present an active set strategy-based nonmonotonic projection Barzilai-Borwein gradient method and outline the proposed algorithm in Algorithm 1. We can follow a similar procedure for updating H.
To keep things simple, define the direction of the scaling projection gradient as
for each α>0 and W≥0. The next Lemma 2 is very important in our proof.
Lemma 2. [2] For each α∈(0,αmax], W≥0,
(i) ⟨∇f(W),Dα(W)⟩≤−1α‖Dα(W)‖2≤−1αmax‖Dα(W)‖2,
(ii) The stationary point of (1.3) is at W if and only if Dα(W)=0.
The lemma that follows states that Dt=0 is true if and only if the stationary point of problem (1.3) is the iteration point {Zt}.
Lemma 3. Let Dt be calculated by (2.19), then Dt=0 if and only if Zt is a stationary point of problem (1.3).
Proof. Let (Dt)ij=0. Clearly, (Zt)ij is a stationary point of problem (1.3) when ij∈ˉA1(Zt). If ij∈˜A1(Zt), we have
The above inequality implies that (∇f(Zt))ij≥0. By the Karush-Kuhn-Tucker (KKT) condition, we can get that (Zt)ij is a stationary point of problem (1.3). In the case of (Dt)ij=0,ij∈A2(Zt)∪F(Zt), by (ii) of Lemma 2, we know that (Zt)ij is a stationary point of problem (1.3).
Suppose that Zt is a stationary point of (1.3). From the KKT condition, (2.13), and (2.14), we have
By the definition of (Dt)ij, we have (Dt)ij=0 for all ij∈A1(Zt). Then from (ii) of Lemma 2, we have (Dt)ij=0 for all ij∈A2(Zt). Therefore, we have (Dt)ij=0 for all ij∈ˉA(Zt). For another case, since ∇f(Zt)ij=0, for ij∈ˉFt, and {Zt}ij is a feasible point, from the definition of (Dt)ij, we have (Dt)ij=0,∀ij∈ˉFt.□
The lemma shown below states that when Zt is not a stationary point of problem (1.3), Dt is the descent direction of f at Zt.
Lemma 4. Given sequence {Zt} produced by Algorithm 1, we have
Proof. By (2.19), we know
If ij∈ˉA1(Zt), it is obvious that
holds.
If ij∈A2(Zt)∪F(Zt), from (i) of Lemma 2, we have
Thus, we now only need to prove that
If (Dt(Zt))ij=0, the inequality (2.25) holds. If (Dt(Zt))ij≠0, for all ij∈˜A1(Zt), from (2.17), we have
which leads to
The above deduction implies that the inequality (2.22) holds for ij∈ˉA1(Zt). Combining (2.23), (2.24), and (2.26), we obtain that (2.22) holds. □
The lemma shown below is borrowed from Lemma 3 [23].
Lemma 5. [23] Suppose Algorithm 1 generates {Zt} and {Wt}, then there is
Now, we are going to show the nice property of our line search.
Lemma 6. Suppose Algorithm 1 generates sequences {Zt} and {Wt}, then there is
Proof. Based on the definition of St and (2.20), we have
From 1−ηt−1>0, it concludes that f(Wt)−St−1≤0, i.e., f(Wt)≤St−1.
Therefore, if ηt−1≠0, from (2.6), we have
where the last inequality follows from (2.29). Thus, (2.30) indicates
In addition, if ηt−1=0, we have f(Wt)=St.□
It follows from Lemma 6 that
In addition, for any initial iterate W0≥0, Algorithm 1 generates sequences {Zt} and {Wt} that are both included in the level set.
Again, from Lemma 6, the theorem shown below can be easily obtained.
Theorem 1. Assume that the level set L(W0) is bounded, so the sequence {St} is convergent.
Proof. See Corollary 2.2 in [1]. □
Next, we will exhibit that the line search (2.5) is well-defined.
Theorem 2. Assume Algorithm 1 generates sequences {Zt} and {Wt}, so step 5 of the Algorithm 1 is well-defined.
Proof. For this purpose, we prove that the line search stops at a limited value of steps. To establish a contradiction, we suppose that λt such as in (2.20) does not exist, then for all adequately large positive integers m, according to Lemmas 5 and 6, we have
From Lemma 4, we have,
According to the mean-theorem, there is a θt∈(0,1) such that
namely,
When m→∞, we get that
Since 0<γt1−ηt<1<s, ⟨∇f(Zt),Dt⟩≥0 is correct. This is not consistent with the fact that ⟨∇f(Zt),Dt⟩≤0. Therefore, step 5 of Algorithm 1 is well-defined.□
2.2. Convergence analysis
In this part, we prove the global convergence of ANMPBB. The following result implies that there exists a minimum step size λt that must be satisfied, and this lower bound is indispensable to ensure the global convergence of the suggested algorithm.
Lemma 7. Suppose that Algorithm 1 generates a step size λt, if the stationary point of (1.3) is not Wt+1, such that there is a constant ˜λ that will cause λt≥˜λ.
Proof. For the resulting step size λt, if λt does not satisfy (2.20), namely,
where Lemma 4 leads to the second inequality, and similarly, Lemmas 5 and 6 lead to the final inequality. Thus,
By the mean-value theorem, we can find an θ∈(0,1) that makes
Substitute the last inequality we obtained from (2.33) into (2.32), and we get
From ηt−1∈[ηmin,ηmax] and γt∈[γmin,γmax(1−ηmax)], we have
So, there is going to be a ˜λ that makes λt≥˜λ.□
Lemma 8. Assume that Algorithm 1 generates the sequence {Wt}, for the given level set L(W0), if it is considered bounded, so there is
(i)
(ii) there is a positive constant δ that makes
Proof. (i) By the definition of St+1, for t≥1, we have
Since ηmax∈[0,1], and ηt∈[ηmin,ηmax] for all t,
According to Theorem 1, as t→∞,
which implies that
(ii) From (2.5), we have
where δ=γmin˜λ(1−ηmin)αmax.□
The global convergence of Algorithm 1 is proved by the theorem shown below.
Theorem 3. Suppose that Algorithm 1 generates sequences {Zt} and {Wt}, so we get
Proof. According to (ii) of Lemma 8, we have
Based on (i) of Lemma 8, as t→∞, we can obtain
□
According to Theorem 3, Lemma 3, and (2.10), we will exhibit the main convergence results we get as follows.
Theorem 4. For a given level set L(W0), assume that it is bounded, hence Algorithm 1 computes the generated sequence {Wt}, and any accumulation point obtained is a stationary point of (1.3).
2.3. Complexity analysis
It is obvious that, at each iteration, the major cost of ANMPBB is to check the condition (2.20) and calculate the gradient. Therefore, the time complexity of it is O(mnr)+ #sub-iterations × O(tmr2+tnr2) in one iteration, where t is the number of trials of the nonmonotone line search procedure.
3.
Numerical experiments
In the following content, by using synthetic datasets and real-world datasets (ORL image dataset and Yale image dataset *), we exhibit main numerical experiments to compare the performance of ANMPBB with that of five other efficient methods including the NeNMF [15], the projected BB method (APBB2 [16]), QRPBB [23], hierarchical alternating least squares (HALS) [5], and block coordinate descent (BCD) method [44]. All of the reported numerical results are performed using MATLAB v8.1 (R2013a) on a Lenovo laptop.
*Both ORL and Yale image datasets in MATLAB format are available at http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html
3.1. Stopping criterion
According to the Karush-Kuhn-Tucker (KKT) conditions optimized by the existing constraints, we know that (Wk,Hk) is a stationary point of NMF (1.2) if and only if ∇PWf(W,H)=0, and ∇PHf(W,H)=0 are simultaneously satisfied. Here
and ∇PHf(W(k),H(k)) is also written as shown above. Hence, we employ the stopping criteria shown below, which is also used in [29] in numerical experiments:
where ϵ>0 is a tolerance. When employing the stop criterion (3.1), we need to pay attention to the scale degrees of freedom of the NMF solution, as discussed in [11].
3.2. Synthetic data
In this section, first, the ANMPBB method and the other three ANLS-based methods are tested on synthetic datasets. Since the matrix V in this test happens to be a low-rank matrix, it will be rewritten as V=LR, and we generate the L and R by using the MATLAB commands max(0,randn(m,r)) and max(0,randn(r,n)), respectively.
For ANMPBB, in a later experiment, we adopt the parameters shown below:
The settings are identical with those of APBB2 and QRPBB. Take s=1.7 for ANMPBB, and the reason for selecting the relaxation factor s=1.7 is given in Section 4.4. Take tol=10−8 for all comparison algorithms. In addition, for ANMPBB, update ηt by the formula (2.8). We unify the maximum number of iterations of all algorithms to 50,000. All of the other parameters of APBB2, NeNMF, and QRPBB are unified as default values.
For all of the problems we are considering, we casually generated 10 diverse starting value, and the average outcomes obtained from using these starting points are presented in Table 1. The item iter represents the number of iterations required to satisfy the termination condition (3.1) is met. The item niter represents the total number of sub-iterations for solving W and H. ‖V−WkHk‖F/‖V‖F is relative error, ‖[∇PHf(Wk,Hk),∇PWf(Wk,Hk)]‖F is the final value of the projected gradient norm, and CPU time (in seconds) measures performance.
Table 1 clearly indicates that all methods met the condition of convergence within a reasonable number of iterations. Table 1 also clearly indicates that our ANMPBB needs the least execution time and the least number of sub-iterations among all methods, particularly in the case of large-scale problems.
The ANMPBB method is closely related to the QRPBB method, and we all know that the hierarchical ALS (HALS) algorithm for NMF is the most effective upon most occasions, which uses the coordinate descent method to solve subproblems in NMF. We further examine algorithms of ANMPBB, QRPBB, HALS, and BCD. We show how these four methods compare on eight randomly generated independent Gaussian noises measured when the signal-to-noise ratio is 30dB in Figures 1–3. All of the methods are terminated when the stopping criterion said by the inequality in (3.1) satisfies ϵ=10−8 or the maximum number of iterations is more than 30. Figure 1 shows the value of the objective function compared to the number of iterations. From Figure 1, for most of the test problems, we will draw a conclusion that ANMPBB decreases the objective function much quicker than the other three methods in 30 iterations. This may be because our ANMPBB exploits an efficient modified nonmonotone line search, uses a well active set prediction strategy of solution, and adds a relaxing factor s to the update rules of Wt+1 and Ht+1. Hence our ANMPBB significantly outperformed the other three methods. Figure 2 shows the relationship between the relative residual errors and the number of iterations. Figure 3 exhibits the relative residual errors versus CPU time. The results shown in Figures 2 and 3 are consistent with those shown in Figure 1.
3.3. Image data
The ORL image dataset is a collection of 400 images of people's faces belonging to 40 individuals, 10 each. The dataset includes variations in lighting conditions, facial expressions (including whether they open their eyes and whether they smile), and facial details including whether they wear glasses. Some subjects have multiple photos taken at different times. The images were captured with the subject positioned upright and facing forward (allowing for slight movement to the sides). The background used was uniformly dark and even. All of the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). The pictures used are represented by the columns of the matrix V, and V has 400 rows and 1024 columns.
The Yale face dataset was created at the Yale Center for Computational Vision and Control. It consists of 165 gray-scale images, with each person in the dataset having 11 images associated with them. In total, there are 15 people. The facial images in question were captured under different lighting conditions (left-light, center-light, right-light), with various facial expressions (calm, cheerful, sorrowful, amazed, and blinking), and with or without glasses. The pictures used are represented by the rows of the matrix V, and V has 165 rows and 1024 columns.
For all of the datasets we used, in (3.1), we performed diverse casually generated starting iterations with ϵ=10−8, the maximum number of iterations (maxit) for all algorithms was set to 50,000, and the average results are presented in Table 2. From Table 2, we can conclude that the QRPBB method converged in fewer iterations and CPU times than APBB2 and NeNMF, and in contrast to QRPBB, our ANMPBB method required 1/3 the CPU time to satisfy the set tolerance. Although the residuals by ANMPBB were not the smallest among all of the algorithms that appeared for all of the databases we used, the results of pgn showed that solutions by ANMPBB were closer to the stationary point.
3.4. The importance of relaxation factor s
In the following content, the clear experimental results indicate that relaxation factor s is used for updating the rules of Wt+1 and Ht+1. We implement ANMPBB using diverse s given: s=0.1,0.3,0.7,1.0,1.3,1.7,1.9 on synthetic datas which are the same as those in Section 4.2.
We set the required maximum number of iterations to 30, and the other parameters required in the experiment will have the same values as those in Section 4.2. Figure 4 shows the relationship between the relative residuals error and the run-time results. In Figure 4, we can see that the relaxation factor s fails to accelerate the convergence when s<1 and increasing constant s significantly accelerates the convergence when 1<s<2. As for ANMPBB, it is seem that s = 1.7 is the best compared with other experimental values in terms of the speed of convergence, hence, s = 1.7 was used as our ANMPBB in all experiments.
4.
Conclusions
In this paper, an active set recognition technology was suggested, and then an improved non-monotonic line search rule was proposed to enhance the efficiency of the nonmonotonic line rule, in which we introduce a new parameter formula to attempt to command the nonmonotonic degree of the line search, and thus increase the likelihood of looking for the global minimum. On the basis of the alternating nonnegative least squares framework, a global convergence gradient-based NMF method was proposed by using the modified line search and the active set recognition technology. In addition, the Barzilai-Borwein step-size and greater step size technique was utilized to make convergence faster. In the end, numerical results showed that our algorithm is an NMF tool with great promise.
NMF is an important linear dimensionality reduction technique for nonnegative data, which has found numerous applications in data analysis such as various clustering tasks [9,10,35,36,37]. Therefore, in future work, we plan to extend the application of our algorithm to multi-class clustering problems. In addition, another direction for future research would be to extend the proposed algorithm to solve real-world problems [38,39,40,41].
Author contributions
Xiaoping Xu: Software, formal analysis, resources, visualization; Jinxuan Liu: Methodology, validation, data curation, project administration; Wenbo Li: Conceptualization, validation, writing original draft preparation, supervision, funding acquisition; Yuhan Xu: Conceptualization, validation, investigation; Fuxiao Li: writing original draft preparation, writing review and editing, project administration. All authors have read and approved the final version of the manuscript for publication.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (Grant No. 12201492).
Conflict of interest
The authors declare that they have no conflicts of interest.