
Trace norm regularization is a widely used approach for learning low rank matrices. A standard optimization strategy is based on formulating the problem as one of low rank matrix factorization which, however, leads to a non-convex problem. In practice this approach works well, and it is often computationally faster than standard convex solvers such as proximal gradient methods. Nevertheless, it is not guaranteed to converge to a global optimum, and the optimization can be trapped at poor stationary points. In this paper we show that it is possible to characterize all critical points of the non-convex problem. This allows us to provide an efficient criterion to determine whether a critical point is also a global minimizer. Our analysis suggests an iterative meta-algorithm that dynamically expands the parameter space and allows the optimization to escape any non-global critical point, thereby converging to a global minimizer. The algorithm can be applied to problems such as matrix completion or multitask learning, and our analysis holds for any random initialization of the factor matrices. Finally, we confirm the good performance of the algorithm on synthetic and real datasets.
Citation: Carlo Ciliberto, Massimiliano Pontil, Dimitrios Stamos. Reexamining low rank matrix factorization for trace norm regularization[J]. Mathematics in Engineering, 2023, 5(3): 1-22. doi: 10.3934/mine.2023053
[1] | Zaffar Mehdi Dar, M. Arrutselvi, Chandru Muthusamy, Sundararajan Natarajan, Gianmarco Manzini . Virtual element approximations of the time-fractional nonlinear convection-diffusion equation on polygonal meshes. Mathematics in Engineering, 2025, 7(2): 96-129. doi: 10.3934/mine.2025005 |
[2] | Wenjing Liao, Mauro Maggioni, Stefano Vigogna . Multiscale regression on unknown manifolds. Mathematics in Engineering, 2022, 4(4): 1-25. doi: 10.3934/mine.2022028 |
[3] | Gennaro Ciampa, Gianluca Crippa, Stefano Spirito . Propagation of logarithmic regularity and inviscid limit for the 2D Euler equations. Mathematics in Engineering, 2024, 6(4): 494-509. doi: 10.3934/mine.2024020 |
[4] | Neil S. Trudinger . On the local theory of prescribed Jacobian equations revisited. Mathematics in Engineering, 2021, 3(6): 1-17. doi: 10.3934/mine.2021048 |
[5] | Luis Silvestre, Stanley Snelson . Solutions to the non-cutoff Boltzmann equation uniformly near a Maxwellian. Mathematics in Engineering, 2023, 5(2): 1-36. doi: 10.3934/mine.2023034 |
[6] | G. Gaeta, G. Pucacco . Near-resonances and detuning in classical and quantum mechanics. Mathematics in Engineering, 2023, 5(1): 1-44. doi: 10.3934/mine.2023005 |
[7] | Gerd Grubb . The principal transmission condition. Mathematics in Engineering, 2022, 4(4): 1-33. doi: 10.3934/mine.2022026 |
[8] | Guido De Philippis, Filip Rindler . Fine properties of functions of bounded deformation-an approach via linear PDEs. Mathematics in Engineering, 2020, 2(3): 386-422. doi: 10.3934/mine.2020018 |
[9] | Yannick Sire, Susanna Terracini, Stefano Vita . Liouville type theorems and regularity of solutions to degenerate or singular problems part II: odd solutions. Mathematics in Engineering, 2021, 3(1): 1-50. doi: 10.3934/mine.2021005 |
[10] | Zeljko Kereta, Valeriya Naumova . On an unsupervised method for parameter selection for the elastic net. Mathematics in Engineering, 2022, 4(6): 1-36. doi: 10.3934/mine.2022053 |
Trace norm regularization is a widely used approach for learning low rank matrices. A standard optimization strategy is based on formulating the problem as one of low rank matrix factorization which, however, leads to a non-convex problem. In practice this approach works well, and it is often computationally faster than standard convex solvers such as proximal gradient methods. Nevertheless, it is not guaranteed to converge to a global optimum, and the optimization can be trapped at poor stationary points. In this paper we show that it is possible to characterize all critical points of the non-convex problem. This allows us to provide an efficient criterion to determine whether a critical point is also a global minimizer. Our analysis suggests an iterative meta-algorithm that dynamically expands the parameter space and allows the optimization to escape any non-global critical point, thereby converging to a global minimizer. The algorithm can be applied to problems such as matrix completion or multitask learning, and our analysis holds for any random initialization of the factor matrices. Finally, we confirm the good performance of the algorithm on synthetic and real datasets.
Learning low rank matrices is a problem of broad interest in machine learning and statistics, with applications ranging from collaborative filtering [1,2], to multitask learning [3], to computer vision [4], and many more. A principled approach to tackle this problem is via suitable convex relaxations. Perhaps the most successful strategy in this sense is provided by trace (or nuclear) norm regularization [3,5,6,7]. However, solving the corresponding optimization problem is computationally expensive for two main reasons. First, many commonly used algorithms require the computation of the proximity operator (e.g., [8]) which entails performing the singular value decomposition at each iteration. Second, and most importantly, the space complexity of convex solvers grows with the matrix size, which makes it prohibitive to employ them for large scale applications.
Due to the above shortcomings, practical algorithms for low rank matrix completion often use an explicit low rank matrix factorization to reduce the number of variables (see e.g., [2,9] and references therein). In particular, a reduced variational form of the trace norm is used [7]. The resulting problem is however non-convex, and popular methods such as alternate minimization or alternate gradient descent may get struck at poor stationary points. Recent studies [10,11] have shown that under certain conditions on the data generation process (underling low rank model, RIP, etc.) a particular version of the non-convex problem can be solved efficiently. However such conditions are not verifiable in real applications, and the problem of finding a global solution remains open.
In this paper we characterize the critical points of the non-convex problem and provide an efficient criterion to determine whether a critical point is also a global minimizer. Our analysis is constructive and suggests an iterative meta-algorithm that dynamically expands the parameter space to escape any non-global critical point, thereby converging to a global minimizer. We highlight the potential of the proposed meta-algorithm, by comparing its computational and statistical performance to two state of the art methods [9,12].
The paper is organized as follows. In Sec. 2 we introduce the trace norm regularization problem and basic notions used throughout the paper. In Sec. 2.2 we present the low rank matrix factorization approach and set the main questions addressed in the paper. In Sec. 3 we present our analysis of the critical points of the low rank matrix factorization problem and the associated meta-algorithm. In Sec. 4 we report results from numerical experiments using our method. The Appendix contains proofs of the results, only stated in the main body of the paper, together with further auxiliary results and empirical observations.
In this work we study trace norm regularized problems of the form
minimizeW∈Rn×mfλ(W),fλ(W)=ℓ(W)+λ‖W‖∗ | (2.1) |
where ℓ:Rn×m→R is a twice differentiable convex function with Lipschitz continuous gradient, ‖W‖∗ denotes the trace norm of a matrix W∈Rn×m, namely the sum of the singular values of W, and λ is a positive parameter. Examples of relevant learning problems that can be formulated as Eq (2.1) are:
Matrix completion. In this setting we wish to recover a matrix Y∈Rn×m from a small subset of its entries. A typical choice for ℓ is the square error, ℓ(W)=‖M⊙(Y−W)‖2F, where ‖⋅‖F is the Frobenius norm, ⊙ denotes the Hadamard product (i.e., the entry-wise product) between two matrices and M∈Rn×m is a binary matrix used to "mask" the entries of Y that are not available.
Multi-task learning. Here, the columns w1,…,wm of matrix W are interpreted as the regression vectors of different learning tasks. Given m datasets (xij,yij)nji=1 with xij∈Rn and yij∈R, for j=1,…,m, we choose ℓ(W)=∑mj=11nj∑nji=1ˉℓ(yij,w⊤jxij), where ˉℓ:R×R→R is a prescribed loss function (e.g., the square or the logistic loss).
Other examples which are captured by problem (2.1) include collaborative filtering with attributes [13] and multiclass classification [5].
Problem (2.1) can be solved by first-order optimization methods such as the proximal forward-backward (PFB) splitting [8]. Given a starting point W0∈Rn×m, PFB produces a sequence (Wk)k∈N with
Wk+1=proxγλ‖⋅‖∗(Wk−γ∇ℓ(Wk)) | (2.2) |
where γ>0 and for any convex function ϕ:Rn×m→R, the associated proximity operator at W∈Rn×m is defined as proxϕ(W)=argmin{ϕ(Z)+12‖W−Z‖2F:Z∈Rn×m}. In particular, the proximity operator of the trace norm at W∈Rn×m corresponds to performing a soft-thresholding on the singular values of W. That is, assuming a singular value decomposition (SVD) W=UΣV⊤, where U∈Rn×r, V∈Rm×r have orthonormal columns, Σ=diag(σ1,…,σr), with σ1≥⋯≥σr>0 and r=rank(W), we have
proxγλ‖⋅‖∗(W)=Udiag(hγλ(σ1),…,hγλ(σr))V⊤ | (2.3) |
where hγλ is the soft-thresholding operator, defined for σ≥0 as hγλ(σ)=max(0,σ−γλ).
PFB guarantees that for a suitable choice of the descent step γ (e.g., γ<2/L with L the Lipschitz constant of the gradient of ℓ), the sequence fλ(Wk) converges to the global minimum of the problem with a rate of O(1/k) [8] (faster rates can be achieved using accelerated versions of the algorithm [14]). However, at each iteration PFB performs the SVD of an n×m matrix via Eqs (2.2) and (2.3), which requires O(min(n,m)nm) operations, a procedure that becomes prohibitively expensive for large values of n and m. Other methods such as those based on Frank-Wolfe procedure (e.g., [15]) alleviate this cost but require more iterations. More importantly, these methods need to store in memory the iterates Wk imposing an O(nm) space complexity, a major bottleneck for large scale applications. However, trace norm regularization is typically applied to problems where the solution of problem (2.1) is assumed to be low-rank, namely of the form AB⊤ for some A∈Rn×r,B∈Rm×r and r≪min(n,m). Therefore it would be ideal to have an optimization method capable to capture this aspect and consequently reduce the memory requirement to O(r(m+n)) by keeping track of the two factor matrices A and B throughout the optimization rather than their product. This is the idea behind factorization-based methods, which have been observed to lead to remarkable performance in practice and are the subject of our investigation in this work.
Factorization methods build on the so-called variational form of the trace norm. Specifically, the trace norm of a matrix W∈Rn×m can be characterized as (see e.g., [16] or Lemma 9 in the Appendix)
‖W‖∗=12inf{‖A‖2F+‖B‖2F,:r∈N,A∈Rn×r,B∈Rm×r,W=AB⊤}. | (2.4) |
with the infimum always attained for r=rank(W). The above formulation leads to the following "factorized" version of the original optimization problem (2.1)
minimizeA∈Rn×r,B∈Rm×rgλ,r(A,B),gλ,r(A,B)=ℓ(AB⊤)+λ2(‖A‖2F+‖B‖2F) | (2.5) |
where r∈N is now a further hyperparameter of the problem. Clearly, fλ and gλ,r are tightly related and a natural question is whether minimizing the latter would allow to recover a solution of the original problem. The following well-known result (of which we provide a proof in the Appendix for completeness) shows that the two problems are indeed equivalent (for sufficiently large r).
Proposition 1 (Equivalence between problems (2.1) and (2.5)). Let W∗∈Rn×m be a global minimizer of fλ in Eq (2.1) with r∗=rank(W∗). Then, for every r≥r∗, every global minimizer (A∗,B∗) of gλ,r is such that
gλ,r(A∗,B∗)=fλ(A∗B⊤∗)=fλ(W∗). | (2.6) |
The above proposition implies that for sufficiently large values of r in Eq (2.5), the optimization of fλ and gλ,r are equivalent. Therefore, we can minimize fλ by finding a global minimizer for gλ,r. This is a well-known approach to trace norm regularization (see e.g., [1,2]) and can be extremely advantageous in practice. Indeed, we can leverage on a large body of smooth optimization literature to solve such a factorized problem [17]. As an example, if we apply Gradient Descent (GD) from a starting point (A0,B0), we obtain the sequence (Ak,Bk)k∈N with
Ak+1=Ak−γ(∇ℓ(AkB⊤k)Bk+λAk)Bk+1=Bk−γ(∇ℓ(AkB⊤k)⊤Ak+λBk). | (2.7) |
This approach is much more appealing than PFB from a computational perspective because: 1) for small values of r, the iterations at Eq (2.7) are extremely fast since they mainly consists of matrix products and therefore require only O(nmr) operations; 2) the space complexity of GD is O(r(n+m)), which may be remarkably smaller than the O(nm) of PFB for small values of r; 3) Even if for large values of r, e.g., r=min(n,m), every iteration has the same time complexity as PFB, we do not need to perform expensive operations such as the SVD of an n×m matrix at every iteration. This dramatically reduces computational times in practice.
The strategy of minimizing gλ,r instead of fλ was originally proposed in [7] and its empirical advantages have been extensively documented in previous literature, e.g., [2]. However, this approach opens important theoretical and practical questions that have not been addressed by previous work:
● How to choose r? By Prop. 1 we know that for suitably large values of r the minimization of gλ,r and fλ are equivalent. However, a lower bound for such an r cannot be recovered analytically from the functional itself and, so, it is not clear how to choose r in practice.
● Global convergence. The function gλ,r is not jointly convex in the two variables A and B. This opens the question of whether GD (or other optimization methods) converge to a global minimizer for sufficiently large values of r.
Investigating such issues is the main focus of this work.
In this section we study the questions outlined above and provide a meta-algorithm to minimize the function gλ,r while incrementally searching for a rank r for which Prop. 1 is verified. Our analysis builds upon the following keypoints:
● (Prop. 2) We characterize all critical points of gλ,r, namely those points to which iterative optimization methods applied to Eq (2.5) (e.g., GD) could in principle converge to.
● (Thm. 3) We derive an efficient criterion to determine whether a critical point of gλ,r is a global minimizer (typically an NP hard problem for non-convex functions).
● (Prop. 4) We show that for any critical point (A,B) of gλ,r which is not a global minimizer, it is always possible to constructively find a descent direction for gλ,r+1 from the point ([A0],[B0])∈Rn×(r+1)×Rm×(r+1).
● (Thm. 5) By combining the above results we show that for r≥min(n,m), every critical point of gλ,r is either a global minimizer or a so-called strict saddle point, namely a point where the Hessian of the target function has at least a negative direction. We can then appeal to [18] to show that descent methods such as GD avoid strict saddle points and hence convergence to a global minimizer.
The above discussion suggests a natural "meta-algorithm" (which is presented more formally in Sec. 3.3) to address the minimization of fλ via gλ,r while increasing r incrementally:
1) Initialize r=1. Choose A′0∈Rn and B′0∈Rm.
2) Starting from (A′r−1,B′r−1), converge to a critical point (Ar,Br) for gλ,r.
3) If (Ar,Br) satisfies our criterion for global optimality (see Thm. 3) stop, otherwise:
4) Perform a step in a descent direction for gλ,r+1 from ([Ar0],[Br0]) to a point (A′r,B′r), A′r∈Rn×r+1,B′r∈Rm×r+1; Increase r to r+1 and go back to Step 2.
From our analysis in the following, the procedure above is guaranteed to stop at most after r=min(n,m) iterations. However, Prop. 1, together with our criterion for global optimality, suggests that this meta-algorithm could stop much earlier if fλ admits a low-rank minimizer (which is indeed the case in our experiments). This has two main advantages: 1) by exploring candidate ranks incrementally, we can expect significantly faster computations and convergence if our optimality criterion activates for r≪min(n,m) and 2) we automatically recover the rank of a minimizer for fλ without the need to perform expensive operations such as SVD.
Remark 1. The meta-algorithm considered in this paper is related to the optimization strategy recently proposed in [19], where the authors study convex problems for which a non-convex "factorized" formulation exists, including the setting considered in this work as a special case. However, by adopting such a general perspective, the resulting optimization strategy is less effective when applied to the specific minimization of gλ,r. In particular: 1) the optimality criterion derived in [19] is only a sufficient but not necessary condition; 2) the upper bound on r is much larger than the one provided in this work, i.e., r=nm rather than r=min(n,m); 3) convergence guarantees to a global optimum cannot be provided.
By focusing exclusively on the question of minimizing fλ via its factorized form gλ,r and, by leveraging on the specific structure of the problem, it is instead possible to provide a further analysis of the behavior of the proposed meta-algorithm.
Since gλ,r is a non-convex smooth function, in principle we can expect optimization algorithms based on first or second order methods to converge only to critical points, i.e., points (A∗,B∗) for which ∇gλ,r(A∗,B∗)=0. The following result provides a characterization of such critical points and plays a key role in our analysis.
Proposition 2 (Characterization of critical points of gλ,r). Let (A∗,B∗)∈Rn×r×Rm×r be a critical point of gλ,r. Let s≤min(n,m) and let U∈Rn×s and V∈Rm×s be two matrices with orthonormal columns corresponding to the left and right singular vectors of ∇ℓ(A∗B⊤∗)∈Rn×m with singular value equal to λ. Then, there exists C∈Rs×r, such that A∗=UC and B∗=−VC.
This result is instrumental in deriving a necessary and sufficient condition to determine whether a stationary point for gλ,r is actually a global minimizer. Indeed, since fλ is convex, we can leverage on the first order condition for global optimality, stating that the matrix W∗=A∗B⊤∗ is a global minimizer for fλ (and by Prop. 1 also (A∗,B∗) for gλ,r) if and only if the zero matrix belongs to its subdifferential (see e.g., [8]). Studying this inclusion leads to the following theorem.
Theorem 3 (A criterion for global optimality). Let (A∗,B∗) be a critical point of gλ,r. Then A∗B⊤∗ is a minimizer for fλ if and only if ‖∇ℓ(A∗B⊤∗)‖≤λ.
This result provides a natural strategy to determine whether a descent method minimizing gλ,r has converged to a global minimizer, that is we evaluate the operator norm of the gradient, denoted ‖∇ℓ(A∗B⊤∗)‖, and then check whether it is larger than λ. For large matrices this operation can be performed efficiently by using approximation methods, e.g., power iteration [20]. Note that in general it is an NP-hard problem to determine whether a critical point of a non-convex function is actually a global minimizer [21]; it is only because of the relation with the convex function fλ that in this case it is possible to perform such check in polynomial time.
In this section, we observe that for any critical point of gλ,r which is not a global minimizer, it is always possible to either find a direction to escape from it or alternatively to increase r by one and to find a decreasing direction for gλ,r+1. This strategy is suggested by the following result.
Proposition 4 (Escape direction from critical points). With the same notation of Prop. 2, assume rank(A∗)=rank(B∗)<r and ‖∇ℓ(A∗B⊤∗)‖=μ>λ. Then, (A∗,B∗) is a so-called strict saddle point for gλ,r, namely the Hessian of gλ,r at (A∗,B∗) has at least one negative eigenvalue. In particular, there exists q∈Rr such that A∗q=0, B∗q=0 and if u∈Rn and v∈Rm are the left and right singular vectors of ∇ℓ(A∗B⊤∗) with singular value equal to μ, then gλ,r decreases locally at (A∗,B∗) along the direction (uq⊤,−vq⊤).
A direct consequence of the result above is that an optimization strategy can remain trapped only at global minimizers of gλ,r or at critical points (A∗,B∗) for which A∗ and B∗ have full rank (since we can always escape from rank deficient ones). If the latter happens, Prop. 4 suggests the strategy adopted in this work, namely to increase the problem dimension to r+1 and consider the "inflated" point ([A∗0],[B∗0]). Indeed, at such point, gλ,r+1 attains the same value as gλ,r(A∗B⊤∗) and it is straightforward to verify that ([A∗0],[B∗0]) is still a critical point for gλ,r+1. Since matrices [A∗0] and [B∗0] have now rank <r+1, we can apply Prop. 4 to find a direction along which gλ,r+1 decreases. This procedure will stop for r>min(n,m) since rank(A∗)≤min(n,m)<r and we can therefore apply Prop. 4 to always escape from critical points until we reach a global minimizer (this fact can be actually improved to hold also for r=min(n,m) as we see in the following).
Note however that in general, if the number of non-global critical points is infinite, it is not guaranteed that such strategy will converge to the global minimizer. However, since by Prop. 4 every such critical point is a strict saddle point, we can leverage on previous results from the non-convex optimization literature (see [18] and references therein) in order to prove the following result.
Theorem 5 (Convergence to global minimizers of gλ,r). Let r≥min(n,m). Then the set of starting points (A0,B0) for which GD does not converge to a global minimizer of gλ,r has measure zero.
In particular, Thm. 5 suggests to initialize the optimization method used to minimize gλ,r by applying a small perturbation to the initial point (A0,B0) via additive noise according to a distribution that is absolutely continuous with respect to the Lebesgue measure of Rn×r×Rm×r (e.g., a Gaussian). This perturbation guarantees such initial point to not be in the set of points for which GD converges to strict saddle point and therefore that the meta-algorithm considered in this work converges to a global minimizer. We make this statement more precise in the next section.
We can now formally present the meta-algorithm outlined at the beginning of Sec. 3 to find a solution of the trace norm regularization problem (2.1) by minimizing gλ,r in Eq (2.5) for increasing values of r.
Algorithm 1 META-ALGORITHM |
Input: λ>0, ϵconv>0 convergence tolerance, ϵcrit>0 global criterion tolerance.
Initialize: Set r=1. Sample A′0∈Rn and B′0∈Rm randomly. For r=1 to min(n,m) (Ar,Br)= OPTIMIZATIONALGORITHM(A′r−1,B′r−1,gλ,r,ϵconv) If ‖∇ℓ(ArB⊤r)‖≤λ+ϵcrit Break (A′r,B′r)=([Aru],[Brv]) with u∈Rn,v∈Rm sampled randomly. r=r+1 End Return (Ar,Br) |
Algorithm 1 proceeds by iteratively applying the descent method OPTIMIZATION ALGORITHM (e.g., GD) to minimize gλ,r while increasing the estimated rank r one step at the time. Whenever the optimization algorithm converges to a critical point (Ar,Br) of gλ,r (within a certain tolerance ϵconv), Algorithm 1 verifies whether the global optimality criterion has been activated (again within a certain tolerance ϵcrit). If this is the case, (Ar,Br) is a global minimizer and we stop the algorithm. Otherwise we "inflate" the two factor matrices by one column each and repeat the procedure. The new column is initialized randomly, since by Prop. 4 we know that we will not converge again to ([Ar0],[Br0]) because it is not full rank. A more refined approach would be to choose u and −v to be the singular vectors of ∇ℓ(ArB⊤r) associated to the highest singular value. For the sake of brevity we provide an example of such strategy in the Appendix. However note that we still need to apply a random perturbation to the step in the descent direction in order to invoke Thm. 5 and be guaranteed to not converge to strict saddle points. As a direct corollary to Thm. 5 we have
Corollary 6. Algorithm 1 with GD as OPTIMIZATIONALGORITHM converges to a point (A∗,B∗) such that A∗B⊤∗ is a global minimizer for fλ with probability 1.
In this section, we touch upon the question of convergence rates for optimization schemes applied to problem (2.5). For simplicity, we focus on GD, but our result generalizes to other first order methods. We provide upper bounds to the number of iterations required to guarantee that GD iterates are ϵ close to a critical point of the target function. By standard convex analysis results (see [22]) it is known that GD applied to a differentiable convex function is guaranteed to have sublinear convergence of O(1/ϵ) comparable to that of PFB. However, since gλ,r is non-convex, here we need to rely on more recent results that investigated the application of first order methods to functions satisfying the so-called Kurdyka-Lojasiewicz (KL) inequality [23,24].
Definition 7 (Kurdyka-Lojasiewicz inequality). A differentiable function g:Rd→R is said to satisfy the Kurdyka-Lojasiewicz inequality at a critical point x∗∈Rd if there exists a neighborhood U⊆Rd and constants γ,ϵ>0 and α∈[0,1) such that, for all x∈U∩{x:g(x∗)<g(x)<g(x∗)+ϵ},
γ|g(x)−g(x∗)|α<‖∇g(x)‖F. | (3.1) |
The KL inequality is a measure of how large is the gradient of the function in the neighborhood of a critical point. This allows us to derive convergence rates for GD methods that depend on the constant α. In particular, as a corollary to [18] (see also [23]) we obtain the following result.
Corollary 8 (Convergence rate of gradient descent). Let (Ak,Bk)k∈N a sequence produced by GD method applied to gλ,r. If gλ,r satisfies the KL inequality for some α∈[0,1), then there exists a critical point (A∗,B∗) of gλ,r and constants C>0, b∈(0,1) such that
‖(Ak,Bk)−(A∗,B∗)‖2F≤{Cbkif α∈(0,1/2],Ck−1−α2α−1if α∈(1/2,1). | (3.2) |
Furthermore, if α=0 convergence is achieved in a finite number of steps.
This result shows that depending on the constant α∈[0,1) appearing in the KL inequality, we can expect different convergence rates for GD applied to problem (2.5). Although, it is a challenging task to identify such constant or even provide an upper bound in specific instances, the class of functions satisfying the KL inequality is extremely large and includes both analytic and semi-algebraic functions, see e.g., [25]. In the Appendix, we argue that if a function ℓ:Rn×m→R is a semi-algebraic, then also ℓr:Rn×r×Rm×r→R such that ℓr(A,B)=ℓ(AB⊤) is semi-algebraic. Therefore, in order to apply Cor. 8 it is sufficient that the error ℓ and the regularizer are semi-algebraic, a property which is verified by many commonly used error functions (or the associated loss functions, e.g., square, logistic) and regularizers (in particular the squared Frobenius norm).
We report an empirical analysis of the meta-algorithm studied in this work. All experiments were conducted on an Intel Xeon E5-2697 V3 2.60Ghz CPU with 32GB RAM. We consider a matrix completion setting, with the loss ℓ(W)=‖M⊙(Y−W)‖2F, where Y is the matrix that we aim to recover and M a binary matrix masking entries not available at training time. Below we briefly described the datasets used.
Synthetic. We generated a 100×100 matrix Y=AB⊤+E as a rank 10 matrix product of two 100×10 matrices A,B plus additive noise E; the entries for A,B and E were independently sampled from a standard normal distribution.
Movielens. This dataset [26] consists of different datasets for movie recommendation of increasing size. They all comprise a number of ratings (from 1 to 5) given by n users on a database of m movies, which are recorded as a Y∈Rn×m matrix with missing entries. In this work we considered three such datasets of increasing size, namely Movielens 100k (ml100k) with 100 thousand ratings from n=943 users on m=1682 movies, Movielens 1m (ml1m) with ∼1 million ratings, n=6040 users and m=3900 movies and Movielens 10m (ml10m), with ∼10 millions ratings, n=71567 users and 10681 movies.
The result in Thm. 3 provides a necessary and sufficient criterion to determine whether Algorithm 1 has achieved a global minimum for fλ. A natural question is to ask whether, in practice, such criterion will be satisfied for a much smaller rank r than the one at which we are guaranteed convergence, namely r=min(n,m). To address this question we compared the solution achieved by our approach with the one obtained by iterative soft thresholding (ISTA) (or proximal forward backward, see e.g., [8]) on both synthetic and real datasets. Figure 1 reports the value of r for which our meta-algorithm satisfied the criterion for global optimality and compares it with the rank of the ISTA solution for different values of λ. For the Synthetic dataset (Figure 1 Left) we considered only 20% of the generated matrix Y entries for the optimization of fλ. For the real dataset (Figure 1 Right) we considered ml100k and sampled 50% of each user's available ratings. For both synthetic and real experiments our meta-algorithm recovers the same rank than as that found by ISTA. However, our algorithm reaches such rank incrementally, exploiting the low rank factorization. As we will see in the next section this results in a much faster convergence speed in practice.
We compared the performance of our meta-algorithm with two state of the art methods, Active Newton (ALT) [12] and Alternating Least Squares Soft Impute (ASL-SI) [9] on the three Movielens datasets. We used 50%,25% and 25% of each user's ratings for training, validation and testing and repeated our experiments across 5 separate trials to account for statistical variability in the sampling. Test error was measured in terms of the Normalized Mean Absolute Error (NMAE), namely the mean (entry-wise) absolute error on the test set, normalized by the maximum discrepancy max(Yij)−min(Yij) between entires in Y. As a reference of the behavior of the different methods, Figure 2, reports on a single trial the decrease of fλ on training data and NMAE on the test set with respect to time for the best λ chosen by validation. All methods where run until convergence and attained the same value of the objective function and same test error in all our trials. However, as it can be noticed, our meta-algorithm and ALS-SI seem to attain a much lower test error during earlier iterations. To better investigate this aspect, Table 1 reports results in terms of time, test error and estimated rank attained on average across the 5 trials by the different methods at the iteration with smallest validation error. As it can be noticed our meta-algorithm is generally on par with its competitors in terms of test error while being relatively faster and recovering low-rank solutions. This highlights an interesting byproduct of the meta-algorithm considered in this work, namely that by exploring candidate ranks incrementally, the method allows to find potentially better factorizations than trace norm regularization both in terms of test error and estimated rank. This fact can be empirically observed also for different values of λ as we report in the Appendix.
ml100k | ml1m | ml10m | |||||||
NMAE | time(s) | rank | NMAE | time(s) | rank | NMAE | time(s) | rank | |
ALT | 0.2165 | 97 | 93 | 0.1806 | 4133 | 179 | 0.1670 | 205023 | 225 |
ALS-SI | 0.1956 | 40 | 16 | 0.1749 | 832 | 31 | 0.1648 | 51205 | 36 |
Ours | 0.1959 | 2 | 11 | 0.1751 | 39 | 25 | 0.1659 | 3150 | 41 |
We studied the convergence properties of low rank factorization methods for trace norm regularization. Key to our study is a necessary and sufficient condition for global optimality, which can be applied to any critical points of the non-convex problem. This condition together with a detailed analysis of the critical points lead us to propose a meta-algorithm for trace norm regularization, that incrementally expands the number of factors used by the non-convex solver. Although algorithms of this kind have been studied empirically for years, our analysis provides a fresh look and novel insights which can be used to confirm whether a global solution has been reached. Numerical experiments indicated that our optimality condition is useful in practice and the meta-algorithm is competitive with state-of-the art solvers. In the future it would be valuable to study improvements to our analysis, which would allow from one hand to derive precise rate of convergence for specific solvers used within the meta-algorithm and from another hand to study additional conditions under which our global optimality is guaranteed to activate immediately after the number of factors exceed the rank of the trace norm regularization minimizer.
We kindly thank Andreas Argyriou, Kevin Lai and Saverio Salzo for their helpful comments, corrections and suggestions. This work was supported in part by EPSRC grant EP/P009069/1.
The authors declare no conflict of interest.
Here we collect some auxiliary results and we provide proofs of the results stated in the main body of the paper.
The first lemma establishes the variational form for the trace norm; its proof can be found in [16].
Lemma 9 (Variational form of the trace norm). For every W∈Rn×m and r∈N let Fr(W)={(A,B)∈Rn×k×Rm×r:AB⊤=W}. Let k=rank(W) and let σ1(W)≥⋯≥σk(W)>0 be the k singular values of W. Then
‖W‖∗=k∑i=1σi(W)=12inf{‖A‖2F+‖B‖2F|(A,B)∈Fr(W),r∈N}. |
Furthermore, if W=UΣV⊤ is a singular value decomposition (SVD) for W, with Σ=diag(σ1(W),…,σr(W)), the infimum is attained for r=rank(W), A=UΣ12, and B=VΣ12.
Recall that if ϕ:Rd→R∪{+∞} is proper convex function, its sub-differential at x is the set
∂ϕ(x)={u:ϕ(x)+⟨u,y−x⟩≤ϕ(y),forally∈ domain(ϕ)}. |
The elements of ∂ϕ(x) are called the sub-gradients of ϕ at x.
Let On be the set of n×n orthogonal matrices. A norm ‖⋅‖:Rm×n→[0,∞) is called orthogonally invariant if, for every U∈On, V∈Om and W∈Rn×m we have that ‖UWV‖=‖W‖ or, equivalently ‖W‖=g(σ(W)), where g is a symmetric gauge function (SGF), that is g is a norm invariant to permutations and sign changes. An important example of orthogonally invariant norms are the p-Schatten norms, ‖W‖=‖σ(W)‖p, where ‖⋅‖ is the ℓp-norm of a vector. In particular, for p∈{1,2,∞} we have the trace, Frobenius, and spectral norms, respectively.
The following result is due to [27,Cor. 2.5].
Lemma 10. If ‖⋅‖:Rm×n is an orthogonally invariant and g is the associated SGF, then for every W∈Rn×m, it holds that
∂‖W‖={Udiag(μ)V⊤:U∈On,V∈Om,μ∈∂g(σ),W=Udiag(σ)V⊤}. |
For convenience of the reader, we restate the results presented in the main body of the paper.
Proposition 1 (Equivalence between problems (2.1) and (2.5)). Let W∗∈Rn×m be a global minimizer of fλ in Eq (2.1) with r∗=rank(W∗). Then, for every r≥r∗, every global minimizer (A∗,B∗) of gλ,r is such that
gλ,r(A∗,B∗)=fλ(A∗B∗⊤)=fλ(W∗). | (2.6) |
Proof. Let W∗∈Rn×m be a minimizer for fλ of rank r∗=rank(W∗) and let UΣV⊤ be a singular value decomposition of W∗ with U∈Rn×r∗ and V∈Rm×r∗ with orthonormal columns and Σ∈Rr∗×r∗ diagonal with positive diagonal entires. Define A∗=UΣ1/2∈Rn×r∗ and B=VΣ1/2∈Rm×r∗. By construction ‖W∗‖∗=12(‖A∗‖2F+‖B∗‖2F) and therefore
fλ(W∗)=fλ(A∗B⊤∗)=gλ,r∗(A∗,B∗). |
Now, we prove that (A∗,B∗) is a minimizer for gλ,r∗. Suppose by contraddiction that there exist a couple A1∈Rn×r∗,B1∈Rm×r∗ such that gλ,r∗(A1,B1)<gλ,r∗(A∗,B∗). Define
(ˉA1,ˉB1)=argminAB⊤=A1B⊤1‖A‖2F+‖B‖2F. |
Then by Lemma 9 we have
‖ˉA1ˉB⊤1‖∗=12(‖ˉA1‖2F+‖ˉB1‖2F)≤12(‖A1‖2F+‖B1‖2F) |
and therefore
fλ(ˉA1ˉB⊤1)=gλ,r∗(ˉA1,ˉB1)≤gλ,r∗(A1,B1)<gλ,r∗(A∗,B∗)=fλ(W∗) |
which is clearly not possible since W∗ was a global minimizer for W∗.
Proposition 2 (Characterization of critical points of gλ,r). Let (A∗,B∗)∈Rn×r×Rm×r be a critical point of gλ,r. Let s≤min(n,m) and let U∈Rn×s and V∈Rm×s be two matrices with orthonormal columns corresponding to the left and right singular vectors of ∇ℓ(A∗B∗⊤)∈Rn×m with singular value equal to λ. Then, there exists C∈Rs×r, such that A∗=UC and B∗=−VC.
Proof. Let A∗∈Rm×r, B∗∈Rn×r be the matrices that correspond to a critical point of gλ,r. We let
∇ℓ(A∗B∗⊤)=λU1V⊤1+U2Σ2V⊤2+U3Σ3V⊤3 | (B.1) |
be the breakdown SVD of the gradient of ℓ at A∗B∗⊤, where Σ2 is the diagonal matrix formed by the singular values strictly larger than λ and Σ3 is the diagonal matrix formed by the singular values strictly smaller than λ (including those which are zero). For each i=1,2,3 we denote with si the number of columns of Ui∈Rn×si and Vi∈Rm×si. Recall that the matrices [U1U2U3]∈Rn×n and [V1V2V3]∈Rm×m are both orthogonal.
Taking the derivatives of Eq (2.5) w.r.t. A and B and setting them to zero gives the following optimality conditions for the critical points
∇ℓ(A∗B∗⊤)B∗+λA∗=0 | (B.2) |
∇ℓ(A∗B∗⊤)⊤A∗+λB∗=0. | (B.3) |
Solving Eq (B.3) for B∗ and and replacing it in Eq (B.2) yields that
(−1λ2∇ℓ(A∗B∗⊤)∇ℓ(A∗B∗⊤)⊤+Im)A∗=0. | (B.4) |
By Eq (B.1) ℓ(A∗B∗⊤)∇ℓ(A∗B∗⊤)⊤=λ2U⊤1+U2Σ22U⊤2+U3Σ23U⊤3. Using this in Eq (B.4) and rearranging gives
(Im−U1U⊤1−U2Σ22λ2U⊤2−U3Σ23λ2U⊤3)A∗=0 |
which we rewrite as
(U2(I−Σ22λ2)U⊤2+U3(I−Σ23λ2)U⊤3)A∗=0. |
Therefore, the columns of A∗ must be in the range of U1 (i.e., orthogonal to U2 and U3), namely
A∗=U1C | (B.5) |
for some C∈Rs1×r. Similarly we derive that
B∗=V1D | (B.6) |
for some D∈Rs1×r. Combining Eq (B.1) with Eq (B.6) we obtain that
∇ℓ(A∗B∗⊤)B∗=−λU1C. |
Using this equation, Eq (B.5) and Eq (B.6) we rewrite Eq (B.2) as λU1D+λU1C=0. This implies that D=−C and, so, B∗=−V1C.
Theorem 3 (A criterion for global optimality). Let (A∗,B∗) be a critical point of gλ,r. Then A∗B∗⊤ is a minimizer for fλ if and only if ‖∇ℓ(A∗B∗⊤)‖≤λ.
Proof. Let W∗=A∗B∗⊤. We need to show that 0∈∂fλ(W∗) or, equivalently
−1λ∇ℓ(W∗)∈∂‖W∗‖∗. | (B.7) |
Let Z=−1λ∇ℓ(W∗). As a special case of Lemma 10 we have that Z∈∂‖W∗‖∗ holds true if and only if there exists a simultaneous singular value decomposition of the form W∗=Udiag(σ)V⊤, Z=Udiag(σ(Z))V⊤ and σ(Z)∈∂‖σ(W∗)‖1, with σ(Z) denoting the spectrum of Z (namely the vector of singular values of Z arranged in non-increasing order). Recall that
∂‖σ‖1={z∈Rm:zi=1ifσi≠0,andzi∈[−1,1]otherwise}. |
Using the same notation of Prop. 2, consider the SVD
Z=−U1V⊤1−U2Σ2λV⊤2−U3Σ3λV⊤3. |
By Prop. 2 we have that A∗=U1C and B∗=−V1C, with C∈Rs1×r. Now, let ˉr=rank(C)≤r and let C=PΓQ⊤ be the SVD of C, with P∈Rs1×ˉr and Q∈Rˉr×r matrices with orthonormal columns and Γ∈Rˉr×ˉr diagonal with positive diagonal elements. Denote ˜P=[PP⊥]∈Rs1×s1 the orthonormal matrix obtained by completing P with a matrix P⊥∈Rs1×(s1−ˉr) with orthonormal columns such that P⊤P⊥=0. Moreover denote ˜Γ∈Rs1×s1 as
˜Γ=[Γ0ˉr×s1−ˉr0s1−ˉr×ˉr0s1−ˉr×ˉs1−ˉr]. |
Then,
W∗=−U1CC⊤V1=(−U1P)Γ2(V1P)⊤=(−U1˜P)˜Γ2(V1˜P)⊤=(−˜U1)˜Γ2(˜V1) |
with ˜U1=U1˜P and ˜V=V1˜P. Note that since ˜P is orthonormal, ˜U1˜V⊤1=U1V⊤1. Moreover U2,U3 have columns orthogonal to ˜U1 and V2,V3 have columns orthogonal to ˜V1. Consequently,
Z=−U1V⊤1−U2Σ2λV⊤2−U3Σ3λV⊤3=−˜U1˜V⊤1−U2Σ2λV2−U3Σ3λV⊤3 |
is an alternative singular value decomposition for Z. Therefore, W∗ and Z have a simultaneous singular value decomposition and we can conclude that σ(Z)∈∂‖σ(W∗)‖1 if and only if s2=0, namely ‖∇ℓ(W∗)‖≤λ as desired.
Proposition 4 (Escape direction from critical points). With the same notation of Prop. 2, assume rank(A∗)=rank(B∗)<r and ‖∇ℓ(A∗B∗⊤)‖=μ>λ. Then, (A∗,B∗) is a so-called strict saddle point for gλ,r, namely the Hessian of gλ,r at~(A∗,B∗)~has at least one negative eigenvalue. In particular, there exists q∈Rr such that A∗q=0, B∗q=0 and if u∈Rn and v∈Rm are the left and right singular vectors of~∇ℓ(A∗B∗⊤)~with %corresponding singular value equal to μ, then gλ,r decreases locally at (A∗,B∗) along the direction (uq⊤,−vq⊤).
Proof. By Theorem 2, A∗=U1C and B∗=−V1C, where U1 and V1 are the matrices of left and right singular vectors of ∇ℓ(A∗B∗⊤) and C∈Rs×r, for s=rank(A∗)=rank(B∗). Taking the SVD of C=PΓQ⊤, we rewrite
A∗=U1PΓQ⊤,B∗=−V1PΓQ⊤. | (B.8) |
Since A∗ and B∗ are rank deficient and they have the same null space, we can choose q∈Rr such that A∗q=B∗q=0. Let u2 and v2 the a left and right singular vector of ∇ℓ(A∗B∗⊤) with singular value equal to μ.
We consider a perturbation of the objective function in the direction (−u2q⊤,v1q⊤). We have that
L(γ)=ℓ((A∗−γu2q⊤)(B∗+γv1q⊤)⊤)+λ2(‖A∗−γu1q⊤‖2F+‖B∗+γv1q⊤‖2F) | (B.9) |
=ℓ(A∗B∗⊤−γ2u2v⊤2)+λ2(‖A∗‖2F+‖B∗‖2F)+λγ2. | (B.10) |
(B.11) |
Thus, we have that
L′(γ)=⟨−2γ∇ℓ(A∗B∗⊤−γ2u2v⊤2),u2v⊤2⟩+2γλ | (B.12) |
=−2γu⊤2∇ℓ(A∗B∗⊤−γ2u2v⊤2)v2+2γλ. | (B.13) |
Consequently
L″(0)=2(λ−u⊤2∇ℓ(A∗B∗⊤)v2)=2(λ−μ)<0 |
and the result follows.
Theorem 5 (Convergence to global minimizers of gλ,r). Let r≥min(n,m). Then the set of starting points (A0,B0) for which GD does not converge to a global minimizer of gλ,r has measure zero.
Proof. We have shown in Prop. 4 that every critical point (A,B) of gλ,r for r≥min(n,m) is either a global minimizer or a strict saddle point, namely such that the Hessian ∇2gλ,r(A,B) has at least a negative eigenvalue. Therefore, since the error function ℓ is twice differentiable with gradient Lipschitz continuous also gλ,r is. Let Lr>0 be the Lipschitz constant of ∇gλ,r. Then, we are in the hypotheses of Thm.4.1 in [18] which states that if (Ak,Bk)k∈N is obtained with step 0<α<1/Lr with initial point (A0,B0) sampled uniformly at random, then for any (A∗,B∗) strict saddle point of gλ,r,
Prob(limk→+∞(Ak,Bk)=(A∗,B∗))=0 |
which implies the desired result and corresponds to Cor. 6.
At last we show that if the spectral norm of the gradient at a critical point is close to one from above then value of the objective function is close to the global minimum.
Proposition 11. Let (A∗,B∗) be a critical point of gλ,r and let W∗=A∗B⊤∗. If ‖∇ℓ(W∗)‖≤λ+ϵ with ϵ∈[0,λ] then
fλ(W∗)≤minWfλ(W)+ϵ(‖W∗‖∗+ℓ(0)λ) |
Proof. We write
∇ℓ(W∗)=λU1V⊤1+U2Σ2V⊤2+λU3V⊤3+U3Σ3V⊤3=λ(U1V⊤1+U3V⊤3)+U2Σ2V⊤2+λU3V⊤3+U3(Σ3−λU)V⊤3. | (B.14) |
Let Z=U3(Σ3−λU)V⊤3. By assumption ‖Σ3‖≤2λ, implying that −Z∈∂fλ(W∗). That is, for every W∈Mn,m,
fλ(W∗)+⟨Z,W−W∗⟩≤fλ(W). | (B.15) |
In turn this implies that
fλ(W∗)≤⟨Z,W−W∗⟩+fλ(W). |
Since the minimizer can be constrainted to be in the set {W:‖W‖tr≤ℓ(0)/λ} we conclude from Eq (15.15) that
fλ(W∗)≤fλ(W)−⟨Z,W−W∗⟩[2pt]≤minWfλ(W)+‖Z‖(‖W∗‖∗+ℓ(0)λ)≤minWfλ(W)+ϵ(‖W∗‖tr+ℓ(0)λ). |
We expand on the discussion in Sec. 3.3, where we formally introduced the meta-algorithm considered in this work. Specifically we observe that in the formulation of Algorithm 1 we did not exploit the result in Prop. 4, providing an explicit decreasing direction for gλ,r+1 from the inflated point ([A∗0],[B∗0]), where (A∗,B∗) is a stationary point for gλ,r. For completeness in Algorithm 2 we report a variant of the meta-algorithm proposed in this paper that makes use of such escape direction. We care to point out that in our experiments we did not observe any statistically significant difference with the original version.
We discuss here the details of Algorithm 2. By Prop. 4 we know that there exists γ>0 such that
gλ,r+1([A∗0]+γ[0n×r,u],[B∗0]+γ[0m×r,−v])<gλ,r(A∗,B∗) |
where u∈Rn,v∈Rm are respectively the left and right singular vectors associated to the largest singular value μ of ∇ℓ(A∗B⊤∗). In Algorithm 2 these are provided by the routine LARGESTSINGULARPAIR.
We can find a new point on the direction specified above by trying to approximate the minimizer of
γ∗=argminγ∈[0,1]gλ,r+1([A∗0]+γ[0n×ru],[B∗0]+γ[0m×r−v]) |
for instance by considering a set of candidate γ1,…,γM∈(0,1] (e.g., γi=γi0 for i=1,…,M) and choose the one leading to the lowest value for gλ,r+1. In Algorithm 2 we refer to this procedure as LINESEARCH given its analogy with standard line search approaches often used in optimization. However notice that such methods can typically leverage on a criterion depending on the norm of the gradient ‖∇gλ,r+1([A∗0],[B∗0])‖F in order to guarantee that the function decreases significantly. We cannot replicate such strategy since ∇gλ,r+1([A∗0],[B∗0])=0 in our case.
As a final observation, we care to point out (as we already done in Sec. 3.3) that it is necessary to perturb the point ([A∗γu],[B∗−γv]), e.g., by adding some noise, in order to guarantee that ([A∗γu],[B∗−γv]) does not belong to the set of measure zero, mentioned in Thm. 5, of initial points that converge to strict saddle points of gλ,r.
Algorithm 2 META-ALGORITHM WITH EXPLICIT ESCAPE FROM STATIONARY POINTS |
Input: λ>0, ϵconv>0 convergence tolerance, ϵcrit>0 global criterion tolerance, σ>0 noise parameter.
Initialize: Set r=1. Sample A′0∈Rn and B′0∈Rn randomly. For r=1 to min(n,m) (Ar,Br) = OPTIMIZATIONALGORITHM(A′r−1,B′r−1,gλ,r,ϵconv) If ‖∇ℓ(ArB⊤r)‖≤λ+ϵcrit Break [u,μ,v]= LARGESTSINGULARPAIR(∇ℓ(ArB⊤r)) ˆγ= LINESEARCH([Ar0],[Br0],u,v) Perturb u=u+η with η∼N(0,σIn×n) Perturb v=v+η with η∼N(0,σIm×m) (A′r+1,B′r+1)=([Arγu],[Br−γv]) r=r+1 End Return (Ar,Br) |
We extend here the discussion on the KL inequality (Def. 7) and corresponding convergence results reviewed in Sec. 3.4. As a special case to the problem of optimizing gλ,r considered in this work, we have recalled in Cor. 8 that if gλ,r satisfies the KL inequality, we can expect GD to exhibit polynomial rates of convergence. A natural question is when gλ,r satisfies such inequality. To provide an insight on this issue, we consider the result in [24] showing that the KL inequality is satisfied by semi-algebraic functions. We recall that a set S⊆Rd is said semi-algebraic if there exists a finite number of polynomials pkh,qkh:Rd→R such that
S=K⋃k=1H⋂h=1{x∈Rd|pkh(x)=0,qkh(x)≤0}. |
A function f:Rd→R is said semi-algebraic if its graph
graphf={(x,t)|x∈Rd,t∈R,f(x)=t} |
is semi-algebraic.
Note that a variety of error functions ℓ:Rn×m→R typically used in machine learning and matrix factorization problems are semi-algebraic (e.g., the square loss). Interestingly, we have that if ℓ is semi-algebraic, then ℓr:Rn×r×Rm×r→R such that ℓr(A,B)=ℓ(AB⊤), is semi-algebraic as well. Indeed, by definition of semi-algebraic function we have that
graphℓ={(X,t)|X∈Rn×m,t∈R,ℓ(X)=t} |
is semi algebraic, therefore there exist pkh,qkh:Rn×m×R→R such that
graphℓ=K⋃k=1H⋂h=1{(X,t)|pkh(X,t)=0,qkh(X,t)≤0}. |
Now, denote Xij the (i,j)-th entry of X. For X=AB⊤ with A∈Rn×r and B∈Rm×r we have Xij=∑rs=1AisBjs namely Xij=mij(A,B) with mij:Rn×r×Rm×r→R a real polynomial. Let us denote m:Rn×r×Rm×r→Rn×m the matrix valued function with (i,j)-th entry m(A,B)ij=mij(A,B). Since every pkh and qkh are polynomial in the variables of X, we have that pkh∘m and hkh∘m are still polynomials and therefore the graph of ℓr
graphℓr={(A,B,t)|A∈Rn×r,B∈Rm×r,t∈R,ℓ(ABtop)=t}=K⋃k=1H⋂h=1{(A,B,t)|pkh(m(A,B)),t)=0,qkh(m(A,B),t)≤0} |
is semi-algebraic, which implies that ℓr is a semi-algebraic function.
Going back to the question of whether gλ,r(A,B)=ℓ(AB⊤)+λ2(‖A‖2F+‖B‖2F) is semi-algebraic, we now can conclude that it is sufficient to assume the error function ℓ to be semi-algebric. Indeed, it is well-known (see e.g., [24]) that the squared Frobenius norm of a matrix is semi-algebraic and that the finite sum of semi-algebraic functions is still semi-algebraic. Therefore we can re-formulate Cor. 8 in terms only of the error ℓ, namely
Corollary 12 (Convergence rate of gradient descent). Let (Ak,Bk)k∈N a sequence produced by GD method applied to gλ,r. If ℓ is semi-algebraic, then gλ,r satisfies the KL inequality for some constant α∈[0,1), and there exists a critical point (A∗,B∗) of gλ,r and constants C>0, b∈(0,1) such that
‖(Ak,Bk)−(A∗,B∗)‖2F≤{Cbkif α∈(0,1/2],Ck−1−α2α−1if α∈(1/2,1). | (D.1) |
Furthermore, if α=0 convergence is achieved in a finite number of steps.
This last section provides more comparative experiments between the meta-algorithm and the two state-of-the art solvers, as well as a comparison to the algorithm in [19].
In Sec. 4 we reported on the performance of the three methods considered for λ chosen by validation as the parameter leading to the lowest Normalized Mean Average Error (NMAE). For completeness here we report the same experiments for a range of candidate values of λ.
Figures 3 and 4 report respectively the value of the objective function fλ and the test error (NMAE) for the three methods considered in our experiments. Interestingly we observe a similar pattern to the one of the optimal λ, with our method exhibiting comparable performance in terms of both time and test error to the state-of-the-art competitors for most of the λ considered.
[1] | J. D. M. Rennie, N. Srebro, Fast maximum margin matrix factorization for collaborative prediction, In: ICML '05: Proceedings of the 22nd international conference on machine learning, 2005,713–719. http://doi.org/10.1145/1102351.1102441 |
[2] |
Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, Computer, 42 (2009), 30–37. http://doi.org/10.1109/MC.2009.263 doi: 10.1109/MC.2009.263
![]() |
[3] |
A. Argyriou, T. Evgeniou, M. Pontil, Convex multi-task feature learning, Mach. Learn., 73 (2008), 243–272. http://doi.org/10.1007/s10994-007-5040-8 doi: 10.1007/s10994-007-5040-8
![]() |
[4] | Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, J. Malick, Large-scale image classification with trace-norm regularization, In: 2012 IEEE conference on computer vision and pattern recognition, 2012, 3386–3393. http://doi.org/10.1109/CVPR.2012.6248078 |
[5] | Y. Amit, M. Fink, N. Srebro, S. Ullman, Uncovering shared structures in multiclass classification, In: ICML '07: Proceedings of the 24th international conference on machine learning, 2007, 17–24. http://doi.org/10.1145/1273496.1273499 |
[6] | F. R. Bach, Consistency of trace norm minimization, The Journal of Machine Learning Research, 9 (2008), 1019–1048. |
[7] | N. Srebro, J. D. M. Rennie, T. S. Jaakkola, Maximum-margin matrix factorization, In: NIPS'04: Proceedings of the 17th international conference on neural information processing systems, 2004, 1329–1336. |
[8] | H. H. Bauschke, P. L. Combettes, Convex analysis and monotone operator theory in Hilbert spaces, New York, NY: Springer, 2011. http://doi.org/10.1007/978-1-4419-9467-7 |
[9] | T. Hastie, R. Mazumder, J. D. Lee, R. Zadeh, Matrix completion and low-rank SVD via fast alternating least squares, The Journal of Machine Learning Research, 16 (2015), 3367–3402. |
[10] | R. Ge, J. D. Lee, T. Ma, Matrix completion has no spurious local minimum, In: Advances in neural information processing systems 29, 2016, 2973–2981. |
[11] | S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, In: NIPS'16: Proceedings of the 30th international conference on neural information processing systems, 2016, 3880–3888. |
[12] | C.-J. Hsieh, P. Olsen, Nuclear norm minimization via active subspace selection, In: ICML'14: Proceedings of the 31st international conference on international conference on machine learning, 2014,575–583. |
[13] | J. Abernethy, F. Bach, T. Evgeniou, J.-P. Vert, A new approach to collaborative filtering: operator estimation with spectral regularization, The Journal of Machine Learning Research, 10 (2009), 803–826. |
[14] |
A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), 183–202. http://doi.org/10.1137/080716542 doi: 10.1137/080716542
![]() |
[15] | M. Dudik, Z. Harchaoui, J. Malick, Lifted coordinate descent for learning with trace-norm regularization, In: Proceedings of the fifteenth international conference on artificial intelligence and statistics, 2012,327–336. |
[16] | G. J. O. Jameson, Summing and nuclear norms in Banach space theory, Cambridge University Press, 1987. http://doi.org/10.1017/CBO9780511569166 |
[17] | D. P. Bertsekas, Nonlinear programming, Athena scientific Belmont, 1999. |
[18] | J. D. Lee, M. Simchowitz, M. I. Jordan, B. Recht, Gradient descent only converges to minimizers, In: 29th Annual conference on learning theory, 2016, 1246–1257. |
[19] | B. D. Haeffele, R. Vidal, Global optimality in tensor factorization, deep learning, and beyond, 2015, arXiv: 1506.07540. |
[20] | D. P. Woodruff, Sketching as a tool for numerical linear algebra, Now Foundations and Trends, 2014. http://doi.org/10.1561/0400000060 |
[21] |
K. G. Murty, S. N. Kabadi, Some NP-complete problems in quadratic and nonlinear programming, Mathematical Programming, 39 (1987), 117–129. http://doi.org/10.1007/BF02592948 doi: 10.1007/BF02592948
![]() |
[22] | S. Boyd, L. Vandenberghe, Convex optimization, Cambridge University Press, 2004. |
[23] |
H. Attouch, J. Bolte, P. Redont, A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-ƚojasiewicz inequality, Math. Oper. Res., 35 (2010), 438–457. http://doi.org/10.1287/moor.1100.0449 doi: 10.1287/moor.1100.0449
![]() |
[24] |
J. Bolte, A. Daniilidis, O. Ley, L. Mazet, Characterizations of ƚojasiewicz inequalities: subgradient flows, talweg, convexity, Trans. Amer. Math. Soc., 362 (2010), 3319–3363. http://doi.org/10.1090/S0002-9947-09-05048-X doi: 10.1090/S0002-9947-09-05048-X
![]() |
[25] |
H. Attouch, J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Mathematical Programming, 116 (2009), 5–16. http://doi.org/10.1007/s10107-007-0133-5 doi: 10.1007/s10107-007-0133-5
![]() |
[26] |
F. M. Harper, J. A. Konstan, The movielens datasets: history and context, ACM Trans. Interact. Intel. Syst., 5 (2016), 1–19. http://doi.org/10.1145/2827872 doi: 10.1145/2827872
![]() |
[27] | A. S. Lewis, The convex analysis of unitarily invariant matrix functions, J. Convex Anal., 2 (1995), 173–183. |
ml100k | ml1m | ml10m | |||||||
NMAE | time(s) | rank | NMAE | time(s) | rank | NMAE | time(s) | rank | |
ALT | 0.2165 | 97 | 93 | 0.1806 | 4133 | 179 | 0.1670 | 205023 | 225 |
ALS-SI | 0.1956 | 40 | 16 | 0.1749 | 832 | 31 | 0.1648 | 51205 | 36 |
Ours | 0.1959 | 2 | 11 | 0.1751 | 39 | 25 | 0.1659 | 3150 | 41 |