1.
Introduction
A mainstay of scientific computing is solving large linear systems of the form
where A∈Rm×n, b∈Rm and x∈Rn.
In what follows, we assume that the linear system is consistent and consider only the least-norm solution x∗. For linear systems of extremely large size, iterative methods are often the only option. These methods date back to the early 19th century that were initiated by the work of Gauss. An explosion of activities has been sparked ever since due to demands from the engineering and scientific fields. Among those candidates, the Kaczmarz algorithm proposed in 1937 by Stefan Kaczmarz is one that is still very appealing nowadays [22]. Based on the BLAS level-1 subroutines, it has successfully found its way into a variety of applications in image reconstruction [16], computed tomography [20] and signal processing [9], to mention a few. In its simplest form, the classical Kaczmarz algorithm is given as
where A(i) denotes the ith row of the matrix A, b(i) represents the ith entry of the vector b and (A(i))T stands for the transpose of A(i). At each iteration, the current iterate xk is projected onto the hyperplane A(ik)x=b(ik), where the working row in (1.2) is selected cyclically by applying ik=(kmodm)+1.
The philosophy behind the classical Kaczmarz algorithm is to apply sequences of orthogonal projections onto hyperplanes. Its convergence is warranted. However, it is not easy to analyze the convergence rate. For more than a decade, the classical Kaczmarz algorithm was in oblivion. Fortunately, it resurged sporadically in the works of [8,15,30]. In 1970, it was rediscovered in the field of electron microscopy and X-ray photography, termed the algebraic reconstruction technique [16]. A breakthrough in the research of Kaczmarz-type algorithms is the randomized Kaczmarz algorithm, which adopts a non-uniform selection rule by picking the working row with probability that is proportional to the norm of the rows [29]. Agaskar et al. [1] propose a complete characterization of the randomized Kaczmarz algorithm, which includes an exact formula for the mean squared error and the annealed error exponent portraying its decay rate. Several approaches for determining working rows have been proposed ever since; for instance, Bai and Wu propose a greedy randomized Kaczmarz (GRK) method which selects the working row by applying the residual rule [5,6]. Gower et al. analyze adaptive sampling rules in the sketch-and-project setting, with an emphasis on the distance rule [17]. Griebel and Oswald investigate the maximal residual rule that greedily chooses the subspace with the largest residual norm for the next update [19]. Eldar and Needell consider a Johnson-Lindenstrauss-type projection when implementing the maximal distance rule for solving linear systems [13]. More about the selection rules for working rows and convergence analysis can be found in [3,4,7,31,34] and the references therein.
While the selection rules for working rows are of great importance in the implementation of the Kaczmarz-type algorithms, the idea of exploiting blocks also matters. To put it simply, the block Kaczmarz method is an iterative scheme for solving linear systems. At each iteration step, the block Kaczmarz algorithm projects the current iterate onto the solution space of a subset of the constraints [27]. In [14], the coefficient matrix is partitioned by blocks of rows whose generalized inverse is then applied in Jacobi or successive overrelaxation iterative schemes. Needell and Tropp [27] put forward a block Kaczmarz algorithm that uses a randomized control scheme to choose the subset per iteration. It lays the path for the research of "row paving" and is the first block Kaczmarz method with an expected linear rate of convergence that is expressed in the geometric properties of the matrix and its submatrices. By using the block Meany inequality, Bai and Liu [2] improve some existing results and develop new convergence theorem for row-action iterative schemes like the block Kaczmarz and the Householder-Bauer methods for solving large linear systems and least-squares problems. Necoara presents a group of randomized block Kaczmarz algorithms that involve a subset of the constraints and extrapolated step sizes [26]. Niu and Zheng investigate a block Kaczmarz algorithm which incorporates the greedy row-selection rule [28]. To capture large block entries of the residual vector, Liu and Gu propose a greedy randomized block Kaczmarz algorithm for solving consistent linear systems [24]. Inspired by the work of [5], Miao and Wu present a greedy randomized average block Kaczmarz method for solving linear systems [25]. Motivated by the Gaussian Kaczmarz method [18] and the greedy rule for selecting working rows in [5,6], Chen and Huang consider a fast deterministic block Kaczmarz method [10]. Based on the Kaczmarz-Motzkin method, Zhang and Li develop several efficient block Kaczmarz-Motzkin variants by exploiting different sampling strategies [36,37]. Recently, the block Kaczmarz-type algorithms have been generalized to solve least squares problems and linear systems with multiple right-hand sides; for instance, Du et al. [12] present a randomized extended average block Kaczmarz algorithm that is suitable for solving least squares problems. Wu and Chang propose semi-randomized block Kaczmarz methods with simple random sampling for linear systems with multiple right-hand sides [32].
In this work, we propose an efficient variant of the greedy block Kaczmarz (VGBK) algorithm for solving consistent linear systems in the form of (1.1). It proceeds with a partition of rows which can dramatically reduce the computational complexity per iteration. Numerical experiments show that the new algorithm is competitive with some existing block Kaczmarz-type methods. Throughout this work, we denote by ‖A‖F the matrix Frobenius norm and ‖⋅‖2 the 2-norm of a matrix or vector if there is no ambiguity. The notion A† defines the Moore-Penrose pseudoinverse. The symbol (⋅)T stands for the transpose of a matrix or vector. The smallest nonzero singular value and its largest peer of a matrix are represented by σmin(⋅) and σmax(⋅), respectively. Unless otherwise stated, matrices are denoted by capital letters, while vectors are usually represented by lowercase ones; capital letters with superscripts are occasionally used to indicate a vector, e.g., A(ik) for the ikth row of a matrix A. Scalars are usually signified by Greek letters, except those, such as m, n, i, j, k, s and p, which are used for indexing. The set {1,…,m} is abbreviated as [m]. An empty set is denoted by ∅. An index set is usually indicated by a calligraphic capital letter like I, whose cardinality reads as |I|.
This work is organized as follows. In Section 2, we review three related GBK-type algorithms. In Section 3, we first make clear the motivation for this paper, and then we develop a new GBK algorithm with analysis that includes the convergence rate and complexity. In Section 4, we present some numerical experiments to justify the effectiveness of the proposed algorithm. In Section 5, some conclusions and the future work are given.
2.
The greedy block Kaczmarz algorithm and its efficient variants
In this section, we recapitulate the GBK algorithm [28], the fast deterministic block Kaczmarz (FDBK) algorithm [10] and the fast greedy block Kaczmarz (FGBK) algorithm [33]. These three algorithms facilitate the delineation of our new algorithm in Section 3.
2.1. The greedy block Kaczmarz algorithm
In the classical Kaczmarz algorithm, the current iterate xk is projected onto a single hyperplane. In its block counterpart, however, the current iterate xk is projected onto a combination of several hyperplanes, i.e.,
where IGk is a set that consists of the indexed rows [14,27]. In general, the block Kaczmarz algorithms take advantage of the BLAS level-2 operations and benefit more from exploiting multiple rows simultaneously.
In [28], the block technique is integrated with the idea of maximal distance, which yields the GBK algorithm. In that setting, an index set IGk is determined once the squared distance of the current iterate to the hyperplane A(ik)xk=b(ik) exceeds εk which is a parameter varying from zero to the maximal squared distance. In what follows, we do not elaborate on the algorithmic implementation, but refer to Algorithm 1 for details.
The convergence result of the GBK algorithm is detailed in the following theorem.
Theorem 1. For the consistent linear system (1.1), the iterative sequence of {xk}∞k=0, as generated by Algorithm 1, converges to the unique least-norm solution x∗=A†b. For k≥0, the norm of error satisfies
where α∈(0,1], βk=α‖A‖2F‖AIGk‖2Fσ2max(AIGk)(‖A‖2F−‖AIGk−1‖2F) with β0=α‖AIG0‖2Fσ2max(AIG0) and σmin(⋅) and σmax(⋅) denote the smallest nonzero and largest singular values of a matrix, respectively.
2.2. A fast deterministic block Kaczmarz algorithm
Another way to implement the block Kaczmarz algorithm is to make use of the Gaussian Kaczmarz method [18,p. 1677] defined by
where g∈Rm is a random Gaussian vector with a zero mean and the covariance matrix being the identity matrix.
Motivated by the Gaussian Kaczmarz method and the greedy rule for selecting working rows in [5,6], Chen and Huang propose the FDBK algorithm [10]. When updating the iteration, recall from Algorithm 1 that one needs to compute the Moore-Penrose pseudoinverse of sub-matrices of A. In FDBK, however, one only needs to compute a linear combination of submatrices of AT. The algorithmic implementation of FDBK is presented in Algorithm 2; see [10,Algorithm 1] for more detail.
The convergence result of the FDBK algorithm is wrapped up in Theorem 2.
Theorem 2. Suppose that the linear system (1.1) is consistent and there is no row with entries all zero in A. For any initial guess x0 in the column space of AT, the iterative sequence of {xk}∞k=0, as generated by Algorithm 2, converges to the unique least-norm solution x∗=A†b, with the error satisfying
where
with
and
2.3. The fast greedy block Kaczmarz algorithm
As explained in the previous two subsections, the GBK algorithm captures hyperplanes onto which the current iterate is projected with relatively large distance, while the FDBK algorithm follows a Gaussian Kaczmarz practice that makes use of all rows of A and b simultaneously.
In order to speed up the convergence of the GBK and FDBK algorithms, Xiao et al. [33] propose an FGBK algorithm that exploits the advantages of the GBK and FDBK algorithms. In particular, a parameter p is introduced to allow for more flexibility in implementing the algorithm. More algorithmic details can be found in Algorithm 3.
The convergence result of the FGBK algorithm is characterized by the following theorem [33].
Theorem 3. Suppose that the linear system (1.1) is consistent. For any initial guess x0 in the column space of AT, the iterative sequence {xk}∞k=0, as generated by Algorithm 3, converges to the unique minimum norm solution x∗=A†b, with the error satisfying
where βk(α,p)=α2p∑i∈[m]∖IFGk−1‖A(i)‖2p⋅∑i∈IFGk‖A(i)‖2pσ2max(AIFGk), α∈(0,1] and p∈[1,+∞).
Remark 1. We are in a position to give some remarks on the origins of the above-mentioned block Kaczmarz-type algorithms. In [5], Bai and Wu propose a novel GRK algorithm that outperforms the prototypical randomized Kaczmarz algorithm in many scenarios. The GRK algorithm exploits the greedy selection strategy prioritizing large entries of the residual vector with higher probabilities. As shown in Algorithms 1–3, the GBK, FDBK and FGBK algorithms fall into the category of block Kaczmarz-type algorithms. In fact, they can be regarded as extensions of the GRK algorithm with some effective modifications. For instance, the GBK algorithm combines the maximum-distance rule and a modified index set originating from the GRK algorithm. Following the line of the GRK algorithm, the FDBK algorithm employs the same index set as in GRK and incorporates the idea of Gaussian-Kaczmarz iteration.
3.
An efficient variant of the greedy block Kaczmarz algorithm
In this section, we first clarify the initial rationale for developing the new algorithm, and then we give an in-depth analysis of the algorithmic implementation as well as the theoretical results. In Section 2, recall that two existing algorithms, i.e., GBK and FDBK, offer two different ways to incorporate the block techniques; the GBK algorithm takes advantage of rows selected from the index set, while the FDBK algorithm avoids the computation of the Moore-Penrose pseudoinverse and updates the current iterate in a manner analogous to that of the Gaussian Kaczmarz method. In the actual computations, however, both algorithms require one to find the maximum squared distance to m hyperplanes; see Step 2 in Algorithms 1 and 2. It may pose a challenge to the numerical performance when the size of A is huge. Now we switch tacks. Intuitively, it would be desirable to seek the maximum distance to merely a small fraction of the hyperplanes since the computational cost of working out the maximum distance shall be reduced. Therefore, it is interesting to investigate the case in which these desirable attributes are merged. Motivated by this intuition, we construct an efficient variant of the greedy block Kaczmarz (VGBK) algorithm for solving large linear system (1.1). A complete coverage of its algorithmic implementation is displayed in Algorithm 4.
Now, let us present the details of the VGBK algorithm. To reduce the computational complexity in computing the maximum distance, a user-prescribed row partition of the set [m] is imposed at the outset, namely,
where τj⊂[m] is a subset comprising the row indices of the jth partition for j=1,…,s. As for the choices of partition before the iteration (Step 1 in Algorithm 4), several candidates are available; see, for instance, [21,27,35]. In the numerical experiments, we adopt a simple approach by partitioning [m] with τTj=[j:s:m], where the MATLAB colon operator ":'' is used to create a vector with evenly spaced increments for j=1,…,s. It should be stressed that this practice is not necessarily the optimal one, but it turns out to be quite successful as compared with other block Kaczmarz algorithms; see the numerical examples in Section 4. The following toy example gives the reader a sense of the appearance of the partitioning. Assume that m=11 and s=3. It follows from the above discussion that
The corresponding partitions of A and b can thus be given; for example, if A∈R11×5, then Aτ3∈R3×5, assembled by A(3,:), A(6,:) and A(9,:), is the third submatrix (block) from the partition.
After the partition, we have at hand s submatrices Aτ1,…,Aτs that mimic the coefficient matrix A. Assume that there are no rows with entries all zero in A and, thus, ‖Aτj‖2≠0 for j=1,…,s. We select cyclically the index for the working submatrix Aτjk and vector bτjk with jk=(k mod s)+1. It turns out that the maximum squared distance
is now unraveled from a subset τjk instead of [m]. Then, a set Ik is used to collect the indices ik's of rows when
for α∈(0,1]. It is easy to verify that Ik is a subset of τjk and is non-empty. Inspired by the Gaussian Kaczmarz method, we update the iterate xk+1 by
where ck=∑ik∈Ik(b(ik)τjk−A(ik)τjkxk)eik, with eik being the ith canonical basis vector in R|τjk|.
Remark 2. We are ready to give some remarks. As introduced in Section 2, the novelty of the related FGBK algorithm lies in the choice of exponent p, a parameter encountered in εk and the index set. At each iteration, however, the FGBK algorithm requires one to compute the pth power of the maximum distance to m hyperplanes, which may be increasingly time-consuming as the size of A increases; see Algorithm 3 for more details. Such difficulty is circumvented in our VGBK algorithm since one only needs to figure out the maximum distance from a much smaller subset τjk per iteration. Numerical experiments in Section 4 also corroborate that the VGBK algorithm improves upon the FGBK method in terms of CPU time.
The theorem below establishes the convergence result of the VGBK algorithm.
Theorem 4. For the consistent linear system (1.1), assume that there are no zero rows in A . Then, the iterative sequence \{x_k\}_{k = 0}^{\infty} , as generated by Algorithm 4, converges to the unique least-norm solution x_{*} = A^{\dagger}b in expectation for any initial guess x_0 that is in the column space of A^T . Moreover, the norm of error satisfies
where \tau_{j_k} is the partition block chosen at the k th iteration, \mathcal{I}_k is the index set defined in Algorithm 4, \alpha\in (0, 1] and \omega_k\in (0, 1] for k = 0, 1, \ldots.
Proof. The proof is partly inspired by [10,Theorem 3.1]. It follows from Steps 6 and 7 of Algorithm 4 that
where c_{k} = \mathop{\sum}\limits_{i_k\in \mathcal{I}_{k} }\left(b^{(i_k)}-A^{(i_k)}x_{k} \right)e_{i}\in\mathbb{R}^{|\tau_{j_{k}}|} . Subtracting x_* from both sides of (3.2) yields
where P_{k} = \dfrac{A_{\tau_{j_{k}}}^{T}c_{k}c _{k}^{T}A_{\tau_{j_{k}}}}{\|A_{\tau_{j_{k}}}^{T}c_{k}\|_{2}^{2}}\in\mathbb{R}^{n\times n} is an orthogonal projection matrix, namely, P_k^T = P_k and P_k^2 = P_k . From (3.2) and (3.3), we have
which implies that x_{k+1}-x_{*} is orthogonal to x_{k+1}-x_{k} .
Computing the squared 2-norm of x_{k+1}-x_{*} in (3.3), together with the use of the Pythagorean theorem, renders
Let |\mathcal{I}_{k}| be the number of elements in the index set \mathcal{I}_{k} and E_{k}\in \mathbb{R}^{|\tau_{j_{k}}|\times |\mathcal{I}_{k}|} be a matrix consisting of the canonical basis vectors e_{i_k}\in\mathbb{R}^{|\tau_{j_{k}}|} for i_k\in\mathcal{I}_{k} . Therefore, we get
and
In a nutshell, \hat{c}_k and A_{\mathcal{I}_{k}} are condensed counterparts of {c}_k and A_{\tau_{j_{k}}} by encapsulating rows indexed in the set \mathcal{I}_{k} . A straightforward computation using (3.5) and (3.6) yields
where \sigma_{\max}^{2}(A_{\mathcal{I}_{k}}) is the largest singular value of A_{\mathcal{I}_{k}} . Besides, it follows from the definition of c_{k} and (3.7) that
Combining (3.7)–(3.9), we have
Moreover, \|b_{\tau_{j_{k}}}-A_{\tau_{j_{k}}}x_{k}\|_{2}^{2} can be recast as
or, equivalently,
Denote u_k = x_k-x_* and u_k = u_{k1}+u_{k2} , where u_{k1} is in the column space of A_{\tau_{j_{k}}}^T and u_{k2} is in the subspace orthogonal to the column space of A_{\tau_{j_{k}}}^T , i.e., A_{\tau_{j_{k}}} u_{k2} = 0 . Let \|u_{k1}\|_2^2 = \omega_k\|u_{k}\|_2^2 with \omega_k\in(0, 1] . Therefore,
Multiplying both sides of (3.11) with \alpha , we attain that
where \sigma_{\min}^{2}(A_{\tau_{j_{k}}}) is the smallest nonzero singular value of A_{\tau_{j_{k}}} .
Substituting the inequality (3.12) into (3.10) gives
Finally, it follows from (3.4) and (3.13) that
which completes the proof. □
Remark 3. Some remarks are in order. Since \|A_{\mathcal{I}_k}\|_F^2\geq\sigma_{\max}^2(A_{\mathcal{I}_k}) , it follows that
where \alpha\in(0, 1] and \omega_k\in(0, 1] . Therefore, the VGBK algorithm converges to the least-norm solution under the assumption of Theorem 4. A special case of the VGBK algorithm exists when the number of blocks s = 1 , i.e., A_{\tau_{j_{k}}} = A . If so, then \omega_k = 1 ; thus (3.14) is simplified to
More generally, if \sigma_{\min}^{2}(A_{\tau_{j_{k}}})/\|A_{\tau_{j_{k}}}\|_{F}^2\geq\sigma_{\min}^{2}(A)/\|A\|_{F}^2 holds in (3.14), then the inequality (3.14) reduces to
which resembles those error bounds given in Theorems 1–3. It should be noted that the condition \sigma_{\min}^{2}(A_{\tau_{j_{k}}})/\|A_{\tau_{j_{k}}}\|_{F}^2\geq\sigma_{\min}^{2}(A)/\|A\|_{F}^2 is not too stringent; for instance, if A is a matrix of normally distributed random numbers, then \|A_{\tau_{j_{k}}}\|_{F}^2/\|A\|_{F}^2\approx s^{-1} for sufficiently large m , where s is the number of blocks and m is the number of rows in A . Consequently, the condition \sigma_{\min}^{2}(A_{\tau_{j_{k}}})/\|A_{\tau_{j_{k}}}\|_{F}^2\geq\sigma_{\min}^{2}(A)/\|A\|_{F}^2 holds as long as \sigma_{\min}^{2}(A_{\tau_{j_{k}}})/\sigma_{\min}^{2}(A)\gtrsim s^{-1} . In the numerical experiments, s is always chosen as s = \lfloor 0.8\%m\rfloor or s = \lfloor 4\%m\rfloor , which implies that s^{-1} approaches 0 if m is large enough. In the implementation of the VGBK algorithm, we only work on a small fraction of A and b , instead of exploiting the full information of A and b as done in the GBK, FDBK and FGBK methods. For this reason, the VGBK algorithm generally requires more iteration steps than these three algorithms to reach the required accuracy. Nevertheless, this does not necessarily imply that the VGBK algorithm is numerically inferior to the three counterparts. On the contrary, as we will explain soon, the VGBK algorithm consumes far less computational cost than the other three algorithms per iteration step, which can readily reduce the total computing time. Numerical examples in the next section also validate that the VGBK algorithm enjoys noticeable gains in CPU time as compared to the other three algorithms.
Before ending this section, we investigate the computational overhead of VGBK at each iteration step. The complexities of the block Kaczmarz algorithms outlined in Section 2, including the GBK, FDBK and FGBK algorithms, are also analyzed for comparison. For each method, the total computational overhead per iteration consists of the cost for determining the pseudoinverse of a matrix, finding the maximum of a vector and updating the iteration and other costs. To keep the exposition simple, we assume that the coefficient matrix A in (1.1) is dense and p = 2 in the FGBK algorithm. The results are tabulated in Table 1.
The second column of Table 1 indicates whether the implementation of the corresponding method involves computing the pseudoinverse. We see that the FDBK, FGBK and VGBK algorithms are pseudoinverse-free, while the GBK method involves computing the pseudoinverse of a submatrix of A . The computation involving the pseudoinverse of a matrix is expensive and often constitutes the bulk of the total computational cost. A standard work-around refers to iterative methods such as CGLS [27]. In Table 1, we do not specify the method for addressing the pseudoinverse in GBK as we just qualitatively point out that the GBK algorithm is not pseudoinverse-free; a detailed analysis can be made available if the method is prescribed, such as CGLS [28,p. 5]. The third column shows the number of maximum to be found. Take the FDBK algorithm for example. We need to figure out \max_{1\leq i\leq m}\{{|b^{(i)}-A^{(i)}x_{k}|^{2}}/{\|A^{(i)}\|_{2}^{2}}\} at Step 2 of Algorithm 2, which is equivalent to finding the maximum of an m -dimensional vector. The fourth column displays the FLOPS needed when updating the iterate x_k . The last column collects FLOPS that are not counted in the previous three columns; for instance, in the case of FDBK, it includes the cost of computing the parameter \varepsilon_k , determining the index set \mathcal{I}_k^{FD} and accessing the vector c_k .
In the VGBK algorithm, recall that \mathcal{I}_k \subset \tau_{j_{k}} and \tau_{j_{k}}\subset [m] . It follows that |\mathcal{I}_k| < |\tau_{j_{k}}|\ll m . Furthermore, it generally holds that |\mathcal{I}_k|\ll\min\{|\mathcal{I}^{G}_k|, |\mathcal{I}^{FD}_k|, |\mathcal{I}^{FG}_k|\} since \mathcal{I}_k is a subset of \tau_{j_{k}} , which itself is a subset of [m] while \mathcal{I}^{G}_k , \mathcal{I}^{FD}_k , \mathcal{I}^{FG}_k are subsets of [m] . Therefore, the total cost per iteration for the VGBK algorithm can be much less than that for its three counterparts. To see this intuitively, we consider the number of blocks s = \lfloor 0.8\%m\rfloor , which is a practical choice for the overdetermined linear system given in Section 4. Thus each block \tau_{j_{k}} has about m/s\approx125 rows on average. In light of this, the computational cost per iteration for the VGBK algorithm roughly accounts for 1/s of those for the GBK, FDBK and FGBK methods, which greatly reduces the computing time and offsets the excessive iterations due to the initial partitioning. Such an advantage becomes more pronounced when m is large since 1/s = 1/\lfloor 0.8\%m\rfloor goes to 0 as m increases; see Section 4 for more details.
4.
Numerical experiments
In this section, some numerical experiments are presented to verify the effectiveness of the VGBK algorithm in comparison with three block Kaczmarz counterparts, i.e., GBK [28], FDBK [10] and FGBK [33]. All experiments were run on a laptop with an AMD R5-5600H CPU @3.30GHZ 16 GB RAM by operating MATLAB 2022a on a Windows 11 system. Numerical performance of the four algorithms is assessed by means of the CPU time in seconds (CPU) and the number of iteration steps (IT). To demonstrate the effectiveness, we determine the speed-up values of VGBK against GBK, FDBK and FGBK, which are respectively defined by
As stated in Section 1, we are only concerned with the solution of consistent linear system (1.1) in this work. To this end, the right-hand side vector b is set to be Ax_* in all examples, where x_{*} is generated by the MATLAB built-in function \texttt{randn}. The coefficient matrices A are either given by \texttt{randn} or taken from real-world applications [11,23]; see Table 2 for detailed properties of these matrices. As such, we are guaranteed to work on the consistent linear system (1.1). The initial guess is given by x_0 = 0 . All computations are terminated once the relative solution error (RSE) at the current iteration satisfies that \text{RSE} < 10^{-6} or when the number of iteration steps exceeds the maximum of 200000 , where
In Algorithm 4, recall that the implementation of VGBK involves two parameters, i.e., the number of blocks s and the tuning parameter \alpha . Numerical experiments show that the VGBK algorithm often offers sound performance with appropriate choices of s and \alpha . The experimental parameters in the GBK, FDBK, FGBK and VGBK algorithms are set as follows:
I. GBK [28]: The parameter \alpha in Algorithm 1 is chosen to be
which is recommended in the original work on the GBK [28].
II. FDBK [10]: Algorithm 2 is parameter-free.
III. FGBK [33]: In [33], FGBK( p ) with p = 2 requires the least CPU time for most tests. Therefore, we set p = 2 . Besides, the parameter \alpha is set to be 0.05 or 0.1 , which are preferred in [33].
IV. VGBK: The tuning parameter \alpha and the number of blocks s are required in Algorithm 4. We set s = \lfloor 0.8\%m\rfloor (overdetermined linear system), s = \lfloor 4\%m\rfloor (underdetermined linear system) and \alpha = 0.1 , the reason for which is given in Examples 1–2. Here, \lfloor \cdot \rfloor denotes the floor function.
The purpose of this section is two-fold. In Section 4.1, we look into choices of the two parameters s and \alpha for various types of test matrices. In Section 4.2, the numerical performance of the VGBK algorithm is appraised by comparing it with that of the GBK, FDBK and FGBK algorithms for abundant experiments.
4.1. Choices of s and \alpha
Theoretically identifying the optimal values of s and \alpha in the VGBK method, if they exist, is not available for now, and we shall not pursue it here. Instead, we attempt to experimentally determine practical choices of s and \alpha in this subsection. In Example 1, we empirically analyze the influence of s on VGBK. In Example 2, we tackle the problem of choosing \alpha .
Example 1. This example serves to show how the number of blocks s impacts the numerical performance of the VGBK algorithm. For the moment, \alpha is selected to be 0.1 ; numerical experiments in Example 2 indicate that setting \alpha = 0.1 turns out to be practical. We consider linear system (1.1) with thin ( m\geq n ) or fat ( m\leq n ) coefficient matrices A which are either generated by the MATLAB function \texttt{randn} or adapted from the SuiteSparse Matrix Collection [23]. Numerical results with varying values of s are illustrated in Figures 1–3.
We are now in a position to give some remarks. For either thin or fat coefficient matrices, as demonstrated in Figures 1–3, the CPU time (though with wiggles) exhibits a roughly V -shaped curve while the number of IT increases in general as s increases. It can be explained as follows. As for the CPU time, if s is small, or put it another way, if each submatrix/vector associated with \tau_{j} from the partition in the VGBK algorithm is of large size, then the computation overhead tends to increase, which in turn may consume more CPU time. On the contrary, if s is large, i.e., each submatrix/vector indexed in \tau_{j} is of small size, then only a linear system of small size is handled; more iterations are required in exchange for such inexpensive computation, which probably boosts the total CPU time. This explains why the CPU time first drops and then bounces up as s grows. Nevertheless, the scenario for the number of IT differs. In fact, small values of s often lead to large blocks which maintain most information of the original linear system; computation with such large blocks tends to less demanding in the number of iterations, which is quite in line with the observations reported in Figures 1–3.
In practice, it is often the CPU time that counts. Therefore, we restrict ourselves to diminishing the CPU time here. As portrayed in Figures 1–3, one needs to strike a balance when choosing the number of blocks s . To this end, we pinpoint the "experimental minimum" of the CPU time by tracking the MATLAB data tips. Take the thin matrix \texttt{abtaha1} as an example; the first subplot in Figure 3 illustrates that the"experimental minimum" of the CPU time is achieved at s = 121 , with approximately 0.82\% of the row size of \texttt{abtaha1} ( m = 14596 ). For all experiments in this example, the percentage s/m ranges from 0.37\% to 1.14\% for thin matrices while that for fat matrices ranges from 2.10\% to 5.71\% . Heuristically, any value in the intervals [\lfloor 0.37\%m\rfloor, \lfloor 1.14\%m\rfloor] and [\lfloor 2.10\%m\rfloor, \lfloor 5.71\%m\rfloor] , separately, can be a reasonable choice for thin and fat matrices, where \lfloor \cdot\rfloor is the floor function and m is the row size of A . In light of this, we pick two intermediate values by setting s = \lfloor 0.8\%m\rfloor for overdetermined linear systems and s = \lfloor 4\%m\rfloor for underdetermined ones in Examples 2–5.
Example 2. In this example, we examine how the tuning parameter \alpha influences the numerical performance of the VGBK algorithm. In Algorithm 4, if the value of \alpha is large, then only a small fraction of rows are extracted from A and b to update the iterate x_{k+1} . To ameliorate the possible loss of information, more iterations (and thus CPU time) are required to reach the prescribed accuracy; see Steps 5–7 in Algorithm 4. Conversely, if \alpha is chosen to be small, more rows will be retained, which amounts to updating the iterate x_{k+1} for a relatively large linear system. As a result, the computational overhead per iteration increases, which results in more CPU time. Such an observation tallies with Figures 4–6 for various test matrices. Therefore, one should exercise caution in picking the value of \alpha . As shown in Figures 4–6, values of \alpha that entail the least CPU time for either thin or fat matrices lie in some neighborhood of 0.1 . For this reason, the choice \alpha = 0.1 can be practical for the VGBK algorithm. In what follows, we assume that \alpha = 0.1 in all experiments. It should also be stressed that the value 0.1 is also a candidate of choice for \alpha in the FGBK algorithm [33,Section 3].
4.2. Comparison of block Kaczmarz algorithms
In this subsection, the effectiveness of the VGBK algorithm is justified by comparing it with the GBK, FDBK and FGBK algorithms. Reminiscent of Section 4.1, we consider consistent linear system (1.1) with coefficient matrices adapted from the MATLAB built-in function \texttt{randn} or the SuiteSparse Matrix Collection [23]. The structure of this subsection is as follows. In Example 3, we consider solving overdetermined linear system (1.1). In Example 4, we look on solving underdetermined linear system (1.1). In Example 5, we touch upon solving linear system (1.1) with large coefficient matrices from real-world problems.
Example 3. In this example, we are concerned with solving linear systems with thin coefficient matrices generated by the MATLAB built-in function \texttt{randn}. As recommended in Examples 1 and 2, we set s = \lfloor 0.8\%m\rfloor and \alpha = 0.1 for the VGBK algorithm. The number of IT and CPU time for the GBK, FDBK, FGBK and VGBK algorithms are tabulated in Table 3. A rough conclusion is that FDBK, FGBK and VGBK outperform their prototype GBK in terms of CPU time; see, e.g., the case of 10000\times 5000 where the speed-up of VGBK against GBK is up to 13.8089. Moreover, VGBK surpasses all block Kaczmarz peers regarding CPU time, albeit with more iteration steps. This can be interpreted as follows. In the VGBK, we only need to work on a small linear system with just \lfloor 0.8\%m\rfloor rows once the partition is applied, which dramatically reduces the computational cost (and thus CPU) per iteration. In general, the gain in CPU time reduction becomes more pronounced as m grows. It tends to give us a hint that the VGBK algorithm is suitable for solving large overdetermined linear systems. Finally, we explain how the comparison with the FGBK method is made. In fact, the speed-up3 is obtained by FGBK with less CPU time for either \alpha = 0.05 or \alpha = 0.1 as compared to that of VGBK; for example, speed-up3 = 2.2912 is computed from a ratio of 3.4606 to 1.5104 in the case of 10000\times 5000 .
Example 4. In this example, we consider solving linear systems with fat coefficient matrices generated by \texttt{randn}. The number of IT and CPU time for the GBK, FDBK, FGBK and VGBK algorithms are listed in Table 4. We choose s = \lfloor 4\%m\rfloor and \alpha = 0.1 for VGBK, as suggested in Examples 1 and 2. For all test matrices, VGBK outperforms the other three algorithms regarding CPU time, with the speed-up ranging from 1.0887 to 18.0378 . Besides, the speed-up becomes noticeable as m increases; for instance, the speed-up of VGBK against GBK reaches as high as 18.0378 for the case 12000\times15000 . It should be noted that FGBK renders a performance that is comparable to that of the VGBK when m is relatively small. As m increases, however, the VGBK algorithm has a remarkable advantage over FGBK; see the results for speed-up3 in Table 4.
Example 5. In this example, we shall compare the performance of the VGBK algorithm with those of GBK, FDBK and FGBK for solving linear systems with coefficient matrices from the SuiteSparse Matrix Collection whose properties are listed in Table 2. The corresponding numerical results are displayed in Tables 5 and 6. The effectiveness of VGBK is again verified regarding CPU time; for instance, the speed-up of VGBK against GBK, FDBK and FGBK varies from 2.0647 to 15.0433 , 2.2771 to 7.2730 and 1.2906 to 9.3184 , respectively. In this sense, the VGBK algorithm is very appealing for solving linear systems from real-world problems.
5.
Conclusions
By imposing a row partitioning, we put forth a VGBK algorithm for solving consistent linear systems. We establish theoretical results that characterize the convergence of the proposed algorithm. Numerical experiments demonstrate that the new algorithm offers good performance for a broad class of problems and outperforms other block Kaczmarz-type counterparts in terms of CPU time for large linear systems. The choice of partitioning approach is crucial for the performance of the proposed algorithm, and it can be regarded as important future work.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
The authors are grateful to Miss Xiang-Xiang Chen of Shanghai Maritime University for helpful discussions. The authors are grateful to the referees for their insightful comments and valuable suggestions, which greatly improved the original version of this paper.
This work is supported by the National Natural Science Foundation of China (No. 12271342).
Conflict of interest
The authors declare that they have no conflict of interest.