
Convolution is a very basic and important operation for convolutional neural networks. For neural network training, how to bound the convolutional layers is a currently popular research topic. Each convolutional layer is represented by a tensor, which corresponds to a structured transformation matrix. The objective is to ensure that the singular values of each transformation matrix are bounded around 1 by changing the entries of the tensor. We propose three new regularization terms for a convolutional kernel tensor and derive the gradient descent algorithm for each penalty function. Numerical examples are presented to demonstrate the effectiveness of the algorithms.
Citation: Pei-Chang Guo. New regularization methods for convolutional kernel tensors[J]. AIMS Mathematics, 2023, 8(11): 26188-26198. doi: 10.3934/math.20231335
[1] | F. Z. Geng . Piecewise reproducing kernel-based symmetric collocation approach for linear stationary singularly perturbed problems. AIMS Mathematics, 2020, 5(6): 6020-6029. doi: 10.3934/math.2020385 |
[2] | Ali Hassani . Singular expansion of the wave kernel and harmonic sums on Riemannian symmetric spaces of the non-compact type. AIMS Mathematics, 2025, 10(3): 4775-4791. doi: 10.3934/math.2025219 |
[3] | Ayman Elsharkawy, Hoda Elsayied, Abdelrhman Tawfiq, Fatimah Alghamdi . Geometric analysis of the pseudo-projective curvature tensor in doubly and twisted warped product manifolds. AIMS Mathematics, 2025, 10(1): 56-71. doi: 10.3934/math.2025004 |
[4] | Dazhao Chen . Weighted boundedness for Toeplitz type operator related to singular integral transform with variable Calderón-Zygmund kernel. AIMS Mathematics, 2021, 6(1): 688-697. doi: 10.3934/math.2021041 |
[5] | Nuraddeen S. Gafai, Ali H. M. Murid, Samir Naqos, Nur H. A. A. Wahid . Computing the zeros of the Szegö kernel for doubly connected regions using conformal mapping. AIMS Mathematics, 2023, 8(5): 12040-12061. doi: 10.3934/math.2023607 |
[6] | Hong-Mei Song, Shi-Wei Wang, Guang-Xin Huang . Tensor Conjugate-Gradient methods for tensor linear discrete ill-posed problems. AIMS Mathematics, 2023, 8(11): 26782-26800. doi: 10.3934/math.20231371 |
[7] | Tianji Wang, Qingdao Huang . A new Newton method for convex optimization problems with singular Hessian matrices. AIMS Mathematics, 2023, 8(9): 21161-21175. doi: 10.3934/math.20231078 |
[8] | Salima Kouser, Shafiq Ur Rehman, Mabkhoot Alsaiari, Fayyaz Ahmad, Mohammed Jalalah, Farid A. Harraz, Muhammad Akram . A smoothing spline algorithm to interpolate and predict the eigenvalues of matrices extracted from the sequence of preconditioned banded symmetric Toeplitz matrices. AIMS Mathematics, 2024, 9(6): 15782-15795. doi: 10.3934/math.2024762 |
[9] | Jiaqi Qu, Yunlan Wei, Yanpeng Zheng, Zhaolin Jiang . Fast algorithms for a linear system with infinitesimal generator structure of a Markovian queueing model. AIMS Mathematics, 2025, 10(3): 6546-6559. doi: 10.3934/math.2025299 |
[10] | Young Joon Ahn . An approximation method for convolution curves of regular curves and ellipses. AIMS Mathematics, 2024, 9(12): 34606-34617. doi: 10.3934/math.20241648 |
Convolution is a very basic and important operation for convolutional neural networks. For neural network training, how to bound the convolutional layers is a currently popular research topic. Each convolutional layer is represented by a tensor, which corresponds to a structured transformation matrix. The objective is to ensure that the singular values of each transformation matrix are bounded around 1 by changing the entries of the tensor. We propose three new regularization terms for a convolutional kernel tensor and derive the gradient descent algorithm for each penalty function. Numerical examples are presented to demonstrate the effectiveness of the algorithms.
Convolutional neural networks (CNNs) are an important class of deep learning models and they have been applied successfully in image understanding in recent years. The use of CNNs is now the dominant approach for almost all recognition and detection tasks [8]. Despite the great success, the training of deep convolutional networks remains to be difficult both theoretically and practically. It has been shown that exploiting the orthogonality to regularize convolutional layers can improve the stability and performance of CNNs and alleviate the issue of unstable gradients [2,4,9,16,17,21,24]. In this paper, we propose three new regularization terms for convolutional layers and derive the gradient descent algorithm for each penalty function.
First we introduce some necessary notations used in this paper. The notation ∗ denotes the convolution arithmetic in neural networks. vec(X) denotes the vectorization of X. When X is a matrix, with the columns of X stacked on top of one another, vec(X) denotes the resulting column vector. When X is a tensor, vec(X) denotes the column vector obtained by stacking the columns of the flattening of X along the first index (please see [7] on the flattening of a tensor). The notation ⌜⋅⌝ means to round a number to the nearest integer greater than or equal to the number. For a matrix A, σmax(A) and σmin(A) denote the largest and smallest singular values, respectively.
The tensor is an important concept in many disciplines [5,15]. Tensors can represent multi-relational data or nonlinear relationships. In CNNs, the convolution is a basic and important operation, which is represented by a tensor. Each convolution arithmetic is associated with a linear structured transformation matrix. Given a convolutional kernel tensor K, Y=K∗X is mathematically equivalent to
vec(Y)=Mvec(X), | (1.1) |
where M is the structured transformation matrix.
In the field of deep learning, there exist different forms of convolution arithmetic because of different choices about strides and padding patterns [6]. In this paper, without losing generality, the same convolution with unit strides is used to introduce our method. For the one-channel case, a convolutional kernel is represented by a matrix K∈Rk×k and the input is a matrix X∈RN×N; then, the output Y∈RN×N is computed by
Yr,s=(K∗X)r,s=∑p∈{1,⋯,k}∑q∈{1,⋯,k}Xr−m+p,s−m+qKp,q, |
where m=⌜k/2⌝ and Xi,j=0 if i≤0 or i>N, or if j≤0 or j>N.
In deep convolutional networks, multi-channel convolutions are more common and a convolutional kernel is represented by a 4 dimensional tensor. For a kernel tensor K∈Rk×k×g×h and the input represented by a 3 dimensional tensor X∈RN×N×g, the output Y=K∗X,Y∈RN×N×h is given by
Yr,s,c=(K∗X)r,s,c=∑d∈{1,⋯,g}∑p∈{1,⋯,k}∑q∈{1,⋯,k}Xr−m+p,s−m+q,dKp,q,d,c, |
where m=⌜k/2⌝ and Xi,j,d=0 if i≤0 or i>N, or if j≤0 or j>N.
Deep neural networks are usually layered. The singular values of the Jacobian of a layer bound the factors by which the norms of forward-propagated and backpropagated signals change. In the backward direction, if the singular values of the layers are all close to zero or all significantly larger than 1, gradient exploding or gradient vanishing will occur, which are fundamental obstacles for training deep networks [8,11,17]. In the forward direction, if the the singular values of the layers are all bounded, the computations will be more stable, the generalization error can be bounded and the robustness to adversarial examples can be improved [1,4,13,19,20,25]. Therefore, it is desirable to constrain the operator norms of network layers. The stability and Hopf bifurcation of some delayed neural networks have been investigated [14,22,23]. Convolutional layers are important components of CNNs. In this paper we will give three new regularization terms for the singular values of convolutional layers and develop the gradient descent algorithms; thus, we can modify the singular values of M in (1.1) as desired by changing the entries of K.
In the field of deep learning, there have been many research papers studying how to enforce the orthogonality or spectral norm regularization on the weights of a neural network [2,4,16,24]. Unlike the preceding papers including [2,4,16,24] and the references therein, in this paper we handle convolutions differently. They get the h×(gkk) matrix by reshaping the kernel K∈Rk×k×g×h; they then enforce the constraint directly on the h×(gkk) matrix. We enforce the constraint on the transformation matrix M associated with the convolutional kernel tensor K. In [17], the authors project a convolutional layer onto an operator-norm ball and confirm that this is an effective regularizer by conducting numerical experiments. Although the projection method in [17] can effectively prevent the singular values of the transformation matrix from being large, it can't prevent the singular values from being too small. In [10,21], regularization methods are given to ensure that the transformation matrix is near orthogonal, where the largest and smallest singular values are modified simultaneously.
In this paper we present new regularization methods for the convolutional kernel tensor K. We have two main contributions. First, the new proposed regularization terms can decrease the largest singular value and increase the smallest singular value of convolutional layers independently. Thus the regularization will be more flexible and targeted, depending on the practical need during the training process. Existing methods have no clear impact on the singular values of the transformation matrix M, or they cannot effectively prevent the singular values from being smaller, or can only simultaneously decrease the largest singular value and increase the smallest singular value [10,16,17,21,24]. Second, we give the formulae for partial derivatives of the proposed penalty functions versus the convolutional kernel tensor, which are first order perturbation results, revealing how each entry of a convolutional kernel tensor affects the singular values of the associated structured transformation matrix.
The rest of the paper is organized as follows: In Section 2, as a warm-up, we handle the one-channel case in which the kernel K is a k×k matrix. We propose the penalty functions and give the formulas for computing partial derivatives. In Section 3, we handle the multi-channel case, where the kernel is represented by a tensor K∈Rk×k×g×h. We also propose the penalty functions and give the gradient descent algorithms. In Section 4, we present numerical results to show that the proposed methods are effective. In Section 5, some conclusions and discussions are given.
For the one-channel case, the convolutional kernel is a k×k matrix, and there exist one input channel and one output channel. Suppose that the convolutional kernel K is a 3×3 matrix and the input data matrix is N×N; we show the form of the associated structured transformation matrix. Here,
K=(k11k12k13k21k22k23k31k32k33). |
For Y=K∗X, the linear transformation matrix M satisfies the equation vec(Y)=Mvec(X), so we can get the linear transformation matrix M as
M=(A0A−100⋯0A1A0A−1⋱⋱⋮0A1A0⋱⋱00⋱⋱⋱A−10⋮⋱⋱A1A0A−10⋯00A1A0), | (2.1) |
where
A0=(k22k3200⋯0k12k22k32⋱⋱⋮0k12k22⋱⋱00⋱⋱⋱k320⋮⋱⋱k12k22k320⋯00k12k22),A−1=(k23k3300⋯0k13k23k33⋱⋱⋮0k13k23⋱⋱00⋱⋱⋱k330⋮⋱⋱k13k23k330⋯00k13k23), |
A1=(k21k3100⋯0k11k21k31⋱⋱⋮0k11k21⋱⋱00⋱⋱⋱k310⋮⋱⋱k11k21k310⋯00k11k21). |
For this case, the N2×N2 matrix M is a doubly blocked banded Toeplitz matrix, i.e., a banded block Toeplitz matrix with its blocks represented as banded Toeplitz matrices. For the details about Toeplitz matrices, we recommend the references [3,12]. We use T to represent the set of all matrices with the same structure as M in (2.1), i.e., doubly blocked banded Toeplitz matrices with a fixed bandwidth.
From the structure of M, we see that one entry of K corresponds to more than one entry of M. The value of Kp,q will appear at different (i,j) indexes of the matrix M. In this section, We use S to denote this index set, to which each (i,j) index corresponding to Kp,q belongs. That is to say, we have that mij=Kp,q for each (i,j)∈S and mij≠Kp,q for each (i,j) that does not satisfy (i,j)∈S.
Given a matrix M, the square of the Frobenius norm of M, ‖M‖2F, is the sum of squares of all the entries of M. Meanwhile, it is equal to the sum of squares of all the singular values of M [7]. We will use 12‖M‖2F as the regularization term for the convolutional kernel K to prevent the singular values from being too large, and we derive the formula ∂12‖M‖2F/∂Kp,q. We give the following simple lemma, which will be useful in the following derivation.
Lemma 2.1. For A∈Rn×n, in terms of the partial derivative of the square of its Frobenius norm with respect to entries aij, it holds that ∂‖A‖2F/∂aij=2A.
Proof. Combining
∂Σa2ij/∂aij=2aijwith‖A‖2F=Σa2ij, |
we can get that
∂‖A‖2F/∂aij=2A. |
As we see, one entry of K corresponds to more than one entry of M. For the entry Kp,q, the (i,j) index set S represents the locations in M. According to the chain rule formula about the derivative, in order to get ∂‖M‖2F/∂Kp,q, we need to compute ∂‖M‖2F/∂mij for all (i,j)∈S and take the sum. Then we can organize the above analysis result as the following theorem.
Theorem 2.1. Let M∈Rn×n be the structured transformation matrix associated with the kernel K∈Rk×k. Given (p,q), if S denotes the set of all indices (i,j) such that mij=kp,q, it holds that
12∂‖M‖2F∂Kp,q=∑(i,j)∈Smij. | (2.2) |
Proof. As we see from the structure of M, each entry of K corresponds to more than one entry of M. Given (p,q), as S denotes the set of all indices (i,j) such that mij=kp,q, combining Lemma 2.1 with the chain rule formula about the derivative, we get
12∂‖M‖2F∂Kp,q=12∑(i,j)∈S∂‖M‖2F∂mij=∑(i,j)∈Smij. |
As we know, the Frobenius norm of a matrix equals the sum of squares of the singular values. Formula (2.2) could be used to implement the gradient descent algorithm for ‖M‖2F. So, we can change the entries of a convolutional kernel K to let singular values of M be smaller.
In this subsection, we show how to increase the smallest singular value of M by modifying the entries of K. To compute ∂σmin(M)/∂Kp,q, we need the following lemma, which is the perturbation analysis result for a simple singular value of a matrix; see [18] for the details.
Lemma 2.2. For a matrix A=[aij]∈Rm×m, if σ is a simple singular value, and u and v are respectively the normalized left and right singular vectors associated with σ. Then ∂σ/∂aij is uvT.
The value of Kp,q will appear at different (i,j) indexes of the matrix M, where S is the index set. Therefore we can use the chain rule and Lemma 2.2 to get the next theorem.
Theorem 2.2. For the one-channel convolutional kernel K∈Rk×k, let M∈Rn×n be the structured transformation matrix. Assume that σmin(M) is simple and σmin(M)>0, and that u, v are the normalized left and right singular vectors associated with σmin(M). Given (p,q), if S denotes the set of all indices (i,j) such that mij=kp,q, we have
∂σmin(M)/∂Kp,q=∑(i,j)∈Su(i)v(j). | (2.3) |
Proof. Each entry of K corresponds to more than one entry of M. Given (p,q), S denotes the set of all indices (i,j) such that mij=kp,q. Combining Lemma 2.2 with the chain rule formula for the derivative, we get (2.3).
Formula (2.3) could be used to implement the gradient descent for the penalty function σmin(M). Then we can modify the entries of K to increase σmin(M).
Now we can combine Theorems 2.1 and 2.2 to ensure that the singular values of M are neither large nor small. As we know, ‖M‖2F is the squared sum of all singular values of M. If M is n×n, ‖M‖2F is the squared sum of n singular values. We may choose 12‖M‖2F−nσmin(M) as the regularization term to ensure that the singular values of M are neither large nor small. This leads to the next theorem.
Theorem 2.3. Let M∈Rn×n be the structured transformation matrix corresponding to the one channel convolutional kernel K∈Rk×k. Assume that σmin(M)>0 and that σmin(M) is simple, and u, v are the normalized left and right singular vectors of M that are respectively associated with σmin(M). Given (p,q), if S denotes the set of all indices (i,j) such that mij=kp,q, we have
∂(12‖M‖2F−nσmin(M))/∂Kp,q=∑(i,j)∈S(mij−nu(i)v(j)). | (2.4) |
Proof. Combining (2.2) with (2.3), we can get (2.4).
For the case of multi-channel convolution, the convolutional kernel is represented by a tensor K∈Rk×k×g×h. The tensor X∈RN×N×g denotes the input, where element Xi,j,d is the value of the input unit within channel d at row i and column j. Entries of Y=K∗X,Y∈RN×N×h are computed according to
Yr,s,c=(K∗X)r,s,c=∑d∈{1,⋯,g}∑p∈{1,⋯,k}∑q∈{1,⋯,k}Xr−m+p,s−m+q,dKp,q,d,c, |
where Xi,j,d=0 if i≤0 or i>N, or if j≤0 or j>N. Through calculation, the structured transformation matrix M such that vec(Y)=Mvec(X) is as follows
M=(M(1)(1)M(1)(2)⋯M(1)(g)M(2)(1)M(2)(2)⋯M(2)(g)⋮⋮⋯⋮M(h)(1)M(h)(2)⋯M(h)(g)), | (3.1) |
where M(c)(d)∈S, i.e., M(c)(d) is a N2×N2 doubly blocked banded Toeplitz matrix. M(c)(d) corresponds to the portion K:,:,d,c that is convolved with the d-th input channel to get the c-th output channel.
In this section, we use Ωp,q,z,y to denote the set of all indexes (i,j) satisfying that mij=Kp,q,z,y. That is to say, we have that mij=Kp,q,z,y for each (i,j)∈Ωp,q,z,y and mij≠Kp,q,z,y for each (i,j) that does not satisfy (i,j)∈Ωp,q,z,y.
We can generalize the results for one-channel convolution to the multi-channel case, which are summarized as the following two theorems.
Theorem 3.1. For the convolutional kernel K∈Rk×k×g×h, let M be the associated structured transformation matrix as defined in (3.1). Given (p,q,z,y), if Ωp,q,z,y is the set of all indices (i,j) such that mij=kp,q,z,y, it holds that
12∂‖M‖2F∂Kp,q,z,y=∑(i,j)∈Ωp,q,z,ymij. | (3.2) |
The proof of Theorem 3.1 follows from Lemma 2.1 as in Theorem 2.1; it is omitted here.
Theorem 3.2. For the convolutional kernel K∈Rk×k×g×h, let M be the associated structured transformation matrix as defined in (3.1). Given (p,q,z,y), if Ωp,q,z,y is the set of all indices (i,j) such that mij=kp,q,z,y, it holds that
∂σmin(M)/∂Kp,q,z,y=∑(i,j)∈Ωp,q,z,yu(i)v(j). | (3.3) |
The proof of Theorem 3.2 follows from Lemma 2.2 as in Theorem 2.2; it is omitted here.
Theorem 3.3. For the convolutional kernel K∈Rk×k×g×h, let M be the associated structured transformation matrix as defined in (3.1). Given (p,q,z,y), if Ωp,q,z,y is the set of all indices (i,j) such that mij=kp,q,z,y, it holds that
∂(12‖M‖2F−min(g,h)N2σmin(M))/∂Kp,q,z,y=∑(i,j)∈Ωp,q,z,y(mij−min(g,h)u(i)v(j)). | (3.4) |
Here, min(g,h) denotes the smaller value of g and h.
Proof. Combining (3.2) with (3.3), we can get (3.4).
Then we present the detailed gradient descent algorithm for the three different penalty functions, where in Algorithm 3.3, min(g,h) denotes the smaller value of g and h.
Algorithm 3.1. Gradient descent algorithm for Rα(K)=12‖M‖2F
(1) Input: a convolutional kernel tensor K∈Rk×k×g×h, step size λ, and input size N×N×g.
(2) If σmax(M) is large:
(3) Compute G=[∂12‖M‖2F∂kp,q,z,y]k,k,g,hp,q,z,y=1 by (3.2);
(4) Update K=K−λG;
(5) End
Algorithm 3.2. Gradient descent algorithm for Rα(K)=−σmin(M)
(1) Input: a convolutional kernel tensor K∈Rk×k×g×h, step size λ, and input size N×N×g.
(2) If σmin(M) is small:
(3) Compute G=[−∂σmin(M)∂kp,q,z,y]k,k,g,hp,q,z,y=1 by (3.3);
(4) Update K=K−λG;
(5) End
Algorithm 3.3. Gradient descent algorithm for Rα(K)=12‖M‖2F−min(g,h)N2σmin(M)
(1) Input: a convolutional kernel tensor K∈Rk×k×g×h, step size λ, and input size N×N×g.
(2) While not converged:
(3) Compute G=[∂(12‖M‖2F−min(g,h)N2σmin(M))∂kp,q,z,y]k,k,g,hp,q,z,y=1 by (3.4);
(4) Update K=K−λG;
(5) End
We do numerical experiments by using MATLAB R2016b on a laptop. The laptop had specifications of 3.0 GHz and 16GB of memory. M denotes the transformation matrix corresponding to the convolutional kernel tensor. σmax(M) and σmin(M), the iteration steps (denoted as "iter"), are used to show the effectiveness of the proposed algorithms. We randomly generated multi-channel convolutional kernels using the following command
rand(′state′,1),K=rand(k,k,g,h). |
We considered K∈R3×3×g×h with different values of g,h, i.e., kernels of different sizes with 3×3 filters. For each kernel, we used 20×20×g as the size of the input data matrix. We then minimized the three different penalty functions by using Algorithms 3.1–3.3 respectively.
Regarding the choice of the step size λ, although we have no theoretical result, we have a good rule of thumb. According to our numerical experimental results, the step size λ=1e−5 is suitable for Algorithms 3.1 and 3.3 and the step size λ=1e−4 is suitable for Algorithm 3.2. Regarding the process to obtain Ωp,q,z,y, i.e., the set of all indexes (i,j) such that mij=kp,q,z,y, we first generated a structured matrix A with the entry kp,q,z,y and then we used the MATLAB command "find(A)" to get the row and column subscripts of each nonzero element in A. Besides, at each iteration step we used MATLAB commands "norm(M)" and "cond(M)" to compute the largest and smallest singular value of the new transformation matrix.
We present the results for 3×3×3×1 and 3×3×1×3 kernels in the following figures. In Figure 1, we demonstrate the changes of the largest singular value of M by using Algorithm 3.1.
As the number of iterations increases, σmax(M) is reduced. In Figure 2, we demonstrate the changes of the smallest singular value of M by using Algorithm 3.2. As the number of iterations increases, σmin(M) increases.
In Figure 3, the changes of σmax(M) and σmin(M) are shown, respectively as a result of using Algorithm 3.3. As the number of iterations increases, σmax(M) on the left axis scale is reduced and meanwhile σmin(M) on the right axis scale increases. The changes of σmax(M) and σmin(M) in the figures confirm that the three proposed algorithms are effective. In the training of deep neural networks, the practitioners decide which algorithm should be used based on the knowledge about the specific neural network architecture.
Numerical experiments were performed by using other randomly generated examples, which include random kernels with each entry uniformly distributed on [0,1]. The figures illustrating the convergence of σmax(M) and σmin(M) were similar to the figures presented in the paper.
In this paper, we provide new methods to modify the singular values of the convolutional kernel tensors. From the perspective of linear algebra, each convolution operation corresponds to a structured transformation matrix. We have applied the knowledge about linear algebra in combination with the chain rule formula for computing derivatives to get the new regularization methods. New regularization terms for convolutional kernels have been proposed and the gradient decent algorithms for the regularization terms have been provided. The methods are shown to be effective in modifying the singular values of convolutional kernel tensors.
The author declares he has not used Artificial Intelligence (AI) tools in the creation of this article.
This work was supported by the National Natural Science Foundation of China (Grant No. 12001504) and the Fundamental Research Funds for the Central Universities (Grant No. 2652019320).
The author declares no conflict of interest.
[1] | P. L. Bartlett, D. J. Foster, M. Telgarsky, Spectrally-normalized margin bounds for neural networks, Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, 6241–6250. |
[2] |
A. Brock, T. Lim, J. M. Ritchie, N. Weston, Neural photo editing with introspective adversarial networks, arXiv, 2017. https://doi.org/10.48550/arXiv.1609.07093 doi: 10.48550/arXiv.1609.07093
![]() |
[3] | R. H. F. Chan, X. Jin, An introduction to iterative toeplitz solvers, SIAM Press, 2007. |
[4] | M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, N. Usunier, Parseval networks: improving robustness to adversarial examples, Proceedings of the 34th International Conference on Machine Learning, 70 (2017), 854–863. |
[5] | W. Ding, Y. Wei, Theory and computation of tensors: multi-dimensional arrays, Academic Press, 2016. https://doi.org/10.1016/C2014-0-04764-8 |
[6] |
V. Dumoulin, F. Visin, A guide to convolution arithmetic for deep learning, arXiv, 2018. https://doi.org/10.48550/arXiv.1603.07285 doi: 10.48550/arXiv.1603.07285
![]() |
[7] | G. H. Golub, C. F. Van Loan, Matrix computations, Johns Hopkins University Press, 2013. https://doi.org/10.56021/9781421407944 |
[8] | I. J. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT Press, 2016. |
[9] |
I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, arXiv, 2015. https://doi.org/10.48550/arXiv.1412.6572 doi: 10.48550/arXiv.1412.6572
![]() |
[10] |
P. C. Guo, Q. Ye, On the regularization of convolutional kernels in neural networks, Linear Multilinear Algebra, 70 (2022), 2318–2330. https://doi.org/10.1080/03081087.2020.1795058 doi: 10.1080/03081087.2020.1795058
![]() |
[11] | J. F. Kolen, S. C. Kremer, Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, Wiley-IEEE Press, 2001. https://doi.org/10.1109/9780470544037.ch14 |
[12] | X. Q. Jin, Developments and applications of block Toeplitz iterative solvers, Springer Science & Business Media, 2003. |
[13] | J. Kovačević, A. Chebira, An introduction to frames, Now Publishers Inc., 2008. |
[14] |
P. Li, Y. Lu, C. Xu, J. Ren, Insight into Hopf bifurcation and control methods in fractional order BAM neural networks incorporating symmetric structure and delay, Cognit. Comput., 2023. https://doi.org/10.1007/s12559-023-10155-2 doi: 10.1007/s12559-023-10155-2
![]() |
[15] |
L. H. Lim, Tensors in computations, Acta Numer., 30 (2021), 555–764. https://doi.org/10.1017/S0962492921000076 doi: 10.1017/S0962492921000076
![]() |
[16] |
T. Miyato, T. Kataoka, M. Koyama, Y. Yoshida, Spectral normalization for generative adversarial networks, arXiv, 2018. https://doi.org/10.48550/arXiv.1802.05957 doi: 10.48550/arXiv.1802.05957
![]() |
[17] |
H. Sedghi, V. Gupta, P. M. Long, The singular values of convolutional layers, arXiv, 2018. https://doi.org/10.48550/arXiv.1805.10408 doi: 10.48550/arXiv.1805.10408
![]() |
[18] | G. W. Stewart. Matrix algorithms, SIAM Publications Library, 2001. https://doi.org/10.1137/1.9780898718058 |
[19] |
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, et al., Intriguing properties of neural networks, arXiv, 2013. https://doi.org/10.48550/arXiv.1312.6199 doi: 10.48550/arXiv.1312.6199
![]() |
[20] | Y. Tsuzuku, I. Sato, M. Sugiyama, Lipschitz-Margin training: scalable certification of perturbation invariance for deep neural networks, Adv. Neural Inf. Process., 31 (2018), 6542–6551. |
[21] |
J. Wang, Y. Chen, R. Chakraborty, S. X. Yu, Orthogonal convolutional neural networks, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. https://doi.org/10.1109/CVPR42600.2020.01152 doi: 10.1109/CVPR42600.2020.01152
![]() |
[22] |
C. Xu, Z. Liu, P. Li, J. Yan, L. Yao, Bifurcation mechanism for fractional-order three-triangle multi-delayed neural networks, Neural Process. Lett., 2022. https://doi.org/10.1007/s11063-022-11130-y doi: 10.1007/s11063-022-11130-y
![]() |
[23] |
C. Xu, W. Zhang, Z. Liu, L. Yao, Delay-induced periodic oscillation for fractional-order neural networks with mixed delays, Neurocomputing, 488 (2022), 681–693. https://doi.org/10.1016/j.neucom.2021.11.079 doi: 10.1016/j.neucom.2021.11.079
![]() |
[24] |
Y. Yoshida, T. Miyato, Spectral norm regularization for improving the generalizability of deep learning, arXiv, 2017. https://doi.org/10.48550/arXiv.1705.10941 doi: 10.48550/arXiv.1705.10941
![]() |
[25] |
C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning (still) requires rethinking generalization, Commun. ACM, 64 (2021), 107–115. https://doi.org/10.1145/3446776 doi: 10.1145/3446776
![]() |