Loading [MathJax]/extensions/TeX/mathchoice.js
Research article

Improved convergence rates for some kernel random forest algorithms

  • Received: 11 October 2023 Revised: 01 February 2024 Accepted: 17 March 2024 Published: 20 March 2024
  • Random forests are notable learning algorithms introduced by Breiman in 2001. They are widely used for classification and regression tasks and their mathematical properties are under ongoing research. We consider a specific class of random forest algorithms related to kernel methods, the so-called Kernel Random Forests (KeRF). In particular, we investigate thoroughly two explicit algorithms, designed independently of the data set, the centered KeRF and the uniform KeRF. In the present article, we provide an improvement in the rate of convergence for both algorithms and we explore the related reproducing kernel Hilbert space defined by the explicit kernel of the centered random forest.

    Citation: Iakovidis Isidoros, Nicola Arcozzi. Improved convergence rates for some kernel random forest algorithms[J]. Mathematics in Engineering, 2024, 6(2): 305-338. doi: 10.3934/mine.2024013

    Related Papers:

    [1] Giorgia Franchini, Valeria Ruggiero, Federica Porta, Luca Zanni . Neural architecture search via standard machine learning methodologies. Mathematics in Engineering, 2023, 5(1): 1-21. doi: 10.3934/mine.2023012
    [2] Qiang Du, Tadele Mengesha, Xiaochuan Tian . $ L^{p} $ compactness criteria with an application to variational convergence of some nonlocal energy functionals. Mathematics in Engineering, 2023, 5(6): 1-31. doi: 10.3934/mine.2023097
    [3] Yihong Du, Wenjie Ni . The Fisher-KPP nonlocal diffusion equation with free boundary and radial symmetry in $ {\mathbb R}^3 $. Mathematics in Engineering, 2023, 5(2): 1-26. doi: 10.3934/mine.2023041
    [4] Monica Conti, Filippo Dell'Oro, Vittorino Pata . Exponential decay of a first order linear Volterra equation. Mathematics in Engineering, 2020, 2(3): 459-471. doi: 10.3934/mine.2020021
    [5] Alessia Nota, Juan J. L. Velázquez . Homoenergetic solutions of the Boltzmann equation: the case of simple-shear deformations. Mathematics in Engineering, 2023, 5(1): 1-25. doi: 10.3934/mine.2023019
    [6] Gerd Grubb . The principal transmission condition. Mathematics in Engineering, 2022, 4(4): 1-33. doi: 10.3934/mine.2022026
    [7] Juan-Carlos Felipe-Navarro, Tomás Sanz-Perela . Semilinear integro-differential equations, Ⅱ: one-dimensional and saddle-shaped solutions to the Allen-Cahn equation. Mathematics in Engineering, 2021, 3(5): 1-36. doi: 10.3934/mine.2021037
    [8] Carlo Ciliberto, Massimiliano Pontil, Dimitrios Stamos . Reexamining low rank matrix factorization for trace norm regularization. Mathematics in Engineering, 2023, 5(3): 1-22. doi: 10.3934/mine.2023053
    [9] Gabriel B. Apolinário, Laurent Chevillard . Space-time statistics of a linear dynamical energy cascade model. Mathematics in Engineering, 2023, 5(2): 1-23. doi: 10.3934/mine.2023025
    [10] Francesco Cordoni, Luca Di Persio . A maximum principle for a stochastic control problem with multiple random terminal times. Mathematics in Engineering, 2020, 2(3): 557-583. doi: 10.3934/mine.2020025
  • Random forests are notable learning algorithms introduced by Breiman in 2001. They are widely used for classification and regression tasks and their mathematical properties are under ongoing research. We consider a specific class of random forest algorithms related to kernel methods, the so-called Kernel Random Forests (KeRF). In particular, we investigate thoroughly two explicit algorithms, designed independently of the data set, the centered KeRF and the uniform KeRF. In the present article, we provide an improvement in the rate of convergence for both algorithms and we explore the related reproducing kernel Hilbert space defined by the explicit kernel of the centered random forest.



    Random forests are a class of non-parametric statistic machine learning algorithms used for regression and classification tasks. Random forest algorithms have the capability to perform sparse tasks with high accuracy in high dimensions, avoiding overfitting. In particular, random forests are considered to be among the most accurate learning algorithm classes for general tasks. They are routinely used in many fields including bio-informatics [15], economics [30], biology [10], and linguistics [18].

    The most widely used random forest algorithm was introduced by Breiman [13], who was inspired by the work on random subspaces of Ho [19], and geometrical feature selection of Amit and Geman [2]. In Breiman's random forest, the trees are grown based on the classification and regression trees (CART) procedure, where both splitting directions and training sets are randomized. However, despite the few parameters that need to be tuned [16,21], their mathematical properties are still areas of active research [6,22]. A significant distinction among the class of random forest algorithms consists in the way each individual tree is constructed, and, in particular, the dependence of each tree on the data set. Some of the researchers consider random forests designed independently from the data set [7,14,25].

    In 2012, Biau [5] studied a random forest model proposed by Breiman, where the construction is independent of the data set, called in literature centered random forest. In [5] an upper bound on the rate of consistency of the algorithm and its adaption to sparsity were proven. More precisely, about the first item, for a data set of n samples in a space of dimension d, the convergence rate was O(n1d43log2+1). In 2021, Klusowski [20] improved the rate of convergence to O((nlogd12n)(1+δdlog2+1)), where δ is a positive constant that depends on the dimension of the feature space d and converges to zero as d approaches infinity. In addition, in the same paper, Klusowski proved that the rate of convergence of the algorithm is sharp, although it fails to reach the minimax rate of consistency over the class of the Lipschitz functions O(n2d+2) [29]. There is also important work on the consistency of algorithms that depend on data [23,27,28]. For a comprehensive overview of both theoretical and practical aspects of the random forests, see [8], which surveys the subject up to 2016.

    An important tool for algorithmically manipulating random forests is through kernel methods. Breiman [11] observed the connection between kernel theory and random forests by showing the equivalence between tree construction and kernel action. Later this was formalized by Geurts et al. in [17]. In the same direction Scornet in [26] defined Kernel Random Forest (KeRF) by modifying the original algorithm, and providing theoretical and practical results. In particular, in his important work, Scornet provided explicit kernels for some generalizations of algorithms, their rate of consistency, and comparisons with the corresponding random forests. Furthermore, Arnould et al. [3] investigated the trade-off between interpolation of several random forest algorithms and their consistency results.

    In the first part of the paper, we provide the notation and the definitions of the centered and uniform random forests and their corresponding kernel-related formulations. In addition, we improve the rate of consistency for the centered KeRF algorithm. Let k1 be the depth of the trees used to estimate the target variable Y (see Section 2 for definitions and notation).

    Theorem 1. Suppose that X=(X1,,Xd) and Y are related by Y=m(X)+ϵ where: ϵ is a zero mean Gaussian noise with finite variance independent of X, X is uniformly distributed in [0,1]d, and m is a regression function, which we assume to be Lipschitz. Then, there exists a constant ˜C such that, for every n>1 and for every x[0,1]d,

    E(˜mCen,n(x)m(x))2˜Cn(11+dlog2)(logn).

    Here, m(x)=E[Y|X=x] is the predicted value of Y for X=x[0,1]d, while ˜mCen,n(x) is the estimate for m provided by the kernel version of the centered random forest algorithm.

    Similarly, with ˜mUn,n(x) playing for the uniform KeRF algorithm the role ˜mCen,n(x) had above, we have:

    Theorem 2. Let X, Y, m, and ϵ be as in Theorem 1, with Y=m(X)+ϵ. Then there exists a constant ˜C such that for every n>1, for every x[0,1]d

    E(˜mUn,n(x)m(x))2˜Cn(11+32dlog2)(logn).

    Moreover, in Section 4, we provide numerical examples and experiments concerning the tuning parameter k, which is the tree depth of the two kernel-based random forest algorithms, by comparing the L2-error for different values and under specific assumptions on the data set.

    In the final part of the article, we consider the reproducing kernel K used in the centered KeRF algorithm per se. It is rewarding looking at it as defined on the finite Abelian group Zkd2, where, as above, d is the dimension of the vector X and k is the depth of the tree. By using elementary Fourier analysis on groups, we obtain several equivalent expressions for K and its group transform, we characterize the functions belonging to the corresponding Reproducing Kernel Hilbert Space (RKHS) HK, we derive results on multipliers, and we obtain bounds for the dimension of HK, which is much smaller than what one might expect.

    A usual problem in machine learning is, based on n observations of a random vector (X,Y)X×RRd×R, to estimate the function m(x)=E(Y|X=x). In classification problems, Y ranges over a finite set. In particular we assume that we are given a training sample Dn={(X1,Y1),...,(Xn,Yn)} of independent random variables, where Xi[0,1]d for every i=1,...,n and YR with a shared joint distribution PX,Y. The goal is using the data set to construct an estimate mn:X[0,1]dR of the function m. Our convergence rate requires an a priori assumption on the regularity of the function m. Following [26], we suppose that m belongs to the class of L Lipschitz functions,

    |m(x)m(x)|Lxx.

    Here, as is [26], we consider on Rd the distance

    xx=dj=1|xjxj|.

    Next, we provide the general random forest framework by defining firstly the notion of a random tree. Additionally, we present two specific variations of the original random forest algorithm, namely, the centered and uniform random forest algorithms.

    Let's assume Θi for i=1,...,M is a collection of independent random variables, distributed as Θ. The random variables Θi correspond to sample the training set or select the positions for splitting. The detailed construction in the case of the centered random forest is performed in Appendix.

    Definition 1. For the j-th tree in the forest, the predicted value x will be denoted by

    mn,Θj,Dn(x)=ni=11XiAn,Θj,Dn(x)YiNn,Θj,Dn(x).

    Where An,Θj,Dn(x) is the cell containing x and Nn,Θj,Dn(x) is the number of points that fall into the cell that x belongs to.

    For a fixed value of x[0,1]d, the value of the tree is the empirical expectation of Y in the unique cell containing x; which is, this is the hope, a good guess for the target value corresponding to x.

    A random forest is a finite collection (average) of independent, finite random trees {Θ1,,ΘM}.

    Definition 2. The finite M forest is

    mM,n(x)=1MMj=1mn,Θj,Dn(x).

    From a modeling point of view, we let M and consider the infinite forest estimate

    m,n,Dn(x)=EΘ(mn,Θ,Dn(x)).

    The convergence holds almost surely by the law of the large numbers conditionally on Dn [12] and [25, Theorem 3.1].

    In 2016, Scornet [26] introduced kernel methods in the random forest world (KeRF), producing a kernel-based algorithm, together with estimates on how this compares with the old one, described above.

    To understand the intuition behind KeRF construction, we reformulate the random forest algorithm.

    For all x[0,1]d,

    mM,n(x)=1MMj=1(ni=11XiAn,Θj,Dn(x)YiNn,Θj,Dn(x)).

    Therefore we can define the weights of every observation Yi as

    Wi,j,n(x)=1XiAn,Θj,Dn(x)Nn,Θj,Dn(x).

    Hence it is clear that the value of weights changes significantly concerning the number of points in each cell. A way to overcome this nuisance is by simultaneously considering all tree cells containing x, as the tree is randomly picked in the forest.

    For all x[0,1]d,

    ˜mM,n,Θ1,Θ2,...,ΘM(x)=1Mj=1Nn,Θj(x)Mj=1ni=1Yi1XiAn,Θj(x).

    This way, empty cells do not affect the computation of the prediction function of the algorithm.

    It is proven in [26], that this representation has indeed a kernel representation.

    Proposition 1. [26, Proposition 1] For all x[0,1]d almost surely, it holds

    ˜mM,n,Θ1,Θ2,...,ΘM(x)=ni=1KM,n(x,Xi)Yini=1KM,n(x,Xi),

    where

    KM,n(x,z)=1MMi=11xAn,Θi,Dn(z)

    is the proximity function of the M forest.

    The infinite random forest arises as the number of trees tends to infinity.

    Definition 3. The infinite KeRF is defined as:

    ˜m,n(x)=limM˜mM,n(x,Θ1,Θ2,...,ΘM).

    The extension of the kernel follows also in the infinite random forest.

    Proposition 2. [26, Proposition 2] Almost surely for all x,y[0.1]d

    limMKM,n(x,y)=Kn(x,y),

    where

    Kn(x,y)=PΘ(xAn(y,Θ)),

    where the left-hand side is the probability that x and y belong to the same cell in the infinite forest.

    In this paper, we say that an estimator function mn of m is consistent if the following L2type of convergence holds,

    E(mn(x)m(x))20,

    as n.

    In the centered and uniform forest algorithms, the way the data set Dn is partitioned is independent of the data set itself.

    The centered forest is designed as follows:

    1) Fix kN.

    2) At each node of each individual tree choose a coordinate uniformly from {1,2,..d}.

    3) Split the node at the midpoint of the interval of the selected coordinate.

    Repeat steps 2) and 3) k times. At the end, we have 2k leaves, or cells. A toy example of this iterative process for k=1,2 is in Figures 1 and 2. Our estimation at a point x is achieved by averaging the Yi corresponding to the Xi in the cell containing x.

    Figure 1.  Centered algorithm with tree level k=1 with the convention that 1 corresponds to x axis and 2 to the y axis.
    Figure 2.  Centered algorithm with tree level k=1 with the convention that 1 corresponds to x axis and 2 to the y axis.

    Scornet in [26] introduced the corresponding kernel-based centered random forest providing explicitly the proximity kernel function.

    Proposition 3. A centered random forest kernel with kN parameter has the following multinomial expression [26, Proposition 5],

    KCenk(x,z)=dj=1kj=kk!k1!...kd!(1d)kdj=112kjxj=2kjzj,

    where KCenk is the kernel of the corresponding centered random forest.

    Uniform random forest was introduced by Biau et al. [7] and is another toy model of Breinman's random forest as a centered random forest. The algorithm forms a partition in [0,1]d as follows:

    1) Fix kN.

    2) At each node of each individual tree choose a coordinate uniformly from {1,2,..d}.

    3) The splitting is performed uniformly on the side of the cell of the selected coordinate.

    Repeat steps 2) and 3) k times. At the end, we have 2k leaves. Our final estimation at a point x is achieved by averaging the Yi corresponding to the Xi in the cell x.

    Again Scornet in [26, Proposition 6] proved the corresponding kernel-based uniform random forest.

    Proposition 4. The corresponding proximity kernel for the uniform KeRF for parameter kN and x[0,1]d has the following form:

    KUnk(0,x)=dj=1kj=kk!k1!...kd!(1d)kdm=1(1xmkm1j=0(lnxm)jj!)

    with the convention that

    1j=0(lnxm)jj!=0

    and by continuity we can extend the kernel also for zero components of the vector.

    Unfortunately, it is very hard to obtain a general formula for KUn(x,y) but we consider instead a translation invariant KeRF uniform forest:

    Un,n(x)=ni=1YiKUnk(0,|Xix|)ni=1KUnk(0,|Xix|).

    In this section, after providing some measure concentration type results [9], we improve the rate of consistency of the centered KeRF algorithm. The following lemmata will provide inequalities to derive upper bounds for averages of iid random variables. Lacking a reference, for completeness, we provide detailed proofs of these lemmata. Moreover, we assume for this article that all random variables are real-valued and ||X||Lp:=(E|X|p)1p and ||X||:=inf{t:P(|X|t)=1}.

    Lemma 1. Let X1,...,Xn be a sequence of real independent and identically distributed random variables with E(Xi)=0. Assuming also that there is a uniform bound for the L1-norm and the supremum norm i.e., E(|Xi|)C, ||Xi||CM for every i=1,...,n. Then for every t(0,1)

    P({|ni=1Xi|nt})2e˜CCt2nM

    for some positive constant ˜CC that depends only on C.

    Proof. x[0,1] one has that ex1+x+x2. By using the hypothesis for every λ1CM,

    eλXi1+λXi+(λXi)2EeλXi1+λ2E(Xi)21+λ2||Xi||1||Xi||1+λ2C2Meλ2C2M.

    By the independence of the random variables Xi,

    Eeni=1λXi=ni=1EeλXini=1eλ2C2M=enλ2C2M.

    Therefore, by Markov inequality

    P({ni=1Xint})eλtnEeni=1λXieλtnenλ2C2M=enλ2C2Mλtn.

    Finally if C14 we choose, λ=t2C2M, otherwise for λ=t16CM

    P({ni=1Xint})e˜CCt2nM.

    By replacing Xi with Xi we conclude the proof.

    Lemma 2. Let X1,...,Xn be a non-negative sequence of independent and identically distributed random variables with E(Xi)2, ||Xi||M for every i=1,...,n. Let also a sequence of independent random variables ϵi following normal distribution with zero mean and finite variance σ2, for every i=1,...,n. We assume also that ϵi are independent from Xi for every i=1,...,n.

    Then for every t(0,1),

    P(1nni=1|ϵiXi|t)2exp(Ct2nM)

    with the positive constant C depending only on σ.

    Proof.

    P(1nni=1ϵiXit)=P(exp(λnni=1ϵiXiexp(λt))(for a positive λ)exp(λt)Eexp(λnni=1ϵiXi)(by Chebyshev's inequality)=exp(λt)ni=1Eexp(λnϵiXi)(by independence)=exp(λt)ni=1(1+k=2λkEXkiEϵkinkk!)exp(λt)ni=1(1+2Mk=2λkMkEϵkinkk!)=exp(λt)ni=1(1+2M(Eexp(λMnϵi)1))exp(λt)ni=1(1+2M(exp(λ2σ2M2n2)1))=exp(λt)exp(ni=1(log(1+2M(exp(λ2σ2M2n2)1))))exp(λt)exp(ni=12M(exp(λ2σ2M2n2)1))exp(λt)exp(2nM(exp(λ2σ2M2n2)1))exp(λt)exp(2nM(2λ2σ2M2n2))(forλnσM)=exp(λt+4Mnλ2σ2).

    Finally, we select λ=tn8Mσ2, when σ18 and λ=tnMσ, when σ18

    P(1nni=1ϵiXit)exp(Ct2nM).

    Replacing Xi with Xi we conclude the proof.

    Theorem 3.1. Y=m(X)+ϵ where ϵ is a zero mean Gaussian noise with finite variance independent of X. Assuming also that X is uniformly distributed in [0,1]d and m is a Lipschitz function. Then there exists exists a constant ˜C such that for every n>1, for every x[0,1]d

    E(˜mCen,n(x)m(x))2˜Cn(11+dlog2)(logn).

    Proof. Following the notation in [26], let x[0,1]d, m=supx[0,1]d|m(x)|, and by the construction of the algorithm

    ˜mCenn,(x)=ni=1YiKk(x,Xi)ni=1Kk(x,Xi).

    Let

    An(x)=1nni=1(YiKk(x,Xi)E(YKk(x,X))E(Kk(x,X))),
    Bn(x)=1nni=1(Kk(x,Xi)E(Kk(x,X))E(Kk(x,X))),

    and

    Mn(x)=E(YKk(x,X))E(Kk(x,X)).

    Hence, we can reformulate the estimator as

    ˜mCenn,(x)=Mn(x)+An(x)Bn(x)+1.

    Let t(0,12) and the event Ct(x) where {An(x),Bn(x)t}.

    E(˜mccn,(x)m(x))2=E(˜mccn,(x)m(x))21Ct(x)+E(˜mccn,(x)m(x))21Cct(x)E(˜mccn,(x)m(x))21Cct(x)+c1(112d)2k+c2t2

    where the last inequality was obtained in [26, p.1496]. Moreover, in [26],

    E(˜mccn,(x)m(x))21Cct(x)c3(logn)(P(Cct(x)))12.

    In order to find the rate of consistency we need a bound for the probability P(Cct(x)). Obviously,

    P(Cct(x))P(|An(x)|>t)+P(|Bn(x)|>t).

    We will work separately to obtain an upper bound for both probabilities.

    Proposition 5. Let

    ˜Xi=Kk(x,Xi)E(Kk(x,X))1,

    a sequence of iid random variables. Then for any t(0,1),

    P({|ni=1˜Xi|nt})=P(|Bn(x)|t)2e˜C1t2n2k

    for some positive constant ˜C1.

    Proof. It is easy to verify that E˜Xi=0, and

    |˜Xi|=|Kk(x,Xi)E(Kk(x,X))1|Kk(x,Xi)E(Kk(x,X))+1,

    hence, E|˜Xi|2.

    Finally,

    ||˜Xi||=sup{|˜Xi|}=sup{|Kk(x,Xi)E(Kk(x,X))1|}1E(Kk(x,X))supKk(x,Xi)+12k+12k+1.

    By Lemma 1,

    P({|ni=1˜Xi|nt})=P(|Bn(x)|t)2e˜C1t2n2k.

    We need a bound for P(|An(x)|>t) where,

    An(x)=1nni=1(YiKk(x,Xi)E(YKk(x,X))E(Kk(x,X))).

    Proposition 6. Let

    ~Zi=YiKk(x,Xi)E(YKk(x,X))E(Kk(x,X))

    for i=1,...,n, then for every t(0,1),

    P({|ni=1˜Zi|nt})=P(|An(x)|t)4eCt2n2k,

    for some constant C depending only on σ,m.

    Proof.

    An(x)=1nni=1(YiKk(x,Xi)E(YKk(x,X))E(Kk(x,X)))=1nni=1(m(Xi)Kk(x,Xi)E(m(X)Kk(x,X))E(Kk(x,X)))+1nni=1(ϵiKk(x,Xi)E(ϵKk(x,X))E(Kk(x,X)))=1nni=1(m(Xi)Kk(x,Xi)E(m(X)Kk(x,X))E(Kk(x,X)))+1nni=1(ϵiKk(x,Xi)E(Kk(x,X))).

    Therefore,

    P(|An(x)|t)P(|2nni=1m(Xi)Kk(x,Xi)E(m(X)Kk(x,X))E(Kk(x,X))|t)+P(|2nni=1ϵiKk(x,Xi)E(Kk(x,X))|t).

    Let

    Zi=2(m(Xi)Kk(x,Xi)E(m(X)Kk(x,X)))E(Kk(x,X)),

    a sequence of iid random variables. It is easy to verify that ~Zi are centered and

    |˜Zi|=|m(Xi)Kk(x,Xi)E(m(X)Kk(x,X))E(Kk(x,X))|2||m||Kk(x,Xi)+E(Kk(x,X))E(Kk(x,X)).

    Hence, E|Zi|4||m||. Finally,

    ||Zi||=sup{|Zi|}=2sup{|m(Xi)Kk(x,Xi)E(m(X)Kk(x,X))E(Kk(x,X))|}2||m||(2k+1)4||m||2k.

    By Lemma 1,

    P(|1nni=1m(Xi)Kk(x,Xi)E(m(X)Kk(x,X))E(Kk(x,X))|t)2eCnt22k.

    Furthermore, let

    ˜Wi=2ϵiKk(x,Xi)E(Kk(x,X))

    for i=1,...,n, a sequence of independent and identically distributed random variables. We can verify that for every for i=1,...,n:

    E(2Kk(x,Xi)E(Kk(x,X)))2.

    Finally,

    sup{|2Kk(x,Xi)E(Kk(x,X))|}2E(Kk(x,X))sup{Kk(x,Xi)}2k+1.

    By Lemma 2 it is clear,

    P(|2nni=1ϵiKk(x,Xi)E(Kk(x,X))|t)2eC2nt22k.

    We conclude the proposition by observing

    P(|An(x)|t)4emin{C2,C}nt22k.

    Finally, let us compute the rate of consistency of the algorithm-centered KeRF. By Propositions 5 and 6, one has that

    (P(Cct(x)))12(P(|An(x)|>t)+P(|Bn(x)|>t))12c3ec4nt22k,

    for some constants c3,c4 independent of k and n.

    Thus,

    E(˜m,nm(x))2c1(112d)2k+c2t2+c3lognec4t2n2k.

    We compute the minimum of the right-hand side of the inequality for t(0,1),

    2c2t2tc4lognc3n2kec4t2n2k=0ec4t2n2k=c2c3c42knlognandt2=1c42knlog(c2c3c4nlogn2k).

    Hence, the inequality becomes

    E(˜m,nm(x))2c1(112d)2k+c21c42knlog(c2c3c4nlogn2k)+c3lognc2c3c42knlogn=c1(112d)2k+c21c42knlog(c2c3c4nlogn2kec2c4).

    For every ϵn(0,2] it holds, logx1ϵnxϵn. Then one has that

    E(˜m,nm(x))2c1(112d)2k+c2(ec2c4c2c3c4)nc4ϵn(2kn(logn)ϵn1ϵn)1ϵn.

    We pick

    k=c(d)log2n(logn)ϵn1ϵn,

    thus,

    c2(ec2c4c2c3c4)nc4ϵn(2kn(logn)ϵn1ϵn)1ϵncϵnn(c(d)1)(1ϵn)lognϵn(1c(d)),

    for a constant c independent of n and,

    c1(112d)2k=c1(112d)2(c(d)log2n(logn)ϵn1ϵn)=c122c(d)log2(112d)log2n(logn)ϵn1ϵn=c1n2c(d)log2(112d)1(logn)c(d)2ϵn1ϵnlog2(112d).

    Therefore,

    c(d)=ϵn12log2(112d)(1ϵn).

    Finally,

    c1n2c(d)log2(112d)1(logn)c(d)2ϵn1ϵnlog2(112d)=c1n2(ϵn1)2log2(112d)(1ϵn)log2(112d)×1(logn)2(ϵn1)2log2(112d)(1ϵn)2ϵn1ϵnlog2(112d)=c1n2(ϵn1)2(12dlog2)(1ϵn)(12dlog2)×1(logn)2(ϵn1)2log2(112d)(1ϵn)2ϵn1ϵnlog2(112d)=c1n(1ϵn1+(1ϵn)dlog2)(logn)(ϵn1+dlog2(1ϵn))

    and, for the second term, with the same arguments

    ˜cϵnn(c(d)1)(1ϵn)lognϵn(1c(d))=˜cϵnn(1ϵn1+(1ϵn)dlog2)(logn)(ϵn1+dlog2(1ϵn))

    for a constant ˜c independent of ϵn, hence,

    E(˜mCen,n(x)m(x))2Cϵnn(1ϵn1+(1ϵn)dlog2)(logn)(ϵn1+dlog2(1ϵn)),

    and consequently,

    Cϵnn(1ϵn1+(1ϵn)dlog2)(logn)(ϵn1+dlog2(1ϵn))=Cϵnn(11+dlog2)(logn)(ϵn1+dlog2(1ϵn))×n(ϵn(1+dlog2)(1+(1ϵn))dlog2)Cϵnn(11+dlog2)(logn)(ϵndlog2(1ϵn))×(logn)lognloglogn(ϵn(dlog2)2(1ϵn)).

    Finally we finish the proof by selecting ϵn=1logn, and

    E(˜mCen,n(x)m(x))2˜Cn(11+dlog2)(logn).

    Theorem 4. Y=m(X)+ϵ where ϵ is a zero mean Gaussian noise with finite variance independent of X. Assuming also that X is uniformly distributed in [0,1]d and m is a Lipschitz function. Providing k, there exists a constant ˜C such that for every n>1, for every x[0,1]d

    E(˜mUn,n(x)m(x))2˜Cn(11+32dlog2)(logn).

    Proof. By arguing with the same reasoning as the proof of the centered random forest we can verify that

    (P(Cct(x)))12(P(|An(x)|>t)+P(|Bn(x)|>t))12c3ec4nt22k

    for some constants c3,c4 independent of k and n. The rate of consistency for the Uniform KeRF is the minimum of the right hand in the inequality in terms of n

    E(˜mUn,nm(x))2c1(113d)2k+c2t2+c3lognec4t2n2k.

    We compute the minimum of the right-hand side of the inequality for t(0,1),

    2c2t2tc4lognc3n2kec4t2n2k=0ec4t2n2k=c2c3c42knlognandt2=1c42knlog(c2c3c4nlogn2k).

    Hence, the inequality becomes

    E(˜mUn,n(x)m(x))2c1(113d)2k+c21c42knlog(c2c3c4nlogn2k)+c3lognc2c3c42knlogn=c1(113d)2k+c21c42knlog(c2c3c4nlogn2kec2c4).

    For every ϵn(0,2] it holds, logx1ϵnxϵn. Then one has that,

    E(˜mUn,nm(x))2c1(113d)2k+c2(ec2c4c2c3c4)nc4ϵn(2kn(logn)ϵn1ϵn)1ϵn.

    We pick

    k=c(d)log2n(logn)ϵn1ϵn.

    Therefore,

    c2(ec2c4c2c3c4)nc4ϵn(2kn(logn)ϵn1ϵn)1ϵncϵnn(c(d)1)(1ϵn)lognϵn(1c(d)),

    for a constant c independent of n and,

    c1(113d)2k=c1(113d)2c(d)log2n(logn)ϵn1ϵn=c122c(d)log2(113d)log2n(logn)ϵn1ϵn=c1n2c(d)log2(113d)1(logn)c(d)2ϵn1ϵnlog2(113d).

    Therefore,

    c(d)=ϵn12log2(113d)(1ϵn).

    Finally,

    c1n2c(d)log2(113d)1(logn)c(d)2ϵn1ϵnlog2(113d)=c1n2(ϵn1)2log2(113d)(1ϵn)log2(113d)×1(logn)2(ϵn1)2log2(113d)(1ϵn)2ϵn1ϵnlog2(113d)=c1n2(ϵn1)2(13dlog2)(1ϵn)(13dlog2)×1(logn)2(ϵn1)2(13dlog2)(1ϵn)2ϵn1ϵn(13dlog2)=n(2(1ϵn)1+(1ϵn)dlog2)1(logn)2ϵn2+3dlog2(ϵn1)=n(2(1ϵn)2+(1ϵn)3dlog2)(logn)(2ϵn2+3dlog2(1ϵn))

    and, for the second term, with the same arguments

    ˜cϵnn(c(d)1)(1ϵn)lognϵn(1c(d))=˜cϵnn(1ϵn1+(1ϵn)32dlog2)(logn)(ϵn1+d32log2(1ϵn)),

    for a constant ˜c independent of ϵn hence,

    E(˜mUn,n(x)m(x))2Cϵnn(1ϵn1+(1ϵn)32dlog2)(logn)(ϵn1+32dlog2(1ϵn)),

    and consequently,

    Cϵnn(1ϵn1+(1ϵn)32dlog2)(logn)(ϵn1+32dlog2(1ϵn))=Cϵnn(11+32dlog2)(logn)(ϵn1+32dlog2(1ϵn))×n(ϵn(1+32dlog2)(1+(1ϵn))dlog2)Cϵnn(11+32dlog2)(logn)(ϵn32dlog2(1ϵn))×(logn)lognloglogn(ϵn(32dlog2)2(1ϵn)).

    Finally we finish the proof by selecting ϵn=1logn, and

    E(˜mUn,n(x)m(x))2˜Cn(11+32dlog2)(logn).

    In the following section, we summarize the rates of convergence for the centered KeRF and the uniform KeRF, and we compare them with the minimax rate of convergence over the class of the Lipschitz functions [29]. In addition, we provide numerical simulations where we compare the L2error for different choices of the tree depth. All experiments performed with the software Python (https://www.python.org/), mainly with the numpy library, where random sets uniformly distributed in [0,1]d have been created, for various examples for the dimension d and the function Y. For every experiment the set is divided into a training set (80 %) and a testing set (20 %); then the L2error (Xitest set(˜m(Xi)Yi)2) and the standard deviation of the error is computed.

    For the centered KeRF we compare three different values of tree depth, which from theory provide different convergence rates. First, the choice of k in [26, Theorem 1] that provides the previous convergence rate; second, the selection of k as it was delivered from the Theorem 1.1; and, third, the case where the estimator, in probability, interpolates the data set, but the known convergence rate is slow [3, Theorem 4.1], O(lognd56) for the dimension of the feature space d>5.

    For the uniform KeRF, we compare the two values of tree depth as they were derived from [26] and Theorem 2 nevertheless, it is not known if the uniform-KeRF algorithm converges when our estimator function interpolates the data set. Of course, in practice, since real data might violate the assumptions of the theorems, one should try cross-validation for tuning the parameters of the algorithms.

    Comparing the rates of consistency for centered KeRF and the depth of the corresponding trees:

    ● Scornet in [26, Theorem 1] rate of convergence:

    n(1dlog2+3)(logn)2,and k=1log2+3dlognlogn2.

    ● New rate of convergence:

    n(11+dlog2)(logn),andk=1logn12log2(112d)(11logn)log2n(logn)1logn11logn.

    ● Minimax [29] rate of consistency over the class of Lipschitz functions: n2d+2 functions.

    Thus, theoretically, the improved rate of consistency is achieved when trees grow at a deeper level compared with the parameter selection in [26, Theorem 1].

    As it is evident from Figure 3, the improvement in the convergence rate is more significant in the low dimensional feature space scenarios. The constant ˜C=˜C(d) of Theorem 1 depends on the dimension d of the space. The convergence rates in the literature do not try to have a good estimate for ˜C, and they are significant for fixed values of d only.

    Figure 3.  Plot of the exponents of n, for the previous rate of convergence for the centered KeRF algorithm, the new rate of convergence, and the optimal over the class of the Lipschitz functions.

    Finally, we note that Klusowski's rate of convergence in [20], O((nlogd12n)(1+δdlog2+1)), where δ is a positive constant that depends on the dimension of the feature space d and converges to zero as d approaches infinity, is sharp and better than the one in Theorem 1 O(n(11+dlog2)(logn)) for small values of d. For large values of n and d the two estimates are essentially the same, but for now, we do not know if in general, the rate of convergence of the centered KeRF is not improvable.

    Comparing the rates of convergence for uniform KeRF and the depth of the corresponding trees:

    ● Scornet in [26, Theorem 2] rate of convergence:

    n(23dlog2+6)(logn)2,and k=1log2+3dlognlogn2.

    ● New rate of convergence:

    n(23dlog2+2)(logn),and k=1logn12log2(113d)(11logn)log2n(logn)1logn11logn.

    ● Minimax [29] rate of convergence for the consistency over the class of Lipschitz functions: n2d+2 functions.

    Thus, theoretically, as in the case of centered random KeRF, the improved rate of consistency is achieved when trees grow at a deeper level compared with the parameter selection in [26, Theorem 2].

    The same considerations on the dependence of the constant ˜C on d we made for the centered KeRF hold in the uniform KeRF case, as it is evident in Figure 4. Moreover, as of now, it is still unknown to us if the convergence rate of the uniform KeRF can be improved.

    Figure 4.  Plot of the exponents of n, for the previous rate of convergence for the uniform KeRF algorithm, the new rate of convergence, and the optimal over the class of the Lipschitz functions.

    Finally, numerical simulations of the L2error of the centered KeRF-approximation for three different values of k in Figure 5 are reported with the standard deviation of the errors in Figure 6. In Appendix, more simulations and plots for different target functions and for both algorithms are illustrated.

    Figure 5.  Plot of the L2error of the centered KeRF-approximation for three different values of k for the function Y=X21+eX22+ϵ, where ϵN(0,12), against different data set size.
    Figure 6.  Plot of the standard deviation of errors, for the centered KeRF-approximation for three different values of k of the function Y=X21+eX22+ϵ, where ϵN(0,12), against different data set size.

    In this paper, we have obtained improved rates of convergence for two kernel-based random forests, the centered KeRF and the uniform KeRF. In addition, we provided numerical simulations for both algorithms concerning the parameters of the methods. Finally, we explored the reproducing kernel Hilbert space related to the centered KeRF providing bounds for the dimension of the aforementioned space.

    The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

    The first author was funded by the PhD scholarship PON-DOT1303154-3, the second author was partially funded by project GNAMPA 2022 "HOLOMORPHIC FUNCTIONS IN ONE AND SEVERAL VARIABLES" and by Horizon H2020-MSCA-RISE-2017 project 777822 "GHAIA".

    All authors declare no conflicts of interest in this paper.

    In this last section, we provide more experiments of low dimensional regression examples with additive noise, regarding the centered and the uniform KeRF. In particular, we calculate and compare the L2errors and the standard deviations against different sample sizes for different values of the parameter k of the estimator. Moreover, in the following subsection, we study the corresponding RKHS produced by the kernel

    KCenk(x,z)=dj=1kj=kk!k1!...kd!(1d)kdj=112kjxj=2kjzj,

    defined in the abelian group Zkd2. To conclude we recall some notation for finite abelian groups, necessary to define the aforementioned RKHS and estimate its dimension.

    We provide here an alternative description of the centered random forest algorithm, where the dyadic tiling of the hypercube motivates us to define the kernel in the abelian group Zkd2. First, we define a Random Tree Θ. Start with a random variable Θ0, uniformly distributed in {1,,d}, and split

    I:=[0,1]d=IΘ00IΘ01,

    where

    IΘ0l=[0,1]××[l/2,(l+1)/2]×[0,1],

    where for l=0,1 the splitting was performed in the Θ0-th coordinate. Choose then random variables Θ1,l (l=0,1), distributed as Θ0, and split each

    IΘ0l=IΘ0,Θ1l,0IΘ0,Θ1l,1,

    where, as before, the splitting is performed at the Θ1-th coordinate, and IΘ0,Θ1l,0 is the lower half of IΘ0l. Iterate the same procedure k times. In order to do that, we need random variables Θj;η0,,ηj, with ηl{1,,d} and j=1,,k. We assume that all such random variables are independent. It is useful think of Θ={Θj;η0,,ηj} as indexed by a d-adic tree, and, in fact, we refer to Θ as a random tree in [0,1]d. We call cells, or leaves, each of the 2k rectangles into which [0,1]d is split at the end of the kth subdivision.

    On Figure A1 we see the L2error of the centered KeRF-approximation for a three dimensional feature space and on Figure A2 the standard deviation of the errors.

    Figure A1.  Plot of the L2error of the centered KeRF-approximation for three different values of k for the function Y=X21+1eX22+eX23+ϵ where ϵN(0,0.5), against different data set size.
    Figure A2.  Plot of the standard deviation of errors, of the centered KeRF-approximation for three different values of k of the function Y=X21+1eX22+eX23+ϵ where ϵN(0,0.5), against different data set size.

    Figures A3 and A4 show the L2error and the standard deviation for the uniform KeRF in a two dimensional feature space and Figures A5 and A6 present a three dimensional example respectively.

    Figure A3.  Plot of the L2error of the uniform KeRF-approximation for two different values of k for the function Y=X21+eX22+ϵ where ϵN(0,0.5), against different data set size.
    Figure A4.  Plot of the standard deviation of errors, of the uniform KeRF-approximation for two different values of k for the function Y=X21+eX22+ϵ where ϵN(0,0.5), against different data set size.
    Figure A5.  Plot of the L2error of the uniform KeRF-approximation for two different values of k for the function Y=X21+1(eX23+eX22)+ϵ where ϵN(0,0.5), against different data set size.
    Figure A6.  Plot of the standard deviation of the errors, of the uniform KeRF-approximation for two different values of k for the function Y=X21+1(eX23+eX22)+ϵ where ϵN(0,0.5), against different data set size.

    B.1. The Fourier analysis of the kernel

    The kernel in the centered KeRF defines a RKHS HK structure on a set Γ having 2kd points [4]. In fact, Γ has a group structure, and Fourier analysis can be used. Much research is done in RKHS theory, and in this section (see [1]), we study the structure of HK in itself. A priori, HK might have any dimension less or equal to #Γ. We show in fact that its dimension is much lower than that, a fact which is somehow surprising, and we believe it is interesting in itself. Furthermore, we prove that there are no nonconstant multipliers in the space HK. For completeness we provide definitions and notation on Fourier analysis on Abelian groups in Appendix C.

    We identify every real number x[0,1] with its binary expression x=0.x1x2x3... with xi{0,1} for iN. Here we consider the group

    G=Zkd2x=(xji)i=1,,kj=1,,d=(x1|x2||xd)=(x1x2xk). (B.1)

    The kernel K:Γ×ΓC corresponding to the kernel Kcenk is,

    \begin{eqnarray} K(a, b)& = &\sum\limits_{\genfrac{}{}{0pt}{}{{l\in\mathbb{N}^d}}{{|l| = k}}}\frac{1}{d^k}\binom{k}{l}\prod\limits_{j = 1}^d\chi\left(a^j_1 = b^j_1, \dots, a^j_{k_j} = b^j_{k_j}\right)\\ & = &\sum\limits_{\genfrac{}{}{0pt}{}{{l\in\mathbb{N}^d}}{{|l| = k}}}\frac{1}{d^k}\binom{k}{l}\prod\limits_{j = 1}^d\prod\limits_{i = 1}^{k_j}\chi\left(a^j_i = b^j_i\right)\\ & = &\varphi(a-b), \end{eqnarray} (B.2)

    where \binom{k}{l} is the multidimensional binomial coefficient, \chi_E is the characteristic function of the set E , and a-b is the difference in the group \mathbb{Z}_2^{kd} . Incidentally, (B.2) shows that the kernel K can be viewed as a convolution kernel on the appropriate group structure. For the last equality, we consider the binary representation of a number in (0, 1] whose digits are not definitely vanishing. The fact that 0 does not have such representation is irrelevant since the probability that one of the coordinates of the data vanishes is zero.

    We now compute the anti-Fourier transform \mu = \check{\varphi} . We know that \sharp(\Gamma) = 2^{kd} , and that the characters of \mathbb{Z}_2^{kd} have the form

    \begin{equation} \gamma_a(x), \ x\in \mathbb{Z}_2^{kd}, \ a\in \widehat{\mathbb{Z}_2^{kd}}, \ a\cdot x = a^1_1 x^1_1+\dots+a^d_k x^d_k. \end{equation} (B.3)

    Hence,

    \begin{eqnarray} 2^{kd}p^n\mu(x)& = &d^k\sum\limits_{a\in\Gamma}\varphi(a)\gamma_a(x)\\ & = &d^k\sum\limits_{a\in\Gamma}\varphi(a)(-1)^{a\cdot x}\\ & = &\sum\limits_{a\in\Gamma}\sum\limits_{\genfrac{}{}{0pt}{}{{l\in\mathbb{N}^d}}{{|l| = k}}}\binom{k}{l}\prod\limits_{j = 1}^d\prod\limits_{i = 1}^{k_j}\chi\left(a^j_i = 0\right) (-1)^{a\cdot x}\\ & = &\sum\limits_{a\in\Gamma}\sum\limits_{\genfrac{}{}{0pt}{}{{l\in\mathbb{N}^d}}{{|l| = k}}}\binom{k}{l}\prod\limits_{j = 1}^d(-1)^{\tilde{a}_j^{k_j}\cdot\tilde{x}_j^{k_j}}\prod\limits_{i = 1}^{k_j}\left[\chi\left(a^j_i = 0\right)(-1)^{a^j_i x^j_i}\right]\\ &\ &\text{where }\tilde{a}_j^{k_j} = \begin{pmatrix} a^j_{k_j+1}\\ \dots\\ a^j_n \end{pmatrix}\text{ is the lower,}\ (k-k_j) -\text{dimensional} \\ &\ &\text{part of the column}\ a^j , \\ & = &\sum\limits_{\genfrac{}{}{0pt}{}{{l\in\mathbb{N}^d}}{{|l| = k}}}\binom{k}{l}\prod\limits_{j = 1}^d(-1)^{\tilde{a}_j^{k_j}\cdot\tilde{x}_j^{k_j}}\prod\limits_{i = 1}^{k_j}\chi\left(a^j_i = 0\right)\\ & = &\sum\limits_{\genfrac{}{}{0pt}{}{{l\in\mathbb{N}^d}}{{|l| = k}}}\binom{k}{l}\sum\limits_{\genfrac{}{}{0pt}{}{{a\in\Gamma}}{{a^1_1 = \dots a^1_{k_1} = a^2_1 = \dots = a^d_{k_d} = 0}}}\prod\limits_{j = 1}^d(-1)^{\tilde{a}_j^{k_j}\cdot\tilde{x}_j^{k_j}}. \end{eqnarray} (B.4)

    The last expression vanishes exactly when for all l , there are some 1\le j\le d , and some k_j+1\le i\le k such that x^j_i = 1 , due to the presence of the factor (-1)^{a^j_i x^j_i} = (-1)^{a^j_i} which takes values \pm1 on summands having, two by two, the same absolute values.

    If, on the contrary, there is l such that for all 1\le j\le d , and k_j+1\le i\le k , we have that x^j_i = 0 , then \mu(x)\ne0 . Since |l| = k and there are kd binary digits involved in the expression of x , the latter occurs exactly when the binary matrix representing x has a large lower region in which all entries are 0 . More precisely, the number of vanishing entries must be at least

    \begin{equation} (k-k_1)+\dots+(k-k_p) = (d-1)k. \end{equation} (B.5)

    The number N(d, k) of such matrices is the dimension of H_K , the Hilbert space having K as a reproducing kernel.

    Next, we prove some estimates for the dimension of the RKHS.

    We summarize the main items in the following statement.

    Theorem B.1. Let K:\Gamma\times\Gamma\to\mathbb{C} be the kernel in (B.2), K(a, b) = \varphi(a-b) , and let

    \begin{equation} E_K = \mathit{\text{supp}}(\check{\varphi})\in K. \end{equation} (B.6)

    Then,

    (i) As a linear space, H_K = L_{E_K} , where

    \begin{eqnarray} E_K& = &\{x = (x^1|\dots|x^d):\ x^j_i = 0 \mathit{\text{for}}k_j+1\le i\le k, \ \mathit{\text{for some}}\ l\\ & = &(k_1, \dots, k_d)\in\mathbb{N}^d\ \mathit{\text{with}}\ k_1+\dots+k_d = k\}; \end{eqnarray} (B.7)

    (ii) For x\in E_K ,

    \begin{equation} \check{\varphi}(x) = \frac{1}{2^{k}d^k}\sum\limits_{\genfrac{}{}{0pt}{}{{l\in\mathbb{N}^d, \ |l| = k}}{{x^j_i = 0 \mathit{\text{for}}k_j+1\le i\le k}}}\binom{k}{l}. \end{equation} (B.8)

    To obtain the expression on (B.8), we used the fact that

    \sharp\{a:\ a^1_1 = \dots a^1_{k_1} = a^2_1 = \dots = a^p_{k_p} = 0\} = 2^{(d-1)k}.

    B.2. Some properties of H_K

    Linear relations. Among all functions \psi:\Gamma\to\mathbb{C} , those belonging to H_K (i.e., those belonging to L_{E_K} ) are characterized by a set of linear equations,

    \begin{equation} 0 = 2^{np}p^n\mu(x) = \sum\limits_{\genfrac{}{}{0pt}{}{{k\in\mathbb{N}^p, \ |k| = n}}{{x^j_i = 0 \text{ for }k_j+1\le i\le n}}}\binom{n}{k}\text{ for }x\notin E_K. \end{equation} (B.9)

    Multipliers. A multiplier of H_K is a function m:\Gamma\to\mathbb{C} such that m\psi\in H_K whenever \psi\in H_K .

    Proposition B.1. The space H_K has no nonconstant multiplier.

    In particular, it does not enjoy the complete Pick property, which has been subject of intensive research for the past twenty-five years [1].

    Proof. The space H_K coincides as a linear space with L_{E_K} . Let \Lambda_E = \check{L_E} , which is spanned by \{\delta_x:\ x\in E\} . Observe that, since 0 = (0|\dots|0)\in E_K , the constant functions belong to H_K , hence, any multiplier m of H_K lies in H_K ; m = m\cdot 1\in H_K .

    Suppose m is not constant. Then, \check{m}(a)\ne0 for some a\in E_K , a\ne0 . Let a be an element in E_K such that \check{m}(a)\ne0 . Since H_K\ni m\cdot\widehat{\delta_x} for all x in E_K , and m\cdot\widehat{\delta_x} = \widehat{\check{m}\ast\delta_x} , we have that the support of \check{m}\ast\delta_x lies in H_K . Now,

    \check{m}\ast\delta_x(y) = \check{m}(x-y),

    hence, we have that, for any x in E_K , y = x-a lies in E_K as well. This forces a = 0 , hence m to be constant.

    B.3. Bounds for dimension and generating functions

    Theorem B.2. For fixed d\ge 1 , we have the estimates:

    \begin{equation} \mathit{\text{dim}}(H_K)\sim\frac{2^{k-d+1}k^{d-1}}{(d-1)!}, \ \mathit{\text{hence}}\ \frac{\mathit{\text{dim}}(H_K)}{2^{kd}}\sim\frac{k^{d-1}}{2^{d-1}(d-1)!2^{k(d-1)}}. \end{equation} (B.10)

    Proof. Let l_{1}, l_{2}, ..., l_{d} such that

    0\leq l_{1}+l_{2}+...+l_{d} = m \leq k

    where m is a parameter and let also \lambda = \lvert j : l_{j} \geq 1 \rvert = \lvert \{ \text{stop 1-digits} \} \rvert = \lvert \{ \text{back-entries} \} \vert where \lvert \cdot \rvert denotes the size of the sets, and of course we have that 0 \leq m \leq k and 0\leq \lambda \leq d, m. Goal to obtain a bound for

    N(k, d) = \sum\limits_{m = 0}^k \sum\limits_{\lambda = 0}^{d \wedge m} 2^{m-\lambda} { d \choose \lambda} \lvert \{ (l_{1}, l_{2}, ..., l_{d}): l_{1} +l_{2}+...+l_{d} = m \rvert

    and

    \lvert \{ j : l_{j} = 1 \} = 1 \rvert \} .

    Let A(m, \lambda) the m-th coefficient of x in the polynomial

    \begin{align*} (x+x^{2}+...x^{m}+...)^{\lambda} & = (x(1+x+x^{2}+...)^{\lambda}\\ & = (x^{\lambda}(1+x+...)^{\lambda})\\ & = \frac{x^{\lambda}}{(1-x)^{\lambda}} \end{align*}

    and 2^{m} A(m, \lambda) is the m-th coefficient of x, for the fraction \frac{(2x)^{\lambda}}{(1-2x)^{\lambda}}, therefore 2^{m-\lambda} A(m, \lambda) is the m-th coefficient of \frac{x^{\lambda}}{(1-2x)^{\lambda}}. Let's see the first sum, B(m, d) is the m-th coefficient of x :

    \begin{align*} \sum\limits_{\lambda = 0}^{d \wedge m }{ d \choose \lambda} 2^{m-\lambda } A(m, \lambda) & = \sum\limits_{\lambda = 0}^{d \wedge m}{ d \choose \lambda} ( \frac{x}{1-2x} )^{\lambda}\\ & = ( 1+ \frac{x}{1-2x})^d\\ & = (\frac{1-x}{1-2x})^d. \end{align*}

    Again by the same combinatoric argument we are looking the k -th coefficient of the function

    f(x) = \frac{1}{1-x}(\frac{1-x}{1-2x})^d .

    Back to the estimate, Let a_{k} the coefficient of the power series centered at z = 0.

    \max\limits_{ |z| = r } | f(z)| = \max\limits_{ |z| = r } \bigg| \frac{(1-z)^{d-1}}{(1-2z)^{d}} \bigg| = \max\limits_{ |z| = r }\bigg| \frac{1}{1-z} \bigg(\frac{1-z}{1-2z}\bigg)^{d} \bigg| \leq 2 \max\limits_{ \theta \in ( -\pi, \pi) } \bigg| \frac{1-re^{i \theta}}{1-2re^{i \theta}} \bigg| ^{d}.

    After some calculations since r is fixed one has that the maximum is achieved for \theta = 0. So

    \max\limits_{ |z| = r } | f(z)| \leq 2 (\frac{1-r}{1-2r})^{d} .

    Our estimation finally becomes:

    \begin{align*} |a_{k}| &\leq \frac{2 \big( \frac{1-r}{1-2r} \big)^{d} }{r^{k}}\\ & = \frac{2(1-r)^{d}}{r^{k} (1-2r)^{k} }\\ & = k^{d} 2^{k}(\frac{1}{2} + \frac{1}{2k})^{d}, \quad \quad ( \text{since, } \quad r = \frac{1}{2}( 1-\frac{1}{k}) )\\ & = k^{d} (1+ \frac{1}{k})^{d} 2^{k-d}. \end{align*}

    Thus an estimate for the dimension of H_K is

    \frac{|a_k|}{2^{kd}}\lesssim \frac{k^{d} (1+ \frac{1}{k})^{d} 2^{k(1-d)}}{2^d}.

    Another estimate about the dimension of H_K . For f(z) = \sum_{n = 0}^\infty a_n z^n , we have

    |a_n|\le\frac{\max\{|f\left(r e^{i t}\right)|:\ |t|\le\pi\}}{r^n}.

    Consider the function

    f(z) = \frac{(1-z)^{d-1}}{(1-2z)^d}

    in |z| < 1/2 and let r = \frac{1-1/k}{2} . Then,

    \begin{eqnarray*} |a_k|&\le&\frac{(3/2)^{d-1}}{(1/k)^d(1-1/k)^k2^{-k}}\\ &\le&(3/2)^{d-1}2^ke k^d. \end{eqnarray*}

    Thus,

    \frac{|a_k|}{2^{kd}}\lesssim\frac{k^d (3/2)^d}{2^{k(d-1)}}.

    Recursively working out the generating function one gets the estimates in (B.10).

    Following the notation of [24], we recall here the basic notions of Fourier theory for a finite, abelian group G , which we employed above. Here, G is endowed with its counting measure. The dual group \Gamma = \widehat{G} of G is populated by labels a for homomorphisms \gamma_a:G\to\mathbb{T} = \{e^{i t}:\ t\in\mathbb{R}\} . Given a function f:G\to\mathbb{C} , its Fourier transform \widehat{f}:\Gamma\to\mathbb{C} is defined as

    \begin{equation} \widehat{f}(a) = \sum\limits_{x\in G}f(x)\overline{\gamma_a(x)}. \end{equation} (C.1)

    We make \Gamma into a (finite), additive group by setting

    \gamma_{a+b} = \gamma_a\cdot\gamma_b, \text{ and }\gamma_x(a): = \gamma_a(x).

    It turns out that they have the same number of elements, \sharp(G) = \sharp(\Gamma) . Some basic properties are:

    \begin{eqnarray*} f(x)& = &\frac{1}{\sharp(\Gamma)}\sum\limits_{a\in \Gamma}\widehat{f}(a)\gamma_a(x)\text{ (inverse Fourier transform) }\\ \sum\limits_{x\in G}|f(x)|^2& = &\frac{1}{\sharp(\Gamma)}\sum\limits_{a\in \Gamma}|\widehat{f}(a)|^2\text{ (Plancherel) }\\ \widehat{f\ast g}& = &\widehat{f}\cdot\widehat{g}, \end{eqnarray*}

    where

    \begin{equation} (f\ast g)(x) = \sum\limits_{y\in G}f(x-y)g(y). \end{equation} (C.2)

    We write

    \begin{equation} \check{\varphi}(x) = \sharp(\Gamma)^{-1}\sum\limits_{a\in\Gamma}\varphi(a)\gamma_a(x), \text{ so that }\widehat{\check{\varphi}} = \varphi. \end{equation} (C.3)

    The unit element of convolution in G is \delta_0 .

    In the other direction, for \varphi, \psi:\Gamma\to\mathbb{C} we define

    \begin{equation} (\varphi\ast\psi)(a) = \frac{1}{\sharp(\Gamma)}\sum\limits_{b\in\Gamma}\varphi(a-b)\psi(b), \end{equation} (C.4)

    and similarly to above, \widehat{\check{\varphi}\check{\psi}} = \varphi\ast\psi . The unit element on convolution in \Gamma is \sharp(\Gamma)\delta_0 .

    A function \varphi on \Gamma is positive definite if

    \sum\limits_{a, b\in\Gamma}^n c(a) \overline{c(b)}\varphi(b-a)\ge0.

    Theorem C.1. [Bochner's theorem] A function \varphi:\Gamma\to\mathbb{C} is positive definite if and only if there exists \mu:G\to\mathbb{R}_+ such that \varphi = \widehat{\mu} .

    The theorem holds in great generality, and its proof in the finite group case is elementary. We include it because it highlights the relationship between the measure \mu on G and the positive definite function (the kernel) \varphi .

    Proof. If

    \begin{eqnarray} \sharp(\Gamma)^{-2}\sum\limits_{a, b\in\Gamma}\widehat{\mu}(b-a) c(a)\overline{c(b)}& = &\sum\limits_{x\in G}\sharp(\Gamma)^{-2}\sum\limits_{a, b\in\Gamma}\mu(x)\overline{\gamma_{b-a}(x)}c(a)\overline{c(b)}\\ & = &\sum\limits_{x\in G}\sharp(\Gamma)^{-2}\sum\limits_{a, b\in\Gamma}\mu(x)\overline{\gamma_{b}(x)}\overline{c(b)}\gamma_a(x)c(a)\\ & = &\sum\limits_{x\in G}\mu(x)\left|\sharp(\Gamma)^{-1}\sum\limits_{a\in\Gamma}c(a)\gamma_a(x)\right|^2\\ & = &\sum\limits_{x\in G}\mu(x)\left|\check{c}(x)\right|^2\ge0. \end{eqnarray} (C.5)

    Only if, since for all b in \Gamma ,

    \begin{eqnarray} \mu(x)\sharp(\Gamma)& = &\sum\limits_{a\in\Gamma}\varphi(a)\gamma_x(a) \\ & = &\sum\limits_{a\in\Gamma}\varphi(a-b)\gamma_x(a-b)\\ & = &\sum\limits_{a\in\Gamma}\varphi(a-b)\gamma_x(a)\overline{\gamma_x(b)}, \end{eqnarray} (C.6)

    we have

    \begin{equation} \mu(x)\sharp(\Gamma)^2 = \sum\limits_{a, b\in\Gamma}\varphi(a-b)\gamma_x(a)\overline{\gamma_x(b)}\ge0, \end{equation} (C.7)

    by the assumption.

    We now come to reproducing kernels on \Gamma which are based on positive definite functions \varphi:\Gamma\to\mathbb{R}_+ . Set

    \begin{equation} K(a, b) = \varphi(a-b) = K_b(a), \ K:\Gamma\times\Gamma\to\mathbb{C}, \end{equation} (C.18)

    and set

    \begin{equation} H_K = \text{span}\{K_b:\ b\in\Gamma\}\ni\sum\limits_{b\in\Gamma}c(b)K_b, \end{equation} (C.19)

    where H_K is the Hilbert space having K as reproducing kernel. We wish to have a more precise understanding of it.

    We start by expressing the norm of an element on H_K is several equivalent ways,

    \begin{eqnarray} \left\|\sum\limits_{b\in\Gamma}c(b)K_b\right\|_{H_K}^2& = &\sum\limits_{a, b\in\Gamma}\overline{c(a)}c(b)\langle K_b, K_a\rangle\\ & = &\sum\limits_{a, b\in\Gamma}\overline{c(a)}c(b)K(a, b) = \sum\limits_{a, b\in\Gamma}\overline{c(a)}c(b)\widehat{\mu}(a-b)\\ & = &\sum\limits_{a, b\in\Gamma}\overline{c(a)}c(b)\sum\limits_{x\in G}\mu(x)\gamma_{b-a}(x)\\ & = &\sum\limits_{x\in G}\mu(x)\sum\limits_{a, b\in\Gamma}\overline{c(a)}c(b)\gamma_{b}(x)\overline{\gamma_a(x)}\\ & = &\sum\limits_{x\in G}\mu(x)\left|\sum\limits_{b\in\Gamma}c(b)\gamma_b(x)\right|^2\\ & = &\sharp(\Gamma)^2\sum\limits_{x\in G}\mu(x)\left|\check{c}(x)\right|^2\\ & = &\sharp(\Gamma)^2\sum\limits_{x\in G}\left|\mu(x)^{1/2}\check{c}(x)\right|^2. \end{eqnarray} (C.10)

    In other terms,

    \begin{equation} \sharp(\Gamma)^{-1}\sum\limits_{b\in\Gamma}c(b)K_b\mapsto \check{c} \end{equation} (C.11)

    is an isometry of H_K onto L^2(\mu) . This will become important later, when we verify that for our kernels \text{supp}(\mu) is sparse in G . In fact, \text{dim}(H_K) = \sharp(\text{supp}(\mu)) .

    Corollary C.1. As a linear space, H_K is determined by \mathit{\text{supp}}(\mu) :

    \psi\in H_K\ \mathit{\text{if and only if}}\ \mathit{\text{supp}}\ (\check{\psi})\subseteq\mathit{\text{supp}}(\mu).

    Let E\subseteq G . We denote

    \begin{equation} L_E = \{G\xrightarrow{\psi}\mathbb{C}:\ \text{supp}(\check{\psi})\subseteq E\}. \end{equation} (C.12)

    Next, we look for the natural orthonormal system provided by the Fourier isometry (C.11). Fr x\in G , let \check{c_x} = \mu(x)^{-1/2}\delta_x : \{\check{c_x}:\ x\in E: = \text{supp}(\mu)\} is a orthonormal system for L^2(\mu) , and so \{e_x:\ x\in E\} is an orthonormal basis for H_K , where

    \begin{eqnarray} c_x(b) = \sum\limits_{y\in G}\mu(x)^{-1/2}\delta_x(y)\overline{\gamma_b(y)} = \mu(x)^{-1/2}\overline{\gamma_b(x)}, \end{eqnarray} (C.13)

    and

    \begin{eqnarray} e_x(a)& = &\sharp(\Gamma)^{-1}\sum\limits_{b\in\Gamma}c_x(b)K_b(a)\\ & = &\frac{\mu(x)^{-1/2}}{\sharp(\Gamma)}\sum\limits_{b\in\Gamma}K_b(a)\overline{\gamma_b(x)}\\ & = &\frac{\mu(x)^{-1/2}}{\sharp(\Gamma)}\sum\limits_{b\in\Gamma}\varphi(a-b)\overline{\gamma_b(x)}\\ & = &\frac{\mu(x)^{-1/2}}{\sharp(\Gamma)}\sum\limits_{b\in\Gamma}\varphi(a-b)\gamma_{a-b}(x)\overline{\gamma_a(x)}\\ & = &\mu(x)^{-1/2}\mu(x)\overline{\gamma_a(x)}\\ & = &\mu(x)^{1/2}\overline{\gamma_a(x)}. \end{eqnarray} (C.14)

    Let's verify that we obtain the reproducing kernel from the o.n.b. as expected,

    \begin{eqnarray} \sum\limits_{x\in\Gamma}e_x(a)\overline{e_x(b)}& = &\sum\limits_{x\in\Gamma}\mu(x)\overline{\gamma_x(a)}\gamma_x(b)\\ & = &\sum\limits_{x\in\Gamma}\mu(x)\overline{\gamma_x(a-b)}\\ & = &\widehat{\mu}(a-b)\\ & = &\varphi(a-b). \end{eqnarray} (C.15)

    Remark C.1. Since any finite, abelian group can be written as the direct product of cyclic groups,

    \begin{equation} G = \bigoplus\limits_{l = 1}^L \mathbb{Z}_{m_l}, \end{equation} (C.16)

    its dual \Gamma can be written in the same way, because \widehat{\mathbb{Z}_m}\equiv\mathbb{Z}_m . From the Fourier point of view, the only difference is that, if on G we consider the counting measure, then on \Gamma we consider normalized counting measure, as we did above.



    [1] J. Agler, J. E. McCarthy, Pick interpolation and Hilbert function spaces, 2002.
    [2] Y. Amit, D. Geman, Shape quantization and recognition with randomized trees, Neural Comput., 9 (1997), 1545–1588. https://doi.org/10.1162/neco.1997.9.7.1545 doi: 10.1162/neco.1997.9.7.1545
    [3] L. Arnould, C. Boyer, E. Scornet, Is interpolation benign for random forest regression? Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, 206 (2023), 5493–5548.
    [4] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc., 68 (1950), 337–404.
    [5] G. Biau, Analysis of a random forests model, J. Mach. Learn. Res., 13 (2012), 1063–1095.
    [6] G. Biau, L. Devroye, On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification, J. Multivariate Anal., 101 (2010), 2499–2518. https://doi.org/10.1016/j.jmva.2010.06.019 doi: 10.1016/j.jmva.2010.06.019
    [7] G. Biau, L. Devroye, G. Lugosi, Consistency of random forests and other averaging classifiers, J. Mach. Learn. Res., 9 (2008), 2015–2033.
    [8] G. Biau, E. Scornet, A random forest guided tour, TEST, 25 (2016), 197–227. https://doi.org/10.1007/s11749-016-0481-7 doi: 10.1007/s11749-016-0481-7
    [9] S. Boucheron, G. Lugosi, P. Massart, Concentration inequalities: a nonasymptotic theory of independence, Oxford University Press, 2013.
    [10] A. Boulesteix, S. Janitza, J. Kruppa, I. R. König, Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics, Wiley Interdiscip. Rev.: Data Min. Knowl. Discovery, 2 (2012), 493–507. https://doi.org/10.1002/widm.1072 doi: 10.1002/widm.1072
    [11] L. Breiman, Some infinity theory for predictor ensembles, Technical Report 579, Statistics Dept. UCB, 2000.
    [12] L. Breiman, Random forests, Mach. Learn., 45 (2001), 5–32. https://doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324
    [13] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, Classification and regression trees, 1 Ed., New York: Chapman and Hall/CRC, 1984. https://doi.org/10.1201/9781315139470
    [14] M. Denil, D. Matheson, N. Freitas, Consistency of online random forests, Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, 2013, 1256–1264.
    [15] D. J. Dittman, T. M. Khoshgoftaar, A. Napolitano, The effect of data sampling when using random forest on imbalanced bioinformatics data, 2015 IEEE International Conference on Information Reuse and Integration, IEEE, 2015,457–463. https://doi.org/10.1109/IRI.2015.76
    [16] R. Genuer, J. M. Poggi, C. Tuleau, Random forests: some methodological insights, arXiv, 2008. https://doi.org/10.48550/arXiv.0811.3619
    [17] P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees, Mach. Learn., 63 (2006), 3–42. https://doi.org/10.1007/s10994-006-6226-1 doi: 10.1007/s10994-006-6226-1
    [18] S. T. Gries, On classification trees and random forests in corpus linguistics: some words of caution and suggestions for improvement, Corpus Linguist. Ling. Theory, 16 (2019), 617–647. https://doi.org/10.1515/cllt-2018-0078 doi: 10.1515/cllt-2018-0078
    [19] T. K. Ho, Random decision forests, Proceedings of 3rd International Conference on Document Analysis and Recognition, IEEE, 1 (1995), 278–282. https://doi.org/10.1109/ICDAR.1995.598994
    [20] J. M. Klusowski, Sharp analysis of a simple model for random forests, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 130 (2021), 757–765.
    [21] A. Liaw, M. Wiener, Classification and regression by randomForest, R News, 2 (2002), 18–22.
    [22] Y. Lin, Y. Jeon, Random forests and adaptive nearest neighbors, J. Amer. Stat. Assoc., 101 (2006), 578–590. https://doi.org/10.1198/016214505000001230 doi: 10.1198/016214505000001230
    [23] L. K. Mentch, G. Hooker, Ensemble trees and CLTs: statistical inference for supervised learning, arXiv, 2014. https://doi.org/10.48550/arXiv.1404.6473
    [24] W. Rudin, Fourier analysis on groups, Courier Dover Publications, 2017.
    [25] E. Scornet, On the asymptotics of random forests, J. Multivariate Anal., 146 (2016), 72–83. https://doi.org/10.1016/j.jmva.2015.06.009 doi: 10.1016/j.jmva.2015.06.009
    [26] E. Scornet, Random forests and kernel methods, IEEE Trans. Inf. Theory, 62 (2016), 1485–1500. https://doi.org/10.1109/TIT.2016.2514489 doi: 10.1109/TIT.2016.2514489
    [27] E. Scornet, G. Biau, J. P. Vert, Consistency of random forests, Ann. Statist., 43 (2015), 1716–1741. https://doi.org/10.1214/15-AOS1321 doi: 10.1214/15-AOS1321
    [28] S. Wager, Asymptotic theory for random forests, arXiv, 2014. https://doi.org/10.48550/arXiv.1405.0352
    [29] Y. Yang, A. Barron, Information-theoretic determination of minimax rates of convergence, Ann. Statist., 27 (1999), 1564–1599.
    [30] J. Yoon, Forecasting of real GDP growth using machine learning models: gradient boosting and random forest approach, Comput. Econ., 57 (2021), 247–265. https://doi.org/10.1007/s10614-020-10054-w doi: 10.1007/s10614-020-10054-w
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1554) PDF downloads(243) Cited by(0)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog