Processing math: 56%
Research article Special Issues

Multiscale regression on unknown manifolds

  • We consider the regression problem of estimating functions on RD but supported on a d-dimensional manifold M  RD with dD. Drawing ideas from multi-resolution analysis and nonlinear approximation, we construct low-dimensional coordinates on M at multiple scales, and perform multiscale regression by local polynomial fitting. We propose a data-driven wavelet thresholding scheme that automatically adapts to the unknown regularity of the function, allowing for efficient estimation of functions exhibiting nonuniform regularity at different locations and scales. We analyze the generalization error of our method by proving finite sample bounds in high probability on rich classes of priors. Our estimator attains optimal learning rates (up to logarithmic factors) as if the function was defined on a known Euclidean domain of dimension d, instead of an unknown manifold embedded in RD. The implemented algorithm has quasilinear complexity in the sample size, with constants linear in D and exponential in d. Our work therefore establishes a new framework for regression on low-dimensional sets embedded in high dimensions, with fast implementation and strong theoretical guarantees.

    Citation: Wenjing Liao, Mauro Maggioni, Stefano Vigogna. Multiscale regression on unknown manifolds[J]. Mathematics in Engineering, 2022, 4(4): 1-25. doi: 10.3934/mine.2022028

    Related Papers:

    [1] Simon Lemaire, Julien Moatti . Structure preservation in high-order hybrid discretisations of potential-driven advection-diffusion: linear and nonlinear approaches. Mathematics in Engineering, 2024, 6(1): 100-136. doi: 10.3934/mine.2024005
    [2] Virginia Agostiniani, Lorenzo Mazzieri, Francesca Oronzio . A geometric capacitary inequality for sub-static manifolds with harmonic potentials. Mathematics in Engineering, 2022, 4(2): 1-40. doi: 10.3934/mine.2022013
    [3] Lina Zhao, Eun-Jae Park . A locally conservative staggered least squares method on polygonal meshes. Mathematics in Engineering, 2024, 6(2): 339-362. doi: 10.3934/mine.2024014
    [4] Claudio Canuto, Davide Fassino . Higher-order adaptive virtual element methods with contraction properties. Mathematics in Engineering, 2023, 5(6): 1-33. doi: 10.3934/mine.2023101
    [5] Matteo Lapucci, Davide Pucci . Mixed-integer quadratic programming reformulations of multi-task learning models. Mathematics in Engineering, 2023, 5(1): 1-16. doi: 10.3934/mine.2023020
    [6] Iakovidis Isidoros, Nicola Arcozzi . Improved convergence rates for some kernel random forest algorithms. Mathematics in Engineering, 2024, 6(2): 305-338. doi: 10.3934/mine.2024013
    [7] Giacomo Ascione, Daniele Castorina, Giovanni Catino, Carlo Mantegazza . A matrix Harnack inequality for semilinear heat equations. Mathematics in Engineering, 2023, 5(1): 1-15. doi: 10.3934/mine.2023003
    [8] Jinghong Li, Hongyu Liu, Wing-Yan Tsui, Xianchao Wang . An inverse scattering approach for geometric body generation: a machine learning perspective. Mathematics in Engineering, 2019, 1(4): 800-823. doi: 10.3934/mine.2019.4.800
    [9] Zeljko Kereta, Valeriya Naumova . On an unsupervised method for parameter selection for the elastic net. Mathematics in Engineering, 2022, 4(6): 1-36. doi: 10.3934/mine.2022053
    [10] Cristiana De Filippis . Optimal gradient estimates for multi-phase integrals. Mathematics in Engineering, 2022, 4(5): 1-36. doi: 10.3934/mine.2022043
  • We consider the regression problem of estimating functions on RD but supported on a d-dimensional manifold M  RD with dD. Drawing ideas from multi-resolution analysis and nonlinear approximation, we construct low-dimensional coordinates on M at multiple scales, and perform multiscale regression by local polynomial fitting. We propose a data-driven wavelet thresholding scheme that automatically adapts to the unknown regularity of the function, allowing for efficient estimation of functions exhibiting nonuniform regularity at different locations and scales. We analyze the generalization error of our method by proving finite sample bounds in high probability on rich classes of priors. Our estimator attains optimal learning rates (up to logarithmic factors) as if the function was defined on a known Euclidean domain of dimension d, instead of an unknown manifold embedded in RD. The implemented algorithm has quasilinear complexity in the sample size, with constants linear in D and exponential in d. Our work therefore establishes a new framework for regression on low-dimensional sets embedded in high dimensions, with fast implementation and strong theoretical guarantees.



    High-dimensional data challenge classical statistical models and require new understanding of tradeoffs in accuracy and efficiency. The seemingly quantitative fact of the increase of dimension has qualitative consequences in both methodology and implementation, demanding new ways to break what has been called the curse of dimensionality. On the other hand, the presence of inherent nonuniform structure in the data calls into question linear dimension reduction techniques, and motivates a search for intrinsic learning models. In this paper we explore the idea of learning and exploiting the intrinsic geometry and regularity of the data in the context of regression analysis. Our goal is to build low-dimensional representations of high dimensional functions, while ensuring good generalization properties and fast implementation. In view of the complexity of the data, we allow interesting features to change from scale to scale and from location to location. Hence, we will develop multiscale methods, extending classical ideas of multi-resolution analysis beyond regular domains and to the random sample regime.

    In regression, the problem is to estimate a function from a finite set of random samples. The minimax mean squared error (MSE) for estimating functions in the Hölder space Cs([0,1]D), s>0, is O(n2s/(2s+D)), where n is the number of samples. The exponential dependence of the minimax rate on D manifests the curse of dimensionality in statistical learning, as n=O(ε(2s+D)/s) points are generally needed to achieve accuracy ε. This rate is optimal (in the minimax sense), unless further structural assumptions are made [28,32]. If the samples concentrate near a d-dimensional set with dD, and the function belongs to a nonuniform smoothness space BS, with S>s, we may hope to find estimators converging in O(n2S/(2S+d)). In this quantified sense, we may break the curse of dimensionality by adapting to the intrinsic dimension and regularity of the problem.

    A possible approach to this problem is based on first performing dimension reduction, and then regression in the reduced space. Linear dimension reduction methods include principal component analysis (PCA) [24,25,39], for data concentrating on a single subspace, or subspace clustering [8,9,18,36,47], for a union of subspaces. Going beyond linear models, we encounter isomap [43], locally linear embedding [40], local tangent space alignment [49], Laplacian eigenmaps [2], Hessian eigenmap [15] and diffusion map [12]. Besides the classical Principal Component Regression [26], in [33] diffusion map is used for nonparametric regression expanding the unknown function over the eigenfunctions of a kernel-based operator. It is proved that, when data lie on a d-dimensional manifold, the MSE converges in O(n1/O(d2  )). This rate depends only on the intrinsic dimension, but does not match the minimax rate in the Euclidean space. If infinitely many unlabeled points are sampled, so that the eigenfunctions are exactly computed, the MSE can achieve optimal rates for Sobolev functions with smoothness parameter at least 1. Similar results hold for regression with the Laplacian eigenmaps [50].

    Some regression methods have been shown to automatically adapt to the intrinsic dimension and perform as well as if the intrinsic domain was known. Results in this direction have been established for local linear regression [4], k-nearest neighbors [29], and kernel regression [31], where optimal rates depending on the intrinsic dimension were proved for functions in C2, C1, and Cs with s1, respectively. Kernel methods such as kernel ridge regression are also known to adapt to the intrinsic dimension [41,48], while suitable variants of regression trees have been proved to attain intrinsic yet suboptimal learning rates [30]. On the other hand, dyadic partitioning estimates with piecewise polynomial regression can cover the whole scale of spaces Cs, s>0 [21], and be combined with wavelet thresholding techniques to optimally adapt to broader classes of nonuniform regularity [5,6]. However, such estimators are cursed by the ambient dimension D, due to the exponential cardinality of a dyadic partition of the D-dimensional hypercube.

    This paper aims at generalizing dyadic partitioning estimates [5,6] to predict functions supported on low-dimensional sets, with optimal performance guarantees and low computational cost. We tie together ideas in classical statistical learning [20,21,45], multi-resolution analysis [12,13,38], and nonlinear approximation [11,16,17]. Our main tool is geometric multi-resolution analysis (GMRA) [1,34,35,37], which is a multiscale geometric approximation scheme for point clouds in high dimensions concentrating near low-dimensional sets. Using GMRA we learn low-dimensional local coordinates at multiple scales, on which we perform a multiscale regression estimate by fitting local polynomials. Inspired by wavelet thresholding techniques [5,6,11], we then compute differences between estimators at adjacent scales, and retain the locations where such differences are large enough. This empirically reveals where higher resolution is required to attain a good approximation, generating a data-driven partition which adapts to the local regularity of the function.

    Our approach has several distinctive features:

    (i) it is multiscale, and is therefore well-suited for data sets containing variable structural information at different scales;

    (ii) it is adaptive, allowing the function to have localized singularities or variable regularity;

    (iii) it is entirely data-driven, that is, it does not require a priori knowledge about the regularity of the function, and rather learns it automatically from the data;

    (iv) it is provable, with strong theoretical guarantees of optimal performance on large classes of priors;

    (v) it is efficient, having straightforward implementation, minor parameter tuning, and low computational cost.

    We will prove that, for functions supported on a d-dimensional manifold and belonging to a rich model class characterized by a smoothness parameter S, the MSE of our estimator converges at rate O((logn/n)2S/(2S+d)). This model class contains classical Hölder continuous functions, but further accounts for potential nonuniform regularity. Our results show that, up to a logarithmic factor, we attain the same optimal learning rate as if the function was defined on a known Euclidean domain of dimension d, instead of an unknown manifold embedded in RD. In particular, the rate of convergence depends on the intrinsic dimension d and not on the ambient dimension D. In terms of computations, all the constructions above can be realized by algorithms of complexity O(nlogn), with constants linear in the ambient dimension D and exponential in the intrinsic dimension d.

    The remainder of this paper is organized as follows. We conclude this section by defining some general notation and formalizing the problem setup. In Section 2 we review geometric multi-resolution analysis. In Section 3 we introduce our multiscale regression method, establish the performance guarantees, and discuss the computational complexity of our algorithms. The proofs of our results are collected in Section 4, with some technical details postponed to Appendix A.

    Notation. fg and fg mean that there exists a positive constant C, independent on any variable upon which f and g depend, such that fCg and fCg, respectively. fg means that both fg and fg hold. The cardinality of a set A is denoted by #A. For xRD, x denotes the Euclidean norm and Br(x) denotes the Euclidean ball of radius r centered at x. Given a subspace V  RD, we denote its dimension by dim(V) and the orthogonal projection onto V by ProjV. Let f,g:MR be two functions, and let ρ be a probability measure supported on M. We define the inner product of f and g with respect to ρ as f,g:=Mf(x)g(x)dρ. The L2 norm of f with respect to ρ is f:=(M|f(x)|2dρ)12. Given n i.i.d. samples {xi}ni=1 of ρ, the empirical L2 norm of f is fn:=1nni=1|f(xi)|2. The L norm of f is f:=supess|f|. We denote probability and expectation by P and E, respectively. For a fixed M>0, TM is the truncation operator defined by TM(x):=min(|x|,M)sign(x). We denote by 1j,k the indicator function of an indexed set Cj,k (i.e., 1j,k(x)=1 if xCj,k, and 0 otherwise).

    Setup. We consider the problem of estimating a function f:MR given n samples {(xi,yi)}ni=1, where

    M is an unknown Riemannian manifold of dimension d isometrically embedded in RD, with dD;

    ρ is an unknown probability measure supported on M;

    {xi}ni=1 are independently drawn from ρ;

    yi=f(xi)+ζi;

    f is bounded, with fM;

    {ζi}ni=1 are i.i.d. sub-Gaussian random variables with sub-Gaussian norm ζiψ2σ2, independent of the xi's.

    We wish to construct an estimator ˆf of f minimizing the mean squared error

    MSE:=Efˆf2=EM|f(x)ˆf(x)|2dρ.

    Geometric multi-resolution analysis (GMRA) is an efficient tool to build low-dimensional representations of data concentrating on or near a low-dimensional set embedded in high dimensions. To keep the presentation self-contained, we summarize here the main ideas, and refer the reader to [1,34,37] for further details. Given a probability measure ρ supported on a d-dimensional manifold M  RD, GMRA performs the following steps:

    (1). Construct a multiscale tree decomposition T of M into nested cells T:={Cj,k}kKj,jZ, where j represents the scale and k the location. Here Kj is a location index set.

    (2). Compute a local principal component analysis on each Cj,k. Let cj,k be the mean of x on Cj,k, and Vj,k the d-dimensional principal subspace of Cj,k. Define Pj,k:=cj,k+ProjVj,k(xcj,k).

    An ideal multiscale tree decomposition should satisfy assumptions (A1)÷(A5) below for all integers jjmin:

    (A1) For every kKj and kKj+1, either Cj+1,k  Cj,k or ρ(Cj+1,kCj,k)=0. The children of Cj,k are the cells Cj+1,k such that Cj+1,k  Cj,k. We assume that 1amin#{Cj+1,k:Cj+1,k  Cj,k}amax for all kKj and jjmin. Also, for every Cj,k, there exists a unique kKj1 such that Cj,k  Cj1,k. We call Cj1,k the parent of Cj,k.

    (A2) ρ(MkKjCj,k)=0, i.e., Λj:={Cj,k}kKj is a partition of M, up to negligible sets.

    (A3) There exists θ1>0 such that #Λj2jd/θ1.

    (A4) There exists θ2>0 such that, if x is drawn from ρ conditioned on Cj,k, then xcj,kθ22j almost surely.

    (A5) Let λj,k1λj,k2λj,kD be the eigenvalues of the covariance matrix Σj,k of ρ|Cj,k, defined in Table 1. Then:

    Table 1.  Objects of GMRA. Vj,k and ˆVj,k are the eigenspaces associated with the largest d eigenvalues of Σj,k and ˆΣj,k, respectively.
    oracles empirical counterparts
    ρ(Cj,k) ˆρ(Cj,k):=ˆnj,kn,ˆnj,k:=#{xi:xiCj,k}
    cj,k:=1ρ(Cj,k)Cj,kxdρ ˆcj,k:=1ˆnj,kxiCj,kxi
    Σj,k:=1ρ(Cj,k)Cj,k(xcj,k)(xcj,k)Tdρ ˆΣj,k:=1ˆnj,kxiCj,k(xiˆcj,k)(xiˆcj,k)T
    Vj,k:=argmindimV=d1ρ(Cj,k)Cj,kxcj,kProjV(xcj,k)2dρ ˆVj,k:=argmindimV=d1ˆnj,kxiCj,kxˆcj,kProjV(xˆcj,k)2
    Pj,k(x):=cj,k+ProjVj,k(xcj,k) ˆPj,k(x):=ˆcj,k+ProjˆVj,k(xˆcj,k)

     | Show Table
    DownLoad: CSV

    (i) there exists θ3>0 such that, for every jjmin and kKj, λj,kdθ322j/d;

    (ii) there exists θ4(0,1) such that λj,kd+1θ4λj,kd.

    These are natural properties for multiscale partitions generalizing dyadic partitions to nonEuclidean domains [10]. (A1) establishes that the cells constitute a tree structure. (A2) says that the cells at scale j form a partition. (A3) guarantees that there are at most 2jd/θ1 cells at scale j. (A4) ensures that the diameter of all cells at scale j is bounded by 2j, up to a uniform constant. (A5)(ii) assumes that the best rank d approximation to the covariance of a cell is close to the covariance matrix of a d-dimensional Euclidean ball, while (A5)(ii) assumes that the cell has significantly larger variance in d directions than in all the remaining ones.

    Since all cells at scale j have similar diameter, Λj is called a uniform partition. A master tree T is a tree satisfying the properties above. A proper subtree ˜T of T is a collection of nodes of T with the properties: the root node is in ˜T; if a node is in ˜T, then its parent is also in ˜T. Any finite proper subtree ˜T is associated with a unique partition Λ=Λ(˜T) consisting of its outer leaves, by which we mean those nodes that are not in ˜T, but whose parent is.

    In practice, the master tree T is not given. We will construct one by an application of the cover tree algorithm [3] (see [34,Algorithm 3]). In order to make the samples for tree construction and function estimation independent from each other, we split the data in half and use one subset to construct the tree and the other one for local PCA and regression. From now on we index the training data as {(xi,yi)}2ni=1, and split them in {(xi,yi)}2ni=1={(xi,yi)}ni=1{(xi,yi)}2ni=n+1. Running Algorithm [34,Algorithm 3] on {xi}2ni=n+1, we construct a family of cells {ˆCj,k}kKj,jmin  j  jmax which satisfies (A1)÷(A4) with high probability if ρ is doubling*; furthermore, if M is a Cs, s(1,), d-dimensional closed Riemannian manifold isometrically embedded in RD, and ρ is the volume measure on M, then (A5) is satisfied as well:

    *ρ is doubling if there exists C1>1 such that C11rdρ(MBr(x))C1rd for any xM and r>0; C1 is called the doubling constant of ρ. See also [10,14].

    Proposition 1 (Proposition 14 in [34]). Assume ρ is a doubling probability measure on M with doubling constant C1. Then, the ˆCj,k's constructed from [34,Algorithm 3] satisfy:

    (a1)(A1) with amax=C21(24)d and amin=1;

    (a2) let ˆM=jmaxj=jminkKjˆCj,k; for any ν>0,

    P{ρ(MˆM)>28νlogn3n}2nν;

    (a3)(A3) with θ1=C114d;

    (a4)(A4) with θ2=3.

    If additionally M is a Cs,s(1,), d-dimensional closed Riemannian manifold isometrically embedded in RD, and ρ is the volume measure on M, then

    (a5)(A5) is satisfied when j is sufficiently large.

    Since there are finite training points, the constructed master tree has a finite number of nodes. We first build a tree whose leaves contain a single point, and then prune it to the largest subtree whose leaves contain at least d training points. This pruned tree associated with the ˆCj,k's is called the data master tree, and denoted by Tn. The ˆCj,k's cover ˆM, which represents the part of M that has been explored by the data.

    Even though assumption (A2) is not exactly satisfied, we claim that (a2) is sufficient for our performance guarantees, for example in the case where fM. Indeed, simply estimating f on MˆM by 0, for any ν>0 we have

    P{MˆMf2dρ28M2νlogn3n}2nνandEMˆMf2dρ56M2νlogn3n1+ν.

    In view of these bounds, the rate of convergence on MˆM is faster than the ones we will obtain on ˆM. We will therefore assume (A2), thanks to (a2). Also, it may happen that conditions (A3)÷(A5) are satisfied at the coarsest scales with very poor constants θ. Nonetheless, it will be clear that in all that follows we may discard a few coarse scales, and only work at scales that are fine enough and for which (A3)÷(A5) truly capture in a quantitative way the local geometry of M. Since regression is performed on an independent subset of data, we can assume, by conditioning, that the ˆCj,k's are given and satisfy the required assumptions. To keep the notation simple, from now on we will use Cj,k instead of ˆCj,k, and M in place of ˆM, with a slight abuse of notation.

    Besides cover tree, there are other methods that can be applied in practice to obtain multiscale partitions, such as METIS [27], used in [1], iterated PCA [42], and iterated k-means. These methods can be computationally more efficient than cover tree, but lead to partitions where the properties (A1)÷(A5) are not guaranteed to hold.

    After constructing the multiscale tree T, GMRA computes a collection of affine projectors {Pj:RDRD}jjmin. The main objects of GMRA in their population and sample version are summarized in Table 1. Given a suitable partition Λ  T, M can be approximated by the piecewise linear set {Pj,k(Cj,k)}Cj,kΛ.

    Given a multiscale tree decomposition {Cj,k}j,k and training samples {(xi,yi)}ni=1, we construct a family {ˆfj,k}j,k of local estimates of f in two stages: first we compute local coordinates on Cj,k using GMRA outlined above, and then we estimate f|Cj,k by fitting a polynomial of order on such coordinates. A global estimator ˆfΛ is finally obtained by summing the local estimates over a suitable partition Λ. Our regression method is specified in Algorithm 1. The explicit constructions of the constant (=0) and linear (=1) local estimators are detailed in Table 2.

    Table 2.  Constant and linear local estimators. The truncation in [Λj,k]d regularizes the least squares problem, which is ill-posed due to the small eigenvalues {λj,kl}Dl=d+1.
    oracles empirical counterparts
    piecewise constant (=0)
    g0j,k(x):=yj,k:=1ρ(Cj,k)Cj,kydρ ˆg0j,k(x):=ˆyj,k:=1ˆnj,kxiCj,kyi
    f0j,k(x):=TM[g0j,k(x)] ˆf0j,k(x):=TM[ˆg0j,k(x)]
    piecewise linear (=1)
    g1j,k(x):=[πj,k(x)T 2j]βj,k ˆg1j,k(x):=[ˆπj,k(x)T 2j]ˆβj,k
    βj,k:=[[Λj,k]1d0022j]1ρ(Cj,k)Cj,ky[πj,k(x)2j]dρ ˆβj,k:=[[ˆΛj,k]1d0022j]1ˉnj,kxiCj,kyi[ˆπj,k(xi)2j]
    [Λj,k]d:=diag(λj,k1,,λj,kd) [ˆΛj,k]d:=diag(ˆλj,k1,,ˆλj,kd)
    f1j,k(x):=TM[g1j,k(x)] ˆf1j,k(x):=TM[ˆg1j,k(x)]

     | Show Table
    DownLoad: CSV

    In order to analyze the performance of our method, we introduce the oracle estimator fΛ based on the distribution ρ, defined by

    πj,k:Cj,kRd,πj,k(x):=VTj,k(xcj,k),pj,k:=argminpPCj,k|ypπj,k(x)|2dρ,
    fj,k:=TM[pj,kπj,k]fΛ:=Cj,kΛfj,k1j,k

    and split the MSE into a bias and a variance term:

    EfˆfΛ22ffΛ2bias2+2EfΛˆfΛ2variance. (1)

    The bias term is a deterministic approximation error, and will be handled by assuming suitable regularity models for ρ and f (see Definitions 1 and 3). The variance term quantifies the stochastic error arising from finite-sample estimation, and will be bounded using concentration inequalities (see Proposition 2). The role of Λ, encoded in its size #Λ, is crucial to balance (1). We will discuss two possible choices: uniform partitions in Section 3.1, and adaptive at multiple scales in Section 3.2.

    Algorithm 1 GMRA regression
    Input: training data {xi,yi}2ni=1, intrinsic dimension d, bound M, polynomial order , approximation type (uniform or adaptive).
    Output: multiscale tree decomposition Tn, partition Λ, piecewise -order polynomial estimator ˆfΛ.
    1: construct a multiscale tree Tn by [34,Algorithm 3] on {xi}2ni=n+1;
    2: compute centers ˆcj,k and subspaces ˆVj,k by empirical GMRA on {xi}ni=1;
    3: define coordinates ˆπj,k on Cj,k:
    ˆπj,k:Cj,kRd,ˆπj,k(x):=ˆVj,kT(xˆcj,k);
    4: compute local estimators ˆgj,k by solving the following least squares problems over the space P of polynomials of degree :
    ˆpj,k:=argminpP1ˆnj,kni=1|yipˆπj,k(xi)|21j,k(xi),ˆgj,k:=ˆpj,kˆπj,k;   (2)
    5: truncate ˆgj,k by M:
    ˆfj,k:=TM[ˆgj,k];
    6: construct a uniform (see Section 3.1) or adaptive (see Section 3.2) partition Λ;
    7: define the global estimator ˆfΛ by summing the local estimators over the partition Λ:
    ˆfΛ:=Cj,kΛˆfj,k1j,k.

     | Show Table
    DownLoad: CSV

    A first natural choice for Λ is a uniform partition Λj:={Cj,k}kKj,jjmin. At scale j, f is estimated by ˆfΛj=kKjˆfj,k1j,k. The bias ffΛj decays at a rate depending on the regularity of f, which can be quantified as follows:

    Definition 1 (model class As). A function f:MR is in the class As for some s>0 with respect to the measure ρ if

    |f|As:=supT supjjmin ffΛj2js<,

    where T ranges over the set, assumed non-empty, of multiscale tree decompositions satisfying assumptions (A1)÷(A5).

    We capture the case where the bias is roughly the same on every cell with the following definition:

    Definition 2 (model class A,s). A function f:MR is in the class A,s for some s>0 with respect to the measure ρ if

    |f|A,s:=supT supjjmin supkKj (ffj,k)1j,k2jsρ(Cj,k)<,

    where T ranges over the set, assumed non-empty, of multiscale tree decompositions satisfying assumptions (A1)÷(A5).

    Clearly A,s  As. These classes contain uniformly regular functions on manifolds, such as Hölder functions.

    Example 1. Let M be a closed smooth d-dimensional Riemannian manifold isometrically embedded in RD, and let ρ be the volume measure on M. Consider a function f:MR and a smooth chart (U,ϕ) on M. The function ˜f:ϕ(U)R defined by ˜f(v)=fϕ1(v) is called the coordinate representation of f. Let λ=(λ1,,λd) be a multi-index with |λ|:=λ1++λd=. The -order λ-derivative of f is defined as

    λf(x):=λ(fϕ1).

    Hölder functions C,α on M with N and α(0,1] are defined as follows: fC,α if the -order derivatives of f exist, and

    |f|C,α:=max|λ|=supxz|λf(x)λf(z)|d(x,z)α<,

    d(x,z) being the geodesic distance between x and z. We will always assume to work at sufficiently fine scales at which d(x,z)zxRD. Note that C,1 is the space of -times continuously differentiable functions on M with Lipschitz -order derivatives. We have C,α  A,+α with |f|A,+αθ+α2d|f|C,α/!. The proof is in Appendix A.

    Example 2. Let M be a smooth closed Riemannian manifold isometrically embedded in RD, and let ρ be the volume measure on M. Let Ω  M such that Γ:=Ω is a smooth and closed dΓ-dimensional submanifold with finite reach. Let g=a1Ω+b1Ω for some a,bR, where 1S denotes the indicator function of a set S. Then gA(ddΓ)/2 for every =0,1,2,; however, gA,s for any s>0. The proof is in Appendix A.

    The reach of M is an important global characteristic of M. Let D(M):={yRD:! xM s.t. xy=infzMzy}, Mr:={yRD:infxMxy<r}. Then reach(M):=sup{r0:Mr  D(M)}. See also [19].

    When we take uniform partitions Λ=Λj in (1), the squared bias satisfies

    ffΛj2|f|2As22js

    whenever fAs, which decreases as j increases. On the other hand, Proposition 2 shows that the variance at the scale j satisfies

    EfΛjˆfΛj2O(j2jdn),

    which increases as j increases. Choosing the optimal scale j in the bias-variance tradeoff, we obtain the following rate of convergence for uniform estimators:

    Theorem 1. Suppose fM and fAs for {0,1} and s>0. Let j be chosen such that

    2j:=μ(lognn)12s+d

    for μ>0. Then there exist positive constants c:=c(θ1,d,μ) and C:=C(θ1,d,μ) for =0, or c:=c(θ1,θ2,θ3,d,μ) and C:=C(θ1,θ2,θ3,d,μ) for =1, such that:

    (a) for every ν>0 there is cν>0 such that

    P{fˆfΛj>(|f|Asμs+cν)(lognn)s2s+d}Cnν,

    where cν:=cν(ν,θ1,d,M,σ,s,μ) for =0, and cν:=cν(ν,θ1,θ2,θ3,d,M,σ,s,μ) for =1;

    (b) EfˆfΛj2(|f|2Asμs+cmax(M2,σ2))(lognn)2s2s+d.

    Theorem 1 is proved in Section 4. Note that the rate depends on the intrinsic dimension d instead of the ambient dimension D. Moreover, the rate is optimal (up to logarithmic factors) at least in the case of C,α functions on M, as discussed in Example 1.

    Theorem 1 is not fully satisfactory for two reasons: (i) the choice of the optimal scale requires knowledge of the regularity of the unknown function; (ii) no uniform scale can be optimal if the regularity of the function varies at different locations and scales. We thus propose an adaptive estimator which learns near-optimal partitions from data, without knowing the possibly nonuniform regularity of the function. Adaptive partitions may be selected by a criterion that determines whether or not a cell should be picked or not. The quantities involved in this selection are summarized in Table 3, along with their empirical versions.

    Table 3.  Local approximation difference between scales.
    oracles empirical counterparts
    Wj,k:=(fΛjfΛj+1)1j,k ˆWj,k:=(ˆfΛjˆfΛj+1)1j,k
    Δj,k:=Wj,k ˆΔj,k:=ˆWj,kn

     | Show Table
    DownLoad: CSV

    Δj,k plays the role of the magnitude of a wavelet coefficient in typical wavelet thresholding constructions, and reduces to it in the case of Haar wavelets on Euclidean domains by dyadic partitioning. It measures the local difference in approximation between two consecutive scales: a large Δj,k suggests a significant reduction of error if we refine Cj,k to its children. Intuitively, we should truncate the master tree to the subtree including the nodes where this quantity is large. However, if too few samples exist in a node, then the empirical counterpart ˆΔj,k can not be trusted. We thus proceed as follows. We set a threshold τn decreasing in n, and let ˆTn(τn) be the smallest proper subtree of Tn containing all Cj,k's for which ˆΔj,kτn. Crucially, τn may be chosen independently of the regularity of f (see Theorem 2). We finally define our adaptive partition ˆΛn(τn) as the partition associated with the outer leaves of ˆTn(τn). The procedure is summarized in Algorithm 2.

    Algorithm 2 Adaptive partition
    Input: training data {(xi,yi)}ni=1, multiscale tree decomposition Tn, local -order polynomial estimates {ˆfj,k}j,k, threshold parameter κ.
    Output: adaptive partition ˆΛn(τn).
    1: compute the approximation difference ˆΔj,k on every node Cj,kTn;
    2: set the threshold τn:=κ(logn)/n;
    3: select the smallest proper subtree ˆTn(τn) of Tn containing all Cj,k's with ˆΔj,kτn;
    4: define the adaptive partition ˆΛn(τn) associated with the outer leaves of ˆTn(τn).

     | Show Table
    DownLoad: CSV

    To provide performance guarantees for our adaptive estimator, we need to define a proper model class based on oracles. Given any master tree T satisfying assumptions (A1)÷(A5) and a threshold τ>0, we let T(τ) be the smallest subtree of T consisting of all the cells Cj,k's with Δj,kτ. The partition made of the outer leaves of T(τ) is denoted by Λ(τ).

    Definition 3 (model class Bs). A function f:MR is in the class Bs for some s>0 with respect to the measure ρ if

    |f|pBs:=supTsupτ>0τp#T(τ)<,p=2d2s+d,

    where T varies over the set, assumed non-empty, of multiscale tree decompositions satisfying assumptions (A1)÷(A5).

    In general, the truncated tree T(τ) grows as the threshold τ decreases. For elements in Bs, we have control on the growth rate, namely #T(τn)τp. In the classical case of dyadic partitions of the Euclidean space with uniform measure, Bs is well understood as a nonlinear approximation space containing a scale of Besov spaces [11]. The class Bs is indeed rich, and contains in particular A,s, while additionally capturing functions of nonuniform regularity.

    Lemma 1. A,s  Bs. If fA,s, then fBs and |f|Bs(amin/θ1)2s+d2d|f|A,s.

    The proof is given in Appendix A.

    Example 3 Let g be the function in Example 2. Then gBd(ddΓ)/(2dΓ) for every =0,1,2,. Notice that gA(ddΓ)/2, so g has a larger regularity parameter s in the Bs model than in the As model.

    We will also need a quasi-orthogonality condition ensuring that the functions Wj,k representing the approximation difference between two scales are almost orthogonal across scales.

    Definition 4. We say that f satisfies quasi-orthogonality of order with respect to the measure ρ if there exists a constant B0>0 such that, for any proper subtree S of any tree T satisfying assumptions (A1)÷(A5),

    Cj,kTSWj,k2B0Cj,kTSWj,k2.

    The following lemma shows that fBs, along with quasi-orthogonality, implies a certain approximation rate of f by fΛ(τ) as τ0. The proof is given in Appendix A.

    Lemma 2. If fBs(LAt) for some s,t>0, and f satisfies quasi-orthogonality of order , then

    ffΛ(τ)2Bs,d|f|pBsτ2pBs,d|f|2Bs#Λ(τ)2sd,p=2d2s+d,

    with Bs,d:=B02p02(2p).

    The main result of this paper is the following performance analysis of adaptive estimators, which is proved in Section 4.

    Theorem 2. Let {0,1} and M>0. Suppose fM and f satisfies quasi-orthogonality of order . Set τn:=κlogn/n. Then:

    (a) For every ν>0 there exists κν:=κν(amax,θ2,θ3,d,M,σ,ν)>0 such that, whenever fBs for some s>0 and κκν, there are r,C>0 such that

    P{fˆfˆΛn(τn)>r(lognn)s2s+d}Cnν.

    (b) There exists κ0:=κ0(amax,θ2,θ3,d,M,σ) such that, whenever fBs for some s>0 and κκ0, there is ˉC>0 such that

    EfˆfˆΛn(τn)2ˉC(lognn)2s2s+d.

    Here r depends on θ2, θ3, amax, d, M, s, |f|Bs, σ, B0, ν, κ; C depends on θ2, θ3, amax, amin, d, s, |f|Bs, κ; ˉC depends on θ2, θ3, amax, amin, d, M, s, |f|Bs, B0, κ.

    Theorem 2 is more satisfactory than Theorem 1 for two reasons: (i) the same rate is achieved for a richer model class; (ii) the estimator does not require a priori knowledge of the regularity of the function, since the choice of κ is independent of s.

    For a given accuracy ε, in order to achieve MSEε2, the number of samples we need is nε(1/ε)2s+dslog(1/ε). When s is unknown, we can determine s as follows: we fix a small n0, and run Algorithm 2 with 2n0,4n0,,2jn0, samples. For each sample size, we evenly split data into a training set to build the adaptive estimator, and a test set to evaluate the MSE. According to Theorem 2, the MSE scales as (logn/n)2s2s+d. Therefore, the slope in the log-log plot of the MSE versus n gives an approximation of 2s/(2s+d). This could be formalized by a suitable adaptation of Lepski's method.

    The computational cost of Algorithms 1 and 2 may be split as follows:

    Tree construction. Cover tree itself is an online algorithm where a single-point insertion or removal takes cost at most O(logn). The total computational cost of the cover tree algorithm is CdDnlogn, where C>0 is a constant [3].

    Local PCA. At every scale j, we perform local PCA on the training data restricted to the Cj,k for every kKj using the random PCA algorithm [22]. Recall that ˆnj,k denotes the number of training points in Cj,k. The cost of local PCA at scale j is in the order of kKjDdˆnj,k=Ddn, and there are at most clogn scales where c>0 is a constant, which gives a total cost of cDdnlogn.

    Multiscale regression. Given ˆnj,k training points on Cj,k, computing the low-dimensional coordinates ˆπj,k(xi) for all xiCj,k costs Ddˆnj,k, and solving the linear least squares problem (2), where the matrix is of size ˆnj,k×d, costs at most ˆnj,kd2. Hence, constructing the -order polynomials at scale j takes kKjDdˆnj,k+d2ˆnj,k=(Dd+d2)n, and there are at most clogn scales, which sums up to c(Dd+d2)nlogn.

    Adaptive approximation. We need to compute the coefficients ˆΔj,k for every Cj,k, which costs 2(Dd+d)ˆnj,k on Cj,k, and 2c(Dd+d)nlogn for the whole tree.

    In summary, the total cost of constructing GMRA adaptive estimators of order is

    CdDnlogn+(4Dd+d2+d)cnlogn.

    The cost scales linearly with the number of samples n up to a logarithmic factor, and linearly with the ambient dimension D.

    We analyze the error of our estimator by a bias-variance decomposition as in (1). We present the variance estimate in Section 4.1, the proofs for uniform approximations in Section 4.2, and for adaptive approximations in Section 4.3.

    The following proposition bounds the variance of 0-order (piecewise constant) and 1st-order (piecewise linear) estimators over an arbitrary partition Λ.

    Proposition 2. Suppose fM and let {0,1}. For any partition Λ, let fΛ and ˆfΛ be the optimal approximation and the empirical estimators of order on Λ, respectively. Then, for every η>0,

    (a) P{fΛˆfΛ>η}{C0#Λexp(nη2c0max(M2,σ2)#Λ)for  =0C1d#Λexp(nη2c1max(d4  M2,d2  σ2)#Λ)for  =1,

    (b) EfΛˆfΛ2{c0max(M2,σ2)#Λlog(C0#Λ)nfor  =0c1max(d4  M2,d2  σ2)#Λlog(C1d#Λ)nfor  =1,

    for some absolute constants c_0, C_0 and some c_1, C_1 depending on \theta_2, \theta_3 .

    Proof. Since f_\Lambda ^\ell and \widehat{f}_\Lambda ^\ell are bounded by M , we define \Lambda ^{-} : = \{ C_{j, k}\in\Lambda :\rho(C_{j, k}) \le \frac{\eta^2}{4M^2\#\Lambda } \} , and observe that

    \sum\limits_{C_{j, k}\in \Lambda ^{-}} \| (f_\Lambda ^\ell - \widehat{f}_\Lambda ^\ell) {{\mathbf{1}}_{j,k}} \|^2 \le \eta^2 .

    We then restrict our attention to \Lambda ^{+} : = \Lambda \setminus \Lambda ^{-} and apply Lemma 5 with t = \frac{\eta}{\sqrt{\rho(C_{j, k})\#\Lambda }} . This leads to (a), while (b) follows from (a) by integrating over \eta > 0 .

    Notice that \#\Lambda _j \le 2^{jd}/\theta_1 by (A3). By choosing j^\star such that 2^{-j^\star} = \mu \left(\frac{\log n}{n}\right)^{\frac{1}{2s+d}} for some \mu > 0 , we have

    \|f - f_{\Lambda _{j^\star}}^\ell\| \le |f|_{\mathcal{A}_{s}^{\ell}} 2^{-j^\star s} \le |f|_{\mathcal{A}_{s}^{\ell}} \mu^s \left( \frac{\log n}{n}\right)^{\frac{s}{2s+d}}.

    Moreover, by Proposition 2,

    \begin{align*} \mathbb{P}\left\{ \|f^\ell_{\Lambda _{j^\star}} -\widehat{f}^\ell_{\Lambda _{j^\star}} \|\ge c_\nu \left(\frac{\log n}{n}\right)^{\frac{s}{2s+d}}\right\} & \le \begin{cases} \frac{C_0}{\theta_1 ~~\mu^d} (\log n)^{-\frac{d}{2s+d}} n^{- \left(\frac{\theta_1 ~~\mu^d ~~c_\nu^2}{c_0\max(M^2, \sigma^2)} -\frac{d}{2s+d} \right)} & (\ell = 0) \\ \frac{C_1 d}{\theta_1 ~~\mu^d} (\log n)^{-\frac{d}{2s+d}} n^{- \left(\frac{\theta_1 ~~\mu^d ~~c_\nu^2}{c_1\max(d^4 ~~M^2, d^2~~\sigma^2)} -\frac{d}{2s+d} \right)} & (\ell = 1) \end{cases} \ \le C n^{-\nu} \end{align*}

    provided that \frac{\theta_1 ~~\mu^d ~~c_\nu^2}{c_0\max(M^2, \sigma^2)} -\frac{d}{2s+d} > \nu for \ell = 0 and \frac{\theta_1 ~~\mu^d ~~c_\nu^2}{c_1\max(d^4 ~~M^2, d^2~~\sigma^2)} -\frac{d}{2s+d} > \nu for \ell = 1 .

    We begin by defining several objects of interest:

    \mathcal{T}_n : the data master tree whose leaves contain at least d points of training data. It can be viewed as the part of a multiscale tree that our training data have explored. Notice that

    \#\mathcal{T}_n \le \sum\limits_{j = 0}^\infty {{a}_{\text{min}}}^{-j} \frac n d = \frac{{{a}_{\text{min}}}}{{{a}_{\text{min}}}-1}\frac n d \le {{a}_{\text{min}}} \frac n d .

    \mathcal{T} : a complete multiscale tree containing \mathcal{T}_n .

    \mathcal{T} can be viewed as the union \mathcal{T}_n and some empty cells, mostly at fine scales with high probability, that our data have not explored.

    \mathcal{T}(\tau) : the smallest subtree of \mathcal{T} which contains \{C_{j, k} \in \mathcal{T} \, :\, \Delta_{j, k}^{\ell} \ge \tau\} .

    \mathcal{T}_n(\tau) : = \mathcal{T}(\tau) \cap \mathcal{T}_n .

    \widehat{\mathcal{T}}_n(\tau) : the smallest subtree of \mathcal{T}_n which contains \{C_{j, k} \in \mathcal{T}_n\, :\, \widehat{\Delta}_{j, k} \ge \tau\} .

    \Lambda (\tau) : the adaptive partition associated with \mathcal{T}(\tau) .

    \Lambda _n(\tau) : the adaptive partition associated with \mathcal{T}_n(\tau) .

    \widehat{\Lambda }_n(\tau) : the adaptive partition associated with \widehat{\mathcal{T}}_n(\tau) .

    ● Suppose \mathcal{T}^0 and \mathcal{T}^1 are two subtrees of \mathcal{T} . If \Lambda ^0 and {\Lambda }^1 are two adaptive partitions associated with \mathcal{T}^0 and \mathcal{T}^1 respectively, we denote by \Lambda ^0 \vee \Lambda ^1 and \Lambda ^0 \wedge \Lambda ^1 the partitions associated to the trees \mathcal{T}^0 \cup \mathcal{T}^1 and \mathcal{T}^0 \cap \mathcal{T}^1 respectively.

    ● Let b = 2{{a}_{\text{max}}}+5 where {{a}_{\text{max}}} is the maximal number of children that a node has in \mathcal{T}_n .

    Inspired by the analysis of wavelet thresholding procedures [5,6], we split the error into four terms,

    \| f - \widehat{f}^\ell_{\widehat{\Lambda }_n(\tau_n)} \| \le e_1 + e_2 + e_3 +e_4 ,

    where

    \begin{alignat*} {2} & e_1 : = \| f - f^\ell_{\widehat{\Lambda }_n(\tau_n)\vee \Lambda _n(b\tau_n)} \| && \qquad e_2 : = \| f^\ell_{\widehat{\Lambda }_n(\tau_n)\vee \Lambda _n(b\tau_n)} - f^\ell_{\widehat{\Lambda }_n(\tau_n)\wedge \Lambda _n(\tau_n/b)} \| \\ & e_3 : = \| f^\ell_{\widehat{\Lambda }_n(\tau_n)\wedge \Lambda _n(\tau_n/b)} - \widehat{f}^\ell_{\widehat{\Lambda }_n(\tau_n)\wedge \Lambda _n(\tau_n/b)} \| && \qquad e_4 : = \| \widehat{f}^\ell_{\widehat{\Lambda }_n(\tau_n)\wedge \Lambda _n(\tau_n/b)} - \widehat{f}^\ell_{\widehat{\Lambda }_n(\tau_n)} \| . \end{alignat*}

    The goal of the splitting above is to handle the bias and variance separately, as well as to deal with the fact the partition built from those C_{j, k} such that \widehat{\Delta}_{j, k} \ge \tau_n does not coincide with the partition which would be chosen by an oracle based on those C_{j, k} such that \Delta_{j, k}^{\ell} \ge \tau_n . This is accounted by the terms e_2 and e_4 which correspond to those C_{j, k} such that \widehat{\Delta}_{j, k} is significantly larger or smaller than \Delta_{j, k}^{\ell} respectively, and which will be proved to be small in probability. The e_1 and e_3 terms correspond to the bias and variance of oracle estimators based on partitions obtained by thresholding the unknown oracle change in approximation \Delta_{j, k}^{\ell} .

    Since \widehat{\Lambda }_n(\tau_n)\vee \Lambda _n(b\tau_n) is a finer partition than \Lambda _n(b\tau_n) , we have

    e_1 \le \| f - f^\ell_{\Lambda _n(b\tau_n)} \| \le \| f - f^\ell_{\Lambda (b\tau_n)} \|+ \| f^\ell_{\Lambda (b\tau_n)} - f^\ell_{\Lambda _n(b\tau_n)} \| = : e_{11} + e_{12}\, .

    The e_{11} term is treated by a deterministic estimate based on the model class \mathcal{B}_{s}^{\ell} : by Lemma 2 we have

    e_{11}^2 \le B_{s, d} |f|_{\mathcal{B}_{s}^{\ell}}^p (b\kappa)^{2-p} (\log n / n)^{\frac{2s}{2s+d}} ,

    The term e_{12} accounts for the error on the cells that have not been explored by our training data, which is small:

    \begin{align*} \mathbb{P} \{ e_{12} > 0 \} &\le \mathbb{P} \{ \exists \ C_{j, k} \in \mathcal{T}(b\tau_n) \setminus \mathcal{T}_n(b\tau_n) \} \\ & = \mathbb{P} \{ \exists \ C_{j, k} \in \mathcal{T}(b\tau_n) : \Delta_{j, k}^{\ell} \ge b\tau_n ~\text{ and }~\widehat{\rho}(C_{j, k}) < d/n \} \\ &\le \sum\limits_{C_{j, k}\in\mathcal{T}(b\tau_n)} \mathbb{P} \{\Delta_{j, k}^{\ell} \ge b\tau_n ~\text{ and }~ \widehat{\rho}(C_{j, k}) < d/n \} . \end{align*}

    According to (4), we have (\Delta_{j, k}^{\ell})^2 \le 4\|f\|_\infty^2 \rho(C_{j, k}) . Then every C_{j, k} with \Delta_{j, k}^{\ell} \ge b\tau_n satisfies \rho(C_{j, k}) \gtrsim \frac{b^2\kappa^2}{\|f\|_\infty^2} (\log n / n) . Hence, provided that n satisfies \frac{b^2\kappa^2}{\|f\|_\infty^2} \log n \gtrsim 2d , we have

    \begin{align*} \mathbb{P} \{\Delta_{j, k}^{\ell} \ge b\tau_n ~\text{ and }~ \widehat{\rho}(C_{j, k}) < d/n \} &\le \mathbb{P} \left\{ |\rho(C_{j, k}) - \widehat{\rho}(C_{j, k})| \ge \frac{1}{2} \rho(C_{j, k}) ~\text{ and }~ \rho(C_{j, k}) \gtrsim \frac{b^2\kappa^2}{\|f\|_\infty^2} \frac{\log n}{n} \right\} \\ & \le 2n^{- \frac{3b^2\kappa^2}{28\|f\|_\infty^2 }}, \end{align*}

    where the last inequality follows from Lemma 3(b). Therefore, by Definition 3 we obtain

    \begin{align*} \mathbb{P} \{ e_{12} > 0 \} &\lesssim \#\mathcal{T}(b\tau_n) n^{- \frac{3b^2\kappa^2}{28\|f\|_\infty^2}} \le |f|_{\mathcal{B}_{s}^{\ell}}^{p} (b\tau_n)^{-p} n^{- \frac{3b^2\kappa^2}{28\|f\|_\infty^2} } \le |f|_{\mathcal{B}_{s}^{\ell}}^{p} (b\kappa)^{-p} n^{- \left( \frac{3b^2\kappa^2}{28\|f\|_\infty^2} - 1 \right)} \le |f|_{\mathcal{B}_{s}^{\ell}}^{p} (b\kappa)^{-p} n^{-\nu} \end{align*}

    as long as \frac{3b^2\kappa^2}{28\|f\|_\infty^2} - 1 > \nu . To estimate \mathbb{E} e_{12}^2 , we observe that, thanks to Lemma 6,

    e_{12}^2 = \sum\limits_{C_{j, k}\in\Lambda _n(b\tau_n)\setminus\Lambda (b\tau_n)} \ \sum\limits_{\substack{C_{j', k'}\in\Lambda (b\tau_n) \\ C_{j', k'}~~\subset C_{j, k}}} \| (f^\ell_{j, k} - f^\ell_{j', k'}) \mathbf{1}_{j', k'} \|^2 \lesssim M^2 .

    Hence, by choosing \nu = 1 > \tfrac{2s}{2s+d} we get

    \mathbb{E} e_{12}^2 \lesssim \|f\|_\infty^2 \mathbb{P} \{e_{12} > 0\} \lesssim \|f\|_\infty^2 |f|_{\mathcal{B}_{s}^{\ell}}^{p} (b\kappa)^{-p} (\log n / n)^{\frac{2s}{2s+d}} .

    The term e_3 is the variance term which can be estimated by Proposition 2 with \Lambda = \widehat{\Lambda }_n(\tau_n)\wedge \Lambda _n(\tau_n/b) . We plug in \eta = r (\log n / n)^{\frac{s}{2s+d}} . Bounding \#\Lambda by \# \Lambda _n(\tau_n/b) \le \#\Lambda _n \le n /d (as our data master tree has at d points in each leaf) outside the exponential, and by \# \Lambda _n(\tau_n/b) \le \# \Lambda (\tau_n/b) \le |f|_{\mathcal{B}_{s}^{\ell}}^{p} (\tau_n/b)^{-p} inside the exponential, we get the following estimates for e_3 :

    \mathbb{P} \left\{ e_3 > r \left( \frac{\log n}{n}\right)^{\frac{s}{2s+d}} \right\} \le \begin{cases} C_0 n^{ 1- \frac{ r^2 \kappa^{p}}{c_0 ~~b^{p} |f|_{\mathcal{B}_{s}^{\ell}}^{p} \max\{M^2, \sigma^2\} }} & (\ell = 0) \\ C_1 n^{ 1 - \frac{ \gamma^2 \kappa^{p}}{c_1 b^{p} |f|_{\mathcal{B}_{s}^{\ell}}^{p} \max\{d^4M^2, d^2~~\sigma^2\} }} & (\ell = 1) , \end{cases}

    where C_0 = C_0(\theta_2, \theta_3, {{a}_{\text{max}}}, d, s, |f|_{\mathcal{B}_{s}^{\ell}}, \kappa) and C_1 = C_1(\theta_2, \theta_3, {{a}_{\text{max}}}, d, s, |f|_{\mathcal{B}_{s}^{\ell}}, \kappa) . We obtain \mathbb{P}\{e_3 > r(\log n /n)^{\frac{s}{2s+d}}\} \le C n^{-\nu} as long as r is chosen large enough to make the exponent smaller than -\nu .

    To estimate \mathbb{E} e_3^2 , we apply again Propositions 2 and with \#\Lambda \le |f|_{\mathcal{B}_{s}^{\ell}}^{p} (b/\kappa)^{p} (\log n / n)^{-\frac{d}{2s+d}} ~ , obtaining

    \mathbb{E} e_3^2 \le \bar{C} (\log n / n)^{\frac{2s}{2s+d}}\, .

    Next we estimate e_2 and e_4 . Since \widehat{\mathcal{T}}_n(\tau_n) \cap \mathcal{T}_n(\tau_n/b) ~~\subseteq \widehat{\mathcal{T}}_n(\tau_n) \cup \mathcal{T}_n(b\tau_n) and \mathcal{T}_n(b\tau_n) ~~\subseteq \mathcal{T}_n(\tau_n/b) , we have e_2 > 0 if and only if there is a C_{j, k} \in \mathcal{T}_n such that either C_{j, k} is in \widehat{\mathcal{T}}_n(\tau_n) but not in \mathcal{T}_n(\tau_n/b) , or C_{j, k} is in \mathcal{T}_n(b\tau_n) but not in \widehat{\mathcal{T}}_n(\tau_n) . This means that either \widehat{\Delta}_{j, k} \ge \tau_n but \Delta_{j, k}^{\ell} < \tau_n/b , or \Delta_{j, k}^{\ell} \ge b\tau_n but \widehat{\Delta}_{j, k} < \tau_n . As a consequence,

    \mathbb{P} \{ e_2 > 0 \} \le \sum\limits_{C_{j, k}\in\mathcal{T}_n} \mathbb{P} \left\{ \widehat{\Delta}_{j, k} \ge \tau_n \ \text{and} \ \Delta_{j, k}^{\ell} < \tau_n/b \right\} + \sum\limits_{C_{j, k}\in\mathcal{T}_n} \mathbb{P} \left\{ \Delta_{j, k}^{\ell} \ge b\tau_n \ \text{and} \ \widehat{\Delta}_{j, k} < \tau_n \right\} ,

    and analogously

    \mathbb{P} \{ e_4 > 0 \} \le \sum\limits_{C_{j, k}\in\mathcal{T}_n} \mathbb{P} \left\{ \widehat{\Delta}_{j, k} \ge \tau_n \ \text{and} \ \Delta_{j, k}^{\ell} < \tau_n/b \right\} .

    We can now apply Lemma 7: we use (b) with \eta = \tau_n/b , and (a) with \eta = \tau_n . We obtain that

    \mathbb{P}\{e_2 > 0 \} + \mathbb{P}\{e_4 > 0\} \le \begin{cases} C({{a}_{\text{min}}}, d) n^{1-\frac{\kappa^2}{c_0 ~~b^2 \max\{M^2, \sigma^2\}}} & (\ell = 0) \\ C(\theta_2, \theta_3, {{a}_{\text{min}}}, d) n^{1- \frac{\kappa^2}{c_1 b^2 \max\{ d^4 ~~M^2, d^2~~\sigma^2\}}} & (\ell = 1). \end{cases}

    We have \mathbb{P}\{e_2 > 0 \} + \mathbb{P}\{ e_4 > 0\} \le C n^{-\nu} provided that \kappa is chosen such that the exponents are smaller than -\nu .

    We are left to deal with the expectations. As for e_{2} , Lemma 6 implies e_2 \lesssim M , which gives rise to, for \nu = 1 > \frac{2s}{2s+d} ,

    \mathbb{E} e_2^2 \lesssim M^2 \mathbb{P}\{e_2 > 0\} \le C M^2 (\log n / n)^{\frac{2s}{2s+d}}\, .

    The same bound holds for e_4 , which concludes the proof of Theorem 1.

    This section contains the main concentration inequalities of the empirical quantities on their oracles. For piecewise linear estimators, some quantities used in Lemma 5 are decomposed in Table 4. All proofs are collected in Appendix A.

    Table 4.  Decomposition of piecewise linear estimators into quantities used in Lemma 5.
    oracles empirical counterparts
    {f}_{j, k}(x) = T_M \left([(x-c_{j, k})^T \ \ 2^{-j}] {Q}_{j, k} {r}_{j, k}\right) \widehat{f}_{j, k}(x) =T_M\left([(x-\widehat{c}_{j, k})^T \ \ 2^{-j}] \widehat{Q}_{j, k} \widehat{r}_{j, k} \right)
    {Q}_{j, k} := \left[ \begin{matrix} [\Sigma_{j, k}]_d^{\dagger} & 0\\ 0 & 2^{2j}\end{matrix} \right] \widehat{Q}_{j, k} := \left[ \begin{matrix}[\widehat{\Sigma}_{j, k}]_d^{\dagger} & 0\\ 0 & 2^{2j}\end{matrix} \right]
    [\Sigma_{j, k}]_d^{\dagger} := V_{j, k} [{{\Lambda }^{j,k}}]_d^{-1} V_{j, k}^T [\widehat{\Sigma}_{j, k}]_d^{\dagger} := {{\widehat{V}}_{j, k}} {[{{\widehat{\Lambda }}^{j,k}}]_d}^{-1} ({{\widehat{V}}_{j, k}})^T
    {r}_{j, k}:=\frac{1}{\rho(C_{j, k})} \int_{C_{j, k}} y \left[\begin{align} (x-c_{j, k})\\ 2^{-j}\end{align} \right]d\rho \widehat{r}_{j, k}:= \frac{1}{\widehat{n}_{j,k}} \sum\limits_{x_i \in C_{j, k}} y_i \left[\begin{align} (x_i-\widehat{c}_{j, k})\\ 2^{-j}\end{align} \right]

     | Show Table
    DownLoad: CSV

    Lemma 3. For every t > 0 we have:

    (a) \mathbb{P}\left\{\left|\rho(C_{j, k})-\widehat{\rho}(C_{j, k})\right| > t \right\} \le 2\exp\left(\frac{-3nt^2}{6\rho(C_{j, k}) +2t}\right) ;

    (b) setting t = \frac 1 2 \rho(C_{j, k}) in (a) yields \mathbb{P}\left\{|\rho(C_{j, k})-\widehat{\rho}(C_{j, k})| > \frac{1}{2}\rho(C_{j, k}) \right\} \le 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) ;

    (c) \mathbb{P}\left\{\|c_{j, k}-\widehat{c}_{j, k}\| > t \right\} \le 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) + 8\exp\left(-\frac{3 n\rho(C_{j, k}) t^2}{12\theta_2^2 2^{-2j}+4\theta_2 2^{-j}t} \right) ;

    (d) \mathbb{P} \{\|\Sigma_{j, k}-\widehat{\Sigma}_{j, k}\| > t \} \le 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) + \left(\frac{4\theta_2^2 }{\theta_3}d+8\right) \exp\left(\frac{-3 n\rho(C_{j, k}) t^2}{96\theta_2^4 2^{-4j} +16\theta_2^2 2^{-2j}t} \right).

    Lemma 4. We have:

    (a) \mathbb{P} \{\|{Q}_{j, k} - \widehat{Q}_{j, k}\| > \frac{48}{\theta_3^2}d^2~~ 2^{4j} \| \Sigma_{j, k} - \widehat{\Sigma}_{j, k} \|\} \\ \le 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) + \left(4\frac{\theta_2^2 }{\theta_3}d+10\right) \exp\left(- \frac{n \rho(C_{j, k})}{512 (\theta_2^2/\theta_3)^2 d^2~~ + \frac{64}{3} (\theta_2^2/\theta_3)d}\right) .

    (b) \mathbb{P} \{\|\widehat{Q}^{j, k}\| > \frac{2}{\theta_3} d\ 2^{2j}\} \le 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) + \left(4\frac{\theta_2^2 }{\theta_3}d+10\right) \exp\left(-\tfrac{n\rho(C_{j, k})}{128(\theta_2^2/\theta_3)^2d^2~~ + \frac{32}{3}(\theta_2^2/\theta_3)d}\right).

    (c) Suppose f is in L^\infty . For every t > 0 , we have

    \begin{aligned}[t] & \mathbb{P}\left\{ \|{r}_{j, k} -\widehat{r}_{j, k}\| > t \right\} \begin{aligned}[t] & \le 2\exp\left(-\tfrac{3}{28}n\rho(C_{j, k})\right) + 8 \exp\left(\tfrac{- n\rho(C_{j, k}) t^2}{4 \langle\theta_2\rangle^2 \|f\|_\infty^22^{-2j}+2 \langle\theta_2\rangle \|f\|_\infty 2^{-j} t} \right) + 2 \exp\left(- c\tfrac{n\rho(C_{j, k})t^2}{\theta_2^2 \|\zeta\|_{\psi_2}^2 2^{-2j}} \right) \end{aligned} \\ \end{aligned}

    where c is an absolute constant.

    Lemma 5. Suppose f is in L^\infty . For every t > 0 , we have

    \mathbb{P} \left\{ \| f_{j, k}^\ell - \widehat{f}_{j, k}^\ell \|_\infty > t \right\} \le \begin{cases} C_0 \left[\exp\left(-\frac{n\rho(C_{j, k})}{c_0}\right) + \exp\left(- \frac{n\rho(C_{j, k}) t^2}{c_0 (\|f\|_\infty^2 + \|f\|_\infty t)}\right) + \exp\left(-\tfrac{n\rho(C_{j, k})t^2}{c_0\|\zeta\|_{\psi_2}^2}\right) \right] & \mathit{{for~~ \ell = 0}} \\ C_1 d \left[\exp\left(- \frac{n\rho(C_{j, k})}{c_1d^2~~} \right) + \exp\left(- \frac{n\rho(C_{j, k})t^2}{c_1 d^4 (\|f\|_\infty^2 + \|f\|_\infty t)}\right) + \exp\left(- \frac{n\rho(C_{j, k})t^2}{c_1 d^2~~ \|\zeta\|_{\psi_2}^2}\right) \right] & \mathit{{for~~ \ell = 1, }} \end{cases}

    where c_0, C_0 are absolute constants, c_0' depends on \theta_2 , and c_1, C_1 depend on \theta_2, \theta_3 .

    Lemma 6. Suppose f \in L^\infty . For every C_{j, k} \in \mathcal{T} and C_{j', k'} ~~\subset C_{j, k} ,

    \| {f}_{j, k} - f_{j', k'} \|_\infty \le 2 M , \qquad \| \widehat{f}_{j, k} - \widehat{f}_{j', k'} \|_\infty \le 2M .

    Lemma 7. Suppose f is in L^\infty . For every \eta > 0 and any \gamma > 1 , we have

    (a) \mathbb{P}\left\{\widehat{\Delta}_{j, k}^{\ell} <\eta \& \Delta_{j, k}^{\ell} \geq\left(2 a_{\max }+5\right) \eta\right\} \leq \begin{cases}C_{0} \exp \left(-\frac{n \eta^{2}}{c_{0} \max \{\|\|f\|\|_{\infty}^{2} \|\left.\zeta\right|_{\psi_{2}} ^{2}}\right) & \text { for } \ell=0 \\ C_{1} d \exp \left(-\frac{n \eta^{2}}{\left.c_{1} \max \left|d^{4}\|f\|_{\infty}^{2}, d^{2}\right| \mid \zeta \|_{\psi_{2}}^{2}\right\}}\right) & \text { for } \ell=1 ;\end{cases}

    (b) \mathbb{P}\left\{{\Delta}_{j, k}^{\ell} <\eta \& \widehat{\Delta}_{j, k}^{\ell} \geq\left(2 a_{\max }+5\right) \eta\right\} \leq \begin{cases}C_{0} \exp \left(-\frac{n \eta^{2}}{c_{0} \max \{\|\|f\|\|_{\infty}^{2} \|\left.\zeta\right|_{\psi_{2}} ^{2}}\right) & \text { for } \ell=0 \\ C_{1} d \exp \left(-\frac{n \eta^{2}}{\left.c_{1} \max \left|d^{4}\|f\|_{\infty}^{2}, d^{2}\right| \mid \zeta \|_{\psi_{2}}^{2}\right\}}\right) & \text { for } \ell=1 ;\end{cases}

    C_0, c_0 depend on {{a}_{\text{max}}} ; c_0' depends on {{a}_{\text{max}}}, \theta_2 ; C_1, c_1 depend on {{a}_{\text{max}}}, \theta_2, \theta_3 .

    This work was partially supported by NSF-DMS-125012, AFOSR FA9550-17-1-0280, NSF-IIS-1546392. The authors thank Duke University for donating computing equipment used for this project.

    The authors declare no conflict of interest.

    Example 1. Let f \in \mathcal{C}^{\ell, \alpha} . The local estimator {f}_{j, k}^\ell minimizes \|(f - p){{\mathbf{1}}_{j,k}}\| over all possible polynomials p of order less than or equal to \ell . Thus, in particular, we have \|(f - {f}_{j, k}^\ell){{\mathbf{1}}_{j,k}}\| \le \| (f - p) {{\mathbf{1}}_{j,k}} \| where p is equal to the \ell -order Taylor polynomial of f at some z \in C_{j, k} . Hence, for x \in C_{j, k} there is \xi \in \mathcal{M} \cap B_{\theta_2 2^{-j}}(z) such that

    \begin{align*} | f(x) - p(x) | &\le \sum\limits_{|\lambda | = \ell} \frac{1}{\lambda !} | \partial^\lambda f(\xi) - \partial^{\lambda } f(z) | | x - z |^\lambda \le |f|_{\mathcal{C}^{\ell, \alpha}} \| \xi - z \|^\alpha \sum\limits_{|\lambda | = \ell} \frac{1}{\lambda !} | x - z |^\lambda \\ & \le \frac{d^\ell}{\ell!} |f|_{\mathcal{C}^{\ell, \alpha}} \| \xi - z \|^\alpha \| x - z \|^\ell \le \theta_2^{\ell+\alpha} \frac{d^\ell}{\ell!} |f|_{\mathcal{C}^{\ell, \alpha}} 2^{-j(\ell+\alpha)} . \end{align*}

    Therefore, for every j and k \in \mathcal{K}_j , we have

    \begin{equation*} \|(f - {f}_{j, k}^\ell){{\mathbf{1}}_{j,k}}\|^2 \le \theta_2^{2(\ell+\alpha) } (\frac{d^\ell}{\ell!})^2 |f|_{\mathcal{C}^{\ell, \alpha}}^2 2^{-2j(\ell+\alpha)} \rho(C_{j, k}) . \end{equation*}

    Examples 2 and 3. For polynomial estimators of any fixed order \ell = 0, 1, \ldots , g^\ell_{j, k} - g{{\mathbf{1}}_{j,k}} = 0 when C_{j, k} \cap \Gamma = \varnothing , and g^\ell_{j, k} - g{{\mathbf{1}}_{j,k}} = {O}(1) when C_{j, k} \cap \Gamma \neq \varnothing . At the scale j , \rho(C_{j, k}) \approx 2^{-jd} and \rho(\cup\{C_{j, k}: C_{j, k}\cap \Gamma \neq \varnothing\}) \approx 2^{-j(d-d_\Gamma)}\rho(\Gamma) . Therefore,

    \|g^\ell_{\Lambda _j}-g\| \le {O}(\sqrt{2^{-j(d-d_\Gamma)}}) = {O}(2^{-j (d-d_\Gamma)/{2}}),

    which implies g\in \mathcal{A}^{\ell}_{(d-d_\Gamma)/{2}} .

    In adaptive approximations, \Delta_{j, k}^{\ell} = 0 when C_{j, k} \cap \Gamma = \varnothing . When C_{j, k} \cap \Gamma \neq \varnothing , \Delta_{j, k}^{\ell} = \|g^{\ell}_{j, k} - \sum\limits_{C_{j+1, k'}~~\subset C_{j, k}}g^{\ell}_{j+1, k'}\| \lesssim \sqrt{\rho(C_{j, k})} \lesssim 2^{-jd/2} . Given any fixed threshold \tau > 0 , in the truncated tree \mathcal{T}(\tau) , the leaf nodes intersecting with \Gamma satisfy 2^{-jd/2} \gtrsim \tau . In other words, around \Gamma the tree is truncated at a coarser scale than j^\star such that 2^{-j^\star} = {O}(\tau^{\frac 2 d}) . The cardinality of \mathcal{T}(\tau) is dominated by the nodes intersecting with \Gamma , so

    \#\mathcal{T}(\tau) \lesssim \frac{\rho(\Gamma) 2^{-j^\star(d-d_\Gamma)}}{2^{-j^\star d}} = \rho(\Gamma)2^{j^\star d_\Gamma} \lesssim \tau^{-\frac{2d_\Gamma}{d}},

    which implies p = 2d_\Gamma/d . We conclude that g\in \mathcal{B}^{\ell}_s with s = \frac{d(2-p)}{2p} = \frac{d}{d_\Gamma} (d-d_\Gamma)/2.

    Lemma 1. By definition, we have \|(f-{f}_{j, k}^\ell){{\mathbf{1}}_{j,k}}\| \le |f|_{\mathcal{A}_{s}^{\ell, \infty}} 2^{-js} \sqrt{\rho(C_{j, k})} as long as f \in \mathcal{A}_{s}^{\ell, \infty} . By splitting (\Delta^\ell_{j, k})^2 \le 2\|(f - {f}_{j, k}^\ell){{\mathbf{1}}_{j,k}}\|^2 + 2\sum\limits_{k':C_{j+1, k'}~~\subset C_{j, k}} \|(f - f^\ell_{j+1, k'})\mathbf{1}_{j+1, k'}\|^2 , we get

    (\Delta^\ell_{j, k})^2 \le 4 |f|^2_{\mathcal{A}_{s}^{\ell, \infty}} 2^{-2js}\rho(C_{j, k}).

    In the selection of adaptive partitions, every C_{j, k} with \Delta_{j, k}^{\ell} \ge \tau must satisfy \rho(C_{j, k}) \ge 2^{2js} (\tau/|f|_{\mathcal{A}_{s}^{\ell, \infty}})^2 . With extra assumptions \rho(C_{j, k}) \le \theta_0 2^{-jd} (true when the measure \rho is doubling), we have

    \begin{equation} \Delta_{j, k}^{\ell} \ge \tau \Longrightarrow 2^{-j} \ge \left(\frac{\tau}{|f|_{\mathcal{A}_{s}^{\ell, \infty}}}\right)^{\frac{2}{2s+d}}. \end{equation} (3)

    Therefore, every cell in \Lambda (\tau) will be at a coarser scale than j^\star with j^\star satisfying (3). Using (A3) we thus get

    \begin{align*} &\tau^{p} \# \mathcal{T}(\tau) \le \tau^{p} {{a}_{\text{min}}} \#\Lambda _{j^\star} \le \theta_1^{-1} \tau^{p} {{a}_{\text{min}}} 2^{j^\star d} \le \frac{{{a}_{\text{min}}} |f|_{\mathcal{A}_{s}^{\ell, \infty}}^{\frac{2d}{2s+d}}}{\theta_1} \end{align*}

    which yields the result.

    Lemma 2. For any partition \Lambda ~~\subset \mathcal{T} , denote by \Lambda ^{l} the l -th generation partition such that \Lambda ^0 = \Lambda and \Lambda ^{l+1} consists of the children of \Lambda ^l . We first prove that \lim_{l\to\infty} f^\ell_{\Lambda ^l} = f in L^2(\mathcal{M}) . Suppose f \in L^\infty . Notice that \| f^\ell_{\Lambda ^l} - f \| \le \| f_{\Lambda ^l}^0 - f \| . As a result of the Lebesgue differentiation theorem, f_{\Lambda ^l}^0 \to f almost everywhere. Since f is bounded, f_{\Lambda ^l}^0 is uniformly bounded, hence f_{\Lambda ^l}^0 \to f in L^2(\mathcal{M}) by the dominated convergence theorem. In the case where f \in \mathcal{A}_t^\ell , taking the uniform partition \Lambda _{j(l)} at the coarsest scale of \Lambda ^l , denoted by j(l) , we have \|f - f^\ell_{\Lambda ^l}\| \le \|f - f^\ell_{\Lambda _{j(l)}} \| \lesssim 2^{-j(l)t} , and therefore f^\ell_{\Lambda ^l} \to f in L^2(\mathcal{M}) .

    Now, setting \Lambda = \Lambda (\tau) and \mathcal{S} : = \mathcal{T}(\tau)\setminus\Lambda , by Definitions 3 and 4 we get

    \begin{align*} \| f^\ell_{\Lambda } - f \|^2 & = \left\| \sum\limits_{l = 0}^{L-1} \left(f^\ell_{\Lambda ^l} - f^\ell_{\Lambda ^{l+1}}\right) + f^\ell_{\Lambda ^L} - f \right\|^2 = \left\| \sum\limits_{l = 0}^\infty (f^\ell_{\Lambda ^l} - f^\ell_{\Lambda ^{l+1}}) \right\|^2 \\ & = \left\| \sum\limits_{C_{j, k}\in\mathcal{T}\setminus\mathcal{S}} W_{j, k}^\ell \right\|^2 \le B_0 \sum\limits_{C_{j, k} \in \mathcal{T}\setminus \mathcal{S}} \|W_{j, k}^\ell\|^2 \\ & = B_0 \sum\limits_{l = 0}^\infty \sum\limits_{\Delta_{j, k}^{\ell} \in [2^{-(l+1)}\tau, 2^{-l}\tau)} \Delta_{j, k}^2 \le B_0 \sum\limits_{l = 0}^\infty 2^{-2 l} \tau^2 \#\mathcal{T}(2^{-(l+1)}\tau) \\ & = B_0 2^p \tau^{2-p} \sum\limits_{l = 0}^\infty 2^{-(2-p)l} |f|_{\mathcal{B}_{s}^{\ell}}^p \le B_0 2^p |f|_{\mathcal{B}_{s}^{\ell}}^p \tau^{2-p} \sum\limits_{l = 0}^\infty 2^{-(2-p) l} , \end{align*}

    which yields the first inequality in Lemma 2. The second inequality follows by observing that 2-p = \frac{2s}{d} p and |f|_{\mathcal{B}_{s}^{\ell}}^p \tau^{2-p} = |f|_{\mathcal{B}_{s}^{\ell}}^2 (|f|_{\mathcal{B}_{s}^{\ell}}^{-p} \tau^p)^{\frac{2s}{d}} \le |f|_{\mathcal{B}_{s}^{\ell}}^2 \#\left[\mathcal{T}(\tau)\right]^{-\frac{2s}{d}} by Definition 3.

    Lemma 3. See [34].

    Lemma 4. (a). Thanks to [23,Theorem 3.2] and assumption (A5), we have

    \begin{align*} \| {Q}_{j, k} - \widehat{Q}_{j, k} \| & = \| [\Sigma_{j, k}]_d^{\dagger} - [\widehat{\Sigma}_{j, k}]_d^{\dagger} \| \le 3 \tfrac{\| \Sigma_{j, k} - \widehat{\Sigma}_{j, k} \|}{({\lambda }_{d}^{j,k} - \lambda_{d+1}^{j,k} - \|\Sigma_{j, k} - \widehat{\Sigma}_{j, k}\|)^2} \le 3\tfrac{\| \Sigma_{j, k} - \widehat{\Sigma}_{j, k} \|}{\left( \frac{\theta_3}{2d}2^{-2j} - \|\Sigma_{j, k} - \widehat{\Sigma}_{j, k}\|\right)^2} . \end{align*}

    Hence, the bound follows applying Lemma 3(d) with t = \frac{\theta_3}{4d}2^{-2j} .

    (b). Observe that \|\widehat{Q}_{j, k}\| \le \|[\widehat{\Sigma}_{j, k}]_d^{\dagger}\| = (\hat{\lambda }_{d}^{j,k})^{-1} . Moreover, \hat{\lambda }_{d}^{j,k} \ge {\lambda }_{d}^{j,k} - |{\lambda }_{d}^{j,k} - \hat{\lambda }_{d}^{j,k}| \ge \frac{\theta_3}{d}2^{-2j} - \|\Sigma_{j, k} - \widehat{\Sigma}_{j, k}\| by assumption (A5). Thus, using Lemma 3(d) with t = \frac{\theta_3}{2d}2^{-2j} yields the result.

    (c). We condition on the event that \widehat{n}_{j,k} \ge \frac{1}{2} \mathbb{E}\widehat{n}_{j,k} = \frac{1}{2}n\rho(C_{j, k}) , whose complement occurs with probability lower than 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) by Lemma 3(b). The quantity \| {r}_{j, k} - \widehat{r}_{j, k} \| is bounded by A + B + C + D with

    \begin{align*} &A : = \left\| \frac{1}{\widehat{n}_{j,k}} \sum\limits_{i = 1}^{n} \left( f(x_i) \begin{bmatrix} x_i - c_{j, k} \\ 2^{-j} \end{bmatrix} - \frac{1}{\rho(C_{j, k})} \int_{C_{j, k}} f(x) \begin{bmatrix} x - c_{j, k} \\ 2^{-j} \end{bmatrix} d\rho \right) {{\mathbf{1}}_{j,k}}(x_i) \right\| \\ &B : = \left\| \frac{1}{\widehat{n}_{j,k}} \sum\limits_{i = 1}^{n} f(x_i) \begin{bmatrix} c_{j, k} - \widehat{c}_{j, k} \\ 0 \end{bmatrix} {{\mathbf{1}}_{j,k}}(x_i) \right\| \\ &C : = \left\| \frac{1}{\widehat{n}_{j,k}} \sum\limits_{i = 1}^{n} \zeta_i \begin{bmatrix} x_i - c_{j, k} \\ 2^{-j} \end{bmatrix} {{\mathbf{1}}_{j,k}}(x_i) \right\| \\ &D : = \left\| \frac{1}{\widehat{n}_{j,k}} \sum\limits_{i = 1}^{n} \zeta_i \begin{bmatrix} c_{j, k} - \widehat{c}_{j, k} \\ 0 \end{bmatrix} {{\mathbf{1}}_{j,k}}(x_i) \right\| . \end{align*}

    Each term of the sum in A has expectation 0 and bound 2\langle\theta_2\rangle \|f\|_\infty 2^{-j} . Thus, applying the Bernstein inequality [44,Corollary 7.3.2] we obtain

    \mathbb{P} \{ A > t \} \le 8 \exp\left( - c \tfrac{n\rho(C_{j, k})t^2}{\langle\theta_2\rangle^2 \|f\|_\infty^2 2^{-2j} + \langle\theta_2\rangle \|f\|_\infty 2^{-j} t} \right) .

    B is bounded by \|f\|_\infty \|c_{j, k} - \widehat{c}_{j, k}\| so that, using 3(c) with t replaced by t / \|f\|_\infty , we get

    \mathbb{P} \{ B > t \} \le 2\exp\left(-\tfrac{3}{28}n\rho(C_{j, k})\right) + 8\exp\left(-c\tfrac{n\rho(C_{j, k}) t^2}{\theta_2^2 \|f\|_\infty^2 2^{-2j} + \theta_2 \|f\|_\infty 2^{-j}t} \right) .

    To estimate C we appeal to [7,Theorem 3.1,Remark 4.2]. For X \in \mathbb{R}^n , take G(X) : = \|MX\| with M:=\left[\begin{array}{ccc} x_{1}-c_{j, k} & & x_{n}-c_{j, k} \\ 2^{-j} & \cdots & 2^{-j} \end{array}\right] . Then |\partial_i G(X)| \le \|x_i - c_{j, k}\| \le \theta_2 2^{-j} . Now let X = (\zeta_1 {{\mathbf{1}}_{j,k}}(x_1), \dots, \zeta_n {{\mathbf{1}}_{j,k}}(x_n))^T , so that C = G(X) / \widehat{n}_{j,k} . Since the \zeta_i 's are independent, [7,Remark 4.2] applies, and it yields \mathbb{P} \left\{ G(X) > t \right\} \le 2 \exp\left(- \frac{t^2}{2 \sigma^2} \right) , where \sigma^2 = \sum_{i = 1}^n \|\partial_iG\|_\infty^2 \|\zeta_i\|_{\psi_2}^2 {{\mathbf{1}}_{j,k}}(x_i) \le \widehat{n}_{j,k} \theta_2^2 2^{-2j} \|\zeta\|_{\psi_2}^2 , and thus

    \mathbb{P} \left\{ C > t \right\} \le 2 \exp\left( - \tfrac{n\rho(C_{j, k})t^2}{2 \theta_2^2 \|\zeta\|_{\psi_2}^2 2^{-2j}} \right) .

    We are left with D . This term is smaller than \|c_{j, k} - \widehat{c}_{j, k}\| \left| \frac{1}{\widehat{n}_{j,k}} \sum_{i = 1}^{n} \zeta_i {{\mathbf{1}}_{j,k}}(x_i) \right| , where, by Lemma 3(c), \|c_{j, k} - \widehat{c}_{j, k}\| \le \theta_2 2^{-j} with probability higher than 1 - 10 \exp\left(- \frac{3}{28} n \rho(C_{j, k}) \right) . Hence, by the standard sub-Gaussian tail inequality [46,Proposition 5.10] we have

    \mathbb{P} \{ D > t \} \le 10 \exp\left( - \tfrac{3}{28} n \rho(C_{j, k}) \right) + e \exp\left( - c \tfrac{n\rho(C_{j, k})t^2}{\theta_2^2 \|\zeta\|_{\psi_2}^2 2^{-2j}} \right) .

    This completes the proof.

    Lemma 5. If \ell = 0 , then \| \widehat{f}_{j, k}^\ell - \widehat{f}_{j, k}^\ell \|_\infty = | y_{j,k} - \widehat{y}_{j,k} | , which is less than

    \left| \frac{1}{\widehat{n}_{j,k}}\sum\limits_{i = 1}^{n} \left( f(x_i) - \frac{1}{\rho(C_{j, k})} \int_{C_{j, k}} f(x) d\rho(x) \right) {{\mathbf{1}}_{j,k}}(x_i) \right| + \left| \frac{1}{\widehat{n}_{j,k}}\sum\limits_{i = 1}^{n} \zeta_i {{\mathbf{1}}_{j,k}}(x_i) \right| .

    Each addend in the first term has expectation 0 and bound 2 \|f\|_\infty , and therefore we can apply the standard Bernstein inequality [44,Theorem 1.6.1]. As for the second term, we use the standard sub-Gaussian tail inequality [46,Proposition 5.10]. This yields the bounds for \ell = 0 .

    For \ell = 1 , we have

    \begin{align*} &|{f}_{j, k}(x)-\widehat{f}_{j, k}(x)| \le |[(x-c_{j, k})^T \ 2^{-j}]^T {Q}_{j, k} {r}_{j, k} - [(x-\widehat{c}_{j, k})^T \ 2^{-j}]^T \widehat{Q}_{j, k} \widehat{r}_{j, k} | \\ \le \ &| \begin{bmatrix}(c_{j, k}-\widehat{c}_{j, k})^T & 0 \end{bmatrix} {Q}_{j, k}{r}_{j, k} | + | \begin{bmatrix}(x-\widehat{c}_{j, k})^T & 2^{-j} \end{bmatrix} ({Q}_{j, k}{r}_{j, k} - \widehat{Q}_{j, k}\widehat{r}_{j, k}) | \\ \le \ &\| c_{j, k} - \widehat{c}_{j, k} \| \ \|{Q}_{j, k}\| \ \|{r}_{j, k}\| + \|\begin{bmatrix}(x-\widehat{c}_{j, k})^T & 2^{-j}\end{bmatrix}\| \ \|{Q}_{j, k}{r}_{j, k} - \widehat{Q}_{j, k}\widehat{r}_{j, k} \| \\ \le \ &\| c_{j, k} - \widehat{c}_{j, k} \| \ \|{Q}_{j, k}\| \ \|{r}_{j, k}\| + \|\begin{bmatrix}(x-\widehat{c}_{j, k})^T & 2^{-j}\end{bmatrix}\| \ \left( \| {Q}_{j, k} - \widehat{Q}_{j, k} \| \ \|{r}_{j, k}\| + \|\widehat{Q}_{j, k}\| \|{r}_{j, k}-\widehat{r}_{j, k}\| \right) \\ \lesssim \ &\tfrac{\theta_2^2}{\theta_3} \left( d \|f\|_\infty 2^j \| c_{j, k} - \widehat{c}_{j, k} \| + d^2~~ M 2^{2j} \| \Sigma_{j, k} - \widehat{\Sigma}_{j, k} \| + d 2^{j} \|{r}_{j, k}-\widehat{r}_{j, k}\|\right), \end{align*}

    where the last inequality holds with high probability thanks to Lemma 4(a)(b). Thus, applying Lemma 3(c)(d) and Lemma 4(c) with t replaced by \frac{t}{\theta d M2^j} , \frac{t}{\theta d^2~~ M2^{2j}} and \frac{t}{\theta d 2^j} , we obtain the desired result.

    Lemma 6. Follows simply by truncation.

    Lemma 7. We start with (a). Defining \overline{\Delta }_{j,k}^{{\ell}} : = \| W_{j, k}^\ell \|_n we have

    \begin{align*} & \mathbb{P} \left\{ \widehat{\Delta}_{j, k} < \eta \ ~\text{ and }~ \ \Delta_{j, k}^{\ell} \ge (2{{a}_{\text{max}}}+5) \eta \right\} \\ \le \ & \mathbb{P} \left\{ \widehat{\Delta}_{j, k} < \eta \ ~\text{ and }~ \ \overline{\Delta }_{j,k}^{{\ell}} \ge ({{a}_{\text{max}}}+2)\eta \right\} + \mathbb{P} \left\{ \overline{\Delta }_{j,k}^{{\ell}} < ({{a}_{\text{max}}}+2)\eta \ ~\text{ and }~ \ \Delta_{j, k}^{\ell} \ge (2{{a}_{\text{max}}}+5)\eta \right\} \\ \le \ & \mathbb{P} \left\{ | \overline{\Delta }_{j,k}^{{\ell}} - \widehat{\Delta}_{j, k}^{{\ell}} | \ge (1+{{a}_{\text{max}}})\eta \right\} + \mathbb{P} \left\{ | \Delta_{j, k}^{\ell} - 2\overline{\Delta }_{j,k}^{{\ell}} | \ge \eta \right\} . \end{align*}

    The first quantity can be bounded by

    \begin{align*} | \overline{\Delta }_{j,k}^{{\ell}} - \widehat{\Delta}_{j, k} | &\le \| W_{j, k}^\ell - \widehat{W}_{j, k}^\ell \|_n \le \| {f}_{j, k}^\ell - \widehat{f}_{j, k}^\ell \|_n + \sum\limits_{C_{j+1, k'}~~\subset C_{j, k}} \| f^\ell_{j+1, k'} - \widehat{f}^\ell_{j+1, k'} \|_n \\ &\le \|{f}_{j, k}^\ell - \widehat{f}_{j, k}^\ell\|_\infty \sqrt{\widehat{\rho}(C_{j, k})} + \sum\limits_{C_{j+1, k'}~~\subset C_{j, k}} \|f^\ell_{j+1, k'} - \widehat{f}^\ell_{j+1, k'}\|_\infty \sqrt{\widehat{\rho}(C_{j+1, k'})} , \end{align*}

    so that

    \begin{align*} & \mathbb{P} \left\{ | \overline{\Delta }_{j,k}^{{\ell}} - \widehat{\Delta}_{j, k} | \ge (1+{{a}_{\text{max}}})\eta \right\} \\ \le \ & \mathbb{P} \left\{ \| {f}_{j, k}^\ell - \widehat{f}_{j, k}^\ell \|_\infty \sqrt{\widehat{\rho}(C_{j, k})} \ge \eta \right\} + \sum\limits_{C^{j+1, k'}~~\subset C_{j, k}} \mathbb{P} \left\{ \| f^\ell_{j+1, k'} - \widehat{f}^\ell_{j+1, k'}\|_\infty \sqrt{\widehat{\rho}(C_{j+1, k'})} \ge \eta \right\} . \end{align*}

    We now condition on the event that |\rho(C_{j, k}) - \widehat{\rho}(C_{j, k})| \le \frac{1}{2}\rho(C_{j, k}) , which entails \widehat{\rho}(C_{j, k}) \le \frac{3}{2} \rho(C_{j, k}) , and apply Lemma 5 with t \lesssim \eta/\sqrt{\rho(Cjk)} . The probability of the complementary event is bounded by Lemma 3(b). To get rid of the remaining \rho(C_{j, k}) 's inside the exponentials, we lower bound \rho(C_{j, k}) as follows. We have

    \begin{equation} (\Delta_{j, k}^{\ell})^2 \le 4\|f\|_\infty^2 \rho(C_{j, k}) . \end{equation} (4)

    Thus, \Delta_{j, k}^{\ell} \ge (2a+5)\eta implies \rho(C_{j, k}) \ge \frac{(2a+5)^2\eta^2}{4\|f\|_\infty^2} . Therefore, we obtain that

    \begin{align*} \mathbb{P} \{ | \overline{\Delta }_{j,k}^{{\ell}} - \widehat{\Delta}_{j, k} | \ge (1+{{a}_{\text{max}}})\eta \} \le \begin{cases} C_0 \exp\left( - \tfrac{n\eta^2}{c_0 \max\{\|f\|_\infty^2, \|\zeta\|_{\psi_2}^2\}} \right) & (\ell = 0) \\ C_1 d \exp\left( - \tfrac{n\eta^2}{c_1 \max\{d^4\|f\|_\infty^2, d^2~~\|\zeta\|_{\psi_2}^2\}} \right) & (\ell = 1) , \end{cases} \end{align*}

    where C_0, c_0 depend on a , and C_1, c_1 depend on {{a}_{\text{max}}}, \theta_2, \theta_3 .

    Next we estimate \mathbb{P} \left\{ \Delta_{j, k}^{\ell} - 2\overline{\Delta }_{j,k}^{{\ell}} \ge \eta \right\} by [21,Theorem 11.2]. Notice that for all x\in \mathcal{M} , |W_{j, k}^\ell(x)| \lesssim \|f\|_\infty . If x \notin C_{j, k} , then W_{j, k}(x) = 0 , otherwise there is k' such that x \in \mathcal{C}_{j+1, k'} ~~\subset C_{j, k} . In such a case, |W_{j, k}(x)| = | {f}_{j, k}(x) - f_{j+1, k'}(x) | , and the claim follows from Lemma 6. Thus, [21,Theorem 11.2] gives us

    \mathbb{P} \left\{ \Delta_{j, k}^{\ell} - 2\overline{\Delta }_{j,k}^{{\ell}} \ge \eta \right\} \lesssim \exp\left( - \tfrac{n\eta^2}{c \|f\|_\infty^2 } \right) ,

    where c is an absolute constant.

    Let us turn to (b). We first observe that

    \begin{equation} \widehat{\Delta}_{j, k} \lesssim M \sqrt{\widehat{\rho}(C_{j, k})} . \end{equation} (5)

    To see this, note again that \widehat{W}_{j, k}(x) \ne 0 only when x \in \mathcal{C}_{j+1, k'} ~~\subset C_{j, k} for some k' , in which case |\widehat{W}_{j, k}(x)| = | \widehat{f}_{j, k}(x) - \widehat{f}_{j+1, k'}(x) | and we can apply Lemma 6. Now note that b = 2{{a}_{\text{max}}}+5 . We have

    \begin{align*} \mathbb{P} \left\{\Delta_{j, k}^{\ell} < \eta \ ~\text{ and }~ \ \widehat{\Delta}_{j, k} \ge b\eta \right\} &\le \mathbb{P} \left\{\Delta_{j, k}^{\ell} < \eta , \ \widehat{\Delta}_{j, k} \ge b\eta \ ~\text{ and }~ \ \rho(C_{j, k}) \ge \frac{b^2\eta^2}{2\|f\|_\infty^2} \right\} \\ &+ \mathbb{P} \left\{ \rho(C_{j, k}) < \frac{b^2\eta^2}{2\|f\|_\infty^2} \ \text{ and} \ \widehat{\rho}(C_{j, k}) \ge \frac{b^2\eta^2}{\|f\|_\infty^2} \right\} \\ &+ \mathbb{P} \left\{ \widehat{\rho}(C_{j, k}) < \frac{b^2\eta^2}{\|f\|_\infty^2} \ \text{ given } \ \widehat{\Delta}_{j, k} \ge b\eta \right\} . \end{align*}

    The first probability can be estimated similarly to how we did for (a). Thanks to Lemma 3(a), the second probability is bounded by

    \mathbb{P}\left\{ \rho(C_{j, k}) < \frac{b^2 \eta^2}{2\|f\|_\infty^2} \ ~\text{ and }~ \ |\widehat{\rho}(C_{j, k})-\rho(C_{j, k})| > \frac{b^2 \eta^2}{2\|f\|_\infty^2} \right\} \lesssim \exp\left(-\tfrac{b^2 n\eta^2}{c\|f\|_\infty^2 }\right)

    for an absolute constant c . Finally, the third probability is zero thanks to (5).



    [1] W. K. Allard, G. Chen, M. Maggioni, Multi-scale geometric methods for data sets II: geometric multi-resolution analysis, Appl. Comput. Harmon. Anal., 32 (2012), 435-462. doi: 10.1016/j.acha.2011.08.001
    [2] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput., 15 (2003), 1373-1396. doi: 10.1162/089976603321780317
    [3] A. Beygelzimer, S. Kakade, J. Langford, Cover trees for nearest neighbor, In: Proceedings of the 23rd international conference on Machine learning, 2006, 97-104.
    [4] P. J. Bickel, B. Li, Local polynomial regression on unknown manifolds, Lecture Notes-Monograph Series, 54 (2007), 177-186.
    [5] P. Binev, A. Cohen, W. Dahmen, R. A. DeVore, Universal algorithms for learning theory part II: Piecewise polynomial functions, Constr. Approx., 26 (2007), 127-152.
    [6] P. Binev, A. Cohen, W. Dahmen, R. A. DeVore, V. N. Temlyakov, Universal algorithms for learning theory part I: Piecewise constant functions, J. Mach. Learn. Res., 6 (2005), 1297-1321.
    [7] V. Buldygin, E. Pechuk, Inequalities for the distributions of functionals of sub-Gaussian vectors, Theor. Probability and Math. Statist., 80 (2010), 25-36.
    [8] G. Chen, G. Lerman, Spectral Curvature Clustering (SCC), Int. J. Comput. Vis., 81 (2009), 317-330.
    [9] G. Chen, M. Maggioni, Multiscale geometric and spectral analysis of plane arrangements, In: IEEE Conference on Computer Vision and Pattern Recognition, 2011, 2825-2832.
    [10] M. Christ, A T(b) theorem with remarks on analytic capacity and the {C}auchy integral, Colloq. Math., 60/61 (1990), 601-628.
    [11] A. Cohen, W. Dahmen, I. Daubechies, R. A. DeVore, Tree approximation and optimal encoding, Appl. Comput. Harmon. Anal., 11 (2001), 192-226. doi: 10.1006/acha.2001.0336
    [12] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, et al., Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps, PNAS, 102 (2005), 7426-7431.
    [13] I. Daubechies, Ten lectures on wavelets, SIAM, 1992.
    [14] D. Deng, Y. Han, Harmonic analysis on spaces of homogeneous type, Springer, 2008.
    [15] D. L. Donoho, C. Grimes, Hessian eigenmaps: locally linear embedding techniques for high-dimensional data, PNAS, 100 (2003), 5591-5596.
    [16] D. L. Donoho, J. M. Johnstone, Ideal spatial adaptation by wavelet shrinkage, Biometrika, 81 (1994), 425-455.
    [17] D. L. Donoho, J. M. Johnstone, Adapting to unknown smoothness via wavelet shrinkage, J. Am. Stat. Assoc., 90 (1995), 1200-1224.
    [18] E. Elhamifar, R. Vidal, Sparse subspace clustering, In: IEEE Conference on Computer Vision and Pattern Recognition, 2009, 2790-2797.
    [19] H. Federer, Curvature measures, T. Am. Math. Soc., 93 (1959), 418-491.
    [20] J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning, Springer, 2001.
    [21] L. Györfi, M. Kohler, A. Krzyżak, H. Walk, A distribution-free theory of nonparametric regression, Springer, 2002.
    [22] N. Halko, P. G. Martinsson, J. A. Tropp, Finding structure with randomness: stochastic algorithms for constructing approximate matrix decompositions, SIAM Rev., 53 (2011), 217-288.
    [23] P. C. Hansen, The truncated SVD as a method for regularization, Bit Numer. Math., 27 (1987), 534-553.
    [24] H. Hotelling, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 24 (1933), 417-441.
    [25] H. Hotelling, Relations between two sets of variates, Biometrika, 28 (1936), 321-377.
    [26] I. T. Jolliffe, A note on the use of principal components in regression, J. C. Stat. Soc. C. Appl., 31 (1982), 300-303.
    [27] G. Karypis, V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM J. Sci. Comput., 20 (1999), 359-392.
    [28] T. Klock, A. Lanteri, S. Vigogna, Estimating multi-index models with response-conditional least squares, Electron. J. Stat., 15 (2021), 589-629.
    [29] S. Kpotufe, k-NN regression adapts to local intrinsic dimension, In: Advances in Neural Information Processing Systems 24 (NIPS 2011), 2011,729-737.
    [30] S. Kpotufe, S. Dasgupta, A tree-based regressor that adapts to intrinsic dimension, J. Comput. Syst. Sci., 78 (2012), 1496-1515.
    [31] S. Kpotufe, V. K. Garg, Adaptivity to local smoothness and dimension in kernel regression, In: Advances in Neural Information Processing Systems 26 (NIPS 2011), 2013, 3075-3083.
    [32] A. Lanteri, M. Maggioni, S. Vigogna, Conditional regression for single-index models, 2020 arXiv: 2002.10008.
    [33] A. B. Lee, R. Izbicki, A spectral series approach to high-dimensional nonparametric regression, Electron. J. Stat., 10 (2016), 423-463.
    [34] W. Liao, M. Maggioni, Adaptive geometric multiscale approximations for intrinsically low-dimensional data, J. Mach. Learn. Res., 20 (2019), 1-63.
    [35] W. Liao, M. Maggioni, S. Vigogna, Learning adaptive multiscale approximations to data and functions near low-dimensional sets, In: IEEE Information Theory Workshop (ITW), 2016,226-230.
    [36] G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, In: Proceedings of the 26 th International Conference on Machine Learning, 2010,663-670.
    [37] M. Maggioni, S. Minsker, N. Strawn, Multiscale dictionary learning: Non-asymptotic bounds and robustness, J. Mach. Learn. Res., 17 (2016), 1-51.
    [38] S. Mallat, A wavelet tour of signal processing, 2 Eds., Academic Press, 1999.
    [39] K. Pearson, On lines and planes of closest fit to systems of points in space, Philos. Mag., 2 (1901), 559-572.
    [40] S. T. Roweis, L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, 290 (2000), 2323-2326.
    [41] I. Steinwart, D. R. Hush, C. Scovel, Optimal rates for regularized least squares regression, In: The 22nd Annual Conference on Learning Theory, 2009.
    [42] A. Szlam, Asymptotic regularity of subdivisions of euclidean domains by iterated PCA and iterated 2-means, Appl. Comput. Harmon. Anal., 27 (2009), 342-350.
    [43] J. B. Tenenbaum, V. D. Silva, J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, 290 (2000), 2319-2323.
    [44] J. A. Tropp, User-friendly tools for random matrices: An introduction, NIPS version, 2012.
    [45] A. B. Tsybakov, Introduction to nonparametric estimation, Springer, 2009.
    [46] R. Vershynin, Introduction to the non-asymptotic analysis of random matrices, In: Compressed sensing, Cambridge University Press, 2012,210-268.
    [47] R. Vidal, Y. Ma, S. Sastry, Generalized principal component analysis (GPCA), IEEE T. Pattern Anal., 27 (2005), 1945-1959.
    [48] G. B. Ye, D. X. Zhou, Learning and approximation by Gaussians on Riemannian manifolds, Adv. Comput. Math., 29 (2008), 291-310. doi: 10.1007/s10444-007-9049-0
    [49] Z. Zhang, H. Zha, Principal manifolds and nonlinear dimension reduction via local tangent space alignment, SIAM J. Sci. Comput., 26 (2002), 313-338.
    [50] X. Zhou, N. Srebro, Error analysis of Laplacian eigenmaps for semi-supervised learning, In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011,901-908.
  • This article has been cited by:

    1. Andreas Oslandsbotn, Željko Kereta, Valeriya Naumova, Yoav Freund, Alexander Cloninger, StreaMRAK a streaming multi-resolution adaptive kernel algorithm, 2022, 426, 00963003, 127112, 10.1016/j.amc.2022.127112
    2. Alessandro Lanteri, Mauro Maggioni, Stefano Vigogna, Conditional regression for single-index models, 2022, 28, 1350-7265, 10.3150/22-BEJ1482
    3. Abiy Tasissa, Pranay Tankala, James M. Murphy, Demba Ba, K-Deep Simplex: Manifold Learning via Local Dictionaries, 2023, 71, 1053-587X, 3741, 10.1109/TSP.2023.3322820
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(2489) PDF downloads(126) Cited by(3)

Figures and Tables

Tables(4)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog