1.
Introduction
High-dimensional data challenge classical statistical models and require new understanding of tradeoffs in accuracy and efficiency. The seemingly quantitative fact of the increase of dimension has qualitative consequences in both methodology and implementation, demanding new ways to break what has been called the curse of dimensionality. On the other hand, the presence of inherent nonuniform structure in the data calls into question linear dimension reduction techniques, and motivates a search for intrinsic learning models. In this paper we explore the idea of learning and exploiting the intrinsic geometry and regularity of the data in the context of regression analysis. Our goal is to build low-dimensional representations of high dimensional functions, while ensuring good generalization properties and fast implementation. In view of the complexity of the data, we allow interesting features to change from scale to scale and from location to location. Hence, we will develop multiscale methods, extending classical ideas of multi-resolution analysis beyond regular domains and to the random sample regime.
In regression, the problem is to estimate a function from a finite set of random samples. The minimax mean squared error (MSE) for estimating functions in the Hölder space Cs([0,1]D), s>0, is O(n−2s/(2s+D)), where n is the number of samples. The exponential dependence of the minimax rate on D manifests the curse of dimensionality in statistical learning, as n=O(ε−(2s+D)/s) points are generally needed to achieve accuracy ε. This rate is optimal (in the minimax sense), unless further structural assumptions are made [28,32]. If the samples concentrate near a d-dimensional set with d≪D, and the function belongs to a nonuniform smoothness space BS, with S>s, we may hope to find estimators converging in O(n−2S/(2S+d)). In this quantified sense, we may break the curse of dimensionality by adapting to the intrinsic dimension and regularity of the problem.
A possible approach to this problem is based on first performing dimension reduction, and then regression in the reduced space. Linear dimension reduction methods include principal component analysis (PCA) [24,25,39], for data concentrating on a single subspace, or subspace clustering [8,9,18,36,47], for a union of subspaces. Going beyond linear models, we encounter isomap [43], locally linear embedding [40], local tangent space alignment [49], Laplacian eigenmaps [2], Hessian eigenmap [15] and diffusion map [12]. Besides the classical Principal Component Regression [26], in [33] diffusion map is used for nonparametric regression expanding the unknown function over the eigenfunctions of a kernel-based operator. It is proved that, when data lie on a d-dimensional manifold, the MSE converges in O(n−1/O(d2 )). This rate depends only on the intrinsic dimension, but does not match the minimax rate in the Euclidean space. If infinitely many unlabeled points are sampled, so that the eigenfunctions are exactly computed, the MSE can achieve optimal rates for Sobolev functions with smoothness parameter at least 1. Similar results hold for regression with the Laplacian eigenmaps [50].
Some regression methods have been shown to automatically adapt to the intrinsic dimension and perform as well as if the intrinsic domain was known. Results in this direction have been established for local linear regression [4], k-nearest neighbors [29], and kernel regression [31], where optimal rates depending on the intrinsic dimension were proved for functions in C2, C1, and Cs with s≤1, respectively. Kernel methods such as kernel ridge regression are also known to adapt to the intrinsic dimension [41,48], while suitable variants of regression trees have been proved to attain intrinsic yet suboptimal learning rates [30]. On the other hand, dyadic partitioning estimates with piecewise polynomial regression can cover the whole scale of spaces Cs, s>0 [21], and be combined with wavelet thresholding techniques to optimally adapt to broader classes of nonuniform regularity [5,6]. However, such estimators are cursed by the ambient dimension D, due to the exponential cardinality of a dyadic partition of the D-dimensional hypercube.
This paper aims at generalizing dyadic partitioning estimates [5,6] to predict functions supported on low-dimensional sets, with optimal performance guarantees and low computational cost. We tie together ideas in classical statistical learning [20,21,45], multi-resolution analysis [12,13,38], and nonlinear approximation [11,16,17]. Our main tool is geometric multi-resolution analysis (GMRA) [1,34,35,37], which is a multiscale geometric approximation scheme for point clouds in high dimensions concentrating near low-dimensional sets. Using GMRA we learn low-dimensional local coordinates at multiple scales, on which we perform a multiscale regression estimate by fitting local polynomials. Inspired by wavelet thresholding techniques [5,6,11], we then compute differences between estimators at adjacent scales, and retain the locations where such differences are large enough. This empirically reveals where higher resolution is required to attain a good approximation, generating a data-driven partition which adapts to the local regularity of the function.
Our approach has several distinctive features:
(i) it is multiscale, and is therefore well-suited for data sets containing variable structural information at different scales;
(ii) it is adaptive, allowing the function to have localized singularities or variable regularity;
(iii) it is entirely data-driven, that is, it does not require a priori knowledge about the regularity of the function, and rather learns it automatically from the data;
(iv) it is provable, with strong theoretical guarantees of optimal performance on large classes of priors;
(v) it is efficient, having straightforward implementation, minor parameter tuning, and low computational cost.
We will prove that, for functions supported on a d-dimensional manifold and belonging to a rich model class characterized by a smoothness parameter S, the MSE of our estimator converges at rate O((logn/n)2S/(2S+d)). This model class contains classical Hölder continuous functions, but further accounts for potential nonuniform regularity. Our results show that, up to a logarithmic factor, we attain the same optimal learning rate as if the function was defined on a known Euclidean domain of dimension d, instead of an unknown manifold embedded in RD. In particular, the rate of convergence depends on the intrinsic dimension d and not on the ambient dimension D. In terms of computations, all the constructions above can be realized by algorithms of complexity O(nlogn), with constants linear in the ambient dimension D and exponential in the intrinsic dimension d.
The remainder of this paper is organized as follows. We conclude this section by defining some general notation and formalizing the problem setup. In Section 2 we review geometric multi-resolution analysis. In Section 3 we introduce our multiscale regression method, establish the performance guarantees, and discuss the computational complexity of our algorithms. The proofs of our results are collected in Section 4, with some technical details postponed to Appendix A.
Notation. f≲g and f≳g mean that there exists a positive constant C, independent on any variable upon which f and g depend, such that f≤Cg and f≥Cg, respectively. f≍g means that both f≲g and f≳g hold. The cardinality of a set A is denoted by #A. For x∈RD, ‖x‖ denotes the Euclidean norm and Br(x) denotes the Euclidean ball of radius r centered at x. Given a subspace V ⊂RD, we denote its dimension by dim(V) and the orthogonal projection onto V by ProjV. Let f,g:M→R be two functions, and let ρ be a probability measure supported on M. We define the inner product of f and g with respect to ρ as ⟨f,g⟩:=∫Mf(x)g(x)dρ. The L2 norm of f with respect to ρ is ‖f‖:=(∫M|f(x)|2dρ)12. Given n i.i.d. samples {xi}ni=1 of ρ, the empirical L2 norm of f is ‖f‖n:=1nn∑i=1|f(xi)|2. The L∞ norm of f is ‖f‖∞:=supess|f|. We denote probability and expectation by P and E, respectively. For a fixed M>0, TM is the truncation operator defined by TM(x):=min(|x|,M)sign(x). We denote by 1j,k the indicator function of an indexed set Cj,k (i.e., 1j,k(x)=1 if x∈Cj,k, and 0 otherwise).
Setup. We consider the problem of estimating a function f:M→R given n samples {(xi,yi)}ni=1, where
● M is an unknown Riemannian manifold of dimension d isometrically embedded in RD, with d≪D;
● ρ is an unknown probability measure supported on M;
● {xi}ni=1 are independently drawn from ρ;
● yi=f(xi)+ζi;
● f is bounded, with ‖f‖∞≤M;
● {ζi}ni=1 are i.i.d. sub-Gaussian random variables with sub-Gaussian norm ‖ζi‖ψ2≤σ2, independent of the xi's.
We wish to construct an estimator ˆf of f minimizing the mean squared error
2.
Geometric multi-resolution analysis
Geometric multi-resolution analysis (GMRA) is an efficient tool to build low-dimensional representations of data concentrating on or near a low-dimensional set embedded in high dimensions. To keep the presentation self-contained, we summarize here the main ideas, and refer the reader to [1,34,37] for further details. Given a probability measure ρ supported on a d-dimensional manifold M ⊂RD, GMRA performs the following steps:
(1). Construct a multiscale tree decomposition T of M into nested cells T:={Cj,k}k∈Kj,j∈Z, where j represents the scale and k the location. Here Kj is a location index set.
(2). Compute a local principal component analysis on each Cj,k. Let cj,k be the mean of x on Cj,k, and Vj,k the d-dimensional principal subspace of Cj,k. Define Pj,k:=cj,k+ProjVj,k(x−cj,k).
An ideal multiscale tree decomposition should satisfy assumptions (A1)÷(A5) below for all integers j≥jmin:
(A1) For every k∈Kj and k′∈Kj+1, either Cj+1,k′ ⊆Cj,k or ρ(Cj+1,k′∩Cj,k)=0. The children of Cj,k are the cells Cj+1,k′ such that Cj+1,k′ ⊆Cj,k. We assume that 1≤amin≤#{Cj+1,k′:Cj+1,k′ ⊆Cj,k}≤amax for all k∈Kj and j≥jmin. Also, for every Cj,k, there exists a unique k′∈Kj−1 such that Cj,k ⊆Cj−1,k′. We call Cj−1,k′ the parent of Cj,k.
(A2) ρ(M∖⋃k∈KjCj,k)=0, i.e., Λj:={Cj,k}k∈Kj is a partition of M, up to negligible sets.
(A3) There exists θ1>0 such that #Λj≤2jd/θ1.
(A4) There exists θ2>0 such that, if x is drawn from ρ conditioned on Cj,k, then ‖x−cj,k‖≤θ22−j almost surely.
(A5) Let λj,k1≥λj,k2≥…≥λj,kD be the eigenvalues of the covariance matrix Σj,k of ρ|Cj,k, defined in Table 1. Then:
(i) there exists θ3>0 such that, for every j≥jmin and k∈Kj, λj,kd≥θ32−2j/d;
(ii) there exists θ4∈(0,1) such that λj,kd+1≤θ4λj,kd.
These are natural properties for multiscale partitions generalizing dyadic partitions to nonEuclidean domains [10]. (A1) establishes that the cells constitute a tree structure. (A2) says that the cells at scale j form a partition. (A3) guarantees that there are at most 2jd/θ1 cells at scale j. (A4) ensures that the diameter of all cells at scale j is bounded by 2−j, up to a uniform constant. (A5)(ii) assumes that the best rank d approximation to the covariance of a cell is close to the covariance matrix of a d-dimensional Euclidean ball, while (A5)(ii) assumes that the cell has significantly larger variance in d directions than in all the remaining ones.
Since all cells at scale j have similar diameter, Λj is called a uniform partition. A master tree T is a tree satisfying the properties above. A proper subtree ˜T of T is a collection of nodes of T with the properties: the root node is in ˜T; if a node is in ˜T, then its parent is also in ˜T. Any finite proper subtree ˜T is associated with a unique partition Λ=Λ(˜T) consisting of its outer leaves, by which we mean those nodes that are not in ˜T, but whose parent is.
In practice, the master tree T is not given. We will construct one by an application of the cover tree algorithm [3] (see [34,Algorithm 3]). In order to make the samples for tree construction and function estimation independent from each other, we split the data in half and use one subset to construct the tree and the other one for local PCA and regression. From now on we index the training data as {(xi,yi)}2ni=1, and split them in {(xi,yi)}2ni=1={(xi,yi)}ni=1∪{(xi,yi)}2ni=n+1. Running Algorithm [34,Algorithm 3] on {xi}2ni=n+1, we construct a family of cells {ˆCj,k}k∈Kj,jmin ≤j ≤ jmax which satisfies (A1)÷(A4) with high probability if ρ is doubling*; furthermore, if M is a Cs, s∈(1,∞), d-dimensional closed Riemannian manifold isometrically embedded in RD, and ρ is the volume measure on M, then (A5) is satisfied as well:
*ρ is doubling if there exists C1>1 such that C−11rd≤ρ(M∩Br(x))≤C1rd for any x∈M and r>0; C1 is called the doubling constant of ρ. See also [10,14].
Proposition 1 (Proposition 14 in [34]). Assume ρ is a doubling probability measure on M with doubling constant C1. Then, the ˆCj,k's constructed from [34,Algorithm 3] satisfy:
(a1)(A1) with amax=C21(24)d and amin=1;
(a2) let ˆM=⋃jmaxj=jmin⋃k∈KjˆCj,k; for any ν>0,
(a3)(A3) with θ1=C−114−d;
(a4)(A4) with θ2=3.
If additionally M is a Cs,s∈(1,∞), d-dimensional closed Riemannian manifold isometrically embedded in RD, and ρ is the volume measure on M, then
(a5)(A5) is satisfied when j is sufficiently large.
Since there are finite training points, the constructed master tree has a finite number of nodes. We first build a tree whose leaves contain a single point, and then prune it to the largest subtree whose leaves contain at least d training points. This pruned tree associated with the ˆCj,k's is called the data master tree, and denoted by Tn. The ˆCj,k's cover ˆM, which represents the part of M that has been explored by the data.
Even though assumption (A2) is not exactly satisfied, we claim that (a2) is sufficient for our performance guarantees, for example in the case where ‖f‖∞≤M. Indeed, simply estimating f on M∖ˆM by 0, for any ν>0 we have
In view of these bounds, the rate of convergence on M∖ˆM is faster than the ones we will obtain on ˆM. We will therefore assume (A2), thanks to (a2). Also, it may happen that conditions (A3)÷(A5) are satisfied at the coarsest scales with very poor constants θ. Nonetheless, it will be clear that in all that follows we may discard a few coarse scales, and only work at scales that are fine enough and for which (A3)÷(A5) truly capture in a quantitative way the local geometry of M. Since regression is performed on an independent subset of data, we can assume, by conditioning, that the ˆCj,k's are given and satisfy the required assumptions. To keep the notation simple, from now on we will use Cj,k instead of ˆCj,k, and M in place of ˆM, with a slight abuse of notation.
Besides cover tree, there are other methods that can be applied in practice to obtain multiscale partitions, such as METIS [27], used in [1], iterated PCA [42], and iterated k-means. These methods can be computationally more efficient than cover tree, but lead to partitions where the properties (A1)÷(A5) are not guaranteed to hold.
After constructing the multiscale tree T, GMRA computes a collection of affine projectors {Pj:RD→RD}j≥jmin. The main objects of GMRA in their population and sample version are summarized in Table 1. Given a suitable partition Λ ⊂T, M can be approximated by the piecewise linear set {Pj,k(Cj,k)}Cj,k∈Λ.
3.
Multiscale polynomial regression
Given a multiscale tree decomposition {Cj,k}j,k and training samples {(xi,yi)}ni=1, we construct a family {ˆfj,k}j,k of local estimates of f in two stages: first we compute local coordinates on Cj,k using GMRA outlined above, and then we estimate f|Cj,k by fitting a polynomial of order ℓ on such coordinates. A global estimator ˆfℓΛ is finally obtained by summing the local estimates over a suitable partition Λ. Our regression method is specified in Algorithm 1. The explicit constructions of the constant (ℓ=0) and linear (ℓ=1) local estimators are detailed in Table 2.
In order to analyze the performance of our method, we introduce the oracle estimator fℓΛ based on the distribution ρ, defined by
and split the MSE into a bias and a variance term:
The bias term is a deterministic approximation error, and will be handled by assuming suitable regularity models for ρ and f (see Definitions 1 and 3). The variance term quantifies the stochastic error arising from finite-sample estimation, and will be bounded using concentration inequalities (see Proposition 2). The role of Λ, encoded in its size #Λ, is crucial to balance (1). We will discuss two possible choices: uniform partitions in Section 3.1, and adaptive at multiple scales in Section 3.2.
3.1. Uniform partitions
A first natural choice for Λ is a uniform partition Λj:={Cj,k}k∈Kj,j≥jmin. At scale j, f is estimated by ˆfℓΛj=∑k∈Kjˆfℓj,k1j,k. The bias ‖f−fℓΛj‖ decays at a rate depending on the regularity of f, which can be quantified as follows:
Definition 1 (model class Aℓs). A function f:M→R is in the class Aℓs for some s>0 with respect to the measure ρ if
where T ranges over the set, assumed non-empty, of multiscale tree decompositions satisfying assumptions (A1)÷(A5).
We capture the case where the bias is roughly the same on every cell with the following definition:
Definition 2 (model class Aℓ,∞s). A function f:M→R is in the class Aℓ,∞s for some s>0 with respect to the measure ρ if
where T ranges over the set, assumed non-empty, of multiscale tree decompositions satisfying assumptions (A1)÷(A5).
Clearly Aℓ,∞s ⊂Aℓs. These classes contain uniformly regular functions on manifolds, such as Hölder functions.
Example 1. Let M be a closed smooth d-dimensional Riemannian manifold isometrically embedded in RD, and let ρ be the volume measure on M. Consider a function f:M→R and a smooth chart (U,ϕ) on M. The function ˜f:ϕ(U)→R defined by ˜f(v)=f∘ϕ−1(v) is called the coordinate representation of f. Let λ=(λ1,…,λd) be a multi-index with |λ|:=λ1+…+λd=ℓ. The ℓ-order λ-derivative of f is defined as
Hölder functions Cℓ,α on M with ℓ∈N and α∈(0,1] are defined as follows: f∈Cℓ,α if the ℓ-order derivatives of f exist, and
d(x,z) being the geodesic distance between x and z. We will always assume to work at sufficiently fine scales at which d(x,z)≍‖z−x‖RD. Note that Cℓ,1 is the space of ℓ-times continuously differentiable functions on M with Lipschitz ℓ-order derivatives. We have Cℓ,α ⊂Aℓ,∞ℓ+α with |f|Aℓ,∞ℓ+α≤θℓ+α2dℓ|f|Cℓ,α/ℓ!. The proof is in Appendix A.
Example 2. Let M be a smooth closed Riemannian manifold isometrically embedded in RD, and let ρ be the volume measure on M. Let Ω ⊂M such that Γ:=∂Ω is a smooth and closed dΓ-dimensional submanifold with finite reach†. Let g=a1Ω+b1Ω∁ for some a,b∈R, where 1S denotes the indicator function of a set S. Then g∈Aℓ(d−dΓ)/2 for every ℓ=0,1,2,…; however, g∉Aℓ,∞s for any s>0. The proof is in Appendix A.
† The reach of M is an important global characteristic of M. Let D(M):={y∈RD:∃! x∈M s.t. ‖x−y‖=infz∈M‖z−y‖}, Mr:={y∈RD:infx∈M‖x−y‖<r}. Then reach(M):=sup{r≥0:Mr ⊂D(M)}. See also [19].
When we take uniform partitions Λ=Λj in (1), the squared bias satisfies
whenever f∈Aℓs, which decreases as j increases. On the other hand, Proposition 2 shows that the variance at the scale j satisfies
which increases as j increases. Choosing the optimal scale j⋆ in the bias-variance tradeoff, we obtain the following rate of convergence for uniform estimators:
Theorem 1. Suppose ‖f‖∞≤M and f∈Aℓs for ℓ∈{0,1} and s>0. Let j⋆ be chosen such that
for μ>0. Then there exist positive constants c:=c(θ1,d,μ) and C:=C(θ1,d,μ) for ℓ=0, or c:=c(θ1,θ2,θ3,d,μ) and C:=C(θ1,θ2,θ3,d,μ) for ℓ=1, such that:
(a) for every ν>0 there is cν>0 such that
where cν:=cν(ν,θ1,d,M,σ,s,μ) for ℓ=0, and cν:=cν(ν,θ1,θ2,θ3,d,M,σ,s,μ) for ℓ=1;
(b) E‖f−ˆfℓΛj⋆‖2≤(|f|2Aℓsμs+cmax(M2,σ2))(lognn)2s2s+d.
Theorem 1 is proved in Section 4. Note that the rate depends on the intrinsic dimension d instead of the ambient dimension D. Moreover, the rate is optimal (up to logarithmic factors) at least in the case of Cℓ,α functions on M, as discussed in Example 1.
3.2. Adaptive partitions
Theorem 1 is not fully satisfactory for two reasons: (i) the choice of the optimal scale requires knowledge of the regularity of the unknown function; (ii) no uniform scale can be optimal if the regularity of the function varies at different locations and scales. We thus propose an adaptive estimator which learns near-optimal partitions from data, without knowing the possibly nonuniform regularity of the function. Adaptive partitions may be selected by a criterion that determines whether or not a cell should be picked or not. The quantities involved in this selection are summarized in Table 3, along with their empirical versions.
Δℓj,k plays the role of the magnitude of a wavelet coefficient in typical wavelet thresholding constructions, and reduces to it in the case of Haar wavelets on Euclidean domains by dyadic partitioning. It measures the local difference in approximation between two consecutive scales: a large Δℓj,k suggests a significant reduction of error if we refine Cj,k to its children. Intuitively, we should truncate the master tree to the subtree including the nodes where this quantity is large. However, if too few samples exist in a node, then the empirical counterpart ˆΔj,k can not be trusted. We thus proceed as follows. We set a threshold τn decreasing in n, and let ˆTn(τn) be the smallest proper subtree of Tn containing all Cj,k's for which ˆΔj,k≥τn. Crucially, τn may be chosen independently of the regularity of f (see Theorem 2). We finally define our adaptive partition ˆΛn(τn) as the partition associated with the outer leaves of ˆTn(τn). The procedure is summarized in Algorithm 2.
To provide performance guarantees for our adaptive estimator, we need to define a proper model class based on oracles. Given any master tree T satisfying assumptions (A1)÷(A5) and a threshold τ>0, we let T(τ) be the smallest subtree of T consisting of all the cells Cj,k's with Δℓj,k≥τ. The partition made of the outer leaves of T(τ) is denoted by Λ(τ).
Definition 3 (model class Bℓs). A function f:M→R is in the class Bℓs for some s>0 with respect to the measure ρ if
where T varies over the set, assumed non-empty, of multiscale tree decompositions satisfying assumptions (A1)÷(A5).
In general, the truncated tree T(τ) grows as the threshold τ decreases. For elements in Bℓs, we have control on the growth rate, namely #T(τn)≲τ−p. In the classical case of dyadic partitions of the Euclidean space with uniform measure, Bℓs is well understood as a nonlinear approximation space containing a scale of Besov spaces [11]. The class Bℓs is indeed rich, and contains in particular Aℓ,∞s, while additionally capturing functions of nonuniform regularity.
Lemma 1. Aℓ,∞s ⊂Bℓs. If f∈Aℓ,∞s, then f∈Bℓs and |f|Bℓs≤(amin/θ1)2s+d2d|f|Aℓ,∞s.
The proof is given in Appendix A.
Example 3 Let g be the function in Example 2. Then g∈Bℓd(d−dΓ)/(2dΓ) for every ℓ=0,1,2,…. Notice that g∈Aℓ(d−dΓ)/2, so g has a larger regularity parameter s in the Bℓs model than in the Aℓs model.
We will also need a quasi-orthogonality condition ensuring that the functions Wℓj,k representing the approximation difference between two scales are almost orthogonal across scales.
Definition 4. We say that f satisfies quasi-orthogonality of order ℓ with respect to the measure ρ if there exists a constant B0>0 such that, for any proper subtree S of any tree T satisfying assumptions (A1)÷(A5),
The following lemma shows that f∈Bℓs, along with quasi-orthogonality, implies a certain approximation rate of f by fℓΛ(τ) as τ→0. The proof is given in Appendix A.
Lemma 2. If f∈Bℓs∩(L∞∪Aℓt) for some s,t>0, and f satisfies quasi-orthogonality of order ℓ, then
with Bs,d:=B02p∑ℓ≥02−ℓ(2−p).
The main result of this paper is the following performance analysis of adaptive estimators, which is proved in Section 4.
Theorem 2. Let ℓ∈{0,1} and M>0. Suppose ‖f‖≤M and f satisfies quasi-orthogonality of order ℓ. Set τn:=κ√logn/n. Then:
(a) For every ν>0 there exists κν:=κν(amax,θ2,θ3,d,M,σ,ν)>0 such that, whenever f∈Bℓs for some s>0 and κ≥κν, there are r,C>0 such that
(b) There exists κ0:=κ0(amax,θ2,θ3,d,M,σ) such that, whenever f∈Bℓs for some s>0 and κ≥κ0, there is ˉC>0 such that
Here r depends on θ2, θ3, amax, d, M, s, |f|Bℓs, σ, B0, ν, κ; C depends on θ2, θ3, amax, amin, d, s, |f|Bℓs, κ; ˉC depends on θ2, θ3, amax, amin, d, M, s, |f|Bℓs, B0, κ.
Theorem 2 is more satisfactory than Theorem 1 for two reasons: (i) the same rate is achieved for a richer model class; (ii) the estimator does not require a priori knowledge of the regularity of the function, since the choice of κ is independent of s.
For a given accuracy ε, in order to achieve MSE≲ε2, the number of samples we need is nε≳(1/ε)2s+dslog(1/ε). When s is unknown, we can determine s as follows: we fix a small n0, and run Algorithm 2 with 2n0,4n0,…,2jn0,… samples. For each sample size, we evenly split data into a training set to build the adaptive estimator, and a test set to evaluate the MSE. According to Theorem 2, the MSE scales as (logn/n)2s2s+d. Therefore, the slope in the log-log plot of the MSE versus n gives an approximation of −2s/(2s+d). This could be formalized by a suitable adaptation of Lepski's method.
3.3. Computational complexity
The computational cost of Algorithms 1 and 2 may be split as follows:
Tree construction. Cover tree itself is an online algorithm where a single-point insertion or removal takes cost at most O(logn). The total computational cost of the cover tree algorithm is CdDnlogn, where C>0 is a constant [3].
Local PCA. At every scale j, we perform local PCA on the training data restricted to the Cj,k for every k∈Kj using the random PCA algorithm [22]. Recall that ˆnj,k denotes the number of training points in Cj,k. The cost of local PCA at scale j is in the order of ∑k∈KjDdˆnj,k=Ddn, and there are at most clogn scales where c>0 is a constant, which gives a total cost of cDdnlogn.
Multiscale regression. Given ˆnj,k training points on Cj,k, computing the low-dimensional coordinates ˆπj,k(xi) for all xi∈Cj,k costs Ddˆnj,k, and solving the linear least squares problem (2), where the matrix is of size ˆnj,k×dℓ, costs at most ˆnj,kd2ℓ. Hence, constructing the ℓ-order polynomials at scale j takes ∑k∈KjDdˆnj,k+d2ℓˆnj,k=(Dd+d2ℓ)n, and there are at most clogn scales, which sums up to c(Dd+d2ℓ)nlogn.
Adaptive approximation. We need to compute the coefficients ˆΔj,k for every Cj,k, which costs 2(Dd+dℓ)ˆnj,k on Cj,k, and 2c(Dd+dℓ)nlogn for the whole tree.
In summary, the total cost of constructing GMRA adaptive estimators of order ℓ is
The cost scales linearly with the number of samples n up to a logarithmic factor, and linearly with the ambient dimension D.
4.
Proofs
We analyze the error of our estimator by a bias-variance decomposition as in (1). We present the variance estimate in Section 4.1, the proofs for uniform approximations in Section 4.2, and for adaptive approximations in Section 4.3.
4.1. Variance estimate
The following proposition bounds the variance of 0-order (piecewise constant) and 1st-order (piecewise linear) estimators over an arbitrary partition Λ.
Proposition 2. Suppose ‖f‖∞≤M and let ℓ∈{0,1}. For any partition Λ, let fℓΛ and ˆfℓΛ be the optimal approximation and the empirical estimators of order ℓ on Λ, respectively. Then, for every η>0,
(a) P{‖fℓΛ−ˆfℓΛ‖>η}≤{C0#Λexp(−nη2c0max(M2,σ2)#Λ)for ℓ=0C1d#Λexp(−nη2c1max(d4 M2,d2 σ2)#Λ)for ℓ=1,
(b) E‖fℓΛ−ˆfℓΛ‖2≤{c0max(M2,σ2)#Λlog(C0#Λ)nfor ℓ=0c1max(d4 M2,d2 σ2)#Λlog(C1d#Λ)nfor ℓ=1,
for some absolute constants c_0, C_0 and some c_1, C_1 depending on \theta_2, \theta_3 .
Proof. Since f_\Lambda ^\ell and \widehat{f}_\Lambda ^\ell are bounded by M , we define \Lambda ^{-} : = \{ C_{j, k}\in\Lambda :\rho(C_{j, k}) \le \frac{\eta^2}{4M^2\#\Lambda } \} , and observe that
We then restrict our attention to \Lambda ^{+} : = \Lambda \setminus \Lambda ^{-} and apply Lemma 5 with t = \frac{\eta}{\sqrt{\rho(C_{j, k})\#\Lambda }} . This leads to (a), while (b) follows from (a) by integrating over \eta > 0 .
4.2. Proof of Theorem 1
Notice that \#\Lambda _j \le 2^{jd}/\theta_1 by (A3). By choosing j^\star such that 2^{-j^\star} = \mu \left(\frac{\log n}{n}\right)^{\frac{1}{2s+d}} for some \mu > 0 , we have
Moreover, by Proposition 2,
provided that \frac{\theta_1 ~~\mu^d ~~c_\nu^2}{c_0\max(M^2, \sigma^2)} -\frac{d}{2s+d} > \nu for \ell = 0 and \frac{\theta_1 ~~\mu^d ~~c_\nu^2}{c_1\max(d^4 ~~M^2, d^2~~\sigma^2)} -\frac{d}{2s+d} > \nu for \ell = 1 .
4.3. Proof of Theorem 2
We begin by defining several objects of interest:
● \mathcal{T}_n : the data master tree whose leaves contain at least d points of training data. It can be viewed as the part of a multiscale tree that our training data have explored. Notice that
● \mathcal{T} : a complete multiscale tree containing \mathcal{T}_n .
\mathcal{T} can be viewed as the union \mathcal{T}_n and some empty cells, mostly at fine scales with high probability, that our data have not explored.
● \mathcal{T}(\tau) : the smallest subtree of \mathcal{T} which contains \{C_{j, k} \in \mathcal{T} \, :\, \Delta_{j, k}^{\ell} \ge \tau\} .
● \mathcal{T}_n(\tau) : = \mathcal{T}(\tau) \cap \mathcal{T}_n .
● \widehat{\mathcal{T}}_n(\tau) : the smallest subtree of \mathcal{T}_n which contains \{C_{j, k} \in \mathcal{T}_n\, :\, \widehat{\Delta}_{j, k} \ge \tau\} .
● \Lambda (\tau) : the adaptive partition associated with \mathcal{T}(\tau) .
● \Lambda _n(\tau) : the adaptive partition associated with \mathcal{T}_n(\tau) .
● \widehat{\Lambda }_n(\tau) : the adaptive partition associated with \widehat{\mathcal{T}}_n(\tau) .
● Suppose \mathcal{T}^0 and \mathcal{T}^1 are two subtrees of \mathcal{T} . If \Lambda ^0 and {\Lambda }^1 are two adaptive partitions associated with \mathcal{T}^0 and \mathcal{T}^1 respectively, we denote by \Lambda ^0 \vee \Lambda ^1 and \Lambda ^0 \wedge \Lambda ^1 the partitions associated to the trees \mathcal{T}^0 \cup \mathcal{T}^1 and \mathcal{T}^0 \cap \mathcal{T}^1 respectively.
● Let b = 2{{a}_{\text{max}}}+5 where {{a}_{\text{max}}} is the maximal number of children that a node has in \mathcal{T}_n .
Inspired by the analysis of wavelet thresholding procedures [5,6], we split the error into four terms,
where
The goal of the splitting above is to handle the bias and variance separately, as well as to deal with the fact the partition built from those C_{j, k} such that \widehat{\Delta}_{j, k} \ge \tau_n does not coincide with the partition which would be chosen by an oracle based on those C_{j, k} such that \Delta_{j, k}^{\ell} \ge \tau_n . This is accounted by the terms e_2 and e_4 which correspond to those C_{j, k} such that \widehat{\Delta}_{j, k} is significantly larger or smaller than \Delta_{j, k}^{\ell} respectively, and which will be proved to be small in probability. The e_1 and e_3 terms correspond to the bias and variance of oracle estimators based on partitions obtained by thresholding the unknown oracle change in approximation \Delta_{j, k}^{\ell} .
Since \widehat{\Lambda }_n(\tau_n)\vee \Lambda _n(b\tau_n) is a finer partition than \Lambda _n(b\tau_n) , we have
The e_{11} term is treated by a deterministic estimate based on the model class \mathcal{B}_{s}^{\ell} : by Lemma 2 we have
The term e_{12} accounts for the error on the cells that have not been explored by our training data, which is small:
According to (4), we have (\Delta_{j, k}^{\ell})^2 \le 4\|f\|_\infty^2 \rho(C_{j, k}) . Then every C_{j, k} with \Delta_{j, k}^{\ell} \ge b\tau_n satisfies \rho(C_{j, k}) \gtrsim \frac{b^2\kappa^2}{\|f\|_\infty^2} (\log n / n) . Hence, provided that n satisfies \frac{b^2\kappa^2}{\|f\|_\infty^2} \log n \gtrsim 2d , we have
where the last inequality follows from Lemma 3(b). Therefore, by Definition 3 we obtain
as long as \frac{3b^2\kappa^2}{28\|f\|_\infty^2} - 1 > \nu . To estimate \mathbb{E} e_{12}^2 , we observe that, thanks to Lemma 6,
Hence, by choosing \nu = 1 > \tfrac{2s}{2s+d} we get
The term e_3 is the variance term which can be estimated by Proposition 2 with \Lambda = \widehat{\Lambda }_n(\tau_n)\wedge \Lambda _n(\tau_n/b) . We plug in \eta = r (\log n / n)^{\frac{s}{2s+d}} . Bounding \#\Lambda by \# \Lambda _n(\tau_n/b) \le \#\Lambda _n \le n /d (as our data master tree has at d points in each leaf) outside the exponential, and by \# \Lambda _n(\tau_n/b) \le \# \Lambda (\tau_n/b) \le |f|_{\mathcal{B}_{s}^{\ell}}^{p} (\tau_n/b)^{-p} inside the exponential, we get the following estimates for e_3 :
where C_0 = C_0(\theta_2, \theta_3, {{a}_{\text{max}}}, d, s, |f|_{\mathcal{B}_{s}^{\ell}}, \kappa) and C_1 = C_1(\theta_2, \theta_3, {{a}_{\text{max}}}, d, s, |f|_{\mathcal{B}_{s}^{\ell}}, \kappa) . We obtain \mathbb{P}\{e_3 > r(\log n /n)^{\frac{s}{2s+d}}\} \le C n^{-\nu} as long as r is chosen large enough to make the exponent smaller than -\nu .
To estimate \mathbb{E} e_3^2 , we apply again Propositions 2 and with \#\Lambda \le |f|_{\mathcal{B}_{s}^{\ell}}^{p} (b/\kappa)^{p} (\log n / n)^{-\frac{d}{2s+d}} ~ , obtaining
Next we estimate e_2 and e_4 . Since \widehat{\mathcal{T}}_n(\tau_n) \cap \mathcal{T}_n(\tau_n/b) ~~\subseteq \widehat{\mathcal{T}}_n(\tau_n) \cup \mathcal{T}_n(b\tau_n) and \mathcal{T}_n(b\tau_n) ~~\subseteq \mathcal{T}_n(\tau_n/b) , we have e_2 > 0 if and only if there is a C_{j, k} \in \mathcal{T}_n such that either C_{j, k} is in \widehat{\mathcal{T}}_n(\tau_n) but not in \mathcal{T}_n(\tau_n/b) , or C_{j, k} is in \mathcal{T}_n(b\tau_n) but not in \widehat{\mathcal{T}}_n(\tau_n) . This means that either \widehat{\Delta}_{j, k} \ge \tau_n but \Delta_{j, k}^{\ell} < \tau_n/b , or \Delta_{j, k}^{\ell} \ge b\tau_n but \widehat{\Delta}_{j, k} < \tau_n . As a consequence,
and analogously
We can now apply Lemma 7: we use (b) with \eta = \tau_n/b , and (a) with \eta = \tau_n . We obtain that
We have \mathbb{P}\{e_2 > 0 \} + \mathbb{P}\{ e_4 > 0\} \le C n^{-\nu} provided that \kappa is chosen such that the exponents are smaller than -\nu .
We are left to deal with the expectations. As for e_{2} , Lemma 6 implies e_2 \lesssim M , which gives rise to, for \nu = 1 > \frac{2s}{2s+d} ,
The same bound holds for e_4 , which concludes the proof of Theorem 1.
4.4. Basic concentration inequalities
This section contains the main concentration inequalities of the empirical quantities on their oracles. For piecewise linear estimators, some quantities used in Lemma 5 are decomposed in Table 4. All proofs are collected in Appendix A.
Lemma 3. For every t > 0 we have:
(a) \mathbb{P}\left\{\left|\rho(C_{j, k})-\widehat{\rho}(C_{j, k})\right| > t \right\} \le 2\exp\left(\frac{-3nt^2}{6\rho(C_{j, k}) +2t}\right) ;
(b) setting t = \frac 1 2 \rho(C_{j, k}) in (a) yields \mathbb{P}\left\{|\rho(C_{j, k})-\widehat{\rho}(C_{j, k})| > \frac{1}{2}\rho(C_{j, k}) \right\} \le 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) ;
(c) \mathbb{P}\left\{\|c_{j, k}-\widehat{c}_{j, k}\| > t \right\} \le 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) + 8\exp\left(-\frac{3 n\rho(C_{j, k}) t^2}{12\theta_2^2 2^{-2j}+4\theta_2 2^{-j}t} \right) ;
(d) \mathbb{P} \{\|\Sigma_{j, k}-\widehat{\Sigma}_{j, k}\| > t \} \le 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) + \left(\frac{4\theta_2^2 }{\theta_3}d+8\right) \exp\left(\frac{-3 n\rho(C_{j, k}) t^2}{96\theta_2^4 2^{-4j} +16\theta_2^2 2^{-2j}t} \right).
Lemma 4. We have:
(a) \mathbb{P} \{\|{Q}_{j, k} - \widehat{Q}_{j, k}\| > \frac{48}{\theta_3^2}d^2~~ 2^{4j} \| \Sigma_{j, k} - \widehat{\Sigma}_{j, k} \|\} \\ \le 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) + \left(4\frac{\theta_2^2 }{\theta_3}d+10\right) \exp\left(- \frac{n \rho(C_{j, k})}{512 (\theta_2^2/\theta_3)^2 d^2~~ + \frac{64}{3} (\theta_2^2/\theta_3)d}\right) .
(b) \mathbb{P} \{\|\widehat{Q}^{j, k}\| > \frac{2}{\theta_3} d\ 2^{2j}\} \le 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) + \left(4\frac{\theta_2^2 }{\theta_3}d+10\right) \exp\left(-\tfrac{n\rho(C_{j, k})}{128(\theta_2^2/\theta_3)^2d^2~~ + \frac{32}{3}(\theta_2^2/\theta_3)d}\right).
(c) Suppose f is in L^\infty . For every t > 0 , we have
\begin{aligned}[t] & \mathbb{P}\left\{ \|{r}_{j, k} -\widehat{r}_{j, k}\| > t \right\} \begin{aligned}[t] & \le 2\exp\left(-\tfrac{3}{28}n\rho(C_{j, k})\right) + 8 \exp\left(\tfrac{- n\rho(C_{j, k}) t^2}{4 \langle\theta_2\rangle^2 \|f\|_\infty^22^{-2j}+2 \langle\theta_2\rangle \|f\|_\infty 2^{-j} t} \right) + 2 \exp\left(- c\tfrac{n\rho(C_{j, k})t^2}{\theta_2^2 \|\zeta\|_{\psi_2}^2 2^{-2j}} \right) \end{aligned} \\ \end{aligned}
where c is an absolute constant.
Lemma 5. Suppose f is in L^\infty . For every t > 0 , we have
where c_0, C_0 are absolute constants, c_0' depends on \theta_2 , and c_1, C_1 depend on \theta_2, \theta_3 .
Lemma 6. Suppose f \in L^\infty . For every C_{j, k} \in \mathcal{T} and C_{j', k'} ~~\subset C_{j, k} ,
Lemma 7. Suppose f is in L^\infty . For every \eta > 0 and any \gamma > 1 , we have
(a) \mathbb{P}\left\{\widehat{\Delta}_{j, k}^{\ell} <\eta \& \Delta_{j, k}^{\ell} \geq\left(2 a_{\max }+5\right) \eta\right\} \leq \begin{cases}C_{0} \exp \left(-\frac{n \eta^{2}}{c_{0} \max \{\|\|f\|\|_{\infty}^{2} \|\left.\zeta\right|_{\psi_{2}} ^{2}}\right) & \text { for } \ell=0 \\ C_{1} d \exp \left(-\frac{n \eta^{2}}{\left.c_{1} \max \left|d^{4}\|f\|_{\infty}^{2}, d^{2}\right| \mid \zeta \|_{\psi_{2}}^{2}\right\}}\right) & \text { for } \ell=1 ;\end{cases}
(b) \mathbb{P}\left\{{\Delta}_{j, k}^{\ell} <\eta \& \widehat{\Delta}_{j, k}^{\ell} \geq\left(2 a_{\max }+5\right) \eta\right\} \leq \begin{cases}C_{0} \exp \left(-\frac{n \eta^{2}}{c_{0} \max \{\|\|f\|\|_{\infty}^{2} \|\left.\zeta\right|_{\psi_{2}} ^{2}}\right) & \text { for } \ell=0 \\ C_{1} d \exp \left(-\frac{n \eta^{2}}{\left.c_{1} \max \left|d^{4}\|f\|_{\infty}^{2}, d^{2}\right| \mid \zeta \|_{\psi_{2}}^{2}\right\}}\right) & \text { for } \ell=1 ;\end{cases}
C_0, c_0 depend on {{a}_{\text{max}}} ; c_0' depends on {{a}_{\text{max}}}, \theta_2 ; C_1, c_1 depend on {{a}_{\text{max}}}, \theta_2, \theta_3 .
Acknowledgements
This work was partially supported by NSF-DMS-125012, AFOSR FA9550-17-1-0280, NSF-IIS-1546392. The authors thank Duke University for donating computing equipment used for this project.
Conflict of interest
The authors declare no conflict of interest.
A.
Additional proofs
Example 1. Let f \in \mathcal{C}^{\ell, \alpha} . The local estimator {f}_{j, k}^\ell minimizes \|(f - p){{\mathbf{1}}_{j,k}}\| over all possible polynomials p of order less than or equal to \ell . Thus, in particular, we have \|(f - {f}_{j, k}^\ell){{\mathbf{1}}_{j,k}}\| \le \| (f - p) {{\mathbf{1}}_{j,k}} \| where p is equal to the \ell -order Taylor polynomial of f at some z \in C_{j, k} . Hence, for x \in C_{j, k} there is \xi \in \mathcal{M} \cap B_{\theta_2 2^{-j}}(z) such that
Therefore, for every j and k \in \mathcal{K}_j , we have
Examples 2 and 3. For polynomial estimators of any fixed order \ell = 0, 1, \ldots , g^\ell_{j, k} - g{{\mathbf{1}}_{j,k}} = 0 when C_{j, k} \cap \Gamma = \varnothing , and g^\ell_{j, k} - g{{\mathbf{1}}_{j,k}} = {O}(1) when C_{j, k} \cap \Gamma \neq \varnothing . At the scale j , \rho(C_{j, k}) \approx 2^{-jd} and \rho(\cup\{C_{j, k}: C_{j, k}\cap \Gamma \neq \varnothing\}) \approx 2^{-j(d-d_\Gamma)}\rho(\Gamma) . Therefore,
which implies g\in \mathcal{A}^{\ell}_{(d-d_\Gamma)/{2}} .
In adaptive approximations, \Delta_{j, k}^{\ell} = 0 when C_{j, k} \cap \Gamma = \varnothing . When C_{j, k} \cap \Gamma \neq \varnothing , \Delta_{j, k}^{\ell} = \|g^{\ell}_{j, k} - \sum\limits_{C_{j+1, k'}~~\subset C_{j, k}}g^{\ell}_{j+1, k'}\| \lesssim \sqrt{\rho(C_{j, k})} \lesssim 2^{-jd/2} . Given any fixed threshold \tau > 0 , in the truncated tree \mathcal{T}(\tau) , the leaf nodes intersecting with \Gamma satisfy 2^{-jd/2} \gtrsim \tau . In other words, around \Gamma the tree is truncated at a coarser scale than j^\star such that 2^{-j^\star} = {O}(\tau^{\frac 2 d}) . The cardinality of \mathcal{T}(\tau) is dominated by the nodes intersecting with \Gamma , so
which implies p = 2d_\Gamma/d . We conclude that g\in \mathcal{B}^{\ell}_s with s = \frac{d(2-p)}{2p} = \frac{d}{d_\Gamma} (d-d_\Gamma)/2.
Lemma 1. By definition, we have \|(f-{f}_{j, k}^\ell){{\mathbf{1}}_{j,k}}\| \le |f|_{\mathcal{A}_{s}^{\ell, \infty}} 2^{-js} \sqrt{\rho(C_{j, k})} as long as f \in \mathcal{A}_{s}^{\ell, \infty} . By splitting (\Delta^\ell_{j, k})^2 \le 2\|(f - {f}_{j, k}^\ell){{\mathbf{1}}_{j,k}}\|^2 + 2\sum\limits_{k':C_{j+1, k'}~~\subset C_{j, k}} \|(f - f^\ell_{j+1, k'})\mathbf{1}_{j+1, k'}\|^2 , we get
In the selection of adaptive partitions, every C_{j, k} with \Delta_{j, k}^{\ell} \ge \tau must satisfy \rho(C_{j, k}) \ge 2^{2js} (\tau/|f|_{\mathcal{A}_{s}^{\ell, \infty}})^2 . With extra assumptions \rho(C_{j, k}) \le \theta_0 2^{-jd} (true when the measure \rho is doubling), we have
Therefore, every cell in \Lambda (\tau) will be at a coarser scale than j^\star with j^\star satisfying (3). Using (A3) we thus get
which yields the result.
Lemma 2. For any partition \Lambda ~~\subset \mathcal{T} , denote by \Lambda ^{l} the l -th generation partition such that \Lambda ^0 = \Lambda and \Lambda ^{l+1} consists of the children of \Lambda ^l . We first prove that \lim_{l\to\infty} f^\ell_{\Lambda ^l} = f in L^2(\mathcal{M}) . Suppose f \in L^\infty . Notice that \| f^\ell_{\Lambda ^l} - f \| \le \| f_{\Lambda ^l}^0 - f \| . As a result of the Lebesgue differentiation theorem, f_{\Lambda ^l}^0 \to f almost everywhere. Since f is bounded, f_{\Lambda ^l}^0 is uniformly bounded, hence f_{\Lambda ^l}^0 \to f in L^2(\mathcal{M}) by the dominated convergence theorem. In the case where f \in \mathcal{A}_t^\ell , taking the uniform partition \Lambda _{j(l)} at the coarsest scale of \Lambda ^l , denoted by j(l) , we have \|f - f^\ell_{\Lambda ^l}\| \le \|f - f^\ell_{\Lambda _{j(l)}} \| \lesssim 2^{-j(l)t} , and therefore f^\ell_{\Lambda ^l} \to f in L^2(\mathcal{M}) .
Now, setting \Lambda = \Lambda (\tau) and \mathcal{S} : = \mathcal{T}(\tau)\setminus\Lambda , by Definitions 3 and 4 we get
which yields the first inequality in Lemma 2. The second inequality follows by observing that 2-p = \frac{2s}{d} p and |f|_{\mathcal{B}_{s}^{\ell}}^p \tau^{2-p} = |f|_{\mathcal{B}_{s}^{\ell}}^2 (|f|_{\mathcal{B}_{s}^{\ell}}^{-p} \tau^p)^{\frac{2s}{d}} \le |f|_{\mathcal{B}_{s}^{\ell}}^2 \#\left[\mathcal{T}(\tau)\right]^{-\frac{2s}{d}} by Definition 3.
Lemma 3. See [34].
Lemma 4. (a). Thanks to [23,Theorem 3.2] and assumption (A5), we have
Hence, the bound follows applying Lemma 3(d) with t = \frac{\theta_3}{4d}2^{-2j} .
(b). Observe that \|\widehat{Q}_{j, k}\| \le \|[\widehat{\Sigma}_{j, k}]_d^{\dagger}\| = (\hat{\lambda }_{d}^{j,k})^{-1} . Moreover, \hat{\lambda }_{d}^{j,k} \ge {\lambda }_{d}^{j,k} - |{\lambda }_{d}^{j,k} - \hat{\lambda }_{d}^{j,k}| \ge \frac{\theta_3}{d}2^{-2j} - \|\Sigma_{j, k} - \widehat{\Sigma}_{j, k}\| by assumption (A5). Thus, using Lemma 3(d) with t = \frac{\theta_3}{2d}2^{-2j} yields the result.
(c). We condition on the event that \widehat{n}_{j,k} \ge \frac{1}{2} \mathbb{E}\widehat{n}_{j,k} = \frac{1}{2}n\rho(C_{j, k}) , whose complement occurs with probability lower than 2\exp\left(-\frac{3}{28}n\rho(C_{j, k})\right) by Lemma 3(b). The quantity \| {r}_{j, k} - \widehat{r}_{j, k} \| is bounded by A + B + C + D with
Each term of the sum in A has expectation 0 and bound 2\langle\theta_2\rangle \|f\|_\infty 2^{-j} . Thus, applying the Bernstein inequality [44,Corollary 7.3.2] we obtain
B is bounded by \|f\|_\infty \|c_{j, k} - \widehat{c}_{j, k}\| so that, using 3(c) with t replaced by t / \|f\|_\infty , we get
To estimate C we appeal to [7,Theorem 3.1,Remark 4.2]. For X \in \mathbb{R}^n , take G(X) : = \|MX\| with M:=\left[\begin{array}{ccc}
x_{1}-c_{j, k} & & x_{n}-c_{j, k} \\
2^{-j} & \cdots & 2^{-j}
\end{array}\right] . Then |\partial_i G(X)| \le \|x_i - c_{j, k}\| \le \theta_2 2^{-j} . Now let X = (\zeta_1 {{\mathbf{1}}_{j,k}}(x_1), \dots, \zeta_n {{\mathbf{1}}_{j,k}}(x_n))^T , so that C = G(X) / \widehat{n}_{j,k} . Since the \zeta_i 's are independent, [7,Remark 4.2] applies, and it yields \mathbb{P} \left\{ G(X) > t \right\} \le 2 \exp\left(- \frac{t^2}{2 \sigma^2} \right) , where \sigma^2 = \sum_{i = 1}^n \|\partial_iG\|_\infty^2 \|\zeta_i\|_{\psi_2}^2 {{\mathbf{1}}_{j,k}}(x_i) \le \widehat{n}_{j,k} \theta_2^2 2^{-2j} \|\zeta\|_{\psi_2}^2 , and thus
We are left with D . This term is smaller than \|c_{j, k} - \widehat{c}_{j, k}\| \left| \frac{1}{\widehat{n}_{j,k}} \sum_{i = 1}^{n} \zeta_i {{\mathbf{1}}_{j,k}}(x_i) \right| , where, by Lemma 3(c), \|c_{j, k} - \widehat{c}_{j, k}\| \le \theta_2 2^{-j} with probability higher than 1 - 10 \exp\left(- \frac{3}{28} n \rho(C_{j, k}) \right) . Hence, by the standard sub-Gaussian tail inequality [46,Proposition 5.10] we have
This completes the proof.
Lemma 5. If \ell = 0 , then \| \widehat{f}_{j, k}^\ell - \widehat{f}_{j, k}^\ell \|_\infty = | y_{j,k} - \widehat{y}_{j,k} | , which is less than
Each addend in the first term has expectation 0 and bound 2 \|f\|_\infty , and therefore we can apply the standard Bernstein inequality [44,Theorem 1.6.1]. As for the second term, we use the standard sub-Gaussian tail inequality [46,Proposition 5.10]. This yields the bounds for \ell = 0 .
For \ell = 1 , we have
where the last inequality holds with high probability thanks to Lemma 4(a)(b). Thus, applying Lemma 3(c)(d) and Lemma 4(c) with t replaced by \frac{t}{\theta d M2^j} , \frac{t}{\theta d^2~~ M2^{2j}} and \frac{t}{\theta d 2^j} , we obtain the desired result.
Lemma 6. Follows simply by truncation.
Lemma 7. We start with (a). Defining \overline{\Delta }_{j,k}^{{\ell}} : = \| W_{j, k}^\ell \|_n we have
The first quantity can be bounded by
so that
We now condition on the event that |\rho(C_{j, k}) - \widehat{\rho}(C_{j, k})| \le \frac{1}{2}\rho(C_{j, k}) , which entails \widehat{\rho}(C_{j, k}) \le \frac{3}{2} \rho(C_{j, k}) , and apply Lemma 5 with t \lesssim \eta/\sqrt{\rho(Cjk)} . The probability of the complementary event is bounded by Lemma 3(b). To get rid of the remaining \rho(C_{j, k}) 's inside the exponentials, we lower bound \rho(C_{j, k}) as follows. We have
Thus, \Delta_{j, k}^{\ell} \ge (2a+5)\eta implies \rho(C_{j, k}) \ge \frac{(2a+5)^2\eta^2}{4\|f\|_\infty^2} . Therefore, we obtain that
where C_0, c_0 depend on a , and C_1, c_1 depend on {{a}_{\text{max}}}, \theta_2, \theta_3 .
Next we estimate \mathbb{P} \left\{ \Delta_{j, k}^{\ell} - 2\overline{\Delta }_{j,k}^{{\ell}} \ge \eta \right\} by [21,Theorem 11.2]. Notice that for all x\in \mathcal{M} , |W_{j, k}^\ell(x)| \lesssim \|f\|_\infty . If x \notin C_{j, k} , then W_{j, k}(x) = 0 , otherwise there is k' such that x \in \mathcal{C}_{j+1, k'} ~~\subset C_{j, k} . In such a case, |W_{j, k}(x)| = | {f}_{j, k}(x) - f_{j+1, k'}(x) | , and the claim follows from Lemma 6. Thus, [21,Theorem 11.2] gives us
where c is an absolute constant.
Let us turn to (b). We first observe that
To see this, note again that \widehat{W}_{j, k}(x) \ne 0 only when x \in \mathcal{C}_{j+1, k'} ~~\subset C_{j, k} for some k' , in which case |\widehat{W}_{j, k}(x)| = | \widehat{f}_{j, k}(x) - \widehat{f}_{j+1, k'}(x) | and we can apply Lemma 6. Now note that b = 2{{a}_{\text{max}}}+5 . We have
The first probability can be estimated similarly to how we did for (a). Thanks to Lemma 3(a), the second probability is bounded by
for an absolute constant c . Finally, the third probability is zero thanks to (5).