Single-Index Models in the High Signal Regimeashwinpm/SIMs.pdf · 2019-11-10 · In this paper, we...

Single-Index Models in the High Signal Regime

Ashwin Pananjady? and Dean P. Foster†,‡

?Department of Electrical Engineering and Computer Sciences, UC Berkeley†Department of Statistics, Wharton School, University of Pennsylvania

‡Amazon NYC

Abstract

A single-index model is given by y = g∗(〈x, θ∗〉) + ε: The scalar response y depends onthe covariate vector x both through an unknown (vector) parameter θ∗ as well as an unknown,non-parametric, univariate link-function g∗ ∈ G. We study the problem of recovering the pa-rameter θ∗ from i.i.d. samples of the model when the covariates are drawn from a normaldistribution. Our focus is on leveraging information about the (known) function class G in or-der to design a procedure that adapts to the noise level in the problem, thereby reducing thebias of parameter estimation. We show that when given access to a natural “labeling oracle”,our procedure recovers the underlying parameter at a rate that depends explicitly on how wellwe are able to estimate a suitably defined “inverse” link function. Both the procedure and itsanalysis framework are flexible, admitting any black-box estimator for the inverse link func-tion. The resulting rate of parameter estimation significantly improves upon the risk of classicalsemi-parametric procedures whenever consistent estimates of the inverse link function can beobtained. When the function class G is appropriately structured and empirical risk minimizationis used to estimate the inverse function, we provide quantitative upper bounds on the risk thatdepend on natural complexity measures of the class of inverse functions.

We specialize our framework to the case where G is a sub-class of monotone single-indexmodels, showing a computationally efficient, end-to-end algorithm that achieves very fast ratesof parameter estimation in the regime in which the signal-to-noise ratio in the problem is large.We also pay particular attention to parameter identifiability in the noiseless model, derivingsharper upper bounds as well as information-theoretic lower bounds. Consequences for the(real) phase retrieval problem are also discussed.

1 Introduction

In classical non-parametric regression, we are interested in modeling the relationship between ap-dimensional covariate x and a scalar response y through a function f : Rp 7→ R that satisfiessome regularity conditions. However, standard non-parametric function classes in high dimensionsare extremely expressive and require prohibitively many samples—exponential in the dimension—to learn (see e.g., Tsybakov [Tsy09]). A popular dimensionality reduction technique is to assumethat the model is in fact semi-parametric, and that the function f is given by the composition ofa lower-dimensional function h : Rk 7→ R with a linear model. Formally, we have

f(x) = h(〈θ1, x〉, 〈θ2, x〉, . . . , 〈θk, x〉),

where the p-dimensional regression coefficients θ1, . . . , θk span a k-dimensional subspace, with k p.Such models are called multi-index models, since the functional relationship can be captured by afew indices that represent particular directions of the covariate space.

1

In this paper, we focus on the special case of the multi-index model where k = 1, which resultsin the single-index model

y = g∗(〈θ∗, x〉) + ε, (1)

or SIM for short. Here g∗ is a univariate, non-parametric link function, θ∗ ∈ Rp is the salient linearpredictor, and ε is a random variable independent of everything else that captures the noise in themodeling process. The model (1) between the covariate and response should be seen as one of themost basic forms of non-linear dimensionality reduction, and as a step towards the broader goal ofrepresentation learning or feature engineering. In order to facilitate a concrete theoretical study,we also assume in this paper that the covariates x are drawn from a normal distribution, and thatthe noise ε is sub-Gaussian with parameter σ; these are standard assumptions in many parts of theliterature [Li92; Bri83; DH18].

As stated, the single-index model (1) is classical, and there is an extensive body of literaturespanning the statistics, econometrics, and geometric functional analysis communities that is ded-icated to studying many aspects of this model. We provide an extensive survey of this literaturein Section 1.2 to follow. For now, let us focus on the recent paper by Plan and Versyhnin [PV16],which studies the problem under further geometric constraints on the parameter θ∗ and repre-sents, to an extent, the state-of-the-art progress on this problem. Upon analyzing a moment-basedmethod—whose roots go back to the classical work of Brillinger [Bri83]—for recovering the “signal”θ∗, they point out that it is not necessary to explicitly model the non-linear link function g∗. Toquote portions of their text:

“This leads to the intriguing conclusion that in the high noise regime, an unknown non-linearityin the observations does not significantly reduce one’s ability to determine the signal... even when

the non-linearity is not explicitly modeled.”

This surprising claim is somewhat counter-intuitive: after all, obtaining a “good” model for thefunction g∗ should help in the estimation task, and this intuition has largely guided the extensivesub-field of generalized linear modeling [MN89], in which we assume the function g∗ is knownexactly. More generally, there ought to exist a trade-off between the approximation and estimationerrors for this class of problems: on the one hand, we incur a certain approximation error (orbias) by treating the unknown function as linear, and our estimation error (or variance) behaves asthough the true model is linear. The results of Plan and Vershynin—and of many other precedingpapers in this general area—are intriguing because they show that for a large enough noise level,and provided that the function g∗ is not “orthogonal” to the class of linear functions, a biasedestimator for the parameter achieves error that is optimal up to constant factor, since the bias isof a smaller order than the variance.

On the other hand, one could instead ask what happens in the low noise, or high signal regime1,when the errors made due to modeling the non-linear function as linear are no longer of the sameorder as the noise. Indeed, such a question is motivated by applications in which we often havesignificant side-information that allows us posit some function class G to which g∗ belongs. Bybuilding better models for the non-linearity, it would stand to reason that the bias can be reducedand finally eliminated when G ⊇ g∗. The major motivation for this paper is to understand this

1A natural measure of signal-to-noise ratio in the problem is given by ‖θ∗‖/σ. We set ‖θ∗‖ = 1 for identifiabilityin the single-index model, and so the low noise regime in which σ → 0 corresponds to high signal-to-noise ratio.Thus, we use the terms ‘low noise’ and ‘high signal’ interchangeably in this paper.

2

−4 −2 0 2 4

−1

0

1

z

g∗ (z)

4−5 4−4 4−3 4−2 4−1 40 41

4−9

4−8

4−7

4−6

4−5

4−4

4−3

Noise level σ

Err

or‖θ−θ∗‖2

ADEAlgorithm 2

Figure 1: Left: The unknown, monotone function g∗(z) = sgn(z) · log (1 + |z|) used in the simu-lation. We collected i.i.d. samples from the single-index model defined by this function, corruptedby Gaussian noise of variance σ2. Right: Simulations of the error of parameter estimation plottedagainst the noise level σ for (a) in red, the standard Average Derivative Estimator (ADE) [Bri83;PV16] and (b) in blue, our refined estimator from Algorithm 2 that employs the least squaresestimator over monotone functions as the non-parametric estimate of the ‘inverse’ function. In thisexperiment, we set p = 20 and n = 5000, and the errors are averaged over 50 independent runs ofthe respective algorithms. Further details of the experiment are provided in Appendix D.

phenomenon in quantitative terms. We study this issue in the context of parameter recovery, i.e.,recovering θ∗ from n i.i.d. samples drawn from the model. From a statistical perspective, wewould like to derive precise bounds on the recovery error as a function of both the dimension pand the sample size n. In addition, we would like to be able to accomplish the estimation task ina computationally efficient manner, and by using fine-grained properties about the function classG. For a comparison of our approach and motivation with those of related work, see Section 1.2.Overall, our approach formalizes a complementary notion to that articulated by Plan and Vershyninabove. In particular, we show that when g∗ ∈ G, then leveraging certain structural properties ofthe class G through a natural, iterative algorithm can lead to uniformly faster rates of estimationfor all noise levels. In particular, significant gains are obtainable in the high signal regime.

To foreshadow our results, let us illustrate in a simulation the quantitative benefit of using ouriterative framework for a sample link function. Figure 1 plots the performance of our procedurealong with a classical semiparametric estimate as a function of the noise parameter σ. In par-ticular, while standard algorithms see a large error floor even as σ → 0, our estimator achievesasymptotically better error in the high signal (or low noise) regime, while remaining competitivewith the classical approach even for larger values of σ. It is also worth noting that even so, theerror achieved by our estimator plateaus for small values of σ, leading to a non-zero error floor ofthe problem. This motivates our study of the special case σ = 0, which we show serves as a proxyfor all values of σ that are “sufficiently small”.

In the literature on statistical learning theory, the low-noise regime has received considerableattention for empirical risk minimization applied to regression problems (e.g., [Men14; Men18;LM18; LM17]). In particular, the first paper in this series was written by Mendelson [Men14],who noted that classical analyses of ERM are often very conservative in the low noise (or whatis referred to in learning theory as the nearly-realizable) setting. He proposed a new “small-ball”

3

method of analysis to derive rates for the problem that are usually much faster when the model isnearly-realizable. Our motivation should be viewed as analogous but as applied to semi-parametricregression2. The low-noise regime has also been extensively studied in the literature on statisticalsignal processing, but as applied to specific models such as phase retrieval (in which g∗ is theabsolute value or square function) and its relatives3. For the related (noisy) matrix completionproblem, the recent paper [CCF+19] provides an analysis of a popular convex relaxation methodin the low-noise regime via a delicate analysis of a non-convex optimization algorithm.

We now discuss our contributions in a little bit more detail in Section 1.1, before providing asurvey of related work and applications in Section 1.2.

1.1 Contributions and Organization

Our approach to performing estimation under the single-index model is based on leveraging fine-grained structure in the function g∗, and we formalize the notion of structure that we require byassuming access to a certain “labeling” oracle that provides information about the “inverse” modely 7→ E[〈x, θ∗〉|y]. Loosely speaking, the labeling oracle helps us narrow our investigation to regionsof the domain of the function g∗ on which this conditional expectation is easy to reason about.This provides, in broad terms, a program for estimation under the semi-parametric model (1) viaa reduction to non-parametric regression over the inverse function class.

In order to illustrate this intuition and as a warm-up exercise, we implement such an oraclefor the phase retrieval problem in Section 2 and reduce the problem to linear regression. Thisleads to a simple algorithm for phase retrieval that achieves optimal parameter estimation rates.With this intuition for phase retrieval in hand, we present in Section 3 the precise labeling oraclethat we require for general single-index models, and provide a flexible procedure for parameterestimation. This procedure assumes access to the labeling oracle and solves a non-convex problemvia an iterative algorithm. It requires, as input, any non-parametric function estimation oracleover the inverse function class. Under standard assumptions on the observation model, our generalresult (Theorem 1) provides guarantees on the error attained by such an iterative algorithm forgeneral SIMs as a function of the error rate of the non-parametric estimator provided as inputto the procedure. We then leverage the vast literature on empirical risk minimization (ERM) fornon-parametric function estimation in order to establish Theorem 2, which shows upper bounds onparameter estimation in terms of natural measures of complexity of the class of inverse functions.

In order to illustrate a concrete application of our framework, we consider a sub-class of mono-tone SIMs for which a labeling oracle can be implemented efficiently, and with no additional com-putational effort. This leads to a procedure with end-to-end guarantees for this class, which wepresent as Corollary 1. This result provides a sharper parameter estimate than classical procedures,and the gains are particularly significant when σ → 0. Accordingly, we investigate the noiselesscase of the general problem in some more detail in Section 4, and show that a slight modificationof our general framework can be used to prove even faster estimation rates in this setting—this

2Indeed, Mendelson’s general techniques are also applicable to this problem via the reduction that we establish;see Appendix B for a discussion.

3Indeed, in the phase retrieval problem, we now have a concrete understanding of how the guarantees of momentmethods can be significantly sharpened via a variety of methods in order to achieve optimal rates of parameterestimation in the low noise (or even noiseless) regime [CLS15; CLM16; MM18; CC15; YCS14; LAL19]. In Section 2,we provide what may be viewed as another such method for the phase retrieval problem, and use it to motivate ourtechnique for general SIMs.

4

result is presented in Theorem 3. Once again, we apply this result to monotone SIMs in order toobtain Corollary 2, and complement our upper bounds with an identifiability lower bound for thisclass of SIMs in Proposition 3.

1.2 Related work

Single-index models have seen a concrete theoretical treatment across multiple, related commu-nities. The classical viewpoint emerges from the statistics community, in which these have beenstudied under the broader umbrella of semi-parametric estimation; the latter is broadly applied inmicroeconomics, finance, and the social sciences. Index models, in particular, have been used asa general-purpose, non-linear dimensionality reduction tool. We refer the the interested reader tothe books by Bickel et al. [BKR+93] and Li and Racine [LR07] for a broad overview of classicalmethods for semi-parametric estimation, their applications, and associated guarantees. In the con-text of the single-index model, a well known estimator for the index vector is the semi-parametricmaximum likelihood estimator (SMLE) [Hor09], which solves the full-blown M-estimation prob-lem, finding the function, index pair that maximizes the likelihood of the observed samples. TheSMLE is known to have excellent statistical properties in the asymptotic regime where the am-bient dimension is fixed and the number of samples goes to infinity—in particular, a parameterestimate obtained as a result of running these procedures is often “

√n-consistent”—and succeeds

with minimal assumptions on the covariate distribution [Rob88; Kos08]. In addition to the SMLE,other influential approaches include gradient-based estimators [HJP+01; DJS08], moment-basedestimators [Bri83; Li92], and slicing estimators [Li91], which have driven a lot of progress in thedeployment of semi-parametric models in practice. There has also been recent interest in studyingsome of these procedures under weak covariate assumptions [BB18]. Indeed, our general approachcan be viewed as a more refined version of slicing; this is discussed in detail in Remark 2. We alsonote the recent work of Dudeja and Hsu, which falls under this broad umbrella and analyzes thesingle-index model with Gaussian covariates by expressing the unknown function in the Hermitepolynomial basis. Their estimators may be viewed as higher-order moment methods, and theypropose efficient, gradient-based algorithms to compute them.

There has also been a lot of recent interest in applying the double (or de-biased) machinelearning approach to semi-parametric models [CGS+17; CCD+18; CNR18], especially in thehigh-dimensional regime [CNS+18; CGS+17]. These papers are motivated by the fact that semi-parametric estimation is a natural lens through which to view estimation problems with nuisancecomponents, when the statistician is only interested in some target component; examples of suchproblems span the diverse fields of treatment effect estimation [KSB+19], policy learning [AW17],and domain adaptation [CMM10]. The classical notion of Neyman orthogonality [Ney59; Ney79]has re-emerged as a natural and flexible condition under which to study these problems [CEI+16].We do not survey this literature in detail, but refer the reader to the recent paper by Foster andSyrgkanis [FS19], which provides a general treatment of problems in this space. Focusing on prov-ing excess risk bounds for problems with a nuisance component, these results show that a naturalone-step meta-algorithm that splits samples between estimating the nuisance component and thetarget component (or parameter) is able to achieve oracle excess risk bounds in some settings. Inparticular, they show that if a Neyman orthogonality condition is satisfied and the class of nuisancecomponents is not too large when compared to the target class, then oracle risk bounds4 are achiev-

4That is, the excess risk of estimating the target is of the same order as the risk attainable if the nuisance

5

able. The generality of these results is striking: they apply to a general class of problems, generalloss functions, and general data distributions, thereby providing a broad framework for the studyof such models. Notably, the results are also reduction-based, in that they allow the statistician touse any procedure for estimation of the target and nuisance components, and derive bounds thatdepend on the rates at which these components can be estimated. In this last respect, our treat-ment is similar; however, our focus should be viewed as being complementary to this general theory.Some salient differences are worth highlighting: First, and foremost, we are interested primarily inunderstanding the rates of estimation as a function of the noise level in the problem, which wasnot the focus of these recent results. In particular, any one-step meta-algorithm will no longer beoptimal (even in the special case of SIMs) over all noise levels. Second, we are interested in therates of parameter estimation as in the semi-parametric literature, and this requires us to imposestronger covariate assumptions. Finally, by specializing our model class to single-index models,we are able to simultaneously address issues of computational efficiency, statistical optimality, andadaptivity to the noise level.

A second perspective on single-index models emerges from the statistical signal processingliterature5—or more broadly, the literature on geometric functional analysis and linear inverseproblems—in which we are interested in imposing additional structure on the underlying parame-ter θ∗. While the application of geometric functional analysis to linear inverse problems is a rela-tively recent endeavour, the literature in this general space is already quite formidable; examples ofresults here can be found in the papers [PV16; PVY17; YWC+15; NWL16; YWL+16; GRW+15;YBL17; TAH15; TR17]. The focus in this area is on recovering the underlying “signal” θ∗ at a ratethat depends optimally on the properties of the set to which the signal belongs. This literature oftenplaces stronger assumptions on the measurements or covariates—often Gaussian, although someextensions to sub-Gaussian settings are available (e.g. [MPT07]). Many of the algorithms in thisspace are based on convex relaxations, but in the case where there is no structure on θ∗, they reduceto more classical moment-based estimators. As mentioned in Section 1, a representative result inthis space is that of Plan and Vershynin, which shows that provided the unknown link functionhas a non-zero “projection” onto the class of linear functions, a constrained variant of Brillinger’sAverage Derivative Estimator (ADE) [Bri83] recovers the true parameter at the optimal rate forlarge noise levels; in particular, this error rate depends precisely on the geometric properties of theset to which θ∗ belongs. Extensions of this result are also available for cases when g∗ is an evenfunction [TR17], and are based on constrained versions of the Principal Hessian Directions (PHD)algorithm [Li92]. Besides convex relaxation approaches, there are also non-convex approaches toproblems in this space; for example, Yang et al. [YYF+17] study a two-step non-convex optimiza-tion procedure for SIMs based on the thresholded Wirtinger flow algorithm [CLM16], and showthat this algorithm is able to obtain a parameter estimate at the optimal s log p

n rate for s-sparsevectors θ∗ under moment conditions on the link function.

Given that we specialize some of our results in the sequel to the class of monotone SIMs, let usnow discuss some prior work in this space. The design of efficient algorithms for monotone single-index models was the focus of much work in the machine learning community [KS09; KKS+11],where these models were introduced in order to account for mis-specification in generalized linearmodels with known link functions. The algorithms here—Isotron [KS09] and variants [KKS+11]—

component were known exactly5Our division of related work under these two broad headings is somewhat arbitrary; the motivations of some of

the papers listed in the geometric functional analysis literature were statistical, and vice versa.

6

are inspired by the Perceptron algorithm and run variants of the stochastic gradient method. Theyobtain bounds on the excess risk incurred by the algorithm, showing bounds that are typicallynon-parametric. These models have also seen a more recent appearance in the literature on shape-constrained estimation, in which index-models and their relatives have emerged as natural meansto alleviate the curse of dimensionality [CS16; KPS17; BDJ16]. Broadly speaking, these papersanalyze the consistency of the global SMLE for their respective problems, and propose heuristicalgorithms—without provable guarantees—that solve this non-convex problem by alternating pro-jection procedures. It should be noted that in the absence of smoothness assumptions, there area multitude of technical obstacles that must be overcome in order to show that the SMLE is evenconsistent. The monotone single-index model, in particular, has been analyzed in recent papersby Balbdaoui et al. [BDJ16] and Groeneboom and Hendrickx [GH19]. In addition to providingfine-grained guarantees for the SMLE (e.g., the limiting distribution of the regression estimate ata point [GJW01], or the prediction error of the “bundled” function g∗(〈θ∗, ·〉)), these papers alsoprovide guarantees for the ADE approach and their guarantees hold under minimal assumptionson the underlying link function.

Having discussed the lay of the land, let us now put our contributions in context. In spiteof the vast literature on single-index models, some important and fundamental questions remainunaddressed. In particular, our focus is on simultaneously tackling the following issues:

• Leveraging structure in the class of link functions: Moment and slicing based estima-tors, which form the foundation for the investigation of SIMs in the literature on linear inverseproblems, completely ignore any fine-grained structure in the true function g∗. As alludedto earlier, they simply require g∗ to obey certain moment conditions, and do not attempt tomodel it in any way. This leads to a “bias” in these estimators that becomes significant inthe high signal regime, and indicates that better models for g∗ can be leveraged to reducethis bias.

• Adapting to the noise level: As alluded to in the introduction, none of the computationallyefficient estimators of θ∗ obtain a provably optimal error bound as a function of the noisevariance σ2. In particular, the performance of estimators in the low noise setting is near-identical to their performance in the constant-noise setting. Take, for example, the recentresults of Babichev and Bach [BB18] and Dudeja and Hsu [DH18], which show a bound ofthe form

‖θ − θ∗‖2 . (σ2 + c)p

n(2)

for their respective estimators, provided the function g∗ satisfies certain conditions. The .notation in these bounds hides logarithmic factors in the pair (p, n), and the constant c in thisbound is some problem dependent constant that is strictly positive for any non-linear g∗. Theanalysis of Yang et al. [YYF+17] posits additional structure on the underlying parameter θ∗

and improves the rate of the estimate (i.e., the dimension p in the bound (2) is replaced bya geometric quantity, but the (σ2 + c) term persists). Clearly, these bounds exhibit the samebehavior for both large and small σ, and this is a limitation of these approaches that we wouldlike to address. Adaptivity to noise variance is only achievable when we are able to drive thebias of the problem to zero at a faster rate by positing a good model for the function g∗.

• Computational efficiency: The SMLE, for instance, solves a non-convex problem to opti-mality and is NP-hard to compute for many non-parametric function classes. Variants of the

7

SMLE are able to avoid some statistical issues with the SMLE, but they are still computa-tionally intractable.

• Dependence on the dimension: Since a large portion of the semi-parametric literatureis classical, the dependence on the covariate dimension p is seldom made explicit. In manycases, this dependence is much worse than the linear dependence on p that we expect forparametric models.

1.3 Notation

We largely use capital letters X,Y , etc. to denote random variables/vectors, and small letters todenote their realizations, usually with the sample index xi, yi, etc. We reserve the notation Z forthe standard Gaussian distribution, where the dimension can be inferred from context. Boldfacecapital letters X,W, etc. are used to denote matrices; we let Xi denote the i-th column of X. Welet X† denote the Moore-Penrose pseudoinverse of a (tall) matrix X. We let Id denote the d × didentity matrix.

For a positive integer n, let [n] : = 1, 2, . . . , n. For a finite set S, we use |S| to denote itscardinality. For two sequences an∞n=1 and bn∞n=1, we write an . bn if there is a universalconstant C such that an ≤ Cbn for all n ≥ 1. The relation an & bn is defined analogously, andwe use an ∼ bn to signify that the relations an . bn and an & bn hold simultaneously. We usec, C, c1, c2, . . . to denote universal constants that may change from line to line. We use the notation‖v‖ to denote the `2 norm of a vector v unless otherwise specified. We also denote the p-dimensionalunit ball by Sp−1 = v ∈ Rp : ‖v‖ = 1.

We deliberately eschew measure-theoretic considerations. Throughout, we write conditionalexpectations assuming that they exist. For a pair of continuous random variables (U, V ) and ascalar u, we use the shorthand E[V |U = u] to denote the standard conditional expectation E[V |u].

2 Warm-Up: Phase Retrieval via Linear Regression

In order to build intuition for our general framework, let us first illustrate how using specificstructure in the function g∗ can help us improve the estimation rate. In this section, we work withthe phase retrieval model with n i.i.d. samples

yi = |〈xi, θ∗〉|+ εi, (3)

where the absolute value function forms the (known) scalar link function g∗, and the unit norm6

parameter θ∗ ∈ Sp−1 is fixed7. As before, we assume the covariate xi is drawn from a standardGaussian distribution, and that εi is zero-mean and σ-sub-Gaussian, chosen independently of xi.

Moment-methods that are variants of Li’s PHD procedure [Li92] are often used to provideinitializations for this problem, and there is a large body of literature on how one might obtain“optimal” initializations according to various criteria [MM18; LAL19]. Furthermore, there is an

6In principle, we can drop the restriction that ‖θ∗‖ = 1 for this section since the link function is known, butwe keep this restriction to avoid confusion. Our results in this section carry over to the general case by scalingappropriately.

7Contrast this with the ‘universal’ setting in which the parameter can be chosen with knowledge of the realizedcovariates [CC15; Wal18].

8

even larger literature dedicated to using these initializations and refining them further with other al-gorithms [GS18; NJS13; CLS15; Wal18; GPG+19]. In this section, we provide what may be viewedas another such method, showcasing that the isolation of samples that fall in easily “invertible”regions of g∗ can immediately reduce phase retrieval to a linear regression problem.

Specifically, suppose that we are given a unit-norm parameter estimate (an initialization) θ0

that is “close” to the true parameter θ∗. Then it stands to reason that the quantity sgn(〈xi, θ0〉) willbe a good proxy for the true latent variable sgn(〈xi, θ∗〉) provided the magnitude |〈xi, θ0〉| is large.With an additional tuning parameter λ, which we use to reason about the degree of closenessalluded to above, this heuristic intuition can be turned into a natural procedure that relies onisolating (or labeling) a set of samples S ⊆ [n] such that the sgn(〈xi, θ∗〉) can be determined withhigh probability for all i ∈ S. Once these latent signs are determined via the labeling step, theproblem reduces to a linear regression problem on the restricted set of samples S; we call thisthe ‘inversion’ step of the algorithm. Our label-then-invert, or LTI-Phase algorithm, is presentedformally as Algorithm 1.

Algorithm 1: LTI-Phase: Label-Then-Invert algorithm for phase retrieval

Input: Data xi, yini=1 drawn from the model (3); initial parameter estimate θ0 that isstatistically independent of the data; scalar λ > 0.

Output: Final parameter estimate θLTI(λ).

1 (Labeling step): Let I+ = z ∈ R | z ≥ λ and I− = z ∈ R | z ≤ −λ denote two openintervals. Form the scalar quantities 〈θ0, xi〉 for all i ∈ [n], and denote the indices of thesamples i such that 〈θ0, xi〉 falls in the region I+ and I− by S+ and S−, respectively.

2 Modify the covariates by defining xi = −xi for all i ∈ S−, and leave the other set ofcovariates the same by defining xi = xi for all i ∈ S+.

3 (Inversion step): Form a design matrix X with rows given by x>i for each i ∈ S+ ∪ S−.Collect the corresponding responses into the vector y, and compute the least squares fitθLTI(λ) = X†y.

4 Return θLTI(λ).

While we have proposed a one-shot solution to the linear system in step 3 of the algorithm,this step can be implemented by a linear system solver of choice, or even just by gradient descenton the least squares loss function L(θ) = ‖y − Xθ‖2. While Proposition 1 to follow will be statedfor an exact solution, most iterative algorithms to invert linear systems are able to provide linearconvergence to any pre-specified tolerance ε; we ignore the contribution of this numerical error tothe final rate.

We are now ready to state our result for phase retrieval; we let c1(λ) = (Pr|Z| ≥ λ)−1 , whereZ denotes a standard Gaussian random variable.

Proposition 1. Suppose λ ≥ 1/5, and that the initialization θ0 satisfies ‖θ0 − θ∗‖ ≤(50 log 8n

δ

)−1/2.

There is an absolute constant C such that if n ≥ C · c1(λ)c1(λ) log

(Cδ

)∨ p(1 + λ2)2

, then the

parameter estimate returned by Algorithm 1 satisfies

‖θLTI(λ)− θ∗‖2 ≤ 6c1(λ) · σ2

(p+ log

(1δ

)n

),

with probability greater than 1− δ.

9

Our proof of the proposition is reduction-based; the analysis follows mostly from classical resultson the (non-)asymptotic risk under the linear model. However, we require control on the spectrumof the (conditional) covariate matrix; note that the labeling step changes the covariate distributionon the selected samples.

The flexibility of such a reduction-based framework can also be used to derive asymptotic nor-mality guarantees on the estimate θLTI(λ) by using standard results for the linear model. Asymptoticnormality is much more challenging to establish for iterative algorithms8 that represent the stateof the art in this problem [CC15; Wal18], and to the best of our knowledge, asymptotic normalityis only known for the (uncomputable) global LSE (see, e.g., van de Geer’s thesis [Gee88] for gen-eral results of this form). The flexibility of the reduction-based approach is also evident from ourtreatment of single-index models in Section 3 to follow.

Let us conclude this section with a few specific comments regarding Proposition 1. First, it isworth mentioning that in the noiseless regime, there are many sophisticated algorithms tailored tothis problem, including Wirtinger flow [CC15], amplitude flow [WGE18], and PhaseMax [GS18],and they all require an initialization θ0 (also referred to as an anchor vector) that is a smallconstant distance from the true parameter θ∗. We go one step further, requiring a slightly betterinitialization θ0 satisfying ‖θ0 − θ∗‖ = O((log n)−1/2), since this allows us to guarantee correctnessof the labeling step with high probability.

Second, we note that while Proposition 1 is stated for the Gaussian covariate distribution, thisis mainly for convenience; Remark 7 in the proof clarifies that similar guarantees hold provided (a)the covariates are sub-Gaussian, and (b) a mild condition holds on the second moment matrix ofthe covariates when we condition on a certain direction of the covariate space. These assumptionsare satisfied, for instance, by distributions obeying the small-ball property, including log-concavedistributions (see, e.g., the papers [DR17] and [GPG+19] for similar weakenings of distributionalassumptions on the covariates).

Finally, we stress that our primary goal in this section was not to provide a ‘better’ algorithmfor phase retrieval, but to use phase retrieval as a natural special case in which to develop intuitionfor solving single-index models. In that regard, our algorithm demonstrates that it is beneficial tohave access to “labelled” examples for which the univariate function g∗ is well-behaved—in thiscase, linear. This motivates our definition of the labeling oracle introduced in the next section.

3 Methodology and main result for SIMs

We now turn to the single-index model, which is the main focus of the paper. Throughout, wesuppose that n samples drawn i.i.d. from the observation model

yi = g∗(〈xi, θ∗〉) + εi, (4)

once again assuming that xi ⊥⊥ εi, and that xii.i.d.∼ N (0, Ip). We also assume that the noise εi is

drawn from a σ-sub-Gaussian distribution and that the unknown parameter θ∗ ∈ Sp−1 has unitnorm. Assumptions on both the covariates and noise can be relaxed for subsets of our results, andwe allude to this in Section 6. The univariate function is assumed to satisfy the inclusion g∗ ∈ G for

8The Principal Hessian Directions, or PHD algorithm, is a moment method for this problem for which guaranteesof asymptotic normality can be established via classical techniques [VW96]; however, the variance of the resultingnormal distribution is non-zero even when σ = 0.

10

some non-parametric function class G. Our procedure for parameter estimation in SIMs requirestwo natural oracles, which we introduce first.

3.1 Oracles: Labeling and Inverse Regression

The first oracle is the labeling oracle; it may be helpful for the reader to view this oracle as ablack-box implementation of step 1 of the LTI-Phase procedure presented in Algorithm 1.

Labeling Oracle: Such an oracle outputs:

• A closed interval I ⊆ R, and a set of labeled samples S = i : 〈xi, θ∗〉 ∈ I. Let W denotethe truncation of the random variable 〈X, θ∗〉 on this interval, having density

fW (w) =

1∫

x∈I φ(x)dxφ(w) if w ∈ I

0 otherwise,

where φ denotes the standard Gaussian density. Denote by PY the induced distribution onthe response Y = g∗(W ) + ε, and let Y denote its sample space.

• A closed, convex9 set H corresponding to the function class

H ⊇y 7→ E[W |y]

∣∣∣ y = g(W ) + ε, g ∈ G.

In words, this contains functions mapping R → I that contains all conditional expectationsunder the “inverse” model. We use the shorthand h∗ to denote the conditional expectation—which we refer to hereafter as the “inverse function”—formed when our observations aregenerated according to the link function g∗.

Note that in principle, outputting a set S via such a labeling oracle requires knowledge of thetrue parameter θ∗, which we are trying to estimate! But as we saw in Section 2, there are problemsfor which the set S of labeled samples can be be computed in a data-dependent manner with highprobability; in the sequel, we show an example of a class of single-index models for which this isalso true. For now, assume that such a labeling oracle exists and let N = |S| be the effective samplesize that we work with. Note that N is, in principle, a random variable, but it will be helpful tothink of it as a fixed integer for the rest of this section.

The “spirit” of the labeling oracle is similar to that of step 1 in Algorithm 1: to provide a regionon which the “inverse” function is easy to reason about. With the labeling oracle in hand, notethat the random variable W may be viewed as being generated according to the model

W = h∗(Y ) + ξ, (5)

where ξ = W |Y −E[W |Y ] is uncorrelated with h∗(Y ) by definition, and may be viewed as zero-meannoise. In the sequel, we use the convenient notation ξ(y) = [W |Y = y] − E[W |Y = y] to indicatethat ξ depends on the realization y. The sample space of W is I, and when the noise ε is supportedon the entire real line, the sample space of Y is Y = R. We emphasize that in spite of how the

9If the set H is not convex, then it suffices to work with its convex hull. More generally, we only require the setto be star-shaped around h∗, and if not, we can work with the star hull centered at h∗.

11

labeling oracle above has been defined, we do not assume that we have access to realizations of thepair of random variables (W, ξ); one should view the observation model (5) simply as an analysisdevice.

The second oracle that we require is a non-parametric regression oracle over the function classH. In Algorithm 1, this was achieved by the linear estimator in step 3.

Inverse regression oracle: Our overall algorithm uses, as a black-box, an estimation procedureA over the function class H. Given k i.i.d. samples drawn from a generic non-parametric regressionmodel over the class H, the procedure A : (R×R)k 7→ H uses these samples to compute a functionh ∈ H that optimizes some measure of fit to these samples. We place no restrictions (besidesmeasurability) on such a procedure; our main result depends on the properties of the procedurethrough its “rate” function, introduced in Assumption 3.

With these two oracles in hand, we are now ready to present our procedure for parameter esti-mation in general SIMs. We denote the covariate distribution post-truncation (i.e., the distributionon the samples S) by PIX .

3.2 Reducing SIMs to regression: a meta-algorithm and its analysis

Our procedure is based on a natural alternating minimization principle applied iteratively for Tsteps. We begin by partitioning the N samples into 2T equal parts10. Denote such a partitionby D1, . . . ,D2T ; each of these sets has size N/(2T ) by construction. Our algorithm runs for Titerations; at each iteration, we use two of these data sets. Let us briefly describe iteration t of thealgorithm, which uses the data sets D2t+1 and D2t+2.

On the first data set, we run the non-parametric procedureA on the set of pairs (yi, 〈xi, θt〉)i∈D2t+1 ,

and form a function estimate ht+1 ∈ H such that ht+1 = A((

yi, 〈xi, θt〉)i∈D2t+1

). In particular,

we treat our current linear prediction 〈xi, θt〉 as a noisy observation of the true function evaluatedat the point yi. This is our minimization in the space of functions H, through which we obtain anestimate of h∗. In order to intuitively reason about whether this step is sensible, consider the specialcase θt = θ∗. Here, the non-parametric procedure A obtains samples from the model h∗(yi) + ξi foreach i ∈ D2t+1; these are simply noisy observations of the true function, and A is designed preciselyto denoise these samples. On the other hand, if θt is close to θ∗, then we obtain samples from asimilar model, but with some additional noise—our analysis will make this precise—that vanishesprovided θt converges to θ∗.

With the function estimate ht+1 in hand, we now turn to the second data set and run a

linear regression. In particular, we regressht+1(yi)

i∈D2t+2

on the covariates xii∈D2t+2 and

obtain the linear parameter estimate θt+1. Finally, we output the normalized parameter estimateθt+1 = θt+1/‖θt+1‖ at the end of this iteration. Note that once again, one can reason about howsensible our linear regression step is by specializing to the case ht+1 = h∗; here, h∗(yi) is effectivelya noisy sample of 〈xi, θ∗〉, and so we expect the linear regression to return an estimate that is closeto θ∗. When ht+1 6= h∗, this, once again, introduces additional noise in our observation processwhich vanishes when our function estimate ht+1 converges to the true function h∗.

10We assume that N is a multiple of 2T for simplicity.

12

With this intuition—made concrete in the proof—we are then able to relate the error of pa-rameter estimation at the next time step with the error at the current time step, and iterating thisbound allows us to improve upon the error of the initializer θ0. A formal description of the entireprocedure is provided as Algorithm 2.

Algorithm 2: The LTI-SIM meta-algorithm with sample-splitting for the two regressions

Input: Data of N samples xi, yii∈S returned by the labeling oracle; non-parametricregression procedure A; initial parameter θ0; number of iterations T .

Output: Final parameter estimate θT .1 Initialize t← 0. Split the data into 2T equal portions indexed by D1, . . . ,D2T .

repeat

2 Form the function estimate ht+1 ∈ H by computing

ht+1 = A((

yi, 〈xi, θt〉)i∈D2t+1

). (6)

3 Letting Xt+1 denote the N2T × p matrix with rows xii∈D2t+2 and stacking up the

responsesht+1(yi)

i∈D2t+2

in a vector v, compute

θt+1 = X†t+1v.

4 Compute the normalized parameter θt+1 = θt+1

‖θt+1‖.

until t = T ;

5 Return θT .

Note that we use two separate samples for the sub-steps of the algorithm in order to ensure thatht+1 is independent of the samples used to perform the linear regression. In Section 4 to follow, weintroduce and analyze a variant of the algorithm without sample-splitting in the special case σ = 0.

Remark 1 (LTI-SIM as alternating minimization). Note that for X ∼ PIX , the observations obeythe relation

〈θ∗, X〉 = h∗(Y ) + ξ.

where ξ may be viewed as “noise” in the inverse problem. Thus, for a data set D ⊆ S, it isreasonable to construct the loss function

LD(θ, h) : =1

|D|∑i∈D

(〈θ, xi〉 − h(yi))2,

and minimize it over the pair (θ, h) in order to obtain some measure of fit to the samples in thedata set. However, this minimization is rendered non-convex by the constraint that the returned θmust be unit norm. Thus, step 2 of the LTI-SIM procedure may be viewed11 as minimizing this lossfunction over the function class H, and steps 3 and 4 in conjunction as performing a minimizationin parameter space.

11This is particularly true if the procedure A performs least squares, as in Theorem 2 to follow.

13

Remark 2 (Comparison with slicing estimators). Slicing estimators [Li91; BB18] are based onthe observation that for spherically symmetric distributions, the conditional moments12 E[X⊗k|Y ]capture properties of the true parameter θ∗. For instance, when k = 1, classical calculations showthat under mild assumptions on g∗, the vector E[X|Y ] aligns with the vector θ∗ for almost everyrealization of Y . Thus, we may construct estimates of this conditional expectation from samples byslicing over y values, and this leads to a

√n-consistent estimate for the parameter and is similar

in many respects to the ADE procedure [Bri83]. However, even when σ = 0, the randomness in thecovariate X introduces noise in the empirical expectation, and so the error cannot decay at a ratefaster than

√n even in this noiseless case.

Algorithm 2 is also based on reasoning about a first-order conditional expectation, but relieson a model, provided by the labeling oracle, of further structure in the function y 7→ E[W |Y = y].Intuitively, modeling this higher-order structure in conjunction with the first-order conditional ex-pectation allows us to considerably refine the slicing estimate in an iterative fashion. The originalslicing estimator can thus be used to provide a natural initialization θ0 for our procedure.

While our methodology is well-defined for any single index model in which we have access to alabeling oracle, our theoretical analysis of the algorithm requires the following assumptions.

Assumption 1. The Gaussian volume of the set I is greater than κ, i.e., PrZ ∈ I ≥ κ forZ ∈ N (0, 1).

Such an assumption is natural, and guarantees that we have a large enough “effective samplesize”, with N growing directly proportional to the true sample size n. For our next assumption, werequire the following definition of a sub-Gaussian norm, which is a standard notion [VW96; Ver10].

Definition 1 (Sub-Gaussian norm). The L2-Orlicz norm of a scalar random variable U is givenby

‖U‖ψ2 = inft > 0 | E[exp(U2/t2)] ≤ 2.

We also refer to this as the sub-Gaussian norm, and a random variable with sub-Gaussian normbounded by σ is said to be σ-sub-Gaussian.

Assumption 2. The noise of the inverse problem has sub-Gaussian norm ρσ uniformly for ally ∈ R. Specifically, ρσ is a positive scalar such that

‖ξ(y)‖ψ2 ≤ ρσ for all y ∈ Y.

Remark 3. Assumption 2 can be weakened in multiple ways. Firstly, the requirement that thenoise be uniformly sub-Gaussian y-everywhere can be replaced with a requirement that it only holdsover all y that can be realized with high probability. More generally, the sub-Gaussian assumptionis not really required for our main result and can be weakened to allow for heavy-tailed noise—seeAppendix A for such an extension to noise with bounded second moment.

Finally, it can be verified that if the function g∗ is invertible on the interval I and σ = 0, thenAssumption 2 is trivially satisfied with ρ0 = 0, since without noise, we have E[W |Y = y] = g−1(y),and so ξ(y) = 0 almost surely. The next assumption requires that our inverse regression procedureoutput a useful function estimate.

12The notation v⊗k represents the tensor product of order k.

14

Assumption 3. Suppose we have k samples yi, wiki=1 drawn i.i.d. from the observation model

wi = h∗(yi) + ξi + zi, (7)

where the pair (yi, ξi) is drawn from a joint distribution PY,ξ such that E[ξ|Y = y] = 0 for eachscalar y, the RV zi is additional zero-mean, ρ-sub-Gaussian noise that is independent of the pair(yi, ξi), and h∗ ∈ H is an unknown function to be estimated. Suppose yiki=1 are k fresh samples,

each drawn i.i.d. from the distribution PY . Then the procedure A(

(yi, wi)ki=1

)returns a function

h satisfying

1

k

k∑i=1

(h(yi)− h∗(yi))2 ≤ RAk (h∗,PY,ξ; ρ2, δ)

with probability greater than 1− δ.

Through Assumption 3, we quantify the “quality” of the non-parametric procedure A throughits population rate function RAk . Indeed, computing these rate functions for specific non-parametricregression procedures is one of the principal goals of statistical learning theory [BM02; Tsy09]. Notethat unlike standard definitions of such a rate function, we allow the rate RAk to depend explicitlyboth on the underlying function h∗, and on the joint distribution of the noise and design pointsPY,ξ. In the sequel, we visit settings in which the latter dependence can be removed if Assumption 2also holds.

With these assumptions in place, we are now ready to state our main theorem. In the statement

of the theorem, we track the error at time t by ∆t = sin2 ∠(θt, θ

∗)

; note that since each estimate

θt has unit norm, there are absolute constants (c, C) such that

cmin‖θt − θ∗‖2, ‖θt + θ∗‖2 ≤ ∆t ≤ C min‖θt − θ∗‖2, ‖θt + θ∗‖2,

whence ∆t captures the squared `2 error of parameter estimation up to a sign13. We also use the

shorthand N : = N/2T and νt = cos∠(θt, θ

∗)

=√

1−∆t for convenience, and denote by P∗Y,ξ the

joint distribution of the random variables (Y, ξ) in the model (5). The shorthand c · P∗Y,ξ denotesthe joint distribution of the scaled random variables (cY, cξ) in the model (5). Finally, recall thedefinition of the function h∗ from the model (5).

Theorem 1. Suppose that Assumptions 1, 2, and 3 hold, and that the iterates θ0, . . . , θT aregenerated by Algorithm 2. Then there is a pair of absolute constants (c1, c2) such that for eacht = 0, . . . , T − 1, if

∆t ≤99

100, RAN

(νth∗; νtP∗Y,ξ,∆t, δ/3

)+ ρ2

σ ≤ c1κ2 and N ≥ c2 max

p, κ−2 log2(1/κ) log

(c2

δ

),

(8a)

then we have

∆t+1 ≤ c2

RAN (νth

∗; νtP∗Y,ξ,∆t, δ/3) + ρ2σ

·(p+ log(4/δ)

N

)(8b)

with probability exceeding 1−δ. Moreover, on this event, if ∠(θt, θ∗) ≤ π/2, then ∠(θt+1, θ

∗) ≤ π/2.

13While the sign ambiguity is inherent to even link functions g∗, it can otherwise be eliminated by assuming that θ0forms an acute angle with θ∗.

15

The conditions (8a) present in the theorem warrant some discussion. The theorem requires thatthe iterate at time t satisfy ∆t ≤ 99/100; the value of this constant is not important, and can bereplaced with any other absolute constant14 less than 1. The second condition

RAN(νth∗; νtP∗Y,ξ,∆t, δ/3

)+ ρ2

σ ≤ c1

implies (qualitatively) that we are in the low noise regime with ρσ bounded above by an absoluteconstant. This is the regime in which we expect any gains to occur over classical semi-parametricestimators, and in that sense, the condition should not be viewed as restrictive. Finally, the samplesize condition N ≥ c2p is also natural, and a consequence of the fact that we would like the linearregression step in the algorithm to return a unique solution. The accompanying technical conditionN ≥ c2κ

−2 log2(1/κ) log c2δ ensures that the matrix Xt+1 is well-conditioned.

Moving on to the theorem’s conclusion, first note that it applies to any non-parametric estima-

tion procedure that we use, and significant gains are obtained whenever the error rate RAN


)is small. In particular, if RA

N


)= o(N), then running just one step of the pro-

cedure already obtains a better guarantee than that of classical estimators (cf. equation (2), withn ≡ N). To obtain a final guarantee—which will typically be even sharper—the inequality needs tobe applied iteratively; we do so in deriving Corollary 1 to follow. Finally, since the theorem appliesto only one step of the iterative procedure, it is worth noting that the error ∆t of previous stepacts as the noise variance encountered by the non-parametric estimation procedure. This is whatallows us to bootstrap the result and obtain a final rate. In the next section, we derive corollariesof the main theorem for a specific choice of the procedure A.

3.3 Broad implications for regression procedures based on M-estimation

It is useful to particularize Theorem 1 to the case where A corresponds to the empirical riskminimization (ERM) algorithm over the function class H, which is a special case of M-estimation.Since we are interested in performing ERM on i.i.d. samples drawn from the model (7), let usintroduce it in this context. Given k i.i.d. samples yi, wiki=1 drawn from this model, the ERMalgorithm estimating the unknown non-parametric function h∗ ∈ H returns the function

hERM ∈ argminh∈H

1

k

k∑i=1

(wi − h(yi))2,

where we have chosen the squared loss given our assumption that the noise ε is sub-Gaussian. Notethat this estimator exists since the function class H is closed and convex. The estimate is alsorandom, due to both the randomness in the “design points” y1, . . . , yk and in the noise. Let us nowdiscuss how one might bound the error rate of this algorithm with high probability.

A classical result in the study of the ERM algorithm [BM02; Kol11] is that the rate functionis governed by the local population Rademacher complexity of the function class being estimatedover. Let us first define a more general version of this quantity, valid for an arbitrary function classF mapping R 7→ R with k i.i.d. samples from our model (7). Let yk1 = (y1, . . . , yk) denote the

14It is also likely that this condition can be weakened to allow ∆t ≤ 1−O(p−1/2) (which would accommodate, say,

a vector θt chosen uniformly at random from the unit sphere), but we do not concern ourselves with this extensionsince classical estimators can be used to guarantee that ∆t is smaller than any pre-specified universal constant.

16

tuple of k i.i.d. design points drawn from the distribution PY , and let η = (η1, . . . , ηk) denote ki.i.d. Rademacher random variables drawn independently of everything else. Then the populationRademacher complexity of the function class is given by

Rk(F) : = Eη,yk1

[supf∈F

∣∣∣∣∣1kk∑i=1

ηif(yi)

∣∣∣∣∣]. (9a)

The Rademacher complexity defined in equation (9a) depends only on the function class F and de-sign points y1, . . . , yk, but not on the specific noise in the problem. In order to reason about how thenoise affects the estimation procedure in our specific context, it is useful to also introduce anothermeasure of complexity of the function class. Consider model (7), and denote by ξi : = ρ−1(ξi + zi)the rescaled noise in the i-th sample of our observations. Use the shorthand ρ : =

√ρ2σ + ρ2, and

let ξ : = (ξ1, . . . , ξk). Then the noise complexity of the function class15 F is defined as

Gk(F ; yk1 ) : = Eξ

[supf∈F

∣∣∣∣∣1kk∑i=1

ξi · f(yi)

∣∣∣∣∣]. (9b)

Note that in contrast to our definition of the population Rademacher complexity (9a), we nolonger take an expectation over the random samples y1, . . . , yk in equation (9b), and so the noisecomplexity should be viewed as a random variable when the samples y1, . . . , yk are random.

It is also useful to define the norms

‖f‖22 : = EY∼PY [f2(Y )] and ‖f‖2k : =k∑i=1

f2(yi); (10)

once again, the second norm should be viewed as random when the samples y1, . . . , yk are random.For either norm ‖ · ‖, let B(‖ · ‖; t) denote the norm-ball of radius t centered at zero. Also definethe shifted function class

Hh0 = h− h0 | h ∈ H,

where we have the equivalence H ≡ H0.With these definitions in place, analyses of the ERM algorithm rely on finding fixed points of

certain local complexity measures, which we now define for our specific function class Hh∗ . Foreach positive integer k and pair of positive constants (γ1, γ2), define the quantities

τk(h∗; γ1) : = inf

τ > 0 : Rk(Hh∗ ∩ B(‖ · ‖2; τ)) ≤ τ2

γ

, and (11a)

µk(h∗, yk1 ; γ2) : = inf

µ > 0 : Gk(Hh∗ ∩ B(‖ · ‖k;µ); yk1 ) ≤ µ2

γ2

. (11b)

Note that the functional µk depends on the noise ξ, while the functional τk does not. Let us providesome motivation for these complexity measures. A natural way to measure the error of the ERMis via its fixed-design loss

‖hERM − h∗‖2k : =1

k

k∑i=1

(hERM(yi)− h∗(yi))2, (12a)

15This is typically known as the Gaussian complexity when the noise is Gaussian, but we prefer the more generalnomenclature at this stage of our development.

17

where y1, . . . , yk are precisely the k i.i.d. samples generated from the model (7) using which theERM procedure is computed; consequently, the estimate hERM is not independent of the randomnessin these samples. The noise complexity (9b) and associated critical inequality (11b) are useful inbounding this quantity. However, we are interested in controlling the error measured by the randomvariable

1

k

k∑i=1

(hERM(yi)− h∗(yi))2, (12b)

with fresh samples yiki=1 drawn from the distribution PY . A natural question is whether theerror measures defined in equation (12) are close to each other. The functional τk defined inequation (11a) provides such a measure of closeness for an appropriate value of the scalar γ1. Inparticular, a uniform law of large numbers holds in this problem, as a consequence of which botherror metrics are in fact close to the expected error ‖hERM − h∗‖22.

With this lengthy setup complete, we are finally ready to state our assumption on the functionclass H. For simplicity, we assume that the function class is uniformly bounded.16

Assumption 4 (Bounded function class). There is a positive constant b such that for all h ∈ H,we have ‖h‖∞ ≤ b.

The following proposition states a bound on the rate of the ERM algorithm in terms of thecomplexity functions defined above, with the shorthand H∗ ≡ Hh∗ . Recall that the observationsare corrupted by sub-Gaussian noise with parameter ρ.

Proposition 2 (Theorems 14.1 and 13.5 of Wainwright [Wai19]). (a) Suppose that Assumption 4holds, and that we observe k samples from the model (7). Then there are absolute constants (c1, c2)such that for each scalar u ≥ τk(h∗; b), we have

|‖f‖22 − ‖f‖2k| ≤1

2‖f‖22 +

1

2u2

uniformly for all functions f ∈ H∗, with probability exceeding 1− c2 exp(−c1k

u2

b2

).

(b) Suppose that Assumption 2 holds. Then there are absolute constants (c1, c2) such that foreach u ≥ µk(h

∗, yk1 ; 2ρ), the fixed-design loss (12a) of the ERM algorithm run on k samples fromthe model (7) satisfies

Pr‖hERM − h∗‖2k ≥ 16uµk(h

∗, yk1 ; 2ρ)≤ c2 exp

(−c1ku ·

µk(h∗, yk1 ; 2ρ)

ρ2

).

Applying this proposition leads to the following consequence of our main result Theorem 1when the procedure A corresponds to the ERM algorithm. The proof follows straightforwardly bycombining Theorem 1 and Proposition 2, but we provide it in Section 5.3 for completeness. We nev-ertheless state the result as a theorem for stylistic reasons, using the shorthand ρσ,t : =

√∆2t + ν2

t ρ2σ

for convenience.16This assumption can be relaxed to require that for some p ≥ q ≥ 2 and all h ∈ H with ‖h‖2 ≤ 1 we have

E[hp(Y )] ≤ bp−qE[hq(Y )],

when Y ∼ PY . We do not pursue this extension here, and direct the reader to Wainwright [Wai19] and Mendel-son [Men14] for details.

18

Theorem 2. Suppose that Assumptions 1, 2 and 4 hold. Also suppose that the iterates θ0, . . . , θTare generated by running Algorithm 2 with procedure A corresponding to the ERM algorithm. Thereare absolute constants (c1, c2) such that for each t = 0, . . . , T − 1, if ∆t ≤ 99/100,(

τ2N (νth

∗; b) ∨ b2

N

)+(µ2N (νth

∗, yii∈D2t+1 ; 2ρσ,t) ∨ρ2σ,tN

)log(c2δ

)+ ρ2

σ ≤ c1κ2, and

N ≥ c2 maxp, κ−2 log2(1/κ) log

(c2

δ

),

then

∆t+1 ≤ c2

((τ2N (νth

∗; b) ∨ b2

N

)+(µ2N (νth

∗, yii∈D2t+1 ; 2ρσ,t) ∨ρ2σ,tN

)log(c2δ

)+ ρ2

σ

)·(p+ log(4/δ)

N

)(13)

with probability exceeding 1− δ.

It is worth making a few remarks on the theorem. Note that once again, we have stated theresult only for one step of our iterative algorithm; in order to produce a guarantee on the ‘final’iterate we will have to recurse this bound, and subsequently, bound (i) the number of iterations Trequired to reach a fixed point of the recursive relation (13), and (ii) the error of the fixed point ofthe recursion. To provide a qualitative answer to point (i), first note that for most non-parametricfunction classes, the RHS of equation (13) is always strictly positive for each finite N , so that wecan never hope to show an exact recovery guarantee for the algorithm. In other words, zero is nota fixed point of the error recursion. Typical arguments used in analyses of many non-parametricregression problems show that

τ2N ∼

(C1

N

)λ1and µ2

N ∼(C2

N

)λ2, (14)

where λ1 and λ2 are two fixed constants in the unit interval that depend on the regression problem,and the constants (C1, C2) depend on the remaining quantities that parametrize each of thesecomplexity functions. In this case, it suffices to apply the error recursion for T0 = O(log log(N))iterations in order to arrive within a constant multiplicative factor of the fixed point17, where theconstants absorbed by this asymptotic notation depend on the other parameters of the problem,and the scalars (λ1, λ2). For a specific illustration of this phenomenon, see Corollary 1 to follow.

The abstract bound (14) also provides a qualitative answer to point (ii) above: taking N →∞,we see immediately that the fixed point has error bounded by a quantity (o(N)+ρ2

σ) pN

. Comparingsuch an error bound with equation (2), we verify what was already alluded to after the statementof Theorem 1: when ρσ σ, using a consistent ERM estimator improves the rate of parameterestimation uniformly for all noise levels.

It is also helpful to state a consequence of the bound (13) when ρσ = 0; this is achieved fornoiseless SIMs if the function g∗ is invertible on the interval I. Let m∗

N,δdenote the value of m

satisfying the fixed point relation

m = c2µ2N

(h∗, yk1 ; 2

√m) p

Nlog(c2δ

).

17Since zero is not a fixed point, the number of iterations required to ensure convergence to within a multiplica-tive factor of the fixed point is finite; this is in contrast to problems for which we would like to guarantee exactrecovery [NJS13], and crucially, bounds the number of resampling steps required by the algorithm.

19

Then assuming that the error recursion converges to its fixed point, we have the bound

∆T ≤ C( pN· τ2N (b) +m∗N,δ

)(15)

for the ‘final’ iterate of our algorithm. Once again, it is worth noting that the final error inthe noiseless case is strictly better than the p/N rate if the complexity term τN decays with N ;Corollary 1 provides an example of such a phenomenon.

Remark 4 (Sharpness in the noiseless regime). The bound (15) is unlikely to be the sharpest boundone can prove in general SIMs for the noiseless case. There are other analyses of the ERM tailoredto capture the correct “version space” of the noiseless non-parametric regression problem, and thesewill be sharper than the bounds presented above. For a more in-depth discussion, see Appendix B.

3.4 Consequences for monotone SIMs

In this section, we apply the general result given by Theorem 2 to the case where the link functiong∗ is monotone. In this section, suppose that we have n i.i.d. samples drawn from the SIM (4)where the noise distribution is Gaussian of variance σ2. We also make a further assumption on thelink function g∗; we require some additional notation in order to state it. Let cn,δ =

√2 log(8n/δ),

and recall that the set of sub-differentials of a function g at the point x are given by

∂g(x) = y ∈ R : g(z) ≥ g(x) + y · (z − x) for all z ∈ R.

For a pair of reals a < b, we say that a ≤ ∂g(x) ≤ b if each element y in the set of sub-differentialsobeys the inclusion y ∈ [a, b]. With these definitions in place, we make the following assumptionon the link function g∗.

Assumption 5. The function g∗ is continuous with 0 < m ≤ ∂g∗(z) ≤M <∞ for all z ∈ [−cn,δ, cn,δ].

Link functions employed in generalized linear models largely satisfy18 Assumption 5, and moregenerally, the class of SIMs satisfying Assumption 5 has been extensively studied as a generalizationof GLMs [KS09; KKS+11]. Note that in contrast to general SIMs, the invertibility of the truefunction makes this class comparatively easier to handle. Let us now specify the two oracles thatwe require.

Labeling oracle: For this class of SIMs, the labeling oracle is trivial to implement. Simplyoutput:

• The interval I = [−cn,δ, cn,δ],

• All n samples of the SIM, and

• The function classH =

h : R 7→ I

∣∣ h non-decreasing,

which is a convex set by definition. In Lemma 2 in the proof section, we show that this classcontains, with high probability, all the appropriate conditional expectations that we hope tomodel.

18In some cases, it may be necessary to choose the tuple (m,M) to be functions of n and δ (e.g., for the logisticlink function), but these will typically be functions that decrease/increase sub-polynomially in n.

20

Non-parametric inverse regression procedure: We let A correspond to the ERM procedureover the function class H defined above. In this special case, the algorithm can be implemented innear-linear time via the pool adjacent violators algorithm [BBB+72; GW84].

With the labeling oracle and inverse regression procedure specified, it remains to verify thevarious technical assumptions required to apply Theorem 2. Let κ0 = M

m denote a natural notionof conditioning in the problem. In Lemma 3, we show that Assumption 2 holds with

ρσ ≤ ρmono : = C

(σ2cn,δ

√κ2

0 − 1 +σ

mlog(3κ0) ∨ cn,δ

). (16)

When σ is small, i.e., in our regime of interest, we have ρmono σ · cn,δ, where the notation hidesproblem-dependent factors. Another special case is when M = m and g∗(z) = mz a.e.; here, wehave ρmono = C σ

mcn,δ, and σm is the right proxy for noise-to-signal ratio in linear models.

Bounds on the complexity terms are provided in Lemma 4, and Assumption 4 holds triviallywith b = cn,δ. We are thus led to the following corollary of Theorem 2, in which we use theshorthand n = n/2T for convenience.

Corollary 1. Suppose that Assumption 5 holds, and that the labeling oracle and regression procedureare given by the discussion above. Then there is a tuple of absolute constants (c1, c2, c3, c4) suchthat for each t = 0, 1, . . . , T − 1, if

n ≥ c2p, ∆t ≤ 99100 , and ρmono ≤ c1,

then

∆t+1 ≤ c2

(log n

n

)2/3

+

(∆t + ρ2

mono

n

)2/3

+ ρ2mono

p

nlog(c2

δ

)· log

(c2n

δ

)(17a)

with probability exceeding 1− δ.Consequently, if in addition we have n ≥ c2p log2 n, then when c3 log(log n) ≤ T ≤ c4 log(log n), weobtain

∆T ≤ c2p log n ·

(log n · log(log n)

n

)5/3

+ ρ2mono

log n · log(log n)

n

(17b)

with probability exceeding 1− c2n−9.

Once again, a few comments are in order. First, note that by our discussion above, thebound (17b) recovers the correct behavior in a linear model up to a poly-logarithmic factor. Sec-ond, note the following consequence of the bound (17b) in order to facilitate a more transparentdiscussion. Assuming the initial angle made by θ0 with θ∗ is acute, we have

‖θT − θ∗‖2 .

σ2 p

n if σ ≥ n−1/3

pn5/3 otherwise,

, (18)

where the . notation above ignores both problem-dependent constants that depend on the pair(m,M), as well as logarithmic factors in n. Comparing the bounds (2) and (18), we see immediately

21

that the estimation bias is significantly reduced, and this comparison helps explain the behaviorseen in Figure 1.

While Corollary 1 clearly provides a guarantee that is significantly better than classical estima-tors when σ is small, it is worth noting that it is derived as a consequence of Theorem 2, which maynot be the sharpest possible result obtainable when, for instance, σ = 0. In the next section, wetake a slightly different route towards understanding the zero-noise setting, by designing a slightlydifferent procedure that is motivated by analysis considerations.

4 Identifiability and the noiseless case

We now investigate whether the scaling predicted by Theorem 2 is improvable in the noiselesscase—the assumption σ = 0 will be made throughout this section. As before, we assume access toboth a labeling oracle and an inverse regression oracle. For analytical ease, we now study a slightvariant of Algorithm 2 presented as Algorithm 3 for general SIMs; the only difference here is thatwe perform both the non-parametric regression and least squares fit on the same set of samples inevery iteration.

Algorithm 3: The LTI-SIM meta-algorithm without sample-splitting for the two regressions

Input: Data of N samples xi, yii∈S returned by the labeling oracle; non-parametricregression procedure A; initial parameter θ0; number of iterations T .

Output: Final parameter estimate θT .1 Initialize t← 0. Split the data into T equal portions indexed by D1, . . . ,DT .

repeat

2 Form the function estimate ht+1 ∈ H by computing

ht+1 = A((

yi, 〈xi, θt〉)i∈Dt+1

). (19)

3 Letting Xt+1 denote the N/T × p matrix with rows xii∈Dt+1 and stacking up the

responsesht+1(yi)

i∈Dt+1

in a vector v, compute

θt+1 = X†t+1v.

4 Compute the normalized parameter θt+1 = θt+1/‖θt+1‖.until t = T ;

5 Return θT .

Algorithm 3 is arguably more natural than Algorithm 2 since it makes more efficient use ofthe samples within the alternating minimization update (see Remark 1). Let us now state aguarantee for this algorithm in the case where A corresponds to the ERM algorithm. This resultuses Proposition 2(b); recall the functional µk defined in equation (11b).

Theorem 3. Consider the noiseless case σ = 0, and assume that ρ0 = 0. Suppose that Assump-tion 1 holds, and that we run Algorithm 3 with A corresponding to the ERM algorithm. There is a

22

pair of absolute constants (c1, c2) such that if ∆t ≤ 99/100 and N ≥ c2 maxp, log(c2/δ), and(µ2N (νth

∗, yii∈Dt+1 ; 2√

∆t) ∨∆t

N

)log(c2δ

)≤ c1,

then

∆t+1 ≤ c2 ·(µ2N (νth

∗, yii∈Dt+1 ; 2√

∆t) ∨∆t

N

)log(c2δ

)(20)


It is worth comparing Theorem 3 with Theorem 2 specialized to σ = 0. The conclusion ofTheorem 3 does not have the p/N contraction factor present in inequality (8b) of Theorem 2, andin this sense, is somewhat weaker than Theorem 2. However, it has one distinct advantage, in thatthe bound only depends on the noise complexity functional µN . As we illustrate in Corollary 2,this facilitates a sharper analysis that further reduces the error floor when σ = 0.

4.1 Sharpening the bound for monotone SIMs

Let us once again use the particular example of the monotone SIM to provide a concrete rate forthe noiseless case. We employ the same labeling oracle as before, and also the ERM algorithm asour inverse regression procedure. In stating the following result, we use the shorthand n = n/2Tfor consistency.

Corollary 2. Consider the noiseless case σ = 0. Suppose that Assumption 5 holds, and that thelabeling oracle and regression procedure are as in Section 3.4. Then there is a tuple of absoluteconstants (c1, c2, c3, c4) such that

∆t+1 ≤ c2 · c2n,δ

(∆t

log n

n

)2/3

(21a)

for all t = 0, 1, . . . , T − 1 with probability exceeding 1− δ.Consequently, when c3 log(log n) ≤ T ≤ c4 log(log n), we obtain

∆T ≤ c2 log2(log n) · log5 n

n2(21b)

with probability exceeding 1− c2n−9.

Corollary 2 shows that in the noiseless case, we can obtain significantly faster rates than thoseguaranteed by Corollary 1, especially in high dimensions, as seen by the following simplification:

‖θT − θ∗‖2 .1

n2if σ = 0, (22)

which ignores poly-logarithmic factors. An important question here is whether there is a resultthat unifies both Corollaries 1 and Corollary 2 (or equivalently, the bounds (18) and (22)) thatprovides a (continuously varying) rate for all noise levels. On a related note, is there a fundamentallimit of parameter estimation under the noiseless model, and does the bound (22) capture it? Thisquestion leads to our next and final result.

23

4.2 A lower bound on identifiability

In this section, we set σ = 0 and study the identifiability of monotone SIMs satisfying Assumption 5.A fundamental question here is whether one can in fact obtain an exact parameter estimate underthis model; after all, Corollary 2 only shows that we can estimate the parameter at rate n−2 with nsamples, which is non-zero for all finite n. The following proposition shows that there does indeedexist an information-theoretic lower bound that precludes exact recovery for any finite sample size.

Proposition 3. Suppose that 2 ≤ p ≤ n/2. There is an absolute constant c and another constantC′m,M that depends only on the pair (m,M) such that the following holds. There are two function-

parameter pairs (g∗, θ∗) and (g, θ) such that g∗ and g both satisfy Assumption 5, the parameters θ∗

and θ both have unit norm, and we have

‖θ∗ − θ‖2 ≥ C′m,M ·p

(n log n)4and g∗(〈θ∗, xi〉) = g(〈θ, xi〉) for all i ∈ [n] (23)

with probability exceeding 1− c(log n)−2.

A few points are worth noting. The constant C′m,M is made explicit in the proof, and, asexpected, satisfies C′m,M = 0 whenever m = M . Second, note that while Propositon 3 providesa family of results, one for each value of the pair (p, n), the strongest bound (for any fixed n) isobtained by setting p = n/2. In this case, we obtain a lower bound of the order Ω(n−3), which stillexhibits a gap to the upper bound O(n−2) proved in Corollary 2. Closing this gap is an interestingopen problem. On a related note, we stress that the proof of Proposition 3 is constructive: we exhibitexplicit constructions of the pairs (g∗, θ∗) and (g, θ) that satisfy the statement of the proposition.Doing so involves bounding the spacings between n i.i.d. samples generated from a Gaussiandistribution, and the slack between the lower bound of Proposition 3 and the corresponding upperbounds of Corollaries 1 and 2 arises due to the fact that the minimum spacing between these pointsis O(n) smaller than the average spacing.

5 Proofs

In this section, we provide proofs of our main results. We begin by proving Theorem 1, and thenderive the various corollaries stated in the main text. Proposition 1, though stated first in the maintext, is proved in Section 5.8.

A few notes to the reader. Throughout our proofs, we assume that n is greater than some uni-versal constant; the complementary case can be handled by appropriately modifying the constantsin the proofs. Often, we work with the random variables defining a model—which we denote bycapital leters—before instantiating the model on samples—which we denote using small letters.Finally, we use c, c1, c

′, . . . to denote universal constants whose values may change from line to line.

5.1 Proof of Theorem 1

Each covariate is given by p i.i.d. random variables X = (X1, X2, . . . , Xp). Assume wlog by therotational invariance of the Gaussian distribution that θ∗ = e1, so that 〈X, θ∗〉 = X1. Recall therandom variable W given by the truncation of X1 to the interval I. Recall the function h∗, givenby

h∗(y) = E[W |Y = y] for each y ∈ R.

24

Also recall the (unobservable) model (5) given by

W = h∗(y) + ξ(y),

for each fixed value of y, where ξ(y) = [W |Y = y]−E[W |Y = y] denotes noise that obeys E[ξ(y)] = 0for each y by definition, and is ρσ-sub-Gaussian for each y ∈ R by Assumption 2. Finally, recall thatwe denoted the covariate distribution post-truncation by PIX . Note that in each sample, we alsoobserve p−1 other covariates X2, . . . , Xp each drawn from a standard Gaussian that is independent

of everything else. Let αt = ∠(θt, θ∗) and note that for X ∼ PIX , we have

〈X, θt〉 = cos(αt)W + sin(αt)X,

where X ∼ N (0, 1) is some linear combination of the random variables X2, . . . , Xp (and thereforeindependent of W ). Recall the shorthand νt = cos(αt) and ∆t = sin2(αt). Suppose for the restof this proof that αt is acute, so that sin(αt) =

√∆t; the complementary case is similar, provided

we work with the angle αt = ∠(θt,−θ∗) instead. With this setup at hand, we are now ready toprove the theorem. We organize the proof by providing error guarantees for the two sub-steps ofAlgorithm 2, and then putting together the pieces.

Error due to non-parametric regression: The procedure A is given N = N/2T samplesdrawn from the observation model

〈xi, θt〉 = νt · h∗(yi) + νt · ξi +√

∆txi for i ∈ D2t+1.

By the star-shaped nature of H, we have νt ·h∗ ∈ H, so that this is now a non-parametric regressionmodel where we observe N i.i.d. evaluations of the true function νth

∗ corrupted by noise. Inparticular, comparing with the model (7), we have ρ2 = ∆t. By Assumption 3, the procedureA uses these samples to then return a function ht+1 ∈ H that satisfies, for each δ ∈ (0, 1), theinequality

1

N

∑i∈D2t+2

(ht+1(yi)− νth∗(yi))2 ≤ RAN (νth∗; νtP∗Y,ξ,∆t, δ)


Error due to linear regression: We now show that performing the linear regression step leadsto an error contraction by a multiplicative factor roughly p/N . For this, we require the followinglemma. For a vector v, we let vi denote its i-th entry, and let v\i = v − vi · ei denote the vectorwith its i-th entry zeroed out.

Lemma 1. Suppose we are given a matrix X = (X1, . . . , Xp) ∈ Rn×p, where the columns X2, . . . , Xp i.i.d.∼N (0, In). Also suppose we are given the n-dimensional vector y = τX1 + z for some scalar τ , andsome vector z ∈ Rn that is fixed independently of the random vectors X2, . . . , Xp. Then there isan absolute constant c such that if n ≥ cmax

p, log

(cδ

), then the estimate β = X†y obeys the

inequalities

‖β\1‖2 ≤ 16 ·(p+ log(4/δ)

n2

)· ‖z‖2, and (24a)

β21 ≥

τ2

2− 3

‖z‖2

‖X1‖2, (24b)

25

with probability exceeding 1− 3δ4 . Moreover, on this event, if τ >

√32 ·

‖z‖‖X1‖ , then β1 > 0.

We prove this lemma at the end of the section. Let us now use it to provide an error guaranteeon our problem. For a fresh draw of the pair (X,Y ) with marginals X ∼ PIX and Y ∼ PY , we have

νtW = νt〈X, θ∗〉 = νth∗(Y ) + νtξ = ht+1(Y ) + νtξ + (νth

∗(Y )− ht+1(Y )),

so that rearranging yields

ht+1(Y ) = νt〈X, θ∗〉 − νtξ − (νth∗(Y )− ht+1(Y )). (25)

Notably, all random variables in the RHS are functions only of the tuple (W, ε) and hence indepen-dent of the random variables X2, . . . , Xp.

The linear regression step is performed on the samples i ∈ D2t+2. It is therefore helpful toinstantiate the model (25) on these samples, and write

ht+1(yi) = νt〈xi, θ∗〉 − νtξi − (νth∗(yi)− ht+1(yi))︸︷︷︸−ξ′i

for i ∈ D2t+2;

crucially, due to sample-splitting across the two sub-steps of the algorithm, we have ensured that thefunction estimate ht+1 can be regarded as fixed, since it is independent of the samples i ∈ D2t+2.Step 3 of the algorithm models the value ht+1(yi) as a linear response to the covariates xi. Inparticular, stack the covariates xii∈D2t+2 in a matrix and the responses ht+1(yi)i∈D2t+2 in a

vector, and let ξ′ ∈ RN denote a “noise” vector. Recall the condition N ≥ cmaxp, log(c/δ)assumed in Theorem 1, and recall that our regression estimate obtained as a result of step 3 ofAlgorithm 2 was denoted by θt+1. Applying Lemma 1 yields, with probability exceeding 1 − 3δ

4 ,the implications

‖θt+1‖2 sin2(αt+1) ≤ 16 · ‖ξ′‖2 ·(p+ log(4/δ)

N2

)and (26a)

‖θt+1‖2 cos2(αt+1) ≥ ν2t

2− 3 · ‖ξ′‖2∑

i∈D2t+2〈xi, θ∗〉2

, (26b)

where we have used the fact that the scalar τ in the lemma is equal to νt.

Putting together the pieces: We are finally ready to put together the pieces. Applying theCauchy-Schwarz inequality yields

‖ξ′‖2 ≤ 2 ·

∑i∈D2t+2

ν2t ξ

2i +

1

N

∑i∈D2t+2

(νth∗(yi)− ht+1(yi))

2

.

By Assumption 2, each random variable ξi is ρσ-sub-Gaussian, so that

Pr

1

N

∑i∈D2t+2

ξ2i ≥ ρ2

σ

(1 +

√log(6/δ)

N

) ≤ δ/3 (27)

26

for each N ≥ log(6/δ). On the other hand, Assumption 3 guarantees that we have

Pr

1

N

∑i∈D2t+2

(νth∗(yi)− ht+1(yi))

2 ≥ RAN (νth∗; νtP∗Y,ξ,∆t, δ/3)

≤ δ3 .

Putting together the pieces, we have

‖ξ′‖2

N≤ C

(ν2t ρ

2σ + RAN (νth

∗; νtP∗Y,ξ,∆t, δ/3))

(28a)

with probability greater than 1− 2δ3 . Additionally, Lemma 11 from the appendix guarantees that

provided N ≥ c1log2(1/κ)

κ2log(4/δ), we have∑

i∈D2t+2

〈xi, θ∗〉2 =∑

i∈D2t+2

w2i ≥

1

2κ2N (28b)

with probability exceeding 1−δ/4. Now note that we have ∆t ≤ 99/100 by assumption, which guar-antees the relation νt ≥ 1/10. Thus, on the intersection of the two events defined in inequality (28),inequality (26) yields

tan2(αt+1) ≤ C(ρ2σ + RAN (νth

∗; νtP∗Y,ξ,∆t, δ/3))·(p+ log(4/δ)

N

),

where we have also used that the condition ρ2σ + RA

N(νth

∗; νtP∗Y,ξ,∆t, δ/3) ≤ c1 holds for a smallenough constant c1, to ensure that the RHS of inequality (26b) is bounded below by a universalpositive constant. Finally, noting the elementary inequality sin2 α ≤ tan2 α concludes the proof.

5.2 Proof of Lemma 1

Our proof of this lemma proceeds from first principles; we note that similar proofs are used to boundthe variance inflation factor (VIF) in linear models (see, e.g., the book [NKN+96]). Use the moreconvenient notation x = X1 and W = (X2, . . . , Xp), so that the matrix is given by X =

[x W

],

and X>X =

[‖x‖2 x>WW>x W>W

]. Note that for a general (invertible) symmetric matrix, the partial

LDU decomposition can be written as[a b>

b C

]=

[1 0

−C−1b I

] [(a− b>C−1b)−1 0

0 C−1

] [1 −b>C−1

0 I

],

where I denotes the identity matrix of appropriate dimension. Applying this to the matrix X>Xand using the shorthand PW = W(W>W)−1W> for the projection matrix onto the range of thematrix W, we may write the pseudoinverse of X as

X† =

[1 0

−(W>W)−1W>x I

] [(‖x‖2 − x>PWx)−1 0

0 (W>W)−1

] [1 −x>W(W>W)−1

0 I

] [x>

W>

]Now for an arbitrary vector v ∈ Rn, let 〈x, v〉 : = xv; then we have[

1 −x>W(W>W)−1

0 I

] [x>

W>

]v =

[xv − x>PWv

W>v

],

27

so that putting together the pieces yields

X†v =

[1 0

−(W>W)−1W>x I

] [(‖x‖2 − x>PWx)−1 0

0 (W>W)−1

] [xv − x>PWv

W>v

]=

[1 0

−(W>W)−1W>x I

] [(‖x‖2 − x>PWx)−1 · (xv − x>PWv)

W†v

].

Now using the shorthand P⊥W = I−PW for the projection matrix onto the orthogonal complementof W, we have

X†v =

[1 0

−(W>W)−1W>x I

] [(x>P⊥Wx)−1 · (x>P⊥Wv)

W†v

]=

[τv

−W†x · τv + W†v

],

where we have let τv : = (x>P⊥Wx)−1 · (x>P⊥Wv) for convenience.Note that the above derivation holds for each v ∈ Rn. We are interested in a vector that can

be written as v = τx+ z. In this case, we have

X†v = X†(τx) + X†z

= τe1 + X†z

=

[τz + τ

−W†x · τz + W†z

].

Up to this point, all of our steps were deterministic; we now use the fact that W is a standardGaussian random matrix. In particular, letting w1 denote the first row of the matrix W and forany vector u fixed independently of W, we have

‖W†u‖2 d= ‖(W>W)−1W>e1‖2‖u‖2

= ‖(W>W)−1w1‖2‖u‖2

≤ |||(W>W)−1|||2op‖w1‖2‖u‖2

(i)

≤

1(√n−√p−

√log(4/δ)

)2

2

(p+ log(4/δ))‖u‖2,

where step (i) holds with probability exceeding 1− δ/4 by tail bounds for χ2 random variables, andthe minimum singular value of a Gaussian random matrix [Ver10].

We now use the assumption n ≥ cmaxp, log(4/δ) for a large enough constant c to obtain theinequality

‖W†u‖2 ≤ 2

(1

n

)2

· ‖u‖2

n

(p+ log

(4

δ

)),

which holds for each fixed vector u with probability exceeding 1−δ/4. Moreover, we have u>P⊥Wu = ‖P⊥Wu‖2.Putting together the pieces and using the Cauchy-Schwarz inequality, the following sequence of

28

bounds holds with probability exceeding 1− δ:

‖β\1‖2

4≤(p+ log(4/δ)

n2

)· (τ2

z ‖x‖2 + ‖z‖2)

=

(p+ log(4/δ)

n2

)·(

(x>P⊥Wz)2

‖P⊥Wx‖4· ‖x‖2 + ‖z‖2

)(ii)

≤(p+ log(4/δ)

n2

)·(‖z‖2

‖P⊥Wx‖2· ‖x‖2 + ‖z‖2

),

where step (ii) uses the Cauchy-Schwarz inequality and symmetry of the matrix P⊥W to obtain|x>P⊥Wz| ≤ ‖P⊥Wx‖‖z‖.

Now note that we since x ⊥⊥ W, we have‖P⊥Wx‖2‖x‖2

d= ‖P⊥We1‖2, which is the squared norm

of a unit-norm n-dimensional vector projected onto a random (n − p + 1)-dimensional subspace.By well-known results (see, e.g, Dasgupta and Gupta [DG03]), this quantity is bounded above byn−p+1+log(4/δ)

n with probability exceeding 1− δ/4. Putting together the pieces once again with ourassumption n ≥ cmaxp, log(4/δ), we have

‖β\1‖2

4≤(p+ log(4/δ)

n2

)·

3 · ‖z‖2 + ‖z‖2,

with probability exceeding 1− 3δ4 .

To lower bound the signal term, we once again use the Cauchy-Schwarz inequality to obtain

β21 ≥

τ2

2− τ2

z

≥ τ2

2− 3 · ‖z‖

2

‖x‖2.

This concludes the proof.

Remark 5. Lemma 1 illustrates the role of the approximation error in the problem, and can be usedto reason about variants of classical semi-parametric estimators. In particular, if E[g∗(Z)Z] = µ 6=0, then we may write g∗(X1) = µX1 + Z, where Z is uncorrelated with X1 due to the orthogonalityproperties of Hermite polynomials [Gra49]. Treating E|Z| as the approximation error (which is aconstant for any non-linear g∗), we see that Lemma 1 guarantees that regressing our observationsg(X1) + ε on X1, . . . , Xn yields an estimate with error p

n(σ2 +E|Z|) (cf. equation (2)). The goal of

first performing non-parametric regression to obtain a function estimate h is to significantly reducethe approximation error of the problem.


First, we use Proposition 2 to provide a bound on the rate function RERMk (νth

∗; ρσ,t, δ/3). Using

the notation of Assumption 3, let h denote the function estimate obtained as a result of runningthe ERM on k samples from the model (7). For a sufficiently large constant c2, set

u = c2 ·√

log(c2δ

)·(τk(νth

∗; b) ∨ b√k

)(29)

29

in Proposition 2(a), and consider the function f : = h− νth∗ ∈ Hνth∗ . This yields the bound

‖h− νth∗‖22 ≤ 2‖h− νth∗‖2k + c2 ·(τ2k (νth

∗; b) ∨ b2

k

)log(c2δ

)with probability exceeding 1− δ/9. Moreover, applying the same result to the set of fresh samplesy1, . . . , yk (note that our definition of the norm, etc. would have to change, but the same resultapplies), we have

1

k

k∑i=1

(h(yi)− νth∗(yi)

)2≤ 3

2‖h− νth∗‖22 +

1

2u2;

choosing u according to equation (29) and putting together the pieces implies the bound

1

k

k∑i=1

(h(yi)− νth∗(yi)

)2≤ 3‖h− νth∗‖2k + c2 ·

(τ2k (νth

∗; b) ∨ b2

k

)log(c2δ

)with probability exceeding 1− 2δ

9 .

Finally, we bound ‖h− νth∗‖2k using Proposition 2(b). Setting

u = c2 · log(c2δ

)·(µk(νth

∗, yk1 ; 2ρσ,t) ∨ρ2σ,t

k·µk(νth∗,yk1 ;2ρσ,t)

)and simplifying, we obtain

‖h− νth∗‖2k ≤ c2

(µ2k(νth

∗, yk1 ; 2ρσ,t) ∨ρ2σ,tk

)log(c2δ

)with probability exceeding 1− δ

9 . Putting together the pieces, we have shown the bound

RERMk (νth

∗; ρσ,t, δ/3) ≤ c2

(τ2k (νth

∗; b) ∨ b2

k

)+(µ2k(νth

∗, yk1 ; 2ρσ,t) ∨ρ2σ,tk

)log(c2δ

).

Substituting this expression into Theorem 1 by setting k = N completes the proof, since at iterationt of the iterative algorithm, we have yk1 = yii∈D2t+1 .

5.4 Proof of Corollary 1

As mentioned in the discussion, this corollary follows from Theorem 2, and so it suffices—in additionto establishing Assumptions 1, 2, and 4—to bound the complexity functions τk and µk. All of thesesteps are presented in the following lemmas. Recall the value ρmono defined in equation (16), andour shorthand cn,δ =

√2 log(2n/δ).

Lemma 2. For a non-decreasing function g : R→ R, consider the observation model

Y = g(X) + σZ, (30)

with Z ∼ N (0, 1) and X drawn from some Lebesque measurable distribution. Then the functionh(y) = E[X|Y = y] exists a.e., and is non-decreasing.

30

Lemma 3. Suppose that in the monotone single-index model (30), the link function g satisfiesAssumption 5, and the covariate distribution is given by a Gaussian truncated to the interval[−cn,δ, cn,δ]. Then, Assumption 2 holds with ρσ ≤ ρmono.

The next two lemmas are stated assuming that the function class H is given by

H(b) =h : R 7→ [−b, b]

∣∣ h non-decreasing

(31)

for some positive real number b. Recall our notation for the shifted function class around h∗, givenby Hh∗ = h− h∗ | h ∈ H. In the following lemmas, we also assume that h∗ ∈ H.

Lemma 4. For each function h∗ ∈ H, integer k, sequence of samples yk1 , and scalar γ, we have

τ2k (h∗; b) ≤ c2b

2

(log k

k

)2/3

, and (32a)

µ2k(h∗, yk1 ; γ) ≤ c2

(γ2b log k

k

)2/3

. (32b)

for a sufficiently large constant c2.

Remark 6. Note that we have not been particularly careful about the exact logarithmic factor inthe bounds (32), since there are other logarithmic terms present in the final bound of Corollary 1.However, we do note that it is likely that these bounds can be sharpened to remove the logarithmicfactor appearing on the RHS.

We prove these lemmas at the end of the section. Taking them as given for the moment, let usestablish Corollary 1. Begin by defining the event

E = |〈xi, θ∗〉| ≤ cn,δ for all i ∈ [n],

and noting that PrE ≥ 1− δ4 by standard Gaussian tail bounds. We work on this event for the rest

of the proof, so that Lemma 2 guarantees the inclusion H(cn,δ) ⊇ y 7→ E[〈X, θ∗〉|Y = y]. Now,

recall our shorthand ρσ,t =√

∆t + νtρ2σ, and suppose that the pair (∆t, δ) satisfies the inequalities

∆t ≤99

100and

(τ2n(νth

∗; b) ∨ b2

n

)+(µ2n(νth

∗, yii∈D2t+1 ; 2ρσ,t) ∨ρ2σ,tn

)log(c2δ

)+ ρ2

mono ≤ c1.

(33)

The second condition is satisfied for large enough n, and by the assumption that ρmono is boundedabove by a small enough constant. The tuple (µn, τn, b) is chosen according to Lemma 4. Then,applying Theorem 2 and substituting the bounds guaranteed by Lemmas 3 and 4 yields the guar-antee

∆t+1 ≤ c2

c2n,δ

(log n

n

)2/3

+

(cn,δ

(∆t + ρ2

mono

) log n

n

)2/3

+ ρ2mono

p

nlog(c2

δ

)= c2

(log n

n

)2/3

+

(∆t + ρ2

mono

n

)2/3

+ ρ2mono

p

nlog(c2

δ

)· c2n,δ

31

with probability at least 1 − δ, where the reader should recall that the values of the absoluteconstants may change from line to line. This establishes the bound (17a).

It remains to translate this guarantee into a bound on the final iterate (17b). Toward that end,set δ = n−10, and note that c2

n,δ ∼ log n to obtain the simplified one-step guarantee

∆t+1 ≤ c2

(log n

n

)2/3

+

(∆t + ρ2

mono

n

)2/3

+ ρ2mono

p

nlog2 n,

which holds for each iteration t on the corresponding event Et. On Et and under the assumptionn ≥ Cσ2(κ2

0 − 1)p log2 n, it can be verified that ∆t+1 satisfies condition (33) for a large enoughconstant C. Consequently, the argument can be applied iteratively; for an integer value T0 tobe determined shortly, condition on the event ∩T0i=0Ei. By the union bound, this event occurs withprobability exceeding 1−T0n

−10. Abusing notation slightly, let ρmono now denote the same quantitybut with this value of δ substituted.

Now choose an integer value T satisfying C log log n ≤ T ≤ T0 for a large enough absoluteconstant C and any T0 ≤ n. Let us apply Lemma 10 in the appendix with the substitutions

C1 = c2p

nlog2 n

(log n

n

)2/3

+ ρ2mono

,

C2 = c2p

nlog2 n, and

C3 = ρ2mono,

and note that γ = 2/3 and ∆0 ≤ 1 by definition. Then by choosing C large enough, we can ensurethat T is large enough to satisfy the condition required by Lemma 10. Consequently, we have

∆T ≤ c ·

p

nlog2 n

(log n

n

)2/3

+ ρ2mono

+p

nlog2 n ·

(ρ2mono

n

)2/3

+p3

n5log6 n

(i)

≤ c ·

p log2 n

n

(log n

n

)2/3

︸︷︷︸T1

+p log2 n

nρ2mono︸︷︷︸

T2

+p log2 n

n·(ρ2mono

n

)2/3

︸︷︷︸T3

(34)

where in step (i), we have used the condition n & p to obtain the bound

p3

n5log6 n .

p log2 n

n

(log n

n

)2/3

.

Finally, some algebra reveals that if T2 ≤ T3, then T3 ≤ T1, and so we may drop the term T3

from the bound by changing the absolute constant, and this concludes the proof. We note that thepoly-logarithmic factors in the final bound have not been optimized.It remains to prove the various lemmas.

5.4.1 Proof of Lemma 2

We use fX and fY to denote the marginal densities of the pair (X,Y ). The notation fX,Y is used todenote their joint density, and let fX|Y denote the conditional density X|Y . We use φ(·) to denote

32

the standard Gaussian PDF, and X to denote the support of X. We have

h(y) : = E[X|Y = y] =

∫XxfX,Y (x, y)

fY (y)dx

=

∫X xfX(x)φ(y − g(x))dx

fY (y).

Now note that we have ddyfX,Y (x, y) = −

(y−g(x)σ2

)fX,Y (x, y) for each y; this follows by differenti-

ating the Gaussian PDF. Further, note that fY (y) =∫X fX,Y (x, y)dx, so we have

σ2h′(y) =1

(fY (y))2·(−∫X

∫Xx(y − g(x)) · fX,Y (x, y)fX,Y (x, y)dxdx

+

∫X

∫Xx(y − g(x)) · fX,Y (x, y)fX,Y (x, y)dxdx

)=

1

(fY (y))2·∫X

∫Xx(g(x)− g(x))fX,Y (x, y)fX,Y (x, y)dxdx. (35)

Since the same statement holds with the roles of x and x interchanged, we also have

σ2h′(y) =1

(fY (y))2·∫X

∫Xx(g(x)− g(x))fX,Y (x, y)fX,Y (x, y)dxdx. (36)

Summing equations (35) and (36) yields

2σ2h′(y) =1

(fY (y))2·∫X

∫X

(x− x)(g(x)− g(x))fX,Y (x, y)fX,Y (x, y)dxdx ≥ 0,

where the inequality follows from the monotonicity of g, which ensures that (x− x)(g(x)− g(x)) isnon-negative.


Recall that our (forward) observation model on the set of labeled samples is given by

Y = g∗(W ) + σZ,

where W is a standard Gaussian truncated to the interval [−cn,δ, cn,δ], the link function g∗ : R→ Ris monotone, and Z is a standard normal independent of everything else. In addition, Assumption 5also implies that the bounds

m|a− b| ≤ |g∗(a)− g∗(b)| ≤M |a− b| (37)

hold for each pair of scalars (a, b).

We now split the rest of the proof into two cases.

33

Case g∗(−cn,δ) ≤ y ≤ g∗(cn,δ): In this case, note that the function g−1 is uniquely defined. Letus use the shorthand

τ ≡√

1 + σ2/M2 ≤ τ ≡√

1 + σ2/m2,

and note that both of these quantities are equal to 1 when σ = 0. It also convenient to define

σ =

√σ2

m2 + σ2, σ =

√σ2

M2 + σ2,

µ(y) =g−1(y)

1 + σ2/m2, and µ(y) =

g−1(y)

1 + σ2/M2.

Once again, it is useful to keep in mind that we have σ ≈ σ/m and µ(y) ≈ µ(y) ≈ g−1(y) in thesmall σ regime. Finally, let φτ denote the density of a zero-mean Gaussian with standard deviationτ , and let φ ≡ φ1. Let Φ denote the CDF of the standard Gaussian.

We require the following lemma about the joint density of the pair (W,Y ).

Lemma 5. For each g∗(−cn,δ) ≤ y ≤ g∗(cn,δ), we have

fW,Y (w, y) ≤ τ σ

σφτ(g−1(y)

)φσ (w − µ(y))κ−11 w ∈ [−cn,δ, cn,δ] and (38a)

fY (y) ≥ τ σ

3σφτ(g−1(y)

)κ−1. (38b)

The proof of Lemma 5 is postponed to the end of the subsection. Taking it as given for themoment, let us complete the proof of Lemma 3. Let us use the shorthand Wy ≡ [W |Y = y].Lemma 5 yields the tail bound

Pr(|Wy − µ| ≥ tσ) ≤ 3τ σ

τ σ·φτ(g−1(y)

)φτ (g−1(y))

Φ(−t) ∧ 1

≤ 3τ σ

τ σ·φτ(g−1(y)

)φτ (g−1(y))

exp(−t2/2) ∧ 1

=3σ

σexp

(−(g−1(y)

)22

·(τ−2 − τ−2

))exp(−t2/2) ∧ 1

(i)

≤ 3M

mexp

((g−1(y)

)22

·(τ−2 − τ−2

))exp(−t2/2) ∧ 1,

where in step (i), we have used the relation√

σ2+M2

σ2+m2 ≤ Mm . Further substituting the values of the

pair (τ , τ), we have

Pr(|Wy − µ| > tσ) ≤ 3M

mexp

((g−1(y)

)22

· σ · σ ·Mm(M2 −m2)

)exp(−t2/2) ∧ 1

(ii)

≤ 3M

mexp

((g−1(y)

)22

· σ2(M2 −m2)

)exp(−t2/2) ∧ 1,

34

where in step (ii), we have used the fact that when ρmono ≤ c1, we have σ ≤ Cm. We now make noteof the following series of inequalities, which holds for each tuple of positive scalars (K, t) satisfyingK ≥ 1:

eKe−t2/2 ∧ 1 ≤ exp

K − t2/2 ∧ 0

= exp

K(1− t2

2K ∧ 0

(i)

≤ exp

(1− t2

2K ) ∧ 0

≤ exp

1− t2

2K

.

where step (i) uses the fact that K ≥ 1. Also define the positive scalar

Γ(y) : =

√(g−1(y))2

2· σ2(M2 −m2) + log

(3Mm

)for convenience, and note that we have

supy∈[g∗(−cn,δ),g∗(cn,δ)]

Γ(y) ≤ Γ : =

√c2n,δ

2· σ2(M2 −m2) + log

(3Mm

).

Putting together the pieces, we have

Pr

(|Wy − µ|

σ≥ t)≤ exp

(1− t2

2Γ2(y)

),

and applying Lemma 5.5 of Vershynin [Ver10] yields the inequality

‖ξ(y)‖ψ2

(ii)

≤ C‖Wy − µ‖ψ2 ≤ CσΓ(y) ≤ C( σm∧ 1)· Γ,

where step (ii) follows from the fact that centering does not change the sub-Gaussian constant bymore than a constant factor (see, e.g., Lemma 2.6.8 of Vershynin [Ver18]). Finally, using once againthe fact that σ ≤ Cm and applying the elementary inequality

√a+ b ≤

√a+√b, which holds for

any pair of positive reals (a, b) establishes the result for this case.

Case 2: When y /∈ [g∗(−cn,δ), g∗(cn,δ)], we proceed by showing that the desired sub-Gaussianconstant is still less than C(σΓ ∨ σcn,δ) for an absolute constant C. Let y : = g∗(cn,δ), and firstconsider the case y ≥ y. Define the (non-negative) random variable W ′ : = cn,δ −Wy, and usethe notation W ≡ Wy for simplicity. First, note that it suffices to show a bound on the smallestpositive γ such that

E[eγ(W ′)2 |Y = y] ≤ 2,

since centering at the mean only affects the sub-Gaussian constant by a constant factor [Ver18].

35

Then the ratio between the conditional densities when Y = y and Y = y is given by

fW |Y (w|y)

fW |Y (w|y)= Cy,y

fW,Y (w, y)

fW,Y (w, y)

= Cy,yφσ(y − g∗(w))

φσ(y − g∗(w))

= Cy,y exp(−((y − g∗(w))2 − (y − g∗(w))2)/2σ2)

),

where Cy,y is a positive parameter that depends on the pair (y, y), but is independent of the scalarw. This yields the further chain of bounds

fW |Y (w|y)

fW |Y (w|y)= Cy,y exp

(−((y − y)(

y + y

2− g∗(w))/σ2)

)= Cy,y exp

(−((y − y)/σ2)

) y+y2−g∗(w)

= C′y,y exp((y − y)/σ2

)g∗(w),

for a different positive constant C′y,y that is independent of w. Since y ≥ y, the likelihood ratio isnon-decreasing in w. Equivalently, the likelihood ratio decreases as the quantity cn,δ −w increases.

Consequently, the random variables (cn,δ −W )2 andfW |Y (W |y)

fW |Y (W |y) are negatively correlated, and for

each γ > 0, we have

E[e(W ′)2/γ2 |Y = y

]= E

[eγ−2(cn,δ−W )2 fW |Y (W |y)

fW |Y (W |y)|Y = y

]≤ E

[eγ−2(cn,δ−W )2 |Y = y

]· E[fW |Y (W |y)

fW |Y (W |y)|Y = y

]= E

[eγ−2(cn,δ−W )2 |Y = y

]= E

[exp

(γ−2(cn,δ − µ(y))2 + (µ(y)−W )2

)|Y = y)

]= exp

(γ−2(cn,δ − µ(y))2

)E[exp

(γ−2(µ(y)−W )2

)|Y = y)

].

Finally, note that we have (cn,δ − µ(y))2 = c2n,δ

σ2

m2+σ2 = c2n,δσ

2. On the other hand, by case 1

of the proof, the expectation term in the last display is bounded by a constant provided γ2 ≥σ2Γ2. By adjusting the constant factors, we can ensure that E

[eγ−2(W ′)2 |Y = y

]≤ 2 provided

γ ≥ Cσ2(Γ2 ∨ c2n,δ), and this completes the proof for the case y ≥ y. An identical argument holds

when y ≤ −y, and combining the two cases yields the lemma.

Proof of Lemma 5: Recall the notation κ = PrZ ∈ I, so that the density of the randomvariable W is given by

fW (w) = κ−1φ(w)1 w ∈ I;

we use the shorthand κ(w) : = κ−11 w ∈ I for convenience.

36

Let us begin by deriving the joint density. We have

fW,Y (w, y) = fY |W (y|w)fW (w)

= φσ(y − g(x)) φ(w)κ(w)

(i)

≤ φσ(m(g−1(y)− w)) φ(w)κ(w)

=1

2πσexp

(−(g−1(y)− w)2

2σ2/m2− w2

2

)κ(w),

where step (i) follows from equation (37). Completing the squares and performing some morealgebra leads to the relation

fW,Y (w, y) ≤ 1

2πσexp

(− g−1(y)2

2(1 + σ2/m2)

)exp

−(1 + σ2/m2)(w − g−1(y)

1+σ2/m2

)2

2σ2/m2

κ(w)

=1

2πσexp

(−g−1(y)2

2τ2

)exp

(−(w − µ(y))2

2σ2

)κ(w)

=τ σ

σφτ(g−1(y)

)φσ (w − µ(y))κ(w),

and this proves inequality (38a).

We now turn to establishing bound (38b). We have

fY (y) =

∫IfW,Y (w, y)dw

=

∫φσ(y − g(x)) φ(w)κ(w)dw

(ii)

≥∫φσ(M(g−1(y)− w)) φ(w)κ(w)dw,

where step (ii) once again follows from equation (37). Completing the square similarly to aboveand performing some more algebra yields

fY (y) ≥∫

1

2πσexp

(− (g−1(y))2

2(1 + σ2/M2)

)exp

−(1 + σ2/M2)(w − g−1(y)

1+σ2/M2

)2

2σ2/M2

κ(w)dw

=τ σ

σφτ(g−1(y)

) ∫φσ(w − µ(y)

)κ(w)dw.

37

The following sequence of relations then completes the proof.

κ

∫φσ(w − µ(y))κ(w)dw =

∫ cn,δ

−cn,δφσ(w − µ(y))dw

≥∫ cn,δ

µ(y)φσ(w − µ(y))dw

=

∫ µ(y)+cn,δ

0φσ(w)dw

≥∫ 2

0φ(z)dz > 1/3,

where we have used the inequalities 0 ≤ µ(y) ≤ cn,δ, cn,δ ≥ 2, and σ ≤ 1.


The proof of both claims in this lemma are based on the following result that bounds the expectedsupremum of the associated empirical process. In it, we let ν = (ν1, . . . , νk) denote a sequence ofi.i.d. 1-sub-Gaussian random variables that is independent of everything else.

Lemma 6. For each function h ∈ H, sequence of samples y1, . . . , yk, and scalar ϑ, we have

Eν

suph∈H

‖h−h∗‖k≤ϑ

∣∣∣∣∣1kk∑i=1

νi · (h− h∗)(yi)

∣∣∣∣∣ ≤ c2

√bϑ (log(b/ϑ) ∨ 1)

k.

A variant of this claim can be found, for instance, in van de Geer [Van90], but we provide theproof at the end of this section for completeness. Taking the lemma as given for the moment, letus establish the two bounds. For convenience, we use the shorthand H∗ ≡ Hh∗ .

Proof of claim (32a): We must establish a bound on the (localized) population Rademachercomplexity of the function class H, which contains functions that are uniformly bounded by b. It ishelpful to work instead with the empirical Rademacher complexity, which, for an abstract functionclass F takes the form

Rk(F) : = Eη

[supf∈F

∣∣∣∣∣1kk∑i=1

ηif(yi)

∣∣∣∣∣]. (39)

Note that we no longer take an expectation over the design points, and so this complexity measureshould be viewed as a random variable when the samples y1, . . . , yk are random. Recall the norm ‖·‖k defined in equation (10), and let τk(h

∗; γ) denote the smallest positive solution to the (empirical)critical equality

τ2

γ= Rk(H∗ ∩ B(‖ · ‖k; τ)) (40)

38

for some positive scalar γ. Since the function class H∗ is 2b-bounded and star-shaped around0, a slight modification of Proposition 14.25 of Wainwright (see also the discussion surroundingequations (14.6)-(14.8)) guarantees that there is an absolute constant c such that

τ2k (h∗; b) ≤ c · τ2

k (h∗; b)

with probability exceeding 1 − exp−c1kτ

2k (h∗; b)/b2

. Consequently, it suffices to bound the

(random) quantity τk; we dedicate the rest of this proof to such a bound. Applying Lemma 6, weare looking for the smallest strictly positive solution to the inequality

τ2

b≤ c2

(bτ log(b/τ)

k

)1/2

,

and solving with equality yields the bound.

Proof of claim (32b): By definition of the functional µk and Lemma 6, we are looking for thesmallest strictly positive solution to the inequality

µ2

γ≤ c2

(bµ log(b/µ)

k

)1/2

,

and solving with equality yields the claimed bound.

Proof of Lemma 6: Since we are interested in bounding the sub-Gaussian complexity over theclass of bounded monotone functions, we appeal to arguments based on metric entropy bounds andchaining [VW96]. Let us provide some background for this method, starting with the definition ofthe covering number of a set in a metric space.

Definition 2 (Covering number). An ε-cover of a set T with respect to a metric ρ is a setθ1, θ2, . . . , θN

⊂ T such that for each θ ∈ T, there exists some i ∈ [N ] such that ρ(θ, θi) ≤ ε. The

ε-covering number N(ε,T, ρ) is the cardinality of the smallest ε-cover.

The logarithm of the covering number is referred to as the metric entropy of a set. It is wellknown that the sub-Gaussian complexities of sets can be bounded via their metric entropies, andwe employ this approach below. View the samples y1, . . . , yk as fixed in our particular problem,and use the shorthand Bn(µ;H∗)) : = h ∈ H | ‖h− h∗‖k ≤ µ. Then we have the upper bound interms of Dudley’s entropy integral (see Theorem 5.22 of Wainwright [Wai19]):

Eν

suph∈H

‖h−h∗‖k≤ϑ

∣∣∣∣∣1kk∑i=1

νi · (h− h∗)(yi)

∣∣∣∣∣ ≤ 16√

k

∫ ϑ

0

√logN(t,Bn(ϑ;H∗), ‖ · ‖k)dt.

It is a classical fact that we have the bound

logN(t,Bn(µ;H∗), ‖ · ‖k) ≤cb

tmaxlog(b/t), 1

for some absolute constant c (see, e.g., Example 2.1(i) of van de Geer [Van90]), where b is theuniform bound on functions in H. Substituting this bound and simplifying yields the claim.

39


The proof of this theorem closely parallels that of Theorem 1 with ξ = 0, and so we only sketchparts of the proof that are different. In the noiseless case, we may take h∗(y) = g−1(y), where g−1

denotes the inverse of g∗. Thus, we have

W = h∗(Y ),

where Y ∼ PY , and in each sample, we also observe p− 1 other covariates X2, . . . , Xp each drawnfrom a standard Gaussian that is independent of everything else.

Once again, let αt = ∠(θt, θ∗) and recall the notation νt from the proof of Theorem 1. Note

that for X ∼ PIX , we have

〈X, θt〉 = cos(αt)W + sin(αt)X,

where X ∼ N (0, 1) is some linear combination of the random variables X2, . . . , Xp (and thereforeindependent of W ).

The procedure A is given N = N/T samples drawn from the observation model

〈xi, θt〉 = νt · h∗(Y ) +√

∆txi for i ∈ Dt+1.

By the star-shaped nature of H around 0, we have νt ·h∗ ∈ H, so that this is now a non-parametricregression model corrupted by Gaussian noise of variance ∆t. Setting

u =c2

µN (νth∗, yii∈Dt+1 ; 2√

∆t)

(µ2N (νth

∗, yii∈Dt+1 ; 2√

∆t) ∨∆t

N

)log(c2δ

)in Proposition 2(b), the procedure A then returns a function ht+1 ∈ H that satisfies, for eachδ ∈ (0, 1), the inequality

1

N

∑i∈Dt+1

(ht+1(yi)− νth∗(yi))2 ≤ c2 ·(µ2N (νth

∗, yii∈Dt+1 ; 2√

∆t) ∨∆t

N

)log(c2δ

)with probability exceeding 1 − δ. It remains to show that the linear regression step—also carriedout on the same data set Dt+1—satisfies a similar error bound. Toward that end, denote by Pt+1(v)the projection of a vector v onto the subspace spanned by the collection xii∈Dt+1 . Recall that θwas the solution to the linear regression problem in step 3 of the algorithm, and note that we haveνth∗(yi) = νt〈xi, θ∗〉. In the rest of the proof, we abuse notation and use the shorthand ht+1 and

h∗ to denote the N dimensional vectors formed by stacking the evaluations of the two functions atthe sample points. Consequently,∑

i∈Dt+1

〈xi, θ − νtθ∗〉2 = ‖Pt+1(ht+1)− Pt+1(νth∗)‖2

(i)

≤ ‖h− νth∗‖2

≤ c2N ·(µ2N (νth

∗, yii∈Dt+1 ; 2√

∆t) ∨∆t

N

)log(c2δ

),

40

where step (i) follows from the non-expansiveness of the projection operation onto a convex set(which in this case is given by a subspace). Finally, it remains to relate the error ∆t+1 to theleft-most quantity above. In order to do so, recall the matrix Xt+1 defined by stacking up thecovariates xii∈Dt+1 as its rows, and write

‖θ − νtθ∗‖2 ≤ λ−1min

(X>t+1Xt+1

N

)·

1

N

∑i∈Dt+1

〈xi, θ − νtθ∗〉2 .

By assumption, we have N ≥ cmaxp, log(c/δ) for a large enough constant c. Thus, applyingLemma 12 in the appendix yields that for a large enough constant C, the inequality

λ−1min

(X>t+1Xt+1

N

)≤ C

holds with probability exceeding 1− δ, where we have also used the fact that the quantity γ in thelemma is greater than an absolute constant. Putting together the pieces, we thus obtain the bound

‖θ − νtθ∗‖2 ≤ c2 ·(µ2N (νth

∗, yii∈Dt+1 ; 2√

∆t) ∨∆t

N

)log(c2δ

).

To complete the proof, note that νt ≥ 1/10 and that the theorem guarantees that the condition(µ2N (νth

∗, yii∈Dt+1 ; 2√

∆t) ∨∆t

N

)log(c2δ

)≤ c1

holds for a small enough constant c1. Thus, applying Lemma 13 from the appendix, we obtain thefinal result.

5.6 Proof of Corollary 2

The proof of Corollary 1 already contains most of the main ideas; the only difference here is thatwe are only interested in a bound on the noise complexity functional µ. In particular, applyingLemma 4 to the function class H(cn,δ) defined in equation (31), we have

µ2n(h∗, yii∈Dt+1 ; 2

√∆t) ≤ c2c

2n,δ

(4∆t log n

n

)2/3

for each h∗ ∈ H(cn,δ) and collection of samples yii∈Dt+1 . Furthermore, we also have ∆tn ≤

(∆tn

)2/3.

Subtituting these expressions into Theorem 3 yields the one-step inequality (21a).In order to obtain inequality (21b), we recursively apply this inequality like before. In particular,

set δ = n−10, and note that c2n,δ ∼ log n to obtain the simplified one-step guarantee

∆t+1 ≤ c2 log n

(∆t

log n

n

)2/3

,

which holds for each iteration t on the corresponding event Et. For an integer value T0 to bedetermined shortly, condition on the event ∩T0i=0Ei; by the union bound, this event occurs withprobability exceeding 1− T0n

−10.

41

Now choosing a value C log log n ≤ T ≤ T0 for a large enough absolute constant C and anyT0 ≤ n, let us apply Lemma 10 in the appendix with the substitutions C1 = C3 = 0 and C2 =c2 log n · log2/3 n, and note that γ = 2/3 and ∆0 ≤ 1 by definition. Then by choosing C largeenough, we can ensure that T is large enough to satisfy the condition required by Lemma 10.Consequently, we have

∆T ≤ c2log3 n · log2 n

n2. (41)

Substituting the value of n completes the proof.

5.7 Proof of Proposition 3

Our proof is constructive: given a function-parameter pair (g∗, θ∗) defining a SIM, we show thatthe noiseless covariate-response pairs xi, yini=1 generated from this model can also be perfectly fit

by a second function-parameter pair (g, θ), such that ‖θ∗ − θ‖2 obeys the claimed lower bound.We require the following technical lemmas for the proof; the proofs of these lemmas are post-

poned to Sections 5.7.1 and 5.7.2, respectively. In stating both lemmas, we assume that θ∗ is some

fixed unit norm vector in Rd and that xii.i.d.∼ N (0, Id). Also, define the (d−1)-dimensional subspace

K⊥θ∗ : = v ∈ Rd : 〈v, θ∗〉 = 0, and let Sk ⊆ 2[n] denote the family of all subsets of [n] that havesize less than or equal to k.

Lemma 7. For each k < p < n/2, there is an event that occurs with probability greater than

1 − k(nk

)−1such that on this event, we have the following. For each subset S ∈ Sk, there exists a

unit norm vector vS ∈ K⊥θ∗ such that

〈xi, vS〉 = 0 for all i ∈ S, and,

|〈xi, vS〉| ≤√

10k log(n/k) for all i /∈ S.

Lemma 8. For each fixed δ > 0, we have

#i : minj 6=i|〈θ∗, xi〉 − 〈θ∗, xj〉| ≤ δ ≤ δn2 log n

with probability greater than 1− 2elog2 n

.

Taking these lemmas as given, let us now establish the proposition. Let p ≥ 2, set δ = p2n2 logn

,and let

Sδ = i : minj 6=i|〈θ∗, xi〉 − 〈θ∗, xj〉| ≤ δ

represent the indices of samples that have poor spacing, in that there is an adjacent point atdistance less than δ. By Lemma 8, there is an event E1 occurring with probability greater than1− 2e

log2 non which we have |Sδ| ≤ p/2. Lemma 8 also clearly guarantees (by setting δ = (n log n)−2)

that there is an event E ′1 occurring with probability greater than 1− 2elog2 n

on which we have

mini,j|〈θ∗, xi〉 − 〈θ∗, xj〉| ≥ (n log n)−2,

42

so that the minimum spacing is of this order.

Furthermore, by Lemma 7, there is another event E2 occurring with probability exceeding1− k

(nk

)−1on which there exists a unit vector vSδ such that

〈xi, vSδ〉 = 0 for all i ∈ Sδ, and,

|〈xi, vSδ〉| .√p log(n/p) for all i /∈ Sδ.

Finally, there is an event E3 occurring with probability exceeding 1−n−10 on which maxi∈[n] |〈xi, θ∗〉| ≤√5 log n. Condition on the event E1 ∩ E ′1 ∩ E2 ∩ E3 for the rest of the proof, and note that oper-

ationally, we have established the existence of a unit vector vSδ that allows us to perturb thewell-spaced points 〈θ∗, xi〉i∈Sc in a controlled fashion while leaving the poorly-spaced points inSδ unchanged. Set g∗(x) = m+M

2 x to be a linear function.

Let us now construct the pair (g, θ). For a small enough positive constant C′m,M chosen such

that minC′m,M , (C′m,M )2 ≤(M−m

4m

), define the scalar r : =

C′m,M√p

n2 log2 n≤ 1, and let ∆ = r · vS . Let

θ = ∆+θ∗

‖∆+θ∗‖ = ∆+θ∗√1+r2

.

By construction, for each i ∈ Sδ, we have

|〈θ, xi〉 − 〈θ∗, xi〉| =(

1− 1√1 + r2

)|〈θ∗, xi〉|

(i)

≤(

1− 1√1 + r2

)√5 log n

(ii)

. r2√

5 log n

. (C′m,M )2 p

n4 log3 n(42)

.(C′m,M )2

10

1

(n log n)2, (43)

where step (i) is a result of conditioning on E3, and step (ii) follows from the Taylor approximationof the function f(w) =

√1 + w2 around the point w = 0, and the bound r2 ≤ 1.

On the other hand, for each i ∈ Sc, we have

|〈θ, xi〉 − 〈θ∗, xi〉| ≤(

1− 1√1 + r2

)|〈θ∗, xi〉|+

1√1 + r2

|〈∆, xi〉|

(iii)

. r2√

5 log n+ r√p log(n/p)

. C′m,Mp

4n2 log n, (44)

where step (iii) follows from conditioning E3 and the choice of the vector vSδ according to Lemma 7.

Now note that since we have conditioned on the event E ′1, all points in Sδ have spacing at least(n log n)−2, and by definition, all points in Scδ have spacing at least p

2n2 logn. Putting together the

pieces, we have shown that

〈θ∗, xi〉 ≤ 〈θ∗, xj〉 =⇒ 〈θ, xi〉 ≤ 〈θ, xj〉

43

for all pairs i, j ∈ [n].Now define a function g on the points 〈θ, xi〉ni=1 by setting

g(〈θ, xi〉) = g∗(〈θ∗, xi〉) = yi,

and by interpolating linear functions between adjacent points. This function can also be triviallyentended to the entire domain R by simply extending the lines that define its left and right extremes.Such a function is, by definition, monotone; so it only remains to verify that g satisfies Assumption 5.Since g∗ was the linear function with slope (M + m)/2, the slope τi of the linear interpolant (inthe definition of g) between the (ordered) points i and i+ 1 must satisfy

(m+M)si2(si + ψi)

≤ τi ≤(m+M)si2(si − ψi)

, (45)

where si denotes the spacing between these points, and ψi the magnitude of the perturbation,which we bounded in equations (43) and (44). Finally, these two equations and the inequalityminC′m,M , (C′m,M )2 ≤

(M−m

4m

)yield that the ratio is bounded as

ψisi≤(M −m

2m

),

and substituting into equation (45) yields the bound m ≤ τi ≤M , thereby verifying that g satisfiesAssumption 5.

We have thus shown the existence of two distinct pairs (g∗, θ∗) and (g, θ) that interpolate ourobservations exactly, and with ‖θ∗ − θ‖2 ≥ r2/2 & (C′m,M )2 p

n4 log4 n. This completes the proof.

It remains to prove Lemmas 7 and 8.


Let k be some fixed integer obeying k < p < n/2. Given n vectors xini=1 sampled i.i.d. froma standard p-dimensional Gaussian distribution, consider some fixed subset S ⊆ [n] of size s < p.There exists a subspace of dimension p− s− 1 such that for any vector v in this subspace, we have〈v, xi〉 = 0 for all i ∈ S and 〈v, θ∗〉 = 0. Choose a fixed vector unit vector vS in this subspace.Since the points xii∈Sc are chosen independently of the points in S, we have

〈vS , xi〉i.i.d.∼ N (0, 1) for all i ∈ Sc,

so that for all t ≥ 1, we have

Prmaxi∈Sc|〈vS , xi〉|2 ≥ t ≤ (n− s)e−ct.

Choosing t = 10k log(n/k) yields the bound

Pr

maxi∈Sc|〈vS , xi〉| ≥

√10k log(n/k)

≤(n

k

)−2

Since this bound was shown for each fixed subset, taking a union bound over all

|Sk| =∑s≤k

(n

s

)(iv)

≤ k

(n

k

)subsets proves the lemma via a union bound; we have used the inequality k ≤ n/2 in step (iv).

44


Fix a positive scalar δ. For each i ∈ [n], define the indicator random variables

ζi = 1

minj 6=i|〈θ∗, xi〉 − 〈θ∗, xj〉| ≤ δ

.

We are interested in showing a high probability bound on the quantity∑n

i=1 ζi; we do so byMarkov’s inequality.

Computing the expectation of ζi, we have

E[ζi] = Prminj 6=i|〈θ∗, xi〉 − 〈θ∗, xj〉| ≤ δ

(i)

≤ nPr|〈θ∗, x1〉 − 〈θ∗, x2〉| ≤ δ,(ii)

≤ neδ

where step (i) follows by symmetry and the union bound, and step (ii) from the standard χ2 tailbound

Prχ2 ≤ t ≤ e√t for all t ≥ 0.

Putting together the pieces, we have shown that

E[

n∑i=1

ζi] ≤ n2eδ;

finally, applying Markov’s inequality once more yields the required result.

5.8 Proof of Proposition 1

Recall that we work with the phase retrieval model given by

yi = |〈xi, θ∗〉|+ εi, for i = 1, 2, . . . , n.

As a result of the labeling step, we select all samples i such that |〈xi, θ0〉| ≥ λ. Denote the set ofsamples by S and the random number of samples by N . Note that we have E[N ] = n/c1(λ), and

Pr

|N − E[N ]| ≥ E[N ]

2

≤ 2 exp

(− n

4c21(λ)

)(i)

≤ δ

4, (46)

where step (i) holds provided n ≥ C · c21(λ) log

(Cδ

)for a large enough constant C. Condition on

N for the remainder of the proof. Let Pλ denote the Gaussian distribution truncated to the regionx : |〈x, θ0〉| ≥ λ. A simple calculation reveals that for a random variable ξ ∼ Pλ, we have

E[ξ] = 0 and Σλ : = E[ξξ>] = E[ZZ> | |〈Z, θ0〉| ≥ λ] = I + c2(λ)θ0θ>0 ,

where Z denotes a standard Gaussian variate and the final step follows from the rotational invari-ance of the Gaussian coupled with the definition c2(λ) : = EZ∼N (0,1)[Z

2 | |Z| ≥ λ]. Furthermore,the following lemma bounds the singular values of a random matrix whose rows are drawn fromthe distribution Pλ. We prove this lemma at the end of this section.

45

Lemma 9. Suppose that ξ1, . . . , ξm ∈ Rp are m random vectors drawn i.i.d. from the distributionPλ, and let Σm = 1

m

∑mi=1 ξiξ

>i . Then, we have

|||Σm − Σλ|||op ≤ c1(1 + λ2)

(√p

m+p

m+ δ

)with probability exceeding 1− 2 exp(−cmminδ, δ2).

It will also be helpful to define, for each n, the event

En =

maxi∈[n]|〈xi, θ0 − θ∗〉| ≤

1

5

.

Since ‖θ0− θ∗‖ ≤(50 log(8n

δ ))−1/2

, standard bounds on the maxima of Gaussian random variables

ensure that we have PrEcn ≤ δ4 . On the event En, we have

〈xi, θ0〉 −1

5≤ 〈xi, θ∗〉 ≤ 〈xi, θ0〉+

1

5,

so that the further relation λ ≥ 1/5 implies that

sgn(〈xi, θ∗〉) = sgn(〈xi, θ0〉) for all i ∈ S. (47)

With these definitions in place, we are now in a position to establish the proposition.First, note that the matrix X has N rows drawn from the distribution Pλ. Let us apply Lemma 9

to these samples; provided N ≥ C max(1 + λ2)2p, log(C/δ), we have

|||ΣN − Σλ|||op ≤1

2

with probability exceeding 1− δ/4. Call this event E . Note that X>X = N ΣN . Putting togetherthe pieces with the definition of the matrix Σλ, we have

tr(X>X)−1 ≤ 2

N· p

|||X†|||F ≤2

N· √p, and

|||X†|||op ≤2

N, (48)

where we have used the fact that the matrix Σλ has p − 1 eigenvalues that are unity, and oneeigenvalue that is 1 + c2(λ) > 1.

The event E ∩ En occurs with probability exceeding 1− δ2 . On this event, we have y = Xθ∗ + ε

entry-wise, where ε ∈ RN represents the noise variables on the samples indexed by S. The linearregression estimate can then be explicitly written as

θLTI(λ) = X†y

= X†Xθ∗ + X†ε

= θ∗ + X†ε,

46

so that in order to bound the error ‖θLTI(λ)− θ∗‖2, it suffices to bound the quantity ‖X†ε‖2.Crucially, since the labeling procedure is independent of the noise variables ε, we have X ⊥⊥ ε,

and furthermore, the noise ε is entry-wise σ-sub-Gaussian. Applying the Hanson-Wright inequality(see, e.g., the papers [RV13; HKZ12]) yields the bound

Pr

‖X†ε‖2

σ2≥ tr(X>X)−1 + ‖X†‖F

√t+ |||X†|||opt

≤ e−t for each t ≥ 0.

Substituting bounds on the various norms from equation (48), we have

Pr

‖X†ε‖2

σ2≥ 2p

N+

2

N

√pt+

2

Nt

≤ e−t

Using the AM-GM inequality on the second term then yields

Pr

‖X†ε‖2

σ2≥ 3p

N+

3t

N

≤ e−t,

so that the choice t = log(2/δ) yields the required result in conjunction with a union bound.

Remark 7. As is clear from the proof, our assumption of Gaussian covariates was only used inLemma 9 and in order to bound the maxima of i.i.d. collections of random variables. The secondproperty clearly holds more generally for sub-Gaussian covariates, and the first property holds alsofor random variables satisfying a certain small ball condition. See the papers [DR17; GPG+19] forweakenings of this type for related problems.


Let ξ denote a random variable drawn according to the distribution Pλ. By rotation invarianceof the Gaussian distribution, we may assume without loss of generality that the first entry of ξ isdrawn from the (two-sided) truncated univariate Gaussian having density

f1(x) =

c1(λ) · φ(x) if |x| ≥ λ0 otherwise.

and that the remaining p− 1 entries are i.i.d. standard normal indpendent of the first entry.We begin with a useful reduction argument. Define the vector of independent Rademacher

random variables η = (η1, . . . , ηn) via the relation ηi = sgn(ξi,1), and the modified covariates

ξi = ηiξi. Note that we have ξiξ>i = ξiξ

>i point-wise, which implies in particular that

E[ξiξ>i ] = Σλ for each i ∈ [n].

Moreover, the random variables ξ1, . . . , ξn are drawn i.i.d. from a distribution Pλ supported on Rpsuch that its first entry is a (one-sided) truncated Gaussian with density f2(x) = 2f1(|x|), and theremaining p− 1 entries are i.i.d. standard Gaussians independent of the first entry.

We now claim that the distribution Pλ is sub-Gaussian with ψ2 constant (see, e.g., Ver-shynin [Ver10]) bounded by c3 + 2λ2 for an absolute constant c ∈ [0, 1]. Taking this claim as

47

given for the moment, the proof of the lemma is immediate, since applying well known results (e.g.,Remark 5.40 of Vershynin [Ver10] or Theorem 6.5 of Wainwright [Wai19]) yields the bound

Pr|||Σn − Σλ|||op ≥ c2(c3 + 2λ2)

(√pn+

p

n+ δ)≤ 2 exp

(−c1nminδ, δ2

),

where c1 and c2 are absolute constants. This completes the proof of the lemma; it remains toestablish the claimed sub-Gaussianity. First, let µλ ∈ Rp denote the mean of the distribution Pλ;its i-th entry is given by

µλi =

1√2π· c1(λ)e−λ

2/2 if i = 1

0 otherwise.

Here, we have performed an explicit computation in the case i = 1 by noting that∫ ∞λ

zφ(z)dz =1√2π

∫ ∞λ

ze−z2/2dz =

1√2πe−λ

2/2.

Applying standard bounds on the Gaussian tail probability, we have ‖µλ‖ ≤√

2λ. Furthermore,the random variable ξ − µλ is zero-mean by definition, and given by a Gaussian truncated to aconvex set. Hence, it is strongly log-concave, with sub-Gaussian parameter bounded by an absoluteconstant c (see, e.g., Ledoux , or Mao et al. for a simple proof in one dimension.) Combining theboundedness of the mean with the sub-Gaussianity of the centered random variable completes theproof.

6 Discussion

Our work provides a general-purpose method by which the bias in performing parameter estimationunder the class of single-index models can be significantly reduced; crucially, this involves leveragingproperties of the function class to which the non-parametric link function belongs. Our approachshould be viewed as reduction based: given an appropriate labeling oracle, we are able to reducethe problem to performing non-parametric regression over a suitably defined inverse function class.Our analysis is black-box and also reduction-based, in that it allows any non-parametric functionestimator, and derives a final rate of parameter estimation depending on the rate of the non-parametric estimator. We particularized this framework to the case where the non-parametricfunction estimator was given by least-squares, or empirical risk minimization.

To illustrate this general framework, we derived end-to-end parameter estimation guarantees fora sub-class of monotone single-index models, improving upon the rates of classical semi-parametricestimators. Owing to the reduction in bias, this improvement is particularly stark as the noise levelσ → 0. In particular, when the model is noiseless, we showed a sharpened rate for the problemusing a slightly different analysis method adapted to a natural variant of the procedure. In addition,we showed an information-theoretic identifiability limit for the problem of parameter estimation inmonotone SIMs.

The generality of our framework raises many interesting questions. For instance, are there otherclasses of SIMs for which a labeling oracle is implementable in a computationally efficient manner?In a short companion paper [PF19], we show that such an oracle can indeed be implemented forsmooth convex SIMs, but doing so requires us to weaken our assumptions somewhat, and significant

48

additional technical effort. For general single-index models, our work effectively reduces the problemof parameter estimation to an implementation of a labeling oracle, and shows how one might derivea concrete rate in the presence of such an oracle.

Another important assumption that was made in our paper was that of Gaussian covariates.Strictly speaking, this assumption can be weakened slightly provided the noise in the “nuisance”directions (corresponding to directions of covariate space that are orthogonal to 〈X, θ∗〉) are well-behaved under conditioning. A rigorous extension to this class of covariates is an interesting openproblem, and is likely to significantly broaden the scope of our results. It is worth noting that thescore function trick, which has been employed fruitfully along with other algorithms for estimationin index models and its relatives [BB18; SJA16], is likely to once again be useful in extending ourprocedures to arbitrary covariate distributions. In particular, since our procedure may be viewed asa more refined method of performing slicing, it would be interesting to see if the results of Babichevand Bach [BB18] are useful in extending it to general covariate distributions.

Finally, there is the question—regarded as widely important in the statistical signal processingliterature—of how these approaches should be modified when the true parameter θ∗ ∈ K for some(typically convex) set K ⊆ Rp. Is it sufficient to perform the linear regression step in our algorithmsunder this additional restriction? What are the rates achieved by such a procedure in the high signalregime?

Acknowledgements

Part of this work was performed when AP was visiting Amazon NYC. AP also acknowledgessupport from NSF CCF-1704967. We thank Lee Dicker, Venkat Anantharam, Bodhisattva Sen,and Adityanand Guntuboyina for helpful discussions. Thanks also to Vidya Muthukumar forproviding helpful comments that improved the presentation of the manuscript.

References

[AW17] S. Athey and S. Wager. “Efficient policy learning”. In: arXiv preprint arXiv:1702.02896(2017).

[BB18] D. Babichev and F. Bach. “Slice inverse regression with score functions”. In: ElectronicJournal of Statistics 12.1 (2018), pp. 1507–1543.

[BBB+72] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical infer-ence under order restrictions. The theory and application of isotonic regression. WileySeries in Probability and Mathematical Statistics. John Wiley & Sons, London-NewYork-Sydney, 1972, pp. xii+388.

[BDJ16] F. Balabdaoui, C. Durot, and H. Jankowski. “Least squares estimation in the mono-tone single index model”. In: arXiv preprint arXiv:1610.06026 (2016).

[BKR+93] P. J. Bickel, C. A. Klaassen, Y. Ritov, and J. A. Wellner. Efficient and adaptiveestimation for semiparametric models. Johns Hopkins University Press Baltimore,1993.

[BM02] P. L. Bartlett and S. Mendelson. “Rademacher and Gaussian complexities: Riskbounds and structural results”. In: Journal of Machine Learning Research 3.Nov(2002), pp. 463–482.

49

[Bri83] D. R. Brillinger. “A generalized linear model with “Gaussian” regressor variables”.In: A Festschrift for Erich L. Lehmann. Wadsworth Statist./Probab. Ser. Wadsworth,Belmont, CA, 1983, pp. 97–114.

[CC15] Y. Chen and E. Candes. “Solving random quadratic systems of equations is nearlyas easy as solving linear systems”. In: Advances in Neural Information ProcessingSystems. 2015, pp. 739–747.

[CCD+18] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, andJ. Robins. Double/debiased machine learning for treatment and structural parameters.2018.

[CCF+19] Y. Chen, Y. Chi, J. Fan, C. Ma, and Y. Yan. “Noisy matrix completion: Understandingstatistical guarantees for convex relaxation via nonconvex optimization”. In: arXivpreprint arXiv:1902.07698 (2019).

[CEI+16] V. Chernozhukov, J. C. Escanciano, H. Ichimura, W. K. Newey, and J. M. Robins. “Lo-cally robust semiparametric estimation”. In: arXiv preprint arXiv:1608.00033 (2016).

[CGS+17] V. Chernozhukov, M. Goldman, V. Semenova, and M. Taddy. “Orthogonal machinelearning for demand estimation: High dimensional causal inference in dynamic panels”.In: arXiv preprint arXiv:1712.09988 (2017).

[CLM16] T. T. Cai, X. Li, and Z. Ma. “Optimal rates of convergence for noisy sparse phaseretrieval via thresholded Wirtinger flow”. In: The Annals of Statistics 44.5 (2016),pp. 2221–2251.

[CLS15] E. J. Candes, X. Li, and M. Soltanolkotabi. “Phase retrieval via Wirtinger flow: Theoryand algorithms”. In: IEEE Transactions on Information Theory 61.4 (2015), pp. 1985–2007.

[CMM10] C. Cortes, Y. Mansour, and M. Mohri. “Learning bounds for importance weighting”.In: Advances in neural information processing systems. 2010, pp. 442–450.

[CNR18] V. Chernozhukov, W. Newey, and J. Robins. “Double/de-biased machine learningusing regularized Riesz representers”. In: arXiv preprint arXiv:1802.08667 (2018).

[CNS+18] V. Chernozhukov, D. Nekipelov, V. Semenova, and V. Syrgkanis. “Plug-in regularizedestimation of high-dimensional parameters in nonlinear semiparametric models”. In:arXiv preprint arXiv:1806.04823 (2018).

[CS16] Y. Chen and R. J. Samworth. “Generalized additive and index models with shapeconstraints”. In: Journal of the Royal Statistical Society: Series B (Statistical Method-ology) 78.4 (2016), pp. 729–754.

[DG03] S. Dasgupta and A. Gupta. “An elementary proof of a theorem of Johnson and Lin-denstrauss”. In: Random Structures & Algorithms 22.1 (2003), pp. 60–65.

[DH18] R. Dudeja and D. Hsu. “Learning Single-Index Models in Gaussian Space”. In: Con-ference On Learning Theory. 2018, pp. 1887–1930.

[DJS08] A. S. Dalalyan, A. Juditsky, and V. Spokoiny. “A new algorithm for estimating theeffective dimension-reduction subspace”. In: Journal of Machine Learning Research9.Aug (2008), pp. 1647–1678.

50

[DR17] J. C. Duchi and F. Ruan. “Solving (most) of a set of quadratic equalities: Compositeoptimization for robust phase retrieval”. In: arXiv preprint arXiv:1705.02356 (2017).

[FS19] D. J. Foster and V. Syrgkanis. “Orthogonal statistical learning”. In: arXiv preprintarXiv:1901.09036 (2019).

[Gee88] S. A. van de Geer. Regression analysis and empirical processes. Vol. 45. CWI Tract.Stichting Mathematisch Centrum, Centrum voor Wiskunde en Informatica, Amster-dam, 1988, pp. vi+161. isbn: 90-6196-330-3.

[GH19] P. Groeneboom and K. Hendrickx. “Estimation in monotone single-index models”.In: Statistica Neerlandica 73.1 (2019), pp. 78–99.

[GJW01] P. Groeneboom, G. Jongbloed, and J. A. Wellner. “Estimation of a convex function:characterizations and asymptotic theory”. In: The Annals of Statistics 29.6 (2001),pp. 1653–1698.

[GPG+19] A. Ghosh, A. Pananjady, A. Guntuboyina, and K. Ramchandran. “Max-Affine Re-gression: Provable, Tractable, and Near-Optimal Statistical Estimation”. In: arXivpreprint arXiv:1906.09255 (2019).

[Gra49] H. Grad. “Note on N-dimensional hermite polynomials”. In: Communications on Pureand Applied Mathematics 2.4 (1949), pp. 325–330.

[GRW+15] R. Ganti, N. Rao, R. M. Willett, and R. Nowak. “Learning single index models inhigh dimensions”. In: arXiv preprint arXiv:1506.08910 (2015).

[GS18] T. Goldstein and C. Studer. “PhaseMax: Convex phase retrieval via basis pursuit”.In: IEEE Transactions on Information Theory (2018).

[GW84] S. J. Grotzinger and C. Witzgall. “Projections onto order simplexes”. In: Appl. Math.Optim. 12.3 (1984), pp. 247–270. issn: 0095-4616. doi: 10.1007/BF01449044. url:https://doi.org/10.1007/BF01449044.

[GZ84] E. Gine and J. Zinn. “Some limit theorems for empirical processes”. In: The Annalsof Probability 12.4 (1984), pp. 929–989.

[HJP+01] M. Hristache, A. Juditsky, J. Polzehl, and V. Spokoiny. “Structure adaptive approachfor dimension reduction”. In: The Annals of Statistics 29.6 (2001), pp. 1537–1566.

[HKZ12] D. Hsu, S. Kakade, and T. Zhang. “A tail inequality for quadratic forms of subgaussianrandom vectors”. In: Electronic Communications in Probability 17 (2012).

[Hor09] J. L. Horowitz. Semiparametric and nonparametric methods in econometrics. Vol. 12.Springer, 2009.

[KKS+11] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. “Efficient learning of general-ized linear and single index models with isotonic regression”. In: Advances in NeuralInformation Processing Systems. 2011, pp. 927–935.

[Kol11] V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Re-covery Problems. Vol. 2033. Springer Science & Business Media, 2011.

[Kos08] M. R. Kosorok. Introduction to empirical processes and semiparametric inference.Springer, 2008.

[KPS17] A. K. Kuchibhotla, R. K. Patra, and B. Sen. “Efficient estimation in convex singleindex models”. In: arXiv preprint arXiv:1708.00145 (2017).

51

[KS09] A. T. Kalai and R. Sastry. “The Isotron Algorithm: High-Dimensional Isotonic Re-gression.” In: COLT. Citeseer. 2009.

[KSB+19] S. R. Kunzel, J. S. Sekhon, P. J. Bickel, and B. Yu. “Metalearners for estimating het-erogeneous treatment effects using machine learning”. In: Proceedings of the NationalAcademy of Sciences 116.10 (2019), pp. 4156–4165.

[LAL19] W. Luo, W. Alghamdi, and Y. M. Lu. “Optimal spectral initialization for signal recov-ery with applications to phase retrieval”. In: IEEE Transactions on Signal Processing67.9 (2019), pp. 2347–2356.

[Li91] K.-C. Li. “Sliced inverse regression for dimension reduction”. In: Journal of the Amer-ican Statistical Association 86.414 (1991), pp. 316–327.

[Li92] K.-C. Li. “On principal Hessian directions for data visualization and dimension reduc-tion: Another application of Stein’s lemma”. In: Journal of the American StatisticalAssociation 87.420 (1992), pp. 1025–1039.

[LM17] G. Lecue and S. Mendelson. “Regularization and the small-ball method II: complexitydependent error rates”. In: The Journal of Machine Learning Research 18.1 (2017),pp. 5356–5403.

[LM18] G. Lecue and S. Mendelson. “Regularization and the small-ball method i: sparse re-covery”. In: The Annals of Statistics 46.2 (2018), pp. 611–641.

[LR07] Q. Li and J. S. Racine. Nonparametric econometrics: theory and practice. PrincetonUniversity Press, 2007.

[Men14] S. Mendelson. “Learning without concentration”. In: Conference on Learning Theory.2014, pp. 25–39.

[Men18] S. Mendelson. “Learning without concentration for general loss functions”. In: Prob-ability Theory and Related Fields 171.1-2 (2018), pp. 459–502.

[MM18] M. Mondelli and A. Montanari. “Fundamental limits of weak recovery with appli-cations to phase retrieval”. In: Foundations of Computational Mathematics (2018),pp. 1–71.

[MN89] P. McCullagh and J. A. Nelder. Generalized linear models. Monographs on Statisticsand Applied Probability. Second edition [of MR0727836]. Chapman & Hall, London,1989, pp. xix+511. isbn: 0-412-31760-5. doi: 10.1007/978-1-4899-3242-6. url:https://doi.org/10.1007/978-1-4899-3242-6.

[MPT07] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. “Reconstruction and subgaus-sian operators in asymptotic geometric analysis”. In: Geometric and Functional Anal-ysis 17.4 (2007), pp. 1248–1282.

[MPW18] C. Mao, A. Pananjady, and M. J. Wainwright. “Towards Optimal Estimation of Bivari-ate Isotonic Matrices with Unknown Permutations”. In: arXiv preprint arXiv:1806.09544(2018).

[Ney59] J. Neyman. “Optimal asymptotic tests of composite hypotheses”. In: Probability andstatistics (1959), pp. 213–234.

[Ney79] J. Neyman. “C(α) tests and their use”. In: Sankhya Ser. A 41.1-2 (1979), pp. 1–21.issn: 0581-572X.

52

[NJS13] P. Netrapalli, P. Jain, and S. Sanghavi. “Phase retrieval using alternating minimiza-tion”. In: Advances in Neural Information Processing Systems. 2013, pp. 2796–2804.

[NKN+96] J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman. Applied linear statis-tical models. Vol. 4. Irwin Chicago, 1996.

[NWL16] M. Neykov, Z. Wang, and H. Liu. “Agnostic estimation for misspecified phase retrievalmodels”. In: Advances in Neural Information Processing Systems. 2016, pp. 4089–4097.

[PF19] A. Pananjady and D. P. Foster. “The sample complexity of learning convex single-index models”. In: preparation (2019).

[PV16] Y. Plan and R. Vershynin. “The generalized lasso with non-linear observations”. In:IEEE Transactions on information theory 62.3 (2016), pp. 1528–1537.

[PVY17] Y. Plan, R. Vershynin, and E. Yudovina. “High-dimensional estimation with geometricconstraints”. In: Information and Inference: A Journal of the IMA 6.1 (2017), pp. 1–40.

[Rob88] P. M. Robinson. “Root-N-consistent semiparametric regression”. In: Econometrica:Journal of the Econometric Society (1988), pp. 931–954.

[RV13] M. Rudelson and R. Vershynin. “Hanson-Wright inequality and sub-gaussian concen-tration”. In: Electronic Communications in Probability 18 (2013).

[SJA16] H. Sedghi, M. Janzamin, and A. Anandkumar. “Provable tensor methods for learningmixtures of generalized linear models”. In: Artificial Intelligence and Statistics. 2016,pp. 1223–1231.

[TAH15] C. Thrampoulidis, E. Abbasi, and B. Hassibi. “Lasso with non-linear measurementsis equivalent to one with linear measurements”. In: Advances in Neural InformationProcessing Systems. 2015, pp. 3420–3428.

[TR17] C. Thrampoulidis and A. S. Rawat. “The PhaseLift for Non-quadratic Gaussian Mea-surements”. In: arXiv preprint arXiv:1712.03638 (2017).

[Tsy09] A. B. Tsybakov. Introduction to nonparametric estimation. Revised and extended fromthe 2004 French original. Translated by Vladimir Zaiats. 2009.

[Van90] S. Van de Geer. “Estimating a regression function”. In: The Annals of Statistics(1990), pp. 907–924.

[Ver10] R. Vershynin. “Introduction to the non-asymptotic analysis of random matrices”. In:arXiv preprint arXiv:1011.3027 (2010).

[Ver18] R. Vershynin. High-dimensional probability: An introduction with applications in datascience. Vol. 47. Cambridge University Press, 2018.

[VW96] A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes.Springer Series in Statistics. With applications to statistics. Springer-Verlag, NewYork, 1996, pp. xvi+508. isbn: 0-387-94640-3. doi: 10.1007/978-1-4757-2545-2.url: https://doi.org/10.1007/978-1-4757-2545-2.

[Wai19] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Vol. 48.Cambridge University Press, 2019.

53

[Wal18] I. Waldspurger. “Phase Retrieval With Random Gaussian Sensing Vectors by Alter-nating Projections”. In: IEEE Transactions on Information Theory 64.5 (May 2018),pp. 3301–3312. issn: 0018-9448. doi: 10.1109/TIT.2018.2800663.

[WGE18] G. Wang, G. B. Giannakis, and Y. C. Eldar. “Solving systems of random quadraticequations via truncated amplitude flow”. In: IEEE Transactions on Information The-ory 64.2 (2018), pp. 773–794.

[YBL17] Z. Yang, K. Balasubramanian, and H. Liu. “High-dimensional non-Gaussian singleindex models via thresholded score function estimation”. In: International Conferenceon Machine Learning. 2017, pp. 3851–3860.

[YCS14] X. Yi, C. Caramanis, and S. Sanghavi. “Alternating minimization for mixed linearregression”. In: International Conference on Machine Learning. 2014, pp. 613–621.

[YWC+15] X. Yi, Z. Wang, C. Caramanis, and H. Liu. “Optimal linear estimation under unknownnonlinear transform”. In: Advances in neural information processing systems. 2015,pp. 1549–1557.

[YWL+16] Z. Yang, Z. Wang, H. Liu, Y. Eldar, and T. Zhang. “Sparse nonlinear regression:Parameter estimation under nonconvexity”. In: International Conference on MachineLearning. 2016, pp. 2472–2481.

[YYF+17] Z. Yang, L. F. Yang, E. X. Fang, T. Zhao, Z. Wang, and M. Neykov. “Misspec-ified Nonconvex Statistical Optimization for Phase Retrieval”. In: arXiv preprintarXiv:1712.06245 (2017).

Appendix

A Weakening assumptions on the noise in the inverse problem

In this section, we present a version of our main result when the “noise” of the inverse problemonly has a finite second moment instead of being sub-Gaussian. It is conceivable that in manyproblems, a second moment bound may be easier to establish than a sub-Gaussian bound.

Assumption 6. There is a positive scalar ρσ such that

E[ξ2(y)] ≤ ρ2σ for all y ∈ Y.

Under the above assumption, the following version of our main result follows.

Theorem 4. Suppose that Assumptions 1, 3, and 6 hold, and that the iterates θ0, . . . , θT aregenerated by Algorithm 2. Then there is a pair of absolute constants (c1, c2) such that for eacht = 0, . . . , T − 1, if

∆t ≤99

100, RAN


)+ ρ2σ

δ ≤ c1κ2 and N ≥ c2 max

p, κ−2 log2(1/κ) log

(c2

δ

),

(49a)

then we have

∆t+1 ≤ c2

RAN (νth

∗; νtP∗Y,ξ,∆t, δ/3) +ρ2σ

δ

·(p+ log(4/δ)

N

)(49b)

with probability exceeding 1−δ. Moreover, on this event, if ∠(θt, θ∗) ≤ π/2, then ∠(θt+1, θ

∗) ≤ π/2.

54

Proof. The proof proceeds in exactly the same fashion as that of Theorem 1, so we only sketch thedifferences. In particular, replace equation (27) in that proof with Markov’s inequality

Pr

1

N

∑i∈D2t+2

ξ2i ≥

3

δρ2σ

≤ δ/3.The rest of the proof holds as is, and following the exact same steps proves the theorem.

Notably, the only respect in which Theorem 4 is different from Theorem 1 is in its dependenceon the parameter δ. In principle, a family of results that interpolate between these two theoremscan be proved depending on the tail behavior of the noise: the lighter the tail, the milder thedependence of the result on the parameter δ.

B Sharpening Theorem 2 using Mendelson’s method

While we have presented one particular analysis of the ERM algorithm in Section 3.3 in terms ofthe complexity functionals τk and µk, Mendelson, in a series of papers [Men14; Men18] proposed analternative analysis by defining two other complexity functionals. Among other things, he showedthat his functionals more accurately captured the behaviour of the ERM algorithm in the low-noiselimit. Let us briefly present his results in our context. For each positive integer k, scalar δ′ ∈ (0, 1),and pair of positive constants (γ1, γ2), define the quantities

tk(h∗; γ1) : = inf

t > 0 : Rk(H∗ ∩ B(‖ · ‖2; t)) ≤ t

γ1

, and (50a)

uk(h∗, yk1 ; γ2) : = inf

u > 0 : Pr

Gk(H∗ ∩ B(‖ · ‖2;u)) ≥ u2

γ2

≤ δ′

, (50b)

where

Gk : = Eη,ξ

[supf∈F

∣∣∣∣∣1kk∑i=1

ηiξi · f(yi)

∣∣∣∣∣].

denotes the symmetrized noise complexity of the problem19, with η = (η1, . . . , ηk) once againrepresenting k i.i.d. Rademacher RVs drawn independently of everything else.

These two quantities parallel the terms (τk, µk) defined in equation (11), with two crucialdifferences. First, note that for a fixed scalar γ1, the term tk(γ1) will be of a much higher orderthan the corresponding term τk(γ1) as k grows. Note that both these terms are independent ofthe noise in the problem and effectively determine the error floor of our procedure. As argues byMendelson, the term tk captures the correct size of the “version space” of the problem, which,roughly speaking, quantifies how far apart two functions can be if the agree on the set of (random)design points.

The fixed point uk differs from its counterpart µk in that the noise complexity is computedby taking a supremum over the (random) set of functions B(‖ · ‖2;u). This makes this quantityparticularly difficult to evaluate, but at the very least, it has the favorable property that uk = 0when the noise level σ = 0.

19The noise complexity (9b) can be bounded by the symmetrized noise complexity up to a universal constant factor,via the Gine-Zinn symmetrization theorem [GZ84].

55

The following proposition is a mild restatement of Theorem 3.1 of Mendelson [Men14] underAssumption 4.

Proposition 4 (Theorem 3.1 of Mendelson [Men14]). Suppose that Assumptions 2 and 4 hold.Then, there are absolute constants (c1, c2) such that for all δ ≥ exp−c1k/b

4, the rate function ofthe ERM algorithm run on k samples from the model (7) satisfies

RERMk (h∗;P∗Y,ξ, ρ2, δ) ≤ 16 · (t2k(c2b

2) + u2k(256c2b

2; ρ, δ/8)).

Remark 8. As mentioned above, we expect Proposition 4 to be sharper that Proposition 2, especiallyin the low noise regime, since tk τk. This was one of Mendelson’s major contributions, in additionto weakening the conditions required on the function class to certain “small-ball” assumptions. Ourstatement of Proposition 4 ignores the latter extension.

Applying this proposition leads to the following consequence of our main result Theorem 1.Once again, recall the shorthand N = N/2T .

Theorem 5. Suppose that Assumptions 1, 2 and 4 hold, and that N ≥ 2p. Let A correspond tothe ERM algorithm. There are absolute constants (c1, c2) such that if

δ ≥ exp−c1N/b4, and t2N (c2b

2) + u2N (256c2b

2;√

∆t + ρ2σ, δ/8N) + ρ2

σ ≤ c1κ2,

then we have

∆t+1 ≤ c2

t2k(c2b

2) + u2k(256c2b

2;√

∆t + ρ2σ, δ/8N) + ρ2

σ

pN

(51)

with probability exceeding 1− δ − 2 exp−c1N

for each t = 0, . . . , T − 1.

While this theorem is also likely to be useful for applications, we do not state one since thequantity uk was difficult to compute for our class of monotone SIMs satisfying Assumption 5.

C Technical lemmas

We now collect some technical lemmas that were used in the proofs of our main results.

C.1 A recursion formula

We present a general recursion formula that is used to bound the error in multiple proofs.

Lemma 10. Consider any sequence of positive reals aii≥0 satisfying the sequence of inequalities

at+1 ≤ C1 + C2

(at + C3

n

)γfor each integer t ≥ 0,

where the tuple (C1, C2, C3) represent some arbitrary positive scalars, n represents a positive integer,

and we have the inclusion γ ∈ (0, 1). Define the shorthand ρ : =(

12C2

)(1−γ)−1

a0. Then there is an

absolute constant c such that for all T ≥ logγ−1 maxlog nγ(1−γ)−1∨1, log ρ, we have

aT ≤ cC1 + C2

(C3

n

)γ+ (2C2)(1−γ)−1 · n−γ(1−γ)−1

.

56

Proof. First, note the fact that two positive scalars a and b and γ ∈ (0, 1], we have (a+b)γ ≤ aγ+bγ .Thus, a consequence of the recursive inequality above is the relation

at+1 ≤ C1 + C2

(C3

n

)γ+ C2

(atn

)γ≤ 2 max

C1 + C2

(C3

n

)γ, C2

(atn

)γ.

Since the first term above is a constant, it suffices to provide upper bounds on the recursion

bt+1 ≤ 2C2

(btn

)γwith the initial condition b0 = a0.

We now claim that for all t ≥ 1, the following upper bound holds:

bt ≤ (2C2)(1−γ)−1(1−γt) · n−γ(1−γ)−1·(1−γt) · bγt

0 (52)

= (2C2)(1−γ)−1 · n−γ(1−γ)−1 ·(nγ(1−γ)−1

)γt·

((1

2C2

)(1−γ)−1

b0

)γt, (53)

where equality (53) follows by computation.Taking this claim as given for the moment, note that if t ≥ logγ−1 log x for a scalar x ≥ 1, then

we have γt ≤ (log x)−1. Also note that for each x ∈ R, we have x(log x)−1= e by definition. We

now split the proof into two cases, using the shorthand ρ : =(

12C2

)(1−γ)−1

b0.

Case 1; ρ ≤ 1: In this case, it suffices to take t ≥ t0 : = logγ−1 log nγ(1−γ)−1∨1, in which case wehave

bt ≤ e(2C2)(1−γ)−1 · n−γ(1−γ)−1.

Case 2; ρ > 1: Now take t ≥ t0 ∨ logγ−1 log ρ where t0 was defined in case 1 above. Then, weagain have

bt ≤ e2(2C2)(1−γ)−1 · n−γ(1−γ)−1.

Combining the two cases with the setup above completes the proof of the lemma.It remains to establish claim (52), for which we use an inductive argument. The base case

follows from the one-step definition of the recursion. Assuming the induction hypothesis—that theclaim is true for some positive t—evaluating the recursion yields

bt+1 ≤ (2C2)

(btn

)γ≤ (2C2) · (2C2)γ·(1−γ)−1(1−γt) ×

(1

n

)γ· (1/n)γ·γ(1−γ)−1·(1−γt) × bγ

t+1

0

= (2C2)(1−γ)−1(1−γt+1) · (1/n)γ(1−γ)−1(1−γt+1) · bγt+1

0 ,

thereby establishing the induction.

57

C.2 Properties of truncated Gaussians

Let Φ(·) denote the d-dimensional standard Gaussian PDF. For a < b, let m2(a, b) denote thesecond moment of a univariate standard Gaussian truncated to lie in the interval [a, b], and letγ = min1,m2(a, b). Finally, let κ denote the Gaussian volume of the interval [a, b].

Lemma 11. Let w1, w2, . . . , wn denote i.i.d. draws from a Gaussian truncated to the interval [a, b].There is a pair of absolute constants (c1, c2) such that if κ2n ≥ c1 log2(1/κ), then

Pr

1

n

n∑i=1

w2i ≤

1

2κ2

≤ c1 exp

(−c2

nκ3

log2(1/κ)

).

Proof. The proof follows immediately from Lemma 4 of Ghosh et al. [GPG+19]. In particu-lar, a slight modification of their lemma, specialized (in their notation) to d = 1 and with nκsamples, yields the following claim. There is a pair of universal constants (c1, c2) such that ifκ2n ≥ c1 log2(1/κ), then

Pr

1

n

n∑i=1

w2i ≤

1

2κ2

≤ exp

(− nκ3

log2(1/κ)

).

Adjusting the constant factors completes the proof.

Lemma 12. Consider a matrix X consisting of n ≥ p i.i.d. rows drawn from the distribution

g(x) =1 x1 ∈ [`, r]Prx1 ∈ [`, r]

· Φ(x)

for each x ∈ Rp. Then for all t ≥ 0, we have

Pσmin(X>X/n) ≤ γ − c√p/n− t√

n ≤ e−t2/2.

Proof. Let Y ∼ g denote the truncated random variable. We claim that

E[Y Y >] γI, and that Y is sub-Gaussian with parameter at most 2.

Given this claim, the proof of the theorem follows immediately by applying Remark 5.40 of [Ver10].Proving the claim is also straightforward. Indeed, for any v ∈ Rp, we have

E〈Y, v〉2 = v21E[Y 2

1 ] +∑i 6=1

v2i

= v21m2(a, b) +

∑i 6=1

v2i .

Minimizing the above expression over unit norm v yields

infv:‖v‖=1

E〈Y, v〉2 = min1,m2(a, b) = γ.

In order to show that the random vector Y is sub-Gaussian, it suffices to show that 〈Y, v〉 is 2-sub-Gaussian for each unit vector v. Since the truncation operation only influences the one-dimensionalRV Y1, it suffices to show that Y1 is 2-sub-Gaussian. Once again, we invoke a standard truncationlemma by symmetrization (e.g. Appendix A.4 of Mao et al. [MPW18]), which yields the result.

58

c

B(cv; )

0

Figure 2: Center of the circle denotes the point cv; circle denotes valid set of u. Clearly, the u thatmakes the largest angle with v is given by the tangent to the circle (in blue).

C.3 Angles and norms

The following lemma collects an elementary fact about angles between vectors and distances be-tween their scaled counterparts.

Lemma 13. Given a unit norm vector v and a pair of positive scalars (c, τ) obeying the relationτ ≤ c, suppose that a vector u satisfies

‖u− cv‖2 ≤ τ2. (54)

Then, we have

sin∠(u, v) ≤ τ

c.

Proof. We provide a simple proof by picture in Figure 2. In particular, denoting the ball of radiusr centered at x by B(x; r), condition (54) is equivalent to the inclusion u ∈ B(cv; τ). Clearly, thevector u that maximizes the angle between u and v is given by the tangent to this ball from theorigin. In this particular case, we have

sin∠(u, v) =τ

c,

and this establishes the proof.

D Details of the experiment used to produce Figure 1

We used Python for our simulation. As mentioned in the caption to Figure 1, our simulation iscarried out for a 20-dimensional problem with a total of n = 5000 samples. We set θ∗ = e1. Inorder to produce the ADE estimate, we used all 5000 samples to generate the unit-norm parameter

θADE(n) =

(1

n

n∑i=1

yixi

)/

∥∥∥∥∥ 1

n

n∑i=1

yixi

∥∥∥∥∥ .59

In order to run Algorithm 2, we must specify (a) an initializer θ0, (b) a non-parametric functionestimator, and (c) the number of iterations T for which our algorithm should be run. To be fair, webegin by splitting our samples into two parts, the first with n1 = 200 samples and used the initializerθ0 = θADE(n1). We then split the remaining n2 = 4800 samples into T = 2 equal parts. We then ranAlgorithm 2 with the isotonic regression estimator—for this, we used the IsotonicRegressionpackage20 available in scikitlearn, which in turn uses the pool adjacent violators algorithm toproduce a function fit—for two iterations, and returned the estimate θ2. Note that due to thefurther sample-split within Algorithm 2, the isotonic regression fit is computed on 1200 samplesper iteration.

20We chose the clipped estimator, which returns a fit that is valid for all real numbers by extrapolating the returnedfunction by a flat line on both its extremes.

60

Single-Index Models in the High Signal Regimeashwinpm/SIMs.pdf · 2019-11-10 · In this paper, we...

Documents

Transcript of Single-Index Models in the High Signal Regimeashwinpm/SIMs.pdf · 2019-11-10 · In this paper, we...