PIVOTAL ESTIMATION OF NONPARAMETRIC FUNCTIONS VIA …math.mit.edu/~liewang/square-root lasso...

arX

iv:1

105.

1475

v2 [

stat

.ME

] 1

5 M

ay 2

011

PIVOTAL ESTIMATION OF NONPARAMETRIC FUNCTIONS VIA

SQUARE-ROOT LASSO

ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG

Abstract. In a nonparametric linear regression model we study a variant of LASSO,

called√LASSO, which does not require the knowledge of the scaling parameter σ of the

noise or bounds for it. This work derives new finite sample upper bounds for prediction

norm rate of convergence, ℓ1-rate of converge, ℓ∞-rate of convergence, and sparsity of the√LASSO estimator. A lower bound for the prediction norm rate of convergence is also

established.

In many non-Gaussian noise cases, we rely on moderate deviation theory for self-

normalized sums and on new data-dependent empirical process inequalities to achieve

Gaussian-like results provided log p = o(n1/3) improving upon results derived in the para-

metric case that required log p . log n.

In addition, we derive finite sample bounds on the performance of ordinary least square

(OLS) applied tom the model selected by√LASSO accounting for possible misspecification

of the selected model. In particular, we provide mild conditions under which the rate of

convergence of OLS post√LASSO is not worse than

√LASSO.

We also study two extreme cases: parametric noiseless and nonparametric unbounded

variance.√LASSO does have interesting theoretical guarantees for these two extreme

cases. For the parametric noiseless case, differently than LASSO,√LASSO is capable of

exact recovery. In the unbounded variance case it can still be consistent since its penalty

choice does not depend on σ.

Finally, we conduct Monte carlo experiments which show that the empirical performance

of√LASSO is very similar to the performance of LASSO when σ is known. We also

emphasize that√LASSO can be formulated as a convex programming problem and its

computation burden is similar to LASSO. We provide theoretical and empirical evidence

of that.

Date: First Version: February 4, 2009 Last Revision: May 17, 2011.

1

http://arxiv.org/abs/1105.1475v2

2 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG

1. Introduction

We consider the problem of recovering a nonparametric regression function, where the

underlying function of interest has unknown function form of basic covariates. To be more

specific, we consider a nonparametric regression model:

yi = f(zi) + σǫi, i = 1, . . . , n, (1.1)

where yi are the outcomes, zi are vectors of fixed basic covariates, ǫi are i.i.d. noises, f is

the regression function, and σ is a scaling parameter. Our goal is to recover the regression

function f . To achieve this goal, we use linear combinations of regressors xi = P (zi) to

approximate f , where P (zi) is a p-vector of transformations of zi. We are interested in the

high dimension low sample size case, in which we potentially have p ≥ n, to attain a flexible

functional form. In particular, we are interested in a sparse model over the regressors xi to

describe the regression function.

Now the model is written as

yi = x′iβ0 + ui, ui = ri + σǫi.

In the above expression, ri := fi − x′iβ0 is the approximation error, where fi = f(zi).

Assume that the cardinality of the support of coefficient β0 is

s := |T | = ‖β0‖0,

where T = supp(β0). It is well known that ordinary least square is generally inconsistent

when p > n. However, the sparsity assumption makes it possible to estimate these models

effectively by searching for approximately the right set of the regressors. In particular, ℓ1-

based penalization methods have been playing a central role in this question. Many papers

have studied the estimation of high dimensional mean regression models with the ℓ1-norm

acting as a penalty function [6, 11, 15, 19, 34, 39, 38]. In these references, under appropriate

choice of penalty level, it was demonstrated that the ℓ1-penalized least squares estimators

achieve the rate σ√s/n

√log p, which is very close to the oracle rate σ

√s/n achievable

when the true model is known. We refer to [4, 6, 8, 9, 7, 13, 16, 17, 24, 34] for many other

developments and a more detailed review of the existing literature.

An important ℓ1-based estimator proposed in [30] is the LASSO estimator defined as

βL ∈ arg minβ∈Rp

Q(β) +λ

n‖β‖1, (1.2)

√LASSO 3

where, for observations of a response variable yi and regressors xi, Q(β) = En[(yi−x′iβ)2] =1n

∑ni=1(yi − x′iβ)

2, ‖β‖1 =∑p

j=1 |βj |, and λ is the penalty level. We note that the LASSO

estimator minimizes a convex function. Therefore, from a computational complexity per-

spective, (1.2) is a computationally efficient (polynomial time) alternative to exhaustive

combinatorial search over all possible submodels.

The performance of LASSO relies heavily on the penalization parameter λ which should

majorate the non-negligible spurious correlation between the noise terms and the large

number of additional regressors. The typical choice of λ is proportional to the unknown

scaling parameter σ of the noise (typically the standard deviation). Simple upper bounds for

σ can be derived based on the empirical variance of the response variable. However, upper

bounds on σ can lead to unnecessary over regularization which translates into larger bias

and slower rates of convergence. Moreover, such over regularization can lead to the exclusion

of relevant regressors from the selected model harming post model selection estimators.

In this paper, we have three sets of main results. The first contribution is to study a

variant of (1.2), called√LASSO, which does not require the knowledge of σ or bounds of

it but it is still computationally attractive. The√LASSO estimator is defined as

β ∈ arg minβ∈IRp

√Q(β) +

λ

n‖β‖1. (1.3)

In the parametric model studied in [5], the choice of the penalty parameter becomes

pivotal given the covariates and the distribution of the error term. In contrast, in the

nonparametric setting we need to account for the impact of the approximation error to

derive a practical and theoretical justified choice of penalty level. We rely on moderate

deviation theory for self-normalized sums and on data-dependent empirical process inequal-

ities to achieve Gaussian-like results in many non-Gaussian cases provided log p = o(n1/3)

improving upon results derived in the parametric case that required log p . log n, see [5].

We perform a thorough non-asymptotic theoretical analysis of the choice of the penalty

parameter.

The second set of contributions is to derive upper bounds for prediction norm rate of

convergence, ℓ1-rate of convergence, ℓ∞-rate of convergence, and sparsity of the√LASSO

estimator. A lower bound on the rate of convergence for the prediction norm is also es-

tablished. Furthermore, we also study two extreme cases: (i) parametric noiseless and (ii)


nonparametric unbounded variance.√LASSO does have interesting theoretical guarantees

for these two extreme cases. For the parametric noiseless case, for a wide range of the

penalty level,√LASSO achieves exact recovery in sharp contrast to LASSO. In the non-

parametric unbounded variance case,√LASSO estimator can still be consistent since its

penalty choice does not depend on the standard deviation of the noise. We develop the

necessary modifications on the oracle definition and penalty level, and derive finite sample

bounds for the case the noise has a Student’s t-distribution with 2 degrees of freedom.

The third contribution aims to remove the potentially significant bias towards zero in-

troduced by the ℓ1-norm regularization employed in (1.3). We consider the post model

selection estimator that applies ordinary least squares (OLS) regression to the model T

selected by√LASSO. Formally, set

T = supp(β) = {j ∈ {1, . . . , p} : |βj | > 0},

and define the OLS post√LASSO estimator β as

β ∈ arg minβ∈Rp

√Q(β) : βj = 0 if j ∈ T c. (1.4)

It follows that if the model selection works perfectly (i.e., T = T ) then the OLS post√LASSO estimator is simply the oracle estimator whose properties are well known. Un-

fortunately, perfect model selection might be unlikely for many designs of interest. This is

usually the case in a nonparametric setting. Thus, we are also interest on the properties of

OLS post√LASSO when T 6= T , including cases where T 6⊆ T .

Finally, we emphasize that√LASSO can be formulated as a convex programming problem

which allows us to rely on many efficient algorithmic implementations to compute the

estimator (interior point methods [31, 32], first order methods [1, 2], and coordinatewise

methods). Importantly, the computation cost of√LASSO is similar to LASSO. We conduct

Monte carlo experiments which show that the empirical performance of√LASSO is very

similar to the performance of LASSO when σ is known.

2. Nonparametric Regression Model

Recall the nonparametric regression model:

yi = f(zi) + σǫi, ǫi ∼ F0, E[ǫi] = 0, E[ǫ2i ] = 1, i = 1, . . . , n. (2.5)

√LASSO 5

The model can also be written as

yi = x′iβ0 + ui, ui = ri + σǫi. (2.6)

In many applications of interest there is no exact sparse model or, due to noise, it might

be inefficient to rely on an exact model. However, there might be a sparse model that yields

a good approximation to the true regression function f in equation (2.5). In this case, the

target linear combination is given by any vector β0 that solves the following “oracle” risk

minimization problem:

minβ∈IRp

En[(fi − x′iβ)2] + σ2

‖β‖0n

, (2.7)

where En[zi] = (1/n)∑n

i=1 zi, fi = f(zi), and the corresponding cardinality of its support

T = supp(β0) is

s := |T | = ‖β0‖0.

The oracle balances the approximation error En[(fi − x′iβ)2] with the variance term

σ2‖β‖0/n, where the latter is determined by the complexity of the model – the number

of non-zero coefficients of β. The average square error from approximating fi by x′iβ0 is

denote by

c2s := En[r2i ] = En[(fi − x′iβ0)

2],

so that c2s + σ2s/n is the optimal value of (2.7). In general the support of the best sparse

approximation T = supp(β0) is unknown since we do not observe fi.

We consider the case of fixed design, namely we treat the covariate values x1, . . . , xn as

fixed. Without loss of generality, we normalize the covariates so that

En[x2ij ] = 1 for j = 1, . . . , p. (2.8)

We summarize the previous setting in the following condition.

Condition 1 (ASM). We have data {(yi, zi) : i = 1, . . . , n} that for each n obey the

regression model (2.5), which admits the approximately sparse form (2.6) with β0 defined

by (2.7). The regressors xi = P (zi) are normalized as in (2.8).

The main focus of the literature is on deriving rate of convergence results in the prediction

norm, which measures the accuracy of predicting the true regression function over the


design points x1, . . . , xn, ‖δ‖2,n =√

En[(x′iδ)2]. It follows that the least square criterion

Q(β) = En[(yi − x′iβ)2] satisfies,

Q(β)− Q(β0) = ‖β − β0‖22,n − 2En[(σǫi + ri)x′i(β − β0)]. (2.9)

Thus, Q(β) − Q(β0) provides noisy information about ‖β − β0‖22,n, and the discrepancy is

controlled by the noise and approximation term

S = 2En[σǫixi] and 2En[rixi].

Moreover, the estimation of β0 in the prediction norm allows to bound the empirical risk

by the triangular inequality:√

En[(x′iβ − fi)2] ≤ ‖β − β0‖2,n + cs. (2.10)

2.1. Conditions on the Gram Matrix. It is known that the Gram matrix En[xix′i]

plays an important role in the analysis of estimators in this setup. In our case, the smallest

eigenvalue of the Gram matrix has zero value if p > n which creates potential identification

problems. Thus, to restore identification, one needs to restrict the type of deviation vectors

δ from β0 that we will consider. It will be important to consider vectors δ that belong to

the restricted set ∆c defined as

∆c = {δ ∈ Rp : ‖δT c‖1 ≤ c‖δT ‖1, δ 6= 0}, for c ≥ 1.

We will state the bounds in terms of the following restricted eigenvalue of the Gram

matrix En[xix′i]:

κc := minδ∈∆c

√s‖δ‖2,n‖δT ‖1

. (2.11)

The restricted eigenvalue can depend on n and T , but we suppress the dependence in

our notations. The restricted eigenvalue (2.11) is a variant of the restricted eigenvalue

introduced in Bickel, Ritov and Tsybakov [6].

Next consider the minimal and maximal m-sparse eigenvalues of the Gram matrix,

φmin(m) := min‖δTc‖0≤m,δ 6=0

‖δ‖22,n‖δ‖22

, and φmax(m) := max‖δTc‖0≤m,δ 6=0

‖δ‖22,n‖δ‖22

. (2.12)

They also play an important role in the analysis. Moreover, sparse eigenvalues provide a

simple sufficient condition to bound restricted eigenvalues. Indeed, following [6], we can

√LASSO 7

bound κc from below by

κc > maxm>0

φmin(m)

(1− φmax(m)

φmin(m)c√s/m

).

Thus, if m-sparse eigenvalues are bounded away from zero and from above

0 < k ≤ φmin(m) ≤ φmax(m) ≤ k′ <∞, for all m ≤ 4(k′/k)2c2s, (2.13)

then κc ≥ φmin(4(k′/k)2c2s)/2. We note that (2.13) only requires the eigenvalues of certain

“small” m×m submatrices of the large p× p Gram matrix to be bounded from above and

below. Many sufficient conditions for (2.13) are provided by [6], [39], and [19]. Bickel, Ritov,

and Tsybakov [6] and others also provide different sets of sufficient primitive conditions for

κc to be bounded away from zero.

3. Pivotal Penalty Level for√LASSO

Here we propose a pivotal, data-driven choice for the regularization parameter value λ

in the case that the distribution of the disturbances is known up to a scaling factor (in the

traditional case we would have unknown variance but normality of ǫi). Since the objective

function in the optimization problem (1.3) is not pivotal in either small or large samples,

finding a pivotal λ appears to be difficult a priori. However, the insight is to consider

the gradient at the true parameter that summarizes the noise in the estimation problem.

We note that the principle of setting λ to dominate the score of the criterion function is

motivated by [6]’s choice of penalty level for LASSO. In fact, this is a general principle

that carries over to other convex problems and that leads to the near-oracle performance of

ℓ1-penalized estimators.

The key quantity determining the choice of the penalty level for√LASSO is the score

S :=En[xi(σǫi + ri)]√En[(σǫi + ri)2]

if En[(σǫi + ri)2] > 0, and S := 0 otherwise.

The score S summarizes the estimation noise and approximation error in our problem,

and we may set the penalty level λ/n to dominate this term. For efficiency reasons, we set

λ/n at a smallest level that dominates the estimation noise, namely we choose the smallest

λ such that

λ ≥ cΛ, for Λ := n‖S‖∞, (3.14)


with a high probability, say 1− α, where Λ is the maximal score scaled by n, and c > 1 is

a theoretical constant of [6] to be stated later. The event (3.14) implies that

β − β0 ∈ ∆c, where c =c+ 1

c− 1

so that rates of convergence can be attained using the restricted eigenvalue κc.

It is worth pointing out that in the parametric case, ri = 0, i = 1, . . . , n, the score S

does not depend on the unknown scaling parameter σ or the unknown true parameter value

β0. Therefore, the score is pivotal with respect to these parameters. Moreover, under the

additional classical normality assumption, namely F0 = Φ, the score is in fact completely

pivotal, conditional on X. This means that in principle we know the distribution of S in

this case, or at least we can compute it by simulation, see [5].

However, in the nonparametric case the design X also determines the unknown approxi-

mation errors ri. To achieve an implementable choice of penalty level, we will consider the

following random variable

Λ :=n‖En[σǫixi]‖∞ + σ

√n√

En[σ2ǫ2i ]=n‖En[ǫixi]‖∞ +

√n√

En[ǫ2i ]. (3.15)

Λ also does not depend on the unknown scaling parameter σ or the unknown true parameter

value β0, and therefore, it is pivotal with respect to these parameters. We will show that

if√

En[σ2ǫ2i ] ≤ (1 + un)

√En[(σǫi + ri)2] we have Λ ≤ (1 + un)Λ. (3.16)

It will follow that the condition above is satisfied with high probability even for un = o(1).

We propose two choices for the penalty level λ. For 0 < α < 1, c > 1, and un ≥ 0 define:

exact λ = (1 + un) · c · Λ(1− α|X),

asymptotic λ = (1 + un) · c ·√nΦ−1(1− α/2p),

(3.17)

where Φ(·) is the cumulative distribution function for a standard Gaussian random variable,

and Λ(1 − α|X) := (1 − α)-quantile of Λ|X, ǫi ∼ F0. We can accurately approximate the

latter by simulation and the former by numerical integration.

The parameter 1−α is a confidence level which guarantees near-oracle performance with

probability at least 1 − α; we recommend 1 − α = 95%. The constant c > 1 is the slack

parameter in (3.14) used in [6]; we recommend c = 1.1. The parameter un is intended to

account for the approximation errors to achieve (3.16); we recommend un = 0.05. These

√LASSO 9

recommendations are valid either in finite or large samples under the conditions stated

below. They are also supported by the finite-sample experiments reported in Section 5.

The exact option is applicable when we know that the distribution of the errors F0, for

example in the classical Gaussian errors case. As stated previously, we can approximate

the quantiles Λ(1 − α|X) by simulation. Therefore, we can implement the exact option.

The asymptotic option is applicable when F0 and design X satisfy some moment conditions

stated below. We recommend using the exact, when applicable, since it is better tailored to

the given design X and sample size n. However, the asymptotic option is trivial to compute

and it often provides a very similar penalty level.

3.1. Analysis of the Penalty Choices. In this section we formally analyze the penalty

choices described in (3.17). In particular we are interested on establishing bounds on the

probability of (3.14) occurring.

In order to achieve Gaussian-like behavior for non-Gaussian disturbances we have to rely

on some moment conditions for the noise, on some restrictions on the growth of p relative

to n, and also consider α that is either bounded away from zero or approaches zero not too

rapidly. In this section we focus on the following set of conditions.

Condition M’. There exist a finite constant q > 8 such that the disturbance obeys

supn≥1 EF0 [|ǫ|q] <∞, and the design X obeys supn≥1max1≤j≤p En[|xij |q] <∞.

Condition R’. Let rn =(α−1 log nCqE[|ǫ|q∨4]

)1/q/n1/4 < 1/2, and define define the

constants c1 = 1 + 4

√log(6p(log n)/α)

n

(3E[ǫqi ]α/ logn

)4/qmax1≤j≤p(En[x

8ij])

1/4 and c2 = 1/(1 −31/qrn). Also, assume that n1/6 ≥ (Φ−1(1 − α/2p) + 1)max1≤j≤p(En[|xij |3]E[|ǫi|3])1/3. Letun ≥ 0 be such that unΦ

−1(1− α/2p) ≥ 4/(1 − rn), un ≥ 4(√c1c2 − 1).

The growth condition depends on the number of bounded moments q of regressors and of

the noise term. Under condition M’ and α fixed, condition R’ is satisfied if log p = o(n1/3).

This is asymptotically less restrictive that the condition log p ≤ (q − 2) log n required in

[5]. However, condition M’ is more stringent than some conditions in [5] thus neither set of

condition dominates the other. (Due to similarities, we defer the result under the conditions

considered in [5] to the appendix.)

The main insight of the new analysis is the use of the theory of moderate deviation

for self normalized sums. Theorems 1 and 2 show that the options (3.17) implement the


regularization event λ > cΛ in the non-Gaussian case with asymptotic probability at least

1− α. Theorem 3 bounds the magnitude of the penalty level λ for the exact option.

Theorem 1 (Coverage of the Exact Penalty Level). Suppose that conditions ASM, M’ and

R’ hold. Then, the exact option in (3.17) implements λ > cΛ with probability at least

1− α

(1 +

1

log n

)− 4(1 + un)

nun.

Theorem 2 (Coverage of the Asymptotic Penalty Level). Suppose that conditions ASM,

M’ and R’ hold. Then, the asymptotic option in (3.17) implements λ > cΛ with probability

at least

1− α

(1 +

A

ℓ3n+

2

log n

)− 20

n(un ∧ 1)

where ℓn = n1/6/[(Φ−1(1−α/2p)+1)max1≤j≤p(En[|xij |3]E[|ǫi|3])1/3], and A is an universal

constant.

Theorem 3 (Bound on the Exact Penalty Level). Suppose that conditions ASM, M’ and

R’ hold. Then, the approximate score Λ(1− α|X) in (3.15) satisfies

Λ (1− α |X) ≤ 2√n+

√nΦ−1(1− α/2p)

√c1c2

(1 +

√2 log(3 + 3A/ℓ3n)/Φ

−1(1− α/2p))

where ℓn = n1/6/[(Φ−1(1 − α/2p) + 1)max1≤j≤p(En[|xij |3]E[|ǫi|3])1/3], A is an universal

constant, and Φ−1(1− α/2p) ≤√

2 log(p/α).

Under conditions on the growth of p relative to n, these theorems establish that many

nice properties of the penalty level in the Gaussian case continue to hold in many non-

Gaussian cases. The following corollary summarizes the asymptotic behavior of the penalty

choices (3.17).

Corollary 1. Suppose that conditions ASM, M’ and R’ hold, and λ is chosen either

following the exact or asymptotic rule in (3.17). If α/p = o(1), log(p/α) = o(n1/3),

1/α = o(n/ log p) then there exists un = 1 + o(1) such that λ satisfies

P (λ ≥ cΛ|X) ≥ 1− α(1 + o(1)) and λ ≤ (1 + o(1))c√

2n log(p/α).

4. Finite-Sample Analysis of√LASSO

Next we establish several finite sample results regarding the√LASSO estimator. We

highlight several differences between the√LASSO analysis conducted here and traditional

√LASSO 11

analysis of LASSO. First, we do not assume Gaussian or sub-Gaussian noise. Second, the

value of the scaling parameter σ is unknown, and the penalty level λ does not depend on

the scaling parameter σ. Third, the necessity of a mild side condition to hold which ensures

that the penalty does not overrule the identification. Fourth, most of the analysis of this

section is conditional not only on the covariates x1, . . . , xn, but also on the noise ǫ1, . . . , ǫn,

through the event λ ≥ cΛ. Therefore, by choosing λ as in (3.17) the event λ ≥ cΛ occurs

with high probability and the stated results hold.

4.1. Finite-Sample Bounds on Different Norms. We begin with a finite sample bound

for the prediction norm which is similar to the bound obtained by the LASSO estimator

that knows σ.

Theorem 4 (Finite Sample Bounds on Estimation Error). Under condition ASM, let c > 1,

c = (c+ 1)/(c − 1), and suppose that λ obeys the growth restriction λ√s < nκc. If λ ≥ cΛ,

then

‖β − β0‖2,n ≤ 2(1 + 1/c)

1−(λ√s

nκc

)2√Q(β0)

λ√s

nκc.

We recall that the choice of λ does not depend on the scaling parameter σ. The impact

of σ in the bound above comes through the factor

√Q(β0) ≤ σ

√En[ǫ2i ] + cs.

Thus, this result leads to the same rate of convergence as in the case of the LASSO estimator

that knows σ since En[ǫ2i ] concentrates around one under (2.5) and the law of large numbers.

As mentioned before, the analysis of√LASSO raises several different issues from that of

LASSO, and so the proof of Theorem 4 is different. In particular, we need to invoke the

additional growth restriction, λ√s < nκc, which is not present in the LASSO analysis that

treats σ as known. This is required because the introduction of the square-root removes

the quadratic growth which would eventually dominates the ℓ1 penalty for large enough

deviations from β0. This condition ensures that the penalty its not too large so identification

of β0 is still possible. However, when this side condition fails and σ is bounded away from

zero, LASSO is not guaranteed to be consistent since its rate of convergence is typically

given by σλ√s/[nκc].


Also, the event λ ≥ cΛ accounts for the approximation errors r1, . . . , rn. That has two

implications. First, the impact of cs on the estimation of β0 is diminished by a factor of

λ√s/[nκc]. Second, despite of the approximation errors, we have β − β0 ∈ ∆c. This is in

contrast to the analysis that relied on λ ≥ cn‖En[ǫixi]‖∞ instead, see [6, 3]. We build on

the latter to establish ℓ1-rate and ℓ∞-rate of convergence.

Theorem 5 (ℓ1-rate of convergence). Under condition ASM, if λ ≥ cΛ, for c > 1 and

c := (c+ 1)/(c − 1), then

‖β − β0‖1 ≤ (1 + c)

√s‖β − β0‖2,n

κc.

Theorem 6 (ℓ∞-rate of convergence). Under condition ASM, if λ ≥ cΛ, for c > 1 and

c = (c+ 1)/(c − 1), then

‖β − β0‖∞ ≤(1 +

1

c

)λ

√Q(β0)

n+

(λ2

n2+ ‖En[xix

′i]− I‖∞

)‖β − β0‖1.

Regarding the ℓ∞-rate, since we have ‖ · ‖∞ ≤ ‖ · ‖1, the result is meaningful for nearly

orthogonal designs so that ‖En[xix′i]− I‖∞ is small. In fact, near orthogonality also allows

to bound the restricted eigenvalues κc from below. [6] and [16] have established that if for

some u ≥ 1 we have ‖En[xix′i]− I‖∞ ≤ 1/(u(1 + c)s) then κc ≥

√1− 1/u.

We close this subsection establishing relative finite sample bound on the estimation of

Q(β0) based on Q(β) under the assumptions of Theorem 4.

Theorem 7 (Relative Estimation of Q(β0)). Under condition ASM, if λ ≥ cΛ and λ√s <

nκc, for c > 1 and c := (c+ 1)/(c − 1) we have

−4c

c

(λ√s

nκc

)2

1−(λ√s

nκc

)2√Q(β0) ≤

√Q(β)−

√Q(β0) ≤ 2

(1 +

1

c

) (λ√s

nκ1

)2

1−(λ√s

nκ1

)2√Q(β0).

Thus, if in addition λ√s = o(nκc) holds, Theorem 7 establishes that

√Q(β) = (1 + o(1))

√Q(β0).

The quantity

√Q(β) is particularly relevant for

√LASSO since it appears in the first order

condition which is the key to establish sparsity properties.

√LASSO 13

4.2. Finite-Sample Bounds Relating Sparsity and Prediction Norm. In this sec-

tion we investigate sparsity properties and lower bounds on the rate of convergence in the

prediction norm of the√LASSO estimator. It turns out these results are connected via the

first order condition. We start with a technical lemma.

Lemma 1 (Relating Sparsity and Prediction Norm). Under condition ASM, let T =

supp(β) and m = |T \ T |. For any λ > 0 we have

λ

n

√Q(β)

√|T | ≤

√|T |‖En[xi(σǫi + ri)]‖∞ +

√φmax(m)‖β − β0‖2,n.

The proof of the lemma above rely on the optimality conditions which implies that the

selected support has binding dual constraints. Intuitively, for any selected component, there

is a shrinkage bias which introduces a bound on how close the estimated coefficient can be

from the true coefficient. Based on the inequality above and Theorem 7, we establish the

following result.

Theorem 8 (Lower Bound on Prediction Norm). Under condition ASM, if λ ≥ cΛ, λ√s <

nκc, where c > 1 and c := (c + 1)/(c − 1), and letting T = supp(β) and m = |T \ T |, wehave

‖β − β0‖2,n ≥λ

√|T |

n√φmax(m)

√Q(β0)

1− 1

c− 4c

c

(λ√s

nκc

)2

1−(λ√s

nκc

)2

.

In the case of LASSO, as derived in [17], the lower bound does not have the term

√Q(β0)

since the impact of the scaling parameter σ is accounted in the penalty level λ. Thus, under

condition ASM, the lower bounds for LASSO and√LASSO are very close.

Next we proceed to bound the size of the selected support T = supp(β) for the√LASSO

estimator relative to the size s of the support of the oracle estimator β0.

Theorem 9 (Sparsity bound for√LASSO). Under condition ASM, let β denote the

√LASSO

estimator, T = supp(β), and let m := |T \ T |. If λ ≥ cΛ and λ√s ≤ nκcρ, where

2ρ2 ≤ (c− 1)/(c − 1 + 4c), for c > 1 and c = (c+ 1)/(c − 1), we have that

m ≤ s ·[minm∈M

φmax(m)

]· (4c/κc)2

where M = {m ∈ N : m > sφmax(m) · 2(4c/κc)2}.


The slightly more stringent side condition ensures that the right hand side of the bound

in Theorem 8 is positive. Asymptotically, under mild conditions on the design matrix, for

exampleφmax(s log n)

φmin(s log n). 1,

the event (3.14) and the side condition s log(p/α) = o(n), imply that for n large enough,

the size of the selected model is of the same order of magnitude as the oracle model, namely

m . s.

4.3. Finite Sample Bounds on the Estimation Error of OLS post√LASSO. Based

on the model selected by√LASSO estimator, T := supp(β), we consider the OLS estimator

restricted to these data-driven selected components. If model selection works perfectly (as

it will under some rather stringent conditions), then this estimator is simply the oracle esti-

mator and its properties are well known. However, of more interest is the case when model

selection does not work perfectly, as occurs for many designs of interest in applications.

The following theorem establishes bounds on the prediction error of the OLS post√LASSO

estimator. The analysis accounts for the data-driven choice of components and for the pos-

sibly having a mispecified model (i.e. T 6⊆ T ). The analysis build upon sparsity and rate

bounds of the√LASSO estimator, and on a data-dependent empirical process inequality.

Theorem 10 (Performance of OLS post√LASSO). Suppose condition ASM holds, let

c > 1, c = (c + 1)/(c − 1) and m = |T \ T |. If λ ≥ cΛ occurs with probability at least

1 − α, and λ√s ≤ nκ1ρ1, for some ρ1 < 1, then for C ≥ 1, with probability at least

1− α− 1/C2 − 1/[9C2 log p], we have

‖β − β0‖2,n ≤ Cσ√φmin(m)

√s

n+ 2cs +

√m log p

n

24Cσ

√1 ∨ max

j=1,...,pEn

[x2ijǫ

2i

]

√φmin(m)

+

+ 1{T 6⊆ T}λ√s

nκ1

√Q(β0)

4(1 + 1/c)

1− ρ21

(1 +

(1 + 1/c)ρ211− ρ21

).

We note that the random term in the bound above can be controlled in a variety of ways.

For example, under conditions M’ and R’, if log p = o(n1/3) Lemma 10 establishes that

maxj=1,...,p

En

[x2ijǫ

2i

]= 1 + oP (1).

√LASSO 15

The following corollary summarizes the performance of OLS post√LASSO under com-

monly used designs.

Corollary 2 (Asymptotic Performance of OLS post√LASSO). Suppose conditions ASM,

M’ and R’ hold, let c > 1, c = (c+ 1)/(c − 1) and m = |T \ T |. If λ is set as the exact or

asymptotic choice in (3.17), α = o(1), φmax(s log n)/φmin(s log n) . 1, s log(p/α) = o(n),

log p = o(n1/3), we have that

‖β − β0‖2,n .P cs + σ

√s log p

n.

Moreover, if m = o(s) and T ⊆ T with probability going to 1,

‖β − β0‖2,n .P cs + σ

√o(s) log p

n+ σ

√s

n,

and if T = T with probability going to 1, we have

‖β − β0‖2,n .P cs + σ√s/n.

Under the conditions of the corollary above, the upper bounds on the rates of convergence

of√LASSO and OLS post

√LASSO coincide. This occurs despite the fact that

√LASSO

may in general fail to correctly select the oracle model T as a subset, that is T 6⊆ T .

Nonetheless, there is a class of well-behaved models in which OLS post√LASSO rate

improves upon the rate achieved by√LASSO. More specifically, this occurs if m = oP (s)

and T ⊆ T with probability going to 1 or in the case of perfect model selection, when T = T

with probability going to 1. Results on LASSO’s model selection performance derived on

Wainright [38], Candes and Plan [10], and Belloni and Chernozhukov [3] can be extended to

the√LASSO based on Theorem 6 and 7. Moreover, under mild conditions, we know from

Theorem 8 that the upper bounds for the rates found for√LASSO are sharp, i.e. the rate of

convergence cannot be faster than σ√log p

√s/n. Thus the use of the post model selection

estimator leads to a strict improvement in the rate of convergence on these well-behaved

models.

4.4. Extreme Cases: Parametric Noiseless and Nonparametric Unbounded Vari-

ance. In this section we show that the robustness advantage of√LASSO with respect

to LASSO extends to two extreme cases as well: σ + cs = 0 and E[ǫ2i ] = ∞. Since

the the traditional choice of the penalty level λ for LASSO depends on σ√

E[ǫ2i ], setting


λ = σ√

E[ǫ2i ] 2c√nΦ−1(1 − α/2p) cannot be directly applied in either case. In contrast,

the penalty of√LASSO is indirectly self normalized by the factor

√Q(β) which is well

defined in both cases.

4.4.1. Parametric Noiseless Case. The analysis developed in the previous section immedi-

ately covers the case σ = 0 if cs > 0. The case that cs = 0 is also zero, thus Q(β0) = 0,

allows for exact recovery under less stringent restrictions.

Theorem 11 (Exact Recovery under Parametric Noiseless Case). Under condition ASM,

let σ = 0 and cs = 0. Suppose that λ > 0 obeys the growth restriction λ√s < nκ1. Then we

have that β = β0.

It is worth mentioning that for any λ > 0, unless β0 = 0, LASSO cannot achieve exact

recovery. Moreover, it is not obvious how to properly set the penalty level for LASSO even

if we knew a priori that it is a parametric noiseless model. In contrast, square-root lasso

intrinsically adjusts the penalty λ by a factor of

√Q(β). Under mild conditions Theorem

7 ensures that

√Q(β) =

√Q(β0) = 0 which allows for the perfect recovery. Also note that

the lower bound derived in Theorem 8 becomes trivially zero.

4.4.2. Nonparametric Unbounded Variance. Next we turn to the unbounded variance case.

Although the analysis of the prediction norm of the estimator goes through with no change,

one needs to redefine the model and oracle properly. Importantly, the exact choice of λ is

still valid but the asymptotic penalty choice of λ no longer applies. In order to account

for unbounded variance, the main insight is to interpret σ as a scaling factor (and not the

standard deviation), but the effective standard deviation σ to be

σ := σ√

En[ǫ2i ] <∞.

That allows to properly define the oracle in finite sample. However, it follows that σ → ∞with probability one since the noise has infinite variance. Therefore, meaningful estimation

of β0 is still possible provided that σ does not diverge too quickly. In this case, the oracle

estimator needs to be redefined as any any solution to

min0≤k≤p∧n

min‖β‖0=k

En[(fi − x′iβ)2] + σ2

k

n. (4.18)

√LASSO 17

Regarding the choice of the penalty parameter λ, we stress that the asymptotic option

no longer applies and we cannot achieve Gaussian-like rates. However, the exact option for

λ is still valid. In fact, the scaling parameter σ still cancels out. By inspection of the proof,

Lemma 7 can be adjusted to yield |ri| ≤ σ/√n. That motivates the definition of Λ as

Λ = n

∥∥∥∥∥∥En[ǫixi]√En[ǫ2i ]

∥∥∥∥∥∥∞

+√n ≥

∥∥∥∥∥∥En[(σǫi + ri)xi]√

En[σǫ2i ]

∥∥∥∥∥∥∞

.

Thus, the rate of convergence will be affected by how fast En[ǫ2i ] diverges and λ/n goes to

zero. That is, the final rates will depend on the particular tail properties of the distribution

of the noise. The next lemma establishes finite sample bounds in the case of ǫi ∼ t(2),

i = 1, . . . , n.

Theorem 12 (√LASSO prediction norm for ǫi ∼ t(2)). Consider a nonparametric re-

gression model with data {(yi, xi) : i = 1, . . . , n}, such that En[x2ij] = 1 for every j =

1, . . . , p, ǫi ∼ t(2) are i.i.d., and β0 defined as any solution to (4.18). Letting x =

max1≤j≤p,1≤i≤n |xij | and λ = c(1 + un)Λ(1 − α − 64n1/2 |X), and λ

√s < nκc. Then, for

any τ ∈ (0, 1), with probability at least 1− α− τ − 64n1/2 − 48(1+un) log(4n/τ)

nun logn we have

‖β − β0‖2,n ≤ 2(1 + 1/c)

1− (λ√s/[nκc])

2

(cs + σ

√log(4n/τ) + 2

√2/τ

)λ√s

nκc

where

λ ≤ c(1 + un)

4x

√2n log(4p/α)

√log(4n/α) + 2

√2/α

√log n/

√24

+√n

.

Asymptotically, if 1/α = o(log n) and s log(p/α) = o(nκc), the result above yields that

with probability 1− α(1 + o(1))

‖β − β0‖2,n . x(cs + σ√

log n)

√s log p

n

where the scaling factor σ <∞ is fixed. Thus, despite of the infinite variance of the noise in

the t(2) case, for bounded designs,√LASSO rate of convergence differs from the Gaussian

case only by a√log n factor.

5. Empirical Performance of√LASSO Relative to LASSO

Next we proceed to evaluate the empirical performance of√LASSO relative to LASSO.

In particular we discuss their computational burden and their estimation performance.


5.1. Computational Performance of√LASSO Relative to LASSO. Since model se-

lection is particularly relevant in high-dimensional problems, the computational tractability

of the optimization problem associated with√LASSO is an important issue. It will follow

that the optimization problem associated with√LASSO can be cast as a tractable conic

programming problem. Conic programming consists of the following optimization problem

minx c(x)

A(x) = b

x ∈ K

where K is a cone, c is a linear functional, A is a linear operator, and b is an element in

the counter domain of A. We are particularly interested on the case that K is also convex.

Convex conic programming problems have greatly extended the scope of applications of

linear programming problems1 in several fields including optimal control, learning theory,

eigenvalue optimization, combinatorial optimization and many others. Under mild regular-

ities conditions, the duality theory for conic programs have been fully developed and allows

for characterization of optimal conditions via dual variables, much like linear programming

problems.

In the past two decades, the study of the computational complexity and the develop-

ments of efficient computational algorithms for conic programming have played a central

role in the optimization community. In particular, for the case of self-dual cones, which

encompasses the non-negative orthant, second-order cones, and the cone of semi-definite

positive matrices, interior point methods have been highly specialize. A sound theoretical

foundation, establishing polynomial computational complexity [22, 23], and efficient soft-

ware implementations [33] made large instances of these problems computational tractable.

More recently, first order methods have also been propose to approximately solve even even

larger instances of structured conic problem [20, 21, 18].

It follows that (1.3) can be written as a conic programming problem whose relevant cone

is self-dual. Letting Qn+1 := {(t, v) ∈ IR× IRn : t ≥ ‖v‖} denote the second order cone in

1The relevant cone in linear programs is the non-negative orthant, minw{c′w : Aw = b, w ∈ IRk+}).

√LASSO 19

IRn+1, we can recast (1.3) as the following conic program:

mint,v,β+,β−

t√n+λ

n

p∑

i=1

(β+j + β−j

)

vi = yi − x′iβ+ + x′iβ

−, i = 1, . . . , n

(t, v) ∈ Qn+1, β+ ≥ 0, β− ≥ 0.

(5.19)

Conic duality immediately yields the following dual problem

maxa

En [yiai]

|En [xijai]| ≤ λ/n, j = 1, . . . , p

‖a‖ ≤√n.

(5.20)

From a statistical perspective, the dual variables represent the normalized residuals. Thus

the dual problem maximizes the correlation of the dual variable a subject to the constraint

that a are approximately uncorrelated with the regressors. It follows that these dual vari-

ables play a role in deriving necessary conditions for a component βj to be non-zero and

therefore on sparsity bounds.

The fact that√LASSO can be formulated as a convex conic programming problem

allows the use of several computational methods tailored for conic problems to be used

to compute the√LASSO estimator. In this section we compare three different methods

to compute√LASSO with their counterparts to compute LASSO. We note that these

methods have different initialization and stopping criterion that could impact the running

times significantly. Therefore we do not aim to compare different methods but instead we

focus on the comparison of the performance of each method to LASSO and√LASSO since

the same initialization and stopping criterion are used.

Table 5.1 illustrates that the average computational time to solve LASSO and√LASSO

optimization problems are comparable. Table 5.1 also reinforces typical behavior of these

methods. As the size increases, the running time for interior point methods grows faster

than other first order methods. Simple componentwise methods are particular effective

when the solution is highly sparse. This is the case of the parametric design used here. We

emphasize the performance of each method depends on the particular design and choice of

λ.

5.2. Estimation Performance of√LASSO Relative to LASSO. In this section we use

Monte-Carlo experiments to assess the finite sample performance of the following estimators:


n = 100, p = 500 Componentwise First Order Interior Point

LASSO 0.2173 10.99 2.545√LASSO 0.3268 7.345 1.645


LASSO 0.6115 19.84 14.20√LASSO 0.6448 19.96 8.291


LASSO 2.625 84.12 108.9√LASSO 2.687 77.65 62.86

Table 1. In these instances we had s = 5, σ = 1, and each value was

computed by averaging 100 simulations.

• the (infeasible) LASSO, which knows σ (which is unknown outside the experiments),

• OLS post LASSO, which applies OLS to the model selected by (infeasible) LASSO,

•√LASSO, which does not know σ, and

• OLS post√LASSO, which applies OLS to the model selected by

√LASSO.

We set the penalty level for LASSO as the standard choice in the literature, λ = c2σ√nΦ−1(1−

α/2p), and√LASSO according to the asymptotic option, λ = c

√nΦ−1(1−α/2p), both with

1 − α = .95 and c = 1.1 to both estimators. (The results obtained using the exact option

are similar to the case with the penalty level set according to the asymptotic option, so we

only report the results for the latter.)

We use the linear regression model stated in the introduction as a data-generating process,

with either standard normal or t(4) errors:

(a) ǫi ∼ N(0, 1) or (b) ǫi ∼ t(4)/√2,

so that E[ǫ2i ] = 1 in either case. We set the regression function as

f(xi) = x′iβ∗0 , where β∗0j = 1/j3/2, j = 1, . . . , p. (5.21)

The scaling parameter σ vary between 0.25 and 5. For a fixed design, as the scaling param-

eter σ increases, the number of non-zero components in the oracle vector s decreases. The

number of regressors is p = 500, the sample size is n = 100, and we used 100 simulations for

√LASSO 21

each design. We generate regressors as xi ∼ N(0,Σ) with the Toeplitz correlation matrix

Σjk = (1/2)|j−k|.

0 1 2 3 4 50

0.5

1

1.5ǫi ∼ N(0,1)

Em

piri

calR

isk

σ0 1 2 3 4 5

0

0.5

1

1.5ǫi ∼ t(4)

Em

piri

calR

isk

σ

LASSO√

LASSO OLS post LASSO OLS post√

LASSO

Figure 1. The average empirical risk of the estimators as a function of the

scaling parameter σ.

0 1 2 3 4 50

0.5

1

1.5ǫi ∼ N(0,1)

Bia

s

σ0 1 2 3 4 5

0

0.5

1

1.5ǫi ∼ t(4)

Bia

s

σ

LASSO√

LASSO OLS post LASSO OLS post√

LASSO

Figure 2. The norm of the bias of the estimators as a function of the



0 1 2 3 4 50

1

2

3

4

5

6

7

8

9

10ǫi ∼ N(0,1)

Num

.R

egs.

Sele

cted

σ0 1 2 3 4 5

0

1

2

3

4

5

6

7

8

9

10ǫi ∼ t(4)

Num

.R

egs.

Sele

cted

σ

LASSO and OLS post LASSO√

LASSO and OLS post√

LASSO

Figure 3. The average number of regressors selected as a function of the


We present the results of computational experiments for designs a) and b) in Figures 1,

2, 3. The left plot of each figure reports the results for the normal errors, and the right

plot of each figure reports the results for t(4) errors. For each model, the figures show the

following quantities as a function of the signal strength C for each estimator β:

• Figure 1 – the average empirical risk, E[En[x′i(β − β0)]

2],

• Figure 2 – the norm of the bias, ‖E[β − β0]‖, and• Figure 3 – the average number of regressors selected, E|support(β)|.

Figure 1, left panel, shows the empirical risk for the Gaussian case. We see that, for a

wide range of the scaling parameter σ, LASSO and√LASSO perform similarly in terms of

empirical risk, although standard LASSO outperforms somewhat√LASSO. At the same

time, OLS post LASSO outperforms slightly OLS post√LASSO for larger signal strengths.

This is expected since√LASSO over regularize to simultaneously estimate σ when compared

to LASSO (since it essentially uses

√Q(β) as an estimate of σ). In the nonparametric model

considered here, the coefficients are not well separated from zero. These two issues combined

leads to a smaller selected support.

√LASSO 23

Overall, the empirical performance of√LASSO and OLS post

√LASSO is achieves its

goal. Despite not knowing σ,√LASSO performs comparably to the standard LASSO that

knows σ. These results are in close agreement with our theoretical results, which state that

the upper bounds on empirical risk for√LASSO asymptotically approach the analogous

bounds for standard LASSO.

Figures 2 and 3 provide additional insight into the performance of the estimators. On

the one hand, Figure 2 shows that the finite-sample differences in empirical risk for LASSO

and√LASSO arise primarily due to

√LASSO having a larger bias than standard LASSO.

This bias arises because√LASSO uses an effectively heavier penalty. Figure 3 shows that

such heavier penalty translates into√LASSO achieving a smaller support than LASSO on

average.

Finally, Figure 1, right panel, shows the empirical risk for the t(4) case. We see that the

results for the Gaussian case carry over to the t(4) case. In fact, the performance of LASSO

and√LASSO under t(4) errors nearly coincides with their performance under Gaussian

errors. This is exactly what is predicted by our theoretical results.

Appendix A. Probability Inequalities

A.1. Moment Inequalities. We begin with Rosenthal and Von Bahr-Esseen Inequalities.

Lemma 2 (Rosenthal Inequality). Let X1, ...,Xn be independent zero-mean random vari-

ables, then for r ≥ 2

E

[∣∣∣∣∣

n∑

i=1

Xi

∣∣∣∣∣

r]≤ C(r)max

n∑

t=1

E[|Xi|r],(

n∑

t=1

E[X2i ]

)r/2 .

Corollary 3 (Rosenthal LLN). Let r ≥ 2, and consider the case of independent and iden-

tically distributed zero-mean variables Xi with E[X2i ] = 1 and E[|Xi|r] bounded by C. Then

for any ℓn > 0

Pr

( |∑ni=1Xi|n

> ℓnn−1/2

)≤ 2C(r)C

ℓrn,

where C(r) is a constant depend only on r.


Remark. To verify the corollary, note that by Rosenthal’s inequality E [|∑ni=1Xi|r] ≤

Cnr/2. By Markov inequality,

P

( |∑ni=1Xi|n

> c

)≤ C(r)Cnr/2

crnr≤ C(r)C

crnr/2,

so the corollary follows. We refer [25] for complete proofs.

Lemma 3 (Vonbahr-Esseen inequality). Let X1, ...,Xn be independent zero-mean random

variables. Then for 1 ≤ r ≤ 2

E

[∣∣∣∣∣

n∑

i=1

Xi

∣∣∣∣∣

r]≤ (2− n−1) ·

n∑

k=1

E[|Xk|r].

We refer to [36] for proofs.

Corollary 4 (Vonbahr-Esseen’s LLN). Let r ∈ [1, 2], and consider the case of identically

distributed zero-mean variables Xi with E|Xi|r bounded by C. Then for any ℓn > 0

Pr

( |∑ni=1Xi|n

> ℓnn−(1−1/r)

)≤ 2C

ℓrn.

Remark. By Markov and Vonbahr-Esseen’s inequalities,

Pr

( |∑n

i=1Xi|n

> c

)≤ E [|

∑ni=1Xi|r]crnr

≤ (2n− 1)E[|Xi|r]crnr

≤ 2C

crnr−1,

which implies the corollary.

A.2. Moderate Deviations for Sums of Independent Random Variables. Next we

consider Slastnikov-Rubin-Sethuraman Moderate Deviation Theorem.

Let Xni, i = 1, ..., kn;n ≥ 1 be a double sequence of row-wise independent random vari-

ables with E[Xni] = 0, E[X2ni] < ∞, i = 1, ..., kn; n ≥ 1, and B2

n =∑kn

i=1E[X2ni] → ∞ as

n→ ∞. Let

Fn(x) = Pr

(kn∑

i=1

Xni < xBn

).

Lemma 4 (Slastnikov, Theorem 1.1). If for sufficiently large n and some positive constant

c,kn∑

i=1

E[|Xni|2+c2 ]ρ(|Xni|) log−(1+c2)/2(3 + |Xni|) ≤ g(Bn)B2n,

√LASSO 25

where ρ(t) is slowly varying function monotonically growing to infinity and g(t) = o(ρ(t))

as t→ ∞, then

1− Fn(x) ∼ 1− Φ(x), Fn(−x) ∼ Φ(−x), n→ ∞,

uniformly in the region 0 ≤ x ≤ c√

logB2n

Corollary 5 (Slastnikov, Rubin-Sethuraman). If q > c2 + 2 and

kn∑

i=1

E[|Xni|q] ≤ KB2n,

then there is a sequence γn → 1, such that∣∣∣∣1− Fn(x) + Fn(−x)

2Φ(x)− 1

∣∣∣∣ ≤ γn − 1 → 0, n→ ∞,

uniformly in the region 0 ≤ x ≤ c√

logB2n

Remark. Rubin-Sethuraman derived the corollary for x = t√

logB2n for fixed t. Slast-

nikov’s result adds uniformity and relaxes the moment assumption.

We refer to [28] for proofs.

A.3. Moderate Deviations for Self-Normalized Sums. We shall be using the following

result – Theorem 7.4 in [12].

Let X1, ...,Xn be independent, mean-zero variables, and

Sn =

n∑

i=1

Xi, V 2n =

n∑

i=1

X2i .

For 0 < δ ≤ 1 set

B2n =

n∑

i=1

EX2i , Ln,δ =

n∑

i=1

E|Xi|2+δ, dn,δ = Bn/L1/(2+δ)n,δ .

Then for uniformly in 0 ≤ x ≤ dn,δ,

Pr(Sn/Vn ≥ x)

Φ(x)= 1 +O(1)

(1 + x

dn,δ

)2+δ

,

Pr(Sn/Vn ≤ −x)Φ(−x) = 1 +O(1)

(1 + x

dn,δ

)2+δ

,

where the terms O(1) are bounded in absolute value by a universal constant A, and Φ :=

1− Φ.


Application of this result gives the following lemma:

Lemma 5 (Moderate Deviations for Self-Normalized Sums). Let X1,n, ...,Xn,n be the tri-

angular array of i.i.d, zero-mean random variables. Suppose that

Mn =(EX2

1,n)1/2

(E|X1,n|3)1/3> 0

and that for some ℓn → ∞n1/6Mn/ℓn ≥ 1.

Then uniformly on 0 ≤ x ≤ n1/6Mn/ℓn − 1, the quantities

Sn,n =

n∑

i=1

Xi,n, V 2n,n =

n∑

i=1

X2i,n.

obey ∣∣∣∣Pr(|Sn,n/Vn,n| ≥ x)

2Φ(x)− 1

∣∣∣∣ ≤A

ℓ3n→ 0.

Proof. This follows by the application of the quoted theorem to the i.i.d. case with δ = 1

and dn,1 = n1/6Mn. The calculated error bound follows from the triangular inequalities and

conditions on ℓn and Mn. �

A.4. Data-dependent Probabilistic Inequality. In this section we derive a data-de-

pendent probability inequality for empirical processes indexed by a finite class of functions.

In what follows, for a random variable X let q(X, 1 − τ) denote its (1 − τ)-quantile. For a

class of functions F we define ‖X‖F = supf∈F |f(X)|. Also for random variables Z1, . . . , Zn

and a function f define ‖f‖Pn,2 =√En[f(Zi)2], Gn(f) = (1/

√n)∑n

i=1{f(Zi)− E[f(Zi)]},and G

on(f) = (1/

√n)∑n

i=1 εif(Zi) where εi are independent Rademacher random variables.

In order to prove a bound on tail probabilities of a general separable empirical process, we

need to go through a symmetrization argument. Since we use a data-dependent threshold,

we need an appropriate extension of the classical symmetrization lemma to allow for this.

Let us call a threshold function x : IRn 7→ IR k-sub-exchangeable if for any v,w ∈ IRn and

any vectors v, w created by the pairwise exchange of the components in v with components

in w, we have that x(v)∨x(w) ≥ [x(v)∨x(w)]/k. Several functions satisfy this property, in

particular x(v) = ‖v‖ with k =√2, constant functions with k = 1, and x(v) = ‖v‖∞ with

k = 1. The following result generalizes the standard symmetrization lemma for probabilities

√LASSO 27

(Lemma 2.3.7 of [35]) to the case of a random threshold x that is sub-exchangeable. The

proof of Lemma 6 can be found in [4].

Lemma 6 (Symmetrization with Data-dependent Thresholds). Consider arbitrary inde-

pendent stochastic processes Z1, . . . , Zn and arbitrary functions µ1, . . . , µn : F 7→ IR. Let

x(Z) = x(Z1, . . . , Zn) be a k-sub-exchangeable random variable and for any τ ∈ (0, 1) let qτ

denote the τ quantile of x(Z), pτ := P (x(Z) ≤ qτ ) ≥ τ , and pτ := P (x(Z) < qτ ) ≤ τ . Then

P

(∥∥∥∥∥

n∑

i=1

Zi

∥∥∥∥∥F> x0 ∨ x(Z)

)≤ 4

pτP

(∥∥∥∥∥

n∑

i=1

εi (Zi − µi)

∥∥∥∥∥F>x0 ∨ x(Z)

4k

)+ pτ

where x0 is a constant such that inff∈F P(|∑n

i=1 Zi(f)| ≤ x02

)≥ 1− pτ

2 .

Theorem 13 (Maximal Inequality for Empirical Processes). Let

qD(F , 1 − τ) = supf∈F

q(|Gn(f)|, 1− τ) ≤ supf∈F

√varP (Gn(f))/τ

and consider the data dependent quantity

en(F ,Pn) =√

2 log |F| supf∈F

‖f‖Pn,2.

Then, for any C ≥ 1 and τ ∈ (0, 1) we have

supf∈F

|Gn(f)| ≤ qD(F , 1 − τ/2) ∨ 4√2Cen(F ,Pn),

with probability at least 1− τ − 4 exp(−(C2 − 1) log |F|)/τ .

Proof. Step 1. (Main Step) In this step we prove the main result. First, recall en(F ,Pn) :=√2 log |F| supf∈F ‖f‖Pn,2. Note that supf∈F ‖f‖Pn,2 is

√2-sub-exchangeable by Step 2

below.

By the symmetrization Lemma 6 we obtain

P

{supf∈F

|Gn(f)| > 4√2Cen(F ,Pn) ∨ qD(F , 1 − τ/2)

}≤ 4

τP

{supf∈F

|Gon(f)| > Cen(F ,Pn)

}+τ.

Thus a union bound yields

P

{supf∈F

|Gn(f)| > 4√2Cen(F ,Pn) ∨ qD(F , 1 − τ/2)

}≤ τ+

4|F|τ

supf∈F

P {|Gon(f)| > Cen(F ,Pn)} .

(A.22)


We then condition on the values of Z1, . . . , Zn, denoting the conditional probability mea-

sure as Pε. Conditional on Z1, . . . , Zn, by the Hoeffding inequality the symmetrized pro-

cess Gon is sub-Gaussian for the L2(Pn) norm, namely, for f ∈ F , Pε{|Go

n(f)| > x} ≤2 exp(−x2/[2‖f‖2

Pn,2]). Hence, we can bound

Pε {|Gon(f)| ≥ Cen(F ,Pn)|Z1, . . . , Zn} ≤ 2 exp(−C2en(F ,Pn)

2/[2‖f‖2Pn,2

])

≤ 2 exp(−C2 log |F|).

Taking the expectation over Z1, . . . , Zn does not affect the right hand side bound. Plugging

in this bound into relation (A.23) yields the result.

Step 2. (Auxiliary calculations.) To establish that supf∈F ‖f‖Pn,2 is√2-sub-exchangeable,

let Z and Y be created by exchanging any components in Z with corresponding components

in Y . Then

√2(sup

f∈F‖f‖

Pn(Z),2 ∨ supf∈F

‖f‖Pn(Y ),2) ≥ (sup

f∈F‖f‖2

Pn(Z),2+ sup

f∈F‖f‖2

Pn(Y ),2)1/2

≥ (supf∈F

En[f(Zi)2] + En[f(Yi)

2])1/2 = (supf∈F

En[f(Zi)2] + En[f(Yi)

2])1/2

≥ (supf∈F

‖f‖2Pn(Z),2 ∨ sup

f∈F‖f‖2

Pn(Y ),2)1/2= sup

f∈F‖f‖Pn(Z),2 ∨ sup

f∈F‖f‖Pn(Y ),2.

�

Corollary 6 (Data-Dependent Probability Inequality). Let ǫi be i.i.d random variables

such that E[ǫi] = 0 and E[ǫ2i ] = σ2 for i = 1, . . . , n. Conditional on x1, . . . , xn ∈ IRp, we

have that for any C ≥ 1, with probability at least 1− 1/[9C2 log p],

‖En[xiǫi]‖∞ ≤ C · 24√

log p

nmax

j=1,...,p

√En[ǫ2i x

2ij] ∨

√σ2En[x2ij].

Proof of Corollary 6. Consider the class of separable empirical process induced by ‖En[xiǫi]‖∞,

i.e. the class of functions f ∈ F = {ǫixij : j ∈ {1, . . . , p}} so that√n‖En[xiǫi]‖∞ =

supf∈F |Gn(f)|. Define the data dependent quantity

en(F ,Pn) =√

2 log p maxj=1,...,p

√En[ǫ2i x

2ij].

Then, by Theorem 13, for any constant C ≥ 1

supf∈F

|Gn(f)| ≤ q(F , 1 − τ/2) ∨ 4√2Cen(F ,Pn).

√LASSO 29

with probability 1− τ − 4 exp(−(C2− 1) log p)/τ . Picking τ = 1/[2C2 log p], we have by the

Chebyshev’s inequality

q(F , 1− τ/2) ≤ maxj=1,...,p

√E[ǫ2ix

2ij ]/√τ/2 = 2C

√log p max

j=1,...,p

√E[ǫ2i x

2ij]

Setting C ≥ 3 we have with probability 1− 1/[C2 log p]

supf∈F

|Gn(f)| ≤(6C

3

√log p max

j=1,...,p

√E[ǫ2ix

2ij ]

)∨(24C

3

√log p max

j=1,...,p

√En[ǫ2i x

2ij ]

).

(Note that if p ≤ 2 the statement is trivial since the probability is greater than 1.) �

A.5. Bounds via Symmetrization. Next we proceed to use symmetrization arguments

to bound the empirical process. Let ‖f‖Pn,2 =√

En[f(Xi)2], Gn(f) =√nEn[f(Xi)−E[Xi]],

and for a random variable Z let q(Z, 1− τ) denote its (1− τ)-quantile,

Theorem 14 (Maximal Inequality via Symmetrization). Let

qS(F , 1 − τ) = q(supf∈F

‖f‖Pn,2, 1− τ).

Then, for any τ ∈ (0, 1), and δ ∈ (0, 1) we have

supf∈F

|Gn(f)| ≤ 4√

2 log(2|F|/δ)qS(F , 1 − τ),

with probability at least 1− τ − δ.

Proof. Let en(F) =√log(2|F|/δ)qS(F , 1 − τ) and the event E = {supf∈F

√En [f2] ≤

qS(F , 1 − τ)}, by definition P (E) ≥ 1 − τ . By the symmetrization Lemma 2.3.7 of [35] we

obtain

P

{supf∈F

|Gn(f)| > 4en(F)

}≤ 4P

{supf∈F

|Gon(f)| > en(F)

}≤ P

{supf∈F

|Gon(f)| > en(F)|E

}+τ.

Thus a union bound yields

P

{supf∈F

|Gn(f)| > en(F)

}≤ τ + |F| sup

f∈FP {|Go

n(f)| > en(F)|E} . (A.23)

We then condition on the values of Z1, . . . , Zn and E , denoting the conditional probability

measure as Pε. Conditional on Z1, . . . , Zn, by the Hoeffding inequality the symmetrized


process Gon is sub-Gaussian for the L2(Pn) norm, namely, for f ∈ F , Pε{|Go

n(f)| > x} ≤2 exp(−x2/[2‖f‖2

Pn,2]). Hence, we can bound

Pε {|Gon(f)| ≥ en(F)|Z1, . . . , Zn, E} ≤ 2 exp(−en(F)2/[2‖f‖2

Pn,2])

≤ 2 exp(− log(2|F|/δ)).

Taking the expectation over Z1, . . . , Zn does not affect the right hand side bound. Plugging

in this bound yields the result. �

Appendix B. Technical Lemmas

We begin with two technical lemmas that control the impact of the approximation errors

in the analysis.

Lemma 7. Under condition ASM we have

‖En[xiri]‖∞ ≤ min

{σ√n, cs

}.

Proof. First note that for every j = 1, . . . , p, we have |En[xijri]| ≤√

En[x2ij ]En[r

2i ] = cs.

Next, by definition of β0 in (2.7), for j ∈ T we have En[xij(fi − x′iβ0)] = En[xijri] = 0 since

β0 is a minimizer over the support of β0. For j ∈ T c we have that for any t ∈ IR

En[(fi − x′iβ0)2] + σ2

s

n≤ En[(fi − x′iβ0 − txij)

2] + σ2s+ 1

n.

Therefore, for any t ∈ IR we have

−σ2/n ≤ En[(fi − x′iβ0 − txij)2]− En[(fi − x′iβ0)

2] = −2tEn[xij(fi − x′iβ0)] + t2En[x2ij].

Taking the minimum over t in the right hand side at t∗ = En[xij(fi − x′iβ0)] we obtain

−σ2/n ≤ −(En[xij(fi − x′iβ0)])2 or equivalently, |En[xij(fi − x′iβ0)]| ≤ σ/

√n. �

Lemma 8. Assume that condition ASM holds and that there is q > 2 such that E[|ǫ|q] <∞.

Then we have

P

(√En[σ2ǫ2i ] > (1 + un)

√En[(σǫi + ri)2]

∣∣ X)

≤ minv∈(0,1)

ψ(v) +2(1 + un)

(1− v)unn,

where ψ(v) :=CqE[|ǫ|q∨4]

vqnq/4 ∧ 2E[|ǫ|q]n1∧(q/2−1)vq/2

.

√LASSO 31

Proof. Let an = [ (1 + un)2 − 1 ]/(1 + un)

2. We have that

P (En[σ2ǫ2i ] > (1 + un)

2En[(σǫi + ri)

2] | X) = P (2En[ǫiri] < −c2s − anEn[σ2ǫ2i ]|X). (B.24)

By Lemma 9 we have

Pr(√

En[ǫ2i ] < 1− v|F0 = F ) ≤ Pr(|En[ǫ

2i ]− 1| > v|F0 = F ) ≤ ψ(v).

Thus,

P (En[σ2ǫ2i ] > (1+ un)

2En[(σǫi + ri)

2] | X) ≤ ψ(v) +P (2En[σǫiri] < −c2s − anσ2(1− v)|X).

Since E[(2En[σǫiri])2] = 4σ2c2s/n, by Chebyshev inequality we have

P

(√En[σ2ǫ2i ] ≤ (1 + un)

√En[(σǫi + ri)2]

∣∣ X)

≤ ψ(v) +4σ2c2s/n

(c2s + anσ2(1− v))2

≤ ψ(v) + 2(1+un)(1−v)unn

.

The result follows by minimizing over v ∈ (0, 1). �

Next we focus on controlling the deviations of empirical second moments of the noise.

Lemma 9. Assume that condition ASM holds and that there is q > 2 such that E[|ǫ|q] <∞.

Then there is a constant Cq, that depends on q only, such that for v > 0 we have

Pr(|En[ǫ2i ]− 1| > v|F0 = F ) ≤ ψ(v) :=

CqE[|ǫi|q∨4]vqnq/4

∧ 2E[|ǫi|q]n(q/2∧1)−1vq/2

.

Proof. By the application of either Rosenthal’s inequality [26] for the case of q > 4 or

Vonbahr-Esseen’s inequalities [37] for the case of 2 < q ≤ 4,

Pr(|En[ǫ2i ]− 1| > v|F0 = F ) ≤ ψ(v) :=

CqE[|ǫi|q∨4]vqnq/4

∧ 2E[|ǫi|q]n(q/2∧1)−1vq/2

.

�

Lemma 10. Under conditions ASM, M’ and R’, we have that with probability 1− τ1 − τ2

we have

max1≤j≤p

En[x2ij(ǫ

2i − 1)] ≤ 4

√log(2p/τ1)

n

(E[ǫqi ]

τ2

)4/q

max1≤j≤p

(En[x8ij ])

1/4.


Proof. For a random variable Z, let q(Z, 1 − α) = (1 − α)-quantile of Z. Let, q(1 −τ) = q(max1≤j≤p En[x

4ijǫ

4i ], 1 − τ) ≤ max1≤j≤p

√En[x8ij ]

√q(En[ǫ8i ], 1 − τ) . In particular,

by Theorem 14, we have

P

(max1≤j≤p

En[x2ij(ǫ

2i − 1)] > 4

√2 log(2p/τ1)

√q(1− τ2)

)≤ τ1 + τ2.

Under Condition M’, by the Markov inequality with k = q/8 > 1, we have q(En[ǫ8i ], 1−τ) ≤

(E[ǫqi ])8/q/τ8/q. �

Lemma 11. Under conditions ASM, M’ and R’, assume that

n1/6 ≥ ℓn max1≤j≤p

(En[|xij |3]E[|ǫi|3]

)1/3.

For τ ∈ (0, 1), let the constants c1, c2 > 1 satisfy

(i) c1 ≥ 1 + 4

√log(6p/τ)

n

(3E[ǫqi ]

τ

)4/qmax1≤j≤p(En[x

8ij])

1/4;

(ii) c2 ≥ 1/(1 − (3CqE[|ǫi|q∨4])1/q

τ1/qn1/4 ).

Then, there is an universal constant A such that uniformly in t ≥ 0

t+ 1 ≤ n1/6/[ℓn max1≤j≤p

(En[|xij |3]E[|ǫi|3])1/3],

we have

P

∥∥∥∥∥∥

√nEn[xiǫi]√En[ǫ2i ]

∥∥∥∥∥∥∞

>√c1c2 t

≤ 2pΦ(t)

(1 +

A

ℓ3n

)+ τ.

Proof. Let t = 1/[ℓn max1≤j≤p(En[|xij |3]E[|ǫi|3])1/3] > 0, τ1 = τ2 = τ3 = τ/3.

P

∥∥∥∥∥∥

√nEn[xiǫi]√En[ǫ2i ]

∥∥∥∥∥∥∞

>√c1c2t

≤ P

max

1≤j≤p

√n|En[xijǫi]|√En[x2ijǫ

2i ]

> t

+P

(max1≤j≤p

En[x2ijǫ

2i ]

En[ǫ2i ]> c1c2

).

By Lemma 5 above, provided that t ≤ t n1/6−1, we have that there is an universal constant

A, such that

P

max

1≤j≤p

√n|En[xijǫi]|√En[x

2ijǫ

2i ]

> t

≤ p max

1≤j≤pP

√n|En[xijǫi]|√En[x

2ijǫ

2i ]

> t

≤ 2p Φ(t)

(1 +

A

ℓ3n

).

Next note that

P

(max1≤j≤p

En[x2ijǫ

2i ]

En[ǫ2i ]> c1c2

)≤ P

(max1≤j≤p

En[x2ij(ǫ

2i − 1)] > c1 − 1

)+ P

(En[ǫ

2i ] < 1/c2

).

√LASSO 33

By Lemma 9 we have there is a constant Cq such that

P(En[ǫ

2i ] < 1/c2

)≤ CqE[|ǫi|q∨4]

([c2 − 1]/c2)qnq/4≤ τ3

where the last inequality follows from the choice of c2.

Finally, by Lemma 10 and the choice of c1, we have

P

(max1≤j≤p

En[x2ij(ǫ

2i − 1)] > c1 − 1

)≤ τ1 + τ2.

Since τ1 + τ2 + τ3 = τ the result follows. �

Appendix C. Results under Conditions M and R

Condition M. There exist a finite constant q > 2 such that the disturbance obeys

supn≥1 EF0 [|ǫ|q] <∞, and the design X obeys supn≥1max1≤j≤p En[|xij |q] <∞.

Condition R. We have that p ≤ αnη(q−2)/2/2 for some constant 0 < η < 1, c2sΦ−1(1−

α/2p) ≤ σ2, and Φ−1(1 − α/2p)2α− 2

q n−[(1−2/q)∧1/2] log n ≤ 1/3. For convenience we also

assume that Φ−1(1− α/2p) ≥ 1.

Conditions M and R were considered in [5] in the parametric setting. The analysis relied

on concentration of measure and moderate deviation bounds. Below we provide the results

that validate the choice of λ in the current nonparametric setting not covered in [5].

Theorem 15 (Properties of the Penalty Level under Conditions M and R). Suppose that

conditions M and R hold, and let un = 2/Φ−1(1 − α/2p). Then, there is a constant C(q),

that depends only on q, and a deterministic sequence γn converging to 1, such that

(i) the exact option in (3.17) implements λ > cΛ with probability at least

1− α− C(q)

(E[|ǫ|q∧4]nq/4

∧ E[|ǫ|q]n1∧(q/2−1)

)− 16

n;

(ii) the asymptotic option in (3.17) implements λ > cΛ with probability at least

1− αγn − C(q)

(E[|ǫ|q∨4]nq/4

∧ E[|ǫ|q]n1∧(q/2−1)

)− 25

n− 24γn

Φ−1(1− α/2p)2 log n

α2qn[(1−2/q)∧1/2]

;

(iii) the approximate score Λ(1− α|X) in (3.15) satisfies

Λ(1− α(1 + C(q)/ log−q/2 n) |X) ≤√nΦ−1(1− α/2p)(1 +

√2 log γn/tn)/(1 − rn − tn)

≤√

2n log(p/α)(1 +√

2 log γn/tn)/(1 − rn − tn)


where rn = α− 2

q n−[(1−2/q)∧1/2] log n, and tn = Φ−1(1− α/2p).

Proof of Theorem 15. Under conditions R and M, and using un = 2/Φ−1(1 − α/2p), by

Lemma 8 we have with probability 1− ψ(v) + 4c2s(1+un)/nun(2+un)σ2(1−v)2 ,

√En[σ2ǫ2i ] ≤ (1 + un)

√En[(σǫi + ri)2].

Then, using (3.16), with at least the same probability we have Λ ≤ (1 + un)Λ. Hence,

statement (i) holds by definition of Λ(1 − α|X), setting v = 1/2, and condition R that

ensures c2sΦ−1(1− α/2p) ≤ σ2.

To show statement (ii), we define tn = Φ−1(1− α/2p) and rn = α− 2

qn−[(1−2/q)∧1/2] log n.

Note that un = 2/tn, tn ≥ 1, and t2nrn ≤ 1/3.

By the same argument used to establish (i), under our conditions, we have Λ ≤ (1+un/4)Λ

with probability

1− ψ(rn)−25

n.

Thus it suffices to prove that, conditional on X, Λ ≤ (1 + un/2)√nΦ−1(1 − α/2p) with

probability at least 1− α γn − α 12γn(exp(t2nrn)− 1)− ψ(rn).

Then for any F = Fn and X = Xn that obey Condition M:

Pr(Λ > (1 + un/2)√ntn|X)

≤(1) p max1≤j≤p

Pr(|n1/2En[xijǫi]|+ 1 > (1 + 1/tn)tn(1− rn)|X) + Pr(√

En[ǫ2i ] < 1− rn)

≤(2) p max1≤j≤p

Pr(|n1/2En[xijǫ]| > tn(1− rn − rn/tn)|X) + ψ(rn)

where (1) holds by the union bound; (2) holds by Lemma 9.

Next note that by Condition M and Slastnikov’s theorem on moderate deviations, [29]

(see also [27]) we have that uniformly in 0 ≤ |t| ≤ k√log n for some k2 < q − 2 and a

sequence γn → 1, uniformly in 1 ≤ j ≤ p,∣∣∣∣∣Pr(n1/2|Enxijǫ| > t|X)

2Φ(t)− 1

∣∣∣∣∣ ≤ γn − 1.

We apply the result above for t = tn(1 − rn − rn/tn) ≤√

2 log(2p/α) ≤√η(q − 2) log n

for η < 1 by assumption. Note that we apply Slastnikov’s theorem to n−1/2|∑ni=1 zi,n| for

√LASSO 35

zi,n = xijǫi, where we allow the design X, the law F , and index j to be (implicitly) indexed

by n. Slastnikov’s theorem then applies provided

supn,j≤p

EnEFn |zn|q = supn,j≤p

En|xij |qEFn |ǫ|q <∞,

which is implied by our Condition M, and where we used the condition that the design is

fixed, so that ǫi are independent of xij . Thus, we obtained the moderate deviation result

uniformly in 1 ≤ j ≤ p and for any sequence of distributions F = Fn and designs X = Xn

that obey our Condition M.

Pr(Λ > (1 + un/2)√ntn|X) ≤ 2p Φ(tn(1− rn − rn/tn))γn + ψ(rn)

= 2p Φ(tn)γn + 2pγn

∫ tn

tn(1−rn−rn/tn)φ(u)du + ψ(rn)

≤ αγn + 2pγnφ(tn(1− rn − rn/tn))− φ(tn)

tn(1− rn − rn/tn)+ ψ(rn)

≤ αγn + 2pγnφ(tn)

tn

exp(t2nrn)− 1

1− rn − rn/tn+ ψ(rn)

≤ αγn + 2pγn 12Φ(tn)(exp(t2nrn)− 1) + ψ(rn)

≤ αγn + α 24γnt2nrn + ψ(rn),

where we used that Φ(t) ≤ φ(t)/t and φ(t)/t ≤ 2Φ(t) for t ≥ 1, rn ≤ 1/3, and that

exp(m) ≤ 1 + 2m for 0 ≤ m ≤ 1/3.

To show statement (iii) of the lemma, let ν ′ > (1 +√2 log γn/tn)/(1− rn − 1/tn),

Pr(Λ > ν ′√ntn|X) = Pr(

√n‖En[xiǫi]‖∞ + 1 > ν ′tn

√En[ǫ2i ]|X)

≤ Pr(√n‖En[xiǫi]‖∞ + 1 > ν ′tn(1− rn)|X) + Pr(

√En[ǫ2i ] < (1− rn))

≤ Pr(√n‖En[xiǫi]‖∞ > ν ′tn(1− rn − 1/tn)|X) + ψ(rn)

≤ Pr(√n‖En[xiǫi]‖∞ > tn +

√log γn|X) + ψ(rn)

where the last step follows by Lemma 9. Thus the bound

Pr(√n‖En[xiǫi]‖∞ > tn +

√log γn|X) ≤ γnΦ(tn +

√2 log γn) ≤ α

follows analogously to the proof of statement (ii); we omit the details for brevity. Therefore,

under our conditions, Λ(1−α−ψ(rn)|X) ≤ ν ′√nΦ−1(1−α/2p). Note that by construction

ψ(rn) ≤ αC(q) log−q/2 n. The last bound follows from standard tail bounds for Gaussian

random variables. �


Appendix D. Proofs of Section 3

Proof of Theorem 1. The exact option in (3.17) is λ = (1 + un)cΛ(1 − α|X) so that

P (cΛ ≥ λ|X) = P (Λ ≥ (1 + un)Λ(1− α|X)|X)

≤ P (Λ ≥ Λ(1− α|X)|X) + P (Λ ≥ (1 + un)Λ|X)

≤ α+ P (Λ ≥ (1 + un)Λ|X).

By (3.16) and Lemma 8 with v = rn, the last term satisfies

P (Λ ≥ (1 + un)Λ|X) ≤ P (√

En[σ2ǫ2i ] ≥ (1 + un)

√En[(σǫi + ri)2]|X)

≤ ψ(rn) +2(1+un)

n(1−rn)un

≤ αlogn + 4(1+un)

nun

by the choice of rn = (α−1 log nCqE[|ǫ|q∧4])1/q/n1/4 < 1/2 under condition R’. �

Proof of Theorem 2. Let tn = Φ−1(1−α/2p) and recall rn = (α−1 log nCqE[|ǫ|q∧4])1/q < 1/2

under Condition R’. Next, note that for un ≥ 0 it follows that 1+un ≥ (1+un/2)(1+ [un ∧1]/4), so that

P (Λ ≥ (1 + un)√ntn|X) ≤ P (Λ ≥ (1 + un/2)

√ntn|X) + P (Λ ≥ (1 + un∧1

4 )Λ|X).

Regarding the second term, by (3.16) and Lemma 8 with v = rn, it follows that

P (Λ ≥ (1 + un∧14 )Λ|X) ≤ P (

√En[σ2ǫ2i ] ≥ (1 + un∧1

4 )√

En[(σǫi + ri)2]|X)

≤ ψ(rn) +2(1+un∧1

4)

n(1−rn)([un∧1]/4)≤ α

logn + 20n(un∧1) .

Next note that

Pr (Λ > (1 + un/2)√ntn|X) ≤ Pr

(n1/2‖En[xiǫi]‖∞√

En[ǫ2i ]> (1 + un/4)tn|X

)+

+Pr

( √n√

En[ǫ2i ]>

√ntnun/4|X

).

To control the second term above, note that under Condition M’ and R’, 4/untn ≤ 1−rn,we have that by Lemma 9

Pr

√n√

En[ǫ2i ]>

√ntnun4

|X

= Pr

(En[ǫ

2i ] <

4

untn|X)

≤ ψ(rn) ≤ α/ log n.

√LASSO 37

Next, under Condition M’ and R’, letting c1, c2 as in Lemma 11 with τ = α/ log n, we

have (1 + un/4) ≥ √c1c2. Thus, since tn + 1 ≤ n1/6/max1≤j≤p(En[|xij |3]E[|ǫi|3])1/3 by

Condition R’, Lemma 11 yields

Pr

n

1/2‖En[xiǫi]‖∞√En[ǫ

2i ]

> (1 + un/4)tn|X

≤ 2pΦ (tn)

(1 +

A

ℓ3n

)+

α

log n

≤ α

(1 +

A

ℓ3n+

1

log n

).

�

Proof of Theorem 3. Under Conditions M’ and R’, note that by triangle inequality and by

Lemma 11 with τ = α/ log n we have

Pr (Λ > (ν + ν ′)√ntn|X) ≤ Pr

(n1/2‖En[xiǫi]‖∞√

En[ǫ2i ]> νtn|X

)+ Pr

(√En[ǫ2i ] < 1/[ν ′tn]|X

)

≤ 2pΦ(νtn/√c1c2)(1 +A/ℓ3n) + 2α/ log n < α

provided that ν >√c1c2(1+

√2 log(3 + 3A/ℓ3n)/tn) we have 2pΦ(νtn/c1c2)(1+A/ℓ

3n) ≤ α/3,

ν ′ ≥ 1/[(1 − rn)tn)] we have ψ((ν ′tn − 1)/[ν ′tn]) < α/ log n, and log n > 2. �

Appendix E. Proofs of Section 4

Proof of Theorem 4. We prove the upper bound on the prediction norm in two steps.

Step 1. In this step we show that δ = β − β0 ∈ ∆c under the prescribed penalty level.

By definition of β

√Q(β)−

√Q(β0) ≤

λ

n‖β0‖1 −

λ

n‖β‖1 ≤

λ

n(‖δT ‖1 − ‖δT c‖1), (E.25)

where the last inequality holds because

‖β0‖1 − ‖β‖1 = ‖β0T ‖1 − ‖βT ‖1 − ‖βT c‖1 6 ‖δT ‖1 − ‖δT c‖1. (E.26)

Note that using the convexity of

√Q, −S ∈ ∂

√Q(β0), and if λ > cn‖S‖∞, we have

√Q(β)−

√Q(β0) > −S′δ > −‖S‖∞‖δ‖1 (E.27)

> − λ

cn(‖δT ‖1 + ‖δT c‖1). (E.28)


Combining (E.25) with (E.28) we obtain

− λ

cn(‖δT ‖1 + ‖δT c‖1) 6

λ

n(‖δT ‖1 − ‖δT c‖1), (E.29)

that is

‖δT c‖1 6c+ 1

c− 1· ‖δT ‖1 = c‖δT ‖1, or δ ∈ ∆c. (E.30)

Step 2. In this step we derive bounds on the estimation error. We shall use the following

relations:

Q(β)− Q(β0) = ‖δ‖22,n − 2En[(σǫi + ri)x′iδ], (E.31)

Q(β)− Q(β0) =

(√Q(β) +

√Q(β0)

)(√Q(β)−

√Q(β0)

), (E.32)

2|En(σǫi + ri)x′iδ| ≤ 2‖En[(σǫi + ri)xi]‖∞‖δ‖1 ≤ 2

√Q(β0)λ‖δ‖1/[cn], (E.33)

‖δT ‖1 ≤√s‖δ‖2,nκc

for δ ∈ ∆c, (E.34)

where (E.28) holds by Holder inequality and the last inequality holds by the definition of

κc.

Note that by (E.31), if

√Q(β) +

√Q(β0) = 0, we have ‖δ‖2,n = 0 and we are done. So

we can assume

√Q(β) +

√Q(β0) > 0. Using (E.25) and (E.31)-(E.34) we obtain

‖δ‖22,n ≤ 2λ

cn

√Q(β0)‖δ‖1 +

(√Q(β) +

√Q(β0)

)λ

n

(√s‖δ‖2,nκc

− ‖δT c‖1). (E.35)

Also using (E.25) and (E.34) we obtain

√Q(β) ≤

√Q(β0) +

λ

n

(√s‖δ‖2,nκc

− ‖δT c‖1)

≤√Q(β0) +

λ

n

√s‖δ‖2,nκc

. (E.36)

Combining inequalities (E.36) and (E.35), we obtain

‖δ‖22,n ≤ 2λ

cn

√Q(β0)‖δ‖1 + 2

√Q(β0)

λ√s

nκc‖δ‖2,n +

(λ√s

nκc‖δ‖2,n

)2

− 2

√Q(β0)

λ

n‖δT c‖1.

Simplifying further the expression we obtain

‖δ‖22,n ≤ 2λ

cn

√Q(β0)‖δT ‖1 + 2

√Q(β0)

λ√s

nκc‖δ‖2,n +

(λ√s

nκc‖δ‖2,n

)2

,

and then using (E.34) we obtain[1−

(λ√s

nκc

)2]‖δ‖22,n ≤ 2

(1

c+ 1

)√Q(β0)

λ√s

nκc‖δ‖2,n.

√LASSO 39

Provided that λ√s

nκc< 1 and solving the inequality above we obtain the bound stated in the

theorem. �

Proof of Theorem 5. Let δ := β−β0. Under the condition on λ above, we have that δ ∈ ∆c.

Thus, we have

‖δ‖1 ≤ (1 + c)‖δT ‖1 ≤ (1 + c)

√s‖δ‖2,nκc

,

by the restricted eigenvalue condition. �

Proof of Theorem 6. Let δ := β − β0. We have that

‖δ‖∞ ≤ ‖En[xix′iδ]‖∞ + ‖En[xix

′iδ] − δ‖∞.

Note that by the first order optimality conditions of β and the assumption on λ

‖En[xix′iδ]‖∞ ≤ ‖En[xi(yi − x′iβ)]‖∞ + ‖S‖∞

√Q(β0)

≤ λ√

Q(β)n +

λ√

Q(β0)cn

by the first order conditions and the condition on λ.

Next let ej denote the jth-canonical direction.

|En[e′jxix

′iδ]− δj | ≤ |En[e

′j(xix

′i − I)δ]|

≤ ‖δ‖1 ‖En[xix′i − I]‖∞.

Therefore, using the optimality of β that implies

√Q(β) ≤

√Q(β0)+(λ/n)‖δ‖1, we have

‖δ‖∞ ≤(√

Q(β) +

√Q(β0)c

)λn + ‖En[xix

′i]− I‖∞‖δ‖1

≤(1 + 1

c

) λ√

Q(β0)n +

(λ2

n2 + ‖En[xix′i]− I‖∞

)‖δ‖1.

�

Proof of Theorem 7. Let δ := β − β0 ∈ ∆c under the condition that λ ≥ c‖Λ‖∞ by Step 1

in Theorem 4.

First we establish the upper bound. By optimality of β√Q(β)−

√Q(β0) ≤

λ

n(‖β0‖1 − ‖β‖1) ≤

λ

n(‖δT ‖1 − ‖δT c‖1).


Thus, if δ /∈ ∆1 we have Q(β) ≤ Q(β0). So we can assume δ ∈ ∆1 in the bound above,

so that ‖δT ‖1 ≤ √s‖δ‖2,n/κ1. Thus, setting ρ1 = λ

√s/[nκ1], we have ‖δ‖2,n ≤ 2(1 +

1/c)ρ1

√Q(β0)/[1 − ρ21] if δ ∈ ∆1 by inspection of Theorem 4.

To establish the lower bound, by convexity of

√Q and the condition that λ ≥ cn‖S‖∞

we have√Q(β)−

√Q(β0) ≥ −S′δ ≥ −λ‖δ‖1

cn.

Thus, combining these bounds with Theorems 4 and 5, letting ρ := λ√s/[nκc] < 1, we

obtain

√Q(β)−

√Q(β0) ≥ −1+c

cρ2

1−ρ22(1 + 1

c

)√Q(β0) = −4c

cρ2

1−ρ2

√Q(β0).

where the last equality follows from simplifications. �

Proof of Lemma 1. First note that by strong duality we have

En [yiai] =‖Y −Xβ‖√

n+λ

n

p∑

j=1

|βj |.

Since En [xij ai] βj = λ|βj |/n for every j = 1, . . . , p, we have

En [yiai] =‖Y −Xβ‖√

n+

p∑

j=1

En [xij ai] βj =‖Y −Xβ‖√

n+ En

ai

p∑

j=1

xijβj

.

Rearranging the terms we have En

[(yi − x′iβ)ai

]= ‖Y −Xβ‖/√n.

If ‖Y −Xβ‖ = 0, we have

√Q(β) = 0 and the statement of the lemma trivially holds.

If ‖Y −Xβ‖ > 0, since ‖a‖ ≤ √n the equality can only hold for a =

√n(Y −Xβ)/‖Y −

Xβ‖ = (Y −Xβ)/

√Q(β).

Next, note that for any j ∈ T we have En [xij ai] = sign(βj)λ/n. Therefore, we have

√Q(β)

√|T |λ = ‖(X ′(Y −Xβ))

T‖

≤ ‖(X ′(Y −Xβ0))T ‖+ ‖(X ′X(β0 − β))T ‖≤

√|T | n‖En[xi(σǫi + ri)]‖∞ + nφmax(m)‖β − β0‖2,n,

√LASSO 41

where we used

‖(X ′X(β − β0))T ‖ ≤ sup‖αTc‖0≤m,‖α‖≤1 |α′X ′X(β − β0))|≤ sup‖αTc‖0≤m,‖α‖≤1 ‖α′X ′‖‖X(β − β0)‖= sup‖αTc‖0≤m,‖α‖≤1

√|α′X ′Xα|‖X(β − β0)‖

= n√φmax(m)‖β − β0‖2,n.

�

Proof of Theorem 8. We can assume that

√Q(β0) > 0 otherwise the result is trivially true.

In the event λ ≥ cΛ, by Lemma 1(√

Q(β)

Q(β0)− 1

c

)λ

n

√Q(β0)

√|T | ≤

√φmax(m)‖β − β0‖2,n. (E.37)

Under the condition λ√s < nκcρ, ρ < 1, we have by Theorem 7 that

(1− ρ2

1− ρ24c

c− 1

c

)λ

√Q(β0)

n√φmax(m)

√|T | ≤ ‖β − β0‖2,n.

�

Proof of Theorem 9. We can assume that

√Q(β0) > 0 otherwise the result follows by The-

orem 11 which establish β = β0.

In the event λ ≥ cΛ, by Lemma 1(√

Q(β)

Q(β0)− 1

c

)λ

√Q(β0)

√|T | ≤ n

√φmax(m)‖β − β0‖2,n. (E.38)

Under the condition λ√s ≤ nκcρ, ρ < 1, we have by Theorem 4 and Theorem 7 that

(1− ρ2

1− ρ24c

c− 1

c

)λ

√Q(β0)

√|T | ≤ n

√φmax(m)2

(1 +

1

c

)√Q(β0)

1

1− ρ2λ√s

nκc.

Since ρ2 < (c− 1)/(c − 1 + 4c) we have

√|T | ≤

√s

κc· 2

√φmax(m)(c+ 1)

c− 1− ρ2(c− 1 + 4c)=

√s ·√φmax(m) · 2c

κc

(1 +

ρ2(c− 1 + 4c)

c− 1− ρ2(c− 1 + 4c)

).

Under the condition 2ρ2 ≤ (c− 1)/(c − 1 + 4c), the expression above simplifies to

|T | ≤ φmax(m)

(4c

κc

)2

(E.39)

since ρ2(c− 1 + 4c) ≤ c− 1− ρ2(c− 1 + 4c).


Consider any m ∈ M, and suppose m > m. Therefore by sublinearity of sparse eigenval-

ues

m ≤ s ·⌈m

m

⌉φmax(m)

(4c

κc

)2

.

Thus, since ⌈k⌉ < 2k for any k ≥ 1 we have m < s · 2φmax(m)(4c/κc)2 which violates the

condition of m ∈ M and s. Therefore, we must have m ≤ m. In turn, applying (E.39) once

more with m ≤ m we obtain m ≤ s ·φmax(m)(4c/κc)2. The result follows by minimizing the

bound over m ∈ M. �

Lemma 12 (Performance of a generic second-step estimator). Let β be any first-step esti-

mator with support T , define

Bn := Q(β)− Q(β0), Cn := Q(β0T )− Q(β0), and Dm := sup‖δTc‖0≤m,‖δ‖2,n>0

∣∣∣∣En

[σǫix

′iδ

‖δ‖2,n

]∣∣∣∣ ,

and let β the second-step estimator. Then, we have that for m := |T \ T |

‖β − β0‖2,n ≤ Dm + 2cs +√

(Bn)+ ∧ (Cn)+.

Proof of Lemma 12. Let δ := β − β0. By definition of the second-step estimator, it follows

that Q(β) ≤ Q(β) and Q(β) ≤ Q(β0T ). Thus,

Q(β)− Q(β0) ≤(Q(β)− Q(β0)

)∧(Q(β

0T)− Q(β0)

)≤ Bn ∧ Cn.

Moreover, since

|Q(β)− Q(β0)− ‖δ‖22,n| ≤|S′δ|‖δ‖2,n

‖δ‖2,n + 2cs‖δ‖2,n

we have

‖δ‖22,n ≤(

|S′δ|‖δ‖2,n

+ 2cs

)‖δ‖2,n +Bn ∧ Cn.

Solving which we obtain the stated result:

‖δ‖2,n ≤ Dm + 2cs +√

(Bn)+ ∧ (Cn)+

using the definition of Dm. �

Lemma 13 (Control of Bn and Cn for√LASSO). Under condition ASM, if λ ≥ cΛ and

ρ1 = λ√s/[nκ1] < 1, we have that

Bn ∧ Cn ≤ 1{T 6⊆ T} · Q(β0)(1 + 1/c)4ρ21

1 − ρ21

(1 +

(1 + 1/c)ρ211− ρ21

).

√LASSO 43

Proof of Lemma 13. First note that Cn = 0 if T ⊆ T .

Let δ = β − β0. By optimality of β for (1.3), we have

Q(β)− Q(β0) ≤(√

Q(β) +

√Q(β0)

)λn (‖δT ‖1 − ‖δT c‖1)

≤ 2

√Q(β0)

λn (‖δT ‖1 − ‖δT c‖1) + λ2

n2 (‖δT ‖1 − ‖δT c‖1)2+ .

If ‖δT c‖1 > ‖δT ‖1, we have Bn ≤ 0. Otherwise we have ‖δT ‖1 ≤√s‖δ‖2,n/κ1.

Moreover, under ‖δT c‖1 ≤ ‖δT ‖1 we have that ‖δ‖2,n ≤ 2(1+1/c)

√Q(β0)ρ1/(1−ρ21), for

λ√s ≤ nκ1ρ1, ρ1 < 1, by Theorem 4. Thus,

Q(β)− Q(β0) ≤ 2Q(β0)2(1+1/c)ρ21

1−ρ21+ ρ21

4(1+1/c)2Q(β0)ρ21(1−ρ21)

2 .

�

Lemma 14 (Control of Dm). Suppose condition ASM holds. We have that uniformly over

m ≤ p with probability at least 1− 1/C2 − 1/[9C2 log p]

Dm ≤ Cσ√φmin(m)

√s

n+

24Cσ√φmin(m)

√1 ∨ max

j=1,...,pEn

[x2ijǫ

2i

]√m log p

n.

Proof of Lemma 14. By the triangle inequality we have

Dm := sup‖δTc‖0≤m,‖δ‖2,n>0

∣∣∣∣En

[σǫix

′iδ

‖δ‖2,n

]∣∣∣∣

≤ sup‖δTc‖0≤m,‖δ‖2,n>0

∣∣∣∣En

[σǫix

′iT δ

‖δ‖2,n

]∣∣∣∣+ sup‖δTc‖0≤m,‖δ‖2,n>0

∣∣∣∣En

[σǫix

′iT cδ

‖δ‖2,n

]∣∣∣∣

=: DTm +DT c

m .

To bound DTm, note that ‖δ‖2,n ≥

√φmin(m)‖δ‖ and x′iT δ = x′iT δT , so that

DTm ≤ σ‖En[ǫixiT ]‖ ‖δT ‖√

φmin(m)‖δ‖.

We proceed to bound the second moment of ‖En[ǫixiT ]‖. Thus, since E[ǫi] = 0 and E[ǫ2i ],

E[‖En[ǫixiT ]‖2

]= E[En[ǫixiT ]

′En[ǫixiT ]] = trace(En[xiTx

′iT ]/n) = s/n.

Therefore, by Chebyshev inequality we have with probability at least 1− 1/C2

DTm ≤ Cσ√

φmax(m)

√s

n.


Next, to bound DT c

m , we note that the support of supp(δ) ⊆ T ∪ T . Thus, we have

‖δ‖2,n ≥√φmin(m)‖δ‖ and

En[ǫix′iT cδ] = En[ǫix

′iT cδT c ] ≤ ‖En[ǫixiT c ]‖∞‖δT c‖1 ≤ ‖En[ǫixiT c ]‖∞

√m‖δT c‖.

Thus, we have

DT c

m ≤√mσ‖En[ǫixiT c ]‖∞/

√φmin(m).

Finally, by Corollary 6, with probability at least 1− 1/[9C2 log p], we have

‖En[ǫixiT c ]‖∞ ≤ 24C

√1 ∨ max

j=1,...,pEn

[x2ijǫ

2i

]√ log p

n.

�

Proof of Theorem 10. The theorem follows by Lemma 12 and applying Lemma 13 to bound

Bn ∧ Cn and Lemma 14 to bound Dm. �

Proof of Theorem 11. Note that because σ = 0 and cs = 0, we have

√Q(β0) = 0 and√

Q(β) = ‖β − β0‖2,n. Thus, by optimality of β we have

‖β − β0‖2,n +λ

n‖β‖1 ≤

λ

n‖β0‖1.

Thus ‖β‖1 ≤ ‖β0‖1 and δ = β − β0 satisfies ‖δT c‖1 ≤ ‖δT ‖1. This implies that

‖δ‖2,n ≤ λ

n‖δT ‖1 ≤

λ√s

nκ1‖δ‖2,n.

Since λ√s < nκ1 we have ‖δ‖2,n = 0. In turn, 0 =

√s‖δ‖2,n ≥ κ1‖δT ‖1 ≥ κ1‖δ‖1/2. This

shows that δ = 0 and β = β0. �

Lemma 15. Consider ǫi ∼ t(2). Then, for τ ∈ (0, 1) we have that:

(i) P (En[ǫ2i ] ≥ 2

√2/τ + log(4n/τ)) ≤ τ/2.

(ii) Let x := maxi,j |xij| and 0 < a < 1/6. Then, with probability at least 1 − τ −1

n1/2(1/6−a)2we have

∥∥∥∥∥∥En[xiǫi]√En[ǫ

2i ]

∥∥∥∥∥∥∞

≤4x√

2 log(4p/τ)√

log(4n/τ) + 2√2/τ

√n√a log n

.

√LASSO 45

(iii) For un ≥ 0 and 0 < a < 1/6, we have

P (√En[σ2ǫ2i ] ≤ (1+un)

√En[(σǫi + ri)2]) ≤

4σ2c2s log(4n/τ)

n(c2s + [un/(1 + un)]σ2a logn)2+

1

n1/2(1/6− a)2+τ/2.

Proof of Lemma 15. To show (i) we will establish a bound on q(En[ǫ2i ], 1 − τ). Recall that

for a t(2) random variable, the cumulative distribution function and the density function

are given by:

F (x) =1

2

(1 +

x√2 + x2

)and f(x) =

1

(2 + x2)3/2.

For any truncation level tn ≥√2 we have

E[ǫ2i 1{ǫ2i ≤ tn}] = 2∫ √

20

x2dx(2+x2)3/2

+ 2∫ √

tn√2

x2dx(2+x2)3/2

≤ 2∫ √

20

x2dx23/2

+ 2∫ √

tn√2

x2dxx3

≤ log tn.

E[ǫ4i 1{ǫ2i ≤ tn}] ≤ 2∫ √

20

x4dx23/2

+ 2∫ √

tn√2

x4dxx3 ≤ tn.

E[ǫ2i 1{ǫ2i ≤ tn}] ≥ 2∫ 10

x2dx33/2

+ 2∫ √

21

x2dx43/2

+ 22√2

∫ √tn√2

dxx

≥ log tn2√2.

(E.40)

Also, because 1−√1− v ≤ v for every 0 ≤ v ≤ 1,

P (|ǫi|2 > tn) =

(1−

√tn

2 + tn

)≤ 2/(2 + tn). (E.41)

Thus, by setting tn = 4n/τ and t = 2√2/τ we have [14], relation (7.5),

P (|En[ǫ2i ]− E[ǫ2i 1{ǫ2i ≤ tn}]| ≥ t) ≤ E[ǫ4i 1{ǫ2i≤tn}]

nt2+ nP (|ǫ2i | > tn)

≤ tnnt2

+ 2n2+tn

≤ τ/2.(E.42)

Thus, (i) is established and we have

q

(max1≤j≤p

En[x2ijǫ

2i ], 1 − τ/2

)≤ x2q(En[ǫ

2i ], 1 − τ/2) ≤ x2

(log(4n/τ) + 2

√2/τ).

To show (ii), using the bound above and Theorem 14, we have that with probability 1−τwe have

‖Gn[xiǫi]‖∞ ≤ 4x√

2 log(4p/τ)

√log(4n/τ) + 2

√2/τ .


Moreover, for 0 < a < 1/6, we have

P (En[ǫ2i ] ≤ a log n) ≤ P (En[ǫ

2i 1{ǫ2i ≤ n1/2}] ≤ a log n)

≤ P (|En[ǫ2i 1{ǫ2i ≤ n1/2}]− E[ǫ2i 1{ǫ2i ≤ n1/2}]| ≥ (1/6) log n− a log n)

≤ 1n1/2(1/6−a)2

(E.43)

by Chebyshev inequality and since E[ǫ2i 1{ǫ2i ≤ n1/2}] ≥ (1/6) log n. Thus, with probability

at least 1− τ − 1n1/2(1/6−a)2

we have

∥∥∥∥∥∥En[xiǫi]√En[ǫ

2i ]

∥∥∥∥∥∥∞

≤4x√

2 log(4p/τ)√

log(4n/τ) + 2√2/τ

√n√a log n

.

To establish (iii), let an = [(1+un)2− 1]/(1+un)

2 = un(2+un)/(1+un)2 ≥ un/(1+un)

and note that by (E.40), (E.42), and (E.43) we have

P(√

En[σ2ǫ2i ] > (1 + un)√

En[(σǫi + ri)2])= P (2σEn[ǫiri] > c2s + anEn[σ

2ǫ2i ])

≤ P (2σEn[ǫiri1{ǫ2i ≤ tn}] > c2s + anσ2a log n) + P (En[ǫ

2i ] ≤ a log n) + nP (ǫ2i ≤ tn)

≤ 4σ2c2s log tnn(c2s+anσ2a logn)2 + 1

n1/2(1/6−a)2+ τ/2.

�

Proof of Theorem 12. The result follows from Theorem 4 and Lemma 15 which provides

bounds for λ and

√Q(β0) ≤ cs + σ

√En[ǫ2i ]. Note the simplification that

4σ2c2s log tnn(c2s + anσ2a log n)2

≤ 2 log tnnana log n

.

The result follows with tn = 4n/τ , a = 1/24, and an = un/[1 + un]. �

Appendix F. Comparing Computational Methods for LASSO and√LASSO

Below we discuss in more detail the applications of these methods for lasso and√LASSO.

For each method, the similarities between the lasso and√LASSO formulations derived below

provide theoretical justification for the similar computational performance.

Interior point methods. Interior point methods (IPMs) solvers typically focus on

solving conic programming problems in standard form,

minwc′w : Aw = b, w ∈ K. (F.44)

√LASSO 47

The main difficulty of the problem arises because the conic constraint will be biding at the

optimal solution.

IPMs regularize the objective function of the optimization with a barrier function so that

the optimal solution of the regularized problem naturally lies in the interior of the cone.

By steadily scaling down the barrier function, a IPM creates a sequence of solutions that

converges to the solution of the original problem (F.44).

In order to formulate the optimization problem associated with the lasso estimator as

a conic programming problem (F.44), we let β = β+ − β−, and note that for any vector

v ∈ IRn and any scalar t ≥ 0 we have that

v′v ≤ t is equivalent to ‖(v, (t − 1)/2)‖2 ≤ (t+ 1)/2.

Thus, we have that lasso optimization problem can be cast

mint,β+,β−,a1,a2,v

t√n+λ

n

p∑

j=1

β+j + β−j

v = Y −Xβ+ +Xβ−

t = −1 + 2a1

t = 1 + 2a2

(v, a2, a1) ∈ Qn+2, t ≥ 0, β+ ∈ IRp+, β

− ∈ IRp+.

The√LASSO optimization problem can be cast by similarly but without auxiliary variables

a1, a2:

mint,β+,β−,v

t√n+λ

n

p∑

j=1

β+j + β−j

v = Y −Xβ+ +Xβ−

(v, t) ∈ Qn+1, β+ ∈ IRp+, β

− ∈ IRp+.

First order methods. The new generation of first order methods focus on structured

convex problems that can be cast as

minwf(A(w) + b) + h(w) or min

wh(w) : A(w) + b ∈ K.

where f is a smooth function and h is a structured function that is possibly non-differentiable

or with extended values. However it allows for an efficient proximal function to be solved,

see [1]. By combining projections and (sub)gradient information these methods construct

a sequence of iterates with strong theoretical guarantees. Recently these methods have


been specialized for conic problems which includes lasso and√LASSO. It is well known

that several different formulations can be made for the same optimization problem and the

particular choice can impact the computational running times substantially. We focus on

simple formulations for lasso and√LASSO.

Lasso is cast as

minwf(A(w) + b) + h(w)

where f(·) = ‖ · ‖2/n, h(·) = (λ/n)‖ · ‖1, A = X, and b = −Y . The projection required to

be solved on every iteration for a given current point βk is

β(βk) = argminβ

2En[xi(yi − x′iβk)]′β +

1

2µ‖β − βk‖2 + λ

n‖β‖1.

It follows that the minimization in β above is separable and can be solved by soft-thresholding

as

βj(βk) = sign

(βkj + 2En[xij(yi − x′iβ

k)]/µ)max

{∣∣∣βkj + 2En[xij(yi − x′iβk)]/µ

∣∣∣− λ/[nµ], 0}.

For√LASSO the “conic form” is given by

minwh(w) : A(w) + b ∈ K.

Letting Qn+1 = {(z, t) ∈ IRn × IR : t ≥ ‖z‖} and h(w) = f(β, t) = t/√n + (λ/n)‖β‖1 we

have that

minβ,t

t√n+λ

n‖β‖1 : A(β, t) + b ∈ Qn+1

where b = (−Y ′, 0)′ and A(β, t) 7→ (β′X ′, t)′.

In the associated dual problem, the dual variable z ∈ IRn is constrained to be ‖z‖ ≤ 1/√n

(the corresponding dual variable associated with t is set to 1/√n to obtain a finite dual

value). Thus we obtain

max‖z‖≤1/

√ninfβ

λ

n‖β‖1 +

1

2µ‖β − βk‖2 − z′(Y −Xβ).

Given iterates βk, zk, as in the case of lasso that the minimization in β is separable and can

be solved by soft-thresholding as

βj(βk, zk) = sign

(βkj + (X ′zk/µ)j

)max

{∣∣∣βkj + (X ′zk/µ)j∣∣∣− λ/[nµ], 0

}.

√LASSO 49

The dual projection accounts for the constraint ‖z‖ ≤ 1/√n and solves

z(βk, zk) = arg min‖z‖≤1/

√n

θk2tk

‖z − zk‖2 + (Y −Xβk)′z

which yields

z(βk, zk) =zk + (tk/θk)(Y −Xβk)

‖zk + (tk/θk)(Y −Xβk)‖ min

{1√n, ‖zk + (tk/θk)(Y −Xβk)‖

}.

Componentwise Search. A common approach to solve unconstrained multivariate

optimization problems is to (i) pick a component, (ii) fix all remaining components, (iii)

minimize the objective function along the chosen component, and loop steps (i)-(iii) until

convergence is achieved. This is particulary attractive in cases where the minimization over

a single component can be done very efficiently. Its simple implementation also contributes

for the widespread use of this approach.

Consider the following lasso optimization problem:

minβ∈IRp

En[(yi − x′iβ)2] +

λ

n

p∑

j=1

γj |βj |.

Under standard normalization assumptions we would have γj = 1 and En[x2ij] = 1 for

j = 1, . . . , p. Below we describe the rule to set optimally the value of βj given fixed the

values of the remaining variables. It is well known that lasso optimization problem has a

closed form solution for minimizing a single component.

For a current point β, let β−j = (β1, β2, . . . , βj−1, 0, βj+1, . . . , βp)′.

• If 2En[xij(yi − x′iβ−j)] > λγj/n it follows that the optimal choice for βj is

βj =(−2En[xij(yi − x′iβ−j)] + λγj/n

)/En[x

2ij ]

• If 2En[xij(yi − x′iβ−j)] < −λγj/n it follows that the optimal choice for βj is

βj =(2En[xij(yi − x′iβ−j)]− λγj/n

)/En[x

2ij ]

• If 2|En[xij(yi − x′iβ−j)]| ≤ λγj/n we would set βj = 0.

This simple method is particularly attractive when the optimal solution is sparse which

is typically the case of interest under choices of penalty levels that dominate the noise like

λ ≥ cn‖S‖∞.


Despite of the additional square-root, which creates a non-separable criterion function,

it turns out that the componentwise minimization for√LASSO also has a closed form

solution. Consider the following optimization problem:

minβ∈IRp

√En[(yi − x′iβ)

2] +λ

n

p∑

j=1

γj|βj |.

As before, under standard normalization assumptions we would have γj = 1 and En[x2ij ] = 1

for j = 1, . . . , p. Below we describe the rule to set optimally the value of βj given fixed the

values of the remaining variables.

• If En[xij(yi − x′iβ−j)] > (λ/n)γj

√Q(β−j), we have

βj = −En[xij(yi − x′iβ−j)]

En[x2ij]+

λγjEn[x2ij ]

√Q(β−j)− (En[xij(yi − x′iβ−j)]2/En[x2ij ])√

n2 − (λ2γ2j /En[x2ij ])

• If En[xij(yi − x′iβ−j)] < −(λ/n)γj

√Q(β−j), we have

βj = −En[xij(yi − x′iβ−j)]

En[x2ij]− λγj

En[x2ij ]

√Q(β−j)− (En[xij(yi − x′iβ−j)]2/En[x2ij ])√

n2 − (λ2γ2j /En[x2ij ])

• If |En[xij(yi − x′iβ−j)]| ≤ (λ/n)γj

√Q(β−j), we have βj = 0.

References

[1] Stephen Becker, Emmanuel Candes, and Michael Grant. Templates for convex cone problems with

applications to sparse signal recovery. ArXiv, 2010.

[2] Stephen Becker, Emmanuel Candes, and Michael Grant. Tfocs v1.0 user guide (beta). 2010.

[3] A. Belloni and V. Chernozhukov. Post-ℓ1-penalized estimators in high-dimensional linear regression

models. arXiv:[math.ST], 2009.

[4] A. Belloni and V. Chernozhukov. ℓ1-penalized quantile regression for high dimensional sparse models.

accepted at the Annals of Statistics, 2010.

[5] A. Belloni, V. Chernozhukov, and L. Wang. Square-root-lasso: Pivotal recovery of sparse signals via

conic programming. arXiv:[math.ST], 2010.

[6] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals

of Statistics, 37(4):1705–1732, 2009.

[7] F. Bunea, A. Tsybakov, and M. H. Wegkamp. Sparsity oracle inequalities for the lasso. Electronic

Journal of Statistics, 1:169–194, 2007.

√LASSO 51

[8] F. Bunea, A. B. Tsybakov, , and M. H. Wegkamp. Aggregation and sparsity via ℓ1 penalized least

squares. In Proceedings of 19th Annual Conference on Learning Theory (COLT 2006) (G. Lugosi and

H. U. Simon, eds.), pages 379–391, 2006.

[9] F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation for Gaussian regression. The Annals of

Statistics, 35(4):1674–1697, 2007.

[10] E. Candes and Y. Plan. Near-ideal model selection by l1 minimization. Ann. Statist., 37(5A):2145–2177,

2009.

[11] E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. Ann.

Statist., 35(6):2313–2351, 2007.

[12] V. H. de la Pena, T. L. Lai, and Q.-M. Shao. Self-Normalized Processes. Springer, 2009.

[13] J. Fan and J. Lv. Sure independence screening for ultra-high dimensional feature space. Journal of the

Royal Statistical Society Series B, 70(5):849–911, 2008.

[14] W. Feller. An Introduction to Probability Theory and Its Applicaitons, Volume II. Wiley Series in Prob-

ability and Mathematical Statistics, 1966.

[15] V. Koltchinskii. Sparsity in penalized empirical risk minimization. Ann. Inst. H. Poincar Probab.

Statist., 45(1):7–57, 2009.

[16] K. Lounici. Sup-norm convergence rate and sign concentration property of lasso and dantzig estimators.

Electron. J. Statist., 2:90–102, 2008.

[17] K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking advantage of sparsity in multi-task

learning. arXiv:0903.1468v1 [stat.ML], 2010.

[18] Z. Lu. Gradient based method for cone programming with application to large-scale compressed sensing.

Technical Report, 2008.

[19] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data.

Annals of Statistics, 37(1):2246–2270, 2009.

[20] Y. Nesterov. Smooth minimization of non-smooth functions, mathematical programming. Mathematical

Programming, 103(1):127–152, 2005.

[21] Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and related

problems. Mathematical Programming, 109(2-3):319–344, 2007.

[22] Y. Nesterov and A Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming. Society

for Industrial and Applied Mathematics (SIAM), Philadelphia, 1993.

[23] J. Renegar. A Mathematical View of Interior-Point Methods in Convex Optimization. Society for In-

dustrial and Applied Mathematics (SIAM), Philadelphia, 2001.

[24] M. Rosenbaum and A. B. Tsybakov. Sparse recovery under matrix uncertainty. arXiv:0812.2818v1

[math.ST], 2008.

[25] H. P. Rosenthal. On the subspaces of lp (p > 2) spanned by sequences of independent random variables.

Israel J. Math., 9:273–303, 1970.

[26] Haskell P. Rosenthal. On the subspaces of Lp (p > 2) spanned by sequences of independent random

variables. Israel J. Math., 8:273–303, 1970.


[27] Herman Rubin and J. Sethuraman. Probabilities of moderatie deviations. Sankhya Ser. A, 27:325–346,

1965.

[28] A. D. Slastnikov. Limit theorems for moderate deviation probabilities. Theory of Probability and its

Applications, 23:322–340, 1979.

[29] A. D. Slastnikov. Large deviations for sums of nonidentically distributed random variables. Teor. Veroy-

atnost. i Primenen., 27(1):36–46, 1982.

[30] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58:267–288,

1996.

[31] K. C. Toh, M. J. Todd, and R. H. Tutuncu. Sdpt3 — a matlab software package for semidefinite

programming. Optimization Methods and Software, 11:545–581, 1999.

[32] K. C. Toh, M. J. Todd, and R. H. Tutuncu. On the implementation and usage of sdpt3 – a matlab

software package for semidefinite-quadratic-linear programming, version 4.0. Handbook of semidefinite,

cone and polynomial optimization: theory, algorithms, software and applications, Edited by M. Anjos

and J.B. Lasserre, June 2010.

[33] R. H. Tutuncu, K. C. Toh, and M. J. Todd. SDPT3 — a MATLAB software package

for semidefinite-quadratic-linear programming, version 3.0. Technical report, 2001. Available at

http://www.math.nus.edu.sg/∼mattohkc/sdpt3.html.

[34] S. A. van de Geer. High-dimensional generalized linear models and the lasso. Annals of Statistics,

36(2):614–645, 2008.

[35] A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes. Springer Series in

Statistics, 1996.

[36] Bengt von Bahr and Carl-Gustav Esseen. Inequalities for the rth absolute moment of a sum of random

variables, 1 ≤ r ≤ 2. Ann. Math. Statist, 36:299–303, 1965.

[37] Bengt von Bahr and Carl-Gustav Esseen. Inequalities for the rth absolute moment of a sum of random

variables, 1 ≤ r ≤ 2. Ann. Math. Statist, 36:299–303, 1965.

[38] M. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1-

constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55:2183–2202,

May 2009.

[39] C.-H. Zhang and J. Huang. The sparsity and bias of the lasso selection in high-dimensional linear

regression. Ann. Statist., 36(4):1567–1594, 2008.

PIVOTAL ESTIMATION OF NONPARAMETRIC FUNCTIONS VIA …math.mit.edu/~liewang/square-root lasso...

Documents

Transcript of PIVOTAL ESTIMATION OF NONPARAMETRIC FUNCTIONS VIA …math.mit.edu/~liewang/square-root lasso...