PIVOTAL ESTIMATION OF NONPARAMETRIC FUNCTIONS VIA …math.mit.edu/~liewang/square-root lasso...
Transcript of PIVOTAL ESTIMATION OF NONPARAMETRIC FUNCTIONS VIA …math.mit.edu/~liewang/square-root lasso...
arX
iv:1
105.
1475
v2 [
stat
.ME
] 1
5 M
ay 2
011
PIVOTAL ESTIMATION OF NONPARAMETRIC FUNCTIONS VIA
SQUARE-ROOT LASSO
ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Abstract. In a nonparametric linear regression model we study a variant of LASSO,
called√LASSO, which does not require the knowledge of the scaling parameter σ of the
noise or bounds for it. This work derives new finite sample upper bounds for prediction
norm rate of convergence, ℓ1-rate of converge, ℓ∞-rate of convergence, and sparsity of the√LASSO estimator. A lower bound for the prediction norm rate of convergence is also
established.
In many non-Gaussian noise cases, we rely on moderate deviation theory for self-
normalized sums and on new data-dependent empirical process inequalities to achieve
Gaussian-like results provided log p = o(n1/3) improving upon results derived in the para-
metric case that required log p . log n.
In addition, we derive finite sample bounds on the performance of ordinary least square
(OLS) applied tom the model selected by√LASSO accounting for possible misspecification
of the selected model. In particular, we provide mild conditions under which the rate of
convergence of OLS post√LASSO is not worse than
√LASSO.
We also study two extreme cases: parametric noiseless and nonparametric unbounded
variance.√LASSO does have interesting theoretical guarantees for these two extreme
cases. For the parametric noiseless case, differently than LASSO,√LASSO is capable of
exact recovery. In the unbounded variance case it can still be consistent since its penalty
choice does not depend on σ.
Finally, we conduct Monte carlo experiments which show that the empirical performance
of√LASSO is very similar to the performance of LASSO when σ is known. We also
emphasize that√LASSO can be formulated as a convex programming problem and its
computation burden is similar to LASSO. We provide theoretical and empirical evidence
of that.
Date: First Version: February 4, 2009 Last Revision: May 17, 2011.
1
2 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
1. Introduction
We consider the problem of recovering a nonparametric regression function, where the
underlying function of interest has unknown function form of basic covariates. To be more
specific, we consider a nonparametric regression model:
yi = f(zi) + σǫi, i = 1, . . . , n, (1.1)
where yi are the outcomes, zi are vectors of fixed basic covariates, ǫi are i.i.d. noises, f is
the regression function, and σ is a scaling parameter. Our goal is to recover the regression
function f . To achieve this goal, we use linear combinations of regressors xi = P (zi) to
approximate f , where P (zi) is a p-vector of transformations of zi. We are interested in the
high dimension low sample size case, in which we potentially have p ≥ n, to attain a flexible
functional form. In particular, we are interested in a sparse model over the regressors xi to
describe the regression function.
Now the model is written as
yi = x′iβ0 + ui, ui = ri + σǫi.
In the above expression, ri := fi − x′iβ0 is the approximation error, where fi = f(zi).
Assume that the cardinality of the support of coefficient β0 is
s := |T | = ‖β0‖0,
where T = supp(β0). It is well known that ordinary least square is generally inconsistent
when p > n. However, the sparsity assumption makes it possible to estimate these models
effectively by searching for approximately the right set of the regressors. In particular, ℓ1-
based penalization methods have been playing a central role in this question. Many papers
have studied the estimation of high dimensional mean regression models with the ℓ1-norm
acting as a penalty function [6, 11, 15, 19, 34, 39, 38]. In these references, under appropriate
choice of penalty level, it was demonstrated that the ℓ1-penalized least squares estimators
achieve the rate σ√s/n
√log p, which is very close to the oracle rate σ
√s/n achievable
when the true model is known. We refer to [4, 6, 8, 9, 7, 13, 16, 17, 24, 34] for many other
developments and a more detailed review of the existing literature.
An important ℓ1-based estimator proposed in [30] is the LASSO estimator defined as
βL ∈ arg minβ∈Rp
Q(β) +λ
n‖β‖1, (1.2)
√LASSO 3
where, for observations of a response variable yi and regressors xi, Q(β) = En[(yi−x′iβ)2] =1n
∑ni=1(yi − x′iβ)
2, ‖β‖1 =∑p
j=1 |βj |, and λ is the penalty level. We note that the LASSO
estimator minimizes a convex function. Therefore, from a computational complexity per-
spective, (1.2) is a computationally efficient (polynomial time) alternative to exhaustive
combinatorial search over all possible submodels.
The performance of LASSO relies heavily on the penalization parameter λ which should
majorate the non-negligible spurious correlation between the noise terms and the large
number of additional regressors. The typical choice of λ is proportional to the unknown
scaling parameter σ of the noise (typically the standard deviation). Simple upper bounds for
σ can be derived based on the empirical variance of the response variable. However, upper
bounds on σ can lead to unnecessary over regularization which translates into larger bias
and slower rates of convergence. Moreover, such over regularization can lead to the exclusion
of relevant regressors from the selected model harming post model selection estimators.
In this paper, we have three sets of main results. The first contribution is to study a
variant of (1.2), called√LASSO, which does not require the knowledge of σ or bounds of
it but it is still computationally attractive. The√LASSO estimator is defined as
β ∈ arg minβ∈IRp
√Q(β) +
λ
n‖β‖1. (1.3)
In the parametric model studied in [5], the choice of the penalty parameter becomes
pivotal given the covariates and the distribution of the error term. In contrast, in the
nonparametric setting we need to account for the impact of the approximation error to
derive a practical and theoretical justified choice of penalty level. We rely on moderate
deviation theory for self-normalized sums and on data-dependent empirical process inequal-
ities to achieve Gaussian-like results in many non-Gaussian cases provided log p = o(n1/3)
improving upon results derived in the parametric case that required log p . log n, see [5].
We perform a thorough non-asymptotic theoretical analysis of the choice of the penalty
parameter.
The second set of contributions is to derive upper bounds for prediction norm rate of
convergence, ℓ1-rate of convergence, ℓ∞-rate of convergence, and sparsity of the√LASSO
estimator. A lower bound on the rate of convergence for the prediction norm is also es-
tablished. Furthermore, we also study two extreme cases: (i) parametric noiseless and (ii)
4 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
nonparametric unbounded variance.√LASSO does have interesting theoretical guarantees
for these two extreme cases. For the parametric noiseless case, for a wide range of the
penalty level,√LASSO achieves exact recovery in sharp contrast to LASSO. In the non-
parametric unbounded variance case,√LASSO estimator can still be consistent since its
penalty choice does not depend on the standard deviation of the noise. We develop the
necessary modifications on the oracle definition and penalty level, and derive finite sample
bounds for the case the noise has a Student’s t-distribution with 2 degrees of freedom.
The third contribution aims to remove the potentially significant bias towards zero in-
troduced by the ℓ1-norm regularization employed in (1.3). We consider the post model
selection estimator that applies ordinary least squares (OLS) regression to the model T
selected by√LASSO. Formally, set
T = supp(β) = {j ∈ {1, . . . , p} : |βj | > 0},
and define the OLS post√LASSO estimator β as
β ∈ arg minβ∈Rp
√Q(β) : βj = 0 if j ∈ T c. (1.4)
It follows that if the model selection works perfectly (i.e., T = T ) then the OLS post√LASSO estimator is simply the oracle estimator whose properties are well known. Un-
fortunately, perfect model selection might be unlikely for many designs of interest. This is
usually the case in a nonparametric setting. Thus, we are also interest on the properties of
OLS post√LASSO when T 6= T , including cases where T 6⊆ T .
Finally, we emphasize that√LASSO can be formulated as a convex programming problem
which allows us to rely on many efficient algorithmic implementations to compute the
estimator (interior point methods [31, 32], first order methods [1, 2], and coordinatewise
methods). Importantly, the computation cost of√LASSO is similar to LASSO. We conduct
Monte carlo experiments which show that the empirical performance of√LASSO is very
similar to the performance of LASSO when σ is known.
2. Nonparametric Regression Model
Recall the nonparametric regression model:
yi = f(zi) + σǫi, ǫi ∼ F0, E[ǫi] = 0, E[ǫ2i ] = 1, i = 1, . . . , n. (2.5)
√LASSO 5
The model can also be written as
yi = x′iβ0 + ui, ui = ri + σǫi. (2.6)
In many applications of interest there is no exact sparse model or, due to noise, it might
be inefficient to rely on an exact model. However, there might be a sparse model that yields
a good approximation to the true regression function f in equation (2.5). In this case, the
target linear combination is given by any vector β0 that solves the following “oracle” risk
minimization problem:
minβ∈IRp
En[(fi − x′iβ)2] + σ2
‖β‖0n
, (2.7)
where En[zi] = (1/n)∑n
i=1 zi, fi = f(zi), and the corresponding cardinality of its support
T = supp(β0) is
s := |T | = ‖β0‖0.
The oracle balances the approximation error En[(fi − x′iβ)2] with the variance term
σ2‖β‖0/n, where the latter is determined by the complexity of the model – the number
of non-zero coefficients of β. The average square error from approximating fi by x′iβ0 is
denote by
c2s := En[r2i ] = En[(fi − x′iβ0)
2],
so that c2s + σ2s/n is the optimal value of (2.7). In general the support of the best sparse
approximation T = supp(β0) is unknown since we do not observe fi.
We consider the case of fixed design, namely we treat the covariate values x1, . . . , xn as
fixed. Without loss of generality, we normalize the covariates so that
En[x2ij ] = 1 for j = 1, . . . , p. (2.8)
We summarize the previous setting in the following condition.
Condition 1 (ASM). We have data {(yi, zi) : i = 1, . . . , n} that for each n obey the
regression model (2.5), which admits the approximately sparse form (2.6) with β0 defined
by (2.7). The regressors xi = P (zi) are normalized as in (2.8).
The main focus of the literature is on deriving rate of convergence results in the prediction
norm, which measures the accuracy of predicting the true regression function over the
6 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
design points x1, . . . , xn, ‖δ‖2,n =√
En[(x′iδ)2]. It follows that the least square criterion
Q(β) = En[(yi − x′iβ)2] satisfies,
Q(β)− Q(β0) = ‖β − β0‖22,n − 2En[(σǫi + ri)x′i(β − β0)]. (2.9)
Thus, Q(β) − Q(β0) provides noisy information about ‖β − β0‖22,n, and the discrepancy is
controlled by the noise and approximation term
S = 2En[σǫixi] and 2En[rixi].
Moreover, the estimation of β0 in the prediction norm allows to bound the empirical risk
by the triangular inequality:√
En[(x′iβ − fi)2] ≤ ‖β − β0‖2,n + cs. (2.10)
2.1. Conditions on the Gram Matrix. It is known that the Gram matrix En[xix′i]
plays an important role in the analysis of estimators in this setup. In our case, the smallest
eigenvalue of the Gram matrix has zero value if p > n which creates potential identification
problems. Thus, to restore identification, one needs to restrict the type of deviation vectors
δ from β0 that we will consider. It will be important to consider vectors δ that belong to
the restricted set ∆c defined as
∆c = {δ ∈ Rp : ‖δT c‖1 ≤ c‖δT ‖1, δ 6= 0}, for c ≥ 1.
We will state the bounds in terms of the following restricted eigenvalue of the Gram
matrix En[xix′i]:
κc := minδ∈∆c
√s‖δ‖2,n‖δT ‖1
. (2.11)
The restricted eigenvalue can depend on n and T , but we suppress the dependence in
our notations. The restricted eigenvalue (2.11) is a variant of the restricted eigenvalue
introduced in Bickel, Ritov and Tsybakov [6].
Next consider the minimal and maximal m-sparse eigenvalues of the Gram matrix,
φmin(m) := min‖δTc‖0≤m,δ 6=0
‖δ‖22,n‖δ‖22
, and φmax(m) := max‖δTc‖0≤m,δ 6=0
‖δ‖22,n‖δ‖22
. (2.12)
They also play an important role in the analysis. Moreover, sparse eigenvalues provide a
simple sufficient condition to bound restricted eigenvalues. Indeed, following [6], we can
√LASSO 7
bound κc from below by
κc > maxm>0
φmin(m)
(1− φmax(m)
φmin(m)c√s/m
).
Thus, if m-sparse eigenvalues are bounded away from zero and from above
0 < k ≤ φmin(m) ≤ φmax(m) ≤ k′ <∞, for all m ≤ 4(k′/k)2c2s, (2.13)
then κc ≥ φmin(4(k′/k)2c2s)/2. We note that (2.13) only requires the eigenvalues of certain
“small” m×m submatrices of the large p× p Gram matrix to be bounded from above and
below. Many sufficient conditions for (2.13) are provided by [6], [39], and [19]. Bickel, Ritov,
and Tsybakov [6] and others also provide different sets of sufficient primitive conditions for
κc to be bounded away from zero.
3. Pivotal Penalty Level for√LASSO
Here we propose a pivotal, data-driven choice for the regularization parameter value λ
in the case that the distribution of the disturbances is known up to a scaling factor (in the
traditional case we would have unknown variance but normality of ǫi). Since the objective
function in the optimization problem (1.3) is not pivotal in either small or large samples,
finding a pivotal λ appears to be difficult a priori. However, the insight is to consider
the gradient at the true parameter that summarizes the noise in the estimation problem.
We note that the principle of setting λ to dominate the score of the criterion function is
motivated by [6]’s choice of penalty level for LASSO. In fact, this is a general principle
that carries over to other convex problems and that leads to the near-oracle performance of
ℓ1-penalized estimators.
The key quantity determining the choice of the penalty level for√LASSO is the score
S :=En[xi(σǫi + ri)]√En[(σǫi + ri)2]
if En[(σǫi + ri)2] > 0, and S := 0 otherwise.
The score S summarizes the estimation noise and approximation error in our problem,
and we may set the penalty level λ/n to dominate this term. For efficiency reasons, we set
λ/n at a smallest level that dominates the estimation noise, namely we choose the smallest
λ such that
λ ≥ cΛ, for Λ := n‖S‖∞, (3.14)
8 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
with a high probability, say 1− α, where Λ is the maximal score scaled by n, and c > 1 is
a theoretical constant of [6] to be stated later. The event (3.14) implies that
β − β0 ∈ ∆c, where c =c+ 1
c− 1
so that rates of convergence can be attained using the restricted eigenvalue κc.
It is worth pointing out that in the parametric case, ri = 0, i = 1, . . . , n, the score S
does not depend on the unknown scaling parameter σ or the unknown true parameter value
β0. Therefore, the score is pivotal with respect to these parameters. Moreover, under the
additional classical normality assumption, namely F0 = Φ, the score is in fact completely
pivotal, conditional on X. This means that in principle we know the distribution of S in
this case, or at least we can compute it by simulation, see [5].
However, in the nonparametric case the design X also determines the unknown approxi-
mation errors ri. To achieve an implementable choice of penalty level, we will consider the
following random variable
Λ :=n‖En[σǫixi]‖∞ + σ
√n√
En[σ2ǫ2i ]=n‖En[ǫixi]‖∞ +
√n√
En[ǫ2i ]. (3.15)
Λ also does not depend on the unknown scaling parameter σ or the unknown true parameter
value β0, and therefore, it is pivotal with respect to these parameters. We will show that
if√
En[σ2ǫ2i ] ≤ (1 + un)
√En[(σǫi + ri)2] we have Λ ≤ (1 + un)Λ. (3.16)
It will follow that the condition above is satisfied with high probability even for un = o(1).
We propose two choices for the penalty level λ. For 0 < α < 1, c > 1, and un ≥ 0 define:
exact λ = (1 + un) · c · Λ(1− α|X),
asymptotic λ = (1 + un) · c ·√nΦ−1(1− α/2p),
(3.17)
where Φ(·) is the cumulative distribution function for a standard Gaussian random variable,
and Λ(1 − α|X) := (1 − α)-quantile of Λ|X, ǫi ∼ F0. We can accurately approximate the
latter by simulation and the former by numerical integration.
The parameter 1−α is a confidence level which guarantees near-oracle performance with
probability at least 1 − α; we recommend 1 − α = 95%. The constant c > 1 is the slack
parameter in (3.14) used in [6]; we recommend c = 1.1. The parameter un is intended to
account for the approximation errors to achieve (3.16); we recommend un = 0.05. These
√LASSO 9
recommendations are valid either in finite or large samples under the conditions stated
below. They are also supported by the finite-sample experiments reported in Section 5.
The exact option is applicable when we know that the distribution of the errors F0, for
example in the classical Gaussian errors case. As stated previously, we can approximate
the quantiles Λ(1 − α|X) by simulation. Therefore, we can implement the exact option.
The asymptotic option is applicable when F0 and design X satisfy some moment conditions
stated below. We recommend using the exact, when applicable, since it is better tailored to
the given design X and sample size n. However, the asymptotic option is trivial to compute
and it often provides a very similar penalty level.
3.1. Analysis of the Penalty Choices. In this section we formally analyze the penalty
choices described in (3.17). In particular we are interested on establishing bounds on the
probability of (3.14) occurring.
In order to achieve Gaussian-like behavior for non-Gaussian disturbances we have to rely
on some moment conditions for the noise, on some restrictions on the growth of p relative
to n, and also consider α that is either bounded away from zero or approaches zero not too
rapidly. In this section we focus on the following set of conditions.
Condition M’. There exist a finite constant q > 8 such that the disturbance obeys
supn≥1 EF0 [|ǫ|q] <∞, and the design X obeys supn≥1max1≤j≤p En[|xij |q] <∞.
Condition R’. Let rn =(α−1 log nCqE[|ǫ|q∨4]
)1/q/n1/4 < 1/2, and define define the
constants c1 = 1 + 4
√log(6p(log n)/α)
n
(3E[ǫqi ]α/ logn
)4/qmax1≤j≤p(En[x
8ij])
1/4 and c2 = 1/(1 −31/qrn). Also, assume that n1/6 ≥ (Φ−1(1 − α/2p) + 1)max1≤j≤p(En[|xij |3]E[|ǫi|3])1/3. Letun ≥ 0 be such that unΦ
−1(1− α/2p) ≥ 4/(1 − rn), un ≥ 4(√c1c2 − 1).
The growth condition depends on the number of bounded moments q of regressors and of
the noise term. Under condition M’ and α fixed, condition R’ is satisfied if log p = o(n1/3).
This is asymptotically less restrictive that the condition log p ≤ (q − 2) log n required in
[5]. However, condition M’ is more stringent than some conditions in [5] thus neither set of
condition dominates the other. (Due to similarities, we defer the result under the conditions
considered in [5] to the appendix.)
The main insight of the new analysis is the use of the theory of moderate deviation
for self normalized sums. Theorems 1 and 2 show that the options (3.17) implement the
10 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
regularization event λ > cΛ in the non-Gaussian case with asymptotic probability at least
1− α. Theorem 3 bounds the magnitude of the penalty level λ for the exact option.
Theorem 1 (Coverage of the Exact Penalty Level). Suppose that conditions ASM, M’ and
R’ hold. Then, the exact option in (3.17) implements λ > cΛ with probability at least
1− α
(1 +
1
log n
)− 4(1 + un)
nun.
Theorem 2 (Coverage of the Asymptotic Penalty Level). Suppose that conditions ASM,
M’ and R’ hold. Then, the asymptotic option in (3.17) implements λ > cΛ with probability
at least
1− α
(1 +
A
ℓ3n+
2
log n
)− 20
n(un ∧ 1)
where ℓn = n1/6/[(Φ−1(1−α/2p)+1)max1≤j≤p(En[|xij |3]E[|ǫi|3])1/3], and A is an universal
constant.
Theorem 3 (Bound on the Exact Penalty Level). Suppose that conditions ASM, M’ and
R’ hold. Then, the approximate score Λ(1− α|X) in (3.15) satisfies
Λ (1− α |X) ≤ 2√n+
√nΦ−1(1− α/2p)
√c1c2
(1 +
√2 log(3 + 3A/ℓ3n)/Φ
−1(1− α/2p))
where ℓn = n1/6/[(Φ−1(1 − α/2p) + 1)max1≤j≤p(En[|xij |3]E[|ǫi|3])1/3], A is an universal
constant, and Φ−1(1− α/2p) ≤√
2 log(p/α).
Under conditions on the growth of p relative to n, these theorems establish that many
nice properties of the penalty level in the Gaussian case continue to hold in many non-
Gaussian cases. The following corollary summarizes the asymptotic behavior of the penalty
choices (3.17).
Corollary 1. Suppose that conditions ASM, M’ and R’ hold, and λ is chosen either
following the exact or asymptotic rule in (3.17). If α/p = o(1), log(p/α) = o(n1/3),
1/α = o(n/ log p) then there exists un = 1 + o(1) such that λ satisfies
P (λ ≥ cΛ|X) ≥ 1− α(1 + o(1)) and λ ≤ (1 + o(1))c√
2n log(p/α).
4. Finite-Sample Analysis of√LASSO
Next we establish several finite sample results regarding the√LASSO estimator. We
highlight several differences between the√LASSO analysis conducted here and traditional
√LASSO 11
analysis of LASSO. First, we do not assume Gaussian or sub-Gaussian noise. Second, the
value of the scaling parameter σ is unknown, and the penalty level λ does not depend on
the scaling parameter σ. Third, the necessity of a mild side condition to hold which ensures
that the penalty does not overrule the identification. Fourth, most of the analysis of this
section is conditional not only on the covariates x1, . . . , xn, but also on the noise ǫ1, . . . , ǫn,
through the event λ ≥ cΛ. Therefore, by choosing λ as in (3.17) the event λ ≥ cΛ occurs
with high probability and the stated results hold.
4.1. Finite-Sample Bounds on Different Norms. We begin with a finite sample bound
for the prediction norm which is similar to the bound obtained by the LASSO estimator
that knows σ.
Theorem 4 (Finite Sample Bounds on Estimation Error). Under condition ASM, let c > 1,
c = (c+ 1)/(c − 1), and suppose that λ obeys the growth restriction λ√s < nκc. If λ ≥ cΛ,
then
‖β − β0‖2,n ≤ 2(1 + 1/c)
1−(λ√s
nκc
)2√Q(β0)
λ√s
nκc.
We recall that the choice of λ does not depend on the scaling parameter σ. The impact
of σ in the bound above comes through the factor
√Q(β0) ≤ σ
√En[ǫ2i ] + cs.
Thus, this result leads to the same rate of convergence as in the case of the LASSO estimator
that knows σ since En[ǫ2i ] concentrates around one under (2.5) and the law of large numbers.
As mentioned before, the analysis of√LASSO raises several different issues from that of
LASSO, and so the proof of Theorem 4 is different. In particular, we need to invoke the
additional growth restriction, λ√s < nκc, which is not present in the LASSO analysis that
treats σ as known. This is required because the introduction of the square-root removes
the quadratic growth which would eventually dominates the ℓ1 penalty for large enough
deviations from β0. This condition ensures that the penalty its not too large so identification
of β0 is still possible. However, when this side condition fails and σ is bounded away from
zero, LASSO is not guaranteed to be consistent since its rate of convergence is typically
given by σλ√s/[nκc].
12 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Also, the event λ ≥ cΛ accounts for the approximation errors r1, . . . , rn. That has two
implications. First, the impact of cs on the estimation of β0 is diminished by a factor of
λ√s/[nκc]. Second, despite of the approximation errors, we have β − β0 ∈ ∆c. This is in
contrast to the analysis that relied on λ ≥ cn‖En[ǫixi]‖∞ instead, see [6, 3]. We build on
the latter to establish ℓ1-rate and ℓ∞-rate of convergence.
Theorem 5 (ℓ1-rate of convergence). Under condition ASM, if λ ≥ cΛ, for c > 1 and
c := (c+ 1)/(c − 1), then
‖β − β0‖1 ≤ (1 + c)
√s‖β − β0‖2,n
κc.
Theorem 6 (ℓ∞-rate of convergence). Under condition ASM, if λ ≥ cΛ, for c > 1 and
c = (c+ 1)/(c − 1), then
‖β − β0‖∞ ≤(1 +
1
c
)λ
√Q(β0)
n+
(λ2
n2+ ‖En[xix
′i]− I‖∞
)‖β − β0‖1.
Regarding the ℓ∞-rate, since we have ‖ · ‖∞ ≤ ‖ · ‖1, the result is meaningful for nearly
orthogonal designs so that ‖En[xix′i]− I‖∞ is small. In fact, near orthogonality also allows
to bound the restricted eigenvalues κc from below. [6] and [16] have established that if for
some u ≥ 1 we have ‖En[xix′i]− I‖∞ ≤ 1/(u(1 + c)s) then κc ≥
√1− 1/u.
We close this subsection establishing relative finite sample bound on the estimation of
Q(β0) based on Q(β) under the assumptions of Theorem 4.
Theorem 7 (Relative Estimation of Q(β0)). Under condition ASM, if λ ≥ cΛ and λ√s <
nκc, for c > 1 and c := (c+ 1)/(c − 1) we have
−4c
c
(λ√s
nκc
)2
1−(λ√s
nκc
)2√Q(β0) ≤
√Q(β)−
√Q(β0) ≤ 2
(1 +
1
c
) (λ√s
nκ1
)2
1−(λ√s
nκ1
)2√Q(β0).
Thus, if in addition λ√s = o(nκc) holds, Theorem 7 establishes that
√Q(β) = (1 + o(1))
√Q(β0).
The quantity
√Q(β) is particularly relevant for
√LASSO since it appears in the first order
condition which is the key to establish sparsity properties.
√LASSO 13
4.2. Finite-Sample Bounds Relating Sparsity and Prediction Norm. In this sec-
tion we investigate sparsity properties and lower bounds on the rate of convergence in the
prediction norm of the√LASSO estimator. It turns out these results are connected via the
first order condition. We start with a technical lemma.
Lemma 1 (Relating Sparsity and Prediction Norm). Under condition ASM, let T =
supp(β) and m = |T \ T |. For any λ > 0 we have
λ
n
√Q(β)
√|T | ≤
√|T |‖En[xi(σǫi + ri)]‖∞ +
√φmax(m)‖β − β0‖2,n.
The proof of the lemma above rely on the optimality conditions which implies that the
selected support has binding dual constraints. Intuitively, for any selected component, there
is a shrinkage bias which introduces a bound on how close the estimated coefficient can be
from the true coefficient. Based on the inequality above and Theorem 7, we establish the
following result.
Theorem 8 (Lower Bound on Prediction Norm). Under condition ASM, if λ ≥ cΛ, λ√s <
nκc, where c > 1 and c := (c + 1)/(c − 1), and letting T = supp(β) and m = |T \ T |, wehave
‖β − β0‖2,n ≥λ
√|T |
n√φmax(m)
√Q(β0)
1− 1
c− 4c
c
(λ√s
nκc
)2
1−(λ√s
nκc
)2
.
In the case of LASSO, as derived in [17], the lower bound does not have the term
√Q(β0)
since the impact of the scaling parameter σ is accounted in the penalty level λ. Thus, under
condition ASM, the lower bounds for LASSO and√LASSO are very close.
Next we proceed to bound the size of the selected support T = supp(β) for the√LASSO
estimator relative to the size s of the support of the oracle estimator β0.
Theorem 9 (Sparsity bound for√LASSO). Under condition ASM, let β denote the
√LASSO
estimator, T = supp(β), and let m := |T \ T |. If λ ≥ cΛ and λ√s ≤ nκcρ, where
2ρ2 ≤ (c− 1)/(c − 1 + 4c), for c > 1 and c = (c+ 1)/(c − 1), we have that
m ≤ s ·[minm∈M
φmax(m)
]· (4c/κc)2
where M = {m ∈ N : m > sφmax(m) · 2(4c/κc)2}.
14 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
The slightly more stringent side condition ensures that the right hand side of the bound
in Theorem 8 is positive. Asymptotically, under mild conditions on the design matrix, for
exampleφmax(s log n)
φmin(s log n). 1,
the event (3.14) and the side condition s log(p/α) = o(n), imply that for n large enough,
the size of the selected model is of the same order of magnitude as the oracle model, namely
m . s.
4.3. Finite Sample Bounds on the Estimation Error of OLS post√LASSO. Based
on the model selected by√LASSO estimator, T := supp(β), we consider the OLS estimator
restricted to these data-driven selected components. If model selection works perfectly (as
it will under some rather stringent conditions), then this estimator is simply the oracle esti-
mator and its properties are well known. However, of more interest is the case when model
selection does not work perfectly, as occurs for many designs of interest in applications.
The following theorem establishes bounds on the prediction error of the OLS post√LASSO
estimator. The analysis accounts for the data-driven choice of components and for the pos-
sibly having a mispecified model (i.e. T 6⊆ T ). The analysis build upon sparsity and rate
bounds of the√LASSO estimator, and on a data-dependent empirical process inequality.
Theorem 10 (Performance of OLS post√LASSO). Suppose condition ASM holds, let
c > 1, c = (c + 1)/(c − 1) and m = |T \ T |. If λ ≥ cΛ occurs with probability at least
1 − α, and λ√s ≤ nκ1ρ1, for some ρ1 < 1, then for C ≥ 1, with probability at least
1− α− 1/C2 − 1/[9C2 log p], we have
‖β − β0‖2,n ≤ Cσ√φmin(m)
√s
n+ 2cs +
√m log p
n
24Cσ
√1 ∨ max
j=1,...,pEn
[x2ijǫ
2i
]
√φmin(m)
+
+ 1{T 6⊆ T}λ√s
nκ1
√Q(β0)
4(1 + 1/c)
1− ρ21
(1 +
(1 + 1/c)ρ211− ρ21
).
We note that the random term in the bound above can be controlled in a variety of ways.
For example, under conditions M’ and R’, if log p = o(n1/3) Lemma 10 establishes that
maxj=1,...,p
En
[x2ijǫ
2i
]= 1 + oP (1).
√LASSO 15
The following corollary summarizes the performance of OLS post√LASSO under com-
monly used designs.
Corollary 2 (Asymptotic Performance of OLS post√LASSO). Suppose conditions ASM,
M’ and R’ hold, let c > 1, c = (c+ 1)/(c − 1) and m = |T \ T |. If λ is set as the exact or
asymptotic choice in (3.17), α = o(1), φmax(s log n)/φmin(s log n) . 1, s log(p/α) = o(n),
log p = o(n1/3), we have that
‖β − β0‖2,n .P cs + σ
√s log p
n.
Moreover, if m = o(s) and T ⊆ T with probability going to 1,
‖β − β0‖2,n .P cs + σ
√o(s) log p
n+ σ
√s
n,
and if T = T with probability going to 1, we have
‖β − β0‖2,n .P cs + σ√s/n.
Under the conditions of the corollary above, the upper bounds on the rates of convergence
of√LASSO and OLS post
√LASSO coincide. This occurs despite the fact that
√LASSO
may in general fail to correctly select the oracle model T as a subset, that is T 6⊆ T .
Nonetheless, there is a class of well-behaved models in which OLS post√LASSO rate
improves upon the rate achieved by√LASSO. More specifically, this occurs if m = oP (s)
and T ⊆ T with probability going to 1 or in the case of perfect model selection, when T = T
with probability going to 1. Results on LASSO’s model selection performance derived on
Wainright [38], Candes and Plan [10], and Belloni and Chernozhukov [3] can be extended to
the√LASSO based on Theorem 6 and 7. Moreover, under mild conditions, we know from
Theorem 8 that the upper bounds for the rates found for√LASSO are sharp, i.e. the rate of
convergence cannot be faster than σ√log p
√s/n. Thus the use of the post model selection
estimator leads to a strict improvement in the rate of convergence on these well-behaved
models.
4.4. Extreme Cases: Parametric Noiseless and Nonparametric Unbounded Vari-
ance. In this section we show that the robustness advantage of√LASSO with respect
to LASSO extends to two extreme cases as well: σ + cs = 0 and E[ǫ2i ] = ∞. Since
the the traditional choice of the penalty level λ for LASSO depends on σ√
E[ǫ2i ], setting
16 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
λ = σ√
E[ǫ2i ] 2c√nΦ−1(1 − α/2p) cannot be directly applied in either case. In contrast,
the penalty of√LASSO is indirectly self normalized by the factor
√Q(β) which is well
defined in both cases.
4.4.1. Parametric Noiseless Case. The analysis developed in the previous section immedi-
ately covers the case σ = 0 if cs > 0. The case that cs = 0 is also zero, thus Q(β0) = 0,
allows for exact recovery under less stringent restrictions.
Theorem 11 (Exact Recovery under Parametric Noiseless Case). Under condition ASM,
let σ = 0 and cs = 0. Suppose that λ > 0 obeys the growth restriction λ√s < nκ1. Then we
have that β = β0.
It is worth mentioning that for any λ > 0, unless β0 = 0, LASSO cannot achieve exact
recovery. Moreover, it is not obvious how to properly set the penalty level for LASSO even
if we knew a priori that it is a parametric noiseless model. In contrast, square-root lasso
intrinsically adjusts the penalty λ by a factor of
√Q(β). Under mild conditions Theorem
7 ensures that
√Q(β) =
√Q(β0) = 0 which allows for the perfect recovery. Also note that
the lower bound derived in Theorem 8 becomes trivially zero.
4.4.2. Nonparametric Unbounded Variance. Next we turn to the unbounded variance case.
Although the analysis of the prediction norm of the estimator goes through with no change,
one needs to redefine the model and oracle properly. Importantly, the exact choice of λ is
still valid but the asymptotic penalty choice of λ no longer applies. In order to account
for unbounded variance, the main insight is to interpret σ as a scaling factor (and not the
standard deviation), but the effective standard deviation σ to be
σ := σ√
En[ǫ2i ] <∞.
That allows to properly define the oracle in finite sample. However, it follows that σ → ∞with probability one since the noise has infinite variance. Therefore, meaningful estimation
of β0 is still possible provided that σ does not diverge too quickly. In this case, the oracle
estimator needs to be redefined as any any solution to
min0≤k≤p∧n
min‖β‖0=k
En[(fi − x′iβ)2] + σ2
k
n. (4.18)
√LASSO 17
Regarding the choice of the penalty parameter λ, we stress that the asymptotic option
no longer applies and we cannot achieve Gaussian-like rates. However, the exact option for
λ is still valid. In fact, the scaling parameter σ still cancels out. By inspection of the proof,
Lemma 7 can be adjusted to yield |ri| ≤ σ/√n. That motivates the definition of Λ as
Λ = n
∥∥∥∥∥∥En[ǫixi]√En[ǫ2i ]
∥∥∥∥∥∥∞
+√n ≥
∥∥∥∥∥∥En[(σǫi + ri)xi]√
En[σǫ2i ]
∥∥∥∥∥∥∞
.
Thus, the rate of convergence will be affected by how fast En[ǫ2i ] diverges and λ/n goes to
zero. That is, the final rates will depend on the particular tail properties of the distribution
of the noise. The next lemma establishes finite sample bounds in the case of ǫi ∼ t(2),
i = 1, . . . , n.
Theorem 12 (√LASSO prediction norm for ǫi ∼ t(2)). Consider a nonparametric re-
gression model with data {(yi, xi) : i = 1, . . . , n}, such that En[x2ij] = 1 for every j =
1, . . . , p, ǫi ∼ t(2) are i.i.d., and β0 defined as any solution to (4.18). Letting x =
max1≤j≤p,1≤i≤n |xij | and λ = c(1 + un)Λ(1 − α − 64n1/2 |X), and λ
√s < nκc. Then, for
any τ ∈ (0, 1), with probability at least 1− α− τ − 64n1/2 − 48(1+un) log(4n/τ)
nun logn we have
‖β − β0‖2,n ≤ 2(1 + 1/c)
1− (λ√s/[nκc])
2
(cs + σ
√log(4n/τ) + 2
√2/τ
)λ√s
nκc
where
λ ≤ c(1 + un)
4x
√2n log(4p/α)
√log(4n/α) + 2
√2/α
√log n/
√24
+√n
.
Asymptotically, if 1/α = o(log n) and s log(p/α) = o(nκc), the result above yields that
with probability 1− α(1 + o(1))
‖β − β0‖2,n . x(cs + σ√
log n)
√s log p
n
where the scaling factor σ <∞ is fixed. Thus, despite of the infinite variance of the noise in
the t(2) case, for bounded designs,√LASSO rate of convergence differs from the Gaussian
case only by a√log n factor.
5. Empirical Performance of√LASSO Relative to LASSO
Next we proceed to evaluate the empirical performance of√LASSO relative to LASSO.
In particular we discuss their computational burden and their estimation performance.
18 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
5.1. Computational Performance of√LASSO Relative to LASSO. Since model se-
lection is particularly relevant in high-dimensional problems, the computational tractability
of the optimization problem associated with√LASSO is an important issue. It will follow
that the optimization problem associated with√LASSO can be cast as a tractable conic
programming problem. Conic programming consists of the following optimization problem
minx c(x)
A(x) = b
x ∈ K
where K is a cone, c is a linear functional, A is a linear operator, and b is an element in
the counter domain of A. We are particularly interested on the case that K is also convex.
Convex conic programming problems have greatly extended the scope of applications of
linear programming problems1 in several fields including optimal control, learning theory,
eigenvalue optimization, combinatorial optimization and many others. Under mild regular-
ities conditions, the duality theory for conic programs have been fully developed and allows
for characterization of optimal conditions via dual variables, much like linear programming
problems.
In the past two decades, the study of the computational complexity and the develop-
ments of efficient computational algorithms for conic programming have played a central
role in the optimization community. In particular, for the case of self-dual cones, which
encompasses the non-negative orthant, second-order cones, and the cone of semi-definite
positive matrices, interior point methods have been highly specialize. A sound theoretical
foundation, establishing polynomial computational complexity [22, 23], and efficient soft-
ware implementations [33] made large instances of these problems computational tractable.
More recently, first order methods have also been propose to approximately solve even even
larger instances of structured conic problem [20, 21, 18].
It follows that (1.3) can be written as a conic programming problem whose relevant cone
is self-dual. Letting Qn+1 := {(t, v) ∈ IR× IRn : t ≥ ‖v‖} denote the second order cone in
1The relevant cone in linear programs is the non-negative orthant, minw{c′w : Aw = b, w ∈ IRk+}).
√LASSO 19
IRn+1, we can recast (1.3) as the following conic program:
mint,v,β+,β−
t√n+λ
n
p∑
i=1
(β+j + β−j
)
vi = yi − x′iβ+ + x′iβ
−, i = 1, . . . , n
(t, v) ∈ Qn+1, β+ ≥ 0, β− ≥ 0.
(5.19)
Conic duality immediately yields the following dual problem
maxa
En [yiai]
|En [xijai]| ≤ λ/n, j = 1, . . . , p
‖a‖ ≤√n.
(5.20)
From a statistical perspective, the dual variables represent the normalized residuals. Thus
the dual problem maximizes the correlation of the dual variable a subject to the constraint
that a are approximately uncorrelated with the regressors. It follows that these dual vari-
ables play a role in deriving necessary conditions for a component βj to be non-zero and
therefore on sparsity bounds.
The fact that√LASSO can be formulated as a convex conic programming problem
allows the use of several computational methods tailored for conic problems to be used
to compute the√LASSO estimator. In this section we compare three different methods
to compute√LASSO with their counterparts to compute LASSO. We note that these
methods have different initialization and stopping criterion that could impact the running
times significantly. Therefore we do not aim to compare different methods but instead we
focus on the comparison of the performance of each method to LASSO and√LASSO since
the same initialization and stopping criterion are used.
Table 5.1 illustrates that the average computational time to solve LASSO and√LASSO
optimization problems are comparable. Table 5.1 also reinforces typical behavior of these
methods. As the size increases, the running time for interior point methods grows faster
than other first order methods. Simple componentwise methods are particular effective
when the solution is highly sparse. This is the case of the parametric design used here. We
emphasize the performance of each method depends on the particular design and choice of
λ.
5.2. Estimation Performance of√LASSO Relative to LASSO. In this section we use
Monte-Carlo experiments to assess the finite sample performance of the following estimators:
20 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
n = 100, p = 500 Componentwise First Order Interior Point
LASSO 0.2173 10.99 2.545√LASSO 0.3268 7.345 1.645
n = 200, p = 1000 Componentwise First Order Interior Point
LASSO 0.6115 19.84 14.20√LASSO 0.6448 19.96 8.291
n = 400, p = 2000 Componentwise First Order Interior Point
LASSO 2.625 84.12 108.9√LASSO 2.687 77.65 62.86
Table 1. In these instances we had s = 5, σ = 1, and each value was
computed by averaging 100 simulations.
• the (infeasible) LASSO, which knows σ (which is unknown outside the experiments),
• OLS post LASSO, which applies OLS to the model selected by (infeasible) LASSO,
•√LASSO, which does not know σ, and
• OLS post√LASSO, which applies OLS to the model selected by
√LASSO.
We set the penalty level for LASSO as the standard choice in the literature, λ = c2σ√nΦ−1(1−
α/2p), and√LASSO according to the asymptotic option, λ = c
√nΦ−1(1−α/2p), both with
1 − α = .95 and c = 1.1 to both estimators. (The results obtained using the exact option
are similar to the case with the penalty level set according to the asymptotic option, so we
only report the results for the latter.)
We use the linear regression model stated in the introduction as a data-generating process,
with either standard normal or t(4) errors:
(a) ǫi ∼ N(0, 1) or (b) ǫi ∼ t(4)/√2,
so that E[ǫ2i ] = 1 in either case. We set the regression function as
f(xi) = x′iβ∗0 , where β∗0j = 1/j3/2, j = 1, . . . , p. (5.21)
The scaling parameter σ vary between 0.25 and 5. For a fixed design, as the scaling param-
eter σ increases, the number of non-zero components in the oracle vector s decreases. The
number of regressors is p = 500, the sample size is n = 100, and we used 100 simulations for
√LASSO 21
each design. We generate regressors as xi ∼ N(0,Σ) with the Toeplitz correlation matrix
Σjk = (1/2)|j−k|.
0 1 2 3 4 50
0.5
1
1.5ǫi ∼ N(0,1)
Em
piri
calR
isk
σ0 1 2 3 4 5
0
0.5
1
1.5ǫi ∼ t(4)
Em
piri
calR
isk
σ
LASSO√
LASSO OLS post LASSO OLS post√
LASSO
Figure 1. The average empirical risk of the estimators as a function of the
scaling parameter σ.
0 1 2 3 4 50
0.5
1
1.5ǫi ∼ N(0,1)
Bia
s
σ0 1 2 3 4 5
0
0.5
1
1.5ǫi ∼ t(4)
Bia
s
σ
LASSO√
LASSO OLS post LASSO OLS post√
LASSO
Figure 2. The norm of the bias of the estimators as a function of the
scaling parameter σ.
22 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
0 1 2 3 4 50
1
2
3
4
5
6
7
8
9
10ǫi ∼ N(0,1)
Num
.R
egs.
Sele
cted
σ0 1 2 3 4 5
0
1
2
3
4
5
6
7
8
9
10ǫi ∼ t(4)
Num
.R
egs.
Sele
cted
σ
LASSO and OLS post LASSO√
LASSO and OLS post√
LASSO
Figure 3. The average number of regressors selected as a function of the
scaling parameter σ.
We present the results of computational experiments for designs a) and b) in Figures 1,
2, 3. The left plot of each figure reports the results for the normal errors, and the right
plot of each figure reports the results for t(4) errors. For each model, the figures show the
following quantities as a function of the signal strength C for each estimator β:
• Figure 1 – the average empirical risk, E[En[x′i(β − β0)]
2],
• Figure 2 – the norm of the bias, ‖E[β − β0]‖, and• Figure 3 – the average number of regressors selected, E|support(β)|.
Figure 1, left panel, shows the empirical risk for the Gaussian case. We see that, for a
wide range of the scaling parameter σ, LASSO and√LASSO perform similarly in terms of
empirical risk, although standard LASSO outperforms somewhat√LASSO. At the same
time, OLS post LASSO outperforms slightly OLS post√LASSO for larger signal strengths.
This is expected since√LASSO over regularize to simultaneously estimate σ when compared
to LASSO (since it essentially uses
√Q(β) as an estimate of σ). In the nonparametric model
considered here, the coefficients are not well separated from zero. These two issues combined
leads to a smaller selected support.
√LASSO 23
Overall, the empirical performance of√LASSO and OLS post
√LASSO is achieves its
goal. Despite not knowing σ,√LASSO performs comparably to the standard LASSO that
knows σ. These results are in close agreement with our theoretical results, which state that
the upper bounds on empirical risk for√LASSO asymptotically approach the analogous
bounds for standard LASSO.
Figures 2 and 3 provide additional insight into the performance of the estimators. On
the one hand, Figure 2 shows that the finite-sample differences in empirical risk for LASSO
and√LASSO arise primarily due to
√LASSO having a larger bias than standard LASSO.
This bias arises because√LASSO uses an effectively heavier penalty. Figure 3 shows that
such heavier penalty translates into√LASSO achieving a smaller support than LASSO on
average.
Finally, Figure 1, right panel, shows the empirical risk for the t(4) case. We see that the
results for the Gaussian case carry over to the t(4) case. In fact, the performance of LASSO
and√LASSO under t(4) errors nearly coincides with their performance under Gaussian
errors. This is exactly what is predicted by our theoretical results.
Appendix A. Probability Inequalities
A.1. Moment Inequalities. We begin with Rosenthal and Von Bahr-Esseen Inequalities.
Lemma 2 (Rosenthal Inequality). Let X1, ...,Xn be independent zero-mean random vari-
ables, then for r ≥ 2
E
[∣∣∣∣∣
n∑
i=1
Xi
∣∣∣∣∣
r]≤ C(r)max
n∑
t=1
E[|Xi|r],(
n∑
t=1
E[X2i ]
)r/2 .
Corollary 3 (Rosenthal LLN). Let r ≥ 2, and consider the case of independent and iden-
tically distributed zero-mean variables Xi with E[X2i ] = 1 and E[|Xi|r] bounded by C. Then
for any ℓn > 0
Pr
( |∑ni=1Xi|n
> ℓnn−1/2
)≤ 2C(r)C
ℓrn,
where C(r) is a constant depend only on r.
24 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Remark. To verify the corollary, note that by Rosenthal’s inequality E [|∑ni=1Xi|r] ≤
Cnr/2. By Markov inequality,
P
( |∑ni=1Xi|n
> c
)≤ C(r)Cnr/2
crnr≤ C(r)C
crnr/2,
so the corollary follows. We refer [25] for complete proofs.
Lemma 3 (Vonbahr-Esseen inequality). Let X1, ...,Xn be independent zero-mean random
variables. Then for 1 ≤ r ≤ 2
E
[∣∣∣∣∣
n∑
i=1
Xi
∣∣∣∣∣
r]≤ (2− n−1) ·
n∑
k=1
E[|Xk|r].
We refer to [36] for proofs.
Corollary 4 (Vonbahr-Esseen’s LLN). Let r ∈ [1, 2], and consider the case of identically
distributed zero-mean variables Xi with E|Xi|r bounded by C. Then for any ℓn > 0
Pr
( |∑ni=1Xi|n
> ℓnn−(1−1/r)
)≤ 2C
ℓrn.
Remark. By Markov and Vonbahr-Esseen’s inequalities,
Pr
( |∑n
i=1Xi|n
> c
)≤ E [|
∑ni=1Xi|r]crnr
≤ (2n− 1)E[|Xi|r]crnr
≤ 2C
crnr−1,
which implies the corollary.
A.2. Moderate Deviations for Sums of Independent Random Variables. Next we
consider Slastnikov-Rubin-Sethuraman Moderate Deviation Theorem.
Let Xni, i = 1, ..., kn;n ≥ 1 be a double sequence of row-wise independent random vari-
ables with E[Xni] = 0, E[X2ni] < ∞, i = 1, ..., kn; n ≥ 1, and B2
n =∑kn
i=1E[X2ni] → ∞ as
n→ ∞. Let
Fn(x) = Pr
(kn∑
i=1
Xni < xBn
).
Lemma 4 (Slastnikov, Theorem 1.1). If for sufficiently large n and some positive constant
c,kn∑
i=1
E[|Xni|2+c2 ]ρ(|Xni|) log−(1+c2)/2(3 + |Xni|) ≤ g(Bn)B2n,
√LASSO 25
where ρ(t) is slowly varying function monotonically growing to infinity and g(t) = o(ρ(t))
as t→ ∞, then
1− Fn(x) ∼ 1− Φ(x), Fn(−x) ∼ Φ(−x), n→ ∞,
uniformly in the region 0 ≤ x ≤ c√
logB2n
Corollary 5 (Slastnikov, Rubin-Sethuraman). If q > c2 + 2 and
kn∑
i=1
E[|Xni|q] ≤ KB2n,
then there is a sequence γn → 1, such that∣∣∣∣1− Fn(x) + Fn(−x)
2Φ(x)− 1
∣∣∣∣ ≤ γn − 1 → 0, n→ ∞,
uniformly in the region 0 ≤ x ≤ c√
logB2n
Remark. Rubin-Sethuraman derived the corollary for x = t√
logB2n for fixed t. Slast-
nikov’s result adds uniformity and relaxes the moment assumption.
We refer to [28] for proofs.
A.3. Moderate Deviations for Self-Normalized Sums. We shall be using the following
result – Theorem 7.4 in [12].
Let X1, ...,Xn be independent, mean-zero variables, and
Sn =
n∑
i=1
Xi, V 2n =
n∑
i=1
X2i .
For 0 < δ ≤ 1 set
B2n =
n∑
i=1
EX2i , Ln,δ =
n∑
i=1
E|Xi|2+δ, dn,δ = Bn/L1/(2+δ)n,δ .
Then for uniformly in 0 ≤ x ≤ dn,δ,
Pr(Sn/Vn ≥ x)
Φ(x)= 1 +O(1)
(1 + x
dn,δ
)2+δ
,
Pr(Sn/Vn ≤ −x)Φ(−x) = 1 +O(1)
(1 + x
dn,δ
)2+δ
,
where the terms O(1) are bounded in absolute value by a universal constant A, and Φ :=
1− Φ.
26 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Application of this result gives the following lemma:
Lemma 5 (Moderate Deviations for Self-Normalized Sums). Let X1,n, ...,Xn,n be the tri-
angular array of i.i.d, zero-mean random variables. Suppose that
Mn =(EX2
1,n)1/2
(E|X1,n|3)1/3> 0
and that for some ℓn → ∞n1/6Mn/ℓn ≥ 1.
Then uniformly on 0 ≤ x ≤ n1/6Mn/ℓn − 1, the quantities
Sn,n =
n∑
i=1
Xi,n, V 2n,n =
n∑
i=1
X2i,n.
obey ∣∣∣∣Pr(|Sn,n/Vn,n| ≥ x)
2Φ(x)− 1
∣∣∣∣ ≤A
ℓ3n→ 0.
Proof. This follows by the application of the quoted theorem to the i.i.d. case with δ = 1
and dn,1 = n1/6Mn. The calculated error bound follows from the triangular inequalities and
conditions on ℓn and Mn. �
A.4. Data-dependent Probabilistic Inequality. In this section we derive a data-de-
pendent probability inequality for empirical processes indexed by a finite class of functions.
In what follows, for a random variable X let q(X, 1 − τ) denote its (1 − τ)-quantile. For a
class of functions F we define ‖X‖F = supf∈F |f(X)|. Also for random variables Z1, . . . , Zn
and a function f define ‖f‖Pn,2 =√En[f(Zi)2], Gn(f) = (1/
√n)∑n
i=1{f(Zi)− E[f(Zi)]},and G
on(f) = (1/
√n)∑n
i=1 εif(Zi) where εi are independent Rademacher random variables.
In order to prove a bound on tail probabilities of a general separable empirical process, we
need to go through a symmetrization argument. Since we use a data-dependent threshold,
we need an appropriate extension of the classical symmetrization lemma to allow for this.
Let us call a threshold function x : IRn 7→ IR k-sub-exchangeable if for any v,w ∈ IRn and
any vectors v, w created by the pairwise exchange of the components in v with components
in w, we have that x(v)∨x(w) ≥ [x(v)∨x(w)]/k. Several functions satisfy this property, in
particular x(v) = ‖v‖ with k =√2, constant functions with k = 1, and x(v) = ‖v‖∞ with
k = 1. The following result generalizes the standard symmetrization lemma for probabilities
√LASSO 27
(Lemma 2.3.7 of [35]) to the case of a random threshold x that is sub-exchangeable. The
proof of Lemma 6 can be found in [4].
Lemma 6 (Symmetrization with Data-dependent Thresholds). Consider arbitrary inde-
pendent stochastic processes Z1, . . . , Zn and arbitrary functions µ1, . . . , µn : F 7→ IR. Let
x(Z) = x(Z1, . . . , Zn) be a k-sub-exchangeable random variable and for any τ ∈ (0, 1) let qτ
denote the τ quantile of x(Z), pτ := P (x(Z) ≤ qτ ) ≥ τ , and pτ := P (x(Z) < qτ ) ≤ τ . Then
P
(∥∥∥∥∥
n∑
i=1
Zi
∥∥∥∥∥F> x0 ∨ x(Z)
)≤ 4
pτP
(∥∥∥∥∥
n∑
i=1
εi (Zi − µi)
∥∥∥∥∥F>x0 ∨ x(Z)
4k
)+ pτ
where x0 is a constant such that inff∈F P(|∑n
i=1 Zi(f)| ≤ x02
)≥ 1− pτ
2 .
Theorem 13 (Maximal Inequality for Empirical Processes). Let
qD(F , 1 − τ) = supf∈F
q(|Gn(f)|, 1− τ) ≤ supf∈F
√varP (Gn(f))/τ
and consider the data dependent quantity
en(F ,Pn) =√
2 log |F| supf∈F
‖f‖Pn,2.
Then, for any C ≥ 1 and τ ∈ (0, 1) we have
supf∈F
|Gn(f)| ≤ qD(F , 1 − τ/2) ∨ 4√2Cen(F ,Pn),
with probability at least 1− τ − 4 exp(−(C2 − 1) log |F|)/τ .
Proof. Step 1. (Main Step) In this step we prove the main result. First, recall en(F ,Pn) :=√2 log |F| supf∈F ‖f‖Pn,2. Note that supf∈F ‖f‖Pn,2 is
√2-sub-exchangeable by Step 2
below.
By the symmetrization Lemma 6 we obtain
P
{supf∈F
|Gn(f)| > 4√2Cen(F ,Pn) ∨ qD(F , 1 − τ/2)
}≤ 4
τP
{supf∈F
|Gon(f)| > Cen(F ,Pn)
}+τ.
Thus a union bound yields
P
{supf∈F
|Gn(f)| > 4√2Cen(F ,Pn) ∨ qD(F , 1 − τ/2)
}≤ τ+
4|F|τ
supf∈F
P {|Gon(f)| > Cen(F ,Pn)} .
(A.22)
28 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
We then condition on the values of Z1, . . . , Zn, denoting the conditional probability mea-
sure as Pε. Conditional on Z1, . . . , Zn, by the Hoeffding inequality the symmetrized pro-
cess Gon is sub-Gaussian for the L2(Pn) norm, namely, for f ∈ F , Pε{|Go
n(f)| > x} ≤2 exp(−x2/[2‖f‖2
Pn,2]). Hence, we can bound
Pε {|Gon(f)| ≥ Cen(F ,Pn)|Z1, . . . , Zn} ≤ 2 exp(−C2en(F ,Pn)
2/[2‖f‖2Pn,2
])
≤ 2 exp(−C2 log |F|).
Taking the expectation over Z1, . . . , Zn does not affect the right hand side bound. Plugging
in this bound into relation (A.23) yields the result.
Step 2. (Auxiliary calculations.) To establish that supf∈F ‖f‖Pn,2 is√2-sub-exchangeable,
let Z and Y be created by exchanging any components in Z with corresponding components
in Y . Then
√2(sup
f∈F‖f‖
Pn(Z),2 ∨ supf∈F
‖f‖Pn(Y ),2) ≥ (sup
f∈F‖f‖2
Pn(Z),2+ sup
f∈F‖f‖2
Pn(Y ),2)1/2
≥ (supf∈F
En[f(Zi)2] + En[f(Yi)
2])1/2 = (supf∈F
En[f(Zi)2] + En[f(Yi)
2])1/2
≥ (supf∈F
‖f‖2Pn(Z),2 ∨ sup
f∈F‖f‖2
Pn(Y ),2)1/2= sup
f∈F‖f‖Pn(Z),2 ∨ sup
f∈F‖f‖Pn(Y ),2.
�
Corollary 6 (Data-Dependent Probability Inequality). Let ǫi be i.i.d random variables
such that E[ǫi] = 0 and E[ǫ2i ] = σ2 for i = 1, . . . , n. Conditional on x1, . . . , xn ∈ IRp, we
have that for any C ≥ 1, with probability at least 1− 1/[9C2 log p],
‖En[xiǫi]‖∞ ≤ C · 24√
log p
nmax
j=1,...,p
√En[ǫ2i x
2ij] ∨
√σ2En[x2ij].
Proof of Corollary 6. Consider the class of separable empirical process induced by ‖En[xiǫi]‖∞,
i.e. the class of functions f ∈ F = {ǫixij : j ∈ {1, . . . , p}} so that√n‖En[xiǫi]‖∞ =
supf∈F |Gn(f)|. Define the data dependent quantity
en(F ,Pn) =√
2 log p maxj=1,...,p
√En[ǫ2i x
2ij].
Then, by Theorem 13, for any constant C ≥ 1
supf∈F
|Gn(f)| ≤ q(F , 1 − τ/2) ∨ 4√2Cen(F ,Pn).
√LASSO 29
with probability 1− τ − 4 exp(−(C2− 1) log p)/τ . Picking τ = 1/[2C2 log p], we have by the
Chebyshev’s inequality
q(F , 1− τ/2) ≤ maxj=1,...,p
√E[ǫ2ix
2ij ]/√τ/2 = 2C
√log p max
j=1,...,p
√E[ǫ2i x
2ij]
Setting C ≥ 3 we have with probability 1− 1/[C2 log p]
supf∈F
|Gn(f)| ≤(6C
3
√log p max
j=1,...,p
√E[ǫ2ix
2ij ]
)∨(24C
3
√log p max
j=1,...,p
√En[ǫ2i x
2ij ]
).
(Note that if p ≤ 2 the statement is trivial since the probability is greater than 1.) �
A.5. Bounds via Symmetrization. Next we proceed to use symmetrization arguments
to bound the empirical process. Let ‖f‖Pn,2 =√
En[f(Xi)2], Gn(f) =√nEn[f(Xi)−E[Xi]],
and for a random variable Z let q(Z, 1− τ) denote its (1− τ)-quantile,
Theorem 14 (Maximal Inequality via Symmetrization). Let
qS(F , 1 − τ) = q(supf∈F
‖f‖Pn,2, 1− τ).
Then, for any τ ∈ (0, 1), and δ ∈ (0, 1) we have
supf∈F
|Gn(f)| ≤ 4√
2 log(2|F|/δ)qS(F , 1 − τ),
with probability at least 1− τ − δ.
Proof. Let en(F) =√log(2|F|/δ)qS(F , 1 − τ) and the event E = {supf∈F
√En [f2] ≤
qS(F , 1 − τ)}, by definition P (E) ≥ 1 − τ . By the symmetrization Lemma 2.3.7 of [35] we
obtain
P
{supf∈F
|Gn(f)| > 4en(F)
}≤ 4P
{supf∈F
|Gon(f)| > en(F)
}≤ P
{supf∈F
|Gon(f)| > en(F)|E
}+τ.
Thus a union bound yields
P
{supf∈F
|Gn(f)| > en(F)
}≤ τ + |F| sup
f∈FP {|Go
n(f)| > en(F)|E} . (A.23)
We then condition on the values of Z1, . . . , Zn and E , denoting the conditional probability
measure as Pε. Conditional on Z1, . . . , Zn, by the Hoeffding inequality the symmetrized
30 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
process Gon is sub-Gaussian for the L2(Pn) norm, namely, for f ∈ F , Pε{|Go
n(f)| > x} ≤2 exp(−x2/[2‖f‖2
Pn,2]). Hence, we can bound
Pε {|Gon(f)| ≥ en(F)|Z1, . . . , Zn, E} ≤ 2 exp(−en(F)2/[2‖f‖2
Pn,2])
≤ 2 exp(− log(2|F|/δ)).
Taking the expectation over Z1, . . . , Zn does not affect the right hand side bound. Plugging
in this bound yields the result. �
Appendix B. Technical Lemmas
We begin with two technical lemmas that control the impact of the approximation errors
in the analysis.
Lemma 7. Under condition ASM we have
‖En[xiri]‖∞ ≤ min
{σ√n, cs
}.
Proof. First note that for every j = 1, . . . , p, we have |En[xijri]| ≤√
En[x2ij ]En[r
2i ] = cs.
Next, by definition of β0 in (2.7), for j ∈ T we have En[xij(fi − x′iβ0)] = En[xijri] = 0 since
β0 is a minimizer over the support of β0. For j ∈ T c we have that for any t ∈ IR
En[(fi − x′iβ0)2] + σ2
s
n≤ En[(fi − x′iβ0 − txij)
2] + σ2s+ 1
n.
Therefore, for any t ∈ IR we have
−σ2/n ≤ En[(fi − x′iβ0 − txij)2]− En[(fi − x′iβ0)
2] = −2tEn[xij(fi − x′iβ0)] + t2En[x2ij].
Taking the minimum over t in the right hand side at t∗ = En[xij(fi − x′iβ0)] we obtain
−σ2/n ≤ −(En[xij(fi − x′iβ0)])2 or equivalently, |En[xij(fi − x′iβ0)]| ≤ σ/
√n. �
Lemma 8. Assume that condition ASM holds and that there is q > 2 such that E[|ǫ|q] <∞.
Then we have
P
(√En[σ2ǫ2i ] > (1 + un)
√En[(σǫi + ri)2]
∣∣ X)
≤ minv∈(0,1)
ψ(v) +2(1 + un)
(1− v)unn,
where ψ(v) :=CqE[|ǫ|q∨4]
vqnq/4 ∧ 2E[|ǫ|q]n1∧(q/2−1)vq/2
.
√LASSO 31
Proof. Let an = [ (1 + un)2 − 1 ]/(1 + un)
2. We have that
P (En[σ2ǫ2i ] > (1 + un)
2En[(σǫi + ri)
2] | X) = P (2En[ǫiri] < −c2s − anEn[σ2ǫ2i ]|X). (B.24)
By Lemma 9 we have
Pr(√
En[ǫ2i ] < 1− v|F0 = F ) ≤ Pr(|En[ǫ
2i ]− 1| > v|F0 = F ) ≤ ψ(v).
Thus,
P (En[σ2ǫ2i ] > (1+ un)
2En[(σǫi + ri)
2] | X) ≤ ψ(v) +P (2En[σǫiri] < −c2s − anσ2(1− v)|X).
Since E[(2En[σǫiri])2] = 4σ2c2s/n, by Chebyshev inequality we have
P
(√En[σ2ǫ2i ] ≤ (1 + un)
√En[(σǫi + ri)2]
∣∣ X)
≤ ψ(v) +4σ2c2s/n
(c2s + anσ2(1− v))2
≤ ψ(v) + 2(1+un)(1−v)unn
.
The result follows by minimizing over v ∈ (0, 1). �
Next we focus on controlling the deviations of empirical second moments of the noise.
Lemma 9. Assume that condition ASM holds and that there is q > 2 such that E[|ǫ|q] <∞.
Then there is a constant Cq, that depends on q only, such that for v > 0 we have
Pr(|En[ǫ2i ]− 1| > v|F0 = F ) ≤ ψ(v) :=
CqE[|ǫi|q∨4]vqnq/4
∧ 2E[|ǫi|q]n(q/2∧1)−1vq/2
.
Proof. By the application of either Rosenthal’s inequality [26] for the case of q > 4 or
Vonbahr-Esseen’s inequalities [37] for the case of 2 < q ≤ 4,
Pr(|En[ǫ2i ]− 1| > v|F0 = F ) ≤ ψ(v) :=
CqE[|ǫi|q∨4]vqnq/4
∧ 2E[|ǫi|q]n(q/2∧1)−1vq/2
.
�
Lemma 10. Under conditions ASM, M’ and R’, we have that with probability 1− τ1 − τ2
we have
max1≤j≤p
En[x2ij(ǫ
2i − 1)] ≤ 4
√log(2p/τ1)
n
(E[ǫqi ]
τ2
)4/q
max1≤j≤p
(En[x8ij ])
1/4.
32 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Proof. For a random variable Z, let q(Z, 1 − α) = (1 − α)-quantile of Z. Let, q(1 −τ) = q(max1≤j≤p En[x
4ijǫ
4i ], 1 − τ) ≤ max1≤j≤p
√En[x8ij ]
√q(En[ǫ8i ], 1 − τ) . In particular,
by Theorem 14, we have
P
(max1≤j≤p
En[x2ij(ǫ
2i − 1)] > 4
√2 log(2p/τ1)
√q(1− τ2)
)≤ τ1 + τ2.
Under Condition M’, by the Markov inequality with k = q/8 > 1, we have q(En[ǫ8i ], 1−τ) ≤
(E[ǫqi ])8/q/τ8/q. �
Lemma 11. Under conditions ASM, M’ and R’, assume that
n1/6 ≥ ℓn max1≤j≤p
(En[|xij |3]E[|ǫi|3]
)1/3.
For τ ∈ (0, 1), let the constants c1, c2 > 1 satisfy
(i) c1 ≥ 1 + 4
√log(6p/τ)
n
(3E[ǫqi ]
τ
)4/qmax1≤j≤p(En[x
8ij])
1/4;
(ii) c2 ≥ 1/(1 − (3CqE[|ǫi|q∨4])1/q
τ1/qn1/4 ).
Then, there is an universal constant A such that uniformly in t ≥ 0
t+ 1 ≤ n1/6/[ℓn max1≤j≤p
(En[|xij |3]E[|ǫi|3])1/3],
we have
P
∥∥∥∥∥∥
√nEn[xiǫi]√En[ǫ2i ]
∥∥∥∥∥∥∞
>√c1c2 t
≤ 2pΦ(t)
(1 +
A
ℓ3n
)+ τ.
Proof. Let t = 1/[ℓn max1≤j≤p(En[|xij |3]E[|ǫi|3])1/3] > 0, τ1 = τ2 = τ3 = τ/3.
P
∥∥∥∥∥∥
√nEn[xiǫi]√En[ǫ2i ]
∥∥∥∥∥∥∞
>√c1c2t
≤ P
max
1≤j≤p
√n|En[xijǫi]|√En[x2ijǫ
2i ]
> t
+P
(max1≤j≤p
En[x2ijǫ
2i ]
En[ǫ2i ]> c1c2
).
By Lemma 5 above, provided that t ≤ t n1/6−1, we have that there is an universal constant
A, such that
P
max
1≤j≤p
√n|En[xijǫi]|√En[x
2ijǫ
2i ]
> t
≤ p max
1≤j≤pP
√n|En[xijǫi]|√En[x
2ijǫ
2i ]
> t
≤ 2p Φ(t)
(1 +
A
ℓ3n
).
Next note that
P
(max1≤j≤p
En[x2ijǫ
2i ]
En[ǫ2i ]> c1c2
)≤ P
(max1≤j≤p
En[x2ij(ǫ
2i − 1)] > c1 − 1
)+ P
(En[ǫ
2i ] < 1/c2
).
√LASSO 33
By Lemma 9 we have there is a constant Cq such that
P(En[ǫ
2i ] < 1/c2
)≤ CqE[|ǫi|q∨4]
([c2 − 1]/c2)qnq/4≤ τ3
where the last inequality follows from the choice of c2.
Finally, by Lemma 10 and the choice of c1, we have
P
(max1≤j≤p
En[x2ij(ǫ
2i − 1)] > c1 − 1
)≤ τ1 + τ2.
Since τ1 + τ2 + τ3 = τ the result follows. �
Appendix C. Results under Conditions M and R
Condition M. There exist a finite constant q > 2 such that the disturbance obeys
supn≥1 EF0 [|ǫ|q] <∞, and the design X obeys supn≥1max1≤j≤p En[|xij |q] <∞.
Condition R. We have that p ≤ αnη(q−2)/2/2 for some constant 0 < η < 1, c2sΦ−1(1−
α/2p) ≤ σ2, and Φ−1(1 − α/2p)2α− 2
q n−[(1−2/q)∧1/2] log n ≤ 1/3. For convenience we also
assume that Φ−1(1− α/2p) ≥ 1.
Conditions M and R were considered in [5] in the parametric setting. The analysis relied
on concentration of measure and moderate deviation bounds. Below we provide the results
that validate the choice of λ in the current nonparametric setting not covered in [5].
Theorem 15 (Properties of the Penalty Level under Conditions M and R). Suppose that
conditions M and R hold, and let un = 2/Φ−1(1 − α/2p). Then, there is a constant C(q),
that depends only on q, and a deterministic sequence γn converging to 1, such that
(i) the exact option in (3.17) implements λ > cΛ with probability at least
1− α− C(q)
(E[|ǫ|q∧4]nq/4
∧ E[|ǫ|q]n1∧(q/2−1)
)− 16
n;
(ii) the asymptotic option in (3.17) implements λ > cΛ with probability at least
1− αγn − C(q)
(E[|ǫ|q∨4]nq/4
∧ E[|ǫ|q]n1∧(q/2−1)
)− 25
n− 24γn
Φ−1(1− α/2p)2 log n
α2qn[(1−2/q)∧1/2]
;
(iii) the approximate score Λ(1− α|X) in (3.15) satisfies
Λ(1− α(1 + C(q)/ log−q/2 n) |X) ≤√nΦ−1(1− α/2p)(1 +
√2 log γn/tn)/(1 − rn − tn)
≤√
2n log(p/α)(1 +√
2 log γn/tn)/(1 − rn − tn)
34 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
where rn = α− 2
q n−[(1−2/q)∧1/2] log n, and tn = Φ−1(1− α/2p).
Proof of Theorem 15. Under conditions R and M, and using un = 2/Φ−1(1 − α/2p), by
Lemma 8 we have with probability 1− ψ(v) + 4c2s(1+un)/nun(2+un)σ2(1−v)2 ,
√En[σ2ǫ2i ] ≤ (1 + un)
√En[(σǫi + ri)2].
Then, using (3.16), with at least the same probability we have Λ ≤ (1 + un)Λ. Hence,
statement (i) holds by definition of Λ(1 − α|X), setting v = 1/2, and condition R that
ensures c2sΦ−1(1− α/2p) ≤ σ2.
To show statement (ii), we define tn = Φ−1(1− α/2p) and rn = α− 2
qn−[(1−2/q)∧1/2] log n.
Note that un = 2/tn, tn ≥ 1, and t2nrn ≤ 1/3.
By the same argument used to establish (i), under our conditions, we have Λ ≤ (1+un/4)Λ
with probability
1− ψ(rn)−25
n.
Thus it suffices to prove that, conditional on X, Λ ≤ (1 + un/2)√nΦ−1(1 − α/2p) with
probability at least 1− α γn − α 12γn(exp(t2nrn)− 1)− ψ(rn).
Then for any F = Fn and X = Xn that obey Condition M:
Pr(Λ > (1 + un/2)√ntn|X)
≤(1) p max1≤j≤p
Pr(|n1/2En[xijǫi]|+ 1 > (1 + 1/tn)tn(1− rn)|X) + Pr(√
En[ǫ2i ] < 1− rn)
≤(2) p max1≤j≤p
Pr(|n1/2En[xijǫ]| > tn(1− rn − rn/tn)|X) + ψ(rn)
where (1) holds by the union bound; (2) holds by Lemma 9.
Next note that by Condition M and Slastnikov’s theorem on moderate deviations, [29]
(see also [27]) we have that uniformly in 0 ≤ |t| ≤ k√log n for some k2 < q − 2 and a
sequence γn → 1, uniformly in 1 ≤ j ≤ p,∣∣∣∣∣Pr(n1/2|Enxijǫ| > t|X)
2Φ(t)− 1
∣∣∣∣∣ ≤ γn − 1.
We apply the result above for t = tn(1 − rn − rn/tn) ≤√
2 log(2p/α) ≤√η(q − 2) log n
for η < 1 by assumption. Note that we apply Slastnikov’s theorem to n−1/2|∑ni=1 zi,n| for
√LASSO 35
zi,n = xijǫi, where we allow the design X, the law F , and index j to be (implicitly) indexed
by n. Slastnikov’s theorem then applies provided
supn,j≤p
EnEFn |zn|q = supn,j≤p
En|xij |qEFn |ǫ|q <∞,
which is implied by our Condition M, and where we used the condition that the design is
fixed, so that ǫi are independent of xij . Thus, we obtained the moderate deviation result
uniformly in 1 ≤ j ≤ p and for any sequence of distributions F = Fn and designs X = Xn
that obey our Condition M.
Pr(Λ > (1 + un/2)√ntn|X) ≤ 2p Φ(tn(1− rn − rn/tn))γn + ψ(rn)
= 2p Φ(tn)γn + 2pγn
∫ tn
tn(1−rn−rn/tn)φ(u)du + ψ(rn)
≤ αγn + 2pγnφ(tn(1− rn − rn/tn))− φ(tn)
tn(1− rn − rn/tn)+ ψ(rn)
≤ αγn + 2pγnφ(tn)
tn
exp(t2nrn)− 1
1− rn − rn/tn+ ψ(rn)
≤ αγn + 2pγn 12Φ(tn)(exp(t2nrn)− 1) + ψ(rn)
≤ αγn + α 24γnt2nrn + ψ(rn),
where we used that Φ(t) ≤ φ(t)/t and φ(t)/t ≤ 2Φ(t) for t ≥ 1, rn ≤ 1/3, and that
exp(m) ≤ 1 + 2m for 0 ≤ m ≤ 1/3.
To show statement (iii) of the lemma, let ν ′ > (1 +√2 log γn/tn)/(1− rn − 1/tn),
Pr(Λ > ν ′√ntn|X) = Pr(
√n‖En[xiǫi]‖∞ + 1 > ν ′tn
√En[ǫ2i ]|X)
≤ Pr(√n‖En[xiǫi]‖∞ + 1 > ν ′tn(1− rn)|X) + Pr(
√En[ǫ2i ] < (1− rn))
≤ Pr(√n‖En[xiǫi]‖∞ > ν ′tn(1− rn − 1/tn)|X) + ψ(rn)
≤ Pr(√n‖En[xiǫi]‖∞ > tn +
√log γn|X) + ψ(rn)
where the last step follows by Lemma 9. Thus the bound
Pr(√n‖En[xiǫi]‖∞ > tn +
√log γn|X) ≤ γnΦ(tn +
√2 log γn) ≤ α
follows analogously to the proof of statement (ii); we omit the details for brevity. Therefore,
under our conditions, Λ(1−α−ψ(rn)|X) ≤ ν ′√nΦ−1(1−α/2p). Note that by construction
ψ(rn) ≤ αC(q) log−q/2 n. The last bound follows from standard tail bounds for Gaussian
random variables. �
36 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Appendix D. Proofs of Section 3
Proof of Theorem 1. The exact option in (3.17) is λ = (1 + un)cΛ(1 − α|X) so that
P (cΛ ≥ λ|X) = P (Λ ≥ (1 + un)Λ(1− α|X)|X)
≤ P (Λ ≥ Λ(1− α|X)|X) + P (Λ ≥ (1 + un)Λ|X)
≤ α+ P (Λ ≥ (1 + un)Λ|X).
By (3.16) and Lemma 8 with v = rn, the last term satisfies
P (Λ ≥ (1 + un)Λ|X) ≤ P (√
En[σ2ǫ2i ] ≥ (1 + un)
√En[(σǫi + ri)2]|X)
≤ ψ(rn) +2(1+un)
n(1−rn)un
≤ αlogn + 4(1+un)
nun
by the choice of rn = (α−1 log nCqE[|ǫ|q∧4])1/q/n1/4 < 1/2 under condition R’. �
Proof of Theorem 2. Let tn = Φ−1(1−α/2p) and recall rn = (α−1 log nCqE[|ǫ|q∧4])1/q < 1/2
under Condition R’. Next, note that for un ≥ 0 it follows that 1+un ≥ (1+un/2)(1+ [un ∧1]/4), so that
P (Λ ≥ (1 + un)√ntn|X) ≤ P (Λ ≥ (1 + un/2)
√ntn|X) + P (Λ ≥ (1 + un∧1
4 )Λ|X).
Regarding the second term, by (3.16) and Lemma 8 with v = rn, it follows that
P (Λ ≥ (1 + un∧14 )Λ|X) ≤ P (
√En[σ2ǫ2i ] ≥ (1 + un∧1
4 )√
En[(σǫi + ri)2]|X)
≤ ψ(rn) +2(1+un∧1
4)
n(1−rn)([un∧1]/4)≤ α
logn + 20n(un∧1) .
Next note that
Pr (Λ > (1 + un/2)√ntn|X) ≤ Pr
(n1/2‖En[xiǫi]‖∞√
En[ǫ2i ]> (1 + un/4)tn|X
)+
+Pr
( √n√
En[ǫ2i ]>
√ntnun/4|X
).
To control the second term above, note that under Condition M’ and R’, 4/untn ≤ 1−rn,we have that by Lemma 9
Pr
√n√
En[ǫ2i ]>
√ntnun4
|X
= Pr
(En[ǫ
2i ] <
4
untn|X)
≤ ψ(rn) ≤ α/ log n.
√LASSO 37
Next, under Condition M’ and R’, letting c1, c2 as in Lemma 11 with τ = α/ log n, we
have (1 + un/4) ≥ √c1c2. Thus, since tn + 1 ≤ n1/6/max1≤j≤p(En[|xij |3]E[|ǫi|3])1/3 by
Condition R’, Lemma 11 yields
Pr
n
1/2‖En[xiǫi]‖∞√En[ǫ
2i ]
> (1 + un/4)tn|X
≤ 2pΦ (tn)
(1 +
A
ℓ3n
)+
α
log n
≤ α
(1 +
A
ℓ3n+
1
log n
).
�
Proof of Theorem 3. Under Conditions M’ and R’, note that by triangle inequality and by
Lemma 11 with τ = α/ log n we have
Pr (Λ > (ν + ν ′)√ntn|X) ≤ Pr
(n1/2‖En[xiǫi]‖∞√
En[ǫ2i ]> νtn|X
)+ Pr
(√En[ǫ2i ] < 1/[ν ′tn]|X
)
≤ 2pΦ(νtn/√c1c2)(1 +A/ℓ3n) + 2α/ log n < α
provided that ν >√c1c2(1+
√2 log(3 + 3A/ℓ3n)/tn) we have 2pΦ(νtn/c1c2)(1+A/ℓ
3n) ≤ α/3,
ν ′ ≥ 1/[(1 − rn)tn)] we have ψ((ν ′tn − 1)/[ν ′tn]) < α/ log n, and log n > 2. �
Appendix E. Proofs of Section 4
Proof of Theorem 4. We prove the upper bound on the prediction norm in two steps.
Step 1. In this step we show that δ = β − β0 ∈ ∆c under the prescribed penalty level.
By definition of β
√Q(β)−
√Q(β0) ≤
λ
n‖β0‖1 −
λ
n‖β‖1 ≤
λ
n(‖δT ‖1 − ‖δT c‖1), (E.25)
where the last inequality holds because
‖β0‖1 − ‖β‖1 = ‖β0T ‖1 − ‖βT ‖1 − ‖βT c‖1 6 ‖δT ‖1 − ‖δT c‖1. (E.26)
Note that using the convexity of
√Q, −S ∈ ∂
√Q(β0), and if λ > cn‖S‖∞, we have
√Q(β)−
√Q(β0) > −S′δ > −‖S‖∞‖δ‖1 (E.27)
> − λ
cn(‖δT ‖1 + ‖δT c‖1). (E.28)
38 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Combining (E.25) with (E.28) we obtain
− λ
cn(‖δT ‖1 + ‖δT c‖1) 6
λ
n(‖δT ‖1 − ‖δT c‖1), (E.29)
that is
‖δT c‖1 6c+ 1
c− 1· ‖δT ‖1 = c‖δT ‖1, or δ ∈ ∆c. (E.30)
Step 2. In this step we derive bounds on the estimation error. We shall use the following
relations:
Q(β)− Q(β0) = ‖δ‖22,n − 2En[(σǫi + ri)x′iδ], (E.31)
Q(β)− Q(β0) =
(√Q(β) +
√Q(β0)
)(√Q(β)−
√Q(β0)
), (E.32)
2|En(σǫi + ri)x′iδ| ≤ 2‖En[(σǫi + ri)xi]‖∞‖δ‖1 ≤ 2
√Q(β0)λ‖δ‖1/[cn], (E.33)
‖δT ‖1 ≤√s‖δ‖2,nκc
for δ ∈ ∆c, (E.34)
where (E.28) holds by Holder inequality and the last inequality holds by the definition of
κc.
Note that by (E.31), if
√Q(β) +
√Q(β0) = 0, we have ‖δ‖2,n = 0 and we are done. So
we can assume
√Q(β) +
√Q(β0) > 0. Using (E.25) and (E.31)-(E.34) we obtain
‖δ‖22,n ≤ 2λ
cn
√Q(β0)‖δ‖1 +
(√Q(β) +
√Q(β0)
)λ
n
(√s‖δ‖2,nκc
− ‖δT c‖1). (E.35)
Also using (E.25) and (E.34) we obtain
√Q(β) ≤
√Q(β0) +
λ
n
(√s‖δ‖2,nκc
− ‖δT c‖1)
≤√Q(β0) +
λ
n
√s‖δ‖2,nκc
. (E.36)
Combining inequalities (E.36) and (E.35), we obtain
‖δ‖22,n ≤ 2λ
cn
√Q(β0)‖δ‖1 + 2
√Q(β0)
λ√s
nκc‖δ‖2,n +
(λ√s
nκc‖δ‖2,n
)2
− 2
√Q(β0)
λ
n‖δT c‖1.
Simplifying further the expression we obtain
‖δ‖22,n ≤ 2λ
cn
√Q(β0)‖δT ‖1 + 2
√Q(β0)
λ√s
nκc‖δ‖2,n +
(λ√s
nκc‖δ‖2,n
)2
,
and then using (E.34) we obtain[1−
(λ√s
nκc
)2]‖δ‖22,n ≤ 2
(1
c+ 1
)√Q(β0)
λ√s
nκc‖δ‖2,n.
√LASSO 39
Provided that λ√s
nκc< 1 and solving the inequality above we obtain the bound stated in the
theorem. �
Proof of Theorem 5. Let δ := β−β0. Under the condition on λ above, we have that δ ∈ ∆c.
Thus, we have
‖δ‖1 ≤ (1 + c)‖δT ‖1 ≤ (1 + c)
√s‖δ‖2,nκc
,
by the restricted eigenvalue condition. �
Proof of Theorem 6. Let δ := β − β0. We have that
‖δ‖∞ ≤ ‖En[xix′iδ]‖∞ + ‖En[xix
′iδ] − δ‖∞.
Note that by the first order optimality conditions of β and the assumption on λ
‖En[xix′iδ]‖∞ ≤ ‖En[xi(yi − x′iβ)]‖∞ + ‖S‖∞
√Q(β0)
≤ λ√
Q(β)n +
λ√
Q(β0)cn
by the first order conditions and the condition on λ.
Next let ej denote the jth-canonical direction.
|En[e′jxix
′iδ]− δj | ≤ |En[e
′j(xix
′i − I)δ]|
≤ ‖δ‖1 ‖En[xix′i − I]‖∞.
Therefore, using the optimality of β that implies
√Q(β) ≤
√Q(β0)+(λ/n)‖δ‖1, we have
‖δ‖∞ ≤(√
Q(β) +
√Q(β0)c
)λn + ‖En[xix
′i]− I‖∞‖δ‖1
≤(1 + 1
c
) λ√
Q(β0)n +
(λ2
n2 + ‖En[xix′i]− I‖∞
)‖δ‖1.
�
Proof of Theorem 7. Let δ := β − β0 ∈ ∆c under the condition that λ ≥ c‖Λ‖∞ by Step 1
in Theorem 4.
First we establish the upper bound. By optimality of β√Q(β)−
√Q(β0) ≤
λ
n(‖β0‖1 − ‖β‖1) ≤
λ
n(‖δT ‖1 − ‖δT c‖1).
40 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Thus, if δ /∈ ∆1 we have Q(β) ≤ Q(β0). So we can assume δ ∈ ∆1 in the bound above,
so that ‖δT ‖1 ≤ √s‖δ‖2,n/κ1. Thus, setting ρ1 = λ
√s/[nκ1], we have ‖δ‖2,n ≤ 2(1 +
1/c)ρ1
√Q(β0)/[1 − ρ21] if δ ∈ ∆1 by inspection of Theorem 4.
To establish the lower bound, by convexity of
√Q and the condition that λ ≥ cn‖S‖∞
we have√Q(β)−
√Q(β0) ≥ −S′δ ≥ −λ‖δ‖1
cn.
Thus, combining these bounds with Theorems 4 and 5, letting ρ := λ√s/[nκc] < 1, we
obtain
√Q(β)−
√Q(β0) ≥ −1+c
cρ2
1−ρ22(1 + 1
c
)√Q(β0) = −4c
cρ2
1−ρ2
√Q(β0).
where the last equality follows from simplifications. �
Proof of Lemma 1. First note that by strong duality we have
En [yiai] =‖Y −Xβ‖√
n+λ
n
p∑
j=1
|βj |.
Since En [xij ai] βj = λ|βj |/n for every j = 1, . . . , p, we have
En [yiai] =‖Y −Xβ‖√
n+
p∑
j=1
En [xij ai] βj =‖Y −Xβ‖√
n+ En
ai
p∑
j=1
xijβj
.
Rearranging the terms we have En
[(yi − x′iβ)ai
]= ‖Y −Xβ‖/√n.
If ‖Y −Xβ‖ = 0, we have
√Q(β) = 0 and the statement of the lemma trivially holds.
If ‖Y −Xβ‖ > 0, since ‖a‖ ≤ √n the equality can only hold for a =
√n(Y −Xβ)/‖Y −
Xβ‖ = (Y −Xβ)/
√Q(β).
Next, note that for any j ∈ T we have En [xij ai] = sign(βj)λ/n. Therefore, we have
√Q(β)
√|T |λ = ‖(X ′(Y −Xβ))
T‖
≤ ‖(X ′(Y −Xβ0))T ‖+ ‖(X ′X(β0 − β))T ‖≤
√|T | n‖En[xi(σǫi + ri)]‖∞ + nφmax(m)‖β − β0‖2,n,
√LASSO 41
where we used
‖(X ′X(β − β0))T ‖ ≤ sup‖αTc‖0≤m,‖α‖≤1 |α′X ′X(β − β0))|≤ sup‖αTc‖0≤m,‖α‖≤1 ‖α′X ′‖‖X(β − β0)‖= sup‖αTc‖0≤m,‖α‖≤1
√|α′X ′Xα|‖X(β − β0)‖
= n√φmax(m)‖β − β0‖2,n.
�
Proof of Theorem 8. We can assume that
√Q(β0) > 0 otherwise the result is trivially true.
In the event λ ≥ cΛ, by Lemma 1(√
Q(β)
Q(β0)− 1
c
)λ
n
√Q(β0)
√|T | ≤
√φmax(m)‖β − β0‖2,n. (E.37)
Under the condition λ√s < nκcρ, ρ < 1, we have by Theorem 7 that
(1− ρ2
1− ρ24c
c− 1
c
)λ
√Q(β0)
n√φmax(m)
√|T | ≤ ‖β − β0‖2,n.
�
Proof of Theorem 9. We can assume that
√Q(β0) > 0 otherwise the result follows by The-
orem 11 which establish β = β0.
In the event λ ≥ cΛ, by Lemma 1(√
Q(β)
Q(β0)− 1
c
)λ
√Q(β0)
√|T | ≤ n
√φmax(m)‖β − β0‖2,n. (E.38)
Under the condition λ√s ≤ nκcρ, ρ < 1, we have by Theorem 4 and Theorem 7 that
(1− ρ2
1− ρ24c
c− 1
c
)λ
√Q(β0)
√|T | ≤ n
√φmax(m)2
(1 +
1
c
)√Q(β0)
1
1− ρ2λ√s
nκc.
Since ρ2 < (c− 1)/(c − 1 + 4c) we have
√|T | ≤
√s
κc· 2
√φmax(m)(c+ 1)
c− 1− ρ2(c− 1 + 4c)=
√s ·√φmax(m) · 2c
κc
(1 +
ρ2(c− 1 + 4c)
c− 1− ρ2(c− 1 + 4c)
).
Under the condition 2ρ2 ≤ (c− 1)/(c − 1 + 4c), the expression above simplifies to
|T | ≤ φmax(m)
(4c
κc
)2
(E.39)
since ρ2(c− 1 + 4c) ≤ c− 1− ρ2(c− 1 + 4c).
42 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Consider any m ∈ M, and suppose m > m. Therefore by sublinearity of sparse eigenval-
ues
m ≤ s ·⌈m
m
⌉φmax(m)
(4c
κc
)2
.
Thus, since ⌈k⌉ < 2k for any k ≥ 1 we have m < s · 2φmax(m)(4c/κc)2 which violates the
condition of m ∈ M and s. Therefore, we must have m ≤ m. In turn, applying (E.39) once
more with m ≤ m we obtain m ≤ s ·φmax(m)(4c/κc)2. The result follows by minimizing the
bound over m ∈ M. �
Lemma 12 (Performance of a generic second-step estimator). Let β be any first-step esti-
mator with support T , define
Bn := Q(β)− Q(β0), Cn := Q(β0T )− Q(β0), and Dm := sup‖δTc‖0≤m,‖δ‖2,n>0
∣∣∣∣En
[σǫix
′iδ
‖δ‖2,n
]∣∣∣∣ ,
and let β the second-step estimator. Then, we have that for m := |T \ T |
‖β − β0‖2,n ≤ Dm + 2cs +√
(Bn)+ ∧ (Cn)+.
Proof of Lemma 12. Let δ := β − β0. By definition of the second-step estimator, it follows
that Q(β) ≤ Q(β) and Q(β) ≤ Q(β0T ). Thus,
Q(β)− Q(β0) ≤(Q(β)− Q(β0)
)∧(Q(β
0T)− Q(β0)
)≤ Bn ∧ Cn.
Moreover, since
|Q(β)− Q(β0)− ‖δ‖22,n| ≤|S′δ|‖δ‖2,n
‖δ‖2,n + 2cs‖δ‖2,n
we have
‖δ‖22,n ≤(
|S′δ|‖δ‖2,n
+ 2cs
)‖δ‖2,n +Bn ∧ Cn.
Solving which we obtain the stated result:
‖δ‖2,n ≤ Dm + 2cs +√
(Bn)+ ∧ (Cn)+
using the definition of Dm. �
Lemma 13 (Control of Bn and Cn for√LASSO). Under condition ASM, if λ ≥ cΛ and
ρ1 = λ√s/[nκ1] < 1, we have that
Bn ∧ Cn ≤ 1{T 6⊆ T} · Q(β0)(1 + 1/c)4ρ21
1 − ρ21
(1 +
(1 + 1/c)ρ211− ρ21
).
√LASSO 43
Proof of Lemma 13. First note that Cn = 0 if T ⊆ T .
Let δ = β − β0. By optimality of β for (1.3), we have
Q(β)− Q(β0) ≤(√
Q(β) +
√Q(β0)
)λn (‖δT ‖1 − ‖δT c‖1)
≤ 2
√Q(β0)
λn (‖δT ‖1 − ‖δT c‖1) + λ2
n2 (‖δT ‖1 − ‖δT c‖1)2+ .
If ‖δT c‖1 > ‖δT ‖1, we have Bn ≤ 0. Otherwise we have ‖δT ‖1 ≤√s‖δ‖2,n/κ1.
Moreover, under ‖δT c‖1 ≤ ‖δT ‖1 we have that ‖δ‖2,n ≤ 2(1+1/c)
√Q(β0)ρ1/(1−ρ21), for
λ√s ≤ nκ1ρ1, ρ1 < 1, by Theorem 4. Thus,
Q(β)− Q(β0) ≤ 2Q(β0)2(1+1/c)ρ21
1−ρ21+ ρ21
4(1+1/c)2Q(β0)ρ21(1−ρ21)
2 .
�
Lemma 14 (Control of Dm). Suppose condition ASM holds. We have that uniformly over
m ≤ p with probability at least 1− 1/C2 − 1/[9C2 log p]
Dm ≤ Cσ√φmin(m)
√s
n+
24Cσ√φmin(m)
√1 ∨ max
j=1,...,pEn
[x2ijǫ
2i
]√m log p
n.
Proof of Lemma 14. By the triangle inequality we have
Dm := sup‖δTc‖0≤m,‖δ‖2,n>0
∣∣∣∣En
[σǫix
′iδ
‖δ‖2,n
]∣∣∣∣
≤ sup‖δTc‖0≤m,‖δ‖2,n>0
∣∣∣∣En
[σǫix
′iT δ
‖δ‖2,n
]∣∣∣∣+ sup‖δTc‖0≤m,‖δ‖2,n>0
∣∣∣∣En
[σǫix
′iT cδ
‖δ‖2,n
]∣∣∣∣
=: DTm +DT c
m .
To bound DTm, note that ‖δ‖2,n ≥
√φmin(m)‖δ‖ and x′iT δ = x′iT δT , so that
DTm ≤ σ‖En[ǫixiT ]‖ ‖δT ‖√
φmin(m)‖δ‖.
We proceed to bound the second moment of ‖En[ǫixiT ]‖. Thus, since E[ǫi] = 0 and E[ǫ2i ],
E[‖En[ǫixiT ]‖2
]= E[En[ǫixiT ]
′En[ǫixiT ]] = trace(En[xiTx
′iT ]/n) = s/n.
Therefore, by Chebyshev inequality we have with probability at least 1− 1/C2
DTm ≤ Cσ√
φmax(m)
√s
n.
44 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Next, to bound DT c
m , we note that the support of supp(δ) ⊆ T ∪ T . Thus, we have
‖δ‖2,n ≥√φmin(m)‖δ‖ and
En[ǫix′iT cδ] = En[ǫix
′iT cδT c ] ≤ ‖En[ǫixiT c ]‖∞‖δT c‖1 ≤ ‖En[ǫixiT c ]‖∞
√m‖δT c‖.
Thus, we have
DT c
m ≤√mσ‖En[ǫixiT c ]‖∞/
√φmin(m).
Finally, by Corollary 6, with probability at least 1− 1/[9C2 log p], we have
‖En[ǫixiT c ]‖∞ ≤ 24C
√1 ∨ max
j=1,...,pEn
[x2ijǫ
2i
]√ log p
n.
�
Proof of Theorem 10. The theorem follows by Lemma 12 and applying Lemma 13 to bound
Bn ∧ Cn and Lemma 14 to bound Dm. �
Proof of Theorem 11. Note that because σ = 0 and cs = 0, we have
√Q(β0) = 0 and√
Q(β) = ‖β − β0‖2,n. Thus, by optimality of β we have
‖β − β0‖2,n +λ
n‖β‖1 ≤
λ
n‖β0‖1.
Thus ‖β‖1 ≤ ‖β0‖1 and δ = β − β0 satisfies ‖δT c‖1 ≤ ‖δT ‖1. This implies that
‖δ‖2,n ≤ λ
n‖δT ‖1 ≤
λ√s
nκ1‖δ‖2,n.
Since λ√s < nκ1 we have ‖δ‖2,n = 0. In turn, 0 =
√s‖δ‖2,n ≥ κ1‖δT ‖1 ≥ κ1‖δ‖1/2. This
shows that δ = 0 and β = β0. �
Lemma 15. Consider ǫi ∼ t(2). Then, for τ ∈ (0, 1) we have that:
(i) P (En[ǫ2i ] ≥ 2
√2/τ + log(4n/τ)) ≤ τ/2.
(ii) Let x := maxi,j |xij| and 0 < a < 1/6. Then, with probability at least 1 − τ −1
n1/2(1/6−a)2we have
∥∥∥∥∥∥En[xiǫi]√En[ǫ
2i ]
∥∥∥∥∥∥∞
≤4x√
2 log(4p/τ)√
log(4n/τ) + 2√2/τ
√n√a log n
.
√LASSO 45
(iii) For un ≥ 0 and 0 < a < 1/6, we have
P (√En[σ2ǫ2i ] ≤ (1+un)
√En[(σǫi + ri)2]) ≤
4σ2c2s log(4n/τ)
n(c2s + [un/(1 + un)]σ2a logn)2+
1
n1/2(1/6− a)2+τ/2.
Proof of Lemma 15. To show (i) we will establish a bound on q(En[ǫ2i ], 1 − τ). Recall that
for a t(2) random variable, the cumulative distribution function and the density function
are given by:
F (x) =1
2
(1 +
x√2 + x2
)and f(x) =
1
(2 + x2)3/2.
For any truncation level tn ≥√2 we have
E[ǫ2i 1{ǫ2i ≤ tn}] = 2∫ √
20
x2dx(2+x2)3/2
+ 2∫ √
tn√2
x2dx(2+x2)3/2
≤ 2∫ √
20
x2dx23/2
+ 2∫ √
tn√2
x2dxx3
≤ log tn.
E[ǫ4i 1{ǫ2i ≤ tn}] ≤ 2∫ √
20
x4dx23/2
+ 2∫ √
tn√2
x4dxx3 ≤ tn.
E[ǫ2i 1{ǫ2i ≤ tn}] ≥ 2∫ 10
x2dx33/2
+ 2∫ √
21
x2dx43/2
+ 22√2
∫ √tn√2
dxx
≥ log tn2√2.
(E.40)
Also, because 1−√1− v ≤ v for every 0 ≤ v ≤ 1,
P (|ǫi|2 > tn) =
(1−
√tn
2 + tn
)≤ 2/(2 + tn). (E.41)
Thus, by setting tn = 4n/τ and t = 2√2/τ we have [14], relation (7.5),
P (|En[ǫ2i ]− E[ǫ2i 1{ǫ2i ≤ tn}]| ≥ t) ≤ E[ǫ4i 1{ǫ2i≤tn}]
nt2+ nP (|ǫ2i | > tn)
≤ tnnt2
+ 2n2+tn
≤ τ/2.(E.42)
Thus, (i) is established and we have
q
(max1≤j≤p
En[x2ijǫ
2i ], 1 − τ/2
)≤ x2q(En[ǫ
2i ], 1 − τ/2) ≤ x2
(log(4n/τ) + 2
√2/τ).
To show (ii), using the bound above and Theorem 14, we have that with probability 1−τwe have
‖Gn[xiǫi]‖∞ ≤ 4x√
2 log(4p/τ)
√log(4n/τ) + 2
√2/τ .
46 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Moreover, for 0 < a < 1/6, we have
P (En[ǫ2i ] ≤ a log n) ≤ P (En[ǫ
2i 1{ǫ2i ≤ n1/2}] ≤ a log n)
≤ P (|En[ǫ2i 1{ǫ2i ≤ n1/2}]− E[ǫ2i 1{ǫ2i ≤ n1/2}]| ≥ (1/6) log n− a log n)
≤ 1n1/2(1/6−a)2
(E.43)
by Chebyshev inequality and since E[ǫ2i 1{ǫ2i ≤ n1/2}] ≥ (1/6) log n. Thus, with probability
at least 1− τ − 1n1/2(1/6−a)2
we have
∥∥∥∥∥∥En[xiǫi]√En[ǫ
2i ]
∥∥∥∥∥∥∞
≤4x√
2 log(4p/τ)√
log(4n/τ) + 2√2/τ
√n√a log n
.
To establish (iii), let an = [(1+un)2− 1]/(1+un)
2 = un(2+un)/(1+un)2 ≥ un/(1+un)
and note that by (E.40), (E.42), and (E.43) we have
P(√
En[σ2ǫ2i ] > (1 + un)√
En[(σǫi + ri)2])= P (2σEn[ǫiri] > c2s + anEn[σ
2ǫ2i ])
≤ P (2σEn[ǫiri1{ǫ2i ≤ tn}] > c2s + anσ2a log n) + P (En[ǫ
2i ] ≤ a log n) + nP (ǫ2i ≤ tn)
≤ 4σ2c2s log tnn(c2s+anσ2a logn)2 + 1
n1/2(1/6−a)2+ τ/2.
�
Proof of Theorem 12. The result follows from Theorem 4 and Lemma 15 which provides
bounds for λ and
√Q(β0) ≤ cs + σ
√En[ǫ2i ]. Note the simplification that
4σ2c2s log tnn(c2s + anσ2a log n)2
≤ 2 log tnnana log n
.
The result follows with tn = 4n/τ , a = 1/24, and an = un/[1 + un]. �
Appendix F. Comparing Computational Methods for LASSO and√LASSO
Below we discuss in more detail the applications of these methods for lasso and√LASSO.
For each method, the similarities between the lasso and√LASSO formulations derived below
provide theoretical justification for the similar computational performance.
Interior point methods. Interior point methods (IPMs) solvers typically focus on
solving conic programming problems in standard form,
minwc′w : Aw = b, w ∈ K. (F.44)
√LASSO 47
The main difficulty of the problem arises because the conic constraint will be biding at the
optimal solution.
IPMs regularize the objective function of the optimization with a barrier function so that
the optimal solution of the regularized problem naturally lies in the interior of the cone.
By steadily scaling down the barrier function, a IPM creates a sequence of solutions that
converges to the solution of the original problem (F.44).
In order to formulate the optimization problem associated with the lasso estimator as
a conic programming problem (F.44), we let β = β+ − β−, and note that for any vector
v ∈ IRn and any scalar t ≥ 0 we have that
v′v ≤ t is equivalent to ‖(v, (t − 1)/2)‖2 ≤ (t+ 1)/2.
Thus, we have that lasso optimization problem can be cast
mint,β+,β−,a1,a2,v
t√n+λ
n
p∑
j=1
β+j + β−j
v = Y −Xβ+ +Xβ−
t = −1 + 2a1
t = 1 + 2a2
(v, a2, a1) ∈ Qn+2, t ≥ 0, β+ ∈ IRp+, β
− ∈ IRp+.
The√LASSO optimization problem can be cast by similarly but without auxiliary variables
a1, a2:
mint,β+,β−,v
t√n+λ
n
p∑
j=1
β+j + β−j
v = Y −Xβ+ +Xβ−
(v, t) ∈ Qn+1, β+ ∈ IRp+, β
− ∈ IRp+.
First order methods. The new generation of first order methods focus on structured
convex problems that can be cast as
minwf(A(w) + b) + h(w) or min
wh(w) : A(w) + b ∈ K.
where f is a smooth function and h is a structured function that is possibly non-differentiable
or with extended values. However it allows for an efficient proximal function to be solved,
see [1]. By combining projections and (sub)gradient information these methods construct
a sequence of iterates with strong theoretical guarantees. Recently these methods have
48 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
been specialized for conic problems which includes lasso and√LASSO. It is well known
that several different formulations can be made for the same optimization problem and the
particular choice can impact the computational running times substantially. We focus on
simple formulations for lasso and√LASSO.
Lasso is cast as
minwf(A(w) + b) + h(w)
where f(·) = ‖ · ‖2/n, h(·) = (λ/n)‖ · ‖1, A = X, and b = −Y . The projection required to
be solved on every iteration for a given current point βk is
β(βk) = argminβ
2En[xi(yi − x′iβk)]′β +
1
2µ‖β − βk‖2 + λ
n‖β‖1.
It follows that the minimization in β above is separable and can be solved by soft-thresholding
as
βj(βk) = sign
(βkj + 2En[xij(yi − x′iβ
k)]/µ)max
{∣∣∣βkj + 2En[xij(yi − x′iβk)]/µ
∣∣∣− λ/[nµ], 0}.
For√LASSO the “conic form” is given by
minwh(w) : A(w) + b ∈ K.
Letting Qn+1 = {(z, t) ∈ IRn × IR : t ≥ ‖z‖} and h(w) = f(β, t) = t/√n + (λ/n)‖β‖1 we
have that
minβ,t
t√n+λ
n‖β‖1 : A(β, t) + b ∈ Qn+1
where b = (−Y ′, 0)′ and A(β, t) 7→ (β′X ′, t)′.
In the associated dual problem, the dual variable z ∈ IRn is constrained to be ‖z‖ ≤ 1/√n
(the corresponding dual variable associated with t is set to 1/√n to obtain a finite dual
value). Thus we obtain
max‖z‖≤1/
√ninfβ
λ
n‖β‖1 +
1
2µ‖β − βk‖2 − z′(Y −Xβ).
Given iterates βk, zk, as in the case of lasso that the minimization in β is separable and can
be solved by soft-thresholding as
βj(βk, zk) = sign
(βkj + (X ′zk/µ)j
)max
{∣∣∣βkj + (X ′zk/µ)j∣∣∣− λ/[nµ], 0
}.
√LASSO 49
The dual projection accounts for the constraint ‖z‖ ≤ 1/√n and solves
z(βk, zk) = arg min‖z‖≤1/
√n
θk2tk
‖z − zk‖2 + (Y −Xβk)′z
which yields
z(βk, zk) =zk + (tk/θk)(Y −Xβk)
‖zk + (tk/θk)(Y −Xβk)‖ min
{1√n, ‖zk + (tk/θk)(Y −Xβk)‖
}.
Componentwise Search. A common approach to solve unconstrained multivariate
optimization problems is to (i) pick a component, (ii) fix all remaining components, (iii)
minimize the objective function along the chosen component, and loop steps (i)-(iii) until
convergence is achieved. This is particulary attractive in cases where the minimization over
a single component can be done very efficiently. Its simple implementation also contributes
for the widespread use of this approach.
Consider the following lasso optimization problem:
minβ∈IRp
En[(yi − x′iβ)2] +
λ
n
p∑
j=1
γj |βj |.
Under standard normalization assumptions we would have γj = 1 and En[x2ij] = 1 for
j = 1, . . . , p. Below we describe the rule to set optimally the value of βj given fixed the
values of the remaining variables. It is well known that lasso optimization problem has a
closed form solution for minimizing a single component.
For a current point β, let β−j = (β1, β2, . . . , βj−1, 0, βj+1, . . . , βp)′.
• If 2En[xij(yi − x′iβ−j)] > λγj/n it follows that the optimal choice for βj is
βj =(−2En[xij(yi − x′iβ−j)] + λγj/n
)/En[x
2ij ]
• If 2En[xij(yi − x′iβ−j)] < −λγj/n it follows that the optimal choice for βj is
βj =(2En[xij(yi − x′iβ−j)]− λγj/n
)/En[x
2ij ]
• If 2|En[xij(yi − x′iβ−j)]| ≤ λγj/n we would set βj = 0.
This simple method is particularly attractive when the optimal solution is sparse which
is typically the case of interest under choices of penalty levels that dominate the noise like
λ ≥ cn‖S‖∞.
50 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
Despite of the additional square-root, which creates a non-separable criterion function,
it turns out that the componentwise minimization for√LASSO also has a closed form
solution. Consider the following optimization problem:
minβ∈IRp
√En[(yi − x′iβ)
2] +λ
n
p∑
j=1
γj|βj |.
As before, under standard normalization assumptions we would have γj = 1 and En[x2ij ] = 1
for j = 1, . . . , p. Below we describe the rule to set optimally the value of βj given fixed the
values of the remaining variables.
• If En[xij(yi − x′iβ−j)] > (λ/n)γj
√Q(β−j), we have
βj = −En[xij(yi − x′iβ−j)]
En[x2ij]+
λγjEn[x2ij ]
√Q(β−j)− (En[xij(yi − x′iβ−j)]2/En[x2ij ])√
n2 − (λ2γ2j /En[x2ij ])
• If En[xij(yi − x′iβ−j)] < −(λ/n)γj
√Q(β−j), we have
βj = −En[xij(yi − x′iβ−j)]
En[x2ij]− λγj
En[x2ij ]
√Q(β−j)− (En[xij(yi − x′iβ−j)]2/En[x2ij ])√
n2 − (λ2γ2j /En[x2ij ])
• If |En[xij(yi − x′iβ−j)]| ≤ (λ/n)γj
√Q(β−j), we have βj = 0.
References
[1] Stephen Becker, Emmanuel Candes, and Michael Grant. Templates for convex cone problems with
applications to sparse signal recovery. ArXiv, 2010.
[2] Stephen Becker, Emmanuel Candes, and Michael Grant. Tfocs v1.0 user guide (beta). 2010.
[3] A. Belloni and V. Chernozhukov. Post-ℓ1-penalized estimators in high-dimensional linear regression
models. arXiv:[math.ST], 2009.
[4] A. Belloni and V. Chernozhukov. ℓ1-penalized quantile regression for high dimensional sparse models.
accepted at the Annals of Statistics, 2010.
[5] A. Belloni, V. Chernozhukov, and L. Wang. Square-root-lasso: Pivotal recovery of sparse signals via
conic programming. arXiv:[math.ST], 2010.
[6] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals
of Statistics, 37(4):1705–1732, 2009.
[7] F. Bunea, A. Tsybakov, and M. H. Wegkamp. Sparsity oracle inequalities for the lasso. Electronic
Journal of Statistics, 1:169–194, 2007.
√LASSO 51
[8] F. Bunea, A. B. Tsybakov, , and M. H. Wegkamp. Aggregation and sparsity via ℓ1 penalized least
squares. In Proceedings of 19th Annual Conference on Learning Theory (COLT 2006) (G. Lugosi and
H. U. Simon, eds.), pages 379–391, 2006.
[9] F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation for Gaussian regression. The Annals of
Statistics, 35(4):1674–1697, 2007.
[10] E. Candes and Y. Plan. Near-ideal model selection by l1 minimization. Ann. Statist., 37(5A):2145–2177,
2009.
[11] E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. Ann.
Statist., 35(6):2313–2351, 2007.
[12] V. H. de la Pena, T. L. Lai, and Q.-M. Shao. Self-Normalized Processes. Springer, 2009.
[13] J. Fan and J. Lv. Sure independence screening for ultra-high dimensional feature space. Journal of the
Royal Statistical Society Series B, 70(5):849–911, 2008.
[14] W. Feller. An Introduction to Probability Theory and Its Applicaitons, Volume II. Wiley Series in Prob-
ability and Mathematical Statistics, 1966.
[15] V. Koltchinskii. Sparsity in penalized empirical risk minimization. Ann. Inst. H. Poincar Probab.
Statist., 45(1):7–57, 2009.
[16] K. Lounici. Sup-norm convergence rate and sign concentration property of lasso and dantzig estimators.
Electron. J. Statist., 2:90–102, 2008.
[17] K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking advantage of sparsity in multi-task
learning. arXiv:0903.1468v1 [stat.ML], 2010.
[18] Z. Lu. Gradient based method for cone programming with application to large-scale compressed sensing.
Technical Report, 2008.
[19] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data.
Annals of Statistics, 37(1):2246–2270, 2009.
[20] Y. Nesterov. Smooth minimization of non-smooth functions, mathematical programming. Mathematical
Programming, 103(1):127–152, 2005.
[21] Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and related
problems. Mathematical Programming, 109(2-3):319–344, 2007.
[22] Y. Nesterov and A Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming. Society
for Industrial and Applied Mathematics (SIAM), Philadelphia, 1993.
[23] J. Renegar. A Mathematical View of Interior-Point Methods in Convex Optimization. Society for In-
dustrial and Applied Mathematics (SIAM), Philadelphia, 2001.
[24] M. Rosenbaum and A. B. Tsybakov. Sparse recovery under matrix uncertainty. arXiv:0812.2818v1
[math.ST], 2008.
[25] H. P. Rosenthal. On the subspaces of lp (p > 2) spanned by sequences of independent random variables.
Israel J. Math., 9:273–303, 1970.
[26] Haskell P. Rosenthal. On the subspaces of Lp (p > 2) spanned by sequences of independent random
variables. Israel J. Math., 8:273–303, 1970.
52 ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV, AND LIE WANG
[27] Herman Rubin and J. Sethuraman. Probabilities of moderatie deviations. Sankhya Ser. A, 27:325–346,
1965.
[28] A. D. Slastnikov. Limit theorems for moderate deviation probabilities. Theory of Probability and its
Applications, 23:322–340, 1979.
[29] A. D. Slastnikov. Large deviations for sums of nonidentically distributed random variables. Teor. Veroy-
atnost. i Primenen., 27(1):36–46, 1982.
[30] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58:267–288,
1996.
[31] K. C. Toh, M. J. Todd, and R. H. Tutuncu. Sdpt3 — a matlab software package for semidefinite
programming. Optimization Methods and Software, 11:545–581, 1999.
[32] K. C. Toh, M. J. Todd, and R. H. Tutuncu. On the implementation and usage of sdpt3 – a matlab
software package for semidefinite-quadratic-linear programming, version 4.0. Handbook of semidefinite,
cone and polynomial optimization: theory, algorithms, software and applications, Edited by M. Anjos
and J.B. Lasserre, June 2010.
[33] R. H. Tutuncu, K. C. Toh, and M. J. Todd. SDPT3 — a MATLAB software package
for semidefinite-quadratic-linear programming, version 3.0. Technical report, 2001. Available at
http://www.math.nus.edu.sg/∼mattohkc/sdpt3.html.
[34] S. A. van de Geer. High-dimensional generalized linear models and the lasso. Annals of Statistics,
36(2):614–645, 2008.
[35] A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes. Springer Series in
Statistics, 1996.
[36] Bengt von Bahr and Carl-Gustav Esseen. Inequalities for the rth absolute moment of a sum of random
variables, 1 ≤ r ≤ 2. Ann. Math. Statist, 36:299–303, 1965.
[37] Bengt von Bahr and Carl-Gustav Esseen. Inequalities for the rth absolute moment of a sum of random
variables, 1 ≤ r ≤ 2. Ann. Math. Statist, 36:299–303, 1965.
[38] M. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1-
constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55:2183–2202,
May 2009.
[39] C.-H. Zhang and J. Huang. The sparsity and bias of the lasso selection in high-dimensional linear
regression. Ann. Statist., 36(4):1567–1594, 2008.