Block Regularized Lasso for Multivariate Multi-Response ...proceedings.mlr.press/v31/wang13c.pdf ·...

10
608 Block Regularized Lasso for Multivariate Multi-Response Linear Regression Weiguang Wang Yingbin Liang Eric P. Xing Syracuse University Syracuse University Carnegie Mellon University Abstract The multivariate multi-response (MVMR) linear regression problem is investigated, in which design matrices are Gaussian with co- variance matrices Σ (1:K) = ( Σ (1) ,..., Σ (K) ) for K linear regressions. The support union of Kp-dimensional regression vectors (col- lected as columns of matrix B * ) are recov- ered using l 1 /l 2 -regularized Lasso. Sufficient and necessary conditions to guarantee suc- cessful recovery of the support union are characterized via a threshold. More specif- ically, it is shown that under certain condi- tions on the distributions of design matri- ces, if n>c p1 ψ(B * , Σ (1:K) ) log(p - s) where c p1 is a constant, and s is the size of the support set, then l 1 /l 2 -regularized Lasso cor- rectly recovers the support union; and if n<c p2 ψ(B * , Σ (1:K) ) log(p - s) where c p2 is a constant, then l 1 /l 2 -regularized Lasso fails to recover the support union. In particu- lar, ψ(B * , Σ (1:K) ) captures the impact of the sparsity of K regression vectors and the sta- tistical properties of the design matrices on the threshold for support recovery. Numer- ical results are provided to demonstrate the advantages of joint support union recovery using multi-task Lasso over individual sup- port recovery using single-task Lasso. 1 Introduction Linear regression is a simple but practically very useful statistical model, in which an n sample response vector - Y can be modeled as - Y = X - θ + -→ W Appearing in Proceedings of the 16 th International Con- ference on Artificial Intelligence and Statistics (AISTATS) 2013, Scottsdale, AZ, USA. Volume 31 of JMLR: W&CP 31. Copyright 2013 by the authors. where X R n×p is the design matrix containing n samples of feature vectors, - θ =(θ 1 ,...,θ p ) R p con- tains regression coefficients, and -→ W R n is the noise vector. The goal is to find the regression coefficients - θ such that the linear relationship is as accurate as pos- sible with regard to a certain performance criterion. The problem is more interesting in high dimensional regime with a sparse regression vector, in which the sample size n can be much smaller than the dimension p of the regression vector. In order to estimate the sparse regression vector, it is natural to construct an optimization problem with an l 0 -constraint on - θ , i.e., the number of nonzero com- ponents of - θ . However, such an optimization problem is nonconvex and in general very difficult to solve in an efficient manner [1]. More recently, the convex re- laxation (referred to as Lasso) has been studied with an l 1 -constraint on - θ based on the idea in the seminal work [2] by Tibshirani, [3] by Chen, Donoho and Saun- ders, and [4] by Donoho and Huo. More specifically, the regression problem can be formulated as: min - θ R p 1 n k - Y - X - θ k 2 l2 + λ n k - θ k l1 . The l 1 -regularized estimator has been proved to be equivalent to Dantzig Selector [5], which was proposed in [6]. Various efficient algorithms have been devel- oped to solve the above convex problem efficiently (see a review monograph [7]), although the objective function is not differentiable everywhere due to l 1 - regularization. Moreover, the l 1 -regularization is crit- ical to force the minimizer to have sparse components as shown in [2–4]. A vast amount of recent work has studied the high di- mensional linear regression problem via l 1 -regularized Lasso under various assumptions. For example, the studies in [8–11] investigated the noiseless scenario and showed that recovery of true coefficients could be guaranteed with certain conditions on design matri- ces and sparsity. A number of studies developed l 1 - regularized quadratic programming to achieve sparsity

Transcript of Block Regularized Lasso for Multivariate Multi-Response ...proceedings.mlr.press/v31/wang13c.pdf ·...

608

Block Regularized Lasso for Multivariate Multi-Response LinearRegression

Weiguang Wang Yingbin Liang Eric P. XingSyracuse University Syracuse University Carnegie Mellon University

Abstract

The multivariate multi-response (MVMR)linear regression problem is investigated, inwhich design matrices are Gaussian with co-variance matrices Σ(1:K) =

(Σ(1), . . . ,Σ(K)

)for K linear regressions. The support unionof K p-dimensional regression vectors (col-lected as columns of matrix B∗) are recov-ered using l1/l2-regularized Lasso. Sufficientand necessary conditions to guarantee suc-cessful recovery of the support union arecharacterized via a threshold. More specif-ically, it is shown that under certain condi-tions on the distributions of design matri-ces, if n > cp1ψ(B∗,Σ(1:K)) log(p − s) wherecp1 is a constant, and s is the size of thesupport set, then l1/l2-regularized Lasso cor-rectly recovers the support union; and ifn < cp2ψ(B∗,Σ(1:K)) log(p − s) where cp2 isa constant, then l1/l2-regularized Lasso failsto recover the support union. In particu-lar, ψ(B∗,Σ(1:K)) captures the impact of thesparsity of K regression vectors and the sta-tistical properties of the design matrices onthe threshold for support recovery. Numer-ical results are provided to demonstrate theadvantages of joint support union recoveryusing multi-task Lasso over individual sup-port recovery using single-task Lasso.

1 Introduction

Linear regression is a simple but practically very usefulstatistical model, in which an n sample response vector−→Y can be modeled as

−→Y = X

−→θ +−→W

Appearing in Proceedings of the 16th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2013, Scottsdale, AZ, USA. Volume 31 of JMLR: W&CP31. Copyright 2013 by the authors.

where X ∈ Rn×p is the design matrix containing n

samples of feature vectors,−→θ = (θ1, . . . , θp) ∈ Rp con-

tains regression coefficients, and−→W ∈ Rn is the noise

vector. The goal is to find the regression coefficients−→θ

such that the linear relationship is as accurate as pos-sible with regard to a certain performance criterion.The problem is more interesting in high dimensionalregime with a sparse regression vector, in which thesample size n can be much smaller than the dimensionp of the regression vector.

In order to estimate the sparse regression vector, it isnatural to construct an optimization problem with an

l0-constraint on−→θ , i.e., the number of nonzero com-

ponents of−→θ . However, such an optimization problem

is nonconvex and in general very difficult to solve inan efficient manner [1]. More recently, the convex re-laxation (referred to as Lasso) has been studied with

an l1-constraint on−→θ based on the idea in the seminal

work [2] by Tibshirani, [3] by Chen, Donoho and Saun-ders, and [4] by Donoho and Huo. More specifically,the regression problem can be formulated as:

min−→θ ∈Rp

1

n‖−→Y −X

−→θ ‖2l2 + λn‖

−→θ ‖l1 .

The l1-regularized estimator has been proved to beequivalent to Dantzig Selector [5], which was proposedin [6]. Various efficient algorithms have been devel-oped to solve the above convex problem efficiently(see a review monograph [7]), although the objectivefunction is not differentiable everywhere due to l1-regularization. Moreover, the l1-regularization is crit-ical to force the minimizer to have sparse componentsas shown in [2–4].

A vast amount of recent work has studied the high di-mensional linear regression problem via l1-regularizedLasso under various assumptions. For example, thestudies in [8–11] investigated the noiseless scenarioand showed that recovery of true coefficients could beguaranteed with certain conditions on design matri-ces and sparsity. A number of studies developed l1-regularized quadratic programming to achieve sparsity

609

Block Regularized Lasso for Multivariate Multi-Response Linear Regression

recovery for noisy scenarios. Some work (e.g., [12–14])focused on the problem with deterministic design ma-trices, while other work (e. g., [15, 16]) studied theproblem with random design matrices. Lasso has alsobeen proved to be useful for generalized linear models(GLMs) such as logistic regression [17, 18] and somespecific exponential families [19].

Inspired by the success of Lasso in the single-task prob-lem, block-regularized Lasso for the high-dimensionalmultivariate linear regression problem (i.e., the multi-task linear regression problem) was intensively studied(see, e.g., [20–23] and references therein). The modelof the problem is given by

Y = XB∗ +W (1)

where Y ∈ Rn×K has each column corresponding tothe output of one task, X ∈ Rn×p is the design matrix,the regression matrix B∗ ∈ Rp×K has each columncorresponding to the regression vector for one task,and W ∈ Rn×K has each column corresponding to the

noise vector of one task. For each column−→Y (k) of

the matrix Y , it is clear that−→Y (k) = X

−→θ ∗(k) +

−→W (k),

where−→θ ∗(k) and

−→W (k) are the corresponding columns

in B∗ and W . Then each column is a single-task lin-ear regression problem and can be solved individually.However, the K individual problems (i.e., tasks) canalso be coupled together via a block regularized Lassoand solved jointly in one problem.

Various types of block regularization have been pro-posed and studied. In [24], the l1/l2-regularization wasadopted to recover the support union of the regressionvectors. More specifically, the following problem wasstudied

minB∈Rp×K

1

2n|||Y −XB|||2F + λn‖B‖l1/l2 ,

where ‖·‖la/lb is defined in (5) in section 2.1. Sufficientand necessary conditions for correct recovery of thesupport union (i.e., the union of the supports of allcolumns of B∗) have been characterized. The l1/lq-regularized Lasso was adopted for learning structuredlinear regression model in [25]. Block regularized Lassohas also been applied to study various other modelsand problems in [26–36].

In the multivariate linear regression problem given in(1), the design matrix is identical for all tasks, namely,i.e., X is the same for all column vectors of Y and B∗.However, in many applications, it is often the casethat different output variables may depend on designvariables that are different or distributed differently.Thus, the resulting model includes K linear regressionmodels with different design matrices and is given by:

−→Y (k) = X(k)−→θ ∗(k) +

−→W (k) (2)

for k = 1, . . . ,K, where−→Y (k) ∈ Rn, X(k) ∈ Rn×p,−→

θ ∗(k) ∈ Rp, and−→W (k) ∈ Rn. We refer to the above

problem as the multivariate multi-response (MVMR)linear regression model, and the goal is to recover−→θ ∗(k) for k = 1, . . . ,K jointly. This problem has beenstudied in [37] via the l1/l2 regularized Lasso for fixedmatrices X(1), . . . , X(K). For random design matri-ces, this model has been studied via l1/l∞-regularizedLasso in [38] and via l1/l1 + l1/l∞-regularized Lassoin [22] for incorporating both row sparsity and indi-vidual sparsity.

In this paper, we study the MVMR problem for ran-dom design matrices via l1/l2-regularized Lasso. It isassumed that the design matrices are Gaussian dis-tributed, and are independent but not identical acrossk. For each task k, the row vector of X(k) is Gaussianwith mean zero and the covariance matrix Σ(k) fork = 1, . . . ,K. The noise vectors and hence the outputvectors are also Gaussian distributed and independentacross tasks. We are interested in joint recovery of theunion of the support sets (i.e., the support union) of

regression vectors−→θ ∗(1), . . . ,

−→θ ∗(K). We collect these

vectors together as a matrix B∗ =[−→θ ∗(1), . . . ,

−→θ ∗(K)

].

We adopt the l1/l2-regularized Lasso problem for re-covery of the support union via the following optimiza-tion problem:

minB∈Rp×K

1

2n

K∑k=1

∥∥∥−→Y (k) −X(k)−→θ (k)∥∥∥22

+ λn ‖B‖l1/l2

where B =[−→θ (1), . . . ,

−→θ (K)

]. In this way, the K lin-

ear regression problems are coupled together via theregularization constraint. We show that this approachis advantageous as opposed to individual recovery ofthe support set for each linear regression problem.This is because the K regression models may sharetheir samples in joint support recovery so that the to-tal number of samples needed can be significantly re-duced compared to performing each task individually.

1.1 Main Results and Contributions

In the following, we summarize the main contributionsof this work. Our results contain two parts: the achiev-ability and the converse, corresponding respectively tosufficient and necessary conditions under which thel1/l2- regularized Lasso recovers the support unionfor the MVMR linear regression problem. Our proofadapts the techniques developed by Wainwright in [15]and by Obozinski, Wainwright, and Jordan in [24], butinvolves nontrivial development to deal with the dif-ferently distributed design matrices across tasks. Thisalso leads to interesting generalization of the resultsin [24] as we articulate in section 1.2.

610

Weiguang Wang, Yingbin Liang, Eric P. Xing

More specifically, we show that under certain condi-tions that the distributions of the design matrices sat-isfy, if n > cp1ψ(B∗,Σ(1:K)) log(p − s), where ψ(·) isdefined in (6) in Section 2.1 and cp1 is a constant,then the l1/l2-regularized Lasso recovers the supportunion for the MVMR linear regression problem; andif n < cp2ψ(B∗,Σ(1:K)) log(p − s), where cp2 is a con-stant, then the l1/l2-regularized Lasso fails to recoverthe support union. Thus, ψ(B∗,Σ(1:K)) log(p − s)serves as a sharp threshold on the sample size.

In particular, ψ(B∗,Σ(1:K)) captures the sparsity ofB∗ and the statistical properties of the design ma-trices, which are important in determining the suffi-cient and necessary conditions for successful recoveryof the support union. The property of ψ(B∗,Σ(1:K))also captures the advantages of the multi-task Lassoover solving each problem individually via the single-task Lasso. We show that when the K tasks sharethe same support sets (although the design matri-ces can be differently distributed), ψ(B∗,Σ(1:K)) =1K max1≤k≤K ψ(

−→θ ∗k,Σ

(k)). This means that the num-ber of samples needed per task for the multi-task Lassoto jointly recover the support union is reduced byK compared to that of the single-task Lasso to re-cover each support set individually. On the otherhand, if the K tasks have disjoint support sets, then

ψ(B∗,Σ(1:K)) = max1≤k≤K ψ(−→θ ∗(k),Σ(k)). This im-

plies that the number of samples needed per task tocorrectly recover the support union is almost the sameas that of the single-task Lasso to recover each sup-port individually. Between these two extreme cases,tasks can have overlapped support sets with differentoverlapping levels, and the impact of these propertieson the sample size for recovery of the support union isprecisely captured by ψ(B∗,Σ(1:K)).

1.2 Comparison to Previous Results

The MVMR model (with differently distributed designmatrices across tasks) can be viewed as generalizationof the multivariate model (with an identical design ma-trix across tasks) studied in [24]. It is thus interestingto compare our results to the results in [24]. For thescenario when the tasks share the same regression vec-tor, it is shown in [24] that the major advantage ofjointly solving a multi-task Lasso problem over solv-ing each single-task Lasso problem individually is re-duction of effective noise variance by the factor K.But the sample size needed per task for recovery ofthe support union via multi-task Lasso is the same asthat needed for recovery of each support set individu-ally via single-task Lasso. This implies that multi-taskLasso does not offer benefit in reducing the sample size(in the order sense) for this case. Our result, on theother hand, shows that the benefit in the sample size

of multi-task Lasso appears when the design matri-ces are differently distributed across tasks. For such acase, although the design matrices are different acrossK tasks, we show that the same total sample size (i.e.,as the case with the same design matrix) is still suffi-cient for correct recovery of the support union. Hence,the sample size needed per task is reduced by K viamulti-task Lasso compared to recovery of each supportset individually via single-task Lasso. Consequently,our result is a nontrivial generalization of the resultin [24]. For the scenario when the tasks have disjointsupport sets, our result is consistent with the resultin [24], which suggests that there is no advantage ofperforming multi-task Lasso as opposed to performingsingle-task Lasso for each task.

As we mentioned before, the MVMR model was alsostudied in [22, 38], in which l1/l∞ and l1/l1 + l1/l∞-regularization were adopted for support union recov-ery, respectively. We study the same model but underl1/l2-regularization. More importantly, we character-ize the gain in saving the sample size due to multi-taskLasso. Furthermore, we characterize the conditionson the sample size for recovery of support union as athreshold for the general model. This type of resultswere given in [22,38] only for specific scenarios.

1.3 Relationship to Jointly LearningMultiple Markov Networks

One application of the MVMR linear regression modelis to jointly learning multiple Gaussian Markov net-work structures. In this context, it solves a multi-taskneighbor selection problem. This is also a natural sce-nario where features and their distributions vary acrosstasks of linear regression.

We consider K Gaussian Markov networks, each with

p+ 1 nodes represented by X(k)1 , . . . , X

(k)p+1 for k =

1, . . . ,K. The distribution of the Gaussian vector

for graph k is given by N(

0,Σ(k)p+1

), where Σ

(k)p+1 ∈

R(p+1)×(p+1). Assume for each graph, there are n i.i.d.samples generated based on the joint distribution ofthe nodes. The objective is to estimate the connectionrelationship of nodes based on the samples. We denote

n samples of each variable X(k)j by a column vector

−→X

(k)j ∈ Rn for j = 1, . . . , p+ 1 and k = 1, . . . ,K. For

each graph k and each node with index a, the sample

vector−→X

(k)a can be expressed as:

−→X (k)a = X

(k)−a−→θ (k) +

−→W (k)

a (3)

where X(k)−a is an n × p matrix that groups all col-

umn vectors−→X

(k)j for j 6= a,

−→θ (k) is a p-dimensional

vector consisting of the estimation parameters of X(k)a

611

Block Regularized Lasso for Multivariate Multi-Response Linear Regression

from X(k)j with j 6= a, and

−→W

(k)a is the n-dimensional

Gaussian vector containing i.i.d. components with zeromean and variance given by

σ(k)W

2= V ar(X1a)

− Cov(X1a, X1,−a)Cov−1(X1,−a)Cov(X1,−a, X1a).

It has been shown that the nonzero components of the

vector−→θ (k) represent existence of the edges between

the corresponding nodes and node a in graph k. Hence,

estimation of the support set of−→θ (k) provides an esti-

mation of the graph structure, which is referred to asthe neighbor selection problem [39].

Therefore, multi-task Lasso for the MVMR linearregression problem provides an useful approach forjoint neighbor selection over K graphs. It is clear

that in this case, the design matrices X(k)−a in gen-

eral have different distributions across k, and hencethe multi-feature model is well justified. We note thatjointly learning multiple graphs has also been studiedin [40,41], which adopted a different objective functionof the precision matrix Σ−1. Via the MVMR linearregression model, we characterize the threshold-basedsufficient and necessary conditions for joint recovery ofthe graphs.

2 Problem Formulation and Notations

In this paper, we study the MVMR linear regressionproblem given by (2), which contains K linear re-gressions. Here, the design matrices X(1), . . . , X(K)

and noise vectors−→W (1), . . . ,

−→W (K) are Gaussian dis-

tributed, and are independent but not identical acrossk. For each task k, X(k) has independent and iden-tically distributed (i.i.d.) row vectors with each be-ing Gaussian with mean zero and covariance matrix

Σ(k), and the noise vector−→W (k) has i.i.d. components

with each being Gaussian with mean zero and variance

σ(k)W

2. We let σmax = max1≤k≤K σ

(k)W

2.

In (2),−→θ ∗(k) denotes the true regression vector for

each task k. We define the support set for each−→θ ∗(k)

as Sk := {j ∈ {1, . . . , p}|−→θ∗(k)j 6= 0}. The support

union over K tasks is defined to be S := ∪Kk=1Sk. Inthis paper, we are interested in estimating the supportunion jointly for K tasks.

We adopt the l1/l2-regularized Lasso to jointly recoverthe support union for the MVMR linear regressionmodel. More specifically, we solve the following multi-

task Lasso problem rewritten below:

minB∈Rp×K

1

2n

K∑k=1

∥∥∥−→Y (k) −X(k)−→θ (k)∥∥∥22

+ λn ‖B‖l1/l2

(4)

where B =[−→θ (1), . . . ,

−→θ (K)

]. In this way, the K lin-

ear regression problems are coupled together via theregularization constraint. In this paper, we character-ize conditions under which the solution to the abovemulti-task Lasso problem correctly recover the supportunion of the true regression vectors for K tasks.

2.1 Notations

We introduce some notations that we use in this paper.For a matrix A ∈ Rp×K , we define the la/lb block normas

‖A‖la/lb :=

p∑i=1

K∑j=1

|Aij |ba/b

1/a

. (5)

We also define the operator norm for a matrix as

|||A|||a,b := sup‖x‖b=1

‖Ax‖a.

In particular, we define the spectral norm as |||A|||2 =|||A|||2,2 and the l∞-operator norm as |||A|||∞ =|||A|||∞,∞, which are special cases of the operator norm.

For matrix B =[−→θ (1), . . . ,

−→θ (K)

]that appears in (4),

−→θ (k) denotes its kth columns for k = 1, . . . ,K. Wefurther let Bi to be the ith row of B. Similarly, for

B∗ =[−→θ ∗(1), . . . ,

−→θ ∗(K)

]that contains true regression

vectors, its kth column is denoted by−→θ ∗(k) and the ith

row is denoted by B∗i . We next define the normalizedrow vectors of B∗ as

Z∗i =

B∗i‖B∗i ‖l2

if B∗i 6= 0

0 otherwise,

and define the matrix Z∗ to contain Z∗i as its ith row

for i = 1, . . . , p. To avoid confusion, we use B to de-note the solution to the multi-task Lasso problem (4).

The support union S(B) for a matrix B ∈ Rp×K isdenoted as S(B) = {i ∈ {1, . . . , p}|Bi 6= 0}, which in-cludes indices of the nonzero rows of the matrix B. Weuse S to represent S(B∗) (i.e., the true support union)for convenience and use Sc to denote the complementof the set S. We let s = |S| denote the size of the set

S. For any matrix X(k) ∈ Rn×p, the matrix X(k)S con-

tains the columns of matrix X(k) with column indices

612

Weiguang Wang, Yingbin Liang, Eric P. Xing

in the set S, and X(k)Sc contains the columns of matrix

X(k) with column indices in the set Sc. Similarly, B∗Sand Z∗S respectively contain rows of B∗ and Z∗ withindices in S.

As each row of matrix X(k) is Gaussian distributedas N (0,Σ(k)), we use Σ

(k)SS to denote the covariance

matrix for each row of X(k)S , and use Σ

(k)ScS to denote

the cross covariance between rows of X(k)Sc and X

(k)S .

For convenience, we use Σ(1:K) to denote a set of matri-ces Σ(1), . . . ,Σ(K). We also define the following func-tions of matrices Q(1:K) to simply our notations:

ρu

(Q(1:K)

):= max

j∈Scmax

1≤k≤KQ

(k)jj ,

ρl

(Q(1:K)

):= min

i,j∈Sc,j 6=imin

1≤k≤K

[Q

(k)jj +Q

(k)ii − 2Q

(k)ji

].

In particular, our results contain the functions

ρu

(1:K)ScSc|S

)and ρl

(1:K)ScSc|S

), where Σ

(k)ScSc|S is the

covariance matrix of each row of X(k)Sc with X

(1:K)S

given.

For matrix B∗, we define b∗min = minj∈S∥∥B∗j ∥∥l2 . We

define the following function that captures the spar-sity of B∗ and the statistical properties of the design

matrices X(1:K)S :

ψ(B∗,Σ(1:K)) := max1≤k≤K

−→Z ∗TSk

(k)SS

)−1−→Z ∗Sk, (6)

where−→Z ∗Sk is the kth column of Z∗S . We note that this

definition of ψ(·) function is different from the previouswork [24]. Here, due to different design matrices, ψ(·)depends on K quantities

−→Z ∗TSk

(k)SS

)−1−→Z ∗Sk with each

depending on a column vector−→Z ∗Sk.

3 Main Results

In this section, we provide our main results on usingmulti-task Lasso to recover the support union for theMVMR linear regression model. Our results containtwo parts: one is the achievability, i.e., sufficient con-ditions for the l1/l2-regularized Lasso to recover thesupport union; and the other is the converse, i.e., con-ditions under which the l1/l2-regularized Lasso fails torecover the support union. We then discuss implica-tions of our results by considering a few representativescenarios, and compare our results with those for themultivariate linear regression with an identical designmatrix across tasks.

3.1 Achievability and Converse

We first introduce a number of conditions on covari-ance matrices Σ(k) for k = 1, . . . ,K, which are useful

for the statements of our results.

(C1). There exists a real number γ ∈ (0, 1]such that |||A|||∞ ≤ 1 − γ, where Ajs =

max1≤k≤K

∣∣∣∣∣(

Σ(k)ScS

(k)SS

)−1)js

∣∣∣∣∣ for j ∈ Sc and s ∈ S.

(C2). There exist constants 0 < Cmin ≤ Cmax < +∞such that all eigenvalues of the matrix Σ

(k)SS are con-

tained in the interval [Cmin, Cmax] for k = 1, . . . ,K.

(C3). There exists a constant Dmax < +∞ such that

max1≤k≤K

∣∣∣∣∣∣∣∣∣∣∣∣(Σ(k)SS

)−1∣∣∣∣∣∣∣∣∣∣∣∣∞≤ Dmax.

In this paper, we consider the asymptotic regime, inwhich p→∞, s→∞, and log (p− s)→ +∞. In sucha regime, we introduce the conditions on the regular-ization parameter and the sample size n as follows:

(P1). Regularization parameter λn =√

f(p) log pn ,

where the function f(p) is chosen such that f(p) →+∞ as p → +∞, and f(p) log p

n → 0 as n → ∞, i.e.,λn → 0 as n→ +∞.

(P2). Define ρ(n, s, λn) as

ρ(n, s, λn) :=

√8σ2

maxs log s

nCmin+λn

(Dmax +

12s

Cmin√n

)and require ρ(n,s,λn)

b∗min= o(1).

The following theorem characterizes sufficient condi-tions for recovery of the support union via the multi-task Lasso.

Theorem 1. Consider the MVMR problem in theasymptotic regime, in which p → ∞, s → ∞ andlog(p − s) → ∞. We assume that the parameters(n, p, s, B∗,Σ(1:K)

)satisfy the conditions (C1)-(C3),

and (P1)-(P2). If for some small constant v > 0,

n > 2(1 + v)ψ(B∗,Σ(1:K)

)log(p− s)

ρu

(1:K)ScSc|S

)γ2

,

(7)

then the problem (4) has a unique solution B, the sup-

port union S(B) is the same as the true support union

S(B∗),and ‖B−B∗‖l∞/l2 = o(b∗min) with the probabil-ity greater than

1−K exp (−c0 log s)− exp (−c1 log (p− s)) (8)

where c0 and c1 are constants.

Theorem 1 provides sufficient conditions on the sam-ple size such that the solution to the multi-task Lassoproblem correctly recovers the support union of theMVMR linear regression model. We next provides a

613

Block Regularized Lasso for Multivariate Multi-Response Linear Regression

theorem about the conditions on the sample size un-der which the solution to the multi-task Lasso problemfails to recover the support union.

Theorem 2. Consider the MVMR problem in theasymptotic regime, in which p → ∞, s → ∞ andlog(p − s) → ∞. We assume that the parameters(n, p, s, B∗,Σ(1:K)

)satisfy the conditions (C1)-(C2)

and the conditions: s/n = o(1) and 1λ2ns→ 0. If for

some small constant v > 0,

n < 2(1− v)ψ(B∗,Σ(1:K)) log (p− s)ρl

(1:K)(ScSc|S)

)(2− γ)2

,

(9)then with the probability greater than

1− exp(−c2s)− c3 exp(−c4

n

s

)(10)

for some positive constants c2, c3 and c4, no solutionB to the multi-task Lasso problem recovers the truesupport union and achieves ‖B −B∗‖l∞/l2 = o(b∗min).

The proofs of Theorems 1 and 2 adapt the techniquesdeveloped by Obozinski, Wainwright, and Jordan in[24], but involve novel development to deal with thedifferently distributed design matrices across tasks.

Combining Theorems 1 and 2, the quantityψ(B∗,Σ(1:K)) log(p − s) serves as a threshold on thesample size n, which is tight in the order sense. Asthe sample size is above the threshold, the multi-taskLasso recovers the true support union, and as the sam-ple size is below the threshold, the multi-task Lassofails to recover the true support union. The followingproposition provides bounds on the scaling behavior ofthe function ψ(B∗,Σ(1:K)) in the asymptotic regime.

Proposition 1. Consider the MVMR linear regres-sion model with the regression matrix B∗ and the co-variance matrices Σ(1:K) satisfying the condition (C2),the function ψ(B∗,Σ(1:K)) is bounded:

s

KCmin≤ ψ(B∗,Σ(1:K)) ≤ s

Cmin.

In the next subsection, we explore the properties of thequantity ψ(B∗,Σ(1:K)) in order to understand the im-pact of sparsity of B∗ and covariance matrices Σ(1:K)

on the conditions for recovering the support union.

3.2 Implications

The quantity ψ(B∗,Σ(1:K)) captures sparsity of B∗

and statistical properties of design matrices (i.e.,Σ(1:K)), and hence plays an important role in deter-mining the conditions on the sample size for recoveryof the support union as shown in Theorems 1 and 2.

In this section, we analyze ψ(B∗,Σ(1:K)) for a num-ber of representative cases in order to understand ad-vantages of multi-task Lasso which solves multiple lin-ear regression problems jointly over single-task Lassowhich solves each linear regression problem individu-ally.

We denote ψ(−→θ ∗(k),Σ(k)) as the function correspond-

ing to a single linear regression problem, where−→θ ∗(k)

represents the kth column of B∗. It is clear thatψ(B∗,Σ(1:K)) captures the threshold on the samplesize for the multi-task Lasso problem. Comparison

of ψ(B∗,Σ(1:K)) and ψ(−→θ ∗(k),Σ(k)) provides compar-

ison between multi-task Lasso and single-task Lassoin terms of the number of samples needed for recov-ery of the support union/set. We explicitly express

ψ(B∗,Σ(1:K)) and ψ(−→θ ∗(k),Σ(k)) as follows:

ψ(B∗,Σ(1:K))

= max1≤k≤K

∑i∈S

∑j∈S

B∗ikB∗jk

‖B∗i ‖l2∥∥B∗j ∥∥l2

((Σ

(k)SS

)−1)ij

(11)

ψ(−→θ ∗(k),Σ(k)) =

∑i∈S

∑j∈S

−→θ∗(k)i

−→θ∗(k)j∣∣∣−→θ ∗(k)i

∣∣∣ ∣∣∣−→θ ∗(k)j

∣∣∣((

Σ(k)SS

)−1)ij

(12)

where B∗ik denotes the (i, k)th entry of the matrix B∗

and−→θ∗(k)i denotes the ith entry of the vector

−→θ ∗(k).

We first study the scenario, in which all K tasks havethe same regression vectors, and hence have the samesupport sets.

Corollary 1. (Identical Regression Vectors) If B∗ has

identical column vectors, i.e.,−→θ ∗(k) =

−→θ ∗ for k =

1, . . . ,K, then

ψ(B∗,Σ(1:K)) =1

Kmax

1≤k≤Kψ(−→θ ∗,Σ(k)). (13)

Remark 1. Corollary 1 implies that the number ofsamples per task needed to correctly recover the sup-port union via multi-task Lasso is reduced by a factorof K compared to single-task Lasso that recovers eachsupport set individually.

It can be seen that although the K tasks involve de-sign matrices that have different covariances, as longas dependence of the output variables on the featurevariables is the same for all tasks, the tasks share sam-ples in multi-task Lasso to recover the support unionso that the sample size needed per task is reduced by afactor of K. Hence, there is a significant advantage ofgrouping tasks with similar regression vectors togetherfor multi-task learning.

614

Weiguang Wang, Yingbin Liang, Eric P. Xing

Corollary 1 can be viewed as a generalization of theresult in [24], in which the design matrices for thetasks are the same. The result in [24] suggests thatif the tasks share the same regression vector, there isno benefit in terms of the number of samples neededfor support recovery using multi-task Lasso comparedto single-task Lasso. Our result suggests that the ben-efit of multi-task Lasso in fact appears when the designmatrices are differently distributed. For such a case,we show that the total number of samples needed forrecovery of the support union is not increased for thecase with differently distributed design matrices com-pared to the case with the same design matrix (studiedin [24]). Hence, sample size needed per design matrix(i.e., per task) is reduced by the factor K. Moreover,compared to recovery of each support set individuallyvia single-task Lasso, multi-task Lasso also reducessample size per task by the factor K. However, suchan advantage does not appear if the K tasks have thesame design matrix and regression vectors as in [24].

We next study a more general case when regressionvectors are also different across tasks (but the supportsets of tasks are the same) in addition to varying designmatrices across tasks.

Corollary 2. (Varying Regression Vectors with SameSupports) Suppose all entries B∗jk > 0 for j ∈ S andk = 1, . . . ,K, and all coefficients are bounded, i.e.,Bk −∆k ≤ B∗jk ≤ Bk + ∆k, where ∆k > 0 is a small

perturbation constant with Bk > ∆k. Then,

ψ(B∗,Σ(1:K))

max1≤k≤K ψ(−→θ ∗(k),Σ(k))

≤ 1

Kmax

1≤k≤K

(Bk + ∆k

)2(Bk −∆k

)2 .Corollary 2 is a strengthened version of Corollary 1 inthat Corollary 2 allows both the regression vectors anddesign matrices to be different across tasks and stillshows that the number of samples needed is reducedby a factor of K compared to single-task Lasso, as longas the support sets across tasks are the same.

Corollary 3. (Disjoint Support Sets) Suppose the dis-tribution of all design matrices are the same, i.e.,

Σ(k) = Σ and Σ(k)SS = ΣSS for k = 1, . . . ,K, and sup-

pose that the support sets Sk of all tasks are disjoint.Let sk = |Sk|, and hence s =

∑Kk=1 sk. Then,

ψ(B∗,Σ(1:K)) = max1≤k≤K

ψ(−→θ ∗(k),Σ(k)).

We note that

max1≤k≤K

ψ(−→θ ∗(k),Σ(k)) log (p− s)

≤ max1≤k≤K

ψ(−→θ ∗(k),Σ(k)) log (p− sk).

Since the number of samples needed pertask for multi-task Lasso is proportional to

max1≤k≤K ψ(−→θ ∗(k),Σ(k)) log (p− s), and the number

of samples needed for single-task Lasso for task k is

proportional to ψ(−→θ ∗(k),Σ(k)) log (p− sk), the above

equation implies that the required number of samplesfor multi-task Lasso is smaller than (in fact almostthe same as) that for single-task Lasso.

Corollary 3 suggests that if the tasks have disjoint sup-port sets for regression vectors, the advantage of themulti-task Lasso disappear. This is reasonable becausethe tasks do not benefit from sharing the samples forrecovering the supports if their supports are disjoint.The essential message of Corollary 3 should not changeif the tasks have different design matrices and/or dif-ferent regression vectors. The critical assumption inCorollary 3 is the disjoint support sets.

Corollaries 1 and 3 provide two extreme cases when thetasks share the same support sets and have disjointsupport sets, respectively. The number of samplesneeded per task for recovery of the support union goesfrom 1/K of to the same as the sample size needed forsingle-task Lasso. It is conceivable that between thesetwo extreme cases, tasks may have overlapped supportsets with various overlapping levels. Correspondingly,the number of samples needed for recovering the sup-port union should depend on the overlapping levels ofthe support sets and is captured precisely by the quan-tify ψ(B∗,Σ(1:K)). We demonstrate such behavior viaour numerically results in the next section.

4 Numerical Results

In this section, we provide numerical simulations todemonstrate our theoretical results on using block-regularized multi-task Lasso for recovery of the sup-port union for the MVMR linear regression model. Westudy how the sample size needed for correct recov-ery of the support union depends on sparsity of theregression vectors, on the distributions of the designmatrices, and on the number of tasks.

We first study the scenario considered in Corollary 1when the K tasks have the same regression vectors,

i.e., B∗ =−→θ ∗~1TK . We set

−→θ ∗ = 1√

K~1S , where S

is the common support set across K tasks. We setthe covariance matrix Σ(k) different across K tasks asfollows. For k = 1, . . . ,K, we set Cov(Xa, Xb) > 0(where a, b ∈ {1, 2, . . . , p}) if a = b ± 1, otherwiseCov(Xa, Xb) = 0. Cov(Xa, Xb) = 1 + 1/k if a = b± 1and a is odd. Cov(Xa, Xb) = 1−0.8/k if a = b±1 anda is even. The sparsity of linear regression vectors islinearly proportional to the dimension p, i.e., s = αp,with the parameter α controlling the sparsity of themodel. We set α = 1/8. We choose the dimension p =128, 256, 512. We solve multi-task Lasso for recovery

615

Block Regularized Lasso for Multivariate Multi-Response Linear Regression

0.5 1 1.5 2 2.50

0.2

0.4

0.6

0.8

1

n/[2s log(p−s)]

Pro

babi

lity

of C

orre

ct R

ecov

ery p=128

K=2K=4K=6K=8

0.5 1 1.5 2 2.50

0.2

0.4

0.6

0.8

1

n/[2s log(p−s)]

Pro

babi

lity

of C

orre

ct R

ecov

ery p=256

K=2K=4K=6K=8

0.5 1 1.5 2 2.50

0.2

0.4

0.6

0.8

1

n/[2s log(p−s)]

Pro

babi

lity

of C

orre

ct R

ecov

ery p=512

K=2K=4K=6K=8

Figure 1: Impact of number of tasks on the sample size for scenarios with identical regression vectors and varyingdistributions for design matrices across tasks

1 2 30

0.2

0.4

0.6

0.8

1

n/[2s log(p−s)]

Pro

babi

lity

of C

orre

ct R

ecov

ery p=128

IdenticalOverlapDisjoint

1 2 30

0.2

0.4

0.6

0.8

1

n/[2s log(p−s)]

Pro

babi

lity

of C

orre

ct R

ecov

ery p=256

IdenticalOverlapDisjoint

1 2 30

0.2

0.4

0.6

0.8

1

n/[2s log(p−s)]

Pro

babi

lity

of C

orre

ct R

ecov

ery p=512

IdenticalOverlapDisjoint

Figure 2: Impact of overlapping levels of support sets on the sample size with same regression values for over-lapping entries and identical distributions for design matrices across tasks

of the support union for K = 2, 4, 6, 8. We set the reg-ularization parameter λn = 3.5×

√log (p− s) log s/n.

Fig. 1 plots the probability of correct recovery of thesupport union as a function of the scaled sample sizefor p = 128, 256, 512. It can be seen that the samplesize for guaranteeing correct recovery scales in the or-der of s log(p−s) for all plots. Moreover, as the numberof tasks K increases, the sample size (per task) neededfor correct recovery decreases inversely proportionallywith K, which is consistent with Corollary 1. Theseresults demonstrate that when the regression vectorsare the same across tasks, multi-task Lasso has a greatadvantage compared to single-task Lasso in terms ofreduction in the sample size needed per task.

We next study how the overlapping levels of the sup-port sets across tasks affect the sample size for correctrecovery of the support union. We set K = 2, i.e.,two tasks, and study three overlapping models for thetwo tasks: (1) same support sets S1 = S2 = {j ≤ p :8tpe + 1}, where tpe ≥ 0 is an integer; (2) disjoint sup-port sets S1

⋂S2 = φ in which S1 = {j ≤ p : 16tpe+1}

and S2 = {j ≤ p : 16tpe + 2}; (3) overlapping supportsets in which S1 = {j : j = 24tpe+ 1 or j = 24tpe+ 2},and S2 = {j : j = 24tpe + 2 or j = 24tpe + 3}. Wechoose the linear sparsity model with α = 1/8. We setp = 128, 256, 512, and Σ(k) = Ip for k = 1 and 2. We

also set λn = 3.5×√

log (p− s) log s/n.

Fig. 2 compares the probability of correct recovery ofthe support union as a function of the scaled samplesize for the three overlapping models. It can be seenthat the model with the same support set requires the

smallest sample size, and the model with disjoint sup-port sets requires the largest sample size. The modelwith overlapping support sets needs the sample sizebetween the two extreme models. This is reasonablebecause as the support sets overlap more, tasks sharemore information in samples for support recovery andhence need less number of samples for correct recovery.

5 Conclusions

In this paper, we have investigated the GaussianMVMR linear regression model. We have character-ized sufficient and necessary conditions under whichthe multi-task Lasso guarantees successful recovery ofthe support union of K linear regression vectors. Thetwo conditions are characterized by a threshold andhence are tight in the order sense. Our numerical re-sults have demonstrated the advantage of joint recov-ery of the support union compared to using single-taskLasso to recover the support set of each task individu-ally. Further studying the MVMR model under otherblock-constrains is an interesting topic in the future.Applications of the approach here to structure learn-ing problems based on real data sets such as socialnetwork data are also interesting.

Acknowledgements

The work of W. Wang and Y. Liang was supported bya National Science Foundation CAREER Award un-der Grant CCF-10-26565 and by the National ScienceFoundation under Grant CCF-10-26566.

616

Weiguang Wang, Yingbin Liang, Eric P. Xing

References

[1] B. K. Natarajan. Sparse approximate solutionsto linear systems. SIAM Journal on Computing,24(2):227–234, 1995.

[2] R. Tibshirani. Regression shrinkage and selectionvia the Lasso. J. Roy. Statist. Soc. Ser. B, 58:267–288, 1996.

[3] S. Chen, D. Donaho, and M. Saunders. Atomicdecomposition by basis persuit. SIAM Journal onScientific Computing, 20(1):33–61, 1998.

[4] D. Donoho and X. Huo. Uncertainty principles anideal atomic decomposition. IEEE Trans. Inform.Theory, 47(7):2845–2862, 2001.

[5] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Si-multaneous analysis of Lasso and dantzig selector.Annals of Statistics, 37(4):1705–1732, 2009.

[6] E. Candes and T. Tao. The dantzig selector: Sta-tistical estimation when p is much larger than n.Annals of Statistics, 35(6):2313–2351, 2007.

[7] F. Bach, R. Jenatton, J. Mairal, and G. Obozin-ski. Optimization with sparsity-inducing penal-ties. Foundations and Trends in Machine Learn-ing, 4(1):1–106, Now Publishers, Hanover, MA,USA, 2012.

[8] M. Elad and A. M. Bruckstein. A generalizeduncertainty principle and sparse representationin pairs of bases. IEEE Trans. Inform. Theory,48(9):2558–2567, 2002.

[9] A. Feuer and A. Nemirovski. On sparse repre-sentation in pairs of bases. IEEE Trans. Inform.Theory, 49(6):1579–1581, 2003.

[10] D. M. Malioutov, M. Cetin, and A. S. Willsky.Optimal sparse representations in general over-complete bases. In Proc. Int. Conf. Acoustics,Speech, and Signal Processing, Montreal, Canada,2004.

[11] J. Tropp. Greedy is good: Algorithm resutlts forsparse approximation. IEEE Trans. Inform. The-ory, 50(10):2231–2242, 2004.

[12] J. J. Fuch. Recovery of exact sparse representa-tions in the presence of noise. IEEE Trans. In-form. Theory, 51(10):3601–3608, 2005.

[13] P. Zhao and B. Yu. On model selection consis-tency of Lasso. Journal of Machine Learning Re-search, 7:2541–2567, 2006.

[14] N. Meinshausen and B. Yu. Lasso-type recovery ofsparse representations for high-dimensional data.The annals of statistics, 37(1):246–270, 2009.

[15] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery usingl1-constrained quadratic programming (Lasso).IEEE Trans. Inform. Theory, 55(5):2183–2202,2009.

[16] G. Raskutti, M. J. Wainwright, and B. Yu. Re-stricted eigenvalue properties for correlated gaus-sian designs. Journal of Machine Learning Re-search, 11:2241–2259, 2010.

[17] L. Meier, S. Van De Geer, and P. Buhlmann. Thegroup Lasso for logistic regression. Journal of theRoyal Stastistical Society, 70:53–71, 2008.

[18] E. Yang, P. Ravikumar, G. Allen, and Z. Liu.Graphical models via generalized linear models.In Advances in Neural Information ProcessingSystems (NIPS), volume 25, pages 1367–1375,2012.

[19] S. M. Kakade, O. Shamir, K. Sridharan, andA. Tewari. Learning exponential families in high-dimensions: Strong convexity and sparsity. In AIand Statistics, volume 9, pages 381–388, 2010.

[20] M. Yuan and Y. Lin. Model selection and estima-tion in regression with grouped variables. J. Roy.Statist. Soc. Ser. B, 68:49–67, 2006.

[21] J. Huang and T. Zhang. The benefit of groupsparsity. Annals of Statistics, 38(4):1978–2004,2010.

[22] A. Jalali, P. Ravikumara, S. Sanghavi, andC. Ruan. A dirty model for multi-task learning. InAdvances in Neural Information Processing Sys-tems (NIPS), 2010.

[23] S. Negahban, P. Ravikumar, M. J. Wainwright,and B. Yu. A unified framework for high-dimensional analysis of M-estimators with decom-posable regularizers. to appear in Statistical Sci-ence, 2012.

[24] G. Obozinski, M. J. Wainwright, and M. I. Jor-dan. Support union recovery in high-dimensionalmultivariate regression. The Annals of Statistics,39(1):1–47, 2011.

[25] H. Liu and J. Zhang. On the l1 − lq regularizedregression. arXiv:0802.1517v1, 2008.

[26] B. A. Turlach, W. N. Venables, and S. J. Wright.Simultaneous variable selection. Technometrics,47(3):349–363, 2005.

617

Block Regularized Lasso for Multivariate Multi-Response Linear Regression

[27] G. Obozinski, B. Tarskar, and M. I. Jordan. Jointcovariate selection and joint subspace selection formultiple classification problems. Statistics andComputing, 20(2):231–252, 2010.

[28] M. Kolar, J. Lafferty, and L. Wasserman. Unionsupport recovery in multi-task learning. Jour-nal of Machine Learning Research, 12:2415–2435,2011.

[29] M. Pontil, A. Argyriou, and T. Evgeniou. Multi-task feature learning. In Advances in Neural In-formation Processing Systems (NIPS), 2007.

[30] F. Bach. Consistency of trace norm minimization.Journal of Machine Learning Research, 9:1019–1048, 2008.

[31] L. Jacob, G. Obozinski, and J.-P. Vert. GroupLasso with overlaps and graph Lasso. In Interna-tional Conference on Machine Learning (ICML),2009.

[32] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu,and K. Knight. Sparsity and smoothness via thefused Lasso. Journal of the Royal Stastistical So-ciety, 67:91–108, 2005.

[33] Fused sparsity and robust estimation for linearmodels with unknown variance. In Advances inNeural Information Processing Systems (NIPS),volume 25, pages 1268–1276, 2012.

[34] Y. Kim, J. Kim, and Y. Kim. Blockwise sparseregression. Statistica Sinica, 16:375–390, 2006.

[35] C. H. Zhang and J. Huang. The sparsity and biasof the Lasso selection in high-dimensional linearregression. Annals of Statistics, 36(4):1576–1594,2008.

[36] M. Yuan, A. Ekici, Z. Lu, and R. Monteiro.Demension reduction and coefficient estimationin multivariate linear regression. Journal of theRoyal Statistical Society, 69(3):329–346, 2007.

[37] K. Lounici, M. Pontil, S. Geer, and A. B. Tsy-bakov. Oracle inequalities and optimal infer-ence under group sparsity. Annals of Statistics,39:2164–2204, 2011.

[38] S. Negahban and M. J. Wainwright. Simultaneoussupport recovery in high dimensions: Benefits andperils of block `1/`∞-regularization. IEEE Trans.Inform. Theory, 57(6):3841–3863, 2011.

[39] N. Meinshausen and P. Buhlmann. High-dimensional graphs and variable selection withthe Lasso. Annals of Statistics, 34(3):1436–1462,2006.

[40] P. Danaher, P. Wang, and D. M. Witten. Thejoint graphical Lasso for inverse covariance es-timation across multiple classes. under review,2011.

[41] J. Guo, E. Levina, G. Michailidis, and J. Zhu.Joint estimation of multiple graphical models.Biometrika, 98(1):1–15, 2011.