A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on...

24
A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE OPTIMIZATION PROBLEMS WITH LINEAR CONSTRAINTS I. NECOARA * , Y. NESTEROV AND F. GLINEUR * Abstract. In this paper we develop a random block coordinate descent method for minimizing large-scale convex problems with linearly coupled constraints and prove that it obtains in expectation an ϵ-accurate solution in at most O( 1 ϵ ) iterations. However, the numerical complexity per iteration of the new method is usually much cheaper than that of methods based on full gradient information. We focus on how to choose the probabilities to make the randomized algorithm to converge as fast as possible and we arrive at solving a sparse SDP. Analysis for rate of convergence in probability is also provided. For strongly convex functions the new method converges linearly. Numerical tests confirm that on large optimization problems with cheap coordinate derivatives the new method is much more efficient than methods based on full gradient. Key words. Large convex optimization problems, Lipschitz continuous gradient functions, linearly coupled constraints, block-coordinate descent method, random algorithm, connected graphs. 1. Introduction. The performance of a network composed of interconnected subsystems can be increased if the traditionally separated subsystems are jointly op- timized. Recently, parallel and distributed optimization methods have emerged as a powerful tool for solving large network optimization problems: e.g. resource allo- cation [7, 8, 20], telecommunications [1, 13], coordination in multi-agent systems [20], estimation in sensor networks [14], distributed control [14], image processing [4], traffic equilibrium problems [1], network flow [1] and other areas [10, 16, 22]. The goal of this paper is to develop an efficient distributed algorithm with cheap iteration for solving large separable convex problems with linearly coupled constraints that arise in network applications. The problems that we consider in this paper have the following features: the size of data is very large so that usual methods based on whole gradient computations are prohibitive. Moreover the incomplete structure of information (e.g. the data are distributed over the all nodes of the network, so that at a given time we need to work only with the data available then) may also be an obstacle for whole gradient computations. In this case, an appropriate way to approach these problems is through coordinate descent methods. These methods were among the first optimization methods studied in literature [1] but until recently they haven’t received much attention. The main differences in all variants of coordinate descent methods consist in the criterion of choosing at each iteration the coordinate over which we minimize the ob- jective function and the complexity of this choice. Two classical criteria used often in these algorithms are the cyclic [1] and the greedy descent coordinate search [19], which significantly differs by the amount of computations required to choose the appropriate index. For cyclic coordinate search estimates on the rate of convergence were given recently in [3], while for the greedy coordinate search (e.g. Gauss-Southwell rule) the convergence rate is given e.g. in [19]. Another interesting approach is based on ran- dom choice rule, where the coordinate search is random. Recent complexity results on random coordinate descent methods for smooth convex functions were obtained by * Automatic Control and Systems Engineering Department, University Politehnica Bucharest, 060042 Bucharest, Romania, ([email protected]). Center for Operations Research and Econometrics, Catholic University of Louvain, Louvain-la- Neuve, B-1348, Belgium ({Yurii.Nesterov, Francois.Glineur}@uclouvain.be). 1

Transcript of A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on...

Page 1: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A RANDOM COORDINATE DESCENT METHOD ONLARGE-SCALE OPTIMIZATION PROBLEMS WITH LINEAR

CONSTRAINTS

I. NECOARA∗, Y. NESTEROV† AND F. GLINEUR† ∗

Abstract. In this paper we develop a random block coordinate descent method for minimizinglarge-scale convex problems with linearly coupled constraints and prove that it obtains in expectationan ϵ-accurate solution in at most O( 1

ϵ) iterations. However, the numerical complexity per iteration

of the new method is usually much cheaper than that of methods based on full gradient information.We focus on how to choose the probabilities to make the randomized algorithm to converge as fastas possible and we arrive at solving a sparse SDP. Analysis for rate of convergence in probabilityis also provided. For strongly convex functions the new method converges linearly. Numerical testsconfirm that on large optimization problems with cheap coordinate derivatives the new method ismuch more efficient than methods based on full gradient.

Key words. Large convex optimization problems, Lipschitz continuous gradient functions,linearly coupled constraints, block-coordinate descent method, random algorithm, connected graphs.

1. Introduction. The performance of a network composed of interconnectedsubsystems can be increased if the traditionally separated subsystems are jointly op-timized. Recently, parallel and distributed optimization methods have emerged asa powerful tool for solving large network optimization problems: e.g. resource allo-cation [7, 8, 20], telecommunications [1, 13], coordination in multi-agent systems [20],estimation in sensor networks [14], distributed control [14], image processing [4], trafficequilibrium problems [1], network flow [1] and other areas [10,16,22].

The goal of this paper is to develop an efficient distributed algorithm with cheapiteration for solving large separable convex problems with linearly coupled constraintsthat arise in network applications. The problems that we consider in this paper havethe following features: the size of data is very large so that usual methods basedon whole gradient computations are prohibitive. Moreover the incomplete structureof information (e.g. the data are distributed over the all nodes of the network, sothat at a given time we need to work only with the data available then) may alsobe an obstacle for whole gradient computations. In this case, an appropriate wayto approach these problems is through coordinate descent methods. These methodswere among the first optimization methods studied in literature [1] but until recentlythey haven’t received much attention.

The main differences in all variants of coordinate descent methods consist in thecriterion of choosing at each iteration the coordinate over which we minimize the ob-jective function and the complexity of this choice. Two classical criteria used often inthese algorithms are the cyclic [1] and the greedy descent coordinate search [19], whichsignificantly differs by the amount of computations required to choose the appropriateindex. For cyclic coordinate search estimates on the rate of convergence were givenrecently in [3], while for the greedy coordinate search (e.g. Gauss-Southwell rule) theconvergence rate is given e.g. in [19]. Another interesting approach is based on ran-dom choice rule, where the coordinate search is random. Recent complexity resultson random coordinate descent methods for smooth convex functions were obtained by

∗Automatic Control and Systems Engineering Department, University Politehnica Bucharest,060042 Bucharest, Romania, ([email protected]).† Center for Operations Research and Econometrics, Catholic University of Louvain, Louvain-la-Neuve, B-1348, Belgium ({Yurii.Nesterov, Francois.Glineur}@uclouvain.be).

1

Page 2: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

2 I. Necoara, Y. Nesterov, F. Glineur

Nesterov in [12]. The extension to composite functions was given in [17]. However,in most of the previous work the authors considered optimization models where theconstraint set is decoupled (i.e. characterized by Cartesian product).

In this paper we develop a random block coordinate descent method suited forlarge optimization problems in networks where the information cannot be gathercentrally, but rather the information is distributed over the network. Moreover, inour paper we focus on optimization problems with linearly coupled constraints (i.e.the constraint set is coupled). Due to the coupling in the constraints we introduce a2-block variant of random coordinate descent method, that involves at each iterationthe closed form solution of an optimization problem only with respect to two blockvariables while keeping all the other variables fixed. We prove for the new algorithm anexpected convergence rate of orderO( 1k ) for the function values, where k is the numberof iterations. We focus on how to design the probabilities to make this algorithm toconverge as fast as possible and we prove that this problem can be recast as a sparseSDP. We also show that for functions with cheap coordinate derivatives the newmethod is faster than schemes based on full gradient information or based on greedycoordinate descent. Analysis for rate of convergence in probability is also provided.For strongly convex functions we prove that the new method converges linearly. Whilethe most obvious benefit of randomization is that it can lead to faster algorithms,either in worst case complexity analysis and/or numerical implementation, there arealso other benefits of our algorithm that are at least as important. For example, theuse of randomization leads to a simpler algorithm that is easier to analyze, producesa more robust output and can often be organized to exploit modern computationalarchitectures (e.g distributed and parallel computer architectures).

The paper is organized as follows. In Section 2 we introduce the optimizationmodel analyzed in this paper and the main assumptions. In Section 3 we present andanalyze a random 2-block coordinate descent method for solving our optimizationproblem, derive the convergence rate in expectation and in probability and providemeans to choose the probability distribution. We also compare our algorithm with thefull projected gradient method and other existing methods and show that on problemswith cheap coordinate derivatives our method has better arithmetic complexity. Insection 4 we extend our algorithm to more than a pair of indexes. We also provideextensive numerical tests to prove the efficiency of the proposed algorithm.

2. Problem formulation. We work in the space Rn composed by column vec-tors. For x, y ∈ Rn denote the standard Euclidian inner product ⟨x, y⟩ =

∑ni=1 xiyi

and Euclidian norm ∥x∥ = ⟨x, x⟩1/2. We use the same notation ⟨·, ·⟩ and ∥ · ∥ forspaces of different dimension. The inner product on the space of symmetric matricesis denoted with ⟨W1,W2⟩1 = trace(W1W2) for all W1,W2 symmetric matrices. We de-compose the full space RNn = ΠN

i=1Rn. We also define the corresponding partition ofthe identity matrix: INn = [U1 · · ·UN ], where Ui ∈ RNn×n. Then for any x ∈ RNn

we write x =∑

i Uixi. We denote e ∈ RN the vector with all entries 1 and ei ∈ RN thevector with all entries zero except the component i equal to 1. Furthermore, we define:U = [In · · · In] = eT ⊗ In ∈ Rn×Nn and Vi = [0 · · ·Ui · · · 0] = eT ⊗ Ui ∈ RNn×Nn,where ⊗ denotes the Kronecker product. Given a vector ν = [ν1 · · · νN ]T ∈ RN , wedefine the vector νp = [νp1 · · · ν

pN ]T for any integer p and diag(ν) denotes the diag-

onal matrix with the entries νi on the diagonal. For a positive semidefinite matrixW ∈ RN×N we consider the following order on its eigenvalues 0 ≤ λ1 ≤ λ2 ≤ · · · ≤ λN

and the notation ∥x∥2W = xTWx for any x.

Page 3: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A randomized coordinate descent method on large convex problems 3

We consider large network optimization problems where each agent in the networkis associated a local variable so that their sum is fixed and we need to minimize aseparable convex objective function:

(2.1)

{f∗ = min

xi∈Rnf1(x1) + · · ·+ fN (xN )

s.t.: x1 + · · ·+ xN = 0.

Optimization problems with linearly coupled constraints (2.1) arise in many areas suchas resource allocation in economic systems [8] or distributed computer systems [9],in signal processing [4], in traffic equilibrium and network flow [1] or distributedcontrol [14]. For problem (2.1) we associate a network composed of several nodesV = {1, · · · , N} that can exchange information according to a communication graphG = (V,E), where E denotes the set of edges, i.e. (i, j) ∈ E ⊆ V × V modelsthat node j sends information to node i. We assume that the graph G is undirectedand connected. The local information structure imposed by the graph G should beconsidered as part of the problem formulation. Note that constraints of the formα1x1 + · · ·+ αNxN = b, where αi ∈ R, can be easily handled in our framework by achange of coordinates.

The goal of this paper is to devise a distributed algorithm that iteratively solvesthe convex problem (2.1) by passing the estimate of the optimizer only between neigh-boring nodes. There is great interest in designing such distributed algorithms, sincecentralized algorithms scale poorly with the number of nodes and are less resilient tofailure of the central node.Let us define the extended subspace:

S =

{x ∈ RNn :

N∑i=1

xi = 0

},

that has the orthogonal complement the subspace T = {u ∈ RNn : u1 = · · · = uN}.We also use the notation:

x = [xT1 · · ·xT

N ]T =N∑i=1

Uixi ∈ RNn, f(x) = f1(x1) + · · ·+ fN (xN ).

The basic assumption considered in this paper is the following:Assumption A.1. We assume that the functions fi are convex and have Lipschitz

continuous gradient, with Lipschitz constants Li > 0, i.e.:

(2.2) ∥∇fi(xi)−∇fi(yi)∥ ≤ Li∥xi − yi∥ ∀xi, yi ∈ Rn, i ∈ V.

From the Lipschitz property of the gradient (2.2), the following inequality holds (seee.g. Section 2 in [11]):

fi(xi + di) ≤ fi(xi) + ⟨∇fi(xi), di⟩+Li

2∥di∥2 ∀xi, di ∈ Rn.(2.3)

The following inequality, which is central in our paper, is a straightforward conse-quence of (2.3) and holds for all x ∈ RNn and di, dj ∈ Rn:

f(x+Uidi+Ujdj) ≤f(x)+⟨∇fi(xi), di⟩+Li

2∥di∥2+⟨∇fj(xj), dj⟩+

Lj

2∥dj∥2.(2.4)

Page 4: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

4 I. Necoara, Y. Nesterov, F. Glineur

We denote with X∗ the set of optimal solutions for problem (2.1). The optimalityconditions for optimization problem (2.1) become: x∗ is optimal solution for theconvex problem (2.1) if and only if

N∑i=1

x∗i = 0, ∇fi(x

∗i ) = ∇fj(x

∗j ) ∀i ̸= j ∈ V.

2.1. Previous work. We briefly review some well-known methods from theliterature for solving the optimization model (2.1). In [7, 20] distributed weightedgradient methods were proposed to solve a similar problem as in (2.1), in particular theauthors in [20] consider strongly convex functions fi with positive definite Hessians.These papers propose a class of center-free algorithms (in these papers the termcenter-free refers to the absence of a coordinator) with the following iteration:

xk+1i = xk

i −∑j∈Ni

wij

(∇fj(x

kj )−∇fi(x

ki ))

∀i ∈ V, k ≥ 0,(2.5)

where Ni denotes the set of neighbors of node i in the graph G. Under the strongconvexity assumption and provided that the weights wij are chosen as a solutionof a certain SDP, linear rate of convergence is obtained. Note however, that thismethod requires at each iteration the computation of the full gradient and the iterationcomplexity is O (N(n+ nf )), where nf is the number of operations for evaluating thegradient of any function fi at the current iterate for all i ∈ V .

In [19] Tseng studied optimization problems with linearly coupled constraints andcomposite objective functions of the form f+h, where h is convex nonsmooth function,and developed a block coordinate descent method based on the Gauss-Southwell choicerule. The principal requirement for this method is that at each iteration a subset ofindexes I needs to be chosen with respect to the Gauss-Southwell rule and then theupdate direction is a solution of the following QP problem:

dH(x; I) = arg mins:∑

j∈I sj=0⟨∇f(x), s⟩+ 1

2∥s∥2H + h(x+ s),

where H is a positive definite matrix chosen at the initial step of the algorithm. Usingthis direction and choosing an appropriate step size αk, the next iterate is defined as:xk+1 = xk + αkdH(xk; Ik). The total complexity per iteration of this method isO (Nn+ nf ). In [19], the authors proved, for the particular case of a single linearconstraint and the nonsmooth part h of the objective function is piece-wise linear

and separable, that after k iterations a sublinear convergence rate of order O(NnL̄R2

0

k )is attained for the function values, where L̄ = maxi∈V Li and R0 is the Euclidiandistance of the starting iterate to the set of optimal solutions. In [2] a 2-coordinatedescent method is developed for minimizing a smooth function subject to a singlelinear equality constraint and additional bound constraints on the decision variables.In the convex case, when all the variables are lower bounded but not upper bounded,the author shows that the sequence of function values converges at a sublinear rate

O(

NnL̄R20

k

), while the complexity per iteration is at least O (Nn+ nf ).

A random coordinate descent algorithm for an optimization model with smoothobjective function an separable constraints was analyzed by Nesterov in [12], where acomplete rate analysis is provided. The main feature of his randomized algorithm isthe cheap iteration complexity of order O(nf +n+lnN), while still keeping sublinear

Page 5: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A randomized coordinate descent method on large convex problems 5

rate of convergence. The generalization of this algorithm to composite objectivefunction has been studied in [16, 17]. However, none of these papers studied theapplication of random coordinate descent algorithms to smooth convex problems withlinearly coupled constraints. In this paper we develop a random coordinate descentmethod for this type of optimization model as described in (2.1).

3. A random block coordinate descent method. In this section we devisea randomized block coordinate descent algorithm for solving the separable convexproblem (2.1) and analyze its convergence. We present a distributed method whereonly neighbors need to communicate with each other. At a certain iteration having afeasible estimate x ∈ S of the optimizer, we choose randomly a pair (i, j) ∈ E withprobability pij > 0. Since we assume an undirected graph G = (V,E) associated toproblem (2.1) (the generalization of the present scheme to directed graphs is straight-forward), we consider pij = pji. We assume that the graph G is connected. For afeasible x ∈ S and a randomly chosen pair of indexes (i, j), with i < j, we define thenext feasible step x+ ∈ Rn as follows:

x+ = x+ Uidi + Ujdj .

Derivation of the directions di and dj is based on the inequality (2.4):

f(x+) ≤ f(x) + ⟨∇fi(xi), di⟩+ ⟨∇fj(xj), dj⟩+Li

2∥di∥2 +

Lj

2∥dj∥2.(3.1)

Minimizing the right hand side of inequality (3.1), but imposing additionally feasibilityfor the next iterate x+

(i,j) (i.e. we require di + dj = 0), we arrive at the following local

minimization problem:

[dTi dTj ]T = arg min

si,sj∈Rn: si+sj=0⟨∇fi(xi), si⟩+ ⟨∇fj(xj), sj⟩+

Li

2∥si∥2 +

Lj

2∥sj∥2

that has the closed form solution

di = − 1

Li + Lj(∇fi(xi)−∇fj(xj)) , dj = −di.(3.2)

We also obtain from (2.4) the following decrease in the objective function, whichshows that our method is a descent method:

f(x+(i,j)) ≤ f(x)− 1

2(Li + Lj)∥∇fi(xi)−∇fj(xj)∥2.(3.3)

Now, let the starting point x0 be feasible for our problem (2.1) and assume someprobability distribution (pij)(i,j)∈E available over the undirected graph G, then wecan present the new random coordinate descent method:

Algorithm (RCD): (Random 2-Block Coordinate Descent Method)For k ≥ 0 iterate:1. Choose (ik, jk) ∈ E with probability pikjk2. Update xk+1 = xk − 1

Lik+Ljk

(Uik−Ujk)(∇fik(x

kik)−∇fjk(x

kjk)).

Clearly, Algorithm (RCD) is distributed since only two neighboring nodes in thegraph need to communicate at each iteration. Further, at each iteration only two

Page 6: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

6 I. Necoara, Y. Nesterov, F. Glineur

components of x are updated, so that our method has low complexity per iterationand is very efficient on functions with cheap derivatives (we need to compute only twopartial gradients (∇fi(xi) ∇fj(xj)) in R2n compared to full gradient methods wherethe full gradient ∇f(x) in RNn is required). Finally, in our algorithm we maintainfeasibility at each iteration, i.e. xk

1 + · · ·+ xkN = 0 for all k ≥ 0.

3.1. Efficiency of (RCD) method in expectation. In this section we analyzethe convergence rate of Algorithm (RCD) for the expected values of the objectivefunction and in probability. After k iterations of the previous algorithm, we generate arandom output (xk, f(xk)), which depends on the observed implementation of randomvariable:

ηk = {(i0, j0), · · · , (ik, jk)}.

Let us define the expected value of the objective function w.r.t. ηk:

ϕk = E[f(xk)

].

For simplicity of the exposition we use the following notation: given the currentiterate x, denote x+ = x + Uid(i) + Ujd(j) the next iterate, where directions (di, dj)are given by (3.2) for some random chosen pair (i, j) w.r.t. a probability distribution.For brevity, we also adapt the notation of expectation upon the entire history, i.e.(ϕ, η) instead of (ϕk, ηk). For a feasible x, taking the expected value over the randomvariable (i, j), we obtain:

f(x)− E[f(x+) | η

]=

∑(i,j)∈E

pij [f(x)− f(x+)]

(3.3)

≥∑

(i,j)∈E

pij2(Li + Lj)

∥∇fi(xi)−∇fj(xj)∥2

= ∇f(x)T

∑(i,j)∈E

pij2(Li + Lj)

Gij

∇f(x),

where Gij = (ei−ej)(ei−ej)T ⊗In ∈ RNn×Nn. We introduce the weighted Laplacian

of the underlying graph G as being the matrix L = L(pij , Li) ∈ RN×N defined as:

(3.4) Lij =

− pij

Li+Ljif (i, j) ∈ E∑

l∈Ni

pil

Li+Llif i = j

0 if (i, j) ̸∈ E,

where Ni denotes the set of neighbors of node i in the graph G. Note that theLaplacian matrix L is positive semidefinite and Le = 0, i.e. it has the smallesteigenvalue λ1(L) = 0 with the associated eigenvector e. Since the graph is connected,then it is well known that the eigenvalue λ1(L) = 0 is simple, i.e. λ2(L) > 0. Weintroduce the following set:

M =

L ∈ RN×N : L defined in (3.4), pij = pji,∑

(i,j)∈E

pij = 1

.

Page 7: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A randomized coordinate descent method on large convex problems 7

Then, we have that the matrix G2 ∈ RNn×Nn defined as:

G2 =∑

(i,j)∈E

pijLi + Lj

Gij = L ⊗ In

is also positive semidefinite. In conclusion we obtain the following useful inequalitythat shows the decrease of the objective function in expectation:

f(x)− E[f(x+) | η

]≥ 1

2∇f(x)TG2∇f(x).(3.5)

On the extended subspace S we now define a norm that will be used subsequently formeasuring distances in this subspace. We define the extended primal norm inducedby the matrix G2 as:

∥u∥G2=

√uTG2u ∀u ∈ RNn \ T.

On the subspace S we introduce its extended dual norm:

∥x∥∗G2= max

u:∥u∥G2≤1

⟨x, u⟩ = maxu:⟨G2u,u⟩≤1

⟨x, u⟩ ∀x ∈ S.

Using the definition of conjugate norms, the Cauchy-Schwartz inequality holds:

⟨u, x⟩ ≤ ∥u∥G2 · ∥x∥∗G2∀x ∈ S, u ∈ RNn.

Let us compute this dual norm for any x ∈ S:

∥x∥∗G2= max

u∈RNn: ⟨G2u,u⟩≤1⟨x, u⟩

= maxu:⟨G2(u−e⊗ 1

N

∑Ni=1 ui),u−e⊗ 1

N

∑Ni=1 ui⟩≤1

⟨x, u− e⊗ 1

N

N∑i=1

ui⟩

= maxu:⟨G2u,u⟩≤1,

∑Ni=1 ui=0

⟨x, u⟩ = maxu:⟨G2u,u⟩≤1,Uu=0

⟨x, u⟩

= maxu:⟨G2u,u⟩≤1,(Uu)TUu≤0

⟨x, u⟩ = maxu:⟨G2u,u⟩≤1,uTUTUu≤0

⟨x, u⟩

= minν,µ≥0

maxu

⟨x, u⟩+ µ(1− ⟨G2u, u⟩)− ν⟨UTUu, u⟩

= minν,µ≥0

µ+ ⟨(µG2 + νUTU)−1x, x⟩

= minν≥0

minµ≥0

µ+1

µ⟨(G2 +

ν

µUTU)−1x, x⟩

= minζ≥0

√⟨(G2 + ζUTU)−1x, x⟩.

In conclusion, we obtain an extended dual norm that is well defined in S:

∥x∥∗G2= min

ζ≥0

√⟨(L+ ζeeT )

−1 ⊗ Inx, x⟩ ∀x ∈ S.(3.6)

Using the eigenvalue decomposition of the Laplacian L = Ξdiag(0, λ2, · · · , λN )ΞT ,where λi are the positive eigenvalues and Ξ = [e ξ2 · · · ξN ] such that ⟨e, ξi⟩ = 0 for all

Page 8: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

8 I. Necoara, Y. Nesterov, F. Glineur

i ∈ V , then (L + ζeeT )−1 = Ξdiag(ζ∥e∥2, λ2, · · · , λN )−1ΞT . It is straightforward tosee that our defined norm has the following closed form:

∥x∥∗G2=

√xT (L+ ⊗ In)x ∀x ∈ S,

where L+ = Ξdiag(0, 1λ2, · · · , 1

λN)ΞT denotes the pseudoinverse of the matrix L. On

the other hand if we define L[N−1] as the leading matrix of dimension N − 1 of L andx1:N−1 = [xT

1 · · ·xTN−1]

T ∈ R(N−1)n, from the definition of the norm we also have:

∥x∥∗G2= max

u:⟨G2(u−eN⊗uN ),u−eN⊗uN ⟩≤1⟨x, u− eN ⊗ uN ⟩

= maxu:⟨G2u,u⟩≤1,uN=0

⟨x, u⟩ = maxu1:N−1:⟨(L[N−1]⊗In)u1:N−1,u1:N−1⟩≤1

⟨x1:N−1, u1:N−1⟩.

The optimality condition in the previous maximization problem is given by:

(L[N−1] ⊗ In)u1:N−1 = x1:N−1,

In conclusion, we have:

∥x∥∗G2=

√xT1:N−1

(L[N−1] ⊗ In

)−1x1:N−1 ∀x ∈ S.(3.7)

Let us compute our defined norm for some important graphs:1. For a cycle graph if we define the vector of inverse probabilities as:

p = [1

p12

1

p23· · · 1

pN1]T

and the lower triangular matrix W ∈ RN×N with all entries equal with 1 inthe lower part, then the norm takes the closed form:

(∥x∥∗G2

)2= xT

((WT

(diag(p)− 1

eT pppT

)W

)⊗ In

)x ∀x ∈ S.

2. For a star-shaped graph if we define the vector of inverse probabilities as:

p = [1

p1N

1

p2N· · · 1

pN−1N0]T ,

then the norm takes the closed form(∥x∥∗G2

)2= xT (diag(p)⊗ In)x ∀x ∈ S.

3. For a complete graph, if we take for the probabilities the expressions:

psij =Li + Lj

N(N − 1)Lav, Lav =

1

N

∑i

Li,(3.8)

then we can see immediately that

Ls =1

N(N − 1)Lav(NIN − eeT )(3.9)

Page 9: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A randomized coordinate descent method on large convex problems 9

and thus using matrix inversion lemma we get

(L[N−1]s )−1 = (N − 1)Lav(IN−1 + eeT ).

In this case from (3.7) we get:

(∥x∥∗G2,s

)2

= (N − 1)N∑i=1

Lav∥xi∥2 ∀x ∈ S.(3.10)

On the other hand, if we take for the probabilities the expressions:

pinvij =1/Li + 1/Lj

(N − 1)∑

i 1/Li,(3.11)

then we can see immediately that

Linv =1

N − 1

(diag(L−1)− 1∑

i 1/LiL−1(L−1)T

)(3.12)

and thus using again matrix inversion lemma we get

(L[N−1]inv )−1 = (N − 1)(diag(L1 . . . LN−1) + LNeeT ).

In this case from (3.7) we obtain:

(∥x∥∗G2,inv

)2

= (N − 1)N∑i=1

Li∥xi∥2 ∀x ∈ S.(3.13)

In order to estimate the rate of convergence of our algorithm we introduce the followingdistance that takes into account that our algorithm is a descent method (see inequality(3.3)):

R(x0) = maxx∈S:f(x)≤f(x0)

maxx∗∈X∗

∥x− x∗∥∗G2,

which measures the size of the level set of f given by x0. We assume that thisdistance is finite for the initial iterate x0. We now state and prove the main result ofthis section:

Theorem 3.1. Let Assumption 1 hold for the optimization problem (2.1) andthe sequence (xk)k≥0 be generated by Algorithm (RCD). Then, we have the followingrate of convergence for the expected values of the objective function:

(3.14) ϕk − f∗ ≤ 2R2(x0)

k.

Proof. From convexity of f and the definition of the norm ∥ · ∥G2 we get:

f(xl)− f∗ ≤ ⟨∇f(xl), xl − x∗⟩ ≤ ∥xl − x∗∥∗G2· ∥∇f(xl)∥G2

≤ R(x0) · ∥∇f(xl)∥G2 ∀l ≥ 0.

Combining this inequality with (3.5), we obtain:

f(xl)− E[f(xl+1) | ηl

]≥ (f(xl)− f∗)2

2R2(x0),

Page 10: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

10 I. Necoara, Y. Nesterov, F. Glineur

or equivalently

E[f(xl+1) | ηl

]− f∗ ≤ f(xl)− f∗ − (f(xl)− f∗)2

2R2(x0).(3.15)

Taking the expectation of both sides of this inequality in ηl−1 and denoting ∆l =ϕl − f∗ leads to:

∆l+1 ≤ ∆l −∆2

l

2R2(x0).

Dividing both sides of this inequality with ∆l∆l+1 and taking into account that∆l+1 ≤ ∆l we obtain:

1

∆l≤ 1

∆l+1− 1

2R2(x0)∀l ≥ 0.

Adding these inequalities from l = 0, · · · , k − 1 we get that 0 ≤ 1∆0

≤ 1∆k

− k2R2(x0)

from which we obtain the statement (3.14) of the theorem.Theorem 3.1 shows that for smooth functions Algorithm (RCD) has a sublinear

rate of convergence in expectation but with a low complexity per iteration. Morespecifically, the iteration complexity is of order

O(nf + n+ lnN),

where we recall that nf is the maximum cost of computing the gradient of eachfunction fi for all i ∈ V , O(n) is the cost of updating x+

(i,j) from x and lnN is the

cost of choosing randomly a pair of indices (i, j) for a given probability distribution(pij)(i,j)∈E , where N is the number of nodes in the graph G. The convergence rate ofour method (RCD) can be explicitly expressed for a complete graph and under somespecific choice for probabilities, according the the discussion above. In particular, letus assume that we know some constants Ri > 0 such that for any x∗ ∈ X∗ and anyx satisfying f(x) ≤ f(x0) we have:

∥xi − x∗i ∥ ≤ Ri ∀i ∈ V,

and define R = [R1 · · ·RN ]T . Since our Algorithm (RCD) is a descent method (seeinequality (3.3)), it follows that:

R(x0) ≤ maxx∈S:∥xi−x∗

i ∥≤Ri ∀i∈Vmaxx∗∈X∗

∥x− x∗∥∗G2.

For a complete graph and using probabilities in the form (3.8), it follows immediatelythe following convergence rate for Algorithm (RCD) (see (3.10)):

ϕk − f∗ ≤ (N − 1)2∑N

i=1 LavR2i

k.(3.16)

For a complete graph and using probabilities in the form (3.11), it follows the followingconvergence rate for Algorithm (RCD) (see (3.13)):

ϕk − f∗ ≤ (N − 1)2∑N

i=1 LiR2i

k.(3.17)

Page 11: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A randomized coordinate descent method on large convex problems 11

3.2. Design of the probabilities. We have several choices for the probabil-ities (pij)(i,j)∈E , which the randomized block coordinate descent Algorithm (RCD)depends on. For example, we can choose probabilities dependent on the Lipschitzconstants Lij :

pαij =Lαij

Lα, Lα =

∑(i,j)∈E

Lαij , α ≥ 0.(3.18)

Note that for α = 0 we recover the uniform probabilities. Finally, we can design theprobabilities from the convergence rate of the method. From the definition of theconstants Ri it follows that:

R(x0) ≤ maxx:∥xi∥≤Ri ∀i∈V,

∑Ni=1 xi=0

∥x∥∗G2.

We have the freedom to choose the matrix G2 that depends on the probabilities (werecall that G2 = L ⊗ In and L depends linearly on pij). Therefore, we search for theprobabilities pij that are the optimal solution of the following optimization problem:

R∗(x0) = minpij

R(x0) ≤ minG2:G2=L⊗In,L∈M

maxx:∥xi∥≤Ri ∀i∈V,

∑Ni=1 xi=0

∥x∥∗G2.

In the next theorem we derive an easily computed upper bound on R(x0) and weprovide a way to suboptimally select the probabilities pij :

Theorem 3.2. A suboptimal choice of probabilities (pij)(i,j)∈E can be obtainedas a solution of the following SDP problem whose optimal value is an upper bound onR2(x0), i.e.:

(R∗(x0))2 ≤ min

L∈M,ζ≥0,ν≥0

{⟨ν,R2⟩ :

[L+ ζeeT IN

IN diag(ν)

]≽ 0

}.(3.19)

Proof. The previous optimization problem can be written as follows:

R2(x0) = minG2:G2=L⊗In,L∈M

maxx:∥xi∥≤Ri ∀i,

∑i xi=0

(∥x∥∗G2

)2= min

G2:G2=L⊗In,L∈Mmax

x:∥xi∥≤Ri ∀i,∑

i xi=0minζ≥0

⟨(G2 + ζUTU)−1x, x⟩

= minG2:G2=L⊗In,L∈M,ζ≥0

maxx:∥xi∥≤Ri ∀i,

∑i xi=0

⟨(G2 + ζUTU)−1x, x⟩

= minG2:G2=L⊗In,L∈M,ζ≥0

maxx:∥xi∥≤Ri ∀i,

∑i xi=0

⟨(G2 + ζUTU)−1, xxT ⟩1.

Page 12: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

12 I. Necoara, Y. Nesterov, F. Glineur

Using the following well-known relaxation from the SDP literature, we have:

minG2,ζ

maxX≽0,rankX=1,⟨X,Vi⟩1≤R2

i ∀i,⟨UTU,X⟩1=0⟨(G2 + ζUTU)−1, X⟩1

≤ minG2,ζ

maxX≽0,⟨X,Vi⟩1≤R2

i ∀i,⟨UTU,X⟩1=0⟨(G2 + ζUTU)−1, X⟩1

= minG2,ζ,θ,Z≽0,ν≥0

maxX

⟨(G2 + ζUTU)−1 + Z + θUTU,X⟩1 +N∑i=1

νi(R2i − ⟨X,Vi⟩1)

= minG2,ζ,θ,Z≽0,ν≥0

maxX

⟨(G2 + ζUTU)−1 + Z + θUTU −∑i

νiVi, X⟩1 +∑i

νiR2i

= minG2,ζ,θ,Z≽0,ν≥0,(G2+ζUTU)−1+Z+θUTU−

∑i νiVi=0

∑i

νiR2i

= minG2,ζ,θ,Z≽0,ν≥0,Z=

∑i νiVi−(G2+ζUTU)−1−θUTU

⟨ν,R2⟩

= minG2,ζ,θ,ν≥0,

∑i νiVi−(G2+ζUTU)−1−θUTU≽0

⟨ν,R2⟩

= minG2,ζ,θ,ν≥0,(G2+ζUTU)−1≼

∑i νiVi−θUTU

⟨ν,R2⟩

= minG2,ζ≥0,ν≥0,(G2+ζUTU)−1≼diag(ν)⊗In

⟨ν,R2⟩

= minG2,ζ≥0,ν≥0,G2+ζUTU≽diag(ν−1)⊗In

⟨ν,R2⟩,

where ν = [ν1 · · · νN ]T . Now, taking into account that G2 = L ⊗ In, where we recallthat L ∈ M, and ζ ≥ 0, we get:

minL∈M,ζ≥0,ν≥0,(L+ζeeT )⊗IN≽diag(ν−1)⊗In

⟨ν,R2⟩ = minL∈M,ζ≥0,ν≥0,L+ζeeT≽diag(ν−1)

⟨ν,R2⟩.

Finally, the SDP (3.19) is obtained from Schur complement formula applied to theprevious optimization problem.

Since we assumed that the graph G is connected, we have that λ1(L) = 0 is simpleand consequently λ2(L) > 0. Note that the following equivalence holds:

L+ teeT

∥e∥2≽ tIN if and only if t ≤ λ2(L),

since the spectrum of the matrix L + ζeeT is {ζ∥e∥2, λ2(L), · · · , λN (L)}. It followsthat ζ = t

∥e∥2 , νi = 1t for all i, and L such that t ≤ λ2(L) is feasible for the SDP

problem (3.19). We conclude that:

(R∗(x0))2 ≤ min

L∈M,ζ≥0,ν≥0,L+ζeeT≽diag(ν−1)⟨ν,R2⟩

≤ minL∈M,t≤λ2(L)

N∑i=1

R2i

1

t≤

∑i R

2i

λ2(L)∀L ∈ M.(3.20)

Then, according to Theorem 3.1 we obtain the following upper bound on the rate ofconvergence for the expected values of the objective function:

(3.21) ϕk − f∗ ≤2∑N

i=1 R2i

λ2(L) · k∀L ∈ M.

Page 13: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A randomized coordinate descent method on large convex problems 13

From the convergence rate for Algorithm (RCD) given in (3.21) it follows that wecan choose the probabilities such that we maximize the second eigenvalue of L:maxL∈M λ2(L). In conclusion, in order to find some suboptimal probabilities (pij)(i,j)∈E ,we can solve the following simpler SDP problem than the one given in (3.19):

p∗ij = arg maxt,L∈M

{t : L ≽ t

(IN − eeT

∥e∥2

)}.(3.22)

Note that the matrices on both sides of the LMI from (3.22) have the common eigen-value zero associated to the eigenvector e, so that this LMI has empty interior whichcan cause problems for some classes of interior point methods. We can overcome thisproblem by replacing this LMI with the following equivalent LMI:

L+eeT

∥e∥2≽ t

(IN − eeT

∥e∥2

).

If the probabilities are chosen as the optimal solution of the previous SDP problem(3.22), then from (3.21) we obtain the following upper bound on the rate of conver-gence for the expected values of the objective function:

(3.23) ϕk − f∗ ≤ (N − 1)2∑N

i=11

(N−1)λ2(L∗)R2i

k.

Here L∗ denotes the optimal solution of the SDP (3.22). In conclusion, we have:

(R∗(x0)

)2 ≤ minpij

SDP (3.19) ≤ minpij

∑i R

2i

SDP (3.22)≤

∑i R

2i

λ2(L)∀pij

and consequently

ϕk − f∗ ≤2(R∗(x0)

)2k

≤2minpij SDP (3.19)

k(3.24)

≤ (N − 1)2∑N

i=11

(N−1)λ2(L∗)R2i

k≤

2∑N

i=11

(N−1)λ2(L)R2i

k∀pij .

Finally, we also have the following results:1. If we assume complete graph and the probabilities are taken in the form

(3.8), then the Laplacian matrix has the expression given in (3.9), i.e. Ls =1

(N−1)Lav

(IN − 1

N eeT). For this matrix we can show immediately that

λ2(Ls) =1

(N − 1)Lav.

In conclusion, we get the convergence rate (3.16), which shows that R2(x0) ≤(N − 1)

∑i LavR

2i =

∑i R

2i

λ2(Ls).

2. If we assume a complete graph and the probabilities are taken in the form(3.11), then the Laplacian matrix has the expression given in (3.12), i.e.Linv = 1

(N−1)∑

i 1/Li

((∑

i 1/Li)diag(L−1)− L−1(L−1)T

). For this matrix

we have that R2(x0) ≤ (N − 1)∑

i LiR2i ≤

∑i R

2i

λ2(Linv).

Page 14: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

14 I. Necoara, Y. Nesterov, F. Glineur

3.3. Comparison with the full projected gradient method. Based on As-sumption 1 we can derive the following inequality:

f(x+ s) ≤N∑i=1

fi(xi) + ⟨∇fi(xi), si⟩+N∑i=1

Li

2∥si∥2

= f(x) + ⟨∇f(x), s⟩+ 1

2∥s∥2diag(L) ∀x, s ∈ RNn.(3.25)

Thus, we also have:

f(x+ s) ≤ f(x) + ⟨∇f(x), s⟩+ L̄

2∥s∥2 ∀x, s ∈ RNn,(3.26)

where we recall that L̄ = maxi Li. Therefore, if we measure distances in the extendedspace RNn with the Euclidean norm, we can take L̄ as a Lipschitz constant for f .Let us apply the full projected gradient method for solving the optimization problem(2.1). Given x ∈ S, we define the following iteration:

x+ = x+ d,

where d is the optimal solution of the following optimization problem (see (3.26)):

d = arg mins∈RNn:

∑Ni=1 si=0

f(x) + ⟨∇f(x), s⟩+ L̄

2∥s∥2.

Since we assume local Euclidian norms on Rn, we obtain the following solution:

di =1

NL̄

N∑j=1

(∇fj(xj)−∇fi(xi)) ∀i ∈ V.

In conclusion, if we consider the Euclidian norm in the extended space RNn, thenfrom Assumption 1 it follows that the function f has Lipschitz continuous gradientwith Lipschitz constant L̄ (according to (3.26)) and then the convergence rate of theprojected gradient method is given by [11]:

f(xk)− f∗ ≤2∑N

i=1 L̄R2i

k.(3.27)

In the sequel we show that we can consider another norm to measure distances inthe subspace S from the extended space RNn different from the Euclidian norm. Wewill see that with this norm the convergence rate of the projected gradient is betterthan that in (3.27), where the Euclidian norm was considered. Since f has Lipschitzcontinuous gradient and the descent lemma from (3.25) is valid, by standard reasoningwe can argue that the direction dk in the gradient method can be computed as (see(3.25)):

d = arg mins∈RNn:

∑Ni=1 si=0

N∑i=1

fi(xi) + ⟨∇fi(xi), si⟩+Li

2∥si∥2.

We obtain the following closed form solution:

di =1

Li

∑Nj=1

1Lj

(∇fj(xj)−∇fi(xi))∑Nj=1

1Lj

∀i ∈ V.

Page 15: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A randomized coordinate descent method on large convex problems 15

From (3.25) we derive the following inequality:

f(x+) ≤ f(x)−N∑i=1

1

2Li

∥∑N

j=11Lj

(∇fj(xj)−∇fi(xi)) ∥2

(∑N

j=11Lj

)2

= f(x)−N∑i=1

1

2Li

∥∥∥∥∥∇fi(xi)−∑N

j=11Lj

∇fj(xj)∑Nj=1

1Lj

∥∥∥∥∥2

= f(x)− 1

2∇f(x)TGN∇f(x),

where GN = LN ⊗ In and the matrix LN is defined as

LN = diag(L−1)− 1∑i 1/Li

L−1(L−1)T ,

where we recall that L = [L1 · · ·LN ]T . Note that LN is still a Laplacian matrix butfor a complete graph with N nodes. As in the previous section, for the matrix GN

we define the induced norms in the extended primal and dual space:

∥u∥GN=

√uTGNu, ∥x∥∗GN

= maxu:⟨GNu,u⟩≤1

⟨x, u⟩ ∀x ∈ S, u ∈ RNn \ T.

Based on (3.12) and (3.13) we conclude that

(∥x∥∗GN)2 =

N∑i=1

Li∥xi∥2 ∀x ∈ S.(3.28)

The full projected gradient iteration at each step k becomes:

xk+1i = xk

i − 1

Li∇fi(x

ki ) +

1

Li

∑Nj=1

1Lj

∇fj(xkj )∑N

j=11Lj

∀i ∈ V.(3.29)

Following the same reasoning as in Theorem 3.1, we obtain the following rate ofconvergence:

f(xk)− f∗ ≤ 2R2full(x0)

k,

where Rfull(x0) = max

x:f(x)≤f(x0)maxx∗∈X∗

∥x − x∗∥∗GN. Using the expression for the norm

∥x∥∗GNgiven in (3.28) and the definition for Ri, we can show that

R2full(x0) ≤

N∑i=1

LiR2i ,

and thus we get the following convergence rate for the full projected gradient method,when the extended norm (3.28) is considered:

f(xk)− f∗ ≤2∑N

i=1 LiR2i

k.(3.30)

Page 16: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

16 I. Necoara, Y. Nesterov, F. Glineur

Moreover, the complexity per iteration of the full gradient is

O(Nnf +Nn),

i.e. O(N) times more than for Algorithm RCD.

Clearly, the estimate given in (3.30) (where the induced norm defined by thematrix Gfull is considered) is better than the estimate given in (3.27) (where theEuclidian norm is considered). Further, the iteration complexity of the full projected

gradient method is O (N(n+ nf )). Note that if∑N

i=1 LavR2i ≤

∑Ni=1 LiR

2i , then the

rate of convergence for Algorithm (RCD) is as follows:

ϕk − f∗ ≤2(R∗(x0)

)2k

(3.31)

p∗ij

≤ (N − 1)2∑N

i=11

(N−1)λ2(L∗)R2i

kpsij

≤ (N − 1)2∑N

i=1 LavR2i

k(3.32)

pinvij

≤ (N − 1)2∑N

i=1 LiR2i

k.

Table 3.1Comparison of arithmetic complexities for algorithms (RCD), full gradient, [2] and [19] for n = 1.

Alg. block iteration Rate of conv. Iter. complexity

Full grad. yes full O(∑

i LiR2i

k

)O(Nnf +N)

(RCD)/pinvij yes random (i, j) O((N − 1)

∑i LiR

2i

k

)O(nf + lnN)

(RCD)/psij yes random (i, j) O((N − 1)

∑i LavR

2i

k

)O(nf + lnN)

Tseng [19] yes greedy (i, j) O(N

∑i L̄R2

i

k

)O(nf +N)

Beck [2] no greedy (i, j) O(N

∑i L̄R2

i

k

)O(nf +N)

However, the iteration complexity of (RCD) method is usually O(N) times cheaperthan the iteration complexity of the full projected gradient algorithm. Moreover,the full projected gradient method is not a distributed algorithm since it requires acentral coordinator. Note that despite the fact that the coordinate descent methodspresented in [2, 19] can solve optimization problems with additional box constraints,the arithmetic complexity of those methods are O(N) times worse than the arithmeticcomplexity of our Algorithm (RCD). This can be seen in Table 4.1 where we comparethe arithmetic complexities off all these four algorithms (full gradient, our methodRCD, the coordinate descent method in [19] and the coordinate descent method in[2]) for optimization problems with n = 1 (scalar case) and a single linear coupledconstraint (recall that L̄ = maxi Li). Finally, note that the method in [19] has verybad rate of convergence when the number of coupling constraints is larger than one,while the method in [2] is not able to handle more than one single coupling constraint.

Page 17: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A randomized coordinate descent method on large convex problems 17

3.4. Strongly convex case. Additionally to the assumption of Lipschitz con-tinuous gradient for each function fi (see Assumption (1)), we now assume that thefunction f is also strongly convex with respect to the extended norm ∥ · ∥∗G2

withconvexity parameter σG2 on the subspace S. More exactly, the objective function fsatisfies:

(3.33) f(x) ≥ f(y) + ⟨∇f(y), x− y⟩+ σG2

2

(∥x− y∥∗G2

)2 ∀x, y ∈ S.

Combining the Lipschitz inequality (3.25) with the previous strong convex inequality(3.33) we get:

N∑i=1

Li∥xi − yi∥2 ≥ σG2

(∥x− y∥∗G2

)2 ∀x, y ∈ S.

Now, if we consider e.g. a full graph and the probabilities given in (3.11), then usingthe expression for the norm ∥ · ∥G2 given in (3.13) we obtain σG2 ≤ 1

N−1 .We now state the main result of this section:

Theorem 3.3. Under the assumptions of Theorem 3.1, let function f be alsostrongly convex with respect to norm ∥ · ∥∗G2

and convexity parameter σG2 . For the

sequence (xk)k≥0 generated by Algorithm (RCD) we have the following linear estimatefor the convergence rate in expectation:

(3.34) ϕk − f∗ ≤ (1− σG2)k

(f(x0)− f∗) .

Proof. From (3.5) we have

2(f(xk)− E

[f(xk+1) | ηk

])≥ ∥∇f(xk)∥2G2

.

On the other hand, minimizing both sides of inequality (3.33) over x ∈ S we have:

∥∇f(y)∥2G2≥ 2σG2(f(y)− f∗) ∀y ∈ S

and for y = xk we get:

∥∇f(xk)∥2G2≥ 2σG2

(f(xk)− f∗) .

Combining these two relations and taking expectation in ηk−1 in both sides, we provethe statement of the theorem.

We notice that if fi’s are strongly convex functions with respect to the Euclidiannorm, with convexity parameter σi, i.e.

fi(xi) ≥ fi(yi) + ⟨∇fi(yi), xi − yi⟩+σi

2∥xi − yi∥2 ∀xi, yi ∈ Rn, i ∈ V,

then the whole function f =∑

i fi is also strongly convex w.r.t. the extended norminduced by the positive definite matrix Σ = diag(σ), where σ = [σ1 · · ·σN ]T , i.e.

f(x) ≥ f(y) + ⟨∇f(y), x− y⟩+ 1

2∥x− y∥2Σ⊗In ∀x, y ∈ RnN .

Note that in this extended norm ∥·∥Σ⊗In the strongly convex parameter of the functionf is equal to 1. It follows immediately that the function f is also strongly convexwith respect to the norm ∥ · ∥∗G2

with the strongly convex parameter σ2G satisfying:

σG2Σ−1 ≼ L+ ζeeT ,

Page 18: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

18 I. Necoara, Y. Nesterov, F. Glineur

for some ζ ≥ 0. In conclusion we get the following LMI:

σG2IN ≼ Σ1/2(L+ ζeeT )Σ1/2.

From Theorem 3.3 it follows that in order to get a better convergence rate we needto search for σG as large as possible. Therefore, for strongly convex functions theoptimal probabilities are chosen as the solution of the following SDP problem:

p∗ij = arg maxζ≥0,σG2

,L∈MσG2(3.35)

σG2IN ≼ Σ1/2(L+ ζeeT )Σ1/2.

3.5. Rate of convergence in probability. In this section we estimate thequality of random point xk, quantifying the confidence of reaching the accuracy ϵ forthe optimal value. We denote by ρ a confidence level and we use Theorem 1 from [17]to give the following lemma:

Lemma 3.4. Let {ξk}k≥0 be a nonnegative, nonincreasing sequence of discreterandom variables with one of the properties:

(i) E [ξk+1|ξk] ≤ ξk − ξ2kr for k ≥ 0 and constant r > ϵ.

(ii) E [ξk+1|ξk] ≤(1− 1

r

)ξk for k ≥ 0 and constant r > 1.

If the first part of the lemma holds and choose K such that

K ≥ r

ϵ

(1 + ln

1

ρ

)+

r

ξ0+ 2

or if the second part holds and choose K satisfying

K ≥ r lnξ0ϵρ

,

then we have the following probability Prob(ξK ≤ ϵ) ≥ 1− ρ.Proof. The proof use a similar reasoning as Theorem 1 in [17] and is derived from

Markov inequality.Considering now the sequence of random variable ξk = f(xk)−f∗ in the previous

lemma, we reach the following result:Theorem 3.5. Under the assumptions of Theorem 3.1, let us choose

k ≥ R2(x0)

ϵ

(1 + ln

1

ρ− ϵ

f(x0)− f∗

)+ 2,

then the random sequence (xk)k≥0 generated by Algorithm (RCD) for solving (2.1)satisfies:

Prob(f(xk)− f∗ ≤ ϵ) ≥ 1− ρ.

If additionally, the function f is also strongly convex with respect to norm ∥ · ∥∗G2

and convexity parameter σG2 , choosing k to satisfy:

k ≥ 1

σG2

lnf(x0)− f∗

ϵρ

it is ensured that:

Prob(f(xk)− f∗ ≤ ϵ

)≥ 1− ρ.

Page 19: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A randomized coordinate descent method on large convex problems 19

Proof. From inequality (3.15) we note that random variable ξk = f(xk)− f∗ has

the property: ξk+1 ≤ ξk− ξ2kr , where we assume r = R2(x0) > ϵ. Note that ξk satisfies

the first conditions of Lemma 3.4, thus we get the first statement of the theorem.For the second part, from Markov inequality and relation (3.34) we have:

Prob(f(xk)− f∗ ≥ ϵ) ≤ ϕk − f∗

ϵ≤ 1

ϵ(1− σG2)

k(f(x0)− f∗).

Choosing k as in the statement of the theorem we have that Prob(f(xk)−f∗ ≥ ϵ) ≤ ρ,so that the second statement of the theorem is now proved.

4. Generalization. Parallel implementations of Algorithm (RCD) are possible,i.e. we can choose in parallel different block coordinates and update x in all thesecomponents. Usually, Algorithm (RCD) can be accelerated by parallelization, i.e. byusing more than one pair per iteration. In this section we extend the main resultsof the previous sections to general randomized block coordinate descent algorithmthat chooses an M -tuple of indices. The (RCD)(i,j) algorithm can be extended toupdate more than two pairs (i, j) at each iteration. For example, taking a positiveinteger M ≤ N , we denote with N any subset of V having cardinality M . Here, wedo not assume explicitly additional structure such as those imposed by a graph, i.e.we consider all to all communication. Then, we can derive a randomized M blockcoordinate descent algorithm where we update at each iteration only M blocks in thevector x. Let us define N = (i1, · · · iM ) with il ∈ V , sN = [si1 · · · siM ]T ∈ RMn andLN = [Li1 · · ·LiM ]T ∈ RM . Under assumption (2.2) the following inequality holds:

(4.1) f(x+∑i∈N

Uisi) ≤ f(x) + ⟨∇fN (x), sN ⟩+ 1

2∥sN ∥2diag(LN ).

Based on the inequality (4.1) we can define a general randomized M block coordinatedescent algorithm, let us call it (RCD)M . Given an x in the feasible set S, we choosethe coordinate M -tuple N with probability pN . Let the next iterate be chosen asfollows:

x+ = x+∑i∈N

Uidi,

i.e. we update M components in the vector x, where the direction dN is determinedby requiring that the next iterate x+ to be also feasible and minimizing the right handside in (4.1), i.e.:

dN = arg minsN :

∑i∈N si=0

f(x) + ⟨∇fN (x), sN ⟩+ 1

2∥sN ∥2diag(LN )

or explicitly

di =1

Li

∑j∈N

1Lj

(∇fj(xj)−∇fi(xi))∑j∈N

1Lj

∀i ∈ N .

In conclusion, we obtain the following randomizedM block coordinate descent method:

Page 20: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

20 I. Necoara, Y. Nesterov, F. Glineur

Algorithm (RCD)M : Random M Block Coordinate Descent Method

1. choose the M -tuple Nk = (i1k, · · · iMk ) with probability pN

2. set xk+1l = xk

l ∀i /∈ Nk and xk+1l = xk

l + dkl ∀i ∈ Nk.

Based on the inequality (4.1) the following decrease in objective function can bederived:

f(x+) ≤ f(x)−∑i∈N

1

2Li

∥∑

j∈N1Lj

(∇fj(xj)−∇fi(xi)) ∥2

(∑

j∈N1Lj

)2

= f(x)− 1

2∇f(x)TGN∇f(x),

where GN = LN ⊗ In and the matrix LN ∈ RN×N is defined as

LN = diag(L−1N )− 1∑

i∈N 1/LiL−1N (L−1

N )T ,

where we redefine LN ∈ RN as the vector with components zero outside the indexset N and components Li for i ∈ N . Note that LN is still a Laplacian matrix.Therefore, taking the expectation over the random M -tuple N ⊂ V , we obtain thefollowing inequality:

(4.2) E[f(x+) | η] ≤ f(x)− 1

2∇f(x)TGM∇f(x).

The corresponding matrix GM =∑

N pNGN is still positive semidefinite and has aneigenvalue λ1(GM ) = 0. Based on the decrease in expectation (4.2), a similar rate ofconvergence will be obtained for this general algorithm (RCD)M as in the previoussections, but depending on M . For example, we can consider probabilities of the form:

pinvN =

∑i∈N 1/Li∑

N∑

i∈N 1/Li.

We can see that:

Σp =∑N

∑i∈N

1/Li =

(MN)∑j=1

∑i∈Nj

1/Li =

(MN)∑j=1

N∑i=1

1i,Nj1/Li

=

N∑i=1

1/Li

(MN)∑j=1

1i,Nj

=

N∑i=1

1/Li

(M − 1

N − 1

).

Using a similar reasoning we can derive that:

LM =∑N

pNLN =1

Σp

∑N

[(∑i∈N

1/Li)diag(L−1N )− L−1

N (L−1N )T

]

=1

Σp

(M − 2

N − 2

)[(

N∑i=1

1/Li)diag(L−1)− L−1(L−1)T

]

=M − 1

N − 1

[diag(L−1)− 1∑N

i=1 1/Li

L−1(L−1)T

].

Page 21: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A randomized coordinate descent method on large convex problems 21

In conclusion, for this choice of the probabilities we get the following convergence ratefor Algorithm (RCD)M in the expected values of the objective function:

ϕk − f∗ ≤ N − 1

M − 1·2∑

i LiR2i

k.

5. Applications. Problem (2.1) arises in many real applications, e.g. resourceallocation in economic systems [8] or distributed computer systems [9], in distributedcontrol [14], in traffic equilibrium problems or network flow [1] and other areas. Forexample, we can interpret it as N agents exchanging n goods to minimize a total cost,where the constraint

∑i xi = 0 is the equilibrium or market clearing constraint. In

this context [xi]j ≥ 0 means that agent i receives [xi]j of good j from exchange and[xi]j < 0 means that agent i contributes |(xi)j | of good j to exchange.

Problem (2.1) can also be seen as the dual problem corresponding to an opti-mization of a sum of convex functions. Consider the following convex optimizationproblem that arises in many engineering applications such as signal processing, dis-tributed control and network flow [1,4–6,13–15]:

(5.1) g∗ = minv∈∩N

i=1Qi

g1(v) + · · ·+ gN (v),

where gi are all convex functions and Qi are convex sets. This problem can bereformulated as:

minui∈Qi,ui=v ∀i∈V

g1(u1) + · · ·+ gN (uN ).

Let us define u = [uT1 · · ·uT

N ]T and g(u) = g1(u1) + · · · + gN (uN ). By duality, usingthe Lagrange multipliers xi for the constraints ui = v, we obtain the separable convexproblem (2.1), where fi(xi) = g̃∗i (xi) and g̃∗i is the convex conjugate of the functiong̃i = gi + 1Qi . Note that if gi is strongly convex, then the convex conjugate fiis well-defined and has Lipschitz continuous gradient, so that Assumption 1 holds.A particular application is the problem of finding the projection of a point v0 in theintersection of the convex sets ∩N

i=1Qi. This problem can be written as an optimizationproblem in the form:

minv∈∩N

i=1Qi

p1∥v − v0∥2 + · · ·+ pN∥v − v0∥2,

where pi > 0 such that∑

i pi = 1. This is a particular case of the separable problem(5.1). Note that since the functions gi(v) = pi∥v − v0∥2 are strongly convex, then fihave Lipschitz continuous gradient with Lipsctiz constants Li = 1/pi for all i. We canalso consider the problem of finding a point in the intersection of some convex sets:

(5.2) minv∈∩N

i=1Qi

cT1 v + · · ·+ cTNv,

which again applying duality as above, it will lead to the separable convex problem(2.1). Since in this case the objective functions gi(v) = cTi v are linear, it follows thatthe functions fi are not smooth anymore (i.e. Assumption 1 will not hold in thiscase) and smoothing techniques need to be applied. In this scenario we can considersmoothing (5.2) as follows:

minv∈∩N

i=1Qi

(cT1 v + pi∥v∥2) + · · ·+ (cTNv + pN∥v∥2).

Page 22: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

22 I. Necoara, Y. Nesterov, F. Glineur

5.1. Recovering approximate primal solutions from full gradient algo-rithm. From Section 3.3 we have obtain the following convergence rate for the fullprojected gradient method:

f(xk)− f∗ ≤2∑

i LiR2i

k∀k > 0.

We also want to provide estimates on the convergence rate of the primal sequence{uk}k≥0. Let us apply 2k iterations of the full projected gradient method. From theprevious discussion we have that the following inequality holds:

f(xl+1) ≤ f(xl)− 1

2∥∇f(xl)∥2GN

∀l = k, · · · , 2k − 1.

Adding these inequalities for l = k, · · · , 2k − 1 we obtain:

f(xk)− f(x2k) ≥ 1

2

2k−1∑l=k

∥∇f(xl)∥2GN≥ k

2∥∇f(xk∗

)∥2GN,

where

k∗ = arg minl=k,··· ,2k−1

∥∇f(xl)∥2GN.

Taking into account that after k steps we have sublinear convergence in the form

O(2∑

i LiR2i

k ), we get:

∥∇f(xk∗)∥2GN

≤ 2(f(xk)− f(x2k))

k≤ 2(f(xk)− f∗)

k≤

4∑

i LiR2i

k2

f(xk∗)− f∗ ≤ f(xk)− f∗ −

k∗−1∑l=k

1

2∥∇f(xl)∥2GN

≤2∑

i LiR2i

k.

Further, we have that:

∥∇f(xk∗)∥2GN

=N∑i=1

1

2Li

∥∥∥∥∥∇fi(xk∗

i )−∑N

j=11Lj

∇fj(xk∗

j )∑Nj=1

1Lj

∥∥∥∥∥2

=

N∑i=1

1

2Li

∥∥∥∥∥uk∗

i −∑N

j=11Lj

uk∗

j∑Nj=1

1Lj

∥∥∥∥∥2

.

Now, if we define vk∗=

∑Nj=1

1Lj

uk∗j∑N

j=11

Lj

, then the full projected gradient method produces

after 2k iterations the dual variables xk∗with

∑i x

k∗

i = 0 and primal variables uk∗

i ∈Qi for all i ∈ V such that with their convex combination vk

∗satisfies the following

dual suboptimality:

f(xk∗)− f∗ ≤

2∑N

i=1 LiR2i

k.(5.3)

and primal feasibility violation:

∥uk∗

j − vk∗∥2 ≤

8Lj(∑N

i=1 LiR2i )

k2∀j ∈ V.(5.4)

Page 23: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

A randomized coordinate descent method on large convex problems 23

5.2. Numerical experiments.

5.2.1. First application. In this section we perform simulations for solving theoptimization problem (2.1) where the functions fi are taken as in [20]:

fi(xi) =1

2ai(xi − ci)

2 + log(1 + exp(bi(xi − di))),(5.5)

where the coefficients ai ≥ 0, bi, ci, di are generated randomly with uniform distribu-tions on [−2, 2]. We also generate random graphs with N nodes, where each nodehas at most p neighbors. Note that σi = ai and Li = ai +

14b

2i .

5.2.2. Second application. In this section we consider the following applica-tion that is a particular case of optimization problem (5.1):

minv∈∩N

i=1Qi

p1∥v − a1∥2 + · · ·+ pN∥v − aN∥2,(5.6)

where ai are given vectors in Rn, Qi’s are simple convex sets and the weights pi > 0such that

∑i pi = 1. We also associate a local information structure to our problem

imposed by a connected graph G = (V,E). The application consists of finding theclosest point from the intersection of some given sets Qi to the average

∑i piai in a

distributed fashion. Let us note that for Qi = Rn the problem reduces at computingdistributively the average

∑i piai that has many applications in engineering [15, 21],

while for ai = 0 for all i the problem can be interpreted as finding a point in commonto the given convex sets Qi’s [4–6, 15]. We solve the primal problem (5.6) via itsdual formulation (2.1). Note that the gradient of each function fi can be efficientlycomputed as long as the projection on the set Qi is easy and further the gradient isLipschitz continuous with constant Li =

1pi.

REFERENCES

[1] D.P. Bertsekas, Nonlinear Programming, Athena Scientific, 2nd edition, 1999.[2] A. Beck, The 2-Coordinate Descent Method for Solving Double-Sided Simplex Constrained

Minimization Problems, Technical Report, 2012.[3] A. Beck and L. Tetruashvili, On The Convergence of Block Coordinate Descent Type Methods,

Technical Report, 2012.[4] P. L. Combettes, The convex feasibility problem in image recovery, in Advances in Imaging and

Electron Physics (P. Hawkes, Ed.), 95, 155270, Academic Press, 1996.[5] F. Deutsch and H. Hundal, The rate of convergence for the cyclic projections algorithm: Angles

between convex sets, Journal of Approximation Theory, 142, 36-55, 2006.[6] L.G. Gubin, B.T. Polyak and E.V. Raik, The method of projections for finding the common

point of convex sets, Computational Mathematics and Mathematical Physics, 7 (6), 1211-1228, 1967.

[7] Y.C. Ho, L.D. Servi and R. Suri, A class of center-free resource allocation algorithms, LargeScale Systems, 1, 51–62, 1980.

[8] L. Hurwicz, The design of mechanisms for resource allocation, American Economic Review,63, 1-30, 1973.

[9] J. Kurose and R. Simha, Microeconomic approach to optimal resource allocation in distributedcomputer systems, IEEE Transactions on Computers, 38, 705-717, 1989.

[10] Z. Q. Luo and P. Tseng, On the convergence rate of dual ascent methods for linearly constrainedconvex minimization, Mathematics of Operations Research, 18 (2), 846-867, 1993.

[11] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Boston, Kluwer,2004.

[12] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems,Core Discussion Paper, 2010.

[13] P. Tsiaflakis, I. Necoara, J. Suykens, M. Moonen, Improved dual decomposition based optimiza-tion for DSL dynamic spectrum management, IEEE Transactions on Signal Processing, 58(4), 2230–2245, 2010.

Page 24: A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

24 I. Necoara, Y. Nesterov, F. Glineur

[14] I. Necoara, V. Nedelcu, and I. Dumitrache, Parallel and distributed optimization methods forestimation and control in networks, Journal of Process Control, 21 (5), 756–766, 2011.

[15] A. Nedic, A. Ozdaglar and A.P. Parrilo, Constrained Consensus and Optimization in Multi-Agent Networks, IEEE Transactions on Automatic Control, 55 (4), 922–938, 2010.

[16] Z. Qin, K. Scheinberg and D. Goldfarb, Efficient block-coordinate descent algorithms for thegroup lasso, Technical Report, 2010.

[17] P. Richtarik and M. Takac, Iteration complexity of randomized block-coordinate descent methodsfor minimizing a composite function, submitted to Mathematical Programming, 2011.

[18] R.T. Rockafeller, The elementary vectors of a subspace in RN , Combinatorial Mathematics andits Applications, Proceedings of the Chapel Hill Conference, R.C. Bose and T.A. Downlingeds., Chapel Hill, 104-127, 1969.

[19] P. Tseng and S. Yun, A Block-Coordinate Gradient Descent Method for Linearly ConstrainedNonsmooth Separable Optimization, Journal of Optimization Theory and Applications,140, 513–535, 2009.

[20] L. Xiao and S. Boyd, Optimal scaling of a gradient method for distributed resource allocation,Journal of Optimization Theory and Applications, 129 (3), 2006.

[21] L. Xiao and S. Boyd, Fast Linear Iterations for Distributed Averaging, Systems and ControlLetters, 53, 65–78, 2004.

[22] S. Wright, Accelerated block coordinate relaxation for regularized optimization, Technical Re-port, 2010.