A Hybrid Variance-Reduced Method for Decentralized ...

11
A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization Ran Xin 1 Usman A. Khan 2 Soummya Kar 1 Abstract This paper considers decentralized stochastic op- timization over a network of n nodes, where each node possesses a smooth non-convex local cost function and the goal of the networked nodes is to find an -accurate first-order stationary point of the sum of the local costs. We focus on an online setting, where each node accesses its local cost only by means of a stochastic first-order oracle that returns a noisy version of the exact gradient. In this context, we propose a novel single-loop decentralized hybrid variance-reduced stochastic gradient method, called GT-HSGD, that outper- forms the existing approaches in terms of both the oracle complexity and practical implementa- tion. The GT-HSGD algorithm implements spe- cialized local hybrid stochastic gradient estima- tors that are fused over the network to track the global gradient. Remarkably, GT-HSGD achieves a network topology-independent oracle complex- ity of O(n -1 -3 ) when the required error toler- ance is small enough, leading to a linear speedup with respect to the centralized optimal online variance-reduced approaches that operate on a single node. Numerical experiments are provided to illustrate our main technical results. 1. Introduction We consider n nodes, such as machines or edge devices, communicating over a decentralized network described by a directed graph G =(V , E ), where V = {1, ··· ,n} is the set of node indices and E⊆V×V is the collection of ordered pairs (i, j ), i, j ∈V , such that node j sends infor- mation to node i. Each node i possesses a private local cost function f i : R p R and the goal of the networked nodes 1 Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA 2 Department of Electrical and Computer Engineering, Tufts University, Medford, MA, USA. Correspondence to: Ran Xin <[email protected]>. Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). is to solve, via local computation and communication, the following optimization problem: min xR p F (x)= 1 n n X i=1 f i (x). This canonical formulation is known as decentralized opti- mization (Tsitsiklis et al., 1986; Nedi ´ c & Ozdaglar, 2009; Kar et al., 2012; Chen & Sayed, 2015) that has emerged as a promising framework for large-scale data science and machine learning problems (Lian et al., 2017; Assran et al., 2019). Decentralized optimization is essential in scenarios where data is geographically distributed and/or centralized data processing is infeasible due to communication and com- putation overhead or data privacy concerns. In this paper, we focus on an online and non-convex setting. In particular, we assume that each local cost f i is non-convex and each node i only accesses f i by querying a local stochastic first- order oracle (SFO) (Nemirovski et al., 2009) that returns a stochastic gradient, i.e., a noisy version of the exact gradi- ent, at the queried point. As a concrete example of practical interest, the SFO mechanism applies to many online learn- ing and expected risk minimization problems where the noise in SFO lies in the uncertainty of sampling from the underlying streaming data received at each node (Kar et al., 2012; Chen & Sayed, 2015). We are interested in the oracle complexity, i.e., the total number of queries to SFO required at each node, to find an -accurate first-order stationary point x * of the global cost F such that E[k∇F (x * )k] . 1.1. Related Work We now briefly review the literature of decentralized non- convex optimization with SFO, which has been widely stud- ied recently. Perhaps the most well-known approach is the decentralized stochastic gradient descent (DSGD) and its variants (Chen & Sayed, 2015; Kar et al., 2012; Vlaski & Sayed, 2019; Lian et al., 2017; Taheri et al., 2020), which combine average consensus and a local stochastic gradient step. Although being simple and effective, DSGD is known to have difficulties in handling heterogeneous data (Xin et al., 2020a). Recent works (Tang et al., 2018; Lu et al., 2019; Xin et al., 2020e; Yi et al., 2020) achieve robust- ness to heterogeneous environments by leveraging certain

Transcript of A Hybrid Variance-Reduced Method for Decentralized ...

Page 1: A Hybrid Variance-Reduced Method for Decentralized ...

A Hybrid Variance-Reduced Method forDecentralized Stochastic Non-Convex Optimization

Ran Xin 1 Usman A. Khan 2 Soummya Kar 1

AbstractThis paper considers decentralized stochastic op-timization over a network of n nodes, where eachnode possesses a smooth non-convex local costfunction and the goal of the networked nodes isto find an ε-accurate first-order stationary point ofthe sum of the local costs. We focus on an onlinesetting, where each node accesses its local costonly by means of a stochastic first-order oraclethat returns a noisy version of the exact gradient.In this context, we propose a novel single-loopdecentralized hybrid variance-reduced stochasticgradient method, called GT-HSGD, that outper-forms the existing approaches in terms of boththe oracle complexity and practical implementa-tion. The GT-HSGD algorithm implements spe-cialized local hybrid stochastic gradient estima-tors that are fused over the network to track theglobal gradient. Remarkably, GT-HSGD achievesa network topology-independent oracle complex-ity of O(n−1ε−3) when the required error toler-ance ε is small enough, leading to a linear speedupwith respect to the centralized optimal onlinevariance-reduced approaches that operate on asingle node. Numerical experiments are providedto illustrate our main technical results.

1. IntroductionWe consider n nodes, such as machines or edge devices,communicating over a decentralized network described bya directed graph G = (V, E), where V = 1, · · · , n is theset of node indices and E ⊆ V × V is the collection ofordered pairs (i, j), i, j ∈ V , such that node j sends infor-mation to node i. Each node i possesses a private local costfunction fi : Rp → R and the goal of the networked nodes

1Department of Electrical and Computer Engineering, CarnegieMellon University, Pittsburgh, PA, USA 2Department of Electricaland Computer Engineering, Tufts University, Medford, MA, USA.Correspondence to: Ran Xin <[email protected]>.

Proceedings of the 38 th International Conference on MachineLearning, PMLR 139, 2021. Copyright 2021 by the author(s).

is to solve, via local computation and communication, thefollowing optimization problem:

minx∈Rp

F (x) =1

n

n∑i=1

fi(x).

This canonical formulation is known as decentralized opti-mization (Tsitsiklis et al., 1986; Nedic & Ozdaglar, 2009;Kar et al., 2012; Chen & Sayed, 2015) that has emergedas a promising framework for large-scale data science andmachine learning problems (Lian et al., 2017; Assran et al.,2019). Decentralized optimization is essential in scenarioswhere data is geographically distributed and/or centralizeddata processing is infeasible due to communication and com-putation overhead or data privacy concerns. In this paper,we focus on an online and non-convex setting. In particular,we assume that each local cost fi is non-convex and eachnode i only accesses fi by querying a local stochastic first-order oracle (SFO) (Nemirovski et al., 2009) that returns astochastic gradient, i.e., a noisy version of the exact gradi-ent, at the queried point. As a concrete example of practicalinterest, the SFO mechanism applies to many online learn-ing and expected risk minimization problems where thenoise in SFO lies in the uncertainty of sampling from theunderlying streaming data received at each node (Kar et al.,2012; Chen & Sayed, 2015). We are interested in the oraclecomplexity, i.e., the total number of queries to SFO requiredat each node, to find an ε-accurate first-order stationarypoint x∗ of the global cost F such that E[‖∇F (x∗)‖] ≤ ε.

1.1. Related Work

We now briefly review the literature of decentralized non-convex optimization with SFO, which has been widely stud-ied recently. Perhaps the most well-known approach is thedecentralized stochastic gradient descent (DSGD) and itsvariants (Chen & Sayed, 2015; Kar et al., 2012; Vlaski &Sayed, 2019; Lian et al., 2017; Taheri et al., 2020), whichcombine average consensus and a local stochastic gradientstep. Although being simple and effective, DSGD is knownto have difficulties in handling heterogeneous data (Xinet al., 2020a). Recent works (Tang et al., 2018; Lu et al.,2019; Xin et al., 2020e; Yi et al., 2020) achieve robust-ness to heterogeneous environments by leveraging certain

Page 2: A Hybrid Variance-Reduced Method for Decentralized ...

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

decentralized bias-correction techniques such as EXTRA(type) (Shi et al., 2015; Yuan et al., 2020; Li & Lin, 2020),gradient tracking (Di Lorenzo & Scutari, 2016; Xu et al.,2015; Pu & Nedich, 2020; Xin et al., 2020f; Nedich et al.,2017; Qu & Li, 2017; Xi et al., 2017), and primal-dual prin-ciples (Jakovetic, 2018; Li et al., 2019; Alghunaim et al.,2020; Xu et al., 2020). Built on top of these bias-correctiontechniques, very recent works (Sun et al., 2020) and (Panet al., 2020) propose D-GET and D-SPIDER-SFO respec-tively that further incorporate online SARAH/SPIDER-typevariance reduction schemes (Fang et al., 2018; Wang et al.,2019; Pham et al., 2020) to achieve lower oracle complex-ities, when the SFO satisfies a mean-squared smoothnessproperty. Finally, we note that the family of decentralizedvariance reduced methods has been significantly enrichedrecently, see, for instance, (Mokhtari & Ribeiro, 2016; Yuanet al., 2018; Xin et al., 2020a; Li et al., 2020b;a; Rajawat &Kumar, 2020; Xin et al., 2020b;c; Lu et al., 2020); however,these approaches are explicitly designed for empirical mini-mization where each local cost fi is decomposed as a finite-sum of component functions, i.e., fi = 1

m

∑mr=1 fi,r; it is

therefore unclear whether these algorithms can be adaptedto the online SFO setting, which is the focus of this paper.

1.2. Our Contributions

In this paper, we propose GT-HSGD, a novel online variancereduced method for decentralized non-convex optimiza-tion with stochastic first-order oracles (SFO). To achievefast and robust performance, the GT-HSGD algorithm isbuilt upon global gradient tracking (Di Lorenzo & Scutari,2016; Xu et al., 2015) and a local hybrid stochastic gra-dient estimator (Liu et al., 2020; Tran-Dinh et al., 2020;Cutkosky & Orabona, 2019) that can be considered as a con-vex combination of the vanilla stochastic gradient returnedby the SFO and a SARAH-type variance-reduced stochasticgradient (Nguyen et al., 2017). In the following, we empha-size the key advantages of GT-HSGD compared with theexisting decentralized online (variance-reduced) approaches,from both theoretical and practical aspects.

Improved oracle complexity. A comparison of the oraclecomplexity of GT-HSGD with related algorithms is pro-vided in Table 1, from which we have the following im-portant observations. First of all, the oracle complexityof GT-HSGD is lower than that of DSGD, D2, GT-DSGDand D-PD-SGD, which are decentralized online algorithmswithout variance reduction; however, GT-HSGD imposeson the SFO an additional mean-squared smoothness (MSS)assumption that is required by all online variance-reducedtechniques in the literature (Arjevani et al., 2019; Fang et al.,2018; Wang et al., 2019; Pham et al., 2020; Liu et al., 2020;Tran-Dinh et al., 2020; Cutkosky & Orabona, 2019; Sunet al., 2020; Pan et al., 2020; Zhou et al., 2020). Secondly,GT-HSGD further achieves a lower oracle complexity than

the existing decentralized online variance-reduced meth-ods D-GET (Sun et al., 2020) and D-SPIDER-SFO (Panet al., 2020), especially in a regime where the required er-ror tolerance ε and the network spectral gap (1 − λ) arerelatively small.1. Moreover, when ε is small enough suchthat ε . min

λ−4(1 − λ)3n−1, λ−1(1 − λ)1.5n−1

, it

can be verified that the oracle complexity of GT-HSGD re-duces to O(n−1ε−3), independent of the network topology,and GT-HSGD achieves a linear speedup, in terms of thescaling with the network size n, compared with the cen-tralized optimal online variance-reduced approaches thatoperate on a single node (Fang et al., 2018; Wang et al.,2019; Pham et al., 2020; Liu et al., 2020; Tran-Dinh et al.,2020; Zhou et al., 2020); see Section 3 for a detailed dis-cussion. In sharp contrast, the speedup of D-GET (Sunet al., 2020) and D-SPIDER-SFO (Pan et al., 2020) is notclear compared with the aforementioned centralized optimalmethods even if the network is fully connected, i.e., λ = 0.

More practical implementation. Both D-GET (Sun et al.,2020) and D-SPIDER-SFO (Pan et al., 2020) are double-loop algorithms that require very large minibatch sizes.In particular, during each inner loop they execute a fixednumber of minibatch stochastic gradient type iterationswith O(ε−1) oracle queries per update per node, whileat every outer loop they obtain a stochastic gradient withmega minibatch size by O(ε−2) oracle queries at each node.Clearly, querying the oracles exceedingly, i.e., obtaininga large amount of samples, at each node and every itera-tion in online steaming data scenarios substantially jeopar-dizes the actual wall-clock time. This is because the nextiteration cannot be performed until all nodes complete thesampling process. Moreover, the double-loop implementa-tion may incur periodic network synchronizations. Theseissues are especially significant when the working envi-ronments of the nodes are heterogeneous. Conversely, theproposed GT-HSGD is a single-loop algorithm with O(1)oracle queries per update and only requires a large minibatchsize with O(ε−1) oracle queries once in the initializationphase, i.e., before the update recursion is executed; seeAlgorithm 1 and Corollary 1 for details.

1.3. Roadmap and Notations

The rest of the paper is organized as follows. In Section 2,we state the problem formulation and develop the proposedGT-HSGD algorithm. Section 3 presents the main conver-gence results of GT-HSGD and their implications. Section 4outlines the convergence analysis of GT-HSGD, while thedetailed proofs are provided in the Appendix. Section 5provides numerical experiments to illustrate our theoreticalclaims. Section 6 concludes the paper.

1A small network spectral gap (1− λ) implies that the connec-tivity of the network is weak.

Page 3: A Hybrid Variance-Reduced Method for Decentralized ...

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

Table 1. A comparison of the oracle complexity of decentralized online stochastic gradient methods. The oracle complexity is in termsof the total number of queries to SFO required at each node to obtain an ε-accurate stationary point x∗ of the global cost F suchthat E[‖∇F (x∗)‖] ≤ ε. In the table, n is the number of the nodes and (1− λ) ∈ (0, 1] is the spectral gap of the weight matrix associatedwith the network. We note that the complexity of D2 and D-SPIDER-SFO also depends on the smallest eigenvalue λn of the weightmatrix; however, since λn is less sensitive to the network topology, we omit the dependence of λn in the table for conciseness. TheMSS column indicates whether the algorithm in question requires the mean-squared smoothness assumption on the SFO. Finally, weemphasize that DSGD requires bounded heterogeneity such that supx

1n

∑ni=1 ‖∇fi(x) − ∇F (x)‖2 ≤ ζ2, for some ζ ∈ R+, while

other algorithms in the table do not need this assumption.

Algorithm Oracle Complexity MSS Remarks

DSGD (Lian et al., 2017) O

(max

1

nε4,

λ2n

(1− λ)2ε2

)7 bounded heterogeneity

D2 (Tang et al., 2018) O

(max

1

nε4,

n

(1− λ)bε2

)7

b ∈ R+ is not explicitlyshown in (Tang et al., 2018)

GT-DSGD (Xin et al., 2020e) O

(max

1

nε4,

λ2n

(1− λ)3ε2

)7

D-PD-SGD (Yi et al., 2020) O

(max

1

nε4,

n

(1− λ)cε2

)7

c ∈ R+ is not explicitlyshown in (Yi et al., 2020)

D-GET (Sun et al., 2020) O

(1

(1− λ)dε3

)3

d ∈ R+ is not explicitlyshown in (Sun et al., 2020)

D-SPIDER-SFO(Pan et al., 2020) O

(1

(1− λ)hε3

)3

h ∈ R+ is not explicitlyshown in (Pan et al., 2020)

GT-HSGD(this work) O

(max

1

nε3,

λ4

(1− λ)3ε2 ,λ1.5n0.5

(1− λ)2.25ε1.5

)3

We adopt the following notations throughout the paper. Weuse lowercase bold letters to denote vectors and uppercasebold letters to denote matrices. The ceiling function isdenoted as d·e. The matrix Id represents the d× d identity;1d and 0d are the d-dimensional column vectors of all onesand zeros, respectively. We denote [x]i as the i-th entry of avector x. The Kronecker product of two matrices A and Bis denoted by A⊗B. We use ‖ · ‖ to denote the Euclideannorm of a vector or the spectral norm of a matrix. We useσ(·) to denote the σ-algebra generated by the sets and/orrandom vectors in its argument.

2. Problem Setup and GT-HSGDIn this section, we introduce the mathematical model ofthe stochastic first-order oracle (SFO) at each node and thecommunication network. Based on these formulations, wedevelop the proposed GT-HSGD algorithm.

2.1. Optimization and Network Model

We work with a rich enough probability space Ω,P,F.We consider decentralized recursive algorithms of interestthat generate a sequence of estimates xitt≥0 of the first-order stationary points of F at each node i, where xi0 isassumed constant. At each iteration t, each node i observesa random vector ξit in Rq, which, for instance, may beconsidered as noise or as an online data sample. We thenintroduce the natural filtration (an increasing family of sub-

σ-algebras of F ) induced by these random vectors observedsequentially by the networked nodes:

F0 := Ω, φ,Ft := σ

(ξi0, ξ

i1, · · · , ξ

it−1 : i ∈ V

), ∀t ≥ 1, (1)

where φ is the empty set. We are now ready to definethe SFO mechanism in the following. At each iteration t,each node i, given an input random vector x ∈ Rp thatis Ft-measurable, is able to query the local SFO to obtain astochastic gradient of the form gi(x, ξ

it), where gi : Rp ×

Rq → Rp is a Borel measurable function. We assume thatthe SFO satisfies the following four properties.

Assumption 1 (Oracle). For any Ft-measurable randomvectors x,y ∈ Rp, we have the following: ∀i ∈ V , ∀t ≥ 0,

• E[gi(x, ξ

it)|Ft

]= ∇fi(x);

• E[‖gi(x, ξit)−∇fi(x)‖2

]≤ ν2

i , ν2 := 1

n

∑ni=1 ν

2i ;

• the familyξit : ∀t ≥ 0, i ∈ V

of random vectors is

independent;

• E[‖gi(x, ξit)− gi(y, ξ

it)‖2

]≤ L2E

[‖x− y‖2

].

The first three properties above are standard and commonlyused to establish the convergence of decentralized stochasticgradient methods. They however do not explicitly imposeany structures on the stochastic gradient mapping gi other

Page 4: A Hybrid Variance-Reduced Method for Decentralized ...

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

than the measurability. On the other hand, the last property,the mean-squared smoothness, roughly speaking, requiresthat gi is L-smooth on average with respect to the input ar-guments x and y. As a simple example, Assumption 1 holdsif fi(x) = 1

2x>Qix and gi(x, ξi) = Qix + ξi, where Qi

is a constant matrix and ξi has zero mean and finite secondmoment. We further note that the mean-squared smooth-ness of each gi implies, by Jensen’s inequality, that each fiis L-smooth, i.e., ‖∇fi(x) − ∇fi(y)‖ ≤ L‖x − y‖, andconsequently the global function F is also L-smooth.

In addition, we make the following assumptions on F andthe communication network G.Assumption 2 (Global Function). F is bounded below,i.e., F ∗ := infx∈Rp F (x) > −∞.Assumption 3 (Communication Network). The directednetwork G admits a primitive and doubly-stochastic weightmatrix W = wij ∈ Rn×n. Hence, W1n = W>1n =

1n and λ := ‖W − 1n1n1

>n ‖ ∈ [0, 1).

The weight matrix W that satisfies Assumption 3 may bedesigned for strongly-connected weight-balanced directedgraphs (and thus for arbitrary connected undirected graphs).For example, the family of directed exponential graphs isweight-balanced and plays a key role in decentralized train-ing (Assran et al., 2019). We note that λ is known as the sec-ond largest singular value of W and measures the algebraicconnectivity of the graph, i.e., a smaller value of λ roughlymeans a better connectivity. We note that several existingapproaches require strictly stronger assumptions on W. Forinstance, D2 (Tang et al., 2018) and D-PD-SGD (Yi et al.,2020) require W to be symmetric and hence are restrictedto undirected networks.

2.2. Algorithm Development and Description

We now describe the proposed GT-HSGD algorithm and pro-vide an intuitive construction. Recall that xit is the estimateof an stationary point of the global cost F at node i and iter-ation t. Let gi(xit, ξ

it) and gi(x

it−1, ξ

it) be the correspond-

ing stochastic gradients returned by the local SFO queriedat xit and xit−1 respectively. Motivated by the strong per-formance of recently introduced decentralized methods thatcombine gradient tracking and various variance reductionschemes for finite-sum problems (Xin et al., 2020b;c; Liet al., 2020a; Sun et al., 2020), we seek similar variancereduction for decentralized online problems with SFO. Inparticular, we focus on the following local hybrid variancereduced stochastic gradient estimator vit introduced in (Liuet al., 2020; Tran-Dinh et al., 2020; Cutkosky & Orabona,2019) for centralized online problems: ∀t ≥ 1,

vit = gi(xit, ξ

it) + (1− β)

(vit−1 − gi(x

it−1, ξ

it)), (2)

for some applicable weight parameter β ∈ [0, 1]. Thislocal gradient estimator vit is fused, via a gradient tracking

mechanism (Di Lorenzo & Scutari, 2016; Xu et al., 2015),over the network to update the global gradient tracker yit,which is subsequently used as the descent direction in the xit-update. The complete description of GT-HSGD is providedin Algorithm 1. We note that the update (2) of vit may beequivalently written as

vit = β · gi(xit, ξit)︸ ︷︷ ︸

Stochastic gradient

+ (1− β) ·(gi(x

it, ξ

it)− gi(x

it−1, ξ

it) + vit−1

)︸ ︷︷ ︸SARAH

,

which is a convex combination of the local vanilla stochasticgradient returned by the SFO and a SARAH-type (Nguyenet al., 2017; Fang et al., 2018; Wang et al., 2019) gradient es-timator. This discussion leads to the fact that GT-HSGD re-duces to GT-DSGD (Pu & Nedich, 2020; Xin et al., 2020e;Lu et al., 2019) when β = 1, and becomes the inner loopof GT-SARAH (Xin et al., 2020b) when β = 0. However,our convergence analysis shows that GT-HSGD achievesits best oracle complexity and outperforms the existing de-centralized online variance-reduced approaches (Sun et al.,2020; Pan et al., 2020) with a weight parameter β ∈ (0, 1).It is then clear that neither GT-DSGD nor the inner loopof GT-SARAH, on their own, are able to outperform theproposed approach, making GT-HSGD a non-trivial algo-rithmic design for this problem class.

Algorithm 1 GT-HSGD at each node iRequire: xi0 = x0; α; β; b0; yi0 = 0p; vi−1 = 0p; T .

1: Sample ξi0,rb0r=1 and vi0 = 1

b0

∑b0r=1 gi(x

i0, ξ

i0,r);

2: yi1 =∑nj=1 wij

(yj0 + vj0 − vj−1

);

3: xi1 =∑nj=1 wij

(xj0 − αy

j1

);

4: for t = 1, 2, · · · , T − 1 do5: Sample ξit;6: vit = gi(x

it, ξ

it) + (1− β)

(vit−1 − gi(x

it−1, ξ

it)).

7: yit+1 =∑nj=1wij

(yjt + vjt − vjt−1

);

8: xit+1 =∑nj=1wij

(xjt − αy

jt+1

);

9: end foroutput xT selected uniformly at random from xiti∈V0≤t≤T .

Remark 1. Clearly, each vit is a conditionally biased es-timator of ∇fi(xit), i.e., E[vit|Ft] 6= ∇fi(xit) in general.However, it can be shown that E[vit] = E[∇fi(xit)], mean-ing that vit serves as a surrogate for the underlying exactgradient in the sense of total expectation.

3. Main ResultsIn this section, we present the main convergence resultsof GT-HSGD in this paper and discuss their salient features.

Page 5: A Hybrid Variance-Reduced Method for Decentralized ...

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

The formal convergence analysis is deferred to Section 4.

Theorem 1. If the weight parameter β = 48L2α2

n and thestep-size α is chosen as

0 < α < min

(1− λ2)2

90λ2,

√n(1− λ)

26λ,

1

4√

3

1

L,

then the output xT of GT-HSGD satisfies: ∀T ≥ 2,

E[‖∇F (xT )‖2

]≤ 4(F (x0)− F ∗)

αT+

8βν2

n+

4ν2

βb0nT

+64λ4‖∇f

(x0

)‖2

(1− λ2)3nT+

96λ2ν2

(1− λ2)3b0T+

256λ2β2ν2

(1− λ2)3,

where ‖∇f(x0

)‖2 =

∑ni=1 ‖∇fi

(x0

)‖2

Remark 2. Theorem 1 holds for GT-HSGD with arbitraryinitial minibatch size b0 ≥ 1.

Theorem 1 establishes a non-asymptotic bound, with nohidden constants, on the mean-squared stationary gap ofGT-HSGD over any finite time horizon T .

Transient and steady-state performance over infinitetime horizon. If α and β are chosen according to Theo-rem 1, the mean-squared stationary gap E

[‖∇F (xT )‖2

]of

GT-HSGD decays sublinearly at a rate of O(1/T ) up to asteady-state error (SSE) such that

lim supT→∞

E[‖∇F (xT )‖2

]≤ 8βν2

n+

256λ2β2ν2

(1− λ2)3. (3)

In view of (3), the SSE of GT-HSGD is bounded by the sumof two terms: (i) the first term is in the order of O(β) andthe division by n demonstrates the benefit of increasing thenetwork size2; (ii) the second term is in the order of O(β2)and reveals the impact of the spectral gap (1 − λ) of thenetwork topology. Clearly, the SSE can be made arbitrarilysmall by choosing small enough β and α. Moreover, sincethe spectral gap (1− λ) only appears in a higher order termof β in (3), its impact reduces as β becomes smaller, i.e., aswe require a smaller SSE.

The following corollary is concerned with the finite-timeconvergence rate of GT-HSGD with specific choices of thealgorithmic parameters α, β, and b0.

Corollary 1. Setting α = n2/3

8LT 1/3 , β = 3n1/3

4T 2/3 , and b0 =

dT1/3

n2/3 e in Theorem 1, we have:

E[‖∇F (xT )‖2

]≤ 32L(F (x0)− F ∗) + 12ν2

(nT )2/3

+64λ4‖∇f

(x0

)‖2

(1− λ2)3nT+

240λ2n2/3ν2

(1− λ2)3T 4/3,

2Since GT-HSGD computes O(n) stochastic gradients in par-allel per iteration across the nodes, the network size n can beinterpreted as the minibatch size of GT-HSGD.

for all T > max

1424λ6n2

(1−λ2)6 ,35λ3n0.5

(1−λ)1.5

. As a consequence,

GT-HSGD achieves an ε-accurate stationary point x∗ of theglobal cost F such that E[‖∇F (x∗)‖] ≤ ε with

H = O (max Hopt,Hnet)

iterations3, whereHopt andHnet are given respectively by

Hopt =(L(F (x0)− F ∗) + ν2)1.5

nε3,

Hnet = max

λ4‖∇f

(x0

)‖2

(1− λ2)3nε2,

λ1.5n0.5ν1.5

(1− λ2)2.25ε1.5

.

The resulting total number of oracle queries at each node isthus dH+H1/3n−2/3e.

Remark 3. SinceH1/3n−2/3 is much smaller thanH, wetreat the oracle complexity of GT-HSGD asH for the easeof exposition in Table 1 and the following discussion.

An important implication of Corollary 1 is given in thefollowing.

A regime for network topology-independent oracle com-plexity and linear speedup. According to Corollary 1, theoracle complexity of GT-HSGD at each node is boundedby the maximum of two terms: (i) the first term Hopt isindependent of the network topology and, more importantly,is n times smaller than the oracle complexity of the opti-mal centralized online variance-reduced methods that exe-cute on a single node for this problem class (Fang et al.,2018; Wang et al., 2019; Pham et al., 2020; Tran-Dinhet al., 2020; Liu et al., 2020); (ii) the second term Hnetdepends on the network spectral gap 1 − λ and is in thelower order of 1/ε. These two observations lead to theinteresting fact that the oracle complexity of GT-HSGD be-comes independent of the network topology, i.e.,Hopt dom-inatesHnet, if the required error tolerance ε is small enoughsuch that4 ε . min

λ−4(1−λ)3n−1, λ−1(1−λ)1.5n−1

.

In this regime, GT-HSGD thus achieves a network topology-independent oracle complexity Hopt = O(n−1ε−3), ex-hibiting a linear speed up compared with the aforemen-tioned centralized optimal algorithms (Fang et al., 2018;Wang et al., 2019; Pham et al., 2020; Tran-Dinh et al., 2020;Liu et al., 2020; Zhou et al., 2020), in the sense that the totalnumber of oracle queries required to achieve an ε-accuratestationary point at each node is reduced by a factor of 1/n.Remark 4. The small error tolerance regime in the abovediscussion corresponds to a large number of oracle queries,which translates to the scenario where the required totalnumber of iterations T is large. Note that a large T furtherimplies that the step-size α and the weight parameter β aresmall; see the expression of α and β in Corollary 1.

3The O(·) notation here does not absorb any problem parame-ters, i.e., it only hides universal constants.

4This boundary condition follows from basic algebraic manip-ulations.

Page 6: A Hybrid Variance-Reduced Method for Decentralized ...

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

4. Outline of the Convergence AnalysisIn this section, we outlines the proof of Theorem 1, whilethe detailed proofs are provided in the Appendix. We letAssumptions 1-3 hold throughout the rest of the paper with-out explicitly stating them. For the ease of exposition, wewrite the xt- and yt-update of GT-HSGD in the followingequivalent matrix form: ∀t ≥ 0,

yt+1 = W (yt + vt − vt−1) , (4a)xt+1 = W (xt − αyt+1) , (4b)

where W := W ⊗ Ip and xt,yt,vt are square-integrablerandom vectors in Rnp that respectively concatenate thelocal estimates xitni=1 of a stationary point of F , gradienttrackers yitni=1, stochastic gradient estimators vitni=1. Itis straightforward to verify that xt and yt areFt-measurablewhile vt is Ft+1-measurable for all t ≥ 0. For convenience,we also denote

∇f(xt) :=[∇f1(x1

t )>, · · · ,∇fn(xnt )>

]>and introduce the following quantities:

J :=

(1

n1n1

>n

)⊗ Ip

xt :=1

n(1>n ⊗ Ip)xt, yt :=

1

n(1>n ⊗ Ip)yt,

vt :=1

n(1>n ⊗ Ip)vt, ∇f(xt) :=

1

n(1>n ⊗ Ip)∇f(xt).

In the following lemma, we enlist several well-known resultsin the context of gradient tracking-based algorithms fordecentralized stochastic optimization, whose proofs may befound in (Di Lorenzo & Scutari, 2016; Qu & Li, 2017; Xinet al., 2020d; Pu & Nedich, 2020).Lemma 1. The following relationships hold.

(a) ‖Wx− Jx‖ ≤ λ‖x− Jx‖,∀x ∈ Rnp.

(b) yt+1 = vt,∀t ≥ 0.

(c)∥∥∇f(xt)−∇F (xt)

∥∥2 ≤ L2

n ‖xt − Jxt‖2 ,∀t ≥ 0.

We note that Lemma 1(a) holds since W is primitive anddoubly-stochastic, Lemma 1(b) is a direct consequenceof the gradient tracking update (4a) and Lemma 1(c) isdue to the L-smoothness of each fi. By the estimate up-date of GT-HSGD described in (4b) and Lemma 1(b), it isstraightforward to obtain:

xt+1 = xt − αyt+1 = xt − αvt, ∀t ≥ 0. (5)

Hence, the mean state xt proceeds in the direction of theaverage of local stochastic gradient estimators vt. Withthe help of (5) and the L-smoothness of F and each fi, weestablish the following descent inequality which sheds lighton the overall convergence analysis.

Lemma 2. If 0 < α ≤ 12L , then we have: ∀T ≥ 0,

T∑t=0

∥∥∇F (xt)∥∥2 ≤ 2(F (x0)− F ∗)

α− 1

2

T∑t=0

‖vt‖2

+ 2

T∑t=0

∥∥vt −∇f(xt)∥∥2+

2L2

n

T∑t=0

‖xt − Jxt‖2 .

In light of Lemma 2, our approach to establishing the con-vergence of GT-HSGD is to seek the conditions on the al-gorithmic parameters of GT-HSGD, i.e., the step-size α andthe weight parameter β, such that

− 1

2T

T∑t=0

E[‖vt‖2

]+

2

T

T∑t=0

E[‖vt −∇f(xt)‖2

]+

2L2

nT

T∑t=0

E[‖xt − Jxt‖2

]= O

(α, β,

1

b0,

1

T

), (6)

whereO(α, β, 1/b0, 1/T ) represents a nonnegative quantitywhich may be made arbitrarily small by choosing smallenough α and β along with large enough T and b0. If (6)holds, then Lemma 2 reduces to

1

T + 1

T∑t=0

E[‖∇F (xt)‖2

]≤ 2(F (x0)− F ∗)

αT+O

(α, β,

1

b0,

1

T

),

which leads to the convergence arguments of GT-HSGD.For these purposes, we quantify

∑Tt=0 E

[‖vt −∇f(xt)‖2

]and

∑Tt=0 E

[‖xt − Jxt‖2

]next.

4.1. Contraction Relationships

First of all, we establish upper bounds on the gradient vari-ances E

[‖vt −∇f(xt)‖2

]and E

[‖vt −∇f(xt)‖2

]by ex-

ploiting the hybrid and recursive update of vt.Lemma 3. The following inequalities hold: ∀t ≥ 1,

E[‖vt −∇f(xt)‖2

]≤ (1− β)2E

[‖vt−1 −∇f(xt−1)‖2

]+

6L2α2

n(1− β)2E

[‖vt−1‖2

]+

2β2ν2

n

+6L2

n2(1− β)2E

[‖xt − Jxt‖2 + ‖xt−1 − Jxt−1‖2

],

(7)

and, ∀t ≥ 1,

E[‖vt −∇f(xt)‖2

]≤ (1− β)2E

[‖vt−1 −∇f(xt−1)‖2

]+ 6nL2α2(1− β)2E

[‖vt−1‖2

]+ 2nβ2ν2

+ 6L2(1− β)2E[‖xt − Jxt‖2 + ‖xt−1 − Jxt−1‖2

].

(8)

Page 7: A Hybrid Variance-Reduced Method for Decentralized ...

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

Remark 5. Since vt is a conditionally biased estimatorof ∇f(xt), (7) and (8) do not directly imply each other andneed to be established separately.

We emphasize that the contraction structure of the gradientvariances shown in Lemma 3 plays a crucial role in theconvergence analysis. The following contraction bounds onthe consensus errors E

[‖xt − Jxt‖2

]are standard in de-

centralized algorithms based on gradient tracking, e.g., (Pu& Nedich, 2020; Xin et al., 2020b); in particular, it followsdirectly from the xt-update (4b) and Young’s inequality.

Lemma 4. The following inequalities hold: ∀t ≥ 0,

‖xt+1 − Jxt+1‖2 ≤1 + λ2

2‖xt − Jxt‖2

+2α2λ2

1− λ2‖yt+1 − Jyt+1‖2 . (9)

‖xt+1 − Jxt+1‖2 ≤ 2λ2 ‖xt − Jxt‖2

+ 2α2λ2 ‖yt+1 − Jyt+1‖2 . (10)

It is then clear from Lemma 4 that we need to further quan-tify the gradient tracking errors E

[‖yt − Jyt‖2

]in order to

bound the consensus errors. These error bounds are shownin the following lemma.

Lemma 5. We have the following.

(a) E[‖y1 − Jy1‖2

]≤ λ2

∥∥∇f(x0

)∥∥2+ λ2nν2/b0.

(b) If 0 < α ≤ 1−λ2

2√

42λ2L, then ∀t ≥ 1,

E[‖yt+1 − Jyt+1‖2

]≤ 3 + λ2

4E[‖yt − Jyt‖2

]+

21λ2nL2α2

1− λ2E[‖vt−1‖2]

+63λ2L2

1− λ2E[‖xt−1 − Jxt−1‖2

]+

7λ2β2

1− λ2E[‖vt−1 −∇f(xt−1)‖2

]+ 3λ2nβ2ν2.

We note that establishing the contraction argument of gradi-ent tracking errors in Lemma 5 requires a careful examina-tion of the structure of the vt-update.

4.2. Error Accumulations

To proceed, we observe, from Lemma 3, 4, and 5, that therecursions of the gradient variances, consensus, and gradienttracking errors admit similar forms. Therefore, we abstractout formulas for the accumulation of the error recursions ofthis type in the following lemma.

Lemma 6. Let Vtt≥0, Rtt≥0 and Qtt≥0 be non-negative sequences and C ≥ 0 be some constant such

that Vt ≤ qVt−1+qRt−1+Qt+C, ∀t ≥ 1, where q ∈ (0, 1).Then the following inequality holds: ∀T ≥ 1,T∑t=0

Vt ≤V0

1− q+

1

1− q

T−1∑t=0

Rt +1

1− q

T∑t=1

Qt+CT

1− q.

(11)

Similarly, if Vt+1 ≤ qVt + Rt−1 + C, ∀t ≥ 1, then wehave: ∀T ≥ 2,

T∑t=1

Vt ≤V1

1− q+

1

1− q

T−2∑t=0

Rt +CT

1− q. (12)

Applying Lemma 6 to Lemma 3 leads to the following upperbounds on the accumulated variances.Lemma 7. For any β ∈ (0, 1), the following inequalitieshold: ∀T ≥ 1,

T∑t=0

E[‖vt −∇f(xt)‖2

]≤ ν2

βb0n+

6L2α2

T−1∑t=0

E[‖vt‖2

]+

12L2

n2β

T∑t=0

E[‖xt − Jxt‖2

]+

2βν2T

n, (13)

and, ∀T ≥ 1,T∑t=0

E[‖vt −∇f(xt)‖2

]≤ nν2

βb0+

6nL2α2

β

T−1∑t=0

E[‖vt‖2

]+

12L2

β

T∑t=0

E[‖xt − Jxt‖2

]+ 2nβν2T. (14)

It can be observed that (13) in Lemma 7 may be used torefine the left hand side of (6). The remaining step, naturally,is to bound

∑t E[‖xt − Jxt‖2

]in terms of

∑t E[‖vt‖2

].

This result is provided in the following lemma that is ob-tained with the help of Lemma 4, 5, 6, and 7.

Lemma 8. If 0 < α ≤ (1−λ2)2

70λ2L and β ∈ (0, 1), then thefollowing inequality holds: ∀T ≥ 2,T∑t=0

E[‖xt − Jxt‖2

]n

≤ 2016λ4L2α4

(1− λ2)4

T−2∑t=0

E[‖vt‖2

]+

32λ4α2

(1− λ2)3

‖∇f(x0

)‖2

n+

(7β

1− λ2+ 1

)32λ4ν2α2

(1− λ2)3b0

+

(14β

1− λ2+ 3

)32λ4β2ν2α2T

(1− λ2)3.

Finally, we note that Lemma 7 and 8 suffice to establish (6)and hence lead to Theorem 1; see the Appendix for details.

Page 8: A Hybrid Variance-Reduced Method for Decentralized ...

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

Figure 1. A comparison of GT-HSGD with other decentralized online stochastic gradient algorithms over the undirected exponential graphof 20 nodes on the a9a, covertype, KDD98, and MiniBooNE datasets.

5. Numerical ExperimentsIn this section, we illustrate our theoretical results on theconvergence of the proposed GT-HSGD algorithm with thehelp of numerical experiments.

Model. We consider a non-convex logistic regression model(Antoniadis et al., 2011) for decentralized binary classifica-tion. In particular, the decentralized non-convex optimiza-tion problem of interest takes the form minx∈Rp F (x) :=1n

∑ni=1 fi(x) + r(x), such that

fi(x) =1

m

m∑j=1

log[1 + e−〈x,θij〉lij

]and

r(x) = R

p∑k=1

[x]2k1 + [x]2k

,

where θi,j is the feature vector, li,j ∈ −1,+1 is the corre-sponding binary label, and r(x) is a non-convex regularizer.To simulate the online SFO setting described in Section 2,each node i is only able to sample with replacement fromits local data θi,j , li,jmj=1 and compute the corresponding(minibatch) stochastic gradient. Throughout all experiments,we set the number of the nodes to n = 20 and the regular-ization parameter to R = 10−4.

Data. To test the performance of the applicable decentral-ized algorithms, we distribute the a9a, covertype, KDD98,MiniBooNE datasets uniformly over the nodes and normal-ize the feature vectors such that ‖θi,j‖ = 1,∀i, j. Thestatistics of these datasets are provided in Table 2.

Table 2. Datasets used in numerical experiments, all available athttps://www.openml.org/.

Dataset train (nm) dimension (p)

a9a 48,840 123

covertype 100,000 54

KDD98 75,000 477

MiniBooNE 100,000 11

Network topology. We consider the following networktopologies: the undirected ring graph, the undirected anddirected exponential graphs, and the complete graph; see(Nedic et al., 2018; Xin et al., 2020f; Assran et al., 2019;Lian et al., 2017) for detailed configurations of these graphs.For all graphs, the associated doubly stochastic weights areset to be equal. The resulting second largest singular value λof the weight matrices are 0.98, 0.75, 0.67, 0, respectively,demonstrating a significant difference in the algebraic con-nectivity of these graphs.

Page 9: A Hybrid Variance-Reduced Method for Decentralized ...

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

Performance measure. We measure the performance ofthe decentralized algorithms in question by the decrease ofthe global cost function value F (x), to which we refer asloss, versus epochs, where x = 1

n

∑ni=1 xi with xi being

the model at node i and each epoch contains m stochasticgradient computations at each node.

5.1. Comparison with the Existing DecentralizedStochastic Gradient Methods.

We conduct a performance comparison of GT-HSGD withGT-DSGD (Pu & Nedich, 2020; Lu et al., 2019; Xin et al.,2020e), D-GET (Sun et al., 2020), and D-SPIDER-SFO(Pan et al., 2020) over the undirected exponential graph of 20nodes. Note that we use GT-DSGD to represent methodsthat do not incorporate online variance reduction techniques,since it in general matches or outperforms DSGD (Lian et al.,2017) and has a similar performance with D2 (Tang et al.,2018) and D-PD-SGD (Yi et al., 2020).

Parameter tuning. We set the parameters of GT-HSGD,GT-DSGD, D-GET, and D-SPIDER-SFO according to thefollowing procedures. First, we find a very large step-sizecandidate set for each algorithm in comparison. Second, wechoose the minibatch size candidate set for all algorithmsas B := 1, 4, 8, 16, 32, 64, 128, 256, 512, 1024: the mini-batch size of GT-DSGD, the minibatch size of GT-HSGD att = 0, the minibatch size of D-GET and D-SPIDER-SFOat inner- and outer-loop are all chosen from B. Third, forD-GET and D-SPIDER-SFO, we choose the inner-looplength candidate set as m20b ,

m19b , · · · ,

mb ,

2mb , · · · ,

20mb ,

where m is the local data size and b is the minibatch sizeat the inner-loop. Fourth, we iterate over all combinationsof parameters for each algorithm to find its best perfor-mance. In particular, we find that the best performance ofGT-HSGD is attained with a small β and a relatively large αas Corollary 1 suggests.

The experimental results are provided in Fig. 1, where weobserve that GT-HSGD achieves faster convergence than theother algorithms in comparison on those four datasets. Thisobservation is coherent with our main convergence resultsthat GT-HSGD achieves a lower oracle complexity than theexisting approaches; see Table 1.

5.2. Topology-Independent Rate of GT-HSGD

We test the performance of GT-HSGD over different net-work topologies. In particular, we follow the proceduresdescribed in Section 5.1 to find the best set of parametersfor GT-HSGD over the complete graph and then use thisparameter set for other graphs. The corresponding exper-imental results are presented in Fig. 2. Clearly, it can beobserved that when the number of iterations is large enough,that is to say, the required error tolerance is small enough,the convergence rate of GT-HSGD is not affected by the

Figure 2. Convergence behaviors of GT-HSGD over different net-work topologies on the a9a and covertype datasets.

underlying network topology. This interesting phenomenonis consistent with our convergence theory; see Corollary 1and the related discussion in Section 3.

6. ConclusionIn this paper, we investigate decentralized stochastic op-timization to minimize a sum of smooth non-convex costfunctions over a network of nodes. Assuming that eachnode has access to a stochastic first-order oracle, we proposeGT-HSGD, a novel single-loop decentralized algorithm thatleverages local hybrid variance reduction and gradient track-ing to achieve provably fast convergence and robust perfor-mance. Compared with the existing online variance-reducedmethods, GT-HSGD achieves a lower oracle complexitywith a more practical implementation. We further show thatGT-HSGD achieves a network topology-independent ora-cle complexity, when the required error tolerance is smallenough, leading to a linear speedup with respect to the cen-tralized optimal methods that execute on a single node.

AcknowledgmentsThe work of Ran Xin and Soummya Kar was supported inpart by NSF under Award #1513936. The work of Usman A.Khan was supported in part by NSF under Awards #1903972and #1935555.

Page 10: A Hybrid Variance-Reduced Method for Decentralized ...

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

ReferencesAlghunaim, S. A., Ryu, E., Yuan, K., and Sayed, A. H.

Decentralized proximal gradient algorithms with linearconvergence rates. IEEE Trans. Autom. Control, 2020.

Antoniadis, A., Gijbels, I., and Nikolova, M. Penalizedlikelihood regression for generalized linear models withnon-quadratic penalties. Annals of the Institute of Statis-tical Mathematics, 63(3):585–615, 2011.

Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J.,Srebro, N., and Woodworth, B. Lower bounds fornon-convex stochastic optimization. arXiv preprintarXiv:1912.02365, 2019.

Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochas-tic gradient push for distributed deep learning. In Proceed-ings of the 36th International Conference on MachineLearning, pp. 97: 344–353, 2019.

Chen, J. and Sayed, A. H. On the learning behavior ofadaptive networks—part i: Transient analysis. IEEETransactions on Information Theory, 61(6):3487–3517,2015.

Cutkosky, A. and Orabona, F. Momentum-based variancereduction in non-convex sgd. In Adv. Neural Inf. Process.Syst., pp. 15236–15245, 2019.

Di Lorenzo, P. and Scutari, G. NEXT: In-network noncon-vex optimization. IEEE Trans. Signal Inf. Process. Netw.Process., 2(2):120–136, 2016.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. SPIDER:near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Proc. Adv. Neural Inf.Process. Syst., pp. 689–699, 2018.

Jakovetic, D. A unification and generalization of exactdistributed first-order methods. IEEE Trans. Signal Inf.Process. Netw. Process., 5(1):31–46, 2018.

Kar, S., Moura, J. M. F., and Ramanan, K. Distributedparameter estimation in sensor networks: Nonlinear ob-servation models and imperfect communication. IEEETrans. Inf. Theory, 58(6):3575–3605, 2012.

Li, B., Cen, S., Chen, Y., and Chi, Y. Communication-efficient distributed optimization in networks with gradi-ent tracking and variance reduction. J. Mach. Learn. Res.,21(180):1–51, 2020a.

Li, H. and Lin, Z. Revisiting extra for smooth distributedoptimization. SIAM J. Optim., 30(3):1795–1821, 2020.

Li, H., Lin, Z., and Fang, Y. Optimal accelerated vari-ance reduced EXTRA and DIGing for strongly convexand smooth decentralized optimization. arXiv preprintarXiv:2009.04373, 2020b.

Li, Z., Shi, W., and Yan, M. A decentralized proximal-gradient method with network independent step-sizes andseparated convergence rates. IEEE Trans. Signal Process.,67(17):4494–4506, 2019.

Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W.,and Liu, J. Can decentralized algorithms outperformcentralized algorithms? A case study for decentralizedparallel stochastic gradient descent. In Adv. Neural Inf.Process. Syst., pp. 5330–5340, 2017.

Liu, D., Nguyen, L. M., and Tran-Dinh, Q. An op-timal hybrid variance-reduced algorithm for stochas-tic composite nonconvex optimization. arXiv preprintarXiv:2008.09055, 2020.

Lu, Q., Liao, X., Li, H., and Huang, T. A computation-efficient decentralized algorithm for composite con-strained optimization. IEEE Trans. Signal Inf. Process.Netw., 6:774–789, 2020.

Lu, S., Zhang, X., Sun, H., and Hong, M. GNSD: A gradient-tracking based nonconvex stochastic algorithm for de-centralized optimization. In 2019 IEEE Data ScienceWorkshop, pp. 315–321, 2019.

Mokhtari, A. and Ribeiro, A. DSA: Decentralized doublestochastic averaging gradient algorithm. J. Mach. Learn.Res., 17(1):2165–2199, 2016.

Nedic, A. and Ozdaglar, A. Distributed subgradient meth-ods for multi-agent optimization. IEEE Trans. Autom.Control, 54(1):48, 2009.

Nedic, A., Olshevsky, A., and Rabbat, M. G. Networktopology and communication-computation tradeoffs indecentralized optimization. P. IEEE, 106(5):953–976,2018.

Nedich, A., Olshevsky, A., and Shi, W. Achieving ge-ometric convergence for distributed optimization overtime-varying graphs. SIAM J. Optim., 27(4):2597–2633,2017.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro-bust stochastic approximation approach to stochastic pro-gramming. SIAM J. Optim., 19(4):1574–1609, 2009.

Nesterov, Y. Lectures on convex optimization, volume 137.Springer, 2018.

Nguyen, L. M., Liu, J., Scheinberg, K., and Takac, M.SARAH: A novel method for machine learning prob-lems using stochastic recursive gradient. In Proc. 34thInt. Conf. Mach. Learn., pp. 2613–2621, 2017.

Pan, T., Liu, J., and Wang, J. D-SPIDER-SFO: A decentral-ized optimization algorithm with faster convergence rate

Page 11: A Hybrid Variance-Reduced Method for Decentralized ...

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

for nonconvex problems. In Proceedings of the AAAI Con-ference on Artificial Intelligence, volume 34, pp. 1619–1626, 2020.

Pham, N. H., Nguyen, L. M., Phan, D. T., and Tran-Dinh,Q. ProxSARAH: an efficient algorithmic framework forstochastic composite nonconvex optimization. J. Mach.Learn. Res., 21(110):1–48, 2020.

Pu, S. and Nedich, A. Distributed stochastic gradient track-ing methods. Math. Program., pp. 1–49, 2020.

Qu, G. and Li, N. Harnessing smoothness to accelerate dis-tributed optimization. IEEE Trans. Control. Netw. Syst.,5(3):1245–1260, 2017.

Rajawat, K. and Kumar, C. A primal-dual framework fordecentralized stochastic optimization. arXiv preprintarXiv:2012.04402, 2020.

Shi, W., Ling, Q., Wu, G., and Yin, W. EXTRA: An exactfirst-order algorithm for decentralized consensus opti-mization. SIAM J. Optim., 25(2):944–966, 2015.

Sun, H., Lu, S., and Hong, M. Improving the sample andcommunication complexity for decentralized non-convexoptimization: Joint gradient estimation and tracking. InProceedings of the 37th International Conference on Ma-chine Learning, volume 119, pp. 9217–9228, 13–18 Jul2020.

Taheri, H., Mokhtari, A., Hassani, H., and Pedarsani, R.Quantized decentralized stochastic learning over directedgraphs. In International Conference on Machine Learn-ing, pp. 9324–9333, 2020.

Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. D2:Decentralized training over decentralized data. In Interna-tional Conference on Machine Learning, pp. 4848–4856,2018.

Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M.A hybrid stochastic optimization framework for stochas-tic composite nonconvex optimization. Math. Program.,2020.

Tsitsiklis, J., Bertsekas, D., and Athans, M. Distributedasynchronous deterministic and stochastic gradient opti-mization algorithms. IEEE Trans. Autom. Control, 31(9):803–812, 1986.

Vlaski, S. and Sayed, A. H. Distributed learning in non-convex environments–Part II: Polynomial escape fromsaddle-points. arXiv:1907.01849, 2019.

Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. Spi-derboost and momentum: Faster variance reduction al-gorithms. In Proc. Adv. Neural Inf. Process. Syst., pp.2403–2413, 2019.

Xi, C., Xin, R., and Khan, U. A. ADD-OPT: Accelerateddistributed directed optimization. IEEE Trans. Autom.Control, 63(5):1329–1339, 2017.

Xin, R., Kar, S., and Khan, U. A. Decentralized stochasticoptimization and machine learning: A unified variance-reduction framework for robust performance and fastconvergence. IEEE Signal Process. Mag., 37(3):102–113,2020a.

Xin, R., Khan, U. A., and Kar, S. A near-optimal stochasticgradient method for decentralized non-convex finite-sumoptimization. arXiv preprint arXiv:2008.07428, 2020b.

Xin, R., Khan, U. A., and Kar, S. A fast randomized in-cremental gradient method for decentralized non-convexoptimization. arXiv preprint arXiv:2011.03853, 2020c.

Xin, R., Khan, U. A., and Kar, S. Variance-reduced de-centralized stochastic optimization with accelerated con-vergence. IEEE Trans. Signal Process., 68:6255–6271,2020d.

Xin, R., Khan, U. A., and Kar, S. An improved convergenceanalysis for decentralized online stochastic non-convexoptimization. arXiv preprint arXiv:2008.04195, 2020e.

Xin, R., Pu, S., Nedic, A., and Khan, U. A. A generalframework for decentralized optimization with first-ordermethods. Proceedings of the IEEE, 108(11):1869–1889,2020f.

Xu, J., Zhu, S., Soh, Y. C., and Xie, L. Augmented dis-tributed gradient methods for multi-agent optimizationunder uncoordinated constant stepsizes. In Proc. IEEEConf. Decis. Control, pp. 2055–2060, 2015.

Xu, J., Tian, Y., Sun, Y., and Scutari, G. Distributed al-gorithms for composite optimization: Unified and tightconvergence analysis. arXiv:2002.11534, 2020.

Yi, X., Zhang, S., Yang, T., Chai, T., and Johansson, K. H.A primal-dual SGD algorithm for distributed nonconvexoptimization. arXiv preprint arXiv:2006.03474, 2020.

Yuan, K., Ying, B., Liu, J., and Sayed, A. H. Variance-reduced stochastic learning by networked agents underrandom reshuffling. IEEE Trans. Signal Process.,67(2):351–366, 2018.

Yuan, K., Alghunaim, S. A., Ying, B., and Sayed, A. H. Onthe influence of bias-correction on distributed stochasticoptimization. IEEE Trans. Signal Process., 2020.

Zhou, D., Xu, P., and Gu, Q. Stochastic nested variancereduction for nonconvex optimization. J. Mach. Learn.Res., 2020.