Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional...

33
Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov *1 Minne Li *12 Alexander I. Cowen-Rivers *1 Jun Wang 2 Haitham Bou-Ammar 12 Abstract In this paper we present C-ADAM, the first adap- tive solver for compositional problems involving a non-linear functional nesting of expected values. We prove that C-ADAM converges to a stationary point in O(δ -2.25 ) first oracle calls with δ be- ing a precision parameter. Moreover, we demon- strate the importance of our results by bridging, for the first time, model-agnostic meta-learning (MAML) and compositional optimisation show- ing the fastest known rates for deep network adap- tation to-date. Finally, we validate our findings in a set of experiments from portfolio optimisation and meta-learning. Our results manifest signifi- cant sample complexity reductions compared to both standard and compositional solvers. 1. Introduction The availability of large data-sets coupled with improve- ments in optimisation algorithms and the growth in comput- ing power has led to an unprecedented interest in machine learning and its applications in, for instance, medical imag- ing (Faes et al., 2019), autonomous self-driving (Bojarski et al., 2016), fraud detection (Wang et al., 2020), computer games (Mnih et al., 2015; Silver et al., 2017; Tian et al., 2018), and many other fields. Fueling these successes are novel developments in non- convex optimisation, which is, by now, a wide-reaching field with algorithms involving zero (Hu et al., 2016; Shamir, 2017; Gabillon et al., 2019), first (Sun et al., 2019; Beck, 2017; Menghan, 2019), and second-order methods (Boyd & Vandenberghe, 2004; Nesterov & Polyak, 2006; Tutunov et al., 2016). Despite such diversity, first-order algorithms are still highly regarded as the go-to technique when it comes to large-scale machine learning applications due to their ease of implementation, and memory and computa- tional efficacy. First-order optimisers can further be split * Equal contribution 1 Huawei Technologies Research and Development, London, UK. 2 University College London, London, UK.. Correspondence to: Rasul Tutunov <ra- [email protected]>. Based on ICML’s Template. into gradient-based, momentum, and adaptive solvers, each varying in the process by which a decent direction is com- puted. On the one hand, momentum techniques (Nesterov, 2014; Li & Lin, 2015) handle the direction itself, while adaptive designs modify learning rates (Duchi et al., 2011; Zeiler, 2012; Mukkamala & Hein, 2017). In spite of numer- ous theoretical developments (Liu et al., 2018; Fang et al., 2018; Allen-Zhu & Hazan, 2016), ADAM – an adaptive optimiser originally proposed in (Kingma & Ba, 2014), and then theoretically grounded in (Zaheer et al., 2018; Reddi et al., 2019) – (arguably) retains state-of-the-art status in machine learning practice. Although this first wave of machine learning applications led to major breakthroughs and paradigm shifts, experts and practitioners are rapidly realising limitations of current models in that they require a large amount of training data and are slow to adapt to novel tasks even when these tasks involve minor changes in the data distribution. The quest for having fast and adaptable models re-surged interest in general and model-agnostic machine learning forms, such as meta (Finn et al., 2017; Fallah et al., 2019), lifelong (Thrun & Pratt, 1998; Ruvolo & Eaton, 2013; Bou-Ammar et al., 2014; 2015a;b), and few-shot learning (Wang & Yao, 2019). When investigated closer, one comes to realise that such key problems typically involve non-linear functional nesting of expected values. For instance, in both meta and lifelong learning, a learner attempts to acquire a model across multi- ple tasks, each exhibiting randomness in its data generating process. In other words, the goal of the agent is to find a set of deep network parameters that minimise an expecta- tion overall tasks, where each task’s loss function involves another expectation over training data 1 . It is clear from the above that problems of this nature are compositional abiding by the form introduced in (Wang et al., 2014). Such a bridge, however, has not been for- mally established before, partially due to the discrepancy between communities and the lack of stochastic and adaptive 1 Please note that in model-agnostic meta-learning such an ex- pectation can involve more than two layers. In this paper, however, we focus on the original presentation (Finn et al., 2017) that per- forms one gradient step, and as such, only includes two nestings. Generalising to an m nesting is an interesting avenue for future research. arXiv:2002.03755v2 [cs.LG] 24 Apr 2020

Transcript of Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional...

Page 1: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

Compositional ADAM: An Adaptive Compositional Solver

Rasul Tutunov * 1 Minne Li * 1 2 Alexander I. Cowen-Rivers * 1 Jun Wang 2 Haitham Bou-Ammar 1 2

AbstractIn this paper we present C-ADAM, the first adap-tive solver for compositional problems involvinga non-linear functional nesting of expected values.We prove that C-ADAM converges to a stationarypoint in O(δ−2.25) first oracle calls with δ be-ing a precision parameter. Moreover, we demon-strate the importance of our results by bridging,for the first time, model-agnostic meta-learning(MAML) and compositional optimisation show-ing the fastest known rates for deep network adap-tation to-date. Finally, we validate our findings ina set of experiments from portfolio optimisationand meta-learning. Our results manifest signifi-cant sample complexity reductions compared toboth standard and compositional solvers.

1. IntroductionThe availability of large data-sets coupled with improve-ments in optimisation algorithms and the growth in comput-ing power has led to an unprecedented interest in machinelearning and its applications in, for instance, medical imag-ing (Faes et al., 2019), autonomous self-driving (Bojarskiet al., 2016), fraud detection (Wang et al., 2020), computergames (Mnih et al., 2015; Silver et al., 2017; Tian et al.,2018), and many other fields.

Fueling these successes are novel developments in non-convex optimisation, which is, by now, a wide-reachingfield with algorithms involving zero (Hu et al., 2016; Shamir,2017; Gabillon et al., 2019), first (Sun et al., 2019; Beck,2017; Menghan, 2019), and second-order methods (Boyd& Vandenberghe, 2004; Nesterov & Polyak, 2006; Tutunovet al., 2016). Despite such diversity, first-order algorithmsare still highly regarded as the go-to technique when itcomes to large-scale machine learning applications due totheir ease of implementation, and memory and computa-tional efficacy. First-order optimisers can further be split

*Equal contribution 1Huawei Technologies Research andDevelopment, London, UK. 2University College London,London, UK.. Correspondence to: Rasul Tutunov <[email protected]>.

Based on ICML’s Template.

into gradient-based, momentum, and adaptive solvers, eachvarying in the process by which a decent direction is com-puted. On the one hand, momentum techniques (Nesterov,2014; Li & Lin, 2015) handle the direction itself, whileadaptive designs modify learning rates (Duchi et al., 2011;Zeiler, 2012; Mukkamala & Hein, 2017). In spite of numer-ous theoretical developments (Liu et al., 2018; Fang et al.,2018; Allen-Zhu & Hazan, 2016), ADAM – an adaptiveoptimiser originally proposed in (Kingma & Ba, 2014), andthen theoretically grounded in (Zaheer et al., 2018; Reddiet al., 2019) – (arguably) retains state-of-the-art status inmachine learning practice.

Although this first wave of machine learning applicationsled to major breakthroughs and paradigm shifts, expertsand practitioners are rapidly realising limitations of currentmodels in that they require a large amount of training dataand are slow to adapt to novel tasks even when these tasksinvolve minor changes in the data distribution. The questfor having fast and adaptable models re-surged interest ingeneral and model-agnostic machine learning forms, such asmeta (Finn et al., 2017; Fallah et al., 2019), lifelong (Thrun& Pratt, 1998; Ruvolo & Eaton, 2013; Bou-Ammar et al.,2014; 2015a;b), and few-shot learning (Wang & Yao, 2019).

When investigated closer, one comes to realise that such keyproblems typically involve non-linear functional nesting ofexpected values. For instance, in both meta and lifelonglearning, a learner attempts to acquire a model across multi-ple tasks, each exhibiting randomness in its data generatingprocess. In other words, the goal of the agent is to find aset of deep network parameters that minimise an expecta-tion overall tasks, where each task’s loss function involvesanother expectation over training data1.

It is clear from the above that problems of this nature arecompositional abiding by the form introduced in (Wanget al., 2014). Such a bridge, however, has not been for-mally established before, partially due to the discrepancybetween communities and the lack of stochastic and adaptive

1Please note that in model-agnostic meta-learning such an ex-pectation can involve more than two layers. In this paper, however,we focus on the original presentation (Finn et al., 2017) that per-forms one gradient step, and as such, only includes two nestings.Generalising to an m nesting is an interesting avenue for futureresearch.

arX

iv:2

002.

0375

5v2

[cs

.LG

] 2

4 A

pr 2

020

Page 2: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

compositional solvers that allow for scalable and effectivealgorithms (see Section 2)2. With meta-learning as our mo-tivation, we propose compositional ADAM (C-ADAM), thefirst, to the best of our knowledge, adaptive compositionaloptimiser that provably converges to a first-order stationarypoint. Along the way, we also extend standard ADAM’sproof from (Zaheer et al., 2018) to a more realistic setting ofhyper-parameters closer to these used by practitioners (i.e.,none of the β’s are set to zero). Apart from theoretical rigour,C-ADAM is also stochastic allowing for mini-batching andonly executes one loop easing implementation.

Unlocking major associations, we further derive a novelconnection between model-agnostic meta-learning (MAML)and compositional optimisation allowing, for the first time,the application of compositional methods in large-scale ma-chine learning problems beyond portfolio mean-variancetrade-offs, and discrete state (or linear function approxima-tion) reinforcement learning (Wang et al., 2014). Here, weshow that MAML can be written as a nested expectationproblem and derive corollaries from our results sheddinglight on MAML’s convergence guarantees. Such an analysisreveals that our method achieves best-known convergencebounds to-date3 for problems involving the fast adaptationof deep networks. Validating our theoretical discoveries, we,finally, conduct an in-depth empirical study demonstratingthe C-ADAM outperforms others by significant margins.In fact, our results demonstrate that correctly handling cor-relations across data and tasks using C-ADAM, one canoutperform ADAM (and others) in meta-learning.

2. Related WorkSince (Wang et al., 2014) proposed a stochastic composi-tional gradient optimiser with O

(δ−4)

query complexity,much attention has been devoted towards more efficientand scalable solvers. Efforts in (Wang et al., 2017), forinstance, led to further improvements achieving O(δ−3.5)complexity by adopting an accelerated version of the ap-proach in (Wang et al., 2014). Building on these results,further developments employed Nesterov acceleration tech-niques to arrive to an O(δ−2.25) for gradient query com-plexities (Wang et al., 2016). Concurrently, other authorsconsidered a relaxation of the problem by studying a finite-sum alternative, i.e., a Monte-Carlo approximation to thestochastic objectives4. Here, Liu et al. (2017) examined

2Though stochastic forms are proposed in literature, we wouldlike to mention that these typically tackle relaxed versions thatassume independent coupling random variables.

3We note that results we study on MAML handle the morerealistic non-convex scenario compared to these in (Finn et al.,2019; Golmant, 2019).

4We note that such forms are a special case of the problemconsidered in this work. Here, m and n denote the number ofsamples for the inner and outer nesting, respectively.

variance reduction techniques to achieve a total complex-ity of O((m+ n)0.8δ−1). Succeedingly, Huo et al. (2017)adapted variance reduction to proximal algorithms acquir-ing O((m+ n)

23 δ−1) bound . A similar complexity bound

was obtained in (Liu et al., 2018) by applying two kindsof variance reduction techniques. Such bound has beenlater improved to O((m+ n)

12 δ−1) with recursive gradient

methods in (Hu et al., 2019).5

3. Compositional ADAM3.1. Problem Definition, Notation & Assumptions

We focus on the following two-level nested optimisationproblem:

minx∈Rp

J (x) = Eν [fν (Eω [gω(x)])] , (1)

where ν and ω are random variables sampled from ν ∼Pν(·) and ω ∼ Pω(·) with Pν(·) and Pω(·) being unknown.Furthermore, for any ν and ω, fν(·) : Rq → R is a functionmapping to real-values, while gω(·) : Rp → Rq representsa map transforming the p-dimensional optimisation variableto a q-dimensional space. We make no restrictive assump-tions on the probabilistic relationship between ν and ω,which can be either dependent or independent.

As our method utilises gradient information in perform-ing updates, it is instructive to clearly state assumptionswe exploit in designing compositional ADAM. DefiningJ (x) = f (g(x)), with f(y) = Eν [fν(y)], and g(x) =Eω[gw(x)], gradients can be written as: ∇xJ (x) =∇xg(x)T∇yf (g(x)). Analogous to most optimisationtechniques, we prefer if gradient are Lipschitz continuousby that introducing regularity conditions that can aid in de-signing better behaved algorithms. Rather than restrictingoverall objectives, however, we realise that it suffices forgradients of fν(·) and gω(·) to be Lipschitz and bounded soas to achieve such results. As such, we assume:

Assumption 1. We make the following assumptions:

1. We assume that fν(·) is bounded above by Bf andits gradients by Mf , i.e., ∀y ∈ Rq and for any ν,|fν(y)| ≤ Bf , and ||∇fν(y)|| ≤Mf .

2. We assume that fν(·) is Lf -smooth, i.e., for anyy1,y2 ∈ Rq and for any ν, we have: ||∇fν(y1) −∇fν(y2)||2 ≤ Lf ||y1 − y2||2.

3. We assume that the mapping gω(x) is Mg- Lipschitzcontinuous, i.e., for any (x1,x2) ∈ Rp and for any ω,we have: ||gω(x1)− gω(x2)||2 ≤Mg ||x1 − x2||2.

5In our related work review, we only focus on non-convexobjectives. Notice that authors devoted lots of research to con-vex/strongly convex settings, e.g., (Lin et al., 2018; Bedi et al.,2019; Liu et al., 2018; Lian et al., 2016; Liu et al., 2017).

Page 3: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

4. We assume that the mapping gω(x) is Lg-smooth,i.e., for any (x1,x2) ∈ Rp and for any ω, we have:||∇gω(x1)−∇gω(x2)||2 ≤ Lg||x1 − x2||2.

Given the above assumptions, we can now prove Lipschitzsmoothness of6 J (x):

Lemma 1. If Assumption 1 holds, then J (x) is L-Lipschitzsmooth, i.e., ||∇J (x1)−∇J (x2)||2 ≤ L ||x1 − x2||2,for all (x1,x2) ∈ Rp, with L = M2

gLf + LgMf .

We now focus on the process by which gradients are evalu-ated. As we assume that distributions Pν(·) and Pω(·) areunknown, we introduce two first-order oracles that can bequeried to return gradients and function values. We presumeOraclef (·, ·) and Oracleg(·, ·), such that, at a time instancet, for any two fixed vectors zt ∈ Rp, yt ∈ Rq, and for anytwo integers, representing batch sizes, K(1)

t and K(2)t , these

oracles return the following collection:

Oraclef(yt,K

(1)t

)={〈νti ,∇fνti (yt)〉

}K(1)t

i=1,

Oracleg(zt,K

(2)t

)={〈ωti , gωti (zt),∇gωti (zt)〉

}K(2)t

i=1,

with {νti}K

(1)t

i=1 and {ωti}K

(2)t

i=1 being identically indepen-dently distributed (i.i.d.).7

We report our convergence complexity results in terms ofthe total number of calls to these first-order oracles. Similarto (Wang et al., 2014), we introduce one final set of assump-tions needed to understand the randomness of the samplingprocess:

Assumption 2. For any time step t, Oraclef (·, ·) andOracleg(·, ·) satisfy the following conditions for any z ∈ Rpand y ∈ Rq:

1. Independent sample collections:({ν1i}

K(1)1

i=1 , {ω1i}K

(2)1

i=1

), . . . ,

({νti}

K(1)t

i=1 , {ωti}K

(1)t

i=1

);

2. Oracles return unbiased gradient estimates for any t,i, and j:

Eωti[gωti (z)

]= Eω [gω(z)] ,

Eωti ,νtj[∇gTωti (z)∇fνtj (y)

]= Eω

[∇gTω(z)

]× Eν [∇fν(y)] ;

6Due to space constraints, we relegate proofs to the appendix.7Please notice that i.i.d. assumption here should be inter-

preted as follows: at each iteration t, samples νti and νtj (re-spectively wti and wtj ) for i, j ∈ 1, . . . ,K

(1)t (respectively

i, j ∈ 1, . . . ,K(2)t ) are i.i.d., but ωti and νti might not be in-

dependent.

3. Variance-bounded stochastic gradients for any t, i, andj:

Eνtj

[∣∣∣∣∣∣∇fνtj (y)−∇Eν [fν(y)]∣∣∣∣∣∣2

2

]≤ σ2

1 ,

Eωti[∣∣∣∣∇gωti (z)−∇Eω [gω(z)]

∣∣∣∣22

]≤ σ2

2 ,

Eωti[∣∣∣∣gωti (z)− Eω[gω(z)]

∣∣∣∣22

]≤ σ2

3 .

3.2. Algorithmic Development & Theoretical Results

We now propose our algorithm providing its convergenceproperties. We summarise our solver, titled C-ADAM, inthe pseudo-code of Algorithm 1.

On a high level, C-ADAM shares similarities with originalADAM (Kingma & Ba, 2014) in that it exhibits both main,xt, and auxiliary variablesmt and vt. Similar to ADAM, ateach iteration, we compute two parameters corresponding toa weighted combination of historical and current gradients,and a weighted combination of squared components thatrelate to variances. When performing parameter updates,however, we deviate from ADAM by introducing additionalauxiliary variables essential for an extrapolation smoothingscheme that allows for fast estimation of the variances ofEω[gω(xt)].

In addition to precision parameters and an initialisation, ouralgorithm acquires a schedule of learning rates and mini-batches as inputs. Given such inputs, we then execute aloop for O(δ−

45 ) steps to return a δ-first-order stationary

point to the problem in Equation 1. The loop operates intwo main phases. In the first, necessary variables neededfor parameter updates are computed according to lines 5-6,while the second updates each of xt, yt, and zt. Similar toADAM xt is updated with no additional calls to the oracle.Contrary to standard ADAM, on the other hand, our methodmakes another call to the oracle to sample K(3)

t functionvalues before revising the value of yt, see lines 8-9.

The remainder of this section is dedicated to providing theo-retical guarantees for Algorithm 1. We, next, prove that ouralgorithm converges to a δ-first-order stationary point afterT = O(δ−

54 ) iterations:

Theorem 1 (Main Result). Consider Algorithm 1 with aparameter setup given by: αt = Cα/t

15 , βt = Cβ ,K

(1)t =

C1t45 ,K

(2)t = C2t

45 ,K

(3)t = C3t

45 , γ

(1)t = Cγµ

t, γ(t)2 =

1 − Cα/t25 (1 − Cγµ

t)2, for some positive constantsCα, Cβ , C1, C2, C3, Cγ , µ such thatCβ < 1 and µ ∈ (0, 1).For any δ ∈ (0, 1), Algorithm 1 outputs, in expectation, aδ-approximate first-order stationary point x of J (x). Thatis:

Etotal

[||∇J (x)||22

]≤ δ,

with “total” representing all incurred randomness. More-

Page 4: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Algorithm 1 Compositional ADAM (C-ADAM)

1: Inputs: Initial variable x1 ∈ Rp, precision parame-ter δ ∈ (0, 1), positive constant ε, running time T =

O(δ−54 ), learning rates

{⟨αt, βt, γ

(1)t , γ

(2)t

⟩}Tt=1

,

batch sizes{⟨K

(1)t ,K

(2)t ,K

(3)t

⟩}Tt=1

2: Initialisation: Initialise z1 = x1, y1 = 0 ∈ Rq, andm0 = v0 = 0 ∈ Rp

3: for t = 1 to T do:4: Computing Necessary Variables:5: Call oracles to compute:

∇ft(yt) =1

K(1)t

K(1)t∑i=1

∇fνti (yt)

∇gt(xt) =1

K(2)t

K(2)t∑

j=1

∇gωtj (xt)

6: With∇J (xt) = ∇gt(xt)T∇ft(yt) compute:

mt = γ(1)t mt−1 +

(1− γ(1)

t

)∇J (xt)

vt = γ(2)t vt−1 +

(1− γ(2)

t

) [∇J (xt)

]27: Variable Updates:8: Update main and first auxiliary variable using:

xt+1 = xt − αtmt√vt + ε

zt+1 =

(1− 1

β t

)xt +

1

βtxt+1

9: To update the second auxiliary variable perform:

yt+1 = (1− βt)yt + βtgt(zt+1),

with gt(zt+1) = 1

K(3)t

∑K(3)t

i=1 gωti (zt+1) from

Oracleg(zt+1,K

(3)t

)10: Output: Return a uniform sample from {xt}Tt=1

over, Algorithm 1 acquires x with an overall oracle complex-ity for Oraclef (·, ·) and Oracleg(·, ·) of the orderO

(δ−

94

).

Proof Road-Map: In achieving Theorem 1, we study theeffect of the updates in lines 4-9 of Algorithm 1 on thechange in function value between two iterations. Whenattempting such an analysis, we realise the random effectintroduced by gradient sub-sampling (e.g., line 5 in Al-gorithm 1). Consequently, we bound expected loss valuereduction with respect to all randomness induced at some

iteration t. With this achieved, we then focus on bound-ing expected loss reduction with respect to all iterationst = 1, . . . , T exploiting the fact of independence across twosuccessive updates. Throughout our proof, we also realisethe need to study two additional components that we deriveby generalising lemmas originally presented in (Wang et al.,2014). We note that our proof is not just a mere applicationof the results in (Zaheer et al., 2018) to a compositionalsetting. In fact, our derivations are novel in that they followalternative directions combining and generalising resultsfrom (Wang et al., 2014; 2017) with these from (Zaheeret al., 2018). Due to space constraints, all proofs can befound in the Appendix A.

3.2.1. PRELIMINARY LEMMAS:

First, we detail two essential lemmas that are adjusted fromthe work in (Wang et al., 2014) to deal with our ADAMsolver.

Lemma 2 (Auxiliary Variable Properties). Consider auxil-iary variable updates in lines 8 and 9 in Algorithm 1. LetEtotal[·] denote the expectation with respect to all incurredrandomness. For any t, the following holds:

Etotal

[||g(xt+1)− yt+1||22

]≤L2g

2Etotal

[D2t+1

]+ 2Etotal

[||Et+1||22

],

with g(xt+1) = Eω[gω(xt+1)], and Dt+1, ||Et+1||22 satisfythe following recurrent inequalities:

Dt+1 ≤ (1− βt)Dt +2M2

gM2f

ε2α2t

βt+ βtF2

t ,

Etotal

[||Et+1||22

]≤ (1− βt)2 Etotal

[||Et||22

]+

β2t

K(3)t

σ23 ,

F2t ≤ (1− βt−1)F2

t−1 +4M2

gM2f

ε2α2t−1

βt−1,

and D1 = 0, Etotal[||E1||22

]= ||g(x1)||22, F1 = 0.

Having established the first preliminary lemma depictingrelationships between auxiliary variable updates, we nowpresent a second one essential in our proof.

Lemma 3 (Recursion Property). Let ηt =Cηta , ζt =

Cζtb

,where Cη > 1 + b − a, Cζ > 0, (b − a) /∈ [−1, 0] anda ≤ 1. Consider the following recurrent inequality: At+1 ≤(1− ηt +C1η

2t )At +C2ζt, where C1, C2 ≥ 0. Then, there

is a constant CA > 0 such that At ≤ CAtb−a

.

We can easily adapt the above lemma to the specifics ofour algorithm bounding expected norm differences betweeng(xt) and yt:

Corollary 1. (Recursion & Auxiliary Variables in Algo-rithm 1) Consider Algorithm 1 with step sizes given by:

Page 5: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

αt = Cαta , βt =

Cβtb, and K

(3)t = C3t

e, for some con-stants Cα, Cβ , C3, a, b, e > 0 such that (2a−2b) /∈ [−1, 0],and b ≤ 1. For CD, CE > 0, we have:

Etotal[||g(xt)− yt||22

]≤L2gC

2D

2

1

t4a−4b+ 2C2

E1

tb+e.

3.2.2. STEPS IN MAIN PROOF:

With the above lemmas established, we now detail the es-sential steps needed to arrive at the statement of Theorem1. As mentioned previously, our main analysis relies on thestudy of the change in the value of the compositional lossbetween two successive iterations. Previously in Lemma 1,we have shown that under our assumptions, the loss functionis Lipschitz with a constant L. Hence, we write:

J (xt+1) ≤ J (xt) +∇TJ (xt)∆xt+1 +L

2||∆xt+1||22 ,

with ∆xt+1 = xt+1 − xt.

Due to randomness induced by first-order oracles, such achange, has to be analysed in expectation with respect toall randomness incurred at iteration t given a fixed xt. Wedefine such an expectation as Et[·] = E

K(1)t ,K

(2)t ,K

(3)t

[·∣∣∣xt]

to consider all sampling performed at the tth iteration. Withthis, we note that gradients, and all primary and auxiliaryvariables are t-measurable, leading us to:

Et [J (xt+1)] ≤ J (xt) + αt∇TJ (xt)Et[

mt√vt + ε

]+Lα2

t

2Et

[∣∣∣∣∣∣∣∣ mt√vt + ε

∣∣∣∣∣∣∣∣22

].

To achieve the convergence result, we bound each of theterms on the right-hand-side of the above equation. Us-ing Assumptions 1 and 2 with the proposed setup offree-parameters, and applying the law of total expectationEtotal [Et[·]] = Etotal [·], we get:

Etotal [J (xt+1)− J (xt)] ≤ O(t−(a+c)) +O(µtt−a

)−

O(t−a)(Etotal

[||∇J (xt)||22

]− Etotal

[||g(xt)− yt||22

]).

Hence, using Corollary 1 and re-grouping yields:

Etotal

[||∇J (xt)||22

]≤ t−(5a−5b) + t−(a+b+c) + t−(a+c)

+ µtt−a +O (taEtotal[J (xt)− J (xt+1)]) .

Considering the average over all iterations T , and using first-order concavity conditions for f(t) = ta (when a ∈ (0, 1))with the following setup for the constants: a = 0.2; b =0; c = e = 0.8, eventually yields:

1

T

T∑t=1

Etotal

[||∇J (xt)||22

]≤ δ,

with a total gradient sample complexity given by O(δ−

9/4)

(i.e., result in Theorem 1).

4. Use Case: Model-Agnostic Meta-LearningIn this section, we present a novel connection mappingmodel-agnostic meta-learning (MAML) to compositionaloptimisation. MAML, introduced in (Finn et al., 2017), isan algorithm aiming at efficiently adapting deep networksto unseen problems by considering meta-updates performedon a set of tasks.

In its original form, MAML solves a meta-optimisationproblem of the form:

minxL(x) = min

x

∑Ti∼PTasks(·)

LTi (NN (Φ(x;DTi)) ;DTi) ,

(2)where Ti represents the T th

i task, PTasks(·) an unknown taskdistribution, and DTi the data set for task Ti. Furthermore,Φ is a map operating on neural network parameters de-noted by x to define a function typically encoded by aneural network or NN(·) in the above equation. Variousalgorithms from meta-learning have considered a variety ofmaps Φ see (Finn et al., 2018; Kim et al., 2018; Grant et al.,2018; Vuorio et al., 2018) and references therein. In thispaper, we follow the original presentation in (Finn et al.,2017) that defines Φ as a one-step gradient update, i.e.,Φ(x;DTi) = x − α∇LTi (NN(x);DTi). The remainingingredient needed for fully defining the problem in Equation2 is the loss function for a task Ti, i.e., LTi(·;DTi). Such aloss, however, is task dependent, e.g., logistic loss in classifi-cation, and mean-squared error in regression. Keeping withthe generality of the exposition, we will map to a composi-tional form, while assuming generic losses8 LTi(·;DTi).

4.1. Compositional Meta-Learning: C-MAML

4.1.1. STOCHASTIC PROBLEM FORMULATION:

To map MAML to compositional optimisation, we, first,rewrite the problem in Equation 2 in an equivalent stochas-tic form. To do so, we start by considering Tasks ={T1, T2, . . . } to be a collection of tasks (e.g., a set ofregression or classification tasks) that can be observedby the learner. Furthermore, we assume a distribution,PTasks = ∆ ({T1, T2, . . . }), from which tasks are sampledover rounds. To evaluate loss functions and perform updatesof Equation 2, we also assume a deterministically availabledata-set, DTi , for each task Ti ∼ ∆ ({T1, T2, . . . }). Astypical in machine learning, we suppose that an agent cannot access all data for a task Ti but rather needs to employa mini-batching scheme in performing updates. Signifying

8Specifics on various machine learning tasks can be found inthe appendix.

Page 6: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

such a process, we introduce an additional random variableξ used for performing data sub-sampling according to anunknown distribution9 P(data)

Ti (·) over DTi with Ti being theith task. With these notations, we can write the gradientstep update Φ(x;DT ) for a given T ∼ ∆ ({T1, T2, . . . })as:

Φ(x;DT ) = Eξ∼P(data)

T (·)

[x− α∇Lt (NN(x); ξ)

∣∣∣t = T].

Given a stochastic form of the update Φ(x;DT ), we nowfocus on the problem in Equation 2. In (Finn et al., 2017),the authors re-sample a new data-set from the same datadistribution to compute the outer-loss. Hence, we introduceone more random variable ξ′ to write:

minx

ET

[Eξ′∼P(data)

T (·)

[Lt

(NN(Φ(x;DT )); ξ′

)∣∣∣∣∣t = T

]].

(3)

In words, the stochastic formulation in Equation 3 simplystates that MAML attempts to minimise an expected (overtasks) loss, which itself involves a nested expectation overdata mini-batches used to compute per-task errors and up-dates.

Given such a stochastic formulation of MAML, we nowconsider two cases. The first involves one task, while thesecond handles a meta-learning scenario as we detail next.

4.1.2. CASE I – SINGLE TASK WITH LARGE DATA-SET:

Here, we assume the availability of only one task T1 whosedata-set is large to motivate a data-streaming scenario. Assuch, the outer expectation in Equation 3 is non-existent,and the gradient map Φ can be written as:

Φ(x;DT1) = Eξ∼P(data)

T1(·)

[x− α∇Lt (NN(x); ξ)

∣∣∣t = T1

].

Noting that the original form in Equation 1 can be rewrit-ten as J (x) = f(g(x)) (Section 3.1), we first de-fine g(x) = E

ξ∼P(data)T1

(·)

[x− α∇Lt (NN(x); ξ)

∣∣∣t = T1

].

Subsequently, the objective in Equation 3, when consideringone task T1, can be rewritten as:

minx

Eξ′∼P(data)

T1(·)

[fξ′

(Eξ∼P(data)

T1(·) [gξ(x)]

)], (4)

where gξ(x) = x − α∇Lt (NN(x); ξ), and fξ′(y) =LT1 (NN(y), ξ′). Clearly, the problem in Equation 4 isa special case of that in 1 and thus, one can employ Algo-rithm 1 for finding a stationary point (see Section 5.2).

9We note that we do not assume that agents can access such adata distribution but can sample input-output pairs from P(data)

t (·).

4.1.3. CASE II – MULTIPLE TASKS & META-UPDATES:

After showing single task MAML as a special case ofcompositional optimisation, we now generalise previousresults to K > 1 tasks T1, . . . , TK . Sampling accordingto PTasks(·), we can write a Monte-Carlo estimator of theloss in Equation 3 to arrive at the following optimisationproblem:

minxL(x) = min

x

1

K

K∑j=1

Eξ′j∼P

(data)Tj

(·)

(j)t (x; ξ′j)

∣∣∣∣∣t = Tj

],

(5)with γ(j)

t (x; ξ′j) = Lt(NN

(Φ(x;DTj )

); ξ′j), and Φ(·) =

Φ(x;DTj ) defined as:

Φ(·) = Eξj∼P(data)

Tj(·)

[x− α∇Lt (NN(x); ξj)

∣∣∣t = Tj].

(6)Now, consider Gξ1,...,ξK (x) ∈ Rp×K to be a matrix withdimension of neural network parameters p rows and totalnumber of tasks K columns, such that each column j ∈{1, . . . ,K} is defined as:

Gξ1,...,ξK (x)[:, j] = x− α∇LTj (NN(x); ξj).

Denoting ξ = (ξ1, . . . , ξK), where ξj ∼ P(data)Tj (·) for all

j ∈ {1, . . . ,K}, we define G(x) = Eξ [Gξ(x)] as anexpected map such that each of its j columns is given by10:

G(x)[:, j] = Eξj∼P(data)

Tj(·)

[x− α∇Lt(NN(x); ξj)

∣∣∣∣∣t = Tj

].

Hence, one can write the gradient update map in Equation 6for any task j simply as: Φ(x;DTj ) = G(x)ej with ejbeing the jth basis vector in a standard orthonormal basisin RK . Applying our newly introduced notions to the lossin Equation 5, we get:

L(x) = Eξ′

1

K

K∑j=1

Lt(NN (G(x)ej) ; ξ′j

)|t = Tj

= Eξ′ [fξ′ (Eξ [Gξ(x)])] , ⇐= Compositional Form

with ξ′ = (ξ′1, . . . , ξ′K), and fξ′(A) =

1K

∑Kj=1 LTj (NN(A[:, j]); ξ′j), with A[:, j] being the

jth column of matrix A ∈ Rp×K . It is again clear thatmulti-task MAML is another special case of compositionaloptimisation for which we can employ Algorithm 1 todetermine a stationary point.C-MAML Implications: Connecting MAML and compo-sitional optimisation sheds-light on interesting theoreticalinsights for meta-learning – a topic gaining lots of attention

10Please note that for ease of notation we have denoted Eξ[·] =Eξ1,...,ξK = Eξ1 [Eξ2 [. . .EξK [·]]].

Page 7: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Table 1. Theoretical sample complexity bounds comparing oursolver C-MAML to state-of-the-art in current literature to achieve||∇L(x)||22 ≤ δ. It is clear that C-MAML achieves the best rateknown so-far.

Algorithms Complexities

C-MAML (This paper) O(δ−2.25

)(Fallah et al., 2019) – Alg. 1 O(δ−3)(Fallah et al., 2019) – Alg. 3 O(δ−3)

(Kingma & Ba, 2014) in MAML N/A(Finn et al., 2017) N/A

in recent literature (Fallah et al., 2019). Importantly, werealise that upon the application of C-ADAM to the bridgemade in Section 4, we achieve the fastest known complexitybound for meta-learning. Table 1 depicts these resultsdemonstrating that the closest results to ours are recentbounds derived in (Fallah et al., 2019). In Table 1, we alsonote that methods proposed in (Finn et al., 2017; Zaheeret al., 2018) have no provided convergence guarantees(marked by N/A in the table) leaving such an analysis as aninteresting avenue for future research11.

5. ExperimentsWe present an in-depth empirical study demonstrating thatC-ADAM (Algorithm 1) outperforms state-of-the-art meth-ods from both compositional and standard optimisation. Wesplit this section into two. In the first, we experiment withmore conventional portfolio optimisation tasks as presentedin (Wang et al., 2017), while in the second, we focus onmodel-agnostic meta-learning building on our results fromSection 4. In the portfolio scenario, algorithms tasked todetermine stationary points for sparse mean-variance trade-offs on real-world data-sets gathered by the center for re-search in security prices (CRSP) 12 are studied. In MAML’scase, on the other hand, we benchmark against regressiontasks initially presented in (Finn et al., 2017) demonstratingthat C-ADAM achieves new state-of-the-art performance.All models were trained on a single NVIDIA GeForce GTX1080 Ti GPU.

5.1. Portfolio Mean-Variance

We consider three large 100-portfolio data-sets (m =13781, n = 100) and 18 region-based medium-sized sets

11Furthermore, we note that other theoretical attempts aimingat understanding MAML (Finn et al., 2019; Golmant, 2019) werenot explicitly mentioned as these consider convex assumptions onloss functions abide using deep network models.

12https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

with m = 7240, and 25 assets as collected by CRSP todemonstrate book-to-market (BM), operating profitability(OP), and investment (Inv.). Given n assets and reward vec-tors at m time instances, the goal of sparse mean-varianceoptimisation (Ravikumar et al., 2007) is to maximise returnsand control risk; a problem that can be mapped to a compo-sitional form (see Appendix B.1). We compared C-ADAMwith a line of existing algorithms for compositional optimi-sation: ASCVRG (Lin et al., 2018), VRSC-PG (Huo et al.,2017), ASC-PG (Wang et al., 2016), and SCGD (Wang et al.,2014). Implementation of these baselines were provided byrespective authors and optimised for mean-variance tasks.As a comparative metric between algorithms, we measuredthe optimality gap J (x)− J ∗, versus number of samplesused-per asset. Free parameters in Algorithm 1 were set toCα = 0.01, Cβ = 0.01,K

(1)t = K

(2)t = K

(3)t = 1, and

Cγ = 1 unless otherwise specified. Figure 1(a, b, c) showsthat C-ADAM outperforms other algorithms on large data-sets, while Figure 1(d, e, f) demonstrates that C-ADAMoutperforms others on sample results from BM, OP, and Invdata-sets13.Ablation Study: Algorithm 1 introduces free-parametersthat need to be tuned for successful execution. To assesstheir importance, we conducted an ablation study on batch-sizes K(1)

t ,K(2)t ,K

(3)t , and step-sizes Cα, Cβ . We demon-

strate the effects of varying these parameters on an assetdata-set in Figure 1(g, h). It is clear that C-ADAM per-forms best with a batch-size of 1 (Figure 1(g)), and a smallstep size of 0.01 (Figure 1(h)). This, in turn, motivated ourchoice of these parameters in the portfolio experiments.

5.2. Compositional MAML

To validate theoretical guarantees achieved for MAML, weconduct two sets of experiments on the regression and clas-sification problems originally introduced in (Finn et al.,2017).

5.2.1. COMPOSITIONAL REGRESSION RESULTS

After observing a set of tasks, the goal is to perform few-shot supervised regression on novel unobserved tasks whenonly a few-data points are available. Tasks vary in their datadistribution, e.g., changing parameters for data generationdistribution14.

We use a neural network regressor with two hidden lay-ers each, of size 40 with ReLU non-linearities. For allexperiments, we use one meta step (one inner gradientupdate) with ten examples, i.e., 10-shot regression, and

13Please note that due to space constraints full results (all demon-strating that C-ADAM outperforms others) on all 18 data-sets canbe found in Appendix B.1.

14Due to space constraints all hyper-parameters we used in ourexperiments can be found in Appendix B.2.

Page 8: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

0 5 10 15 20#grad/n

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le100_INV, m=13781, n=100

C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(a)

0 5 10 15 20#grad/n

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

100_ME, m=13781, n=100C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(b)

0 5 10 15 20#grad/n

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

100_OP, m=13781, n=100C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(c)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

Global_ME, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(d)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

Europe_OP, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(e)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100Ob

ject

ive

Gap

in L

og-S

cale

AP_ex_Japan_INV, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(f)

0 1 2 3 4#grad/n

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

AP_ex_Japan_ME, m=7240, n=25K=1K=2K=5

(g)

0 1 2 3#grad/n

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

AP_ex_Japan_ME, m=7240, n=25C=0.1C=0.01C=0.001C=0.0001C=1e-05

(h)

0 1 2 3 4Number of Samples (*1e4)

10 3

10 2

10 1

100

Trai

ning

Los

s

C-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(i)

0 1 2 3 4Number of Samples (*1e4)

10 3

10 2

10 1

100

Test

Los

s

C-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(j)

0 10 20 30 40 50 60Number of Samples (*1e4)

100

Met

a Te

st L

oss

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(k) (l)

Figure 1. (a, b, c): C-ADAM’s performance versus other compositional optimisation methods on the three large 100-portfolio data-setsdemonstrating that our method significantly outperforms others in terms of convergence speeds. (d, e, f): [d]: C-ADAM’s performanceversus other composition methods on one data-set from BM, [e]: C-ADAM’s performance on OP, and [f]: C-ADAM’s performanceon a data-set from Inv. In all cases, we see that C-ADAM outperforms other techniques significantly. (g, h): [g:] An ablation study

demonstrating the effect of modifying the batch sizes – K(1)t = K

(2)t = K

(3)t = K, and [h:] Effect of step-size – Cα = Cβ = C. (i, j):

Training loss curves of single-task compositional MAML compared to methods from compositional and standard optimisation. Theseresults again demonstrate that C-ADAM outperforms others. (k): The meta test loss of multi-task compositional MAML compared toothers demonstrating C-ADAM’s performance. (l): Quantitative regression results showing the learning curve after training with 40000iterations on meta test-time. It is again clear that C-ADAM outperforms others.

adopt an outer and inner learning rate (meta-learning rate)of α = 0.01 for all algorithms unless stated otherwise. Wecompare C-ADAM with both compositional (ASC-PG andSCGD) and non-compositional optimisers (ADAM, Ada-grad (Duchi et al., 2011), ASGD (Polyak & Juditsky, 1992),RMSprop (Mukkamala & Hein, 2017), SGD) and demon-strate the performance in Figure 1. In C-ADAM and ADAM,we set the outer learning rate to 0.001. When dealing withonly one task, i.e., streaming data points, C-ADAM achievesthe best performance in training and test loss with respectto the number of samples, as shown in Figures 1(i) and (j).

Evaluating few-shot regression (multi-task scenario), wefine-tune the meta-learnt model using the same optimiser(SGD) on M = 10 examples for each method. We alsocompare performance on two additional baseline models:(1) pre-training on all of the tasks, which entails training a

single neural net to deal with multiple tasks at the same time(Transfer in Figure 1(k, l)), and (2) random initialization(Random in Figure 1(k, l)).

During fine-tuning, each gradient step is computed usingthe same M = 10 data-points. For each evaluation point,we report the result with 100 test tasks and show the metatest loss (i.e., test loss on novel unseen tasks during training)after M = 10 gradient steps in Figure 1(k). The meta-learntmodel using C-ADAM optimiser is able to adapt duringmeta test quickly. While models learnt with baseline op-timisers can adapt to the meta test set after training for40000 iterations as shown in Figure 1(l), C-ADAM’s modelstill achieves the best convergence performance among allbaselines, which demonstrates both the advantage of ourcompositional formulation of MAML and the adaptive na-ture of C-ADAM.

Page 9: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

0.0 0.5 1.0 1.5 2.0 2.5Number of Samples (*1e3)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Met

a Te

st A

ccur

acy

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(a)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

0.20.30.40.50.60.70.80.91.0

Met

a Te

st A

ccur

acy

200TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(b)

0.0 0.5 1.0 1.5 2.0 2.5Number of Samples (*1e3)

10 1

100

Logs

cale

Met

a Te

st L

oss Transfer

RandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(c)

0 1 2 3Number of Gradient Steps

10 1

100

Logs

cale

Met

a Te

st L

oss 200

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(d)

0 1 2 3Number of Samples (*1e3)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Met

a Te

st A

ccur

acy

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(e)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8M

eta

Test

Acc

urac

y300

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(f)

0 1 2 3Number of Samples (*1e3)

100

6 × 10 1

2 × 100

3 × 100

Logs

cale

Met

a Te

st L

oss Transfer

RandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(g)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

100

101

Logs

cale

Met

a Te

st L

oss 300

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(h)

0 1 2 3Number of Samples (*1e3)

0.2

0.4

0.6

0.8

Met

a Te

st A

ccur

acy

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(i)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

0.0

0.2

0.4

0.6

0.8

Met

a Te

st A

ccur

acy

250TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(j)

0 1 2 3Number of Samples (*1e3)

100

Logs

cale

Met

a Te

st L

oss Transfer

RandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(k)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

100

101

Logs

cale

Met

a Te

st L

oss 250

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(l)

Figure 2. We show three scenarios on Omniglot data highlighting test statistics of C-MAML compared against other optimisation methodsfor varying N-shot K-way values using the convolutional model. (a)-(d) refer to the Omniglot 5-shot 5-way task, (e)-(h) Omniglot refersto 1-shot 20-way task, (i)-(l) refers to Omniglot 5-shot 20-way task.

5.2.2. COMPOSITIONAL CLASSIFICATION RESULTS

Following a similar setup to MAML, we applied our al-gorithm to N-shot K-way image recognition on the Om-niglot (Lake et al., 2011) and MiniImagenet (Ravi &Larochelle, 2016) data-set. In both data-sets, we use thesame convolutional (Figure 2) and non-convolutional (Fig-ure 9) model as in (Finn et al., 2017), trained with one metastep, and a meta batch size of 32 tasks. In Omniglot, weset all outer optimizer learning rates to 0.01 and a meta-learning rate of 0.4. For MiniImagenet, we set all outeroptimizer learning rates to 0.001 and a meta-learning rate of0.1. When comparing optimisers, we use the same learningrates and architecture. We evaluated models using threemeta steps with the same meta step size as was used fortraining. Figure 2 highlights that C-MAML outperformsothers across this suite of challenging image tasks.

6. Conclusions and Future WorkWe proposed C-ADAM, the first adaptive compositionalsolver. We validated our method both theoretically and em-pirically and provided the first connection between model-

agnostic deep learning and compositional optimisation thatattained best-known convergence bounds to-date.

In future, we plan to investigate further applications of com-positional optimisation such as robust single and multi-player type-objectives (Grau-Moya et al., 2018; Mguni,2018; Abdullah et al., 2019; Wen et al., 2019) as well aspropose adaptive algorithms for problems involving nestingof more than two expected values.

ReferencesAbdullah, M. A., Ren, H., Ammar, H. B., Milenkovic, V.,

Luo, R., Zhang, M., and Wang, J. Wasserstein robustreinforcement learning, 2019.

Allen-Zhu, Z. and Hazan, E. Variance reduction for fasternon-convex optimization, 2016.

Beck, A. First-Order Methods in Optimization. MOS-SIAM Series on Optimization. Society for Industrialand Applied Mathematics, 2017. ISBN 9781611974980.URL https://books.google.co.uk/books?id=xLk4DwAAQBAJ.

Page 10: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Bedi, A. S., Koppel, A., and Rajawat, K. Nonparametriccompositional stochastic optimization, 2019.

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B.,Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller,U., Zhang, J., et al. End to end learning for self-drivingcars. arXiv preprint arXiv:1604.07316, 2016.

Bou-Ammar, H., Eaton, E., Ruvolo, P., and Taylor, M. E.Online multi-task learning for policy gradient methods.In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26June 2014, pp. 1206–1214, 2014.

Bou-Ammar, H., Eaton, E., Luna, J., and Ruvolo, P. Au-tonomous cross-domain knowledge transfer in lifelongpolicy gradient reinforcement learning. In Proceedings ofthe Twenty-Fourth International Joint Conference on Ar-tificial Intelligence, IJCAI 2015, Buenos Aires, Argentina,July 25-31, 2015, pp. 3345–3351, 2015a.

Bou-Ammar, H., Tutunov, R., and Eaton, E. Safe policysearch for lifelong reinforcement learning with sublinearregret. CoRR, abs/1505.05798, 2015b. URL http://arxiv.org/abs/1505.05798.

Boyd, S. and Vandenberghe, L. Convex Optimization. Cam-bridge University Press, USA, 2004. ISBN 0521833787.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradientmethods for online learning and stochastic optimization.J. Mach. Learn. Res., 12(null):2121–2159, July 2011.ISSN 1532-4435.

Faes, L., Wagner, S. K., Fu, D. J., Liu, X., Korot, E., Led-sam, J. R., Back, T., Chopra, R., Pontikos, N., Kern, C.,et al. Automated deep learning design for medical imageclassification by health-care professionals with no codingexperience: a feasibility study. The Lancet Digital Health,1(5):e232–e242, 2019.

Fallah, A., Mokhtari, A., and Ozdaglar, A. On the con-vergence theory of gradient-based model-agnostic meta-learning algorithms, 2019.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Near-optimal non-convex optimization via stochastic path inte-grated differential estimator, 2018.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. CoRR,abs/1703.03400, 2017. URL http://arxiv.org/abs/1703.03400.

Finn, C., Xu, K., and Levine, S. Probabilistic model-agnostic meta-learning. In Advances in Neural Infor-mation Processing Systems, pp. 9516–9527, 2018.

Finn, C., Rajeswaran, A., Kakade, S. M., and Levine, S. On-line meta-learning. CoRR, abs/1902.08438, 2019. URLhttp://arxiv.org/abs/1902.08438.

Gabillon, V., Tutunov, R., Valko, M., and Ammar, H. B.Derivative-free & order-robust optimisation. arXivpreprint arXiv:1910.04034, 2019.

Golmant, N. On the convergence of model-agnostic meta-learning. 2019. URL http://noahgolmant.com/writings/maml.pdf.

Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T.Recasting gradient-based meta-learning as hierarchicalbayes. arXiv preprint arXiv:1801.08930, 2018.

Grau-Moya, J., Leibfried, F., and Bou-Ammar, H. Balancingtwo-player stochastic games with soft q-learning, 2018.

Hu, W., Li, C. J., Lian, X., Liu, J., and Yuan, H. Efficientsmooth non-convex stochastic compositional optimiza-tion via stochastic recursive gradient descent. In Wallach,H., Larochelle, H., Beygelzimer, A., d'Alche-Buc, F.,Fox, E., and Garnett, R. (eds.), Advances in Neural Infor-mation Processing Systems 32, pp. 6926–6935. CurranAssociates, Inc., 2019.

Hu, X., Prashanth, L., Gyorgy, A., and Szepesvari, C. (ban-dit) convex optimization with biased noisy gradient ora-cles. In Artificial Intelligence and Statistics, pp. 819–828,2016.

Huo, Z., Gu, B., and Huang, H. Accelerated methodfor stochastic composition optimization with nonsmoothregularization. CoRR, abs/1711.03937, 2017. URLhttp://arxiv.org/abs/1711.03937.

Kim, T., Yoon, J., Dia, O., Kim, S., Bengio, Y., and Ahn, S.Bayesian model-agnostic meta-learning. arXiv preprintarXiv:1806.03836, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization, 2014.

Lake, B., Salakhutdinov, R., Gross, J., and Tenenbaum, J.One shot learning of simple visual concepts. In Proceed-ings of the annual meeting of the cognitive science society,volume 33, 2011.

Li, H. and Lin, Z. Accelerated proximal gradient meth-ods for nonconvex programming. In Proceedings of the28th International Conference on Neural InformationProcessing Systems - Volume 1, NIPS’15, pp. 379–387,Cambridge, MA, USA, 2015. MIT Press.

Lian, X., Wang, M., and Liu, J. Finite-sum compositionoptimization via variance reduced gradient descent, 2016.

Page 11: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Lin, T., Fan, C., Wang, M., and Jordan, M. I. Improvedsample complexity for stochastic compositional variancereduced gradient, 2018.

Liu, L., Liu, J., and Tao, D. Variance reduced methods fornon-convex composition optimization, 2017.

Liu, L., Liu, J., Hsieh, C.-J., and Tao, D. Stochasticallycontrolled stochastic gradient for the convex and non-convex composition problem, 2018.

Menghan, Z. A review of classical first order optimizationmethods in machine learning. In Deng, K., Yu, Z., Pat-naik, S., and Wang, J. (eds.), Recent Developments inMechatronics and Intelligent Robotics, pp. 1183–1190,Cham, 2019. Springer International Publishing. ISBN978-3-030-00214-5.

Mguni, D. A viscosity approach to stochastic differentialgames of control and stopping involving impulsive con-trol, 2018.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529–533, 2015.

Mukkamala, M. C. and Hein, M. Variants of rmsprop andadagrad with logarithmic regret bounds. In Proceedingsof the 34th International Conference on Machine Learn-ing - Volume 70, ICML’17, pp. 2545–2553. JMLR.org,2017.

Nesterov, Y. Introductory Lectures on Convex Optimiza-tion: A Basic Course. Springer Publishing Company,Incorporated, 1 edition, 2014. ISBN 1461346916.

Nesterov, Y. and Polyak, B. Cubic regularization of new-ton method and its global performance. MathematicalProgramming, 108(1):177–205, Aug 2006. ISSN 1436-4646. doi: 10.1007/s10107-006-0706-8. URL https://doi.org/10.1007/s10107-006-0706-8.

Polyak, B. T. and Juditsky, A. B. Acceleration of stochasticapproximation by averaging. SIAM journal on controland optimization, 30(4):838–855, 1992.

Ravi, S. and Larochelle, H. Optimization as a model forfew-shot learning. 2016.

Ravikumar, P., Lafferty, J., Liu, H., and Wasserman, L.Sparse additive models, 2007.

Reddi, S. J., Kale, S., and Kumar, S. On the convergence ofadam and beyond. CoRR, abs/1904.09237, 2019. URLhttp://arxiv.org/abs/1904.09237.

Ruvolo, P. and Eaton, E. Ella: An efficient lifelong learn-ing algorithm. In Proceedings of the 30th InternationalConference on Machine Learning (ICML-13), June 2013.

Shamir, O. An optimal algorithm for bandit and zero-orderconvex optimization with two-point feedback. The Jour-nal of Machine Learning Research, 18(1):1703–1713,2017.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae-pel, T., et al. Mastering chess and shogi by self-playwith a general reinforcement learning algorithm. arXivpreprint arXiv:1712.01815, 2017.

Sun, S., Cao, Z., Zhu, H., and Zhao, J. A survey of opti-mization methods from a machine learning perspective.CoRR, abs/1906.06821, 2019. URL http://arxiv.org/abs/1906.06821.

Thrun, S. and Pratt, L. Learning to learn: Introduction andoverview. In Learning to learn, pp. 3–17. Springer, 1998.

Tian, Z., Zou, S., Warr, T., Wu, L., and Wang, J. Learningto communicate implicitly by actions. arXiv preprintarXiv:1810.04444, 2018.

Tutunov, R., Bou-Ammar, H., and Jadbabaie, A. A dis-tributed newton method for large scale consensus opti-mization. CoRR, abs/1606.06593, 2016. URL http://arxiv.org/abs/1606.06593.

Vuorio, R., Sun, S.-H., Hu, H., and Lim, J. J. Towardmultimodal model-agnostic meta-learning. arXiv preprintarXiv:1812.07172, 2018.

Wang, M., Fang, E. X., and Liu, H. Stochastic compositionalgradient descent: Algorithms for minimizing composi-tions of expected-value functions, 2014.

Wang, M., Liu, J., and Fang, E. Accelerating stochasticcomposition optimization. In Lee, D. D., Sugiyama, M.,Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Ad-vances in Neural Information Processing Systems 29, pp.1714–1722. Curran Associates, Inc., 2016.

Wang, M., Fang, E. X., and Liu, H. Stochastic compo-sitional gradient descent: Algorithms for minimizingcompositions of expected-value functions. Math. Pro-gram., 161(1–2):419–449, January 2017. ISSN 0025-5610. doi: 10.1007/s10107-016-1017-3. URL https://doi.org/10.1007/s10107-016-1017-3.

Wang, R., Nie, K., Wang, T., Yang, Y., and Long, B. Deeplearning for anomaly detection. In Proceedings of the13th International Conference on Web Search and DataMining, pp. 894–896, 2020.

Page 12: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Wang, Y. and Yao, Q. Few-shot learning: A survey. CoRR,abs/1904.05046, 2019. URL http://arxiv.org/abs/1904.05046.

Wen, Y., Yang, Y., Luo, R., and Wang, J. Modelling boundedrationality in multi-agent interactions by generalized re-cursive reasoning, 2019.

Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar,S. Adaptive methods for nonconvex optimization. InBengio, S., Wallach, H., Larochelle, H., Grauman, K.,Cesa-Bianchi, N., and Garnett, R. (eds.), Advances inNeural Information Processing Systems 31, pp. 9793–9803. Curran Associates, Inc., 2018.

Zeiler, M. D. Adadelta: An adaptive learning rate method,2012.

Page 13: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

A. Theoretical ResultsIn this part of the Appendix, we present proofs for all statements made in the main paper:

A.1. Proof of Lemma 1

Lemma 4. If Assumptions 1.1, 1.3, and 1.4 hold, then J (x) is L-Lipschitz smooth, i.e.,

||∇xJ (x1)−∇xJ (x2)||2 ≤ L ||x1 − x2||2 ,

for all (x1,x2) ∈ Rp, with L = M2gLf + LgMf .

Proof. Assumption 1 implies that ||∇gw(x)||2 ≤Mg for any x ∈ Rp. Hence, using Jensen inequality as well as propertyof the norm we have:

||∇J (x1)−∇J (x2)||2 =

||Ew[∇gTw(x1)

]Ev [∇fv (Ew [gw(x1)])]− Ew

[∇gTw(x2)

]Ev [∇fv (Ew [gw(x2)])] ||2 ≤

||Ew[∇gTw(x1)

]Ev [∇fv (Ew [gw(x1)])]− Ew

[∇gTw(x1)

]Ev [∇fv (Ew [gw(x2)])] ||2+

||Ew[∇gTw(x1)

]Ev [∇fv (Ew [gw(x2)])]− Ew

[∇gTw(x2)

]Ev [∇fv (Ew [gw(x2)])] ||2 ≤

Ew[||∇gTw(x1)||2

]||Ev [∇fv (Ew [gw(x1)])]− Ev [∇fv (Ew [gw(x2)])] ||2+

||Ew[∇gTw(x1)

]− Ew

[∇gTw(x2)

]||2Ev [||∇fv (Ew [gw(x2)]) ||2] ≤

MgEv [||∇fv (Ew [gw(x1)])−∇fv (Ew [gw(x2)]) ||2] +

MfEw[||∇gTw(x1)−∇gTw(x2)||2

]≤MgLf ||Ew [gw(x1)]− Ew [gw(x2)] ||2+

MfEw[||∇gTw(x1)−∇gTw(x2)||2

]≤MgLfEw [||gw(x1)− gw(x2)||2] +

MfEw[||∇gTw(x1)−∇gTw(x2)||2

]≤MgLfEw [||gw(x1)− gw(x2)||2] +MfLg||x1 − x2||2 ≤

M2gLf ||x1 − x2||2 +MfLg||x1 − x2||2 =

(M2gLf +MfLg

)||x1 − x2||2 = L||x1 − x2||2.

which finishes the proof of the claim.

A.2. Proof of Lemma 2

Lemma 5. Consider auxiliary variable updates in lines 8 and 9 in Algorithm 1. Let Etotal[·] denote the expectation withrespect to all incurred randomness. For any t, the following holds:

Etotal

[||g(xt+1)− yt+1||22

]≤L2g

2Etotal

[D2t+1

]+ 2Etotal

[||Et+1||22

],

with g(xt+1) = Eω[gω(xt+1)], and Dt+1, ||Et+1||22 satisfy the following recurrent inequalities:

Dt+1 ≤ (1− βt)Dt +2M2

gM2f

ε2α2t

βt+ βtF2

t ,

Etotal

[||Et+1||22

]≤ (1− βt)2 Etotal

[||Et||22

]+

β2t

K(3)t

σ23 ,

F2t ≤ (1− βt−1)F2

t−1 +4M2

gM2f

ε2α2t−1

βt−1,

and D1 = 0, Etotal[||E1||22

]= ||g(x1)||22, F1 = 0.

Proof. Let us introduce the sequence of coefficients {θ}tj=0 such that

θ(t)j =

{βj∏ti=j+1(1− βi) if j < t.

βt if j = t.

Page 14: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

and we assume β0 = 1 for simplicity. Denote St =∑tj=0 θ

(t)j , then:

St =

t∑j=0

θ(t)j =

βt + (1− βt)βt−1 + (1− βt)(1− βt−1)βt−2 + · · ·+ (1− βt)(1− βt−1) . . . (1− β1)β0 =

βt + (1− βt) [βt−1 + (1− βt−1)βt−2 + · · ·+ (1− βt−1) . . . (1− β1)β0] = βt + (1− βt)St−1

and S1 = β1 + (1−β1)β0. Since β0 = 1 it implies S1 = S2 = . . . = St = 1. By assuming g0(z1) = 0q , one can representxt+1 and yt+1 as a convex combinations of {zj}t+1

j=1 and {gj(zj+1)}tj=0 respectively:

xt+1 =

t∑j=0

θ(t)j zj+1, and yt+1 =

t∑j=0

θ(t)j gj(zj+1) (7)

Hence, using Taylor expansion for function g(zj+1) around xt+1 we have:

yt+1 =

t∑j=0

θ(t)j gj(zj+1) =

t∑j=0

θ(t)j

[g(zj+1)− g(zj+1) + gj(zj+1)

]=

t∑j=0

θ(t)j g(zj+1)+

t∑j=0

θ(t)j

[gj(zj+1)− g(zj+1)

]=

t∑j=0

θ(t)j

(g(xt+1) +∇g(xt+1)[zj+1 − xt+1] +O

(||zj+1 − xt+1||22

))+

t∑j=0

θ(t)j

[gj(zj+1)− g(zj+1)

]= g(xt+1) +∇g(xt+1)

t∑j=0

θ(t)j zj+1 −

t∑j=0

θ(t)j xk+1

+

t∑j=0

θ(t)j O

(||zj+1 − xt+1||22

)+

t∑j=0

θ(t)j

[gj(zj+1)− g(zj+1)

]= g(xt+1) +

t∑j=0

θ(t)j

[gj(zj+1)− g(zj+1)

]+

t∑j=0

θ(t)j O

(||zj+1 − xt+1||22

)Therefore, using that g(·) is Lg−smooth function:

||yt+1 − g(xt+1)||2 ≤Lg2

t∑j=0

θ(t)j ||zj+1 − xt+1||22 +

∣∣∣∣∣∣∣∣∣∣∣∣t∑

j=0

θ(t)j

[gj(zj+1)− g(zj+1)

]∣∣∣∣∣∣∣∣∣∣∣∣2

and, applying (a+ b)2 ≤ 2a2 + 2b2:

||yt+1 − g(xt+1)||22 ≤L2g

2

t∑

j=0

θ(t)j ||zj+1 − xt+1||22︸ ︷︷ ︸

Dt+1

2

+ 2

∣∣∣∣∣∣∣∣∣∣∣

∣∣∣∣∣∣∣∣∣∣∣t∑

j=0

θ(t)j

[gj(zj+1)− g(zj+1)

]︸ ︷︷ ︸

Et+1

∣∣∣∣∣∣∣∣∣∣∣

∣∣∣∣∣∣∣∣∣∣∣

2

2

Taking expectation Etotal from both sides gives:

Etotal[||yt+1 − g(xt+1)||22

]≤L2g

2Etotal

[D2t+1

]+ 2Etotal

[||Et+1||22

](8)

where

Dt+1 =

t∑j=0

θ(t)j ||zj+1 − xt+1||22 , Et+1 =

t∑j=0

θ(t)j

[gj(zj+1)− g(zj+1)

],

Page 15: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

It is easy to see that D1 = 0 and E1 = g(z1) (using notational assumptions g0(z1) = 0q and β0 = 1). Let us bound bothterms in expression (8). Due to θtj = (1− βt)θ(t−1)

j for j < t for the expression Dt+1 we have:

Dt+1 =

t∑j=0

θ(t)j ||zj+1 − xt+1||22 =

t−1∑j=0

θ(t)j ||zj+1 − xt+1||22 + βt||zt+1 − xt+1||22 = (1− βt)

t−1∑j=0

θ(t−1)j ||zj+1 − xt+1||22 + βt||zt+1 − xt+1||22 =

(1− βt)t−1∑j=0

θ(t−1)j ||zj+1 − xt+1||22 +

(1− βt)2

βt||xt+1 − xt||22 = (1− βt)

t−1∑j=0

θ(t−1)j ||zj+1 − xt||22+

(1− βt)2

βt||xt+1 − xt||22 + (1− βt)

t−1∑j=0

θ(t−1)j

[||zj+1 − xt+1||22 − ||zj+1 − xt||22

]= (1− βt)Dt+

(1− βt)2

βt||xt+1 − xt||22 + (1− βt)

t−1∑j=0

θ(t−1)j [||zj+1 − xt+1||2 − ||zj+1 − xt||2]×

[||zj+1 − xt+1||2 + ||zj+1 − xt||2] ≤ (1− βt)Dt +(1− βt)2

βt||xt+1 − xt||22+

(1− βt)t−1∑j=0

θ(t−1)j ||xt+1 − xt||2 [||xt+1 − xt||2 + 2||xt − zj+1||2] = (1− βt)Dt +

(1− βt)2

βt||xt+1 − xt||22+

(1− βt)||xt+1 − xt||22 + 2(1− βt)||xt+1 − xt||2t−1∑j=0

θ(t−1)j ||xt − zj+1||2 = (1− βt)Dt+

1− βtβt||xt+1 − xt||22 + 2(1− βt)||xt+1 − xt||2

t−1∑j=0

θ(t−1)j ||xt − zj+1||2

Applying 2ab ≤ 1βta2 + βtb

2:

Dt+1 ≤

(1− βt)Dt +1− βtβt||xt+1 − xt||22 + (1− βt)

||xt+1 − xt||22βt

+ βt

t−1∑j=0

θ(t−1)j ||xt − zj+1||2

2 =

(1− βt)Dt + 21− βtβt||xt+1 − xt||22 + (1− βt)βt

t−1∑j=0

θ(t−1)j ||xt − zj+1||2

2

(1− βt)Dt +2

βt||xt+1 − xt||22 + βt

t−1∑j=0

θ(t−1)j ||xt − zj+1||2︸ ︷︷ ︸

Ft

2

Applying the primal variable update with ||∇xJ (·)||2 ≤MgMf gives:

Dt+1 ≤ (9)

(1− βt)Dt +2α2

t

βt

∣∣∣∣∣∣∣∣ mt√vt + ε

∣∣∣∣∣∣∣∣22

+ βtF2t ≤ (1− βt)Dt +

2M2gM

2f

ε2α2t

βt+ βtF2

t

Page 16: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Next, for expression Ft we have (using θt−1j = (1− βt−1)θ

(t−2)j for j < t− 1):

Ft =

t−1∑j=0

θ(t−1)j ||xt − zj+1||2 =

t−2∑j=0

θ(t−1)j ||xt − zj+1||2 + θ

(t−1)t−1 ||xt − zt||2 =

t−2∑j=0

θ(t−1)j ||xt − zj+1||2 + βt−1||xt − zt||2 =

(1− βt−1)

t−2∑j=0

θ(t−2)j ||xt − zj+1||2 + βt−1||xt − zt||2 ≤ (1− βt−1)||xt − xt−1||2+

(1− βt−1)

t−2∑j=0

θ(t−2)j [||xt−1 − zj+1||2 + ||xt − xt−1||2] = (1− βt−1) (Ft−1 + 2||xt − xt−1||2)

and F1 = 0. Hence, applying (a+ b)2 ≤ (1 + α)a2 + (1 + 1α )b2 for α = βt−1 > 0 and using primal variable update with

||∇xJ (·)||2 ≤MgMf :

F2t ≤ (10)

(1 + βt−1)(1− βt−1)2F2t−1 + 4

(1 +

1

βt−1

)(1− βt−1)2||xt − xt−1||22 ≤

(1− βt−1)F2t−1 +

4

βt−1||xt − xt−1||22 = (1− βt−1)F2

t−1 +4α2

t−1

βt−1

∣∣∣∣∣∣∣∣ mt−1√vt−1 + ε

∣∣∣∣∣∣∣∣22

(1− βt−1)F2t−1 +

4M2gM

2f

ε2α2t−1

βt−1

Finally, for Et+1 we have (using θtj = (1− βt)θ(t−1)j for j < t):

Et+1 =

t∑j=0

θ(t)j

[gj(zj+1)− g(zj+1)

]=

t−1∑j=0

θ(t)j

[gj(zj+1)− g(zj+1)

]+ βt[gt(zt+1)− g(zt+1)] =

t−1∑j=0

(1− βt)θ(t−1)j

[gj(zj+1)− g(zj+1)

]+ βt[gt(zt+1)− g(zt+1)] =

(1− βt)t−1∑j=0

θ(t−1)j

[gj(zj+1)− g(zj+1)

]+ βt

[gt(zt+1)− g(zt+1)

]=

(1− βt)Et + βt

[gt(zt+1)− g(zt+1)

]Due to the fact, that all samplings done at iteration t1 are independent from samplings done at iteration t2 6= t1, thenconsider expectation with all randomness induced at iteration t (with fixed iterative value xt):

Et [·] = EK

(1)t ,K

(2)t ,K

(3)t

[·|xt] (11)

for any t. Using that Et is independent from the randomness induced at iteration t we have:

Et[||Et+1||22

]=

Et[(

(1− βt)ETt + βt

[gt(zt+1)− g(zt+1)

]T)((1− βt)Et + βt

[gt(zt+1)− g(zt+1)

])]=

(1− βt)2 ||Et||22 + 2βt(1− βt)ETt Et

[gt(zt+1)− g(zt+1)

]+ β2

tEt[∣∣∣∣∣∣gt(zt+1)− g(zt+1)

∣∣∣∣∣∣22

]=

(1− βt)2 ||Et||22 + β2tEt

[∣∣∣∣∣∣gt(zt+1)− g(zt+1)∣∣∣∣∣∣2

2

]

Page 17: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

where Et[gt(zt+1)− g(zt+1)

]= E

K(1)t ,K

(2)t ,K

(3)t

[gt(zt+1)− g(zt+1)

]= E

K(1)t ,K

(2)t

[0q] = 0q due to Assumption 2.2.

Assumption 2.3 implies Et[∣∣∣∣∣∣gt(zt+1)− g(zt+1)

∣∣∣∣∣∣22

]≤ 1

K(3)t

σ23 , therefore,

Et[||Et+1||22

]≤ (1− βt)2 ||Et||22 +

β2t

K(3)t

σ23

Taking expectation Etotal from both sides of the above inequality and using (11) and the law of total expectation, we have:

Etotal[||Et+1||22

]≤ (1− βt)2Etotal

[||Et||22

]+

β2t

K(3)t

σ23 (12)

Combining (8), (9), (10), and (12) gives the statement of the Lemma.

A.3. Proof of Lemma 3

Lemma 6. Let ηt =Cηta , ζt =

Cζtb

, where Cη > 1 + b− a, Cζ > 0, (b− a) /∈ [−1, 0] and a ≤ 1. Consider the followingrecurrent inequality:

At+1 ≤ (1− ηt + C1η2t )At + C2ζt,

where C1, C2 ≥ 0. Then, there is a constant CA > 0 such that At ≤ CAtb−a

.

Proof. We adopted this proof from (Wang et al., 2017) and added to appendix to make the narration of the paperself-contained.

Let us introduce constant CA such that

CA = maxt≤(C1C2

η)1a+1

Attb−a +

C2CζCη − 1− b+ a

The claim will be proved by induction. Consider two cases here:

1. If t ≤ (C1C2η)

1a : Then, by from the definition of constant CA it follows immediately:

At ≤ CAta−b =CAtb−a

2. If t > (C1C2η)

1a : Assume that At ≤ CA

tb−afor some t > (C1C

2η)

1a . Hence:

At+1 ≤ (13)

(1− ηt + C1η2t )At + C2ζt =

(1− Cη

ta+ C1

C2η

t2a

)At + C2

Cζtb≤(

1− Cηta

+ C1

C2η

t2a

)CAtb−a

+ C2Cζtb

=CAtb−a

− CACηtb

+C1CAC

ta+b+C2Cζtb

=

CA(t+ 1)b−a

− CA

1

(t+ 1)b−a− 1

tb−a+Cηtb−C1C

ta+b︸ ︷︷ ︸∆t+1

+C2Cζtb

=

CA(t+ 1)b−a

− CA∆t+1 +C2Cζtb

=CA

(t+ 1)b−a−∆t+1

(CA −

C2Cζ∆t+1tb

)

Page 18: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Since function f(t) = 1tc is convex for c /∈ [−1, 0] and t > 0 one can apply the first order condition of convexity:

f(t+ 1) ≥ f(t) + f′(t) =⇒ 1

(t+ 1)c≥ 1

tc− c 1

tc+1

Hence, using a ≤ 1 and t > (C1C2η)

1a for ∆t+1 we have:

∆t+1 =1

(t+ 1)b−a− 1

tb−a+Cηtb−C1C

ta+b≥ −(b− a)

1

tb−a+1+Cηtb−C1C

ta+b≥

− b− atb−a+1

+Cηtb− 1

tb≥ −b− a

tb+Cηtb− 1

tb= (Cη − 1− b+ a)

1

tb> 0

Moreover,

C2Cζ∆t+1tb

≤ C2CζCη − 1− b+ a

≤ CA

Combining these two result in (13) gives:

At+1 ≤CA

(t+ 1)b−a−∆t+1

(CA −

C2Cζ∆t+1tb

)≤ CA

(t+ 1)b−a

which proves the induction step.

A.4. Proof of Corollary 1

Corollary 2. Consider Algorithm 1 with step sizes given by: αt = Cαta , βt =

Cβtb, and K

(3)t = C3t

e, for some constantsCα, Cβ , C3, a, b, e > 0 such that (2a− 2b) /∈ [−1, 0], and b ≤ 1. For CD, CE > 0, we have:

Etotal[||g(xt)− yt||22

]≤L2gC

2D

2

1

t4a−4b+ 2C2

E1

tb+e,

for some constants CD, CE > 0.

Proof. Using αt = Cαta , βt =

Cβtb

in the recurrent inequalities for F2t ,Dt and Etotal

[||Et+1||22

]gives:

1. For F2t+1:

F2t+1 ≤

(1− βt)F2t +

4M2gM

2f

ε2α2t

βt=

(1− Cβ

tb

)F2t +

4M2gM

2fC

ε2Cβ

1

t2a−b

and applying Lemma 6 gives

F2t ≤

CFt2a−2b

where CF =4M2

gM2fC

ε2Cβ(Cβ − 1− 2a+ 2b)(14)

2. For Dt+1:

Dt+1 ≤(1− Cβ

tb

)Dt +

2M2gM

2fC

ε2Cβ

1

t2a−b+ CβCF

1

t2a−b=(

1− Cβtb

)Dt +

[2M2

gM2fC

ε2Cβ+ CβCF

]1

t2a−b

Page 19: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

and applying Lemma 6 gives

Dt ≤CDt2a−2b

where CD =2M2

gM2fC

2α + ε2C2

βCF

ε2Cβ(Cβ − 1− 2a+ 2b)(15)

3. For Etotal[||Et+1||22

]:

Etotal[||Et+1||22

]≤ (1− βt)2Etotal

[||Et||22

]+

β2t

K(3)t

σ23 =(

1− 2Cβtb

+C2β

t2b

)Etotal

[||Et||22

]+C2βσ

23

C3

1

t2b+e=(

1− Cβtb

+C2β

4t2b

)Etotal

[||Et||22

]+C2βσ

23

C3

1

t2b+e

and applying Lemma 6 gives:

Etotal[||Et||22

]≤ CEtb+e

where CE = maxt≤(C2

β)1b +1

Etotal[||Et||22

]tb+e +

C2βσ

23

C3(2Cβ − 1− b− e)(16)

Next, combining results (15) and (16) in the bound of Lemma 5 gives:

Etotal[||g(xt)− yt||22

]≤

L2g

2Etotal

[D2t

]+ 2Etotal

[||Et||22

]≤L2gC

2D

2

1

t4a−4b+ 2C2

E1

tb+e

A.5. Proof of Main Theorem 1

Theorem 2. Consider Algorithm 1 with a parameter setup given by: αt = Cα/t15 , βt = Cβ ,K

(1)t = C1t

45 ,K

(2)t =

C2t45 ,K

(3)t = C3t

45 , γ

(1)t = Cγµ

t, γ(t)2 = 1− Cα/t

25 (1− Cγµt)2, for some positive constants Cα, Cβ , C1, C2, C3, Cγ , µ

such that Cβ < 1 and µ ∈ (0, 1). For any δ ∈ (0, 1), Algorithm 1 outputs, in expectation, a δ-approximate first-orderstationary point x of J (x). That is:

Etotal

[||∇xJ (x)||22

]≤ δ,

with “total” representing all incurred randomness. Moreover, Algorithm 1 acquires x with an overall oracle complexity forOraclef (·, ·) and Oracleg(·, ·) of the order O

(δ−

94

).

Proof. Let us study the change of the function between two consecutive iterations. Using, that function J (·) is Lipschitzcontinuous (see Lemma 4):

J (xt+1) ≤ J (xt) +∇J (xt)T(xt+1 − xt) +

L

2||xt+1 − xt||22 = (17)

J (xt)− αtp∑i=1

[∇J (xt)]i[mt]i√[vt]i + ε

+Lα2

t

2

p∑i=1

[mt]2i(√

[vt]i + ε)2

Now, let us introduce mathematical expectation with respect to all randomness at iteration t given a fixed iter-ative value xt as Et [·] = E

K(1)t ,K

(2)t ,K

(3)t

[·∣∣∣xt]. This expectation taking into account all samplings which is

done on iteration t of Algorithm 1. Then, it is easy to see that the following variables will be t − measurable:

Page 20: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

J (xt),mt,vt,xt+1, zt+1,yt+1. On the other hand, the following variables will be independent from randomness intro-duced at iteration t: ∇J (xt−1),mt−1,vt−1,xt, zt,yt. Hence, taking expectation Et [] from the both sides of equation(17) gives:

Et [J (xt+1)] ≤ J (xt)− αtp∑i=1

[∇J (xt)]i Et

[[mt]i√[vt]i + ε

]+Lα2

t

2

p∑i=1

Et

[mt]2i(√

[vt]i + ε)2

(18)

Now, let us focus on the second term in the above expression:

p∑i=1

[∇J (xt)]i Et

[[mt]i√[vt]i + ε

]=

p∑i=1

[∇J (xt)]i Et

[mt]i√[vt]i + ε

− [mt]i√γ

(2)t [vt−1]i + ε

+[mt]i√

γ(2)t [vt−1]i + ε

=

p∑i=1

[∇J (xt)]i Et

[mt]i√γ

(2)t [vt−1]i + ε

︸ ︷︷ ︸

A

+

p∑i=1

[∇J (xt)]i Et

[mt]i√[vt]i + ε

− [mt]i√γ

(2)t [vt−1]i + ε

︸ ︷︷ ︸

B

Please notice, from Assumptions 1.3 and 1.4 it follows immediately:

||∇J (xt)||2 ≤ Ew[∣∣∣∣∇gw(xt)

T∣∣∣∣

2

]Ev [||∇fv(Ew[gw(xt)])||2] ≤MgMf (19)

||∇J (xt)||2 ≤∣∣∣∣∣∣∇gt(xt)T∣∣∣∣∣∣

2

∣∣∣∣∣∣∇ft(yt)∣∣∣∣∣∣2≤MgMf

and applying induction:

||mt||2 ≤ γ(1)t MgMf +

(1− γ(1)

t

)MgMf = MgMf , ∀t

||vt||2 ≤ γ(2)t M2

gM2f +

(1− γ(2)

t

)M2gM

2f = M2

gM2f , ∀t

Now, let us apply (19) for the expression A:

A =

p∑i=1

[∇J (xt)]iEt[mt]i√

γ(2)t [vt−1]i + ε

=

p∑i=1

[∇J (xt)]i

Et[γ

(1)t [mt−1]i +

(1− γ(1)

t

)[∇J (xt)]i

]√γ

(2)t [vt−1]i + ε

=

p∑i=1

[∇J (xt)]i

γ(1)t [mt−1]i +

(1− γ(1)

t

)Et[[∇J (xt)]i

]√γ

(2)t [vt−1]i + ε

= γ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

+

(1− γ(1)

t

)∇J (xt)

TEt[∇gt(xt)

T∇ft(yt)

]√γ

(2)t vt−1 + ε

= γ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

+

(1− γ(1)

t

)∇J (xt)

TEt[∇J (xt)−∇J (xt) +∇gt(xt)

T∇ft(yt)

]√γ

(2)t vt−1 + ε

= γ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

+

(1− γ(1)

t

) p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

−(

1− γ(1)t

)∇J (xt)

TEt[∇J (xt)−∇gt(xt)

T∇ft(yt)

]√γ

(2)t vt−1 + ε

=

γ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

+(

1− γ(1)t

) p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

(1− γ(1)

t

) Et[∇J (xt)

T(∇J (xt)−∇gt(xt)

T∇ft(yt)

)]√γ

(2)t vt−1 + ε︸ ︷︷ ︸A1

Page 21: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Using Assumption 2.2 we have:

A1 =Et[∇J (xt)

T(∇J (xt)−∇gt(xt)

T∇ft(g(xt))

)]√γ

(2)t vt−1 + ε

+Et[∇J (xt)

T(∇gt(xt)

T∇ft(g(xt))−∇gt(xt)

T∇ft(yt)

)]√γ

(2)t vt−1 + ε

=

Et[∇J (xt)

T(∇gt(xt)

T∇ft(g(xt))−∇gt(xt)

T∇ft(yt)

)]√γ

(2)t vt−1 + ε

=Et[∇J (xt)

T∇gt(xt)T(∇ft(g(xt))−∇ft(yt)

)]√γ

(2)t vt−1 + ε

=

Et

p∑i=1

[∇J (xt)]i√√γ

(2)t [vt−1]i + ε

[∇gt(xt)

T∇ft(g(xt))−∇gt(xt)

T∇ft(yt)

]i√√

γ(2)t [vt−1]i + ε

≤ 1

2Et

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+

1

2Et

p∑i=1

[∇gt(xt)

T∇ft(g(xt))−∇gt(xt)

T∇ft(yt)

]2i√

γ(2)t [vt−1]i + ε

≤ 1

2Et

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+

1

2εEt[∣∣∣∣∣∣∇gt(xt)T∇ft(g(xt))−∇gt(xt)

T∇ft(yt)

∣∣∣∣∣∣22

]≤ 1

2Et

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+

1

2εEt[∣∣∣∣∣∣∇gt(xt)T∣∣∣∣∣∣2

2

∣∣∣∣∣∣∇ft(g(xt))−∇ft(yt)∣∣∣∣∣∣2

2

]≤ 1

2Et

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+

1

2εM2gEt

[∣∣∣∣∣∣∇ft(g(xt))−∇ft(yt)∣∣∣∣∣∣2

2

]≤ 1

2Et

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+

1

2εM2gEt

[∣∣∣∣∣∣∇ft(g(xt))−∇ft(yt)∣∣∣∣∣∣2

2

]≤ 1

2Et

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+

1

2K(1)t ε

M2g

K(1)t∑

a=1

Et[||∇fta(g(xt))−∇fta(yt)||22

]≤ 1

2Et

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+

1

2εM2gL

2fEt

[||g(xt)− yt||22

]=

1

2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+1

2εM2gL

2f ||g(xt)− yt||22

Hence, we arrive at the following expression for A:

−A = −γ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

−(

1− γ(1)t

) p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+(

1− γ(1)t

)A1 ≤ (20)

− γ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

−(

1− γ(1)t

) p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+

(1− γ(1)

t

)1

2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+1

2εM2gL

2f ||g(xt)− yt||22

=

− γ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

(1− γ(1)

t

)2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+

(1− γ(1)

t

)2ε

M2gL

2f ||g(xt)− yt||22

Page 22: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Now, let us consider more carefully the second term:

− B = −p∑i=1

[∇J (xt)]i Et

[mt]i√[vt]i + ε

− [mt]i√γ

(2)t [vt−1]i + ε

≤ p∑i=1

|[∇J (xt)]i|

∣∣∣∣∣∣∣∣∣∣∣Et

[mt]i√[vt]i + ε

− [mt]i√γ

(2)t [vt−1]i + ε︸ ︷︷ ︸

B1

∣∣∣∣∣∣∣∣∣∣∣For the expression B1 we have:

B1 =[mt]i√[vt]i + ε

− [mt]i√γ

(2)t [vt−1]i + ε

≤ |[mt]i| ×

∣∣∣∣∣∣ 1√[vt]i + ε

− 1√γ

(2)t [vt−1]i + ε

∣∣∣∣∣∣ =

|[mt]i|

(√

[vt]i + ε)(

√γ

(2)t [vt−1]i + ε)

×

∣∣∣∣∣∣ (1− γ(2)t )[∇J (xt)]

2i√

[vt]i +

√γ

(2)t [vt−1]i

∣∣∣∣∣∣ =(1− γ(2)

t )|[mt]i|

(√

[vt]i + ε)(

√γ

(2)t [vt−1]i + ε)

×

[∇J (xt)]2i√

γ(2)t [vt−1]i + (1− γ(2)

t )[∇J (xt)]2i +

√γ

(2)t [vt−1]i

√1− γ(2)

t |[mt]i||[∇J (xt)]i|

ε

(√γ

(2)t [vt−1]i + ε

) ≤

√1− γ(2)

t

|γ(1)t [mt−1]i|+ (1− γ(1)

t )|[∇J (xt)]i|

ε

(√γ

(2)t [vt−1]i + ε

) |[∇J (xt)]i| =(1− γ(1)

t )

√1− γ(2)

t

ε

(√γ

(2)t [vt−1]i + ε

) [∇J (xt)]2i+

γ(1)t

√1− γ(2)

t

ε

(√γ

(2)t [vt−1]i + ε

) |[∇J (xt)]i||[mt−1]i| ≤ 2(1− γ(1)

t )

√1− γ(2)

t

ε

(√γ

(2)t [vt−1]i + ε

) [∇J (xt)]2i+

(1)t

)2√

1− γ(2)t

ε(1− γ(1)t )

(√γ

(2)t [vt−1]i + ε

) [mt−1]2i

where in the last step we use |ab| ≤ αa2 + 1αb

2 for α =1−γ(1)

t

γ(1)t

. Therefore, for the term B we have:

− B ≤p∑i=1

|[∇J (xt)]i|Et

2(1− γ(1)

t )

√1− γ(2)

t

ε

(√γ

(2)t [vt−1]i + ε

) [∇J (xt)]2i +

(1)t

)2√

1− γ(2)t

ε(1− γ(1)t )

(√γ

(2)t [vt−1]i + ε

) [mt−1]2i

= (21)

(1)t

)2√

1− γ(2)t

ε(1− γ(1)t )

p∑i=1

|[∇J (xt)]i|[mt−1]2i(√γ

(2)t [vt−1]i + +ε

) +(1− γ(1)

t )

√1− γ(2)

t

ε

p∑i=1

|[∇J (xt)]i|(√γ

(2)t [vt−1]i + ε

)Et[[∇J (xt)]

2i

]

Using that [vt]i ≥ 0 for any t and applying (19) we get immediately:

− B ≤

(1)t

)2√

1− γ(2)t

ε(1− γ(1)t )

p∑i=1

|[∇J (xt)]i|[mt−1]2i(√γ

(2)t [vt−1]i + ε

) +(1− γ(1)

t )

√1− γ(2)

t

ε

p∑i=1

|[∇J (xt)]i|(√γ

(2)t [vt−1]i + ε

)Et[[∇J (xt)]

2i

]≤

MgMf

(1)t

)2√

1− γ(2)t

ε(1− γ(1)t )

p∑i=1

|[∇J (xt)]i[mt−1]i|√γ

(2)t [vt−1]i + ε

+MgMf (1− γ(1)

t )

√1− γ(2)

t

ε

p∑i=1

Et

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

Page 23: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Finally, we have the bound for the second term in the expression (18):

−p∑i=1

[∇J (xt)]i Et

[[mt]i√[vt]i + ε

]= −A− B ≤ (22)

− γ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

(1− γ(1)

t

)2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+

(1− γ(1)

t

)2ε

M2gL

2f ||g(xt)− yt||22 +

MgMf

(1)t

)2√

1− γ(2)t

ε(1− γ(1)t )

p∑i=1

|[∇J (xt)]i[mt−1]i|√γ

(2)t [vt−1]i + ε

+MgMf (1− γ(1)

t )

√1− γ(2)

t

ε

p∑i=1

Et

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

Now, we can focus on the third term in the expression (18). Applying that [vt]i ≥ γ(2)t [vt−1]i we have:

p∑i=1

Et

[mt]2i(√

[vt]i + ε)2

≤ p∑i=1

Et

[mt]2i(√

γ(2)t [vt−1]i + ε

)2

=

p∑i=1

Et[[mt]

2i

](√γ

(2)t [vt−1]i + ε

)2 =

p∑i=1

Et[[γ

(1)t mt−1 +

(1− γ(1)

t

)∇J (xt)]

2i

](√

γ(2)t [vt−1]i + ε

)2 ≤p∑i=1

2(γ

(1)t

)2

Et[[mt−1]2i

]+ 2

(1− γ(1)

t

)2

Et[[∇J (xt)]

2i

](√

γ(2)t [vt−1]i + ε

)2 =

2(γ

(1)t

)2p∑i=1

[mt−1]2i(√γ

(2)t [vt−1]i + ε

)2 + 2(

1− γ(1)t

)2p∑i=1

Et[[∇J (xt)]

2i

](√

γ(2)t [vt−1]i + ε

)2 ≤

2(γ

(1)t

)2p∑i=1

[mt−1]2i(√γ

(2)t [vt−1]i + ε

)2 + 2

(1− γ(1)

t

)2

ε

p∑i=1

Et[[∇J (xt)]

2i

]√γ

(2)t [vt−1]i + ε

2(γ

(1)t

)2

ε

p∑i=1

[mt−1]2i√γ

(2)t [vt−1]i + ε

+ 2

(1− γ(1)

t

)2

ε

p∑i=1

Et[[∇J (xt)]

2i

]√γ

(2)t [vt−1]i + ε

Hence, we arrive at the following expression for change in the function between two consecutive interaction:

Et [J (xt+1)]− J (xt) ≤ (23)

− αtγ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

− αt

(1− γ(1)

t

)2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+ αt

(1− γ(1)

t

)2ε

M2gL

2f ||g(xt)− yt||22 +

αtMgMf

(1)t

)2√

1− γ(2)t

ε(1− γ(1)t )

p∑i=1

|[∇J (xt)]i[mt−1]i|√γ

(2)t [vt−1]i + ε

+αtMgMf (1− γ(1)

t )

√1− γ(2)

t

ε

p∑i=1

Et[[∇J (xt)]

2i

]√γ

(2)t [vt−1]i + ε

+

Lα2t

(1)t

)2

ε

p∑i=1

[mt−1]2i√γ

(2)t [vt−1]i + ε

+Lα2

t

(1− γ(1)

t

)2

ε

p∑i=1

Et[[∇J (xt)]

2i

]√γ

(2)t [vt−1]i + ε

=

Page 24: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

−αt

(1− γ(1)

t

)2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+αt

(1− γ(1)

t

)2ε

M2gL

2f ||g(xt)− yt||22 +

p∑i=1

Et[[∇J (xt)]

2i

]√γ

(2)t [vt−1]i + ε

×

Lα2t

(1− γ(1)

t

)2

ε+αtMgMf (1− γ(1)

t )

√1− γ(2)

t

ε

− αtγ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

+

Lα2t

(1)t

)2

ε

p∑i=1

[mt−1]2i√γ

(2)t [vt−1]i + ε

+αtMgMf

(1)t

)2√

1− γ(2)t

ε(1− γ(1)t )

p∑i=1

|[∇J (xt)]i[mt−1]i|√γ

(2)t [vt−1]i + ε

Please notice, from Assumption 2.3 we immediately have the following bounds on the variances of∇gt(x) and ∇ft(y):

Et[∣∣∣∣∣∣∇gt(x)−∇g(x)

∣∣∣∣∣∣22

]≤ 1

K(1)t

σ22 , Et

[∣∣∣∣∣∣∇ft(y)−∇f(y)∣∣∣∣∣∣2

2

]≤ 1

K(2)t

σ21 (24)

Now, let us focus on the term Et[[∇J (xt)]

2i

]. Using Assumption 2.2 we have:

Et[(

[∇J (xt)]i −[∇g(xt)

T∇f(yt)]i

)2]

= Et[[∇J (xt)]

2i

]−[∇g(xt)

T∇f(yt)]2i

(25)

From the other hand, using the properties of matrix || · ||215 and Assumptions 1.1, 1.3 and 2.3:

Et[(

[∇J (xt)]i −[∇g(xt)

T∇f(yt)]i

)2]

= Et[(

[∇gt(xt)T∇ft(yt)]i −

[∇g(xt)

T∇f(yt)]i

)2]

=

Et[[∇gt(xt)

T∇ft(yt)−∇g(xt)

T∇ft(yt) +∇g(xt)T∇ft(yt)−∇g(xt)

T∇f(yt)]2i

]≤

2Et[[(∇gt(xt)

T−∇g(xt)

T)∇ft(yt)

]2i

]+ 2Et

[[∇g(xt)

T(∇ft(yt)−∇f(yt)

)]2i

]=

2Et

( q∑r=1

[(∇gt(xt)

T−∇g(xt)

T)]

ir[∇ft(yt)]r

)2+ 2Et

( q∑r=1

[∇g(xt)

T]ir

[∇ft(yt)−∇f(yt)

]r

)2 ≤

2qEt

[q∑r=1

[(∇gt(xt)

T−∇g(xt)

T)]2

ir[∇ft(yt)]2r

]+ 2qEt

[q∑r=1

[∇g(xt)

T]2ir

[∇ft(yt)−∇f(yt)

]2r

]≤

2qEt

[∣∣∣∣∣∣∇gt(xt)T −∇g(xt)T∣∣∣∣∣∣2

2

q∑r=1

[∇ft(yt)]2r

]+ 2qEt

[∣∣∣∣∣∣∇ft(yt)−∇f(yt)∣∣∣∣∣∣2

2

q∑r=1

[∇g(xt)

T]2ir

]≤

2qEt[∣∣∣∣∣∣∇gt(xt)T −∇g(xt)

T∣∣∣∣∣∣2

2

∣∣∣∣∣∣∇ft(yt)∣∣∣∣∣∣22

]+ 2qEt

[∣∣∣∣∣∣∇ft(yt)−∇f(yt)∣∣∣∣∣∣2

2

∣∣∣∣g(xt)T∣∣∣∣2

2

]≤

2qM2fEt

[∣∣∣∣∣∣∇gt(xt)T −∇g(xt)T∣∣∣∣∣∣2

2

]+ 2qM2

gEt[∣∣∣∣∣∣∇ft(yt)−∇f(yt)

∣∣∣∣∣∣22

]≤

2qM2fEt

[∣∣∣∣∣∣∇gt(xt)−∇g(xt)∣∣∣∣∣∣2

2

]+ 2qM2

gEt[∣∣∣∣∣∣∇ft(yt)−∇f(yt)

∣∣∣∣∣∣22

]≤

2qM2f

1

K(1)t

σ22 + 2qM2

g

1

K(2)t

σ21

where in the last step we use (24). Combining this result with (25) gives:

Et[[∇J (xt)]

2i

]≤ 2qM2

f

1

K(1)t

σ22 + 2qM2

g

1

K(2)t

σ21 +

[∇g(xt)

T∇f(yt)]2i

15Particularly, we use that for any matrix A ∈ Rp×q: ||A||22 = supv:||v||=1 vTAATv ≥

∑qr=1[A]2ir for any i = 1, . . . , p. To see

this, just take v = ei

Page 25: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Moreover,[∇g(xt)

T∇f(yt)]2i

=[∇g(xt)

T∇f(yt)−∇g(xt)T∇f(g(xt)) +∇g(xt)

T∇f(g(xt))]2i

=[∇g(xt)

T [∇f(yt)−∇f(g(xt))] +∇g(xt)T∇f(g(xt))

]2i≤ 2

[∇g(xt)

T∇f(g(xt))]2i

+

2[∇g(xt)

T [∇f(yt)−∇f(g(xt))]]2i≤ 2 [∇J (xt)]

2i + 2

∣∣∣∣∇g(xt)T∣∣∣∣2

2||∇f(yt)−∇f(g(xt))||22 ≤

2 [∇J (xt)]2i + 2M2

gL2f ||yt − g(xt)||22

Hence, we have

p∑i=1

Et[[∇J (xt)]

2i

]√γ

(2)t [vt−1]i + ε

≤p∑i=1

2qM2f

1

K(1)t

σ22 + 2qM2

g1

K(2)t

σ21 +

[∇g(xt)

T∇f(yt)]2i√

γ(2)t [vt−1]i + ε

= (26)

2pqM2fσ

22

εK(1)t

+2pqM2

gσ21

εK(2)t

+

p∑i=1

2 [∇J (xt)]2i + 2M2

gL2f ||yt − g(xt)||22√

γ(2)t [vt−1]i + ε

2pqM2fσ

22

εK(1)t

+2pqM2

gσ21

εK(2)t

+2pM2

gL2f

ε||yt − g(xt)||22 + 2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

Hence, expression (23) can be simplified as follows:

Et [J (xt+1)]− J (xt) ≤ −αt

(1− γ(1)

t

)2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+ αt

(1− γ(1)

t

)2ε

M2gL

2f ||g(xt)− yt||22 + (27)

Lα2t

(1− γ(1)

t

)2

ε+αtMgMf (1− γ(1)

t )

√1− γ(2)

t

ε

p∑i=1

Et[[∇J (xt)]

2i

]√γ

(2)t [vt−1]i + ε

− αtγ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

+

Lα2t

(1)t

)2

ε

p∑i=1

[mt−1]2i√γ

(2)t [vt−1]i + ε

+αtMgMf

(1)t

)2√

1− γ(2)t

ε(1− γ(1)t )

p∑i=1

|[∇J (xt)]i[mt−1]i|√γ

(2)t [vt−1]i + ε

=

− αt

(1− γ(1)

t

)2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+ αt

(1− γ(1)

t

)2ε

M2gL

2f ||g(xt)− yt||22 +

αtγ(1)t M2

gM2f

ε+Lα2

t

(1)t

)2

M2gM

2f

ε2+αtM

3gM

3f

(1)t

)2√

1− γ(2)t

ε2(1− γ(1)t )

+Lα2t

(1− γ(1)

t

)2

ε+αtMgMf (1− γ(1)

t )

√1− γ(2)

t

ε

p∑i=1

Et[[∇J (xt)]

2i

]√γ

(2)t [vt−1]i + ε

where we used (19):

− αtγ(1)t ∇J (xt)

T mt−1√γ

(2)t vt−1 + ε

≤ αtγ(1)t

ε||∇J (xt)||2||mt−1||2 ≤

αtγ(1)t M2

gM2f

ε,

Lα2t

(1)t

)2

ε

p∑i=1

[mt−1]2i√γ

(2)t [vt−1]i + ε

≤Lα2

t

(1)t

)2

M2gM

2f

ε2,

Page 26: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

αtMgMf

(1)t

)2√

1− γ(2)t

ε(1− γ(1)t )

p∑i=1

|[∇J (xt)]i[mt−1]i|√γ

(2)t [vt−1]i + ε

≤αtMgMf

(1)t

)2√

1− γ(2)t

ε2(1− γ(1)t )

p∑i=1

|[∇J (xt)]i[mt−1]i| ≤

αtMgMf

(1)t

)2√

1− γ(2)t

ε2(1− γ(1)t )

(||∇J (xt)||22 + ||mt−1||22) ≤αtM

3gM

3f

(1)t

)2√

1− γ(2)t

ε2(1− γ(1)t )

Hence, applying (26) in (27) gives:

Et [J (xt−1)]− J (xt) ≤ (28)

− αt

(1− γ(1)

t

)2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+ αt

(1− γ(1)

t

)2ε

M2gL

2f ||g(xt)− yt||22 +

αtγ(1)t M2

gM2f

ε

1 +Lαtγ

(1)t

ε+MgMfγ

(1)t

√1− γ(2)

t

ε(

1− γ(1)t

)+

αt

(1− γ(1)

t

[Lαt

(1− γ(1)

t

)+MgMf

√1− γ(2)

t

] p∑i=1

Et[[∇J (xt)]

2i

]√γ

(2)t [vt−1]i + ε

− αt

(1− γ(1)

t

)2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+ αt

(1− γ(1)

t

)2ε

M2gL

2f ||g(xt)− yt||22 +

αtγ(1)t M2

gM2f

ε

1 +Lαtγ

(1)t

ε+MgMfγ

(1)t

√1− γ(2)

t

ε(

1− γ(1)t

)+

αt

(1− γ(1)

t

[Lαt

(1− γ(1)

t

)+MgMf

√1− γ(2)

t

2pqM2fσ

22

εK(1)t

+2pqM2

gσ21

εK(2)t

+2pM2

gL2f

ε||yt − g(xt)||22 + 2

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

Grouping the terms in (28) gives:

Et [J (xt+1)]− J (xt) ≤ −αt

(1− γ(1)

t

)2

1−4

[Lαt

(1− γ(1)

t

)+MgMf

√1− γ(2)

t

p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+

(29)

αt

(1− γ(1)

t

)2ε

M2gL

2f

1 +

4p

[Lαt

(1− γ(1)

t

)+MgMf

√1− γ(2)

t

||g(xt)− yt||22 +

2pqαt

(1− γ(1)

t

)ε2

[Lαt

(1− γ(1)

t

)+MgMf

√1− γ(2)

t

](M2fσ

22

K(1)t

+M2gσ

21

K(2)t

)+

αtγ(1)t M2

gM2f

ε

1 +

γ(1)t

[Lαt

(1− γ(1)

t

)+MgMf

√1− γ(2)

t

]ε(

1− γ(1)t

) =

Page 27: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

−αt

(1− γ(1)

t

)2

[1− 4Ct]p∑i=1

[∇J (xt)]2i√

γ(2)t [vt−1]i + ε

+αt

(1− γ(1)

t

)M2gL

2f

2ε[1 + 4pCt] ||g(xt)− yt||22 +

2pqαtCt(

1− γ(1)t

(M2fσ

22

K(1)t

+M2gσ

21

K(2)t

)+αtγ

(1)t M2

gM2f

ε

1 +γ

(1)t Ct(

1− γ(1)t

) ≤

−αt

(1− γ(1)

t

)2(MgMf + ε)

[1− 4Ct] ||∇J (xt)||22 +αt

(1− γ(1)

t

)M2gL

2f

2ε[1 + 4pCt] ||g(xt)− yt||22 +

2pqαtCt(

1− γ(1)t

(M2fσ

22

K(1)t

+M2gσ

21

K(2)t

)+αtγ

(1)t M2

gM2f

ε

1 +γ

(1)t Ct(

1− γ(1)t

)

where we use notation Ct =Lαt

(1−γ(1)

t

)+MgMf

√1−γ(2)

t

ε and [vt−1]i ≤M2gM

2f .

Next, let us denote Etotal [] be the expectation with respect to all randomness induced in all T iterations of Algorithm 1.Using the low of total expectation:

Etotal [Et [ζt]] = Etotal[EK

(1)t ,K

(2)t ,K

(3)t

[ζt

∣∣∣xt]] = Etotal [ζt]

for any t−measurable16 random variable ζt. Hence, taking expectation E from both sides of (29) gives:

Etotal [J (xt+1)− J (xt)] ≤ (30)

−αt

(1− γ(1)

t

)2(MgMf + ε)

[1− 4Ct]Etotal[||∇J (xt)||22

]+αt

(1− γ(1)

t

)M2gL

2f

2ε[1 + 4pCt]Etotal

[||g(xt)− yt||22

]+

2pqαtCt(

1− γ(1)t

(M2fσ

22

K(1)t

+M2gσ

21

K(2)t

)+αtγ

(1)t M2

gM2f

ε

1 +γ

(1)t Ct(

1− γ(1)t

)

Now, let αt = Cαta , βt =

Cβtb

, K(1)t = C1t

c, K(2)t = C2t

c and K(3)t = C3t

e for some constantsCα, Cβ , C1, C2, C3, a, b, c, e > 0 such that (2a− 2b) /∈ [−1, 0], b ≤ 1. Following Corollary 2 we have:

Etotal[||g(xt)− yt||22

]≤L2gC

2D

2

1

t4a−4b+ 2C2

E1

tb+e

for some constants CD, CE > 0, and (30) can be written as:

Etotal [J (xt+1)− J (xt)] ≤ (31)

−Cα

(1− γ(1)

t

)2ta(MgMf + ε)

[1− 4Ct]Etotal[||∇J (xt)||22

]+Cα

(1− γ(1)

t

)M2gL

2f

2taε[1 + 4pCt]

[L2gC

2D

2

1

t4a−4b+ 2C2

E1

tb+e

]+

2pqCαCt(

1− γ(1)t

)taε

(M2fσ

22

C1tc+M2gσ

21

C2tc

)+Cαγ

(1)t M2

gM2f

taε

1 +γ

(1)t Ct(

1− γ(1)t

)

with Ct =LCαta

(1−γ(1)

t

)+MgMf

√1−γ(2)

t

ε . By choosing γ(2)t = 1− C2

α

t2a (1− γ(1)t )2 such that

√1− γ(2)

t = Cαta (1− γ(1)

t ) we

have Ct = Cα(L+MgMf )

ε1−γ(1)

t

ta ≤ Cα (L+MgMf )ε . By choosing Cα ≤ ε

8p(L+MgMf ) we have 8pCt ≤ 1, hence, (31) can be

16Random variable is called t−measurable if its affected by the randomness induced in the first t rounds.

Page 28: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

simplified as:

Etotal [J (xt+1)− J (xt)] ≤ (32)

−Cα

(1− γ(1)

t

)2ta(MgMf + ε)

[1− 1

2p

]Etotal

[||∇J (xt)||22

]+Cα

(1− γ(1)

t

)M2gL

2f

taε

[L2gC

2D

2

1

t4a−4b+ 2C2

E1

tb+e

]+

qCα(C1M2gσ

21 + C2M

2fσ

22)

2C1C2ε

(1− γ(1)

t

)ta+c

+CαM

2gM

2f

ε

1 +γ

(1)t

4p(

1− γ(1)t

) γ(1)

t

ta≤

−Cα

(1− γ(1)

t

)4ta(MgMf + ε)

Etotal[||∇J (xt)||22

]+Cα

(1− γ(1)

t

)M2gL

2f

taε

[L2gC

2D

2

1

t4a−4b+ 2C2

E1

tb+e

]+

qCα(C1M2gσ

21 + C2M

2fσ

22)

2C1C2ε

(1− γ(1)

t

)ta+c

+CαM

2gM

2f

ε

1 +γ

(1)t

4p(

1− γ(1)t

) γ(1)

t

ta

Assuming that γ(1)t = Cγµ

t for some Cγ ∈(0, 1

2

)and µ ∈ [0, 1), then expression (32):

Etotal [J (xt+1)− J (xt)] ≤

− Cα8ta(MgMf + ε)

Etotal[||∇J (xt)||22

]+CαM

2gL

2f

taε

[L2gC

2D

2

1

t4a−4b+ 2C2

E1

tb+e

]+

qCα(C1M2gσ

21 + C2M

2fσ

22)

2C1C2ε

1

ta+c+

2CαM2gM

2f

ε

µt

ta

Finally, choosing

Cα = min{ ε

8p(L+MgMf ),

e

2M2gM

2f

,2C1C2ε

q(C1M2gσ

21 + C2M2

fσ22),

M2gL

2fL

2gC

2D,

ε

2M2gL

2fC

2E

}we have:

Etotal [J (xt+1)− J (xt)] ≤

− C0

taEtotal

[||∇J (xt)||22

]+

1

t5a−4b+

1

ta+b+e+

1

ta+c+µt

ta

where C0 = Cα8(MgMf+ε) . Therefore,

Etotal[||∇J (xt)||22

]≤ (33)

ta

C0Etotal [J (xt)− J (xt+1)] +

1

C0t4a−4b+

1

C0tb+e+

1

C0tc+µt

C0

Taking summation in (33) over t = 1, . . . , T and dividing the result by T gives:∑Tt=1 Etotal

[||∇J (xt)||22

]T

≤ (34)∑Tt=1 t

aEtotal [J (xt)− J (xt+1)]

C0T+

1

C0T

T∑t=1

[1

t4a−4b+

1

tb+e+

1

tc

]+

1

C0T (1− µ)

Notice, using first order concavity condition for function f(t) = ta (if a ∈ (0, 1)) and Assumption 1.1:T∑t=1

taEtotal [J (xt)− J (xt+1)] =

J (x1) +

T∑t=2

[(t+ 1)a − ta]Etotal [J (xt)] ≤ Bf +

T∑t=2

ata−1Etotal [J (xt)] ≤ J (x1) +Bf

T∑t=2

ata−1 ≤

Bf +BfaTa

Page 29: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

Hence, for (34) we have (for 4a− 4b ≤ 1):∑Tt=1 Etotal

[||∇J (xt)||22

]T

≤ BfC0

1

T+aBfC0

1

T 1−a +1

TO(T 4b−4a+1I{4a− 4b 6= 1}

)+ (35)

1

TO (log T I{4a− 4b = 1}) +

1

TO(T 1−b−eI{b+ e 6= 1}+ log T I{b+ e = 1}

)1

TO(T 1−cI{c 6= 1}+ log T I{c = 1}

)+

1

C0(1− µ)

1

T=(

BfC0

+1

C0(1− µ)

)1

T+aBfC0

1

T 1−a +O(

1

T 4a−4bI{4a− 4b 6= 1}+

1

T b+eI{b+ e 6= 1}+

1

T cI{c 6= 1}

)+

O(

log T

T

)[I{4a− 4b = 1}+ I{b+ e = 1}+ I{c = 1}]

where I{condition} = 1 if condition is satisfied, and 0 otherwise. Notice, that oracle complexity per iteration of Algorithm1 is given by O(tmax{c,e}). Hence, after T iterations, the total first order oracles complexity is given by O

(T 1+max{c,e}).

Let us denote φ(a, b, c, e) = min{1− a, 4a− 4b, b+ e, c}. Then, ignoring logarithmic factors log T we have:

1

T

T∑t=1

Etotal[||∇J (xt)||22

]≤ O

(1

Tφ(a,b,c,e)

)≤ δ

implies T = O(

1

δ1

φ(a,b,c,e)

), and for the first oracles complexity we have the following expression Ψ(a, b, e, c) =

1

δ1+max{c,e}φ(a,b,c,e)

. Hence, we have the following optimisation problem to find the optimal setup of parameters a, b, c, e:

mina,b,c,e

1 + max{c, e}min{1− a, 4a− 4b, b+ e, c}

(36)

s.t. a > b

0 ≤ a ≤ 1

0 ≤ b ≤ 1

4a− 4b ≤ 1

c ≥ 0

e ≥ 0

Considering all possible cases (8 in total) it is easy to see that the optimal setup is given by: a∗ = 15 , b∗ = 0, c∗ = e∗ = 4

5 ,and overall complexity is given by:

Ψ(a∗, b∗, c∗, e∗) = δ−94

This implies that Algorithm 1 in expectation outputs δ−approximate first order stationary point of function J (x) andrequires O

(δ−

94

)calls to the oracles FOOf [·, ·] and FOOg[·, ·].

B. Additional Experiment DetailsB.1. Portfolio Mean-Variance

In this section, we present numerical results of sparse mean-variance optimization problems on real-world portfoliodatasets from CRSP17, which are formed on size and: 1) Book-to-Market (BM), 2) Operating Profitability (OP), and 3)Investment (INV). We consider three large 100-portfolio datasets (m = 13781, n = 100), and eighteen region-basedmedium 25-portfolio datasets (m = 7240, n = 25).

Given n assets and the reward vectors at m time points, the goal of sparse mean-variance optimization is to maximize the

17https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data library.html

Page 30: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

0 5 10 15 20#grad/n

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le100_INV, m=13781, n=100

C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(a)

0 5 10 15 20#grad/n

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

100_ME, m=13781, n=100C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(b)

0 5 10 15 20#grad/n

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

100_OP, m=13781, n=100C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(c)

Figure 3. Performance on 3 large 100-portfolio datasets (m = 13781, n = 100).

0.0 0.5 1.0 1.5 2.0Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

100_INV, m=137810, n=100C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(a)

0.0 0.5 1.0 1.5 2.0Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le100_ME, m=137810, n=100

C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(b)

0.0 0.5 1.0 1.5 2.0Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

100_OP, m=137810, n=100C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(c)

Figure 4. Performance on 3 larger 100-portfolio datasets (m = 137810, n = 100).

0.0 0.5 1.0 1.5 2.0Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

6 Regions All, m=130320, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(a)

0.0 0.5 1.0 1.5 2.0Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

100_Full, m=41343, n=100C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(b)

Figure 5. Performance on all region-based (m = 130320, n = 25) and all 100-portfolio datasets (m = 41343, n = 100).

return of the investment as well as to control the investment risk:

minx∈X

1

m

m∑i=1

(rTi x−

1

m

m∑i=1

rTi x

)2

− 1

m

m∑i=1

rTi x,

which is in the form of Eq. (1) with p = n, q = n+ 1,

gω(x) = gj(x) =

[x−rTj x

], fν(gω(x)) = fi(gj(x)) =

([rTi , 1]gj(x)

)2 − [rTi , 0]gj(x).

for x ∈ Rp and a bounded set X .

Figure 3 shows that C-ADAM outperforms other algorithms on large datasets. AGD performs the worst because of thehighest per-iteration sample complexity on large datasets. Figures 6,7,8 show that C-ADAM outperforms other algorithms on

Page 31: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

leJapan_OP, m=7240, n=25

C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(a)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

North_AM_OP, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(b)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

Global_OP, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(c)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

Global_ex_US_OP, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(d)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100Ob

ject

ive

Gap

in L

og-S

cale

Europe_OP, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(e)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

AP_ex_Japan_OP, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(f)

Figure 6. Performance on 6 region-based Operating Profitability datasets (m = 7240, n = 25).

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

Japan_INV, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(a)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

North_AM_INV, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(b)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

leGlobal_INV, m=7240, n=25

C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(c)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

Global_ex_US_INV, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(d)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

Europe_INV, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(e)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

AP_ex_Japan_INV, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(f)

Figure 7. Performance on 6 region-based Investment datasets (m = 7240, n = 25).

Operating Profitability, Investment, and Book-to-Market datasets, respectively. This is consistent with the better complexitybound of C-ADAM. Overall, C-ADAM has the potential to be a benchmark algorithm for convex composition optimization.

Page 32: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

leJapan_ME, m=7240, n=25

C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(a)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

North_AM_ME, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(b)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

Global_ME, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(c)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

Global_ex_US_ME, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(d)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100Ob

ject

ive

Gap

in L

og-S

cale

Europe_ME, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(e)

0 2 4Normalized number of queries

10 5

10 4

10 3

10 2

10 1

100

Obje

ctiv

e Ga

p in

Log

-Sca

le

AP_ex_Japan_ME, m=7240, n=25C-ADAMASC-PGASCVRGSCGDVRSC-PGAGD

(f)

Figure 8. Performance on 6 region-based Book-to-Market datasets (m = 7240, n = 25).

B.2. Compositional MAML

In few-shot supervised regression problem, the goal is to predict the outputs of a sine wave Tk from only a few datapointsξobs sampled from P(data)

Tk (·), after training on many functions Tk using observations of the form ξ = (ξobs, ξtarget). Theamplitude and phase of the sinusoid are varied between tasks. Specifically, for each task, the underlying function is

ξtarget = a sin(ξobs + b),

where a ∈ [0.1, 5.0] and b ∈ [0, 2π]. The goal is to learn to find ξtarget given ξobs based on M = 10 (ξobs, ξtarget) pairs.The loss for task Tk is represented by the mean-squared error between the model’s output for ξobs and the correspondingtarget values ξtarget:

LTk (NN(x); ξ) =∑

(ξobs,ξtarget)∼P(data)Tk

(·)

‖NNx(ξobs)− ξtarget‖22 .

During training and testing, datapoints ξobs are sampled uniformly from [−5.0, 5.0]. Free parameters in Algorithm 1 wereset to Cα = 0.001, Cβ = 0.99,K

(1)t = K

(2)t = K

(3)t = 10, and Cγ = 1.

C. Image ClassificationWe observe the critical impact of meta-steps in MiniImagenet, as our algorithms limitation of only being able to optimizefor one meta step significantly reduces overall performance for this task across all optimizers — we wish to extend ouralgorithms theoretical guarantees to higher meta step scenarios in the future.

Page 33: Compositional ADAM: An Adaptive Compositional …Compositional ADAM: An Adaptive Compositional Solver Rasul Tutunov * 1Minne Li* 1 2 Alexander I. Cowen-Rivers Jun Wang2 Haitham Bou-Ammar1

C-ADAM: An Adaptive Compositional Solver

0 5 10 15 20Number of Samples (*1e3)

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Met

a Te

st A

ccur

acy

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(a)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Met

a Te

st A

ccur

acy

1500TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(b)

0 5 10 15 20Number of Samples (*1e3)

100

6 × 10 1

2 × 100

Logs

cale

Met

a Te

st L

oss Transfer

RandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(c)

0 1 2 3Number of Gradient Steps

100

6 × 10 1

2 × 100

Logs

cale

Met

a Te

st L

oss 1500

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(d)

0 5 10 15 20Number of Samples (*1e3)

0.5

0.6

0.7

0.8

0.9

Met

a Te

st A

ccur

acy

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(e)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Met

a Te

st A

ccur

acy

1500TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(f)

0 5 10 15 20Number of Samples (*1e3)

100

3 × 10 1

4 × 10 1

6 × 10 1

Logs

cale

Met

a Te

st L

oss Transfer

RandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(g)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

100

Logs

cale

Met

a Te

st L

oss 1500

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(h)

0 5 10 15Number of Samples (*1e3)

0.2

0.3

0.4

0.5

0.6

0.7

Met

a Te

st A

ccur

acy

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(i)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Met

a Te

st A

ccur

acy

1500TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(j)

0 5 10 15Number of Samples (*1e3)

2 × 100

3 × 100

Logs

cale

Met

a Te

st L

oss Transfer

RandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(k)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

101

Logs

cale

Met

a Te

st L

oss 1500

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(l)

0 5 10 15Number of Samples (*1e3)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Met

a Te

st A

ccur

acy

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(m)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

0.10.20.30.40.50.60.70.8

Met

a Te

st A

ccur

acy

1500TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(n)

0 5 10 15Number of Samples (*1e3)

100

101

Logs

cale

Met

a Te

st L

oss Transfer

RandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(o)

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of Gradient Steps

100

101

102Lo

gsca

le M

eta

Test

Los

s 1500TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(p)

0 2 4 6Number of Samples (*1e3)

0.21

0.22

0.23

0.24

0.25

Met

a Te

st A

ccur

acy

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(q)

0 1 2 3Number of Gradient Steps

0.19

0.20

0.21

0.22

0.23

0.24

0.25

Met

a Te

st A

ccur

acy

500TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(r)

0 1 2 3 4Number of Samples (*1e3)

0.24

0.25

0.26

0.27

0.28

0.29

0.30

Met

a Te

st A

ccur

acy

TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(s)

0 1 2 3Number of Gradient Steps

0.20

0.22

0.24

0.26

0.28

Met

a Te

st A

ccur

acy

350TransferRandomC-ADAMASC-PGSCGDADAMAdagradASGDRMSpropSGD

(t)

Figure 9. We show four scenarios on Omniglot and MiniImagenet data highlighting test statistics of C-MAML compared against otheroptimisation methods for varying N-shot K-way values using the non-convolutional model. (a)-(d) refer to the Omniglot 1-shot 5-waytask, (e)-(h) Omniglot refers to 1-shot 5-way task, (i)-(l) refers to Omniglot 1-shot 20-way task, (m)-(p) refers to the Omniglot 5-shot20-way task, (q)-(r) refers to the MiniImagenet 1-shot 5-way task and (s)-(t) refers to the MiniImagenet 5-shot 5-way task.