-SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

24
ε-SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS ON RIEMANNIAN MANIFOLDS P. GROHS*, S. HOSSEINI* Abstract. This paper presents a descent direction method for finding ex- trema of locally Lipschitz functions defined on Riemannian manifolds. To this end we define a set-valued mapping x εf (x) named ε-subdifferential which is an approximation for the Clarke subdifferential and which generalizes the Goldstein-ε-subdifferential to the Riemannian setting. Using this notion we construct a steepest descent method where the descent directions are com- puted by a computable approximation of the ε-subdifferential. We establish the global convergence of our algorithm to a stationary point. Numerical ex- periments illustrate our results. 1. introduction This paper is concerned with the numerical solution of optimization problems defined on Riemannian manifolds where the objective function may be nonsmooth. Such problems arise in a variety of applications, e.g., in computer vision, signal processing, motion and structure estimation, or numerical linear algebra; see for instance [2, 3, 30, 39]. In the linear case is well known that ordinary gradient descent, when applied to nonsmooth functions, typically fails by converging to a non-optimal point. The fundamental difficulty is that most interesting nonsmooth objective functions as- sume their extrema at points where the gradient is not defined. This has led to the introduction of the generalized gradient of convex functions defined on a linear space by Rockafellar in 1961 and subsequently for locally Lipschitz functions by Clarke in 1975; [40, 15]. Their use in optimization algorithms began soon after their appearance. Since the Clarke generalized gradient is in general difficult to compute numerically, most of algorithms which are based on it can be efficient only for certain types of functions; see for instance [7, 26, 49]. The paper [22] is among the first works on optimization of Lipschitz functions on Euclidean spaces. In that article a new set valued mapping named ε-subdifferential ε f of a function f was introduced, and several properties of this map, which are useful for building optimization algorithms of locally Lipschitz functions on linear spaces, were presented. For the numerical computation of the ε-subdifferential various strategies have been proposed in the literature. Key words and phrases. Riemannian manifolds, Lipschitz function, Descent direction, Clarke subdifferential. AMS Subject Classifications: 49J52, 65K05, 58C05. *ETH Z¨ urich, Seminar for Applied Mathematics, R¨amistrasse 101, 8092 Z¨ urich, Switzerland ([email protected], [email protected]). 1

Transcript of -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

Page 1: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ

FUNCTIONS ON RIEMANNIAN MANIFOLDS

P. GROHS*, S. HOSSEINI*

Abstract. This paper presents a descent direction method for finding ex-

trema of locally Lipschitz functions defined on Riemannian manifolds. To thisend we define a set-valued mapping x → ∂εf(x) named ε-subdifferential which

is an approximation for the Clarke subdifferential and which generalizes the

Goldstein-ε-subdifferential to the Riemannian setting. Using this notion weconstruct a steepest descent method where the descent directions are com-

puted by a computable approximation of the ε-subdifferential. We establish

the global convergence of our algorithm to a stationary point. Numerical ex-periments illustrate our results.

1. introduction

This paper is concerned with the numerical solution of optimization problemsdefined on Riemannian manifolds where the objective function may be nonsmooth.Such problems arise in a variety of applications, e.g., in computer vision, signalprocessing, motion and structure estimation, or numerical linear algebra; see forinstance [2, 3, 30, 39].

In the linear case is well known that ordinary gradient descent, when appliedto nonsmooth functions, typically fails by converging to a non-optimal point. Thefundamental difficulty is that most interesting nonsmooth objective functions as-sume their extrema at points where the gradient is not defined. This has led tothe introduction of the generalized gradient of convex functions defined on a linearspace by Rockafellar in 1961 and subsequently for locally Lipschitz functions byClarke in 1975; [40, 15]. Their use in optimization algorithms began soon aftertheir appearance. Since the Clarke generalized gradient is in general difficult tocompute numerically, most of algorithms which are based on it can be efficient onlyfor certain types of functions; see for instance [7, 26, 49].

The paper [22] is among the first works on optimization of Lipschitz functions onEuclidean spaces. In that article a new set valued mapping named ε−subdifferential∂εf of a function f was introduced, and several properties of this map, which areuseful for building optimization algorithms of locally Lipschitz functions on linearspaces, were presented. For the numerical computation of the ε−subdifferentialvarious strategies have been proposed in the literature.

Key words and phrases. Riemannian manifolds, Lipschitz function, Descent direction, Clarke

subdifferential.AMS Subject Classifications: 49J52, 65K05, 58C05.

*ETH Zurich, Seminar for Applied Mathematics, Ramistrasse 101, 8092 Zurich, Switzerland([email protected], [email protected]).

1

Page 2: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

2 P. GROHS*, S. HOSSEINI*

The gradient sampling algorithm (GS), introduced and analyzed by Burke, Lewisand Overton [13, 14], is a method for minimizing an objective function f that islocally Lipschitz and continuously differentiable in an open dense subset of Rn.At each iteration, the GS algorithm computes the gradient of f at the currentiterate and at m ≥ n + 1 randomly generated nearby points. This bundle ofgradients is used to find an approximate ε-steepest descent direction as the solutionof a quadratic program, where ε denotes the sampling radius. A standard Armijoline search along this direction produces a candidate for the next iterate, which isobtained by perturbing the candidate, if necessary, to stay in the set Ω where f isdifferentiable; the perturbation is random and small enough to maintain the Armijosufficient descent property. The sampling radius may be fixed for all iterations ormay be reduced dynamically.

The discrete gradient method (DG) approximates ∂εf(x) by a set of discretegradients. In this algorithm, the descent direction is iteratively computed, andin every iteration the approximation of ∂εf(x) is improved by adding a discretegradient to the set of discrete gradients; see [7].

In [34], ∂εf(x) is approximated by an iterative algorithm. The algorithm startswith one element of ∂εf(x) in the first iteration, and in every subsequent iteration,a new element of ∂εf(x) is computed and added to the working set to improvethe approximation of ∂εf(x). The results of the algorithm presented in [34] ascompared to those obtained by the GS is more efficient, and as compared to thoseby the DG is more accurate, [34].

The extension of the aforementioned optimization techniques to Riemannianmanifolds are the subject of the present paper. A manifold, in general, does nothave a linear structure, hence the usual techniques, which are often used to studyoptimization problems on linear spaces cannot be applied and new techniques needto be developed.

The development of smooth and nonsmooth Riemannian optimization algorithmsis primarily motivated by their large-scale applications in robust, sparse, structuredprincipal component analysis, statistics on manifolds (e.g. median calculation ofpositive semidefinite tensors), and low-rank optimization (matrix completion, col-laborative filtering, source separation); see [28, 46, 47, 45]. Furthermore, thesealgorithms have a lot of applications in image processing, computer vision, con-strained optimization problems on linear spaces; [4, 12, 18].

Contributions. Our main contributions are twofold. First, we define a Rie-mannian generalization of the ε-subdifferential defined in [22]. This is nontrivialsince the linear definition of ∂εf(x), x ∈ Rn involves subgradients of f at pointsy ∈ Rn different from x. In the linear case this is not an issue since tangentspaces at different points can be identified. In the nonlinear case, with M beinga Riemannian manifold and f : M → R, we move these subgradients at pointsy ∈ M to the tangent space in x via the derivative of the logarithm mapping inorder to obtain a workable definition of the ε-subdifferential; see Definition 3.1below. In Section 3.1, we prove several basic properties of the novel Riemannianε-subdifferential which subsequently enables us to formulate conditions for descentdirections in Section 3.2. Using these basic properties of the ε-subdifferential, weare able to generalize (GS) and the algorithm in [34] to the Riemannian setting.In Section 3.3, we present the details for the generalization of [34] which yields the

Page 3: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS ON RIEMANNIAN MANIFOLDS 3

second main contribution of the present paper, namely a proof of global conver-gence of the proposed algorithm. Finally, our proposed algorithm is implementedin MATLAB environment and applied to some nonsmooth problems with locallyLipschitz objective functions.

Previous Work. For the optimization of smooth objective functions manyclassical methods for unconstrained minimization, such as Newton-type and trust-region methods have been successfully generalized to problems on Riemannian man-ifolds [1, 3, 16, 33, 38, 43, 44, 50]. The recent monograph by Absil, Mahony andSepulchre discusses, in a systematic way, the framework and many numerical first-order and second-order manifold-based algorithms for minimization problems onRiemannian manifolds with an emphasis on applications to numerical linear alge-bra, [2].

In considering optimization problems with nonsmooth objective functions onRiemannian manifolds, it is necessary to generalize concepts of nonsmooth analysisto Riemannian manifolds. In the past few years a number of results have beenobtained on numerous aspects of nonsmooth analysis on Riemannian manifolds,[5, 6, 23, 24, 25, 32].

Recently, some mathematicians have started developing nonsmooth optimizationalgorithms to manifold settings. It is worth noting that while they presented gradi-ent based and proximal point algorithms on manifolds, their numerical experimentsare limited to some special test functions whose subdifferential either are singletonor can be computed explicitly. This might be because of the difficulty of finding thesubdifferential of the functions; see [9, 10, 17, 20, 37]. Finally, it is worth mention-ing the paper [18], which presents a survey on Riemannian geometry methods forsmooth and nonsmooth constrained optimization as well as gradient and subgradi-ent descent algorithms on a Riemannian manifold. In that paper, the methods areillustrated by applications from robotics and multi antenna communication.

2. preliminaries

In this paper, we use the standard notations and known results of Riemannianmanifolds; see, e.g. [29]. Throughout this paper, M is an n-dimensional completemanifold endowed with a Riemannian metric 〈., .〉 on the tangent space TxM . Weidentify (via the Riemannian metric) the tangent space of M at a point x, denotedby TxM , with the cotangent space at x, denoted by TxM

∗. As usual we denoteby B(x, δ) the open ball centered at x with radius δ, by intN(clN) the interior(closure) of the set N . Also, let S be a nonempty closed subset of a Riemannianmanifold M , we define distS : M −→ R by

distS(x) := infdist(x, s) : s ∈ S ,

where dist is the Riemannian distance on M . Recall that the set S in a Riemannianmanifold M is called convex if every two points p1, p2 ∈ S can be joined by a uniquegeodesic whose image belongs to S. For the point x ∈ M, expx : Ux → M willstand for the exponential function at x, where Ux is an open subset of TxM . Recallthat expx maps straight lines of the tangent space TxM passing through 0x ∈ TxMinto geodesics of M passing through x.

We will also use the parallel transport of vectors along geodesics. Recall that,for a given curve γ : I →M, number t0 ∈ I, and a vector V0 ∈ Tγ(t0)M, there existsa unique parallel vector field V (t) along γ(t) such that V (t0) = V0. Moreover, the

Page 4: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

4 P. GROHS*, S. HOSSEINI*

map defined by V0 7→ V (t1) is a linear isometry between the tangent spaces Tγ(t0)Mand Tγ(t1)M, for each t1 ∈ I. In the case when γ is a minimizing geodesic andγ(t0) = x, γ(t1) = y, we will denote this map by Lxy, and we will call it the paralleltransport from TxM to TyM along the curve γ. Note that, Lxy is well defined whenthe minimizing geodesic which connects x to y, is unique. For example, the paralleltransport Lxy is well defined when x and y are contained in a convex neighborhood.In what follows, Lxy will be used wherever it is well defined. The isometry Lyxinduces another linear isometry L∗yx between TxM

∗ and TyM∗, such that for every

σ ∈ TxM∗ and v ∈ TyM, we have 〈L∗yx(σ), v〉 = 〈σ, Lyx(v)〉. We will still denotethis isometry by Lxy : TxM

∗ → TyM∗.

By iM (x) we denote the injectivity radius of M at x, that is the supremum ofthe radius r of all balls B(0x, r) in TxM for which expx is a diffeomorphism fromB(0x, r) onto B(x, r). Note that if U is a compact subset of a Riemannian manifoldM and i(U) := infiM (x) : x ∈ U, then 0 < i(U); see [27].

A retraction on a manifold M is a continuously differentiable map R : TM →Mwith the following properties. Let Rx denote the restriction of R to TxM .

• Rx(0x) = x, where 0x denotes the zero element of TxM .• With the canonical identification T0xTxM ≈ TxM , dRx(0x) = idTxM where

idTxM denotes the identity map on TxM .

In the present paper, we are concerned with the minimization of locally Lipschitzfunctions which we now define.

Definition 2.1 (Lipschitz condition). Recall that a real valued function f definedon a Riemannian manifold M is said to satisfy a Lipschitz condition of rank k ona given subset S of M if | f(x)− f(y) |≤ kdist(x, y) for every x, y ∈ S, where distis the Riemannian distance on M . A function f is said to be Lipschitz near x ∈Mif it satisfies the Lipschitz condition of some rank on an open neighborhood of x. Afunction f is said to be locally Lipschitz on M if f is Lipschitz near x, for everyx ∈M .

Let us continue with the definition of the Clarke generalized directional derivativefor locally Lipschitz functions on Riemannian manifolds; see [23, 25].

Definition 2.2 (Clarke generalized directional derivative). Suppose f : M → R is alocally Lipschitz function on a Riemannian manifold M . Let φx : Ux → TxM be anexponential chart at x. Given another point y ∈ Ux, consider σy,v(t) := φ−1y (tw), ageodesic passing through y with derivative w, where (φy, y) is an exponential chartaround y and d(φxφ−1y )(0y)(w) = v. Then, the Clarke generalized directionalderivative of f at x ∈ M in the direction v ∈ TxM , denoted by f(x, v), is definedas

f(x;v) = lim supy→x, t↓0

f(σy,v(t))− f(y)

t.

If f is differentiable in x ∈ M , we define the gradient of f as the unique vectorgrad f(x) ∈ TxM which satisfies

〈grad f(x), ξ〉 = df(x)(ξ) for all ξ ∈ TxM.

Using the previous definition of a Riemannian Clarke generalized directional de-rivative we can also generalize the notion of the subdifferential to a Riemanniancontext.

Page 5: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS ON RIEMANNIAN MANIFOLDS 5

Definition 2.3 (Subdifferential). We define the subdifferential of f at x, denotedby ∂f(x), as the subset of TxM with support function given by f(x; .), i.e., forevery v ∈ TxM ,

f(x; v) = sup〈ξ, v〉 : ξ ∈ ∂f(x).It can be proved [23] that

∂f(x) = conv limi→∞

grad f(xi) : xi ⊆ Ωf , xi → x,

where Ωf is a dense subset of M on which f is differentiable.

It is worthwhile to mention that lim grad f(xi) in the previous definition is ob-tained as follows. Let ξi ∈ TxiM , i = 1, 2, ... be a sequence of tangent vectors ofM and ξ ∈ TxM . We say ξi converges to ξ, denoted by lim ξi = ξ, provided thatxi → x and, for any smooth vector field X, 〈ξi, X(xi)〉 → 〈ξ,X(x)〉.

Using the notion of subdifferential, we can now define stationary points of alocally Lipschitz mapping f .

Definition 2.4 (Stationary point, Stationary set). A point x is a stationary pointof f if 0 ∈ ∂f(x). Z is a stationary set if each z ∈ Z is a stationary point.

Proposition 2.5. A necessary condition that f achieve a local minimum at x isthat 0 ∈ ∂f(x).

Proof. If f has a local minimum at x, then for every v ∈ TxM , f(x;v) ≥ 0 whichimplies 0 ∈ ∂f(x).

3. The Riemannian ε-Subdifferential

In smooth optimization, there exist minimization methods, which, instead of us-ing the gradient, use its approximations through finite differences (forward, back-ward, and central differences). In [31], a very simple convex nondifferentiable func-tion was presented, for which these finite differences may give no information aboutthe subdifferential. It follows that these finite-difference estimates of the gradientcannot be used for the approximation of the subgradient of the nonsmooth func-tions. In [22] a set valued mapping named ε-subdifferential, to approximate thesubdifferential of locally Lipschitz functions defined on Rn was introduced.

The present section generalizes the concept of the ε-subdifferential of locallyLipschitz functions to functions defined on a Riemannian manifold, generalizing thecorresponding Euclidean concept introduced in [22]. The definition is as follows.

Definition 3.1 (ε-subdifferential). Let f : M → R be a locally Lipschitz function ona Riemannian manifold M , and θk be any sequence of positive numbers convergingdownward to zero. For each ε > 0 with ε + θk < iM (x) for almost every k, theε−subdifferential at x is defined by

∂εf(x) := conv⋂k

cld exp−1x (y)(grad f(y)) : y ∈ clB(x, ε+ θk) ∩ Ωf,

where the intersection is taken over all k for which ε+ θk < iM (x).

Clearly this definition is independent of the choice of the sequence θk.

Page 6: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

6 P. GROHS*, S. HOSSEINI*

3.1. Basic Properties. In the present subsection, we establish some basic prop-erties of the ε-subdifferential as defined above in Definition 3.1; see [22] for similarresults in the linear case. We select ε small enough that f is Lipschitz on B(x, 2ε)and expx is a diffeomorphism from B(0x, 2ε) onto B(x, 2ε).

Lemma 3.2. For every y ∈ B(x, ε),

d exp−1x (y)(∂f(y)) ⊂ ∂εf(x).

Proof. For every ξ = limi→∞ grad f(yi), where grad f(yi) exists and yi → y, wehave

d exp−1x (y)(ξ) = limi→∞

d exp−1x (yi)(grad f(yi)),

hence there exists N ∈ N, such that for every i ≥ N , yi ∈ B(x, ε) and

limi→∞

d exp−1x (yi)(grad f(yi)) ∈⋂k

cld exp−1x (y)(grad f(y)) : y ∈ clB(x, ε+θk)∩Ωf,

which means

d exp−1x (y)( limi→∞

grad f(yi) : yi ⊆ Ωf , yi → y)

is a subset of⋂k cld exp−1x (y)(grad f(y)) : y ∈ clB(x, ε+ θk)∩Ωf, which implies

d exp−1x (y)(∂f(y)) ⊂ ∂εf(x).

Lemma 3.3. ∂εf(x) is a nonempty compact and convex subset of TxM .

Proof. By the Lipschitzness of f and the smoothness of the exponential map,

Sk = cld exp−1x (y)(grad f(y)) : y ∈ clB(x, ε+ θk) ∩ Ωfis a closed bounded subset of TxM and Sk+1 ⊂ Sk. Hence

⋂k Sk is compact and

nonempty, and convex hull of a compact set in TxM is compact. The convexity of∂εf(x) is deduced by the definition.

Lemma 3.4.

∂εf(x) = conv limi→∞

d exp−1x (yi)(grad f(yi)) : limi→∞

yi = y ∈ clB(x, ε), (yi) ∈ Ωf.

Proof. We start with the inclusion

∂εf(x) ⊃ conv limi→∞

d exp−1x (yi)(grad f(yi)) : limi→∞

yi = y ∈ clB(x, ε), (yi) ∈ Ωf.

Let yi be a sequence in Ωf converging to some point y ∈ clB(x, ε) and v =limi→∞ d exp−1x (yi)(grad f(yi)). For any sequence of positive numbers θk converg-ing downward to zero, we have clB(x, ε) ⊂ B(x, θk + ε). Therefore, for any k,there exists Nk such that for i ≥ Nk, yi ∈ B(x, θk + ε). We set (zkj )j = (yNk+j)j ∈B(x, θk+ε). Now it is clear that for every k, v = limj→∞ d exp−1x (zkj )(grad f(zkj )) ∈cld exp−1x (y)(grad f(y)) : y ∈ clB(x, ε+θk)∩Ωf, which proves the first inclusion.

For the converse, let

w ∈⋂k

cld exp−1x (y)(grad f(y)) : y ∈ clB(x, ε+ θk) ∩ Ωf.

Then, for every k ∈ N with θk + ε < iM (x), we have

w ∈ cld exp−1x (y)(grad f(y)) : y ∈ clB(x, ε+ θk) ∩ Ωf,

Page 7: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS ON RIEMANNIAN MANIFOLDS 7

Therefore, we can find a sequence yi ∈ clB(x, ε+ θi) ∩ Ωf such that

limi→∞

‖d exp−1x (yi)(grad f(yi))− w‖ = 0

and (if necessary after passing to a subsequence),

limi→∞

yi = y ∈ clB(x, ε),

as required.

Using the previous lemma one can prove the following characterization of theRiemannian ε-subdifferential.

Lemma 3.5. We have

∂εf(x) = convd exp−1x (y)(∂f(y)) : y ∈ clB(x, ε).

Proof. Assume that η ∈ ∂εf(x), Lemma 3.4 implies η = Σnk=1tkξk where

ξk = limik→∞

d exp−1x (yik)(grad f(yik)),

yik ∈ Ωf , limik→∞ yik = yk ∈ clB(x, ε). Hence

ξk = d exp−1x (yk)( limik→∞

grad f(yik)).

Set ηk = (d exp−1x (yk))−1

(ξk) in ∂f(yk), then

η = Σnk=1tkd exp−1x (yk)(ηk) ∈ convd exp−1x (y)(∂f(y)) : y ∈ clB(x, ε).For the converse, let

A = d exp−1x (y)(∂f(y)) : y ∈ clB(x, ε)and ξ ∈ A, then ξ = d exp−1x (y)(η), where η = limi→∞ grad f(yi), yi ∈ Ωf ,limi→∞ yi = y. Hence

ξ = limi→∞

d exp−1x (yi)(grad f(yi))

which implies A ⊂ ∂εf(x), and the property of convex hull completes the proof.

The following remark is required in the sequel.

Remark 3.6. Let M be a Riemannian manifold. An easy consequence of thedefinition of the parallel translation along a curve as a solution to an ordinarylinear differential equation, implies that the mapping

C : TM → Tx0M, C(x, ξ) = Lxx0(ξ),

when x is in a neighborhood U of x0, is well defined and continuous at (x0, ξ0),that is, if (xn, ξn) → (x0, ξ0) in TM then Lxnx0

(ξn) → Lx0x0(ξ0) = ξ0, for every

(x0, ξ0) ∈ TM ; see [5, Remark 6.11].

Remark 3.7. Note that for small enough ε > 0, ∂f(x) ⊂ ∂εf(x). If ε1 > ε2, then∂ε2f(x) ⊂ ∂ε1f(x). Therefore, ∂f(x)⊆ limεk↓0 ∂εkf(x) =

⋂εk∂εkf(x). We claim

that⋂εk∂εkf(x) ⊆ ∂f(x). To prove the claim, we assume on the contrary that

there exists ξ ∈⋂εk∂εkf(x) \ ∂f(x). Since ∂εkf(x) is a sequence of compact and

nested subsets of TxM , we have

ξ ∈ ∩εk∂εkf(x) = conv ∩εk d exp−1x (y)(∂f(y)) : dist(y, x) ≤ εk.Hence, ξ = Σmk=1tkξk, where ξk ∈ ∩εkd exp−1x (y)(∂f(y)) : dist(y, x) ≤ εk, andΣmk=1tk = 1. Therefore, there exists wki ∈ ∂f(yki) such that dist(yki , x) ≤ εk with

Page 8: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

8 P. GROHS*, S. HOSSEINI*

ξk = d exp−1x (yki)(wki). Since M is complete, it follows that yki has a subse-quence convergent to x in M . By Theorem 2.9 of [23], Lyki

x(wki) has a subsequence

convergent to some vector ξ ∈ ∂f(x). Since Lykix(wki) = Lyki

x((d exp−1x (yki))−1(ξk))

converges to ξk, then ξ = ξk ∈ ∂f(x) and since ∂f(x) is convex, ξ ∈ ∂f(x) which isa contradiction.

We recall that a set valued function F : X → Y , where X, Y are topologicalspaces, is said to be upper semicontinuous at x, if for every open neighborhood Uof F (x), there exits an open neighborhood V of x, such that

y ∈ V =⇒ F (y) ⊆ U.

Assume that F has compact values, then there is a sequential characterization forthe set valued upper semicontinuity as follows: F is upper semicontinuous at x,if and only if for each sequence xn ⊂ X converging to x and each sequenceyn ⊂ F (xn) converging to y; y ∈ F (x).

Lemma 3.8. Let U be a compact subset of M and ε < i(U), then for every openneighborhood W in U , the set valued mapping ∂εf : W → TM is upper semicontin-uous.

Proof. For every arbitrary fixed x ∈W , let r be a positive number with r < ε. Wedefine F : B(x, r) ∩W → TxM by

F (z) = Lzx(d exp−1z (y)(∂f(y)) : y ∈ clB(z, ε)).

First, we prove F is upper semicontinuous at x.Let xk ⊂ B(x, r) ∩W and vk ⊂ TxM be two sequences converging, respec-

tively, to x and v, where vk ∈ F (xk). Hence vk = Lxkx(d exp−1xk(yk)(ξk)) where

ξk ∈ ∂f(yk) and yk ∈ clB(xk, ε).Note that M is complete, therefore yk has a subsequence convergent to some

point y in M . Moreover, f is Lipschitz on B(x, ε), by Theorem 2.9 of [23] wededuce that Lyky(ξk) has a subsequence convergent to some vector ξ ∈ ∂f(y).Thus, v = d exp−1x (y)(ξ), where ξ ∈ ∂f(y). Since dist(xk, yk) ≤ ε by the continuityof the distance function dist(x, y) ≤ ε, which means v ∈ F (x) and F is uppersemicontinuous at x. Note that F has compact values, consequently the set valuedfunction convF : B(x, r) ∩W → TxM defined by

convF (z) = Lzx(convd exp−1z (y)(∂f(y)) : y ∈ clB(z, ε)).

is upper semicontinuous at x.Now, we prove the upper semicontinuity of ∂εf at x. Let xk ⊂ B(x, r) ∩W

and vk ⊂ TM be two sequences converging, respectively, to x and v, wherevk ∈ ∂εf(xk). Then Lxkx(vk) ∈ convF (xk) and by Remark 3.6, Lxkx(vk) convergesto v and by upper semicontinuity of convF at x, v ∈ convF (x) = ∂εf(x).

Lemma 3.9. Let B be a closed ball in a complete Riemannian manifold M , f :M → R be locally Lipschitz, Z be the set of all stationary points of f in B andBδ := x ∈ B : distZ(x) ≥ δ > 0. Then there exist ε > 0 and σ > 0 such that0 /∈ ∂εf(x) and min‖v‖ : v ∈ ∂εf(x) ≥ σ, for all x ∈ Bδ.

Proof. Since B is compact, it follows that there exists ε > 0 such that ∂εf is well-defined on Bδ. Assume that x ∈ Bδ, consequently 0 /∈ ∂f(x). We claim that there

Page 9: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS ON RIEMANNIAN MANIFOLDS 9

exists ε > 0 such that 0 /∈ ∂εf(x). On the contrary, suppose that 0 ∈ ∂ 1if(x), for

i = N,N + 1, ..., 1/N < iM (x). Since ∂ 1if(x) is a sequence of compact and nested

subsets of TxM , we have

0 ∈ ∩∞i=N∂ 1if(x) = conv ∩∞i=N d exp−1x (y)(∂f(y)) : dist(y, x) ≤ 1

i.

Hence, 0 = Σmk=1tkξk, where ξk ∈ ∩∞i=Nd exp−1x (y)(∂f(y)) : dist(y, x) ≤ 1i .

Therefore, there exists wki ∈ ∂f(yki) such that dist(yki , x) ≤ 1i+N with ξk =

d exp−1x (yki)(wki). Since M is complete, it follows that yki has a subsequenceconvergent to x in M . By Theorem 2.9 of [23], Lyki

x(wki) has a subsequence con-

vergent to some vector ξ ∈ ∂f(x). Since Lykix(wki) = Lyki

x((d exp−1x (yki))−1(ξk))

converges to ξk, then ξ = ξk ∈ ∂f(x) and since ∂f(x) is convex, 0 ∈ ∂f(x) which isa contradiction.

To prove the second part of the lemma; note that ∂εf(x) is a compact subset ofTxM , and the norm function is continuous, therefore there exists 0 6= w ∈ ∂εf(x)such that ‖w‖ = min‖v‖ : v ∈ ∂εf(x). Assume on the contrary, that for everyi ∈ N, there exists xi ∈ Bδ provided that ‖wi‖ = min‖v‖ : v ∈ ∂εf(xi) and0 < ‖wi‖ < 1/i. Therefore, there exist convergent subsequences of xi and wi withrespective limits x ∈ Bδ and 0 ∈ ∂εf(x), which is a contradiction.

3.2. Descent Directions. In the present section, we treat the problem of findingdirections w0 ∈ ∂εf(x) such that with suitable step lengths t > 0 the objectivefunction f affords a decrease along the geodesic expx(−tw0

‖w0‖ ). The next result shows

that, whenever one has full knowledge of the ε-subdifferential, a suitable descentdirection can be obtained by solving a simple quadratic program. We will use thefollowing theorem; for its proof see [23].

Theorem 3.10. (Lebourg’s Mean Value Theorem) Let M be a finite dimen-sional Riemannian manifold, x, y ∈ M and γ : [0, 1] −→ M be a smooth pathjoining x and y. Let f be a Lipschitz function around γ[0, 1]. Then there exist0 < t0 < 1 and ξ ∈ ∂f(γ(t0)) such that

f(y)− f(x) = 〈ξ, γ′(t0)〉.

The next theorem can be proved using a property of the exponential map calledradially isometry.

Definition 3.11. Assume that R : TM →M is a retraction on M , we say R is aradial isometry if for each x ∈M there exists ε > 0 such that for all v, w ∈ B(0x, ε),we have

〈dRx(v)(v), dRx(v)(w)〉 = 〈v, w〉.

By Gauss’s lemma, we know that the exponential map is a radial isometry.

Theorem 3.12. Assume ε > 0 and δ are given from Lemma 3.9 so that 0 /∈ ∂εf(x)for all x ∈ Bδ. Let x ∈ Bδ and consider an element of ∂εf(x) with minimum norm,

w0 := argmin‖v‖ : v ∈ ∂εf(x),

and get g0 := − w0

‖w0‖ . Then g0 affords a uniform decrease of f over B(x, ε), i.e.,

f(expx(εg0))− f(x) ≤ −ε‖w0‖.

Page 10: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

10 P. GROHS*, S. HOSSEINI*

Proof. By Lebourg’s mean value theorem [23], there exist 0 < t0 < 1 and ξ ∈∂f(γ(t0)) such that f(expx(εg0))− f(x) = 〈ξ, γ′(t0)〉, where γ(t) := expx(tεg0) is ageodesic starting at x by initial speed εg0. Since expx is a radial isometry, thereforewe have that

f(expx(εg0))− f(x) = 〈ξ, d expx(εt0g0)(εg0)〉= ε〈d expx(εt0g0)(d exp−1x (expx(εt0g0))(ξ)), d expx(εt0g0)(g0)〉= ε〈d exp−1x (expx(εt0g0))(ξ), g0〉.

Since dist(expx(εt0g0), x) = t0ε ≤ ε, it follows that d exp−1x (expx(εt0g0))(ξ) ∈∂εf(x). We claim that ‖w0‖2 ≤ 〈φ,w0〉 for every φ ∈ ∂εf(x), which implies 〈φ, g0〉 ≤−‖w0‖. Hence, we can deduce that f(expx(εg0))− f(x) ≤ −ε‖w0‖.

Proof of the claim: assume on the contrary; there exists φ ∈ ∂εf(x) such that〈φ,w0〉 < ‖w0‖2 and consider w := w0 + t(φ− w0) ∈ ∂εf(x), then

‖w0‖2 − ‖w‖2 = −t(2〈w0, φ− w0〉+ t〈φ− w0, φ− w0〉),we can assume that t is small enough such that ‖w0‖2 > ‖w‖2, which is a contra-diction.

Remark 3.13. Instead of using the differential of the inverse exponential map totransport the subgradients of the cost function at the point y to the point x, wecan choose the differential of the inverse of a retraction R : TM → M , howeverby the proof of the previous theorem we have to restrict ourselves to the class ofretractions that are radially isometric; see [1]. For example as parallel transport canbe considered as the differential of the inverse of a retraction, therefore it mightbe used to transport the subgradients of the cost function at the point y to thepoint x; see [38]. It is worth mentioning that if the differential of the inverse of aretraction R : TM → M is selected to transport the vectors, then the retractionR must also be used to take a step in the direction of a tangent vector . Using agood retraction amounts to finding an approximation of the exponential mappingthat can be computed with low computational cost while not adversely affectingthe behavior of the optimization algorithm.

Definition 3.14 (Descent direction). Let f : M → R be a locally Lipschitz functionon a complete Riemannian manifold M , w ∈ TxM , g = − w

‖w‖ is called a decent

direction at x, if there exists α > 0 such that

(3.1) f(expx(tg))− f(x) ≤ −t‖w‖, ∀t ∈ (0, α).

In the construction of the previous theorem, g0 is a descent direction of f atx, because along the same lines as the proof, we can deduce that for g0 and everyt ∈ (0, ε),

f(expx(tg0))− f(x) ≤ −t‖w‖.It is clear that we can choose the mentioned descent direction in order to move

along a geodesic starting from an initial point toward a neighborhood of a minimumpoint.

3.3. Approximation of the ε-subdifferential. For general nonsmooth optimiza-tion problems it may be difficult to give an explicit description of the full subdiffer-ential set. In the present section, we generalize ideas of [34] to obtain an iterativeprocedure to approximate the ε-subdifferential. We start with the subgradient ofan arbitrary point nearby x and move the subgradient to the tangent space in x

Page 11: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS ON RIEMANNIAN MANIFOLDS 11

via the derivative of the logarithm mapping, and in every subsequent iteration, thesubgradient of a new point nearby x is computed and moved to the tangent spacein x to add to the working set to improve the approximation of ∂εf(x). Indeed,we do not want to provide a description of the entire ε-subdifferential set at eachiteration, what we do is to approximate ∂εf(x) by the convex hull of its elements.In this way, let Wk := v1, ..., vk ⊆ ∂εf(x), then we define

wk := argminv∈convWk

‖v‖.

Now if we have

(3.2) f(expx(εgk))− f(x) ≤ −cε‖wk‖, c ∈ (0, 1)

where gk = − wk

‖wk‖ , then we can say convWk is an acceptable approximation for

∂εf(x). Otherwise, we add a new element of ∂εf(x) \ convWk to Wk.

Lemma 3.15. Let v ∈ ∂εf(x) such that 〈v, gk〉>− ‖wk‖, then v /∈ convWk.

Proof. It can be proved along the same lines as the proof of the claim of Theorem3.12.

The following lemma proves that if Wk is not an acceptable approximation for∂εf(x), then there exists vk+1 ∈ ∂εf(x) such that 〈vk+1, gk〉 ≥ −c‖wk‖ > −‖wk‖,therefore we have from the previous lemma that vk+1 ∈ ∂εf(x) \ convWk.

Lemma 3.16. Let Wk = v1, ..., vk ⊂ ∂εf(x), 0 /∈ convWk and

wk = argmin‖v‖ : v ∈ convWk.

If we have f(expx(εgk)) − f(x) > −cε‖wk‖, where c ∈ (0, 1) and gk = −wk

‖wk‖ , then

there exist θ0 ∈ (0, ε] and vk+1 ∈ ∂f(expx(θ0gk)) such that

〈d exp−1x (expx(θ0gk))(vk+1), gk〉≥ − c‖wk‖,

and vk+1 :=d exp−1x (expx(θ0gk))(vk+1) /∈ convWk.

Proof. We prove this lemma using Lemma 3.1 and Proposition 3.1 in [34]. Define

h(t) := f(expx(tgk))− f(x) + ct‖wk‖, t ∈ R,

and a new locally Lipschitz function G : B(0x, iM (x)) ⊂ TxM → R by G(g) =f(expx(g)), then h(t) = G(tgk) − G(0) + ct‖wk‖. Assume that h(ε) > 0, thenby Proposition 3.1 of [34], there exists θ0 ∈ [0, ε] such that h is increasing in aneighborhood of θ0. Therefore, by Lemma 3.1 of [34] for every ξ ∈ ∂h(θ0), one hasξ ≥ 0. By [23, Proposition 3.1]

∂h(θ0) ⊆ 〈∂f(expx(θ0gk)), d expx(θ0gk)(gk)〉+ c‖wk‖.

If vk+1 ∈ ∂f(expx(θ0gk)) such that

〈vk+1, d expx(θ0gk)(gk)〉+ c‖wk‖ ∈ ∂h(θ0),

then

〈d exp−1x (expx(θ0gk))(vk+1), gk〉+ c‖wk‖ ≥ 0.

Now, Lemma 3.15 implies that

vk+1 :=d exp−1x (expx(θ0gk))(vk+1) /∈ convWk,

which proves our claim.

Page 12: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

12 P. GROHS*, S. HOSSEINI*

Now we present Algorithm 1 to find a vector vk+1 ∈ ∂εf(x) which can be addedto the set Wk in order to improve the approximation of ∂εf(x). It is easy to proveby Proposition 3.2 and Proposition 3.3 of [34] that this algorithm terminates afterfinitely many iterations. Practically, applying Algorithm 1 is not costly. We haveobserved that, as h does not usually have a local extremum on (0, ε) for ε small,the algorithm terminates after one iteration.

Algorithm 1 An h-increasing point algorithm; v = Increasing(x, g, a, b).

1: Input x ∈M, g ∈ TxM,a, b ∈ R.2: Let t = b.3: repeat4: select v ∈ ∂f(expx(tg)) such that 〈v, d expx(tg)(g)〉+ c‖w‖ ∈ ∂h(t)5: if 〈v, d expx(tg)(g)〉+ c‖w‖ < 0 then6: t = a+b

27: if h(b) > h(t) then8: a = t9: else

10: b = t11: end if12: end if13: until 〈v, d expx(tg)(g)〉+ c‖w‖ ≥ 0

Then we give Algorithm 2 for finding a descent direction. Moreover, Theorem3.17 proves that Algorithm 2 terminates after finitely many iterations.

Algorithm 2 A descent direction algorithm; (gk, k) = Decent(x, δ, c, ε).

1: Input x ∈M ; δ, c, ε ∈ (0, 1).2: Let g1 ∈ TxM such that ‖g1‖ = 1.3: if f is differentiable on expx(εg1), then

v = d exp−1x (expx(εg1))(grad f(expx(εg1)))4: else select arbitrary v ∈ d exp−1x (expx(εg1))(∂f(expx(εg1)))5: Set W1 = v and let k = 16: end if7: Step 1: (Compute a descent direction)8: Solve the following minimization problem and let wk be its solution:

minv∈convWk

‖v‖.

9: if ‖wk‖ ≤ δ then stop10: else let gk+1 = − wk

‖wk‖ .

11: end if12: Step 2: (Stopping condition)13: if f(expx(εgk+1))− f(x) ≤ −cε‖wk‖, then stop.14: end if15: Step 3: v = Increasing(x, gk+1, 0, ε).16: Set vk+1 = v, Wk+1 = Wk ∪ vk+1 and k = k + 1. Go to step 1.

Page 13: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS ON RIEMANNIAN MANIFOLDS 13

Theorem 3.17. Let for the point x1 ∈M, the level set N = x : f(x) ≤ f(x1) bebounded, then for each x ∈ N, Algorithm 2 terminates after finitely many iterations.

Proof. Now we claim that either after a finite number of iterations the stoppingcondition is satisfied or for some m,

‖wm‖ ≤ δ,and the algorithm terminates. If the stopping condition is not satisfied and ‖wk‖ >δ, then by Lemma 3.16 we find vk+1 /∈ convWk such that

〈vk+1, wk〉 ≤ c‖wk‖2.Note that d exp−1x on clB(x, ε) is bounded by some m1 ≥ 0 and by the Lipschitz-

ness of f of the constant L, Theorem 2.9 of [23] implies that for every ξ ∈ ∂εf(x),‖ξ‖ ≤ m1L. Now, wk+1 ∈ convvk+1 ∪ Wk has the minimum norm, so for allt ∈ (0, 1),

(3.3)

‖wk+1‖2 ≤ ‖tvk+1 + (1− t)wk‖2

≤ ‖wk‖2 + 2t〈wk, (vk+1 − wk)〉+ t2‖vk+1 − wk‖2

≤ ‖wk‖2 − 2t(1− c)‖wk‖2 + 4t2L2m21

≤ (1− [(1− c)(2Lm1)−1δ]2)‖wk‖2,

where the last inequality is obtained by assuming t = (1− c)(2Lm1)−2‖wk‖2 ∈(0, 1), δ ∈ (0, Lm1) and ‖wk‖ > δ. Now considering r = 1− [(1− c)(2Lm1)−1δ]2, itfollows that

‖wk+1‖2 ≤ r‖wk‖2 ≤ ... ≤ rk(Lm1)2.

Therefore, after a finite number of iterations ‖wk+1‖ ≤ δ.

Finally, Algorithm 3 is the minimization algorithm which finds a descent direc-tion in any iteration.

Algorithm 3 A minimization algorithm; xk = Min(f, x1, θε, θδ, ε1, δ1, c).

1: Input: f (A locally Lipschitz function defined on a complete Riemannian man-ifold M); x1 ∈M (a starting point); c, θε, θδ, ε1, δ1 ∈ (0, 1); k = 1.

2: Step 1 (Set new parameters) s = 1 and xsk = xk.3: Step 2. (Descent direction) (gsk, n

sk) = Decent(xsk, δk, c, εk)

4:

‖wsk‖ = min‖w‖ : w ∈ convW sk.

5: if ‖wsk‖ = 0 then stop

6: else let gsk = − wsk

‖wsk‖

be the descent direction.

7: end if8: if ‖wsk‖ ≤ δk then set εk+1 = εkθε, δk+1 = δkθδ, xk+1 = xsk, k = k + 1. Go to

Step 1.9: else

σ = argmaxσ ≥ εk : f(expxsk(σgsk))− f(xsk) ≤ −cσ‖wsk‖

and construct the next iterate xs+1k = expxs

k(σgsk). Set s = s+ 1 and go to Step

2.10: end if

Page 14: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

14 P. GROHS*, S. HOSSEINI*

Theorem 3.18. If f : M → R is a locally Lipschitz function on a complete Rie-mannian manifold M , and

N = x : f(x) ≤ f(x1)

is bounded, then either Algorithm 3 terminates after a finite number of iterationswith ‖wsk‖ = 0, or every accumulation point of the sequence xk belongs to the set

X = x ∈M : 0 ∈ ∂f(x).

Proof. Note that there exists ε < i(N) such that ∂εf on N is well-defined. If thealgorithm terminates after finite number of iterations, then xsk is an ε−stationarypoint of f . Suppose that the algorithm does not terminate after finitely manyiterations. Assume that gsk is a descent direction, since σ ≥ εk, we have

f(xs+1k )− f(xsk) ≤ −cεk‖wsk‖ < 0,

for s = 1, 2, ..., therefore, f(xs+1k ) < f(xsk) for s = 1, 2, .... Since f is Lipschitz and

N is bounded, it follows that f has a minimum in N . Therefore, f(xsk) is a bounded

decreasing sequence in R, so is convergent. Thus f(xsk)− f(xs+1k ) is convergent to

zero and there exists sk such that

f(xsk)− f(xs+1k ) ≤ cεkδk,

for all s ≥ sk. Thus

(3.4) ‖wsk‖ ≤f(xsk)− f(xs+1

k )

cεk≤ δk, s ≥ sk.

Hence after finitely many iterations, there exists sk such that

xk+1 = xskk ,

and

min‖v‖ : v ∈ convW skk ≤ δk.

Since M is a complete Riemannian manifold and xk ⊂ N is bounded, there exists

a subsequence xki converging to a point x∗ ∈ M . Since convWski

kiis a subset of

∂εkif(x

ski

ki), then

‖wki‖ = min‖v‖ : v ∈ ∂εkif(x

ski

ki) ≤ δki .

Hence limki→∞ ‖wki‖ = 0. Note that wki ∈ ∂εkif(x

ski

ki), hence by Lemma 3.8 and

Remark 3.7, 0 ∈ ∂f(x∗).

4. Numerical Experiments

We close this article by giving several numerical experiments. We set the pa-rameters as follows: c = 0.2, δ1 = 10−5, ε1 = 0.1, and θδ = 1. In Algorithm 3, forall values of k ≤ 4, we set θε = 0.1 and for k > 4, we set θε = 0.8. Algorithm 3terminates when εk < 10−7. We assume that the Armijo parameter c = 0.2, anduse the simple line search strategy,

σ = argmaxσ ≥ εk : f(expxsk(σgsk))− f(xsk) ≤ −cσ‖wsk‖.

Indeed, we start with σ = 1 and backtrack with a factor γ = 0.5. It is worthpointing out that in Algorithm 2, we generate g1 randomly, therefore our algorithmhas a stochastic behavior.

Page 15: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS ON RIEMANNIAN MANIFOLDS 15

4.1. Denoising on a Riemannian manifold. We are going to solve the one di-mensional total variation problem for functions which map into a manifold. There-fore, assume that M is a manifold, consider the minimization problem

(4.1) minu∈BV ([0,1];M)

F (u) := dist2(f, u)2 + λ‖∇u‖1,

where f : [0, 1] → M is the given (noisy) function, u is a function of boundedvariation from [0, 1] to M , dist2 is the distance on the function space L2([0, 1];M),and λ > 0 is a Lagrangian parameter, [41]. Note that for every w ∈ [0, 1], ∇u(w) :R → Tu(w)M and ‖∇u‖1 =

∫[0,1]‖∇u(w)‖dw. Now we can formulate a discrete

version of the problem (4.1) by restricting the space of functions to VMh which isthe space of all geodesic finite element functions for M associated with a regulargrid on [0, 1]; see [42, 21]. We refer to [42] for the definition of geodesic finiteelement spaces VMh .

Using the nodal evaluation operator ε : VMh →Mn, (ε(vh))i = vh(xi), where xiis the i-th vertex of the simplicial grid on [0, 1], one can find an equivalent problemdefined on Mn as follows,

(4.2) minu∈Mn

F∗(u) := dist∗(ε(f), u)2 + λ‖∇(ε−1(u))‖1,

where dist∗ is the Riemannian distance on Mn.

Theorem 4.1. Let M be a Hadamard manifold. If F∗ is defined as in (4.2), thenF∗ is convex as a function defined on Mn.

Proof. It is enough to prove that ‖∇(ε−1(u))‖1 is convex. Thus, we should provethat

∫[0,1]‖∇vhu(w)‖dw, where vhu is the geodesic finite element function corre-

sponding to u, is convex. To do this, assume that u1, u2 are two arbitrary points inMn and γ = (γ1, ..., γn) is a geodesic connecting them. We first show that for everyarbitrary fix w ∈ [0, 1], f(t) = ‖∇vhγ(t)(w)‖, as a function of t, is convex. Define

g(t) = 1/2〈∇vhγ(t)(w),∇vhγ(t)(w)〉.

Assume that Γ is a grid on [0, 1] and (si, si+1) ∈ Γ is such that w ∈ (si, si+1),moreover, σit is a minimizing geodesic parametrized by arc length connecting γi(t)and γ(i+1)(t). Since σit is a geodesic with a constant speed, we have that

g(t) = 1/2〈∇σit(w),∇σit(w)〉 =1

2

∫ 1

0

〈∇σit(x),∇σit(x)〉dx.

Now we define another smooth function G : [0, 1]× [0, 1]→M by

G(t, x) = σit(x).

We put V (t, x) := ∂G∂t (t, x) and usually write ∇σit(x) = ∇G = ∂G

∂x dx. Consider the

vector bundle T ([0, 1]× [0, 1])∗⊗G−1TM over [0, 1]× [0, 1], which admits a naturalfiber metric and a standard connection ∇ compatible with the metric. Under the

Page 16: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

16 P. GROHS*, S. HOSSEINI*

natural identification, we denote ∇x = ∇(0, ∂∂x ) and ∇t = ∇( ∂

∂t ,0). Therefore,

1/2∂2

∂2t〈∂G∂x

dx,∂G

∂xdx〉 =

∂t〈∇t

∂G

∂xdx,

∂G

∂xdx〉

=∂

∂t〈∇x

∂G

∂tdx,

∂G

∂xdx〉

= 〈∇t∇x∂G

∂tdx,

∂G

∂xdx〉+ 〈∇x

∂G

∂tdx,∇x

∂G

∂tdx〉

= 〈∇x∇t∂G

∂tdx,

∂G

∂xdx〉+ 〈R(

∂G

∂t,∂G

∂x)∂G

∂tdx,

∂G

∂xdx〉

+ 〈∇xV dx,∇xV dx〉Since ∇ is metric,

0 =

∫ 1

0

∂x〈∇t

∂G

∂t,∂G

∂xdx〉dx =∫ 1

0

〈∇x∇t∂G

∂tdx,

∂G

∂xdx〉dx.

Hence,

g′′(t) =

∫ 1

0

‖∇V ‖2 − trace〈R(∇G,V )V,∇G〉.

Since M is a Hadamard manifold, it follows that g′′(t) ≥ 0 which implies g is convex.By definition of g, it is clear that g(t) = 1/2f2(t). We assume that f(t) 6= 0, then

f ′′(t) =g′′(t)f2(t)− (g′(t))2

f3(t).

Hence,

f ′′(t) =1

f3(t)∫ 1

0

(‖∇V ‖2)〈∇vhγ(t)(w),∇vhγ(t)(w)〉

−trace〈R(∇G,V )V,∇G〉〈∇vhγ(t)(w),∇vhγ(t)(w)〉−〈∇V,∇vhγ(t)(w)〉2 ≥ 0,

which is obtained by the Cauchy-Schwarz inequality and the negativity of the sec-tional curvature. Thus, we proved that for every w ∈ [0, 1], f(t) = ‖∇vhγ(w)‖ isconvex, hence

‖∇vhγ(t)(w)‖ ≤ t‖∇vhu1(w)‖+ (1− t)‖∇vhu2

(w)‖,which implies∫

[0,1]

‖∇vhγ(t)(w)‖dw ≤ t∫[0,1]

‖∇vhu1(w)‖dw + (1− t)∫[0,1]

‖∇vhu2(w)‖dw,

which means∫[0,1]‖∇vhu(w)‖ = ‖∇(ε−1(u))‖1 is convex.

Note that if M is a Hadamard manifold, dist2 is also a convex function on Mn.Hence we can conclude that F∗ is convex on Mn.

Let ε(f) = (p1, ...., pn), then F∗ : Mn → R can be defined by

F∗(u1, ..., un) = Σni=1dist(pi, ui)2 + λΣn−1i=1 dist(ui, ui+1),

where dist is the Riemannian distance on M . In order to find the subdifferentialof F , we have to find the subdifferential of the distance and squared distance func-tions. The distance function is differentiable at (p, q) ∈M ×M if and only if there

Page 17: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS ON RIEMANNIAN MANIFOLDS 17

is a unique length minimizing geodesic from p to q. Furthermore, the distancefunction is smooth in a neighborhood of (p, q) if and only if p and q are not conju-gate points along this minimizing geodesic. Consequently, the distance function isnondifferentiable at (p, q) if and only if p = q or p and q are the conjugate points.Let the distance function be differentiable at (p, q), then

∂dist

∂p(p, q) =

− exp−1p (q)

dist(p, q),∂dist2

∂p(p, q) = −2 exp−1p (q).

The following lemma is a direct consequence of Definition 2.3, the fact that distpis smooth on U \ p, where U is a sufficiently small open neighborhood of p, andexpp is near 0 ∈ TpM a radial isometry.

Lemma 4.2. Let M be a complete Riemannian manifold. If distp : M → R isdefined by distp(q) = dist(p, q), then

∂distp(p) = B,

where B is the closed unit ball of TpM.

In our numerical examples, we consider a two dimensional sphere S2 and thespace of positive-definite matrices which is known as a Hadamard manifold. There-fore, F∗ is convex on the space of positive definite matrices, while F∗ is not aconvex function on every sphere; see [48]. We first recall the properties of these twoRiemannian manifolds.

The unit sphere S2 is the smooth compact manifold

S2 = x ∈ R3 : ‖x‖ = 1,and the global coordinates on S2 are naturally given by this embedding into R3.The tangent space at a point x ∈ S2 is

TxS2 = v ∈ R3 : 〈x, v〉 = 0.

The inner product on TxS2 is defined by

〈v, w〉TxS2 = 〈v, w〉R3 .

The exponential map

expx : TxS2 → S2

is defined by

expx(v) = cos(‖v‖)x+ sin(‖v‖) v

‖v‖.

Moreover, if x ∈ S2, then

exp−1x : S2 → TxS2

is defined by

exp−1x (y) =θ

sin(θ)(y − x cos(θ)),

where θ = arccos〈x, y〉. The Riemannian distance between two points x, y in S2 isgiven by

dist(x, y) = arccos〈x, y〉.Let t→ γ(t) be a geodesic on S2, and let u = γ(0)

‖γ(0)‖ . The parallel translation of a

vector v ∈ Tγ(0)S2, along the geodesic γ, is given by [2]

Lγ(0)γ(t)(v) = −γ(0) sin(‖γ(0)‖t)u′v + u cos(‖γ(0)‖t)u′v + (I − uu′)v.

Page 18: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

18 P. GROHS*, S. HOSSEINI*

Utilizing the properties of the exponential map on a Riemannian manifold, for fixedpoint x ∈ S2, and for each ε > 0, we may find number δx > 0 such that

‖d(exp−1x )(y)− Lyx‖ ≤ ε, provided that dist(x, y) < δx.

It is worthwhile to mention that on any sphere antipodal points are conjugatepoints, but without loss of generality we can assume that ui and ui+1 are notconjugate. In fact we use more than two nodal points for discretization of thefunction F , therefore it can be assumed that there is a nodal point between everytwo antipodal points.

The set of symmetric positive definite matrices, as a Riemannian manifold, isthe most studied example of manifolds of nonpositive curvature. The space of alln × n symmetric, positive definite matrices will be denoted by P (n). The tangentspace to P (n) at any of its points P is the space TPP (n) = P×S(n), where S(n)is the space of symmetric n×n matrices. On each tangent space TPP (n), the innerproduct is defined by

〈A,B〉TPP (n) = tr(P−1AP−1B).

The Riemannian distance between P,Q ∈ P (n) is given by

dist(P,Q) = (Σni=1 ln2(λi))(1/2),

where λi, i = 1, ..., n are eigenvalues of P−1Q. The exponential map

expP : S(n)→ P (n)

is defined by

expP (v) = P 1/2 exp(P−1/2vP−1/2)P 1/2.

Moreover, if P ∈ P (n), then

exp−1P : P (n)→ S(n)

is defined by

exp−1P (Q) = P 1/2 log(P−1/2QP−1/2)P 1/2,

where log, exp, denote the logarithm and exponential functions on matrix space;for more details see [35].

First, we assume that M = S2. We need to define a function from [0, 1] to S2

to get the original image. Afterward, we add a Gaussian noise to the image to getthe noisy image. Finally we apply algorithm 5 to the function F∗ defined on M100

to get the denoised image, see Figure 1.For another example, we assume that M = P (2). We add a Guassian noise to an

original image on P (2). Then we apply algorithm 5 to F∗ on M100 to denoise thenoisy image. In Figure 2, we present the results regarding to the minimization ofF∗ on M100. Note that every symmetric positive definite matrix A ∈ P (2) definesan ellipse. The principal axes are given by the eigenvectors of A and the squareroot of the eigenvalues are the radii of the corresponding axes.

4.2. Riemannian geometric median on the Sphere S2. Let M be a Riemann-ian manifold. Given points p1, ..., pm in M and corresponding positive real weightsw1, ..., wm, with

∑mi=1 wi = 1, define the weighted sum of distances function

f(q) =

m∑i=1

widist(pi, q),

Page 19: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS ON RIEMANNIAN MANIFOLDS 19

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

The original image

The noisy image

The denoised image

Figure 1. TV regularization on S2.

−0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−0.1

−0.05

0

0.05

0.1

−0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−0.1

−0.05

0

0.05

0.1

−0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−0.1

−0.05

0

0.05

0.1

Figure 2. TV regularization on P (2). Down-to-up: the original image, the

noisy image, the denoised image.

Page 20: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

20 P. GROHS*, S. HOSSEINI*

0 10 20 30 40 50 60 700.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

iterations

function v

alu

es

Figure 3. ε-subgradient descent for Riemannian geometric median on S2.

0 50 100 150 200 250 300 350 4000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

iterations

function v

alu

es

Figure 4. ε-subgradient descent for Rayleigh quotients on S2.

where dist is the Riemannian distance function on M . We define the weightedgeometric median x, as the minimizer of f . When all the weights are equal, wi =1/m, we call x simply the geometric median. Now, we assume that M = S2.In Figure 3 the results of the ε-subgradient algorithm for Riemannian geometricmedian on S2 are plotted. Algorithm 2 terminated after only one iteration.

4.3. Rayleigh quotients on S2. We consider the maximum of m Rayleigh quo-tients on the sphere S2, i.e.,

(4.3) f(x) = maxi=1,...,m

1

2x′Aix,

Ai ∈ S(3). Our aim is to find a minimum of f . In Figure 4, the results of theε-subgradient algorithm for Rayleigh quotients on S2 are plotted. We have seenthat for ε > 0 small, Algorithm 2 terminates after one iteration. Indeed, generallywe don’t meet points where several x′Aix achieve the maximum simultaneously.

Page 21: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS ON RIEMANNIAN MANIFOLDS 21

4.4. Sphere packing on Grassmannians. We assume that the GrassmannianGr(n, k) is the set of all k-dimensional linear subspaces of Rn. In this section, weconsider the problem of the packing of m spherical balls on Gr(n, k) with respectto the chordal distance. Let B(P, r) denote the ball in Gr(n, k) with respect tochordal distance. Then we would like to find m points P1, ..., Pm in Gr(n, k) suchthat

(4.4) maxr| ∀i 6= j : B(Pi, r) ∩B(Pj , r) = ∅,is maximized. This problem has been solved in [18] using a subgradient method.

Indeed, Gr(n, k) can be identified with the set P ∈ S(n)| P 2 = P, tr(P ) = k.Moreover, the tangent space of the Grassmannian at the point P , denoted byTPGr(n, k), is the following set

TPGrass(n, k) = PΩ− ΩP | Ω ∈ so(n),where

so(n) = Ω ∈ Rn×n| Ω′ = −Ω.As Gr(n, k) is a subset of the Euclidean vector space S(n), the scalar product〈P,Q〉 := tr(PQ) induces a Riemannian metric on it. Therefore the chordal distanceon Gr(n, k), denoted by dist(P,Q), is defined by

dist(P,Q) =

√1

2‖P −Q‖F ,

where ‖.‖F denotes the Frobenius norm. On Gr(n, k) with the induced Riemannianmetric, the geodesic γ emanating from P in the direction η ∈ TPGrass(n, k) isdefined by

γ(t) = exp(t(ηP − Pη))P exp(−t(ηP − Pη)).

The problem (4.4) is equivalent to the minimizing the following nonsmooth function;

(4.5) F (P1, ..., Pm) := maxi 6=j

tr(PiPj),

on Gr(n, k)× ...×Gr(n, k); see [18].In Table 1, we illustrate the results of the nonsmooth subgradient (SB) method

and our method for the sphere packing in Gr(16, 2) with m = 10 after 80 iterationswith the same arbitrary starting points for both methods and the same step lengths.The computation time for both methods is the same. Our results show that the bothmethods have similar performance for this example. However, the ε-subgradientalgorithm is more general, because in this algorithm we do not need to have anexplicit formula for the subdifferential and it can be computed approximately.

5. Conclusions

We have presented a practical algorithm in the context of ε-subgradient methodsfor nonsmooth problems on Riemannian manifolds. To the best of our knowledge,this is the first practical paper on approximating the subdifferential of locally Lip-schitz functions on Riemannian manifolds. Using the algorithms presented in thispaper, one can solve all nonsmooth locally Lipschitz minimization problems onRiemannian manifolds. The main result is the global convergence property of ourminimization algorithm which is stated in Theorem 3.18. Moreover, comparingwith subgradient algorithm [18], the ε-subgradient algorithm is much more gen-eral, because in this algorithm we do not need to have an explicit formula for thesubdifferential and it can be computed approximately. An implementation of our

Page 22: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

22 P. GROHS*, S. HOSSEINI*

Table 1. Numerical results in terms of number of function evaluationsand the final obtained value of the function for sphere packings on Grass-

mannians

No. Minimal distance in our method nfeval Minimal distance in SB nfeval1 1.853 605 1.853 6012 1.682 230 1.683 2313 1.775 1101 1.775 11094 1.872 1128 1.872 11235 1.801 237 1.801 2346 1.813 1608 1.813 16097 1.872 1205 1.872 12098 1.874 891 1.874 8909 1.719 1400 1.718 140910 1.789 989 1.789 98011 1.708 425 1.708 43012 1.804 1236 1.804 123013 1.897 1674 1.897 167914 1.676 2344 1.676 234015 1.631 1300 1.631 1306

proposed minimization algorithm is given in Matlab environment and tested onsome problems.

References

[1] P. A. Absil, C. G. Baker,Trust-region methods on Riemannian manifolds, Found. Comput.Math., 7 (2007), pp. 303-330.

[2] P. A. Absil, R. Mahony, R. Sepulchre, Optimization Algorithm on Matrix Manifolds, Prince-

ton University Press, 2008.[3] R. L. Adler, J. P. Dedieu, J. Y. Margulies, M. Martens, M. Shub, Newton’s method on

Riemannian manifolds and a geometric model for the human spine, IMA J. Numer. Anal.,22 (2002), pp. 359-390.

[4] B. Afsari, R. Tron, R. Vidal, On the convergence of gradient descent for finding the Rie-

mannian center of mass, SIAM J. Control Optim., 51 (2013), pp. 2230-2260.[5] D. Azagra, J. Ferrera, F. Lopez-Mesas, Nonsmooth analysis and Hamilton-Jacobi equations

on Riemannian manifolds, J. Funct. Anal., 220 (2005), pp. 304-361.

[6] D. Azagra, J. Ferrera, Applications of proximal calculus to fixed point theory on Riemannianmanifolds, Nonlinear. Anal., 67 (2007), pp. 154-174.

[7] A. M. Bagirov, Continuous subdifferential approximations and their applications, J. Math.

Sci., 115 (2003), pp. 2567-2609.[8] G. C. Bento, O. P. Ferreira, P. R. Oliveira, Local convergence of the proximal point method for

a special class of nonconvex functions on Hadamard manifolds, Nonlinear Anal., 73 (2010),

pp. 564-572.[9] G. C. Bento, O. P. Ferreira, P. R. Oliveira, Unconstrained steepest descent method for mul-

ticriteria optimization on Riemannian manifolds, J. Optim. Theory Appl., 154 (2012), pp.

88-107.[10] G. C. Bento, J. G. Melo, A subgradient method for convex feasibility on Riemannian mani-

folds, J. Optim. Theory Appl., 152 (2012), pp. 773-785.[11] G. C. Bento, O. P. Ferreira, P. R. Oliveira, Proximal point method for a special class of

nonconvex functions on Hadamard manifolds, Optimization (2012), pp. 1-31.

Page 23: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

ε-SUBGRADIENT ALGORITHMS ON RIEMANNIAN MANIFOLDS 23

[12] P. B. Borckmans, S. Easter Selvan, N. Boumal, P. A. Absil, A Riemannian subgradient

algorithm for economic dispatch with valve-point effect, J. Comput. Appl. Math., 255 (2014),

pp. 848-866.[13] J. V. Burke, A. S. Lewis, M. L. Overton, Approximating subdifferentials by random sampling

of gradients, Math. Oper. Res., 27 (2002), pp. 567-584.

[14] J. V. Burke, A. S. Lewis, M. L. Overton, A robust gradient sampling algorithm for nonsmooth,nonconvex optimization, SIAM J. Optim., 15 (2005), pp. 751-779.

[15] F. H. Clarke, Necessary Conditions for Nonsmooth Problems in Optimal Control and the

Calculus of Variations, Thesis, University of Washington, Seattle, 1973.[16] J. X. da Cruz Neto, L. L. Lima, P. R. Oliveira, Geodesic Algorithm in Riemannian Manifolds,

Balkan J. Geom. Appl., 2 (1998), pp. 89 - 100.

[17] J.X. da Cruz Neto, O. P. Ferreira and L. R. Lucambio Perez, A proximal regularization ofthe steepest descent method in Riemannian manifold, Balkan J. Geom. Appl., 2 (1999), pp.

1-8.[18] G. Dirr, U. Helmke, C. Lageman, Nonsmooth Riemannian optimization with applications to

sphere packing and grasping, In Lagrangian and Hamiltonian Methods for Nonlinear Control

2006: Proceedings from the 3rd IFAC Workshop, Nagoya, Japan, 2006, Lecture Notes inControl and Information Sciences, Vol. 366, Springer Verlag, Berlin, 2007.

[19] O. P. Ferreira, P. R. Oliveira, Proximal point algorithm on Riemannian manifolds, Optimiza-

tion. 51 (2002), pp. 257-270.[20] O. P. Ferreira, P. R. Oliveira, Subgradient algorithm on Riemannian manifolds, J. Optim.

Theory Appl., 97 (1998), pp. 93-104.

[21] P. Grohs, H. Hardering, O. Sander, Optimal a priori discretization error bounds for geodesicfinite elements, SAM Report 2013-16, ETH Zurich, (2013), Submitted.

[22] A. A. Goldstein, Optimization of Lipschitz continuous functions, Math. Program., 13 (1977),

pp. 14-22.[23] S. Hosseini, M. R. Pouryayevali, Generalized gradients and characterization of epi-Lipschitz

sets in Riemannian manifolds, Nonlinear Anal., 74 (2011), pp. 3884-3895.[24] S. Hosseini, M. R. Pouryayevali, Euler characterization of epi-Lipschitz subsets of Riemann-

ian manifolds, J. Convex. Anal., 20 (2013), No. 1, pp. 67-91.

[25] S. Hosseini, M. R. Pouryayevali, On the metric projection onto prox-regular subsets of Rie-mannian manifolds, Proc. Amer. Math. Soc., 141 (2013), pp. 233-244.

[26] K. C. Kiwiel, Methods of Descent for Nondifferentiable Optimization, Lecture Notes in Math-

ematics, 1133, Springer-Verlag, Berlin, 1985.[27] W. Klingenberg, Riemannian Geometry, Walter de Gruyter Studies in Mathematics, Vol. 1,

Walter de Gruyter, Berlin, New York, 1995.

[28] D. Kressner, M. Steinlechner, B. Vandereycken,Low-rank tensor completion by Riemannianoptimization, BIT Numer. Math., 54 (2014), 447-468.

[29] S. Lang, Fundamentals of Differential Geometry, Graduate Texts in Mathematics, Vol. 191,

Springer, New York, 1999.[30] P. Y. Lee, Geometric Optimization for Computer Vision, PhD thesis, Australian National

University, 2005.[31] C. Lemarechal, Nondifferentiable optimization, In: Handbook in Operations Research and

Management Science (G. L. Nemhauser et al., Eds.), Vol. 1, North Holland, Amsterdam

(1989), pp. 529-572.[32] C. Li, B. S. Mordukhovich, J. Wang, J. C. Yao, Weak sharp minima on Riemannian mani-

folds, SIAM J. Optim., 21(4) (2011), pp. 1523-1560.[33] R. E. Mahony, The constrained Newton method on a Lie group and the symmetric eigenvalue

problem, Linear Algebra. Appl., 248 (1996), pp. 67-89.

[34] N. Mahdavi-Amiri, R. Yousefpour, An effective nonsmooth optimization algorithm for locally

Lipschitz functions, J. Optim. Theory Appl., 155 (2012), pp. 180-195.[35] M. Moakher, M. Zerai, The Riemannian geometry of the space of positive-definite matrices

and its application to the regularization of positive-definite matrix-valued data, J. Math.Imaging Vision., 40 (2011), pp. 171-187.

[36] J. Nocedal, S. J. Wright,Numerical Optimization, Springer, 1999.

[37] E. A. Papa Quiroz, E. M. Quispe, P. R. Oliveira, Steepest descent method with a generalized

Armijo search for quasiconvex functions on Riemannian manifolds, J. Math. Anal. Appl.,341 (2008), pp. 467-477.

Page 24: -SUBGRADIENT ALGORITHMS FOR LOCALLY LIPSCHITZ FUNCTIONS …

24 P. GROHS*, S. HOSSEINI*

[38] W. Ring, B. Wirth, Optimization methods on Riemannian manifolds and their application

to shape space, SIAM J. Optim., 22(2) (2012), pp. 596-627.

[39] R. C. Riddell, Minimax problems on Grassmann manifolds. Sums of eigenvalues, Adv. Math.,54 (1984), pp. 107-199.

[40] R. T. Rockafellar, Convex Functions and Dual Extremum Problems, Thesis, Harvard, 1963.

[41] L. Rudin, S. Osher, E. Fatemi, Nonlinear total variation based noise removal algorithms,Phys. D., 60 (1) (1992), pp. 259-268.

[42] O. Sander, Geodesic finite elements for Cosserat rods, Internat. J. Numer. Methods Engrg.,

82 (2010), pp. 1645-1670.[43] S. T. Smith, Optimization techniques on Riemannian manifolds, Fields Institute Communi-

cations, 3 (1994), pp. 113-146.

[44] C. Udriste, Convex Functions and Optimization Methods on Riemannian Manifolds, KluwerAcademic Publishers, Dordrecht, Netherlands, 1994.

[45] K. Usevich, I. Markovsky,Optimization on a Grassmann manifold with application to systemidentification, Submitted, (http://homepages.vub.ac.be/ imarkovs/t-abstracts.html)

[46] B. Vandereycken, S. Vandewalle, A Riemannian optimization approach for computing low-

rank solutions of Lyapunov equations, SIAM. J. Matrix Anal. Appl., 31(5) (2010), 2553-2579.[47] B. Vandereycken, Low-rank matrix completion by Riemannian optimization, SIAM J. Optim.,

23(2) (2013), 1214-1236.

[48] S. T. Yau, Non-existence of continuous convex functions on certain Riemannian manifolds,Math. Ann., 207 (1974), pp. 269-270.

[49] L. S. Zhang, X. L. Sun, An algorithm for minimizing a class of locally Lipschitz functions,

J. Optim. Theory Appl., 90 (1996), pp. 203-212.[50] L. H. Zhang, Riemannian Newton method for the multivariate eigenvalue problem, SIAM J.

Matrix Anal. Appl., 31 (2010), pp. 2972-2996.