Optimizing linear functions with the (1+λ) evolutionary algorithm—Different asymptotic runtimes...

21
JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.1(1-21) Theoretical Computer Science ••• (••••) •••••• Contents lists available at ScienceDirect Theoretical Computer Science www.elsevier.com/locate/tcs Optimizing linear functions with the (1 + λ) evolutionary algorithm—Different asymptotic runtimes for different instances Benjamin Doerr a,b,c , Marvin Künnemann b,d,a Laboratoire d’Informatique (LIX), École Polytechnique, Palaiseau, France b Max Planck Institute for Informatics, Saarbrücken, Germany c Saarland University, Saarbrücken, Germany d Saarbrücken Graduate School of Computer Science, Germany article info abstract Article history: Received 1 November 2013 Accepted 7 March 2014 Available online xxxx Keywords: Population-based EA Runtime analysis We analyze how the (1 + λ) evolutionary algorithm (EA) optimizes linear pseudo-Boolean functions. We prove that it finds the optimum of any linear function within an expected number of O ( 1 λ n log n +n) iterations. We also show that this bound is sharp for some linear functions, e.g., the binary value function. Since previous works shows an asymptotically smaller runtime for the special case of OneMax, it follows that for the (1 + λ) EA different linear functions may have run-times of different asymptotic order. The proof of our upper bound heavily relies on a number of classic and recent drift analysis methods. In particular, we show how to analyze a process displaying different types of drifts in different phases. Our work corrects a wrongfully claimed better asymptotic runtime in an earlier work [13]. We also use our methods to analyze the runtime of the (1 + λ) EA on the OneMax test function and obtain a new upper bound of O (n log log λ/ log λ) for the case that λ is larger than O (log n log log n/ log log log n); this is the cut-off point where a linear speed-up ceases to exist. While our results are mostly spurred from a theory-driven interest, they also show that choosing the right size of the offspring population can be crucial. For both the binary value and the OneMax test function we observe that once a linear speed-up ceases to exist, in fact, the speed-up from a larger λ reduces to sub-logarithmic (still at the price of a linear increase of the cost of each generation). © 2014 Elsevier B.V. All rights reserved. 1. Introduction If there is one single problem that was most influential on the theory of evolutionary computation, then clearly it is the innocent question of how simple evolutionary algorithms optimize linear pseudo-Boolean functions, that is, functions f :{0, 1} n R; x n i=1 w i x i with w 1 ,..., w n R. While easy to state, this problem is surprisingly difficult and already for the simple (1 + 1) evolutionary algorithm (EA) was subject to a sequence of works [10,11,14–16,18,6,5,3,2,19,35,7,4,36]. More importantly, the quest for understanding this linear functions problem led to a large number of strong analysis tools like artificial fitness functions [11], additive drift analysis [14], average drift [18], multiplicative drift [6] and adaptive drift [3] A preliminary version of this work has been presented at the 15th Genetic and Evolutionary Computation Conference (GECCO 2013) [8]. * Correspondence to: Max Planck Institute for Informatics, Campus E1 4, 66123 Saarbrücken, Germany. Tel.: +49 681 9325 1027; fax: +49 681 9325 1099. E-mail address: [email protected] (M. Künnemann). http://dx.doi.org/10.1016/j.tcs.2014.03.015 0304-3975/© 2014 Elsevier B.V. All rights reserved.

Transcript of Optimizing linear functions with the (1+λ) evolutionary algorithm—Different asymptotic runtimes...

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.1 (1-21)

Theoretical Computer Science ••• (••••) •••–•••

Contents lists available at ScienceDirect

Theoretical Computer Science

www.elsevier.com/locate/tcs

Optimizing linear functions with the (1 + λ) evolutionaryalgorithm—Different asymptotic runtimes for differentinstances ✩

Benjamin Doerr a,b,c, Marvin Künnemann b,d,∗a Laboratoire d’Informatique (LIX), École Polytechnique, Palaiseau, Franceb Max Planck Institute for Informatics, Saarbrücken, Germanyc Saarland University, Saarbrücken, Germanyd Saarbrücken Graduate School of Computer Science, Germany

a r t i c l e i n f o a b s t r a c t

Article history:Received 1 November 2013Accepted 7 March 2014Available online xxxx

Keywords:Population-based EARuntime analysis

We analyze how the (1 + λ) evolutionary algorithm (EA) optimizes linear pseudo-Booleanfunctions. We prove that it finds the optimum of any linear function within an expectednumber of O ( 1

λn log n+n) iterations. We also show that this bound is sharp for some linear

functions, e.g., the binary value function. Since previous works shows an asymptoticallysmaller runtime for the special case of OneMax, it follows that for the (1 + λ) EA differentlinear functions may have run-times of different asymptotic order. The proof of our upperbound heavily relies on a number of classic and recent drift analysis methods. In particular,we show how to analyze a process displaying different types of drifts in different phases.Our work corrects a wrongfully claimed better asymptotic runtime in an earlier work [13].We also use our methods to analyze the runtime of the (1 + λ) EA on the OneMax testfunction and obtain a new upper bound of O (n log logλ/ log λ) for the case that λ is largerthan O (logn log logn/ log log logn); this is the cut-off point where a linear speed-up ceasesto exist. While our results are mostly spurred from a theory-driven interest, they also showthat choosing the right size of the offspring population can be crucial. For both the binaryvalue and the OneMax test function we observe that once a linear speed-up ceases toexist, in fact, the speed-up from a larger λ reduces to sub-logarithmic (still at the price ofa linear increase of the cost of each generation).

© 2014 Elsevier B.V. All rights reserved.

1. Introduction

If there is one single problem that was most influential on the theory of evolutionary computation, then clearly it isthe innocent question of how simple evolutionary algorithms optimize linear pseudo-Boolean functions, that is, functionsf : {0,1}n → R; x �→ ∑n

i=1 wi xi with w1, . . . , wn ∈ R. While easy to state, this problem is surprisingly difficult and alreadyfor the simple (1 + 1) evolutionary algorithm (EA) was subject to a sequence of works [10,11,14–16,18,6,5,3,2,19,35,7,4,36].More importantly, the quest for understanding this linear functions problem led to a large number of strong analysis toolslike artificial fitness functions [11], additive drift analysis [14], average drift [18], multiplicative drift [6] and adaptive drift [3]

✩ A preliminary version of this work has been presented at the 15th Genetic and Evolutionary Computation Conference (GECCO 2013) [8].

* Correspondence to: Max Planck Institute for Informatics, Campus E1 4, 66123 Saarbrücken, Germany. Tel.: +49 681 9325 1027; fax: +49 681 9325 1099.E-mail address: [email protected] (M. Künnemann).

http://dx.doi.org/10.1016/j.tcs.2014.03.0150304-3975/© 2014 Elsevier B.V. All rights reserved.

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.2 (1-21)

2 B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–•••

(see Section 3 for an explanation of these terms). These techniques subsequently were heavily used and gave rise to anumber of remarkable results in different areas, e.g., to the ICARIS 2011 best paper award winner [22] analyzing an immuneinspired B-cell algorithm for the vertex cover problem.

In this work, we significantly extend this line of research and analyze how the (1 + λ) EA solves the linear functionsproblem. We prove a sharp upper bound on the runtime for arbitrary linear functions. This is the first time a sharp runtimeanalysis for a population-based EA for the linear functions problem is given (note that the only previous work in thisdirection is not correct, see the discussion below).

1.1. Optimizing linear functions

Rigorously analyzing how evolutionary algorithms solve optimization problems, and supporting this understanding withmathematical proofs, is one of the main goals in the theory of evolutionary computation. The hope is that a deeper under-standing tells us which problems are tractable with evolutionary methods, what particular methods are best suited to solvea particular problem, how to choose the parameters of these methods in the right way, and what computational efforts toexpect.

Unfortunately, even for simple evolutionary algorithms and very simple optimization problems such an understandingis surprisingly hard to obtain. For example, the innocent-looking question how the (1 + 1) EA finds the minimum of apseudo-Boolean linear function f : {0,1}n → R, x �→ ∑n

i=1 wi xi with positive coefficients w1, . . . , wn kept the field busy foraround 15 years. It seems obvious that any reasonable randomized search heuristic should easily find the unique optimumx∗ = (0, . . . ,0), since flipping any 1-bit into a 0-bit strictly improves the objective value. Unfortunately, if the wi are notall equal, mutations may be accepted that increase the number of 1-bits (namely if the gain from flipping fewer 1-bitsto 0-bits is larger than the loss from flipping more bits in the other direction). Consequently, it is easy to prove thatrandomized local search (RLS) (flipping only single bits) finds the optimum of any linear function in n ln n + O (n) iterations.However, the analogous bound of en ln n + O (n) for the (1 + 1) EA was only proven last year in Witt’s remarkable STACSpaper [35], ending series of works [10,11,14–16,18,6,5,3,2,19,7,4,36] from various research groups over the last 15 years. Seethe extended version [36] of Witt’s paper for a detailed account of the history of this problem.

One noteworthy consequence of Witt’s result is that the (1 + 1) EA finds all linear functions (with non-zero coefficients)equally easy to optimize (ignoring lower order terms). This follows from the lower bound [1] of (1 − o(1))en ln n shown forthe particular linear function of OneMax together with the general result [7] that OneMax has the shortest optimizationtime among all linear functions (when using the (1 + 1) EA).

1.2. Population-based EA

While the linear functions problem for the (1 + 1) EA is now well understood and, moreover, many tight or near-tightanalyses for combinatorial optimization problems like minimum spanning trees or shortest paths exist (see the book [27]),our understanding of population-based EA, even simple ones like the (μ + 1) EA or the (1 + λ) EA, is comparably weak.

What among the few runtime analyses on population-based EA comes closest to our work is the paper by Jansen,De Jong, and Wegener [21]. Among other results, they determine the asymptotic optimization times of the (1+λ) EA on theclassic test functions OneMax and LeadingOnes. While for the LeadingOnes function an optimization time of Θ( 1

λn2 + n)

generations is not too difficult to show, the optimization behavior of the simple linear function OneMax : {0,1}n → R;x �→ ∑n

i=1 xi is already surprisingly rich. For all λ = O (log n log log n/ log log log n), only O ( 1λ

n log n) generations are needed.Hence for these λ, a linear speed-up is observed. For larger values of λ, a linear speed-up provably does not exist, that is,ω( 1

λn log n) generations are necessary. Note that these results are far from trivial, in fact, their proofs span several pages

in [21]. Similar analyses on how the (μ + 1) EA solves the OneMax problem, also technically highly demanding, have beenconducted by Storch [32] and Witt [34]; results on how the (1, λ) EA optimizes OneMax were given by Jägersküpper andStorch [20] as well as Rowe and Sudholt [31].

There are some more results on runtimes of population-based EAs on artificial functions and a few scattered results oncombinatorial optimization problems [28,26], but the overall impression remains that our understanding of population-basedEAs is much inferior to the one of simpler randomized search heuristics.

1.3. Our result

In this work, we extend both the sequence of results on the (1 + 1) EA optimizing linear functions as well as theanalysis [21] of how the (1 + λ) EA optimizes the OneMax function.

Our technically most demanding result (Theorem 8) is that the (1 + λ) EA with arbitrary population size λ (which mayand usually will be a function of the bit-string length n) finds the optimum of any linear function in O ( 1

λn log(n) + n)

generations. When λ = O (n1−ε), we also show that this bound holds with probability 1 −n−s , s any constant, if the implicitconstant in the runtime statement is made large enough.

These bounds are larger than some of the bounds in [21] for the OneMax function. This is with good reason. We show(Theorems 19 and 21) that our bound is sharp (apart from constant factors, but again with strong concentration) for thelinear function BinVal : {0,1}n →R; x �→ ∑n

i=1 xi2i−1 and all λ = O (n). Consequently, not all linear functions have the same

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.3 (1-21)

B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–••• 3

asymptotic runtime. We find this quite surprising—recall that for the (1 + 1) EA, all linear functions not only have the sameasymptotic runtime of O (n log n), but even are all in (1 ± o(1))en ln(n).

Our result also shows that the point when larger populations cease to yield a linear speed-up can, for generallinear functions, be smaller than for OneMax. There, this turning point, also called cut-off point, was shown to beΘ(log n log log n/ log log log n), whereas our result shows that it can be as low as Θ(log n) for certain linear functions.

We also use our methods to give a sharp runtime analysis for the (1 + λ) EA on OneMax, see Section 8. Here it wasonly known that the cut-off point is λ = Θ(log n log log n/ log log log n), leading to a runtime of Θ( 1

λn log n) generations for

λ values up to this slightly super-logarithmic cut-off point. For all λ larger than the cut-off, but still in O (n1−ε), ε anarbitrarily small constant, we show that the runtime is Θ(n log log λ/ log λ). Consequently, at the cut-off point, the speed-upreduces from linear to only sub-logarithmic.

From a broader perspective, this and the result on linear functions shows how important the choice of the offspringpopulation is. Since every generation of the (1 + λ) EA incurs Θ(λ) fitness evaluations, our results imply that above thecut-off point, the total computational effort increases linearly (in the case of BinVal) or almost linearly (in the case ofOneMax) with λ. Hence using an inappropriately large λ can lead to a significant increase in the computational effort whilegiving no or almost no speed up even under the optimistic assumption that the generations can be sampled fully in parallel.

Given that both the analyses of how the (1 + 1) EA optimizes linear functions and work [21] on how the (1 + λ) EAoptimizes OneMax are quite technical, it is no surprise that to prove our results we need a number of non-trivial tools,mostly from the drift analysis world. A particular difficulty we face is that the (1+λ) EA, when optimizing a linear function,first makes a constant progress (both in terms of the Hamming distance to the optimum and in terms of the potentialfunction we use for the drift analysis), but later makes progress proportional to the distance from the optimum. Hence thefirst phase calls for the additive drift theorem of He and Yao [14], the second phase for the multiplicative one [6]. Usingeither theorem for both phases necessarily would not give sharp results. Unfortunately, we cannot simply use one analysisafter the other—while in the second phase we observe an expected multiplicative drift, it is well possible that mutations areaccepted that bring us back into the range of additive drift. Hence we cannot guarantee a multiplicative drift for the wholeremaining process as required to apply the multiplicative drift theorem.

We solve this difficulty as follows. We easily see that the additive and multiplicative regime overlap. Once in themultiplicative regime, we may use an additive negative drift theorem [29] to show that the probability of leaving themultiplicative range is small. Hence the expected number of times we revert to the purely additive regime is small, andWald’s equation allows us to bound the total time spent. We are optimistic that this technique will find more applicationsin the near future.

1.4. Relation with He’s analysis [13]

In the description of previous results above, we omitted a paper [13] that appeared at CEC 2010. The reason is that itsmain claim is not correct, as we noticed when conducting this research.

In [13], also the linear functions problem for the (1 + λ) EA is analyzed. An upper bound of O ( 1λ

n logn + n log log λlog λ

) isclaimed to hold for all linear functions. For λ being asymptotically larger than logarithmic in n, this contradicts our lowerbound of Ω(n) for the BinVal linear function (see Section 7). Consequently, [13] also obtains the too optimistic cut-off pointthat a linear speed-up only ceases to exist for λ = ω(log n log log n/ log log log n).

We are sorry to have to point out this mistake in [13], because it is truly a courageous work. Note that at that time,neither multiplicative nor adaptive drift was known. Consequently, an extremely complicated potential function had to beinvented, which allows to analyze an inherently non-additive process via the additive drift theorem. This might be the causefor the problem, and also the reason why no reviewer found the mistake.

2. Preliminaries and notation

In this work, we will be solely regarding one evolutionary algorithm called (1 + λ) evolutionary algorithm ((1 + λ) EA),and use it to minimize pseudo-Boolean functions f : {0,1}n → R.

This algorithm keeps a population P = {x} of size one only. Initially, this one member is chosen uniformly at randomfrom the search space {0,1}n of all length-n bit strings.

In one iteration, exactly λ offspring x(1), . . . , x(λ) of x are generated independently via standard bit-mutation with muta-tion probability 1/n. That is, for all i ∈ [λ] := {1, . . . , λ} independently, we have that x(i) satisfies the following: For all bitpositions k ∈ [n] independently, we have Pr(x(i)

k = xk) = 1 − 1/n and Pr(x(i)k = 1 − xk) = 1/n. If the best of these offspring is

at least as good as x, it replaces x in the population. More precisely, we choose x∗ uniformly at random from the multi-set{x(i) | i ∈ [λ] ∧ ∀ j ∈ [λ] : f (x( j)) ≥ f (x(i))} of offspring having the best f -value. If f (x∗) ≤ f (x), then we set x := x∗ . Recallthat we aim at minimizing f .

This loop is iterated until a suitable termination criterion is satisfied. Since in this theoretical investigation we aim atunderstanding how many iterations it takes to find an optimal solution, we do not specify a termination criterion here.

The usual measure for the performance of an evolutionary algorithm is its optimization time, which is the number T fof f -evaluations the algorithm performs until an optimal solution is found. Note that for the (1 + λ) EA, this is at most λ

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.4 (1-21)

4 B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–•••

Algorithm 1: The (1 + λ) Evolutionary Algorithm for minimizing pseudo-Boolean functions f : {0,1}n → R.

1 choose x ∈ {0,1}n uniformly at random;2 for t = 0 to ∞ do3 create x(1), . . . , x(λ) ∈ {0,1}n independently, each by flipping each bit of x with probability 1/n;

4 choose x∗ uniformly at random from the multi-set {x(i) | i ∈ [λ] ∧ ∀ j ∈ [λ] : f (x( j)) ≥ f (x(i))};5 if f (x∗) ≤ f (x) then6 x := x∗

times the number of iterations (“generations”) performed (plus one, to be precise and include the evaluation of the initialindividual). For convenience, in this work we shall mainly talk about the number of generations until an optimum is found.Note also that T f is a random variable, since we regard a randomized algorithm.

We shall analyze how the (1 + λ) EA optimizes linear pseudo-Boolean functions, that is, functions f : {0,1}n → R;x �→ f (x) := ∑n

i=1 wi xi , where w1, . . . , wn are arbitrary real-valued coefficients. It is easy to see (consult [11] if in doubt)that in the analysis of upper bounds of the optimization time of the (1 + 1) EA on linear functions we may assume withoutloss of generality that the weights wi are all positive and in increasing order, that is, we have 0 < w1 ≤ · · · ≤ wn . Naturally,the same is true for the (1 + λ) EA. Hence in the following we assume

0 < w1 ≤ · · · ≤ wn.

2.1. Tools from probability theory

The following tools from probability theory will be used to derive tail bounds for sums of random variables, refer, e.g.,to [25].

Lemma 1 (Chernoff bounds). Let X1, . . . , Xn be independent Poisson trials such that Pr[Xi = 1] = pi . Let X = ∑ni=1 Xi and μ = E[X].

Then the following Chernoff bounds hold.

1. For any δ > 0, we have

Pr[

X ≥ (1 + δ)μ]<

(eδ

(1 + δ)(1+δ)

.

2. For any 0 < δ ≤ 1, we have

Pr[

X ≥ (1 + δ)μ] ≤ e−μδ2/3.

3. For R ≥ 6μ, it holds that

Pr[X ≥ R] ≤ 2−R .

Lemma 2 (Azuma–Hoeffding inequality). Let X0, . . . , Xn be a martingale such that

Bk ≤ Xk − Xk−1 ≤ Bk + dk

for some constants dk and for some random variables Bk that may be functions of X0, X1, . . . , Xk−1 . Then, for all t ≥ 0 and any λ > 0,

Pr[|Xt − X0| ≥ λ

] ≤ 2e−2λ2/(∑t

k=1 d2k ).

2.2. Notation

For x ∈ {0,1}n , we write |x|1 := ∑ni=1 xi (or OneMax(x), when this measure is used as fitness function) to count the

number of one-bits of x. For a Boolean expression φ, e.g. f (x) ≤ f (x∗), we write [φ] to denote the corresponding event thatφ holds. For f : N→R and g : N→R>0, we let f = Ω(g) imply that there is an n0 such that f (n) > 0 for all n ≥ n0.

3. Drift analysis

Our work heavily uses a method known as drift analysis. To ease reading, we collect in this section the results we shalluse.

In a nut-shell, drift analysis comprises a number of methods that allow to translate information of the expected progressof a randomized process into information on the time the process needs to reach a particular goal. In evolutionary algo-rithms, an easy example could be the following: We let the (1 + 1) EA optimize the LeadingOnes test function. Denote

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.5 (1-21)

B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–••• 5

by Xt the fitness of the population after the t-th iteration. Then it is easy to see that we have an expected progress ofE[Xt+1 − Xt | Xt < n] ≥ 1/(en). Drift analysis can translate this information into the bound that the first time T such thatXT = n, that is, an optimal solution is found, satisfies E[T ] < en2. Such an approach, already for the much more difficultlinear functions problem, was first in the theory of evolutionary computation conducted by He and Yao in their seminalpaper [14].

Usually the random process will be more complicated than simply regarding the fitness of the individual (or its fitnessdifference to the optimum). Choosing the right measure for the progress of an EA is usually tricky. Let us remark that theidea of tracking not the fitness, but some potential function (“artificial fitness function”) was first used without explicit driftmethods in [11].

The drift theorem first used in the analysis of evolutionary computation is the additive drift theorem of He andYao [14,16]. It is strongest when we have a similar progress throughout the whole random process. Due to symmetryin terms of upper and lower bounds, it often allows to prove sharp bounds.

Lemma 3 (Additive drift theorem). Let S ⊆ R be a finite set of positive numbers and let {Xt}t∈N be a sequence of random variablesover S ∪ {0}. Let T be the random variable that denotes the first point in time t ∈ N for which Xt = 0.

Upper bound. Suppose that there exists a real number δ > 0 such that for all t ≥ 0, we have E[Xt − Xt+1 | T > t] ≥ δ. Then

E[T | X0] ≤ X0

δ.

Lower bound. Suppose that there exists a real number δ > 0 such that for all t ≥ 0, we have E[Xt − Xt+1 | T > t] ≤ δ. Then

E[T | X0] ≥ X0

δ.

Often, the progress is not constant, but stronger when further from the optimum. In such situations, the followingmultiplicative drift theorem can be the right tool. Note that it not only gives a bound on the expected hitting time, butalso admits large deviation bounds. The concept of multiplicative drift was first used in [6,7], the large deviation bound wasfound shortly thereafter [3,4].

Lemma 4 (Multiplicative drift theorem). Let S ⊆ R be a finite set of positive numbers with minimum smin . Let {Xt}t∈N be a sequenceof random variables over S ∪ {0}. Let T be the random variable that denotes the first point time t ∈ N for which Xt = 0. Suppose thereexists a real number δ > 0 such that

E[Xt − Xt+1 | Xt = s] ≥ δs

holds for all s ∈ S with Pr[Xt = s] > 0. Then for all s0 ∈ S and t ≥ 0 with Pr[X0 = s0] > 0, we have

E[T | X0 = s0] ≤ 1 + ln(s0/smin)

δ.

Moreover, we have Pr[T > (ln(X0/smin) + r)/δ] ≤ e−r for all r > 0.

All theorems above deal with the situation that the target is in the direction the drift is pointing to. Sometimes, we needto prove that the probability that we reach a target that is in the opposite direction of the drift is very small. A negativedrift theorem as introduced in [30] yields such results.

Lemma 5 (Negative drift theorem with scaling). Let {Xt}t∈N be the random variables describing a stochastic process over some discretestate space S ⊆R. Let St := {(x0, . . . , xt) ∈ St | Pr[X0 = x0, . . . , Xt = xt] > 0}. Suppose there exist an interval [a,b] in the state space,and, possibly depending on := b − a, a bound ε := ε() > 0 and a scaling factor 1 ≤ r := r() ≤ min{ε2,

√ε/(132 log(ε))} such

that for all t ≥ 0 the following two conditions hold:

1. For all (x0, . . . , xt) ∈ St with a < xt < b, we have E[Xt − Xt+1 | X0 = x0, . . . , Xt = xt] ≥ ε()

2. For all (x0, . . . , xt) ∈ St with xt < b, we have that Pr[|Xt − Xt+1| ≥ jr | X0 = x0, . . . , Xt = xt] ≤ e− j for all j ∈N0 .

For the first hitting time T ∗ := min{t ≥ 0 : Xt ≥ b} it then holds

Pr[T ∗ ≤ eε/(132r2)

∣∣ X0 ≤ a] = O

(e−ε/(132r2)

).

Before the work [2], all drift analyses in the field of evolutionary computation used the same way of measuring progress(“universal potential function”) for all instances of the problem under consideration (e.g., the same potential function for alllinear functions). Using a different potential function for each linear function, which is called adaptive drift analysis, was used

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.6 (1-21)

6 B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–•••

in [2] to obtain results on higher mutation probabilities, which provably cannot be shown via universal potential functions.The same is true for Witt’s [35,36] sharp analysis for the linear functions problem.

Essential to our analysis of linear functions is a separate treatment of different phases of the optimization process. In thiscontext, the previously mentioned tools already successfully allow for deriving large-deviation bounds on the runtime of the(1 + λ) EA. To additionally obtain corresponding results for the expected optimization time, we will follow two approachesto merge the drift analyses of the separate phases.

The first approach uses a tool that was introduced exactly for this purpose of conveniently dealing with more complexbounds on the drift than only constant or linear functions, independently in [23] (this is the version we shall use in thefollowing) and in [24].

Lemma 6 (Johannsen’s variable drift theorem). Let S ⊆ R be a finite set of positive numbers with minimum smin . Let {Xt}t∈N be asequence of random variables over S ∪ {0}. Let T be the random variable that denotes the first point time t ∈ N for which Xt = 0.Suppose that there exists a continuous and monotone increasing function h :R→ R≥0 such that for all s ∈ S

E[Xt − Xt+1 | Xt = s] ≥ h(s)

holds for all t < T . Then,

E[T | X0] ≤ smin

h(smin)+

X0∫smin

1

h(x)dx.

We finally state a result strongly connected to drift analysis that enables us to pursue a second approach to combinethe analysis of different phases. Wald’s equation [33] allows for bounding the expectation of a sum of random variables,where the range of summation is also controlled by a random variable (a so-called stopping time). It is usually stated forall random variables in the sum to be equally distributed. We shall need the following slight extension, which avoids thenotion of stopping times and is analogous to a lower-bound version used in [17]. We, however, do not require the randomvariables X1, X2, . . . to be bounded.

Lemma 7. Let T be a random variable with bounded expectation and let X1, X2, . . . be non-negative random variables with E[Xi |T ≥ i] ≤ C. Then

E

[T∑

i=1

Xi

]≤ E[T ] · C .

Proof. For i ≥ 1, let IT ≥i be the indicator variable defined by IT ≥i = 1 if and only if T ≥ i. We compute

E

[T∑

i=1

Xi

]=

∞∑i=1

E[IT ≥i Xi]

=∞∑

i=1

Pr[T ≥ i]E[Xi | T ≥ i]

≤ C∞∑

i=1

Pr[T ≥ i] = E[T ] · C .

The transformation in the first line is legal, since the series∑∞

i=1 E[IT ≥i Xi] converges absolutely, which follows fromE[T ] < ∞. �

Note that when all Xi , conditioned on T ≥ i, have equal expectation C , then we actually have equality: E[∑Ti=1 Xi] =

E[T ] · C . In this case, Wald’s equation may also be used to conduct a drift analysis. If Xi is the progress made in iteration i,then d := ∑T

i=1 Xi is simply the distance from the initial population to the optimum and the time to get there satisfiesE[T ] = d/C . A similar application of Wald’s equation for the analysis of evolutionary algorithms can be found in [11].

4. Outline of the analysis

Our main theorem provides an upper bound on the optimization time of the (1 + λ) EA on arbitrary linear functions,both in expectation and with high probability.

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.7 (1-21)

B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–••• 7

Theorem 8. Let f (x) = ∑ni=1 wi xi with 0 < w1 ≤ · · · ≤ wn be any linear function on bit strings with positive bit weights. Let

λ :N →N be an arbitrary function. Then the (1 + λ) EA optimizes f in an expected number of

O

(n log n

λ(n)+ n

)

iterations. Furthermore, if λ = O (n1−ε) for some ε > 0, then for each s > 0 there is a constant C(s) such that the probability that theoptimum is not found within C(s)( 1

λ(n)n log n + n) iterations is bounded by O (n−s).

The proof of this theorem relies on analyzing two overlapping phases of the optimization process. In the beginning ofthe process, in each optimization step an individual with a strictly better fitness is found with constant probability. In thisfirst phase, the gain of creating several offspring in parallel does not scale linearly with λ(n). Later in the process, whenthe current individual is already close to the optimum value, the probability of finding a strictly better individual decreases,which allows a speed-up asymptotically linear in λ(n).

In the rest of the section, let λ := λ(n) and choose arbitrary constants 0 < γ2 < γ1 < ln(3/e). We call Phase 1 active, ifthe current individual x has at least θ2 := γ2n/λ one-bits. Phase 2 becomes active, if |x|1 < θ1 := γ1n/λ. Correspondingly, wesay that an individual x lies above or below some threshold θ , if its number of one-bits is at least or at most θ , respectively.We pessimistically assume that the starting individual lies above θ1 and the process needs to traverse both phases to get tothe optimum.

In Section 5.1, we will prove that Phase 1 can be traversed in O (n) generations both in expectation (by use of theadditive drift theorem) and with high probability (by use of suitable martingales and the Azuma–Hoeffding inequality).

Lemma 9. Let x ∈ {0,1}n. The expected number of generations until the (1 + λ) EA hits the first element below θ2 when starting with

x is bounded by O (n). Furthermore, if λ = O (e3√n), there is a constant c such that the probability that it fails to do so in cn generations

is at most e−Ω( 3√n) .

In contrast to the first phase, Phase 2 provides a drift towards the optimum that lies within an Ω(λ)-factor of the driftof the (1 + 1) EA. Using the multiplicative drift theorem, we obtain an expected number of O ( 1

λn log n) iterations to find

the optimum, under the condition that the process does not leave Phase 2. Large deviation bounds follow immediately fromthe use of multiplicative drift. The following lemma will be proved in Section 5.2.

Lemma 10. The expected number of generations until the (1 +λ) EA hits the optimum 0n under the condition that Phase 2 stays activethroughout the process is bounded by O ( 1

λn log n). For any s > 0, the probability that it fails to find the optimum value in c(s)

λn log n

steps under this condition is bounded by n−s for some constant c(s).

The lemma above needs to ensure that the drift guarantee of Phase 2 applies to all individuals observed throughout theprocess. It is however possible, although very unlikely, that the process returns above θ1. By use of a negative drift theorem,we bound the probability of this event in Section 6.2.

Lemma 11. Let λ = O (n1−ε) for some ε > 0. The probability to ever return above θ1 with the (1 + λ) EA starting at some individual

below θ2 is bounded by e−Ω(ns′ ) for some s′ > 0.

As an immediate corollary of the previous three lemmas, we bound the optimization time with high probability forλ = O (n1−ε). Let s > 0 be arbitrary and choose c sufficiently large. The probability of not finding the optimum value inc( 1

λn log n + n) steps can be bounded by the sum of the probabilities of (i) not finding the first element below θ2 in cn

steps, of (ii) returning to an element above θ1 again, and of (iii) not finding the optimum value in cλ

n log n although never

returning above θ1. It is thus bounded by e−Ω( 3√n) + n−s + e−Ω(ns′ ) = O (n−s).To bound the expected number of generations to find the optimum, we split the optimization process into stages. In

each stage, the (1 + λ) EA starts with an individual (typically above θ1), traverses Phase 1 to below θ2 and then eitherreturns to above θ1 and enters the next stage or finds the optimum and stops the process. In the first case, we enter thenext stage starting from the individual above θ1 that was hit. Define Ti as the time spent in stage i and N as the randomnumber of stages, then we aim to compute E[∑N

i=1 Ti], which is the expectation of a sum of a random number of randomvariables. Wald’s equation appears to be the appropriate tool for this. Note, however, that although the random variablesare well-behaved in the sense that they are collectively bounded in expectation, they are not independent from each otherand from N , which is why we need the generalized version of Wald’s equation of Lemma 7.

To bound the expectation of each Ti , we pessimistically analyze the time that only one of the stopping criteria of astage, namely hitting the optimum value, is fulfilled. By Lemmas 9 and 10, we conclude that E[Ti] ≤ O (n + n log n/λ). Wealso need to argue that the indicator variable IN≥i is independent from the outcome of Xi . Since by definition, N ≥ i is aconsequence of not finding the optimum in stages 1, . . . , i − 1 and only then the variable Xi is explicitly defined, the claimis obvious. It is left to bound the expectation of N using Lemma 11 to obtain with the Wald-type inequality of Lemma 7,

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.8 (1-21)

8 B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–•••

E

[N∑

i=1

Ti

]≤ E[N]O (n + n log n/λ)

≤ 1

1 − e−Ω(ns)O (n + n log n/λ) = O (n + n log n/λ),

for sufficiently large n. This proves the upper bound on the expected optimization time for λ = O (n1−ε). In Section 6.1,an alternative approach using the variable drift theorem will prove the result without this restriction on λ. We feel thatthe Wald-type approach is interesting in its own right, since it is a natural way to directly turn the statements provinga large-deviation bound on the runtime into a corresponding result for the expected optimization time. Furthermore, it isindependent of whether Lemmas 9 and 10 were derived by drift analysis or other proof methods.

Our analysis for general linear functions can be refined to show a stronger upper bound for special classes of linearfunctions, see Section 8 for a treatment of OneMax. In addition, for bounded coefficients we expect to obtain the sameasymptotic bounds that we prove for OneMax. For this, an approach using variable drift for the first phase or, alternatively,splitting the first phase again into two phases with different drift guarantees seem appropriate. The statement of the driftin Section 5.1 (Lemma 13) aims to provide a general way to perform such an analysis for more complex functions—it is thekey statement to adapt for special cases of linear functions.

5. Bounding the drift

An essential aspect in drift analysis is to choose a suitable potential function to measure the progress of the randomprocess towards the optimum value. We rely on the following potential function that has proven useful to give tight boundson the optimization time of the (1 + 1)-EA on linear functions [35]. For x ∈ {0,1}n , define g(x) := ∑n

i=1 gi xi with

gi :=(

1 + 1

n − 1

)min{ j≤i|w j=wi}−1

.

Note that the weights g1, . . . , gn of the potential function are non-decreasing as are the bit weights w1, . . . , wn . Further-more, bits of equal weight receive the same weight in g . Note also that 1 ≤ gi ≤ (1 + 1/(n − 1))n−1 ≤ e and thus g(x) ≤ enfor all x ∈ {0,1}n .

The potential change Δ(x) of an optimization step with parent individual x is defined by Δ(x) = g(x) − g(x∗) iff (x∗) ≤ f (x) and Δ(x) = 0 otherwise. The drift is the expected potential change E[Δ(x)].

Let a parent individual x ∈ {0,1}n \ {0n} be given. In the following, we aim at bounding the bit-flip probabilities inthe winner individual x∗ . For the (1 + λ) EA, this is more intricate than for the (1 + 1) EA, since the bits of the winnerindividual are not independently flipped with probability 1/n, but result from choosing the offspring with the best fitness.This distribution even depends on the fitness function. However, for a very general class of events, we show how to boundthe conditional bit-flip probability of a zero-bit of x in x∗ .

For the statement of this class of events, let an index 1 ≤ j ≤ n with x j = 0 be given. Let y(1), . . . , y(λ) ∈ {0,1}n bea realization of the offspring individuals, in which the winner individual y∗ := argminz=y(1),...,y(λ) f (z) has the j-th bit

set to one, i.e., y∗j = 1. Let m be the first index with f (y(m)) = f (y∗). We define an injective map M : (y(1), . . . , y(λ)) �→

( y(1), . . . , y(λ)) by

y(i) = y(i) for i �= m, y(m)

j′ = y(m)

j′ for j′ �= j, y(m)j = 0.

We say that an event E is zero-flip closed if for any realization y(1), . . . , y(λ) with y∗j = 1, (y(1), . . . , y(λ)) ∈ E implies that

( y(1), . . . , y(λ)) ∈ E .

Lemma 12. Let j be such that x j = 0 and E be a zero-flip closed event. Then Pr[x∗j = 1 | E] ≤ 1/n.

Proof. We verify that M is indeed injective. Observe that given some y(1), . . . , y(λ) , there is a unique winner individual andhence exactly one index m with y(m) = y∗; we can uniquely reconstruct y(1), . . . , y(λ) by reversing the mapping of M , i.e.,setting y(m)

j to one and using the values y(i)j′ for all other coordinates y(i)

j′ .

Consider any realization of the offspring individuals y(1), . . . , y(λ) ∈ {0,1}n which satisfies the zero-flip closed propertyE and y∗

j = 1. Let m be such that y∗ = y(m) and let y(1), . . . , y(λ) ∈ {0,1}n be given by the mapping M . We have that

Pr[x(1) = y(1), . . . , x(λ) = y(λ)

] = 1

n

∏j′,m′

′ ′

Pr[x(m′)

j′ = y(m′)j′

]

j �= j or m �=m

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.9 (1-21)

B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–••• 9

= 1

n − 1

(1 − 1

n

) ∏j′,m′

j′ �= j or m′ �=m

Pr[x(m′)

j′ = y(m′)j′

]

= 1

n − 1Pr

[x(1) = y(1), . . . , x(λ) = y(λ)

].

Since M is an injective map from E ∩ [x∗j = 1] into E ∩ [x∗

j = 0], we can relate the event x∗j = 1 conditioned on E to its

opposite event x∗j = 0 conditioned on E by computing

Pr[x∗

j = 1∣∣ E] =

∑(y(1),...,y(λ))∈E,y∗

j =1 Pr[y(1), . . . , y(λ)]Pr[E]

= 1

n − 1

∑(y(1),...,y(λ))∈E,y∗

j =1 Pr[ y(1), . . . , y(λ)]Pr[E]

≤ 1

n − 1

∑(y(1),...,y(λ))∈E,y∗

j =0 Pr[y(1), . . . , y(λ)]Pr[E]

= 1 − Pr[x∗j = 1 | E]

n − 1. �

5.1. Phase 1

The key step in proving the drift guarantee in Phase 1 (Lemma 9) is a careful, quite technical case analysis in the spiritof [7] and [35]. For this case analysis, Lemma 12 serves as a convenient tool to deal with favorable, yet non-trivial events.In the following, let E denote the event that exactly of the one-bits of x flip in the winner individual x∗ .

Lemma 13. Let x ∈ {0,1}n be such that Pr[ f (x(i)) < f (x)] ≥ c1λ

. Then E[Δ(x)] ≥ Ω(F (x)) and F (x) ≥ Ω(1), where F (x) =∑n=0 Pr[E | f (x∗) ≤ f (x)] is the expected number of flipped one-bits of x in an accepted winner individual x∗ .

Proof. Let I = {i ∈ [n] | xi = 1} denote the one-bits in x. The expected progress of an accepted offspring can be decomposedinto

E[Δ(x)

∣∣ f(x∗) ≤ f (x)

] =|I|∑

=0

Pr[

E

∣∣ f(x∗) ≤ f (x)

]E[Δ(x)

∣∣ E and f(x∗) ≤ f (x)

].

Consider the three events E0, E1 and E2 ∪ · · · ∪ En separately. We have that E[Δ(x) | E0 and f (x∗) ≤ f (x)] = 0, sinceunder E0, no zero-bit of x may flip in an accepted winner individual.

Let Z = {i ∈ [n] | xi = 0} denote the zero-bits in x and ≥ 2. Note that E ∩ [ f (x∗) ≤ f (x)] is a zero-flip closed event andrecall that 1 ≤ gi ≤ (1 + 1

n−1 )i−1. We compute

E[Δ(x)

∣∣ E and f(x∗) ≤ f (x)

] ≥ −∑j∈Z

Pr[x∗

j = 1∣∣ E and f

(x∗) ≤ f (x)

]g j

≥ − 1

n

n∑j=1

(1 + 1

n − 1

) j−1

≥ − (1 + 1n−1 )n − 1

nn−1

≥ −(

e −(

1 − 1

n

))≥ − β,

with β := e − 1 + 1n < 2 ≤ for n ≥ 4, where the second inequality follows from Lemma 12.

It is left to analyze E1, which we decompose into the events E(i)1 , in which the i-th bit is the only one-bit of x that

flips in the winner individual x∗ . In such an event, define the set W (i) = { j ∈ [n] | x j = 0 and w j = wi} of zero-bits of x

of equal weight to the flipped one-bit xi . By W (i)=0, we denote the event that none of the bits of W (i) flips in the winner

individual x∗ , and, analogously, W (i) denotes the complementing event in which at least one of the bits of W (i) flips.

≥1

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.10 (1-21)

10 B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–•••

For a given i ∈ I , consider the following case distinction:

Case 1: C1 := E(i)1 ∩ [ f (x∗) ≤ f (x)] ∩ W (i)

≥1. In this case, at least one of the bits of W (i) flips in x∗ . For f (x∗) ≤ f (x) to hold,this is the only zero-bit of x that flips in x∗ . Call this bit j. From wi = w j it follows that gi = g j and hence E[Δ(x) | C1] = 0.

Case 2: C2 := E(i)1 ∩ [ f (x∗) ≤ f (x)] ∩ W (i)

=0. Define i∗ = min{i′ | wi = wi′ }, so that all elements j < i∗ have a strictly smallerweight. Note that C2 is a zero-flip closed event. We conclude, using Lemma 12, that

E[Δ(x)

∣∣ C2] ≥ gi∗ −

∑j<i∗:x j=0

Pr[x∗

j = 1∣∣ C2

]g j

≥ gi∗ − 1

n

i∗−1∑j=1

(1 + 1

n − 1

) j−1

≥ gi∗ − (1 + 1n−1 )i∗−1 − 1

nn−1

=(

1 + 1

n − 1

)i∗−1

− n − 1

n

(1 + 1

n − 1

)i∗−1

+ n − 1

n

= 1

n

(1 + 1

n − 1

)i∗

+ 1 − 1

n≥ 1.

Note that Pr[W (i)=0 | E(i)

1 and f (x∗) ≤ f (x)] ≥ (1 − (1/n))n−1 ≥ 1/e, again utilizing Lemma 12 and the fact that there are atmost n − 1 zero-bits of the same weight. By law of total probability, we immediately conclude that

E[Δ(x)

∣∣ E(i)1 and f

(x∗) ≤ f (x)

] ≥ E[Δ(x)

∣∣ C2]

Pr[W (i)

=0

∣∣ E(i)1 and f

(x∗) ≤ f (x)

] ≥ 1

e.

Recall that F (x) = ∑n=0 Pr[E | f (x∗) ≤ f (x)] is the expected number of flipped one-bits of x in an accepted winner

individual x∗ and compute, by law of total probability,

E[Δ(x)

∣∣ f(x∗) ≤ f (x)

] ≥∑i∈I

Pr[

E(i)1

∣∣ f(x∗) ≤ f (x)

]1

e+

|I|∑=2

Pr[

E

∣∣ f(x∗) ≤ f (x)

]( − β)

= Pr[

E1∣∣ f

(x∗) ≤ f (x)

]1

e+

|I|∑=2

Pr[

E

∣∣ f(x∗) ≤ f (x)

]( − β)

= F (x) − β Pr

[ ⋃≥2

E

∣∣∣ f(x∗) ≤ f (x)

]−

(1 − 1

e

)Pr

[E1

∣∣ f(x∗) ≤ f (x)

].

If F (x) ≥ 2β , this already proves that E[Δ(x) | f (x∗) ≤ f (x)] ≥ F (x) − β ≥ F (x)/2 = Ω(1). To prove a positive constant lowerbound on the drift also when F (x) < 2β , set γ := 2 − β and note that the computations above show that

E[Δ(x)

∣∣ f(x∗) ≤ f (x)

] ≥n∑

i=1

Pr[

Ei∣∣ f

(x∗) ≤ f (x)

= (1 − Pr

[E0

∣∣ f(x∗) ≤ f (x)

])γ .

This yields

E[Δ(x)

] = Pr[

f(x∗) ≤ f (x)

]E[Δ(x)

∣∣ f(x∗) ≤ f (x)

]≥ Pr

[f(x∗) ≤ f (x)

](1 − Pr

[E0

∣∣ f(x∗) ≤ f (x)

])γ

= (Pr

[f(x∗) ≤ f (x)

] − Pr[E0])γ

≥ (Pr

[f(x∗) ≤ f (x)

] − Pr[

f (x) = f(x∗)])γ

= Pr[

f(x∗) < f (x)

≥(

1 −(

1 − c1

λ

)λ)γ ≥

(1 − 1

ec1

)γ = Ω(1),

where the fourth line follows from the fact that E0 implies that f (x∗) = f (x) holds and the last line by noting that each ofthe λ offspring is generated independently and strictly improves in fitness with probability at least c1 . �

λ

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.11 (1-21)

B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–••• 11

Note that F (x), and thus the bound on E[Δ(x)], heavily depends on the specific linear function to be minimized.Using the above lemma, we can prove Lemma 9. Let the (1 +λ) EA start at an arbitrary individual x ∈ {0,1}n . The hitting

time of the first element below θ2 is bounded by zero if already |x|1 ≤ θ2 = γ2nλ

holds. Otherwise the previous lemmaapplies, since

Pr[

f(x(i)) < f (x)

] ≥∑

j,x j=1

1

n

(1 − 1

n

)n−1

≥ |x|1en

≥ γ2

λe.

Consequently for some D > 0, the drift satisfies E[Δ] ≥ D throughout the phase, and the additive drift theorem (Lemma 3)yields that the expected hitting time of the first element below the threshold is bounded by (maxx g(x))/D ≤ en/D .

To prove that the linear time bound also holds with high probability, we have to overcome two obstacles. First, althougha constant drift is guaranteed throughout Phase 1, the random variables representing the drift in a fixed number of rounds Bare not independent. Second, the individual progress of a single generation may get as large as Ω(n). Hence, Chernoff boundsdo not directly apply here. However, it is very unlikely to encounter a very large potential change in O (n) optimization steps.Since additionally, also the lower bound on the drift holds regardless of how the process developed previously, it is possibleto define a martingale on a process that is conditioned on experiencing only small drifts in each direction. Applying theAzuma–Hoeffding inequality to bound the deviation from the martingale’s expectation, the time spent in Phase 1 can bebounded with high probability.

Lemma 14. Let x ∈ {0,1}n be an arbitrary initial individual and λ = o(e3√n). There is some constant c > 0 such that for the hitting

time T x→θ2 of the first individual below θ2 , it holds that

Pr[T x→θ2 ≥ cn

] ≤ e−Ω( 3√n).

Proof. Let P1 = x, P2, P3, . . . denote the parent individuals encountered by the (1 + λ) EA and X1 := g(P1), X2 := g(P2), . . .

denote the potential of these parent individuals. To bound T x→θ2 = min{t | Xt ≤ θ2}, consider analyzing a slightly adaptedstochastic process X1, X2, . . . which is almost identical to X1, X2, . . . , but forbids large potential changes in either directionand continues to have a constant drift even after the first individual below θ2 is found. More formally, let a thresholdparameter d be given. By Lemma 13, there is some ε > 0 with E[Δ(x)] ≥ ε for all x ∈ {0,1}n with |x|1 ≥ θ2. For such an x,we set Pr[ P i = y | P i−1 = x] = Pr[Pi = y | Pi−1 = x ∧ |Xi − Xi−1| ≤ ed], i.e., during Phase 1 we enforce the drift to stay in[−ed, ed]. For Pi−1 = x with |x|1 < θ2 we set Xi := Xi−1 − ε deterministically. Define Δ1,Δ2, . . . , and Δ1, Δ2, . . . as driftsthroughout the optimization process, i.e., Δi := Xi − Xi+1 and Δi := Xi − Xi+1. To compute E[Δi], we bound the probabilitythat a large potential change occurs in round i in the original process. Let x be an individual in Phase 1, then, since everyflipping bit changes the potential by at most gn ≤ e,

Pr[|Δi| > ed

∣∣ Pi−1 = x] ≤ Pr

[∃1 ≤ j ≤ λ : ∃S ⊆ [n], |S| = d : bits S flip in x( j)]≤ λ

(n

d

)1

nd≤ λ

d! .

Choosing d = Ω(log λn

log log λn ), this probability is bounded by o(1/n). Then, by law of total probability,

E[Δi | P i−1 = x] = E[Δi

∣∣ Pi−1 = x ∧ |Δi| ≤ ed]

= E[Δi | Pi−1 = x]Pr[|Δi| ≤ ed | Pi−1 = x]− Pr[|Δi| > ed | Pi−1 = x]E[Δi | Pi−1 = x ∧ |Δi| > ed]

Pr[|Δi| ≤ ed | Pi−1 = x]≥ ε

1 − o(1)− Pr[|Δi| > ed | Pi−1 = x]

1 − Pr[|Δi| > ed | Pi−1 = x] · en

= ε − o(1).

Consequently, E[Δi] ≥ ε for any positive constant ε < ε and sufficiently large n.We are ready to define a martingale Z0, Z1, . . . with respect to X1, . . . by Zi := ∑i

j=1(Δ j − E[Δ j]). Note that Z0 = 0.We analyze the probability of not arriving below θ2 in B := (1 + δ) en

εgenerations for some δ > 0. Observe that Z B > −δ ne

ε

implies X j < θ2 for some 1 ≤ j ≤ B , i.e., in the modified process an individual below θ2 is found. In other words, we seekto bound the probability that Z B deviates heavily from its expectation 0. This observation together with Zi − Zi−1 ≤ 2edenables the use of the Azuma–Hoeffding inequality (Lemma 2), which yields

Pr

[|Z B − Z0| > δ

en]

≤ 2e− n2 ( eδ

edε)2 = e

−Ω( nd2 )

.

ε

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.12 (1-21)

12 B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–•••

To complete the proof, we bound the probability that there is a large potential change in one of the B generations and themodified process does not apply. Similar to above, we have

Pr[∃1 ≤ j ≤ B : |Δ j| ≥ ed

] ≤ Bλ

(n

d

)1

nd≤ Bλ

d! = O

(nλ

d!)

.

By choosing d := 3√

n = Ω(log λn

log log λn ), we obtain

Pr[T x→θ2 ≥ B

] ≤ Pr[∃1 ≤ j ≤ B : |Δ j| ≥ ed

] + Pr[|Z B − Z0| > B

]≤ O

(nλ

( 3√

n)!)

+ e−Ω( 3√n) = e−Ω( 3√n). �5.2. Phase 2

This subsection is dedicated to proving Lemma 10. In Phase 2, the probability of making progress towards the optimumis small. Hence, the parallel mutations can be regarded as repetitions of a Bernoulli trial with small success probability; theexpected gain scales favorably with λ. The following lemma makes this claim formal and shows a multiplicative drift thatis larger than corresponding bounds for the (1 + 1) EA by a factor of Ω(λ).

Lemma 15. Let x ∈ {0,1}n \ {0n} be such that Pr[ f (x(i)) < f (x)] ≤ c2λ

with c2 ≤ ln(3/e) and let n be sufficiently large. Then

E[Δ(x)

] ≥ λg(x)

ec2+1n.

Proof. Set p := Pr[ f (x(i)) < f (x)] ≤ c2λ

and let A denote the event that among all λ offspring, exactly offspring have astrictly smaller fitness value, i.e., |{1 ≤ i ≤ λ | f (x(i)) < f (x)}| = . It holds that

E[Δ(x)

] =λ∑

=0

)p(1 − p)λ−E

[Δ(x)

∣∣ A

]. (1)

We analyze the events A0, A1, and A2 ∪ · · · ∪ Aλ separately. For A0, note that if the best individual has the same fitnessas its parent, the algorithm chooses the winner individual uniformly at random among all offspring with the same fitness.Hence,

E[Δ(x)

∣∣ A0] = Pr

[f(x∗) = f (x)

∣∣ A0]E[Δ(x)

∣∣ f(x∗) = f (x)

]= Pr

[f(x∗) = f (x)

∣∣ A0]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) = f (x)

],

i.e., we can ignore the fact that λ instead of a single offspring are produced.The case that there is more than one strictly improving offspring is quite unlikely. Consequently, for ≥ 2, it suffices

to prove that E[Δ(x) | A] ≥ 0. To see this, recall that Em is the event that exactly m one-bits of x flip in the winnerindividual x∗ . Observe that the events A and A ∩ Em , for any m ≥ 1, are zero-flip closed. By law of total probability, we have

E[Δ(x)

∣∣ A

] =n∑

m=0

Pr[Em | A]E[Δ(x)

∣∣ A ∩ Em],

with Pr[E0 | A] = 0, since a strict improvement requires at least one one-bit of x to flip in the winner individual. If A ∩ E1occurs, call the flipping one-bit i∗ . In this case, the drift equals

gi∗ −∑

j<i∗:x j=0

Pr[x∗

j = 1∣∣ A ∩ E1

]g j ≥ gi∗ − i∗ − 1

ngi∗ ≥ 0,

where we have applied Lemma 12 and the monotonicity of the drift potential function. Similarly, if A ∩ Em for m ≥ 2occurs, the drift can be bounded by

2 −∑

j:x j=0

Pr[x∗

j = 1∣∣ A ∩ Em

]g j ≥ 2 − 1

n

n∑j=1

g j ≥ 0.

For the case = 1, observe that

Pr[A1]E[Δ(x)

∣∣ A1] =

λ∑Pr

[∀ j �= i : f(x(i)) ≥ f (x)

]Pr

[f(x(i)) < f (x)

]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) < f (x)

]

i=1

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.13 (1-21)

B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–••• 13

This observation allows us to use the case of λ = 1, i.e., the results already obtained for (1+1) EAs. Using all the previousobservations we can simplify (1) to

E[Δ(x)

] ≥ Pr[

f(x∗) = f (x)

]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) = f (x)

]+ λ(1 − p)λ−1 Pr

[f(x(i)) < f (x)

]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) < f (x)

]. (2)

This expression is a bound on the drift using only the expected progress of a single mutation conditioned on either astrict improvement or acceptance of an offspring of equal fitness. Naturally, we aim at reusing the results from Witt’s tightruntime analysis of the (1 + 1) EA on linear functions [36].

Lemma 16 (Drift of the (1 + 1) EA on linear functions). Let n ≥ 4 and 1 ≤ i ≤ λ. Then

Pr[

f(x(i)) ≤ f (x)

]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) ≤ f (x)

] ≥ g(x)

en, (3)

Pr[

f(x(i)) < f (x)

]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) < f (x)

] ≥ g(x)

en. (4)

Proof. Inequality (3) is part of the proof of Theorem 5.1 in [36], where it is shown that for the (1 + 1) EA, E[Δ(x)] ≥g(x)/(en) for any x ∈ {0,1}n . This translates trivially to creating a single offspring of x for the (1 + λ) EA.

It is not hard to see that this guarantee also holds if only strictly improving individuals are accepted, which proves (4).In particular, note that the proof of (3) only exploits a positive expected progress if exactly one one-bit flips and all otherbits keep their value (Case 2.1 in [36]), which is a strict improvement. All other cases are shown to have a non-negativeexpected contribution, where Case 1 and Case 2.2.2 are trivially adapted and Case 2.2.1 even cannot occur when acceptingonly strictly improving offspring. �

Inequality (3) allows us to transform (2) to

E[Δ(x)

] ≥ Pr[

f(x∗) = f (x)

]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) = f (x)

]+ λ

(1 − c2

λ

)λ−1

Pr[

f(x(i)) < f (x)

]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) < f (x)

]≥ Pr

[f(x∗) = f (x)

]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) = f (x)

]+ λ

ec2· Pr

[f(x(i)) < f (x)

]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) < f (x)

].

We aim to prove E[Δ(x)] ≥ λ/ec2 · g(x)/(en). If E[g(x) − g(x(i)) | f (x) = f (x(i))] ≥ 0, this follows immediately from the aboveinequality using (4). If, however, E[g(x) − g(x(i)) | f (x) = f (x(i))] < 0, note that if λ ≥ 3, we have e1+c2 < e1+ln(3/e) = 3 ≤ λ,and consequently

Pr[

f(x∗) = f (x)

] ≤ 1 ≤ λ

ec2· 1

e≤ λ

ec2Pr

[f(x(i)) = f (x)

]holds for sufficiently large n, since Pr[ f (x) = f (x(i))] ≥ (1 − 1/n)n ≥ 1/e − o(1). This proves

E[Δ(x)

] ≥ λ

ec2Pr

[f(x(i)) = f (x)

]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) = f (x)

]+ λ

ec2· Pr

[f(x(i)) < f (x)

]E[

g(x) − g(x(i)) ∣∣ f

(x(i)) < f (x)

]≥ λ

ec2· g(x)

en,

where the last inequality follows from (3).The above argument only proves the result for λ ≥ 3. With a little more effort, we can manually extend the drift lower

bound to λ = 2. Set p> := Pr[ f (x(i)) > f (x)] and p= := Pr[ f (x(i)) = f (x)]. We aim to show

Pr[

f(x∗) = f (x)

] = (p> + p=)λ − pλ> ≤ λ

ec2p=,

which is equivalent, for λ = 2, to

2p> + p= ≤ 2c

. (5)

e 2

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.14 (1-21)

14 B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–•••

Since any non-optimal search point flips only exactly one of the remaining one-bits with probability at least (1/n)×(1 − 1/n)n−1 ≥ 1/(en), we have p> + p= ≤ 1 − 1/(en). Additionally, p= ≥ (1 − 1/n)n ≥ (1 − 1/n)1/e, and thus p> ≤ 1 − 1/e.This yields 2p> + p= ≤ 1 − 1/(en) + 1 − 1/e ≤ 2 − 1/e. Since c2 ≤ ln(e/3), (5) follows immediately from 2/ec2 ≥ 2e/3 ≥2 − 1/e. �

To prove Lemma 10, start with any individual x ∈ {0,1}n below θ2. Observe that we condition on |x|1 ≤ θ1 = γ1nλ

through-out the process, hence all encountered parent individuals satisfy

Pr[

f(x(i)) < f (x)

] ≤∑

j,x j=1

Pr[x(i)

j = 0] = |x|1

n≤ γ1

λ.

Since conditioning on not returning above θ1 cannot decrease the drift, the drift guarantee of the previous lemma appliesand we conclude using the multiplicative drift theorem (Lemma 4) and the facts that g(x) ≥ 1 for all x ∈ {0,1}n \ {0n} andg(x) ≤ e|x|1 that the expected hitting time of the optimum value, conditioned on not returning above θ1, is bounded by

eγ1+1n

λ

(1 + ln(en/λ)

) = O

(n log(n)

λ

).

By choosing r := s log n and a sufficiently large c, the multiplicative drift theorem also directly satisfies the correspondingtail bound.

6. Merging the phases

In Section 4, we showed how to connect the different drift behavior in Phases 1 and 2 by an application of a Wald-typeinequality. For this, a bound on the probability of returning to the previous phase was necessary, which will be proven inSection 6.2. In the next subsection, we will give an alternative method employing the variable drift theorem, which is morespecific to the details of the drift analysis, but will allow a stronger bound on the optimization time.

6.1. Application of the variable drift theorem

We introduce a potential threshold of θg := γ2nλ

. If g(x) ≥ θg , we have |x|1 ≥ γ2neλ , since gi ≤ e. Consequently, Pr[ f (x(i)) ≤

f (x)] ≥ γ2e2λ

and thus, by Lemma 13, there is some ε > 0 with E[Δ(x)] ≥ ε for all x ≥ θg and sufficiently large n. Similarly,

if g(x) ≤ θg , we have Pr[ f (x(i)) ≤ f (x)] ≤ |x|1n ≤ γ2

λ, since gi ≥ 1. Hence, Lemma 15 yields E[Δ(x)] ≥ λg(x)

ec2+1nfor sufficiently

large n.To define a monotone increasing lower bound on the drift, choose some 0 < ε′ ≤ 1/ec2+1 such that ε′ λθg

n ≤ ε. We define

h(s) :={

ε if s ≥ θg,

ε′ λg(x)n if s ≤ θg,

which is a monotone increasing lower bound on the drift. We can apply the variable drift theorem (Lemma 6) with smin = 1and obtain, for the hitting time T of the optimum value, the bound

E[T | X0] ≤ n

ε′ +θg∫

1

n

ε′λsds +

en∫θg

ds

ε

= n

(e − γ2

λ

ε+ 1

ε′

)+ n

ε′λln

(γ2n

λ

)= O

(n + n log n

λ

).

This proves the first part of Theorem 8.

6.2. Bounding the return probability

To prove Lemma 11, we will apply the negative drift theorem (Lemma 5) which proves that the process will, with highprobability, return to above θ1 only after a super-polynomial number of steps. In this time, however, the process is muchmore likely to already hit the optimum value. Note that once the optimum value is found, the process will stay at thisindividual and consequently never return to above θ1.

To apply the negative drift theorem, we prove an exponential decay in the probability to get a high drift in one step.

Lemma 17. Let r := 6e ln λ for λ > 1, x be an element of Phase 2 and j ∈N0 , then

Pr[∣∣Δ(x)

∣∣ ≥ jr] ≤ e− j .

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.15 (1-21)

B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–••• 15

Proof. Fix an individual 1 ≤ i ≤ λ. For 1 ≤ j ≤ n, define X (i)j as an indicator random variable attaining value one if and only

if the j-th bit flips in x(i) . Union bounding and using the Chernoff bound of Lemma 1, we conclude

Pr[∣∣Δ(x)

∣∣ ≥ jr] ≤ Pr

[∃1 ≤ i ≤ λ :

n∑j=1

X (i)j ≥

⌈jr

e

⌉]

≤λ∑

i=1

Pr

[n∑

j=1

X (i)j ≥ 6 j ln λ

]

≤ λ2−6 j ln λ ≤ 2−5 j ln λ ≤ e− j . �Let x ∈ {0,1}n below θ2 be given and let X1 = x, . . . , denote the potential of the parent individuals throughout the

process. Define T return := min{t ≥ 0 | Xt ≥ θ1}, then we have the following.

Lemma 18. For λ ∈ O (n1−ε) for any ε > 0, it holds that Pr[T return ≤ eΩ(ns)] ≤ e−Ω(ns) for some s > 0.

Proof. Since in the range |x|1 ∈ [θ2, θ1], the drift satisfies E[Δ(x)] ≥ ε > 0 by Lemma 13, condition 1 of the negative drifttheorem is satisfied. Lemma 17 proves the second condition for a scaling factor of r := 6e ln λ. Since = θ1 − θ2 = Θ(n/λ),this choice of r satisfies r ≤ ε2 and r ≤ √

ε/(132 log(ε)) as long as λ = O (n1−ε) and n is sufficiently large. We can henceapply Lemma 5 and obtain

Pr[T return ≤ e

Ω( nλ ln2 λ

)] ≤ e−Ω( n

λ ln2 λ), where n/

(λ ln2 λ

) = Ω(nε/2). �

Define T optimize := min{t ≥ 0 | Xt = 0 ∧ X j < θ1 for all j < t}. To prove Lemma 11, note that at each point in time t thereare only three possible events. In event Eopt, the optimum was found without returning above θ1 before t . In event Ereturn,the process has already returned to above θ1 before t . In case Enone, the process has neither found the optimum value norreturned above θ1, but still has the possibility to do so. We bound the probability to ever return by bounding the probabilitythat at t , case 2 or 3 occur. Choosing t = eΩ(ns) , the previous lemma, as well as Lemma 10 together with Markov’s inequalitycan be used to bound the probability of ever returning above θ1 with

Pr[

Ereturn] + Pr[

Enone∣∣ ¬Ereturn] ≤ Pr

[T return ≤ t

] + Pr[T optimize ≥ t

∣∣ ¬Ereturn]≤ Pr

[T return ≤ eΩ(ns)

] + E[T optimize | ¬Ereturn]eΩ(ns)

≤ e−Ω(ns) + O (n log n/λ)e−Ω(ns) = e−Ω(ns).

This concludes the proof of Lemma 11.

7. Lower bounds

In this section, we prove a lower bound showing that for all λ = O (n), the expected number of generations needed tooptimize the linear function BinVal : {0,1}n → R; x �→ ∑n

i=1 xi2i−1 is at least Ω( 1λ

n log n + n). We will then strengthen thisresult to hold also with high probability.

7.1. Expected Optimization Time

Let us first state the result formally.

Theorem 19 (Lower Bound for BinVal). For the (1 + λ) EA with λ = O (n), the expected number of iterations performed on BinVal tofind the optimum is lower bounded by

Ω

(n + n log n

λ

).

This result has some important consequences. (i) It shows that the stronger upper bound claimed in [13] cannot be true.(ii) It shows that our general upper bound cannot be improved. (iii) It also shows that, surprisingly and very differently fromwhat we know from the (1 + 1) EA, the optimization time of different linear functions varies significantly (recall that [21]proved that O (n log log log n

log log n ) iterations suffice for the (1 + λ) EA to find the optimum of OneMax when λ equals the cut-off

point log n log log nlog log log n ). That different linear functions can have different optimization times was previously only observed for

estimation of distribution algorithms in [9].

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.16 (1-21)

16 B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–•••

There is a not too difficult explanation for the different optimization behaviors of OneMax and BinVal. When λ is large,say equal to n, then a simple balls-into-bins argument [12] shows that with high probability there is an offspring withΘ(log n/ log log n) bits flipped. If the parent has a OneMax value of n/2, with constant probability the number of bits flippingfrom one to zero is by Θ(

√log n/ log log n) larger than the number of bits flipping in the other direction. Consequently, we

have an expected gain of Θ(√

log n/ log log n) in this situation (see [21] for arguments why also for smaller OneMax valuesa non-constant improvement is likely).

When optimizing BinVal instead of OneMax, of course, we will also create offspring being by Θ(√

log n/ log log n) closerto the optimum (in terms of the Hamming distance). However, since the winner individual is mostly decided by the most-valuable flipping bit, such an individual has only a minimal advantage in becoming the winner individual. In fact, thisintuitive argument appears to be similar to the reason why the compact genetic distribution performs worse on BinVal

than on OneMax [9]. Also there, a winner individual is selected from more than one individual, and progress may be lostsince the most significant bit is crucial and supersedes improvements in the other bits.

We make the above argumentation precise in the following proof. We first show that throughout the optimization pro-cess, there is at most a constant drift; we then apply the additive drift theorem.

Lemma 20. Consider optimizing f = BinVal through the (1 + λ) EA with λ = O (n). Let x ∈ {0,1}n be arbitrary. Then the drift of xsatisfies E[Δ(x)] = O (1).

Proof. We model the process of determining the winner individual of the λ offspring of x by successively revealing the bitat position 1 ≤ i ≤ n, from the highest-weighted to the lowest-weighted bit, in the offspring individuals. Analyze the set W i

of potential winner individuals after sampling the bits n,n − 1, . . . , i in all offspring individuals. For BinVal, once x(i) strictlydominates x( j) in the highest-weighted bits, f (x(i)) > f (x( j)) follows.

We count the expected number of flipped one-bits F of x in the winner individual x∗ . The probability of a one-bit of xto flip in x∗ is 1/n as soon as the winner individual is uniquely determined. To analyze the process before the winner isdetermined, we compute the probability that the winner individual stays undetermined if the count of flipped one-bits ofx in the winner is increased. This is the complementary event to having exactly one potential winner individual flip a bitunder the condition that at least one of them lets it flip. More formally, for i ∈ I and k ≥ 2,

Pr

[|W i | = 1

∣∣∣ |W i+1| = k and

∃ j ∈ W i+1 : x( j)i = 0

]=

kn (1 − 1

n )k−1

1 − (1 − 1n )k

=: pk.

Note that pk is decreasing in k. Hence, we can pessimistically assume that |W i | = · · · = |Wn+1| = λ until a winner isuniquely determined, i.e., |W j| = 1 for some j. Also, we disregard all zero-bits of x, since by revealing these bits, thenumber of potential winners can only decrease and the number of flipped one-bits is left unchanged. If all potential winnerindividuals leave a one-bit of x unchanged, then the number of flipped one-bits keeps its value as well. Consequently, thenumber of flipped one-bits until a unique winner is found can be upper bounded by a geometrically distributed randomvariable X with success probability pλ . Once the winner is determined, we flip the remaining at most n − 1 bits withprobability 1/n. This yields,

F (x) ≤ E[X] + (n − 1) · 1

n≤ 1 − (1 − 1

n )λ

λn (1 − 1

n )λ−1+ 1

= n

λ

((1 + 1

n − 1

)λ−1

−(

1 − 1

n

))+ 1

≤ n

λ

(e

λ−1n−1 − 1

) + 1 + 1

λ= O (1).

The last line follows immediately for λ = Θ(n) and from the convergence of x(e1/x − 1) → 1 for x → ∞ if λ = o(n). Withgi ≤ e, the bound E[Δ(x)] ≤ eF (x) establishes the claim. �

Note that by the Chernoff bound of Lemma 1, the initial individual x satisfies |x|1 ≥ n/4 with probability at least 1−en/12.Hence, for the number T of iterations the (1 + λ) EA requires to optimize BinVal, the additive drift theorem yields, sinceE[Δ(x)] ≤ ε for some constant ε,

E[T ] ≥ (1 − e−n/12) n

4ε= Ω(n).

To complete the proof of Theorem 19, we need to establish an additional bound of Ω( 1λ

n log n). This bound follows fromTheorem 3 in [21] or from Lemma 22 below, which proves that the lower bound even holds with high probability.

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.17 (1-21)

B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–••• 17

7.2. Large-deviation bound

Theorem 21. Let λ :N →N be an arbitrary function with λ = λ(n) = O (n). Then there is some constant c > 0 such that the (1+λ) EA

does not optimize BinVal in less than c( 1λ

n logn + n) iterations with probability 1 − e−Ω( 3√n) .

In the following, let λ = O (n). We prove the theorem again in two steps. The first bound is a simple Ω( 1λ

n logn) lowerbound, which follows from a Coupon Collector argument.

Lemma 22. Let f (x) = ∑ni=1 wi xi with 0 < w1 ≤ · · · ≤ wn be any linear function on bit strings with positive bit weights. Let

0 < c < 1, then the (1 + λ) EA does not optimize f in less than c 1λ(n − 1) ln n generations with probability 1 − e−Ω(n1−c) .

Proof. Let x be the starting individual of the (1 + λ) EA and let I := {i | xi = 1} denote its one-bits. Since x is chosenuniformly at random from {0,1}n , the Chernoff bound of Lemma 1 yields that |I| ≥ n/4 with probability at least 1 − e−n/12.Observe that if the (1 + λ) EA has optimized f after t generations, then for each bit position i ∈ I , the i-th bit must havebeen flipped in any of the λ offspring of any of the t generations. Note that the occurrence of a bit flip at position i isindependent of the actual parent individual. Hence, we can bound the probability that i ∈ I has not been flipped aftert := c 1

λ(n − 1) ln n iterations by

(1 − 1

n

)λt

=(

1 − 1

n

)c(n−1) ln n

≥ e−c ln n = 1

nc.

Consequently, given that |I| ≥ n/4, the probability that f has been optimized already after t steps is at most

(1 − n−c)|I| ≤

(1 − n1−c

n

)n/4

≤ e− n1−c4 . �

Analogously to the way in which the lower bound on the drift can be turned into large-deviation bounds (Lemma 14),we can derive, from Lemma 20, the following lower bound that holds with high probability.

Lemma 23. There is some constant c > 0 such that the (1 + λ) EA does not optimize f in less than cn iterations with probability

e−Ω( 3√n) .

Proof. Let P1, P2, P3, . . . denote the parent individuals encountered by the (1 + λ) EA and X1 := g(P1), X2 := g(P2), . . .

denote the potential of these parent individuals. Since the initial individual P1 is chosen uniformly at random from {0,1}n ,it will have, by the Chernoff bound of Lemma 1, at least n

4 one-bits with probability e−n/12. Hence, to bound T := min{t |Xt = 0}, we can assume that g(P1) ≥ n

4 .

By Lemma 20, there is some ε with E[Δ(x)] ≤ ε for all x ∈ {0,1}n . Symmetrically to Lemma 14, we analyze X1, X2, . . .

which is defined exactly as in Lemma 14, with the difference that now only once Xi−1 ≤ 0, i.e., the optimum has alreadybeen found, we set Xi := Xi−1 − ε deterministically. Define Δ1,Δ2, . . . , and Δ1, Δ2, . . . again as potential changes through-out the optimization process. With a threshold value of d = Ω(

log λnlog log λn ), we again have

Pr[|Δi| > ed

∣∣ Pi−1 = x] ≤ λ

(n

d

)1

nd= o(1/n).

By law of total probability, we compute

E[Δi | P i−1 = x] = E[Δi

∣∣ Pi−1 = x ∧ |Δi| ≤ ed]

= E[Δi | Pi−1 = x]Pr[|Δi| ≤ ed | Pi−1 = x]+ Pr[|Δi| > ed | Pi−1 = x]E[Δi | Pi−1 = x ∧ |Δi| > ed]

Pr[|Δi| ≤ ed | Pi−1 = x]≤ ε

1 − o(1)+ Pr[|Δi| > ed | Pi−1 = x]

1 − Pr[|Δi| > ed | Pi−1 = x] · en

= ε + o(1).

Consequently, E[Δi] ≤ ε for any constant ε > ε and sufficiently large n.Define the martingale Z0, Z1, . . . with respect to X1, . . . by Zi := ∑i

j=1(Δ j − E[Δ j]). We analyze the probability of not

arriving at the optimum value in B := (1−δ) n generations for some 0 < δ < 1. Observe that Z B > δ n implies X j > 0 for all

4ε 4ε

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.18 (1-21)

18 B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–•••

1 ≤ j ≤ B , i.e., the optimum value has not been found. The fact that Zi − Zi−1 ≤ 2ed enables the use of the Azuma–Hoeffdinginequality (Lemma 2), which yields

Pr

[|Z B − Z0| > δ

n

]≤ 2e− n

2 ( δ4edε

)2 = e−Ω( n

d2 ).

To complete the proof, we bound the probability that there is a large potential change in one of the B generations and themodified process does not apply. Similar to above, we have

Pr[∃1 ≤ j ≤ B : |Δ j| ≥ ed

] ≤ Bλ

(n

d

)1

nd≤ Bλ

d! = O

(n2

2d

).

By choosing d := 3√

n = Ω(log λn

log log λn ), we obtain

Pr[T ≥ B] ≤ Pr

[|P1|1 <

n

4

]+ Pr

[∃1 ≤ j ≤ B : |Δ j| ≥ ed] + Pr

[|Z B − Z0| > B]

≤ e−Ω(n) + O(λn2− 3√n) + e−Ω( 3√n) = e−Ω( 3√n). �

Since it is not the most interesting case, we only cursorily note that Theorem 21 also holds for any λ that is polynomialin n. The reason is that the W i ’s in Lemma 20 satisfy a multiplicative drift condition when |W i | ≥ n and xi = 1, namelyE[|W i−1|] = O (|W i |/n). Hence if λ = O (nκ ), Θ(κ) iterations with probability n−Θ(κ) suffice to bring the number of potentialwinner individuals down to less than n. Hence for constant κ , the drift remains constant, and consequently, Lemma 20 stillholds.

8. The special case of the ONEMAX function

In this section, we use the methods developed so far to analyze the runtime of the (1 + λ) EA for the classic testfunction OneMax : {0,1}n → R; x �→ ∑n

i=1 xi . For this, it was proven by Jansen, De Jong, and Wegener [21] that a linearspeed-up exists for all λ = O (log n log log n/ log log log n). Consequently, in this case Θ( 1

λn log n) generations are needed. It

was also proven that for larger values of λ, a linear speed-up does not exist (this implies that ω( 1λ

n log n) generations arenecessary), but no more precise statement on the runtime is known. We close this gap by showing the following result.

Theorem 24. Let ε > 0. Let λ : N → N be an arbitrary function with λ(n) = O (n1−ε) for some sufficiently large constant c. Then the(1 + λ) EA with offspring population size λ = λ(n) optimizes OneMax in an expected number of

O

(n log n

λ(n)+ n log logλ(n)

logλ(n)

)

iterations.

The runtime bound for OneMax differs from arbitrary linear functions by having a faster progress when the currentsearch point is far away from the optimum. In particular, we prove the following stronger drift of Ω(

log λlog log λ

) in the regime

above nln λ

.

Lemma 25. Let x ∈ {0,1}n with |x|1 ≥ nln λ

. Then for the (1 + λ) EA on OneMax, we have

E[Δ(x)

] ≥(

1 − 1

e

)1

2

ln(λ) − 1

ln ln λ.

Proof. Note that for OneMax, the fitness function equals the potential function, i.e., g ≡ OneMax, since all bit-weights areone. Hence, negative progress in terms of the potential function is impossible, i.e., Δ(x) ≥ 0 holds always. Additionally, notethat for an arbitrary x above n

ln λ, it suffices to flip at least t one-bits and flip no zero-bit to obtain a progress of t . Let

|x|1 ≥ nln λ

and set t := 12

ln(λ)−1ln ln λ

. We compute

Pr[

f(x(i)) ≤ f (x) − t

] ≥(|I|

t

)1

nt

(1 − 1

n

)n−t

≥( |I|

t

)t 1

ent≥ 1

ett(lnλ)t≥ e−(1+t(ln t+ln ln λ)) ≥ e−(1+2t ln ln λ) ≥ 1

λ.

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.19 (1-21)

B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–••• 19

We conclude that

E[Δ(x)

] ≥ Pr[∃1 ≤ i ≤ λ : f

(x(i)) ≤ f (x) − t

]t

≥(

1 −(

1 − 1

λ

)λ)t ≥

(1 − 1

e

)1

2

ln(λ) − 1

ln ln λ. �

Besides the previous lemma, we have Lemma 13 (E[Δ(x)] ≥ α > 0 if |x|1 = g(x) ≥ γ1nλ

) and Lemma 15 (E[Δ(x)] ≥λg(x)/(e1+c1n) if |x|1 = g(x) ≤ γ1n

λ) yielding drift lower bounds. We combine these bounds into a function h that is mono-

tone increasing in s := g(x) by setting

h(s) :=

⎧⎪⎨⎪⎩

max{(1 − 1e ) 1

2ln(λ)−1ln ln λ

,α} s ≥ nln λ

,

αγ1nλ

≤ s ≤ nln λ

,

min{ λse2n

,α} 1 ≤ s ≤ γ1nλ

.

Hence, we may apply Lemma 6 to obtain, for the required number of iterations T ,

E[T | X0] ≤ max

{e1+γ2n

λ,

1

α

}+

X0∫1

1

h(x)dx,

where

X0∫1

1

h(x)dx ≤

γ1nλ∫

1

e1+γ1n

λxdx +

nln λ∫

1

1

αdx +

n∫n

ln λ

1

1 − 1/e

2 ln lnλ

ln(λ) − 1dx

≤ e1+γ1n

λln

(n

λ

)+ n

α lnλ+ 1

1 − 1/e

2n ln lnλ

ln(λ) − 1.

Hence,

E[T | X0] ≤ O

(n log n

λ+ n log logλ

logλ

).

To give a matching lower bound on the expected runtime of the (1 +λ) EA on OneMax, we prove that the bound on thedrift in Lemma 25 is in fact tight.

Lemma 26. Let x ∈ {0,1}n and λ = ω(1). There is a constant c, such that for the (1 + λ) EA on OneMax

E[Δ(x)

] ≤ clog λ

log logλ

holds for sufficiently large n.

Proof. Note that each of the λ offspring flips only one bit in expectation. To experience a potential change of t , there hasto be an offspring whose number of bit flips deviates heavily from this expectation. Using the Chernoff bounds of Lemma 1,we can bound the probability that one of the λ offspring flips more than t(c) := c ln λ/ ln ln λ bits by

λet−1

tt= e−t ln t+t+ln λ−1 = e−(c−1)(ln λ−o(ln λ)).

Since λ = ω(1), we have that, for sufficiently large n,

Pr[Δ(x) ≥ t(c)

] ≤ 2−c+1.

We can thus bound

E[Δ(x)

] = Pr[Δ(x) ≤ t(1)

]t(1) +

∞∑i=0

Pr[t(2i) ≤ Δ(x) ≤ t

(2i+1)]t

(2i+1)

≤ t(1)

(1 +

∞∑Pr

[t(2i) ≤ Δ(x)

]2i+1

)

i=0

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.20 (1-21)

20 B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–•••

Fig. 1. Number of generations performed by the (1 + λ) EA on OneMax and BinVal for λ = �log2(n)� (left, together with regression lines) and λ = n (right).

≤ t(1)

(1 +

∞∑i=0

2−2i+12i+1

)

≤ O(t(1)

) = O

(logλ

log log λ

). �

Analogous to the lower bound of Theorem 19, an application of the additive drift theorem immediately yields the fol-lowing result.

Theorem 27 (Lower Bound for OneMax). For the (1 +λ)-EA with λ = O (n), the expected number of iterations performed on OneMax

to find the optimum is lower bounded by

Ω

(n log logλ

logλ+ n log n

λ

).

9. Experiments

Asymptotic analyses in general, while explaining the growth behavior for very large problem sizes, could possibly hidephenomena in the range for practical problem sizes. To illustrate that even for small n the predicted results are reflectedin practical observations and to provide further evidence for our results, we conducted an empirical investigation. Weimplemented the (1 + λ) EA in C++, using the GNU Scientific Library to generate pseudo-random numbers.

The results in Fig. 1 depict the number of iterations performed to hit the optimum, averaged over 10,000 runs, ofthe (1 + λ) EA on OneMax and BinVal for λ = �log2(n)� and λ = n, respectively. The empirical standard deviation forany depicted value does not exceed a value of 140 and stays within 47 percent of the mean. The plot indicates that forλ = �log2(n)�, the expected logarithmic speed-up is obtained for both OneMax and BinVal; however, the slope of thefunctions differs. Observe that jumps in the function value occur near powers of two due to the rounding of the logarithm.The corresponding values for λ = n give insight into the performance far above the cut-off point. BinVal, as predicted byTheorem 19, does not profit from the larger value of λ in terms of asymptotic growth. OneMax, however, shows a sublineargrowth and deviates heavily from the optimization time of BinVal.

Acknowledgements

We thank the anonymous reviewers of a preliminary version of this work [8] for helpful remarks and Karl Bringmannfor pointing out the short proof of Lemma 7.

References

[1] B. Doerr, M. Fouz, C. Witt, Quasirandom evolutionary algorithms, in: GECCO ’10: Proceedings of the 12th Annual Genetic and Evolutionary ComputationConference, ACM, 2010, pp. 1457–1464.

JID:TCS AID:9672 /FLA Doctopic: Theory of natural computing [m3G; v 1.133; Prn:30/04/2014; 8:45] P.21 (1-21)

B. Doerr, M. Künnemann / Theoretical Computer Science ••• (••••) •••–••• 21

[2] B. Doerr, L.A. Goldberg, Adaptive drift analysis, in: PPSN ’10: Proceedings of the 11th International Conference on Parallel Problem Solving from Nature,2010, pp. 32–41.

[3] B. Doerr, L.A. Goldberg, Drift analysis with tail bounds, in: PPSN ’10: Proceedings of the 11th International Conference on Parallel Problem Solvingfrom Nature, Springer, 2010, pp. 174–183.

[4] B. Doerr, L.A. Goldberg, Adaptive drift analysis, Algorithmica 65 (2013) 224–250.[5] B. Doerr, D. Johannsen, C. Winzen, Drift analysis and linear functions revisited, in: CEC ’10: Proceedings of the 2010 IEEE Congress on Evolutionary

Computation, IEEE, 2010, pp. 1967–1974.[6] B. Doerr, D. Johannsen, C. Winzen, Multiplicative drift analysis, in: GECCO ’10: Proceedings of the 12th Annual Genetic and Evolutionary Computation

Conference, ACM, 2010, pp. 1449–1456.[7] B. Doerr, D. Johannsen, C. Winzen, Multiplicative drift analysis, Algorithmica 64 (2012) 673–697.[8] B. Doerr, M. Künnemann, How the (1 + λ) evolutionary algorithm optimizes linear functions, in: GECCO’13: Proceedings of the 15th Annual Genetic

and Evolutionary Computation Conference, ACM, 2013, pp. 1589–1596.[9] S. Droste, A rigorous analysis of the compact genetic algorithm for linear functions, Nat. Comput. 5 (2006) 257–283.

[10] S. Droste, T. Jansen, I. Wegener, A rigorous complexity analysis of the (1 + 1) evolutionary algorithm for linear functions with Boolean inputs, in:CEC’98: Proc. of the 1998 IEEE Cong. on Evolutionary Computation, IEEE, 1998, pp. 499–504.

[11] S. Droste, T. Jansen, I. Wegener, On the analysis of the (1 + 1) evolutionary algorithm, Theoret. Comput. Sci. 276 (2002) 51–81.[12] G.H. Gonnet, Expected length of the longest probe sequence in hash code searching, J. ACM 28 (1981) 289–304.[13] J. He, A note on the first hitting time of (1 + λ) evolutionary algorithm for linear functions with Boolean inputs, in: CEC ’10: Proceedings of the 2010

IEEE Congress on Evolutionary Computation, IEEE, 2010, pp. 1–6.[14] J. He, X. Yao, Drift analysis and average time complexity of evolutionary algorithms, Artificial Intelligence 127 (2001) 51–81.[15] J. He, X. Yao, Erratum to: drift analysis and average time complexity of evolutionary algorithms [Artificial Intelligence 127 (2001) 57–85], Artificial

Intelligence 140 (2002) 245–248.[16] J. He, X. Yao, A study of drift analysis for estimating computation time of evolutionary algorithms, Nat. Comput. 3 (2004) 21–35.[17] J. Jägersküpper, Algorithmic analysis of a basic evolutionary algorithm for continuous optimization, Theoret. Comput. Sci. 379 (2007) 329–347.[18] J. Jägersküpper, A blend of Markov-chain and drift analysis, in: PPSN ’08: Proceedings of the 10th International Conference on Parallel Problem Solving

from Nature, Springer, 2008, pp. 41–51.[19] J. Jägersküpper, Combining Markov-chain analysis and drift analysis—the (1 + 1) evolutionary algorithm on linear functions reloaded, Algorithmica 59

(2011) 409–424.[20] J. Jägersküpper, T. Storch, When the plus strategy outperforms the comma strategy and when not, in: FOCI ’07: Proceedings of the IEEE Symposium on

Foundations of Computational Intelligence, IEEE, 2007, pp. 25–32.[21] T. Jansen, K.A.D. Jong, I. Wegener, On the choice of the offspring population size in evolutionary algorithms, Evol. Comput. 13 (2005) 413–440.[22] T. Jansen, P.S. Oliveto, C. Zarges, On the analysis of the immune-inspired B-cell algorithm for the vertex cover problem, in: ICARIS’11: Proc. of the 10th

International Conference on Artificial Immune Systems, Springer, 2011, pp. 117–131.[23] D. Johannsen, Random combinatorial structures and randomized search heuristics, PhD thesis, Universität des Saarlandes, 2010, Available online at

http://scidok.sulb.uni-saarland.de/volltexte/2011/3529/pdf/Dissertation_3166_Joha_Dani_2010.pdf.[24] B. Mitavskiy, J.E. Rowe, C. Cannings, Theoretical analysis of local search strategies to optimize network communication subject to preserving the total

number of links, Int. J. Intell. Comput. Cybern. 2 (2009) 243–284.[25] M. Mitzenmacher, E. Upfal, Probability and Computing: Randomized Algorithms and Probabilistic Analysis, Cambridge University Press, 2005.[26] F. Neumann, I. Wegener, Randomized local search, evolutionary algorithms, and the minimum spanning tree problem, Theoret. Comput. Sci. 378 (2007)

32–40.[27] F. Neumann, C. Witt, Bioinspired Computation in Combinatorial Optimization, Springer, 2010.[28] P.S. Oliveto, J. He, X. Yao, Analysis of population-based evolutionary algorithms for the vertex cover problem, in: CEC’08: Proc. of the 2008 IEEE Cong.

on Evolutionary Computation, 2008, pp. 1563–1570.[29] P.S. Oliveto, C. Witt, On the analysis of the simple genetic algorithm, in: GECCO ’12: Proceedings of the 14th Annual Genetic and Evolutionary Compu-

tation Conference, 2012, pp. 1341–1348.[30] P.S. Oliveto, C. Witt, On the runtime analysis of the simple genetic algorithm, Theoret. Comput. Sci. (2013), in press, http://dx.doi.org/10.1016/

j.tcs.2013.06.015.[31] J.E. Rowe, D. Sudholt, The choice of the offspring population size in the (1, λ) EA, in: GECCO ’12: Proceedings of the 14th Annual Genetic and Evolu-

tionary Computation Conference, ACM, 2012, pp. 1349–1356.[32] T. Storch, On the choice of the parent population size, Evol. Comput. 16 (2008) 557–578.[33] A. Wald, On cumulative sums of random variables, Ann. Math. Stat. 15 (1944) 283–296.[34] C. Witt, Runtime analysis of the (μ + 1) EA on simple pseudo-Boolean functions, Evol. Comput. 14 (2006) 65–86.[35] C. Witt, Optimizing linear functions with randomized search heuristics—the robustness of mutation, in: STACS ’12: Proceedings of the 29th Annual

Symposium on Theoretical Aspects of Computer Science, 2012, pp. 420–431.[36] C. Witt, Tight bounds on the optimization time of a randomized search heuristic on linear functions, Combin. Probab. Comput. 22 (2013) 294–318.