Lecture 7 - Pennsylvania State UniversityUday V. Shanbhag Lecture 7 Introduction Consider the...

29
Lecture 7 Sample average approximation estimators April 1, 2015

Transcript of Lecture 7 - Pennsylvania State UniversityUday V. Shanbhag Lecture 7 Introduction Consider the...

Lecture 7

Sample average approximation estimators

April 1, 2015

Uday V. Shanbhag Lecture 7

Introduction

Consider the following stochastic program:

minx∈X

f(x), f(x) , E[f(x, ξ)],

where X ⊆ Rn is a closed and convex set, ξ is a random vector in Rd with

distribution P. Furthermore ξ ∈ Ξ and f : X × Ξ → R. Unless stated

otherwise, the expectation is assumed to be well-defined and finite valued

for all x ∈ X.

Suppose, we have access to N realizations of the random vector ξ, denoted

by ξ1, . . . , ξN . Then, an estimator for the expected value f(x) can be

obtained by solving the following sample average approximation (SAA)

Stochastic Optimization 1

Uday V. Shanbhag Lecture 7

problem:

minx∈X

fN(x), where fN(x) ,1

N

N∑i=1

f(x, ξj).

Note that fN(x) can be viewed as the Expectation taken with respect to the

empirical measure associated with a probability mass function

1N, . . . , 1

N

.

By the law of large numbers (LLN), under suitable regularity conditions

fN(x) converges to f(x) pointwise with probability one as N → ∞.Moreover, by the classical LLN, this convergence holds if the sample is

independent and identically distributed.

Furthermore, we expect that E[fN(x)] = f(x); the estimator fN(x) is

unbiased.

It is also natural to expect that the optimal value/solutions of (SAA)

converge to their true counterpart/set as N →∞.

Note that we may view fN(x) as being defined on a common probability

space (Ω,F ,P). When considering iid samples, one avenue is to set

Stochastic Optimization 2

Uday V. Shanbhag Lecture 7

Ω := Ξ∞, where Ξ∞ is the set of sequences

Ξ∞ = ξ1, . . . , ξi∈Ξ,i∈N .

Furthermore, this set is equipped with the product of the corresponding

probability measures.

Finally, suppose f(x, ξ) is a Caratheodory function (continuous in x and

measurable in ξ). It follows that fN(x) is also a Caratheodory function.

Stochastic Optimization 3

Uday V. Shanbhag Lecture 7

Consistency of SAA estimators

Proposition 1 Let f : X → R and fN : X → R be a sequence of

deterministic real-valued functions. Then the following two properties are

equivalent: (i) For an x ∈ X and any sequence xN ⊂ X converging to

x, fN(xN) converges to f(x); and (ii) the function f(x) is continuous on

X and fN(•) converges to f(•) uniformly∗ on any compact subset of X.

Proof:

(a) ((i)⇒ (ii)): Suppose that property (i) holds. Consider a point x ∈ Xand a sequence xN ⊂ X such that xN → x and a scalar ε > 0.

By considering a sequence x1, x1, . . ., we have that fN(x1)→ f(x1)

and there exists an N1 such that |fN1(x1) − f(x1)| < 12ε. Similarly,

∗Point-wise and uniform convergence: Let D be a subset of Rn and let fn be a sequence of functions defined on D.We say that fn converges pointwise on D if limn→∞ fn(x) exists for each point x ∈ D. Furthermore, fn convergesuniformly to f if given any ε > 0, there exists an N = N(ε) such that |fn(x)− f(x)| < ε for n > N(ε) and for every x ∈ D.

Stochastic Optimization 4

Uday V. Shanbhag Lecture 7

there exists an N2 > N1 such that |fN2(x2) − f(x2)| < 12ε. We may

now construct a sequence x′N defined as follows:

x′i = x1, i = 1, . . . , N

x′i = x2, i = N + 1, . . . ,2N...

It follows that x′N → x and therefore |fN(x′N)− f(x)| < 12ε for N large

enough. We also have that |fNk(x′Nk) − f(xk)| < 12ε, and hence we

have that for k large enough

|f(xk)−f(x)| ≤ |f(xk)−fNk(x′Nk

)|+|fNk(x′Nk

)−f(x)| < 12ε+1

2ε = ε.

This shows that f(xk)→ f(x) and hence f(•) is continuous at x.

Next, let C be a compact subset of X and proceed by contradiction.

Stochastic Optimization 5

Uday V. Shanbhag Lecture 7

Suppose that fN(•) does not converge to f(•) uniformly on C. It

follows that there exists a sequence xN ⊂ C such and an ε > 0

such that |fN(xN) − f(xN)| ≥ ε for all N large enough. Since C is

a compact set, we can assume that xN converges to an x ∈ C. It

follows that

|fN(xN)− f(xN) ≤ |fN(xN)− f(x)|+ |f(x)− f(xN)|.

Of these, the first term tends to zero by the hypothesis (i) and the

second term tends to zero by the continuity of f(•). Since both terms

are less than 12ε for sufficiently large N , a contradiction follows.

(b) ((i) ⇒ (ii)): Suppose that property (ii) holds. Consider a sequence

xN ⊂ X such that xN → X ∈ X. Assume that this sequence is

contained in a compact subset of X. Consider |fN(xN)− f(x)|. Then

Stochastic Optimization 6

Uday V. Shanbhag Lecture 7

we have the following:

|fN(xN)− f(x)| ≤ |fN(xN)− f(xN)|+ |f(xN)− f(x)|.

Of these the first term tends to zero given the uniform convergence of

fN to f while the second term tends to zero based on the continuity of

f . It follows that fN(xN) converges to f(x).

We now consider the consistency of the estimators θN and SN .

Definition 1 (Consistency of estimators) An estimator θN of a parame-

ter θ is said to be consistent if θN → θ w.p.1. as N →∞.

We begin by considering the consistency of the estimator of the optimal

value θN . For a fixed x ∈ X, we have that

θN ≤ fN(x).

Stochastic Optimization 7

Uday V. Shanbhag Lecture 7

Furthermore, if the pointwise LLN holds, we have that

lim supN→∞

θN ≤ limN→∞

fN(x) = f(x), w.p.1.

Proposition 2 Suppose that fN(x) converges to f(x) w.p.1. as N →∞,

uniformly on X. Then θN → θ∗ w.p.1. as N →∞.

Proof: First, we note that since θN ≤ fN(x) for all x,N , we have that

almost surely

lim supN→∞

θN ≤ limN→∞

fN(x∗) = f(x∗) = θ∗,

where the last equality follows from the convergence of fN to f uniformly

on x.

Stochastic Optimization 8

Uday V. Shanbhag Lecture 7

Furthermore, we may show that lim infN→∞ θN ≤ θ∗. Uniform convergence

of fN on X implies that for all x ∈ X and for sufficiently large N , we

have that fN(x) ≥ f(x) − ε. Since θN is attained on X, for sufficiently

large N , θN ≥ θ∗−ε. Since ε is arbitrary, we have that lim infN→∞ θN ≥ θ∗

in an almost sure sense.

It follows that θN → θ∗ a.s. as N →∞.

Consistency of the estimators of the solution set requires somewhat stronger

conditions. Note that D(A,B) is the deviation between sets A and B and

is defined as

D(A,B) , supx∈A

dist(x,B), where dist(x,A) , infx′∈A‖x− x′‖.

Theorem 3 Suppose that there exists a compact set C ⊂ Rn such that

the following hold: (i) the set of the optimal solutions of the true problem

Stochastic Optimization 9

Uday V. Shanbhag Lecture 7

is nonempty and contained in C; (ii) the function f(x) is finite valued

and continuous on C; (iii) fN(x) converges to f(x) w.p.1. as N → ∞uniformly in x ∈ C; and (iv) w.p.1, for N large enough, the set SN is

nonempty and SN ⊂ C. Then θN → θ∗ and D(SN , S) → 0 w.p.1. as

N →∞.

Proof: (i) and (ii) imply that both the true and the sample-average problem

can be restricted to the sets X ∩ C, implying that we can assume without

loss of generality that X is compact.

The earlier assertion implies that θN → θ∗ w.p.1. as N → ∞. It remains

to show that D(SN(ω), S) → 0 as N → ∞ for every ω ∈ Ω such that

θN → θ∗ and (ii) and (iii) hold.

We proceed by contradiction and drop the dependence on ω for notational

convenience (but the result is proved for every ω).Suppose D(SN , S) 6→ 0.

Since X is compact, we may consider a convergent subsequence and assume

Stochastic Optimization 10

Uday V. Shanbhag Lecture 7

that there exists an xN ∈ SN such that dist(xN , S) ≥ ε for some ε > 0

and that xN → x∗ ∈ X. Consequently, x∗ 6∈ S and f(x∗) > θ∗.

Further

fN(xN)− f(x∗) = [fN(xN)− f(xN)] + [f(xN)− f(x∗)].

Of these the first term tends to zero based on (iii) while the second term

tends to zero by the continuity of f . As a result, θN → f∗ > θ∗, which is a

contradiction. The required result follows.• By Prop. 1, (ii) and (iii) is equivalent to the condition that for any

sequence xN ⊂ C converging to a point x, it follows that fN(xN)→f(x) w.p.1.• (iv) holds in the above theorem if the feasible set X is closed and the

functions fN(x) are lower semicontinuous, and for some α > θ∗, the

level sets x ∈ X : fN(x) ≤ α

Stochastic Optimization 11

Uday V. Shanbhag Lecture 7

are uniformly bounded w.p.1. This is often called an inf-compactness

condition.

• Conditions for guaranteeing the uniform convergence of fN(x) to f(x)

can be provided. For instance, if f(x, ξ) is continuous at x for almost

every ξ ∈ Ξ, f(x, ξ) is dominated by an integrable function for x ∈ Xand the sample is iid.

If the problem is convex, we can relax the required regularity conditions.

We may allow f(x, ξ) to be an extended real-valued function and define

the following:

f(x, ξ) := f(x, ξ)+1lX(x), f(x) = f(x)+1lX(x), fN(x) := fN(x)+1lX(x).

Theorem 4 Suppose that (i) The function f(x, ξ) is random lsc, (ii) for

almost every ξ ∈ Ξ, the function f(x, ξ) is convex in x, (iii) X is closed

Stochastic Optimization 12

Uday V. Shanbhag Lecture 7

and convex, (iv) the function f(x) is lsc and there exists a x ∈ X such that

f(x) < +∞ for all x ∈ nbhd(x), (v) The set S of optimal solutions of the

true problem is nonempty and bounded, and (vi) the LLN holds pointwise.

Then θN → θ∗ and D(SN , S∗)→ 0 as N →∞ w.p.1.

Some observations:

• lsc† of f(•) follows from lsc of f(., ξ) provided that f(x, .) is bounded

from below by an integrable function

• It was assume that the LLN holds pointwise for all x ∈ Rn. It suffices

to assume that this holds for all x in a neighborhood of S.

†Recall that a function f(x) is lower semicontinuous at x0 if lim infx→x0 f(x) ≥ f(x0).

Stochastic Optimization 13

Uday V. Shanbhag Lecture 7

Randomness of feasible set X

Up to this point, we have assumed that the feasible set X of the SAA

problem is fixed and deterministic. Suppose we instead assume that XN is

a subset of Rn and is random.

Theorem 5 Suppose that in addition to the assumptions of Theorem 3, the

following hold:

(a) If xN ∈ XN and xN → x w.p.1., then x ∈ X.

(b) For some point x ∈ S, there exists a sequence xN ∈ XN such that xN → x

w.p.1.

Then θN → θ∗ and D(SN , S)→ 0 w.p.1. as N →∞.

Stochastic Optimization 14

Uday V. Shanbhag Lecture 7

Proof: Consider an xN ∈ SN . By compactness, we may assume that the

sequence xN converges to an x∗ ∈ Rn. Since SN ⊂ XN , we have that

xN ∈ XN and it follows that x∗ ∈ X.

First, note that θN = fN(xN) since xN ∈ SN . Furthermore from Prop. 1,

θN = fN(xN) tends w.p.1. to f(x∗). But f(x∗) ≥ θ∗ since x∗ is an

arbitrary point in X and it follows that

lim infN→∞

θN ≥ θ∗, w.p.1.

On the other hand, we have from (b), that there exists a sequence xN ∈ XN

converging to a point x ∈ S w.p.1. Consider an x∗ in X and there exists a

sequence xN that converges to x∗ such that xN ∈ XN . From Prop. 1,

we have that f(xN) → f(x∗) = θ∗ w.p.1. as N → ∞. But, since xN is

Stochastic Optimization 15

Uday V. Shanbhag Lecture 7

not necessarily a minimizer of fN over XN , f(xN) ≥ θN and it follows that

θN ≤ fN(xN)→ f(x∗) = θ∗, w.p.1.

Note that the second part of the above assertion follows from Prop. 1.

Hence lim supN→∞ θN ≤ θ∗. As a result, θN → θ∗ w.p.1.

The remainder of the proof follows in a fashion analogous to Theorem 3.

Stochastic Optimization 16

Uday V. Shanbhag Lecture 7

Asymptotics of the SAA optimal value

Consistency of the estimators is important in that it allows for claiming the

estimation error tends to zero as the sample size grows to infinity. It does

not, however, provide much indication of the error for a given sample.

Suppose a sample average estimator fN(x) of f(x) is unbiased with variance

σ2(x)/N := var(f(x, ξ)) assumed to be finite. Then by the central limit

theorem (CLT), we have

√N(fN(x)− f(x)

) D−→Yx,

where Yx ∼ N (0, σ2(x)). In effect, for sufficiently large N , fN(x) has

an approximately asymptotic normal distribution with a mean f(x) and a

variance σ2(x).

This allows for constructing an approximate 100(1 − α)% confidence

Stochastic Optimization 17

Uday V. Shanbhag Lecture 7

interval for f(x) given by

[fN(x)−

zα/2σ(x)√N

, fN(x) +zα/2σ(x)√

N

],

where zα/2 := Φ−1(1− α/2) and the sample variance is given by

σ2(x) :=1

N − 1

N∑j=1

[f(x, ξj)− fN(x)

]2.

Consider now the optimal value θN(x) of the SAA problem. For any x′ ∈ X,

we have that

fN(x) ≥ infx∈X

fN(x).

Stochastic Optimization 18

Uday V. Shanbhag Lecture 7

It follows that

E[fN(x)

]≥ E

[infx∈X

fN(x)].

Taking infima over X, we obtain

infx∈X

E[fN(x)

]≥ E

[infx∈X

fN(x)].

Since E[fN(x)] = f(x),, we have that

θ∗ = infx∈X

f(x) = infx∈X

E[fN(x)

]≥ E

[infx∈X

fN(x)]

= E[θN].

As a consequence, it can be said that θN is a downward biased estimator of

θ∗. The next proposition shows that this bias decreases monotonically with

sample size N .

Stochastic Optimization 19

Uday V. Shanbhag Lecture 7

Proposition 6 Let θN be the optimal value of the SAA problem and suppose

the sample is iid. Then we have the following:

(a) θ∗ ≥ E[θN] (downward biased)

(b) E[θN] ≤ E[θN+1] ≤ θ∗ for any N ∈ N.

Proof:

(a) See discussion prior to this proposition. (b) Recall that fN(x) can be

Stochastic Optimization 20

Uday V. Shanbhag Lecture 7

written as

fN+1(x) =1

N + 1

N+1∑i=1

[1

NNf(x, ξi)

]

=1

N(N + 1)

∑j

f(x, ξj)− f(x, ξ1)

+ . . .+

∑j

f(x, ξj)− f(x, ξN)

=

1

N + 1

N+1∑i=1

1

N

∑j 6=i

f(x, ξj)

.

Stochastic Optimization 21

Uday V. Shanbhag Lecture 7

By some elementary analysis, we see that

E[θN+1(x)] = E[

infx∈X

fN+1(x)]

= E

infx∈X

1

N + 1

N+1∑i=1

1

N

∑j 6=i

f(x, ξj)

≥ E

1

N + 1

N+1∑i=1

infx∈X

1

N

∑j 6=i

f(x, ξj)

.Since the samples are iid, we have that

E

1

N + 1

N+1∑i=1

infx∈X

1

N

∑j 6=i

f(x, ξj)

=1

N + 1

N+1∑i=1

E

infx∈X

1

N

∑j 6=i

f(x, ξj)

=

1

N + 1

N+1∑i=1

E[θN(x)] = E[θN].

Stochastic Optimization 22

Uday V. Shanbhag Lecture 7

This completes the proof.

Stochastic Optimization 23

Uday V. Shanbhag Lecture 7

First order asymptotics on the SAA optimal value

We begin with the following assumptions on f(x, ξ).

Assumption 1

(A1) For some point x ∈ X, the expectation E[f(x, ξ)2] <∞

(A2) There exists a measurable function C : Ξ → R+ such that E[C(ξ)]2] is

finite and

|f(x, ξ)− f(x′, ξ)| ≤ C(ξ)‖x− x′‖,

for all x, x′ ∈ X and a.e. ξ ∈ Ξ.

The above assumption allows one to claim that f(x) is Lipschitz continuous

with constant E[C(ξ)], based on using Jensen’s inequality and the convexity

Stochastic Optimization 24

Uday V. Shanbhag Lecture 7

of the norm function:

‖f(x)− f(x′)‖ = ‖E[f(x, ξ)− f(x′, ξ)]‖≤ E[‖f(x, ξ)− f(x′, ξ)‖]≤ E[C(ξ)]‖x− x′‖.

Consequently, if X is compact then the set of minimizers over X is

nonempty.

Let Yx be the random variables defined earlier and be denoted by Y (x).

Then by the multivariate CLT, we have that for any finite set x1, . . . , xm ⊂X, the random vector Y (x1), . . . , Y (xm) has a multivariate normal dis-

tribution with mean zero and a covariance matrix identical to that of

(f(x1, ξ), . . . , f(xm, ξ)).

Stochastic Optimization 25

Uday V. Shanbhag Lecture 7

Then by the functional central limit theorem, we have that from (A1), (A2),

and the compactness of the set,

√N(fN − f)

D−→ Y,

where Y is a random element of C(X)‡

Theorem 7 Let θN be the optimal value of the SAA problem. Suppose

that the sample is iid and assumptions (A1) and (A2) are satisfied. Then‡C(X) represents the space of continuous functions equipped with a sup-norm. A random element of C(X) is a map

Y : Ω→ C(X) from a probability space (Ω,F ,P) into C(X) which is a measurable function with respect to the Borel sigmaalgebra if C(X); Y (x) = Y (x, ω) can be viewed as a random map.

Stochastic Optimization 26

Uday V. Shanbhag Lecture 7

the following holds§:

θN = infx∈S

fN(x) + op(√N)

√N(θN − θ∗

) D−→ infx∈S

Y (x).

where Y (x) is a random function of x drawn from Yx. Recall that

√N[fN(x)− f(x)

] D−→Yx.

Moreover, if S is a singleton given by x, then we have that

√N(θN − θ∗

) D−→N (0, σ2(x)).

§Recall that r(h) is said to be op(h) if r(h)/‖h‖p → 0 as h→ 0.

Stochastic Optimization 27

Uday V. Shanbhag Lecture 7

Proof omitted.

Under mild conditions, it follows that

√N E

[θN − θ∗

]−→E

[infx∈S

Y (x)].

This implies the following:

E[θN − θ∗

]︸ ︷︷ ︸

bias

= N−12 E

[infx∈S

Y (x)]

+ o(N−12),

If S = x, then E[Y (x)] = 0 and the bias is o(√

1/N).

Stochastic Optimization 28