Mathematical Programming and Research Methods (Part II)

Mathematical Programming and Research

Methods (Part II)

4. Convexity and Optimization

Massimiliano Pontil

(based on previous lecture by Andreas Argyriou)

1

Today’s Plan

• Convex sets and functions

• Types of convex programs

• Algorithms

• Convex learning problems

2

Convexity

• Simple intuition, originates from simple geometric shapes (e.g.polygons)

Convex Convex NOT

• Convexity plays a very important role in optimization

3

Convex Sets

Definition 1. A set C ⊆ IRd is called convex if

λx + (1 − λ)y ∈ C for all x, y ∈ C, λ ∈ [0, 1]

• I.e. if x and y are in the set C, then the whole line segment{λx + (1 − λ)y : λ ∈ [0, 1]} lies also in C

4

Convex Sets (contd.)

• We call λx + (1 − λ)y a convex combination of x and y wheneverλ ∈ [0, 1]

• Generally, for any k ∈ IN, the sum

k∑

i=1

λixi

is called a convex combination of the points x1, . . . , xk ∈ IRd whenever

λ1, . . . , λk ≥ 0 and

k∑

i=1

λi = 1

5


• Clearly, if a set C is convex, all convex combinations of points in C (forall k ∈ IN) belong to C

• This set of all convex combinations is called the convex hull of C

• In general, given a set S ⊆ IRd (S need not be convex), the convex hullof S is the set

conv(S) :=

{

k∑

i=1

λixi : xi ∈ S, λ1, . . . , λk ≥ 0,k∑

i=1

λi = 1, k ∈ IN

}

• The convex hull is the smallest convex set containing S

6


• S = conv(S) if and only if S is convex

7

Examples of Convex Sets

• Affine sets, i.e. sets of solutions of linear equations {x : Ax = b}

• Convex cones, i.e. sets containing any nonnegative combinationk∑

i=1

θixi , θ1, . . . , θk ≥ 0, of their points

0

0.2

0.4

0.6

0.8

1

−1−0.5

00.5

10

0.2

0.4

0.6

0.8

1

8

Examples of Convex Sets (contd.)

• Hyperplanes, i.e. sets of the form {x : a⊤x = b}, wherea ∈ IRd, a 6= 0, b ∈ IR (since they are special cases of affine sets)

• Halfspaces, i.e. sets of the form {x : a⊤x ≤ b}, wherea ∈ IRd, a 6= 0, b ∈ IR

9

Polyhedra

• A polyhedron is a set defined by a finite number of affine equalities andinequalities

P = {x : a⊤

i x ≤ bi, i = 1, . . . ,m, c⊤j x = dj, j = 1, . . . , p}

where a1, . . . , am, c1, . . . , cp ∈ IRd, b1, . . . , bm, d1, . . . , dp ∈ IR

• Polyhedra are convex sets

10

Polyhedra (contd.)

• A bounded polyhedron is called a polytope

• A set is a polytope if and only if it is the convex hull of a finite set ofpoints

11

The Positive Semidefinite Cone

• We use the notations

X � 0 X ≻ 0

to denote that a d × d matrix X is positive semidefinite and positivedefinite, respectively

• The setsS

d+ := {X ∈ IRd×d : X � 0}

andS

d++ := {X ∈ IRd×d : X ≻ 0}

are called the positive semidefinite cone and positive definite cone,respectively

12

The Positive Semidefinite Cone (contd.)

• Sd+ and S

d++ are convex cones

Proof. For any A, B ∈ Sd+, θ1, θ2 ≥ 0, the matrix θ1A + θ2B is psd. ⊓⊔

• E.g. in IR2×2, the positive semidefinite cone consists of the matrices ofthe form

(

x y

y z

)

such that x, z ≥ 0, xz ≥ y2

0

0.2

0.4

0.6

0.8

1

−1−0.5

00.5

10

0.2

0.4

0.6

0.8

1

13

Norms

• A norm, denoted by ‖ · ‖, is a function from IRd to IR+ such that

1. ‖w‖ ≥ 0, for all w ∈ IRd

2. ‖w‖ = 0 if and only if w = 0

3. ‖aw‖ = |a|‖w‖, for all a ∈ IR, w ∈ IRd (homogeneity)

4. ‖w + z‖ ≤ ‖w‖ + ‖z‖, for all w, z ∈ IRd (triangle inequality)

• Important example: the Lp norm

‖w‖p :=

(

d∑

i=1

|wi|p

)

1p

where p ∈ [1,+∞)

14

Norms (contd.)

• We have already seen the L2 norm – it is the regularizer in ridgeregression, SVM etc.

‖w‖2 =

(

d∑

i=1

w2i

)

12

= (w⊤w)12

• The L1 norm‖w‖1 =

d∑

i=1

|wi|

• Letting p → +∞, we get the L∞ norm

‖w‖∞ =d

maxi=1

|wi|

15

Norm Balls

• The unit ball for a norm is the set {w : ‖w‖ ≤ 1}

−1−0.5

00.5

1

−1

−0.5

0

0.5

1−1

−0.5

0

0.5

1

L1 unit ball L2 unit ball L∞ unit ball

16

Norm Balls (contd.)

• In general, any norm ball of the form

{w : ‖w − c‖ ≤ r}

where c ∈ IRd and r ≥ 0 are the center and radius of the ball,respectively, is a convex set

17

Convex Functions

• A function f : IRd → IR is called convex if

f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)

for all x, y ∈ IRd, λ ∈ [0, 1]

• Intuition: the line segment connecting any two points on the graph of f

lies above the graph

Convex Non-convex

18

Convex Functions (contd.)

• Similarly, a function f is called concave if

f(λx + (1 − λ)y) ≥ λf(x) + (1 − λ)f(y)

for all x, y ∈ IRd, λ ∈ [0, 1]or, equivalently, if −f is convex

19

Strict Convexity

• A function f : IRd → IR is called strictly convex if

f(λx + (1 − λ)y) < λf(x) + (1 − λ)f(y)

for all x, y ∈ IRd, x 6= y, λ ∈ (0, 1)

• I.e. the line segment connecting any two points on the graph of f liesstrictly above the graph

• Equivalently, the graph of a strictly convex function does not containany line segments

20

Strict Convexity (contd.)

Not strictly convex Strictly convex

21

Jensen’s Inequality

• If f is convex, it follows by induction that

f

(

k∑

i=1

λixi

)

≤

k∑

i=1

λif(xi)

for all k ∈ IN, x1, . . . , xk ∈ IRd, λ1, . . . , λk ≥ 0, such thatk∑

i=1

λi = 1

• It can be generalised to integrals and expected values(it is used e.g. to derive the EM algorithm)

22

Continuity / Differentiability

Theorem 1. If a function f is convex on IRd then it is also continuous

on IRd.

Proof. Not easy. ⊓⊔

• There are convex functions which are not differentiable everywhere(and others which are)

23

Second Order Condition

Theorem 2. Assume that a function f is twice differentiable on IRd.

Then f is convex if and only if its Hessian is psd.

∇2f(w) � 0 for all w ∈ IRd

• Recall that the Hessian is the matrix formed by the second partialderivatives

∇2f(w) :=

(

∂f

∂wi∂wj

(w)

)d

i,j=1

• Note: the condition ∇2f ≻ 0 implies strict convexity, but the converseis not true – see [Boyd & Vanderberghe]

24

Examples of Convex Functions

• Affine functions w 7→ a⊤w + b are both convex and concave

• Exponentials, powers, log-sum-exp

−5 −4 −3 −2 −1 0 1 2 30

5

10

15

20

25

30

35

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w ∈ IR 7→ eaw w ∈ IR 7→ |w|p w ∈ IRd 7→ log(

∑d

i=1ewi

)

for any a ∈ IR for any p ≥ 1

25

Examples of Convex Functions (contd.)

• Psd. quadratic functions

f(w) = w⊤Aw + a⊤w + b for all w ∈ IRd

where A ∈ Sd+, a ∈ IRd, b ∈ IR

26

Examples of Convex Functions (contd.)

• Max function w 7→ max{w1, . . . , wd}

27

Norms Are Convex Functions

• Every norm is a convex function

Proof. Let w, z ∈ IRd and λ ∈ (0, 1). Then

‖λw +(1−λ)z‖ ≤triangle

inequality

‖λw‖+ ‖(1−λ)z‖ =homogeneity

λ‖w‖+(1−λ)‖z‖

⊓⊔

• No norm is strictly convex (to see this, select z = aw with a > 0, a 6= 1)

• The square of every norm, w 7→ ‖w‖2, is a convex function

Proof. Do it as an exercise. ⊓⊔

28

Operations that Preserve Convexity

Question: If f1, . . . , fq : IRd → IR are convex functions, which operationsF can we apply so that f := F (f1, . . . , fq) is also convex?

• Nonnegative weighted sums

f =

q∑

i=1

θifi

where θ1, . . . , θq ≥ 0

Proof. Easy from the definition. ⊓⊔

29

Operations that Preserve Convexity (contd.)

• Composition of a convex function with an affine map

f(x) = g(Ax + b) for all x ∈ IRd

where g : IRn → IR is convex, A ∈ IRn×d is a matrix and b ∈ IRn

Proof. Let x, y ∈ IRd, λ ∈ (0, 1). Then

f(λx + (1 − λ)y) = g(λAx + (1 − λ)Ay + b)

= g (λ(Ax + b) + (1 − λ)(Ay + b))

≤ λg(Ax + b) + (1 − λ)g(Ay + b) = λf(x) + (1 − λ)f(y)

⊓⊔

30

Operations that Preserve Convexity (contd.)

• Maximum of convex functions

f = max{f1, . . . , fq}

Proof (for q = 2, can be easily generalised to any q). Let x, y ∈ IRd,

λ ∈ (0, 1). Then

f(λx + (1 − λ)y) = max{f1(λx + (1 − λ)y), f2(λx + (1 − λ)y)}

≤ max{λf1(x) + (1 − λ)f1(y), λf2(x) + (1 − λ)f2(y)}

≤ max{λf1(x), λf2(x)} + max{(1 − λ)f1(y), (1 − λ)f2(y)}

= λf(x) + (1 − λ)f(y) ⊓⊔

• Extends also to infinite sets of convex functions

31

Proving Convexity

• Thus, to prove convexity of a function f , there are several approaches

– From the definition of convexity

– Compute the Hessian of f and show that it is psd.

– Decompose f as a nonnegative weighted sum of convex functions

– Decompose f as the composition of a convex and an affine function

– Decompose f as a maximum of convex functions

32

Examples

• Show that the function

f(w) = w⊤w = ‖w‖2

is strictly convex

Proof. The Hessian of f equals ∇2f(w) = 2Id which is positivedefinite. ⊓⊔

33

Examples (contd.)

• Show that the quadratic function

f(w) = w⊤Aw + a⊤w + b

where A ∈ Sd+, a ∈ IRd, b ∈ IR, is convex

Proof. Write A = R⊤R for some matrix R ∈ IRd×d. Thenw⊤Aw = (Rw)⊤(Rw), which is a composition of the convex functionw 7→ w⊤w and a linear map. The term w 7→ a⊤w + b is affine andhence convex. Thus f is convex, as the sum of convex functions. ⊓⊔

Alternatively, we may compute the Hessian, which equals 2A, which ispsd.

34

Proving Strict Convexity

• To prove strict convexity of a function f

– Use the definition (with strict inequality, x 6= y and λ ∈ (0, 1))

– Compute the Hessian of f and show that it is positive definite

– Decompose f as a sum of a convex and a strictly convex function

(easy to prove this property)

Note: When does the convex-affine composition operation apply?

• Example: the quadratic function f(w) = w⊤Aw + a⊤w + b is strictlyconvex if and only if A ≻ 0(since the Hessian equals 2A)

35

Convex Optimization

• The problem

minw∈IRd

f(w)

subject to fi(w) ≤ 0 i = 1, . . . ,M (1)

a⊤

j w = bj j = 1, . . . , P

where f, f1, . . . , fM are convex functions, is called a convex program orconvex optimization problem

• The function f whose value we wish to minimise is called the objectivefunction

36

Remarks

• The set of points w satisfying the constraints fi(w) ≤ 0, a⊤

j w = bj iscalled the feasible set

• The feasible set is convex: for any feasible points x, y,fi(λx + (1 − λ)y) ≤ λfi(x) + (1 − λ)fi(y) ≤ 0 anda⊤

j (λx + (1 − λ)y)) = λbj + (1 − λ)bj = bj

• In general, if we minimize a convex objective function over a convex set,the problem can be rewritten in form (1) (in principle at least;sometimes not practically possible)

• Many problems of interest can be rewritten in the form (1); they do notnecessarily appear in that form however

37

Remarks (contd.)

• Minimum (1) does not always exist! (could be an infimum or could be−∞)

−5 −4 −3 −2 −1 0 1 2 30

5

10

15

20

25

30

35

• The set of solutions (minimisers) of problem (1) is convex (easy toshow)

• In particular, if the function f is strictly convex, then there is a uniqueminimiser (if any exists)

38

Remarks (contd.)

• There are no local minima outside the set of minimisers

• This is important because it implies that algorithms will not get stuckaway from the solution(s)

• Thus, the great appeal of convex programs is that they can be solved!(many of them in polynomial time)

39

Examples

minw∈IRd

w⊤Aw + a⊤w

subject to w⊤Bw + b⊤w + c ≤ 0

d⊤w = e

where A, B � 0

minw∈IRd

a⊤w

subject to b⊤1w ≤ c1

b⊤2w ≤ c2

d⊤

1w = e1

40

Regularization

minw∈IRd

m∑

i=1

E(

w⊤xi , yi

)

+ γ ‖w‖2 (R)

• Assume that E(·, y) is a convex function for every y ∈ IR

• Then, problem (R) is a convex program; this program is unconstrained

• Indeed, the objective function is convex, as a sum of convex functions:E(w⊤xi , yi) is convex as a convex-affine composition and ‖w‖2 is alsoconvex, as we have already seen

• It can be shown that the minimum exists (under mild assumptions onE)

41

Regularization (contd.)

• Example 1: ridge regression

minw∈IRd

m∑

i=1

(yi − w⊤xi)2 + γ ‖w‖2

• Convex program, since the function z 7→ (z − y)2 is convex for everyy ∈ IR

42


• Example 2: we have seen (in Lecture 1) that SVM is equivalent to theregularization problem

minw∈IRd

m∑

i=1

max{1 − yi(w⊤xi) , 0} + γ ‖w‖2

with γ =1

2C

• This is a convex program, since the function z 7→ max{1 − yz , 0} (thehinge loss) is convex for every y ∈ {−1, 1}; indeed, it is a maximum ofconvex (in particular, affine) functions

43


• How about the SVM primal

minw∈IRd

1

2‖w‖2 + C

m∑

i=1

ξi

subject to yi(w⊤xi) ≥ 1 − ξi, ξi ≥ 0, for i = 1, . . . ,m

• It is also easy to see that this is a convex program, but now thevariables include wi and ξi as well

• The objective function is convex (quadratic in w, linear in ξi); thefunctions 1 − ξi − yi(w

⊤xi) and −ξi in the inequality constraints arealso convex

44


• Similarly, the SVM and ridge regression dual problems can be seen to beconvex problems

• In general, the dual of regularization

minc∈IRm

m∑

i=1

E(

c⊤gi , yi

)

+ γ c⊤Gc (C)

is a convex problem (assuming as before that the loss function isconvex)

• Indeed, the quadratic form c⊤Gc is a convex function of c since theGram matrix G is positive semidefinite

45


minw∈IRd

m∑

i=1

E(

w⊤xi , yi

)

+ γ ‖w‖2 (R)

minc∈IRm

m∑

i=1

E(

c⊤gi , yi

)

+ γ c⊤Gc (C)

• Problem (R) has a unique solution; indeed, the term ‖w‖2 is strictlyconvex; hence the objective function is also strictly convex

• However, problem (C) has a unique solution only if G ≻ 0, i.e. only ifthe feature vectors φ(xi) are linearly independent;otherwise, there are infinite optimal c, but the correspondingw =

∑m

i=1ciφ(xi) is unique

46

Convex Programs with Linear Equality Constraints

• The following special type of convex program can be solved usingLagrange multipliers

minw∈IRd

f(w)

subject to a⊤

j w = bj j = 1, . . . , P (2)

where f is a convex and differentiable function

• Set the gradient to zero: ∇f(w) =∑P

j=1cjaj, for some cj ∈ IR; the

set of solutions of this equation is the same as the set of minimisers of(2) (by a theorem); a dual problem can also be obtained

47

Important Convex Optimization Problems

• Linear Programming

• Quadratic Programming

• Semidefinite Programming

• Dedicated off-the-shelf algorithms exist for each of the above categories

• In machine learning, algorithms have been developed for specialsubtypes of such problems

48

Linear Programming (LP)

minw∈IRd

c⊤w

subject to d⊤

i w ≤ ei i = 1, . . . , M (3)

a⊤

j w = bj j = 1, . . . , P

• The feasible set is a polyhedron (bounded or not)

• Problem (3) may have one, none, or infinite solutions

• Interesting fact: the dual problem is also a linear program

49

Linear Programming (contd.)

• It can be shown that the solution (if unique) will be one of the vertices

• The simplex algorithm is one of the oldest optimization algorithms(Dantzig in the 40s)

50

Linear Programming (contd.)

• Intuition of simplex: find a vertex to start from; from each vertex, moveto a neighbour so that the objective function decreases; terminate ifthere is no such neighbour

• Time complexity: very good in almost all cases, but very bad(exponential) in the worst case

• In practice, very fast for typical problems and can be applied to largedata sets

• Methods developed in the 80s (interior-point methods) have beenapplied to linear programming and are of polynomial-time complexity

51

Quadratic Programming (QP)

minw∈IRd

w⊤Aw + c⊤w

subject to d⊤

i w ≤ ei i = 1, . . . ,M (4)

a⊤

j w = bj j = 1, . . . , P

where A � 0

• If A ≻ 0, the solution (if any) is unique (due to strict convexity)

• The dual is also a quadratic program

52

Quadratic Programming (contd.)

• The idea behind simplex does not apply here; in fact, the minimisercould be anywhere in the feasible set (on the boundary or in the interior)

• The difficulty in solving QP is due to the fact that the solution may lieon the boundary of the feasible set

53

Interior-Point Methods

• Idea: change the objective function by adding to it a barrier function

• The barrier depends on the constraints and is parameterised by aparameter t

• Unconstrained minimisation of the barrier function gives a solution inthe interior of the feasible set

• Changing t appropriately, the algorithm converges to the solution of (4)in polynomial time

• These methods can handle problems of reasonably large size

54

Ridge Regression as QP

minw∈IRd

m∑

i=1

(yi − w⊤xi)2 + γ ‖w‖2

• Ridge regression is an unconstrained QP

• Just need to solve a linear system using standard methods

55

SVM as QP

minw∈IRd

1

2‖w‖2 + C

m∑

i=1

ξi maxα∈IRm

−1

2α⊤Aα +

m∑

i=1

αi

s. t. yi(w⊤xi) ≥ 1 − ξi s. t. 0 ≤ αi ≤ C

ξi ≥ 0 for i = 1, . . . , m

for i = 1, . . . , m

• SVM is a QP with inequality constraints

• The SVM dual is a QP with “box” constraints

56

Algorithms for SVM

• One approach to solve SVMs is with interior-point methods

• For large datasets (say m > 103) it is practically impossible to solve thedual problem with such methods (matrix A is dense!)

• A typical approach is to iteratively optimize wrt. an ’active set’ A ofdual variables, fixing the rest. Set α = 0, choose q ≤ m and an activeset A of q variables. We repeat until convergence the steps

– Solve the problem wrt. the variables in A– Remove one variable from A which satisfies the KKT conditions and

add one variable, if any, which violates the KKT conditions. If nosuch variable exists, stop

57

QCQP

minw∈IRd

w⊤Aw + c⊤w

subject to w⊤Bw + d⊤

i w ≤ ei i = 1, . . . ,M

a⊤

j w = bj j = 1, . . . , P

where both A,B are psd.

• It is called a quadratically constrained quadratic program (QCQP)

• Larger family; contains the family of QP

• The dual problem is not a QCQP, in general

58

QCQP (contd.)

• The feasible set is the intersection of ellipsoids and/or a polyhedron

• It is faster to solve a QP with a dedicated method than to use a QCQPsolver

59

SDP

minw∈IRd

c⊤w

subject to w1F1 + · · · + wnFn + G � 0

a⊤

j w = bj j = 1, . . . , P

• There is a linear matrix inequality (LMI) constraint

• Multiple LMIs reduce to an equivalent problem with just one LMI

• The dual problem of an SDP is also an SDP

• LP ⊆ QP ⊆ QCQP ⊆ · · · ⊆ SDP (LPs, QPs, QCQPs can be rewrittenas SDPs)

60

Bibliography

Lectures available at:

http://www.cs.ucl.ac.uk/staff/M.Pontil/courses/index-ATML10.htm

See also Boyd and Vandenberghe, Convex Optimization, 2004,

http://www.stanford.edu/boyd/cvxbook/

Secs. 2.1.4-2.2.5, 3.1.1, 3.1.5, 3.1.8, 3.2.1-3.2.3, 4.1.1, 4.2.1, 4.2.2, 4.3, 4.4, 4.6.2

61

Mathematical Programming and Research Methods (Part II)

Documents

Transcript of Mathematical Programming and Research Methods (Part II)