Mathematical Programming and Research Methods (Part II)
Transcript of Mathematical Programming and Research Methods (Part II)
Mathematical Programming and Research
Methods (Part II)
4. Convexity and Optimization
Massimiliano Pontil
(based on previous lecture by Andreas Argyriou)
1
Today’s Plan
• Convex sets and functions
• Types of convex programs
• Algorithms
• Convex learning problems
2
Convexity
• Simple intuition, originates from simple geometric shapes (e.g.polygons)
Convex Convex NOT
• Convexity plays a very important role in optimization
3
Convex Sets
Definition 1. A set C ⊆ IRd is called convex if
λx + (1 − λ)y ∈ C for all x, y ∈ C, λ ∈ [0, 1]
• I.e. if x and y are in the set C, then the whole line segment{λx + (1 − λ)y : λ ∈ [0, 1]} lies also in C
4
Convex Sets (contd.)
• We call λx + (1 − λ)y a convex combination of x and y wheneverλ ∈ [0, 1]
• Generally, for any k ∈ IN, the sum
k∑
i=1
λixi
is called a convex combination of the points x1, . . . , xk ∈ IRd whenever
λ1, . . . , λk ≥ 0 and
k∑
i=1
λi = 1
5
Convex Sets (contd.)
• Clearly, if a set C is convex, all convex combinations of points in C (forall k ∈ IN) belong to C
• This set of all convex combinations is called the convex hull of C
• In general, given a set S ⊆ IRd (S need not be convex), the convex hullof S is the set
conv(S) :=
{
k∑
i=1
λixi : xi ∈ S, λ1, . . . , λk ≥ 0,k∑
i=1
λi = 1, k ∈ IN
}
• The convex hull is the smallest convex set containing S
6
Convex Sets (contd.)
• S = conv(S) if and only if S is convex
7
Examples of Convex Sets
• Affine sets, i.e. sets of solutions of linear equations {x : Ax = b}
• Convex cones, i.e. sets containing any nonnegative combinationk∑
i=1
θixi , θ1, . . . , θk ≥ 0, of their points
0
0.2
0.4
0.6
0.8
1
−1−0.5
00.5
10
0.2
0.4
0.6
0.8
1
8
Examples of Convex Sets (contd.)
• Hyperplanes, i.e. sets of the form {x : a⊤x = b}, wherea ∈ IRd, a 6= 0, b ∈ IR (since they are special cases of affine sets)
• Halfspaces, i.e. sets of the form {x : a⊤x ≤ b}, wherea ∈ IRd, a 6= 0, b ∈ IR
9
Polyhedra
• A polyhedron is a set defined by a finite number of affine equalities andinequalities
P = {x : a⊤
i x ≤ bi, i = 1, . . . ,m, c⊤j x = dj, j = 1, . . . , p}
where a1, . . . , am, c1, . . . , cp ∈ IRd, b1, . . . , bm, d1, . . . , dp ∈ IR
• Polyhedra are convex sets
10
Polyhedra (contd.)
• A bounded polyhedron is called a polytope
• A set is a polytope if and only if it is the convex hull of a finite set ofpoints
11
The Positive Semidefinite Cone
• We use the notations
X � 0 X ≻ 0
to denote that a d × d matrix X is positive semidefinite and positivedefinite, respectively
• The setsS
d+ := {X ∈ IRd×d : X � 0}
andS
d++ := {X ∈ IRd×d : X ≻ 0}
are called the positive semidefinite cone and positive definite cone,respectively
12
The Positive Semidefinite Cone (contd.)
• Sd+ and S
d++ are convex cones
Proof. For any A, B ∈ Sd+, θ1, θ2 ≥ 0, the matrix θ1A + θ2B is psd. ⊓⊔
• E.g. in IR2×2, the positive semidefinite cone consists of the matrices ofthe form
(
x y
y z
)
such that x, z ≥ 0, xz ≥ y2
0
0.2
0.4
0.6
0.8
1
−1−0.5
00.5
10
0.2
0.4
0.6
0.8
1
13
Norms
• A norm, denoted by ‖ · ‖, is a function from IRd to IR+ such that
1. ‖w‖ ≥ 0, for all w ∈ IRd
2. ‖w‖ = 0 if and only if w = 0
3. ‖aw‖ = |a|‖w‖, for all a ∈ IR, w ∈ IRd (homogeneity)
4. ‖w + z‖ ≤ ‖w‖ + ‖z‖, for all w, z ∈ IRd (triangle inequality)
• Important example: the Lp norm
‖w‖p :=
(
d∑
i=1
|wi|p
)
1p
where p ∈ [1,+∞)
14
Norms (contd.)
• We have already seen the L2 norm – it is the regularizer in ridgeregression, SVM etc.
‖w‖2 =
(
d∑
i=1
w2i
)
12
= (w⊤w)12
• The L1 norm‖w‖1 =
d∑
i=1
|wi|
• Letting p → +∞, we get the L∞ norm
‖w‖∞ =d
maxi=1
|wi|
15
Norm Balls
• The unit ball for a norm is the set {w : ‖w‖ ≤ 1}
−1−0.5
00.5
1
−1
−0.5
0
0.5
1−1
−0.5
0
0.5
1
L1 unit ball L2 unit ball L∞ unit ball
16
Norm Balls (contd.)
• In general, any norm ball of the form
{w : ‖w − c‖ ≤ r}
where c ∈ IRd and r ≥ 0 are the center and radius of the ball,respectively, is a convex set
17
Convex Functions
• A function f : IRd → IR is called convex if
f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)
for all x, y ∈ IRd, λ ∈ [0, 1]
• Intuition: the line segment connecting any two points on the graph of f
lies above the graph
Convex Non-convex
18
Convex Functions (contd.)
• Similarly, a function f is called concave if
f(λx + (1 − λ)y) ≥ λf(x) + (1 − λ)f(y)
for all x, y ∈ IRd, λ ∈ [0, 1]or, equivalently, if −f is convex
19
Strict Convexity
• A function f : IRd → IR is called strictly convex if
f(λx + (1 − λ)y) < λf(x) + (1 − λ)f(y)
for all x, y ∈ IRd, x 6= y, λ ∈ (0, 1)
• I.e. the line segment connecting any two points on the graph of f liesstrictly above the graph
• Equivalently, the graph of a strictly convex function does not containany line segments
20
Strict Convexity (contd.)
Not strictly convex Strictly convex
21
Jensen’s Inequality
• If f is convex, it follows by induction that
f
(
k∑
i=1
λixi
)
≤
k∑
i=1
λif(xi)
for all k ∈ IN, x1, . . . , xk ∈ IRd, λ1, . . . , λk ≥ 0, such thatk∑
i=1
λi = 1
• It can be generalised to integrals and expected values(it is used e.g. to derive the EM algorithm)
22
Continuity / Differentiability
Theorem 1. If a function f is convex on IRd then it is also continuous
on IRd.
Proof. Not easy. ⊓⊔
• There are convex functions which are not differentiable everywhere(and others which are)
23
Second Order Condition
Theorem 2. Assume that a function f is twice differentiable on IRd.
Then f is convex if and only if its Hessian is psd.
∇2f(w) � 0 for all w ∈ IRd
• Recall that the Hessian is the matrix formed by the second partialderivatives
∇2f(w) :=
(
∂f
∂wi∂wj
(w)
)d
i,j=1
• Note: the condition ∇2f ≻ 0 implies strict convexity, but the converseis not true – see [Boyd & Vanderberghe]
24
Examples of Convex Functions
• Affine functions w 7→ a⊤w + b are both convex and concave
• Exponentials, powers, log-sum-exp
−5 −4 −3 −2 −1 0 1 2 30
5
10
15
20
25
30
35
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w ∈ IR 7→ eaw w ∈ IR 7→ |w|p w ∈ IRd 7→ log(
∑d
i=1ewi
)
for any a ∈ IR for any p ≥ 1
25
Examples of Convex Functions (contd.)
• Psd. quadratic functions
f(w) = w⊤Aw + a⊤w + b for all w ∈ IRd
where A ∈ Sd+, a ∈ IRd, b ∈ IR
26
Examples of Convex Functions (contd.)
• Max function w 7→ max{w1, . . . , wd}
27
Norms Are Convex Functions
• Every norm is a convex function
Proof. Let w, z ∈ IRd and λ ∈ (0, 1). Then
‖λw +(1−λ)z‖ ≤triangle
inequality
‖λw‖+ ‖(1−λ)z‖ =homogeneity
λ‖w‖+(1−λ)‖z‖
⊓⊔
• No norm is strictly convex (to see this, select z = aw with a > 0, a 6= 1)
• The square of every norm, w 7→ ‖w‖2, is a convex function
Proof. Do it as an exercise. ⊓⊔
28
Operations that Preserve Convexity
Question: If f1, . . . , fq : IRd → IR are convex functions, which operationsF can we apply so that f := F (f1, . . . , fq) is also convex?
• Nonnegative weighted sums
f =
q∑
i=1
θifi
where θ1, . . . , θq ≥ 0
Proof. Easy from the definition. ⊓⊔
29
Operations that Preserve Convexity (contd.)
• Composition of a convex function with an affine map
f(x) = g(Ax + b) for all x ∈ IRd
where g : IRn → IR is convex, A ∈ IRn×d is a matrix and b ∈ IRn
Proof. Let x, y ∈ IRd, λ ∈ (0, 1). Then
f(λx + (1 − λ)y) = g(λAx + (1 − λ)Ay + b)
= g (λ(Ax + b) + (1 − λ)(Ay + b))
≤ λg(Ax + b) + (1 − λ)g(Ay + b) = λf(x) + (1 − λ)f(y)
⊓⊔
30
Operations that Preserve Convexity (contd.)
• Maximum of convex functions
f = max{f1, . . . , fq}
Proof (for q = 2, can be easily generalised to any q). Let x, y ∈ IRd,
λ ∈ (0, 1). Then
f(λx + (1 − λ)y) = max{f1(λx + (1 − λ)y), f2(λx + (1 − λ)y)}
≤ max{λf1(x) + (1 − λ)f1(y), λf2(x) + (1 − λ)f2(y)}
≤ max{λf1(x), λf2(x)} + max{(1 − λ)f1(y), (1 − λ)f2(y)}
= λf(x) + (1 − λ)f(y) ⊓⊔
• Extends also to infinite sets of convex functions
31
Proving Convexity
• Thus, to prove convexity of a function f , there are several approaches
– From the definition of convexity
– Compute the Hessian of f and show that it is psd.
– Decompose f as a nonnegative weighted sum of convex functions
– Decompose f as the composition of a convex and an affine function
– Decompose f as a maximum of convex functions
32
Examples
• Show that the function
f(w) = w⊤w = ‖w‖2
is strictly convex
Proof. The Hessian of f equals ∇2f(w) = 2Id which is positivedefinite. ⊓⊔
33
Examples (contd.)
• Show that the quadratic function
f(w) = w⊤Aw + a⊤w + b
where A ∈ Sd+, a ∈ IRd, b ∈ IR, is convex
Proof. Write A = R⊤R for some matrix R ∈ IRd×d. Thenw⊤Aw = (Rw)⊤(Rw), which is a composition of the convex functionw 7→ w⊤w and a linear map. The term w 7→ a⊤w + b is affine andhence convex. Thus f is convex, as the sum of convex functions. ⊓⊔
Alternatively, we may compute the Hessian, which equals 2A, which ispsd.
34
Proving Strict Convexity
• To prove strict convexity of a function f
– Use the definition (with strict inequality, x 6= y and λ ∈ (0, 1))
– Compute the Hessian of f and show that it is positive definite
– Decompose f as a sum of a convex and a strictly convex function
(easy to prove this property)
Note: When does the convex-affine composition operation apply?
• Example: the quadratic function f(w) = w⊤Aw + a⊤w + b is strictlyconvex if and only if A ≻ 0(since the Hessian equals 2A)
35
Convex Optimization
• The problem
minw∈IRd
f(w)
subject to fi(w) ≤ 0 i = 1, . . . ,M (1)
a⊤
j w = bj j = 1, . . . , P
where f, f1, . . . , fM are convex functions, is called a convex program orconvex optimization problem
• The function f whose value we wish to minimise is called the objectivefunction
36
Remarks
• The set of points w satisfying the constraints fi(w) ≤ 0, a⊤
j w = bj iscalled the feasible set
• The feasible set is convex: for any feasible points x, y,fi(λx + (1 − λ)y) ≤ λfi(x) + (1 − λ)fi(y) ≤ 0 anda⊤
j (λx + (1 − λ)y)) = λbj + (1 − λ)bj = bj
• In general, if we minimize a convex objective function over a convex set,the problem can be rewritten in form (1) (in principle at least;sometimes not practically possible)
• Many problems of interest can be rewritten in the form (1); they do notnecessarily appear in that form however
37
Remarks (contd.)
• Minimum (1) does not always exist! (could be an infimum or could be−∞)
−5 −4 −3 −2 −1 0 1 2 30
5
10
15
20
25
30
35
• The set of solutions (minimisers) of problem (1) is convex (easy toshow)
• In particular, if the function f is strictly convex, then there is a uniqueminimiser (if any exists)
38
Remarks (contd.)
• There are no local minima outside the set of minimisers
• This is important because it implies that algorithms will not get stuckaway from the solution(s)
• Thus, the great appeal of convex programs is that they can be solved!(many of them in polynomial time)
39
Examples
minw∈IRd
w⊤Aw + a⊤w
subject to w⊤Bw + b⊤w + c ≤ 0
d⊤w = e
where A, B � 0
minw∈IRd
a⊤w
subject to b⊤1w ≤ c1
b⊤2w ≤ c2
d⊤
1w = e1
40
Regularization
minw∈IRd
m∑
i=1
E(
w⊤xi , yi
)
+ γ ‖w‖2 (R)
• Assume that E(·, y) is a convex function for every y ∈ IR
• Then, problem (R) is a convex program; this program is unconstrained
• Indeed, the objective function is convex, as a sum of convex functions:E(w⊤xi , yi) is convex as a convex-affine composition and ‖w‖2 is alsoconvex, as we have already seen
• It can be shown that the minimum exists (under mild assumptions onE)
41
Regularization (contd.)
• Example 1: ridge regression
minw∈IRd
m∑
i=1
(yi − w⊤xi)2 + γ ‖w‖2
• Convex program, since the function z 7→ (z − y)2 is convex for everyy ∈ IR
42
Regularization (contd.)
• Example 2: we have seen (in Lecture 1) that SVM is equivalent to theregularization problem
minw∈IRd
m∑
i=1
max{1 − yi(w⊤xi) , 0} + γ ‖w‖2
with γ =1
2C
• This is a convex program, since the function z 7→ max{1 − yz , 0} (thehinge loss) is convex for every y ∈ {−1, 1}; indeed, it is a maximum ofconvex (in particular, affine) functions
43
Regularization (contd.)
• How about the SVM primal
minw∈IRd
1
2‖w‖2 + C
m∑
i=1
ξi
subject to yi(w⊤xi) ≥ 1 − ξi, ξi ≥ 0, for i = 1, . . . ,m
• It is also easy to see that this is a convex program, but now thevariables include wi and ξi as well
• The objective function is convex (quadratic in w, linear in ξi); thefunctions 1 − ξi − yi(w
⊤xi) and −ξi in the inequality constraints arealso convex
44
Regularization (contd.)
• Similarly, the SVM and ridge regression dual problems can be seen to beconvex problems
• In general, the dual of regularization
minc∈IRm
m∑
i=1
E(
c⊤gi , yi
)
+ γ c⊤Gc (C)
is a convex problem (assuming as before that the loss function isconvex)
• Indeed, the quadratic form c⊤Gc is a convex function of c since theGram matrix G is positive semidefinite
45
Regularization (contd.)
minw∈IRd
m∑
i=1
E(
w⊤xi , yi
)
+ γ ‖w‖2 (R)
minc∈IRm
m∑
i=1
E(
c⊤gi , yi
)
+ γ c⊤Gc (C)
• Problem (R) has a unique solution; indeed, the term ‖w‖2 is strictlyconvex; hence the objective function is also strictly convex
• However, problem (C) has a unique solution only if G ≻ 0, i.e. only ifthe feature vectors φ(xi) are linearly independent;otherwise, there are infinite optimal c, but the correspondingw =
∑m
i=1ciφ(xi) is unique
46
Convex Programs with Linear Equality Constraints
• The following special type of convex program can be solved usingLagrange multipliers
minw∈IRd
f(w)
subject to a⊤
j w = bj j = 1, . . . , P (2)
where f is a convex and differentiable function
• Set the gradient to zero: ∇f(w) =∑P
j=1cjaj, for some cj ∈ IR; the
set of solutions of this equation is the same as the set of minimisers of(2) (by a theorem); a dual problem can also be obtained
47
Important Convex Optimization Problems
• Linear Programming
• Quadratic Programming
• Semidefinite Programming
• Dedicated off-the-shelf algorithms exist for each of the above categories
• In machine learning, algorithms have been developed for specialsubtypes of such problems
48
Linear Programming (LP)
minw∈IRd
c⊤w
subject to d⊤
i w ≤ ei i = 1, . . . , M (3)
a⊤
j w = bj j = 1, . . . , P
• The feasible set is a polyhedron (bounded or not)
• Problem (3) may have one, none, or infinite solutions
• Interesting fact: the dual problem is also a linear program
49
Linear Programming (contd.)
• It can be shown that the solution (if unique) will be one of the vertices
• The simplex algorithm is one of the oldest optimization algorithms(Dantzig in the 40s)
50
Linear Programming (contd.)
• Intuition of simplex: find a vertex to start from; from each vertex, moveto a neighbour so that the objective function decreases; terminate ifthere is no such neighbour
• Time complexity: very good in almost all cases, but very bad(exponential) in the worst case
• In practice, very fast for typical problems and can be applied to largedata sets
• Methods developed in the 80s (interior-point methods) have beenapplied to linear programming and are of polynomial-time complexity
51
Quadratic Programming (QP)
minw∈IRd
w⊤Aw + c⊤w
subject to d⊤
i w ≤ ei i = 1, . . . ,M (4)
a⊤
j w = bj j = 1, . . . , P
where A � 0
• If A ≻ 0, the solution (if any) is unique (due to strict convexity)
• The dual is also a quadratic program
52
Quadratic Programming (contd.)
• The idea behind simplex does not apply here; in fact, the minimisercould be anywhere in the feasible set (on the boundary or in the interior)
• The difficulty in solving QP is due to the fact that the solution may lieon the boundary of the feasible set
53
Interior-Point Methods
• Idea: change the objective function by adding to it a barrier function
• The barrier depends on the constraints and is parameterised by aparameter t
• Unconstrained minimisation of the barrier function gives a solution inthe interior of the feasible set
• Changing t appropriately, the algorithm converges to the solution of (4)in polynomial time
• These methods can handle problems of reasonably large size
54
Ridge Regression as QP
minw∈IRd
m∑
i=1
(yi − w⊤xi)2 + γ ‖w‖2
• Ridge regression is an unconstrained QP
• Just need to solve a linear system using standard methods
55
SVM as QP
minw∈IRd
1
2‖w‖2 + C
m∑
i=1
ξi maxα∈IRm
−1
2α⊤Aα +
m∑
i=1
αi
s. t. yi(w⊤xi) ≥ 1 − ξi s. t. 0 ≤ αi ≤ C
ξi ≥ 0 for i = 1, . . . , m
for i = 1, . . . , m
• SVM is a QP with inequality constraints
• The SVM dual is a QP with “box” constraints
56
Algorithms for SVM
• One approach to solve SVMs is with interior-point methods
• For large datasets (say m > 103) it is practically impossible to solve thedual problem with such methods (matrix A is dense!)
• A typical approach is to iteratively optimize wrt. an ’active set’ A ofdual variables, fixing the rest. Set α = 0, choose q ≤ m and an activeset A of q variables. We repeat until convergence the steps
– Solve the problem wrt. the variables in A– Remove one variable from A which satisfies the KKT conditions and
add one variable, if any, which violates the KKT conditions. If nosuch variable exists, stop
57
QCQP
minw∈IRd
w⊤Aw + c⊤w
subject to w⊤Bw + d⊤
i w ≤ ei i = 1, . . . ,M
a⊤
j w = bj j = 1, . . . , P
where both A,B are psd.
• It is called a quadratically constrained quadratic program (QCQP)
• Larger family; contains the family of QP
• The dual problem is not a QCQP, in general
58
QCQP (contd.)
• The feasible set is the intersection of ellipsoids and/or a polyhedron
• It is faster to solve a QP with a dedicated method than to use a QCQPsolver
59
SDP
minw∈IRd
c⊤w
subject to w1F1 + · · · + wnFn + G � 0
a⊤
j w = bj j = 1, . . . , P
• There is a linear matrix inequality (LMI) constraint
• Multiple LMIs reduce to an equivalent problem with just one LMI
• The dual problem of an SDP is also an SDP
• LP ⊆ QP ⊆ QCQP ⊆ · · · ⊆ SDP (LPs, QPs, QCQPs can be rewrittenas SDPs)
60
Bibliography
Lectures available at:
http://www.cs.ucl.ac.uk/staff/M.Pontil/courses/index-ATML10.htm
See also Boyd and Vandenberghe, Convex Optimization, 2004,
http://www.stanford.edu/boyd/cvxbook/
Secs. 2.1.4-2.2.5, 3.1.1, 3.1.5, 3.1.8, 3.2.1-3.2.3, 4.1.1, 4.2.1, 4.2.2, 4.3, 4.4, 4.6.2
61