Question? Leave a message!




mathematical optimization

mathematical optimization 6
Convex Optimization — Boyd Vandenberghe 1. Introduction • mathematical optimization • leastsquares and linear programming • convex optimization • example • course goals and topics • nonlinear optimization • brief history of convex optimization 1–1Mathematical optimization (mathematical) optimization problem minimize f (x) 0 subject to f (x)≤b, i = 1,...,m i i • x = (x ,...,x ): optimization variables 1 n n • f :R →R: objective function 0 n • f :R →R, i = 1,...,m: constraint functions i ⋆ optimal solution x has smallest value of f among all vectors that 0 satisfy the constraints Introduction 1–2Examples portfolio optimization • variables: amounts invested in different assets • constraints: budget, max./min. investment per asset, minimum return • objective: overall risk or return variance device sizing in electronic circuits • variables: device widths and lengths • constraints: manufacturing limits, timing requirements, maximum area • objective: power consumption data fitting • variables: model parameters • constraints: prior information, parameter limits • objective: measure of misfit or prediction error Introduction 1–3Solving optimization problems general optimization problem • very difficult to solve • methods involve some compromise, e.g., very long computation time, or not always finding the solution exceptions: certain problem classes can be solved efficiently and reliably • leastsquares problems • linear programming problems • convex optimization problems Introduction 1–4Leastsquares 2 minimize kAx−bk 2 solving leastsquares problems ⋆ T −1 T • analytical solution: x = (A A) A b • reliable and efficient algorithms and software k×n 2 • computation time proportional to n k (A∈R ); less if structured • a mature technology using leastsquares • leastsquares problems are easy to recognize • a few standard techniques increase flexibility (e.g., including weights, adding regularization terms) Introduction 1–5Linear programming T minimize c x T subject to a x≤b, i = 1,...,m i i solving linear programs • no analytical formula for solution • reliable and efficient algorithms and software 2 • computation time proportional to n m if m≥n; less with structure • a mature technology using linear programming • not as easy to recognize as leastsquares problems • a few standard tricks used to convert problems into linear programs (e.g., problems involving ℓ or ℓ norms, piecewiselinear functions) 1 ∞ Introduction 1–6Convex optimization problem minimize f (x) 0 subject to f (x)≤b, i = 1,...,m i i • objective and constraint functions are convex: f (αx+βy)≤αf (x)+βf (y) i i i if α+β = 1, α≥ 0, β≥ 0 • includes leastsquares problems and linear programs as special cases Introduction 1–7solving convex optimization problems • no analytical solution • reliable and efficient algorithms 3 2 • computation time (roughly) proportional to maxn ,n m,F, where F is cost of evaluating f ’s and their first and second derivatives i • almost a technology using convex optimization • often difficult to recognize • many tricks for transforming problems into convex form • surprisingly many problems can be solved via convex optimization Introduction 1–8Example m lamps illuminating n (small, flat) patches lamp power p j r kj θ kj illumination I k intensity I at patch k depends linearly on lamp powers p : k j m X −2 I = a p , a =r maxcosθ ,0 k kj j kj kj kj j=1 problem: achieve desired illumination I with bounded lamp powers des minimize max logI −logI k=1,...,n k des subject to 0≤p ≤p , j = 1,...,m j max Introduction 1–9how to solve 1. use uniform power: p =p, vary p j 2. use leastsquares: P n 2 minimize (I −I ) k des k=1 round p if p p or p 0 j j max j 3. use weighted leastsquares: P P n m 2 2 minimize (I −I ) + w (p −p /2) k des j j max k=1 j=1 iteratively adjust weights w until 0≤p ≤p j j max 4. use linear programming: minimize max I −I k=1,...,n k des subject to 0≤p ≤p , j = 1,...,m j max which can be solved via linear programming of course these are approximate (suboptimal) ‘solutions’ Introduction 1–105. use convex optimization: problem is equivalent to minimize f (p) = max h(I /I ) 0 k=1,...,n k des subject to 0≤p ≤p , j = 1,...,m j max with h(u) = maxu,1/u 5 4 3 2 1 0 0 1 2 3 4 u f is convex because maximum of convex functions is convex 0 exact solution obtained with effort≈ modest factor× leastsquares effort Introduction 1–11 h(u)additional constraints: does adding 1 or 2 below complicate the problem 1. no more than half of total power is in any 10 lamps 2. no more than half of the lamps are on (p 0) j • answer: with (1), still easy to solve; with (2), extremely difficult • moral: (untrained) intuition doesn’t always work; without the proper background very easy problems can appear quite similar to very difficult problems Introduction 1–12Course goals and topics goals 1. recognize/formulate problems (such as the illumination problem) as convex optimization problems 2. develop code for problems of moderate size (1000 lamps, 5000 patches) 3. characterize optimal solution (optimal power distribution), give limits of performance, etc. topics 1. convex sets, functions, optimization problems 2. examples and applications 3. algorithms Introduction 1–13Nonlinear optimization traditional techniques for general nonconvex problems involve compromises local optimization methods (nonlinear programming) • find a point that minimizes f among feasible points near it 0 • fast, can handle large problems • require initial guess • provide no information about distance to (global) optimum global optimization methods • find the (global) solution • worstcase complexity grows exponentially with problem size these algorithms are often based on solving convex subproblems Introduction 1–14Brief history of convex optimization theory (convex analysis): ca1900–1970 algorithms • 1947: simplex algorithm for linear programming (Dantzig) • 1960s: early interiorpoint methods (Fiacco McCormick, Dikin, . . . ) • 1970s: ellipsoid method and other subgradient methods • 1980s: polynomialtime interiorpoint methods for linear programming (Karmarkar 1984) • late 1980s–now: polynomialtime interiorpoint methods for nonlinear convex optimization (Nesterov Nemirovski 1994) applications • before 1990: mostly in operations research; few in engineering • since 1990: many new applications in engineering (control, signal processing, communications, circuit design, . . . ); new problem classes (semidefinite and secondorder cone programming, robust optimization) Introduction 1–15Convex Optimization — Boyd Vandenberghe 2. Convex sets • affine and convex sets • some important examples • operations that preserve convexity • generalized inequalities • separating and supporting hyperplanes • dual cones and generalized inequalities 2–1Affine set line through x , x : all points 1 2 x =θx +(1−θ)x (θ∈R) 1 2 x 1 θ = 1.2 θ = 1 θ = 0.6 x 2 θ = 0 θ =−0.2 affine set: contains the line through any two distinct points in the set example: solution set of linear equationsxAx =b (conversely, every affine set can be expressed as solution set of system of linear equations) Convex sets 2–2Convex set line segment between x and x : all points 1 2 x =θx +(1−θ)x 1 2 with 0≤θ≤ 1 convex set: contains line segment between any two points in the set x ,x ∈C, 0≤θ≤ 1 =⇒ θx +(1−θ)x ∈C 1 2 1 2 examples (one convex, two nonconvex sets) Convex sets 2–3Convex combination and convex hull convex combination of x ,. . . , x : any point x of the form 1 k x =θ x +θ x +···+θ x 1 1 2 2 k k with θ +···+θ = 1, θ ≥ 0 1 k i convex hull convS: set of all convex combinations of points in S Convex sets 2–4Convex cone conic (nonnegative) combination of x and x : any point of the form 1 2 x =θ x +θ x 1 1 2 2 with θ ≥ 0, θ ≥ 0 1 2 x 1 x 2 0 convex cone: set that contains all conic combinations of points in the set Convex sets 2–56 6 Hyperplanes and halfspaces T hyperplane: set of the formxa x =b (a = 0) a x 0 x T a x =b T halfspace: set of the formxa x≤b (a = 0) a T a x≥ b x 0 T a x≤ b • a is the normal vector • hyperplanes are affine and convex; halfspaces are convex Convex sets 2–6Euclidean balls and ellipsoids (Euclidean) ball with center x and radius r: c B(x ,r) =xkx−x k ≤r =x +rukuk ≤ 1 c c 2 c 2 ellipsoid: set of the form T −1 x (x−x ) P (x−x )≤ 1 c c n with P ∈S (i.e., P symmetric positive definite) ++ x c other representation: x +Aukuk ≤ 1 with A square and nonsingular c 2 Convex sets 2–7Norm balls and norm cones norm: a functionk·k that satisfies • kxk≥ 0;kxk = 0 if and only if x = 0 • ktxk =tkxk for t∈R • kx+yk≤kxk+kyk notation: k·k is general (unspecified) norm;k·k is particular norm symb norm ball with center x and radius r: xkx−x k≤r c c 1 norm cone: (x,t)kxk≤t 0.5 Euclidean norm cone is called second 0 order cone 1 1 0 0 −1 x −1 2 x 1 norm balls and cones are convex Convex sets 2–8 tPolyhedra solution set of finitely many linear inequalities and equalities Axb, Cx =d m×n p×n (A∈R , C∈R , is componentwise inequality) a 1 a 2 P a 5 a 3 a 4 polyhedron is intersection of finite number of halfspaces and hyperplanes Convex sets 2–9Positive semidefinite cone notation: n • S is set of symmetric n×n matrices n n • S =X ∈S X  0: positive semidefinite n×n matrices + n T X ∈S ⇐⇒ z Xz≥ 0 for all z + n S is a convex cone + n n • S =X ∈S X ≻ 0: positive definite n×n matrices ++ 1   0.5 x y 2 example: ∈S + y z 0 1 1 0 0.5 y −1 0 x Convex sets 2–10 zOperations that preserve convexity practical methods for establishing convexity of a set C 1. apply definition x ,x ∈C, 0≤θ≤ 1 =⇒ θx +(1−θ)x ∈C 1 2 1 2 2. show that C is obtained from simple convex sets (hyperplanes, halfspaces, norm balls, . . . ) by operations that preserve convexity • intersection • affine functions • perspective function • linearfractional functions Convex sets 2–11Intersection the intersection of (any number of) convex sets is convex example: m S =x∈R p(t)≤ 1 fort≤π/3 where p(t) =x cost+x cos2t+···+x cosmt 1 2 m for m = 2: 2 1 1 0 S 0 −1 −1 −2 0 π/3 2π/3 π −2 −1 0 1 2 x 1 t Convex sets 2–12 p(t) x 2Affine function n m m×n m suppose f :R →R is affine (f(x) =Ax+b with A∈R , b∈R ) • the image of a convex set under f is convex n S⊆R convex =⇒ f(S) =f(x)x∈S convex −1 • the inverse image f (C) of a convex set under f is convex m n −1 C⊆R convex =⇒ f (C) =x∈R f(x)∈C convex examples • scaling, translation, projection • solution set of linear matrix inequalityxx A +···+x A B 1 1 m m p (with A,B∈S ) i n T T 2 T • hyperbolic conexx Px≤ (c x) , c x≥ 0 (with P ∈S ) + Convex sets 2–13Perspective and linearfractional function n+1 n perspective function P :R →R : P(x,t) =x/t, domP =(x,t)t 0 images and inverse images of convex sets under perspective are convex n m linearfractional function f :R →R : Ax+b T f(x) = , domf =xc x+d 0 T c x+d images and inverse images of convex sets under linearfractional functions are convex Convex sets 2–14example of a linearfractional function 1 f(x) = x x +x +1 1 2 1 1 0 0 C f(C) −1 −1 −1 0 1 −1 0 1 x x 1 1 Convex sets 2–15 x 2 x 2Generalized inequalities n a convex cone K⊆R is a proper cone if • K is closed (contains its boundary) • K is solid (has nonempty interior) • K is pointed (contains no line) examples n n • nonnegative orthant K =R =x∈R x ≥ 0, i = 1,...,n i + n • positive semidefinite cone K =S + • nonnegative polynomials on 0,1: n 2 n−1 K =x∈R x +x t+x t +···+x t ≥ 0 for t∈ 0,1 1 2 3 n Convex sets 2–16generalized inequality defined by a proper cone K: x y ⇐⇒ y−x∈K, x≺ y ⇐⇒ y−x∈intK K K examples n • componentwise inequality (K =R ) + n x y ⇐⇒ x ≤y, i = 1,...,n i i R + n • matrix inequality (K =S ) + n X  Y ⇐⇒ Y −X positive semidefinite S + these two types are so common that we drop the subscript in K properties: many properties of are similar to≤ on R, e.g., K x y, u v =⇒ x+u y+v K K K Convex sets 2–17Minimum and minimal elements  is not in general a linear ordering: we can have x6 y and y6 x K K K x∈S is the minimum element of S with respect to if K y∈S =⇒ x y K x∈S is a minimal element of S with respect to if K y∈S, y x =⇒ y =x K 2 example (K =R ) + S 2 S x 1 2 x is the minimum element ofS 1 1 x is a minimal element of S 2 2 x 1 Convex sets 2–186 Separating hyperplane theorem if C and D are nonempty disjoint convex sets, there exist a = 0, b s.t. T T a x≤b for x∈C, a x≥b for x∈D T T a x≥ b a x≤ b D C a T the hyperplanexa x =b separates C and D strict separation requires additional assumptions (e.g., C is closed, D is a singleton) Convex sets 2–196 Supporting hyperplane theorem supporting hyperplane to set C at boundary point x : 0 T T xa x =a x 0 T T where a = 0 and a x≤a x for all x∈C 0 a x 0 C supporting hyperplane theorem: if C is convex, then there exists a supporting hyperplane at every boundary point of C Convex sets 2–20Dual cones and generalized inequalities dual cone of a cone K: ∗ T K =yy x≥ 0 for all x∈K examples n n ∗ • K =R : K =R + + n n ∗ • K =S : K =S + + ∗ • K =(x,t)kxk ≤t: K =(x,t)kxk ≤t 2 2 ∗ • K =(x,t)kxk ≤t: K =(x,t)kxk ≤t 1 ∞ first three examples are selfdual cones dual cones of proper cones are proper, hence define generalized inequalities: T ∗ y 0 ⇐⇒ y x≥ 0 for all x 0 K K Convex sets 2–21Minimum and minimal elements via dual inequalities minimum element w.r.t. K x is minimum element of S iff for all ∗ λ≻ 0, x is the unique minimizer K S T of λ z over S x minimal element w.r.t. K T ∗ • if x minimizes λ z over S for some λ≻ 0, then x is minimal K λ 1 x 1 S λ 2 x 2 • if x is a minimal element of a convex set S, then there exists a nonzero T ∗ λ 0 such that x minimizes λ z over S K Convex sets 2–22optimal production frontier n • different production methods use different amounts of resources x∈R • production setP: resource vectorsx for all possible production methods • efficient (Pareto optimal) methods correspond to resource vectors x n that are minimal w.r.t. R + fuel example (n = 2) P x , x , x are efficient; x , x are not 1 2 3 4 5 x 1 x x 5 x 4 2 λ x 3 labor Convex sets 2–23Convex Optimization — Boyd Vandenberghe 3. Convex functions • basic properties and examples • operations that preserve convexity • the conjugate function • quasiconvex functions • logconcave and logconvex functions • convexity with respect to generalized inequalities 3–16 Definition n f :R →R is convex if domf is a convex set and f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y) for all x,y∈domf, 0≤θ≤ 1 (y,f(y)) (x,f(x)) • f is concave if−f is convex • f is strictly convex if domf is convex and f(θx+(1−θ)y)θf(x)+(1−θ)f(y) for x,y∈domf, x =y, 0θ 1 Convex functions 3–2Examples on R convex: • affine: ax+b on R, for any a,b∈R ax • exponential: e , for any a∈R α • powers: x on R , for α≥ 1 or α≤ 0 ++ p • powers of absolute value: x on R, for p≥ 1 • negative entropy: xlogx on R ++ concave: • affine: ax+b on R, for any a,b∈R α • powers: x on R , for 0≤α≤ 1 ++ • logarithm: logx on R ++ Convex functions 3–3n m×n Examples on R and R affine functions are convex and concave; all norms are convex n examples on R T • affine function f(x) =a x+b P n p 1/p • norms: kxk = ( x ) for p≥ 1;kxk = max x p i ∞ k k i=1 m×n examples on R (m×n matrices) • affine function m n XX T f(X) =tr(A X)+b = A X +b ij ij i=1 j=1 • spectral (maximum singular value) norm T 1/2 f(X) =kXk =σ (X) = (λ (X X)) 2 max max Convex functions 3–4Restriction of a convex function to a line n f :R →R is convex if and only if the function g :R→R, g(t) =f(x+tv), domg =tx+tv∈domf n is convex (in t) for any x∈domf, v∈R can check convexity off by checking convexity of functions of one variable n n example. f :S →R with f(X) = logdetX, domf =S ++ −1/2 −1/2 g(t) = logdet(X +tV) = logdetX +logdet(I +tX VX ) n X = logdetX + log(1+tλ ) i i=1 −1/2 −1/2 where λ are the eigenvalues of X VX i g is concave in t (for any choice of X ≻ 0, V); hence f is concave Convex functions 3–5Extendedvalue extension ˜ extendedvalue extension f of f is ˜ ˜ f(x) =f(x), x∈domf, f(x) =∞, x6∈domf often simplifies notation; for example, the condition ˜ ˜ ˜ 0≤θ≤ 1 =⇒ f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y) (as an inequality in R∪∞), means the same as the two conditions • domf is convex • for x,y∈domf, 0≤θ≤ 1 =⇒ f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y) Convex functions 3–6Firstorder condition f is differentiable if domf is open and the gradient   ∂f(x) ∂f(x) ∂f(x) ∇f(x) = , ,..., ∂x ∂x ∂x 1 2 n exists at each x∈domf 1storder condition: differentiable f with convex domain is convex iff T f(y)≥f(x)+∇f(x) (y−x) for all x,y∈domf f(y) T f(x)+∇f(x) (y−x) (x,f(x)) firstorder approximation of f is global underestimator Convex functions 3–7Secondorder conditions n 2 f is twice differentiable if domf is open and the Hessian∇ f(x)∈S , 2 ∂ f(x) 2 ∇ f(x) = , i,j = 1,...,n, ij ∂x∂x i j exists at each x∈domf 2ndorder conditions: for twice differentiable f with convex domain • f is convex if and only if 2 ∇ f(x) 0 for all x∈domf 2 • if∇ f(x)≻ 0 for all x∈domf, then f is strictly convex Convex functions 3–8Examples n T T quadratic function: f(x) = (1/2)x Px+q x+r (with P ∈S ) 2 ∇f(x) =Px+q, ∇ f(x) =P convex if P  0 2 leastsquares objective: f(x) =kAx−bk 2 T 2 T ∇f(x) = 2A (Ax−b), ∇ f(x) = 2A A convex (for any A) 2 quadraticoverlinear: f(x,y) =x /y 2    T 1 2 y y 2 ∇ f(x,y) =  0 3 −x −x y 0 2 2 1 0 0 −2 convex for y 0 y x Convex functions 3–9 f(x,y)P n logsumexp: f(x) = log expx is convex k k=1 1 1 2 T ∇ f(x) = diag(z)− zz (z = expx ) k k T T 2 1 z (1 z) 2 T 2 to show∇ f(x) 0, we must verify that v ∇ f(x)v≥ 0 for all v: P P P 2 2 ( z v )( z )−( v z ) k k k k T 2 k k k k P v ∇ f(x)v = ≥ 0 2 ( z ) k k P P P 2 2 since ( v z ) ≤ ( z v )( z ) (from CauchySchwarz inequality) k k k k k k k k Q n n 1/n geometric mean: f(x) = ( x ) on R is concave k ++ k=1 (similar proof as for logsumexp) Convex functions 3–10Epigraph and sublevel set n αsublevel set of f :R →R: C =x∈domf f(x)≤α α sublevel sets of convex functions are convex (converse is false) n epigraph of f :R →R: n+1 epif =(x,t)∈R x∈domf, f(x)≤t epif f f is convex if and only if epif is a convex set Convex functions 3–11Jensen’s inequality basic inequality: if f is convex, then for 0≤θ≤ 1, f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y) extension: if f is convex, then f(Ez)≤Ef(z) for any random variable z basic inequality is special case with discrete distribution prob(z =x) =θ, prob(z =y) = 1−θ Convex functions 3–12Operations that preserve convexity practical methods for establishing convexity of a function 1. verify definition (often simplified by restricting to a line) 2 2. for twice differentiable functions, show∇ f(x) 0 3. show that f is obtained from simple convex functions by operations that preserve convexity • nonnegative weighted sum • composition with affine function • pointwise maximum and supremum • composition • minimization • perspective Convex functions 3–13Positive weighted sum composition with affine function nonnegative multiple: αf is convex if f is convex, α≥ 0 sum: f +f convex if f ,f convex (extends to infinite sums, integrals) 1 2 1 2 composition with affine function: f(Ax+b) is convex if f is convex examples • log barrier for linear inequalities m X T T f(x) =− log(b −a x), domf =xa xb,i = 1,...,m i i i i i=1 • (any) norm of affine function: f(x) =kAx+bk Convex functions 3–14Pointwise maximum if f , . . . , f are convex, then f(x) = maxf (x),...,f (x) is convex 1 m 1 m examples T • piecewiselinear function: f(x) = max (a x+b ) is convex i=1,...,m i i n • sum of r largest components of x∈R : f(x) =x +x +···+x 1 2 r is convex (x is ith largest component of x) i proof: f(x) = maxx +x +···+x 1≤i i ···i ≤n i i i 1 2 r r 1 2 Convex functions 3–15Pointwise supremum if f(x,y) is convex in x for each y∈A, then g(x) = supf(x,y) y∈A is convex examples T • support function of a set C: S (x) = sup y x is convex C y∈C • distance to farthest point in a set C: f(x) = supkx−yk y∈C n • maximum eigenvalue of symmetric matrix: for X ∈S , T λ (X) = sup y Xy max kyk =1 2 Convex functions 3–16Composition with scalar functions n composition of g :R →R and h :R→R: f(x) =h(g(x)) ˜ g convex, h convex, h nondecreasing f is convex if ˜ g concave, h convex, h nonincreasing • proof (for n = 1, differentiable g,h) ′′ ′′ ′ 2 ′ ′′ f (x) =h (g(x))g (x) +h(g(x))g (x) ˜ • note: monotonicity must hold for extendedvalue extension h examples • expg(x) is convex if g is convex • 1/g(x) is convex if g is concave and positive Convex functions 3–17Vector composition n k k composition of g :R →R and h :R →R: f(x) =h(g(x)) =h(g (x),g (x),...,g (x)) 1 2 k ˜ g convex, h convex, h nondecreasing in each argument i f is convex if ˜ g concave, h convex, h nonincreasing in each argument i proof (for n = 1, differentiable g,h) ′′ ′ T 2 ′ T ′′ f (x) =g (x) ∇ h(g(x))g (x)+∇h(g(x)) g (x) examples P m • logg (x) is concave if g are concave and positive i i i=1 P m • log expg (x) is convex if g are convex i i i=1 Convex functions 3–18Minimization if f(x,y) is convex in (x,y) and C is a convex set, then g(x) = inf f(x,y) y∈C is convex examples T T T • f(x,y) =x Ax+2x By+y Cy with   A B  0, C≻ 0 T B C T −1 T minimizing over y gives g(x) = inf f(x,y) =x (A−BC B )x y −1 T g is convex, hence Schur complement A−BC B  0 • distance to a set: dist(x,S) = inf kx−yk is convex if S is convex y∈S Convex functions 3–19Perspective n n the perspective of a function f :R →R is the function g :R ×R→R, g(x,t) =tf(x/t), domg =(x,t)x/t∈domf, t 0 g is convex if f is convex examples T T • f(x) =x x is convex; hence g(x,t) =x x/t is convex for t 0 • negative logarithm f(x) =−logx is convex; hence relative entropy 2 g(x,t) =tlogt−tlogx is convex on R ++ • if f is convex, then  T T g(x) = (c x+d)f (Ax+b)/(c x+d) T T is convex onxc x+d 0, (Ax+b)/(c x+d)∈domf Convex functions 3–20The conjugate function the conjugate of a function f is ∗ T f (y) = sup (y x−f(x)) x∈domf f(x) xy x ∗ (0,−f (y)) ∗ • f is convex (even if f is not) • will be useful in chapter 5 Convex functions 3–21examples • negative logarithm f(x) =−logx ∗ f (y) = sup(xy+logx) x0  −1−log(−y) y 0 = ∞ otherwise n T • strictly convex quadratic f(x) = (1/2)x Qx with Q∈S ++ ∗ T T f (y) = sup(y x−(1/2)x Qx) x 1 T −1 = y Q y 2 Convex functions 3–22Quasiconvex functions n f :R →R is quasiconvex if domf is convex and the sublevel sets S =x∈domf f(x)≤α α are convex for all α β α a b c • f is quasiconcave if−f is quasiconvex • f is quasilinear if it is quasiconvex and quasiconcave Convex functions 3–23Examples p • x is quasiconvex on R • ceil(x) = infz∈Zz≥x is quasilinear • logx is quasilinear on R ++ 2 • f(x ,x ) =x x is quasiconcave on R 1 2 1 2 ++ • linearfractional function T a x+b T f(x) = , domf =xc x+d 0 T c x+d is quasilinear • distance ratio kx−ak 2 f(x) = , domf =xkx−ak ≤kx−bk 2 2 kx−bk 2 is quasiconvex Convex functions 3–24internal rate of return • cash flow x = (x ,...,x ); x is payment in period i (to us if x 0) 0 n i i • we assume x 0 and x +x +···+x 0 0 0 1 n • present value of cash flow x, for interest rate r: n X −i PV(x,r) = (1+r) x i i=0 • internal rate of return is smallest interest rate for which PV(x,r) = 0: IRR(x) = infr≥ 0 PV(x,r) = 0 IRR is quasiconcave: superlevel set is intersection of open halfspaces n X −i IRR(x)≥R ⇐⇒ (1+r) x 0 for 0≤rR i i=0 Convex functions 3–25Properties modified Jensen inequality: for quasiconvex f 0≤θ≤ 1 =⇒ f(θx+(1−θ)y)≤ maxf(x),f(y) firstorder condition: differentiable f with cvx domain is quasiconvex iff T f(y)≤f(x) =⇒ ∇f(x) (y−x)≤ 0 ∇f(x) x sums of quasiconvex functions are not necessarily quasiconvex Convex functions 3–26Logconcave and logconvex functions a positive function f is logconcave if logf is concave: θ 1−θ f(θx+(1−θ)y)≥f(x) f(y) for 0≤θ≤ 1 f is logconvex if logf is convex a • powers: x on R is logconvex for a≤ 0, logconcave for a≥ 0 ++ • many common probability densities are logconcave, e.g., normal: T −1 1 1 − (x−x¯) Σ (x−x¯) 2 p f(x) = e n (2π) detΣ • cumulative Gaussian distribution function Φ is logconcave Z x 2 1 −u /2 √ Φ(x) = e du 2π −∞ Convex functions 3–27Properties of logconcave functions • twice differentiable f with convex domain is logconcave if and only if 2 T f(x)∇ f(x)∇f(x)∇f(x) for all x∈domf • product of logconcave functions is logconcave • sum of logconcave functions is not always logconcave n m • integration: if f :R ×R →R is logconcave, then Z g(x) = f(x,y)dy is logconcave (not easy to show) Convex functions 3–28consequences of integration property • convolution f∗g of logconcave functions f, g is logconcave Z (f∗g)(x) = f(x−y)g(y)dy n • if C⊆R convex and y is a random variable with logconcave pdf then f(x) =prob(x+y∈C) is logconcave proof: write f(x) as integral of product of logconcave functions  Z 1 u∈C f(x) = g(x+y)p(y)dy, g(u) = 0 u6∈C, p is pdf of y Convex functions 3–29example: yield function Y(x) =prob(x+w∈S) n • x∈R : nominal parameter values for product n • w∈R : random variations of parameters in manufactured product • S: set of acceptable values if S is convex and w has a logconcave pdf, then • Y is logconcave • yield regionsxY(x)≥α are convex Convex functions 3–30Convexity with respect to generalized inequalities n m f :R →R is Kconvex if domf is convex and f(θx+(1−θ)y) θf(x)+(1−θ)f(y) K for x, y∈domf, 0≤θ≤ 1 m m m 2 example f :S →S , f(X) =X is S convex + m T 2 2 proof: for fixed z∈R , z X z =kXzk is convex in X, i.e., 2 T 2 T 2 T 2 z (θX +(1−θ)Y) z≤θz X z +(1−θ)z Y z m for X,Y ∈S , 0≤θ≤ 1 2 2 2 therefore (θX +(1−θ)Y) θX +(1−θ)Y Convex functions 3–31Convex Optimization — Boyd Vandenberghe 4. Convex optimization problems • optimization problem in standard form • convex optimization problems • quasiconvex optimization • linear optimization • quadratic optimization • geometric programming • generalized inequality constraints • semidefinite programming • vector optimization 4–1Optimization problem in standard form minimize f (x) 0 subject to f (x)≤ 0, i = 1,...,m i h (x) = 0, i = 1,...,p i n • x∈R is the optimization variable n • f :R →R is the objective or cost function 0 n • f :R →R, i = 1,...,m, are the inequality constraint functions i n • h :R →R are the equality constraint functions i optimal value: ⋆ p = inff (x)f (x)≤ 0, i = 1,...,m, h (x) = 0, i = 1,...,p 0 i i ⋆ • p =∞ if problem is infeasible (no x satisfies the constraints) ⋆ • p =−∞ if problem is unbounded below Convex optimization problems 4–2Optimal and locally optimal points x is feasible if x∈domf and it satisfies the constraints 0 ⋆ a feasible x is optimal if f (x) =p ; X is the set of optimal points 0 opt x is locally optimal if there is an R 0 such that x is optimal for minimize (over z) f (z) 0 subject to f (z)≤ 0, i = 1,...,m, h (z) = 0, i = 1,...,p i i kz−xk ≤R 2 examples (with n = 1, m =p = 0) ⋆ • f (x) = 1/x, domf =R : p = 0, no optimal point 0 0 ++ ⋆ • f (x) =−logx, domf =R : p =−∞ 0 0 ++ ⋆ • f (x) =xlogx, domf =R : p =−1/e, x = 1/e is optimal 0 0 ++ 3 ⋆ • f (x) =x −3x, p =−∞, local optimum at x = 1 0 Convex optimization problems 4–3Implicit constraints the standard form optimization problem has an implicit constraint p m \ \ x∈D = domf ∩ domh, i i i=0 i=1 • we callD the domain of the problem • the constraints f (x)≤ 0, h (x) = 0 are the explicit constraints i i • a problem is unconstrained if it has no explicit constraints (m =p = 0) example: P k T minimize f (x) =− log(b −a x) 0 i i i=1 T is an unconstrained problem with implicit constraints a xb i i Convex optimization problems 4–4Feasibility problem find x subject to f (x)≤ 0, i = 1,...,m i h (x) = 0, i = 1,...,p i can be considered a special case of the general problem with f (x) = 0: 0 minimize 0 subject to f (x)≤ 0, i = 1,...,m i h (x) = 0, i = 1,...,p i ⋆ • p = 0 if constraints are feasible; any feasible x is optimal ⋆ • p =∞ if constraints are infeasible Convex optimization problems 4–5Convex optimization problem standard form convex optimization problem minimize f (x) 0 subject to f (x)≤ 0, i = 1,...,m i T a x =b, i = 1,...,p i i • f , f , . . . , f are convex; equality constraints are affine 0 1 m • problem is quasiconvex if f is quasiconvex (and f , . . . , f convex) 0 1 m often written as minimize f (x) 0 subject to f (x)≤ 0, i = 1,...,m i Ax =b important property: feasible set of a convex optimization problem is convex Convex optimization problems 4–6example 2 2 minimize f (x) =x +x 0 1 2 2 subject to f (x) =x /(1+x )≤ 0 1 1 2 2 h (x) = (x +x ) = 0 1 1 2 • f is convex; feasible set(x ,x )x =−x ≤ 0 is convex 0 1 2 1 2 • not a convex problem (according to our definition): f is not convex, h 1 1 is not affine • equivalent (but not identical) to the convex problem 2 2 minimize x +x 1 2 subject to x ≤ 0 1 x +x = 0 1 2 Convex optimization problems 4–7Local and global optima any locally optimal point of a convex problem is (globally) optimal proof: suppose x is locally optimal, but there exists a feasible y with f (y)f (x) 0 0 x locally optimal means there is an R 0 such that z feasible, kz−xk ≤R =⇒ f (z)≥f (x) 2 0 0 consider z =θy+(1−θ)x with θ =R/(2ky−xk ) 2 • ky−xk R, so 0θ 1/2 2 • z is a convex combination of two feasible points, hence also feasible • kz−xk =R/2 and 2 f (z)≤θf (y)+(1−θ)f (x)f (x) 0 0 0 0 which contradicts our assumption that x is locally optimal Convex optimization problems 4–8Optimality criterion for differentiable f 0 x is optimal if and only if it is feasible and T ∇f (x) (y−x)≥ 0 for all feasible y 0 −∇f (x) 0 x X if nonzero,∇f (x) defines a supporting hyperplane to feasible set X at x 0 Convex optimization problems 4–9• unconstrained problem: x is optimal if and only if x∈domf , ∇f (x) = 0 0 0 • equality constrained problem minimize f (x) subject to Ax =b 0 x is optimal if and only if there exists a ν such that T x∈domf , Ax =b, ∇f (x)+A ν = 0 0 0 • minimization over nonnegative orthant minimize f (x) subject to x 0 0 x is optimal if and only if  ∇f (x) ≥ 0 x = 0 0 i i x∈domf , x 0, 0 ∇f (x) = 0 x 0 0 i i Convex optimization problems 4–10Equivalent convex problems two problems are (informally) equivalent if the solution of one is readily obtained from the solution of the other, and viceversa some common transformations that preserve convexity: • eliminating equality constraints minimize f (x) 0 subject to f (x)≤ 0, i = 1,...,m i Ax =b is equivalent to minimize (over z) f (Fz +x ) 0 0 subject to f (Fz +x )≤ 0, i = 1,...,m i 0 where F and x are such that 0 Ax =b ⇐⇒ x =Fz +x for some z 0 Convex optimization problems 4–11• introducing equality constraints minimize f (A x+b ) 0 0 0 subject to f (Ax+b )≤ 0, i = 1,...,m i i i is equivalent to minimize (over x, y ) f (y ) i 0 0 subject to f (y )≤ 0, i = 1,...,m i i y =Ax+b, i = 0,1,...,m i i i • introducing slack variables for linear inequalities minimize f (x) 0 T subject to a x≤b, i = 1,...,m i i is equivalent to minimize (over x, s) f (x) 0 T subject to a x+s =b, i = 1,...,m i i i s ≥ 0, i = 1,...m i Convex optimization problems 4–12• epigraph form: standard form convex problem is equivalent to minimize (over x, t) t subject to f (x)−t≤ 0 0 f (x)≤ 0, i = 1,...,m i Ax =b • minimizing over some variables minimize f (x ,x ) 0 1 2 subject to f (x )≤ 0, i = 1,...,m i 1 is equivalent to ˜ minimize f (x ) 0 1 subject to f (x )≤ 0, i = 1,...,m i 1 ˜ where f (x ) = inf f (x ,x ) 0 1 x 0 1 2 2 Convex optimization problems 4–13Quasiconvex optimization minimize f (x) 0 subject to f (x)≤ 0, i = 1,...,m i Ax =b n with f :R →R quasiconvex, f , . . . , f convex 0 1 m can have locally optimal points that are not (globally) optimal (x,f (x)) 0 Convex optimization problems 4–14convex representation of sublevel sets of f 0 if f is quasiconvex, there exists a family of functions φ such that: 0 t • φ (x) is convex in x for fixed t t • tsublevel set of f is 0sublevel set of φ , i.e., 0 t f (x)≤t ⇐⇒ φ (x)≤ 0 0 t example p(x) f (x) = 0 q(x) with p convex, q concave, and p(x)≥ 0, q(x) 0 on domf 0 can take φ (x) =p(x)−tq(x): t • for t≥ 0, φ convex in x t • p(x)/q(x)≤t if and only if φ (x)≤ 0 t Convex optimization problems 4–15quasiconvex optimization via convex feasibility problems φ (x)≤ 0, f (x)≤ 0, i = 1,...,m, Ax =b (1) t i • for fixed t, a convex feasibility problem in x ⋆ ⋆ • if feasible, we can conclude that t≥p ; if infeasible, t≤p Bisection method for quasiconvex optimization ⋆ ⋆ given l≤ p , u≥ p , tolerance ǫ 0. repeat 1. t := (l+u)/2. 2. Solve the convex feasibility problem (1). 3. if (1) is feasible, u :=t; else l :=t. until u−l≤ ǫ. requires exactly⌈log ((u−l)/ǫ)⌉ iterations (where u, l are initial values) 2 Convex optimization problems 4–16Linear program (LP) T minimize c x+d subject to Gxh Ax =b • convex problem with affine objective and constraint functions • feasible set is a polyhedron −c ⋆ x P Convex optimization problems 4–17Examples diet problem: choose quantities x , . . . , x of n foods 1 n • one unit of food j costs c , contains amount a of nutrient i j ij • healthy diet requires nutrient i in quantity at least b i to find cheapest healthy diet, T minimize c x subject to Axb, x 0 piecewiselinear minimization T minimize max (a x+b ) i=1,...,m i i equivalent to an LP minimize t T subject to a x+b ≤t, i = 1,...,m i i Convex optimization problems 4–18Chebyshev center of a polyhedron Chebyshev center of T P =xa x≤b, i = 1,...,m i i xx cch heeb b is center of largest inscribed ball B =x +ukuk ≤r c 2 T • a x≤b for all x∈B if and only if i i T T supa (x +u)kuk ≤r =a x +rkak ≤b c 2 c i 2 i i i • hence, x , r can be determined by solving the LP c maximize r T subject to a x +rkak ≤b, i = 1,...,m c i 2 i i Convex optimization problems 4–19Linearfractional program minimize f (x) 0 subject to Gxh Ax =b linearfractional program T c x+d T f (x) = , domf (x) =xe x+f 0 0 0 T e x+f • a quasiconvex optimization problem; can be solved by bisection • also equivalent to the LP (variables y, z) T minimize c y+dz subject to Gyhz Ay =bz T e y+fz = 1 z≥ 0 Convex optimization problems 4–20generalized linearfractional program T c x+d i i T f (x) = max , domf (x) =xe x+f 0, i = 1,...,r 0 0 i i T i=1,...,r e x+f i i a quasiconvex optimization problem; can be solved by bisection example: Von Neumann model of a growing economy + + maximize (over x, x ) min x /x i=1,...,n i i + + subject to x  0, Bx Ax n + • x,x ∈R : activity levels of n sectors, in current and next period + • (Ax) , (Bx ) : produced, resp. consumed, amounts of good i i i + • x /x : growth rate of sector i i i allocate activity to maximize growth rate of slowest growing sector Convex optimization problems 4–21Quadratic program (QP) T T minimize (1/2)x Px+q x+r subject to Gxh Ax =b n • P ∈S , so objective is convex quadratic + • minimize a convex quadratic function over a polyhedron ⋆ −∇f (x ) 0 ⋆ x P Convex optimization problems 4–22Examples leastsquares 2 minimize kAx−bk 2 ⋆ † † • analytical solution x =A b (A is pseudoinverse) • can add linear constraints, e.g., lxu linear program with random cost T T T T minimize c¯ x+γx Σx =Ec x+γvar(c x) subject to Gxh, Ax =b • c is random vector with mean c¯and covariance Σ T T T • hence, c x is random variable with mean c¯ x and variance x Σx • γ 0 is risk aversion parameter; controls the tradeoff between expected cost and variance (risk) Convex optimization problems 4–23Quadratically constrained quadratic program (QCQP) T T minimize (1/2)x P x+q x+r 0 0 0 T T subject to (1/2)x Px+q x+r ≤ 0, i = 1,...,m i i i Ax =b n • P ∈S ; objective and constraints are convex quadratic i + n • if P ,...,P ∈S , feasible region is intersection of m ellipsoids and 1 m ++ an affine set Convex optimization problems 4–24Secondorder cone programming T minimize f x T subject to kAx+bk ≤c x+d, i = 1,...,m i i 2 i i Fx =g n×n p×n i (A ∈R , F ∈R ) i • inequalities are called secondorder cone (SOC) constraints: n +1 T i (Ax+b,c x+d )∈ secondorder cone in R i i i i • for n = 0, reduces to an LP; if c = 0, reduces to a QCQP i i • more general than QCQP and LP Convex optimization problems 4–25Robust linear programming the parameters in optimization problems are often uncertain, e.g., in an LP T minimize c x T subject to a x≤b, i = 1,...,m, i i there can be uncertainty in c, a , b i i two common approaches to handling uncertainty (in a , for simplicity) i • deterministic model: constraints must hold for all a ∈E i i T minimize c x T subject to a x≤b for all a ∈E , i = 1,...,m, i i i i • stochastic model: a is random variable; constraints must hold with i probability η T minimize c x T subject to prob(a x≤b )≥η, i = 1,...,m i i Convex optimization problems 4–26deterministic approach via SOCP • choose an ellipsoid asE : i n n×n E =a¯ +Pukuk ≤ 1 (a¯ ∈R , P ∈R ) i i i 2 i i center is a¯ , semiaxes determined by singular values/vectors of P i i • robust LP T minimize c x T subject to a x≤b ∀a ∈E, i = 1,...,m i i i i is equivalent to the SOCP T minimize c x T T subject to a¯ x+kP xk ≤b, i = 1,...,m 2 i i i T T T (follows from sup (a¯ +Pu) x =a¯ x+kP xk ) i i 2 kuk ≤1 i i 2 Convex optimization problems 4–27stochastic approach via SOCP • assume a is Gaussian with mean a¯ , covariance Σ (a ∼N(a¯,Σ )) i i i i i i T T T • a x is Gaussian r.v. with mean a¯ x, variance x Σx; hence i i i T b −a¯ x i T i prob(a x≤b ) = Φ i i 1/2 kΣ xk 2 i √ R 2 x −t /2 where Φ(x) = (1/ 2π) e dt is CDF ofN(0,1) −∞ • robust LP T minimize c x T subject to prob(a x≤b )≥η, i = 1,...,m, i i with η≥ 1/2, is equivalent to the SOCP T minimize c x 1/2 T −1 subject to a¯ x+Φ (η)kΣ xk ≤b, i = 1,...,m 2 i i i Convex optimization problems 4–28Geometric programming monomial function a a a n 1 2 n f(x) =cx x ···x , domf =R 1 2 n ++ with c 0; exponent a can be any real number i posynomial function: sum of monomials K X a a n a 1k 2k nk f(x) = c x x ···x , domf =R k n ++ 1 2 k=1 geometric program (GP) minimize f (x) 0 subject to f (x)≤ 1, i = 1,...,m i h (x) = 1, i = 1,...,p i with f posynomial, h monomial i i Convex optimization problems 4–29Geometric program in convex form change variables to y = logx , and take logarithm of cost, constraints i i a 1 a n • monomial f(x) =cx ···x transforms to n 1 y y T n 1 logf(e ,...,e ) =a y+b (b = logc) P K a a a 1k 2k nk • posynomial f(x) = c x x ···x transforms to n k 1 2 k=1 K X T y y a y+b 1 n k k logf(e ,...,e ) = log e (b = logc ) k k k=1 • geometric program transforms to convex problem   P K T minimize log exp(a y+b ) 0k 0k k=1   P K T subject to log exp(a y+b ) ≤ 0, i = 1,...,m ik k=1 ik Gy+d = 0 Convex optimization problems 4–30Design of cantilever beam segment 4 segment 3 segment 2 segment 1 F • N segments with unit lengths, rectangular crosssections of size w ×h i i • given vertical force F applied at the right end design problem minimize total weight subject to upper lower bounds on w , h i i upper bound lower bounds on aspect ratios h/w i i upper bound on stress in each segment upper bound on vertical deflection at the end of the beam variables: w , h for i = 1,...,N i i Convex optimization problems 4–31objective and constraint functions • total weight w h +···+w h is posynomial 1 1 N N • aspect ratio h/w and inverse aspect ratio w/h are monomials i i i i 2 • maximum stress in segment i is given by 6iF/(wh ), a monomial i i • the vertical deflection y and slope v of central axis at the right end of i i segment i are defined recursively as F v = 12(i−1/2) +v i i+1 3 Ewh i i F y = 6(i−1/3) +v +y i i+1 i+1 3 Ewh i i fori =N,N−1,...,1, withv =y = 0 (E is Young’s modulus) N+1 N+1 v and y are posynomial functions of w, h i i Convex optimization problems 4–32formulation as a GP minimize w h +···+w h 1 1 N N −1 −1 subject to w w ≤ 1, w w ≤ 1, i = 1,...,N i min max i −1 −1 h h ≤ 1, h h ≤ 1, i = 1,...,N i min max i −1 −1 −1 S w h ≤ 1, S wh ≤ 1, i = 1,...,N i min i max i i −1 −2 −1 6iFσ w h ≤ 1, i = 1,...,N max i i −1 y y ≤ 1 1 max note • we write w ≤w ≤w and h ≤h ≤h min i max min i max w /w ≤ 1, w/w ≤ 1, h /h ≤ 1, h/h ≤ 1 min i i max min i i max • we write S ≤h/w ≤S as min i i max S w/h ≤ 1, h/(wS )≤ 1 min i i i i max Convex optimization problems 4–33Minimizing spectral radius of nonnegative matrix PerronFrobenius eigenvalue λ (A) pf n×n • exists for (elementwise) positive A∈R • a real, positive eigenvalue of A, equal to spectral radius max λ (A) i i k k k • determines asymptotic growth (decay) rate of A : A ∼λ as k→∞ pf • alternative characterization: λ (A) = infλAvλv for some v≻ 0 pf minimizing spectral radius of matrix of posynomials • minimize λ (A(x)), where the elements A(x) are posynomials of x pf ij • equivalent geometric program: minimize λ P n subject to A(x) v /(λv )≤ 1, i = 1,...,n ij j i j=1 variables λ, v, x Convex optimization problems 4–34Generalized inequality constraints convex problem with generalized inequality constraints minimize f (x) 0 subject to f (x) 0, i = 1,...,m i K i Ax =b n n k i • f :R →R convex; f :R →R K convex w.r.t. proper cone K 0 i i i • same properties as standard convex problem (convex feasible set, local optimum is global, etc.) conic form problem: special case with affine objective and constraints T minimize c x subject to Fx+g 0 K Ax =b m extends linear programming (K =R ) to nonpolyhedral cones + Convex optimization problems 4–35Semidefinite program (SDP) T minimize c x subject to x F +x F +···+x F +G 0 1 1 2 2 n n Ax =b k with F , G∈S i • inequality constraint is called linear matrix inequality (LMI) • includes problems with multiple LMI constraints: for example, ˆ ˆ ˆ ˜ ˜ ˜ x F +···+x F +G 0, x F +···+x F +G 0 1 1 n n 1 1 n n is equivalent to single LMI         ˆ ˆ ˆ ˆ F 0 F 0 F 0 G 0 1 2 n x +x +···+x +  0 1 2 n ˜ ˜ ˜ ˜ 0 F 0 F 0 F 0 G 1 2 n Convex optimization problems 4–36LP and SOCP as SDP LP and equivalent SDP T T LP: minimize c x SDP: minimize c x subject to Axb subject to diag(Ax−b) 0 (note different interpretation of generalized inequality) SOCP and equivalent SDP T SOCP: minimize f x T subject to kAx+bk ≤c x+d, i = 1,...,m i i 2 i i T SDP: minimize f x   T (c x+d )I Ax+b i i i i subject to  0, i = 1,...,m T T (Ax+b ) c x+d i i i i Convex optimization problems 4–37Eigenvalue minimization minimize λ (A(x)) max k where A(x) =A +x A +···+x A (with given A ∈S ) 0 1 1 n n i equivalent SDP minimize t subject to A(x)tI n • variables x∈R , t∈R • follows from λ (A)≤t ⇐⇒ AtI max Convex optimization problems 4–38Matrix norm minimization  1/2 T minimize kA(x)k = λ (A(x) A(x)) 2 max p×q where A(x) =A +x A +···+x A (with given A ∈R ) 0 1 1 n n i equivalent SDP minimize t   tI A(x) subject to  0 T A(x) tI n • variables x∈R , t∈R • constraint follows from T 2 kAk ≤t ⇐⇒ A At I, t≥ 0 2   tI A ⇐⇒  0 T A tI Convex optimization problems 4–39Vector optimization general vector optimization problem minimize (w.r.t. K) f (x) 0 subject to f (x)≤ 0, i = 1,...,m i h (x) = 0, i = 1,...,p i n q q vector objective f :R →R , minimized w.r.t. proper cone K∈R 0 convex vector optimization problem minimize (w.r.t. K) f (x) 0 subject to f (x)≤ 0, i = 1,...,m i Ax =b with f Kconvex, f , . . . , f convex 0 1 m Convex optimization problems 4–40Optimal and Pareto optimal points set of achievable objective values O =f (x)x feasible 0 • feasible x is optimal if f (x) is the minimum value ofO 0 • feasible x is Pareto optimal if f (x) is a minimal value ofO 0 O O po f (x ) 0 ⋆ f (x ) 0 ⋆ po x is optimal x is Pareto optimal Convex optimization problems 4–41Multicriterion optimization q vector optimization problem with K =R + f (x) = (F (x),...,F (x)) 0 1 q • q different objectives F ; roughly speaking we want all F ’s to be small i i ⋆ • feasible x is optimal if ⋆ y feasible =⇒ f (x )f (y) 0 0 if there exists an optimal point, the objectives are noncompeting po • feasible x is Pareto optimal if po po y feasible, f (y)f (x ) =⇒ f (x ) =f (y) 0 0 0 0 if there are multiple Pareto optimal values, there is a tradeoff between the objectives Convex optimization problems 4–42Regularized leastsquares 2 2 2 minimize (w.r.t. R ) (kAx−bk ,kxk ) + 2 2 25 20 O 15 10 5 0 0 10 20 30 40 50 2 F (x) =kAx−bk 1 2 100×10 example for A∈R ; heavy line is formed by Pareto optimal points Convex optimization problems 4–43 2 F (x) =kxk 2 2Risk return tradeoff in portfolio optimization 2 T T minimize (w.r.t. R ) (−p¯ x,x Σx) + T subject to 1 x = 1, x 0 n • x∈R is investment portfolio; x is fraction invested in asset i i n • p∈R is vector of relative asset price changes; modeled as a random variable with mean p¯, covariance Σ T T • p¯ x =Er is expected return; x Σx =varr is return variance example 15 1 x(4) x(3) x(2) 10 0.5 x(1) 5 0 0 0 10 20 0 10 20 standard deviation of return standard deviation of return Convex optimization problems 4–44 mean return allocation xScalarization ∗ to find Pareto optimal points: choose λ≻ 0 and solve scalar problem K T minimize λ f (x) 0 subject to f (x)≤ 0, i = 1,...,m i h (x) = 0, i = 1,...,p i O if x is optimal for scalar problem, then it is Paretooptimal for vector f (x ) 0 1 optimization problem f (x ) 0 3 λ 1 λ 2 f (x ) 0 2 for convex vector optimization problems, can find (almost) all Pareto ∗ optimal points by varying λ≻ 0 K Convex optimization problems 4–45Scalarization for multicriterion problems to find Pareto optimal points, minimize positive weighted sum T λ f (x) =λ F (x)+···+λ F (x) 0 1 1 q q examples • regularized leastsquares problem of page 4–43 20 15 take λ = (1,γ) with γ 0 10 2 2 minimize kAx−bk +γkxk 2 2 5 for fixed γ, a LS problem γ = 1 0 0 5 10 15 20 2 kAx−bk 2 Convex optimization problems 4–46 2 kxk 2• riskreturn tradeoff of page 4–44 T T minimize −p¯ x+γx Σx T subject to 1 x = 1, x 0 for fixed γ 0, a quadratic program Convex optimization problems 4–47Convex Optimization — Boyd Vandenberghe 5. Duality • Lagrange dual problem • weak and strong duality • geometric interpretation • optimality conditions • perturbation and sensitivity analysis • examples • generalized inequalities 5–1Lagrangian standard form problem (not necessarily convex) minimize f (x) 0 subject to f (x)≤ 0, i = 1,...,m i h (x) = 0, i = 1,...,p i n ⋆ variable x∈R , domainD, optimal value p n m p m p Lagrangian: L :R ×R ×R →R, with domL =D×R ×R , p m X X L(x,λ,ν) =f (x)+ λf (x)+ νh (x) 0 i i i i i=1 i=1 • weighted sum of objective and constraint functions • λ is Lagrange multiplier associated with f (x)≤ 0 i i • ν is Lagrange multiplier associated with h (x) = 0 i i Duality 5–2Lagrange dual function m p Lagrange dual function: g :R ×R →R, g(λ,ν) = inf L(x,λ,ν) x∈D p m X X = inf f (x)+ λf (x)+ νh (x) 0 i i i i x∈D i=1 i=1 g is concave, can be−∞ for some λ, ν ⋆ lower bound property: if λ 0, then g(λ,ν)≤p proof: if x˜ is feasible and λ 0, then f (x˜)≥L(x˜,λ,ν)≥ inf L(x,λ,ν) =g(λ,ν) 0 x∈D ⋆ minimizing over all feasible x˜ gives p ≥g(λ,ν) Duality 5–3Leastnorm solution of linear equations T minimize x x subject to Ax =b dual function T T • Lagrangian is L(x,ν) =x x+ν (Ax−b) • to minimize L over x, set gradient equal to zero: T T ∇ L(x,ν) = 2x+A ν = 0 =⇒ x =−(1/2)A ν x • plug in in L to obtain g: 1 T T T T g(ν) =L((−1/2)A ν,ν) =− ν AA ν−b ν 4 a concave function of ν ⋆ T T T lower bound property: p ≥−(1/4)ν AA ν−b ν for all ν Duality 5–4Standard form LP T minimize c x subject to Ax =b, x 0 dual function • Lagrangian is T T T L(x,λ,ν) = c x+ν (Ax−b)−λ x T T T = −b ν +(c+A ν−λ) x • L is affine in x, hence  T T −b ν A ν−λ+c = 0 g(λ,ν) = infL(x,λ,ν) = −∞ otherwise x T g is linear on affine domain(λ,ν)A ν−λ+c = 0, hence concave ⋆ T T lower bound property: p ≥−b ν if A ν +c 0 Duality 5–5Equality constrained norm minimization minimize kxk subject to Ax =b dual function  T T b ν kA νk ≤ 1 ∗ T T g(ν) = inf(kxk−ν Ax+b ν) = x −∞ otherwise T wherekvk = sup u v is dual norm ofk·k ∗ kuk≤1 T proof: follows from inf (kxk−y x) = 0 ifkyk ≤ 1,−∞ otherwise x ∗ T • ifkyk ≤ 1, thenkxk−y x≥ 0 for all x, with equality if x = 0 ∗ T • ifkyk 1, choose x =tu wherekuk≤ 1, u y =kyk 1: ∗ ∗ T kxk−y x =t(kuk−kyk )→−∞ as t→∞ ∗ ⋆ T T lower bound property: p ≥b ν ifkA νk ≤ 1 ∗ Duality 5–6Twoway partitioning T minimize x Wx 2 subject to x = 1, i = 1,...,n i n • a nonconvex problem; feasible set contains 2 discrete points • interpretation: partition1,...,n in two sets; W is cost of assigning ij i, j to the same set;−W is cost of assigning to different sets ij dual function X T 2 T T g(ν) = inf(x Wx+ ν (x −1)) = infx (W +diag(ν))x−1 ν i i x x i  T −1 ν W +diag(ν) 0 = −∞ otherwise ⋆ T lower bound property: p ≥−1 ν if W +diag(ν) 0 ⋆ example: ν =−λ (W)1 gives bound p ≥nλ (W) min min Duality 5–7Lagrange dual and conjugate function minimize f (x) 0 subject to Axb, Cx =d dual function  T T T T T g(λ,ν) = inf f (x)+(A λ+C ν) x−b λ−d ν 0 x∈domf 0 ∗ T T T T = −f (−A λ−C ν)−b λ−d ν 0 ∗ T • recall definition of conjugate f (y) = sup (y x−f(x)) x∈domf • simplifies derivation of dual if conjugate of f is known 0 example: entropy maximization n n X X ∗ y−1 i f (x) = x logx, f (y) = e 0 i i 0 i=1 i=1 Duality 5–8The dual problem Lagrange dual problem maximize g(λ,ν) subject to λ 0 ⋆ • finds best lower bound on p , obtained from Lagrange dual function ⋆ • a convex optimization problem; optimal value denoted d • λ, ν are dual feasible if λ 0, (λ,ν)∈domg • often simplified by making implicit constraint (λ,ν)∈domg explicit example: standard form LP and its dual (page 5–5) T T minimize c x maximize −b ν T subject to Ax =b subject to A ν +c 0 x 0 Duality 5–9Weak and strong duality ⋆ ⋆ weak duality: d ≤p • always holds (for convex and nonconvex problems) • can be used to find nontrivial lower bounds for difficult problems for example, solving the SDP T maximize −1 ν subject to W +diag(ν) 0 gives a lower bound for the twoway partitioning problem on page 5–7 ⋆ ⋆ strong duality: d =p • does not hold in general • (usually) holds for convex problems • conditions that guarantee strong duality in convex problems are called constraint qualifications Duality 5–10Slater’s constraint qualification strong duality holds for a convex problem minimize f (x) 0 subject to f (x)≤ 0, i = 1,...,m i Ax =b if it is strictly feasible, i.e., ∃x∈intD : f (x) 0, i = 1,...,m, Ax =b i ⋆ • also guarantees that the dual optimum is attained (if p −∞) • can be sharpened: e.g., can replace intD with relintD (interior relative to affine hull); linear inequalities do not need to hold with strict inequality, . . . • there exist many other types of constraint qualifications Duality 5–11Inequality form LP primal problem T minimize c x subject to Axb dual function  T T  −b λ A λ+c = 0 T T T g(λ) = inf (c+A λ) x−b λ = x −∞ otherwise dual problem T maximize −b λ T subject to A λ+c = 0, λ 0 ⋆ ⋆ • from Slater’s condition: p =d if Ax˜≺b for some x˜ ⋆ ⋆ • in fact, p =d except when primal and dual are infeasible Duality 5–12Quadratic program n primal problem (assume P ∈S ) ++ T minimize x Px subject to Axb dual function  1 T T T −1 T T g(λ) = inf x Px+λ (Ax−b) =− λ AP A λ−b λ x 4 dual problem T −1 T T maximize −(1/4)λ AP A λ−b λ subject to λ 0 ⋆ ⋆ • from Slater’s condition: p =d if Ax˜≺b for some x˜ ⋆ ⋆ • in fact, p =d always Duality 5–13A nonconvex problem with strong duality T T minimize x Ax+2b x T subject to x x≤ 1 A6 0, hence nonconvex T T dual function: g(λ) = inf (x (A+λI)x+2b x−λ) x • unbounded below if A+λI 6 0 or if A+λI  0 and b6∈R(A+λI) † T † • minimized by x =−(A+λI) b otherwise: g(λ) =−b (A+λI) b−λ dual problem and equivalent SDP: T † maximize −b (A+λI) b−λ maximize −t−λ   subject to A+λI  0 A+λI b subject to  0 T b∈R(A+λI) b t strong duality although primal problem is not convex (not easy to show) Duality 5–14Geometric interpretation for simplicity, consider problem with one constraint f (x)≤ 0 1 interpretation of dual function: g(λ) = inf (t+λu), where G =(f (x),f (x))x∈D 1 0 (u,t)∈G t t G G ⋆ ⋆ p p ⋆ d λu+t =g(λ) g(λ) u u • λu+t =g(λ) is (nonvertical) supporting hyperplane toG • hyperplane intersects taxis at t =g(λ) Duality 5–15epigraph variation: same interpretation ifG is replaced with A =(u,t)f (x)≤u,f (x)≤t for some x∈D 1 0 t A ⋆ p λu+t =g(λ) g(λ) u strong duality ⋆ • holds if there is a nonvertical supporting hyperplane toA at (0,p ) ⋆ • for convex problem,A is convex, hence has supp. hyperplane at (0,p ) ˜ • Slater’s condition: if there exist (u˜,t)∈A with u˜ 0, then supporting ⋆ hyperplanes at (0,p ) must be nonvertical Duality 5–16Complementary slackness ⋆ ⋆ ⋆ assume strong duality holds, x is primal optimal, (λ ,ν ) is dual optimal p m X X ⋆ ⋆ ⋆ ⋆ ⋆ f (x ) =g(λ ,ν ) = inf f (x)+ λ f (x)+ ν h (x) 0 0 i i i i x i=1 i=1 p m X X ⋆ ⋆ ⋆ ⋆ ⋆ ≤ f (x )+ λ f (x )+ ν h (x ) 0 i i i i i=1 i=1 ⋆ ≤ f (x ) 0 hence, the two inequalities hold with equality ⋆ ⋆ ⋆ • x minimizes L(x,λ ,ν ) ⋆ ⋆ • λ f (x ) = 0 for i = 1,...,m (known as complementary slackness): i i ⋆ ⋆ ⋆ ⋆ λ 0 =⇒f (x ) = 0, f (x ) 0 =⇒λ = 0 i i i i Duality 5–17KarushKuhnTucker (KKT) conditions the following four conditions are called KKT conditions (for a problem with differentiable f , h ): i i 1. primal constraints: f (x)≤ 0, i = 1,...,m, h (x) = 0, i = 1,...,p i i 2. dual constraints: λ 0 3. complementary slackness: λf (x) = 0, i = 1,...,m i i 4. gradient of Lagrangian with respect to x vanishes: p m X X ∇f (x)+ λ∇f (x)+ ν∇h (x) = 0 0 i i i i i=1 i=1 from page 5–17: if strong duality holds and x, λ, ν are optimal, then they must satisfy the KKT conditions Duality 5–18KKT conditions for convex problem ˜ if x˜, λ, ν˜ satisfy KKT for a convex problem, then they are optimal: ˜ • from complementary slackness: f (x˜) =L(x˜,λ,ν˜) 0 ˜ ˜ • from 4th condition (and convexity): g(λ,ν˜) =L(x˜,λ,ν˜) ˜ hence, f (x˜) =g(λ,ν˜) 0 if Slater’s condition is satisfied: x is optimal if and only if there exist λ, ν that satisfy KKT conditions • recall that Slater implies strong duality, and dual optimum is attained • generalizes optimality condition∇f (x) = 0 for unconstrained problem 0 Duality 5–19example: waterfilling (assume α 0) i P n minimize − log(x +α ) i i i=1 T subject to x 0, 1 x = 1 n T x is optimal iff x 0, 1 x = 1, and there exist λ∈R , ν∈R such that 1 λ 0, λx = 0, +λ =ν i i i x +α i i • if ν 1/α : λ = 0 and x = 1/ν−α i i i i • if ν≥ 1/α : λ =ν−1/α and x = 0 i i i i P n T • determine ν from 1 x = max0,1/ν−α = 1 i i=1 interpretation • n patches; level of patch i is at height α i ⋆ 1/ν x i • flood area with unit amount of water α i ⋆ • resulting level is 1/ν i Duality 5–20Perturbation and sensitivity analysis (unperturbed) optimization problem and its dual minimize f (x) maximize g(λ,ν) 0 subject to f (x)≤ 0, i = 1,...,m subject to λ 0 i h (x) = 0, i = 1,...,p i perturbed problem and its dual T T min. f (x) max. g(λ,ν)−u λ−v ν 0 s.t. f (x)≤u, i = 1,...,m s.t. λ 0 i i h (x) =v, i = 1,...,p i i • x is primal variable; u, v are parameters ⋆ • p (u,v) is optimal value as a function of u, v ⋆ • we are interested in information about p (u,v) that we can obtain from the solution of the unperturbed problem and its dual Duality 5–21global sensitivity result ⋆ ⋆ assume strong duality holds for unperturbed problem, and that λ , ν are dual optimal for unperturbed problem apply weak duality to perturbed problem: ⋆ ⋆ ⋆ T ⋆ T ⋆ p (u,v) ≥ g(λ ,ν )−u λ −v ν ⋆ T ⋆ T ⋆ = p (0,0)−u λ −v ν sensitivity interpretation ⋆ ⋆ • if λ large: p increases greatly if we tighten constraint i (u 0) i i ⋆ ⋆ • ifλ small: p does not decrease much if we loosen constrainti (u 0) i i ⋆ ⋆ • if ν large and positive: p increases greatly if we take v 0; i i ⋆ ⋆ if ν large and negative: p increases greatly if we take v 0 i i ⋆ ⋆ • if ν small and positive: p does not decrease much if we take v 0; i i ⋆ ⋆ if ν small and negative: p does not decrease much if we take v 0 i i Duality 5–22⋆ local sensitivity: if (in addition) p (u,v) is differentiable at (0,0), then ⋆ ⋆ ∂p (0,0) ∂p (0,0) ⋆ ⋆ λ =− , ν =− i i ∂u ∂v i i ⋆ proof (for λ ): from global sensitivity result, i ⋆ ⋆ ⋆ ∂p (0,0) p (te,0)−p (0,0) i ⋆ = lim ≥−λ i ∂u tց0 t i ⋆ ⋆ ⋆ ∂p (0,0) p (te,0)−p (0,0) i ⋆ = lim ≤−λ i ∂u tր0 t i hence, equality ⋆ p (u) for a problem with one (inequality) u constraint: ⋆ u = 0 p (u) ⋆ ⋆ p (0)−λ u Duality 5–23Duality and problem reformulations • equivalent formulations of a problem can lead to very different duals • reformulating the primal problem can be useful when the dual is difficult to derive, or uninteresting common reformulations • introduce new variables and equality constraints • make explicit constraints implicit or viceversa • transform objective or constraint functions e.g., replace f (x) by φ(f (x)) with φ convex, increasing 0 0 Duality 5–24Introducing new variables and equality constraints minimize f (Ax+b) 0 ⋆ • dual function is constant: g = inf L(x) = inf f (Ax+b) =p x x 0 • we have strong duality, but dual is quite useless reformulated problem and its dual T ∗ minimize f (y) maximize b ν−f (ν) 0 0 T subject to Ax+b−y = 0 subject to A ν = 0 dual function follows from T T T g(ν) = inf(f (y)−ν y+ν Ax+b ν) 0 x,y  ∗ T T −f (ν)+b ν A ν = 0 0 = −∞ otherwise Duality 5–25norm approximation problem: minimizekAx−bk minimize kyk subject to y =Ax−b can look up conjugate ofk·k, or derive dual directly T T T g(ν) = inf(kyk+ν y−ν Ax+b ν) x,y  T T T b ν +inf (kyk+ν y) A ν = 0 y = −∞ otherwise  T T b ν A ν = 0, kνk ≤ 1 ∗ = −∞ otherwise (see page 5–4) dual of norm approximation problem T maximize b ν T subject to A ν = 0, kνk ≤ 1 ∗ Duality 5–26Implicit constraints LP with box constraints: primal and dual problem T T T T minimize c x maximize −b ν−1 λ −1 λ 1 2 T subject to Ax =b subject to c+A ν +λ −λ = 0 1 2 −1x1 λ  0, λ  0 1 2 reformulation with box constraints made implicit  T c x −1x1 minimize f (x) = 0 ∞ otherwise subject to Ax =b dual function T T g(ν) = inf (c x+ν (Ax−b)) −1x1 T T = −b ν−kA ν +ck 1 T T dual problem: maximize−b ν−kA ν +ck 1 Duality 5–27Problems with generalized inequalities minimize f (x) 0 subject to f (x) 0, i = 1,...,m i K i h (x) = 0, i = 1,...,p i k i  is generalized inequality on R K i definitions are parallel to scalar case: k i • Lagrange multiplier for f (x) 0 is vector λ ∈R i K i i n k k p m 1 • Lagrangian L :R ×R ×···×R ×R →R, is defined as p m X X T L(x,λ ,··· ,λ ,ν) =f (x)+ λ f (x)+ νh (x) 1 m 0 i i i i i=1 i=1 k k p 1 m • dual function g :R ×···×R ×R →R, is defined as g(λ ,...,λ ,ν) = inf L(x,λ ,··· ,λ ,ν) 1 m 1 m x∈D Duality 5–28⋆ ∗ lower bound property: if λ  0, then g(λ ,...,λ ,ν)≤p i 1 m K i ∗ proof: if x˜ is feasible and λ 0, then K i p m X X T f (x˜) ≥ f (x˜)+ λ f (x˜)+ νh (x˜) 0 0 i i i i i=1 i=1 ≥ inf L(x,λ ,...,λ ,ν) 1 m x∈D = g(λ ,...,λ ,ν) 1 m ⋆ minimizing over all feasible x˜ gives p ≥g(λ ,...,λ ,ν) 1 m dual problem maximize g(λ ,...,λ ,ν) 1 m ∗ subject to λ  0, i = 1,...,m i K i ⋆ ⋆ • weak duality: p ≥d always ⋆ ⋆ • strong duality: p =d for convex problem with constraint qualification (for example, Slater’s: primal problem is strictly feasible) Duality 5–29Semidefinite program k primal SDP (F,G∈S ) i T minimize c x subject to x F +···+x F G 1 1 n n k • Lagrange multiplier is matrix Z∈S T • Lagrangian L(x,Z) =c x+tr(Z(x F +···+x F −G)) 1 1 n n • dual function  −tr(GZ) tr(FZ)+c = 0, i = 1,...,n i i g(Z) = infL(x,Z) = −∞ otherwise x dual SDP maximize −tr(GZ) subject to Z 0, tr(FZ)+c = 0, i = 1,...,n i i ⋆ ⋆ p =d if primal SDP is strictly feasible (∃x withx F +···+x F ≺G) 1 1 n n Duality 5–30Convex Optimization — Boyd Vandenberghe 6. Approximation and fitting • norm approximation • leastnorm problems • regularized approximation • robust approximation 6–1Norm approximation minimize kAx−bk m×n m (A∈R with m≥n,k·k is a norm on R ) ⋆ interpretations of solution x = argmin kAx−bk: x ⋆ • geometric: Ax is point inR(A) closest to b • estimation: linear measurement model y =Ax+v y are measurements, x is unknown, v is measurement error ⋆ given y =b, best guess of x is x • optimal design: x are design variables (input), Ax is result (output) ⋆ x is design that best approximates desired result b Approximation and fitting 6–2examples • leastsquares approximation (k·k ): solution satisfies normal equations 2 T T A Ax =A b ⋆ T −1 T (x = (A A) A b if rankA =n) • Chebyshev approximation (k·k ): can be solved as an LP ∞ minimize t subject to −t1Ax−bt1 • sum of absolute residuals approximation (k·k ): can be solved as an LP 1 T minimize 1 y subject to −yAx−by Approximation and fitting 6–3Penalty function approximation minimize φ(r )+···+φ(r ) 1 m subject to r =Ax−b m×n (A∈R , φ :R→R is a convex penalty function) examples 2 2 • quadratic: φ(u) =u log barrier 1.5 quadratic • deadzonelinear with width a: 1 deadzonelinear φ(u) = max0,u−a 0.5 0 • logbarrier with limit a: −1.5 −1 −0.5 0 0.5 1 1.5 u  2 2 −a log(1−(u/a) ) ua φ(u) = ∞ otherwise Approximation and fitting 6–4 φ(u)example (m = 100, n = 30): histogram of residuals for penalties 2 2 φ(u) =u, φ(u) =u , φ(u) = max0,u−a, φ(u) =−log(1−u ) 40 0 −2 −1 0 1 2 10 0 −2 −1 0 1 2 20 0 −2 −1 0 1 2 10 0 −2 −1 0 1 2 r shape of penalty function has large effect on distribution of residuals Approximation and fitting 6–5 Log barrier Deadzone p = 2 p = 1Huber penalty function (with parameter M) replacements  2 u u≤M φ (u) = hub M(2u−M) uM linear growth for large u makes approximation less sensitive to outliers 2 20 1.5 10 1 0 0.5 −10 −20 0 −1.5 −1 −0.5 0 0.5 1 1.5 −10 −5 0 5 10 u t • left: Huber penalty for M = 1 • right: affine function f(t) =α+βt fitted to 42 points t , y (circles) i i using quadratic (dashed) and Huber (solid) penalty Approximation and fitting 6–6 φ (u) hub f(t)Leastnorm problems minimize kxk subject to Ax =b m×n n (A∈R with m≤n,k·k is a norm on R ) ⋆ interpretations of solution x = argmin kxk: Ax=b ⋆ • geometric: x is point in affine setxAx =b with minimum distance to 0 ⋆ • estimation: b =Ax are (perfect) measurements of x; x is smallest (’most plausible’) estimate consistent with measurements • design: x are design variables (inputs); b are required results (outputs) ⋆ x is smallest (’most efficient’) design that satisfies requirements Approximation and fitting 6–7examples • leastsquares solution of linear equations (k·k ): 2 can be solved via optimality conditions T 2x+A ν = 0, Ax =b • minimum sum of absolute values (k·k ): can be solved as an LP 1 T minimize 1 y subject to −yxy, Ax =b ⋆ tends to produce sparse solution x extension: leastpenalty problem minimize φ(x )+···+φ(x ) 1 n subject to Ax =b φ :R→R is convex penalty function Approximation and fitting 6–8Regularized approximation 2 minimize (w.r.t. R ) (kAx−bk,kxk) + m×n m n A∈R , norms on R and R can be different interpretation: find good approximation Ax≈b with small x • estimation: linear measurement model y =Ax+v, with prior knowledge thatkxk is small • optimal design: small x is cheaper or more efficient, or the linear model y =Ax is only valid for small x • robust approximation: good approximation Ax≈b with small x is less sensitive to errors in A than good approximation with large x Approximation and fitting 6–9Scalarized problem minimize kAx−bk+γkxk • solution for γ 0 traces out optimal tradeoff curve 2 2 • other common method: minimizekAx−bk +δkxk with δ 0 Tikhonov regularization 2 2 minimize kAx−bk +δkxk 2 2 can be solved as a leastsquares problem     2 A b √ minimize x− 0 δI 2 ⋆ T −1 T solution x = (A A+δI) A b Approximation and fitting 6–10Optimal input design linear dynamical system with impulse response h: t X y(t) = h(τ)u(t−τ), t = 0,1,...,N τ=0 input design problem: multicriterion problem with 3 objectives P N 2 1. tracking error with desired output y : J = (y(t)−y (t)) des track des t=0 P N 2 2. input magnitude: J = u(t) mag t=0 P N−1 2 3. input variation: J = (u(t+1)−u(t)) der t=0 track desired output using a small and slowly varying input signal regularized leastsquares formulation minimize J +δJ +ηJ track der mag for fixed δ,η, a leastsquares problem in u(0), . . . , u(N) Approximation and fitting 6–11example: 3 solutions on optimal tradeoff surface (top) δ = 0, small η; (middle) δ = 0, larger η; (bottom) large δ 5 1 0.5 0 0 −5 −0.5 −1 −10 0 50 100 150 200 0 50 100 150 200 t t 4 1 2 0.5 0 0 −0.5 −2 −1 −4 0 50 100 150 200 0 50 100 150 200 t t 4 1 2 0.5 0 0 −0.5 −2 −1 −4 0 50 100 150 200 0 50 100 150 200 t t Approximation and fitting 6–12 u(t) u(t) u(t) y(t) y(t) y(t)Signal reconstruction 2 minimize (w.r.t. R ) (kxˆ−x k ,φ(xˆ)) cor 2 + n • x∈R is unknown signal • x =x+v is (known) corrupted version of x, with additive noise v cor • variable xˆ (reconstructed signal) is estimate of x n • φ :R →R is regularization function or smoothing objective examples: quadratic smoothing, total variation smoothing: n−1 n−1 X X 2 φ (xˆ) = (xˆ −xˆ ) , φ (xˆ) = xˆ −xˆ quad i+1 i tv i+1 i i=1 i=1 Approximation and fitting 6–13quadratic smoothing example 0.5 0.5 0 −0.5 0 0 1000 2000 3000 4000 0.5 −0.5 0 1000 2000 3000 4000 0 −0.5 0.5 0 1000 2000 3000 4000 0.5 0 0 −0.5 −0.5 0 1000 2000 3000 4000 0 1000 2000 3000 4000 i i three solutions on tradeoff curve original signal x and noisy kxˆ−x k versus φ (xˆ) signal x cor 2 quad cor Approximation and fitting 6–14 x x cor xˆ xˆ xˆtotal variation reconstruction example 2 0 2 1 −2 0 500 1000 1500 2000 0 2 −1 0 −2 0 500 1000 1500 2000 −2 2 0 500 1000 1500 2000 2 1 0 0 −1 −2 −2 0 500 1000 1500 2000 0 500 1000 1500 2000 i i three solutions on tradeoff curve original signal x and noisy kxˆ−x k versus φ (xˆ) signal x cor 2 quad cor quadratic smoothing smooths out noise and sharp transitions in signal Approximation and fitting 6–15 x x cor xˆ xˆ xˆ i i i2 0 2 1 −2 0 500 1000 1500 2000 0 2 −1 0 −2 0 500 1000 1500 2000 −2 2 0 500 1000 1500 2000 2 1 0 0 −1 −2 −2 0 500 1000 1500 2000 0 500 1000 1500 2000 i i original signal x and noisy three solutions on tradeoff curve signal x kxˆ−x k versus φ (xˆ) cor cor 2 tv total variation smoothing preserves sharp transitions in signal Approximation and fitting 6–16 x x cor xˆ xˆ xˆRobust approximation minimizekAx−bk with uncertain A two approaches: • stochastic: assume A is random, minimize EkAx−bk • worstcase: setA of possible values of A, minimize sup kAx−bk A∈A tractable only in special cases (certain normsk·k, distributions, setsA) 12 example: A(u) =A +uA 0 1 10 x 2 nom • x minimizeskA x−bk nom 0 2 8 2 • x minimizes EkA(u)x−bk x stoch stoch 2 6 with u uniform on −1,1 x wc 4 2 • x minimizes sup kA(u)x−bk wc −1≤u≤1 2 2 figure shows r(u) =kA(u)x−bk 0 2 −2 −1 0 1 2 u Approximation and fitting 6–17 r(u)T ¯ stochastic robust LS withA =A+U,U random, EU = 0, EU U =P 2 ¯ minimize Ek(A+U)x−bk 2 • explicit expression for objective: 2 2 ¯ EkAx−bk = EkAx−b+Uxk 2 2 2 T T ¯ = kAx−bk +Ex U Ux 2 2 T ¯ = kAx−bk +x Px 2 • hence, robust LS problem is equivalent to LS problem 2 1/2 2 ¯ minimize kAx−bk +kP xk 2 2 • for P =δI, get Tikhonov regularized problem 2 2 ¯ minimize kAx−bk +δkxk 2 2 Approximation and fitting 6–18¯ worstcase robust LS withA =A+u A +···+u A kuk ≤ 1 1 1 p p 2 2 2 minimize sup kAx−bk = sup kP(x)u+q(x)k A∈A 2 kuk ≤1 2 2   ¯ A x A x ··· A x where P(x) = , q(x) =Ax−b 1 2 p • from page 5–14, strong duality holds between the following problems 2 maximize kPu+qk minimize t+λ 2   2 subject to kuk ≤ 1 I P q 2 T   subject to P λI 0  0 T q 0 t • hence, robust LS problem is equivalent to SDP minimize t+λ   I P(x) q(x) T   P(x) λI 0 subject to  0 T q(x) 0 t Approximation and fitting 6–19example: histogram of residuals r(u) =k(A +u A +u A )x−bk 0 1 1 2 2 2 with u uniformly distributed on unit disk, for three values of x 0.25 x 0.2 rls 0.15 0.1 x tik x 0.05 ls 0 0 1 2 3 4 5 r(u) • x minimizeskA x−bk ls 0 2 2 2 • x minimizeskA x−bk +δkxk (Tikhonov solution) tik 0 2 2 2 2 • x minimizes sup kAx−bk +kxk rls A∈A 2 2 Approximation and fitting 6–20 frequencyConvex Optimization — Boyd Vandenberghe 7. Statistical estimation • maximum likelihood estimation • optimal detector design • experiment design 7–1Parametric distribution estimation • distribution estimation problem: estimate probability density p(y) of a random variable from observed values • parametric distribution estimation: choose from a family of densities p (y), indexed by a parameter x x maximum likelihood estimation maximize (over x) logp (y) x • y is observed value • l(x) = logp (y) is called loglikelihood function x • can add constraints x∈C explicitly, or define p (y) = 0 for x6∈C x • a convex optimization problem if logp (y) is concave in x for fixed y x Statistical estimation 7–2Linear measurements with IID noise linear measurement model T y =a x+v, i = 1,...,m i i i n • x∈R is vector of unknown parameters • v is IID measurement noise, with density p(z) i Q m m T • y is measurement: y∈R has density p (y) = p(y −a x) i x i i i=1 maximum likelihood estimate: any solution x of P m T maximize l(x) = logp(y −a x) i i i=1 (y is observed value) Statistical estimation 7–3examples 2 2 2 2 −1/2 −z /(2σ ) • Gaussian noiseN(0,σ ): p(z) = (2πσ ) e , m X m 1 2 T 2 l(x) =− log(2πσ )− (a x−y ) i i 2 2 2σ i=1 ML estimate is LS solution −z/a • Laplacian noise: p(z) = (1/(2a))e , m X 1 T l(x) =−mlog(2a)− a x−y i i a i=1 ML estimate is ℓ norm solution 1 • uniform noise on −a,a:  T −mlog(2a) a x−y≤a, i = 1,...,m i i l(x) = −∞ otherwise T ML estimate is any x witha x−y≤a i i Statistical estimation 7–4Logistic regression random variable y∈0,1 with distribution T exp(a u+b) p =prob(y = 1) = T 1+exp(a u+b) n • a, b are parameters; u∈R are (observable) explanatory variables • estimation problem: estimate a, b from m observations (u,y ) i i loglikelihood function (for y =··· =y = 1, y =··· =y = 0): 1 k k+1 m   k m T Y Y exp(a u +b) 1 i   l(a,b) = log T T 1+exp(a u +b) 1+exp(a u +b) i i i=1 i=k+1 k m X X T T = (a u +b)− log(1+exp(a u +b)) i i i=1 i=1 concave in a, b Statistical estimation 7–5example (n = 1, m = 50 measurements) 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 u • circles show 50 points (u,y ) i i • solid curve is ML estimate of p = exp(au+b)/(1+exp(au+b)) Statistical estimation 7–6 prob(y = 1)(Binary) hypothesis testing detection (hypothesis testing) problem given observation of a random variable X ∈1,...,n, choose between: • hypothesis 1: X was generated by distribution p = (p ,...,p ) 1 n • hypothesis 2: X was generated by distribution q = (q ,...,q ) 1 n randomized detector 2×n T T • a nonnegative matrix T ∈R , with 1 T =1 • if we observe X =k, we choose hypothesis 1 with probability t , 1k hypothesis 2 with probability t 2k • if all elements of T are 0 or 1, it is called a deterministic detector Statistical estimation 7–7detection probability matrix:     1−P P fp fn D = Tp Tq = P 1−P fp fn • P is probability of selecting hypothesis 2 if X is generated by fp distribution 1 (false positive) • P is probability of selecting hypothesis 1 if X is generated by fn distribution 2 (false negative) multicriterion formulation of detector design 2 minimize (w.r.t. R ) (P ,P ) = ((Tp) ,(Tq) ) fp fn 2 1 + subject to t +t = 1, k = 1,...,n 1k 2k t ≥ 0, i = 1,2, k = 1,...,n ik 2×n variable T ∈R Statistical estimation 7–8scalarization (with weight λ 0) minimize (Tp) +λ(Tq) 2 1 subject to t +t = 1, t ≥ 0, i = 1,2, k = 1,...,n 1k 2k ik an LP with a simple analytical solution  (1,0) p ≥λq k k (t ,t ) = 1k 2k (0,1) p λq k k • a deterministic detector, given by a likelihood ratio test • if p =λq for some k, any value 0≤t ≤ 1, t = 1−t is optimal k k 1k 1k 2k (i.e., Paretooptimal detectors include nondeterministic detectors) minimax detector minimize maxP ,P = max(Tp) ,(Tq) fp fn 2 1 subject to t +t = 1, t ≥ 0, i = 1,2, k = 1,...,n 1k 2k ik an LP; solution is usually not deterministic Statistical estimation 7–9example   0.70 0.10   0.20 0.10   P =   0.05 0.70 0.05 0.10 1 0.8 0.6 0.4 1 0.2 2 4 3 0 0 0.2 0.4 0.6 0.8 1 P fp solutions 1, 2, 3 (and endpoints) are deterministic; 4 is minimax detector Statistical estimation 7–10 P fnExperiment design n T m linear measurements y =a x+w , i = 1,...,m of unknown x∈R i i i • measurement errors w are IIDN(0,1) i • ML (leastsquares) estimate is −1 m m X X T xˆ = aa ya i i i i i=1 i=1 • error e =xˆ−x has zero mean and covariance −1 m X T T E =Eee = aa i i i=1 T −1 confidence ellipsoids are given byx (x−xˆ) E (x−xˆ)≤β experiment design: choose a ∈v ,...,v (a set of possible test i 1 p vectors) to make E ‘small’ Statistical estimation 7–11vector optimization formulation  P −1 p n T minimize (w.r.t. S ) E = m v v k k + k=1 k subject to m ≥ 0, m +···+m =m k 1 p m ∈Z k • variables are m ( vectors a equal to v ) k i k • difficult in general, due to integer constraint relaxed experiment design assume m≫p, use λ =m /m as (continuous) real variable k k  P −1 p n T minimize (w.r.t. S ) E = (1/m) λ v v k k + k k=1 T subject to λ 0, 1 λ = 1 • common scalarizations: minimize logdetE, trE, λ (E), . . . max T • can add other convex constraints, e.g., bound experiment cost c λ≤B Statistical estimation 7–12Doptimal design  P −1 p T minimize logdet λ v v k k k k=1 T subject to λ 0, 1 λ = 1 interpretation: minimizes volume of confidence ellipsoids dual problem maximize logdetW +nlogn T subject to v Wv ≤ 1, k = 1,...,p k k T interpretation: xx Wx≤ 1 is minimum volume ellipsoid centered at origin, that includes all test vectors v k complementary slackness: for λ, W primal and dual optimal T λ (1−v Wv ) = 0, k = 1,...,p k k k optimal experiment uses vectors v on boundary of ellipsoid defined by W k Statistical estimation 7–13example (p = 20) λ = 0.5 1 λ = 0.5 2 design uses two vectors, on boundary of ellipse defined by optimal W Statistical estimation 7–14derivation of dual of page 7–13 first reformulate primal problem with new variable X: −1 minimize logdetX P p T T subject to X = λ v v , λ 0, 1 λ = 1 k k k k=1 p X −1 T T T L(X,λ,Z,z,ν) = logdetX +tr Z X− λ v v −z λ+ν(1 λ−1) k k k k=1 −1 • minimize over X by setting gradient to zero: −X +Z = 0 T • minimum over λ is−∞ unless−v Zv −z +ν = 0 k k k k dual problem maximize n+logdetZ−ν T subject to v Zv ≤ν, k = 1,...,p k k change variable W =Z/ν, and optimize over ν to get dual of page 7–13 Statistical estimation 7–15Convex Optimization — Boyd Vandenberghe 8. Geometric problems • extremal volume ellipsoids • centering • classification • placement and facility location 8–1Minimum volume ellipsoid around a set Lo¨wnerJohn ellipsoid of a setC: minimum volume ellipsoidE s.t.C⊆E n • parametrizeE asE =vkAv+bk ≤ 1; w.l.o.g. assume A∈S 2 ++ −1 • volE is proportional to detA ; to compute minimum volume ellipsoid, −1 minimize (over A, b) logdetA subject to sup kAv +bk ≤ 1 2 v∈C convex, but evaluating the constraint can be hard (for general C) finite set C =x ,...,x : 1 m −1 minimize (over A, b) logdetA subject to kAx +bk ≤ 1, i = 1,...,m i 2 also gives L¨ownerJohn ellipsoid for polyhedron convx ,...,x 1 m Geometric problems 8–2Maximum volume inscribed ellipsoid n maximum volume ellipsoidE inside a convex set C⊆R n • parametrizeE asE =Bu+dkuk ≤ 1; w.l.o.g. assume B∈S 2 ++ • volE is proportional to detB; can computeE by solving maximize logdetB subject to sup I (Bu+d)≤ 0 C kuk ≤1 2 (where I (x) = 0 for x∈C and I (x) =∞ for x6∈C) C C convex, but evaluating the constraint can be hard (for general C) T polyhedronxa x≤b, i = 1,...,m: i i maximize logdetB T subject to kBak +a d≤b, i = 1,...,m i 2 i i T T (constraint follows from sup a (Bu+d) =kBak +a d) i 2 kuk ≤1 i i 2 Geometric problems 8–3Efficiency of ellipsoidal approximations n C⊆R convex, bounded, with nonempty interior • L¨ownerJohn ellipsoid, shrunk by a factor n, lies inside C • maximum volume inscribed ellipsoid, expanded by a factor n, covers C 2 example (for two polyhedra in R ) √ factor n can be improved to n if C is symmetric Geometric problems 8–4Centering some possible definitions of ‘center’ of a convex set C: • center of largest inscribed ball (’Chebyshev center’) for polyhedron, can be computed via linear programming (page 4–19) • center of maximum volume inscribed ellipsoid (page 8–3) xx cch heeb b x mve MVE center is invariant under affine coordinate transformations Geometric problems 8–5Analytic center of a set of inequalities the analytic center of set of convex inequalities and linear equations f (x)≤ 0, i = 1,...,m, Fx =g i is defined as the optimal point of P m minimize − log(−f (x)) i i=1 subject to Fx =g • more easily computed than MVE or Chebyshev center (see later) • not just a property of the feasible set: two sets of inequalities can describe the same set, but have different analytic centers Geometric problems 8–6T analytic center of linear inequalities a x≤b , i = 1,...,m i i x is minimizer of ac m x ac X T φ(x) =− log(b −a x) i i i=1 inner and outer ellipsoids from analytic center: T E ⊆xa x≤b, i = 1,...,m⊆E inner i outer i where T 2 E = x (x−x ) ∇ φ(x )(x−x )≤ 1 inner ac ac ac T 2 E = x (x−x ) ∇ φ(x )(x−x )≤m(m−1) outer ac ac ac Geometric problems 8–7Linear discrimination separate two sets of pointsx ,...,x ,y ,...,y by a hyperplane: 1 N 1 M T T a x +b 0, i = 1,...,N, a y +b 0, i = 1,...,M i i homogeneous in a, b, hence equivalent to T T a x +b≥ 1, i = 1,...,N, a y +b≤−1, i = 1,...,M i i a set of linear inequalities in a, b Geometric problems 8–8Robust linear discrimination (Euclidean) distance between hyperplanes T H = za z +b = 1 1 T H = za z +b =−1 2 is dist(H ,H ) = 2/kak 1 2 2 to separate two sets of points by maximum margin, minimize (1/2)kak 2 T subject to a x +b≥ 1, i = 1,...,N (1) i T a y +b≤−1, i = 1,...,M i (after squaring objective) a QP in a, b Geometric problems 8–9µ µ µ µ µ µ µ µ Lagrange dual of maximum margin separation problem (1) T T maximize 1 λ+1 P P N M subject to 2 λx − y ≤ 1 i i i i (2) i=1 i=1 2 T T 1 λ =1 , λ  0,  0 from duality, optimal value is inverse of maximum margin of separation interpretation T T T T • change variables to θ =λ/1 λ, γ = /1 , t = 1/(1 λ+1 ) i i i i T T • invert objective to minimize 1/(1 λ+1 ) =t minimize t P P N M subject to θx − γy ≤t i i i i i=1 i=1 2 T T θ 0, 1 θ = 1, γ 0, 1 γ = 1 optimal value is distance between convex hulls Geometric problems 8–10Approximate linear separation of nonseparable sets T T minimize 1 u+1 v T subject to a x +b≥ 1−u, i = 1,...,N i i T a y +b≤−1+v, i = 1,...,M i i u 0, v 0 • an LP in a, b, u, v T T • at optimum, u = max0,1−a x −b, v = max0,1+a y +b i i i i • can be interpreted as a heuristic for minimizing misclassified points Geometric problems 8–11Support vector classifier T T minimize kak +γ(1 u+1 v) 2 T subject to a x +b≥ 1−u, i = 1,...,N i i T a y +b≤−1+v, i = 1,...,M i i u 0, v 0 produces point on tradeoff curve between inverse of margin 2/kak and 2 T T classification error, measured by total slack 1 u+1 v same example as previous page, with γ = 0.1: Geometric problems 8–12Nonlinear discrimination separate two sets of points by a nonlinear function: f(x ) 0, i = 1,...,N, f(y ) 0, i = 1,...,M i i • choose a linearly parametrized family of functions T f(z) =θ F(z) n k F = (F ,...,F ) :R →R are basis functions 1 k • solve a set of linear inequalities in θ: T T θ F(x )≥ 1, i = 1,...,N, θ F(y )≤−1, i = 1,...,M i i Geometric problems 8–13T T quadratic discrimination: f(z) =z Pz +q z +r T T T T x Px +q x +r≥ 1, y Py +q y +r≤−1 i i i i i i can add additional constraints (e.g., P −I to separate by an ellipsoid) polynomial discrimination: F(z) are all monomials up to a given degree separation by ellipsoid separation by 4th degree polynomial Geometric problems 8–146 Placement and facility location 2 3 • N points with coordinates x ∈R (or R ) i • some positions x are given; the other x ’s are variables i i • for each pair of points, a cost function f (x,x ) ij i j placement problem P minimize f (x,x ) ij i j i=j variables are positions of free points interpretations • points represent plants or warehouses;f is transportation cost between ij facilities i and j • points represent cells on an IC; f represents wirelength ij Geometric problems 8–15P example: minimize h(kx −x k ), with 6 free points, 27 links i j 2 (i,j)∈A 2 4 optimal placement for h(z) =z, h(z) =z , h(z) =z 1 1 1 0 0 0 −1 −1 −1 −1 0 1 −1 0 1 −1 0 1 histograms of connection lengthskx −x k i j 2 6 4 4 5 3 3 4 3 2 2 2 1 1 1 0 0 0 0 0.5 1 1.5 2 0 0.5 1 1.5 0 0.5 1 1.5 Geometric problems 8–16Convex Optimization — Boyd Vandenberghe 9. Numerical linear algebra background • matrix structure and algorithm complexity • solving linear equations with factored matrices T • LU, Cholesky, LDL factorization • block elimination and the matrix inversion lemma • solving underdetermined equations 9–1Matrix structure and algorithm complexity n×n cost (execution time) of solving Ax =b with A∈R 3 • for general methods, grows as n • less if A is structured (banded, sparse, Toeplitz, . . . ) flop counts • flop (floatingpoint operation): one addition, subtraction, multiplication, or division of two floatingpoint numbers • to estimate complexity of an algorithm: express number of flops as a (polynomial) function of the problem dimensions, and simplify by keeping only the leading terms • not an accurate predictor of computation time on modern computers • useful as a rough estimate of complexity Numerical linear algebra background 9–2n vectorvector operations (x, y∈R ) T • inner product x y: 2n−1 flops (or 2n if n is large) • sum x+y, scalar multiplication αx: n flops m×n matrixvector product y =Ax with A∈R • m(2n−1) flops (or 2mn if n large) • 2N if A is sparse with N nonzero elements m×p n×p T • 2p(n+m) if A is given as A =UV , U ∈R , V ∈R m×n n×p matrixmatrix product C =AB with A∈R , B∈R • mp(2n−1) flops (or 2mnp if n large) • less if A and/or B are sparse 2 • (1/2)m(m+1)(2n−1)≈m n if m =p and C symmetric Numerical linear algebra background 9–36 Linear equations that are easy to solve diagonal matrices (a = 0 if i =j): n flops ij −1 x =A b = (b /a ,...,b /a ) 1 11 n nn 2 lower triangular (a = 0 if j i): n flops ij x := b /a 1 1 11 x := (b −a x )/a 2 2 21 1 22 x := (b −a x −a x )/a 3 3 31 1 32 2 33 . . . x := (b −a x −a x −···−a x )/a n n n1 1 n2 2 n,n−1 n−1 nn called forward substitution 2 upper triangular (a = 0 if j i): n flops via backward substitution ij Numerical linear algebra background 9–4−1 T orthogonal matrices: A =A 2 T • 2n flops to compute x =A b for general A T • less with structure, e.g., if A =I−2uu withkuk = 1, we can 2 T T compute x =A b =b−2(u b)u in 4n flops permutation matrices:  1 j =π i a = ij 0 otherwise where π = (π ,π ,...,π ) is a permutation of (1,2,...,n) 1 2 n • interpretation: Ax = (x ,...,x ) π π n 1 −1 T • satisfies A =A , hence cost of solving Ax =b is 0 flops example:     0 1 0 0 0 1 −1 T     A = 0 0 1 , A =A = 1 0 0 1 0 0 0 1 0 Numerical linear algebra background 9–5The factorsolve method for solving Ax =b • factor A as a product of simple matrices (usually 2 or 3): A =A A ···A 1 2 k (A diagonal, upper or lower triangular, etc) i −1 −1 −1 −1 • compute x =A b =A ···A A b by solving k ‘easy’ equations k 2 1 A x =b, A x =x , ..., A x =x 1 1 2 2 1 k k−1 cost of factorization step usually dominates cost of solve step equations with multiple righthand sides Ax =b , Ax =b , ..., Ax =b 1 1 2 2 m m cost: one factorization plus m solves Numerical linear algebra background 9–6LU factorization every nonsingular matrix A can be factored as A =PLU with P a permutation matrix, L lower triangular, U upper triangular 3 cost: (2/3)n flops Solving linear equations by LU factorization. given a set of linear equations Ax =b, with A nonsingular. 3 1. LU factorization. Factor A as A =PLU ((2/3)n flops). 2. Permutation. Solve Pz =b (0 flops). 1 2 3. Forward substitution. Solve Lz =z (n flops). 2 1 2 4. Backward substitution. Solve Ux =z (n flops). 2 3 2 3 cost: (2/3)n +2n ≈ (2/3)n for large n Numerical linear algebra background 9–7sparse LU factorization A =P LUP 1 2 • adding permutation matrix P offers possibility of sparser L, U (hence, 2 cheaper factor and solve steps) • P and P chosen (heuristically) to yield sparse L, U 1 2 • choice of P and P depends on sparsity pattern and values of A 1 2 3 • cost is usually much less than (2/3)n ; exact value depends in a complicated way on n, number of zeros in A, sparsity pattern Numerical linear algebra background 9–8Cholesky factorization every positive definite A can be factored as T A =LL with L lower triangular 3 cost: (1/3)n flops Solving linear equations by Cholesky factorization. n given a set of linear equations Ax =b, with A∈ S . ++ T 3 1. Cholesky factorization. Factor A as A =LL ((1/3)n flops). 2 2. Forward substitution. Solve Lz =b (n flops). 1 T 2 3. Backward substitution. Solve L x =z (n flops). 1 3 2 3 cost: (1/3)n +2n ≈ (1/3)n for large n Numerical linear algebra background 9–9sparse Cholesky factorization T T A =PLL P • adding permutation matrix P offers possibility of sparser L • P chosen (heuristically) to yield sparse L • choice of P only depends on sparsity pattern of A (unlike sparse LU) 3 • cost is usually much less than (1/3)n ; exact value depends in a complicated way on n, number of zeros in A, sparsity pattern Numerical linear algebra background 9–10T LDL factorization every nonsingular symmetric matrix A can be factored as T T A =PLDL P with P a permutation matrix, L lower triangular, D block diagonal with 1×1 or 2×2 diagonal blocks 3 cost: (1/3)n T • cost of solving symmetric sets of linear equations by LDL factorization: 3 2 3 (1/3)n +2n ≈ (1/3)n for large n 3 • for sparse A, can choose P to yield sparse L; cost≪ (1/3)n Numerical linear algebra background 9–11Equations with structured subblocks      A A x b 11 12 1 1 = (1) A A x b 21 22 2 2 n n n×n 1 2 i j • variables x ∈R , x ∈R ; blocks A ∈R 1 2 ij −1 • if A is nonsingular, can eliminate x : x =A (b −A x ); 11 1 1 1 12 2 11 to compute x , solve 2 −1 −1 (A −A A A )x =b −A A b 22 21 12 2 2 21 1 11 11 Solving linear equations by block elimination. given a nonsingular set of linear equations (1), with A nonsingular. 11 −1 −1 1. Form A A and A b . 12 1 11 11 −1 −1 ˜ 2. Form S =A −A A A and b =b −A A b . 22 21 12 2 21 1 11 11 ˜ 3. Determine x by solving Sx =b. 2 2 4. Determine x by solving A x =b −A x . 1 11 1 1 12 2 Numerical linear algebra background 9–12dominant terms in flop count • step 1: f +n s (f is cost of factoring A ; s is cost of solve step) 2 11 −1 2 • step 2: 2n n (cost dominated by product of A and A A ) 1 21 12 2 11 3 • step 3: (2/3)n 2 2 3 total: f +n s+2n n +(2/3)n 2 1 2 2 examples 3 2 • general A (f = (2/3)n , s = 2n ): no gain over standard method 11 1 1 3 2 2 3 3 flops = (2/3)n +2n n +2n n +(2/3)n = (2/3)(n +n ) 2 1 1 2 1 1 2 2 3 • block elimination is useful for structured A (f ≪n ) 11 1 2 3 for example, diagonal (f = 0, s =n ): flops≈ 2n n +(2/3)n 1 1 2 2 Numerical linear algebra background 9–13Structured matrix plus low rank term (A+BC)x =b n×n n×p p×n • A∈R , B∈R , C∈R • assume A has structure (Ax =b easy to solve) first write as      A B x b = C −I y 0 now apply block elimination: solve −1 −1 (I +CA B)y =CA b, then solve Ax =b−By this proves the matrix inversion lemma: if A and A+BC nonsingular, −1 −1 −1 −1 −1 −1 (A+BC) =A −A B(I +CA B) CA Numerical linear algebra background 9–14example: A diagonal, B,C dense • method 1: form D =A+BC, then solve Dx =b 3 2 cost: (2/3)n +2pn • method 2 (via matrix inversion lemma): solve −1 −1 (I +CA B)y =CA b, (2) −1 −1 then compute x =A b−A By 2 3 total cost is dominated by (2): 2p n+(2/3)p (i.e., linear in n) Numerical linear algebra background 9–15Underdetermined linear equations p×n if A∈R with pn, rankA =p, n−p xAx =b =Fz +xˆz∈R • xˆ is (any) particular solution n×(n−p) • columns of F ∈R span nullspace of A • there exist several numerical methods for computing F (QR factorization, rectangular LU factorization, . . . ) Numerical linear algebra background 9–16Convex Optimization — Boyd Vandenberghe 10. Unconstrained minimization • terminology and assumptions • gradient descent method • steepest descent method • Newton’s method • selfconcordant functions • implementation 10–1Unconstrained minimization minimize f(x) • f convex, twice continuously differentiable (hence domf open) ⋆ • we assume optimal value p = inf f(x) is attained (and finite) x unconstrained minimization methods (k) • produce sequence of points x ∈domf, k = 0,1,... with (k) ⋆ f(x )→p • can be interpreted as iterative methods for solving optimality condition ⋆ ∇f(x ) = 0 Unconstrained minimization 10–2Initial point and sublevel set (0) algorithms in this chapter require a starting point x such that (0) • x ∈domf (0) • sublevel set S =xf(x)≤f(x ) is closed 2nd condition is hard to verify, except when all sublevel sets are closed: • equivalent to condition that epif is closed n • true if domf =R • true if f(x)→∞ as x→bddomf examples of differentiable functions with closed sublevel sets: m m X X T T f(x) = log( exp(a x+b )), f(x) =− log(b −a x) i i i i i=1 i=1 Unconstrained minimization 10–3Strong convexity and implications f is strongly convex on S if there exists an m 0 such that 2 ∇ f(x)mI for all x∈S implications • for x,y∈S, m T 2 f(y)≥f(x)+∇f(x) (y−x)+ kx−yk 2 2 hence, S is bounded ⋆ • p −∞, and for x∈S, 1 ⋆ 2 f(x)−p ≤ k∇f(x)k 2 2m useful as stopping criterion (if you know m) Unconstrained minimization 10–4Descent methods (k+1) (k) (k) (k) (k+1) (k) x =x +t Δx with f(x )f(x ) + • other notations: x =x+tΔx, x :=x+tΔx • Δx is the step, or search direction; t is the step size, or step length + T • from convexity, f(x )f(x) implies∇f(x) Δx 0 (i.e., Δx is a descent direction) General descent method. given a starting point x∈ domf. repeat 1. Determine a descent direction Δx. 2. Line search. Choose a step size t 0. 3. Update. x :=x+tΔx. until stopping criterion is satisfied. Unconstrained minimization 10–5Line search types exact line search: t = argmin f(x+tΔx) t0 backtracking line search (with parameters α∈ (0,1/2), β∈ (0,1)) • starting at t = 1, repeat t :=βt until T f(x+tΔx)f(x)+αt∇f(x) Δx • graphical interpretation: backtrack until t≤t 0 f(x+tΔx) T T f(x)+αt∇f(x) Δx f(x)+t∇f(x) Δx t t = 0 t 0 Unconstrained minimization 10–6Gradient descent method general descent method with Δx =−∇f(x) given a starting point x∈ domf. repeat 1. Δx :=−∇f(x). 2. Line search. Choose step size t via exact or backtracking line search. 3. Update. x :=x+tΔx. until stopping criterion is satisfied. • stopping criterion usually of the formk∇f(x)k ≤ǫ 2 • convergence result: for strongly convex f, (k) ⋆ k (0) ⋆ f(x )−p ≤c (f(x )−p ) (0) c∈ (0,1) depends on m, x , line search type • very simple, but often very slow; rarely used in practice Unconstrained minimization 10–72 quadratic problem in R 2 2 f(x) = (1/2)(x +γx ) (γ 0) 1 2 (0) with exact line search, starting at x = (γ,1):     k k γ−1 γ−1 (k) (k) x =γ , x = − 1 2 γ +1 γ +1 • very slow if γ≫ 1 or γ≪ 1 • example for γ = 10: 4 (0) x 0 (1) x −4 −10 0 10 x 1 Unconstrained minimization 10–8 x 2nonquadratic example x +3x −0.1 x −3x −0.1 −x −0.1 1 2 1 2 1 f(x ,x ) =e +e +e 1 2 (0) (0) x x (2) x (1) x (1) x backtracking line search exact line search Unconstrained minimization 10–9100 a problem in R 500 X T T f(x) =c x− log(b −a x) i i i=1 4 10 2 10 0 10 exact l.s. −2 10 backtracking l.s. −4 10 0 50 100 150 200 k ‘linear’ convergence, i.e., a straight line on a semilog plot Unconstrained minimization 10–10 (k) ⋆ f(x )−pSteepest descent method normalized steepest descent direction (at x, for normk·k): T Δx = argmin∇f(x) vkvk = 1 nsd T interpretation: for small v, f(x+v)≈f(x)+∇f(x) v; direction Δx is unitnorm step with most negative directional derivative nsd (unnormalized) steepest descent direction Δx =k∇f(x)k Δx sd ∗ nsd T 2 satisfies∇f(x) Δx =−k∇f(x)k sd ∗ steepest descent method • general descent method with Δx = Δx sd • convergence properties similar to gradient descent Unconstrained minimization 10–11examples • Euclidean norm: Δx =−∇f(x) sd n T 1/2 −1 • quadratic normkxk = (x Px) (P ∈S ): Δx =−P ∇f(x) P sd ++ • ℓ norm: Δx =−(∂f(x)/∂x )e , where∂f(x)/∂x =k∇f(x)k 1 sd i i i ∞ unit balls and normalized steepest descent directions for a quadratic norm and the ℓ norm: 1 −∇f(x) −∇f(x) Δx nsd Δx nsd Unconstrained minimization 10–12choice of norm for steepest descent (0) x (0) x (2) x (1) (2) x x (1) x • steepest descent with backtracking line search for two quadratic norms (k) • ellipses showxkx−x k = 1 P • equivalent interpretation of steepest descent with quadratic normk·k : P 1/2 gradient descent after change of variables x¯ =P x shows choice of P has strong effect on speed of convergence Unconstrained minimization 10–13Newton step 2 −1 Δx =−∇ f(x) ∇f(x) nt interpretations • x+Δx minimizes second order approximation nt 1 T T 2 b f(x+v) =f(x)+∇f(x) v + v ∇ f(x)v 2 • x+Δx solves linearized optimality condition nt 2 b ∇f(x+v)≈∇f(x+v) =∇f(x)+∇ f(x)v = 0 ′ b f ′ b f f ′ (x+Δx ,f (x+Δx )) nt nt (x,f(x)) ′ (x,f (x)) f (x+Δx ,f(x+Δx )) nt nt Unconstrained minimization 10–14• Δx is steepest descent direction at x in local Hessian norm nt  1/2 T 2 kuk 2 = u ∇ f(x)u ∇ f(x) x x+Δx nsd x+Δx nt T 2 dashed lines are contour lines of f; ellipse isx+vv ∇ f(x)v = 1 arrow shows−∇f(x) Unconstrained minimization 10–15Newton decrement  1/2 T 2 −1 λ(x) = ∇f(x) ∇ f(x) ∇f(x) ⋆ a measure of the proximity of x to x properties ⋆ b • gives an estimate of f(x)−p , using quadratic approximation f: 1 2 b f(x)−inff(y) = λ(x) y 2 • equal to the norm of the Newton step in the quadratic Hessian norm  1/2 T 2 λ(x) = Δx ∇ f(x)Δx nt nt T 2 • directional derivative in the Newton direction: ∇f(x) Δx =−λ(x) nt • affine invariant (unlikek∇f(x)k ) 2 Unconstrained minimization 10–16Newton’s method given a starting point x∈ domf, tolerance ǫ 0. repeat 1. Compute the Newton step and decrement. 2 −1 2 T 2 −1 Δx :=−∇ f(x) ∇f(x); λ :=∇f(x) ∇ f(x) ∇f(x). nt 2 2. Stopping criterion. quit if λ /2≤ ǫ. 3. Line search. Choose step size t by backtracking line search. 4. Update. x :=x+tΔx . nt affine invariant, i.e., independent of linear changes of coordinates: (0) −1 (0) ˜ Newton iterates for f(y) =f(Ty) with starting point y =T x are (k) −1 (k) y =T x Unconstrained minimization 10–17Classical convergence analysis assumptions • f strongly convex on S with constant m 2 • ∇ f is Lipschitz continuous on S, with constant L 0: 2 2 k∇ f(x)−∇ f(y)k ≤Lkx−yk 2 2 (L measures how well f can be approximated by a quadratic function) 2 outline: there exist constants η∈ (0,m /L), γ 0 such that (k+1) (k) • ifk∇f(x)k ≥η, then f(x )−f(x )≤−γ 2 • ifk∇f(x)k η, then 2   2 L L (k+1) (k) k∇f(x )k ≤ k∇f(x )k 2 2 2 2 2m 2m Unconstrained minimization 10–18damped Newton phase (k∇f(x)k ≥η) 2 • most iterations require backtracking steps • function value decreases by at least γ ⋆ (0) ⋆ • if p −∞, this phase ends after at most (f(x )−p )/γ iterations quadratically convergent phase (k∇f(x)k η) 2 • all iterations use step size t = 1 (k) • k∇f(x)k converges to zero quadratically: ifk∇f(x )k η, then 2 2 l−k l−k     2 2 L L 1 l k k∇f(x )k ≤ k∇f(x )k ≤ , l≥k 2 2 2 2 2m 2m 2 Unconstrained minimization 10–19⋆ conclusion: number of iterations until f(x)−p ≤ǫ is bounded above by (0) ⋆ f(x )−p +log log (ǫ /ǫ) 0 2 2 γ (0) • γ, ǫ are constants that depend on m, L, x 0 • second term is small (of the order of 6) and almost constant for practical purposes • in practice, constants m, L (hence γ, ǫ ) are usually unknown 0 • provides qualitative insight in convergence properties (i.e., explains two algorithm phases) Unconstrained minimization 10–20Examples 2 example in R (page 10–9) 5 10 (0) 0 x 10 (1) x −5 10 −10 10 −15 10 0 1 2 3 4 5 k • backtracking parameters α = 0.1, β = 0.7 • converges in only 5 steps • quadratic local convergence Unconstrained minimization 10–21 (k) ⋆ f(x )−p100 example in R (page 10–10) 5 2 10 exact line search 0 1.5 10 backtracking −5 1 10 exact line search −10 0.5 backtracking 10 −15 0 10 0 2 4 6 8 10 0 2 4 6 8 k k • backtracking parameters α = 0.01, β = 0.5 • backtracking line search almost as fast as exact l.s. (and much simpler) • clearly shows two phases in algorithm Unconstrained minimization 10–22 (k) ⋆ f(x )−p (k) step size t10000 example in R (with sparse a ) i 10000 100000 X X 2 T f(x) =− log(1−x )− log(b −a x) i i i i=1 i=1 5 10 0 10 −5 10 0 5 10 15 20 k • backtracking parameters α = 0.01, β = 0.5. • performance similar as for small examples Unconstrained minimization 10–23 (k) ⋆ f(x )−pSelfconcordance shortcomings of classical convergence analysis • depends on unknown constants (m, L, . . . ) • bound is not affinely invariant, although Newton’s method is convergence analysis via selfconcordance (Nesterov and Nemirovski) • does not depend on any unknown constants • gives affineinvariant bound • applies to special class of convex functions (‘selfconcordant’ functions) • developed to analyze polynomialtime interiorpoint methods for convex optimization Unconstrained minimization 10–24Selfconcordant functions definition ′′′ ′′ 3/2 • convex f :R→R is selfconcordant iff (x)≤ 2f (x) for all x∈domf n • f :R →R is selfconcordant if g(t) =f(x+tv) is selfconcordant for n all x∈domf, v∈R examples on R • linear and quadratic functions • negative logarithm f(x) =−logx • negative entropy plus negative logarithm: f(x) =xlogx−logx ˜ affine invariance: if f :R→R is s.c., then f(y) =f(ay+b) is s.c.: ′′′ 3 ′′′ ′′ 2 ′′ ˜ ˜ f (y) =a f (ay+b), f (y) =a f (ay+b) Unconstrained minimization 10–25Selfconcordant calculus properties • preserved under positive scaling α≥ 1, and sum • preserved under composition with affine function ′′′ ′′ • if g is convex with domg =R andg (x)≤ 3g (x)/x then ++ f(x) = log(−g(x))−logx is selfconcordant examples: properties can be used to show that the following are s.c. P m T T • f(x) =− log(b −a x) onxa xb, i = 1,...,m i i i i i=1 n • f(X) =−logdetX on S ++ 2 T • f(x) =−log(y −x x) on(x,y)kxk y 2 Unconstrained minimization 10–26Convergence analysis for selfconcordant functions summary: there exist constants η∈ (0,1/4, γ 0 such that • if λ(x)η, then (k+1) (k) f(x )−f(x )≤−γ • if λ(x)≤η, then   2 (k+1) (k) 2λ(x )≤ 2λ(x ) (η and γ only depend on backtracking parameters α, β) complexity bound: number of Newton iterations bounded by (0) ⋆ f(x )−p +log log (1/ǫ) 2 2 γ −10 (0) ⋆ for α = 0.1, β = 0.8, ǫ = 10 , bound evaluates to 375(f(x )−p )+6 Unconstrained minimization 10–27numerical example: 150 randomly generated instances of P m T minimize f(x) =− log(b −a x) i i i=1 25 20 ◦: m = 100, n = 50 15 : m = 1000, n = 500 10 ♦: m = 1000, n = 50 5 0 0 5 10 15 20 25 30 35 (0) ⋆ f(x )−p (0) ⋆ • number of iterations much smaller than 375(f(x )−p )+6 (0) ⋆ • bound of the form c(f(x )−p )+6 with smaller c (empirically) valid Unconstrained minimization 10–28 iterationsImplementation main effort in each iteration: evaluate derivatives and solve Newton system HΔx =−g 2 where H =∇ f(x), g =∇f(x) via Cholesky factorization T −T −1 −1 H =LL , Δx =−L L g, λ(x) =kL gk nt 2 3 • cost (1/3)n flops for unstructured system 3 • cost≪ (1/3)n if H sparse, banded Unconstrained minimization 10–29example of dense Newton system with structure n X T f(x) = ψ (x )+ψ (Ax+b), H =D+A H A i i 0 0 i=1 p×n • assume A∈R , dense, with p≪n ′′ 2 • D diagonal with diagonal elements ψ (x ); H =∇ ψ (Ax+b) i 0 0 i 3 method 1: form H, solve via dense Cholesky factorization: (cost (1/3)n ) T method 2 (page 9–15): factor H =L L ; write Newton system as 0 0 0 T T DΔx+A L w =−g, L AΔx−w = 0 0 0 eliminate Δx from first equation; compute w and Δx from T −1 T T −1 T (I +L AD A L )w =−L AD g, DΔx =−g−A L w 0 0 0 0 2 T −1 T cost: 2p n (dominated by computation of L AD A L ) 0 0 Unconstrained minimization 10–30Convex Optimization — Boyd Vandenberghe 11. Equality constrained minimization • equality constrained minimization • eliminating equality constraints • Newton’s method with equality constraints • infeasible start Newton method • implementation 11–1Equality constrained minimization minimize f(x) subject to Ax =b • f convex, twice continuously differentiable p×n • A∈R with rankA =p ⋆ • we assume p is finite and attained ⋆ ⋆ optimality conditions: x is optimal iff there exists a ν such that ⋆ T ⋆ ⋆ ∇f(x )+A ν = 0, Ax =b Equality constrained minimization 11–26 n equality constrained quadratic minimization (with P ∈S ) + T T minimize (1/2)x Px+q x+r subject to Ax =b optimality condition:      T ⋆ P A x −q = ⋆ A 0 ν b • coefficient matrix is called KKT matrix • KKT matrix is nonsingular if and only if T Ax = 0, x = 0 =⇒ x Px 0 T • equivalent condition for nonsingularity: P +A A≻ 0 Equality constrained minimization 11–3Eliminating equality constraints represent solution ofxAx =b as n−p xAx =b =Fz +xˆz∈R • xˆ is (any) particular solution n×(n−p) • range ofF ∈R is nullspace ofA (rankF =n−p andAF = 0) reduced or eliminated problem minimize f(Fz +xˆ) n−p • an unconstrained problem with variable z∈R ⋆ ⋆ ⋆ • from solution z , obtain x and ν as ⋆ ⋆ ⋆ T −1 ⋆ x =Fz +xˆ, ν =−(AA ) A∇f(x ) Equality constrained minimization 11–4example: optimal allocation with resource constraint minimize f (x )+f (x )+···+f (x ) 1 1 2 2 n n subject to x +x +···+x =b 1 2 n eliminate x =b−x −···−x , i.e., choose n 1 n−1   I n×(n−1) xˆ =be , F = ∈R n T −1 reduced problem: minimize f (x )+···+f (x )+f (b−x −···−x ) 1 1 n−1 n−1 n 1 n−1 (variables x , . . . , x ) 1 n−1 Equality constrained minimization 11–5Newton step Newton step Δx of f at feasible x is given by solution v of nt      2 T ∇ f(x) A v −∇f(x) = A 0 w 0 interpretations • Δx solves second order approximation (with variable v) nt T T 2 b minimize f(x+v) =f(x)+∇f(x) v+(1/2)v ∇ f(x)v subject to A(x+v) =b • Δx equations follow from linearizing optimality conditions nt T 2 T ∇f(x+v)+A w≈∇f(x)+∇ f(x)v+A w = 0, A(x+v) =b Equality constrained minimization 11–66 Newton decrement   1/2 1/2 T 2 T λ(x) = Δx ∇ f(x)Δx = −∇f(x) Δx nt nt nt properties ⋆ b • gives an estimate of f(x)−p using quadratic approximation f: 1 2 b f(x)− inf f(y) = λ(x) Ay=b 2 • directional derivative in Newton direction: d 2 f(x+tΔx ) =−λ(x) nt dt t=0  1/2 T 2 −1 • in general, λ(x) = ∇f(x) ∇ f(x) ∇f(x) Equality constrained minimization 11–7Newton’s method with equality constraints given starting point x∈ domf with Ax =b, tolerance ǫ 0. repeat 1. Compute the Newton step and decrement Δx , λ(x). nt 2 2. Stopping criterion. quit if λ /2≤ ǫ. 3. Line search. Choose step size t by backtracking line search. 4. Update. x :=x+tΔx . nt (k) (k+1) (k) • a feasible descent method: x feasible and f(x )f(x ) • affine invariant Equality constrained minimization 11–8Newton’s method and elimination Newton’s method for reduced problem ˜ minimize f(z) =f(Fz +xˆ) n−p • variables z∈R • xˆ satisfies Axˆ =b; rankF =n−p and AF = 0 (0) (k) ˜ • Newton’s method for f, started at z , generates iterates z Newton’s method with equality constraints (0) (0) when started at x =Fz +xˆ, iterates are (k+1) (k) x =Fz +xˆ hence, don’t need separate convergence analysis Equality constrained minimization 11–96 Newton step at infeasible points 2nd interpretation of page 11–6 extends to infeasible x (i.e., Ax =b) linearizing optimality conditions at infeasible x (with x∈domf) gives      2 T ∇ f(x) A Δx ∇f(x) nt =− (1) A 0 w Ax−b primaldual interpretation • write optimality condition as r(y) = 0, where T y = (x,ν), r(y) = (∇f(x)+A ν,Ax−b) • linearizing r(y) = 0 gives r(y+Δy)≈r(y)+Dr(y)Δy = 0:      2 T T ∇ f(x) A Δx ∇f(x)+A ν nt =− A 0 Δν Ax−b nt same as (1) with w =ν +Δν nt Equality constrained minimization 11–10Infeasible start Newton method given starting point x∈ domf, ν, tolerance ǫ 0, α∈ (0,1/2), β∈ (0,1). repeat 1. Compute primal and dual Newton steps Δx , Δν . nt nt 2. Backtracking line search onkrk . 2 t := 1. whilekr(x+tΔx ,ν +tΔν )k (1−αt)kr(x,ν)k , t :=βt. nt nt 2 2 3. Update. x :=x+tΔx , ν :=ν +tΔν . nt nt until Ax =b andkr(x,ν)k ≤ ǫ. 2 (k+1) (k) • not a descent method: f(x )f(x ) is possible • directional derivative ofkr(y)k in direction Δy = (Δx ,Δν ) is 2 nt nt d kr(y+tΔy)k =−kr(y)k 2 2 dt t=0 Equality constrained minimization 11–11Solving KKT systems      T H A v g =− A 0 w h solution methods T • LDL factorization • elimination (if H nonsingular) −1 T −1 T AH A w =h−AH g, Hv =−(g+A w) • elimination with singular H: write as      T T T H +A QA A v g+A Qh =− A 0 w h T with Q 0 for which H +A QA≻ 0, and apply elimination Equality constrained minimization 11–12Equality constrained analytic centering P n primal problem: minimize− logx subject to Ax =b i i=1 P n T T dual problem: maximize−b ν + log(A ν) +n i i=1 100×500 three methods for an example with A∈R , different starting points (0) (0) 1. Newton method with equality constraints (requires x ≻ 0, Ax =b) 5 10 0 10 −5 10 −10 10 0 5 10 15 20 k Equality constrained minimization 11–13 (k) ⋆ f(x )−pT (0) 2. Newton method applied to dual problem (requires A ν ≻ 0) 5 10 0 10 −5 10 −10 10 0 2 4 6 8 10 k (0) 3. infeasible start Newton method (requires x ≻ 0) 10 10 5 10 0 10 −5 10 −10 10 −15 10 0 5 10 15 20 25 k Equality constrained minimization 11–14 (k) (k) ⋆ (k) kr(x ,ν )k p −g(ν ) 2complexity per iteration of three methods is identical 1. use block elimination to solve KKT system      −2 T −1 diag(x) A Δx diag(x) 1 = A 0 w 0 2 T reduces to solving Adiag(x) A w =b T −2 T T −1 2. solve Newton system Adiag(A ν) A Δν =−b+Adiag(A ν) 1 3. use block elimination to solve KKT system      −2 T −1 T diag(x) A Δx diag(x) 1−A ν = A 0 Δν b−Ax 2 T reduces to solving Adiag(x) A w = 2Ax−b T conclusion: in each case, solve ADA w =h with D positive diagonal Equality constrained minimization 11–15Network flow optimization P n minimize φ (x ) i i i=1 subject to Ax =b • directed graph with n arcs, p+1 nodes ′′ • x : flow through arc i; φ : cost flow function for arc i (with φ (x) 0) i i i (p+1)×n ˜ • nodeincidence matrix A∈R defined as  1 arc j leaves node i  ˜ A = −1 arc j enters node i ij  0 otherwise p×n ˜ • reduced nodeincidence matrix A∈R is A with last row removed p • b∈R is (reduced) source vector • rankA =p if graph is connected Equality constrained minimization 11–166 6 KKT system      T H A v g =− A 0 w h ′′ ′′ • H =diag(φ (x ),...,φ (x )), positive diagonal 1 n 1 n • solve via elimination: −1 T −1 T AH A w =h−AH g, Hv =−(g+A w) sparsity pattern of coefficient matrix is given by graph connectivity −1 T T (AH A ) = 0 ⇐⇒ (AA ) = 0 ij ij ⇐⇒ nodes i and j are connected by an arc Equality constrained minimization 11–17Analytic center of linear matrix inequality minimize −logdetX subject to tr(AX) =b, i = 1,...,p i i n variable X ∈S optimality conditions p X ⋆ ⋆ −1 ⋆ ⋆ X ≻ 0, −(X ) + ν A = 0, tr(AX ) =b, i = 1,...,p i i i j j=1 Newton equation at feasible X: p X −1 −1 −1 X ΔXX + w A =X , tr(A ΔX) = 0, i = 1,...,p j i i j=1 −1 −1 −1 −1 • follows from linear approximation (X +ΔX) ≈X −X ΔXX • n(n+1)/2+p variables ΔX, w Equality constrained minimization 11–18solution by block elimination P p • eliminate ΔX from first equation: ΔX =X− w XA X j j j=1 • substitute ΔX in second equation p X tr(AXA X)w =b, i = 1,...,p (2) i j j i j=1 p a dense positive definite set of linear equations with variable w∈R T flop count (dominant terms) using Cholesky factorization X =LL : T 3 • form p products L A L: (3/2)pn j T T 2 2 • form p(p+1)/2 inner products tr((L AL)(L A L)): (1/2)p n i j 3 • solve (2) via Cholesky factorization: (1/3)p Equality constrained minimization 11–19Convex Optimization — Boyd Vandenberghe 12. Interiorpoint methods • inequality constrained minimization • logarithmic barrier function and central path • barrier method • feasibility and phase I methods • complexity analysis via selfconcordance • generalized inequalities 12–1Inequality constrained minimization minimize f (x) 0 subject to f (x)≤ 0, i = 1,...,m (1) i Ax =b • f convex, twice continuously differentiable i p×n • A∈R with rankA =p ⋆ • we assume p is finite and attained • we assume problem is strictly feasible: there exists x˜ with x˜∈domf , f (x˜) 0, i = 1,...,m, Ax˜ =b 0 i hence, strong duality holds and dual optimum is attained Interiorpoint methods 12–2Examples • LP, QP, QCQP, GP • entropy maximization with linear inequality constraints P n minimize x logx i i i=1 subject to Fxg Ax =b n with domf =R 0 ++ • differentiability may require reformulating the problem, e.g., piecewiselinear minimization or ℓ norm approximation via LP ∞ • SDPs and SOCPs are better handled as problems with generalized inequalities (see later) Interiorpoint methods 12–3Logarithmic barrier reformulation of (1) via indicator function: P m minimize f (x)+ I (f (x)) 0 − i i=1 subject to Ax =b whereI (u) = 0 ifu≤ 0,I (u) =∞ otherwise (indicator function of R ) − − − approximation via logarithmic barrier P m minimize f (x)−(1/t) log(−f (x)) 0 i i=1 subject to Ax =b 10 • an equality constrained problem 5 • for t 0,−(1/t)log(−u) is a smooth approximation of I − 0 • approximation improves as t→∞ −5 −3 −2 −1 0 1 u Interiorpoint methods 12–4logarithmic barrier function m X φ(x) =− log(−f (x)), domφ =xf (x) 0,...,f (x) 0 i 1 m i=1 • convex (follows from composition rules) • twice continuously differentiable, with derivatives m X 1 ∇φ(x) = ∇f (x) i −f (x) i i=1 m m X X 1 1 2 T 2 ∇ φ(x) = ∇f (x)∇f (x) + ∇ f (x) i i i 2 f (x) −f (x) i i i=1 i=1 Interiorpoint methods 12–5Central path ⋆ • for t 0, define x (t) as the solution of minimize tf (x)+φ(x) 0 subject to Ax =b ⋆ (for now, assume x (t) exists and is unique for each t 0) ⋆ • central path isx (t)t 0 example: central path for an LP c T minimize c x T subject to a x≤b, i = 1,...,6 i i ⋆ x (10) ⋆ x T T ⋆ hyperplane c x =c x (t) is tangent to ⋆ level curve of φ through x (t) Interiorpoint methods 12–6Dual points on central path ⋆ x =x (t) if there exists a w such that m X 1 T t∇f (x)+ ∇f (x)+A w = 0, Ax =b 0 i −f (x) i i=1 ⋆ • therefore, x (t) minimizes the Lagrangian m X ⋆ ⋆ ⋆ ⋆ T L(x,λ (t),ν (t)) =f (x)+ λ (t)f (x)+ν (t) (Ax−b) 0 i i i=1 ⋆ ⋆ ⋆ where we define λ (t) = 1/(−tf (x (t)) and ν (t) =w/t i i ⋆ ⋆ • this confirms the intuitive idea that f (x (t))→p if t→∞: 0 ⋆ ⋆ ⋆ p ≥ g(λ (t),ν (t)) ⋆ ⋆ ⋆ = L(x (t),λ (t),ν (t)) ⋆ = f (x (t))−m/t 0 Interiorpoint methods 12–7Interpretation via KKT conditions ⋆ ⋆ ⋆ x =x (t), λ =λ (t), ν =ν (t) satisfy 1. primal constraints: f (x)≤ 0, i = 1,...,m, Ax =b i 2. dual constraints: λ 0 3. approximate complementary slackness: −λf (x) = 1/t, i = 1,...,m i i 4. gradient of Lagrangian with respect to x vanishes: m X T ∇f (x)+ λ∇f (x)+A ν = 0 0 i i i=1 difference with KKT is that condition 3 replaces λf (x) = 0 i i Interiorpoint methods 12–8Force field interpretation centering problem (for problem with no equality constraints) P m minimize tf (x)− log(−f (x)) 0 i i=1 force field interpretation • tf (x) is potential of force field F (x) =−t∇f (x) 0 0 0 • −log(−f (x)) is potential of force field F (x) = (1/f (x))∇f (x) i i i i ⋆ the forces balance at x (t): m X ⋆ ⋆ F (x (t))+ F (x (t)) = 0 0 i i=1 Interiorpoint methods 12–9example T minimize c x T subject to a x≤b, i = 1,...,m i i • objective force field is constant: F (x) =−tc 0 • constraint force field decays as inverse distance to constraint hyperplane: −a 1 i F (x) = , kF (x)k = i i 2 T b −a x dist(x,H ) i i i T whereH =xa x =b i i i −c −3c t = 1 t = 3 Interiorpoint methods 12–10µ µ µ µ µ Barrier method (0) given strictly feasible x, t :=t 0, 1, tolerance ǫ 0. repeat ⋆ 1. Centering step. Compute x (t) by minimizing tf +φ, subject to Ax =b. 0 ⋆ 2. Update. x :=x (t). 3. Stopping criterion. quit if m/t ǫ. 4. Increase t. t := t . ⋆ • terminates with f (x)−p ≤ǫ (stopping criterion follows from 0 ⋆ ⋆ f (x (t))−p ≤m/t) 0 • centering usually done using Newton’s method, starting at current x • choice of involves a tradeoff: large means fewer outer iterations, more inner (Newton) iterations; typical values: = 10–20 (0) • several heuristics for choice of t Interiorpoint methods 12–11µ Convergence analysis number of outer (centering) iterations: exactly   (0) log(m/(ǫt )) log ⋆ (0) plus the initial centering step (to compute x (t )) centering problem minimize tf (x)+φ(x) 0 see convergence analysis of Newton’s method (0) • tf +φ must have closed sublevel sets for t≥t 0 • classical analysis requires strong convexity, Lipschitz condition • analysis via selfconcordance requires selfconcordance of tf +φ 0 Interiorpoint methods 12–12µ µ µ µ µ Examples inequality form LP (m = 100 inequalities, n = 50 variables) 2 140 10 120 0 10 100 −2 80 10 60 −4 10 40 −6 20 = 50 = 150 = 2 10 0 0 20 40 60 80 0 40 80 120 160 200 Newton iterations (0) • starts with x on central path (t = 1, duality gap 100) 8 −6 • terminates when t = 10 (gap 10 ) • centering uses Newton’s method with backtracking • total number of Newton iterations not very sensitive for ≥ 10 Interiorpoint methods 12–13 duality gap Newton iterationsµ µ µ geometric program (m = 100 inequalities and n = 50 variables)   P 5 T minimize log exp(a x+b ) 0k k=1 0k   P 5 T subject to log exp(a x+b ) ≤ 0, i = 1,...,m ik ik k=1 2 10 0 10 −2 10 −4 10 −6 = 150 = 50 = 2 10 0 20 40 60 80 100 120 Newton iterations Interiorpoint methods 12–14 duality gapm×2m family of standard LPs (A∈R ) T minimize c x subject to Ax =b, x 0 m = 10,...,1000; for each m, solve 100 randomly generated instances 35 30 25 20 15 1 2 3 10 10 10 m number of iterations grows very slowly as m ranges over a 100 : 1 ratio Interiorpoint methods 12–15 Newton iterationsFeasibility and phase I methods feasibility problem: find x such that f (x)≤ 0, i = 1,...,m, Ax =b (2) i phase I: computes strictly feasible starting point for barrier method basic phase I method minimize (over x, s) s subject to f (x)≤s, i = 1,...,m (3) i Ax =b • if x, s feasible, with s 0, then x is strictly feasible for (2) ⋆ • if optimal value p¯ of (3) is positive, then problem (2) is infeasible ⋆ • if p¯ = 0 and attained, then problem (2) is feasible (but not strictly); ⋆ if p¯ = 0 and not attained, then problem (2) is infeasible Interiorpoint methods 12–16sum of infeasibilities phase I method T minimize 1 s subject to s 0, f (x)≤s, i = 1,...,m i i Ax =b for infeasible problems, produces a solution that satisfies many more inequalities than basic phase I method example (infeasible set of 100 linear inequalities in 50 variables) 60 60 40 40 20 20 0 0 −1 −0.5 0 0.5 1 1.5 −1 −0.5 0 0.5 1 1.5 T T b −a x b −a x i max i sum i i left: basic phase I solution; satisfies 39 inequalities right: sum of infeasibilities phase I solution; satisfies 79 inequalities Interiorpoint methods 12–17 number numberexample: family of linear inequalities Axb+γΔb • data chosen to be strictly feasible for γ 0, infeasible for γ≤ 0 • use basic phase I, terminate when s 0 or dual objective is positive 100 80 Infeasible Feasible 60 40 20 0 −1 −0.5 0 0.5 1 γ 100 100 80 80 60 60 40 40 20 20 0 0 0 −2 −4 −6 −6 −4 −2 0 −10 −10 −10 −10 10 10 10 10 γ γ number of iterations roughly proportional to log(1/γ) Interiorpoint methods 12–18 Newton iterations Newton iterations Newton iterationsComplexity analysis via selfconcordance same assumptions as on page 12–2, plus: • sublevel sets (of f , on the feasible set) are bounded 0 • tf +φ is selfconcordant with closed sublevel sets 0 second condition • holds for LP, QP, QCQP • may require reformulating the problem, e.g., P P n n minimize x logx −→ minimize x logx i i i i i=1 i=1 subject to Fxg subject to Fxg, x 0 • needed for complexity analysis; barrier method works even when selfconcordance assumption does not apply Interiorpoint methods 12–19µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ Newton iterations per centering step: from selfconcordance theory + + tf (x)+φ(x)−tf (x )−φ(x ) 0 0 Newton iterations≤ +c γ + ⋆ ⋆ • bound on effort of computing x =x (t ) starting at x =x (t) • γ, c are constants (depend only on Newton algorithm parameters) ⋆ ⋆ • from duality (with λ =λ (t), ν =ν (t)): + + tf (x)+φ(x)−tf (x )−φ(x ) 0 0 m X + + = tf (x)−tf (x )+ log(−tλ f (x ))−mlog 0 0 i i i=1 m X + + ≤ tf (x)−tf (x )−t λf (x )−m−mlog 0 0 i i i=1 ≤ tf (x)−tg (λ,ν)−m−mlog 0 = m( −1−log ) Interiorpoint methods 12–20µ µ µ µ µ µ total number of Newton iterations (excluding first centering step)    (0) log(m/(t ǫ)) m( −1−log ) Newton iterations≤N = +c log γ 4 510 4 410 figure shows N for typical values of γ, c, 4 310 m 5 m = 100, = 10 4 (0) 210 t ǫ 4 110 0 1 1.1 1.2 • confirms tradeoff in choice of • in practice, iterations is in the tens; not very sensitive for ≥ 10 Interiorpoint methods 12–21 Nµ µ µ µ polynomialtime complexity of barrier method √ • for = 1+1/ m:    (0) √ m/t N =O mlog ǫ √ • number of Newton iterations for fixed gap reduction is O( m) • multiply with cost of one Newton iteration (a polynomial function of problem dimensions), to get bound on number of flops this choice of optimizes worstcase complexity; in practice we choose fixed ( = 10,...,20) Interiorpoint methods 12–22Generalized inequalities minimize f (x) 0 subject to f (x) 0, i = 1,...,m i K i Ax =b n k i • f convex, f :R →R , i = 1,...,m, convex with respect to proper 0 i k i cones K ∈R i • f twice continuously differentiable i p×n • A∈R with rankA =p ⋆ • we assume p is finite and attained • we assume problem is strictly feasible; hence strong duality holds and dual optimum is attained examples of greatest interest: SOCP, SDP Interiorpoint methods 12–23Generalized logarithm for proper cone q q ψ :R →R is generalized logarithm for proper cone K⊆R if: 2 • domψ =intK and∇ ψ(y)≺ 0 for y≻ 0 K • ψ(sy) =ψ(y)+θlogs for y≻ 0, s 0 (θ is the degree of ψ) K examples P n n • nonnegative orthant K =R : ψ(y) = logy , with degree θ =n i + i=1 n • positive semidefinite cone K =S : + ψ(Y) = logdetY (θ =n) n+1 2 2 1/2 • secondorder cone K =y∈R (y +···+y ) ≤y : n+1 1 n 2 2 2 ψ(y) = log(y −y −···−y ) (θ = 2) n+1 1 n Interiorpoint methods 12–24properties (without proof): for y≻ 0, K T ∗ ∇ψ(y) 0, y ∇ψ(y) =θ K P n n • nonnegative orthant R : ψ(y) = logy i + i=1 T ∇ψ(y) = (1/y ,...,1/y ), y ∇ψ(y) =n 1 n n • positive semidefinite cone S : ψ(Y) = logdetY + −1 ∇ψ(Y) =Y , tr(Y∇ψ(Y)) =n n+1 2 2 1/2 • secondorder cone K =y∈R (y +···+y ) ≤y : n+1 1 n   −y 1 .   . 2 . T   ∇ψ(y) = , y ∇ψ(y) = 2 2 2   2 −y y −y −···−y n n+1 1 n y n+1 Interiorpoint methods 12–25Logarithmic barrier and central path logarithmic barrier for f (x) 0, . . . , f (x) 0: 1 K m K m 1 m X φ(x) =− ψ (−f (x)), domφ =xf (x)≺ 0, i = 1,...,m i i i K i i=1 • ψ is generalized logarithm for K , with degree θ i i i • φ is convex, twice continuously differentiable ⋆ ⋆ central path: x (t)t 0 where x (t) solves minimize tf (x)+φ(x) 0 subject to Ax =b Interiorpoint methods 12–26Dual points on central path p ⋆ x =x (t) if there exists w∈R , m X T T t∇f (x)+ Df (x) ∇ψ (−f (x))+A w = 0 0 i i i i=1 k×n i (Df (x)∈R is derivative matrix of f ) i i ⋆ ⋆ ⋆ • therefore, x (t) minimizes Lagrangian L(x,λ (t),ν (t)), where 1 w ⋆ ⋆ ⋆ λ (t) = ∇ψ (−f (x (t))), ν (t) = i i i t t ⋆ ∗ • from properties of ψ : λ (t)≻ 0, with duality gap i K i i m X ⋆ ⋆ ⋆ f (x (t))−g(λ (t),ν (t)) = (1/t) θ 0 i i=1 Interiorpoint methods 12–27p example: semidefinite programming (with F ∈S ) i T minimize c x P n subject to F(x) = xF +G 0 i i i=1 −1 • logarithmic barrier: φ(x) = logdet(−F(x) ) ⋆ T • central path: x (t) minimizes tc x−logdet(−F(x)); hence ⋆ −1 tc −tr(FF(x (t)) ) = 0, i = 1,...,n i i ⋆ ⋆ −1 • dual point on central path: Z (t) =−(1/t)F(x (t)) is feasible for maximize tr(GZ) subject to tr(FZ)+c = 0, i = 1,...,n i i Z 0 T ⋆ ⋆ • duality gap on central path: c x (t)−tr(GZ (t)) =p/t Interiorpoint methods 12–28µ µ µ Barrier method (0) given strictly feasible x, t :=t 0, 1, tolerance ǫ 0. repeat ⋆ 1. Centering step. Compute x (t) by minimizing tf +φ, subject to Ax =b. 0 ⋆ 2. Update. x :=x (t). P 3. Stopping criterion. quit if ( θ )/t ǫ. i i 4. Increase t. t := t . P • only difference is duality gap m/t on central path is replaced by θ/t i i • number of outer iterations: ' P (0) log(( θ )/(ǫt )) i i log • complexity analysis via selfconcordance applies to SDP, SOCP Interiorpoint methods 12–29µ µ µ µ µ µ µ µ Examples 6 secondorder cone program (50 variables, 50 SOC constraints in R ) 2 10 120 0 10 −2 80 10 −4 10 40 −6 = 50 = 200 = 2 10 0 0 20 40 60 80 20 60 100 140 180 Newton iterations 100 semidefinite program (100 variables, LMI constraint in S ) 140 2 10 0 100 10 −2 10 60 −4 10 −6 20 = 150 = 50 = 2 10 0 20 40 60 80 100 0 20 40 60 80 100 120 Newton iterations Interiorpoint methods 12–30 duality gap duality gap Newton iterations Newton iterationsn n family of SDPs (A∈S , x∈R ) T minimize 1 x subject to A+diag(x) 0 n = 10,...,1000, for each n solve 100 randomly generated instances 35 30 25 20 15 1 2 3 10 10 10 n Interiorpoint methods 12–31 Newton iterationsPrimaldual interiorpoint methods more efficient than barrier method when high accuracy is needed • update primal and dual variables at each iteration; no distinction between inner and outer iterations • often exhibit superlinear asymptotic convergence • search directions can be interpreted as Newton directions for modified KKT conditions • can start at infeasible points • cost per iteration same as barrier method Interiorpoint methods 12–32Convex Optimization — Boyd Vandenberghe 13. Conclusions • main ideas of the course • importance of modeling in optimization 13–1Modeling mathematical optimization • problems in engineering design, data analysis and statistics, economics, management, . . . , can often be expressed as mathematical optimization problems • techniques exist to take into account multiple objectives or uncertainty in the data tractability • roughly speaking, tractability in optimization requires convexity • algorithms for nonconvex optimization find local (suboptimal) solutions, or are very expensive • surprisingly many applications can be formulated as convex problems Conclusions 13–2Theoretical consequences of convexity • local optima are global • extensive duality theory – systematic way of deriving lower bounds on optimal value – necessary and sufficient optimality conditions – certificates of infeasibility – sensitivity analysis • solution methods with polynomial worstcase complexity theory (with selfconcordance) Conclusions 13–3Practical consequences of convexity (most) convex problems can be solved globally and efficiently • interiorpoint methods require 20 – 80 steps in practice • basic algorithms (e.g., Newton, barrier method, . . . ) are easy to implement and work well for small and medium size problems (larger problems if structure is exploited) • more and more highquality implementations of advanced algorithms and modeling tools are becoming available • high level modeling tools like cvx ease modeling and problem specification Conclusions 13–4How to use convex optimization to use convex optimization in some applied context • use rapid prototyping, approximate modeling – start with simple models, small problem instances, inefficient solution methods – if you don’t like the results, no need to expend further effort on more accurate models or efficient algorithms • work out, simplify, and interpret optimality conditions and dual • even if the problem is quite nonconvex, you can use convex optimization – in subproblems, e.g., to find search direction – by repeatedly forming and solving a convex approximation at the current point Conclusions 13–5Further topics some topics we didn’t cover: • methods for very large scale problems • subgradient calculus, convex analysis • localization, subgradient, and related methods • distributed convex optimization • applications that build on or use convex optimization Conclusions 13–6What’s next • EE364B — convex optimization II • MATH301 — advanced topics in convex optimization • MSE314 — linear and conic optimization • EE464 — semidefinite optimization and algebraic techniques Conclusions 13–7