Advanced optimization techniques lecture notes

introduction to optimization lecture notes computer based optimization methods notes and combinatorial optimization lecture notes
ImogenCameron Profile Pic
Published Date:14-07-2017
Your Website URL(Optional)
EECS260 Optimization — Lecture notes ´ Miguel A. Carreira-Perpin˜a´n EECS, University of California, Merced November 25, 2016 These are notes for a one-semester graduate course on numerical optimisation given by Prof. ´ Miguel A. Carreira-Perpin˜´an at the University of California, Merced. The notes are largely based on the book “Numerical Optimization” by Jorge Nocedal and Stephen J. Wright (Springer, 2nd ed., 2006), with some additions. These notes may be used for educational, non-commercial purposes. ´ c 2005–2016 Miguel A. Carreira-Perpin˜´an1 Introduction • Goal: describe the basic concepts & main state-of-the-art algorithms for continuous optimiza- tion. • The optimization problem:  c (x) = 0, i∈E equality constraints (scalar) i minf(x) s.t. n c (x)≥ 0, i∈I inequality constraints (scalar) x∈R i x: variables (vector); f(x): objective function (scalar). Feasible region: set of points satisfying all constraints. maxf≡−min−f.  2 x −x ≤ 0 2 2 2 1 • Ex. (fig. 1.1): min (x −2) +(x −1) s.t. 1 2 x +x ≤ 2. 1 2 • Ex.: transportation problem (LP)  P x ≤a ∀i (capacity of factory i)  ij i X j P min c x s.t. x ≥b ∀i (demand of shop j) ij ij ij j i  i,j x ≥ 0 ∀i,j (nonnegative production) ij c : shipping cost; x : amount of product shipped from factory i to shop j. ij ij • Ex.: LSQ problem: fit a parametric model (e.g. line, polynomial, neural net...) to a data set. Ex. 2.1 • Optimization algorithms are iterative: build sequence of points that converges to the solution. Needs good initial point (often by prior knowledge). • Focus on many-variable problems (but will illustrate in 2D). • Desiderata for algorithms: – Robustness: performwell onwide varietyofproblems intheir class, for anystarting point; – Efficiency: little computer time or storage; – Accuracy: identify solution precisely (within the limits of fixed-point arithmetic). They conflict with each other. • Generalcommentaboutoptimization(Fletcher): “fascinatingblendoftheoryandcomputation, heuristics and rigour”. – No universal algorithm: a given algorithm works well with a given class of problems. – Necessary to adapt a method to the problem at hand (by experimenting). – Not choosing an appropriate algorithm→ solution found very slowly or not at all. • Not covered in the Nocedal & Wright book, or in this course: – Discrete optimization (integer programming): the variables are discrete. Ex.: integer transportation problem, traveling salesman problem. ∗ Hardertosolvethancontinuousopt(inthelatterwecanpredicttheobjectivefunction value at nearby points). 1∗ Too many solutions to count them. ∗ Rounding typically gives very bad solutions. ∗ Highly specialized techniques for each problem type. Ref: Papadimitriou & Steiglitz 1982. – Network opt: shortest paths, max flow, min cost flow, assignments & matchings, MST, dynamic programming, graph partitioning... Ref: Ahuja, Magnanti & Orlin 1993. – Non-smooth opt: discontinuous derivatives, e.g. L -norm. 1 Ref: Fletcher 1987. – Stochastic opt: the model is specified with uncertainty, e.g. x≤b where b could be given by a probability density function. – Global opt: find the global minimum, not just a local one. Very difficult. Some heuristics: simulated annealing, genetic algorithms, evolutionary computation. – Multiobjective opt: one approach is to transform it to a single objective = linear combi- nations of objectives. – EMalgorithm(Expectation-Maximization): specializedtechniqueformaximumlikelihood estimation of probabilistic models. Ref: McLachlan & Krishnan 2008; many books on statistics or machine learning. – Calculus of variations: stationary point of a functional (= function of functions). – Convex optimization: we’ll see some of this. Ref: Boyd & Vandenberghe 2003. – Modeling: the setup of the opt problem, i.e., the process of identifying objective, variables and constraints for a given problem. Very important but application-dependent. Ref: Dantzig 1963; Ahuja, Magnanti & Orlin 1993. • Course contents: derivative-based methods for continuous optimization (see syllabus). 22 Fundamentals of unconstrained optimization n Problem: minf(x), x∈R . ∗ Conditions for a local minimum x (cf. case n =1) ∗ n • Global minimizer: f(x )≤f(x)∀x∈R . ∗ ∗ • Local minimizer:∃ neighborhoodN of x : f(x )≤f(x)∀x∈N. ∗ ∗ 4 • Strict (or strong) global minimizer: f(x )f(x)∀x∈N\x. (Ex. f(x) = 2 vs f(x) =(x−2) .) ∗ ∗ 1 4 4 • Isolated local minimizer:∃N ofx such thatx is theonly localmin. inN. (Ex.f(x) =x cos +2x x ∗ with f(0) =0 and x = 0.) All isolated local min. are strict. ∗ • First-order necessary conditions (Th. 2.2): x local min, f cont. diff. in an open neighborhood ∗ ∗ of x ⇒∇f(x ) =0 ∗ 3 (Pf.: by contradiction: if∇f(x ) = 6 0 then f decreases along the gradient direction.) (Not sufficient condition, ex: f(x) =x .) ∗ • Stationary point:∇f(x ) =0. ∗ • Second-order necessary conditions (Th. 2.3): x is local min, f twice cont. diff. in an open ∗ ∗ 2 ∗ neighborhood of x ⇒∇f(x ) =0 and∇ f(x ) is psd. 2 ∗ 2 (Pf.: by contradiction: if∇ f(x ) is not psd then f decreases along the direction where∇ is not psd.) 2 ∗ ∗ • Second-ordersufficientconditions (Th. 2.4):∇ f cont.inanopenneighborhoodofx ,∇f(x ) = 2 ∗ ∗ 0,∇ f(x ) pd⇒ x is a strict local minimizer of f. ∗ 4 ∗ (Pf.: Taylor-expand f around x .) (Not necessary condition, ex.: f(x) =x at x = 0.) 2 The key for the conditions is that∇,∇ exist and are continuous. The smoothness off allows us to predict approximately the landscape around a point x. Convex optimization n •S⊂R is a convex set if x,y∈S⇒αx+(1−α)y∈S,∀α∈ 0,1. n • f: S ⊂R →R is a convex function if its domainS is convex and f(αx + (1−α)y)≤ αf(x)+(1−α)f(y),∀α∈ 0,1,∀x,y∈S. • Convexoptimizationproblem: – objective function is convex ) – equality constraints are linear feasible set is convex – inequality constraints are concave Ex.: linear programming (LP). • Easier to solve because every local min is a global min. • Th. 2.5: – f convex⇒ any local min is also global. – f convex and differentiable⇒ any stationary point is a global min. ∗ ∗ (Pf.: by contradiction, assume z with f(z)f(x ), study the segment x z.) 3Algorithm overview • Algorithms look for a stationary point starting from a point x (arbitrary or user-supplied)⇒ 0 ∞ sequence of iteratesx that terminates when no more progress can be made, or it seems k k=0 that a solution has been approximated with sufficient accuracy. ∗ ∗ • Stopping criterion: can’t usekx −xk orf(x )−f(x ). Instead, in practice (given a small k k ǫ 0): –k∇f(x )kǫ k –kx −x kǫ orkx −x kǫkx k k k−1 k k−1 k−1 –f(x )−f(x )ǫ orf(x )−f(x )ǫf(x ) k k−1 k k−1 k−1 and also set a limit on k. • Wechoosex giveninformationaboutf atx (andpossiblyearlieriterates)sothatf(x ) k+1 k k+1 f(x ) (descent). k • Movex →x : twofundamentalstrategies,linesearchandtrustregion. k k+1 – Line search strategy 1. Choose a direction p k 2. Search along p from x for x with f(x )f(x ), i.e., approximately solve the k k k+1 k+1 k 1D minimization problem: min f(x +αp ) where α = step length. α0 k k – Trust region strategy 1. Construct a model function m (typ. quadratic) that is similar to (but simpler than) k f around x . k 2. Search for x with m (x ) m(x ) inside a small trust region (typ. a ball) k+1 k k+1 k aroundx , i.e., approximately solve then-D minimization problem: min m (x +p) k p k k s.t. x +p∈ trust region. k 3. If x does not produce enough decay in f, shrink the region. k+1 • In both strategies, the subproblem (step 2) is easier to solve with the real problem. Why not solve the subproblem exactly? – Good: derives maximum benefit from p or m ; but k k – Bad: expensive (many iterations over α) and unnecessary towards the real problem (min n f over allR ). • Both strategies differ in the order in which they choose the direction and the distance of the move: – Line search: fix direction, choose distance. – Trust region: fix maximum distance, choose direction and actual distance. 4Scaling (“units” of the variables) • A problem is poorlyscaled if changes tox in a certain direction produce much larger variations in the value of f than do changes to x in another direction. Some algorithms (e.g. steepest descent) are sensitive to poor scaling while others (e.g. Newton’s method) are not. Generally, scale-invariant algorithms are more robust to poor problem formulations. 9 2 2 Ex. f(x) =10 x +x (fig. 2.7). 1 2 • Related, but not the same as ill-conditioning.    1 1 1+ǫ T 2+ǫ Ex. f(x) = x x− x. 1+ǫ 1 2 2 Review: fundamentals of unconstrained optimization ∗ 2 ∗ ∗ Assuming the derivatives∇f(x ),∇ f(x ) exist and are continuous in a neighborhood of x :  ∗ ∇f(x ) =0 (first order) ∗ • x is a local minimizer⇒ (necessary condition). 2 ∗ ∇ f(x ) psd (second order) ∗ 2 ∗ ∗ • ∇f(x )=0,∇ f(x ) pd⇒ x is a strict local minimizer (sufficient condition).  pd: strict local minimizer     nd: strict local maximizer  ∗ 2 ∗ • Stationary point:∇f(x ) =0;∇ f(x ) not definite (pos & neg eigenvalues): saddle point   psd: may be non-strict local minimizer    nsd: may be non-strict local maximizer 53 Line search methods Iteration: x =x +α p , where α is the step length (how far to move along p ), α 0; p is k+1 k k k k k k k the search direction. 3 2.5 2 1.5 p k x k 1 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 π T Descent direction: p ∇f =kpkk∇fkcosθ 0 (angle with−∇f ). Guarantees that f can k k k k k k 2 be reduced along p (for a sufficiently small step): k T 2 f(x +αp ) =f(x )+αp ∇f +O(α ) (Taylor’s th.) k k k k k f(x ) for all sufficiently small α 0 k • The steepest descent direction, i.e., the direction along which f decreases most rapidly, is fig. 2.5 T 2 fig. 2.6 p =−∇f . Pf.: for any p, α: f(x +αp) =f(x )+αp ∇f +O(α ) so the rate of change k k k k k T T in f along p at x is p ∇f (the directional derivative) =kpkk∇fkcosθ. Then min p ∇f k k k p k s.t.kpk = 1 is achieved when cosθ =−1, i.e., p =−∇f /k∇fk. k k This direction is⊥ to the contours of f. Pf.: take x+p on the same contour line asx. Then, by Taylor’s th.: T 2 1 1p ∇ f(x+ǫp)p T T 2 f(x+p) =f(x)+p ∇f(x)+ p ∇ f(x+ǫp)p, ǫ∈ (0,1)⇒cos∠(p,∇f(x)) =− −−−−−→0 2 2 kpkk∇f(x)k kpk→0 but kpk→0 along the contour line means p/kpk is parallel to its tangent at x. 2 −1 • TheNewton direction isp =−∇ f ∇f . Thiscorrespondstoassumingf islocallyquadratic k k k and jumping directly to its minimum. Pf.: by Taylor’s th.: 1 T T 2 f(x +p)≈f +p ∇f + p ∇ f p =m (p) k k k k k 2 2 which is minimized (take derivatives wrt p) by the Newton direction if∇ f is pd. (✐ what k happens if assuming f is locally linear (order 1)?) In a line search the Newton direction has a natural step length of 1. −1 • For most algorithms, p =−B ∇f where B is symmetric nonsingular: k k k k 6– steepest descent: B =I k 2 – Newton’s method: B =∇ f(x ) k k 2 – Quasi-Newton method: B ≈∇ f(x ) k k −1 T T If B is pd then p is a descent direction: p ∇f =−∇f B ∇f 0. k k k k k k k Here, we deal with how to choose the step length given the search directionp . Desirable properties: k guaranteed global convergence and rapid rate of convergence. Step length Time/accuracy trade-off: want to choose α to give a substantial reduction in f but not to spend k much time on it. • Exact line search (global or local min): α : min φ(α) =f(x +αp ). Too expensive: many k α0 k k \ evaluations of f,∇f to find α even with moderate precision. (✐ Angle (p ,∇f ) = ?) k k k+1 • Inexactlinesearch: atypical l.s.algorithmwilltryasequence ofαvaluesandstopwhencertain conditions hold. We want easily verifiable theoretical conditions on the step length that allow to prove convergence of an optimization algorithm. • Reduction in f: f(x +α p )f(x )→ not enough, can converge before reaching the mini- k k k k mizer. Wolfe conditions T ➀ f(x +α p )≤f(x )+c α∇f p (“smaller objective function”); k k k k 1 k k k T T ➁∇f(x +α p ) p ≥c∇f p (“larger gradient”). k k k k 2 k k ′ T where 0c c 1. Call φ(α) =f(x +αp ), then φ(α) =∇f(x +αp ) p . 1 2 k k k k k ′ ➀ Sufficient decrease (Armijo condition) is equivalent to φ(0)−φ(α )≥α (−c φ(0)). fig. 3.3 k k 1 Rejectstoosmalldecreases. Thereductionisproportionalbothtosteplengthα anddirectional k T −4 derivative∇f p . In practice, c is very small, e.g. c = 10 . k 1 1 k It is satisfied for any sufficiently small α⇒ not enough, need to rule out unacceptably small steps. ′ ′ ➁ Curvature condition is equivalent to−φ(α )≤−c φ(0). fig. 3.4 k 2 ′ fig. 3.5 Rejects too negative slopes. Reason: if the slope atα,φ(α), is strongly negative, it’s likely we can reduce f significantly by moving further. ( 0.9 if p is chosen by a Newton or quasi-Newton method k In practice, c = 2 0.1 if p is chosen by a nonlinear conjugate method. k We will concentrate on the Wolfe conditions in general, and assume they always hold when the l.s. is used as part of an optimization algorithm (allows convergence proofs). 7Lemma 3.1: there always exist step lengths that satisfy the Wolfe (also the strong Wolfe) conditions if f is smooth and bounded below. (Pf.: mean value th.; see figure below.) T ′ φ(α) =f(x +αp ) l(α) =f −(−c∇f p )α =f −(−c φ(0))α k k k 1 k k 1 k Other useful conditions T T • Strong Wolfe conditions: ➀ +➁’ ∇f(x +α p ) p ≤c ∇f p (“flatter gradient”). We k k k k 2 k k ′ don’t allow φ(α ) to be too positive, so we exclude points that are far from stationary points k of φ. • Goldstein conditions: ensure step length achieves sufficient decrease but is not too short: fig. 3.6 ➋ ➊ T T 1 f(x )+(1−c)α∇f p ≤f(x +α p )≤f(x )+cα∇f p , 0c k k k k k k k k k k k 2 ➋ controls step from below,➊ is the first Wolfe condition. Disadvantage: may exclude all minimizers of φ. (✐ do the Wolfe conditions exclude minimizers?) Sufficient decrease and backtracking l.s. (Armijo l.s.) Startwithlargishstepsizeanddecreaseit(timesρ 1)untilitmeetsthesufficientdecreasecondition ➀ (Algorithm 3.1). • It’s a heuristic approach to avoid a more careful l.s. that satisfies the Wolfe cond. • It always terminates because➀ is satisfied by sufficiently small α. • Works well in practice because the accepted α is near (times ρ) the previous α, which was k rejected for being too long. • The initial step length α is 1 for Newton and quasi-Newton methods. 8Step-length selection algorithms • They take a starting value of α and generate a sequenceα that satisfies the Wolfe cond. i Usually they use interpolation, e.g. approximate φ(α) as a cubic poly. • There are also derivative-free methods (e.g. the goldensection search) but they areless efficient and can’t benefit from the Wolfe cond. (to prove global convergence). • We’ll just use backtracking for simplicity. Convergence of line search methods • Global convergence: k∇fk−−−→ 0, i.e., convergence to a stationary point for any starting k k→∞ point x . To converge to a minimizer we need more information, e.g. the Hessian. 0 • We give a theorem for the search direction p to obtain global convergence, focusing on the k T −∇f p k angleθ between p and the steepest descent direction−∇f : cosθ = , and assuming k k k k k∇f kkpk k the Wolfe cond. Similar theorems exist for strong Wolfe, Goldstein cond. • Important theorem, e.g. shows that the steepest descent method is globally convergent; for other algorithms, it describes how far p can deviate from the steepest descent direction and k still give rise to a globally convergent iteration. • Th. 3.2 (Zoutendijk). Consider an iterative method x = x +α p with starting point k+1 k k k x wherep is a descent direction andα satisfies the Wolfe conditions. Supposef is bounded 0 k k n belowinR andcont.diff.inanopensetN containingthelevelsetL =x: f(x)≤f(x ),and 0 that∇f is Lipschitz continuous onN (⇔∃L 0:k∇f(x)−∇f(x ˜)k≤Lkx−x ˜k∀x,x˜∈N; weaker than bounded Hessian). Then X 2 2 (Proof: ➁ + L lower bound α ; k cos θ k∇fk ∞ (Zoutendijk’s condition) k k ➀ upper bound f ; telescope) k+1 k≥0 2 2 Zoutendijk’s condition implies cos θ k∇fk → 0. Thus if cosθ ≥δ 0∀k for fixed δ then k k k k∇fk→ 0 (global convergence). k • Examples: – Steepest descent method: p = −∇f ⇒ cosθ = 1⇒ global convergence. Intuitive fig. 3.7 k k k method, but very slow in difficult problems. −1 – Newton-like method: p =−B ∇f withB symmetric, pd and with bounded condition k k k k −1 1 number:kBk B ≤M∀k (✐ ill-cond.⇒∇f⊥ Newton dir.) Then cosθ ≥ (Pf.: exe. 3.5)⇒ k k k M global convergence. In other words, if B are pd (which is required for descent directions), have bounded c.n. k and the step lengths satisfy the Wolfe conditions⇒ global convergence. This includes steepest descent, some Newton and quasi-Newton methods. • Forsomemethods(e.g.conjugategradients)wemayhavedirectionsthatarealmost⊥∇f when k theHessianisill-conditioned. Itisstillpossibletoshowglobalconvergencebyassumingthatwe take a steepest descent step from time to time. “Turning” the directions toward−∇f so that k cosθ δ forsomepreselectedδ 0isgenerallyabadidea: itslowsdownthemethod(difficult k to choose a goodδ) and also destroys the invariance properties of quasi-Newton methods. 9• Fast convergence can sometimes conflict with global convergence, e.g. steepest descent is glob- ally convergent but quite slow; Newton’s method converges very fast when near a solution but away from the solution its steps may not even be descent (indeed, it may be looking for a maximizer). The challenge is to design algorithms will both fast and global convergence. Rate of convergence • Steepest descent: p =−∇f . k k Th. 3.4: assume f is twice cont. diff. and that the iterates generated by the steepest descent ∗ 2 ∗ method with exact line searches converge to a pointx where the Hessian∇ f(x ) is pd. Then ∗ 2 ∗ λn−λ κ−1 1 f(x )−f(x )≤ r (f(x )−f(x )), where r = = , 0 λ ≤···≤ λ are the k+1 k 1 n λ +λ κ+1 n 1 2 ∗ eigenvalues of∇ f(x ) and κ =λ /λ its condition number. n 1 ∗ ∗ (Pf.: near the min., f is approx. quadratic.) (For quadratic functions (with matrix Q): kx −x k ≤ rkx −x k where k+1 k Q Q 2 2 T 1 ∗ ∗ kxk =x Qx and kx−x k =f(x)−f(x ).) Q Q 2 Thus the convergence rate is linear, with two extremes: – Very well conditioned Hessian: λ ≈ λ ; very fast, since the steepest descent direction 1 n approximately points to the minimizer. – Ill-conditioned Hessian: λ ≪ λ ; very slow, zigzagging behaviour. This is the typical 1 n situation in practice. −1 2 • Newton’s method: p =−∇ f ∇f . k k k 2 ∗ Th. 3.5: assumef twice diff. and∇ f Lipschitz cont. in a neighborhood of a solutionx at the −1 2 which second-order sufficient cond. hold. Consider the iterates x =x −∇ f ∇f . Then, k+1 k k k ∗ ∗ if x is sufficiently close to x : (x )→x , (k∇fk)→ 0 both quadratically. 0 k k 2 ∗ ∗ (Pf.: upper bound kx +p −x k by a constant ×kx −x k using Taylor + Lipschitz.) k k k That is, near the solution, where the Hessian is pd, the convergence rate is quadratic if we always takeα = 1 (no line search at all). The theorem does not apply if the Hessian is not pd k (away from the solution); practical Newton methods avoid this. −1 • Quasi-Newton methods: p =−B ∇f with B symmetric pd. k k k k −1 Th. 3.7: assume f is twice cont. diff. and that the iterates x = x −B ∇f converge to k+1 k k k ∗ ∗ a point x at the which second-order sufficient cond. hold. Then (x )→ x superlinearly iff k 2 ∗ (B −∇ f(x ))p k k kk lim = 0. k→∞ kp k k Thus, the convergence rate is superlinear if the matrices B become increasingly accurate k approximations of the Hessian along the search directions p , and near the solution the step k length α is always 1, i.e., x = x +p . In practice, we must always try α = 1 in the line k k+1 k k search and accept it if it satisfies the Wolfe cond. 10Newton’s method with Hessian modification 2 N Newton step: solution of the n×n linear system∇ f(x )p =−∇f(x ). k k k • Near a minimizer the Hessian is pd⇒ quadratic convergence (with unit steps α = 1). k • Away froma minimizer theHessian maynot bepdor maybeclose N to singular⇒ p may be an ascent direction, or too long. A too k long direction may not be good even if it is a descent direction, because it violates the spirit of Newton’s method (which relies on a quadratic approximation valid near the current iterate); thus it may require many iterations in the line search. Modified Newton method: modify the Hessian to make it sufficiently pd. 2 • Wesolve(byfactorizingB ,e.g.Cholesky)thesystemB p =−∇f(x )whereB =∇ f(x )+ k k k k k k E (modified Hessian) such that B is sufficiently pd. Then we use p (which is a descent k k k direction because B is pd) in a line search with Wolfe conditions. k −1 • Global convergence (k∇fk→ 0) by Zoutendijk’s condition if κ(B ) =kBk B ≤ M k k k k (which can be proven for several types of modifications). • Quadratic convergence near the minimizer, where the Hessian is pd, if E =0 there. k Diagonal Hessian modification: E = λI with λ≥ 0 large enough that B is sufficiently pd. λ = k k 2 max(0,δ−λ (∇ f(x ))) for δ 0. min k • We want λ as small as possible to preserve Hessian information along the positive curvature directions; but if λ is too small, B is nearly singular and the step too long. k • The method behaves like pure Newton for pd Hessian and λ = 0, like steepest descent for λ→∞, and finds some descent direction for intermediate λ. Other types of Hessian modification exist, but there is no consensus about which one is best: • Direct modification of the eigenvalues (needs SVD of the Hessian). T T IfA =QΛQ (spectral th.) then ΔA =Qdiag(max(0,δ−λ ))Q isthe correction with minimumFrobenius norms.t.λ (A+ i min ΔA)≥δ. • Modified Cholesky factorization (add terms to diagonal on the fly). 11Review: line search (l.s.) methods  search direction p given by optimization method k • Iteration x =x +α p k+1 k k k l.s.≡ determine step length α k  • We want: steepest descent dir. : −∇f  k   −1 2 Newton dir. : −∇ f ∇f k T k – Descent direction: p ∇f =kpkk∇fkcosθ 0 k k k k −1 k  Quasi-Newton dir. : −B ∇f k  k  (B pd⇒ descent dir.) k – Inexact l.s.: approx. solution of min f(x +α p ) (faster convergence of the overall algorithm). α0 k k k Even if the l.s. is inexact, if α satisfies certain conditions at each k then the overallalgorithmhas global k convergence. Ex.: theWolfeconditions(othersexist),mostcruciallythesufficientdecreaseinf. Asimplel.s.algorithm that often (not always) satisfies the Wolfe cond. is backtracking (better ones exist). • Global convergence (to a stationary point):k∇fk→ 0. k P 2 2 Zoutendijk’s th.: descent dir + Wolfe + mild cond. on f⇒ cos θ k∇fk ∞. k k k≥0 Corollary: cosθ ≥ δ 0∀k⇒ global convergence. But we often want cosθ ≈ 0 k k Ex.: steepest descent and some Newton-like methods have global convergence. • Convergence rate:  2 λ −λ n 1 – Steepest descent: linear, r = ; slow for ill-conditioned problems. λ +λ n 1 – Quasi-Newton: superlinear under certain conditions. – Newton: quadratic near the solution. 2 • Modified Hessian Newton’s method: B =∇ f(x )+λI (diagonal modif., others exist) s.t. B suffic. pd. k k k λ =0: pure Newton step; λ→∞: steepest descent direction (with step→ 0). Descent direction with moderate length away from minimizer, pure Newton’s method near minimizer. Global, quadratic convergence if κ(B )≤ M and l.s. with Wolfe cond. k 124 Trust region methods Iteration: x = x +p , where p is the approximate minimizer of the model m (p) in a region k+1 k k k around x (the trust region); if p does not produce a sufficient decrease in f, we shrink the region k k and try again. • The trust region is typically a ball B(x ;Δ). k Other region shapes may also be used: elliptical (to adapt to ill-conditioned Hessians) and box-shaped (with linear constraints.) • Each time we decrease Δ after failure of a candidate iterate, the step from x is shorter and k usually points in a different direction. ( too small: good model, but can only take a small step, so slow convergence • Trade-offinΔ: too large: bad model, we may have to reduce Δ and repeat. In practice, we increase Δ if previous steps showed the model reliable. fig. 4.1 ∇f T k • Linear model: m (p) =f +∇f p s.t.kpk≤ Δ ⇒ p =−Δ , i.e., steepest descent k k k k k k k∇f k k with step length α given by Δ (no news). (✐ what is the approx. error for the linear model?) k k 1 T T • Quadratic model: m (p) =f +∇f p+ p B p whereB is symmetric but need not be psd. k k k k k 2 ( 3 2 O(kpk ) if B =∇ f , trust-region Newton method k k The approximation error is 2 O(kpk ) otherwise. In both cases, the model is accurate for smallkpk, which guarantees we can always find a good step for sufficiently small Δ. (✐ what happens if ρ 0 butkp kΔ with an arbitrary m ?) k k k k Two issues remain: how to choose Δ ? How to find p ? k k f −f(x +p ) actual reduction k k k Choice of the trust-region radius Δ Define the ratio ρ = = . k k m (0)−m (p ) predicted reduction k k k • Predicted reduction≥ 0 always (since p =0 is in the region). • If actual reduction 0 the new objective value is larger, so reject the step.  ≈ 1: good agreement between f and the model m , so expand Δ ifkpk = Δ  k k k    (otherwise, don’t interfere) • ρ k  0 but not close to 1: keep Δ k    close to 0 or negative: shrink Δ . k Algorithm 4.1 The optimization subproblem A quadratically-constrained quadratic program: 1 T T minm (p) =f +∇f p+ p B p s.t. kpk≤ Δ k k k k k n p∈R 2 −1 −1 B • If B is pd and B ∇f ≤ Δ the solution is the unconstrained minimizer p =−B ∇f k k k k k k k (the full step). • Otherwise, we compute an approximate solution (finding an exact one is too costly). 13Characterization of the exact solution of the optimization subproblem ∗ T 1 T Th. 4.1. p is a global solution of the trust-region problem min m(p) =f +g p+ p Bp iff kpk≤Δ 2 ∗ p is feasible and∃λ≥ 0 such that: ∗ 1. (B+λI)p =−g ∗ ∗ 2. λ(Δ−kpk) = 0 (i.e., λ = 0 orkpk = Δ) 3. B+λI is psd. Note: 2 λ • Using B+λI instead ofB in the model transforms the problem into min m(p)+ kpk , and p 2 so for large λ 0 the minimizer is strictly inside the region. As we decrease λ the minimizer moves to the region boundary and the theorem holds for that λ. • Ifλ 0 then the direction is antiparallel to the model gradient and so the region is tangent to ∗ ∗ ∗ the model contour at the solution: λp =−g−Bp =−∇m(p ). • This is useful for Newton’s method and is the basis of the Levenberg-Marquardt algorithm for nonlinear least-squares problems. Approximate solution of the optimization subproblem Several approaches: C • The Cauchy point p is the minimizer ofm alongp =−∇m (0) =−g , which lies either at eq. (4.12) k k k k k Δ Δ fig. 4.3 k k − g (boundary) or at−τ g with 0τ 1 (interior). Thus, it is steepest descent with k k k k g g k k a certain step size, and gives a baseline solution. Several methods improve over the Cauchy point. • Iterative solution of the subproblem (based on the characterization): ∗ ∗ 1. Try λ = 0, solve Bp =−g and see ifkpk≤ Δ (full step). ∗ −1 2. Ifkpk Δ, define p(λ) =−(B+λI) g for λ sufficiently large that B+λI is pd and seek a smaller valueλ 0 such thatkp(λ)k = Δ (1D root-finding forλ; iterative solution factorizing the matrix B+λI). fig. 4.5 T T 2 P P q g (q g) n j 2 n j T −1 B =QΛQ (spectral th. withλ ≥···≥λ )⇒p(λ) =−(B+λI) g =− q ⇒kp(λ)k = . If n 1 j 2 j=1 j=1 λ +λ (λ +λ) j j T ∗ 1 1 1 q g6= 0, findλ −λ using Newton’s method for root finding onr(λ) = − (since ≈(λ+λ )/constant). 1 1 1 Δ kp(λ)k kp(λ)k One can show this is equivalent to Alg. 4.3, which uses Cholesky factorizations (limit to ∼3 steps). C B • Dogleg method (for pd B): find minimizer along two-leg path 0→ p → p within trust fig. 4.4 k k region. m(p) decreases along the dogleg path (lemma 4.2). The minimizer results from a 2nd-degree polynomial. • Two-dimensional subspace minimization: on the span of the dogleg path. eq. 4.17 The minimizer results from a 4th-degree polynomial. Global and local convergence • Under certain assumptions, these approximate algorithms have global convergence to a station- ary point (Th. 4.6). Essentially, they must ensure a sufficient decrease m (0) ≤ m (p ) at each step; the Cauchy point already achieves such as k k k decrease. 2 • If using B =∇ f and if the region becomes eventually inactive and we always take the full k k step, the convergence is quadratic (the method becomes Newton’s method). 14Review: trust-region methods • Iteration x =x +p k+1 k k – p = approx. minimizer of model m of f in trust region: min m (p). k k kpk≤Δ k k – p does not produce sufficient decrease⇒ region too big, shrink it and try again. k f −f(x +p ) k k k Insufficient decrease⇔ ρ = . 0 k m (0)−m (p ) k k k T 1 T • Quadratic model: m (p) =f +∇f p+ p B p. B need not be psd. k k k k k 2 Cauchy point: minimizer of m within the trust region along the steepest descent direction. k The exact, global minimizer of m within the trust regionkpk≤Δ satisfies certain conditions (th. 4.1) that k k can be used to find an approximate solution. Simpler methods exist (dogleg, 2D subspace min.). 2 • Global convergence under mild conditions if sufficient decrease; quadratic rate if B =∇ f and if the region k k becomes eventually inactive and we always take the full step. • Mainly used for Newton and Levenberg-Marquardt methods. 155 Conjugate gradient methods • Linear conjugate gradient method: solves a large linear system of equations. • Nonlinear conjugate gradient method: adaptation of linear CG for nonlinear optimization. Key features: requires no matrix storage, faster than steepest descent. 1 T T Assume in all this chapter that A is an n×n symmetric pd matrix, φ(x) = x Ax−b x and 2 ∇φ(x) =Ax−b=r(x) (residual). Idea: • steepest descent (1 or∞ iterations): • coordinate descent (n or∞ iterations): • Newton’s method (1 iteration): • CG (n iterations): The linear conjugate gradient method Iterative method for solving the two equivalent problems (i.e., both have the same, unique solution ∗ x ): 1 T T Linear system Ax =b⇐⇒ Optimization problem minφ(x) = x Ax−b x. 2 T • A set of nonzero vectorsp ,p ,...,p is conjugate wrt A iff p Ap = 0∀i6=j. 0 1 l j i P T Conjugacy⇒ linear independence (Pf.: left-multiply σ p times p A.) i i j 1/2 1/2 1/2 NoteA p ,A p ,...,A p are orthogonal. 0 1 l • Th. 5.1: we can minimizeφ inn steps at most by successively minimizingφ along then vectors in a conjugate set. n Conjugate direction method: given a starting point x ∈R and a set of conjugate directions 0 T r p k k p ,...,p , generate the sequencex with x = x +α p , where α =− (✐ 0 n−1 k k+1 k k k k T p Ap k k P n−1 ∗ denominator 6=0?) (exact line search) (Proof: x =x + α p .) 0 i i i=0 • Intuitive idea: – A diagonal: quadratic function φ can be minimized along the coordinate directions fig. 5.1 e ,...,e in n iterations. 1 n – A not diagonal: the coordinate directions don’t minimize φ in n iterations; but the fig. 5.2 −1 ˆ ˆ ˆ ˆ variable change x = S x with S = (p p ...p ) diagonalizes A: φ(x) = φ(Sx) = 0 1 n−1 1 T T T T x ˆ (S AS)x ˆ−(S b) xˆ. (✐ why isS invertible?) 2 coordinate search in x ˆ⇔ conjugate direction search in x. • Th. 5.2 (expanding subspace minimization): for the conjugate directions method: T – r p = 0 for i = 0,...,k−1 (the current residual is⊥ to all previous search directions). i k (Intuition: if r =∇φ(x ) had a nonzero projection along p , it would not be a minimum.) i k k – x is the minimizer of φ over the set x +spanp ,...,p . k 0 0 k−1 That is, the method minimizes φ piecewise, one direction at a time. (Proof: induction plus the fact r =r +α Ap (implied by r =Ax −b and x =x +α p ).) k+1 k k k k k k+1 k k k 16• How to obtain conjugate directions? Many ways, e.g. using the eigenvectors ofA or transform- ing a set of l.i. vectors into conjugate directions with a procedure similar to Gram-Schmidt. But these are computationally expensive • The conjugate gradient method generates conjugate direction p by using only the previous k one, p : k−1 – p is a l.c. of−∇φ(x ) andp s.t. being conjugate top ⇒p =−r +β p with k k k−1 k−1 k k k k−1 T r Ap k−1 k β = . k T p Ap k−1 k−1 – We start with the steepest descent direction: p =−∇φ(x ) =−r . 0 0 0 Algorithm 5.1 (CG; preliminary version): given x : 0 r ←∇φ(x ) =Ax −b, p ←−r , k→ 0 Start with steepest descent dir. from x 0 0 0 0 0 0 while r =6 0 r =0 means we are done, which may happen before n steps k k T r p k k α ←− , x ←x +α p Exact line search k T k+1 k k k p Ap k k r =Ax −b New residual k+1 k+1 T r Ap k k+1 β ← , p =−r +β p New l.s. direction p is conjugate to p ,p ,...,p k+1 k k−1 0 k+1 T k+1 k+1 k+1 k p Ap k k k←k+1 end To prove the algorithm works, we need to prove it builds a conjugate direction set. ∗ • Th. 5.3: suppose that the kth iterate of the CG method is not the solution x . Then: T – r r = 0 for i = 0,...,k−1 (the gradients at all iterates are⊥ to each other.) i k  k – spanr ,...,r = spanp ,...,p = span r ,Ar ,...,A r = Krylov subspace of 0 k 0 k 0 0 0 degree k for r . (So r orthogonal basis,p basis.) 0 k k (Intuitive explanation: compute r ,p for k =1,2 usingr =r +A p ,p =−r +β p .) k k k+1 k k k k+1 k+1 k+1 k T – p Ap = 0 for i = 0,...,k−1 (conjugate wrt A.) i k ∗ Thus the sequencex converges to x in at most n steps. k Important: the theorem needs that the first direction be the steepest descent dir. T T (Proof: r r =p Ap = 0 follow fori =k−1 by construction and for ik−1 by induction.) i i k k • We can simplify a bit the algorithm using the following results: – p =−r +β p (construction of the kth direction) k+1 k+1 k+1 k – r =r +α Ap (definition of r and x ) k+1 k k k k k+1 T T – r p =r r = 0 for ik (th. 5.2 & 5.3) i i k k T T 2 2 r r r r kr k k+1 kr k k k+1 k k k+1 ⇒α ← = , r ←r +α Ap , β ← = in algorithm 5.2. k T 2 k+1 k k k k+1 T 2 p Ap kp k r r kr k k k k k k k A • Space complexity: O(n) since it computes x,r,p at k + 1 given the values at k: no matrix storage. 2 • Time complexity: the bottleneck is the matrix-vector productAp , which isO(n ) (maybe less k 3 if A has structure)⇒ in n steps,O(n ), similar to other methods for solving linear systems (e.g. Gauss factorization). 17• Advantages: no matrix storage; does not alter A; does not introduce fill (for a sparse matrix A); fast convergence. • Disadvantages: sensitive to roundoff errors. It is recommended for large systems. Rate of convergence • Here we don’t mean the asymptotic rate (k→∞) because CG converges in at most n steps for a quadratic function. But CG can get very close to the solution in quite less than n steps, depending on the eigenvalue structure of A: – Th. 5.4: if A has only r distinct eigenvalues, CG converges in at most r steps. – IftheeigenvaluesofAoccurinrdistinctclusters,CGwillapproximatelysolvetheproblem in r steps. fig. 5.4 2 T • Two bounds (usingkxk = x Ax), useful to estimate the convergence rate in advance if we A know something about the eigenvalues of A:  2 2 λ −λ 2 ∗ n−k 1 ∗ – Th. 5.5:kx −xk ≤ kx −xk if A has eigenvalues λ ≤···≤λ . k+1 0 1 n A A λ +λ n−k 1 √  k κ−1 λ ∗ ∗ n √ –kx −xk ≤ 2 kx −xk if κ = is the c.n. (this bound is very coarse). k 0 A κ+1 A λ 1 √ κ−1 κ−1 √ Recall for steepest descent we had a similar expression but with instead of . κ+1 κ+1 −T −1 ˆ • Preconditioning: change of variables xˆ = Cx so that the new matrix A = C AC has a clustered spectrum or a small condition number (thus faster convergence). Besides being effec- tive in this sense, a good preconditioner C should take little storage and allow an inexpensive ˆ solution ofCx =x. Finding good preconditioners C depends on the problem (the structure of A), e.g. good ones exist when A results from a discretized PDE. Nonlinear conjugate gradient methods We adapt the linear CG (which minimizes a quadratic function φ) for a nonlinear function f. ( α is determined by an inexact line search k The Fletcher-Reeves method r =∇f k Algorithm 5.4: given x : 0 Evaluate f ←f(x ),∇f ←∇f(x ) 0 0 0 0 p ←−∇f , k← 0 0 0 while∇f =6 0 k x ←x +α p with inexact l.s. for α k+1 k k k k Evaluate∇f k+1 T ∇f ∇f k+1 FR k+1 FR β ← , p ←−∇f +β p T k+1 k+1 k k+1 k+1 ∇f ∇f k k k←k+1 end • Uses no matrix operations, requires only f and∇f. 18FR • Line search for α : we need each direction p =−∇f +β p to be a descent direction, k k+1 k+1 k k+1 2 T FR T i.e.,∇f p =−k∇f k +β ∇f p 0. k+1 k+1 k k+1 k+1 k+1 T – Exact l.s.: α is a local minimizer along p ⇒∇f p = 0⇒ p is descent. k k k k+1 k+1 – Inexact l.s.: p is descent if α satisfies the strong Wolfe conditions (lemma 5.6): k+1 k T f(x +α p )≤f(x )+c α∇f p k k k k 1 k k k T T ∇f(x +α p ) p ≤c ∇f p k k k k 2 k k 1 where 0c c (note we required a looser 0c c 1 in ch. 3). 1 2 1 2 2 The Polak-Ribi`ere method T ∇f (∇f −∇f ) k+1 k PR k+1 • Differs in the parameter β , defined as β = . 2 k k+1 k∇f k k PR FR HS • For strongly convex quadratic functions and exact l.s. β = β = β = β for linear CG k k k k (since the successive gradients are mutually⊥). • For nonlinear functions in general, with inexact l.s., PR is empirically more robust and efficient than FR. • The strong Wolfe conditions don’t guarantee that p is a descent direction. k T ∇f (∇f −∇f ) k+1 k k+1 HS T ¯ • OthervariantssimilartoPR:Hestenes-Stiefelβ = (derivedbyrequiringp G p = k k T k+1 k+1 (∇f −∇f ) p k+1 k k 2 R k∇f k k+1 1 ¯ 2 0 for the average Hessian G = ∇ f(x +tα p )dt), β = , etc. k k k k k+1 T 0 (∇f −∇f ) p k+1 k k Restarts Restarting the iteration every n steps (by setting β = 0, i.e., taking a steepest descent k step) periodically refreshes the algorithm and works well in practice. It leads to n-step quadratic ∗ kx −x k k+n convergence: ≤ M; intuitively because near the minimum, f is approx. quadratic and so 2 ∗ kx −x k k after a restart we will have (approximately) the linear CG method (which requires p = steepest 0 descent). For large n (when CG is most useful) restarts may never occur, since an approximate solution may be found in less than n steps. Global convergence • WithrestartsandthestrongWolfeconditions,thealgorithms(FR,PR)haveglobalconvergence since they include as a subsequence the steepest descent method (which is globally convergent with the Wolfe conditions). • Without restarts: – FR has global convergence with the strong Wolfe conditions above. – PR does not have global convergence, even though in practice it is better. • In general, the theory on the rate of convergence of CG is complex and assumes exact l.s. 19