Lecture Notes on Iterative Optimization Algorithms

an iterative optimization clustering algorithm based on manifold distance and stochastic iterative optimization algorithms pdf free download
ImogenCameron Profile Pic
Published Date:14-07-2017
Your Website URL(Optional)
Charles L. Byrne Department of Mathematical Sciences University of Massachusetts Lowell December 8, 2014 Lecture Notes on Iterative Optimization AlgorithmsChapter 1 Overview and Examples 1.1 Overview ::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 1 1.2 Auxiliary-Function Methods ::::::::::::::::::::::::::::::::::::: 2 1.2.1 Barrier-Function Methods: An Example :::::::::::::::: 3 1.2.2 Barrier-Function Methods: Another Example ::::::::::: 3 1.2.3 Barrier-Function Methods for the Basic Problem ::::::: 4 1.2.4 The SUMMA Class :::::::::::::::::::::::::::::::::::::: 5 1.2.5 Cross-Entropy Methods ::::::::::::::::::::::::::::::::: 5 1.2.6 Alternating Minimization ::::::::::::::::::::::::::::::: 6 1.2.7 Penalty-Function Methods :::::::::::::::::::::::::::::: 6 1.3 Fixed-Point Methods ::::::::::::::::::::::::::::::::::::::::::::: 7 1.3.1 Gradient Descent Algorithms ::::::::::::::::::::::::::: 7 1.3.2 Projected Gradient Descent ::::::::::::::::::::::::::::: 8 1.3.3 Solving Ax =b ::::::::::::::::::::::::::::::::::::::::::: 9 1.3.4 Projected Landweber Algorithm :::::::::::::::::::::::: 9 1.3.5 The Split Feasibility Problem ::::::::::::::::::::::::::: 10 1.3.6 Firmly Nonexpansive Operators ::::::::::::::::::::::::: 10 1.3.7 Averaged Operators ::::::::::::::::::::::::::::::::::::: 11 1.3.8 Useful Properties of Operators onH :::::::::::::::::::: 11 1.3.9 Subdi erentials and Subgradients ::::::::::::::::::::::: 12 1.3.10 Monotone Operators ::::::::::::::::::::::::::::::::::::: 13 1.3.11 The BaillonHaddad Theorem :::::::::::::::::::::::::: 14 1.1 Overview The basic problem we consider in these notes is to minimize a function f :XR over x in CX, where X is an arbitrary nonempty set. Until it is absolutely necessary, we shall not impose any structure on X or on f. One reason for avoiding structure on X and f is that we can actually achieve something interesting without it. The second reason is that when we do introduce structure, it will not necessarily be that of a metric space; for instance, cross-entropy and other Bregman distances play an important role in some of the iterative optimization algorithms I discuss in these notes. We investigate two classes of iterative optimization methods: sequential 12 Lecture Notes on Iterative Optimization Algorithms auxiliary-function (AF) methods; and xed-point (FP) methods. As we shall see, there is some overlap between these two classes of methods. As is appropriate for an overview, in this chapter we make a number of assertions without providing proofs. Proofs of most of these assertions will be given in subsequent chapters. 1.2 Auxiliary-Function Methods For k = 1; 2;::: we minimize the function G (x) =f(x) +g (x) (1.1) k k k over x in X to get x . We shall say that the functions g (x) are auxil- k iary functions if they have the properties g (x) 0 for all x2 X, and k k1 k g (x ) = 0. We then say that the sequencefxg has been generated by k an auxiliary-function (AF) method. We then have the following result. k Proposition 1.1 If the sequencefxg is generated by an AF method, then k the sequenceff(x )g is nonincreasing. Proof: We have k1 k1 k1 k1 G (x ) =f(x ) +g (x ) =f(x ) k k k k k k G (x ) =f(x ) +g (x )f(x ); k k k1 k so f(x )f(x ). k In order to have the sequenceff(xg converging to = infff(x)jx2Cg we need to impose an additional property on the g (x). We shall return to k this issue later in this chapter. Perhaps the best known examples of AF methods are the sequential unconstrained minimization (SUM) methods discussed by Fiacco and Mc- Cormick in their classic book 94. They focus on barrier-function and penalty-function algorithms, in which the auxiliary functions are intro- duced to incorporate the constraint that f is to be minimized over C. In 94 barrier-function methods are called interior-point methods, while penalty-function methods are called exterior-point methods. A barrier function has the value +1 for x not in C, while the penalty function is zero onC and positive o ofC. In more general AF methods, we may or may not have C =X. If C is a proper subset of X, we can replace the function f(x) with f(x) + (x), where  (x) takes on the value zero C C forx inC and the value +1 forx not inC; then theg (x) need not involve k C.Overview and Examples 3 Prior to the 1980's linear programming and nonlinear programming were viewed as separate subjects. In 1984 Karmarkar published his polynomial-time interior-point algorithm for linear programming 113. This event created much excitement at the time, even making the front page of the New York Times. Although it appears now that some claims for this algorithm were overstated, the arrival of the algorithm served to revive interest in interior-point methods and to bring linear and nonlinear programming closer together. As we shall see, in addition to incorporating the constraint set C, the g (x) can be selected to make the computations simpler; sometimes we k k select theg (x) so thatx can be expressed in closed form. However, in the k most general, non-topological case, we are not concerned with calculational k issues involved in nding x . Our objective is to select the g (x) so that k k the sequenceff(x )g converges to = infff(x);x2 Cg. We begin with two simple examples of the use of barrier functions. 1.2.1 Barrier-Function Methods: An Example 2 2 Our rst problem is to minimize the functionf(x) =f(x ;x ) =x +x , 1 2 1 2 2 subject to x +x  1. Here X = R , the function f(x) is continuous, 1 2 indeed, di erentiable, and the set C = fxjx + x  1g is closed and 1 2 convex. For each k we minimize the function 1 1 2 2 B (x) =f(x) + b(x) =x +x log(x +x 1) (1.2) k 1 2 1 2 k k over x2D, where D =fxjx +x 1g. Note that C is the closure of D 1 2 and = infff(x)jx2 Dg. Setting the partial derivatives to zero, we nd that r 1 1 4 k k x =x = + 1 + : 1 2 4 4 k 1 1 k As k +1 the sequencefxg converges to ( ; ), which is the solution. 2 2 In this example the auxiliary function log(x +x 1) serves to limit 1 2 the calculations to those x satisfying the constraint x +x 1, while 1 2 k also permitting us to get x in closed form. The minus sign may seem unnecessary, since log(x +x 1) would similarly restrict thex. However, 1 2 without the minus sign there is no minimizer of B (x) within D. k 1.2.2 Barrier-Function Methods: Another Example Now we want to minimize the function f(x) =x +x , subject to the 1 2 2 constraintsx +x  0 and x  0. Once again, we use the log function 2 1 1 to create our barrier function. We minimize 1 1 2 B (x) =f(x) + b(x) =x +x + ( log(x +x ) logx ) (1.3) k 1 2 2 1 1 k k4 Lecture Notes on Iterative Optimization Algorithms k to get x . With a bit of algebra we nd that r 1 1 8 k x = + 1 + ; 1 4 4 k and r 2 1 8 1 k x = 1 + 1 + + : 2 16 k k k As k +1 the sequencefxg converges to (0; 0), which is the answer. 1.2.3 Barrier-Function Methods for the Basic Problem Now the problem is to minimizef :XR, subject tox2C. We select b : X (1; +1 with C =fxjb(x) +1g. For each k we minimize 1 k B (x) =f(x) + b(x) over all x2X to get x , which must necessarily lie k k inC. Formulated this way, the method is not yet in AF form. Nevertheless, we have the following proposition. k Proposition 1.2 The sequencefb(x )g is nondecreasing, and the sequence k ff(x )g is nonincreasing and converges to = inf f(x). x2C k1 k k k1 Proof: From B (x ) B (x ) and B (x ) B (x ), for k = k k k1 k1 2; 3;:::, it follows easily that 1 1 k k1 k1 k k k1 (b(x )b(x ))f(x )f(x ) (b(x )b(x )): k 1 k k  Suppose thatff(x )g . Then there is z2C with k  f(x ) f(z) ; for all k. Then 1 k k  (b(z)b(x ))f(x )f(z) f(z) 0; k 1 k for all k. But the sequence f (b(z) b(x ))g converges to zero, which k  contradicts the assumption that . The proof of Proposition 1.2 depended heavily on the details of the barrier-function method. Now we reformulate the barrier-function method as an AF method, and obtain a di erent proof of Proposition 1.2 that leads to the de nition of the SUMMA class of of AF methods 47, 55. 1 k MinimizingB (x) =f(x) + b(x) to getx is equivalent to minimizing k k kf(x) +b(x), which, in turn, is equivalent to minimizing G (x) =f(x) +g (x); k kOverview and Examples 5 where k1 k1 g (x) = (k 1)f(x) +b(x) (k 1)f(x ) +b(x ): k k1 Clearly, g (x) 0 and g (x ) = 0. Now we have the AF form of the k k method. Here is a di erent proof of Proposition 1.2. A simple calculation shows that k G (x)G (x ) =g (x); (1.4) k k k+1 k for all x2X. Suppose that the nonincreasing sequenceff(x )g converges   to some . Then there is z2C with f(z) . Then k g (z)g (z) =g (z)G (z) +G (x ) k k+1 k k k k k  =g (z)f(z)g (z) +f(x ) +g (x ) f(z) 0: k k k This is impossible, sincefg z) is then a nonincreasing sequence of non- k negative numbers whose successive di erences are bounded below by the  positive number f(z). 1.2.4 The SUMMA Class Close inspection of the second proof of Proposition 1.2 reveals that Equation (1.4) is unnecessary; all we need is the SUMMA Inequality k G (x)G (x )g (x); (1.5) k k k+1 for allx2X. All iterative methods that can be reformulated as AF meth- ods for which the inequality in (1.5) holds are said to belong to the SUMMA class of methods. This may seem to be a quite restricted class of methods, but, as we shall see, that is far from the case. Many well known iterative methods fall into the SUMMA class 47, 55. 1.2.5 Cross-Entropy Methods For a 0 and b 0, let the cross-entropy or Kullback-Leibler (KL) distance 117 from a to b be a KL(a;b) =a log +ba; (1.6) b with KL(a; 0) = +1; and KL(0;b) = b. Extend to nonnegative vectors coordinate-wise, so that J X KL(x;z) = KL(x ;z ): (1.7) j j j=16 Lecture Notes on Iterative Optimization Algorithms Then KL(x;z)  0 and KL(x;z) = 0 if and only if x = z. Unlike the Euclidean distance, the KL distance is not symmetric; KL(x;y) and KL(y;x) are distinct. We can obtain di erent approximate solutions of a nonnegative system of linear equations Px = y by minimizing KL(Px;y) and KL(y;Px) with respect to nonnegative x. The SMART minimizes KL(Px;y), while the EMML algorithm minimizesKL(y;Px). Both are it- erative algorithms in the SUMMA class, and are best developed using the alternating minimization (AM) framework. 1.2.6 Alternating Minimization Let  : PQ (1; +1, where P and Q are arbitrary nonempty n1 sets. In the alternating minimization (AM) method we minimize (p;q ) n n n over p2P to get p and then minimize (p ;q) over q2Q to get q . We want n n f(p ;q )g = inff(p;q)jp2P;q2Qg: (1.8) In 81 Csisz ar and Tusn ady show that if the function  possesses what they call the ve-point property, then Equation (1.8) holds. There seemed to be no convincing explanation of why the ve-point property should be used, except that it works. I was quite surprised when I discovered that the AM method can be reformulated as an AF method to minimize a function of the single variable p and that the ve-point property becomes precisely the SUMMA condition. 1.2.7 Penalty-Function Methods Once again, we want to minimize f : X R, subject to x2 C. We select a penalty function p : X 0; +1) with p(x) = 0 if and only if x2C. Then, for each k, we minimize P (x) =f(x) +kp(x); k k over allx, to getx . Here is a simple example of the use of penalty-function methods. 2 Let us minimize the function f(x) = (x + 1) , subject to x 0. We let 2 k 1 p(x) = 0 for x 0, and p(x) = x , for x 0. Then x = , which k+1 k converges to zero, the correct answer, as k +1. Note that x is not in C =R , which is why such methods are called exterior-point methods. + Clearly, it is equivalent to minimize 1 p(x) + f(x); k which gives the penalty-function method the form of a barrier-functionOverview and Examples 7 k method. From Proposition 1.2 it follows that the sequencefp(x )g is non- k increasing and converges to zero, while the sequenceff(x )g is nondecreas- ing, and, as we can easily show, converges to some  . Without imposing further structure on X and f we cannot conclude k thatff(x )g converges to . The reason is that, in the absence of further structure, such as the continuity of f, what f does within C is unrelated k to what it does outside C. If, for some f, we do haveff(x )g converging to , we can replace f(x) with f(x) 1 for x not in C, while leaving f(x) unchanged for x in C. Then remains unaltered, while the new sequence k ff(x )g converges to = 1. 1.3 Fixed-Point Methods We turn now to xed-point (FP) methods, the second class of methods we shall discuss in these notes. For our discussion of xed-point methods we shall impose some structure onX, although, as we shall see, it may not be that of a metric space. However, most of the examples we shall present J will be in the setting of a Hilbert spaceH, usuallyH =R . When we use an FP method to solve a problem iteratively, we select an operator T : X X such that z solves the problem if and only if z is a xed point ofT , that is,Tz =z. The set of xed points ofT will be denoted k k1 Fix(T ). Our iterative method is then to let x =Tx , for k = 1; 2;:::. If we are trying to minimize f : XR, subject to x2 C, then, as before, k we want the sequenceff(x )g to converge to . If we have imposed some k topology onX, we can also ask for the sequencefxg to converge, at least weakly, to a solution of the problem. De nition 1.1 An operator T :HH is convergent if T has at least k 0 one xed point and the sequencefT xg converges weakly to a xed point 0 of T , for every starting point x . 1.3.1 Gradient Descent Algorithms Suppose that we want to minimize f :H R. When f is G ateaux di erentiable the derivative of f at x is the gradient,rf(x). A gradient- descent algorithm has the iterative step k k1 k1 x =x rf(x ); (1.9) k where the step-length parameters are adjusted at each step. When f is k G ateaux di erentiable at x, the one-sided directional derivative of f at x8 Lecture Notes on Iterative Optimization Algorithms 0 and in the direction d, denoted f (x;d), is given by + 0 f (x;d) =hrf(x);di: + When f is convex and G ateaux di erentiable and the gradientrf is L- Lipschitz continuous, that is, krf(x)rf(y)kLkxyk; 2 for allx andy, we can modify the iteration in Equation (1.9). If 0 , L and k k1 k1 x =x rf(x ); (1.10) k then the sequencefxg converges to a zero ofrf, whenever there are such zeros. Now that is independent of k, we can take T = I rf, with I k k1 the identity operator, and write the iteration as x =Tx . As we shall show later, the operator T is convergent. A point z is a xed point of T if and only ifrf(z) = 0, and so if and only if z minimizes f. De nition 1.2 For any function f :HR the epigraph of f is the set epi(f) =f(x; )2HRjf(x) g: De nition 1.3 A function f :HR is lower semi-continuous or closed if epi(f) is closed inHR. If f :HR is closed and convex, then epi(f) is a nonempty, closed, and convex set. J Note that whenf :R R is convex, it is continuous. It is also true in the in nite-dimensional case, provided that f is closed as well (15, Corol- lary 8.30). For any convexf :HR, Gateaux di erentiability and Fr echet di erentiability are equivalent for nite-dimensionalH, but not necessarily equivalent in the case of in nite-dimensional Hilbert space. We shall use the word \di erentiable" to mean Gateaux di erentiable. Whenever f is di erentiable andrf is continuous, f is Fr echet di erentiable. Therefore, ifrf is L-Lipschitz continuous, then f is Fr echet di erentiable. A functionf :H 1; +1 is proper if there is nox withf(x) =1 and some x with f(x) +1. All the functions we consider are proper. 1.3.2 Projected Gradient Descent For any nonempty, closed convex subsetC of a Hilbert spaceH and any x inH there is a uniquez2C closest tox. Thisz is denotedz =P x and C the operator P is called an orthogonal projection (or metric projection) C operator.Overview and Examples 9 Suppose now that we want to minimize a di erentiable closed convex function f :H R over x2 C, where C is a nonempty closed, convex 2 subset ofH. Assume thatrf is L-Lipschitz continuous, and 0 . L The projected gradient descent algorithm has the iterative step k k1 k1 x =P (x rf(x )): (1.11) C k The sequencefxg converges weakly to a point z2 C with f(z) f(x), for all x2 C, whenever such a point z exists. It is not hard to show that such z are the xed points of the operator T = P (I rf), which is a C convergent operator. 1.3.3 Solving Ax =b M Suppose thatA is a realM byN matrix andb is a member ofR . We N wantx2R so thatAx =b, or, if there are no suchx, then we want anx 1 2 that minimizes the functionf(x) = kAxbk , where the norm is the two- 2 norm (that is, the usual Euclidean norm). There is a closed-form solution T 1 T T for this problem: x = (A A) A b, provided that A A is invertible. In many applications in image processing and remote sensing the matrix A can be quite large, with M and N in the tens of thousands. In such cases, using the closed-form expression for the solution is not practical and we turn to iterative methods. 1 2 The function f(x) = kAxbk is di erentiable and its derivative, 2 T rf(x) =A (Axb); T T is L-Lipschitz continuous for L = (A A), the spectral radius of A A, T which, in this case, is the largest eigenvalue ofA A. Applying the gradient descent algorithm in the previous section, we get the Landweber iteration 118, 20, k k1 T k1 x =x A (Ax b); (1.12) 2 k for 0 . The sequencefxg converges to the minimizer off(x) T (A A) 0 closest to x . 1.3.4 Projected Landweber Algorithm N Suppose that we want to nd x2 C  R such that Ax = b, or, 1 2 failing that, to minimize the function kAxbk over x2 C. Applying 2 the projected gradient descent algorithm, we get the projected Landweber algorithm 20, k k1 T k1 x =P (x A (Ax b)): (1.13) C k The sequencefxg converges to a minimizer of f over C, whenever such minimizers exist.10 Lecture Notes on Iterative Optimization Algorithms 1.3.5 The Split Feasibility Problem N M Let C and Q be nonempty closed, convex subsets of R and R , re- spectively, andA a realM byN matrix. The split feasibility problem (SFP) is to nd x2 C with Ax2 Q, or, failing that, to minimize the function 1 2 f(x) = kP AxAxk over x2 C. It can be shown 55 that f(x) is Q 2 di erentiable and its gradient is T rf(x) =A (IP )Ax: Q T The gradient is again L-Lipschitz continuous for L = (A A). Applying the gradient descent algorithm we have the CQ algorithm 42, 43: k k1 T k1 x =P (x A (IP )Ax ): (1.14) C Q Solutions of the problem are the xed points of the convergent operator T T :HH given by T =P (I A (IP )A). C Q In 66, 62 Yair Censor and his colleagues modi ed the CQ algorithm and used their modi ed method to derive protocols for intensity modi ed radiation therapy (IMRT). 1.3.6 Firmly Nonexpansive Operators We are interested in operatorsT that are convergent. For such operators k+1 k k k1 we often nd thatkx xkkx x k for each k. This leads us to the de nition of nonexpansive operators. De nition 1.4 An operatorT onH is nonexpansive (ne) if, for all x and y, we have kTxTykkxyk: Nonexpansive operators need not be convergent, as the ne operatorT =I illustrates. As we shall see later, the operators T =P are nonexpansive. In fact, C the operators P have a much stronger property; they are rmly nonex- C pansive. De nition 1.5 An operator T onH is rmly nonexpansive (fne) if, for every x and y, we have 2 hTxTy;xyikTxTyk : If T is fne then T is convergent. The class of fne operators is smaller than the class of ne operators and does yield convergent iterative sequences. However, the product or composition of two or more fne operators need not be fne, which limits the usefulness of this class of operators. Even the product of P and P need not be fne. We need to nd a class of C C 1 2 convergent operators that is closed to nite products.Overview and Examples 11 1.3.7 Averaged Operators It can be shown easily that an operator F is fne if and only if there is a nonexpansive operator N such that 1 1 F = I + N: 2 2 De nition 1.6 An operator A :HH is -averaged ( -av) if there is a nonexpansive operator N such that A = (1 )I + N; for some in the interval (0; 1). If A is -av for some then A is an averaged (av) operator. All averaged operators are nonexpansive, all rmly nonexpansive op- erators are averaged, the class of averaged operators is closed to nite products, and averaged operators are convergent. In other words, the class of averaged operators is precisely the class that we are looking for. 1.3.8 Useful Properties of Operators onH It turns out that properties of an operator T are often more easily studied in terms of properties of its complement,G =IT . The following two identities are easy to prove and are quite helpful. For any operator T :HH and G =IT we have 2 2 2 kxyk kTxTyk = 2hGxGy;xyikGxGyk ; (1.15) and 2 2 hTxTy;xyikTxTyk =hGxGy;xyikGxGyk :(1.16) De nition 1.7 An operator G :HH is -inverse strongly monotone (-ism) for some  0 if 2 hGxGy;xyikGxGyk ; for all x and y.  Clearly, if G is -ism then G is -ism. Using the two identities in (1.15) and (1.16) it is easy to prove the following theorem. Theorem 1.1 Let T :HH be arbitrary and G =IT . Then 1 1. T is ne if and only if G is -ism for  = ; 2 1 2. T is -av if and only if G is -ism for  = , for some 0 1; 2 12 Lecture Notes on Iterative Optimization Algorithms 3. T is fne if and only if G is -ism for  = 1. 4. T is fne if and only if G is fne; 5. If G is -ism and 0, then G is -ism. 1.3.9 Subdi erentials and Subgradients Subdi erentials and subgradients are important tools in optimization, particularly in convex optimization. De nition 1.8 Let f :H 1;1. Then f(x) is the subdi erential of f at x, de ned by f(x) =fujhu;zxi +f(x)f(z)g; (1.17) for all z. The members of f(x) are the subgradients of f at x. It is easy to see that, if z2H minimizes f(x) over x2H, then 02f(z). There is a subtle point to be aware of here: iff is di erentiable, but not convex,rf(x) need not be a member off(x), which may be empty. Iff is closed and convex andf(x) is a singleton set, thenf is di erentiable and f(x) =frf(x)g. The following proposition provides a characterization of convexity for nondi erentiable functions. Proposition 1.3 A function f :HR is closed and convex if and only if f(x) is nonempty, for every x2H. Proof: When f is closed and convex, its epigraph is a closed convex set. We can then use orthogonal projection to nd a supporting hyperplane for the epigraph at the point (x;f(x)). From the normal to the hyperplane we can construct a member of f(x) 56. Now we prove the converse. By Proposition 17.39 of 15, if f : H R is convex and f(x) is nonempty, for each x, thenf is closed. Now let x andy be arbitrary inH, z = (1 )x + y, for some 2 (0; 1), and u a member of f(z). Then f(x)f(z)hu;xzi = hu;xyi; and f(y)f(z)hu;yzi = (1 )hu;yxi =(1 )hu;xyi: Therefore, (1 )(f(x)f(z)) (1 ) hu;xyi (f(z)f(y)); and so (1 )f(x) + f(y) (1 )f(z) + f(z) =f(z):Overview and Examples 13 Proposition 1.4 For any f :HR and g :HR we have f(x) +g(x)(f +g)(x): (1.18) If f and g are closed and convex, then f(x) +g(x) =(f +g)(x): (1.19) Proof: The containment in (1.18) follows immediately from the de nition of the subdi erential. For the proof of (1.19) see Corollary 16.38 of 15. In some discussions, convex functions may be allowed to take on the value +1. In such cases only the containment in (1.18) may hold; see Corollary 16.38 of 15. Corollary 1.1 If f :HR and g :HR are both closed and convex, and f +g =h is di erentiable, then both f and g are di erentiable. Proof: From Proposition 1.4 we have f(x) +g(x)(f +g)(x) =h(x) =frh(x)g: Since both f(x) and g(x) are nonempty, they must be singleton sets. Therefore, both functions are di erentiable, according to Proposition 17.26 of 15. 1.3.10 Monotone Operators There is an interesting connection between fne operators and monotone operators. H De nition 1.9 A set-valued function B :H 2 is said to be monotone if, for eachx andy inH andu2B(x) andv2B(y) we havehuv;xyi H 0. If there is no monotone A :H 2 with B(x) A(x) for all x, then B is a maximal monotone operator. If f :HR is closed and convex then B(x) =f(x) de nes a mono- tone operator. In particular, if f is also di erentiable then T =rf is a monotone operator. We have the following proposition. Proposition 1.5 If B is monotone and x2 z +B(z) and x2 y +B(y), then z =y The proof is not dicult and we leave it to the reader. It follows from Proposition 1.5 thatz is uniquely de ned by the inclusion x2 z +B(z) and we write z = J x. The operator J :HH is the B B 1 resolvent of the monotone operatorB. Sometimes we writeJ = (I +B) . B The operator J is fne and T is a fne operator if and only if there is B H monotone operator B :H 2 such that T = J . Given operator T , B 1 de ne B(x) =T (fxg)x. Then T is fne if and only if B is monotone.14 Lecture Notes on Iterative Optimization Algorithms 1.3.11 The BaillonHaddad Theorem The BaillonHaddad Theorem 4, 14 provides one of the most impor- tant links between xed-point methods and iterative optimization. The proof we give here is new 52. It is the rst elementary proof of this the- orem and depends only on basic properties of convex functions. The non- elementary proof of this theorem in 100 was repeated in the book 46. The proof given here and in 52 is closely related to that given in the book 55. De nition 1.10 Let f :H R be convex and di erentiable. The Breg- man distance associated with f is D (x;y) given by f D (x;y) =f(x)f(y)hrf(y);xyi: f Then D (x;y)  0, and D (x;x) = 0. If f is strictly convex, then f f D (x;y) = 0 if and only if x =y. f Theorem 1.2 Let f :HR be convex and di erentiable, and let q(x) = 1 2 kxk . The following are equivalent: 2 1. g =qf is convex; 1 2 2. kzxk D (z;x) for all x and z; f 2 3. T =rf is rmly nonexpansive; 4. T =rf is nonexpansive and f is Fr echet di erentiable. Proof:  (1. implies 2.) Because g is convex, we have g(z)g(x) +hrg(x);zxi; which is easily shown to be equivalent to 1 2 kzxk f(z)f(x)hrf(x);zxi =D (z;x): f 2  (2. implies 3.) Fix y and de ne d(x) by d(x) =D (x;y) =f(x)f(y)hrf(y);xyi 0: f Then rd(x) =rf(x)rf(y) and D (z;x) =D (z;x) for all z and x. Therefore, we have f d 1 2 kzxk D (z;x) =d(z)d(x)hrd(x);zxi: d 2Overview and Examples 15 Now let zx =rf(y)rf(x), so that 1 2 d(x) =D (x;y) krf(x)rf(y)k : f 2 Similarly, 1 2 D (y;x) krf(x)rf(y)k : f 2 Adding these two inequalities gives 2 hrf(x)rf(y);xyikrf(x)rf(y)k :  (3. implies 4.) Clearly, ifrf is rmly nonexpansive, it is also nonex- pansive. Since it is then continuous,f must be Fr echet di erentiable.  (4. implies 1.) Fromrg(x) =xrf(x) we get 2 hrg(x)rg(y);xyi =kxyk hrf(x)rf(y);xyi kxyk(kxykkrf(x)rf(y)k) 0: Therefore, g is convex. We get a slightly more general version of Theorem 4.1, but with a slightly less elementary proof, if we assume that f is closed and omit the assumption thatf be di erentiable. Once we assume 1., the di erentiablil- ity of f follows from Proposition 1.4 and Corollary 1.1. As was mentioned previously, the BaillonHaddad Theorem plays an important role in linking xed-point algorithms to optimization. Suppose that f :H R is convex and di erentiable, and its gradient,rf, is L- 1 Lipschitz continuous. Then the gradient of the function g = f is ne, and L sorg is fne. As we shall see in Chapter 4, it follows from the theory of av- eraged operators and the Krasnosel'skii-Mann-Opial Theorem 4.2 that the operatorI rf is an averaged operator, therefore a convergent operator, 2 for 0 . L In 14 Bauschke and Combettes extend the BaillonHaddad Theorem to include several other equivalent conditions. These additional conditions involve de nitions and results that are not elementary; we shall return to their expanded version of the theorem in Chapter 7.Chapter 2 Auxiliary-Function Methods and Examples 2.1 Auxiliary-Function Methods ::::::::::::::::::::::::::::::::::::: 17 2.2 Majorization Minimization ::::::::::::::::::::::::::::::::::::::: 17 2.3 The Method of Auslander and Teboulle ::::::::::::::::::::::::: 18 2.4 The EM Algorithm ::::::::::::::::::::::::::::::::::::::::::::::: 19 2.1 Auxiliary-Function Methods We suppose that f : X R and C  X, where X is an ar- bitrary nonempty set. An iterative algorithm is in the AF class if, for k k = 1; 2;::: we minimize G (x) = f(x) +g (x) over x2 X to get x , k k k1 where g (x) 0 and g (x ) = 0. As we saw previously, the sequence k k k k ff(x )g is then nonincreasing. We want the sequenceff(x )g to converge to b = infff(x)jx2 Cg. If C is a proper subset of X we replace f with k f + at the beginning. In that case every x lies in C. C 2.2 Majorization Minimization Majorization minimization (MM), also called optimization transfer, is a technique used in statistics to convert a dicult optimization problem into a sequence of simpler ones 138, 18, 121. The MM method requires that we majorize the objective function f(x) with g(xjy), such that g(xjy)f(x), for allx andy, andg(yjy) =f(y). At thekth step of the iterative algorithm k1 k we minimize the function g(xjx ) to get x . The MM methods are members of the AF class. At the kth step of an MM iteration we minimize k1 k1 G (x) =f(x) + g(xjx )f(x) =f(x) +d(x;x ); (2.1) k where d(x;z) = g(xjz)f(x) is a distance function satisfying d(x;z) 0 1718 Lecture Notes on Iterative Optimization Algorithms k1 k1 and d(z;z) = 0. Since g (x) = d(x;x ) 0 and g (x ) = 0, MM k k k methods are also AF methods; it then follows that the sequenceff(x )g is nonincreasing. k k1 All MM algorithms have the formx =Tx , whereT is the operator de ned by Tz = argmin ff(x) +d(x;z)g: (2.2) x 1 2 Ifd(x;z) = kxzk , thenT is Moreau's proximity operatorTz = prox (z) 2 f 2 131, 132, 133, which we shall discuss in some detail later. 2.3 The Method of Auslander and Teboulle The method of Auslander and Teboulle 2 is a particular example of an J MM algorithm. We take C to be a closed, nonempty, convex subset ofR , with interior U. At the kth step of their method one minimizes a function k1 G (x) =f(x) +d(x;x ) (2.3) k k to getx . Their distanced(x;y) is de ned forx andy inU, and the gradient with respect to the rst variable, denotedr d(x;y), is assumed to exist. 1 The distanced(x;y) is not assumed to be a Bregman distance. Instead, they assume that the distance d has an associated induced proximal distance H(a;b) 0, nite for a and b in U, with H(a;a) = 0 and hr d(b;a);cbiH(c;a)H(c;b); (2.4) 1 for all c in U. If d =D , that is, if d is a Bregman distance, then from the equation h hr d(b;a);cbi =D (c;a)D (c;b)D (b;a) (2.5) 1 h h h we see that D has H =D for its associated induced proximal distance, h h so D is self-proximal, in the terminology of 2. h The method of Auslander and Teboulle seems not to be a particular case of SUMMA. However, we can adapt the proof of Proposition 1.2 to prove the analogous result for their method. We assume that f(x )f(x), for all x in C. k Theorem 2.1 For k = 2; 3;:::, let x minimize the function k1 G (x) =f(x) +d(x;x ): k k If the distanced has an induced proximal distanceH, thenff(x )gf(x ).Auxiliary-Function Methods and Examples 19 k Proof: We know that the sequenceff(x )g is decreasing and the sequence k k1 fd(x ;x )g converges to zero. Now suppose that k f(x )f(x ) +; for some  0 and all k. Since x is in C, there is z in U with  k f(x )f(z) + ; 2 k for all k. Since x minimizes G (x), it follows that k k k k1 0 =rf(x ) +r d(x ;x ): 1 Using the convexity of the functionf(x) and the fact thatH is an induced proximal distance, we have  k k k 0 f(x )f(z)hrf(x );zxi = 2 k k1 k k1 k hr d(x ;x );zxiH(z;x )H(z;x ): 1 k Therefore, the nonnegative sequencefH(z;x )g is decreasing, but its suc-  cessive di erences remain bounded below by , which is a contradiction. 2 It is interesting to note that the Auslander-Teboulle approach places a restriction on the function d(x;y), the existence of the induced proximal distance H, that is unrelated to the objective function f(x), but this con- dition is helpful only for convex f(x). In contrast, the SUMMA approach requires that k 0g (x)G (x)G (x ); k+1 k k which involves the f(x) being minimized, but does not require that this f(x) be convex. 2.4 The EM Algorithm The expectation maximization maximum likelihood (EM) \algorithm" is not a single algorithm, but a framework, or, as the authors of 18 put it, a \prescription" , for constructing algorithms. Nevertheless, we shall refer to it as the EM algorithm. The EM algorithm is always presented within the context of statistical likelihood maximization, but the essence of this method is not stochastic; the EM algorithms can be shown to form a subclass of AF methods. We

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.