Lecture notes on Nonlinear optimization

what is linear and nonlinear optimization and lecture notes on nonlinear programming and how to solve nonlinear optimization problems pdf free download
ImogenCameron Profile Pic
Published Date:14-07-2017
Your Website URL(Optional)
Nonlinear optimization Lecture notes for the course MAT-INF2360 Øyvind Ryan, Geir Dahl, and Knut Mørken March 7, 2013Contents 1 The basics and applications 5 1.1 The basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Some applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 Portfolio optimization . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.2 Fitting a model . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.4 Optimal control problems . . . . . . . . . . . . . . . . . . . . . 12 1.2.5 Linear optimization . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3 Multivariate calculus and linear algebra . . . . . . . . . . . . . . . . 14 2 A crash course in convexity 21 2.1 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Properties of convex functions . . . . . . . . . . . . . . . . . . . . . . 24 3 Nonlinear equations 29 3.1 Equations and fixed points . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 Unconstrained optimization 37 4.1 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Constrained optimization - theory 51 5.1 Equality constraints and the Lagrangian . . . . . . . . . . . . . . . . 51 5.2 Inequality constraints and KKT . . . . . . . . . . . . . . . . . . . . . . 56 5.3 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 i6 Constrained optimization - methods 73 6.1 Equality constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Inequality constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Mathematics index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Index for MATLAB commands . . . . . . . . . . . . . . . . . . . . . . . . . 103 12Preface These lecture notes have been written for the course MAT-INF2360. They deal with the third part of that course, and is about nonlinear optimization. Just as the first parts of MAT-INF2360, this third part also has its roots in linear algebra. In addition, it has stronger connection than the previous parts to multivariate calculus, as taught in MAT1110. Notation We will follow multivariate calculus and linear algebra notation as you know it from MAT1110 and MAT1120. In particular, vectors will be in boldface (x, y, etc.), while matrices will be in uppercase (A, B, etc.). The zero vector, or the zero matrix, is denoted by 0. All vectors stated will be assumed to be column T vectors. A row vector will always be written asx , wherex is a (column) vector Vector-valued functions will be in uppercase boldface (F ,G, etc.). Functions are written using both uppercase and lowercase. Uppercase is often used for the component functions of a vector-valued function. Acknowledgment Most material has been written by Geir Dahl, with many useful suggestions and help from Øyvind Ryan. The authors would all like to thank each other for estab- lishing these notes. The authors would also like to thank Andreas Våvang Solbrå for his valuable contributions to the notes. Geir Dahl, Knut Mørken, and Øyvind Ryan Oslo, February 2013. 34Chapter1 The basics and applications The problem of minimizing a function of several variables, possibly subject to constraints on these variables, is what optimization is about. So the main prob- lem is easy to state And, more importantly, such problems arise in many ap- plications in natural science, engineering, economics and business as well as in mathematics itself. Nonlinear optimization differs from Fourier analysis and wavelet theory in that classical multivariate analysis also is an important ingredient. A recom- mended book on this, used here at the University of Oslo, is 8 (in Norwegian). It contains a significant amount of fixed point theory, nonlinear equations, and optimization. There are many excellent books on nonlinear optimization (or nonlinear programming, as it is also called). Some of these books that have influenced these notes are 1, 2, 9, 5, 13, 11. These are all recommended books for those who want to go deeper into the subject. These lecture notes are particularly in- fluenced by the presentations in 1, 2. Optimization has its mathematical foundation in linear algebra and multi- variate calculus. In analysis the area of convexity is especially important. For the brief presentation of convexity given here the author’s own lecture notes 4 (originally from 2001), and the very nice book 14, have been useful sources. But, of course, anyone who wants to learn convexity should study the work by R.T. Rockafellar, see e.g. the classic text 12. Linear optimization (LP, linear programming) is a special case of nonlinear optimization, but we do not discuss this in any detail here. The reason for this is that we, at the University of Oslo, have a separate course in linear optimization which covers many parts of that subject in some detail. 5This first chapter introduces some of the basic concepts in optimization and discusses some applications. Many of the ideas and results that you will find in these lecture notes may be extended to more general linear spaces, even infinite- dimensional. However, to keep life a bit easier and still cover most applications, n we will only be working inR . Due to its character this chapter is a “proof-free zone”, but in the remaining text we usually give full proofs of the main results. n n ¯ Notation: For z 2R and ±È 0 define the (closed) ball B(z;²)Æ x 2R : kx¡zk·². It consists of all points with distance at most² from z. Similarly, n define the open ball B(z;²)Æ x2R :kx¡zkDz. A neighborhood ofz is a set N containing B(z;²) for some²È 0. Vectors are treated as column vectors and they are identified with the corresponding n-tuple, denoted by xÆ (x , x ,..., x ). A 1 2 n statement like P(x) (x2 H) means that the statement P(x) is true for allx2 H. 1.1 The basic concepts Optimization deals with finding optimal solutions So we need to define what this is. n Let f :R R be a real-valued function in n variables. The function value n is written as f (x), forx2R , or f (x , x ,..., x ). This is the function we want to 1 2 n ¤ n minimize (or maximize) and it is often called the objective function. Letx 2R . ¤ Then x is a local minimum (or local minimizer) of f if there is an²È 0 such that ¤ ¤ f (x )· f (x) for allx2 B(x ;²). ¤ ¤ So, no point “sufficiently near”x has smaller f -value thanx . A local max- imum is defined similarly, but with the inequality reversed. A stronger notion is ¤ thatx is a global minimum of f which means that ¤ n f (x )· f (x) for allx2R . A global maximum satisfies the opposite inequality. 6The definition of local minimum has a “variational character”; it concerns ¤ the behavior of f nearx . Due to this it is perhaps natural that Taylor’s formula, which gives an approximation of f in such a neighborhood, becomes a main tool for characterizing and finding local minima. We present Taylor’s formula, in different versions, in Section 1.3. An extension of the notion of minimum and maximum is for constrained problems where we want, for instance, to minimize f (x) over all x lying in a ¤ given set C . Then x 2 C is a local minimum of f over the set C , or subject to ¤ x2 C as we shall say, provided no point in C in some neighborhood of x has ¤ smaller f -value thanx . A similar extension holds for global minimum over C , and for maxima. Example 1.1. To make these things concrete, consider an example from plane geometry. Consider the point set C Æ (z , z ) : z ¸ 0, z ¸ 0, z Å z · 1 in the 1 2 1 2 1 2 plane. We want to find a point xÆ (x , x )2 C which is closest possible to the 1 2 pointaÆ (3,2). This can be formulated as the minimization problem 2 2 minimize (x ¡ 3) Å (x ¡ 2) 1 2 subject to x Å x · 1 1 2 x ¸ 0, x ¸ 0. 1 2 2 2 The function we want to minimize is f (x)Æ (x ¡3) Å(x ¡2) which is a quadratic 1 2 function. This is the square of the distance betweenx anda; and minimizing the distance or the square of the distance is equivalent (why?). A minimum here is ¤ x Æ (1,0), as can be seen from a simple geometric argument where we draw the 2 normal from (3,2) to the line x Å x Æ 1. If we instead minimize f overR , the 1 2 ¤ unique global minimum is clearly x ÆaÆ (3,2). It is also useful, and not too hard, to find these minima analytically. In optimization one considers minimization and maximization problems. As maxf (x) :x2 SÆ¡min¡f (x) :x2 S it is clear how to convert a maximization problem into a minimization problem (or vise versa). This transformation may, however, change the properties of the function you work with. For instance, if f is convex (definitions come later), then¡f is not convex (unless f is linear), so rewriting between minimization and maximization may take you out of a class of “good problems”. Note that a minimum or maximum may not exist. A main tool one uses to establish that 7optimal solutions really exist is the extreme value theorem as stated next. You may want to look these notions up in 8. n Theorem 1.2. Let C be a subset ofR which is closed and bounded, and let f : CR be a continuous function. Then f attains both its (global) minimum 1 2 and maximum, so these are pointsx ,x 2 C with 1 2 f (x )· f (x)· f (x ) (x2 C ). 1.2 Some applications It is useful to see some application areas for optimization. They are many, and here we mention a few in some detail. The methods we will learn later will be applied to these examples. 1.2.1 Portfolio optimization The following optimization problem was introduced by Markowitz in order to find an optimal portfolio in a financial market; he later received the Nobel prize 1 in economics (in 1990) for his contributions in this area: P P n minimize ® c x x ¡ ¹ x i ,j·n i j i j j j jÆ1 subject to P n x Æ 1 j jÆ1 x ¸ 0 (j· n). j The model may be understood as follows. The decision variables are x , x , 1 2 ..., x where x is the fraction of a total investment that is made in (say) stock i . n i Thus one has available a set of stocks in different companies (Statoil, IBM, Apple etc.) or bonds. The fractions x must be nonnegative (so we consider no short i sale) and add up to 1. The function f to be minimized is n X X f (x)Æ® c x x ¡ ¹ x . i j i j j j i ,j·n jÆ1 1 The precise term is “Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel” 8It can be explained in terms of random variables. Let R be the return on stock j j , this is a random variable, and let¹ ÆER be the expectation of R . So if X j j j P n denotes the random variable XÆ x R , which is the return on our portfolio j j jÆ1 P n (= mix among investments), thenEXÆ ¹ x which is the second term in f . j j jÆ1 The minus sign in front explains that we really want to maximize the expected return. The first term in f is there because just looking at expected return is too simple. We want to spread our investments to reduce the risk. The first term in f is the variance of X multiplied by a weight factor®; the constant c is the i j covariance of R and R , defined by i j c ÆE(R ¡¹ )(R ¡¹ ). i j i i j j c is also called the variance of R . i i i So f is a weighted difference of variance and expected return. This is what we want to minimize. The optimization problem is to minimize a quadratic function subject to linear constraints. We shall discuss theory and methods for such problems later. In order to use such a model one needs to find good values for all the param- eters¹ and c ; this is done using historical data from the stock markets. The j i j weight parameter® is often varied and the optimization problem is solved for each such “interesting” value. This makes it possible to find a so-called efficient frontier of expectation versus variance for optimal solutions. The Markowitz model is a useful tool for financial investments, and now ex- tensions and variations of the model exist, e.g., by using different ways of mea- suring risk. All such models involve a balance between risk and expected return. 1.2.2 Fitting a model In many applications one has a mathematical model of some phenomenon where the model has some parameters. These parameters represent a flexibility of the model, and they may be adjusted so that the model explains the phenomenon best possible. To be more specific consider a model yÆ F (x) ® m n for some function F :R R. Here®Æ (® ,® ,...,® )2R is a parameter vec- ® 1 2 n tor (so we may have several parameters). Perhaps there are natural constraints n on the parameter, say®2 A for a given set A inR . For instance, consider ® 2 yÆ® cos x Å x 1 1 2 9® 2 so here nÆ mÆ 2,®Æ (® ,® ) and F (x)Æ® cos x Åx where (say)® 2R and 1 2 ® 1 1 1 2 ® 2 1,2. 2 The general model may also be thought of as yÆ F (x)Å error ® since it is usually a simplification of the system one considers. In statistics one specifies this error term as a random variable with some (partially) known dis- tribution. Sometimes one calls y the dependent variable and x the explaining variable. The goal is to understand how y depends onx. To proceed, assume we are given a number of observations of the phenomenon given by points i i (x , y ) (iÆ 1,2,...,m). i i meaning that one has observed y corresponding toxÆx . We have m such ob- servations. Usually (but not always) we have m¸ n. The model fit problem is to adjust the parameter® so that the model fits the given data as good as possible. This leads to the optimization problem m X i i 2 minimize (y ¡ F (x )) subject to ®2 A. ® iÆ1 The optimization variable is the parameter®. Here the model error is quadratic (corresponding to the Euclidean norm), but other norms are also used. This optimization problem above is a constrained nonlinear optimization problem. When the function F depends linearly on®, which often is the case in ® practice, the problem becomes the classical least squares approximation prob- lem which is treated in basic linear algebra courses. The solution is then charac- terized by a certain linear system of equations, the so-called normal equations. 1.2.3 Maximum likelihood A very important problem in statistics, arising in many applications, is param- eter estimation and, in particular, maximum likelihood estimation. It leads to optimization. Let Y be a “continuous” real-valued random variable with probability den- sisty p (y). Here x is a parameter (often one uses other symbols for the pa- x rameter, like», µ etc.). For instance, if Y is a normal (Gaussian) variable with 2 1 ¡(y¡x) /2 p expectation x and variance 1, then p (y)Æ e and x 2¼ Z b 1 2 ¡(y¡x) /2 P(a· Y · b)Æ p e d y a 2¼ 10whereP denotes probability. Assume Y is the outcome of an experiment, and that we have observed Y Æ y (so y is a known real number or a vector, if several observations were made). On the basis of y we want to estimate the value of the parameter x which “explains” best possible our observation Y Æ y. We have now available the probability den- sity p (¢). The function x p (y), for fixed y, is called the likelihood function. x x It gives the “probability mass” in y as a function of the parameter x. The max- imum likelihood problem is to find a parameter value x which maximizes the likelihood, i.e., which maximizes the probability of getting precisely y. This is an optimization problem max p (y) x x where y is fixed and the optimization variable is x. We may here add a con- straint on x, say x2 C for some set C , which may incorporate possible knowl- edge of x and assure that p (y) is positive for x2 C . Often it is easier to solve the x equivalent optimization problem of maximizing the logarithm of the likelihood function max ln p (y) x x This is a nonlinear optimization problem. Often, in statistics, there are several n parameters, so x2R for some n, and we need to solve a nonlinear optimiza- tion problem in several variables, possibly with constraints on these variables. If the likelihood function, or its logarithm, is a concave function, we have (after multiplying by¡1) a convex optimization problem. Such problems are easier to solve than general optimization problems. This will be discussed later. As a specific example assume we have the linear statistical model yÆ AxÅw n m where A is a given m£ n matrix, x2R is an unknown parameter, w2R is a n random variable (the “noise”), and y2R is the observed quantity. We assume that the components of w, i.e., w , w ,..., w are independent and identically 1 2 m distributed with common density function p onR. This leads to the likelihood function m Y p (y)Æ p(y ¡a x) x i i iÆ1 where a is the i ’th row in A. Taking the logarithm we obtain the maximum i likelihood problem m X max ln p(y ¡a x). i i iÆ1 In many applications of statistics is is central to solve this optimization problem numerically. 11Example 1.3. Let us take a look at a model take from physics for desintegration of muons. The angleµ in electron radiation for desintegration of muons has a probability density 1Å®x p(x;®)Æ (1.1) 2 for x2 ¡1,1, where xÆ cosµ, and where® is an unknown parameter in ¡1,1. Our goal is to estimate® from n measurements xÆ (x ,..., x ). In this case the 1 n Q n likelihood function, which we seek to maximize, takes the form g (®)Æ p(x ;®). i iÆ1 Taking logarithms and multiplying by¡1, our problem is to minimize à n n Y X f (®)Æ¡ln g (®)Æ¡ln p(x ;®) Æ¡ ln((1Å®x )/2). (1.2) i i iÆ1 iÆ1 We compute n n X X x /2 x i i 0 f (®)Æ¡ Æ¡ (1Å®x )/2 1Å®x i i iÆ1 iÆ1 2 n X x 00 i f (®)Æ 2 (1Å®x ) i iÆ1 00 We see that f (®)¸ 0, so that f is convex. As explained, this will make the prob- 0 lem easier to solve using numerical methods. If we try to solve f (®)Æ 0 we will 0 run into problems, however. We see, however, that f (®) 0 when ®§1, x 1 0 i and since Æ , we must have that f (®)1 when®¡1/x from i 1Å®x 1/x Å® i i 0 below, and f (®)¡1 when®¡1/x from above. It is therefore clear that i f has exactly one minimum in every interval of the form ¡1/x ,¡1/x when i iÅ1 we list the x in increasing order. It is not for sure that there is a minimum within i ¡1,1 at all. If all measurements have the same sign we are guaranteed to find no such point. In this case the minimum must be one of the end points in the interval. We will later look into numerical method for finding this minimum. 1.2.4 Optimal control problems Recall that a discrete dynamical system is an equation x Æ h (x ) (tÆ 0,1,...) tÅ1 t t n wherex 2R ,x is the initial solution, and h is a given function for each t. We t 0 t here think of t as time andx is the state of the process at time t. For instance, let t nÆ 1 and consider h (x)Æ ax (tÆ 0,1,...) for some a2R. Then the solution is t t n x Æ a x . Another example is when A is an n£n matrix,x 2R and h (x)Æ Ax t 0 t t 12t for each t. Then the solution isx Æ A x . For the more general situation, where t 0 the system functions h may be different, it may be difficult to find an explicit t solution for x . Numerically, however, we compute x simply in a for-loop by t t computingx , thenx Æ f (x ) and thenx Æ f (x ) etc. 0 1 1 0 2 2 1 Now, consider a dynamical system where we may “control” the system in each time step. We restrict the attention to a finite time span, t Æ 0,1,...,T . A proper model is then x Æ h (x ,u ) (tÆ 0,1,...,T¡ 1) tÅ1 t t t wherex is the state of the system at time t and the new variableu is the control t t n m at time t. We assumex 2R andu 2R for each t (but these things also work t t if these vectors lie in spaces of different dimensions). Thus, when we choose the controlsu ,u ,...,u andx is known, the sequence x of states is uniquely 0 1 T¡1 0 t n m determined. Next, assume there are given functions f :R £R R that we call t cost functions. We think of f (x ,u ) as the “cost” at time t when the system is t t t in statex and we choose controlu . The optimal control problem is t t P T¡1 minimize f (x )Å f (x ,u ) T T t t t tÆ0 subject to (1.3) x Æ h (x ,u ) (tÆ 0,1,...,T¡ 1) tÅ1 t t t where the control is the sequence (u ,u ,...,u ) to be determined. This 0 1 T¡1 problem arises an many applications, in engineering, finance, economics etc. N We now rewrite this problem. First, letuÆ (u ,u ,...,u )2R where NÆ T n. 1 2 T Since, as we noted, x is uniquely determined byu, there is a function v such t t thatx Æv (u) (tÆ 1,2,...,T );x is given. Therefore the total cost may be writ- t t 0 ten T¡1 T¡1 X X f (x )Å f (x ,u )Æ f (v (u))Å f (v (u),u ) :Æ f (u) T T t t t T T t t t tÆ0 tÆ0 which is a function ofu. Thus, we see that the optimal control problem may be transformed to the unconstrained optimization problem min f (u) N u2R Sometimes there may be constraints on the control variables, for instance that they each lie in some interval, and then the transformation above results in a constrained optimization problem. 131.2.5 Linear optimization This is not an application, but rather a special case of the general nonlinear opti- mization problem where all functions are linear. A linear optimization problem, also called linear programming, has the form T minimize c x subject to (1.4) AxÆb, x¸ 0. m Here A is an m£ n matrix,b2R andx¸ 0 means that x ¸ 0 for each i· n. i So in linear optimization one minimizes (or maximizes) a linear function subject to linear equations and nonnegativity on the variables. Actually, one can show any problem with constraints that are linear equations and/or linear inequali- ties may be transformed into the form above. Such problems have a wide range of application in science, engineering, economics, business etc. Applications in- clude portfolio optimization and many planning problems for e.g. production, transportation etc. Some of these problems are of a combinatorial nature, but linear optimization is a main tool here as well. We shall not treat linear optimization in detail here since this is the topic of a separate course, INF-MAT3370 Linear optimization. In that course one presents some powerful methods for such problems, the simplex algorithm and interior point methods. In addition one considers applications in network flow models and game theory. 1.3 Multivariate calculus and linear algebra We first recall some useful facts from linear algebra. The spectral theorem says that if A is a real symmetric matrix, then there is an orthogonal matrix V (i.e., its columns are orthonormal) and a diagonal matrix D such that T AÆ V DV . The diagonal of D contains the eigenvalues of A, and A has an orthonormal set of eigenvectors (the columns of V ). 2 T n A real symmetric matrix is positive semidefinite if x Ax¸ 0 for all x2R . 2 See Section 7.2 in 7 14The following statements are equivalent (i) A is positive semidefinite, (ii) all eigenvalues of A are nonnegative, T (iii) AÆ W W for some matrix W . T Similarly, a real symmetric matrix is positive definite ifx AxÈ 0 for all nonzero n x2R . The following statements are equivalent (i) A is positive definite, (ii) all eigenvalues of A are positive, T (iii) AÆ W W for some invertible matrix W . Every positive definite matrix is therefore invertible. We also recall some central facts from multivariate calculus. They will be n used repeatedly in these notes. Let f :R R be a real-valued function defined n onR . The gradient of f atx is the n-tuple µ ¶ f (x) f (x) f (x) rf (x)Æ , ,..., . x x x 1 2 n 3 We will always identify an n-tuple with the corresponding column vector . Of course, the gradient only exists if all the partial derivatives exist. Second or- der information is contained in a matrix: assuming f has second order par- 4 2 tial derivatives we define the Hessian matrix r f (x) as the n£ n matrix whose (i , j )’th entry is 2 f (x) . x x i j If these second order partial derivatives are continuous, then we may switch the 2 order in the derivations, andr f (x) is a symmetric matrix. For vector-valued functions we also need the derivative. Consider the vector- valued function F given by 2 3 F (x) 1 6 7 F (x) 6 2 7 6 7 F (x)Æ . 6 7 . 4 . 5 F (x) n 3 This is somewhat different from 8, since the gradient there is always considered as a row vector 4 See Section 5.9 in 8 15n 0 5 so F :R R is the i th component function ofF .F denotes the Jacobi matrix , i or simply the derivative, ofF 2 3 F (x) F (x) F (x) 1 1 1 ¢¢¢ x x x 1 2 n 6 7 F (x) F (x) F (x) 2 2 1 6 7 ¢¢¢ x x x 6 7 0 1 2 n F (x)Æ 6 7 . 6 . 7 . 4 5 F (x) F (x) F (x) n n n ¢¢¢ x x x 1 2 n The i th row of this matrix is therefore the gradient of F , now viewed as a row i vector. 6 Next we recall Taylor’s theorems from multivariate calculus : n Theorem 1.4 (First order Taylor theorem). Let f :R R be a function hav- n ing continuous partial derivatives in some ball B(x;r ). Then, for eachh2R withkhkÇ r there is some t2 (0,1) such that T f (xÅh)Æ f (x)Årf (xÅ th) h. The next one is known as Taylor’s formula, or the second order Taylor’s the- 7 orem : n Theorem 1.5 (Second order Taylor theorem). Let f :R R be a function having second order partial derivatives that are continuous in some ball n B(x;r ). Then, for eachh2R withkhkÇ r there is some t2 (0,1) such that 1 T T 2 f (xÅh)Æ f (x)Årf (x) hÅ h r f (xÅ th)h. 2 This may be shown by considering the one-variable function g (t)Æ f (xÅth) and applying the chain rule and Taylor’s formula in one variable. There is another version of the second order Taylor theorem in which the Hessian is evaluated in x and, as a result, we get an error term. This theorem 8 shows how f may be approximated by a quadratic polynomial in n variables : 5 See Section 2.6 in 8 6 This theorem is also the mean value theorem of functions in several variables, see Section 5.5 in 8 7 See Section 5.9 in 8 8 See Section 5.9 in 8 16n Theorem 1.6 (Second order Taylor theorem, version 2). Let f :R R be a function having second order partial derivatives that are continuous in some n n ball B(x;r ). Then there is a function² :R R such that, for eachh2R with khkÇ r , 1 T T 2 2 h r f (x)hŲ(h)khk . f (xÅh)Æ f (x)Årf (x) hÅ 2 Here²(y) 0 wheny 0. Using the O-notation from Definition ??, the very useful approximations we get from Taylor’ theorems can thus be summarized as follows: Taylor approximations: T First order: f (xÅh) Æ f (x)Årf (x) hÅO(khk) T ¼ f (x)Årf (x) h. 1 T T 2 2 Second order: f (xÅh) Æ f (x)Årf (x) hÅ h r f (x)hÅO(khk ) 2 T 1 T 2 ¼ f (x)Årf (x) hÅ h r f (x)h. 2 We introduce notation for these approximations 1 T T (x;xÅh) Æ f (x)Årf (x) h f 2 T 1 T 2 T (x;xÅh) Æ f (x)Årf (x) hÅ h r f (x)h f 2 As we shall see, one can get a lot of optimization out of these approximations We also need a Taylor theorem for vector-valued functions, which follows by applying Taylor’ theorem above to each component function: Theorem 1.7 (First order Taylor theorem for vector-valued functions). Let n m F :R R be a vector-valued function which is continuously differentiable in a neighborhood N ofx. Then 0 F (xÅh)ÆF (x)ÅF (x)hÅO(khk) whenxÅh2 N . n m k n Finally, ifF :R R andG :R R we define the compositionHÆF±G k m as the functionH :R R byH(x)ÆF (G(x)). Then, under natural differentia- 179 bility assumption the following chain rule holds: 0 0 0 H (x)ÆF (G(x))G (x). Here the right-hand side is a product of two matrices, the respective Jacobi ma- trices evaluated in the right points. Finally, we discuss some notions concerning the convergence of sequences. 1 Definition 1.8 (Linear convergence). We say that a sequence x con- k kÆ1 ¤ verges tox linearly (or that the convergence speed in linear) if there is a°Ç 1 such that ¤ ¤ kx ¡x k·°kx ¡x k (kÆ 0,1,...). kÅ1 k A faster convergence rate is superlinear convergence which means that ¤ ¤ lim kx ¡x k/kx ¡x kÆ 0 kÅ1 k k1 A special type of superlinear convergence is quadratic convergence where ¤ ¤ 2 kx ¡x k·°kx ¡x k (kÆ 0,1,...) kÅ1 k for some°Ç 1. Exercises for Chapter 1 1. Give an example of a function f :RR with 10 global minima. 2. Consider the function f (x)Æ x sin(1/x) defined for xÈ 0. Find its local min- ima. What about global minimum? 3. Let f : X R be a function (with nonnegative function values). Explain Å 2 why it is equivalent to minimize f over x2 X or minimize f (x) over X . 4. In Example 1.2.3 we mentioned that optimizing the function p (y) is equiv- x alent to optimizing the function ln p (y). Explain why maximizing/minimizing x g is the same as maximizing/minimizing ln g for any positive function g . 2 2 2 5. Consider f :R R given by f (x)Æ (x ¡ 3) Å (x ¡ 2) . How would you 1 2 ¤ explain to anyone thatx Æ (3, 2) is a minimum point? 2 2 6. The level sets of a function f :R R are sets of the form L Æx2R : f (x)Æ ® 1 2 2 ®. Let f (x)Æ (x ¡ 1) Å (x ¡ 3) . Draw the level sets in the plane for ®Æ 1 2 4 10,5,1,0.1. 9 See Section 2.7 in 8 18

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.