Lectures on modern convex optimization 2018

lectures on modern convex optimization analysis algorithms and engineering applications and lectures on modern convex optimization mps-siam series on optimization
ImogenCameron Profile Pic
Published Date:14-07-2017
Your Website URL(Optional)
. LECTURES ON MODERN CONVEX OPTIMIZATION y  Aharon Ben-Tal and Arkadi Nemirovski y The William Davidson Faculty of Industrial Engineering & Management, Technion Israel Institute of Technology, abentalie.technion.ac.il http://ie.technion.ac.il/Home/Users/morbt0.html  H. Milton Stewart School of Industrial & Systems Engineering, Georgia Institute of Technology, nemirovsisye.gatech.edu http://www.isye.gatech.edu/faculty-staff/profile.php?entry=an63 Fall Semester 2013Lecture 1 From Linear to Conic Programming 1.1 Linear programming: basic notions A Linear Programming (LP) program is an optimization program of the form   T min c x Axb ; (LP) where n  x2 R is the design vector n T  c2 R is a given vector of coecients of the objective function c x m  A is a given mn constraint matrix, and b 2 R is a given right hand side of the constraints. (LP) is called feasible, if its feasible set F =fx :Axb 0g is nonempty; a point x2F is called a feasible solution to (LP); T bounded below, if it is either infeasible, or its objective c x is bounded below onF. For a feasible bounded below problem (LP), the quantity  T c  inf c x x:Axb0 is called the optimal value of the problem. For an infeasible problem, we setc = +1,  while for feasible unbounded below problem we set c =1.  (LP) is called solvable, if it is feasible, bounded below and the optimal value is attained, i.e., T  there exists x2F with c x =c . An x of this type is called an optimal solution to (LP). A priori it is unclear whether a feasible and bounded below LP program is solvable: why should the in mum be achieved? It turns out, however, that a feasible and bounded below program (LP) always is solvable. This nice fact (we shall establish it later) is speci c for LP. Indeed, a very simple nonlinear optimization program   1 min x 1 x is feasible and bounded below, but it is not solvable. 1516 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING 1.2 Duality in Linear Programming The most important and interesting feature of linear programming as a mathematical entity (i.e., aside of computations and applications) is the wonderful LP duality theory we are about to consider. We motivate this topic by rst addressing the following question: Given an LP program    T c = min c x Axb 0 ; (LP) x  how to nd a systematic way to bound from below its optimal value c ? Why this is an important question, and how the answer helps to deal with LP, this will be seen in the sequel. For the time being, let us just believe that the question is worthy of the e ort. A trivial answer to the posed question is: solve (LP) and look what is the optimal value. There is, however, a smarter and a much more instructive way to answer our question. Just to get an idea of this way, let us look at the following example: 8 9 x + 2x +::: + 2001x + 2002x 1  0; 1 2 2001 2002 = min x +x +::: +x 2002x + 2001x +::: + 2x +x 100  0; : 1 2 2002 1 2 2001 2002 : ; ::::: ::: ::: 101 We claim that the optimal value in the problem is . How could one certify this bound? 2003 This is immediate: add the rst two constraints to get the inequality 2003(x +x +::: +x +x ) 101 0; 1 2 1998 2002 and divide the resulting inequality by 2003. LP duality is nothing but a straightforward gener- alization of this simple trick. 1.2.1 Certi cates for solvability and insolvability Consider a ( nite) system of scalar inequalities with n unknowns. To be as general as possible, we do not assume for the time being the inequalities to be linear, and we allow for both non- strict and strict inequalities in the system, as well as for equalities. Since an equality can be represented by a pair of non-strict inequalities, our system can always be written as f (x) 0; i = 1;:::;m; (S) i i where every is either the relation " " or the relation " ". i The basic question about (S) is (?) Whether (S) has a solution or not. Knowing how to answer the question (?), we are able to answer many other questions. E.g., to  verify whether a given real a is a lower bound on the optimal value c of (LP) is the same as to verify whether the system ( T c x +a 0 Axb  0 has no solutions. The general question above is too dicult, and it makes sense to pass from it to a seemingly simpler one:1.2. DUALITY IN LINEAR PROGRAMMING 17 (??) How to certify that (S) has, or does not have, a solution. Imagine that you are very smart and know the correct answer to (?); how could you convince somebody that your answer is correct? What could be an \evident for everybody" certi cate of the validity of your answer?  If your claim is that (S) is solvable, a certi cate could be just to point out a solution x to   (S). Given this certi cate, one can substitute x into the system and check whether x indeed is a solution. Assume now that your claim is that (S) has no solutions. What could be a \simple certi cate" of this claim? How one could certify a negative statement? This is a highly nontrivial problem not just for mathematics; for example, in criminal law: how should someone accused in a murder prove his innocence? The \real life" answer to the question \how to certify a negative statement" is discouraging: such a statement normally cannot be certi ed (this is where the rule \a person is presumed innocent until proven guilty" comes from). In mathematics, however, the situation is di erent: in some cases there exist \simple certi cates" of negative statements. E.g., in order to certify that (S) has no solutions, it suces to demonstrate that a consequence of (S) is a contradictory inequality such as 1 0: For example, assume that ,i = 1;:::;m, are nonnegative weights. Combining inequalities from i (S) with these weights, we come to the inequality m X f (x) 0 (Cons()) i i i=1 where is either " " (this is the case when the weight of at least one strict inequality from (S) is positive), or " " (otherwise). Since the resulting inequality, due to its origin, is a consequence of the system (S), i.e., it is satis ed by every solution toS), it follows that if (Cons()) has no solutions at all, we can be sure that (S) has no solution. Whenever this is the case, we may treat the corresponding vector  as a \simple certi cate" of the fact that (S) is infeasible. Let us look what does the outlined approach mean when (S) is comprised of linear inequal- ities:    " " T (S) : fa x b; i = 1;:::;mg = i i i i " " Here the \combined inequality" is linear as well: m m X X T (Cons()) : ( a ) x b i i i=1 i=1 ( is " " whenever 0 for at least onei with = " ", and is " " otherwise). Now, i i when can a linear inequality T d x e be contradictory? Of course, it can happen only whend = 0. Whether in this case the inequality is contradictory, it depends on what is the relation : if = " ", then the inequality is contradictory if and only if e 0, and if = " ", it is contradictory if and only if e 0. We have established the following simple result:18 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING Proposition 1.2.1 Consider a system of linear inequalities ( T a x b; i = 1;:::;m ; i s i (S) : T a x  b; i =m + 1;:::;m: i s i with n-dimensional vector of unknowns x. Let us associate with (S) two systems of linear inequalities and equations with m-dimensional vector of unknowns : 8 (a)   0; m P (b) a = 0; i i i=1 m P T : I (c ) b  0; I i i i=1 m s P : (d )  0: I i i=1 8 (a)   0; m P (b) a = 0; i i T : II i=1 m P : (c ) b 0: i i II i=1 Assume that at least one of the systemsT ,T is solvable. Then the system (S) is infeasible. I II Proposition 1.2.1 says that in some cases it is easy to certify infeasibility of a linear system of inequalities: a \simple certi cate" is a solution to another system of linear inequalities. Note, however, that the existence of a certi cate of this latter type is to the moment only a sucient, but not a necessary, condition for the infeasibility of (S). A fundamental result in the theory of linear inequalities is that the sucient condition in question is in fact also necessary: Theorem 1.2.1 General Theorem on Alternative In the notation from Proposition 1.2.1, sys- tem (S) has no solutions if and only if eitherT , orT , or both these systems, are solvable. I II There are numerous proofs of the Theorem on Alternative; in my taste, the most instructive one is to reduce the Theorem to its particular case the Homogeneous Farkas Lemma: Homogeneous Farkas Lemma A homogeneous nonstrict linear inequality T a x 0 is a consequence of a system of homogeneous nonstrict linear inequalities T a x 0; i = 1;:::;m i if and only if it can be obtained from the system by taking weighted sum with nonnegative weights: T T (a) a x 0; i = 1;:::;m)a x 0; i m (1.2.1) P (b) 9  0 :a =  a : i i i i The reduction of GTA to HFL is easy. As about the HFL, there are, essentially, two ways to prove the statement:  The \quick and dirty" one based on separation arguments (see Section B.2.6 and/or Exercise B.14), which is as follows:1.2. DUALITY IN LINEAR PROGRAMMING 19 n 1. First, we demonstrate that if A is a nonempty closed convex set in R anda is a point from n n R nA, then a can be strongly separated from A by a linear form: there exists x2 R such that T T x a inf x b: (1.2.2) b2A To this end, it suces to verify that p T (a) In A, there exists a point closest to a w.r.t. the standard Euclidean normkbk = b b, 2 i.e., that the optimization program minkabk 2 b2A has a solution b ;  (b) Setting x =b a, one ensures (1.2.2).  Both (a) and (b) are immediate. 2. Second, we demonstrate that the set m X A =fb :9 0 :b =  ag i i i=1 the cone spanned by the vectorsa ;:::;a is convex (which is immediate) and closed (the 1 m proof of this crucial fact also is not dicult). 3. Combining the above facts, we immediately see that either a2A, i.e., (1.2.1.b) holds, P T T or there exists x such that x a inf x  a . i i 0 i T The latter inf is nite if and only if x a  0 for all i, and in this case the inf is 0, so that i T T the \or" statement says exactly that there exists x with a x 0, a x 0, or, which is the i same, that (1.2.1.a) does not hold. Thus, among the statements (1.2.1.a) and the negation of (1.2.1.b) at least one (and, as it is immediately seen, at most one as well) always is valid, which is exactly the equivalence (1.2.1).  \Advanced" proofs based purely on Linear Algebra facts (see Section B.2.5.A). The advantage of these purely Linear Algebra proofs is that they, in contrast to the outlined separation-based proof, n do not use the completeness of R as a metric space and thus work when we pass from systems with real coecients and unknowns to systems with rational (or algebraic) coecients. As a result, an advanced proof allows to establish the Theorem on Alternative for the case when the coecients and unknowns in (S),T ,T are restricted to belong to a given \real eld" (e.g., are rational). I II We formulate here explicitly two very useful principles following from the Theorem on Al- ternative: A. A system of linear inequalities T a x b; i = 1;:::;m i i i has no solutions if and only if one can combine the inequalities of the system in a linear fashion (i.e., multiplying the inequalities by nonnegative weights, adding T the results and passing, if necessary, from an inequality a x b to the inequality T T a xb) to get a contradictory inequality, namely, either the inequality 0 x 1, or T the inequality 0 x 0. B. A linear inequality T a x b 0 0 020 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING is a consequence of a solvable system of linear inequalities T a x b; i = 1;:::;m i i i if and only if it can be obtained by combining, in a linear fashion, the inequalities of the system and the trivial inequality 01. It should be stressed that the above principles are highly nontrivial and very deep. Consider, e.g., the following system of 4 linear inequalities with two variables u;v: 1u 1 1v 1: From these inequalities it follows that 2 2 u +v  2; () which in turn implies, by the Cauchy inequality, the linear inequality u +v 2: p p p 2 2 2 2 2 u +v = 1u + 1v 1 + 1 u +v  ( 2) = 2: () The concluding inequality is linear and is a consequence of the original system, but in the demonstration of this fact both steps () and () are \highly nonlinear". It is absolutely unclear a priori why the same consequence can, as it is stated by Principle A, be derived from the system in a linear manner as well of course it can it suces just to add two inequalities u 1 and v 1. Note that the Theorem on Alternative and its corollaries A and B heavily exploit the fact that we are speaking about linear inequalities. E.g., consider the following 2 quadratic and 2 linear inequalities with two variables: 2 (a) u  1; 2 (b) v  1; (c) u  0; (d) v  0; along with the quadratic inequality (e) uv  1: The inequality (e) is clearly a consequence of (a) (d). However, if we extend the system of inequalities (a) (b) by all \trivial" (i.e., identically true) linear and quadratic inequalities 2 2 2 2 2 2 with 2 variables, like 0 1, u +v  0, u + 2uv +v  0, u uv +v  0, etc., and ask whether (e) can be derived in a linear fashion from the inequalities of the extended system, the answer will be negative. Thus, Principle A fails to be true already for quadratic inequalities (which is a great sorrow otherwise there were no dicult problems at all) We are about to use the Theorem on Alternative to obtain the basic results of the LP duality theory. 1.2.2 Dual to an LP program: the origin As already mentioned, the motivation for constructing the problem dual to an LP program 2 2 3 3 T a 1   6 6 7 7 T a 6 6 7 7  T 2 mn c = min c x Axb 0 A = 2 R (LP) 6 6 7 7 x 4 4 ::: 5 5 T a m1.2. DUALITY IN LINEAR PROGRAMMING 21  is the desire to generate, in a systematic way, lower bounds on the optimal value c of (LP). An evident way to bound from below a given function f(x) in the domain given by system of inequalities g (x)b; i = 1;:::;m; (1.2.3) i i is o ered by what is called the Lagrange duality and is as follows: Lagrange Duality:  Let us look at all inequalities which can be obtained from (1.2.3) by linear aggre- gation, i.e., at the inequalities of the form X X yg (x) yb (1.2.4) i i i i i i with the \aggregation weights" y  0. Note that the inequality (1.2.4), due to its i origin, is valid on the entire set X of solutions of (1.2.3).  Depending on the choice of aggregation weights, it may happen that the left hand n side in (1.2.4) isf(x) for all x2 R . Whenever it is the case, the right hand side P yb of (1.2.4) is a lower bound on f in X. i i i P Indeed, on X the quantity y b is a lower bound ony g (x), and fory in question i i i i i the latter function of x is everywheref(x). It follows that  The optimal value in the problem 8 9 = X y 0; (a) P max yb : n (1.2.5) i i yg (x)f(x)8x2 R (b) i i y : ; i i is a lower bound on the values of f on the set of solutions to the system (1.2.3). Let us look what happens with the Lagrange duality when f andg are homogeneous linear i T T functions: f = c x, g (x) = a x. In this case, the requirement (1.2.5.b) merely says that i i P T c = ya (or, which is the same, A y = c due to the origin of A). Thus, problem (1.2.5) i i i becomes the Linear Programming problem n o T T  max b y :A y =c; y 0 ; (LP ) y which is nothing but the LP dual of (LP). By the construction of the dual problem,  Weak Duality The optimal value in (LP ) is less than or equal to the optimal value in (LP). In fact, the \less than or equal to" in the latter statement is \equal", provided that the optimal  value c in (LP) is a number (i.e., (LP) is feasible and below bounded). To see that this indeed  T is the case, note that a real a is a lower bound on c if and only if c x a whenever Ax b, or, which is the same, if and only if the system of linear inequalities T (S ) : c xa;Axb a22 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING has no solution. We know by the Theorem on Alternative that the latter fact means that some other system of linear equalities (more exactly, at least one of a certain pair of systems) does have a solution. More precisely, () (S ) has no solutions if and only if at least one of the following two systems with a m + 1 unknowns: 8 (a)  = ( ; ;:::; )  0; 0 1 m m P (b)  c + a = 0; 0 i i i=1 T : I m P (c )  a + b  0; I 0 i i i=1 : (d )  0; I 0 or 8 (a)  = ( ; ;:::; )  0; 0 1 m m P (b)  c a = 0; 0 i i T : II i=1 m P : (c )  a b 0 II 0 i i i=1 has a solution. Now assume that (LP) is feasible. We claim that under this assumption (S ) has no solutions a if and only ifT has a solution. I The implication "T has a solution) (S ) has no solution" is readily given by the above I a remarks. To verify the inverse implication, assume that (S ) has no solutions and the system a Axb has a solution, and let us prove that thenT has a solution. IfT has no solution, then I I by ()T has a solution and, moreover,  = 0 for (every) solution toT (since a solution II 0 II to the latter system with  0 solvesT as well). But the fact thatT has a solution  0 I II with  = 0 is independent of the values of a and c; if this fact would take place, it would 0 mean, by the same Theorem on Alternative, that, e.g., the following instance of (S ): a T 0 x1;Axb has no solutions. The latter means that the systemAxb has no solutions a contradiction with the assumption that (LP) is feasible. Now, ifT has a solution, this system has a solution with  = 1 as well (to see this, pass from I 0 a solution  to the one = ; this construction is well-de ned, since  0 for every solution 0 0 toT ). Now, an (m + 1)-dimensional vector  = (1;y) is a solution toT if and only if the I I m-dimensional vector y solves the system of linear inequalities and equations y  0; m P T A y ya = c; (D) i i i=1 T b y  a Summarizing our observations, we come to the following result. Proposition 1.2.2 Assume that system (D) associated with the LP program (LP) has a solution (y;a). Then a is a lower bound on the optimal value in (LP). Vice versa, if (LP) is feasible and a is a lower bound on the optimal value of (LP), then a can be extended by a properly chosen m-dimensional vector y to a solution to (D).1.2. DUALITY IN LINEAR PROGRAMMING 23 We see that the entity responsible for lower bounds on the optimal value of (LP) is the system (D): every solution to the latter system induces a bound of this type, and in the case when (LP) is feasible, all lower bounds can be obtained from solutions to (D). Now note that if T (y;a) is a solution to (D), then the pair (y;b y) also is a solution to the same system, and the T  lower bound b y on c is not worse than the lower bound a. Thus, as far as lower bounds on  c are concerned, we lose nothing by restricting ourselves to the solutions (y;a) of (D) with T  a =b y; the best lower bound on c given by (D) is therefore the optimal value of the problem   T T  max b y A y =c;y 0 , which is nothing but the dual to (LP) problem (LP ). Note that y  (LP ) is also a Linear Programming program. All we know about the dual problem to the moment is the following:  Proposition 1.2.3 Whenever y is a feasible solution to (LP ), the corresponding value of the T  dual objective b y is a lower bound on the optimal value c in (LP). If (LP) is feasible, then for   T every ac there exists a feasible solution y of (LP ) with b ya. 1.2.3 The LP Duality Theorem Proposition 1.2.3 is in fact equivalent to the following Theorem 1.2.2 Duality Theorem in Linear Programming Consider a linear programming program   T min c x Axb (LP) x along with its dual   T T  max b y A y =c;y 0 (LP ) y Then 1) The duality is symmetric: the problem dual to dual is equivalent to the primal; 2) The value of the dual objective at every dual feasible solution is the value of the primal objective at every primal feasible solution 3) The following 5 properties are equivalent to each other: (i) The primal is feasible and bounded below. (ii) The dual is feasible and bounded above. (iii) The primal is solvable. (iv) The dual is solvable. (v) Both primal and dual are feasible. Whenever (i) (ii) (iii) (iv) (v) is the case, the optimal values of the primal and the dual problems are equal to each other.  Proof. 1) is quite straightforward: writing the dual problem (LP ) in our standard form, we get 8 2 3 9 0 1 I 0 m = 6 7 T T A min b y 4 A 5y c  0 ; y : T ; A c24 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING where I is the m-dimensional unit matrix. Applying the duality transformation to the latter m problem, we come to the problem 8 9   0 =   0 T T T max 0  +c  + (c)  ;   0 ;; : ; A +A = b which is clearly equivalent to (LP) (set x =). 2) is readily given by Proposition 1.2.3. 3):  (i))(iv): If the primal is feasible and bounded below, its optimal value c (which of course is a lower bound on itself) can, by Proposition 1.2.3, be (non-strictly) T    majorized by a quantity b y , where y is a feasible solution to (LP ). In the T   situation in question, of course, b y =c (by already proved item 2)); on the other  hand, in view of the same Proposition 1.2.3, the optimal value in the dual isc . We conclude that the optimal value in the dual is attained and is equal to the optimal value in the primal. (iv))(ii): evident; (ii))(iii): This implication, in view of the primal-dual symmetry, follows from the implication (i))(iv). (iii))(i): evident. We have seen that (i)(ii)(iii)(iv) and that the rst (and consequently each) of these 4 equivalent properties implies that the optimal value in the primal problem is equal to the optimal value in the dual one. All which remains is to prove the equivalence between (i)(iv), on one hand, and (v), on the other hand. This is immediate: (i)(iv), of course, imply (v); vice versa, in the case of (v) the primal is not only feasible, but also bounded below (this is an immediate consequence of the feasibility of the dual problem, see 2)), and (i) follows. An immediate corollary of the LP Duality Theorem is the following necessary and sucient optimality condition in LP: Theorem 1.2.3 Necessary and sucient optimality conditions in linear programming Con-  sider an LP program (LP) along with its dual (LP ). A pair (x;y) of primal and dual feasible solutions is comprised of optimal solutions to the respective problems if and only if y Axb = 0; i = 1;:::;m; complementary slackness i i likewise as if and only if T T c xb y = 0 zero duality gap Indeed, the \zero duality gap" optimality condition is an immediate consequence of the fact that the value of primal objective at every primal feasible solution is the value of the dual objective at every dual feasible solution, while the optimal values in the primal and the dual are equal to each other, see Theorem 1.2.2. The equivalence between the \zero duality gap" and the \complementary slackness" optimality conditions is given by the following1.3. SELECTED ENGINEERING APPLICATIONS OF LP 25 computation: whenever x is primal feasible and y is dual feasible, the products y Axb , i i i = 1;:::;m, are nonnegative, while the sum of these products is precisely the duality gap: T T T T T T y Axb = (A y) xb y =c xb y: Thus, the duality gap can vanish at a primal-dual feasible pair (x;y) if and only if all products y Axb for this pair are zeros. i i 1.3 Selected Engineering Applications of LP Linear Programming possesses enormously wide spectrum of applications. Most of them, or at least the vast majority of applications presented in textbooks, have to do with Decision Making. Here we present an instructive sample of applications of LP in Engineering. The \common denominator" of what follows (except for the topic on Support Vector Machines, where we just tell stories) can be summarized as \LP Duality at work." 1.3.1 Sparsity-oriented Signal Processing and ` minimization 1 1 Let us start with Compressed Sensing which addresses the problem as follows: \in the nature" there exists a signal represented by an n-dimensional vector x. We observe (perhaps, in the presence of observation noise) the image of x under linear transformation x7Ax, where A is a given mn sensing matrix; thus, our observation is m y =Ax +2 R (1.3.1) where is observation noise. Our goal is to recoverx from the observedy. The outlined problem is responsible for an extremely wide variety of applications and, depending on a particular application, is studied in di erent \regimes." For example, in the traditional Statistics x is interpreted not as a signal, but as the vector of parameters of a \black box" which, given on n T 1 m input a vectora2 R , produces outputa x. Given a collectiona ;:::;a ofn-dimensional inputs i T to the black box and the corresponding outputs (perhaps corrupted by noise) y = a x + , i i we want to recover the vector of parametersx; this is called linear regression problem,. In order i T to represent this problem in the form of (1.3.1) one should make the row vectors a the rows of an mn matrix, thus getting matrix A, and to set y = y ;:::;y ,  =  ;:::; . The 1 m 1 m typical regime here is m n - the number of observations is much larger than the number of parameters to be recovered, and the challenge is to use this \observation redundancy" in order to get rid, to the best extent possible, of the observation noise. In Compressed Sensing the situation is opposite: the regime of interest is m n. At the rst glance, this regime seems to be completely hopeless: even with no noise ( = 0), we need to recover a solution x to an underdetermined system of linear equations y = Ax. When the number of variables is greater than the number of observations, the solution to the system either does not exist, or is not unique, and in both cases our goal seems to be unreachable. This indeed is so, unless we have at our disposal some additional information on x. In Compressed Sensing, this additional information is thatx is s-sparse has at most a given numbers of nonzero entries. Note that in many applications we indeed can be sure that the true signal x is sparse. Consider, e.g., the following story about signal detection: 1 For related reading, see, e.g., 16 and references therein.26 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING There aren locations where signal transmitters could be placed, andm locations with the receivers. The contribution of a signal of unit magnitude originating in location j to the signal measured by receiveri is a known quantitya , and signals originating ij in di erent locations merely sum up in the receivers; thus, if x is the n-dimensional vector with entriesx representing the magnitudes of signals transmitted in locations j j = 1; 2;:::;n, then them-dimensional vectory of (noiseless) measurements of them mn receivers is y =Ax, A2 R . Given this vector, we intend to recover y. Now, if the receivers are hydrophones registering noises emitted by submarines in certain part of Atlantic, tentative positions of submarines being discretized with resolution 500 m, the dimension of the vector x (the number of points in the discretization grid) will be in the range of tens of thousands, if not tens of millions. At the same time, the total number of submarines (i.e., nonzero entries in x) can be safely upper-bounded by 50, if not by 20. Sparse recovery from de cient observations Sparsity changes dramatically our possibilities to recover high-dimensional signals from their low-dimensional linear images: given in advance that x has at most sm nonzero entries, the possibility of exact recovery of x at least from noiseless observations y becomes quite natural. Indeed, let us try to recover x by the following \brute force" search: we inspect, one by one, all subsetsI of the index setf1;:::;ng rst the empty set, then n singletonsf1g,...,fng, then n(n1) 2-element subsets, etc., and each time try to solve the system of linear equations 2 y =Ax; x = 0 when j62I; j when arriving for the rst time at a solvable system, we terminate and claim that its solution is the true vector x. It is clear that we will terminate before all sets I of cardinality s are inspected. It is also easy to show (do it) that if every 2s distinct columns in A are linearly 2 independent (when m 2s, this indeed is the case for a matrix A in a \general position" ), then the procedure is correct it indeed recovers the true vector x. A bad news is that the outlined procedure becomes completely impractical already for \small" values ofs andn because of the astronomically large number of linear systems we need 3 to process . A partial remedy is as follows. The outlined approach is, essentially, a particular way to solve the optimization problem minfnnz(x) :Ax =yg; () 2 Here and in the sequel, the words \in general position" mean the following. We consider a family of objects, with a particular object an instance of the family identi ed by a vector of real parameters (you may think about the family of nn square matrices; the vector of parameters in this case is the matrix itself). We say that an instance of the family possesses certain property in general position, if the set of values of the parameter vector for which the associated instance does not possess the property is of measure 0. Equivalently: randomly perturbing the parameter vector of an instance, the perturbation being uniformly distributed in a (whatever small) box, we with probability 1 get an instance possessing the property in question. E.g., a square matrix \in general position" is nonsingular. 3 When s = 5 and n = 100, this number is 7:53e7 much, but perhaps doable. When n = 200 and s = 20, the number of systems to be processed jumps to 1:61e27, which is by many orders of magnitude beyond our \computational grasp"; we would be unable to carry out that many computations even if the fate of the mankind were dependent on them. And from the perspective of Compressed Sensing,n = 200 still is a completely toy size, by 3-4 orders of magnitude less than we would like to handle.1.3. SELECTED ENGINEERING APPLICATIONS OF LP 27 where nnz(x) is the number of nonzero entries in a vector x. At the present level of our knowl- edge, this problem looks completely intractable (in fact, we do not know algorithms solving the problem essentially faster than the brute force search), and there are strong reasons, to be addressed later in our course, to believe that it indeed is intractable. Well, if we do not know how to minimize under linear constraints the \bad" objective nnz(x), let us \approximate" this objective with one which we do know how to minimize. The true objective is separable: P n nnz(x) = (x ), where (s) is the function on the axis equal to 0 at the origin and equal j i=1 to 1 otherwise. As a matter of fact, the separable functions which we do know how to minimize 4 under linear constraints are sums of convex functions ofx ;:::;x . The most natural candidate 1 n to the role of convex approximation of(s) isjsj; with this approximation, () converts into the ` -minimization problem 1 ( ) n X min kxk := jxj :Ax =y ; (1.3.2) 1 j x i=1 which is equivalent to the LP program ( ) n X min w :Ax =y;w x w ; 1jn : j j j j x;w i=1 For the time being, we were focusing on the (unrealistic) case of noiseless observations  = 0. A realistic model is that 6= 0. How to proceed in this case, depends on what we know on . In the simplest case of \unknown but small noise" one assumes that, say, the Euclidean normkk of  is upper-bounded by a given \noise level\ : kk  . In this case, the ` 2 2 1 recovery usually takes the form xb = Argminfkwk :kAwyk g (1.3.3) 1 2 w b Now we cannot hope that our recoveryx will be exactly equal to the true s-sparse signalx, but perhaps may hope that xb is close to x when  is small. 5 Note that (1.3.3) is not an LP program anymore , but still is a nice convex optimization program which can be solved to high accuracy even for reasonable large m;n. s-goodness and nullspace property Let us say that a sensing matrix A is s-good, if in the noiseless case ` minimization (1.3.2) 1 recovers correctly all s-sparse signals x. It is easy to say when this is the case: the necessary and sucient condition for A to be s-good is the following nullspace property: X 1 n 8(z2 R :Az = 0;z6= 0;If1;:::;ng; Card(I)s) : jzj kzk : (1.3.4) i 1 2 i2I In other words, for every nonzero vector z2 KerA, the sumkzk of the s largest magnitudes s;1 of entries in z should be strictly less than half of the sum of magnitudes of all entries. 4 A real-valued function f(s) on the real axis is called convex, if its graph, between every pair of its points, is below the chord linking these points, or, equivalently, iff(x+(yx))f(x)+(f(y)f(x)) for everyx;y2 R and every2 0; 1. For example, maxima of ( nitely many) ane functions a s +b on the axis are convex. For i i more detailed treatment of convexity of functions, see Appendix C. 5 To get an LP, we should replace the Euclidean normkAwyk of the residual with, say, the uniform norm 2 kAwyk , which makes perfect sense when we start with coordinate-wise bounds on observation errors, which 1 indeed is the case in some applications.28 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING The necessity and suciency of the nullspace property for s-goodness of A can be derived \from scratch" from the fact that s-goodness means that every s-sparse signal x should be the unique optimal solution to the associated LP min fkwk : w 1 Aw = Axg combined with the LP optimality conditions. Another option, which we prefer to use here, is to guess the condition and then to prove that its indeed is necessary and sucient fors-goodness ofA. The necessity is evident: if the nullspace property does not take place, then there exists 0 =6 z2 KerA and s-element subset I of the index setf1;:::;ng such that if J is the complement of I inf1;:::;ng, then the vector z obtained from z by zeroing out all entries with indexes not in I along I with the vector z obtained from z by zeroing out all entries with indexes not in J J 1 1 satisfy the relationkzk  kzk = kzk +kz k , that is, I 1 1 I 1 J 1 2 2 kzk kz k : I 1 J 1 Since Az = 0, we have Az = Az , and we conclude that the s-sparse vector z I J I is not the unique optimal solution to the LP min fkwk :Aw =Azg, sincez is w 1 I J feasible solution to the program with the value of the objective at least as good as the one atz , on one hand, and the solutionz is di erent fromz (since otherwise J J I we should havez =z = 0, whencez = 0, which is not the case) on the other hand. I J To prove that the nullspace property is sucient for A to be s-good is equally easy: indeed, assume that this property does take place, and let x be s-sparse signal, so that the indexes of nonzero entries in x are contained in an s-element subset I of f1;:::;ng, and let us prove that if xb is an optimal solution to the LP (1.3.2), then xb =X. Indeed, denoting by J the complement of I setting z =xbx and assuming that z =6 0, we have Az = 0. Further, in the same notation as above we have b b kxk kxk kzk kzk kx kkx k I 1 I 1 I 1 J J J 1 (the rst and the third inequality are due to the Triangle inequality, and the second due to the nullspace property), whencekxk =kxk +kx k kxbk +kxb k =kxbk , 1 I 1 J 1 i 1 J 1 b which contradicts the origin of x. From nullspace property to error bounds for imperfect ` recovery 1 The nullspace property establishes necessary and sucient condition for the validity of ` re- 1 covery in the noiseless case, whatever be the s-sparse true signal. We are about to show that after appropriate quanti cation, this property implies meaningful error bounds in the case of imperfect recovery (presence of observation noise, near-, but not exact, s-sparsity of the true signal, approximate minimization in (1.3.3). The aforementioned \proper quanti cation" of the nullspace property is suggested by the LP n duality theory and is as follows. Let V be the set of all vectors v2 R with at most s nonzero s entries, equal1 each. Observing that the sumkzk of the s largest magnitudes of entries in s;1 T a vector z is nothing that maxv z, the nullspace property says that the optimal value in the v2V s LP program T (v) = maxfv z :Az = 0;kzk  1g (P ) 1 v z is 1=2 whenever v2V (why?). Applying the LP Duality Theorem, we get, after straightfor- s ward simpli cations of the dual, that T (v) = minkA hvk : 1 h1.3. SELECTED ENGINEERING APPLICATIONS OF LP 29 Denoting by h an optimal solution to the right hand side LP, let us set v := (A) = max (v); (A) = maxkhk : s s v 2 v2V v2V s s Observe that the maxima in question are well de ned reals, since V is a nite set, and that the s nullspace property is nothing but the relation (A) 1=2: (1.3.5) s Observe also that we have the following relation: n 8z2 R :kzk  (A)kAzk + (A)kzk : (1.3.6) s;1 s 2 s 1 n Indeed, for v2V and z2 R we have s T T T T T T T v z = vA h z + A h zkvA hk kzk +h Az v v v 1 1 v  (v)kzk +khkkAzk  (A)kzk + (A)kAzk : 1 v 2 2 s 1 s 2 T Sincekzk = max v z, the resulting inequality implies (1.3.6). s;1 v2V s Now consider imperfect ` recovery x7y7xb, where 1 n 1. x2 R can be approximated within some accuracy , measured in the ` norm, by an 1 s-sparse signal, or, which is the same, s kxxk  1 s wherex is the bests-sparse approximation ofx (to get this approximation, one zeros out all but the s largest in magnitude entries in x, the ties, if any, being resolved arbitrarily); 2. y is a noisy observation of x: y =Ax +;kk ; 2 b 3. x is a -suboptimal and -feasible solution to (1.3.3), speci cally, kxbk  + minfkwk :kAwyk g &kAxbyk : 1 1 2 2 w Theorem 1.3.1 Let A, s be given, and let the relation 8z :kzk  kAzk + kzk (1.3.7) s;1 2 1 holds true with some parameters 1=2 and 1 (as de nitely is the case when A iss-good, = (A) and = (A)). The for the outlined imperfect ` recovery the following error bound s s 1 holds true: 2 ( +) + + 2 kxbxk  ; (1.3.8) 1 1 2 i.e., the recovery error is of order of the maximum of the \imperfections" mentioned in 1) 3).30 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING Proof. Let I be the set of indexes of the s largest in magnitude entries in x, J be b the complement of I, and z = x x. Observing that x is feasible for (1.3.3), we have minfkwk :kAwyk gkxk , whence 1 2 1 w kxbk  +kxk ; 1 1 or, in the same notation as above, kxk kxbk kxb k kx k  I 1 I 1 J 1 J 1 z z kz k kz k2kx k I 1 J 1 J 1 whence kz k  +kzk + 2kx k ; J 1 I 1 J 1 so that kzk  + 2kzk + 2kx k : (a) 1 I 1 J 1 We further have kzk  kAzk + kzk ; 1 2 1 I which combines with (a) to imply that kzk  kAzk +  + 2kzk + 2kx k ; I 1 2 I 1 J 1 whence, in view of 1=2 and due tokx k =, J 1 1 kzk  kAzk +  + 2: I 1 2 1 2 Combining this bound with (a), we get 2 kzk  + 2 + kAzk +  + 2: 1 2 1 2 Recalling that z =xbx and that thereforekAzk kAxyk +kAxbyk  +, we nally 2 2 2 get 2 kxbxk  + 2 +  + +  + 2: 1 1 2 Compressed Sensing: Limits of performance The Compressed Sensing theory demonstrates that 1. For given m;n with m n (say, m=n 1=2), there exist mn sensing matrices which m 6 are s-good for the values of s \nearly as large as m," speci cally, for s O(1) ln(n=m) Moreover, there are natural families of matrices where this level of goodness \is a rule." E.g., when drawing anmn matrix at random from the Gaussian or the1 distributions (i.e., lling the matrix with independent realizations of a random variable which is either 6 From now on, O(1)'s denote positive absolute constants appropriately chosen numbers like 0.5, or 1, or perhaps 100,000. We could, in principle, replace all O(1)'s by speci c numbers; following the standard mathe- matical practice, we do not do it, partly from laziness, partly because the particular values of these numbers in our context are irrelevant.1.3. SELECTED ENGINEERING APPLICATIONS OF LP 31 p 7 Gaussian (zero mean, variance 1=m), or takes values1= m with probabilities 0.5 , the result will bes-good, for the outlined value ofs, with probability approaching 1 as m and n grow. Moreover, for the indicated values ofs and randomly selected matricesA, one has p (A)O(1) s with probability approaching one when m;n grow. s 2. The above results can be considered as a good news. A bad news is, that we do not know how to check eciently, given an s and a sensing matrix A, that the matrix is s- good. Indeed, we know that a necessary and sucient condition for s-goodness of A is the nullspace property (1.3.5); this, however, does not help, since the quantity (A) is s s n dicult to compute: computing it by de nition requires solving 2 C LP programs (P ), v s v2V , which is an astronomic number already for moderaten unlesss is really small, like s 1 or 2. And no alternative ecient way to compute (A) is known. s As a matter of fact, not only we do not know how to checks-goodness eciently; there still is no ecient recipe allowing to build, given m, an m 2m matrix A which is provably p s-good for s larger than O(1) m a much smaller \level of goodness" then the one 8 (s =O(1)m) promised by theory for typical randomly generated matrices. The \common life" analogy of this pitiful situation would be as follows: you know that with probability at least 0.9, a brick in your wall is made of gold, and at the same time, you do not know 9 how to tell a golden brick from a usual one. Veri able sucient conditions for s-goodness As it was already mentioned, we do not know ecient ways to checks-goodness of a given sensing matrix in the case when s is not really small. The diculty here is the standard: to certify s- goodness, we should verify (1.3.5), and the most natural way to do it, based on computing (A), s is blocked: by de nition, (A) = maxfkzk :Az = 0;kzk  1g (1.3.9) s s;1 1 z that is, (A) is the maximum of a convex functionkzk over the convex setfz :Az = 0;kzk 1g. s s;1  Although both the function and the set are simple, maximizing of convex function over a convex p 7 entries \of order of 1= m" make the Euclidean norms of columns inmn matrixA nearly one, which is the most convenient for Compressed Sensing normalization of A. 8 Note that the naive algorithm \generatem 2m matrices at random until ans-good, withs promised by the theory, matrix is generated" is not an ecient recipe, since we do not know how to check s-goodness eciently. 9 This phenomenon is met in many other situations. E.g., in 1938 Claude Shannon (1916-2001), \the father of Information Theory," made (in his M.Sc. Thesis) a fundamental discovery as follows. Consider a Boolean function of n Boolean variables (i.e., both the function and the variables take values 0 and 1 only); as it is easily n 2 seen there are 2 function of this type, and every one of them can be computed by a dedicated circuit comprised of \switches" implementing just 3 basic operations AND, OR and NOT (like computing a polynomial can be carried out on a circuit with nodes implementing just two basic operation: addition of reals and their multiplication). The discovery of Shannon was that every Boolean function of n variables can be computed on a circuit with no more 1 n thanCn 2 switches, whereC is an appropriate absolute constant. Moreover, Shannon proved that \nearly all" 1 n Boolean functions ofn variables require circuits with at leastcn 2 switches,c being another absolute constant; \nearly all" in this context means that the fraction of \easy to compute" functions (i.e., those computable by 1 n circuits with less thancn 2 switches) among all Boolean functions ofn variables goes to 0 asn goes to1. Now, computing Boolean functions by circuits comprised of switches was an important technical task already in 1938; its role in our today life can hardly be overestimated the outlined computation is nothing but what is going on in a computer. Given this observation, it is not surprising that the Shannon discovery of 1938 was the subject of countless re nements, extensions, modi cations, etc., etc. What is still missing, is a single individual example of a \dicult to compute" Boolean function: as a matter of fact, all multivariate Boolean functions f(x ;:::;x ) 1 n people managed to describe explicitly are computable by circuits with just linear in n number of switches32 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING set typically is dicult. The only notable exception here is the case of maximizing a convex 1 N functionf over a convex setX given as the convex hull of a nite set: X = Convfv ;:::;v g. In 1 N this case, a maximizer of f on the nite setfv ;:::;v g (this maximizer can be found by brute i force computation of the values of f at v ) is the maximizer of f over the entire X (check it yourself or see Section C.5). Given that the nullspace property \as it is" is dicult to check, we can look for \the second best thing" eciently computable upper and lower bounds on the \goodness" s (A) of A  (i.e., on the largest s for which A is s-good). Let us start with ecient lower bounding of s (A), that is, with eciently veri able su-  cient conditions for s-goodness. One way to derive such a condition is to specify an eciently b computable upper bound (A) on (A). With such a bound at our disposal, the eciently s s veri able condition b (A) 1=2 clearly will be a sucient condition for the validity of (1.3.5). s The question is, how to nd an eciently computable upper bound on (A), and here is s one of the options:   T (A) = max maxv z :Az = 0;kzk  1 s 1 z v2V s   mn T T )8H2 R : (A) = max maxv 1H Az :Az = 0;kzk  1 s 1 z v2V s   T T  max maxv 1H Az :kzk  1 1 z v2V s T = maxkIH Azk ; Z =fz :kzk  1g: s;1 1 z2Z mn We see that whatever be \design parameter" H2 R , the quantity (A) does not exceed s T the maximum of a convex functionkIH Azk of z over the unit ` -ball Z. But the latter s;1 1 set is perfectly well suited for maximizing convex functions: it is the convex hull of a small (just 2n points, basic orths) set. We end up with mn T T 8H2 R : (A) maxkIH Azk = maxkCol IH Ak ; s s;1 j s;1 z2Z 1jn where Col (B) denotes j-th column of a matrix B. We conclude that j T (A) b (A) := min maxkCol IH Ak (1.3.10) s s j s;1 H j z (H) The function (H) is eciently computable and convex, this is why its minimization can be carried out eciently. Thus, b (A) is an eciently computable upper bound on (A). s s Some instructive remarks are in order. b 1. The trick which led us to (A) is applicable to bounding from above the maximum of a s 1 N convex function f over the set X of the formfx2 Convfv ;:::;v g : Ax = 0g (i.e., over the intersection of an \easy for convex maximization" domain and a linear subspace. The mn trick is merely to note that if A is mn, then for every H2 R one has n o 1 N T i max f(x) :x2 Convfv ;:::;v g;Ax = 0  max f(IH Axv ) () x 1iN Indeed, a feasible solutionx to the left hand side optimization problem can be represented P P i T i as a convex combination v , and since Ax = 0, we have also x =  IH Av ; i i i i1.3. SELECTED ENGINEERING APPLICATIONS OF LP 33 T i sincef is convex, we have thereforef(x) maxf(IH Av ), and () follows. Since () i takes place for every H, we arrive at n o 1 N T i max f(x) :x2 Convfv ;:::;v g;Ax = 0  b := max f(IH Av ); x 1iN b and, same as above, is eciently computable, provided that f is eciently computable convex function. 2. The eciently computable upper bound b (A) is polyhedrally representable it is the s optimal value in an explicit LP program. To derive this problem, we start with important by itself polyhedral representation of the functionkzk : s;1 n Lemma 1.3.1 For every z2 R and integer sn, we have ( ) n X kzk = min st + w :jzjt +w; 1in;w 0 : (1.3.11) s;1 i i i w;t i=1 T T Proof. One way to get (1.3.11) is to note thatkzk = maxv z = max v z and s;1 v2V s v2Conv(V ) s n to verify that the convex hull of the set V is exactly the polytopeV =fv2 R :jvj s s i P 18i; jvjsg (or, which is the same, to verify that the vertices of the latter polytope i i are exactly the vectors from V ). With this veri cation at our disposal, we get s ( ) X T kzk = max v z :jvj 18i; jvjs ; s;1 i i v i applying LP Duality, we get the representation (1.3.11). A shortcoming of the outlined approach is that one indeed should prove that the extreme points ofV are exactly the s points from V ; this is a relatively easy exercise which we strongly recommend to do. We, s however, prefer to demonstrate (1.3.11) directly. Indeed, if (w;t) is feasible for (1.3.11), thenjzjw +t, whence the sum of thes largest magnitudes of entries inz does not exceed i i st plus the sum of the corresponding s entries in w, and thus since w is nonnegative P does not exceed st + w . Thus, the right hand side in (1.3.11) is the left hand side. i i On the other hand, letjz jjz j:::jz j are the s largest magnitudes of entries in i i i 1 2 s z (so that i ;:::;i are distinct from each other), and let t =jz j, w = maxjzjt; 0. It 1 s i i i s is immediately seen that (t;w) is feasible for the right hand side problem in (1.3.11) and P P s that st + w = jz j =kzk . Thus, the right hand side in (1.3.11) is the left i i s;1 i j=1 j hand side. 2 Lemma 1.3.1 straightforwardly leads to the following polyhedral representation of b (A): s T b (A) := min maxkCol IH Ak s j s;1 j H ( ) j j j T j w t  IH A w +t 8i;j ij i i = min  : P : j j j j j H;w ;t ; w  08j;st + w 8j i i