Machine learning Algorithmic perspective

algorithmic statistics prediction and machine learning algorithmic trading and machine learning imperial machine learning an algorithmic perspective second edition download
AbbieBenson Profile Pic
AbbieBenson,United States,Professional
Published Date:13-07-2017
Your Website URL(Optional)
Comment
Algorithmic Aspects of Machine Learning Ankur Moitra c Draft date March 30, 2014Contents Contents i Preface 1 1 Introduction 3 2 Nonnegative Matrix Factorization 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Algebraic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Stability and Separability . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 Tensor Methods 25 3.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Perturbation Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Phylogenetic Trees and HMMs . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Extensions to Mixed Models . . . . . . . . . . . . . . . . . . . . . . . 44 3.6 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . 50 4 Sparse Recovery 53 4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Uniqueness and Uncertainty Principles . . . . . . . . . . . . . . . . . 56 4.3 Pursuit Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 iii CONTENTS 4.4 Prony's Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Compressed Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5 Dictionary Learning 71 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Full Rank Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Overcomplete Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . 77 6 Gaussian Mixture Models 83 6.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2 Clustering-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . 86 6.3 Discussion of Density Estimation . . . . . . . . . . . . . . . . . . . . 89 6.4 Clustering-Free Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 91 6.5 A Univariate Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.6 A View from Algebraic Geometry . . . . . . . . . . . . . . . . . . . . 101 7 Matrix Completion 105 7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.2 Nuclear Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3 Quantum Gol ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Bibliography 115Preface The monograph is based on the class \18.S996: Algorithmic Aspects of Machine Learning" taught at MIT in Fall 2013. Thanks to the scribes Adam Hesterberg, Adrian Vladu, Matt Coudron, Jan-Christian Hutter,  Henry Yuen, Yufei Zhao, Hi- lary Finucane, Matthew Johnson, Kayhan Batmanghelich, Gautam Kamath, George Chen, Pratiksha Thaker, Mohammad Bavarian, Vlad Firoiu, Madalina Persu, Cameron Musco, Christopher Musco, Jing Lin, Timothy Chu, Yin-Tat Lee, Josh Alman, Nathan Pinsker and Adam Bouland. 12 CONTENTSChapter 1 Introduction This course will be organized around algorithmic issues that arise in machine learn- ing. The usual paradigm for algorithm design is to give an algorithm that succeeds on all possible inputs, but the diculty is that almost all of the optimization problems that arise in modern machine learning are computationally intractable. Nevertheless, practitioners use a wide variety of heuristics that are successful in practice. However we often do not understand when and why these approaches work (an issue we would not have if our algorithms came with provable guarantees). The central questions in this course are: Question 1 Which models in machine learning lead to tractable algorithmic prob- lems? Worst-case analysis is comfortable because if an algorithm works in this model, it certainly works in practice. But the optimization problems that machine learning systems \solve" everyday are indeed hard in the worst-case. However these lower bounds are not so frightening; many of the hard instances of machine learning problems are not ones we would want to solve in practice anyways We will see a number of examples where choosing the right model will lead us to discover new algorithms with provable guarantees, where we really can understand when and why they work. In some cases, we will even be able to analyze approaches that practitioners already use and give new insights into their behavior. Question 2 Can new models that better represent the instances we actually want to solve in practice be the inspiration for developing fundamentally new algorithms for machine learning problems? Can we understand when and why widely used heuristics work? 34 CHAPTER 1. INTRODUCTION This course will focus on (a) nonnegative matrix factorization (b) topic modeling (c) tensor decompositions (d) sparse recovery (e) dictionary learning (f) learning mixtures models (g) matrix completion Hopefully more sections will be added to this course over time, since there are a vast number of topics at the intersection of algorithms and machine learning left to explore.Chapter 2 Nonnegative Matrix Factorization In this chapter we will explore the nonnegative matrix factorization problem. We will rst recap the motivations from this problem. Next, we give new algorithms that we apply to the classic problem of learning the parameters of a topic model. 2.1 Introduction In order to understand why nonnegative matrix factorization is useful in applica- tions, it will be helpful to compare it to the singular value decomposition. We will focus on applications of both of these to text analysis in this chapter. Singular Value Decomposition Given an mn matrix M, its singular value decomposition is T M =UV where U and V are orthonormal and  is diagonal and its entries are nonnegative. Alternatively we can write r X T M = uv i i i i=1 th th th where u is the i column of U, v is the i column of V and  is the i diagonal i i i entry of . Every matrix has a singular value decomposition In fact, this representation can be quite useful in understanding the behavior of a linear operator or in general 56 CHAPTER 2. NONNEGATIVE MATRIX FACTORIZATION for extracting signi cant \features" from a large data matrix. We will focus our discussion of the singular value decomposition on the latter. One of the many useful properties of this decomposition is that we can immediately read-o the best low rank approximation to M from it. q P 2 De nition 2.1.1 The Frobenius norm of a matrix M iskMk = M . Al- F i;j i;j p P P r T 2 ternately, if M = uv ,kMk =  . i i F i i i=1 Consider the following optimization problem: Let B be the best rank k ap- proximation to M in the Frobenius norm - i.e. B is the minimizer ofkMBk F over all rank at most k matrices. Then we can without loss of generality choose B to be the rst k terms of the singular value decomposition. Theorem 2.1.2 (Eckart-Young) The best rank k approximation to M in Frobe- q P P k r T 2 nius norm is attained byB = uv , and its error iskMBk =  . i i F i=1 i i=k+1 i This is one of the reasons why the singular value decomposition is so widely useful: if we are given data in the form of a matrix M but we believe that the data is approximately low-rank, a natural approach to making use of this structure is to instead work with the best rankk approximation toM. This theorem is quite robust and holds even when we change how we measure how goodB is as an approximation to M: De nition 2.1.3 The operator norm of a matrix M iskMk = max kMvk . 2 2 jvj=1 P r T Then if M = uv ,jjMjj = (the largest singular value). i i 2 1 i i=1 The best approximation to M in the operator norm is also attained by B = P k T uv , in which case the error iskMBk = . i i 2 k+1 i i=1 Let us give one more interpretation of the singular value decomposition. We m can regard anmn matrixM as a collection ofn data points in . We associate a distribution  with this set of points which chooses a point uniformly at random. Further suppose that the expectation of this distribution is zero. Our data is in high dimension, and a natural question to ask is: how should we project our data onto a one dimensional subspace in a manner that preserves as much information as possible? One concrete goal is to nd a direction u so that projecting  on u maximizes the variance (among all one-dimensional projections). The question leads to another characterization of the singular vectors:2.1. INTRODUCTION 7 T ku Mk 2 u = argmax 1 kuk 2 and the maximum is  . Similarly if we want to project onto a two-dimensional 1 subspace so as to maximize the projected variance we should project on span(u ;u ). 1 2 Relatedly T ku Mk 2 u = min argmax 2 u u?u 1 1 kuk 2 and the maximum is  . This is called the variational characterization of singular 2 vectors. (Here we have assumed the singular values are distinct). There are ecient algorithms to compute the singular value decomposition. If 2 n6= m then these algorithms run in time O(mn ). The basic idea is to reduce M to bidiagonal form using Householder re ections, and then to compute the singular value decomposition from this representation using the QR algorithm. Next we will describe an application to text analysis. Applications to Text Analysis Latent Semantic Indexing: 49 Suppose we are give a large collection of documents, and we would like to extract some hidden structure in this collection (either with the goal of performing information retrieval, or clustering). The usual rst step is to collect the data in a very large, very sparse matrix: De nition 2.1.4 The term-by-document matrix M is an mn matrix where each row represents a word, each column represents a document and the entry in row i, column j is the number of times that word i occurs in document j. We have clearly lost some information, since this representation does not take into account the order of the words. However matrices are much easier to work with, and the underlying assumption is that it should still be possible to cluster the documents just knowing what words each one contains but not their order. This is often called the bag-of-words assumption. The idea behind latent semantic indexing is to compute the singular value decomposition of M and use this for information retrieval and clustering. More precisely, if we write (k) (k) (k)T MU  V (k) (k) where U is the rst k columns of U, etc. then the columns of U are the k directions that maximize the projected variance of a random document. These8 CHAPTER 2. NONNEGATIVE MATRIX FACTORIZATION vectors are interpreted as \topics". More precisely, suppose we want to compute a \similarity" score for document i and document j. We could do this by computing hM;Mi i j th where M is the i column of M, etc. This function \counts" the number of words i in common. In particular, given a query we would judge how similar a document is to it just be counting how many of its words occur in each document. This is quite naive. Instead, we could compute T (k) T (k) hM U ;M U i i j Intuitively this maps each document to a vector of length k that measures how much of each topic is present in the document, and computes the similarly of the documents by taking an inner-product in this low-dimensional space. In practice this is a much better way to measure similarity and was introduced by the seminal paper of Deerwester et al 49. However it has its own weaknesses. This approach has some rather undesirable properties: (a) \topics" are orthonormal Consider topics like \politics" and \ nance". Are the sets of words that describe these topics uncorrelated? No (b) \topics" contain negative values This is more subtle, but negative words can be useful to signal that document is not about a given topic. But then when we compute similarly, two documents are judged to be more similar based on a topic that they are both decidedly not about. This is another counter intuitive and undesirable property. Nonnegative Matrix Factorization The idea due to 73 and 98 is to write MAW whereA andW aremk andkn respectively and are required to be entry-wise nonnegative. In fact, let us suppose that the columns of M each sum to one. It is th not hard to see that if D is a diagonal matrix where the i entry is the reciprocal2.1. INTRODUCTION 9 th ef e of the sum of the entries in thei column ofA thenM =AW whereA =AD and 1 f e f W =D W normalizes the data so that the columns ofA and ofW each sum to one. e Hence we are nding a set of topics (the columns of A which are each distributions on words) so that every document can be obtained as a convex combination of the topics that we have found. This optimization problem plays a crucial role in many machine learning sys- tems, such as image segmentation, text analysis, recommendation systems, etc. But this optimization problem is NP -hard 115. So what should we do now? Give up? In contrast, singular value decomposition is a problem where theory and prac- tice agree It can be computed eciently, and it has many uses. But in spite of this intractability result, nonnegative matrix factorization really is used in practice. The standard approach is to use alternating minimization: Alternating Minimization: This problem is non-convex, but suppose we guessA. Then computing the nonnegativeW that minimizeskMAWk is convex F and can be solved eciently. The approach is to guessA, compute the bestW then setW as xed and compute the bestA, and so on. This process converges, but not necessarily to the optimal solution. It can and does get stuck in local minima in practice We note that this approach is also called expectation-maximization 50, and is the standard approach not just for nonnegative matrix factorization, but for many other problems we will study in this course such as dictionary learning and learning mix- tures models. Food for Thought But maybe heuristics like this are identifying interesting instances of the problem. The goal of this course is to not give up when faced with intractability, and to look for new explanations. These explanations could be new models (that avoid the aspects of the problem that allow us to embed hard problems) or could be identifying conditions under which heuristics that are already used, do work. This is a largely unexplored area. In the next section, we will ask what happens if we restrict the number of topics. The instances generated by 115 have k linear in m and n, but when we look for a set of topics that explain 300; 000 New York Times articles, we are looking for only a few hundred topics. So one way to reformulate the question is to ask what its complexity is as a function of k. We will essentially resolve this using algebraic techniques. Nevertheless if we want even better algorithms, we need more10 CHAPTER 2. NONNEGATIVE MATRIX FACTORIZATION assumptions. We will see how a geometric interpretation of this problem implies that these hard instances are unstable, and we will examine a condition (separability) that enforces stability, and allows us to give much better algorithms - ones that run in time polynomial in all of the parameters. 2.2 Algebraic Algorithms In the previous section we introduced the nonnegative matrix factorization problem and described its applications to text analysis (it has many other applications). Vavasis proved that this problem is NP -hard in the worst-case, but the instances he contracted have k the number of topics linear in the size of the matrix 115. In most practical applications,k is much smaller thanm orn and with this in mind we will instead ask: What is the complexity of this problem as a function of k? We will make use of tools from algebra to give a polynomial time algorithm for any k =O(1). In fact, the algorithm we present here will be nearly optimal in terms of its dependence on k. De nitions Let us de ne the nonnegative matrix factorization problem formally, since we did so only informally in the previous section: Suppose we are given an entry-wise nonnegative matrix M of size mn. + De nition 2.2.1 The nonnegative rank ofM denoted by rank (M) is the small- est k such that there are nonnegative matrices A and W of size mk and kn respectively that satisfy M =AW . + Equivalently, rank (M) is the smallest k such that there are k nonnegative rank P one matricesfMg that satisfy M = M . i i i Both of these equivalent formulations of the problem will be useful throughout our discussion. To gain some familiarity with this parameter, it is helpful to compare it to a more familiar one: If we omit the requirement that A and W be entry-wise nonnegative, then the smallest k is precisely the rank of M. Hence the following relation is immediate: + Fact 2.2.2 rank (M) rank(M) In fact the rank and the nonnegative rank of a matrix can be quite di erent:2.2. ALGEBRAIC ALGORITHMS 11 2 Example. Let M 2 M , where M = (ij) . It is easy to see that the nn ij columns of M are spanned by 8 9 2 3 2 3 2 3 2 1 1 1 6 7 6 7 6 27= 1 2 2 6 7 6 7 6 7 6 7;6 7;6 7 . . . . . . 4 5 4 5 4 5 . . . : ; 2 1 n n It is easy to see that rank(M) = 3 However, M has zeros along the diagonal and non-zeros o it. Furthermore for any rank one nonnegative matrix M , its pattern i of zeros and non-zeros is a combinatorial rectangle - i.e. the intersection of some set + of rows and columns - and a standard argument implies that rank (M) = (logn). There are examples with even larger separations too. Next we will connect nonnegative matrix factorization to computational prob- lems involving systems of polynomial inequalities. Systems of Polynomial Inequalities + We can reformulate the problem of nding an A and W that prove rank (M)k as a problem of nding a feasible solution to a particular system of polynomial inequalities. More speci cally, the problem we want to solve is: 8 M =AW (2.1) A  0 : W  0 This system consists of quadratic equality constraints (one for each entry of M), and linear constraints that A and W be entry-wise nonnegative. Before trying to design better algorithms for k =O(1), we should ask a more basic question (whose answer is not at all obvious): Question 3 Is there any nite time algorithm? The diculty is that even if there is a solution, the entries ofA andW could be irrational. This is quite di erent than, say, 3-SAT where there is a simple brute-force algorithm. In contrast for nonnegative matrix factorization it is quite challenging to design algorithms that run in any nite amount of time. But indeed there are algorithms (that run in some xed amount of time) to decide whether a system of polynomial inequalities has a solution or not in the real RAM model. These algorithms can also compute an implicit representation of the solution, if there is12 CHAPTER 2. NONNEGATIVE MATRIX FACTORIZATION one. The output is a polynomial and an interval (for each variable) in which there is only one root, which is the value of the variable in the true solution. And you can nd as many bits of the solution as you would like by performing binary search for the root. The rst algorithm follows from the seminal work of Tarski, and there has been a long line of improvements based on successively more powerful algebraic decompositions. This line of work culminated in algorithms whose running time is exponential in the number of variables but is polynomial in all the other parameters of the problem (the number of polynomial inequalities, the maximum degree and O(r) the bit complexity of the coecients). The running time is (nD) where n is the number of polynomial inequalities,D is the maximum degree andr is the number of variables 106. This running time is essentially optimal under the exponential time hypothesis 78. In particular, if there is an algorithm for this problem that runs in o(r) time (pD) then it would yield sub-exponential time algorithms for 3-SAT. We can use these algorithms to solve nonnegative matrix factorization. How- ever the number of variables we would need in the naive representation is nk +mk, one for each entry inA orW . So even ifk =O(1), we would need a linear number of variables and the running time would be exponential. However we could hope that even though the naive representation uses many variables, perhaps there is a more clever representation that uses many fewer variables. Can we reduce the number of variables in the system of polynomial inequalities from O(nk +mk) to f(k)? If we could do this, then we could solve nonnegative matrix factorization in polynomial time for any k = O(1). Next, we will describe some basic tools in the rst-order theory of the reals. These results will help formalize our intuition from above that the number of variables is the right complexity measure when reasoning about how dicult it is to solve a system of polynomial inequalities, but their proof is out of scope of this course. First-Order Theory of the Reals De nition 2.2.3 A set S is semialgebraic if there exist multivariate polynomials p ;:::;p such that 1 n S =fx ;:::;xjp (x ;:::;x ) 0g 1 r i 1 r or if S is a nite union or intersection of such sets. De nition 2.2.4 The projection of a semialgebraic set S is de ned as proj (X ;:::;X ) =fx ;:::;xj9x ;:::;x such that p(x ;:::;x )2Sg 1 ` 1 ` `+1 r 1 r S2.2. ALGEBRAIC ALGORITHMS 13 Theorem 2.2.5 (Tarski) The projection of a semialgebraic set is semialgebraic. This is one of the foundational results in the eld, and is often called quanti er elimination 110, 107. To gain some familiarity with this notion, consider the case of algebraic sets (de ned analogously as above, but with polynomial equality constraints instead of inequalities). Indeed, the above theorem implies that the projection of an algebraic set is itself semi-algebraic. Is its projection also algebraic? No (e.g. think about the projection of a circle) Earlier, we stated that there are algorithms to solve systems of polynomial inequalities (and nd an implicit representation for the solution, if there is one) in O(r) time (nD) wheren is the number of polynomial inequalities,D is the maximum degree andr is the number of variables 106. In fact, these algorithms work in a more general setting where there is additionally a boolean functionB that constraints the sign pattern of the polynomials. We are interested in deciding whether the set S =fx ;:::;xjB(p (x ;:::;x );:::;p (x ;:::;x )) = trueg 1 r 1 1 r n 1 r is non-empty, and we assume that we can evaluate B (but not, say, that it has a succinct circuit). A related result is the famous Milnor-Warren bound (see e.g. 7): Theorem 2.2.6 (Milnor-Warren) Given n polynomials p ;:::;p of degree D 1 m on r variables x =x ;:::x , consider the sign pattern at x: 1 r  x sgn(p (x));sgn(p (x));:::;sgn(p (x)) 1 2 m r r Then as x ranges over R the number of distinct sign patterns is at most (nD) . n A priori we could have expected as many as 3 sign patterns. In fact, algorithms for solving systems of polynomial inequalities are based on cleverly enumerating the set of sign patterns so that the total running time is dominated by the maximum number of distinct sign patterns that there could be In fact, the Milnor-Warren bound can be thought of as an analogue of the Sauer-Shelah lemma that is used throughout supervised learning where the number of variables plays the role of the VC-dimension. Next we will give a technique to reduce the number of variables. Variable Reduction It is clear that the set of points satisfying (2.1) is a semialgebraic set. However even for k = 3 this system has a linear (in n and m) number of variables, so directly solving (2.1) would require exponential time.14 CHAPTER 2. NONNEGATIVE MATRIX FACTORIZATION Question 4 Can we nd an alternate system of polynomial inequalities that ex- presses the same decision problem but uses many fewer variables? We will focus on a special case called simplicial factorization where rank(M) = k. + In this case, we are asking whether or not rank (M) = rank(M) = k and this simpli es matters because of the following observation: Claim 2.2.7 In any solution,A andW must have full column and row rank respec- tively. Proof: The span of the columns ofA must contain the columns ofM and similarly the span of the rows of W must contain the rows of M. Since rank(M) = k and A and W have k columns and rows respectively we conclude that the A and W must have full column and row rank respectively. Moreover their span must be the column space and row space of M respectively.  + + Hence we know thatA andW have left and right pseudo-inversesA andW respectively. We will make use of these pseudo-inverses to reduce the number of + variables in our system of polynomial inequalities: We have that A A = I where r I is the kk identity. Hence k + A AW =W and so we can recover the columns ofW from a linear transformation of the columns of M. This leads to the following alternative system of polynomial inequalities: 8 + + MW A M =M + (2.2) MW  0 : + A M  0 A priori, it is not clear that we have made progress since this system also has + + nk +mk variables corresponding to the entries of A and W . However consider + + the matrix A M. If we represent A as an kn matrix then we are describing its action on all vectors, but the crucial observation is that we only need to know + howA acts on the columns ofM which span ak dimensional space. Hence we can apply a change of basis to rewrite M as M which is an km matrix, and there R + is an kk linear transformation T (obtained from A and the change of basis) so that TM =W . A similar approach works for W , and hence we get a new system: R 8 M STM =M C R (2.3) M S  0 C : TM  0 R2.3. STABILITY AND SEPARABILITY 15 2 The variables of this system are the entries in S and T . So there are 2k variables. And the properties we need of this system are that (a) If the simplicial factorization problem has a solution, then there is a solution to this system (completeness) (b) If there is any solution to the system, then the simplicial factorization has a solution (soundness) We have already proven the rst property, and the second property follows because we can set A =M S and W =TM and this is a valid factorization with C R inner-dimension k. Hence if we apply Renegar's algorithm to this new system, the 2 O(k ) algorithm runs in time (nm) and solves the simplicial factorization problem. The above approach is based on the paper of Arora et al 13 where the authors also give a variable reduction procedure for nonnegative matrix factorization (in the general case where A and W need not have full column or row rank respectively). 2 k The authors reduce the number of variables from (nk +mk) to f(k) = 2k 2 and this yields a doubly-exponential time algorithm as a function of k. The crucial observation is that even ifA does not have full column rank, we could write a system of polynomial inequalities that has a pseudo-inverse for each set of its columns that  k is full rank (and similarly forW ). HoweverA could have as many as maximal k=2 sets of linearly independent columns, and hence the resulting system of polynomial inequalities has f(k) variables but f(k) is itself exponential in k. 2 In 94 the author further reduces the number of variables to 2k for nonneg- ative matrix factorization, and the main idea is that even though A could have exponentially many maximal sets of linearly independent columns, their psueudo- 2 inverses are algebraically dependent and can be expressed over a common set of k variables using Cramer's rule. This yields a singly exponential time algorithm for 2 O(k ) nonnegative matrix factorization that runs in (nm) time which is essentially op- o(k) timal since any algorithm that runs in time (nm) would yield a sub-exponential time algorithm for 3-SAT 13. 2.3 Stability and Separability In the previous section we took an algebraic approach and here instead we will work with an assumption called separability 54 which will allow us to give an algorithm that runs in polynomial time (even for large values ofr). Our discussion will revolve around the intermediate simplex problem.16 CHAPTER 2. NONNEGATIVE MATRIX FACTORIZATION Intermediate Simplex Problem Let us de ne the intermediate simplex problem: We are given two polytopes Q and P with PQ and furthermore P is encoded by its vertices andQ is encoded by its facets. Is there a simplex K with PKQ? We would like to connect this problem to nonnegative matrix factorization, since it will help us build up a geometric view of the problem. Consider the following problem: Given nonnegative matrices M and A, does there exists W  0 such that M =AW ? The answer is \Yes", if and only if each column of M is in the cone spanned by nonnegative combinations of the columns of A. Moreover if we normalize the columns ofM andA so that they sum to one, then the answer is \Yes" if and only if the convex hull of the columns ofA contains the columns ofM. Recall in simplicial factorization we are given a nonnegative matrix M with rank(M) = k, and our + goal is to decide whether or not rank (M) = k. We will prove that the simplicial factorization problem and the intermediate simplex problem are equivalent 115. Consider the following helper problem, which we call (P0): 1 Given M =UV , is there an invertible kk matrix T such that UT , and TV are nonnegative? In fact, Vavasis 115 proved that (P0), intermediate simplex and the simplicial factorization problem are each polynomial time interreducible. It is easy to see that (P0) and the simplicial factorization problem are equivalent since in any two factorizationsM =UV orM =AW (where the inner-dimension equals the rank of M), the column spaces of M, U and A are identical. Similarly the rows spaces of M, V and W are also identical. The more interesting aspect of the proof is the equivalence between (P0) and the intermediate simplex problem. The translation is: (a) rows of U () vertices of P (b) rows of T () vertices of K (c) columns of V () facets of Q2.3. STABILITY AND SEPARABILITY 17 Figure 2.1: This gure is taken from 115. The intermediate simplex problem has two solutions which will be used to encode the truth assignment of a variable. 1 Then the constraint thatUT is nonnegative is (roughly) the constraint thatPK and the constraint TV is (roughly) the constraint KQ. There are some tedious normalization issues that arise since we need to be careful about the distinction between the convex hull of a set of vectors and the cone generated by all nonnegative combinations. However this equivalence gives us a geometric view that will be helpful. Vavasis made use of the equivalences in the previous subsection to prove that nonnegative matrix factorization is NP -hard. Consider the gadget in Figure 2.1; the crucial property is that there are only two possible intermediate triangles, which can then be used to represent the truth assignment for a variablex . The description i of the complete reduction, and the proof of its soundness are involved (see 115). The trouble is that gadgets like those in Figure ?? are unstable. We can change the number of solutions by small perturbations to the problem. Motivated by issues of uniqueness and robustness, Donoho and Stodden 54 introduced a condition called separability that alleviates many of these problems, which we will discuss in the next subsection. Separability De nition 2.3.1 We call A separable if, for every column of A, there exists a row of A whose only non-zero entry is in that column.