Question? Leave a message!




Retrieval Support Vector Machine (SVM)

Retrieval Support Vector Machine (SVM)
RyanCanon Profile Pic
RyanCanon,United Arab Emirates,Teacher
Published Date:20-07-2017
Website URL
Comment
Introduction to Information Retrieval Introduction to Information Retrieval Support vector machines and machine learning on documents www.ThesisScientist.comIntroduction to Information Retrieval Text classification: Up until now and today  Previously: 3 algorithms for text classification  Naive Bayes classifier  K Nearest Neighbor classification  Simple, expensive at test time, high variance, non-linear  Vector space classification using centroids and hyperplanes that split them  Simple, linear discriminant classifier; perhaps too simple  (or maybe not)  Today  SVMs  Some empirical evaluation and comparison  Text-specific issues in classification www.ThesisScientist.comIntroduction to Information Retrieval Ch. 15 Linear classifiers: Which Hyperplane?  Lots of possible solutions for a, b, c.  Some methods find a separating hyperplane, This line but not the optimal one according to some represents the criterion of expected goodness decision  E.g., perceptron boundary:  Support Vector Machine (SVM) finds an ax + by− c = 0 optimal solution.  Maximizes the distance between the hyperplane and the “difficult points” close to decision boundary  One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.1 Another intuition  If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.1 Support Vector Machine (SVM) Support vectors  SVMs maximize the margin around the separating hyperplane.  A.k.a. large margin classifiers  The decision function is fully specified by a subset of training samples, the support vectors.  Solving SVMs is a quadratic Maximizes programming problem Narrower margin  Seen by many as the most margin successful current text classification method but other discriminative methods www.ThesisScientist.com often perform very similarlyIntroduction to Information Retrieval Sec. 15.1 Maximum Margin: Formalization  w: decision hyperplane normal vector  x : data point i i  y : class of data point i (+1 or -1) NB: Not 1/0 i T  Classifier is: f(x ) = sign(w x + b) i i T  Functional margin of x is: y (w x + b) i i i  But note that we can increase this margin simply by scaling w, b….  Functional margin of dataset is twice the minimum functional margin for any point  The factor of 2 comes from measuring the whole width of the margin www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.1 Geometric Margin T w x +b r = y  Distance from example to the separator is w  Examples closest to the hyperplane are support vectors.  Marginρ of the separator is the width of separation between support vectors of classes. ρ x Derivation of finding r: Dotted line x’−x is perpendicular to r decision boundary so parallel to w. x′ Unit vector is w/w, so line is rw/w. x’ = x– yrw/w. T x’ satisfies w x’+b = 0. T So w (x–yrw/w) + b = 0 T Recall that w = sqrt(w w). T So w x–yrw + b = 0 w So, solving for r gives: www.ThesisScientist.com T r = y(w x + b)/wIntroduction to Information Retrieval Sec. 15.1 Linear SVM Mathematically The linearly separable case  Assume that all data is at least distance 1 from the hyperplane, then the following two constraints follow for a training set (x ,y ) i i T w x + b ≥ 1 if y = 1 i i T w x + b ≤ −1 if y = −1 i i  For support vectors, the inequality becomes an equality  Then, since each example’s distance from the hyperplane is T w x +b r = y w  The margin is: 2 r = w www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.1 Linear Support Vector Machine (SVM) T w x + b = 1 a ρ T w x + b = -1 b  Hyperplane T w x + b = 0  Extra scale constraint: T min w x + b = 1 i=1,…,n i  This implies: T w (x –x ) = 2 a b ρ = x –x = 2/w a b 2 2 T w x + b = 0 www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.1 Linear SVMs Mathematically (cont.)  Then we can formulate the quadratic optimization problem: Find w and b such that 2 r = is maximized; and for all (x , y ) i i w T T w x + b ≥ 1 if y =1; w x + b ≤ -1 if y = -1 i i i i  A better formulation (min w = max 1/ w ): Find w and b such that T Φ(w) =½ w w is minimized; T and for all (x ,y ): y (w x + b) ≥ 1 i i i i www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.1 Solving the Optimization Problem Find w and b such that T Φ(w) =½ w w is minimized; T and for all (x ,y ): y (w x + b) ≥ 1 i i i i  This is now optimizing a quadratic function subject to linear constraints  Quadratic optimization problems are a well-known class of mathematical programming problem, and many (intricate) algorithms exist for solving them (with many special ones built for SVMs)  The solution involves constructing a dual problem where a Lagrange multiplierα is associated with every constraint in the primary problem: i Find α…α such that 1 N T Q(α) =Σα - ½ΣΣαα y y x x is maximized and i i j i j i j (1) Σα y = 0 i i (2) α ≥ 0 for all α i i www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.1 The Optimization Problem Solution  The solution has the form: T w =Σα y x b= y - w x for any x such that α 0 i i i k k k k  Each non-zero α indicates that corresponding x is a support vector. i i  Then the classifying function will have the form: T f(x) = Σα y x x + b i i i  Notice that it relies on an inner product between the test point x and the support vectors x i  We will return to this later.  Also keep in mind that solving the optimization problem involved T computing the inner products x x between all pairs of training points. i j www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.2.1 Soft Margin Classification  If the training data is not linearly separable, slack variablesξ can be added to i allow misclassification of difficult or noisy examples.  Allow some errors ξ i  Let some points be moved ξ to where they belong, at a j cost  Still, try to minimize training set errors, and to place hyperplane “far” from each class (large margin) www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.2.1 Soft Margin Classification Mathematically  The old formulation: Find w and b such that T Φ(w) =½ w w is minimized and for all (x ,y ) i i T y (w x + b) ≥ 1 i i  The new formulation incorporating slack variables: Find w and b such that T Φ(w) =½ w w + CΣξ is minimized and for all (x ,y ) i i i T y (w x + b) ≥ 1-ξ and ξ ≥ 0 for all i i i i i  Parameter C can be viewed as a way to control overfitting  A regularization term www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.2.1 Soft Margin Classification – Solution  The dual problem for soft margin classification: Find α…α such that 1 N T Q(α) =Σα - ½ΣΣαα y y x x is maximized and i i j i j i j (1) Σα y = 0 i i (2) 0 ≤α ≤ C for all α i i  Neither slack variables ξ nor their Lagrange multipliers appear in the dual i problem  Again, x with non-zero α will be support vectors. i i  Solution to the dual problem is: w is not needed explicitly w = Σα y x for classification i i i T b = y (1-ξ ) - w x where k = argmax α k k k k’ T k’ f(x) = Σα y x x + b i i i www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.1 Classification with SVMs  Given a new point x, we can score its projection onto the hyperplane normal: T T  I.e., compute score: w x + b = Σα y x x + b i i i  Decide class based on whether or 0  Can set confidence threshold t. Score t: yes Score -t: no 1 0 Else: don’t know -1 www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.2.1 Linear SVMs: Summary  The classifier is a separating hyperplane.  The most “important” training points are the support vectors; they define the hyperplane.  Quadratic optimization algorithms can identify which training points x are i support vectors with non-zero Lagrangian multipliers α . i  Both in the dual formulation of the problem and in the solution, training points appear only inside inner products: T Find α…α such that 1 N f(x) = Σα y x x + b i i i T Q(α) =Σα - ½ΣΣαα y y x x is maximized and i i j i j i j (1) Σα y = 0 i i (2) 0 ≤α ≤ C for all α i i www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.2.3 Non-linear SVMs  Datasets that are linearly separable (with some noise) work out great: x 0  But what are we going to do if the dataset is just too hard? x 0  How about … mapping data to a higher-dimensional space: 2 x x 0 www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.2.3 Non-linear SVMs: Feature spaces  General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) www.ThesisScientist.comIntroduction to Information Retrieval Sec. 15.2.3 The “Kernel Trick” T  The linear classifier relies on an inner product between vectors K(x ,x )=x x i j i j  If every datapoint is mapped into high-dimensional space via some transformation Φ: x→ φ(x), the inner product becomes: T K(x ,x )= φ(x ) φ(x ) i j i j  A kernel function is some function that corresponds to an inner product in some expanded feature space.  Example: T 2 2-dimensional vectors x=x x ; let K(x ,x )=(1 + x x ) 1 2 i j i j , T Need to show that K(x ,x )= φ(x ) φ(x ): i j i j T 2 2 2 2 2 K(x ,x )=(1 + x x ) = 1+ x x + 2 x x x x + x x + 2x x + 2x x = i j i j , i1 j1 i1 j1 i2 j2 i2 j2 i1 j1 i2 j2 2 2 T 2 2 = 1 x√2 x x x√2x√2x 1 x√2 x x x√2x√2x i1 i1 i2 i2 i1 i2 j1 j1 j2 j2 j1 j2 T 2 2 = φ(x ) φ(x ) where φ(x) = 1 x√2 x x x√2x√2x i j 1 1 2 2 1 2 www.ThesisScientist.com