A vector space model for Automatic indexing

advantages and disadvantages of information retrieval and application of vector space in computer science example of vector space in mathematics
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Your Website URL(Optional)
Comment
Introduction to Information Retrieval Introduction to Information Retrieval : Vector Space Classification 1Introduction to Information Retrieval Relevance feedback: Basic idea  The user issues a (short, simple) query.  The search engine returns a set of documents.  User marks some docs as relevant, some as nonrelevant.  Search engine computes a new representation of the information need – should be better than the initial query.  Search engine runs new query and returns new results.  New results have (hopefully) better recall. 4Introduction to Information Retrieval Rocchio illustrated 5Introduction to Information Retrieval Take-away today  Feature selection for text classification: How to select a subset of available dimensions  Vector space classification: Basic idea of doing textclassification for documents that are represented as vectors  Rocchio classifier: Rocchio relevance feedback idea applied to text classification  k nearest neighbor classification  Linear classifiers  More than two classes 6Introduction to Information Retrieval Feature selection  In text classification, we usually represent documents in a high-dimensional space, with each dimension corresponding to a term.  In this lecture: axis = dimension = word = term = feature  Many dimensions correspond to rare words.  Rare words can mislead the classifier.  Rare misleading features are called noise features.  Eliminating noise features from the representation increases efficiency and effectiveness of text classification.  Eliminating features is called feature selection. 8Introduction to Information Retrieval Example for a noise feature  Let’s say we’re doing text classification for the class China.  Suppose a rare term, say ARACHNOCENTRIC, has no information about China . . .  . . . but all instances of ARACHNOCENTRIC happen to occur in  China documents in our training set.  Then we may learn a classifier that incorrectly interprets ARACHNOCENTRIC as evidence for the class China.  Such an incorrect generalization from an accidental property of the training set is called overfitting.  Feature selection reduces overfitting and improves the  accuracy of the classifier. 9Introduction to Information Retrieval Basic feature selection algorithm 10Introduction to Information Retrieval Different feature selection methods  A feature selection method is mainly defined by the feature utility measure it employs  Feature utility measures:  Frequency – select the most frequent terms  Mutual information – select the terms with the highest mutual information  Mutual information is also called information gain in this context.  Chi-square (see book) 11Introduction to Information Retrieval Mutual information  Compute the feature utility A(t, c) as the expected mutual information (MI) of term t and class c.  MI tells us “how much information” the term contains about the class and vice versa.  For example, if a term’s occurrence is independent of the class (same proportion of docs within/without class contain the term), then MI is 0.  Definition: 12Introduction to Information Retrieval How to compute MI values  Based on maximum likelihood estimates, the formula we actually use is:  N10: number of documents that contain t (et = 1) and are not in c (ec = 0); N11: number of documents that contain t (et = 1) and are in c (ec = 1); N01: number of documents that do not contain t (et = 1) and are in c (ec = 1); N00: number of documents that do not contain t (et = 1) and are not in c (ec = 1); N = N00 + N01 + N10 + N11. 13Introduction to Information Retrieval MI example for poultry/EXPORT in Reuters 14Introduction to Information Retrieval MI feature selection on Reuters 15Introduction to Information Retrieval Naive Bayes: Effect of feature selection (multinomial = multinomial Naive Bayes, binomial = Bernoulli Naive Bayes) 16Introduction to Information Retrieval Feature selection for Naive Bayes  In general, feature selection is necessary for Naive Bayes to get decent performance.  Also true for most other learning methods in text classification: you need feature selection for optimal performance. 17Introduction to Information Retrieval Exercise (i) Compute the “export”/POULTRY contingency table for the “Kyoto”/JAPAN in the collection given below. (ii) Make up a contingency table for which MI is 0 – that is, term and class are independent of each other. “export”/POULTRY table: 18Introduction to Information Retrieval Outline ❶ Recap ❷ Feature selection ❸ Intro vector space classification ❹ Rocchio ❺ kNN ❻ Linear classifiers ❼ two classes 19Introduction to Information Retrieval Recall vector space representation  Each document is a vector, one component for each term.  Terms are axes.  High dimensionality: 100,000s of dimensions  Normalize vectors (documents) to unit length  How can we do classification in this space? 20Introduction to Information Retrieval Vector space classification  As before, the training set is a set of documents, each labeled with its class.  In vector space classification, this set corresponds to a labeled set of points or vectors in the vector space.  Premise 1: Documents in the same class form a contiguous region.  Premise 2: Documents from different classes don’t overlap.  We define lines, surfaces, hypersurfaces to divide regions. 21Introduction to Information Retrieval Classes in the vector space Should the document ⋆ be assigned to China, UK or Kenya? Find separators between the classes Based on these separators: ⋆ should be assigned to China How do we find separators that do a good job at classifying new documents like ⋆? – Main topic of today 22Introduction to Information Retrieval Aside: 2D/3D graphs can be misleading Left: A projection of the 2D semicircle to 1D. For the points x , x , x , x , x at x coordinates −0.9,−0.2, 0, 0.2, 0.9 the distance 1 2 3 4 5 x x ≈ 0.201 only differs by 0.5% from x′ x′ = 0.2; but 2 3 2 3 x x /x′ x′ = d /d ≈ 1.06/0.9 ≈ 1.18 is an example of 1 3 1 3 true projected a large distortion (18%) when projecting a large area. Right: The corresponding projection of the 3D hemisphere to 2D. 23