Question? Leave a message!




Intro vector space classification

Intro vector space classification
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Website URL
Comment
Introduction to Information Retrieval Introduction to Information Retrieval : Vector Space Classification 1Introduction to Information Retrieval Overview ❶ Recap ❷ Feature selection ❸ Intro vector space classification ❹ Rocchio ❺ kNN ❻ Linear classifiers ❼ two classes 2Introduction to Information Retrieval Outline ❶ Recap ❷ Feature selection ❸ Intro vector space classification ❹ Rocchio ❺ kNN ❻ Linear classifiers ❼ two classes 3Introduction to Information Retrieval Relevance feedback: Basic idea  The user issues a (short, simple) query.  The search engine returns a set of documents.  User marks some docs as relevant, some as nonrelevant.  Search engine computes a new representation of the information need – should be better than the initial query.  Search engine runs new query and returns new results.  New results have (hopefully) better recall. 4Introduction to Information Retrieval Rocchio illustrated 5Introduction to Information Retrieval Take-away today  Feature selection for text classification: How to select a subset of available dimensions  Vector space classification: Basic idea of doing textclassification for documents that are represented as vectors  Rocchio classifier: Rocchio relevance feedback idea applied to text classification  k nearest neighbor classification  Linear classifiers  More than two classes 6Introduction to Information Retrieval Outline ❶ Recap ❷ Feature selection ❸ Intro vector space classification ❹ Rocchio ❺ kNN ❻ Linear classifiers ❼ two classes 7Introduction to Information Retrieval Feature selection  In text classification, we usually represent documents in a high-dimensional space, with each dimension corresponding to a term.  In this lecture: axis = dimension = word = term = feature  Many dimensions correspond to rare words.  Rare words can mislead the classifier.  Rare misleading features are called noise features.  Eliminating noise features from the representation increases efficiency and effectiveness of text classification.  Eliminating features is called feature selection. 8Introduction to Information Retrieval Example for a noise feature  Let’s say we’re doing text classification for the class China.  Suppose a rare term, say ARACHNOCENTRIC, has no information about China . . .  . . . but all instances of ARACHNOCENTRIC happen to occur in  China documents in our training set.  Then we may learn a classifier that incorrectly interprets ARACHNOCENTRIC as evidence for the class China.  Such an incorrect generalization from an accidental property of the training set is called overfitting.  Feature selection reduces overfitting and improves the  accuracy of the classifier. 9Introduction to Information Retrieval Basic feature selection algorithm 10Introduction to Information Retrieval Different feature selection methods  A feature selection method is mainly defined by the feature utility measure it employs  Feature utility measures:  Frequency – select the most frequent terms  Mutual information – select the terms with the highest mutual information  Mutual information is also called information gain in this context.  Chi-square (see book) 11Introduction to Information Retrieval Mutual information  Compute the feature utility A(t, c) as the expected mutual information (MI) of term t and class c.  MI tells us “how much information” the term contains about the class and vice versa.  For example, if a term’s occurrence is independent of the class (same proportion of docs within/without class contain the term), then MI is 0.  Definition: 12Introduction to Information Retrieval How to compute MI values  Based on maximum likelihood estimates, the formula we actually use is:  N10: number of documents that contain t (et = 1) and are not in c (ec = 0); N11: number of documents that contain t (et = 1) and are in c (ec = 1); N01: number of documents that do not contain t (et = 1) and are in c (ec = 1); N00: number of documents that do not contain t (et = 1) and are not in c (ec = 1); N = N00 + N01 + N10 + N11. 13Introduction to Information Retrieval MI example for poultry/EXPORT in Reuters 14Introduction to Information Retrieval MI feature selection on Reuters 15Introduction to Information Retrieval Naive Bayes: Effect of feature selection (multinomial = multinomial Naive Bayes, binomial = Bernoulli Naive Bayes) 16Introduction to Information Retrieval Feature selection for Naive Bayes  In general, feature selection is necessary for Naive Bayes to get decent performance.  Also true for most other learning methods in text classification: you need feature selection for optimal performance. 17Introduction to Information Retrieval Exercise (i) Compute the “export”/POULTRY contingency table for the “Kyoto”/JAPAN in the collection given below. (ii) Make up a contingency table for which MI is 0 – that is, term and class are independent of each other. “export”/POULTRY table: 18Introduction to Information Retrieval Outline ❶ Recap ❷ Feature selection ❸ Intro vector space classification ❹ Rocchio ❺ kNN ❻ Linear classifiers ❼ two classes 19Introduction to Information Retrieval Recall vector space representation  Each document is a vector, one component for each term.  Terms are axes.  High dimensionality: 100,000s of dimensions  Normalize vectors (documents) to unit length  How can we do classification in this space? 20