Cosine similarity in Information retrieval

similarity measures in information retrieval system and distributional similarity for information retrieval
ImogenCameron Profile Pic
ImogenCameron,France,Teacher
Published Date:14-07-2017
Your Website URL(Optional)
Comment
Document Similarity in Information Retrieval Mausam (Based on slides of W. Arms, Thomas Hofmann, Ata Kaban, Melanie Martin) Standard Web Search Engine Architecture store documents, check for duplicates, extract links crawl the web DocIds create an inverted user index query Search inverted show results engine index To user servers Slide adapted from Marti Hearst / UC Berkeley Indexing Subsystem documents Documents assign document IDs text document break into tokens numbers and field tokens stop list numbers non-stoplist stemming tokens Indicates stemmed term weighting optional terms operation. terms with Index weights database Search Subsystem query parse query query tokens ranked stop list non-stoplist document set tokens ranking stemming stemmed terms Boolean retrieved operations Indicates document set optional Index operation. database relevant document set Terms vs tokens • Terms are what results after tokenization and linguistic processing. – Examples • knowledge - knowledg • The - the • Removal of stop words Matching/Ranking of Textual Documents Major Categories of Methods 1. Exact matching (Boolean) 2. Ranking by similarity to query (vector space model) 3. Ranking of matches by importance of documents (PageRank) 4. Combination methods What happens in major search engines (Googlerank) Vector representation of documents and queries Why do this? • Represents a large space for documents • Compare – Documents – Documents with queries • Retrieve and rank documents with regards to a specific query - Enables methods of similarity All search engines do this. Boolean queries • Document is relevant to a query of the query itself is in the document. – Query blue and red brings back all documents with blue and red in them • Document is either relevant or not relevant to the query. • What about relevance ranking – partial relevance. Vector model deals with this. Similarity Measures and Relevance • Retrieve the most similar documents to a query • Equate similarity to relevance – Most similar are the most relevant • This measure is one of “text similarity” – The matching of text or words Similarity Ranking Methods Index Documents Query database Mechanism for determining the similarity of the query to the document. Set of documents ranked by how similar they are to the query Term Similarity: Example Problem: Given two text documents, how similar are they? Methods that measure similarity do not assume exact matches. Example (assume tokens converted to terms) Here are three documents. How similar are they? d ant ant bee 1 d dog bee dog hog dog ant dog 2 d cat gnu dog eel fox 3 Documents can be any length from one word to thousands. A query is a special type of document. Bag of words view of a doc Tokens are extracted from text and thrown into a “bag” without order and labeled by document. is Mary John • Thus the doc than – John is quicker than Mary. quicker is indistinguishable from the doc – Mary is quicker than John. Term Similarity: Basic Concept Two documents are similar if they contain some of the same terms. Possible measures of similarity might take into consideration: (a) The lengths of the documents (b) The number of terms in common (c) Whether the terms are common or unusual (d) How many times each term appears TERM VECTOR SPACE Term vector space n-dimensional space, where n is the number of different terms/tokens used to index a set of documents. Vector Document i, d , represented by a vector. Its magnitude in i dimension j is w , where: ij w 0 if term j occurs in document i ij w = 0 otherwise ij w is the weight of term j in document i. ijA Document Represented in a 3-Dimensional Term Vector Space t 3 d 1 t 13 t 2 t 12 t 11 t 1 Basic Method: Incidence Matrix (Binary Weighting) document text terms d ant ant bee ant bee 1 d dog bee dog hog dog ant dog ant bee dog hog 2 d cat gnu dog eel fox cat dog eel fox gnu 3 ant bee cat dog eel fox gnu hog 3 vectors in d 1 1 8-dimensional 1 term vector d 1 1 1 1 2 space d 1 1 1 1 1 3 Weights: t = 1 if document i contains term j and zero otherwise ij Basic Vector Space Methods: Similarity between 2 documents t 3 The similarity between two documents is a d 1 d 2 function of the angle between their vectors in t  the term vector space. 2 t 1 Vector Space Revision x = (x , x , x , ..., x ) is a vector in an n-dimensional vector space 1 2 3 n Length of x is given by (extension of Pythagoras's theorem) 2 2 2 2 2 x = x + x + x + ... + x 1 2 3 n 2 2 2 2 1/2 x = ( x + x + x + ... + x ) 1 2 3 n If x and x are vectors: 1 2 Inner product (or dot product) is given by x .x = x x + x x + x x + ... + x x 1 2 11 21 12 22 13 23 1n 2n Cosine of the angle between the vectors x and x 1 2: x .x 1 2 cos () = x x 1 2Document similarity d = (x , x , x , ..., x ) is a vector in an n-dimensional vector space 1 2 3 n Length of x is given by (extension of Pythagoras's theorem) 2 2 2 2 2 d = x + x + x + ... + x 1 2 3 n 2 2 2 2 1/2 d = ( x + x + x + ... + x ) 1 2 3 n If d and d are document vectors: 1 2 Inner product (or dot product) is given by d .d = x x + x x + x x + ... + x x 1 2 11 21 12 22 13 23 1n 2n Cosine angle between the docs d and d determines doc similarity 1 2 d .d 1 2 cos () = d d 1 2 cos () = 1; documents exactly the same; = 0, totally different Example 1 No Weighting ant bee cat dog eel fox gnu hog length d 1 1 2 1 d 1 1 1 1 4 2 d 1 1 1 1 1 5 3 2 2 1/2 Ex: length d = (1 +1 ) 1