Question? Leave a message!




Document Similarity in Information Retrieval

Document Similarity in Information Retrieval 2
Document Similarity in Information Retrieval Mausam (Based on slides of W. Arms, Thomas Hofmann, Ata Kaban, Melanie Martin) Standard Web Search Engine Architecture store documents, check for duplicates, extract links crawl the web DocIds create an inverted user index query Search inverted show results engine index To user servers Slide adapted from Marti Hearst / UC Berkeley Indexing Subsystem documents Documents assign document IDs text document break into tokens numbers and field tokens stop list numbers nonstoplist stemming tokens Indicates stemmed term weighting optional terms operation. terms with Index weights database Search Subsystem query parse query query tokens ranked stop list nonstoplist document set tokens ranking stemming stemmed terms Boolean retrieved operations Indicates document set optional Index operation. database relevant document set Terms vs tokens • Terms are what results after tokenization and linguistic processing. – Examples • knowledge knowledg • The the • Removal of stop words Matching/Ranking of Textual Documents Major Categories of Methods 1. Exact matching (Boolean) 2. Ranking by similarity to query (vector space model) 3. Ranking of matches by importance of documents (PageRank) 4. Combination methods What happens in major search engines (Googlerank) Vector representation of documents and queries Why do this • Represents a large space for documents • Compare – Documents – Documents with queries • Retrieve and rank documents with regards to a specific query Enables methods of similarity All search engines do this. Boolean queries • Document is relevant to a query of the query itself is in the document. – Query blue and red brings back all documents with blue and red in them • Document is either relevant or not relevant to the query. • What about relevance ranking – partial relevance. Vector model deals with this. Similarity Measures and Relevance • Retrieve the most similar documents to a query • Equate similarity to relevance – Most similar are the most relevant • This measure is one of “text similarity” – The matching of text or words Similarity Ranking Methods Index Documents Query database Mechanism for determining the similarity of the query to the document. Set of documents ranked by how similar they are to the query Term Similarity: Example Problem: Given two text documents, how similar are they Methods that measure similarity do not assume exact matches. Example (assume tokens converted to terms) Here are three documents. How similar are they d ant ant bee 1 d dog bee dog hog dog ant dog 2 d cat gnu dog eel fox 3 Documents can be any length from one word to thousands. A query is a special type of document. Bag of words view of a doc Tokens are extracted from text and thrown into a “bag” without order and labeled by document. is Mary John • Thus the doc than – John is quicker than Mary. quicker is indistinguishable from the doc – Mary is quicker than John. Term Similarity: Basic Concept Two documents are similar if they contain some of the same terms. Possible measures of similarity might take into consideration: (a) The lengths of the documents (b) The number of terms in common (c) Whether the terms are common or unusual (d) How many times each term appears TERM VECTOR SPACE Term vector space ndimensional space, where n is the number of different terms/tokens used to index a set of documents. Vector Document i, d , represented by a vector. Its magnitude in i dimension j is w , where: ij w 0 if term j occurs in document i ij w = 0 otherwise ij w is the weight of term j in document i. ijA Document Represented in a 3Dimensional Term Vector Space t 3 d 1 t 13 t 2 t 12 t 11 t 1 Basic Method: Incidence Matrix (Binary Weighting) document text terms d ant ant bee ant bee 1 d dog bee dog hog dog ant dog ant bee dog hog 2 d cat gnu dog eel fox cat dog eel fox gnu 3 ant bee cat dog eel fox gnu hog 3 vectors in d 1 1 8dimensional 1 term vector d 1 1 1 1 2 space d 1 1 1 1 1 3 Weights: t = 1 if document i contains term j and zero otherwise ij Basic Vector Space Methods: Similarity between 2 documents t 3 The similarity between two documents is a d 1 d 2 function of the angle between their vectors in t  the term vector space. 2 t 1 Vector Space Revision x = (x , x , x , ..., x ) is a vector in an ndimensional vector space 1 2 3 n Length of x is given by (extension of Pythagoras's theorem) 2 2 2 2 2 x = x + x + x + ... + x 1 2 3 n 2 2 2 2 1/2 x = ( x + x + x + ... + x ) 1 2 3 n If x and x are vectors: 1 2 Inner product (or dot product) is given by x .x = x x + x x + x x + ... + x x 1 2 11 21 12 22 13 23 1n 2n Cosine of the angle between the vectors x and x 1 2: x .x 1 2 cos () = x x 1 2Document similarity d = (x , x , x , ..., x ) is a vector in an ndimensional vector space 1 2 3 n Length of x is given by (extension of Pythagoras's theorem) 2 2 2 2 2 d = x + x + x + ... + x 1 2 3 n 2 2 2 2 1/2 d = ( x + x + x + ... + x ) 1 2 3 n If d and d are document vectors: 1 2 Inner product (or dot product) is given by d .d = x x + x x + x x + ... + x x 1 2 11 21 12 22 13 23 1n 2n Cosine angle between the docs d and d determines doc similarity 1 2 d .d 1 2 cos () = d d 1 2 cos () = 1; documents exactly the same; = 0, totally different Example 1 No Weighting ant bee cat dog eel fox gnu hog length d 1 1 2 1 d 1 1 1 1 4 2 d 1 1 1 1 1 5 3 2 2 1/2 Ex: length d = (1 +1 ) 1 Example 1 (continued) Similarity of documents in example: d d d 1 2 3 d 1 0.71 0 1 d 0.71 1 0.22 2 d 0 0.22 1 3Digression: terminology • WARNING: In a lot of IR literature, “frequency” is used to mean “count” – Thus term frequency in IR literature is used to mean number of occurrences in a doc – Not divided by document length (which would actually make it a frequency) • We will conform to this misnomer – In saying term frequency we mean the number of occurrences of a term in a document. Example 2 Weighting by Term Frequency (tf) document text terms d ant ant bee ant bee 1 d dog bee dog hog dog ant dog ant bee dog hog 2 d cat gnu dog eel fox cat dog eel fox gnu 3 ant bee cat dog eel fox gnu hog length d 2 1 5 1 d 1 1 4 1 19 2 d 1 1 1 1 1 5 3 Weights: t = frequency that term j occurs in document i ijExample 2 (continued) Similarity of documents in example: d d d 1 2 3 d 1 0.31 0 1 d 0.31 1 0.41 2 d 0 0.41 1 3 Similarity depends upon the weights given to the terms. Note differences in results from Example 1. Summary: Vector Similarity Computation with Weights Documents in a collection are assigned terms from a set of n terms The term vector space W is defined as: if term k does not occur in document d , w = 0 i ik if term k occurs in document d , w is greater than zero i ik (w is called the weight of term k in document d ) ik i Similarity between d and d is defined as: i j n  w w ik jk k=1 cos(d , d ) = i j d d i j Where d and d are the corresponding weighted term vectors and i j d is the length of the document vector d i i Summary: Vector Similarity Computation with Weights Inner product (or dot product) between documents d .d = w w + w w + w w + ... + w w 1 2 11 21 12 22 13 23 1n 2n Inner product (or dot product) is between a document and query d .q = w w + w w + w w + ... + w w 1 1 11 q11 12 q12 13 q13 1n q1n where w is the weight of the th term of the th query qij j i Simple Uses of Vector Similarity in Information Retrieval Threshold For query q, retrieve all documents with similarity above a threshold, e.g., similarity 0.50. Ranking For query q, return the n most similar documents ranked in order of similarity. This is the standard practice. Simple Example of Ranking (Weighting by Term Frequency) query q ant dog document text terms d ant ant bee ant bee 1 d dog bee dog hog dog ant dog ant bee dog hog 2 d cat gnu dog eel fox cat dog eel fox gnu 3 ant bee cat dog eel fox gnu hog length q 1 1 √2 d 2 1 5 1 d 1 1 4 1 19 2 d 1 1 1 1 1 5 3Calculate Ranking Similarity of query to documents in example: d d d 1 2 3 q 2/√10 5/√38 1/√10 0.63 0.81 0.32 If the query q is searched against this document set, the ranked results are: d , d , d 2 1 3Bigger Corpora • Consider – n = 1M documents, – each with about 1K terms. • Avg 6 bytes/term incl spaces/punctuation – 6GB of data. • Say there are m = 500K distinct terms…. Can’t Build the Matrix • 500K x 1M matrix: 500 Billion 0’s and 1’s. • But it has no more than 1 billion 1’s. – matrix is extremely sparse. • What’s a better representation Term Doc I 1 did 1 Inverted index enact 1 julius 1 Documents are parsed to extract caesar 1 words and these are saved with the I 1 was 1 document ID. killed 1 i' 1 Doc 1 Doc 2 the 1 I did enact Julius So let it be with capitol 1 Caesar I was Caesar. The brutus 1 killed i' the Noble Brutus hath killed 1 Capitol; Brutus told you Caesar me 1 killed me. was ambitious so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious2Later, sort inverted file by terms Term Doc Term Doc ambitious 2 I 1 be 2 did 1 brutus 1 enact 1 brutus 2 julius 1 capitol 1 caesar 1 caesar 1 I 1 caesar 2 was 1 killed 1 caesar 2 i' 1 did 1 the 1 enact 1 capitol 1 hath 1 brutus 1 I 1 killed 1 I 1 me 1 i' 1 so 2 it 2 let 2 julius 1 it 2 killed 1 be 2 killed 1 with 2 let 2 caesar 2 me 1 the 2 noble 2 noble 2 so 2 brutus 2 the 1 hath 2 the 2 told 2 told 2 you 2 you 2 caesar 2 was 1 was 2 was 2 ambitious 2 Term Doc Term Doc Freq ambitious 2 ambitious 2 1 be 2 be 2 1 • Multiple term brutus 1 brutus 1 1 brutus 2 brutus 2 1 capitol 1 capitol 1 1 entries in a single caesar 1 caesar 1 1 caesar 2 caesar 2 2 caesar 2 did 1 1 document are did 1 enact 1 1 enact 1 hath 2 1 hath 1 I 1 2 merged and I 1 i' 1 1 I 1 it 2 1 i' 1 julius 1 1 frequency it 2 killed 1 2 julius 1 let 2 1 killed 1 me 1 1 information added killed 1 noble 2 1 let 2 so 2 1 me 1 the 1 1 noble 2 the 2 1 so 2 told 2 1 the 1 you 2 1 the 2 was 1 1 told 2 was 2 1 you 2 with 2 1 was 1 was 2 with 2Best Choice of Weights query q ant dog document text terms d ant ant bee ant bee 1 d dog bee dog hog dog ant dog ant bee dog hog 2 d cat gnu dog eel fox cat dog eel fox gnu 3 ant bee cat dog eel fox gnu hog What weights lead q to the best d 1 information d 2 retrieval d 3Weighting Term Frequency (tf) Suppose term j appears f times in document i. What ij weighting should be given to a term j Term Frequency: Concept A term that appears many times within a document is likely to be more important than a term that appears only once. Term Frequency: Freetext Document Length of document i Simple method is to use w as the term frequency. ij ...but, in freetext documents, terms are likely to appear more often in long documents. Therefore w ij should be scaled by some variable related to document length. Term Frequency: Freetext Document A standard method for freetext documents Scale f relative to the frequency of other terms in the ij document. This partially corrects for variations in the i length of the documents. Let m = max (f ) i.e., m is the maximum frequency of i ij i any term in document i. Term frequency (tf): tf = f / m when f 0 ij ij i ij Note: There is no special justification for taking this form of term frequency except that it works well in practice and is easy to calculate. Weighting Inverse Document Frequency (idf) Suppose term j appears f times in document i. What ij weighting should be given to a term j Inverse Document Frequency: Concept A term that occurs in a few documents is likely to be a better discriminator than a term that appears in most or all documents. Inverse Document Frequency Suppose there are n documents and that the number of documents in which term j occurs is n . j A possible method might be to use n/n as the inverse j document frequency. A standard method The simple method overemphasizes small differences. Therefore use a logarithm. Inverse document frequency (idf): idf = log (n/n ) + 1 n 0 j 2 j j Note: There is no special justification for taking this form of inverse document frequency except that it works well in practice and is easy to calculate. Example of Inverse Document Frequency Example n = 1,000 documents; n of docs term appears in j term j n idf j j A 100 4.32 B 500 2.00 C 900 1.13 D 1,000 1.00 idf modifies only the columns not j From: Salton and McGill the rows Full Weighting: A Standard Form of tf.idf Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances: (weight of term j in document i) = (term frequency) (inverse document frequency) A standard tf.idf weighting scheme, for free text documents, is: t = tf idf ij ij j = (f / m ) (log (n/n ) + 1) when n 0 ij i 2 j j Discussion of Similarity The choice of similarity measure is widely used and works well on a wide range of documents, but has no theoretical basis. 1. There are several possible measures other that angle between vectors 2. There is a choice of possible definitions of tf and idf 3. With fielded searching, there are various ways to adjust the weight given to each field. Similarity Measures Compared Simple matching (coordination level match) Q D Q D Dice’s Coefficient 2 Q  D Q D Jaccard’s Coefficient Q D Q D 1 1 Cosine Coefficient (what we studied) 2 2 Q  D Q D Overlap Coefficient min( Q , D )Similarity Measures • A similarity measure is a function which computes the degree of similarity between a pair of vectors or documents – since queries and documents are both vectors, a similarity measure can represent the similarity between two documents, two queries, or one document and one query • There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet) • With similarity measure between query and documents – it is possible to rank the retrieved documents in the order of presumed importance – it is possible to enforce certain threshold so that the size of the retrieved set can be controlled – the results can be used to reformulate the original query in relevance feedback (e.g., combining a document vector with the query vector) Problems • Synonyms: separate words that have the same meaning. – E.g. ‘car’ ‘automobile’ – They tend to reduce recall • Polysems: words with multiple meanings – E.g. ‘Java’ – They tend to reduce precision  The problem is more general: there is a disconnect between topics and words • ‘… a more appropriate model should consider some conceptual dimensions instead of words.’ (Gardenfors) Latent Semantic Analysis (LSA) • LSA aims to discover something about the meaning behind the words; about the topics in the documents. • What is the difference between topics and words – Words are observable – Topics are not. They are latent. • How to find out topics from the words in an automatic way – We can imagine them as a compression of words – A combination of words – Try to formalise this Latent Semantic Analysis • Singular Value Decomposition (SVD)  A(mn) = U(mr) E(rr) V(rn)  Keep only k eigen values from E  A(mn) = U(mk) E(kk) V(kn)  Convert terms and documents to points in k dimensional space Latent Semantic Analysis • Singular Value Decomposition T A=USV • Dimension Reduction T A=USVLatent Semantic Analysis • LSA puts documents together even if they don’t have common words if – The docs share frequently cooccurring terms • Disadvantages: – Statistical foundation is missing PLSA addresses this concern PLSA  Latent Variable model for general cooccurrence data  Associate each observation (w,d) with a class variable z Є Zz1,…,zK • Generative Model – Select a doc with probability P(d) – Pick a latent class z with probability P(zd) – Generate a word w with probability p(wz) P(d) P(zd) P(wz) d z w PLSA • To get the joint probability model Model fitting with EM • We have the equation for loglikelihood function from the aspect model, and we need to maximize it. • Expectation Maximization ( EM) is used for this purpose EM Steps • EStep – Expectation step where expectation of the likelihood function is calculated with the current parameter values • MStep – Update the parameters with the calculated posterior probabilities – Find the parameters that maximizes the likelihood function E Step • It is the probability that a word w occurring in a document d, is explained by aspect z (based on some calculations) M Step • All these equations use p(zd,w) calculated in E Step • Converges to local maximum of the likelihood function The performance of a retrieval system based on this model (PLSI) was found superior to that of both the vector space based similarity (cos) and a nonprobabilistic latent semantic indexing (LSI) method. (We skip details here.) From Th. Hofmann, 2000 Comparing PLSA and LSA • LSA and PLSA perform dimensionality reduction – In LSA, by keeping only K singular values – In PLSA, by having K aspects • Comparison to SVD – U Matrix related to P(dz) (doc to aspect) – V Matrix related to P(zw) (aspect to term) – E Matrix related to P(z) (aspect strength) • The main difference is the way the approximation is done – PLSA generates a model (aspect model) and maximizes its predictive power – Selecting the proper value of K is heuristic in LSA – Model selection in statistics can determine optimal K in PLSA