Question? Leave a message!




Text and Web Search

Text and Web Search
Text and Web Search www.ThesisScientist.comText Databases and IR  Text databases (document databases)  Large collections of documents from various sources: news articles, research papers, books, digital libraries, email messages, and Web pages, library database, etc.  Information retrieval  A field developed in parallel with database systems  Information is organized into (a large number of) documents  Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents www.ThesisScientist.comInformation Retrieval  Typical IR systems  Online library catalogs  Online document management systems  Information retrieval vs. database systems  Some DB problems are not present in IR, e.g., update, transaction management, complex objects  Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance www.ThesisScientist.comBasic Measures for Text Retrieval Relevant Relevant Retrieved Retrieved All Documents  Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) RelevantRetrieved precision  Retrieved  Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved RelevantRetrieved recall  Relevant www.ThesisScientist.comInformation Retrieval Techniques  Index Terms (Attribute) Selection:  Stop list  Word stem  Index terms weighting methods  Terms  Documents Frequency Matrices  Information Retrieval Models:  Boolean Model  Vector Model  Probabilistic Model www.ThesisScientist.comProblem Motivation  Given a database of documents, find documents containing “data”, “retrieval”  Applications:  Web  law + patent offices  digital libraries  information filtering www.ThesisScientist.comProblem Motivation  Types of queries:  boolean („data‟ AND „retrieval‟ AND NOT ...)  additional features („data‟ ADJACENT „retrieval‟)  keyword queries („data‟, „retrieval‟)  How to search a large collection of documents www.ThesisScientist.comFulltext scanning  for single term:  (naive: O(NM)) ABRACADABRA text CAB pattern www.ThesisScientist.comFulltext scanning  for single term:  (naive: O(NM))  Knuth, Morris and Pratt („77)  build a small FSA; visit every text letter once only, by carefully shifting more than one step ABRACADABRA text CAB pattern www.ThesisScientist.comFulltext scanning ABRACADABRA text CAB pattern CAB ... CAB CAB www.ThesisScientist.comFulltext scanning  for single term:  (naive: O(NM))  Knuth Morris and Pratt („77)  Boyer and Moore („77)  preprocess pattern; start from right to left skip ABRACADABRA text CAB pattern www.ThesisScientist.comText Detailed outline  text  problem  full text scanning  inversion  signature files  clustering  information filtering and LSI www.ThesisScientist.comText – Inverted Files www.ThesisScientist.comText – Inverted Files Q: space overhead A: mainly, the postings lists www.ThesisScientist.comText – Inverted Files  how to organize dictionary  stemming – Y/N  Keep only the root of each word ex. inverted, inversion  invert  insertions www.ThesisScientist.comText – Inverted Files  how to organize dictionary  Btree, hashing, TRIEs, PATRICIA trees, ...  stemming – Y/N  insertions www.ThesisScientist.comText – Inverted Files  postings list – more Zipf distr.: eg., rankfrequency plot of „Bible‟ log(freq) freq 1/rank / ln(1.78V) log(rank) www.ThesisScientist.comText – Inverted Files  postings lists  Cutting+Pedersen  (keep first 4 in Btree leaves)  how to allocate space: Faloutsos+92  geometric progression  compression (Elias codes) Zobel+ – down to 2 overhead  Conclusions: needs space overhead (2300), but it is the fastest www.ThesisScientist.comVector Space Model and Clustering  Keyword (freetext) queries (vs Boolean)  each document: vector (HOW)  each query: vector  search for „similar‟ vectors www.ThesisScientist.comVector Space Model and Clustering  main idea: each document is a vector of size d: d is the number of different terms in the database document zoo aaron data „indexing‟ ...data... d (= vocabulary size) www.ThesisScientist.comDocument Vectors  Documents are represented as “bags of words”  Represented as vectors when used computationally  A vector is like an array of floating points  Has direction and magnitude  Each vector holds a place for every term in the collection  Therefore, most vectors are sparse www.ThesisScientist.comDocument Vectors One location for each word. nova galaxy heat h‟wood film role diet fur A 10 5 3 B 5 10 C 10 8 7 D 9 10 5 “Nova” occurs 10 times in text A E 10 10 “Galaxy” occurs 5 times in text A F 9 10 “Heat” occurs 3 times in text A G 5 7 9 (Blank means 0 occurrences.) H 6 10 2 8 I 7 5 1 3 www.ThesisScientist.comDocument Vectors One location for each word. nova galaxy heat h‟wood film role diet fur A 10 5 3 B 5 10 C “Hollywood” occurs 7 times in text I 10 8 7 D “Film” occur 9s 5 tim 10 es in tex 5 t I E “Diet” occurs 1 time in text I 10 10 F “Fur” occurs 3 times in text I 9 10 G 5 7 9 H 6 10 2 8 I 7 5 1 3 www.ThesisScientist.comDocument Vectors Document ids nova galaxy heat h‟wood film role diet fur A 10 5 3 B 5 10 C 10 8 7 D 9 10 5 E 10 10 F 9 10 G 5 7 9 H 6 10 2 8 I 7 5 1 3 www.ThesisScientist.comWe Can Plot the Vectors Star Doc about movie stars Doc about astronomy Doc about mammal behavior Diet www.ThesisScientist.comVector Space Model and Clustering Then, group nearby vectors together  Q1: cluster search  Q2: cluster generation Two significant contributions  ranked output  relevance feedback www.ThesisScientist.comVector Space Model and Clustering  cluster search: visit the (k) closest superclusters; continue recursively MD TRs CS TRs www.ThesisScientist.comVector Space Model and Clustering  ranked output: easy MD TRs CS TRs www.ThesisScientist.comVector Space Model and Clustering  relevance feedback (brilliant idea) Roccio‟73 MD TRs CS TRs www.ThesisScientist.comVector Space Model and Clustering  relevance feedback (brilliant idea) Roccio‟73  How MD TRs CS TRs www.ThesisScientist.comVector Space Model and Clustering  How A: by adding the „good‟ vectors and subtracting the „bad‟ ones MD TRs CS TRs www.ThesisScientist.comCluster generation  Problem:  given N points in V dimensions,  group them www.ThesisScientist.comCluster generation  Problem:  given N points in V dimensions,  group them (typically a kmeans or AGNES is used) www.ThesisScientist.comAssigning Weights to Terms  Binary Weights  Raw term frequency  tf x idf  Recall the Zipf distribution  Want to weight terms highly if they are  frequent in relevant documents … BUT  infrequent in the collection as a whole www.ThesisScientist.comBinary Weights  Only the presence (1) or absence (0) of a term is included in the vector docs t1 t2 t3 D1 1 0 1 D2 1 0 0 D3 0 1 1 D4 1 0 0 D5 1 1 1 D6 1 1 0 D7 0 1 0 D8 0 1 0 D9 0 0 1 D10 0 1 1 D11 1 0 1 www.ThesisScientist.comRaw Term Weights  The frequency of occurrence for the term in each document is included in the vector docs t1 t2 t3 D1 2 0 3 D2 1 0 0 D3 0 4 7 D4 3 0 0 D5 1 6 3 D6 3 5 0 D7 0 8 0 D8 0 10 0 D9 0 0 1 D10 0 3 5 D11 4 0 1 www.ThesisScientist.comAssigning Weights  tf x idf measure:  term frequency (tf)  inverse document frequency (idf) a way to deal with the problems of the Zipf distribution  Goal: assign a tf idf weight to each term in each document www.ThesisScientist.comtf x idf w  tf log( N / n ) ik ik k T  term k k tf  frequency of term T in document D ik k i idf  inverse document frequency of term T in C k k N  total number of documents in the collection C n  the number of documents in C that contain T k k N   idf  log   k n  k  www.ThesisScientist.comInverse Document Frequency  IDF provides high values for rare words and low values for common words 10000   log  0   10000   For a 10000   collection log  0.301   5000   of 10000 documents 10000   log  2.698   20   10000   log  4   1   www.ThesisScientist.comSimilarity Measures for document vectors Q  D Simple matching (coordination level match) Q  D 2 Dice‟s Coefficient Q  D Q  D Jaccard‟s Coefficient Q  D Q  D 1 1 2 2 Cosine Coefficient Q  D Q  D min( Q , D ) Overlap Coefficient www.ThesisScientist.comtf x idf normalization  Normalize the term weights (so longer documents are not unfairly given more weight)  normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive. tf log( N / n ) ik k w  ik t 2 2 (tf ) log( N / n )  ik k k1 www.ThesisScientist.comVector space similarity (use the weights to compare the documents) Now, the similarity of two documents is : t v v i j sim(D , D )  w  w   i j ik jk v v k 1 i j This is also called the cosine, or normalized inner product. www.ThesisScientist.comComputing Similarity Scores D  (0.8, 0.3) 1 D  (0.2, 0.7) 2 Q  (0.4, 0.8) 1.0 Q D 2 cos  0.74 1 0.8 cos  0.98 2 0.6  2 0.4 D  1 1 0.2 0.2 0.4 0.6 0.8 1.0 www.ThesisScientist.comVector Space with Term Weights and Cosine Matching D =(d ,w ;d , w;…;d , w ) i i1 di1 i2 di2 it dit Term B Q =(q ,w ;q , w;…;q , w ) i1 qi1 i2 qi2 it qit 1.0 Q = (0.4,0.8) t D1=(0.8,0.3) w w Q  q d D j ij j1 2 D2=(0.2,0.7) sim(Q, D )  0.8 i t t 2 2 (w ) (w )  q  d j ij j1 j1 0.6  2 (0.40.2)  (0.80.7) sim(Q, D2)  0.4 2 2 2 2 (0.4)  (0.8) (0.2)  (0.7) D 1 0.2 0.64  1   0.98 0.42 0 0.2 0.4 0.6 0.8 1.0 .56 Term A sim(Q, D )   0.74 1 0.58 www.ThesisScientist.comText Detailed outline  Text databases  problem  full text scanning  inversion  signature files (a.k.a. Bloom Filters)  Vector model and clustering  information filtering and LSI www.ThesisScientist.comInformation Filtering + LSI  Foltz+,‟92 Goal:  users specify interests (= keywords)  system alerts them, on suitable news documents  Major contribution: LSI = Latent Semantic Indexing  latent („hidden‟) concepts www.ThesisScientist.comInformation Filtering + LSI Main idea  map each document into some „concepts‟  map each term into some „concepts‟ „Concept‟: a set of terms, with weights, e.g.  “data” (0.8), “system” (0.5), “retrieval” (0.6) DBMSconcept www.ThesisScientist.comInformation Filtering + LSI Pictorially: termdocument matrix (BEFORE) 'data' 'system' 'retrieval' 'lung' 'ear' TR1 1 1 1 TR2 1 1 1 TR3 1 1 TR4 1 1 www.ThesisScientist.comInformation Filtering + LSI Pictorially: conceptdocument matrix and... 'DBMS 'medical concept' concept' TR1 1 TR2 1 TR3 1 TR4 1 www.ThesisScientist.comInformation Filtering + LSI ... and conceptterm matrix 'DBMS 'medical concept' concept' data 1 system 1 retrieval 1 lung 1 ear 1 www.ThesisScientist.comInformation Filtering + LSI Q: How to search, eg., for „system‟ www.ThesisScientist.comInformation Filtering + LSI A: find the corresponding concept(s); and the corresponding documents 'DBMS 'medical 'DBMS 'medical concept' concept' concept' concept' data 1 TR1 1 system 1 TR2 1 retrieval 1 TR3 1 lung 1 TR4 1 ear 1 www.ThesisScientist.comInformation Filtering + LSI A: find the corresponding concept(s); and the corresponding documents 'DBMS 'medical 'DBMS 'medical concept' concept' concept' concept' data 1 TR1 1 system 1 TR2 1 retrieval 1 TR3 1 lung 1 TR4 1 ear 1 www.ThesisScientist.comInformation Filtering + LSI Thus it works like an (automatically constructed) thesaurus: we may retrieve documents that DON‟T have the term „system‟, but they contain almost everything else („data‟, „retrieval‟) www.ThesisScientist.comSVD  LSI: find „concepts‟ www.ThesisScientist.comSVD Definition T A = U L (V ) n x m n x r r x r m x r  A: n x m matrix (eg., n documents, m terms)  U: n x r matrix (n documents, r concepts)  L: r x r diagonal matrix (strength of each „concept‟) (r : rank of the matrix)  V: m x r matrix (m terms, r concepts) www.ThesisScientist.comSVD Example T  A = U L V example: retrieval inf. lung brain data 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 CS 9.64 0 1 1 1 0 0 0.18 0 x x = 5 5 5 0 0 0 5.29 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 www.ThesisScientist.comSVD Example T  A = U L V example: retrieval CSconcept inf. lung MDconcept brain data 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 CS 9.64 0 1 1 1 0 0 0.18 0 x x = 5 5 5 0 0 0 5.29 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 www.ThesisScientist.comSVD Example T doctoconcept  A = U L V example: similarity matrix retrieval CSconcept inf. lung MDconcept brain data 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 CS 9.64 0 1 1 1 0 0 0.18 0 x x = 5 5 5 0 0 0 5.29 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 www.ThesisScientist.comSVD Example T  A = U L V example: retrieval „strength‟ of CSconcept inf. lung brain data 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 CS 9.64 0 1 1 1 0 0 0.18 0 x x = 5 5 5 0 0 0 5.29 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 www.ThesisScientist.comSVD Example T  A = U L V example: termtoconcept retrieval similarity matrix inf. lung brain data 0.18 0 1 1 1 0 0 CSconcept 0.36 0 2 2 2 0 0 CS 9.64 0 1 1 1 0 0 0.18 0 x x = 5 5 5 0 0 0 5.29 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 www.ThesisScientist.comSVD Example T  A = U L V example: termtoconcept retrieval similarity matrix inf. lung brain data 0.18 0 1 1 1 0 0 CSconcept 0.36 0 2 2 2 0 0 CS 9.64 0 1 1 1 0 0 0.18 0 x x = 5 5 5 0 0 0 5.29 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 www.ThesisScientist.comSVD for LSI „documents‟, „terms‟ and „concepts‟:  U: documenttoconcept similarity matrix  V: termtoconcept sim. matrix  L: its diagonal elements: „strength‟ of each concept www.ThesisScientist.comSVD for LSI  Need to keep all the eigenvectors  NO, just keep the first k (concepts) www.ThesisScientist.comWeb Search  What about web search  First you need to get all the documents of the web…. Crawlers.  Then you have to index them (inverted files, etc)  Find the web pages that are relevant to the query  Report the pages with their links in a sorted order  Main difference with IR: web pages have links  may be possible to exploit the link structure for sorting the relevant documents… www.ThesisScientist.comKleinberg‟s Algorithm (HITS)  Main idea: In many cases, when you search the web using some terms, the most relevant pages may not contain this term (or contain the term only a few times)  Harvard : www.harvard.edu  Search Engines: yahoo, google, altavista  Authorities and hubs www.ThesisScientist.comKleinberg‟s algorithm  Problem dfn: given the web and a query  find the most „authoritative‟ web pages for this query Step 0: find all pages containing the query terms (root set) Step 1: expand by one move forward and backward (base set) www.ThesisScientist.comKleinberg‟s algorithm  Step 1: expand by one move forward and backward www.ThesisScientist.comKleinberg‟s algorithm  on the resulting graph, give high score (= „authorities‟) to nodes that many important nodes point to  give high importance score („hubs‟) to nodes that point to good „authorities‟) hubs authorities www.ThesisScientist.comKleinberg‟s algorithm observations  recursive definition  each node (say, „i‟th node) has both an authoritativeness score a and a i hubness score h i www.ThesisScientist.comKleinberg‟s algorithm Let E be the set of edges and A be the adjacency matrix: the (i,j) is 1 if the edge from i to j exists Let h and a be n x 1 vectors with the „hubness‟ and „authoritativiness‟ scores. Then: www.ThesisScientist.comKleinberg‟s algorithm Then: k a = h + h + h i k l m i l that is m a = Sum (h ) over all j that i j (j,i) edge exists or T a = A h www.ThesisScientist.comKleinberg‟s algorithm symmetrically, for the i n „hubness‟: p h = a + a + a i n p q that is q h = Sum (q ) over all j that i j (i,j) edge exists or h = A a www.ThesisScientist.comKleinberg‟s algorithm In conclusion, we want vectors h and a such that: h = A a T a = A h Start from a and h to all 1. Then apply the following trick: T T T 2 T k h=Aa=A(A h)=(AA )h = ..=(AA ) h ..= (AA ) h T k a = (A A) a www.ThesisScientist.comKleinberg‟s algorithm In short, the solutions to h = A a T a = A h are the left and right eigenvectors of the adjacency matrix A. Starting from random a’ and iterating, we‟ll eventually converge (Q: to which of all the eigenvectors why) www.ThesisScientist.comKleinberg‟s algorithm (Q: to which of all the eigenvectors why) A: to the ones of the strongest eigenvalue, because of property : T k (A A ) v’ (constant) v 1 So, we can find the a and h vectors and the page with the highest a values are reported www.ThesisScientist.comKleinberg‟s algorithm results Eg., for the query „java‟: 0.328 www.gamelan.com 0.251 java.sun.com 0.190 www.digitalfocus.com (“the java developer”) www.ThesisScientist.comKleinberg‟s algorithm discussion  „authority‟ score can be used to find „similar pages‟ to page p  closely related to „citation analysis‟, social networs / „small world‟ phenomena www.ThesisScientist.comgoogle/pagerank algorithm  closely related: The Web is a directed graph of connected nodes  imagine a particle randomly moving along the edges ()  compute its steadystate probabilities. That gives the PageRank of each pages (the importance of this page) () with occasional random jumps www.ThesisScientist.comPageRank Definition  Assume a page A and pages T1, T2, …, Tm that point to A. Let d is a damping factor. PR(A) the Pagerank of A. C(A) the out degree of A. Then: PR(T1) PR(T 2) PR(Tm) PR(A)  (1 d)  d(  ... ) C(T1) C(T 2) C(Tm) www.ThesisScientist.comgoogle/pagerank algorithm  Compute the PR of each pageidentical problem: given a Markov Chain, compute the steady state probabilities p1 ... p5 2 1 3 4 5 www.ThesisScientist.comComputing PageRank  Iterative procedure  Also, … navigate the web by randomly follow links or with prob p jump to a random page. Let A the adjacency matrix (n x n), c outdegree of page i i 1 –1 Prob(AA ) = dn +(1d)c A i j i ij A’i,j = Prob(A A ) i j www.ThesisScientist.comgoogle/pagerank algorithm  Let A’ be the transition matrix (= adjacency matrix, rownormalized : sum of each row = 1) p1 p1 1 2 p2 p2 1 3 1/2 1/2 p3 p3 = 1 p4 p4 1 4 p5 p5 5 1/2 1/2 www.ThesisScientist.comgoogle/pagerank algorithm  A p = p A p = p p1 p1 1 2 p2 p2 1 3 1/2 1/2 p3 p3 = 1 p4 p4 1 4 p5 p5 5 1/2 1/2 www.ThesisScientist.comgoogle/pagerank algorithm  A p = p  thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is rownormalized) www.ThesisScientist.comKleinberg/google conclusions SVD helps in graph analysis: hub/authority scores: strongest left and right eigenvectors of the adjacency matrix random walk on a graph: steady state probabilities are given by the strongest eigenvector of the transition matrix www.ThesisScientist.comReferences  Brin, S. and L. Page (1998). Anatomy of a LargeScale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf. www.ThesisScientist.com