Web search and text mining

javascript search highlight text web page and search text web page javascript
Dr.JakeFinlay Profile Pic
Published Date:22-07-2017
Your Website URL(Optional)
Text and Web Search www.ThesisScientist.comText Databases and IR  Text databases (document databases)  Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc.  Information retrieval  A field developed in parallel with database systems  Information is organized into (a large number of) documents  Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents www.ThesisScientist.comInformation Retrieval  Typical IR systems  Online library catalogs  Online document management systems  Information retrieval vs. database systems  Some DB problems are not present in IR, e.g., update, transaction management, complex objects  Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance www.ThesisScientist.comBasic Measures for Text Retrieval Relevant & Relevant Retrieved Retrieved All Documents  Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) RelevantRetrieved precision  Retrieved  Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved RelevantRetrieved recall  Relevant www.ThesisScientist.comInformation Retrieval Techniques  Index Terms (Attribute) Selection:  Stop list  Word stem  Index terms weighting methods  Terms  Documents Frequency Matrices  Information Retrieval Models:  Boolean Model  Vector Model  Probabilistic Model www.ThesisScientist.comProblem - Motivation  Given a database of documents, find documents containing “data”, “retrieval”  Applications:  Web  law + patent offices  digital libraries  information filtering www.ThesisScientist.comProblem - Motivation  Types of queries:  boolean („data‟ AND „retrieval‟ AND NOT ...)  additional features („data‟ ADJACENT „retrieval‟)  keyword queries („data‟, „retrieval‟)  How to search a large collection of documents? www.ThesisScientist.comFull-text scanning  for single term:  (naive: O(NM)) ABRACADABRA text CAB pattern www.ThesisScientist.comFull-text scanning  for single term:  (naive: O(NM))  Knuth, Morris and Pratt („77)  build a small FSA; visit every text letter once only, by carefully shifting more than one step ABRACADABRA text CAB pattern www.ThesisScientist.comFull-text scanning ABRACADABRA text CAB pattern CAB ... CAB CAB www.ThesisScientist.comFull-text scanning  for single term:  (naive: O(NM))  Knuth Morris and Pratt („77)  Boyer and Moore („77)  preprocess pattern; start from right to left & skip ABRACADABRA text CAB pattern www.ThesisScientist.comText - Detailed outline  text  problem  full text scanning  inversion  signature files  clustering  information filtering and LSI www.ThesisScientist.comText – Inverted Files www.ThesisScientist.comText – Inverted Files Q: space overhead? A: mainly, the postings lists www.ThesisScientist.comText – Inverted Files  how to organize dictionary?  stemming – Y/N?  Keep only the root of each word ex. inverted, inversion  invert  insertions? www.ThesisScientist.comText – Inverted Files  how to organize dictionary?  B-tree, hashing, TRIEs, PATRICIA trees, ...  stemming – Y/N?  insertions? www.ThesisScientist.comText – Inverted Files  postings list – more Zipf distr.: eg., rank-frequency plot of „Bible‟ log(freq) freq 1/rank / ln(1.78V) log(rank) www.ThesisScientist.comText – Inverted Files  postings lists  Cutting+Pedersen  (keep first 4 in B-tree leaves)  how to allocate space: Faloutsos+92  geometric progression  compression (Elias codes) Zobel+ – down to 2% overhead  Conclusions: needs space overhead (2%-300%), but it is the fastest www.ThesisScientist.comVector Space Model and Clustering  Keyword (free-text) queries (vs Boolean)  each document: - vector (HOW?)  each query: - vector  search for „similar‟ vectors www.ThesisScientist.comVector Space Model and Clustering  main idea: each document is a vector of size d: d is the number of different terms in the database document zoo aaron data „indexing‟ ...data... d (= vocabulary size) www.ThesisScientist.com