Question? Leave a message!

Abstract text summarization

a survey on automatic text summarization and advances in automatic text summarization
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Website URL
Introduction to Information Retrieval Introduction to Information Retrieval Evaluation & Result Summaries 1Introduction to Information Retrieval Heuristics for finding the top k even faster  Document-at-a-time processing  We complete computation of the query-document similarity score of document d before starting to compute the query- i document similarity score of d . i+1  Requires a consistent ordering of documents in the postings lists  Term-at-a-time processing  We complete processing the postings list of query term t i before starting to process the postings list of t . i+1  Requires an accumulator for each document “still in the running”  The most effective heuristics switch back and forth between term-at-a-time and document-at-a-time processing. 6 6Introduction to Information Retrieval Use min heap for selecting top k ouf of N  Use a binary min heap  A binary min heap is a binary tree in which each node’s value is less than the values of its children.  It takes O(N log k) operations to construct the k-heap containing the k largest values (where N is the number of documents).  Essentially linear in N for small k and large N. 7 7Introduction to Information Retrieval Binary min heap 8Introduction to Information Retrieval Selecting k top scoring documents in O(N log k)  Goal: Keep the k top documents seen so far  Use a binary min heap  To process a new document d′ with score s′:  Get current minimum h of heap (in O(1)) m  If s′ ≤ h skip to next document m  If s′ h heap-delete-root (in O(log k)) m  Heap-add d′/s′ (in O(1))  Reheapify (in O(log k)) 9 9Introduction to Information Retrieval Measures for a search engine  How fast does it index  e.g., number of bytes per hour  How fast does it search  e.g., latency as a function of queries per second  What is the cost per query?  in dollars 12 12Introduction to Information Retrieval Measures for a search engine  All of the preceding criteria are measurable: we can quantify speed / size / money  However, the key measure for a search engine is user happiness.  What is user happiness?  Factors include:  Speed of response  Size of index  Uncluttered UI  Most important: relevance  (actually, maybe even more important: it’s free)  Note that none of these is sufficient: blindingly fast, but useless answers won’t make a user happy.  How can we quantify user happiness? 13 13Introduction to Information Retrieval Who is the user?  Who is the user we are trying to make happy?  Web search engine: searcher. Success: Searcher finds what she was looking for. Measure: rate of return to this search engine  Web search engine: advertiser. Success: Searcher clicks on ad. Measure: clickthrough rate  Ecommerce: buyer. Success: Buyer buys something. Measures: time to purchase, fraction of “conversions” of searchers to buyers  Ecommerce: seller. Success: Seller sells something. Measure: profit per item sold  Enterprise: CEO. Success: Employees are more productive (because of effective search). Measure: profit of the company 14 14Introduction to Information Retrieval Most common definition of user happiness: Relevance  User happiness is equated with the relevance of search results to the query.  But how do you measure relevance?  Standard methodology in information retrieval consists of three elements.  A benchmark document collection  A benchmark suite of queries  An assessment of the relevance of each query-document pair 15 15Introduction to Information Retrieval Relevance: query vs. information need  Relevance to what?  First take: relevance to the query  “Relevance to the query” is very problematic.  Information need i : “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.”  This is an information need, not a query.  Query q: red wine white wine heart attack  Consider document d′: At heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving.  d′ is an excellent match for query q . . .  d′ is not relevant to the information need i . 16 16Introduction to Information Retrieval Relevance: query vs. information need  User happiness can only be measured by relevance to an information need, not by relevance to queries.  Our terminology is sloppy in these slides and in IIR: we talk about query-document relevance judgments even though we mean information-need-document relevance judgments. 17 17Introduction to Information Retrieval Precision and recall  Precision (P) is the fraction of retrieved documents that are relevant  Recall (R) is the fraction of relevant documents that are retrieved 18 18Introduction to Information Retrieval Precision and recall P = TP / ( TP + FP ) R = TP / ( TP + FN ) 19 19Introduction to Information Retrieval Precision/recall tradeoff  You can increase recall by returning more docs.  Recall is a non-decreasing function of the number of docs retrieved.  A system that returns all docs has 100% recall  The converse is also true (usually): It’s easy to get high precision for very low recall.  Suppose the document with the largest score is relevant. How can we maximize precision? 20 20Introduction to Information Retrieval A combined measure: F  F allows us to trade off precision against recall. where 2  α ϵ 0, 1 and thus b ϵ 0,∞  Most frequently used: balanced F with b = 1 or α = 0.5  This is the harmonic mean of P and R:  What value range of β weights recall higher than precision? 21 21Introduction to Information Retrieval F: Example relevant not relevant retrieved 20 40 60 not retrieved 60 1,000,000 1,000,060 80 1,000,040 1,000,120  P = 20/(20 + 40) = 1/3  R = 20/(20 + 60) = 1/4  22 22Introduction to Information Retrieval Accuracy  Why do we use complex measures like precision, recall, and F?  Why not something simple like accuracy?  Accuracy is the fraction of decisions (relevant/nonrelevant) that are correct.  In terms of the contingency table above, accuracy = (TP + TN)/(TP + FP + FN + TN).  Why is accuracy not a useful measure for web information retrieval? 23 23Introduction to Information Retrieval Exercise  Compute precision, recall and F for this result set: 1 relevant not relevant retrieved 18 2 not retrieved 82 1,000,000,000  The snoogle search engine below always returns 0 results (“0 matching results found”), regardless of the query. Why does snoogle demonstrate that accuracy is not a useful measure in IR? 24 24Introduction to Information Retrieval Why accuracy is a useless measure in IR  Simple trick to maximize accuracy in IR: always say no and return nothing  You then get 99.99% accuracy on most queries.  Searchers on the web (and in IR in general) want to find something and have a certain tolerance for junk.  It’s better to return some bad hits as long as you return something.  →We use precision, recall, and F for evaluation, not accuracy. 25 25Introduction to Information Retrieval F: Why harmonic mean?  Why don’t we use a different mean of P and R as a measure?  e.g., the arithmetic mean  The simple (arithmetic) mean is 50% for “return-everything” search engine, which is too high.  Desideratum: Punish really bad performance on either precision or recall.  Taking the minimum achieves this.  But minimum is not smooth and hard to weight.  F (harmonic mean) is a kind of smooth minimum. 26 26