Information extraction techniques ppt

information extraction from unstructured text and information retrieval and extraction
Dr.DouglasPatton Profile Pic
Dr.DouglasPatton,United States,Teacher
Published Date:26-07-2017
Your Website URL(Optional)
Comment
Information Extraction and Named Entity RecognitionChristopher Manning Information Extraction • Information extraction (IE) systems • Find and understand limited relevant parts of texts • Gather information from many pieces of text • Produce a structured representation of relevant information: • relations (in the database sense), a.k.a., • a knowledge base • Goals: 1. Organize information so that it is useful to people 2. Put information in a semantically precise form that allows further inferences to be made by computer algorithmsChristopher Manning Information Extraction (IE) • IE systems extract clear, factual information • Roughly: Who did what to whom when? • E.g., • Gathering earnings, profits, board members, headquarters, etc. from company reports • The headquarters of BHP Billiton Limited, and the global headquarters of the combined BHP Billiton Group, are located in Melbourne, Australia. • headquarters(“BHP Biliton Limited”, “Melbourne, Australia”) • Learn drug-gene product interactions from medical research literatureChristopher Manning Low-level information extraction • Is now available – and I think popular – in applications like Apple or Google mail, and web indexing • Often seems to be based on regular expressions and name listsChristopher Manning Low-level information extractionChristopher Manning Named Entity Recognition (NER) • A very important sub-task: find and classify names in text, for example: • The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.Christopher Manning Named Entity Recognition (NER) • A very important sub-task: find and classify names in text, for example: • The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.Christopher Manning Named Entity Recognition (NER) • A very important sub-task: find and classify names in text, for example: • The decision by the independent MP Andrew Wilkie to Person withdraw his support for the minority Labor government Date sounded dramatic but it should not further threaten its Location stability. When, after the 2010 election, Wilkie, Rob Organi- Oakeshott, Tony Windsor and the Greens agreed to support zation Labor, they gave just two guarantees: confidence and supply.Christopher Manning Named Entity Recognition (NER) • The uses: • Named entities can be indexed, linked off, etc. • Sentiment can be attributed to companies or products • A lot of IE relations are associations between named entities • For question answering, answers are often named entities. • Concretely: • Many web pages tag various entities, with links to bio or topic pages, etc. • Reuters’ OpenCalais, Evri, AlchemyAPI, Yahoo’s Term Extraction, … • Apple/Google/Microsoft/… smart recognizers for document contentInformation Extraction and Named Entity Recognition Introducing the tasks: Getting simple structured information out of textEvaluation of Named Entity Recognition The extension of Precision, Recall, and the F measure to sequencesChristopher Manning The Named Entity Recognition Task Task: Predict entities in a text Foreign ORG Ministry ORG spokesman O Standard Shen PER evaluation is per entity, Guofang PER not per token told O Reuters ORG : :Christopher Manning Precision/Recall/F1 for IE/NER • Recall and precision are straightforward for tasks like IR and text categorization, where there is only one grain size (documents) • The measure behaves a bit funnily for IE/NER when there are boundary errors (which are common): • First Bank of Chicago announced earnings … • This counts as both a fp and a fn • Selecting nothing would have been better • Some other metrics (e.g., MUC scorer) give partial credit (according to complex rules)Evaluation of Named Entity Recognition The extension of Precision, Recall, and the F measure to sequencesSequence Models for Named Entity RecognitionChristopher Manning The ML sequence model approach to NER Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a sequence classifier to predict the labels from the data Testing 1. Receive a set of testing documents 2. Run sequence model inference to label each token 3. Appropriately output the recognized entitiesChristopher Manning Encoding classes for sequence labeling IO encoding IOB encoding Fred PER B-PER showed O O Sue PER B-PER Mengqiu PER B-PER Huang PER I-PER ‘s O O new O O painting O OChristopher Manning Features for sequence labeling • Words • Current word (essentially like a learned dictionary) • Previous/next word (context) • Other kinds of inferred linguistic classification • Part-of-speech tags • Label context • Previous (and perhaps next) label 18Christopher Manning Features: Word substrings oxa : field 0 0 0 6 8 0 0 14 0 0 0 4 6 17 14 4 68 708 18 drug company Cotrimoxazole Wethersfield movie place person Alien Fury: Countdown to Invasion 241Christopher Manning Features: Word shapes • Word Shapes • Map words to simplified representation that encodes attributes such as length, capitalization, numerals, Greek letters, internal punctuation, etc. Varicella-zoster Xx-xxx mRNA xXXX CPA1 XXXd

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.