Question? Leave a message!




XML indexing

XML indexing
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Website URL
Comment
Introduction to Information Retrieval Introduction to Information Retrieval XML Retrieval 1Introduction to Information Retrieval Overview ❶ Introduction ❷ Basic XML concepts ❸ Challenges in XML IR ❹ Vector space model for XML IR ❺ Evaluation of XML IR 2Introduction to Information Retrieval Outline ❶ Introduction ❷ Basic XML concepts ❸ Challenges in XML IR ❹ Vector space model for XML IR ❺ Evaluation of XML IR 3Introduction to Information Retrieval IR and relational databases IR systems are often contrasted with relational databases (RDB).  Traditionally, IR systems retrieve information from unstructured text (“raw” text without markup).  RDB systems are used for querying relational data: sets of records that have values for predefined attributes such as employee number, title and salary. Some structured data sources containing text are best modeled as structured documents rather than relational data (Structured retrieval). 4 4Introduction to Information Retrieval Structured retrieval Basic setting: queries are structured or unstructured; documents are structured. Applications of structured retrieval Digital libraries, patent databases, blogs, tagged text with entities like persons and locations (named entity tagging) Example  Digital libraries: give me a full-length article on fast fourier transforms  Patents: give me patens whose claims mention RSA public key encryption and that cite US patent 4,405,829  Entity-tagged text: give me articles about sightseeing tours of the Vatican and the Coliseum 5 5Introduction to Information Retrieval Why RDB is not suitable in this case Three main problems ❶ An unranked system (DB) would return a potentially large number of articles that mention the Vatican, the Coliseum and sightseeing tours without ranking them by relevance to query. ❷ Difficult for users to precisely state structural constraints – may not know which structured elements are supported by the system. tours AND (COUNTRY: Vatican OR LANDMARK: Coliseum)? tours AND (STATE: Vatican OR BUILDING: Coliseum)? ❸ Users may be completely unfamiliar with structured search and advanced search interfaces or unwilling to use them. Solution: adapt ranked retrieval to structured documents to address these problems. 6Introduction to Information Retrieval Structured Retrieval Standard for encoding structured documents: Extensible Markup Language (XML)   structured IR XML IR  also applicable to other types of markup (HTML, SGML, …) 7 7Introduction to Information Retrieval Outline ❶ Introduction ❷ Basic XML concepts ❸ Challenges in XML IR ❹ Vector space model for XML IR ❺ Evaluation of XML IR 8Introduction to Information Retrieval XML document  Ordered, labeled tree play authorShakespeare/author  Each node of the tree is an XML element, written with titleMacbeth/title an opening and closing XML act number=“I” tag (e.g. title…, /title…) scene number=“”vii”  An element can have one or titleMacbeth’s castle/title more XML attributes (e.g. verseWill I with wine number) …/verse  Attributes can have values (e.g. vii) /scene  Attributes can have child /act elements (e.g. title, verse) /play 9 9Introduction to Information Retrieval XML document root element play element element element author act title text text Shakespeare Macbeth element attribute scene number=“I” element attribute element verse number=“vii” title text text 10 10 Shakespeare Macbeth’s castleIntroduction to Information Retrieval XML document The leaf nodes root element play consist of text element element element author act title text text Shakespeare Macbeth element attribute scene number=“I” element attribute element verse number=“vii” title text text 11 11 Shakespeare Macbeth’s castleIntroduction to Information Retrieval XML document The internal nodes encode root element play document structure or metadata functions element element element author act title text text Shakespeare Macbeth element attribute scene number=“I” element attribute element verse number=“vii” title text text 12 12 Shakespeare Macbeth’s castleIntroduction to Information Retrieval XML basics  XML Documents Object Model (XML DOM): standard for accessing and processing XML documents  The DOM represents elements, attributes and text within elements as nodes in a tree.  With a DOM API, we can process an XML documents by starting at the root element and then descending down the tree from parents to children.  XPath: standard for enumerating path in an XML document collection.  We will also refer to paths as XML contexts or simply contexts  Schema: puts constraints on the structure of allowable XML documents. E.g. a schema for Shakespeare’s plays: scenes can occur as children of acts.  Two standards for schemas for XML documents are: XML DTD 13 13 (document type definition) and XML Schema.Introduction to Information Retrieval Outline ❶ Introduction ❷ Basic XML concepts ❸ Challenges in XML IR ❹ Vector space model for XML IR ❺ Evaluation of XML IR 14Introduction to Information Retrieval First challenge: document parts to retrieve Structured or XML retrieval: users want us to return parts of documents (i.e., XML elements), not entire documents as IR systems usually do in unstructured retrieval. Example If we query Shakespeare’s plays for Macbeth’s castle, should we return the scene, the act or the entire play?  In this case, the user is probably looking for the scene.  However, an otherwise unspecified search for Macbeth should return the play of this name, not a subunit. Solution: structured document retrieval principle 15 15Introduction to Information Retrieval Structured document retrieval principle Structured document retrieval principle One criterion for selecting the most appropriate part of a document: A system should always retrieve the most specific part of a document answering the query.  Motivates a retrieval strategy that returns the smallest unit that contains the information sought, but does not go below this level.  Hard to implement this principle algorithmically. E.g. query: title:Macbeth can match both the title of the tragedy, Macbeth, and the title of Act I, Scene vii, Macbeth’s castle.  But in this case, the title of the tragedy (higher node) is preferred.  Difficult to decide which level of the tree satisfies the query. 16 16Introduction to Information Retrieval Second challenge: document parts to index Central notion for indexing and ranking in IR: documents unit or indexing unit.  In unstructured retrieval, usually straightforward: files on your desktop, email massages, web pages on the web etc.  In structured retrieval, there are four main different approaches to defining the indexing unit ❶ non-overlapping pseudodocuments ❷ top down ❸ bottom up ❹ all 17Introduction to Information Retrieval XML indexing unit: approach 1 Group nodes into non-overlapping pseudodocuments. Indexing units: books, chapters, section, but without overlap. Disadvantage: pseudodocuments may not make sense to the user because they are not coherent units. 18Introduction to Information Retrieval XML indexing unit: approach 2 Top down (2-stage process): ❶ Start with one of the latest elements as the indexing unit, e.g. the book element in a collection of books ❷ Then, postprocess search results to find for each book the subelement that is the best hit. This two-stage retrieval process often fails to return the best subelement because the relevance of a whole book is often not a good predictor of the relevance of small subelements within it. 19Introduction to Information Retrieval XML indexing unit: approach 3 Bottom up: Instead of retrieving large units and identifying subelements (top down), we can search all leaves, select the most relevant ones and then extend them to larger units in postprocessing. Similar problem as top down: the relevance of a leaf element is often not a good predictor of the relevance of elements it is contained in. 20