Link analysis in information retrieval ppt

link analysis in web mining ppt and communication link analysis ppt and algorithms and models for network data and link analysis
RyanCanon Profile Pic
RyanCanon,United Arab Emirates,Teacher
Published Date:20-07-2017
Your Website URL(Optional)
Comment
Introduction to Information Retrieval Introduction to Information Retrieval Link analysis www.ThesisScientist.com 1Introduction to Information Retrieval Today’s lecture – hypertext and links  We look beyond the content of documents  We begin to look at the hyperlinks between them  Address questions like  Do the links represent a conferral of authority to some pages? Is this useful for ranking?  How likely is it that a page pointed to by the CERN home page is about high energy physics  Big application areas  The Web  Email  Social networks www.ThesisScientist.com 2Introduction to Information Retrieval Links are everywhere  Powerful sources of authenticity and authority  Mail spam – which email accounts are spammers?  Host quality – which hosts are “bad”?  Phone call logs  The Good, The Bad and The Unknown ? Good ? ? Bad ? www.ThesisScientist.com 3Introduction to Information Retrieval Example 1: Good/Bad/Unknown  The Good, The Bad and The Unknown  Good nodes won’t point to Bad nodes  All other combinations plausible ? Good ? ? Bad ? www.ThesisScientist.com 4Introduction to Information Retrieval Simple iterative logic  Good nodes won’t point to Bad nodes  If you point to a Bad node, you’re Bad  If a Good node points to you, you’re Good ? Good ? ? Bad ? www.ThesisScientist.com 5Introduction to Information Retrieval Simple iterative logic  Good nodes won’t point to Bad nodes  If you point to a Bad node, you’re Bad  If a Good node points to you, you’re Good Good ? Bad ? www.ThesisScientist.com 6Introduction to Information Retrieval Simple iterative logic  Good nodes won’t point to Bad nodes  If you point to a Bad node, you’re Bad  If a Good node points to you, you’re Good Good Bad www.ThesisScientist.com 7 Sometimes need probabilistic analogs – e.g., mail spamIntroduction to Information Retrieval Example 2: In-links to pages – unusual patterns  Spammers violating power laws www.ThesisScientist.com 8Introduction to Information Retrieval Many other examples of link analysis  Social networks are a rich source of grouping behavior  E.g., Shoppers’ affinity – Goel+Goldstein 2010  Consumers whose friends spend a lot, spend a lot themselves  http://www.cs.cornell.edu/home/kleinber/networks-book/ www.ThesisScientist.com 9Introduction to Information Retrieval Our primary interest in this course  Link analysis for most IR functionality thus far based purely on text  Scoring and ranking  Link-based clustering – topical structure from links  Links as features in classification – documents that link to one another are likely to be on the same subject  Crawling  Based on the links seen, where do we crawl next? www.ThesisScientist.com 10Introduction to Information Retrieval Sec. 21.1 The Web as a Directed Graph hyperlink Page B Anchor Page A Hypothesis 1: A hyperlink between pages denotes a conferral of authority (quality signal) Hypothesis 2: The text in the anchor of the hyperlink on page A describes the target page B www.ThesisScientist.com 11Introduction to Information Retrieval Assumption 1: reputed sites www.ThesisScientist.com 12Introduction to Information Retrieval Assumption 2: annotation of target www.ThesisScientist.com 13Introduction to Information Retrieval Sec. 21.1.1 Anchor Text WWW Worm - McBryan Mcbr94  For ibm how to distinguish between:  IBM’s home page (mostly graphical)  IBM’s copyright page (high term freq. for ‘ibm’)  Rival’s spam page (arbitrarily high term freq.) “IBM home page” “ibm.com” “ibm” A million pieces of anchor text with “ibm” www.ibm.com send a strong signal www.ThesisScientist.com 14Introduction to Information Retrieval Sec. 21.1.1 Indexing anchor text  When indexing a document D, include (with some weight) anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today www.ibm.com Joe’s computer hardware Big Blue today announced links record profits for the quarter Sun HP IBM www.ThesisScientist.com 15Introduction to Information Retrieval Sec. 21.1.1 Indexing anchor text  Can sometimes have unexpected effects, e.g., spam, miserable failure  Can score anchor text with weight depending on the authority of the anchor page’s website  E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust (more) the anchor text from them  Increase the weight of off-site anchors (non-nepotistic scoring) www.ThesisScientist.com 16Introduction to Information Retrieval Connectivity servers Getting at all that link information Inexpensively www.ThesisScientist.com 17Introduction to Information Retrieval Sec. 20.4 Connectivity Server  Support for fast queries on the web graph  Which URLs point to a given URL?  Which URLs does a given URL point to? Stores mappings in memory from  URL to outlinks, URL to inlinks  Applications  Link analysis  Web graph analysis  Connectivity, crawl optimization  Crawl control www.ThesisScientist.com 18Introduction to Information Retrieval Sec. 20.4 Boldi and Vigna 2004  http://www2004.org/proceedings/docs/1p595.pdf  Webgraph – set of algorithms and a java implementation  Fundamental goal – maintain node adjacency lists in memory  For this, compressing the adjacency lists is the critical component www.ThesisScientist.com 19Introduction to Information Retrieval Sec. 20.4 Adjacency lists  The set of neighbors of a node  Assume each URL represented by an integer  E.g., for a 4 billion page web, need 32 bits per node  Naively, this demands 64 bits to represent each hyperlink  Boldi/Vigna get down to an average of 3 bits/link  Further work achieves 2 bits/link www.ThesisScientist.com 20