Question? Leave a message!




Computing PageRank: Power method

Computing PageRank: Power method
Introduction to Information Retrieval Introduction to Information Retrieval Web Search 1Introduction to Information Retrieval Overview ❶ Recap ❷ Big picture ❸ Ads ❹ Duplicate detection 2Introduction to Information Retrieval Outline ❶ Recap ❷ Big picture ❸ Ads ❹ Duplicate detection 3Introduction to Information Retrieval Indexing anchor text  Anchor text is often a better description of a page’s content than the page itself.  Anchor text can be weighted more highly than the text on the page.  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.  dangerous cult on Google, Bing, Yahoo 4 4Introduction to Information Retrieval PageRank  Model: a web surfer doing a random walk on the web  Formalization: Markov chain  PageRank is the longterm visit rate of the random surfer or the steadystate distribution.  Need teleportation to ensure welldefined PageRank  Power method to compute PageRank  PageRank is the principal left eigenvector of the transition probability matrix. 5 5Introduction to Information Retrieval Computing PageRank: Power method  PageRank vector = π = (π1, π2) = (0.25, 0.75) P (d ) = P (d ) ∗ P + P (d ) ∗ P t 1 t−1 1 11 t−1 2 21 P (d ) = P (d ) ∗ P + P (d ) ∗ P t 2 t−1 1 12 t−1 2 22 6 6Introduction to Information Retrieval HITS: Hubs and authorities 7 7Introduction to Information Retrieval HITS update rules  A: link matrix   h: vector of hub scores   a: vector of authority scores  HITS algorithm:    Compute h = Aa   T  Compute a = A h  Iterate until convergence  Output (i) list of hubs ranked according to hub score and (ii) list of authorities ranked according to authority score 8 8Introduction to Information Retrieval Outline ❶ Recap ❷ Big picture ❸ Ads ❹ Duplicate detection 9Introduction to Information Retrieval Web search overview 10 10Introduction to Information Retrieval Search is the top activity on the web 11 11Introduction to Information Retrieval Without search engines, the web wouldn’t work  Without search, content is hard to find.  → Without search, there is no incentive to create content.  Why publish something if nobody will read it  Why publish something if I don’t get ad revenue from it  Somebody needs to pay for the web.  Servers, web infrastructure, content creation  A large part today is paid by search ads.  Search pays for the web. 12 12Introduction to Information Retrieval Interest aggregation  Unique feature of the web: A small number of geographically dispersed people with similar interests can find each other.  Elementary school kids with hemophilia  People interested in translating R5R5 Scheme into relatively portable C (open source project)  Search engines are a key enabler for interest aggregation. 13 13Introduction to Information Retrieval IR on the web vs. IR in general  On the web, search is not just a nice feature.  Search is a key enabler of the web: . . .  . . . financing, content creation, interest aggregation etc. → look at search ads  The web is a chaotic und uncoordinated collection. → lots of duplicates – need to detect duplicates  No control / restrictions on who can author content → lots of spam – need to detect spam  The web is very large. → need to know how big it is 14 14Introduction to Information Retrieval Takeaway today  Big picture  Ads – they pay for the web  Duplicate detection – addresses one aspect of chaotic content creation  Spam detection – addresses one aspect of lack of central access control  Probably won’t get to today  Web information retrieval  Size of the web 15 15Introduction to Information Retrieval Outline ❶ Recap ❷ Big picture ❸ Ads ❹ Duplicate detection 16Introduction to Information Retrieval First generation of search ads: Goto (1996) 17 17Introduction to Information Retrieval First generation of search ads: Goto (1996)  Buddy Blake bid the maximum (0.38) for this search.  He paid 0.38 to Goto every time somebody clicked on the link.  Pages were simply ranked according to bid – revenue maximization for Goto.  No separation of ads/docs. Only one result list  Upfront and honest. No relevance ranking, . . .  . . . but Goto did not pretend there was any. 18 18Introduction to Information Retrieval Second generation of search ads: Google (2000/2001)  Strict separation of search results and search ads 19 19Introduction to Information Retrieval Two ranked lists: web pages (left) and ads (right) SogoTrade appears in search results. SogoTrade appears in ads. Do search engines rank advertisers higher than nonadvertisers All major search engines claim no. 20 20Introduction to Information Retrieval Do ads influence editorial content  Similar problem at newspapers / TV channels  A newspaper is reluctant to publish harsh criticism of its major advertisers.  The line often gets blurred at newspapers / on TV.  No known case of this happening with search engines yet 21 21Introduction to Information Retrieval How are the ads on the right ranked 22 22Introduction to Information Retrieval How are ads ranked  Advertisers bid for keywords – sale by auction.  Open system: Anybody can participate and bid on keywords.  Advertisers are only charged when somebody clicks on your ad.  How does the auction determine an ad’s rank and the price paid for the ad  Basis is a second price auction, but with twists  For the bottom line, this is perhaps the most important research area for search engines – computational advertising.  Squeezing an additional fraction of a cent from each ad means billions of additional revenue for the search engine. 23 23Introduction to Information Retrieval How are ads ranked  First cut: according to bid price `a la Goto  Bad idea: open to abuse  Example: query does my husband cheat+ → ad for divorce lawyer  We don’t want to show nonrelevant ads.  Instead: rank based on bid price and relevance  Key measure of ad relevance: clickthrough rate  clickthrough rate = CTR = clicks per impressions  Result: A nonrelevant ad will be ranked low.  Even if this decreases search engine revenue shortterm  Hope: Overall acceptance of the system and overall revenue is maximized if users get useful information.  Other ranking factors: location, time of day, quality and loading speed of landing page  The main ranking factor: the query 24 24Introduction to Information Retrieval Google AdsWords demo 25 25Introduction to Information Retrieval Google’s second price auction  bid: maximum bid for a click by advertiser  CTR: clickthrough rate: when an ad is displayed, what percentage of time do users click on it CTR is a measure of relevance.  ad rank: bid × CTR: this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is  rank: rank in auction  paid: second price auction price paid by advertiser 26 26Introduction to Information Retrieval Google’s second price auction Second price auction: The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent). price × CTR = bid × CTR (this will result in rank =rank ) 1 1 2 2 1 2 price = bid × CTR / CTR 1 2 2 1 p = bid × CTR /CTR = 3.00 × 0.03/0.06 = 1.50 1 2 2 1 p = bid × CTR /CTR = 1.00 × 0.08/0.03 = 2.67 2 3 3 2 p = bid × CTR /CTR = 4.00 × 0.01/0.08 = 0.50 3 4 4 3 27 27Introduction to Information Retrieval Keywords with high bids According to http://www.cwire.org/highestpayingsearchterms/ 69.1 mesothelioma treatment options 65.9 personal injury lawyer michigan 62.6 student loans consolidation 61.4 car accident attorney los angeles 59.4 online car insurance quotes 59.4 arizona dui lawyer 46.4 asbestos cancer 40.1 home equity line of credit 39.8 life insurance quotes 39.2 refinancing 38.7 equity line of credit 38.0 lasik eye surgery new york city 37.0 2nd mortgage 35.9 free car insurance quote 28 28Introduction to Information Retrieval Search ads: A winwinwin  The search engine company gets revenue every time somebody clicks on an ad.  The user only clicks on an ad if they are interested in the ad.  Search engines punish misleading and nonrelevant ads.  As a result, users are often satisfied with what they find after clicking on an ad.  The advertiser finds new customers in a costeffective way. 29 29Introduction to Information Retrieval Exercise  Why is web search potentially more attractive for advertisers than TV spots, newspaper ads or radio spots  The advertiser pays for all this. How can the advertiser be cheated  Any way this could be bad for the user  Any way this could be bad for the search engine 30 30Introduction to Information Retrieval Not a winwinwin: Keyword arbitrage  Buy a keyword on Google  Then redirect traffic to a third party that is paying much more than you are paying Google.  E.g., redirect to a page full of ads  This rarely makes sense for the user.  Ad spammers keep inventing new tricks.  The search engines need time to catch up with them. 31 31Introduction to Information Retrieval Not a winwinwin: Violation of trademarks  Example: geico  During part of 2005: The search term “geico” on Google was bought by competitors.  Geico lost this case in the United States.  Louis Vuitton lost similar case in Europe.  See http://google.com/tm complaint.html  It’s potentially misleading to users to trigger an ad off of a trademark if the user can’t buy the product on the site. 32 32Introduction to Information Retrieval Outline ❶ Recap ❷ Big picture ❸ Ads ❹ Duplicate detection 33Introduction to Information Retrieval Duplicate detection  The web is full of duplicated content.  More so than many other collections  Exact duplicates  Easy to eliminate  E.g., use hash/fingerprint  Nearduplicates  Abundant on the web  Difficult to eliminate  For the user, it’s annoying to get a search result with near identical documents.  Marginal relevance is zero: even a highly relevant document becomes nonrelevant if it appears below a (near)duplicate.  We need to eliminate nearduplicates. 34 34Introduction to Information Retrieval Nearduplicates: Example 35 35Introduction to Information Retrieval Exercise How would you eliminate nearduplicates on the web 36 36Introduction to Information Retrieval Detecting nearduplicates  Compute similarity with an editdistance measure  We want “syntactic” (as opposed to semantic) similarity.  True semantic similarity (similarity in content) is too difficult to compute.  We do not consider documents nearduplicates if they have the same content, but express it with different words.  Use similarity threshold θ to make the call “is/isn’t a near duplicate”.  E.g., two documents are nearduplicates if similarity θ = 80. 37 37Introduction to Information Retrieval Represent each document as set of shingles  A shingle is simply a word ngram.  Shingles are used as features to measure syntactic similarity of documents.  For example, for n = 3, “a rose is a rose is a rose” would be represented as this set of shingles:  aroseis, roseisa, isarose m  We can map shingles to 1..2 (e.g., m = 64) by fingerprinting. m  From now on: s refers to the shingle’s fingerprint in 1..2 . k  We define the similarity of two documents as the Jaccard coefficient of their shingle sets. 38 38Introduction to Information Retrieval Recall: Jaccard coefficient  A commonly used measure of overlap of two sets  Let A and B be two sets  Jaccard coefficient:  JACCARD(A,A) = 1  JACCARD(A,B) = 0 if A ∩ B = 0  A and B don’t have to be the same size.  Always assigns a number between 0 and 1. 39 39Introduction to Information Retrieval Jaccard coefficient: Example  Three documents: d1: “Jack London traveled to Oakland” d2: “Jack London traveled to the city of Oakland” d3: “Jack traveled from Oakland to London”  Based on shingles of size 2 (2grams or bigrams), what are the Jaccard coefficients J(d1, d2) and J(d1, d3)  J(d1, d2) = 3/8 = 0.375  J(d1, d3) = 0  Note: very sensitive to dissimilarity 40 40Introduction to Information Retrieval Represent each document as a sketch  The number of shingles per document is large.  To increase efficiency, we will use a sketch, a cleverly chosen subset of the shingles of a document.  The size of a sketch is, say, n = 200 . . .  . . . and is defined by a set of permutations π1 . . . π200. m  Each π is a random permutation on 1..2 i  The sketch of d is defined as: min π (s),min π (s), . . . ,min π (s) s∈d 1 s∈d 2 s∈d 200 (a vector of 200 numbers). 41 41Introduction to Information Retrieval The Permutation and minimum: Example document 1: s document 2: s k k We use min π(s) = mins∈d π(s) as a test for: are d and d s∈d1 2 1 2 nearduplicates In this case: permutation π says: d ≈ d 1 2 42 42Introduction to Information Retrieval Computing Jaccard for sketches  Sketches: Each document is now a vector of n = 200 numbers.  Much easier to deal with than the very highdimensional space of shingles  But how do we compute Jaccard 43 43Introduction to Information Retrieval Computing Jaccard for sketches (2)  How do we compute Jaccard  Let U be the union of the set of shingles of d1 and d2 and I the intersection.  There are U permutations on U.  For s′ ∈ I , for how many permutations π do we have argmin π(s) = s′ = argmin π(s) s∈d1 s∈d2  Answer: (U − 1)  There is a set of (U − 1) different permutations for each s in I . ⇒ I (U − 1) permutations make argmin π(s) = argmin π(s) true s∈d1 s∈d2  Thus, the proportion of permutations that make min π(s) = min π(s) true is: s∈d1 s∈d2 44 44Introduction to Information Retrieval Estimating Jaccard  Thus, the proportion of successful permutations is the Jaccard coefficient.  Permutation π is successful iff min π(s) = min π(s) s∈d1 s∈d2  Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial.  Estimator of probability of success: proportion of successes in n Bernoulli trials. (n = 200)  Our sketch is based on a random selection of permutations.  Thus, to compute Jaccard, count the number k of successful permutations for d , d and divide by n = 1 2 200.  k/n = k/200 estimates J(d , d ). 1 2 45 45Introduction to Information Retrieval Implementation  We use hash functions as an efficient type of permutation: m m h : 1..2 → ,1..2 i  Scan all shingles s in union of two sets in arbitrary order k  For each hash function h and documents d , d , . . .: keep i 1 2 slot for minimum value found so far  If h (s ) is lower than minimum found so far: update slot i k 46 46Introduction to Information Retrieval Example final sketches 47 47Introduction to Information Retrieval Exercise h(x) = 5x + 5 mod 4 g(x) = (3x + 1) mod 4 48 48Introduction to Information Retrieval Solution (1) h(x) = 5x + 5 mod 4 g(x) = (3x + 1) mod 4 final sketches 49 49Introduction to Information Retrieval Solution (2) 50 50Introduction to Information Retrieval Shingling: Summary  Input: N documents  Choose ngram size for shingling, e.g., n = 5  Pick 200 random permutations, represented as hash functions  Compute N sketches: 200 × N matrix shown on previous slide, one row per permutation, one column per document  Compute pairwise similarities  Transitive closure of documents with similarity θ  Index only one document from each equivalence class 51 51Introduction to Information Retrieval Efficient nearduplicate detection  Now we have an extremely efficient method for estimating a Jaccard coefficient for a single pair of two documents. 2  But we still have to estimate O(N ) coefficients where N is the number of web pages.  Still intractable  One solution: locality sensitive hashing (LSH)  Another solution: sorting (Henzinger 2006) 52 52