how does google crawl the web and how does google crawl a site
Introduction to Information Retrieval
1Introduction to Information Retrieval
Search engines rank content pages and ads
4 4Introduction to Information Retrieval
Google’s second price auction
bid: maximum bid for a click by advertiser
CTR: click-through rate: when an ad is displayed, what
percentage of time do users click on it? CTR is a measure of
ad rank: bid × CTR: this trades off (i) how much money the
advertiser is willing to pay against (ii) how relevant the ad is
paid: Second price auction: The advertiser pays the minimum
amount necessary to maintain their position in the auction
(plus 1 cent).
5 5Introduction to Information Retrieval
What’s great about search ads
Users only click if they are interested.
The advertiser only pays when a user clicks on an ad.
Searching for something indicates that you are more likely
to buy it . . .
. . . in contrast to radio and newpaper ads.
6 6Introduction to Information Retrieval
Near duplicate detection: Minimum of
document 1: s document 2: s
Roughly: We use as a test for: are
d and d near-duplicates?
7 7Introduction to Information Retrieval
h(x) = x mod 5
g(x) = (2x + 1) mod 5
8 8Introduction to Information Retrieval
How hard can crawling be?
Web search engines must crawl their documents.
Getting the content of the documents is easier for many
other IR systems.
E.g., indexing all files on your hard disk: just do a recursive
descent on your file system
Ok: for web IR, getting the content of the documents takes
longer . . .
. . . because of latency.
But is that really a design/systems challenge?
10 10Introduction to Information Retrieval
Basic crawler operation
Initialize queue with URLs of known seed pages
Take URL from queue
Fetch and parse page
Extract URLs from page
Add URLs to queue
Fundamental assumption: The web is well linked.
11 11Introduction to Information Retrieval
Exercise: What’s wrong with this crawler?
urlqueue := (some carefully selected set of seed urls)
while urlqueue is not empty:
myurl := urlqueue.getlastanddelete()
mypage := myurl.fetch()
newurls := mypage.extracturls()
for myurl in newurls:
if myurl not in fetchedurls and not in urlqueue:
12 12Introduction to Information Retrieval
What’s wrong with the simple crawler
Scale: we need to distribute.
We can’t index everything: we need to subselect. How?
Duplicates: need to integrate duplicate detection
Spam and spider traps: need to integrate spam detection
Politeness: we need to be “nice” and space out all requests
for a site over a longer period (hours, days)
Freshness: we need to recrawl periodically.
Because of the size of the web, we can do frequent recrawls
only for a small subset.
Again, subselection problem or prioritization
13 13Introduction to Information Retrieval
Magnitude of the crawling problem
To fetch 20,000,000,000 pages in one month . . .
. . . we need to fetch almost 8000 pages per second
Actually: many more since many of the pages we attempt to
crawl will be duplicates, unfetchable, spam etc.
14 14Introduction to Information Retrieval
What a crawler must do
Don’t hit a site too often
Only crawl pages you are allowed to crawl: robots.txt
Be immune to spider traps, duplicates, very large pages, very
large websites, dynamic pages etc
15 15Introduction to Information Retrieval
Protocol for giving crawlers (“robots”) limited access to a
website, originally from 1994
Important: cache the robots.txt file of each site we are crawling
16 16Introduction to Information Retrieval
Example of a robots.txt (nih.gov)
17 17Introduction to Information Retrieval
What any crawler should do
Be capable of distributed operation
Be scalable: need to be able to increase crawl rate by adding
Fetch pages of higher quality first
Continuous operation: get fresh version of already crawled
18 18Introduction to Information Retrieval
❷ A simple crawler
❸ A real crawler
19Introduction to Information Retrieval
20 20Introduction to Information Retrieval
The URL frontier is the data structure that holds and
manages URLs we’ve seen, but that have not been crawled
Can include multiple pages from the same host
Must avoid trying to fetch them all at the same time
Must keep all crawling threads busy
21 21Introduction to Information Retrieval
Basic crawl architecture
22 22Introduction to Information Retrieval
Some URLs extracted from a document are relative URLs.
E.g., at http://mit.edu, we may have aboutsite.html
This is the same as: http://mit.edu/aboutsite.html
During parsing, we must normalize (expand) all relative URLs.