Week 4
Week 4
SIMPLE CRAWLER
How hard can crawling be?
3
3
Exercise: What’s wrong with this
crawler?
urlqueue := (some carefully selected set of seed
urls)
while urlqueue is not empty:
myurl := urlqueue.getlastanddelete()
mypage := myurl.fetch()
fetchedurls.add(myurl)
newurls := mypage.extracturls()
for myurl in newurls:
if myurl not in fetchedurls and not in
urlqueue:
urlqueue.add(myurl)
4 addtoinvertedindex(mypage)
4
What’s wrong with the simple
crawler
Scale: we need to distribute.
We can’t index everything: we need to
subselect. How?
Duplicates: need to integrate duplicate
detection
Spam and spider traps: need to integrate spam
detection
Politeness: we need to be “nice” and space out
all requests for a site over a longer period
(hours, days)
Freshness: we need to recrawl periodically.
5 Because of the size of the web, we can do
5
Magnitude of the crawling problem
6
6
What a crawler must do
Be polite
Don’t hit a site too often
Only crawl pages you are allowed to crawl:
robots.txt
Be robust
Be immune to spider traps, duplicates, very large
pages, very large websites, dynamic pages etc
7
7
Robots.txt
10
10
11
REAL CRAWLER
URL frontier
12
12
URL frontier
13
13
Basic crawl architecture
14
14
URL normalization
15
15
Content seen
16
16
Distributing the crawler
17
17
Google data centers (wazfaring.
com)
18
18
Distributed crawler
19
19
URL frontier: Two main
considerations
20
20
Mercator URL frontier
21
21
Mercator URL frontier
22
22
Mercator URL frontier
23
23
Mercator URL frontier
24
24
Mercator URL frontier
26
26
Mercator URL frontier: Front queues
Prioritizer assigns
to URL an integer
priority between 1
and F.
27
27
Mercator URL frontier: Front queues
Prioritizer assigns
to URL an integer
priority between 1
and F.
Then appends URL
to corresponding
queue
28
28
Mercator URL frontier: Front queues
Prioritizer assigns
to URL an integer
priority between 1
and F.
Then appends URL
to corresponding
queue
Heuristics for
assigning priority:
refresh rate,
PageRank etc
29
29
Mercator URL frontier: Front queues
31
31
Mercator URL frontier: Back queues
Invariant 1. Each
back queue is kept
non-empty while
the crawl is in
progress.
Invariant 2. Each
back queue only
contains URLs from
a single host.
Maintain a table
from hosts to back
32
queues. 32
Mercator URL frontier: Back queues
In the heap:
One entry for each
back queue
The entry is the
earliest time te at
which the host
corresponding to
the back queue can
be hit again.
The earliest time te
is determined by (i)
33
last access to that
host (ii) time gap 33
Mercator URL frontier: Back queues
How fetcher
interacts with back
queue:
Repeat (i) extract
current root q of
the heap (q is a
back queue)
and (ii) fetch URL u
at head of q . . .
. . . until we empty
the q we get.
(i.e.: u was the last
34
URL in q) 34
Mercator URL frontier: Back queues
When we have
emptied a back
queue q:
Repeat (i) pull URLs
u from front queues
and (ii) add u to its
corresponding back
queue . . .
. . . until we get a u
whose host does
not have a back
queue.
35
Then put u in q and35
Mercator URL frontier
37
37
Resources
Chapter 20 of IIR
Resources at http://ifnlp.org/ir
Paper on Mercator by Heydon et al.
Robot exclusion standard
38
38