0% found this document useful (0 votes)
18 views38 pages

Week 4

Uploaded by

jumain.dj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views38 pages

Week 4

Uploaded by

jumain.dj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 38

1

SIMPLE CRAWLER
How hard can crawling be?

 Web search engines must crawl their


documents.
 Getting the content of the documents is easier
for many other IR systems.
 E.g., indexing all files on your hard disk: just do a
recursive descent on your file system
 Ok: for web IR, getting the content of the
documents takes longer . . .
 . . . because of latency.
 But is that really a design/systems challenge?
2
2
Basic crawler operation

 Initialize queue with URLs of known seed pages


 Repeat
 Take URL from queue
 Fetch and parse page
 Extract URLs from page
 Add URLs to queue
 Fundamental assumption: The web is well
linked.

3
3
Exercise: What’s wrong with this
crawler?
urlqueue := (some carefully selected set of seed
urls)
while urlqueue is not empty:
myurl := urlqueue.getlastanddelete()
mypage := myurl.fetch()
fetchedurls.add(myurl)
newurls := mypage.extracturls()
for myurl in newurls:
if myurl not in fetchedurls and not in
urlqueue:
urlqueue.add(myurl)
4 addtoinvertedindex(mypage)
4
What’s wrong with the simple
crawler
 Scale: we need to distribute.
 We can’t index everything: we need to
subselect. How?
 Duplicates: need to integrate duplicate
detection
 Spam and spider traps: need to integrate spam
detection
 Politeness: we need to be “nice” and space out
all requests for a site over a longer period
(hours, days)
 Freshness: we need to recrawl periodically.
5  Because of the size of the web, we can do
5
Magnitude of the crawling problem

 To fetch 20,000,000,000 pages in one month . . .


 . . . we need to fetch almost 8000 pages per
second!
 Actually: many more since many of the pages we
attempt to crawl will be duplicates, unfetchable,
spam etc.

6
6
What a crawler must do

Be polite
 Don’t hit a site too often
 Only crawl pages you are allowed to crawl:
robots.txt

Be robust
 Be immune to spider traps, duplicates, very large
pages, very large websites, dynamic pages etc

7
7
Robots.txt

 Protocol for giving crawlers (“robots”) limited


access to a website, originally from 1994
 Examples:
 User-agent: *
Disallow: /yoursite/temp/
 User-agent: searchengine
Disallow: /
 Important: cache the robots.txt file of each site we
are crawling
8
8
Example of a robots.txt (nih.gov)
User-agent: PicoSearch/1.0
Disallow: /news/information/knight/
Disallow: /nidcd/
...
Disallow: /news/research_matters/secure/
Disallow: /od/ocpl/wag/
User-agent: *
Disallow: /news/information/knight/
Disallow: /nidcd/
...
Disallow: /news/research_matters/secure/
Disallow: /od/ocpl/wag/
Disallow: /ddir/
Disallow: /sdminutes/
9
9
What any crawler should do

 Be capable of distributed operation


 Be scalable: need to be able to increase crawl
rate by adding more machines
 Fetch pages of higher quality first
 Continuous operation: get fresh version of
already crawled pages

10
10
11

REAL CRAWLER
URL frontier

12
12
URL frontier

 The URL frontier is the data structure that holds


and manages URLs we’ve seen, but that have
not been crawled yet.
 Can include multiple pages from the same host
 Must avoid trying to fetch them all at the same
time
 Must keep all crawling threads busy

13
13
Basic crawl architecture

14
14
URL normalization

 Some URLs extracted from a document are


relative URLs.
 E.g., at http://mit.edu, we may have
aboutsite.html
 This is the same as: http://mit.edu/aboutsite.html
 During parsing, we must normalize (expand) all
relative URLs.

15
15
Content seen

 For each page fetched: check if the content is


already in the index
 Check this using document fingerprints or
shingles
 Skip documents whose content has already been
indexed

16
16
Distributing the crawler

 Run multiple crawl threads, potentially at


different nodes
 Usually geographically distributed nodes
 Partition hosts being crawled into nodes

17
17
Google data centers (wazfaring.
com)

18
18
Distributed crawler

19
19
URL frontier: Two main
considerations

 Politeness: Don’t hit a web server too frequently


 E.g., insert a time gap between successive
requests to the same server
 Freshness: Crawl some pages (e.g., news sites)
more often than others
 Not an easy problem: simple priority queue fails.

20
20
Mercator URL frontier

21
21
Mercator URL frontier

 URLs flow in from the


top into the frontier.

22
22
Mercator URL frontier

 URLs flow in from the


top into the frontier.
 Front queues
manage
prioritization.

23
23
Mercator URL frontier

 URLs flow in from the


top into the frontier.
 Front queues
manage
prioritization.
 Back queues enforce
politeness.

24
24
Mercator URL frontier

 URLs flow in from the


top into the frontier.
 Front queues
manage
prioritization.
 Back queues enforce
politeness.
 Each queue is FIFO.
25
25
Mercator URL frontier: Front queues

26
26
Mercator URL frontier: Front queues

 Prioritizer assigns
to URL an integer
priority between 1
and F.

27
27
Mercator URL frontier: Front queues

 Prioritizer assigns
to URL an integer
priority between 1
and F.
 Then appends URL
to corresponding
queue

28
28
Mercator URL frontier: Front queues

 Prioritizer assigns
to URL an integer
priority between 1
and F.
 Then appends URL
to corresponding
queue
 Heuristics for
assigning priority:
refresh rate,
PageRank etc
29
29
Mercator URL frontier: Front queues

 Selection from front


queues is initiated
by back queues
 Pick a front queue
from which to
select next URL:
Round robin,
randomly, or more
sophisticated
variant
 But with a bias in
30 favor of high- 30
Mercator URL frontier: Back queues

31
31
Mercator URL frontier: Back queues

 Invariant 1. Each
back queue is kept
non-empty while
the crawl is in
progress.
 Invariant 2. Each
back queue only
contains URLs from
a single host.
 Maintain a table
from hosts to back
32
queues. 32
Mercator URL frontier: Back queues
 In the heap:
 One entry for each
back queue
 The entry is the
earliest time te at
which the host
corresponding to
the back queue can
be hit again.
 The earliest time te
is determined by (i)
33
last access to that
host (ii) time gap 33
Mercator URL frontier: Back queues
 How fetcher
interacts with back
queue:
 Repeat (i) extract
current root q of
the heap (q is a
back queue)
 and (ii) fetch URL u
at head of q . . .
 . . . until we empty
the q we get.
 (i.e.: u was the last
34
URL in q) 34
Mercator URL frontier: Back queues
 When we have
emptied a back
queue q:
 Repeat (i) pull URLs
u from front queues
and (ii) add u to its
corresponding back
queue . . .
 . . . until we get a u
whose host does
not have a back
queue.
35
 Then put u in q and35
Mercator URL frontier

 URLs flow in from


the top into the
frontier.
 Front queues
manage
prioritization.
 Back queues
enforce politeness.
36
36
Spider trap

 Malicious server that generates an infinite


sequence of linked pages
 Sophisticated spider traps generate pages that
are not easily identified as dynamic.

37
37
Resources

 Chapter 20 of IIR
 Resources at http://ifnlp.org/ir
 Paper on Mercator by Heydon et al.
 Robot exclusion standard

38
38

You might also like