20 Crawl
20 Crawl
Introduction to
Information Retrieval
1
Introduction to Information Retrieval
Overview
❶ Recap
❷ A simple crawler
❸ A real crawler
2
Introduction to Information Retrieval
Outline
❶ Recap
❷ A simple crawler
❸ A real crawler
3
Introduction to Information Retrieval
4
Introduction to Information Retrieval
6
Introduction to Information Retrieval
Example
h(x) = x mod 5
g(x) = (2x + 1) mod 5
final sketches
8
Introduction to Information Retrieval
Outline
❶ Recap
❷ A simple crawler
❸ A real crawler
9
Introduction to Information Retrieval
10
Introduction to Information Retrieval
11
Introduction to Information Retrieval
12
Introduction to Information Retrieval
13
Introduction to Information Retrieval
14
Introduction to Information Retrieval
Be polite
▪ Don’t hit a site too often
▪ Only crawl pages you are allowed to crawl: robots.txt
Be robust
▪ Be immune to spider traps, duplicates, very large pages, very
large websites, dynamic pages etc
15
Introduction to Information Retrieval
Robots.txt
16
Introduction to Information Retrieval
18
Introduction to Information Retrieval
Outline
❶ Recap
❷ A simple crawler
❸ A real crawler
19
Introduction to Information Retrieval
URL frontier
20
Introduction to Information Retrieval
URL frontier
21
Introduction to Information Retrieval
22
Introduction to Information Retrieval
URL normalization
23
Introduction to Information Retrieval
Content seen
24
Introduction to Information Retrieval
25
Introduction to Information Retrieval
26
Introduction to Information Retrieval
Distributed crawler
27
Introduction to Information Retrieval
28
Introduction to Information Retrieval
29
Introduction to Information Retrieval
30
Introduction to Information Retrieval
31
Introduction to Information Retrieval
32
Introduction to Information Retrieval
33
Introduction to Information Retrieval
34
Introduction to Information Retrieval
35
Introduction to Information Retrieval
36
Introduction to Information Retrieval
37
Introduction to Information Retrieval
39
Introduction to Information Retrieval
40
Introduction to Information Retrieval
42
Introduction to Information Retrieval
44
Introduction to Information Retrieval
Spider trap
45
Introduction to Information Retrieval
Resources
▪ Chapter 20 of IIR
▪ Resources at http://ifnlp.org/ir
▪ Paper on Mercator by Heydon et al.
▪ Robot exclusion standard
46