0% found this document useful (0 votes)

35 views46 pages

20 Crawl

Uploaded by

18JE0254 CHIRAG JAIN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views46 pages

20 Crawl

Uploaded by

18JE0254 CHIRAG JAIN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Introduction to Information Retrieval

Introduction to
Information Retrieval

Hinrich Schütze and Christina Lioma

Lecture 20: Crawling

1
Introduction to Information Retrieval

Overview

❶ Recap

❷ A simple crawler

❸ A real crawler

2
Introduction to Information Retrieval

Outline

❶ Recap

❷ A simple crawler

❸ A real crawler

3
Introduction to Information Retrieval

Search engines rank content pages and ads

4
Introduction to Information Retrieval

Google’s second price auction

▪ bid: maximum bid for a click by advertiser

▪ CTR: click-through rate: when an ad is displayed, what
percentage of time do users click on it? CTR is a measure of
relevance.
▪ ad rank: bid × CTR: this trades off (i) how much money the
advertiser is willing to pay against (ii) how relevant the ad is
▪ paid: Second price auction: The advertiser pays the minimum
amount necessary to maintain their position in the auction
(plus 1 cent). 5
Introduction to Information Retrieval

What’s great about search ads

▪ Users only click if they are interested.

▪ The advertiser only pays when a user clicks on an ad.
▪ Searching for something indicates that you are more likely
to buy it . . .
▪ . . . in contrast to radio and newpaper ads.

6
Introduction to Information Retrieval

Near duplicate detection: Minimum of

permutation
document 1: {sk} document 2: {sk}

Roughly: We use as a test for: are

d1 and d2 near-duplicates?
7
Introduction to Information Retrieval

Example

h(x) = x mod 5
g(x) = (2x + 1) mod 5

final sketches
8
Introduction to Information Retrieval

Outline

❶ Recap

❷ A simple crawler

❸ A real crawler

9
Introduction to Information Retrieval

How hard can crawling be?

▪ Web search engines must crawl their documents.

▪ Getting the content of the documents is easier for many
other IR systems.
▪ E.g., indexing all files on your hard disk: just do a recursive
descent on your file system
▪ Ok: for web IR, getting the content of the documents takes
longer . . .
▪ . . . because of latency.
▪ But is that really a design/systems challenge?

10
Introduction to Information Retrieval

Basic crawler operation

▪ Initialize queue with URLs of known seed pages

▪ Repeat
▪ Take URL from queue
▪ Fetch and parse page
▪ Extract URLs from page
▪ Add URLs to queue
▪ Fundamental assumption: The web is well linked.

11
Introduction to Information Retrieval

Exercise: What’s wrong with this crawler?

urlqueue := (some carefully selected set of seed urls)

while urlqueue is not empty:
myurl := urlqueue.getlastanddelete()
mypage := myurl.fetch()
fetchedurls.add(myurl)
newurls := mypage.extracturls()
for myurl in newurls:
if myurl not in fetchedurls and not in urlqueue:
urlqueue.add(myurl)
addtoinvertedindex(mypage)

12
Introduction to Information Retrieval

What’s wrong with the simple crawler

▪ Scale: we need to distribute.
▪ We can’t index everything: we need to subselect. How?
▪ Duplicates: need to integrate duplicate detection
▪ Spam and spider traps: need to integrate spam detection
▪ Politeness: we need to be “nice” and space out all requests
for a site over a longer period (hours, days)
▪ Freshness: we need to recrawl periodically.
▪ Because of the size of the web, we can do frequent recrawls
only for a small subset.
▪ Again, subselection problem or prioritization

13
Introduction to Information Retrieval

Magnitude of the crawling problem

▪ To fetch 20,000,000,000 pages in one month . . .

▪ . . . we need to fetch almost 8000 pages per second!
▪ Actually: many more since many of the pages we attempt to
crawl will be duplicates, unfetchable, spam etc.

14
Introduction to Information Retrieval

What a crawler must do

Be polite
▪ Don’t hit a site too often
▪ Only crawl pages you are allowed to crawl: robots.txt

Be robust
▪ Be immune to spider traps, duplicates, very large pages, very
large websites, dynamic pages etc

15
Introduction to Information Retrieval

Robots.txt

▪ Protocol for giving crawlers (“robots”) limited access to a

website, originally from 1994
▪ Examples:
▪ User-agent: *
Disallow: /yoursite/temp/
▪ User-agent: searchengine
Disallow: /
▪ Important: cache the robots.txt file of each site we are crawling

16
Introduction to Information Retrieval

Example of a robots.txt (nih.gov)

User-agent: PicoSearch/1.0
Disallow: /news/information/knight/
Disallow: /nidcd/
...
Disallow: /news/research_matters/secure/
Disallow: /od/ocpl/wag/
User-agent: *
Disallow: /news/information/knight/
Disallow: /nidcd/
...
Disallow: /news/research_matters/secure/
Disallow: /od/ocpl/wag/
Disallow: /ddir/
Disallow: /sdminutes/
17
Introduction to Information Retrieval

What any crawler should do

▪ Be capable of distributed operation

▪ Be scalable: need to be able to increase crawl rate by adding
more machines
▪ Fetch pages of higher quality first
▪ Continuous operation: get fresh version of already crawled
pages

18
Introduction to Information Retrieval

Outline

❶ Recap

❷ A simple crawler

❸ A real crawler

19
Introduction to Information Retrieval

URL frontier

20
Introduction to Information Retrieval

URL frontier

▪ The URL frontier is the data structure that holds and

manages URLs we’ve seen, but that have not been crawled
yet.
▪ Can include multiple pages from the same host
▪ Must avoid trying to fetch them all at the same time
▪ Must keep all crawling threads busy

21
Introduction to Information Retrieval

Basic crawl architecture

22
Introduction to Information Retrieval

URL normalization

▪ Some URLs extracted from a document are relative URLs.

▪ E.g., at http://mit.edu, we may have aboutsite.html
▪ This is the same as: http://mit.edu/aboutsite.html
▪ During parsing, we must normalize (expand) all relative URLs.

23
Introduction to Information Retrieval

Content seen

▪ For each page fetched: check if the content is already in the

index
▪ Check this using document fingerprints or shingles
▪ Skip documents whose content has already been indexed

24
Introduction to Information Retrieval

Distributing the crawler

▪ Run multiple crawl threads, potentially at different nodes

▪ Usually geographically distributed nodes
▪ Partition hosts being crawled into nodes

25
Introduction to Information Retrieval

Google data centers (wazfaring. com)

26
Introduction to Information Retrieval

Distributed crawler

27
Introduction to Information Retrieval

URL frontier: Two main considerations

▪ Politeness: Don’t hit a web server too frequently

▪ E.g., insert a time gap between successive requests to the
same server
▪ Freshness: Crawl some pages (e.g., news sites) more often
than others
▪ Not an easy problem: simple priority queue fails.

28
Introduction to Information Retrieval

Mercator URL frontier

29
Introduction to Information Retrieval

Mercator URL frontier

▪ URLs flow in from the top

into the frontier.

30
Introduction to Information Retrieval

Mercator URL frontier

▪ URLs flow in from the top

into the frontier.
▪ Front queues manage
prioritization.

31
Introduction to Information Retrieval

Mercator URL frontier

▪ URLs flow in from the top

into the frontier.
▪ Front queues manage
prioritization.
▪ Back queues enforce
politeness.

32
Introduction to Information Retrieval

Mercator URL frontier

▪ URLs flow in from the top

into the frontier.
▪ Front queues manage
prioritization.
▪ Back queues enforce
politeness.
▪ Each queue is FIFO.

33
Introduction to Information Retrieval

Mercator URL frontier: Front queues

34
Introduction to Information Retrieval

Mercator URL frontier: Front queues

▪ Prioritizer assigns to URL

an integer priority
between 1 and F.

35
Introduction to Information Retrieval

Mercator URL frontier: Front queues

▪ Prioritizer assigns to URL

an integer priority
between 1 and F.
▪ Then appends URL to
corresponding queue

36
Introduction to Information Retrieval

Mercator URL frontier: Front queues

▪ Prioritizer assigns to URL

an integer priority
between 1 and F.
▪ Then appends URL to
corresponding queue
▪ Heuristics for assigning
priority: refresh rate,
PageRank etc

37
Introduction to Information Retrieval

Mercator URL frontier: Front queues

▪ Selection from front

queues is initiated by
back queues
▪ Pick a front queue from
which to select next
URL: Round robin,
randomly, or more
sophisticated variant
▪ But with a bias in favor
of high-priority front
queues
38
Introduction to Information Retrieval

Mercator URL frontier: Back queues

39
Introduction to Information Retrieval

Mercator URL frontier: Back queues

▪ Invariant 1. Each back

queue is kept non-
empty while the crawl is
in progress.
▪ Invariant 2. Each back
queue only contains
URLs from a single host.
▪ Maintain a table from
hosts to back queues.

40
Introduction to Information Retrieval

Mercator URL frontier: Back queues

▪ In the heap:
▪ One entry for each back
queue
▪ The entry is the earliest
time te at which the host
corresponding to the
back queue can be hit
again.
▪ The earliest time te is
determined by (i) last
access to that host (ii)
time gap heuristic
41
Introduction to Information Retrieval

Mercator URL frontier: Back queues

▪ How fetcher interacts
with back queue:
▪ Repeat (i) extract
current root q of the
heap (q is a back queue)
▪ and (ii) fetch URL u at
head of q . . .
▪ . . . until we empty the q
we get.
▪ (i.e.: u was the last URL
in q)

42
Introduction to Information Retrieval

Mercator URL frontier: Back queues

▪ When we have emptied
a back queue q:
▪ Repeat (i) pull URLs u
from front queues and
(ii) add u to its
corresponding back
queue . . .
▪ . . . until we get a u
whose host does not
have a back queue.
▪ Then put u in q and
create heap entry for it.
43
Introduction to Information Retrieval

Mercator URL frontier

▪ URLs flow in from the

top into the frontier.
▪ Front queues manage
prioritization.
▪ Back queues enforce
politeness.

44
Introduction to Information Retrieval

Spider trap

▪ Malicious server that generates an infinite sequence of

linked pages
▪ Sophisticated spider traps generate pages that are not easily
identified as dynamic.

45
Introduction to Information Retrieval

Resources

▪ Chapter 20 of IIR
▪ Resources at http://ifnlp.org/ir
▪ Paper on Mercator by Heydon et al.
▪ Robot exclusion standard

Lecture16 Crawling
No ratings yet
Lecture16 Crawling
39 pages
CS 3308 Discussion Assignment Unit 8
No ratings yet
CS 3308 Discussion Assignment Unit 8
4 pages
IR Module 3
No ratings yet
IR Module 3
45 pages
Week 4
No ratings yet
Week 4
38 pages
Web Crawling
No ratings yet
Web Crawling
44 pages
Lecture Crawling
No ratings yet
Lecture Crawling
38 pages
IRT
No ratings yet
IRT
100 pages
Lect 02-Crawling Part A
No ratings yet
Lect 02-Crawling Part A
21 pages
Intelligent Crawling: Junghoo Cho Hector Garcia-Molina Stanford Infolab
No ratings yet
Intelligent Crawling: Junghoo Cho Hector Garcia-Molina Stanford Infolab
24 pages
Web Search. Web Spidering
No ratings yet
Web Search. Web Spidering
44 pages
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
Web Mining1
No ratings yet
Web Mining1
87 pages
Web Crawler A Review
No ratings yet
Web Crawler A Review
5 pages
Search Engines Information Retrieval in Practice PDF
No ratings yet
Search Engines Information Retrieval in Practice PDF
542 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Ir 5
No ratings yet
Ir 5
18 pages
Web Search Engines: Part 1
No ratings yet
Web Search Engines: Part 1
6 pages
Python Design and Implementation of A Simple Web Search E
No ratings yet
Python Design and Implementation of A Simple Web Search E
9 pages
Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai
No ratings yet
Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai
25 pages
Lab1 Crawling Python
No ratings yet
Lab1 Crawling Python
10 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Research Paper
No ratings yet
Research Paper
5 pages
Cse3024 WM Module-2 Smsatapathy
No ratings yet
Cse3024 WM Module-2 Smsatapathy
106 pages
CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
No ratings yet
CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
34 pages
Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo
No ratings yet
Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo
14 pages
Design and Implementation of A Simple Web Search E
No ratings yet
Design and Implementation of A Simple Web Search E
9 pages
I) Web Crawling: Yash Pahlani D17B 49
No ratings yet
I) Web Crawling: Yash Pahlani D17B 49
7 pages
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
No ratings yet
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
27 pages
Seminar Report: Submitted By: Aanchal Garg CSE
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
22 pages
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
No ratings yet
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
7 pages
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
No ratings yet
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
6 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Erformance Valuation EB Rawler: P E O W C
No ratings yet
Erformance Valuation EB Rawler: P E O W C
34 pages
Web Crawling: Based On The Slides by Filippo
No ratings yet
Web Crawling: Based On The Slides by Filippo
52 pages
Explores The Ways of Usage of Web Crawler in Mobile Systems
No ratings yet
Explores The Ways of Usage of Web Crawler in Mobile Systems
5 pages
Web Crawlers & Hyperlink Analysis
No ratings yet
Web Crawlers & Hyperlink Analysis
50 pages
Detailed Explanation: IR Vs Web Search Vs Web
No ratings yet
Detailed Explanation: IR Vs Web Search Vs Web
15 pages
Adaptive Focus
No ratings yet
Adaptive Focus
6 pages
Information Retrieval Lecture 10 - Web Crawling
No ratings yet
Information Retrieval Lecture 10 - Web Crawling
8 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Build A Web Crawler
No ratings yet
Build A Web Crawler
6 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Effective Searching Policies For Web Crawler
No ratings yet
Effective Searching Policies For Web Crawler
3 pages
IR-UNIT 10 (Web Crawling)
No ratings yet
IR-UNIT 10 (Web Crawling)
62 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
No ratings yet
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
4 pages
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
No ratings yet
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
25 pages
Web Crawling: Christopher Olston and Marc Najork
No ratings yet
Web Crawling: Christopher Olston and Marc Najork
49 pages
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
No ratings yet
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
4 pages
A Study of Focused Web Crawling Techniques
No ratings yet
A Study of Focused Web Crawling Techniques
4 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Unit No 4 Slides Full
No ratings yet
Unit No 4 Slides Full
133 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Unit No 1 Slides
No ratings yet
Unit No 1 Slides
56 pages
16 Flat
No ratings yet
16 Flat
88 pages

20 Crawl

Uploaded by

20 Crawl

Uploaded by

Introduction to Information Retrieval

Hinrich Schütze and Christina Lioma

Search engines rank content pages and ads

Google’s second price auction

▪ bid: maximum bid for a click by advertiser

What’s great about search ads

▪ Users only click if they are interested.

Near duplicate detection: Minimum of

Roughly: We use as a test for: are

How hard can crawling be?

▪ Web search engines must crawl their documents.

Basic crawler operation

▪ Initialize queue with URLs of known seed pages

Exercise: What’s wrong with this crawler?

urlqueue := (some carefully selected set of seed urls)

What’s wrong with the simple crawler

Magnitude of the crawling problem

▪ To fetch 20,000,000,000 pages in one month . . .

What a crawler must do

▪ Protocol for giving crawlers (“robots”) limited access to a

Example of a robots.txt (nih.gov)

What any crawler should do

▪ Be capable of distributed operation

▪ The URL frontier is the data structure that holds and

Basic crawl architecture

▪ Some URLs extracted from a document are relative URLs.

▪ For each page fetched: check if the content is already in the

Distributing the crawler

▪ Run multiple crawl threads, potentially at different nodes

Google data centers (wazfaring. com)

URL frontier: Two main considerations

▪ Politeness: Don’t hit a web server too frequently

Mercator URL frontier

Mercator URL frontier

▪ URLs flow in from the top

Mercator URL frontier

▪ URLs flow in from the top

Mercator URL frontier

▪ URLs flow in from the top

Mercator URL frontier

▪ URLs flow in from the top

Mercator URL frontier: Front queues

Mercator URL frontier: Front queues

▪ Prioritizer assigns to URL

Mercator URL frontier: Front queues

▪ Prioritizer assigns to URL

Mercator URL frontier: Front queues

▪ Prioritizer assigns to URL

Mercator URL frontier: Front queues

▪ Selection from front

Mercator URL frontier: Back queues

Mercator URL frontier: Back queues

▪ Invariant 1. Each back

Mercator URL frontier: Back queues

Mercator URL frontier: Back queues

Mercator URL frontier: Back queues

Mercator URL frontier

▪ URLs flow in from the

▪ Malicious server that generates an infinite sequence of

You might also like