CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012

This document summarizes a lecture on crawling and publish/subscribe systems for the Internet and Web Systems course at the University of Pennsylvania. It begins with an overview of basic crawling techniques including initializing a queue with seed URLs, downloading pages, extracting links, and recursively crawling. It then discusses the Mercator crawler and how it avoids duplications and is polite by limiting requests to each server. Finally, it introduces publish/subscribe systems and interest-based crawling using XML filters to match documents to user interests based on XPath queries.

Uploaded by

Sally Yun Fei Guo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views34 pages

CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012

Uploaded by

Sally Yun Fei Guo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

A. Haeberlen, Z.

Ives
CIS 455/555: Internet and Web Systems
1
University of Pennsylvania
Crawling and Publish/Subscribe

February 15, 2012
A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
2
University of Pennsylvania
NEXT
A. Haeberlen, Z. Ives
Motivation
Suppose you want to build a search engine
Need a large corpus of web pages
How can we find these pages?
Idea: crawl the web

What else can you crawl?
For example, social network
3
University of Pennsylvania
A. Haeberlen, Z. Ives
4
Crawling: The basic process
What state do we need?
Q := Queue of URLs to visit
P := Set of pages already crawled

Basic process:
1. Initialize Q with a set of seed URLs
2. Pick the first URL from Q and download the corresponding page
3. Extract all URLs from the page (<base href> tag, anchor links, CSS, DTDs,
scripts, optionally image links)
4. Append to Q any URLs that a) meet our criteria, and b) are not already in P
5. If Q is not empty, repeat from step 2

Can one machine crawl the entire web?
Of course not! Need to distribute crawling across many machines.
University of Pennsylvania
A. Haeberlen, Z. Ives
Crawling visualized
5
University of Pennsylvania
Seeds
Unseen web
"Frontier"
URLs already
crawled
URLs that will
eventually be
crawled
URLs that do not
fit our criteria
(too deep, etc)
The Web
Pages currently
being crawled
A. Haeberlen, Z. Ives
Crawling complications
What order to traverse in?
Polite to do BFS - why?
Malicious pages
Spam pages / SEO
Spider traps (incl. dynamically generated ones)
General messiness
Cycles
Varying latency and bandwidth to remote servers
Site mirrors, duplicate pages, aliases
Web masters' stipulations
How deep to crawl? How often to crawl?
Continuous crawling; freshness
6
University of Pennsylvania
Need to be
robust!
A. Haeberlen, Z. Ives
Normalization; eliminating duplicates
Some of the extracted URLs are relative URLs
Example: /~ahae/papers/ from www.cis.upenn.edu
Normalize it: http://www.cis.upenn.edu/~ahae/papers

Duplication is widespread on the web
If the fetched page is already in the index, do not process it
Can verify using document fingerprint (hash) or shingles
7
University of Pennsylvania
A. Haeberlen, Z. Ives
8
Crawler etiquette
Explicit politeness
Look for meta tags; for example, ignore pages that have
<META NAME="ROBOTS CONTENT="NOINDEX">
Implement the robot exclusion protocol; for example, look
for, and respect, robots.txt

Implicit politeness
Even if no explicit specifications are present, do not hit the
same web site too often
University of Pennsylvania
A. Haeberlen, Z. Ives
9
Robots.txt
What should be in robots.txt?
See http://www.robotstxt.org/wc/robots.html
To exclude all robots from a server:
User-agent: *
Disallow: /
To exclude one robot from two directories:
User-agent: BobsCrawler
Disallow: /news/
Disallow: /tmp/

University of Pennsylvania
A. Haeberlen, Z. Ives
10
Recap: Crawling
How does the basic process work?

What are some of the main challenges?
Duplicate elimination
Politeness
Malicious pages / spider traps
Normalization
Scalability

University of Pennsylvania
A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
11
University of Pennsylvania
NEXT
A. Haeberlen, Z. Ives
12
Mercator: A scalable web crawler
Written entirely in Java
Expands a URL frontier
Avoids re-crawling same URLs
Also considers whether a document has been
seen before
Same content, different URL [when might this occur?]
Every document has signature/checksum info computed as
its crawled
Despite the name, it does not actually scale
to a large number of nodes
But it would not be too difficult to parallelize
Heydon and Najork: Mercator, a scalable, extensible web crawler (WWW'99) University of Pennsylvania
A. Haeberlen, Z. Ives
13
Mercator architecture
1. Dequeue frontier URL
2. Fetch document
3. Record into RewindStream
(RIS)
4. Check against fingerprints to
verify its new
5. Extract hyperlinks
6. Filter unwanted links
7. Check if URL repeated
(compare its hash)
8. Enqueue URL
S
o
u
r
c
e
:

M
e
r
c
a
t
o
r

p
a
p
e
r

University of Pennsylvania
A. Haeberlen, Z. Ives
14
Mercators polite frontier queues
Tries to go beyond breadth-first approach
Goal is to have only one crawler thread per server
What does this mean for the load caused by Mercator?

Distributed URL frontier queue:
One subqueue per worker thread
The worker thread is determined by hashing the hostname
of the URL
Thus, only one outstanding request per web server

University of Pennsylvania
A. Haeberlen, Z. Ives
Mercators HTTP fetcher
First, needs to ensure robots.txt is followed
Caches the contents of robots.txt for various web sites as it
crawls them

Designed to be extensible to other protocols
Had to write own HTTP requestor in Java
their Java version didnt have timeouts
Today, can use setSoTimeout()

Could use Java non-blocking I/O:
http://www.owlmountain.com/tutorials/NonBlockingIo.htm
But they use multiple threads and synchronous I/O
15
University of Pennsylvania
A. Haeberlen, Z. Ives
16
Other caveats
Infinitely long URL names (good way to get a
buffer overflow!)
Aliased host names
Alternative paths to the same host
Can catch most of these with signatures of
document data (e.g., MD5)
Comparison to Bloom filters
Crawler traps (e.g., CGI scripts that link to
themselves using a different name)
May need to have a way for human to override certain URL
paths see Section 5 of paper
University of Pennsylvania
A. Haeberlen, Z. Ives
Mercator document statistics
PAGE TYPE PERCENT
text/html 69.2%
image/gif 17.9%
image/jpeg 8.1%
text/plain 1.5%
pdf 0.9%
audio 0.4%
zip 0.4%
postscript 0.3%
other 1.4%
Histogram of document sizes
(60M pages)
17
University of Pennsylvania
A. Haeberlen, Z. Ives
18
Further considerations
May want to prioritize certain pages as being
most worth crawling
Focused crawling tries to prioritize based on relevance

May need to refresh certain pages more often
University of Pennsylvania
A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
19
University of Pennsylvania
NEXT
A. Haeberlen, Z. Ives
The publish/subscribe model
Each publisher produces events
Example: Web page update, stock quote, announcement, ...
Each subscriber wants a subset of the events
But usually not all
How do we implement this efficiently?
20
University of Pennsylvania
Events
Interests
?
?
?
A. Haeberlen, Z. Ives
Example: RSS
Web server publishes XML file with events
Clients periodically request the file to see if
there are new events
Is this a good solution?
21
University of Pennsylvania
A. Haeberlen, Z. Ives
Interest-based crawling
Suppose we want to crawl XML documents
based on user interests

We need several parts:
A list of interests expressed in an executable form,
perhaps XPath queries
A crawler goes out and fetches XML content
A filter / routing engine matches XML content against
users interests, sends them the content if it matches

22
University of Pennsylvania
A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
23
University of Pennsylvania
NEXT
A. Haeberlen, Z. Ives
24
XML-Based information dissemination
Basic model (XFilter, YFilter, Xyleme):
Users are interested in data relating to a particular topic, and
know the schema
/politics/usa//body
A crawler-aggregator reads XML files from the web (or gets
them from data sources) and feeds them to interested
parties

XPath
(here used to
match
documents,
not nodes)
University of Pennsylvania
A. Haeberlen, Z. Ives
25
Engine for XFilter [Altinel & Franklin 00]
University of Pennsylvania
A. Haeberlen, Z. Ives
26
How does it work?
Each XPath segment is basically a subset of
regular expressions over element tags
Convert into finite state automata
Parse data as it comes in use SAX API
Match against finite state machines

Most of these systems use modified FSMs
because they want to match many patterns
at the same time
University of Pennsylvania
A. Haeberlen, Z. Ives
27
Path nodes and FSMs
XPath parser decomposes XPath expressions into a set of
path nodes
These nodes act as the states of corresponding FSM
A node in the Candidate List denotes the current state
The rest of the states are in corresponding Wait Lists
Simple FSM for /politics[@topic=president]/usa//body:

politics usa body
Q1_1 Q1_2 Q1_3
University of Pennsylvania
A. Haeberlen, Z. Ives
28
Decomposing into path nodes
Query ID
Position in state machine
Relative Position (RP) in tree:
0 for root node if its not
preceded by //
-1 for any node preceded by
//
Else =1+ (no of * nodes from
predecessor node)
Level:
If current node has fixed
distance from root, then 1+
distance
Else if RP = 1, then 1, else 0
Finaly, NextPathNodeSet points to
next node
Q1=/politics[@topic=president]/usa//body
Q1 Q1 Q1
1 2 3
0 1 -1
1 2 -1
Q1-1 Q1-2 Q1-3
Q2 Q2 Q2
1 2 3
-1 2 1
-1
0 0
Q2-1 Q2-2 Q2-3
Q2=//usa/*/body/p
University of Pennsylvania
A. Haeberlen, Z. Ives
29
Query index
Query index entry
for each XML tag
Two lists: Candidate
List (CL) and Wait
List (WL) divided
across the nodes
Live queries
states are in CL;
pending queries +
states are in WL
Events that cause
state transition are
generated by the
XML parser
politics
usa
body
p
Q1-1
Q2-1
Q1-3 Q2-2
Q2-3
X
X
X
X
X
X
X
X
CL
WL
Q1-2
University of Pennsylvania
A. Haeberlen, Z. Ives
30
Encountering an element
Look up the element name in the Query
Index and all nodes in the associated CL
Validate that we actually have a match

Q1
1
0
1
Q1-1
politics
Q1-1
X
X
WL
startElement: politics
CL
Query ID
Position
Rel. Position
Level
Entry in Query Index:
NextPathNodeSet
University of Pennsylvania
A. Haeberlen, Z. Ives
31
Validating a match
We first check that the current XML depth
matches the level in the user query:
If level in CL node is less than 1, then ignore height
else level in CL node must = height

This ensures were matching at the right point in the tree!

Finally, we validate any predicates against
attributes (e.g., [@topic=president])
University of Pennsylvania
A. Haeberlen, Z. Ives
32
Processing further elements
Queries that dont meet validation are
removed from the Candidate Lists
For other queries, we advance to the next
state
We copy the next node of the query from the WL to the CL,
and update the RP and level
When we reach a final state (e.g., Q1-3), we can output the
document to the subscriber
When we encounter an end element, we
must remove that element from the CL

University of Pennsylvania
A. Haeberlen, Z. Ives
A simpler approach
Instantiate a DOM tree for each document
Traverse and recursively match XPaths

Pros and cons?
33
University of Pennsylvania
A. Haeberlen, Z. Ives
34
Recap: Publish-subscribe model
Publish-subscribe model
Publishers produce events
Each subscriber is interested in a subset of the events

Challenge: Efficient implementation
Comparison: XFilter vs RSS

XFilter
Interests are specified with XPaths (very powerful!)
Sophisticated technique for efficiently matching documents
against many XPaths in parallel
University of Pennsylvania

Transmission Servicing Volvo 850
No ratings yet
Transmission Servicing Volvo 850
7 pages
Week 4
No ratings yet
Week 4
38 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Lect 02-Crawling Part A
No ratings yet
Lect 02-Crawling Part A
21 pages
20 Crawl
No ratings yet
20 Crawl
46 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Ir 5
No ratings yet
Ir 5
18 pages
Web Crawling: Christopher Olston and Marc Najork
No ratings yet
Web Crawling: Christopher Olston and Marc Najork
49 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
Web Crawling: Based On The Slides by Filippo
No ratings yet
Web Crawling: Based On The Slides by Filippo
52 pages
Lab1 Crawling Python
No ratings yet
Lab1 Crawling Python
10 pages
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
No ratings yet
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
25 pages
Lecture16 Crawling
No ratings yet
Lecture16 Crawling
39 pages
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
No ratings yet
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
27 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
IR-UNIT 10 (Web Crawling)
No ratings yet
IR-UNIT 10 (Web Crawling)
62 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
Web Info PDF
No ratings yet
Web Info PDF
4 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Cse3024 WM Module-2 Smsatapathy
No ratings yet
Cse3024 WM Module-2 Smsatapathy
106 pages
Web Search. Web Spidering
No ratings yet
Web Search. Web Spidering
44 pages
Web Crawlers & Hyperlink Analysis
No ratings yet
Web Crawlers & Hyperlink Analysis
50 pages
Seminar Report: Submitted By: Aanchal Garg CSE
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
22 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
Web Crawling
No ratings yet
Web Crawling
44 pages
Information Retrieval Lecture 10 - Web Crawling
No ratings yet
Information Retrieval Lecture 10 - Web Crawling
8 pages
Chapter 3
No ratings yet
Chapter 3
39 pages
CIS 555 F P P: P ' F S E: Inal Roject Oogle ENN S Avorite Earch Ngine
No ratings yet
CIS 555 F P P: P ' F S E: Inal Roject Oogle ENN S Avorite Earch Ngine
5 pages
Artificial Intellegence Project Module - I: Contemporary Curriculum, Pedagogy, and Practice (C2P2) BY
No ratings yet
Artificial Intellegence Project Module - I: Contemporary Curriculum, Pedagogy, and Practice (C2P2) BY
5 pages
Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo
No ratings yet
Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo
14 pages
Detailed Explanation: IR Vs Web Search Vs Web
No ratings yet
Detailed Explanation: IR Vs Web Search Vs Web
15 pages
Build A Web Crawler
No ratings yet
Build A Web Crawler
6 pages
Mercator: A Scalable, Extensible Web Crawler: Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA
No ratings yet
Mercator: A Scalable, Extensible Web Crawler: Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA
11 pages
Lecture Crawling
No ratings yet
Lecture Crawling
38 pages
WebCrawlingChapter Chapter 8
No ratings yet
WebCrawlingChapter Chapter 8
114 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Web Mining1
No ratings yet
Web Mining1
87 pages
Web Crawler
0% (1)
Web Crawler
16 pages
Web Crawler A Review
No ratings yet
Web Crawler A Review
5 pages
A Study of Focused Web Crawling Techniques
No ratings yet
A Study of Focused Web Crawling Techniques
4 pages
Research Paper
No ratings yet
Research Paper
5 pages
UNIT3
No ratings yet
UNIT3
7 pages
S O W C A: Urvey F EB Rawling Lgorithms
No ratings yet
S O W C A: Urvey F EB Rawling Lgorithms
8 pages
I) Web Crawling: Yash Pahlani D17B 49
No ratings yet
I) Web Crawling: Yash Pahlani D17B 49
7 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Inverted Indexing For Text Retrieval
No ratings yet
Inverted Indexing For Text Retrieval
21 pages
Mercator: A Scalable, Extensible Web Crawler: Allan Heydon Marc Najork Compaq Systems Research Center
No ratings yet
Mercator: A Scalable, Extensible Web Crawler: Allan Heydon Marc Najork Compaq Systems Research Center
14 pages
3 Web Crawling
No ratings yet
3 Web Crawling
39 pages
Search Engine
No ratings yet
Search Engine
35 pages
Erformance Valuation EB Rawler: P E O W C
No ratings yet
Erformance Valuation EB Rawler: P E O W C
34 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
No ratings yet
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
4 pages
Web Search Engines: Part 1
No ratings yet
Web Search Engines: Part 1
6 pages
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
No ratings yet
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
21 pages
IRT
No ratings yet
IRT
100 pages
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
No ratings yet
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
19 pages
Preparing Data for Analysis with JMP
From Everand
Preparing Data for Analysis with JMP
Robert Carver
No ratings yet
Flux Architecture
From Everand
Flux Architecture
Adam Boduch
No ratings yet
Building Websites with OpenCms
From Everand
Building Websites with OpenCms
Matt Butcher
No ratings yet
Mud Logging
No ratings yet
Mud Logging
10 pages
Installation Instruction: Single Pole Insulated Conductor Rail Programme 812
No ratings yet
Installation Instruction: Single Pole Insulated Conductor Rail Programme 812
9 pages
DS4510 5010
100% (1)
DS4510 5010
2 pages
Electric Actuator Commissioning
No ratings yet
Electric Actuator Commissioning
4 pages
IFD5 Manual - Issue 5
No ratings yet
IFD5 Manual - Issue 5
21 pages
CVP Analysis 2
50% (2)
CVP Analysis 2
7 pages
01 JRODOS Overview
No ratings yet
01 JRODOS Overview
25 pages
Republic Act No 11479
No ratings yet
Republic Act No 11479
2 pages
CLobazam
No ratings yet
CLobazam
7 pages
Xilinx System Generator For DSP PDF
No ratings yet
Xilinx System Generator For DSP PDF
376 pages
IWB Instructions YS Amer PDF
No ratings yet
IWB Instructions YS Amer PDF
27 pages
10 EIM Q2M1 TLE10 - EIM - Q2 - Mod1 - Wk1-5 - Elec-Meter-Connection-and-Grounding - v3
100% (1)
10 EIM Q2M1 TLE10 - EIM - Q2 - Mod1 - Wk1-5 - Elec-Meter-Connection-and-Grounding - v3
35 pages
Euler's Path
50% (2)
Euler's Path
10 pages
Los Campeones
No ratings yet
Los Campeones
1 page
Ducati Monster S4RS 2006 Parts List WWW - Manualedereparatie.info PDF
No ratings yet
Ducati Monster S4RS 2006 Parts List WWW - Manualedereparatie.info PDF
120 pages
Kikambala Revised Drawings
No ratings yet
Kikambala Revised Drawings
1 page
ARRI Pro Cam Accs BRCH
No ratings yet
ARRI Pro Cam Accs BRCH
24 pages
新电影评论和评分
100% (2)
新电影评论和评分
7 pages
True or False 1
No ratings yet
True or False 1
7 pages
Samsung Manual-ACI3PR16001 R2
No ratings yet
Samsung Manual-ACI3PR16001 R2
32 pages
Important Questions
No ratings yet
Important Questions
21 pages
University Chemistry 1st Edition Peter E Siska Ebook and TestBank Bundle Fast Access
No ratings yet
University Chemistry 1st Edition Peter E Siska Ebook and TestBank Bundle Fast Access
325 pages
Schischek Product Catalogue en PUB113 001 00
No ratings yet
Schischek Product Catalogue en PUB113 001 00
76 pages
UFBU Meeting Notice03072025120953
No ratings yet
UFBU Meeting Notice03072025120953
2 pages
ct9 Ilm3
No ratings yet
ct9 Ilm3
11 pages
UCL International Postgraduates Orientation Webinar
No ratings yet
UCL International Postgraduates Orientation Webinar
70 pages
Cat Connectors
No ratings yet
Cat Connectors
85 pages
Syllabus For Paper I & Ii For M.B.B.S Course Subject - Anatomy
No ratings yet
Syllabus For Paper I & Ii For M.B.B.S Course Subject - Anatomy
10 pages
RVM100 Instruction Manual
No ratings yet
RVM100 Instruction Manual
7 pages

CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012

Uploaded by

CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012

Uploaded by

A. Haeberlen, Z.

You might also like