CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
Ives
CIS 455/555: Internet and Web Systems
1
University of Pennsylvania
Crawling and Publish/Subscribe
February 15, 2012
A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
2
University of Pennsylvania
NEXT
A. Haeberlen, Z. Ives
Motivation
Suppose you want to build a search engine
Need a large corpus of web pages
How can we find these pages?
Idea: crawl the web
What else can you crawl?
For example, social network
3
University of Pennsylvania
A. Haeberlen, Z. Ives
4
Crawling: The basic process
What state do we need?
Q := Queue of URLs to visit
P := Set of pages already crawled
Basic process:
1. Initialize Q with a set of seed URLs
2. Pick the first URL from Q and download the corresponding page
3. Extract all URLs from the page (<base href> tag, anchor links, CSS, DTDs,
scripts, optionally image links)
4. Append to Q any URLs that a) meet our criteria, and b) are not already in P
5. If Q is not empty, repeat from step 2
Can one machine crawl the entire web?
Of course not! Need to distribute crawling across many machines.
University of Pennsylvania
A. Haeberlen, Z. Ives
Crawling visualized
5
University of Pennsylvania
Seeds
Unseen web
"Frontier"
URLs already
crawled
URLs that will
eventually be
crawled
URLs that do not
fit our criteria
(too deep, etc)
The Web
Pages currently
being crawled
A. Haeberlen, Z. Ives
Crawling complications
What order to traverse in?
Polite to do BFS - why?
Malicious pages
Spam pages / SEO
Spider traps (incl. dynamically generated ones)
General messiness
Cycles
Varying latency and bandwidth to remote servers
Site mirrors, duplicate pages, aliases
Web masters' stipulations
How deep to crawl? How often to crawl?
Continuous crawling; freshness
6
University of Pennsylvania
Need to be
robust!
A. Haeberlen, Z. Ives
Normalization; eliminating duplicates
Some of the extracted URLs are relative URLs
Example: /~ahae/papers/ from www.cis.upenn.edu
Normalize it: http://www.cis.upenn.edu/~ahae/papers
Duplication is widespread on the web
If the fetched page is already in the index, do not process it
Can verify using document fingerprint (hash) or shingles
7
University of Pennsylvania
A. Haeberlen, Z. Ives
8
Crawler etiquette
Explicit politeness
Look for meta tags; for example, ignore pages that have
<META NAME="ROBOTS CONTENT="NOINDEX">
Implement the robot exclusion protocol; for example, look
for, and respect, robots.txt
Implicit politeness
Even if no explicit specifications are present, do not hit the
same web site too often
University of Pennsylvania
A. Haeberlen, Z. Ives
9
Robots.txt
What should be in robots.txt?
See http://www.robotstxt.org/wc/robots.html
To exclude all robots from a server:
User-agent: *
Disallow: /
To exclude one robot from two directories:
User-agent: BobsCrawler
Disallow: /news/
Disallow: /tmp/
University of Pennsylvania
A. Haeberlen, Z. Ives
10
Recap: Crawling
How does the basic process work?
What are some of the main challenges?
Duplicate elimination
Politeness
Malicious pages / spider traps
Normalization
Scalability
University of Pennsylvania
A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
11
University of Pennsylvania
NEXT
A. Haeberlen, Z. Ives
12
Mercator: A scalable web crawler
Written entirely in Java
Expands a URL frontier
Avoids re-crawling same URLs
Also considers whether a document has been
seen before
Same content, different URL [when might this occur?]
Every document has signature/checksum info computed as
its crawled
Despite the name, it does not actually scale
to a large number of nodes
But it would not be too difficult to parallelize
Heydon and Najork: Mercator, a scalable, extensible web crawler (WWW'99) University of Pennsylvania
A. Haeberlen, Z. Ives
13
Mercator architecture
1. Dequeue frontier URL
2. Fetch document
3. Record into RewindStream
(RIS)
4. Check against fingerprints to
verify its new
5. Extract hyperlinks
6. Filter unwanted links
7. Check if URL repeated
(compare its hash)
8. Enqueue URL
S
o
u
r
c
e
:
M
e
r
c
a
t
o
r
p
a
p
e
r
University of Pennsylvania
A. Haeberlen, Z. Ives
14
Mercators polite frontier queues
Tries to go beyond breadth-first approach
Goal is to have only one crawler thread per server
What does this mean for the load caused by Mercator?
Distributed URL frontier queue:
One subqueue per worker thread
The worker thread is determined by hashing the hostname
of the URL
Thus, only one outstanding request per web server
University of Pennsylvania
A. Haeberlen, Z. Ives
Mercators HTTP fetcher
First, needs to ensure robots.txt is followed
Caches the contents of robots.txt for various web sites as it
crawls them
Designed to be extensible to other protocols
Had to write own HTTP requestor in Java
their Java version didnt have timeouts
Today, can use setSoTimeout()
Could use Java non-blocking I/O:
http://www.owlmountain.com/tutorials/NonBlockingIo.htm
But they use multiple threads and synchronous I/O
15
University of Pennsylvania
A. Haeberlen, Z. Ives
16
Other caveats
Infinitely long URL names (good way to get a
buffer overflow!)
Aliased host names
Alternative paths to the same host
Can catch most of these with signatures of
document data (e.g., MD5)
Comparison to Bloom filters
Crawler traps (e.g., CGI scripts that link to
themselves using a different name)
May need to have a way for human to override certain URL
paths see Section 5 of paper
University of Pennsylvania
A. Haeberlen, Z. Ives
Mercator document statistics
PAGE TYPE PERCENT
text/html 69.2%
image/gif 17.9%
image/jpeg 8.1%
text/plain 1.5%
pdf 0.9%
audio 0.4%
zip 0.4%
postscript 0.3%
other 1.4%
Histogram of document sizes
(60M pages)
17
University of Pennsylvania
A. Haeberlen, Z. Ives
18
Further considerations
May want to prioritize certain pages as being
most worth crawling
Focused crawling tries to prioritize based on relevance
May need to refresh certain pages more often
University of Pennsylvania
A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
19
University of Pennsylvania
NEXT
A. Haeberlen, Z. Ives
The publish/subscribe model
Each publisher produces events
Example: Web page update, stock quote, announcement, ...
Each subscriber wants a subset of the events
But usually not all
How do we implement this efficiently?
20
University of Pennsylvania
Events
Interests
?
?
?
A. Haeberlen, Z. Ives
Example: RSS
Web server publishes XML file with events
Clients periodically request the file to see if
there are new events
Is this a good solution?
21
University of Pennsylvania
A. Haeberlen, Z. Ives
Interest-based crawling
Suppose we want to crawl XML documents
based on user interests
We need several parts:
A list of interests expressed in an executable form,
perhaps XPath queries
A crawler goes out and fetches XML content
A filter / routing engine matches XML content against
users interests, sends them the content if it matches
22
University of Pennsylvania
A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
23
University of Pennsylvania
NEXT
A. Haeberlen, Z. Ives
24
XML-Based information dissemination
Basic model (XFilter, YFilter, Xyleme):
Users are interested in data relating to a particular topic, and
know the schema
/politics/usa//body
A crawler-aggregator reads XML files from the web (or gets
them from data sources) and feeds them to interested
parties
XPath
(here used to
match
documents,
not nodes)
University of Pennsylvania
A. Haeberlen, Z. Ives
25
Engine for XFilter [Altinel & Franklin 00]
University of Pennsylvania
A. Haeberlen, Z. Ives
26
How does it work?
Each XPath segment is basically a subset of
regular expressions over element tags
Convert into finite state automata
Parse data as it comes in use SAX API
Match against finite state machines
Most of these systems use modified FSMs
because they want to match many patterns
at the same time
University of Pennsylvania
A. Haeberlen, Z. Ives
27
Path nodes and FSMs
XPath parser decomposes XPath expressions into a set of
path nodes
These nodes act as the states of corresponding FSM
A node in the Candidate List denotes the current state
The rest of the states are in corresponding Wait Lists
Simple FSM for /politics[@topic=president]/usa//body:
politics usa body
Q1_1 Q1_2 Q1_3
University of Pennsylvania
A. Haeberlen, Z. Ives
28
Decomposing into path nodes
Query ID
Position in state machine
Relative Position (RP) in tree:
0 for root node if its not
preceded by //
-1 for any node preceded by
//
Else =1+ (no of * nodes from
predecessor node)
Level:
If current node has fixed
distance from root, then 1+
distance
Else if RP = 1, then 1, else 0
Finaly, NextPathNodeSet points to
next node
Q1=/politics[@topic=president]/usa//body
Q1 Q1 Q1
1 2 3
0 1 -1
1 2 -1
Q1-1 Q1-2 Q1-3
Q2 Q2 Q2
1 2 3
-1 2 1
-1
0 0
Q2-1 Q2-2 Q2-3
Q2=//usa/*/body/p
University of Pennsylvania
A. Haeberlen, Z. Ives
29
Query index
Query index entry
for each XML tag
Two lists: Candidate
List (CL) and Wait
List (WL) divided
across the nodes
Live queries
states are in CL;
pending queries +
states are in WL
Events that cause
state transition are
generated by the
XML parser
politics
usa
body
p
Q1-1
Q2-1
Q1-3 Q2-2
Q2-3
X
X
X
X
X
X
X
X
CL
WL
Q1-2
University of Pennsylvania
A. Haeberlen, Z. Ives
30
Encountering an element
Look up the element name in the Query
Index and all nodes in the associated CL
Validate that we actually have a match
Q1
1
0
1
Q1-1
politics
Q1-1
X
X
WL
startElement: politics
CL
Query ID
Position
Rel. Position
Level
Entry in Query Index:
NextPathNodeSet
University of Pennsylvania
A. Haeberlen, Z. Ives
31
Validating a match
We first check that the current XML depth
matches the level in the user query:
If level in CL node is less than 1, then ignore height
else level in CL node must = height
This ensures were matching at the right point in the tree!
Finally, we validate any predicates against
attributes (e.g., [@topic=president])
University of Pennsylvania
A. Haeberlen, Z. Ives
32
Processing further elements
Queries that dont meet validation are
removed from the Candidate Lists
For other queries, we advance to the next
state
We copy the next node of the query from the WL to the CL,
and update the RP and level
When we reach a final state (e.g., Q1-3), we can output the
document to the subscriber
When we encounter an end element, we
must remove that element from the CL
University of Pennsylvania
A. Haeberlen, Z. Ives
A simpler approach
Instantiate a DOM tree for each document
Traverse and recursively match XPaths
Pros and cons?
33
University of Pennsylvania
A. Haeberlen, Z. Ives
34
Recap: Publish-subscribe model
Publish-subscribe model
Publishers produce events
Each subscriber is interested in a subset of the events
Challenge: Efficient implementation
Comparison: XFilter vs RSS
XFilter
Interests are specified with XPaths (very powerful!)
Sophisticated technique for efficiently matching documents
against many XPaths in parallel
University of Pennsylvania