IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
Search Engines
Page 1
Site 1 Page 1 Site 2
Page 3 Page 2
Page 3
Page 2
Page 5 Page 1
Page 4
Site 5 Page 1
Site 3
Search Engine Architecture
●
Spider
– Crawls the web to find pages. Follows
hyperlinks
●
Indexer
– Produces data structures for fast searching of
all words in the pages
●
Retriever
– Query interface
– Database lookup to find hits
●
2 billion documents
●
4 TB RAM, many terabytes of disk
– Ranking
Typical Search Engine
Architecture
User
Queries
Page Repository Results
Query Ranking
Crawlers Engine
Indexer
Indexes
Structure
Text
Web
Manpower and Hardware:
Google
85 people
50% technical, 14 Ph.D. in Computer Science
Equipment
2,500 Linux machines
80 terabytes of spinning disks
30 new machines installed daily
●
Main idea:
– Start with known sites
– Record information for these sites
– Follow the links from each site
– Record information found at new sites
– Repeat
Web Crawlers
● Start with an initial page P0. Find URLs on P0 and add
them to a queue.
● When done with P0, pass it to an indexing program,
get a page P1 from the queue and repeat.
●
Issues
– Which page to look at next?
– Avoid overloading a site
– How deep within a site to go (drill-down)?
– How frequently to visit pages?
Page Visit Order
●
Animated examples of breadth-first vs depth-first search on trees:
– http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Structure to be traversed
Indexing
• Arrangement of data to permit fast searching
• Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
• Sorting helps in searching
– You probably use this when looking something up in the
telephone book or dictionary. For instance, "cold fusion" is
probably near the front, so you open maybe 1/4 of the way in.
Inverted Files
FILE
POS
1 A file is a list of words by position
10
– First entry is the word in position 1 (first word)
20
– Entry 4562 is the word in position 4562 (4562nd word)
30
677 1 481
713 3 42 312 802
Ranking (Scoring) Hits
●
Hits must be presented in some order
●
What order?
●
Apply recursively: Quality of a page is related to
– its in-degree, and to
– the quality of pages linking to it
PageRank Algorithm (Brinn & Page, 1998)
SOURCE: GOOGLE
PageRank
●
Consider the following infinite random walk (surfing):
– Initially the surfer is at a random page
– At each step, the surfer proceeds
●
to a randomly chosen web page with probability d
●
to a randomly chosen successor of the current page
with probability 1-d
SOURCE: GOOGLE
PageRank Formula
d
PageRank ( p ) = + (1 − d ) ∑ PageRank (q ) / outdegree(q )
n ( q , p )∈E
●
Google uses d ≈ 0.85
●
PageRank is a probability distribution over web pages SOURCE: GOOGLE
PageRank Example
A B
d d
PageRank of P is
(1-d)∗[(PageRank of A)/4 + (PageRank of B)/3)] + d/n
PAGERANK CALCULATOR
SOURCE: GOOGLE
Robot Exclusion
●
You may not want certain pages indexed but still viewable
by browsers. Can’t protect directory.
●
Some crawlers conform to the Robot Exclusion Protocol.
Compliance is voluntary. One way to enforce: firewall
●
They look for file robots.txt at highest directory level in
domain. If domain is www.ecom.cmu.edu, robots.txt goes
in www.ecom.cmu.edu/robots.txt
●
A specific document can be shielded from a crawler by
adding the line: <META NAME="ROBOTS”
CONTENT="NOINDEX">