How A Search Engine Works
How A Search Engine Works
-N.Sai Sumanth
3rd year cseB
08BQ1A0560
How it happens?
How a search engine works
A search engine has three main parts
1.Gather the content of all web pages(using a
program called crawler or spider)
syntax:
<a href=www.google.com>google</a>
used for redirecting from one page to other
How to find web pages visit and copy
• Can start with a list of domain names , visit the
home pages there
• Look at the hyper link on the home page and
follow those links to more pages use http
commands to get the pages
• Keep a list of urls visited and those still to be
visited
• Each time the program loads in a new html page
add the links in that page to the list to be crawled
Four laws of crawling
• A crawler must show identification
• A crawler must obey robots exclusion standard
• A crawler must not hog resources
• A crawler must report errors –rarely used
Robots exclusion standard
Compute-Prefix-Function (p)
1 m length[p] //’p’ pattern to be matched
2 Π[1] 0
3 k0
4 for q 2 to m
5 do while k > 0 and p[k+1] != p[q]
6 do k Π[k]
7 If p[k+1] = p[q]
8 then k k +1
9 Π[q] k
10 return Π
Example: compute Π for the pattern ‘p’ below:
p a b a b a c a
Initially: m = length[p] = 7
Π[1] = 0
k=0
q 1 2 3 4 5 6 7
Step 1: q = 2, k=0
Π[2] = 0 p a b a b a c a
Π 0 0
Step 2: q = 3, k = 0, q 1 2 3 4 5 6 7
Π[3] = 1 p a b a b a c a
Π 0 0 1
Step 3: q = 4, k = 1 q 1 2 3 4 5 6 7
Π[4] = 2
p a b a b a c A
Π 0 0 1 2
Step 4: q = 5, k =2 q 1 2 3 4 5 6 7
Π[5] = 3 p a b a b a c a
Π 0 0 1 2 3
q 1 2 3 4 5 6 7
Step 5: q = 6, k = 3
Π[6] = 1 p a b a b a c a
Π 0 0 1 2 3 1
q 1 2 3 4 5 6 7
Step 6: q = 7, k = 1 p a b a b a c a
Π[7] = 1
Π 0 0 1 2 3 1 1
Note: KMP finds every occurrence of a ‘p’ in ‘S’. That is why KMP does not terminate in step 12, rather it
searches remainder of ‘S’ for any more occurrences of ‘p’.
Illustration: given a String ‘S’ and pattern ‘p’ as follows:
S
b a c b a b a b a b a c a c a
p a b a b a c a
Let us execute the KMP algorithm to find whether
‘p’ occurs in ‘S’.
For ‘p’ the prefix function, Π was computed previously and is as follows:
q 1 2 3 4 5 6 7
p a b A b a c a
Π 0 0 1 2 3 1 1
Initially: n = size of S = 15;
m = size of p = 7
Step 1: i = 1, q = 0
comparing p[1] with S[1]
S b a c b a b a b a b a c a a b
p a b a b a c a
P[1] does not match with S[1]. ‘p’ will be shifted one position to the right.
Step 2: i = 2, q = 0
comparing p[1] with S[2]
S b a c b a b a b a b a c a a b
p a b a b a c a
P[1] matches S[2]. Since there is a match, p is not shifted.
Step 3: i = 3, q = 1
Comparing p[2] with S[3] p[2] does not match with S[3]
S b a c b a b a b a b a c a a b
p a b a b a c a
Backtracking on p, comparing p[1] and S[3]
Step 4: i = 4, q = 0
comparing p[1] with S[4] p[1] does not match with S[4]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 5: i = 5, q = 0
comparing p[1] with S[5] p[1] matches with S[5]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 6: i = 6, q = 1
Comparing p[2] with S[6] p[2] matches with S[6]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 7: i = 7, q = 2
Comparing p[3] with S[7] p[3] matches with S[7]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 8: i = 8, q = 3
Comparing p[4] with S[8] p[4] matches with S[8]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 9: i = 9, q = 4
Comparing p[5] with S[9] p[5] matches with S[9]
S b a c b a b a b a b a c a a b
p a b a b a c a
S b a c b a b a b a b a c a a b
p a b a b a c a
Backtracking on p, comparing p[4] with S[10] because after mismatch q = Π[5] = 3
p a b a b a c a
S b a c b a b a b a b a c a a b
p a b a b a c a
Pattern ‘p’ has been found to completely occur in string ‘S’. The total number of shifts
that took place for the match to be found are: i – m = 13 – 7 = 6 shifts.
Running - time analysis
• Compute-Prefix-Function (Π) • KMP Matcher
1 m length[p] //’p’ pattern to be matched 1 n length[S]
2 Π[1] 0 2 m length[p]
3 k0 3 Π Compute-Prefix-Function(p)
4 for q 2 to m 4q0
5 do while k > 0 and p[k+1] != p[q] 5 for i 1 to n
6 do k Π[k] 6 do while q > 0 and p[q+1] != S[i]
7 If p[k+1] = p[q] 7 do q Π[q]
8 then k k +1 8 if p[q+1] = S[i]
9 Π[q] k 9 then q q + 1
10 return Π 10 if q = m
11 then print “Pattern occurs with shift” i – m
12 q Π[ q]
back