0% found this document useful (0 votes)
42 views37 pages

IRS Module 5 & 6 Web Search

This document discusses the components and architecture of web crawlers, including their crawling policies for selection, re-visiting pages, and politeness. It covers the key components of a web crawler like the frontier, history and page repository, fetching, parsing, URL extraction and canonicalization. The document also touches on topics like graph search problems, URL normalization, multithreaded crawlers, and crawler identification.

Uploaded by

Pravin Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views37 pages

IRS Module 5 & 6 Web Search

This document discusses the components and architecture of web crawlers, including their crawling policies for selection, re-visiting pages, and politeness. It covers the key components of a web crawler like the frontier, history and page repository, fetching, parsing, URL extraction and canonicalization. The document also touches on topics like graph search problems, URL normalization, multithreaded crawlers, and crawler identification.

Uploaded by

Pravin Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

WEB SEARCH IRS

Prof.Pravin V.Shinde
Web Crawler
Use of Web Crawler
Continue..
Crawling Policies
Selection Policy
Re-Visit Policy
Re-Visit Policy
Politeness Policy
Parallelization Policy
Components
Contd..
Contd..
Web Crawler Architecture
Crawling Infrastructure
Contd..
Graph Search Problem
Contd..
Frontier
History and Page Repository
Contd..
Fetching
Fetching Contd..
Parsing
URL Extraction and
Canonicalization
Canonicalization Procedure
Stoplisting
Stemming
HTML Tag Tree
Example
Contd..
URL Normalization
Crawler Identification
Multithreaded Crawler
Contd..
Thank You

You might also like