Chapter 3

The document discusses the functioning of web crawlers, which automatically find and download web pages for search engines. It covers various aspects of crawling, including the importance of freshness, the challenges of duplicate detection, and the use of sitemaps and document feeds. Additionally, it highlights the role of distributed crawling and storage systems like BigTable in managing large collections of web data.

Uploaded by

gillybobfitz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views39 pages

Chapter 3

Uploaded by

gillybobfitz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Search Engines

Information Retrieval in Practice

Web Crawler
• Finds and downloads web pages automatically
– provides the collection for searching
• Web is huge and constantly growing
• Web is not under the control of search engine providers
• Web pages are constantly changing
• Crawlers also used for other types of data
Retrieving Web Pages
• Every page has a unique uniform resource locator (URL)
• Web pages are stored on web servers that use HTTP to
exchange information with client software
• e.g.,
Retrieving Web Pages
• Web crawler client program connects to a domain name system
(DNS) server
• DNS server translates the hostname into an internet protocol (IP)
address
• Crawler then attempts to connect to server host using a specific
port
• After connection, the crawler sends an HTTP request to the
webserver to request a page
– usually a GET request
Web Crawler
• Starts with a set of seeds, which are a set of URLs given to it as
parameters
• Seeds are added to a URL request queue
• Crawler starts fetching pages from the request queue
• Downloaded pages are parsed to find link tags that might contain
other useful URLs to fetch
• New URLs added to the crawler’s request queue, or frontier
• Continue until no more new URLs or disk full
Web Crawling
• Web crawlers spend a lot of time waiting for responses to requests
• To reduce this inefficiency, web crawlers use threads and fetch
hundreds of pages at once
• Crawlers could potentially flood sites with requests for pages
• To avoid this problem, web crawlers use politeness policies
– e.g., delay between requests to same web server
Controlling Crawling
• Even crawling a site slowly will anger some web server
administrators, who object to any copying of their data
• Robots.txt file can be used to control crawlers
Simple Crawler Thread
Freshness
• Web pages are constantly being added, deleted, and modified
• Web crawler must continually revisit pages it has already
crawled to see if they have changed in order to maintain the
freshness of the document collection
– stale copies no longer reflect the real contents of the web pages
Freshness
• HTTP protocol has a special request type called HEAD that
makes it easy to check for page changes
– returns information about page, not page itself
Freshness
• Not possible to constantly check all pages
– must check important pages and pages that change frequently
• Freshness is the proportion of fresh pages
• Optimizing for this metric can lead to bad decisions, such as
not crawling popular sites
• Age is a better metric
Freshness vs. Age
Age
• Expected age of a page t days after it was last crawled:

• Web page updates follow the Poisson distribution on average

– time until the next update is governed by an exponential distribution
Age
• Older a page gets, the more it costs not to crawl it
– e.g., expected age with mean change frequency λ = 1/7 (one change
per week)
Focused Crawling
• Attempts to download only those pages that are about a
particular topic
– used by vertical search applications
• Rely on the fact that pages about a topic tend to have links to
other pages on the same topic
– popular pages for a topic are typically used as seeds
• Crawler uses text classifier to decide whether a page is on topic
Deep Web
• Sites that are difficult for a crawler to find are collectively
referred to as the deep (or hidden) Web
– much larger than conventional Web
• Three broad categories:
– private sites
• no incoming links, or may require log in with a valid account
– form results
• sites that can be reached only after entering some data into a form
– scripted pages
• pages that use JavaScript, Flash, or another client-side language to generate
links
Sitemaps
• Sitemaps contain lists of URLs and data about those URLs, such
as modification time and modification frequency
• Generated by web server administrators
• Tells crawler about pages it might not otherwise find
• Gives crawler a hint about when to check a page for changes
Sitemap Example
Distributed Crawling
• Three reasons to use multiple computers for crawling
– Helps to put the crawler closer to the sites it crawls
– Reduces the number of sites the crawler has to remember
– Reduces computing resources required
• Distributed crawler uses a hash function to assign URLs to
crawling computers
– hash function should be computed on the host part of each URL
Desktop Crawls
• Used for desktop search and enterprise search
• Differences to web crawling:
– Much easier to find the data
– Responding quickly to updates is more important
– Must be conservative in terms of disk and CPU usage
– Many different document formats
– Data privacy very important
Document Feeds
• Many documents are published
– created at a fixed time and rarely updated again
– e.g., news articles, blog posts, press releases, email
• Published documents from a single source can be ordered in a
sequence called a document feed
– new documents found by examining the end of the feed
Document Feeds
• Two types:
– A push feed alerts the subscriber to new documents
– A pull feed requires the subscriber to check periodically for new
documents
• Most common format for pull feeds is called RSS
– Really Simple Syndication, RDF Site Summary, Rich Site Summary,
or ...
RSS “Really Simple Syndication” Example
RSS stands for Really Simple Syndication, a way to distribute content like news and podcasts in real-time. RSS
feeds are XML-based and can be used to keep up with new content from websites you subscribe to.
RSS Example
RSS
• ttl tag (time to live)
– amount of time (in minutes) contents should be cached
• RSS feeds are accessed like web pages
– using HTTP GET requests to web servers that host them
• Easy for crawlers to parse
• Easy to find new information
Storing the Documents
• Many reasons to store converted document text
– saves crawling time when page is not updated
– provides efficient access to text for snippet generation, information
extraction, etc.
• Database systems can provide document storage for some
applications
– web search engines use customized document storage systems
Storing the Documents
• Requirements for document storage system:
– Random access
• request the content of a document based on its URL
• hash function based on URL is typical
– Compression and large files
• reducing storage requirements and efficient access
– Update
• handling large volumes of new and modified documents
• adding new anchor text
Large Files
• Store many documents in large files, rather than each
document in a file
– avoids overhead in opening and closing files
– reduces seek time relative to read time
• Compound documents formats
– used to store multiple documents in a file
– e.g., TREC Web
Compression
• Text is highly redundant (or predictable)
• Compression techniques exploit this redundancy to make files
smaller without losing any of the content
• Compression of indexes covered later
• Popular algorithms can compress HTML and XML text by 80%
– e.g., DEFLATE (zip, gzip) and LZW (UNIX compress, PDF)
– may compress large files in blocks to make access faster
BigTable
• Google’s document storage system
– Customized for storing, finding, and updating web pages
– Handles large collection sizes using inexpensive computers
BigTable
• No query language, no complex queries to optimize
• Only row-level transactions
• Tablets are stored in a replicated file system that is accessible by all
BigTable servers
• Any changes to a BigTable tablet are recorded to a transaction log,
which is also stored in a shared file system
• If any tablet server crashes, another server can immediately read the
tablet data and transaction log from the file system and take over
BigTable
• Logically organized into rows
• A row stores data for a single web page

• Combination of a row key, a column key, and a timestamp point

to a single cell in the row
BigTable
• BigTable can have a huge number of columns per row
– all rows have the same column groups
– not all rows have the same columns
– important for reducing disk reads to access document data
• Rows are partitioned into tablets based on their row keys
– simplifies determining which server is appropriate
Detecting Duplicates
• Duplicate and near-duplicate documents occur in many situations
– Copies, versions, plagiarism, spam, mirror sites
– 30% of the web pages in a large crawl are exact or near duplicates of
pages in the other 70%
• Duplicates consume significant resources during crawling,
indexing, and search
– Little value to most users
Duplicate Detection
• Exact duplicate detection is relatively easy
• Checksum techniques
– A checksum is a value that is computed based on the content of the
document
• e.g., sum of the bytes in the document file

– Possible for files with different text to have same checksum

• Functions such as a cyclic redundancy check (CRC), have been
developed that consider the positions of the bytes
Near-Duplicate Detection
• More challenging task
– Are web pages with same text context but different advertising or
format near-duplicates?
• A near-duplicate document is defined using a threshold value
for some similarity measure between pairs of documents
– e.g., document D1 is a near-duplicate of document D2 if more than
90% of the words in the documents are the same
Near-Duplicate Detection
• Search:
– find near-duplicates of a document D
– O(N) comparisons required
• Discovery:
– find all pairs of near-duplicate documents in the collection
– O(N2) comparisons
• IR techniques are effective for search scenario
• For discovery, other techniques used to generate compact
representations
Fingerprints
Fingerprint Example

Undercarriage Inspection Service Undercarriage Inspection Service
No ratings yet
Undercarriage Inspection Service Undercarriage Inspection Service
2 pages
Brkarc 3000
No ratings yet
Brkarc 3000
242 pages
Wsi PSD
No ratings yet
Wsi PSD
18 pages
1-12263642568 Saf PDF
No ratings yet
1-12263642568 Saf PDF
6 pages
Mesh Warping
No ratings yet
Mesh Warping
6 pages
Information Technology Systems: 3.4 Internet
No ratings yet
Information Technology Systems: 3.4 Internet
59 pages
8 Info-Retrieval PDF
No ratings yet
8 Info-Retrieval PDF
60 pages
GSM Channels
No ratings yet
GSM Channels
44 pages
ScadaBR-Developers - CERTI - ScadaBR2
100% (1)
ScadaBR-Developers - CERTI - ScadaBR2
20 pages
Uas Praktikum Pemrograman Web 1: Index - PHP
No ratings yet
Uas Praktikum Pemrograman Web 1: Index - PHP
36 pages
Agile Scrum Mastery - Course Slides
No ratings yet
Agile Scrum Mastery - Course Slides
38 pages
Vari An Wireless
No ratings yet
Vari An Wireless
4 pages
1747 UIC Procedure
No ratings yet
1747 UIC Procedure
7 pages
Chapter 3
No ratings yet
Chapter 3
64 pages
Sethi2021 Article AnOptimizedCrawlingTechniqueFo
No ratings yet
Sethi2021 Article AnOptimizedCrawlingTechniqueFo
29 pages
Café Time Time Management System02
No ratings yet
Café Time Time Management System02
21 pages
Dotw
No ratings yet
Dotw
2 pages
A Bms Client and Gateway Using Bacnet Protocol: Abstract. A Building Management System (BMS) Is A Computer-Based Control
No ratings yet
A Bms Client and Gateway Using Bacnet Protocol: Abstract. A Building Management System (BMS) Is A Computer-Based Control
2 pages
Mobile Agent
No ratings yet
Mobile Agent
22 pages
R Mini-Compiler
No ratings yet
R Mini-Compiler
16 pages
Web Crawling: Based On The Slides by Filippo
No ratings yet
Web Crawling: Based On The Slides by Filippo
52 pages
Deep Web
No ratings yet
Deep Web
35 pages
Creating Extra Information Types As A Self-Serivce Function - Oracle Apps
No ratings yet
Creating Extra Information Types As A Self-Serivce Function - Oracle Apps
8 pages
Web Search
No ratings yet
Web Search
49 pages
Chapter - 2 Literature Survey: S. No Page No
No ratings yet
Chapter - 2 Literature Survey: S. No Page No
22 pages
Web Crawler
0% (1)
Web Crawler
16 pages
Search Engine
No ratings yet
Search Engine
35 pages
ADC.F.7 Preliminary Design Review
No ratings yet
ADC.F.7 Preliminary Design Review
3 pages
Chapter 3
No ratings yet
Chapter 3
55 pages
Inverted Indexing For Text Retrieval
No ratings yet
Inverted Indexing For Text Retrieval
21 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
University of Cambridge International Examinations International General Certificate of Secondary Education
No ratings yet
University of Cambridge International Examinations International General Certificate of Secondary Education
8 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
Structural and Dynamic Analysis of Optimized Four Bar Mechanism Considering Counterweight in Coupler Link - ScienceDirect
No ratings yet
Structural and Dynamic Analysis of Optimized Four Bar Mechanism Considering Counterweight in Coupler Link - ScienceDirect
1 page
Ir 49 72
No ratings yet
Ir 49 72
24 pages
Ws 3500
No ratings yet
Ws 3500
2 pages
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
CM2 4G GPS Datasheet - 1
No ratings yet
CM2 4G GPS Datasheet - 1
2 pages
An Approach For Search Engine Optimization Using Classification - A Data Mining Technique
No ratings yet
An Approach For Search Engine Optimization Using Classification - A Data Mining Technique
4 pages
Web Mining
No ratings yet
Web Mining
48 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Topic 3 W3 Crawls and Feeds - SDR - March2023
No ratings yet
Topic 3 W3 Crawls and Feeds - SDR - March2023
32 pages
Appendix I-Handover Package and Folder Structure
No ratings yet
Appendix I-Handover Package and Folder Structure
6 pages
Unit - 3 Ir Questionbank
No ratings yet
Unit - 3 Ir Questionbank
27 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Web Info PDF
No ratings yet
Web Info PDF
4 pages
Irt Unit3
No ratings yet
Irt Unit3
50 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Network Automation Cookbook Pdf00015
No ratings yet
Network Automation Cookbook Pdf00015
5 pages
Web Search Engines: Part 1
No ratings yet
Web Search Engines: Part 1
6 pages
Single Line
No ratings yet
Single Line
54 pages
5 More Notes On Information and Communication
No ratings yet
5 More Notes On Information and Communication
45 pages
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
No ratings yet
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
12 pages
Webmininglec
100% (1)
Webmininglec
75 pages
Erformance Valuation EB Rawler: P E O W C
No ratings yet
Erformance Valuation EB Rawler: P E O W C
34 pages
Web Crawler Assisted Web Page Cleaning For Web Data Mining
No ratings yet
Web Crawler Assisted Web Page Cleaning For Web Data Mining
75 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
BCA NEP I II N III Yrs 2023-24 Multidisciplinary Courses
No ratings yet
BCA NEP I II N III Yrs 2023-24 Multidisciplinary Courses
19 pages
Datamining
No ratings yet
Datamining
21 pages
Student Result: Aktu-One-View (Oneview - Aspx)
No ratings yet
Student Result: Aktu-One-View (Oneview - Aspx)
8 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
Web Mining
No ratings yet
Web Mining
53 pages
Seminar Report: Submitted By: Aanchal Garg CSE
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
22 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
Web Mining1
No ratings yet
Web Mining1
87 pages
IS6335 Week2
No ratings yet
IS6335 Week2
51 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Sleep Tracker Project App
No ratings yet
Sleep Tracker Project App
14 pages
Web Crawler A Review
No ratings yet
Web Crawler A Review
5 pages
Arasu 2001
No ratings yet
Arasu 2001
42 pages
Ir 5
No ratings yet
Ir 5
18 pages
IR Unit 3
No ratings yet
IR Unit 3
64 pages
Anurag IOT Exp 8
No ratings yet
Anurag IOT Exp 8
5 pages
Irs Unit-5
No ratings yet
Irs Unit-5
28 pages
Data Mining Module 5 Important Topics PYQs
No ratings yet
Data Mining Module 5 Important Topics PYQs
28 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Lecture 5 Information and Communication 2023
No ratings yet
Lecture 5 Information and Communication 2023
45 pages
S4 Web Design
No ratings yet
S4 Web Design
19 pages
IR Module 3
No ratings yet
IR Module 3
45 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
Web Search. Web Spidering
No ratings yet
Web Search. Web Spidering
44 pages
Information Retrieval QA
No ratings yet
Information Retrieval QA
8 pages
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Web Strategy for Everyone: How to Create and Manage a Website, Usable by Anyone on Any Device, With Great Information Architecture and High Performance
From Everand
Web Strategy for Everyone: How to Create and Manage a Website, Usable by Anyone on Any Device, With Great Information Architecture and High Performance
Marcus Österberg
4/5 (3)
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
Squid Proxy Server 3.1 Beginner's Guide
From Everand
Squid Proxy Server 3.1 Beginner's Guide
Kulbir Saini
3/5 (1)
phpMyAdmin Starter
From Everand
phpMyAdmin Starter
Marc Delisle
No ratings yet