Ir 5
Ir 5
The PageRank Algorithm was introduced by Google founders Larry Page and Sergey Brin and
became a foundational component of Google's search engine. It evaluates the importance of
web pages based on the quantity and quality of links pointing to them, treating hyperlinks as
votes of trust and authority.
1. Voting System
○ Each link from one page to another is treated as a "vote" for the linked page.
○ Votes from high-authority pages carry more weight than votes from low-quality
pages.
2. Link Value Distribution
○ A page shares its PageRank equally among all the links on it.
○ For example, if a page has a PageRank of 6 and links to 3 other pages, each link
passes 6 ÷ 3 = 2 units of PageRank.
3. Damping Factor (d)
○ A damping factor (usually set to 0.85) represents the likelihood that a user
continues clicking links instead of starting a new search.
○ This ensures the algorithm doesn't solely rely on link structure and accounts for
random navigation.
Advantages of PageRank
1. Authority Recognition
○ Pages with more backlinks from authoritative sources rank higher, improving
result quality.
2. Resilience
○ Resistant to small changes in web structure due to its global link-based
calculation.
3. Scalability
○ Suitable for large-scale web indexing with millions or billions of pages.
Limitations of PageRank
1. Susceptible to Manipulation
○ Black-hat SEO techniques like link farming and paid links can artificially inflate
rankings.
2. Ignores Content Relevance
○ Focuses solely on link structure and does not account for the actual relevance of
page content to a query.
3. Outdated with Modern Needs
○ Modern search engines use advanced algorithms incorporating hundreds of
factors, such as user behavior, mobile-friendliness, and semantic search, beyond
just PageRank.
Ranking functions in Information Retrieval (IR) are mathematical models that evaluate and
rank documents based on their relevance to a given query. Simple ranking functions are
straightforward models used to calculate scores for ranking documents.
○
Applications
Web Crawler
A web crawler (also known as a spider, bot, or web robot) is an automated program or script
used to systematically browse and index web pages across the internet. It plays a vital role in
search engines by collecting data to build and maintain an up-to-date index of the web.
1. Seed URLs
○ The process starts with a list of initial URLs, known as seed URLs.
○ The crawler fetches the content of these URLs.
2. Parsing Content
○ Extracts links (hyperlinks) from the fetched pages.
○ Stores the data for indexing, such as the text, metadata, and media files.
3. Queue Management
○ Adds newly discovered links to a queue for future crawling.
○ URLs are prioritized based on factors like importance, freshness, and relevance.
4. Fetching and Repeating
○ The crawler visits new URLs in the queue, repeating the process until resources
or constraints limit it.
5. Indexing
○ The fetched content is passed to the search engine's indexing system for
organizing and storing.
1. Politeness
○ Adheres to rules in a site's robots.txt file, which specifies which pages should
or should not be crawled.
2. Efficiency
○ Optimized to handle vast amounts of data without overwhelming servers.
3. Scalability
○ Designed to crawl millions or billions of web pages efficiently.
4. Freshness
○ Re-visits pages periodically to keep the index updated.
1. Scale
○ The web contains billions of pages, and crawlers must handle massive amounts
of data.
2. Politeness and Ethics
○ Crawlers must avoid overwhelming servers or accessing restricted pages.
3. Dynamic Content
○ Pages with JavaScript or AJAX may be harder to parse.
4. Duplicate Content
○ Avoid indexing multiple versions of the same content (e.g., www.example.com vs.
example.com).
5. Bandwidth and Storage
○ Crawlers must balance resource usage with efficient data collection.
1. Search Engines
○ Build and maintain an index of web pages for user queries.
2. Market Research
○ Collect data for analyzing trends or competitors.
3. Content Aggregation
○ Fetch data for platforms like news aggregators or price comparison sites.
4. Sentiment Analysis
○ Gather user reviews, comments, or social media data for analysis.
5. Academic Research
○ Collect large-scale datasets for machine learning or data mining projects.
A web crawler consists of several interconnected components designed to fetch, process, and store web
content efficiently. Below is the typical structure of a web crawler:
● Purpose: Stores the initial list of URLs (seed URLs) from where the crawling begins.
● Features:
○ URLs can be added manually or dynamically.
○ Updated periodically based on crawling needs.
2. URL Frontier
● Purpose: Sends HTTP requests to web servers to retrieve the content of web pages.
● Key Features:
○ Handles protocols like HTTP/HTTPS.
○ Manages timeouts and retries for failed requests.
○ Uses headers to identify itself (e.g., User-Agent).
4. Content Parser
● Purpose: Processes the fetched HTML, extracting useful data and links.
● Key Features:
○ HTML Parsing: Extracts text, metadata, and structured content (e.g., headings, tables).
○ Link Extraction: Finds all hyperlinks in the page to expand the URL frontier.
5. Data Storage
● Purpose: Stores the fetched content and metadata for further processing or indexing.
● Types of Data Stored:
○ Raw HTML: For re-parsing or advanced analysis.
○ Parsed Data: Metadata, extracted text, or media files.
○ Log Information: Details about crawling activities (e.g., timestamps, response codes).
● Purpose: Ensures that the same content is not fetched or indexed multiple times.
● Techniques:
○ URL Normalization: Removing duplicate URLs caused by variations (e.g.,
www.example.com vs. example.com).
○ Content Hashing: Comparing hashes of page content to detect duplicates.
7. Scheduler
● Purpose: Determines the crawling strategy by selecting URLs from the frontier and prioritizing
them.
● Policies:
○ Politeness: Ensures compliance with server rules (e.g., robots.txt).
○ Freshness: Re-crawls content based on the likelihood of updates.
○ Priority: Prioritizes high-value URLs (e.g., popular pages or sites with high PageRank).
8. Politeness Module
9. Indexer (Optional)
● Purpose: Organizes and structures the collected data for efficient retrieval.
● Key Features:
○ Builds an inverted index for search engines.
○ Associates keywords with the pages they appear on.
10. Analytics and Monitoring
Challenges in Design
This structured approach ensures that web crawlers are effective, efficient, and ethical.
Python
1. Scrapy
○ Description: A powerful and widely-used web scraping and crawling library.
○ Features:
■ Supports asynchronous crawling.
■ Built-in support for handling requests, parsing, and storing data.
■ Handles robots.txt rules.
○ Use Case: General-purpose web scraping and crawling.
2. BeautifulSoup
○ Description: A library for parsing HTML and XML documents.
○ Features:
■ Easy HTML parsing and data extraction.
■ Works well with static pages.
○ Use Case: Parsing content after fetching it with a tool like requests.
3. Requests-HTML
○ Description: A simple library for fetching and parsing HTML.
○ Features:
■ Supports JavaScript-rendered pages.
■ User-friendly API for extracting data.
○ Use Case: Lightweight crawling and data extraction.
4. Selenium
○ Description: A browser automation tool often used for crawling dynamic web
pages.
○ Features:
■ Interacts with JavaScript-heavy pages.
■ Automates browser actions.
○ Use Case: Crawling dynamic or interactive web content.
5. PySpider
○ Description: A web crawler framework for larger-scale projects.
○ Features:
■ Web UI for monitoring and managing crawlers.
■ Built-in database support.
○ Use Case: Distributed crawling.
Java
1. Apache Nutch
○ Description: An open-source, scalable web crawler based on Hadoop.
○ Features:
■ Highly scalable and customizable.
■ Integration with Apache Solr for search indexing.
○ Use Case: Large-scale distributed crawling.
2. Crawler4j
○ Description: A simple and lightweight Java web crawler.
○ Features:
■ Multi-threaded crawling.
■ Politeness policy support.
○ Use Case: Custom crawlers for specific domains.
JavaScript
1. Puppeteer
○ Description: A Node.js library for controlling Chrome or Chromium via the
DevTools protocol.
○ Features:
■ Supports JavaScript-heavy pages.
■ Can take screenshots and interact with web pages.
○ Use Case: Crawling and interacting with dynamic web pages.
2. Cheerio
○ Description: A fast, flexible, and lean implementation of jQuery for the server.
○ Features:
■ Parsing and traversing HTML documents.
■ Lightweight compared to Puppeteer.
○ Use Case: Scraping static HTML content.
Scrapy is a powerful and open-source web crawling and web scraping framework written in
Python. It is designed to efficiently extract data from websites and process it according to
user-defined rules, making it an ideal tool for large-scale web scraping projects.
1. Spider:
○ A Spider is a class that defines how a certain site (or a set of sites) will be
scraped.
○ It contains the logic for navigating the website and extracting the required
information.
○ Spiders are responsible for:
■ Sending requests to the URLs.
■ Parsing the responses and extracting relevant data.
■ Yielding the scraped data in a structured form.
■ Following links to scrape additional pages if needed.
2. Item:
○ Items are simple containers used to structure the data that will be scraped from a
website.
○ They are similar to dictionaries in Python and can be used to define fields that will
store scraped content.
○ For example, an Item could define fields for the title, description, and URL of a
scraped article.
3. Selector:
○ Selectors are used to extract data from the HTML content returned by the
website.
○ Scrapy uses XPath or CSS selectors to target specific elements of a webpage
(such as titles, links, and text).
○ XPath is the default method used, but CSS selectors are also supported for ease
of use.
4. Pipeline:
○ Pipelines are used to process the scraped data after it has been collected.
○ They can be used for various tasks such as:
■ Cleaning or validating data.
■ Storing the data in a specific format (e.g., JSON, CSV, or database).
■ Removing duplicates.
○ Each pipeline component is executed sequentially as the data passes through it.
5. Middleware:
○ Middleware is a mechanism that allows you to modify the request and response
processing at various stages.
○ Scrapy’s middleware system allows you to:
■ Handle retries, user-agent rotation, and cookies.
■ Control the response or request before they are passed to the spider.
■ Implement custom functionality for advanced use cases.
6. Settings:
○ Scrapy provides a settings module that allows you to configure various aspects of
a project.
○ You can define settings for download delay, request headers, user-agents,
logging, and more.
○ These settings can be changed globally or on a per-project basis.
1. Define a Spider:
○ You start by creating a spider that defines the start URLs and the parsing logic.
○ The spider sends requests to the target website(s), processes the response, and
extracts the necessary data.
2. Request-Response Cycle:
○ Scrapy sends HTTP requests to the specified URLs.
○ It then receives the response and passes it to the spider's parsing function.
3. Data Extraction:
○ The parsing function extracts the required data using selectors (XPath or CSS).
○ The extracted data is returned as an Item object or can be directly saved to a
file.
4. Data Processing:
○ After the data is scraped, it can be processed by Scrapy’s item pipelines.
○ Pipelines clean, validate, or store the data in a desired format (such as JSON,
XML, or a database).
5. Following Links:
○ Spiders can be configured to follow links within the scraped pages to recursively
scrape additional pages.
○ This is done by yielding new requests from the spider’s parsing function.
1. Asynchronous Processing:
○ Scrapy is built on top of Twisted, a Python framework that allows asynchronous
processing.
○ This allows Scrapy to perform multiple HTTP requests concurrently, making it
much faster compared to other synchronous scraping tools.
2. Extensibility:
○ Scrapy allows for easy customization through middlewares and pipelines.
○ It also provides the ability to extend existing functionality through custom spiders,
settings, and handlers.
3. Built-in Features:
○ Scrapy has built-in support for handling HTTP requests, managing cookies,
following redirects, and handling retries.
○ It also handles data storage (e.g., saving results in JSON, CSV, or database
formats) and logging.
4. Robust Error Handling:
○ Scrapy provides excellent error handling for timeouts, connection issues, and
missing pages.
○ It also has built-in features to respect the robots.txt file of websites, ensuring
that the crawler doesn't violate the website’s terms of service.
5. Data Export:
○ Scrapy supports exporting scraped data into a variety of formats, including
JSON, CSV, and XML. It can also integrate with databases for more structured
data storage.
Advantages of Scrapy
1. High Performance:
○ Scrapy's asynchronous processing makes it highly efficient for crawling large
websites with many pages.
2. Built-in Features:
○ It handles various crawling issues (e.g., retries, delays, redirects, user-agent
rotation) out of the box, reducing the need for custom code.
3. Flexibility:
○ It offers great flexibility with its extensive configuration options and custom spider
development.
4. Scalability:
○ Scrapy can be easily scaled by running multiple crawlers or distributing tasks
across machines for large-scale scraping.
5. Ease of Use:
○ Scrapy’s straightforward syntax and structure make it relatively easy to use, even
for beginners.
1. Web Scraping:
○ Collecting structured data from websites like product details, news articles, or
real estate listings.
2. Data Mining:
○ Scrapy can be used to mine large amounts of unstructured data for analysis.
3. Competitive Analysis:
○ Extracting data about competitors, such as product prices, reviews, and more.
4. Search Engine Indexing:
○ Scrapy is useful for creating crawlers that index specific parts of the web for
search engines or databases.
BeautifulSoup is a Python library used for parsing HTML and XML documents. It is particularly
useful for web scraping tasks, where the goal is to extract data from web pages. BeautifulSoup
provides simple methods to navigate and search the parse tree, making it easier to extract
specific elements from a webpage's structure.
Key Features of BeautifulSoup
1. Parsing HTML/XML:
○ BeautifulSoup parses HTML or XML documents into a structured tree, allowing
easy navigation and manipulation of the document’s elements.
○ It can handle poorly structured or invalid HTML, which makes it especially useful
for scraping real-world web data.
2. Navigating the Parse Tree:
○ BeautifulSoup allows traversal of the HTML/XML parse tree using a variety of
methods, such as accessing elements by tags, class, or attributes.
○ It supports various ways to access nodes, including:
■ Tags: Access elements by their tag names (e.g., div, a, p).
■ Attributes: Access elements by their HTML attributes, such as id,
class, href, etc.
■ Text content: Extract text inside HTML elements.
3. Search Capabilities:
○ BeautifulSoup provides several methods to search for elements or extract text:
■ find(): Finds the first matching tag.
■ find_all(): Finds all matching tags.
■ select(): Allows searching for elements using CSS selectors.
○ These search methods help locate specific data from complex HTML structures.
4. Modifying the Parse Tree:
○ Once the data is extracted, BeautifulSoup also allows modifications to the
document. For example, you can change the text or attributes of an element, or
even add new elements.
5. Support for Multiple Parsers:
○ BeautifulSoup can use multiple parsers, including:
■ lxml: Faster and more feature-rich.
■ html.parser: A built-in Python parser.
■ html5lib: For parsing HTML5 documents.
Advantages of BeautifulSoup
1. Ease of Use:
○ BeautifulSoup offers a simple and intuitive API for parsing and extracting data,
making it a popular choice for web scraping tasks.
2. Flexibility:
○ It supports a wide range of HTML parsing and searching features, allowing users
to scrape complex websites.
3. Handling Badly Structured HTML:
○ BeautifulSoup can parse malformed or poorly structured HTML, making it
resilient in real-world web scraping scenarios.
4. Works Well with Other Libraries:
○ BeautifulSoup can be easily combined with other libraries like requests for
fetching web pages or pandas for storing scraped data in structured formats.
5. Lightweight:
○ The library is lightweight and easy to install, making it an excellent choice for
small to medium-sized scraping tasks.
Disadvantages of BeautifulSoup
1. Performance:
○ While BeautifulSoup is easy to use, it may not be the fastest option for
large-scale web scraping projects.
○ For more complex scraping tasks or high-performance needs, other tools like
Scrapy or lxml might be better suited.
2. Lack of Asynchronous Features:
○ Unlike Scrapy, BeautifulSoup does not provide built-in support for asynchronous
scraping, meaning it might not be ideal for handling large volumes of requests
concurrently.
3. Limited Support for JavaScript:
○ BeautifulSoup works well with static HTML but does not have the capability to
render or interact with JavaScript-heavy websites. For dynamic pages, tools like
Selenium or requests-HTML are more suitable.
1. Web Scraping:
○ Extracting data such as product details, reviews, articles, or weather information
from websites.
2. HTML/XML Parsing:
○ BeautifulSoup can be used for parsing and manipulating HTML or XML
documents in various applications, such as web scraping, data cleaning, and
automated data extraction.
3. Data Cleaning and Transformation:
○ Extracting and cleaning data from web pages, then transforming it into a
structured format for analysis or storage.
4. SEO and Web Monitoring:
○ Scraping websites for SEO analysis, tracking changes, and extracting content for
monitoring purposes.