0% found this document useful (0 votes)
4 views38 pages

Wad Module3

Module 3 covers web search and information retrieval, emphasizing the importance of SEO for web visibility and ranking, types of SEO, and the processes of web crawling and indexing. It explains various search techniques, including Boolean, natural language, and semantic searches, and outlines the architecture of search engines and ranking algorithms. Additionally, it discusses web traffic models and their significance in optimizing website performance and understanding user behavior.

Uploaded by

akpalan09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views38 pages

Wad Module3

Module 3 covers web search and information retrieval, emphasizing the importance of SEO for web visibility and ranking, types of SEO, and the processes of web crawling and indexing. It explains various search techniques, including Boolean, natural language, and semantic searches, and outlines the architecture of search engines and ranking algorithms. Additionally, it discusses web traffic models and their significance in optimizing website performance and understanding user behavior.

Uploaded by

akpalan09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

MODULE 3

WAD 3
Module 3: Web Search and Information Retrieval
Web Search and Retrieval: Search Engine Optimization-
Importance of SEO for web visibility and ranking,
Types of SEO: On-page, Off-page, Technical SEO.
Web Crawling and indexing- Crawling Algorithms and Challenges,
Ranking Algorithms,
Web traffic models.
Web Search & Retrieval" refers to the process of finding relevant information on the
World Wide Web by using a search engine
It is the process of utilizing an online search engine to locate documents, web pages,
videos, images, or any other type of digital content on the internet.
It applies "Information Retrieval" techniques to the vast collection of data available
online
Ø The user enters a query and the system returns a ranked list of web pages that best
match the search terms based on complex algorithms that analyze content and link
structure
Ø it involves components like web crawlers to index web pages, and ranking
algorithms to determine the most relevant results for a given query
Advantages of web search
It is fast, convenient, and comprehensive.
ü You can quickly search through thousands of websites containing data on virtually
any topic.
ü It enables you to easily compare different sources of information to get the most
accurate and up-to-date facts.
Web searching offers a number of advantages over other methods of finding
information, including the following:
ü Your web search history is automatically saved, so you can access your search
results whenever you want.
ü It allows you to narrow down your search criteria so that you can find exactly what
you are looking fory
ü Most search engines now offer a web search app, so you can access them with a
single tap on your phone.Most search engines now offer a web search app, so you
Types of Web search:

Ø Boolean search — This type of search allows the user to combine keywords and
phrases using the so-called Boolean operators, such as “and”, “or”, and “not”, in
order to narrow down results. This approach is best for finding specific information.
Ø Natural language search — This one allows the user to type in a phrase or question
the same way they would say it out loud. This type of search is great for getting
more informal results from your queries.
Ø Semantic search — This option takes into account the context of a query to provide
more precise results. Semantic search is great for finding information related to
specific topics or concepts.
Boolean search - eg
1. A typical example of a Boolean search in a web application would be searching for
"laptops AND (Dell OR HP) NOT Chromebook" on an online shopping site, which
would return results only for laptops that are either Dell or HP, but would exclude
any Chromebooks, demonstrating the use of "AND", "OR", and "NOT" operators to
narrow down the search results.
2. "Climate change OR global warming NOT politics" - To find articles about climate
change or global warming, excluding articles focused on politics
Natural Language Search - eg

Trump and Zelensky fight..... search phrase/sentence as we say

got several links, like

https://www.hindustantimes.com/world-news/volodymyr-zelensky-on-white-house-
fight-with-donald-trump-not-good-for-both-sides-101740786722125.html

https://www.thehindu.com/news/international/world-reacts-to-trump-zelensky-oval-
office-clash/article69278424.ece
Ø Semantic search is a set of search engine capabilities, which includes
understanding words from the searcher's intent and their search context.
Ø intended to improve the quality of search results by interpreting natural language
more accurately and in context.
Ø Semantic search is a data searching technique that focuses on understanding the
contextual meaning and intent behind a user's search query, rather than only
matching keywords.
v Google's Knowledge Graph is one of the most well-known examples of semantic
search in action. It uses data from a variety of sources to provide users with
information about people, places, and things. The Knowledge Graph has been
growing steadily since it was first introduced in 2012.
v use semantic search to provide users with more accurate and relevant search
results. Companies use semantic search to boost market visibility, increase sales,
and more.
Key points about Web Search & Retrieval:
Information Retrieval (IR):
ü The broader field that encompasses the theory and techniques for finding relevant
information from a collection of data, with web search being a specific application of IR.
Web Crawler:
ü A program that automatically browses the web, discovering and downloading web
pages to be indexed by the search engine.
Index:
ü A structured data storage that allows for quick lookups of relevant web pages based on
keywords and other metadata.
Query:
ü The text a user enters into a search engine to specify what information they are
looking for.
Ranking Algorithm:

ü The complex mathematical formula that determines the order in which search results
are displayed based on relevance to the query.

o Example:

When you search for "best restaurants near me" on Google, the search engine
uses its web crawler to find relevant restaurant pages, indexes them based on keywords
and location data, and then applies its ranking algorithm to present you with the most
likely top matches based on your search query
Bookmark (or favourite): a collection of links (saved shortcuts) to web pages that is
stored in a web browser.
Saving bookmarks allows users to quickly navigate back to the websites they visit the
most.
Bookmark bar: A toolbar that contains all bookmarks and is displayed at the top of a
browser window (under the address bar)
Web search vs Information retrieval

Ø While both involve finding relevant information based on a user query, "web search" is
a specific application of "information retrieval" that focuses on searching the vast,
interconnected collection of data on the World Wide Web,

Ø while “information retrieval” is a broader concept encompassing the process of


retrieving relevant documents from any structured collection of data, including
databases or even physical libraries, using various algorithms to rank results based on
relevance.
o Example:
Ø Information retrieval: Searching for research papers related to a specific medical topic
within a scientific database.
Ø Web search: Using Google to find information about a restaurant in your area.
Key differences:
o Scope:
Ø Web search is limited to the internet, while information retrieval can be applied to any
collection of data, including internal company databases or specialized document
repositories.
o Complexity of data:
Ø Web search deals with a massive, dynamic, and often unstructured collection of web
pages, whereas traditional information retrieval might involve more controlled and
structured data sets.
o Crawling and indexing:
Ø Web search engines rely on web crawlers to constantly index new web pages, a
feature not always necessary in other information retrieval systems.
Ø Data Sources:
ü Information retrieval systems often work with structured databases or specific
collections of documents
ü In contrast web search engines index the entire web, making them more
versatile but less precise in certain contexts.
Ø User Intent:
ü IR techniques are designed to understand and fulfill specific user needs, often
requiring more detailed queries.
ü In contrast, web search engines cater to a broader audience, optimizing for
speed and general relevance.
Search Engine Architecture
A search engine operates through a series of steps:

Ø User Input: The user enters a keyword or phrase into the search interface.
Ø Query Processing: The search engine processes the query using algorithms to find
relevant documents in its database.
Ø Result Ranking: The engine ranks the results based on relevance and displays
them to the user.
Ø Web Crawlers : Web crawlers play a vital role in web search by systematically
browsing the internet to index content. They ensure that search engines have up-
to-date information to provide users with the most relevant results.
Search Engine Optimization-
Ø Search engine optimization is the process of improving the quality and quantity of
website traffic to a website or a web page from search engines.
Ø SEO targets unpaid search traffic rather than direct traffic, referral traffic, social
media traffic, or paid traffic.
Ø Search engine optimization (SEO) is an essential practice for any website looking to
improve its visibility and attract more organic traffic.
Ø the practice of orienting your website to rank higher on a search engine results
page (SERP) so that you receive more traffic. The aim is typically to rank on the first
page of Google results for search terms that mean the most to your target audience.
The four types of SEO are on-page, off-page, technical, and local.
Each type has its own strategies and best practices to improve a website's search
engine ranking.
On-page SEO
Ø Also known as on-site SEO, this involves optimizing a website's body copy,
keywords, headers, meta titles, meta descriptions, and images
Off-page SEO
Ø This involves using tools, tips, and best practices to promote a website on search
engines and third-party websites
Technical SEO
Ø This involves crawling, indexing, rendering, and website architecture
Local SEO
Ø This involves optimizing a website for customers in a particular geographic area
Other types of SEO
Ø International SEO: Optimizing a website for users in different countries and
speakers of different languages
Ø Multilingual SEO: Optimizing website content for multiple languages to improve
visibility and ranking on search engines
Ø White hat SEO: A safe optimization strategy to rank websites higher on search
engine results page (SERP)s
Ø Black hat SEO: A tactic that uses keyword stuffing, spammy tactics, and abuses
Google's algorithms
Ø Grey hat SEO: A technique that falls between white hat and black hat SEO
To optimize your website for SEO (Search Engine Optimization),
Ø conduct thorough keyword research
Ø create high-quality content relevant to those keywords
Ø optimize your website structure
Ø use relevant meta tags
Ø ensure mobile responsiveness
Ø monitor your website's performance using analytics tools like Google Search
Console to identify areas for improvement; essentially making your site easily
understood and valuable to search engines like Google, leading to higher rankings in
search results.
"Website crawling and indexing" refers to the process where
Ø search engine bots, also called "crawlers" or "spiders", automatically discover and
explore web pages across the internet
Ø then store and organize the content of those pages in a searchable database called
an "index",
Ø allowing them to display relevant results when users search for information online

A bot is a software program that performs tasks automatically,


often imitating human behavior.
Bots can be used for both helpful and harmful purposes.
Ø A web crawler is a software program that follows all the links on a page, leading to new
pages, and continues that process until it has no more new links or pages to crawl.
Ø Also called spider, it is a type of bot that is typically operated by search engines like
Google and Bing. Their purpose is to index the content of websites all across the
Internet so that those websites can appear in search engine results.
Google’s web crawler is named Googlebot.
Ø Google uses an initial “seed list” of trusted websites that tend to link to many other
sites.
Ø They also use lists of sites they’ve seen in past crawls as well as sitemaps submitted by
website owners.
Ø The mission is to catalog the entire Internet
ü Basic crawling is the simplest form of web crawling technique.
ü It involves the crawler visiting web pages and following links to other pages within
the same domain.
ü The crawler starts from a seed URL and recursively explores the website, extracting
relevant data from each page it encounters.
Crawling algorithms
v Web crawling algorithms are automated processes that systematically explore the
internet, following links and extracting data from websites. Common crawling
algorithms:
Ø general-purpose crawling
Ø focused crawling
Ø distributed crawling.
v The basic web crawling algorithm is simple:
Ø Given a set of seed Uniform Resource Locators (URLs), a crawler downloads all the
web pages addressed by the URLs, extracts the hyperlinks contained in the pages,
and iteratively downloads the web pages addressed by these hyperlinks.
General-Purpose Crawling:
v Description: This is the most basic type of web crawling, where a crawler follows
links from a starting point (seed URL) and explores the web without any specific
focus or constraints.
v Use Cases: Used by search engines like Google and Bing to index the entire web
and by applications that need to collect a broad range of data.
v Example: Googlebot, Amazonbot.
Focused Crawling:
v Description:
v Focused crawlers are designed to target specific topics or domains, improving
efficiency and relevance by focusing on relevant pages.
v Use Cases:
v Used for building domain-specific databases, monitoring specific websites, and
collecting data for particular research or applications.
Algorithms:
Ø Best-first search: Prioritizes URLs based on their estimated relevance to the target
topic.
Ø Fish search algorithm: An algorithm that stimulates crawling using a group of fish
that migrates with the web.
Ø Adaptive crawling: Crawlers that can change their strategies based on the target
websites' behavior and structure.
Distributed Crawling:
Description:
v This approach involves multiple crawlers working independently to explore the web,
improving scalability and robustness.
Use Cases:
v Used for large-scale web crawling projects where a single crawler may not be
sufficient.
Process:
v A central server coordinates the crawlers, assigning them different sections of the
web to explore.
Web crawling faces many challenges, including:

Ø Anti-scraping techniques: Websites use CAPTCHAs and other security measures to

prevent automated access.

Ø Page structure changes: Frequent changes to page structure can make it hard for web

crawlers to select elements correctly.

Ø Dynamic content: Websites that use dynamic content, like JavaScript, can be difficult

for traditional web crawlers to index.

Ø IP blocking: Websites can block IP addresses that make too many requests to the site.
Ø Scalability: As data needs grow, so does the complexity of managing, storing,
and processing it.
Ø Bandwidth: Web crawlers can consume a lot of server bandwidth when visiting
large websites.
Ø Crawling policies: Crawlers must adhere to website terms of use, such as
robots.txt files.
Ø Ethical considerations: Web crawlers must comply with legal and ethical
guidelines.
Ø Managing large-scale data: Web crawlers must handle large amounts of data
efficiently.
Ranking algorithms
§ Search engines use ranking algorithms to determine the order of search results for
a given query.
§ Some examples of ranking algorithms include PageRank, HITS, and Tagrank

factors considered for Ranking


Ø Keyword relevance: How well a page's content matches the user's query
Ø Page authority: How authoritative a page is
Ø Link structures: How the links on a page are structured
Ø Content quality: How high quality the page's content is
PageRank
ü A Google algorithm that estimates the importance of a web page by counting the
number and quality of links to it
ü It is named after both the term "web page" and co-founder Larry Page.
ü PageRank is a way of measuring the importance of website pages
HITS
ü A diversity-based algorithm that considers the content diversity of webpages
ü Webpages linked by webpages with similar content are ranked lower
Tagrank
ü A ranking algorithm that uses document and query features to maximize the
rating of search engine results pages (SERPs)
Ø Ranking by similarity, distance, preference, and probability are the most common
types of ranking algorithms.
Ø Ranking by probability is the most accurate type of ranking algorithm because it
takes into account the uncertainty of the data.
SERP stands for Search Engine Results Page. It's the page that appears after you
search for something on a search engine.
How do SERPs work?
Ø Search engines use algorithms to rank search results.
Ø The ranking takes into account factors like the quality of the content, the authority
of the source, and the user's location.
Ø The order of results on a SERP is called the SERP ranking.
Ø The higher the ranking, the closer the URL is to the top of the page.
Ø SERPs can include organic results, paid search ads, images, videos, and more.
Web traffic models
o Web traffic models are mathematical or statistical representations of how data is
exchanged between users and websites, useful for analyzing network performance,
predicting traffic patterns, and optimizing website infrastructure.
o Web traffic models aim to simulate and understand the flow of data (packets,
requests, responses) on the internet, specifically related to website usage.
o Website Optimization: Predicting traffic patterns can help optimize website
performance, reduce latency, and improve user experience.
Website traffic comes in various flavors:
ü Direct traffic occurs when people directly type your website's address
ü Organic traffic comes from search engines
ü Referral traffic arrives from other websites
ü Social traffic originates from social media, Email traffic comes from email
Why they are important:

§ Performance Analysis: They help assess the capacity and efficiency of networks,

servers, and protocols.

§ Capacity Planning: By understanding traffic patterns, network administrators can

anticipate future needs and allocate resources accordingly.

§ Optimization: Models can identify bottlenecks and suggest improvements to

website infrastructure and network configurations.

§ Forecasting: They can be used to predict future traffic trends, allowing for proactive

measures to prevent overload or improve user experience.


Types of Models:
Ø Poisson Process: A simple model where events (e.g., web requests) occur
independently at a constant rate.
Ø Self-Similarity: A more complex model that acknowledges that traffic patterns can
exhibit long-range dependencies, meaning that short-term fluctuations can influence
long-term trends.
Ø Markov-Modulated Traffic Models: These models use a Markov process to capture
the dynamic nature of traffic, where the underlying traffic mechanism is influenced
by an auxiliary process.
Ø Spatiotemporal Models: These models consider both the location and time
dependencies of traffic data.
ü A Markov process is a random process indexed by time, and with the property that
the future is independent of the past, given the present. Markov processes, named
for Andrei Markov, are among the most important of all random processes.
ü A common example of a Markov chain in action is the way Google predicts the next
word in your sentence based on your previous entry within Gmail.

v The aim is to improve and optimise the website in every respect, by making the
analysis of the different factors
BEST WISHES

You might also like