Irs Unit-5
Irs Unit-5
--> Searching the Web: Challenges The text you've provided discusses the
challenges and considerations in searching the Web and how the nature of Web
data complicates efficient retrieval. Let's break down and summarize the main
points discussed:
Exponential Growth: The Web has exploded in size since its inception in the
late 1980s, with vast amounts of textual data, images, audio, and video now
available.
Need for Efficient Tools: The explosion of available data on the Web and
intranets necessitates the development of tools to manage, retrieve, and filter
information efficiently.
Data vs. Information Retrieval: While both deal with data, information
retrieval focuses on retrieving relevant data that fulfills a user’s specific
information need, which is the focus here.
Search Engines: These index large portions of the Web, making it searchable
via keywords or patterns.
These problems are related to the interaction between users and search
systems, not just the raw data. They are briefly mentioned, but specifics are not
detailed in this section.
Scaling Issues: The growth in the size of the Web is difficult to manage
effectively with current search technologies.
Measuring the scale and characteristics of the Web is inherently difficult due to
its dynamic and constantly changing nature. Here are the key points and
challenges outlined in the text:
Global Connectivity: As of the late 1990s, there were more than 40 million
computers connected to the Internet across over 200 countries, many of
which hosted Web servers.
Estimating Web Servers: The number of Web servers was estimated to range
from 2.4 million (according to NetSizer, 1998) to over 3 million (according to
the Netcraft survey, 1998). The variation can be explained by:
1. Shared servers: Many websites share the same Web server through a
method called virtual hosting.
2. Inaccessibility and Provisional Sites: Not all servers or websites are
fully accessible, and some may be temporary or "under construction."
Institutions and Web Servers: The number of institutions (not Web servers)
maintaining Web data is smaller than the number of servers, as many
institutions operate multiple servers. However, the exact number of institutions
is unknown but was estimated to be over 40% of the number of Web servers in
1995.
Number of Web Pages: The exact number of Web pages is hard to estimate.
However, estimates in early 1998 ranged from 200 million to 320 million
pages, with 350 million being considered the most accurate estimate as of July
1998.
Bray and Woodruff's Studies: These studies provide valuable data about Web
characteristics, such as the number of pages, links, and institutions. While their
data is outdated by today's standards, they were some of the first to explore the
statistical properties of the Web.
Growth: The Web was growing rapidly in the late 1990s, with millions of
servers and websites emerging daily. The exponential growth made it difficult
to keep accurate and up-to-date measurements.
Key Takeaways:
The Web was estimated to have millions of Web servers and hundreds of
millions of Web pages by the late 1990s.
The exact numbers were difficult to pinpoint due to the Web’s dynamic and
decentralized nature, with many sites shared across multiple servers and
content changing frequently.
Example of the Web's Scale (as of 1998):
SEARCH ENGINES
This section discusses the architecture of search engines and their role in
retrieving information from the Web, which is treated as a massive full-text
database. Unlike traditional information retrieval (IR) systems, Web search
engines face unique challenges due to the size, dynamic nature, and structure of
the Web.
This constraint affects the indexing algorithms, searching methods, and the
query languages used by search engines.
Crawler-Indexer Architecture:
Crawlers: These are programs or software agents that traverse the Web by
sending requests to Web servers and retrieving pages. They don’t actually run
on remote machines but instead send requests to Web servers to fetch
documents.
Indexers: Once pages are retrieved by the crawlers, they are processed and
indexed in a centralized location. The index is a database that stores metadata
about the content, such as keywords, page structure, and links, allowing the
search engine to quickly match queries to relevant pages.
o AltaVista
o HotBot
o Northern Light
o Excite
These engines were estimated to cover about 28-55% or 14-34% of the
Web pages, depending on the study. The total number of Web pages was
estimated to be over 300 million by that time.
Different Search Engine Approaches:
Most major search engines (e.g., AltaVista, HotBot, Excite) were based
in the United States and primarily focused on English-language
content.
However, there were also search engines designed for specific countries
or languages. For example:
Challenges in Crawling:
Freshness of Pages:
Crawl Speed:
1. Modern crawlers can visit up to 10 million pages per day but
must manage server load and crawl depth to avoid overwhelming
websites.
Crawling Order:
speed of Crawling
1.Spelling Errors
"accomodation" → "accommodation"
"definately" → "definitely"
2.Phonetic Misspellings
"expierience" → "experience"
"liscence" → "license"
3.Contractions or Abbreviations
"u" → "you"
"ur" → "your" or "you're"
"tho" → "though"
Problem: "Go" could refer to the ancient board game (Go) or the verb meaning
to move. Without context, the search engine will struggle to provide accurate
results, often favoring the more common usage (the verb "go").
Impact: Inaccurate results, leading the user to irrelevant pages, like those about
travel or movement, instead of the game Go.
Example Search: Searching "Jaguar speed" might return results for Jaguar
cars, the Atari video game, or network servers rather than information about
the jaguar animal.
Solution: Adding "cat" might improve the search results, but users may still
face confusion due to the multiple meanings of "jaguar."
F.Query Modifications:
Search Term:
Problem:
G.Proximity Search:
· Search Result 1: "A grocery store sells many fruits, including apples,
bananas, and oranges."
In this case, "apple" and "store" are not related to the Apple Store at
all.
· Search Result 2: "The Apple Store in New York is a great place to purchase
products."
· Here, "apple" and "store" are related and should appear together.
This section focuses on advanced techniques for searching the Web by using
hyperlink structure, alongside the traditional content-based search methods.
Here's an overview of the main ideas discussed, along with examples.
Traditional search engines index and query the content of web pages, but Web
query languages go further by querying the link structure of the Web,
enabling users to search for pages based not only on their content but also on
the way they are connected to other pages via hyperlinks.
A query language designed to search both the content of web pages and their
link structures. For example, a query might ask for all pages that contain a
specific image and are linked from a particular website up to a certain number
of links.
Example Query:
Find pages with the word "robot" and an image that are reachable from
example.com via three links.
WebSQL:
Another query language that combines SQL-style queries with web page
structure, allowing users to search not only for page content but also for links
between pages.
Example Query:
Find all pages that contain "technology" and are linked from a page that
contains "AI".
A query language that allows searching for specific paths within the hyperlink
structure using path regular expressions. This means a query can focus on the
route or sequence of links that lead to a page, not just the page itself.
Example Query:
Search for pages that contain "climate change" and are located within 2
links of a page about "environmental policies."
WebLog:
Example Query:
Extract a list of all articles about "machine learning" from web pages
where the content includes headings, images, and external links to
related resources.
FLORID:
A language used for data extraction and integration from multiple sources
across the Web. It supports both content and link structure queries and can
generate new web structures based on the queried data.
Example Query:
Find and integrate information about "data science" from multiple sites,
including articles, tutorials, and videos, and present it in a structured
form.
A query language that allows querying not just the content but also the internal
structure of web pages. It can work with the semi-structured nature of web
data, handling HTML and other markup languages.
Example Query:
Retrieve all tables from a web page that contain data on "global
temperature rise."
Link-Based Search:
Example: Searching for all pages related to climate change that are linked
from educational institutions or government sites can yield more credible and
focused results.
Dynamic Searching:
Dynamic searching refers to the idea that search queries can evolve in real-time
based on the user's needs and the structure of the Web. For example, search
engines can prioritize pages based on their link popularity or freshness.
Example: A search engine might prioritize pages that have been linked to
recently or are at the center of a topic based on a network of connected pages,
helping users find more up-to-date or relevant information.
4. Challenges:
In an information retrieval system, extracting data from text is a critical task for
processing, indexing, and retrieving relevant information. The process of
extracting meaningful data involves several steps, from initial text parsing to
applying various extraction techniques like keyword identification, named
entity recognition (NER), and syntactic parsing.
Here’s a breakdown of the common steps involved in extracting data from text
in an information retrieval system:
Before any meaningful extraction can happen, the raw text typically needs to
be preprocessed. Common preprocessing steps include:
4. Relationship Extraction
7. Advanced Techniques
For more complex data extraction tasks, advanced techniques such as the
following can be used:
7. Domain-Specific Refinements
Example:
Similarity Metrics
Cosine Similarity: Measures the cosine of the angle between two vectors
(user/item profiles).
Pros
Challenges
2. Content-Based Recommendation
Feature Extraction
o Example:
User Profile
o Example:
Similarity Metrics
o Example:
o Example:
E-commerce websites: Products are recommended by
comparing item attributes like price or rating using
Euclidean distance.
BERT Embeddings: Advanced natural language processing techniques
to measure semantic similarity.
o Example:
Pros
Challenges
Limited Serendipity: The system may recommend items that are too
similar, reducing the discovery of new or diverse content.
Methods of Combining
Weighted Hybrid: Combines CF and content-based recommendations
with assigned weights.
o Example:
Netflix: Uses both user-item interactions (CF) and content-based data (e.g.,
movie genre, cast) to recommend content, adjusting the influence of each
approach based on the user’s preferences.
o Example:
o Example:
o Example:
Benefits
4. Practical Applications
Document Recommendations
News Websites:
Research Papers:
Product Recommendations
E-commerce Websites:
5. Summary of Approaches
PageRank Algorithm
8. Alexa Rank
10. TrustRank
Summary of Examples