0% found this document useful (0 votes)
10 views28 pages

Irs Unit-5

The document discusses the complexities of web data mining and the challenges associated with searching the web, including the exponential growth of unstructured data, the need for efficient retrieval tools, and issues related to data quality and redundancy. It highlights the architecture of search engines, their centralized crawler-indexer systems, and the difficulties in measuring the web's scale and characteristics due to its dynamic nature. Additionally, it addresses user interaction problems with search systems, emphasizing the importance of understanding semantics and the impact of typos on search results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views28 pages

Irs Unit-5

The document discusses the complexities of web data mining and the challenges associated with searching the web, including the exponential growth of unstructured data, the need for efficient retrieval tools, and issues related to data quality and redundancy. It highlights the architecture of search engines, their centralized crawler-indexer systems, and the difficulties in measuring the web's scale and characteristics due to its dynamic nature. Additionally, it addresses user interaction problems with search systems, emphasizing the importance of understanding semantics and the impact of typos on search results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

WEB DATA MINING AND RECOMMENDATION

--> Searching the Web: Challenges The text you've provided discusses the
challenges and considerations in searching the Web and how the nature of Web
data complicates efficient retrieval. Let's break down and summarize the main
points discussed:

1. The Growth and Nature of Web Data

Exponential Growth: The Web has exploded in size since its inception in the
late 1980s, with vast amounts of textual data, images, audio, and video now
available.

Unstructured Data: The Web is a largely unstructured, decentralized database


that lacks consistency or inherent organization, making data retrieval complex.

2. The Challenge of Searching the Web

Need for Efficient Tools: The explosion of available data on the Web and
intranets necessitates the development of tools to manage, retrieve, and filter
information efficiently.

Data vs. Information Retrieval: While both deal with data, information
retrieval focuses on retrieving relevant data that fulfills a user’s specific
information need, which is the focus here.

3. Search Types on the Web

Syntactic Search: Focuses on searching for specific words or patterns in Web


documents, but it may not capture the intrinsic semantics (meaning) of the
text. Natural language processing (NLP) attempts to extract semantics but is
computationally expensive and not fully effective at scale.

Types of Web Searching:

Search Engines: These index large portions of the Web, making it searchable
via keywords or patterns.

Web Directories: These are human-curated lists of Web pages, organized by


subject or category.

Hyperlink-Based Search (still evolving): This form of search exploits the


Web's hyperlink structure, potentially providing a richer search experience by
analyzing connections between documents.

4. Challenges with Data


Distributed Data: Web data is distributed across countless servers with
varying bandwidth and reliability, making it difficult to manage and search
efficiently.

Volatile Data: Web content is highly dynamic, with websites frequently


changing or disappearing (an estimated 40% of the Web changes every month).
This leads to problems like dangling links (links that lead to non-existent
pages) or relocation issues (URLs that change).

Large Volume of Data: The Web is growing exponentially, which presents


scaling issues that are hard to handle with traditional search systems.

Unstructured and Redundant Data: Much Web content is poorly structured,


with many HTML pages not following any strict guidelines. Additionally, a
significant amount of content is duplicated or very similar (around 30% of Web
pages are duplicates or near-duplicates), leading to semantic redundancy
(similar or repetitive content spread across different sites).

Quality of Data: Unlike traditional publishing, there's no editorial control over


Web content. This can lead to:

1. Inaccurate or outdated data


2. Poorly written content (including grammatical mistakes, typos, etc.)
3. Content errors (from misprints, OCR errors, etc.)
4. It's estimated that typo rates could be as high as 1 in 3 for certain types
of data (e.g., foreign surnames).

5. Problems Regarding User Interaction with Retrieval Systems

These problems are related to the interaction between users and search
systems, not just the raw data. They are briefly mentioned, but specifics are not
detailed in this section.

Summary of the Key Challenges:

Data Distribution and Volatility: The decentralized, dynamic, and distributed


nature of Web data creates challenges in searching and indexing.

Scaling Issues: The growth in the size of the Web is difficult to manage
effectively with current search technologies.

Unstructured and Redundant Content: Most of the Web’s data is


unorganized and redundant, complicating efficient information retrieval..

Data Quality: The lack of editorial oversight results in inaccuracies, poor


writing quality, and errors, all of which degrade search results.
CHARACTERIZING THE WEB

Measuring the Web: Challenges and Estimates

Measuring the scale and characteristics of the Web is inherently difficult due to
its dynamic and constantly changing nature. Here are the key points and
challenges outlined in the text:

1. Size and Scale of the Internet and the Web

Global Connectivity: As of the late 1990s, there were more than 40 million
computers connected to the Internet across over 200 countries, many of
which hosted Web servers.

Estimating Web Servers: The number of Web servers was estimated to range
from 2.4 million (according to NetSizer, 1998) to over 3 million (according to
the Netcraft survey, 1998). The variation can be explained by:

1. Shared servers: Many websites share the same Web server through a
method called virtual hosting.
2. Inaccessibility and Provisional Sites: Not all servers or websites are
fully accessible, and some may be temporary or "under construction."

 Sampling and Surveys:

o One method involved sampling 0.1% of all Internet numeric


addresses, yielding about 2 million unique websites.
o Another approach counted domain names starting with "www",
which amounted to 780,000 in July 1998. However, since not all
Web servers use this prefix, the true number of websites was
likely higher.

2. Hosts and Web Servers

 The number of Internet hosts (computers connected to the Internet) in


July 1998 was estimated at 36.7 million. Therefore, there was about one
Web server for every ten computers connected to the Internet at the
time.

3. Studies on Web Pages

 Two key studies focused on measuring and characterizing Web pages:


o Bray's Study (1995): Analyzed 1 million pages.
o Woodruff et al. (1995): Examined 2.6 million pages.
 These studies aimed to understand the structure, distribution, and other
statistical measures of the Web.
4. Challenges in Estimating Web Characteristics

Institutions and Web Servers: The number of institutions (not Web servers)
maintaining Web data is smaller than the number of servers, as many
institutions operate multiple servers. However, the exact number of institutions
is unknown but was estimated to be over 40% of the number of Web servers in
1995.

Number of Web Pages: The exact number of Web pages is hard to estimate.
However, estimates in early 1998 ranged from 200 million to 320 million
pages, with 350 million being considered the most accurate estimate as of July
1998.

This estimate was based on 20,000 random queries submitted to 4 search


engines, which covered about 70% of the Web. The queries used a lexicon of
400,000 words extracted from the Yahoo! directory.

5. Statistical Analysis of Web Data

Bray and Woodruff's Studies: These studies provide valuable data about Web
characteristics, such as the number of pages, links, and institutions. While their
data is outdated by today's standards, they were some of the first to explore the
statistical properties of the Web.

Key Insights from the Data:

Growth: The Web was growing rapidly in the late 1990s, with millions of
servers and websites emerging daily. The exponential growth made it difficult
to keep accurate and up-to-date measurements.

Measurement Complexity: Estimating the number of Web pages, servers, or


institutions was challenging due to factors like virtual hosting, inaccessible
sites, the lack of a consistent structure, and the dynamic nature of Web content.

Key Takeaways:

The Web was estimated to have millions of Web servers and hundreds of
millions of Web pages by the late 1990s.

Measurement techniques were based on various sampling methods, such as


counting domain names, examining search engine results, and using random
queries.

The exact numbers were difficult to pinpoint due to the Web’s dynamic and
decentralized nature, with many sites shared across multiple servers and
content changing frequently.
Example of the Web's Scale (as of 1998):

 Web servers: Estimated to be between 2.4 million and 3 million.


 Web pages: Estimates ranged from 200 million to 350 million.
 Search coverage: A study based on 20,000 queries covered about 70%
of the Web at the time.

SEARCH ENGINES

Search Engines: Architecture and Challenges

This section discusses the architecture of search engines and their role in
retrieving information from the Web, which is treated as a massive full-text
database. Unlike traditional information retrieval (IR) systems, Web search
engines face unique challenges due to the size, dynamic nature, and structure of
the Web.

1. Key Difference Between Standard IR Systems and Web Search Engines

 In traditional IR systems, queries are answered by directly accessing


and processing the underlying text or documents.
 Web search engines, however, can only access indices (pre-processed
data structures) instead of the actual Web pages at query time. This is
because:
o Storing a full copy of the Web for querying would be
prohibitively expensive.
o Accessing remote Web pages in real-time for each query would
be too slow.

This constraint affects the indexing algorithms, searching methods, and the
query languages used by search engines.

13.4.1 Centralized Architecture

Most search engines utilize a centralized crawler-indexer architecture,


which consists of two main components: crawlers (also called robots, spiders,
wanderers, walkers, or knowbots) and indexers.

Crawler-Indexer Architecture:

Crawlers: These are programs or software agents that traverse the Web by
sending requests to Web servers and retrieving pages. They don’t actually run
on remote machines but instead send requests to Web servers to fetch
documents.

Crawlers continuously discover new or updated Web pages.


They send the retrieved pages to a central indexer, which processes and
organizes the content for efficient retrieval.

Indexers: Once pages are retrieved by the crawlers, they are processed and
indexed in a centralized location. The index is a database that stores metadata
about the content, such as keywords, page structure, and links, allowing the
search engine to quickly match queries to relevant pages.

Key Challenges of Centralized Architecture:

 Web's Dynamic Nature: The Web is highly dynamic, with pages


frequently being updated, moved, or deleted. This makes it difficult for
crawlers to keep up with changes.
 Network Congestion: The Web’s vastness and the communication links
between servers can lead to saturated links and high server load,
which can slow down crawling and indexing processes.
 Volume of Data: The exponential growth of the Web creates a huge
volume of data. Search engines need to handle a continuously
increasing amount of content, which can strain infrastructure.
 Load Balancing: Effective load balancing is crucial for managing the
internal processes (like answering queries and indexing) and external
tasks (like crawling) of a search engine to ensure smooth performance.

AltaVista Example (1998):

 In 1998, AltaVista was one of the leading search engines. It used a


centralized crawler-indexer architecture and ran on 20 multi-processor
machines, each with over 130 GB of RAM and 500 GB of disk space.
The query engine consumed over 75% of the system's resources.
 Despite its massive infrastructure, AltaVista faced challenges related to
crawling the Web efficiently and indexing the growing volume of
content

Search Engine Coverage (1998):

 By June 1998, some of the largest search engines included:

o AltaVista
o HotBot
o Northern Light
o Excite
 These engines were estimated to cover about 28-55% or 14-34% of the
Web pages, depending on the study. The total number of Web pages was
estimated to be over 300 million by that time.
Different Search Engine Approaches:

 Most major search engines (e.g., AltaVista, HotBot, Excite) were based
in the United States and primarily focused on English-language
content.
 However, there were also search engines designed for specific countries
or languages. For example:

o Search engines that could handle Kanji (for Chinese, Japanese,


and Korean content).
o Ask Jeeves!: A search engine that simulated an interview, where
users could ask questions in natural language.
o DirectHit: A search engine that ranked results based on
popularity (i.e., the number of users clicking on a particular
page).
o DejaNews: Focused on searching USENET archives, a
precursor to modern forums and discussion boards.
 Some search engines were also topic-specific, such as Search Broker,
which allowed users to search across a wide range of specialized topics

Summary of Search Engine Challenges and Features:

 Centralized Crawling and Indexing: The most common search engine


architecture in the 1990s involved centralized crawlers that traversed the
Web and indexed pages in a central database.
 Scalability Issues: Search engines had to deal with the increasing size
of the Web and the dynamic nature of Web content, which posed
challenges for efficient indexing and retrieval.
 Diverse Search Approaches: While many search engines were based in
the U.S. and focused on English-language content, there were also
search engines tailored to specific languages, regions, and types of
content (e.g., USENET or specific topics).
 Performance and Load Balancing: As Web traffic and the volume of
content increased, balancing the load between crawling, indexing, and
query processing became a critical concern for maintaining search
engine performance.

Example: Search Engine Resources (AltaVista, 1998):

 20 multi-processor machines with 130 GB of RAM and 500 GB of


disk space.
 Over 75% of resources were dedicated to the query engine, which
highlights the significant computational resources required to handle a
large-scale search engine operation.
KeY POINTS ON CRAWLING THE WEB

CRAWLING TECHNIQUES:The fundamental task in crawling the


Web is discovering new URLs and retrieving their content. Different
strategies can be employed to decide how to traverse the Web:

1. Breadth-First Crawling: Explores all linked pages from the


current page before moving on. It covers a wide but shallow area
of the Web.
2. Depth-First Crawling: Follows one link to the deepest level
before backtracking. It offers a narrow but deep exploration.
3. Partitioned Crawling: Divides the Web into regions (e.g., by
country or domain) and assigns crawlers to each region to avoid
duplication and improve efficiency.
4. Starting with Popular URLs: Crawlers begin with highly
requested or popular pages, ensuring important content is indexed
first.

Challenges in Crawling:

1. Dynamic Web: Pages change frequently, leading to challenges in


keeping the index up-to-date. Crawlers may encounter outdated
or broken links (2–9% of links can be invalid).
2. Coordinating Multiple Crawlers: Multiple crawlers may
overlap in their work, leading to inefficiency or server overload.

Freshness of Pages:

1. Pages in the index can be from 1 day to 2 months old. Search


engines show the index date to indicate how fresh the content is.
2. Popular and frequently updated pages are crawled more often.

Crawl Speed:
1. Modern crawlers can visit up to 10 million pages per day but
must manage server load and crawl depth to avoid overwhelming
websites.

Crawling Order:

1. Better ordering of crawled pages (e.g., using PageRank) can


improve efficiency by prioritizing important pages.

speed of Crawling

 Modern crawlers are capable of crawling millions of pages per day.


The fastest crawlers can now traverse up to 10 million Web pages per
day. However, the speed of crawling is affected by several factors, such
as:
o Server load: Crawlers must be careful not to overload servers
with too many requests.
o Crawl depth and breadth: Crawling deeply or broadly can
impact the speed and efficiency of the process.

FINDING NEEDLE IN THE HAYSTACK

Key Points on User Problems in Web Search

User Understanding Issue:

A.Lack of Semantics: Users often don’t understand how search engines


interpret words. For example, searching for "bank" instead of "Bank" may lead
to results that lose important semantics (case sensitivity and variations in
spelling)
.· Searching for "bank" might return results related to financial institutions or
the physical object like a riverbank.
· Searching for "Bank" could prioritize proper noun usage, such as a specific
bank's name (e.g., "Bank of America").
B.Typos and Variations: Typos and difficult-to-spell words can cause
significant loss of relevant results (up to 50%).

1.Spelling Errors

 "accomodation" → "accommodation"
 "definately" → "definitely"

2.Phonetic Misspellings

 "expierience" → "experience"
 "liscence" → "license"
3.Contractions or Abbreviations

 "u" → "you"
 "ur" → "your" or "you're"
 "tho" → "though"

4.Incorrect Use of Homophones

 "there" → "their" or "they're"


 "to" → "too" or "two"
 "affect" → "effect" (depending on context)
 "its" → "it's" (depending on context)

C. Boolean Logic Confusion: Many users struggle with Boolean operators


(AND, OR), and often don’t use them properly, resulting in ineffective
searches. Around 80% of users don’t use Boolean logic in their queries.

Key Boolean Operators:

AND: Narrows search results by requiring that all search terms be


present in the results.

1. Example: "cats AND dogs"


This will return results that include both cats and dogs.

OR: Broadens search results by allowing either of the search terms to


appear in the results.

1. Example: "cats OR dogs"


This will return results that include either cats or dogs (or both).

NOT: Excludes specific terms from the search results.

1. Example: "cats NOT dogs"


This will return results that mention cats, but not dogs.

Quotation Marks (" "): Used to search for exact phrases.

1. Example: "cute cats"


This will return results where the exact phrase "cute cats" appears.

Parentheses ( ): Used to group terms and control the order of operations,


especially for combining operators.

1. Example: "cats AND (dogs OR rabbits)"


This will return results that mention cats and either dogs or
rabbits
D.Query Behavior:

E.Polysemy (Multiple Meanings of a Word):

Search Term: "Go"

Problem: "Go" could refer to the ancient board game (Go) or the verb meaning
to move. Without context, the search engine will struggle to provide accurate
results, often favoring the more common usage (the verb "go").

Impact: Inaccurate results, leading the user to irrelevant pages, like those about
travel or movement, instead of the game Go.

Example Search: Searching "Jaguar speed" might return results for Jaguar
cars, the Atari video game, or network servers rather than information about
the jaguar animal.

Solution: Adding "cat" might improve the search results, but users may still
face confusion due to the multiple meanings of "jaguar."

F.Query Modifications:

1. Search Term: "best restaurants in NYC"

Search Term:

"best restaurants in NYC"

Problem:

Lack of Query Refinement


Many users will type a general search query like “best restaurants in NYC”
and, if the first page of results doesn’t satisfy them, they’ll either:

 Accept those results as "good enough," or


 Try a very slightly modified query (e.g., "top restaurants in NYC" or
"famous restaurants NYC").

However, users rarely leverage advanced search features to further refine


their query or tailor results to their specific needs.

G.Proximity Search:

· Search Result 1: "A grocery store sells many fruits, including apples,
bananas, and oranges."
In this case, "apple" and "store" are not related to the Apple Store at
all.

· Search Result 2: "The Apple Store in New York is a great place to purchase
products."

 · Here, "apple" and "store" are related and should appear together.

Summary of Example Problems:

 Case Sensitivity: "bank" vs. "Bank" leading to different meanings.


 Typos: Misspelled terms like "writting" reduce result accuracy.
 Polysemy: Words with multiple meanings (e.g., "Go" as a game or verb).
 Synonyms: Using "fast car" might miss pages that use "quick vehicle."
 Boolean Confusion: Misunderstanding of "AND" and "OR" operators.
 Query Modification: Users don’t often refine or adjust queries after
receiving unsatisfactory results.
 Proximity Search: Terms like "apple" and "store" should be close
together but may not be in some search results.

SEARCHING USING HYPERLINKS: WEB QUERY LANGUAGES


AND DYNAMIC SEARCHING

This section focuses on advanced techniques for searching the Web by using
hyperlink structure, alongside the traditional content-based search methods.
Here's an overview of the main ideas discussed, along with examples.

1. Web Query Languages

Traditional search engines index and query the content of web pages, but Web
query languages go further by querying the link structure of the Web,
enabling users to search for pages based not only on their content but also on
the way they are connected to other pages via hyperlinks.

Key Features of Web Query Languages:

 Content and Link Structure: These languages combine content-based


search with the ability to query hyperlink relationships. For instance,
users can query for pages that contain specific content and are linked
from a particular set of pages.
 Graph and Semi-Structured Data Models:
o Labeled Graph Model: This represents Web pages as nodes
and hyperlinks as edges between nodes.
o Semi-Structured Data Model: This model accounts for dynamic
or unstructured content on Web pages, where the data schema is
flexible and may change over time.
Examples of Web Query Languages:

W3QL (World Wide Web Query Language):

A query language designed to search both the content of web pages and their
link structures. For example, a query might ask for all pages that contain a
specific image and are linked from a particular website up to a certain number
of links.

Example Query:
Find pages with the word "robot" and an image that are reachable from
example.com via three links.

WebSQL:

Another query language that combines SQL-style queries with web page
structure, allowing users to search not only for page content but also for links
between pages.

Example Query:
Find all pages that contain "technology" and are linked from a page that
contains "AI".

WQL (Web Query Language):

A query language that allows searching for specific paths within the hyperlink
structure using path regular expressions. This means a query can focus on the
route or sequence of links that lead to a page, not just the page itself.

Example Query:
Search for pages that contain "climate change" and are located within 2
links of a page about "environmental policies."

WebLog:

A language focused on querying logs of web activity, integrating both page


content and structural data about how users navigate the web.

2. Web Data Manipulation Languages (Second Generation)

The second generation of web query languages emphasizes semi-structured


data and enhances the capabilities of earlier query languages by providing
more advanced manipulation of data, such as the ability to query the internal
structure of a webpage or create new data structures from existing data.

Languages in this Category:


STRUQL (Structure Query Language):

A query language that allows users to query semi-structured web data. It


provides powerful mechanisms to create new structures based on the web data
returned by queries.

Example Query:
Extract a list of all articles about "machine learning" from web pages
where the content includes headings, images, and external links to
related resources.

FLORID:

A language used for data extraction and integration from multiple sources
across the Web. It supports both content and link structure queries and can
generate new web structures based on the queried data.

Example Query:
Find and integrate information about "data science" from multiple sites,
including articles, tutorials, and videos, and present it in a structured
form.

WebOQL (Web Object Query Language):

A query language that allows querying not just the content but also the internal
structure of web pages. It can work with the semi-structured nature of web
data, handling HTML and other markup languages.

Example Query:
Retrieve all tables from a web page that contain data on "global
temperature rise."

3. Use Cases and Challenges

 Link-Based Search:

One of the major advantages of hyperlink-based search is that it can be used to


identify important pages that are well-connected through links, making it
possible to search beyond just the text of a page.

Example: Searching for all pages related to climate change that are linked
from educational institutions or government sites can yield more credible and
focused results.

 Dynamic Searching:
Dynamic searching refers to the idea that search queries can evolve in real-time
based on the user's needs and the structure of the Web. For example, search
engines can prioritize pages based on their link popularity or freshness.

Example: A search engine might prioritize pages that have been linked to
recently or are at the center of a topic based on a network of connected pages,
helping users find more up-to-date or relevant information.

4. Challenges:

 Performance Limitations: Querying both content and hyperlink


structure requires substantial computational resources, particularly when
dealing with the vast scale of the Web.
 Lack of Commercial Products: Despite the theoretical advantages,
Web query languages are not widely adopted in commercial search
engines due to their complexity and performance limitations.
 Integration with Existing Systems: Implementing these query
languages into current systems (which typically rely on simpler content-
based search) requires significant changes to the infrastructure.

EXTRACTING DATA FROM TEXT IN ONFORMATION RETRIEVAL


SYSYTEM

In an information retrieval system, extracting data from text is a critical task for
processing, indexing, and retrieving relevant information. The process of
extracting meaningful data involves several steps, from initial text parsing to
applying various extraction techniques like keyword identification, named
entity recognition (NER), and syntactic parsing.

Here’s a breakdown of the common steps involved in extracting data from text
in an information retrieval system:

1. Preprocessing the Text

Before any meaningful extraction can happen, the raw text typically needs to
be preprocessed. Common preprocessing steps include:

 Tokenization: Splitting the text into smaller units (tokens) such as


words or phrases.
 Lowercasing: Standardizing the text by converting everything to
lowercase.
 Stopword Removal: Removing common words (like "the," "and," "is")
that don't contribute significant meaning.
 Stemming and Lemmatization: Reducing words to their base forms.
For example, “running” becomes “run”.
 Noise Removal: Eliminating irrelevant symbols or characters (e.g.,
special characters, digits).
2. Keyword Extraction

Once the text is preprocessed, keyword extraction focuses on identifying


important terms or phrases that are central to the information retrieval process.

 TF-IDF (Term Frequency-Inverse Document Frequency): A


statistical method that evaluates the importance of a word in a document
relative to a corpus. Words with high TF-IDF scores are more relevant
for retrieval.
 Frequency Analysis: Counting how often a word or phrase occurs can
highlight its importance in a document.
 N-grams: Extracting sequences of "n" words together, which can be
useful for understanding multi-word concepts or phrases (e.g., "data
mining" as a 2-gram).

3. Named Entity Recognition (NER)

NER is a crucial technique for identifying specific entities in text, such as


names of people, organizations, locations, dates, etc. This allows the system to
extract specific, meaningful data from unstructured text.

 Person (P): Names of individuals or groups.


 Organization (O): Names of companies, institutions, etc.
 Location (L): Geographical locations such as cities or countries.
 Date/Time (T): Dates and times (e.g., "January 12, 2024").
 Miscellaneous (M): Other types of entities, such as product names,
event names, etc.

NER can be done using pre-trained models or rule-based approaches,


depending on the complexity and domain of the text.

4. Relationship Extraction

Beyond identifying individual entities, extracting relationships between entities


is important for understanding the interactions and connections between them.
This might involve:

 Syntax-based Methods: Analyzing grammatical structure to identify


relationships (e.g., subject-verb-object patterns).
 Dependency Parsing: Using syntactic structure to find relationships
between words, such as who is doing what to whom.
 Pattern Matching: Searching for predefined patterns that describe
common relationships (e.g., "works at," "was born in").

5. Information Retrieval and Querying


Once data has been extracted, an effective retrieval system must allow users to
query the extracted data. This involves:

 Indexing: Building an index of the extracted keywords, entities, and


relationships for fast retrieval.
 Ranking: Ranking retrieved documents or data based on relevance
using algorithms like BM25, language models, or neural networks.
 Query Expansion: Enhancing the user's query with synonyms or related
terms to improve retrieval.

6. Data Normalization and Structuring

To ensure consistency and usability, the extracted data may need to be


normalized or structured. For instance:

 Date Normalization: Converting different date formats into a standard


format.
 Canonicalization: Mapping extracted terms to standardized forms (e.g.,
different abbreviations of the same company).
 Structuring Data: Organizing data into a structured format such as
JSON, CSV, or a relational database for easier querying and retrieval.

7. Advanced Techniques

For more complex data extraction tasks, advanced techniques such as the
following can be used:

 Machine Learning Models: Models such as Named Entity Recognition


(NER) or Relation Extraction can be trained on domain-specific data to
improve the accuracy of data extraction.
 Deep Learning: Deep learning techniques, like transformers (e.g.,
BERT, GPT), can be employed for more advanced semantic
understanding and extraction.

Use Cases of Data Extraction in Information Retrieval:

 Search Engines: Extracting relevant documents or facts from a large


corpus based on a user query.
 Document Summarization: Extracting key points and concepts to
generate a concise summary of a document.
 Question Answering Systems: Extracting direct answers to user queries
from a text.
 Sentiment Analysis: Extracting sentiments (positive, negative, neutral)
from text.
 Content Recommendation: Extracting interests and preferences based
on user behavior or content similarity for personalized recommendations.
Example Workflow of Data Extraction:

1. Input: A news article or document.


2. Preprocessing: Tokenization, stopword removal, stemming.
3. Keyword Extraction: Extract relevant keywords and topics using TF-
IDF or other methods.
4. NER: Identify named entities (e.g., "Elon Musk", "Tesla", "2024").
5. Relationship Extraction: Identify relationships (e.g., "Elon Musk is the
CEO of Tesla").
6. Indexing & Retrieval: Store the extracted data in an index and retrieve
based on user queries.

--> COLLECTING AND INTEGRATING SPECIALIZED


INFORMATION FROM THE WEB IN AN INFORMATION
RETRIEVAL SYSTEM (IRS):

1. Identifying Relevant Web Sources

 Domain-specific databases (e.g., PubMed, IEEE Xplore, LexisNexis)


 Government websites and authoritative sources (e.g., WHO, U.S.
Census)
 Academic research and institutional websites
 Industry-specific websites and expert forums (e.g., Stack Overflow,
Reddit)

2. Web Crawling and Scraping

 Web Crawling: Automate data collection by traversing web pages.


 Web Scraping: Extract structured and unstructured data from pages.
 APIs: Use official APIs for direct, structured access to data.

3. Preprocessing and Normalizing the Data

 Data Cleaning: Remove irrelevant or noisy content.


 Text Normalization: Standardize dates, formats, and terms.
 Named Entity Recognition (NER): Identify and standardize domain-
specific entities (e.g., drug names, legal cases).
 Language Processing: Handle multilingual data or translate when
necessary.

4. Integration and Structuring the Data

 Data Structuring: Convert raw data into structured formats (JSON,


XML, CSV).
 Ontology Integration: Incorporate domain-specific taxonomies or
ontologies (e.g., MeSH, SNOMED).
 External Knowledge Linking: Connect to external knowledge bases
for richer context.

5. Indexing the Collected Data

 Full-Text Indexing: Build inverted indexes for fast text retrieval.


 Keyword-Based Indexing: Index based on important terms or phrases.
 Entity-Based Indexing: Index by domain-specific entities (e.g.,
diseases, legal terms).
 Vector Indexing: Use embeddings (e.g., word2vec, BERT) for semantic
search.

6. Search and Retrieval

 Query Parsing: Translate user queries into structured searches.


 Ranking: Rank results using algorithms like BM25 or machine learning
models.
 Contextualization: Use domain context to refine query results (e.g.,
legal context, medical context).

7. Domain-Specific Refinements

 Search Features: Add specialized search facets (e.g., drug interactions,


case law).
 Semantic Search: Implement domain-specific ontologies and
embeddings.
 Expert Review: Incorporate human verification for high-stakes domains
like healthcare or law.

8. Continuous Updates and Monitoring

 Real-Time Crawling: Collect the latest information as it becomes


available.
 Data Refreshing: Periodically update the database to ensure freshness.
 Quality Control: Regular validation to ensure the integrity of data.

--> 1. Collaborative Filtering (CF)

Concept: Collaborative Filtering recommends items based on the preferences


and behaviors of similar users. There are two main types of CF:

a. User-Based Collaborative Filtering

 How it Works: Recommends items that similar users have liked or


interacted with.
 Example:
Netflix: If User A and User B have watched and rated movies similarly in the
past, Netflix will recommend movies User A has watched to User B.

b. Item-Based Collaborative Filtering

 How it Works: Recommends items similar to those a user has already


interacted with.

Example:

Amazon: If you purchase a camera, Amazon recommends similar cameras or


accessories based on what other users have purchased alongside the camera.

Similarity Metrics

Cosine Similarity: Measures the cosine of the angle between two vectors
(user/item profiles).

Example: In Spotify, the similarity between two playlists is calculated using


cosine similarity.

Pearson Correlation: Measures the linear correlation between two variables.

Example: In Rotten Tomatoes, the similarity between two users' movie


ratings is often calculated using Pearson correlation.

Jaccard Index: Measures similarity between two sets.

Example: In e-commerce, the Jaccard index is used to measure the similarity


between customers' purchase history.

Pros

 Does not require item content knowledge.


 Can discover unexpected items (serendipity).

Challenges

 Cold Start Problem: Difficulty in recommending items for new users


or new items with little interaction data.

Example: On Amazon, if a new product is launched, Collaborative Filtering


struggles to recommend it until there is enough interaction data.

 Sparsity: The user-item interaction matrix is often sparse, making it


harder to find similarities.
 Scalability: Computationally expensive for large datasets.
o Example: In platforms like Netflix with millions of users and
movies, the computational cost for calculating similarities can be
very high.

2. Content-Based Recommendation

Concept: Content-Based Recommendation uses the features of items (e.g.,


keywords, topics, metadata) to recommend similar items based on the user's
past interactions.

Feature Extraction

 How it Works: Uses item features like keywords, topics, and


metadata to recommend items with similar characteristics.

o Example:

 Spotify: Recommends music based on features like genre,


artist, or mood.

User Profile

 How it Works: Builds a profile based on the user’s past interactions


with items.

o Example:

 Amazon: Recommends products based on a user's


browsing history, such as recommending books similar to
those a user has previously read.

Similarity Metrics

 Cosine Similarity: Measures the similarity between items based on


features.

o Example:

 Google News: Recommends news articles similar to those


the user has read, based on article keywords or content.
 Euclidean Distance: Measures the distance between item features
(often used in continuous-valued features).

o Example:
 E-commerce websites: Products are recommended by
comparing item attributes like price or rating using
Euclidean distance.
 BERT Embeddings: Advanced natural language processing techniques
to measure semantic similarity.

o Example:

 Medium or ResearchGate: Recommends articles or


papers by analyzing content similarity through
embeddings generated from BERT.

Pros

 Personalized recommendations based on user preferences.


 No cold start problem for items, as content is available from the
beginning.

Challenges

 Limited Serendipity: The system may recommend items that are too
similar, reducing the discovery of new or diverse content.

o Example: If you regularly listen to pop music, a content-based


system will mostly recommend other pop songs, limiting
exposure to other genres.
 Feature Engineering: Requires careful extraction of relevant features
from items, which can be domain-specific.

o Example: In movies, content-based features might include genre,


director, actors, etc., but feature extraction is challenging for
more abstract items.
 Over-Specialization: Recommendations may become too narrow,
leading to reduced diversity.

o Example: In Amazon, recommending only related products to


previously purchased items can reduce the variety in what the
user sees.

3. Hybrid Recommendation System

Concept: Combines Collaborative Filtering and Content-Based


Recommendation to leverage the strengths of both approaches and overcome
their individual limitations.

Methods of Combining
 Weighted Hybrid: Combines CF and content-based recommendations
with assigned weights.

o Example:

Netflix: Uses both user-item interactions (CF) and content-based data (e.g.,
movie genre, cast) to recommend content, adjusting the influence of each
approach based on the user’s preferences.

 Switching Hybrid: Switches between CF and content-based


recommendations depending on the available data.

o Example:

Amazon: Uses Collaborative Filtering when enough interaction data is


available but switches to content-based when introducing a new product.

 Feature Augmentation: Uses CF results as features for content-based


models.

o Example:

YouTube: Combines CF recommendations (similar users’ behavior) with


content-based filtering (video metadata) to recommend videos.

 Model-Based Hybrid: Machine learning models (e.g., decision trees,


neural networks) combine both CF and content-based predictions.

o Example:

Spotify: Combines collaborative listening patterns with the content of songs


(e.g., tempo, genre) to recommend music.

Benefits

 Improved accuracy and diversity in recommendations.


 Reduces the limitations of individual methods (cold start, narrow
recommendations).

o Example: Combining CF and content-based approaches on


Netflix helps mitigate the cold start problem by recommending
based on both user preferences and content attributes.

4. Practical Applications

Document Recommendations
News Websites:

Collaborative Filtering: Recommends articles based on similar users’ reading


history.

Example: Flipboard: Recommends news stories based on similar users’


preferences.

Content-Based: Recommends articles based on keywords or topics.

Example: Google News: Recommends articles based on user’s previously read


topics like “technology” or “health.”

Research Papers:

Collaborative Filtering: Recommends papers from researchers with similar


reading history.

Example: Google Scholar: Recommends papers similar to those you have


previously read or cited.

Content-Based: Recommends papers based on the paper’s content, abstract, or


keywords.

Example: Semantic Scholar: Recommends research papers based on


keywords and topics.

Product Recommendations

E-commerce Websites:

Collaborative Filtering: Recommends products based on similar users'


purchases.

Example: Amazon: Recommends products based on the purchasing history of


similar users.

Content-Based: Recommends products based on their attributes, such as


category, brand, and price.

Example: Etsy: Recommends jewelry or art based on the type of products a


user has previously purchased.

 Movie Streaming Services:

Collaborative Filtering: Recommends movies based on similar users' viewing


habits.
Example: Netflix: Recommends movies based on the preferences of users who
have similar watching habits.

Content-Based: Recommends movies based on genre, cast, or director.

Example: Hulu: Recommends movies similar to those a user has watched


based on content features like genre and actors.

5. Summary of Approaches

Feature Collaborative Filtering Content-Based Filtering


Recommendation
User/item similarity Item content and features
Basis
User-item interaction Item metadata (keywords,
Data Used
matrix categories, features)
Discovers new items Personalized recommendations
Strength
through similarity based on user interests
New items can be
Cold Start Problem New users/items struggle
recommended
High (unexpected
Serendipity Low (focuses on similar items)
recommendations)
High (due to similarity Moderate (requires feature
Complexity
computations) extraction)

--> PAGE RANKING ALGORITHMS

PageRank Algorithm

 Example: Google Search


o How it works: Google's search engine ranks web pages based on
their importance determined by the number and quality of
incoming links. Pages with more incoming links from other
authoritative pages are ranked higher.
o Real-life Example: When you search for "best pizza places near
me," Google's PageRank algorithm will prioritize pages (websites)
that are linked to by high-authority pages (food blogs, review
sites, news articles).

2. HITS (Hyperlink-Induced Topic Search)

 Example: Academic Search Engines (e.g., Google Scholar, CiteSeer)


o How it works: HITS classifies web pages as authorities (trusted,
high-quality content) and hubs (pages that link to authoritative
content). For example, a research paper that links to many
authoritative academic sources would be considered a hub.
o Real-life Example: In academic research, an influential journal
article (authority) may be linked to by multiple research papers
(hubs) in the same topic area.

3. TF-IDF (Term Frequency-Inverse Document Frequency)

 Example: Search Engine Ranking (e.g., Google Search,


Elasticsearch)
o How it works: TF-IDF is used to rank documents based on the
frequency of query terms in a document (TF) and how unique
those terms are across all documents (IDF).
o Real-life Example: If you search for "machine learning
techniques," the search engine ranks documents with higher
occurrences of these terms and penalizes those terms if they
appear in many other documents.

4. BM25 (Best Matching 25)

 Example: Elasticsearch and Apache Solr


o How it works: BM25 builds on TF-IDF but adds saturation and
document length normalization. It adjusts how term frequency
affects the score and normalizes based on document length.
o Real-life Example: When searching for "deep learning models,"
BM25 will prioritize documents with the term "deep learning" in
a way that prevents documents with repeated occurrences of this
term from being ranked too highly.

5. Personalized PageRank (PPR)

 Example: YouTube or Netflix Recommendations


o How it works: Personalized PageRank tailors rankings based on
user behavior. For example, YouTube adjusts the ranking of
videos based on what a user has watched before, their likes, and
subscriptions.
o Real-life Example: If you frequently watch cooking tutorials,
YouTube will recommend more cooking-related videos based on
your previous watch history.

6. RankNet (Neural Network-based Ranking)

 Example: Microsoft Bing Search


o How it works: RankNet uses a neural network to rank documents
based on a learned model that considers multiple features such as
TF-IDF, BM25 scores, click-through rates, and more.
o Real-life Example: Bing's search results are influenced by
RankNet, which uses machine learning to understand which
documents are most relevant to a user’s query based on previous
search patterns and clicks

7. Learning to Rank (LTR)

 Example: Amazon Search and Yahoo Search


o How it works: LTR uses machine learning algorithms (like
decision trees or gradient boosting) to rank documents based on a
set of features (e.g., term frequency, user feedback, query-
document relevance).
o Real-life Example: In Amazon's product search, Learning to
Rank algorithms consider factors like the product's relevance,
user reviews, past purchase data, and click-through history to
rank search results.

8. Alexa Rank

 Example: Alexa.com (now part of Amazon)


o How it works: Alexa Rank is based on the popularity of websites,
which is determined by traffic data such as the number of visitors,
page views, and engagement time.
o Real-life Example: Websites like Facebook and YouTube have
high Alexa ranks because of their massive traffic, while niche
blogs or lesser-known sites will have lower ranks.

9. Social Media and Social Bookmarking-based Ranking

 Example: Reddit and Twitter


o How it works: Content is ranked based on user interactions like
upvotes, shares, comments, and retweets.
o Real-life Example: On Reddit, posts that get a high number of
upvotes and comments rise to the top of the subreddit, while on
Twitter, tweets that get the most retweets and likes gain visibility.

10. TrustRank

 Example: Google Search (fighting web spam)


o How it works: TrustRank is used to identify high-quality,
trustworthy websites by starting with a small set of trusted pages
(e.g., government, academic, or well-established sites) and
propagating trust through links.
o Real-life Example: Google’s algorithm incorporates TrustRank
to minimize the influence of spammy websites that try to
manipulate rankings through unnatural link-building tactics.

Summary of Examples

Algorithm Example Real-Life Application


Ranking web pages based on link
PageRank Google Search
structure
Google Scholar, Identifying authorities and hubs in
HITS
CiteSeer academic papers
Google Search, Ranking documents based on query-
TF-IDF
Elasticsearch term relevance
Ranking with term frequency
BM25 Elasticsearch, Solr
saturation and length normalization
Personalized Content recommendation based on
YouTube, Netflix
PageRank user behavior
Ranking documents using neural
RankNet Microsoft Bing
networks
Learning to Amazon Search, Ranking products or search results
Rank Yahoo Search using machine learning
Measuring website popularity and
Alexa Rank Alexa.com
traffic
Social Media Ranking based on user interaction
Reddit, Twitter
Ranking (upvotes, likes, shares)
Reducing spam by ranking trusted
TrustRank Google Search
websites higher

You might also like