4

Information Retrieval (IR) faces challenges such as vocabulary mismatches, ambiguous queries, and inadequate content representation. Different types of search engines, including mainstream, private, vertical, and computational, serve various user needs and privacy concerns. The integration of artificial intelligence in IR systems aims to enhance user experience and improve search outcomes through intelligent processing and automation.

Uploaded by

ayusssssh100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views35 pages

4

Uploaded by

ayusssssh100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 35

Why is IR difficult?

• Vocabularies mismatching
• Queries are ambiguous
• Content representation may be
inadequate and incomplete
• The user is the ultimate judge, but we
don’t know how the judge judges.
Challenges in IR

• Scale, distribution of documents

• Controversy over the unit of indexing
• High heterogeneity
• Retrieval strategies
Types Of Search Engines
1.Mainstream search engines.
• Mainstream search engines like Google, Bing, and Yahoo! are all free to
use and supported by online advertising. They all use variations of the
same strategy (crawling, indexing, and ranking) to let you search the
entirety of the internet.
2. Private search engines.
• Private search engines have risen in popularity recently due to privacy
concerns raised by the data collection practices of mainstream search
engines. These include anonymous, ad-supported search engines like
DuckDuckGo and private, ad-free search engines like Neeva.
3. Vertical search engines.
• Vertical search, or specialized search, is a way of narrowing your search
to one topic category, rather than the entirety of the web. Examples of
vertical search engines include:
1. The search bar on shopping sites like eBay and Amazon
2. Google Scholar, which indexes scholarly literature across publications
3. Searchable social media sites and apps like Pinterest
4. Computational search engines.
• WolframAlpha is an example of a computational search engine, devoted
to answering questions related to math and science.
Open source search engine
• Open-source software is software whose source code is available for modification or enhancement by
anyone. "Source code" is the part of software that most computer users don't ever see; it's the code computer
programmers can manipulate to change how a piece of software—a "program" or "application"—works.

Advantage of open source

• The right to use the software in any way.

• There is usually no license cost and free of cost.
• The source code is open and can be modified freely.
• Open standards.
• It provides higher flexibility.
Disadvantage of open source

• There is no guarantee that development will happen.

• It is sometimes difficult to know that a project exist, and its current status.
• No secured follow-up development strategy.
• Closed software is a term for software whose license
does not allow for the release or distribution of the
software’s source code. Generally, it means only the
binaries of a computer program are distributed.
Closed search engine
•Google Search – The most widely used web search engine.
•Bing – Microsoft’s search engine.
•Yandex – Russian search engine.
•Baidu – Leading search engine in China.
•DuckDuckGo – Privacy-focused search engine, but proprietary.
•Yahoo Search – Powered by Bing.
•Brave Search – Independent and privacy-focused search engine.
•Algolia – AI-powered search API for developers.
•Amazon A9 – Search engine used in Amazon’s product search.
•IBM Watson Discovery – AI-powered enterprise search.
• Lists of open source search engines:

• 1.Apache Lucene
• 2. Sphinx
• 3. Whoosh
• 4. Carrot2
Apache Lucene Core
• Apache Lucene is a high-performance, full-featured text search engine library written entirely in
Java. It is a technology suitable for nearly any application that requires full-text search, especially
cross-platform.
• Powerful features through a simple API:
• • Scalable, High-Performance Indexing
• • Over 150GB/hour on modem hardware
• • small RAM requirements -- only 1MB heap
• • incremental indexing as fast as batch indexing
• • index size roughly 20-30% the size of text indexed
• • Powerful, Accurate and Efficient Search Algorithms
• • ranked searching -- best results returned first
• • many powerful query types: phrase queries, wildcard queries, proximity queries, range queries
and more
• • fielded searching (e.g. title, author, contents)
• • sorting by any field
• • multiple-index searching with merged results
• • allows simultaneous update and searching
• • flexible faceting, highlighting, joins and result grouping
• • fast, memory-efficient and typo-tolerant suggesters
• • pluggable ranking models, including the Vector Space Model and Okapi BM25
• • configurable storage engine (codecs)
Sphinx

• Sphinx is a full-text search engine, publicly distributed under GPL version. Technically, Sphinx is a standalone
software package provides fast and relevant full-text search functionality to client applications.
• It was specially designed to integrate well with SQL databases storing the data, and to be easily accessed by
scripting languages.
• However, Sphinx does not depend on nor require any specific database to function.
• Applications can access Sphinx search daemon (searchd) using any of the three different access methods:
a) via Sphinx own implementation of MySQL network protocol
b) via native search API (SphinxAPI) or
c) via MySQL server with a pluggable storage engine (SphinxSE).
• Starting from version 1.10-beta, Sphinx supports two different indexing backends:
a) "Disk" index backend- Disk indexes support online full-text index rebuilds, but online updates can only be
done on non-text (attribute) data.
b) "Realtime" (RT) index backend - RT indexes additionally allow for online full-text index updates. Previous
versions only supported disk indexes.
• Sphinx features are:
• high indexing and searching performance;
• advanced indexing and querying tools;
• advanced result set post-processing (SELECT with expressions, WHERE, ORDER BY, GROUP BY, HAVING etc over text
search results);
• proven scalability up to billions of documents, terabytes of data, and thousands of queries per second;
• easy integration with SQL and XML data sources, and SphinxQL, SphinxAPI, or SphinxSE search interfaces;
• easy scaling with distributed searches.
Whoosh
• Whoosh was created by Matt Chaput. It started as a quick and dirty search server
for the online documentation of the Houdini 3D animation software package.
• Whoosh is a fast, featureful full-text indexing and searching library implemented
in pure Python.
• Programmers can use it to easily add search functionality to their applications
and websites.
• Every part of how Whoosh works can be extended or replaced to meet your
needs exactly. Whoosh’s features include:
• • Pythonic API.
• • Pure-Python. No compilation or binary packages needed, no mysterious
crashes.
• • Fielded indexing and search.
• • Fast indexing and retrieval – faster than any other pure-Python, scoring, full-
text search solution
• • Pluggable scoring algorithm (including BM25F), text analysis, storage, posting
format, etc.
• • Powerful query language.
• • Pure Python spell-checker
Carrot²

• Carrot² is an Open Source Search Results Clustering Engine. It can

automatically organize small collections of documents into thematic
categories.
• The architecture of Carrot² is based on processing components
arranged into pipelines. Two major groups or processing components
in Carrot² are: a)Document sources b)Clustering algorithms
• a)Document sources provide data for further processing. Typically,
they would e.g. fetch search results from an external search engine,
Lucene / Solr index or load text files from a local disk.
• Currently, Carrot² has built-in support for the following document
sources: • Bing Search API • Lucene index • OpenSearch • PubMed •
Solr server • eTools metasearch engine • Generic XML files Other
document sources can be integrated based on the code examples
provided with Carrot² distribution.
• b)Clustering algorithms Carrot offers two specialized document
clustering algorithms that place emphasis on the quality of cluster
labels: • Lingo a clustering algorithm based on the Singular value
decomposition • STC Suffix Tree Clustering
Impact of Web on IR

Tim Berners-Lee concept in 1989: a British computer scientist, proposed the concept
of the World Wide Web while working at CERN (the European Organization for Nuclear
Research).
•1990: The concept was successfully tested, and the first website was created.
•1991: The World Wide Web was publicly released, allowing people outside of CERN to
use and access web pages. This marked the beginning of the modern internet era.
• It was called world wide web (www)
• WWW use three protocols
• HTML
• HTTP
• URLs
IR on web

• IR on web has always been a difficult and different

task compared to a classical retrieval system.
• Hypertext
• Heterogeneity of document
• Duplication
• Number of documents
• Lack of stability
• Poor queries
• Reaction to results
• Heterogeneity of users
• IR system involves two terms
• Objective and non objective
• Objective terms : it is extrinsic to semantic content
• Ex: author name, document URL, date of publication,

• Non objective terms: it is intended to reflect the

information in the document and there is no
agreement about the choice or degree of
applicability of the terms, known as content terms.
• Ex: keywords, concepts and topic, synonyms and
related terms (different expression of the same
concept), latent semantic terms.
IR queries

• Keyword queries
• Boolean Queries
• Phrase queries
• Proximity queries
• Full document queries
• Natural language questions
Web challenges on IR

• WWW expanding faster than any current

search engine can possibly index. Many web
pages are updated frequently or are
dynamically generated which forces search
engines to repeatedly revisit them.
• Many dynamically allocated generated sites
are not indexable by search engines known
as invisible web.
• The ordering of results is not always solely by
relevance, but sometimes influenced by
monetary contributions. It is difficult with
business model.
• Some sites use tricks to manipulate the
search engine to improve their ranking for
certain keywords, known as search engine
spamming
Web problems divided into 2 classes

• Problem with data itself

(data-centric )
• Problems regarding the user
(interaction centric)
Problem with data itself

• Distributed data: Documents spread over

millions of different web servers.
• Volatile data: Many documents change or
disappear rapidly.
• Large volume: Trillions of separate
documents
• Unstructured and redundant data: HTML
errors, duplicate documents
• Quality of data: False information, Poor
quality writing
• Heterogeneous data: Multiple media
types.
Problems regarding the user
These problems are concerned with how users interact with web
systems and services.
• How to specify the query?
• How to interpret the answer provided by the system?

•Usability Issues: Designing intuitive interfaces for better user experience.

•Personalization: Tailoring content and recommendations based on user
preferences.
•Accessibility: Ensuring web content is accessible to users with disabilities.
•User Engagement: Encouraging interaction and participation.
•Trust and Credibility: Ensuring the user perceives the content as reliable and
authentic.
•Latency and Performance: Ensuring fast and responsive interactions.
•Behavior Analysis: Understanding user needs and optimizing interactions
accordingly.
The role of artificial intelligence (AI) in IR

• In the early days of computer science, IR and AI developed in parallel.

• Information Retrieval
• • The amount of available information is growing at an incredible rate, for example the Internet and World
Wide Web. Information are stored in many forms e.g. images, text, video, and audio. • Information Retrieval
is a way to separate relevant data from irrelevant.
• • IR field has developed successful methods to deal effectively with huge amounts of information. o
Common methods include the Boolean, Vector Space and Probabilistic models.

• Artificial Intelligence
• • Study of how to construct intelligent machines and systems that can simulate or extend the development
of human intelligence
• In the 1980s, they started to cooperate and the term intelligent information retrieval was coined for AI applications in IR.
• The integration of Artificial Intelligence and Information Retrieval has led to the following development:
• o Development of methods to learn user's information needs.
• o Extract information based on what has been learned.
• o Represent the semantics of information
• In the 1990s, information retrieval has seen a shift from set based Boolean retrieval models to ranking systems
What are Intelligent IR Systems?
• The concept of 'intelligent' information retrieval was first
suggested in the late 1970s.
• Not pursued by IR Community until early 1990s.
• An intelligent IR system can simulate the human thinking
process on information processing and intelligence
activities to achieve information and knowledge storage,
retrieval and reasoning, and to provide intelligence support.
• In an Intelligent IR system, the functions of the human
intermediary are performed by a program, interacting with
the human user.
• Intelligent IR is performed by a computer program
(intelligent agent), which acts on (minimal or no explicit)
instructions from a human user, retrieves and presents
information to the user without any other interaction.
How to introduce AI into IR systems?

• Levels of user and system involvement:

• Level 0 – No system involvement (User comes up with a tactic,
formulating a query, coming up with a strategy and thinking
about the outcome)
• Level 1 – User can ask for information about searching (System
suggests tactics that can be used to formulate queries e.g.
help)
• Level 2 – User simply enters a query, suggests what needs to
be done, and the system executes the query to return results.
• Level 3 – First signs of AI. System actually starts suggesting
improvements to user.
• Level 4 – Full Automation. User queries are entered and the rest
is done by the system.
Some AI methods currently used in
Intelligent IR Systems
• Web Crawlers (for information extraction)
• Mediator Techniques (for information
integration)
• Ontologies (for intelligent information
access by making semantics of information
explicit and machine readable)
• Neural Networks (for document clustering
& preprocessing)
• Kohonen Neural Networks – Self Organizing
maps
• Hopefield Networks
• Semantic Networks
Areas of AI for IR

Reasoning Natural
under language
certainty processing

Knowledge
representatio
n
Cognitiv
e
theory Machine
Computer Learning
Vision
AI applied to IR

System
Information integration
characterization

Search
formulation in Support
seeking functions
information
Web Search vs IR
• Traditional IR systems normally index a closed
collection of documents, which are mainly text-
based and usually offer little linkage between
documents.
• Traditional IR systems are often referred to as
full-text retrieval systems.
• Libraries were among the first to adopt IR to
index their catalogs and later, to search through
information which was typically imprinted onto
CD-ROMs.
• The main aim of traditional IR was to return
relevant documents that satisfy the user’s
information need.
• Although the main goal of satisfying the user’s
need is still the central issue in web IR (or web
search).
Components of a Search engine

• A search engine is an information retrieval software program that

discovers, crawls, transforms and stores information for retrieval
and presentation in response to user queries
• A search engine normally consists of four components, that are
• search interface,
• crawler (also known as a spider or bot),
• indexer, and
• database.
• The crawler traverses a document collection, deconstructs
document text, and assigns surrogate's for storage in the search
engine index.
• Online search engine's store images, link data and metadata for
the document as well.
Components of a Search engine
CHARACTERIZING THE WEB

• Characteristics
• Measuring the Internet and in particular the Web, is a difficult task due
to its highly dynamic nature.
• How many different institutions (not Web servers) maintain Web data? o
This number is smaller than the number of servers, because many
places have multiple servers.
• The exact number is unknown, but should be larger than 40% of the number of
Web servers.
• More recent studies on the size of search engines estimated that there were over
20 billion pages in 2005, and that the size of the static Web is roughly doubling
every eight months.
• Nowadays, the Web is infinite for practical purposes, as we can generate
an infinite number of dynamic pages (e.g. consider an on-line calendar..
• The most popular formats of Web documents are HTML, followed by GIF
and JPG (both images), ASCII text, and PDF, in that order.

Search Engine Optimization Starter Guide - by Google
100% (2)
Search Engine Optimization Starter Guide - by Google
32 pages
Search Engines Information Retrieval in Practice PDF
No ratings yet
Search Engines Information Retrieval in Practice PDF
542 pages
DIGITAL MEDIA Notes Till SEM
No ratings yet
DIGITAL MEDIA Notes Till SEM
66 pages
Lab Aws 14-10
100% (1)
Lab Aws 14-10
25 pages
Scrapy Documentation
No ratings yet
Scrapy Documentation
230 pages
Sphinx Search Beginner's Guide
From Everand
Sphinx Search Beginner's Guide
Abbas Ali
4/5 (2)
Scrapy PDF
No ratings yet
Scrapy PDF
250 pages
Search Engine: A Project On
No ratings yet
Search Engine: A Project On
60 pages
What Is Technical SEO
100% (1)
What Is Technical SEO
5 pages
UNIT I - Introduction and Motivation
No ratings yet
UNIT I - Introduction and Motivation
57 pages
Information Retrieval and XML Data: ADBMS Unit-4
No ratings yet
Information Retrieval and XML Data: ADBMS Unit-4
37 pages
Seach Engine
50% (2)
Seach Engine
18 pages
OSINT 2025: Deep Search AI Integration and Future Trends
No ratings yet
OSINT 2025: Deep Search AI Integration and Future Trends
61 pages
5 - Professional Software Testing Boot Camp References
No ratings yet
5 - Professional Software Testing Boot Camp References
24 pages
Mid-Semester Exam: Af 302 - Information Systems
No ratings yet
Mid-Semester Exam: Af 302 - Information Systems
14 pages
SEO Audit Report - Bizualized v2
No ratings yet
SEO Audit Report - Bizualized v2
17 pages
93512information Retrieval LecturesNotes2024
No ratings yet
93512information Retrieval LecturesNotes2024
153 pages
Leximancer Manual
No ratings yet
Leximancer Manual
176 pages
System Design Interview Questions 1696745718
No ratings yet
System Design Interview Questions 1696745718
32 pages
Uhv 0004
No ratings yet
Uhv 0004
29 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
2 Mod-1 - Lec-2
No ratings yet
2 Mod-1 - Lec-2
58 pages
Bulu
No ratings yet
Bulu
47 pages
Chapter - 6 - Searching and Indexing
No ratings yet
Chapter - 6 - Searching and Indexing
44 pages
Chap 2
No ratings yet
Chap 2
29 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
Nessus
No ratings yet
Nessus
29 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
L01
No ratings yet
L01
33 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
(Minor Project)
No ratings yet
(Minor Project)
46 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Learning ELK Stack: Build mesmerizing visualizations, analytics, and logs from your data using Elasticsearch, Logstash, and Kibana
From Everand
Learning ELK Stack: Build mesmerizing visualizations, analytics, and logs from your data using Elasticsearch, Logstash, and Kibana
Saurabh Chhajed
No ratings yet
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
DM Unit-5
No ratings yet
DM Unit-5
21 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Chapter 3 Uhv
No ratings yet
Chapter 3 Uhv
26 pages
Unit3 QueryLanguages Berlin
No ratings yet
Unit3 QueryLanguages Berlin
29 pages
Tutorial 3
No ratings yet
Tutorial 3
38 pages
6 - Loops in C
No ratings yet
6 - Loops in C
33 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
2 - Number Systems
No ratings yet
2 - Number Systems
38 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
12 - Structure & Files in C
No ratings yet
12 - Structure & Files in C
37 pages
11 - Function in C
No ratings yet
11 - Function in C
26 pages
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Oswap Zap
No ratings yet
Oswap Zap
12 pages
Sharma 2015
No ratings yet
Sharma 2015
5 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Topic 2 W2 - SDR - Edited - March2023
No ratings yet
Topic 2 W2 - SDR - Edited - March2023
25 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
No ratings yet
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
47 pages
9 Strings
No ratings yet
9 Strings
16 pages
Uhv 0002
No ratings yet
Uhv 0002
20 pages
The Overview of Web Search Engines 16ep4np3gk
No ratings yet
The Overview of Web Search Engines 16ep4np3gk
23 pages
Lec 3 Uhvslides
No ratings yet
Lec 3 Uhvslides
10 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
How To Create Datasets - Strategies and Examples
No ratings yet
How To Create Datasets - Strategies and Examples
18 pages
Ir Mod1 Notes
No ratings yet
Ir Mod1 Notes
20 pages
Iwt Solution Sheet Set - B
No ratings yet
Iwt Solution Sheet Set - B
22 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
7 Arrays 0
No ratings yet
7 Arrays 0
11 pages
Lab Manual: Web Technology
No ratings yet
Lab Manual: Web Technology
39 pages
8 - Two Dimensional Array
No ratings yet
8 - Two Dimensional Array
18 pages
Search Engine
No ratings yet
Search Engine
35 pages
A Comparison of Open Source Search Engine
No ratings yet
A Comparison of Open Source Search Engine
46 pages
Sourcegraph Essentials: The Complete Guide for Developers and Engineers
From Everand
Sourcegraph Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Mini Google
No ratings yet
Mini Google
34 pages
OS Search Engine Comparison
No ratings yet
OS Search Engine Comparison
46 pages
7 - Arrays - 1 - Linear Search
No ratings yet
7 - Arrays - 1 - Linear Search
14 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
32 pages
Web Technology
No ratings yet
Web Technology
17 pages
Overview of The SEO Check SEO Score: Search Preview
No ratings yet
Overview of The SEO Check SEO Score: Search Preview
19 pages
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
No ratings yet
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
21 pages
Applied Soft Computing: Ahmed I. Saleh, Arwa E. Abulwafa, Mohammed F. Al Rahmawy
No ratings yet
Applied Soft Computing: Ahmed I. Saleh, Arwa E. Abulwafa, Mohammed F. Al Rahmawy
24 pages
Villacollege - Edu.mv Seo
No ratings yet
Villacollege - Edu.mv Seo
25 pages
Search Engine
No ratings yet
Search Engine
42 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
Chap 1
No ratings yet
Chap 1
22 pages
Chapter - 2 Literature Survey: S. No Page No
No ratings yet
Chapter - 2 Literature Survey: S. No Page No
22 pages
كوثر علي حسين
No ratings yet
كوثر علي حسين
9 pages
Dial One For Scam: A Large-Scale Analysis of Technical Support Scams
No ratings yet
Dial One For Scam: A Large-Scale Analysis of Technical Support Scams
15 pages
Phishinpatterns: Measuring Elicited User Interactions at Scale On Phishing Websites
No ratings yet
Phishinpatterns: Measuring Elicited User Interactions at Scale On Phishing Websites
16 pages
DM Assignment
No ratings yet
DM Assignment
13 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Text
No ratings yet
Text
5 pages
Search Tools: Presented By: ISHA
No ratings yet
Search Tools: Presented By: ISHA
22 pages
2019 Framework For Hoax News Detection1
No ratings yet
2019 Framework For Hoax News Detection1
8 pages
Information Retrieval Systems and Web Search Engin
No ratings yet
Information Retrieval Systems and Web Search Engin
4 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
SPPM 1002 Web Searching
No ratings yet
SPPM 1002 Web Searching
12 pages
ML - Assignment 1
No ratings yet
ML - Assignment 1
2 pages
DAA Sess-I 2024
No ratings yet
DAA Sess-I 2024
1 page
Search Engine Using Apache Lucene
No ratings yet
Search Engine Using Apache Lucene
5 pages
How Do Search Engines Work
No ratings yet
How Do Search Engines Work
3 pages
Search Engine Problems and Solutions
No ratings yet
Search Engine Problems and Solutions
2 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages