0% found this document useful (0 votes)
7 views35 pages

4

Information Retrieval (IR) faces challenges such as vocabulary mismatches, ambiguous queries, and inadequate content representation. Different types of search engines, including mainstream, private, vertical, and computational, serve various user needs and privacy concerns. The integration of artificial intelligence in IR systems aims to enhance user experience and improve search outcomes through intelligent processing and automation.

Uploaded by

ayusssssh100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views35 pages

4

Information Retrieval (IR) faces challenges such as vocabulary mismatches, ambiguous queries, and inadequate content representation. Different types of search engines, including mainstream, private, vertical, and computational, serve various user needs and privacy concerns. The integration of artificial intelligence in IR systems aims to enhance user experience and improve search outcomes through intelligent processing and automation.

Uploaded by

ayusssssh100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Why is IR difficult?

• Vocabularies mismatching
• Queries are ambiguous
• Content representation may be
inadequate and incomplete
• The user is the ultimate judge, but we
don’t know how the judge judges.
Challenges in IR

• Scale, distribution of documents


• Controversy over the unit of indexing
• High heterogeneity
• Retrieval strategies
Types Of Search Engines
1.Mainstream search engines.
• Mainstream search engines like Google, Bing, and Yahoo! are all free to
use and supported by online advertising. They all use variations of the
same strategy (crawling, indexing, and ranking) to let you search the
entirety of the internet.
2. Private search engines.
• Private search engines have risen in popularity recently due to privacy
concerns raised by the data collection practices of mainstream search
engines. These include anonymous, ad-supported search engines like
DuckDuckGo and private, ad-free search engines like Neeva.
3. Vertical search engines.
• Vertical search, or specialized search, is a way of narrowing your search
to one topic category, rather than the entirety of the web. Examples of
vertical search engines include:
1. The search bar on shopping sites like eBay and Amazon
2. Google Scholar, which indexes scholarly literature across publications
3. Searchable social media sites and apps like Pinterest
4. Computational search engines.
• WolframAlpha is an example of a computational search engine, devoted
to answering questions related to math and science.
Open source search engine
• Open-source software is software whose source code is available for modification or enhancement by
anyone. "Source code" is the part of software that most computer users don't ever see; it's the code computer
programmers can manipulate to change how a piece of software—a "program" or "application"—works.

Advantage of open source

• The right to use the software in any way.


• There is usually no license cost and free of cost.
• The source code is open and can be modified freely.
• Open standards.
• It provides higher flexibility.
Disadvantage of open source

• There is no guarantee that development will happen.


• It is sometimes difficult to know that a project exist, and its current status.
• No secured follow-up development strategy.
• Closed software is a term for software whose license
does not allow for the release or distribution of the
software’s source code. Generally, it means only the
binaries of a computer program are distributed.
Closed search engine
•Google Search – The most widely used web search engine.
•Bing – Microsoft’s search engine.
•Yandex – Russian search engine.
•Baidu – Leading search engine in China.
•DuckDuckGo – Privacy-focused search engine, but proprietary.
•Yahoo Search – Powered by Bing.
•Brave Search – Independent and privacy-focused search engine.
•Algolia – AI-powered search API for developers.
•Amazon A9 – Search engine used in Amazon’s product search.
•IBM Watson Discovery – AI-powered enterprise search.
• Lists of open source search engines:

• 1.Apache Lucene
• 2. Sphinx
• 3. Whoosh
• 4. Carrot2
Apache Lucene Core
• Apache Lucene is a high-performance, full-featured text search engine library written entirely in
Java. It is a technology suitable for nearly any application that requires full-text search, especially
cross-platform.
• Powerful features through a simple API:
• • Scalable, High-Performance Indexing
• • Over 150GB/hour on modem hardware
• • small RAM requirements -- only 1MB heap
• • incremental indexing as fast as batch indexing
• • index size roughly 20-30% the size of text indexed
• • Powerful, Accurate and Efficient Search Algorithms
• • ranked searching -- best results returned first
• • many powerful query types: phrase queries, wildcard queries, proximity queries, range queries
and more
• • fielded searching (e.g. title, author, contents)
• • sorting by any field
• • multiple-index searching with merged results
• • allows simultaneous update and searching
• • flexible faceting, highlighting, joins and result grouping
• • fast, memory-efficient and typo-tolerant suggesters
• • pluggable ranking models, including the Vector Space Model and Okapi BM25
• • configurable storage engine (codecs)
Sphinx

• Sphinx is a full-text search engine, publicly distributed under GPL version. Technically, Sphinx is a standalone
software package provides fast and relevant full-text search functionality to client applications.
• It was specially designed to integrate well with SQL databases storing the data, and to be easily accessed by
scripting languages.
• However, Sphinx does not depend on nor require any specific database to function.
• Applications can access Sphinx search daemon (searchd) using any of the three different access methods:
a) via Sphinx own implementation of MySQL network protocol
b) via native search API (SphinxAPI) or
c) via MySQL server with a pluggable storage engine (SphinxSE).
• Starting from version 1.10-beta, Sphinx supports two different indexing backends:
a) "Disk" index backend- Disk indexes support online full-text index rebuilds, but online updates can only be
done on non-text (attribute) data.
b) "Realtime" (RT) index backend - RT indexes additionally allow for online full-text index updates. Previous
versions only supported disk indexes.
• Sphinx features are:
• high indexing and searching performance;
• advanced indexing and querying tools;
• advanced result set post-processing (SELECT with expressions, WHERE, ORDER BY, GROUP BY, HAVING etc over text
search results);
• proven scalability up to billions of documents, terabytes of data, and thousands of queries per second;
• easy integration with SQL and XML data sources, and SphinxQL, SphinxAPI, or SphinxSE search interfaces;
• easy scaling with distributed searches.
Whoosh
• Whoosh was created by Matt Chaput. It started as a quick and dirty search server
for the online documentation of the Houdini 3D animation software package.
• Whoosh is a fast, featureful full-text indexing and searching library implemented
in pure Python.
• Programmers can use it to easily add search functionality to their applications
and websites.
• Every part of how Whoosh works can be extended or replaced to meet your
needs exactly. Whoosh’s features include:
• • Pythonic API.
• • Pure-Python. No compilation or binary packages needed, no mysterious
crashes.
• • Fielded indexing and search.
• • Fast indexing and retrieval – faster than any other pure-Python, scoring, full-
text search solution
• • Pluggable scoring algorithm (including BM25F), text analysis, storage, posting
format, etc.
• • Powerful query language.
• • Pure Python spell-checker
Carrot²

• Carrot² is an Open Source Search Results Clustering Engine. It can


automatically organize small collections of documents into thematic
categories.
• The architecture of Carrot² is based on processing components
arranged into pipelines. Two major groups or processing components
in Carrot² are: a)Document sources b)Clustering algorithms
• a)Document sources provide data for further processing. Typically,
they would e.g. fetch search results from an external search engine,
Lucene / Solr index or load text files from a local disk.
• Currently, Carrot² has built-in support for the following document
sources: • Bing Search API • Lucene index • OpenSearch • PubMed •
Solr server • eTools metasearch engine • Generic XML files Other
document sources can be integrated based on the code examples
provided with Carrot² distribution.
• b)Clustering algorithms Carrot offers two specialized document
clustering algorithms that place emphasis on the quality of cluster
labels: • Lingo a clustering algorithm based on the Singular value
decomposition • STC Suffix Tree Clustering
Impact of Web on IR

Tim Berners-Lee concept in 1989: a British computer scientist, proposed the concept
of the World Wide Web while working at CERN (the European Organization for Nuclear
Research).
•1990: The concept was successfully tested, and the first website was created.
•1991: The World Wide Web was publicly released, allowing people outside of CERN to
use and access web pages. This marked the beginning of the modern internet era.
• It was called world wide web (www)
• WWW use three protocols
• HTML
• HTTP
• URLs
IR on web

• IR on web has always been a difficult and different


task compared to a classical retrieval system.
• Hypertext
• Heterogeneity of document
• Duplication
• Number of documents
• Lack of stability
• Poor queries
• Reaction to results
• Heterogeneity of users
• IR system involves two terms
• Objective and non objective
• Objective terms : it is extrinsic to semantic content
• Ex: author name, document URL, date of publication,

• Non objective terms: it is intended to reflect the


information in the document and there is no
agreement about the choice or degree of
applicability of the terms, known as content terms.
• Ex: keywords, concepts and topic, synonyms and
related terms (different expression of the same
concept), latent semantic terms.
IR queries

• Keyword queries
• Boolean Queries
• Phrase queries
• Proximity queries
• Full document queries
• Natural language questions
Web challenges on IR

• WWW expanding faster than any current


search engine can possibly index. Many web
pages are updated frequently or are
dynamically generated which forces search
engines to repeatedly revisit them.
• Many dynamically allocated generated sites
are not indexable by search engines known
as invisible web.
• The ordering of results is not always solely by
relevance, but sometimes influenced by
monetary contributions. It is difficult with
business model.
• Some sites use tricks to manipulate the
search engine to improve their ranking for
certain keywords, known as search engine
spamming
Web problems divided into 2 classes

• Problem with data itself


(data-centric )
• Problems regarding the user
(interaction centric)
Problem with data itself

• Distributed data: Documents spread over


millions of different web servers.
• Volatile data: Many documents change or
disappear rapidly.
• Large volume: Trillions of separate
documents
• Unstructured and redundant data: HTML
errors, duplicate documents
• Quality of data: False information, Poor
quality writing
• Heterogeneous data: Multiple media
types.
Problems regarding the user
These problems are concerned with how users interact with web
systems and services.
• How to specify the query?
• How to interpret the answer provided by the system?

•Usability Issues: Designing intuitive interfaces for better user experience.


•Personalization: Tailoring content and recommendations based on user
preferences.
•Accessibility: Ensuring web content is accessible to users with disabilities.
•User Engagement: Encouraging interaction and participation.
•Trust and Credibility: Ensuring the user perceives the content as reliable and
authentic.
•Latency and Performance: Ensuring fast and responsive interactions.
•Behavior Analysis: Understanding user needs and optimizing interactions
accordingly.
The role of artificial intelligence (AI) in IR

• In the early days of computer science, IR and AI developed in parallel.


• Information Retrieval
• • The amount of available information is growing at an incredible rate, for example the Internet and World
Wide Web. Information are stored in many forms e.g. images, text, video, and audio. • Information Retrieval
is a way to separate relevant data from irrelevant.
• • IR field has developed successful methods to deal effectively with huge amounts of information. o
Common methods include the Boolean, Vector Space and Probabilistic models.

• Artificial Intelligence
• • Study of how to construct intelligent machines and systems that can simulate or extend the development
of human intelligence
• In the 1980s, they started to cooperate and the term intelligent information retrieval was coined for AI applications in IR.
• The integration of Artificial Intelligence and Information Retrieval has led to the following development:
• o Development of methods to learn user's information needs.
• o Extract information based on what has been learned.
• o Represent the semantics of information
• In the 1990s, information retrieval has seen a shift from set based Boolean retrieval models to ranking systems
What are Intelligent IR Systems?
• The concept of 'intelligent' information retrieval was first
suggested in the late 1970s.
• Not pursued by IR Community until early 1990s.
• An intelligent IR system can simulate the human thinking
process on information processing and intelligence
activities to achieve information and knowledge storage,
retrieval and reasoning, and to provide intelligence support.
• In an Intelligent IR system, the functions of the human
intermediary are performed by a program, interacting with
the human user.
• Intelligent IR is performed by a computer program
(intelligent agent), which acts on (minimal or no explicit)
instructions from a human user, retrieves and presents
information to the user without any other interaction.
How to introduce AI into IR systems?

• Levels of user and system involvement:


• Level 0 – No system involvement (User comes up with a tactic,
formulating a query, coming up with a strategy and thinking
about the outcome)
• Level 1 – User can ask for information about searching (System
suggests tactics that can be used to formulate queries e.g.
help)
• Level 2 – User simply enters a query, suggests what needs to
be done, and the system executes the query to return results.
• Level 3 – First signs of AI. System actually starts suggesting
improvements to user.
• Level 4 – Full Automation. User queries are entered and the rest
is done by the system.
Some AI methods currently used in
Intelligent IR Systems
• Web Crawlers (for information extraction)
• Mediator Techniques (for information
integration)
• Ontologies (for intelligent information
access by making semantics of information
explicit and machine readable)
• Neural Networks (for document clustering
& preprocessing)
• Kohonen Neural Networks – Self Organizing
maps
• Hopefield Networks
• Semantic Networks
Areas of AI for IR

Reasoning Natural
under language
certainty processing

Knowledge
representatio
n
Cognitiv
e
theory Machine
Computer Learning
Vision
AI applied to IR

System
Information integration
characterization

Search
formulation in Support
seeking functions
information
Web Search vs IR
• Traditional IR systems normally index a closed
collection of documents, which are mainly text-
based and usually offer little linkage between
documents.
• Traditional IR systems are often referred to as
full-text retrieval systems.
• Libraries were among the first to adopt IR to
index their catalogs and later, to search through
information which was typically imprinted onto
CD-ROMs.
• The main aim of traditional IR was to return
relevant documents that satisfy the user’s
information need.
• Although the main goal of satisfying the user’s
need is still the central issue in web IR (or web
search).
Components of a Search engine

• A search engine is an information retrieval software program that


discovers, crawls, transforms and stores information for retrieval
and presentation in response to user queries
• A search engine normally consists of four components, that are
• search interface,
• crawler (also known as a spider or bot),
• indexer, and
• database.
• The crawler traverses a document collection, deconstructs
document text, and assigns surrogate's for storage in the search
engine index.
• Online search engine's store images, link data and metadata for
the document as well.
Components of a Search engine
CHARACTERIZING THE WEB

• Characteristics
• Measuring the Internet and in particular the Web, is a difficult task due
to its highly dynamic nature.
• How many different institutions (not Web servers) maintain Web data? o
This number is smaller than the number of servers, because many
places have multiple servers.
• The exact number is unknown, but should be larger than 40% of the number of
Web servers.
• More recent studies on the size of search engines estimated that there were over
20 billion pages in 2005, and that the size of the static Web is roughly doubling
every eight months.
• Nowadays, the Web is infinite for practical purposes, as we can generate
an infinite number of dynamic pages (e.g. consider an on-line calendar..
• The most popular formats of Web documents are HTML, followed by GIF
and JPG (both images), ASCII text, and PDF, in that order.

You might also like