Ir - Chapter 1

This document provides an overview of information retrieval (IR). It discusses how IR has evolved from relying on librarians to index tools like keyword indexing. It also describes some challenges in IR like unstructured text and volatile web data. The typical IR process involves indexing documents, formulating queries, matching queries to indexes, and selecting relevant results. Artificial intelligence techniques like natural language processing and machine learning can help improve different parts of the IR process.

Uploaded by

Anoo Shrestha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

184 views7 pages

Ir - Chapter 1

Uploaded by

Anoo Shrestha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

IR - CHAPTER 1 2011

INFORMATION RETRIEVAL
- Text is the primary way that human knowledge is stored after speech.
- Techniques for storing and searching for textual documents are nearly as old as written language
itself.
- In past, information retrieval means going to town’s library and asking the librarian for help.
- The librarian usually knew all the books in the possession and could give one a define answer.
- As the number of books grew, it became impossible. Then tools for information retrieval had to be
devised.
- One of the most important tools is indexing.
- Index is a terms with pointer to places where information about them can be found.
- The terms can be subject matter, author names, etc.
- Oliver Wendell Holmes wrote in 1872, “It is the province of knowledge to speak and it is the
privilege of wisdom to listen”.
- In future, “It is the province of knowledge to write and it is the privilege of wisdom to query.”
- The field of computer science that deals with the automated storage and retrieval of a document is
called information retrieval.
- Requires:
o Algorithm – For manipulating natural language.
o Data Structures – To efficiently store and process data.

WHAT MAKES IR A HARD PROBLEM?

1. Under good circumstances
- Text is unstructured.
- Requires understanding of semantics. For example: restaurant  café, PRC  China, fast
automobiles  fast cars.
- Human language presents distinct problems like ambiguity. For example: bat (mammal or
baseball), apple (company or fruit), bit (unit of data or act of eating), etc.

1
IR - CHAPTER 1 2011
2. Under hard circumstances
- Web pages change rapidly.
- Many pages lie about their content.
- New pages are not linked to.

3. Multimedia information
- Hard to store (size), represent and compare.

IR SYSTEM

- Searching for pages on the World Wide Web is the most recent killer application.
- IR concerns firstly with retrieving relevant documents as a query.

2
IR - CHAPTER 1 2011
- Relevance is a subjective judgment and may include:
1. Being on the proper subject.
2. Being timely (recent information).
3. Being authoritative (from a trusted source).
4. Satisfy the goals of the user.

TYPICAL IR
1. Given
- A corpus of textual natural language documents.
- A user query in the form of a textual string.
2. Find
- A ranked set of documents that is relevant to the query.

IR SYSTEM ARCHITECTURE

3
IR - CHAPTER 1 2011
IR SYSTEM COMPONENTS
1. Text Operations
- Forms index words (tokens) by stop-word removal and stemming.
2. Indexing
- Constructs an inverted index of word to document pointers.
3. Searching
- Retrieves documents that contain a given query token from the inverted index.
4. Ranking
- Scores all retrieved documents according to relevance matrix.
5. User Interface
- Manage interaction with the user.
- Query input and document output.
- Relevance feedback.
- Visualization of results.
6. Query Operations
- Transform the query to improve retrieval.
- Query expansion using thesaurus.

WEB SEARCH AND IR (Application of IR to HTML documents on the World Wide Web)

4
IR - CHAPTER 1 2011
WEB CHALLENGES OF IR
1. Distributed Data
- Documents spread over millions of different web servers.
2. Volatile Data
- Many documents change or disappear rapidly. For example: dead link.
3. Large Volume
- Billions of separate documents.
4. Unstructured and Redundant Data
- No Uniform Structure.
- Up to 30% (near) are duplicate documents
5. Quality of Data
- No editorial control.
- False information.
- Poor quality writing.
6. Heterogeneous Data
- Multiple media types (image, video)
- Languages.

AREAS OF AI FOR IR
1. Natural Language Processing
- Focused on syntactic, semantic and pragmatic analysis of natural language text.
- Retrieval based should be focused on semantic.
- Methods for determining the sense of ambiguous word based on context.
- Question answering.
2. Machine Learning
- Focused on the development of computational system that improves their performance with
experience.
- Automated classification of examples based on learning concepts from labeled training.

5
IR - CHAPTER 1 2011
- For example: supervised learning.
- Automated methods for clustering unlabeled examples into meaningful groups (unsupervised).
- Text categorization (For example: spam filtering).
- Text clustering (clustering of IR query results).
- Text mining.
3. Knowledge Representation
- Expert system
4. Reasoning Under Uncertainty
- Bayesian network
5. Cognitive Theory

STEPS IN IR PROCESS (RETRIEVAL PROCESS)

1. Indexing (Creating document representation)
- Indexing is the manual or automated process of making statements about a document, lesson,
and person and so on.
- For example: author wise, subject wise, text wise, etc.
- Index can be:
i. Document oriented: - the indexer accesses the document relevance to subjects and other
features of interests to user.
ii. Request oriented: - the indexer accesses the document relevance to subjects and other
features of interests to user.
- Automated indexing begins with feature extraction such as extracting all words from a text,
followed by refinements such as eliminating stop words (a, an, the, of), stemming (walking 
walk), counting the most frequent words, mapping the concepts using thesaurus (tube  pipe).

2. Query Formulation (Creating query representation)

- Retrieval means using the available evidence to predict the degree to which a document is
relevant or useful for a given user need as described in a free form query description.

6
IR - CHAPTER 1 2011
- A query can specify text words or phrase, the system should look for.
- The query description is transformed manually or automatically into a formed query
representation, ready to match with document representation.

3. Matching the Query Representation With Entity Representation

- The match uses the features specified in the query to predict document relevance.
- Exact match (0 or 1).
- Synonym expansion (pipe  tube).
- Hierarchical expansion (pipe  capillary).
- The system ranks the result.

4. Selection
- User examines the results and selects the relevant items.

5. Relevance Feedback & Interactive Retrieval

- The system can assist the user in improving the query by showing a list of features (option)
found in many relevant items.

Outside Processing in Oracle OPM
100% (1)
Outside Processing in Oracle OPM
3 pages
Information Retrieval 1
100% (2)
Information Retrieval 1
12 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Information Retrieval
No ratings yet
Information Retrieval
2 pages
Cs8080informationretrievaltechniquesunit Ipptpdfversion 220423092105
No ratings yet
Cs8080informationretrievaltechniquesunit Ipptpdfversion 220423092105
240 pages
(A) What Is Traditional Model of NLP?: Unit - 1
No ratings yet
(A) What Is Traditional Model of NLP?: Unit - 1
18 pages
Solutions To NLP I Mid Set A
100% (1)
Solutions To NLP I Mid Set A
8 pages
Information Retrieval Systems (A70533)
No ratings yet
Information Retrieval Systems (A70533)
11 pages
Data Visualization PDF
No ratings yet
Data Visualization PDF
90 pages
Com713 Advanced Data Structures and Algorithms
No ratings yet
Com713 Advanced Data Structures and Algorithms
13 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
103 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Cp7004 Image Processing and Analysis 1
No ratings yet
Cp7004 Image Processing and Analysis 1
8 pages
Be Computer Engineering Semester 7 2023 May Dloc III Natural Language Processing Rev 2019 C Scheme
0% (1)
Be Computer Engineering Semester 7 2023 May Dloc III Natural Language Processing Rev 2019 C Scheme
2 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
NLP Lab Tasks
No ratings yet
NLP Lab Tasks
16 pages
Cs8080 Information Retrieval Techniques
No ratings yet
Cs8080 Information Retrieval Techniques
10 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
Information Retrieval Systems U6
No ratings yet
Information Retrieval Systems U6
13 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
Semantic Information Retrieval
No ratings yet
Semantic Information Retrieval
168 pages
Unit-I Introduction To Image Processing
No ratings yet
Unit-I Introduction To Image Processing
23 pages
2 18 Covariance
No ratings yet
2 18 Covariance
34 pages
NLP UNIT 1 (Ques Ans Bank)
No ratings yet
NLP UNIT 1 (Ques Ans Bank)
20 pages
CS6007 Information Retrieval
No ratings yet
CS6007 Information Retrieval
8 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
01cs6105 s1 Advanced Data Structures and Algorithms
No ratings yet
01cs6105 s1 Advanced Data Structures and Algorithms
2 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
Sp09midterm Revised
No ratings yet
Sp09midterm Revised
6 pages
CHP - 1 - Fundamentals of Digital Image Min
No ratings yet
CHP - 1 - Fundamentals of Digital Image Min
15 pages
Irs PPT Unit Ii
No ratings yet
Irs PPT Unit Ii
19 pages
NLP Final Answer
No ratings yet
NLP Final Answer
9 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
Irs Important Questions
0% (1)
Irs Important Questions
3 pages
The Use of Medical Information Retrieval Systems
No ratings yet
The Use of Medical Information Retrieval Systems
22 pages
Unit - 3:: Explain Briefly About Automatic Indexing? Explain About Types of Classes Automatic Indexing?
No ratings yet
Unit - 3:: Explain Briefly About Automatic Indexing? Explain About Types of Classes Automatic Indexing?
28 pages
Unit-8: Natural Language: Processing
No ratings yet
Unit-8: Natural Language: Processing
16 pages
Lecture 1: Introduction To NLP: Understand Concepts Applications
No ratings yet
Lecture 1: Introduction To NLP: Understand Concepts Applications
32 pages
Model Question Paper
0% (1)
Model Question Paper
2 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
NLP Iat QB
No ratings yet
NLP Iat QB
10 pages
Lemmatization Stemming Presentation
No ratings yet
Lemmatization Stemming Presentation
11 pages
NLP Course File Notes
No ratings yet
NLP Course File Notes
71 pages
IRS Unit-1
50% (2)
IRS Unit-1
14 pages
Unit 3 Notes UDS23201J Query Processing
No ratings yet
Unit 3 Notes UDS23201J Query Processing
38 pages
NATURAL LANGUAGE PROCESSING (18CS2T50) - Mid Term Exam - 2021-2022
No ratings yet
NATURAL LANGUAGE PROCESSING (18CS2T50) - Mid Term Exam - 2021-2022
2 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
Irs Unit Ii Part 1
No ratings yet
Irs Unit Ii Part 1
16 pages
Searching The Internet and Hypertext in Information Retrieval Systems
No ratings yet
Searching The Internet and Hypertext in Information Retrieval Systems
1 page
Lecture NLP
100% (1)
Lecture NLP
38 pages
Speech Recognition Architecture
No ratings yet
Speech Recognition Architecture
13 pages
NLP Question Bank
No ratings yet
NLP Question Bank
1 page
NLP Lab File
100% (2)
NLP Lab File
66 pages
NLP Assignment Answer
No ratings yet
NLP Assignment Answer
4 pages
17EC72 DIP Question Bank
No ratings yet
17EC72 DIP Question Bank
12 pages
Search Capabilities in Information Retrieval System
No ratings yet
Search Capabilities in Information Retrieval System
16 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
05 CC - Complete Notes - Er. Loknath Regmi
No ratings yet
05 CC - Complete Notes - Er. Loknath Regmi
81 pages
Classes and Objects of Java
No ratings yet
Classes and Objects of Java
52 pages
Data and Expressions in Java
No ratings yet
Data and Expressions in Java
59 pages
JAVA Introduction
No ratings yet
JAVA Introduction
39 pages
Internet Technology Chapter 2: Internet Protocol Overview
No ratings yet
Internet Technology Chapter 2: Internet Protocol Overview
17 pages
Adbms 1.1234
No ratings yet
Adbms 1.1234
53 pages
Advanced SQL Programming: by Bishnu Gautam New Summit College
No ratings yet
Advanced SQL Programming: by Bishnu Gautam New Summit College
21 pages
Unit 4 New Database Applications and Environments: by Bhupendra Singh Saud
No ratings yet
Unit 4 New Database Applications and Environments: by Bhupendra Singh Saud
14 pages
Internet Technology
No ratings yet
Internet Technology
164 pages
A Proposal On - : Airline Reservation System
No ratings yet
A Proposal On - : Airline Reservation System
9 pages
MSP430fg4618 Lab Manual
No ratings yet
MSP430fg4618 Lab Manual
89 pages
FEMAP Commands
No ratings yet
FEMAP Commands
519 pages
Galaxy Database Manager User's Guide
No ratings yet
Galaxy Database Manager User's Guide
8 pages
Unit 1 - Object Oriented Programming and Methodology - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Object Oriented Programming and Methodology - WWW - Rgpvnotes.in
16 pages
Automated Library Management System
No ratings yet
Automated Library Management System
12 pages
SDSF
No ratings yet
SDSF
2 pages
GNANA Resume1
No ratings yet
GNANA Resume1
3 pages
Sheet Metal Nesting Report For Plasma and Laser Cutting
No ratings yet
Sheet Metal Nesting Report For Plasma and Laser Cutting
1 page
Day 8
No ratings yet
Day 8
7 pages
Original
No ratings yet
Original
34 pages
RC1 Proj MGMT Overview
No ratings yet
RC1 Proj MGMT Overview
101 pages
Applications Note 3.0: Creating Parts in Eagle
No ratings yet
Applications Note 3.0: Creating Parts in Eagle
10 pages
Basic AutoCAD Commands
No ratings yet
Basic AutoCAD Commands
15 pages
Digital Image Procesing
No ratings yet
Digital Image Procesing
24 pages
Monthly Power Consumption: Electric Gyser
No ratings yet
Monthly Power Consumption: Electric Gyser
3 pages
Presentation by Rajashekar G.S
100% (1)
Presentation by Rajashekar G.S
79 pages
SQL Query Interview Questions and Answers
100% (2)
SQL Query Interview Questions and Answers
4 pages
PHP in Hindi: Kuldeep Chand
No ratings yet
PHP in Hindi: Kuldeep Chand
51 pages
Intuit Interview Exp
No ratings yet
Intuit Interview Exp
11 pages
Oop ST
No ratings yet
Oop ST
10 pages
Gate Cse Notes: Joyoshish Saha
No ratings yet
Gate Cse Notes: Joyoshish Saha
3 pages
Assignment 6 1
No ratings yet
Assignment 6 1
3 pages
CSE/MATH 6643: Numerical Linear Algebra: Haesun Park
No ratings yet
CSE/MATH 6643: Numerical Linear Algebra: Haesun Park
13 pages
Methods of Qual-WPS Office
No ratings yet
Methods of Qual-WPS Office
2 pages
"A Product Is Frozen Information" Jay Doblin (1978) : by Dr. Aukje Thomassen Associate Professor
No ratings yet
"A Product Is Frozen Information" Jay Doblin (1978) : by Dr. Aukje Thomassen Associate Professor
12 pages
Professional Planner On Primavera
No ratings yet
Professional Planner On Primavera
50 pages
Zero Lag Data Smoothers: Figure 1. Steady State Lag Compensation
100% (2)
Zero Lag Data Smoothers: Figure 1. Steady State Lag Compensation
8 pages
Laboratory Manual: Microprocessor & Microcontroller
No ratings yet
Laboratory Manual: Microprocessor & Microcontroller
10 pages
BRR Sector Report Institutional Development
No ratings yet
BRR Sector Report Institutional Development
22 pages