Unit - I - IR

Uploaded by

Chandrashekar B.H.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views39 pages

Unit - I - IR

Uploaded by

Chandrashekar B.H.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 39

Course Code : 18MCA431 CIE Marks : 100

Credits: L:T:P : 3:0:0 SEE Marks : 100

Total Hours : 39L SEE Duration : 3 Hrs (T)

Course Outcomes: After completing the course, the students will be
able to
CO1: Understand the concept of Information Retrieval, its models and
Search Engine
CO2: Recognize and use various indexing and querying techniques to
store and retrieve documents
CO3: Apply IR principles to extract relevant information and build
retrieval models
CO4: Analyse and evaluate the IR techniques, retrieval models and
search engines
Continuous Internal Evaluation (CIE): Theory (100 Marks

CIE is executed by way of Quizzes(Q), Test(T) and

Assignment(A).
•A minimum of two quizzes are conducted and each quiz is
evaluated for 10 marks adding up to 20 marks.
•Three tests are conducted for 50 marks each and the sum of
the marks scored from three tests is reduced to 50 marks.
•A minimum of two assignments are given with a combination
of two component among
1) solving innovative problem using different platforms
2) seminar/new developments in the related course
•Total CIE is 20(Q)+50(T)+30(A)=100 Marks
Semester End Evaluation (SEE): Theory (100 Marks)

Theory (100 Marks) The question paper will have FIVE

questions with internal choice from each unit. Each
question will carry 20 marks. Student will have to answer
one full question from each unit.
Unit I – 07 hours : Introduction to information retrieval ,
architecture of a search engine-Search Engines

Information Retrieval- What Is Information Retrieval? The Big Issues, Search

Engines, Search Engineers
Architecture of a Search Engine- What is architecture? Basic Building Blocks,
Breaking It Down
Unit – II – 08 hours :Crawls and Feeds , Processing Text

Crawls and Feeds- Deciding what to search, Crawling the Web, Crawling
Documents and Email, Document Feeds, The Conversion Problem, Storing the
Documents, Detecting Duplicates
Processing Text - From Words to Terms, Text Statistics, Document Parsing,
Document Structure and Markup, Link Analysis, Information Extraction,
Internationalization
Unit III – 08 hours :Ranking with Indexes
Overview, Abstract Model of Ranking, Inverted indexes, Compression, Auxiliary
Structures, Index Construction, Query Processing

Unit – IV – 08 hours: Queries and Interfaces

Information Needs and Queries, Query Transformation and Refinement,

Showing the Results, Cross-Language Search

Unit – V – 08 hours : Retrieval Models and Evaluating Search

Engines
Overview of Retrieval Models , Probabilistic Models, Ranking Based on Language
Models
Why Evaluate?, The Evaluation Corpus, Effectiveness Metrics, Efficiency Metrics
Unit I : Introduction to Information Retrieval

 Search on the Web is a daily

activity for many people
throughout the world
 • Search and communication
are most popular uses of the
computer
 • Applications involving search
are everywhere
 • The field of computer science
that is most involved with R&D
for search is information
retrieval (IR)
Unit I : Information Vs Data
Information Retrieval

“Information retrieval is a field concerned with the structure,

analysis, organization, storage, searching, and retrieval of
information.” (Salton, 1968)
• General definition that can be applied to many types of
information and search applications
• Primary focus of IR since the 50s has been on text and
documents
What is a Document?
 Examples:
– web pages, email, books,
news stories, scholarly
papers, text messages,
Word™, Powerpoint™,
PDF, forum postings,
patents, IM sessions, etc.
 Common properties
– Significant text content
– Some structure (e.g.,
title, author, date
for papers; subject,
sender, destination
for email)
Documents vs. Database Records
 Database records (or tuples in relational databases) are
typically made up of well‐ defined fields (or attributes)
e.g., bank records with account numbers, balances,
names, addresses, social security numbers, dates of
birth, etc.
 Easy to compare fields with well‐defined semantics to
queries in order to find matches
 Text is more difficult
Documents vs. Records
 Example bank database query
– Find records with balance > $50,000 in branches located
in Amherst, MA.
– Matches easily found by comparison with field values of
records
 Example search engine query
bank scandals in western mass
– This text must be compared to the text of entire
news stories
Comparing Text
 Comparing the query text to the document text and
determining what is a good match is the core issue of
information retrieval
 Exact matching of words is not enough
– Many different ways to write the same thing in a “natural
language” like English
– e.g., does a news story containing the text “bank director
in Amherst steals funds” match the query?
– Some stories will be better matches than others
Dimensions of IR
 IR is more than just text, and more than just web search
– although these are central
 People doing IR work with different media, different types
of search applications, and different tasks
Other Media
 New applications
increasingly involve new
media
– e.g., video, photos,
music, speech
 Like text, content is
difficult to describe and
compare
– text may be used to
represent them (e.g. tags)
 IR approaches to search
and evaluation are
appropriate
Dimensions of IR
IR Tasks
 Ad‐hoc search
– Find relevant documents
for an arbitrary text query
 Filtering – Identify
relevant user profiles for a
new document
 Classification

– Identify relevant labels

for documents
 Question answering

– Give a specific answer to a

question
Big Issues in IR
Contd…
Contd…
Contd…
IR and Search Engines
 A search engine is the
practical application of
information retrieval
techniques to large scale
text collections
 Web search engines are
best‐known examples, but
many others
– Open source search engines
are important for research
and development
• e.g., Lucene, Lemur/Indri,
Galago
IR and Search Engines
Issues in Search Engines
 Performance  Dynamic data
– Measuring and improving – The “collection” for most
the efficiency of search • real applications is
e.g., reducing response constantly changing in
time, increasing query terms of updates,
throughput, increasing additions, deletions
indexing speed • e.g., web pages – Acquiring
- Indexes are data structures or “crawling” the
designed to improve search documents is a major task
efficiency • designing and - Typical measures are
implementing them are coverage and freshness– --
major issues for search Updating the indexes while
engines processing queries is also a
design issue
Contd…
 Scalability  Adaptability
– Making everything work – Changing and tuning
with millions of users search engine components
every day, and many such as ranking algorithm,
terabytes of documents indexing strategy, interface
– Distributed processing is for different applications
essential
Spam
For Web search, spam in all its forms is
one of the major issues • Affects the
efficiency of search engines and, more
seriously, the effectiveness of the
results
• Many types of spam – e.g.
spamdexing or term spam, link spam,
“optimization”
• New subfield called adversarial IR,
since spammers are “adversaries” with
different goals
ANY QUESTIONS???
Search Engine Architecture
 A software architecture consists of software components,
the interfaces provided by those components, and the
relationships between them
– describes a system at a particular level of abstraction
 Architecture of a search engine determined by 2
requirements
– effectiveness (quality of results) and efficiency (response
time and throughput)
Indexing Process
Indexing Process
 • Text acquisition
– identifies and stores documents for indexing
 • Text transformation

– transforms documents into index terms or features

 • Index creation

– takes index terms and creates data structures (indexes) to

support fast searching
Query Process
Query Process
 User interaction – supports creation and refinement of
query, display of results
 Ranking – uses query and indexes to generate ranked list of
documents
 Evaluation – monitors and measures effectiveness and
efficiency (primarily offline)
Crawler
 Identifies and acquires
documents for search engine
 Many types – web, enterprise,
desktop
 Web crawlers follow links to find
documents
 Must efficiently find huge
numbers of web pages (coverage)
and keep them up‐to‐date
(freshness)
 Single site crawlers for site search
, Topical or focused crawlers for
vertical search
 Document crawlers for enterprise
and desktop search - Follow links
and scan directories
Text Acquisition
 Feeds – Real‐time streams
of documents
• e.g., web feeds for news,
blogs, video, radio, tv –
RSS is common standard
• RSS “reader” can provide
new XML documents to
search engine
Contd…
 Conversion – Convert  Document data store
variety of documents into a – Stores text, metadata, and
consistent text plus other related content for
metadata format documents
• e.g. HTML, XML, Word, • Metadata is information
PDF, etc. → XML about document such as type
and creation date
– Convert text encoding for
• Other content includes links,
different languages • Using
anchor text
a Unicode standard like
– Provides fast access to
UTF‐8
document contents for search
engine components
• e.g. result list generation –
Could use relational database
system
Text Transformation
Contd…
Contd…
 Stopping  Stemming
– Remove common words • – Group words derived from
e.g., “and”, “or”, “the”, a common stem
“in” – Some impact on • e.g., “computer”,
efficiency and effectiveness “computers”,
– Can be a problem for some “computing”, “compute”
queries – Usually effective, but not
for all queries – Benefits
vary for different
languages
Contd…

ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
MCQ For 9th Class
67% (3)
MCQ For 9th Class
20 pages
UNIT I - Introduction and Motivation
No ratings yet
UNIT I - Introduction and Motivation
57 pages
Lec5 Ir Introduction
No ratings yet
Lec5 Ir Introduction
37 pages
Revise Edexcel Igcse Computer Science Revision Workbook
67% (3)
Revise Edexcel Igcse Computer Science Revision Workbook
23 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Information Retrieval - Lecture 1
No ratings yet
Information Retrieval - Lecture 1
15 pages
1 Mod-1 - Lec-1
No ratings yet
1 Mod-1 - Lec-1
21 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Chap 1
No ratings yet
Chap 1
23 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Unit-5. Search Engines
No ratings yet
Unit-5. Search Engines
105 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
UNIT I IR Final
No ratings yet
UNIT I IR Final
26 pages
01 - Lect - Introd
No ratings yet
01 - Lect - Introd
23 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Introduction
No ratings yet
Introduction
32 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
CS317 IR W1a
No ratings yet
CS317 IR W1a
20 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
1 introIR
No ratings yet
1 introIR
22 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
1stunit GN
No ratings yet
1stunit GN
36 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Cs8080irtunitinotes 220515215754 E06d144b
No ratings yet
Cs8080irtunitinotes 220515215754 E06d144b
43 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
All Units Notes TYBSC-CS-Information-Retrieval
No ratings yet
All Units Notes TYBSC-CS-Information-Retrieval
89 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
16 pages
Chap 1
No ratings yet
Chap 1
22 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Week 1
No ratings yet
Week 1
28 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
5 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
No ratings yet
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
24 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Internal Order in Multi Org Setup
100% (1)
Internal Order in Multi Org Setup
10 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
IR UNIT I - Notes
No ratings yet
IR UNIT I - Notes
23 pages
User's Manual Live
No ratings yet
User's Manual Live
94 pages
12th Computer Applications All Practical Programs English Medium PDF Download
No ratings yet
12th Computer Applications All Practical Programs English Medium PDF Download
33 pages
Cloud Computing Mid Term
No ratings yet
Cloud Computing Mid Term
2 pages
Data Archiving IM - Investment Management
No ratings yet
Data Archiving IM - Investment Management
180 pages
Additive Manufacturing Module 5 Notes
No ratings yet
Additive Manufacturing Module 5 Notes
30 pages
EMPOTECH
No ratings yet
EMPOTECH
18 pages
PLC Curriculum
No ratings yet
PLC Curriculum
8 pages
1 SP - PP Gold Model Design1-2
No ratings yet
1 SP - PP Gold Model Design1-2
221 pages
Vsphere Esxi Vcenter Server 651 Monitoring Performance Guide PDF
No ratings yet
Vsphere Esxi Vcenter Server 651 Monitoring Performance Guide PDF
206 pages
IOT Main Notes
No ratings yet
IOT Main Notes
178 pages
Swing Java
No ratings yet
Swing Java
34 pages
Lecture 0 INT330.ppt 20250120 072501 0000
No ratings yet
Lecture 0 INT330.ppt 20250120 072501 0000
41 pages
Chapter 06
No ratings yet
Chapter 06
4 pages
Intelligently Handling Call Traffic Between Premise & Cloud Contact Center
No ratings yet
Intelligently Handling Call Traffic Between Premise & Cloud Contact Center
71 pages
Ebook - Ultimate Guide To Building Applications With FlowFuse Dashboard For Node-RED
No ratings yet
Ebook - Ultimate Guide To Building Applications With FlowFuse Dashboard For Node-RED
45 pages
Ezto Verify
No ratings yet
Ezto Verify
11 pages
On The RAN: The State of Next Generation RAN Transformations
No ratings yet
On The RAN: The State of Next Generation RAN Transformations
22 pages
Installation Log
No ratings yet
Installation Log
56 pages
Assessment System: Exam Viewer - Enetwork Final Exam - Ccna Exploration: Network Fundamentals (Version 4.0)
No ratings yet
Assessment System: Exam Viewer - Enetwork Final Exam - Ccna Exploration: Network Fundamentals (Version 4.0)
39 pages
Intro
No ratings yet
Intro
24 pages
Manual Testing
No ratings yet
Manual Testing
4 pages
How To Manage Qualitative Data: A Step-by-Step Guide
No ratings yet
How To Manage Qualitative Data: A Step-by-Step Guide
3 pages
Complete Z80 OP-Code Reference
100% (4)
Complete Z80 OP-Code Reference
8 pages
The Agile Change Management Process
No ratings yet
The Agile Change Management Process
6 pages
SIM808GPSGSMmanual 1685940223
No ratings yet
SIM808GPSGSMmanual 1685940223
12 pages
Update HTC Desire HD To Android 4.4.2 OS With CM11 Custom ROM Firmware PDF
No ratings yet
Update HTC Desire HD To Android 4.4.2 OS With CM11 Custom ROM Firmware PDF
10 pages
Homework 4
No ratings yet
Homework 4
3 pages
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet