0% found this document useful (0 votes)
15 views39 pages

Unit - I - IR

ir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views39 pages

Unit - I - IR

ir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 39

Course Code : 18MCA431 CIE Marks : 100

Credits: L:T:P : 3:0:0 SEE Marks : 100

Total Hours : 39L SEE Duration : 3 Hrs (T)


Course Outcomes: After completing the course, the students will be
able to
CO1: Understand the concept of Information Retrieval, its models and
Search Engine
CO2: Recognize and use various indexing and querying techniques to
store and retrieve documents
CO3: Apply IR principles to extract relevant information and build
retrieval models
CO4: Analyse and evaluate the IR techniques, retrieval models and
search engines
Continuous Internal Evaluation (CIE): Theory (100 Marks

CIE is executed by way of Quizzes(Q), Test(T) and


Assignment(A).
•A minimum of two quizzes are conducted and each quiz is
evaluated for 10 marks adding up to 20 marks.
•Three tests are conducted for 50 marks each and the sum of
the marks scored from three tests is reduced to 50 marks.
•A minimum of two assignments are given with a combination
of two component among
1) solving innovative problem using different platforms
2) seminar/new developments in the related course
•Total CIE is 20(Q)+50(T)+30(A)=100 Marks
Semester End Evaluation (SEE): Theory (100 Marks)

Theory (100 Marks) The question paper will have FIVE


questions with internal choice from each unit. Each
question will carry 20 marks. Student will have to answer
one full question from each unit.
Unit I – 07 hours : Introduction to information retrieval ,
architecture of a search engine-Search Engines

Information Retrieval- What Is Information Retrieval? The Big Issues, Search


Engines, Search Engineers
Architecture of a Search Engine- What is architecture? Basic Building Blocks,
Breaking It Down
Unit – II – 08 hours :Crawls and Feeds , Processing Text

Crawls and Feeds- Deciding what to search, Crawling the Web, Crawling
Documents and Email, Document Feeds, The Conversion Problem, Storing the
Documents, Detecting Duplicates
Processing Text - From Words to Terms, Text Statistics, Document Parsing,
Document Structure and Markup, Link Analysis, Information Extraction,
Internationalization
Unit III – 08 hours :Ranking with Indexes
Overview, Abstract Model of Ranking, Inverted indexes, Compression, Auxiliary
Structures, Index Construction, Query Processing

Unit – IV – 08 hours: Queries and Interfaces

Information Needs and Queries, Query Transformation and Refinement,


Showing the Results, Cross-Language Search

Unit – V – 08 hours : Retrieval Models and Evaluating Search


Engines
Overview of Retrieval Models , Probabilistic Models, Ranking Based on Language
Models
Why Evaluate?, The Evaluation Corpus, Effectiveness Metrics, Efficiency Metrics
Unit I : Introduction to Information Retrieval

 Search on the Web is a daily


activity for many people
throughout the world
 • Search and communication
are most popular uses of the
computer
 • Applications involving search
are everywhere
 • The field of computer science
that is most involved with R&D
for search is information
retrieval (IR)
Unit I : Information Vs Data
Information Retrieval

“Information retrieval is a field concerned with the structure,


analysis, organization, storage, searching, and retrieval of
information.” (Salton, 1968)
• General definition that can be applied to many types of
information and search applications
• Primary focus of IR since the 50s has been on text and
documents
What is a Document?
 Examples:
– web pages, email, books,
news stories, scholarly
papers, text messages,
Word™, Powerpoint™,
PDF, forum postings,
patents, IM sessions, etc.
 Common properties
– Significant text content
– Some structure (e.g.,
title, author, date
for papers; subject,
sender, destination
for email)
Documents vs. Database Records
 Database records (or tuples in relational databases) are
typically made up of well‐ defined fields (or attributes)
e.g., bank records with account numbers, balances,
names, addresses, social security numbers, dates of
birth, etc.
 Easy to compare fields with well‐defined semantics to
queries in order to find matches
 Text is more difficult
Documents vs. Records
 Example bank database query
– Find records with balance > $50,000 in branches located
in Amherst, MA.
– Matches easily found by comparison with field values of
records
 Example search engine query
bank scandals in western mass
– This text must be compared to the text of entire
news stories
Comparing Text
 Comparing the query text to the document text and
determining what is a good match is the core issue of
information retrieval
 Exact matching of words is not enough
– Many different ways to write the same thing in a “natural
language” like English
– e.g., does a news story containing the text “bank director
in Amherst steals funds” match the query?
– Some stories will be better matches than others
Dimensions of IR
 IR is more than just text, and more than just web search
– although these are central
 People doing IR work with different media, different types
of search applications, and different tasks
Other Media
 New applications
increasingly involve new
media
– e.g., video, photos,
music, speech
 Like text, content is
difficult to describe and
compare
– text may be used to
represent them (e.g. tags)
 IR approaches to search
and evaluation are
appropriate
Dimensions of IR
IR Tasks
 Ad‐hoc search
– Find relevant documents
for an arbitrary text query
 Filtering – Identify
relevant user profiles for a
new document
 Classification

– Identify relevant labels


for documents
 Question answering

– Give a specific answer to a


question
Big Issues in IR
Contd…
Contd…
Contd…
IR and Search Engines
 A search engine is the
practical application of
information retrieval
techniques to large scale
text collections
 Web search engines are
best‐known examples, but
many others
– Open source search engines
are important for research
and development
• e.g., Lucene, Lemur/Indri,
Galago
IR and Search Engines
Issues in Search Engines
 Performance  Dynamic data
– Measuring and improving – The “collection” for most
the efficiency of search • real applications is
e.g., reducing response constantly changing in
time, increasing query terms of updates,
throughput, increasing additions, deletions
indexing speed • e.g., web pages – Acquiring
- Indexes are data structures or “crawling” the
designed to improve search documents is a major task
efficiency • designing and - Typical measures are
implementing them are coverage and freshness– --
major issues for search Updating the indexes while
engines processing queries is also a
design issue
Contd…
 Scalability  Adaptability
– Making everything work – Changing and tuning
with millions of users search engine components
every day, and many such as ranking algorithm,
terabytes of documents indexing strategy, interface
– Distributed processing is for different applications
essential
Spam
For Web search, spam in all its forms is
one of the major issues • Affects the
efficiency of search engines and, more
seriously, the effectiveness of the
results
• Many types of spam – e.g.
spamdexing or term spam, link spam,
“optimization”
• New subfield called adversarial IR,
since spammers are “adversaries” with
different goals
ANY QUESTIONS???
Search Engine Architecture
 A software architecture consists of software components,
the interfaces provided by those components, and the
relationships between them
– describes a system at a particular level of abstraction
 Architecture of a search engine determined by 2
requirements
– effectiveness (quality of results) and efficiency (response
time and throughput)
Indexing Process
Indexing Process
 • Text acquisition
– identifies and stores documents for indexing
 • Text transformation

– transforms documents into index terms or features


 • Index creation

– takes index terms and creates data structures (indexes) to


support fast searching
Query Process
Query Process
 User interaction – supports creation and refinement of
query, display of results
 Ranking – uses query and indexes to generate ranked list of
documents
 Evaluation – monitors and measures effectiveness and
efficiency (primarily offline)
Crawler
 Identifies and acquires
documents for search engine
 Many types – web, enterprise,
desktop
 Web crawlers follow links to find
documents
 Must efficiently find huge
numbers of web pages (coverage)
and keep them up‐to‐date
(freshness)
 Single site crawlers for site search
, Topical or focused crawlers for
vertical search
 Document crawlers for enterprise
and desktop search - Follow links
and scan directories
Text Acquisition
 Feeds – Real‐time streams
of documents
• e.g., web feeds for news,
blogs, video, radio, tv –
RSS is common standard
• RSS “reader” can provide
new XML documents to
search engine
Contd…
 Conversion – Convert  Document data store
variety of documents into a – Stores text, metadata, and
consistent text plus other related content for
metadata format documents
• e.g. HTML, XML, Word, • Metadata is information
PDF, etc. → XML about document such as type
and creation date
– Convert text encoding for
• Other content includes links,
different languages • Using
anchor text
a Unicode standard like
– Provides fast access to
UTF‐8
document contents for search
engine components
• e.g. result list generation –
Could use relational database
system
Text Transformation
Contd…
Contd…
 Stopping  Stemming
– Remove common words • – Group words derived from
e.g., “and”, “or”, “the”, a common stem
“in” – Some impact on • e.g., “computer”,
efficiency and effectiveness “computers”,
– Can be a problem for some “computing”, “compute”
queries – Usually effective, but not
for all queries – Benefits
vary for different
languages
Contd…

You might also like