0% found this document useful (0 votes)

57 views28 pages

Chapter 2: Modeling: Advanced Topics in Information Retrieval

The document discusses modeling in information retrieval systems. It explains that complex real phenomena like user information needs are simplified into formal models to allow computation. Common models discussed include the Boolean, vector, and probabilistic models. The vector model with TF-IDF weighting is presented as currently the best balance of quality and simplicity. However, the document notes there are many alternative models like fuzzy set, latent semantic indexing, and neural networks that have promising properties but are not yet well-researched.

Uploaded by

VishnuDhanabalan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views28 pages

Chapter 2: Modeling: Advanced Topics in Information Retrieval

Uploaded by

VishnuDhanabalan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 28

Special Topics in Computer Science

Advanced Topics in Information Retrieval

Chapter 2: Modeling
Alexander Gelbukh www.Gelbukh.com

Previous chapter
User Information Need
o Vague o Semantic, not formal

Document Relevance
o Order, not retrieve

Huge amount of information

o Efficiency concerns o Tradeoffs

Art more than science

Modeling
Still science: computation is formal No good methods to work with (vague) semantics Thus, simplify to get a (formal) model Develop (precise) math over this (simple) model

Why math if the model is not precise (simplified)? phenomenon model = step 1 = step 2 = ... = result

math
phenomenon model step 1 step 2 ... ?!
3

Modeling
Substitute a complex real phenomenon with a simple model, which you can measure and manipulate formally Keep only important properties (for this application) Do this with text:

Modeling in IR: idea

Tag documents with fields
o As in a (relational) DB: customer = {name, age, address} o Unlike DB, very many fields: individual words! o E.g., bag of words: {word1, word2, ...}: {3, 5, 0, 0, 2, ...}

Define a similarity measure between query and such a record

o (Unlike DB) Rank (order), not retrieve (yes/no) o Justify your model (optional, but nice)

Develop math and algorithms for fast access

o as relational algebra in DB
5

Taxonomy of IR systems

Aspects of an IR system
IR model
o Boolean, Vector, Probabilistic

Logical view of documents

o Full text, bag of words, ...

User task
o retrieval, browsing

Independent, though some are more compatible

Appropriate models

Characterization of an IR model
D = {dj}, collection of formal representations of docs
o e.g., keyword vectors

Q = {qi}, possible formal representations of user information need (queries) F, framework for modeling these two: reason for the next R(qi,dj): Q D R, ranking function
o defines ordering

Specific IR models

IR models
Classical
o Boolean o Vector o Probabilistic (clear ideas, but some disadvantages)

Refined
o o o o Each one with refinements Solve many of the problems of the basic models Give good examples of possible developments in the area Not investigated well
We can work on this
11

Basic notions
Document: Set of index term
o Mainly nouns o Maybe all, then full text logical view

Term weights
o some terms are better than others o terms less frequent in this doc and more frequent in other docs are less useful

Documents index term vector {w1j, w2j, ..., wtj}

o weights of terms in the doc o t is the number of terms in all docs o weights of different terms are independent (simplification)
12

Boolean model
Weights {0, 1}
o Doc: set of words

Query: Boolean expression

o R(qi,dj) {0, 1}

Good:
o clear semantics, neat formalism, simple

Bad:
o no ranking ( data retrieval), retrieves too many or too few o difficult to translate User Information Need into query

No term weighting
13

Vector model
Weights (non-binary) Ranking, much better results (for User Info Need) R(qi,dj) = correlation between query vector and doc vector E.g., cosine measure: (there is a typo in the book)

Projection

Weights
How are the weights wij obtained? Many variants. One way: TF-IDF balance TF: Term frequency
o How well the term is related to the doc? o If appears many times, is important o Proportional to the number of times that appears

IDF: Inverse document frequency

o How important is the term to distinguish documents? o If appears in many docs, is not important o Inversely proportional to number of docs where appears

Contradictory. How to balance?

TF-IDF ranking
TF: Term frequency

IDF: Inverse document frequency

Balance: TF IDF
o Other formulas exist. Art.
17

Advantages of vector model

One of the best known strategies Improves quality (term weighting) Allows approximate matching (partial matching) Gives ranking by similarity (cosine formula) Simple, fast But: Does not consider term dependencies
o considering them in a bad way hurts quality o no known good way

No logical expressions (e.g., negation: mouse & NOT cat)

Probabilistic model
Assumptions:
o set of relevant docs, o probabilities of docs to be relevant o After Bayes calculation: probabilities of terms to be important for defining relevant docs

Initial idea: interact with the user.

o Generate an initial set o Ask the user to mark some of them as relevant or not o Estimate the probabilities of keywords. Repeat

Can be done without user

o Just re-calculate the probabilities assuming the users acceptance is the same as predicted ranking
19

(Dis) advantages of Probabilistic model

Advantage: Theoretical adequacy: ranks by probabilities Disadvantages: Need to guess the initial ranking Binary weights, ignores frequencies Independence assumption (not clear if bad)

Does not perform well (?)

Alternative Set Theoretic models

Fuzzy set model

Takes into account term relationships (thesaurus)
o Bible is related to Church

Fuzzy belonging of a term to a document

o Document containing Bible also contains a little bit of Church, but not entirely

Fuzzy set logic applied to such fuzzy belonging

o logical expressions with AND, OR, and NOT

Provides ranking, not just yes/no Not investigated well.

o Why not investigate it?
21

Alternative Set Theoretic models

Extended Boolean model

Combination of Boolean and Vector In comparison with Boolean model, adds distance from query
o some documents satisfy the query better than others

In comparison with Vector model, adds the distinction between AND and OR combinations There is a parameter (degree of norm) allowing to adjust the behavior between Boolean-like and Vector-like This can be even different within one query Not investigated well. Why not investigate it?
22

Alternative Algebraic models

Generalized Vector Space model

Classical independence assumptions:
o All combinations of terms are possible, none are equivalent (= basis in the vector space) o Pair-wise orthogonal: cos ({ki}, {kj}) = 0

This model relaxes the pair-wise orthogonality: cos ({ki}, {kj}) 0 Operates by combinations (co-occurrences) of index terms, not individual terms More complex, more expensive, not clear if better Not investigated well. Why not investigate it?
23

Alternative Algebraic models

Latent Semantic Indexing model

Index by larger units, concepts sets of terms used together Retrieve a document that share concepts with a relevant one (even if it does not contain query terms) Group index terms together (map into lower dimensional space). So some terms are equivalent.
o Not exactly, but this is the idea o Eliminates unimportant details o Depends on a parameter (what details are unimportant?)

Not investigated well. Why not investigate it?

Alternative Algebraic models

Neural Network model

NNs are good at matching Iteratively uses the found documents as auxiliary queries
o Spreading activation.

o Terms docs terms docs terms docs ...

Like a built-in thesaurus First round gives same result as Vector model No evidence if it is good Not investigated well. Why not investigate it?
25

Models for browsing

Flat browsing: String
o Just as a list of paper o No context cues provided

Structure guided: Tree

o Hierarchy o Like directory tree in the computer

Hypertext (Internet!): Directed graph

o No limitations of sequential writing o Modeled by a directed graph: links from unit A to unit B
units: docs, chapters, etc.

o A map (with traversed path) can be helpful

Research issues
How people judge relevance?
o ranking strategies

How to combine different sources of evidence? What interfaces can help users to understand and formulate their Information Need?
o user interfaces: an open issue

Meta-search engines: combine results from different Web search engines

o They almost do not intersect o How to combine ranking?
27

Conclusions
Modeling is needed for formal operations Boolean model is the simplest Vector model is the best combination of quality and simplicity
o TF-IDF term weighting o This (or similar) weighting is used in all further models

Many interesting and not well-investigated variations

o possible future work

BOOKLET - Chemsheets A2 009 (Acids - Bases)
No ratings yet
BOOKLET - Chemsheets A2 009 (Acids - Bases)
21 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
IR - Models
100% (3)
IR - Models
58 pages
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
No ratings yet
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
47 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Design For Test Scan Test
100% (1)
Design For Test Scan Test
31 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
IR Chap4
100% (1)
IR Chap4
32 pages
The Design Process & The Role of CAD
100% (1)
The Design Process & The Role of CAD
12 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
01-Bowles-Foundation Analysis and Design PDF
No ratings yet
01-Bowles-Foundation Analysis and Design PDF
6 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Kobelco 6E - Hyd Motors PDF
100% (1)
Kobelco 6E - Hyd Motors PDF
26 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Web Search Engines: Rooted in Information Retrieval (IR) Systems
No ratings yet
Web Search Engines: Rooted in Information Retrieval (IR) Systems
48 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Slingshot Elastics Test
100% (1)
Slingshot Elastics Test
12 pages
1.2 Newtonian Relativity and Galilean Transformations
No ratings yet
1.2 Newtonian Relativity and Galilean Transformations
7 pages
SAP CONTROLLING - PRODUCT COSTING PART-1 - SAP Blogs
No ratings yet
SAP CONTROLLING - PRODUCT COSTING PART-1 - SAP Blogs
47 pages
IR Models
No ratings yet
IR Models
65 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Text Mining
No ratings yet
Text Mining
23 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Unit II
No ratings yet
Unit II
73 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
KeyTalk Anything You Ever Wanted To Know About SMIME Email Encryption DigitalSigning Configurations. But Were Afraid To Ask
No ratings yet
KeyTalk Anything You Ever Wanted To Know About SMIME Email Encryption DigitalSigning Configurations. But Were Afraid To Ask
19 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
15 pages
9 Database - PPT Compatibility Mode
No ratings yet
9 Database - PPT Compatibility Mode
30 pages
User Manual GALILEO: 06/2013 MN04802104Z-EN
No ratings yet
User Manual GALILEO: 06/2013 MN04802104Z-EN
17 pages
Maths Scanner
No ratings yet
Maths Scanner
136 pages
Radiator - Wikipedia
No ratings yet
Radiator - Wikipedia
8 pages
Introduction To Language and Communication-Week11
No ratings yet
Introduction To Language and Communication-Week11
33 pages
Black Sand, Tellurides Sulfides - For Recreational Gold Prospecting
100% (2)
Black Sand, Tellurides Sulfides - For Recreational Gold Prospecting
4 pages
EmpowermentG11 MODULE 6 7
No ratings yet
EmpowermentG11 MODULE 6 7
25 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Chemical Shift
No ratings yet
Chemical Shift
10 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
NLP See
No ratings yet
NLP See
9 pages
Module 1 Part BInformation Retrieval Webdocuments
No ratings yet
Module 1 Part BInformation Retrieval Webdocuments
49 pages
NLP See
No ratings yet
NLP See
27 pages
IRS Unit 3 by Krishna
No ratings yet
IRS Unit 3 by Krishna
50 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
Unit Ii Part B 1. Write About Basic IR Model
No ratings yet
Unit Ii Part B 1. Write About Basic IR Model
17 pages
IDS Syllabus
No ratings yet
IDS Syllabus
3 pages
Ir Mod2 Notes
No ratings yet
Ir Mod2 Notes
26 pages
Advanced Micro Controller: Unit I - AVR Microcontroller
No ratings yet
Advanced Micro Controller: Unit I - AVR Microcontroller
52 pages
Chapter 4
No ratings yet
Chapter 4
48 pages
TENSION TEST ON Tor Steel
No ratings yet
TENSION TEST ON Tor Steel
7 pages
Web Search
No ratings yet
Web Search
30 pages
Module 2-Students
No ratings yet
Module 2-Students
143 pages
Nested List Home Work
No ratings yet
Nested List Home Work
2 pages
Dr. Devang Sharma
No ratings yet
Dr. Devang Sharma
6 pages
Analytic Scoring Rubric For MI
No ratings yet
Analytic Scoring Rubric For MI
4 pages
Ch2. Basics of Python Programming: Dr. Tulika Assistant Professor Department of Computer Science Miranda House
No ratings yet
Ch2. Basics of Python Programming: Dr. Tulika Assistant Professor Department of Computer Science Miranda House
47 pages
1 Overview
No ratings yet
1 Overview
44 pages
Object Oriented Analysis
No ratings yet
Object Oriented Analysis
6 pages
TS Grewal Solution Class 12 Chapter 3 Volume 1
No ratings yet
TS Grewal Solution Class 12 Chapter 3 Volume 1
39 pages
DSD Univ Paper 2023-24
No ratings yet
DSD Univ Paper 2023-24
2 pages
Bhumika Di Ip
No ratings yet
Bhumika Di Ip
20 pages
Irt Q&A
No ratings yet
Irt Q&A
14 pages
Unit 2
No ratings yet
Unit 2
13 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Lang Models: 04 December 2024 23:03
No ratings yet
Lang Models: 04 December 2024 23:03
4 pages
Bulu
No ratings yet
Bulu
47 pages
Asynch Exercise 2 WACC APV
No ratings yet
Asynch Exercise 2 WACC APV
2 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
VP AIR Part Test 04 Class 12th NEET 2025 08-06-2025 Questions Paper
No ratings yet
VP AIR Part Test 04 Class 12th NEET 2025 08-06-2025 Questions Paper
23 pages