Chapter 2: Modeling: Advanced Topics in Information Retrieval
Chapter 2: Modeling: Advanced Topics in Information Retrieval
Chapter 2: Modeling
Alexander Gelbukh www.Gelbukh.com
Previous chapter
User Information Need
o Vague o Semantic, not formal
Document Relevance
o Order, not retrieve
Modeling
Still science: computation is formal No good methods to work with (vague) semantics Thus, simplify to get a (formal) model Develop (precise) math over this (simple) model
Why math if the model is not precise (simplified)? phenomenon model = step 1 = step 2 = ... = result
math
phenomenon model step 1 step 2 ... ?!
3
Modeling
Substitute a complex real phenomenon with a simple model, which you can measure and manipulate formally Keep only important properties (for this application) Do this with text:
Taxonomy of IR systems
Aspects of an IR system
IR model
o Boolean, Vector, Probabilistic
User task
o retrieval, browsing
Appropriate models
Characterization of an IR model
D = {dj}, collection of formal representations of docs
o e.g., keyword vectors
Q = {qi}, possible formal representations of user information need (queries) F, framework for modeling these two: reason for the next R(qi,dj): Q D R, ranking function
o defines ordering
Specific IR models
IR models
Classical
o Boolean o Vector o Probabilistic (clear ideas, but some disadvantages)
Refined
o o o o Each one with refinements Solve many of the problems of the basic models Give good examples of possible developments in the area Not investigated well
We can work on this
11
Basic notions
Document: Set of index term
o Mainly nouns o Maybe all, then full text logical view
Term weights
o some terms are better than others o terms less frequent in this doc and more frequent in other docs are less useful
Boolean model
Weights {0, 1}
o Doc: set of words
Good:
o clear semantics, neat formalism, simple
Bad:
o no ranking ( data retrieval), retrieves too many or too few o difficult to translate User Information Need into query
No term weighting
13
Vector model
Weights (non-binary) Ranking, much better results (for User Info Need) R(qi,dj) = correlation between query vector and doc vector E.g., cosine measure: (there is a typo in the book)
14
Projection
Weights
How are the weights wij obtained? Many variants. One way: TF-IDF balance TF: Term frequency
o How well the term is related to the doc? o If appears many times, is important o Proportional to the number of times that appears
TF-IDF ranking
TF: Term frequency
Balance: TF IDF
o Other formulas exist. Art.
17
Probabilistic model
Assumptions:
o set of relevant docs, o probabilities of docs to be relevant o After Bayes calculation: probabilities of terms to be important for defining relevant docs
20
In comparison with Vector model, adds the distinction between AND and OR combinations There is a parameter (degree of norm) allowing to adjust the behavior between Boolean-like and Vector-like This can be even different within one query Not investigated well. Why not investigate it?
22
This model relaxes the pair-wise orthogonality: cos ({ki}, {kj}) 0 Operates by combinations (co-occurrences) of index terms, not individual terms More complex, more expensive, not clear if better Not investigated well. Why not investigate it?
23
Research issues
How people judge relevance?
o ranking strategies
How to combine different sources of evidence? What interfaces can help users to understand and formulate their Information Need?
o user interfaces: an open issue
Conclusions
Modeling is needed for formal operations Boolean model is the simplest Vector model is the best combination of quality and simplicity
o TF-IDF term weighting o This (or similar) weighting is used in all further models
28