0% found this document useful (0 votes)
57 views28 pages

Chapter 2: Modeling: Advanced Topics in Information Retrieval

The document discusses modeling in information retrieval systems. It explains that complex real phenomena like user information needs are simplified into formal models to allow computation. Common models discussed include the Boolean, vector, and probabilistic models. The vector model with TF-IDF weighting is presented as currently the best balance of quality and simplicity. However, the document notes there are many alternative models like fuzzy set, latent semantic indexing, and neural networks that have promising properties but are not yet well-researched.

Uploaded by

VishnuDhanabalan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views28 pages

Chapter 2: Modeling: Advanced Topics in Information Retrieval

The document discusses modeling in information retrieval systems. It explains that complex real phenomena like user information needs are simplified into formal models to allow computation. Common models discussed include the Boolean, vector, and probabilistic models. The vector model with TF-IDF weighting is presented as currently the best balance of quality and simplicity. However, the document notes there are many alternative models like fuzzy set, latent semantic indexing, and neural networks that have promising properties but are not yet well-researched.

Uploaded by

VishnuDhanabalan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

Special Topics in Computer Science

Advanced Topics in Information Retrieval

Chapter 2: Modeling
Alexander Gelbukh www.Gelbukh.com

Previous chapter
User Information Need
o Vague o Semantic, not formal

Document Relevance
o Order, not retrieve

Huge amount of information


o Efficiency concerns o Tradeoffs

Art more than science

Modeling
Still science: computation is formal No good methods to work with (vague) semantics Thus, simplify to get a (formal) model Develop (precise) math over this (simple) model

Why math if the model is not precise (simplified)? phenomenon model = step 1 = step 2 = ... = result

math
phenomenon model step 1 step 2 ... ?!
3

Modeling
Substitute a complex real phenomenon with a simple model, which you can measure and manipulate formally Keep only important properties (for this application) Do this with text:

Modeling in IR: idea


Tag documents with fields
o As in a (relational) DB: customer = {name, age, address} o Unlike DB, very many fields: individual words! o E.g., bag of words: {word1, word2, ...}: {3, 5, 0, 0, 2, ...}

Define a similarity measure between query and such a record


o (Unlike DB) Rank (order), not retrieve (yes/no) o Justify your model (optional, but nice)

Develop math and algorithms for fast access


o as relational algebra in DB
5

Taxonomy of IR systems

Aspects of an IR system
IR model
o Boolean, Vector, Probabilistic

Logical view of documents


o Full text, bag of words, ...

User task
o retrieval, browsing

Independent, though some are more compatible

Appropriate models

Characterization of an IR model
D = {dj}, collection of formal representations of docs
o e.g., keyword vectors

Q = {qi}, possible formal representations of user information need (queries) F, framework for modeling these two: reason for the next R(qi,dj): Q D R, ranking function
o defines ordering

Specific IR models

IR models
Classical
o Boolean o Vector o Probabilistic (clear ideas, but some disadvantages)

Refined
o o o o Each one with refinements Solve many of the problems of the basic models Give good examples of possible developments in the area Not investigated well
We can work on this
11

Basic notions
Document: Set of index term
o Mainly nouns o Maybe all, then full text logical view

Term weights
o some terms are better than others o terms less frequent in this doc and more frequent in other docs are less useful

Documents index term vector {w1j, w2j, ..., wtj}


o weights of terms in the doc o t is the number of terms in all docs o weights of different terms are independent (simplification)
12

Boolean model
Weights {0, 1}
o Doc: set of words

Query: Boolean expression


o R(qi,dj) {0, 1}

Good:
o clear semantics, neat formalism, simple

Bad:
o no ranking ( data retrieval), retrieves too many or too few o difficult to translate User Information Need into query

No term weighting
13

Vector model
Weights (non-binary) Ranking, much better results (for User Info Need) R(qi,dj) = correlation between query vector and doc vector E.g., cosine measure: (there is a typo in the book)

14

Projection

Weights
How are the weights wij obtained? Many variants. One way: TF-IDF balance TF: Term frequency
o How well the term is related to the doc? o If appears many times, is important o Proportional to the number of times that appears

IDF: Inverse document frequency


o How important is the term to distinguish documents? o If appears in many docs, is not important o Inversely proportional to number of docs where appears

Contradictory. How to balance?


16

TF-IDF ranking
TF: Term frequency

IDF: Inverse document frequency

Balance: TF IDF
o Other formulas exist. Art.
17

Advantages of vector model


One of the best known strategies Improves quality (term weighting) Allows approximate matching (partial matching) Gives ranking by similarity (cosine formula) Simple, fast But: Does not consider term dependencies
o considering them in a bad way hurts quality o no known good way

No logical expressions (e.g., negation: mouse & NOT cat)


18

Probabilistic model
Assumptions:
o set of relevant docs, o probabilities of docs to be relevant o After Bayes calculation: probabilities of terms to be important for defining relevant docs

Initial idea: interact with the user.


o Generate an initial set o Ask the user to mark some of them as relevant or not o Estimate the probabilities of keywords. Repeat

Can be done without user


o Just re-calculate the probabilities assuming the users acceptance is the same as predicted ranking
19

(Dis) advantages of Probabilistic model


Advantage: Theoretical adequacy: ranks by probabilities Disadvantages: Need to guess the initial ranking Binary weights, ignores frequencies Independence assumption (not clear if bad)

Does not perform well (?)

20

Alternative Set Theoretic models

Fuzzy set model


Takes into account term relationships (thesaurus)
o Bible is related to Church

Fuzzy belonging of a term to a document


o Document containing Bible also contains a little bit of Church, but not entirely

Fuzzy set logic applied to such fuzzy belonging


o logical expressions with AND, OR, and NOT

Provides ranking, not just yes/no Not investigated well.


o Why not investigate it?
21

Alternative Set Theoretic models

Extended Boolean model


Combination of Boolean and Vector In comparison with Boolean model, adds distance from query
o some documents satisfy the query better than others

In comparison with Vector model, adds the distinction between AND and OR combinations There is a parameter (degree of norm) allowing to adjust the behavior between Boolean-like and Vector-like This can be even different within one query Not investigated well. Why not investigate it?
22

Alternative Algebraic models

Generalized Vector Space model


Classical independence assumptions:
o All combinations of terms are possible, none are equivalent (= basis in the vector space) o Pair-wise orthogonal: cos ({ki}, {kj}) = 0

This model relaxes the pair-wise orthogonality: cos ({ki}, {kj}) 0 Operates by combinations (co-occurrences) of index terms, not individual terms More complex, more expensive, not clear if better Not investigated well. Why not investigate it?
23

Alternative Algebraic models

Latent Semantic Indexing model


Index by larger units, concepts sets of terms used together Retrieve a document that share concepts with a relevant one (even if it does not contain query terms) Group index terms together (map into lower dimensional space). So some terms are equivalent.
o Not exactly, but this is the idea o Eliminates unimportant details o Depends on a parameter (what details are unimportant?)

Not investigated well. Why not investigate it?


24

Alternative Algebraic models

Neural Network model


NNs are good at matching Iteratively uses the found documents as auxiliary queries
o Spreading activation.

o Terms docs terms docs terms docs ...


Like a built-in thesaurus First round gives same result as Vector model No evidence if it is good Not investigated well. Why not investigate it?
25

Models for browsing


Flat browsing: String
o Just as a list of paper o No context cues provided

Structure guided: Tree


o Hierarchy o Like directory tree in the computer

Hypertext (Internet!): Directed graph


o No limitations of sequential writing o Modeled by a directed graph: links from unit A to unit B
units: docs, chapters, etc.

o A map (with traversed path) can be helpful


26

Research issues
How people judge relevance?
o ranking strategies

How to combine different sources of evidence? What interfaces can help users to understand and formulate their Information Need?
o user interfaces: an open issue

Meta-search engines: combine results from different Web search engines


o They almost do not intersect o How to combine ranking?
27

Conclusions
Modeling is needed for formal operations Boolean model is the simplest Vector model is the best combination of quality and simplicity
o TF-IDF term weighting o This (or similar) weighting is used in all further models

Many interesting and not well-investigated variations


o possible future work

28

You might also like