0% found this document useful (0 votes)
223 views43 pages

Lecture 6 Score - Term Weight - Vector Space Model

This document discusses techniques for ranked retrieval and scoring documents, including term frequency, inverse document frequency (IDF), and the vector space model. It explains that ranked retrieval orders documents by relevance to a query rather than just returning matching documents. Term frequency (TF) captures the number of times a term appears in a document, while IDF accounts for how common or rare a term is across documents. TF-IDF weighting combines these by giving higher weight to uncommon terms that appear frequently in a document. Documents and queries can then be represented as vectors in a vector space, where similarity is measured to rank documents by relevance to the query.

Uploaded by

Prateek Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
223 views43 pages

Lecture 6 Score - Term Weight - Vector Space Model

This document discusses techniques for ranked retrieval and scoring documents, including term frequency, inverse document frequency (IDF), and the vector space model. It explains that ranked retrieval orders documents by relevance to a query rather than just returning matching documents. Term frequency (TF) captures the number of times a term appears in a document, while IDF accounts for how common or rare a term is across documents. TF-IDF weighting combines these by giving higher weight to uncommon terms that appear frequently in a document. Documents and queries can then be represented as vectors in a vector space, where similarity is measured to rank documents by relevance to the query.

Uploaded by

Prateek Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Introduction to Information

Retrieval

Introduction to
Information Retrieval
Topic: Scoring, Term Weighting and the Vector Space Model
Introduction to Information
Retrieval

Topic: Scoring, Term Weighting and the Vector


Space Model
▪ Ranked retrieval
▪ Scoring documents
▪ Term frequency
▪ Weighting schemes
▪ Vector space scoring
Introduction to Information
Retrieval

Ranked retrieval
▪ Thus far, our queries have all been Boolean.
▪ Documents either match or don’t.

▪ Good for expert users with precise understanding of their


needs and the collection.
▪ Also good for applications: Applications can easily consume 1000s of
results.

▪ Not good for the majority of users.


▪ Most users incapable of writing Boolean queries (or they are, but they
think it’s too much work).
▪ Most users don’t want to wade through 1000s of results.
▪ This is particularly true of web search.
Introduction to Information
Retrieval

Problem with Boolean search: Feast or famine

▪ Boolean queries often result in either too few (=0) or too


many (1000s) results.

▪ Query 1: “standard user dlink 650” → 200,000 hits


▪ Query 2: “standard user dlink 650 no card found”: 0 hits

▪ It takes a lot of skill to come up with a query that produces a


manageable number of hits.
▪ AND gives too few; OR gives too many
• With a ranked list of documents it does not matter how
large the retrieved set is.
Introduction to Information
Retrieval

Ranked retrieval models


▪ Rather than a set of documents satisfying a query expression,
in ranked retrieval models, the system returns an ordering
over the (top) documents in the collection with respect to a
query

▪ Free text queries: Rather than a query language of operators


and expressions, the user’s query is just one or more words in
a human language

5
Introduction to Information
Retrieval

Feast or famine: not a problem in ranked


retrieval
▪ When a system produces a ranked result set, large
result sets are not an issue
▪ Indeed, the size of the result set is not an issue
▪ We just show the top k ( ≈ 10) results
▪ We don’t overwhelm the user
Introduction to Information
Retrieval

Recall previous Lecture): Binary term-document


incidence matrix

Each document is represented by a binary vector ∈


{0,1}|V|
Introduction to Information
Retrieval

Term-document count matrices


▪ Consider the number of occurrences of a term in a
document:
Introduction to Information
Retrieval

Bag of words model


▪ Vector representation doesn’t consider the ordering of words
in a document

▪ John is quicker than Mary and Mary is quicker than John have
the same vectors

▪ This is called the bag of words model.


Introduction to Information
Retrieval

Term frequency tf
▪ A document or zone that mentions a query term more often
has more to do with that query and therefore should receive
more score.

▪ In this scheme each term in a document are assigned a weight


depending on the number of occurrences of term in the
document.

▪ Computing Score b/w a query term ’t’ and a document ‘d’, is


based on the weight of ‘t’ in ‘d’.
▪ The simplest approach is to assign the weight to be equal to
the number of occurrences of term ’t’ in document ‘d’.
Introduction to Information
Retrieval

Term frequency tf
▪ We want to use tf when computing query-document match
scores. But how?

▪ Raw term frequency is not what we want:


▪ A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
▪ But not 10 times more relevant.

▪ Relevance does not increase proportionally with term


frequency.
Introduction to Information
Retrieval

Log-frequency weighting
▪ The log frequency weight of term t in d is

▪ 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.


▪ Score for a document-query pair: sum over terms t in both q
and d:
score

▪ The score is 0 if none of the query terms is present in the


document.
Introduction to Information
Retrieval

Document frequency (df)


▪ Frequent terms are less informative than rare terms.
▪ Consider a query term that is frequent in the collection (e.g.,
high, increase, line)
▪ A document containing such a term is more likely to be
relevant than a document that doesn’t But it’s not a sure
indicator of relevance.
▪ For frequent terms, we want high positive weights for words
like high, increase, and line But lower weights than for rare
terms.
▪ We will use document frequency (df) to capture this.
Introduction to Information
Retrieval

idf weight
▪ dft is the document frequency of t: the number of
documents that contain t
▪ dft is an inverse measure of the informativeness of t
▪ dft ≤ N
▪ We define the idf (inverse document frequency) of t
by

▪ We use log (N/dft) instead of N/dft to “dampen” the effect


of idf.

Will turn out the base of the log is immaterial.


Introduction to Information
Retrieval

idf example, suppose N = 1 million


term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000

There is one idf value for each term t in a collection.


Introduction to Information
Retrieval

Effect of idf on ranking


▪ Does idf have an effect on ranking for one-term
queries, like
▪ iPhone
▪ idf has no effect on ranking one term queries
▪ idf affects the ranking of documents for queries with at
least two terms
▪ For the query capricious person, idf weighting makes
occurrences of capricious count for much more in the final
document ranking than occurrences of person.

16
Introduction to Information
Retrieval

Collection vs. Document frequency


▪ The collection frequency of t is the number of occurrences
of t in the collection, counting multiple occurrences.
Example:
Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

▪ Which word is a better search term (and should get a


higher weight)?
Introduction to Information
Retrieval

tf-idf weighting
▪ The tf-idf weighting scheme assigns to term ‘t’ a weight in
document ‘d’ given by

[1]

▪ Highest when ‘t’ occurs many times within a small number of


documents.
▪ Lower when the term occurs fewer times in a document, or
occurs in many documents
▪ Lowest when the term occurs in virtually all documents.
Introduction to Information
Retrieval

tf-idf weighting
▪ One may view each document as a vector with one component
corresponding to each term in the dictionary, together with a
weight for each component that is given by eq. 1.

▪ For dictionary term that do not occur in a document , this


weight is zero.

▪ The score of a document ‘d’ for a query ‘q’ is the sum, of the
tf-idf weight of each term of ‘q’ in ‘d’.

----(2)
Introduction to Information
Retrieval

Binary → count → weight matrix

Each document is now represented by a real-valued


vector of tf-idf weights ∈ R|V|
Introduction to Information
Retrieval

Example:

Here is given the idf’s of terms with various frequencies in the Reuters
collection of 806,791 documents

Consider the table of term frequencies for 3 documents denoted Doc1, Doc2,
Doc3 in
Table 2. Compute the tf-idf weights for the terms car, auto, insurance, best,
for each document, using the idf values from Table 1.
Introduction to Information
Retrieval

The vectors space model for scoring


▪ The representation of a set of documents as vectors in a
common vector space is known as the vector space model.

▪ Is fundamental to a host of information retrieval operations


ranging from scoring documents on a query, document
classification and document clustering.
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval

The vectors space model for scoring


▪ Let the vector derived from document d, with one component
in the vector for each dictionary term is denoted by V (d) .

▪ The set of documents in a collection then may be viewed as a


set of vectors in a vector space, in which there is one axis for
each term.
Introduction to Information
Retrieval

The vectors space model for scoring


▪ How do we quantify the similarity between two documents in
this vector space?

▪ A first attempt might consider the magnitude of the vector


difference between two document vectors.

▪ This measure suffers from a drawback: two documents with


very similar content can have a significant vector difference
simply because one is much longer than the other.
Introduction to Information
Retrieval

The vectors space model for scoring


▪ To compensate for the effect of document length, the
standard way of quantifying the similarity between two
documents d1 and d2 is to compute the cosine similarity of
their vector representations V (d1) and V (d2)

-----(3)

▪ where the numerator represents the dot product of the


vectors V (d1) and V (d2) and,
▪ The denominator is the product of their Euclidean lengths.
Introduction to Information
Retrieval

The vectors space model for scoring


▪ The dot product x · y of two vectors is defined as

▪ Let V(d) denote the document vector for d, with M


components V1(d) . . .VM(d). The Euclidean length of d is
defined to be

▪ The effect of the denominator of Equation (3) is thus to


length-normalize the vectors V(d1) and V(d2) to unit vectors
v(d1) = V (d1)/|V(d1)| and V (d2)/|V(d2)| .
Introduction to Information
Retrieval

The vectors space model for scoring


▪ We can then rewrite equation (3) as
----(4)

Example:
Consider the document given below in table. Apply Euclidean
normalization to the ‘tf’ values from the table, for each of the
three documents in the table.
Introduction to Information
Retrieval

Euclidean normalized tf values for documents


Introduction to Information
Retrieval

example
▪ Table below shows the number of occurrences of three terms (affection, jealous and
gossip) in each of the following three novels: Jane Austen’s Sense and Sensibility (SaS) and
Pride and Prejudice (PaP) and Emily Brontë’s Wuthering Heights (WH).

Table1: Term frequencies in three novels. Tabel2: Term vectors for the three novels
of table1

▪ Now consider the cosine similarities between pairs of the resulting three-dimensional
vectors. A simple computation shows that sim(v(SAS), v(PAP)) is 0.999, whereas
sim(v(SAS), v(WH)) is 0.888; thus, the two books authored by Austen (SaS and PaP) are
considerably closer to each other than to Brontë’s Wuthering Heights. In fact, the
similarity between the first two is almost perfect (when restricted to the three terms we
consider). Here we have considered tf weights, but we could of course use other term
weight functions.
Introduction to Information
Retrieval

The vectors space model for scoring Contd…


▪ Thus equation (4) can be viewed as the dot product of the
normalized versions of the two document vectors.
▪ This measure is the cosine of the angle θ between the two
vectors as shown below.

Cosine similarity illustrated. sim(d1, d2) = cos θ.


Introduction to Information
Retrieval

What use is the similarity measure sim(d1, d2)?


▪ Given a document d (potentially one of the di in the
collection), consider searching for the documents in the
collectionmost similar to d.
▪ Such a search is useful in a system where a user may identify a
document and seek others like it – a feature available in the
results lists of search engines as a more like this feature.
▪ We reduce the problem of finding the document(s) most
similar to d to that of finding the di with the highest dot
products (sim values) v(d) ·v(di).
▪ We could do this by computing the dot products between v(d)
and each of v(d1), . . . ,v(dN), then picking off the highest
resulting sim values.
Introduction to Information
Retrieval

Queries as vectors
▪ we can also view a query as a vector.
▪ The key idea now: to assign to each document ‘d’ a score
equal to the dot product

▪ we can use the cosine similarity between the query vector


and a document vector as a measure of the score of the
document for that query.
▪ The resulting scores can then be used to select the
top-scoring documents for a query. Thus we have

34
Introduction to Information
Retrieval

Example:
Suppose we query an IR system for the query "gold silver truck". The
database collection consists of three documents (D = 3) shown below.
Query, Q: “gold silver truck”.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”

35
Introduction to Information
Retrieval
Introduction to Information
Retrieval

Similarity Analysis

37
Introduction to Information
Retrieval

▪ Next, we compute all dot products (zero products ignored)

▪ Now we calculate the similarity values

38
Introduction to Information
Retrieval

Now we calculate the similarity values

▪ Finally we sort and rank the documents in descending order


according to the similarity values
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801 39
Introduction to Information
Retrieval

tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?


Introduction to Information
Retrieval

Weighting may differ in queries vs documents


▪ Many search engines allow for different weightings for queries
vs. documents
▪ SMART Notation: denotes the combination in use in an
engine, with the notation ddd.qqq, using the acronyms from
the previous table
▪ A very standard weighting scheme is: lnc.ltc
▪ Document: logarithmic tf (l as first character), no idf and
cosine normalization
A bad
▪ idea?
Query: logarithmic tf (l in leftmost column), idf (t in second
column), no normalization …
Introduction to Information
Retrieval

tf-idf example: lnc.ltc


Document: car insurance auto insurance
Query: best car insurance
Term Query Document Pro
d
tf-ra tf-w df idf wt n’liz tf-ra tf-wt wt n’liz
w t e w e
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0

car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27

insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53


Doc length =
Score = 0+0+0.27+0.53 = 0.8
Introduction to Information
Retrieval

Summary – vector space ranking


▪ Represent the query as a weighted tf-idf vector
▪ Represent each document as a weighted tf-idf vector
▪ Compute the cosine similarity score for the query
vector and each document vector
▪ Rank documents with respect to the query by score
▪ Return the top K (e.g., K = 10) to the user

You might also like