0% found this document useful (0 votes)

223 views43 pages

Lecture 6 Score - Term Weight - Vector Space Model

This document discusses techniques for ranked retrieval and scoring documents, including term frequency, inverse document frequency (IDF), and the vector space model. It explains that ranked retrieval orders documents by relevance to a query rather than just returning matching documents. Term frequency (TF) captures the number of times a term appears in a document, while IDF accounts for how common or rare a term is across documents. TF-IDF weighting combines these by giving higher weight to uncommon terms that appear frequently in a document. Documents and queries can then be represented as vectors in a vector space, where similarity is measured to rank documents by relevance to the query.

Uploaded by

Prateek Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

223 views43 pages

Lecture 6 Score - Term Weight - Vector Space Model

Uploaded by

Prateek Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Introduction to Information

Retrieval

Introduction to
Information Retrieval
Topic: Scoring, Term Weighting and the Vector Space Model
Introduction to Information
Retrieval

Topic: Scoring, Term Weighting and the Vector

Space Model
▪ Ranked retrieval
▪ Scoring documents
▪ Term frequency
▪ Weighting schemes
▪ Vector space scoring
Introduction to Information
Retrieval

Ranked retrieval
▪ Thus far, our queries have all been Boolean.
▪ Documents either match or don’t.

▪ Good for expert users with precise understanding of their

needs and the collection.
▪ Also good for applications: Applications can easily consume 1000s of
results.

▪ Not good for the majority of users.

▪ Most users incapable of writing Boolean queries (or they are, but they
think it’s too much work).
▪ Most users don’t want to wade through 1000s of results.
▪ This is particularly true of web search.
Introduction to Information
Retrieval

Problem with Boolean search: Feast or famine

▪ Boolean queries often result in either too few (=0) or too

many (1000s) results.

▪ Query 1: “standard user dlink 650” → 200,000 hits

▪ Query 2: “standard user dlink 650 no card found”: 0 hits

▪ It takes a lot of skill to come up with a query that produces a

manageable number of hits.
▪ AND gives too few; OR gives too many
• With a ranked list of documents it does not matter how
large the retrieved set is.
Introduction to Information
Retrieval

Ranked retrieval models

▪ Rather than a set of documents satisfying a query expression,
in ranked retrieval models, the system returns an ordering
over the (top) documents in the collection with respect to a
query

▪ Free text queries: Rather than a query language of operators

and expressions, the user’s query is just one or more words in
a human language

5
Introduction to Information
Retrieval

Feast or famine: not a problem in ranked

retrieval
▪ When a system produces a ranked result set, large
result sets are not an issue
▪ Indeed, the size of the result set is not an issue
▪ We just show the top k ( ≈ 10) results
▪ We don’t overwhelm the user
Introduction to Information
Retrieval

Recall previous Lecture): Binary term-document

incidence matrix

Each document is represented by a binary vector ∈

{0,1}|V|
Introduction to Information
Retrieval

Term-document count matrices

▪ Consider the number of occurrences of a term in a
document:
Introduction to Information
Retrieval

Bag of words model

▪ Vector representation doesn’t consider the ordering of words
in a document

▪ John is quicker than Mary and Mary is quicker than John have
the same vectors

▪ This is called the bag of words model.

Introduction to Information
Retrieval

Term frequency tf
▪ A document or zone that mentions a query term more often
has more to do with that query and therefore should receive
more score.

▪ In this scheme each term in a document are assigned a weight

depending on the number of occurrences of term in the
document.

▪ Computing Score b/w a query term ’t’ and a document ‘d’, is

based on the weight of ‘t’ in ‘d’.
▪ The simplest approach is to assign the weight to be equal to
the number of occurrences of term ’t’ in document ‘d’.
Introduction to Information
Retrieval

Term frequency tf
▪ We want to use tf when computing query-document match
scores. But how?

▪ Raw term frequency is not what we want:

▪ A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
▪ But not 10 times more relevant.

▪ Relevance does not increase proportionally with term

frequency.
Introduction to Information
Retrieval

Log-frequency weighting
▪ The log frequency weight of term t in d is

▪ 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

▪ Score for a document-query pair: sum over terms t in both q
and d:
score

▪ The score is 0 if none of the query terms is present in the

document.
Introduction to Information
Retrieval

Document frequency (df)

▪ Frequent terms are less informative than rare terms.
▪ Consider a query term that is frequent in the collection (e.g.,
high, increase, line)
▪ A document containing such a term is more likely to be
relevant than a document that doesn’t But it’s not a sure
indicator of relevance.
▪ For frequent terms, we want high positive weights for words
like high, increase, and line But lower weights than for rare
terms.
▪ We will use document frequency (df) to capture this.
Introduction to Information
Retrieval

idf weight
▪ dft is the document frequency of t: the number of
documents that contain t
▪ dft is an inverse measure of the informativeness of t
▪ dft ≤ N
▪ We define the idf (inverse document frequency) of t
by

▪ We use log (N/dft) instead of N/dft to “dampen” the effect

of idf.

Will turn out the base of the log is immaterial.

Introduction to Information
Retrieval

idf example, suppose N = 1 million

term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000

There is one idf value for each term t in a collection.

Introduction to Information
Retrieval

Effect of idf on ranking

▪ Does idf have an effect on ranking for one-term
queries, like
▪ iPhone
▪ idf has no effect on ranking one term queries
▪ idf affects the ranking of documents for queries with at
least two terms
▪ For the query capricious person, idf weighting makes
occurrences of capricious count for much more in the final
document ranking than occurrences of person.

16
Introduction to Information
Retrieval

Collection vs. Document frequency

▪ The collection frequency of t is the number of occurrences
of t in the collection, counting multiple occurrences.
Example:
Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

▪ Which word is a better search term (and should get a

higher weight)?
Introduction to Information
Retrieval

tf-idf weighting
▪ The tf-idf weighting scheme assigns to term ‘t’ a weight in
document ‘d’ given by

[1]

▪ Highest when ‘t’ occurs many times within a small number of

documents.
▪ Lower when the term occurs fewer times in a document, or
occurs in many documents
▪ Lowest when the term occurs in virtually all documents.
Introduction to Information
Retrieval

tf-idf weighting
▪ One may view each document as a vector with one component
corresponding to each term in the dictionary, together with a
weight for each component that is given by eq. 1.

▪ For dictionary term that do not occur in a document , this

weight is zero.

▪ The score of a document ‘d’ for a query ‘q’ is the sum, of the
tf-idf weight of each term of ‘q’ in ‘d’.

----(2)
Introduction to Information
Retrieval

Binary → count → weight matrix

Each document is now represented by a real-valued

vector of tf-idf weights ∈ R|V|
Introduction to Information
Retrieval

Example:

Here is given the idf’s of terms with various frequencies in the Reuters
collection of 806,791 documents

Consider the table of term frequencies for 3 documents denoted Doc1, Doc2,
Doc3 in
Table 2. Compute the tf-idf weights for the terms car, auto, insurance, best,
for each document, using the idf values from Table 1.
Introduction to Information
Retrieval

The vectors space model for scoring

▪ The representation of a set of documents as vectors in a
common vector space is known as the vector space model.

▪ Is fundamental to a host of information retrieval operations

ranging from scoring documents on a query, document
classification and document clustering.
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval

The vectors space model for scoring

▪ Let the vector derived from document d, with one component
in the vector for each dictionary term is denoted by V (d) .

▪ The set of documents in a collection then may be viewed as a

set of vectors in a vector space, in which there is one axis for
each term.
Introduction to Information
Retrieval

The vectors space model for scoring

▪ How do we quantify the similarity between two documents in
this vector space?

▪ A first attempt might consider the magnitude of the vector

difference between two document vectors.

▪ This measure suffers from a drawback: two documents with

very similar content can have a significant vector difference
simply because one is much longer than the other.
Introduction to Information
Retrieval

The vectors space model for scoring

▪ To compensate for the effect of document length, the
standard way of quantifying the similarity between two
documents d1 and d2 is to compute the cosine similarity of
their vector representations V (d1) and V (d2)

-----(3)

▪ where the numerator represents the dot product of the

vectors V (d1) and V (d2) and,
▪ The denominator is the product of their Euclidean lengths.
Introduction to Information
Retrieval

The vectors space model for scoring

▪ The dot product x · y of two vectors is defined as

▪ Let V(d) denote the document vector for d, with M

components V1(d) . . .VM(d). The Euclidean length of d is
defined to be

▪ The effect of the denominator of Equation (3) is thus to

length-normalize the vectors V(d1) and V(d2) to unit vectors
v(d1) = V (d1)/|V(d1)| and V (d2)/|V(d2)| .
Introduction to Information
Retrieval

The vectors space model for scoring

▪ We can then rewrite equation (3) as
----(4)

Example:
Consider the document given below in table. Apply Euclidean
normalization to the ‘tf’ values from the table, for each of the
three documents in the table.
Introduction to Information
Retrieval

Euclidean normalized tf values for documents

Introduction to Information
Retrieval

example
▪ Table below shows the number of occurrences of three terms (affection, jealous and
gossip) in each of the following three novels: Jane Austen’s Sense and Sensibility (SaS) and
Pride and Prejudice (PaP) and Emily Brontë’s Wuthering Heights (WH).

Table1: Term frequencies in three novels. Tabel2: Term vectors for the three novels
of table1

▪ Now consider the cosine similarities between pairs of the resulting three-dimensional
vectors. A simple computation shows that sim(v(SAS), v(PAP)) is 0.999, whereas
sim(v(SAS), v(WH)) is 0.888; thus, the two books authored by Austen (SaS and PaP) are
considerably closer to each other than to Brontë’s Wuthering Heights. In fact, the
similarity between the first two is almost perfect (when restricted to the three terms we
consider). Here we have considered tf weights, but we could of course use other term
weight functions.
Introduction to Information
Retrieval

The vectors space model for scoring Contd…

▪ Thus equation (4) can be viewed as the dot product of the
normalized versions of the two document vectors.
▪ This measure is the cosine of the angle θ between the two
vectors as shown below.

Cosine similarity illustrated. sim(d1, d2) = cos θ.

Introduction to Information
Retrieval

What use is the similarity measure sim(d1, d2)?

▪ Given a document d (potentially one of the di in the
collection), consider searching for the documents in the
collectionmost similar to d.
▪ Such a search is useful in a system where a user may identify a
document and seek others like it – a feature available in the
results lists of search engines as a more like this feature.
▪ We reduce the problem of finding the document(s) most
similar to d to that of finding the di with the highest dot
products (sim values) v(d) ·v(di).
▪ We could do this by computing the dot products between v(d)
and each of v(d1), . . . ,v(dN), then picking off the highest
resulting sim values.
Introduction to Information
Retrieval

Queries as vectors
▪ we can also view a query as a vector.
▪ The key idea now: to assign to each document ‘d’ a score
equal to the dot product

▪ we can use the cosine similarity between the query vector

and a document vector as a measure of the score of the
document for that query.
▪ The resulting scores can then be used to select the
top-scoring documents for a query. Thus we have

34
Introduction to Information
Retrieval

Example:
Suppose we query an IR system for the query "gold silver truck". The
database collection consists of three documents (D = 3) shown below.
Query, Q: “gold silver truck”.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”

35
Introduction to Information
Retrieval
Introduction to Information
Retrieval

Similarity Analysis

37
Introduction to Information
Retrieval

▪ Next, we compute all dot products (zero products ignored)

▪ Now we calculate the similarity values

38
Introduction to Information
Retrieval

Now we calculate the similarity values

▪ Finally we sort and rank the documents in descending order

according to the similarity values
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801 39
Introduction to Information
Retrieval

tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

Introduction to Information
Retrieval

Weighting may differ in queries vs documents

▪ Many search engines allow for different weightings for queries
vs. documents
▪ SMART Notation: denotes the combination in use in an
engine, with the notation ddd.qqq, using the acronyms from
the previous table
▪ A very standard weighting scheme is: lnc.ltc
▪ Document: logarithmic tf (l as first character), no idf and
cosine normalization
A bad
▪ idea?
Query: logarithmic tf (l in leftmost column), idf (t in second
column), no normalization …
Introduction to Information
Retrieval

tf-idf example: lnc.ltc

Document: car insurance auto insurance
Query: best car insurance
Term Query Document Pro
d
tf-ra tf-w df idf wt n’liz tf-ra tf-wt wt n’liz
w t e w e
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0

car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27

insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Doc length =
Score = 0+0+0.27+0.53 = 0.8
Introduction to Information
Retrieval

Summary – vector space ranking

▪ Represent the query as a weighted tf-idf vector
▪ Represent each document as a weighted tf-idf vector
▪ Compute the cosine similarity score for the query
vector and each document vector
▪ Rank documents with respect to the query by score
▪ Return the top K (e.g., K = 10) to the user

Relevance Feedback
No ratings yet
Relevance Feedback
47 pages
M05 Developing Cascading Style Sheets
No ratings yet
M05 Developing Cascading Style Sheets
111 pages
Traditional IR vs. Web IR
100% (2)
Traditional IR vs. Web IR
4 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
2023-2024 Math League Grades 5-6
No ratings yet
2023-2024 Math League Grades 5-6
28 pages
Exercise 1 1680070237
No ratings yet
Exercise 1 1680070237
19 pages
MCQ HNS L4-1
No ratings yet
MCQ HNS L4-1
9 pages
Cambridge Textbooks - 19-Mar-2020
0% (1)
Cambridge Textbooks - 19-Mar-2020
340 pages
Introduction To Compiling
100% (1)
Introduction To Compiling
26 pages
Continuity and Differentiability
No ratings yet
Continuity and Differentiability
4 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
Ai Quiz
No ratings yet
Ai Quiz
5 pages
CBSE Class 10 Maths Standard Question Paper Solution 2020 Set 30-4-1
No ratings yet
CBSE Class 10 Maths Standard Question Paper Solution 2020 Set 30-4-1
15 pages
Mall Customer Segmentation Using Machine Learning Techniques
No ratings yet
Mall Customer Segmentation Using Machine Learning Techniques
17 pages
Chapter No - 2 Software Project Management
No ratings yet
Chapter No - 2 Software Project Management
82 pages
Word Excercise
No ratings yet
Word Excercise
41 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
77 pages
Aca 3
No ratings yet
Aca 3
113 pages
Teacher's Guide For L 3 Final Tvet Ict
100% (2)
Teacher's Guide For L 3 Final Tvet Ict
89 pages
Use Advanced Structured Query Language
No ratings yet
Use Advanced Structured Query Language
91 pages
Vector Space and IR Evaluation
No ratings yet
Vector Space and IR Evaluation
41 pages
Lecture 3 (Verman Cipher)
No ratings yet
Lecture 3 (Verman Cipher)
14 pages
Adobe Scan 09 Dec 2024
No ratings yet
Adobe Scan 09 Dec 2024
25 pages
Stress Transformation
No ratings yet
Stress Transformation
24 pages
EOS IT Support Service L1 & L2
No ratings yet
EOS IT Support Service L1 & L2
82 pages
Ss2 Mathematics Third Term
No ratings yet
Ss2 Mathematics Third Term
2 pages
Math Paper 3
No ratings yet
Math Paper 3
9 pages
DIP Complete
No ratings yet
DIP Complete
71 pages
Sentiment Mining Model For Opinionated Amharic Texts
No ratings yet
Sentiment Mining Model For Opinionated Amharic Texts
86 pages
Okamoto Analisis de Geometria Amonoideos
No ratings yet
Okamoto Analisis de Geometria Amonoideos
19 pages
Chapter 1a
No ratings yet
Chapter 1a
63 pages
Cryptographic Hash Functions
No ratings yet
Cryptographic Hash Functions
63 pages
Anaphora Resolution PDF
No ratings yet
Anaphora Resolution PDF
63 pages
05 - Strategies For Query Processing (Ch18)
No ratings yet
05 - Strategies For Query Processing (Ch18)
50 pages
Mathematics in Primary Education-GC-21-BED-S-849
No ratings yet
Mathematics in Primary Education-GC-21-BED-S-849
5 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
Assembler: Jian-hua Yeh (葉建華) 真理大學資訊科學系助理教授
No ratings yet
Assembler: Jian-hua Yeh (葉建華) 真理大學資訊科學系助理教授
69 pages
Unit-I: Finite Differences, Interpolation, Numerical Differentiation & Numerical Integration
No ratings yet
Unit-I: Finite Differences, Interpolation, Numerical Differentiation & Numerical Integration
33 pages
Types of Malwares: Trojan Horses
No ratings yet
Types of Malwares: Trojan Horses
12 pages
12th Class Syllabus Math
No ratings yet
12th Class Syllabus Math
6 pages
Part A Simulation: Matthias Winkel Department of Statistics University of Oxford
No ratings yet
Part A Simulation: Matthias Winkel Department of Statistics University of Oxford
54 pages
Visible Surface Detection Methods
No ratings yet
Visible Surface Detection Methods
54 pages
MT1818
No ratings yet
MT1818
2 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Configure and Use Internet
100% (1)
Configure and Use Internet
9 pages
Index Compression
100% (1)
Index Compression
38 pages
New 1
No ratings yet
New 1
13 pages
Q-2 Final Exam - Pre - Calculus
No ratings yet
Q-2 Final Exam - Pre - Calculus
4 pages
DS Notes Shashank PDF
No ratings yet
DS Notes Shashank PDF
29 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
40 pages
146-2022 Riya Vechiot - Assignment1-Console Application C#
100% (1)
146-2022 Riya Vechiot - Assignment1-Console Application C#
9 pages
Trigonometry - Finding A Missing Angle - Home Learning Interactive
No ratings yet
Trigonometry - Finding A Missing Angle - Home Learning Interactive
5 pages
Maths Curriculum GRADE 3
No ratings yet
Maths Curriculum GRADE 3
3 pages
Lecture 14 XTS-AES & RC4
No ratings yet
Lecture 14 XTS-AES & RC4
24 pages
Operation Research Chapter Five 5. Networks and Project Management
No ratings yet
Operation Research Chapter Five 5. Networks and Project Management
11 pages
Testing Physical DB Implementation Tvet
100% (2)
Testing Physical DB Implementation Tvet
22 pages
Lecture 4 Abstract Algebra
No ratings yet
Lecture 4 Abstract Algebra
19 pages
New Doc 2018-02-15
No ratings yet
New Doc 2018-02-15
23 pages
Computer Organization and Architecture Tutorial
No ratings yet
Computer Organization and Architecture Tutorial
8 pages
Module 1 Operating System Overview
No ratings yet
Module 1 Operating System Overview
20 pages
AiCE C1 Program Requirements
No ratings yet
AiCE C1 Program Requirements
8 pages
ICT ITS4!06!0811 Use Advanced Structural Query Language
No ratings yet
ICT ITS4!06!0811 Use Advanced Structural Query Language
22 pages
Implement Maintenance Procedures LO2
No ratings yet
Implement Maintenance Procedures LO2
15 pages
Lecture 1 (Vigenere Cipher)
No ratings yet
Lecture 1 (Vigenere Cipher)
13 pages
Coc Level 1
100% (1)
Coc Level 1
7 pages
Important Questions
No ratings yet
Important Questions
2 pages
CMO 29 s2007 - Annex III BSCE Course Specs PDF
100% (6)
CMO 29 s2007 - Annex III BSCE Course Specs PDF
74 pages
System Software (Csd-224) : Assignment: 01
No ratings yet
System Software (Csd-224) : Assignment: 01
4 pages
Modeling Data Object
No ratings yet
Modeling Data Object
36 pages
National Exit Exam Term 1 and Term 2
No ratings yet
National Exit Exam Term 1 and Term 2
5 pages
3 MM Compression
100% (1)
3 MM Compression
35 pages
System Software (Csd-224) : Assignment: 01
No ratings yet
System Software (Csd-224) : Assignment: 01
4 pages
Sic, Sic/Xe: Jian-hua Yeh (葉建華) 真理大學資訊科學系助理教授
No ratings yet
Sic, Sic/Xe: Jian-hua Yeh (葉建華) 真理大學資訊科學系助理教授
14 pages
You Are An IT Assistance in Commercial Bank of Ethiopia and The Following Problems Is Happened in The Morning
No ratings yet
You Are An IT Assistance in Commercial Bank of Ethiopia and The Following Problems Is Happened in The Morning
4 pages
Software Engineering Lecture 1 PDF
No ratings yet
Software Engineering Lecture 1 PDF
32 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
4 pages
Index Construction
No ratings yet
Index Construction
37 pages
Haramaya University: College of Computing and Informatics Department of Computer Science
No ratings yet
Haramaya University: College of Computing and Informatics Department of Computer Science
22 pages
Soda14 Disjoint Set Union
No ratings yet
Soda14 Disjoint Set Union
13 pages
Daubechies
No ratings yet
Daubechies
11 pages
Application For Hibret Bank
100% (1)
Application For Hibret Bank
1 page
Computer Organization: National Institute of Technology Hamirpur
No ratings yet
Computer Organization: National Institute of Technology Hamirpur
8 pages
NOTES 10.5 Surface Area
No ratings yet
NOTES 10.5 Surface Area
4 pages
Chapter 1: Maintenance, Upgrade and Repair
No ratings yet
Chapter 1: Maintenance, Upgrade and Repair
35 pages
L14 - Wildcard Queries
No ratings yet
L14 - Wildcard Queries
19 pages
Test Yourself: Example 1.3.7 Equality of Functions
No ratings yet
Test Yourself: Example 1.3.7 Equality of Functions
2 pages
SQL Cleaning Data
No ratings yet
SQL Cleaning Data
7 pages
Practical Projects For Operate Personal Computer
No ratings yet
Practical Projects For Operate Personal Computer
3 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
16 pages
Import: "../information Retrieval/" "Doc .TXT" ' '
No ratings yet
Import: "../information Retrieval/" "Doc .TXT" ' '
8 pages
Ms-Access Note
No ratings yet
Ms-Access Note
10 pages
Homework 1 (10') : Exercise 1.2 0.5'
No ratings yet
Homework 1 (10') : Exercise 1.2 0.5'
8 pages
Bipolar Neutrosophic Soft Expert Set Theory
No ratings yet
Bipolar Neutrosophic Soft Expert Set Theory
14 pages
Chimdesa Gedefa Assignment #2 Causal and Entry Consistency
No ratings yet
Chimdesa Gedefa Assignment #2 Causal and Entry Consistency
15 pages
List in Python
No ratings yet
List in Python
9 pages
JEEMAINJAN AdmitCard PDF
No ratings yet
JEEMAINJAN AdmitCard PDF
1 page
Chapter 3 - Simple Sorting and Searching
100% (1)
Chapter 3 - Simple Sorting and Searching
18 pages
BSCMSC
No ratings yet
BSCMSC
1 page
Psi Curriculum Lesson Plan
No ratings yet
Psi Curriculum Lesson Plan
3 pages
Big Data in Practice
No ratings yet
Big Data in Practice
6 pages
Data Structures Notes
100% (1)
Data Structures Notes
17 pages
GMQ1M23a Week 7 PDF
No ratings yet
GMQ1M23a Week 7 PDF
12 pages
Information Retreival Assignment
No ratings yet
Information Retreival Assignment
4 pages
Math Question Paper2 Sa1
100% (1)
Math Question Paper2 Sa1
7 pages
Ethiopian TVET System: LG Code
No ratings yet
Ethiopian TVET System: LG Code
6 pages
C++ Lab Work Sheet 1
No ratings yet
C++ Lab Work Sheet 1
9 pages
Overview of Big Data
No ratings yet
Overview of Big Data
4 pages
Ms Access Project
No ratings yet
Ms Access Project
2 pages
Use of Information Retrieval Systems in Scientific Research
No ratings yet
Use of Information Retrieval Systems in Scientific Research
4 pages
Number - Quantitative Aptitude For CAT EBOOK
No ratings yet
Number - Quantitative Aptitude For CAT EBOOK
6 pages

Lecture 6 Score - Term Weight - Vector Space Model

Uploaded by

Lecture 6 Score - Term Weight - Vector Space Model

Uploaded by

Introduction to Information

Topic: Scoring, Term Weighting and the Vector

▪ Good for expert users with precise understanding of their

▪ Not good for the majority of users.

Problem with Boolean search: Feast or famine

▪ Boolean queries often result in either too few (=0) or too

▪ Query 1: “standard user dlink 650” → 200,000 hits

▪ It takes a lot of skill to come up with a query that produces a

Ranked retrieval models

▪ Free text queries: Rather than a query language of operators

Feast or famine: not a problem in ranked

Recall previous Lecture): Binary term-document

Each document is represented by a binary vector ∈

Term-document count matrices

Bag of words model

▪ This is called the bag of words model.

▪ In this scheme each term in a document are assigned a weight

▪ Computing Score b/w a query term ’t’ and a document ‘d’, is

▪ Raw term frequency is not what we want:

▪ Relevance does not increase proportionally with term

▪ 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

▪ The score is 0 if none of the query terms is present in the

Document frequency (df)

▪ We use log (N/dft) instead of N/dft to “dampen” the effect

Will turn out the base of the log is immaterial.

idf example, suppose N = 1 million

There is one idf value for each term t in a collection.

Effect of idf on ranking

Collection vs. Document frequency

insurance 10440 3997

try 10422 8760

▪ Which word is a better search term (and should get a

▪ Highest when ‘t’ occurs many times within a small number of

▪ For dictionary term that do not occur in a document , this

Binary → count → weight matrix

Each document is now represented by a real-valued

The vectors space model for scoring

▪ Is fundamental to a host of information retrieval operations

The vectors space model for scoring

▪ The set of documents in a collection then may be viewed as a

The vectors space model for scoring

▪ A first attempt might consider the magnitude of the vector

▪ This measure suffers from a drawback: two documents with

The vectors space model for scoring

▪ where the numerator represents the dot product of the

The vectors space model for scoring

▪ Let V(d) denote the document vector for d, with M

▪ The effect of the denominator of Equation (3) is thus to

The vectors space model for scoring

Euclidean normalized tf values for documents

The vectors space model for scoring Contd…

Cosine similarity illustrated. sim(d1, d2) = cos θ.

What use is the similarity measure sim(d1, d2)?

▪ we can use the cosine similarity between the query vector

▪ Next, we compute all dot products (zero products ignored)

▪ Now we calculate the similarity values

Now we calculate the similarity values

▪ Finally we sort and rank the documents in descending order

tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

Weighting may differ in queries vs documents

tf-idf example: lnc.ltc

car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27

insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Summary – vector space ranking

You might also like