0% found this document useful (0 votes)
13 views11 pages

Text

Uploaded by

RAUSHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Text

Uploaded by

RAUSHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 11

In association with and

Chapter 7 aka MRS-Chapter 6


Ranking
Search Engines and Information Retrieval
In association with and
From Ch. 2: Indexing Process
Ch 3
Ch 4
From Ch. 2: Query Process
● Query Suggest
● Query refinements
● Spell correction
● User clicks
● Mouse tracking
In association with
and
Scoring, term weighting and vector space model
● For boolean queries, a document either matches or does not match a
query
● Resulting number of matching documents can far exceed the number a
human user could possibly sift through
● It is essential to rank-order the documents matching a query
● Search engine computes, for each matching document, a score with
respect to the query at hand
In association with
and
Three main ideas
● Parametric and zone indexes
○ helps in indexing and retrieving documents by metadata
○ gives a simple means for scoring (and thereby ranking) documents in response
to a
query
● Weighting the importance of a term in a document, based on the statistics
of occurrence of the term
● Viewing each document as a vector of such weights, we can compute a
score between a query and each document.
○ known as vector space scoring
In association with
and
Parametric and zone indexes
● Digital documents generally encode, in machine-recognizable form, certain
metadata associated with each document
● Metadata generally includes
○ fields such as the date of creation,
○ the author and possibly the title of the document
In association with
and
Parametric and zone indexes
Consider queries of the form “find documents authored by William
Shakespeare in 1601, containing the phrase alas poor Yorick”
● Query processing may consist of merging postings from standard inverted
as well as parametric indexes
● One parametric index for each field
○ Example: date of creation
○ helps select only the documents matching a date specified in the query
In association with
and
Zones vs Fields
● Zone can be an arbitrary, unbounded amount of text
○ document titles and abstracts
● Field may take on a relatively small set of values
○ date of creation
○ language
In association with
and
User’s view of a parametric search
In association with
and
Weighted zone scoring
● Given a Boolean query q and a document d, weighted zone scoring assigns to
the pair (q, d) a score in the interval [0, 1]
● Computes a linear combination of zone scores
○ each zone of the document contributes a Boolean value
In association with
and
Weighted zone scoring
➢ Let ℓ be number of zones in each document
➢ g1
, . . . , gℓ
∈ [0, 1],
➢ si
be the Boolean score denoting a match (or absence) between q and the
ith zone for 1 ≤ i ≤ ℓ,
➔ Therefore, the weighted zone score is

Weighted zone scoring is sometimes called as ranked Boolean retrieval
In association with
and
Weighted zone scoring
● Example problem: Consider the query shakespeare in a collection in which
each document has three zones: author, title and body. The Boolean score
function for a zone takes on the value 1 if the query term shakespeare is
present in the zone, and 0 otherwise. Weighted zone scoring in such a
collection would require three weights g1, g2 and g3, respectively
corresponding to the author, title and body zones. Suppose g1=0.2, g2=0.3 and
g3=0.5 (add up to 1). If the term shakespeare appears in the title and body
zones but not the author zone of a document, what will be the score of this
document?
In association with
and
Weighted zone scoring
● Solution:
Here, g1
= 0.2, g2
= 0.3 and g3
= 0.5
As the term shakespeare does not appear in the author zone, s1 = 0
In contrast, s2
= 1 and s3
= 1 as the term appears in both title and body zone
Therefore, the score will be calculated as follows:
g1
s1
+ g2
s2 + g2
s2
= (0.2x0) + (0.3x1) +(0.5x1) = 0.8
In association with
and
In association with and
See you next time
Image Credit: Adobe Text to Image
Position Paper - 10 Marks
Generative AI and Search
In association with
and
Position Paper
In association with
and
In association with and
Position Paper
In association with and
Position Paper
In association with and
Position Paper
In Search
In association with and
Position Paper
Position Paper - 10 Marks
Steps:
● FIRST, write down your hypothesis:
○ Do you believe that LLMs will be better search experience than your current
search engine and would they replace search engines?
● Second, gather data:
○ Use ChatGPT, Claude, Gemini, Perplexity EXCLUSIVELY as your search engine for
TWO days each for the four
○ Take notes on what worked, what failed, what you like, what you dislike, if you
needed to go back to search, why?
○ Basically make notes on your user experience for every information need you had.
You need to gather data on NUMEROUS information needs.
● Third, write a position paper:
○ Paper should be at least 2,000 words
○ It should follow a good structure, e.g. abstract, introduction, hypothesis, …,
conclusions, references
● Due Date: April 15, 2024
In association with
and
The Vector Space Model
In association with
and
Remember
Gerard Slaton
Amit Singhal
Chris Buckley
Cindy Robinson
Mandar Mitra
In association with
and
In association with and
The Birth
Vector Space Model
● Mathematical framework to reason about text
○ Representing text in a mathematical formulation, a vector
○ Every piece of text can be represented as a vector
○ Weight of x-dimension is the length of x-component
○ Weight of y-dimension is the length of y-component
Image credit:
baeldung.com
In association with
and
Vector Space Model

Text
vector





Every index-term is a dimension
Every text an be represented as a
n is the vocabulary size (billions)
Term that are missing from a text are given a zero weight
Most values are ZERO, so every text is a very SPARSE vector

In implementation in data structures, you never store zeroes
n
dimensional
weighted
vector
In association with
and
Vector Space Model
● How similar are two pieces of text?
○ The more words they share, the more similar they are
■ Similarity score should go up with number of shared words
○ The more important the shared word, the more similar they are
■ Similarity score should go up with the importance (term-weight) of the shared
word
○ Vector
Dot Product (scalar product) has both these properties
In association with
and
In association with and
Vector Space Model
● Dot product can be represented two way (learn here why)
Vector Space Model
● Dot Product is beautiful:
○ E.g. document score for a query: Score (q, d) = ∑
○ E.g. similarity between two documents: Sim (d1, d2) = ∑

wt t,q
x wt
t,d
wt t,d1
x wt
t,d2
is weight or importance of term t in text z (can be tf x idf, something else)
Where wt t,z
In association with
and
Term Weighting
In association with
and
Term Weighting
● Simple - Binary
○ Weight = 1 if term is present in text, zero otherwise
○ Score (q, Di
) = ∑term-j ∊ q
1 if term-j in Di
○ The term weight of every term present in the query or the document is one, zero
otherwise
● A better method
○ Repeated words are more important in a text
○ Weight = number of occurrences (frequency) of a term in a text
○ If
tf(term-j, Di
) is term frequency of term-j in
○ Score (q, Di
) = ∑term-j ∊ q
tf(term-j, Di
)
Di
which is zero if the term is missing in
Di
In association with
and
Term Weighting
● But
○ Not all words are created equal
■ Common words are less meaningful
● the, if, a, an, of, …
■ Uncommon words are more meaningful
● frequency, meaningful, retrieval
In association with
and
Term Weighting
● Common words (the, a, an of, in) are LESS importance
○ Probability that a word is present in a document in your collection
■ If a word is present in df documents out of N then
p = df/N
● Where df is known as the document frequency of the word
■ Higher the p, more common the word, less meaningful/important is the word
■ Lower the p, less common the word, more meaningful/important is the word
■ Importance is inversely proportional to probability that a word is present in a
random
document
■ Use -log(p) as an importance measure inverse document frequency
■ idf = -log(p) = -log(df/N) = log (N/df)
● Notice that idf of a word is query and document independent
● It is collection dependent
In association with
and
In association with and
Term Weighting
● Term's tf⸱idf weight in a text = tf x idf
● Simple tf⸱idf based document score for a query
○ Score (q, Di
) = ∑term-j ∊ q
tf(term-j, Di
) x idf(term-j)
● We have been assuming that queries are short and only have words which occur
once.
● Also we have been assuming that using idf once to downweight a word is
enough so we will not used idf in weighting the query vector.
● However, all this can be changed for longer queries
○ Score (q, Di
) = ∑term-j ∊ q
tf(term-j, q) x idf(term-j) x tf(term-j, Di
) x idf(term-j)
○ Only experimentation can tell what is a good weighting scheme
In association with and
Vector Space Model
● Dot product increases with the Euclidean length of the vector || v ||
Vector Space Model
● Longer documents have longer vectors (Euclidean length)
○ D1
: Longer documents have longer vectors.
○ D2
: Longer documents have longer vectors. Yes! Longer documents have longer vectors.
○ V1
= {documents:1, have:1, longer:2, vectors:1} - || V1
|| = sqrt (12 + 12 + 22+12) = 2.65
○ V2
= {documents:2, have:2, longer:4, vectors:2, yes: 1} - || V2
|| = sqrt (22+22+42+22+12) = 5.39

Query:
[longer vectors]
, assume idf = 1 for every word for ease.
○ Score (q, D1
) = 3, Score (q, D2
) = 6
In association with
and
Vector Space Model
● Convert every vector to UNIT length
○ Divide every term weight with the Euclidean Length of the vector (also known as
2-norm)
○ V' = V || V ||, thus every vector is unit length
○ Dot product between two vectors becomes the cosine of the angle between two
vectors
● This is known as Cosine Similarity
��
Image Credit: Statistics for Machine Learning by Pratap Dangeti
In association with
and
Project-3
● Download folder 25 from
Dataset: 132 documents
○ You have to build tf x idf weighted vectors for every document and compute
pairwise cosine similarity.
○ Necessary steps:
■ Build a dictionary of <DOCNO> entry to document-id (numeric: 1-N)
■ Only use the <TITLE> and the <TEXT> Sections for indexing
■ Case normalize the document
■ Tokenize every document such that any sequence of alphanumeric characters and an
underscore form a
token (a-z, 0-9, _), each token is an index term
■ Build a dictionary of token to token-id (numeric: 1-M)
■ Compute token idfs
■ Build a tf x idf cosine normalized document vector
■ Compute pairwise document similarity
■ Sort document pairs from most similar to least similar, output the top FIFTY
<DOCNO> entry pairs and their
similarity values.
● Due Date: Apr 25, 2024
In association with
and
In association with and
See you next time
Image Credit: Adobe Text to Image
In association with and
Vector Space Model
● What is wrong with raw tf? (ignore idf/length)
○ q = [information retrieval]
○ D1 = [247 0] : Score(q, D1) = 247
■ https://en.wikipedia.org/wiki/Information
○ D2 = [118 112] : Score(q, D2) = 230
■ https://en.wikipedia.org/wiki/Information_retrieval
● Raw tf behaves like OR
● Search should be like AND
Remember
Gerard Slaton
Amit Singhal
Chris Buckley
Cindy Robinson
Mandar Mitra
In association with
and
In association with and
Term Weighting
● https://dl.acm.org/doi/pdf/10.3115/1075671.1075753
In association with and
Term Weighting
● https://dl.acm.org/doi/pdf/10.3115/1075671.1075753
In association with and
Term Weighting
● Why is 1+ln(tf) so good? (ignore idf/length)
○ q = [information retrieval]
○ D1 = [247 0] : Score(q, D1) = 1+ln(247) = 6.5
■ https://en.wikipedia.org/wiki/Information
○ D2 = [118 112]
■ Score(q, D2) = 1+ln(118)+1+ln(112) = 11.5
■ https://en.wikipedia.org/wiki/Information_retrieval
● Raw tf (or linear tf functions) behave like OR
● 1+ln(tf) behaves more like AND
● Log reduces the contribution any term can have
Term Weighting
● However log is still unbounded and can grow quite a bit.
● Here is a better function
○ 3*tf / (2+tf): is 1 for 1, 3 for infinity
○ (n+1)*tf / (n+tf): is 1 for 1, n+1 for infinity
○ Any term can have at most n+1 times
the influence of single occurrence term
● Motivated by BM25
○ We will learn in a story
In association with
and
Vector Space Model
● Longer documents have longer vectors (Euclidean length)
○ D1: Longer documents have longer vectors.
○ D2: Longer documents have longer vectors. Yes! Longer documents have longer
vectors.
○ V1
= {documents:1, have:1, longer:2, vectors:1} - || V1
|| = sqrt (12 + 12 + 22+12) = 2.65
○ V2
= {documents:2, have:2, longer:4, vectors:2, yes: 1} - || V2
|| = sqrt (22+22+42+22+12) = 5.39

Query: [longer vectors], assume idf = 1 for every word for ease.
○ Score (q, D1
) = 3, Score (q, D1
) = 6
In association with
and
Vector Space Model
● Convert every vector to UNIT length
○ Divide every term weight with the Euclidean Length of the vector (also known as
2-norm)
○ V' = V || V ||, thus every vector is unit length
○ Dot product between two vectors becomes the cosine of the angle between two
vectors
● This is known as Cosine Similarity
Image Credit: Statistics for Machine Learning by Pratap Dangeti
In association with
and
Document Length Normalization
● Cosine similarity is a mathematical concept
● It used every text vector of unit (Euclidean) length
● Under cosine similarity a vector (text) has 100% (1.0) similarity with itself
○ Under cosine, the query itself is the most relevant text for the query: cos(0) =
1
○ Really? Yes, really!
In association with
and
Document Length Normalization
● Documents are not vectors, having extra non-query words is a good thing
● Longer documents often have more useful information, but they can be
needlessly verbose
● How do we appropriately retrieve documents of different lengths?
● Clearly not cosine!
In association with
and
Document Length Normalization
● Suppose you knew that in your collection
○ There are k relevant documents of length l for query q. (human rating based)
○ If your algorithm returns k documents of length l, but you return n non-relevant
documents
then:
■ Length difference is NOT a factor in your poor quality retrieval, something else
is.
● New document length normalization can be developed using this insight.
In association with
and
Document Length Normalization
● If P(relevance | length) = P(retrieval | length) then you have removed
length as a variable in your ranking.
● If we have a ranking function, like cosine similarity
○ We know P(retrieval | length) for any rank cutoff (say top 10)
○ If there are L documents of length l in the collection, and you retrieve x (<=
10) in top 10
○ In top 10 P(retrieval | l) = x / L
● But we have relevance judgements (training data)
○ <q, d> pairs are created for recall-precision graphs by human saying d is
relevant to q
In association with
and
In association with and
Interpolation
Document Length Normalization
● If P(relevance | length) = P(retrieval | length) then you have removed
length as a variable in your ranking.
● Since we have training data
○ We know P(relevance | length) for any length
○ If there are L documents of length l in the collection, and y of those are
relevant to query q
○ The P(relevance | l) = y / L
In association with
and
Remember
Gerard Slaton
Amit Singhal
Chris Buckley
Cindy Robinson
Mandar Mitra
In association with
and
In association with and
Pivoted Document Length Normalization
Pivoted Document Length Normalization
In association with
and
In association with and
Slide-23
● Cosine: Convert every vector to UNIT length
○ Divide every term weight with the Euclidean Length of the vector (also known as
2-norm)
or cosine normalization factor
● wi
= f(tf, idf) /
Image Credit: Statistics for Machine Learning by Pratap Dangeti
Slide-21
● Finding the similarity score between two pieces of text
○ E.g. document score for a query: Score (q, d) = ∑

wt t,q
x wt
t,d
To decrease the score of a document, we can
decrease
wt t,d

To increase the score of a document, we can
increase
of every term in it
wt t,d
of every term in it


To decrease the probability of retrieval for a document, we should decrease its
score


we can
decrease
wt t,d
of every term in it
This can be achieved by
increasing the denominator
instead of just
To increase the probability of retrieval for a document, we should increase its
score


we can
increase
wt t,d
of every term in it
This can be achieved by
decreasing the denominator
instead of just
In association with
and
Pivoted Document Length Normalization
In association with
and
In association with and
Pivoted Document Length Normalization
Slope < 1.0
Relevance Feedback Deep Dive
● In
Chapter 6 we read about
○ Query expansion
○ Stemming
○ Synonymy (thesaurus)
○ Relevance Feedback
● Let's dive deeper into relevance feedback
○ Overview from Chapter 6
In association with
and
Relevance Feedback
● Rocchio's Algorithm
○ Designed for vector space model
○ A good query will maximize the similarity with relevant documents and minimize
the
similarity with non-relevant documents
In association with
and
Relevance Feedback
● Rocchio's Algorithm
○ An optimal query will maximize the similarity to relevant documents and minimize
the
similarity to non-relevant documents.
● The vector difference between the centroids of the relevant and the non-relevant
document.
In association with
and
In association with and
Relevance Feedback
● Rocchio's Algorithm
Relevance Feedback
● Expands the query by adding new words
○ In academic practice, for efficiency, we would add 10, 20, 30 new words with
highest
weights as described by Rocchio's Algorithm
○ After that the weights (importance of the new words) become so low that adding
low
weight words becomes inconsequential
○ On the web, relevance feedback is seldom used as running such long queries is
not
possible in real time.
○ Relevance feedback (query expansion) tends to be a recall tool.
In association with
and
Pseudo Relevance Feedback
● One magical trick in academic practice is to ASSUME that the top X documents
retrieved using
modern vector space term similarity ARE RELEVANT and run relevance feedback without
asking
user for relevance judgements.
○ Retrieve top X documents using modern term weighting
○ Assume they are relevant to the query
○ Use Rocchio's method without any non-relevant document
○ Add Y terms to the query
○ Run the expanded query again and return results
In association with
and

You might also like