We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 11
In association with and
Chapter 7 aka MRS-Chapter 6
Ranking Search Engines and Information Retrieval In association with and From Ch. 2: Indexing Process Ch 3 Ch 4 From Ch. 2: Query Process ● Query Suggest ● Query refinements ● Spell correction ● User clicks ● Mouse tracking In association with and Scoring, term weighting and vector space model ● For boolean queries, a document either matches or does not match a query ● Resulting number of matching documents can far exceed the number a human user could possibly sift through ● It is essential to rank-order the documents matching a query ● Search engine computes, for each matching document, a score with respect to the query at hand In association with and Three main ideas ● Parametric and zone indexes ○ helps in indexing and retrieving documents by metadata ○ gives a simple means for scoring (and thereby ranking) documents in response to a query ● Weighting the importance of a term in a document, based on the statistics of occurrence of the term ● Viewing each document as a vector of such weights, we can compute a score between a query and each document. ○ known as vector space scoring In association with and Parametric and zone indexes ● Digital documents generally encode, in machine-recognizable form, certain metadata associated with each document ● Metadata generally includes ○ fields such as the date of creation, ○ the author and possibly the title of the document In association with and Parametric and zone indexes Consider queries of the form “find documents authored by William Shakespeare in 1601, containing the phrase alas poor Yorick” ● Query processing may consist of merging postings from standard inverted as well as parametric indexes ● One parametric index for each field ○ Example: date of creation ○ helps select only the documents matching a date specified in the query In association with and Zones vs Fields ● Zone can be an arbitrary, unbounded amount of text ○ document titles and abstracts ● Field may take on a relatively small set of values ○ date of creation ○ language In association with and User’s view of a parametric search In association with and Weighted zone scoring ● Given a Boolean query q and a document d, weighted zone scoring assigns to the pair (q, d) a score in the interval [0, 1] ● Computes a linear combination of zone scores ○ each zone of the document contributes a Boolean value In association with and Weighted zone scoring ➢ Let ℓ be number of zones in each document ➢ g1 , . . . , gℓ ∈ [0, 1], ➢ si be the Boolean score denoting a match (or absence) between q and the ith zone for 1 ≤ i ≤ ℓ, ➔ Therefore, the weighted zone score is ● Weighted zone scoring is sometimes called as ranked Boolean retrieval In association with and Weighted zone scoring ● Example problem: Consider the query shakespeare in a collection in which each document has three zones: author, title and body. The Boolean score function for a zone takes on the value 1 if the query term shakespeare is present in the zone, and 0 otherwise. Weighted zone scoring in such a collection would require three weights g1, g2 and g3, respectively corresponding to the author, title and body zones. Suppose g1=0.2, g2=0.3 and g3=0.5 (add up to 1). If the term shakespeare appears in the title and body zones but not the author zone of a document, what will be the score of this document? In association with and Weighted zone scoring ● Solution: Here, g1 = 0.2, g2 = 0.3 and g3 = 0.5 As the term shakespeare does not appear in the author zone, s1 = 0 In contrast, s2 = 1 and s3 = 1 as the term appears in both title and body zone Therefore, the score will be calculated as follows: g1 s1 + g2 s2 + g2 s2 = (0.2x0) + (0.3x1) +(0.5x1) = 0.8 In association with and In association with and See you next time Image Credit: Adobe Text to Image Position Paper - 10 Marks Generative AI and Search In association with and Position Paper In association with and In association with and Position Paper In association with and Position Paper In association with and Position Paper In Search In association with and Position Paper Position Paper - 10 Marks Steps: ● FIRST, write down your hypothesis: ○ Do you believe that LLMs will be better search experience than your current search engine and would they replace search engines? ● Second, gather data: ○ Use ChatGPT, Claude, Gemini, Perplexity EXCLUSIVELY as your search engine for TWO days each for the four ○ Take notes on what worked, what failed, what you like, what you dislike, if you needed to go back to search, why? ○ Basically make notes on your user experience for every information need you had. You need to gather data on NUMEROUS information needs. ● Third, write a position paper: ○ Paper should be at least 2,000 words ○ It should follow a good structure, e.g. abstract, introduction, hypothesis, …, conclusions, references ● Due Date: April 15, 2024 In association with and The Vector Space Model In association with and Remember Gerard Slaton Amit Singhal Chris Buckley Cindy Robinson Mandar Mitra In association with and In association with and The Birth Vector Space Model ● Mathematical framework to reason about text ○ Representing text in a mathematical formulation, a vector ○ Every piece of text can be represented as a vector ○ Weight of x-dimension is the length of x-component ○ Weight of y-dimension is the length of y-component Image credit: baeldung.com In association with and Vector Space Model ● Text vector ○ ○ ○ ○ ○ Every index-term is a dimension Every text an be represented as a n is the vocabulary size (billions) Term that are missing from a text are given a zero weight Most values are ZERO, so every text is a very SPARSE vector ■ In implementation in data structures, you never store zeroes n dimensional weighted vector In association with and Vector Space Model ● How similar are two pieces of text? ○ The more words they share, the more similar they are ■ Similarity score should go up with number of shared words ○ The more important the shared word, the more similar they are ■ Similarity score should go up with the importance (term-weight) of the shared word ○ Vector Dot Product (scalar product) has both these properties In association with and In association with and Vector Space Model ● Dot product can be represented two way (learn here why) Vector Space Model ● Dot Product is beautiful: ○ E.g. document score for a query: Score (q, d) = ∑ ○ E.g. similarity between two documents: Sim (d1, d2) = ∑ ○ wt t,q x wt t,d wt t,d1 x wt t,d2 is weight or importance of term t in text z (can be tf x idf, something else) Where wt t,z In association with and Term Weighting In association with and Term Weighting ● Simple - Binary ○ Weight = 1 if term is present in text, zero otherwise ○ Score (q, Di ) = ∑term-j ∊ q 1 if term-j in Di ○ The term weight of every term present in the query or the document is one, zero otherwise ● A better method ○ Repeated words are more important in a text ○ Weight = number of occurrences (frequency) of a term in a text ○ If tf(term-j, Di ) is term frequency of term-j in ○ Score (q, Di ) = ∑term-j ∊ q tf(term-j, Di ) Di which is zero if the term is missing in Di In association with and Term Weighting ● But ○ Not all words are created equal ■ Common words are less meaningful ● the, if, a, an, of, … ■ Uncommon words are more meaningful ● frequency, meaningful, retrieval In association with and Term Weighting ● Common words (the, a, an of, in) are LESS importance ○ Probability that a word is present in a document in your collection ■ If a word is present in df documents out of N then p = df/N ● Where df is known as the document frequency of the word ■ Higher the p, more common the word, less meaningful/important is the word ■ Lower the p, less common the word, more meaningful/important is the word ■ Importance is inversely proportional to probability that a word is present in a random document ■ Use -log(p) as an importance measure inverse document frequency ■ idf = -log(p) = -log(df/N) = log (N/df) ● Notice that idf of a word is query and document independent ● It is collection dependent In association with and In association with and Term Weighting ● Term's tf⸱idf weight in a text = tf x idf ● Simple tf⸱idf based document score for a query ○ Score (q, Di ) = ∑term-j ∊ q tf(term-j, Di ) x idf(term-j) ● We have been assuming that queries are short and only have words which occur once. ● Also we have been assuming that using idf once to downweight a word is enough so we will not used idf in weighting the query vector. ● However, all this can be changed for longer queries ○ Score (q, Di ) = ∑term-j ∊ q tf(term-j, q) x idf(term-j) x tf(term-j, Di ) x idf(term-j) ○ Only experimentation can tell what is a good weighting scheme In association with and Vector Space Model ● Dot product increases with the Euclidean length of the vector || v || Vector Space Model ● Longer documents have longer vectors (Euclidean length) ○ D1 : Longer documents have longer vectors. ○ D2 : Longer documents have longer vectors. Yes! Longer documents have longer vectors. ○ V1 = {documents:1, have:1, longer:2, vectors:1} - || V1 || = sqrt (12 + 12 + 22+12) = 2.65 ○ V2 = {documents:2, have:2, longer:4, vectors:2, yes: 1} - || V2 || = sqrt (22+22+42+22+12) = 5.39 ○ Query: [longer vectors] , assume idf = 1 for every word for ease. ○ Score (q, D1 ) = 3, Score (q, D2 ) = 6 In association with and Vector Space Model ● Convert every vector to UNIT length ○ Divide every term weight with the Euclidean Length of the vector (also known as 2-norm) ○ V' = V || V ||, thus every vector is unit length ○ Dot product between two vectors becomes the cosine of the angle between two vectors ● This is known as Cosine Similarity �� Image Credit: Statistics for Machine Learning by Pratap Dangeti In association with and Project-3 ● Download folder 25 from Dataset: 132 documents ○ You have to build tf x idf weighted vectors for every document and compute pairwise cosine similarity. ○ Necessary steps: ■ Build a dictionary of <DOCNO> entry to document-id (numeric: 1-N) ■ Only use the <TITLE> and the <TEXT> Sections for indexing ■ Case normalize the document ■ Tokenize every document such that any sequence of alphanumeric characters and an underscore form a token (a-z, 0-9, _), each token is an index term ■ Build a dictionary of token to token-id (numeric: 1-M) ■ Compute token idfs ■ Build a tf x idf cosine normalized document vector ■ Compute pairwise document similarity ■ Sort document pairs from most similar to least similar, output the top FIFTY <DOCNO> entry pairs and their similarity values. ● Due Date: Apr 25, 2024 In association with and In association with and See you next time Image Credit: Adobe Text to Image In association with and Vector Space Model ● What is wrong with raw tf? (ignore idf/length) ○ q = [information retrieval] ○ D1 = [247 0] : Score(q, D1) = 247 ■ https://en.wikipedia.org/wiki/Information ○ D2 = [118 112] : Score(q, D2) = 230 ■ https://en.wikipedia.org/wiki/Information_retrieval ● Raw tf behaves like OR ● Search should be like AND Remember Gerard Slaton Amit Singhal Chris Buckley Cindy Robinson Mandar Mitra In association with and In association with and Term Weighting ● https://dl.acm.org/doi/pdf/10.3115/1075671.1075753 In association with and Term Weighting ● https://dl.acm.org/doi/pdf/10.3115/1075671.1075753 In association with and Term Weighting ● Why is 1+ln(tf) so good? (ignore idf/length) ○ q = [information retrieval] ○ D1 = [247 0] : Score(q, D1) = 1+ln(247) = 6.5 ■ https://en.wikipedia.org/wiki/Information ○ D2 = [118 112] ■ Score(q, D2) = 1+ln(118)+1+ln(112) = 11.5 ■ https://en.wikipedia.org/wiki/Information_retrieval ● Raw tf (or linear tf functions) behave like OR ● 1+ln(tf) behaves more like AND ● Log reduces the contribution any term can have Term Weighting ● However log is still unbounded and can grow quite a bit. ● Here is a better function ○ 3*tf / (2+tf): is 1 for 1, 3 for infinity ○ (n+1)*tf / (n+tf): is 1 for 1, n+1 for infinity ○ Any term can have at most n+1 times the influence of single occurrence term ● Motivated by BM25 ○ We will learn in a story In association with and Vector Space Model ● Longer documents have longer vectors (Euclidean length) ○ D1: Longer documents have longer vectors. ○ D2: Longer documents have longer vectors. Yes! Longer documents have longer vectors. ○ V1 = {documents:1, have:1, longer:2, vectors:1} - || V1 || = sqrt (12 + 12 + 22+12) = 2.65 ○ V2 = {documents:2, have:2, longer:4, vectors:2, yes: 1} - || V2 || = sqrt (22+22+42+22+12) = 5.39 ○ Query: [longer vectors], assume idf = 1 for every word for ease. ○ Score (q, D1 ) = 3, Score (q, D1 ) = 6 In association with and Vector Space Model ● Convert every vector to UNIT length ○ Divide every term weight with the Euclidean Length of the vector (also known as 2-norm) ○ V' = V || V ||, thus every vector is unit length ○ Dot product between two vectors becomes the cosine of the angle between two vectors ● This is known as Cosine Similarity Image Credit: Statistics for Machine Learning by Pratap Dangeti In association with and Document Length Normalization ● Cosine similarity is a mathematical concept ● It used every text vector of unit (Euclidean) length ● Under cosine similarity a vector (text) has 100% (1.0) similarity with itself ○ Under cosine, the query itself is the most relevant text for the query: cos(0) = 1 ○ Really? Yes, really! In association with and Document Length Normalization ● Documents are not vectors, having extra non-query words is a good thing ● Longer documents often have more useful information, but they can be needlessly verbose ● How do we appropriately retrieve documents of different lengths? ● Clearly not cosine! In association with and Document Length Normalization ● Suppose you knew that in your collection ○ There are k relevant documents of length l for query q. (human rating based) ○ If your algorithm returns k documents of length l, but you return n non-relevant documents then: ■ Length difference is NOT a factor in your poor quality retrieval, something else is. ● New document length normalization can be developed using this insight. In association with and Document Length Normalization ● If P(relevance | length) = P(retrieval | length) then you have removed length as a variable in your ranking. ● If we have a ranking function, like cosine similarity ○ We know P(retrieval | length) for any rank cutoff (say top 10) ○ If there are L documents of length l in the collection, and you retrieve x (<= 10) in top 10 ○ In top 10 P(retrieval | l) = x / L ● But we have relevance judgements (training data) ○ <q, d> pairs are created for recall-precision graphs by human saying d is relevant to q In association with and In association with and Interpolation Document Length Normalization ● If P(relevance | length) = P(retrieval | length) then you have removed length as a variable in your ranking. ● Since we have training data ○ We know P(relevance | length) for any length ○ If there are L documents of length l in the collection, and y of those are relevant to query q ○ The P(relevance | l) = y / L In association with and Remember Gerard Slaton Amit Singhal Chris Buckley Cindy Robinson Mandar Mitra In association with and In association with and Pivoted Document Length Normalization Pivoted Document Length Normalization In association with and In association with and Slide-23 ● Cosine: Convert every vector to UNIT length ○ Divide every term weight with the Euclidean Length of the vector (also known as 2-norm) or cosine normalization factor ● wi = f(tf, idf) / Image Credit: Statistics for Machine Learning by Pratap Dangeti Slide-21 ● Finding the similarity score between two pieces of text ○ E.g. document score for a query: Score (q, d) = ∑ ○ wt t,q x wt t,d To decrease the score of a document, we can decrease wt t,d ○ To increase the score of a document, we can increase of every term in it wt t,d of every term in it ○ ○ To decrease the probability of retrieval for a document, we should decrease its score ■ ■ we can decrease wt t,d of every term in it This can be achieved by increasing the denominator instead of just To increase the probability of retrieval for a document, we should increase its score ■ ■ we can increase wt t,d of every term in it This can be achieved by decreasing the denominator instead of just In association with and Pivoted Document Length Normalization In association with and In association with and Pivoted Document Length Normalization Slope < 1.0 Relevance Feedback Deep Dive ● In Chapter 6 we read about ○ Query expansion ○ Stemming ○ Synonymy (thesaurus) ○ Relevance Feedback ● Let's dive deeper into relevance feedback ○ Overview from Chapter 6 In association with and Relevance Feedback ● Rocchio's Algorithm ○ Designed for vector space model ○ A good query will maximize the similarity with relevant documents and minimize the similarity with non-relevant documents In association with and Relevance Feedback ● Rocchio's Algorithm ○ An optimal query will maximize the similarity to relevant documents and minimize the similarity to non-relevant documents. ● The vector difference between the centroids of the relevant and the non-relevant document. In association with and In association with and Relevance Feedback ● Rocchio's Algorithm Relevance Feedback ● Expands the query by adding new words ○ In academic practice, for efficiency, we would add 10, 20, 30 new words with highest weights as described by Rocchio's Algorithm ○ After that the weights (importance of the new words) become so low that adding low weight words becomes inconsequential ○ On the web, relevance feedback is seldom used as running such long queries is not possible in real time. ○ Relevance feedback (query expansion) tends to be a recall tool. In association with and Pseudo Relevance Feedback ● One magical trick in academic practice is to ASSUME that the top X documents retrieved using modern vector space term similarity ARE RELEVANT and run relevance feedback without asking user for relevance judgements. ○ Retrieve top X documents using modern term weighting ○ Assume they are relevant to the query ○ Use Rocchio's method without any non-relevant document ○ Add Y terms to the query ○ Run the expanded query again and return results In association with and