KEN2570-5-Search and IR
KEN2570-5-Search and IR
3 4
Information Retrieval The classic search model
• Information Retrieval (IR) is finding material (usually
Get rid of mice in a
documents) of an unstructured nature (usually text) User task
politically correct way
that satisfies an information need from within large Misconception?
collections (usually stored on computers). Info about removing mice
Info need
without killing them
- These days we frequently think first of web search, but there Misformulation?
are many other cases:
- E-mail search Query
how trap mice alive Search
- Searching your laptop
- Corporate knowledge bases
- Legal information retrieval Search
engine
Query Results
5 Collection 6
refinement
7 8
Challenges & Characteristics Text Retrieval Is Hard!
• Dynamically generated content
• Under/over-specified query
• New pages get added all the time
- The size of the web (or textual content in general) just doubles
- Ambiguous: “buying CDs” (money or music?)
every a few minutes - Incomplete: what kind of CDs?
• Users (usually) revise and revisit their queries - What if “CD” is never mentioned in documents?
• Queries are not extremely long • Vague semantics of documents
- They used to be very short (up to 2 words) - Ambiguity: e.g., word-sense, structural
• Probably a large number of typos - Incomplete: Inferences required
• A small number of popular queries • Hard even for people!
- A long tail of infrequent ones
• Almost no use of advanced query operators
- 80% agreement in human judgments or relevant
results
9 10
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony X 1 1 0 0 0 1
Brutus 1 1 1 0 1 0 0
Caesar 1 1 1 0 1 1 1
Calpurnia 0 0 1 0 0 0 0
13 14
Sec. 1.1
15 16
Sec. 1.1
Sec. 6.2
Sec. 6.3
- A doc is long because it uses more words • Dividing a vector by its L2 norm makes it a unit
- A doc is long because it has more content $%&' !(!,#
(length) vector 𝑡𝑓!,# =
• Generally we want to penalize long documents, ∑'
$%& !($,#
(
Sec. 6.3
• idf has no effect on ranking one term queries Caesar 8.59 2.54 0 1.51 0.25 0
€ Calpurnia 0 1.54 0 0 0 0
- idf affects the ranking of documents for queries with Cleopatra 2.85 0 0 0 0 0
at least two terms mercy 1.51 0 1.9 0.12 5.25 0.88
- For the query capricious person, idf weighting makes worser 1.37 0 0.11 4.15 0.25 1.95
i=1
xi yi
cos(x, y) =
• These are very sparse vectors – most entries ∑
V
xi2 ∑
V
yi2
are zero
i=1 i=1
43
Sec. 6.3
N (k1 +1)tfi
RSV BM 25 = ∑ log ⋅ dl = document length (|d|)
avdl = average document length
i∈q dfi k ((1− b) + b dl ) + tf in the whole collection
1 i
avdl
• Example query:
• Taking into account the meaning of the words
“president lincoln” tfpresident,d tflincoln,d Score(q,d) used.
with BM25
• “president” is in 40.000 - Any external info e.g. WordNet?
documents in the collection 15 25 20.66
15 1 12.74
- Word vectors?
(dfpresident=40000)
• “lincoln” is in 300 documents in 15 0 5.00 • Taking into account the order of words in the
the collection (dflincoln=300) 1 25 18.2 query.
• The document length is 90% of 0 25 15.66
• Adapting to the user based on direct or indirect
the average length (dl/avgl=0.9)
• Let’s assume, k1=1.2, b=0.75
The low df term plays a bigger role. feedback.
• Taking into account the authority of the source.
52
Integrating multiple features How to combine features to assign a relevance
to determine relevance score to a document?
• Modern systems – especially on the Web – use a great
number of features: • Given lots of relevant features…
- Arbitrary useful features – not a single unified model
- Log frequency of query word in anchor text? • You can continue to hand-engineer retrieval
-
-
Query word in color on page?
# of images on page?
scores
- # of (out) links on page? • Or, you can build a classifier to learn weights for
- PageRank of page?
- URL length? the features
- URL contains “~”? - Requires: labeled training data
- Page edit recency?
- Page length? - This is the “learning to rank” approach, which has become a
• Google is using over 200 such features for the rankings and hot area in recent years (esp. with deep models)
constantly updates the algorithm (SEO, paid advertisements, - We only provide an elementary introduction here
etc.)
53 54
R R
- Train a machine learning model to predict the class r of a document-query
pair N N
0.025 R
R R
R N N
N
N
N
N N
0
2 3 4 5
Term proximity w 56
Sec. 8.6
61 62
Sec. 8.4
63 64
Sec. 8.4
67 68
Privacy in IR Recap
• Personalization is an important topic in information retrieval; after all,
we'd like our search results to be relevant to us and our interests.
• Let’s google "marguerite". What is the first search result? Would you • Information Retrieval Challenges
expect another person - say, someone in USA- to get the same search
result?
• Retrieval models
- Think of other examples of personalization based on location, search and - Boolean
browsing history, or social media?
• What are potential benefits and risks of getting personalized searches? Is - TF-IDF
it okay that search engines are using our data to personalize our - BM25
searches? Or is there a limit to what kind of data should be okay for
search engines to use? • Evaluation metrics for retrieval tasks
• In 2009, the French government signed the "Charter of good practices on
the right to be forgotten on social networks and search engines."
- Do you think people should have the right to remove information about
themselves from the web (the right to be forgotten)?
- Do you think Google should be required to remove information about an
individual upon request?
69 70
References
• IR Chapters: 1, 2.1-2.2, 6.2-6.4.3, 8.1-8.5
• Bias in IR:
- https://www.theverge.com/2022/5/11/23064883/google-ai-skin-tone-measure-
monk-scale-inclusive-search-results
- https://blogs.bing.com/search-quality-insights/february-2018/toward-a-more-
intelligent-search-bing-multi-perspective-answers
- https://www.tandfonline.com/doi/abs/10.1080/00913367.1990.10673179
- https://journals.sagepub.com/doi/10.1177/002193479902900303
- https://psycnet-apa-org.stanford.idm.oclc.org/fulltext/2020-42793-001.html
- https://journals.sagepub.com/doi/abs/10.1177/1090198120957949
71