Information Retrieval: Prof: Ehab Ezzat Hassanein
Information Retrieval: Prof: Ehab Ezzat Hassanein
2 / 49
Course Objectives
●
How to do efficient (fast, compact) text
indexing
●
Retrieval models: Boolean, vector-space,
probabilistic, and machine learning models
●
Evaluation and IR interface issues
●
Document clustering and classification
●
Search on the web, including crawling, link-
based algorithms, indirect feedback,
metadata
●
Trends: AI, chatGPT, Bard,….etc.
3 / 49
Course Plan
4 / 49
Recommended Textbook
5 / 49
Google Stock in 8-2-2023
6 / 49
7 / 49
Google Stock Keeps Falling After Bard Ad Shows
Inaccurate Answer, AI Race Heats Up
8 / 49
Text mining interaction with
other fields
9 / 49
Inter-relationship among different text mining techniques
and their core functionalities
10 / 49
History
11 / 49
Goldberg machine
Goldberg machine is a mechanical machine that searched for a
pattern of dots or letters across catalog entries stored on a roll of
microfilm.
12 / 49
Goldberg machine cont.
●
Here it can be seen that catalog entries were
stored on a roll of film (No. 1 of the figure).
●
A query (2) was also on film showing a negative
image of the part of the catalog being searched
for; in this case the 1 st and 6 th entries on the roll.
●
A light source (7) was shone through the catalog
roll and query film, focused onto a photocell (6).
●
If an exact match was found, all light was blocked
to the cell causing a relay to move a counter
forward (12) and for an image of the match to be
shown via a half silvered mirror (3), reflecting the
match onto a screen or photographic plate (4 & 5).
13 / 49
The number of websites
●
While the exact number of websites keeps changing every
second, there are well over 1 billion sites on the world wide
web (1,197,982,359 according to Netcraft’s January 2021
Web Server Survey
January 2020 1 295 973 827 (189 000 000)
●
January 2018 1 805 260 010 (171 648 771)
●
January 2016 906 616 188 (170 258 872)
●
January 2014 861 379 152 (180 067 270)
●
January 2012 582 716 657 (182 441 983)
●
January 2010 206 741 990 (83 456 669)
●
January 2008 155 583 825 (68 274 154)
14 / 49
Basic Definitions
15 / 49
Information retrieval (IR)
Information retrieval (IR) is
finding material (usually documents) of an
unstructured nature (usually text) that satisfies an
information need from within large collections
(usually stored on computers).
That include not only Web Search but also :
●
Email Search
●
Searching your laptop
●
Corporate knowledge bases
●
Legal Information retrieval
16 / 49
Data extraction &
Information extraction
Data extraction is a process that involves retrieval of
data from various sources. Frequently, companies
extract data in order to process it further, migrate the
data to a data repository or to further analyze it.
17 / 49
Data mining & Web mining
Data mining Data mining is the process of analyzing
dense volumes of data to find patterns, discover trends,
and gain insight into how that data can be used. Data
miners can then use those findings to make decisions or
predict an outcome. Data mining is an interconnected
discipline, blending the fields of statistics, machine
learning, and artificial intelligence.
Web Mining is the process of using data mining
techniques and algorithms to extract information directly
from the Web by extracting it from Web documents and
services, Web content, hyperlinks and server logs.
The goal of Web mining is to look for patterns in Web data
by collecting and analyzing information in order to gain
18 / 49
insight into trends, the industry and users in general.
web crawler & web scraper
19 / 49
Unstructured (text) vs.
Structurer (database) data
In the mid nineties
20 / 49
Unstructured (text) vs.
Structurer (database) data
Today
21 / 49
Basic Assumptions of
Information Retrieval
●
Collection: a set of documents
Assume it is a static collection for now..
●
Goal: retrieve documents with Information that is
relevant to the user’s information need and help
the user to complete a task
22 / 49
The Classic Search Model
23 / 49
The Classic Search Model
what can go wrong..
24 / 49
Information need
●
An information need is the topic about which
the user desires to know more, and is
differentiated from a query, which is what the
user conveys to the computer in an attempt to
communicate the information need.
25 / 49
Relevance
●
Relevant if it is one that the user perceives
as containing information of value with
respect to their personal information need.
26 / 49
The Effectiveness
27 / 49
How good are the retrieved
documents
●
PRECISION
Precision: What fraction of the returned
results are relevant to the information need?
●
RECALL
Recall: What fraction of the relevant
documents in the collection were returned by
the system?
28 / 49
Term-document Incidence
Matrix And Inverted Index
29 / 49
Information Need
●
An information need is the topic about which
the user desires to know more, and is
differentiated from a query, which is what the
user conveys to the computer in an attempt
to communicate the information need.
30 / 49
AD HOC RETRIEVAL
●
Our goal is to develop a system to address the
ad hoc retrieval task.
●
This is the most standard IR task. In it, a system
aims to provide documents from within the
collection that are relevant to an arbitrary user
information need, communicated to the system
by means of a one-off, user-initiated query
31 / 49
Relevance
●
Relevant if it is one that the user perceives
as containing information of value with
respect to their personal information need.
32 / 49
The Effectiveness
33 / 49
How good are the retrieved
documents
●
PRECISION
Precision: What fraction of the returned
results are relevant to the information need?
●
RECALL
Recall: What fraction of the relevant
documents in the collection were returned by
the system?
34 / 49
Grepping
●
This process is commonly referred to as
grepping through text, after the Unix
command grep, which performs this process.
●
Grepping through text can be a very effective
process, especially given the speed of
modern computers, and often allows useful
possibilities for wildcard pattern matching
through the use of regular expressions.
●
for simple querying of modest collections (the
size of Shakespeare’s Collected Works is a bit
under one million words of text in total), you
really need nothing more
35 / 49
Unstructured data in 1620
36 / 49
Shortfalls of Grepping
1. To process large document collections quickly. The
amount of online data has grown at least as quickly
as the speed of computers, and we would now like to
be able to search collections that total in the order of
billions to trillions of words.
2. To allow more flexible matching operations. For
example, it is impractical to perform the query
Romans NEAR countrymen with grep, where NEAR
might be defined as “within 5 words” or “within the
same sentence”.
3. To allow ranked retrieval: in many cases you want
the best answer to an information need among many
documents that contain certain words. 37 / 49
term-document incidence
matrix
39 / 49
BOOLEAN RETRIEVAL
MODEL
●
The Boolean retrieval model is a model for
information retrieval in which we can pose
any query which is in the form of a Boolean
expression of terms, that is, in which terms
are combined with the operators AND , OR ,
and NOT .
●
The model views each document as just a set
of words.
40 / 49
Bigger Collection
●
Suppose we have N = 1 million documents.
●
Suppose each document is about 1000 words
long (2–3 book pages)
●
assume an average of 6 bytes per word
including spaces and punctuation,
●
This is a document collection about 6 GB in size
●
Typically, there might be about M = 500,000
distinct terms in these documents (corresponds
to the number of rows in the matrix)
41 / 49
Can’t build the Matrix!
●
500K x 1M matrix => half a trillion 0’s and 1’s
BUT
●
Almost all of the entries are 0’s
●
The documents at most has 1 billion 1’s
– Since we assume that we have 1 M document
each with 1000 words then even if w have distinct
terms for each documents we at most have 1000M
1’s
●
Such a matrix is extremely sparse. Almost all entries
are 0’. We need better representation. A
representation that records only the 1’s
42 / 49
Inverted Index.
●
The key data structure that underlay all modern
IR systems
●
It is a data structure that exploits the sparsity of the
term document matrix and allow for very efficient
retrieval
●
The name is actually redundant: an index always
maps back from terms to the parts of a document
where they occur.
●
Nevertheless, inverted index, or sometimes inverted
file, has become the standard term in information
retrieval.
43 / 49
Inverted Index.
●
For each term t, we must store all the
documents that contain t.
– Identify each document by docID, a
document serial number
– Can we us Fixed-size arrays for this?
●
Very inefficient
44 / 49
Inverted Index.
●
We need variable-size posting lists
– In disk a continuous run of postings is normal and
best.
– In memory, can use linked lists or variable length
arrays.
– Dictionary is small so it can be stored in memory;
whereas, postings are large and may be stored in
disks.
45 / 49
Inverted index vs. Forward
Index
●
In a search engine you have a list of documents
(pages on web sites), where you enter some
keywords and get results back.
●
A forward index (or just index) is the list of
documents, and which words appear in them. In
the web search example, Google crawls the web,
building the list of documents, figuring out which
words appear in each page.
●
The inverted index is the list of words, and the
documents in which they appear. In the web
search example, you provide the list of words
(your search query), and Google produces the
46 / 49
documents (search result links).
Inverted Index construction
47 / 49
Initial stages of text
processing
●
Tokenization
– Cut character sequence into words tokens
Deal with “John’s”, a state-of-the-art solution
●
●
Normalization
– Map text and query term to the same form
USA and U.S.A to match
●
●
Stemming
– We may wish different forms of a root to match
authorize and authorization
●
●
Stop words
– We may omit very common words (or not!)
●
The, a, to, of 48 / 49
– Query the song to be or not to be!!
49 / 49