0% found this document useful (0 votes)

8 views49 pages

Information Retrieval: Prof: Ehab Ezzat Hassanein

The document outlines a course on Information Retrieval, covering topics such as text indexing, retrieval models, evaluation issues, document clustering, and web search techniques. It emphasizes the importance of precision and recall in assessing the effectiveness of IR systems and introduces key concepts like inverted indexing and the differences between web crawlers and scrapers. The course also touches on trends in AI and tools for information extraction and data mining.

Uploaded by

yahia mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views49 pages

Information Retrieval: Prof: Ehab Ezzat Hassanein

Uploaded by

yahia mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Information Retrieval

Prof: Ehab Ezzat Hassanein

1 / 49
Introduction

2 / 49
Course Objectives

●
How to do efficient (fast, compact) text
indexing
●
Retrieval models: Boolean, vector-space,
probabilistic, and machine learning models
●
Evaluation and IR interface issues
●
Document clustering and classification
●
Search on the web, including crawling, link-
based algorithms, indirect feedback,
metadata
●
Trends: AI, chatGPT, Bard,….etc.
3 / 49
Course Plan

4 / 49
Recommended Textbook

Introduction to Information Retrieval by C. Manning, P. Raghavan, and H.

Schütze. Cambridge University Press, 2008
ISBN-13: 9780521865715

5 / 49
Google Stock in 8-2-2023

6 / 49
7 / 49
Google Stock Keeps Falling After Bard Ad Shows
Inaccurate Answer, AI Race Heats Up

8 / 49
Text mining interaction with
other fields

9 / 49
Inter-relationship among different text mining techniques
and their core functionalities

10 / 49
History

11 / 49
Goldberg machine
Goldberg machine is a mechanical machine that searched for a
pattern of dots or letters across catalog entries stored on a roll of
microfilm.

12 / 49
Goldberg machine cont.

●
Here it can be seen that catalog entries were
stored on a roll of film (No. 1 of the figure).
●
A query (2) was also on film showing a negative
image of the part of the catalog being searched
for; in this case the 1 st and 6 th entries on the roll.
●
A light source (7) was shone through the catalog
roll and query film, focused onto a photocell (6).
●
If an exact match was found, all light was blocked
to the cell causing a relay to move a counter
forward (12) and for an image of the match to be
shown via a half silvered mirror (3), reflecting the
match onto a screen or photographic plate (4 & 5).

13 / 49
The number of websites

●
While the exact number of websites keeps changing every
second, there are well over 1 billion sites on the world wide
web (1,197,982,359 according to Netcraft’s January 2021
Web Server Survey
January 2020 1 295 973 827 (189 000 000)
●
January 2018 1 805 260 010 (171 648 771)
●
January 2016 906 616 188 (170 258 872)
●
January 2014 861 379 152 (180 067 270)
●
January 2012 582 716 657 (182 441 983)
●
January 2010 206 741 990 (83 456 669)
●
January 2008 155 583 825 (68 274 154)

14 / 49
Basic Definitions

15 / 49
Information retrieval (IR)
Information retrieval (IR) is
finding material (usually documents) of an
unstructured nature (usually text) that satisfies an
information need from within large collections
(usually stored on computers).
That include not only Web Search but also :
●
Email Search
●
Searching your laptop
●
Corporate knowledge bases
●
Legal Information retrieval
16 / 49
Data extraction &
Information extraction
Data extraction is a process that involves retrieval of
data from various sources. Frequently, companies
extract data in order to process it further, migrate the
data to a data repository or to further analyze it.

Information extraction (IE) is the automated

retrieval of specific information related to a selected
topic from a body or bodies of text. Information
extraction tools make it possible to pull information
from text documents, databases, websites or multiple
sources.

17 / 49
Data mining & Web mining
Data mining Data mining is the process of analyzing
dense volumes of data to find patterns, discover trends,
and gain insight into how that data can be used. Data
miners can then use those findings to make decisions or
predict an outcome. Data mining is an interconnected
discipline, blending the fields of statistics, machine
learning, and artificial intelligence.
Web Mining is the process of using data mining
techniques and algorithms to extract information directly
from the Web by extracting it from Web documents and
services, Web content, hyperlinks and server logs.
The goal of Web mining is to look for patterns in Web data
by collecting and analyzing information in order to gain
18 / 49
insight into trends, the industry and users in general.
web crawler & web scraper

A web crawler sometimes called a “spider,” is a

standalone bot that systematically scans the Internet
for indexing and searching for content, following
internal links on web pages.

A web scraper is a process of extracting specific

data. Unlike web crawling, a web scraper searches for
specific information on specific websites or pages.

19 / 49
Unstructured (text) vs.
Structurer (database) data
In the mid nineties

20 / 49
Unstructured (text) vs.
Structurer (database) data
Today

21 / 49
Basic Assumptions of
Information Retrieval
●
Collection: a set of documents
Assume it is a static collection for now..

●
Goal: retrieve documents with Information that is
relevant to the user’s information need and help
the user to complete a task

22 / 49
The Classic Search Model

23 / 49
The Classic Search Model
what can go wrong..

24 / 49
Information need
●
An information need is the topic about which
the user desires to know more, and is
differentiated from a query, which is what the
user conveys to the computer in an attempt to
communicate the information need.

25 / 49
Relevance

●
Relevant if it is one that the user perceives
as containing information of value with
respect to their personal information need.

26 / 49
The Effectiveness

To assess the effectiveness of an IR system

(i.e., the quality of its search results), a user
will usually want to know two key statistics
about the system’s returned results for a
query: Percesion and Recall

27 / 49
How good are the retrieved
documents
●
PRECISION
Precision: What fraction of the returned
results are relevant to the information need?
●
RECALL
Recall: What fraction of the relevant
documents in the collection were returned by
the system?

28 / 49
Term-document Incidence
Matrix And Inverted Index

29 / 49
Information Need

●
An information need is the topic about which
the user desires to know more, and is
differentiated from a query, which is what the
user conveys to the computer in an attempt
to communicate the information need.

30 / 49
AD HOC RETRIEVAL
●
Our goal is to develop a system to address the
ad hoc retrieval task.
●
This is the most standard IR task. In it, a system
aims to provide documents from within the
collection that are relevant to an arbitrary user
information need, communicated to the system
by means of a one-off, user-initiated query

31 / 49
Relevance

●
Relevant if it is one that the user perceives
as containing information of value with
respect to their personal information need.

32 / 49
The Effectiveness

To assess the effectiveness of an IR system

(i.e., the quality of its search results), a user
will usually want to know two key statistics
about the system’s returned results for a
query: Percesion and Recall

33 / 49
How good are the retrieved
documents
●
PRECISION
Precision: What fraction of the returned
results are relevant to the information need?
●
RECALL
Recall: What fraction of the relevant
documents in the collection were returned by
the system?

34 / 49
Grepping
●
This process is commonly referred to as
grepping through text, after the Unix
command grep, which performs this process.
●
Grepping through text can be a very effective
process, especially given the speed of
modern computers, and often allows useful
possibilities for wildcard pattern matching
through the use of regular expressions.
●
for simple querying of modest collections (the
size of Shakespeare’s Collected Works is a bit
under one million words of text in total), you
really need nothing more
35 / 49
Unstructured data in 1620

36 / 49
Shortfalls of Grepping
1. To process large document collections quickly. The
amount of online data has grown at least as quickly
as the speed of computers, and we would now like to
be able to search collections that total in the order of
billions to trillions of words.
2. To allow more flexible matching operations. For
example, it is impractical to perform the query
Romans NEAR countrymen with grep, where NEAR
might be defined as “within 5 words” or “within the
same sentence”.
3. To allow ranked retrieval: in many cases you want
the best answer to an information need among many
documents that contain certain words. 37 / 49
term-document incidence
matrix

the query: Brutus AND Caesar AND NOT Calpurnia

we take the vectors for Brutus , Caesar and Calpurnia,
complement the last, and then do a bitwise AND :
110100 AND 110111 AND 101111 = 100100 38 / 49
Answer to the Query:

39 / 49
BOOLEAN RETRIEVAL
MODEL
●
The Boolean retrieval model is a model for
information retrieval in which we can pose
any query which is in the form of a Boolean
expression of terms, that is, in which terms
are combined with the operators AND , OR ,
and NOT .
●
The model views each document as just a set
of words.

40 / 49
Bigger Collection

●
Suppose we have N = 1 million documents.
●
Suppose each document is about 1000 words
long (2–3 book pages)
●
assume an average of 6 bytes per word
including spaces and punctuation,
●
This is a document collection about 6 GB in size
●
Typically, there might be about M = 500,000
distinct terms in these documents (corresponds
to the number of rows in the matrix)

41 / 49
Can’t build the Matrix!
●
500K x 1M matrix => half a trillion 0’s and 1’s
BUT
●
Almost all of the entries are 0’s
●
The documents at most has 1 billion 1’s
– Since we assume that we have 1 M document
each with 1000 words then even if w have distinct
terms for each documents we at most have 1000M
1’s
●
Such a matrix is extremely sparse. Almost all entries
are 0’. We need better representation. A
representation that records only the 1’s
42 / 49
Inverted Index.
●
The key data structure that underlay all modern
IR systems
●
It is a data structure that exploits the sparsity of the
term document matrix and allow for very efficient
retrieval
●
The name is actually redundant: an index always
maps back from terms to the parts of a document
where they occur.
●
Nevertheless, inverted index, or sometimes inverted
file, has become the standard term in information
retrieval.
43 / 49
Inverted Index.

●
For each term t, we must store all the
documents that contain t.
– Identify each document by docID, a
document serial number
– Can we us Fixed-size arrays for this?
●
Very inefficient

44 / 49
Inverted Index.
●
We need variable-size posting lists
– In disk a continuous run of postings is normal and
best.
– In memory, can use linked lists or variable length
arrays.
– Dictionary is small so it can be stored in memory;
whereas, postings are large and may be stored in
disks.

45 / 49
Inverted index vs. Forward
Index
●
In a search engine you have a list of documents
(pages on web sites), where you enter some
keywords and get results back.
●
A forward index (or just index) is the list of
documents, and which words appear in them. In
the web search example, Google crawls the web,
building the list of documents, figuring out which
words appear in each page.
●
The inverted index is the list of words, and the
documents in which they appear. In the web
search example, you provide the list of words
(your search query), and Google produces the
46 / 49
documents (search result links).
Inverted Index construction

47 / 49
Initial stages of text
processing
●
Tokenization
– Cut character sequence into words tokens
Deal with “John’s”, a state-of-the-art solution
●

●
Normalization
– Map text and query term to the same form
USA and U.S.A to match
●

●
Stemming
– We may wish different forms of a root to match
authorize and authorization
●

●
Stop words
– We may omit very common words (or not!)
●
The, a, to, of 48 / 49
– Query the song to be or not to be!!
49 / 49

Complete Bundle Discovering Psychology The Science of Mind 4th Edition Cacioppo HQ File
100% (1)
Complete Bundle Discovering Psychology The Science of Mind 4th Edition Cacioppo HQ File
408 pages
How To Choose An Effective and Sufficient Sample For An AML Program Audit
0% (1)
How To Choose An Effective and Sufficient Sample For An AML Program Audit
15 pages
40+ Smartforms - Interview Questions With Answers - SAP ABAP
83% (6)
40+ Smartforms - Interview Questions With Answers - SAP ABAP
14 pages
SHS Core - Media and Information Literacy Curriculum Guide
94% (78)
SHS Core - Media and Information Literacy Curriculum Guide
16 pages
GIT Lecture 9 Cybercrime Laws in The Philippines
No ratings yet
GIT Lecture 9 Cybercrime Laws in The Philippines
137 pages
Mdx36Range: Installation, Use and Maintenance Gearless
No ratings yet
Mdx36Range: Installation, Use and Maintenance Gearless
36 pages
New Gen Strategy Ultra-Supercritical Technology
No ratings yet
New Gen Strategy Ultra-Supercritical Technology
21 pages
Flexible Manufacturing System (FMS) and Automated Guided Vehicle System (Agvs)
No ratings yet
Flexible Manufacturing System (FMS) and Automated Guided Vehicle System (Agvs)
97 pages
Testing and Evaluation of Systems
No ratings yet
Testing and Evaluation of Systems
24 pages
Instrument Specification Sheet - Flame Detectors: Project
No ratings yet
Instrument Specification Sheet - Flame Detectors: Project
1 page
Junior Engineer (Civil, Mechanical, Electrical and Quantity Surveying & Contracts) Examination, 2020 (Paper-I)
No ratings yet
Junior Engineer (Civil, Mechanical, Electrical and Quantity Surveying & Contracts) Examination, 2020 (Paper-I)
53 pages
Meivalvole SerieCD ENG 20150212
No ratings yet
Meivalvole SerieCD ENG 20150212
4 pages
Cardin Trade Ace 601
No ratings yet
Cardin Trade Ace 601
2 pages
Feature: CONCEPT Repair Data Base
No ratings yet
Feature: CONCEPT Repair Data Base
4 pages
Classwork For Information Retrieval
No ratings yet
Classwork For Information Retrieval
118 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Unit 1 Irt
No ratings yet
Unit 1 Irt
21 pages
BGN GIO Cloud Price
No ratings yet
BGN GIO Cloud Price
3 pages
Empowering A New Generation of Trillionaires: About Us
No ratings yet
Empowering A New Generation of Trillionaires: About Us
13 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
HiperLAN - Wikipedia
No ratings yet
HiperLAN - Wikipedia
3 pages
Library Orientation: For Graduate Students
No ratings yet
Library Orientation: For Graduate Students
2 pages
IEC Timers and IEC Counter For SIMATIC S7-1200
No ratings yet
IEC Timers and IEC Counter For SIMATIC S7-1200
33 pages
Detailed Drawing of Footing and Column
No ratings yet
Detailed Drawing of Footing and Column
1 page
Kia Seltos 4 Page Leaflet 2023 Desktop Revised
No ratings yet
Kia Seltos 4 Page Leaflet 2023 Desktop Revised
4 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Information Retrieval and XML Data: ADBMS Unit-4
No ratings yet
Information Retrieval and XML Data: ADBMS Unit-4
37 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
CS317 IR W1a
No ratings yet
CS317 IR W1a
20 pages
21ai601 LM1 23 23
No ratings yet
21ai601 LM1 23 23
13 pages
5 Unit Notes
100% (1)
5 Unit Notes
166 pages
All Units Notes TYBSC-CS-Information-Retrieval
No ratings yet
All Units Notes TYBSC-CS-Information-Retrieval
89 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
Aesthetics and Technology in Building, Pier Luigi Nervi
100% (4)
Aesthetics and Technology in Building, Pier Luigi Nervi
146 pages
Chap 1
No ratings yet
Chap 1
22 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Orskey User Manual-En
No ratings yet
Orskey User Manual-En
34 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Cs8080irtunitinotes 220515215754 E06d144b
No ratings yet
Cs8080irtunitinotes 220515215754 E06d144b
43 pages
Integration Between Service Quality With Refined KANO To Improve Academic Quality at MTI
No ratings yet
Integration Between Service Quality With Refined KANO To Improve Academic Quality at MTI
8 pages
Introduction
No ratings yet
Introduction
32 pages
Cs6007 - Information Retrieval: Objectives: The Student Should Be Made To
No ratings yet
Cs6007 - Information Retrieval: Objectives: The Student Should Be Made To
24 pages
Pelco Storage Estimator
No ratings yet
Pelco Storage Estimator
2 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
CS8080 Irt
100% (1)
CS8080 Irt
33 pages
Information Search and Retrieval
No ratings yet
Information Search and Retrieval
23 pages
Introduction Information Retrieval
No ratings yet
Introduction Information Retrieval
73 pages
Cs8080 Irt Unit 1 PDF
No ratings yet
Cs8080 Irt Unit 1 PDF
28 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Week 1
No ratings yet
Week 1
28 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
Effectiveness and Safety of Virtual Reality Rehabilitation After
No ratings yet
Effectiveness and Safety of Virtual Reality Rehabilitation After
15 pages
ITR Notes
No ratings yet
ITR Notes
166 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
Suresh Academy-Computer
No ratings yet
Suresh Academy-Computer
38 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
16 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Lab1-Algorithms For Information Retrieval. Introduction
No ratings yet
Lab1-Algorithms For Information Retrieval. Introduction
13 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
IR Introduction
100% (1)
IR Introduction
6 pages
Lec 1 - Intro - Unit 1 Information Technology
No ratings yet
Lec 1 - Intro - Unit 1 Information Technology
102 pages
Irs Ia 1
No ratings yet
Irs Ia 1
12 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Unit-5. Search Engines
No ratings yet
Unit-5. Search Engines
105 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
5 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
15 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
16 pages
MFG Kinetic Hardware Sizing Guide WP ENS
No ratings yet
MFG Kinetic Hardware Sizing Guide WP ENS
28 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
14 pages
Pibna GP 009A SeriesGPSoftwareRecoveryProcedure
No ratings yet
Pibna GP 009A SeriesGPSoftwareRecoveryProcedure
10 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
L01
No ratings yet
L01
33 pages
Cosine Similarity
No ratings yet
Cosine Similarity
1 page
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
IR Notes
No ratings yet
IR Notes
14 pages
Unit 3
No ratings yet
Unit 3
27 pages
Bulu
No ratings yet
Bulu
47 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Rohan Kumar Resume
No ratings yet
Rohan Kumar Resume
1 page
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet