0% found this document useful (0 votes)

3 views

Lect 01-Introduction (1)

The document discusses the evolution of Information Retrieval (IR) from basic search engines to advanced systems that address user needs through various data sources. It outlines key concepts such as unstructured vs. structured data, Boolean queries, inverted indexes, and ranking mechanisms, emphasizing the complexity and breadth of modern IR. Additionally, it highlights the integration of machine learning and AI in enhancing search results and addresses challenges faced in web IR.

Uploaded by

bulba-670c256ebf8e0334468fd2b3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Lect 01-Introduction (1)

Uploaded by

bulba-670c256ebf8e0334468fd2b3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 53

Nowadays IR is

much more than

building search
engines !
IR

Paolo Ferragina
Dipartimento di Informatica
Università di Pisa

Reading Chapter 1
Th course
 Timetable
 Monday 11-13 (L) and Tuesday 9-11 (L1)

 The web page: two parts (last year and

current)

 Twitter: @FerraginaTeach
 Student meetings: Monday 14.30-16.30
 The exam
 One written test with theory questions + exercises
(two rounds, with small penalty)
 Perhaps, a lab test on Lucene/elastic search
Arguments to do or not do?
 I/O-model. multi-way mergesort. Sketch on MapReduce
 Hashing. Compacted trie, front coding  auto-completion
 Edit distance via Dynamic Programming (possibly
weighted)  Overlap measure with k-gram index.
 Posting list compression: gamma, variable bytes (t-
nibble), PForDelta and Elias-Fano.

 Compressed storage of documents: LZ-based

compression. Storage and Transmission of file(s): Delta
compression (Zdelta), File Synchronization (rsync, zsync).
 Rank and Select data structures, Elias-Fano
 Succinct representation of binary trees and navigation.
 Random Walks. Link-based ranking: pagerank, topic-
based pagerank, personalized pagerank, CoSim rank.
HITS.
What is IR today?

Paolo Ferragina
2009 2009-12
Evolution of Search Engines
 1991-.. Wanderer
Zero generation -- use metadata added by users

 First generation -- use only on-page, web-text data

1995-1997 AltaVista,
 Word frequency and language Excite, Lycos, etc

 Second generation -- use off-page, web-graph data

 Link (or connectivity) analysis 1998: Google
 Anchor-text (How people refer to a page)

 Third generation -- answer “the need behind the query”

 Focus on “user need”, rather than on query Google, Yahoo,
 Integrate multiple data-sources MSN, ASK,………
 Click-through data
Fourth and current generation  Information Supply
Searching «substrings»
Searching routes
Searching over geo+labels
Searching over labeled graphs
Paolo Ferragina,
Università di Pisa
Paolo Ferragina, Università di Pisa
Paolo Ferragina, Università di Pisa
Paolo Ferragina, Università di Pisa
Paolo Ferragina, Università di Pisa

CISCO foresee 50 mld devices connected by 2020

Paolo Ferragina, Università di Pisa

Paradigm shift

We have now «devices 2.0» that have their ID,

Communication capacity, computing and storage, and
currently interaction ability.
Three main types of data…
 Opportunistic
 Credit card transactions
 Tel calls, bills, web clicks, …

 Purposely sensed
 pollution, temperature, wind, …
 movement, accelleration,…
 Health sensing,…

 User generated
 Photo, tweet, post, email,…
 Query-log on search engines
A universe of possibilities

… limited only by our

immagination !

The Phd+ course:

how to build a start-up ?
Paolo Ferragina, Università di Pisa
Basics

Paolo Ferragina
Information Retrieval
Information Retrieval (IR) is finding
material (usually documents) of
unstructured nature (usually text) that
satisfies an information need from
within large collections (usually stored
on computers).

29
IR vs. databases:
Unstructured vs Structured data
Structured data tends to refer to “tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

Ivy Smith 50000

Typically allows numerical range and exact match

(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
30
Semi-structured data: XML
 In fact almost no data is “unstructured”
 E.g., this slide has distinctly identified
zones such as the Title and Bullets

 Facilitates “semi-structured” search such

as
 Title contains data AND Bullets contain
search
 Issues:
 how do you process “about”?
31
 how do you rank results?
Unstructured data
Typically refers to free text, and
allows

 Keyword queries including operators

 More sophisticated “concept” queries e.g.,

 find all web pages dealing with drug abuse

Classic model for searching text documents

32
Boolean queries: Exact match
 The Boolean retrieval model is being able
to ask a query that is a Boolean expression:
 Boolean Queries are queries using AND, OR
and NOT to join query terms

Views each document as a set of words

Is precise: document matches condition or not.

 Perhaps the simplest model to build an IR

system on

 Many search systems still use it:

 Email, library catalog, Mac OS X Spotlight 33
Implementing the Boolean model

be
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony
l d 1 1 0 0 0 1

ou
c big
Brutus 1 1 0 1 0 0
Caesar
i x
r ry
1 1 0 1 1 1

t
Calpurnia 0 1 0 0 0 0

a ve
Cleopatra 1 0 0 0 0 0
M mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

1 if play contains word,

Brutus AND Caesar
0 otherwise
BUT NOT Calpurnia
Inverted index
 For each term t, we must store a list of all
documents that contain t.
 Identify each by docID, a document serial
number
 Can we use fixed-size arrays for this?
 What about inserting a new docID ?
Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132

Calpurnia 2 31 54 101

35
AND query
Cleopatra 9 3 45 11 1 46 31 ….

Cesare 57 12 4 9 15 16 2 ….

If n,m are the lengths of the lists, how

many comparisons ?
n*m

This is not an «engineering problem», ≈10 cmp

You need efficient algorithms! 3
≈10 sec
AND query
Cleopatra 9 3 45 11 1 46 31 ….

Cesare 57 12 4 9 15 16 2 ….

Cleopatra 1 3 9 11 31 45 46 ….

Cesare 2 4 9 12 15 16 57 ….

How many comparisons ? n + m ≈106

Which are the top-10 results ? ≈1 msec
Intersecting two postings lists

38
The Inverted index

Brutus 2 4 6 10 32

the 1 2 3 5 8 13 21 34

Calpurnia 13 16

Two advantages:
 Speed: query requires just a scan

 Space: store smaller integers (gap coding)

Compressed, they occupy 13% original text

Query optimization

 What is the best order for query

processing?
 Consider a query that is an AND of n terms.
 For each of the n terms, get its postings,
then AND them together.
Brutus 2 4 8 16 32 64 128

Caesar 1 2 3 5 8 16 21 34

Calpurnia 13 16

Query: Brutus AND Calpurnia AND Caesar 40

Query optimization
 Can we improve scanning-based intersection?
 Skips (yet scan-based but with shortcuts)
Sec. 2.3
Augment postings with skip
pointers (at indexing time)

41 128
2 4 8 41 48 64 128

11 31
1 2 3 8 11 17 21 31

 Where do we place them ?

 Which is the space/time trade-off ?
Query optimization
 Can we improve scanning-based intersection?
 Skips (yet scan-based but with shortcuts)
 Recursive merge (splitting by pivots)

Caesar 1 2 3 5 8 16 21 34

Calpurnia 13 16 34

Binary search
43
Which list you bisect at every recursive step ?
Sec. 1.3

Boolean queries:
More general merges
 Exercise: Adapt the merge for :
Brutus AND NOT Caesar
Brutus OR NOT Caesar

Can we still run the merge in time O(n + m)?

44
IR is much more…
 What about phrases?
 “Stanford University”
 Proximity: Find Gates NEAR Microsoft.
 Need index to capture term positions in
docs.
 Zones in documents: Find documents with
(author = Ullman) AND (text contains
automata).
 Search for Maradona and find also “el
pibe de oro” 45
Sec. 6.1

Zone indexes
 A zone is a region of the doc that can
contain an arbitrary amount of text e.g.,
 Title
 Abstract
 References …

 Build inverted indexes on fields AND

zones to permit querying

 E.g., “find docs with merchant in the title

zone and matching the query gentle rain”
Sec. 6.1

Example zone indexes

Encode zones in dictionary vs. postings.

Ranking search results
 Boolean queries give inclusion or exclusion of
docs.

 But

often results are too many and we need to rank
results

Classification, clustering, summarization, text
mining, etc…

A lot of AI and Machine Learning on several kinds

of features extracted from pages content and
the Web for results selection and ranking
Web IR and its challenges
 Unusual and diverse
 Documents
 Users
 Queries
 Information needs

 Exploit ideas from social networks

 link analysis, click-streams,
knowledge graphs,... 49
?
Our topics, on an example
Page archive

Crawler
Hashing
Query
Linear Algebra
eb

Clustering
W

Page
Classification
Indexer
Analizer
Query Ranker
Sorting resolver

Dictionaries

Which pages
to visit next?
text auxiliary
Structure

Data Compression
I data center
[Procs OSDI 2006]

No
SQL

 Hbase, in Java, Apache license, runs on Hadoop

 HyperTable, in C++, GNU license, runs on Hadoop or GlusterFS

 Cassandra, in Java, Apache license 2, runs on Amazon’s Dynamo

“Smart” algorithms
2007

“This is rocket science but

you don't have to be a
rocket scientist to use it”

Advanced Oracle PL/SQL Developer's Guide - Second Edition - Sample Chapter
No ratings yet
Advanced Oracle PL/SQL Developer's Guide - Second Edition - Sample Chapter
54 pages
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
No ratings yet
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
47 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
No ratings yet
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
22 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Unit 1
No ratings yet
Unit 1
181 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
32 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
lecture1-intro-boolean
No ratings yet
lecture1-intro-boolean
42 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Information Retrieval and Web Search
No ratings yet
Information Retrieval and Web Search
29 pages
Ch2_IR and LT
No ratings yet
Ch2_IR and LT
45 pages
1-Introduction-MIR
No ratings yet
1-Introduction-MIR
35 pages
Intro Notes
No ratings yet
Intro Notes
11 pages
2
No ratings yet
2
50 pages
600 Computer Mcqs
No ratings yet
600 Computer Mcqs
23 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
63 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Introduction To Information Retrieval - by William Scott - Medium
No ratings yet
Introduction To Information Retrieval - by William Scott - Medium
4 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
1_introIR
No ratings yet
1_introIR
15 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
lecture1-intro
No ratings yet
lecture1-intro
60 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Chap 1
No ratings yet
Chap 1
22 pages
Unit II
No ratings yet
Unit II
73 pages
IRWS Lecture 03 - Indexing and Transcrib
No ratings yet
IRWS Lecture 03 - Indexing and Transcrib
42 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
1520784495 Lec5 Ir Introduction
No ratings yet
1520784495 Lec5 Ir Introduction
37 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
Information Retrieval: History: Elementary IR: Scalable Boolean Text Search
No ratings yet
Information Retrieval: History: Elementary IR: Scalable Boolean Text Search
4 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
chapter one IR
No ratings yet
chapter one IR
18 pages
Information Retrieval - 2
No ratings yet
Information Retrieval - 2
24 pages
Computer Data
From Everand
Computer Data
Angel Gabaldon
No ratings yet
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Troanary Computation - Beyond Binary and Quantum - Toward Reflective and Adaptive Computers: 1, #2
From Everand
Troanary Computation - Beyond Binary and Quantum - Toward Reflective and Adaptive Computers: 1, #2
Ylia Callan
No ratings yet
DG at Chevron Gom - Pnec17
No ratings yet
DG at Chevron Gom - Pnec17
34 pages
DB-Lab 8
No ratings yet
DB-Lab 8
4 pages
GlobalScale Whitepaper WebVersion 072018
No ratings yet
GlobalScale Whitepaper WebVersion 072018
10 pages
Zookeeper
No ratings yet
Zookeeper
4 pages
Big Data Analytics A Review On Theoretical Contributions-2017
No ratings yet
Big Data Analytics A Review On Theoretical Contributions-2017
27 pages
DBMS Unit 1
No ratings yet
DBMS Unit 1
37 pages
Sakila Queries Exercise
No ratings yet
Sakila Queries Exercise
2 pages
Unit 2 - BD - Big Data Technology Foundations
No ratings yet
Unit 2 - BD - Big Data Technology Foundations
44 pages
Lab3 ERD2Relational Su2023 SE1754
No ratings yet
Lab3 ERD2Relational Su2023 SE1754
2 pages
Summative 2 - Laboratory
No ratings yet
Summative 2 - Laboratory
21 pages
Multicloud Architect 2
No ratings yet
Multicloud Architect 2
36 pages
Unit-1 DM
No ratings yet
Unit-1 DM
10 pages
IBM DB2 10.5 For Linux, UNIX, and Windows - Data Recovery and High Availability Guide and Reference
No ratings yet
IBM DB2 10.5 For Linux, UNIX, and Windows - Data Recovery and High Availability Guide and Reference
547 pages
Navneet Project
No ratings yet
Navneet Project
9 pages
Step-By-Step Procedure For Creation, Execution and Storing of ABAP Managed Database Procedures in HANA - SAP Blogs
No ratings yet
Step-By-Step Procedure For Creation, Execution and Storing of ABAP Managed Database Procedures in HANA - SAP Blogs
13 pages
Pengendalian Mutu Pada Produksi Keripik Sukun
No ratings yet
Pengendalian Mutu Pada Produksi Keripik Sukun
10 pages
6334 Copyright Registration Form PDF
No ratings yet
6334 Copyright Registration Form PDF
3 pages
Data Man System2
No ratings yet
Data Man System2
23 pages
Oracle EBS Technical Step by Step - AP To GL Link Using XLA Tables
No ratings yet
Oracle EBS Technical Step by Step - AP To GL Link Using XLA Tables
11 pages
ETL Standards For Informatica
100% (2)
ETL Standards For Informatica
16 pages
Veracrypt 1
No ratings yet
Veracrypt 1
11 pages
States of Transaction
No ratings yet
States of Transaction
2 pages
Splunk Skills Assessment-Updated
No ratings yet
Splunk Skills Assessment-Updated
14 pages
Vaishnavi Cheemakurthi - Power BI - 2 Yrs
No ratings yet
Vaishnavi Cheemakurthi - Power BI - 2 Yrs
1 page
GettingStartedGuide SQL Secure
No ratings yet
GettingStartedGuide SQL Secure
22 pages
CS 2004 (DBMS) - CS - End - May - 2023
No ratings yet
CS 2004 (DBMS) - CS - End - May - 2023
14 pages
IICS
100% (1)
IICS
150 pages
ABHISHEK GHOSH_DBMS WITH SQL_MIC401B
No ratings yet
ABHISHEK GHOSH_DBMS WITH SQL_MIC401B
15 pages
Log Shipping Configuration
No ratings yet
Log Shipping Configuration
12 pages

Lect 01-Introduction (1)

Uploaded by

Lect 01-Introduction (1)

Uploaded by

Nowadays IR is

much more than

 The web page: two parts (last year and

 Compressed storage of documents: LZ-based

 First generation -- use only on-page, web-text data

 Second generation -- use off-page, web-graph data

 Third generation -- answer “the need behind the query”

CISCO foresee 50 mld devices connected by 2020

We have now «devices 2.0» that have their ID,

… limited only by our

The Phd+ course:

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

Ivy Smith 50000

Typically allows numerical range and exact match

 Facilitates “semi-structured” search such

 Keyword queries including operators

 More sophisticated “concept” queries e.g.,

Classic model for searching text documents

 Perhaps the simplest model to build an IR

 Many search systems still use it:

1 if play contains word,

If n,m are the lengths of the lists, how

This is not an «engineering problem», ≈10 cmp

How many comparisons ? n + m ≈106

 Space: store smaller integers (gap coding)

Compressed, they occupy 13% original text

 What is the best order for query

Query: Brutus AND Calpurnia AND Caesar 40

 Where do we place them ?

Can we still run the merge in time O(n + m)?

 Build inverted indexes on fields AND

 E.g., “find docs with merchant in the title

Example zone indexes

Encode zones in dictionary vs. postings.

A lot of AI and Machine Learning on several kinds

 Exploit ideas from social networks

 Hbase, in Java, Apache license, runs on Hadoop

 HyperTable, in C++, GNU license, runs on Hadoop or GlusterFS

 Cassandra, in Java, Apache license 2, runs on Amazon’s Dynamo

“This is rocket science but

You might also like