0% found this document useful (0 votes)
3 views

Lect 01-Introduction (1)

The document discusses the evolution of Information Retrieval (IR) from basic search engines to advanced systems that address user needs through various data sources. It outlines key concepts such as unstructured vs. structured data, Boolean queries, inverted indexes, and ranking mechanisms, emphasizing the complexity and breadth of modern IR. Additionally, it highlights the integration of machine learning and AI in enhancing search results and addresses challenges faced in web IR.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lect 01-Introduction (1)

The document discusses the evolution of Information Retrieval (IR) from basic search engines to advanced systems that address user needs through various data sources. It outlines key concepts such as unstructured vs. structured data, Boolean queries, inverted indexes, and ranking mechanisms, emphasizing the complexity and breadth of modern IR. Additionally, it highlights the integration of machine learning and AI in enhancing search results and addresses challenges faced in web IR.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 53

Nowadays IR is

much more than


building search
engines !
IR

Paolo Ferragina
Dipartimento di Informatica
Università di Pisa

Reading Chapter 1
Th course
 Timetable
 Monday 11-13 (L) and Tuesday 9-11 (L1)

 The web page: two parts (last year and


current)

 Twitter: @FerraginaTeach
 Student meetings: Monday 14.30-16.30
 The exam
 One written test with theory questions + exercises
(two rounds, with small penalty)
 Perhaps, a lab test on Lucene/elastic search
Arguments to do or not do?
 I/O-model. multi-way mergesort. Sketch on MapReduce
 Hashing. Compacted trie, front coding  auto-completion
 Edit distance via Dynamic Programming (possibly
weighted)  Overlap measure with k-gram index.
 Posting list compression: gamma, variable bytes (t-
nibble), PForDelta and Elias-Fano.

 Compressed storage of documents: LZ-based


compression. Storage and Transmission of file(s): Delta
compression (Zdelta), File Synchronization (rsync, zsync).
 Rank and Select data structures, Elias-Fano
 Succinct representation of binary trees and navigation.
 Random Walks. Link-based ranking: pagerank, topic-
based pagerank, personalized pagerank, CoSim rank.
HITS.
What is IR today?

Paolo Ferragina
2009 2009-12
Evolution of Search Engines
 1991-.. Wanderer
Zero generation -- use metadata added by users

 First generation -- use only on-page, web-text data


1995-1997 AltaVista,
 Word frequency and language Excite, Lycos, etc

 Second generation -- use off-page, web-graph data


 Link (or connectivity) analysis 1998: Google
 Anchor-text (How people refer to a page)

 Third generation -- answer “the need behind the query”


 Focus on “user need”, rather than on query Google, Yahoo,
 Integrate multiple data-sources MSN, ASK,………
 Click-through data
Fourth and current generation  Information Supply
Searching «substrings»
Searching routes
Searching over geo+labels
Searching over labeled graphs
Paolo Ferragina,
Università di Pisa
Paolo Ferragina, Università di Pisa
Paolo Ferragina, Università di Pisa
Paolo Ferragina, Università di Pisa
Paolo Ferragina, Università di Pisa

CISCO foresee 50 mld devices connected by 2020


Paolo Ferragina, Università di Pisa

Paradigm shift

We have now «devices 2.0» that have their ID,


Communication capacity, computing and storage, and
currently interaction ability.
Three main types of data…
 Opportunistic
 Credit card transactions
 Tel calls, bills, web clicks, …

 Purposely sensed
 pollution, temperature, wind, …
 movement, accelleration,…
 Health sensing,…

 User generated
 Photo, tweet, post, email,…
 Query-log on search engines
A universe of possibilities

… limited only by our


immagination !

The Phd+ course:


how to build a start-up ?
Paolo Ferragina, Università di Pisa
Basics

Paolo Ferragina
Information Retrieval
Information Retrieval (IR) is finding
material (usually documents) of
unstructured nature (usually text) that
satisfies an information need from
within large collections (usually stored
on computers).

29
IR vs. databases:
Unstructured vs Structured data
Structured data tends to refer to “tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

Ivy Smith 50000

Typically allows numerical range and exact match


(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
30
Semi-structured data: XML
 In fact almost no data is “unstructured”
 E.g., this slide has distinctly identified
zones such as the Title and Bullets

 Facilitates “semi-structured” search such


as
 Title contains data AND Bullets contain
search
 Issues:
 how do you process “about”?
31
 how do you rank results?
Unstructured data
Typically refers to free text, and
allows

 Keyword queries including operators

 More sophisticated “concept” queries e.g.,


 find all web pages dealing with drug abuse

Classic model for searching text documents


32
Boolean queries: Exact match
 The Boolean retrieval model is being able
to ask a query that is a Boolean expression:
 Boolean Queries are queries using AND, OR
and NOT to join query terms

Views each document as a set of words

Is precise: document matches condition or not.

 Perhaps the simplest model to build an IR


system on

 Many search systems still use it:


 Email, library catalog, Mac OS X Spotlight 33
Implementing the Boolean model

be
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony
l d 1 1 0 0 0 1

ou
c big
Brutus 1 1 0 1 0 0
Caesar
i x
r ry
1 1 0 1 1 1

t
Calpurnia 0 1 0 0 0 0

a ve
Cleopatra 1 0 0 0 0 0
M mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

1 if play contains word,


Brutus AND Caesar
0 otherwise
BUT NOT Calpurnia
Inverted index
 For each term t, we must store a list of all
documents that contain t.
 Identify each by docID, a document serial
number
 Can we use fixed-size arrays for this?
 What about inserting a new docID ?
Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132

Calpurnia 2 31 54 101

35
AND query
Cleopatra 9 3 45 11 1 46 31 ….

Cesare 57 12 4 9 15 16 2 ….

If n,m are the lengths of the lists, how


many comparisons ?
n*m

12

This is not an «engineering problem», ≈10 cmp


You need efficient algorithms! 3
≈10 sec
AND query
Cleopatra 9 3 45 11 1 46 31 ….

Cesare 57 12 4 9 15 16 2 ….

Cleopatra 1 3 9 11 31 45 46 ….

Cesare 2 4 9 12 15 16 57 ….

How many comparisons ? n + m ≈106


Which are the top-10 results ? ≈1 msec
Intersecting two postings lists

38
The Inverted index

Brutus 2 4 6 10 32

the 1 2 3 5 8 13 21 34

Calpurnia 13 16

Two advantages:
 Speed: query requires just a scan

 Space: store smaller integers (gap coding)

Compressed, they occupy 13% original text


Query optimization

 What is the best order for query


processing?
 Consider a query that is an AND of n terms.
 For each of the n terms, get its postings,
then AND them together.
Brutus 2 4 8 16 32 64 128

Caesar 1 2 3 5 8 16 21 34

Calpurnia 13 16

Query: Brutus AND Calpurnia AND Caesar 40


Query optimization
 Can we improve scanning-based intersection?
 Skips (yet scan-based but with shortcuts)
Sec. 2.3
Augment postings with skip
pointers (at indexing time)

41 128
2 4 8 41 48 64 128

11 31
1 2 3 8 11 17 21 31

 Where do we place them ?


 Which is the space/time trade-off ?
Query optimization
 Can we improve scanning-based intersection?
 Skips (yet scan-based but with shortcuts)
 Recursive merge (splitting by pivots)

Caesar 1 2 3 5 8 16 21 34

Calpurnia 13 16 34

Binary search
43
Which list you bisect at every recursive step ?
Sec. 1.3

Boolean queries:
More general merges
 Exercise: Adapt the merge for :
Brutus AND NOT Caesar
Brutus OR NOT Caesar

Can we still run the merge in time O(n + m)?

44
IR is much more…
 What about phrases?
 “Stanford University”
 Proximity: Find Gates NEAR Microsoft.
 Need index to capture term positions in
docs.
 Zones in documents: Find documents with
(author = Ullman) AND (text contains
automata).
 Search for Maradona and find also “el
pibe de oro” 45
Sec. 6.1

Zone indexes
 A zone is a region of the doc that can
contain an arbitrary amount of text e.g.,
 Title
 Abstract
 References …

 Build inverted indexes on fields AND


zones to permit querying

 E.g., “find docs with merchant in the title


zone and matching the query gentle rain”
Sec. 6.1

Example zone indexes

Encode zones in dictionary vs. postings.


Ranking search results
 Boolean queries give inclusion or exclusion of
docs.

 But

often results are too many and we need to rank
results

Classification, clustering, summarization, text
mining, etc…

A lot of AI and Machine Learning on several kinds


of features extracted from pages content and
the Web for results selection and ranking
Web IR and its challenges
 Unusual and diverse
 Documents
 Users
 Queries
 Information needs

 Exploit ideas from social networks


 link analysis, click-streams,
knowledge graphs,... 49
?
Our topics, on an example
Page archive

Crawler
Hashing
Query
Linear Algebra
eb

Clustering
W

Page
Classification
Indexer
Analizer
Query Ranker
Sorting resolver

Dictionaries

Which pages
to visit next?
text auxiliary
Structure

Data Compression
I data center
[Procs OSDI 2006]

No
SQL

 Hbase, in Java, Apache license, runs on Hadoop

 HyperTable, in C++, GNU license, runs on Hadoop or GlusterFS

 Cassandra, in Java, Apache license 2, runs on Amazon’s Dynamo


“Smart” algorithms
2007

“This is rocket science but


you don't have to be a
rocket scientist to use it”

You might also like