Lect 01-Introduction (1)
Lect 01-Introduction (1)
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Reading Chapter 1
Th course
Timetable
Monday 11-13 (L) and Tuesday 9-11 (L1)
Twitter: @FerraginaTeach
Student meetings: Monday 14.30-16.30
The exam
One written test with theory questions + exercises
(two rounds, with small penalty)
Perhaps, a lab test on Lucene/elastic search
Arguments to do or not do?
I/O-model. multi-way mergesort. Sketch on MapReduce
Hashing. Compacted trie, front coding auto-completion
Edit distance via Dynamic Programming (possibly
weighted) Overlap measure with k-gram index.
Posting list compression: gamma, variable bytes (t-
nibble), PForDelta and Elias-Fano.
Paolo Ferragina
2009 2009-12
Evolution of Search Engines
1991-.. Wanderer
Zero generation -- use metadata added by users
Paradigm shift
Purposely sensed
pollution, temperature, wind, …
movement, accelleration,…
Health sensing,…
User generated
Photo, tweet, post, email,…
Query-log on search engines
A universe of possibilities
Paolo Ferragina
Information Retrieval
Information Retrieval (IR) is finding
material (usually documents) of
unstructured nature (usually text) that
satisfies an information need from
within large collections (usually stored
on computers).
29
IR vs. databases:
Unstructured vs Structured data
Structured data tends to refer to “tables”
be
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony
l d 1 1 0 0 0 1
ou
c big
Brutus 1 1 0 1 0 0
Caesar
i x
r ry
1 1 0 1 1 1
t
Calpurnia 0 1 0 0 0 0
a ve
Cleopatra 1 0 0 0 0 0
M mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
35
AND query
Cleopatra 9 3 45 11 1 46 31 ….
Cesare 57 12 4 9 15 16 2 ….
12
Cesare 57 12 4 9 15 16 2 ….
Cleopatra 1 3 9 11 31 45 46 ….
Cesare 2 4 9 12 15 16 57 ….
38
The Inverted index
Brutus 2 4 6 10 32
the 1 2 3 5 8 13 21 34
Calpurnia 13 16
Two advantages:
Speed: query requires just a scan
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
41 128
2 4 8 41 48 64 128
11 31
1 2 3 8 11 17 21 31
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16 34
Binary search
43
Which list you bisect at every recursive step ?
Sec. 1.3
Boolean queries:
More general merges
Exercise: Adapt the merge for :
Brutus AND NOT Caesar
Brutus OR NOT Caesar
44
IR is much more…
What about phrases?
“Stanford University”
Proximity: Find Gates NEAR Microsoft.
Need index to capture term positions in
docs.
Zones in documents: Find documents with
(author = Ullman) AND (text contains
automata).
Search for Maradona and find also “el
pibe de oro” 45
Sec. 6.1
Zone indexes
A zone is a region of the doc that can
contain an arbitrary amount of text e.g.,
Title
Abstract
References …
But
often results are too many and we need to rank
results
Classification, clustering, summarization, text
mining, etc…
Crawler
Hashing
Query
Linear Algebra
eb
Clustering
W
Page
Classification
Indexer
Analizer
Query Ranker
Sorting resolver
Dictionaries
Which pages
to visit next?
text auxiliary
Structure
Data Compression
I data center
[Procs OSDI 2006]
No
SQL