Informa (On Retrieval: Recap of The Previous Lecture
Informa (On Retrieval: Recap of The Previous Lecture
Recap
of
the
previous
lecture
Basic
inverted
indexes:
Introduc)on
to
Structure:
Dic)onary
and
Pos)ngs
Informa(on
Retrieval
CS276:
Informa)on
Retrieval
and
Web
Search
Key
step
in
construc)on:
Sor)ng
Christopher
Manning
and
Prabhakar
Raghavan
Boolean
query
processing
Lecture
2:
The
term
vocabulary
and
pos)ngs
Intersec)on
by
linear
)me
“merging”
lists
Simple
op)miza)ons
Overview
of
course
topics
Introduc)on to Informa)on Retrieval Introduc)on to Informa)on Retrieval
Plan
for
this
lecture
Recall
the
basic
indexing
pipeline
Elaborate
basic
indexing
Documents to Friends, Romans, countrymen.
be indexed.
Preprocessing
to
form
the
term
vocabulary
Documents
Tokenizer
Tokeniza)on
Token stream. Friends Romans Countrymen
What
terms
do
we
put
in
the
index?
Linguistic
Pos)ngs
modules
Modified tokens. friend roman countryman
Faster
merges:
skip
lists
Indexer friend
2 4
Posi)onal
pos)ngs
and
phrase
queries
roman
1 2
Inverted index.
countryman
13 16
Parsing
a
document
Complica)ons:
Format/language
What
format
is
it
in?
Documents
being
indexed
can
include
docs
from
many
different
languages
pdf/word/excel/html?
A
single
index
may
have
to
contain
terms
of
several
What
language
is
it
in?
languages.
What
character
set
is
in
use?
Some)mes
a
document
or
its
components
can
contain
mul)ple
languages/formats
French
email
with
a
German
pdf
aWachment.
Each of these is a classification problem, What
is
a
unit
document?
which we will study later in the course. A
file?
An
email?
(Perhaps
one
of
many
in
an
mbox.)
But these tasks are often done heuristically … An
email
with
5
aWachments?
A
group
of
files
(PPT
or
LaTeX
as
HTML
pages)
1
Introduc)on
to
Informa)on
Retrieval
Introduc)on
to
Informa)on
Retrieval
Sec. 2.2.1
Tokeniza)on
Input:
“Friends,
Romans
and
Countrymen”
Output:
Tokens
Friends
Romans
Countrymen
A
token
is
an
instance
of
a
sequence
of
characters
TOKENS
AND
TERMS
Each
such
token
is
now
a
candidate
for
an
index
entry,
a_er
further
processing
Described
below
But
what
are
valid
tokens
to
emit?
Tokeniza)on
Numbers
Issues
in
tokeniza)on:
3/20/91
Mar.
12,
1991
20/3/91
Finland’s
capital
→
55
B.C.
B‐52
Finland?
Finlands?
Finland’s?
My
PGP
key
is
324a3df234cb23e
Hewle:‐Packard
→
Hewle:
and
Packard
as
two
(800)
234‐2333
tokens?
O_en
have
embedded
spaces
state‐of‐the‐art:
break
up
hyphenated
sequence.
Older
IR
systems
may
not
index
numbers
co‐educa?on
But
o_en
very
useful:
think
about
things
like
looking
up
error
lowercase,
lower‐case,
lower
case
?
codes/stacktraces
on
the
web
It
can
be
effec)ve
to
get
the
user
to
put
in
possible
hyphens
(One
answer
is
using
n‐grams:
Lecture
3)
San
Francisco:
one
token
or
two?
Will
o_en
index
“meta‐data”
separately
How
do
you
decide
it
is
one
token?
Crea)on
date,
format,
etc.
Tokeniza)on:
language
issues
Tokeniza)on:
language
issues
French
Chinese
and
Japanese
have
no
spaces
between
L'ensemble
→
one
token
or
two?
words:
L
?
L’
?
Le
?
莎拉波娃现在居住在美国东南部的佛罗里达。
Want
l’ensemble
to
match
with
un
ensemble
Not
always
guaranteed
a
unique
tokeniza)on
Un)l
at
least
2003,
it
didn’t
on
Google
Interna)onaliza)on!
Further
complicated
in
Japanese,
with
mul)ple
alphabets
intermingled
German
noun
compounds
are
not
segmented
Dates/amounts
in
mul)ple
formats
LebensversicherungsgesellschaUsangestellter
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
‘life
insurance
company
employee’
German
retrieval
systems
benefit
greatly
from
a
compound
spli>er
Katakana Hiragana Kanji Romaji
module
Can
give
a
15%
performance
boost
for
German
End-user can express query entirely in hiragana!
2
Introduc)on
to
Informa)on
Retrieval
Sec. 2.2.1 Introduc)on
to
Informa)on
Retrieval
Sec. 2.2.2
Tokeniza)on:
language
issues
Stop
words
Arabic
(or
Hebrew)
is
basically
wriWen
right
to
le_,
With
a
stop
list,
you
exclude
from
the
dic)onary
but
with
certain
items
like
numbers
wriWen
le_
to
en)rely
the
commonest
words.
Intui)on:
right
They
have
liWle
seman)c
content:
the,
a,
and,
to,
be
Words
are
separated,
but
leWer
forms
within
a
word
There
are
a
lot
of
them:
~30%
of
pos)ngs
for
top
30
words
form
complex
ligatures
But
the
trend
is
away
from
doing
this:
Good
compression
techniques
(lecture
5)
means
the
space
for
including
stopwords
in
a
system
is
very
small
←
→
←
→
←
start
Good
query
op)miza)on
techniques
(lecture
7)
mean
you
pay
liWle
at
query
)me
for
including
stop
words.
‘Algeria
achieved
its
independence
in
1962
a_er
132
You
need
them
for:
years
of
French
occupa)on.’
Phrase
queries:
“King
of
Denmark”
With
Unicode,
the
surface
presenta)on
is
complex,
but
the
Various
song
)tles,
etc.:
“Let
it
be”,
“To
be
or
not
to
be”
stored
form
is
straighnorward
“Rela)onal”
queries:
“flights
to
London”
Normaliza)on
to
terms
Normaliza)on:
other
languages
We
need
to
“normalize”
words
in
indexed
text
as
well
Accents:
e.g.,
French
résumé
vs.
resume.
as
query
words
into
the
same
form
Umlauts:
e.g.,
German:
Tuebingen
vs.
Tübingen
We
want
to
match
U.S.A.
and
USA
Should
be
equivalent
Result
is
terms:
a
term
is
a
(normalized)
word
type,
Most
important
criterion:
which
is
an
entry
in
our
IR
system
dic)onary
How
are
your
users
like
to
write
their
queries
for
these
We
most
commonly
implicitly
define
equivalence
words?
classes
of
terms
by,
e.g.,
dele)ng
periods
to
form
a
term
Even
in
languages
that
standardly
have
accents,
users
U.S.A.,
USA
USA
o_en
may
not
type
them
dele)ng
hyphens
to
form
a
term
O_en
best
to
normalize
to
a
de‐accented
term
an?‐discriminatory,
an?discriminatory
an?discriminatory
Tuebingen,
Tübingen,
Tubingen
Tubingen
Normaliza)on:
other
languages
Case
folding
Normaliza)on
of
things
like
date
forms
Reduce
all
leWers
to
lower
case
7月30日 vs. 7/30 excep)on:
upper
case
in
mid‐sentence?
Japanese use of kana vs. Chinese characters
e.g.,
General
Motors
Fed
vs.
fed
SAIL
vs.
sail
Tokeniza)on
and
normaliza)on
may
depend
on
the
O_en
best
to
lower
case
everything,
since
language
and
so
is
intertwined
with
language
users
will
use
lowercase
regardless
of
detec)on
Is this
‘correct’
capitaliza)on…
Morgen will ich in MIT … German “mit”? Google
example:
Crucial:
Need
to
“normalize”
indexed
text
as
well
as
Query
C.A.T.
query
terms
into
the
same
form
#1
result
is
for
“cat”
(well,
Lolcats)
not
Caterpillar
Inc.
3
Introduc)on
to
Informa)on
Retrieval
Sec. 2.2.3 Introduc)on
to
Informa)on
Retrieval
Normaliza)on
to
terms
Thesauri
and
soundex
Do
we
handle
synonyms
and
homonyms?
An
alterna)ve
to
equivalence
classing
is
to
do
E.g.,
by
hand‐constructed
equivalence
classes
car
=
automobile
color
=
colour
asymmetric
expansion
We
can
rewrite
to
form
equivalence‐class
terms
An
example
of
where
this
may
be
useful
When
the
document
contains
automobile,
index
it
under
car‐
Enter:
window
Search:
window,
windows
automobile
(and
vice‐versa)
Enter:
windows
Search:
Windows,
windows,
window
Or
we
can
expand
a
query
Enter:
Windows
Search:
Windows
When
the
query
contains
automobile,
look
under
car
as
well
Poten)ally
more
powerful,
but
less
efficient
What
about
spelling
mistakes?
One
approach
is
soundex,
which
forms
equivalence
classes
of
words
based
on
phone)c
heuris)cs
More
in
lectures
3
and
9
Lemma)za)on
Stemming
Reduce
inflec)onal/variant
forms
to
base
form
Reduce
terms
to
their
“roots”
before
indexing
E.g.,
“Stemming”
suggest
crude
affix
chopping
am,
are,
is
→
be
language
dependent
car,
cars,
car's,
cars'
→
car
e.g.,
automate(s),
automa?c,
automa?on
all
reduced
to
automat.
the
boy's
cars
are
different
colors
→
the
boy
car
be
different
color
Lemma)za)on
implies
doing
“proper”
reduc)on
to
for example compressed for exampl compress and
dic)onary
headword
form
and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Porter’s
algorithm
Typical
rules
in
Porter
Commonest
algorithm
for
stemming
English
sses
→
ss
Results
suggest
it’s
at
least
as
good
as
other
stemming
ies
→
i
op)ons
a)onal
→
ate
Conven)ons
+
5
phases
of
reduc)ons
)onal
→
)on
phases
applied
sequen)ally
each
phase
consists
of
a
set
of
commands
sample
conven)on:
Of
the
rules
in
a
compound
command,
Weight
of
word
sensi)ve
rules
select
the
one
that
applies
to
the
longest
suffix.
(m>1)
EMENT
→
replacement
→
replac
cement
→
cement
4
Introduc)on
to
Informa)on
Retrieval
Sec. 2.2.4 Introduc)on
to
Informa)on
Retrieval
Sec. 2.2.4
Other
stemmers
Language‐specificity
Other
stemmers
exist,
e.g.,
Lovins
stemmer
Many
of
the
above
features
embody
transforma)ons
hWp://www.comp.lancs.ac.uk/compu)ng/research/stemming/general/lovins.htm
that
are
Single‐pass,
longest
suffix
removal
(about
250
rules)
Language‐specific
and
Full
morphological
analysis
–
at
most
modest
O_en,
applica)on‐specific
benefits
for
retrieval
These
are
“plug‐in”
addenda
to
the
indexing
process
Do
stemming
and
other
normaliza)ons
help?
Both
open
source
and
commercial
plug‐ins
are
English:
very
mixed
results.
Helps
recall
for
some
queries
but
available
for
handling
these
harms
precision
on
others
E.g.,
opera)ve
(den)stry)
⇒ oper
Definitely useful for Spanish, German, Finnish, …
30% performance gains for Finnish!
Dic)onary
entries
–
first
cut
ensemble.french
時間.japanese
Augment
pos)ngs
with
skip
pointers
Recall
basic
merge
(at
indexing
)me)
Walk
through
the
two
pos)ngs
simultaneously,
in
41 128
)me
linear
in
the
total
number
of
pos)ngs
entries
2 4 8 41 48 64 128
11 31
2 4 8 41 48 64 128 Brutus 1 2 3 8 11 17 21 31
2 8
1 2 3 8 11 17 21 31 Caesar Why?
To
skip
pos)ngs
that
will
not
figure
in
the
search
If the list lengths are m and n, the merge takes O(m+n)
operations. results.
How?
Can we do better?
Yes (if index isn’t changing too fast). Where
do
we
place
skip
pointers?
5
Introduc)on
to
Informa)on
Retrieval
Sec. 2.3 Introduc)on
to
Informa)on
Retrieval
Sec. 2.3
Query
processing
with
skip
pointers
Where
do
we
place
skips?
41 128 Tradeoff:
2 4 8 41 48 64 128 More
skips
→
shorter
skip
spans
⇒
more
likely
to
skip.
But
lots
of
comparisons
to
skip
pointers.
11 31 Fewer
skips
→
few
pointer
comparison,
but
then
long
skip
1 2 3 8 11 17 21 31 spans
⇒
few
successful
skips.
Placing
skips
Simple
heuris)c:
for
pos)ngs
of
length
L,
use
√L
evenly‐spaced
skip
pointers.
This
ignores
the
distribu)on
of
query
terms.
Easy
if
the
index
is
rela)vely
sta)c;
harder
if
L
keeps
changing
because
of
updates.
This
definitely
used
to
help;
with
modern
hardware
it
may
not
(Bahle
et
al.
2002)
unless
you’re
memory‐
PHRASE
QUERIES
AND
POSITIONAL
based
INDEXES
The
I/O
cost
of
loading
a
bigger
pos)ngs
list
can
outweigh
the
gains
from
quicker
in
memory
merging!
Phrase
queries
A
first
aWempt:
Biword
indexes
Want
to
be
able
to
answer
queries
such
as
“stanford
Index
every
consecu)ve
pair
of
terms
in
the
text
as
a
university”
–
as
a
phrase
phrase
Thus
the
sentence
“I
went
to
university
at
Stanford”
For
example
the
text
“Friends,
Romans,
Countrymen”
is
not
a
match.
would
generate
the
biwords
The
concept
of
phrase
queries
has
proven
easily
friends
romans
understood
by
users;
one
of
the
few
“advanced
search”
romans
countrymen
ideas
that
works
Each
of
these
biwords
is
now
a
dic)onary
term
Many
more
queries
are
implicit
phrase
queries
Two‐word
phrase
query‐processing
is
now
For
this,
it
no
longer
suffices
to
store
only
immediate.
<term
:
docs>
entries
6
Introduc)on
to
Informa)on
Retrieval
Sec. 2.4.1 Introduc)on
to
Informa)on
Retrieval
Sec. 2.4.1
Longer
phrase
queries
Extended
biwords
Longer
phrases
are
processed
as
we
did
with
wild‐ Parse
the
indexed
text
and
perform
part‐of‐speech‐tagging
(POST).
cards:
Bucket
the
terms
into
(say)
Nouns
(N)
and
ar)cles/
stanford
university
palo
alto
can
be
broken
into
the
preposi)ons
(X).
Boolean
query
on
biwords:
Call
any
string
of
terms
of
the
form
NX*N
an
extended
biword.
stanford
university
AND
university
palo
AND
palo
alto
Each
such
extended
biword
is
now
made
a
term
in
the
dic)onary.
Example:
catcher
in
the
rye
Without
the
docs,
we
cannot
verify
that
the
docs
N
X
X
N
matching
the
above
Boolean
query
do
contain
the
Query
processing:
parse
it
into
N’s
and
X’s
phrase.
Segment
query
into
enhanced
biwords
Look
up
in
index:
catcher
rye
Can have false positives!
Issues
for
biword
indexes
Solu)on
2:
Posi)onal
indexes
False
posi)ves,
as
noted
before
In
the
pos)ngs,
store,
for
each
term
the
posi)on(s)
in
Index
blowup
due
to
bigger
dic)onary
which
tokens
of
it
appear:
Infeasible
for
more
than
biwords,
big
even
for
them
<term,
number
of
docs
containing
term;
Biword
indexes
are
not
the
standard
solu)on
(for
all
doc1:
posi)on1,
posi)on2
…
;
biwords)
but
can
be
part
of
a
compound
strategy
doc2:
posi)on1,
posi)on2
…
;
etc.>
Posi)onal
index
example
Processing
a
phrase
query
Extract
inverted
index
entries
for
each
dis)nct
term:
<be: 993427; to,
be,
or,
not.
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5 Merge
their
doc:posi)on
lists
to
enumerate
all
2: 3, 149; could contain “to be posi)ons
with
“to
be
or
not
to
be”.
4: 17, 191, 291, 430, 434; or not to be”? to:
5: 363, 367, …>
2:1,17,74,222,551;
4:8,16,190,429,433;
7:13,23,191;
...
For
phrase
queries,
we
use
a
merge
algorithm
be:
recursively
at
the
document
level
1:17,19;
4:17,191,291,430,434;
5:14,19,101;
...
But
we
now
need
to
deal
with
more
than
just
Same
general
method
for
proximity
searches
equality
7
Introduc)on
to
Informa)on
Retrieval
Sec. 2.4.2 Introduc)on
to
Informa)on
Retrieval
Sec. 2.4.2
Proximity
queries
Posi)onal
index
size
LIMIT!
/3
STATUTE
/3
FEDERAL
/2
TORT
You
can
compress
posi)on
values/offsets:
we’ll
talk
Again,
here,
/k
means
“within
k
words
of”.
about
that
in
lecture
5
Clearly,
posi)onal
indexes
can
be
used
for
such
Nevertheless,
a
posi)onal
index
expands
pos)ngs
queries;
biword
indexes
cannot.
storage
substan)ally
Exercise:
Adapt
the
linear
merge
of
pos)ngs
to
Nevertheless,
a
posi)onal
index
is
now
standardly
handle
proximity
queries.
Can
you
make
it
work
for
used
because
of
the
power
and
usefulness
of
phrase
any
value
of
k?
and
proximity
queries
…
whether
used
explicitly
or
This
is
a
liWle
tricky
to
do
correctly
and
efficiently
implicitly
in
a
ranking
retrieval
system.
See
Figure
2.12
of
IIR
There’s
likely
to
be
a
problem
on
it!
Posi)onal
index
size
Rules
of
thumb
Need
an
entry
for
each
occurrence,
not
just
once
per
A
posi)onal
index
is
2–4
as
large
as
a
non‐posi)onal
document
index
Index
size
depends
on
average
document
size
Why? Posi)onal
index
size
35–50%
of
volume
of
original
Average
web
page
has
<1000
terms
text
SEC
filings,
books,
even
some
epic
poems
…
easily
100,000
Caveat:
all
of
this
holds
for
“English‐like”
languages
terms
Consider
a
term
with
frequency
0.1%
Document size Postings Positional postings
1000 1 1
100,000 1 100
Combina)on
schemes
Resources
for
today’s
lecture
These
two
approaches
can
be
profitably
IIR
2
combined
MG
3.6,
4.3;
MIR
7.2
For
par)cular
phrases
(“Michael
Jackson”,
“Britney
Porter’s
stemmer:
Spears”)
it
is
inefficient
to
keep
on
merging
posi)onal
hWp://www.tartarus.org/~mar)n/PorterStemmer/
pos)ngs
lists
Skip
Lists
theory:
Pugh
(1990)
Even
more
so
for
phrases
like
“The
Who”
Mul)level
skip
lists
give
same
O(log
n)
efficiency
as
trees
Williams
et
al.
(2004)
evaluate
a
more
H.E. Williams, J. Zobel, and D. Bahle. 2004. “Fast Phrase
Querying with Combined Indexes”, ACM Transactions on
sophis)cated
mixed
indexing
scheme
Information Systems.
A
typical
web
query
mixture
was
executed
in
¼
of
the
hWp://www.seg.rmit.edu.au/research/research.php?author=4
)me
of
using
just
a
posi)onal
index
D.
Bahle,
H.
Williams,
and
J.
Zobel.
Efficient
phrase
querying
with
an
auxiliary
index.
SIGIR
2002,
pp.
215‐221.
It
required
26%
more
space
than
having
a
posi)onal
index
alone