06 VectorSpaceModel
06 VectorSpaceModel
Jaime Arguello
INLS 509: Information Retrieval
[email protected]
The Search Task
2
What is a Retrieval Model?
3
Basic Information Retrieval Process
doc
the retrieval model doc
doc
doc
doc
is responsible for
information need performing this document
comparison and
retrieving objects
representation that are likely to representation
satisfy the user
retrieved objects
evaluation
4
Boolean Retrieval Models
• The user describes their information need using boolean
constraints (e.g., AND, OR, and AND NOT)
• Unranked Boolean: retrieves documents that satisfy the
constraints in no particular order
• Ranked Boolean: retrieves documents that satisfy the
constraints and orders them based on the number of ways
they satisfy the constraints
• Also known as ‘exact-match’ retrieval models
5
Boolean Retrieval Models
• Advantages:
7
Introduction to Best-Match Retrieval Models
8
Vector Space Model
What is a Vector Space?
• Formally, a vector space is defined by a set of linearly
independent basis vectors
• The basis vectors correspond to the dimensions or
directions of the vector space
Y Y
X X
Z 10
What is a Vector?
Y Y
X X
Z
11
What is a Vector?
Y Y
y y
X X
z
x
Z
12
What is a Vector Space?
Y Y
X X
14
Vector Space Representation
15
Vector Space Representation
with binary weights
bite 1
16
Vector Space Representation
with binary weights
17
Vector Space Representation
with binary weights
18
Vector Space Representation
with binary weights
man
1
dog
bite
19
Vector Space Representation
with binary weights
man
dog
1
bite
20
Vector Space Representation
with binary weights
man
1
dog
1
1
bite
21
Vector Space Representation
man
22
Vector Space Representation
man
dog
23
Vector Space Similarity
V
∑ xi × yi
i =1
24
The Inner Product
xi yi xi × yi
• Multiply a 1 1 1
corresponding
components and aardvark 0 1 0
then sum those abacus 1 1 1
products
abba 1 0 0
V able 0 1 0
∑ xi × yi :: 0 0 0
i =1
zoom 0 0 0
inner product => 2
25
The Inner Product
xi yi xi × yi
• What does the a 1 1 1
inner product
(with a binary aardvark 0 1 0
representation) abacus 1 1 1
correspond to?
abba 1 0 0
V able 0 1 0
∑ xi × yi :: 0 0 0
i =1
zoom 0 0 0
inner product => 2
26
The Inner Product
xi yi xi × yi
• When using 0’s a 1 1 1
and 1’s, this is
just the number aardvark 0 1 0
of unique terms abacus 1 1 1
in common
abba 1 0 0
between the
query and the able 0 1 0
document
:: 0 0 0
V zoom 0 0 0
∑ xi × yi inner product => 2
i =1
27
The Inner Product
28
The Inner Product
29
The Inner Product
• What is more relevant to a query?
30
The Cosine Similarity
• The numerator is the inner product
V
∑ i =1xi × yi
! !
V 2× V 2
x
∑ i =1 i y
∑ i =1 i
length of length of
vector x vector y
31
Vector Space Model
cosine similarity example (binary weights)
V
xi × yi
∑ i =1
! !
V V
∑i=1 xi × ∑i=1 y2i
2
(1 × 1) + (0 × 1) + (1 × 0)
√ √ = 0.5
12 + 02 + 12 × 12 + 12 + 02
32
∑V xi × yi
In Class Exercise
! i = 1 !
∑V x
i =1 i
2×
∑ V 2
i =1 y i
33
∑V xi × yi
In Class Exercise
! i = 1 !
∑V x
i =1 i
2×
∑ V 2
i =1 y i
34
Vector Space Representation
a aardvark abacus abba able ... zoom
doc_1 1 0 0 0 0 ... 1
doc_2 0 0 0 0 1 ... 1
:: :: :: :: :: :: ... 0
doc_m 0 0 1 1 0 ... 0
query 0 1 0 0 1 ... 1
• 0’s and 1’s indicate whether the term occurs (at least
once) in the document/query
• Let’s explore a more sophisticated representation 35
Term-Weighting
what are the most important terms?
• Plot:
Rocky Balboa is a struggling boxer trying to make the big time. Working in a meat factory in Philadelphia for
a pittance, he also earns extra cash as a debt collector. When heavyweight champion Apollo Creed visits
Philadelphia, his managers want to set up an exhibition match between Creed and a struggling boxer,
touting the fight as a chance for a "nobody" to become a "somebody". The match is supposed to be easily
won by Creed, but someone forgot to tell Rocky, who sees this as his only shot at the big time. Rocky Balboa
is a small-time boxer who lives in an apartment in Philadelphia, Pennsylvania, and his career has so far not
gotten off the canvas. Rocky earns a living by collecting debts for a loan shark named Gazzo, but Gazzo
doesn't think Rocky has the viciousness it takes to beat up deadbeats. Rocky still boxes every once in a
while to keep his boxing skills sharp, and his ex-trainer, Mickey, believes he could've made it to the top if he
was willing to work for it. Rocky, goes to a pet store that sells pet supplies, and this is where he meets a
young woman named Adrian, who is extremely shy, with no ability to talk to men. Rocky befriends her.
Adrain later surprised Rocky with a dog from the pet shop that Rocky had befriended. Adrian's brother
Paulie, who works for a meat packing company, is thrilled that someone has become interested in Adrian,
and Adrian spends Thanksgiving with Rocky. Later, they go to Rocky's apartment, where Adrian explains that
she has never been in a man's apartment before. Rocky sets her mind at ease, and they become lovers.
Current world heavyweight boxing champion Apollo Creed comes up with the idea of giving an unknown a
shot at the title. Apollo checks out the Philadelphia boxing scene, and chooses Rocky. Fight promoter
Jergens gets things in gear, and Rocky starts training with Mickey. After a lot of training, Rocky is ready for
the match, and he wants to prove that he can go the distance with Apollo. The 'Italian Stallion', Rocky
Balboa, is an aspiring boxer in downtown Philadelphia. His one chance to make a better life for himself is
through his boxing and Adrian, a girl who works in the local pet store. Through a publicity stunt, Rocky is
set up to fight Apollo Creed, the current heavyweight champion who is already set to win. But Rocky really
needs to triumph, against all the odds... 36
Term-Frequency
how important is a term?
rank term freq. rank term freq.
1 a 22 16 creed 5
2 rocky 19 17 philadelphia 5
3 to 18 18 has 4
4 the 17 19 pet 4
5 is 11 20 boxing 4
6 and 10 21 up 4
7 in 10 22 an 4
8 for 7 23 boxer 4
9 his 7 24 s 3
10 he 6 25 balboa 3
11 adrian 6 26 it 3
12 with 6 27 heavyweigh 3
13 who 6 28 t
champion 3
14 that 5 29 fight 3
15 apollo 5 30 become 3 37
Term-Frequency
how important is a term?
rank term freq. rank term freq.
1 a 22 16 creed 5
2 rocky 19 17 philadelphia 5
3 to 18 18 has 4
4 the 17 19 pet 4
5 is 11 20 boxing 4
6 and 10 21 up 4
7 in 10 22 an 4
8 for 7 23 boxer 4
9 his 7 24 s 3
10 he 6 25 balboa 3
11 adrian 6 26 it 3
12 with 6 27 heavyweigh 3
13 who 6 28 t
champion 3
14 that 5 29 fight 3
15 apollo 5 30 become 3 38
Inverse Document Frequency (IDF)
how important is a term?
N
id f t = log( )
d ft
39
Inverse Document Frequency (IDF)
how important is a term?
rank term idf rank term idf
1 doesn 11.66 16 creed 6.84
2 adrain 10.96 17 paulie 6.82
3 viciousness 9.95 18 packing 6.81
4 deadbeats 9.86 19 boxes 6.75
5 touting 9.64 20 forgot 6.72
6 jergens 9.35 21 ease 6.53
7 gazzo 9.21 22 thanksgivin 6.52
8 pittance 9.05 23 g
earns 6.51
9 balboa 8.61 24 pennsylvani 6.50
10 heavyweigh 7.18 25 a
promoter 6.43
11 t
stallion 7.17 26 befriended 6.38
12 canvas 7.10 27 exhibition 6.31
13 ve 6.96 28 collecting 6.23
14 managers 6.88 29 philadelphia 6.19
15 apollo 6.84 30 gear 6.18 40
TF.IDF
how important is a term?
t f t × id f t
41
TF.IDF
! "
N
t f t × log
d ft
42
TF.IDF
how important is a term?
rank term tf.idf rank term tf.idf
1 rocky 96.72 16 meat 11.76
2 apollo 34.20 17 doesn 11.66
3 creed 34.18 18 adrain 10.96
4 philadelphia 30.95 19 fight 10.02
5 adrian 26.44 20 viciousness 9.95
6 balboa 25.83 21 deadbeats 9.86
7 boxing 22.37 22 touting 9.64
8 boxer 22.19 23 current 9.57
9 heavyweigh 21.54 24 jergens 9.35
10 t
pet 21.17 25 s 9.29
11 gazzo 18.43 26 struggling 9.21
12 champion 15.08 27 training 9.17
13 match 13.96 28 pittance 9.05
14 earns 13.01 29 become 8.96
15 apartment 11.82 30 mickey 8.96 43
TF, IDF, or TF.IDF?
$(,-$)! $")). $6"")-' ! ! .)3'! .#)+-! ),"-+! )8)"&! )4(*3*'*#-! )4'",! /,"! /*5('! /#" 5,99# ! ! 5)'+ ! 5*"2
44
TF, IDF, or TF.IDF?
$,##7*7 !$#))*$5-.1! $#))*$5#" ! $"**+ $8""*.5 +*'+/*'57 ! ! ! +*/5! +*/57! +-75'.$*! +#*7.! +#4.5#4.
*'".7! *'7*! *'7-)& ! *0,-/-5-#.! *05"'! *05"*6*)& ! 9'$5#"& ! 9-1,5! 9#"1#5! 1'22# ! 1*'"! 1#55*.
,*'3&4*-1,5 ! ,-7 ! -7 ! :*"1*.7! )'5*"! )#'.! )#5! )#3*"7! 6'.'1*"7! 6'5$,! 6*'5! 6-$%*&! .'6*+
.#/#+& ! #++7 ! ('$%-.1! ('8)-*! (*..7&)3'.-'! (*5 (,-)'+*)(,-' (-55'.$* ("#6#5*"
! ! !
(8/)-$-5&! "*'+& ! "#$%& ! 7*))7 ! 7*5! 7,'"% ! 7,'"(! 7,#5! 7,& ! 7#6*/#+&! 7#6*#.*! 75'))-#.! 75#"*
75"811)-.1! 758.5! 78(()-*7! 78((#7*+! 78"("-7*+! 5,'.%71-3-.1! 5,-.%! 5,"-))*+! 5-6*! 5-5)*! 5#85-.1! 5"'-.*"! 5"'-.-.1
5"-86(,!8(!3*!3-$-#87.*77!3-7-57 !4,*"*!4,#!4-))-.1!4#.!4#"%7
45
TF, IDF, or TF.IDF?
'*)3)+5! '"(')& '"()'& '3($'"5 '/#33# '%/)()&, *'3*#'
! ! ! ! !
*$'+! *$7()$&"$"
*$7()$&"%! *$++$(! *#8$(! *#8$%! *#8)&,
!
)+'3)'& .$(,$&%
! 3#'& 3#+ 3#2$(% 9'&',$(% 9'+04 9$'+
! :$$/! 3)2)&,! ! ! ! ! !
($'"5 (#0:5 %$33% %4'(: %4'(/ %4#/ %45 %:)33% %#9$*#"5 %/$&"%
! ! ! ! ! ! ! ! !
6#&
46
Queries as TF.IDF Vectors
• TF usually equals 1
N
id f t = log( )
d ft
47
Queries as TF.IDF Vectors
examples from AOL queries with clicks on IMDB results
term 1 tf.idf term 2 tf.idf term 3 tf.idf
central 4.89 casting 6.05 ny 5.99
wizard 6.04 of 0.18 oz 6.14
sam 2.80 jones 3.15 iii 2.26
film 2.31 technical 6.34 advisors 8.74
edie 7.41 sands 5.88 singer 3.88
high 3.09 fidelity 7.66 quotes 8.11
quotes 8.11 about 1.61 brides 6.71
title 4.71 wave 5.68 pics 10.96
saw 4.87 3 2.43 trailers 7.83
the 0.03 rainmaker 9.09 movie 0.00
nancy 5.50 and 0.09 sluggo 9.46
audrey 6.30 rose 4.52 movie 0.00
mark 2.43 sway 7.53 photo 5.14
piece 4.59 of 0.18 cheese 6.38
date 3.93 movie 0.00 cast 0.00 48
Vector Space Model
cosine similarity example (tf.idf weights)
V
xi × yi
∑ i =1
! !
V V
∑i=1 xi × ∑i=1 y2i
2
49
Vector Space Model
cosine similarity example (tf.idf weights)
dog
doc_1
query
doc_2
bite
50
TF.IDF
51
Independence Assumption
• The basis vectors (X, Y, Z) are linearly independent
because knowing a vector’s value on one dimension
doesn’t say anything about its value along another
dimension
Y =man
Z = dog
basis vectors for 3-dimensional space
52
Mutual Information
IMDB Corpus
• If this were true, what would these mutual information
values be?
w1 w2 MI w1 w2 MI
francisco san ? dollars million ?
angeles los ? brooke rick ?
prime minister ? teach lesson ?
united states ? canada canadian ?
9 11 ? un ma ?
winning award ? nicole roman ?
brooke taylor ? china chinese ?
con un ? japan japanese ?
un la ? belle roman ?
belle nicole ? border mexican ? 53
Mutual Information
IMDB Corpus
w1 w2 MI w1 w2 MI
francisco san 6.619 dollars million 5.437
angeles los 6.282 brooke rick 5.405
prime minister 5.976 teach lesson 5.370
united states 5.765 canada canadian 5.338
9 11 5.639 un ma 5.334
winning award 5.597 nicole roman 5.255
brooke taylor 5.518 china chinese 5.231
con un 5.514 japan japanese 5.204
un la 5.512 belle roman 5.202
belle nicole 5.508 border mexican 5.186 54
Independence Assumption
Z
55
Vector Space Model
‣ a document
‣ a query
‣ a sentence
‣ a word
‣ an entire encyclopedia
• Rank documents based on their cosine similarity to query
56
Vector Space Representation
• A power tool!
57
Vector Space Representation
• Find documents that are similar to this query
58
Vector Space Representation
• Find ads that are similar to these results
59
Vector Space Representation
• Find ads similar to this this document
60
Vector Space Representation
• Find queries that are similar to this query
61
Vector Space Representation
• Topic categorization: automatically assigning a
document to a category
62
Vector Space Representation
• Find documents (with a known category assignment)
that are similar to this document
63
Vector Space Representation
• Find documents (with a known category assignment)
that are similar to this document
computers
sports
politics
64
Summary
‣ a document
‣ a query
‣ a sentence
‣ a word
‣ an entire encyclopedia
• Rank documents based on their cosine similarity to query
65