Lecture 8 - Semantic Similarity Vector Semantic - Sem
Lecture 8 - Semantic Similarity Vector Semantic - Sem
Similarity
Word
Senses
and
Word
Rela-ons
Dan
Jurafsky
Wordform
Lemma
banks
bank
sung
sing
duermes
dormir
Dan
Jurafsky
Homonymy
Homonyms:
words
that
share
a
form
but
have
unrelated,
dis-nct
meanings:
• bank1:
financial
ins-tu-on,
bank2:
sloping
land
• bat1:
club
for
hiNng
a
ball,
bat2:
nocturnal
flying
mammal
1. Homographs
(bank/bank,
bat/bat)
2. Homophones:
1. Write
and
right
2. Piece
and
peace
Dan
Jurafsky
Polysemy
• 1.
The
bank
was
constructed
in
1875
out
of
local
red
brick.
• 2.
I
withdrew
the
money
from
the
bank
• Are
those
the
same
sense?
• Sense
2:
“A
financial
ins-tu-on”
• Sense
1:
“The
building
belonging
to
a
financial
ins-tu-on”
• A
polysemous
word
has
related
meanings
• Most
non-‐rare
words
have
mul-ple
meanings
Dan
Jurafsky
Synonyms
• Word
that
have
the
same
meaning
in
some
or
all
contexts.
• filbert
/
hazelnut
• couch
/
sofa
• big
/
large
• automobile
/
car
• vomit
/
throw
up
• Water
/
H20
• Two
lexemes
are
synonyms
• if
they
can
be
subs-tuted
for
each
other
in
all
situa-ons
• If
so
they
have
the
same
proposi1onal
meaning
Dan
Jurafsky
Synonyms
• But
there
are
few
(or
no)
examples
of
perfect
synonymy.
• Even
if
many
aspects
of
meaning
are
iden-cal
• S-ll
may
not
preserve
the
acceptability
based
on
no-ons
of
politeness,
slang,
register,
genre,
etc.
• Example:
• Water/H20
• Big/large
• Brave/courageous
Dan
Jurafsky
Antonyms
• Senses
that
are
opposites
with
respect
to
one
feature
of
meaning
• Otherwise,
they
are
very
similar!
dark/light short/long !fast/slow !rise/fall!
hot/cold! up/down! in/out!
• More
formally:
antonyms
can
• define
a
binary
opposi-on
or
be
at
opposite
ends
of
a
scale
•
long/short, fast/slow!
• Be
reversives:
• rise/fall, up/down!
Dan
Jurafsky
15
Word Meaning and
Similarity
Word
Senses
and
Word
Rela-ons
Word Meaning and
Similarity
WordNet
and
other
Online
Thesauri
Dan
Jurafsky
• Informa-on
Extrac-on
• Informa-on
Retrieval
• Ques-on
Answering
• Bioinforma-cs
and
Medical
Informa-cs
• Machine
Transla-on
Dan
Jurafsky
WordNet
3.0
• A
hierarchically
organized
lexical
database
• On-‐line
thesaurus
+
aspects
of
a
dic-onary
• Some
other
languages
available
or
under
development
• (Arabic,
Finnish,
German,
Portuguese…)
WordNet 3.0
• Hemoglobins
Synset
Entry
Terms:
Eryhem,
Ferrous
Hemoglobin,
Hemoglobin
Defini1on:
The
oxygen-‐carrying
proteins
of
ERYTHROCYTES.
They
are
found
in
all
vertebrates
and
some
invertebrates.
The
number
of
globin
subunits
in
the
hemoglobin
quaternary
structure
differs
between
species.
Structures
range
from
monomeric
to
a
variety
of
mul-meric
arrangements
Dan
Jurafsky
• a
26
Dan
Jurafsky
Word
Similarity
• Synonymy:
a
binary
rela-on
• Two
words
are
either
synonymous
or
not
• Similarity
(or
distance):
a
looser
metric
• Two
words
are
more
similar
if
they
share
more
features
of
meaning
• Similarity
is
properly
a
rela-on
between
senses
• The
word
“bank”
is
not
similar
to
the
word
“slope”
• Bank1
is
similar
to
fund3
• Bank2
is
similar
to
slope5
• But
we’ll
compute
similarity
over
both
words
and
senses
Dan
Jurafsky
1
• simpath(c1,c2) =
pathlen(c1, c2 )
simpath(nickel,coin)
=
1/2 = .5
simpath(fund,budget)
=
1/2 = .5
simpath(nickel,currency)
=
1/4 = .25
simpath(nickel,money)
=
1/6 = .17
simpath(coinage,Richter
scale)
=
1/6 = .17
Dan
Jurafsky
geological-‐forma-on
" count(w)
P(c) = w!words(c)
N
Dan
Jurafsky
• Informa-on
content:
IC(c) = -log P(c)
• Most
informa-ve
subsumer
(Lowest
common
subsumer)
LCS(c1,c2) =
The
most
informa-ve
(lowest)
node
in
the
hierarchy
subsuming
both
c1
and
c2
Dan
Jurafsky
Using
informa1on
content
for
similarity:
the
Resnik
method
Philip
Resnik.
1995.
Using
Informa-on
Content
to
Evaluate
Seman-c
Similarity
in
a
Taxonomy.
IJCAI
1995.
Philip
Resnik.
1999.
Seman-c
Similarity
in
a
Taxonomy:
An
Informa-on-‐Based
Measure
and
its
Applica-on
to
Problems
of
Ambiguity
in
Natural
Language.
JAIR
11,
95-‐130.
• Intui-on:
Similarity
between
A
and
B
is
not
just
what
they
have
in
common
• The
more
differences
between
A
and
B,
the
less
similar
they
are:
• Commonality:
the
more
A
and
B
have
in
common,
the
more
similar
they
are
• Difference:
the
more
differences
between
A
and
B,
the
less
similar
• Commonality:
IC(common(A,B))
• Difference:
IC(descrip-on(A,B)-‐IC(common(A,B))
Dan
Jurafsky
2 log P(LCS(c1, c2 ))
simLin (A, B) =
log P(c1 ) + log P(c2 )
2 log P(geological-formation)
simLin (hill, coast) =
log P(hill) + log P(coast)
2 ln 0.00176
=
ln 0.0000189 + ln 0.0000216
= .59
Dan
Jurafsky
• WordNet::Similarity
• hnp://wn-‐similarity.sourceforge.net/
• Web-‐based
interface:
• hnp://marimba.d.umn.edu/cgi-‐bin/similarity/similarity.cgi
48
Dan
Jurafsky
Evalua1ng
similarity
• Intrinsic
Evalua-on:
• Correla-on
between
algorithm
and
human
word
similarity
ra-ngs
• Extrinsic
(task-‐based,
end-‐to-‐end)
Evalua-on:
• Malapropism
(spelling
error)
detec-on
• WSD
• Essay
grading
• Taking
TOEFL
mul-ple-‐choice
vocabulary
tests
Levied is closest in meaning to:!
imposed, believed, requested, correlated!
Word Meaning and
Similarity
Word
Similarity:
Thesaurus
Methods
Word Meaning and
Similarity
Word
Similarity:
Distribu-onal
Similarity
(I)
Dan
Jurafsky
• Nida
example:
A bottle of tesgüino is on the table!
Everybody likes tesgüino!
Tesgüino makes you drunk!
We make tesgüino out of corn.!
• From context words humans can guess tesgüino means
• an
alcoholic
beverage
like
beer
• Intui-on
for
algorithm:
• Two
words
are
similar
if
they
have
similar
word
contexts.
Dan
Jurafsky
55
Dan
Jurafsky
56
Dan
Jurafsky
57
Dan
Jurafsky
58
Dan
Jurafsky
59
Dan
Jurafsky
61
Dan
Jurafsky
62
Dan
Jurafsky
Dan
Jurafsky
fij
pij = W C
!! fij
i=1 j=1
C W
Weighing
PMI
• PMI
is
biased
toward
infrequent
events
• Various
weigh-ng
schemes
help
alleviate
this
• See
Turney
and
Pantel
(2010)
• Add-‐one
smoothing
can
also
help
67
Dan
Jurafsky
!""#$%&'(()*+"%,(-.)/012(.)+3)4
!"#$%&'( )*&* $+,!- ('.%/& .%0*(
*$(+!"& 1 1 2 1 2
$+,'*$$/' 1 1 2 1 2
)+0+&*/ 3 2 1 2 1
+,4"(#*&+", 2 5 1 6 1
!"#$%&'()*(+,-.//012 !"#+
!"#$%&'( )*&* $+,!- ('.%/& .%0*(
*$(+!"& 1213 1213 1214 1213 1214 1251
$+,'*$$/' 1213 1213 1214 1213 1214 1251
)+0+&*/ 1216 1214 1213 1214 1213 1257
+,8"(#*&+", 1214 1297 1213 1291 1213 123:
!"%&'()*(+ 129; 1254 1296 1255 1296
68
Dan
Jurafsky
!!"#$%&'()*+,*-
!"#$%&'( )*&* $+,!- ('.%/& .%0*(
*$(+!"& 1 1 2324 1 2324
$+,'*$$/' 1 1 2324 1 2324
)+0+&*/ 5366 7377 1 7377 1
+,8"(#*&+", 7377 7349 1 73:9 1
!!"#$%&'()*+,*-./011234
!"#$%&'( )*&* $+,!- ('.%/& .%0*(
*$(+!"& 1211 1211 1234 1211 1234
$+,'*$$/' 1211 1211 1234 1211 1234
)+0+&*/ 1245 1211 1211 1211 1211
69
+,6"(#*&+", 1211 1237 1211 1289 1211
Word Meaning and
Similarity
Word
Similarity:
Distribu-onal
Similarity
(I)
Word Meaning and
Similarity
Word
Similarity:
Distribu-onal
Similarity
(II)
Dan
Jurafsky
! ! ! ! N
! ! v •w
cos(v, w) = ! ! =
v w !i=1 vi wi
!• ! =
v w v w N 2 N
!i=1 vi !i=1 wi2
vi is the PPMI value for word v in context i
wi is the PPMI value for word w in context i.
76
Dan
Jurafsky
Evalua1ng
similarity
(the
same
as
for
thesaurus-‐based)
• Intrinsic
Evalua-on:
• Correla-on
between
algorithm
and
human
word
similarity
ra-ngs
• Extrinsic
(task-‐based,
end-‐to-‐end)
Evalua-on:
• Spelling
error
detec-on,
WSD,
essay
grading
• Taking
TOEFL
mul-ple-‐choice
vocabulary
tests