CRPITV74 Yang
CRPITV74 Yang
i
syntactically conditioned contexts, they in fact make no
differentiation between them, which are similar to
computing distributional similarity with unordered
3 Syntactically constrained distributional
context. The advantage of using the syntactic constrained similarity
context has not yet been fully exploited when yielding To automate thesauri, we first employed an English
statistical semantics from word distributions. syntactic parser based on Link Grammar to construct a
To fully harvest the advantages of computing syntactically constrained VSM. The word space consists
distributional similarity in the syntactically constrained of four major syntactic dependency sets that are widely
contexts, we proposed to first categorize contexts in terms adopted in the current research on distributional
of grammatical relations, and then overlapped the top n similarity. Following the reduction of dimensionality on
similar words yielded in each context to generate the dependency sets, we created the latent semantic
automatic thesauri. This is in contrast to averaging representation of words through which distributional
distributional similarity across these contexts, which is similarity can be measured so that thesaurus items can be
commonly adopted in the literature. retrieved.
x⋅ y ∑x y i i
The four syntactically conditioned matrices, as shown in cosθ = = i =1
Table 1, are extremely sparse with nulls in over 95% of x y n n
Verb rVX 1,282 24,702 58,617 84,601 81,713 144,545 4.4 A walk-through example
VoX 1,260 24,265 57,225 82,750 79,771 141,039 For each seed word, after computing the cosine similarity
sVX 1,269 24,354 57,642 83,265 80,681 142,256 of the seed with all other words in each dependency
∑X 1,297 25,283 60,483 87,165 83,415 148,455 matrix, we produced and ranked the top n words as
candidates. We then applied the two heuristics: ‘any two’
and ‘all’ on these candidates to forming automatic
Table 2: The word relatedness distribution in the thesauri.
‘gold-standard’ across each matrix
In Table 3 we exemplify the top 20 similar words of
We select 100 seed nouns and 100 seed verbs with term sentence and attack yielded in each dependency set and
frequencies of around 10,000 times in BNC. The average the two heuristics. Consider the distributionally similar
frequency of these nouns is about 8,988.9, and 10,364.4 words of sentence and attack in aNX and rVX for
for these verbs. High frequency words are likely to be example. The words related to the linguistic sense of
generic or general terms and the less frequent words may sentence consists of syllable, words, adjective, etc, in
not happen in the semantic sets. The average frequency of aNX, while the words with the judicial sense make up
the nouns in AnX, aNX, SvX, and vOX is in fact decreased around half of the 20 words including imprisonment,
to 3,361.1, 5,629.1, 1,156.7, and 1,692.1, and the verbs in penalty, and the like. The words such as rape and
rVX, VoX, and sVX are decreased to 3,014.3, 3,328.9, and slaughter from rVX are from the literal sense of attack,
1,971.8, as we only extracted syntactic dependencies together with its metaphorical sense among other words
from BNC. Overall, the average frequency of the nouns is like badmouth, flame, and so on.
about 2,959.7 across AnX, aNX, SvX, and vOX, and
3,960.9 for the verbs across rVX, VoX, and sVX. The heuristic of ‘any two’ collected the intersection of
thesaurus items across these dependency sets. For
We first used SimWN and SimRT to compare each seed example, punishment and words are the similar words to
word to all other words from the dependency sets, namely sentence, which respectively occurred in aNX and vOX as
AnX, aNX, SvX, and vOX for nouns and rVX, VoX, and well as in aNX and AnX; criticise and bomb are the
sVX for verbs, to retrieve its candidate words in the ‘gold similar words to attack, which respectively occurred in
standard’. Instead of a normal thesaurus with a full VoX and rVX as well as in VoX and sVX.
coverage of PoS tags, we only compiled the synonyms of
Similar words Analogously for the ranked word list from an automatic
aNX imprisonment term utterance penalty excommunication syllable thesaurus, the top n similar words with respect to each
words punishment prison prisoner phrase detention sense of T in WordNet are produced in the order of
hospitalisation fisticuffs banishment verdict Minnesota meaning
adjective warder hyper/hyponyms and holo/meronyms with exhausting
AnX words syllable utterance clause nictation word swarthiness initially synonyms and then antonyms, whereas the top n
paragraph text homograph discourse imprisonment nonce words in Roget can be subsequently acquired within +/-n
phrase hexagram adjective verb niacin savarin micheas (preceding/succeeding) words from T in each of its
vOX soubise cybele sextet cristal raper stint concatenation kohlrabi
tostada apprenticeship ban contrivance Guadalcanal necropolis category. Through these redefined precision and recall Pn
misanthropy roulade gasworks curacy jejunum punishment can stand for the coverage of the automatic thesaurus on
SvX ratel occurrence cragsman jingoism shiism Oklahoma potentially arbitrary senses or categories of T and Rp can
genuineness unimportance language gathering letting grimm describe relatedness of the thesaurus on the actual sense
chaucer accent taxation ultimatum arrogance test verticality
habituation or category of T.
any imprisonment words utterance word term punishment
two paragraph text phrase jail verb meaning noun poem 5 Results
language passage sequence syllable lexicon fine
all Imprisonment utterance penalty excommunication punishment We took the top n similar words derived from each co-
prison prisoner detention hospitalisation banishment Minnesota occurrence matrix for ‘any two’ or ‘all’, with n varying
meaning contrariety phoneme consonant counterintelligence
starvation fine cathedra lifespan from 1 to 1000 in ten steps, roughly doubling each time.
The results are shown in Table 4. We individually listed
(a) The similar words to sentence (as a noun) Pn and Rp values with respect to WordNet, Roget, and
Similar words the union of WordNet and Roget (Total).
rVX assault rape criticize arm slaughter abduct mortar accuse defend
fire avow lash badmouth blaspheme slit singe flame kidnap
persecute ‘all’ ‘any two’
VoX Raid criticise bomb realign outwit beleaguer guard raze bombard
WordNe Roget Total WordNe Roget Total
criticize resemble spy pulse misspend reformulate alkalinise t t
metastasise placard ruck glory N Pn Rp Pn Rp Pn Rp Pn Rp Pn Rp Pn Rp
sVX ambush invade fraternize palpitate patrol wound pillage bomb
1 noun 22.0 22.0 15.0 15.0 27.0 27.0 24.0 24.0 12.0 12.0 28.0 28.0
billet shell fire liberate kidnap raid garrison accuse assault arrest
slaughter outnumber verb 13.0 13.0 7.0 7.0 16.0 16.0 15.0 15.0 8.0 8.0 20.0 20.0
any assault criticize bomb ambush accuse raid fire rape bombard 2 noun 31.0 35.2 19.0 23.7 36.0 41.2 34.0 34.0 20.0 20.0 42.0 37.5
two kidnap infiltrate patrol defend storm invade arrest garrison
torture stab shoot verb 39.0 31.7 9.5 12.0 40.0 34.2 48.5 34.4 11.0 13.3 49.5 38.2
all raid bomb assault criticize ambush accuse fire guard bombard
5 noun 42.4 21.1 22.2 29.5 46.8 27.1 56.6 17.1 28.4 24.0 63.2 20.0
patrol rape storm infiltrate wound kidnap criticise garrison
alkalinize torture spy verb 54.2 25.6 20.2 17.1 55.8 26.9 62.6 27.4 23.8 15.0 64.0 28.7
10 noun 43.4 11.8 19.4 18.5 47.5 15.5 56.6 10.4 26.9 17.1 62.3 11.0
(b) The similar words to attack (as a verb)
verb 53.3 19.5 18.0 17.5 54.7 19.6 62.3 21.7 20.9 15.9 63.7 21.2
Table 3: A sample of thesaurus items
20 noun 37.7 9.5 16.1 13.8 41.6 9.8 50.2 8.7 22.7 16.5 56.0 8.4
verb 49.3 15.0 13.9 15.0 50.9 14.7 57.5 15.6 16.1 13.8 59.0 15.4
4.5 Performance evaluation
50 noun 29.0 8.0 11.2 11.2 32.3 7.4 41.4 7.2 16.7 9.5 46.4 6.8
Instead of simply matching with the ‘gold standard’ verb 43.8 11.9 10.0 10.9 45.4 11.3 49.5 12.2 11.4 9.9 51.3 11.5
thesauri, Lin (1998) proposed to compare his automatic
100 noun 22.9 8.4 8.2 9.5 25.7 7.4 33.8 6.6 12.8 6.6 38.4 5.9
thesaurus with WordNet and Roget on their structures,
taking into account the similarity scores and orders of verb 39.7 10.0 7.7 8.4 41.2 9.2 44.1 10.4 8.4 7.5 45.6 9.8
similar words respectively produced from distributional 200 noun 18.6 6.9 5.9 7.8 20.9 5.9 26.6 6.2 8.9 6.2 30.2 5.5
similarity and taxonomic similarity. This approach can verb 36.0 9.3 5.9 6.5 37.4 8.6 39.6 9.3 6.4 6.2 41.0 8.5
account for thesaurus resemblance under the hierarchy of
500 noun 13.6 6.4 3.9 6.1 15.4 5.5 18.6 6.0 5.4 5.8 21.0 5.3
WordNet or Roget, which is an apparent advantage over
straight word matching. verb 32.6 8.5 4.2 5.7 33.8 7.7 35.1 8.5 4.6 5.3 36.4 7.7
1000 noun 11.0 6.3 2.8 5.5 12.4 5.4 14.1 6.1 3.6 5.5 16.0 5.2
Instead of calculating the varied cosine similarity
between each target vector yielded from automatic verb 30.5 8.2 3.4 4.9 31.6 7.3 32.7 8.2 3.6 4.9 33.8 7.3