Vector Semantics
Vector Semantics
• captures the meaning of the given text while considering context, logical
structuring of sentences and grammar roles
What is Semantics ?
The Study of meaning: Relation between symbols and
their denotata
John told Mary that the train moved out of the station
at 3 o'clock
Word
Computational Semantics
• Contextual representation
• A word’s contextual representation is an abstract cognitive
structure that
• accumulates from encounters with the word in various linguistic
contexts.
• Parameters
• Window size
• Window shape - rectangular/triangular/other
• Consider the following passage
Suspected communist rebels on 4 July 1989 killed Col. Herminio Taylo,
police chief of Makati, the Philippines major financial center, in an
escalation of street violence sweeping the Capitol area. The gunmen
shouted references to the rebel New People’s Army. They fled in a
commandeered passenger jeep. The military says communist rebels have
killed up to 65 soldiers and police in the Capitol region since January.
Word as Context :Window Size :5
Context
weighting:
documents as
context
Bag of words Model
• is a Natural Language Processing
technique of text modelling.
• machine learning algorithms prefer structured, well defined fixed-length inputs and
• Also, at a much granular level, the machine learning models work with numerical
data rather than textual data.
• by using the bag-of-words (BoW) technique, we convert a text into its equivalent
vector of numbers.
Understanding Bag of Words
with an example
• Welcome
• To
• Great
• Learning
• ,
• Now
• start
• learning
• is
• a
• good
• practice
• Note that the words ‘Learning’ and ‘ learning’ are not the same here because of the
difference in their cases and hence are repeated.
• The scoring method we use here is to count the presence of each word and mark 0
for absence.
Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]
Example(2) with preprocessing:
• Step 1: Convert the above sentences in lower case as the case of the
word does not hold any information.
Sentence 2 ➝ [ 0,0,1,0,0,1,1 ]
Term Frequency-Inverse Document
frequency
• It measures how important a term is within a document relative to a collection
of documents (i.e., relative to a corpus). Words within a text document are
transformed into importance numbers by a text vectorization process.
• There are many different text vectorization scoring schemes, with TF-IDF being
one of the most common.
Here's the raw counts in the Shakespeare term-document matrix, and the tf-idf weighted version of the same matrix..
Note that the tf-idf values for the dimension corresponding to the word good have now all become 0;
since this word appears in every document, the tf-idf algorithm leads it to be ignored.
Similarly, the word fool, which appears in 36 out of the 37 plays, has a much lower weight.
Point Mutual Information
Computing PPMI on a term-context matrix
60
•p(w=information,c=data) =3982/111716 = .3399
•p(w=information) = 7703/11716 = .6575
•p(c=data) =5673/11716 = .4842
61
• pmi(information,data) = .3399 / (.6575*.4842) ) = .0944
log2 (
Resulting PPMI matrix (negatives replaced by 0)
62
Weighting PMI
• PMI is biased toward infrequent events
• Very rare words have very high PMI values
• Two solutions:
• Give rare words slightly higher probabilities
• Use add-one smoothing (which has a similar effect)
63
Weighting PMI: Giving rare context
words slightly higher probability
• Raise the context probabilities to :
64
Computing word similarity: Dot product and cosine
• The dot product tends to be high when the two vectors have large
values in the same dimensions
• Dot product can thus be a useful similarity metric between vectors
Problem with raw dot-product
• Dot product favors long vectors
• Dot product is higher if a vector is longer (has higher values in
many dimension)
• Vector length:
• Frequent words (of, the, you) have long vectors (since they occur
many times with other words).
• So dot product overly favors frequent words
Alternative: cosine for computing word
similarity
Based on the definition of the dot product between two vectors a and b
Cosine as a similarity metric
• But since raw frequency values are non-negative, the cosine for term-
term matrix vectors
68 ranges from 0–1
Cosine examples
pie data computer
cherry 442 8 2
digital 5 1683 1670
information 5 3982 3325
69
Visualizing cosines
(well, angles)
Word2Vec
• These models are shallow two-layer neural networks having one input layer,
one hidden layer, and one output layer.
• Increase no of dimensions
• you can give a corpus without any label information and the model
can create dense word embeddings,
• Intuitively, you can imagine the skip-gram model being the opposite of the
CBOW model.
• In this architecture, it takes the current word as an input and tries to accurately
predict the words before and after this current word.
• This model essentially tries to learn and predict the context words around the
specified input word.