0% found this document useful (0 votes)
32 views83 pages

Vector Semantics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views83 pages

Vector Semantics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 83

Computational Semantics

and Semantic Parsing


What Is Semantic Analysis in NLP?
• to understand the meaning of Natural Language

• due to the vast complexity and subjectivity involved in human language,


interpreting it is quite a complicated task for machines

• captures the meaning of the given text while considering context, logical
structuring of sentences and grammar roles
What is Semantics ?
The Study of meaning: Relation between symbols and
their denotata

John told Mary that the train moved out of the station
at 3 o'clock
Word
Computational Semantics

• The study of how to automate the process of constructing and


reasoning with meaning representations of natural
language expressions.

• Methods in Computational Semantics generally fall in two


categories:
• Formal Semantics: Construction of precise mathematical models
of the relations between expressions in a natural language and the
world. John chases a bat → ∃x[bat(x)∧chase(john, x)]
• Distributional Semantics: The study of statistical patterns of
human word usage to extract semantics.
Distributional Hypothesis
• Distributional Hypothesis: Basic Intuition
• “The meaning of a word is its use in language.” (Wittgenstein,1953)
• “You know a word by the company it keeps.” (Firth, 1957)
• → Word meaning (whatever it might be) is reflected in linguistic
distributions.
• “Words that occur in the same contexts tend to have
similar meanings.” (Zellig Harris, 1968)
• Semantically similar words tend to have similar distributional
patterns.
Distributional Semantics: a linguistic perspective

• “If linguistics is to deal with meaning, it can only do so through distributional


analysis.” (Zellig Harris)

• “If we consider words or morphemes A and B to be more different in meaning


than A and C, then we will often find that the distributions of A and B are
more different than the distributions of A and C.

• In other words, difference in meaning correlates with difference of


distribution.” (Zellig Harris, “Distributional Structure”)
Distributional Semantics: a cognitive perspective

• Contextual representation
• A word’s contextual representation is an abstract cognitive
structure that
• accumulates from encounters with the word in various linguistic
contexts.

• We learn new words based on contextual cues


• He filled the wampimuk with the substance, passed it around and
we all drunk some.
• We found a little wampimuk sleeping behind the tree.
Distributional Semantic Models (DSMs)

• Computational models that build contextual semantic representations


from corpus data
• DSMs are models for semantic representations
• The semantic content is represented by a vector
• Vectors are obtained through the statistical analysis of the
linguistic contexts of a word
• Alternative names
• corpus-based semantics
• statistical semantics
• geometrical models of meaning
• vector semantics
• word space models
Distributional Semantics: The general intuition

• Distributions are vectors in a multidimensional semantic space, that is,


objects with a magnitude and a direction.

• The semantic space has dimensions which correspond to possible contexts,


as gathered from a given corpus.
Vector Semantics:
Vector Semantics:
Vector Space
Word Space

• Target words: automobile, car, soccer, football


• Term vocabulary: wheel, transport, passenger, tournament, London,
goal, match
Constructing Word spaces
• Informal algorithm for constructing word spaces
• Pick the words you are interested in: target words
• Define a context window, number of words surrounding target
word
• I The context can in general be defined in terms of documents,
paragraphs
• or sentences.
• Count number of times the target word co-occurs with the
context words: co-occurrence matrix
• Build vectors out of (a function of) these co-occurrence counts
Constructing Word spaces:
distributional vectors
Vector Space
Computing similarity
Vector Space Model without distributional similarity
• Words are treated as atomic symbols
• One-hot representation
Distributional Similarity Based
Representations
• You know a word by the company it keeps

• These words will represent banking


Building a DSM step-by-step
Building a DSM step-by-step
Many design choices
The parameter space
Documents as context: Word ×
document
Words as context: Word × Word
Words as contexts

• Parameters
• Window size
• Window shape - rectangular/triangular/other
• Consider the following passage
Suspected communist rebels on 4 July 1989 killed Col. Herminio Taylo,
police chief of Makati, the Philippines major financial center, in an
escalation of street violence sweeping the Capitol area. The gunmen
shouted references to the rebel New People’s Army. They fled in a
commandeered passenger jeep. The military says communist rebels have
killed up to 65 soldiers and police in the Capitol region since January.
Word as Context :Window Size :5
Context
weighting:
documents as
context
Bag of words Model
• is a Natural Language Processing
technique of text modelling.

• it is a method of feature extraction with text


data.

• it is a simple and flexible way of extracting


features from documents.
Bag of words Model
• is a representation of text that describes the
occurrence of words within a document.

• It keeps the track of word counts and disregard the


grammatical details and the word order

• is only concerned with whether known words


occur in the document, not where in the document
Why is the Bag-of-Words
algorithm used?
• One of the biggest problems with text is that it is messy and unstructured, and

• machine learning algorithms prefer structured, well defined fixed-length inputs and

• by using the Bag-of-Words technique we can convert variable-length texts into a


fixed-length vector.

• Also, at a much granular level, the machine learning models work with numerical
data rather than textual data.

• by using the bag-of-words (BoW) technique, we convert a text into its equivalent
vector of numbers.
Understanding Bag of Words
with an example

• Example(1) without preprocessing:

• Sentence 1: ”Welcome to Great Learning, Now start learning”

• Sentence 2: “Learning is a good practice”


Solution Example 1
Sentence 1 Sentence 2
Welcome Learning
to is
Great a
Learning good
, practice
Now
start
learning
• Step 1: Go through all the words in the above text and
make a list of all the words in our model vocabulary.

• Welcome
• To
• Great
• Learning
• ,
• Now
• start
• learning
• is
• a
• good
• practice
• Note that the words ‘Learning’ and ‘ learning’ are not the same here because of the
difference in their cases and hence are repeated.

• Also, note that a comma ‘ , ’ is also taken in the list.

• Because we know the vocabulary has 12 words, we can use a fixed-length


document-representation of 12, with one position in the vector to score each word.

• The scoring method we use here is to count the presence of each word and mark 0
for absence.

• This scoring method is used more generally.


Frequency Table for both sentences

Welc Learni learnin practic


Sentence to Great , Now start is a good
ome ng g e
Sentence1 1 1 1 1 1 1 1 1 0 0 0 0
Sentence2 0 0 0 0 0 0 0 1 1 1 1 1

writing the above frequencies in the vector form


Sentence 1 ➝ [ 1,1,1,1,1,1,1,1,0,0,0 ]

Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]
Example(2) with preprocessing:

• Example(2) with preprocessing:


• Sentence 1: ”Welcome to Great Learning, Now start learning”
• Sentence 2: “Learning is a good practice”

• Step 1: Convert the above sentences in lower case as the case of the
word does not hold any information.

• Step 2: Remove special characters and stopwords from the text.


Stopwords are the words that do not contain much information about
text like ‘is’, ‘a’,’the and many more’.
Example(2) with preprocessing:

• After applying the above steps, the sentences are changed


•welcome
to •great
• Sentence 1: ”welcome great learning now start learning”
• Sentence 2: “learning good practice”
•learning
•now
• Although the above sentences do not make much sense
the maximum information is contained in these words only. •start
•good
• Step 3: Go through all the words in the above text and
make a list of all the words in our model vocabulary. •practice
Now as the vocabulary has only 7 words, we can use a fixed-
length document-representation of 7, with one position in the
vector to score each word.
Occurrence matrix

Sentence welcome great learning now start good practice


Sentence1 1 1 2 1 1 0 0
Sentence2 0 0 1 0 0 1 1

Writing the above frequencies in the vector


Sentence 1 ➝ [ 1,1,2,1,1,0,0 ]

Sentence 2 ➝ [ 0,0,1,0,0,1,1 ]
Term Frequency-Inverse Document
frequency
• It measures how important a term is within a document relative to a collection
of documents (i.e., relative to a corpus). Words within a text document are
transformed into importance numbers by a text vectorization process.

• There are many different text vectorization scoring schemes, with TF-IDF being
one of the most common.

• TF-IDF vectorizes/scores a word by multiplying the word’s Term Frequency (TF)


with the Inverse Document Frequency (IDF).
Term Frequency
• TF of a term or word is the number of times the term
appears in a document compared to the total number
of words in the document.
Inverse Document Frequency:

• IDF of a term reflects the proportion of documents in the corpus that


contain the term.

• Words unique to a small percentage of documents (e.g., technical


jargon terms) receive higher importance values than words common
across all documents (e.g., a, the, and).
TF-IDF
• The TF-IDF of a term is calculated by multiplying TF and IDF scores.

• importance of a term is high when it occurs a lot in a given


document and rarely in others.

• In short, commonality within a document measured by TF is


balanced by rarity between documents measured by IDF.

• The resulting TF-IDF score reflects the importance of a term for a


document in the corpus.
Example :TFIDF

News Article Classification Problem


Example :TFIDF
News Article Classification Problem
Consider The Document Frequency

Inverse Document Frequency :-


TF-IDF
Why Log of IDF
The tf-idf weighted value wt,d for word t in document d thus combines term
frequency tft,d with idf:

Here's the raw counts in the Shakespeare term-document matrix, and the tf-idf weighted version of the same matrix..
Note that the tf-idf values for the dimension corresponding to the word good have now all become 0;

since this word appears in every document, the tf-idf algorithm leads it to be ignored.

Similarly, the word fool, which appears in 36 out of the 37 plays, has a much lower weight.
Point Mutual Information
Computing PPMI on a term-context matrix

• Matrix F with W rows (words) and C columns (contexts)


• fij is # of times wi occurs in context cj

60
•p(w=information,c=data) =3982/111716 = .3399
•p(w=information) = 7703/11716 = .6575
•p(c=data) =5673/11716 = .4842

61
• pmi(information,data) = .3399 / (.6575*.4842) ) = .0944
log2 (
Resulting PPMI matrix (negatives replaced by 0)

62
Weighting PMI
• PMI is biased toward infrequent events
• Very rare words have very high PMI values
• Two solutions:
• Give rare words slightly higher probabilities
• Use add-one smoothing (which has a similar effect)

63
Weighting PMI: Giving rare context
words slightly higher probability
• Raise the context probabilities to :

• This helps because for rare c


• Consider two events, P(a) = .99 and P(b)=.01

64
Computing word similarity: Dot product and cosine

• The dot product between two vectors is a scalar:

• The dot product tends to be high when the two vectors have large
values in the same dimensions
• Dot product can thus be a useful similarity metric between vectors
Problem with raw dot-product
• Dot product favors long vectors
• Dot product is higher if a vector is longer (has higher values in
many dimension)
• Vector length:

• Frequent words (of, the, you) have long vectors (since they occur
many times with other words).
• So dot product overly favors frequent words
Alternative: cosine for computing word
similarity

Based on the definition of the dot product between two vectors a and b
Cosine as a similarity metric

• -1: vectors point in opposite directions


• +1: vectors point in same directions
• 0: vectors are orthogonal

• But since raw frequency values are non-negative, the cosine for term-
term matrix vectors
68 ranges from 0–1
Cosine examples
pie data computer
cherry 442 8 2
digital 5 1683 1670
information 5 3982 3325

69
Visualizing cosines
(well, angles)
Word2Vec

• Word Embedding is a language modeling technique used for mapping words


to vectors of real numbers.

• It represents words or phrases in vector space with several dimensions.

• Word embeddings can be generated using various methods like neural


networks, co-occurrence matrix, probabilistic models, etc.
Word2Vec
• Word2Vec creates vectors of the words that are
distributed numerical representations of word features

• these word features could comprise of words that


represent the context of the individual words present
in our vocabulary.

• Word embeddings eventually help in establishing the


association of a word with another similar meaning
word through the created vectors.
Word2Vec

• Word2Vec consists of models for generating word embedding.

• These models are shallow two-layer neural networks having one input layer,
one hidden layer, and one output layer.

• Word2Vec utilizes two architectures :


• CBOW (Continuous Bag of Words)
• Skip Gram
Issues in BOW /TF_IDF
• Semantics meaning is not captured
• Similar words should have similar kind of vectors

• Sparse Matrix is generated

• Increase no of dimensions

Vectors created in word2Vec


Limited dimension
Sparsity is reduce ( Create dense vector )
Semantic Meaning is maintained
How does CBOW work?
• Word2Vec is an unsupervised model

• you can give a corpus without any label information and the model
can create dense word embeddings,

• Word2Vec internally leverages a supervised classification model to get


these embeddings from the corpus.
How does CBOW work
• The CBOW architecture comprises a deep learning classification
model in which we take in context words as input, X, and try to
predict our target word, Y.
Architecture of CBOW
Math Behind The Above Calculation
Skip-Gram Model
• The skip-gram model is a simple neural network with one hidden layer trained in
order to predict the probability of a given word being present when an input
word is present.

• Intuitively, you can imagine the skip-gram model being the opposite of the
CBOW model.

• In this architecture, it takes the current word as an input and tries to accurately
predict the words before and after this current word.

• This model essentially tries to learn and predict the context words around the
specified input word.

You might also like