0% found this document useful (0 votes)
19 views60 pages

Word Embedding

Uploaded by

hamzajafri04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views60 pages

Word Embedding

Uploaded by

hamzajafri04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

This is AI4001

GCR : t37g47w
Problem With Text
A problem with modeling text is that it is messy, and techniques like
machine learning algorithms prefer well defined fixed-length inputs and
outputs.
Machine learning algorithms cannot work with raw text directly; the
text must be converted into numbers. Specifically, vectors of numbers.

This is called feature extraction or feature encoding.


A popular and simple method of feature extraction with text data is
called the bag-of-words model of text.
Bag Of Words
A bag-of-words is a representation of text that describes the occurrence of words
within a document. It involves two things:

● A vocabulary of known words.


● A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or


structure of words in the document is discarded. The model is only concerned with
whether known words occur in the document, not where in the document.

We look at the histogram of the words within the text, i.e. considering each word
count as a feature.
Bag Of Words
The intuition is that documents are similar if they have similar
content. Further, that from the content alone we can learn
something about the meaning of the document.

The bag-of-words can be as simple or complex as you like. The


complexity comes both in deciding how to design the vocabulary of
known words (or tokens) and how to score the presence of known
words.
Bag Of Words
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
Bag Of Words
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,

Design the Vocabulary


Bag Of Words
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,

Design the Vocabulary


Bag Of Words
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,

Create Document Vectors


Bag Of Words
All ordering of the words is nominally discarded

New documents that overlap with the vocabulary of known


words, but may contain words outside of the vocabulary, can
still be encoded, where only the occurrence of known words
are scored and unknown words are ignored.
Bag Of Words
Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good
Managing Vocabulary
As the vocabulary size increases, so does the vector
representation of documents.

This results in a vector with lots of zero scores, called a


sparse vector or sparse representation.

As such, there is pressure to decrease the size of the vocabulary


when using a bag-of-words model.
Managing Vocabulary
1. Text Cleaning
● Ignoring case
● Ignoring punctuation
● Ignoring stop words, like “a,” “of,” etc.
● Fixing misspelled words.
● Reducing words to their stem

2. Create a vocabulary of grouped words.


BOW
A bag-of-bigrams representation is
much more powerful than
bag-of-words, and in many cases
proves very hard to beat.

— Page 75, Neural Network Methods in Natural Language


Processing, 2017.
TF IDF
A problem with scoring word frequency is that highly frequent words start to
dominate in the document (e.g. larger score), but may not contain as much
“informational content” to the model as rarer but perhaps domain specific words.

One approach is to rescale the frequency of words by how often they appear in
all documents, so that the scores for frequent words like “the” that are also
frequent across all documents are penalized.

This approach to scoring is called Term Frequency – Inverse Document Frequency,


or TF-IDF for short.
Tokenization
Term Frequency: is a scoring of the frequency of the word in the
current document.

Inverse Document Frequency: is a scoring of how rare the word is


across documents.

Thus the idf of a rare term is high, whereas the idf of a


frequent term is likely to be low.
TF IDF
TF IDF
TF IDF
TF IDF
DisAdvantage
Not capturing Semantics
Word Representation
To make a machine learn from the raw text we need to transform data into a
vector format. This transformation of raw text into a vector format is known
as word representation.
Representing Words By Their context
Distributional semantics: A word’s meaning is given by the
words that frequently appear close-by.

“You shall know a word by the company it


keeps”
(J. R. Firth 1957: 11)
Representing Words By Their context
One of the most successful ideas of modern statistical NLP!

When a word w appears in a text, its context is the set of


words that appear nearby (within a fixed-size window).

Use the many contexts of w to build up a representation of w


Representing Words By Their context
We will build a dense vector for each word,
chosen so that it is similar to vectors of
words that appear in similar contexts

Word vectors are also called word embeddings


or (neural) word representations

They are a distributed representation


Word2Vec
Word2Vec
Word2Vec
Skip Gram
CBOW
SkipGram

Unsupervised learning techniques or Semi Supervised

Target word is input while context words are output.

As there is more than one context word to be predicted which


makes this problem difficult.
Skip Gram VS CBOW
Skip GRam
SkipGram
SkipGram
SkipGram
SkipGram
SkipGram
SkipGram
SkipGram
Backward propagation
Backward propagation
Backward propagation
Backward propagation
Backward propagation
Glove —-----> Global Vectors
References
https://towardsdatascience.com/text-vectorization-term-frequency-invers
e-document-frequency-tfidf-5a3f9604da6d
https://aegis4048.github.io/demystifying_neural_network_in_skip_gram_la
nguage_modeling
https://aegis4048.github.io/optimize_computational_efficiency_of_skip-g
ram_with_negative_sampling
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/slides/cs2
24n-2021-lecture02-wordvecs2.pdf

You might also like