0% found this document useful (0 votes)
8 views

Lab 5

The document outlines various feature extraction techniques used in Natural Language Processing, including One Hot Encoding, Bag of Words, N-grams, and TF-IDF, along with their drawbacks. It emphasizes the limitations of traditional methods in capturing word relationships and introduces Word2Vec as a solution for creating word embeddings that reflect semantic similarity. The document also discusses the Continuous Bag of Words and Skip-gram models within Word2Vec, highlighting their applications in tasks such as sentiment analysis and information retrieval.

Uploaded by

Michael Mansour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lab 5

The document outlines various feature extraction techniques used in Natural Language Processing, including One Hot Encoding, Bag of Words, N-grams, and TF-IDF, along with their drawbacks. It emphasizes the limitations of traditional methods in capturing word relationships and introduces Word2Vec as a solution for creating word embeddings that reflect semantic similarity. The document also discusses the Continuous Bag of Words and Skip-gram models within Word2Vec, highlighting their applications in tasks such as sentiment analysis and information retrieval.

Uploaded by

Michael Mansour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Natural Language

Processing
LAB 5
Lab Outline:

► Feature Extraction
► Feature extraction techniques
► One Hot Encoding
► Bag of Words (BOW)
► N-gram
► TF-IDF
► Word Vectors (Word2Vec)
► Continuous Bag of Words (CBOW)
► Skip-Gram
Feature Extraction

► Feature
► is the name given to selected or treated data that is prepared to be used as
input to Machine Learning Algorithms. Features can be things like the price
of a house, the RGB value of a pixel or, in our case, the representation of a
word.
► Feature Extraction
► Is the process of transforming the raw text into a numerical representation
that can be processed by computers.
One Hot Encoding

► One Hot Encoding ( gnenerates a vector of boolean values)


► Map each word to a unique id
► The ID vector is filled with 0s except for a 1 at the position associated with
the ID.
► Vector dimension= number of words in vocabulary
► Example
► Vocab(set of unique words)=[‘dog’, ‘bites’, ‘man’]
► Sent 1: ‘dog bites man’=[[1,0,0],[0,1,0],[0,0,1]]
► Sent 2: ‘man bites dog’=[[0,0,1],[0,1,0],[1,0,0]]
One Hot Encoding
One Hot Encoding

► Drawbacks
► Size of input vector scales with size of vocabulary
► No relationship between words
► Resultants sparse vectors (most computations go to zero)
Bag of Words

► Bag of Words (BOW)


► In this method, each document is considered as a collection or bag having all
the words in it.
► It tells us how many times each word can occur in a document.
► Example
► Vocab(set of unique words)=[‘the’, ‘cat’, ‘sat’, ’on’, ‘hat’, ‘dog’, ‘ate’,
‘and’]
► Sent 1: ‘the cat sat on the hat’={2, 1, 1, 1, 1, 0, 0, 0}
► Sent 2: ‘the dog ate the cat and the hat’={3, 1, 0, 0, 1, 1, 1, 1}
Bag of Words
Bag of Words

► Drawbacks
► No semantical relasionship between words
► Not designed to model linguistic knowledge
► Sparisty
► Due to high number of dimensions
► Curse of dimensionality
► When dimensionality increases, the distance between points becomes
less meaningful
Bag of N-grams

► N-gram
► is basically a collection of word tokens from a text document such that
these tokens are contiguous and occur in a sequence. Bi-grams (two words),
Tri-grams (three words), and so on.
► It tells us how many times a phrase can occur in a document.
► Example
► Vocab(set of all n-grams in corpus)=[‘the cat’, ‘cat sat’, ‘sat on’, ’on the’, ‘the
hat’, ‘the dog’, ‘dog ate’, ‘ate the’, ‘ cat and’, ‘and the’]
► Sent 1: ‘the cat sat on the hat’={1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
► Sent 2: ‘the dog ate the cat and the hat’={1, 0, 0, 0, 0, 1, 1, 1, 1, 1}
Bag of N-grams
Bag of N-grams

► Drawbacks
► Very large vocab set
► No notion of syntactic or semantic similarity between words.
Term Frequency-Inverse Document Frequency

► TF-IDF
► Captures importance of a word to a document in a corpus.
► Importance increases proportionally to the number of times a word appears
in the document; but is inversely proportional to the frequency of the word
in the corpus .
Term Frequency-Inverse Document Frequency


Term Frequency-Inverse Document Frequency
Term Frequency-Inverse Document Frequency

► Drawbacks
► Based on the bag-of-words model, so it doesn’t capture position in text,
semantics, co-occurrence in different documents.
► Thus TF-IDF is only useful as a lexical level feature.
Legacy Techniques Problem

► The previous feature extraction techniques represents the words as discrete


symbols
► Example
► in web search, if user searches for “Seattle motel”, we would like to match
documents containing “Seattle hotel”; but:
► motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
► hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
► These two vectors are orthogonal. There is no natural notion of similarity for
one-hot vectors!
Legacy Techniques Solution

► So, learn to encode similarity in the vectors themselves.


► Target: is to represent the words by meaning. The word’s meaning is given by the words
that frequently appear close-by.
► There are lexical resources such as the Wordnet lexicon which contains word
synonymos, hypernyms but it has some problems like 1] missing new meanings of
words, 2] can’t compute accurate word similarity, 3] the word synonym may differ
based on the context.
► From this perspective, the distributed representations like word vectors or word
embeddings or word representations are appearing.
Word Context

► When a word w appears in a text, its context is the set of words that appear
nearby (within a fixed-size window).
► Use the many contexts of w to build up a representation of w.
Word Vectors/ Embeddings

► The word embeddings show us some


properties. Like,
► if two words are similar they must be
closer to each other in representation,
and
► two opposite words if their pairs exist,
they both must be having the same
difference of distances. These help us
find synonyms, analogies, etc…
Word2Vec

► Word2vec is a framework for learning word vectors. It


transforms the text words into vectors.
► Word2vec is not a deep neural network however, it
converts text into an unambiguous form of computation
for deep neural networks.
► Purpose: is to collect vectors of the same words together
in vector space.
► There are two methods presented in Word2Vec model
► Continuous bag of words (CBoW)
► Skip-gram
Continuous bag of words (CBoW)

► Predict the target word given the context words.


► Example given a sentence and a window size 2.

► Drawbacks: it can’t capture rare words.


► So, the skip-gram algorithm comes.
Skip-gram

► Reverses format of CBoW


► Predicts a context given a target word
► The context is specified by the window length
Skip-gram

► Advantages
► It can capture rare words
► It captures the similarity of word semantics
► Synonymous like ‘intelligent’ and ‘smart’ would have very similar contexts.
Word2Vec Implementation

► Initially use the command: pip install gensim


► input
Word2Vec Implementation

► Input

► Output
Word Vectors Applications

► Sentiment Analysis
► Speech Recognition
► Information Retrieval
► Question Answering

You might also like