Lab 5
Lab 5
Processing
LAB 5
Lab Outline:
► Feature Extraction
► Feature extraction techniques
► One Hot Encoding
► Bag of Words (BOW)
► N-gram
► TF-IDF
► Word Vectors (Word2Vec)
► Continuous Bag of Words (CBOW)
► Skip-Gram
Feature Extraction
► Feature
► is the name given to selected or treated data that is prepared to be used as
input to Machine Learning Algorithms. Features can be things like the price
of a house, the RGB value of a pixel or, in our case, the representation of a
word.
► Feature Extraction
► Is the process of transforming the raw text into a numerical representation
that can be processed by computers.
One Hot Encoding
► Drawbacks
► Size of input vector scales with size of vocabulary
► No relationship between words
► Resultants sparse vectors (most computations go to zero)
Bag of Words
► Drawbacks
► No semantical relasionship between words
► Not designed to model linguistic knowledge
► Sparisty
► Due to high number of dimensions
► Curse of dimensionality
► When dimensionality increases, the distance between points becomes
less meaningful
Bag of N-grams
► N-gram
► is basically a collection of word tokens from a text document such that
these tokens are contiguous and occur in a sequence. Bi-grams (two words),
Tri-grams (three words), and so on.
► It tells us how many times a phrase can occur in a document.
► Example
► Vocab(set of all n-grams in corpus)=[‘the cat’, ‘cat sat’, ‘sat on’, ’on the’, ‘the
hat’, ‘the dog’, ‘dog ate’, ‘ate the’, ‘ cat and’, ‘and the’]
► Sent 1: ‘the cat sat on the hat’={1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
► Sent 2: ‘the dog ate the cat and the hat’={1, 0, 0, 0, 0, 1, 1, 1, 1, 1}
Bag of N-grams
Bag of N-grams
► Drawbacks
► Very large vocab set
► No notion of syntactic or semantic similarity between words.
Term Frequency-Inverse Document Frequency
► TF-IDF
► Captures importance of a word to a document in a corpus.
► Importance increases proportionally to the number of times a word appears
in the document; but is inversely proportional to the frequency of the word
in the corpus .
Term Frequency-Inverse Document Frequency
►
Term Frequency-Inverse Document Frequency
Term Frequency-Inverse Document Frequency
► Drawbacks
► Based on the bag-of-words model, so it doesn’t capture position in text,
semantics, co-occurrence in different documents.
► Thus TF-IDF is only useful as a lexical level feature.
Legacy Techniques Problem
► When a word w appears in a text, its context is the set of words that appear
nearby (within a fixed-size window).
► Use the many contexts of w to build up a representation of w.
Word Vectors/ Embeddings
► Advantages
► It can capture rare words
► It captures the similarity of word semantics
► Synonymous like ‘intelligent’ and ‘smart’ would have very similar contexts.
Word2Vec Implementation
► Input
► Output
Word Vectors Applications
► Sentiment Analysis
► Speech Recognition
► Information Retrieval
► Question Answering