04 - Text Representation
04 - Text Representation
Representation
Natural Language Processing
Contents
• Vector Space Models
• Basic Vectorization Approaches
• One-Hot Encoding
• Bag of Words
• Bag of N-Grams
• TF-IDF
• Distributed Representations
• Word Embedding
• CBOW
• SkipGram
• Visualizing Embeddings
Introduction
• Feature extraction is an important step for any machine learning
problem.
• How do we transform a given text into numerical form so that it can be
fed into NLP and ML algorithms?
Feature engineering
• Feature engineering steps convert the raw data into a
format that can be consumed by a machine.
cell[i,j]
Numerical Representations of Data
• Speech: a numerical array representing the
amplitude of a sound wave at fixed time intervals
Text representation
• Text representation:
Given a piece of text → find a scheme to represent it mathematically
Vector Space Models - VSM
• VSM: Mô hình không gian vector
• The text data must be converted into
some mathematical form.
• represent text units (characters,
phonemes, words, phrases,
sentences, paragraphs, and
documents) with vectors of
numbers.
• toy corpus:
four documents—D1 , D2 , D3 , D4
• vocabulary of this corpus
[dog, bites, man, eats, meat, food]
One-Hot Encoding
• D1 is represented as
[[1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]]
dog bites man
One-Hot Encoding
https://github.com/practical-nlp/practical-nlp-code/tree/master/Ch3
One-Hot Encoding
• Pros and cons
• intuitive to understand and straightforward to implement
• The size of a one-hot vector is directly proportional to size of the
vocabulary, and most real-world corpora have large vocabularies.
• not give a fixed-length representation for text. For most learning
algorithms, we need the feature vectors to be of the same length.
• It treats words as atomic units and has no notion of (dis)similarity
between words.
• For example: run, ran, and apple.
• Out Of Vocabulary (OOV) problem:
• “man eats fruits”: The training data didn’t include “fruit” and there’s no way to
represent it in our model
Basic Vectorization Approaches
+ One-Hot Encoding
+ Bag of Words
+ Bag of N-Grams
+ TF-IDF
Bag of Words
Bag of Words - BoW
• Represent the text under consideration as a bag (collection) of words
while ignoring the order and context
• assumes that the text belonging to a given class in the dataset is
characterized by a unique set of words. If two text pieces have nearly the
same words, then they belong to the same bag (class).
• TF
• IDF
• → TF-IDF score = TF*IDF
TF (term frequency)
• TF: measures how often a term or word occurs in a given document
• TF of a term t in a document d is defined as:
IDF (inverse document frequency)
• IDF: measures the importance of the term across a corpus
• weighs down the terms that are very common across a corpus
• weighs up the rare terms
• IDF of a term t is calculated as follows:
TF-IDF
• Distributional hypothesis
• words that occur in similar contexts have similar meanings
• “dog” and “cat” occur in similar contexts - there must be a strong similarity
between the meanings of these two words.
• VSM: if two words often occur in similar context, then their corresponding
representation vectors must also be close to each other
-Distributed Representations
• Distributional representation
• representation schemes that are obtained based on distribution of words from the
context in which the words appear
• These schemes are based on distributional hypotheses
• The distributional property is induced from context
• distributional representation schemes use highdimensional vectors to represent
words
• The dimension of this matrix is equal to the size of the vocabulary of the corpus
• one-hot, bag of words, bag of n-grams, and TF-IDF
Distributed Representations
• Distributed representation
• based on the distributional hypothesis
• distributed representation schemes significantly compress the
dimensionality.
• This results in vectors that are compact (i.e., low dimensional) and
dense (i.e., hardly any zeros)
Distributed Representations
• Embedding
• For the set of words in a corpus, embedding is a mapping between
vector space coming from distributional representation to vector
space coming from distributed representation
• Vector semantics
• This refers to the set of NLP methods that aim to learn the word
representations based on distributional properties of words in a large
corpus
Embedding
Word Embeddings
• Distributional similarities between words
• given the word “USA,” distributionally similar words could be other countries (e.g.,
Canada, Germany, India, etc.) or cities in the USA
• given the word “beautiful,” words that share some relationship with this word
(e.g., synonyms, antonyms) could be considered distributionally similar words.
• These are words that are likely to occur in similar contexts
• Word representation model known as “Word2vec,”
based on “distributional similarity”
• Word2Vec
• CBOW
• Skip-gram
CBOW
• CBOW: predict the “center” word from the words in its context.
CBOW
• run a sliding window of size 2k+1 over the text corpus (k=2)
• Each position: marks the set of 2k+1 words
• The center word in the window is the target,
• k words on either side of the center word form the context
→ one data point (X, Y)
• context is the X
• the target word is the Y
• shift the window to the right on the corpus by one word and
repeat the process → get the next data point
• Window size: 2k+1 = 5
Neural network
Neural network
CBOW
• CBOW: predict the “center” word from the words in its context.
CBOW
• word embedding dimension n
=4
• vocabulary size |V| = 6
• window size 5 (k = 2).
• input layer: Wordi ∈ |V| × 1 is
a one-hot vector
• hidden layer: p ∈ ℝ|V|×n and
q ∈ n × |V|, g
• output layer: vector of
probabilities O ∈ |V| × 1
specifies the likelihood of each
word to be the target word in
the context window
CBOW
CBOW
CBOW
• CBOW: predict the “center” word from the words in its context.
SkipGram
• predict the context words from the center word
SkipGram: Prepare data set
• a vector of probabilities
O ∈ |V| × 1 in the output
layer specifies the
likelihood of each word to
appear in the context
window
PRE-TRAINED WORD EMBEDDINGS
• Some of the most popular pre-trained embeddings:
• Word2vec by Google
• GloVe by Stanford
• Fasttext embeddings by Facebook
• available for various dimensions like d = 25, 50, 100, 200, 300, 600
• most_similar('beautiful')
• the last line returns the
• embedding vector of the word “beautiful”
• Word Embeddings
• Word2Vec
• FastText
• GloVe
• Distributed Representations Beyond Words and Characters
• Doc2vec
• Doc2vec is based on the paragraph vectors framework [21] and is
implemented in genism
• Universal Text Representations
• ELMo
• BERT
Doc2vec
Visualizing Embeddings
Visualizing Embeddings
• https://projector.tensorflow.org/
Visualizing Embeddings
Visualizing MNIST data using t-SNE
Visualization of Wikipedia document vectors
Bài tập
• Given a text corpus, the aim is to learn embeddings for every word in
• the corpus such that the word vector in the embedding space best
• captures the meaning of the word. To “derive” the meaning of the
• word, Word2vec uses distributional similarity and distributional
• hypothesis. That is, it derives the meaning of a word from its context:
• words that appear in its neighborhood in the text. So, if two different
• words (often) occur in similar context, then it’s highly likely that their
• meanings are also similar. Word2vec operationalizes this by
• projecting the meaning of the words in a vector space where words
• with similar meanings will tend to cluster together, and words with
• very different meanings are far from one another.
• Word2vec takes a large corpus of text as input and
• “learns” to represent the words in a common vector space based on
• the contexts in which they appear in the corpus
• For every word w in corpus,
• we start with a vector v initialized with random values. The
• Word2vec model refines the values in v by predicting v , given the
• vectors for words in the context C. It does this using a two-layer
• neural network
PRE-TRAINED WORD EMBEDDINGS
• training word embeddings on a large corpus: large corpus, such as
Wikipedia, news articles, or even the entire web
• Such embeddings can be
• thought of as a large collection of key-value pairs, where keys are the
• words in the vocabulary and values are their corresponding word
• vectors.
• Some of the most popular pre-trained embeddings are
• Word2vec by Google [8], GloVe by Stanford [9], and fasttext
• embeddings by Facebook [10], to name a few. Further, they’re
• available for various dimensions like d = 25, 50, 100, 200, 300, 600.
Pre_Trained_Word_Embeddings.ipynb
• find the words that are semantically
• most similar to the word “beautiful”;
• the last line returns the
• embedding vector of the word “beautiful”
TRAINING OUR OWN EMBEDDINGS
• two architectural variants that were proposed in the
• original Word2vec approach
• Continuous bag of words (CBOW)
• SkipGram
CBOW
• the primary task is to build a language model that
• correctly predicts the center word given the context words in which
• the center word appears.