0% found this document useful (0 votes)
32 views131 pages

04 - Text Representation

The document discusses different approaches for representing text numerically, including vector space models, one-hot encoding, bag-of-words, bag-of-n-grams, and TF-IDF. It also covers word embeddings and how distributed representations work in deep learning pipelines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views131 pages

04 - Text Representation

The document discusses different approaches for representing text numerically, including vector space models, one-hot encoding, bag-of-words, bag-of-n-grams, and TF-IDF. It also covers word embeddings and how distributed representations work in deep learning pipelines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 131

Text

Representation
Natural Language Processing
Contents
• Vector Space Models
• Basic Vectorization Approaches
• One-Hot Encoding
• Bag of Words
• Bag of N-Grams
• TF-IDF
• Distributed Representations
• Word Embedding
• CBOW
• SkipGram
• Visualizing Embeddings
Introduction
• Feature extraction is an important step for any machine learning
problem.
• How do we transform a given text into numerical form so that it can be
fed into NLP and ML algorithms?
Feature engineering
• Feature engineering steps convert the raw data into a
format that can be consumed by a machine.

• Two different approaches taken in practice for feature


engineering
• (1) a classical NLP and traditional ML pipeline
• (2) a DL pipeline
Classical NLP/ML Pipeline
• Task of sentiment classification on product reviews in e-commerce
(positive or negative)
• Count the number of positive and negative words in each review
Feature engineering
• Classical NLP/ML Pipeline
• Usually handcrafted in the classical ML pipeline
• Statistical measures for understanding if a feature is useful for a task
or not
• The features are heavily inspired by the task at hand as well as domain
knowledge
• The model remains interpretable—it’s possible to quantify exactly
how much each feature is influencing the model prediction

• model performance and the model development cycle


• noisy or unrelated feature can potentially harm the model’s performance
• Classical NLP/ML Pipeline
DL Pipeline
DL Pipeline
• The raw data is directly fed to a model
• (after pre-processing)
• The model is capable of “learning” features from the data.
• improve performance
• all these features are learned via model parameters
• the model loses interpretability
• It’s very hard to explain a DL model’s prediction
• Example
• sentiment classification
• identify an email as ham or spam
Text Representation
Text
Representation
Natural Language Processing
Numerical Representations of Data
• Image:
• Build a classifier that can distinguish images of cats from images of dogs

cell[i,j]
Numerical Representations of Data
• Speech: a numerical array representing the
amplitude of a sound wave at fixed time intervals
Text representation
• Text representation:
Given a piece of text → find a scheme to represent it mathematically
Vector Space Models - VSM
• VSM: Mô hình không gian vector
• The text data must be converted into
some mathematical form.
• represent text units (characters,
phonemes, words, phrases,
sentences, paragraphs, and
documents) with vectors of
numbers.

• VSM: It’s a mathematical model that


represents text units as vectors
VSM
• Calculation similarity 1

between two text


blobs is using cosine
similarity
• Cos(0°) = 1
2
• Cos(180°) = –1
• the cosine
monotonically
decreasing from 0° 3
to 180°
Basic Vectorization Approaches
+ One-Hot Encoding
+ Bag of Words
+ Bag of N-Grams
+ TF-IDF
• text corpus
• document
• vocabulary (V)
• V-dimensional vector: [0,0,0,0,0,0] / [0,1,0,1,0,1] / [1,1,1,1,1,1] (V=6)

• toy corpus:
four documents—D1 , D2 , D3 , D4
• vocabulary of this corpus
[dog, bites, man, eats, meat, food]
One-Hot Encoding

• vocabulary of this corpus is comprised of six words


V = [dog, bites, man, eats, meat, food]
• each word w is given a unique integer ID wid : value 1 →|V|
• Each word is then represented by a V-dimensional binary vector of 0s
and 1s.
• Vector of w:
• At index wid: value=1
• Others: value=0
One-Hot Encoding
V = [dog, bites, man, eats, meat, food]
• Map each of the six words to unique IDs:
dog = 1, bites = 2, man = 3, meat = 4 , food = 5, eats = 6
• Dog is represented as [1 0 0 0 0 0]
([dog, 0-bites, 0-man, 0-eats, 0-meat, 0-food])
• Bites is represented as [0 1 0 0 0 0]

• D1 is represented as
[[1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]]
dog bites man
One-Hot Encoding

https://github.com/practical-nlp/practical-nlp-code/tree/master/Ch3
One-Hot Encoding
• Pros and cons
• intuitive to understand and straightforward to implement
• The size of a one-hot vector is directly proportional to size of the
vocabulary, and most real-world corpora have large vocabularies.
• not give a fixed-length representation for text. For most learning
algorithms, we need the feature vectors to be of the same length.
• It treats words as atomic units and has no notion of (dis)similarity
between words.
• For example: run, ran, and apple.
• Out Of Vocabulary (OOV) problem:
• “man eats fruits”: The training data didn’t include “fruit” and there’s no way to
represent it in our model
Basic Vectorization Approaches
+ One-Hot Encoding
+ Bag of Words
+ Bag of N-Grams
+ TF-IDF
Bag of Words
Bag of Words - BoW
• Represent the text under consideration as a bag (collection) of words
while ignoring the order and context
• assumes that the text belonging to a given class in the dataset is
characterized by a unique set of words. If two text pieces have nearly the
same words, then they belong to the same bag (class).

• BoW maps words to unique integer IDs between 1 and |V|.


• Each document → converted into a vector of |V| dimensions
where index i = number of times the word w occurs in the document
→ simply score each word in V by their occurrence count in the document
Bag of Words - BoW
• word IDs: dog = 1, bites = 2, man = 3, meat = 4 , food = 5, eats = 6
• D1 becomes [1 1 1 0 0 0]
• D4 becomes [0 0 1 0 1 1]

• Dog and dog are friends.


• [0 2 0 0 0 0]
https://github.com/practical-nlp/practical-nlp-code/tree/master/Ch3
Bag of Words - BoW
• Advantages of this encoding
• BoW is fairly simple to understand and implement.
• documents having the same words will have their vector
representations closer to each other.
• The vector space resulting from the BoW scheme captures the
semantic similarity of documents. So if two documents have similar
vocabulary, they’ll be closer to each other in the vector space and vice
versa.
• We have a fixed-length encoding for any sentence of arbitrary length.
Bag of Words - BoW
• Disadvantages
• The size of the vector increases with the size of the vocabulary →
limiting the vocabulary to n number of the most frequent words
• It does not capture the similarity between different words that mean
the same thing.
• “I run”, “I ran”, and “I ate”.
• BoW vectors of all three documents will be equally apart.
• Problem of Out Of Vocabulary words
• “bag” of words — word order information is lost in this
representation. (D1 & D2)
Basic Vectorization Approaches
+ One-Hot Encoding
+ Bag of Words
+ Bag of N-Grams
+ TF-IDF
• Sao, bảo, nó, không, đến

• Sao không bảo nó đến?


• Bảo nó sao không đến?
• Đến sao không bảo nó?
• Không bảo sao nó đến?
• Nó bảo sao không đến?
•…
Bag of N-Grams - BoN
• Breaking text into chunks of n contiguous words (or tokens)
• Each chunk is called an n-gram
• Each document → represented by a vector of length |V|.
• This vector contains the frequency counts of ngrams
and zero for the n-grams that are not present
Bigram: 2-gram
• {dog bites, bites man, man bites, bites dog,
dog eats, eats meat, man eats, eats food}

• BoN representation: eight-dimensional vector for each document


• D1 : [1,1,0,0,0,0,0,0], D2 : [0,0,1,1,0,0,0,0]

• BoW scheme is a special case of the BoN scheme, with n=1


• n=1: unigram
• n=2: “bigram model”
• n=3: “trigram model”
Bag of N-Grams - BoN
• pros and cons of BoN
• It captures some context and word-order information in the form
of n-grams
• resulting vector space is able to capture some semantic similarity
• As n increases, dimensionality (and therefore sparsity) only
increases rapidly
• OOV problem
Basic Vectorization Approaches
+ One-Hot Encoding
+ Bag of Words
+ Bag of N-Grams
+ TF-IDF
TF-IDF
• One Hot, BoW, BoN: all the words in the text are treated as equally
important—there’s no notion of some words in the document being
more important than others
• TF-IDF: term frequency–inverse document frequency
• quantify the importance of a given word relative to other words in the
document and in the corpus
TF-IDF
• if a word w
• appears many times in a document di
• but does not occur much in the rest of the documents dj in the
corpus
• then the word w
• must be of great importance to the document di

• TF
• IDF
• → TF-IDF score = TF*IDF
TF (term frequency)
• TF: measures how often a term or word occurs in a given document
• TF of a term t in a document d is defined as:
IDF (inverse document frequency)
• IDF: measures the importance of the term across a corpus
• weighs down the terms that are very common across a corpus
• weighs up the rare terms
• IDF of a term t is calculated as follows:
TF-IDF

• TF-IDF score = TF * IDF

TF-IDF score = TF * IDF


Let’s compute TF-IDF scores for our toy corpus
• The size of our corpus is N=4
• Vocabulary:
• V: [dog, bites, man, eats, meat, food]
• |V| = 6
• D1
• Dog:
• TF = 1/3 = 0,33
• IDF = log2(4/3) = 0,41
• TF-IDF = TF*IDF = 0,33 * 0,41 = 0,136
Example
Distributed Representations
+ Word Embeddings
Distributed Representations
• Distributional similarity
• the meaning of a word can be understood from the context in which the word
appears
• connotation: meaning is defined by context

• Distributional hypothesis
• words that occur in similar contexts have similar meanings
• “dog” and “cat” occur in similar contexts - there must be a strong similarity
between the meanings of these two words.
• VSM: if two words often occur in similar context, then their corresponding
representation vectors must also be close to each other
-Distributed Representations
• Distributional representation
• representation schemes that are obtained based on distribution of words from the
context in which the words appear
• These schemes are based on distributional hypotheses
• The distributional property is induced from context
• distributional representation schemes use highdimensional vectors to represent
words
• The dimension of this matrix is equal to the size of the vocabulary of the corpus
• one-hot, bag of words, bag of n-grams, and TF-IDF
Distributed Representations
• Distributed representation
• based on the distributional hypothesis
• distributed representation schemes significantly compress the
dimensionality.
• This results in vectors that are compact (i.e., low dimensional) and
dense (i.e., hardly any zeros)
Distributed Representations
• Embedding
• For the set of words in a corpus, embedding is a mapping between
vector space coming from distributional representation to vector
space coming from distributed representation

• Vector semantics
• This refers to the set of NLP methods that aim to learn the word
representations based on distributional properties of words in a large
corpus
Embedding
Word Embeddings
• Distributional similarities between words
• given the word “USA,” distributionally similar words could be other countries (e.g.,
Canada, Germany, India, etc.) or cities in the USA
• given the word “beautiful,” words that share some relationship with this word
(e.g., synonyms, antonyms) could be considered distributionally similar words.
• These are words that are likely to occur in similar contexts
• Word representation model known as “Word2vec,”
based on “distributional similarity”

King – Man + Woman ≈ Queen


Word2vec
Word2vec
• Word2vec:
• semantical rich relationships
• the learned word representations are low dimensional (|V| ~ 50–500)
• dense (most values in these vectors are non-zero)
• These representations are also called “embeddings.”

• Word2Vec
• CBOW
• Skip-gram
CBOW
• CBOW: predict the “center” word from the words in its context.
CBOW
• run a sliding window of size 2k+1 over the text corpus (k=2)
• Each position: marks the set of 2k+1 words
• The center word in the window is the target,
• k words on either side of the center word form the context
→ one data point (X, Y)
• context is the X
• the target word is the Y
• shift the window to the right on the corpus by one word and
repeat the process → get the next data point
• Window size: 2k+1 = 5
Neural network
Neural network
CBOW
• CBOW: predict the “center” word from the words in its context.
CBOW
• word embedding dimension n
=4
• vocabulary size |V| = 6
• window size 5 (k = 2).
• input layer: Wordi ∈ |V| × 1 is
a one-hot vector
• hidden layer: p ∈ ℝ|V|×n and
q ∈ n × |V|, g
• output layer: vector of
probabilities O ∈ |V| × 1
specifies the likelihood of each
word to be the target word in
the context window
CBOW
CBOW
CBOW
• CBOW: predict the “center” word from the words in its context.
SkipGram
• predict the context words from the center word
SkipGram: Prepare data set

• run a sliding window of size 2k+1 over the text corpus


→ set of 2k+1 words
• center word in the window is the X,
• k words on either side of the center word are Y
→ 2k data points
• shift the window to the right on the corpus by one word
• repeat the process
• Window size = 5
• word embedding
dimension n = 4,
• vocabulary size |V| = 6
• window size 5 (k = 2).

• a vector of probabilities
O ∈ |V| × 1 in the output
layer specifies the
likelihood of each word to
appear in the context
window
PRE-TRAINED WORD EMBEDDINGS
• Some of the most popular pre-trained embeddings:
• Word2vec by Google
• GloVe by Stanford
• Fasttext embeddings by Facebook

• available for various dimensions like d = 25, 50, 100, 200, 300, 600
• most_similar('beautiful')
• the last line returns the
• embedding vector of the word “beautiful”
• Word Embeddings
• Word2Vec
• FastText
• GloVe
• Distributed Representations Beyond Words and Characters
• Doc2vec
• Doc2vec is based on the paragraph vectors framework [21] and is
implemented in genism
• Universal Text Representations
• ELMo
• BERT
Doc2vec
Visualizing Embeddings
Visualizing Embeddings
• https://projector.tensorflow.org/
Visualizing Embeddings
Visualizing MNIST data using t-SNE
Visualization of Wikipedia document vectors
Bài tập
• Given a text corpus, the aim is to learn embeddings for every word in
• the corpus such that the word vector in the embedding space best
• captures the meaning of the word. To “derive” the meaning of the
• word, Word2vec uses distributional similarity and distributional
• hypothesis. That is, it derives the meaning of a word from its context:
• words that appear in its neighborhood in the text. So, if two different
• words (often) occur in similar context, then it’s highly likely that their
• meanings are also similar. Word2vec operationalizes this by
• projecting the meaning of the words in a vector space where words
• with similar meanings will tend to cluster together, and words with
• very different meanings are far from one another.
• Word2vec takes a large corpus of text as input and
• “learns” to represent the words in a common vector space based on
• the contexts in which they appear in the corpus
• For every word w in corpus,
• we start with a vector v initialized with random values. The
• Word2vec model refines the values in v by predicting v , given the
• vectors for words in the context C. It does this using a two-layer
• neural network
PRE-TRAINED WORD EMBEDDINGS
• training word embeddings on a large corpus: large corpus, such as
Wikipedia, news articles, or even the entire web
• Such embeddings can be
• thought of as a large collection of key-value pairs, where keys are the
• words in the vocabulary and values are their corresponding word
• vectors.
• Some of the most popular pre-trained embeddings are
• Word2vec by Google [8], GloVe by Stanford [9], and fasttext
• embeddings by Facebook [10], to name a few. Further, they’re
• available for various dimensions like d = 25, 50, 100, 200, 300, 600.
Pre_Trained_Word_Embeddings.ipynb
• find the words that are semantically
• most similar to the word “beautiful”;
• the last line returns the
• embedding vector of the word “beautiful”
TRAINING OUR OWN EMBEDDINGS
• two architectural variants that were proposed in the
• original Word2vec approach
• Continuous bag of words (CBOW)
• SkipGram
CBOW
• the primary task is to build a language model that
• correctly predicts the center word given the context words in which
• the center word appears.

• What is a language model?


• It is a (statistical)
• model that tries to give a probability distribution over sequences of
• words. Given a sentence of, say, m words, it assigns a probability
• Pr(w , w , ….., w ) to the whole sentence.
• The objective of a
• language model is to assign probabilities in such a way that it gives
• high probability to “good” sentences and low probabilities to “bad”
• sentences.
• By good, we mean sentences that are semantically and
• syntactically correct. By bad, we mean sentences that are incorrect—
• semantically or syntactically or both.
• So, for a sentence like “The cat
• 1 2 njumped over the dog,” it will try to assign a probability close to 1.0,
• whereas for a sentence like “jumped over the the cat dog,” it tries to
• assign a probability close to 0.0.
• CBOW tries to learn a language model that tries to predict the
• “center” word from the words in its context.
• CBOW uses the context words to predict the
• target word—jumps
• we run a
• sliding window of size 2k+1 over the text corpus
• D-dim word
embeddings
• let V be the
vocabulary of
the text corpus
• Skip-gram model architecture with word embedding dimension n = 4, vocabulary size |V| = 6and
window size 5 (c = 2). The input layer is a one-hot encoding I ∈ |V| × 1 denoting the target word in
the context window. In the hidden layer, after multiplying I with the vocabulary matrix P ∈ ℝ|V|×n, the
resulting vector is h ∈ n × 1. After multiplying h with the output weight matrix q ∈ n × |V| in the output
layer and sending the result vector to softmax function, a vector of probabilities O ∈ |V| × 1 in the output
layer specifies the likelihood of each word to appear in the context window
• CBOW model architecture with word embedding dimension n = 4, vocabulary size |V| = 6 and
window size 5 (c = 2). It consists of three layers: input layer, hidden layer and output layer. In the input
layer, each Ii ∈ |V| × 1 is a one-hot encoding vector of a context word in the context window surrounding
the target word; in the hidden layer, each one-hot encoding vector IiT multiplied against the vocabulary
matrix P ∈ ℝ|V|×n to select the matrix row that represents the context word; 횐 ∈ n × 1 is the average of
the context word vectors. After multiplying 횐 with the output weight matrix q ∈ n × |V| and sending the
result to a softmax function, a vector of probabilities O ∈ |V| × 1 in the output layer specifies the likelihood of each
word to be the target word in the context window
CBOW
• Các bạn chú ý nhé “learn” để đoán ra word lân cận từ word hiện
tại. Từ đó nghĩa là :
• Word2Vec là deep learning và cụ thể là mạng NN
• Ta phải có quá trình train cho nó chứ không phải chỉ tính toán
đơn thuần như các phương pháp trước.
• Word2Vec sẽ biểu diễn mỗi từ bằng một vector với một số
chiều N nào đó (ví dụ 300 chẳng hạn). Và nó được train, được tối
ưu weights sao cho những từ càng “gần” nhau về mặt khoảng
cách (khoảng cách giữa 2 vector từ) là các từ hay xuất hiện
cùng nhau trong văn cảnh, các từ đồng nghĩa.

• https://towardsdatascience.com/an-overview-for-text-representations-in-nlp-
311253730af1
• https://aws.amazon.com/vi/ec2/instance-types/t2/
• https://aws.amazon.com/vi/ec2/pricing/on-demand/
• https://console.aws.amazon.com/billing/home?region=us-east-1#/
• https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:
• GENSIM: https://machinelearningmastery.com/develop-word-embeddings-python-
gensim/
• https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial
• https://thorpham.github.io/blog/2018/04/24/word2vec/
• https://viblo.asia/p/xay-dung-mo-hinh-khong-gian-vector-cho-tieng-viet-GrLZDXr2Zk0
• https://github.com/QuangPH1/FramgiaBlog/tree/master/Blog01_Word_embed
ding

You might also like