Text Mining - Vectorization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Text Vectorization

Hina Arora
TextVectorization.ipynb
• Text Vectorization is the process of converting text into numerical
representation

• Some popular methods to accomplish text vectorization:


o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF
o Word2Vec
o etc
Binary Term Frequency
• Captures presence (1) or absence (0) of term in document
• Token_pattern = ‘(?u)\\b\\w\\w+\\b’
The default regexp select tokens of 2 or more alphanumeric characters (punctuation is
completely ignored and always treated as a token separator).

• lowercase = True

• stop_words = ‘english’

• max_df (default 1.0):


When building the vocabulary ignore terms that have a document frequency strictly higher
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.

• min_df (default 1):


When building the vocabulary ignore terms that have a document frequency strictly lower
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.

• max_features (default None) :


If not None, build a vocabulary that only consider the top max_features ordered by term
frequency across the corpus.

• ngram_range (default (1,1)):


The lower and upper boundary of the range of n-values for different n-grams to be
extracted. All values of n such that min_n <= n <= max_n will be used.
Bag of Words (BoW) Term Frequency
• Captures frequency of term in document
(L1) Normalized Term Frequency
• Captures normalized BoW term frequency in document
• TF typically L1-normalized
(L2) Normalized TFIDF
• Captures normalized TFIDF of term in document
• TFIDF typically L2-normalized
• Number of documents in corpus: N

• Number of documents in corpus with term t: Nt

• Term Frequency of term t in document d: TF(t, d)


o Bag of Words (BoW) Term Frequency
o The more frequent a term is, the higher the TF
o With sublinear TF: log(TF) + 1

• Inverse Document Frequency of term t in corpus: IDF(t) = log[N/Nt] + 1


o Measures how common a term is among all documents.
o The more common a term is, the lower its IDF.
o With smoothing: IDF(t) = log[(1+N)/(1+ Nt)] + 1

• TFIDF = Term Frequency * Inverse Document Frequency = TF*IDF (t, d)


o If a term appears frequently in a document, it's important - give the term a high score.
o If a term appears in many documents, it's not a unique identifier - give the term a low score.

• TFIDF score is then often l2-normalized (could also consider l1-normalized)


Word2Vec
• Captures embedded representation of terms

References:
Distributed Representations of Words and Phrases and their Compositionality
Efficient Estimation of Word Representations in Vector Space
Typical text representations provide localized representations of the word:
o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF

ngrams try to capture some level of contextual information, but don’t


really do a great job.
• Word2Vec Provides distributed or embedded representation of words

• Start with OHE representation of all words in the corpus

• Train a NN (with 1 hidden layer) on a very large corpus of data. The rows of the
resulting hidden-layer weight-matrix are then used as the word vectors.

• One of two methods is typically used for training the NN:


o Continuous Bag of Words (CBOW): Predict vector representation of center/target word -
based on window of context words.
o Skip-Gram (SG): Predict vector representation of window of context words - based on
center/target word.
context words

𝑤𝒕−𝟐 𝑤𝒕−1 𝑤𝒕+1 𝑤𝒕+2

You shall know a word by the company it keeps


𝑤𝒕

center/target word

*Quote by J. R. Firth
Several factors influence the quality of the word vectors including:

• Amount and quality of the training data.


If you don’t have enough data, you may be able to use pre-trained vectors created by others (for
instance Google has shared a model trained on ~ 100 billion words from their News data. The
model contains 300-dimensional vectors for 3 million words and phrases). If you do end up using
pre-trained vectors, make sure the training data domain is similar to the data you’re working with.

• Size of the embedded vectors


In general, quality increases with higher dimensionality, but marginal gains typically diminish after
a threshold. Typically, the dimensionality of the vectors is set to be between 100 and 1000.

• Training algorithm
Typically, CBOW trains faster and has slightly better accuracy for the frequent words. SG works
well with small amounts of the training data, and does a good job representing rare words or
phrases.
Once we have the embedded vectors for each word, we can use them for NLP, for
instance:

• Compute similarity using cosine similarity between word vectors

• Create higher order representations (sentence/document) using weighted


average of the word vectors and feed to the classification task

You might also like