Text Mining - Vectorization
Text Mining - Vectorization
Text Mining - Vectorization
Hina Arora
TextVectorization.ipynb
• Text Vectorization is the process of converting text into numerical
representation
• lowercase = True
• stop_words = ‘english’
References:
Distributed Representations of Words and Phrases and their Compositionality
Efficient Estimation of Word Representations in Vector Space
Typical text representations provide localized representations of the word:
o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF
• Train a NN (with 1 hidden layer) on a very large corpus of data. The rows of the
resulting hidden-layer weight-matrix are then used as the word vectors.
center/target word
*Quote by J. R. Firth
Several factors influence the quality of the word vectors including:
• Training algorithm
Typically, CBOW trains faster and has slightly better accuracy for the frequent words. SG works
well with small amounts of the training data, and does a good job representing rare words or
phrases.
Once we have the embedded vectors for each word, we can use them for NLP, for
instance: