0% found this document useful (0 votes)

16 views24 pages

Text Mining - Vectorization

Text vectorization is the process of converting text into numerical representations. Popular methods include binary term frequency, bag-of-words term frequency, normalized term frequencies, TF-IDF, and Word2Vec. Word2Vec provides distributed representations of words by training a neural network to learn embedded word vectors from a large corpus of text. These vectors capture semantic and syntactic relationships between words based on their distributional properties.

Uploaded by

Zorka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views24 pages

Text Mining - Vectorization

Uploaded by

Zorka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Text Vectorization

Hina Arora
TextVectorization.ipynb
• Text Vectorization is the process of converting text into numerical
representation

• Some popular methods to accomplish text vectorization:

o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF
o Word2Vec
o etc
Binary Term Frequency
• Captures presence (1) or absence (0) of term in document
• Token_pattern = ‘(?u)\\b\\w\\w+\\b’
The default regexp select tokens of 2 or more alphanumeric characters (punctuation is
completely ignored and always treated as a token separator).

• lowercase = True

• stop_words = ‘english’

• max_df (default 1.0):

When building the vocabulary ignore terms that have a document frequency strictly higher
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.

• min_df (default 1):

When building the vocabulary ignore terms that have a document frequency strictly lower
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.

• max_features (default None) :

If not None, build a vocabulary that only consider the top max_features ordered by term
frequency across the corpus.

• ngram_range (default (1,1)):

The lower and upper boundary of the range of n-values for different n-grams to be
extracted. All values of n such that min_n <= n <= max_n will be used.
Bag of Words (BoW) Term Frequency
• Captures frequency of term in document
(L1) Normalized Term Frequency
• Captures normalized BoW term frequency in document
• TF typically L1-normalized
(L2) Normalized TFIDF
• Captures normalized TFIDF of term in document
• TFIDF typically L2-normalized
• Number of documents in corpus: N

• Number of documents in corpus with term t: Nt

• Term Frequency of term t in document d: TF(t, d)

o Bag of Words (BoW) Term Frequency
o The more frequent a term is, the higher the TF
o With sublinear TF: log(TF) + 1

• Inverse Document Frequency of term t in corpus: IDF(t) = log[N/Nt] + 1

o Measures how common a term is among all documents.
o The more common a term is, the lower its IDF.
o With smoothing: IDF(t) = log[(1+N)/(1+ Nt)] + 1

• TFIDF = Term Frequency * Inverse Document Frequency = TF*IDF (t, d)

o If a term appears frequently in a document, it's important - give the term a high score.
o If a term appears in many documents, it's not a unique identifier - give the term a low score.

• TFIDF score is then often l2-normalized (could also consider l1-normalized)

Word2Vec
• Captures embedded representation of terms

References:
Distributed Representations of Words and Phrases and their Compositionality
Efficient Estimation of Word Representations in Vector Space
Typical text representations provide localized representations of the word:
o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF

ngrams try to capture some level of contextual information, but don’t

really do a great job.
• Word2Vec Provides distributed or embedded representation of words

• Start with OHE representation of all words in the corpus

• Train a NN (with 1 hidden layer) on a very large corpus of data. The rows of the
resulting hidden-layer weight-matrix are then used as the word vectors.

• One of two methods is typically used for training the NN:

o Continuous Bag of Words (CBOW): Predict vector representation of center/target word -
based on window of context words.
o Skip-Gram (SG): Predict vector representation of window of context words - based on
center/target word.
context words

𝑤𝒕−𝟐 𝑤𝒕−1 𝑤𝒕+1 𝑤𝒕+2

You shall know a word by the company it keeps

𝑤𝒕

center/target word

*Quote by J. R. Firth
Several factors influence the quality of the word vectors including:

• Amount and quality of the training data.

If you don’t have enough data, you may be able to use pre-trained vectors created by others (for
instance Google has shared a model trained on ~ 100 billion words from their News data. The
model contains 300-dimensional vectors for 3 million words and phrases). If you do end up using
pre-trained vectors, make sure the training data domain is similar to the data you’re working with.

• Size of the embedded vectors

In general, quality increases with higher dimensionality, but marginal gains typically diminish after
a threshold. Typically, the dimensionality of the vectors is set to be between 100 and 1000.

• Training algorithm
Typically, CBOW trains faster and has slightly better accuracy for the frequent words. SG works
well with small amounts of the training data, and does a good job representing rare words or
phrases.
Once we have the embedded vectors for each word, we can use them for NLP, for
instance:

• Compute similarity using cosine similarity between word vectors

• Create higher order representations (sentence/document) using weighted

average of the word vectors and feed to the classification task

Unit Ii
No ratings yet
Unit Ii
20 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Lecture 5 - Language Representation Tf-Idf
No ratings yet
Lecture 5 - Language Representation Tf-Idf
51 pages
Machine Learning For Algorithmic Trading
36% (11)
Machine Learning For Algorithmic Trading
13 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
AIML Unit5
No ratings yet
AIML Unit5
36 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
57 pages
Unit IV
No ratings yet
Unit IV
58 pages
Lect 04
No ratings yet
Lect 04
44 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Lab 5
No ratings yet
Lab 5
27 pages
Word Embedding
No ratings yet
Word Embedding
60 pages
Bag of Words and TF-IDF
No ratings yet
Bag of Words and TF-IDF
17 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Module III
No ratings yet
Module III
42 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Unit 2 Newml
No ratings yet
Unit 2 Newml
25 pages
Unit IV
No ratings yet
Unit IV
57 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Text Vectorization
No ratings yet
Text Vectorization
10 pages
Wordembed
No ratings yet
Wordembed
31 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
NLP m3
No ratings yet
NLP m3
111 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Learning Representations That Convey Semantic and Syntactic Information
No ratings yet
Learning Representations That Convey Semantic and Syntactic Information
14 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
Sentiment Analysis Based On Vector Embeding
No ratings yet
Sentiment Analysis Based On Vector Embeding
5 pages
Lec 6
No ratings yet
Lec 6
2 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Pipeline
No ratings yet
Pipeline
9 pages
Allnlp
No ratings yet
Allnlp
15 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Sheet 3
No ratings yet
Sheet 3
5 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Traditional Word Embedding
No ratings yet
Traditional Word Embedding
9 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
Ch6 - Text Vectorization - 1
No ratings yet
Ch6 - Text Vectorization - 1
63 pages
Data Science Interview Preparation Questions (#Day06)
No ratings yet
Data Science Interview Preparation Questions (#Day06)
10 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Embeddings
No ratings yet
Embeddings
3 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Curated List of Links
100% (1)
Curated List of Links
28 pages
NLP Session 1-7 BT DR - Chetna
No ratings yet
NLP Session 1-7 BT DR - Chetna
469 pages
Akchukwu Wisdom Chidi Seminar Corrected Version
No ratings yet
Akchukwu Wisdom Chidi Seminar Corrected Version
17 pages
Cyberbullying Detection Based On Semantic Enhanced Marginalised Denoising Autoencoder - Report
No ratings yet
Cyberbullying Detection Based On Semantic Enhanced Marginalised Denoising Autoencoder - Report
71 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
Lecture#14
No ratings yet
Lecture#14
38 pages
Natural Language Processing in Action Understanding Analyzing and Generating Text With Python 1st Edition Hobson Lane Download
No ratings yet
Natural Language Processing in Action Understanding Analyzing and Generating Text With Python 1st Edition Hobson Lane Download
53 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
IntellBot Retrieval Augmented LLM Chatbot For Cybe
No ratings yet
IntellBot Retrieval Augmented LLM Chatbot For Cybe
22 pages
LLM4psych Multimodalities
No ratings yet
LLM4psych Multimodalities
31 pages
7) Link Prediction in Dynamic Networks Using Time Aware Network Embedding and Time Series Forecasting
No ratings yet
7) Link Prediction in Dynamic Networks Using Time Aware Network Embedding and Time Series Forecasting
13 pages
Project Synapsis
No ratings yet
Project Synapsis
14 pages
Afzal Et Al., 2020
No ratings yet
Afzal Et Al., 2020
18 pages
A Robust Model For Automated Essay Scoring System
No ratings yet
A Robust Model For Automated Essay Scoring System
5 pages
Detecting Language Impairments in Autism A Computational Analysis of Semi Structured Conversations
No ratings yet
Detecting Language Impairments in Autism A Computational Analysis of Semi Structured Conversations
11 pages
Certified AI Practitioner Exam AIP-110 Blueprint - Final - 20190813
No ratings yet
Certified AI Practitioner Exam AIP-110 Blueprint - Final - 20190813
11 pages
Deep Learning-Based Sentiment Classification in Amharic Using Multi-Lingual Datasets
No ratings yet
Deep Learning-Based Sentiment Classification in Amharic Using Multi-Lingual Datasets
24 pages
Aspect Based Opinion Mining On Hotel Rev
No ratings yet
Aspect Based Opinion Mining On Hotel Rev
7 pages
An Improved FakeBERT For Fake News Detection
No ratings yet
An Improved FakeBERT For Fake News Detection
9 pages
Literature Survey
No ratings yet
Literature Survey
32 pages
Introduction To AI - Exam
No ratings yet
Introduction To AI - Exam
11 pages
Thoughts and Methods of Applying Artificial Intelligence in Judge Quota
No ratings yet
Thoughts and Methods of Applying Artificial Intelligence in Judge Quota
9 pages
1 s2.0 S254266052100086X Main
No ratings yet
1 s2.0 S254266052100086X Main
11 pages
Untitled28.ipynb - Colaboratory
No ratings yet
Untitled28.ipynb - Colaboratory
16 pages
08 Exercises Word2vec MUD SOLVED
No ratings yet
08 Exercises Word2vec MUD SOLVED
3 pages
Applied Machine Learning Course Schedule: 1:fundamentals of Programming
No ratings yet
Applied Machine Learning Course Schedule: 1:fundamentals of Programming
33 pages
Hierarchical Multi-Label Classification For Large Scale Data
No ratings yet
Hierarchical Multi-Label Classification For Large Scale Data
35 pages
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet

Text Mining - Vectorization

Uploaded by

Text Mining - Vectorization

Uploaded by

Text Vectorization

• Some popular methods to accomplish text vectorization:

• max_df (default 1.0):

• min_df (default 1):

• max_features (default None) :

• ngram_range (default (1,1)):

• Number of documents in corpus with term t: Nt

• Term Frequency of term t in document d: TF(t, d)

• Inverse Document Frequency of term t in corpus: IDF(t) = log[N/Nt] + 1

• TFIDF = Term Frequency * Inverse Document Frequency = TF*IDF (t, d)

• TFIDF score is then often l2-normalized (could also consider l1-normalized)

ngrams try to capture some level of contextual information, but don’t

• Start with OHE representation of all words in the corpus

• One of two methods is typically used for training the NN:

𝑤𝒕−𝟐 𝑤𝒕−1 𝑤𝒕+1 𝑤𝒕+2

You shall know a word by the company it keeps

• Amount and quality of the training data.

• Size of the embedded vectors

• Compute similarity using cosine similarity between word vectors

• Create higher order representations (sentence/document) using weighted

You might also like