0% found this document useful (0 votes)
46 views4 pages

Introductory Sheet

The document discusses various text vectorization techniques in Natural Language Processing (NLP), including Bag of Words, TF-IDF, Word Embeddings, Sentence Embeddings, and Transformer-Based Models. Each method is evaluated based on its ability to capture word order, context awareness, and computational cost. It concludes with recommendations for beginners on which methods to start with based on task complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views4 pages

Introductory Sheet

The document discusses various text vectorization techniques in Natural Language Processing (NLP), including Bag of Words, TF-IDF, Word Embeddings, Sentence Embeddings, and Transformer-Based Models. Each method is evaluated based on its ability to capture word order, context awareness, and computational cost. It concludes with recommendations for beginners on which methods to start with based on task complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

When working with text in Natural Language Processing (NLP), we need to convert words

and sentences into numerical representations that computers can understand. This process is
called text vectorization or embedding. Below are different ways to convert a text sentence
into a numerical vector, ranging from simple to advanced methods.

1. Bag of Words (BoW)

 This is one of the simplest techniques.


 Each word in a sentence is treated as a unique feature.
 A sentence is represented as a vector, where each dimension corresponds to a word in
the vocabulary.
 The value in each dimension is the word count (or sometimes just 1 if the word is
present).

Example:
Sentences:

1. "I love NLP"


2. "I love machine learning"

Vocabulary = {I, love, NLP, machine, learning}

Sentence I love NLP machine learning


"I love NLP" 11 1 0 0
"I love machine learning" 1 1 0 1 1

🔹 Pros: Simple and interpretable


🔹 Cons: Ignores word order and meaning, leads to high-dimensional vectors

2. Term Frequency - Inverse Document Frequency (TF-IDF)

 This method assigns a weight to each word based on how important it is in a


document compared to all documents.
 Formula: TF−IDF=Term Frequency(TF)×Inverse Document Frequency(IDF)TF-IDF
= \text{Term Frequency} (TF) \times \text{Inverse Document Frequency} (IDF)
 Words that appear frequently in one document but rarely in others get higher
importance.

Example:

If "NLP" appears 10 times in a document but is rare in the overall dataset, it gets a high TF-
IDF score.

🔹 Pros: Reduces the impact of common words (like "the", "is")


🔹 Cons: Still ignores word order, computationally expensive for large vocabularies

3. Word Embeddings (Word2Vec, GloVe, FastText)

 Instead of one-hot vectors, embeddings represent words as dense vectors in a


continuous space.
 Word2Vec: Uses neural networks to learn word relationships (Skip-gram, CBOW
models).
 GloVe: Learns word representations based on word co-occurrence in a large corpus.
 FastText: Extends Word2Vec by considering subwords, making it better for
morphologically rich languages.

Example:

 "king" → [0.45, 0.32, -0.12, ..., 0.56]


 "queen" → [0.42, 0.30, -0.10, ..., 0.58]
 "king - man + woman ≈ queen" (shows semantic relationships)

🔹 Pros: Captures word meaning, relations, and context


🔹 Cons: Requires large amounts of training data

4. Sentence Embeddings (BERT, Sentence-BERT)

 Models like BERT (Bidirectional Encoder Representations from Transformers)


process entire sentences and learn contextual relationships.
 Sentence-BERT (SBERT) creates vector representations of whole sentences.
 Unlike word embeddings, these methods consider word order and context.

Example:

Sentence: "The bank approved the loan."

 BERT understands "bank" as a financial institution (not a riverbank).

🔹 Pros: Captures sentence meaning and context


🔹 Cons: Computationally expensive

5. Transformer-Based Models (GPT, T5)

 GPT (Generative Pre-trained Transformer) creates sentence representations useful


for NLP tasks.
 T5 (Text-to-Text Transfer Transformer) converts sentences into embeddings for
different applications.

Example:

If we ask GPT:
"What does the word 'apple' mean?"
It understands whether we are talking about a fruit or the tech company.

🔹 Pros: Extremely powerful, state-of-the-art NLP performance


🔹 Cons: High computational cost, requires GPU resources

Conclusion

Method Word Order? Context Awareness? Computational Cost


Bag of Words (BoW) ❌ ❌ Low
TF-IDF ❌ ❌ Medium
Word2Vec/GloVe/FastText ❌ ✅ (semantic meaning) Medium
Method Word Order? Context Awareness? Computational Cost
BERT/Sentence-BERT ✅ ✅ High
GPT/T5 ✅ ✅ Very High

For beginners:

 Start with BoW or TF-IDF for simple tasks.


 Move to Word2Vec or GloVe for better word meanings.
 Use BERT or GPT for advanced NLP applications.

Would you like a Python implementation of these methods? 🚀

You might also like