0% found this document useful (0 votes)

46 views4 pages

Introductory Sheet

The document discusses various text vectorization techniques in Natural Language Processing (NLP), including Bag of Words, TF-IDF, Word Embeddings, Sentence Embeddings, and Transformer-Based Models. Each method is evaluated based on its ability to capture word order, context awareness, and computational cost. It concludes with recommendations for beginners on which methods to start with based on task complexity.

Uploaded by

Asil Zulfiqar 4459-FBAS/BSCS4/F21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views4 pages

Introductory Sheet

Uploaded by

Asil Zulfiqar 4459-FBAS/BSCS4/F21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

When working with text in Natural Language Processing (NLP), we need to convert words

and sentences into numerical representations that computers can understand. This process is
called text vectorization or embedding. Below are different ways to convert a text sentence
into a numerical vector, ranging from simple to advanced methods.

1. Bag of Words (BoW)

 This is one of the simplest techniques.

 Each word in a sentence is treated as a unique feature.
 A sentence is represented as a vector, where each dimension corresponds to a word in
the vocabulary.
 The value in each dimension is the word count (or sometimes just 1 if the word is
present).

Example:
Sentences:

1. "I love NLP"

2. "I love machine learning"

Vocabulary = {I, love, NLP, machine, learning}

Sentence I love NLP machine learning

"I love NLP" 11 1 0 0
"I love machine learning" 1 1 0 1 1

🔹 Pros: Simple and interpretable

🔹 Cons: Ignores word order and meaning, leads to high-dimensional vectors

2. Term Frequency - Inverse Document Frequency (TF-IDF)

 This method assigns a weight to each word based on how important it is in a

document compared to all documents.
 Formula: TF−IDF=Term Frequency(TF)×Inverse Document Frequency(IDF)TF-IDF
= \text{Term Frequency} (TF) \times \text{Inverse Document Frequency} (IDF)
 Words that appear frequently in one document but rarely in others get higher
importance.

Example:

If "NLP" appears 10 times in a document but is rare in the overall dataset, it gets a high TF-
IDF score.

🔹 Pros: Reduces the impact of common words (like "the", "is")

🔹 Cons: Still ignores word order, computationally expensive for large vocabularies

3. Word Embeddings (Word2Vec, GloVe, FastText)

 Instead of one-hot vectors, embeddings represent words as dense vectors in a

continuous space.
 Word2Vec: Uses neural networks to learn word relationships (Skip-gram, CBOW
models).
 GloVe: Learns word representations based on word co-occurrence in a large corpus.
 FastText: Extends Word2Vec by considering subwords, making it better for
morphologically rich languages.

Example:

 "king" → [0.45, 0.32, -0.12, ..., 0.56]

 "queen" → [0.42, 0.30, -0.10, ..., 0.58]
 "king - man + woman ≈ queen" (shows semantic relationships)

🔹 Pros: Captures word meaning, relations, and context

🔹 Cons: Requires large amounts of training data

4. Sentence Embeddings (BERT, Sentence-BERT)

 Models like BERT (Bidirectional Encoder Representations from Transformers)

process entire sentences and learn contextual relationships.
 Sentence-BERT (SBERT) creates vector representations of whole sentences.
 Unlike word embeddings, these methods consider word order and context.

Example:

Sentence: "The bank approved the loan."

 BERT understands "bank" as a financial institution (not a riverbank).

🔹 Pros: Captures sentence meaning and context

🔹 Cons: Computationally expensive

5. Transformer-Based Models (GPT, T5)

 GPT (Generative Pre-trained Transformer) creates sentence representations useful

for NLP tasks.
 T5 (Text-to-Text Transfer Transformer) converts sentences into embeddings for
different applications.

Example:

If we ask GPT:
"What does the word 'apple' mean?"
It understands whether we are talking about a fruit or the tech company.

🔹 Pros: Extremely powerful, state-of-the-art NLP performance

🔹 Cons: High computational cost, requires GPU resources

Conclusion

Method Word Order? Context Awareness? Computational Cost

Bag of Words (BoW) ❌ ❌ Low
TF-IDF ❌ ❌ Medium
Word2Vec/GloVe/FastText ❌ ✅ (semantic meaning) Medium
Method Word Order? Context Awareness? Computational Cost
BERT/Sentence-BERT ✅ ✅ High
GPT/T5 ✅ ✅ Very High

For beginners:

 Start with BoW or TF-IDF for simple tasks.

 Move to Word2Vec or GloVe for better word meanings.
 Use BERT or GPT for advanced NLP applications.

Would you like a Python implementation of these methods? 🚀

Kanji Dictionary For Foreigners Learning Japanese - 2500 Kanjis - ISBN - 9784816366970 PDF
16% (38)
Kanji Dictionary For Foreigners Learning Japanese - 2500 Kanjis - ISBN - 9784816366970 PDF
4 pages
Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Root Cause Analysis CEFR Level Testing Progress Checking
100% (1)
Root Cause Analysis CEFR Level Testing Progress Checking
21 pages
Learning English Using LSRW Skills
No ratings yet
Learning English Using LSRW Skills
14 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
The 7 NLP Techniques That Will Change How You Communicate in The Future (Part I)
No ratings yet
The 7 NLP Techniques That Will Change How You Communicate in The Future (Part I)
19 pages
NLP - Natural Language Processing
No ratings yet
NLP - Natural Language Processing
74 pages
NLP 160709201345
No ratings yet
NLP 160709201345
61 pages
Word Embadding
No ratings yet
Word Embadding
24 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
28 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
BDMH LLM
No ratings yet
BDMH LLM
51 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
NLP PDF
No ratings yet
NLP PDF
3 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Transformer
No ratings yet
Transformer
5 pages
Module III
No ratings yet
Module III
42 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
Unit-III NLP
No ratings yet
Unit-III NLP
15 pages
Unit - 4 DL
No ratings yet
Unit - 4 DL
33 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
CCS369 Unit-2 20.12.24
No ratings yet
CCS369 Unit-2 20.12.24
41 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
Gen AI 1
No ratings yet
Gen AI 1
4 pages
CH 3
No ratings yet
CH 3
183 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science
No ratings yet
Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science
11 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
0 Yqn EK3 VG 4 He OTv 089 KX SI1 Ij Wzu Ax T1 Ag Gev OKKJE
No ratings yet
0 Yqn EK3 VG 4 He OTv 089 KX SI1 Ij Wzu Ax T1 Ag Gev OKKJE
4 pages
Course Material - Artificial Intelligence-Week7 - Update
No ratings yet
Course Material - Artificial Intelligence-Week7 - Update
42 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
Trend
No ratings yet
Trend
47 pages
Word 2 Vec
No ratings yet
Word 2 Vec
6 pages
NLP Prep
No ratings yet
NLP Prep
14 pages
BERT
No ratings yet
BERT
98 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
Big Data Analytics Chap 11
No ratings yet
Big Data Analytics Chap 11
8 pages
DLNLP CH-3 N
No ratings yet
DLNLP CH-3 N
11 pages
Word Embeddings Notes
No ratings yet
Word Embeddings Notes
9 pages
Word Embeddings in NLP - Gunjan Agicha - Medium
No ratings yet
Word Embeddings in NLP - Gunjan Agicha - Medium
5 pages
Unit IV
No ratings yet
Unit IV
57 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
wan机器学习quiz 1
No ratings yet
wan机器学习quiz 1
4 pages
Chapter II
No ratings yet
Chapter II
26 pages
Summaries of The Chapters
No ratings yet
Summaries of The Chapters
29 pages
Unit 5 DL
No ratings yet
Unit 5 DL
11 pages
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
Large Language Models
From Everand
Large Language Models
A. Scholtens
2/5 (2)
Case Study 1
No ratings yet
Case Study 1
1 page
Info Classical Encryption
No ratings yet
Info Classical Encryption
71 pages
AI Lec13
No ratings yet
AI Lec13
65 pages
AI Lec3
No ratings yet
AI Lec3
22 pages
AI Lec5
No ratings yet
AI Lec5
42 pages
AI Lec2
No ratings yet
AI Lec2
43 pages
Consolidated Advertisement No. 13-2023 Extended
No ratings yet
Consolidated Advertisement No. 13-2023 Extended
5 pages
De Thi Chon HSG Tieng Anh 23 24 Quang Tri Vong 2 Dap An
No ratings yet
De Thi Chon HSG Tieng Anh 23 24 Quang Tri Vong 2 Dap An
2 pages
Bilingualism and Multilingualism
No ratings yet
Bilingualism and Multilingualism
35 pages
SPELD SA Set 5 Zack Hid From Dad-DS
No ratings yet
SPELD SA Set 5 Zack Hid From Dad-DS
17 pages
ALM by Hajer and Ouala
No ratings yet
ALM by Hajer and Ouala
3 pages
Session Learning Episodes: Reading Gap
No ratings yet
Session Learning Episodes: Reading Gap
4 pages
Reading Intervention 2024
No ratings yet
Reading Intervention 2024
4 pages
Analysis of The Task Based Syllabus Stre
No ratings yet
Analysis of The Task Based Syllabus Stre
16 pages
Introduction: Inventory of Learner'S Biography Purposive Communication
100% (1)
Introduction: Inventory of Learner'S Biography Purposive Communication
3 pages
How To Use Assimil
100% (1)
How To Use Assimil
2 pages
Kedah Trial 2024 P2 - Writing (Answer Scheme)
No ratings yet
Kedah Trial 2024 P2 - Writing (Answer Scheme)
4 pages
Bridging Between Languages
No ratings yet
Bridging Between Languages
14 pages
GTM CLT DM Alm
50% (2)
GTM CLT DM Alm
4 pages
The Importance of Reading in English Learning
No ratings yet
The Importance of Reading in English Learning
7 pages
First Language and Target Language in The Foreign Language Classroom
No ratings yet
First Language and Target Language in The Foreign Language Classroom
15 pages
Instant: Dictionaries
No ratings yet
Instant: Dictionaries
4 pages
1 - Decodable Beginning Blends Science of Reading Aligned Poems
No ratings yet
1 - Decodable Beginning Blends Science of Reading Aligned Poems
26 pages
Lesson Plan
No ratings yet
Lesson Plan
3 pages
Mapa Mental Sobre Su Personal Learning Environment PLE GA4 240202501 AA1 EV02
No ratings yet
Mapa Mental Sobre Su Personal Learning Environment PLE GA4 240202501 AA1 EV02
1 page
Phonics Unit 1 Week 5
No ratings yet
Phonics Unit 1 Week 5
5 pages
MTB - MLE Curriculum Framework
No ratings yet
MTB - MLE Curriculum Framework
29 pages
4000 Essential English Words, Book 5, 2nd Edition-CP
100% (2)
4000 Essential English Words, Book 5, 2nd Edition-CP
194 pages
Lesson Plan 4A
No ratings yet
Lesson Plan 4A
2 pages
Artikel Reading Vocabulary
No ratings yet
Artikel Reading Vocabulary
7 pages
Shin, S. J. (2008) - Preparing Non-Native English-Speaking ESL Teachers.
No ratings yet
Shin, S. J. (2008) - Preparing Non-Native English-Speaking ESL Teachers.
21 pages
Criteria in Impromptu and Extemporaneous Speech
No ratings yet
Criteria in Impromptu and Extemporaneous Speech
1 page
Hahahahahah A
No ratings yet
Hahahahahah A
3 pages
Glottodidactics - Questions and Answers
No ratings yet
Glottodidactics - Questions and Answers
7 pages