NLP Notebook
NLP Notebook
This notebook demonstrates core Natural Language Processing techniques combined with
Machine Learning models, including tokenization, feature extraction, and model training —
all with hands-on Python code.
NLP (Natural Language Processing) is how computers understand human language like English, Hindi
Just like we talk to each other, NLP helps us talk to computers using text or speech.
Text Preprocessing
Tokenization
Tokenization is the process of breaking down text into smaller pieces, called tokens, which
can be words, characters, or subwords.
💡 Types of Tokenization
1. Word Tokenization – Splits text into individual words.
2. Character Tokenization – Splits text into individual characters.
3. Subword Tokenization – Splits words into smaller meaningful parts (used in modern
models like BERT or GPT).
🧠 Example
Original Text: ChatGPT is amazing!
1. Word Tokenization:
["ChatGPT", "is", "amazing", "!"]
2. Character Tokenization:
["C", "h", "a", "t", "G", "P", "T", " ", "i", "s", " ", "a", "m", "a",
"z", "i", "n", "g", "!"]
Words are split into smaller chunks, especially useful for rare or compound words.
text = "ChatGPT is amazing! It can help you write code, explain concepts, and much
sentences = sent_tokenize(text)
print(sentences)
['ChatGPT is amazing!', 'It can help you write code, explain concepts, and much mor
e.', "Isn't that great?"]
Stemming
Reduces words to their root form by chopping off suffixes.
1. Porter Stemmer
✅ Use Case:
Best for general English text processing tasks like information retrieval, search engines,
or basic NLP pipelines.
📌 Characteristics:
stemmer = PorterStemmer()
words = ["running", "flies", "easily", "flying", "played", "happily"]
2. Lancaster Stemmer
✅ Use Case:
Suitable when you value speed and want a very aggressive stemming strategy.
Good for use cases where over-stemming is acceptable, such as duplicate detection or
topic clustering.
📌 Characteristics:
Often reduces words too much (over-stemming), which may distort meaning.
stemmer = LancasterStemmer()
words = ["running", "flies", "easily", "flying", "played", "happily"]
[stemmer.stem(w) for w in words]
📌 Characteristics:
More advanced and consistent than Porter.
stemmer = SnowballStemmer("english")
words = ["running", "flies", "easily", "flying", "played", "happily"]
[stemmer.stem(w) for w in words]
4. RegexpStemmer
✅ Use Case:
Best for domain-specific tasks where default stemmers don’t work well.
Useful when you want to strip predictable suffixes like “-ing”, “-ed”, “-s”, etc.
📌 Characteristics:
Allows manual control over stemming behavior.
In [10]: words = [
("running", "v"), # verb
("flies", "n"), # noun
("better", "a"), # adjective
("played", "v"),
("children", "n"), # plural noun
("am", "v"), # verb (be form)
]
📌 Example Categories:
NN – Noun (e.g., dog, book)
VB – Verb (base form, e.g., run, play)
JJ – Adjective (e.g., happy, blue)
RB – Adverb (e.g., quickly, very)
PRP – Personal pronoun (e.g., he, they)
IN – Preposition (e.g., in, on)
In [17]: tagged[:10]
Category Examples
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\visha\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data] C:\Users\visha\AppData\Roaming\nltk_data...
[nltk_data] Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data] C:\Users\visha\AppData\Roaming\nltk_data...
[nltk_data] Package words is already up-to-date!
Out[19]: True
In [23]: named_entities
Out[23]:
Text to Vectors
🧠 Why Convert Words to Vectors?
Machine learning models can only understand numbers, not raw text. To train models for
tasks like classification, translation, sentiment analysis, etc., we need to convert words or
documents into fixed-size vectors.
1. One Hot Encoding
🧠 What is One-Hot Encoding?
One-Hot Encoding represents each word in a vocabulary as a binary vector:
🔍 Example:
Assume we have a vocabulary of 5 words:
["I", "love", "NLP", "is", "fun"]
I [1, 0, 0, 0, 0]
love [0, 1, 0, 0, 0]
NLP [0, 0, 1, 0, 0]
is [0, 0, 0, 1, 0]
fun [0, 0, 0, 0, 1]
encoder = OneHotEncoder()
onehot = encoder.fit_transform(words)
print(encoder.categories_)
In [25]: print(onehot)
(0, 0) 1.0
(1, 4) 1.0
(2, 1) 1.0
(3, 3) 1.0
(4, 2) 1.0
✅ Use Case:
Simple text classification tasks (e.g., spam detection, topic classification).
print(vectorizer.get_feature_names_out())
In [27]: print(X.toarray())
[[0 0 0 1 1 0]
[1 1 1 0 1 1]]
🧠 What is an N-gram?
An n-gram is a contiguous sequence of n items (usually words) from a given text or speech.
N-grams help preserve context and word order compared to Bag of Words.
N N-grams
3. TF-IDF
TF-IDF stands for:
N
IDF(t) = log( )
1 + df (t)
Where:
TF-IDF
So,
′ ′ 3
I DF ( love ) = log( ) = 0
1+2
print(vectorizer.get_feature_names_out())
print(X.toarray())
❌ Limitations
Limitation Description
📝 Summary
Term Meaning
Word Embeddings
Word embeddings are dense vector representations of words in a continuous vector space,
where similar words are mapped to similar vectors.
Word embedding is a way to turn words into numbers, so a computer can understand them
— but not just any numbers.
So the computer can figure out which words are related and how closely.
"king" and "queen" have similar vectors → so the computer knows they are related.
One-hot vectors are huge & sparse Embeddings are dense & compact
Cannot generalize across contexts Embeddings help capture word usage patterns
💡 Key Idea
Each word is represented as a vector of real numbers (e.g., 100–300 dimensions), trained so
that words used in similar contexts have similar vectors.
FastText Includes subword information (good for misspellings and rare words)
🧠 What is Word2Vec?
Word2Vec is a method to convert words into vectors so that:
🔁 How It Works
Word2Vec trains on a text corpus and learns word relationships.
Concept: CBOW predicts the target word using its context words (surrounding words).
CBOW tries to learn the representation such that, given surrounding words, it can predict the
central word.
2. Skip-Gram
Concept: Skip-Gram does the reverse. It predicts context words from the target word.
Example: Given the same sentence: "The cat sits on the mat"
Skip-Gram tries to learn word representations such that, given a word, it can predict its
context.
CBOW vs Skip-Gram
Conclusion
# Sample corpus
text = "The cat sits on the mat. The dog plays with the cat."
# Tokenize
tokens = word_tokenize(text.lower())
print("Tokens:", tokens)
Tokens: ['the', 'cat', 'sits', 'on', 'the', 'mat', '.', 'the', 'dog', 'plays', 'wit
h', 'the', 'cat', '.']
Practical Implementation
Spam vs Ham Classification
In [33]: import kagglehub
In [36]: df.head()
1. Using BoW
In [37]: # Data cleaning and preprocessing
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
In [38]: ps = PorterStemmer()
In [39]: corpus = []
In [40]: corpus[:5]
Out[40]: ['go jurong point crazi avail bugi n great world la e buffet cine got amor wat',
'ok lar joke wif u oni',
'free entri wkli comp win fa cup final tkt st may text fa receiv entri question s
td txt rate c appli',
'u dun say earli hor u c alreadi say',
'nah think goe usf live around though']
In [44]: X.shape
In [45]: # cv.vocabulary_
Confusion Matrix:
[[962 4]
[ 8 141]]
Accuracy: 0.989237668161435
precision recall f1-score support
2. Using TF-IDF
In [51]: # Creating the tfidf model
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = tfidf.fit_transform(corpus).toarray()
X[:5, :10] # Display first 5 rows and first 10 columns
# X.shape
Out[51]: array([[0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000],
[0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000],
[0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000],
[0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000],
[0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000]])
In [52]: # tfidf.vocabulary_
In [ ]: