0% found this document useful (0 votes)
6 views5 pages

Basenlp

The document provides an overview of various natural language processing (NLP) techniques including tokenization, Bag of Words, Word2Vec, TF-IDF, stemming, lemmatization, and stop words removal, along with their real-world applications. It includes code snippets demonstrating how to implement these techniques using Python libraries such as NLTK and Gensim. The document also presents a sample dataset of product reviews to illustrate the processes.

Uploaded by

gauthamsivathan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views5 pages

Basenlp

The document provides an overview of various natural language processing (NLP) techniques including tokenization, Bag of Words, Word2Vec, TF-IDF, stemming, lemmatization, and stop words removal, along with their real-world applications. It includes code snippets demonstrating how to implement these techniques using Python libraries such as NLTK and Gensim. The document also presents a sample dataset of product reviews to illustrate the processes.

Uploaded by

gauthamsivathan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

wakqx3zrl

March 9, 2025

0.1 Importing Required Dependencies


[20]: import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import gensim
import numpy as np
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

0.2 Sample dataset


[5]: reviews = [
"I love this product amazing quality",
"Terrible product poor quality",
"I love the amazing service"
]

0.3 Tokenization
Tokenization is the process of breaking text into smaller pieces, such as words or sentences.
Real-world use case: Used in search engines to split queries into words for matching relevant
documents.
[13]: tokenized_reviews = [word_tokenize(review.lower()) for review in reviews]
for i, tokens in enumerate(tokenized_reviews):
print(f"Review {i+1}: {tokens}")

Review 1: ['i', 'love', 'this', 'product', 'amazing', 'quality']


Review 2: ['terrible', 'product', 'poor', 'quality']
Review 3: ['i', 'love', 'the', 'amazing', 'service']

0.4 Bag of Words (BoW)


BoW represents text data as a vector of word counts.
Real-world use case: Used in spam detection, sentiment analysis, and document classification.

1
[14]: vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(reviews)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Matrix:\n", bow_matrix.toarray())

Vocabulary: ['amazing' 'love' 'poor' 'product' 'quality' 'service' 'terrible'


'the'
'this']
BoW Matrix:
[[1 1 0 1 1 0 0 0 1]
[0 0 1 1 1 0 1 0 0]
[1 1 0 0 0 1 0 1 0]]

0.5 Word2Vec
Word2Vec converts words into vector representations based on context.
Real-world use case: Used in chatbots, recommendation systems, and search engines.
[15]: model = gensim.models.Word2Vec(tokenized_reviews, vector_size=10, window=2,␣
↪min_count=1, sg=1)

print("Vector for 'love':", model.wv['love'])


print("Vector for 'quality':", model.wv['quality'])

Vector for 'love': [-0.07511634 -0.00929911 0.09538099 -0.07319422 -0.02333676


-0.01937682
0.0807754 -0.05930967 0.00045279 -0.0475374 ]
Vector for 'quality': [-0.00536227 0.00236431 0.0510335 0.09009273
-0.0930295 -0.07116809
0.06458873 0.08972988 -0.05015428 -0.03763372]

0.6 Avg Word2Vec


This approach averages all word vectors in a sentence to get a single vector.
Real-world use case: Used in document similarity, text clustering, and recommendation systems.
[18]: def get_avg_word2vec(tokens, model):
vectors = [model.wv[word] for word in tokens if word in model.wv]
return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

avg_vectors = [get_avg_word2vec(tokens, model) for tokens in tokenized_reviews]


for i, vec in enumerate(avg_vectors):
print(f"Review {i+1} Average Vector: {vec[:3]}...") # Showing first 3␣
↪dimensions

Review 1 Average Vector: [-0.00220742 0.0134073 0.01929608]…


Review 2 Average Vector: [0.02844399 0.04120733 0.03952274]…
Review 3 Average Vector: [-0.03893989 0.01472283 -0.02407057]…

2
0.7 TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF gives importance to words that appear frequently in a document but not across all docu-
ments.
Real-world use case: Used in search engines, document ranking, and keyword extraction.
[19]: tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(reviews)
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Vocabulary: ['amazing' 'love' 'poor' 'product' 'quality' 'service' 'terrible'


'the'
'this']
TF-IDF Matrix:
[[0.41779577 0.41779577 0. 0.41779577 0.41779577 0.
0. 0. 0.54935123]
[0. 0. 0.5628291 0.42804604 0.42804604 0.
0.5628291 0. 0. ]
[0.42804604 0.42804604 0. 0. 0. 0.5628291
0. 0.5628291 0. ]]

0.8 Stemming (Porter Stemmer)


Stemming reduces words to their root form by chopping off suffixes. It doesn’t always produce real
words but is faster than lemmatization.
Use Case: Stemming is used in search engines (reducing words to base form improves matching).

[28]: print('REVIEWS')
print(reviews)
print(' ')
print('Stemmed Output:' )
ps = PorterStemmer()
stemmed_reviews = [[ps.stem(token) for token in tokens] for tokens in␣
↪tokenized_reviews]

for i, stemmed in enumerate(stemmed_reviews):


print(f"Review {i+1} stemmed: {stemmed}")

REVIEWS
['I love this product amazing quality', 'Terrible product poor quality', 'I love
the amazing service']

Stemmed Output:
Review 1 stemmed: ['i', 'love', 'thi', 'product', 'amaz', 'qualiti']
Review 2 stemmed: ['terribl', 'product', 'poor', 'qualiti']
Review 3 stemmed: ['i', 'love', 'the', 'amaz', 'servic']

3
0.9 Lemmatization
Lemmatization reduces words to their dictionary root form (lemma) using linguistic rules. It
considers the word’s meaning, making it more accurate than stemming.
Use Case: Lemmatization is used in chatbots, spell-checkers, and sentiment analysis.
[27]: print('REVIEWS')
print(reviews)
print(' ')
print('Lemmatized Output:' )

lemmatizer = WordNetLemmatizer()
lemmatized_reviews = [[lemmatizer.lemmatize(token) for token in tokens] for␣
↪tokens in tokenized_reviews]

for i, lemmatized in enumerate(lemmatized_reviews):


print(f"Review {i+1} lemmatized: {lemmatized}")

REVIEWS
['I love this product amazing quality', 'Terrible product poor quality', 'I love
the amazing service']

Lemmatized Output:
Review 1 lemmatized: ['i', 'love', 'this', 'product', 'amazing', 'quality']
Review 2 lemmatized: ['terrible', 'product', 'poor', 'quality']
Review 3 lemmatized: ['i', 'love', 'the', 'amazing', 'service']

0.10 Stop Words Removal


Stop words (e.g., “is”, “and”, “the”) are common words that don’t add meaning in NLP tasks. We
remove them to reduce noise.
Use Case: Stop word removal is used in text classification, sentiment analysis, and keyword extrac-
tion.
[30]: print('REVIEWS')
print(reviews)
print(' ')
print('Stop Words Removal Output:' )

stop_words = set(stopwords.words('english'))
filtered_reviews = [[token for token in tokens if token not in stop_words]
for tokens in tokenized_reviews]
for i, filtered in enumerate(filtered_reviews):
print(f"Review {i+1} without stop words: {filtered}")

REVIEWS
['I love this product amazing quality', 'Terrible product poor quality', 'I love
the amazing service']

4
Stop Words Removal Output:
Review 1 without stop words: ['love', 'product', 'amazing', 'quality']
Review 2 without stop words: ['terrible', 'product', 'poor', 'quality']
Review 3 without stop words: ['love', 'amazing', 'service']

You might also like