0% found this document useful (0 votes)

8 views5 pages

Basenlp

The document provides an overview of various natural language processing (NLP) techniques including tokenization, Bag of Words, Word2Vec, TF-IDF, stemming, lemmatization, and stop words removal, along with their real-world applications. It includes code snippets demonstrating how to implement these techniques using Python libraries such as NLTK and Gensim. The document also presents a sample dataset of product reviews to illustrate the processes.

Uploaded by

gauthamsivathan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views5 pages

Basenlp

Uploaded by

gauthamsivathan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

wakqx3zrl

March 9, 2025

0.1 Importing Required Dependencies

[20]: import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import gensim
import numpy as np
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

0.2 Sample dataset

[5]: reviews = [
"I love this product amazing quality",
"Terrible product poor quality",
"I love the amazing service"
]

0.3 Tokenization
Tokenization is the process of breaking text into smaller pieces, such as words or sentences.
Real-world use case: Used in search engines to split queries into words for matching relevant
documents.
[13]: tokenized_reviews = [word_tokenize(review.lower()) for review in reviews]
for i, tokens in enumerate(tokenized_reviews):
print(f"Review {i+1}: {tokens}")

Review 1: ['i', 'love', 'this', 'product', 'amazing', 'quality']

Review 2: ['terrible', 'product', 'poor', 'quality']
Review 3: ['i', 'love', 'the', 'amazing', 'service']

0.4 Bag of Words (BoW)

BoW represents text data as a vector of word counts.
Real-world use case: Used in spam detection, sentiment analysis, and document classification.

1
[14]: vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(reviews)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Matrix:\n", bow_matrix.toarray())

Vocabulary: ['amazing' 'love' 'poor' 'product' 'quality' 'service' 'terrible'

'the'
'this']
BoW Matrix:
[[1 1 0 1 1 0 0 0 1]
[0 0 1 1 1 0 1 0 0]
[1 1 0 0 0 1 0 1 0]]

0.5 Word2Vec
Word2Vec converts words into vector representations based on context.
Real-world use case: Used in chatbots, recommendation systems, and search engines.
[15]: model = gensim.models.Word2Vec(tokenized_reviews, vector_size=10, window=2,␣
↪min_count=1, sg=1)

print("Vector for 'love':", model.wv['love'])

print("Vector for 'quality':", model.wv['quality'])

Vector for 'love': [-0.07511634 -0.00929911 0.09538099 -0.07319422 -0.02333676

-0.01937682
0.0807754 -0.05930967 0.00045279 -0.0475374 ]
Vector for 'quality': [-0.00536227 0.00236431 0.0510335 0.09009273
-0.0930295 -0.07116809
0.06458873 0.08972988 -0.05015428 -0.03763372]

0.6 Avg Word2Vec

This approach averages all word vectors in a sentence to get a single vector.
Real-world use case: Used in document similarity, text clustering, and recommendation systems.
[18]: def get_avg_word2vec(tokens, model):
vectors = [model.wv[word] for word in tokens if word in model.wv]
return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

avg_vectors = [get_avg_word2vec(tokens, model) for tokens in tokenized_reviews]

for i, vec in enumerate(avg_vectors):
print(f"Review {i+1} Average Vector: {vec[:3]}...") # Showing first 3␣
↪dimensions

Review 1 Average Vector: [-0.00220742 0.0134073 0.01929608]…

Review 2 Average Vector: [0.02844399 0.04120733 0.03952274]…
Review 3 Average Vector: [-0.03893989 0.01472283 -0.02407057]…

2
0.7 TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF gives importance to words that appear frequently in a document but not across all docu-
ments.
Real-world use case: Used in search engines, document ranking, and keyword extraction.
[19]: tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(reviews)
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Vocabulary: ['amazing' 'love' 'poor' 'product' 'quality' 'service' 'terrible'

'the'
'this']
TF-IDF Matrix:
[[0.41779577 0.41779577 0. 0.41779577 0.41779577 0.
0. 0. 0.54935123]
[0. 0. 0.5628291 0.42804604 0.42804604 0.
0.5628291 0. 0. ]
[0.42804604 0.42804604 0. 0. 0. 0.5628291
0. 0.5628291 0. ]]

0.8 Stemming (Porter Stemmer)

Stemming reduces words to their root form by chopping off suffixes. It doesn’t always produce real
words but is faster than lemmatization.
Use Case: Stemming is used in search engines (reducing words to base form improves matching).

[28]: print('REVIEWS')
print(reviews)
print(' ')
print('Stemmed Output:' )
ps = PorterStemmer()
stemmed_reviews = [[ps.stem(token) for token in tokens] for tokens in␣
↪tokenized_reviews]

for i, stemmed in enumerate(stemmed_reviews):

print(f"Review {i+1} stemmed: {stemmed}")

REVIEWS
['I love this product amazing quality', 'Terrible product poor quality', 'I love
the amazing service']

Stemmed Output:
Review 1 stemmed: ['i', 'love', 'thi', 'product', 'amaz', 'qualiti']
Review 2 stemmed: ['terribl', 'product', 'poor', 'qualiti']
Review 3 stemmed: ['i', 'love', 'the', 'amaz', 'servic']

3
0.9 Lemmatization
Lemmatization reduces words to their dictionary root form (lemma) using linguistic rules. It
considers the word’s meaning, making it more accurate than stemming.
Use Case: Lemmatization is used in chatbots, spell-checkers, and sentiment analysis.
[27]: print('REVIEWS')
print(reviews)
print(' ')
print('Lemmatized Output:' )

lemmatizer = WordNetLemmatizer()
lemmatized_reviews = [[lemmatizer.lemmatize(token) for token in tokens] for␣
↪tokens in tokenized_reviews]

for i, lemmatized in enumerate(lemmatized_reviews):

print(f"Review {i+1} lemmatized: {lemmatized}")

REVIEWS
['I love this product amazing quality', 'Terrible product poor quality', 'I love
the amazing service']

Lemmatized Output:
Review 1 lemmatized: ['i', 'love', 'this', 'product', 'amazing', 'quality']
Review 2 lemmatized: ['terrible', 'product', 'poor', 'quality']
Review 3 lemmatized: ['i', 'love', 'the', 'amazing', 'service']

0.10 Stop Words Removal

Stop words (e.g., “is”, “and”, “the”) are common words that don’t add meaning in NLP tasks. We
remove them to reduce noise.
Use Case: Stop word removal is used in text classification, sentiment analysis, and keyword extrac-
tion.
[30]: print('REVIEWS')
print(reviews)
print(' ')
print('Stop Words Removal Output:' )

stop_words = set(stopwords.words('english'))
filtered_reviews = [[token for token in tokens if token not in stop_words]
for tokens in tokenized_reviews]
for i, filtered in enumerate(filtered_reviews):
print(f"Review {i+1} without stop words: {filtered}")

REVIEWS
['I love this product amazing quality', 'Terrible product poor quality', 'I love
the amazing service']

4
Stop Words Removal Output:
Review 1 without stop words: ['love', 'product', 'amazing', 'quality']
Review 2 without stop words: ['terrible', 'product', 'poor', 'quality']
Review 3 without stop words: ['love', 'amazing', 'service']

The Secrets of Seduction
No ratings yet
The Secrets of Seduction
46 pages
Practica-Distribuida Resuelta
No ratings yet
Practica-Distribuida Resuelta
1 page
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
ASTW RA03 PracticalManual
No ratings yet
ASTW RA03 PracticalManual
18 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
Module III
No ratings yet
Module III
42 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
HW 5 Q 1
No ratings yet
HW 5 Q 1
22 pages
R002 KrishAhuja BDA Lab9.Ipynb - Colab
No ratings yet
R002 KrishAhuja BDA Lab9.Ipynb - Colab
3 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP Tushar
No ratings yet
NLP Tushar
21 pages
NLP Crecord Mid2
No ratings yet
NLP Crecord Mid2
36 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Back of Words
No ratings yet
Back of Words
21 pages
NLP Assignment2
No ratings yet
NLP Assignment2
7 pages
Sumati
No ratings yet
Sumati
10 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
4 pages
Pipeline
No ratings yet
Pipeline
9 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Aped For Fake News
No ratings yet
Aped For Fake News
6 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
21 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Assignment
No ratings yet
Assignment
6 pages
Detailed Explanation of The Code
No ratings yet
Detailed Explanation of The Code
4 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Group 4 MovieReview
No ratings yet
Group 4 MovieReview
10 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Module 3 Lab 3
No ratings yet
Module 3 Lab 3
4 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
29 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
Amazon Assignment Ex
No ratings yet
Amazon Assignment Ex
11 pages
AIML IA3 Loki & SG
No ratings yet
AIML IA3 Loki & SG
31 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
Amazon Food Review Notes
No ratings yet
Amazon Food Review Notes
37 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
Feature Extraction NLP
No ratings yet
Feature Extraction NLP
19 pages
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
No ratings yet
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
19 pages
Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence
No ratings yet
Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence
13 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
Extra Feature NLP
No ratings yet
Extra Feature NLP
5 pages
Practicle 7-Notes
No ratings yet
Practicle 7-Notes
2 pages
Lab 5
No ratings yet
Lab 5
27 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
An Introduction To Feature Extraction
No ratings yet
An Introduction To Feature Extraction
2 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
NLP Record
No ratings yet
NLP Record
16 pages
Assignment-10 (NLP-part-2)
No ratings yet
Assignment-10 (NLP-part-2)
2 pages
NLP PDF
No ratings yet
NLP PDF
3 pages
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
No ratings yet
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
6 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
Combine PDF
No ratings yet
Combine PDF
124 pages
Self Evaluation Exercises
No ratings yet
Self Evaluation Exercises
12 pages
Week2Identifying Representative Texts From The Regions
No ratings yet
Week2Identifying Representative Texts From The Regions
22 pages
G8 - Emphasis Markers &activities
100% (1)
G8 - Emphasis Markers &activities
23 pages
9 Month Day Wise Study Plan Lyst1731926853840
No ratings yet
9 Month Day Wise Study Plan Lyst1731926853840
29 pages
Outline
No ratings yet
Outline
8 pages
Winmark Global Case Study
No ratings yet
Winmark Global Case Study
4 pages
Ejercicios de Ingles
No ratings yet
Ejercicios de Ingles
9 pages
Tripura Tallika Deities of PKS March 2024
No ratings yet
Tripura Tallika Deities of PKS March 2024
69 pages
Chad Skills Development For Youth Employability Project
No ratings yet
Chad Skills Development For Youth Employability Project
79 pages
Opaig Precise Praise
No ratings yet
Opaig Precise Praise
3 pages
A Pragmatic Study of Yoruba Proverbs in English-90
No ratings yet
A Pragmatic Study of Yoruba Proverbs in English-90
10 pages
Revison Worksheet Beehive Answer Key
No ratings yet
Revison Worksheet Beehive Answer Key
4 pages
Skills Workbook: Unit 6
100% (1)
Skills Workbook: Unit 6
90 pages
Piaggio Report
33% (3)
Piaggio Report
66 pages
Chapter 4 Rhetoric, Religion and Race
No ratings yet
Chapter 4 Rhetoric, Religion and Race
10 pages
Questions and Answers From Apqp4wind Afternoon Webinar A Closer Look at Fmea
No ratings yet
Questions and Answers From Apqp4wind Afternoon Webinar A Closer Look at Fmea
4 pages
The Concept of Method Justus Buchler Download
No ratings yet
The Concept of Method Justus Buchler Download
59 pages
Dengue Preparedness
No ratings yet
Dengue Preparedness
2 pages
All QB De&vc (2024-25) Question Bank
No ratings yet
All QB De&vc (2024-25) Question Bank
8 pages
Test Results 1st Quarter 2023 2024
No ratings yet
Test Results 1st Quarter 2023 2024
10 pages
Thesis Proposal Defense
No ratings yet
Thesis Proposal Defense
7 pages
Final
No ratings yet
Final
3 pages
Contact Details of TBIs
No ratings yet
Contact Details of TBIs
45 pages
CTFC Qualification Specification-Specification PDF
No ratings yet
CTFC Qualification Specification-Specification PDF
14 pages
Narrative Report National Civil Engineering Summit September 20-22, 2018
No ratings yet
Narrative Report National Civil Engineering Summit September 20-22, 2018
9 pages
Sp2 Life Insurance Principles Syllabus 2025
No ratings yet
Sp2 Life Insurance Principles Syllabus 2025
8 pages
PDCA Cycle Vs OODA Loop
No ratings yet
PDCA Cycle Vs OODA Loop
12 pages
Oak Hill SH PDF
No ratings yet
Oak Hill SH PDF
29 pages
Arianne Sagum (Feedback - Day 1)
No ratings yet
Arianne Sagum (Feedback - Day 1)
2 pages