Basenlp
Basenlp
March 9, 2025
0.3 Tokenization
Tokenization is the process of breaking text into smaller pieces, such as words or sentences.
Real-world use case: Used in search engines to split queries into words for matching relevant
documents.
[13]: tokenized_reviews = [word_tokenize(review.lower()) for review in reviews]
for i, tokens in enumerate(tokenized_reviews):
print(f"Review {i+1}: {tokens}")
1
[14]: vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(reviews)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Matrix:\n", bow_matrix.toarray())
0.5 Word2Vec
Word2Vec converts words into vector representations based on context.
Real-world use case: Used in chatbots, recommendation systems, and search engines.
[15]: model = gensim.models.Word2Vec(tokenized_reviews, vector_size=10, window=2,␣
↪min_count=1, sg=1)
2
0.7 TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF gives importance to words that appear frequently in a document but not across all docu-
ments.
Real-world use case: Used in search engines, document ranking, and keyword extraction.
[19]: tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(reviews)
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())
[28]: print('REVIEWS')
print(reviews)
print(' ')
print('Stemmed Output:' )
ps = PorterStemmer()
stemmed_reviews = [[ps.stem(token) for token in tokens] for tokens in␣
↪tokenized_reviews]
REVIEWS
['I love this product amazing quality', 'Terrible product poor quality', 'I love
the amazing service']
Stemmed Output:
Review 1 stemmed: ['i', 'love', 'thi', 'product', 'amaz', 'qualiti']
Review 2 stemmed: ['terribl', 'product', 'poor', 'qualiti']
Review 3 stemmed: ['i', 'love', 'the', 'amaz', 'servic']
3
0.9 Lemmatization
Lemmatization reduces words to their dictionary root form (lemma) using linguistic rules. It
considers the word’s meaning, making it more accurate than stemming.
Use Case: Lemmatization is used in chatbots, spell-checkers, and sentiment analysis.
[27]: print('REVIEWS')
print(reviews)
print(' ')
print('Lemmatized Output:' )
lemmatizer = WordNetLemmatizer()
lemmatized_reviews = [[lemmatizer.lemmatize(token) for token in tokens] for␣
↪tokens in tokenized_reviews]
REVIEWS
['I love this product amazing quality', 'Terrible product poor quality', 'I love
the amazing service']
Lemmatized Output:
Review 1 lemmatized: ['i', 'love', 'this', 'product', 'amazing', 'quality']
Review 2 lemmatized: ['terrible', 'product', 'poor', 'quality']
Review 3 lemmatized: ['i', 'love', 'the', 'amazing', 'service']
stop_words = set(stopwords.words('english'))
filtered_reviews = [[token for token in tokens if token not in stop_words]
for tokens in tokenized_reviews]
for i, filtered in enumerate(filtered_reviews):
print(f"Review {i+1} without stop words: {filtered}")
REVIEWS
['I love this product amazing quality', 'Terrible product poor quality', 'I love
the amazing service']
4
Stop Words Removal Output:
Review 1 without stop words: ['love', 'product', 'amazing', 'quality']
Review 2 without stop words: ['terrible', 'product', 'poor', 'quality']
Review 3 without stop words: ['love', 'amazing', 'service']