NLP Manual
NLP Manual
For
import nltk
import string
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
def preprocess_text(text):
# Tokenization
tokens = word_tokenize(text.lower())
# Remove punctuation
# Remove stopwords
stop_words = set(stopwords.words('english'))
return tokens
def lemmatize(tokens):
lemmatizer = WordNetLemmatizer()
return lemmas
def stem(tokens):
stemmer = PorterStemmer()
return stems
def main():
# Sample text
text = "Tokenization is the process of breaking down text into words and phrases. Stemming
and Lemmatization are techniques used to reduce words to their base form."
# Preprocess text
tokens = preprocess_text(text)
# Lemmatization
lemmas = lemmatize(tokens)
print("Lemmatization:")
print(lemmas)
# Stemming
stems = stem(tokens)
print("\nStemming:")
print(stems)
if __name__ == "__main__":
main()
This program tokenizes the input text, removes punctuation and stopwords, performs
lemmatization, and stemming using NLTK. Make sure you have NLTK installed (pip install nltk)
and download the required resources using nltk.download().
Output:
Lemmatization:
Stemming:
['token', 'process', 'break', 'text', 'word', 'phrase', 'stem', 'lemmat', 'techniqu', 'use', 'reduc', 'word',
'base', 'form']
This program uses the scikit-learn library for implementing the N-gram language model. It
utilizes the Multinomial Naive Bayes classifier for sentiment analysis. You can adjust the n-gram
range in the CountVectorizer to experiment with different N-gram models. Finally, it evaluates
the model's accuracy and displays the classification report.
Output:
Accuracy: 0.85
Classification Report:
This output demonstrates an accuracy score of 0.85 and provides a classification report showing
precision, recall, F1-score, and support for both the "negative" and "positive" sentiment classes.
3. Write a program to extract features from text by using the following techniques:
vii) One Hot Encoding
viii) Bag of Word (BOW)
ix) n-grams
x) Tf-Idf
xi) Custom features
xii) Word2Vec (Word Embedding)
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
one_hot_encoded = one_hot_encoding(text_data)
print("One Hot Encoding:")
print(one_hot_encoded)
# iii) n-grams
ngram_vectorizer = CountVectorizer(ngram_range=(1, 2))
ngram_features = ngram_vectorizer.fit_transform(text_data)
print("\nn-grams:")
print(ngram_features.toarray())
# iv) Tf-Idf
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(text_data)
print("\nTf-Idf:")
print(tfidf_features.toarray())
This program demonstrates how to extract features from text using One Hot Encoding, Bag
of Words (BOW), n-grams, Tf-Idf, custom features (document length in this case), and
Word2Vec (Word Embedding). Adjust the text_data variable according to your dataset.
Output:
n-grams:
[[0 1 1 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1]
[0 2 1 1 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0]
[1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 1 1 1 0]
[0 1 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0]]
Tf-Idf:
[[0. 0.43877674 0.54197657 0.43877674 0. 0.
0.43877674 0. 0. 0.43877674]
[0. 0.60534851 0.30267425 0.30267425 0. 0.60534851
0.30267425 0. 0.60534851 0. ]
[0.55280532 0.38537163 0.38537163 0.38537163 0.55280532 0.
0.38537163 0.55280532 0. 0. ]
[0. 0.43877674 0.54197657 0.43877674 0. 0.
0.43877674 0. 0. 0.43877674]]
Custom Features:
[[5]
[6]
[5]
[5]]
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
This program performs Topic Modeling using Latent Semantic Analysis (LSA) on the
sample text data provided. It converts the text data into a matrix of token counts using
CountVectorizer and then applies LSA using TruncatedSVD. Finally, it prints the top terms for
each topic extracted by LSA.
Output:
5. Develop a Python program to classify text using Naïve Bayes and SVM.
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
This program uses the 20 newsgroups dataset available in scikit-learn. It splits the
dataset into training and testing sets, vectorizes the text data using TF-IDF representation,
trains Naïve Bayes and SVM classifiers, predicts the labels for the testing set using both
classifiers, and finally evaluates the performance of both classifiers using classification
reports.
The output of the code will consist of two classification reports, one for the Naïve
Bayes classifier and the other for the Support Vector Machine (SVM) classifier. These
reports provide precision, recall, F1-score, and support for each class in the dataset.
Output:
Each classification report provides metrics such as precision, recall, F1-score, and support for each
class, as well as average metrics across all classes. These metrics indicate the performance of the
classifiers in classifying the text data.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.datasets import fetch_20newsgroups
This program performs K-means clustering on the 20 newsgroups dataset using TF-IDF
vectorization. It then prints the top terms for each cluster based on their TF-IDF scores.
Output:
7. Develop a Python program to implement rule based and statistical PoS Tagging on text.
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import stopwords
# Sample text
text = "John likes to play football with his friends."
Output:
Rule-based PoS tagging: [('John', 'NNP'), ('likes', 'VB'), ('to', 'TO'), ('play', 'VB'), ('football', 'NN'),
('with', 'TO'), ('his', 'NNP'), ('friends', 'NN')]
Statistical PoS tagging: [('John', 'NNP'), ('likes', 'VBZ'), ('play', 'NN'), ('football', 'NN'), ('friends',
'NNS')]
In the output:
For rule based PoS tagging, each word is paired with a PoS tag based on the predefined
rules.
For statistical PoS tagging, each word is paired with a PoS tag assigned by NLTK's built-
in PoS tagger, based on its statistical model trained on large corpora.
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import LambdaCallback
import random
import sys
chars = sorted(list(set(text)))
char_indices = {char: i for i, char in enumerate(chars)}
indices_char = {i: char for i, char in enumerate(chars)}
max_len = 40
step = 3
sentences = []
next_chars = []
generated_text += next_char
seed_text = seed_text[1:] + next_char
return generated_text
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
Output:
In the output:
After each epoch, the generated text is displayed with varying temperatures. Lower
temperatures lead to more deterministic output, while higher temperatures lead to more
diverse and creative output.
The model is gradually learning and improving its ability to generate text as the training
progresses through epochs.
Epoch 1/30
100/100 [==============================] - 2s 16ms/step - loss: 2.9398
Epoch 2/30
100/100 [==============================] - 2s 16ms/step - loss: 2.4381
----- Generating text after Epoch: 1
----- Temperature: 0.2
the the the the the the the the the the the the the the the the the the the the the the the the
the the the
----- Temperature: 0.5
is the the the the the the the the the the the the the the the the the the the the the the the the
the the
----- Temperature: 1.0
nd the the the the the the the the the the the the the the the the the the the the the the the
the the the
...
Epoch 30/30
100/100 [==============================] - 2s 16ms/step - loss: 1.2847
9. Develop a Python program to implement HMM and CRF on sequence tagging task.
First, make sure you have both libraries installed: pip install hmmlearn and
pip install sklearn-crfsuite
import numpy as np
from hmmlearn import hmm
from sklearn_crfsuite import CRF
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Evaluation
print("HMM Results:")
hmm_pred = hmm_model.predict(np.concatenate(X_test), [len(seq) for seq in X_test])
print(classification_report([item for sublist in y_test for item in sublist], [item for sublist in
hmm_pred for item in sublist]))
print("\nCRF Results:")
crf_pred = crf_model.predict(X_test)
print(classification_report([item for sublist in y_test for item in sublist], [item for sublist in
crf_pred for item in sublist]))
In this example:
We define a toy dataset X with sequences of words and y with corresponding sequence
labels.
We split the dataset into train and test sets.
We train both HMM and CRF models on the training data.
We evaluate the models on the test data using the classification_report function from
sklearn.metrics.
Output:
HMM Results:
precision recall f1-score support
accuracy 0.67 7
macro avg 0.62 0.58 0.54 7
weighted avg 0.81 0.67 0.64 7
CRF Results:
precision recall f1-score support
accuracy 0.86 7
macro avg 0.88 0.92 0.87 7
weighted avg 0.93 0.86 0.86 7
Viva Questions