NLP Assignment (917722H031)
NLP Assignment (917722H031)
S
917722H031
1.NLP Tokenization
import nltk
Description:
1. nltk.download()
○ Purpose: Downloads required resources like tokenizer models ('punkt'),
stopwords, and lexical database ('wordnet') for processing.
2. word_tokenize()
○ Purpose: Splits the input sentence into individual words (tokens) for further
processing.
3. .lower()
4. isalpha()
5. stopwords.words('english')
○ Purpose: Provides a list of common English words (like "is", "the", "and") to be
removed as they carry little meaning.
6. PorterStemmer
○
Purpose: Reduces words to their root form using rule-based stemming (e.g., "evolving" "evolv").
7. WordNetLemmatizer
○
Purpose: Converts words to their base or dictionary form (e.g., "evolving" "evolve") using linguistic rules
OUTPUT:
\
PROGRAM:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
corpus = [
"NLP is fun",
"NLP is powerful",
"NLP is transforming industries"
]
bow = CountVectorizer()
X_bow = bow.fit_transform(corpus)
print("BoW:", X_bow.toarray())
print("Features:", bow.get_feature_names_out())
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF:", X_tfidf.toarray())
Description:
1. from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
2. corpus
○ Purpose: A list of text documents that serve as the input for vectorization.
3. CountVectorizer()
○ Purpose: Initializes the BoW vectorizer, which counts the frequency of each word
in the corpus.
4. fit_transform(corpus)
○ Purpose: Learns the vocabulary from the corpus and transforms the documents
into a numerical matrix.
5. X_bow.toarray()
○ Purpose: Converts the sparse matrix result of CountVectorizer into a dense array
for easier viewing.
6. bow.get_feature_names_out()
○ Purpose: Retrieves the list of unique words (features) identified in the corpus.
7. TfidfVectorizer()
8. X_tfidf.toarray()
○ Purpose: Converts the sparse TF-IDF matrix to a dense array to view the TF-IDF
scores.
Output:
● Purpose: Imports the AutoTokenizer class from the Hugging Face Transformers library,
which automatically selects the appropriate tokenizer for a given pre-trained model.
AutoTokenizer.from_pretrained("bert-base-uncased")
● Purpose: Loads the tokenizer associated with the BERT base model (uncased version,
meaning it lowercases all input).
● Automatically downloads and caches the tokenizer if it's not already available.
sentence
tokenizer.tokenize(sentence)
● Purpose: Splits the input sentence into subword tokens based on BERT's WordPiece
tokenization.
●
Handles complex words and unknown tokens by breaking them into known subwords (e.g., "unbelievable" ['un', '#
tokenizer.convert_tokens_to_ids(tokens)
● Purpose: Converts each subword token into its corresponding numerical ID from BERT’s
vocabulary.
print("Tokens:", tokens)
OUTPUT:
Program:
from transformers import AutoModel
import torch
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("NLP is powerful", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
print("Embedding shape:", outputs.last_hidden_state.shape)
Description:
from transformers import AutoModel
● Purpose: Imports the pre-trained model loader from Hugging Face's Transformers
library, allowing dynamic selection of model architecture (like BERT).
import torch
● Purpose: Imports PyTorch, which is used to manage tensors and control model
computation (like disabling gradient tracking).
AutoModel.from_pretrained("bert-base-uncased")
● Purpose: Loads the pre-trained BERT base model with lowercase (uncased) inputs.
● Only returns hidden states (not classification heads).
● Purpose: Tokenizes the input sentence and returns it as PyTorch tensors ("pt" stands for
PyTorch).
with torch.no_grad():
● Purpose: Disables gradient computation since you’re doing inference (not training). This
saves memory and speeds up computation.
model(**inputs)
outputs.last_hidden_state
● Purpose: Contains the embeddings (hidden states) for each token in the input sequence
from the last BERT layer.
outputs.last_hidden_state.shape
● Purpose: Prints the shape of the output tensor, typically (batch_size, sequence_length,
hidden_size), e.g., (1, 5, 768).
OUTPUT:
5.TF-IDF Similarity:
PROGRAM:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Example sentences
sentence1 = "I love machine learning"
sentence2 = "Artificial intelligence is fascinating"
# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([sentence1, sentence2])
# Cosine similarity
similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
print(f"TF-IDF similarity: {similarity[0][0]:.4f}")
Description:
from sklearn.feature_extraction.text import TfidfVectorizer
● Purpose: Imports the tool for converting text into TF-IDF vectors, which reflect word
importance relative to the document and the corpus.
sentence1, sentence2
● Purpose: The two input sentences you want to compare for semantic similarity.
TfidfVectorizer()
● Purpose: Initializes the vectorizer that transforms the input text into TF-IDF-weighted
vectors.
fit_transform([sentence1, sentence2])
● Purpose: Learns vocabulary and computes the TF-IDF matrix for the input sentences.
tfidf_matrix[0:1], tfidf_matrix[1:2]
● Purpose: Selects the vector for each individual sentence (row slicing) to compute
pairwise similarity.
cosine_similarity()
● Purpose: Calculates how similar the two TF-IDF vectors are based on the cosine of the
angle between them. Returns a value between 0 (no similarity) and 1 (identical).
similarity[0][0]
● Purpose: Extracts the similarity score from the result matrix (since it's a 1x1 array here).
print(f"...")
● Purpose: Displays the final cosine similarity score, formatted to 4 decimal places.
Output:
6.SEMANTIC SIMILARITY:
PROGRAM:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
Description:
1. from sentence_transformers import SentenceTransformer, util
2. SentenceTransformer('all-MiniLM-L6-v2')
○ Use case: Great for sentence-level semantic tasks like similarity, clustering, etc.
3. sentence1, sentence2
4. model.encode([...], convert_to_tensor=True)
5. util.pytorch_cos_sim(embeddings[0], embeddings[1])
6. similarity.item()
○ Purpose: Converts the single tensor value (similarity score) to a regular float
value for printing.
7. print(f"...")
Output:
Program:
import gensim
from gensim import corpora
from pprint import pprint
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
● Purpose: Imports Gensim, a popular NLP library for topic modeling and vector space
modeling. corpora helps in creating the dictionary and BoW representations.
warnings.filterwarnings(...)
documents
● Purpose: A list of text documents (your input corpus) to extract topics from.
corpora.Dictionary(texts)
doc2bow(text)
gensim.models.LdaModel(...)
lda_model.print_topics(num_words=5)
● Purpose: Retrieves and prints the top 5 words associated with each discovered topic.
pprint(...)
Output:
8.RNN:
PROGRAM:
sentences = [
"i love nlp",
"i love machine learning",
"nlp is fun",
"deep learning is powerful",
"i enjoy learning"
]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
total_words = len(tokenizer.word_index) + 1
print("Vocabulary Size:", total_words)
# Padding sequences
max_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_len, padding='pre')
model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_len-1)) # 10-dim embeddings
model.add(SimpleRNN(64)) # You can also try LSTM
model.add(Dense(total_words, activation='softmax'))
# Example usage
seed = "i love"
predicted = predict_next_word(seed, tokenizer, model, max_len)
print(f"'{seed}' '{predicted}'")
DESCRIPTION:
sentences
● Purpose: A list of sentences to train a simple language model for predicting the next
word based on a given seed text.
● Purpose:
total_words = len(tokenizer.word_index) + 1
● Purpose: Calculates the total number of unique words in the vocabulary, adding 1 to
account for padding.
tokenizer.texts_to_sequences([line])
● Purpose: Converts each sentence into a sequence of word indices based on the
vocabulary learned by the tokenizer.
input_sequences
● Purpose: Generates n-grams (sequences of words) for training. For each sentence, all
possible n-grams are created to predict the next word based on previous words.
● Purpose: Pads the input sequences to ensure they have the same length by adding
zeros at the beginning ('pre' padding).
● Purpose:
to_categorical(y, num_classes=total_words)
● Purpose: Converts the labels into categorical format for multi-class classification (one-
hot encoding).
Sequential()
● Purpose: Initializes the Keras Sequential model, which is a linear stack of layers.
● Purpose: Adds an embedding layer that converts word indices into dense vectors of
fixed size (10 here), representing words in a continuous vector space.
SimpleRNN(64)
● Purpose: Adds a simple recurrent neural network layer with 64 units. This processes
sequences to capture dependencies between words.
Dense(total_words, activation='softmax')
● Purpose: Adds a fully connected layer with softmax activation, which outputs a
probability distribution over all possible next words.
● Purpose: Trains the model on the input data for 200 epochs, with no verbosity.
● Purpose: Defines a function to predict the next word based on a seed text:
○ Uses the trained model to predict the next word, returning the word
corresponding to the predicted index.
np.argmax(predicted_probs)
● Purpose: Retrieves the index of the word with the highest probability as predicted by the
model.
Output: