0% found this document useful (0 votes)
8 views18 pages

NLP Assignment (917722H031)

The document outlines various Natural Language Processing (NLP) techniques, including tokenization, feature extraction, contextual word embeddings, and topic modeling. It provides code examples using libraries such as NLTK, Scikit-learn, and Hugging Face Transformers to demonstrate methods like Bag-of-Words, TF-IDF, and LDA. Additionally, it covers semantic similarity and RNNs for text prediction, showcasing practical applications of these NLP concepts.

Uploaded by

Arockia Alvita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views18 pages

NLP Assignment (917722H031)

The document outlines various Natural Language Processing (NLP) techniques, including tokenization, feature extraction, contextual word embeddings, and topic modeling. It provides code examples using libraries such as NLTK, Scikit-learn, and Hugging Face Transformers to demonstrate methods like Bag-of-Words, TF-IDF, and LDA. Additionally, it covers semantic similarity and RNNs for text prediction, showcasing practical applications of these NLP concepts.

Uploaded by

Arockia Alvita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

NLP ASSIGNMENT -1 Monika.

S
917722H031

1.NLP Tokenization
import nltk

# Downloads the necessary data for tokenization and text processing


nltk.download('punkt') # Tokenizer model for breaking text into words/sentences
nltk.download('stopwords') # Common English stopwords (e.g., "is", "the")
nltk.download('wordnet') # WordNet lexical database for lemmatization

from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Sample input text


text = "NLP is amazing and it's evolving rapidly!"

# Tokenize text into words and convert to lowercase


tokens = word_tokenize(text.lower())

# Remove punctuation or non-alphabetic tokens


tokens = [word for word in tokens if word.isalpha()]

# Remove common stopwords (e.g., "is", "and", "it's")


filtered = [word for word in tokens if word not in stopwords.words('english')]

# Initialize stemmer and lemmatizer


stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Apply stemming to each filtered word


print("Stemmed:", [stemmer.stem(w) for w in filtered])

# Apply lemmatization to each filtered word


print("Lemmatized:", [lemmatizer.lemmatize(w) for w in filtered])

Description:

1. nltk.download()
○ Purpose: Downloads required resources like tokenizer models ('punkt'),
stopwords, and lexical database ('wordnet') for processing.

2. word_tokenize()

○ Purpose: Splits the input sentence into individual words (tokens) for further
processing.

3. .lower()

○ Purpose: Converts the entire text to lowercase to ensure uniformity when


comparing or filtering words.

4. isalpha()

○ Purpose: Checks if each token contains only alphabetic characters, removing


punctuation and numbers.

5. stopwords.words('english')

○ Purpose: Provides a list of common English words (like "is", "the", "and") to be
removed as they carry little meaning.

6. PorterStemmer


Purpose: Reduces words to their root form using rule-based stemming (e.g., "evolving" "evolv").

7. WordNetLemmatizer


Purpose: Converts words to their base or dictionary form (e.g., "evolving" "evolve") using linguistic rules

OUTPUT:
\

2. Feature Extraction (BoW and TF-IDF)

PROGRAM:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = [
"NLP is fun",
"NLP is powerful",
"NLP is transforming industries"
]

bow = CountVectorizer()
X_bow = bow.fit_transform(corpus)
print("BoW:", X_bow.toarray())
print("Features:", bow.get_feature_names_out())

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF:", X_tfidf.toarray())

Description:
1. from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

○ Purpose: Imports two key tools for text feature extraction:

■ CountVectorizer: Converts text to a Bag-of-Words (BoW) model.

■ TfidfVectorizer: Converts text to a TF-IDF (Term Frequency-Inverse


Document Frequency) representation.

2. corpus

○ Purpose: A list of text documents that serve as the input for vectorization.

3. CountVectorizer()

○ Purpose: Initializes the BoW vectorizer, which counts the frequency of each word
in the corpus.
4. fit_transform(corpus)

○ Purpose: Learns the vocabulary from the corpus and transforms the documents
into a numerical matrix.

○ For BoW: Each element represents the count of a word in a document.

○ For TF-IDF: Each element represents the importance of a word in a document


relative to the corpus.

5. X_bow.toarray()

○ Purpose: Converts the sparse matrix result of CountVectorizer into a dense array
for easier viewing.

6. bow.get_feature_names_out()

○ Purpose: Retrieves the list of unique words (features) identified in the corpus.

7. TfidfVectorizer()

○ Purpose: Initializes the TF-IDF vectorizer, which evaluates word importance by


considering frequency and uniqueness across all documents.

8. X_tfidf.toarray()

○ Purpose: Converts the sparse TF-IDF matrix to a dense array to view the TF-IDF
scores.

Output:

3)Tokenization (Classical vs Modern)

from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sentence = "Unbelievable performance by the transformer model!"
tokens = tokenizer.tokenize(sentence)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens:", tokens)
print("Token IDs:", token_ids)
DESCRIPTION:
from transformers import AutoTokenizer

● Purpose: Imports the AutoTokenizer class from the Hugging Face Transformers library,
which automatically selects the appropriate tokenizer for a given pre-trained model.

AutoTokenizer.from_pretrained("bert-base-uncased")

● Purpose: Loads the tokenizer associated with the BERT base model (uncased version,
meaning it lowercases all input).

● Automatically downloads and caches the tokenizer if it's not already available.

sentence

● Purpose: The input sentence that will be tokenized.

tokenizer.tokenize(sentence)

● Purpose: Splits the input sentence into subword tokens based on BERT's WordPiece
tokenization.


Handles complex words and unknown tokens by breaking them into known subwords (e.g., "unbelievable" ['un', '#

tokenizer.convert_tokens_to_ids(tokens)

● Purpose: Converts each subword token into its corresponding numerical ID from BERT’s
vocabulary.

● These IDs are the actual inputs to the BERT model.

print("Tokens:", tokens)

● Purpose: Displays the list of tokens generated by the tokenizer.

print("Token IDs:", token_ids)


● Purpose: Shows the list of token IDs corresponding to each token.

OUTPUT:

4. Contextual Word Embeddings (BERT)

Program:
from transformers import AutoModel
import torch

model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("NLP is powerful", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
print("Embedding shape:", outputs.last_hidden_state.shape)

Description:
from transformers import AutoModel

● Purpose: Imports the pre-trained model loader from Hugging Face's Transformers
library, allowing dynamic selection of model architecture (like BERT).

import torch

● Purpose: Imports PyTorch, which is used to manage tensors and control model
computation (like disabling gradient tracking).

AutoModel.from_pretrained("bert-base-uncased")

● Purpose: Loads the pre-trained BERT base model with lowercase (uncased) inputs.
● Only returns hidden states (not classification heads).

tokenizer("NLP is powerful", return_tensors="pt")

● Purpose: Tokenizes the input sentence and returns it as PyTorch tensors ("pt" stands for
PyTorch).

● Prepares inputs like input_ids and attention_mask for the model.

with torch.no_grad():

● Purpose: Disables gradient computation since you’re doing inference (not training). This
saves memory and speeds up computation.

model(**inputs)

● Purpose: Feeds the tokenized input into the BERT model.

● The **inputs unpacks arguments like input_ids and attention_mask.

outputs.last_hidden_state

● Purpose: Contains the embeddings (hidden states) for each token in the input sequence
from the last BERT layer.

outputs.last_hidden_state.shape

● Purpose: Prints the shape of the output tensor, typically (batch_size, sequence_length,
hidden_size), e.g., (1, 5, 768).

OUTPUT:
5.TF-IDF Similarity:

PROGRAM:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example sentences
sentence1 = "I love machine learning"
sentence2 = "Artificial intelligence is fascinating"

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([sentence1, sentence2])

# Cosine similarity
similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
print(f"TF-IDF similarity: {similarity[0][0]:.4f}")

Description:
from sklearn.feature_extraction.text import TfidfVectorizer

● Purpose: Imports the tool for converting text into TF-IDF vectors, which reflect word
importance relative to the document and the corpus.

from sklearn.metrics.pairwise import cosine_similarity


● Purpose: Imports the function to compute cosine similarity, which measures the angle
between two vectors—used here to find how similar two sentences are.

sentence1, sentence2

● Purpose: The two input sentences you want to compare for semantic similarity.

TfidfVectorizer()

● Purpose: Initializes the vectorizer that transforms the input text into TF-IDF-weighted
vectors.

fit_transform([sentence1, sentence2])

● Purpose: Learns vocabulary and computes the TF-IDF matrix for the input sentences.

tfidf_matrix[0:1], tfidf_matrix[1:2]

● Purpose: Selects the vector for each individual sentence (row slicing) to compute
pairwise similarity.

cosine_similarity()

● Purpose: Calculates how similar the two TF-IDF vectors are based on the cosine of the
angle between them. Returns a value between 0 (no similarity) and 1 (identical).

similarity[0][0]

● Purpose: Extracts the similarity score from the result matrix (since it's a 1x1 array here).

print(f"...")

● Purpose: Displays the final cosine similarity score, formatted to 4 decimal places.

Output:

6.SEMANTIC SIMILARITY:

PROGRAM:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

sentence1 = "I love machine learning."


sentence2 = "I enjoy studying artificial intelligence."

embeddings = model.encode([sentence1, sentence2], convert_to_tensor=True)


similarity = util.pytorch_cos_sim(embeddings[0], embeddings[1])

print(f"Semantic similarity: {similarity.item():.4f}")

Description:
1. from sentence_transformers import SentenceTransformer, util

○ Purpose: Imports the Sentence-BERT model and utility functions.

■ SentenceTransformer: Loads pre-trained models for sentence


embeddings.

■ util: Provides helper functions like cosine similarity in PyTorch.

2. SentenceTransformer('all-MiniLM-L6-v2')

○ Purpose: Loads a lightweight and fast pre-trained Sentence-BERT model.

○ Use case: Great for sentence-level semantic tasks like similarity, clustering, etc.

3. sentence1, sentence2

○ Purpose: The two input sentences to compare semantically.

4. model.encode([...], convert_to_tensor=True)

○ Purpose: Converts input sentences into dense vector representations


(embeddings).

○ convert_to_tensor=True returns PyTorch tensors for direct use in similarity


computations.

5. util.pytorch_cos_sim(embeddings[0], embeddings[1])

○ Purpose: Computes cosine similarity between the two sentence embeddings


using PyTorch.

6. similarity.item()
○ Purpose: Converts the single tensor value (similarity score) to a regular float
value for printing.

7. print(f"...")

○ Purpose: Displays the computed semantic similarity score, formatted to 4


decimal places.

Output:

7.Topic Modeling with LDA (Latent Dirichlet Allocation)

Program:
import gensim
from gensim import corpora
from pprint import pprint
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Step 2: document corpus


documents = [
"I love watching cricket and football with my friends.",
"Messi and Ronaldo are amazing football players.",
"Machine learning and AI are transforming technology.",
"Python and Java are popular programming languages.",
"Studying for exams requires focus and good sleep.",
"Teachers play an important role in shaping our future.",
"Movies and music help me relax after a long day.",
"Marvel and DC make great superhero films.",
"Eating fruits and vegetables keeps you healthy.",
"Regular exercise improves both mental and physical health."
]

# Step 3: Preprocessing - tokenize and lowercase


texts = [[word.lower() for word in doc.split()] for doc in documents]

# Step 4: Create dictionary and bag-of-words corpus


dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Step 5: Train LDA Model


lda_model = gensim.models.LdaModel(
corpus=corpus,
id2word=dictionary,
num_topics=5, # Adjust based on the expected number of topics
random_state=42,
passes=20, # More passes = better convergence
alpha='auto',
per_word_topics=True
)

# Step 6: Display the topics


print("\n Top words in each topic:\n")
pprint(lda_model.print_topics(num_words=5))
# Step 7: Inference on new sentence
new_doc = "I enjoy programming in Python and learning AI."
new_bow = dictionary.doc2bow(new_doc.lower().split())
topics = lda_model.get_document_topics(new_bow)

print("\n🔍 Topic distribution for new sentence:")


for topic_num, prob in topics:
print(f"Topic {topic_num}: {prob:.4f}")
Description:

import gensim and from gensim import corpora

● Purpose: Imports Gensim, a popular NLP library for topic modeling and vector space
modeling. corpora helps in creating the dictionary and BoW representations.

warnings.filterwarnings(...)

● Purpose: Suppresses deprecation warnings to keep the output clean.

documents

● Purpose: A list of text documents (your input corpus) to extract topics from.

texts = [[word.lower() for word in doc.split()] for doc in documents]

● Purpose: Preprocesses each document:

○ Splits into words.

○ Converts to lowercase for consistency.

corpora.Dictionary(texts)

● Purpose: Builds a mapping (dictionary) from words to unique IDs.

doc2bow(text)

● Purpose: Converts each document into a bag-of-words vector:

○ Each document is represented as a list of (word_id, frequency) tuples.

gensim.models.LdaModel(...)

● Purpose: Trains an LDA topic model:

○ corpus: the BoW representation.

○ id2word: dictionary for mapping IDs back to words.

○ num_topics: number of latent topics to discover.


○ passes: number of iterations over the corpus.

○ alpha='auto': automatically tunes the document-topic distribution.

○ per_word_topics=True: tracks word distributions per topic.

lda_model.print_topics(num_words=5)

● Purpose: Retrieves and prints the top 5 words associated with each discovered topic.

● Useful for interpreting what each topic is about.

pprint(...)

● Purpose: Nicely formats the output for readability.

Output:

8.RNN:

PROGRAM:
sentences = [
"i love nlp",
"i love machine learning",
"nlp is fun",
"deep learning is powerful",
"i enjoy learning"
]

from tensorflow.keras.preprocessing.text import Tokenizer


from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

total_words = len(tokenizer.word_index) + 1
print("Vocabulary Size:", total_words)

# Generate input sequences (predict next word)


input_sequences = []
for line in sentences:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
ngram_seq = token_list[:i+1]
input_sequences.append(ngram_seq)

# Padding sequences
max_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_len, padding='pre')

# Features and labels


X, y = input_sequences[:, :-1], input_sequences[:, -1]
y = np.array(y)

from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.utils import to_categorical

y_cat = to_categorical(y, num_classes=total_words)

model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_len-1)) # 10-dim embeddings
model.add(SimpleRNN(64)) # You can also try LSTM
model.add(Dense(total_words, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


model.summary()
model.fit(X, y_cat, epochs=200, verbose=0)
def predict_next_word(seed_text, tokenizer, model, max_len):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_len-1, padding='pre')
predicted_probs = model.predict(token_list, verbose=0)[0]
predicted_index = np.argmax(predicted_probs)

for word, index in tokenizer.word_index.items():


if index == predicted_index:
return word

# Example usage
seed = "i love"
predicted = predict_next_word(seed, tokenizer, model, max_len)
print(f"'{seed}' '{predicted}'")

DESCRIPTION:
sentences

● Purpose: A list of sentences to train a simple language model for predicting the next
word based on a given seed text.

Tokenizer() and fit_on_texts(sentences)

● Purpose:

○ Tokenizer(): Creates a tokenizer to process the text.

○ fit_on_texts(sentences): Tokenizes the sentences, assigning a unique integer


index to each word (creates a word index).

total_words = len(tokenizer.word_index) + 1

● Purpose: Calculates the total number of unique words in the vocabulary, adding 1 to
account for padding.

tokenizer.texts_to_sequences([line])

● Purpose: Converts each sentence into a sequence of word indices based on the
vocabulary learned by the tokenizer.
input_sequences

● Purpose: Generates n-grams (sequences of words) for training. For each sentence, all
possible n-grams are created to predict the next word based on previous words.

pad_sequences(input_sequences, maxlen=max_len, padding='pre')

● Purpose: Pads the input sequences to ensure they have the same length by adding
zeros at the beginning ('pre' padding).

X, y = input_sequences[:, :-1], input_sequences[:, -1]

● Purpose:

○ X: Features (all words except the last word of the sequence).

○ y: Labels (the last word in each sequence).

to_categorical(y, num_classes=total_words)

● Purpose: Converts the labels into categorical format for multi-class classification (one-
hot encoding).

Sequential()

● Purpose: Initializes the Keras Sequential model, which is a linear stack of layers.

Embedding(total_words, 10, input_length=max_len-1)

● Purpose: Adds an embedding layer that converts word indices into dense vectors of
fixed size (10 here), representing words in a continuous vector space.

SimpleRNN(64)

● Purpose: Adds a simple recurrent neural network layer with 64 units. This processes
sequences to capture dependencies between words.

Dense(total_words, activation='softmax')

● Purpose: Adds a fully connected layer with softmax activation, which outputs a
probability distribution over all possible next words.

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


● Purpose: Compiles the model with the Adam optimizer and categorical cross-entropy
loss function. Accuracy is tracked during training.

model.fit(X, y_cat, epochs=200, verbose=0)

● Purpose: Trains the model on the input data for 200 epochs, with no verbosity.

predict_next_word(seed_text, tokenizer, model, max_len)

● Purpose: Defines a function to predict the next word based on a seed text:

○ Converts the seed text into a sequence of word indices.

○ Pads the sequence.

○ Uses the trained model to predict the next word, returning the word
corresponding to the predicted index.

np.argmax(predicted_probs)

● Purpose: Retrieves the index of the word with the highest probability as predicted by the
model.

Output:

You might also like