0% found this document useful (0 votes)
10 views16 pages

NLP Record

The document outlines various experiments in Natural Language Processing (NLP) using Python, covering tasks such as noise removal, lemmatization, stemming, slang standardization, part of speech tagging, topic modeling, TF-IDF, word embeddings, text classification, and cosine similarity. Each experiment includes code snippets and sample outputs demonstrating the techniques applied to textual data. It serves as a practical guide for implementing NLP concepts using libraries like NLTK, Gensim, and Scikit-learn.

Uploaded by

Tejaswini Adam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views16 pages

NLP Record

The document outlines various experiments in Natural Language Processing (NLP) using Python, covering tasks such as noise removal, lemmatization, stemming, slang standardization, part of speech tagging, topic modeling, TF-IDF, word embeddings, text classification, and cosine similarity. Each experiment includes code snippets and sample outputs demonstrating the techniques applied to textual data. It serves as a practical guide for implementing NLP concepts using libraries like NLTK, Gensim, and Scikit-learn.

Uploaded by

Tejaswini Adam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Science : Natural Language Processing ( SOC )

Experiment – 1

Demonstrate noise removal for any textual data and remove regular expression pattern such
as hash tag from textual data.

import re
def remove_noise(text):
# Remove hashtags
text = re.sub(r'#\w+', '', text)
# Remove other noise (e.g., special characters, URLs, numbers)
text = re.sub(r"[^a-zA-Z\s]", "", text) # Remove special characters
text = re.sub(r"\s+", " ", text) # Remove extra whitespaces
text = re.sub(r"http\S+|www\S+|https\S+", "", text) # Remove URLs
text = re.sub(r"\b\d+\b", "", text) # Remove numbers

return text.strip()

# Example usage
text = "Hello! This is a #sample text with #hashtags and some special characters!! 123 @acet.ac.in"
clean_text = remove_noise(text)
print(clean_text)

Output:

Hello This is a text with and some special characters acetacin

1
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 2

Perform lemmatization and stemming using python library nltk.

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer=WordNetLemmatizer()
print(lemmatizer.lemmatize("running"))
print(lemmatizer.lemmatize("runs"))

def lemmatize(word):
lemmatizer=WordNetLemmatizer()
print("Verb Form: "+lemmatizer.lemmatize(word,pos="v"))
print("Noun Form: "+lemmatizer.lemmatize(word,pos="n"))
print("Adverb Form: "+lemmatizer.lemmatize(word,pos="r"))
print("Adjective Form: "+lemmatizer.lemmatize(word,pos="a"))
lemmatize('skewing')

Output:
running
run

Verb Form: skew


Noun Form: skewing
Adverb Form: skewing
Adjective Form: skewing

2
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

import nltk
from nltk.stem import PorterStemmer, LancasterStemmer
porter_stemmer=PorterStemmer()
print(porter_stemmer.stem('running'))
print(porter_stemmer.stem('runs'))
print(porter_stemmer.stem('ran'))

Output:
run
run
ran

3
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 3

Demonstrate object standardization such as replace social media slags from text.

slang_dict = {
"lol": "laughing out loud",
"omg": "oh my god",
"btw": "by the way",
"brb": "be right back",
"idk": "I don't know",
"tbh": "to be honest",
"imho": "in my humble opinion",
"afaik": "as far as I know",
"smh": "shaking my head",
"jk": "just kidding"
}

def standardize_text(text):
words = text.split()
standardized_words = []

for word in words:


if word.lower() in slang_dict:
standardized_words.append(slang_dict[word.lower()])
else:
standardized_words.append(word)

return ' '.join(standardized_words)

text = "lol that's so tbh idk why imho they would do that"
standardized_text = standardize_text(text)
print("Standardized text:", standardized_text)

Output:
Standardized text: laughing out loud that's so to be honest I don't know why in my humble opinion
they would do that

4
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 4

Perform part of speech tagging on any textual data.

import nltk
from nltk import word_tokenize, pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

def perform_pos_tagging(text):
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
return tagged_tokens

# Example usage
text = "I love to explore new places and try different cuisines."
tagged_text = perform_pos_tagging(text)
print(tagged_text)

Output:

[('I', 'PRP'), ('love', 'VBP'), ('to', 'TO'), ('explore', 'VB'), ('new', 'JJ'), ('places', 'NNS'), ('and', 'CC'),
('try', 'VB'), ('different', 'JJ'), ('cuisines', 'NNS'), ('.', '.')]

5
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 5

Implement topic modeling using Latent Dirichlet Allocation (LDA) in python.

import gensim
from gensim import corpora

# Sample documents

documents = [
"Machine learning is a subset of artificial intelligence.",
"Python is a popular programming language for data science.",
"Natural language processing is used in many applications such as chatbots.",
"Topic modeling is a technique for extracting topics from text data."
]

# Tokenize and preprocess the documents


tokenized_docs = [doc.lower().split() for doc in documents]

# Create a dictionary from the tokenized documents


dictionary = corpora.Dictionary(tokenized_docs)

# Create a corpus (term-document frequency)


corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# Build the LDA model


num_topics = 2 # Number of topics to extract
lda_model = gensim.models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary,
passes=10)

# Print the extracted topics and their top words


for idx, topic in lda_model.print_topics(num_words=5):
print(f"Topic {idx + 1}: {topic}")

# Get the topic distribution for a sample document


sample_doc = "Machine learning and data science go hand in hand."
sample_doc_bow = dictionary.doc2bow(sample_doc.lower().split())
sample_doc_topics = lda_model.get_document_topics(sample_doc_bow)
print(f"\nSample Document Topics: {sample_doc_topics}")

Output:

Topic 1: 0.056*"language" + 0.056*"is" + 0.055*"such" + 0.055*"applications" + 0.055*"many"

6
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Topic 2: 0.080*"a" + 0.080*"is" + 0.057*"for" + 0.034*"data." + 0.034*"topics"

Sample Document Topics: [(0, 0.30881903), (1, 0.691181)]

7
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 6

Demonstrate Term Frequency – Inverse Document Frequency ( TF – IDF ) using python.

!pip install scikit-learn


import nltk
nltk.download('punkt')

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Python is a popular programming language for data science.",
"Natural language processing is used in many applications such as chatbots.",
"Topic modeling is a technique for extracting topics from text data."
]

# Create the TF-IDF vectorizer


vectorizer = TfidfVectorizer()

# Fit the vectorizer to the documents and transform the documents into TF-IDF features
tfidf_matrix = vectorizer.fit_transform(documents)

# Get the feature names (terms)


feature_name = vectorizer.get_feature_names_out()

# Print the TF-IDF matrix


print("TF-IDF Matrix:")
print(tfidf_matrix.toarray())

# Print the TF-IDF values for each term in each document


print("\nTF-IDF Values:")
for doc_index, doc in enumerate(documents):
print(f"Document {doc_index + 1}:")
for term_index, term in enumerate(feature_name):
tfidf_value = tfidf_matrix[doc_index, term_index]
if tfidf_value > 0:
print(f"{term}: {tfidf_value:.4f}")

Output:

8
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

TF-IDF Values:
Document 1:
artificial: 0.3993
intelligence: 0.3993
is: 0.2084
learning: 0.3993
machine: 0.3993
of: 0.3993
subset: 0.3993
Document 2:
data: 0.3183
for: 0.3183
is: 0.2106
language: 0.3183
popular: 0.4037
programming: 0.4037
python: 0.4037
science: 0.4037
Document 3:
applications: 0.3179
as: 0.3179
chatbots: 0.3179
in: 0.3179
is: 0.1659
language: 0.2507
many: 0.3179

9
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

natural: 0.3179
processing: 0.3179
such: 0.3179
used: 0.3179
Document 4:
data: 0.2702
extracting: 0.3427
for: 0.2702
from: 0.3427
is: 0.1788
modeling: 0.3427
technique: 0.3427
text: 0.3427
topic: 0.3427
topics: 0.3427

10
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 7

Demonstrate Word Embeddings using word2vec.

!pip install gensim

from gensim.models import Word2Vecec

# Step 1: Prepare training data (list of tokenized sentences)


sentences = [
["machine", "learning", "is", "fun"],
["deep", "learning", "uses", "neural", "networks"],
["natural", "language", "processing", "is", "a", "part", "of", "AI"],
["word2vec", "creates", "word", "embeddings"],
["AI", "is", "the", "future"],
]

# Step 2: Train Word2Vec model


model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=2, sg=1)

# Step 3: Use the model

# Get embedding vector for a word


word_vector = model.wv["learning"]
print("Vector for 'learning':\n", word_vector)

# Find similar words


print("\nMost similar words to 'AI':")
similar = model.wv.most_similar("AI", topn=3)
for word, score in similar:
print(f"{word}: {score:.4f}")

11
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 8

Implement text classification using naïve bayes classifier and text blob library.

!pip install -U scikit-learn


!pip install -U textblob

import nltk
nltk.download('punkt')

from textblob import TextBlob


from textblob.classifiers import NaiveBayesClassifier

# Sample training data


train_data = [
('I love this car.', 'positive'),
('This view is amazing.', 'positive'),
('I feel great!', 'positive'),
('I dislike this product.', 'negative'),
('This place is horrible.', 'negative'),
('I feel sad.', 'negative')
]

# Create the Naive Bayes classifier


classifier = NaiveBayesClassifier(train_data)

# Sample test data


test_data = [
'I like this movie.',
'This food is terrible.',
'I am happy.'
]

# Classify the test data


for text in test_data:
sentiment = classifier.classify(text)
print(f'Text: {text}')
print(f'Sentiment: {sentiment}\n')

Output:

Text: I like this movie.


Sentiment: positive

12
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Text: This food is terrible.


Sentiment: positive

Text: I am happy.
Sentiment: positive

13
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 9

Apply support vector machine for text classification.

!pip install -U scikit-learn


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Sample data
documents = [
("I love natural language processing.", "positive"),
("Machine learning is fascinating.", "positive"),
("Python is widely used in data science.", "positive"),
("I dislike noisy environments.", "negative"),
("This movie is terrible.", "negative"),
("I feel sad today.", "negative")
]

# Split the data into features and labels


texts, labels = zip(*documents)

# Create the TF-IDF vectorizer


vectorizer = TfidfVectorizer()

# Transform the text data into TF-IDF features


features = vectorizer.fit_transform(texts)

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Initialize the SVM classifier


svm_classifier = SVC()

# Train the classifier


svm_classifier.fit(X_train, y_train)

# Make predictions on the test set


y_pred = svm_classifier.predict(X_test)

# Evaluate the performance of the classifier


accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)

14
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

print("\nClassification Report:")
print(report)

Output:
Accuracy: 0.0

Classification Report:
precision recall f1-score support

negative 0.00 0.00 0.00 0.0


positive 0.00 0.00 0.00 2.0
accuracy 0.00 2.0
macro avg 0.00 0.00 0.00 2.0
weighted avg 0.00 0.00 0.00 2.0

15
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 10

Convert text to vectors ( using term frequency ) and apply cosine similarity to provide
closeness among two text.

from sklearn.feature_extraction.text import CountVectorizer


from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
"I love natural language processing.",
"Machine learning is fascinating.",
"Python is widely used in data science."
]

# Initialize the CountVectorizer


vectorizer = CountVectorizer()

# Fit and transform the documents to obtain the term frequency (TF) vectors
tf_vectors = vectorizer.fit_transform(documents).toarray()

# Calculate the cosine similarity between two documents


doc1 = tf_vectors[0]
doc2 = tf_vectors[1]
similarity = cosine_similarity([doc1], [doc2])[0][0]

print(f"Text 1: {documents[0]}")
print(f"Text 2: {documents[1]}")
print(f"Cosine Similarity: {similarity:.4f}")

Output:

Text 1: I love natural language processing.


Text 2: Machine learning is fascinating.
Cosine Similarity: 0.0000

16
Aditya College of Engineering & Technology

You might also like