0% found this document useful (0 votes)

10 views16 pages

NLP Record

The document outlines various experiments in Natural Language Processing (NLP) using Python, covering tasks such as noise removal, lemmatization, stemming, slang standardization, part of speech tagging, topic modeling, TF-IDF, word embeddings, text classification, and cosine similarity. Each experiment includes code snippets and sample outputs demonstrating the techniques applied to textual data. It serves as a practical guide for implementing NLP concepts using libraries like NLTK, Gensim, and Scikit-learn.

Uploaded by

Tejaswini Adam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views16 pages

NLP Record

Uploaded by

Tejaswini Adam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Data Science : Natural Language Processing ( SOC )

Experiment – 1

Demonstrate noise removal for any textual data and remove regular expression pattern such
as hash tag from textual data.

import re
def remove_noise(text):
# Remove hashtags
text = re.sub(r'#\w+', '', text)
# Remove other noise (e.g., special characters, URLs, numbers)
text = re.sub(r"[^a-zA-Z\s]", "", text) # Remove special characters
text = re.sub(r"\s+", " ", text) # Remove extra whitespaces
text = re.sub(r"http\S+|www\S+|https\S+", "", text) # Remove URLs
text = re.sub(r"\b\d+\b", "", text) # Remove numbers

return text.strip()

# Example usage
text = "Hello! This is a #sample text with #hashtags and some special characters!! 123 @acet.ac.in"
clean_text = remove_noise(text)
print(clean_text)

Output:

Hello This is a text with and some special characters acetacin

1
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 2

Perform lemmatization and stemming using python library nltk.

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer=WordNetLemmatizer()
print(lemmatizer.lemmatize("running"))
print(lemmatizer.lemmatize("runs"))

def lemmatize(word):
lemmatizer=WordNetLemmatizer()
print("Verb Form: "+lemmatizer.lemmatize(word,pos="v"))
print("Noun Form: "+lemmatizer.lemmatize(word,pos="n"))
print("Adverb Form: "+lemmatizer.lemmatize(word,pos="r"))
print("Adjective Form: "+lemmatizer.lemmatize(word,pos="a"))
lemmatize('skewing')

Output:
running
run

Verb Form: skew

Noun Form: skewing
Adverb Form: skewing
Adjective Form: skewing

2
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

import nltk
from nltk.stem import PorterStemmer, LancasterStemmer
porter_stemmer=PorterStemmer()
print(porter_stemmer.stem('running'))
print(porter_stemmer.stem('runs'))
print(porter_stemmer.stem('ran'))

Output:
run
run
ran

3
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 3

Demonstrate object standardization such as replace social media slags from text.

slang_dict = {
"lol": "laughing out loud",
"omg": "oh my god",
"btw": "by the way",
"brb": "be right back",
"idk": "I don't know",
"tbh": "to be honest",
"imho": "in my humble opinion",
"afaik": "as far as I know",
"smh": "shaking my head",
"jk": "just kidding"
}

def standardize_text(text):
words = text.split()
standardized_words = []

for word in words:

if word.lower() in slang_dict:
standardized_words.append(slang_dict[word.lower()])
else:
standardized_words.append(word)

return ' '.join(standardized_words)

text = "lol that's so tbh idk why imho they would do that"
standardized_text = standardize_text(text)
print("Standardized text:", standardized_text)

Output:
Standardized text: laughing out loud that's so to be honest I don't know why in my humble opinion
they would do that

4
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 4

Perform part of speech tagging on any textual data.

import nltk
from nltk import word_tokenize, pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

def perform_pos_tagging(text):
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
return tagged_tokens

# Example usage
text = "I love to explore new places and try different cuisines."
tagged_text = perform_pos_tagging(text)
print(tagged_text)

Output:

[('I', 'PRP'), ('love', 'VBP'), ('to', 'TO'), ('explore', 'VB'), ('new', 'JJ'), ('places', 'NNS'), ('and', 'CC'),
('try', 'VB'), ('different', 'JJ'), ('cuisines', 'NNS'), ('.', '.')]

5
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 5

Implement topic modeling using Latent Dirichlet Allocation (LDA) in python.

import gensim
from gensim import corpora

# Sample documents

documents = [
"Machine learning is a subset of artificial intelligence.",
"Python is a popular programming language for data science.",
"Natural language processing is used in many applications such as chatbots.",
"Topic modeling is a technique for extracting topics from text data."
]

# Tokenize and preprocess the documents

tokenized_docs = [doc.lower().split() for doc in documents]

# Create a dictionary from the tokenized documents

dictionary = corpora.Dictionary(tokenized_docs)

# Create a corpus (term-document frequency)

corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# Build the LDA model

num_topics = 2 # Number of topics to extract
lda_model = gensim.models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary,
passes=10)

# Print the extracted topics and their top words

for idx, topic in lda_model.print_topics(num_words=5):
print(f"Topic {idx + 1}: {topic}")

# Get the topic distribution for a sample document

sample_doc = "Machine learning and data science go hand in hand."
sample_doc_bow = dictionary.doc2bow(sample_doc.lower().split())
sample_doc_topics = lda_model.get_document_topics(sample_doc_bow)
print(f"\nSample Document Topics: {sample_doc_topics}")

Output:

Topic 1: 0.056"language" + 0.056"is" + 0.055"such" + 0.055"applications" + 0.055*"many"

6
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Topic 2: 0.080"a" + 0.080"is" + 0.057"for" + 0.034"data." + 0.034*"topics"

Sample Document Topics: [(0, 0.30881903), (1, 0.691181)]

7
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 6

Demonstrate Term Frequency – Inverse Document Frequency ( TF – IDF ) using python.

!pip install scikit-learn

import nltk
nltk.download('punkt')

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Python is a popular programming language for data science.",
"Natural language processing is used in many applications such as chatbots.",
"Topic modeling is a technique for extracting topics from text data."
]

# Create the TF-IDF vectorizer

vectorizer = TfidfVectorizer()

# Fit the vectorizer to the documents and transform the documents into TF-IDF features
tfidf_matrix = vectorizer.fit_transform(documents)

# Get the feature names (terms)

feature_name = vectorizer.get_feature_names_out()

# Print the TF-IDF matrix

print("TF-IDF Matrix:")
print(tfidf_matrix.toarray())

# Print the TF-IDF values for each term in each document

print("\nTF-IDF Values:")
for doc_index, doc in enumerate(documents):
print(f"Document {doc_index + 1}:")
for term_index, term in enumerate(feature_name):
tfidf_value = tfidf_matrix[doc_index, term_index]
if tfidf_value > 0:
print(f"{term}: {tfidf_value:.4f}")

Output:

8
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

TF-IDF Values:
Document 1:
artificial: 0.3993
intelligence: 0.3993
is: 0.2084
learning: 0.3993
machine: 0.3993
of: 0.3993
subset: 0.3993
Document 2:
data: 0.3183
for: 0.3183
is: 0.2106
language: 0.3183
popular: 0.4037
programming: 0.4037
python: 0.4037
science: 0.4037
Document 3:
applications: 0.3179
as: 0.3179
chatbots: 0.3179
in: 0.3179
is: 0.1659
language: 0.2507
many: 0.3179

9
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

natural: 0.3179
processing: 0.3179
such: 0.3179
used: 0.3179
Document 4:
data: 0.2702
extracting: 0.3427
for: 0.2702
from: 0.3427
is: 0.1788
modeling: 0.3427
technique: 0.3427
text: 0.3427
topic: 0.3427
topics: 0.3427

10
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 7

Demonstrate Word Embeddings using word2vec.

!pip install gensim

from gensim.models import Word2Vecec

# Step 1: Prepare training data (list of tokenized sentences)

sentences = [
["machine", "learning", "is", "fun"],
["deep", "learning", "uses", "neural", "networks"],
["natural", "language", "processing", "is", "a", "part", "of", "AI"],
["word2vec", "creates", "word", "embeddings"],
["AI", "is", "the", "future"],
]

# Step 2: Train Word2Vec model

model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=2, sg=1)

# Step 3: Use the model

# Get embedding vector for a word

word_vector = model.wv["learning"]
print("Vector for 'learning':\n", word_vector)

# Find similar words

print("\nMost similar words to 'AI':")
similar = model.wv.most_similar("AI", topn=3)
for word, score in similar:
print(f"{word}: {score:.4f}")

11
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 8

Implement text classification using naïve bayes classifier and text blob library.

!pip install -U scikit-learn

!pip install -U textblob

import nltk
nltk.download('punkt')

from textblob import TextBlob

from textblob.classifiers import NaiveBayesClassifier

# Sample training data

train_data = [
('I love this car.', 'positive'),
('This view is amazing.', 'positive'),
('I feel great!', 'positive'),
('I dislike this product.', 'negative'),
('This place is horrible.', 'negative'),
('I feel sad.', 'negative')
]

# Create the Naive Bayes classifier

classifier = NaiveBayesClassifier(train_data)

# Sample test data

test_data = [
'I like this movie.',
'This food is terrible.',
'I am happy.'
]

# Classify the test data

for text in test_data:
sentiment = classifier.classify(text)
print(f'Text: {text}')
print(f'Sentiment: {sentiment}\n')

Output:

Text: I like this movie.

Sentiment: positive

12
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Text: This food is terrible.

Sentiment: positive

Text: I am happy.
Sentiment: positive

13
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 9

Apply support vector machine for text classification.

!pip install -U scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Sample data
documents = [
("I love natural language processing.", "positive"),
("Machine learning is fascinating.", "positive"),
("Python is widely used in data science.", "positive"),
("I dislike noisy environments.", "negative"),
("This movie is terrible.", "negative"),
("I feel sad today.", "negative")
]

# Split the data into features and labels

texts, labels = zip(*documents)

# Create the TF-IDF vectorizer

vectorizer = TfidfVectorizer()

# Transform the text data into TF-IDF features

features = vectorizer.fit_transform(texts)

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Initialize the SVM classifier

svm_classifier = SVC()

# Train the classifier

svm_classifier.fit(X_train, y_train)

# Make predictions on the test set

y_pred = svm_classifier.predict(X_test)

# Evaluate the performance of the classifier

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)

14
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

print("\nClassification Report:")
print(report)

Output:
Accuracy: 0.0

Classification Report:
precision recall f1-score support

negative 0.00 0.00 0.00 0.0

positive 0.00 0.00 0.00 2.0
accuracy 0.00 2.0
macro avg 0.00 0.00 0.00 2.0
weighted avg 0.00 0.00 0.00 2.0

15
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )

Experiment – 10

Convert text to vectors ( using term frequency ) and apply cosine similarity to provide
closeness among two text.

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
"I love natural language processing.",
"Machine learning is fascinating.",
"Python is widely used in data science."
]

# Initialize the CountVectorizer

vectorizer = CountVectorizer()

# Fit and transform the documents to obtain the term frequency (TF) vectors
tf_vectors = vectorizer.fit_transform(documents).toarray()

# Calculate the cosine similarity between two documents

doc1 = tf_vectors[0]
doc2 = tf_vectors[1]
similarity = cosine_similarity([doc1], [doc2])[0][0]

print(f"Text 1: {documents[0]}")
print(f"Text 2: {documents[1]}")
print(f"Cosine Similarity: {similarity:.4f}")

Output:

Text 1: I love natural language processing.

Text 2: Machine learning is fascinating.
Cosine Similarity: 0.0000

16
Aditya College of Engineering & Technology

Komoiboros Inggoris-Kadazandusun
43% (7)
Komoiboros Inggoris-Kadazandusun
140 pages
The Uzbek Tense Aspect Modality System
100% (1)
The Uzbek Tense Aspect Modality System
133 pages
Baglaclar Ve Zaman Uyumu PDF
100% (1)
Baglaclar Ve Zaman Uyumu PDF
1 page
NLP Lab Manual
No ratings yet
NLP Lab Manual
21 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
NLP Soc
No ratings yet
NLP Soc
15 pages
Laboratory Manual: Faculty of Engineering and Technology Bachelor of Technology
No ratings yet
Laboratory Manual: Faculty of Engineering and Technology Bachelor of Technology
10 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
ASTW RA03 PracticalManual
No ratings yet
ASTW RA03 PracticalManual
18 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
NLP Record
No ratings yet
NLP Record
15 pages
Gen Ai Lab
No ratings yet
Gen Ai Lab
3 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
C24064 - NLP - Lab Manual
No ratings yet
C24064 - NLP - Lab Manual
28 pages
Generative AI
No ratings yet
Generative AI
16 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
Allnlp
No ratings yet
Allnlp
15 pages
DS 7
No ratings yet
DS 7
3 pages
Building A Simple Chatbot From Scratch in Python1
No ratings yet
Building A Simple Chatbot From Scratch in Python1
8 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
CSE 3652 Lab Record Format - PDF
No ratings yet
CSE 3652 Lab Record Format - PDF
13 pages
Minor Assignment-3 (NLP)
No ratings yet
Minor Assignment-3 (NLP)
2 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
NLP PDF
No ratings yet
NLP PDF
3 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
EX1
No ratings yet
EX1
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
No ratings yet
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
275 pages
NLP Lab Tasks
No ratings yet
NLP Lab Tasks
16 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
NLP - Short Assignments
No ratings yet
NLP - Short Assignments
8 pages
Gen AI Lab
No ratings yet
Gen AI Lab
22 pages
Ccs339 Text and Speech Analysis Lab Manual
No ratings yet
Ccs339 Text and Speech Analysis Lab Manual
51 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
NLP FinAL
No ratings yet
NLP FinAL
27 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Assignment-10 (NLP-part-2)
No ratings yet
Assignment-10 (NLP-part-2)
2 pages
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
No ratings yet
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
37 pages
Methodology
No ratings yet
Methodology
9 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Assignment - 7: Import Import Import Import
No ratings yet
Assignment - 7: Import Import Import Import
3 pages
Python NLP Assignment
No ratings yet
Python NLP Assignment
9 pages
Generative AI 2
No ratings yet
Generative AI 2
24 pages
Module III
No ratings yet
Module III
42 pages
12.1. NLP Intro
No ratings yet
12.1. NLP Intro
53 pages
Assignment 2 IR
No ratings yet
Assignment 2 IR
6 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
AIPT LAB 24-25 MANUAL EXPE 4 To8
No ratings yet
AIPT LAB 24-25 MANUAL EXPE 4 To8
15 pages
DMlab 2021
No ratings yet
DMlab 2021
4 pages
NLP - Practical List
No ratings yet
NLP - Practical List
14 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
5 pages
Topic Modelling - Deep Learning Interview Questions
No ratings yet
Topic Modelling - Deep Learning Interview Questions
19 pages
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Pe Final Project
No ratings yet
Pe Final Project
36 pages
Summative Writing Assessment Horror
No ratings yet
Summative Writing Assessment Horror
4 pages
2024 Ultimate Mock English Language 1
No ratings yet
2024 Ultimate Mock English Language 1
7 pages
Article 1
No ratings yet
Article 1
22 pages
Introduction To Linguistics Assignment 1
No ratings yet
Introduction To Linguistics Assignment 1
15 pages
Stative Vs Dynamic Verbs
No ratings yet
Stative Vs Dynamic Verbs
3 pages
NSC FAL IsiZulu Grade 12 May June 2023 P2 and Memo
No ratings yet
NSC FAL IsiZulu Grade 12 May June 2023 P2 and Memo
57 pages
Revised ELLE Manual
No ratings yet
Revised ELLE Manual
56 pages
On Thi
No ratings yet
On Thi
43 pages
Lesson Plan - Level Ii - Cle
No ratings yet
Lesson Plan - Level Ii - Cle
2 pages
Bednarek 2015 Wicked Women - TextTalk 35 4 PDF
No ratings yet
Bednarek 2015 Wicked Women - TextTalk 35 4 PDF
21 pages
The 5 Parameters of ASL: Signals. All Five Parameters Must Be Performed Correctly To Sign The Word Accurately
No ratings yet
The 5 Parameters of ASL: Signals. All Five Parameters Must Be Performed Correctly To Sign The Word Accurately
3 pages
The Language of Tourism
No ratings yet
The Language of Tourism
5 pages
Concept Paper - KSPB - Poetry Recitation Secondary Schools 2025
100% (1)
Concept Paper - KSPB - Poetry Recitation Secondary Schools 2025
13 pages
RELATIVE CLAUSE 3 (BASIC) (Online Version)
No ratings yet
RELATIVE CLAUSE 3 (BASIC) (Online Version)
9 pages
Causative Sentence
No ratings yet
Causative Sentence
16 pages
Preparatory Vii & Viii
No ratings yet
Preparatory Vii & Viii
19 pages
Digraph
No ratings yet
Digraph
9 pages
Devoir Unit4 3e Fashion-1
No ratings yet
Devoir Unit4 3e Fashion-1
2 pages
English 7 Second Term
No ratings yet
English 7 Second Term
5 pages
Problems in Translating
No ratings yet
Problems in Translating
2 pages
Homonyms - Lesson Plan
No ratings yet
Homonyms - Lesson Plan
5 pages
Allama Iqbal Open University, Islamabad (Department of English Language & Applied Linguistics) Warning
No ratings yet
Allama Iqbal Open University, Islamabad (Department of English Language & Applied Linguistics) Warning
2 pages
The Future of National Language Policy and Language Development in Nigeria
No ratings yet
The Future of National Language Policy and Language Development in Nigeria
8 pages
Socializacion Ecuador Starship Presentation July 1 4 30 PM
No ratings yet
Socializacion Ecuador Starship Presentation July 1 4 30 PM
36 pages
Script To Speech For Marathi Language
No ratings yet
Script To Speech For Marathi Language
4 pages
Places and Buildings
No ratings yet
Places and Buildings
6 pages