NLP Record
NLP Record
Experiment – 1
Demonstrate noise removal for any textual data and remove regular expression pattern such
as hash tag from textual data.
import re
def remove_noise(text):
# Remove hashtags
text = re.sub(r'#\w+', '', text)
# Remove other noise (e.g., special characters, URLs, numbers)
text = re.sub(r"[^a-zA-Z\s]", "", text) # Remove special characters
text = re.sub(r"\s+", " ", text) # Remove extra whitespaces
text = re.sub(r"http\S+|www\S+|https\S+", "", text) # Remove URLs
text = re.sub(r"\b\d+\b", "", text) # Remove numbers
return text.strip()
# Example usage
text = "Hello! This is a #sample text with #hashtags and some special characters!! 123 @acet.ac.in"
clean_text = remove_noise(text)
print(clean_text)
Output:
1
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 2
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer=WordNetLemmatizer()
print(lemmatizer.lemmatize("running"))
print(lemmatizer.lemmatize("runs"))
def lemmatize(word):
lemmatizer=WordNetLemmatizer()
print("Verb Form: "+lemmatizer.lemmatize(word,pos="v"))
print("Noun Form: "+lemmatizer.lemmatize(word,pos="n"))
print("Adverb Form: "+lemmatizer.lemmatize(word,pos="r"))
print("Adjective Form: "+lemmatizer.lemmatize(word,pos="a"))
lemmatize('skewing')
Output:
running
run
2
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer
porter_stemmer=PorterStemmer()
print(porter_stemmer.stem('running'))
print(porter_stemmer.stem('runs'))
print(porter_stemmer.stem('ran'))
Output:
run
run
ran
3
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 3
Demonstrate object standardization such as replace social media slags from text.
slang_dict = {
"lol": "laughing out loud",
"omg": "oh my god",
"btw": "by the way",
"brb": "be right back",
"idk": "I don't know",
"tbh": "to be honest",
"imho": "in my humble opinion",
"afaik": "as far as I know",
"smh": "shaking my head",
"jk": "just kidding"
}
def standardize_text(text):
words = text.split()
standardized_words = []
text = "lol that's so tbh idk why imho they would do that"
standardized_text = standardize_text(text)
print("Standardized text:", standardized_text)
Output:
Standardized text: laughing out loud that's so to be honest I don't know why in my humble opinion
they would do that
4
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 4
import nltk
from nltk import word_tokenize, pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
def perform_pos_tagging(text):
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
return tagged_tokens
# Example usage
text = "I love to explore new places and try different cuisines."
tagged_text = perform_pos_tagging(text)
print(tagged_text)
Output:
[('I', 'PRP'), ('love', 'VBP'), ('to', 'TO'), ('explore', 'VB'), ('new', 'JJ'), ('places', 'NNS'), ('and', 'CC'),
('try', 'VB'), ('different', 'JJ'), ('cuisines', 'NNS'), ('.', '.')]
5
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 5
import gensim
from gensim import corpora
# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Python is a popular programming language for data science.",
"Natural language processing is used in many applications such as chatbots.",
"Topic modeling is a technique for extracting topics from text data."
]
Output:
6
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
7
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 6
# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Python is a popular programming language for data science.",
"Natural language processing is used in many applications such as chatbots.",
"Topic modeling is a technique for extracting topics from text data."
]
# Fit the vectorizer to the documents and transform the documents into TF-IDF features
tfidf_matrix = vectorizer.fit_transform(documents)
Output:
8
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
TF-IDF Values:
Document 1:
artificial: 0.3993
intelligence: 0.3993
is: 0.2084
learning: 0.3993
machine: 0.3993
of: 0.3993
subset: 0.3993
Document 2:
data: 0.3183
for: 0.3183
is: 0.2106
language: 0.3183
popular: 0.4037
programming: 0.4037
python: 0.4037
science: 0.4037
Document 3:
applications: 0.3179
as: 0.3179
chatbots: 0.3179
in: 0.3179
is: 0.1659
language: 0.2507
many: 0.3179
9
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
natural: 0.3179
processing: 0.3179
such: 0.3179
used: 0.3179
Document 4:
data: 0.2702
extracting: 0.3427
for: 0.2702
from: 0.3427
is: 0.1788
modeling: 0.3427
technique: 0.3427
text: 0.3427
topic: 0.3427
topics: 0.3427
10
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 7
11
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 8
Implement text classification using naïve bayes classifier and text blob library.
import nltk
nltk.download('punkt')
Output:
12
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Text: I am happy.
Sentiment: positive
13
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 9
# Sample data
documents = [
("I love natural language processing.", "positive"),
("Machine learning is fascinating.", "positive"),
("Python is widely used in data science.", "positive"),
("I dislike noisy environments.", "negative"),
("This movie is terrible.", "negative"),
("I feel sad today.", "negative")
]
print("Accuracy:", accuracy)
14
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
print("\nClassification Report:")
print(report)
Output:
Accuracy: 0.0
Classification Report:
precision recall f1-score support
15
Aditya College of Engineering & Technology
Data Science : Natural Language Processing ( SOC )
Experiment – 10
Convert text to vectors ( using term frequency ) and apply cosine similarity to provide
closeness among two text.
# Sample documents
documents = [
"I love natural language processing.",
"Machine learning is fascinating.",
"Python is widely used in data science."
]
# Fit and transform the documents to obtain the term frequency (TF) vectors
tf_vectors = vectorizer.fit_transform(documents).toarray()
print(f"Text 1: {documents[0]}")
print(f"Text 2: {documents[1]}")
print(f"Cosine Similarity: {similarity:.4f}")
Output:
16
Aditya College of Engineering & Technology