NLP Record
NLP Record
List of Experiments
1. Demonstrate Noise Removal for any textual data and remove regular expression pattern such as
hash tag from textual data.
Noise Removal: Any piece of text which is not relevant to the context of the data and the end-output can
be specified as the noise.
For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or
links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals
with removal of all types of noisy entities present in the text.
A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object
by tokens (or by words), eliminating those tokens which are present in the noise dictionary.
# Sample code to remove noisy words from a text
noise_list = ["is", "a", "this", "..."]
def _remove_noise(input_text):
words = input_text.split()
noise_free_words = [word for word in words if word not in noise_list]
noise_free_text = " ".join(noise_free_words)
return noise_free_text
print(_remove_noise("this is a book"))
Output: book
Another approach is to use the regular expressions while dealing with special patterns of noise.
Following python code removes a regex pattern from the input text:
import re
def remove_regex_pattern(text, pattern):
"""Removes a regex pattern from the input text.
Args:
text: The input text.
pattern: The regex pattern to remove.
Returns:
The input text with the regex pattern removed.
"""
return re.sub(pattern, "", text)
# Example usage:
text = "welcome to (.*) CIST"
pattern = r"\(.*\)"
# Remove the regex pattern from the text.
text_without_regex_pattern = remove_regex_pattern(text, pattern)
# Print the text without the regex pattern.
print(text_without_regex_pattern)
Output:
"welcome to CIST"
example of how to remove a regex pattern #tag from the input text in Python:
import re
def remove_hashtag(text):
"""Removes hashtag from the input text.
Args:
text: The input text.
Returns:
The text with the hashtag removed.
"""
pattern = re.compile(r'#\w+')
return pattern.sub('', text)
# Example usage:
text = 'This is a tweet with #hashtag.'
print(remove_hashtag(text))
Output:
This is a tweet with .
Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure
of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
def perform_lemmatization(text):
lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
return ' '.join(lemmatized_words)
def perform_stemming(text):
stemmer = PorterStemmer()
tokens = word_tokenize(text)
stemmed_words = [stemmer.stem(word) for word in tokens]
return ' '.join(stemmed_words)
# Example text
sample_text = "The quick brown foxes are jumping over the lazy dogs"
# Lemmatization
lemmatized_text = perform_lemmatization(sample_text)
print("Lemmatized Text:", lemmatized_text)
# Stemming
stemmed_text = perform_stemming(sample_text)
print("Stemmed Text:", stemmed_text)
Output:
Lemmatized Text :The quick brown fox are jumping over the lazy dog
Stemmed Text: the quick brown fox are jump over the lazi dog
Output:
Lemmatized Text :They are playing well
Stemmed Text: they are play well
3. Demonstrate object standardization such as replace social media slangs from a text.
Text data often contains words or phrases which are not present in any standard lexical dictionaries. These
pieces are not recognized by search engines and models.
Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help
of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code
below uses a dictionary lookup method to replace social media slangs from a text.
Program:
import nltk
lookup_dict = {'EC': 'European Commission', 'EU': 'European Union', "ECSC": "European Coal and Steel
Commuinty", "EEC": "European Economic Community"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
print(new_text)
return new_text
_lookup_words(
"The High Authority was the supranational administrative executive of the new European Coal and Steel
Community ECSC. It took office first on 10 August 1952 in Luxembourg. In 1958, the Treaties of Rome
had established two new communities alongside the ECSC: the eec and the European Atomic Energy
Community (Euratom). However their executives were called Commissions rather than High Authorities")
Output:
import nltk
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love",'pls':"please"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
print(new_text)
return new_text
_lookup_words("RT this is a retweeted tweet by Devi")
_lookup_words(" She is a awsm student")
_lookup_words("pls send the material")
Output:
Retweet this is a retweeted tweet by Devi
She is a awesome student
please send the material
“Book” is used with different context, however the part of speech tag for both of the cases are different. In
sentence I, the word “book” is used as v erb, while in II it is used as noun. (Lesk Algorithm is also us ed for
similar purposes)
B. Improving word-based features: A learning model could learn different contexts of a word when used
word as the features, however if the part of speech tag is linked with them, the context is preserved, thus
making strong features. For example:
Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)
Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1),
(“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)
C. Normalization and Lemmatization: POS tags are the basis of lemmatization process for converting a
word to its base form (lemma).
D. Efficient stopword removal : P OS tags are also useful in efficient removal of stopwords.
For example, there are some tags which always define the low frequency / less important words of a
language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”), (MD – “may”,
“mu st” etc)
5. Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.
Introduction
Imagine walking into a bookstore to buy a book on world economics and not being able to figure out
the section of the store that has this book, assuming the bookstore has simply stacked all types of
books together. You then realize how important it is to divide the bookstore into different sections
Topic Modelling is similar to dividing a bookstore based on the content of the books as it refers to the
process of discovering themes in a text corpus and annotating the documents based on the identified
topics.
When you need to segment, understand, and summarize a large collection of documents, topic
https://
thinkinfi.com/latent-dirichlet-allocation-for-beginners-a-high-level-overview/
Latent Dirichlet Allocation (LDA) is one of the ways to implement Topic Modelling. It is a generative
topics.
How does the LDA algorithm work?
The following steps are carried out in LDA to assign topics to each of the documents:
1) For each document, randomly initialize each word to a topic amongst the K topics where K is the
P(topic t| document d): Proportion of words in document d that are assigned to topic t
P(word w| topic t): Proportion of assignments to topic t across all documents from words that
come from w
3) Reassign topic T’ to word w with probability p(t’|d)*p(w|t’) considering all other words and their
topic assignments
The last step is repeated multiple times till we reach a steady state where the topic assignments do not
change further. The proportion of topics for each document is then determined from these topic
assignments.
Let us say that we have the following 4 documents as the corpus and we wish to carry out topic
LDA modelling helps us in discovering topics in the above corpus and assigning topic mixtures for
each of the documents. As an example, the model might output something as given below:
Document 1 and 2 would then belong 100% to Topic 1. Document 3 would belong 100% to Topic 2.
This assignment of topics to documents is carried out by LDA modelling using the steps that we
discussed in the previous section. Let us now apply LDA to some text data and analyze the actual
outputs in Python.
We have taken the ‘Amazon Fine Food Reviews’ data from Kaggle
We start by importing the Pandas library to read the CSV and save it in a data frame.
Python Code:
import pandas as pd
import nltk
rev = pd.read_csv(r"Reviews.csv")
print(rev.head())
We are interested in carrying out topic modelling for the ‘Text’ column in this dataset.
Additionally, we would also remove the stop-words before carrying out the LDA. To carry out topic
modelling, we need to convert our text column into a vectorized form and therefore we import the
TfidfVectorizer.
import nltk
from nltk.corpus import stopwords #stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
stop_words=set(nltk.corpus.stopwords.words('english'))
We will apply lemmatization to the words so that the root words of all derived words are used.
Furthermore, the stop-words are removed and words with lengths greater than 3 are used.
def clean_text(headline):
le=WordNetLemmatizer()
word_tokens=word_tokenize(headline)
tokens=[le.lemmatize(w) for w in word_tokens if w not in stop_words and len(w)>3]
cleaned_text=" ".join(tokens)
return cleaned_text
rev['cleaned_text']=rev['Text'].apply(clean_text)
Carrying out a TFIDF vectorization on the text column gives us a document term matrix on which we
can carry out the topic modelling. TFIDF refers to Term Frequency Inverse Document Frequency – as
this vectorization compares the number of times a word appears in a document with the number of
vect =TfidfVectorizer(stop_words=stop_words,max_features=1000)
vect_text=vect.fit_transform(rev['cleaned_text'])
the learning method (which is the way the algorithm updates the assignments of the topics to the
documents), the maximum number of iterations to be carried out and the random state. The parameters
that we have given to the LDA model, as shown below, include the number of topics, the learning
method (which is the way the algorithm updates the assignments of the topics to the documents), the
We can check the proportion of topics that have been assigned to the first document using the lines of
print("Document 0: ")
for i,topic in enumerate(lda_top[0]):
print("Topic ",i,": ",topic*100,"%")
vocab = vect.get_feature_names()
for i, comp in enumerate(lda_model.components_):
vocab_comp = zip(vocab, comp)
sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:10]
print("Topic "+str(i)+": ")
for t in sorted_words:
print(t[0],end=" ")
print("n")
In addition to LDA, other algorithms can be leveraged to carry out topic modelling. Latent Semantic
Indexing (LSI), Non-negative matrix factorization are some of the other algorithms one could try to
carry out topic modelling. All these algorithms, like LDA, involve feature extraction from document
term matrices and generating a group of terms that are differentiating from each other, which
eventually lead to the creation of topics. These topics can help in assessing the main themes of a
6. Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using python
7. Demonstrate word embeddings using word2vec.
8. Implement Text classification using naïve bayes classifier and text blob library.
9. Apply support vector machine for text classification.
10. Convert text to vectors (using term frequency) and apply cosine similarity to provide closeness
among two text.
11. Case study 1: Identify the sentiment of tweets
In this problem, you are provided with tweet data to predict sentiment on electronic
products of netizens.
12. Case study 2: Detect hate speech in tweets.
The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we
say a tweet contains hate speech if it has a racist or sexist sentiment associated with it.
So, the task is to classify racist or sexist tweets from other tweets