Experiment: 1
Experiment: 1
Aim : Determine noise removal for any textual data and remove regular expression
pattern such as hash tag from textual data.
Program :
import string
import re
input_text=input_text.lower()
print(input_text)
input_text=re.sub(r'\d+','',input_text)
print(input_text)
input_text=input_text.strip()
print(input_text)
html_pattern=re.compile('<.*?>')
input_text=html_pattern.sub(r'',input_text)
print(input_text)
url_pattern=re.compile(r'https?://s+\www\s+')
input_text=url_pattern.sub(r'',input_text)
print(input_text)
hash_pattern=re.compile(r'#[a-z]+')
input_text=hash_pattern.sub(r'',input_text)
print(input_text)
if punc in input_text:
input_text=input_text.replace(punc,'')
print(input_text)
x=re.findall("is our\s+\s+",input_text)
input_text=re.sub("is our\s+\s+","",input_text)
print(input_text)
tweet="""If you hold an empty #gaterode #battle up to your ear@@you can hear the
sports oo%%"""
print(x)
Output :
our college is offering computer courses in {b.tech} with specializations & [m.tech] with
specialization science->total courses:?https://www.aec.edu.in is our college website!!! ...
our college is offering computer courses in btech with specializations mtech with
specialization sciencetotal courseshttpswwwaeceduin is our college website
our college is offering computer courses in btech with specializations mtech with
specialization sciencetotal courseshttpswwwaeceduin is our college website
##@@%%
Experiment : 2
Program :
import nltk
words=["walking","swimming","computer","computing","language","natural","educatio
n","easy","irrational","relation"]
Stemmed_ps=PorterStemmer()
Stemmer_ss=SnowballStemmer("english")
Sentence="I was wonder whwn I walk in Indian roads because everybody using
computers to understand the language so they forget their mother language it is netural
because people are edicted to computer it is irriting me."
token_words=nltk.word_tokenize(Sentence)
Stem_Sentence=[]
Stem_Sentence.append(Stemmed_ps.Stem(word))
Porter stemmed words : ['walk', 'swim', 'comput', 'comput', 'languag', 'natur', 'educ', 'easi',
'irrat', 'relat']
Snowball Stemmed words : ['walk', 'swim', 'comput', 'comput', 'languag', 'natur', 'educ',
'easi', 'irrat', 'relat']
The porter stemmed sentence is : ['I', 'wa', 'wonder', 'when', 'i', 'walk', 'in', 'indian', 'road',
'because', 'everybodi', 'use', 'compute', 'to', 'understand', 'the', 'language', 'so', 'they',
'forgot', 'their', 'mother', 'language', 'it', 'is', 'nature', 'became', 'peopl', 'are', 'edict', 'to',
'comput', 'it', 'is', 'irrit', 'me']
Experiment : 3
Aim : Demonstrate object standardization such as replace social media slangs from a
text.
Description : NLP is used for chat bots, summaries of articles or texts, language
translation, and verbal view description. NLP includes steps such as pre-processing,
entity extraction, word frequency measurements. With noise reduction, operations are
performed on connectors such as “and, or, but”.
Object Standardization : Text data often contains words or phrases which are not
present in any standard lexical dictionaries. These pieces are not recognized by search
engines and models. Examples – acronyms, hashtags with attached words, and
colloquial slangs. With the help of regular expressions and manually prepared data
dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup
method to replace social media slangs from a text. Other types of text preprocessing
includes encoding-decoding noise, grammar checker, and spelling correction
The dictionary contains the normalization process of the words from the same root, such
as “I do, I do, I will do” normalization. Object standardization is pre-processing
techniques that can be done on abbreviations such as “rt → retweet, dm → direct
message”.
After preprocessing, entity extraction and entity selection are performed at this stage. At
this stage, the relevant topic is removed from the text. One of the techniques used is Latent
Dirichlet Allocation for Topic Modeling (LDA)
Program :
def lookup_words(input_text):
words = input_text.split()
new_words = []
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
return new_text
print(lookup_dict.keys())
print(lookup_dict.values())
Output :
Message : RT this is a retweeted dm message tweet by Shivam Bansal
Description : One of the more powerful aspects of the NLTK module is the Part of
Speech tagging that it can do for you. This means labeling words in a sentence as nouns,
adjectives, verbs...etc. Even more impressive, it also labels by tense, and more. Here's a
list of the tags, what they mean, and some examples:
Part-of-speech (POS) tagging is just what it sounds like: the process goes through the
words in your corpus and tags them with metadata, indicating whether those words are
nouns, verbs, adjectives, etc.
Alphabetical list of part-of-speech tags used in the Penn Treebank Project: Number Tag
Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
Program :
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
nltk.download('state_union')
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def process_content():
try:
for i in tokenized[:5]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
Output :
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S",
'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'),
('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'),
('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'),
('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT',
'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney',
'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members',
'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'),
('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','),
('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'),
('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',',
','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBD'), ('America', 'NNP'),
('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried',
'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]
[('Tonight', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'),
('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'JJ'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'),
('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'),
('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'),
('the', 'DT'), ('good', 'JJ'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'),
('King', 'NNP'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'),
('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'),
('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'),
(',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan', 'NNP'), ('.', '.')]
Experiment : 5
Aim : Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.
Description : Latent Dirichlet allocation (LDA) is a topic model that generates topics
based on word frequency from a set of documents. LDA is particularly useful for finding
reasonably accurate mixtures of topics within a given document set. It is a generative
probabilistic model that assumes each topic is a mixture over an underlying set of words,
and each document is a mixture of over a set of topic probabilities.
Process of LDA:
iii iii. gensim, a topic modeling package containing our LDA model.
Steps involved
1) Loading data
2) Data cleaning
3) Exploratory analysis
Program :
!pip install stop_words
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood
pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to
drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in en_stop]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word =
dictionary, passes=20)
print(ldamodel.num_terms)
print(ldamodel.num_topics)
print(ldamodel.get_topics())
ldamodel.print_topics()
print(ldamodel.print_topics(num_topics=3, num_words=3))
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word =
dictionary, passes=20)
print(ldamodel.print_topics(num_topics=4, num_words=8))
Output :
32
2
[[0.08554609 0.03569528 0.06107529 0.08554756 0.03660925 0.03570221
0.01230862 0.01231141 0.03600929 0.01231128 0.01230981 0.01231116
0.0123111 0.0366259 0.03662846 0.03662638 0.08556051 0.03662615
0.03662581 0.03646108 0.0366278 0.03662746 0.01228877 0.01228839
0.01228723 0.01228927 0.01228977 0.01228833 0.01228785 0.01228687
0.03661788 0.03661777]
[ 0.01359389 0.06842649 0.01358789 0.01359227 0.01357666 0.06841886
0.04030583 0.04030274 0.0680811 0.0403029 0.04030451 0.04030303
0.04030309 0.01355834 0.01355553 0.01355782 0.01357803 0.01355808
0.01355845 0.04066184 0.01355626 0.01355663 0.04032766 0.04032807
0.04032935 0.0403271 0.04032655 0.04032813 0.04032866 0.04032975
0.01356716 0.01356728]]
[(0, '0.086*"health" + 0.086*"good" + 0.086*"brocolli"'), (1, '0.068*"brother" +
0.068*"mother" + 0.068*"drive"')]
Aim : Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using
python
Program :
import pandas as pd
import numpy as np
corpus = ['data science is one of the most important fields of science','this is one of the
best data science courses','data scientists analyze data' ]
words_set = set()
for doc in corpus:
words = doc.split(' ')
words_set = words_set.union(set(words))
print('Number of words in the corpus:',len(words_set))
print('The words in the corpus: \n', words_set)
columns_set = tuple(words_set)
n_docs = len(corpus)
n_words_set = len(words_set)
df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=columns_set)
for i in range(n_docs):
words = corpus[i].split(' ')
for w in words:
df_tf[w][i] = df_tf[w][i] + (1 / len(words))
print("IDF of: ")
idf = {}
for w in words_set:
k=0
for i in range(n_docs):
if w in corpus[i].split():
k += 1
idf[w] = np.log10(n_docs / k)
print(f'{w:>15}: {idf[w]:>10}' )
df_tf_idf = df_tf.copy()
for w in words_set:
for i in range(n_docs):
df_tf_idf[w][i] = df_tf[w][i] * idf[w]
df_tf_idf
Output :
Experiment : 7
gensim is an open source python library for natural language processing and it was
developed and is maintained by the Czech natural language processing researcher
Gensim library will enable us to develop word embeddings by training our own word2vec
models on a custom corpus either with CBOW of skip-grams algorithms.
After training the word2vec model, obtain the word embedding from the training model.
Finally print the model.
Program :
from gensim.models import Word2Vec
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],['this', 'is', 'the', 'second',
'sentence'],['yet', 'another', 'sentence'],['one', 'more', 'sentence'],['and', 'the', 'final',
'sentence']]
model = Word2Vec(sentences, min_count=1)
print(model)
words = list(model.wv.key_to_index)
print(words)
print(model.wv['is'])
model.save('model.bin')
new_model = Word2Vec.load('model.bin')
print(new_model)
Output :
Experiment : 8
Aim : Implement Text classification using naïve bayes classifier and text blob library.
Description : Text classifier are systems that classify your texts and divide them in
different classes.
TextBlob is a Python library for processing textual data. It provides a consistent API for
diving into common natural language processing (NLP) tasks such as part-of-speech
tagging, noun phrase extraction, sentiment analysis, and more.
Program :
!pip install textblob
!pip3 install textblob
import nltk
nltk.download('punkt')
train = [
('Whatanamazing weather.', 'pos'), ('this is an amazing idea!', 'pos'),
('I feelvery good about these ideas.', 'pos'), ('this is my best performance.', 'pos'), ("what
an awesome view", 'pos'),
('I do not like this place', 'neg'), ('Iamtired ofthisstuff.','neg'),
("I can'tdealwith all this tension", 'neg'), ('he is my sworn enemy!', 'neg'),
('myfriendsishorrible.','neg')
]
test =[
('the foodwasgreat.','pos'),
('I do notwant to liveanymore', 'neg'), ("I ain't feeling dandy today.", 'neg'), ("I feel
amazing!", 'pos'),
('Ramesh is a friend of mine.', 'pos'), ("I can't believe I'm doing this.", 'neg')
]
from textblob.classifiers import NaiveBayesClassifier
cl=NaiveBayesClassifier(train)
print(cl.classify("Thisis an amazing library!"))
print(cl.accuracy(test))
print(cl.classify("my friends is tension"))
print(cl.accuracy(test))
cl.show_informative_features(4)
Output :
Experiment : 9
Program :
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
svc = svm.SVC(kernel='linear', C=1,gamma='auto').fit(X, y)
plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
plt.show()
Output :
Experiment : 10
Aim : Convert text to vectors (using term frequency) and apply cosine similarity to
provide closeness among two texts.
This is also called as Scalar product because the dot product of two vectors gives a scalar result.
For Example,
The dot product vector(A) DOT vector(B) = 5*2+0*5+2*0=10+0+0 =10 The documents are
similar lesser the angle between them and Cosine of Angle increase as the value of angle decreases
since Cos 0 =1 and Cos 90 = 0
First step calculate the cosine similarity between the documents. Convert the
documents/Sentences/words in a form of feature vector first. Useful Methods for feature
extraction i) Bag of Words ii) TF-IDF.
Bag of Words counts the unique words in documents and frequency of each of the words. Scikit
learn Countvectorizer extract the Bag of Words Features.
TF-IDF score of a word to rank it’s importance in a document tfidf score of a word w =
tf(w)*idf(w) tf(w) = Number of times the word appears in a document/Total number of words in
the document.
idf(w) = Number of documents/Number of documents that contains word w se Scikit learn Cosine
Similarity function to compare the first document i.e. Document 0 with the other Documents in
Corpus.
Cosine Similarities of the document 0 compared with other documents in the corpus. The first
element in array is 1 which means Document 0 is compared with Document 0 and second element
0.08619387, 0,0 where Document 0 is compared with Document 1,2,3.
Program :
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
count_vect = CountVectorizer()
Document1= "Aditya Engineering College situated at Surampalem"
Document2= "Engineering Colleges offer computer science courses in MCA AIML CSE
IT departments"
Document3= "Computer science students have opprtunities in IT sector"
Document4= "IT sector hire students with skills in computer science"
corpus = [Document1,Document2,Document3,Document4]
X_train_counts = count_vect.fit_transform(corpus)
df1=pd.DataFrame(X_train_counts.toarray(),columns=count_vect.get_feature_names_o
ut(),index=['Document 0','Document 1','Document2','Document3'])
vectorizer = TfidfVectorizer()
trsfm=vectorizer.fit_transform(corpus)
df2=pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names_out(),index=
['Document 0','Document 1','Document 2','Document 3'])
print(df1)
print(df2)
cosine_similarity(trsfm[0:3], trsfm)
Output :