0% found this document useful (0 votes)

13 views28 pages

Experiment: 1

nlp record

Uploaded by

pareshkumar3108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views28 pages

Experiment: 1

nlp record

Uploaded by

pareshkumar3108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Experiment : 1

Aim : Determine noise removal for any textual data and remove regular expression
pattern such as hash tag from textual data.

Description : Test cleaning or Test preprocessing or data preprocessing .A text is

unstructed data may be full of inconsistencies and ambiguity .Test preprocessing is a
method in NLP that involves cleaning text data and making it ready for model building .
A raw text (text corpus) button variable text data collected from one or many sources such
as websites,spoken languages and voice recognition system may contain various words
with the wrong spelling,short words,special symbols,emojis etc.

Program :

import string

import re

input_text="\t #sample <HTML><H1>#Greetings our college is offering computer

courses in {B.Tech} with 3 specializations & [M.Tech] with 1 specialization #computer
Science->total courses:?https://www.aec.edu.in is our college website!!! <H1>...\t"

input_text=input_text.lower()

print(input_text)

input_text=re.sub(r'\d+','',input_text)

print(input_text)

input_text=input_text.strip()

print(input_text)
html_pattern=re.compile('<.*?>')

input_text=html_pattern.sub(r'',input_text)

print(input_text)

url_pattern=re.compile(r'https?://s+\www\s+')

input_text=url_pattern.sub(r'',input_text)

print(input_text)

hash_pattern=re.compile(r'#[a-z]+')

input_text=hash_pattern.sub(r'',input_text)

print(input_text)

for punc in string.punctuation:

if punc in input_text:

input_text=input_text.replace(punc,'')

print(input_text)

x=re.findall("is our\s+\s+",input_text)

input_text=re.sub("is our\s+\s+","",input_text)

print(input_text)

tweet="""If you hold an empty #gaterode #battle up to your ear@@you can hear the
sports oo%%"""

x=re.sub('[a-z A-Z 0-9]+','',tweet)

print(x)
Output :

#sample <html><h1>#greetings our college is offering computer courses in {b.tech} with

3 specializations & [m.tech] with 1 specialization #computer science->total
courses:?https://www.aec.edu.in is our college website!!! <h1>...

#sample <html><h>#greetings our college is offering computer courses in

{b.tech} with specializations & [m.tech] with specialization #computer science->total
courses:?https://www.aec.edu.in is our college website!!! <h>...

#sample <html><h>#greetings our college is offering computer courses in {b.tech} with

specializations & [m.tech] with specialization #computer science->total
courses:?https://www.aec.edu.in is our college website!!! <h>...

#sample #greetings our college is offering computer courses in {b.tech} with

specializations & [m.tech] with specialization #computer science->total
courses:?https://www.aec.edu.in is our college website!!! ...

#sample #greetings our college is offering computer courses in {b.tech} with

specializations & [m.tech] with specialization #computer science->total
courses:?https://www.aec.edu.in is our college website!!! ...

our college is offering computer courses in {b.tech} with specializations & [m.tech] with
specialization science->total courses:?https://www.aec.edu.in is our college website!!! ...

our college is offering computer courses in btech with specializations mtech with
specialization sciencetotal courseshttpswwwaeceduin is our college website
our college is offering computer courses in btech with specializations mtech with
specialization sciencetotal courseshttpswwwaeceduin is our college website
##@@%%
Experiment : 2

Aim : Perform lemmatization and streaming using python library nltk.

Description : Stemming and Lemmatization in Python NLTK are text normalization

techniques for Natural Language Processing. These techniques are widely used for text
preprocessing. The difference between stemming and lemmatization is that stemming is
faster as it cuts words without knowing the context, while lemmatization is slower as it
knows the context of words before processing.

Stemming is a method of normalization of words in Natural Language Processing. It is a

technique in which a set of words in a sentence are converted into a sequence to shorten
its lookup. In this method, the words having the same meaning but have some variations
according to the context or sentence are normalized.
In another word, there is one root word, but there are many variations of the same
words. For example, the root word is “eat” and it’s variations are “eats, eating, eaten
and like so”. In the same way, with the help of Stemming in Python, we can find the
root word of any variations.

Program :

from nltk.stem import PorterStemmer

from nltk.stem import SnowballStemmer

import nltk

words=["walking","swimming","computer","computing","language","natural","educatio
n","easy","irrational","relation"]
Stemmed_ps=PorterStemmer()

Stemmed_words_ps=[Stemmed_ps.stem(word)for word in words]

print("Porter stemmed words : ",Stemmed_words_ps)

Stemmer_ss=SnowballStemmer("english")

Stemmed_words_ss=[Stemmer_ss.stem(word)for word in words]

print("Snowball Stemmed words : ",Stemmed_words_ss)

Sentence="I was wonder whwn I walk in Indian roads because everybody using
computers to understand the language so they forget their mother language it is netural
because people are edicted to computer it is irriting me."

token_words=nltk.word_tokenize(Sentence)

Stem_Sentence=[]

for word in token_words:

Stem_Sentence.append(Stemmed_ps.Stem(word))

print("The stemmed sentence is : ",Stem_Sentence)

Output :

Porter stemmed words : ['walk', 'swim', 'comput', 'comput', 'languag', 'natur', 'educ', 'easi',
'irrat', 'relat']

Snowball Stemmed words : ['walk', 'swim', 'comput', 'comput', 'languag', 'natur', 'educ',
'easi', 'irrat', 'relat']

The porter stemmed sentence is : ['I', 'wa', 'wonder', 'when', 'i', 'walk', 'in', 'indian', 'road',
'because', 'everybodi', 'use', 'compute', 'to', 'understand', 'the', 'language', 'so', 'they',
'forgot', 'their', 'mother', 'language', 'it', 'is', 'nature', 'became', 'peopl', 'are', 'edict', 'to',
'comput', 'it', 'is', 'irrit', 'me']
Experiment : 3

Aim : Demonstrate object standardization such as replace social media slangs from a
text.

Description : NLP is used for chat bots, summaries of articles or texts, language
translation, and verbal view description. NLP includes steps such as pre-processing,
entity extraction, word frequency measurements. With noise reduction, operations are
performed on connectors such as “and, or, but”.
Object Standardization : Text data often contains words or phrases which are not
present in any standard lexical dictionaries. These pieces are not recognized by search
engines and models. Examples – acronyms, hashtags with attached words, and
colloquial slangs. With the help of regular expressions and manually prepared data
dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup
method to replace social media slangs from a text. Other types of text preprocessing
includes encoding-decoding noise, grammar checker, and spelling correction
The dictionary contains the normalization process of the words from the same root, such
as “I do, I do, I will do” normalization. Object standardization is pre-processing
techniques that can be done on abbreviations such as “rt → retweet, dm → direct
message”.
After preprocessing, entity extraction and entity selection are performed at this stage. At
this stage, the relevant topic is removed from the text. One of the techniques used is Latent
Dirichlet Allocation for Topic Modeling (LDA)
Program :

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love",

"...": " "}

def lookup_words(input_text):

words = input_text.split()

new_words = []

for word in words:

if word.lower() in lookup_dict:

word = lookup_dict[word.lower()]

new_words.append(word)

new_text = " ".join(new_words)

return new_text

print("Message : RT this is a retweeted dm message tweet by Shivam Bansal")

print("Converted message : ",lookup_words("RT this is a retweeted dm message tweet by

Shivam Bansal"))

print(lookup_dict.keys())

print(lookup_dict.values())

Output :
Message : RT this is a retweeted dm message tweet by Shivam Bansal

Converted message : Retweet this is a retweeted direct message message tweet by

Shivam Bansal

dict_keys(['rt', 'dm', 'awsm', 'luv', '...'])

dict_values(['Retweet', 'direct message', 'awesome', 'love', ' '])

Experiment : 4

Aim : Perform part of speech tagging on any textual data.

Description : One of the more powerful aspects of the NLTK module is the Part of
Speech tagging that it can do for you. This means labeling words in a sentence as nouns,
adjectives, verbs...etc. Even more impressive, it also labels by tense, and more. Here's a
list of the tags, what they mean, and some examples:
Part-of-speech (POS) tagging is just what it sounds like: the process goes through the
words in your corpus and tags them with metadata, indicating whether those words are
nouns, verbs, adjectives, etc.

Alphabetical list of part-of-speech tags used in the Penn Treebank Project: Number Tag
Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative

sentence tokenizer, PunktSentenceTokenizer is capable of unsupervised machine

learning, train it on any body of text that you use.
create training and testing data
Data Set used i) State of the Union address from 2005 ii) State of the Union address from
2006 of President George W. Bush.
Train the Punkt tokenizer
Finish up part of speech tagging script by creating a function that will run through and
tag all of the parts of speech per sentence

Program :
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
nltk.download('state_union')
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def process_content():
try:
for i in tokenized[:5]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()

Output :
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S",
'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'),
('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'),
('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'),
('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT',
'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney',
'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members',
'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'),
('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','),
('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'),
('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',',
','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBD'), ('America', 'NNP'),
('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried',
'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]
[('Tonight', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'),
('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'JJ'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'),
('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'),
('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'),
('the', 'DT'), ('good', 'JJ'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'),
('King', 'NNP'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'),
('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'),
('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'),
(',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan', 'NNP'), ('.', '.')]
Experiment : 5

Aim : Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.

Description : Latent Dirichlet allocation (LDA) is a topic model that generates topics
based on word frequency from a set of documents. LDA is particularly useful for finding
reasonably accurate mixtures of topics within a given document set. It is a generative
probabilistic model that assumes each topic is a mixture over an underlying set of words,
and each document is a mixture of over a set of topic probabilities.
Process of LDA:

Input : M number of documents, N number of words, K number of topics.

The model trains to output:
psi -distribution of words for each topic K
phi - the distribution of topics for each document i
Required Python packages:
i i. NLTK(Natural language toolkit)

ii ii. stop_words Python package containing stop words

iii iii. gensim, a topic modeling package containing our LDA model.

Steps involved
1) Loading data
2) Data cleaning

3) Exploratory analysis

4) Preparing data for LDA analysis

5) LDA model training

6) Analyzing LDA model results

Data Cleaning methods :

Tokenizing: converting a document to its atomic elements.
Stopping: removing meaningless words.
Stemming: merging words that are equivalent in meaning.
Constructing a document-term matrix : The result of cleaning stage is texts, a tokenized,
stopped and stemmed list of words from a single document. we looped through all our
documents and appended each one to texts. So now texts is a list of lists, one list for each of
our original documents. To generate an LDA model, we need to understand how frequently
each term occurs within each document.
Construct a document-term matrix with a package called genism : The Dictionary() function
traverses texts, assigning a unique integer id to each unique token while also collecting
word counts and relevant statistics. dictionary must be converted into a bag-of-words:
Applying the LDA model : corpus is a document-term matrix and now we’re ready to
generate an LDA model: The LdaModel class is described in detail in the gensim
documentation. Parameters used in our example:
Parameters:
num_topics: required. An LDA model requires the user to determine how many topics
should be generated. Our document set is small, so we’re only asking for three topics.
id2word: required. The LdaModel class requires our previous dictionary to map ids to
strings.
passes: optional. The number of laps the model will take through corpus. The greater the
number of passes, the more accurate the model will be. A lot of passes can be slow on a
very large corpus.
Examining the results LDA model is now stored as ldamodel with the print_topic and
print_topics methods

Program :
!pip install stop_words
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood
pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to
drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in en_stop]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word =
dictionary, passes=20)
print(ldamodel.num_terms)
print(ldamodel.num_topics)
print(ldamodel.get_topics())
ldamodel.print_topics()
print(ldamodel.print_topics(num_topics=3, num_words=3))
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word =
dictionary, passes=20)
print(ldamodel.print_topics(num_topics=4, num_words=8))

Output :

32
2
[[0.08554609 0.03569528 0.06107529 0.08554756 0.03660925 0.03570221
0.01230862 0.01231141 0.03600929 0.01231128 0.01230981 0.01231116
0.0123111 0.0366259 0.03662846 0.03662638 0.08556051 0.03662615
0.03662581 0.03646108 0.0366278 0.03662746 0.01228877 0.01228839
0.01228723 0.01228927 0.01228977 0.01228833 0.01228785 0.01228687
0.03661788 0.03661777]
[ 0.01359389 0.06842649 0.01358789 0.01359227 0.01357666 0.06841886
0.04030583 0.04030274 0.0680811 0.0403029 0.04030451 0.04030303
0.04030309 0.01355834 0.01355553 0.01355782 0.01357803 0.01355808
0.01355845 0.04066184 0.01355626 0.01355663 0.04032766 0.04032807
0.04032935 0.0403271 0.04032655 0.04032813 0.04032866 0.04032975
0.01356716 0.01356728]]
[(0, '0.086*"health" + 0.086*"good" + 0.086*"brocolli"'), (1, '0.068*"brother" +
0.068*"mother" + 0.068*"drive"')]

[(0, '0.135"health" + 0.052"expert" + 0.052"may" + 0.052"suggest" + 0.052*"caus"

+ 0.052*"tension" + 0.052*"increas" + 0.052*"blood"'), (1, '0.063*"drive" +
0.063*"pressur" + 0.062*"never" + 0.062*"seem" + 0.062*"often" + 0.062*"well" +
0.062*"feel" + 0.062*"better"'), (2, '0.031*"drive" + 0.031*"brocolli" + 0.031*"good" +
0.031*"brother" + 0.031*"mother" + 0.031*"profession" + 0.031*"say" +
0.031*"pressur"'), (3, '0.087*"brocolli" + 0.087*"good" + 0.087*"mother" +
0.087*"brother" + 0.087*"eat" + 0.048*"like" + 0.048*"around" + 0.048*"basebal"')]
Experiment : 6

Aim : Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using
python

Description : It is a widely used statistical method in natural language processing and

information retrieval. It measures how important a term is within a document relative to a
collection of documents. Words within a text document are transformed into importance numbers
by a text vectorization process. TF-IDF vectorizes/scores a word by multiplying the word’s Term
Frequency (TF) with the Inverse Document Frequency (IDF).

TF-IDF is useful in many natural language processing applications.

1. Search Engines used to rank the relevance of a document for a query.
2. Text classification, text summarization, and topic modelling.
Example

Program :
import pandas as pd
import numpy as np
corpus = ['data science is one of the most important fields of science','this is one of the
best data science courses','data scientists analyze data' ]
words_set = set()
for doc in corpus:
words = doc.split(' ')
words_set = words_set.union(set(words))
print('Number of words in the corpus:',len(words_set))
print('The words in the corpus: \n', words_set)
columns_set = tuple(words_set)
n_docs = len(corpus)
n_words_set = len(words_set)
df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=columns_set)

for i in range(n_docs):
words = corpus[i].split(' ')
for w in words:
df_tf[w][i] = df_tf[w][i] + (1 / len(words))
print("IDF of: ")
idf = {}
for w in words_set:
k=0
for i in range(n_docs):
if w in corpus[i].split():
k += 1
idf[w] = np.log10(n_docs / k)
print(f'{w:>15}: {idf[w]:>10}' )
df_tf_idf = df_tf.copy()
for w in words_set:
for i in range(n_docs):
df_tf_idf[w][i] = df_tf[w][i] * idf[w]
df_tf_idf

Output :
Experiment : 7

Aim : Demonstrate word embeddings using word2vec

Description : Word embedding is one of the most important techniques in natural

language processing(NLP), where words are mapped to vectors of real numbers. Word
embedding is capable of capturing the meaning of a word in a document, semantic and
syntactic similarity, relation with other words. It also has been widely used for
recommender systems and text classification. Word2vec is one of the most popular
technique to learn word embeddings using a two-layer neural network. Its input is a text
corpus and its output is a set of vectors. Word embedding via word2vec can make natural
language computer-readable, then further implementation of mathematical operations on
words can be used to detect their similarities. A well-trained set of word vectors will place
similar words close to each other in that space. For instance, the words women, men, and
human might cluster in one corner, while yellow, red and blue cluster together in another.

gensim is an open source python library for natural language processing and it was
developed and is maintained by the Czech natural language processing researcher
Gensim library will enable us to develop word embeddings by training our own word2vec
models on a custom corpus either with CBOW of skip-grams algorithms.

Train the genism word2vec model

model = Word2Vec(sent, min_count=1,size= 50,workers=3, window =3, sg = 1)

The hyperparameters of this model.

size: The number of dimensions of the embeddings and the default is 100.
window: The maximum distance between a target word and words around the target word.
The default window is 5.
min_count: The minimum count of words to consider when training the model; words
with occurrence less than this count will be ignored. The default for min_count is 5.
workers: The number of partitions during training and the default workers is 3.
sg: The training algorithm, either CBOW(0) or skip gram(1). The default training
algorithm is
CBOW.

After training the word2vec model, obtain the word embedding from the training model.
Finally print the model.
Program :
from gensim.models import Word2Vec
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],['this', 'is', 'the', 'second',
'sentence'],['yet', 'another', 'sentence'],['one', 'more', 'sentence'],['and', 'the', 'final',
'sentence']]
model = Word2Vec(sentences, min_count=1)
print(model)
words = list(model.wv.key_to_index)
print(words)
print(model.wv['is'])
model.save('model.bin')
new_model = Word2Vec.load('model.bin')
print(new_model)

Output :
Experiment : 8

Aim : Implement Text classification using naïve bayes classifier and text blob library.

Description : Text classifier are systems that classify your texts and divide them in
different classes.

TextBlob is a Python library for processing textual data. It provides a consistent API for
diving into common natural language processing (NLP) tasks such as part-of-speech
tagging, noun phrase extraction, sentiment analysis, and more.

Step -1 Installing textblob.

Step-2 Download the data files that textblob uses for its functionality and for nltk.
Step-3 Traine the classifier based on Naive Bayes Classifier.
Step-4 Test the data using classifier to get your text classified.
Step-5 Calculat the accuracy of the classifier.

Program :
!pip install textblob
!pip3 install textblob
import nltk
nltk.download('punkt')
train = [
('Whatanamazing weather.', 'pos'), ('this is an amazing idea!', 'pos'),
('I feelvery good about these ideas.', 'pos'), ('this is my best performance.', 'pos'), ("what
an awesome view", 'pos'),
('I do not like this place', 'neg'), ('Iamtired ofthisstuff.','neg'),
("I can'tdealwith all this tension", 'neg'), ('he is my sworn enemy!', 'neg'),
('myfriendsishorrible.','neg')
]
test =[
('the foodwasgreat.','pos'),
('I do notwant to liveanymore', 'neg'), ("I ain't feeling dandy today.", 'neg'), ("I feel
amazing!", 'pos'),
('Ramesh is a friend of mine.', 'pos'), ("I can't believe I'm doing this.", 'neg')
]
from textblob.classifiers import NaiveBayesClassifier
cl=NaiveBayesClassifier(train)
print(cl.classify("Thisis an amazing library!"))
print(cl.accuracy(test))
print(cl.classify("my friends is tension"))
print(cl.accuracy(test))
cl.show_informative_features(4)

Output :
Experiment : 9

Aim : Apply support vector machine for text classification.

Description : Support Vector Machine” (SVM) is a supervised machine learning algorithm

that can be used for both classification or regression challenges. However, it is mostly used in
classificationproblems. In theSVMalgorithm,weploteach dataitemasapointin n-dimensionalspace
(where n is a number of features you have)with the value of each feature being the value of a
particular coordinate. Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well. Support Vectors are the coordinates of individual
observation. The SVM classifier is a frontier that best segregates the two classes (hyper-plane/
line).

Program :
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
svc = svm.SVC(kernel='linear', C=1,gamma='auto').fit(X, y)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
plt.show()

svc = svm.SVC(kernel='rbf', C=1,gamma='auto').fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)

plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
plt.show()

Output :
Experiment : 10

Aim : Convert text to vectors (using term frequency) and apply cosine similarity to
provide closeness among two texts.

Description : Cosine similarity is a measure of similarity between two non-zero vectors of an

inner product space that “measures the cosine of the angle between them” Cosine Similarity tends
to determine how similar two words or sentence are, It can be used for Sentiment Analysis, Text
Comparison and being used by lot of popular packages out there like word2vec.

This is also called as Scalar product because the dot product of two vectors gives a scalar result.
For Example,

Vector(A) = [5,0,2] Vector(B) = [2,5,0]

The dot product vector(A) DOT vector(B) = 5*2+0*5+2*0=10+0+0 =10 The documents are
similar lesser the angle between them and Cosine of Angle increase as the value of angle decreases
since Cos 0 =1 and Cos 90 = 0

First step calculate the cosine similarity between the documents. Convert the
documents/Sentences/words in a form of feature vector first. Useful Methods for feature
extraction i) Bag of Words ii) TF-IDF.

Bag of Words counts the unique words in documents and frequency of each of the words. Scikit
learn Countvectorizer extract the Bag of Words Features.

TF-IDF score of a word to rank it’s importance in a document tfidf score of a word w =
tf(w)*idf(w) tf(w) = Number of times the word appears in a document/Total number of words in
the document.

idf(w) = Number of documents/Number of documents that contains word w se Scikit learn Cosine
Similarity function to compare the first document i.e. Document 0 with the other Documents in
Corpus.

Cosine Similarities of the document 0 compared with other documents in the corpus. The first
element in array is 1 which means Document 0 is compared with Document 0 and second element
0.08619387, 0,0 where Document 0 is compared with Document 1,2,3.
Program :
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
count_vect = CountVectorizer()
Document1= "Aditya Engineering College situated at Surampalem"
Document2= "Engineering Colleges offer computer science courses in MCA AIML CSE
IT departments"
Document3= "Computer science students have opprtunities in IT sector"
Document4= "IT sector hire students with skills in computer science"
corpus = [Document1,Document2,Document3,Document4]
X_train_counts = count_vect.fit_transform(corpus)
df1=pd.DataFrame(X_train_counts.toarray(),columns=count_vect.get_feature_names_o
ut(),index=['Document 0','Document 1','Document2','Document3'])
vectorizer = TfidfVectorizer()
trsfm=vectorizer.fit_transform(corpus)
df2=pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names_out(),index=
['Document 0','Document 1','Document 2','Document 3'])
print(df1)
print(df2)
cosine_similarity(trsfm[0:3], trsfm)
Output :

Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
Industrial Automation Iti
No ratings yet
Industrial Automation Iti
6 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
Approaching Almost Any NLP
No ratings yet
Approaching Almost Any NLP
118 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
38 pages
Office 2016 Group Policy and Oct Settings
No ratings yet
Office 2016 Group Policy and Oct Settings
2,024 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Introduction To BOL Programming
No ratings yet
Introduction To BOL Programming
17 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Ar20 - Aiml 1 To Viii Syllabus - 02.09.2023
No ratings yet
Ar20 - Aiml 1 To Viii Syllabus - 02.09.2023
460 pages
A Physical Design Verification Framework For Superconducting Electronics
No ratings yet
A Physical Design Verification Framework For Superconducting Electronics
198 pages
CSC 102 Lecture 1
No ratings yet
CSC 102 Lecture 1
29 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
Casestudy HR and Ai
No ratings yet
Casestudy HR and Ai
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Lab 2
No ratings yet
Lab 2
49 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
SupremaDM V1.01
0% (1)
SupremaDM V1.01
52 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
Big-Ip Dns (Previously GTM) : F5 Partner Technical Boot Camp
No ratings yet
Big-Ip Dns (Previously GTM) : F5 Partner Technical Boot Camp
45 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
MoA AoA Amended PDF
No ratings yet
MoA AoA Amended PDF
185 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
Reader Configuration Codes
No ratings yet
Reader Configuration Codes
82 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
MIPS Assembly - Stack
No ratings yet
MIPS Assembly - Stack
24 pages
Identifing of Fake Profiles Across Online Social Networks by Using Neural Network
No ratings yet
Identifing of Fake Profiles Across Online Social Networks by Using Neural Network
58 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
Lab - Manual - IR - BE AI&DS CL II
No ratings yet
Lab - Manual - IR - BE AI&DS CL II
38 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
NLPPractical
No ratings yet
NLPPractical
12 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
Handwritten Telugu Character Recognition Using Machine Learning
No ratings yet
Handwritten Telugu Character Recognition Using Machine Learning
6 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP Record
No ratings yet
NLP Record
16 pages
Aiml P4
No ratings yet
Aiml P4
12 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
ANLP semVI Labmanual
No ratings yet
ANLP semVI Labmanual
33 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Chapter 8 Revision
No ratings yet
Chapter 8 Revision
15 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
Bling
No ratings yet
Bling
7 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Darkweb Monitoring Report
No ratings yet
Darkweb Monitoring Report
6 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
iOS Unit Testing and UI Testing Tutorial
No ratings yet
iOS Unit Testing and UI Testing Tutorial
24 pages
NLP Record
No ratings yet
NLP Record
15 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Constraints
No ratings yet
Constraints
5 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Es PPL 11112004
No ratings yet
Es PPL 11112004
33 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
Ir Lab 2 Ir Learning Outcomes: Pyterrier
No ratings yet
Ir Lab 2 Ir Learning Outcomes: Pyterrier
7 pages
DBMS 3
No ratings yet
DBMS 3
10 pages
Lab1 IR
No ratings yet
Lab1 IR
14 pages
DS-2CD1043G2 Datasheet V5.7.1 20230425
No ratings yet
DS-2CD1043G2 Datasheet V5.7.1 20230425
5 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
Track Schedule (ICRTSET 2025)
No ratings yet
Track Schedule (ICRTSET 2025)
3 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Curriculum Vitae Ms. Asenaca Leleasiga Kubu Wotta
No ratings yet
Curriculum Vitae Ms. Asenaca Leleasiga Kubu Wotta
5 pages
International Indian School, Riyadh
No ratings yet
International Indian School, Riyadh
1 page
Unit 5
No ratings yet
Unit 5
4 pages
VT Secondary Injection Format
No ratings yet
VT Secondary Injection Format
3 pages
Clint-Roy Muvirimi-Mukarakate H1802386 AI Practical Assignment
No ratings yet
Clint-Roy Muvirimi-Mukarakate H1802386 AI Practical Assignment
8 pages
DBMS 6
No ratings yet
DBMS 6
4 pages
NLTK
No ratings yet
NLTK
3 pages
Optima XR200amx: Mobile Digital-Ready Radiographic System
No ratings yet
Optima XR200amx: Mobile Digital-Ready Radiographic System
4 pages
Creating A Consistent Layout in ASP - Net Web Pages (Razor) Sites - The ASP
No ratings yet
Creating A Consistent Layout in ASP - Net Web Pages (Razor) Sites - The ASP
17 pages
How To Proceed With Troubleshooting: Can Communication - Can Communication System
No ratings yet
How To Proceed With Troubleshooting: Can Communication - Can Communication System
3 pages
Proiect 6. Manipulating Arrays
No ratings yet
Proiect 6. Manipulating Arrays
6 pages
Set A: Test Code: JS-X-62-17
No ratings yet
Set A: Test Code: JS-X-62-17
2 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
To For "Passport Mela" With Of: On Their
No ratings yet
To For "Passport Mela" With Of: On Their
2 pages
10122me703 Computer Integrated Manufacturing
No ratings yet
10122me703 Computer Integrated Manufacturing
2 pages
Chapter-1. Introduction To Communication Systems:-: (April-2010) (07) (2.1 & 2.2)
No ratings yet
Chapter-1. Introduction To Communication Systems:-: (April-2010) (07) (2.1 & 2.2)
6 pages
Modeling techniques in programming and algorithms
From Everand
Modeling techniques in programming and algorithms
Dougglas Hurtado Carmona
No ratings yet
Absolute Beginner's Python Programming: The Illustrated Guide to Learning Computer Programming
From Everand
Absolute Beginner's Python Programming: The Illustrated Guide to Learning Computer Programming
Kevin Wilson
1/5 (1)
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet

Experiment: 1

Uploaded by

Experiment: 1

Uploaded by

Experiment : 1

Description : Test cleaning or Test preprocessing or data preprocessing .A text is

input_text="\t #sample <HTML><H1>#Greetings our college is offering computer

for punc in string.punctuation:

x=re.sub('[a-z A-Z 0-9]+','',tweet)

#sample <html><h1>#greetings our college is offering computer courses in {b.tech} with

#sample <html><h>#greetings our college is offering computer courses in

#sample <html><h>#greetings our college is offering computer courses in {b.tech} with

#sample #greetings our college is offering computer courses in {b.tech} with

#sample #greetings our college is offering computer courses in {b.tech} with

Aim : Perform lemmatization and streaming using python library nltk.

Description : Stemming and Lemmatization in Python NLTK are text normalization

Stemming is a method of normalization of words in Natural Language Processing. It is a

from nltk.stem import PorterStemmer

from nltk.stem import SnowballStemmer

Stemmed_words_ps=[Stemmed_ps.stem(word)for word in words]

print("Porter stemmed words : ",Stemmed_words_ps)

Stemmed_words_ss=[Stemmer_ss.stem(word)for word in words]

print("Snowball Stemmed words : ",Stemmed_words_ss)

for word in token_words:

print("The stemmed sentence is : ",Stem_Sentence)

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love",

for word in words:

new_text = " ".join(new_words)

print("Message : RT this is a retweeted dm message tweet by Shivam Bansal")

print("Converted message : ",lookup_words("RT this is a retweeted dm message tweet by

Converted message : Retweet this is a retweeted direct message message tweet by

dict_keys(['rt', 'dm', 'awsm', 'luv', '...'])

dict_values(['Retweet', 'direct message', 'awesome', 'love', ' '])

Aim : Perform part of speech tagging on any textual data.

sentence tokenizer, PunktSentenceTokenizer is capable of unsupervised machine

Input : M number of documents, N number of words, K number of topics.

ii ii. stop_words Python package containing stop words

4) Preparing data for LDA analysis

5) LDA model training

6) Analyzing LDA model results

Data Cleaning methods :

[(0, '0.135*"health" + 0.052*"expert" + 0.052*"may" + 0.052*"suggest" + 0.052*"caus"

Description : It is a widely used statistical method in natural language processing and

TF-IDF is useful in many natural language processing applications.

Aim : Demonstrate word embeddings using word2vec

Description : Word embedding is one of the most important techniques in natural

Train the genism word2vec model

The hyperparameters of this model.

Step -1 Installing textblob.

Aim : Apply support vector machine for text classification.

Description : Support Vector Machine” (SVM) is a supervised machine learning algorithm

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

svc = svm.SVC(kernel='rbf', C=1,gamma='auto').fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)

Description : Cosine similarity is a measure of similarity between two non-zero vectors of an

Vector(A) = [5,0,2] Vector(B) = [2,5,0]

You might also like

[(0, '0.135"health" + 0.052"expert" + 0.052"may" + 0.052"suggest" + 0.052*"caus"