0% found this document useful (0 votes)

346 views15 pages

1-NLP - Lab Manual

The document describes an exercise for a Natural Language Processing lab course. The exercise involves preprocessing a sample text through steps like tokenization, lowercasing, removing punctuation and numbers, lemmatization, and part-of-speech tagging. The goal is to understand how to preprocess unstructured text data for analysis through tasks like word tokenization, normalization, and part-of-speech tagging.

Uploaded by

MONESH R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

346 views15 pages

1-NLP - Lab Manual

Uploaded by

MONESH R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Elective-III

Natural Language Processing Lab

Subject Code : Total Contact Hours : 15
Credits 01 L-T-P : 0-0-2
Prerequisite: Python Programming

Course Objectives

 To understand the pre-processing of text data for further analysis.

 To understand the unstructured data and analyze the data.
 To understand word cloud and visualization of text data.

Exercise – 1
1. Tokenize the sentence into words for the further analysis (using Python Function)
import nltk
import string
from nltk.tokenize import word_tokenize

#Open the text file to apply the text preprocessing

file = open("input_text.txt","rt")
text=file.read()
file.close()
print("The File contents")
print(text)

print("------------------------------------------------")

tokens=word_tokenize(pun_rem)
print("Tokens")
print(tokens)
print("------------------------------------------------")

INPUT FILE:

Input_text.txt

The last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. This is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. Noam Chomsky's seminal 1957 work,
Syntactic Structures [3], includes a short chapter devoted to ``finite state Markov processes'', devices that
we now would call weighted finite-state automata.
In this section Chomsky demonstrates in a few paragraphs that English is not a finite state language. (p.
21)

OUTPUT:
The File contents
The last decade has seen a substantial surge in the use of finite-state methods in many areas
of natural-language processing. This is a remarkable comeback considering that in the dawn of
modern linguistics, finite-state grammars were dismissed as fundamentally inadequate. Noam
Chomsky's seminal 1957 work, Syntactic Structures [3], includes a short chapter devoted to
``finite state Markov processes'', devices that we now would call weighted finite-state automata.
In this section Chomsky demonstrates in a few paragraphs that English is not a finite state
language. (p. 21)

1
------------------------------------------------
Tokens
['the', 'last', 'decade', 'has', 'seen', 'a', 'substantial', 'surge', 'in', 'the', 'use', 'of', 'finite', 'state',
'methods', 'in', 'many', 'areas', 'of', 'natural', 'language', 'processing', 'this', 'is', 'a', 'remarkable',
'comeback', 'considering', 'that', 'in', 'the', 'dawn', 'of', 'modern', 'linguistics', 'finite', 'state',
'grammars', 'were', 'dismissed', 'as', 'fundamentally', 'inadequate', 'noam', 'chomskys', 'seminal',
'work', 'syntactic', 'structures', 'includes', 'a', 'short', 'chapter', 'devoted', 'to', 'finite', 'state',
'markov', 'processes', 'devices', 'that', 'we', 'now', 'would', 'call', 'weighted', 'finite', 'state',
'automata', 'in', 'this', 'section', 'chomsky', 'demonstrates', 'in', 'a', 'few', 'paragraphs', 'that',
'english', 'is', 'not', 'a', 'finite', 'state', 'language', 'p']
------------------------------------------------

-----------------------------------------------------------------------------
2. Normalize the sentence to eliminate the unwanted punctuation, converting into lower case or upper case
of the entire document, expanding abbreviation, numbers into words and canonicalization.

#Excerise 1

#Import all the required modules to make the Text preprocessing

import nltk
import re
import string
from nltk import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

#Open the text file to apply the text preprocessing

file = open("input_text.txt","rt")
text=file.read()
file.close()
print("The File contents")
print(text)

#-------------------------------------------------
#sentences = sent_tokenize(text)
#print("Sentences ",sentences)
#text=sentences
#print("------------------------------------------------")
#-------------------------------------------------

#----------------------------------------------------
text=text.lower()
print("After converting to Lower cases")
print(text)
print("------------------------------------------------")
#---------------------------------------------------
num_remo = re.sub(r'\d+', '', text)
print("After removal of Numbers")
print(num_remo)
print("------------------------------------------------")
#-------------------------------------------------

2
translator = str.maketrans('', '', string.punctuation)
text = num_remo.replace('-', ' ')
pun_rem=text.translate(translator)
print("After Punctuation removal")
print(pun_rem)
print("------------------------------------------------")

tokens=word_tokenize(pun_rem)
print("Tokens")
print(tokens)
print("------------------------------------------------")

#--------------------------------------------------
lemmatizer=WordNetLemmatizer()
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in tokens])
#print("After Lemmatizing on Noun -->",lemmatized_output)
lemmatized_output=word_tokenize(lemmatized_output)
lemmatized_output_1 = [lemmatizer.lemmatize(w,'v') for w in lemmatized_output]
print("After Lemmatizing ")
print(lemmatized_output_1)
print("------------------------------------------------")

#------------------------------------------
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in lemmatized_output_1 if not w in stop_words]
print("After Removing the stop words")
print(filtered_sentence)
print("------------------------------------------------")
text = pos_tag(filtered_sentence)
print("After PoS attachment:",text)
INPUT FILE:

Input_text.txt

OUTPUT:

The File contents

3
After converting to Lower cases
the last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. this is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. noam chomsky's seminal 1957 work,
syntactic structures [3], includes a short chapter devoted to ``finite state markov processes'', devices that
we now would call weighted finite-state automata.
in this section chomsky demonstrates in a few paragraphs that english is not a finite state language. (p. 21)

------------------------------------------------
After removal of Numbers
the last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. this is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. noam chomsky's seminal work,
syntactic structures [], includes a short chapter devoted to ``finite state markov processes'', devices that we
now would call weighted finite-state automata.
in this section chomsky demonstrates in a few paragraphs that english is not a finite state language. (p. )

------------------------------------------------
After Punctuation removal
the last decade has seen a substantial surge in the use of finite state methods in many areas of natural
language processing this is a remarkable comeback considering that in the dawn of modern linguistics finite
state grammars were dismissed as fundamentally inadequate noam chomskys seminal work syntactic
structures includes a short chapter devoted to finite state markov processes devices that we now would
call weighted finite state automata
in this section chomsky demonstrates in a few paragraphs that english is not a finite state language p

------------------------------------------------
Tokens
['the', 'last', 'decade', 'has', 'seen', 'a', 'substantial', 'surge', 'in', 'the', 'use', 'of', 'finite', 'state', 'methods', 'in',
'many', 'areas', 'of', 'natural', 'language', 'processing', 'this', 'is', 'a', 'remarkable', 'comeback', 'considering',
'that', 'in', 'the', 'dawn', 'of', 'modern', 'linguistics', 'finite', 'state', 'grammars', 'were', 'dismissed', 'as',
'fundamentally', 'inadequate', 'noam', 'chomskys', 'seminal', 'work', 'syntactic', 'structures', 'includes', 'a',
'short', 'chapter', 'devoted', 'to', 'finite', 'state', 'markov', 'processes', 'devices', 'that', 'we', 'now', 'would', 'call',
'weighted', 'finite', 'state', 'automata', 'in', 'this', 'section', 'chomsky', 'demonstrates', 'in', 'a', 'few',
'paragraphs', 'that', 'english', 'is', 'not', 'a', 'finite', 'state', 'language', 'p']
------------------------------------------------
After Lemmatizing
['the', 'last', 'decade', 'ha', 'see', 'a', 'substantial', 'surge', 'in', 'the', 'use', 'of', 'finite', 'state', 'method', 'in',
'many', 'area', 'of', 'natural', 'language', 'process', 'this', 'be', 'a', 'remarkable', 'comeback', 'consider', 'that',
'in', 'the', 'dawn', 'of', 'modern', 'linguistics', 'finite', 'state', 'grammar', 'be', 'dismiss', 'a', 'fundamentally',
'inadequate', 'noam', 'chomsky', 'seminal', 'work', 'syntactic', 'structure', 'include', 'a', 'short', 'chapter',
'devote', 'to', 'finite', 'state', 'markov', 'process', 'device', 'that', 'we', 'now', 'would', 'call', 'weight', 'finite', 'state',
'automaton', 'in', 'this', 'section', 'chomsky', 'demonstrate', 'in', 'a', 'few', 'paragraph', 'that', 'english', 'be', 'not',
'a', 'finite', 'state', 'language', 'p']
------------------------------------------------
After Removing the stop words
['last', 'decade', 'ha', 'see', 'substantial', 'surge', 'use', 'finite', 'state', 'method', 'many', 'area', 'natural',
'language', 'process', 'remarkable', 'comeback', 'consider', 'dawn', 'modern', 'linguistics', 'finite', 'state',
'grammar', 'dismiss', 'fundamentally', 'inadequate', 'noam', 'chomsky', 'seminal', 'work', 'syntactic', 'structure',
'include', 'short', 'chapter', 'devote', 'finite', 'state', 'markov', 'process', 'device', 'would', 'call', 'weight', 'finite',
'state', 'automaton', 'section', 'chomsky', 'demonstrate', 'paragraph', 'english', 'finite', 'state', 'language', 'p']
------------------------------------------------
After PoS attachment: [('last', 'JJ'), ('decade', 'NN'), ('ha', 'VBD'), ('see', 'VBP'), ('substantial', 'JJ'), ('surge',
'NN'), ('use', 'NN'), ('finite', 'JJ'), ('state', 'NN'), ('method', 'VBD'), ('many', 'JJ'), ('area', 'NN'), ('natural', 'JJ'),
('language', 'NN'), ('process', 'NN'), ('remarkable', 'JJ'), ('comeback', 'NN'), ('consider', 'NN'), ('dawn', 'NN'),

4
('modern', 'JJ'), ('linguistics', 'NNS'), ('finite', 'JJ'), ('state', 'NN'), ('grammar', 'NN'), ('dismiss', 'NN'),
('fundamentally', 'RB'), ('inadequate', 'JJ'), ('noam', 'NNS'), ('chomsky', 'VBP'), ('seminal', 'JJ'), ('work', 'NN'),
('syntactic', 'JJ'), ('structure', 'NN'), ('include', 'VBP'), ('short', 'JJ'), ('chapter', 'NN'), ('devote', 'NN'), ('finite',
'JJ'), ('state', 'NN'), ('markov', 'NN'), ('process', 'NN'), ('device', 'NN'), ('would', 'MD'), ('call', 'VB'), ('weight',
'NN'), ('finite', 'NN'), ('state', 'NN'), ('automaton', 'NN'), ('section', 'NN'), ('chomsky', 'NN'), ('demonstrate',
'NN'), ('paragraph', 'NN'), ('english', 'JJ'), ('finite', 'NN'), ('state', 'NN'), ('language', 'NN'), ('p', 'NN')]

-------------------------------------------------------------------------------------------------------------------


Exercise – 2
1. Apply similarity measures using Jaccard's Coefficient or Tanimoto coefficient

#Text similarity measurement without text preprocessing

import nltk
from nltk.metrics import *
from nltk.tokenize import word_tokenize

s1="We are learning NLP using Python"

s2="We are learning NLP using Python"
s3="NLP under Python is simple"
s4="People are using NLP in Python for chatbot"
s1=set(word_tokenize(s1))
s2=set(word_tokenize(s2))
s3=set(word_tokenize(s3))
s4=set(word_tokenize(s4))
print(s1)
print(s2)
print(jaccard_distance(s1,s2))
print(s1)
print(s3)
print(jaccard_distance(s1,s3))
print(s1)
print(s4)
print(jaccard_distance(s1,s4))

OUTPUT:

{'NLP', 'Python', 'We', 'using', 'learning', 'are'}

{'NLP', 'Python', 'We', 'using', 'learning', 'are'}
0.0
{'NLP', 'Python', 'We', 'using', 'learning', 'are'}
{'simple', 'NLP', 'Python', 'under', 'is'}
0.7777777777777778
{'NLP', 'Python', 'We', 'using', 'learning', 'are'}
{'People', 'for', 'NLP', 'Python', 'using', 'are', 'chatbot', 'in'}
0.6
#Text similarity measurement with text preprocessing
import nltk
import re
import string
from nltk.metrics import *
from nltk import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('wordnet')

5
from nltk.stem import WordNetLemmatizer
def text_preprocess(text):
text=text.lower()
text = re.sub(r'\d+', '', text)
translator = str.maketrans('', '', string.punctuation)
text = text.replace('-', ' ')
text=text.translate(translator)
text=set(word_tokenize(text))
lemmatizer=WordNetLemmatizer()
text = [lemmatizer.lemmatize(w) for w in text]
#print("After Lemmatizing on Noun -->",lemmatized_output)
#lemmatized_output=word_tokenize(lemmatized_output)
text = [lemmatizer.lemmatize(w,'v') for w in text]
return set(text)

s1=text_preprocess("We are learning NLP using Python")

s2=s1
s3=text_preprocess("NLP under Python is simple")
s4=text_preprocess("People are using NLP in Python for chatbot")
print(s1)
print(s2)
print(jaccard_distance(s1,s2))
print(s1)
print(s3)
print(jaccard_distance(s1,s3))
print(s1)
print(s4)
print(jaccard_distance(s1,s4))

OUTPUT:
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\Universal\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
0.0
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
{'simple', 'python', 'nlp', 'be', 'under'}
0.625
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
{'be', 'python', 'for', 'nlp', 'people', 'use', 'chatbot', 'in'}
0.6
-----------------------------------------------------------------------------------

2. Apply similarity measures using the Smith Waterman distance

import itertools
import numpy as np

#H matrix construction
def matrix(a, b, match_score=3, gap_cost=2):
H = np.zeros((len(a) + 1, len(b) + 1), np.int)

for i, j in itertools.product(range(1, H.shape[0]), range(1, H.shape[1])):

match = H[i - 1, j - 1] + (match_score if a[i - 1] == b[j - 1] else - match_score)
delete = H[i - 1, j] - gap_cost

6
insert = H[i, j - 1] - gap_cost
H[i, j] = max(match, delete, insert, 0)
return H

#Matrix Traceback based on the similarity

def traceback(H, b, b_='', old_i=0):
# flip H to get index of **last** occurrence of H.max() with np.argmax()
#print(H)
#print("H value",H)
H_flip = np.flip(np.flip(H, 0), 1)
#print("H_flip value\n",H_flip)
#print(H_flip.argmax(),H_flip.shape)
#print(H_flip)
i_, j_ = np.unravel_index(H_flip.argmax(), H_flip.shape)
#print(i_,j_)
i, j = np.subtract(H.shape, (i_ + 1, j_ + 1)) # (i, j) are **last** indexes of H.max()
#print(i,j,H[i,j])
if H[i, j] == 0:
return b_, j
b_ = b[j - 1] + '-' + b_ if old_i - i > 1 else b[j - 1] + b_
return traceback(H[0:i, 0:j], b, b_, i)

#Smith_waterman text similarity

def smith_waterman(a, b, match_score=3, gap_cost=2):
a, b = a.upper(), b.upper()
H = matrix(a, b, match_score, gap_cost)
b_, pos = traceback(H, b)
return pos, pos + len(b_)

# prints correct scoring matrix from Wikipedia example

a, b = 'rain', 'shine'
#a,b="great","treat"
#a,b="grace","great"
print("Input Strings are ",a," & ",b)
print(matrix(a,b))

H = matrix(a, b)
print(traceback(H, b)) # ('gtt-ac', 1)

#a, b = 'GGTTGACTA', 'TGTTACGG'

start, end = smith_waterman(a, b)
print(a[start:end]) # GTTGAC

OUTPUT:
Input Strings are rain & shine
[[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 3 1 0]
[0 0 0 1 6 4]]
('in', 2)

---------------------------------------------------------------------

7
Exercise – 3
1. For the given data what is the maximum number of words used. Get the output for the frequently
occurred word in the given data?
2. Visualize the given text data with appropriate visual techniques.

#word Frequency calculation

import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
nltk.download('webtext')

#Total number of words in the given text document

data_analysis = nltk.FreqDist(filtered_sentence)
for word in sorted(data_analysis):
print(word, '->', data_analysis[word], end='; ')
print("\n\nTotal no of words in the document -->",len(data_analysis))

#data_analysis = nltk.FreqDist(filtered_sentence)
print("\n\nWord Frequency for the pre processed text")
data_analysis.plot(25, cumulative=False)

# Let's take the specific words only if their frequency is greater than or equal to 2.
filter_words = dict([(m, n) for m, n in data_analysis.items() if n >= 2])

print("Word Frequency >= 2")

for key in sorted(filter_words):
print("%s: %s" % (key, filter_words[key]))
data_analysis = nltk.FreqDist(filter_words)
data_analysis.plot(25, cumulative=False)

OUTPUT:

The preprocessed Text (OUTPUT of Ex 1.2) is given as input here.

area -> 1; automaton -> 1; call -> 1; chapter -> 1; chomsky -> 2; comeback -> 1; consider -> 1; dawn
-> 1; decade -> 1; demonstrate -> 1; device -> 1; devote -> 1; dismiss -> 1; english -> 1; finite -> 5;
fundamentally -> 1; grammar -> 1; ha -> 1; inadequate -> 1; include -> 1; language -> 2; last -> 1;
linguistics -> 1; many -> 1; markov -> 1; method -> 1; modern -> 1; natural -> 1; noam -> 1; p -> 1;
paragraph -> 1; process -> 2; remarkable -> 1; section -> 1; see -> 1; seminal -> 1; short -> 1; state -
> 5; structure -> 1; substantial -> 1; surge -> 1; syntactic -> 1; use -> 1; weight -> 1; work -> 1; would
-> 1;

Total no of words in the document --> 46

8
Word Frequency >= 2
chomsky: 2
finite: 5
language: 2
process: 2
state: 5

Exercise - 4
1. Develop a back-off mechanism for Maximum Likelihood Estimate (MLE)

2. Apply interpolation on data to get mix and match.

from collections import defaultdict

9
def text_preprocess(corpus):
text=corpus
text=text.lower()
print("After converting to Lower cases")
print(text)
print("------------------------------------------------")
#---------------------------------------------------
num_remo = re.sub(r'\d+', '', text)
print("After removal of Numbers")
print(num_remo)
print("------------------------------------------------")
#-------------------------------------------------
translator = str.maketrans('', '', string.punctuation)
text = num_remo.replace('-', ' ')
pun_rem=text.translate(translator)
print("After Punctuation removal")
print(pun_rem)
print("------------------------------------------------")
return(pun_rem)

def build_conditional_probabilities(corpus):
"""
The function takes as its input a corpus string (words separated by
spaces) and returns a 2D dictionnary of probabilities P(next|current) of
seeing a word "next" conditionnaly to seeing a word "current".
"""

# First we parse the string to build a double dimension dictionnary that

# returns the conditional probabilities.

# We parse the string to build a first dictionnary indicating for each

# word, what are the words that follow it in the string. Repeated next
# words are kept so we use a list and not a set.

tokenized_string = corpus.split()
print(tokenized_string)
previous_word = ""
dictionnary = defaultdict(list)

for current_word in tokenized_string:

if previous_word != "":
dictionnary[previous_word].append(current_word)
previous_word = current_word
print(dictionnary)
# We know parse dictionnary to compute the probability each observed
# next word for each word in the dictionnary.

for key in dictionnary.keys():

next_words = dictionnary[key] #{the,["cat","cat","cat","dog"]
unique_words = set(next_words) # removes duplicated ,"cat","dog"
nb_words = len(next_words) #4
probabilities_given_key = {}
for unique_word in unique_words:
probabilities_given_key[unique_word] = float(next_words.count(unique_word)) /
nb_words
dictionnary[key] = probabilities_given_key
10
print(probabilities_given_key)

return dictionnary

def bigram_next_word_predictor(conditional_probabilities, current, next_candidate):

"""
The function takes as its input a 2D dictionnary of probabilities
P(next|current) of seeing a word "next" conditionnaly to seeing a word
"current", the current word being read, and a next candidate word, and
returns P(next_candidate|current).
"""

# We look for the probability corresponding to the

# current -> next_candidate pair

if current in conditional_probabilities:
if next_candidate in conditional_probabilities[current]:
return conditional_probabilities[current][next_candidate]

# If current -> next_candidate pair has not been observed in the corpus,
# the corresponding dictionnary keys will not be defined. We return
# a probability 0.0

return 0.0

# An example corpus to try out the function

#corpus = "the cat is red the cat is green the cat is blue the dog is brown"
corpus = "NLP program is ruling the IT indutry with many applications. NLP program is used
for voice controling process too"
corpus= text_preprocess(corpus)
print("after preprocessing")
print(corpus)

# We call the conditional probability dictionnary builder function

conditional_probabilities = build_conditional_probabilities(corpus)
print(conditional_probabilities)

# Some sample queries to the bigram predictor

#print(bigram_next_word_predictor(conditional_probabilities, "the", "cat"))
#print(bigram_next_word_predictor(conditional_probabilities, "is", "red"))
#print(bigram_next_word_predictor(conditional_probabilities, "", "red"))
#print(bigram_next_word_predictor(conditional_probabilities, "cat", "green"))
print()
print("The Given Corpus is ")
print(corpus)
print()
print("Bigram Prediction")
print()
print("NLP & program ->
",bigram_next_word_predictor(conditional_probabilities,"NLP","program"))
print("NLP & is -> ",bigram_next_word_predictor(conditional_probabilities,"NLP","is"))
print("is & used -> ",bigram_next_word_predictor(conditional_probabilities,"is","used"))

OUTPUT:

After converting to Lower cases

11
nlp program is ruling the it indutry with many applications. nlp program is used for voice controling
process too
------------------------------------------------
After removal of Numbers
nlp program is ruling the it indutry with many applications. nlp program is used for voice controling
process too
------------------------------------------------
After Punctuation removal
nlp program is ruling the it indutry with many applications nlp program is used for voice controling
process too
------------------------------------------------
after preprocessing
nlp program is ruling the it indutry with many applications nlp program is used for voice controling
process too
['nlp', 'program', 'is', 'ruling', 'the', 'it', 'indutry', 'with', 'many', 'applications', 'nlp', 'program', 'is', 'used',
'for', 'voice', 'controling', 'process', 'too']

defaultdict(<class 'list'>, {'nlp': ['program', 'program'], 'program': ['is', 'is'], 'is': ['ruling', 'used'], 'ruling':
['the'], 'the': ['it'], 'it': ['indutry'], 'indutry': ['with'], 'with': ['many'], 'many': ['applications'], 'applications':
['nlp'], 'used': ['for'], 'for': ['voice'], 'voice': ['controling'], 'controling': ['process'], 'process': ['too']})

defaultdict(<class 'list'>, {'nlp': {'program': 1.0}, 'program': {'is': 1.0}, 'is': {'used': 0.5, 'ruling': 0.5},
'ruling': {'the': 1.0}, 'the': {'it': 1.0}, 'it': {'indutry': 1.0}, 'indutry': {'with': 1.0}, 'with': {'many': 1.0}, 'many':
{'applications': 1.0}, 'applications': {'nlp': 1.0}, 'used': {'for': 1.0}, 'for': {'voice': 1.0}, 'voice': {'controling':
1.0}, 'controling': {'process': 1.0}, 'process': {'too': 1.0}})

The Given Corpus is

nlp program is ruling the it indutry with many applications nlp program is used for voice controling
process too

Bigram Prediction

NLP & program -> 0.0

NLP & is -> 0.0
is & used -> 0.5
---------------------------------------------------------------------------------

Exercise - 5
1. Perform the sentiment analysis, classifying comments using a Bayesian analysis.

#Using Sentiment analysis using vader_lexicon

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
#sia.polarity_scores("Wow, NLTK is really powerful!")
sia.polarity_scores("It is too bad to cut the class")

OUTPUT:
{'compound': -0.6808, 'neg': 0.483, 'neu': 0.517, 'pos': 0.0}

12
Exercise – 6
1. Predict Corporate Credit Rating of a company/individual using its available reports and press releases.
2. Predict the market capitalization of a company using its available reports and press releases.
Exercise – 7
1. Write python program for Predict price movements based on news using two sigma
2. Write python program for Clustering Companies Based on Stock Price-Movement
Exercise – 8
1. Predict Rainfall using Linear regression

13
14
2. Create clusters of players based on their strengths in order to build a well-rounded team.
Exercise – 9
1. Write a neural network from scratch in Python to classify digits from MNIST.
Exercise – 10
1. Write python program for Classify the spam mail by using Logistic Regression
2. Build the key Influencers in Social Networks.
Exercise – 11
1. Perform sentiment analysis on Twitter data.
Exercise – 12
1. Predicting disease outbreaks on the community level in a particular location.
2. Write a program for Exploratory Data Analysis using image data, such as scans, x-rays, etc.

Data Source links for required Programming

1. UCI Machine Learning Repository

2. Kaggle Datasets

3. data.gov

4. Sports Statistics Database

5. Sports Reference

6. cricsheet.org

7. Quandl

8. Quantopian

9. US Fundamentals Archive

10. MNIST

11. Twitter API

12. StockTwits API

13. Large Health Data Sets

14. data.gov/health

15. Health Nutrition and Population Statistics

Course Outcome:
After the successful completion of course, students will be able to

 Able to cleanse the data for space and punctuation marks, brackets.
 Able to fit model for text data and interpret them.
 Able to visualize the data in a better way for non-technical people.

21ML1601 NLP QB
No ratings yet
21ML1601 NLP QB
34 pages
Grade 5 - Computer - L - 3 - Working With Tables in Word - Textbook Exercises
No ratings yet
Grade 5 - Computer - L - 3 - Working With Tables in Word - Textbook Exercises
3 pages
Word Level Analysis
No ratings yet
Word Level Analysis
49 pages
NLP Sem Questions and Answers
No ratings yet
NLP Sem Questions and Answers
72 pages
Toc Unit 1 MCQS 2019-20
100% (1)
Toc Unit 1 MCQS 2019-20
567 pages
NLP Unit-2 Notes
No ratings yet
NLP Unit-2 Notes
45 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Understanding Birst Issue 1
100% (1)
Understanding Birst Issue 1
64 pages
STM Unit 5
No ratings yet
STM Unit 5
31 pages
Machine Learning Unit 5
No ratings yet
Machine Learning Unit 5
43 pages
ML - CSA 301 - ML Perspective and Issues
No ratings yet
ML - CSA 301 - ML Perspective and Issues
34 pages
Training Linux Administration Material
100% (1)
Training Linux Administration Material
135 pages
Search Problems
100% (1)
Search Problems
19 pages
Rich Automata Solns
100% (1)
Rich Automata Solns
187 pages
Unit 4 NLP Notes
No ratings yet
Unit 4 NLP Notes
35 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
59 pages
M800 CDMA TM-System Architecture
100% (1)
M800 CDMA TM-System Architecture
85 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
Autohydro Manual 6
No ratings yet
Autohydro Manual 6
58 pages
Comprehensive FLAT Question Bank
100% (1)
Comprehensive FLAT Question Bank
13 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
Atc Module-5 - TM
100% (1)
Atc Module-5 - TM
29 pages
Theory of Computation - Part - A - Anna University Questions
0% (1)
Theory of Computation - Part - A - Anna University Questions
8 pages
5.hyperparameters and Validation Sets (C)
No ratings yet
5.hyperparameters and Validation Sets (C)
3 pages
2marks For Pondicherry University
No ratings yet
2marks For Pondicherry University
45 pages
Representing Knowledge in An Uncertain Domain IN AI: Bayesian Networks
No ratings yet
Representing Knowledge in An Uncertain Domain IN AI: Bayesian Networks
7 pages
Theory of Computation Long Type of Questions-1
100% (1)
Theory of Computation Long Type of Questions-1
22 pages
Unit III Ai Kcs071
No ratings yet
Unit III Ai Kcs071
50 pages
Deep Learning Question Bank (2024-25)
No ratings yet
Deep Learning Question Bank (2024-25)
2 pages
NLP ORAL - Sample Question Bank: Modul e No. Sr. No - Description
No ratings yet
NLP ORAL - Sample Question Bank: Modul e No. Sr. No - Description
9 pages
Unit-5 Alt
No ratings yet
Unit-5 Alt
15 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
NLP Question Bank
No ratings yet
NLP Question Bank
1 page
MPC Software - User Guide - v2.14
No ratings yet
MPC Software - User Guide - v2.14
347 pages
VLookup Tutorial
100% (1)
VLookup Tutorial
5 pages
MCQ On Knowledge Representation 5eea6a0e39140f30f369e525
No ratings yet
MCQ On Knowledge Representation 5eea6a0e39140f30f369e525
21 pages
Unit 4 Knowledge Representation
No ratings yet
Unit 4 Knowledge Representation
13 pages
Tangent Prop and Manifold Tangent Classifier Are B
No ratings yet
Tangent Prop and Manifold Tangent Classifier Are B
4 pages
NLP MCQ 153 Out of 427 - Part One
No ratings yet
NLP MCQ 153 Out of 427 - Part One
30 pages
Natural Language Processing
100% (1)
Natural Language Processing
21 pages
Counting Service Manual ccb9
No ratings yet
Counting Service Manual ccb9
17 pages
Self Defense Japonese Jiu Jitsu PDF
No ratings yet
Self Defense Japonese Jiu Jitsu PDF
137 pages
HTML Videos and Audio
No ratings yet
HTML Videos and Audio
9 pages
Undecidable Problems For Recursively Enumerable Languages: Continued
No ratings yet
Undecidable Problems For Recursively Enumerable Languages: Continued
54 pages
Unit-8: Natural Language: Processing
No ratings yet
Unit-8: Natural Language: Processing
16 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
46 pages
Formal Languages and Automata Theory QB
100% (2)
Formal Languages and Automata Theory QB
5 pages
TOC Question Bank - Unit - 1 - 2 - 3 - 4 - 2022
No ratings yet
TOC Question Bank - Unit - 1 - 2 - 3 - 4 - 2022
7 pages
Solutions To NLP I Mid Set A
100% (1)
Solutions To NLP I Mid Set A
8 pages
New
No ratings yet
New
75 pages
21CS743 DL Module4 Notes
No ratings yet
21CS743 DL Module4 Notes
7 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
CSE4022 Natural-Language-Processing ETH 1 AC41
No ratings yet
CSE4022 Natural-Language-Processing ETH 1 AC41
6 pages
SpotAVet Documentation
No ratings yet
SpotAVet Documentation
158 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
Question Bank: Unit 1: Introduction To Finite Automata
No ratings yet
Question Bank: Unit 1: Introduction To Finite Automata
8 pages
Unit 3 AI Srs 13-14
No ratings yet
Unit 3 AI Srs 13-14
45 pages
Mr. Robot
No ratings yet
Mr. Robot
11 pages
Question Bank For CAT1 - 2mks
No ratings yet
Question Bank For CAT1 - 2mks
36 pages
Assignment No. 1 Class: T.E. Computer Subject: Theory of Computation
No ratings yet
Assignment No. 1 Class: T.E. Computer Subject: Theory of Computation
6 pages
9 - 23 - Evolution of Visual Data Captioning Methods, Datasets, and Evaluation Metrics A Comprehensive Survey
No ratings yet
9 - 23 - Evolution of Visual Data Captioning Methods, Datasets, and Evaluation Metrics A Comprehensive Survey
60 pages
Data Science Projects
No ratings yet
Data Science Projects
74 pages
Model Question Paper
0% (1)
Model Question Paper
2 pages
DSDBA Sppu Dsbda QP
No ratings yet
DSDBA Sppu Dsbda QP
11 pages
DAA UNIT 4 - Final
No ratings yet
DAA UNIT 4 - Final
12 pages
Unit-3-Second Chapter
No ratings yet
Unit-3-Second Chapter
9 pages
De Vic Explorer
No ratings yet
De Vic Explorer
16 pages
iCORE TECHNOLOGIES - DOC-20240405-WA0023 - 240405 - 185853
No ratings yet
iCORE TECHNOLOGIES - DOC-20240405-WA0023 - 240405 - 185853
37 pages
CP5191 Machine Learning Techniques L T P C3 0 0 3
No ratings yet
CP5191 Machine Learning Techniques L T P C3 0 0 3
7 pages
How To Download Books From Books Google With Google Book Download Stand Alone Program and Greasemonkey With Google Books Downloader Script
No ratings yet
How To Download Books From Books Google With Google Book Download Stand Alone Program and Greasemonkey With Google Books Downloader Script
5 pages
Saumont2 tJoK MEAP V08 ch1
No ratings yet
Saumont2 tJoK MEAP V08 ch1
31 pages
Toc Unit Iv
No ratings yet
Toc Unit Iv
6 pages
Module 4 - E - Mail
No ratings yet
Module 4 - E - Mail
20 pages
EU - WEST - 1 Prod s3 Ucmdata Evise C - O4683945 - 2610764 PDF
No ratings yet
EU - WEST - 1 Prod s3 Ucmdata Evise C - O4683945 - 2610764 PDF
27 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Nes Architecture
No ratings yet
Nes Architecture
23 pages
18AI61
No ratings yet
18AI61
3 pages
Class 10
No ratings yet
Class 10
21 pages
CO3053 - Lecture 3 - Embedded Systems Development Process
No ratings yet
CO3053 - Lecture 3 - Embedded Systems Development Process
19 pages
Initial Examination of Digital Evidence Artifacts
No ratings yet
Initial Examination of Digital Evidence Artifacts
15 pages
Project Report: Monitoring of Industrial Faults by Using RF Signal
No ratings yet
Project Report: Monitoring of Industrial Faults by Using RF Signal
20 pages
Data Science Life Cycle - All Details
No ratings yet
Data Science Life Cycle - All Details
12 pages
Ravi Kukretis Resume
No ratings yet
Ravi Kukretis Resume
1 page
Autonomous Machine Learning
No ratings yet
Autonomous Machine Learning
7 pages
DS Neptune NPT-1100
No ratings yet
DS Neptune NPT-1100
8 pages
3 Lab Report For GXCQ
No ratings yet
3 Lab Report For GXCQ
5 pages
Cit 3350 Mobile Application Development
No ratings yet
Cit 3350 Mobile Application Development
3 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet

1-NLP - Lab Manual

Uploaded by

1-NLP - Lab Manual

Uploaded by

Elective-III

Natural Language Processing Lab

 To understand the pre-processing of text data for further analysis.

#Open the text file to apply the text preprocessing

#Import all the required modules to make the Text preprocessing

#Open the text file to apply the text preprocessing

The File contents

#Text similarity measurement without text preprocessing

s1="We are learning NLP using Python"

{'NLP', 'Python', 'We', 'using', 'learning', 'are'}

s1=text_preprocess("We are learning NLP using Python")

2. Apply similarity measures using the Smith Waterman distance

for i, j in itertools.product(range(1, H.shape[0]), range(1, H.shape[1])):

#Matrix Traceback based on the similarity

#Smith_waterman text similarity

# prints correct scoring matrix from Wikipedia example

#a, b = 'GGTTGACTA', 'TGTTACGG'

#word Frequency calculation

#Total number of words in the given text document

print("Word Frequency >= 2")

The preprocessed Text (OUTPUT of Ex 1.2) is given as input here.

Total no of words in the document --> 46

2. Apply interpolation on data to get mix and match.

from collections import defaultdict

# First we parse the string to build a double dimension dictionnary that

# We parse the string to build a first dictionnary indicating for each

for current_word in tokenized_string:

for key in dictionnary.keys():

def bigram_next_word_predictor(conditional_probabilities, current, next_candidate):

# We look for the probability corresponding to the

# An example corpus to try out the function

# We call the conditional probability dictionnary builder function

# Some sample queries to the bigram predictor

After converting to Lower cases

The Given Corpus is

NLP & program -> 0.0

#Using Sentiment analysis using vader_lexicon

Data Source links for required Programming

4. Sports Statistics Database

11. Twitter API

12. StockTwits API

13. Large Health Data Sets

15. Health Nutrition and Population Statistics

You might also like