0% found this document useful (0 votes)
346 views15 pages

1-NLP - Lab Manual

The document describes an exercise for a Natural Language Processing lab course. The exercise involves preprocessing a sample text through steps like tokenization, lowercasing, removing punctuation and numbers, lemmatization, and part-of-speech tagging. The goal is to understand how to preprocess unstructured text data for analysis through tasks like word tokenization, normalization, and part-of-speech tagging.

Uploaded by

MONESH R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
346 views15 pages

1-NLP - Lab Manual

The document describes an exercise for a Natural Language Processing lab course. The exercise involves preprocessing a sample text through steps like tokenization, lowercasing, removing punctuation and numbers, lemmatization, and part-of-speech tagging. The goal is to understand how to preprocess unstructured text data for analysis through tasks like word tokenization, normalization, and part-of-speech tagging.

Uploaded by

MONESH R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Elective-III

Natural Language Processing Lab


Subject Code : Total Contact Hours : 15
Credits 01 L-T-P : 0-0-2
Prerequisite: Python Programming

Course Objectives

 To understand the pre-processing of text data for further analysis.


 To understand the unstructured data and analyze the data.
 To understand word cloud and visualization of text data.

Exercise – 1
1. Tokenize the sentence into words for the further analysis (using Python Function)
import nltk
import string
from nltk.tokenize import word_tokenize

#Open the text file to apply the text preprocessing


file = open("input_text.txt","rt")
text=file.read()
file.close()
print("The File contents")
print(text)

print("------------------------------------------------")

tokens=word_tokenize(pun_rem)
print("Tokens")
print(tokens)
print("------------------------------------------------")

INPUT FILE:

Input_text.txt

The last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. This is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. Noam Chomsky's seminal 1957 work,
Syntactic Structures [3], includes a short chapter devoted to ``finite state Markov processes'', devices that
we now would call weighted finite-state automata.
In this section Chomsky demonstrates in a few paragraphs that English is not a finite state language. (p.
21)

OUTPUT:
The File contents
The last decade has seen a substantial surge in the use of finite-state methods in many areas
of natural-language processing. This is a remarkable comeback considering that in the dawn of
modern linguistics, finite-state grammars were dismissed as fundamentally inadequate. Noam
Chomsky's seminal 1957 work, Syntactic Structures [3], includes a short chapter devoted to
``finite state Markov processes'', devices that we now would call weighted finite-state automata.
In this section Chomsky demonstrates in a few paragraphs that English is not a finite state
language. (p. 21)

1
------------------------------------------------
Tokens
['the', 'last', 'decade', 'has', 'seen', 'a', 'substantial', 'surge', 'in', 'the', 'use', 'of', 'finite', 'state',
'methods', 'in', 'many', 'areas', 'of', 'natural', 'language', 'processing', 'this', 'is', 'a', 'remarkable',
'comeback', 'considering', 'that', 'in', 'the', 'dawn', 'of', 'modern', 'linguistics', 'finite', 'state',
'grammars', 'were', 'dismissed', 'as', 'fundamentally', 'inadequate', 'noam', 'chomskys', 'seminal',
'work', 'syntactic', 'structures', 'includes', 'a', 'short', 'chapter', 'devoted', 'to', 'finite', 'state',
'markov', 'processes', 'devices', 'that', 'we', 'now', 'would', 'call', 'weighted', 'finite', 'state',
'automata', 'in', 'this', 'section', 'chomsky', 'demonstrates', 'in', 'a', 'few', 'paragraphs', 'that',
'english', 'is', 'not', 'a', 'finite', 'state', 'language', 'p']
------------------------------------------------

-----------------------------------------------------------------------------
2. Normalize the sentence to eliminate the unwanted punctuation, converting into lower case or upper case
of the entire document, expanding abbreviation, numbers into words and canonicalization.

#Excerise 1

#Import all the required modules to make the Text preprocessing

import nltk
import re
import string
from nltk import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

#Open the text file to apply the text preprocessing


file = open("input_text.txt","rt")
text=file.read()
file.close()
print("The File contents")
print(text)

#-------------------------------------------------
#sentences = sent_tokenize(text)
#print("Sentences ",sentences)
#text=sentences
#print("------------------------------------------------")
#-------------------------------------------------

#----------------------------------------------------
text=text.lower()
print("After converting to Lower cases")
print(text)
print("------------------------------------------------")
#---------------------------------------------------
num_remo = re.sub(r'\d+', '', text)
print("After removal of Numbers")
print(num_remo)
print("------------------------------------------------")
#-------------------------------------------------

2
translator = str.maketrans('', '', string.punctuation)
text = num_remo.replace('-', ' ')
pun_rem=text.translate(translator)
print("After Punctuation removal")
print(pun_rem)
print("------------------------------------------------")

tokens=word_tokenize(pun_rem)
print("Tokens")
print(tokens)
print("------------------------------------------------")

#--------------------------------------------------
lemmatizer=WordNetLemmatizer()
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in tokens])
#print("After Lemmatizing on Noun -->",lemmatized_output)
lemmatized_output=word_tokenize(lemmatized_output)
lemmatized_output_1 = [lemmatizer.lemmatize(w,'v') for w in lemmatized_output]
print("After Lemmatizing ")
print(lemmatized_output_1)
print("------------------------------------------------")

#------------------------------------------
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in lemmatized_output_1 if not w in stop_words]
print("After Removing the stop words")
print(filtered_sentence)
print("------------------------------------------------")
text = pos_tag(filtered_sentence)
print("After PoS attachment:",text)
INPUT FILE:

Input_text.txt

The last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. This is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. Noam Chomsky's seminal 1957 work,
Syntactic Structures [3], includes a short chapter devoted to ``finite state Markov processes'', devices that
we now would call weighted finite-state automata.
In this section Chomsky demonstrates in a few paragraphs that English is not a finite state language. (p.
21)

OUTPUT:

The File contents


The last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. This is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. Noam Chomsky's seminal 1957 work,
Syntactic Structures [3], includes a short chapter devoted to ``finite state Markov processes'', devices that
we now would call weighted finite-state automata.
In this section Chomsky demonstrates in a few paragraphs that English is not a finite state language. (p. 21)

3
After converting to Lower cases
the last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. this is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. noam chomsky's seminal 1957 work,
syntactic structures [3], includes a short chapter devoted to ``finite state markov processes'', devices that
we now would call weighted finite-state automata.
in this section chomsky demonstrates in a few paragraphs that english is not a finite state language. (p. 21)

------------------------------------------------
After removal of Numbers
the last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. this is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. noam chomsky's seminal work,
syntactic structures [], includes a short chapter devoted to ``finite state markov processes'', devices that we
now would call weighted finite-state automata.
in this section chomsky demonstrates in a few paragraphs that english is not a finite state language. (p. )

------------------------------------------------
After Punctuation removal
the last decade has seen a substantial surge in the use of finite state methods in many areas of natural
language processing this is a remarkable comeback considering that in the dawn of modern linguistics finite
state grammars were dismissed as fundamentally inadequate noam chomskys seminal work syntactic
structures includes a short chapter devoted to finite state markov processes devices that we now would
call weighted finite state automata
in this section chomsky demonstrates in a few paragraphs that english is not a finite state language p

------------------------------------------------
Tokens
['the', 'last', 'decade', 'has', 'seen', 'a', 'substantial', 'surge', 'in', 'the', 'use', 'of', 'finite', 'state', 'methods', 'in',
'many', 'areas', 'of', 'natural', 'language', 'processing', 'this', 'is', 'a', 'remarkable', 'comeback', 'considering',
'that', 'in', 'the', 'dawn', 'of', 'modern', 'linguistics', 'finite', 'state', 'grammars', 'were', 'dismissed', 'as',
'fundamentally', 'inadequate', 'noam', 'chomskys', 'seminal', 'work', 'syntactic', 'structures', 'includes', 'a',
'short', 'chapter', 'devoted', 'to', 'finite', 'state', 'markov', 'processes', 'devices', 'that', 'we', 'now', 'would', 'call',
'weighted', 'finite', 'state', 'automata', 'in', 'this', 'section', 'chomsky', 'demonstrates', 'in', 'a', 'few',
'paragraphs', 'that', 'english', 'is', 'not', 'a', 'finite', 'state', 'language', 'p']
------------------------------------------------
After Lemmatizing
['the', 'last', 'decade', 'ha', 'see', 'a', 'substantial', 'surge', 'in', 'the', 'use', 'of', 'finite', 'state', 'method', 'in',
'many', 'area', 'of', 'natural', 'language', 'process', 'this', 'be', 'a', 'remarkable', 'comeback', 'consider', 'that',
'in', 'the', 'dawn', 'of', 'modern', 'linguistics', 'finite', 'state', 'grammar', 'be', 'dismiss', 'a', 'fundamentally',
'inadequate', 'noam', 'chomsky', 'seminal', 'work', 'syntactic', 'structure', 'include', 'a', 'short', 'chapter',
'devote', 'to', 'finite', 'state', 'markov', 'process', 'device', 'that', 'we', 'now', 'would', 'call', 'weight', 'finite', 'state',
'automaton', 'in', 'this', 'section', 'chomsky', 'demonstrate', 'in', 'a', 'few', 'paragraph', 'that', 'english', 'be', 'not',
'a', 'finite', 'state', 'language', 'p']
------------------------------------------------
After Removing the stop words
['last', 'decade', 'ha', 'see', 'substantial', 'surge', 'use', 'finite', 'state', 'method', 'many', 'area', 'natural',
'language', 'process', 'remarkable', 'comeback', 'consider', 'dawn', 'modern', 'linguistics', 'finite', 'state',
'grammar', 'dismiss', 'fundamentally', 'inadequate', 'noam', 'chomsky', 'seminal', 'work', 'syntactic', 'structure',
'include', 'short', 'chapter', 'devote', 'finite', 'state', 'markov', 'process', 'device', 'would', 'call', 'weight', 'finite',
'state', 'automaton', 'section', 'chomsky', 'demonstrate', 'paragraph', 'english', 'finite', 'state', 'language', 'p']
------------------------------------------------
After PoS attachment: [('last', 'JJ'), ('decade', 'NN'), ('ha', 'VBD'), ('see', 'VBP'), ('substantial', 'JJ'), ('surge',
'NN'), ('use', 'NN'), ('finite', 'JJ'), ('state', 'NN'), ('method', 'VBD'), ('many', 'JJ'), ('area', 'NN'), ('natural', 'JJ'),
('language', 'NN'), ('process', 'NN'), ('remarkable', 'JJ'), ('comeback', 'NN'), ('consider', 'NN'), ('dawn', 'NN'),

4
('modern', 'JJ'), ('linguistics', 'NNS'), ('finite', 'JJ'), ('state', 'NN'), ('grammar', 'NN'), ('dismiss', 'NN'),
('fundamentally', 'RB'), ('inadequate', 'JJ'), ('noam', 'NNS'), ('chomsky', 'VBP'), ('seminal', 'JJ'), ('work', 'NN'),
('syntactic', 'JJ'), ('structure', 'NN'), ('include', 'VBP'), ('short', 'JJ'), ('chapter', 'NN'), ('devote', 'NN'), ('finite',
'JJ'), ('state', 'NN'), ('markov', 'NN'), ('process', 'NN'), ('device', 'NN'), ('would', 'MD'), ('call', 'VB'), ('weight',
'NN'), ('finite', 'NN'), ('state', 'NN'), ('automaton', 'NN'), ('section', 'NN'), ('chomsky', 'NN'), ('demonstrate',
'NN'), ('paragraph', 'NN'), ('english', 'JJ'), ('finite', 'NN'), ('state', 'NN'), ('language', 'NN'), ('p', 'NN')]

-------------------------------------------------------------------------------------------------------------------

Exercise – 2
1. Apply similarity measures using Jaccard's Coefficient or Tanimoto coefficient

#Text similarity measurement without text preprocessing


import nltk
from nltk.metrics import *
from nltk.tokenize import word_tokenize

s1="We are learning NLP using Python"


s2="We are learning NLP using Python"
s3="NLP under Python is simple"
s4="People are using NLP in Python for chatbot"
s1=set(word_tokenize(s1))
s2=set(word_tokenize(s2))
s3=set(word_tokenize(s3))
s4=set(word_tokenize(s4))
print(s1)
print(s2)
print(jaccard_distance(s1,s2))
print(s1)
print(s3)
print(jaccard_distance(s1,s3))
print(s1)
print(s4)
print(jaccard_distance(s1,s4))

OUTPUT:

{'NLP', 'Python', 'We', 'using', 'learning', 'are'}


{'NLP', 'Python', 'We', 'using', 'learning', 'are'}
0.0
{'NLP', 'Python', 'We', 'using', 'learning', 'are'}
{'simple', 'NLP', 'Python', 'under', 'is'}
0.7777777777777778
{'NLP', 'Python', 'We', 'using', 'learning', 'are'}
{'People', 'for', 'NLP', 'Python', 'using', 'are', 'chatbot', 'in'}
0.6
#Text similarity measurement with text preprocessing
import nltk
import re
import string
from nltk.metrics import *
from nltk import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('wordnet')

5
from nltk.stem import WordNetLemmatizer
def text_preprocess(text):
text=text.lower()
text = re.sub(r'\d+', '', text)
translator = str.maketrans('', '', string.punctuation)
text = text.replace('-', ' ')
text=text.translate(translator)
text=set(word_tokenize(text))
lemmatizer=WordNetLemmatizer()
text = [lemmatizer.lemmatize(w) for w in text]
#print("After Lemmatizing on Noun -->",lemmatized_output)
#lemmatized_output=word_tokenize(lemmatized_output)
text = [lemmatizer.lemmatize(w,'v') for w in text]
return set(text)

s1=text_preprocess("We are learning NLP using Python")


s2=s1
s3=text_preprocess("NLP under Python is simple")
s4=text_preprocess("People are using NLP in Python for chatbot")
print(s1)
print(s2)
print(jaccard_distance(s1,s2))
print(s1)
print(s3)
print(jaccard_distance(s1,s3))
print(s1)
print(s4)
print(jaccard_distance(s1,s4))

OUTPUT:
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\Universal\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
0.0
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
{'simple', 'python', 'nlp', 'be', 'under'}
0.625
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
{'be', 'python', 'for', 'nlp', 'people', 'use', 'chatbot', 'in'}
0.6
-----------------------------------------------------------------------------------

2. Apply similarity measures using the Smith Waterman distance

import itertools
import numpy as np

#H matrix construction
def matrix(a, b, match_score=3, gap_cost=2):
H = np.zeros((len(a) + 1, len(b) + 1), np.int)

for i, j in itertools.product(range(1, H.shape[0]), range(1, H.shape[1])):


match = H[i - 1, j - 1] + (match_score if a[i - 1] == b[j - 1] else - match_score)
delete = H[i - 1, j] - gap_cost

6
insert = H[i, j - 1] - gap_cost
H[i, j] = max(match, delete, insert, 0)
return H

#Matrix Traceback based on the similarity


def traceback(H, b, b_='', old_i=0):
# flip H to get index of **last** occurrence of H.max() with np.argmax()
#print(H)
#print("H value",H)
H_flip = np.flip(np.flip(H, 0), 1)
#print("H_flip value\n",H_flip)
#print(H_flip.argmax(),H_flip.shape)
#print(H_flip)
i_, j_ = np.unravel_index(H_flip.argmax(), H_flip.shape)
#print(i_,j_)
i, j = np.subtract(H.shape, (i_ + 1, j_ + 1)) # (i, j) are **last** indexes of H.max()
#print(i,j,H[i,j])
if H[i, j] == 0:
return b_, j
b_ = b[j - 1] + '-' + b_ if old_i - i > 1 else b[j - 1] + b_
return traceback(H[0:i, 0:j], b, b_, i)

#Smith_waterman text similarity


def smith_waterman(a, b, match_score=3, gap_cost=2):
a, b = a.upper(), b.upper()
H = matrix(a, b, match_score, gap_cost)
b_, pos = traceback(H, b)
return pos, pos + len(b_)

# prints correct scoring matrix from Wikipedia example


a, b = 'rain', 'shine'
#a,b="great","treat"
#a,b="grace","great"
print("Input Strings are ",a," & ",b)
print(matrix(a,b))

H = matrix(a, b)
print(traceback(H, b)) # ('gtt-ac', 1)

#a, b = 'GGTTGACTA', 'TGTTACGG'


start, end = smith_waterman(a, b)
print(a[start:end]) # GTTGAC

OUTPUT:
Input Strings are rain & shine
[[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 3 1 0]
[0 0 0 1 6 4]]
('in', 2)

---------------------------------------------------------------------

7
Exercise – 3
1. For the given data what is the maximum number of words used. Get the output for the frequently
occurred word in the given data?
2. Visualize the given text data with appropriate visual techniques.

#word Frequency calculation


import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
nltk.download('webtext')

#Total number of words in the given text document


data_analysis = nltk.FreqDist(filtered_sentence)
for word in sorted(data_analysis):
print(word, '->', data_analysis[word], end='; ')
print("\n\nTotal no of words in the document -->",len(data_analysis))

#data_analysis = nltk.FreqDist(filtered_sentence)
print("\n\nWord Frequency for the pre processed text")
data_analysis.plot(25, cumulative=False)

# Let's take the specific words only if their frequency is greater than or equal to 2.
filter_words = dict([(m, n) for m, n in data_analysis.items() if n >= 2])

print("Word Frequency >= 2")


for key in sorted(filter_words):
print("%s: %s" % (key, filter_words[key]))
data_analysis = nltk.FreqDist(filter_words)
data_analysis.plot(25, cumulative=False)

OUTPUT:

The preprocessed Text (OUTPUT of Ex 1.2) is given as input here.

area -> 1; automaton -> 1; call -> 1; chapter -> 1; chomsky -> 2; comeback -> 1; consider -> 1; dawn
-> 1; decade -> 1; demonstrate -> 1; device -> 1; devote -> 1; dismiss -> 1; english -> 1; finite -> 5;
fundamentally -> 1; grammar -> 1; ha -> 1; inadequate -> 1; include -> 1; language -> 2; last -> 1;
linguistics -> 1; many -> 1; markov -> 1; method -> 1; modern -> 1; natural -> 1; noam -> 1; p -> 1;
paragraph -> 1; process -> 2; remarkable -> 1; section -> 1; see -> 1; seminal -> 1; short -> 1; state -
> 5; structure -> 1; substantial -> 1; surge -> 1; syntactic -> 1; use -> 1; weight -> 1; work -> 1; would
-> 1;

Total no of words in the document --> 46

8
Word Frequency >= 2
chomsky: 2
finite: 5
language: 2
process: 2
state: 5

Exercise - 4
1. Develop a back-off mechanism for Maximum Likelihood Estimate (MLE)

2. Apply interpolation on data to get mix and match.

from collections import defaultdict


import nltk
import re
import string
from nltk import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

9
def text_preprocess(corpus):
text=corpus
text=text.lower()
print("After converting to Lower cases")
print(text)
print("------------------------------------------------")
#---------------------------------------------------
num_remo = re.sub(r'\d+', '', text)
print("After removal of Numbers")
print(num_remo)
print("------------------------------------------------")
#-------------------------------------------------
translator = str.maketrans('', '', string.punctuation)
text = num_remo.replace('-', ' ')
pun_rem=text.translate(translator)
print("After Punctuation removal")
print(pun_rem)
print("------------------------------------------------")
return(pun_rem)

def build_conditional_probabilities(corpus):
"""
The function takes as its input a corpus string (words separated by
spaces) and returns a 2D dictionnary of probabilities P(next|current) of
seeing a word "next" conditionnaly to seeing a word "current".
"""

# First we parse the string to build a double dimension dictionnary that


# returns the conditional probabilities.

# We parse the string to build a first dictionnary indicating for each


# word, what are the words that follow it in the string. Repeated next
# words are kept so we use a list and not a set.

tokenized_string = corpus.split()
print(tokenized_string)
previous_word = ""
dictionnary = defaultdict(list)

for current_word in tokenized_string:


if previous_word != "":
dictionnary[previous_word].append(current_word)
previous_word = current_word
print(dictionnary)
# We know parse dictionnary to compute the probability each observed
# next word for each word in the dictionnary.

for key in dictionnary.keys():


next_words = dictionnary[key] #{the,["cat","cat","cat","dog"]
unique_words = set(next_words) # removes duplicated ,"cat","dog"
nb_words = len(next_words) #4
probabilities_given_key = {}
for unique_word in unique_words:
probabilities_given_key[unique_word] = float(next_words.count(unique_word)) /
nb_words
dictionnary[key] = probabilities_given_key
10
print(probabilities_given_key)

return dictionnary

def bigram_next_word_predictor(conditional_probabilities, current, next_candidate):


"""
The function takes as its input a 2D dictionnary of probabilities
P(next|current) of seeing a word "next" conditionnaly to seeing a word
"current", the current word being read, and a next candidate word, and
returns P(next_candidate|current).
"""

# We look for the probability corresponding to the


# current -> next_candidate pair

if current in conditional_probabilities:
if next_candidate in conditional_probabilities[current]:
return conditional_probabilities[current][next_candidate]

# If current -> next_candidate pair has not been observed in the corpus,
# the corresponding dictionnary keys will not be defined. We return
# a probability 0.0

return 0.0

# An example corpus to try out the function


#corpus = "the cat is red the cat is green the cat is blue the dog is brown"
corpus = "NLP program is ruling the IT indutry with many applications. NLP program is used
for voice controling process too"
corpus= text_preprocess(corpus)
print("after preprocessing")
print(corpus)

# We call the conditional probability dictionnary builder function


conditional_probabilities = build_conditional_probabilities(corpus)
print(conditional_probabilities)

# Some sample queries to the bigram predictor


#print(bigram_next_word_predictor(conditional_probabilities, "the", "cat"))
#print(bigram_next_word_predictor(conditional_probabilities, "is", "red"))
#print(bigram_next_word_predictor(conditional_probabilities, "", "red"))
#print(bigram_next_word_predictor(conditional_probabilities, "cat", "green"))
print()
print("The Given Corpus is ")
print(corpus)
print()
print("Bigram Prediction")
print()
print("NLP & program ->
",bigram_next_word_predictor(conditional_probabilities,"NLP","program"))
print("NLP & is -> ",bigram_next_word_predictor(conditional_probabilities,"NLP","is"))
print("is & used -> ",bigram_next_word_predictor(conditional_probabilities,"is","used"))

OUTPUT:

After converting to Lower cases


11
nlp program is ruling the it indutry with many applications. nlp program is used for voice controling
process too
------------------------------------------------
After removal of Numbers
nlp program is ruling the it indutry with many applications. nlp program is used for voice controling
process too
------------------------------------------------
After Punctuation removal
nlp program is ruling the it indutry with many applications nlp program is used for voice controling
process too
------------------------------------------------
after preprocessing
nlp program is ruling the it indutry with many applications nlp program is used for voice controling
process too
['nlp', 'program', 'is', 'ruling', 'the', 'it', 'indutry', 'with', 'many', 'applications', 'nlp', 'program', 'is', 'used',
'for', 'voice', 'controling', 'process', 'too']

defaultdict(<class 'list'>, {'nlp': ['program', 'program'], 'program': ['is', 'is'], 'is': ['ruling', 'used'], 'ruling':
['the'], 'the': ['it'], 'it': ['indutry'], 'indutry': ['with'], 'with': ['many'], 'many': ['applications'], 'applications':
['nlp'], 'used': ['for'], 'for': ['voice'], 'voice': ['controling'], 'controling': ['process'], 'process': ['too']})

defaultdict(<class 'list'>, {'nlp': {'program': 1.0}, 'program': {'is': 1.0}, 'is': {'used': 0.5, 'ruling': 0.5},
'ruling': {'the': 1.0}, 'the': {'it': 1.0}, 'it': {'indutry': 1.0}, 'indutry': {'with': 1.0}, 'with': {'many': 1.0}, 'many':
{'applications': 1.0}, 'applications': {'nlp': 1.0}, 'used': {'for': 1.0}, 'for': {'voice': 1.0}, 'voice': {'controling':
1.0}, 'controling': {'process': 1.0}, 'process': {'too': 1.0}})

The Given Corpus is


nlp program is ruling the it indutry with many applications nlp program is used for voice controling
process too

Bigram Prediction

NLP & program -> 0.0


NLP & is -> 0.0
is & used -> 0.5
---------------------------------------------------------------------------------

Exercise - 5
1. Perform the sentiment analysis, classifying comments using a Bayesian analysis.

#Using Sentiment analysis using vader_lexicon


import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
#sia.polarity_scores("Wow, NLTK is really powerful!")
sia.polarity_scores("It is too bad to cut the class")

OUTPUT:
{'compound': -0.6808, 'neg': 0.483, 'neu': 0.517, 'pos': 0.0}

12
Exercise – 6
1. Predict Corporate Credit Rating of a company/individual using its available reports and press releases.
2. Predict the market capitalization of a company using its available reports and press releases.
Exercise – 7
1. Write python program for Predict price movements based on news using two sigma
2. Write python program for Clustering Companies Based on Stock Price-Movement
Exercise – 8
1. Predict Rainfall using Linear regression

13
14
2. Create clusters of players based on their strengths in order to build a well-rounded team.
Exercise – 9
1. Write a neural network from scratch in Python to classify digits from MNIST.
Exercise – 10
1. Write python program for Classify the spam mail by using Logistic Regression
2. Build the key Influencers in Social Networks.
Exercise – 11
1. Perform sentiment analysis on Twitter data.
Exercise – 12
1. Predicting disease outbreaks on the community level in a particular location.
2. Write a program for Exploratory Data Analysis using image data, such as scans, x-rays, etc.

Data Source links for required Programming


1. UCI Machine Learning Repository

2. Kaggle Datasets

3. data.gov

4. Sports Statistics Database

5. Sports Reference

6. cricsheet.org

7. Quandl

8. Quantopian

9. US Fundamentals Archive

10. MNIST

11. Twitter API

12. StockTwits API

13. Large Health Data Sets

14. data.gov/health

15. Health Nutrition and Population Statistics

Course Outcome:
After the successful completion of course, students will be able to

 Able to cleanse the data for space and punctuation marks, brackets.
 Able to fit model for text data and interpret them.
 Able to visualize the data in a better way for non-technical people.

15

You might also like