1-NLP - Lab Manual
1-NLP - Lab Manual
Course Objectives
Exercise – 1
1. Tokenize the sentence into words for the further analysis (using Python Function)
import nltk
import string
from nltk.tokenize import word_tokenize
print("------------------------------------------------")
tokens=word_tokenize(pun_rem)
print("Tokens")
print(tokens)
print("------------------------------------------------")
INPUT FILE:
Input_text.txt
The last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. This is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. Noam Chomsky's seminal 1957 work,
Syntactic Structures [3], includes a short chapter devoted to ``finite state Markov processes'', devices that
we now would call weighted finite-state automata.
In this section Chomsky demonstrates in a few paragraphs that English is not a finite state language. (p.
21)
OUTPUT:
The File contents
The last decade has seen a substantial surge in the use of finite-state methods in many areas
of natural-language processing. This is a remarkable comeback considering that in the dawn of
modern linguistics, finite-state grammars were dismissed as fundamentally inadequate. Noam
Chomsky's seminal 1957 work, Syntactic Structures [3], includes a short chapter devoted to
``finite state Markov processes'', devices that we now would call weighted finite-state automata.
In this section Chomsky demonstrates in a few paragraphs that English is not a finite state
language. (p. 21)
1
------------------------------------------------
Tokens
['the', 'last', 'decade', 'has', 'seen', 'a', 'substantial', 'surge', 'in', 'the', 'use', 'of', 'finite', 'state',
'methods', 'in', 'many', 'areas', 'of', 'natural', 'language', 'processing', 'this', 'is', 'a', 'remarkable',
'comeback', 'considering', 'that', 'in', 'the', 'dawn', 'of', 'modern', 'linguistics', 'finite', 'state',
'grammars', 'were', 'dismissed', 'as', 'fundamentally', 'inadequate', 'noam', 'chomskys', 'seminal',
'work', 'syntactic', 'structures', 'includes', 'a', 'short', 'chapter', 'devoted', 'to', 'finite', 'state',
'markov', 'processes', 'devices', 'that', 'we', 'now', 'would', 'call', 'weighted', 'finite', 'state',
'automata', 'in', 'this', 'section', 'chomsky', 'demonstrates', 'in', 'a', 'few', 'paragraphs', 'that',
'english', 'is', 'not', 'a', 'finite', 'state', 'language', 'p']
------------------------------------------------
-----------------------------------------------------------------------------
2. Normalize the sentence to eliminate the unwanted punctuation, converting into lower case or upper case
of the entire document, expanding abbreviation, numbers into words and canonicalization.
#Excerise 1
import nltk
import re
import string
from nltk import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
#-------------------------------------------------
#sentences = sent_tokenize(text)
#print("Sentences ",sentences)
#text=sentences
#print("------------------------------------------------")
#-------------------------------------------------
#----------------------------------------------------
text=text.lower()
print("After converting to Lower cases")
print(text)
print("------------------------------------------------")
#---------------------------------------------------
num_remo = re.sub(r'\d+', '', text)
print("After removal of Numbers")
print(num_remo)
print("------------------------------------------------")
#-------------------------------------------------
2
translator = str.maketrans('', '', string.punctuation)
text = num_remo.replace('-', ' ')
pun_rem=text.translate(translator)
print("After Punctuation removal")
print(pun_rem)
print("------------------------------------------------")
tokens=word_tokenize(pun_rem)
print("Tokens")
print(tokens)
print("------------------------------------------------")
#--------------------------------------------------
lemmatizer=WordNetLemmatizer()
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in tokens])
#print("After Lemmatizing on Noun -->",lemmatized_output)
lemmatized_output=word_tokenize(lemmatized_output)
lemmatized_output_1 = [lemmatizer.lemmatize(w,'v') for w in lemmatized_output]
print("After Lemmatizing ")
print(lemmatized_output_1)
print("------------------------------------------------")
#------------------------------------------
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in lemmatized_output_1 if not w in stop_words]
print("After Removing the stop words")
print(filtered_sentence)
print("------------------------------------------------")
text = pos_tag(filtered_sentence)
print("After PoS attachment:",text)
INPUT FILE:
Input_text.txt
The last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. This is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. Noam Chomsky's seminal 1957 work,
Syntactic Structures [3], includes a short chapter devoted to ``finite state Markov processes'', devices that
we now would call weighted finite-state automata.
In this section Chomsky demonstrates in a few paragraphs that English is not a finite state language. (p.
21)
OUTPUT:
3
After converting to Lower cases
the last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. this is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. noam chomsky's seminal 1957 work,
syntactic structures [3], includes a short chapter devoted to ``finite state markov processes'', devices that
we now would call weighted finite-state automata.
in this section chomsky demonstrates in a few paragraphs that english is not a finite state language. (p. 21)
------------------------------------------------
After removal of Numbers
the last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-
language processing. this is a remarkable comeback considering that in the dawn of modern linguistics,
finite-state grammars were dismissed as fundamentally inadequate. noam chomsky's seminal work,
syntactic structures [], includes a short chapter devoted to ``finite state markov processes'', devices that we
now would call weighted finite-state automata.
in this section chomsky demonstrates in a few paragraphs that english is not a finite state language. (p. )
------------------------------------------------
After Punctuation removal
the last decade has seen a substantial surge in the use of finite state methods in many areas of natural
language processing this is a remarkable comeback considering that in the dawn of modern linguistics finite
state grammars were dismissed as fundamentally inadequate noam chomskys seminal work syntactic
structures includes a short chapter devoted to finite state markov processes devices that we now would
call weighted finite state automata
in this section chomsky demonstrates in a few paragraphs that english is not a finite state language p
------------------------------------------------
Tokens
['the', 'last', 'decade', 'has', 'seen', 'a', 'substantial', 'surge', 'in', 'the', 'use', 'of', 'finite', 'state', 'methods', 'in',
'many', 'areas', 'of', 'natural', 'language', 'processing', 'this', 'is', 'a', 'remarkable', 'comeback', 'considering',
'that', 'in', 'the', 'dawn', 'of', 'modern', 'linguistics', 'finite', 'state', 'grammars', 'were', 'dismissed', 'as',
'fundamentally', 'inadequate', 'noam', 'chomskys', 'seminal', 'work', 'syntactic', 'structures', 'includes', 'a',
'short', 'chapter', 'devoted', 'to', 'finite', 'state', 'markov', 'processes', 'devices', 'that', 'we', 'now', 'would', 'call',
'weighted', 'finite', 'state', 'automata', 'in', 'this', 'section', 'chomsky', 'demonstrates', 'in', 'a', 'few',
'paragraphs', 'that', 'english', 'is', 'not', 'a', 'finite', 'state', 'language', 'p']
------------------------------------------------
After Lemmatizing
['the', 'last', 'decade', 'ha', 'see', 'a', 'substantial', 'surge', 'in', 'the', 'use', 'of', 'finite', 'state', 'method', 'in',
'many', 'area', 'of', 'natural', 'language', 'process', 'this', 'be', 'a', 'remarkable', 'comeback', 'consider', 'that',
'in', 'the', 'dawn', 'of', 'modern', 'linguistics', 'finite', 'state', 'grammar', 'be', 'dismiss', 'a', 'fundamentally',
'inadequate', 'noam', 'chomsky', 'seminal', 'work', 'syntactic', 'structure', 'include', 'a', 'short', 'chapter',
'devote', 'to', 'finite', 'state', 'markov', 'process', 'device', 'that', 'we', 'now', 'would', 'call', 'weight', 'finite', 'state',
'automaton', 'in', 'this', 'section', 'chomsky', 'demonstrate', 'in', 'a', 'few', 'paragraph', 'that', 'english', 'be', 'not',
'a', 'finite', 'state', 'language', 'p']
------------------------------------------------
After Removing the stop words
['last', 'decade', 'ha', 'see', 'substantial', 'surge', 'use', 'finite', 'state', 'method', 'many', 'area', 'natural',
'language', 'process', 'remarkable', 'comeback', 'consider', 'dawn', 'modern', 'linguistics', 'finite', 'state',
'grammar', 'dismiss', 'fundamentally', 'inadequate', 'noam', 'chomsky', 'seminal', 'work', 'syntactic', 'structure',
'include', 'short', 'chapter', 'devote', 'finite', 'state', 'markov', 'process', 'device', 'would', 'call', 'weight', 'finite',
'state', 'automaton', 'section', 'chomsky', 'demonstrate', 'paragraph', 'english', 'finite', 'state', 'language', 'p']
------------------------------------------------
After PoS attachment: [('last', 'JJ'), ('decade', 'NN'), ('ha', 'VBD'), ('see', 'VBP'), ('substantial', 'JJ'), ('surge',
'NN'), ('use', 'NN'), ('finite', 'JJ'), ('state', 'NN'), ('method', 'VBD'), ('many', 'JJ'), ('area', 'NN'), ('natural', 'JJ'),
('language', 'NN'), ('process', 'NN'), ('remarkable', 'JJ'), ('comeback', 'NN'), ('consider', 'NN'), ('dawn', 'NN'),
4
('modern', 'JJ'), ('linguistics', 'NNS'), ('finite', 'JJ'), ('state', 'NN'), ('grammar', 'NN'), ('dismiss', 'NN'),
('fundamentally', 'RB'), ('inadequate', 'JJ'), ('noam', 'NNS'), ('chomsky', 'VBP'), ('seminal', 'JJ'), ('work', 'NN'),
('syntactic', 'JJ'), ('structure', 'NN'), ('include', 'VBP'), ('short', 'JJ'), ('chapter', 'NN'), ('devote', 'NN'), ('finite',
'JJ'), ('state', 'NN'), ('markov', 'NN'), ('process', 'NN'), ('device', 'NN'), ('would', 'MD'), ('call', 'VB'), ('weight',
'NN'), ('finite', 'NN'), ('state', 'NN'), ('automaton', 'NN'), ('section', 'NN'), ('chomsky', 'NN'), ('demonstrate',
'NN'), ('paragraph', 'NN'), ('english', 'JJ'), ('finite', 'NN'), ('state', 'NN'), ('language', 'NN'), ('p', 'NN')]
-------------------------------------------------------------------------------------------------------------------
Exercise – 2
1. Apply similarity measures using Jaccard's Coefficient or Tanimoto coefficient
OUTPUT:
5
from nltk.stem import WordNetLemmatizer
def text_preprocess(text):
text=text.lower()
text = re.sub(r'\d+', '', text)
translator = str.maketrans('', '', string.punctuation)
text = text.replace('-', ' ')
text=text.translate(translator)
text=set(word_tokenize(text))
lemmatizer=WordNetLemmatizer()
text = [lemmatizer.lemmatize(w) for w in text]
#print("After Lemmatizing on Noun -->",lemmatized_output)
#lemmatized_output=word_tokenize(lemmatized_output)
text = [lemmatizer.lemmatize(w,'v') for w in text]
return set(text)
OUTPUT:
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\Universal\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
0.0
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
{'simple', 'python', 'nlp', 'be', 'under'}
0.625
{'we', 'be', 'python', 'nlp', 'learn', 'use'}
{'be', 'python', 'for', 'nlp', 'people', 'use', 'chatbot', 'in'}
0.6
-----------------------------------------------------------------------------------
import itertools
import numpy as np
#H matrix construction
def matrix(a, b, match_score=3, gap_cost=2):
H = np.zeros((len(a) + 1, len(b) + 1), np.int)
6
insert = H[i, j - 1] - gap_cost
H[i, j] = max(match, delete, insert, 0)
return H
H = matrix(a, b)
print(traceback(H, b)) # ('gtt-ac', 1)
OUTPUT:
Input Strings are rain & shine
[[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 3 1 0]
[0 0 0 1 6 4]]
('in', 2)
---------------------------------------------------------------------
7
Exercise – 3
1. For the given data what is the maximum number of words used. Get the output for the frequently
occurred word in the given data?
2. Visualize the given text data with appropriate visual techniques.
#data_analysis = nltk.FreqDist(filtered_sentence)
print("\n\nWord Frequency for the pre processed text")
data_analysis.plot(25, cumulative=False)
# Let's take the specific words only if their frequency is greater than or equal to 2.
filter_words = dict([(m, n) for m, n in data_analysis.items() if n >= 2])
OUTPUT:
area -> 1; automaton -> 1; call -> 1; chapter -> 1; chomsky -> 2; comeback -> 1; consider -> 1; dawn
-> 1; decade -> 1; demonstrate -> 1; device -> 1; devote -> 1; dismiss -> 1; english -> 1; finite -> 5;
fundamentally -> 1; grammar -> 1; ha -> 1; inadequate -> 1; include -> 1; language -> 2; last -> 1;
linguistics -> 1; many -> 1; markov -> 1; method -> 1; modern -> 1; natural -> 1; noam -> 1; p -> 1;
paragraph -> 1; process -> 2; remarkable -> 1; section -> 1; see -> 1; seminal -> 1; short -> 1; state -
> 5; structure -> 1; substantial -> 1; surge -> 1; syntactic -> 1; use -> 1; weight -> 1; work -> 1; would
-> 1;
8
Word Frequency >= 2
chomsky: 2
finite: 5
language: 2
process: 2
state: 5
Exercise - 4
1. Develop a back-off mechanism for Maximum Likelihood Estimate (MLE)
9
def text_preprocess(corpus):
text=corpus
text=text.lower()
print("After converting to Lower cases")
print(text)
print("------------------------------------------------")
#---------------------------------------------------
num_remo = re.sub(r'\d+', '', text)
print("After removal of Numbers")
print(num_remo)
print("------------------------------------------------")
#-------------------------------------------------
translator = str.maketrans('', '', string.punctuation)
text = num_remo.replace('-', ' ')
pun_rem=text.translate(translator)
print("After Punctuation removal")
print(pun_rem)
print("------------------------------------------------")
return(pun_rem)
def build_conditional_probabilities(corpus):
"""
The function takes as its input a corpus string (words separated by
spaces) and returns a 2D dictionnary of probabilities P(next|current) of
seeing a word "next" conditionnaly to seeing a word "current".
"""
tokenized_string = corpus.split()
print(tokenized_string)
previous_word = ""
dictionnary = defaultdict(list)
return dictionnary
if current in conditional_probabilities:
if next_candidate in conditional_probabilities[current]:
return conditional_probabilities[current][next_candidate]
# If current -> next_candidate pair has not been observed in the corpus,
# the corresponding dictionnary keys will not be defined. We return
# a probability 0.0
return 0.0
OUTPUT:
defaultdict(<class 'list'>, {'nlp': ['program', 'program'], 'program': ['is', 'is'], 'is': ['ruling', 'used'], 'ruling':
['the'], 'the': ['it'], 'it': ['indutry'], 'indutry': ['with'], 'with': ['many'], 'many': ['applications'], 'applications':
['nlp'], 'used': ['for'], 'for': ['voice'], 'voice': ['controling'], 'controling': ['process'], 'process': ['too']})
defaultdict(<class 'list'>, {'nlp': {'program': 1.0}, 'program': {'is': 1.0}, 'is': {'used': 0.5, 'ruling': 0.5},
'ruling': {'the': 1.0}, 'the': {'it': 1.0}, 'it': {'indutry': 1.0}, 'indutry': {'with': 1.0}, 'with': {'many': 1.0}, 'many':
{'applications': 1.0}, 'applications': {'nlp': 1.0}, 'used': {'for': 1.0}, 'for': {'voice': 1.0}, 'voice': {'controling':
1.0}, 'controling': {'process': 1.0}, 'process': {'too': 1.0}})
Bigram Prediction
Exercise - 5
1. Perform the sentiment analysis, classifying comments using a Bayesian analysis.
OUTPUT:
{'compound': -0.6808, 'neg': 0.483, 'neu': 0.517, 'pos': 0.0}
12
Exercise – 6
1. Predict Corporate Credit Rating of a company/individual using its available reports and press releases.
2. Predict the market capitalization of a company using its available reports and press releases.
Exercise – 7
1. Write python program for Predict price movements based on news using two sigma
2. Write python program for Clustering Companies Based on Stock Price-Movement
Exercise – 8
1. Predict Rainfall using Linear regression
13
14
2. Create clusters of players based on their strengths in order to build a well-rounded team.
Exercise – 9
1. Write a neural network from scratch in Python to classify digits from MNIST.
Exercise – 10
1. Write python program for Classify the spam mail by using Logistic Regression
2. Build the key Influencers in Social Networks.
Exercise – 11
1. Perform sentiment analysis on Twitter data.
Exercise – 12
1. Predicting disease outbreaks on the community level in a particular location.
2. Write a program for Exploratory Data Analysis using image data, such as scans, x-rays, etc.
2. Kaggle Datasets
3. data.gov
5. Sports Reference
6. cricsheet.org
7. Quandl
8. Quantopian
9. US Fundamentals Archive
10. MNIST
14. data.gov/health
Course Outcome:
After the successful completion of course, students will be able to
Able to cleanse the data for space and punctuation marks, brackets.
Able to fit model for text data and interpret them.
Able to visualize the data in a better way for non-technical people.
15