0% found this document useful (0 votes)

135 views17 pages

Shubham Jade MSC It 31031420010 NLP Practical Journal

The Python code implements term frequency and inverse document frequency (TF-IDF) for three documents. It first tokenizes and preprocesses the text by removing special characters, stopwords, and selected words. It then generates bigrams and trigrams using CountVectorizer. Finally, it calculates TF-IDF scores using TfidfVectorizer and prints the top ranking features.

Uploaded by

Shubham Jade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views17 pages

Shubham Jade MSC It 31031420010 NLP Practical Journal

Uploaded by

Shubham Jade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

SHUBHAM JADE

MSc IT
31031420010
NLP practical Journal
Index

Practical Title Teacher’s

No. Sign

1 Generating Root Words

2 Sentence and Word Tokenization

3 Part of Speech Tagging

4 Generating Parse Tree using Chunk Parser

5 Finding Term Frequency and Inverse Document Frequency

6 Removing Stop Words

7 Using probabilistic model to predict the next word

8 Word Similarity

9 Named Entity Recognition

10 Using Synset and Wordnet database

Practical 1
Implement a python code that will generate the root words in the given sentences.

Stemming Example 1
from nltk.stem import PorterStemmer as ps
text4=(“I am a Student of Somaiya University.”).split()
print(text4)
for w in text4:
rootWord=ps().stem(w)
print(rootWord)

Stemming Example 2
words=["Unexpected", "disagreement", "disagree", "agreement",
"quirkiness", "historical", "canonical", "happiness", "unkind",
"dogs", "expected"]
for w in words:
stemPrint=ps.stem(w)
print(w,” -Stem- ”,stemPrint)
Lemmatization Example 1
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text5 = "I am Studying in Part 2."
tokenization = nltk.word_tokenize(text5)
for w in tokenization:
print("Lemma for {} is {}".format(w,
wordnet_lemmatizer.lemmatize(w))

Lemmatization Example 2
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
words2=["Unexpected", "disagreement", "disagree", "agreement",
"quirkiness", "historical", "canonical", "happiness", "unkind",
"dogs", "expected", “studies”,”cries”,”applies”]
for w in words2:
print("Lemma for {} is {}".format(w,
wordnet_lemmatizer.lemmatize(w)))
Practical 2
Implement the python program that splits the words and displays both splitted words and count
of words in the given sentences using the tokenizer function.

#Word Tokenization using split() function

ExText="I am a Student of Somaiya University."
SplitText=ExText.split()
print(SplitText)
print(“The number of words in given sentence are “+ len(SplitText))
#Sentence Tokenization using split() function
ExText="I am a Student. My College is Somaiya University."
SplitText=ExText.split('.')
print(SplitText)
print(“The number of sentences in given text are “+len(SplitText))
#Using Sent Tokenizer and word Tokenizer Modules
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag_sents
from nltk.tokenize import word_tokenize, sent_tokenize

#Assign Example Text

ExText='Natural language processing (NLP) refers to the branch of computer
science—and more specifically. The branch of artificial intelligence or
AI—concerned with giving computers the ability to understand text and
spoken words in much the same way human beings can.'
#Sentence Tokenization
text_sentence_tokens = sent_tokenize(ExText)
print(text_sentence_tokens)
#Word Tokenization
text_word_tokens = []
for sentence_token in text_sentence_tokens:
text_word_tokens.append(word_tokenize(sentence_token))
print(text_word_tokens)
#POS Tag Word Tokens
text_tagged = pos_tag_sents(text_word_tokens)
print (text_tagged)
#Tokenizing Contradiction
word_tokenize("can't")
['ca', "n't"]
#TreebankWordTokenizer
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize('Hello World.')
#WordPunctTokenizer
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Can't is a contraction.")
#RegexpTokenizer
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("Can't is a contraction.")
#RegexpTokenizer
tokenizer = RegexpTokenizer('\s+', gaps=True)
tokenizer.tokenize("Can't is a contraction.")
#Tokenizing webtext corpus Text - overheard.txt
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
text = webtext.raw('overheard.txt')
sent_tokenizer = PunktSentenceTokenizer(text)
sents1 = sent_tokenizer.tokenize(text)
#Display only first two sentences of sent1
sents1[0:2]

#sent at index 500

sents1[500]
#Tokenize Encoded Text
with open('/root/nltk_data/corpora/webtext/overheard.txt',
encoding='ISO-8859-2') as f:
text = f.read()
sent_tokenizer = PunktSentenceTokenizer(text)
sents = sent_tokenizer.tokenize(text)
sents[0]
Practical 3
Write a python program to read the paragraph and generate the tokens from the paragraph
using sentence tokenizer. Also find the part of speech for each word in the individual tokens that
have been generated.

#Assign Example Text

text_word_tokens = []
for sentence_token in text_sentence_tokens:
text_word_tokens.append(word_tokenize(sentence_token))
print(text_word_tokens)
#POS Tag Word Tokens
text_tagged = pos_tag_sents(text_word_tokens)
print (text_tagged)
#Default tagging
from nltk.tag import DefaultTagger
tagger = DefaultTagger('NN')
tagger.tag(['Hello', 'World'])
#Evaluating Accuracy
import nltk
nltk.download('treebank')
from nltk.corpus import treebank
test_sents = treebank.tagged_sents()[3000:]
tagger.evaluate(test_sents)
#Tagging Sentence
tagger.tag_sents([['Hello', 'world', '.'], ['How', 'are', 'you','?']])
#Untagging a tagged sentence
from nltk.tag import untag

untag([('Hello', 'NN'), ('World', 'NN')])

untag([('Hello', 'DD'), ('World', 'DD')])
untag([('Hello', 'NN'), ('World', 'JJ')])
#Regular Expression Tagger
from nltk.corpus import brown from nltk.tag
import RegexpTagger
test_sent = brown.sents(categories='news')[0]
regexp_tagger = RegexpTagger(
[(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
(r'(The|the|A|a|An|an)$', 'AT'), # articles
(r'.*able$', 'JJ'), # adjectives
(r'.*ness$', 'NN'), # nouns formed from adjectives
(r'.*ly$', 'RB'), # adverbs
(r'.*s$', 'NNS'), # plural nouns
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # past tense
verbs (r'.*', 'NN') # nouns (default)
])
print(regexp_tagger)
print(regexp_tagger.tag(test_sent))
print(regexp_tagger.tag(str))

str= ”asd40 500 running ended”

str1=[‘asd40’,‘500’, ‘running’, ‘ended’]
print(regexp_tagger.tag(str))
print(regexp_tagger.tag(str1))
Practical 4
Draw a Parse tree using python for any given sentence in a required grammar rule using the
chunk parsing.

grammar1 = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "saw" | "ate" | "walked"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"
""")
sent = "Mary saw Bob".split()
rd_parser = nltk.RecursiveDescentParser(grammar1)
for tree in rd_parser.parse(sent):
print(tree)

Ex 3
Simple_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']
sent
parser = nltk.ChartParser(Simple_grammar)
for tree in parser.parse(sent):
print(tree)
Practical 5
Write a python code to find the term frequency and inverse document frequency for three
documents. (Consider 3 documents as 3 paragraphs)

Ex 1
# Getting bigrams
vectorizer = CountVectorizer(ngram_range =(2, 2))
X1 = vectorizer.fit_transform(txt1)
features = (vectorizer.get_feature_names())
print("\n\nX1 : \n", X1.toarray())
# Applying TFIDF
# You can still get n-grams here

vectorizer = TfidfVectorizer(ngram_range = (2, 2))

X2 = vectorizer.fit_transform(txt1)
scores = (X2.toarray())
print("\n\nScores : \n", scores)
# Getting top ranking features
sums = X2.sum(axis = 0)
data1 = []
for col, term in enumerate(features):
data1.append( (term, sums[0, col] ))
ranking = pd.DataFrame(data1, columns = ['term', 'rank'])
words = (ranking.sort_values('rank', ascending = False))
print ("\n\nWords : \n", words.head(7))

Ex 2
# Importing libraries
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd

# Input the file

txt1 = []
with open('C:\\Users\\DELL\\Desktop\\MachineLearning1.txt') as file:
txt1 = file.readlines()
# Preprocessing
def remove_string_special_characters(s):
# removes special characters with ' '
stripped = re.sub('[^a-zA-z\s]', '', s)
stripped = re.sub('_', '', stripped)
# Change any white space to one space
stripped = re.sub('\s+', ' ', stripped)
# Remove start and end white spaces
stripped = stripped.strip()
if stripped != '':
return stripped.lower()
# Stopword removal
stop_words = set(stopwords.words('english'))
your_list = ['skills', 'ability', 'job', 'description']
for i, line in enumerate(txt1):
txt1[i] = ' '.join([x for
x in nltk.word_tokenize(line) if
( x not in stop_words ) and ( x not in your_list )])
# Getting trigrams
vectorizer = CountVectorizer(ngram_range = (3,3))
X1 = vectorizer.fit_transform(txt1)
features = (vectorizer.get_feature_names())
print("\n\nFeatures : \n", features)
print("\n\nX1 : \n", X1.toarray())
# Applying TFIDF
vectorizer = TfidfVectorizer(ngram_range = (3,3))
X2 = vectorizer.fit_transform(txt1)
scores = (X2.toarray())
print("\n\nScores : \n", scores)
# Getting top ranking features
sums = X2.sum(axis = 0)
data1 = []
for col, term in enumerate(features):
data1.append( (term, sums[0,col] ))
ranking = pd.DataFrame(data1, columns = ['term','rank'])
words = (ranking.sort_values('rank', ascending = False))
print ("\n\nWords head : \n", words.head(7))
Practical 6
Implement a python code to remove stop words and identify Part of Speech for a given
paragraph.

from nltk.corpus import stopwords

nltk.download('stopwords')
stopwords.fileids()
stopwords.words('english')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = """This is a sample sentence,
showing off the stop words filtration."""
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
Practical 7
Find the probability for a given sentence and also all the words present in the sentence must be
in the toy_pcfg1 or toy_pcfg2 using Viterbi pcfg parsing.

#Sentence Generation
import itertools
from nltk.grammar import CFG
from nltk.parse import generate
demo_grammar = """
S -> NP VP
NP -> Det N
PP -> P NP
VP -> 'slept' | 'saw' NP | 'walked' PP
Det -> 'the' | 'a'
N -> 'man' | 'park' | 'dog'
P -> 'in' | 'with'
"""
grammar = CFG.fromstring(demo_grammar)
for n, sent in enumerate(generate.generate(grammar, n=10), 1):
print('%3d. %s' % (n, ' '.join(sent)))

Ex 2
from nltk.grammar import Nonterminal
from nltk.grammar import toy_pcfg2
from nltk.probability import DictionaryProbDist
productions = toy_pcfg2.productions()
# Get all productions with LHS=NP
np_productions = toy_pcfg2.productions(Nonterminal('NP'))
dict = {}
for pr in np_productions: dict[pr.rhs()] = pr.prob()
np_probDist = DictionaryProbDist(dict)
# Each time you call, you get a random sample
print(np_probDist.generate())
(Det, N)
print(np_probDist.generate())
(Name,)
print(np_probDist.generate())
(Name,)
pcfg_generate(grammar) -- return a tree sampled from the language described by the PCFG
grammar
Practical 8
Given two words, calculate the similarity between the words
a. By using path similarity.
b. By using Wu-Palmer Similarity.

#Synsets
from nltk.corpus import wordnet
syn1 = wordnet.synsets('hello')[0]
syn2 = wordnet.synsets('selling')[0]
print ("hello name : ", syn1.name())
print ("selling name : ", syn2.name())
a. By using path similarity.
ref = syn1.hypernyms()[0]
print ("Self comparison : ",
syn1.shortest_path_distance(ref))
print ("Distance of hello from greeting : ",
syn1.shortest_path_distance(syn2))
print ("Distance of greeting from hello : ",
syn2.shortest_path_distance(syn1))
b. By using Wu-Palmer Similarity.
syn1.wup_similarity(syn2)
Practical 9
Consider a sentence and do the following.
a. Import the libraries.
b. Then apply word tokenization and Part-Of-Speech tagging to the sentence.
c. Create a chunk parser and test it on the sentence.
d. Identify nationalities or religions or political groups, organization, date and money in the
given sentence.
(Select sentence appropriately)

Import the libraries.

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
· Then apply word tokenization and Part-Of-Speech tagging to the sentence.
sent= '''Prime Minister Jacinda Ardern has claimed that New Zealand had
won a big battle over the spread of coronavirus. Her words came as the
country begins to exit from its lockdown.'''
words= word_tokenize(sent)
postags=pos_tag(words)
postags
● Create a chunk parser and test it on the sentence.
nltk.download('maxent_ne_chunker')
nltk.download('words')
ne_tree = nltk.ne_chunk(postags,binary=False)
print(ne_tree)
● Identify nationalities or religions or political groups, organization, date and money in
the
given sentence.(Select sentence appropriately)
locs = [('Omnicom', 'IN', 'New York'),
... ('DDB Needham', 'IN', 'New York'),
... ('Kaplan Thaler Group', 'IN', 'New York'),
... ('BBDO South', 'IN', 'Atlanta'),
... ('Georgia-Pacific', 'IN', 'Atlanta')]
query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta']
print(query)
from nltk.chunk import tree2conlltags
#from pprint import pprint
iob_tagged = tree2conlltags(ne_tree)
print(iob_tagged)
Practical 10
Write down the syntax for the following:
a. Import wordnet, use the term “hello” to find synsets.
b. Using Synset, find the element in the 0th index, just the word (using lemmas).
c. Name, Definition of that first (0th index) Synset and examples of the word.
d. Discern synonyms and antonyms in synset.
e. Discern Hypernyms and Hyponyms in Synset.

#Working with wordnet and synset

nltk.download('wordnet')
from nltk.corpus import wordnet
syn = wordnet.synsets('hello')[0]
syn.name()
syn.definition()
2. Using Synset, find the element in the 0th index, just the word (using lemmas).
# Just the word:
print(syn[0].lemmas()[0].name())
3. Name, Definition of that first (0th index) Synset and examples of the word.
# Examples of the word in use in sentences:
print(syn[0].examples())
4. Discern synonyms and antonyms in synset.

import nltk
from nltk.corpus import wordnet
synonyms = []
antonyms = []
for syn in wordnet.synsets("good"):
for l in syn.lemmas():
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
5. Discern Hypernyms and Hyponyms in Synset.
#hypernym of synset
syn.hypernyms()
#Similar synsets
syn.hypernyms()[0].hyponyms()
#Tree path of synset
syn.hypernym_paths()

#POS of synset
syn.pos()
len(wordnet.synsets('great'))
len(wordnet.synsets('great', pos='n'))
len(wordnet.synsets('great', pos='a'))
f. Compare the similarity index of any two words
import nltk
from nltk.corpus import wordnet
# Let's compare the noun of "ship" and "boat:"
w1 = wordnet.synset('run.v.01') # v here denotes the tag verb
w2 = wordnet.synset('sprint.v.01')
print(w1.wup_similarity(w2))

Fundamentals of Human Physiology 4th Edition Sherwood Digital Access
100% (1)
Fundamentals of Human Physiology 4th Edition Sherwood Digital Access
402 pages
IPU OS Unit 2 Notes
No ratings yet
IPU OS Unit 2 Notes
123 pages
OOAD Question Bank Notes
No ratings yet
OOAD Question Bank Notes
306 pages
Diploma Module 3 PDF
100% (3)
Diploma Module 3 PDF
296 pages
Bay Leaf
No ratings yet
Bay Leaf
8 pages
Python Programming - Unit-1
No ratings yet
Python Programming - Unit-1
168 pages
Shrila Sanatana Goswami
No ratings yet
Shrila Sanatana Goswami
8 pages
Legal Practitioners Practice Rules
100% (1)
Legal Practitioners Practice Rules
34 pages
Bodh Prakash - Writing Partition - Aesthetics and Ideology in Hindi and Urdu Literature-Pearson (2008)
No ratings yet
Bodh Prakash - Writing Partition - Aesthetics and Ideology in Hindi and Urdu Literature-Pearson (2008)
232 pages
Namma Kalvi 6th Standard Social Science Guide Term 1 EM 220955
No ratings yet
Namma Kalvi 6th Standard Social Science Guide Term 1 EM 220955
54 pages
DDM Lab Main - 1
No ratings yet
DDM Lab Main - 1
74 pages
C21 - Me - Iv Sem
No ratings yet
C21 - Me - Iv Sem
101 pages
4.4.2 WBS 5.4 Benjamin - Srock - Ch7 - Exercises
No ratings yet
4.4.2 WBS 5.4 Benjamin - Srock - Ch7 - Exercises
5 pages
Abhishek Siddharth AARZOO FILR
No ratings yet
Abhishek Siddharth AARZOO FILR
43 pages
Thesis Work in Germany
100% (3)
Thesis Work in Germany
6 pages
Joy Tindiwegi V Julia Tigeita Munubi and Harriet Nyanjura Munubi 2025 UGRSB 11 (12 May 2025)
No ratings yet
Joy Tindiwegi V Julia Tigeita Munubi and Harriet Nyanjura Munubi 2025 UGRSB 11 (12 May 2025)
14 pages
PPL Unit 3
No ratings yet
PPL Unit 3
24 pages
Chapter 4 Operators
No ratings yet
Chapter 4 Operators
25 pages
Minor Project Report Format MCA
No ratings yet
Minor Project Report Format MCA
11 pages
Python Skill Course File
No ratings yet
Python Skill Course File
73 pages
DBMS Solution
No ratings yet
DBMS Solution
64 pages
Machine Learning - Unit - 1
100% (1)
Machine Learning - Unit - 1
58 pages
Iot Physical Servers Cloud Offereings Iot Case Studies
No ratings yet
Iot Physical Servers Cloud Offereings Iot Case Studies
99 pages
Questionpaper PaperB1 June2023
No ratings yet
Questionpaper PaperB1 June2023
12 pages
Malaysian Boy Names - Malay Boys Name With Meaning
No ratings yet
Malaysian Boy Names - Malay Boys Name With Meaning
1 page
Data Type and Data Structure
No ratings yet
Data Type and Data Structure
16 pages
Notes - UP MODULE 1
No ratings yet
Notes - UP MODULE 1
47 pages
Super and Final in Java
100% (1)
Super and Final in Java
18 pages
CSE-IT-A2-I-II-III-IV-V-VI Updated Syllabus Ver 3
No ratings yet
CSE-IT-A2-I-II-III-IV-V-VI Updated Syllabus Ver 3
144 pages
ADBMS Notes (Mtech 1st Sem)
No ratings yet
ADBMS Notes (Mtech 1st Sem)
86 pages
Macro
No ratings yet
Macro
73 pages
Random General Knowledge
No ratings yet
Random General Knowledge
7 pages
Exercise For Students
No ratings yet
Exercise For Students
54 pages
Unit 5 2 Marks
No ratings yet
Unit 5 2 Marks
10 pages
Practice Test 2 Key
No ratings yet
Practice Test 2 Key
4 pages
Module 3 Merged
No ratings yet
Module 3 Merged
80 pages
Css Manual
No ratings yet
Css Manual
88 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
A Study On Deep Learning For Fake News Detection
No ratings yet
A Study On Deep Learning For Fake News Detection
48 pages
Song - The Hall of Fame
No ratings yet
Song - The Hall of Fame
1 page
LG5e LG2 StudentCards
No ratings yet
LG5e LG2 StudentCards
25 pages
(Solved) Ola College of Education (OCE) Is One of The Colleges of Education... - Course Hero
No ratings yet
(Solved) Ola College of Education (OCE) Is One of The Colleges of Education... - Course Hero
3 pages
8VO English TEEN 8 A1 - 1 - StudentsBook Pag 10 A 20
No ratings yet
8VO English TEEN 8 A1 - 1 - StudentsBook Pag 10 A 20
11 pages
LECTURE 3 - ECE521 How To Install Arm Keil PDF
No ratings yet
LECTURE 3 - ECE521 How To Install Arm Keil PDF
27 pages
QUARTER 2 Third-summative-test-2nd-qrtr-M5M6-SCIENCE 7
100% (2)
QUARTER 2 Third-summative-test-2nd-qrtr-M5M6-SCIENCE 7
2 pages
DLL - All Subjects 2 - Q4 - W5 - D1
No ratings yet
DLL - All Subjects 2 - Q4 - W5 - D1
6 pages
8423 Tejas Java Practical
No ratings yet
8423 Tejas Java Practical
95 pages
16CS517-Formal Languages and Automata Theory
No ratings yet
16CS517-Formal Languages and Automata Theory
8 pages
STE Microproject 2.1 Final
No ratings yet
STE Microproject 2.1 Final
16 pages
Hsslive Xi Maths QB 9. Sequences and Series
No ratings yet
Hsslive Xi Maths QB 9. Sequences and Series
3 pages
Python Programs
No ratings yet
Python Programs
25 pages
Informative Speech Assignment Packet - Leaders - Online Class
No ratings yet
Informative Speech Assignment Packet - Leaders - Online Class
6 pages
15 Python Questions
No ratings yet
15 Python Questions
10 pages
Cse CD QB R18
No ratings yet
Cse CD QB R18
30 pages
Unit 1
No ratings yet
Unit 1
32 pages
Java Programming: Multi Threading, I/O Streams Topics Covered in This Unit
No ratings yet
Java Programming: Multi Threading, I/O Streams Topics Covered in This Unit
32 pages
GE4105 - PSPP - Unit V - Notes
No ratings yet
GE4105 - PSPP - Unit V - Notes
26 pages
Unit 5: Advanced PHP & Mysql: Web Programming
No ratings yet
Unit 5: Advanced PHP & Mysql: Web Programming
22 pages
Python Question Paper Solved
No ratings yet
Python Question Paper Solved
13 pages
Hadoop Installation
No ratings yet
Hadoop Installation
12 pages
Programming in C Summer 2019 Answer Paper
No ratings yet
Programming in C Summer 2019 Answer Paper
20 pages
Lists:: Unit-3 Python Programming
No ratings yet
Lists:: Unit-3 Python Programming
41 pages
CPP Report Format
No ratings yet
CPP Report Format
19 pages
Ppscsyllabus
No ratings yet
Ppscsyllabus
26 pages
Your First 100 Clients. Free Chapter Download PDF
100% (1)
Your First 100 Clients. Free Chapter Download PDF
7 pages
Resume Parser and Summarizer
No ratings yet
Resume Parser and Summarizer
6 pages
User Defined Functions in Javascript
No ratings yet
User Defined Functions in Javascript
6 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
30 pages
History of Christianity in India Final
No ratings yet
History of Christianity in India Final
6 pages
World Cup Analysis
No ratings yet
World Cup Analysis
15 pages
21st-Century-2 2 1
No ratings yet
21st-Century-2 2 1
4 pages
Skrip Project 2
No ratings yet
Skrip Project 2
8 pages
Unit 2 Oops
No ratings yet
Unit 2 Oops
49 pages
CBSE Class 12 Informatic Practices Computer Networking PDF
No ratings yet
CBSE Class 12 Informatic Practices Computer Networking PDF
38 pages
Broadridge Interview Questions
No ratings yet
Broadridge Interview Questions
3 pages
019 - Jimenez V Rabot - Elbambo
No ratings yet
019 - Jimenez V Rabot - Elbambo
2 pages
22PLC15Bset1 230320 160331
No ratings yet
22PLC15Bset1 230320 160331
3 pages
Tybsc-It Sem5 Ai Apr19
No ratings yet
Tybsc-It Sem5 Ai Apr19
2 pages
Devi Balika Revision Paper
100% (1)
Devi Balika Revision Paper
16 pages
InformaticsPractices SQP
No ratings yet
InformaticsPractices SQP
9 pages
Transpo Law Reviewer 3D Tesoro
No ratings yet
Transpo Law Reviewer 3D Tesoro
57 pages
Case Management System
No ratings yet
Case Management System
34 pages
Handwritten Hindi Character Recognition Using MultipleClassifiers in Machine Learning
No ratings yet
Handwritten Hindi Character Recognition Using MultipleClassifiers in Machine Learning
6 pages
Questionbank CPP
No ratings yet
Questionbank CPP
7 pages
Low Carb Diet Studies.
No ratings yet
Low Carb Diet Studies.
5 pages
Resume - Roshan Kumar Sharma
No ratings yet
Resume - Roshan Kumar Sharma
1 page
Challenges For Information-Flow Security: University of Pennsylvania, Philadelphia PA 19104, USA
No ratings yet
Challenges For Information-Flow Security: University of Pennsylvania, Philadelphia PA 19104, USA
5 pages
Stanley: # Experience # Education
No ratings yet
Stanley: # Experience # Education
1 page
Noc20 Cs81 Assignment 01 Week 05
No ratings yet
Noc20 Cs81 Assignment 01 Week 05
6 pages
OOSD IIT Kharagpur Mid Sem-12 Question Paper
No ratings yet
OOSD IIT Kharagpur Mid Sem-12 Question Paper
3 pages

Shubham Jade MSC It 31031420010 NLP Practical Journal

Uploaded by

Shubham Jade MSC It 31031420010 NLP Practical Journal

Uploaded by

SHUBHAM JADE

Practical Title Teacher’s

1 Generating Root Words

2 Sentence and Word Tokenization

3 Part of Speech Tagging

4 Generating Parse Tree using Chunk Parser

5 Finding Term Frequency and Inverse Document Frequency

6 Removing Stop Words

7 Using probabilistic model to predict the next word

9 Named Entity Recognition

10 Using Synset and Wordnet database

#Word Tokenization using split() function

#Assign Example Text

#sent at index 500

#Assign Example Text

untag([('Hello', 'NN'), ('World', 'NN')])

str= ”asd40 500 running ended”

vectorizer = TfidfVectorizer(ngram_range = (2, 2))

# Input the file

from nltk.corpus import stopwords

Import the libraries.

#Working with wordnet and synset

You might also like