0% found this document useful (0 votes)

57 views16 pages

Text Processing

The document discusses various text processing techniques in natural language toolkits (NLTK), including tokenization, stopwords removal, and word normalization. It loads text examples from built-in corpora and demonstrates functions for word, sentence and document tokenization. It also covers removing common stopwords and converting words to lowercase.

Uploaded by

Nipuni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views16 pages

Text Processing

Uploaded by

Nipuni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

text-processing

March 24, 2024

[1]: import nltk

#tokenizing
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer

#stopwords
from nltk.corpus import stopwords

#regexp
import re

# pandas dataframe
import pandas as pd

#import count vectorizer

from sklearn.feature_extraction.text import CountVectorizer

[2]: nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

[2]: True

[3]: #load the data used in the book examples into the Python environment:

from nltk.book import *

* Introductory Examples for the NLTK Book *

Loading text1, …, text9 and sent1, …, sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus

1
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
This command loaded 9 of the text examples available from the corpora package (only
a small number of them!). It has used the variable names text1 through text9 for
theseexamples, and already assigned them values. If you type the variable name, you
get a description of the text
[4]: text1

[4]: <Text: Moby Dick by Herman Melville 1851>

Note that the first sentence of the book Moby Dick is “Call me Ishmael.” and that
this sentence has been already separated into tokens in the variable sent1
[5]: #The variables sent1 through sent9 have been set to be a list of tokens of the␣
↪first sentence of each text.

sent1

[5]: ['Call', 'me', 'Ishmael', '.']

[ ]:

0.1 Counting
[8]: #gives the total number of words in the text

len(text1)

[8]: 260819

[7]: #to find out how many unique words there are, not counting repetitions (gives␣
↪all tokens)

sorted(set(text1))

#Or we can just find the length of that list.

len(sorted(set(text3)))

[7]: 2789

[12]: #Or we can specify just to print the first 30 words in the list of sorted words:
sorted(set(text3))[:30]

[12]: ['!',
"'",

2
'(',
')',
',',
',)',
'.',
'.)',
':',
';',
';)',
'?',
'?)',
'A',
'Abel',
'Abelmizraim',
'Abidah',
'Abide',
'Abimael',
'Abimelech',
'Abr',
'Abrah',
'Abraham',
'Abram',
'Accad',
'Achbor',
'Adah',
'Adam',
'Adbeel',
'Admah']

[13]: #to count how many times the word 'Moby' has appeared in the text1
text1.count("Moby")

[13]: 84

[ ]:

0.2 Processing Text

lets use gutenberg corpus
NLTK includes a small selection of texts from the Project Gutenberg electronic text
archive, which contains some 25,000 free electronic books
[19]: # You can then view some books obtained from the Gutenberg on-line book project:
nltk.corpus.gutenberg.fileids()

3
[19]: ['austen-emma.txt',
'austen-persuasion.txt',
'austen-sense.txt',
'bible-kjv.txt',
'blake-poems.txt',
'bryant-stories.txt',
'burgess-busterbrown.txt',
'carroll-alice.txt',
'chesterton-ball.txt',
'chesterton-brown.txt',
'chesterton-thursday.txt',
'edgeworth-parents.txt',
'melville-moby_dick.txt',
'milton-paradise.txt',
'shakespeare-caesar.txt',
'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt',
'whitman-leaves.txt']

[22]: #view the first file

file1 = nltk.corpus.gutenberg.fileids( ) [0]
file1

[22]: 'austen-emma.txt'

[33]: #We can get the original text, using the raw function:

emmatext = nltk.corpus.gutenberg.raw(file1)

emmatext[:120] #Since this is quite long, we can view part of it, e.g. the␣
↪first 120 characters

#len(emmatext) #count of total characters

[33]: '[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse,

handsome, clever, and rich, with a comfortable home\nan'

0.3 1. Tokenization
NLTK has several tokenizers available to break the raw text into tokens; we will use one that
separates by white space and also by special characters (punctuation)

0.3.1 Word Tokenization

[32]: emmatokens = nltk.wordpunct_tokenize(emmatext)

len(emmatokens) #total token count

4
#view the tokenized text
emmatokens[:15]

[32]: ['[',
'Emma',
'by',
'Jane',
'Austen',
'1816',
']',
'VOLUME',
'I',
'CHAPTER',
'I',
'Emma',
'Woodhouse',
',',
'handsome']

[34]: #Example
sentence="I have no money at the moment."
nltk.wordpunct_tokenize(sentence)

[34]: ['I', 'have', 'no', 'money', 'at', 'the', 'moment', '.']

[36]: #using word_tokenize

text = "God is Great! I won a lottery."
print(word_tokenize(text))

['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']

[39]: #usigng Regexp tokenizer

text="God is Great! I won a lottery."
tokenizer = RegexpTokenizer("[\w']+")

tokenizer.tokenize(text)

[39]: ['God', 'is', 'Great', 'I', 'won', 'a', 'lottery']

0.3.2 Sentence Tokenization

[44]: #by using nltk library

text1 = "God is Great! I won a lottery."

print(sent_tokenize(text1))

5
['God is Great!', 'I won a lottery.']

[45]: text2="Let us understand the difference between sentence & word tokenizer. It␣
↪is going to be a simple example."

text2.split(". ")

[45]: ['Let us understand the difference between sentence & word tokenizer',
'It is going to be a simple example.']

[ ]:

0.4 2. Stopwords
[19]: #lookat the stopwords listed
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",
"you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he',
'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's",
'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what',
'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is',
'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having',
'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about',
'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why',
'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some',
'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now',
'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn',
"couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
"hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't",
'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn',
"shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn',
"wouldn't"]

[49]: sent1="""He determined to drop his litigation with the monastry, and relinguish␣
↪his claims to the wood-cuting and

fishery rihgts at once. He was the more ready to do this becuase the rights had␣
↪become much less valuable, and he had

indeed the vaguest idea where the wood and river in question were."""

# set of stop words

stop_words = set(stopwords.words('english'))

# tokens of words

6
word_tokens = word_tokenize(sent1)
word_tokens[:10]

[49]: ['He',
'determined',
'to',
'drop',
'his',
'litigation',
'with',
'the',
'monastry',
',']

[50]: #empty list to get the final stop word removed text
filtered_sentence = []

# filter out the stop words

for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)

print("\nOriginal Sentence \n")

print(" ".join(word_tokens))

print("\nFiltered Sentence \n")

print(" ".join(filtered_sentence))

Original Sentence

He determined to drop his litigation with the monastry , and relinguish his
claims to the wood-cuting and fishery rihgts at once . He was the more ready to
do this becuase the rights had become much less valuable , and he had indeed the
vaguest idea where the wood and river in question were .

Filtered Sentence

He determined drop litigation monastry , relinguish claims wood-cuting fishery

rihgts . He ready becuase rights become much less valuable , indeed vaguest idea
wood river question .

7
0.5 3. Normalizing word Formats
0.6 3.1 Lowercase
[51]: #Example
sentence="I have NO moNey at tHE moMent."

sentence.lower()

[51]: 'i have no money at the moment.'

[53]: #for already tokenized text

emmawords = [w.lower( ) for w in emmatokens]
emmawords[:15]

[53]: ['[',
'emma',
'by',
'jane',
'austen',
'1816',
']',
'volume',
'i',
'chapter',
'i',
'emma',
'woodhouse',
',',
'handsome']

[55]: # We can further view the words by getting the unique words and sorting them:
emmavocab = sorted(set(emmawords))
emmavocab[:10]

[55]: ['!', '!"', '!"--', "!'", "!'--", '!)--', '!--', '!--"', '!--(', '!--`']

[25]: #uppercased
sentence.upper()

#check Table 3.2 for more operations on strings (Chapter 3, Section 3.2 of NLTK␣
↪book)

[25]: 'I HAVE NO MONEY AT THE MOMENT.'

[26]: #select a set of words from the tokenized text

shortwords=emmawords[11:111]
shortwords[:10]

8
[26]: ['emma', 'woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',']

[27]: #get the frequency count for each word

shortdist = FreqDist(shortwords)
shortdist.keys( )

for word in shortdist.keys():

print (word, shortdist[word])

emma 1
woodhouse 1
, 8
handsome 1
clever 1
and 4
rich 1
with 2
a 3
comfortable 1
home 1
happy 1
disposition 1
seemed 1
to 3
unite 1
some 1
of 6
the 4
best 1
blessings 1
existence 1
; 2
had 3
lived 1
nearly 1
twenty 1
- 1
one 1
years 1
in 2
world 1
very 2
little 1
distress 1
or 1
vex 1
her 4

9
. 2
she 1
was 1
youngest 1
two 1
daughters 1
most 1
affectionate 1
indulgent 1
father 1
consequence 1
sister 1
' 1
s 1
marriage 1
been 1
mistress 1
his 1
house 1
from 1
early 1
period 1
mother 1
died 1
too 1
long 1
ago 1
for 1
have 1
more 1

0.7 3.2 Stemming

NLTK has two stemmers, Porter and Lancaster, described in section 3.6 of the NLTK
book. To use these stemmers, you first create them
[58]: porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

[61]: #regular-cased text- porter stemmer

emmaregstem = [porter.stem(t) for t in emmatokens]
emmaregstem[1:10]

[61]: ['emma', 'by', 'jane', 'austen', '1816', ']', 'volum', 'i', 'chapter']

[30]: #lowercased text

emmalowerstem = [porter.stem(t) for t in emmawords]
emmalowerstem[1:10]

10
[30]: ['emma', 'by', 'jane', 'austen', '1816', ']', 'volum', 'i', 'chapter']

[31]: #regular-cased text - lancaster stemmer

emmaregstem1 = [lancaster.stem(t) for t in emmatokens]
emmaregstem1[1:10]

[31]: ['emm', 'by', 'jan', 'aust', '1816', ']', 'volum', 'i', 'chapt']

[70]: #building our own simple stemmer by making a list of suffixes to take off.

def stem(word):
for suffix in ['ing','ly','ed','ious','ies','ive','es','s']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word

#try the above stemmer with 'friends'

stem('friends')

[70]: 'friend'

[71]: stem('relatives')

[71]: 'relativ'

0.8 3.3 Lemmatizing

NLTK has a lemmatizer that uses the WordNet on-line thesaurus as a dictionary to look up roots
and find the word.
[74]: wnl = nltk.WordNetLemmatizer()
emmalemma=[wnl.lemmatize(t) for t in emmawords]
emmalemma[1:10]

[74]: ['emma', 'by', 'jane', 'austen', '1816', ']', 'volume', 'i', 'chapter']

[82]: wnl.lemmatize('friends')
wnl.lemmatize('relatives')

[82]: 'relative'

0.9 4. Regex:Regular Expressions for Detecting Word Patterns

[83]: emmatext[:100]

[83]: '[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse,

handsome, clever, and rich, with a'

11
[85]: #the function replace to replace all the new characters ‘\n’ with a space ‘ ‘.
newemmatext = emmatext.replace('\n', ' ')
shorttext = newemmatext[:150]

#redefined the variable shorttext to be the first 150 characters

#without newlines
shorttext

[85]: '[Emma by Jane Austen 1816] VOLUME I CHAPTER I Emma Woodhouse, handsome,
clever, and rich, with a comfortable home and happy disposition, seemed to'

[38]: pword = re.compile('\w+')

#re.findall will find the substrings that matched anywhere in the string.

re.findall(pword, shorttext)

[38]: ['Emma',
'by',
'Jane',
'Austen',
'1816',
'VOLUME',
'I',
'CHAPTER',
'I',
'Emma',
'Woodhouse',
'handsome',
'clever',
'and',
'rich',
'with',
'a',
'comfortable',
'home',
'and',
'happy',
'disposition',
'seemed',
'to']

[39]: #re.findall will find the substrings that matched anywhere in the specialtext.
specialtext = 'U.S.A. poster-print costs $12.40, with 10% off.'
re.findall(pword, specialtext)

[39]: ['U', 'S', 'A', 'poster', 'print', 'costs', '12', '40', 'with', '10', 'off']

12
[40]: #to match tokens by matching words can have an internal hyphen.
ptoken = re.compile('(\w+(-\w+)*)')
re.findall(ptoken, specialtext)

[40]: [('U', ''),

('S', ''),
('A', ''),
('poster-print', '-print'),
('costs', ''),
('12', ''),
('40', ''),
('with', ''),
('10', ''),
('off', '')]

[41]: #to match abbreviations that might have a “.” inside, like U.S.A.
#We only allow capitalized letters
pabbrev = re.compile('(([A-Z]\.)+)')
re.findall(pabbrev, specialtext)

[41]: [('U.S.A.', 'A.')]

[42]: #combine it with the words pattern to match either words or abbreviations
ptoken = re.compile('(\w+(-\w+)*|([A-Z]\.)+)')
re.findall(ptoken, specialtext)

[42]: [('U', '', ''),

('S', '', ''),
('A', '', ''),
('poster-print', '-print', ''),
('costs', '', ''),
('12', '', ''),
('40', '', ''),
('with', '', ''),
('10', '', ''),
('off', '', '')]

[43]: #order of the matching patterns really matters if

#an earlier pattern matches part of what you want to match.
ptoken = re.compile('(([A-Z]\.)+|\w+(-\w+)*)')
re.findall(ptoken, specialtext)

[43]: [('U.S.A.', 'A.', ''),

('poster-print', '', '-print'),
('costs', '', ''),
('12', '', ''),
('40', '', ''),

13
('with', '', ''),
('10', '', ''),
('off', '', '')]

[44]: #add an expression to match the currency

ptoken = re.compile(r'(([A-Z]\.)+|\w+(-\w+)*|\$?\d+(\.\d+)?)')
re.findall(ptoken, specialtext)

[44]: [('U.S.A.', 'A.', '', ''),

('poster-print', '', '-print', ''),
('costs', '', '', ''),
('$12.40', '', '', '.40'),
('with', '', '', ''),
('10', '', '', ''),
('off', '', '', '')]

Regular Expression Tokenizer using NLTK Tokenizer

[45]: #We can make a prettier regular expression that is equivalent to this one by
#using Python’s triple quotes that allows a string to go across multiple
#lines without adding a newline character

# abbreviations, e.g. U.S.A.

# words with internal hyphens
# currency, like $12.40

ptoken = re.compile(r'''([A-Z]\.)+
| \w+(-\w+)*
| \$?\d+(\.\d+)?
''', re.X)

[46]: # abbreviations, e.g. U.S.A.

# words with optional internal hyphens
# currency and percentages, e.g. $12.40, 82%
# ellipsis ex: hmm..., well...
# these are separate tokens; includes ], [

pattern = r''' (?x) [A-Z][a-z]+\.| (?:[A-Z]\.)+|

| \w+(?:-\w+)*
| \$?\d+(?:\.\d+)?%?
| \.\.\.
| [][.,;"'?():-_']'''

[47]: nltk.regexp_tokenize(shorttext[:30], pattern)

[47]: ['',
'[',

14
'',
'Emma',
'',
'',
'by',
'',
'',
'Jane',
'',
'',
'Austen',
'',
'',
'1816',
'',
']',
'',
'',
'',
'VO',
'']

[48]: nltk.regexp_tokenize(specialtext, pattern)

[48]: ['U.S.A.',
'',
'',
'poster-print',
'',
'',
'costs',
'',
'',
'$12.40',
'',
',',
'',
'',
'with',
'',
'',
'10',
'',
'',
'',
'off',
'',

15
'.',
'']

https://www.nltk.org/book/ch03.html#tab-re-symbols

0.10 Document Term Matrix- DTM

[87]: # Let's start with a 'toy' corpus
CORPUS = [
'the sky is blue',
'sky is blue and sky is beautiful',
'the beautiful sky is so blue',
'i love blue cheese'
]

[90]: #assign the count vectorizer to a variable

countvectorizer=CountVectorizer()

DTM=pd.DataFrame(countvectorizer.fit_transform(CORPUS).toarray(),
columns=countvectorizer.get_feature_names_out(),index=None)

DTM

[90]: and beautiful blue cheese is love sky so the

0 0 0 1 0 1 0 1 0 1
1 1 1 1 0 2 0 2 0 0
2 0 1 1 0 1 0 1 1 1
3 0 0 1 1 0 1 0 0 0

[ ]:

COMP2041 20T2 - Week 01 Laboratory Exercises PDF
No ratings yet
COMP2041 20T2 - Week 01 Laboratory Exercises PDF
7 pages
Troubleshooting Guide PDF
88% (8)
Troubleshooting Guide PDF
48 pages
Forward Chaining in Python
No ratings yet
Forward Chaining in Python
17 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
No ratings yet
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLP Record
No ratings yet
NLP Record
6 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
Lab 2
No ratings yet
Lab 2
49 pages
TSA Student
No ratings yet
TSA Student
20 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Lab1 IR
No ratings yet
Lab1 IR
14 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
NLTK Tutorial
No ratings yet
NLTK Tutorial
33 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
SPR 05 NLTK
No ratings yet
SPR 05 NLTK
18 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Ccs369-Lab Ex 3,4,5
No ratings yet
Ccs369-Lab Ex 3,4,5
8 pages
Batch 2
No ratings yet
Batch 2
13 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
All Practicals
No ratings yet
All Practicals
33 pages
NLP
No ratings yet
NLP
12 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
32 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
Final NLP Lab File
No ratings yet
Final NLP Lab File
28 pages
7 Idf
No ratings yet
7 Idf
5 pages
Text Mining Basics
No ratings yet
Text Mining Basics
16 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
No ratings yet
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
13 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
NLP Lab Manual 3-2 Aiml R22 Update
100% (1)
NLP Lab Manual 3-2 Aiml R22 Update
20 pages
AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem
No ratings yet
AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem AM604PC Natural Language Processing LAB R22 AI&ML 3rd Yr 2nd Sem
20 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
NLP Lab File
No ratings yet
NLP Lab File
13 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
The Game of Logic
From Everand
The Game of Logic
Lewis Carroll
No ratings yet
Soper and Mitra-2013 Amcis-An Inquiry Into Mental Models of Web Interface Design
No ratings yet
Soper and Mitra-2013 Amcis-An Inquiry Into Mental Models of Web Interface Design
7 pages
Data Cleaning and Pre Processing 1
No ratings yet
Data Cleaning and Pre Processing 1
12 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Apache Storm
No ratings yet
Apache Storm
29 pages
Magna-Mike 8600 SW Installation 8600 Fa and Cal Program: Rel/Eco No
No ratings yet
Magna-Mike 8600 SW Installation 8600 Fa and Cal Program: Rel/Eco No
9 pages
Unix Assignments
No ratings yet
Unix Assignments
3 pages
Maintaining and Troubleshooting Avaya One-X Agent
No ratings yet
Maintaining and Troubleshooting Avaya One-X Agent
20 pages
Virendra Public School12 Term1 Comp
No ratings yet
Virendra Public School12 Term1 Comp
11 pages
Ade GP User Manual
No ratings yet
Ade GP User Manual
11 pages
Chapter 3
No ratings yet
Chapter 3
12 pages
Our Village by Mitford, Mary Russell, 1787-1855
No ratings yet
Our Village by Mitford, Mary Russell, 1787-1855
92 pages
Amit Practical
No ratings yet
Amit Practical
31 pages
TS RNC SW 038 I13
No ratings yet
TS RNC SW 038 I13
32 pages
IIB1000.467 MessageModeling Lab4 TDS Advanced
No ratings yet
IIB1000.467 MessageModeling Lab4 TDS Advanced
49 pages
File Handling
No ratings yet
File Handling
3 pages
CS Term1 Practical File
No ratings yet
CS Term1 Practical File
6 pages
Using Mapper To Verify Neighbor Relations With Mapinfo: Incode Telecom Group, Inc
No ratings yet
Using Mapper To Verify Neighbor Relations With Mapinfo: Incode Telecom Group, Inc
24 pages
Introduction To It Systems Unit-2 (Set-1)
No ratings yet
Introduction To It Systems Unit-2 (Set-1)
10 pages
Data & Info (GIS)
No ratings yet
Data & Info (GIS)
35 pages
Assignment 1 File Reading, File Writing, C-String
No ratings yet
Assignment 1 File Reading, File Writing, C-String
3 pages
Get Started With Storage Explorer and Az
No ratings yet
Get Started With Storage Explorer and Az
43 pages
Index: A Shellscript To Concate 2 User Defined Files
No ratings yet
Index: A Shellscript To Concate 2 User Defined Files
31 pages
COMP Class-XII Practical File 2024-25
No ratings yet
COMP Class-XII Practical File 2024-25
33 pages
CE Modder's Field Manual
No ratings yet
CE Modder's Field Manual
74 pages
116 - Serial Port Monitor Software User Manual
No ratings yet
116 - Serial Port Monitor Software User Manual
9 pages
12 CS Practical File 2023-24
No ratings yet
12 CS Practical File 2023-24
7 pages
Notepad Coding
No ratings yet
Notepad Coding
3 pages
Python Book
No ratings yet
Python Book
110 pages
Intermediate Linux Commands Cheat Sheet v2 Red Hat Developer
No ratings yet
Intermediate Linux Commands Cheat Sheet v2 Red Hat Developer
11 pages
Lab 11 File Handling
No ratings yet
Lab 11 File Handling
20 pages
Tripwire Cheatsheet
No ratings yet
Tripwire Cheatsheet
6 pages

Text Processing

Uploaded by

Text Processing

Uploaded by

text-processing

March 24, 2024

[1]: import nltk

#import count vectorizer

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

from nltk.book import *

*** Introductory Examples for the NLTK Book ***

[4]: <Text: Moby Dick by Herman Melville 1851>

[5]: ['Call', 'me', 'Ishmael', '.']

#Or we can just find the length of that list.

0.2 Processing Text

[22]: #view the first file

#len(emmatext) #count of total characters

[33]: '[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse,

0.3.1 Word Tokenization

len(emmatokens) #total token count

[34]: ['I', 'have', 'no', 'money', 'at', 'the', 'moment', '.']

[36]: #using word_tokenize

['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']

[39]: #usigng Regexp tokenizer

[39]: ['God', 'is', 'Great', 'I', 'won', 'a', 'lottery']

0.3.2 Sentence Tokenization

text1 = "God is Great! I won a lottery."

# set of stop words

# filter out the stop words

print("\nOriginal Sentence \n")

print("\nFiltered Sentence \n")

He determined drop litigation monastry , relinguish claims wood-cuting fishery

[51]: 'i have no money at the moment.'

[53]: #for already tokenized text

[25]: 'I HAVE NO MONEY AT THE MOMENT.'

[26]: #select a set of words from the tokenized text

[27]: #get the frequency count for each word

for word in shortdist.keys():

0.7 3.2 Stemming

[61]: #regular-cased text- porter stemmer

[30]: #lowercased text

[31]: #regular-cased text - lancaster stemmer

#try the above stemmer with 'friends'

0.8 3.3 Lemmatizing

0.9 4. Regex:Regular Expressions for Detecting Word Patterns

[83]: '[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse,

#redefined the variable shorttext to be the first 150 characters

[38]: pword = re.compile('\w+')

[40]: [('U', ''),

[41]: [('U.S.A.', 'A.')]

[42]: [('U', '', ''),

[43]: #order of the matching patterns really matters if

[43]: [('U.S.A.', 'A.', ''),

[44]: #add an expression to match the currency

[44]: [('U.S.A.', 'A.', '', ''),

Regular Expression Tokenizer using NLTK Tokenizer

# abbreviations, e.g. U.S.A.

[46]: # abbreviations, e.g. U.S.A.

pattern = r''' (?x) [A-Z][a-z]+\.| (?:[A-Z]\.)+|

[47]: nltk.regexp_tokenize(shorttext[:30], pattern)

[48]: nltk.regexp_tokenize(specialtext, pattern)

0.10 Document Term Matrix- DTM

[90]: #assign the count vectorizer to a variable

[90]: and beautiful blue cheese is love sky so the

You might also like

* Introductory Examples for the NLTK Book *