0% found this document useful (0 votes)
5 views3 pages

7 TextAnalysis

The document contains Python code demonstrating the use of the TextBlob and NLTK libraries for text processing tasks such as spelling correction, tokenization, filtering stopwords, stemming, and lemmatization. It includes examples of how to analyze text, visualize word frequency, and perform part-of-speech tagging. Additionally, it highlights the differences between stemming and lemmatization in natural language processing.

Uploaded by

shashank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views3 pages

7 TextAnalysis

The document contains Python code demonstrating the use of the TextBlob and NLTK libraries for text processing tasks such as spelling correction, tokenization, filtering stopwords, stemming, and lemmatization. It includes examples of how to analyze text, visualize word frequency, and perform part-of-speech tagging. Additionally, it highlights the differences between stemming and lemmatization in natural language processing.

Uploaded by

shashank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

!

pip install textblob


!pip install nltk

from textblob import TextBlob


import nltk

b = TextBlob("I ahve good spelling")


b.correct()

import nltk
nltk.download('punkt') # will download the Punkt tokenizer models.
b1 = TextBlob("beautifull is bettter level than ugly")
b1.words

b1.sentences

b1.words[3].pluralize()

sen = TextBlob("My name name name is anthony gonsalvis main duniya mmein akela
hoon")
sen.word_counts["name"]

print(sen.parse())

sen[0:19]
#substring

b1.upper()

b1.find("ugly")
# character at which first found

apple = TextBlob("apples")
banana = TextBlob("banana")
apple > banana

b1.ngrams(n=3)
#An n-gram is a contiguous sequence of n items from a given sample of text or
speech.

import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize
text = """ Goood day it was today in pune, I loved the weather in Pune. Pune is the
best city to live in."""
text

tokenized_text = sent_tokenize(text)

print(tokenized_text)
# splits a text into a list of sentences using an algorithm that considers
punctuation and capitalization.

from nltk.tokenize import word_tokenize


tokenizer_word = word_tokenize(text)
# splits into words delimiiter " "
print(tokenizer_word)

from nltk.probability import FreqDist


fd = FreqDist(tokenizer_word)
print(fd)

fd.most_common(4)

import matplotlib.pyplot as plt


fd.plot(30, cumulative=False)
#plots the 30 most common items in the frequency

nltk.download('stopwords')
# Stopwords are commonly used words (such as "the", "a", "an", "in", "on", etc.)
# Downloading the stopwords corpus is useful because it allows you to access a
predefined list of stopwords that you can
# use to filter out irrelevant words from your text data during preprocessing

from nltk.corpus import stopwords


st = set(stopwords.words('english'))
print(st)

filtered_sent = []
for w in tokenizer_word:
if w not in st:
filtered_sent.append(w)

print('tokenized sentence : ', tokenizer_word)


print('Filtered sentence :', filtered_sent)

from nltk.stem import PorterStemmer


from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
stemmed_words = []
for w in filtered_sent:
stemmed_words.append(ps.stem(w))

print("Filtered sent :", filtered_sent)


print("Stemmed sentence: ", stemmed_words)
# removing common word endings to reduce words to their base or root form.
# eg running -> run
# runs -> run
# ran -> ran

nltk.download("wordnet")

from nltk.stem.wordnet import WordNetLemmatizer


lem = WordNetLemmatizer()
from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()
word = "flying"
print("Lemmatizer Word: ", lem.lemmatize(word, "v")) # v means verb here, n->noun,
a->adjective, r->adverb
print("Stemmed Word ", stem.stem(word))

#The WordNet lemmatizer is based on WordNet, a lexical database of the English


language.
# another way of reducing the words to base form
# Unlike stemming, lemmatization takes into account the morphological analysis of
words, ensuring that the resulting lemma is a valid word.

sent = "Albert Einstien was born in Ulm, Germany in 1879."


tokens = nltk.word_tokenize(sent)
print(tokens)

nltk.download('averaged_perceptron_tagger')

nltk.pos_tag(tokens)
# nnp proper noun
# vbd past tense verb
# in preposition
# "," mark
# cd cardinal number

from collections import Counter


sent = "Texas is the city in america i guess i dont know"
fq = Counter(sent) # for letter for words use sent.split()
fw = Counter(sent.split())
fw

You might also like