ASTW RA03 PracticalManual
ASTW RA03 PracticalManual
by a movie
Implement Chunking 12
Implement WordNet 13
Code:
import pandas as pd
import nltk
df_avatar_lines = df_avatar.groupby('character').count()
top_character_names = df_avatar_lines.index.values
df_character_sentiment = df_avatar[df_avatar['character'].isin(top_character_names)]
sid = SentimentIntensityAnalyzer()
df_character_sentiment.reset_index(inplace=True, drop=True)
df_character_sentiment
Output :
2) Implement Named Entity Recognition (NER) in Python with Spacy
! pip install spacy
import spacy
NER = spacy.load("en_core_web_sm")
raw_text="The Indian Space Research Organisation or is the national space agency of India,
headquartered in Bengaluru. It operates under Department of Space which is directly overseen by
the Prime Minister of India while Chairman of ISRO acts as executive of DOS as well."
text1= NER(raw_text)
print(word.text,word.label_)
Output :
3) Implement Stemming & Lemmatization
Stemming
import nltk
porter_stemmer = PorterStemmer()
tokenization = nltk.word_tokenize(text)
for w in tokenization:
Lemmatization
import nltk
wordnet_lemmatizer = WordNetLemmatizer()
tokenization = nltk.word_tokenize(text)
for w in tokenization:
import pandas as pd
cv = CountVectorizer(stop_words='english')
cv_matrix = cv.fit_transform(df['text'])
df_dtm = pd.DataFrame(cv_matrix.toarray(),
index=df['review'].values,
columns=cv.get_feature_names())
df_dtm
Output :
5) Implement Term Frequency–Inverse Document
Frequency (TF-IDF)
import pandas as pd
tfidf_matrix = tfidf.fit_transform(df['text'])
df_dtm = pd.DataFrame(tfidf_matrix.toarray(),
index=df['review'].values,
columns=tfidf.get_feature_names())
df_dtm
Output :
6) Implement Stopwords
import nltk
sw_nltk = stopwords.words('english')
print(sw_nltk)
print(len(sw_nltk))
text = "When I first met her she was very quiet. She remained quiet during the entire two hour long
journey from Stony Brook to New York."
print(new_text)
Output :
7) Implement POS Tagging
import nltk
stop_words = set(stopwords.words('english'))
tokenized = sent_tokenize(txt)
for i in tokenized:
wordsList = nltk.word_tokenize(i)
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)
Output :
8) Implement Chunking
import nltk
sentence = [
("the", "DT"),
("book", "NN"),
("has","VBZ"),
("many","JJ"),
("chapters","NNS")
chunker = nltk.RegexpParser(
r'''
NP:{<DT><NN.*><.*>*<NN.*>}
}<VB.*>{
'''
chunker.parse(sentence)
Output = chunker.parse(sentence)
print(Output)
Output :
9) Implement WordNet
import nltk
synonyms = []
antonyms = []
for l in synset.lemmas():
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
Output :
10) Implement Word Cloud
class WordCloudGeneration:
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(paragraph)
return preprocessed_data
# initiate WordCloud object with parameters width, height, maximum font size and background
color
# call the generate method of WordCloud class to generate an image
plt.figure(figsize=(12,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
wordcloud_generator = WordCloudGeneration()
input_text = 'These datasets are used for machine-learning research and have been cited in
peer-reviewed academic journals. Datasets are an integral part of the field of machine learning.
Major advances in this field can result from advances in learning algorithms (such as deep learning),
computer hardware, and, less-intuitively, the availability of high-quality training datasets.[1]
High-quality labeled training datasets for supervised and semi-supervised machine learning
algorithms are usually difficult and expensive to produce because of the large amount of time
needed to label the data. Although they do not need to be labeled, high-quality datasets for
unsupervised learning can also be difficult and costly to produce.'
input_text = input_text.split('.')
clean_data = wordcloud_generator.preprocessing(input_text)
wordcloud_generator.create_word_cloud(clean_data)
Output :