Language Engineering - Section
Language Engineering - Section
import nltk
nltk.download(“punkt”)
nltk.download(“wordnet”)
nltk.download(“averaged_perceptron_tagger”)
2
Tokenizing
NLTK has a module that can tokenize text. You can tokenize text
based on sentences or words.
from nltk.tokenize import sent_tokenize
text = """To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer. The slings and arrows
of outrageous fortune, or to take arms against a sea of troubles. And by opposing end them. To die—to sleep, no
more; and by a sleep to say we end. The heart-ache and the thousand natural shocks"""
tokenized_text = sent_tokenize(text)
print(tokenized_text)
3
Tokenizing
text = """To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer. The slings and arrows
of outrageous fortune, or to take arms against a sea of troubles. And by opposing end them. To die—to sleep, no
more; and by a sleep to say we end. The heart-ache and the thousand natural shocks"""
tokenized_text = word_tokenize(text)
print(tokenized_text)
4
Stemming
5
Stemming
from nltk.stem import PorterStemmer
words = ["connection", "connected", "connecting"]
for word in words:
print(PorterStemmer().stem(word))
6
Lemmatization
7
Lemmatization
from nltk.stem import WordNetLemmatizer
text = "The rabbit was running quickly towards the carrot"
tokenized_text = nltk.word_tokenize(text)
for word in tokenized_text:
print(WordNetLemmatizer().lemmatize(word))
8
Lemmatization
9
Lemmatization
from nltk.stem import WordNetLemmatizer
text = "The rabbit was running quickly towards the carrot"
tokenized_text = nltk.word_tokenize(text)
for word in tokenized_text:
print(WordNetLemmatizer().lemmatize(word, pos="v"))
10
Parts of Speech Tagging
11
Parts of Speech Tagging
import nltk
text = "The rabbit was running quickly towards the carrot"
tokenized_text = nltk.word_tokenize(text)
print(nltk.pos_tag(tokenized_text))
12
Parts of Speech Tagging
14
WordNet
• Word-sense disambiguation
• Information retrieval
• Automatic text classification
• Automatic text summarization
• Machine translation 15
Example (Synsets and Lemmas)
16
code
The function wordnet.synsets('word’);
17
Output
four have the name ’room’ and are a nouns, while the last one’s name
is ’board’ and is a verb.
also suggests that the word ‘room’ has a total of five meanings or
contexts.
18
WordNet
from nltk.corpus import wordnet
word = "hungry"
synset = wordnet.synsets(word)[0]
print("Name: " + synset.name())
print("Description: " + synset.definition())
print("Antonym: " + synset.lemmas()[0].antonyms()[0].name())
print("Examples: " + synset.examples()[0])
19
Try it out yourself
Code:
https://colab.research.google.com/drive/1wLjqqi4aLEY2PWDcpax-_
4tCyh946yVQ
Parts of Speech tagger:
https://parts-of-speech.info/
WordNet search:
http://wordnetweb.princeton.edu/perl/webwn
20
Task #1
Read a PDF file using the PyPDF2 library, extract the text from the
first page, tokenize it into sentences and then tag with the Parts of
Speech tagger.
21
Task #2
22
Thank you for your attention!
23
References
https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30
f70af13b
https://medium.com/@gaurav5430/using-nltk-for-lemmatizing-sentences-c1bfff963258
https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
https://www.nltk.org/book/ch05.html#tab-universal-tagset
24