0% found this document useful (0 votes)
4 views25 pages

Natural language processing-Section (3)

The document provides an overview of the Natural Language Toolkit (NLTK) in Python, covering key functionalities such as tokenization, stemming, lemmatization, and parts of speech tagging. It explains how to install NLTK, use its modules for text processing, and introduces WordNet as a resource for word relationships. Additionally, it includes practical tasks for users to apply their knowledge of NLTK features.

Uploaded by

dw9324764
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views25 pages

Natural language processing-Section (3)

The document provides an overview of the Natural Language Toolkit (NLTK) in Python, covering key functionalities such as tokenization, stemming, lemmatization, and parts of speech tagging. It explains how to install NLTK, use its modules for text processing, and introduces WordNet as a resource for word relationships. Additionally, it includes practical tasks for users to apply their knowledge of NLTK features.

Uploaded by

dw9324764
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Language Engineering

Prepared by: Abdelrahman M. Safwat

Section (3) – NLTK Basics


Installing NLTK

!pip install nltk

import nltk
nltk.download(“punkt”)
nltk.download(“wordnet”)
nltk.download(“averaged_perceptron_tagger”)

2
Tokenizing

 NLTK has a module that can tokenize text. You can


tokenize text based on sentences or words.
from nltk.tokenize import sent_tokenize

text = """To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer.
The slings and arrows of outrageous fortune, or to take arms against a sea of troubles.
And by opposing end them. To die—to sleep, no more; and by a sleep to say we end. The
heart-ache and the thousand natural shocks"""

tokenized_text = sent_tokenize(text)
3
print(tokenized_text)
Tokenizing (cont.)

from nltk.tokenize import word_tokenize

text = """To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer.
The slings and arrows of outrageous fortune, or to take arms against a sea of troubles.
And by opposing end them. To die—to sleep, no more; and by a sleep to say we end. The
heart-ache and the thousand natural shocks"""

tokenized_text = word_tokenize(text)

print(tokenized_text)

4
Stemming

 If we want to get the origin form of a word, we use a


stemmer.
 For example, stemming the word “connection,”
“connecting,” or “connected” would all result in the
word “connect”

5
Stemming (cont.)

from nltk.stem import PorterStemmer

words = ["connection", "connected", "connecting"]

for word in words:


print(PorterStemmer().stem(word))

6
Lemmatization

 Using stemming sometimes can lead to a wrong origin


word, or a word that doesn’t exist.
 In that case, we can use lemmatization, which is
similar to looking up the origin of a word in a dictionary.

7
Lemmatization (cont.)

from nltk.stem import WordNetLemmatizer

text = "The rabbit was running quickly towards the carrot"


tokenized_text = nltk.word_tokenize(text)

for word in tokenized_text:


print(WordNetLemmatizer().lemmatize(word))

8
Lemmatization (cont.)

 In the above example, you’ll see that there is no


meaningful change after lemmatization.
 That’s because you need to provide the lemmatization
function with Parts of Speech tags.

9
Lemmatization (cont.)

from nltk.stem import WordNetLemmatizer

text = "The rabbit was running quickly towards the carrot"


tokenized_text = nltk.word_tokenize(text)

for word in tokenized_text:


print(WordNetLemmatizer().lemmatize(word, pos="v"))

10
Parts of Speech Tagging

 Parts of Speech tagging is the process of tagging a


word in a text based on its definition and context.
 Example: Tagging “likes” as a verb.
 Note: To tag words in a text, you need to tokenize it
first.

11
Parts of Speech Tagging (cont.)

import nltk

text = "The rabbit was running quickly towards the carrot"


tokenized_text = nltk.word_tokenize(text)

print(nltk.pos_tag(tokenized_text))

12
Parts of Speech Tagging (cont.)

Tag Meaning Examples


ADJ Adjective new, good, high, special, big
ADP Adposition on, of, at, with, by, into,
under
ADV Adverb really, already, still, early,
now
CONJ Conjunction and, or, but, if, while
DET Determiner the, a, some, most, every
NOUN Noun year, home, costs, time
NUM Numeral twenty-four, fourth, 1991
PRT Particle at, on, out, over per, that,
up
13
PRON Pronoun he, their, her, its, my, I, us
WordNet

 In lemmatization, we mentioned a process similar to


looking up a dictionary. WordNet is what we use to
look up words.
 WordNet is similar to a database or a dictionary of links
and relationships between words.

14
WordNet (cont.)

The WordNet is a part of Python's Natural Language Toolkit. It


is a large word database of English Nouns, Adjectives, Adverbs
and Verbs.

WordNet has been used for a number of purposes in


information systems, including

• Word-sense disambiguation
• Information retrieval
• Automatic text classification
• Automatic text summarization
• Machine translation 15
Example (Synsets and Lemmas)

In WordNet, similar words are grouped into a set known


as a Synset

Every Synset has a name, a part-of-speech, and a


number. The words in a Synset are known as Lemmas.

16
Code

The function wordnet.synsets('word’);

returns an array containing all the Synsets related to the


word passed to it as the argument.
from nltk.corpus import wordnet
synset = wordnet.synsets(“room”)

17
Output

[Synset(‘room.n.01’), Synset(‘room.n.02’), Synset(‘room.n.03’), Synset(‘room.n.04’),


Synset(‘board.v.02’)]

four have the name ’room’ and are nouns, while the last
one’s name is ’board’ and is a verb.

also suggests that the word ‘room’ has a total of five


meanings or contexts.
18
WordNet

from nltk.corpus import wordnet

word = "hungry"
synset = wordnet.synsets(word)[0]

print("Name: " + synset.name())


print("Description: " + synset.definition())
print("Antonym: " + synset.lemmas()[0].antonyms()[0].name())
print("Examples: " + synset.examples()[0])

19
Try it out yourself

 Code:
https://colab.research.google.com/drive/1wLjqqi4aLEY2
PWDcpax-_4tCyh946yVQ
 Parts of Speech tagger:
https://parts-of-speech.info/
 WordNet search:
http://wordnetweb.princeton.edu/perl/webwn
20
Task #1

 Read a PDF file using the PyPDF2 library, extract the


text from the first page, tokenize it into sentences and
then tag with the Parts of Speech tagger.

21
Task #2

 Use stemming to transform the word to its root form.

22
Task #3

 Write a code to determine stems in the input


sentence.

23
Thank you for your attention!

24
References

 https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pyth
ons-nltk-library-2d30f70af13b
 https://medium.com/@gaurav5430/using-nltk-for-lemmatizing-sentences-c1bfff963258
 https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
 https://www.nltk.org/book/ch05.html#tab-universal-tagset

25

You might also like