0% found this document useful (0 votes)
35 views24 pages

Language Engineering - Section

This document provides an overview of natural language processing basics using the NLTK library in Python. It covers how to install and import NLTK, tokenize text into sentences and words, stem and lemmatize words, part-of-speech tag words in sentences, and describes how WordNet can be used to look up words and their relationships. Examples are given for each NLTK concept along with tasks for readers to try out tokenization, tagging and stemming on their own.

Uploaded by

asmaa soliman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views24 pages

Language Engineering - Section

This document provides an overview of natural language processing basics using the NLTK library in Python. It covers how to install and import NLTK, tokenize text into sentences and words, stem and lemmatize words, part-of-speech tag words in sentences, and describes how WordNet can be used to look up words and their relationships. Examples are given for each NLTK concept along with tasks for readers to try out tokenization, tagging and stemming on their own.

Uploaded by

asmaa soliman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Language Engineering

Prepared by: Abdelrahman M. Safwat

Section (3) – NLTK Basics


Installing NLTK

!pip install nltk

import nltk
nltk.download(“punkt”)
nltk.download(“wordnet”)
nltk.download(“averaged_perceptron_tagger”)

2
Tokenizing

 NLTK has a module that can tokenize text. You can tokenize text
based on sentences or words.
from nltk.tokenize import sent_tokenize

text = """To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer. The slings and arrows
of outrageous fortune, or to take arms against a sea of troubles. And by opposing end them. To die—to sleep, no
more; and by a sleep to say we end. The heart-ache and the thousand natural shocks"""

tokenized_text = sent_tokenize(text)

print(tokenized_text)
3
Tokenizing

from nltk.tokenize import word_tokenize

text = """To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer. The slings and arrows
of outrageous fortune, or to take arms against a sea of troubles. And by opposing end them. To die—to sleep, no
more; and by a sleep to say we end. The heart-ache and the thousand natural shocks"""

tokenized_text = word_tokenize(text)

print(tokenized_text)

4
Stemming

 If we want to get the origin form of a word, we use a stemmer.


 For example, stemming the word “connection,” “connecting,” or
“connected” would all result in the word “connect”

5
Stemming

from nltk.stem import PorterStemmer

words = ["connection", "connected", "connecting"]

for word in words:
print(PorterStemmer().stem(word))

6
Lemmatization

 Using stemming sometimes can lead to a wrong origin word, or a


word that doesn’t exist.
 In that case, we can use lemmatization, which is similar to looking
up the origin of a word in a dictionary.

7
Lemmatization

from nltk.stem import WordNetLemmatizer

text = "The rabbit was running quickly towards the carrot"
tokenized_text = nltk.word_tokenize(text)

for word in tokenized_text:
print(WordNetLemmatizer().lemmatize(word))

8
Lemmatization

 In the above example, you’ll see that there is no meaningful


change after lemmatization.
 That’s because you need to provide the lemmatization function
with Parts of Speech tags.

9
Lemmatization

from nltk.stem import WordNetLemmatizer

text = "The rabbit was running quickly towards the carrot"
tokenized_text = nltk.word_tokenize(text)

for word in tokenized_text:
print(WordNetLemmatizer().lemmatize(word, pos="v"))

10
Parts of Speech Tagging

 Parts of Speech tagging is the process of tagging a word in a text


based on its definition and context.
 Example: Tagging “likes” as a verb.
 Note: To tag words in a text, you need to tokenize it first.

11
Parts of Speech Tagging

import nltk

text = "The rabbit was running quickly towards the carrot"
tokenized_text = nltk.word_tokenize(text)

print(nltk.pos_tag(tokenized_text))

12
Parts of Speech Tagging

Tag Meaning Examples


ADJ Adjective new, good, high, special, big
ADP Adposition on, of, at, with, by, into, under
ADV Adverb really, already, still, early, now
CONJ Conjunction and, or, but, if, while
DET Determiner the, a, some, most, every
NOUN Noun year, home, costs, time
NUM Numeral twenty-four, fourth, 1991
PRT Particle at, on, out, over per, that, up
PRON Pronoun he, their, her, its, my, I, us
VERB Verb is, say, told, given, playing
13
. Punctuation .,;!
WordNet

 In lemmatization, we mentioned a process similar to looking up a


dictionary. WordNet is what we use to look up words.
 WordNet is similar to a database or a dictionary of links and
relationships between words.

14
WordNet

The WordNet is a part of Python's Natural Language Toolkit. It is a


large word database of English Nouns, Adjectives, Adverbs and Verbs.

WordNet has been used for a number of purposes in information systems,


including 

• Word-sense disambiguation
• Information retrieval
• Automatic text classification
• Automatic text summarization
• Machine translation 15
Example (Synsets and Lemmas)

In WordNet, similar words are grouped into a set known as a Synset

Every Synset has a name, a part-of-speech, and a number. The words in


a Synset are known as Lemmas.

16
code

The function wordnet.synsets('word’);

 returns an array containing all the Synsets related to the word passed to it as


the argument.
from nltk.corpus import wordnet
synset = wordnet.synsets(“room”)

17
Output

[Synset(‘room.n.01’), Synset(‘room.n.02’), Synset(‘room.n.03’), Synset(‘room.n.04’), Synset(‘board.v.02’)]

 four have the name ’room’ and are a nouns, while the last one’s name
is ’board’ and is a verb.

also suggests that the word ‘room’ has a total of five meanings or
contexts.
18
WordNet

from nltk.corpus import wordnet

word = "hungry"
synset = wordnet.synsets(word)[0]

print("Name: " + synset.name())
print("Description: " + synset.definition())
print("Antonym: " + synset.lemmas()[0].antonyms()[0].name())
print("Examples: " + synset.examples()[0])

19
Try it out yourself

 Code:
https://colab.research.google.com/drive/1wLjqqi4aLEY2PWDcpax-_
4tCyh946yVQ
 Parts of Speech tagger:
https://parts-of-speech.info/
 WordNet search:
http://wordnetweb.princeton.edu/perl/webwn
20
Task #1

 Read a PDF file using the PyPDF2 library, extract the text from the
first page, tokenize it into sentences and then tag with the Parts of
Speech tagger.

21
Task #2

 With using stemming transform the word to its root form.

22
Thank you for your attention!

23
References

 https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30
f70af13b
 https://medium.com/@gaurav5430/using-nltk-for-lemmatizing-sentences-c1bfff963258
 https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
 https://www.nltk.org/book/ch05.html#tab-universal-tagset

24

You might also like