0% found this document useful (0 votes)

17 views22 pages

Worksheet Notes

Natural language processing (NLP) is a field of artificial intelligence that uses computer science and linguistics to help machines understand, analyze, and derive meaning from human language in text and speech format. NLP seeks to convert unstructured language into structured data to enable machines to comprehend speech and text to form relevant responses. Key techniques used in NLP include tokenization, part-of-speech tagging, dependency parsing, lemmatization/stemming, and stopword removal to analyze language at different levels. NLP has benefits such as enabling large-scale text analysis and automating real-time processes with little human interaction.

Uploaded by

Ng Kai Ting

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views22 pages

Worksheet Notes

Uploaded by

Ng Kai Ting

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

OFFICIAL (CLOSED) \ NON-SENSITIVE

Natural Language
Processing (NLP)
Mr Hew Ka Kian
[email protected]
OFFICIAL (CLOSED) \ NON-SENSITIVE

What Is Natural Language Processing (NLP)?

• Natural Language Processing (NLP) is a field of Artificial Intelligence
(AI) that makes human language intelligible to machines.
• NLP combines the power of linguistics and computer science to study
the rules and structure of language,
and create intelligent systems capable
of understanding, analyzing,
and extracting meaning
from text and speech.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

What Is Natural Language Processing (NLP)?

• Natural language processing (NLP) seeks to convert unstructured language data
into a structured data format to enable machines to understand speech and text
and formulate relevant, contextual responses.
• Its subtopics include natural language processing and natural language
generation.
• Natural language understanding (NLU) focuses
on machine comprehension through
grammar and context, enabling it to
determine the intended meaning.
• Natural language generation (NLG) focuses
on text generation, or the construction
of text, based on a given dataset.
• NLU and NLG are components of NLP

Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
OFFICIAL (CLOSED) \ NON-SENSITIVE

NLP Benefits
• Some of the many benefits of NLP are:
• Perform large-scale analysis. Natural Language Processing
helps machines automatically understand and analyze huge
amounts of unstructured text data, like social media
comments, customer support tickets, online reviews, news
reports, and more.
• Automate processes in real-time. Natural language processing
tools can help machines learn to sort and route information
with little to no human interaction – quickly, efficiently,
accurately, and around the clock.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

How Does NLP Work?

In general, NLP techniques include 4 major steps:
• Lexical Analysis: The process of splitting a sentence into words or small units called “tokens”.
• Syntactic Analysis: The process of identifying the relationship between the different words and
phrases within a sentence, and expressing the relationships in a hierarchical structure.
• Semantic Analysis: The process of relating syntactic structures, from the phrases, sentences and
paragraphs to the writing as a whole, to find the intended meanings.
• Output Transformation: The process of generating an
output based on the semantic analysis of the text or
speech.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Tokenization
• Tokenization is an essential task in natural language processing used to break up a
string of words into units called tokens.
• Sentence tokenization splits sentences within a text, and word tokenization splits
words within a sentence.
• Generally, word tokens are separated by blank spaces, and sentence tokens by
stops.
• However, you can perform high-level tokenization for more complex structures,
like words that often go together, otherwise known as collocations (e.g., New
York).
• An example of how word tokenization simplifies text:
• Customer service couldn’t be better! ->
“customer service” “could” “not” “be” “better”.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Part-of-speech (PoS) tagging

• Part-of-speech tag is the category of the token. Some common PoS tags are verb, adjective, noun,
pronoun, conjunction, preposition, intersection, among others. For a revision on English part-of-
speech, visit http://www.butte.edu/departments/cas/tipsheets/grammar/parts_of_speech.html .
• For the sentence “London is the capital of England”, the surrounding words are examined to
derive at the conclusion that “London” is the proper noun (a proper noun is a specific ,that is not
generic, name for a particular person, place, or thing).
• PoS tagging is useful for identifying relationships between words and, therefore, understand the
meaning of sentences.

Source: https://medium.com/@ritidass29/the-essential-guide-to-how-nlp-works-4d3bb23faf76
OFFICIAL (CLOSED) \ NON-SENSITIVE

Dependency Parsing
• Dependency grammar refers to the way the words in a sentence are connected. A dependency
parser, therefore, analyzes how ‘head words’ are related and modified by other words.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Lemmatization & Stemming

• When we speak or write, we tend to use inflected forms of a word (words in their different
grammatical forms). To make these words easier for computers to understand, NLP uses
lemmatization and stemming to transform them back to their root form.
• The word as it appears in the dictionary – its root form – is called a lemma. For example, the
terms "is, are, am, were, and been,” are grouped under the lemma ‘be.’
• So, if we apply this lemmatization to “African elephants have four nails on their front feet,” the
result will look something like this:
• African elephants have four nails on their front feet ->
African elephant have 4 nail on their foot
• Lemmatization changes the sentence to its base form (e.g., the word "feet"" was changed to
"foot").

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Lemmatization & Stemming

• When using stemming, the root form of a word is called a stem. Stemming "trims" words, so word stems
may not always be semantically correct.
• For example, stemming the words “develop,” “developed,” “developing,” and “development” would result in
the root form “develop.”
• Lemmatization is dictionary-based and chooses the appropriate lemma based on context but stemming
operates on single words without considering the context.

• For example, in the sentence: “This is

better”,
The word “better” is transformed into the
word “good” by a lemmatizer but is
unchanged by stemming.
• Even though stemmers can lead to less-
accurate results, they are easier to build and
perform faster than lemmatizers. But
lemmatizers are recommended if you're
seeking more precise linguistic rules.
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Stopword Removal
• Stopwords are high-frequency words that add little or no semantic value to a sentence, for
example, which, to, at, for, is, etc. Removing stopwords is an essential step in NLP text processing.
• You can even customize lists of stopwords to include words that you want to ignore.
• Let’s say you want to classify customer service tickets based on their topics. In this example:
“Hello, I’m having trouble logging in with my new password”,
it may be useful to remove stopwords like “hello”, “I”, “am”, “with”, “my”, so you’re left with the
words that help you understand the topic of the ticket: “trouble”, “logging in”, “new”, “password”.
• Hello, I’m having trouble logging in with my new password ->
Hello, I’m having trouble logging in with my new password

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise A:
• What is the type of the token 'fund’ in 'Two companies pledge up to
$2 million to fund the Republic Polytechnic (RP) start-ups’?

• ‘fund’ can be verb or noun and is correctly identified as verb in this

case

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise A: Explain other terms
• Print the explanation for PART and neg

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise A:
• Can you guess what does 'X', 'd' and 'x' mean?

• 'X' is uppercase alphabet

• 'x' is lowercase alphabet
• 'd' is digit

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise A:
• Which tokens are stopwords?

• is
• n’t

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise A: Python is not python?
• Print out the number of tokens and named entities.
print("First python lemma is %s and PoS is %s"%(p1[1].lemma_,
p1[1].pos_))
print("Second Python lemma is %s and PoS is %s"%(p1[7].lemma_,
p1[7].pos_))

• First python lemma is python and PoS is NOUN

• Second Python lemma is Python and PoS is PROPN

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise B: How many tokens are there compared to the named entities?
• Print out the number of tokens

print("Tokens:", len(doc7) )

• Print out the number of tokens and named entities.

print("Named entities:", len(doc7.ents) )

• Often there are more tokens than named entities

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise B: What are the noun chunks?
• Print out the noun chunks and the noun for Singapore tech startups open
up to having staff work from anywhere

for chunk in doc9.noun_chunks:

print("Noun chunk:",chunk.text,"- noun:", chunk.root.text)

• Only extract the chunk related to the noun. Some tokens not related to a
noun are ignored.

Singapore tech startups open up to having staff work from anywhere

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise B: Any other ways to say it?
• You should have tried different ways to say It’s a warm summer day
• Examples of some of them with the similarity are below:
• Similarity: 0.912
doc13 = nlp("A hot summer day")
similarity = doc11.similarity(doc13)
print(similarity)
• Similarity: 0.885
doc13 = nlp("what a nice day")
• Similarity: 0.893
doc13 = nlp("It is a good day")
• A chatbot may treat similarity of 0.885 and above as similar and reply with the
same response to the above sentences. The reply may be “I would stay out”
• If your app is to check plagiarism, you may only consider similarity of 0.95 or
higher for plagiarism

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise C: Print the stem of words2 and words3

for word in words2:

print(word, '\t-->\t', p_stemmer.stem(word))

for word in words3:

print(word, '\t-->\t', p_stemmer.stem(word))

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Problem: Analyze National Pledge

• Remember to import the library and load the model
import spacy
nlp = spacy.load('en_core_web_md')

• Create a Doc object from a file

with open('OurPledge.txt') as f:
doc = nlp(f.read())

• The length of the Doc object gives the number of tokens

len(doc)

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Problem: Analyze National Pledge

• To get all the sentences, build a list and then add each sentence
sents = [] # Create a list
for sent in doc.sents: # Append each sentence to the list
sents.append(sent)

• Since the text only contain 1 sentence, we access only sents[0]

print(sents[0].text)

• Print each token and lined up neatly

for token in sents[0]:
# f'{object_to_print:{minimum_characters}}
print(f'{token.text:{15}} {token.pos_:{5}} {token.dep_:{10}}\
{token.lemma_:{15}}')

Source:

English Grammar For Students of Chinese - The Study Guide For Those Learning Chinese-Olivia and Hill Pres
50% (2)
English Grammar For Students of Chinese - The Study Guide For Those Learning Chinese-Olivia and Hill Pres
118 pages
Language Handbook Worksheets: Answer Key
100% (1)
Language Handbook Worksheets: Answer Key
58 pages
Turkmen English Dictionary
No ratings yet
Turkmen English Dictionary
305 pages
Parts of Speech
100% (3)
Parts of Speech
45 pages
Sentence Writing SKills
100% (1)
Sentence Writing SKills
35 pages
Direct and Indirect Narration
100% (1)
Direct and Indirect Narration
9 pages
Parts of Speech - Practice Assessment
No ratings yet
Parts of Speech - Practice Assessment
5 pages
Peter Hugoe Matthews - What Graeco-Roman Grammar Was About-Oxford University Press (2019)
No ratings yet
Peter Hugoe Matthews - What Graeco-Roman Grammar Was About-Oxford University Press (2019)
251 pages
Nouns Verbs and Adjectives Lesson Plan
100% (2)
Nouns Verbs and Adjectives Lesson Plan
3 pages
JEH11 01 Que 20211013
No ratings yet
JEH11 01 Que 20211013
20 pages
лекції лексикологія
No ratings yet
лекції лексикологія
58 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
Complete Grammar For IELTS
No ratings yet
Complete Grammar For IELTS
26 pages
Parts of Speech Clearly
No ratings yet
Parts of Speech Clearly
39 pages
P.7 Eng Placement Set 1
No ratings yet
P.7 Eng Placement Set 1
15 pages
Analysis of Ambiguity
No ratings yet
Analysis of Ambiguity
3 pages
INTRODUCTION
No ratings yet
INTRODUCTION
9 pages
Compounding
No ratings yet
Compounding
18 pages
Compilation of Grammar
No ratings yet
Compilation of Grammar
9 pages
Teacher Copy of Sentence Structure Vocabulary
No ratings yet
Teacher Copy of Sentence Structure Vocabulary
23 pages
Class 5 Syllabus 2018
No ratings yet
Class 5 Syllabus 2018
8 pages
Edexcel IGCSE English - Year 6, Question Paper
No ratings yet
Edexcel IGCSE English - Year 6, Question Paper
9 pages
Artikel 10605112
No ratings yet
Artikel 10605112
23 pages
GST 101-1
No ratings yet
GST 101-1
4 pages
DSSSB General Paper (Part A) - English - 1704193841
No ratings yet
DSSSB General Paper (Part A) - English - 1704193841
6 pages
Oswaal AI Revision Notes CBSE Class 10 2024-25 - Removed
No ratings yet
Oswaal AI Revision Notes CBSE Class 10 2024-25 - Removed
7 pages
Word Classes and Phrase Classes
No ratings yet
Word Classes and Phrase Classes
3 pages
Exercise Morphology - M. Hanif Alriefka
No ratings yet
Exercise Morphology - M. Hanif Alriefka
3 pages
English 40 Days Plan
No ratings yet
English 40 Days Plan
5 pages
All I N One PDF
No ratings yet
All I N One PDF
201 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

Worksheet Notes

Uploaded by

Worksheet Notes

Uploaded by

OFFICIAL (CLOSED) \ NON-SENSITIVE

What Is Natural Language Processing (NLP)?

What Is Natural Language Processing (NLP)?

How Does NLP Work?

Part-of-speech (PoS) tagging

Lemmatization & Stemming

Lemmatization & Stemming

• For example, in the sentence: “This is

• ‘fund’ can be verb or noun and is correctly identified as verb in this

• 'X' is uppercase alphabet

• First python lemma is python and PoS is NOUN

• Print out the number of tokens and named entities.

print("Named entities:", len(doc7.ents) )

• Often there are more tokens than named entities

for chunk in doc9.noun_chunks:

Singapore tech startups open up to having staff work from anywhere

for word in words2:

for word in words3:

Problem: Analyze National Pledge

• Create a Doc object from a file

• The length of the Doc object gives the number of tokens

Problem: Analyze National Pledge

• Since the text only contain 1 sentence, we access only sents[0]

• Print each token and lined up neatly

You might also like