0% found this document useful (0 votes)
17 views22 pages

Worksheet Notes

Natural language processing (NLP) is a field of artificial intelligence that uses computer science and linguistics to help machines understand, analyze, and derive meaning from human language in text and speech format. NLP seeks to convert unstructured language into structured data to enable machines to comprehend speech and text to form relevant responses. Key techniques used in NLP include tokenization, part-of-speech tagging, dependency parsing, lemmatization/stemming, and stopword removal to analyze language at different levels. NLP has benefits such as enabling large-scale text analysis and automating real-time processes with little human interaction.

Uploaded by

Ng Kai Ting
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views22 pages

Worksheet Notes

Natural language processing (NLP) is a field of artificial intelligence that uses computer science and linguistics to help machines understand, analyze, and derive meaning from human language in text and speech format. NLP seeks to convert unstructured language into structured data to enable machines to comprehend speech and text to form relevant responses. Key techniques used in NLP include tokenization, part-of-speech tagging, dependency parsing, lemmatization/stemming, and stopword removal to analyze language at different levels. NLP has benefits such as enabling large-scale text analysis and automating real-time processes with little human interaction.

Uploaded by

Ng Kai Ting
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

OFFICIAL (CLOSED) \ NON-SENSITIVE

Natural Language
Processing (NLP)
Mr Hew Ka Kian
[email protected]
OFFICIAL (CLOSED) \ NON-SENSITIVE

What Is Natural Language Processing (NLP)?


• Natural Language Processing (NLP) is a field of Artificial Intelligence
(AI) that makes human language intelligible to machines.
• NLP combines the power of linguistics and computer science to study
the rules and structure of language,
and create intelligent systems capable
of understanding, analyzing,
and extracting meaning
from text and speech.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

What Is Natural Language Processing (NLP)?


• Natural language processing (NLP) seeks to convert unstructured language data
into a structured data format to enable machines to understand speech and text
and formulate relevant, contextual responses.
• Its subtopics include natural language processing and natural language
generation.
• Natural language understanding (NLU) focuses
on machine comprehension through
grammar and context, enabling it to
determine the intended meaning.
• Natural language generation (NLG) focuses
on text generation, or the construction
of text, based on a given dataset.
• NLU and NLG are components of NLP

Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
OFFICIAL (CLOSED) \ NON-SENSITIVE

NLP Benefits
• Some of the many benefits of NLP are:
• Perform large-scale analysis. Natural Language Processing
helps machines automatically understand and analyze huge
amounts of unstructured text data, like social media
comments, customer support tickets, online reviews, news
reports, and more.
• Automate processes in real-time. Natural language processing
tools can help machines learn to sort and route information
with little to no human interaction – quickly, efficiently,
accurately, and around the clock.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

How Does NLP Work?


In general, NLP techniques include 4 major steps:
• Lexical Analysis: The process of splitting a sentence into words or small units called “tokens”.
• Syntactic Analysis: The process of identifying the relationship between the different words and
phrases within a sentence, and expressing the relationships in a hierarchical structure.
• Semantic Analysis: The process of relating syntactic structures, from the phrases, sentences and
paragraphs to the writing as a whole, to find the intended meanings.
• Output Transformation: The process of generating an
output based on the semantic analysis of the text or
speech.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Tokenization
• Tokenization is an essential task in natural language processing used to break up a
string of words into units called tokens.
• Sentence tokenization splits sentences within a text, and word tokenization splits
words within a sentence.
• Generally, word tokens are separated by blank spaces, and sentence tokens by
stops.
• However, you can perform high-level tokenization for more complex structures,
like words that often go together, otherwise known as collocations (e.g., New
York).
• An example of how word tokenization simplifies text:
• Customer service couldn’t be better! ->
“customer service” “could” “not” “be” “better”.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Part-of-speech (PoS) tagging


• Part-of-speech tag is the category of the token. Some common PoS tags are verb, adjective, noun,
pronoun, conjunction, preposition, intersection, among others. For a revision on English part-of-
speech, visit http://www.butte.edu/departments/cas/tipsheets/grammar/parts_of_speech.html .
• For the sentence “London is the capital of England”, the surrounding words are examined to
derive at the conclusion that “London” is the proper noun (a proper noun is a specific ,that is not
generic, name for a particular person, place, or thing).
• PoS tagging is useful for identifying relationships between words and, therefore, understand the
meaning of sentences.

Source: https://medium.com/@ritidass29/the-essential-guide-to-how-nlp-works-4d3bb23faf76
OFFICIAL (CLOSED) \ NON-SENSITIVE

Dependency Parsing
• Dependency grammar refers to the way the words in a sentence are connected. A dependency
parser, therefore, analyzes how ‘head words’ are related and modified by other words.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Lemmatization & Stemming


• When we speak or write, we tend to use inflected forms of a word (words in their different
grammatical forms). To make these words easier for computers to understand, NLP uses
lemmatization and stemming to transform them back to their root form.
• The word as it appears in the dictionary – its root form – is called a lemma. For example, the
terms "is, are, am, were, and been,” are grouped under the lemma ‘be.’
• So, if we apply this lemmatization to “African elephants have four nails on their front feet,” the
result will look something like this:
• African elephants have four nails on their front feet ->
African elephant have 4 nail on their foot
• Lemmatization changes the sentence to its base form (e.g., the word "feet"" was changed to
"foot").

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Lemmatization & Stemming


• When using stemming, the root form of a word is called a stem. Stemming "trims" words, so word stems
may not always be semantically correct.
• For example, stemming the words “develop,” “developed,” “developing,” and “development” would result in
the root form “develop.”
• Lemmatization is dictionary-based and chooses the appropriate lemma based on context but stemming
operates on single words without considering the context.

• For example, in the sentence: “This is


better”,
The word “better” is transformed into the
word “good” by a lemmatizer but is
unchanged by stemming.
• Even though stemmers can lead to less-
accurate results, they are easier to build and
perform faster than lemmatizers. But
lemmatizers are recommended if you're
seeking more precise linguistic rules.
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Stopword Removal
• Stopwords are high-frequency words that add little or no semantic value to a sentence, for
example, which, to, at, for, is, etc. Removing stopwords is an essential step in NLP text processing.
• You can even customize lists of stopwords to include words that you want to ignore.
• Let’s say you want to classify customer service tickets based on their topics. In this example:
“Hello, I’m having trouble logging in with my new password”,
it may be useful to remove stopwords like “hello”, “I”, “am”, “with”, “my”, so you’re left with the
words that help you understand the topic of the ticket: “trouble”, “logging in”, “new”, “password”.
• Hello, I’m having trouble logging in with my new password ->
Hello, I’m having trouble logging in with my new password

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise A:
• What is the type of the token 'fund’ in 'Two companies pledge up to
$2 million to fund the Republic Polytechnic (RP) start-ups’?

• ‘fund’ can be verb or noun and is correctly identified as verb in this


case

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise A: Explain other terms
• Print the explanation for PART and neg

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise A:
• Can you guess what does 'X', 'd' and 'x' mean?

• 'X' is uppercase alphabet


• 'x' is lowercase alphabet
• 'd' is digit

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise A:
• Which tokens are stopwords?

• is
• n’t

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise A: Python is not python?
• Print out the number of tokens and named entities.
print("First python lemma is %s and PoS is %s"%(p1[1].lemma_,
p1[1].pos_))
print("Second Python lemma is %s and PoS is %s"%(p1[7].lemma_,
p1[7].pos_))

• First python lemma is python and PoS is NOUN


• Second Python lemma is Python and PoS is PROPN

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise B: How many tokens are there compared to the named entities?
• Print out the number of tokens

print("Tokens:", len(doc7) )

• Print out the number of tokens and named entities.

print("Named entities:", len(doc7.ents) )

• Often there are more tokens than named entities


Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise B: What are the noun chunks?
• Print out the noun chunks and the noun for Singapore tech startups open
up to having staff work from anywhere

for chunk in doc9.noun_chunks:


print("Noun chunk:",chunk.text,"- noun:", chunk.root.text)

• Only extract the chunk related to the noun. Some tokens not related to a
noun are ignored.

Singapore tech startups open up to having staff work from anywhere


Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise B: Any other ways to say it?
• You should have tried different ways to say It’s a warm summer day
• Examples of some of them with the similarity are below:
• Similarity: 0.912
doc13 = nlp("A hot summer day")
similarity = doc11.similarity(doc13)
print(similarity)
• Similarity: 0.885
doc13 = nlp("what a nice day")
• Similarity: 0.893
doc13 = nlp("It is a good day")
• A chatbot may treat similarity of 0.885 and above as similar and reply with the
same response to the above sentences. The reply may be “I would stay out”
• If your app is to check plagiarism, you may only consider similarity of 0.95 or
higher for plagiarism

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Exercise C: Print the stem of words2 and words3

for word in words2:


print(word, '\t-->\t', p_stemmer.stem(word))

for word in words3:


print(word, '\t-->\t', p_stemmer.stem(word))

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Problem: Analyze National Pledge


• Remember to import the library and load the model
import spacy
nlp = spacy.load('en_core_web_md')

• Create a Doc object from a file


with open('OurPledge.txt') as f:
doc = nlp(f.read())

• The length of the Doc object gives the number of tokens


len(doc)

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Problem: Analyze National Pledge


• To get all the sentences, build a list and then add each sentence
sents = [] # Create a list
for sent in doc.sents: # Append each sentence to the list
sents.append(sent)

• Since the text only contain 1 sentence, we access only sents[0]


print(sents[0].text)

• Print each token and lined up neatly


for token in sents[0]:
# f'{object_to_print:{minimum_characters}}
print(f'{token.text:{15}} {token.pos_:{5}} {token.dep_:{10}}\
{token.lemma_:{15}}')

Source:

You might also like