Worksheet Notes
Worksheet Notes
Natural Language
Processing (NLP)
Mr Hew Ka Kian
[email protected]
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
OFFICIAL (CLOSED) \ NON-SENSITIVE
NLP Benefits
• Some of the many benefits of NLP are:
• Perform large-scale analysis. Natural Language Processing
helps machines automatically understand and analyze huge
amounts of unstructured text data, like social media
comments, customer support tickets, online reviews, news
reports, and more.
• Automate processes in real-time. Natural language processing
tools can help machines learn to sort and route information
with little to no human interaction – quickly, efficiently,
accurately, and around the clock.
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Tokenization
• Tokenization is an essential task in natural language processing used to break up a
string of words into units called tokens.
• Sentence tokenization splits sentences within a text, and word tokenization splits
words within a sentence.
• Generally, word tokens are separated by blank spaces, and sentence tokens by
stops.
• However, you can perform high-level tokenization for more complex structures,
like words that often go together, otherwise known as collocations (e.g., New
York).
• An example of how word tokenization simplifies text:
• Customer service couldn’t be better! ->
“customer service” “could” “not” “be” “better”.
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source: https://medium.com/@ritidass29/the-essential-guide-to-how-nlp-works-4d3bb23faf76
OFFICIAL (CLOSED) \ NON-SENSITIVE
Dependency Parsing
• Dependency grammar refers to the way the words in a sentence are connected. A dependency
parser, therefore, analyzes how ‘head words’ are related and modified by other words.
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Stopword Removal
• Stopwords are high-frequency words that add little or no semantic value to a sentence, for
example, which, to, at, for, is, etc. Removing stopwords is an essential step in NLP text processing.
• You can even customize lists of stopwords to include words that you want to ignore.
• Let’s say you want to classify customer service tickets based on their topics. In this example:
“Hello, I’m having trouble logging in with my new password”,
it may be useful to remove stopwords like “hello”, “I”, “am”, “with”, “my”, so you’re left with the
words that help you understand the topic of the ticket: “trouble”, “logging in”, “new”, “password”.
• Hello, I’m having trouble logging in with my new password ->
Hello, I’m having trouble logging in with my new password
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
Exercise A:
• What is the type of the token 'fund’ in 'Two companies pledge up to
$2 million to fund the Republic Polytechnic (RP) start-ups’?
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
Exercise A: Explain other terms
• Print the explanation for PART and neg
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
Exercise A:
• Can you guess what does 'X', 'd' and 'x' mean?
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
Exercise A:
• Which tokens are stopwords?
• is
• n’t
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
Exercise A: Python is not python?
• Print out the number of tokens and named entities.
print("First python lemma is %s and PoS is %s"%(p1[1].lemma_,
p1[1].pos_))
print("Second Python lemma is %s and PoS is %s"%(p1[7].lemma_,
p1[7].pos_))
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
Exercise B: How many tokens are there compared to the named entities?
• Print out the number of tokens
print("Tokens:", len(doc7) )
Student Activity
Exercise B: What are the noun chunks?
• Print out the noun chunks and the noun for Singapore tech startups open
up to having staff work from anywhere
• Only extract the chunk related to the noun. Some tokens not related to a
noun are ignored.
Student Activity
Exercise B: Any other ways to say it?
• You should have tried different ways to say It’s a warm summer day
• Examples of some of them with the similarity are below:
• Similarity: 0.912
doc13 = nlp("A hot summer day")
similarity = doc11.similarity(doc13)
print(similarity)
• Similarity: 0.885
doc13 = nlp("what a nice day")
• Similarity: 0.893
doc13 = nlp("It is a good day")
• A chatbot may treat similarity of 0.885 and above as similar and reply with the
same response to the above sentences. The reply may be “I would stay out”
• If your app is to check plagiarism, you may only consider similarity of 0.95 or
higher for plagiarism
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
Exercise C: Print the stem of words2 and words3
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source: