Chapter 7.1 - Introducing Natural Language Processing
Chapter 7.1 - Introducing Natural Language Processing
Business
Chapter 7.1: Introducing Natural
Language Processing
1
Agenda
22
Natural language processing (NLP)
“Alexa, what’s it like outside?”
33
What is NLP?
44
NLP challenges
7
Preprocessing text
Sample Preprocessing
8
Creating tokens and feature engineering
9
Example NLP model: Bag of words
10
Text analysis categories
11
Capture context
Understanding context for the text is
a major challenge for NLP:
• Tagging words with the appropriate
part of speech helps to capture
context
• NLP libraries provide token functions
to help with tagging
12
Introduction to Natural Language
Processing (NLP)
Some NLP Terms
14
Some NLP Terms
party
sense
land
17
Text Pre-processing
Tokenization
• Example:
Sentence Tokens
“I don’t like eggs.” “I”, “do”, “n’t”, “like”, “eggs”, “.”
19
Stop Words Removal
• Stop words: Some words that frequently appear in texts, but they don’t
contribute too much to the overall meaning.
• Common stop words: “a”, “the”, “so”, “is”, “it”, “at”, “in”, “this”, “there”, “that”,
“my”
• Example: Original sentence Without stop words
20
Stop Words Removal
• Stop words: Some words that frequently appear in texts, but they don’t
contribute too much to the overall meaning.
• Common stop words: “a”, “the”, “so”, “is”, “it”, “at”, “in”, “this”, “there”, “that”,
“my”
• Example: Original sentence Without stop words
21
Stop Words Removal
• Stop words from the Natural Language Tool Kit (NLTK)* library:
22
Stop Words Removal
• Stop words from the Natural Language Tool Kit (NLTK)* library:
23
Stemming
24
Lemmatization
25
Stemming vs. Lemmatization
26
Text Processing – Hands-on
MIS_451_Natural_Language_Processing_Text_Process.ipynb
27
Text Vectorization
Bag of Words (BoW)
29
Bag of Words (BoW)
“It is a dog.” 1 0 1 1 1 0 0 0 0
30
Bag of Words (BoW)
“It is a dog.” 1 0 1 1 1 0 0 0 0
31
Bag of Words (BoW)
“It is a dog.” 1 0 1 1 1 0 0 0 0
32
Bag of Words (BoW)
“It is a dog.” 1 0 1 1 1 0 0 0 0
33
Term Frequency (TF)
Term frequency (TF): Increases the weight for common words in a document.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐
𝑡𝑓(𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐)=
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐
34
Inverse Document Frequency (IDF)
term idf
a log(3/3)+1=1 Inverse document frequency (IDF): Decreases the
cat log(3/2)+1=1.18
weights for commonly used words and increases
weights for rare words in the vocabulary.
dog log(3/3)+1=1
is log(3/4)+1=0.87
it log(3/3)+1=1 𝑛𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝑖𝑑𝑓 𝑡𝑒𝑟𝑚 = 𝑙𝑜𝑔 +1
my log(3/2)+1=1.18 𝑛𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 + 1
not log(3/2)+1=1.18
old log(3/2)+1=1.18 𝑒. 𝑔. 𝑖𝑑𝑓 ”𝑐𝑎𝑡” = 1.18
wolf log(3/2)+1=1.18
35
Term Freq.-Inverse Doc. Freq (TF-IDF)
Term Freq. Inverse Doc. Freq (TF-IDF): Combines term frequency and inverse
document frequency.
𝑡𝑓𝑖𝑑𝑓 (𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐) = 𝑡𝑓 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 ∗ 𝑖𝑑𝑓(𝑡𝑒𝑟𝑚)
36
N-gram
37
Bag of Words – Hands-on
MIS_451_Natural_Language_Processing_Bag_of_Word.ipynb
38
39