0% found this document useful (0 votes)
9 views39 pages

Chapter 7.1 - Introducing Natural Language Processing

Chapter 7.1 introduces Natural Language Processing (NLP), focusing on data preprocessing, model training, and applications in business contexts. It covers essential concepts such as the Bag of Words model, tokenization, stemming, and lemmatization, along with challenges and use cases of NLP. The chapter also emphasizes the importance of understanding context and provides hands-on exercises for practical learning.

Uploaded by

giang25092k2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views39 pages

Chapter 7.1 - Introducing Natural Language Processing

Chapter 7.1 introduces Natural Language Processing (NLP), focusing on data preprocessing, model training, and applications in business contexts. It covers essential concepts such as the Bag of Words model, tokenization, stemming, and lemmatization, along with challenges and use cases of NLP. The chapter also emphasizes the importance of understanding context and provides hands-on exercises for practical learning.

Uploaded by

giang25092k2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

MIS 451: Machine Learning for

Business
Chapter 7.1: Introducing Natural
Language Processing

1
Agenda

• Fundamental understanding of data preprocessing, commonly used machine


learning (ML) algorithms, and model evaluation
• Practical knowledge of natural language processing (NLP) specific model
training and applications
• Introduction to NLP and Text Processing
• Bag of Words (BoW)
• Be comfortable talking with scientist partners
• Practice: Text_Process and Bag_of_Word

22
Natural language processing (NLP)
“Alexa, what’s it like outside?”

33
What is NLP?

• NLP develops computational algorithms to automatically analyze and


represent human language.
• By evaluating the structure of language, machine learning systems can process
large sets of words, phrases, and sentences.

44
NLP challenges

Lack of precision Many complex dependencies

Meaning that is based on context Lack of structure


5
Natural language processing use cases

Search applications Market and social research

Human machine interfaces Chatbots


6
Natural language processing flow

7
Preprocessing text

Common preprocessing steps –


Remove stop words
Normalize similar text
Standardize unrecognized text
Other preprocessing steps –
Encoding
Spelling and grammar checks
Multiple libraries and tools are available for preprocessing
(for example, NLTK for Python)

Sample Preprocessing

8
Creating tokens and feature engineering

• Load data by using tokens


• You can use tokens to
convert words into items from nltk.tokenize import word_tokenize
in a DataFrame text = “this is some sample text.”

• Develop features by applying a model Print(word_tokenize(text))

• Common models include Output: [‘this’,‘is’,‘some’,‘sample’,‘text’


bag of words and ‘.’]
term frequency and inverse document frequency (TF-IDF)

Sample token code

9
Example NLP model: Bag of words

• Create a vector for each sentence or phrase


• Evaluate words in a sentence that is based on frequency
• Frequency creates a vector for each sentence or phrase

Example NLP Model

10
Text analysis categories

11
Capture context
Understanding context for the text is
a major challenge for NLP:
• Tagging words with the appropriate
part of speech helps to capture
context
• NLP libraries provide token functions
to help with tagging

12
Introduction to Natural Language
Processing (NLP)
Some NLP Terms

• Corpus: Large collection of words or phrases - can come from different


sources: documents, web sources, database
▪ Common Crawl Corpus: web crawl data composed of over 5 billion web pages (541
TB)
▪ Reddit Submission Corpus: publicly available Reddit submissions (42 GB)
▪ Wikipedia XML Data: complete copy of all Wikimedia wikis, in the form of wikitext
source and metadata embedded in XML. (500 GB)
▪ Etc.

14
Some NLP Terms

party
sense

land

Feature vector: A numeric array that ML


Token: Words or phrases
models use for different tasks such as
extracted from documents training and prediction
15
Machine Learning
with Text Data
Machine Learning with Text Data

• ML models need well-defined numerical data.


Text preprocessing Vectorization Train ML Model
Text data (Cleaning and (Convert to using numerical
formatting) numbers) data
Stop words removal, K Nearest Neighbors (KNN),
Bag of Words
Stemming, Lemmatization Neural Network, etc.

17
Text Pre-processing
Tokenization

• Splits text/document into small parts by white space and punctuation.

• Example:
Sentence Tokens
“I don’t like eggs.” “I”, “do”, “n’t”, “like”, “eggs”, “.”

• These tokens will be used in the next steps in the pipeline.

19
Stop Words Removal

• Stop words: Some words that frequently appear in texts, but they don’t
contribute too much to the overall meaning.
• Common stop words: “a”, “the”, “so”, “is”, “it”, “at”, “in”, “this”, “there”, “that”,
“my”
• Example: Original sentence Without stop words

“There is a tree near the house” “tree near house”

20
Stop Words Removal

• Stop words: Some words that frequently appear in texts, but they don’t
contribute too much to the overall meaning.
• Common stop words: “a”, “the”, “so”, “is”, “it”, “at”, “in”, “this”, “there”, “that”,
“my”
• Example: Original sentence Without stop words

“There is a tree near the house” “tree near house”

21
Stop Words Removal

• Stop words from the Natural Language Tool Kit (NLTK)* library:

• Assume, we have a text classification problem: A product


review is positive or negative.
• Is this a good stop words list for this problem?
* https://www.nltk.org/

22
Stop Words Removal

• Stop words from the Natural Language Tool Kit (NLTK)* library:

• Assume, we have a text classification problem: A product


review is positive or negative.
• Is this a good stop words list for this problem? NO
* https://www.nltk.org/

23
Stemming

• Set of rules to slice a string to a substring that usually refers to a more


general meaning.
▪ The goal is to remove word affixes (particularly suffixes) such as “s”, “es”,
“ing”, “ed”, etc.
o “playing”
o “played” “play”
o ”plays”
▪ The issue: It doesn’t usually work with irregular forms such as irregular
verbs: “taught”, “brought”, etc.

24
Lemmatization

• Similar to stemming, but more advanced. It uses a look-up dictionary.


▪ Handles more situations and usually works better than stemming.
o “taught” -”am”
o “teaching” “teach” -”is” “be”
o “teaches” -“are”
▪ For the best results, correct word position tags should be provided:
“adjective”, “noun”, “verb” etc.

25
Stemming vs. Lemmatization

As we pointed out, lemmatization is a more complex method and usually works


better. E.g.,
• Original sentence: "the children are playing outside. the weather was better
yesterday."
▪ Stemming => “the children are play outside. the weather was better yesterday”
▪ Lemmatization => “the child be play outside. the weather be good yesterday”

26
Text Processing – Hands-on

• In this exercise, we will go over:


• Simple text cleaning processes
• Stop words removal
• Stemming, Lemmatization

MIS_451_Natural_Language_Processing_Text_Process.ipynb

27
Text Vectorization
Bag of Words (BoW)

▪ Bag of Words method converts text data into numbers.


▪ It does this by
• Creating a vocabulary from the words in all documents
• Calculating the occurrences of words:
o binary (present or not)
o word counts
o frequencies

29
Bag of Words (BoW)

• Simple example using word counts:


a cat dog is it my not old wolf

“It is a dog.” 1 0 1 1 1 0 0 0 0

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a


2 0 1 2 2 0 1 0 1
is wolf.”

30
Bag of Words (BoW)

• Simple example using word counts:


a cat dog is it my not old wolf

“It is a dog.” 1 0 1 1 1 0 0 0 0

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a


2 0 1 2 2 0 1 0 1
is wolf.”

31
Bag of Words (BoW)

• Simple example using word counts:


a cat dog is it my not old wolf

“It is a dog.” 1 0 1 1 1 0 0 0 0

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a


2 0 1 2 2 0 1 0 1
is wolf.”

32
Bag of Words (BoW)

• Simple example using word counts:


a cat dog is it my not old wolf

“It is a dog.” 1 0 1 1 1 0 0 0 0

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a


2 0 1 2 2 0 1 0 1
is wolf.”

33
Term Frequency (TF)
Term frequency (TF): Increases the weight for common words in a document.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐
𝑡𝑓(𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐)=
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐

a cat dog is it my not old wolf

“It is a dog.” 0.25 0 0.25 0.25 0.25 0 0 0 0

“my cat is old” 0 0.25 0 0.25 0 0.25 0 0.25 0

“It is not a dog, it a


0.22 0 0.11 0.22 0.22 0 0.11 0 0.11
is wolf.”

34
Inverse Document Frequency (IDF)
term idf
a log(3/3)+1=1 Inverse document frequency (IDF): Decreases the
cat log(3/2)+1=1.18
weights for commonly used words and increases
weights for rare words in the vocabulary.
dog log(3/3)+1=1
is log(3/4)+1=0.87
it log(3/3)+1=1 𝑛𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝑖𝑑𝑓 𝑡𝑒𝑟𝑚 = 𝑙𝑜𝑔 +1
my log(3/2)+1=1.18 𝑛𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 + 1
not log(3/2)+1=1.18
old log(3/2)+1=1.18 𝑒. 𝑔. 𝑖𝑑𝑓 ”𝑐𝑎𝑡” = 1.18

wolf log(3/2)+1=1.18

35
Term Freq.-Inverse Doc. Freq (TF-IDF)
Term Freq. Inverse Doc. Freq (TF-IDF): Combines term frequency and inverse
document frequency.
𝑡𝑓𝑖𝑑𝑓 (𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐) = 𝑡𝑓 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 ∗ 𝑖𝑑𝑓(𝑡𝑒𝑟𝑚)

a cat dog is it my not old wolf

“It is a dog.” 0.25 0 0.25 0.22 0.25 0 0 0 0

“my cat is old” 0 0.3 0 0.22 0 0.3 0 0.3 0

“It is not a dog, it a


0.22 0 0.11 0.19 0.22 0 0.13 0 0.13
is wolf.”

36
N-gram

• An n-gram is a sequence of n tokens from a given sample of text or speech.


• We can include n-grams in our term frequencies too.

Sentence 1-gram (uni-gram): 2-gram (bi-gram):


“it”, “is”, “not”, “a”, “it is”, “is not”, “not a”, “a
It is not a dog, it is a
“dog”, “it”, “is”, dog”, “dog it”, “it is”, “is
wolf
“a”, “wolf” a”, “a wolf”

37
Bag of Words – Hands-on

• In this exercise, we will convert text data to numerical values.


• We will go over:
• Binary
• Word Counts
• Term Frequencies
• Term Freq.- Inverse Document Freq.

MIS_451_Natural_Language_Processing_Bag_of_Word.ipynb

38
39

You might also like