0% found this document useful (0 votes)

9 views39 pages

Chapter 7.1 - Introducing Natural Language Processing

Chapter 7.1 introduces Natural Language Processing (NLP), focusing on data preprocessing, model training, and applications in business contexts. It covers essential concepts such as the Bag of Words model, tokenization, stemming, and lemmatization, along with challenges and use cases of NLP. The chapter also emphasizes the importance of understanding context and provides hands-on exercises for practical learning.

Uploaded by

giang25092k2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views39 pages

Chapter 7.1 - Introducing Natural Language Processing

Uploaded by

giang25092k2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

MIS 451: Machine Learning for

Business
Chapter 7.1: Introducing Natural
Language Processing

1
Agenda

• Fundamental understanding of data preprocessing, commonly used machine

learning (ML) algorithms, and model evaluation
• Practical knowledge of natural language processing (NLP) specific model
training and applications
• Introduction to NLP and Text Processing
• Bag of Words (BoW)
• Be comfortable talking with scientist partners
• Practice: Text_Process and Bag_of_Word

22
Natural language processing (NLP)
“Alexa, what’s it like outside?”

33
What is NLP?

• NLP develops computational algorithms to automatically analyze and

represent human language.
• By evaluating the structure of language, machine learning systems can process
large sets of words, phrases, and sentences.

44
NLP challenges

Lack of precision Many complex dependencies

Meaning that is based on context Lack of structure

5
Natural language processing use cases

Search applications Market and social research

Human machine interfaces Chatbots

6
Natural language processing flow

7
Preprocessing text

Common preprocessing steps –

Remove stop words
Normalize similar text
Standardize unrecognized text
Other preprocessing steps –
Encoding
Spelling and grammar checks
Multiple libraries and tools are available for preprocessing
(for example, NLTK for Python)

Sample Preprocessing

8
Creating tokens and feature engineering

• Load data by using tokens

• You can use tokens to
convert words into items from nltk.tokenize import word_tokenize
in a DataFrame text = “this is some sample text.”

• Develop features by applying a model Print(word_tokenize(text))

• Common models include Output: [‘this’,‘is’,‘some’,‘sample’,‘text’

bag of words and ‘.’]
term frequency and inverse document frequency (TF-IDF)

Sample token code

9
Example NLP model: Bag of words

• Create a vector for each sentence or phrase

• Evaluate words in a sentence that is based on frequency
• Frequency creates a vector for each sentence or phrase

Example NLP Model

10
Text analysis categories

11
Capture context
Understanding context for the text is
a major challenge for NLP:
• Tagging words with the appropriate
part of speech helps to capture
context
• NLP libraries provide token functions
to help with tagging

12
Introduction to Natural Language
Processing (NLP)
Some NLP Terms

• Corpus: Large collection of words or phrases - can come from different

sources: documents, web sources, database
▪ Common Crawl Corpus: web crawl data composed of over 5 billion web pages (541
TB)
▪ Reddit Submission Corpus: publicly available Reddit submissions (42 GB)
▪ Wikipedia XML Data: complete copy of all Wikimedia wikis, in the form of wikitext
source and metadata embedded in XML. (500 GB)
▪ Etc.

14
Some NLP Terms

party
sense

land

Feature vector: A numeric array that ML

Token: Words or phrases
models use for different tasks such as
extracted from documents training and prediction
15
Machine Learning
with Text Data
Machine Learning with Text Data

• ML models need well-defined numerical data.

Text preprocessing Vectorization Train ML Model
Text data (Cleaning and (Convert to using numerical
formatting) numbers) data
Stop words removal, K Nearest Neighbors (KNN),
Bag of Words
Stemming, Lemmatization Neural Network, etc.

17
Text Pre-processing
Tokenization

• Splits text/document into small parts by white space and punctuation.

• Example:
Sentence Tokens
“I don’t like eggs.” “I”, “do”, “n’t”, “like”, “eggs”, “.”

• These tokens will be used in the next steps in the pipeline.

19
Stop Words Removal

• Stop words: Some words that frequently appear in texts, but they don’t
contribute too much to the overall meaning.
• Common stop words: “a”, “the”, “so”, “is”, “it”, “at”, “in”, “this”, “there”, “that”,
“my”
• Example: Original sentence Without stop words

“There is a tree near the house” “tree near house”

20
Stop Words Removal

“There is a tree near the house” “tree near house”

21
Stop Words Removal

• Stop words from the Natural Language Tool Kit (NLTK)* library:

• Assume, we have a text classification problem: A product

review is positive or negative.
• Is this a good stop words list for this problem?
* https://www.nltk.org/

22
Stop Words Removal

• Stop words from the Natural Language Tool Kit (NLTK)* library:

• Assume, we have a text classification problem: A product

review is positive or negative.
• Is this a good stop words list for this problem? NO
* https://www.nltk.org/

23
Stemming

• Set of rules to slice a string to a substring that usually refers to a more

general meaning.
▪ The goal is to remove word affixes (particularly suffixes) such as “s”, “es”,
“ing”, “ed”, etc.
o “playing”
o “played” “play”
o ”plays”
▪ The issue: It doesn’t usually work with irregular forms such as irregular
verbs: “taught”, “brought”, etc.

24
Lemmatization

• Similar to stemming, but more advanced. It uses a look-up dictionary.

▪ Handles more situations and usually works better than stemming.
o “taught” -”am”
o “teaching” “teach” -”is” “be”
o “teaches” -“are”
▪ For the best results, correct word position tags should be provided:
“adjective”, “noun”, “verb” etc.

25
Stemming vs. Lemmatization

As we pointed out, lemmatization is a more complex method and usually works

better. E.g.,
• Original sentence: "the children are playing outside. the weather was better
yesterday."
▪ Stemming => “the children are play outside. the weather was better yesterday”
▪ Lemmatization => “the child be play outside. the weather be good yesterday”

26
Text Processing – Hands-on

• In this exercise, we will go over:

• Simple text cleaning processes
• Stop words removal
• Stemming, Lemmatization

MIS_451_Natural_Language_Processing_Text_Process.ipynb

27
Text Vectorization
Bag of Words (BoW)

▪ Bag of Words method converts text data into numbers.

▪ It does this by
• Creating a vocabulary from the words in all documents
• Calculating the occurrences of words:
o binary (present or not)
o word counts
o frequencies

29
Bag of Words (BoW)

• Simple example using word counts:

a cat dog is it my not old wolf

“It is a dog.” 1 0 1 1 1 0 0 0 0

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a

2 0 1 2 2 0 1 0 1
is wolf.”

30
Bag of Words (BoW)

• Simple example using word counts:

a cat dog is it my not old wolf

“It is a dog.” 1 0 1 1 1 0 0 0 0

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a

2 0 1 2 2 0 1 0 1
is wolf.”

31
Bag of Words (BoW)

• Simple example using word counts:

a cat dog is it my not old wolf

“It is a dog.” 1 0 1 1 1 0 0 0 0

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a

2 0 1 2 2 0 1 0 1
is wolf.”

32
Bag of Words (BoW)

• Simple example using word counts:

a cat dog is it my not old wolf

“It is a dog.” 1 0 1 1 1 0 0 0 0

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a

2 0 1 2 2 0 1 0 1
is wolf.”

33
Term Frequency (TF)
Term frequency (TF): Increases the weight for common words in a document.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐
𝑡𝑓(𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐)=
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐

a cat dog is it my not old wolf

“It is a dog.” 0.25 0 0.25 0.25 0.25 0 0 0 0

“my cat is old” 0 0.25 0 0.25 0 0.25 0 0.25 0

“It is not a dog, it a

0.22 0 0.11 0.22 0.22 0 0.11 0 0.11
is wolf.”

34
Inverse Document Frequency (IDF)
term idf
a log(3/3)+1=1 Inverse document frequency (IDF): Decreases the
cat log(3/2)+1=1.18
weights for commonly used words and increases
weights for rare words in the vocabulary.
dog log(3/3)+1=1
is log(3/4)+1=0.87
it log(3/3)+1=1 𝑛𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝑖𝑑𝑓 𝑡𝑒𝑟𝑚 = 𝑙𝑜𝑔 +1
my log(3/2)+1=1.18 𝑛𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 + 1
not log(3/2)+1=1.18
old log(3/2)+1=1.18 𝑒. 𝑔. 𝑖𝑑𝑓 ”𝑐𝑎𝑡” = 1.18

wolf log(3/2)+1=1.18

35
Term Freq.-Inverse Doc. Freq (TF-IDF)
Term Freq. Inverse Doc. Freq (TF-IDF): Combines term frequency and inverse
document frequency.
𝑡𝑓𝑖𝑑𝑓 (𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐) = 𝑡𝑓 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 ∗ 𝑖𝑑𝑓(𝑡𝑒𝑟𝑚)

a cat dog is it my not old wolf

“It is a dog.” 0.25 0 0.25 0.22 0.25 0 0 0 0

“my cat is old” 0 0.3 0 0.22 0 0.3 0 0.3 0

“It is not a dog, it a

0.22 0 0.11 0.19 0.22 0 0.13 0 0.13
is wolf.”

36
N-gram

• An n-gram is a sequence of n tokens from a given sample of text or speech.

• We can include n-grams in our term frequencies too.

Sentence 1-gram (uni-gram): 2-gram (bi-gram):

“it”, “is”, “not”, “a”, “it is”, “is not”, “not a”, “a
It is not a dog, it is a
“dog”, “it”, “is”, dog”, “dog it”, “it is”, “is
wolf
“a”, “wolf” a”, “a wolf”

37
Bag of Words – Hands-on

• In this exercise, we will convert text data to numerical values.

• We will go over:
• Binary
• Word Counts
• Term Frequencies
• Term Freq.- Inverse Document Freq.

MIS_451_Natural_Language_Processing_Bag_of_Word.ipynb

38
39

Understanding and Using English Garmmar 5th-Betty Azar-Answer-Key PDF PDF Grammar Syntax 2
0% (1)
Understanding and Using English Garmmar 5th-Betty Azar-Answer-Key PDF PDF Grammar Syntax 2
1 page
(T. Balasubramanian) A Textbook of English Phoneti (BookFi) PDF
78% (32)
(T. Balasubramanian) A Textbook of English Phoneti (BookFi) PDF
234 pages
Tibetan-English Dictionary of Buddhist Terminology
100% (3)
Tibetan-English Dictionary of Buddhist Terminology
595 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
NLP Final
No ratings yet
NLP Final
33 pages
C10 - Ai - Unit 3 - NLP - Half Yearly
No ratings yet
C10 - Ai - Unit 3 - NLP - Half Yearly
37 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
20 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
13 pages
Adnan Amin
No ratings yet
Adnan Amin
19 pages
NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
Natural Language Processing Notes Class 10
No ratings yet
Natural Language Processing Notes Class 10
10 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
NLP 9
No ratings yet
NLP 9
44 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Big Data Analytics Chap 11
No ratings yet
Big Data Analytics Chap 11
8 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
NLP - Notes
No ratings yet
NLP - Notes
3 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
13 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Unit 6 Endsem PYQs
No ratings yet
Unit 6 Endsem PYQs
15 pages
NLP Book
No ratings yet
NLP Book
599 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP Module 1
No ratings yet
NLP Module 1
71 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Unit 6 (NLP)
No ratings yet
Unit 6 (NLP)
8 pages
Homophones:: Words That Sound The Same, But Mean Different Things and Have Different Spellings!
No ratings yet
Homophones:: Words That Sound The Same, But Mean Different Things and Have Different Spellings!
28 pages
Let Review GRAMMAR
No ratings yet
Let Review GRAMMAR
21 pages
T2 2024 2025 G3 هيكل امتحان الصف الثالث انجليزي
No ratings yet
T2 2024 2025 G3 هيكل امتحان الصف الثالث انجليزي
63 pages
PB Level 4
No ratings yet
PB Level 4
2 pages
Pgcet Pattern
No ratings yet
Pgcet Pattern
5 pages
25 - English - 2ndGrading-Manipulate Individual Phonemes 1
No ratings yet
25 - English - 2ndGrading-Manipulate Individual Phonemes 1
20 pages
Passage Guideline
No ratings yet
Passage Guideline
1 page
EPR (Reviewer)
No ratings yet
EPR (Reviewer)
40 pages
How To Kill A Conversation
No ratings yet
How To Kill A Conversation
2 pages
Metaphor and Metonymy in Editorials' Headlines
100% (2)
Metaphor and Metonymy in Editorials' Headlines
75 pages
Đề 50
No ratings yet
Đề 50
3 pages
Grammar Ref Answer Key 127-128
No ratings yet
Grammar Ref Answer Key 127-128
2 pages
7 Principles of Communication
100% (1)
7 Principles of Communication
16 pages
Delta Module Three Reading List
No ratings yet
Delta Module Three Reading List
2 pages
Compiler Unit 2 ... 5
No ratings yet
Compiler Unit 2 ... 5
71 pages
Cot On Pronoun-Antecedent Agreement Grade 6
No ratings yet
Cot On Pronoun-Antecedent Agreement Grade 6
7 pages
2 Starland 2 Test 4a
No ratings yet
2 Starland 2 Test 4a
1 page
Lesson Plan Descriptive For Junior High School
100% (1)
Lesson Plan Descriptive For Junior High School
3 pages
English Language Teaching Principles Practice (ELT/PP Year II) - Module II
No ratings yet
English Language Teaching Principles Practice (ELT/PP Year II) - Module II
90 pages
I.E.P "Gran Maestro": Unit 1 - Lesson 1
No ratings yet
I.E.P "Gran Maestro": Unit 1 - Lesson 1
3 pages
What Is Communication
No ratings yet
What Is Communication
7 pages
Narrative-Descriptive Essay Editing Checklist
No ratings yet
Narrative-Descriptive Essay Editing Checklist
1 page
Kyle Worksheet Eng-Scie
No ratings yet
Kyle Worksheet Eng-Scie
11 pages
Grammar - A:An, The, Zero Article 1-3
No ratings yet
Grammar - A:An, The, Zero Article 1-3
8 pages
English LP q4 w1 l2 Final Demo
No ratings yet
English LP q4 w1 l2 Final Demo
5 pages
Xhosa Table Class Wqar
No ratings yet
Xhosa Table Class Wqar
4 pages
Prefixes, Suffixes
No ratings yet
Prefixes, Suffixes
38 pages

Chapter 7.1 - Introducing Natural Language Processing

Uploaded by

Chapter 7.1 - Introducing Natural Language Processing

Uploaded by

MIS 451: Machine Learning for

• Fundamental understanding of data preprocessing, commonly used machine

• NLP develops computational algorithms to automatically analyze and

Lack of precision Many complex dependencies

Meaning that is based on context Lack of structure

Search applications Market and social research

Human machine interfaces Chatbots

Common preprocessing steps –

• Load data by using tokens

• Develop features by applying a model Print(word_tokenize(text))

• Common models include Output: [‘this’,‘is’,‘some’,‘sample’,‘text’

Sample token code

• Create a vector for each sentence or phrase

Example NLP Model

• Corpus: Large collection of words or phrases - can come from different

Feature vector: A numeric array that ML

• ML models need well-defined numerical data.

• Splits text/document into small parts by white space and punctuation.

• These tokens will be used in the next steps in the pipeline.

“There is a tree near the house” “tree near house”

“There is a tree near the house” “tree near house”

• Assume, we have a text classification problem: A product

• Assume, we have a text classification problem: A product

• Set of rules to slice a string to a substring that usually refers to a more

• Similar to stemming, but more advanced. It uses a look-up dictionary.

As we pointed out, lemmatization is a more complex method and usually works

• In this exercise, we will go over:

▪ Bag of Words method converts text data into numbers.

• Simple example using word counts:

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a

• Simple example using word counts:

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a

• Simple example using word counts:

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a

• Simple example using word counts:

“my cat is old” 0 1 0 1 0 1 0 1 0

“It is not a dog, it a

a cat dog is it my not old wolf

“It is a dog.” 0.25 0 0.25 0.25 0.25 0 0 0 0

“my cat is old” 0 0.25 0 0.25 0 0.25 0 0.25 0

“It is not a dog, it a

a cat dog is it my not old wolf

“It is a dog.” 0.25 0 0.25 0.22 0.25 0 0 0 0

“my cat is old” 0 0.3 0 0.22 0 0.3 0 0.3 0

“It is not a dog, it a

• An n-gram is a sequence of n tokens from a given sample of text or speech.

Sentence 1-gram (uni-gram): 2-gram (bi-gram):

• In this exercise, we will convert text data to numerical values.

You might also like