0% found this document useful (0 votes)

8 views

Lect_05_Preprocessing_text

The document covers the fundamentals of Natural Language Processing (NLP) and Text Analytics, focusing on the importance of text preprocessing techniques such as tokenization, lowercasing, and stemming. It discusses various use cases for text analytics in business, including sentiment analysis and document classification. Additionally, the document introduces regular expressions as a tool for pattern matching in text data.

Uploaded by

gacia der

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Lect_05_Preprocessing_text

Uploaded by

gacia der

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

MSBA 315

ML & Predictive Analytics

Lecture 04 – Data Preprocessing

Wael Khreich
[email protected]
Learning Outcomes
• Understand what is NLP and Text Analytics
• Discuss some Text Analytics Use Cases
• Understand the Important of Text Preprocessing
• Learn Common Text Preprocessing Techniques
• Learn Basics of Regular Expressions
• Apply Text Preprocessing

2
What is Natural Language Processing (NLP)?
• An applied science that combines the power of computer science,
artificial intelligence, and computational linguistics to get computers to
perform useful tasks involving human (written and spoken) languages:
• Human-Machine communication

• Improving human-human communication

• Extracting information from texts

Figures Source: Deloitte analysis

3
What is Text Analytics?
• Text analytics is the process of deriving meaningful insights and
actionable information from unstructured text data

• Text analytics combines

techniques from machine
learning, natural language
processing, Information
Retrieval, and more…

• Text Analytics ≈ Text Mining

→ From Words to Actions

4
Source: http://amzn.to/textmine
Text Analytics Business Use Cases?
• Data augmentation:
• Augment customers’ data
from text
• Document summarization
• Consume relevant
information faster
• Sentiment analysis:
• Know what your customers
think of your product or
service and what are the
common issues
• Documents classification and
categorization
• Organize text based on
categories for a rapid and easy
retrieval of information
5
Machine Learning Pipeline

6
ML Algorithms Expect Numbers

Model.fit(𝑿, 𝒚)
Features Labels
𝐹1 𝐹2 … 𝐹𝑚 𝑦
𝑋1 𝑽𝟏,𝟏 𝑽𝟏,𝟐 … 𝑽𝟏,𝒎 𝑳𝟏
1. Machine learning seems cool,
but I hate programming. 𝑋2 𝑽𝟐,𝟏 𝑽𝟐,𝟐 𝑽𝟐,𝒎
Examples
𝑳𝟐
2. This is a bad investment.
3. … ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

𝑋𝑛 𝑽𝒏,𝟏 𝑽𝒏,𝟐 … 𝑽𝒏,𝒎 𝑳𝒌

• Each row contains information about one instance

• Each column is a feature that describes a property of the instance
MSBA 315 7
Encoding Categorical Data
1. Ordinal Encoding
• Each unique category value is assigned an integer value
• Low = 0, Medium = 1, High = 2
• It is a natural encoding for ordinal variables
• It can cause problems for nominal variable (impose arbitrary ordering)
2. One Hot Encoding
• A new binary variable is added for each unique category, where each bit represents a
possible category
• Read -> [0,0,1], Green -> [0,1,0], Blue -> [1,0,0]
3. Dummy Variable Encoding
• Remove redundancy from one hot encoding (might hurt some algorithms)
• K categories can be represented by K-1 binary variables
• Read -> [0,0], Green -> [0,1], Blue -> [1,0]

8
One Hot Encoding Features
Model.fit(𝑿, 𝒚)
Features Labels
𝑰 𝒊𝒔 a to would … learn very good … 𝑦
1. I would like to learn
𝑋1 𝟏 𝟎 0 1 1 … 1 0 0 … 𝒑𝒐𝒔
programming.
2. I think this is a very 𝑋2 𝟏 𝟏 1 0 0 … 0 1 1 … 𝒏𝒆𝒈
very good investment!
3. …

⋮ ⋮ ⋮ ⋮ ⋮

• The feature vector contains an entry for every possible word in the (training)
vocabulary
• Compute the one hot encoding feature vector (𝑋𝑖 ) for an input sentence, by
marking the presence (1) or absence (0) of every word in the feature vector 9
Term Frequency (TF) or
Bag-of-Words (BOW) Model.fit(𝑿, 𝒚)
Features Features Labels
𝑰 𝒊𝒔 a to would … learn very good … 𝑦
1. I would like to learn
𝑋1 𝟏 𝟎 0 1 1 … 1 0 0 … 𝒑𝒐𝒔
programming.
2. I think this is a very 𝑋2 𝟏 𝟏 1 0 0 … 0 2 1 … 𝒏𝒆𝒈
very bad investment!
3. …

⋮ ⋮ ⋮ ⋮ ⋮

• The feature vector contains an entry for every possible word in the (training)
vocabulary
• Compute the BOW/TF feature vector (𝑋𝑖 ) for an input sentence, by counting the
number of time each word appears in the feature vector 10
Text Preprocessing for Model.fit(𝑿, 𝒚)
Cleaner Features
Features Labels
𝑰 𝒊𝒔 a to would … learn very good … 𝑦
1. I’d like to learn machine
𝑋1 ? 0 0 1 ? … ? 0 0 … 𝒑𝒐𝒔
learning, but i must learn
programming. 𝑋2 1 1 1 0 0 … 0 ? 1 … 𝒏𝒆𝒈
2. I think this is a very
veryyy good investment!
3. …
⋮ ⋮ ⋮ ⋮ ⋮

• Stop words
• Frequency-Based Filtering
• Rare and/or frequent words
• Correcting spelling, grammars
• Removing character repetitions
• Etc.
11
Text Preprocessing
• In Natural Language Processing (NLP) text preprocessing is the first
step in the process of building a model
• Common text preprocessing techniques:
• Sentence segmentation
• Tokenization
• Lower casing
• Stop words removal
• Stemming
• Lemmatization
• Etc.

The following slides are based on Jurafsky and Martin: https://web.stanford.edu/~jurafsky/slp3/

12
Sentence Segmentation
• Sentence Segmentation or tokenization is the process of splitting text
into individual sentences
• Technique used:
• Search for a period “.”
• Hand written rules
• Regular expressions
• Train machine learning classifier to detect end of sentences
• “.” is ambiguous
• Abbreviations like Inc. or Dr.
• Numbers like 0.02% or 4.3
• We can use regular expressions to look for specific patterns
• ? ! can be used to easily determine sentence boundary
13
Sentence Segmentation

But it might not be as easy as you think:

• 'Stop!' she shouted. As long as you didn't defend your Ph.D. thesis,
you can't get a certificate!! F.B.I. is an acronym. FBI is an acronym,
c.i.a. could also be one. $1,000,000.00 is a currency value as well as
1.000.000,00£ for example. These are some measures cm24.54 and
34.3cm.

14
Word Token vs. Types
• How many words?

They lay back on the San Francisco grass and looked at the stars and
their
• Type: element of the vocabulary (unique word)
• Token: An instance of that type in running text
• How many?
• 15 tokens (or 14)
• 13 types (or 12) (or 11?)

15
Tokenization
• The process of splitting the text into smaller units or tokens
• Tokens could be words, numbers, symbols, n-grams, or characters
• N-grams are a combination of n words or characters
• Tokenization does this task by locating word boundaries
• Issues with tokenization
• m.p.h., Ph.D. → ??
• San Francisco → one token or two?
• Finland’s capital → Finland Finlands Finland’s ?
• what’re, I’m, isn’t → What are, I am, is not
• Hewlett-Packard → Hewlett Packard ?
• state-of-the-art → state of the art ?

16
Lowercasing
• The simplest technique of text preprocessing
• Consists of lowercasing every single token of the input text.
• It helps in dealing with sparsity issues in the dataset
• Considering Lebanon and lebanon as same word
• It could also increase ambiguity
• When Apple (the company) is transformed into apple
• Confuses the model

17
Normalization

• How to match such terms?

• U.A.E and UAE
• America, United States, U.S.A, and USA…..
• Normalize the terms
• Define equivalence classes; e.g.,
• Remove periods in terms
• Create a map for all forms of one term

18
Lemmatization and Stemming
• Lemmatization: Reduce inflections or variation of a word to a correct
dictionary base word from
• Be am, are ,is
• Car car, cars, car’s, cars’
• Example: The boy’s cars are different colors
→ the boy car be different color
• Stemming: Reduce terms to their stems (core meaning-bearing units)
• A stem doesn’t have to exist in the dictionary
• automate(s), automatic, automation -> automat
• Example: for example compressed and compression are both accepted as
equivalent to compress.
→ for exampl compress and compress ar both accept as equival to compress

19
Porter’s Stemming Algorithm

v: vowel

20
Regular Expression

• Find patterns of words

• String, String.h, Stdout, stdout.h
• woodchuck, woodchucks, Woodchuck, Woodchucks
• A wide variety of usages
• Extract patterns from HTML documents
• Find all files with a given pattern: (e.g., grep, find in UNIX)
• Convert comma separated files(,) to line separated files (\n)
• Create a specific tokenizer
• Etc.

21
Regular Expression - Square Bracket
Patterns Matches Examples
[wW]ood Matches w or W and “ood” Wood, wood
[1234567890] Any digit 1, 2, 3, 4, 5, 6, 7 ,8, 9, or 0
Range
[A-Z] An upper case letter Big Data 4
[a-z] A lower case letter Big Data 4
[0-9] A single digit Big Data 4
^ means negation only if it is the first character (otherwise it is considered as the character: “^”)
[^A-Z] Not an upper case letter Big Data 4
[^iB] Neither i nor B Big Data 4
[a^b] a or ^ or b A pattern a^b
a\^b Pattern a^b A pattern a^b

22
Regular Expression: ?, * , +, .
Patterns Matches Examples

colou?r Optional Previous Char colour, color

o*h! 0 or more of previous char h! oh! ooh! ooooh!

o+h! 1 or more of previous char oh! ooh! oooh! ooooh!

m.n . means any character man, men, mon, mun, m3n

f.*[0-9] 0 or more of any character between f f9, foo1, foo5, fan4, f(*-9,
and a digit fan home9……..

f.*[0-9]+ 0 or more any character between f and f9, foo11, foo4333,

a set of digits fan home9…

f.+[0-9]{3} 1 or more of any character between f f-234, faa234, fbbjjiij_u433, f2234……

and a 3 digits

23
Regular Expression: Anchors ^, $, and Pipe |
Pattern Matches Example
^ and $: ^ matches the beginning and $ the end of a line
^[A-Z] Matches a character at the beginning of a line Montreal

^[^A-Za-z] Matches a non-alphabet at the beginning of a 1a,

line “Hello”
\.$ Matches dot at the end The sentence.
.$ Matches any character at the end The. , Band?, Wow!
Pipe |
data|collection Matches data or collection data in a.., collection of…
a|b|c Matches a or b or c same as [abc] a sentence
[Dd]ata|[Cc]ollection Matches Data, data, collection or Collection Data in a …, data in a….

24
Regular Expression Tools
• UNIX
• Grep, sed, tr (translate), awk, etc.
• Websites
• www.regexpal.com, http://www.regexr.com/
• Regex Editors
• Notepad++: http://notepad-plus-plus.org/
• http://www.regular-expressions.info/tools.html
• Programming Languages
• Python, Java, C, C++, JavaScript, SQL, etc.

NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
UNIT3
No ratings yet
UNIT3
52 pages
Natural Language Processing For Hackers
No ratings yet
Natural Language Processing For Hackers
176 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
Sample
No ratings yet
Sample
8 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
5.2_feature_engineering
No ratings yet
5.2_feature_engineering
57 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
Module 05 - Learners Guide
No ratings yet
Module 05 - Learners Guide
31 pages
Module 1.2
No ratings yet
Module 1.2
28 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Word Embeddings in NLP
No ratings yet
Word Embeddings in NLP
42 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Ai Unit 5 Chapter 12,13 Missing Part
No ratings yet
Ai Unit 5 Chapter 12,13 Missing Part
11 pages
NLP BOOK
No ratings yet
NLP BOOK
599 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
No ratings yet
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
44 pages
Text preprocessing
No ratings yet
Text preprocessing
39 pages
Ed 571275
No ratings yet
Ed 571275
11 pages
NLP_course-EDC-1-29
No ratings yet
NLP_course-EDC-1-29
29 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
Unit 5
No ratings yet
Unit 5
8 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
NLP Unit 1 pdf
No ratings yet
NLP Unit 1 pdf
27 pages
Module 3
No ratings yet
Module 3
40 pages
NLP m2
No ratings yet
NLP m2
71 pages
Supervised Machine Learning for Text Analysis in R 1st Edition Emil Hvitfeldt Julia Silge instant download
100% (1)
Supervised Machine Learning for Text Analysis in R 1st Edition Emil Hvitfeldt Julia Silge instant download
44 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
nlp
No ratings yet
nlp
16 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Module-5 (1)
No ratings yet
Module-5 (1)
57 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Download full (Ebook) Supervised Machine Learning for Text Analysis in R by Emil Hvitfeldt, Julia Silge ISBN 9780367554187, 9780367554194, 0367554186, 0367554194 ebook all chapters
100% (3)
Download full (Ebook) Supervised Machine Learning for Text Analysis in R by Emil Hvitfeldt, Julia Silge ISBN 9780367554187, 9780367554194, 0367554186, 0367554194 ebook all chapters
76 pages
Text Mining
No ratings yet
Text Mining
62 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Text Mining
No ratings yet
Text Mining
31 pages
MOD-1
No ratings yet
MOD-1
71 pages
Instant download Supervised Machine Learning for Text Analysis in R 1st Edition Emil Hvitfeldt pdf all chapter
100% (13)
Instant download Supervised Machine Learning for Text Analysis in R 1st Edition Emil Hvitfeldt pdf all chapter
60 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
Speech and Language Processing: Third Edition Draft
No ratings yet
Speech and Language Processing: Third Edition Draft
287 pages
Speech and Language Processing
100% (1)
Speech and Language Processing
623 pages
Draft: Natural Language Processing For The Working Programmer
No ratings yet
Draft: Natural Language Processing For The Working Programmer
79 pages
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
Rick Spair
No ratings yet
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
Lect_06_Feature_Engineering_and_Selection
No ratings yet
Lect_06_Feature_Engineering_and_Selection
41 pages
ML_Science
No ratings yet
ML_Science
6 pages
MSBA315_Syllabus_2025
No ratings yet
MSBA315_Syllabus_2025
6 pages
MSBA315-Project-Description
No ratings yet
MSBA315-Project-Description
1 page
Download Complete Hybrid Intelligent Systems 15th International Conference HIS 2015 on Hybrid Intelligent Systems Seoul South Korea November 16 18 2015 1st Edition Ajith Abraham PDF for All Chapters
No ratings yet
Download Complete Hybrid Intelligent Systems 15th International Conference HIS 2015 on Hybrid Intelligent Systems Seoul South Korea November 16 18 2015 1st Edition Ajith Abraham PDF for All Chapters
65 pages
TST CT119-3-2-Data Mining and Predictive Modelling (VE1)
No ratings yet
TST CT119-3-2-Data Mining and Predictive Modelling (VE1)
1 page
Amrtasiddhi
No ratings yet
Amrtasiddhi
218 pages
Market Guide for Mat 800299 Ndx
No ratings yet
Market Guide for Mat 800299 Ndx
30 pages
MOZIT SCHOOLS ICT First Term 2023
No ratings yet
MOZIT SCHOOLS ICT First Term 2023
9 pages
PGDCA and MCA Lately Revised(New)and FinalizedGOOD-signed-converted (1)
No ratings yet
PGDCA and MCA Lately Revised(New)and FinalizedGOOD-signed-converted (1)
6 pages
Chapter 1: Introduction: Database System Concepts, 6 Ed
100% (1)
Chapter 1: Introduction: Database System Concepts, 6 Ed
31 pages
UCS_401_Unit-LV_ Trends in Machine Learning_Model and Symbols- Bagging and Boosting, Multitask
No ratings yet
UCS_401_Unit-LV_ Trends in Machine Learning_Model and Symbols- Bagging and Boosting, Multitask
44 pages
z/OS Introduction and Workshop: Unix System Services
No ratings yet
z/OS Introduction and Workshop: Unix System Services
32 pages
The SIEVE Algorithm
No ratings yet
The SIEVE Algorithm
18 pages
2024 08 11 - 11 31 34
No ratings yet
2024 08 11 - 11 31 34
5 pages
AMCIS Digital Twin Taxonomy Final
No ratings yet
AMCIS Digital Twin Taxonomy Final
11 pages
DBMS SYLLABUS
No ratings yet
DBMS SYLLABUS
2 pages
Second Normal Form: 2NF and Candidate Keys
No ratings yet
Second Normal Form: 2NF and Candidate Keys
2 pages
Accident Detection Using ML and Ai Techniques
No ratings yet
Accident Detection Using ML and Ai Techniques
8 pages
Knowledge Skill Sharing Platform Introduction and Overview
No ratings yet
Knowledge Skill Sharing Platform Introduction and Overview
9 pages
Individual Buffalo Identification Through Muzzle Dermatoglyphics Images Using Deep Learning Approaches
No ratings yet
Individual Buffalo Identification Through Muzzle Dermatoglyphics Images Using Deep Learning Approaches
14 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Pharmaco in For Matics
No ratings yet
Pharmaco in For Matics
7 pages
book details library copy (3)
No ratings yet
book details library copy (3)
24 pages
cyber-security-lab-manual
No ratings yet
cyber-security-lab-manual
23 pages
Bhanu_recent (1) (1)
No ratings yet
Bhanu_recent (1) (1)
2 pages
School Form 1 (SF 1) School Register
No ratings yet
School Form 1 (SF 1) School Register
8 pages
Utilization of BI in Business Decision Making - SUDIANTINI 2024
No ratings yet
Utilization of BI in Business Decision Making - SUDIANTINI 2024
11 pages
14.position Ict Officer Ii Software Programming JSG 6
No ratings yet
14.position Ict Officer Ii Software Programming JSG 6
2 pages
Cartography: Visualization of Geospatial Data 4th Edition Menno-Jan Kraak - Download the full set of chapters carefully compiled
100% (1)
Cartography: Visualization of Geospatial Data 4th Edition Menno-Jan Kraak - Download the full set of chapters carefully compiled
67 pages
41 NoSQL Introduction.pptx
No ratings yet
41 NoSQL Introduction.pptx
18 pages
Management Information System: Un It - 3
No ratings yet
Management Information System: Un It - 3
49 pages
Text File
No ratings yet
Text File
4 pages
Amit Resume
No ratings yet
Amit Resume
1 page

Lect_05_Preprocessing_text

Uploaded by

Lect_05_Preprocessing_text

Uploaded by

MSBA 315

ML & Predictive Analytics

Lecture 04 – Data Preprocessing

• Improving human-human communication

• Extracting information from texts

Figures Source: Deloitte analysis

• Text analytics combines

• Text Analytics ≈ Text Mining

𝑋𝑛 𝑽𝒏,𝟏 𝑽𝒏,𝟐 … 𝑽𝒏,𝒎 𝑳𝒌

• Each row contains information about one instance

The following slides are based on Jurafsky and Martin: https://web.stanford.edu/~jurafsky/slp3/

But it might not be as easy as you think:

• How to match such terms?

• Find patterns of words

colou?r Optional Previous Char colour, color

o*h! 0 or more of previous char h! oh! ooh! ooooh!

o+h! 1 or more of previous char oh! ooh! oooh! ooooh!

m.n . means any character man, men, mon, mun, m3n

f.*[0-9]+ 0 or more any character between f and f9, foo11, foo4333,

f.+[0-9]{3} 1 or more of any character between f f-234, faa234, fbbjjiij_u433, f2234……

^[^A-Za-z] Matches a non-alphabet at the beginning of a 1a,

You might also like