0% found this document useful (0 votes)
8 views

Lect_05_Preprocessing_text

The document covers the fundamentals of Natural Language Processing (NLP) and Text Analytics, focusing on the importance of text preprocessing techniques such as tokenization, lowercasing, and stemming. It discusses various use cases for text analytics in business, including sentiment analysis and document classification. Additionally, the document introduces regular expressions as a tool for pattern matching in text data.

Uploaded by

gacia der
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lect_05_Preprocessing_text

The document covers the fundamentals of Natural Language Processing (NLP) and Text Analytics, focusing on the importance of text preprocessing techniques such as tokenization, lowercasing, and stemming. It discusses various use cases for text analytics in business, including sentiment analysis and document classification. Additionally, the document introduces regular expressions as a tool for pattern matching in text data.

Uploaded by

gacia der
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

MSBA 315

ML & Predictive Analytics

Lecture 04 – Data Preprocessing


Wael Khreich
[email protected]
Learning Outcomes
• Understand what is NLP and Text Analytics
• Discuss some Text Analytics Use Cases
• Understand the Important of Text Preprocessing
• Learn Common Text Preprocessing Techniques
• Learn Basics of Regular Expressions
• Apply Text Preprocessing

2
What is Natural Language Processing (NLP)?
• An applied science that combines the power of computer science,
artificial intelligence, and computational linguistics to get computers to
perform useful tasks involving human (written and spoken) languages:
• Human-Machine communication

• Improving human-human communication

• Extracting information from texts

Figures Source: Deloitte analysis


3
What is Text Analytics?
• Text analytics is the process of deriving meaningful insights and
actionable information from unstructured text data

• Text analytics combines


techniques from machine
learning, natural language
processing, Information
Retrieval, and more…

• Text Analytics ≈ Text Mining


→ From Words to Actions

4
Source: http://amzn.to/textmine
Text Analytics Business Use Cases?
• Data augmentation:
• Augment customers’ data
from text
• Document summarization
• Consume relevant
information faster
• Sentiment analysis:
• Know what your customers
think of your product or
service and what are the
common issues
• Documents classification and
categorization
• Organize text based on
categories for a rapid and easy
retrieval of information
5
Machine Learning Pipeline

6
ML Algorithms Expect Numbers

Model.fit(𝑿, 𝒚)
Features Labels
𝐹1 𝐹2 … 𝐹𝑚 𝑦
𝑋1 𝑽𝟏,𝟏 𝑽𝟏,𝟐 … 𝑽𝟏,𝒎 𝑳𝟏
1. Machine learning seems cool,
but I hate programming. 𝑋2 𝑽𝟐,𝟏 𝑽𝟐,𝟐 𝑽𝟐,𝒎
Examples
𝑳𝟐
2. This is a bad investment.
3. … ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

𝑋𝑛 𝑽𝒏,𝟏 𝑽𝒏,𝟐 … 𝑽𝒏,𝒎 𝑳𝒌

• Each row contains information about one instance


• Each column is a feature that describes a property of the instance
MSBA 315 7
Encoding Categorical Data
1. Ordinal Encoding
• Each unique category value is assigned an integer value
• Low = 0, Medium = 1, High = 2
• It is a natural encoding for ordinal variables
• It can cause problems for nominal variable (impose arbitrary ordering)
2. One Hot Encoding
• A new binary variable is added for each unique category, where each bit represents a
possible category
• Read -> [0,0,1], Green -> [0,1,0], Blue -> [1,0,0]
3. Dummy Variable Encoding
• Remove redundancy from one hot encoding (might hurt some algorithms)
• K categories can be represented by K-1 binary variables
• Read -> [0,0], Green -> [0,1], Blue -> [1,0]

8
One Hot Encoding Features
Model.fit(𝑿, 𝒚)
Features Labels
𝑰 𝒊𝒔 a to would … learn very good … 𝑦
1. I would like to learn
𝑋1 𝟏 𝟎 0 1 1 … 1 0 0 … 𝒑𝒐𝒔
programming.
2. I think this is a very 𝑋2 𝟏 𝟏 1 0 0 … 0 1 1 … 𝒏𝒆𝒈
very good investment!
3. …

⋮ ⋮ ⋮ ⋮ ⋮

• The feature vector contains an entry for every possible word in the (training)
vocabulary
• Compute the one hot encoding feature vector (𝑋𝑖 ) for an input sentence, by
marking the presence (1) or absence (0) of every word in the feature vector 9
Term Frequency (TF) or
Bag-of-Words (BOW) Model.fit(𝑿, 𝒚)
Features Features Labels
𝑰 𝒊𝒔 a to would … learn very good … 𝑦
1. I would like to learn
𝑋1 𝟏 𝟎 0 1 1 … 1 0 0 … 𝒑𝒐𝒔
programming.
2. I think this is a very 𝑋2 𝟏 𝟏 1 0 0 … 0 2 1 … 𝒏𝒆𝒈
very bad investment!
3. …

⋮ ⋮ ⋮ ⋮ ⋮

• The feature vector contains an entry for every possible word in the (training)
vocabulary
• Compute the BOW/TF feature vector (𝑋𝑖 ) for an input sentence, by counting the
number of time each word appears in the feature vector 10
Text Preprocessing for Model.fit(𝑿, 𝒚)
Cleaner Features
Features Labels
𝑰 𝒊𝒔 a to would … learn very good … 𝑦
1. I’d like to learn machine
𝑋1 ? 0 0 1 ? … ? 0 0 … 𝒑𝒐𝒔
learning, but i must learn
programming. 𝑋2 1 1 1 0 0 … 0 ? 1 … 𝒏𝒆𝒈
2. I think this is a very
veryyy good investment!
3. …
⋮ ⋮ ⋮ ⋮ ⋮

• Stop words
• Frequency-Based Filtering
• Rare and/or frequent words
• Correcting spelling, grammars
• Removing character repetitions
• Etc.
11
Text Preprocessing
• In Natural Language Processing (NLP) text preprocessing is the first
step in the process of building a model
• Common text preprocessing techniques:
• Sentence segmentation
• Tokenization
• Lower casing
• Stop words removal
• Stemming
• Lemmatization
• Etc.

The following slides are based on Jurafsky and Martin: https://web.stanford.edu/~jurafsky/slp3/

12
Sentence Segmentation
• Sentence Segmentation or tokenization is the process of splitting text
into individual sentences
• Technique used:
• Search for a period “.”
• Hand written rules
• Regular expressions
• Train machine learning classifier to detect end of sentences
• “.” is ambiguous
• Abbreviations like Inc. or Dr.
• Numbers like 0.02% or 4.3
• We can use regular expressions to look for specific patterns
• ? ! can be used to easily determine sentence boundary
13
Sentence Segmentation

But it might not be as easy as you think:


• 'Stop!' she shouted. As long as you didn't defend your Ph.D. thesis,
you can't get a certificate!! F.B.I. is an acronym. FBI is an acronym,
c.i.a. could also be one. $1,000,000.00 is a currency value as well as
1.000.000,00£ for example. These are some measures cm24.54 and
34.3cm.

14
Word Token vs. Types
• How many words?

They lay back on the San Francisco grass and looked at the stars and
their
• Type: element of the vocabulary (unique word)
• Token: An instance of that type in running text
• How many?
• 15 tokens (or 14)
• 13 types (or 12) (or 11?)

15
Tokenization
• The process of splitting the text into smaller units or tokens
• Tokens could be words, numbers, symbols, n-grams, or characters
• N-grams are a combination of n words or characters
• Tokenization does this task by locating word boundaries
• Issues with tokenization
• m.p.h., Ph.D. → ??
• San Francisco → one token or two?
• Finland’s capital → Finland Finlands Finland’s ?
• what’re, I’m, isn’t → What are, I am, is not
• Hewlett-Packard → Hewlett Packard ?
• state-of-the-art → state of the art ?

16
Lowercasing
• The simplest technique of text preprocessing
• Consists of lowercasing every single token of the input text.
• It helps in dealing with sparsity issues in the dataset
• Considering Lebanon and lebanon as same word
• It could also increase ambiguity
• When Apple (the company) is transformed into apple
• Confuses the model

17
Normalization

• How to match such terms?


• U.A.E and UAE
• America, United States, U.S.A, and USA…..
• Normalize the terms
• Define equivalence classes; e.g.,
• Remove periods in terms
• Create a map for all forms of one term

18
Lemmatization and Stemming
• Lemmatization: Reduce inflections or variation of a word to a correct
dictionary base word from
• Be am, are ,is
• Car car, cars, car’s, cars’
• Example: The boy’s cars are different colors
→ the boy car be different color
• Stemming: Reduce terms to their stems (core meaning-bearing units)
• A stem doesn’t have to exist in the dictionary
• automate(s), automatic, automation -> automat
• Example: for example compressed and compression are both accepted as
equivalent to compress.
→ for exampl compress and compress ar both accept as equival to compress

19
Porter’s Stemming Algorithm

v: vowel

20
Regular Expression

• Find patterns of words


• String, String.h, Stdout, stdout.h
• woodchuck, woodchucks, Woodchuck, Woodchucks
• A wide variety of usages
• Extract patterns from HTML documents
• Find all files with a given pattern: (e.g., grep, find in UNIX)
• Convert comma separated files(,) to line separated files (\n)
• Create a specific tokenizer
• Etc.

21
Regular Expression - Square Bracket
Patterns Matches Examples
[wW]ood Matches w or W and “ood” Wood, wood
[1234567890] Any digit 1, 2, 3, 4, 5, 6, 7 ,8, 9, or 0
Range
[A-Z] An upper case letter Big Data 4
[a-z] A lower case letter Big Data 4
[0-9] A single digit Big Data 4
^ means negation only if it is the first character (otherwise it is considered as the character: “^”)
[^A-Z] Not an upper case letter Big Data 4
[^iB] Neither i nor B Big Data 4
[a^b] a or ^ or b A pattern a^b
a\^b Pattern a^b A pattern a^b

22
Regular Expression: ?, * , +, .
Patterns Matches Examples

colou?r Optional Previous Char colour, color

o*h! 0 or more of previous char h! oh! ooh! ooooh!

o+h! 1 or more of previous char oh! ooh! oooh! ooooh!

m.n . means any character man, men, mon, mun, m3n

f.*[0-9] 0 or more of any character between f f9, foo1, foo5, fan4, f(*-9,
and a digit fan home9……..

f.*[0-9]+ 0 or more any character between f and f9, foo11, foo4333,


a set of digits fan home9…

f.+[0-9]{3} 1 or more of any character between f f-234, faa234, fbbjjiij_u433, f2234……


and a 3 digits

23
Regular Expression: Anchors ^, $, and Pipe |
Pattern Matches Example
^ and $: ^ matches the beginning and $ the end of a line
^[A-Z] Matches a character at the beginning of a line Montreal

^[^A-Za-z] Matches a non-alphabet at the beginning of a 1a,


line “Hello”
\.$ Matches dot at the end The sentence.
.$ Matches any character at the end The. , Band?, Wow!
Pipe |
data|collection Matches data or collection data in a.., collection of…
a|b|c Matches a or b or c same as [abc] a sentence
[Dd]ata|[Cc]ollection Matches Data, data, collection or Collection Data in a …, data in a….

24
Regular Expression Tools
• UNIX
• Grep, sed, tr (translate), awk, etc.
• Websites
• www.regexpal.com, http://www.regexr.com/
• Regex Editors
• Notepad++: http://notepad-plus-plus.org/
• http://www.regular-expressions.info/tools.html
• Programming Languages
• Python, Java, C, C++, JavaScript, SQL, etc.

25

You might also like