Lect_05_Preprocessing_text
Lect_05_Preprocessing_text
2
What is Natural Language Processing (NLP)?
• An applied science that combines the power of computer science,
artificial intelligence, and computational linguistics to get computers to
perform useful tasks involving human (written and spoken) languages:
• Human-Machine communication
4
Source: http://amzn.to/textmine
Text Analytics Business Use Cases?
• Data augmentation:
• Augment customers’ data
from text
• Document summarization
• Consume relevant
information faster
• Sentiment analysis:
• Know what your customers
think of your product or
service and what are the
common issues
• Documents classification and
categorization
• Organize text based on
categories for a rapid and easy
retrieval of information
5
Machine Learning Pipeline
6
ML Algorithms Expect Numbers
Model.fit(𝑿, 𝒚)
Features Labels
𝐹1 𝐹2 … 𝐹𝑚 𝑦
𝑋1 𝑽𝟏,𝟏 𝑽𝟏,𝟐 … 𝑽𝟏,𝒎 𝑳𝟏
1. Machine learning seems cool,
but I hate programming. 𝑋2 𝑽𝟐,𝟏 𝑽𝟐,𝟐 𝑽𝟐,𝒎
Examples
𝑳𝟐
2. This is a bad investment.
3. … ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
8
One Hot Encoding Features
Model.fit(𝑿, 𝒚)
Features Labels
𝑰 𝒊𝒔 a to would … learn very good … 𝑦
1. I would like to learn
𝑋1 𝟏 𝟎 0 1 1 … 1 0 0 … 𝒑𝒐𝒔
programming.
2. I think this is a very 𝑋2 𝟏 𝟏 1 0 0 … 0 1 1 … 𝒏𝒆𝒈
very good investment!
3. …
⋮ ⋮ ⋮ ⋮ ⋮
• The feature vector contains an entry for every possible word in the (training)
vocabulary
• Compute the one hot encoding feature vector (𝑋𝑖 ) for an input sentence, by
marking the presence (1) or absence (0) of every word in the feature vector 9
Term Frequency (TF) or
Bag-of-Words (BOW) Model.fit(𝑿, 𝒚)
Features Features Labels
𝑰 𝒊𝒔 a to would … learn very good … 𝑦
1. I would like to learn
𝑋1 𝟏 𝟎 0 1 1 … 1 0 0 … 𝒑𝒐𝒔
programming.
2. I think this is a very 𝑋2 𝟏 𝟏 1 0 0 … 0 2 1 … 𝒏𝒆𝒈
very bad investment!
3. …
⋮ ⋮ ⋮ ⋮ ⋮
• The feature vector contains an entry for every possible word in the (training)
vocabulary
• Compute the BOW/TF feature vector (𝑋𝑖 ) for an input sentence, by counting the
number of time each word appears in the feature vector 10
Text Preprocessing for Model.fit(𝑿, 𝒚)
Cleaner Features
Features Labels
𝑰 𝒊𝒔 a to would … learn very good … 𝑦
1. I’d like to learn machine
𝑋1 ? 0 0 1 ? … ? 0 0 … 𝒑𝒐𝒔
learning, but i must learn
programming. 𝑋2 1 1 1 0 0 … 0 ? 1 … 𝒏𝒆𝒈
2. I think this is a very
veryyy good investment!
3. …
⋮ ⋮ ⋮ ⋮ ⋮
• Stop words
• Frequency-Based Filtering
• Rare and/or frequent words
• Correcting spelling, grammars
• Removing character repetitions
• Etc.
11
Text Preprocessing
• In Natural Language Processing (NLP) text preprocessing is the first
step in the process of building a model
• Common text preprocessing techniques:
• Sentence segmentation
• Tokenization
• Lower casing
• Stop words removal
• Stemming
• Lemmatization
• Etc.
12
Sentence Segmentation
• Sentence Segmentation or tokenization is the process of splitting text
into individual sentences
• Technique used:
• Search for a period “.”
• Hand written rules
• Regular expressions
• Train machine learning classifier to detect end of sentences
• “.” is ambiguous
• Abbreviations like Inc. or Dr.
• Numbers like 0.02% or 4.3
• We can use regular expressions to look for specific patterns
• ? ! can be used to easily determine sentence boundary
13
Sentence Segmentation
14
Word Token vs. Types
• How many words?
They lay back on the San Francisco grass and looked at the stars and
their
• Type: element of the vocabulary (unique word)
• Token: An instance of that type in running text
• How many?
• 15 tokens (or 14)
• 13 types (or 12) (or 11?)
15
Tokenization
• The process of splitting the text into smaller units or tokens
• Tokens could be words, numbers, symbols, n-grams, or characters
• N-grams are a combination of n words or characters
• Tokenization does this task by locating word boundaries
• Issues with tokenization
• m.p.h., Ph.D. → ??
• San Francisco → one token or two?
• Finland’s capital → Finland Finlands Finland’s ?
• what’re, I’m, isn’t → What are, I am, is not
• Hewlett-Packard → Hewlett Packard ?
• state-of-the-art → state of the art ?
16
Lowercasing
• The simplest technique of text preprocessing
• Consists of lowercasing every single token of the input text.
• It helps in dealing with sparsity issues in the dataset
• Considering Lebanon and lebanon as same word
• It could also increase ambiguity
• When Apple (the company) is transformed into apple
• Confuses the model
17
Normalization
18
Lemmatization and Stemming
• Lemmatization: Reduce inflections or variation of a word to a correct
dictionary base word from
• Be am, are ,is
• Car car, cars, car’s, cars’
• Example: The boy’s cars are different colors
→ the boy car be different color
• Stemming: Reduce terms to their stems (core meaning-bearing units)
• A stem doesn’t have to exist in the dictionary
• automate(s), automatic, automation -> automat
• Example: for example compressed and compression are both accepted as
equivalent to compress.
→ for exampl compress and compress ar both accept as equival to compress
19
Porter’s Stemming Algorithm
v: vowel
20
Regular Expression
21
Regular Expression - Square Bracket
Patterns Matches Examples
[wW]ood Matches w or W and “ood” Wood, wood
[1234567890] Any digit 1, 2, 3, 4, 5, 6, 7 ,8, 9, or 0
Range
[A-Z] An upper case letter Big Data 4
[a-z] A lower case letter Big Data 4
[0-9] A single digit Big Data 4
^ means negation only if it is the first character (otherwise it is considered as the character: “^”)
[^A-Z] Not an upper case letter Big Data 4
[^iB] Neither i nor B Big Data 4
[a^b] a or ^ or b A pattern a^b
a\^b Pattern a^b A pattern a^b
22
Regular Expression: ?, * , +, .
Patterns Matches Examples
f.*[0-9] 0 or more of any character between f f9, foo1, foo5, fan4, f(*-9,
and a digit fan home9……..
23
Regular Expression: Anchors ^, $, and Pipe |
Pattern Matches Example
^ and $: ^ matches the beginning and $ the end of a line
^[A-Z] Matches a character at the beginning of a line Montreal
24
Regular Expression Tools
• UNIX
• Grep, sed, tr (translate), awk, etc.
• Websites
• www.regexpal.com, http://www.regexr.com/
• Regex Editors
• Notepad++: http://notepad-plus-plus.org/
• http://www.regular-expressions.info/tools.html
• Programming Languages
• Python, Java, C, C++, JavaScript, SQL, etc.
25