0% found this document useful (0 votes)

32 views

Text Preprocessing

The document discusses various text preprocessing techniques including sentence segmentation, word tokenization, word segmentation, text normalization, stemming, and lemmatization. It describes common algorithms and issues for each technique.

Uploaded by

saisuraj1510

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Text Preprocessing

Uploaded by

saisuraj1510

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Text Pre-processing

1
✓ Sentence Segmentation
✓ Word Tokenization

Text ✓ Word Segmentation

✓ Text Normalization
Preprocessing

2
Text Preprocessing
Sentence Segmentation
The task of segmenting running text into sentences.
Word Tokenization
The task of segmenting running text into words
Word Segmentation - #WelcometoIndia
Task of segmenting the combined string into words
Need dictionary to match with
Normalization
The task of putting words/tokens in a standard format.

3
Sentence Segmentation
Boundary detection - identify dots, semi colons, exclamations and question marks and break
the text stream on them.
!, ? mostly unambiguous but period “.” is very ambiguous

Issues : Not using . as end of sentence (Numbers, Abbreviations)

U.S. Corporate
Mr. Ram
Christoper S. Manning
For e.g.
[email protected]
Sentence Segmentation
Common algorithm:
Tokenize
Use rules or ML to classify a period as either
(a) part of the word or (b) a sentence-boundary

Abbreviation dictionary can help

Sentence segmentation can based on this tokenization.
Sentence Segmentation
Sentence boundary detection algorithms
• Rule based (If-else) for less features
• Regular Expression based
• Statistical classification trees
• Probability of a case and length of words occur before or after a boundary
• Machine learning (NB, MaxEnt)
• Part of speech, other feature distribution of preceding and following
words
Tokenization
Divide text into units called tokens (words, numbers, punctuations)

Whitespace often do not indicate a word break

Exceptional
• San Francisco
• The New York New Heaven railroad
• Wake up, work out

7
Tokenization & Punctuations
Dots: Dots help indicate sentence boundaries.
It’s useful to keep dots (Wash. → wash)

Single apostrophes: contractions (isn’t, didn’t, dog’s)

Useful for meaning extraction (is + n’t or not)

Hyphenation:
• Keep as single word: co-operate
• Separate into words: 26-year-old, aluminum-export-ban

8
Tokenization & Punctuations
Punctuation as a separate token

Commas - useful piece of information for parsers – clause/phrase

Want to keep the punctuation that occurs word internally

• Ph.D.
• AT&T
• cap’n.
• ($45.55) and dates (01/02/06); Do not separate
• URLs (http://www.nitt.edu)
• Twitter hashtags (#nlpACL)
• email addresses ([email protected])
• 5,50,500.50. 9
Tokenization - Clitics

A clitic is a part of a word that can’t stand on its own; only occur when it is
attached to another word.
what’re , we're
A tokenizer - expand clitic contractions that are marked by apostrophes
Converting to what are, we are.

10
Tokenization

• Tokenization standard - Penn Treebank tokenization standard

• Used for the parsed corpora (treebanks) released by the Linguistic Data
Consortium (LDC).
• This standard does
• Separates out clitics (doesn’t becomes does + n’t),
• Keeps hyphenated words together
• Separates out all punctuation

11
Tokenizer output - Penn Treebank tokenization standard

12
Tokenization – Multiword Expression

To tokenize multiword expressions as a single token.

New York (or) rock 'n' roll

Requires a multiword expression dictionary.

Tokenization is tied up with named entity detection

13
How many words in a sentence?
They lay back on the San Francisco grass and looked at the stars
and their

Type: Types are the number of distinct words in a corpus

Token: Tokens are the total number of running words.
• How many Type/Token?
• 15 tokens ?
• 13 types ?
How many words in a corpus?
N = number of tokens
V = vocabulary = set of types, |V| is size of vocabulary
Heaps Law = Herdan's Law = ; .67 < β < .75 ; K btn 10 and 100
vocabulary size grows up > square root of the number of word tokens

Tokens = N Types = |V|

Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million
Tokenization in NLTK
Bird, Loper and Klein (2009), Natural Language Processing with Python. O’Reilly
TOKENIZATION
Byte Pair Encoding
Recent text tokenization

Use the data to tell us how to tokenize.

Subword tokenization (because tokens can be parts of words as
well as whole words)
Subword tokenization
Three common algorithms
• Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
• Unigram language modeling tokenization (Kudo, 2018)
• Word Piece (Schuster and Nakajima, 2012)
Token learner
Takes raw training corpus and induces vocabulary (a set of tokens)
Token segmenter
Takes a raw test sentence and tokenizes it according to that vocabulary
Byte Pair Encoding (BPE) token learner
Let vocabulary be the set of all individual characters
= {A, B, C, D,…, a, b, c, d….}
Repeat:
• Choose most frequently adjacent two symbols in training corpus (say 'A', 'B')
• Add a new merged symbol 'AB' to the vocabulary
• Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
BPE token learner algorithm
Byte Pair Encoding (BPE) : Preprocess Corpus
Sub word algorithms run inside space-separated tokens

• Special end-of-word symbol '__' before space in training corpus

• Separate into letters.

BPE token learner
Sample corpus

low low low low low lowest lowest newer newer newer newer newer newer wider
wider wider new new

Add end-of-word tokens, resulting in this vocabulary:

representation
BPE Token Learner

Merge e r to er
BPE Token Learner

Merge er _ to er_
BPE Token Learner

Merge n e to ne
BPE Token Learner
BPE Token Segmenter
Test data
Run each merge learned from the training data (in learnt order)

Merge every e r to er, then merge er _ to er_, etc.

Result
• Test set "n e w e r _" : tokenized as a full word
• Test set "l o w e r _" : two tokens: "low er_"
Properties of BPE tokens
Include frequent words and frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing unit of a language
• unlikeliest : 3 morphemes un-, likely, and -est
Used in
• GPT-2, RoBERTa, XLM, FlauBERT,
Text Normalization
Set all characters to lowercase

Remove numbers (or convert numbers to textual representations)

Remove punctuation (generally part of tokenization)

Strip white space (also generally part of tokenization)

Remove default stop words (general English stop words)

30
Text Normalization

Tokens are to be normalized

Single normalized form is chosen for words with multiple forms
• USA and US.
• It is valuable, despite the loss of spelling information.

31
Non-standard words (numbers, Acronym, and dates)

• Mapping them to a special vocabulary.

• Decimal number to a single token 0.0
• Acronym could be mapped to AAA.
• Vocabulary becomes small
• Improves the accuracy of many language modeling tasks.

Abbreviations : Cust. - Customer

Acronyms: UN - United Nation

32
Text Normalization-Case folding

Case folding – Converting to a same case.

Adopted in speech recognition and information retrieval
Mapped to lower case.
Sentiment analysis, text classification tasks, information extraction, and
machine translation
case is helpful and case folding is generally not done.
US [country] and us [pronoun]

33
Stemming
• Removes affixes
Text Lemmatization
Normalization • Removes affixes
• Resulting form is a known word in
a dictionary

34
Stemming &
Lemmatization

35
Stemming

• Stemming - reducing derived or inflected words to their stem, base /root

form.
• Related words map to the same stem.
• Search engines treat words with the same stem as synonyms (conflation).
Stemming and Lemmatization

How can we know “organize” , “organizes”, and “organizing” should map to the
same word?
• The goal of stemming and lemmatization
• To reduce inflectional forms and sometimes derivationally related forms of
a word to a common base form.
am, are, is --be
car, cars, car’s, cars’ --- car
• “the boy’s cars are different colors” --- “the boy car be different color”
Stemming Algorithms

40
Word = Stem + Affix(es)
The Porter • generalizations = general + ization + s

Stemmer Porter’s stemmer is a rule-based algorithm

• E.g., ational → ate (apply: relational → relate)
Algorithm Porter’s stemmer is heuristic, in that it is a practical
method not guaranteed to be optimal

41
The Porter Stemmer

Troubles

VOWEL: V
CONSONANT: C
Tr ou bl e s
All words are of the form: C V C V C
(C)(VC)m(V) # VC repeated m times

C=string of one or more consonants (con+)

V=string of one or more vowels

42
The Porter Stemmer: rule
format
Conditions:

m The measure of the stem

*S The stem ends with S
*v* The stem contains a vowel
*d The stem ends with a double
The rules are of the form:
consonant
(condition) S1 -> S2
Where S1 and S2 are suffixes *o The stem ends in CVC

43
45
The Porter Stemmer: Step 1
SSES -> SS
caresses -> caress

IES -> I
ponies -> poni
ties -> ti

SS -> SS
caress -> caress

S -> є
cats -> cat

46
(m≥1) EED -> EE

• Condition verified: agreed -> agree

• Condition not verified: feed -> feed
The Porter
(*V*) ED -> є
Stemmer: Step 2a
• Condition verified: plastered -> plaster
(past tense, • Condition not verified: bled -> bled
progressive)
(*V*) ING -> є

*v* The stem contains a vowel • Condition verified: motoring -> motor
• Condition not verified: sing -> sing

47
(These rules are run if second or third rule in 2a apply)
AT-> ATE

The Porter conflat(ed) -> conflate

BL -> BLE
Stemmer: Step Troubl(ing) -> trouble
2b (*d & ! (*L or *S or *Z)) -> single letter
(cleanup) Condition verified: hopp(ing) -> hop, tann(ed) -> tan
Condition not verified: fall(ing) -> fall
*d Double consonant.
*S The stem ends with S (m=1 & *o) -> E
*d The stem ends with a • Condition verified: fil(ing) -> file
double consonant *o stem ends in CVC
*o The stem ends in CVC • Condition not verified: fail -> fail

48
The Porter Stemmer: Steps 3 and 4
Step 3: Y Elimination (*V*) Y -> I
Condition verified: happy -> happi
Condition not verified: sky -> sky

Step 4: Derivational Morphology I

(m>0) ATIONAL -> ATE
Relational -> relate
(m>0) IZATION -> IZE
generalization-> generalize
(m>0) BILITI -> BLE
sensibiliti -> sensible

49
Step 5: Derivational Step 6: Derivational
Morphology, II Morphology, III
(m>0) ICATE -> IC (m>0) ANCE -> є
The Porter triplicate -> triplic allowance-> allow
Stemmer: (m>0) FUL -> є
hopeful -> hope
(m>0) ENT -> є
dependent-> depend
Steps 5 & 6 (m>0) NESS -> є (m>0) IVE -> є
goodness -> good effective -> effect

50
Step 7a
The Porter (m>1) E -> є
Stemmer: probate -> probat
(m=1 & !*o) NESS -> є
Step 7 goodness -> good
(cleanup) Step 7b

(m>1 & d & L) -> single letter

*L The stem ends with L
*d The stem ends with a
Condition verified: controll -> control
double consonant Condition not verified: roll -> roll
*o The stem ends in CVC
51
Correct Examples
computers generalizations
• Step 1: -> computer • Step 1: -> generalization
• Step 6: -> compute • Step 4: -> generalize
singing • Step 6: -> general
• Step 2a: -> sing doing - > do
controlling • Step 2a: -> do
• Step 2a: -> controll
• Step 7b : -> control

52
Problems

elephants -> eleph

• Step 1: -> elephant
• Step 6: -> eleph

53
Problems with Porter Stemmer

Words with different meanings are conflated to the same stem

• Conflates general, generous, generation, and generic to the same root
gener
Words with similar meaning are not conflated at all.
• recognize and recognition are not conflated

54
Lancaster stemmer (Iterative)

A table of 120 rules is indexed by the last letter of a suffix.

• On each iteration, find an rule based on last character of the word.
• Each rule specifies deletion /replacement of an ending
• Terminate when
• No applicable rule
• Word starts with a vowel and there are only two letters
• Word starts with a consonant and there are only three characters left.
• Otherwise, the rule is applied and the process repeats.

55
Snowball stemmer

Supports 13 non-English languages.

• Danish, Dutch, English, Finnish, French, German, Hungarian, Italian,
Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.

56
Problems

57
Stemmer

Cons:
• Operates on a single word without knowledge of the context
• Cannot discriminate words which have different meanings (based on POS)
Pros:
• Easier to implement and run faster
• Its reduced accuracy may not matter for some applications.
Lemmatization

• Algorithmic process of determining the lemma for a given word.

• To extract the proper lemma - look at the morphological analysis of each word.
• Base form with the part of speech - lexeme
61
Lemmatization – How to?

Computational Morphology
Methods
• Brute force: Try every possible segmentation of word and see which ones match known
stems and affixes
• Rule-based (simplistic method): Have list of known affixes, see which ones apply.
• Rule-based (more sophisticated): List of known affixes, and knowledge about allowable
combinations, eg -ing can only attach to a verb stem (Morphotactic rules)

63/20
• The Porter Stemmer home page (with the original
paper and code):
http://www.tartarus.org/~martin/PorterStemmer/
References • Jurafsky and Martin, chapter 3.4
• The original paper: Porter, M.F., 1980, An algorithm for
suffix stripping, Program, 14(3) :130-137.

Review Your Grammar and Ace Exams
From Everand
Review Your Grammar and Ace Exams
Florian Navarroza-Flores
No ratings yet
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
(Analytical Methods for Social Research) Janet M. Box-Steffensmeier_ John R. Freeman_ Jon C. Pevehouse_ Matthew Perry Hitt - Time Series Analysis for the Social Sciences-Cambridge University Press (20.pdf
No ratings yet
(Analytical Methods for Social Research) Janet M. Box-Steffensmeier_ John R. Freeman_ Jon C. Pevehouse_ Matthew Perry Hitt - Time Series Analysis for the Social Sciences-Cambridge University Press (20.pdf
298 pages
A Shadow - by R.K Narayan
67% (9)
A Shadow - by R.K Narayan
4 pages
Short Tricks Maths JEE New Edition PDF
78% (9)
Short Tricks Maths JEE New Edition PDF
346 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Corpora
No ratings yet
Corpora
48 pages
Week 2
No ratings yet
Week 2
90 pages
Text preprocessing
No ratings yet
Text preprocessing
39 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP_Lecture_6_Week_3
No ratings yet
NLP_Lecture_6_Week_3
9 pages
NLP_Week_02
No ratings yet
NLP_Week_02
55 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Words and Corpora J+M
No ratings yet
Words and Corpora J+M
49 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
Week3
No ratings yet
Week3
15 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
3.Chapter4_Lexical Representations
No ratings yet
3.Chapter4_Lexical Representations
36 pages
1009_nlp_ppt
No ratings yet
1009_nlp_ppt
31 pages
week_02_Tokenizers
No ratings yet
week_02_Tokenizers
36 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
NLP m2
No ratings yet
NLP m2
71 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
3.Word level analysis-tokenization stemming
No ratings yet
3.Word level analysis-tokenization stemming
8 pages
NATURAL LANGUAGE PROCESSING UNIT 1
No ratings yet
NATURAL LANGUAGE PROCESSING UNIT 1
16 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
PART B NOTES
No ratings yet
PART B NOTES
62 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
NLP_Week_02
No ratings yet
NLP_Week_02
54 pages
Session1 2024_2025_ Natural Language Processing
No ratings yet
Session1 2024_2025_ Natural Language Processing
40 pages
nlp
No ratings yet
nlp
16 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
46 pages
lec2
No ratings yet
lec2
21 pages
NLP - Pos and N-Gram Models
No ratings yet
NLP - Pos and N-Gram Models
21 pages
3. text-processing
No ratings yet
3. text-processing
70 pages
Chapter 1 + 2
No ratings yet
Chapter 1 + 2
9 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
02 Linguistics Essentials
No ratings yet
02 Linguistics Essentials
36 pages
Speech Synthesis Unit5
No ratings yet
Speech Synthesis Unit5
39 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Vinija's Notes - Natural Language Processing - Tokenizer
No ratings yet
Vinija's Notes - Natural Language Processing - Tokenizer
11 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
NLP_Simple_Explanation
No ratings yet
NLP_Simple_Explanation
9 pages
Tokenization
No ratings yet
Tokenization
7 pages
UNIT-1 notes
No ratings yet
UNIT-1 notes
19 pages
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
100% (1)
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
66 pages
NLP Basics
No ratings yet
NLP Basics
4 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
NLP Week 2 Rationalist and Empiricist Paradigms in Natural Language Processing
No ratings yet
NLP Week 2 Rationalist and Empiricist Paradigms in Natural Language Processing
28 pages
Text Proc
No ratings yet
Text Proc
55 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
GRAMMER KING
From Everand
GRAMMER KING
King
No ratings yet
Emerson et al 2001
No ratings yet
Emerson et al 2001
17 pages
HISTORY CH1
No ratings yet
HISTORY CH1
5 pages
VPO - PEOP.5.1.2 Fundamental Behavior Styles Questionnaire
No ratings yet
VPO - PEOP.5.1.2 Fundamental Behavior Styles Questionnaire
15 pages
Economic and Social Thought: The Psychology Theories of Waqf-Giving Behaviors Shadiya M. BAQUTAYAN & Akbariah M. MAHDZIR
No ratings yet
Economic and Social Thought: The Psychology Theories of Waqf-Giving Behaviors Shadiya M. BAQUTAYAN & Akbariah M. MAHDZIR
9 pages
Lesson 1: Understanding Culture, Society, & Politics: Some Key Observations
No ratings yet
Lesson 1: Understanding Culture, Society, & Politics: Some Key Observations
14 pages
Leadership in Organizational Settings: Mcgraw-Hill/Irwin Mcshane/Von Glinow Ob 5E
No ratings yet
Leadership in Organizational Settings: Mcgraw-Hill/Irwin Mcshane/Von Glinow Ob 5E
27 pages
Marx, Lukács and The Groundwork of Critical Social Ontology
No ratings yet
Marx, Lukács and The Groundwork of Critical Social Ontology
38 pages
Chapter 3 - Social Science
No ratings yet
Chapter 3 - Social Science
6 pages
Comparative Politics and Comparative Government
No ratings yet
Comparative Politics and Comparative Government
24 pages
Sovacool - Integrating Social Science in Energy Research
No ratings yet
Sovacool - Integrating Social Science in Energy Research
5 pages
Natural Language Processing for Indian Language: NASSCOM 2019
No ratings yet
Natural Language Processing for Indian Language: NASSCOM 2019
105 pages
Risk Lecture
No ratings yet
Risk Lecture
75 pages
UCSP C3 Lesson 1
No ratings yet
UCSP C3 Lesson 1
7 pages
What Is Sociology
No ratings yet
What Is Sociology
22 pages
New Directions in Biocultural Anthropology - 2016 - Zuckerman - Front Matter
No ratings yet
New Directions in Biocultural Anthropology - 2016 - Zuckerman - Front Matter
15 pages
Explanation About Material Self
No ratings yet
Explanation About Material Self
15 pages
The Effects of Using Crossword Puzzle
100% (3)
The Effects of Using Crossword Puzzle
13 pages
Transgenerational Numerology Course Brochure PDF
No ratings yet
Transgenerational Numerology Course Brochure PDF
6 pages
Download full Wittgenstein and Scientism 1st Edition Jonathan Beale ebook all chapters
No ratings yet
Download full Wittgenstein and Scientism 1st Edition Jonathan Beale ebook all chapters
72 pages
Mahatma_Gandhi_Project_Final
No ratings yet
Mahatma_Gandhi_Project_Final
3 pages
Barreto & Ellemers 2005
No ratings yet
Barreto & Ellemers 2005
10 pages
History and Histriography by Waqas 1
No ratings yet
History and Histriography by Waqas 1
41 pages
Discourse Analysis Unit 4
No ratings yet
Discourse Analysis Unit 4
14 pages
B.F Skinner
No ratings yet
B.F Skinner
4 pages
A New Feminist Ethic That Unites and Mobilizes People The Participation of Young People in Argentina S Green Wave
No ratings yet
A New Feminist Ethic That Unites and Mobilizes People The Participation of Young People in Argentina S Green Wave
17 pages
Psychology of Crime and Victimology
No ratings yet
Psychology of Crime and Victimology
3 pages
Qualitative CW 2
No ratings yet
Qualitative CW 2
15 pages

Text Preprocessing

Uploaded by

Text Preprocessing

Uploaded by

Text Pre-processing

Text ✓ Word Segmentation

Issues : Not using . as end of sentence (Numbers, Abbreviations)

Abbreviation dictionary can help

Whitespace often do not indicate a word break

Single apostrophes: contractions (isn’t, didn’t, dog’s)

Commas - useful piece of information for parsers – clause/phrase

Want to keep the punctuation that occurs word internally

• Tokenization standard - Penn Treebank tokenization standard

To tokenize multiword expressions as a single token.

New York (or) rock 'n' roll

Requires a multiword expression dictionary.

Tokenization is tied up with named entity detection

Type: Types are the number of distinct words in a corpus

Tokens = N Types = |V|

Use the data to tell us how to tokenize.

• Special end-of-word symbol '__' before space in training corpus

• Separate into letters.

Add end-of-word tokens, resulting in this vocabulary:

Merge every e r to er, then merge er _ to er_, etc.

Remove numbers (or convert numbers to textual representations)

Remove punctuation (generally part of tokenization)

Strip white space (also generally part of tokenization)

Remove default stop words (general English stop words)

Tokens are to be normalized

• Mapping them to a special vocabulary.

Abbreviations : Cust. - Customer

Acronyms: UN - United Nation

Case folding – Converting to a same case.

• Stemming - reducing derived or inflected words to their stem, base /root

Stemmer Porter’s stemmer is a rule-based algorithm

C=string of one or more consonants (con+)

m The measure of the stem

• Condition verified: agreed -> agree

The Porter conflat(ed) -> conflate

Step 4: Derivational Morphology I

(m>1 & *d & *L) -> single letter

elephants -> eleph

Words with different meanings are conflated to the same stem

A table of 120 rules is indexed by the last letter of a suffix.

Supports 13 non-English languages.

• Algorithmic process of determining the lemma for a given word.

You might also like

(m>1 & d & L) -> single letter