02_Text Preprocessing_part2
02_Text Preprocessing_part2
Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA
01 Introduction to NLP
02 Lexical Analysis
03 Syntax Analysis
04 Other Topics in NLP
Lexical Analysis
• Goals of lexical analysis
✓ Convert a sequence of characters into a sequence of tokens, i.e., meaningful
character strings.
▪ In natural language processing, morpheme is a basic unit
▪ In text mining, word is commonly used as a basic unit for analysis
• Sentence is very important in NLP, but it is not critical for some Text Mining tasks
Lexical Analysis 2: Tokenization
• Text is split into basic units called Tokens
✓ word tokens, number tokens, space tokens, …
MC Scan
Space Not removed Removed
Punctuation Removed Not removed
Numbers Removed Not removed
Special characters Removed Not removed
Lexical Analysis 2: Tokenization
• Even tokenization can be difficult
✓ Is John’s sick one token or two?
▪ If one → problems in parsing (where is the verb?)
▪ If two → what do we do with John’s house?
• Stemming
Lexical Analysis 3: Morphological Analysis Witte (2016)
• Lemmatization
Lexical Analysis 3: Morphological Analysis
• Stemming vs. Lemmatization
Word Stemming Lemmatization
Stemming Lemmatization
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Witte (2016)
✓ Probabilistic models
▪ Generative sequence models: Find the most probable tag sequence given the sentence
(Hidden Markov Model; HMM)
▪ Discriminative sequence models: Predict whole sequence with a classifier (Conditional
Random Field; CRF)
✓ Tagging Model
▪ fi is a feature
▪ λi is a weight (large value implies informative features)
▪ Z(C) is a normalization constant ensuring a proper probability distribution
▪ Makes no independence assumption about the features
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ An example
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ An example
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Probabilistic Model for POS Tagging
✓ Find the most probable tag sequence given the sentence
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Generative Sequence Model
✓ Decompose probability using Baye’s Rule
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Generative Sequence Model: Hidden Markov Model
✓ POS → POS transition probabilities
http://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Collobert et al. (2011)
http://eric-yuan.me/ner_1/
Lexical Analysis 5: Named Entity Recognition
Approaches for NER: Dictionary/Rule-based
• List lookup: systems that recognizes only entities stored in its lists
✓ Advantages: simple, fast, language independent, easy to retarget.
✓ Disadvantages: collection and maintenance of list cannot deal with name variants and
cannot resolve ambiguity
• MITIE
✓ An open sourced information extraction tool developed by MIT NLP lab.
✓ Available for English and Spanish
✓ Available for C++, Java, R, and Python
• CRF++
✓ NER based on conditional random fields
✓ Supports multi-language models