0% found this document useful (0 votes)
10 views

02_Text Preprocessing_part2

Uploaded by

dinhnguyenngoc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

02_Text Preprocessing_part2

Uploaded by

dinhnguyenngoc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Lecture 2: Text Preprocessing

Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA

01 Introduction to NLP
02 Lexical Analysis
03 Syntax Analysis
04 Other Topics in NLP
Lexical Analysis
• Goals of lexical analysis
✓ Convert a sequence of characters into a sequence of tokens, i.e., meaningful
character strings.
▪ In natural language processing, morpheme is a basic unit
▪ In text mining, word is commonly used as a basic unit for analysis

• Process of lexical analysis


✓ Tokenizing
✓ Part-of-Speech (POS) tagging
✓ Additional analysis: named entity recognition (NER), noun phrase recognition,
sentence split, chunking, etc.
Lexical Analysis Hirschberg and Manning (2015)

• Examples of Linguistic Structure Analysis


Lexical Analysis 1: Sentence Splitting Witte (2016)

• Sentence is very important in NLP, but it is not critical for some Text Mining tasks
Lexical Analysis 2: Tokenization
• Text is split into basic units called Tokens
✓ word tokens, number tokens, space tokens, …

MC Scan
Space Not removed Removed
Punctuation Removed Not removed
Numbers Removed Not removed
Special characters Removed Not removed
Lexical Analysis 2: Tokenization
• Even tokenization can be difficult
✓ Is John’s sick one token or two?
▪ If one → problems in parsing (where is the verb?)
▪ If two → what do we do with John’s house?

✓ What to do with hyphens?


▪ database vs. data-base vs. data base

✓ What to do with “C++”, “A/C”, “:-)”, “…”, “ㅋㅋㅋㅋㅋㅋㅋㅋ”?


✓ Some languages do not use whitespace (e.g., Chinese)

• Consistent tokenization is important for all later processing steps.


Lexical Analysis 3: Morphological Analysis Witte (2016)

• Morphological Variants: Stemming and Lemmatization


Lexical Analysis 3: Morphological Analysis Witte (2016)

• Stemming
Lexical Analysis 3: Morphological Analysis Witte (2016)

• Lemmatization
Lexical Analysis 3: Morphological Analysis
• Stemming vs. Lemmatization
Word Stemming Lemmatization

Love Lov Love

Loves Lov Love

Loved Lov Love

Loving Lov Love

Innovation Innovat Innovation

Innovations Innovat Innovation

Innovate Innovat Innovate

Innovates Innovat Innovate

Innovative Innovat Innovative


Lexical Analysis 3: Morphological Analysis
• Stemming vs. Lemmatization with crude example

Stemming Lemmatization
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Witte (2016)

• Part of speech (POS) tagging


✓ Given a sentence X, predict its part of speech sequence Y
▪ Input: tokens that sentence may have ambiguity
▪ Output: most appropriate tag by considering its definition and contexts (relationship with
adjacent and related words in phrases, sentence, or paragraph)

✓ A type of “structured” prediction

• Different POS tags for the same token


✓ I love you. → “love” is a verb
✓ All you need is love. → “love” is noun
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• POS Tagging
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Tagsets: English
Penn Treebank
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Tagsets: Korean
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Witte (2016)

• POS Tagging Algorithms


Lexical Analysis 4: Part-of-Speech (POS) Tagging
• POS Tagging Algorithms
✓ Pointwise prediction: predict each word individually with a classifier (e.g. Maximum
Entropy Model, SVM)

✓ Probabilistic models
▪ Generative sequence models: Find the most probable tag sequence given the sentence
(Hidden Markov Model; HMM)
▪ Discriminative sequence models: Predict whole sequence with a classifier (Conditional
Random Field; CRF)

✓ Neural network-based models


Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ Encode features for tag prediction
▪ Information about word/context: suffix, prefix, neighborhood word information
▪ eg: fi(wj, tj) = 1 if suffix(wj) = “ing” & tj = VBG, 0 otherwise

✓ Tagging Model

▪ fi is a feature
▪ λi is a weight (large value implies informative features)
▪ Z(C) is a normalization constant ensuring a proper probability distribution
▪ Makes no independence assumption about the features
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ An example
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ An example
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Probabilistic Model for POS Tagging
✓ Find the most probable tag sequence given the sentence
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Generative Sequence Model
✓ Decompose probability using Baye’s Rule
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Generative Sequence Model: Hidden Markov Model
✓ POS → POS transition probabilities

✓ POS → Word emission probabilities


Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Discriminative Sequence Model: Conditional Random Field (CRF)
✓ Relieve that constraint that a tag is generated by the previous tag sequence
✓ Predict the whole tag set at the same time, not sequentially

http://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Collobert et al. (2011)

• Neural Network-based Models


✓ Window-based vs. sentence-based
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Neural network-based models
✓ Recurrent neural networks: have a feedback loop within the hidden layer

✓ Input-Output mapping of RNNs


Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Neural network-based models: Recurrent neural networks
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Ma and Hovy (2016)

• Hybrid model: LSTM(RNN) + ConvNet + CRF


Lexical Analysis 5: Named Entity Recognition
• Named Entity Recognition: NER
✓ a subtask of information extraction that seeks to locate and classify elements in text
into pre-defined categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.

http://eric-yuan.me/ner_1/
Lexical Analysis 5: Named Entity Recognition
Approaches for NER: Dictionary/Rule-based

• List lookup: systems that recognizes only entities stored in its lists
✓ Advantages: simple, fast, language independent, easy to retarget.
✓ Disadvantages: collection and maintenance of list cannot deal with name variants and
cannot resolve ambiguity

• Shallow Parsing Approach


✓ Internal evidence – names often have internal structure. These components can be
either stored or guessed.
▪ Location: Cap Word + {Street, Boulevard, Avenue, Crescent, Road}
▪ e.g.: Wall Street
Lexical Analysis 5: Named Entity Recognition
Approaches for NER: Model-based

• MITIE
✓ An open sourced information extraction tool developed by MIT NLP lab.
✓ Available for English and Spanish
✓ Available for C++, Java, R, and Python

• CRF++
✓ NER based on conditional random fields
✓ Supports multi-language models

• Convolutional neural networks


✓ 1-of-M coding, Word2Vec, N-Grams can be used as encoding methods
BERT for Multi NLP Tasks
• Google Transformer
✓ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., ... & Polosukhin, I. (2017). Attention is all you need.
In Advances in Neural Information Processing Systems(pp. 5998-
6008).

✓ Excellent blog post explaining Transformer


▪ http://jalammar.github.io/illustrated-
transformer/
BERT for Multi NLP Tasks
• BERT
✓ Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805.
BERT for Multi NLP Tasks
• BERT
✓ Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805.

You might also like