0% found this document useful (0 votes)

21 views36 pages

02 - Text Preprocessing - Part2

Uploaded by

dinhnguyenngoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views36 pages

02 - Text Preprocessing - Part2

Uploaded by

dinhnguyenngoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Lecture 2: Text Preprocessing

Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA

01 Introduction to NLP
02 Lexical Analysis
03 Syntax Analysis
04 Other Topics in NLP
Lexical Analysis
• Goals of lexical analysis
✓ Convert a sequence of characters into a sequence of tokens, i.e., meaningful
character strings.
▪ In natural language processing, morpheme is a basic unit
▪ In text mining, word is commonly used as a basic unit for analysis

• Process of lexical analysis

✓ Tokenizing
✓ Part-of-Speech (POS) tagging
✓ Additional analysis: named entity recognition (NER), noun phrase recognition,
sentence split, chunking, etc.
Lexical Analysis Hirschberg and Manning (2015)

• Examples of Linguistic Structure Analysis

Lexical Analysis 1: Sentence Splitting Witte (2016)

• Sentence is very important in NLP, but it is not critical for some Text Mining tasks
Lexical Analysis 2: Tokenization
• Text is split into basic units called Tokens
✓ word tokens, number tokens, space tokens, …

MC Scan
Space Not removed Removed
Punctuation Removed Not removed
Numbers Removed Not removed
Special characters Removed Not removed
Lexical Analysis 2: Tokenization
• Even tokenization can be difficult
✓ Is John’s sick one token or two?
▪ If one → problems in parsing (where is the verb?)
▪ If two → what do we do with John’s house?

✓ What to do with hyphens?

▪ database vs. data-base vs. data base

✓ What to do with “C++”, “A/C”, “:-)”, “…”, “ㅋㅋㅋㅋㅋㅋㅋㅋ”?

✓ Some languages do not use whitespace (e.g., Chinese)

• Consistent tokenization is important for all later processing steps.

Lexical Analysis 3: Morphological Analysis Witte (2016)

• Morphological Variants: Stemming and Lemmatization

Lexical Analysis 3: Morphological Analysis Witte (2016)

• Stemming
Lexical Analysis 3: Morphological Analysis Witte (2016)

• Lemmatization
Lexical Analysis 3: Morphological Analysis
• Stemming vs. Lemmatization
Word Stemming Lemmatization

Love Lov Love

Loves Lov Love

Loved Lov Love

Loving Lov Love

Innovation Innovat Innovation

Innovations Innovat Innovation

Innovate Innovat Innovate

Innovates Innovat Innovate

Innovative Innovat Innovative

Lexical Analysis 3: Morphological Analysis
• Stemming vs. Lemmatization with crude example

Stemming Lemmatization
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Witte (2016)

• Part of speech (POS) tagging

✓ Given a sentence X, predict its part of speech sequence Y
▪ Input: tokens that sentence may have ambiguity
▪ Output: most appropriate tag by considering its definition and contexts (relationship with
adjacent and related words in phrases, sentence, or paragraph)

✓ A type of “structured” prediction

• Different POS tags for the same token

✓ I love you. → “love” is a verb
✓ All you need is love. → “love” is noun
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• POS Tagging
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Tagsets: English
Penn Treebank
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Tagsets: Korean
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Witte (2016)

• POS Tagging Algorithms

Lexical Analysis 4: Part-of-Speech (POS) Tagging
• POS Tagging Algorithms
✓ Pointwise prediction: predict each word individually with a classifier (e.g. Maximum
Entropy Model, SVM)

✓ Probabilistic models
▪ Generative sequence models: Find the most probable tag sequence given the sentence
(Hidden Markov Model; HMM)
▪ Discriminative sequence models: Predict whole sequence with a classifier (Conditional
Random Field; CRF)

✓ Neural network-based models

Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ Encode features for tag prediction
▪ Information about word/context: suffix, prefix, neighborhood word information
▪ eg: fi(wj, tj) = 1 if suffix(wj) = “ing” & tj = VBG, 0 otherwise

✓ Tagging Model

▪ fi is a feature
▪ λi is a weight (large value implies informative features)
▪ Z(C) is a normalization constant ensuring a proper probability distribution
▪ Makes no independence assumption about the features
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ An example
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Pointwise Prediction: Maximum Entropy Model
✓ An example
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Probabilistic Model for POS Tagging
✓ Find the most probable tag sequence given the sentence
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Generative Sequence Model
✓ Decompose probability using Baye’s Rule
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Generative Sequence Model: Hidden Markov Model
✓ POS → POS transition probabilities

✓ POS → Word emission probabilities

Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Discriminative Sequence Model: Conditional Random Field (CRF)
✓ Relieve that constraint that a tag is generated by the previous tag sequence
✓ Predict the whole tag set at the same time, not sequentially

http://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Collobert et al. (2011)

• Neural Network-based Models

✓ Window-based vs. sentence-based
Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Neural network-based models
✓ Recurrent neural networks: have a feedback loop within the hidden layer

✓ Input-Output mapping of RNNs

Lexical Analysis 4: Part-of-Speech (POS) Tagging
• Neural network-based models: Recurrent neural networks
Lexical Analysis 4: Part-of-Speech (POS) Tagging
Ma and Hovy (2016)

• Hybrid model: LSTM(RNN) + ConvNet + CRF

Lexical Analysis 5: Named Entity Recognition
• Named Entity Recognition: NER
✓ a subtask of information extraction that seeks to locate and classify elements in text
into pre-defined categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.

http://eric-yuan.me/ner_1/
Lexical Analysis 5: Named Entity Recognition
Approaches for NER: Dictionary/Rule-based

• List lookup: systems that recognizes only entities stored in its lists
✓ Advantages: simple, fast, language independent, easy to retarget.
✓ Disadvantages: collection and maintenance of list cannot deal with name variants and
cannot resolve ambiguity

• Shallow Parsing Approach

✓ Internal evidence – names often have internal structure. These components can be
either stored or guessed.
▪ Location: Cap Word + {Street, Boulevard, Avenue, Crescent, Road}
▪ e.g.: Wall Street
Lexical Analysis 5: Named Entity Recognition
Approaches for NER: Model-based

• MITIE
✓ An open sourced information extraction tool developed by MIT NLP lab.
✓ Available for English and Spanish
✓ Available for C++, Java, R, and Python

• CRF++
✓ NER based on conditional random fields
✓ Supports multi-language models

• Convolutional neural networks

✓ 1-of-M coding, Word2Vec, N-Grams can be used as encoding methods
BERT for Multi NLP Tasks
• Google Transformer
✓ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., ... & Polosukhin, I. (2017). Attention is all you need.
In Advances in Neural Information Processing Systems(pp. 5998-
6008).

✓ Excellent blog post explaining Transformer

▪ http://jalammar.github.io/illustrated-
transformer/
BERT for Multi NLP Tasks
• BERT
✓ Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805.
BERT for Multi NLP Tasks
• BERT
✓ Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805.

Dyslexia Report Template
71% (7)
Dyslexia Report Template
4 pages
Lesson Plan Regular-Irregular Verbs
50% (2)
Lesson Plan Regular-Irregular Verbs
2 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP Pyq Solutions
No ratings yet
NLP Pyq Solutions
59 pages
POStagging
No ratings yet
POStagging
72 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Unit Ii Part of Speech Tagging and Syntactic Parsing
No ratings yet
Unit Ii Part of Speech Tagging and Syntactic Parsing
29 pages
Module 3
No ratings yet
Module 3
33 pages
NLP Assignment Notes
No ratings yet
NLP Assignment Notes
28 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
Lec3-Posner Intro
No ratings yet
Lec3-Posner Intro
30 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
NLP Soln
No ratings yet
NLP Soln
6 pages
Ai TXT Unit4
No ratings yet
Ai TXT Unit4
39 pages
Applied Text Analysis 2
No ratings yet
Applied Text Analysis 2
30 pages
Session2 3
No ratings yet
Session2 3
18 pages
Important Questions-Answers Text Analytics and Natural Language Processing (KAI073)
No ratings yet
Important Questions-Answers Text Analytics and Natural Language Processing (KAI073)
37 pages
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
No ratings yet
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
86 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Unit - 5 Natural Language Processing
No ratings yet
Unit - 5 Natural Language Processing
66 pages
Natural Language
No ratings yet
Natural Language
68 pages
Seminar On Natural Language Processing
No ratings yet
Seminar On Natural Language Processing
21 pages
NLPChapter 3
No ratings yet
NLPChapter 3
14 pages
NLP 9 Que
No ratings yet
NLP 9 Que
10 pages
UNIT 4 New
No ratings yet
UNIT 4 New
14 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
Assignment-1: Natural Language Processing (21Cse356T)
No ratings yet
Assignment-1: Natural Language Processing (21Cse356T)
30 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
NLP Mod 1 (New)
No ratings yet
NLP Mod 1 (New)
50 pages
Unit 5
No ratings yet
Unit 5
70 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
NLP Chapter 3
No ratings yet
NLP Chapter 3
36 pages
Understanding Each Pre-Processing Aspect
No ratings yet
Understanding Each Pre-Processing Aspect
5 pages
CH-2 Natural Language Processing Models and Algorithm
No ratings yet
CH-2 Natural Language Processing Models and Algorithm
119 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
Viva Q&a
No ratings yet
Viva Q&a
5 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
NLP Practical
No ratings yet
NLP Practical
27 pages
Unit III 1
No ratings yet
Unit III 1
11 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Introduction
No ratings yet
Introduction
23 pages
Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau
No ratings yet
Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau
108 pages
NLP m2
No ratings yet
NLP m2
71 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
NLP Simple Explanation
No ratings yet
NLP Simple Explanation
9 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
9.chapter7 POS Tagging
No ratings yet
9.chapter7 POS Tagging
37 pages
Lecture Part of Speech Tagging
No ratings yet
Lecture Part of Speech Tagging
41 pages
Unit 2
No ratings yet
Unit 2
20 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP Unit Test 2
No ratings yet
NLP Unit Test 2
10 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
Lexical Semantic - 3.3
No ratings yet
Lexical Semantic - 3.3
19 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Lesson 01
No ratings yet
Lesson 01
5 pages
Five Steps of Evolution of Postcolonial Englishes
No ratings yet
Five Steps of Evolution of Postcolonial Englishes
4 pages
English Homework For 3rd Graders
100% (1)
English Homework For 3rd Graders
4 pages
09-Ideas Worth Doing
No ratings yet
09-Ideas Worth Doing
2 pages
Intercultural Communication For Sale
No ratings yet
Intercultural Communication For Sale
15 pages
Sociolinguistic Influence in The Use of English As A Second Language (ESL) Classroom: Seeing From Onovughe's (2012) Perspective
No ratings yet
Sociolinguistic Influence in The Use of English As A Second Language (ESL) Classroom: Seeing From Onovughe's (2012) Perspective
5 pages
PTECourse Planner by Saarthi Education
No ratings yet
PTECourse Planner by Saarthi Education
3 pages
Purposive Communication
91% (23)
Purposive Communication
8 pages
Singapore English
100% (1)
Singapore English
145 pages
Steps To Write Argumentative Essay For 10th With Sample Examples of Each Part
No ratings yet
Steps To Write Argumentative Essay For 10th With Sample Examples of Each Part
12 pages
Differences in American and British English Grammar - Part 2
No ratings yet
Differences in American and British English Grammar - Part 2
3 pages
Practical Strategiesfor Writingin Plain Language
No ratings yet
Practical Strategiesfor Writingin Plain Language
3 pages
Nature of Communication: Name: Yeshua Jiro R. Venezuela Class: OLSHSAA5
100% (1)
Nature of Communication: Name: Yeshua Jiro R. Venezuela Class: OLSHSAA5
2 pages
DLL - 3rd Quarter
No ratings yet
DLL - 3rd Quarter
6 pages
5 E Lesson Plan 2018 19 10th STD
No ratings yet
5 E Lesson Plan 2018 19 10th STD
16 pages
Research Project (Fire & Ice)
100% (1)
Research Project (Fire & Ice)
18 pages
Unit 7 - Digital Competence 2021-22
No ratings yet
Unit 7 - Digital Competence 2021-22
9 pages
Succeed in Trinity Ise II 3 PDF Free
100% (3)
Succeed in Trinity Ise II 3 PDF Free
102 pages
TN 3 Unit 03
No ratings yet
TN 3 Unit 03
23 pages
Le G11 Week-2-Emtech
No ratings yet
Le G11 Week-2-Emtech
6 pages
Rating Sheet Demo Teaching 4
No ratings yet
Rating Sheet Demo Teaching 4
2 pages
Lesson 1 Multimedia and ICT
No ratings yet
Lesson 1 Multimedia and ICT
32 pages
WA1 S1G2 P2 Section A & B 2025 Ans v2
No ratings yet
WA1 S1G2 P2 Section A & B 2025 Ans v2
6 pages
Definition, Nature, Characteristics & Elements of Communication
No ratings yet
Definition, Nature, Characteristics & Elements of Communication
13 pages
All About Plural Nouns: Objectives
No ratings yet
All About Plural Nouns: Objectives
13 pages
Daily Lesson Plan: Week DAY Date Class Subje CT Time
No ratings yet
Daily Lesson Plan: Week DAY Date Class Subje CT Time
2 pages
Ethnographic Research Paper Rubric
No ratings yet
Ethnographic Research Paper Rubric
1 page
Module 1-The Components of Communication
No ratings yet
Module 1-The Components of Communication
2 pages