Lecture 2 NLP
Lecture 2 NLP
1
BY
DR. BELAL BADAWY AMIN
2
Introduction
Structured Data Unstructured Data
Can be displayed in rows, columns and Can't be displayed in rows, columns and
relational data base relational data base
tokens ----------.
types ----------.
Corpora
Words don't appear out of nowhere!
A text is produced by
• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific function.
Corpora vary along dimension like
◦ Language: 7097 languages in the world
◦ Variety, like African American Language varieties.
◦ AAE Twitter posts might include forms like "iont" (I don't)
◦ Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]
H/E: dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
◦ Genre: newswire, fiction, scientific articles, Wikipedia
◦ Author Demographics: writer's age, gender, ethnicity, SES
Pre-processing text data
Cleaning up the text data is necessary to highlight attributes that we are
going to want our model to pick up on. Cleaning the data typically consist of
number of steps:
1. Remove Punctuations
2. Converting text to lowercase
3. Tokenization
4. Remove stop – word
5. Lemmatization / stemming
6. Vectorization
7. Feature Engineering
Tokenization
Space-based tokenization
A very simple way to tokenize
◦ For languages that use space characters between words
◦ Arabic, Cyrillic, Greek, Latin, etc., based writing systems
◦ Segment off a token between instances of spaces
Unix tools for space-based tokenization
◦ The "tr" command
◦ Inspired by Ken Church's UNIX for Poets
◦ Given a text file, output the word tokens and their frequencies
Issues in Tokenization
Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses ([email protected])
Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
◦ New York, rock ’n’ roll
Tokenization in languages without spaces
Many languages (like Chinese, Japanese) don't use
spaces to separate words!
姚明进入总决赛
“Yao Ming reaches the finals”
3 words?
姚明 进入 总决赛
YaoMing reaches finals 5 words?
姚 明 进入 总 决赛
Yao Ming reaches overall finals
Word tokenization / segmentation
So in Chinese it's common to just treat each character
(zi) as a token.
• So the segmentation step is very simple
In other languages (like Thai and Japanese), more
complex word segmentation is required.
• The standard algorithms are neural sequence models
trained by supervised machine learning.
Another option for text tokenization
Instead of
• white-space segmentation
• single-character segmentation
Use the data to tell us how to tokenize.
Subword tokenization (is a technique used in natural language
processing (NLP) that involves breaking down words into smaller
subwords or pieces.)
Ex: “football” might be split into “foot”, and “ball”
Sub word tokenization
Three common algorithms:
◦ Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
◦ Unigram language modeling tokenization (Kudo,
2018)
◦ Word Piece (Schuster and Nakajima, 2012)
All have 2 parts:
◦ A token learner that takes a raw training corpus and
induces a vocabulary (a set of tokens).
◦ A token segmenter that takes a raw test sentence and
tokenizes it according to that vocabulary
Byte Pair Encoding (BPE) token
learner
Let vocabulary be the set of all individual characters
= {A, B, C, D,…, a, b, c, d….}
Repeat:
◦ Choose the two symbols that are most frequently
adjacent in the training corpus (say 'A', 'B')
◦ Add a new merged symbol 'AB' to the vocabulary
◦ Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
BPE token learner algorithm
Byte Pair Encoding (BPE) Addendum
Most subword algorithms are run inside space-
separated tokens.
So we commonly first add a special end-of-word
symbol '__' before space in training corpus
Next, separate into letters.
BPE token segmenter algorithm
On the test data, run each merge learned from the
training data:
◦ Greedily
◦ In the order we learned them
◦ (test frequencies don't play a role)
So: merge every e r to er, then merge er _ to er_, etc.
Result:
◦ Test set "n e w e r _" would be tokenized as a full word
◦ Test set "l o w e r _" would be two tokens: "low er_"
Properties of BPE tokens
Usually include frequent words
And frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing unit of a
language
• unlikeliest has 3 morphemes un-, likely, and -est
Word Normalization
Putting words/tokens in a standard format
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
Case folding
Applications like IR: reduce all letters to lower case
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
◦ Fed vs. fed
◦ SAIL vs. sail
For sentiment analysis, MT, Information extraction
◦ Case is helpful (US versus us is important)