Introduction To NLP
Introduction To NLP
Introduction to NLP
Ashish Anand
Professor, Dept. of CSE, IIT Guwahati
Associated Faculty, Mehta Family School of Data Sc and AI, IIT Guwahati
Day Agenda
• Hands-on-Session
• Speaker: Dr Aparajita Dutta
Objective
• What is NLP?
• Daily life NLP Applications
• Generic Formulation
• Getting Started with NLP
• Statistical Language Models
Defining NLP
What do we mean by NLP?
• Target Sentence:
VI: Words, Meaning and Representation
• Similar Words
• Synonyms
• Text Classification
• Single apostrophes
• Contractions such as I’ll, I’m etc.: should be taken as two words or one word?
• Penn Treebank split such contractions.
• Phrases such as dog’s vs. yesterday’s in “The house I rented yesterday’s
garden is really big”.
• Orthographic-word-final single quotation (often comes at the end of
sentence/quoted fragment) and cases like (plural possessive) “boys’ toys”.
Defining words: Problems: Spoken
Corpora
• This lecture umm is main- mainly divided into two components
Source: SLP-Slides-Chap2
Maximum Matching
Word Segmentation Algorithm
• Given a wordlist of Chinese, and a string.
1) Start a pointer at the beginning of the string
2) Find the longest word in dictionary that matches the string starting
at pointer
3) Move the pointer over the word in string
4) Go to 2
Source: SLP-Slides-Chap2
Max-match segmentation
illustration
• Thecatinthehat the cat in the hat
the table down there
• Thetabledownthere
theta bled own there
• Doesn’t generally work in English!
• But works astonishingly well in Chinese
• 莎拉波娃现在居住在美国东南部的佛罗里达。
• 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
• Modern probabilistic segmentation algorithms even better
Source: SLP-Slides-Chap2
Subword Tokenization: Motivation
• Frequent words should be identified as a token
1. Sennrich et al. 2015. Neural machine translation of rare words with subword units. ACL 2016
2. Schuster and Nakajima. 2012. Japanese and Korean voice search. ICASSP 2012
3. Kudo. 2018. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.
ACL2018
4. Kudo et al. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text
Processing. EMNLP 2018 (demo paper)
Byte Pair Encoding
• Idea: Iteratively merge most frequently byte pairs into a byte not
present in the data.
BPE for Word Tokenization
• Dictionary
• {'low<E>': 5, 'lower<E>': 2, 'newest<E>': 6, 'widest<E>': 3}
• Vocabulary on characters
• {'d','e','i','l','n','o','r','s','t','w','<E>’}
• 1st Iter: {'d','e','i','l','n','o','r','s','t','w','<E>','es’} [e and s occurred
together 9 times]
• 2nd Iter: {'d','e','i','l','n','o','r','s','t','w','<E>','es’, ‘est’}
• And So on.
BPE Tokenization: Encoding: Text
Data Tokenization
• Answer
• Idea: Run the merged byte pairs in the order they were learned.
• Segment each test word into characters
• Apply first merge rule [Our example, merge ‘e’ and ‘s’]
• Then second and so on…
• Example: newer -> “new” “er_”
Tools to getting started with NLP
Source: https://medium.com/microsoftazure/7-amazing-open-source-nlp-tools-to-try-with-notebooks-in-2019-c9eec058d9f1
Defining Language Model
Lets look at some examples
• Predicting next word
• I am planning ……..
• Speech Recognition
• I saw a van vs eyes awe an
Example continued
• Spelling correction
• Study was conducted by students vs study was conducted be
students
• Their are two exams for this course vs There are two exams for this
course
• Machine Translation
• I have asked him to do homework
• मैंने उससे पूछा कि होमवर्क करने के लिए
• मैंने उसे होमवर्क करने के लिए कहा
In each of the example, objective is
either
• To find next probable word
• Chain Rule
Estimating Probability of a
sequence
Estimating P(w1, w2, .., wn)
•
Estimating P(w1, w2, .., wn)
• Too many possible sentences
• Data sparseness
• Poor generalizability
Markov Assumption
•
Markov Assumption
•
MLE of N-gram models
• Unigram (Simplest Model)
• Sparsity Issue
• OOV : Can be solved by having <UNK> category
• Words are present in corpus but relevant counts are zero
• Underestimation of such probabilities
N-gram Model: Issue
• Long-distance dependencies
“The computer which I had just put into the lab on the fifth floor
crashed”
Smoothing Techniques
Simplest Approach: Additive
Smoothing
• Add-1 Smoothing
𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 ) + ¿ 𝒱 ∨¿ ¿
𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 , 𝑤𝑖 ) + 1
𝑝 𝑚𝑙𝑒 ( 𝑤𝑖|𝑤 𝑖 −2 , 𝑤𝑖 −1 ¿=
• Generalized version
𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 ) + 𝛿∨𝒱 ∨¿ ¿
𝑐 ( 𝑤 𝑖 −2 , 𝑤𝑖−1 , 𝑤𝑖 ) + 𝛿
𝑝 𝑚𝑙𝑒 ( 𝑤𝑖|𝑤 𝑖 −2 , 𝑤𝑖 −1 ¿=
Take the help of lower order models
• Bigram Example
• c(w1, w2) = 0 = c(w1, w2’)
• Discounting Models
References
• Jurafsky and Martin, Speech and Language Processing, 3rd Ed. Draft
[Available at https://web.stanford.edu/~jurafsky/slp3/ ]
Thanks!
Question and Comments!
[email protected] https://www.iitg.ac.in/
anand.ashish