0% found this document useful (0 votes)
25 views

Introduction To NLP

Uploaded by

anand.ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Introduction To NLP

Uploaded by

anand.ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Deep Learning for NLP:

Introduction to NLP

Ashish Anand
Professor, Dept. of CSE, IIT Guwahati
Associated Faculty, Mehta Family School of Data Sc and AI, IIT Guwahati
Day Agenda

• Talk 1: Introduction to Natural Language Processing (NLP)


• Speaker: Dr Ashish Anand

• Talk 2: Neural Models for NLP


• Speaker: Dr Ashish Anand

• Hands-on-Session
• Speaker: Dr Aparajita Dutta
Objective

• Short and quick Introduction to NLP


• Self-Motivated can get started working with NLP
Outline: Introduction to NLP

• What is NLP?
• Daily life NLP Applications
• Generic Formulation
• Getting Started with NLP
• Statistical Language Models
Defining NLP
What do we mean by NLP?

• Natural Language – Written or Spoken language used by humans.


Example: Assamese, Bengali, Hindi, Sanskrit, English, German, …

• NLP – Computational methods to learn, understand & generate


natural language content

• Multiple distinct fields study human language: Linguists, Speech


Recognition, Computational Linguists etc.
Different Levels of NLP
• Word
• Phonetics and Phonology: study of linguistic sounds
• Morphology: study of meaningful components of words [example]

• Syntax: structural relationship between words

• Semantic: study of meaning


• Lexical semantics: study of meanings of words
• Compositional semantics: How to combine words
• Pragmatics and Discourse: dealing with more than a sentence:
paragraph, documents
Daily Life Applications
Application I: Automatic text completion
Application II: Spelling Correction

• Spelling correction: Study was conducted by students vs study was


conducted be students
Application III: Sentiment Classification

I like this laptop

My new laptop is not good for


computational intensive task

Watching a lecture on my new


laptop
Application IV: Named Entity Recognition
(NER)
Application V: Machine Translation

• Source Sentence: I have asked him to do homework

• Target Sentence:
VI: Words, Meaning and Representation

• Similar Words

• Synonyms

• Word Sense Disambiguation


• I went to a bank to deposit money
• I went to a bank to see calm water currents
Many more applications …..
Formulation
Generic Formulation
Search

• Computes the argmax of the function Ψ

• Often machinery of Combinatorial optimization as often


outputs are discrete variables

• Simple search algorithms to dynamic programming and beam


search
Learning

• Finding the model parameters θ

• Mostly, again an optimization problem.

• Relying on numerical optimization, as parameters are often


continuous
Basic Formulation

• Word => Vector Representation

• Text Classification

• Sequence Labelling Problems

• Sequence to Sequence Learning Problems


Getting Started with
NLP
Source: Corpus
• Corpus (plural : corpora)
• Special collection of texts collected according to a predefined set of criteria
• May be available as pre-pr0cessed and linguistically-marked-up or in raw
format

• Different types of corpora


• Monolingual
• Parallel: bilingual or multilingual [Vary at the alignment level]
• Comparable: bilingual or multilingual
• Learner Corpus
• Diachronic Corpus
Examples of Corpus

Corpus Tokens Types

Switchboard phone 2.4 million 20000


conversations

Shakespeare 884,000 31000

Brown 1 million 38000

Google N-grams 1 trillion 13 million

Two ways to talk about words:


1. Tokens: each occurrence of all words is counted
2. Types: number of distinct words
More Examples of Corpora
• Access to multiple corpus from tools like NLTK
• Building from databases such as PubMed, free text from web,
Wikipedia, Social media platforms etc.
• Task specific
• Shared task challenges: ACE, CoNLL, SemEval, BioAsq, SQuAD, CORD-
19

• Caution: One shoe does not fit all.


• Caution: Ethical and Bias Issues
Text Preprocessing
• Removing non-text (e.g. tags, ads)
• Text Normalization
• Segmentation: Word and Sentence Segmentation
• Normalizing Word Formats
• Spelling Variations: Labeled/labelled
• Capitalization: Led/LED
• Lemmatization
• Stemming
• Morphological analysis: dealing with smallest meaning-bearing units
Text normalization
Tokenization: Word Segmentation
Definition
• Process to divide the input text into units, also called, tokens, where
each is either a word or a number or a punctuation mark.
What counts as a word?

I am interested in Natural Language Processing, but I’m not sure of the


required prerequisites.
What counts as a word?

• Should I count punctuation as a word?


• Should I treat I’m as one word or break them into three words: I, ’, m?
[Clitic]
• Should I consider “Natural Language Processing” as one word or 3
words?
Challenges in defining word as a
contiguous alphanumeric characters
• Too restrictive
• Should we consider “$12.20” or “Micro$oft” or “:)” as a word?

• We can expect several variants especially in forums like Twitter etc.


which may not obey exact definition but should be considered as a
word.

• Simple Heuristic: Whitespace


• “a space or tab or the new line” between words.
• Still to deal with several issues.
Some challenges with simple
heuristics
• Periods
• Wash. vs wash
• Abbreviations at the end vs. in the middle – e.g. etc.
• More on this while discussing sentence segmentation

• Single apostrophes
• Contractions such as I’ll, I’m etc.: should be taken as two words or one word?
• Penn Treebank split such contractions.
• Phrases such as dog’s vs. yesterday’s in “The house I rented yesterday’s
garden is really big”.
• Orthographic-word-final single quotation (often comes at the end of
sentence/quoted fragment) and cases like (plural possessive) “boys’ toys”.
Defining words: Problems: Spoken
Corpora
• This lecture umm is main- mainly divided into two components

• Two types of disfluencies


• Fragments: main-
• Fillers/Filled pauses: uh.. Umm..
Some other issues

• Quite a large vocabulary


• Restricting a vocabulary size enhances OOV problem
• No implicit notion of similar words
• Each word is given distinct id
Tokenization in Practice

• Deterministic algorithms based on regular expressions


• Compiled into efficient finite state automata
Word segmentation in other
languages
• 请将这句话翻译成中文 [Please translate this sentence into Chinese]
• Languages like Chinese, Japanese have no spaces between words
• Japanese is further complicated with multiple alphabets intermingled

• Compound nouns written as a single word


• Lebensversicherungsgesellschaftsangestellter [Life insurance company
employee]
Word Tokenization in Chinese

• Chinese words are composed of characters


• Characters are generally 1 syllable and 1 morpheme.
• Average word is 2.4 characters long.
• Standard baseline segmentation algorithm:
• Maximum Matching (also called Greedy)

Source: SLP-Slides-Chap2
Maximum Matching
Word Segmentation Algorithm
• Given a wordlist of Chinese, and a string.
1) Start a pointer at the beginning of the string
2) Find the longest word in dictionary that matches the string starting
at pointer
3) Move the pointer over the word in string
4) Go to 2

Source: SLP-Slides-Chap2
Max-match segmentation
illustration
• Thecatinthehat the cat in the hat
the table down there
• Thetabledownthere
theta bled own there
• Doesn’t generally work in English!
• But works astonishingly well in Chinese
• 莎拉波娃现在居住在美国东南部的佛罗里达。
• 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
• Modern probabilistic segmentation algorithms even better

Source: SLP-Slides-Chap2
Subword Tokenization: Motivation
• Frequent words should be identified as a token

• Rare words should be broken into meaningful subword tokens:

• Unknowingly : “un”, “know”, “ing”, “ly”


• Helps in taking care of OOV, rare and related words

• Reasonable vocabulary size

• To make it language independent


Subword Tokenization: Popular
Methods
• Byte Pair Encoding (BPE)1
• Wordpiece2
• Similar to BPE, except the merging criteria is different
• Unigram3 and Sentencepiece4
• Rely on unigram language model
• Language independent

1. Sennrich et al. 2015. Neural machine translation of rare words with subword units. ACL 2016
2. Schuster and Nakajima. 2012. Japanese and Korean voice search. ICASSP 2012
3. Kudo. 2018. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.
ACL2018
4. Kudo et al. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text
Processing. EMNLP 2018 (demo paper)
Byte Pair Encoding

• Used for data compression in Information theory

• Idea: Iteratively merge most frequently byte pairs into a byte not
present in the data.
BPE for Word Tokenization

• Assumption: corpus has been already tokenized


• Step 1: Count the frequency of each word appearing in the given
corpus.
• Step 2: For each word, append them with a special token ``<E>",
signifying end of a word.
• Step 3: Break each word into their constituent characters. So a word
"exam" will be converted into a sequence of characters
["e","x","a","m","<E>"].
BPE for Word Tokenization

• Step 4: In each iteration, count the frequency of each consecutive


byte pair and merge the most frequent byte pairs into one.

• Step 5: Stop after a fixed number of iterations (i.e. merge operations)


or after obtaining a maximum number of tokens.
BPE Tokenization: Illustration

• Dictionary
• {'low<E>': 5, 'lower<E>': 2, 'newest<E>': 6, 'widest<E>': 3}
• Vocabulary on characters
• {'d','e','i','l','n','o','r','s','t','w','<E>’}
• 1st Iter: {'d','e','i','l','n','o','r','s','t','w','<E>','es’} [e and s occurred
together 9 times]
• 2nd Iter: {'d','e','i','l','n','o','r','s','t','w','<E>','es’, ‘est’}
• And So on.
BPE Tokenization: Encoding: Text
Data Tokenization

• Question: How to tokenize a given sequence of words into


learned tokens?

• Answer
• Idea: Run the merged byte pairs in the order they were learned.
• Segment each test word into characters
• Apply first merge rule [Our example, merge ‘e’ and ‘s’]
• Then second and so on…
• Example: newer -> “new” “er_”
Tools to getting started with NLP

Source: https://medium.com/microsoftazure/7-amazing-open-source-nlp-tools-to-try-with-notebooks-in-2019-c9eec058d9f1
Defining Language Model
Lets look at some examples
• Predicting next word
• I am planning ……..

• Speech Recognition
• I saw a van vs eyes awe an
Example continued
• Spelling correction
• Study was conducted by students vs study was conducted be
students
• Their are two exams for this course vs There are two exams for this
course

• Machine Translation
• I have asked him to do homework
• मैंने उससे पूछा कि होमवर्क करने के लिए
• मैंने उसे होमवर्क करने के लिए कहा
In each of the example, objective is
either
• To find next probable word

• To find which sentence is more likely to be true


Language Models (LM)
• Models assigning probabilities to a sequence of words

• P(I saw a van) > P(eyes awe an)

• P(मैंने उससे पूछा कि होमवर्क करने के लिए) <


P(मैंने उसे होमवर्क करने के लिए कहा)
Defining LM formally
Estimating Probability of a
sequence
• Our task is to compute
P(I, am, fascinated, with, recent, advances, in, AI)

• Chain Rule
Estimating Probability of a
sequence
Estimating P(w1, w2, .., wn)

Estimating P(w1, w2, .., wn)
• Too many possible sentences
• Data sparseness
• Poor generalizability
Markov Assumption

Markov Assumption

MLE of N-gram models
• Unigram (Simplest Model)

• Bigram (1st order Markov Model)

• Trigram (2nd order Markov Model)


Trigram Model in Summary
Problem with MLE

• Works well if test corpus is very similar to training, which is


not generally the case

• Sparsity Issue
• OOV : Can be solved by having <UNK> category
• Words are present in corpus but relevant counts are zero
• Underestimation of such probabilities
N-gram Model: Issue
• Long-distance dependencies

“The computer which I had just put into the lab on the fifth floor
crashed”
Smoothing Techniques
Simplest Approach: Additive
Smoothing
• Add-1 Smoothing

𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 ) + ¿ 𝒱 ∨¿ ¿
𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 , 𝑤𝑖 ) + 1
𝑝 𝑚𝑙𝑒 ( 𝑤𝑖|𝑤 𝑖 −2 , 𝑤𝑖 −1 ¿=

• Generalized version

𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 ) + 𝛿∨𝒱 ∨¿ ¿
𝑐 ( 𝑤 𝑖 −2 , 𝑤𝑖−1 , 𝑤𝑖 ) + 𝛿
𝑝 𝑚𝑙𝑒 ( 𝑤𝑖|𝑤 𝑖 −2 , 𝑤𝑖 −1 ¿=
Take the help of lower order models
• Bigram Example
• c(w1, w2) = 0 = c(w1, w2’)

• padd (w2 | w1 ) = padd (w2’| w1 )

• Lets assume p(w2‘) < p(w2)

• We should expect padd (w2 | w1 ) > padd (w2’| w1 )


Take the help of lower order models

• Linear Interpolation Models

• Discounting Models
References

• Jurafsky and Martin, Speech and Language Processing, 3rd Ed. Draft
[Available at https://web.stanford.edu/~jurafsky/slp3/ ]
Thanks!
Question and Comments!

[email protected] https://www.iitg.ac.in/
anand.ashish

You might also like